masashi sugiyama-statistical reinforcement learning_ modern machine learning approaches-chapman and...

STATISTICAL

REINFORCEMENT

LEARNING

ModernMachine

LearningApproaches

Chapman&Hall/CRC

MachineLearning&PatternRecognitionSeries

SERIESEDITORS

RalfHerbrich

ThoreGraepel

AmazonDevelopmentCenter

MicrosoftResearchLtd.

Berlin,Germany

Cambridge,UK

AIMSANDSCOPE

Thisseriesreflectsthelatestadvancesandapplicationsinmachinelearningandpatternrecognitionthroughthepublicationofabroadrangeofreferenceworks,textbooks,andhandbooks.Theinclusionofconcreteexamples,applications,andmethodsishighlyencouraged.Thescopeoftheseriesincludes,butisnotlimitedto,titlesintheareasofmachinelearning,patternrecognition,computationalintelligence,robotics,computational/statisticallearningtheory,naturallanguageprocessing,computervision,gameAI,gametheory,neuralnetworks,computationalneuroscience,andotherrelevanttopics,suchasmachinelearningappliedtobioinformaticsorcognitivescience,whichmightbeproposedbypotentialcontribu-tors.

PUBLISHEDTITLES

BAYESIANPROGRAMMING

PierreBessière,EmmanuelMazer,Juan-ManuelAhuactzin,andKamelMekhnacha

UTILITY-BASEDLEARNINGFROMDATA

CraigFriedmanandSvenSandow

HANDBOOKOFNATURALLANGUAGEPROCESSING,SECONDEDITION

NitinIndurkhyaandFredJ.Damerau

COST-SENSITIVEMACHINELEARNING

BalajiKrishnapuram,ShipengYu,andBharatRao

COMPUTATIONALTRUSTMODELSANDMACHINELEARNING

XinLiu,AnwitamanDatta,andEe-PengLim

MULTILINEARSUBSPACELEARNING:DIMENSIONALITYREDUCTIONOF

MULTIDIMENSIONALDATA

HaipingLu,KonstantinosN.Plataniotis,andAnastasiosN.Venetsanopoulos

MACHINELEARNING:AnAlgorithmicPerspective,SecondEdition

StephenMarsland

SPARSEMODELING:THEORY,ALGORITHMS,ANDAPPLICATIONS

IrinaRishandGenadyYa.Grabarnik

AFIRSTCOURSEINMACHINELEARNING

SimonRogersandMarkGirolami

STATISTICALREINFORCEMENTLEARNING:MODERNMACHINELEARNINGAPPROACHES

MasashiSugiyama

MULTI-LABELDIMENSIONALITYREDUCTION

LiangSun,ShuiwangJi,andJiepingYe

REGULARIZATION,OPTIMIZATION,KERNELS,ANDSUPPORTVECTORMACHINES

JohanA.K.Suykens,MarcoSignoretto,andAndreasArgyriou

ENSEMBLEMETHODS:FOUNDATIONSANDALGORITHMS

Zhi-HuaZhou

Chapman&Hall/CRC

MachineLearning&PatternRecognitionSeries

STATISTICAL

REINFORCEMENT

LEARNING

ModernMachine

LearningApproaches

MasashiSugiyama

UniversityofTokyo

Tokyo,Japan

CRCPress

Taylor&FrancisGroup

6000BrokenSoundParkwayNW,Suite300

BocaRaton,FL33487-2742

CRCPressisanimprintofTaylor&FrancisGroup,anInformabusiness

NoclaimtooriginalU.S.Governmentworks

VersionDate:20150128

InternationalStandardBookNumber-13:978-1-4398-5690-1(eBook-PDF)

Thisbookcontainsinformationobtainedfromauthenticandhighlyregardedsources.Reasonableeffortshavebeenmadetopublishreliabledataandinformation,buttheauthorandpublishercannotassumeresponsibilityforthevalidityofallmaterialsortheconsequencesoftheiruse.Theauthorsandpublishershaveattemptedtotracethecopyrightholdersofallmaterialreproducedinthispublicationandapologizetocopyrightholdersifpermissiontopublishinthisformhasnotbeenobtained.Ifanycopyrightmaterialhasnotbeenacknowledgedpleasewriteandletusknowsowemayrectifyinanyfuturereprint.

ExceptaspermittedunderU.S.CopyrightLaw,nopartofthisbookmaybereprinted,reproduced,transmitted,orutilizedinanyformbyanyelectronic,mechanical,orothermeans,nowknownorhereafterinvented,includingphotocopying,microfilming,andrecording,orinanyinformationstor-ageorretrievalsystem,withoutwrittenpermissionfromthepublishers.

Forpermissiontophotocopyorusematerialelectronicallyfromthiswork,pleaseaccesswww.copy-

right.com(http://www.copyright.com/)orcontacttheCopyrightClearanceCenter,Inc.(CCC),222

RosewoodDrive,Danvers,MA01923,978-750-8400.CCCisanot-for-profitorganizationthatprovideslicensesandregistrationforavarietyofusers.FororganizationsthathavebeengrantedaphotocopylicensebytheCCC,aseparatesystemofpaymenthasbeenarranged.

TrademarkNotice:Productorcorporatenamesmaybetrademarksorregisteredtrademarks,andareusedonlyforidentificationandexplanationwithoutintenttoinfringe.

VisittheTaylor&FrancisWebsiteat

http://www.taylorandfrancis.com

andtheCRCPressWebsiteat

http://www.crcpress.com

Contents

Foreword

Preface

Author

Introduction

1IntroductiontoReinforcementLearning

ReinforcementLearning…………………

MathematicalFormulation

……………….

StructureoftheBook………………….

Model-FreePolicyIteration……………

Model-FreePolicySearch…………….

Model-BasedReinforcementLearning………

Model-FreePolicyIteration

2PolicyIterationwithValueFunctionApproximation

ValueFunctions

…………………….

StateValueFunctions………………

State-ActionValueFunctions…………..

Least-SquaresPolicyIteration

……………..

Immediate-RewardRegression………….

Algorithm…………………….

Regularization………………….

ModelSelection………………….

Remarks

………………………..

3BasisDesignforValueFunctionApproximation

GaussianKernelsonGraphs

………………

MDP-InducedGraph……………….

OrdinaryGaussianKernels……………

GeodesicGaussianKernels……………

ExtensiontoContinuousStateSpaces………

Illustration……………………….

Setup………………………

Contents

GeodesicGaussianKernels……………

OrdinaryGaussianKernels……………

Graph-LaplacianEigenbases……………

DiffusionWavelets………………..

NumericalExamples…………………..

Robot-ArmControl……………….

Robot-AgentNavigation……………..

Remarks

………………………..

4SampleReuseinPolicyIteration

Formulation

………………………

Off-PolicyValueFunctionApproximation………..

EpisodicImportanceWeighting………….

Per-DecisionImportanceWeighting

……….

AdaptivePer-DecisionImportanceWeighting…..

Illustration……………………

AutomaticSelectionofFlatteningParameter………

Importance-WeightedCross-Validation………

Sample-ReusePolicyIteration

……………..

InvertedPendulum………………..

MountainCar…………………..

Remarks

………………………..

5ActiveLearninginPolicyIteration

EfficientExplorationwithActiveLearning

……….

ProblemSetup………………….

DecompositionofGeneralizationError………

EstimationofGeneralizationError………..

DesigningSamplingPolicies……………

ActivePolicyIteration

…………………

Sample-ReusePolicyIterationwithActiveLearning.

Remarks

………………………..

6RobustPolicyIteration

RobustnessandReliabilityinPolicyIteration

……..

Robustness……………………

Reliability…………………….

LeastAbsolutePolicyIteration……………..

Contents

Properties…………………….

PossibleExtensions

…………………..

HuberLoss……………………

PinballLoss……………………

Deadzone-LinearLoss………………

ChebyshevApproximation…………….

ConditionalValue-At-Risk…………….

Remarks

………………………..

Model-FreePolicySearch

7DirectPolicySearchbyGradientAscent

Formulation

………………………

GradientApproach

…………………..

GradientAscent…………………

BaselineSubtractionforVarianceReduction…..

VarianceAnalysisofGradientEstimators…….

NaturalGradientApproach……………….

NaturalGradientAscent……………..

ApplicationinComputerGraphics:ArtistAgent…….

SumiePainting………………….

DesignofStates,Actions,andImmediateRewards..

ExperimentalResults………………

Remarks

………………………..

8DirectPolicySearchbyExpectation-Maximization

Expectation-MaximizationApproach

………….

SampleReuse

……………………..

EpisodicImportanceWeighting………….

Per-DecisionImportanceWeight…………

AdaptivePer-DecisionImportanceWeighting…..

AutomaticSelectionofFlatteningParameter…..

Reward-WeightedRegressionwithSampleReuse…

Remarks

………………………..

9Policy-PriorSearch

Formulation

………………………

PolicyGradientswithParameter-BasedExploration…..

Policy-PriorGradientAscent…………..

BaselineSubtractionforVarianceReduction…..

VarianceAnalysisofGradientEstimators…….

Contents

NumericalExamples……………….

SampleReuseinPolicy-PriorSearch…………..

ImportanceWeighting………………

VarianceReductionbyBaselineSubtraction……

NumericalExamples……………….

Remarks

………………………..

Model-BasedReinforcementLearning

10TransitionModelEstimation

10.1ConditionalDensityEstimation

…………….

10.1.1Regression-BasedApproach……………

10.1.2ǫ-NeighborKernelDensityEstimation………

10.1.3Least-SquaresConditionalDensityEstimation….

10.2Model-BasedReinforcementLearning………….

10.3NumericalExamples…………………..

10.3.1ContinuousChainWalk……………..

10.3.2HumanoidRobotControl…………….

10.4Remarks

………………………..

11DimensionalityReductionforTransitionModelEstimation173

11.1SufficientDimensionalityReduction…………..

11.2Squared-LossConditionalEntropy……………

11.2.1ConditionalIndependence…………….

11.2.2DimensionalityReductionwithSCE……….

11.2.3RelationtoSquared-LossMutualInformation…..

11.3NumericalExamples…………………..

11.3.1ArtificialandBenchmarkDatasets………..

11.3.2HumanoidRobot…………………

11.4Remarks

………………………..

References

Foreword

Howcanagentslearnfromexperiencewithoutanomniscientteacherexplicitly

tellingthemwhattodo?Reinforcementlearningistheareawithinmachine

learningthatinvestigateshowanagentcanlearnanoptimalbehaviorby

correlatinggenericrewardsignalswithitspastactions.Thedisciplinedraws

uponandconnectskeyideasfrombehavioralpsychology,economics,control

theory,operationsresearch,andotherdisparatefieldstomodelthelearning

process.Inreinforcementlearning,theenvironmentistypicallymodeledasa

Markovdecisionprocessthatprovidesimmediaterewardandstateinforma-

tiontotheagent.However,theagentdoesnothaveaccesstothetransition

structureoftheenvironmentandneedstolearnhowtochooseappropriate

actionstomaximizeitsoverallrewardovertime.

ThisbookbyProf.MasashiSugiyamacoverstherangeofreinforcement

learningalgorithmsfromafresh,modernperspective.Withafocusonthe

statisticalpropertiesofestimatingparametersforreinforcementlearning,the

bookrelatesanumberofdifferentapproachesacrossthegamutoflearningsce-

narios.Thealgorithmsaredividedintomodel-freeapproachesthatdonotex-

plicitlymodelthedynamicsoftheenvironment,andmodel-basedapproaches

thatconstructdescriptiveprocessmodelsfortheenvironment.Withineach

ofthesecategories,therearepolicyiterationalgorithmswhichestimatevalue

functions,andpolicysearchalgorithmswhichdirectlymanipulatepolicypa-

rameters.

Foreachofthesedifferentreinforcementlearningscenarios,thebookmetic-

ulouslylaysouttheassociatedoptimizationproblems.Acarefulanalysisis

givenforeachofthesecases,withanemphasisonunderstandingthestatistical

propertiesoftheresultingestimatorsandlearnedparameters.Eachchapter

containsillustrativeexamplesofapplicationsofthesealgorithms,withquan-

titativecomparisonsbetweenthedifferenttechniques.Theseexamplesare

drawnfromavarietyofpracticalproblems,includingrobotmotioncontrol

andAsianbrushpainting.

Insummary,thebookprovidesathoughtprovokingstatisticaltreatmentof

reinforcementlearningalgorithms,reflectingtheauthor’sworkandsustained

researchinthisarea.Itisacontemporaryandwelcomeadditiontotherapidly

growingmachinelearningliterature.Bothbeginnerstudentsandexperienced

Foreword

researcherswillfindittobeanimportantsourceforunderstandingthelatest

reinforcementlearningtechniques.

DanielD.Lee

GRASPLaboratory

SchoolofEngineeringandAppliedScience

UniversityofPennsylvania,Philadelphia,PA,USA

Preface

Inthecomingbigdataera,statisticsandmachinelearningarebecoming

indispensabletoolsfordatamining.Dependingonthetypeofdataanalysis,

machinelearningmethodsarecategorizedintothreegroups:

•Supervisedlearning:Giveninput-outputpaireddata,theobjective

ofsupervisedlearningistoanalyzetheinput-outputrelationbehindthe

data.Typicaltasksofsupervisedlearningincluderegression(predict-

ingtherealvalue),classification(predictingthecategory),andranking

(predictingtheorder).Supervisedlearningisthemostcommondata

analysisandhasbeenextensivelystudiedinthestatisticscommunity

forlongtime.Arecenttrendofsupervisedlearningresearchinthema-

chinelearningcommunityistoutilizesideinformationinadditiontothe

input-outputpaireddatatofurtherimprovethepredictionaccuracy.For

example,semi-supervisedlearningutilizesadditionalinput-onlydata,

transferlearningborrowsdatafromothersimilarlearningtasks,and

multi-tasklearningsolvesmultiplerelatedlearningtaskssimultaneously.

•Unsupervisedlearning:Giveninput-onlydata,theobjectiveofun-

supervisedlearningistofindsomethingusefulinthedata.Duetothis

ambiguousdefinition,unsupervisedlearningresearchtendstobemore

adhocthansupervisedlearning.Nevertheless,unsupervisedlearningis

regardedasoneofthemostimportanttoolsindataminingbecause

ofitsautomaticandinexpensivenature.Typicaltasksofunsupervised

learningincludeclustering(groupingthedatabasedontheirsimilarity),

densityestimation(estimatingtheprobabilitydistributionbehindthe

data),anomalydetection(removingoutliersfromthedata),datavisual-

ization(reducingthedimensionalityofthedatato1–3dimensions),and

blindsourceseparation(extractingtheoriginalsourcesignalsfromtheir

mixtures).Also,unsupervisedlearningmethodsaresometimesusedas

datapre-processingtoolsinsupervisedlearning.

•Reinforcementlearning:Supervisedlearningisasoundapproach,

butcollectinginput-outputpaireddataisoftentooexpensive.Unsu-

pervisedlearningisinexpensivetoperform,butittendstobeadhoc.

Reinforcementlearningisplacedbetweensupervisedlearningandunsu-

pervisedlearning—noexplicitsupervision(outputdata)isprovided,

butwestillwanttolearntheinput-outputrelationbehindthedata.

Insteadofoutputdata,reinforcementlearningutilizesrewards,which

Preface

evaluatethevalidityofpredictedoutputs.Givingimplicitsupervision

suchasrewardsisusuallymucheasierandlesscostlythangivingex-

plicitsupervision,andthereforereinforcementlearningcanbeavital

approachinmoderndataanalysis.Varioussupervisedandunsupervised

learningtechniquesarealsoutilizedintheframeworkofreinforcement

learning.

Thisbookisdevotedtointroducingfundamentalconceptsandpracti-

calalgorithmsofstatisticalreinforcementlearningfromthemodernmachine

learningviewpoint.Variousillustrativeexamples,mainlyinrobotics,arealso

providedtohelpunderstandtheintuitionandusefulnessofreinforcement

learningtechniques.Targetreadersaregraduate-levelstudentsincomputer

scienceandappliedstatisticsaswellasresearchersandengineersinrelated

fields.Basicknowledgeofprobabilityandstatistics,linearalgebra,andele-

mentarycalculusisassumed.

Machinelearningisarapidlydevelopingareaofscience,andtheauthor

hopesthatthisbookhelpsthereadergraspvariousexcitingtopicsinrein-

forcementlearningandstimulatereaders’interestinmachinelearning.Please

visitourwebsiteat:http://www.ms.k.u-tokyo.ac.jp.

MasashiSugiyama

UniversityofTokyo,Japan

Author

MasashiSugiyamawasborninOsaka,Japan,in1974.HereceivedBachelor,

Master,andDoctorofEngineeringdegreesinComputerSciencefromAll

TokyoInstituteofTechnology,Japanin1997,1999,and2001,respectively.

In2001,hewasappointedAssistantProfessorinthesameinstitute,andhe

waspromotedtoAssociateProfessorin2003.HemovedtotheUniversityof

TokyoasProfessorin2014.

HereceivedanAlexandervonHumboldtFoundationResearchFellowship

andresearchedatFraunhoferInstitute,Berlin,Germany,from2003to2004.In

2006,hereceivedaEuropeanCommissionProgramErasmusMundusSchol-

arshipandresearchedattheUniversityofEdinburgh,Scotland.Hereceived

theFacultyAwardfromIBMin2007forhiscontributiontomachinelearning

undernon-stationarity,theNagaoSpecialResearcherAwardfromtheInfor-

mationProcessingSocietyofJapanin2011andtheYoungScientists’Prize

fromtheCommendationforScienceandTechnologybytheMinisterofEd-

ucation,Culture,Sports,ScienceandTechnologyforhiscontributiontothe

density-ratioparadigmofmachinelearning.

Hisresearchinterestsincludetheoriesandalgorithmsofmachinelearning

anddatamining,andawiderangeofapplicationssuchassignalprocessing,

imageprocessing,androbotcontrol.HepublishedDensityRatioEstimationin

MachineLearning(CambridgeUniversityPress,2012)andMachineLearning

inNon-StationaryEnvironments:IntroductiontoCovariateShiftAdaptation

(MITPress,2012).

Theauthorthankshiscollaborators,HirotakaHachiya,SethuVijayaku-

mar,JanPeters,JunMorimoto,ZhaoTingting,NingXie,VootTangkaratt,

TetsuroMorimura,andNorikazuSugimoto,forexcitingandcreativediscus-

sions.HeacknowledgessupportfromMEXTKAKENHI17700142,18300057,

20680007,23120004,23300069,25700022,and26280054,theOkawaFounda-

tion,EUErasmusMundusFellowship,AOARD,SCAT,theJSTPRESTO

program,andtheFIRSTprogram.

Thispageintentionallyleftblank

Introduction

Chapter1

IntroductiontoReinforcement

Learning

Reinforcementlearningisaimedatcontrollingacomputeragentsothata

targettaskisachievedinanunknownenvironment.

Inthischapter,wefirstgiveaninformaloverviewofreinforcementlearning

inSection1.1.Thenweprovideamoreformalformulationofreinforcement

learninginSection1.2.Finally,thebookissummarizedinSection1.3.

ReinforcementLearning

AschematicofreinforcementlearningisgiveninFigure1.1.Inanunknown

environment(e.g.,inamaze),acomputeragent(e.g.,arobot)takesanaction

(e.g.,towalk)basedonitsowncontrolpolicy.Thenitsstateisupdated(e.g.,

bymovingforward)andevaluationofthatactionisgivenasa“reward”(e.g.,

praise,neutral,orscolding).Throughsuchinteractionwiththeenvironment,

theagentistrainedtoachieveacertaintask(e.g.,gettingoutofthemaze)

withoutexplicitguidance.Acrucialadvantageofreinforcementlearningisits

non-greedynature.Thatis,theagentistrainednottoimproveperformancein

ashortterm(e.g.,greedilyapproachinganexitofthemaze),buttooptimize

thelong-termachievement(e.g.,successfullygettingoutofthemaze).

Areinforcementlearningproblemcontainsvarioustechnicalcomponents

suchasstates,actions,transitions,rewards,policies,andvalues.Beforego-

ingintomathematicaldetails(whichwillbeprovidedinSection1.2),we

intuitivelyexplaintheseconceptsthroughillustrativereinforcementlearning

problemshere.

Letusconsideramazeproblem(Figure1.2),wherearobotagentislocated

inamazeandwewanttoguidehimtothegoalwithoutexplicitsupervision

aboutwhichdirectiontogo.Statesarepositionsinthemazewhichtherobot

agentcanvisit.IntheexampleillustratedinFigure1.3,thereare21states

inthemaze.Actionsarepossibledirectionsalongwhichtherobotagentcan

move.IntheexampleillustratedinFigure1.4,thereare4actionswhichcorre-

spondtomovementtowardthenorth,south,east,andwestdirections.States

StatisticalReinforcementLearning

Action

Environment

Reward

FIGURE1.1:Reinforcementlearning.

andactionsarefundamentalelementsthatdefineareinforcementlearning

problem.

Transitionsspecifyhowstatesareconnectedtoeachotherthroughactions

(Figure1.5).Thus,knowingthetransitionsintuitivelymeansknowingthemap

ofthemaze.Rewardsspecifytheincomes/coststhattherobotagentreceives

whenmakingatransitionfromonestatetoanotherbyacertainaction.Inthe

caseofthemazeexample,therobotagentreceivesapositiverewardwhenit

reachesthegoal.Morespecifically,apositiverewardisprovidedwhenmaking

atransitionfromstate12tostate17byaction“east”orfromstate18to

state17byaction“north”(Figure1.6).Thus,knowingtherewardsintuitively

meansknowingthelocationofthegoalstate.Toemphasizethefactthata

rewardisgiventotherobotagentrightaftertakinganactionandmakinga

transitiontothenextstate,itisalsoreferredtoasanimmediatereward.

Undertheabovesetup,thegoalofreinforcementlearningtofindthepolicy

forcontrollingtherobotagentthatallowsittoreceivethemaximumamount

ofrewardsinthelongrun.Here,apolicyspecifiesanactiontherobotagent

takesateachstate(Figure1.7).Throughapolicy,aseriesofstatesandac-

tionsthattherobotagenttakesfromastartstatetoanendstateisspecified.

Suchaseriesiscalledatrajectory(seeFigure1.7again).Thesumofim-

mediaterewardsalongatrajectoryiscalledthereturn.Inpractice,rewards

thatcanbeobtainedinthedistantfutureareoftendiscountedbecausere-

ceivingrewardsearlierisregardedasmorepreferable.Inthemazetask,such

adiscountingstrategyurgestherobotagenttoreachthegoalasquicklyas

possible.

Tofindtheoptimalpolicyefficiently,itisusefultoviewthereturnasa

functionoftheinitialstate.Thisiscalledthe(state-)value.Thevaluescan

beefficientlyobtainedviadynamicprogramming,whichisageneralmethod

forsolvingacomplexoptimizationproblembybreakingitdownintosimpler

subproblemsrecursively.Withthehopethatmanysubproblemsareactually

thesame,dynamicprogrammingsolvessuchoverlappedsubproblemsonly

onceandreusesthesolutionstoreducethecomputationcosts.

Inthemazeproblem,thevalueofastatecanbecomputedfromthevalues

ofneighboringstates.Forexample,letuscomputethevalueofstate7(see

IntroductiontoReinforcementLearning

FIGURE1.2:Amazeproblem.Wewanttoguidetherobotagenttothe

FIGURE1.3:Statesarevisitablepositionsinthemaze.

FIGURE1.4:Actionsarepossiblemovementsoftherobotagent.

FIGURE1.5:Transitionsspecifyconnectionsbetweenstatesviaactions.

Thus,knowingthetransitionsmeansknowingthemapofthemaze.

FIGURE1.6:Apositiverewardisgivenwhentherobotagentreachesthe

goal.Thus,therewardspecifiesthegoallocation.

FIGURE1.7:Apolicyspecifiesanactiontherobotagenttakesateach

state.Thus,apolicyalsospecifiesatrajectory,whichisaseriesofstatesand

actionsthattherobotagenttakesfromastartstatetoanendstate.

FIGURE1.8:Valuesofeachstatewhenreward+1isgivenatthegoalstate

andtherewardisdiscountedattherateof0.9accordingtothenumberof

steps.

Figure1.5again).Fromstate7,therobotagentcanreachstate2,state6,

andstate8byasinglestep.Iftherobotagentknowsthevaluesofthese

neighboringstates,thebestactiontherobotagentshouldtakeistovisitthe

neighboringstatewiththelargestvalue,becausethisallowstherobotagent

toearnthelargestamountofrewardsinthelongrun.However,thevalues

ofneighboringstatesareunknowninpracticeandthustheyshouldalsobe

computed.

Now,weneedtosolve3subproblemsofcomputingthevaluesofstate2,

state6,andstate8.Then,inthesameway,thesesubproblemsarefurther

decomposedasfollows:

•Theproblemofcomputingthevalueofstate2isdecomposedinto3

subproblemsofcomputingthevaluesofstate1,state3,andstate7.

subproblemsofcomputingthevaluesofstate1andstate7.

subproblemsofcomputingthevaluesofstate3,state7,andstate9.

Thus,byremovingoverlaps,theoriginalproblemofcomputingthevalueof

state7hasbeendecomposedinto6uniquesubproblems:computingthevalues

ofstate1,state2,state3,state6,state8,andstate9.

Ifwefurthercontinuethisproblemdecomposition,weencountertheprob-

lemofcomputingthevaluesofstate17,wheretherobotagentcanreceive

reward+1.Thenthevaluesofstate12andstate18canbeexplicitlycom-

puted.Indeed,ifadiscountingfactor(amultiplicativepenaltyfordelayed

rewards)is0.9,thevaluesofstate12andstate18are(0.9)1=0.9.Thenwe

canfurtherknowthatthevaluesofstate13andstate19are(0.9)2=0.81.

Byrepeatingthisprocedure,wecancomputethevaluesofallstates(asillus-

tratedinFigure1.8).Basedonthesevalues,wecanknowtheoptimalaction

therobotagentshouldtake,i.e.,anactionthatleadstherobotagenttothe

neighboringstatewiththelargestvalue.

Notethat,inreal-worldreinforcementlearningtasks,transitionsareoften

notdeterministicbutstochastic,becauseofsomeexternaldisturbance;inthe

caseoftheabovemazeexample,thefloormaybeslipperyandthustherobot

agentcannotmoveasperfectlyasitdesires.Also,stochasticpoliciesinwhich

mappingfromastatetoanactionisnotdeterministicareoftenemployed

inmanyreinforcementlearningformulations.Inthesecases,theformulation

becomesslightlymorecomplicated,butessentiallythesameideacanstillbe

usedforsolvingtheproblem.

Tofurtherhighlightthenotableadvantageofreinforcementlearningthat

nottheimmediaterewardsbutthelong-termaccumulationofrewardsismax-

imized,letusconsideramountain-carproblem(Figure1.9).Therearetwo

mountainsandacarislocatedinavalleybetweenthemountains.Thegoalis

toguidethecartothetopoftheright-handhill.However,theengineofthe

carisnotpowerfulenoughtodirectlyrunuptheright-handhillandreach

thegoal.Theoptimalpolicyinthisproblemistofirstclimbtheleft-handhill

andthengodowntheslopetotherightwithfullaccelerationtogettothe

goal(Figure1.10).

Supposewedefinetheimmediaterewardsuchthatmovingthecartothe

rightgivesapositivereward+1andmovingthecartotheleftgivesanega-

tivereward−1.Then,agreedysolutionthatmaximizestheimmediatereward

movesthecartotheright,whichdoesnotallowthecartogettothegoal

duetolackofenginepower.Ontheotherhand,reinforcementlearningseeks

asolutionthatmaximizesthereturn,i.e.,thediscountedsumofimmediate

rewardsthattheagentcancollectovertheentiretrajectory.Thismeansthat

thereinforcementlearningsolutionwillfirstmovethecartothelefteven

thoughnegativerewardsaregivenforawhile,toreceivemorepositivere-

wardsinthefuture.Thus,thenotionof“priorinvestment”canbenaturally

incorporatedinthereinforcementlearningframework.

MathematicalFormulation

Inthissection,thereinforcementlearningproblemismathematicallyfor-

mulatedastheproblemofcontrollingacomputeragentunderaMarkovde-

cisionprocess.

Weconsidertheproblemofcontrollingacomputeragentunderadiscrete-

timeMarkovdecisionprocess(MDP).Thatis,ateachdiscretetime-stept,

theagentobservesastatest∈S,selectsanactionat∈A,makesatransitionst+1∈S,andreceivesanimmediatereward,rt=r(st,at,st+1)∈R.

FIGURE1.9:Amountain-carproblem.Wewanttoguidethecartothe

goal.However,theengineofthecarisnotpowerfulenoughtodirectlyrunup

theright-handhill.

FIGURE1.10:Theoptimalpolicytoreachthegoalistofirstclimbthe

left-handhillandthenheadfortheright-handhillwithfullacceleration.

SandAarecalledthestatespaceandtheactionspace,respectively.r(s,a,s′)

iscalledtheimmediaterewardfunction.

Theinitialpositionoftheagent,s1,isdrawnfromtheinitialprobability

distribution.IfthestatespaceSisdiscrete,theinitialprobabilitydistributionisspecifiedbytheprobabilitymassfunctionP(s)suchthat

0≤P(s)≤1,∀s∈S,XP(s)=1.

s∈SIfthestatespaceSiscontinuous,theinitialprobabilitydistributionisspeci-

fiedbytheprobabilitydensityfunctionp(s)suchthat

p(s)≥0,∀s∈S,

p(s)ds=1.

s∈SBecausetheprobabilitymassfunctionP(s)canbeexpressedasaprobability

densityfunctionp(s)byusingtheDiracdeltafunction1δ(s)as

δ(s′−s)P(s′),

s′∈Swefocusonlyonthecontinuousstatespacebelow.

Thedynamicsoftheenvironment,whichrepresentthetransitionprob-

abilityfromstatestostates′whenactionaistaken,arecharacterized

bythetransitionprobabilitydistributionwithconditionalprobabilitydensity

p(s′|s,a):

p(s′|s,a)≥0,∀s,s′∈S,∀a∈A,Z

p(s′|s,a)ds′=1,∀s∈S,∀a∈A.

s′∈STheagent’sdecisionisdeterminedbyapolicyπ.Whenweconsideradeter-

ministicpolicywheretheactiontotakeateachstateisuniquelydetermined,

weregardthepolicyasafunctionofstates:

π(s)∈A,∀s∈S.Actionacanbeeitherdiscreteorcontinuous.Ontheotherhand,whendevel-

opingmoresophisticatedreinforcementlearningalgorithms,itisoftenmore

convenienttoconsiderastochasticpolicy,whereanactiontotakeatastate

isprobabilisticallydetermined.Mathematically,astochasticpolicyisacon-

ditionalprobabilitydensityoftakingactionaatstates:

π(a|s)≥0,∀s∈S,∀a∈A,Z

π(a|s)da=1,∀s∈S.a∈AByintroducingstochasticityinactionselection,wecanmoreactivelyexplore

theentirestatespace.Notethatwhenactionaisdiscrete,thestochasticpolicy

isexpressedusingDirac’sdeltafunction,asinthecaseofthestatedensities.

Asequenceofstatesandactionsobtainedbytheproceduredescribedin

Figure1.11iscalledatrajectory.

1TheDiracdeltafunctionδ(·)allowsustoobtainthevalueofafunctionfatapointτ

viatheconvolutionwithf:

f(s)δ(s−τ)ds=f(τ).

−∞

Dirac’sdeltafunctionδ(·)canbeexpressedastheGaussiandensitywithstandarddeviationσ→0:

δ(a)=lim√

exp−

σ→0

2πσ2

1.Theinitialstates1ischosenfollowingtheinitialprobabilityp(s).

2.Fort=1,…,T,

(a)Theactionatischosenfollowingthepolicyπ(at|st).

(b)Thenextstatest+1isdeterminedaccordingtothetransition

probabilityp(st+1|st,at).

FIGURE1.11:Generationofatrajectorysample.

Whenthenumberofsteps,T,isfiniteorinfinite,thesituationiscalled

thefinitehorizonorinfinitehorizon,respectively.Below,wefocusonthe

finite-horizoncasebecausethetrajectorylengthisalwaysfiniteinpractice.

Wedenoteatrajectorybyh(whichstandsfora“history”):

h=[s1,a1,…,sT,aT,sT+1].

Thediscountedsumofimmediaterewardsalongthetrajectoryhiscalled

thereturn:

γt−1r(st,at,st+1),

whereγ∈[0,1)iscalledthediscountfactorforfuturerewards.Thegoalofreinforcementlearningistolearntheoptimalpolicyπ∗thatmaximizestheexpectedreturn:

π∗=argmaxEpπ(h)R(h),

whereEpπ(h)denotestheexpectationovertrajectoryhdrawnfrompπ(h),and

pπ(h)denotestheprobabilitydensityofobservingtrajectoryhunderpolicy

pπ(h)=p(s1)

p(st+1|st,at)π(at|st).

“argmax”givesthemaximizerofafunction(Figure1.12).

Forpolicylearning,variousmethodshavebeendevelopedsofar.These

methodscanbeclassifiedintomodel-basedreinforcementlearningandmodel-

freereinforcementlearning.Theterm“model”indicatesamodelofthetran-

sitionprobabilityp(s′|s,a).Inthemodel-basedreinforcementlearningap-

proach,thetransitionprobabilityislearnedinadvanceandthelearnedtran-

sitionmodelisexplicitlyusedforpolicylearning.Ontheotherhand,inthe

model-freereinforcementlearningapproach,policiesarelearnedwithoutex-

plicitlyestimatingthetransitionprobability.Ifstrongpriorknowledgeofthe

argmax

FIGURE1.12:“argmax”givesthemaximizerofafunction,while“max”

givesthemaximumvalueofafunction.

transitionmodelisavailable,themodel-basedapproachwouldbemorefavor-

able.Ontheotherhand,learningthetransitionmodelwithoutpriorknowl-

edgeitselfisahardstatisticalestimationproblem.Thus,ifgoodpriorknowl-

edgeofthetransitionmodelisnotavailable,themodel-freeapproachwould

bemorepromising.

StructureoftheBook

Inthissection,weexplainthestructureofthisbook,whichcoversmajor

reinforcementlearningapproaches.

Model-FreePolicyIteration

Policyiterationisapopularandwell-studiedapproachtoreinforcement

learning.Thekeyideaofpolicyiterationistodeterminepoliciesbasedonthe

valuefunction.

Letusfirstintroducethestate-actionvaluefunctionQπ(s,a)∈Rforpolicyπ,whichisdefinedastheexpectedreturntheagentwillreceivewhen

takingactionaatstatesandfollowingpolicyπthereafter:

s,a)=Epπ(h)R(h)s1=s,a1=a,

where“|s1=s,a1=a”meansthattheinitialstates1andthefirstactiona1

arefixedats1=sanda1=a,respectively.Thatis,theright-handsideof

theaboveequationdenotestheconditionalexpectationofR(h)givens1=s

anda1=a.

LetQ∗(s,a)betheoptimalstate-actionvalueatstatesforactionadefinedas

Q∗(s,a)=maxQπ(s,a).π

Basedontheoptimalstate-actionvaluefunction,theoptimalactiontheagent

shouldtakeatstatesisdeterministicallygivenasthemaximizerofQ∗(s,a)

1.Initializepolicyπ(a|s).

2.Repeatthefollowingtwostepsuntilthepolicyπ(a|s)converges.

(a)Policyevaluation:Computethestate-actionvaluefunction

Qπ(s,a)forthecurrentpolicyπ(a|s).

(b)Policyimprovement:Updatethepolicyas

π(a|s)←−δa−argmaxQπ(s,a′).

FIGURE1.13:Algorithmofpolicyiteration.

withrespecttoa.Thus,theoptimalpolicyπ∗(a|s)isgivenbyπ∗(a|s)=δa−argmaxQ∗(s,a′),a′

whereδ(·)denotesDirac’sdeltafunction.

Becausetheoptimalstate-actionvalueQ∗isunknowninpractice,thepolicyiterationalgorithmalternatelyevaluatesthevalueQπforthecurrent

policyπandupdatesthepolicyπbasedonthecurrentvalueQπ(Figure1.13).

Theperformanceoftheabovepolicyiterationalgorithmdependsonthe

qualityofpolicyevaluation;i.e.,howtolearnthestate-actionvaluefunction

fromdataisthekeyissue.Valuefunctionapproximationcorrespondstoare-

gressionprobleminstatisticsandmachinelearning.Thus,variousstatistical

machinelearningtechniquescanbeutilizedforbettervaluefunctionapprox-

imation.PartIIofthisbookaddressesthisissue,includingleast-squareses-

timationandmodelselection(Chapter2),basisfunctiondesign(Chapter3),

efficientsamplereuse(Chapter4),activelearning(Chapter5),androbust

learning(Chapter6).

Oneofthepotentialweaknessesofpolicyiterationisthatpoliciesare

learnedviavaluefunctions.Thus,improvingthequalityofvaluefunction

approximationdoesnotnecessarilycontributetoimprovingthequalityof

resultingpolicies.Furthermore,asmallchangeinvaluefunctionscancausea

bigdifferenceinpolicies,whichisproblematicin,e.g.,robotcontrolbecause

suchinstabilitycandamagetherobot’sphysicalsystem.Anotherweakness

ofpolicyiterationisthatpolicyimprovement,i.e.,findingthemaximizerof

Qπ(s,a)withrespecttoa,iscomputationallyexpensiveordifficultwhenthe

actionspaceAiscontinuous.

Policysearch,whichdirectlylearnspolicyfunctionswithoutestimating

valuefunctions,canovercometheabovelimitations.Thebasicideaofpolicy

searchistofindthepolicythatmaximizestheexpectedreturn:

π∗=argmaxEpπ(h)R(h).π

Inpolicysearch,howtofindagoodpolicyfunctioninavastfunctionspaceis

thekeyissuetobeaddressed.PartIIIofthisbookfocusesonpolicysearchand

introducesgradient-basedmethodsandtheexpectation-maximizationmethod

inChapter7andChapter8,respectively.However,apotentialweaknessof

thesedirectpolicysearchmethodsistheirinstabilityduetothestochasticity

ofpolicies.Toovercometheinstabilityproblem,analternativeapproachcalled

policy-priorsearch,whichlearnsthepolicy-priordistributionfordeterministic

policies,isintroducedinChapter9.Efficientsamplereuseinpolicy-prior

searchisalsodiscussedthere.

Intheabovemodel-freeapproaches,policiesarelearnedwithoutexplicitly

modelingtheunknownenvironment(i.e.,thetransitionprobabilityofthe

agentintheenvironment,p(s′|s,a)).Ontheotherhand,themodel-based

approachexplicitlylearnstheenvironmentinadvanceandusesthelearned

environmentmodelforpolicylearning.

Noadditionalsamplingcostisnecessarytogenerateartificialsamplesfrom

thelearnedenvironmentmodel.Thus,themodel-basedapproachisparticu-

larlyusefulwhendatacollectionisexpensive(e.g.,robotcontrol).However,

accuratelyestimatingthetransitionmodelfromalimitedamountoftrajec-

torydatainmulti-dimensionalcontinuousstateandactionspacesishighly

challenging.PartIVofthisbookfocusesonmodel-basedreinforcementlearn-

ing.InChapter10,anon-parametrictransitionmodelestimatorthatpossesses

theoptimalconvergenceratewithhighcomputationalefficiencyisintroduced.

However,evenwiththeoptimalconvergencerate,estimatingthetransition

modelinhigh-dimensionalstateandactionspacesisstillchallenging.InChap-

ter11,adimensionalityreductionmethodthatcanbeefficientlyembedded

intothetransitionmodelestimationprocedureisintroducedanditsusefulness

isdemonstratedthroughexperiments.

PartII

Model-FreePolicy

Iteration

InPartII,weintroduceareinforcementlearningapproachbasedonvalue

functionscalledpolicyiteration.

Thekeyissueinthepolicyiterationframeworkishowtoaccuratelyap-

proximatethevaluefunctionfromasmallnumberofdatasamples.InChap-

ter2,afundamentalframeworkofvaluefunctionapproximationbasedon

leastsquaresisexplained.Inthisleast-squaresformulation,howtodesign

goodbasisfunctionsiscriticalforbettervaluefunctionapproximation.A

practicalbasisdesignmethodbasedonmanifold-basedsmoothing(Chapelle

etal.,2006)isexplainedinChapter3.

Inreal-worldreinforcementlearningtasks,gatheringdataisoftencostly.

InChapter4,wedescribeamethodforefficientlyreusingpreviouslycor-

rectedsamplesintheframeworkofcovariateshiftadaptation(Sugiyama&

Kawanabe,2012).InChapter5,weapplyastatisticalactivelearningtech-

nique(Sugiyama&Kawanabe,2012)tooptimizingdatacollectionstrategies

forreducingthesamplingcost.

Finally,inChapter6,anoutlier-robustextensionoftheleast-squares

methodbasedonrobustregression(Huber,1981)isintroduced.Sucharo-

bustmethodishighlyusefulinhandlingnoisyreal-worlddata.

Chapter2

PolicyIterationwithValueFunction

Approximation

Inthischapter,weintroducetheframeworkofleast-squarespolicyiteration.

InSection2.1,wefirstexplaintheframeworkofpolicyiteration,whichitera-

tivelyexecutesthepolicyevaluationandpolicyimprovementstepsforfinding

betterpolicies.Then,inSection2.2,weshowhowvaluefunctionapproxima-

tioninthepolicyevaluationstepcanbeformulatedasaregressionproblem

andintroducealeast-squaresalgorithmcalledleast-squarespolicyiteration

(Lagoudakis&Parr,2003).Finally,thischapterisconcludedinSection2.3.

ValueFunctions

Atraditionalwaytolearntheoptimalpolicyisbasedonvaluefunction.

Inthissection,weintroducetwotypesofvaluefunctions,thestatevalue

functionandthestate-actionvaluefunction,andexplainhowtheycanbe

usedforfindingbetterpolicies.

StateValueFunctions

ThestatevaluefunctionVπ(s)∈Rforpolicyπmeasuresthe“value”ofstates,whichisdefinedastheexpectedreturntheagentwillreceivewhen

followingpolicyπfromstates:

s)=Epπ(h)R(h)s1=s,

where“|s1=s”meansthattheinitialstates1isfixedats1=s.Thatis,the

right-handsideoftheaboveequationdenotestheconditionalexpectationof

returnR(h)givens1=s.

Byrecursion,Vπ(s)canbeexpressedas

Vπ(s)=Ep(s′|s,a)π(a|s)r(s,a,s′)+γVπ(s′),

whereEp(s′|s,a)π(a|s)denotestheconditionalexpectationoveraands′drawn

fromp(s′|s,a)π(a|s)givens.ThisrecursiveexpressioniscalledtheBellman

equationforstatevalues.Vπ(s)maybeobtainedbyrepeatingthefollowing

updatefromsomeinitialestimate:

Vπ(s)←−Ep(s′|s,a)π(a|s)r(s,a,s′)+γVπ(s′).

Theoptimalstatevalueatstates,V∗(s),isdefinedasthemaximizerofstatevalueVπ(s)withrespecttopolicyπ:

V∗(s)=maxVπ(s).π

BasedontheoptimalstatevalueV∗(s),theoptimalpolicyπ∗,whichisde-terministic,canbeobtainedas

π∗(a|s)=δ(a−a∗(s)),whereδ(·)denotesDirac’sdeltafunctionand

a∗(s)=argmaxEp(s′|s,a)r(s,a,s′)+γV∗(s′).

Ep(s′|s,a)denotestheconditionalexpectationovers′drawnfromp(s′|s,a)

givensanda.Thisalgorithm,firstcomputingtheoptimalvaluefunction

andthenobtainingtheoptimalpolicybasedontheoptimalvaluefunction,is

calledvalueiteration.

Apossiblevariationistoiterativelyperformpolicyevaluationandim-

provementas

Policyevaluation:Vπ(s)←−Ep(s′|s,a)π(a|s)r(s,a,s′)+γVπ(s′).

Policyimprovement:π∗(a|s)←−δ(a−aπ(s)),

aπ(s)=argmaxEp(s′|s,a)r(s,a,s′)+γVπ(s′)

a∈AThesetwostepsmaybeiteratedeitherforallstatesatonceorinastate-by-

statemanner.Thisiterativealgorithmiscalledthepolicyiteration(basedon

statevaluefunctions).

State-ActionValueFunctions

Intheabovepolicyimprovementstep,theactiontotakeisoptimizedbased

onthestatevaluefunctionVπ(s).Amoredirectwaytohandlethisaction

optimizationistoconsiderthestate-actionvaluefunctionQπ(s,a)forpolicy

s,a)=Epπ(h)R(h)s1=s,a1=a,

PolicyIterationwithValueFunctionApproximation

where“|s1=s,a1=a”meansthattheinitialstates1andthefirstactiona1

arefixedats1=sanda1=a,respectively.Thatis,theright-handsideof

theaboveequationdenotestheconditionalexpectationofreturnR(h)given

s1=sanda1=a.

Letr(s,a)betheexpectedimmediaterewardwhenactionaistakenat

states:

r(s,a)=Ep(s′|s,a)[r(s,a,s′)].

Then,inthesamewayasVπ(s),Qπ(s,a)canbeexpressedbyrecursionas

Qπ(s,a)=r(s,a)+γEπ(a′|s′)p(s′|s,a)Qπ(s′,a′),

whereEπ(a′|s′)p(s′|s,a)denotestheconditionalexpectationovers′anda′drawn

fromπ(a′|s′)p(s′|s,a)givensanda.Thisrecursiveexpressioniscalledthe

Bellmanequationforstate-actionvalues.

BasedontheBellmanequation,theoptimalpolicymaybeobtainedby

iteratingthefollowingtwosteps:

Policyevaluation:Qπ(s,a)←−r(s,a)+γEπ(a′|s′)p(s′|s,a)Qπ(s′,a′).

Policyimprovement:π(a|s)←−δa−argmaxQπ(s,a′).

a′∈AInpractice,itissometimespreferabletouseanexplorativepolicy.For

example,Gibbspolicyimprovementisgivenby

exp(Qπ(s,a)/τ)

π(a|s)←−R

exp(Qπ(s,a′)/τ)da′

whereτ>0determinesthedegreeofexploration.WhentheactionspaceA

isdiscrete,ǫ-greedypolicyimprovementisalsoused:

(1−ǫ+ǫ/|A|ifa=argmaxQπ(s,a′),

π(a|s)←−

a′∈Aǫ/|A|otherwise,

whereǫ∈(0,1]determinestherandomnessofthenewpolicy.TheabovepolicyimprovementstepbasedonQπ(s,a)isessentiallythe

sameastheonebasedonVπ(s)explainedinSection2.1.1.However,the

policyimprovementstepbasedonQπ(s,a)doesnotcontaintheexpectation

operatorandthuspolicyimprovementcanbemoredirectlycarriedout.For

thisreason,wefocusontheaboveformulation,calledpolicyiterationbased

onstate-actionvaluefunctions.

Least-SquaresPolicyIteration

Asexplainedintheprevioussection,theoptimalpolicyfunctionmaybe

learnedviastate-actionvaluefunctionQπ(s,a).However,learningthestate-

actionvaluefunctionfromdataisachallengingtaskforcontinuousstates

andactiona.

Learningthestate-actionvaluefunctionfromdatacanactuallybere-

gardedasaregressionprobleminstatisticsandmachinelearning.Inthissec-

tion,weexplainhowtheleast-squaresregressiontechniquecanbeemployed

invaluefunctionapproximation,whichiscalledleast-squarespolicyiteration

(Lagoudakis&Parr,2003).

Immediate-RewardRegression

Letusapproximatethestate-actionvaluefunctionQπ(s,a)bythefollow-

inglinear-in-parametermodel:

Xθbφb(s,a),

whereφb(s,a)Barebasisfunctions,Bdenotesthenumberofbasisfunc-

tions,andθbB

areparameters.Specificdesignsofbasisfunctionswillbe

discussedinChapter3.Below,weusethefollowingvectorrepresentationfor

compactlyexpressingtheparametersandbasisfunctions:

θ⊤φ(s,a),

where⊤denotesthetransposeand

θ=(θ1,…,θB)⊤∈RB,⊤φ(s,a)=φ1(s,a),…,φB(s,a)

∈RB.FromtheBellmanequationforstate-actionvalues(2.1),wecanexpress

theexpectedimmediaterewardr(s,a)as

r(s,a)=Qπ(s,a)−γEπ(a′|s′)p(s′|s,a)Qπ(s′,a′).

Bysubstitutingthevaluefunctionmodelθ⊤φ(s,a)intheaboveequation,

theexpectedimmediaterewardr(s,a)maybeapproximatedas

r(s,a)≈θ⊤φ(s,a)−γEπ(a′|s′)p(s′|s,a)θ⊤φ(s′,a′).

Nowletusdefineanewbasisfunctionvectorψ(s,a):

ψ(s,a)=φ(s,a)−γEπ(a′|s′)p(s′|s,a)φ(s′,a′).

r(s1,a1)

r(s,a)

r(sT,aT)

r(s1,a1,s2)

r(s2,a2)

θψ(s,a)

r(sT,aT,sT+1)

r(s2,a2,s3)

(s1,a1)

(s2,a2)

(sT,aT)

FIGURE2.1:Linearapproximationofstate-actionvaluefunctionQπ(s,a)

aslinearregressionofexpectedimmediaterewardr(s,a).

Thentheexpectedimmediaterewardr(s,a)maybeapproximatedas

r(s,a)≈θ⊤ψ(s,a).

Asexplainedabove,thelinearapproximationproblemofthestate-action

valuefunctionQπ(s,a)canbereformulatedasthelinearregressionproblem

oftheexpectedimmediaterewardr(s,a)(seeFigure2.1).Thekeytrickwas

topushtherecursivenatureofthestate-actionvaluefunctionQπ(s,a)into

thecompositebasisfunctionψ(s,a).

Algorithm

Now,weexplainhowtheparametersθarelearnedintheleast-squares

framework.Thatis,themodelθ⊤ψ(s,a)isfittedtotheexpectedimmediate

rewardr(s,a)underthesquaredloss:

minEpπ(h)

θ⊤ψ(st,at)−r(st,at)

wherehdenotesthehistorysamplefollowingthecurrentpolicyπ:

h=[s1,a1,…,sT,aT,sT+1].

ForhistorysamplesH=h1,…,hN,where

hn=[s1,n,a1,n,…,sT,n,aT,n,sT+1,n],

anempiricalversionoftheaboveleast-squaresproblemisgivenas

θ⊤b

ψ(st,n,at,n;H)−r(st,n,at,n,st+1,n)

θ−r

FIGURE2.2:Gradientdescent.

Here,b

ψ(s,a;H)isanempiricalestimatorofψ(s,a)givenby

ψ(s,a;H)=φ(s,a)−

γφ(s′,a′),

π(a′|s′)

(s,a)|s′∈H(s,a)whereH(s,a)denotesasubsetofHthatconsistsofalltransitionsamplesfrom

statesbyactiona,|H(s,a)|denotesthenumberofelementsinthesetH(s,a),

denotesthesummationoveralldestinationstatess′intheset

s′∈Hs,a)H(s,a).

ΨbetheNT×BmatrixandrbetheNT-dimensionalvectordefined

ΨN(t−1)+n,b=b

ψb(st,n,at,n),

rN(t−1)+n=r(st,n,at,n,st+1,n).

Ψissometimescalledthedesignmatrix.Thentheaboveleast-squaresprob-

lemcanbecompactlyexpressedas

Ψθ−rk2,

wherek·kdenotestheℓ2-norm.Becausethisisaquadraticfunctionwith

respecttoθ,itsglobalminimizerb

θcanbeanalyticallyobtainedbysettingits

derivativetozeroas

⊤⊤

Ψ)−1b

⊤IfBistoolargeandcomputingtheinverseofb

Ψisintractable,wemay

useagradientdescentmethod.Thatis,startingfromsomeinitialestimateθ,

thesolutionisupdateduntilconvergence,asfollows(seeFigure2.2):

⊤⊤θ←−θ−ε(b

Ψθ−b

⊤⊤whereb

Ψθ−b

Ψrcorrespondstothegradientoftheobjectivefunction

Ψθ−rk2andεisasmallpositiveconstantrepresentingthestepsizeof

gradientdescent.

Anotablevariationoftheaboveleast-squaresmethodistocomputethe

solutionby

eθ=(Φ⊤b

Ψ)−1Φ⊤r,

whereΦistheNT×Bmatrixdefinedas

ΦN(t−1)+n,b=φ(st,n,at,n).

variation

called

least-squaresfixed-pointapproximation

(Lagoudakis&Parr,2003)andisshowntohandletheestimationerrorin-

cludedinthebasisfunctionb

ψinasoundway(Bradtke&Barto,1996).

However,forsimplicity,wefocusonEq.(2.2)below.

Regularization

Regressiontechniquesinmachinelearningaregenerallyformulatedasmin-

imizationofagoodness-of-fittermandaregularizationterm.Intheabove

least-squaresframework,thegoodness-of-fitofourmodelismeasuredbythe

squaredloss.Inthefollowingchapters,wediscusshowotherlossfunctionscan

beutilizedinthepolicyiterationframework,e.g.,samplereuseinChapter4

andoutlier-robustlearninginChapter6.Herewefocusontheregularization

termandintroducepracticallyusefulregularizationtechniques.

Theℓ2-regularizeristhemoststandardregularizerinstatisticsandma-

chinelearning;itisalsocalledtheridgeregression(Hoerl&Kennard,1970):

Ψθ−rk2+λkθk2,

whereλ≥0istheregularizationparameter.Theroleoftheℓ2-regularizer

kθk2istopenalizethegrowthoftheparametervectorθtoavoidoverfitting

tonoisysamples.Apracticaladvantageoftheuseoftheℓ2-regularizeristhat

theminimizerb

θcanstillbeobtainedanalytically:

⊤⊤θ=(b

Ψ+λIB)−1b

whereIBdenotestheB×Bidentitymatrix.BecauseoftheadditionofλIB,

thematrixtobeinvertedabovehasabetternumericalconditionandthus

thesolutiontendstobemorestablethanthesolutionobtainedbyplainleast

squareswithoutregularization.

Notethatthesamesolutionastheaboveℓ2-penalizedleast-squaresprob-

lemcanbeobtainedbysolvingthefollowingℓ2-constrainedleast-squaresprob-

Ψθ−rk2

ℓ2−CLS

ℓ1−CLS

(a)ℓ2-constraint

(b)ℓ1-constraint

FIGURE2.3:Feasibleregions(i.e.,regionswheretheconstraintissatisfied).

Theleast-squares(LS)solutionisthebottomoftheellipticalhyperboloid,

whereasthesolutionofconstrainedleast-squares(CLS)islocatedatthepoint

wherethehyperboloidtouchesthefeasibleregion.

subjecttokθk2≤C,

whereCisdeterminedfromλ.Notethatthelargerthevalueofλis(i.e.,the

strongertheeffectofregularizationis),thesmallerthevalueofCis(i.e.,the

smallerthefeasibleregionis).Thefeasibleregion(i.e.,theregionwherethe

constraintkθk2≤Cissatisfied)isillustratedinFigure2.3(a).

Anotherpopularchoiceofregularizationinstatisticsandmachinelearn-

ingistheℓ1-regularizer,whichisalsocalledtheleastabsoluteshrinkageand

selectionoperator(LASSO)(Tibshirani,1996):

Ψθ−rk2+λkθk1,

wherek·k1denotestheℓ1-normdefinedastheabsolutesumofelements:

kθk1=

|θb|.

Inthesamewayastheℓ2-regularizationcase,thesamesolutionastheabove

ℓ1-penalizedleast-squaresproblemcanbeobtainedbysolvingthefollowing

constrainedleast-squaresproblem:

Ψθ−rk2

subjecttokθk1≤C,

1stSubset

(K–1)thsubset

Kthsubset

···

Estimation

Validation

FIGURE2.4:Crossvalidation.

whereCisdeterminedfromλ.ThefeasibleregionisillustratedinFig-

ure2.3(b).

Anotablepropertyofℓ1-regularizationisthatthesolutiontendstobe

sparse,i.e.,manyoftheelementsθbBbecomeexactlyzero.Thereasonwhy

thesolutionbecomessparsecanbeintuitivelyunderstoodfromFigure2.3(b):

thesolutiontendstobeononeofthecornersofthefeasibleregion,where

thesolutionissparse.Ontheotherhand,intheℓ2-constraintcase(seeFig-

ure2.3(a)again),thesolutionissimilartotheℓ1-constraintcase,butitis

notgenerallyonanaxisandthusthesolutionisnotsparse.Suchasparse

solutionhasvariouscomputationaladvantages.Forexample,thesolutionfor

large-scaleproblemscanbecomputedefficiently,becauseallparametersdo

nothavetobeexplicitlyhandled;see,e.g.,Tomiokaetal.,2011.Furthermore,

thesolutionsforalldifferentregularizationparameterscanbecomputedef-

ficiently(Efronetal.,2004),andtheoutputofthelearnedmodelcanbe

computedefficiently.

ModelSelection

Inregression,tuningparametersareoftenincludedinthealgorithm,such

asbasisparametersandtheregularizationparameter.Suchtuningparameters

canbeobjectivelyandsystematicallyoptimizedbasedoncross-validation

(Wahba,1990)asfollows(seeFigure2.4).

First,thetrainingdatasetHisdividedintoKdisjointsubsetsofapprox-

imatelythesamesize,HkK.Thentheregressionsolutionbθ

kisobtained

usingH\Hk(i.e.,allsampleswithoutHk),anditssquarederrorforthehold-

outsamplesHkiscomputed.Thisprocedureisrepeatedfork=1,…,K,and

themodel(suchasthebasisparameterandtheregularizationparameter)that

minimizestheaverageerrorischosenasthemostsuitableone.

Onemaythinkthattheordinarysquarederrorisdirectlyusedformodel

selection,insteadofitscross-validationestimator.However,theordinary

squarederrorisheavilybiased(orinotherwords,over-fitted)sincethesame

trainingsamplesareusedtwiceforlearningparametersandestimatingthe

generalizationerror(i.e.,theout-of-samplepredictionerror).Ontheother

hand,thecross-validationestimatorofsquarederrorisalmostunbiased,where

“almost”comesfromthefactthatthenumberoftrainingsamplesisreduced

duetodatasplittinginthecross-validationprocedure.

Ingeneral,cross-validationiscomputationallyexpensivebecausethe

squarederrorneedstobeestimatedmanytimes.Forexample,whenperform-

ing5-foldcross-validationfor10modelcandidates,thelearningprocedurehas

toberepeated5×10=50times.However,thisisoftenacceptableinpractice

becausesensiblemodelselectiongivesanaccuratesolutionevenwithasmall

numberofsamples.Thus,intotal,thecomputationtimemaynotgrowthat

much.Furthermore,cross-validationissuitableforparallelcomputingsinceer-

rorestimationfordifferentmodelsanddifferentfoldsareindependentofeach

other.Forinstance,whenperforming5-foldcross-validationfor10modelcan-

didates,theuseof50computingunitsallowsustocomputeeverythingat

Remarks

Reinforcementlearningviaregressionofstate-actionvaluefunctionsisa

highlypowerfulandflexibleapproach,becausewecanutilizevariousregression

techniquesdevelopedinstatisticsandmachinelearningsuchasleast-squares,

regularization,andcross-validation.

Inthefollowingchapters,weintroducemoresophisticatedregressiontech-

niquessuchasmanifold-basedsmoothing(Chapelleetal.,2006)inChapter3,

covariateshiftadaptation(Sugiyama&Kawanabe,2012)inChapter4,active

learning(Sugiyama&Kawanabe,2012)inChapter5,androbustregression

(Huber,1981)inChapter6.

Chapter3

BasisDesignforValueFunction

Approximation

Least-squarespolicyiterationexplainedinChapter2workswell,givenappro-

priatebasisfunctionsforvaluefunctionapproximation.Becauseofitssmooth-

ness,theGaussiankernelisapopularandusefulchoiceasabasisfunction.

However,itdoesnotallowfordiscontinuity,whichisconceivableinmanyre-

inforcementlearningtasks.Inthischapter,weintroduceanalternativebasis

functionbasedongeodesicGaussiankernels(GGKs),whichexploitthenon-

linearmanifoldstructureinducedbytheMarkovdecisionprocesses(MDPs).

ThedetailsofGGKareexplainedinSection3.1,anditsrelationtoother

basisfunctiondesignsisdiscussedinSection3.2.Then,experimentalperfor-

manceisnumericallyevaluatedinSection3.3,andthischapterisconcluded

inSection3.4.

GaussianKernelsonGraphs

Inleast-squarespolicyiteration,thechoiceofbasisfunctionsφb(s,a)B

isanopendesignissue(seeChapter2).Traditionally,Gaussiankernelshave

beenapopularchoice(Lagoudakis&Parr,2003;Engeletal.,2005),butthey

cannotapproximatediscontinuousfunctionswell.Tocopewiththisproblem,

moresophisticatedmethodsofconstructingsuitablebasisfunctionshavebeen

proposedwhicheffectivelymakeuseofthegraphstructureinducedbyMDPs

(Mahadevan,2005).Inthissection,weintroduceanalternativewayofcon-

structingbasisfunctionsbyincorporatingthegraphstructureofthestate

space.

MDP-InducedGraph

LetGbeagraphinducedbyanMDP,wherestatesSarenodesofthe

graphandthetransitionswithnon-zerotransitionprobabilitiesfromonenode

toanotherareedges.Theedgesmayhaveweightsdetermined,e.g.,basedon

thetransitionprobabilitiesorthedistancebetweennodes.Thegraphstructure

correspondingtoanexamplegridworldshowninFigure3.1(a)isillustrated

123456789101112131415161718192021

→→→→→→→↓↓

→→→→→→→→

→→→→→→→↓↓

→→→→→→↑↑↑

→→↓→↓→→→↓

→↑↑→→↑↑↑↑

↓↓↓↓↓↓↓↓↓

→→→→↑↑↑↑↑

→→→→→→↓↓↓

→→↑→↑↑↑↑↑

→↓↓→↓→↓↓↓

↑→↑↑↑↑↑↑↑

→→↓→→→↓↓↓

↑→↑→↑↑↑↑↑

→→→→→→↓↓↓

→→↑↑↑↑↑↑↑

→→→→→→→→→→→↑↑→↑↑↑↑↑

→→→→→→→→→↑→↑→↑↑↑↑↑↑

→→→→→→→→↑

→↑→↑→↑↑↑↑

→→→↑→→↑↑↑

↑→→↑↑↑↑↑↑

→→↑↑→↑↑→↑

↑→↑↑↑→↑↑↑

→→→→→→→→↑

↑→↑↑↑↑↑↑↑

↑→↑↑↑→→↑↑

↑→→↑↑↑↑↑↑

→→→→→→→→↑

↑→↑↑↑↑↑↑↑

↑↑↑→→→↑↑↑

→→↑↑↑→↑↑↑

→→→→→→→→↑

↑↑↑↑↑↑↑↑↑

(a)Blackareasarewallsoverwhich

(b)Optimalstatevaluefunction(in

theagentcannotmove,whilethegoal

log-scale).

isrepresentedingray.Arrowsonthe

gridsrepresentoneoftheoptimalpoli-

(c)GraphinducedbytheMDPanda

randompolicy.

FIGURE3.1:Anillustrativeexampleofareinforcementlearningtaskof

guidinganagenttoagoalinthegridworld.

inFigure3.1(c).Inpractice,suchgraphstructure(includingtheconnection

weights)isestimatedfromsamplesofafinitelength.Weassumethatthe

graphGisconnected.Typically,thegraphissparseinreinforcementlearning

tasks,i.e.,

ℓ≪n(n−1)/2,

whereℓisthenumberofedgesandnisthenumberofnodes.

BasisDesignforValueFunctionApproximation

OrdinaryGaussianKernels

OrdinaryGaussiankernels(OGKs)ontheEuclideanspacearedefinedas

ED(s,s′)2

K(s,s′)=exp−

whereED(s,s′)aretheEuclideandistancebetweenstatessands′;forex-

ample,

ED(s,s′)=kx−x′k,

whentheCartesianpositionsofsands′inthestatespacearegivenbyxand

x′,respectively.σ2isthevarianceparameteroftheGaussiankernel.

TheaboveGaussianfunctionisdefinedonthestatespaceS,wheres′is

treatedasacenterofthekernel.InordertoemploytheGaussiankernelin

least-squarespolicyiteration,itneedstobeextendedoverthestate-action

spaceS×A.Thisisusuallycarriedoutbysimply“copying”theGaussian

functionovertheactionspace(Lagoudakis&Parr,2003;Mahadevan,2005).

Moreprecisely,letthetotalnumberkofbasisfunctionsbemp,wheremis

thenumberofpossibleactionsandpisthenumberofGaussiancenters.For

thei-thactiona(i)(∈A)(i=1,2,…,m)andforthej-thGaussiancenter

c(j)(∈S)(j=1,2,…,p),the(i+(j−1)m)-thbasisfunctionisdefinedasφi+(j−1)m(s,a)=I(a=a(i))K(s,c(j)),

whereI(·)istheindicatorfunction:

(1ifa=a(i),

I(a=a(i))=

0otherwise.

GeodesicGaussianKernels

Ongraphs,anaturaldefinitionofthedistancewouldbetheshortestpath.

TheGaussiankernelbasedontheshortestpathisgivenby

SP(s,s′)2

K(s,s′)=exp−

whereSP(s,s′)denotestheshortestpathfromstatestostates′.Theshortest

pathonagraphcanbeinterpretedasadiscreteapproximationtothegeodesic

distanceonanon-linearmanifold(Chung,1997).Forthisreason,wecallEq.

(3.2)ageodesicGaussiankernel(GGK)(Sugiyamaetal.,2008).

ShortestpathsongraphscanbeefficientlycomputedusingtheDijkstraal-

gorithm(Dijkstra,1959).Withitsnaiveimplementation,computationalcom-

plexityforcomputingtheshortestpathsfromasinglenodetoallothernodes

isO(n2),wherenisthenumberofnodes.IftheFibonacciheapisemployed,

computationalcomplexitycanbereducedtoO(nlogn+ℓ)(Fredman&Tar-

jan,1987),whereℓisthenumberofedges.Sincethegraphinvaluefunction

approximationproblemsistypicallysparse(i.e.,ℓ≪n2),usingtheFibonacci

heapprovidessignificantcomputationalgains.Furthermore,thereexistvar-

iousapproximationalgorithmswhicharecomputationallyveryefficient(see

Goldberg&Harrelson,2005andreferencestherein).

AnalogouslytoOGKs,weneedtoextendGGKstothestate-actionspace

tousetheminleast-squarespolicyiteration.Anaivewayistojustemploy

Eq.(3.1),butthiscancauseashiftintheGaussiancenterssincethestate

usuallychangeswhensomeactionistaken.Toincorporatethistransition,

thebasisfunctionsaredefinedastheexpectationofGaussianfunctionsafter

transition:

φi+(j−1)m(s,a)=I(a=a(i))

P(s′|s,a)K(s′,c(j)).

s′∈SThisshiftingschemeisshowntoworkverywellwhenthetransitionispre-

dominantlydeterministic(Sugiyamaetal.,2008).

ExtensiontoContinuousStateSpaces

Sofar,wefocusedondiscretestatespaces.However,theconceptofGGKs

canbenaturallyextendedtocontinuousstatespaces,whichisexplainedhere.

First,thecontinuousstatespaceisdiscretized,whichgivesagraphasadis-

creteapproximationtothenon-linearmanifoldstructureofthecontinuous

statespace.Basedonthegraph,GGKscanbeconstructedinthesameway

asthediscretecase.Finally,thediscreteGGKsareinterpolated,e.g.,usinga

linearmethodtogivecontinuousGGKs.

Althoughthisprocedurediscretizesthecontinuousstatespace,itmustbe

notedthatthediscretizationisonlyforthepurposeofobtainingthegraphas

adiscreteapproximationofthecontinuousnon-linearmanifold;theresulting

basisfunctionsthemselvesarecontinuouslyinterpolatedandhencethestate

spaceisstilltreatedascontinuous,asopposedtoconventionaldiscretization

procedures.

Illustration

Inthissection,thecharacteristicsofGGKsarediscussedincomparisonto

existingbasisfunctions.

Letusconsideratoyreinforcementlearningtaskofguidinganagentto

agoalinadeterministicgridworld(seeFigure3.1(a)).Theagentcantake

4actions:up,down,left,andright.Notethatactionswhichmaketheagent

collidewiththewallaredisallowed.Apositiveimmediatereward+1isgiven

iftheagentreachesagoalstate;otherwiseitreceivesnoimmediatereward.

Thediscountfactorissetatγ=0.9.

Inthistask,astatescorrespondstoatwo-dimensionalCartesiangrid

positionxoftheagent.Forillustrationpurposes,letusdisplaythestate

valuefunction,

Vπ(s):S→R,

whichistheexpectedlong-termdiscountedsumofrewardstheagentreceives

whentheagenttakesactionsfollowingpolicyπfromstates.Fromthedefi-

nition,itcanbeconfirmedthatVπ(s)isexpressedintermsofQπ(s,a)as

Vπ(s)=Qπ(s,π(s)).

TheoptimalstatevaluefunctionV∗(s)(inlog-scale)isillustratedinFig-ure3.1(b).AnMDP-inducedgraphstructureestimatedfrom20seriesofran-

domwalksamples1oflength500isillustratedinFigure3.1(c).Here,theedge

weightsinthegrapharesetat1(whichisequivalenttotheEuclideandistance

betweentwonodes).

GeodesicGaussianKernels

AnexampleofGGKsforthisgraphisdepictedinFigure3.2(a),wherethe

varianceofthekernelissetatalargevalue(σ2=30)forillustrationpurposes.

ThegraphshowsthatGGKshaveanicesmoothsurfacealongthemaze,but

notacrossthepartitionbetweentworooms.SinceGGKshave“centers,”they

areextremelyusefulforadaptivelychoosingasubsetofbases,e.g.,usinga

uniformallocationstrategy,sample-dependentallocationstrategy,ormaze-

dependentallocationstrategyofthecenters.Thisisapracticaladvantage

oversomenon-orderedbasisfunctions.Moreover,sinceGGKsarelocalby

nature,theilleffectsoflocalnoiseareconstrainedlocally,whichisanother

usefulpropertyinpractice.

Theapproximatedvaluefunctionsobtainedby40GGKs2aredepictedin

Figure3.3(a),whereoneGGKcenterisputatthegoalstateandtheremaining

9centersarechosenrandomly.ForGGKs,kernelfunctionsareextendedover

theactionspaceusingtheshiftingscheme(seeEq.(3.3))sincethetransitionis

1Moreprecisely,ineachrandomwalk,aninitialstateischosenrandomly.Then,anactionischosenrandomlyandtransitionismade;thisisrepeated500times.Thisentireprocedureisindependentlyrepeated20timestogeneratethetrainingset.

2Notethatthetotalnumberkofbasisfunctionsis160sinceeachGGKiscopiedovertheactionspaceasperEq.(3.3).

(a)GeodesicGaussiankernels

(b)OrdinaryGaussiankernels

−0.05

(c)Graph-Laplacianeigenbases

−0.2

−0.1

(d)Diffusionwavelets

FIGURE3.2:Examplesofbasisfunctions.

−1.5

−2.5

−3.5

(a)GeodesicGaussiankernels(MSE=

(b)OrdinaryGaussiankernels(MSE=

1.03×10−2)

1.19×10−2)

(c)Graph-Laplacianeigenbases(MSE=

(MSE=5.00×

4.73×10−4)

10−4)

FIGURE3.3:Approximatedvaluefunctionsinlog-scale.Theerrorsarecom-

putedwithrespecttotheoptimalvaluefunctionillustratedinFigure3.1(b).

deterministic(seeSection3.1.3).TheproposedGGK-basedmethodproduces

anicesmoothfunctionalongthemazewhilethediscontinuityaroundthepar-

titionbetweentworoomsissharplymaintained(cf.Figure3.1(b)).Asaresult,

forthisparticularcase,GGKsgivetheoptimalpolicy(seeFigure3.4(a)).

AsdiscussedinSection3.1.3,thesparsityofthestatetransitionmatrixal-

lowsefficientandfastcomputationsofshortestpathsonthegraph.Therefore,

least-squarespolicyiterationwithGGK-basedbasesisstillcomputationally

attractive.

OrdinaryGaussianKernels

OGKssharesomeofthepreferablepropertiesofGGKsdescribedabove.

However,asillustratedinFigure3.2(b),thetailofOGKsextendsbeyondthe

partitionbetweentworooms.Therefore,OGKstendtoundesirablysmooth

outthediscontinuityofthevaluefunctionaroundthebarrierwall(see

123456789101112131415161718192021

→→→→→→↓↓↓

→→→→→→→→

→→→→→→→→↓

→→→→→→→→

→→→→→↓↓↓↓

→→→→→→→→↑

→→→→→↓↓↓↓

→→→→→→→→↑

→→→→↓↓↓↓↓

→→→→→→→→↑

→→→↓↓↓↓↓↓

→→→→→→→→↑

→→→↓↓↓↓↓↓

→→→→→→→↑↑

→→→→→→→→↑

→→↓↓↓↓↓↓↓

→→→→→→→↑↑

→→→→→→→→↑

→→→→→↑↑↑↑

→↓↓↓↓↓↓↓↓

→→→→↑↑→↑↑

→→→→→→→→↑

→↑↑↑↑↑↑↑↑

→→→→→→→→→→→→→→↑↑↑↑↑

→→→→→→→→→→↑↑↑↑↑↑↑↑↑

→→→→→↑↑↑↑↑↑→↑↑↑↑↑↑↑

→→→→→→→↑↑↑↑↑↑↑↑↑↑↑↑

→→→↑↑↑↑↑↑

↑↑↑↑↑↑↑↑↑

→→→→→→↑↑↑

↑↑↑↑↑↑↑↑↑

→→→↑↑↑↑↑↑

↑↑↑↑↑↑↑↑↑

→→→→→↑↑↑↑

↑↑↑↑↑↑↑↑↑

→→→↑↑↑↑↑↑

↑↑↑↑↑↑↑↑↑

→→→→→↑↑↑↑

↑↑↑↑↑↑↑↑↑

→→→→↑↑↑↑↑

↑↑↑↑↑↑↑↑↑

→→→→↑↑↑↑↑

↑↑↑↑↑↑↑↑↑

→→→→↑↑↑↑↑

↑↑↑↑↑↑↑↑↑

→→→→↑↑↑↑↑

↑↑↑↑↑↑↑↑↑

→→→→↑↑↑↑↑

↑↑↑↑↑↑↑↑↑

→→→→↑↑↑↑↑

↑↑↑↑↑↑↑↑↑

→→→→↑↑↑↑↑

↑↑↑↑↑↑↑↑↑

→→→→↑↑↑↑↑

↑↑↑↑↑↑↑↑↑

→→→↑↑↑↑↑↑

↑↑↑↑↑↑↑↑↑

→→→→↑↑↑↑↑

↑↑↑↑↑↑↑↑↑

123456789101112131415161718192021

→←↓↓↓↓↓↓↓

↓←↓↓→→→→

↓↓↓↓↓↓↓→↓

→→→→→→→→

↑←↓↓↓↓↓↓↓

↑↑↓↓→→→→↑

↓↓↓↓↓↓→↓↑

→→→→→→→→↑

↓↓↓↓↓↑↑↓↓

↑↑↑↓↓↓→→↑

↓↓↓↓↓→↓→↓

→→→→→→→→↑

↓↓↓↓↓↑↑←↓

→↑↑↑↓↓↓↑↑

↓↓↓↓→↓→↓↑

→→→→→→→→↑

↓↓↓↓↓↓↓↓↓

↓→↑↑↓↓↓↑↑

↓→↓→↓→↓→↓

→→→→→→→→↑

↓↓↓↓↓↓↓↓↓

↓→→→→↓↓←↑

→↓→↓→↓→↓↓

→→→→→→→→↑

↓↓↑↑↓↓↓↓↓

↓↓↑→→→→←←

↓→↓→↓→↓→↑

→←→→→→↑→↑

↓↓↓↑↓↓↓↓↓

↓↓↑↑→→→→↑

→↓→↓→↓→→↑

↑→←→→↑→↑↑

↓↓↓←↓↓↓↓→→↓↓↑↑←→→→↑

↓→↓→↓→↑→→→→↓↑↑↑→↑→↑

↓↓↓←↓↓↓↓→→→↓↓←←←←↑↑

→↓→↓→↑→→↑→↓↑↑↑↑↑→↑↑

↓↓↓↑↓↓↓↓↓

→→↓↓↓←←↑↑

↓→↓→↓→↑→↑

↑↑↑↑↑↑↑→↑

↓↓↑↑↓↓↓↓↓

→→→→↓←←←↓

→↓→↑→↑→↑↓

→↑↑↑↑↑↑↑↑

↓↓↓↓↓↓↓↓↓

↓→→→→→→↓↓

↑→↑→↑→↑→↑

↑→↑↑↑↑↑↑↑

↓↓↓↓↓↓↓↓↓

↓↓↓→→→←↓↓

↓↑↓↑↓↑→↑↓

→↑→↑↑↑↑↑↑

↓↓↓↓↓↑↑←↓

↓↓↓←↑↑←↓↓

↑↓↑↓↑↓↑↓↑

↑→↑→↑↑↑←↑

↓↓↓↓↓↑↑↓↓

↑↓↓←←←←↓↓

↓↑↓↑↓↑↓↑↓

→↑→↑↑↑←↑←

↑↓↓↓↓←←↓↓

↑→→↓←←↑↓↓

↑↓↑↓↑↓↑↓↑

↑→↑→↑←↑←↑

→↑←←←←←←←→→→→←←↑↑↑

↑↑←↑←↑←↑←→↑→↑→↑←↑←

(c)Graph-Laplacianeigenbases

FIGURE3.4:Obtainedpolicies.

Figure3.3(b)).Thiscausesanerrorinthepolicyaroundthepartition(see

x=10,y=2,3,…,9ofFigure3.4(b)).

Graph-LaplacianEigenbases

Mahadevan(2005)proposedemployingthesmoothestvectorsongraphsas

basesinvaluefunctionapproximation.Accordingtothespectralgraphtheory

(Chung,1997),suchsmoothbasesaregivenbytheminoreigenvectorsofthe

graph-Laplacianmatrix,whicharecalledgraph-Laplacianeigenbases(GLEs).

GLEsmayberegardedasanaturalextensionofFourierbasestographs.

ExamplesofGLEsareillustratedinFigure3.2(c),showingthattheyhave

Fourier-likestructureonthegraph.ItshouldbenotedthatGLEsarerather

globalinnature,implyingthatnoiseinalocalregioncanpotentiallyde-

gradetheglobalqualityofapproximation.AnadvantageofGLEsisthatthey

haveanaturalorderingofthebasisfunctionsaccordingtothesmoothness.

Thisispracticallyveryhelpfulinchoosingasubsetofbasisfunctions.Fig-

ure3.3(c)depictstheapproximatedvaluefunctioninlog-scale,wherethetop

40smoothestGLEsoutof326GLEsareused(notethattheactualnumber

ofbasesis160becauseoftheduplicationovertheactionspace).Itshows

thatGLEsgloballygiveaverygoodapproximation,althoughthesmalllocal

fluctuationissignificantlyemphasizedsincethegraphisinlog-scale.Indeed,

themeansquarederror(MSE)betweentheapproximatedandoptimalvalue

functionsdescribedinthecaptionsofFigure3.3showsthatGLEsgivea

muchsmallerMSEthanGGKsandOGKs.However,theobtainedvaluefunc-

tioncontainssystematiclocalfluctuationandthisresultsinaninappropriate

policy(seeFigure3.4(c)).

MDP-inducedgraphsaretypicallysparse.Insuchcases,theresultant

graph-LaplacianmatrixisalsosparseandGLEscanbeobtainedjustbysolv-

ingasparseeigenvalueproblem,whichiscomputationallyefficient.However,

findingminoreigenvectorscouldbenumericallyunstable.

DiffusionWavelets

CoifmanandMaggioni(2006)proposeddiffusionwavelets(DWs),which

areanaturalextensionofwaveletstothegraph.Theconstructionisbased

onasymmetrizedrandomwalkonagraph.Itisdiffusedonthegraphupto

adesiredlevel,resultinginamulti-resolutionstructure.Adetailedalgorithm

forconstructingDWsandmathematicalpropertiesaredescribedinCoifman

andMaggioni(2006).

WhenconstructingDWs,themaximumnestlevelofwaveletsandtoler-

anceusedintheconstructionalgorithmneedstobespecifiedbyusers.The

maximumnestlevelissetat10andthetoleranceissetat10−10,whichare

suggestedbytheauthors.ExamplesofDWsareillustratedinFigure3.2(d),

showinganicemulti-resolutionstructureonthegraph.DWsareover-complete

bases,soonehastoappropriatelychooseasubsetofbasesforbetterapprox-

imation.Figure3.3(d)depictstheapproximatedvaluefunctionobtainedby

DWs,wherewechosethemostglobal40DWsfrom1626over-completeDWs

(notethattheactualnumberofbasesis160becauseoftheduplicationover

theactionspace).Thechoiceofthesubsetbasescouldpossiblybeenhanced

usingmultipleheuristics.However,thecurrentchoiceisreasonablesinceFig-

ure3.3(d)showsthatDWsgiveamuchsmallerMSEthanGaussiankernels.

Nevertheless,similartoGLEs,theobtainedvaluefunctioncontainsalotof

smallfluctuations(seeFigure3.3(d))andthisresultsinanerroneouspolicy

(seeFigure3.4(d)).

Thankstothemulti-resolutionstructure,computationofdiffusionwavelets

canbecarriedoutrecursively.However,duetotheover-completeness,itisstill

ratherdemandingincomputationtime.Furthermore,appropriatelydetermin-

ingthetuningparametersaswellaschoosinganappropriatebasissubsetis

notstraightforwardinpractice.

NumericalExamples

Asdiscussedintheprevioussection,GGKsbringanumberofpreferable

propertiesformakingvaluefunctionapproximationeffective.Inthissection,

thebehaviorofGGKsisillustratednumerically.

Robot-ArmControl

Here,asimulatorofatwo-jointrobotarm(movinginaplane),illustrated

inFigure3.5(a),isemployed.Thetaskistoleadtheend-effector(“hand”)

ofthearmtoanobjectwhileavoidingtheobstacles.Possibleactionsareto

increaseordecreasetheangleofeachjoint(“shoulder”and“elbow”)by5

degreesintheplane,simulatingcoarsestepper-motorjoints.Thus,thestate

spaceSisthe2-dimensionaldiscretespaceconsistingoftwojoint-angles,as

illustratedinFigure3.5(b).Theblackareainthemiddlecorrespondstothe

obstacleinthejoint-anglestatespace.TheactionspaceAinvolves4actions:

increaseordecreaseoneofthejointangles.Apositiveimmediatereward+1

isgivenwhentherobot’send-effectortouchestheobject;otherwisetherobot

receivesnoimmediatereward.Notethatactionswhichmakethearmcollide

withobstaclesaredisallowed.Thediscountfactorissetatγ=0.9.Inthis

environment,therobotcanchangethejointangleexactlyby5degrees,and

thereforetheenvironmentisdeterministic.However,becauseoftheobstacles,

itisdifficulttoexplicitlycomputeaninversekinematicmodel.Furthermore,

theobstaclesintroducediscontinuityinvaluefunctions.Therefore,thisrobot-

armcontroltaskisaninterestingtestbedforinvestigatingthebehaviorof

Trainingsamplesfrom50seriesof1000randomarmmovementsarecol-

lected,wherethestartstateischosenrandomlyineachtrial.Thegraph

inducedbytheaboveMDPconsistsof1605nodesanduniformweightsare

assignedtotheedges.Sincethereare16goalstatesinthisenvironment(see

Figure3.5(b)),thefirst16Gaussiancentersareputatthegoalsandthere-

mainingcentersarechosenrandomlyinthestatespace.ForGGKs,kernel

functionsareextendedovertheactionspaceusingtheshiftingscheme(see

Eq.(3.3))sincethetransitionisdeterministicinthisexperiment.

Figure3.6illustratesthevaluefunctionsapproximatedusingGGKsand

OGKs.ThegraphsshowthatGGKsgiveanicesmoothsurfacewithobstacle-

induceddiscontinuitysharplypreserved,whileOGKstendtosmoothout

thediscontinuity.Thismakesasignificantdifferenceinavoidingtheobsta-

cle.From“A”to“B”inFigure3.5(b),theGGK-basedvaluefunctionresults

inatrajectorythatavoidstheobstacle(seeFigure3.6(a)).Ontheotherhand,

theOGK-basedvaluefunctionyieldsatrajectorythattriestomovethearm

throughtheobstaclebyfollowingthegradientupward(seeFigure3.6(b)),

causingthearmtogetstuckbehindtheobstacle.

(a)Aschematic

(b)Statespace

FIGURE3.5:Atwo-jointrobotarm.Inthisexperiment,GGKsareputat

allthegoalstatesandtheremainingkernelsaredistributeduniformlyover

themaze;theshiftingschemeisusedinGGKs.

Figure3.7summarizestheperformanceofGGKsandOGKsmeasured

bythepercentageofsuccessfultrials(i.e.,theend-effectorreachestheobject)

over30independentruns.Moreprecisely,ineachrun,50,000trainingsamples

arecollectedusingadifferentrandomseed,apolicyisthencomputedbythe

GGK-orOGK-basedleast-squarespolicyiteration,andfinallytheobtained

policyistested.ThisgraphshowsthatGGKsremarkablyoutperformOGKs

sincethearmcansuccessfullyavoidtheobstacle.TheperformanceofOGKs

doesnotgobeyond0.6evenwhenthenumberofkernelsisincreased.Thisis

causedbythetaileffectofOGKs.Asaresult,theOGK-basedpolicycannot

leadtheend-effectortotheobjectifitstartsfromthebottomlefthalfofthe

statespace.

Whenthenumberofkernelsisincreased,theperformanceofbothGGKs

andOGKsgetsworseataroundk=20.Thisiscausedbythekernelalloca-

Joint2(degree)

−180

−100

Joint1(degree)

−180

−100

Joint1(degree)

FIGURE3.6:Approximatedvaluefunctionswith10kernels(theactual

numberofbasesis40becauseoftheduplicationovertheactionspace).

Fractionofsuccessfultrials0.3

GGK(5)

GGK(9)

OGK(5)

OGK(9)

Numberofkernels

FIGURE3.7:Fractionofsuccessfultrials.

tionstrategy:thefirst16kernelsareputatthegoalstatesandtheremaining

kernelcentersarechosenrandomly.Whenkislessthanorequalto16,the

approximatedvaluefunctiontendstohaveaunimodalprofilesinceallkernels

areputatthegoalstates.However,whenkislargerthan16,thisunimodality

isbrokenandthesurfaceoftheapproximatedvaluefunctionhasslightfluc-

tuations,causinganerrorinpoliciesanddegradingperformanceataround

k=20.Thisperformancedegradationtendstorecoverasthenumberof

kernelsisfurtherincreased.

MotionexamplesoftherobotarmtrainedwithGGKandOGKareillus-

tratedinFigure3.8andFigure3.9,respectively.

Overall,theaboveresultshowsthatwhenGGKsarecombinedwiththe

above-mentionedkernel-centerallocationstrategy,almostperfectpoliciescan

beobtainedwithasmallnumberofkernels.Therefore,theGGKmethodis

computationallyhighlyadvantageous.

Robot-AgentNavigation

Theabovesimplerobot-armcontrolsimulationshowsthatGGKsare

promising.Here,GGKsareappliedtoamorechallengingtaskofmobile-robot

navigation,whichinvolvesahigh-dimensionalandverylargestatespace.

AKheperarobot,illustratedinFigure3.10(a),isemployedforthenavi-

gationtask.TheKheperarobotisequippedwith8infraredsensors(“s1”to

“s8”inthefigure),eachofwhichgivesameasureofthedistancefromthesur-

roundingobstacles.Eachsensorproducesascalarvaluebetween0and1023:

thesensorobtainsthemaximumvalue1023ifanobstacleisjustinfrontofthe

sensorandthevaluedecreasesastheobstaclegetsfartheruntilitreachesthe

minimumvalue0.Therefore,thestatespaceSis8-dimensional.TheKhep-

erarobothastwowheelsandtakesthefollowingdefinedactions:forward,

leftrotation,rightrotation,andbackward(i.e.,theactionspaceAcontains

actions).Thespeedoftheleftandrightwheelsforeachactionisdescribed

inFigure3.10(a)inthebracket(theunitispulseper10milliseconds).Note

thatthesensorvaluesandthewheelspeedarehighlystochasticduetothe

crosstalk,sensornoise,slip,etc.Furthermore,perceptualaliasingoccursdue

tothelimitedrangeandresolutionofsensors.Therefore,thestatetransition

isalsohighlystochastic.Thediscountfactorissetatγ=0.9.

ThegoalofthenavigationtaskistomaketheKheperarobotexplore

theenvironmentasmuchaspossible.Tothisend,apositivereward+1is

givenwhentheKheperarobotmovesforwardandanegativereward−2is

givenwhentheKheperarobotcollideswithanobstacle.Norewardisgiven

totheleftrotation,rightrotation,andbackwardactions.Thisrewarddesign

encouragestheKheperarobottogoforwardwithouthittingobstacles,through

whichextensiveexplorationintheenvironmentcouldbeachieved.

Trainingsamplesarecollectedfrom200seriesof100randommovementsin

afixedenvironmentwithseveralobstacles(seeFigure3.11(a)).Then,agraph

isconstructedfromthegatheredsamplesbydiscretizingthecontinuousstate

spaceusingaself-organizingmap(SOM)(Kohonen,1995).ASOMconsists

ofneuronslocatedonaregulargrid.Eachneuroncorrespondstoacluster

andneuronsareconnectedtoadjacentonesbyneighborhoodrelation.The

SOMissimilartothek-meansclusteringalgorithm,butitisdifferentinthat

thetopologicalstructureoftheentiremapistakenintoaccount.Thanksto

this,theentirespacetendstobecoveredbytheSOM.Thenumberofnodes

FIGURE3.8:AmotionexampleoftherobotarmtrainedwithGGK(from

lefttorightandtoptobottom).

FIGURE3.9:AmotionexampleoftherobotarmtrainedwithOGK(from

lefttorightandtoptobottom).

(a)Aschematic

−200

−400

−1000

−800

−600

−400

−200

(b)Statespaceprojectedontoa2-dimensionalsubspaceforvisualization

FIGURE3.10:Kheperarobot.Inthisexperiment,GGKsaredistributed

uniformlyoverthemazewithouttheshiftingscheme.

(states)inthegraphissetat696(equivalenttotheSOMmapsizeof24×29).

Thisvalueiscomputedbythestandardrule-of-thumbformula5n(Vesanto

etal.,2000),wherenisthenumberofsamples.Theconnectivityofthegraph

isdeterminedbystatetransitionsoccurringinthesamples.Morespecifically,

ifthereisastatetransitionfromonenodetoanotherinthesamples,anedge

isestablishedbetweenthesetwonodesandtheedgeweightissetaccording

totheEuclideandistancebetweenthem.

Figure3.10(b)illustratesanexampleoftheobtainedgraphstructure.For

visualizationpurposes,the8-dimensionalstatespaceisprojectedontoa2-

dimensionalsubspacespannedby

(−1−10011

−1−1).

(a)Training

(b)Test

FIGURE3.11:Simulationenvironment.

Notethatthisprojectionisperformedonlyforthepurposeofvisualization.

Allthecomputationsarecarriedoutusingtheentire8-dimensionaldata.

Thei-thelementintheabovebasescorrespondstotheoutputofthei-th

sensor(seeFigure3.10(a)).Theprojectionontothissubspaceroughlymeans

thatthehorizontalaxiscorrespondstothedistancetotheleftandright

obstacles,whiletheverticalaxiscorrespondstothedistancetothefrontand

backobstacles.Forclearvisibility,theedgeswhoseweightislessthan250are

plotted.RepresentativelocalposesoftheKheperarobotwithrespecttothe

obstaclesareillustratedinFigure3.10(b).Thisgraphhasanotablefeature:

thenodesaroundtheregion“B”inthefigurearedirectlyconnectedtothe

nodesat“A,”butareverysparselyconnectedtothenodesat“C,”“D,”and

“E.”Thisimpliesthatthegeodesicdistancefrom“B”to“C,”“B”to“D,”

or“B”to“E”istypicallylargerthantheEuclideandistance.

Sincethetransitionfromonestatetoanotherishighlystochasticinthe

currentexperiment,theGGKfunctionissimplyduplicatedovertheaction

space(seeEq.(3.1)).ForobtainingcontinuousGGKs,GGKfunctionsneedto

beinterpolated(seeSection3.1.4).Asimplelinearinterpolationmethodmay

beemployedingeneral,butthecurrentexperimenthasuniquecharacteristics:

atleastoneofthesensorvaluesisalwayszerosincetheKheperarobotisnever

completelysurroundedbyobstacles.Therefore,samplesarealwaysonthe

surfaceofthe8-dimensionalhypercube-shapedstatespace.Ontheotherhand,

thenodecentersdeterminedbytheSOMarenotgenerallyonthesurface.This

meansthatanysampleisnotincludedintheconvexhullofitsnearestnodes

andthefunctionvalueneedstobeextrapolated.Here,theEuclideandistance

betweenthesampleanditsnearestnodeissimplyaddedwhencomputing

kernelvalues.Moreprecisely,forastatesthatisnotgenerallylocatedona

nodecenter,theGGK-basedbasisfunctionisdefinedas

(ED(s,˜

s)+SP(˜

s,c(j)))2

φi+(j−1)m(s,a)=I(a=a(i))exp−

where˜

sisthenodeclosesttosintheEuclideandistance.

Figure3.12illustratesanexampleofactionsselectedateachnodebythe

GGK-basedandOGK-basedpolicies.Onehundredkernelsareusedandthe

widthissetat1000.Thesymbols↑,↓,⊂,and⊃inthefigureindicateforward,backward,leftrotation,andrightrotationactions.Thisshowsthatthereisa

cleardifferenceintheobtainedpoliciesatthestate“C.”Thebackwardaction

ismostlikelytobetakenbytheOGK-basedpolicy,whiletheleftrotation

andrightrotationaremostlikelytobetakenbytheGGK-basedpolicy.This

causesasignificantdifferenceintheperformance.Toexplainthis,supposethat

theKheperarobotisatthestate“C,”i.e.,itfacesawall.TheGGK-based

policyguidestheKheperarobotfrom“C”to“A”via“D”or“E”bytaking

theleftandrightrotationactionsanditcanavoidtheobstaclesuccessfully.

Ontheotherhand,theOGK-basedpolicytriestoplanapathfrom“C”to

“A”via“B”byactivatingthebackwardaction.Asaresult,theforwardaction

istakenat“B.”Forthisreason,theKheperarobotreturnsto“C”againand

endsupmovingbackandforthbetween“C”and“B.”

Fortheperformanceevaluation,amorecomplicatedenvironmentthan

theoneusedforgatheringtrainingsamples(seeFigure3.11)isused.This

meansthathowwelltheobtainedpoliciescanbegeneralizedtoanunknown

environmentisevaluatedhere.Inthistestenvironment,theKheperarobot

runsfromafixedstartingposition(seeFigure3.11(b))andtakes150steps

followingtheobtainedpolicy,withthesumofrewards(+1fortheforward

action)computed.IftheKheperarobotcollideswithanobstaclebefore150

steps,theevaluationisstopped.Themeantestperformanceover30indepen-

dentrunsisdepictedinFigure3.13asafunctionofthenumberofkernels.

Moreprecisely,ineachrun,agraphisconstructedbasedonthetraining

samplestakenfromthetrainingenvironmentandthespecifiednumberofker-

nelsisputrandomlyonthegraph.Then,apolicyislearnedbytheGGK-

orOGK-basedleast-squarespolicyiterationusingthetrainingsamples.Note

thattheactualnumberofbasesisfourtimesmorebecauseoftheexten-

sionofbasisfunctionsovertheactionspace.Thetestperformanceismea-

sured5timesforeachpolicyandtheaverageisoutput.Figure3.13shows

thatGGKssignificantlyoutperformOGKs,demonstratingthatGGKsare

promisingeveninthechallengingsettingwithahigh-dimensionallargestate

space.

Figure3.14depictsthecomputationtimeofeachmethodasafunctionof

thenumberofkernels.Thisshowsthatthecomputationtimemonotonically

increasesasthenumberofkernelsincreasesandtheGGK-basedandOGK-

basedmethodshavecomparablecomputationtime.However,giventhatthe

GGK-basedmethodworksmuchbetterthantheOGK-basedmethodwitha

smallernumberofkernels(seeFigure3.13),theGGK-basedmethodcouldbe

regardedasacomputationallyefficientalternativetothestandardOGK-based

method.

Finally,thetrainedKheperarobotisappliedtomapbuilding.Starting

fromaninitialposition(indicatedbyasquareinFigure3.15),theKhepera

⊃⊃⊃⊃⊃⊃⊃⊃↑⊂⊂⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃↑↑⊂⊂⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃↑

⊃⊃⊂⊂⊃⊃⊃⊃⊃⊃⊂⊂⊂⊂⊂⊂⊂⊂⊃⊃⊂⊂⊂⊂

⊂⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊂⊂↓

⊃⊂⊂⊂⊂⊂↑⊂⊂⊂⊂⊂⊂⊂⊂⊂⊃↑

⊃⊃⊃

⊃⊃⊃⊃↓⊃⊃⊃↓

⊃⊂⊂⊂⊂⊂⊂⊂⊂⊂⊂⊂800

⊃⊃⊃⊃⊃⊃⊂⊂⊂⊃⊃⊃↑

⊂⊂⊂⊃⊃⊃⊃⊃↑↑↑⊂⊂⊂↑⊂⊂⊂⊂⊂⊂⊂

⊂⊂⊂⊂⊂⊃⊃⊃⊃⊃⊃⊃⊃⊃↑↑⊃⊃⊃↑

↑↑⊂⊂⊂⊂↑⊂⊂⊂⊂⊂⊂⊂⊂⊂⊂⊃⊃↑⊃⊃⊃⊃⊃↑⊂↑⊂↑⊂⊂⊂⊂⊂⊂⊂⊂⊂⊂⊃⊃⊃⊃⊃⊃⊃↑

⊂↑⊂⊂⊂⊂↑↑

⊂↑↑⊂⊂⊂⊂600

⊂⊃⊃⊃⊃⊃↑⊃↑

⊂⊂⊂↑⊃⊃⊂⊃↑

⊂↑

⊂⊂⊂⊂⊂⊂⊃⊃⊃⊃↑

⊃↑↑

⊃↑

⊂↑⊂↑↑⊂⊂⊂⊂⊃⊃⊃⊃↑

⊂⊂↑↑↑

⊂⊂⊂⊂↑↑↑⊂⊂⊂400

⊃↑

⊃⊃⊃↑

↑⊂↑⊂⊂⊂⊃⊃⊃↑

↑↑↑↑

↑↑↑

⊂⊂⊃⊃↑

⊃⊃⊃↑

↑↑

⊂↑

↑↑↑

⊂⊂⊃⊃⊃⊃↑↑↑↑↑

⊂⊂⊂↑↑

⊂↑↑↑

⊂200

⊃⊃⊃↑

⊂⊂⊃⊃↑↑↑↑↑

⊂⊃↑↑↑↑↑

↑↑

⊂⊂⊃⊃⊃⊃↑

⊃↑↑

↑↑↑

⊂↑

↑↑↑

⊂⊂⊂⊃⊃

↑↑

⊃⊃↑

↑↑↑↑↑

↑↑

⊂↑↑

↑↑↑↑

↑⊂⊂0

⊃⊃⊃⊃↑↑↑↑↑

↑↑

↑↑↑

↑↑↑↑↑↑↑↑↑

↑↑

↑↑↑

↑↑↑↑

↑↑

↑↑↑

↑↑

⊂↑↑

↑↑

⊂↑⊂⊂⊂⊂⊂⊂⊂⊂⊂⊂⊂⊂⊃↑↑↑↑↑↑↑↑

↑↑↑↑

⊃⊃

⊃⊃⊃⊃↑↑

↑↑↑

↑↑

↑↑↑↑

↑↑

↑↑↑

−200

↑↑

−400

−1000

−800

−600

−400

−200

⊃⊃⊃↓↓↓↓↓↓↓

↓↓↓

⊃⊃⊃⊃⊃⊃↓↓

↓↓↓↓↓↓

↓↓

⊃⊃⊃⊃⊃↓

⊃⊃↓↓⊃↓↓↓↓

↓↓

↓↓↓↓↓

↓↓↓

↓↓↓↓↓⊂

↓⊂⊂⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃↓

⊃⊃↓

↓↓↓

↓↓

↓↓↓↓⊂⊂⊂⊂⊂⊂⊂⊂⊂⊂⊃⊃⊃⊃

⊃↓↓⊃⊃↓↓↓

↓↓↓↓↓⊂⊂⊂⊂⊂⊂800

⊃⊃⊃⊃⊃↓

⊂⊂⊂⊃⊃⊃↓

⊂⊂⊂⊃⊃⊃↓↓↑↓

↓↓

⊂⊂⊂⊂⊂⊂⊂⊂

⊂⊂⊂⊂⊂⊂⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃↓↑

↓↓↑

↑↓↓↓↓↓

⊂⊂⊂⊂⊂⊂⊂⊂⊃⊃⊃⊃⊃⊃⊃↓↓↑↑↑

↓↓

⊂⊂⊂⊂⊂⊂⊂⊂⊂⊃⊃⊃⊃⊃⊃⊃↓

↑↓

⊂⊂⊂⊂⊃⊃

↑↑

↓↓

⊂⊂⊂600

⊂⊃⊃⊃⊃⊃↓↑

⊂⊂⊂⊃⊃↑

⊂↓

⊂↑

⊂⊂⊂⊂⊂⊂⊃⊃⊃⊃⊃⊃↑

⊂⊂⊃↑

⊃⊃

⊃↑

↑↓⊂⊂⊂⊂⊂⊂⊂⊃⊃⊃⊃⊃⊃↑

⊂⊂↑↑↑

↑↑

⊂↓⊂⊂⊂⊂⊂⊂400

⊃⊃⊃⊃⊃↑

↑↑

⊂⊂⊂⊂⊃⊃⊃⊃↑

↑↑↑↑

↑⊂⊂⊂⊂⊃⊃⊃↑

⊃⊃↑

↑⊂⊂↑

↑↑↑

↑↑⊂⊂⊂⊂⊃⊃⊃⊃⊃↑↑↑↑

⊂⊂⊂↑↑

⊂⊂↑↑⊂⊂

⊃⊃↑

⊂⊂⊃⊃⊃⊃↑↑↑↑

⊂↑

↑↑↑↑↑

↑⊂⊂⊂⊃⊃⊃↑

⊃⊃⊃↑↑↑

⊂↑

↑↑↑

⊂⊂⊂⊃

⊃⊃↑↑

↑↑

⊃⊃↑

⊂↑↑↑↑↑

↑↑

↑↑↑↑

⊂⊂⊂0

⊃⊃⊃⊃↑↑↑↑↑

↑↑

↑↑↑

↑↑↑↑↑↑↑↑↑

↑↑

↑↑↑

↑↑↑↑

↑↑

↑↑↑

↑↑

↑⊂⊂↑⊂⊂⊂⊂⊂⊂⊂⊂⊂⊂⊂⊂⊂⊂⊂⊃⊃↑↑↑↑↑↑↑

↑↑↑↑

⊃⊃

⊃⊃⊃⊃↑↑

↑↑↑

↑↑

↑↑↑↑

↑↑

↑↑↑

−200

↑↑

−400

−1000

−800

−600

−400

−200

FIGURE3.12:Examplesofobtainedpolicies.Thesymbols↑,↓,⊂,and⊃indicateforward,backward,leftrotation,andrightrotationactions.

robottakesanaction2000timesfollowingthelearnedpolicy.Eightykernels

withGaussianwidthσ=1000areusedforvaluefunctionapproximation.The

resultsofGGKsandOGKsaredepictedinFigure3.15.Thegraphsshowthat

theGGKresultgivesabroaderprofileoftheenvironment,whiletheOGK

resultonlyrevealsalocalareaaroundtheinitialposition.

MotionexamplesoftheKheperarobottrainedwithGGKandOGKare

illustratedinFigure3.16andFigure3.17,respectively.

GGK(200)

GGK(1000)

OGK(200)

GGK(1000)

OGK(1000)

Averagedtotalrewards30

Computationtime[sec]1000

200102030405060708090100

00102030405060708090100

Numberofkernels

FIGURE3.13:Averageamountof

FIGURE3.14:Computationtime.

exploration.

FIGURE3.15:Resultsofmapbuilding(cf.Figure3.11(b)).

Remarks

Theperformanceofleast-squarespolicyiterationdependsheavilyonthe

choiceofbasisfunctionsforvaluefunctionapproximation.Inthischapter,

thegeodesicGaussiankernel(GGK)wasintroducedandshowntopossess

severalpreferablepropertiessuchassmoothnessalongthegraphandeasy

computability.ItwasalsodemonstratedthatthepoliciesobtainedbyGGKs

arenotassensitivetothechoiceoftheGaussiankernelwidth,whichisa

usefulpropertyinpractice.Also,theheuristicsofputtingGaussiancenters

ongoalstateswasshowntoworkwell.

However,whenthetransitionishighlystochastic(i.e.,thetransitionprob-

abilityhasawidesupport),thegraphconstructedbasedonthetransition

samplescouldbenoisy.Whenanerroneoustransitionresultsinashort-cut

FIGURE3.16:AmotionexampleoftheKheperarobottrainedwithGGK

(fromlefttorightandtoptobottom).

FIGURE3.17:AmotionexampleoftheKheperarobottrainedwithOGK

(fromlefttorightandtoptobottom).

overobstacles,thegraph-basedapproachmaynotworkwellsincethetopology

ofthestatespacechangessignificantly.

Chapter4

SampleReuseinPolicyIteration

Off-policyreinforcementlearningisaimedatefficientlyusingdatasamples

gatheredfromapolicythatisdifferentfromthecurrentlyoptimizedpolicy.A

commonapproachistouseimportancesamplingtechniquesforcompensating

forthebiascausedbythedifferencebetweenthedata-samplingpolicyandthe

targetpolicy.Inthischapter,weexplainhowimportancesamplingcanbeuti-

lizedtoefficientlyreusepreviouslycollecteddatasamplesinpolicyiteration.

Afterformulatingtheproblemofoff-policyvaluefunctionapproximationin

Section4.1,representativeoff-policyvaluefunctionapproximationtechniques

includingadaptiveimportancesamplingarereviewedinSection4.2.Then,in

Section4.3,howtheadaptivityofimportancesamplingcanbeoptimallycon-

trolledisexplained.InSection4.4,off-policyvaluefunctionapproximation

techniquesareintegratedintheframeworkofleast-squarespolicyiteration

forefficientsamplereuse.ExperimentalresultsareshowninSection4.5,and

finallythischapterisconcludedinSection4.6.

Formulation

AsexplainedinSection2.2,least-squarespolicyiterationmodelsthestate-

actionvaluefunctionQπ(s,a)byalineararchitecture,

θ⊤φ(s,a),

andlearnstheparameterθsothatthegeneralizationerrorGisminimized:

G(θ)=Epπ(h)

θ⊤ψ(s

t,at)−r(st,at)

Here,Epπ(h)denotestheexpectationoverhistory

h=[s1,a1,…,sT,aT,sT+1]

followingthetargetpolicyπand

ψ(s,a)=φ(s,a)−γEπ(a′|s′)p(s′|s,a)φ(s′,a′).

Whenhistorysamplesfollowingthetargetpolicyπareavailable,thesitu-

ationiscalledon-policyreinforcementlearning.Inthiscase,justreplacingthe

expectationcontainedinthegeneralizationerrorGbysampleaveragesgives

astatisticallyconsistentestimator(i.e.,theestimatedparameterconvergesto

theoptimalvalueasthenumberofsamplesgoestoinfinity).

Here,weconsiderthesituationcalledoff-policyreinforcementlearning,

wherethesamplingpolicye

πforcollectingdatasamplesisgenerallydifferent

fromthetargetpolicyπ.Letusdenotethehistorysamplesfollowinge

Heπ=heπ1,…,heπN,

whereeachepisodicsampleheπnisgivenas

heπn=[seπ1,n,aeπ1,n,…,seπT,n,aeπT,n,seπT+1,n].

Undertheoff-policysetup,naivelearningbyminimizingthesample-

approximatedgeneralizationerrorb

GNIWleadstoaninconsistentestimator:

GNIW(θ)=

θ⊤b

ψ(seπ

t,n,ae

t,n;He

π)−r(seπt,n,aeπt,n,seπt+1,n)

n=1t=1

ψ(s,a;H)=φ(s,a)−

γφ(s′,a′).

π(a′|s′)

(s,a)|s′∈H(s,a)H(s,a)denotesasubsetofHthatconsistsofalltransitionsamplesfromstate

sbyactiona,|H(s,a)|denotesthenumberofelementsinthesetH(s,a),and

denotesthesummationoveralldestinationstatess′intheset

s′∈Hs,a)H(s,a).NIWstandsfor“NoImportanceWeight,”whichwillbeexplained

later.

Thisinconsistencyproblemcanbeavoidedbygatheringnewsamplesfol-

lowingthetargetpolicyπ,i.e.,whenthecurrentpolicyisupdated,newsam-

plesaregatheredfollowingtheupdatedpolicyandthenewsamplesareused

forpolicyevaluation.However,whenthedatasamplingcostishigh,thisis

tooexpensive.Itwouldbemorecostefficientifpreviouslygatheredsamples

couldbereusedeffectively.

Off-PolicyValueFunctionApproximation

Importancesamplingisageneraltechniquefordealingwiththeoff-policy

situation.Supposewehavei.i.d.(independentandidenticallydistributed)sam-

plesxnN

n=1fromastrictlypositiveprobabilitydensityfunctione

p(x).Using

thesesamples,wewouldliketocomputetheexpectationofafunctiong(x)

overanotherprobabilitydensityfunctionp(x).Aconsistentapproximationof

theexpectationisgivenbytheimportance-weightedaverageas

n)N→∞

−→E

n)ep(x

p(x)dx=

g(x)p(x)dx=E

p(x)[g(x)].

However,applyingtheimportancesamplingtechniqueinoff-policyrein-

forcementlearningisnotstraightforwardsinceourtrainingsamplesofstate

sandactionaarenoti.i.d.duetothesequentialnatureofMarkovdeci-

sionprocesses(MDPs).Inthissection,representativeimportance-weighting

techniquesforMDPsarereviewed.

EpisodicImportanceWeighting

Basedontheindependencebetweenepisodes,

p(h,h′)=p(h)p(h′)=p(s1,a1,…,sT,aT,sT+1)p(s′1,a′1,…,s′T,a′T,s′T+1),thegeneralizationerrorGcanberewrittenas

G(θ)=Epeπ(h)

θ⊤ψ(s

t,at)−r(st,at)

wherewTistheepisodicimportanceweight(EIW):

pπ(h)

peπ(h)

pπ(h)andpeπ(h)aretheprobabilitydensitiesofobservingepisodicdatah

underpolicyπande

pπ(h)=p(s1)

π(at|st)p(st+1|st,at),

peπ(h)=p(s1)

eπ(at|st)p(st+1|st,at).

Notethattheimportanceweightscanbecomputedwithoutexplicitlyknowing

p(s1)andp(st+1|st,at),sincetheyarecanceledout:

QTπ(a

π(at|st)

UsingthetrainingdataHeπ,wecanconstructaconsistentestimatorofG

GEIW(θ)=

θ⊤b

ψ(seπ

t,n,ae

t,n;He

n=1t=1

QTπ(aeπ

t,n|se

π(aeπt,n|seπt,n)

Per-DecisionImportanceWeighting

AcrucialobservationinEIWisthattheerroratthet-thstepdoesnot

dependonthesamplesafterthet-thstep(Precupetal.,2000).Thus,the

generalizationerrorGcanberewrittenas

G(θ)=Epeπ(h)

θ⊤ψ(s

t,at)−r(st,at)

wherewtistheper-decisionimportanceweight(PIW):

t′=1

t′|st′)p(st′+1|st′,at′)

t′=1

t′|st′)

t′=1e

π(at′|st′)p(st′+1|st′,at′)

t′=1e

π(at′|st′)

UsingthetrainingdataHeπ,wecanconstructaconsistentestimatoras

follows(cf.Eq.(4.2)):

GPIW(θ)=

θ⊤b

ψ(seπ

t,n,ae

t,n;He

n=1t=1

π(aeπ

t′=1

t′,n|se

t′,n

t′=1e

π(aeπt′,n|seπt′,n

wt,nonlycontainstherelevanttermsuptothet-thstep,whileb

wT,nincludes

allthetermsuntiltheendoftheepisode.

AdaptivePer-DecisionImportanceWeighting

ThePIWestimatorisguaranteedtobeconsistent.However,botharenot

efficientinthestatisticalsense(Shimodaira,2000),i.e.,theydonothavethe

smallestadmissiblevariance.Forthisreason,thePIWestimatorcanhave

largevarianceinfinitesamplecasesandthereforelearningwithPIWtendsto

beunstableinpractice.

Toimprovethestability,itisimportanttocontrolthetrade-offbetween

consistencyandefficiency(orsimilarlybiasandvariance)basedontraining

data.Here,theflatteningparameterν(∈[0,1])isintroducedtocontrolthetrade-offbyslightly“flattening”theimportanceweights(Shimodaira,2000;

Sugiyamaetal.,2007):

GAIW(θ)=

θ⊤b

ψ(seπ

t,n,ae

t,n;He

n=1t=1

−r(seπt,n,aeπt,n,seπt+1,n)(b

wt,n)ν,

whereAIWstandsfortheadaptiveper-decisionimportanceweight.When

ν=0,AIWisreducedtoNIWandthereforeithaslargebiasbuthasrelatively

smallvariance.Ontheotherhand,whenν=1,AIWisreducedtoPIW.Thus,

ithassmallbiasbuthasrelativelylargevariance.Inpractice,anintermediate

valueofνwillyieldthebestperformance.

ΨbetheNT×Bmatrix,c

WbetheNT×NTdiagonalmatrix,and

rbetheNT-dimensionalvectordefinedas

ΨN(t−1)+n,b=b

ψb(st,n,at,n),

WN(t−1)+n,N(t−1)+n=b

rN(t−1)+n=r(st,n,at,n,st+1,n).

Then,b

GAIWcanbecompactlyexpressedas

GAIW(θ)=

Ψθ−r)⊤c

Ψθ−r).

Becausethisisaconvexquadraticfunctionwithrespecttoθ,itsglobalmin-

imizerb

θAIWcanbeanalyticallyobtainedbysettingitsderivativetozeroas

AIW=(b

ΨWΨ)−1b

Thismeansthatthecostforcomputingb

θAIWisessentiallythesameasb

θNIW,

whichisgivenasfollows(seeSection2.2.2):

⊤⊤θ

NIW=(b

ΨΨ)−1b

Illustration

Here,theinfluenceoftheflatteningparameterνontheestimatorb

θAIWis

illustratedusingthechain-walkMDPillustratedinFigure4.1.

TheMDPconsistsof10states

S=s(1),…,s(10)

FIGURE4.1:Ten-statechain-walkMDP.

andtwoactions

A=a(1),a(2)=“L,”“R”.

Thereward+1isgivenwhenvisitings(1)ands(10).Thetransitionprobability

pisindicatedbythenumbersattachedtothearrowsinthefigure.Forexample,

p(s(2)|s(1),a=“R”)=0.9

p(s(1)|s(1),a=“R”)=0.1

meanthattheagentcansuccessfullymovetotherightnodewithprobability

0.9(indicatedbysolidarrowsinthefigure)andtheactionfailswithprob-

ability0.1(indicatedbydashedarrowsinthefigure).SixGaussiankernels

withstandarddeviationσ=10areusedasbasisfunctions,andkernelcen-

tersarelocatedats(1),s(5),ands(10).Morespecifically,thebasisfunctions,

φ(s,a)=(φ1(s,a),…,φ6(s,a))aredefinedas

(s−c

3(i−1)+j(s,a)=I(a=a(i))exp

fori=1,2andj=1,2,3,where

c1=1,c2=5,c3=10,

1ifxistrue,

0ifxisnottrue.

Theexperimentsarerepeated50times,wherethesamplingpolicye

π(a|s)

andthecurrentpolicyπ(a|s)arechosenrandomlyineachtrialsuchthat

eπ6=π.Thediscountfactorissetatγ=0.9.ThemodelparameterbθAIWis

learnedfromthetrainingsamplesHeπanditsgeneralizationerroriscomputed

fromthetestsamplesHπ.

TheleftcolumnofFigure4.2depictsthetruegeneralizationerrorGav-

eragedover50trialsasafunctionoftheflatteningparameterνforN=10,

30,and50.Figure4.2(a)showsthatwhenthenumberofepisodesislarge

(N=50),thegeneralizationerrortendstodecreaseastheflatteningparam-

eterincreases.Thiswouldbeanaturalresultduetotheconsistencyofb

Trueerror

Estimatederror

Flatteningparameterν

(a)N=50

Trueerror

Estimatederror

(b)N=30

Trueerror

Estimatederror

0.0950

(c)N=10

FIGURE4.2:Left:TruegeneralizationerrorGaveragedover50trialsas

afunctionoftheflatteningparameterνinthe10-statechain-walkproblem.

ThenumberofstepsisfixedatT=10.ThetrendofGdiffersdependingon

thenumberNofepisodicsamples.Right:Generalizationerrorestimatedby

5-foldimportanceweightedcrossvalidation(IWCV)(b

GIWCV)averagedover

50trialsasafunctionoftheflatteningparameterνinthe10-statechain-walk

problem.ThenumberofstepsisfixedatT=10.IWCVnicelycapturesthe

trendofthetruegeneralizationerrorG.

whenν=1.Ontheotherhand,Figure4.2(b)showsthatwhenthenumberof

episodesisnotlarge(N=30),ν=1performsratherpoorly.Thisimpliesthat

theconsistentestimatortendstobeunstablewhenthenumberofepisodes

isnotlargeenough;ν=0.7worksthebestinthiscase.Figure4.2(c)shows

theresultswhenthenumberofepisodesisfurtherreduced(N=10).This

illustratesthattheconsistentestimatorwithν=1isevenworsethanthe

ordinaryestimator(ν=0)becausethebiasisdominatedbylargevariance.

Inthiscase,thebestνisevensmallerandisachievedatν=0.4.

TheaboveresultsshowthatAIWcanoutperformPIW,particularlywhen

onlyasmallnumberoftrainingsamplesareavailable,providedthattheflat-

teningparameterνischosenappropriately.

AutomaticSelectionofFlatteningParameter

Inthissection,theproblemofselectingtheflatteningparameterinAIW

isaddressed.

Importance-WeightedCross-Validation

Generally,thebestνtendstobelarge(small)whenthenumberoftraining

samplesislarge(small).However,thisgeneraltrendisnotsufficienttofine-

tunetheflatteningparametersincethebestvalueofνdependsontraining

samples,policies,themodelofvaluefunctions,etc.Inthissection,wediscuss

howmodelselectionisperformedtochoosethebestflatteningparameterν

automaticallyfromthetrainingdataandpolicies.

Ideally,thevalueofνshouldbesetsothatthegeneralizationerrorG

isminimized,butthetrueGisnotaccessibleinpractice.Tocopewiththis

problem,wecanusecross-validation(seeSection2.2.4)forestimatingthe

generalizationerrorG.However,intheoff-policyscenariowherethesampling

policye

πandthetargetpolicyπaredifferent,ordinarycross-validationgives

abiasedestimateofG.Intheoff-policyscenario,importance-weightedcross-

validation(IWCV)(Sugiyamaetal.,2007)ismoreuseful,wherethecross-

validationestimateofthegeneralizationerrorisobtainedwithimportance

weighting.

Morespecifically,letusdivideatrainingdatasetHeπcontainingNepisodes

intoKsubsetsHeπ

ofapproximatelythesamesize.Forsimplicity,weas-

sumethatNisdivisiblebyK.Letb

θAIWbetheparameterlearnedfromH\Hk

(i.e.,allsampleswithoutHk).Then,thegeneralizationerrorisestimatedwith

NIW(ν=0)

PIW(ν=1)

AIW+IWCV

Trueerror0.085

Numberofepisodes

FIGURE4.3:TruegeneralizationerrorGaveragedover50trialsobtained

byNIW(ν=0),PIW(ν=1),AIW+IWCV(νischosenbyIWCV)inthe

10-statechain-walkMDP.

importanceweightingas

t,at;He

k)−r(st,at,st+1)

h∈Heπt=1

Thegeneralizationerrorestimateb

GIWCViscomputedforallcandidate

models(inthecurrentsetting,acandidatemodelcorrespondstoadifferent

valueoftheflatteningparameterν)andtheonethatminimizestheestimated

generalizationerrorischosen:

IWCV=argminGIWCV.

Illustration

ToillustratehowIWCVworks,letususethesamenumericalexamples

asSection4.2.4.TherightcolumnofFigure4.2depictsthegeneralization

errorestimatedby5-foldIWCVaveragedover50trialsasafunctionofthe

flatteningparameterν.ThegraphsshowthatIWCVnicelycapturesthetrend

ofthetruegeneralizationerrorforallthreecases.

Figure4.3describes,asafunctionofthenumberNofepisodes,theav-

eragetruegeneralizationerrorobtainedbyNIW(AIWwithν=0),PIW

(AIWwithν=1),andAIW+IWCV(ν∈0.0,0.1,…,0.9,1.0isselectedin

eachtrialusing5-foldIWCV).Thisresultshowsthattheimprovementofthe

performancebyNIWsaturateswhenN≥30,implyingthatthebiascaused

byNIWisnotnegligible.TheperformanceofPIWisworsethanNIWwhen

N≤20,whichiscausedbythelargevarianceofPIW.Ontheotherhand,

AIW+IWCVconsistentlygivesgoodperformanceforallN,illustratingthe

strongadaptationabilityofAIW+IWCV.

Sample-ReusePolicyIteration

Inthissection,AIW+IWCVisextendedfromsingle-steppolicyevaluation

tofullpolicyiteration.Thismethodiscalledsample-reusepolicyiteration

(SRPI).

Algorithm

LetusdenotethepolicyattheL-thiterationbyπL.Inon-policypolicy

iteration,newdatasamplesHπLarecollectedfollowingthenewpolicyπL

duringthepolicyevaluationstep.Thus,previouslycollecteddatasamples

Hπ1,…,HπL−1arenotused:

E:Hπ1

E:Hπ2

E:Hπ3

Qπ1→π2−→

Qπ2→π3−→···I

−→πL,

where“E:H”indicatesthepolicyevaluationstepusingthedatasampleH

and“I”indicatesthepolicyimprovementstep.Itwouldbemorecostefficient

ifallpreviouslycollecteddatasampleswerereusedinpolicyevaluation:

E:Hπ1

E:Hπ1,Hπ2

E:Hπ1,Hπ2,Hπ3

−→

Qπ1→π2

−→

Qπ2→π3

−→

···I

−→πL.

Sincethepreviouspoliciesandthecurrentpolicyaredifferentingeneral,

anoff-policyscenarioneedstobeexplicitlyconsideredtoreusepreviously

collecteddatasamples.Here,weexplainhowAIW+IWCVcanbeusedin

thissituation.Forthispurpose,thedefinitionofb

GAIWisextendedsothat

multiplesamplingpoliciesπ1,…,πLaretakenintoaccount:

θ⊤b

ψ(sπl

t,n,aπl

t,n;HπlL

l=1n=1t=1

πL(aπl

t′,n|sπl

t′,n

t′=1

t,n,aπl

t,n,sπl

t+1,n)

t′=1

l(aπl

t′,n|sπl

t′,n

whereb

isthegeneralizationerrorestimatedattheL-thpolicyevaluation

usingAIW.TheflatteningparameterνLischosenbasedonIWCVbefore

performingpolicyevaluation.

=νIWCV

Return

Totalnumberofepisodes

(a)N=5

(b)N=10

FIGURE4.4:Theperformanceofpolicieslearnedinthreescenarios:ν=0,

ν=1,andSRPI(νischosenbyIWCV)inthe10-statechain-walkproblem.

Theperformanceismeasuredbytheaveragereturncomputedfromtestsam-

plesover30trials.TheagentcollectstrainingsampleHπL(N=5or10with

T=10)ateveryiterationandperformspolicyevaluationusingallcollected

samplesHπ1,…,HπL.Thetotalnumberofepisodesmeansthenumberof

trainingepisodes(N×L)collectedbytheagentinpolicyiteration.

Illustration

Here,thebehaviorofSRPIisillustratedunderthesameexperimental

setupasSection4.3.2.Letusconsiderthreescenarios:νisfixedat0,νisfixedat1,andνischosenbyIWCV(i.e.,SRPI).TheagentcollectssamplesHπLin

eachpolicyiterationfollowingthecurrentpolicyπLandcomputesb

θAIWfrom

allcollectedsamplesHπ1,…,HπLusingEq.(4.3).ThreeGaussiankernels

areusedasbasisfunctions,wherekernelcentersarerandomlyselectedfrom

thestatespaceSineachtrial.Theinitialpolicyπ1ischosenrandomlyand

Gibbspolicyimprovement,

exp(Qπ(s,a)/τ)

π(a|s)←−P

exp(Qπ(s,a′)/τ)

a′∈Aisperformedwithτ=2L.

Figure4.4depictstheaveragereturnover30trialswhenN=5and10

withafixednumberofsteps(T=10).ThegraphsshowthatSRPIprovides

stableandfastlearningofpolicies,whiletheperformanceimprovementof

policieslearnedwithν=0saturatesinearlyiterations.Themethodwith

ν=1canimprovepolicieswell,butitsprogresstendstobebehindSRPI.

Figure4.5depictstheaveragevalueoftheflatteningparameterusedin

SRPIasafunctionofthetotalnumberofepisodicsamples.Thegraphsshow

thatthevalueoftheflatteningparameterchosenbyIWCVtendstoriseinthe

beginningandgodownlater.Atfirstsight,thisdoesnotagreewiththegeneral

trendofpreferringalow-varianceestimatorinearlystagesandpreferringa

Flatteningparameter

(a)N=5

(b)N=10

FIGURE4.5:FlatteningparametervaluesusedbySRPIaveragedover30

trialsasafunctionofthetotalnumberofepisodicsamplesinthe10-state

chain-walkproblem.

low-biasestimatorlater.However,thisresultisstillconsistentwiththegeneral

trend:whenthereturnincreasesrapidly(thetotalnumberofepisodicsamples

isupto15whenN=5and30whenN=10inFigure4.5),thevalueofthe

flatteningparameterincreases(seeFigure4.4).Afterthat,thereturndoes

notincreaseanymore(seeFigure4.4)sincethepolicyiterationhasalready

beenconverged.Then,itisnaturaltopreferasmallflatteningparameter

(Figure4.5)sincethesampleselectionbiasbecomesmildafterconvergence.

TheseresultsshowthatSRPIcaneffectivelyreusepreviouslycollected

samplesbyappropriatelytuningtheflatteningparameteraccordingtothe

conditionofdatasamples,policies,etc.

NumericalExamples

Inthissection,theperformanceofSRPIisnumericallyinvestigatedin

morecomplextasks.

InvertedPendulum

First,weconsiderthetaskoftheswing-upinvertedpendulumillustrated

inFigure4.6,whichconsistsofapolehingedatthetopofacart.Thegoalof

thetaskistoswingthepoleupbymovingthecart.Therearethreeactions:

applyingpositiveforce+50(kg·m/s2)tothecarttomoveright,negative

force−50tomoveleft,andzeroforcetojustcoast.Thatis,theactionspace

FIGURE4.6:Illustrationoftheinvertedpendulumtask.

Aisdiscreteanddescribedby

A=50,−50,0kg·m/s2.

Notethattheforceitselfisnotstrongenoughtoswingthepoleup.Thusthe

cartneedstobemovedbackandforthseveraltimestoswingthepoleup.

ThestatespaceSiscontinuousandconsistsoftheangleϕ[rad](∈[0,2π])andtheangularvelocity˙

ϕ[rad/s](∈[−π,π]).Thus,astatesisdescribedbytwo-dimensionalvectors=(ϕ,˙

ϕ)⊤.Theangleϕandangularvelocity˙

updatedasfollows:

ϕt+1=ϕt+˙

ϕt+1∆t,

9.8sin(ϕ

t)−αwd(˙

ϕt)2sin(2ϕt)/2+αcos(ϕt)at

t+1=˙

4l/3−αwdcos2(ϕt)

whereα=1/(W+w)andat(∈A)istheactionchosenattimet.Therewardfunctionr(s,a,s′)isdefinedas

r(s,a,s′)=cos(ϕs′),

whereϕs′denotestheangleϕofstates′.Theproblemparametersaresetas

follows:themassofthecartWis8[kg],themassofthepolewis2[kg],the

lengthofthepoledis0.5[m],andthesimulationtimestep∆tis0.1[s].

Forty-eightGaussiankernelswithstandarddeviationσ=πareusedas

basisfunctions,andkernelcentersarelocatedoverthefollowinggridpoints:

0,2/3π,4/3π,2π×−3π,−π,π,3π.

Thatis,thebasisfunctionsφ(s,a)=φ1(s,a),…,φ16(s,a)aresetas

ks−c

16(i−1)+j(s,a)=I(a=a(i))exp

fori=1,2,3andj=1,…,16,where

c1=(0,−3π)⊤,c2=(0,−π)⊤,…,c12=(2π,3π)⊤.

=νIWCV

Flatteningparameter

Sumofdiscountedrewards

(a)Performanceofpolicy

(b)Averageflatteningparameter

FIGURE4.7:ResultsofSRPIintheinvertedpendulumtask.Theagentcol-

lectstrainingsampleHπL(N=10andT=100)ineachiterationandpolicy

evaluationisperformedusingallcollectedsamplesHπ1,…,HπL.(a)The

performanceofpolicieslearnedwithν=0,ν=1,andSRPI.Theperformance

ismeasuredbytheaveragereturncomputedfromtestsamplesover20trials.

Thetotalnumberofepisodesmeansthenumberoftrainingepisodes(N×L)

collectedbytheagentinpolicyiteration.(b)Averageflatteningparameter

valueschosenbyIWCVinSRPIover20trials.

Theinitialpolicyπ1(a|s)ischosenrandomly,andtheinitial-stateproba-

bilitydensityp(s)issettobeuniform.TheagentcollectsdatasamplesHπL

(N=10andT=100)ateachpolicyiterationfollowingthecurrentpolicy

πL.Thediscountedfactorissetatγ=0.95andthepolicyisupdatedby

Gibbspolicyimprovement(4.4)withτ=L.

Figure4.7(a)describestheperformanceoflearnedpolicies.Thegraph

showsthatSRPInicelyimprovestheperformancethroughouttheentirepolicy

iteration.Ontheotherhand,theperformancewhentheflatteningparameter

isfixedatν=0orν=1isnotproperlyimprovedafterthemiddleof

iterations.TheaverageflatteningparametervaluesdepictedinFigure4.7(b)

showthattheflatteningparametertendstoincreasequicklyinthebeginning

andtheniskeptatmediumvalues.Motionexamplesoftheinvertedpendulum

bySRPIwithνchosenbyIWCVandν=1areillustratedinFigure4.8and

Figure4.9,respectively.

Theseresultsindicatethattheflatteningparameteriswelladjustedto

reusethepreviouslycollectedsampleseffectivelyforpolicyevaluation,and

thusSRPIcanoutperformtheothermethods.

MountainCar

Next,weconsiderthemountaincartaskillustratedinFigure4.10.The

taskconsistsofacarandtwohillswhoselandscapeisdescribedbysin(3x).

FIGURE4.8:MotionexamplesoftheinvertedpendulumbySRPIwithν

chosenbyIWCV(fromlefttorightandtoptobottom).

FIGURE4.9:MotionexamplesoftheinvertedpendulumbySRPIwith

ν=1(fromlefttorightandtoptobottom).

FIGURE4.10:Illustrationofthemountaincartask.

Thetopoftherighthillisthegoaltowhichwewanttoguidethecar.There

arethreeactions,

+0.2,−0.2,0,

whicharethevaluesoftheforceappliedtothecar.Notethattheforceofthe

carisnotstrongenoughtoclimbuptheslopetoreachthegoal.Thestate

spaceSisdescribedbythehorizontalpositionx[m](∈[−1.2,0.5])andthevelocity˙x[m/s](∈[−1.5,1.5]):s=(x,˙x)⊤.

Thepositionxandvelocity˙xareupdatedby

xt+1=xt+˙xt+1∆t,

t+1=˙

xt+−9.8wcos(3xt)+

−k˙x∆t,

whereat(∈A)istheactionchosenatthetimet.TherewardfunctionR(s,a,s′)isdefinedas

R(s,a,s′)=

s′≥0.5,

−0.01otherwise,

wherexs′denotesthehorizontalpositionxofstates′.Theproblemparame-

tersaresetasfollows:themassofthecarwis0.2[kg],thefrictioncoefficientkis0.3,andthesimulationtimestep∆tis0.1[s].

Thesameexperimentalsetupastheswing-upinvertedpendulumtaskin

Section4.5.1isused,exceptthatthenumberofGaussiankernelsis36,the

kernelstandarddeviationissetatσ=1,andthekernelcentersareallocated

overthefollowinggridpoints:

−1.2,0.35,0.5×−1.5,−0.5,0.5,1.5.

Figure4.11(a)showstheperformanceoflearnedpoliciesmeasuredbythe

=νIWCV

Flatteningparameter

Sumofdiscountedrewards

−0.05

(a)Performanceofpolicy

(b)Averageflatteningparameter

FIGURE4.11:Resultsofsample-reusepolicyiterationinthemountain-car

task.TheagentcollectstrainingsampleHπL(N=10andT=100)atev-

eryiterationandpolicyevaluationisperformedusingallcollectedsamples

Hπ1,…,HπL.(a)Theperformanceismeasuredbytheaveragereturncom-

putedfromtestsamplesover20trials.Thetotalnumberofepisodesmeansthe

numberoftrainingepisodes(N×L)collectedbytheagentinpolicyiteration.

(b)AverageflatteningparametervaluesusedbySRPIover20trials.

averagereturncomputedfromthetestsamples.Thegraphshowssimilarten-

denciestotheswing-upinvertedpendulumtaskforSRPIandν=1,while

themethodwithν=0performsrelativelywellthistime.Thisimpliesthat

thebiasinthepreviouslycollectedsamplesdoesnotaffecttheestimationof

thevaluefunctionsthatstrongly,becausethefunctionapproximatorisbetter

suitedtorepresentthevaluefunctionforthisproblem.Theaverageflattening

parametervalues(cf.Figure4.11(b))showthattheflatteningparameterde-

creasessoonaftertheincreaseinthebeginning,andthenthesmallervalues

tendtobechosen.ThisindicatesthatSRPItendstouselow-varianceesti-

matorsinthistask.MotionexamplesbySRPIwithνchosenbyIWCVare

illustratedinFigure4.12.

TheseresultsshowthatSRPIcanperformstableandfastlearningby

effectivelyreusingpreviouslycollecteddata.

Remarks

Instabilityhasbeenoneofthecriticallimitationsofimportance-sampling

techniques,whichoftenmakesoff-policymethodsimpractical.Toovercome

thisweakness,anadaptiveimportance-samplingtechniquewasintroducedfor

controllingthetrade-offbetweenconsistencyandstabilityinvaluefunction

FIGURE4.12:MotionexamplesofthemountaincarbySRPIwithνchosen

byIWCV(fromlefttorightandtoptobottom).

approximation.Furthermore,importance-weightedcross-validationwasintro-

ducedforautomaticallychoosingthetrade-offparameter.

Therangeofapplicationofimportancesamplingisnotlimitedtopolicy

iteration.Wewillexplainhowimportancesamplingcanbeutilizedforsample

reuseinthepolicysearchframeworksinChapter8andChapter9.

Chapter5

ActiveLearninginPolicyIteration

InChapter4,weconsideredtheoff-policysituationwhereadata-collecting

policyandthetargetpolicyaredifferent.Intheframeworkofsample-reuse

policyiteration,newsamplesarealwayschosenfollowingthetargetpolicy.

However,acleverchoiceofsamplingpoliciescanactuallyfurtherimprovethe

performance.Thetopicofchoosingsamplingpoliciesiscalledactivelearning

instatisticsandmachinelearning.Inthischapter,weaddresstheproblem

ofchoosingsamplingpoliciesinsample-reusepolicyiteration.InSection5.1,

weexplainhowastatisticalactivelearningmethodcanbeemployedforop-

timizingthesamplingpolicyinvaluefunctionapproximation.InSection5.2,

weintroduceactivepolicyiteration,whichincorporatestheactivelearning

ideaintotheframeworkofsample-reusepolicyiteration.Theeffectivenessof

activepolicyiterationisnumericallyinvestigatedinSection5.3,andfinally

thischapterisconcludedinSection5.4.

EfficientExplorationwithActiveLearning

Theaccuracyofestimatedvaluefunctionsdependsontrainingsamples

collectedfollowingsamplingpolicye

π(a|s).Inthissection,weexplainhowa

statisticalactivelearningmethod(Sugiyama,2006)canbeemployedforvalue

functionapproximation.

ProblemSetup

Letusconsiderasituationwherecollectingstate-actiontrajectorysam-

plesiseasyandcheap,butgatheringimmediaterewardsamplesishardand

expensive.Forexample,considerarobot-armcontroltaskofhittingaball

withabatanddrivingtheballasfarawayaspossible(seeFigure5.6).Let

usadoptthecarryoftheballastheimmediatereward.Inthissetting,ob-

tainingstate-actiontrajectorysamplesoftherobotarmiseasyandrelatively

cheapsincewejustneedtocontroltherobotarmandrecorditsstate-action

trajectoriesovertime.However,explicitlycomputingthecarryoftheball

fromthestate-actionsamplesishardduetofrictionandelasticityoflinks,

airresistance,aircurrents,andsoon.Forthisreason,inpractice,wemay

havetoputtherobotinopenspace,lettherobotreallyhittheball,and

measurethecarryoftheballmanually.Thus,gatheringimmediatereward

samplesismuchmoreexpensivethanthestate-actiontrajectorysamples.In

suchasituation,immediaterewardsamplesaretooexpensivetobeusedfor

designingthesamplingpolicy.Onlystate-actiontrajectorysamplesmaybe

usedfordesigningsamplingpolicies.

Thegoalofactivelearninginthecurrentsetupistodeterminethesampling

policysothattheexpectedgeneralizationerrorisminimized.However,since

thegeneralizationerrorisnotaccessibleinpractice,itneedstobeestimated

fromsamplesforperformingactivelearning.Adifficultyofestimatingthe

generalizationerrorinthecontextofactivelearningisthatitsestimation

needstobecarriedoutonlyfromstate-actiontrajectorysampleswithoutusing

immediaterewardsamples.Thismeansthatstandardgeneralizationerror

estimationtechniquessuchascross-validationcannotbeemployed.Below,

weexplainhowthegeneralizationerrorcanbeestimatedwithoutthereward

samples.

DecompositionofGeneralizationError

Theinformationweareallowedtouseforestimatingthegeneralization

errorisasetofroll-outsampleswithoutimmediaterewards:

Heπ=heπ1,…,heπN,

whereeachepisodicsampleheπnisgivenas

heπn=[seπ1,n,aeπ1,n,…,seπT,n,aeπT,n,seπT+1,n].

Letusdefinethedeviationofanobservedimmediaterewardreπ

t,nfromits

expectationr(seπt,n,aeπt,n)as

ǫeπt,n=reπt,n−r(seπt,n,aeπt,n).

Notethatǫeπt,ncouldberegardedasadditivenoiseinthecontextofleast-

squaresfunctionfitting.Bydefinition,ǫeπt,nhasmeanzeroanditsvariance

generallydependsonseπt,nandaeπt,n,i.e.,heteroscedasticnoise(Bishop,2006).

However,sinceestimatingthevarianceofǫeπt,nwithoutusingrewardsamples

isnotgenerallypossible,weignorethedependenceofthevarianceonseπt,nand

aeπt,n.Letusdenotetheinput-independentcommonvariancebyσ2.

Wewouldliketoestimatethegeneralizationerror,

θψ(s

t,at;He

π)−r(st,at)

fromHeπ.Itsexpectationover“noise”canbedecomposedasfollows

(Sugiyama,2006):

EǫeπG(b

θ)=Bias+Variance+ModelError,

whereEǫeπdenotestheexpectationover“noise”ǫeπt,nT,N

t=1,n=1.

“Bias,”

“Variance,”and“ModelError”arethebiasterm,thevarianceterm,andthe

modelerrortermdefinedby

Bias=E

θ−θ∗)⊤b

t,at;He

Variance=E

θ−E

θ)⊤b

t,at;He

ModelError=Epeπ(h)

(θ∗⊤b

t,at;He

π)−r(st,at))2

θ∗denotestheoptimalparameterinthemodel:”

θ∗=argminEpeπ(h)(θ⊤ψ(st,at)−r(st,at))2.

Notethat,foralinearestimatorb

θsuchthat

whereb

LissomematrixandristheNT-dimensionalvectordefinedas

rN(t−1)+n=r(st,n,at,n,st+1,n),

thevariancetermcanbeexpressedinacompactformas

⊤Variance=σ2tr(Ub

wherethematrixUisdefinedas

t,at;He

ψ(st,at;Heπ)⊤t=1

EstimationofGeneralizationError

Sinceweareinterestedinfindingaminimizerofthegeneralizationerror

withrespecttoe

π,themodelerror,whichisconstant,canbesafelyignoredin

generalizationerrorestimation.Ontheotherhand,thebiastermincludesthe

unknownoptimalparameterθ∗.Thus,itmaynotbepossibletoestimatethebiastermwithoutusingrewardsamples.Similarly,itmaynotbepossibleto

estimatethe“noise”varianceσ2includedinthevariancetermwithoutusing

rewardsamples.

Itisknownthatthebiastermissmallenoughtobeneglectedwhenthe

modelisapproximatelycorrect(Sugiyama,2006),i.e.,θ∗⊤b

ψ(s,a)approxi-

matelyagreeswiththetruefunctionr(s,a).Thenwehave

⊤EǫeπG(b

θ)−ModelError−Bias∝tr(UbLb

whichdoesnotrequireimmediaterewardsamplesforitscomputation.Since

Epeπ(h)includedinUisnotaccessible(seeEq.(5.1)),Uisreplacedbyits

consistentestimatorb

ψ(seπ

t,n,ae

t,n;He

ψ(seπt,n,aeπt,n;Heπ)⊤b

n=1t=1

Consequently,thefollowinggeneralizationerrorestimatorisobtained:

⊤J=tr(b

whichcanbecomputedonlyfromHeπandthuscanbeemployedintheactive

learningscenarios.IfitispossibletogatherHeπmultipletimes,theaboveJ

maybecomputedmultipletimesandtheiraverageisusedasageneralization

errorestimator.

NotethatthevaluesofthegeneralizationerrorestimatorJandthetrue

generalizationerrorGarenotdirectlycomparablesinceirrelevantadditive

andmultiplicativeconstantsareignored(seeEq.(5.2)).However,thisisno

problemaslongastheestimatorJhasasimilarprofiletothetrueerrorGas

afunctionofsamplingpolicye

πsincethepurposeofderivingageneralization

errorestimatorinactivelearningisnottoapproximatethetruegeneralization

erroritself,buttoapproximatetheminimizerofthetruegeneralizationerror

withrespecttosamplingpolicye

DesigningSamplingPolicies

Basedonthegeneralizationerrorestimatorderivedabove,asampling

policyisdesignedasfollows:

1.PrepareKcandidatesofsamplingpolicy:e

2.Collectepisodicsampleswithoutimmediaterewardsforeachsampling-

policycandidate:HeπkK.

3.EstimateUusingallsamplesHeπkK:

ψ(seπk

t,n,ae

t,n;He

ψ(seπk

t,n,ae

t,n;He

k=1)⊤b

k=1n=1t=1

whereb

t,ndenotestheimportanceweightforthek-thsamplingpolicy

π(aeπk

t′,n|se

t′,n

t′=1

t′=1e

πk(aeπk

t′,n|se

t′,n

4.Estimatethegeneralizationerrorforeachk:

b⊤k=tr(b

whereb

isdefinedas

Ψ⊤c

Ψ)−1b

Ψ⊤c

istheNT×Bmatrixandc

istheNT×NTdiagonalmatrix

definedas

Ψeπk

N(t−1)+n,b

t,n,ae

N(t−1)+n,N(t−1)+n

5.(Ifpossible)repeat2to4severaltimesandcalculatetheaveragefor

eachk.

6.Determinethesamplingpolicyas

eπAL=argminJk.

k=1,…,K

7.Collecttrainingsampleswithimmediaterewardsfollowinge

8.Learnthevaluefunctionbyleast-squarespolicyiterationusingthecol-

lectedsamples.

Illustration

Here,thebehavioroftheactivelearningmethodisillustratedonatoy

10-statechain-walkenvironmentshowninFigure5.1.TheMDPconsistsof

10states,

S=s(i)10

i=1=1,2,…,10,

and2actions,

A=a(i)2i=1=“L,”“R”.

···

FIGURE5.1:Ten-statechainwalk.Filledandunfilledarrowsindicatethe

transitionswhentakingaction“R”and“L,”andsolidanddashedlinesindi-

catethesuccessfulandfailedtransitions.

Theimmediaterewardfunctionisdefinedas

r(s,a,s′)=f(s′),

wheretheprofileofthefunctionf(s′)isillustratedinFigure5.2.

Thetransitionprobabilityp(s′|s,a)isindicatedbythenumbersattached

tothearrowsinFigure5.1.Forexample,p(s(2)|s(1),a=“R”)=0.8and

p(s(1)|s(1),a=“R”)=0.2.Thus,theagentcansuccessfullymovetothe

intendeddirectionwithprobability0.8(indicatedbysolid-filledarrowsinthe

figure)andtheactionfailswithprobability0.2(indicatedbydashed-filled

arrowsinthefigure).Thediscountfactorγissetat0.9.Thefollowing12

Gaussianbasisfunctionsφ(s,a)areused:

(s−c

I(a=a(j))exp−

φ2(i−1)+j(s,a)=

fori=1,…,5andj=1,2

I(a=a(j))fori=6andj=1,2,

wherec1=1,c2=3,c3=5,c4=7,c5=9,andτ=1.5.I(a=a′)denotes

theindicatorfunction:

1ifa=a′,

I(a=a′)=

ifa6=a′.

Samplingpoliciesandevaluationpoliciesareconstructedasfollows.First,

’)1.5

FIGURE5.2:Profileofthefunctionf(s′).

adeterministic“base”policyπisprepared.Forexample,“LLLLLRRRRR,”

wherethei-thletterdenotestheactiontakenats(i).Letπǫbethe“ǫ-greedy”

versionofthebasepolicyπ,i.e.,theintendedactioncanbesuccessfullychosen

withprobability1−ǫ/2andtheotheractionischosenwithprobabilityǫ/2.

Experimentsareperformedforthreedifferentevaluationpolicies:

π1:“RRRRRRRRRR,”

π2:“RRLLLLLRRR,”

π3:“LLLLLRRRRR,”

withǫ=0.1.Foreachevaluationpolicyπ0.1

(i=1,2,3),10candidatesofthe

samplingpolicye

areprepared,where

=πk/10.Notethat

eπ(k)

eπ(1)

equivalenttotheevaluationpolicyπ0.1

Foreachsamplingpolicy,theactivelearningcriterionJiscomputed5

timesandtheiraverageistaken.Thenumbersofepisodesandstepsareset

atN=10andT=10,respectively.Theinitial-stateprobabilityp(s)is

settobeuniform.Whenthematrixinverseiscomputed,10−3isaddedto

diagonalelementstoavoiddegeneracy.Thisexperimentisrepeated100times

withdifferentrandomseedsandthemeanandstandarddeviationofthetrue

generalizationerroranditsestimateareevaluated.

TheresultsaredepictedinFigure5.3asfunctionsoftheindexkofthe

samplingpolicies.Thegraphsshowthatthegeneralizationerrorestimator

overallcapturesthetrendofthetruegeneralizationerrorwellforallthree

cases.

Next,thevaluesoftheobtainedgeneralizationerrorGisevaluatedwhen

kischosensothatJisminimized(activelearning,AL),theevaluationpolicy

(k=1)isusedforsampling(passivelearning,PL),andkischosenoptimally

sothatthetruegeneralizationerrorisminimized(optimal,OPT).Figure5.4

showsthattheactivelearningmethodcomparesfavorablywithpassivelearn-

ingandperformswellforreducingthegeneralizationerror.

ActivePolicyIteration

InSection5.1,theunknowngeneralizationerrorwasshowntobeaccu-

ratelyestimatedwithoutusingimmediaterewardsamplesinone-steppolicy

evaluation.Inthissection,thisone-stepactivelearningideaisextendedtothe

frameworkofsample-reusepolicyiterationintroducedinChapter4,whichis

calledactivepolicyiteration.LetusdenotetheevaluationpolicyattheL-th

iterationbyπL.

−0.5

Samplingpolicyindexk

(a)π0.1

−0.1

(b)π0.1

−0.2

(c)π0.1

FIGURE5.3:Themeanandstandarddeviationofthetruegeneralization

errorG(left)andtheestimatedgeneralizationerrorJ(right)over100trials.

Sample-ReusePolicyIterationwithActiveLearning

Intheoriginalsample-reusepolicyiteration,newdatasamplesHπlare

collectedfollowingthenewtargetpolicyπlforthenextpolicyevaluation

E:Hπ1

E:Hπ1,Hπ2

E:Hπ1,Hπ2,Hπ3

Qπ1→π2

Qπ2→π3

···I

→πL+1,

(a)π0.1

(b)π0.1

(c)π0.1

FIGURE5.4:Thebox-plotsofthevaluesoftheobtainedgeneralizationerror

Gover100trialswhenkischosensothatJisminimized(activelearning,AL),

theevaluationpolicy(k=1)isusedforsampling(passivelearning,PL),andk

ischosenoptimallysothatthetruegeneralizationerrorisminimized(optimal,

OPT).Thebox-plotnotationindicatesthe5%quantile,25%quantile,50%

quantile(i.e.,median),75%quantile,and95%quantilefrombottomtotop.

where“E:H”indicatespolicyevaluationusingthedatasampleHand“I”

denotespolicyimprovement.Ontheotherhand,inactivepolicyiteration,the

optimizedsamplingpolicye

πlisusedateachiteration:

π1,Heπ2

π1,Heπ2,Heπ3

Qπ1→π2

Qπ2→π3

···I

→πL+1.

Notethat,inactivepolicyiteration,thepreviouslycollectedsamplesareused

notonlyforvaluefunctionapproximation,butalsoforactivelearning.Thus,

activepolicyiterationmakesfulluseofthesamples.

Illustration

Here,thebehaviorofactivepolicyiterationisillustratedusingthesame

10-statechain-walkproblemasSection5.1.5(seeFigure5.1).

Theinitialevaluationpolicyπ1issetas

1(a|s)=0.15pu(a)+0.85I(a=argmaxQ0(s,a′)),

wherepu(a)denotestheprobabilitymassfunctionoftheuniformdistribution

Q0(s,a)=

φb(s,a).

Policiesareupdatedinthel-thiterationusingtheǫ-greedyrulewithǫ=

0.15/l.Inthesampling-policyselectionstepofthel-thiteration,thefollowing

foursampling-policycandidatesareprepared:

eπ(1),

,π0.15/l+0.15,π0.15/l+0.5,π0.15/l+0.85

eπ(2)

eπ(3)

eπ(4)

=π0.15/l

whereπldenotesthepolicyobtainedbygreedyupdateusingb

Qπl−1.

Thenumberofiterationstolearnthepolicyissetat7,thenumberof

stepsissetatT=10,andthenumberNofepisodesisdifferentineachitera-

tionanddefinedasN1,…,N7,whereNl(l=1,…,7)denotesthenumberofepisodescollectedinthel-thiteration.Inthisexperiment,twotypesof

schedulingarecompared:5,5,3,3,3,1,1and3,3,3,3,3,3,3,whichare

referredtoasthe“decreasingN”strategyandthe“fixedN”strategy,respec-

tively.TheJ-valuecalculationisrepeated5timesforactivelearning.The

performanceofthefinallyobtainedpolicyπ8ismeasuredbythereturnfor

testsamplesrπ8

t,nT,N

t,n=1(50episodeswith50stepscollectedfollowingπ8):

Performance=

γt−1rπ8

n=1t=1

wherethediscountfactorγissetat0.9.

Theperformanceofpassivelearning(PL;thecurrentpolicyisusedasthe

samplingpolicyineachiteration)andactivelearning(AL;thebestsampling

policyischosenfromthepolicycandidatespreparedineachiteration)is

compared.Theexperimentsarerepeated1000timeswithdifferentrandom

seedsandtheaverageperformanceofPLandALisevaluated.Theresults

aredepictedinFigure5.5,showingthatALworksbetterthanPLinboth

typesofepisodeschedulingwithstatisticalsignificancebythet-testatthe

significancelevel1%(Henkel,1976)fortheerrorvaluesobtainedafterthe7th

iteration.Furthermore,the“decreasingN”strategyoutperformsthe“fixed

N”strategyforbothPLandAL,showingtheusefulnessofthe“decreasing

N”strategy.

AL(decreasingN)

Performance(average)

PL(decreasingN)

AL(fixedN)

PL(fixedN)

Iteration

FIGURE5.5:Themeanperformanceover1000trialsinthe10-statechain-

walkexperiment.Thedottedlinesdenotetheperformanceofpassivelearning

(PL)andthesolidlinesdenotetheperformanceoftheproposedactivelearning

(AL)method.Theerrorbarsareomittedforclearvisibility.Forboththe

“decreasingN”and“fixedN”strategies,theperformanceofALafterthe7th

iterationissignificantlybetterthanthatofPLaccordingtothet-testatthe

significancelevel1%appliedtotheerrorvaluesatthe7thiteration.

NumericalExamples

Inthissection,theperformanceofactivepolicyiterationisevaluatedusing

aball-battingrobotillustratedinFigure5.6,whichconsistsoftwolinksand

twojoints.Thegoaloftheball-battingtaskistocontroltherobotarmso

thatitdrivestheballasfarawayaspossible.ThestatespaceSiscontinuous

andconsistsofanglesϕ1[rad](∈[0,π/4])andϕ2[rad](∈[−π/4,π/4])and

angularvelocities˙

ϕ1[rad/s]and˙

ϕ2[rad/s].Thus,astates(∈S)isdescribedbya4-dimensionalvectors=(ϕ1,˙

ϕ1,ϕ2,˙

ϕ2)⊤.TheactionspaceAisdiscrete

andcontainstwoelements:

A=a(i)2i=1=(50,−35)⊤,(−50,10)⊤,

wherethei-thelement(i=1,2)ofeachvectorcorrespondstothetorque

[N·m]addedtojointi.

Theopendynamicsengine(http://ode.org/)isusedforphysicalcalculationsincludingtheupdateoftheanglesandangularvelocities,andcollision

detectionbetweentherobotarm,ball,andpin.Thesimulationtimestepis

setat7.5[ms]andthenextstateisobservedafter10timesteps.Theaction

choseninthecurrentstateistakenfor10timesteps.Tomaketheexperi-

mentsrealistic,noiseisaddedtoactions:ifaction(f1,f2)⊤istaken,theactual

FIGURE5.6:Aball-battingrobot.

torquesappliedtothejointsaref1+ε1andf2+ε2,whereε1andε2aredrawn

independentlyfromtheGaussiandistributionwithmean0andvariance3.

Theimmediaterewardisdefinedasthecarryoftheball.Thisrewardis

givenonlywhentherobotarmcollideswiththeballforthefirsttimeatstate

s′aftertakingactionaatcurrentstates.Forvaluefunctionapproximation,

thefollowing110basisfunctionsareused:

ks−c

I(a=a(j))exp−

φ2(i−1)+j=

fori=1,…,54andj=1,2,

I(a=a(j))fori=55andj=1,2,

whereτissetat3π/2andtheGaussiancentersci(i=1,…,54)arelocated

ontheregulargrid:0,π/4×−π,0,π×−π/4,0,π/4×−π,0,π.

ForL=7andT=10,the“decreasingN”strategyandthe“fixed

N”strategyarecompared.The“decreasingN”strategyisdefinedby

10,10,7,7,7,4,4andthe“fixedN”strategyisdefinedby7,7,7,7,7,7,7.

Theinitialstateisalwayssetats=(π/4,0,0,0)⊤,andJ-calculationsare

repeated5timesintheactivelearningmethod.Theinitialevaluationpolicy

π1issetattheǫ-greedypolicydefinedas

1(a|s)=0.15pu(a)+0.85I

a=argmaxQ0(s,a′),

Q0(s,a)=

φb(s,a).

Policiesareupdatedinthel-thiterationusingtheǫ-greedyrulewithǫ=

0.15/l.Sampling-policycandidatesarepreparedinthesamewayasthechain-

walkexperimentinSection5.2.2.

Thediscountfactorγissetat1andtheperformanceoflearnedpolicyπ8

AL(decreasingN)

Performance(average)

PL(decreasingN)

AL(fixedN)

PL(fixedN)

Iteration

FIGURE5.7:Themeanperformanceover500trialsintheball-batting

experiment.Thedottedlinesdenotetheperformanceofpassivelearning(PL)

andthesolidlinesdenotetheperformanceoftheproposedactivelearning(AL)

method.Theerrorbarsareomittedforclearvisibility.Forthe“decreasingN”

strategy,theperformanceofALafterthe7thiterationissignificantlybetter

thanthatofPLaccordingtothet-testatthesignificancelevel1%forthe

errorvaluesatthe7thiteration.

ismeasuredbythereturnfortestsamplesrπ8

t,n10,20

t,n=1(20episodeswith10

stepscollectedfollowingπ

t=1t,n.

Theexperimentisrepeated500timeswithdifferentrandomseedsand

theaverageperformanceofeachlearningmethodisevaluated.Theresults,

depictedinFigure5.7,showthatactivelearningoutperformspassivelearning.

Forthe“decreasingN”strategy,theperformancedifferenceisstatistically

significantbythet-testatthesignificancelevel1%fortheerrorvaluesafter

the7thiteration.

Motionexamplesoftheball-battingrobottrainedwithactivelearningand

passivelearningareillustratedinFigure5.8andFigure5.9,respectively.

Remarks

Whenwecannotaffordtocollectmanytrainingsamplesduetohighsam-

plingcosts,itiscrucialtochoosethemostinformativesamplesforefficiently

learningthevaluefunction.Inthischapter,anactivelearningmethodforop-

timizingdatasamplingstrategieswasintroducedintheframeworkofsample-

reusepolicyiteration,andtheresultingactivepolicyiterationwasdemon-

stratedtobepromising.

FIGURE5.8:Amotionexampleoftheball-battingrobottrainedwithactive

learning(fromlefttorightandtoptobottom).

FIGURE5.9:Amotionexampleoftheball-battingrobottrainedwithpas-

sivelearning(fromlefttorightandtoptobottom).

Chapter6

RobustPolicyIteration

Theframeworkofleast-squarespolicyiteration(LSPI)introducedinChap-

ter2isuseful,thankstoitscomputationalefficiencyandanalyticaltractabil-

ity.However,duetothesquaredloss,ittendstobesensitivetooutliersin

observedrewards.Inthischapter,weintroduceanalternativepolicyiter-

ationmethodthatemploystheabsolutelossforenhancingrobustnessand

reliability.InSection6.1,robustnessandreliabilitybroughtbytheuseofthe

absolutelossisdiscussed.InSection6.2,thepolicyiterationframeworkwith

theabsolutelosscalledleast-absolutepolicyiteration(LAPI)isintroduced.

InSection6.3,theusefulnessofLAPIisillustratedthroughexperiments.

VariationsofLAPIareconsideredinSection6.4,andfinallythischapteris

concludedinSection6.5.

RobustnessandReliabilityinPolicyIteration

ThebasicideaofLSPIistofitalinearmodeltoimmediaterewardsun-

derthesquaredloss,whiletheabsolutelossisusedinthischapter(seeFig-

ure6.1).Thisisjustreplacementoflossfunctions,butthismodificationhighly

enhancesrobustnessandreliability.

Robustness

Inmanyroboticsapplications,immediaterewardsareobtainedthrough

measurementsuchasdistancesensorsorcomputervision.Duetointrinsic

measurementnoiseorrecognitionerror,theobtainedrewardsoftendeviate

fromthetruevalue.Inparticular,therewardsoccasionallycontainoutliers,

whicharesignificantlydifferentfromregularvalues.

Residualminimizationunderthesquaredlossamountstoobtainingthe

meanofsamplesxim

argmin

(xi−c)2=mean(xim

Ifoneofthevaluesisanoutlierhavingaverylargeorsmallvalue,themean

Absoluteloss

Squaredloss

FIGURE6.1:Theabsoluteandsquaredlossfunctionsforreducingthe

temporal-differenceerror.

wouldbestronglyaffectedbythisoutlier.Thismeansthatallthevalues

i=1areresponsibleforthemean,andthereforeevenasingleoutlierob-

servationcansignificantlydamagethelearnedresult.

Ontheotherhand,residualminimizationundertheabsolutelossamounts

toobtainingthemedian:

argmin

|xi−c|=median(xi2n+1)=x

wherex1≤x2≤···≤x2n+1.Themedianisinfluencednotbythemagnitude

ofthevaluesxi2n+1butonlybytheirorder.Thus,aslongastheorderis

keptunchanged,themedianisnotaffectedbyoutliers.Infact,themedianis

knowntobethemostrobustestimatorinlightofbreakdown-pointanalysis

(Huber,1981;Rousseeuw&Leroy,1987).

Therefore,theuseoftheabsolutelosswouldremedytheproblemofro-

bustnessinpolicyiteration.

Reliability

Inpracticalrobot-controltasks,weoftenwanttoattainastableperfor-

mance,ratherthantoachievea“dream”performancewithlittlechanceof

success.Forexample,intheacquisitionofahumanoidgait,wemaywantthe

robottowalkforwardinastablemannerwithhighprobabilityofsuccess,

ratherthantorushveryfastinachancelevel.

Ontheotherhand,wedonotwanttobetooconservativewhentraining

robots.Ifweareoverlyconcernedwithunrealisticfailure,nopracticallyuseful

controlpolicycanbeobtained.Forexample,anyrobotscanbebrokenin

principleiftheyareactivatedforalongtime.However,ifwefearthisfact

toomuch,wemayendupinpraisingacontrolpolicythatdoesnotmovethe

robotsatall,whichisobviouslynonsense.

Sincethesquared-losssolutionisnotrobustagainstoutliers,itissensitive

torareeventswitheitherpositiveornegativeverylargeimmediaterewards.

Consequently,thesquaredlossprefersanextraordinarilysuccessfulmotion

evenifthesuccessprobabilityisverylow.Similarly,itdislikesanunrealistic

troubleevenifsuchaterribleeventmaynothappeninreality.Ontheother

hand,theabsolutelosssolutionisnoteasilyaffectedbysuchrareeventsdueto

itsrobustness.Therefore,theuseoftheabsolutelosswouldproduceareliable

controlpolicyeveninthepresenceofsuchextremeevents.

LeastAbsolutePolicyIteration

Inthissection,apolicyiterationmethodwiththeabsolutelossisintro-

duced.

Algorithm

Insteadofthesquaredloss,alinearmodelisfittedtoimmediaterewards

undertheabsolutelossas

θ⊤b

ψ(st,at)−rt.

Thisminimizationproblemlookscumbersomeduetotheabsolutevalueoper-

atorwhichisnon-differentiable,butthisminimizationproblemcanbereduced

tothefollowinglinearprogram(Boyd&Vandenberghe,2004):

θ,btT

subjectto−bt≤θ⊤bψ(st,at)−rt≤bt,t=1,…,T.

ThenumberofconstraintsisTintheabovelinearprogram.WhenTislarge,

wemayemploysophisticatedoptimizationtechniquessuchascolumngen-

eration(Demirizetal.,2002)forefficientlysolvingthelinearprogramming

problem.Alternatively,anapproximatesolutioncanbeobtainedbygradient

descentorthe(quasi)-Newtonmethodsiftheabsolutelossisapproximated

byasmoothloss(see,e.g.,Section6.4.1).

Thepolicyiterationmethodbasedontheabsolutelossiscalledleastab-

solutepolicyiteration(LAPI).

Illustration

Forillustrationpurposes,letusconsiderthe4-stateMDPproblemde-

scribedinFigure6.2.Theagentisinitiallylocatedatstates(0)andtheactions

FIGURE6.2:IllustrativeMDPproblem.

theagentisallowedtotakearemovingtotheleftorrightstate.Iftheleft

movementactionischosen,theagentalwaysreceivessmallpositivereward

+0.1ats(L).Ontheotherhand,iftherightmovementactionischosen,the

agentreceivesnegativereward−1withprobability0.9999ats(R1)oritre-

ceivesverylargepositivereward+20,000withprobability0.0001ats(R2).The

meanandmedianrewardsforleftmovementareboth+0.1,whilethemean

andmedianrewardsforrightmovementare+1.0001and−1,respectively.

IfQ(s(0),“Left”)andQ(s(0),“Right”)areapproximatedbytheleast-

squaresmethod,itreturnsthemeanrewards,i.e.,+0.1and+1.0001,re-

spectively.Thus,theleast-squaresmethodprefersrightmovement,whichisa

“gambling”policythatnegativereward−1isalmostalwaysobtainedats(R1),

butitispossibletoobtainveryhighreward+20,000withaverysmallprob-

abilityats(R2).Ontheotherhand,ifQ(s(0),“Left”)andQ(s(0),“Right”)are

approximatedbytheleastabsolutemethod,itreturnsthemedianrewards,

i.e.,+0.1and−1,respectively.Thus,theleastabsolutemethodprefersleft

movement,whichisastablepolicythattheagentcanalwaysreceivesmall

positivereward+0.1ats(L).

IfalltherewardsinFigure6.2arenegated,thevaluefunctionsarealso

negatedandadifferentinterpretationcanbeobtained:theleast-squares

methodisafraidoftheriskofreceivingverylargenegativereward−20,000

ats(R2)withaverylowprobability,andconsequentlyitendsupinavery

conservativepolicythattheagentalwaysreceivesnegativereward−0.1at

s(L).Ontheotherhand,theleastabsolutemethodtriestoreceivepositive

reward+1ats(R1)withoutbeingafraidofvisitings(R2)toomuch.

Asillustratedabove,theleastabsolutemethodtendstoprovidequalita-

tivelydifferentsolutionsfromtheleast-squaresmethod.

Properties

Here,propertiesoftheleastabsolutemethodareinvestigatedwhenthe

modelb

Q(s,a)iscorrectlyspecified,i.e.,thereexistsaparameterθ∗suchthatb

Q(s,a)=Q(s,a)

forallsanda.

Underthecorrectmodelassumption,whenthenumberofsamplesTtends

toinfinity,theleastabsolutesolutionb

θwouldsatisfythefollowingequa-

tion(Koenker,2005):

b⊤θψ(s,a)=Mp(s′|s,a)[r(s,a,s′)]forallsanda,

whereMp(s′|s,a)denotestheconditionalmedianofs′overp(s′|s,a)givens

anda.ψ(s,a)isdefinedby

ψ(s,a)=φ(s,a)−γEp(s′|s,a)Eπ(a′|s′)[φ(s′,a′)],

whereEp(s′|s,a)denotestheconditionalexpectationofs′overp(s′|s,a)given

sanda,andEπ(a′|s′)denotestheconditionalexpectationofa′overπ(a′|s′)

givens′.

FromEq.(6.1),wecanobtainthefollowingBellman-likerecursiveexpres-

Q(s,a)=M

p(s′|s,a)[r(s,a,s′)]+γEp(s′|s,a)Eπ(a′|s′)Q(s′,a′).

Notethatinthecaseoftheleast-squaresmethodwhere

b⊤θψ(s,a)=Ep(s′|s,a)[r(s,a,s′)]

issatisfiedinthelimitunderthecorrectmodelassumption,wehave

Q(s,a)=E

p(s′|s,a)[r(s,a,s′)]+γEp(s′|s,a)Eπ(a′|s′)Q(s′,a′).

ThisistheordinaryBellmanequation,andthusEq.(6.2)couldberegarded

asanextensionoftheBellmanequationtotheabsoluteloss.

FromtheordinaryBellmanequation(6.3),wecanrecovertheoriginal

definitionofthestate-valuefunctionQ(s,a):

s,a)=Epπ(h)

γt−1r(st,at,st+1),s1=s,a1=a,

whereEpπ(h)denotestheexpectationovertrajectoryh=[s1,a1,…,

sT,aT,sT+1]and“|s1=s,a1=a”meansthattheinitialstates1andthe

firstactiona1arefixedats1=sanda1=a,respectively.Incontrast,from

theabsolute-lossBellmanequation(6.2),wehave

s,a)=Epπ(h)

γt−1Mp(s

t+1|st,at)[r(st,at,st+1)]1=s,a1=a

1stlink

1stjoint

2ndlink

2ndjoint

Endeffector

FIGURE6.3:Illustrationoftheacrobot.Thegoalistoswinguptheend

effectorbyonlycontrollingthesecondjoint.

Thisisthevaluefunctionthattheleastabsolutemethodistryingtoap-

proximate,whichisdifferentfromtheordinaryvaluefunction.Sincethedis-

countedsumofmedianrewards—nottheexpectedrewards—ismaximized,

theleastabsolutemethodisexpectedtobelesssensitivetooutliersthanthe

least-squaresmethod.

NumericalExamples

Inthissection,thebehaviorofLAPIisillustratedthroughexperiments

usingtheacrobotshowninFigure6.3.Theacrobotisanunder-actuated

systemandconsistsoftwolinks,twojoints,andanendeffector.Thelengthof

eachlinkis0.3[m],andthediameterofeachjointis0.15[m].Thediameterof

theendeffectoris0.10[m],andtheheightofthehorizontalbaris1.2[m].The

firstjointconnectsthefirstlinktothehorizontalbarandisnotcontrollable.

Thesecondjointconnectsthefirstlinktothesecondlinkandiscontrollable.

Theendeffectorisattachedtothetipofthesecondlink.Thecontrolcommand

(action)wecanchooseistoapplypositivetorque+50[N·m],notorque0

[N·m],ornegativetorque−50[N·m]tothesecondjoint.Notethatthe

acrobotmovesonlywithinaplaneorthogonaltothehorizontalbar.

Thegoalistoacquireacontrolpolicysuchthattheendeffectorisswungup

ashighaspossible.Thestatespaceconsistsoftheangleθi[rad]andangular

velocity˙θi[rad/s]ofthefirstandsecondjoints(i=1,2).Theimmediate

rewardisgivenaccordingtotheheightyofthecenteroftheendeffectoras

ify>1.75,

r(s,a,s′)=

exp−(y−1.85)2

if1.5<y≤1.75,

2(0.2)2

otherwise.

Notethat0.55≤y≤1.85inthecurrentsetting.

Here,supposethatthelengthofthelinksisunknown.Thus,theheight

ycannotbedirectlycomputedfromstateinformation.Theheightoftheend

effectorissupposedtobeestimatedfromanimagetakenbyacamera—

theendeffectorisdetectedintheimageandthenitsverticalcoordinateis

computed.Duetorecognitionerror,theestimatedheightishighlynoisyand

couldcontainoutliers.

Ineachpolicyiterationstep,20episodictrainingsamplesoflength150

aregathered.Theperformanceoftheobtainedpolicyisevaluatedusing50

episodictestsamplesoflength300.Notethatthetestsamplesarenotused

forlearningpolicies.Theyareusedonlyforevaluatinglearnedpolicies.The

policiesareupdatedinasoft-maxmanner:

exp(Q(s,a)/η)

π(a|s)←−P

exp(Q(s,a′)/η)

a′∈Awhereη=10−l+1withlbeingtheiterationnumber.Thediscounted

factorissetatγ=1,i.e.,nodiscount.Asbasisfunctionsforvaluefunction

approximation,theGaussiankernelwithstandarddeviationπisused,where

Gaussiancentersarelocatedat

(θ1,θ2,˙θ1,˙θ2)∈−π,−π,0,π,π×−π,0,π×−π,0,π×−π,0,π.2

Theabove135(=5×3×3×3)Gaussiankernelsaredefinedforeachofthe

threeactions.Thus,405(=135×3)kernelsareusedintotal.

Letusconsidertwonoiseenvironments:oneisthecasewherenonoiseis

addedtotherewardsandtheothercaseiswhereLaplaciannoisewithmean

zeroandstandarddeviation2isaddedtotherewardswithprobability0.1.

NotethatthetailoftheLaplaciandensityisheavierthanthatoftheGaussian

density(seeFigure6.4),implyingthatasmallnumberofoutlierstendtobe

includedintheLaplaciannoiseenvironment.Anexampleofthenoisytraining

samplesisshowninFigure6.5.Foreachnoiseenvironment,theexperimentis

repeated50timeswithdifferentrandomseedsandtheaveragesofthesumof

rewardsobtainedbyLAPIandLSPIaresummarizedinFigure6.6.Thebest

methodintermsofthemeanvalueandcomparablemethodsaccordingtothe

t-test(Henkel,1976)atthesignificancelevel5%isspecifiedby“.”

Inthenoiselesscase(seeFigure6.6(a)),bothLAPIandLSPIimprovethe

performanceoveriterationsinacomparableway.Ontheotherhand,inthe

noisycase(seeFigure6.6(b)),theperformanceofLSPIisnotimprovedmuch

duetooutliers,whileLAPIstillproducesagoodcontrolpolicy.

Gaussiandensity

Laplaciandensity

Samplewithnoise

Immediatereward0

1.751.85

Heightofendeffector

FIGURE6.4:Probabilitydensity

FIGURE6.5:Exampleoftraining

functionsofGaussianandLapla-

sampleswithLaplaciannoise.The

ciandistributions.

horizontalaxisistheheightofthe

endeffector.Thesolidlinedenotes

thenoiselessimmediaterewardand

“”denotesanoisytrainingsample.

Sumofrewards

Iteration

(a)Nonoise

(b)Laplaciannoise

FIGURE6.6:Averageandstandarddeviationofthesumofrewardsover50

runsfortheacrobotswinging-upsimulation.Thebestmethodintermsofthe

meanvalueandcomparablemethodsaccordingtothet-testatthesignificance

level5%specifiedby“.”

Figure6.7andFigure6.8depictmotionexamplesoftheacrobotlearned

byLAPIandLSPIintheLaplacian-noiseenvironment.WhenLSPIisused

(Figure6.7),thesecondjointisswunghardinordertolifttheendeffector.

However,theendeffectortendstostaybelowthehorizontalbar,andtherefore

onlyasmallamountofrewardcanbeobtainedbyLSPI.Thiswouldbedueto

theexistenceofoutliers.Ontheotherhand,whenLAPIisused(Figure6.8),

theendeffectorgoesbeyondthebar,andthereforealargeamountofreward

canbeobtainedeveninthepresenceofoutliers.

FIGURE6.7:AmotionexampleoftheacrobotlearnedbyLSPIinthe

Laplacian-noiseenvironment(fromlefttorightandtoptobottom).

FIGURE6.8:AmotionexampleoftheacrobotlearnedbyLAPIinthe

Laplacian-noiseenvironment(fromlefttorightandtoptobottom).

PossibleExtensions

Inthissection,possiblevariationsofLAPIareconsidered.

HuberLoss

UseoftheHuberlosscorrespondstomakingacompromisebetweenthe

squaredandabsolutelossfunctions(Huber,1981):

argmin

θ⊤b

ψ(st,at)−rt

whereκ(≥0)isathresholdparameterandρHB

istheHuberlossdefinedas

follows(seeFigure6.9):

if|x|≤κ,

(x)= κ|x|−1κ2if|x|>κ.

TheHuberlossconvergestotheabsolutelossasκtendstozero,andit

convergestothesquaredlossasκtendstoinfinity.

TheHuberlossfunctionisratherintricate,butthesolutioncanbeob-

tainedbysolvingthefollowingconvexquadraticprogram(Mangasarian&

Musicant,2000):

b2t+κ

subjectto−ct≤θ⊤bψ(st,at)−rt−bt≤ct,t=1,…,T.

Anotherwaytoobtainthesolutionistouseagradientdescentmethod,

wheretheparameterθisupdatedasfollowsuntilconvergence:

θ←θ−ε

∆ρHB

(θ⊤b

ψ(st,at)−rt)b

ψ(st,at).

ε(>0)isthelearningrateand∆ρHB

isthederivativeofρHB

givenby

if|x|≤κ,

∆ρHB

ifx>κ,

−κifx<−κ.

Inpractice,thefollowingstochasticgradientmethod(Amari,1967)wouldbe

Huberloss

Pinballloss

Deadzone-linearloss

FIGURE6.9:TheHuberlossfunction(withκ=1),thepinballlossfunction

(withτ=0.3),andthedeadzone-linearlossfunction(withǫ=1).

moreconvenient.Forarandomlychosenindext∈1,…,Tineachiteration,

repeatthefollowingupdateuntilconvergence:

θ←θ−ε∆ρHB

(θ⊤b

ψ(st,at)−rt)b

ψ(st,at).

Theplain/stochasticgradientmethodsalsocomeinhandywhenapprox-

imatingtheleastabsolutesolution,sincetheHuberlossfunctionwithsmall

κcanberegardedasasmoothapproximationtotheabsoluteloss.

PinballLoss

Theabsolutelossinducesthemedian,whichcorrespondstothe50-

percentilepoint.Asimilardiscussionisalsopossibleforanarbitrarypercentile

100τ(0≤τ≤1)basedonthepinballloss(Koenker,2005):

(θ⊤b

ψ(st,at)−rt),

whereρPB

(x)isthepinballlossdefinedby

ifx≥0,

2(τ−1)xifx<0.

TheprofileofthepinballlossisdepictedinFigure6.9.Whenτ=0.5,the

pinballlossisreducedtotheabsoluteloss.

Thesolutioncanbeobtainedbysolvingthefollowinglinearprogram:

θ,btT

subjectto

θ⊤b

,t=1,…,T.

2(τ−1)

t,at)−rt≤2τ

Deadzone-LinearLoss

Anothervariantoftheabsolutelossisthedeadzone-linearloss(seeFig-

ure6.9):

(θ⊤b

ψ(st,at)−rt),

whereρDL

(x)isthedeadzone-linearlossdefinedby

if|x|≤ǫ,

|x|−ǫif|x|>ǫ.

Thatis,ifthemagnitudeoftheerrorislessthanǫ,noerrorisassessed.This

lossisalsocalledtheǫ-insensitivelossandusedinsupportvectorregression

(Vapnik,1998).

Whenǫ=0,thedeadzone-linearlossisreducedtotheabsoluteloss.

Thus,thedeadzone-linearlossandtheabsolutelossarerelatedtoeachother.

However,theeffectofthedeadzone-linearlossiscompletelyoppositetothe

absolutelosswhenǫ>0.Theinfluenceof“good”samples(withsmallerror)

isdeemphasizedinthedeadzone-linearloss,whiletheabsolutelosstendsto

suppresstheinfluenceof“bad”samples(withlargeerror)comparedwiththe

squaredloss.

Thesolutioncanbeobtainedbysolvingthefollowinglinearprogram(Boyd

&Vandenberghe,2004):

θ,btT

subjectto

t−ǫ≤θ⊤b

ψ(st,at)−rt≤bt+ǫ,

bt≥0,t=1,…,T.

ChebyshevApproximation

TheChebyshevapproximationminimizestheerrorforthe“worst”sample:

max|θ⊤b

ψ(st,at)−rt|.

t=1,…,T

Thisisalsocalledtheminimaxapproximation.

Thesolutioncanbeobtainedbysolvingthefollowinglinearprogram(Boyd

&Vandenberghe,2004):

subjectto−b≤θ⊤bψ(st,at)−rt≤b,t=1,…,T.

FIGURE6.10:Theconditionalvalue-at-risk(CVaR).

ConditionalValue-At-Risk

Intheareaoffinance,theconditionalvalue-at-risk(CVaR)isapopular

riskmeasure(Rockafellar&Uryasev,2002).TheCVaRcorrespondstothe

meanoftheerrorforasetof“bad”samples(seeFigure6.10).

Morespecifically,letusconsiderthedistributionoftheabsoluteerrorover

alltrainingsamples(st,at,rt)Tt=1:

Φ(α|θ)=P(st,at,rt):|θ⊤b

ψ(st,at)−rt|≤α.

Forβ∈[0,1),letαβ(θ)bethe100βpercentileoftheabsoluteerrordistribu-tion:

αβ(θ)=minα|Φ(α|θ)≥β.

Thus,onlythefraction(1−β)oftheabsoluteerror|θ⊤b

ψ(st,at)−rt|exceeds

thethresholdαβ(θ).αβ(θ)isalsoreferredtoasthevalue-at-risk(VaR).

Letusconsidertheβ-taildistributionoftheabsoluteerror:

ifα<αβ(θ),

Φβ(α|θ)= Φ(α|θ)−β

ifα≥α

1−β

β(θ).

Letφβ(θ)bethemeanoftheβ-taildistributionoftheabsolutetemporal

difference(TD)error:

φβ(θ)=EΦ

|θ⊤b

t,at)−rt|

whereEΦdenotestheexpectationoverthedistributionΦ

β.φβ(θ)iscalled

theCVaR.Bydefinition,theCVaRoftheabsoluteerrorisreducedtothe

meanabsoluteerrorifβ=0anditconvergestotheworstabsoluteerror

asβtendsto1.Thus,theCVaRsmoothlybridgestheleastabsoluteand

Chebyshevapproximationmethods.CVaRisalsoreferredtoastheexpected

shortfall.

TheCVaRminimizationprobleminthecurrentcontextisformulatedas

minEΦ

|θ⊤b

t,at)−rt|

Thisoptimizationproblemlookscomplicated,butthesolutionb

θCVcanbeob-

tainedbysolvingthefollowinglinearprogram(Rockafellar&Uryasev,2002):

T(1−β)α+

θ,btT

subjectto

t≤θ⊤b

ψ(st,at)−rt≤bt,

ct≥bt−α,

ct≥0,t=1,…,T.

Notethatifthedefinitionoftheabsoluteerrorisslightlychanged,the

CVaRminimizationmethodamountstominimizingthedeadzone-linearloss

(Takeda,2007).

Remarks

LSPIcanberegardedasregressionofimmediaterewardsunderthe

squaredloss.Inthischapter,theabsolutelosswasusedforregression,which

contributestoenhancingrobustnessandreliability.Theleastabsolutemethod

isformulatedasalinearprogramanditcanbesolvedefficientlybystandard

optimizationsoftware.

LSPImaximizesthestate-actionvaluefunctionQ(s,a),whichistheex-

pectationofreturns.Anotherwaytoaddresstherobustnessandreliability

istomaximizeotherquantitiessuchasthemedianoraquantileofreturns.

AlthoughBellman-likesimplerecursiveexpressionsarenotavailableforquan-

tilesofrewards,aBellman-likerecursiveequationholdsforthedistribution

ofthediscountedsumofrewards(Morimuraetal.,2010a;Morimuraetal.,

2010b).Developingrobustreinforcementlearningalgorithmsalongthisline

ofresearchwouldbeapromisingfuturedirection.

PartIII

InthepolicyiterationapproachexplainedinPartII,thevaluefunctionis

firstestimatedandthenthepolicyisdeterminedbasedonthelearnedvalue

function.Policyiterationwasdemonstratedtoworkwellinmanyreal-world

applications,especiallyinproblemswithdiscretestatesandactions(Tesauro,

1994;Williams&Young,2007;Abeetal.,2010).Althoughpolicyiteration

canalsohandlecontinuousstatesbyfunctionapproximation(Lagoudakis&

Parr,2003),continuousactionsarehardtodealwithduetothedifficultyof

findingamaximizerofthevaluefunctionwithrespecttoactions.Moreover,

sincepoliciesareindirectlydeterminedviavaluefunctionapproximation,mis-

specificationofvaluefunctionmodelscanleadtoaninappropriatepolicyeven

inverysimpleproblems(Weaver&Baxter,1999;Baxteretal.,2001).Another

limitationofpolicyiterationespeciallyinphysicalcontroltasksisthatcontrol

policiescanvarydrasticallyineachiteration.Thiscausessevereinstabilityin

thephysicalsystemandthusisnotfavorableinpractice.

Policysearchisanalternativeapproachtoreinforcementlearningthatcan

overcomethelimitationsofpolicyiteration(Williams,1992;Dayan&Hin-

ton,1997;Kakade,2002).Inthepolicysearchapproach,policiesaredirectly

learnedsothatthereturn(i.e.,thediscountedsumoffuturerewards),

Xγt−1r(st,at,st+1),

ismaximized.

InPartIII,wefocusontheframeworkofpolicysearch.First,directpolicy

searchmethodsareintroduced,whichtrytofindthepolicythatachievesthe

maximumreturnviagradientascent(Chapter7)orexpectation-maximization

(Chapter8).Apotentialweaknessofthedirectpolicysearchapproachisits

instabilityduetotherandomnessofstochasticpolicies.Toovercometheinsta-

bilityproblem,analternativeapproachcalledpolicy-priorsearchisintroduced

inChapter9.

Chapter7

DirectPolicySearchbyGradient

Ascent

Thedirectpolicysearchapproachtriestofindthepolicythatmaximizes

theexpectedreturn.Inthischapter,weintroducegradient-basedalgorithms

fordirectpolicysearch.AftertheproblemformulationinSection7.1,the

gradientascentalgorithmisintroducedinSection7.2.Then,inSection7.3,

itsextentionusingnaturalgradientsisdescribed.InSection7.4,applicationto

computergraphicsisshown.Finally,thischapterisconcludedinSection7.5.

Formulation

Inthissection,theproblemofdirectpolicysearchismathematicallyfor-

mulated.

LetusconsideraMarkovdecisionprocessspecifiedby

(S,A,p(s′|s,a),p(s),r,γ),

whereSisasetofcontinuousstates,Aisasetofcontinuousactions,p(s′|s,a)

isthetransitionprobabilitydensityfromcurrentstatestonextstates′when

actionaistaken,p(s)istheprobabilitydensityofinitialstates,r(s,a,s′)

isanimmediaterewardfortransitionfromstos′bytakingactiona,and

0<γ≤1isthediscountedfactorforfuturerewards.

Letπ(a|s,θ)beastochasticpolicyparameterizedbyθ,whichrepresents

theconditionalprobabilitydensityoftakingactionainstates.Lethbea

trajectoryoflengthT:

h=[s1,a1,…,sT,aT,sT+1].

Thereturn(i.e.,thediscountedsumoffuturerewards)alonghisdefinedas

γt−1r(st,at,st+1),

andtheexpectedreturnforpolicyparameterθisdefinedas

J(θ)=Ep(h|θ)[R(h)]=

p(h|θ)R(h)dh,

FIGURE7.1:Gradientascentfordirectpolicysearch.

whereEp(h|θ)istheexpectationovertrajectoryhdrawnfromp(h|θ),and

p(h|θ)denotestheprobabilitydensityofobservingtrajectoryhunderpolicy

parameterθ:

p(h|θ)=p(s1)

p(st+1|st,at)π(at|st,θ).

Thegoalofdirectpolicysearchistofindtheoptimalpolicyparameterθ∗thatmaximizestheexpectedreturnJ(θ):

θ∗=argmaxJ(θ).θ

However,directlymaximizingJ(θ)ishardsinceJ(θ)usuallyinvolveshigh

non-linearitywithrespecttoθ.Below,agradient-basedalgorithmisintro-

ducedtofindalocalmaximizerofJ(θ).Analternativeapproachbasedon

theexpectation-maximizationalgorithmisprovidedinChapter8.

GradientApproach

Inthissection,agradientascentmethodfordirectpolicysearchisintro-

duced(Figure7.1).

GradientAscent

Thesimplestapproachtofindingalocalmaximizeroftheexpectedreturn

isgradientascent(Williams,1992):

θ←−θ+ε∇θJ(θ),

DirectPolicySearchbyGradientAscent

whereεisasmallpositiveconstantand∇θJ(θ)denotesthegradientofex-pectedreturnJ(θ)withrespecttopolicyparameterθ.Thegradient∇θJ(θ)isgivenby

∇θJ(θ)=∇θp(h|θ)R(h)dhZ

p(h|θ)∇θlogp(h|θ)R(h)dhZ

p(h|θ)

∇θlogπ(at|st,θ)R(h)dh,t=1

wheretheso-called“logtrick”isused:

∇θp(h|θ)=p(h|θ)∇θlogp(h|θ).Thisexpressionmeansthatthegradient∇θJ(θ)isgivenastheexpectationoverp(h|θ):

∇θJ(θ)=Ep(h|θ)∇θlogπ(at|st,θ)R(h).t=1

Sincep(h|θ)isunknown,theexpectationisapproximatedbytheempirical

averageas

∇bθJ(θ)=

θlogπ(at,n|st,n,θ)R(hn),

n=1t=1

hn=[s1,n,a1,n,…,sT,n,aT,n,sT+1,n]

isanindependentsamplefromp(h|θ).ThisalgorithmiscalledREINFORCE

(Williams,1992),whichisanacronymfor“REwardIncrement=Nonnegative

Factor×OffsetReinforcement×CharacteristicEligibility.”

Apopularchoiceforpolicymodelπ(a|s,θ)istheGaussianpolicymodel,

wherepolicyparameterθconsistsofmeanvectorµandstandarddeviation

(a−µ⊤φ(s))2

π(a|s,µ,σ)=√

exp−

Here,φ(s)denotesthebasisfunction.ForthisGaussianpolicymodel,the

policygradientsareexplicitlycomputedas

a−µ⊤φ(s)

∇µlogπ(a|s,µ,σ)=φ(s),

(a−µ⊤φ(s))2−σ2

∇σlogπ(a|s,µ,σ)=.

Asshownabove,thegradientascentalgorithmfordirectpolicysearchis

verysimpletoimplement.Furthermore,thepropertythatpolicyparameters

aregraduallyupdatedinthegradientascentalgorithmispreferablewhen

reinforcementlearningisappliedtothecontrolofavulnerablephysicalsystem

suchasahumanoidrobot,becausesuddenpolicychangecandamagethe

system.However,thevarianceofpolicygradientstendstobelargeinpractice

(Peters&Schaal,2006;Sehnkeetal.,2010),whichcanresultinslowand

unstableconvergence.

BaselineSubtractionforVarianceReduction

Baselinesubtractionisausefultechniquetoreducethevarianceofgradient

estimators.Technically,baselinesubtractioncanbeviewedasthemethodof

controlvariates(Fishman,1996),whichisaneffectiveapproachtoreducing

thevarianceofMonteCarlointegralestimators.

Thebasicideaofbaselinesubtractionisthatanunbiasedestimatorb

stillunbiasedifazero-meanrandomvariablemmultipliedbyaconstantξis

subtracted:

ηξ=b

η−ξm.

Theconstantξ,whichiscalledabaseline,maybechosensothatthevariance

ηξisminimized.Bybaselinesubtraction,amorestableestimatorthanthe

originalb

ηcanbeobtained.

Apolicygradientestimatorwithbaselineξsubtractedisgivenby

θJξ(θ)=∇θJ(θ)−ξ∇θlogπ(at,n|st,n,θ)t=1

n)−ξ)

θlogπ(at,n|st,n,θ),

wheretheexpectationof∇θlogπ(a|s,θ)iszero:Z

E[∇θlogπ(a|s,θ)]=π(a|s,θ)∇θlogπ(a|s,θ)daZ

∇θπ(a|s,θ)daZ

=∇θπ(a|s,θ)da=∇θ1=0.Theoptimalbaselineisdefinedastheminimizerofthevarianceofthegradient

estimatorwithrespecttothebaseline(Greensmithetal.,2004;Weaver&Tao,

2001):

ξ∗=argminVarb

p(h|θ)[∇θJξ(θ)],

whereVarp(h|θ)denotesthetraceofthecovariancematrix:

Varp(h|

θ)[ζ]=tr

p(h|θ)(ζ−Ep(h|θ)[ζ])(ζ−Ep(h|θ)[ζ])⊤h

=Ep(h|θ)kζ−Ep(h|θ)[ζ]k2.

ItwasshowninPetersandSchaal(2006)thattheoptimalbaselineξ∗isgivenas

ξ∗=p(h|θ)[R(h)k

t=1∇θlogπ(at|st,θ)k2]P

p(h|θ)[k

t=1∇θlogπ(at|st,θ)k2]Inpractice,theexpectationsareapproximatedbysampleaverages.

VarianceAnalysisofGradientEstimators

Here,thevarianceofgradientestimatorsistheoreticallyinvestigatedfor

theGaussianpolicymodel(7.1)withφ(s)=s.SeeZhaoetal.(2012)for

technicaldetails.

Inthetheoreticalanalysis,subsetsofthefollowingassumptionsarecon-

sidered:

Assumption(A):r(s,a,s′)∈[−β,β]forβ>0.Assumption(B):r(s,a,s′)∈[α,β]for0<α<β.Assumption(C):Forδ>0,thereexisttwoseriesctTt=1anddtTt=1such

thatkstk≥ctandkstk≤dtholdwithprobabilityatleast1−δ,

respectively,overthechoiceofsamplepaths.

NotethatAssumption(B)isstrongerthanAssumption(A).Let

ζ(T)=CTα2−DTβ2/(2π),

c2tandDT=

First,thevarianceofgradientestimatorsisanalyzed.

Theorem7.1UnderAssumptions(A)and(C),thefollowingupperbound

holdswithprobabilityatleast1−δ/2:

Tβ2(1−γT)2

p(h|θ)∇µJ(µ,σ)≤

Nσ2(1−γ)2

UnderAssumption(A),itholdsthat

2Tβ2(1−γT)2

p(h|θ)∇σJ(µ,σ)≤

Nσ2(1−γ)2

Theaboveupperboundsaremonotoneincreasingwithrespecttotrajec-

torylengthT.

Forthevarianceof∇bµJ(µ,σ),thefollowinglowerboundholds(itsupper

boundhasnotbeenderivedyet):

Theorem7.2UnderAssumptions(B)and(C),thefollowinglowerbound

holdswithprobabilityatleast1−δ:

(1−γT)2

p(h|θ)∇µJ(µ,σ)≥

ζ(T).

Nσ2(1−γ)2

Thislowerboundisnon-trivialifζ(T)>0,whichcanbefulfilled,e.g.,if

αandβsatisfy

2πCTα2>DTβ2.

Next,thecontributionoftheoptimalbaselineisinvestigated.Itwasshown

(Greensmithetal.,2004;Weaver&Tao,2001)thattheexcessvarianceforan

arbitrarybaselineξisgivenby

p(h|θ)[∇θJξ(θ)]−Varp(h|θ)[∇θJξ∗(θ)]

(ξ−ξ∗)2T

p(h|θ)

θlogπ(at|st,θ)

Basedonthisexpression,thefollowingtheoremcanbeobtained.

Theorem7.3UnderAssumptions(B)and(C),thefollowingboundshold

withprobabilityatleast1−δ:

CTα2(1−γT)2≤Var

J(µ,σ)]−Var

Jξ∗(µ,σ)]Nσ2(1−γ)2

p(h|θ)[∇µp(h|θ)[∇µβ2(1−γT)2D

Nσ2(1−γ)2

Thistheoremshowsthatthelowerboundoftheexcessvarianceispositive

andmonotoneincreasingwithrespecttothetrajectorylengthT.Thismeans

thatthevarianceisalwaysreducedbyoptimalbaselinesubtractionandthe

amountofvariancereductionismonotoneincreasingwithrespecttothetra-

jectorylengthT.Notethattheupperboundisalsomonotoneincreasingwith

respecttothetrajectorylengthT.

Finally,thevarianceofgradientestimatorswiththeoptimalbaselineis

investigated:

Theorem7.4UnderAssumptions(B)and(C),itholdsthat

(1−γT)2

p(h|θ)[∇µJξ∗(µ,σ)]≤(β2D

Nσ2(1−γ)2

T−α2CT),

wheretheinequalityholdswithprobabilityatleast1−δ.

(a)Ordinarygradients

(b)Naturalgradients

FIGURE7.2:Ordinaryandnaturalgradients.Ordinarygradientstreatall

dimensionsequally,whilenaturalgradientstaketheRiemannianstructure

intoaccount.

Thistheoremshowsthattheupperboundofthevarianceofthegradient

estimatorswiththeoptimalbaselineisstillmonotoneincreasingwithrespect

tothetrajectorylengthT.Thus,whenthetrajectorylengthTislarge,the

varianceofthegradientestimatorscanstillbelargeevenwiththeoptimal

baseline.

InChapter9,anothergradientapproachwillbeintroducedforovercoming

thislarge-varianceproblem.

NaturalGradientApproach

Thegradient-basedpolicyparameterupdateusedintheREINFORCE

algorithmisperformedundertheEuclideanmetric.Inthissection,weshow

anotherusefulchoiceofthemetricforgradient-basedpolicysearch.

NaturalGradientAscent

UseoftheEuclideanmetricimpliesthatalldimensionsofthepolicypa-

rametervectorθaretreatedequally(Figure7.2(a)).However,sinceapolicy

parameterθspecifiesaconditionalprobabilitydensityπ(a|s,θ),useofthe

Euclideanmetricintheparameterspacedoesnotnecessarilymeanalldi-

mensionsaretreatedequallyinthespaceofconditionalprobabilitydensities.

Thus,asmallchangeinthepolicyparameterθcancauseabigchangeinthe

conditionalprobabilitydensityπ(a|s,θ)(Kakade,2002).

Figure7.3describestheGaussiandensitieswithmeanµ=−5,0,5and

standarddeviationσ=1,2.Thisshowsthatifthestandarddeviationis

FIGURE7.3:Gaussiandensitieswithdifferentmeansandstandarddevi-

ations.Ifthestandarddeviationisdoubled(fromthesolidlinestodashed

lines),thedifferenceinmeanshouldalsobedoubledtomaintainthesame

overlappinglevel.

doubled,thedifferenceinmeanshouldalsobedoubledtomaintainthesame

overlappinglevel.Thus,itis“natural”tocomputethedistancebetweentwo

Gaussiandensitiesparameterizedwith(µ,σ)and(µ+∆µ,σ)notby∆µ,but

by∆µ/σ.

Gradientsthattreatalldimensionsequallyinthespaceofprobability

densitiesarecallednaturalgradients(Amari,1998;Amari&Nagaoka,2000).

Theordinarygradientisdefinedasthesteepestascentdirectionunderthe

Euclideanmetric(Figure7.2(a)):

∇θJ(θ)=argmaxJ(θ+∆θ)subjectto∆θ⊤∆θ≤ǫ,

whereǫisasmallpositivenumber.Ontheotherhand,thenaturalgradi-

entisdefinedasthesteepestascentdirectionundertheRiemannianmetric

(Figure7.2(b)):

∇θJ(θ)=argmaxJ(θ+∆θ)subjectto∆θ⊤Rθ∆θ≤ǫ,

whereRθistheRiemannianmetric,whichisapositivedefinitematrix.The

solutionoftheaboveoptimizationproblemisgivenby

∇θJ(θ)=R−1θ

∇θJ(θ).Thus,theordinarygradient∇θJ(θ)ismodifiedbytheinverseRiemannianmetricR−1inthenaturalgradient.

Astandarddistancemetricinthespaceofprobabilitydensitiesisthe

Kullback–Leibler(KL)divergence(Kullback&Leibler,1951).TheKLdiver-

gencefromdensityptodensityqisdefinedas

KL(pkq)=

p(θ)log

KL(pkq)isalwaysnon-negativeandzeroifandonlyifp=q.Thus,smaller

KL(pkq)meansthatpandqare“closer.”However,notethattheKLdiver-

genceisnotsymmetric,i.e.,KL(pkq)6=KL(qkp)ingeneral.

Forsmall∆θ,theKLdivergencefromp(h|θ)top(h|θ+∆θ)canbeap-

proximatedby

∆θ⊤Fθ∆θ,

whereFθistheFisherinformationmatrix:

Fθ=Ep(h|θ)[∇θlogp(h|θ)∇θlogp(h|θ)⊤].

Thus,FθistheRiemannianmetricinducedbytheKLdivergence.

Thentheupdateruleofthepolicyparameterθbasedonthenatural

gradientisgivenby

θ←−θ+εb

Fθ∇θJ(θ),whereεisasmallpositiveconstantandb

FθisasampleapproximationofFθ:

θlogp(hn|θ)∇θlogp(hn|θ)⊤.

Undermildregularityconditions,theFisherinformationmatrixFθcan

beexpressedas

Fθ=−Ep(h|θ)[∇2θlogp(h|θ)],where∇2logp(hθ

|θ)denotestheHessianmatrixoflogp(h|θ).Thatis,the

(b,b′)-thelementof∇2logp(hlogp(h

|θ)isgivenby

|θ).Thismeans

b∂θb′

thatthenaturalgradienttakesthecurvatureintoaccount,bywhichthecon-

vergencebehavioratflatplateausandsteepridgestendstobeimproved.On

theotherhand,apotentialweaknessofnaturalgradientsisthatcomputation

oftheinverseRiemannianmetrictendstobenumericallyunstable(Deisenroth

etal.,2013).

Illustration

Letusillustratethedifferencebetweenordinaryandnaturalgradients

numerically.

Considerone-dimensionalreal-valuedstatespaceS=Randone-

dimensionalreal-valuedactionspaceA=R.Thetransitiondynamicsislin-

earanddeterministicass′=s+a,andtherewardfunctionisquadraticas

r=0.5s2−0.05a.Thediscountfactorissetatγ=0.95.TheGaussianpolicy

model,

(a−µs)2

π(a|s,µ,σ)=√

exp−

isemployed,whichcontainsthemeanparameterµandthestandarddevia-

tionparameterσ.Theoptimalpolicyparametersinthissetuparegivenby

(µ∗,σ∗)≈(−0.912,0).

−1.5

−0.5

−1.5

−0.5

(a)Ordinarygradients

(b)Naturalgradients

FIGURE7.4:Numericalillustrationsofordinaryandnaturalgradients.

Figure7.4showsnumericalcomparisonofordinaryandnaturalgradients

fortheGaussianpolicy.Thecontourlinesandthearrowsindicatetheex-

pectedreturnsurfaceandthegradientdirections,respectively.Thegraphs

showthattheordinarygradientstendtostronglyreducethestandarddevia-

tionparameterσwithoutreallyupdatingthemeanparameterµ.Thismeans

thatthestochasticityofthepolicyislostquicklyandthustheagentbecomes

lessexploratory.Consequently,onceσgetsclosertozero,thesolutionisat

aflatplateaualongthedirectionofµandthuspolicyupdatesinµarevery

slow.Ontheotherhand,thenaturalgradientsreduceboththemeanparam-

eterµandthestandarddeviationparameterσinabalancedway.Asaresult,

convergencegetsmuchfasterthantheordinarygradientmethod.

ApplicationinComputerGraphics:ArtistAgent

Orientalinkpainting,whichisalsocalledsumie,isoneofthemostdis-

tinctivepaintingstylesandhasattractedartistsaroundtheworld.Major

challengesinsumiesimulationaretoabstractcomplexsceneinformationand

reproducesmoothandnaturalbrushstrokes.Reinforcementlearningisuseful

toautomaticallygeneratesuchsmoothandnaturalstrokes(Xieetal.,2013).

Inthissection,theREINFORCEalgorithmexplainedinSection7.2isapplied

tosumieagenttraining.

SumiePainting

Amongvarioustechniquesofnon-photorealisticrendering(Gooch&

Gooch,2001),stroke-basedpainterlyrenderingsynthesizesanimagefroma

sourceimageinadesiredpaintingstylebyplacingdiscretestrokes(Hertz-

mann,2003).Suchanalgorithmsimulatesthecommonpracticeofhuman

painterswhocreatepaintingswithbrushstrokes.

Westernpaintingstylessuchaswater-color,pastel,andoilpaintingoverlay

strokesontomultiplelayers,whileorientalinkpaintingusesafewexpressive

strokesproducedbysoftbrushtuftstoconveysignificantinformationabouta

targetscene.Theappearanceofthestrokeinorientalinkpaintingistherefore

determinedbytheshapeoftheobjecttopaint,thepathandpostureofthe

brush,andthedistributionofpigmentsinthebrush.

Drawingsmoothandnaturalstrokesinarbitraryshapesischallenging

sinceanoptimalbrushtrajectoryandthepostureofabrushfootprintare

differentforeachshape.Existingmethodscanefficientlymapbrushtexture

bydeformationontoauser-giventrajectorylineortheshapeofatargetstroke

(Hertzmann,1998;Guo&Kunii,2003).However,thegeometricalprocessof

morphingtheentiretextureofabrushstrokeintothetargetshapeleads

toundesirableeffectssuchasunnaturalfoldingsandcreasedappearancesat

cornersorcurves.

Here,asoft-tuftbrushistreatedasareinforcementlearningagent,andthe

REINFORCEalgorithmisusedtoautomaticallydrawartisticstrokes.More

specifically,givenanyclosedcontourthatrepresentstheshapeofadesired

singlestrokewithoutoverlap,theagentmovesthebrushonthecanvastofill

thegivenshapefromastartpointtoanendpointwithstableposesalonga

smoothcontinuousmovementtrajectory(seeFigure7.5).

Inorientalinkpainting,thereareseveraldifferentbrushstylesthatcharac-

terizethepaintings.Below,tworepresentativestylescalledtheuprightbrush

styleandtheobliquebrushstyleareconsidered(seeFigure7.6).Intheupright

brushstyle,thetipofthebrushshouldbelocatedonthemedialaxisofthe

expectedstrokeshape,andthebottomofthebrushshouldbetangenttoboth

sidesoftheboundary.Ontheotherhand,intheobliquebrushstyle,thetip

ofthebrushshouldtouchonesideoftheboundaryandthebottomofthe

brushshouldbetangenttotheothersideoftheboundary.Thechoiceofthe

uprightbrushstyleandtheobliquebrushstyleisexclusiveandauserisasked

tochooseoneofthestylesinadvance.

DesignofStates,Actions,andImmediateRewards

Here,specificdesignofstates,actions,andimmediaterewardstailoredto

thesumieagentisdescribed.

(a)Brushmodel

(b)Footprints

(c)Basicstrokestyles

FIGURE7.5:Illustrationofthebrushagentanditspath.(a)Astrokeisgen-

eratedbymovingthebrushwiththefollowing3actions:Action1isregulating

thedirectionofthebrushmovement,Action2ispushingdown/liftingupthe

brush,andAction3isrotatingthebrushhandle.OnlyAction1isdetermined

byreinforcementlearning,andAction2andAction3aredeterminedbased

onAction1.(b)Thetopsymbolillustratesthebrushagent,whichconsistsof

atipQandacirclewithcenterCandradiusr.Othersillustratefootprintsof

arealbrushwithdifferentinkquantities.(c)Thereare6basicstrokestyles:

fullink,dryink,first-halfhollow,hollow,middlehollow,andboth-endhollow.

Smallfootprintsonthetopofeachstrokeshowtheinterpolationorder.

7.4.2.1

States

Theglobalmeasurement(i.e.,theposeconfigurationofafootprintunder

theglobalCartesiancoordinate)andthelocalmeasurement(i.e.,thepose

andthelocomotioninformationofthebrushagentrelativetothesurrounding

environment)areusedasstates.Here,onlythelocalmeasurementisusedto

calculatearewardandapolicy,bywhichtheagentcanlearnthedrawing

policythatisgeneralizabletonewshapes.Below,thelocalmeasurementis

regardedasstatesandtheglobalmeasurementisdealtwithonlyimplicitly.

FIGURE7.6:Uprightbrushstyle(left)andobliquebrushstyle(right).

Thelocalstate-spacedesignconsistsoftwocomponents:acurrentsur-

roundingshapeandanupcomingshape.Morespecifically,statevectorscon-

sistsofthefollowingsixfeatures:

s=(ω,φ,d,κ1,κ2,l)⊤.

Eachfeatureisdefinedasfollows(seeFigures7.7):

•ω∈(−π,π]:Theangleofthevelocityvectorofthebrushagentrelativetothemedialaxis.

•φ∈(−π,π]:Theheadingdirectionofthebrushagentrelativetothemedialaxis.

•d∈[−2,2]:TheratioofoffsetdistanceδfromthecenterCofthebrushagenttothenearestpointPonthemedialaxisMovertheradiusrof

thebrushagent(|d|=δ/r).dtakesapositive/negativevaluewhenthe

centerofthebrushagentisontheleft-/right-handsideofthemedial

–dtakesthevalue0whenthecenterofthebrushagentisonthe

medialaxis.

–dtakesavaluein[−1,1]whenthebrushagentisinsidethebound-

aries.

–Thevalueofdisin[−2,−1)orin(1,2]whenthebrushagentgoes

overtheboundaryofoneside.

dt–1<=1

rt–1

Qt–1

Pt–1

FIGURE7.7:Illustrationofthedesignofstates.Left:Thebrushagent

consistsofatipQandacirclewithcenterCandradiusr.Right:Theratiod

oftheoffsetdistanceδovertheradiusr.Footprintft−1isinsidethedrawing

area,andthecirclewithcenterCt−1andthetipQt−1touchtheboundaryon

eachside.Inthiscase,δt−1≤rt−1anddt−1∈[0,1].Ontheotherhand,ftgoesovertheboundary,andthenδt>rtanddt>1.Notethatdisrestrictedtobein[−2,2],andPisthenearestpointonmedialaxisMtoC.

Notethatthecenteroftheagentisrestrictedwithintheshape.There-

fore,theextremevaluesofdare±2whenthecenteroftheagentison

theboundary.

•κ1,κ2∈(−1,1):κ1providesthecurrentsurroundinginformationonthepointPt,whereasκ2providestheupcomingshapeinformationonpoint

arctan0.05/r′,

wherer′iistheradiusofthecurve.Morespecifically,thevaluetakes

0/negative/positivewhentheshapeisstraight/left-curved/right-curved,

andthelargeritsabsolutevalueis,thetighterthecurveis.

•l∈0,1:Abinarylabelthatindicateswhethertheagentmovestoaregioncoveredbythepreviousfootprintsornot.l=0meansthatthe

agentmovestoaregioncoveredbythepreviousfootprint.Otherwise,

l=1meansthatitmovestoanuncoveredregion.

7.4.2.2

Actions

Togenerateelegantbrushstrokes,thebrushagentshouldmoveinside

givenboundariesproperly.Here,thefollowingactionsareconsideredtocontrol

thebrush(seeFigure7.5(a)):

•Action1:Movementofthebrushonthecanvaspaper.

•Action2:Scalingup/downofthefootprint.

•Action3:Rotationoftheheadingdirectionofthebrush.

Sinceproperlycoveringthewholedesiredregionisthemostimportantin

termsofthevisualquality,themovementofthebrush(Action1)isregarded

astheprimaryaction.Morespecifically,Action1takesavaluein(−π,−π]

thatindicatestheoffsetturningangleofthemotiondirectionrelativetothe

medialaxisofanexpectedstrokeshape.Inpracticalapplications,theagent

shouldbeabletodealwitharbitrarystrokesinvariousscales.Toachieve

stableperformanceindifferentscales,thevelocityisadaptivelychangedas

r/3,whereristheradiusofthecurrentfootprint.

Action1isdeterminedbytheGaussianpolicyfunctiontrainedbythe

REINFORCEalgorithm,andAction2andAction3aredeterminedasfollows.

•Obliquebrushstrokestyle:Thetipoftheagentissettotouchoneside

oftheboundary,andthebottomoftheagentissettobetangenttothe

othersideoftheboundary.

•Uprightbrushstrokestyle:Thetipoftheagentischosentotravelalong

themedialaxisoftheshape.

IfitisnotpossibletosatisfytheaboveconstraintsbyadjustingAction2and

Action3,thenewfootprintwillsimplybethesamepostureastheprevious

7.4.2.3

ImmediateRewards

Theimmediaterewardfunctionmeasuresthequalityofthebrushagent’s

movementaftertakinganactionateachtimestep.Therewardisdesignedto

reflectthefollowingtwoaspects:

•Thedistancebetweenthecenterofthebrushagentandthenearestpoint

onthemedialaxisoftheshapeatthecurrenttimestep:Thisdetects

whethertheagentmovesoutoftheregionortravelsbackwardfromthe

correctdirection.

•Changeofthelocalconfigurationofthebrushagentafterexecutingan

action:Thisdetectswhethertheagentmovessmoothly.

Thesetwoaspectsareformalizedbydefiningtherewardfunctionasfol-

ifft=ft+1orlt+1=0,

r(st,at,st+1)= 2+|κ1(t)|+|κ2(t)|

otherwise,

location

posture

whereftandft+1arethefootprintsattimestepstandt+1,respectively.This

rewarddesignimpliesthattheimmediaterewardiszerowhenthebrushis

blockedbyaboundaryasft=ft+1orthebrushisgoingbackwardtoaregion

thathasalreadybeencoveredbypreviousfootprints.κ1(t)andκ2(t)arethe

valuesofκ1andκ2attimestept.|κ1(t)|+|κ2(t)|adaptivelyincreasesthe

immediaterewarddependingonthecurvaturesκ1(t)andκ2(t)ofthemedial

measuresthequalityofthelocationofthebrushagentwithre-

location

specttothemedialaxis,definedby

1|ωt|+τ2(|dt|+5)

dt∈[−2,−1)∪(1,2],location

τ1|ωt|+τ2|dt|

dt∈[−1,1],wheredtisthevalueofdattimestept.τ1andτ2areweightparameters,

whicharechosendependingonthebrushstyle:τ1=τ2=0.5fortheupright

brushstyleandτ1=0.1andτ2=0.9fortheobliquebrushstyle.Sincedt

containsinformationaboutwhethertheagentgoesovertheboundaryornot,

asillustratedinFigure7.7,thepenalty+5isaddedtoElocationwhenthe

agentgoesovertheboundaryoftheshape.

posturemeasuresthequalityofthepostureofthebrushagentbasedon

neighboringfootprints,definedby

posture=∆ωt/3+∆φt/3+∆dt/3,

where∆ωt,∆φt,and∆dtarechangesinangleωofthevelocityvector,heading

directionφ,andratiodoftheoffsetdistance,respectively.Thenotation∆xt

denotesthenormalizedsquaredchangebetweenxt−1andxtdefinedby

ifxt=xt−1=0,

∆xt=

t−xt−1)2

otherwise.

(|xt|+|xt−1|)2

7.4.2.4

TrainingandTestSessions

Anaivewaytotrainanagentistouseanentirestrokeshapeasatraining

sample.However,thishasseveraldrawbacks,e.g.,collectingmanytraining

samplesiscostlyandgeneralizationtonewshapesishard.Toovercomethese

limitations,theagentistrainedbasedonpartialshapes,nottheentireshapes

(Figure7.8(a)).Thisallowsustogeneratevariouspartialshapesfromasingle

entireshape,whichsignificantlyincreasesthenumberandvariationoftrain-

ingsamples.Anothermeritisthatthegeneralizationabilitytonewshapes

canbeenhanced,becauseevenwhentheentireprofileofanewshapeisquite

differentfromthatoftrainingdata,thenewshapemaycontainsimilarpartial

shapes.Figure7.8(c)illustrates8examplesof80digitizedrealsinglebrush

strokesthatarecommonlyusedinorientalinkpainting.Boundariesareex-

tractedastheshapeinformationandarearrangedinaqueuefortraining(see

Figure7.8(b)).

Inthetrainingsession,theinitialpositionofthefirstepisodeischosento

bethestartpointofthemedialaxis,andthedirectiontomoveischosentobe

(a)Combinationofshapes

(b)Setupofpolicytraining

(c)Trainingshapes

FIGURE7.8:Policytrainingscheme.(a)Eachentireshapeiscomposed

ofoneoftheupperregionsUi,thecommonregionΩ,andoneofthelower

regionsLj.(b)Boundariesareextractedastheshapeinformationandare

arrangedinaqueuefortraining.(c)Eightexamplesof80digitizedrealsingle

brushstrokesthatarecommonlyusedinorientalinkpaintingareillustrated.

thegoalpoint,asillustratedinFigure7.8(b).Inthefirstepisode,theinitial

footprintissetatthestartpointoftheshape.Then,inthefollowingepisodes,

theinitialfootprintissetateitherthelastfootprintinthepreviousepisode

orthestartpointoftheshape,dependingonwhethertheagentmovedwell

orwasblockedbytheboundaryinthepreviousepisode.

Afterlearningadrawingpolicy,thebrushagentappliesthelearnedpolicy

tocoveringgivenboundarieswithsmoothstrokes.Thelocationoftheagentis

Return10

Upperbound

Iteration

(a)Uprightbrushstyle

(b)Obliquebrushstyle

FIGURE7.9:Averageandstandarddeviationofreturnsobtainedbythe

reinforcementlearning(RL)methodover10trialsandtheupperlimitofthe

returnvalue.

initializedatthestartpointofanewshape.Theagentthensequentiallyselects

actionsbasedonthelearnedpolicyandmakestransitionsuntilitreachesthe

goalpoint.

ExperimentalResults

First,theperformanceofthereinforcementlearning(RL)methodisin-

vestigated.PoliciesareseparatelytrainedbytheREINFORCEalgorithmfor

theuprightbrushstyleandtheobliquebrushstyleusing80singlestrokesas

trainingdata(seeFigure7.8(c)).Theparametersoftheinitialpolicyareset

θ=(µ⊤,σ)⊤=(0,0,0,0,0,0,2)⊤,

wherethefirstsixelementscorrespondtotheGaussianmeanandthelast

elementistheGaussianstandarddeviation.TheagentcollectsN=300

episodicsampleswithtrajectorylengthT=32.Thediscountedfactoris

setatγ=0.99.

Theaverageandstandarddeviationsofthereturnfor300trainingepisodic

samplesover10trialsareplottedinFigure7.9.Thegraphsshowthatthe

averagereturnssharplyincreaseinanearlystageandapproachtheoptimal

values(i.e.,receivingthemaximumimmediatereward,+1,forallsteps).

Next,theperformanceoftheRLmethodiscomparedwiththatofthe

dynamicprogramming(DP)method(Xieetal.,2011),whichinvolvesdis-

cretizationofcontinuousstatespace.InFigure7.10,theexperimentalresults

obtainedbyDPwithdifferentnumbersoffootprintcandidatesineachstep

oftheDPsearchareplottedtogetherwiththeresultobtainedbyRL.This

showsthattheexecutiontimeoftheDPmethodincreasessignificantlyasthe

numberoffootprintcandidatesincreases.IntheDPmethod,thebestreturn

Averagereturn

Computationtime

Thenumberoffootprintcandidates

(a)Averagereturn

(b)Computationtime

FIGURE7.10:Averagereturnandcomputationtimeforreinforcement

learning(RL)anddynamicprogramming(DP).

value26.27isachievedwhenthenumberoffootprintcandidatesissetat180.

Althoughthismaximumvalueiscomparabletothereturnobtainedbythe

RLmethod(26.44),RLisabout50timesfasterthantheDPmethod.Fig-

ure7.11showssomeexemplarystrokesgeneratedbyRL(thetoptworows)

andDP(thebottomtworows).ThisshowsthattheagenttrainedbyRLis

abletodrawnicestrokeswithstableposesafterthe30thpolicyupdateiter-

ation(seealsoFigure7.9).Ontheotherhand,asillustratedinFigure7.11,

theDPresultsfor5,60,and100footprintcandidatesareunacceptablypoor.

GiventhattheDPmethodrequiresmanualtuningofthenumberoffootprint

candidatesateachstepforeachinputshape,theRLmethodisdemonstrated

tobepromising.

TheRLmethodisfurtherappliedtomorerealisticshapes,illustratedin

Figure7.12.Althoughtheshapesarenotincludedinthetrainingsamples,

theRLmethodcanproducesmoothandnaturalbrushstrokesforvarious

unlearnedshapes.MoreresultsareillustratedinFigure7.13,showingthat

theRLmethodispromisinginphotoconversionintothesumiestyle.

Remarks

Inthischapter,gradient-basedalgorithmsfordirectpolicysearchareintro-

duced.Thesegradient-basedmethodsaresuitableforcontrollingvulnerable

physicalsystemssuchashumanoidrobots,thankstothenatureofgradient

methodsthatparametersareupdatedgradually.Furthermore,directpolicy

searchcanhandlecontinuousactionsinastraightforwardway,whichisan

advantageoverpolicyiteration,explainedinPartII.

1stiteration

10thiteration

20thiteration

30thiteration

40thiteration

(a)RLmethod

5candidates

60candidates

100candidates

140candidates

180candidates

(b)DPmethod

FIGURE7.11:ExamplesofstrokesgeneratedbyRLandDP.Thetoptwo

rowsshowtheRLresultsoverpolicyupdateiterations,whilethebottomtwo

rowsshowtheDPresultsfordifferentnumbersoffootprintcandidates.The

linesegmentconnectsthecenterandthetipofafootprint,andthecircle

denotesthebottomcircleofthefootprint.

Thegradient-basedmethodwassuccessfullyappliedtoautomaticsumie

paintinggeneration.Consideringlocalmeasurementsinstatedesignwas

showntobeuseful,whichallowedabrushagenttolearnageneraldrawing

policythatisindependentofaspecificentireshape.Anotherimportantfactor

wastotrainthebrushagentonpartialshapes,nottheentireshapes.This

contributedhighlytoenhancingthegeneralizationabilitytonewshapes,be-

causeevenwhenanewshapeisquitedifferentfromtrainingdataasawhole,

itoftencontainssimilarpartialshapes.Inthiskindofreal-worldapplica-

tionsmanuallydesigningimmediaterewardfunctionsisoftentimeconsuming

anddifficult.Theuseofinversereinforcementlearning(Abbeel&Ng,2004)

wouldbeapromisingapproachforthispurpose.Inparticular,inthecon-

(a)Realphoto

(b)Userinputboundaries

(c)TrajectoriesestimatedbyRL

(d)Renderingresults

FIGURE7.12:Resultsonnewshapes.

textofsumiedrawing,suchdata-drivendesignofrewardfunctionswillallow

automaticlearningofthestyleofaparticularartistfromhis/herdrawings.

Apracticalweaknessofthegradient-basedapproachisthatthestepsize

ofgradientascentisoftendifficulttochoose.InChapter8,astep-size-free

methodofdirectpolicysearchbasedontheexpectation-maximizationalgo-

rithmwillbeintroduced.Anothercriticalproblemofdirectpolicysearchis

thatpolicyupdateisratherunstableduetothestochasticityofpolicies.Al-

thoughvariancereductionbybaselinesubtractioncanmitigatethisproblem

tosomeextent,theinstabilityproblemisstillcriticalinpractice.Thenatural

gradientmethodcouldbeanalternative,butcomputingtheinverseRieman-

nianmetrictendstobeunstable.InChapter9,anothergradientapproach

thatcanaddresstheinstabilityproblemwillbeintroduced.

FIGURE7.13:Photoconversionintothesumiestyle.

Chapter8

DirectPolicySearchby

Expectation-Maximization

Gradient-baseddirectpolicysearchmethodsintroducedinChapter7are

usefulparticularlyincontrollingcontinuoussystems.However,appropriately

choosingthestepsizeofgradientascentisoftendifficultinpractice.In

thischapter,weintroduceanotherdirectpolicysearchmethodbasedonthe

expectation-maximization(EM)algorithmthatdoesnotcontainthestepsize

parameter.InSection8.1,themainideaoftheEM-basedmethodisdescribed,

whichisexpectedtoconvergefasterbecausepoliciesaremoreaggressivelyup-

datedthanthegradient-basedapproach.Inpractice,however,directpolicy

searchoftenrequiresalargenumberofsamplestoobtainastablepolicy

updateestimator.Toimprovethestabilitywhenthesamplesizeissmall,

reusingpreviouslycollectedsamplesisapromisingapproach.InSection8.2,

thesample-reusetechniquethathasbeensuccessfullyusedtoimprovethe

performanceofpolicyiteration(seeChapter4)isappliedtotheEM-based

method.ThenitsexperimentalperformanceisevaluatedinSection8.3and

thischapterisconcludedinSection8.4.

Expectation-MaximizationApproach

Thegradient-basedoptimizationalgorithmsintroducedinSection7.2

graduallyupdatepolicyparametersoveriterations.Althoughthisisadvan-

tageouswhencontrollingaphysicalsystem,itrequiresmanyiterationsuntil

convergence.Inthissection,theexpectation-maximization(EM)algorithm

(Dempsteretal.,1977)isusedtocopewiththisproblem.

ThebasicideaofEM-basedpolicysearchistoiterativelyupdatethepolicy

parameterθbymaximizingalowerboundoftheexpectedreturnJ(θ):

J(θ)=

p(h|θ)R(h)dh.

ToderivealowerboundofJ(θ),Jensen’sinequality(Bishop,2006)isutilized:

q(h)f(g(h))dh≥f

q(h)g(h)dh,

whereqisaprobabilitydensity,fisaconvexfunction,andgisanon-negative

function.Forf(t)=−logt,Jensen’sinequalityyields

q(h)logg(h)dh≤log

q(h)g(h)dh.

AssumethatthereturnR(h)isnonnegative.Lete

θbethecurrentpolicy

parameterduringtheoptimizationprocedure,andqandginEq.(8.1)areset

θ)R(h)

p(h|θ)

andg(h)=

Thenthefollowinglowerboundholdsforallθ:

p(h|θ)R(h)

Zp(h|eθ)R(h)p(h|θ)

Zp(h|eθ)R(h)

p(h|θ)

Thisyields

logJ(θ)≥loge

J(θ),

ZR(h)p(h|eθ)

p(h|θ)

J(θ)=

dh+logJ(e

IntheEMapproach,theparameterθisiterativelyupdatedbymaximizing

thelowerbounde

J(θ):

bθ=argmaxe

J(θ).

Sinceloge

θ)=logJ(e

θ),thelowerbounde

JtouchesthetargetfunctionJat

thecurrentsolutione

θ)=J(e

Thus,monotonenon-decreaseoftheexpectedreturnisguaranteed:

θ)≥J(e

Thisupdateisiterateduntilconvergence(seeFigure8.1).

LetusemploytheGaussianpolicymodeldefinedas

π(a|s,θ)=π(a|s,µ,σ)

DirectPolicySearchbyExpectation-Maximization

FIGURE8.1:PolicyparameterupdateintheEM-basedpolicysearch.The

policyparameterθisupdatediterativelybymaximizingthelowerbound

J(θ),whichtouchesthetrueexpectedreturnJ(θ)atthecurrentsolutione

(a−µ⊤φ(s))2

exp−

whereθ=(µ⊤,σ)⊤andφ(s)denotesthebasisfunction.

Themaximizerb

µ⊤,b

σ)⊤ofthelowerbounde

J(θ)canbeanalytically

obtainedas

θ)R(h)

φ(st)φ(st)⊤dh

θ)R(h)

atφ(st)dh

φ(st,n)φ(st,n)⊤R(hn)

at,nφ(st,n),

θ)R(h)dh

θ)R(h)

µ⊤φ(st))2dh

t,n−b

µ⊤φ(st,n))2

wheretheexpectationoverhisapproximatedbytheaverageoverroll-out

samplesH=hnN

n=1fromthecurrentpolicye

hn=[s1,n,a1,n,…,sT,n,aT,n].

NotethatEM-basedpolicysearchforGaussianmodelsiscalledreward-

weightedregression(RWR)(Peters&Schaal,2007).

SampleReuse

Inpractice,alargenumberofsamplesisneededtoobtainastablepolicy

updateestimatorintheEM-basedpolicysearch.Inthissection,thesample-

reusetechniqueisappliedtotheEMmethodtocopewiththeinstability

problem.

EpisodicImportanceWeighting

TheoriginalRWRmethodisanon-policyalgorithmthatusesdatadrawn

fromthecurrentpolicy.Ontheotherhand,thesituationcalledoff-policyrein-

forcementlearningisconsideredhere,wherethesamplingpolicyforcollecting

datasamplesisdifferentfromthetargetpolicy.Morespecifically,Ntrajec-

torysamplesaregatheredfollowingthepolicyπℓintheℓ-thpolicyupdate

iteration:

Hπℓ=hπℓ

1,…,hπℓ

whereeachtrajectorysamplehπℓ

nisgivenas

hπℓ

n=[sπℓ

1,n,aπℓ

1,n,…,sπℓ,aπℓ,sπℓ

Wewanttoutilizeallthesesamplestoimprovethecurrentpolicy.

SupposethatwearecurrentlyattheL-thpolicyupdateiteration.Ifthe

policiesπℓL

remainunchangedovertheRWRupdates,justusingthe

plainupdaterulesprovidedinSection8.1givesaconsistentestimatorb

θL+1=

µNIW⊤L+1

σNIW)⊤,where

R(hπℓ

φ(sπℓ

t,n)φ(sπℓ

t,n)⊤ℓ=1n=1

R(hπℓ

aπℓ

t,nφ(sπℓ

ℓ=1n=1

L+1)2=

R(hπℓ

ℓ=1n=1

R(hπℓ

aπℓ

φ(sπℓ

t,n−b

ℓ=1n=1

Thesuperscript“NIW”standsfor“noimportanceweight.”However,since

policiesareupdatedineachRWRiteration,datasamplesHπℓL

collected

overiterationsgenerallyfollowdifferentprobabilitydistributionsinducedby

differentpolicies.Therefore,naiveuseoftheaboveupdateruleswillresultin

aninconsistentestimator.

InthesamewayasthediscussioninChapter4,importancesamplingcan

beusedtocopewiththisproblem.Thebasicideaofimportancesampling

istoweightthesamplesdrawnfromadifferentdistributiontomatchthe

targetdistribution.Morespecifically,fromi.i.d.(independentandidentically

distributed)sampleshπℓ

n=1followingp(h|θℓ),theexpectationofafunction

g(h)overanotherprobabilitydensityfunctionp(h|θL)canbeestimatedina

consistentmannerbytheimportance-weightedaverage:

p(hπℓ

p(h|θ

N→∞

n|θL)

−→E

n)p(hπℓ

p(h|θℓ)

p(h|θ

L)p(h|θ

g(h)p(h|θ

ℓ)dh=

θℓ)

=Ep(h|θL)[g(h)].

Theratiooftwodensitiesp(h|θL)/p(h|θℓ)iscalledtheimportanceweightfor

trajectoryh.

ThisimportancesamplingtechniquecanbeemployedinRWRtoobtain

aconsistentestimatorb

⊤L+1=(b

σEIW)⊤,where

R(hπℓ

n)w(L,ℓ)(h)

φ(sπℓ

t,n)φ(sπℓ

t,n)⊤ℓ=1n=1

R(hπℓ

n)w(L,ℓ)(h)

aπℓ

t,nφ(sπℓ

ℓ=1n=1

L+1)2=

R(hπℓ

n)w(L,ℓ)(hπℓ

ℓ=1n=1

R(hπℓ

⊤n)w(L,ℓ)(hπℓ

aπℓ

φ(sπℓ

t,n−b

ℓ=1n=1

Here,w(L,ℓ)(h)denotestheimportanceweightdefinedby

p(h|θ

w(L,ℓ)(h)=

p(h|θℓ)

Thesuperscript“EIW”standsfor“episodicimportanceweight.”

p(h|θL)andp(h|θℓ)denotetheprobabilitydensityofobservingtrajectory

h=[s1,a1,…,sT,aT,sT+1]

underpolicyparametersθLandθℓ,whichcanbeexplicitlywrittenas

p(h|θL)=p(s1)

p(st+1|st,at)π(at|st,θL),

p(h|θℓ)=p(s1)

p(st+1|st,at)π(at|st,θℓ).

Thetwoprobabilitydensitiesp(h|θL)andp(h|θℓ)bothcontainunknownprob-

abilitydensitiesp(s1)andp(st+1|st,at)Tt=1.However,sincetheycancelout

intheimportanceweight,itcanbecomputedwithouttheknowledgeofp(s)

andp(s′|s,a)as

QTπ(a

w(L,ℓ)(h)=

t|st,θL)

t|st,θℓ)

Althoughtheimportance-weightedestimatorb

θL+1isguaranteedtobe

consistent,ittendstohavelargevariance(Shimodaira,2000;Sugiyama&

Kawanabe,2012).Therefore,theimportance-weightedestimatortendstobe

unstablewhenthenumberofepisodesNisrathersmall.

Per-DecisionImportanceWeight

Sincetherewardatthet-thstepdoesnotdependonfuturestate-action

transitionsafterthet-thstep,anepisodicimportanceweightcanbedecom-

posedintostepwiseimportanceweights(Precupetal.,2000).Forinstance,

theexpectedreturnJ(θL)canbeexpressedas

J(θL)=

R(h)p(h|θL)dh

γt−1r(st,at,st+1)w(L,ℓ)(h)p(h|θℓ)dh

γt−1r(st,at,st+1)w(L,ℓ)

(h)p(h|θℓ)dh,

wherew(L,ℓ)

(h)isthet-stepimportanceweight,calledtheper-decisionim-

portanceweight(PIW),definedas

w(L,ℓ)

t′=1

t′|st′,θL)

t′=1

t′|st′,θℓ)

Here,thePIWideaisappliedtoRWRandamorestablealgorithmis

developed.Aslightcomplicationisthatthepolicyupdateformulagivenin

Section8.2.1containsdoublesumsoverTsteps,e.g.,

φ(st′)φ(st′)=

γt−1r(st,at,st+1)φ(st′)φ(st′).

t′=1

t,t′=1

Inthiscase,thesummand

γt−1r(st,at,st+1)φ(st′)φ(st′)

doesnotdependonfuturestate-actionpairsafterthemax(t,t′)-thstep.Thus,

theepisodicimportanceweightfor

γt−1r(st,at,st+1)φ(st′)φ(st′)

canbesimplifiedtotheper-decisionimportanceweightw(L,ℓ)

.Conse-

max(t,t′)

quently,thePIW-basedpolicyupdaterulesaregivenas

γt−1rt,nφ(sπℓ)φ(sπℓ)⊤w(L,ℓ)

(hπℓ

t′,n

max(t,t′)

ℓ=1n=1t,t′=1

γt−1r

t,naπℓφ(sπℓ)w(L,ℓ)

(hπℓ

t′,n

max(t,t′)

ℓ=1n=1t,t′=1

L+1)2=

γt−1rt,nw(L,ℓ)

(hπℓ

ℓ=1n=1t=1

γt−1r

aπℓ

⊤φ(sπℓ)

w(L,ℓ)

(hπℓ

t′,n−b

t′,n

max(t,t′)

ℓ=1n=1t,t′=1

rt,n=r(st,n,at,n,st+1,n).

ThisPIWestimatorb

⊤L+1=(b

σPIW)⊤isconsistentandpotentially

L+1EIW

morestablethantheplainEIWestimatorb

θL+1.

AdaptivePer-DecisionImportanceWeighting

TomoreactivelycontrolthestabilityofthePIWestimator,theadaptive

per-decisionimportanceweight(AIW)isemployed.Morespecifically,anim-

portanceweightw(L,ℓ)

(h)is“flattened”byflatteningparameterν

max(t,t

∈[0,1]′)

asw(L,ℓ)

,i.e.,theν-thpoweroftheper-decisionimportanceweight.

max(t,t′)

Thenwehaveb

⊤L+1=(b

L+1)⊤,where

γt−1rt,nφ(sπℓ)φ(sπℓ)⊤w(L,ℓ)

(hπℓ

t′,n

max(t,t′)

ℓ=1n=1t,t′=1

γt−1r

t,naπℓφ(sπℓ)

w(L,ℓ)

(hπℓ

t′,n

max(t,t′)

ℓ=1n=1t,t′=1

L+1)2=

γt−1rt,nw(L,ℓ)

(hπℓ

ℓ=1n=1t=1

γt−1r

aπℓ

⊤φ(sπℓ)

w(L,ℓ)

(hπℓ

t′,n−b

t′,n

max(t,t′)

ℓ=1n=1t,t′=1

Whenν=0,AIWisreducedtoNIW.Therefore,itisrelativelystable,but

notconsistent.Ontheotherhand,whenν=1,AIWisreducedtoPIW.

Therefore,itisconsistent,butratherunstable.Inpractice,anintermediate

νoftenproducesabetterestimator.Notethatthevalueoftheflattening

parametercanbedifferentineachiteration,i.e.,νmaybereplacedbyνℓ.

However,forsimplicity,asinglecommonvalueνisconsideredhere.

AutomaticSelectionofFlatteningParameter

Theflatteningparameterallowsustocontrolthetrade-offbetweenconsis-

tencyandstability.Here,weshowhowthevalueoftheflatteningparameter

canbeoptimallychosenusingdatasamples.

Thegoalofpolicysearchistofindtheoptimalpolicythatmaximizesthe

expectedreturnJ(θ).Therefore,theoptimalflatteningparametervalueν∗LattheL-thiterationisgivenby

ν∗L=argmaxJ(bθL+1(ν)).ν

Directlyobtainingν∗requiresthecomputationoftheexpectedreturnL

θL+1(ν))foreachcandidateofν.Tothisend,datasamplesfollowing

π(a|s;bθL+1(ν))areneededforeachν,whichisprohibitivelyexpensive.To

reusesamplesgeneratedbypreviouspolicies,avariationofcross-validation

calledimportance-weightedcross-validation(IWCV)(Sugiyamaetal.,2007)

isemployed.

ThebasicideaofIWCVistosplitthetrainingdatasetHπ1:L=HπℓLℓ=1

intoan“estimationpart”anda“validationpart.”Thenthepolicyparam-

θL+1(ν)islearnedfromtheestimationpartanditsexpectedreturn

(ν))isapproximatedusingtheimportance-weightedlossfortheval-

idationpart.AspointedoutinSection8.2.1,importanceweightingtendsto

beunstablewhenthenumberNofepisodesissmall.Forthisreason,per-

decisionimportanceweightingisusedforcross-validation.Below,howIWCV

isappliedtotheselectionoftheflatteningparameterνinthecurrentcontext

isexplainedinmoredetail.

LetusdividethetrainingdatasetHπ1:L=HπℓLintoKdisjointsubsets

Hπ1:L

ofthesamesize,whereeach

containsN/Kepisodicsamples

Hπ1:L

fromeveryHπℓ.Forsimplicity,weassumethatNisdivisiblebyK,i.e.,N/K

isaninteger.K=5willbeusedintheexperimentslater.

θL+1,k(ν)bethepolicyparameterlearnedfromHπ1:L

k′6=k(i.e.,all

datawithoutHπ1:L)byAIWestimation.Theexpectedreturnofb

L+1,k(ν)is

estimatedusingthePIWestimatorfromHπ1:Las

IWCV(b

θL+1,k(ν))=

γt−1r(s

t,at,st+1)w(L,ℓ)

h∈H1:Lt=1k

whereηisanormalizationconstant.Anordinarychoiceisη=LN/K,buta

morestablevariantgivenby

w(L,ℓ)

h∈H1:Lk

isoftenpreferredinpractice(Precupetal.,2000).

Theaboveprocedureisrepeatedforallk=1,…,K,andtheaverage

score,

IWCV(b

θL+1(ν))=

IWCV(b

θL+1,k(ν)),

iscomputed.ThisistheK-foldIWCVestimatorofJ(b

θL+1(ν)),whichwas

showntobealmostunbiased(Sugiyamaetal.,2007).

ThisK-foldIWCVscoreiscomputedforeachcandidatevalueoftheflat-

teningparameterνandtheonethatmaximizestheIWCVscoreischosen:

IWCV=argmaxJIWCV(b

θL+1(ν)).

ThisIWCVschemecanalsobeusedforchoosingthebasisfunctionsφ(s)in

theGaussianpolicymodel.

Notethatwhentheimportanceweightsw(L,ℓ)

areallone(i.e.,noim-

max(t,t′)

portanceweighting),theaboveIWCVprocedureisreducedtotheordinary

CVprocedure.TheuseofIWCVisessentialheresincethetargetpolicy

π(a|s,bθL+1(ν))isusuallydifferentfromthepreviouspoliciesusedforcollect-

ingthedatasamplesHπ1:L.Therefore,theexpectedreturnestimatedusing

ordinaryCV,b

θL+1(ν)),wouldbeheavilybiased.

Reward-WeightedRegressionwithSampleReuse

Sofar,wehaveintroducedAIWtocontrolthestabilityofthepolicy-

parameterupdateandIWCVtoautomaticallychoosetheflatteningparameter

basedontheestimatedexpectedreturn.Thepolicysearchalgorithmthat

combinesthesetwomethodsiscalledreward-weightedregressionwithsample

reuse(RRR).

Ineachiteration(L=1,2,…)ofRRR,episodicdatasamplesHπLare

collectedfollowingthecurrentpolicyπ(a|s,θAIW

),theflatteningparameter

νischosensoastomaximizetheexpectedreturnb

JIWCV(ν)estimatedby

IWCVusingHπℓL,andthenthepolicyparameterisupdatedto

usingHπℓL.

FIGURE8.2:Ballbalancingusingarobotarmsimulator.Twojointsofthe

robotsarecontrolledtokeeptheballinthemiddleofthetray.

NumericalExamples

TheperformanceofRRRisexperimentallyevaluatedonaball-balancing

taskusingarobotarmsimulator(Schaal,2009).

AsillustratedinFigure8.2,a7-degree-of-freedomarmismountedonthe

ceilingupsidedown,whichisequippedwithacirculartrayofradius0.24[m]

attheendeffector.Thegoalistocontrolthejointsoftherobotsothatthe

ballisbroughttothemiddleofthetray.However,thedifficultyisthatthe

angleofthetraycannotbecontrolleddirectly,whichisatypicalrestriction

inreal-worldjoint-motionplanningbasedonfeedbackfromtheenvironment

(e.g.,thestateoftheball).

Tosimplifytheproblem,onlytwojointsarecontrolledhere:thewristangle

αrollandtheelbowangleαpitch.Alltheremainingjointsarefixed.Control

ofthewristandelbowangleswouldroughlycorrespondtochangingtheroll

andpitchanglesofthetray,butnotdirectly.

Twoseparatecontrolsubsystemsaredesignedhere,eachofwhichisin

chargeofcontrollingtherollandpitchangles.Eachsubsystemhasitsown

policyparameterθ,statespaceS,andactionspaceA.ThestatespaceSis

continuousandconsistsof(x,˙x),wherex[m]isthepositionoftheballonthe

trayalongeachaxisand˙x[m/s]isthevelocityoftheball.Theactionspace

Aiscontinuousandcorrespondstothetargetanglea[rad]ofthejoint.The

rewardfunctionisdefinedas

5(x′)2+(˙x′)2+a2

r(s,a,s′)=exp−

2(0.24/2)2

wherethenumber0.24inthedenominatorcomesfromtheradiusofthetray.

Below,howthecontrolsystemisdesignedisexplainedinmoredetail.

FIGURE8.3:Theblockdiagramoftherobot-armcontrolsystemforball

balancing.Thecontrolsystemhastwofeedbackloops,i.e.,joint-trajectory

planningbyRRRandtrajectorytrackingbyahigh-gainproportional-

derivative(PD)controller.

AsillustratedinFigure8.3,thecontrolsystemhastwofeedbackloopsfor

trajectoryplanningusinganRRRcontrollerandtrajectorytrackingusinga

high-gainproportional-derivative(PD)controller(Siciliano&Khatib,2008).

TheRRRcontrolleroutputsthetargetjointangleobtainedbythecurrent

policyatevery0.2[s].NineGaussiankernelsareusedasbasisfunctionsφ(s)

withthekernelcenterscb9

locatedoverthestatespaceat

(x,˙x)∈(−0.2,−0.4),(−0.2,0),(−0.1,0.4),(0,−0.4),(0,0),(0,0.4),

(0.1,−0.4),(0.2,0),(0.2,0.4).

TheGaussianwidthissetatσbasis=0.1.Basedonthediscrete-timetarget

anglesobtainedbyRRR,thedesiredjointtrajectoryinthecontinuoustime

domainislinearlyinterpolatedas

at,u=at+u˙at,

whereuisthetimefromthelastoutputatofRRRatthet-thstep.˙atisthe

angularvelocitycomputedby

t−at−1

wherea0istheinitialangleofajoint.Theangularvelocityisassumedtobe

constantduringthe0.2[s]cycleoftrajectoryplanning.

Ontheotherhand,thePDcontrollerconvertsdesiredjointtrajectoriesto

motortorquesas

τt,u=µp∗(at,u−αt,u)+µd∗(˙at−˙αt,u),whereτisthe2-dimensionalvectorconsistingofthetorqueappliedtothe

wristandelbowjoints.a=(apitch,aroll)⊤and˙a=(˙apitch,˙aroll)⊤arethe

2-dimensionalvectorsconsistingofthedesiredanglesandvelocities.α=

(αpitch,αroll)⊤and˙α=(˙αpitch,˙αroll)⊤arethe2-dimensionalvectorsconsist-

ingofthecurrentjointanglesandvelocities.µpandµdarethe2-dimensional

vectorsconsistingoftheproportionalandderivativegains.“∗”denotestheelement-wiseproduct.Sincethecontrolcycleoftherobotarmis0.002[s],

thePDcontrollerisapplied100times(i.e.,t=0.002,0.004,…,0.198,0.2)ineachRRRcycle.

Figure8.4depictsadesiredtrajectoryofthewristjointgeneratedby

arandompolicyandanactualtrajectoryobtainedusingthehigh-gainPD

controllerdescribedabove.Thegraphsshowthatthedesiredtrajectoryis

followedbytherobotarmreasonablywell.

ThepolicyparameterθLislearnedthroughtheRRRiterations.Theinitial

policyparametersθ1=(µ⊤

1,σ1)⊤aresetmanuallyas

µ1=(−0.5,−0.5,0,−0.5,0,0,0,0,0)⊤andσ1=0.1,

sothatawiderangeofstatesandactionscanbesafelyexploredinthefirstiter-

ation.Theinitialpositionoftheballisrandomlyselectedasx∈[−0.05,0.05].Thedatasetcollectedineachiterationconsistsof10episodeswith20steps.

Thedurationofanepisodeis4[s]andthesamplingcyclebyRRRis0.2[s].

Threescenariosareconsideredhere:

•NIW:Samplereusewithν=0.

•PIW:Samplereusewithν=1.

•RRR:SamplereusewithνchosenbyIWCVfrom0,0.25,0.5,0.75,1

ineachiteration.

Thediscountfactorissetatγ=0.99.Figure8.5depictstheaveragedexpected

returnover10trialsasafunctionofthenumberofpolicyupdateiterations.

Theexpectedreturnineachtrialiscomputedfrom20testepisodicsamples

thathavenotbeenusedfortraining.ThegraphshowsthatRRRnicelyim-

provestheperformanceoveriterations.Ontheotherhand,theperformance

forν=0issaturatedafterthe3rditeration,andtheperformanceforν=1

isimprovedinthebeginningbutsuddenlygoesdownatthe5thiteration.

Theresultforν=1indicatesthatalargechangeinpoliciescausessevere

instabilityinsamplereuse.

Figure8.6andFigure8.7depictexamplesoftrajectoriesofthewristangle

αroll,theelbowangleαpitch,resultingballmovementx,andrewardrfor

policiesobtainedbyNIW(ν=0)andRRR(νischosenbyIWCV)after

the10thiteration.BythepolicyobtainedbyNIW,theballgoesthroughthe

middleofthetray,i.e.,(xroll,xpitch)=(0,0),anddoesnotstop.Ontheother

hand,thepolicyobtainedbyRRRsuccessfullyguidestheballtothemiddle

ofthetrayalongtherollaxis,althoughthemovementalongthepitchaxis

lookssimilartothatbyNIW.MotionexamplesbyRRRwithνchosenby

IWCVareillustratedinFigure8.8.

Angle[rad]

−0.5

Angularvelocity[rad/s]

Desiredtrajectory

Actualtrajectory

−0.05

−1.5

Time[s]

(a)Trajectoryinangles

(b)Trajectoryinangularvelocities

FIGURE8.4:Anexampleofdesiredandactualtrajectoriesofthewrist

jointintherealisticball-balancingtask.Thetargetjointangleisdetermined

byarandompolicyatevery0.2[s],andthenalinearlyinterpolatedangleand

constantvelocityaretrackedusingtheproportional-derivative(PD)controller

inthecycleof0.002[s].

Reusen=0(NIW)

Reusen=1(PIW)

n=νIWCV)

Expectedreturn11

Iteration

FIGURE8.5:Theperformanceoflearnedpolicieswhenν=0(NIW),ν=1

(PIW),andνischosenbyIWCV(RRR)inballbalancingusingasimulated

robot-armsystem.Theperformanceismeasuredbythereturnaveragedover

10trials.Thesymbol“”indicatesthatthemethodisthebestorcomparable

tothebestoneintermsoftheexpectedreturnbythet-testatthesignifi-

cancelevel5%,performedateachiteration.Theerrorbarsindicate1/10ofa

standarddeviation.

pitch1.55

−0.05

−0.1

−0.15

−0.2

Time[s]

Middleoftray

Reward0.4

Ballposition

−0.05

−0.1

Time[s]

FIGURE8.6:Typicalexamplesoftrajectoriesofwristangleαroll,elbow

angleαpitch,resultingballmovementx,andrewardrforpoliciesobtainedby

NIW(ν=0)atthe10thiterationintheball-balancingtask.

pitch1.55

−0.05

−0.1

−0.15

−0.2

Time[s]

Middleoftray

Reward0.4

Ballposition

−0.05

−0.1

Time[s]

FIGURE8.7:Typicalexamplesoftrajectoriesofwristangleαroll,elbow

angleαpitch,resultingballmovementx,andrewardrforpoliciesobtainedby

RRR(νischosenbyIWCV)atthe10thiterationintheball-balancingtask.

FIGURE8.8:MotionexamplesofballbalancingbyRRR(fromlefttoright

andtoptobottom).

Remarks

Adirectpolicysearchalgorithmbasedonexpectation-maximization(EM)

iterativelymaximizesthelower-boundoftheexpectedreturn.TheEM-based

approachdoesnotincludethestepsizeparameter,whichisanadvantageover

thegradient-basedapproachintroducedinChapter7.Asample-reusevariant

oftheEM-basedmethodwasalsoprovided,whichcontributestoimproving

thestabilityofthealgorithminsmall-samplescenarios.

Inpractice,however,theEM-basedapproachisstillratherinstableevenif

itiscombinedwiththesample-reusetechnique.InChapter9,anotherpolicy

searchapproachwillbeintroducedtofurtherimprovethestabilityofpolicy

updates.

Chapter9

Policy-PriorSearch

ThedirectpolicysearchmethodsexplainedinChapter7andChapter8are

usefulinsolvingproblemswithcontinuousactionssuchasrobotcontrol.How-

ever,theytendtosufferfrominstabilityofpolicyupdate.Inthischapter,we

introduceanalternativepolicysearchmethodcalledpolicy-priorsearch,which

isadoptedinthePGPE(policygradientswithparameter-basedexploration)

method(Sehnkeetal.,2010).Thebasicideaistousedeterministicpoliciesto

removeexcessiverandomnessandintroduceusefulstochasticitybyconsidering

apriordistributionforpolicyparameters.

Afterformulatingtheproblemofpolicy-priorsearchinSection9.1,a

gradient-basedalgorithmisintroducedinSection9.2,includingitsimprove-

mentusingbaselinesubtraction,theoreticalanalysis,andexperimentaleval-

uation.Then,inSection9.3,asample-reusevariantisdescribedanditsper-

formanceistheoreticallyanalyzedandexperimentallyinvestigatedusinga

humanoidrobot.Finally,thischapterisconcludedinSection9.4.

Formulation

Inthissection,thepolicysearchproblemisformulatedbasedonpolicy

priors.

Thebasicideaistouseadeterministicpolicyandintroducestochasticity

bydrawingpolicyparametersfromapriordistribution.Morespecifically,pol-

icyparametersarerandomlydeterminedfollowingthepriordistributionatthe

beginningofeachtrajectory,andthereafteractionselectionisdeterministic

(Figure9.1).Notethattransitionsaregenerallystochastic,andthustrajecto-

riesarealsostochasticeventhoughthepolicyisdeterministic.Thankstothis

per-trajectoryformulation,thevarianceofgradientestimatorsinpolicy-prior

searchdoesnotincreasewithrespecttothetrajectorylength,whichallows

ustoovercomethecriticaldrawbackofdirectpolicysearch.

Policy-priorsearchusesadeterministicpolicywithtypicallyalinearar-

chitecture:

π(a|s,θ)=δ(a=θ⊤φ(s)),

whereδ(·)istheDiracdeltafunctionandφ(s)isthebasisfunction.Thepolicy

(a)Stochasticpolicy

(b)Deterministicpolicywithprior

FIGURE9.1:Illustrationofthestochasticpolicyandthedeterministicpol-

icywithapriorunderdeterministictransition.Thenumberofpossibletra-

jectoriesisexponentialwithrespecttothetrajectorylengthwhenstochastic

policiesareused,whileitdoesnotgrowwhendeterministicpoliciesdrawn

fromapriordistributionareused.

parameterθisdrawnfromapriordistributionp(θ|ρ)withhyper-parameter

Theexpectedreturninpolicy-priorsearchisdefinedintermsoftheex-

pectationsoverbothtrajectoryhandpolicyparameterθasafunctionof

hyper-parameterρ:

J(ρ)=Ep(h|θ)p(θ|ρ)[R(h)]=

p(h|θ)p(θ|ρ)R(h)dhdθ,

whereEp(h|θ)p(θ|ρ)denotestheexpectationovertrajectoryhandpolicy

parameterθdrawnfromp(h|θ)p(θ|ρ).Inpolicy-priorsearch,thehyper-

parameterρisoptimizedsothattheexpectedreturnJ(ρ)ismaximized.

Thus,theoptimalhyper-parameterρ∗isgivenbyρ∗=argmaxJ(ρ).ρ

PolicyGradientswithParameter-BasedExploration

Inthissection,agradient-basedalgorithmforpolicy-priorsearchisgiven.

Policy-PriorSearch

Policy-PriorGradientAscent

Here,agradientmethodisusedtofindalocalmaximizeroftheexpected

returnJwithrespecttohyper-parameterρ:

ρ←−ρ+ε∇ρJ(ρ),whereεisasmallpositiveconstantand∇ρJ(ρ)isthederivativeofJwithrespecttoρ:

∇ρJ(ρ)=p(h|θ)∇ρp(θ|ρ)R(h)dhdθ

p(h|θ)p(θ|ρ)∇ρlogp(θ|ρ)R(h)dhdθ=Ep(h|θ)p(θ|ρ)[∇ρlogp(θ|ρ)R(h)],wherethelogarithmicderivative,

∇∇ρp(θ|ρ)

ρlogp(θ|ρ)=

p(θ|ρ)

wasusedinthederivation.Theexpectationsoverhandθareapproximated

bytheempiricalaverages:

∇bρJ(ρ)=

ρlogp(θn|ρ)R(hn),

whereeachtrajectorysamplehnisdrawnindependentlyfromp(h|θn)and

parameterθnisdrawnfromp(θ|ρ).Thus,inpolicy-priorsearch,samplesare

pairsofθandh:

H=(θ1,h1),…,(θN,hN).

Asthepriordistributionforpolicyparameterθ=(θ1,…,θB)⊤,where

Bisthedimensionalityofthebasisvectorφ(s),theindependentGaussian

distributionisastandardchoice.ForthisGaussianprior,thehyper-parameter

ρconsistsofpriormeansη=(η1,…,ηB)⊤andpriorstandarddeviations

τ=(τ1,…,τB)⊤:

b−ηb)2

θ|η,τ)=

exp−

Thenthederivativesoflog-priorlogp(θ|η,τ)withrespecttoηbandτbare

givenas

∇b−ηb

ηlogp(θ|η,τ)=

∇b−ηb)2−τ2

τlogp(θ|η,τ)=

BysubstitutingthesederivativesintoEq.(9.1),thepolicy-priorgradientswith

respecttoηandτcanbeapproximated.

BaselineSubtractionforVarianceReduction

AsexplainedinSection7.2.2,subtractionofabaselinecanreducethevari-

anceofgradientestimators.Here,abaselinesubtractionmethodforpolicy-

priorsearchisdescribed.

Forabaselineξ,amodifiedgradientestimatorisgivenby

∇bρJξ(ρ)=

n)−ξ)∇ρlogp(θn|ρ).n=1

Letξ∗betheoptimalbaselinethatminimizesthevarianceofthegradient:ξ∗=argminVarb

p(h|θ)p(θ|ρ)[∇ρJξ(ρ)],ξ

whereVarp(h|θ)p(θ|ρ)denotesthetraceofthecovariancematrix:

Varp(h|θ)p(θ|ρ)[ζ]

=trEp(h|θ)p(θ|ρ)(ζ−Ep(h|θ)p(θ|ρ)[ζ])(ζ−Ep(h|θ)p(θ|ρ)[ζ])⊤

=Ep(h|θ)p(θ|ρ)kζ−Ep(h|θ)p(θ|ρ)[ζ]k2.

ItwasshowninZhaoetal.(2012)thattheoptimalbaselineforpolicy-prior

searchisgivenby

p(θ|ρ).Inpractice,theexpectationsareapproximatedbythesampleaverages.

VarianceAnalysisofGradientEstimators

Herethevarianceofgradientestimatorsistheoreticallyinvestigatedfor

theindependentGaussianprior(9.2)withφ(s)=s.SeeZhaoetal.(2012)

fortechnicaldetails.

Below,subsetsofthefollowingassumptionsareconsidered(whicharethe

sameastheonesusedinSection7.2.3):

Assumption(A):r(s,a,s′)∈[−β,β]forβ>0.Assumption(B):r(s,a,s′)∈[α,β]for0<α<β.Assumption(C):Forδ>0,thereexisttwoseriesctTt=1anddtTt=1such

kstk≥ctandtk≤dt

holdwithprobabilityatleast1−δ,respectively,overthechoiceof

samplepaths.

Policy-PriorSearch

NotethatAssumption(B)isstrongerthanAssumption(A).

τ−2.

First,thevarianceofgradientestimatorsinpolicy-priorsearchisanalyzed:

Theorem9.1UnderAssumption(A),thefollowingupperboundshold:

β2(1−γT)2G

p(h|θ)p(θ|ρ)∇ηJ(η,τ)≤≤

N(1−γ)2

2β2(1−γT)2G

p(h|θ)p(θ|ρ)∇τJ(η,τ)≤≤

N(1−γ)2

ThesecondupperboundsareindependentofthetrajectorylengthT,while

theupperboundsfordirectpolicysearch(Theorem7.1inSection7.2.3)are

monotoneincreasingwithrespecttothetrajectorylengthT.Thus,gradient

estimationinpolicy-priorsearchisexpectedtobemorereliablethanthatin

directpolicysearchwhenthetrajectorylengthTislarge.

Thefollowingtheoremmoreexplicitlycomparesthevarianceofgradient

estimatorsindirectpolicysearchandpolicy-priorsearch:

Theorem9.2InadditiontoAssumptions(B)and(C),assumethat

ζ(T)=CTα2−DTβ2/(2π)

ispositiveandmonotoneincreasingwithrespecttoT,where

c2tandDT=

IfthereexistsT0suchthat

ζ(T0)≥β2Gσ2,

thenitholdsthat

p(h|θ)p(θ|ρ)[∇µJ(θ)]>Varp(h|θ)p(θ|ρ)[∇ηJ(η,τ)]forallT>T0,withprobabilityatleast1−δ.

Theabovetheoremmeansthatpolicy-priorsearchismorefavorablethan

directpolicysearchintermsofthevarianceofgradientestimatorsofthe

mean,iftrajectorylengthTislarge.

Next,thecontributionoftheoptimalbaselinetothevarianceofthegradi-

entestimatorwithrespecttomeanparameterηisinvestigated.Itwasshown

inZhaoetal.(2012)thattheexcessvarianceforabaselineξisgivenby

p(h|θ)p(θ|ρ)[∇ρJξ(ρ)]−Varp(h|θ)p(θ|ρ)[∇ρJξ∗(ρ)]

(ξ−ξ∗)2h

p(h|θ)p(θ|ρ)

ρlogp(θ|ρ)k2

Basedonthisexpression,thefollowingtheoremholds.

Theorem9.3Ifr(s,a,s′)≥α>0,thefollowinglowerboundholds:

α2(1−γT)2G

p(h|θ)p(θ|ρ)[∇ηJ(η,τ)]−Varp(h|θ)p(θ|ρ)[∇ηJξ∗(η,τ)]≥.

N(1−γ)2

UnderAssumption(A),thefollowingupperboundholds:

β2(1−γT)2G

p(h|θ)p(θ|ρ)[∇ηJ(η,τ)]−Varp(h|θ)p(θ|ρ)[∇ηJξ∗(η,τ)]≤.

N(1−γ)2

Theabovetheoremshowsthatthelowerboundoftheexcessvariance

ispositiveandmonotoneincreasingwithrespecttothetrajectorylengthT.

Thismeansthatthevarianceisalwaysreducedbysubtractingtheoptimal

baselineandtheamountofvariancereductionismonotoneincreasingwith

respecttothetrajectorylengthT.Notethattheupperboundisalsomonotone

increasingwithrespecttothetrajectorylengthT.

Finally,thevarianceofthegradientestimatorwiththeoptimalbaseline

isinvestigated:

Theorem9.4UnderAssumptions(B)and(C),thefollowingupperbound

holdswithprobabilityatleast1−δ:

(1−γT)2

(β2−α2)G

p(h|θ)p(θ|ρ)[∇ηJξ∗(η,τ)]≤(β2−α2)G≤

N(1−γ)2

ThesecondupperboundisindependentofthetrajectorylengthT,while

Theorem7.4inSection7.2.3showedthattheupperboundofthevariance

ofgradientestimatorswiththeoptimalbaselineindirectpolicysearchis

monotoneincreasingwithrespecttotrajectorylengthT.Thus,whentrajec-

torylengthTislarge,policy-priorsearchismorefavorablethandirectpolicy

searchintermsofthevarianceofthegradientestimatorwithrespecttothe

meanevenwhenoptimalbaselinesubtractionisapplied.

NumericalExamples

Here,theperformanceofthedirectpolicysearchandpolicy-priorsearch

algorithmsareexperimentallycompared.

9.2.4.1

LetthestatespaceSbeone-dimensionalandcontinuous,andtheinitial

stateisrandomlychosenfollowingthestandardnormaldistribution.Theac-

tionspaceAisalsosettobeone-dimensionalandcontinuous.Thetransition

dynamicsoftheenvironmentissetat

st+1=st+at+ε,

Policy-PriorSearch

TABLE9.1:Varianceandbiasofestimatedparameters.

(a)TrajectorylengthT=10

Method

Variance

REINFORCE

13.25726.917-0.310-1.510

REINFORCE-OB

-0.069

PGPE-OB

-0.016

(b)TrajectorylengthT=50

Method

Variance

REINFORCE

188.386278.310-1.813-5.175

REINFORCE-OB

-0.299-0.201

-0.105-0.329

PGPE-OB

-0.078

whereε∼N(0,0.52)isstochasticnoiseandN(µ,σ2)denotesthenormaldistributionwithmeanµandvarianceσ2.Theimmediaterewardisdefined

r=exp−s2/2−a2/2+1,

whichisboundedas1<r≤2.ThelengthofthetrajectoryissetatT=10

or50,thediscountfactorissetatγ=0.9,andthenumberofepisodicsamples

issetatN=100.

9.2.4.2

VarianceandBias

First,thevarianceandthebiasofgradientestimatorsofthefollowing

methodsareinvestigated:

•REINFORCE:REINFORCE(gradient-baseddirectpolicysearch)

withoutabaseline(Williams,1992).

•REINFORCE-OB:REINFORCEwithoptimalbaselinesubtraction

(Peters&Schaal,2006).

•PGPE:PGPE(gradient-basedpolicy-priorsearch)withoutabaseline

(Sehnkeetal.,2010).

•PGPE-OB:PGPEwithoptimalbaselinesubtraction(Zhaoetal.,

2012).

Table9.1summarizesthevarianceofgradientestimatorsover100runs,

showingthatthevarianceofREINFORCEisoveralllargerthanPGPE.A

notabledifferencebetweenREINFORCEandPGPEisthatthevarianceof

REINFORCEsignificantlygrowsasthetrajectorylengthTincreases,whereas

thatofPGPEisnotinfluencedthatmuchbyT.Thisagreeswellwiththe

theoreticalanalysesgiveninSection7.2.3andSection9.2.3.Optimalbaseline

subtraction(REINFORCE-OBandPGPE-OB)isshowntocontributehighly

toreducingthevariance,especiallywhentrajectorylengthTislarge,which

alsoagreeswellwiththetheoreticalanalysis.

Thebiasofthegradientestimatorofeachmethodisalsoinvestigated.

Here,gradientsestimatedwithN=1000areregardedastruegradients,and

thebiasofgradientestimatorsiscomputed.Theresultsarealsoincludedin

Table9.1,showingthatintroductionofbaselinesdoesnotincreasethebias;

rather,ittendstoreducethebias.

9.2.4.3

VarianceandPolicyHyper-ParameterChangethroughEn-

tirePolicy-UpdateProcess

Next,thevarianceofgradientestimatorsisinvestigatedwhenpolicyhyper-

parametersareupdatedoveriterations.Ifthedeviationparameterσtakesa

negativevalueduringthepolicy-updateprocess,itissetat0.05.Inthisex-

periment,thevarianceiscomputedfrom50runsforT=20andN=10,and

policiesareupdatedover50iterations.Inordertoevaluatethevariancein

astablemanner,theaboveexperimentsarerepeated20timeswithrandom

choiceofinitialmeanparameterµfrom[−3.0,−0.1],andtheaveragevariance

ofgradientestimatorsisinvestigatedwithrespecttomeanparameterµover

20trials.TheresultsareplottedinFigure9.2.Figure9.2(a)comparesthe

varianceofREINFORCEwith/withoutbaselines,whereasFigure9.2(b)com-

paresthevarianceofPGPEwith/withoutbaselines.Thesegraphsshowthat

introductionofbaselinescontributeshighlytothereductionofthevariance

overiterations.

LetusillustratehowparametersareupdatedbyPGPE-OBover50itera-

tionsforN=10andT=10.Theinitialmeanparameterissetatη=−1.6,

−0.8,or−0.1,andtheinitialdeviationparameterissetatτ=1.Figure9.3

depictsthecontouroftheexpectedreturnandillustratestrajectoriesofpa-

rameterupdatesoveriterationsbyPGPE-OB.Inthegraph,themaximumof

thereturnsurfaceislocatedatthemiddlebottom,andPGPE-OBleadsthe

solutionstoamaximumpointrapidly.

9.2.4.4

PerformanceofLearnedPolicies

Finally,thereturnobtainedbyeachmethodisevaluated.Thetrajectory

lengthisfixedatT=20,andthemaximumnumberofpolicy-updateitera-

tionsissetat50.Averagereturnsover20runsareinvestigatedasfunctions

ofthenumberofepisodicsamplesN.Figure9.4(a)showstheresultswhen

initialmeanparameterµischosenrandomlyfrom[−1.6,−0.1],whichtends

toperformwell.ThegraphshowsthatPGPE-OBperformsthebest,espe-

ciallywhenN<5;thenREINFORCE-OBfollowswithasmallmargin.The

Policy-PriorSearch

REINFORCE

REINFORCE−OB

−scale4

Varianceinlog

Iteration

(a)REINFORCEandREINFORCE-OB

PGPE−OB

−scale

Varianceinlog

−0.50

Iteration

(b)PGPEandPGPE-OB

FIGURE9.2:Meanandstandarderrorofthevarianceofgradientestimators

withrespecttothemeanparameterthroughpolicy-updateiterations.

18.0717.81

Policy-priorstandarddeviation

−1.6

−1.4

−1.2

−0.8

−0.6

−0.4

−0.2

Policy-priormeanη

FIGURE9.3:Trajectoriesofpolicy-priorparameterupdatesbyPGPE.

Return

REINFORCE

REINFORCE−OB

PGPE−OB

Iteration

(a)Goodinitialpolicy

Return

REINFORCE

REINFORCE−OB

PGPE−OB

Iteration

(b)Poorinitialpolicy

FIGURE9.4:Averageandstandarderrorofreturnsover20runsasfunctions

ofthenumberofepisodicsamplesN.

plainPGPEalsoworksreasonablywell,althoughitisslightlyunstabledueto

largervariance.TheplainREINFORCEishighlyunstable,whichiscausedby

thehugevarianceofgradientestimators(seeFigure9.2again).Figure9.4(b)

describestheresultswheninitialmeanparameterµischosenrandomlyfrom

[−3.0,−0.1],whichtendstoresultinpoorerperformance.Inthissetup,the

differenceamongthecomparedmethodsismoresignificantthanthecasewith

goodinitialpolicies,meaningthatREINFORCEissensitivetothechoiceof

initialpolicies.Overall,thePGPEmethodstendtooutperformtheREIN-

FORCEmethods,andamongthePGPEmethods,PGPE-OBworksvery

wellandconvergesquickly.

Policy-PriorSearch

SampleReuseinPolicy-PriorSearch

AlthoughPGPEwasshowntooutperformREINFORCE,itsbehavioris

stillratherunstableifthenumberofdatasamplesusedforestimatingthegra-

dientissmall.Inthissection,thesample-reuseideaisappliedtoPGPE.Tech-

nically,theoriginalPGPEiscategorizedasanon-policyalgorithmwheredata

drawnfromthecurrenttargetpolicyisusedtoestimatepolicy-priorgradients.

Ontheotherhand,off-policyalgorithmsaremoreflexibleinthesensethat

adata-collectingpolicyandthecurrenttargetpolicycanbedifferent.Here,

PGPEisextendedtotheoff-policyscenariousingtheimportance-weighting

technique.

ImportanceWeighting

Letusconsideranoff-policyscenariowhereadata-collectingpolicyand

thecurrenttargetpolicyaredifferentingeneral.InthecontextofPGPE,

twohyper-parametersareconsidered:ρasthetargetpolicytolearnandρ′

asapolicyfordatacollection.Letusdenotethedatasamplescollectedwith

hyper-parameterρ′byH′:

i.i.d.

θ′n,h′nN′

∼p(h|θ)p(θ|ρ′).IfdataH′isnaivelyusedtoestimatepolicy-priorgradientsbyEq.(9.1),we

sufferaninconsistencyproblem:

1X∇N′

ρlogp(θ′n|ρ)R(h′n)N′−→∞

∇ρJ(ρ),n=1

∇ρJ(ρ)=p(h|θ)p(θ|ρ)∇ρlogp(θ|ρ)R(h)dhdθisthegradientoftheexpectedreturn,

J(ρ)=

p(h|θ)p(θ|ρ)R(h)dhdθ,

withrespecttothepolicyhyper-parameterρ.Below,thisnaivemethodis

referredtoasnon-importance-weightedPGPE(NIW-PGPE).

Thisinconsistencyproblemcanbesystematicallyresolvedbyimportance

weighting:

∇bN′→∞

ρJIW(ρ)=

w(θ′

−→∇N′

n)∇ρlogp(θ′n|ρ)R(h′n)ρJ(ρ),

wherew(θ)=p(θ|ρ)/p(θ|ρ′)istheimportanceweight.Thisextendedmethod

iscalledimportance-weightedPGPE(IW-PGPE).

Below,thevarianceofgradientestimatorsinIW-PGPEistheoretically

analyzed.SeeZhaoetal.(2013)fortechnicaldetails.AsdescribedinSec-

tion9.2.1,thedeterministiclinearpolicymodelisusedhere:

π(a|s,θ)=δ(a=θ⊤φ(s)),

whereδ(·)istheDiracdeltafunctionandφ(s)istheB-dimensionalbasis

function.Policyparameterθ=(θ1,…,θB)⊤isdrawnfromtheindependent

Gaussianprior,wherepolicyhyper-parameterρconsistsofpriormeansη=

(η1,…,ηB)⊤andpriorstandarddeviationsτ=(τ1,…,τB)⊤:

b−ηb)2

θ|η,τ)=

exp−

τ−2,

andletVarp(h′|θ′)p(θ′|ρ′)denotethetraceofthecovariancematrix:

Varp(h′|θ′)p(θ′|ρ′)[ζ]

=trEp(h′|θ′)p(θ′|ρ′)(ζ−Ep(h′|θ′)p(θ′|ρ′)[ζ])(ζ−Ep(h′|θ′)p(θ′|ρ′)[ζ])⊤h

=Ep(h′|θ′)p(θ′|ρ′)kζ−Ep(h′|θ′)p(θ′|ρ′)[ζ]k2,

whereEp(h′|θ′)p(θ′|ρ′)denotestheexpectationovertrajectoryh′andpolicy

parameterθ′drawnfromp(h′|θ′)p(θ′|ρ′).Thenthefollowingtheoremholds:

Theorem9.5Assumethatforalls,a,ands′,thereexistsβ>0suchthat

r(s,a,s′)∈[−β,β],and,forallθ,thereexists0<wmax<∞suchthat0<w(θ)≤wmax.Then,thefollowingupperboundshold:

β2(1−γT)2G

p(h′|θ′)p(θ′|ρ′)∇ηJIW(η,τ)≤

N′(1−γ)2

2β2(1−γT)2G

p(h′|θ′)p(θ′|ρ′)∇τJIW(η,τ)≤

N′(1−γ)2

Itisinterestingtonotethattheupperboundsarethesameastheones

fortheplainPGPE(Theorem9.1inSection9.2.3)exceptforfactorwmax.

Whenwmax=1,theboundsarereducedtothoseoftheplainPGPEmethod.

However,ifthesamplingdistributionissignificantlydifferentfromthetarget

distribution,wmaxcantakealargevalueandthusIW-PGPEcanproducea

gradientestimatorwithlargevariance.Therefore,IW-PGPEmaynotbea

reliableapproachasitis.

Below,avariancereductiontechniqueforIW-PGPEisintroducedwhich

leadstoapracticallyusefulalgorithm.

Policy-PriorSearch

VarianceReductionbyBaselineSubtraction

Here,abaselineisintroducedforIW-PGPEtoreducethevarianceof

gradientestimators,inthesamewayastheplainPGPEexplainedinSec-

tion9.2.2.

Apolicy-priorgradientestimatorwithabaselineξ∈RisdefinedasN′

∇bρJξ

(R(h′

n)−ξ)w(θ′n)∇ρlogp(θ′n|ρ).n=1

Here,thebaselineξisdeterminedsothatthevarianceisminimized.Letξ∗betheoptimalbaselineforIW-PGPEthatminimizesthevariance:

ξ∗=argminVarb

p(h′|θ′)p(θ′|ρ′)[∇ρJξ(ρ)].IW

ThentheoptimalbaselineforIW-PGPEisgivenasfollows(Zhaoetal.,2013):

p(θ′|ρ′).Inpractice,theexpectationsareapproximatedbythesampleaver-

ages.Theexcessvarianceforabaselineξisgivenas

p(h′|θ′)p(θ′|ρ′)[∇ρJξ(ρ)]Jξ∗(ρ)]IW

−Varp(h′|θ′)p(θ′|ρ′)[∇ρIW(ξ−ξ∗)2=

p(θ′|ρ′)[w2(θ′)k∇ρlogp(θ′|ρ)k2].Next,contributionsoftheoptimalbaselinetovariancereductioninIW-

PGPEareanalyzedforthedeterministiclinearpolicymodel(9.3)andthe

independentGaussianprior(9.4).SeeZhaoetal.(2013)fortechnicaldetails.

Theorem9.6Assumethatforalls,a,ands′,thereexistsα>0suchthat

r(s,a,s′)≥α,and,forallθ,thereexistswmin>0suchthatw(θ)≥wmin.

Then,thefollowinglowerboundshold:

p(h′|θ′)p(θ′|ρ′)∇ηJIW(η,τ)−Varp(h′|θ′)p(θ′|ρ′)∇ηJξ∗(η,τ)IW

α2(1−γT)2G

N′(1−γ)2

p(h′|θ′)p(θ′|ρ′)∇τJIW(η,τ)−Varp(h′|θ′)p(θ′|ρ′)∇τJξ∗(η,τ)IW

2α2(1−γT)2G

N′(1−γ)2

Assumethatforalls,a,ands′,thereexistsβ>0suchthatr(s,a,s′)∈

[−β,β],and,forallθ,thereexists0<wmax<∞suchthat0<w(θ)≤wmax.

Then,thefollowingupperboundshold:

p(h′|θ′)p(θ′|ρ′)∇ηJIW(η,τ)−Varp(h′|θ′)p(θ′|ρ′)∇ηJξ∗(η,τ)IW

β2(1−γT)2G

N′(1−γ)2

p(h′|θ′)p(θ′|ρ′)∇τJIW(η,τ)−Varp(h′|θ′)p(θ′|ρ′)∇τJξ∗(η,τ)IW

2β2(1−γT)2G

N′(1−γ)2

ThistheoremshowsthattheboundsofthevariancereductioninIW-PGPE

broughtbytheoptimalbaselinedependontheboundsoftheimportance

weight,wminandwmax—thelargertheupperboundwmaxis,themore

optimalbaselinesubtractioncanreducethevariance.

FromTheorem9.5andTheorem9.6,thefollowingcorollarycanbeimme-

diatelyobtained:

Corollary9.7Assumethatforalls,a,ands′,thereexists0<α<βsuch

thatr(s,a,s′)∈[α,β],and,forallθ,thereexists0<wmin<wmax<∞suchthatwmin≤w(θ)≤wmax.Then,thefollowingupperboundshold:

(1−γT)2G

p(h′|θ′)p(θ′|ρ′)∇ηJξ∗(η,τ)(β2w

≤N′(1−γ)2

max−α2wmin),

2(1−γT)2G

p(h′|θ′)p(θ′|ρ′)∇τJξ∗(η,τ)(β2w

≤N′(1−γ)2

max−α2wmin).

FromTheorem9.5andthiscorollary,wecanconfirmthattheupper

boundsforthebaseline-subtractedIW-PGPEaresmallerthanthoseforthe

plainIW-PGPEwithoutbaselinesubtraction,becauseα2wmin>0.Inpartic-

ular,ifwminislarge,theupperboundsforthebaseline-subtractedIW-PGPE

canbemuchsmallerthanthosefortheplainIW-PGPEwithoutbaseline

subtraction.

NumericalExamples

Here,weconsiderthecontrollingtaskofthehumanoidrobotCB-i(Cheng

etal.,2007)showninFigure9.5(a).Thegoalistoleadtheendeffectorof

therightarm(righthand)toatargetobject.First,itssimulatedupper-body

model,illustratedinFigure9.5(b),isusedtoinvestigatetheperformanceof

theIW-PGPE-OBmethod.ThentheIW-PGPE-OBmethodisappliedtothe

realrobot.

9.3.3.1

Theperformanceofthefollowing4methodsiscompared:

Policy-PriorSearch

(a)CB-i

(b)Simulatedupper-bodymodel

FIGURE9.5:HumanoidrobotCB-ianditsupper-bodymodel.Thehu-

manoidrobotCB-iwasdevelopedbytheJST-ICORPComputationalBrain

ProjectandATRComputationalNeuroscienceLabs(Chengetal.,2007).

•IW-REINFORCE-OB:Importance-weightedREINFORCEwiththe

optimalbaseline.

•NIW-PGPE-OB:Data-reusePGPE-OBwithoutimportanceweight-

•PGPE-OB:PlainPGPE-OBwithoutdatareuse.

•IW-PGPE-OB:Importance-weightedPGPEwiththeoptimalbase-

TheupperbodyofCB-ihas9degreesoffreedom:theshoulderpitch,

shoulderroll,elbowpitchoftherightarm;shoulderpitch,shoulderroll,elbow

pitchoftheleftarm;waistyaw;torsoroll;andtorsopitch(Figure9.5(b)).At

eachtimestep,thecontrollerreceivesstatesfromthesystemandsendsout

actions.Thestatespaceis18-dimensional,whichcorrespondstothecurrent

angleandangularvelocityofeachjoint.Theactionspaceis9-dimensional,

whichcorrespondstothetargetangleofeachjoint.Bothstatesandactions

arecontinuous.

Giventhestateandactionineachtimestep,thephysicalcontrolsystem

calculatesthetorquesateachjointbyusingaproportional-derivative(PD)

controlleras

τi=Kp(a

i−si)−Kdii,

wheresi,˙si,andaidenotethecurrentangle,thecurrentangularvelocity,

andthetargetangleofthei-thjoint,respectively.KpandK

denotethe

positionandvelocitygainsforthei-thjoint,respectively.Theseparameters

aresetat

Kp=200andK=10

fortheelbowpitchjoints,and

Kp=2000andK=100

forotherjoints.

Theinitialpositionoftherobotisfixedatthestanding-up-straightpose

withthearmsdown.Theimmediaterewardrtatthetimesteptisdefinedas

rt=exp(−10dt)−0.0005min(ct,10,000),

wheredtisthedistancebetweentherighthandoftherobotandthetarget

object,andctisthesumofcontrolcostsforeachjoint.Thelineardeterministic

policyisusedforthePGPEmethods,andtheGaussianpolicyisusedforIW-

REINFORCE-OB.Inbothcases,thelinearbasisfunctionφ(s)=sisused.

ForPGPE,theinitialpriormeanηisrandomlychosenfromthestandard

normaldistribution,andtheinitialpriorstandarddeviationτissetat1.

Toevaluatetheusefulnessofdatareusemethodswithasmallnumber

ofsamples,theagentcollectsonlyN=3on-policysampleswithtrajectory

lengthT=100ateachiteration.Allpreviousdatasamplesarereusedto

estimatethegradientsinthedatareusemethods,whileonlyon-policysam-

plesareusedtoestimatethegradientsintheplainPGPE-OBmethod.The

discountfactorissetatγ=0.9.

9.3.3.2

Simulationwith2DegreesofFreedom

First,theperformanceonthereachingtaskwithonly2degreesoffreedom

isinvestigated.Thebodyoftherobotisfixedandonlytherightshoulderpitch

andrightelbowpitchareused.Figure9.6depictstheaveragedexpectedreturn

over10trialsasafunctionofthenumberofiterations.Theexpectedreturn

ateachtrialiscomputedfrom50newlydrawntestepisodicdatathatarenot

usedforpolicylearning.ThegraphshowsthatIW-PGPE-OBnicelyimproves

theperformanceoveriterationswithonlyasmallnumberofon-policysamples.

TheplainPGPE-OBmethodcanalsoimprovetheperformanceoveritera-

tions,butslowly.NIW-PGPE-OBisnotasgoodasIW-PGPE-OB,especially

atthelateriterations,becauseoftheinconsistencyoftheNIWestimator.

Thedistancefromtherighthandtotheobjectandthecontrolcostsalong

thetrajectoryarealsoinvestigatedforthreepolicies:theinitialpolicy,thepolicyobtainedatthe20thiterationbyIW-PGPE-OB,andthepolicyobtained

atthe50thiterationbyIW-PGPE-OB.Figure9.7(a)plotsthedistanceto

thetargetobjectasafunctionofthetimestep.Thisshowsthatthepolicy

obtainedatthe50thiterationdecreasesthedistancerapidlycomparedwith

Policy-PriorSearch

IW−PGPE−OB

NIW−PGPE−OB

PGPE−OB

IW−REINFORCE−OB

Return

Iteration

ofthenumberofiterationsforthereachingtaskwith2degreesoffreedom

(rightshoulderpitchandrightelbowpitch).

Initialpolicy

Policyatthe20thiteration

Distance

TImesteps

(a)Distance

Initialpolicy

Controlcosts60

Timesteps

(b)Controlcosts

FIGURE9.7:Distanceandcontrolcostsofarmreachingwith2degreesof

freedomusingthepolicylearnedbyIW-PGPE-OB.

FIGURE9.8:Typicalexampleofarmreachingwith2degreesoffreedom

usingthepolicyobtainedbyIW-PGPE-OBatthe50thiteration(fromleftto

rightandtoptobottom).

theinitialpolicyandthepolicyobtainedatthe20thiteration,whichmeans

thattherobotcanreachtheobjectquicklybyusingthelearnedpolicy.

Figure9.7(b)plotsthecontrolcostasafunctionofthetimestep.This

showsthatthepolicyobtainedatthe50thiterationdecreasesthecontrol

coststeadilyuntilthereachingtaskiscompleted.Thisisbecausetherobot

mainlyadjuststheshoulderpitchinthebeginning,whichconsumesalarger

amountofenergythantheenergyrequiredforcontrollingtheelbowpitch.

Then,oncetherighthandgetsclosertothetargetobject,therobotstarts

adjustingtheelbowpitchtoreachthetargetobject.Thepolicyobtainedat

the20thiterationactuallyconsumeslesscontrolcosts,butitcannotleadthe

armtothetargetobject.

Figure9.8illustratesatypicalsolutionofthereachingtaskwith2degrees

offreedombythepolicyobtainedbyIW-PGPE-OBatthe50thiteration.The

imagesshowthattherighthandissuccessfullyledtothetargetobjectwithin

only10timesteps.

9.3.3.3

SimulationwithAll9DegreesofFreedom

Finally,thesameexperimentiscarriedoutusingall9degreesoffreedom.

Thepositionofthetargetobjectismoredistantfromtherobotsothatit

cannotbereachedbyonlyusingtherightarm.

Policy-PriorSearch

Return

TruncatedIW−PGPE−OB

IW−PGPE−OB

NIW−PGPE−OB

PGPE−OB

Iteration

ofthenumberofiterationsforthereachingtaskwithall9degreesoffreedom.

Becauseall9jointsareused,thedimensionalityofthestatespaceismuch

increasedandthisgrowsthevaluesofimportanceweightsexponentially.In

ordertomitigatethelargevaluesofimportanceweights,wedecidednotto

reuseallpreviouslycollectedsamples,butonlysamplescollectedinthelast

5iterations.Thisallowsustokeepthedifferencebetweenthesamplingdis-

tributionandthetargetdistributionreasonablysmall,andthusthevaluesof

importanceweightscanbesuppressedtosomeextent.Furthermore,follow-

ingWawrzynski(2009),weconsideraversionofIW-PGPE-OB,denotedas

“truncatedIW-PGPE-OB”below,wheretheimportanceweightistruncated

asw=min(w,2).

TheresultsplottedinFigure9.9showthattheperformanceofthetrun-

catedIW-PGPE-OBisthebest.Thisimpliesthatthetruncationofimpor-

tanceweightsishelpfulwhenapplyingIW-PGPE-OBtohigh-dimensional

problems.

Figure9.10illustratesatypicalsolutionofthereachingtaskwithall9

degreesoffreedombythepolicyobtainedbythetruncatedIW-PGPE-OB

atthe400thiteration.Theimagesshowthatthepolicylearnedbyourpro-

posedmethodsuccessfullyleadstherighthandtothetargetobject,andthe

irrelevantpartsarekeptattheinitialpositionforreducingthecontrolcosts.

9.3.3.4

RealRobotControl

Finally,theIW-PGPE-OBmethodisappliedtotherealCB-irobotshown

inFigure9.11(Sugimotoetal.,2014).

Theexperimentalsettingisessentiallythesameastheabovesimulation

studieswith9joints,butpoliciesareupdatedonlyevery5trialsandsamples

takenfromthelast10trialsarereusedforstabilizationpurposes.Figure9.12

FIGURE9.10:Typicalexampleofarmreachingwithall9degreesoffree-

domusingthepolicyobtainedbythetruncatedIW-PGPE-OBatthe400th

iteration(fromlefttorightandtoptobottom).

FIGURE9.11:ReachingtaskbytherealCB-irobot(Sugimotoetal.,2014).

plotstheobtainedrewardscumulatedoverpolicyupdateiterations,showing

thatrewardsaresteadilyincreasedoveriteration.Figure9.13exhibitsthe

acquiredreachingmotionbasedonthepolicyobtainedatthe120thiteration,

showingthattheendeffectoroftherobotcansuccessfullyreachthetarget

object.

Policy-PriorSearch

Cumulativerewards20

Numberofupdates

FIGURE9.12:Obtainedrewardcumulatedoverpolicyupdatediterations.

Remarks

Whenthetrajectorylengthislarge,directpolicysearchtendstoproduce

gradientestimatorswithlargevariance,duetotherandomnessofstochas-

ticpolicies.Policy-priorsearchcanavoidthisproblembyusingdeterminis-

ticpoliciesandintroducingstochasticitybyconsideringapriordistribution

overpolicyparameters.Boththeoreticallyandexperimentally,advantagesof

policy-priorsearchoverdirectpolicysearchwereshown.

Asamplereuseframeworkforpolicy-priorsearchwasalsointroduced

whichishighlyusefulinreal-worldreinforcementlearningproblemswithhigh

samplingcosts.Followingthesamelineasthesamplereusemethodsforpolicy

iterationdescribedinChapter4anddirectpolicysearchintroducedinChap-

ter8,importanceweightingplaysanessentialroleinsample-reusepolicy-prior

search.Whenthedimensionalityofthestate-actionspaceishigh,however,

importanceweightstendtotakeextremelylargevalues,whichcausesinstabil-

ityoftheimportanceweightingmethods.Tomitigatethisproblem,truncation

oftheimportanceweightsisusefulinpractice.

FIGURE9.13:Typicalexampleofarmreachingusingthepolicyobtained

bytheIW-PGPE-OBmethod(fromlefttorightandtoptobottom).

PartIV

Model-Based

ReinforcementLearning

ThereinforcementlearningmethodsexplainedinPartIIandPartIIIare

categorizedintothemodel-freeapproach,meaningthatpoliciesarelearned

withoutexplicitlymodelingtheunknownenvironment(i.e.,thetransition

probabilityoftheagent).Ontheotherhand,inPartIV,weintroducean

alternativeapproachcalledthemodel-basedapproach,whichexplicitlymodels

theenvironmentinadvanceandusesthelearnedenvironmentmodelforpolicy

learning.

Inthemodel-basedapproach,noadditionalsamplingcostisnecessaryto

generateartificialsamplesfromthelearnedenvironmentmodel.Thus,the

model-basedapproachisusefulwhendatacollectionisexpensive(e.g.,robot

control).However,accuratelyestimatingthetransitionmodelfromalimited

amountoftrajectorydatainmulti-dimensionalcontinuousstateandaction

spacesishighlychallenging.

InChapter10,weintroduceanon-parametricmodelestimatorthatpos-

sessestheoptimalconvergenceratewithhighcomputationalefficiency,and

demonstrateitsusefulnessthroughexperiments.Then,inChapter11,we

combinedimensionalityreductionwithmodelestimationtocopewithhigh

dimensionalityofstateandactionspaces.

Chapter10

TransitionModelEstimation

Inthischapter,weintroducetransitionprobabilityestimationmethodsfor

model-basedreinforcementlearning(Wang&Dietterich,2003;Deisenroth&

Rasmussen,2011).AmongthemethodsdescribedinSection10.1,anon-

parametrictransitionmodelestimatorcalledleast-squaresconditionaldensity

estimation(LSCDE)(Sugiyamaetal.,2010)isshowntobethemostpromis-

ingapproach(Tangkarattetal.,2014a).TheninSection10.2,wedescribe

howthetransitionmodelestimatorcanbeutilizedinmodel-basedreinforce-

mentlearning.InSection10.3,experimentalperformanceofamodel-based

policy-priorsearchmethodisevaluated.Finally,inSection10.4,thischapter

isconcluded.

ConditionalDensityEstimation

Inthissection,theproblemofapproximatingthetransitionprobabil-

ityp(s′|s,a)fromindependenttransitionsamples(sm,am,s′m)M

m=1isad-

dressed.

10.1.1

Regression-BasedApproach

Intheregression-basedapproach,theproblemoftransitionprobability

estimationisformulatedasafunctionapproximationproblemofpredicting

outputs′giveninputsandaunderGaussiannoise:

s′=f(s,a)+ǫ,

wherefisanunknownregressionfunctiontobelearned,ǫisanindepen-

dentGaussiannoisevectorwithmeanzeroandcovariancematrixσ2I,andI

denotestheidentitymatrix.

Letusapproximatefbythefollowinglinear-in-parametermodel:

f(s,a,Γ)=Γ⊤φ(s,a),

whereΓistheB×dim(s)parametermatrixandφ(s,a)istheB-dimensional

basisvector.AtypicalchoiceofthebasisvectoristheGaussiankernel,which

isdefinedforB=Mas

ks−s

bk2+(a−ab)2

b(s,a)=exp

andκ>0denotestheGaussiankernelwidth.IfBistoolarge,thenumberof

basisfunctionsmaybereducedbyonlyusingasubsetofsamplesasGaussian

centers.DifferentGaussianwidthsforsandamaybeusedifnecessary.

TheparametermatrixΓislearnedsothattheregularizedsquarederror

isminimized:

Γ=argmin

f(sm,am,Γ)−f(sm,am)

+trΓ⊤RΓ

whereRistheB×Bpositivesemi-definitematrixcalledtheregularization

matrix.Thesolutionb

Γisgivenanalyticallyas

Γ=(Φ⊤Φ+R)−1Φ⊤(s′1,…,s′M)⊤,

whereΦistheM×Bdesignmatrixdefinedas

Φm,b=φb(sm,am).

Wecanconfirmthatpredictedoutputvectorbs′=f(s,a,b

Γ)actuallyfollows

theGaussiandistributionwithmean

(s′1,…,s′M)Φ(Φ⊤Φ+R)−1φ(s,a)

andcovariancematrixb

δ2I,where

bδ2=σ2tr(Φ⊤Φ+R)−2Φ⊤Φ.

ThetuningparameterssuchastheGaussiankernelwidthκandtheregu-

larizationmatrixRcanbedeterminedeitherbycross-validationorevidence

maximizationiftheabovemethodisregardedasGaussianprocessregression

intheBayesianframework(Rasmussen&Williams,2006).

Thisistheregression-basedestimatorofthetransitionprobabilitydensity

p(s′|s,a)foranarbitrarytestinputsanda.Thus,bytheuseofkernel

regressionmodels,theregressionfunctionf(whichistheconditionalmeanof

outputs)isapproximatedinanon-parametricway.However,theconditional

distributionofoutputsitselfisrestrictedtobeGaussian,whichishighly

restrictiveinreal-worldreinforcementlearning.

10.1.2

ǫ-NeighborKernelDensityEstimation

Whentheconditioningvariables(s,a)arediscrete,theconditionaldensity

p(s′|s,a)canbeeasilyestimatedbystandarddensityestimatorssuchaskernel

densityestimation(KDE)byonlyusingsampless′iisuchthat(si,ai)agrees

withthetargetvalues(s,a).ǫ-neighborKDE(ǫKDE)extendsthisideatothe

continuouscasesuchthat(si,ai)areclosetothetargetvalues(s,a).

Morespecifically,ǫKDEwiththeGaussiankernelisgivenby

p(s′|s,a)=

N(s′;s′

i,σ2I),

(s,a),ǫ|i∈I(s,a),ǫwhereI(s,a),ǫisthesetofsampleindicessuchthatk(s,a)−(si,ai)k≤ǫ

andN(s′;s′i,σ2I)denotestheGaussiandensitywithmeans′iandcovariance

matrixσ2I.TheGaussianwidthσandthedistancethresholdǫmaybechosen

bycross-validation.

ǫKDEisausefulnon-parametricdensityestimatorthatiseasytoim-

plement.However,itisunreliableinhigh-dimensionalproblemsduetothe

distance-basedconstruction.

10.1.3

Least-SquaresConditionalDensityEstimation

Anon-parametricconditionaldensityestimatorcalledleast-squarescondi-

tionaldensityestimation(LSCDE)(Sugiyamaetal.,2010)possessesvarious

usefulproperties:

•Itcandirectlyhandlemulti-dimensionalmulti-modalinputsandout-

•Itwasprovedtoachievetheoptimalconvergencerate(Kanamorietal.,

2012).

•Ithashighnumericalstability(Kanamorietal.,2013).

•Itisrobustagainstoutliers(Sugiyamaetal.,2010).

•Itssolutioncanbeanalyticallyandefficientlycomputedjustbysolving

asystemoflinearequations(Kanamorietal.,2009).

•Generatingsamplesfromthelearnedtransitionmodelisstraightforward.

Letusmodelthetransitionprobabilityp(s′|s,a)bythefollowinglinear-

in-parametermodel:

α⊤φ(s,a,s′),

(10.1)

whereαistheB-dimensionalparametervectorandφ(s,a,s′)istheB-

dimensionalbasisfunctionvector.Atypicalchoiceofthebasisfunctionis

theGaussiankernel,whichisdefinedforB=Mas

ks−s

bk2+(a−ab)2+ks′−s′bk2

b(s,a,s′)=exp

κ>0denotestheGaussiankernelwidth.IfBistoolarge,thenumberof

basisfunctionsmaybereducedbyonlyusingasubsetofsamplesasGaussian

centers.DifferentGaussianwidthsfors,a,ands′maybeusedifnecessary.

Theparameterαislearnedsothatthefollowingsquarederrorismini-

mized:

J0(α)=

α⊤φ(s,a,s′)−p(s′|s,a)p(s,a)dsdads′

α⊤φ(s,a,s′)

p(s,a)dsdads′

α⊤φ(s,a,s′)p(s,a,s′)dsdads′+C,

wheretheidentityp(s′|s,a)=p(s,a,s′)/p(s,a)isusedinthesecondterm

p(s′|s,a)p(s,a,s′)dsdads′.

BecauseCisconstantindependentofα,onlythefirsttwotermswillbe

consideredfromhereon:

J(α)=J0(α)−C=α⊤Uα−α⊤v,

whereUistheB×BandvistheB-dimensionalvectordefinedas

Φ(s,a)p(s,a)dsda,

φ(s,a,s′)p(s,a,s′)dsdads′,

Φ(s,a)=

φ(s,a,s′)φ(s,a,s′)⊤ds′.

Notethat,fortheGaussianmodel(10.1),the(b,b′)-thelementofmatrix

Φ(s,a)canbecomputedanalyticallyas

b−s′b′k2

b,b′(s,a)=(

πκ)dim(s′)exp−

ks−s

×exp−

bk2+ks−sb′k2+(a−ab)2+(a−ab′)2

BecauseUandvincludedinJ(α)containtheexpectationsoverunknown

densitiesp(s,a)andp(s,a,s′),theyareapproximatedbysampleaverages.

Thenwehave

J(α)=

α⊤b

Uα−b

v⊤α,

m,am,s′m).

Byaddinganℓ2-regularizertob

J(α)toavoidoverfitting,theLSCDEop-

timizationcriterionisgivenas

α=argminb

J(α)+

kαk2,

α∈RM2

whereλ≥0istheregularizationparameter.Thesolutione

αisgivenanalyti-

callyas

U+λI)−1b

whereIdenotestheidentitymatrix.Becauseconditionalprobabilitydensities

arenon-negativebydefinition,thesolutione

αismodifiedas

αb=max(0,e

Finally,thesolutionisnormalizedinthetestphase.Morespecifically,given

atestinputpoint(s,a),thefinalLSCDEsolutionisgivenas

α⊤φ(s,a,s′)

p(s′|s,a)=R

α⊤φ(s,a,s′′)ds′′

where,fortheGaussianmodel(10.1),thedenominatorcanbeanalytically

computedas

ks−s

bk2+(a−ab)2

α⊤φ(s,a,s′′)ds′′=(2πκ)dim(s′)

αbexp−

ModelselectionoftheGaussianwidthκandtheregularizationparameterλ

ispossiblebycross-validation(Sugiyamaetal.,2010).

Model-basedreinforcementlearningissimplycarriedoutasfollows.

1.Collecttransitionsamples(sm,am,s′m)M

2.Obtainatransitionmodelestimateb

p(s′|s,a)from(sm,am,s′m)M

3.Runamodel-freereinforcementlearningmethodusingtrajectorysam-

plesehte

t=1artificiallygeneratedfromestimatedtransitionmodel

p(s′|s,a)andcurrentpolicyπ(a|s,θ).

Model-basedreinforcementlearningisparticularlyadvantageouswhenthe

samplingcostislimited.Morespecifically,inmodel-freemethods,weneedto

fixthesamplingscheduleinadvance—forexample,whethermanysamples

aregatheredinthebeginningoronlyasmallbatchofsamplesiscollectedfor

alongerperiod.However,optimizingthesamplingscheduleinadvanceisnot

possiblewithoutstrongpriorknowledge.Thus,weneedtojustblindlydesign

thesamplingscheduleinpractice,whichcancausesignificantperformance

degradation.Ontheotherhand,model-basedmethodsdonotsufferfromthis

problem,becausewecandrawasmanytrajectorysamplesaswewantfrom

thelearnedtransitionmodelwithoutadditionalsamplingcosts.

NumericalExamples

Inthissection,theexperimentalperformanceofthemodel-freeandmodel-

basedversionsofPGPE(policygradientswithparameter-basedexploration)

areevaluated:

M-PGPE(LSCDE):Themodel-basedPGPEmethodwithtransitionmodel

estimatedbyLSCDE.

M-PGPE(GP):Themodel-basedPGPEmethodwithtransitionmodeles-

timatedbyGaussianprocess(GP)regression.

IW-PGPE:Themodel-freePGPEmethodwithsamplereusebyimportance

weighting(themethodintroducedinChapter9).

10.3.1

ContinuousChainWalk

Letusfirstconsiderasimplecontinuouschainwalktask,describedin

Figure10.1.

10.3.1.1

(1(4<s′<6),

s∈S=[0,10],a∈A=[−5,5],andr(s,a,s′)=0(otherwise).Thatis,theagentreceivespositivereward+1atthecenterofthestatespace.

ThetrajectorylengthissetatT=10andthediscountfactorissetat

FIGURE10.1:Illustrationofcontinuouschainwalk.

γ=0.99.Thefollowinglinear-in-parameterpolicymodelisusedinboth

theM-PGPEandIW-PGPEmethods:

(s−c

where(c1,…,c6)=(0,2,4,6,8,10).Ifanactiondeterminedbytheabove

policyisoutoftheactionspace,itispulledbacktobeconfinedinthedomain.

Astransitiondynamics,thefollowingtwoscenariosareconsidered:

Gaussian:Thetruetransitiondynamicsisgivenby

st+1=st+at+εt,

whereεtistheGaussiannoisewithmean0andstandarddeviation0.3.

Bimodal:Thetruetransitiondynamicsisgivenby

st+1=st±at+εt,

whereεtistheGaussiannoisewithmean0andstandarddeviation0.3,

andthesignofatisrandomlychosenwithprobability1/2.

Ifthenextstateisoutofthestatespace,itisprojectedbacktothe

domain.Below,thebudgetfordatacollectionisassumedtobelimitedto

N=20trajectorysamples.

10.3.1.2

ComparisonofModelEstimators

WhenthetransitionmodelislearnedintheM-PGPEmethods,allN=20

trajectorysamplesaregatheredrandomlyinthebeginningatonce.More

specifically,theinitialstates1andtheactiona1arechosenfromtheuniform

distributionsoverSandA,respectively.Thenthenextstates2andtheim-

mediaterewardr1areobtained.Afterthat,theactiona2ischosenfromthe

uniformdistributionoverA,andthenextstates3andtheimmediatereward

r2areobtained.ThisprocessisrepeateduntilrTisobtained,bywhichatra-

jectorysampleisobtained.ThisdatagenerationprocessisrepeatedNtimes

toobtainNtrajectorysamples.

Figure10.2andFigure10.3illustratethetruetransitiondynamicsand

,as’|(sp’s5

argmax05

(a)Truetransition

argmax

(b)TransitionestimatedbyLSCDE

(c)TransitionestimatedbyGP

FIGURE10.2:GaussiantransitiondynamicsanditsestimatesbyLSCDE

andGP.

theirestimatesobtainedbyLSCDEandGPintheGaussianandbimodal

cases,respectively.Figure10.2showsthatbothLSCDEandGPcanlearnthe

entireprofileofthetruetransitiondynamicswellintheGaussiancase.Onthe

otherhand,Figure10.3showsthatLSCDEcanstillsuccessfullycapturethe

entireprofileofthetruetransitiondynamicswelleveninthebimodalcase,

butGPfailstocapturethebimodalstructure.

Basedontheestimatedtransitionmodels,policiesarelearnedbytheM-

PGPEmethod.Morespecifically,fromthelearnedtransitionmodel,1000

artificialtrajectorysamplesaregeneratedforgradientestimationandan-

other1000artificialtrajectorysamplesareusedforbaselineestimation.Then

policiesareupdatedbasedontheseartificialtrajectorysamples.Thispolicy

updatestepisrepeated100times.Forevaluatingthereturnofalearnedpol-

icy,100additionaltesttrajectorysamplesareusedwhicharenotemployedfor

policylearning.Figure10.4andFigure10.5depicttheaveragesandstandard

errorsofreturnsover100runsfortheGaussianandbimodalcases,respec-

tively.Theresultsshowthat,intheGaussiancase,theGP-basedmethod

performsverywellandLSCDEalsoexhibitsreasonableperformance.Inthe

bimodalcase,ontheotherhand,GPperformspoorlyandLSCDEgivesmuch

betterresultsthanGP.ThisillustratesthehighflexibilityofLSCDE.

,as’|(sp’s5

argmax05

(a)Truetransition

argmax0

(b)TransitionestimatedbyLSCDE

(c)TransitionestimatedbyGP

FIGURE10.3:BimodaltransitiondynamicsanditsestimatesbyLSCDE

andGP.

M−PGPE(LSCDE)

M−PGPE(GP)

IW−PGPE

M−PGPE(LSCDE)

M−PGPE(GP)

Return

IW−PGPE

Return

Iteration

FIGURE10.4:Averagesandstan-

darderrorsofreturnsofthepolicies

over100runsobtainedbyM-PGPE

withLSCDE,M-PGPEwithGP,

andIW-PGPEforGaussiantransi-

andIW-PGPEforbimodaltransi-

Return

Return1.7

Samplingschedules

darderrorsofreturnsobtainedby

IW-PGPEover100runsforGaus-

IW-PGPEover100runsforbimodal

siantransitionwithdifferentsam-

transitionwithdifferentsampling

plingschedules(e.g.,5×4means

schedules(e.g.,5×4meansgathering

gatheringk=5trajectorysamples

k=5trajectorysamples4times).

4times).

10.3.1.3

ComparisonofModel-BasedandModel-FreeMethods

Next,theperformanceofthemodel-basedandmodel-freePGPEmethods

arecompared.

Underthefixedbudgetscenario,thescheduleofcollecting20trajectory

samplesneedstobedeterminedfortheIW-PGPEmethod.First,theinfluence

ofthechoiceofsamplingschedulesisillustrated.Figure10.6andFigure10.7

showexpectedreturnsaveragedover100runsunderthesamplingschedule

thatabatchofktrajectorysamplesaregathered20/ktimesfordifferentval-

uesofk.Here,policyupdateisperformed100timesafterobservingeachbatch

ofktrajectorysamples,becausethisperformedbetterthantheusualscheme

ofupdatingthepolicyonlyonce.Figure10.6showsthattheperformanceof

IW-PGPEdependsheavilyonthesamplingschedule,andgatheringk=20

trajectorysamplesatonceisshowntobethebestchoiceintheGaussiancase.

Figure10.7showsthatgatheringk=20trajectorysamplesatonceisalsothe

bestchoiceinthebimodalcase.

Althoughthebestsamplingscheduleisnotaccessibleinpractice,theop-

timalsamplingscheduleisusedforevaluatingtheperformanceofIW-PGPE.

Figure10.4andFigure10.5showtheaveragesandstandarderrorsofreturns

obtainedbyIW-PGPEover100runsasfunctionsofthesamplingsteps.These

graphsshowthatIW-PGPEcanimprovethepoliciesonlyinthebeginning,

becausealltrajectorysamplesaregatheredatonceinthebeginning.The

performanceofIW-PGPEmaybefurtherimprovedifitispossibletogather

moretrajectorysamples.However,thisisprohibitedunderthefixedbudget

scenario.Ontheotherhand,returnsofM-PGPEkeepincreasingoveriter-

ations,becauseartificialtrajectorysamplescanbekeptgeneratedwithout

additionalsamplingcosts.Thisillustratesapotentialadvantageofmodel-

basedreinforcementlearning(RL)methods.

10.3.2

HumanoidRobotControl

Finally,theperformanceofM-PGPEisevaluatedonapracticalcontrol

problemofasimulatedupper-bodymodelofthehumanoidrobotCB-i(Cheng

etal.,2007),whichwasalsousedinSection9.3.3;seeFigure9.5forthe

illustrationsofCB-ianditssimulator.

10.3.2.1

ThesimulatorisbasedontheupperbodyoftheCB-ihumanoidrobot,

whichhas9jointsforshoulderpitch,shoulderroll,elbowpitchoftheright

arm,andshoulderpitch,shoulderroll,elbowpitchoftheleftarm,waistyaw,

torsoroll,andtorsopitch.Thestatevectoris18-dimensionalandreal-valued,

whichcorrespondstothecurrentangleindegreeandthecurrentangular

velocityforeachjoint.Theactionvectoris9-dimensionalandreal-valued,

whichcorrespondstothetargetangleofeachjointindegree.Thegoalofthe

controlproblemistoleadtheendeffectoroftherightarm(righthand)tothe

targetobject.Anoisycontrolsystemissimulatedbyperturbingactionvectors

withindependentbimodalGaussiannoise.Morespecifically,foreachelement

oftheactionvector,Gaussiannoisewithmean0andstandarddeviation3is

addedwithprobability0.6,andGaussiannoisewithmean−5andstandard

deviation3isaddedwithprobability0.4.

Theinitialpostureoftherobotisfixedtobestandingupstraightwith

armsdown.Thetargetobjectislocatedinfrontofandabovetherighthand,

whichisreachablebyusingthecontrollablejoints.Therewardfunctionat

eachtimestepisdefinedas

rt=exp(−10dt)−0.000005minct,1,000,000,

wheredtisthedistancebetweentherighthandandtargetobjectattimestep

t,andctisthesumofcontrolcostsforeachjoint.Thedeterministicpolicy

modelusedinM-PGPEandIW-PGPEisdefinedasa=θ⊤φ(s)withthe

basisfunctionφ(s)=s.ThetrajectorylengthissetatT=100andthe

discountfactorissetatγ=0.9.

10.3.2.2

Experimentwith2Joints

First,weconsiderusingonly2jointsamongthe9joints,i.e.,onlytheright

shoulderpitchandrightelbowpitchareallowedtobecontrolled,whilethe

otherjointsremainstillateachtimestep(nocontrolsignalissenttothese

joints).Therefore,thedimensionalitiesofstatevectorsandactionvectora

are4and2,respectively.

WesupposethatthebudgetfordatacollectionislimitedtoN=50trajec-

torysamples.FortheM-PGPEmethods,alltrajectorysamplesarecollected

atfirstusingtheuniformlyrandominitialstatesandpolicy.Morespecifically,

theinitialstateischosenfromtheuniformdistributionoverS.Ateachtime

step,theactionaiofthei-thjointisfirstdrawnfromtheuniformdistribu-

tionon[si−5,si+5],wheresidenotesthestateforthei-thjoint.Intotal,

5000transitionsamplesarecollectedformodelestimation.Then,fromthe

learnedtransitionmodel,1000artificialtrajectorysamplesaregeneratedfor

gradientestimationandanother1000artificialtrajectorysamplesaregener-

atedforbaselineestimationineachiteration.Thesamplingscheduleofthe

IW-PGPEmethodischosentocollectk=5trajectorysamples50/ktimes,

whichperformswell,asshowninFigure10.8.Theaverageandstandarderror

ofthereturnobtainedbyeachmethodover10runsareplottedinFigure10.9,

showingthatM-PGPE(LSCDE)tendstooutperformbothM-PGPE(GP)and

IW-PGPE.

Figure10.10illustratesanexampleofthereachingmotionwith2joints

obtainedbyM-PGPE(LSCDE)atthe60thiteration.Thisshowsthatthe

learnedpolicysuccessfullyleadstherighthandtothetargetobjectwithin

only13stepsinthisnoisycontrolsystem.

10.3.2.3

Experimentwith9Joints

Finally,theperformanceofM-PGPE(LSCDE)andIW-PGPEisevaluated

onthereachingtaskwithall9joints.

Theexperimentalsetupisessentiallythesameasthe2-jointcase,butthe

budgetforgatheringN=1000trajectorysamplesisgiventothiscomplex

andhigh-dimensionaltask.Thepositionofthetargetobjectismovedto

farleft,whichisnotreachablebyusingonly2joints.Thus,therobotis

requiredtomoveotherjointstoreachtheobjectwiththerighthand.Five

thousandrandomlychosentransitionsamplesareusedasGaussiancentersfor

M-PGPE(LSCDE).ThesamplingscheduleforIW-PGPEissetatgathering

1000trajectorysamplesatonce,whichisthebestsamplingscheduleaccording

toFigure10.11.Theaveragesandstandarderrorsofreturnsobtainedby

M-PGPE(LSCDE)andIW-PGPEover30runsareplottedinFigure10.12,

showingthatM-PGPE(LSCDE)tendstooutperformIW-PGPE.

Figure10.13exhibitsatypicalreachingmotionwith9jointsobtainedby

M-PGPE(LSCDE)atthe1000thiteration.Thisshowsthattherighthandis

ledtothedistantobjectsuccessfullywithin14steps.

Return2.5

Samplingschedules

FIGURE10.8:AveragesandstandarderrorsofreturnsobtainedbyIW-

PGPEover10runsforthe2-jointhumanoidrobotsimulatorfordifferent

samplingschedules(e.g.,5×10meansgatheringk=5trajectorysamples10

times).

Return2

M−PGPE(LSCDE)

M−PGPE(GP)

IW−PGPE

Iteration

FIGURE10.9:Averagesandstandarderrorsofobtainedreturnsover10

runsforthe2-jointhumanoidrobotsimulator.Allmethodsuse50trajectory

samplesforpolicylearning.InM-PGPE(LSCDE)andM-PGPE(GP),all50

trajectorysamplesaregatheredinthebeginningandtheenvironmentmodel

islearned;then2000artificialtrajectorysamplesaregeneratedineachup-

dateiteration.InIW-PGPE,abatchof5trajectorysamplesisgatheredfor

10iterations,whichwasshowntobethebestsamplingscheduling(seeFig-

ure10.8).Notethatpolicyupdateisperformed100timesafterobservingeach

batchoftrajectorysamples,whichweconfirmedtoperformwell.Thebottom

horizontalaxisisfortheM-PGPEmethods,whilethetophorizontalaxisis

fortheIW-PGPEmethod.

FIGURE10.10:Exampleofarmreachingwith2jointsusingapolicyob-

tainedbyM-PGPE(LSCDE)atthe60thiteration(fromlefttorightandtop

tobottom).

−4.5

−5.5

Return

−6.5

−71000x1

100x10

10x100

1x1000

Samplingschedules

FIGURE10.11:AveragesandstandarderrorsofreturnsobtainedbyIW-

PGPEover30runsforthe9-jointhumanoidrobotsimulatorfordifferent

samplingschedules(e.g.,100×10meansgatheringk=100trajectorysamples

10times).

Return

M−PGPE

IW−PGPE

Iteration

FIGURE10.12:Averagesandstandarderrorsofobtainedreturnsover30

runsforthehumanoidrobotsimulatorwith9joints.Bothmethodsuse1000

trajectorysamplesforpolicylearning.InM-PGPE(LSCDE),all1000tra-

jectorysamplesaregatheredinthebeginningandtheenvironmentmodel

islearned;then2000artificialtrajectorysamplesaregeneratedineachup-

dateiteration.InIW-PGPE,abatchof1000trajectorysamplesisgatheredat

once,whichwasshowntobethebestscheduling(seeFigure10.11).Notethat

policyupdateisperformed100timesafterobservingeachbatchoftrajectory

samples.ThebottomhorizontalaxisisfortheM-PGPEmethod,whilethe

tophorizontalaxisisfortheIW-PGPEmethod.

FIGURE10.13:Exampleofarmreachingwith9jointsusingapolicyob-

tainedbyM-PGPE(LSCDE)atthe1000thiteration(fromlefttorightand

toptobottom).

Remarks

Model-basedreinforcementlearningisapromisingapproach,giventhat

thetransitionmodelcanbeestimatedaccurately.However,estimatingthe

high-dimensionalconditionaldensityischallenging.Inthischapter,anon-

parametricconditionaldensityestimatorcalledleast-squaresconditionalden-

sityestimation(LSCDE)wasintroduced,andmodel-basedPGPEwith

LSCDEwasshowntoworkexcellentlyinexperiments.

Underthefixedsamplingbudget,themodel-freeapproachrequiresusto

designthesamplingscheduleappropriatelyinadvance.However,thisisprac-

ticallyveryhardunlessstrongpriorknowledgeisavailable.Ontheotherhand,

model-basedmethodsdonotsufferfromthisproblem,whichisanexcellent

practicaladvantageoverthemodel-freeapproach.

Inrobotics,themodel-freeapproachseemstobepreferredbecauseac-

curatelylearningthetransitiondynamicsofcomplexrobotsischallenging

(Deisenrothetal.,2013).Furthermore,model-freemethodscanutilizethe

priorknowledgeintheformofpolicydemonstration(Kober&Peters,2011).

Ontheotherhand,themodel-basedapproachisadvantageousinthatnoin-

teractionwiththerealrobotisrequiredoncethetransitionmodelhasbeen

learnedandthelearnedtransitionmodelcanbeutilizedforfurthersimulation.

Actually,thechoiceofmodel-freeormodel-basedmethodsisnotonlyan

ongoingresearchtopicinmachinelearning,butalsoabigdebatableissuein

neuroscience.Therefore,furtherdiscussionwouldbenecessarytomoredeeply

understandtheprosandconsofthemodel-basedandmodel-freeapproaches.

Combiningorswitchingthemodel-freeandmodel-basedapproacheswould

alsobeaninterestingdirectiontobefurtherinvestigated.

Chapter11

DimensionalityReductionfor

Least-squaresconditionaldensityestimation(LSCDE),introducedinChap-

ter10,isapracticaltransitionmodelestimator.However,transitionmodel

estimationisstillchallengingwhenthedimensionalityofstateandaction

spacesishigh.Inthischapter,adimensionalityreductionmethodisintro-

ducedtoLSCDEwhichfindsalow-dimensionalexpressionoftheoriginal

stateandactionvectorthatisrelevanttopredictingthenextstate.After

mathematicallyformulatingtheproblemofdimensionalityreductioninSec-

tion11.1,adetaileddescriptionofthedimensionalityreductionalgorithm

basedonsquared-lossconditionalentropyisprovidedinSection11.2.Then

numericalexamplesaregiveninSection11.3,andthischapterisconcluded

inSection11.4.

SufficientDimensionalityReduction

Sufficientdimensionalityreduction(Li,1991;Cook&Ni,2005)isaframe-

workofdimensionalityreductioninasupervisedlearningsettingofanalyzing

aninput-outputrelation—inourcase,inputisthestate-actionpair(s,a)

andoutputisthenextstates′.Sufficientdimensionalityreductionisaimedat

findingalow-dimensionalexpressionzofinput(s,a)thatcontains“sufficient”

informationaboutoutputs′.

Letzbealinearprojectionofinput(s,a).Morespecifically,usingmatrix

WsuchthatWW⊤=IwhereIdenotestheidentitymatrix,zisgivenby

Thegoalofsufficientdimensionalityreductionis,fromindependenttransition

samples(sm,am,s′m)M

m=1,tofindWsuchthats′and(s,a)areconditionally

independentgivenz.Thisconditionalindependencemeansthatzcontainsall

informationabouts′andisequivalentlyexpressedas

p(s′|s,a)=p(s′|z).

(11.1)

Squared-LossConditionalEntropy

Inthissection,thedimensionalityreductionmethodbasedonthesquared-

lossconditionalentropy(SCE)isintroduced.

11.2.1

ConditionalIndependence

SCEisdefinedandexpressedas

SCE(s′|z)=−

p(s′|z)p(s′,z)dzds′

p(s′|z)−1p(z)dzds′−1+

ds′.

ItwasshowninTangkarattetal.(2015)that

SCE(s′|z)≥SCE(s′|s,a),

andtheequalityholdsifandonlyifEq.(11.1)holds.Thus,sufficientdimen-

sionalityreductioncanbeperformedbyminimizingSCE(s′|z)withrespect

W∗=argminSCE(s′|z).W∈GHere,GdenotestheGrassmannmanifold,whichisthesetofmatricesW

suchthatWW⊤=Iwithoutredundancyintermsofthespan.

SinceSCEcontainsunknowndensitiesp(s′|z)andp(s′,z),itcannotbe

directlycomputed.Here,letusemploytheLSCDEmethodintroducedin

Chapter10toobtainanestimatorb

p(s′|z)ofconditionaldensityp(s′|z).Then,

byreplacingtheexpectationoverp(s′,z)withthesampleaverage,SCEcan

beapproximatedas

SCE(s′|z)=−

p(s′

α⊤b

m|zm)=−2

m,s′m).

φ(z,s′)isthebasisfunctionvectorusedinLSCDEgivenby

kz−z

bk2+ks′−s′bk2

b(z,s′)=exp

DimensionalityReductionforTransitionModelEstimation

whereκ>0denotestheGaussiankernelwidth.e

αistheLSCDEsolution

givenby

U+λI)−1b

whereλ≥0istheregularizationparameterand

(πκ)dim(s′)

b−s′b′k2

b,b′=

exp−

m−zbk2+kzm−zb′k2

11.2.2

DimensionalityReductionwithSCE

WiththeaboveSCEestimator,apracticalformulationforsufficientdi-

mensionalityreductionisgivenby

W=argmaxS(W),whereS(W)=e

α⊤b

W∈GThegradientofS(W)withrespecttoWℓ,ℓ′isgivenby

v⊤=−e

α⊤∂b

∂Wℓ,ℓ′

IntheEuclideanspace,theabovegradientgivesthesteepestdirection(see

alsoSection7.3.1).However,ontheGrassmannmanifold,thenaturalgradi-

ent(Amari,1998)givesthesteepestdirection.ThenaturalgradientatW

istheprojectionoftheordinarygradienttothetangentspaceoftheGrass-

mannmanifold.Ifthetangentspaceisequippedwiththecanonicalmetric

W,W′=1tr(W⊤W′),thenaturalgradientatWisgivenasfollows(Edel-

manetal.,1998):

∂SW⊤∂W

⊥W⊥,

whereW⊥isthematrixsuchthatW⊤,W⊤isanorthogonalmatrix.

⊥ThegeodesicfromWtothedirectionofthenaturalgradientoverthe

Grassmannmanifoldcanbeexpressedusingt∈Ras”

∂SW⊤W

Oexp−t

⊥∂W

where“exp”foramatrixdenotesthematrixexponentialandOdenotesthe

zeromatrix.Thenlinesearchalongthegeodesicinthenaturalgradientdi-

rectionisperformedbyfindingthemaximizerfromWt|t≥0(Edelman

etal.,1998).

OnceWisupdatedbythenaturalgradientmethod,SCEisre-estimated

fornewWandnaturalgradientascentisperformedagain.Thisentirepro-

cedureisrepeateduntilWconverges,andthefinalsolutionisgivenby

α⊤φ(z,s′)

p(s′|z)=R

α⊤φ(z,s′′)ds′′

whereb

αb=max(0,e

αb),andthedenominatorcanbeanalyticallycomputedas

kz−z

α⊤φ(z,s′′)ds′′=(2πκ)dim(s′)

αbexp−

WhenSCEisre-estimated,performingcross-validationforLSCDEinevery

stepiscomputationallyexpensive.Inpractice,cross-validationmaybeper-

formedonlyonceeveryseveralgradientupdates.Furthermore,tofindabetter

localoptimalsolution,thisgradientascentproceduremaybeexecutedmul-

tipletimeswithrandomlychoseninitialsolutions,andtheoneachievingthe

largestobjectivevalueischosen.

11.2.3

RelationtoSquared-LossMutualInformation

TheabovedimensionalityreductionmethodminimizesSCE:

p(z,s′)2

SCE(s′|z)=−

dzds′.

Ontheotherhand,thedimensionalityreductionmethodproposedinSuzuki

andSugiyama(2013)maximizessquared-lossmutualinformation(SMI):

p(z,s′)2

SMI(z,s′)=

dzds′.

p(z)p(s′)

NotethatSMIcanbeapproximatedalmostinthesamewayasSCEby

theleast-squaresmethod(Suzuki&Sugiyama,2013).Theaboveequations

showthattheessentialdifferencebetweenSCEandSMIiswhetherp(s′)

isincludedinthedenominatorofthedensityratio,andSCEisreducedto

thenegativeSMIifp(s′)isuniform.However,ifp(s′)isnotuniform,the

densityratiofunctionp(z,s′)includedinSMImaybemorefluctuatedthan

p(z)p(s′)

p(z,s′)includedinSCE.Sinceasmootherfunctioncanbemoreaccurately

estimatedfromasmallnumberofsamplesingeneral(Vapnik,1998),SCE-

baseddimensionalityreductionisexpectedtoworkbetterthanSMI-based

dimensionalityreduction.

NumericalExamples

Inthissection,experimentalbehavioroftheSCE-baseddimensionality

reductionmethodisillustrated.

11.3.1

ArtificialandBenchmarkDatasets

Thefollowingdimensionalityreductionschemesarecompared:

•None:Nodimensionalityreductionisperformed.

•SCE(Section11.2):Dimensionalityreductionisperformedbymini-

mizingtheleast-squaresSCEapproximatorusingnaturalgradientsover

theGrassmannmanifold(Tangkarattetal.,2015).

•SMI(Section11.2.3):Dimensionalityreductionisperformedbymax-

imizingtheleast-squaresSMIapproximatorusingnaturalgradientsover

theGrassmannmanifold(Suzuki&Sugiyama,2013).

•True:The“true”subspaceisused(onlyforartificialdatasets).

Afterdimensionalityreduction,thefollowingconditionaldensityestimators

arerun:

•LSCDE(Section10.1.3):Least-squaresconditionaldensityestima-

tion(Sugiyamaetal.,2010).

•ǫKDE(Section10.1.2):ǫ-neighborkerneldensityestimation,where

ǫischosenbyleast-squarescross-validation.

First,thebehaviorofSCE-LSCDEiscomparedwiththeplainLSCDE

withnodimensionalityreduction.Thedatasetshave5-dimensionalinputx=

(x(1),…,x(5))⊤and1-dimensionaloutputy.Amongthe5dimensionsofx,

onlythefirstdimensionx(1)isrelevanttopredictingtheoutputyandthe

other4dimensionsx(2),…,x(5)arejuststandardGaussiannoise.Figure11.1

plotsthefirstdimensionofinputandoutputofthesamplesinthedatasets

andconditionaldensityestimationresults.Thegraphsshowthattheplain

LSCDEdoesnotperformwellduetotheirrelevantnoisedimensionsininput,

whileSCE-LSCDEgivesmuchbetterestimates.

Next,artificialdatasetswith5-dimensionalinputx=(x(1),…,x(5))⊤and1-dimensionaloutputyareused.Eachelementofxfollowsthestandard

Gaussiandistributionandyisgivenby

(a)y=x(1)+(x(1))2+(x(1))3+ε,

(b)y=(x(1))2+(x(2))2+ε,

Sample

Plain-LSCDE

SCE-LSCDE

(a)Bonemineraldensity

(b)OldFaithfulgeyser

FIGURE11.1:ExamplesofconditionaldensityestimationbyplainLSCDE

andSCE-LSCDE.

whereεistheGaussiannoisewithmeanzeroandstandarddeviation1/4.

ThetoprowofFigure11.2showsthedimensionalityreductionerrorbe-

tweentrueW∗anditsestimatecWfordifferentsamplesizen,measured

⊤Error

WW−W∗⊤W∗kFrobenius,wherek·kFrobeniusdenotestheFrobeniusnorm.TheSMI-basedandSCE-based

dimensionalityreductionmethodsbothperformsimilarlyforthedataset(a),

whiletheSCE-basedmethodclearlyoutperformstheSMI-basedmethodfor

thedataset(b).Thehistogramsofy400

i=1plottedinthe2ndrowofFigure11.2

showthattheprofileofthehistogram(whichisasampleapproximationof

p(y))inthedataset(b)ismuchsharperthanthatinthedataset(a).As

explainedinSection11.2.3,thedensityratiofunctionusedinSMIcontains

p(y)inthedenominator.Therefore,itwouldbehighlynon-smoothandthus

ishardtoapproximate.Ontheotherhand,thedensityratiofunctionused

inSCEdoesnotcontainp(y).Therefore,itwouldbesmootherthantheone

usedinSMIandthusiseasiertoapproximate.

The3rdand4throwsofFigure11.2plottheconditionaldensityestimation

errorbetweentruep(y|x)anditsestimateb

p(y|x),evaluatedbythesquared

loss(withoutaconstant):

ErrorCDE=

i)2dy−n′

where(e

yi)n′

i=1isasetoftestsamplesthathavenotbeenusedfor

conditionaldensityestimation.Wesetn′=1000.Thegraphsshowthat

LSCDEoveralloutperformsǫKDEforbothdatasets.Forthedataset(a),

SMI-LSCDEandSCE-LSCDEperformequallywell,andaremuchbetterthan

SMI-based

SCE-based

Error0.4

100150200250300350400

Samplesizen

Frequency

LSCDE*

εKDE*

LSCDE*

εKDE*

−0.1

−0.2

−0.5

−0.3

Error−0.4

−1.5

−0.5

−0.6

−0.7

−2.5

100150200250300350400

Samplesizen

SMI-LSCDE

SCE-LSCDE

SCE-εKDE

SCE-LSCDE

SCE-εKDE

−0.1

−0.2

−0.5

−0.3

Error−0.4

−1.5

−0.5

−0.6

−0.7

−2.5

100150200250300350400

Samplesizen

FIGURE11.2:Toprow:Themeanandstandarderrorofthedimensionality

reductionerrorover20runsontheartificialdatasets.2ndrow:Histograms

ofoutputyi400

i=1.3rdand4throws:Themeanandstandarderrorofthe

conditionaldensityestimationerrorover20runs.

plainLSCDEwithnodimensionalityreduction(LSCDE)andcomparableto

LSCDEwiththetruesubspace(LSCDE*).Forthedataset(b),SCE-LSCDE

outperformsSMI-LSCDEandLSCDEandiscomparabletoLSCDE*.

Next,theUCIbenchmarkdatasets(Bache&Lichman,2013)areusedfor

performanceevaluation.nsamplesareselectedrandomlyfromeachdatasetfor

conditionaldensityestimation,andtherestofthesamplesareusedtomeasure

theconditionaldensityestimationerror.Sincethedimensionalityofzisun-

knownforthebenchmarkdatasets,itwasdeterminedbycross-validation.The

resultsaresummarizedinTable11.1,showingthatSCE-LSCDEworkswell

overall.Table11.2describesthedimensionalitiesselectedbycross-validation,

showingthatboththeSCE-basedandSMI-basedmethodsreducethedimen-

sionalitysignificantly.

11.3.2

HumanoidRobot

Finally,SCE-LSCDEisappliedtotransitionestimationofahumanoid

robot.Weuseasimulatoroftheupper-bodypartofthehumanoidrobot

CB-i(Chengetal.,2007)(seeFigure9.5).

Therobothas9controllablejoints:shoulderpitch,shoulderroll,elbow

pitchoftherightarm,andshoulderpitch,shoulderroll,elbowpitchofthe

leftarm,waistyaw,torsoroll,andtorsopitchjoints.Postureoftherobotis

describedby18-dimensionalreal-valuedstatevectors,whichcorrespondsto

theangleandangularvelocityofeachjointinradianandradian-per-second,

respectively.Therobotiscontrolledbysendinganactioncommandatothe

system.Theactioncommandaisa9-dimensionalreal-valuedvector,which

correspondstothetargetangleofeachjoint.Whentherobotiscurrentlyat

statesandreceivesactiona,thephysicalcontrolsystemofthesimulator

calculatestheamountoftorquetobeappliedtoeachjoint(seeSection9.3.3

fordetails).

Intheexperiment,theactionvectoraisrandomlychosenandanoisy

controlsystemissimulatedbyaddingabimodalGaussiannoisevector.More

specifically,theactionaiofthei-thjointisfirstdrawnfromtheuniformdis-

tributionon[si−0.087,si+0.087],wheresidenotesthestateforthei-th

joint.ThedrawnactionisthencontaminatedbyGaussiannoisewithmean

0andstandarddeviation0.034withprobability0.6andGaussiannoisewith

mean−0.087andstandarddeviation0.034withprobability0.4.Byrepeat-

edlycontrollingtherobotMtimes,transitionsamples(sm,am,s′m)M

areobtained.Ourgoalistolearnthesystemdynamicsasastatetransition

probabilityp(s′|s,a)fromthesesamples.

Thefollowingthreescenariosareconsidered:usingonly2joints(right

shoulderpitchandrightelbowpitch),only4joints(inaddition,rightshoulder

rollandwaistyaw),andall9joints.Thesesetupscorrespondto6-dimensional

inputand4-dimensionaloutputinthe2-jointcase,12-dimensionalinputand

8-dimensionaloutputinthe4-jointcase,and27-dimensionalinputand18-

dimensionaloutputinthe9-jointcase.Fivehundred,1000,and1500transition

t-test

×××××××××

−−−−−−−−−−−−−−

−−

−−−−−−−−−

−−

−−−−

−−−

−−

ED(.0(.0(.1(.2(.0(.0(.0(.0(.0(.4(.5(.8(.2(.6

−−−−−−−−−−−−−−

−−−−−

−−−−

−−−

rdatermsp

etter).

TABLE11.2:Meanandstandarderrorofthechosensubspacedimensional-

ityover10runsforbenchmarkandrobottransitiondatasets.

SCE-based

SMI-based

Dataset

(dx,dy)

Housing

(13,1)

3.9(0.74)

2.0(0.79)

2.0(0.39)

1.3(0.15)

AutoMPG

3.2(0.66)

1.3(0.15)

2.1(0.67)

1.1(0.10)

1.9(0.35)

2.4(0.40)

2.2(0.33)

1.6(0.31)

1.0(0.00)

Physicochem

6.5(0.58)

1.9(0.28)

6.6(0.58)

2.6(0.86)

WhiteWine

(11,1)

1.2(0.13)

1.0(0.00)

1.4(0.31)

1.0(0.00)

RedWine

(11,1)

1.0(0.00)

1.3(0.15)

1.2(0.20)

1.0(0.00)

ForestFires

(12,1)

1.2(0.20)

4.9(0.99)

1.4(0.22)

6.8(1.23)

Concrete

1.0(0.00)

1.2(0.13)

1.0(0.00)

Energy

5.9(0.10)

3.9(0.80)

2.1(0.10)

2.0(0.30)

3.2(0.83)

2.1(0.59)

2.1(0.60)

2.7(0.67)

2Joints

2.9(0.31)

2.7(0.21)

2.5(0.31)

2.0(0.00)

4Joints

(12,8)

5.2(0.68)

6.2(0.63)

5.4(0.67)

4.6(0.43)

9Joints

(27,18)

13.8(1.28)15.3(0.94)11.4(0.75)13.2(1.02)

samplesaregeneratedforthe2-joint,4-joint,and9-jointcases,respectively.

Thenrandomlychosenn=100,200,and500samplesareusedforconditional

densityestimation,andtherestisusedforevaluatingthetesterror.The

resultsaresummarizedinTable11.1,showingthatSCE-LSCDEperforms

wellfortheallthreecases.Table11.2describesthedimensionalitiesselected

bycross-validation.Thisshowsthatthedimensionalitiesaremuchreduced,

implyingthattransitionofthehumanoidrobotishighlyredundant.

Remarks

Copingwithhighdimensionalityofthestateandactionspacesisoneof

themostimportantchallengesinmodel-basedreinforcementlearning.Inthis

chapter,adimensionalityreductionmethodforconditionaldensityestimation

wasintroduced.Thekeyideawastousethesquared-lossconditionalentropy

(SCE)fordimensionalityreduction,whichcanbeestimatedbyleast-squares

conditionaldensityestimation.Thisallowedustoperformdimensionalityre-

ductionandconditionaldensityestimationsimultaneouslyinanintegrated

manner.Incontrast,dimensionalityreductionbasedonsquared-lossmutual

information(SMI)yieldsatwo-stepprocedureoffirstreducingthedimension-

alityandthentheconditionaldensityisestimated.SCE-baseddimensionality

reductionwasshowntooutperformtheSMI-basedmethod,particularlywhen

outputfollowsaskeweddistribution.

References

Abbeel,P.,&Ng,A.Y.(2004).Apprenticeshiplearningviainverserein-

forcementlearning.ProceedingsofInternationalConferenceonMachine

Learning(pp.1–8).

Abe,N.,Melville,P.,Pendus,C.,Reddy,C.K.,Jensen,D.L.,Thomas,V.P.,

Bennett,J.J.,Anderson,G.F.,Cooley,B.R.,Kowalczyk,M.,Domick,M.,

&Gardinier,T.(2010).Optimizingdebtcollectionsusingconstrainedrein-

forcementlearning.ProceedingsofACMSIGKDDInternationalConference

onKnowledgeDiscoveryandDataMining(pp.75–84).

Amari,S.(1967).Theoryofadaptivepatternclassifiers.IEEETransactions

onElectronicComputers,EC-16,299–307.

Amari,S.(1998).Naturalgradientworksefficientlyinlearning.NeuralCom-

putation,10,251–276.

Amari,S.,&Nagaoka,H.(2000).Methodsofinformationgeometry.Provi-

dence,RI,USA:OxfordUniversityPress.

Bache,K.,&Lichman,M.(2013).UCImachinelearningrepository.http:

//archive.ics.uci.edu/ml/

Baxter,J.,Bartlett,P.,&Weaver,L.(2001).Experimentswithinfinite-

horizon,policy-gradientestimation.JournalofArtificialIntelligenceRe-

search,15,351–381.

Bishop,C.M.(2006).Patternrecognitionandmachinelearning.NewYork,

NY,USA:Springer.

Boyd,S.,&Vandenberghe,L.(2004).Convexoptimization.Cambridge,UK:

CambridgeUniversityPress.

Bradtke,S.J.,&Barto,A.G.(1996).Linearleast-squaresalgorithmsfor

temporaldifferencelearning.MachineLearning,22,33–57.

Chapelle,O.,Schölkopf,B.,&Zien,A.(Eds.).(2006).Semi-supervisedlearn-

ing.Cambridge,MA,USA:MITPress.

Cheng,G.,Hyon,S.,Morimoto,J.,Ude,A.,Joshua,G.H.,Colvin,G.,Scrog-

gin,W.,&Stephen,C.J.(2007).CB:Ahumanoidresearchplatformfor

exploringneuroscience.AdvancedRobotics,21,1097–1114.

References

Chung,F.R.K.(1997).Spectralgraphtheory.Providence,RI,USA:American

MathematicalSociety.

Coifman,R.,&Maggioni,M.(2006).Diffusionwavelets.AppliedandCom-

putationalHarmonicAnalysis,21,53–94.

Cook,R.D.,&Ni,L.(2005).Sufficientdimensionreductionviainverse

regression.JournaloftheAmericanStatisticalAssociation,100,410–428.

Dayan,P.,&Hinton,G.E.(1997).Usingexpectation-maximizationforrein-

forcementlearning.NeuralComputation,9,271–278.

Deisenroth,M.P.,Neumann,G.,&Peters,J.(2013).Asurveyonpolicy

searchforrobotics.FoundationsandTrendsinRobotics,2,1–142.

Deisenroth,M.P.,&Rasmussen,C.E.(2011).PILCO:Amodel-basedand

data-efficientapproachtopolicysearch.ProceedingsofInternationalCon-

ferenceonMachineLearning(pp.465–473).

Demiriz,A.,Bennett,K.P.,&Shawe-Taylor,J.(2002).Linearprogramming

boostingviacolumngeneration.MachineLearning,46,225–254.

Dempster,A.P.,Laird,N.M.,&Rubin,D.B.(1977).Maximumlikelihood

fromincompletedataviatheEMalgorithm.JournaloftheRoyalStatistical

Society,seriesB,39,1–38.

Dijkstra,E.W.(1959).Anoteontwoproblemsinconnexion[sic]withgraphs.

NumerischeMathematik,1,269–271.

Edelman,A.,Arias,T.A.,&Smith,S.T.(1998).Thegeometryofalgo-

rithmswithorthogonalityconstraints.SIAMJournalonMatrixAnalysis

andApplications,20,303–353.

Efron,B.,Hastie,T.,Johnstone,I.,&Tibshirani,R.(2004).Leastangle

regression.AnnalsofStatistics,32,407–499.

Engel,Y.,Mannor,S.,&Meir,R.(2005).ReinforcementlearningwithGaus-

sianprocesses.ProceedingsofInternationalConferenceonMachineLearn-

ing(pp.201–208).

Fishman,G.S.(1996).MonteCarlo:Concepts,algorithms,andapplications.

Berlin,Germany:Springer-Verlag.

Fredman,M.L.,&Tarjan,R.E.(1987).Fibonacciheapsandtheiruses

inimprovednetworkoptimizationalgorithms.JournaloftheACM,34,

569–615.

Goldberg,A.V.,&Harrelson,C.(2005).Computingtheshortestpath:A*

searchmeetsgraphtheory.ProceedingsofAnnualACM-SIAMSymposium

onDiscreteAlgorithms(pp.156–165).

References

Gooch,B.,&Gooch,A.(2001).Non-photorealisticrendering.Natick,MA,

USA:A.K.PetersLtd.

Greensmith,E.,Bartlett,P.L.,&Baxter,J.(2004).Variancereductiontech-

niquesforgradientestimatesinreinforcementlearning.JournalofMachine

LearningResearch,5,1471–1530.

Guo,Q.,&Kunii,T.L.(2003).“Nijimi”renderingalgorithmforcreating

qualityblackinkpaintings.ProceedingsofComputerGraphicsInternational

(pp.152–159).

Henkel,R.E.(1976).Testsofsignificance.BeverlyHills,CA,USA.:SAGE

Publication.

Hertzmann,A.(1998).Painterlyrenderingwithcurvedbrushstrokesofmul-

tiplesizes.ProceedingsofAnnualConferenceonComputerGraphicsand

InteractiveTechniques(pp.453–460).

Hertzmann,A.(2003).Asurveyofstrokebasedrendering.IEEEComputer

GraphicsandApplications,23,70–81.

Hoerl,A.E.,&Kennard,R.W.(1970).Ridgeregression:Biasedestimation

fornonorthogonalproblems.Technometrics,12,55–67.

Huber,P.J.(1981).Robuststatistics.NewYork,NY,USA:Wiley.

Kakade,S.(2002).Anaturalpolicygradient.AdvancesinNeuralInformation

ProcessingSystems14(pp.1531–1538).

Kanamori,T.,Hido,S.,&Sugiyama,M.(2009).Aleast-squaresapproachto

directimportanceestimation.JournalofMachineLearningResearch,10,

1391–1445.

Kanamori,T.,Suzuki,T.,&Sugiyama,M.(2012).Statisticalanalysisof

kernel-basedleast-squaresdensity-ratioestimation.MachineLearning,86,

335–367.

Kanamori,T.,Suzuki,T.,&Sugiyama,M.(2013).Computationalcomplex-

ityofkernel-baseddensity-ratioestimation:Aconditionnumberanalysis.

MachineLearning,90,431–460.

Kober,J.,&Peters,J.(2011).Policysearchformotorprimitivesinrobotics.

MachineLearning,84,171–203.

Koenker,R.(2005).Quantileregression.Cambridge,MA,USA:Cambridge

UniversityPress.

Kohonen,T.(1995).Self-organizingmaps.Berlin,Germany:Springer.

Kullback,S.,&Leibler,R.A.(1951).Oninformationandsufficiency.Annals

ofMathematicalStatistics,22,79–86.

References

Lagoudakis,M.G.,&Parr,R.(2003).Least-squarespolicyiteration.Journal

ofMachineLearningResearch,4,1107–1149.

Li,K.(1991).Slicedinverseregressionfordimensionreduction.Journalof

theAmericanStatisticalAssociation,86,316–342.

Mahadevan,S.(2005).Proto-valuefunctions:Developmentalreinforcement

learning.ProceedingsofInternationalConferenceonMachineLearning(pp.

553–560).

Mangasarian,O.L.,&Musicant,D.R.(2000).Robustlinearandsupport

vectorregression.IEEETransactionsonPatternAnalysisandMachine

Intelligence,22,950–955.

Morimura,T.,Sugiyama,M.,Kashima,H.,Hachiya,H.,&Tanaka,T.(2010a).

Nonparametricreturndistributionapproximationforreinforcementlearn-

ing.ProceedingsofInternationalConferenceonMachineLearning(pp.

799–806).

Morimura,T.,Sugiyama,M.,Kashima,H.,Hachiya,H.,&Tanaka,T.

(2010b).Parametricreturndensityestimationforreinforcementlearning.

ConferenceonUncertaintyinArtificialIntelligence(pp.368–375).

Peters,J.,&Schaal,S.(2006).Policygradientmethodsforrobotics.Process-

ingoftheIEEE/RSJInternationalConferenceonIntelligentRobotsand

Systems(pp.2219–2225).

Peters,J.,&Schaal,S.(2007).Reinforcementlearningbyreward-weighted

regressionforoperationalspacecontrol.ProceedingsofInternationalCon-

ferenceonMachineLearning(pp.745–750).Corvallis,Oregon,USA.

Precup,D.,Sutton,R.S.,&Singh,S.(2000).Eligibilitytracesforoff-policypolicyevaluation.ProceedingsofInternationalConferenceonMachine

Learning(pp.759–766).

Rasmussen,C.E.,&Williams,C.K.I.(2006).Gaussianprocessesformachine

learning.Cambridge,MA,USA:MITPress.

Rockafellar,R.T.,&Uryasev,S.(2002).Conditionalvalue-at-riskforgeneral

lossdistributions.JournalofBanking&Finance,26,1443–1471.

Rousseeuw,P.J.,&Leroy,A.M.(1987).Robustregressionandoutlierdetec-

tion.NewYork,NY,USA:Wiley.

Schaal,S.(2009).TheSLsimulationandreal-timecontrolsoftwarepack-

age(TechnicalReport).ComputerScienceandNeuroscience,Universityof

SouthernCalifornia.

Sehnke,F.,Osendorfer,C.,Rückstiess,T.,Graves,A.,Peters,J.,&Schmid-

huber,J.(2010).Parameter-exploringpolicygradients.NeuralNetworks,

23,551–559.

References

Shimodaira,H.(2000).Improvingpredictiveinferenceundercovariateshift

byweightingthelog-likelihoodfunction.JournalofStatisticalPlanningand

Inference,90,227–244.

Siciliano,B.,&Khatib,O.(Eds.).(2008).Springerhandbookofrobotics.

Berlin,Germany:Springer-Verlag.

Sugimoto,N.,Tangkaratt,V.,Wensveen,T.,Zhao,T.,Sugiyama,M.,&Mo-

rimoto,J.(2014).Efficientreuseofpreviousexperiencesinhumanoidmotor

learning.ProceedingsofIEEE-RASInternationalConferenceonHumanoid

Robots(pp.554–559).

Sugiyama,M.(2006).Activelearninginapproximatelylinearregressionbased

onconditionalexpectationofgeneralizationerror.JournalofMachine

LearningResearch,7,141–166.

Sugiyama,M.,Hachiya,H.,Towell,C.,&Vijayakumar,S.(2008).Geodesic

Gaussiankernelsforvaluefunctionapproximation.AutonomousRobots,

25,287–304.

Sugiyama,M.,&Kawanabe,M.(2012).Machinelearninginnon-stationary

environments:Introductiontocovariateshiftadaptation.Cambridge,MA,

USA:MITPress.

Sugiyama,M.,Krauledat,M.,&Müller,K.-R.(2007).Covariateshiftadapta-

tionbyimportanceweightedcrossvalidation.JournalofMachineLearning

Research,8,985–1005.

Sugiyama,M.,Suzuki,T.,&Kanamori,T.(2012).Densityratiomatching

undertheBregmandivergence:Aunifiedframeworkofdensityratioesti-

mation.AnnalsoftheInstituteofStatisticalMathematics,64,1009–1044.

Sugiyama,M.,Takeuchi,I.,Suzuki,T.,Kanamori,T.,Hachiya,H.,&

Okanohara,D.(2010).Least-squaresconditionaldensityestimation.IEICE

TransactionsonInformationandSystems,E93-D,583–594.

Sutton,R.S.,&Barto,G.A.(1998).Reinforcementlearning:Anintroduction.

Cambridge,MA,USA:MITPress.

Suzuki,T.,&Sugiyama,M.(2013).

Sufficientdimensionreductionvia

squared-lossmutualinformationestimation.NeuralComputation,25,725–

Takeda,A.(2007).Supportvectormachinebasedonconditionalvalue-at-risk

minimization(TechnicalReportB-439).DepartmentofMathematicaland

ComputingSciences,TokyoInstituteofTechnology.

Tangkaratt,V.,Mori,S.,Zhao,T.,Morimoto,J.,&Sugiyama,M.(2014).

Model-basedpolicygradientswithparameter-basedexplorationbyleast-

squaresconditionaldensityestimation.NeuralNetworks,57,128–140.

References

Tangkaratt,V.,Xie,N.,&Sugiyama,M.(2015).Conditionaldensityesti-

mationwithdimensionalityreductionviasquared-lossconditionalentropy

minimization.NeuralComputation,27,228–254.

Tesauro,G.(1994).

TD-gammon,aself-teachingbackgammonprogram,

achievesmaster-levelplay.NeuralComputation,6,215–219.

Tibshirani,R.(1996).Regressionshrinkageandsubsetselectionwiththe

lasso.JournaloftheRoyalStatisticalSociety,SeriesB,58,267–288.

Tomioka,R.,Suzuki,T.,&Sugiyama,M.(2011).Super-linearconvergenceof

dualaugmentedLagrangianalgorithmforsparsityregularizedestimation.

JournalofMachineLearningResearch,12,1537–1586.

Vapnik,V.N.(1998).Statisticallearningtheory.NewYork,NY,USA:Wiley.

Vesanto,J.,Himberg,J.,Alhoniemi,E.,&Parhankangas,J.(2000).SOM

toolboxforMatlab5(TechnicalReportA57).HelsinkiUniversityofTech-

nology.

Wahba,G.(1990).Splinemodelsforobservationaldata.Philadelphia,PA,

USA:SocietyforIndustrialandAppliedMathematics.

Wang,X.,&Dietterich,T.G.(2003).Model-basedpolicygradientrein-

forcementlearning.ProceedingsofInternationalConferenceonMachine

Learning(pp.776–783).

Wawrzynski,P.(2009).Real-timereinforcementlearningbysequentialactor-

criticsandexperiencereplay.NeuralNetworks,22,1484–1497.

Weaver,L.,&Baxter,J.(1999).Reinforcementlearningfromstateandtem-

poraldifferences(TechnicalReport).DepartmentofComputerScience,

AustralianNationalUniversity.

Weaver,L.,&Tao,N.(2001).Theoptimalrewardbaselineforgradient-

basedreinforcementlearning.ProceedingsofConferenceonUncertaintyin

ArtificialIntelligence(pp.538–545).

Williams,J.D.,&Young,S.J.(2007).PartiallyobservableMarkovdecision

processesforspokendialogsystems.ComputerSpeechandLanguage,21,

393–422.

Williams,R.J.(1992).Simplestatisticalgradient-followingalgorithmsfor

connectionistreinforcementlearning.MachineLearning,8,229–256.

Xie,N.,Hachiya,H.,&Sugiyama,M.(2013).Artistagent:Areinforcement

learningapproachtoautomaticstrokegenerationinorientalinkpainting.

IEICETransactionsonInformationandSystems,E95-D,1134–1144.

Xie,N.,Laga,H.,Saito,S.,&Nakajima,M.(2011).Contour-drivenSumi-e

renderingofrealphotos.Computers&Graphics,35,122–134.

References

Zhao,T.,Hachiya,H.,Niu,G.,&Sugiyama,M.(2012).Analysisandim-

provementofpolicygradientestimation.NeuralNetworks,26,118–129.

Zhao,T.,Hachiya,H.,Tangkaratt,V.,Morimoto,J.,&Sugiyama,M.(2013).

Efficientsamplereuseinpolicygradientswithparameter-basedexploration.

NeuralComputation,25,1512–1547.

DocumentOutlineCoverContentsForewordPrefaceAuthorPartI:Introduction

Chapter1:IntroductiontoReinforcementLearningPartII:Model-FreePolicyIteration

Chapter2:PolicyIterationwithValueFunctionApproximationChapter3:BasisDesignforValueFunctionApproximationChapter4:SampleReuseinPolicyIterationChapter5:ActiveLearninginPolicyIterationChapter6:RobustPolicyIteration

PartIII:Model-FreePolicySearchChapter7:DirectPolicySearchbyGradientAscentChapter8:DirectPolicySearchbyExpectation-MaximizationChapter9:Policy-PriorSearch

PartIV:Model-BasedReinforcementLearningChapter10:TransitionModelEstimationChapter11:DimensionalityReductionforTransitionModelEstimation

References

masashi sugiyama-statistical reinforcement learning_ modern machine learning approaches-chapman and...

Documents

(statistics, textbooks and monographs) arijit...

kenzo sugiyama cnc7)và— kenzo sugiyama …...point 1 ks-1...

aiu presentation sugiyama 100903

hyde sugiyama senior principal technologist nfv...

eprints.utem.edu.myeprints.utem.edu.my/7743/1/m-learning_-_a_prototype_of_accelerated... ·...

121008 практические кейсы_в_e-learning_

kazuyoshi sugiyama, sewg meeting, culham, 9 -10 july 2007 1...

november 24, 2015 takeshi sugiyama ...november 24, 2015...

takuro sugiyama biography

the board of the public employee retirement system of...

gensho sugiyama

improved vertical segment routing for sugiyama...

(chapman & hall_crc texts in statistical science) simon...

multimedia gebruik in e-learning_ 04 03 08

douglas g. altman-practical statistics for medical...

hirokazu sugiyama - jst

brendan bartram-attitudes to modern foreign language...

empreendedorismo social - rural.ffzg.unizg.hr › ... ›...

grading for learning_-_ken_o_connor

toward robust recommendation systems for scholarly papers...