hessian matrix distribution for bayesian policy gradient reinforcement learning

15
Hessian matrix distribution for Bayesian policy gradient reinforcement learning Ngo Anh Vien a,, Hwanjo Yu b , TaeChoong Chung a a Artificial Intelligence Laboratory, Department of Computer Engineering, School of Electronics and Information, Kyung Hee University, 1 Seocheon, Giheung, Yongin, Gyeonggi 446-701, South Korea b Data Mining Laboratory, Department of Computer Science and Engineering, Pohang University of Science and Technology (POSTECH), South Korea article info Article history: Received 3 July 2008 Received in revised form 22 October 2010 Accepted 1 January 2011 Available online 9 January 2011 Keywords: Markov decision process Reinforcement learning Bayesian policy gradient Monte-Carlo policy gradient Policy gradient Hessian matrix distribution abstract Bayesian policy gradient algorithms have been recently proposed for modeling the policy gradient of the performance measure in reinforcement learning as a Gaussian process. These methods were known to reduce the variance and the number of samples needed to obtain accurate gradient estimates in comparison to the conventional Monte-Carlo pol- icy gradient algorithms. In this paper, we propose an improvement over previous Bayesian frameworks for the policy gradient. We use the Hessian matrix distribution as a learning rate schedule to improve the performance of the Bayesian policy gradient algorithm in terms of the variance and the number of samples. As in computing the policy gradient dis- tributions, the Bayesian quadrature method is used to estimate the Hessian matrix distri- butions. We prove that the posterior mean of the Hessian distribution estimate is symmetric, one of the important properties of the Hessian matrix. Moreover, we prove that with an appropriate choice of kernel, the computational complexity of Hessian distribution estimate is equal to that of the policy gradient distribution estimates. Using simulations, we show encouraging experimental results comparing the proposed algorithm to the Bayesian policy gradient and the Bayesian policy natural gradient algorithms described in Ghavamzadeh and Engel [10]. Ó 2011 Elsevier Inc. All rights reserved. 1. Introduction Dynamic programming is the method of choice for studying Markov decision problems in which the decision is made un- der uncertainty. The framework of dynamic programming (DP) and Markov decision problems (MDPs), originally proposed by Bellman [3] is quite extensive and rigorous. For example, value iteration, policy iteration, and linear programming algo- rithms can be used to find optimal policies of MDPs. However, the main drawback of these classical DP algorithms is that for every decision, they require the computation of the corresponding one step transition probability matrix and the one step transition reward matrix. These transitions use the distribution of the random variables that govern the stochastic processes underlying the system. However, it becomes burdensome or impossible to apply DP for problems with large or infinite state spaces, or in situations where the system dynamics (or a model of the problem) are not available. In the absence of better approaches, problem-specific heuristic algorithms are often used to reach acceptable near-optimal solutions. To overcome the need for an accurate problem’s model and the memory requirements for storing a model, reinforcement learning (RL) with function approximation was proposed in [4,25,8,22,27]. RL combines DP and stochastic approximation 0020-0255/$ - see front matter Ó 2011 Elsevier Inc. All rights reserved. doi:10.1016/j.ins.2011.01.001 Corresponding author. Present address: Department of Computer Science, School of Computing, National University of Singapore, Singapore. Tel.: +82 31 201 2952. E-mail addresses: [email protected], [email protected] (N.A. Vien). Information Sciences 181 (2011) 1671–1685 Contents lists available at ScienceDirect Information Sciences journal homepage: www.elsevier.com/locate/ins

Upload: ngo-anh-vien

Post on 26-Jun-2016

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Hessian matrix distribution for Bayesian policy gradient reinforcement learning

Information Sciences 181 (2011) 1671–1685

Contents lists available at ScienceDirect

Information Sciences

journal homepage: www.elsevier .com/locate / ins

Hessian matrix distribution for Bayesian policy gradientreinforcement learning

Ngo Anh Vien a,⇑, Hwanjo Yu b, TaeChoong Chung a

a Artificial Intelligence Laboratory, Department of Computer Engineering, School of Electronics and Information, Kyung Hee University, 1 Seocheon, Giheung,Yongin, Gyeonggi 446-701, South Koreab Data Mining Laboratory, Department of Computer Science and Engineering, Pohang University of Science and Technology (POSTECH), South Korea

a r t i c l e i n f o

Article history:Received 3 July 2008Received in revised form 22 October 2010Accepted 1 January 2011Available online 9 January 2011

Keywords:Markov decision processReinforcement learningBayesian policy gradientMonte-Carlo policy gradientPolicy gradientHessian matrix distribution

0020-0255/$ - see front matter � 2011 Elsevier Incdoi:10.1016/j.ins.2011.01.001

⇑ Corresponding author. Present address: Departm31 201 2952.

E-mail addresses: [email protected], ngoav@com

a b s t r a c t

Bayesian policy gradient algorithms have been recently proposed for modeling the policygradient of the performance measure in reinforcement learning as a Gaussian process.These methods were known to reduce the variance and the number of samples neededto obtain accurate gradient estimates in comparison to the conventional Monte-Carlo pol-icy gradient algorithms. In this paper, we propose an improvement over previous Bayesianframeworks for the policy gradient. We use the Hessian matrix distribution as a learningrate schedule to improve the performance of the Bayesian policy gradient algorithm interms of the variance and the number of samples. As in computing the policy gradient dis-tributions, the Bayesian quadrature method is used to estimate the Hessian matrix distri-butions. We prove that the posterior mean of the Hessian distribution estimate issymmetric, one of the important properties of the Hessian matrix. Moreover, we prove thatwith an appropriate choice of kernel, the computational complexity of Hessian distributionestimate is equal to that of the policy gradient distribution estimates. Using simulations,we show encouraging experimental results comparing the proposed algorithm to theBayesian policy gradient and the Bayesian policy natural gradient algorithms describedin Ghavamzadeh and Engel [10].

� 2011 Elsevier Inc. All rights reserved.

1. Introduction

Dynamic programming is the method of choice for studying Markov decision problems in which the decision is made un-der uncertainty. The framework of dynamic programming (DP) and Markov decision problems (MDPs), originally proposedby Bellman [3] is quite extensive and rigorous. For example, value iteration, policy iteration, and linear programming algo-rithms can be used to find optimal policies of MDPs. However, the main drawback of these classical DP algorithms is that forevery decision, they require the computation of the corresponding one step transition probability matrix and the one steptransition reward matrix. These transitions use the distribution of the random variables that govern the stochastic processesunderlying the system. However, it becomes burdensome or impossible to apply DP for problems with large or infinite statespaces, or in situations where the system dynamics (or a model of the problem) are not available. In the absence of betterapproaches, problem-specific heuristic algorithms are often used to reach acceptable near-optimal solutions.

To overcome the need for an accurate problem’s model and the memory requirements for storing a model, reinforcementlearning (RL) with function approximation was proposed in [4,25,8,22,27]. RL combines DP and stochastic approximation

. All rights reserved.

ent of Computer Science, School of Computing, National University of Singapore, Singapore. Tel.: +82

p.nus.edu.sg (N.A. Vien).

Page 2: Hessian matrix distribution for Bayesian policy gradient reinforcement learning

1672 N.A. Vien et al. / Information Sciences 181 (2011) 1671–1685

(SA) to learn the optimal value function in run time from a family of parameterized functions that are compactly represented.This approach has achieved some remarkable empirical successes in a number of different domains such as fuzzy control[29], adaptive agent [19], stock trading [14], and evolutionary programming [28]. Recently, there has been a growing interestin direct policy gradient methods for approximate planning in large MDPs [2,26]. These methods seek to find a good policyout of restricted classes of policies by following the gradient of the expected reward. More specifically, let h denote the vectorof policy parameters and q denote the performance of the corresponding policy (e.g., the average reward per step). Then, thepolicy parameters are updated approximately proportionally to the gradient:

Dh � a@q@h

ð1Þ

Another approach which replaces the policy gradient estimate with an estimate of the so called natural policy gradient wasrecently proposed in [13,1,17,5]. This idea, for speeding up policy gradient algorithms, uses the inverse Fisher informationmatrix of the policy for the policy update rule. The natural policy gradient method has been shown to significantly outper-form the conventional policy gradient method.

Both the conventional and natural policy gradient methods rely on Monte-Carlo (MC) techniques to estimate the gradient ofthe performance measure. In contrast, MC is a frequentist procedure which may lead to many inconsistencies as outlined byHagan in [15]. Moreover, the author proposed a Bayesian alternative to MC estimation in [16]. Inspired by this proposal, Bayes-ian frameworks for policy gradients, including the Bayesian policy gradient (BPG) and Bayesian policy natural gradient (BPNG)algorithms are proposed in [10], in which the policy gradient is modeled as a Gaussian process (GP). The BPG algorithm is knownto reduce the number of samples needed to obtain accurate gradient estimates. It uses GPs to define a prior distribution over thegradient of the expected return and then computes the posterior, conditioned on the observed data. The experimental resultsare encouraging, especially if the observed sample size is large enough. Another particular Bayesian RL formulation is GaussianProcess Temporal Difference (GPTD) [7]. GPTD applies GPs to the problem of online value-function estimation. A simple onlineprocedure for improving the performance of GPTD is proposed in [21] by automatically selecting the prior covariance function.In addition to those model-free Bayesian RL algorithms, Bayesian framework is also applied to model-based RL in [18].

In this paper, we propose a Newton Bayesian policy gradient (NBPG) algorithm, an improvement over the BPG methodproposed in [10]. We compute the posterior of the second derivative (Hessian matrix) of the average reward (assuming theyexist) in parallel with the posterior of the gradient. As a result, the Hessian matrix distribution becomes a learning rate sche-dule to improve the convergence speed of BPG. To implement this idea, Bayesian quadrature [16] is used to compute a pos-terior estimate of the second derivative given a single sample path of the underlying MDP. We prove that the consistencyover the posterior mean of the Hessian distribution estimate is symmetric. Moreover, we prove that with a proper kernelchoice, the computational complexity of the Hessian distribution estimate is equal to that of the policy gradient estimate.Experiments show that NBPG algorithm with an efficient learning rate schedule has the least number of updates becauseit uses the Hessian matrix distribution to choose a suitable learning rate schedule. This is due to the dual certainty in theparameter updating step of the NBPG algorithm. Because the NBPG algorithm computes both the gradient and Hessian dis-tributions based on sample paths, it achieves dual certainty over the sample paths. At the same time, because BPG only com-putes the posterior distribution of the gradient and its learning rate is set independently of the sample paths, it becomes lesscertain. Thus, the NBPG algorithm is helpful with small sample path size. Even though NBPG has large CPU times with thelarger sample sizes, it still requires less number of updates. That is, it requires less interaction with the environment than theBPG and BPNG algorithms. This is useful in some specific problem domains, when the agents can only have limited interac-tions with the environment.

The rest of this paper is organized as follows. In the next section, we present some preliminaries on reinforcement learn-ing and the Bayesian policy gradient. In Section 3, we define the matrix of second derivatives (Hessian) and derive its exactposterior moments. In Section 4, we describe an application of the online-sparsification algorithm for our method. In Section5, we present experiments on a continuous state-action linear quadratic regulation problem and a mountain car problem.Section 6 concludes the paper with a discussion of avenues for future work.

2. Reinforcement learning and Bayesian policy gradient algorithms

2.1. Reinforcement learning

Reinforcement learning (RL) [4] is a sub-area of research in Machine learning that is concerned with the behaviors ofagents working in unknown environments. The environment is typically modeled as a finite-state Markov decision process(MDP). A MDP consists of a state space S, an action space A, a reward distribution r(s,a), an initial state distribution P0(�), anda transition distribution P(�js,a). The agent’s objective is to find an optimal policy l(�js) that maximizes the long-term per-formance represented as the expected discounted reward:

gðlÞ ¼ ElXT

t¼0

ctrðst ; atÞ( )

ð2Þ

Page 3: Hessian matrix distribution for Bayesian policy gradient reinforcement learning

N.A. Vien et al. / Information Sciences 181 (2011) 1671–1685 1673

where the policy l(�js) is the probability of choosing an action if the agent is at state s, the parameter c 2 (0,1) is the discountfactor, and T is the length of the time horizon.

Let RðnÞ be the expected discounted reward, which depends on a random variable n. The random variable n is the samplepath with a specified length n = (s0,a0, . . .,sT�1,aT�1,sT) 2N, and l(�js,h) is the probability distribution of the possible actionsto be taken at state s, parametrized by a parameter space h 2 RK . Then, the expected performance is rewritten as

1 The

gðhÞ ¼ ElfRðnÞg ¼Z

RðnÞPðnjhÞdn ð3Þ

where P(njh) is the probability of generating a path n given a policy l

PðnjhÞ ¼ P0ðs0ÞYT�1

t¼0

lðat jst ; hÞPðstþ1jst ; atÞ ð4Þ

Under mild conditions, the gradient of g(l) can be computed as1

rgðhÞ ¼Z

RðnÞrPðn; hÞPðn; hÞ Pðn; hÞ ¼

ZRðnÞ r log Pðn; hÞf gPðn; hÞdn ð5Þ

According to conventional Monte-Carlo (MC) algorithms, if a simulator is used to generate i.i.d. samples n from P(njh): fnigMi¼1,

then they give an unbiased estimate of the gradient in Eq. (5)

ergðhÞ ¼ 1M

XM

i¼1

RðniÞr log Pðni; hÞ ð6Þ

From the definition of P(ni;h) in (4), the MC estimated gradient can be rewritten as

ergðhÞ ¼ 1M

XM

i¼1

RðniÞXTi�1

t¼0

r log lðat;ijst;i; hÞ ð7Þ

~rgðlÞ converges torg(l) with probability one by the law of large numbers. For ease of presentation in the next sections, wedefine

uðnÞ ¼ r log Pðn; hÞ ð8Þ

which is known as the likelihood ratio or score function in classical statistics.

2.2. Bayesian policy gradient algorithms

Although MC estimates are unbiased, they tend to produce a high variance, or alternatively, to require excessive samplesizes [15]. In [16,20], the authors claimed that MC is a frequentist method which leads to many inconsistencies such as vio-lating the likelihood principle, and exploiting the given data inefficiently. To overcome these limitations, they propose aBayesian alternative to MC estimation. A specific example is the presentation of Bayesian quadrature according to Ghavam-zadeh and Engel [10]. The idea is to model integrals of the form

t ¼Z

f ðxÞpðxÞdx ð9Þ

as Gaussian processes (GPs). This Bayesian treatment allows the incorporation of prior knowledge into the estimation, suchas the smoothness of the integrand. With this treatment, samples can be drawn from any distributions, which is one majoradvantage of the Bayesian approach over Monte Carlo. This advantage is helpful in designing sample points to maximize theinformation gain. The unknown quantities t and f(x) are assumed to be random. This assumption is consistent with theBayesian view that all forms of uncertainty are represented by probabilities. Because t is a function of f(x), f is modeledas a GP with a Normal prior distribution over functions, combining it with the observations to obtain the posterior over f,which in turn implies a distribution over the desired t. The prior distribution of f(x) is denoted by f ð�Þ � Nf0; kð�; �Þg, whichmeans E[f(x)] = 0, and Cov[f(x), f(x0)] = k(x,x0). Then, for any given sample set DM ¼ fðxi; yiÞg

Mi¼1, where yi is a noisy measure-

ment of f(xi) (assuming that the noise is also Gaussian), the posterior distribution f jDM is Gaussian, and because of the prop-erties of a GP, the joint distribution over those samples can be written as

fM ¼ ðf ðx1Þ; f ðx2Þ; . . . ; f ðxMÞÞ> � Nð0;KÞ ð10Þ

Thus, the mean and covariance of the posterior f jDM are:

E½f ðxÞjDM� ¼ kMðxÞQ MfM

Cov½f ðxÞ; f ðx0ÞjDM� ¼ kðx; x0Þ � kMðxÞ>Q MkMðx0Þð11Þ

gradient w.r.t. parameters h, rh, are denoted as r.

Page 4: Hessian matrix distribution for Bayesian policy gradient reinforcement learning

1674 N.A. Vien et al. / Information Sciences 181 (2011) 1671–1685

where kM(x) = (k(x1,x), . . .,k(xM,x))>, and QM = (K + RM)�1. Here we assume i.i.d. measurement noise with variance r2; thusthe noise covariance matrix is RM = r2I, where I is the M �M identity matrix. The function t in Eq. (9) is just a sum of infiniteGaussians; thus, its posterior distribution is also Gaussian with mean and variance:

E½tjDM � ¼Z

E½f ðxÞjDM�pðxÞdx

Var½tjDM� ¼Z Z

Cov½f ðxÞ; f ðx0ÞjDM�pðxÞpðx0Þdxdx0ð12Þ

Substituting Eq. (11) into Eq. (12), we have

E½tjDM � ¼ z>MQ MfM

Var½tjDM� ¼ z0 � z>MQ MzMð13Þ

where

zM ¼Z

kMðxÞpðxÞdx

z0 ¼Z Z

kðx; x0ÞpðxÞpðx0Þdxdx0

In [10], the authors use Bayesian MC to estimate the gradient of the expected return, and propose Bayesian policy gradient(BPG) algorithms. In this Bayesian approach, the cumulative return g(h) is considered to be a random variable because of theassumed Bayesian uncertainty, and from Eq. (5), the posterior distribution of the gradient of rg(h) is given by:

rE½gðhÞjDM� ¼ E½rgðhÞjDMÞ� ¼ EZ

RðnÞrPðn; hÞPðn; hÞ Pðn; hÞdn

� �ð14Þ

The authors in [10] then cast this problem of estimating the gradient into the form of Eq. (9), that is to compute the posteriormoments of the gradientrg(h) conditioned on the observed data. In [10], there are two solutions to Bayesian models result-ing from two different ways of partitioning the integrand in Eq. (14):

f ðn; hÞ ¼ RðnÞr log Pðn; hÞ Model 1f ðn; hÞ ¼ RðnÞ Model 2

�ð15Þ

This means that, the p(n) component is P(n;h) orrP(n;h) for Model 1 or 2, respectively. Finally, they propose a BPG algorithmthat is described in Algorithm 2.2 (see the kernel function chosen for each model in [9]). The following definitions are used inAlgorithm 2.2:

fM ¼ ðRðn1; hÞ; . . . ;RðnM ; hÞÞ

fM are the noisy measurements as in Eq. (10) of f(x) for Model 2

FM ¼ ðf ðn1; hÞ; . . . ; f ðnM ; hÞÞ � N ð0;Kþ r2IÞ

FM are the noisy measurements of f(x) for Model 1

UM ¼ ½uðn1Þ; . . . ;uðnMÞ�

The definition of u(n) is given in Eq. (8)

zM ¼Z

kMðnÞPðn; hÞdn

z0 ¼Z Z

kðn; n0ÞPðn; hÞPðn0; hÞ>dndn0

zM and z0 are defined in Eq. (13) when applying Bayesian quadrature for Model 1. In [9,10], the authors use the quadraticFisher kernel [12]:

kðni; njÞ ¼ 1þ uðniÞ>G�1uðnjÞ� �2

; ð16Þ

and show that (zM)i = 1 + u(ni)>G�1u(nj), and z0 = 1 + n, where n is the number of policy parameters, and G = E[u(n)u(n)>] isthe Fisher information matrix. Using the generated sample paths, the Fisher information matrix G is estimated via theMonte-Carlo technique.

ZM ¼ZrPðn; hÞkMðnÞ>dn

Z0 ¼Z Z

kðn; n0ÞrPðn; hÞrPðn0; hÞ>dndn0

Page 5: Hessian matrix distribution for Bayesian policy gradient reinforcement learning

N.A. Vien et al. / Information Sciences 181 (2011) 1671–1685 1675

ZM and Z0 are defined in Eq. (13) when applying Bayesian quadrature for Model 2. In [9,10], the authors use the quadraticFisher kernel [12]:

2 We

kðni; njÞ ¼ uðniÞ>G�1uðnjÞ; ð17Þ

and show that ZM = UM, and Z0 ¼ G� UMQ MU>M .

Algorithm 2.2. Bayesian policy gradient computation

GradientCompute ðh;MÞ; h 2 Rn, sample size M > 0Set G ¼ GðhÞ;D0 ¼ ;for i = 1 to M

Sample a path ni using the policy l(h)Di ¼ Di�1 [ fnigCompute uðniÞ ¼

PTi�1t¼0 r loglðat jst ; hÞ

RðniÞ ¼PTi�1

t¼0 rðst ; atÞUpdate Ki using Ki�1 and ni

Model 1:f(ni) = R(ni)u(ni)

zðiÞM ¼ 1þ uðniÞ>G�1uðniÞModel 2:

f(ni) = R(ni)Z(:,i) = u(ni)

endQM = (K + r2I)�1

Compute the posterior mean and covariance:Model 1:

E½rgðhÞjDM� ¼ FMQ MzM

Cov½rgðhÞjDM� ¼ ðz0 � z>MQ MzMÞIModel 2:

E½rgðhÞjDM� ¼ ZMQ MfM

Cov½rgðhÞjDM� ¼ Z0 � ZMQ MZ>Mreturn: E½rgðhÞjDM�, Cov½rgðhÞjDM�

3. Hessian matrix distribution

In Newton’s method, one uses the second-order approximation to find the minimum of a function. The iterative scheme ofNewton’s method can be generalized to several dimensions by replacing the derivative with the gradient, rg(h), and the re-ciprocal of the second derivative with the inverse of the Hessian matrix, r2g (h). Newton’s method uses curvature informa-tion to take a more direct route than gradient descent algorithms. In this section, we use the principle behind this method toset the learning rate using second-order derivatives of the average reward function. We use the Bayesian quadrature method[20,10] to estimate the Hessian matrix distribution. Given a twice-differentiable density P(n;h), we have

r2gðhÞ ¼Z

RðnÞr2Pðn; hÞPðn; hÞ Pðn; hÞdn ð18Þ

We also cast the problem of computing the Hessian matrix in the form of Eq. (9). We therefore partition the integral into twoparts, f0(n;h) and p(n;h). We investigate only one way of partitioning the integral equation (18) into a GP f0(n;h) and a functionp(n;h) below. We cast p(n;h) and f0(n;h) as2

pðnÞ ¼ Pðn; hÞ

f 0ðn; hÞ ¼ RðnÞr2Pðn; hÞPðn; hÞ

ð19Þ

On the other hand,

r2Pðn; hÞPðn; hÞ ¼

r rPðn;hÞPðn;hÞ Pðn; hÞ

� �Pðn; hÞ ¼

r rPðn;hÞPðn;hÞ

� �Pðn; hÞ

Pðn; hÞ þ rPðn; hÞPðn; hÞ

� � rPðn; hÞPðn; hÞ

� �>¼ r2 log Pðn; hÞ þ ½r log Pðn; hÞ�2 ð20Þ

will use f0( ) and k0( ) to denote derivatives with the same meaning for the functions used in estimating the gradient in the previous section.

Page 6: Hessian matrix distribution for Bayesian policy gradient reinforcement learning

1676 N.A. Vien et al. / Information Sciences 181 (2011) 1671–1685

where the second term on the right-hand-side is the outer product between rlogP(n;h) and itself, that is, the matrix withentries

@ log Pðn; hÞ@hi

@ log Pðn; hÞ@hj

thus, f0(n;h) is an n � n matrix if n is the number of policy parameters, so is r2g(h). We will change the matrix f0(n;h) andr2g(h) to an n2 � 1 vector for the ease of GP regression. This changing operation concatenates all the columns of the matrixinto one column vector. In order to restore the true form of the Hessian matrix, we will perform the inverse changing oper-ation from an n2 � 1 vector to an n � n matrix after getting the posterior result. It is easy to see that this operation does notmake any changes to the result, and it preserve the mean and covariance. Our purpose is now to compute the posterior meanand covariance of n2 � 1 vector r2g(h). We place a GP prior over f 0ðni; hÞMi¼1,

F0M ¼ ðf0ðn1; hÞ; . . . ; f 0ðnM ; hÞÞ � N ð0;CM þ b2IÞ

where CM is the kernel matrix, b2 is the noise variance, and I is the identity matrix. Therefore, the (i, j)th element of the kernelmatrix, [CM](i,j) is itself an n2 � n2 matrix that represents the covariance between the components of f0(ni;h) and f0 (nj;h). Wetreat [CM](i,j) = k0(ni,nj)I as a scalar in order to use it as input to the Bayesian quadrature method [16]. Therefore, we have theexpressions for the posterior mean and covariance for distribution of f 0ðnÞjDM:

E½f 0ðnÞjDM� ¼ F0MBMk0MðnÞCov½f 0ðnÞ; f 0ðn0ÞjDM� ¼ ðk0ðn; n0Þ � k0>M ðnÞBMk0MðnÞÞI

ð21Þ

where k0MðnÞ ¼ ðk0ðn; n1Þ; . . . ; k0ðn; nMÞÞ>;BM ¼ ðCM þ b2IÞ�1. Then, the Normal posterior mean and covariance of the second-or-

der derivatives r2g(h) are:

E½r2gðhÞjDM� ¼ F0MBMxM

Cov½r2gðhÞjDM� ¼ ðx0 �xTMBMxMÞI

ð22Þ

where

xM ¼Z

k0MðnÞPðn; hÞdn

x0 ¼Z Z

k0ðn; n0ÞPðn; hÞPðn0; hÞdndn0

To derive the closed form expression for xM and x0, we use a quadratic Fisher kernel different from Refs. [12,9]:

k0ðni; njÞ ¼ ð1þ vðniÞ>G�12 vðnjÞÞ2 ð23Þ

where vðn; hÞ ¼ r2Pðn;hÞPðn;hÞ is the second-order score function of the path n(v(n,h) is changed to n2 � 1 vector), and G2 is the Fisher

matrix defined as

G2 ¼ E½vðn; hÞvðn; hÞ>� ð24Þ

It is not difficult to see that k0(ni,nj) is also a valid kernel function. We use the generated sample paths to estimate two vectorsxM and x0 by using MC estimation

xðiÞM ¼1M

XM

j¼1

k0ðnj; niÞ

x0 ¼1

M2

XM

i¼1

XM

j¼1

k0ðnj; niÞð25Þ

Given a sample path n, we have

Pðn; hÞ ¼ P0ðs0ÞYT�1

t¼0

lðatjst; hÞPðstþ1jst; atÞ

Because the initial state distribution P0 and the transition distribution P are independent of the policy parameter h, we derivethe expression of v(n) according to Eq. (20) as follows:

vðn; hÞ ¼ r2Pðn; hÞPðn; hÞ ¼

XT�1

t¼0

r2lðat jst; hÞlðat jst ; hÞ

�XT�1

t¼0

rlðat jst; hÞlðat jst ; hÞ

� �2

þXT�1

t¼0

rlðatjst; hÞlðat jst; hÞ

" #2

ð26Þ

Page 7: Hessian matrix distribution for Bayesian policy gradient reinforcement learning

N.A. Vien et al. / Information Sciences 181 (2011) 1671–1685 1677

The pseudo-code for the Bayesian policy Newton’s gradient evaluation algorithm is shown in Algorithm 3. Using MC estima-tion, the Fisher information matrix G2(h) is estimated as follows

bG2ðhÞ ¼1M

XM

i¼1

vðni; hÞvðni; hÞ>

where the second-order score function v(ni,h) of the path ni is calculated as in Eq. (26). The Newton Bayesian policy gra-dient algorithm (NBPG) is described in Algorithm 3.1. This is repeated N times or until jH�1

Mhjj is sufficiently close tozero.

3.1. Consistency in Hessian matrix distribution

Algorithm 3. Newton Bayesian policy gradient computation algorithm

GradientCompute ðh;MÞ; h 2 Rn, sample size M > 0Set G = G(h),G2 = G2(h); D0 ¼ ;for i = 1 to M

Sample a path ni using the policy l(h)Di ¼ Di�1 [ fnigCompute uðniÞ ¼

PTi�1t¼0 r loglðat;ijst;i; hÞ

Compute vðniÞ ¼PTi�1

t¼0r2lðat;i jst;i ;hÞlðat;i jst;i ;hÞ

vðniÞ ¼ vðniÞ �PTi�1

t¼0 ½r log lðat;ijst;i; hÞ�2

vðniÞ ¼ vðniÞ þ ½PTi�1

t¼0 r log lðat;ijst;i; hÞ�2

RðniÞ ¼PTi�1

t¼0 rðst;i; at;iÞUpdate Ki using Ki�1 and ni

Model 1:f(ni) = R(ni)u(ni)

zðiÞM ¼ 1þ uðniÞ>G�1uðniÞModel 2:

f(ni) = R(ni)Z(:,i) = u(ni)

Update for Hessian distribution:� Update Ci using Ci�1 and ni

� f0(ni) = R(ni)v(ni)endQM = (KM + r2I)�1, BM = (CM + b2I)�1

z0 ¼ 1þ n;Z0 ¼ G� ZMQ MZ>MCompute xM and x0 according to Eq. (25).

1. Posterior mean and covariance of rg(h)

� Model 1:

E½rgðhÞjDM � ¼ FMQ MzM

Cov½rgðhÞjDM� ¼ ðz0 � z>MQ MzMÞI

� Model 2:

E½rgðhÞjDM � ¼ ZMQ MfM

Cov½rgðhÞjDM� ¼ Z0 � ZMQ MZ>M

2. Posterior mean and covariance of r2g(h)� Change F0M from n � n matrix to n2 � 1 vector

E½r2gðhÞjDM� ¼ F0MBMxM

Cov½r2gðhÞjDM � ¼ ðx0 �x>MBMxMÞI

� Change F0M;E½r2gðhÞjDM � from n2 � 1 vector to n � n matrix.

return: E½rgðhÞjDM�;Cov½rgðhÞjDM�;E½r2gðhÞjDM�;Cov½r2gðhÞjDM�.

Page 8: Hessian matrix distribution for Bayesian policy gradient reinforcement learning

A Newton Bayesian policy gradient algorithm (NBPG)

1678 N.A. Vien et al. / Information Sciences 181 (2011) 1671–1685

Initialize h0, number of policy updates N > 0, evaluation’s sample size M > 0For i = 1 to N � 1

H ¼ E½r2gðhi�1ÞjDM�Mhi�1 ¼ E½rgðhi�1ÞjDM�hi = hi�1 + H�1

Mhi�1

End

At first sight, it does not seem correct to use a GP as a prior on Hessian matrices. This is because there are constraintson a Hessian, (e.g. they must be symmetric), and it is not clear that a GP is the correct distribution. In this section, wewill prove that the mean of Hessian matrix distribution derived in the previous section is symmetric. In the previoussection, we must change the form of the second-order derivative r2g(h) and f0(n) from an n � n matrix into ann2 � 1 vector for ease of computing. From Eqs. (22) and (21), we see that if E½f 0ðhÞjDM � is symmetric after changing backto an n � n vector, then E½r2gðnÞjDM� is symmetric after changing back to an n � n vector. Therefore, to prove thesymmetry of the Hessian matrix distribution, we will prove that E½f 0ðnÞjDM � is symmetric after changing back to ann � n vector.

Proposition 1. The posterior Hessian matrix distribution, received by changing back from n2 � 1 vector into an n � n matrix as inAlgorithm 3, is symmetric.

Proof. We denote X ¼ E½f 0ðnÞjDM � in Eq. (21), which is a n2 � 1 vector, and let Y be the n � n vector received after changing Xback to n � n vector. Our purpose is now to prove that Y to be symmetric, which means that Yij = Yji. Consequently, we mustprove that Xn(i � 1) + j = Xn(j � 1) + i.

From the definition of F0M , we see that F0M is an n2 �M matrix and each column of F0M is an n2 � 1 vector f0(ni;h) afterchanging from an n � n symmetric matrix. Hence, f0(ni;h)n(i � 1)+j = f0(ni;h)n(j � 1) + i, implying that the matrix F0M has theproperty: (n(i � 1) + j)th row = (n(j � 1) + i)th row. Furthermore, because BM is symmetric, and k0MðnÞ is only a vector, thenfrom Eq. (21) we obtain Xn(i � 1) + j = Xn(j � 1) + i. This result shows that E½r2gðhÞjDM� is symmetric after changing back to an n� n vector. h

3.2. Computational complexity

We note that in order to compute the gradient distribution and the Hessian matrix distribution we need to compute andstore QM(M �M) and BM(M �M) matrices, respectively, incurring a cost of O(M3). This means that the computational com-plexity of Newton Bayesian policy in Algorithm 3 is equal to the computational complexity of the Bayesian policy in Algo-rithm 2.2. This is because we treat [CM](i,j) = k0(ni,nj)I as a scalar, so that n2M � n2M matrix CM is treated as an M � M matrix.We could not obtain this equality when using the ordinary quasi-Newton algorithm.

However, computing QM and BM is computationally intractable in all but the smallest and simplest domains. These prob-lems can be solved by using the sparsification solution from Refs. [7,6]. Using this sparsification method, we can selectivelyadd a new observed sample path to a set of dictionary paths DM , which are used as a basis for approximating the full regres-sion solution.

4. NBPG with online sparsification

4.1. Online sparsification

In this section, we summarize the online sparsification algorithm in [6]. Due to the computational cost incurred, as men-tioned above, we need to reduce the computational burden associated with the computation of the expressions: E½rgðhÞjDM �and E½r2gðhÞjDM � in Algorithms 2.2 and 3. In [6], the authors proposed an online sparsification method for kernel algorithms.This method is explained briefly as follows: In view of Mercer’s theorem, the kernel function k(�, �) is just an inner product in ahigh (possibly infinite) dimensional Hilbert space H (see [24]). This means that there exists a generally non-linear mapping/ : N! H, such that h/ðnÞ;/ðn0ÞiH ¼ kðn; n0Þ. Although the dimension of H may be exceedingly high, the effective dimensionof the manifold spanned by the set of vectors f/ðniÞgt

i¼1 is at most t, and may be significantly lower. Consequently, anyexpression that can be described as a linear combination of these vectors may be expressed, with arbitrary accuracy, by asmaller set of linearly independent feature vectors that approximately span this manifold.

As a consequence, a dictionary of samples is maintained that suffices to approximate any training sample (in featurespace) up to some predefined accuracy threshold. Initially, the dictionary is empty D0 ¼ fg. It observes the sequence ofsample paths n1; n2;. . . one path at a time, and admits nt into the dictionary only if its feature space image /(nt) cannot

Page 9: Hessian matrix distribution for Bayesian policy gradient reinforcement learning

N.A. Vien et al. / Information Sciences 181 (2011) 1671–1685 1679

be approximated sufficiently well by combining the images of paths already in Dt ¼ f~n1; . . . ; ~nmt�1g. Assuming that f~/ð~njÞgmt�1j¼1

is the set of mt � 1 linearly independent feature vectors corresponding to the dictionary paths at time t � 1, the least squaresapproximation of /(nt) is calculated in terms of this set. This approximation can be found by solving the following problem:

3 Due

mina

a> eKt�1a� 2a>~kt�1ðntÞ þ ktt

n oð27Þ

where eKt�1 is the kernel matrix of the dictionary paths at time t � 1; ð½eKt�1�i;j ¼ kð~ni; ~njÞ with i,j = 1, . . .,mt � 1),~kt�1ðnÞ ¼ ðkð~n1; nÞ; . . . ; kð~nmt�1 ; nÞÞ

>, and ktt = k(nt,nt). The solution to (27) is at ¼ eK�1t�1

~kt�1ðntÞ, and substituting it back to (27)we obtain the minimal squared error incurred by the approximation,

�t ¼ ktt � ~kt�1ðntÞ>at ¼ ktt � ~kt�1ðntÞ> eK�1t�1

~kt�1ðntÞ ð28Þ

Let m denote the accuracy threshold; if � > m, then nt is added to the dictionary, and setting at = (0, . . . ,1)> because ~nt is exactlyrepresented by itself. If � 6 m, the dictionary remains unchanged. This approximation means that all of the feature vectors cor-responding to the paths seen until time t can be approximated by the dictionary at time t with a maximum squared error m,

/ðntÞ ¼Xmt

j¼1

at;j/ð~njÞ þ /rest where /res

t

26 m ð29Þ

The solution at can be computed at each time step by updating eKt whenever the dictionary is appended with a new path. Thepartitioned matrix inversion formula [23] should be used here, after updating eKt and its inverse as:

eKt ¼eKt�1

~kt�1ðntÞ~kt�1ðntÞ> ktt

" #

eK�1t ¼

1�t

�teK�1

t�1 þ ata>t �at

�a>t 1

" # ð30Þ

where at ¼ eK�1t�1

~kt�1ðntÞ. Because at is the same as the vector at computed in solving (27), there is no need to recompute it.Defining the matrices3 [At]i,j = ai,j, Ut = [/(n1), . . .,/(nt)], and Ures

t ¼ ½/res1 ; . . . ;/res

t �, Eq. (29) is rewritten for all time-steps up to t as

Ut ¼ eUtA>t þUres

t ð31Þ

By pre-multiplying (31) with its transpose the full t � t kernel matrix Kt ¼ U>t Ut can be decomposed into two matrices:

Kt ¼ AteKtA

>t þ Kres

t ð32Þ

where eKt ¼ eU>t eUt . The matrix AteKtA

>t is a rank mt approximation of Kt. It can be shown that the norm of the residual matrix

Krest is bounded above by a factor linear in m. Consequently, the following approximation is obtained:

Kt � AteKtA

>t ; ktðxÞ � At

~ktðxÞ ð33Þ

The notation � here implies that the difference between the two sides of the equation is O(m). It is noted that the computa-tional cost per time step of this sparsification algorithm is Oðm2

t Þwhich, assuming mt does not depend asymptotically on t, isindependent of time, allowing this algorithm to operate online. Our application version of the NBPG algorithm will obtain itscomputational leverage from the low rank approximation provided by this sparsification algorithm, presented in the nextsection.

4.2. NBPG with online sparsification

We are now ready to combine the ideas described in the previous section with the NBPG method. Substituting approx-imation (33) into the exact GP solutions (the gradient in Models 1, 2 and the second-derivatives of the average reward) weobtain

Model 1:

E½rgðhÞjDM� ¼ FMeQ MzM

Cov½rgðhÞjDM � ¼ ðz0 � z>M eQ MzMÞIð34Þ

Model 2:

E½rgðhÞjDM� ¼ ZMeQ MfM

Cov½rgðhÞjDM � ¼ Z0 � ZMeQ MZ>M

ð35Þ

to the sequential nature of the algorithm, for j > mi,[At]i,j = 0.

Page 10: Hessian matrix distribution for Bayesian policy gradient reinforcement learning

1680 N.A. Vien et al. / Information Sciences 181 (2011) 1671–1685

The second-derivatives of the average reward r2g(h) are as follows:

4 We

E½r2gðhÞjDM� ¼ F0M eBMxM

Cov½r2gðhÞjDM� ¼ ðx0 �x>MeBMxMÞI

ð36Þ

where we define4

eQ M ¼ ðAMeKMA>M þ r2IÞ�1

eBM ¼ ðA0MeCMA0>M þ b2IÞ�1ð37Þ

To reduce the computational complexity of computing the inverse matrices eQ M and eBM , we will now derive the recursiveupdate formulas for eQ M and eBM . (We only derive recursive update formula for eQ M . A recursive update formula for eBM

can be derived similarly.) At each time step i when we sample a path ni, we may be faced with one of the following two cases,either Di ¼ Di�1 or Di ¼ Di�1 [ fnig.

Case 1: Di ¼ Di�1. Hence eKi ¼ eKi�1,

Ai ¼Ai�1

a>i

� �

Then, we have

eQ �1i ¼

Ai�1

a>i

� �eKt�1 A>i�1 ai

�þ r2I 0

0 r2

" #

¼eQ �1

i�1 Ai�1eKt�1ai

a>i eKt�1A>i�1 a>i eKt�1ai þ r2

" #

¼eQ �1

i�1 Ai�1eKt�1ai

ðAi�1eKt�1aiÞ> a>i eKt�1ai þ r2

" #ðBecause eKt is symmetric kernel matrixÞ

We define D~ki�1 ¼ eKt�1ai. Thus, we obtain

eQ �1i ¼

eQ �1i�1 Ai�1D~ki�1

ðAi�1D~ki�1Þ> aTi D

~ki�1 þ r2

" #

We obtain eQ i by using the partitioned matrix inversion formula:

eQ i ¼1si

sieQ i�1 þ gig>i �gi

�g>i 1

" #;

where gi ¼ eQ i�1Ai�1D~ki�1; si ¼ r2 þ a>i D~ki�1 � D~k>i�1A>i�1eQ i�1Ai�1D~ki�1. Let us also define

~ci ¼ A>i gi � ai

Then,

si ¼ r2 � ~c>i D~k

Case 2: Di ¼ Di�1 [ fnig. Hence, eKi is given by (30) which we repeat here

eKi ¼eKi�1

~ki�1ðniÞ~ki�1ðniÞ> kðni; niÞ

" #

Furthermore, ai = (0, . . . ,1)> because /(ni) is exactly representable by itself. Therefore

Ai ¼Ai�1 00> 1

� �

The recursion for eQ �1

i is:

eQ �1i ¼

Ai�1 00> 1

� � eKi�1~ki�1ðniÞ

~ki�1ðniÞ> kðni; niÞ

" #A>i�1 00> 1

" #þ r2I 0

0> r2

" #¼

eQ �1i�1 Ai�1

~ki�1ðniÞðAi�1

~ki�1ðniÞÞ> kðni; niÞ þ r2

" #

define the matrices AM and A0M to differentiate sparsification between kernel matrices KM and CM.

Page 11: Hessian matrix distribution for Bayesian policy gradient reinforcement learning

−1.5 −1 −0.5 0 0.5

−1

−0.5

0

0.5

1

1.5

Fig. 1. The mountain car problem.

N.A. Vien et al. / Information Sciences 181 (2011) 1671–1685 1681

Denoting si ¼ r2 þ kðni; niÞ � ~ki�1ðniÞ>A>i�1eQ i�1Ai�1

~ki�1ðniÞ;gi ¼ eQ i�1Ai�1~ki�1ðniÞ and again using the partitioned matrix inver-

sion formula we get

eQ i ¼1si

sieQ i�1 þ gig>i �gi

�g>i 1

" #

Note that the recursive update formulas evaluate QM and BM in OðMm2

MÞ time.

5. Experimental results

In order to assess the learning performance of NBPG, we present two problems in this section: a mountain car problemand a linear quadratic regulation problem. We compare the NBPG algorithm with online sparsification to with the MC policygradient algorithm, the BPG algorithm, and the Bayesian policy natural gradient (BPNG) algorithm.

5.1. Mountain car problem

The mountain car problem depicted in Fig. 1 is defined in [25] as follows: starting from the bottom of a valley, a car has togain enough momentum to reach the top of a mountain. The same dynamics as described in the mountain car software havebeen employed. The objective is to minimize the number of time steps to reach the goal. The reward in this problem is �1 forall time steps until the car moves past its goal position at the top of the mountain, getting 100 reward, which ends the epi-sode. There are three possible actions a: full throttle forward (+1), full throttle reverse (�1), and zero throttle (0). The con-tinuous state space x ¼ ½st ; _st �> is discretized (45 � 45 grid), where st and _st are respectively the position and velocity of thecar. At the beginning of each episode, the car starts at the default initial state in the bottom of the valley. We define the policyfunction as a softmax function:

lðaijxÞ ¼ pða ¼ aijxÞ ¼expðx>hiÞP3j¼1 expðx>hjÞ

ð38Þ

where hi is the 2 � 1 weight vector for each action, i = {1,2,3}, and all of these weight vectors are collected to establish theproblem’s parameters h ¼ ½h>1 ; h

>2 ; h

>3 �. We compare the performance of the NBPG algorithm with the MC policy gradient algo-

rithm given in Eq. (7), and the BPG and BPNG algorithms given in [10]. The performance is measured in terms of the numberof steps of an episode. We plot the mean of the performance over 100 runs of 1000 episodes with a sample size set to 10. BPGand NBPG are implemented with sparsification [7,6]. The learning rates for BPG and BPNG are defined as ai = 0.05 � 20/(20 + i). The maximum number of steps of each episode is set to 1000. If the car cannot reach the goal after 1000 steps,the episode is reset.

Table 1Convergence speed of the Monte-Carlo policy gradient (MC), Bayesian policy gradient (BPG), Bayesian policynatural gradient (BPNG), and Newton Bayesian policy gradient (NBPG) methods. The results are the meanand first deviation of 100 independent runs. The unit is CPU time in second (s).

Sample size M = 5 Sample size M = 10

MC 55.1 ± 7.9 143.3 ± 11.5BPG 91.7 ± 12.6 166.8 ± 15.2BPNG 140.4 ± 12.2 214.2 ± 22.1NBPG 77.3 ± 9.8 287.4 ± 17.6

Page 12: Hessian matrix distribution for Bayesian policy gradient reinforcement learning

0 100 200 300 400 500 600 700 800 900 1000100

200

300

400

500

600

700

800

900

1000

Episodes

Num

ber o

f ste

ps

MCBPGBPNGNBPG

Fig. 2. Convergence speed in terms of the number of parameter updates of the NBPG, BPG, BPNG, and MC algorithms. The results are the mean and firstdeviation of 100 independent runs, with sample size M = 10.

5 10 20 40 60500

550

600

650

700

750

800

850

Sample sizes

CPU

tim

es (s

econ

ds)

BNPG

BPG

NBPG

Fig. 3. Convergence speed of Bayesian policy gradient (BPG), Bayesian policy natural gradient (BPNG), and Newton Bayesian policy gradient (NBPG)methods. The results are the mean and first deviation of 100 independent runs. The unit is CPU time in second (s).

1682 N.A. Vien et al. / Information Sciences 181 (2011) 1671–1685

Table 1 shows the convergence speed of the MC, BPG, BPNG, NBPG algorithms with sample sizes M = 5, and 10. The NBPGalgorithm shows its robustness with a small sample size compared to BPG and BPNG algorithms. Note that, with a largersample size, the NBPG algorithm requires more CPU time due to its additional computation of Hessian matrix distribution.However, Fig. 2 shows that the NBPG algorithm converges faster in terms of the number of parameter updates. This meansthat it requires fewer interactions with the environment than the MC, BPG and BPNG algorithms.

Page 13: Hessian matrix distribution for Bayesian policy gradient reinforcement learning

N.A. Vien et al. / Information Sciences 181 (2011) 1671–1685 1683

5.2. Linear quadratic Gaussian regulators

In control theory, the linear quadratic Gaussian (LQG) regulator problem is the optimal control problem. Its system be-haves linearly and uncertainly with additive white Gaussian noise disturbance. Moreover, it must address the problem ofquadratic costs and incomplete state information (the feedback does not include all the measured and observable state vari-ables). Let the n-dimensional state space be X ¼ Rn. Let the action space beA ¼ Rd, and the observable space be Y ¼ Rm. Thelinear quadratic system is given as follows:

Fig. 4.the me

xtþ1 ¼ Axt þ Bat þ Cwt

yt ¼ Dxt þ Evtð39Þ

where, A 2 Rn�n; B 2 Rn�d; C 2 Rn�nw ; D 2 Rm�n; E 2 Rm�nv ; wt and vt are the noise terms whose dimensions are nw and nvrespectively. The noise terms wt and vt are distributed according to independent Gaussian distributions: wt � Nð0;Rnw Þ;vt � Nð0;Rnv Þ. The rewards are given by

rt ¼ x>t Q 1xt þ a>t Q 2at

where Q1 and Q2 are symmetric positive semi-definite matrices. If D – I and E – 0, the system is a partially observable MDP(POMDP) in which only yt is observable.

In this experiment, we used the following settings n = 3, m = 2, d = 3, nw = nv = 3, C = I(3 � 3); E = I(2 � 2) and

A ¼�0:6606 �0:3328 �0:2059�0:7333 �0:6667 �0:26930:3466 0:3286 0:0273

264375; B ¼

�0:7188 �0:1590 0:2423�0:1590 �0:2599 0:27750:2423 0:2775 �0:4286

264375; D ¼

0:5 0:6 0:70:1 0:2 0:3

� �;

Q 1 ¼2 �1 0�1 2 �10 �1 2

264375; Q 2 ¼

8:6667 4:1333 3:4667�3:0833 �0:0167 �1:4333�5:8750 �3:2750 �1:6500

264375; Rnw ¼

0:001 0 00 0:001 00 0 0:001

264375;

Rnv ¼0:001 0

0 0:001

� �

0 100 200 300 400 500 600 7000

20

40

60

80

100Sample Size=5

Number of Updates

Aver

age

Expe

cted

Ret

urn

NBPGBPGBNPG

0 100 200 300 400 5000

20

40

60

80

100

Number of Updates

Aver

age

Expe

cted

Ret

urn

Sample Size=10

NBPGBPGBNPG

0 100 200 300 400 5000

20

40

60

80

100Sample Size=20

Number of Updates

Aver

age

Expe

cted

Ret

urn

NBPGBPGBNPG

0 100 200 300 400 5000

20

40

60

80

100Sample Size=40

Number of Updates

Aver

age

Expe

cted

Ret

urn

NBPGBPGBNPG

LQR problem: a comparison of the average expected returns of NBPG, BPG, and BPNG algorithms, for sample sizes 5, 10, 20, and 40. The results arean and first deviation of 100 independent runs.

Page 14: Hessian matrix distribution for Bayesian policy gradient reinforcement learning

1684 N.A. Vien et al. / Information Sciences 181 (2011) 1671–1685

We parameterize the policy as at � lð�; yt; hÞ ¼ N ðKyt ;R�1Þ, where h = (K,R)>, with K 2 R3�2;R 2 R3�3. The number M of

sample paths is set to M = 5, 10, 20, 40, and 60. First, we compare the convergence times (in CPU time (s)) among the NBPG,BPG, and BPNG algorithms. We assume that each algorithm converges when kgi � gi�1k < � = 10�4 for five consecutive iter-ations. Otherwise, the iteration is stopped after 10 h if it has not converged. We use sparsification for BPG, BPNG, and NBPG.The authors in [10] have discussed the efficiency of online sparsification, so we only mention the dimension of the featurespace for the kernel used in the NBPG, BPG, and BPNG algorithms. For all three algorithms, we set the dimension of the fea-ture space 8 (for M > 5). The learning rates for BPG and BPNG are defined as ai = 0.05 � 20/(20 + i).

The comparison results in Fig. 3 shows that BPNG is clearly slower than NBPG. When the sample size is small, NBPG is stillfaster than BPG. This is due to the trade-off between the convergence speed in terms of the number of parameter updatesand the computational time of NBPG in each iteration. When the sample size is small enough, the computation time of NBPGis not significantly larger than that of BPG. Then if NBPG requires fewer iterations to converge (as seen in Fig. 4), its total CPUtime is less than that of BPG. As seen in Fig. 3, when the sample size is M = 5, 10, 20, the CPU time of NBPG is less than that ofBPG. However, when the sample size M is large (M = 40,60), the large computational time of NBPG in each iteration makes itconverge more slowly than BPG in terms of total CPU time.

Fig. 4 shows the number of parameter updates of the NBPG, BPG, and BPNG algorithms. The NBPG algorithm with an effi-cient learning rate schedule has the fewest number of updates. This phenomenon can be explained as a result of the use ofthe Hessian matrix distribution to choose a suitable learning rate schedule. This is due to the dual certainty in the parameterupdating step of the NBPG algorithm. Because the NBPG algorithm computes both the gradient and Hessian distributionsbased on sample paths, it achieves dual certainty over the sample paths. Meanwhile, because BPG only computes the gra-dient distribution and the learning rate is set independently from the sample paths, it has less certainty. Thus, the NBPG algo-rithm is helpful with small sample path sizes, which may give little certainty. Even though NBPG has large CPU time withlarge sample sizes as in Fig. 3, it still requires fewer number of updates. This means that it needs less interaction withthe environment than the BPG and BPNG algorithms. This is useful in some specific problem domains, when the agentscan only have limited interaction with the environment.

6. Conclusion

In this paper, we have proposed an improvement over the Bayesian Policy Gradient in [10], which is also based on theBayesian view. In applications of the Bayesian view for policy gradients, there are two promising recent approaches: theBayesian policy gradient (BPG) and Bayesian policy natural gradient (BPNG) algorithms. Those algorithms use traditionalgradient algorithms, thus their convergence speed significantly relies on a learning rate schedule. There is an efficient learn-ing rate schedule, called a Hessian matrix, that is used in Newton gradient algorithms. This motivated us to apply a Bayesianview to give an efficient Bayesian learning rate schedule, and to propose the Newton Bayesian policy gradient (NBPG)algorithm.

Our contributions can be summarized as follows. We computed the posterior of the second derivatives of the average re-ward (assuming they exist). Then we used this posterior distribution as a Hessian distribution to construct a suitable learningrate schedule. With a proper kernel choice, we proved that the computational complexity of GP Hessian distribution esti-mate is equal to that of the GP gradient estimate. Moreover, we proved that the mean of the Hessian matrix distributionis symmetric. This result is really important because there are constraints on the Hessian matrix. The experimental resultsshowed that our proposed method performs much better than BPG, and BPNG in terms of convergence speed. This meansthat the NBPG algorithm requires less interaction with the environment than the MC, BPG and BPNG algorithms.

The disadvantage of our approach is that to estimate the Hessian, we have to pay a price. In order to reduce the compu-tational complexity of estimating the Hessian distribution, we applied the online sparsification algorithm proposed in [6],but we still need to consider the tradeoff between the additional Hessian estimation and the convergence speed when com-pared to approaches that only estimate the gradient. We believe that these improvements can also be extended to BayesianActor-Critic algorithms [11]. Those questions will be the subject of our future research.

Acknowledgments

The authors thank the anonymous reviewers. Their critical and detailed comments helped to substantially improve thepaper.

This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF)funded by the Ministry of Education, Science and Technology (Grant No.: 2010-0012609).

References

[1] J.A. Bagnell, J.G. Schneider, Covariant policy search, in: IJCAI, 2003, pp. 1019–1024.[2] J. Baxter, P.L. Bartlett, Infinite-horizon policy-gradient estimation, Journal of Artificial Intelligence Research (JAIR) 15 (2001) 319–350.[3] R.E. Bellman, Dynamic Programming, Princeton University Press, Princeton, NJ, 1957.[4] D.P. Bertsekas, J.N. Tsitsiklis, Neuro-dynamic Programming, Athena Scientific, Belmont, Mass, 1996.[5] S. Bhatnagar, R. Sutton, M. Ghavamzadeh, M. Lee, Natural actor-critic algorithms, Automatica 45 (11) (2009) 2471–2482.[6] Y. Engel, Algorithms and Representations for Reinforcement Learning, PhD Thesis, The Hebrew University of Jerusalem, Israel, 2005.

Page 15: Hessian matrix distribution for Bayesian policy gradient reinforcement learning

N.A. Vien et al. / Information Sciences 181 (2011) 1671–1685 1685

[7] Y. Engel, S. Mannor, R. Meir, Reinforcement learning with Gaussian processes, in: ICML, 2005, pp. 201–208.[8] D. Ernst, P. Geurts, L. Wehenkel, Tree-based batch mode reinforcement learning, Journal of Machine Learning Research 6 (2005) 503–556.[9] M. Ghavamzadeh, Y. Engel, Bayesian policy gradient, in: Workshop on Kernel Machines and Reinforcement Learning (KRL), 23rd International

Conference on Machine Learning (ICML-2006), 2006.[10] M. Ghavamzadeh, YEngel, Bayesian policy gradient algorithms, in: NIPS, 2006, pp. 457–464.[11] M. Ghavamzadeh, Y. Engel, Bayesian actor-critic algorithms, in: ICML, 2007, pp. 297–304.[12] T. Jaakkola, D. Haussler, Exploiting generative models in discriminative classifiers, in: NIPS, 1998, pp. 487–493.[13] S. Kakade, A natural policy gradient, in: NIPS, 2001, pp. 1531–1538.[14] J. O, J. Lee, J.W. Lee, B.-T. Zhang, Adaptive stock trading with dynamic asset allocation using reinforcement learning, Information Sciences 176 (15)

(2006) 2121–2147.[15] A. O’Hagan, Monte Carlo is fundamentally unsound, The Statistician 36 (1987) 247–249.[16] A. O’Hagan, Bayes-hermite quadrature, Journal of Statistical Planning and Inference 29 (1991) 245–260.[17] J. Peters, S. Vijayakumar,S. Schaal, Natural actor-critic, in: ECML, 2005, pp. 280–291.[18] P. Poupart, N. Vlassis, Model-based bayesian reinforcement learning in partially observable domains, in: International Symposium on Artificial

Intelligence and Mathematics (ISAIM), 2008.[19] P. Preux, S. Delepoulle, J.-C. Darcheville, A generic architecture for adaptive agents based on reinforcement learning, Information Sciences 161 (1–2)

(2004) 37–55.[20] C.E. Rasmussen, Z. Ghahramani, Bayesian Monte Carlo, in: NIPS, 2002, pp. 489–496.[21] J. Reisinger, P. Stone, R. Miikkulainen, Online kernel selection for Bayesian reinforcement learning, in: ICML, 2008, pp. 816–823.[22] M. Riedmiller, Neural fitted q iteration – first experiences with a data efficient neural reinforcement learning method, in: ECML, 2005, pp. 317–328.[23] L.L. Scharf, Statistical Signal Processing: Detection, Estimation, and Time Series Analysis, Addison-Wesley, New York, 1991.[24] B. Scholkopf, A.J. Smola, Learning with Kernels, MIT Press, Cambridge, MA, 2002.[25] R.S. Sutton, A.G. Barto, Reinforcement Learning: An Introduction, MIT Press, Cambridge, MA, 1998.[26] R.S. Sutton, D.A. McAllester, S.P. Singh, Y. Mansour, Policy gradient methods for reinforcement learning with function approximation, in: NIPS, 1999,

pp. 1057–1063.[27] X. Wang, Y. Cheng, J.-Q. Yi, A fuzzy actor-critic reinforcement learning network, Information Sciences 177 (18) (2007) 3764–3781.[28] H. Zhang, J. Lu, Adaptive evolutionary programming based on reinforcement learning, Information Sciences 178 (4) (2008) 971–984.[29] C. Zhou, D. Ruan, Fuzzy control rules extraction from perception-based information using computing with words, Information Sciences 142 (1-4)

(2002) 275–290.