an improved convergence analysis of stochastic variance...

11
An Improved Convergence Analysis of Stochastic Variance-Reduced Policy Gradient Pan Xu Department of Computer Science University of California, Los Angeles Los Angeles, CA 90095 Felicia Gao Department of Computer Science University of California, Los Angeles Los Angeles, CA 90095 Quanquan Gu Department of Computer Science University of California, Los Angeles Los Angeles, CA 90095 Abstract We revisit the stochastic variance-reduced pol- icy gradient (SVRPG) method proposed by Papini et al. (2018) for reinforcement learn- ing. We provide an improved convergence analysis of SVRPG and show that it can find an -approximate stationary point of the per- formance function within O(1/ 5/3 ) trajecto- ries. This sample complexity improves upon the best known result O(1/ 2 ) by a factor of O(1/ 1/3 ). At the core of our analysis is (i) a tighter upper bound for the variance of impor- tance sampling weights, where we prove that the variance can be controlled by the parame- ter distance between different policies; and (ii) a fine-grained analysis of the epoch length and batch size parameters such that we can signif- icantly reduce the number of trajectories re- quired in each iteration of SVRPG. We also empirically demonstrate the effectiveness of our theoretical claims of batch sizes on rein- forcement learning benchmark tasks. 1 INTRODUCTION Reinforcement learning (RL) is a sequential decision process that learns the best actions to solve a task by re- peated, direct interaction with the environment (Sutton & Barto, 2018). In detail, an RL agent starts at one state and sequentially takes an action according to a certain policy, observes the resulting reward signal, and lastly, evaluates and improves its policy before it transits to the next state. A policy tells the agent which action to take at each state. Therefore, a good policy is critically impor- tant in a RL problem. Recently, policy gradient methods (Sutton et al., 2000) have achieved impressive successes in many challenging deep reinforcement learning appli- cations (Kakade, 2002; Schulman et al., 2015), which di- rectly optimizes the performance function J (θ) (We will formally define it later) over a class of policies parame- terized by some model parameter θ. In particular, pol- icy gradient methods seek to find the best policy π θ that maximizes the expected return of the agent. They are generally more effective in the high-dimensional action space and enjoy the additional flexibility of stochasticity, compared with deterministic value-function based meth- ods such as Q-learning and SARSA (Sutton et al., 2000). In many RL applications, the performance function J (θ) is non-concave and the goal is to find a stationary point θ * such that k∇J (θ * )k 2 =0 using gradient based al- gorithms. Due to the specialty of reinforcement learn- ing, the objective function J (θ) is calculated based on cumulative rewards arriving in a sequential way, which makes it impossible to calculate the full gradient di- rectly. Therefore, most algorithms such as REINFORCE (Williams, 1992) and GPOMDP (Baxter & Bartlett, 2001) need to actively sample trajectories to approximate the gradient J (θ). This resembles the stochastic gra- dient (SG) based algorithms in stochastic optimization (Robbins & Monro, 1951) which require O(1/ 2 ) trajec- tories to obtain E[k∇J (θ)k 2 2 ] Due to the large vari- ances caused by stochastic gradient, the convergence of SG based methods can be rather sample inefficient when the required precision is very small. To mitigate the negative effect of large variance on the convergence of SG methods, a large class of stochastic variance-reduced gradient (SVRG) algorithms were pro- posed for both convex (Johnson & Zhang, 2013; Xiao & Zhang, 2014; Harikandeh et al., 2015; Nguyen et al., 2017) and nonconvex (Allen-Zhu & Hazan, 2016; Reddi et al., 2016; Lei et al., 2017; Li & Li, 2018; Fang et al., 2018; Zhou et al., 2018) objective functions. SVRG has proved to achieve faster convergence in terms of the total number of stochastic gradient evaluations. These variance-reduced algorithms have since been applied to reinforcement learning in policy evaluation (Du et al., 2017), trust-region policy optimization (Xu et al., 2017)

Upload: others

Post on 28-Jun-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: An Improved Convergence Analysis of Stochastic Variance ...auai.org/uai2019/proceedings/papers/191.pdfDeep RL models (Mnih et al., 2015) have been popular in solving complex problems

An Improved Convergence Analysis of Stochastic Variance-Reduced PolicyGradient

Pan XuDepartment of Computer Science

University of California, Los AngelesLos Angeles, CA 90095

Felicia GaoDepartment of Computer Science

University of California, Los AngelesLos Angeles, CA 90095

Quanquan GuDepartment of Computer Science

University of California, Los AngelesLos Angeles, CA 90095

Abstract

We revisit the stochastic variance-reduced pol-icy gradient (SVRPG) method proposed byPapini et al. (2018) for reinforcement learn-ing. We provide an improved convergenceanalysis of SVRPG and show that it can findan ε-approximate stationary point of the per-formance function within O(1/ε5/3) trajecto-ries. This sample complexity improves uponthe best known result O(1/ε2) by a factor ofO(1/ε1/3). At the core of our analysis is (i) atighter upper bound for the variance of impor-tance sampling weights, where we prove thatthe variance can be controlled by the parame-ter distance between different policies; and (ii)a fine-grained analysis of the epoch length andbatch size parameters such that we can signif-icantly reduce the number of trajectories re-quired in each iteration of SVRPG. We alsoempirically demonstrate the effectiveness ofour theoretical claims of batch sizes on rein-forcement learning benchmark tasks.

1 INTRODUCTION

Reinforcement learning (RL) is a sequential decisionprocess that learns the best actions to solve a task by re-peated, direct interaction with the environment (Sutton& Barto, 2018). In detail, an RL agent starts at one stateand sequentially takes an action according to a certainpolicy, observes the resulting reward signal, and lastly,evaluates and improves its policy before it transits to thenext state. A policy tells the agent which action to take ateach state. Therefore, a good policy is critically impor-tant in a RL problem. Recently, policy gradient methods(Sutton et al., 2000) have achieved impressive successesin many challenging deep reinforcement learning appli-cations (Kakade, 2002; Schulman et al., 2015), which di-

rectly optimizes the performance function J(θ) (We willformally define it later) over a class of policies parame-terized by some model parameter θ. In particular, pol-icy gradient methods seek to find the best policy πθ thatmaximizes the expected return of the agent. They aregenerally more effective in the high-dimensional actionspace and enjoy the additional flexibility of stochasticity,compared with deterministic value-function based meth-ods such as Q-learning and SARSA (Sutton et al., 2000).

In many RL applications, the performance function J(θ)is non-concave and the goal is to find a stationary pointθ∗ such that ‖∇J(θ∗)‖2 = 0 using gradient based al-gorithms. Due to the specialty of reinforcement learn-ing, the objective function J(θ) is calculated based oncumulative rewards arriving in a sequential way, whichmakes it impossible to calculate the full gradient di-rectly. Therefore, most algorithms such as REINFORCE(Williams, 1992) and GPOMDP (Baxter & Bartlett,2001) need to actively sample trajectories to approximatethe gradient ∇J(θ). This resembles the stochastic gra-dient (SG) based algorithms in stochastic optimization(Robbins & Monro, 1951) which require O(1/ε2) trajec-tories to obtain E[‖∇J(θ)‖22] ≤ ε Due to the large vari-ances caused by stochastic gradient, the convergence ofSG based methods can be rather sample inefficient whenthe required precision ε is very small.

To mitigate the negative effect of large variance on theconvergence of SG methods, a large class of stochasticvariance-reduced gradient (SVRG) algorithms were pro-posed for both convex (Johnson & Zhang, 2013; Xiao& Zhang, 2014; Harikandeh et al., 2015; Nguyen et al.,2017) and nonconvex (Allen-Zhu & Hazan, 2016; Reddiet al., 2016; Lei et al., 2017; Li & Li, 2018; Fang et al.,2018; Zhou et al., 2018) objective functions. SVRGhas proved to achieve faster convergence in terms of thetotal number of stochastic gradient evaluations. Thesevariance-reduced algorithms have since been applied toreinforcement learning in policy evaluation (Du et al.,2017), trust-region policy optimization (Xu et al., 2017)

Page 2: An Improved Convergence Analysis of Stochastic Variance ...auai.org/uai2019/proceedings/papers/191.pdfDeep RL models (Mnih et al., 2015) have been popular in solving complex problems

and policy gradient (Papini et al., 2018). In particu-lar, Papini et al. (2018) recently proposed a stochas-tic variance-reduced policy gradient (SVRPG) algorithmthat marries SVRG to policy gradient for reinforcementlearning. The algorithm saves on sample computationand improves the performance of the vanilla policy gra-dient methods based on SG. However, from a theoret-ical perspective, the authors only showed that SVRPGconverges to a stationary point within E[‖∇J(θ)‖22] ≤ εwith O(1/ε2) stochastic gradient evaluations (trajectorysamples), which in fact only matches the sample com-plexity of SG based policy gradient methods. This leavesopen the important question:

Can SVRPG be provably better than SG based policygradient methods?

We answer this question affirmatively and fill this gapbetween theory and practice in this paper. Specifically,we provide a sharp convergence analysis of SVRPG andshow that it only requires O(1/ε5/3) stochastic gradi-ent evaluations in order to converge to a stationary pointθ of the performance function, i.e., E[‖∇J(θ)‖22] ≤ ε.This sample complexity of SVRPG is strictly lower thanthat of SG based policy gradient methods by a factor ofO(1/ε1/3). By the same argument, our result is alsobetter than the sample complexity provided in Papiniet al. (2018) by a factor of O(1/ε1/3). The key ideas inour theoretical analysis are twofold: (i) we prove a keylemma that controls the variance of importance weightsintroduced in SVRPG to deal with the non-stationarity ofthe sample distribution in reinforcement learning. Thishelps offset the additional variance introduced by im-portance sampling; and (ii) we provide a refined proofof the convergence of SVRPG and carefully investigatethe trade-off between the convergence rate and compu-tational efficiency of SG methods. This enables us tochoose a smaller batch size to reduce the sample com-plexity while maintaining the convergence rate. In ad-dition, we demonstrate the advantage of SVRPG overGPOMDP and validate our theoretical results on Cart-pole and Mountain Car problems.

Notation In this paper, scalars, vectors and matrices aredenoted by lower case, lower case bold face, and up-per case bold face letters respectively. We use ‖v‖2 and‖A‖2 to denote the vector 2-norm of a vector v ∈ Rdand the spectral norm of a matrix A ∈ Rd×d respec-tively. We denote an = O(bn) if an ≤ Cbn for someconstant 0 < C. For α > 0, the Renyi divergence (Renyiet al., 1961) between distributions P,Q is

Dα(P ||Q) =1

α− 1log2

∫x

P (x)

(P (x)

Q(x)

)α−1dx,

which is non-negative for all α > 0. The exponentiated

Renyi divergence is defined as dα(P ||Q) = 2Dα(P ||Q).

2 ADDITIONAL RELATED WORK

In this section, we review additional relevant work that isnot discussed in the introduction.

Deep RL models (Mnih et al., 2015) have been popularin solving complex problems such as robot locomotion,playing grandmaster skill-level Go, and safe autonomousdriving (Levine et al., 2015; Silver et al., 2016; Shalev-Shwartz et al., 2016). Policy gradient (Sutton et al.,2000) is one of the most effective algorithms, wherethe policy is usually approximated by linear functionsor nonlinear functions such as neural networks, and canbe both stochastic and deterministic (Silver et al., 2014).One major drawback of traditional policy gradient meth-ods such as REINFORCE (Williams, 1992), GPOMDP(Baxter & Bartlett, 2001) and TRPO (Schulman et al.,2015) is the large variance caused in the estimation of thegradient (Sehnke et al., 2010), which leads to a poor con-vergence performance in practice. One way of reducingthe variance in gradient estimation is to introduce var-ious baselines as control variates (Weaver & Tao, 2001;Greensmith et al., 2004; Peters & Schaal, 2008; Gu et al.,2017; Tucker et al., 2018). (Pirotta et al., 2013) proposedto use adaptive step size to offset the effect of variance ofthe policy. Papini et al. (2017) further studied the adap-tive batch size used to approximate the gradient and pro-posed to jointly optimize the adaptive step size and batchsize. It has also been extensively studied to reduce thevariance of policy gradient by importance sampling (Liu,2008; Cortes et al., 2010). Metelli et al. (2018) reducedthe variance caused by importance sampling by derivinga surrogate objective with a Renyi penalty.

3 PRELIMINARIES

In this section, we introduce the preliminaries on rein-forcement learning and policy gradient.Markov Decision Process: We will model the reinforce-ment learning task as a discrete-time Markov DecisionProcess (MDP): M = {S,A,P,R, γ, ρ}, where S isthe state space and A is the action space. P(s′|s, a)defines the probability that the agent transits to state s′

when taking action a in state s. The reward functionR(s, a) : S×A 7→ [0, R] gives the reward after the agenttakes action a at state s for some constant R > 0, andγ ∈ (0, 1) is the discount factor. ρ is the initial state dis-tribution. The probability that the agent chooses action aat state s is modeled by its policy π(a|s). Following anystationary policy, the agent can observe and collect a tra-jectory τ = {s0, a0, s1, a1, . . . , sH−1, aH−1, sH} whichis a sequence of state-action pairs, where H is the trajec-tory horizon. Along with the state-action pairs, the agent

Page 3: An Improved Convergence Analysis of Stochastic Variance ...auai.org/uai2019/proceedings/papers/191.pdfDeep RL models (Mnih et al., 2015) have been popular in solving complex problems

also observes an cumulative discounted reward

R(τ) =∑H−1h=0 γ

hR(sh, ah). (3.1)

Policy Gradients: Suppose that the policy π is parame-terized by an unknown parameter θ ∈ Rd and denoted byπθ. We denote the distribution induced by policy πθ asp(τ |πθ), also referred to as p(τ |θ) for simplicity. Then

p(τ |θ) = ρ(s0)

H−1∏h=0

πθ(ah|sh)P (sh+1|sh, ah). (3.2)

To measure the performance of a given policy πθ, wedefine the expected total reward under this policy asJ(θ) = Eτ∼p(·|θ)[R(τ)|M ]. Taking the gradient ofJ(θ) with respect to θ gives

∇θJ(θ) =

∫τ

R(τ)∇θp(τ |θ)dτ

=

∫τ

R(τ)∇θp(τ |θ)

p(τ |θ)p(τ |θ)dτ

= Eτ∼p(·|θ)[∇θ log p(τ |θ)R(τ)|M ]. (3.3)

We can update the policy by running gradient ascentbased algorithms on θ. However, it is impossible tocalculate the full gradient in reinforcement learning. Inparticular, policy gradient samples a batch of trajectories{τi}Ni=1 to approximate the full gradient in (3.3). At thek-th iteration, the policy is then updated by

θk+1 = θk + η∇NJ(θk), (3.4)

where η > 0 is the step size and the estimated gradient∇NJ(θk) is an approximation of (3.3) based on trajec-tories {τi}Ni=1, which is defined as follows

∇NJ(θ) =1

N

N∑i=1

∇θ log p(τi|θ)R(τi).

According to (3.2), we know that∇θ log p(τi|θ) is inde-pendent of the transition matrix P . Therefore, combiningthis with (3.1) yields

∇NJ(θ)

=1

N

N∑i=1

[H−1∑h=0

∇θ log πθ(aih|sih)

][H−1∑h=0

γhR(sih, aih)

]︸ ︷︷ ︸

g(τi|θ)

,

where τi = {si0, ai0, si1, ai1, . . . , siH−1, aiH−1, siH} for alli = 1, . . . , N are sampled from policy πθ, and g(τi|θ)is the unbiased gradient estimator based on sample τi.Then we can rewrite the gradient in (3.4) as ∇NJ(θ) =

1/N∑Ni=1 g(τi|θ). Based on the above estimator, we

can obtain the most well-known gradient estimators for

policy gradient such as REINFORCE (Williams, 1992)and GPOMDP (Baxter & Bartlett, 2001). In particular,the REINFORCE estimator introduces an additional termb as the constant baseline:

g(τi|θ) (3.5)

=

[H−1∑h=0

∇θ log πθ(aih|sih)

][H−1∑h=0

γhR(sih, aih)− b

].

GPOMDP is a refined estimator of REINFORCE basedon the fact that the current action does not affect previousdecisions:

g(τi|θ) (3.6)

=

H−1∑h=0

[h∑t=0

∇θ log πθ(ait|sit)

](γhr(sih, a

ih)− bh

).

4 ALGORITHMIn each iteration of the gradient ascent update (3.4), pol-icy gradient methods need to sample a batch of trajec-tories to estimate the expected gradient. This subsam-pling introduces a high variance and undermines theconvergence speed of the algorithm. Inspired by thesuccess of stochastic variance-reduced gradient (SVRG)techniques in stochastic optimization (Johnson & Zhang,2013; Reddi et al., 2016; Allen-Zhu & Hazan, 2016), Pa-pini et al. (2018) proposed a stochastic variance reducedpolicy gradient (SVRPG) method, which is displayed inAlgorithm 1.

SVRPG consists of multiple epochs. At the beginning ofthe s-th epoch, it treats the current policy as a referencepoint denoted by θs = θs+1

0 . It then computes a gradientestimator µs = 1/N

∑Ni=1 g(τi|θs) based on N trajec-

tories {τi}Ni=1 sampled from the current policy, whereg(τi|θs) is the REINFORCE or GPOMDP estimator. Atthe t-th iteration within the s-th epoch, SVRPG samplesB trajectories {τj}Bj=1 based on the current policy θs+1

t .Then it updates the policy based on the following semi-stochastic gradient

vs+1t =

1

B

∑Bj=1g(τj |θs+1

t )

+ µs −1

B

B∑j=1

ω(τj |θs,θs+1t )g(τj |θs), (4.1)

where the last two terms serve as a correction to the sub-sampled gradient estimator which reduces the varianceand improves the convergence rate of Algorithm 1. Itis worth noting that the semi-stochastic gradient in (4.1)differs from the common one used in SVRG due to theadditional term ω(τ |θs,θs+1

t ) = p(τ |θs)/p(τ |θs+1t ),

Page 4: An Improved Convergence Analysis of Stochastic Variance ...auai.org/uai2019/proceedings/papers/191.pdfDeep RL models (Mnih et al., 2015) have been popular in solving complex problems

which is called the importance sampling weight fromp(τ |θs+1

t ) to p(τ |θs). This term is important in rein-forcement learning due to the non-stationarity of the dis-tribution of τ . Specifically, {τi}Ni=1 are sampled from θs

while {τj}Bj=1 are sampled based on θs+1t . Nevertheless,

we have

Eπθs+1t

[ω(·|θs,θs+1

t )g(·|θs)]

= Eπθs

[g(·|θs)

],

which ensures the correction term is zero mean and thusvs+1t is an unbiased gradient estimator.

Algorithm 1 SVRPG1: Input: number of epochs S, epoch size m, step sizeη, batch size N , mini-batch size B, gradient estima-tor g, initial parameter θ0

m := θ0 := θ02: for s = 0, . . . , S − 1 do3: θs+1

0 = θs = θsm4: Sample N trajectories {τi} from p(·|θs)5: µs = ∇NJ(θs) := 1

N

∑Ni=1 g(τi|θs)

6: for t = 0, . . . ,m− 1 do7: Sample B trajectories {τj} from p(·|θs+1

t )

8: vs+1t = µs + 1

B

∑Bj=1

(g(τj |θs+1

t

)−

ω(τj |θs,θs+1

t

)g(τj |θs

))9: θs+1

t+1 = θs+1t + ηvs+1

t

10: end for11: end for12: return θout: uniformly picked from {θst } for t =

0, . . . ,m; s = 0, . . . , S

5 THEORYIn this section, we are going to provide a sharp analysisof Algorithm 1. We first lay down the following commonassumption on the log-density of the policy function.

Assumption 5.1. Let πθ(a|s) be the policy of an agentat state s. There exist constants G,M > 0 such that thelog-density of the policy function satisfies

‖∇θ log πθ(a|s)‖ ≤ G,∥∥∇2

θ log πθ(a|s)∥∥2≤M,

for all a ∈ A and s ∈ S.

In many real-world problems, we require that policy pa-rameterization to change smoothly over time instead ofdrastically. Assumption 5.1 is an important condition innonconvex optimization (Reddi et al., 2016; Allen-Zhu& Hazan, 2016), which guarantees the smoothness ofthe objective function J(θ). Our assumption is slightlydifferent from that in Papini et al. (2018), which as-sumes that ∂

∂θilog πθ(a|s) and ∂2

∂θi∂θjlog πθ(a|s) are

upper bounded elementwisely. It can be easily verifiedthat our Assumption 5.1 is milder than theirs. It shouldalso be noted that although in reinforcement learning wemake the assumptions on the parameterized policy, there

is no difference in imposing the smoothness assumptionon the performance function J(θ) directly. In fact, As-sumption 5.1 implies the following proposition on J(θ).

Proposition 5.2. Under Assumption 5.1, J(θ) is L-smooth with L = HR(M +HG2)/(1− γ). In addition,let g(τ |θ) be the REINFORCE or GPOMDP gradient es-timators. Then for all θ1,θ2 ∈ Rd, it holds that

‖g(τ |θ1)− g(τ |θ2)‖2 ≤ Lg‖θ1 − θ2‖2

and ‖g(τ |θ)‖2 ≤ Cg for all θ ∈ Rd, where Lg =HM(R + |b|)/(1 − γ), Cg = HG(R + |b|)/(1 − γ)and b is the baseline reward.

The next assumption requires that the variance of the gra-dient estimator is bounded.

Assumption 5.3. There exists a constant σ such that

Var(g(τ |θ)

)≤ σ2, for all policy πθ.

The above assumption is widely made in stochastic opti-mization. It can be easily verified for Gaussian policieswith REINFORCE estimator (Zhao et al., 2011; Pirottaet al., 2013; Papini et al., 2018).

The following assumption is needed due to the non-stationarity of the sample distribution, which is alsomade in Papini et al. (2018).

Assumption 5.4. There is a constant W < ∞ such thatfor each policy pairs encountered in Algorithm 1, it holds

Var(ω(τ |θ1,θ2)) ≤W, ∀θ1,θ2 ∈ Rd, τ ∼ p(·|θ2).

We now present our convergence result for SVRPG.

Theorem 5.5. Under Assumptions 5.1, 5.3 and 5.4. InAlgorithm 1, suppose the step size η ≤ 1/(4L) andepoch length m and mini-batch size B satisfy

B

m2≥

3(CωC2g + L2

g)

2L2,

where Cω = H(2HG2 + M)(W + 1), and Lg, Cg andL are defined in Proposition 5.2. Then the output of Al-gorithm 1 satisfies

E[∥∥∇J(θout

)∥∥22

]≤ 8(J(θ∗)− J(θ0))

ηSm+

6σ2

N,

where θ∗ is the maximizer of J(θ).

Remark 5.6. Let T = Sm be the total number of itera-tions Algorithm 1 needs to achieve E

[∥∥∇J(θout)∥∥2

2

]≤

ε. The first term on the right hand side in Theorem 5.5gives an O(1/T ) convergence rate which matches thatof Papini et al. (2018) and the results in nonconvex opti-mization (Allen-Zhu & Hazan, 2016; Reddi et al., 2016).

Page 5: An Improved Convergence Analysis of Stochastic Variance ...auai.org/uai2019/proceedings/papers/191.pdfDeep RL models (Mnih et al., 2015) have been popular in solving complex problems

Table 1: Comparison on sample complexity required toachieve ‖∇J(θ)‖22 ≤ ε.

METHODS COMPLEXITY

SG O(1/ε2)SVRPG (Papini et al., 2018) O(1/ε2)SVRPG (This paper) O(1/ε5/3)

The second term O(1/N) comes from the full gradientapproximation at the beginning of each epoch in Algo-rithm 1. Compared with the result in Papini et al. (2018),Theorem 5.5 does not have the additional term O(1/B),which is offset by our elaborate and careful analysis ofthe variance of importance weights. This also enables usto choose a much smaller batch size B in the inner loopsof Algorithm 1 and leads to a lower sample complexity.

Based on Theorem 5.5, we can calculate the total trajec-tory samples Algorithm 1 requires to achieve ε-precision.

Corollary 5.7. Under the same conditions as in The-orem 5.5, let ε > 0, if we set η = 1/(4L), N =O(1/ε), B = O(1/ε2/3) and m =

√B, then Algo-

rithm 1 needs O(1/ε5/3) trajectories in order to achieveE[‖∇J(θout)‖22] ≤ ε.

Remark 5.8. In Theorem 4.4 of Papini et al. (2018), theauthors showed that the sample complexity of SVRPG isO((B + N/m)/ε). In order to make the gradient smallenough, they essentially require that B,N = O(1/ε),which leads to O(1/ε2) sample complexity. In sharpcontrast, our Corollary 5.7 shows that the SVRPG al-gorithm only needs O(1/ε5/3) number of trajectories toachieve ‖∇J(θ)‖22 ≤ ε, which is obviously lower thanthe sample complexity proved in Papini et al. (2018). Wepresent a straightforward comparison in Table 1 to showthe sample complexities of different methods. SG repre-sents vanilla stochastic gradient based methods such asREINFORCE and GPOMDP. It can be seen from Table1 that our analysis yields the lowest complexity.

6 PROOF OF THEORETICALRESULTS

In this section, we prove our main theoretical results.

6.1 PROOF OF THE MAIN THEORY

Before we provide the proof of Theorem 5.5, we first laydown the following key lemma that controls the varianceof the importance sampling weights ω(τ |θs,θs+1

t ).

Lemma 6.1. Let ω(τ |θs,θs+1

t

)= p(τ |θs)/p(τ |θs+1

t ).Under Assumptions 5.1 and 5.4, it holds that

Var(ω(τ |θs,θs+1

t

))≤ Cω‖θs − θs+1

t ‖22,

where Cω = H(2HG2 +M)(W + 1).

Lemma 6.1 shows that the variance of the importanceweight is proportional to the distance between the behav-ioral and the target policies. Note that this upper boundcould be trivial based on Assumption 5.4 when the dis-tance is large. However, Lemma 6.1 also provides a fine-grained control of the variance when the behavioral andtarget polices are sufficiently close.

Now we are ready to present the proof of our main theo-rem, which is also inspired from that in Li & Li (2018).

Proof of Theorem 5.5. By Proposition 5.2, J(θ) is L-smooth, which leads to

J(θs+1t+1

)≥ J

(θs+1t

)+⟨∇J(θs+1t

),θs+1t+1 − θs+1

t

⟩− L/2

∥∥θs+1t+1 − θs+1

t

∥∥22

= J(θs+1t

)+⟨∇J(θs+1t

)− vs+1

t , ηvs+1t

⟩+ η∥∥vs+1

t

∥∥22− L/2

∥∥θs+1t+1 − θs+1

t

∥∥22

≥ J(θs+1t

)− η/2

∥∥∇J(θs+1t

)− vs+1

t

∥∥22

+ η/2∥∥vs+1

t

∥∥22− L/2

∥∥θs+1t+1 − θs+1

t

∥∥22

≥ J(θs+1t

)− 3η/4

∥∥∇J(θs+1t

)− vs+1

t

∥∥22

+ (1/(4η)− L/2)∥∥θs+1

t+1 − θs+1t

∥∥22

+ η/8∥∥∇J(θs+1

t

)∥∥22, (6.1)

where the second inequality holds due to Young’s in-equality and the last inequality comes from the fact that‖∇J(θs+1

t )‖22 ≤ 2‖vs+1t ‖22 + 2‖∇J(θs+1

t ) − vs+1t ‖22.

Let EN,B denote the expectation only over the random-ness of the sampling trajectories {τi}Ni=1 and {τj}Bj=1

EN,B∥∥∇J(θs+1

t

)− vs+1

t

∥∥22

= EN,B∥∥∥∥∇J(θs+1

t

)− µs

+1

B

B∑j=1

(ω(τj |θs,θs+1

t )g(τj |θs

)− g(τj |θs+1

t

))∥∥∥∥22

= EN,B∥∥∥∥∇J(θs+1

t

)−∇J(θs) +∇J(θs)− µs

+1

B

B∑j=1

(ω(τj |θs,θs+1

t )g(τj |θs

)− g(τj |θs+1

t

))∥∥∥∥22

= EN,B∥∥∥∥∇J(θs+1

t

)−∇J(θs)

+1

B

B∑j=1

(ω(τj |θs,θs+1

t )g(τj |θs

)− g(τj |θs+1

t

))∥∥∥∥22

+ EN,B∥∥∥∥∇J(θs)− 1

N

N∑i=1

g(τi|θs

)∥∥∥∥22

(6.2)

Page 6: An Improved Convergence Analysis of Stochastic Variance ...auai.org/uai2019/proceedings/papers/191.pdfDeep RL models (Mnih et al., 2015) have been popular in solving complex problems

=1

B2

B∑j=1

EN,B∥∥∇J(θs+1

t

)−∇J(θs)

+ ω(τj |θs,θs+1t )g

(τj |θs

)− g(τj |θs+1

t

)∥∥22

+1

N2

N∑i=1

EN,B∥∥∇J(θs)− g

(τi|θs

)∥∥22

(6.3)

≤ 1

B2

B∑j=1

EN,B∥∥ω(τj |θs,θs+1

t )g(τj |θs

)− g(τj |θs+1

t

)∥∥22

+ σ2/N, (6.4)

where (6.2) holds due to the independence between tra-jectories {τi}Ni=1 and {τj}Bj=1, (6.3) is due to E‖x1 +

. . . + xn‖22 = E‖x1‖22 + . . . + E‖xn‖22 for independentand zero mean variables x1, . . . ,xn, and (6.4) followsAssumption 5.3 and the fact that E‖x−Ex‖22 ≤ E‖x‖22.Note that we have

EN,B∥∥ω(τj |θs,θs+1

t )g(τj |θs

)− g(τj |θs+1

t

)∥∥22

≤ EN,B∥∥(ω(τj |θs,θs+1

t )− 1)g(τj |θs

)∥∥22

+ EN,B∥∥g(τj |θs)− g(τj |θs+1

t

)∥∥22

≤ C2gEN,B

∥∥ω(τj |θs,θs+1t )− 1

∥∥22

+ L2g

∥∥θs − θs+1t

∥∥22,

(6.5)

where the second inequality comes from Proposition 5.2.By Lemma 6.1, we have

EN,B∥∥ω(τj |θs,θs+1

t )− 1∥∥22

= Varθs,θs+1t

(ω(τj |θs,θs+1

t ))

≤ Cω∥∥θs+1

t − θs∥∥22. (6.6)

whereCω = (2G2+M)(W+1). Substituting the resultsin (6.4), (6.5) and (6.6) into (6.1) yields

EN,B[J(θs+1t+1

)]≥ EN,B

[J(θs+1t

)]+η

8EN,B

[∥∥∇J(θs+1t

)∥∥22

]+

[1

4η− L

2

]EN,B

[∥∥θs+1t+1 − θs+1

t

∥∥22

]− 3ησ2

4N

−3η(CωC

2g + L2)

4BEN,B

[∥∥θs+1t − θs

∥∥22

]. (6.7)

For the ease of notation, we denote

Ψ = 3(CωC2g + L2

g)/4B. (6.8)

By Young’s inequality (Peter-Paul inequality), we have∥∥θs+1t+1 − θs

∥∥22≤ (1 + α)

∥∥θs+1t+1 − θs+1

t

∥∥22

+ (1 + 1/α)∥∥θs+1

t − θs∥∥22

holds for any α > 0. For η ≤ 1/(2L), combining theabove inequality with (6.7) and (6.8) yields

EN,B[J(θs+1t+1

)]≥ EN,B

[J(θs+1t

)]+η

8EN,B

[∥∥∇J(θs+1t

)∥∥22

]− 3ησ2

4N

+1

1 + α

[1

4η− L

2

]EN,B

[∥∥θs+1t+1 − θs

∥∥22

]−[ηΨ +

1

α

[1

4η− L

2

]]EN,B

[∥∥θs+1t − θs

∥∥22

]Now we set α = 2t+ 1 and sum up the above inequalityover t = 0, . . . ,m − 1. Note that θs+1

0 = θs, θs+1m =

θs+1. We are able to obtain

EN[J(θs+1

)]≥ EN

[J(θs)]

8

m−1∑t=0

EN[∥∥∇J(θs+1

t

)∥∥22

]− 3mησ2

4N

+

m−1∑t=0

1/(2η)− L4(t+ 1)

EN[∥∥θs+1

t+1 − θs∥∥22

]−m−1∑t=0

[ηΨ +

1/(2η)− L2(2t+ 1)

]EN[∥∥θs+1

t − θs∥∥22

]= EN

[J(θs)]

8

m−1∑t=0

EN[∥∥∇J(θs+1

t

)∥∥22

]− 3mησ2

4N

+

m−2∑t=0

1/(2η)− L4(t+ 1)

EN[∥∥θs+1

t+1 − θs∥∥22

]−m−1∑t=1

[ηΨ +

1/(2η)− L2(2t+ 1)

]EN[∥∥θs+1

t − θs∥∥22

]+

1/(2η)− L4m

EN[∥∥θs+1

m − θs∥∥22

]−[ηΨ +

1

4η− L

2

]EN[∥∥θs+1

0 − θs∥∥22

]= EN

[J(θs)]

8

m−1∑t=0

EN[∥∥∇J(θs+1

t

)∥∥22

]− 3mησ2

4N

+

m−1∑t=1

[1/(4η)− L/2

2t(2t+ 1)− ηΨ

]EN[∥∥θs+1

t − θs∥∥22

]+

1/(2η)− L4m

EN[∥∥θs+1

m − θs∥∥22

]. (6.9)

Recall the definition of Ψ in (6.8). If we set step size ηand the epoch length B to satisfy

η ≤ 1

4L,

B

m2≥

3(CωC2g + L2

g)

2L2, (6.10)

then (6.9) leads to

EN[J(θs+1

)]≥ EN

[J(θs)]− 3mησ2

4N

Page 7: An Improved Convergence Analysis of Stochastic Variance ...auai.org/uai2019/proceedings/papers/191.pdfDeep RL models (Mnih et al., 2015) have been popular in solving complex problems

8

m−1∑t=0

EN[∥∥∇J(θs+1

t

)∥∥22

].

Telescoping the above inequality yields

η

8

S−1∑s=0

m−1∑t=0

E[∥∥∇J(θs+1

t

)∥∥22

]≤ E

[J(θS)]− E

[J(θ0)]

+3Smησ2

4N,

which immediately implies

E[∥∥∇J(θout

)∥∥22

]≤

8(E[J(θS)]− E

[J(θ0)])

ηSm+

6σ2

N

≤ 8(J(θ∗)− J(θ0))

ηSm+

6σ2

N.

This completes the proof.

Proof of Corollary 5.7. By Theorem 5.5, in order to en-sure E

[∥∥∇J(θout)∥∥2

2

]≤ ε, it suffices to ensure

8(J(θ∗)− J(θ0))

ηSm=ε

2,

6σ2

N=ε

2,

which implies Sm = O(1/ε) and N = O(1/ε). Notethat we have set m = O(

√B). The total number of

stochastic gradient evaluations Tg we need is

Tg = SN + SmB = O

(N√Bε

+B

ε

)= O

(1

ε5/3

),

where we set B = N2/3 = 1/ε2/3.

6.2 PROOF OF TECHNICAL LEMMAS

In this subsection, we provide the proofs of the technicallemmas used in the proof of main theory. We first provethe smoothness of J(θ).

Proof of Proposition 5.2. Recall the notion in (3.3) as

∇J(θ) =

∫τ

R(τ)∇θp(τ |θ)dτ,

which directly implies the Hessian matrix

∇2J(θ) =

∫τ

R(τ)∇2θp(τ |θ)dτ. (6.11)

Note that the Hessian of the log-density function is

∇2θ log p(τ |θ) = −p(τ |θ)−2∇θp(τ |θ)∇θp(τ |θ)>

+ p(τ |θ)−1∇2θp(τ |θ). (6.12)

Substituting (6.12) into (6.11) yields

∇2J(θ) =

∫τ

p(τ |θ)R(τ)[∇2

θ log p(τ |θ)

+∇θ log p(τ |θ)∇θ log p(τ |θ)>]dτ.

Therefore, we have

‖∇2J(θ)‖2 ≤∫τ

p(τ |θ)R(τ)[‖∇2

θ log p(τ |θ)‖2

+ ‖∇θ log p(τ |θ)‖22]dτ

≤∫τ

p(τ |θ)R(τ)H(M +HG2)dτ

≤ R(1− γH)H(M +HG2)

1− γ≤ RH(M +HG2)/(1− γ), (6.13)

where the second inequality comes from Assumption 5.1and the third inequality holds due to the definition of Rin (3.1). Thus J(θ) is L-smooth with L = RH(M +HG2)/(1 − γ). Recall the REINFORCE estimator de-fined in (3.5). We immediately have

‖∇g(τ |θ)‖2 ≤

[H−1∑t=0

∥∥∇2 log πθ(at|st)∥∥2

]R+ |b|1− γ

≤ HM(R+ |b|)/(1− γ).

Similarly, we have

‖g(τ |θ)‖2 ≤ HG[R(1− γH)

1− γ+ |b|

]≤ HG(R+ |b|)

1− γ.

The proof of the GPOMDP estimator is similar and weomit it for simplicity. This completes the proof.

The analysis of Lemma 6.1 relies on the following im-portant properties of importance sampling weights.

Lemma 6.2 (Lemma 1 in Cortes et al. (2010)). Letω(x) = P (x)/Q(x) be the importance weight for dis-tributions P and Q. Then the following identities hold:

E[ω] = 1, E[ω2] = d2(P ||Q),

where d2(P ||Q) = 2D2(P ||Q) and D2(P ||Q) is theRenyi divergence between distributions P and Q. Notethat this immediately implies Var(ω) = d2(P ||Q)− 1.

Proof of Lemma 6.1. According to Lemma 6.2, we have

Var(ω(τ |θs,θs+1

t

))= d2

(p(τ |θs)||p(τ |θs+1

t ))− 1.

In the rest of this proof, we denote θ1 = θs and θ2 =θs+1t to simplify the notation. By definition, we have

d2(p(τ |θ1)||p(τ |θ2)) =

∫τ

p(τ |θ1)p(τ |θ1)

p(τ |θ2)dτ

Page 8: An Improved Convergence Analysis of Stochastic Variance ...auai.org/uai2019/proceedings/papers/191.pdfDeep RL models (Mnih et al., 2015) have been popular in solving complex problems

=

∫τ

p(τ |θ1)2p(τ |θ2)−1dτ.

For any fixed θ2 ∈ Rd, computing the gradient ofd2(p(τ |θ1)||p(τ |θ2)) with respect to θ1 yields

∇θ1d2(p(τ |θ1)||p(τ |θ2))

= 2

∫τ

p(τ |θ1)∇θ1p(τ |θ1)p(τ |θ2)−1dτ,

which implies that if we set θ1 = θ2, we will obtain

∇θ1d2(p(τ |θ1)||p(τ |θ2))∣∣θ1=θ2

= 2

∫τ

∇θ1p(τ |θ1)dτ

∣∣θ1=θ2

= 0.

Hence, applying mean value theorem, we have

d2(p(τ |θ1)||p(τ |θ2)) (6.14)

= 1 +1

2(θ1 − θ2)>∇2

θd2(p(τ |θ)||p(τ |θ2))(θ1 − θ2),

where θ = tθ1 + (1− t)θ2 for some t ∈ [0, 1]. Next, wecompute the Hessian matrix. For any fixed θ2, we have

∇2θd2(p(τ |θ)||p(τ |θ2))

= 2

∫τ

∇θp(τ |θ)∇θp(τ |θ)>p(τ |θ2)−1dτ

+ 2

∫τ

∇2θp(τ |θ)p(τ |θ)p(τ |θ2)−1dτ

= 2

∫τ

∇θ log p(τ |θ)∇θ log p(τ |θ)>p(τ |θ)2

p(τ |θ2)dτ

+ 2

∫τ

∇2θp(τ |θ)p(τ |θ)p(τ |θ2)−1dτ. (6.15)

Recall the Hessian of the log-density function in (6.12).Substituting (6.12) into (6.15) yields

‖∇2θd2(p(τ |θ)||p(τ |θ2))‖2

=

∥∥∥∥4

∫τ

∇θ log p(τ |θ)∇θ log p(τ |θ)>p(τ |θ)2

p(τ |θ2)dτ

+ 2

∫τ

∇2θ log p(τ |θ)

p(τ |θ)2

p(τ |θ2)dτ∥∥∥∥2

≤∫τ

p(τ |θ)2

p(τ |θ2)

(4‖∇θ log p(τ |θ)‖22

+ 2‖∇2θ log p(τ |θ)‖2

)dτ

≤ (4H2G2 + 2HM)E[ω(τ |θ,θ2)2]

≤ 2H(2HG2 +M)(W + 1),

where the second inequality comes from Assumption 5.1and the last inequality is due to Assumption 5.4 andLemma 6.2. Therefore, by (6.14) we have

Var(ω(τ |θs,θs+1

t

))= d2

(p(τ |θs)||p(τ |θs+1

t ))− 1

0 500 1000 1500 2000 2500Number of Trajectories

0.0

0.2

0.4

0.6

0.8

1.0

Aver

age

Retu

rn

×103

REINFORCEGPOMDPSVRPG

(a) Cartpole

0 1000 2000 3000 4000 5000Number of Trajectories

−0.6

−0.4

−0.2

0.0

0.2

0.4

0.6

0.8

Aver

age

Retu

rn

×102

REINFORCEGPOMDPSVRPG

(b) Mountain Car

Figure 1: The average reward of different algorithms inCartpole and Mountain Car environments.

≤ Cω‖θs − θs+1t ‖22,

where Cω = H(2HG2 +M)(W + 1).

7 EXPERIMENTSIn this section, we conduct experiments on reinforcementlearning benchmark tasks, i.e., the Cartpole and Moun-tain Car (continuous) environments (Brockman et al.,2016), to evaluate the performance of Algorithm 1. Wemeasure the performance of an algorithm in terms ofthe total sample trajectories it needs to achieve a cer-tain reward. We compare SVRPG with vanilla stochasticgradient based algorithms: the REINFORCE (Williams,1992) and GPOMDP1 (Baxter & Bartlett, 2001) algo-rithms. Recall that at each iteration of Algorithm 1, wealso need to choose certain stochastic gradient estimatorto approximate the full gradient based on sampled tra-jectories. Since the performance of GPOMDP is alwayscomparable or better than REINFORCE, we only reportthe results of SVRPG with the GPOMDP estimator.

We follow the practical suggestions provided in Papiniet al. (2018) to improve the performance including (1)performing one initial gradient update immediately aftersampling the N trajectories in the outer loop; (2) usingadaptive step sizes; and (3) using adaptive epoch length(terminate the inner loop update early if the step size usedin the inner loop is smaller than that used in the outerloop). Following Papini et al. (2018), we use the follow-ing Gaussian policy with a fixed standard deviation σ2:

πθ(a|s) = 1/√

2πσ exp(− (θ>φ(s)− a)2/2σ2

),

where φ : S 7→ Rd is a bounded feature map. Under theGaussian policy, it is easy to verify that Assumptions 5.1and 5.3 is satisfied with parameters depending on φ, σ2

and the upper bound of the action a for all a ∈ A.

Cartpole Setup: The neural network of the Cartpole en-vironment has one hidden layer of 64 nodes with the

1We thank Papini et al. (2018) for their implementations ofGPOMDP and SVRPG as well as Duan et al. (2016) for theirimplementations from the rllab library.

Page 9: An Improved Convergence Analysis of Stochastic Variance ...auai.org/uai2019/proceedings/papers/191.pdfDeep RL models (Mnih et al., 2015) have been popular in solving complex problems

0 500 1000 1500 2000 2500Number of Trajectories

0.0

0.2

0.4

0.6

0.8

1.0Av

erag

e Re

turn

×103

B = 5B = 10B = 20

(a) Cartpole

0 1000 2000 3000 4000 5000Number of Trajectories

−0.6

−0.4

−0.2

0.0

0.2

0.4

0.6

0.8

1.0

Aver

age

Retu

rn

×102

B = 3B = 8B = 20

(b) Mountain Car

0 500 1000 1500 2000 2500Number of Trajectories

0.0

0.2

0.4

0.6

0.8

1.0

Aver

age

Retu

rn

×103

m = 1m = 3m = 7

(c) Cartpole

0 1000 2000 3000 4000 5000Number of Trajectories

−0.6

−0.4

−0.2

0.0

0.2

0.4

0.6

0.8

1.0

Aver

age

Retu

rn

×102

m = 2m = 5m = 7

(d) Mountain Car

Figure 2: (a) - (b): The average reward of SVRPG with different mini-batch sizes B. (c) - (d): The average reward ofSVRPG with different epoch lengths m.

tanh activation function. In the comparison betweenREINFORCE, GPOMDP, and SVRPG, we use learningrate η = [0.005, 0.005, 0.0125] for them respectively.Based on our theoretical analysis, we chose N = 25 andB = 10 for SVRPG. We also the set the batch size ofvanilla gradient methods to beN = 25 for REINFORCEand N = 10 for GPOMDP.

Mountain Car Setup: The neural network for theMountain Car environment contains one hidden layerwith 16 nodes with the tanh activation. In the compar-ison among REINFORCE, GPOMDP and SVRPG, weset N = 25 and B = 8 for SVRPG and set batch sizeN = 20 for the vanilla gradient methods. REINFORCE,GPOMDP, and SVRPG have respective learning rates ofη = [0.005, 0.005, 0.0075].

Experimental Results: Figures 1(a) and 1(b) respec-tively show the performance of different algorithms onthe Cartpole and Mountain Car environments. All theresults are averaged over 10 repetitions and the shadedarea is a confidence interval corresponding to the stan-dard deviation over different runs. It can be seen that allthe methods solved the Cartpole environment (with aver-aged reward close to 1000). The SVRPG algorithm out-performs the other two by gaining higher rewards withfewer sample trajectories. SVRPG also beats the othermethods in solving the Mountain Car environment (withaveraged reward close to 90).

Sensitivity Study: We now study the impact of the mini-batch size B within each epoch of SVRPG to validateour theoretical claims in Corollary 5.7. Specifically, inCartpole environment, we fix N = 25 and vary the mini-batch sizeB in the range of {5, 10, 20}. In Mountain Carenvironment, we set N = 25 and vary the mini-batchsize B in the range of {3, 8, 20}. The learning rates aretuned accordingly. Figures 2(a) and 2(b) show the effectof different mini-batch sizes B on SVRPG. Note that theouter loop batch size is N = 25 in both environments.It can be seen that when B = 10 and B = 8 for Cart-pole and Mountain Car respectively, SVRPG achieve thebest performance, which is well aligned with our the-

ory. In particular, with a small mini-batch size, SVRPGacts similarly to the vanilla stochastic gradient based al-gorithms which needs fewer trajectories in each iterationbut converges slowly and requires more trajectories in to-tal. Conversely, using a large mini-batch pushes SVRPGto converge in fewer iterations, but requires more trajec-tories in total.

We also conduct experiments to show the impact of dif-ferent epoch lengths m on the performance of SVRPG.Similar to the previous sensitivity study onB, here we fixN and B for all experiments and vary the epoch lengthm in different experiments. The learning rates are tunedaccordingly. Figures 2(c) and 2(d) show the sensitivityof SVRPG with respect to the epoch length m. It can beseen that smallerm tends to give better results, where thealgorithm runs more outer iterations with a larger batchsize N (compared with the mini-batch size B in the in-ner loop). This implies that the current choice of epochlength in Corollary 5.7 may not be optimal yet.

8 CONCLUSIONS

We revisited the SVRPG algorithm (Papini et al., 2018)and derived a sharp convergence analysis of SVRPGwhich achieves the state-of-the-art sample complexity.We provided a detailed discussion and guidance on thechoice of batch sizes and epoch length based on our im-proved analysis so that the total number of samples canbe significantly reduced. We also empirically validatedthe theoretical results on common reinforcement learn-ing tasks. As a future direction, it would be interestingto see whether any better sample complexity can be ob-tained for policy gradient algorithms.

Acknowledgements

We would like to thank the anonymous reviewers fortheir helpful comments. This research was sponsoredin part by the National Science Foundation IIS-1904183and IIS-1906169. The views and conclusions containedin this paper are those of the authors and should not beinterpreted as representing any funding agencies.

Page 10: An Improved Convergence Analysis of Stochastic Variance ...auai.org/uai2019/proceedings/papers/191.pdfDeep RL models (Mnih et al., 2015) have been popular in solving complex problems

References

Allen-Zhu, Z. and Hazan, E. Variance reduction forfaster non-convex optimization. In International Con-ference on Machine Learning, pp. 699–707, 2016.

Baxter, J. and Bartlett, P. L. Infinite-horizon policy-gradient estimation. Journal of Artificial IntelligenceResearch, 15:319–350, 2001.

Brockman, G., Cheung, V., Pettersson, L., Schneider, J.,Schulman, J., Tang, J., and Zaremba, W. Openai gym,2016.

Cortes, C., Mansour, Y., and Mohri, M. Learning boundsfor importance weighting. In Advances in Neural In-formation Processing Systems, pp. 442–450, 2010.

Du, S. S., Chen, J., Li, L., Xiao, L., and Zhou, D.Stochastic variance reduction methods for policy eval-uation. In Proceedings of the 34th International Con-ference on Machine Learning-Volume 70, pp. 1049–1058. JMLR. org, 2017.

Duan, Y., Chen, X., Houthooft, R., Schulman, J., andAbbeel, P. Benchmarking deep reinforcement learningfor continuous control. In International Conference onMachine Learning, pp. 1329–1338, 2016.

Fang, C., Li, C. J., Lin, Z., and Zhang, T. Spider:Near-optimal non-convex optimization via stochasticpath-integrated differential estimator. In Advances inNeural Information Processing Systems, pp. 686–696,2018.

Greensmith, E., Bartlett, P. L., and Baxter, J. Variancereduction techniques for gradient estimates in rein-forcement learning. Journal of Machine Learning Re-search, 5(Nov):1471–1530, 2004.

Gu, S., Lillicrap, T., Ghahramani, Z., Turner, R. E., andLevine, S. Q-prop: Sample-efficient policy gradientwith an off-policy critic. In International Conferenceon Learning Representations 2017. OpenReviews. net,2017.

Harikandeh, R., Ahmed, M. O., Virani, A., Schmidt, M.,Konecny, J., and Sallinen, S. Stopwasting my gradi-ents: Practical svrg. In Advances in Neural Informa-tion Processing Systems, pp. 2251–2259, 2015.

Johnson, R. and Zhang, T. Accelerating stochastic gra-dient descent using predictive variance reduction. InAdvances in Neural Information Processing Systems,pp. 315–323, 2013.

Kakade, S. M. A natural policy gradient. In Advancesin Neural Information Processing Systems, pp. 1531–1538, 2002.

Lei, L., Ju, C., Chen, J., and Jordan, M. I. Non-convexfinite-sum optimization via scsg methods. In Advances

in Neural Information Processing Systems, pp. 2348–2358, 2017.

Levine, S., Wagener, N., and Abbeel, P. Learningcontact-rich manipulation skills with guided policysearch. In 2015 IEEE International Conference onRobotics and Automation (ICRA), pp. 156–163. IEEE,2015.

Li, Z. and Li, J. A simple proximal stochastic gradi-ent method for nonsmooth nonconvex optimization. InAdvances in Neural Information Processing Systems,pp. 5569–5579, 2018.

Liu, J. S. Monte Carlo strategies in scientific computing.Springer Science & Business Media, 2008.

Metelli, A. M., Papini, M., Faccio, F., and Restelli, M.Policy optimization via importance sampling. In Ad-vances in Neural Information Processing Systems, pp.5447–5459, 2018.

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Ve-ness, J., Bellemare, M. G., Graves, A., Riedmiller, M.,Fidjeland, A. K., Ostrovski, G., et al. Human-levelcontrol through deep reinforcement learning. Nature,518(7540):529, 2015.

Nguyen, L. M., Liu, J., Scheinberg, K., and Takac,M. Sarah: A novel method for machine learningproblems using stochastic recursive gradient. In Pro-ceedings of the 34th International Conference on Ma-chine Learning-Volume 70, pp. 2613–2621. JMLR.org, 2017.

Papini, M., Pirotta, M., and Restelli, M. Adaptive batchsize for safe policy gradients. In Advances in Neu-ral Information Processing Systems, pp. 3591–3600,2017.

Papini, M., Binaghi, D., Canonaco, G., Pirotta, M., andRestelli, M. Stochastic variance-reduced policy gradi-ent. In International Conference on Machine Learn-ing, pp. 4023–4032, 2018.

Peters, J. and Schaal, S. Reinforcement learning of motorskills with policy gradients. Neural Networks, 21(4):682–697, 2008.

Pirotta, M., Restelli, M., and Bascetta, L. Adaptive step-size for policy gradient methods. In Advances in Neu-ral Information Processing Systems, pp. 1394–1402,2013.

Reddi, S. J., Hefny, A., Sra, S., Poczos, B., and Smola,A. Stochastic variance reduction for nonconvex op-timization. In International Conference on MachineLearning, pp. 314–323, 2016.

Renyi, A. et al. On measures of entropy and informa-tion. In Proceedings of the Fourth Berkeley Sympo-sium on Mathematical Statistics and Probability, Vol-

Page 11: An Improved Convergence Analysis of Stochastic Variance ...auai.org/uai2019/proceedings/papers/191.pdfDeep RL models (Mnih et al., 2015) have been popular in solving complex problems

ume 1: Contributions to the Theory of Statistics. TheRegents of the University of California, 1961.

Robbins, H. and Monro, S. A stochastic approximationmethod. The Annals of Mathematical Statistics, pp.400–407, 1951.

Schulman, J., Levine, S., Abbeel, P., Jordan, M. I.,and Moritz, P. Trust region policy optimization. InInternational Conference on Machine Learning, vol-ume 37, pp. 1889–1897, 2015.

Sehnke, F., Osendorfer, C., Ruckstieß, T., Graves, A., Pe-ters, J., and Schmidhuber, J. Parameter-exploring pol-icy gradients. Neural Networks, 23(4):551–559, 2010.

Shalev-Shwartz, S., Shammah, S., and Shashua, A. Safe,multi-agent, reinforcement learning for autonomousdriving. CoRR, abs/1610.03295, 2016. URL http://arxiv.org/abs/1610.03295.

Silver, D., Lever, G., Heess, N., Degris, T., Wierstra,D., and Riedmiller, M. Deterministic policy gradientalgorithms. In International Conference on MachineLearning, 2014.

Silver, D., Huang, A., Maddison, C. J., Guez, A.,Sifre, L., van den Driessche, G., Schrittwieser, J.,Antonoglou, I., Panneershelvam, V., Lanctot, M.,Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N.,Sutskever, I., Lillicrap, T. P., Leach, M., Kavukcuoglu,K., Graepel, T., and Hassabis, D. Mastering the gameof go with deep neural networks and tree search. Na-ture, 529:484–489, 2016.

Sutton, R. S. and Barto, A. G. Reinforcement learning:An introduction. MIT press, 2018.

Sutton, R. S., McAllester, D. A., Singh, S. P., and Man-sour, Y. Policy gradient methods for reinforcementlearning with function approximation. In Advancesin Neural Information Processing Systems, pp. 1057–1063, 2000.

Tucker, G., Bhupatiraju, S., Gu, S., Turner, R., Ghahra-mani, Z., and Levine, S. The mirage of action-dependent baselines in reinforcement learning. InInternational Conference on Machine Learning, pp.5022–5031, 2018.

Weaver, L. and Tao, N. The optimal reward baselinefor gradient-based reinforcement learning. In Pro-ceedings of the 17th Conference in Uncertainty in Ar-tificial Intelligence, pp. 538–545. Morgan KaufmannPublishers Inc., 2001.

Williams, R. J. Simple statistical gradient-followingalgorithms for connectionist reinforcement learning.Machine Learning, 8(3-4):229–256, 1992.

Xiao, L. and Zhang, T. A proximal stochastic gradientmethod with progressive variance reduction. SIAMJournal on Optimization, 24(4):2057–2075, 2014.

Xu, T., Liu, Q., and Peng, J. Stochastic variancereduction for policy gradient estimation. CoRR,abs/1710.06034, 2017. URL http://arxiv.org/abs/1710.06034.

Zhao, T., Hachiya, H., Niu, G., and Sugiyama, M. Anal-ysis and improvement of policy gradient estimation. InAdvances in Neural Information Processing Systems,pp. 262–270, 2011.

Zhou, D., Xu, P., and Gu, Q. Stochastic nested variancereduced gradient descent for nonconvex optimization.In Advances in Neural Information Processing Sys-tems, pp. 3922–3933, 2018.