exploration versus exploitation in reinforcement learning: a … · 2018. 12. 4. · on video games...

34
Exploration versus exploitation in reinforcement learning: A stochastic control approach Haoran Wang * Thaleia Zariphopoulou Xun Yu Zhou First draft: March 2018 This draft: December 2018 Abstract We consider reinforcement learning (RL) in continuous time and study the problem of achieving the best trade-off between exploration of a black box environment and exploitation of current knowledge. We propose an entropy-regularized reward function involving the differential entropy of the distributions of actions, and motivate and devise an exploratory for- mulation for the feature dynamics that captures repetitive learning under exploration. The resulting optimization problem is a resurrection of the classical relaxed stochastic control. We carry out a complete analysis of the problem in the linear–quadratic (LQ) case and deduce that the op- timal control distribution for balancing exploitation and exploration is Gaussian. This in turn interprets and justifies the widely adopted Gaus- sian exploration in RL, beyond its simplicity for sampling. Moreover, the exploitation and exploration are reflected respectively by the mean and variance of the Gaussian distribution. We also find that a more ran- dom environment contains more learning opportunities in the sense that less exploration is needed, other things being equal. As the weight of exploration decays to zero, we prove the convergence of the solution to the entropy-regularized LQ problem to that of the classical LQ problem. Finally, we characterize the cost of exploration, which is shown to be pro- portional to the entropy regularization weight and inversely proportional to the discount rate in the LQ case. Key words. Reinforcement learning, exploration, exploitation, en- tropy regularization, stochastic control, linear–quadratic, Gaussian. * Department of Mathematics, The University of Texas at Austin, Austin, TX 78712, USA. Email: [email protected]. Department of Mathematics and IROM, The University of Texas at Austin, Austin, USA and the Oxford-Man Institute, University of Oxford, Oxford, UK. Email: za- [email protected]. Department of Industrial Engineering and Operations Research, Columbia University, New York, NY 10027, USA. Email: [email protected]. 1

Upload: others

Post on 24-Sep-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Exploration versus exploitation in reinforcement learning: A … · 2018. 12. 4. · on video games (Mnih et al. (2015)), and autonomous driving (Levine et al. (2016), Mirowski et

Exploration versus exploitation in reinforcement

learning: A stochastic control approach

Haoran Wang∗ Thaleia Zariphopoulou† Xun Yu Zhou‡

First draft: March 2018This draft: December 2018

Abstract

We consider reinforcement learning (RL) in continuous time and studythe problem of achieving the best trade-off between exploration of a blackbox environment and exploitation of current knowledge. We propose anentropy-regularized reward function involving the differential entropy ofthe distributions of actions, and motivate and devise an exploratory for-mulation for the feature dynamics that captures repetitive learning underexploration. The resulting optimization problem is a resurrection of theclassical relaxed stochastic control. We carry out a complete analysis ofthe problem in the linear–quadratic (LQ) case and deduce that the op-timal control distribution for balancing exploitation and exploration isGaussian. This in turn interprets and justifies the widely adopted Gaus-sian exploration in RL, beyond its simplicity for sampling. Moreover,the exploitation and exploration are reflected respectively by the meanand variance of the Gaussian distribution. We also find that a more ran-dom environment contains more learning opportunities in the sense thatless exploration is needed, other things being equal. As the weight ofexploration decays to zero, we prove the convergence of the solution tothe entropy-regularized LQ problem to that of the classical LQ problem.Finally, we characterize the cost of exploration, which is shown to be pro-portional to the entropy regularization weight and inversely proportionalto the discount rate in the LQ case.

Key words. Reinforcement learning, exploration, exploitation, en-tropy regularization, stochastic control, linear–quadratic, Gaussian.

∗Department of Mathematics, The University of Texas at Austin, Austin, TX 78712, USA.Email: [email protected].†Department of Mathematics and IROM, The University of Texas at Austin, Austin,

USA and the Oxford-Man Institute, University of Oxford, Oxford, UK. Email: [email protected].‡Department of Industrial Engineering and Operations Research, Columbia University,

New York, NY 10027, USA. Email: [email protected].

1

Page 2: Exploration versus exploitation in reinforcement learning: A … · 2018. 12. 4. · on video games (Mnih et al. (2015)), and autonomous driving (Levine et al. (2016), Mirowski et

1 Introduction

Reinforcement learning (RL) is currently one of the most active and fast develop-ing subareas in machine learning. It has been successfully applied to solve largescale real world, complex decision-making problems in recent years, includingplaying perfect-information board games such as Go (AlphaGo/AlphaGo Zero,Silver et al. (2016), Silver et al. (2017)), achieving human-level performanceon video games (Mnih et al. (2015)), and autonomous driving (Levine et al.(2016), Mirowski et al. (2016)). An RL agent does not pre-specify a structuralmodel or a family of models, but instead gradually learns the best (or near-best) strategies based on trial and error, through interactions with the random(black box) environment and incorporation of the responses of these interac-tions in order to improve the overall performance. This is a “kill two birds withone stone” case: the agent’s actions (controls) serve both as a means to explore(learn) and a way to exploit (optimize). Since exploration is inherently costlyin terms of both resources and time, a natural and crucial question in RL isto address the dichotomy between exploration of uncharted territory and ex-ploitation of existing knowledge. Such question exists in both the stateless RLsettings represented by the multi-armed bandit problem, and the more generalmulti-state RL settings with delayed reinforcement (Sutton and Barto (2018),Kaelbling et al. (1996)). More specifically, the agent must balance betweengreedily exploiting what has been learned so far to choose actions that yieldnear-term higher rewards, and continually exploring the environment to acquiremore information to potentially achieve long-term benefits. Extensive studieshave been conducted to find optimal strategies for trading off exploitation andexploration.1

However, most of the contributions to balancing exploitation and explorationdo not include exploration formally as a part of the optimization objective;the attention has mainly focused on solving the classical optimization prob-lem maximizing the accumulated rewards, while exploration is typically treatedseparately as an ad-hoc chosen exogenous component, rather than being en-dogenously derived as a part of the solution to the overall RL problem. Therecently proposed discrete-time entropy-regularized (also termed as “entropy-augmented” or “softmax”) RL formulation, on the other hand, explicitly incor-porates exploration into the optimization objective, with a trade-off weight im-posed on the entropy of the exploration strategy (Ziebart et al. (2008), Nachumet al. (2017a), Fox et al. (2015)). An exploratory distribution with a greaterentropy signifies a higher level of exploration, and is hence more favorable onthe exploration front. The extreme case of a Dirac measure having minimalentropy implies no exploration, reducing to the case of classical optimization

1For the multi-armed bandit problem, well-known strategies include Gittins-index ap-proach (Gittins (1974)), Thompson sampling (Thompson (1933)), and upper confidence boundalgorithm (Auer et al. (2002)), with sound theoretical optimality established (Russo andVan Roy (2013, 2014)). For general RL problems, various efficient exploration methods havebeen proposed that have been proved to induce low sample complexity (see, e.g., Brafman andTennenholtz (2002), Strehl and Littman (2008), Strehl et al. (2009)) among other advantages.

2

Page 3: Exploration versus exploitation in reinforcement learning: A … · 2018. 12. 4. · on video games (Mnih et al. (2015)), and autonomous driving (Levine et al. (2016), Mirowski et

with complete knowledge about the underlying model. Recent works have de-voted to the designing of various algorithms to solve the entropy regulated RLproblem, where numerical experiments demonstrate remarkable robustness andmulti-modal policy learning (Haarnoja et al. (2017), Haarnoja et al. (2018)).

In this paper, we study the trade-off between exploration and exploitation forRL in a continuous time setting with both continuous control (action) and state(feature) spaces.2 Such continuous-time formulation is especially appealing ifthe agent can interact with the environment at ultra-high frequency aided bythe modern computing resources.3 Meanwhile, an in-depth and comprehensivestudy of the RL problem also becomes possible which leads to elegant andinsightful results, once cast in continuous time, thanks in no small measure tothe tools of stochastic calculus and differential equations.

Our first main contribution is to propose an entropy-regularized reward func-tion involving the differential entropy for probability distributions over the con-tinuous action space, and motivate and devise an “exploratory formulation” forthe state dynamics that captures repetitive learning under exploration in thecontinuous-time limit. Existing theoretical works on exploration mainly concen-trate on the analysis at the algorithmic level, e.g., proving convergence of theproposed exploration algorithms to the solutions of the classical optimizationproblems (see, e.g., Singh et al. (2000), Jaakkola et al. (1994)), yet they rarelylook into the learning algorithms’ impact on changing significantly the underly-ing dynamics. Indeed, exploration not only substantially enriches the space ofcontrol strategies (from that of Dirac measures to that of all possible probabilitydistributions) but also, as a result, enormously expands the reachable space ofstates. This, in turn, sets out to change the underlying state transitions andsystem dynamics.

We show that our exploratory formulation can account for the effects oflearning in both the rewards received and the state transitions observed fromthe interactions with the environment. It, thus, unearths the important charac-teristics of learning at a more refined and in-depth level, beyond the theoreticalanalysis of mere learning algorithms. Intriguingly, this formulation of the statedynamics coincides with the relaxed control framework in classical control the-ory (El Karoui et al. (1987), Kurtz and Stockbridge (1998, 2001)), which wasmotivated by entirely different reasons. Relaxed controls were introduced tomainly deal with the theoretical question of whether an optimal control exists.The approach is essentially a randomization device to convexify the universeof control strategies. To our best knowledge, our paper is the first to bringback the fundamental ideas and formulation of the relaxed control, guided by apractical motivation: exploration and learning.

We then carry out a complete analysis of the continuous-time entropy-regularized RL problem, assuming that the original dynamic system is linear

2The terms “feature” and “action” are typically used in the RL literature, whose coun-terparts in the control literature are “state” and “control” respectively. Since this paper usesthe control approach to study RL, we will interchangeably use these terms whenever there isno confusion.

3A notable example is high frequency stock trading.

3

Page 4: Exploration versus exploitation in reinforcement learning: A … · 2018. 12. 4. · on video games (Mnih et al. (2015)), and autonomous driving (Levine et al. (2016), Mirowski et

in both control and state and the original reward function is quadratic in thetwo. This type of linear–quadratic (LQ) problems has occupied the center stagefor research in classical control theory for its elegant solutions and its abilityto approximate more general nonlinear problems. One of the most important,conceptual contributions of this paper is to show that the optimal control distri-bution for balancing exploitation and exploration is Gaussian. As is well known,a pure exploitation optimal distribution is Dirac, and a pure exploration opti-mal distribution is uniform. Our results reveal that Gaussian is the right choiceif one seeks a balance between those two extremes. Moreover, we find that themean of this optimal distribution is a feedback of the current state independentof the intended exploration level, whereas the variance is a linear function of theentropy regulating weight (also called the “temperature parameter”) irrespectiveof the current state. This result highlights a separation between exploitation andexploration: the former is reflected in the mean and the latter in the varianceof the final optimal actions.

There is yet another intriguing result. All other things being equal, the morevolatile the original dynamic system is, the smaller the variance of the optimalGaussian distribution is. Conceptually, this implies that an environment react-ing more aggressively to an action in fact contains more learning opportunitiesand hence is less costly for learning.

Another contribution of the paper is to establish the connection between thesolvability of the exploratory LQ problem and that of the classical LQ problem.We prove that as the exploration weight in the former decays to zero, the optimalGaussian control distribution and its value function converge respectively to theoptimal Dirac measure and the value function of the latter, a desirable resultfor practical learning purposes.

Finally, we observe that, beyond the LQ problems, the Gaussian distributionremains, formally, optimal for a much larger class of control problems, namely,problems with drift and volatility linear in control, and reward functions linearor quadratic in control. Such family of problems can be seen as the local-linear-quadratic approximation to more general stochastic control problems whosestate dynamics are linearized in the control variables and the reward functionsare approximated by quadratic functions in controls locally (Todorov and Li(2005), Li and Todorov (2007)). Note that although such iterative LQ approx-imation generally has different parameters at different local state-action pairs,our result on the optimality of Gaussian distribution under the exploratory LQframework still holds at any local point, and therefore justifies, from a stochas-tic control perspective, why Gaussian distribution is commonly used in practicefor exploration (Haarnoja et al. (2017), Haarnoja et al. (2018), Nachum et al.(2017b)), beyond its simplicity for sampling.

The paper is organized as follows. In section 2, we motivate and proposethe relaxed stochastic control problem involving an exploratory state dynamicsand an entropy-regularized reward function. We then present the Hamilton-Jacobi-Bellman (HJB) equation and the optimal control distribution for gen-eral entropy-regularized stochastic control problems in section 3. In section4, we study the special LQ problem in both the state-independent and state-

4

Page 5: Exploration versus exploitation in reinforcement learning: A … · 2018. 12. 4. · on video games (Mnih et al. (2015)), and autonomous driving (Levine et al. (2016), Mirowski et

dependent reward cases, corresponding respectively to the multi-armed banditproblem and general RL problem in discrete time. We discuss the connectionsbetween the exploratory LQ problem and the classical LQ problem in section5, establish the equivalence of the solvability of the two and the convergenceresult for vanishing exploration, and finally characterize the cost of engagingexploration. Finally, section 6 concludes. Some technical proofs are relegatedto appendices.

2 An Entropy-Regularized Stochastic ControlProblem

We introduce an entropy-regularized stochastic control problem and provide itsmotivation in the context of RL.

Consider a filtered probability space (Ω,F ,P; Ftt≥0) in which we definean Ftt≥0-adapted Brownian motion W = Wt, t ≥ 0. An “action space”U is given, representing the constraint on an agent’s decisions (“controls” or“actions”). An admissible control u = ut, t ≥ 0 is an Ftt≥0-adapted mea-surable process taking value in U . Denote by U the set of all admissible controls.

The classical stochastic control problem is to control the state (or “feature”)dynamics

dxut = b(xut , ut)dt+ σ(xut , ut)dWt, t > 0; xu0 = x (1)

so as to achieve the maximum expected total discounted reward represented bythe value function

w (x) := supU

E[∫ ∞

0

e−ρtr (xut , ut) dt

∣∣∣∣xu0 = x

], (2)

where r is the reward function and ρ > 0 is the discount rate.In the classical setting, where the model is fully known (namely when the

functions b, σ and r are fully specified) and the dynamic programming is ap-plicable, the optimal control can be derived and represented as a deterministicmapping from the current state to the action space U , u∗t = u∗(x∗t ) where u∗ isthe optimal feedback policy (or “law”). This feedback policy is derived at t = 0and will be carried out through [0,∞).

In contrast, under the RL setting when the underlying model is not knownand therefore dynamic learning is needed, the agent employs exploration to in-teract with and learn the unknown environment through trial and error. Thisexploration can be modelled by a distribution of controls π = πt(u), t ≥ 0over the control space U from which each “trial” is sampled. We can thereforeextend the notion of controls to distributions4. The agent executes a control

4A classical control u = ut, t ≥ 0 can be regarded as a Dirac distribution (or “measure”)π = πt(u), t ≥ 0 where πt(·) = δut (·). In a similar fashion, a feedback policy ut = u(xut )can be embedded as a Dirac measure πt(·) = δu(xut )

(·), parameterized by the current statexut .

5

Page 6: Exploration versus exploitation in reinforcement learning: A … · 2018. 12. 4. · on video games (Mnih et al. (2015)), and autonomous driving (Levine et al. (2016), Mirowski et

for N rounds over the same time horizon, while at each round, a classical con-trol is sampled from the distribution π. The reward of such a policy becomesaccurate enough when N is large. This procedure, known as policy evaluation,is considered as a fundamental element of most RL algorithms in practice (see,e.g., Sutton and Barto (2018)). Hence, for evaluating such a policy distributionin our continuous-time setting it is necessary to consider the limiting situationas N ↑ ∞.

In order to quickly capture the essential idea, let us first examine the specialcase when the reward depends only on the control, namely, r(xut , ut) = r(ut).One then considers N identical independent copies of the control problem inthe following way: at round i, i = 1, 2, . . . , N, a control ui is sampled under the(possibly random) control distribution π, and executed for its correspondingcopy of the control problem (1)–(2). Then, at each fixed time t, from the law oflarge numbers and under certain mild technical conditions, it follows that theaverage reward over [t, t+ ∆t], with ∆t small enough, should satisfy

1

N

N∑i=1

e−ρtr(uit)∆ta.s.−−−→ E

[e−ρt

∫U

r(u)πt(u)du∆t

], as N →∞.

For a general reward r(xut , ut) which depends on the state, we first needto describe how exploration might alter the state dynamics (1), by definingappropriately its “exploratory” version. For this, we look at the effect of repet-itive learning under a given control distribution π for N rounds. Let W i

t ,i = 1, 2, . . . , N , be N independent sample paths of the Brownian motion Wt,and xit, i = 1, 2, . . . , N , be the copies of the state process respectively under thecontrols ui, i = 1, 2, . . . , N , each sampled from π. Then the increments of thesestate process copies are, for i = 1, 2, . . . , N ,

∆xit ≡ xit+∆t − xit ≈ b(xit, uit)∆t+ σ(xit, u

it)(W it+∆t −W i

t

), t ≥ 0. (3)

Then, each such process xi, i = 1, 2, . . . , N , can be viewed as an independentsample from the exploratory state dynamics Xπ. The superscript π of Xπ

indicates that each xi is generated according to the classical dynamics (3), withthe corresponding ui sampled independently under the policy π.

It then follows from (3) and the law of large numbers that, as N →∞,

1N

∑Ni=1 ∆xit ≈ 1

N

∑Ni=1 b(x

it, u

it)∆t+ 1

N

∑Ni=1 σ(xit, u

it)(W it+∆t −W i

t

)a.s.−−−→ E

[∫Ub(Xπ

t , u)πt(u)du∆t]

+ E[∫Uσ(Xπ

t , u)πt(u)du]E [Wt+∆t −Wt]

= E[∫Ub(Xπ

t , u)πt(u)du∆t].

(4)In the above, we have implicitly applied the (reasonable) assumption that bothπt and Xπ

t are independent of the increments of the Brownian motion samplepaths, which are identically distributed over [t, t+ ∆t].

6

Page 7: Exploration versus exploitation in reinforcement learning: A … · 2018. 12. 4. · on video games (Mnih et al. (2015)), and autonomous driving (Levine et al. (2016), Mirowski et

Similarly, as N →∞,

1

N

N∑i=1

(∆xit

)2 ≈ 1

N

N∑i=1

σ2(xit, uit)∆t

a.s.−−−→ E[∫

U

σ2(Xπt , u)πt(u)du∆t

]. (5)

As we see, not only ∆xit but also (∆xit)2 are affected by repetitive learning

under the given policy π.Finally, as the individual state xit is an independent sample from Xπ

t , wehave that ∆xit and (∆xit)

2, i = 1, 2, . . . , N , are the independent samples from∆Xπ

t and (∆Xπt )2, respectively. As a result, the law of large numbers gives that

as N →∞,

1

N

N∑i=1

∆xita.s.−−−→ E [∆Xπ

t ] and1

N

N∑i=1

(∆xit)2 a.s.−−−→ E

[(∆Xπ

t )2].

This interpretation, together with (4) and (5), motivates us to propose theexploratory version of the state dynamics

dXπt = b(Xπ

t , πt)dt+ σ(Xπt , πt)dWt, t > 0; Xπ

0 = x, (6)

where the coefficients b(·, ·) and σ(·, ·) are defined as

b(Xπt , πt) :=

∫U

b (Xπt , u)πt(u)du, (7)

and

σ(Xπt , πt) :=

√∫U

σ2 (Xπt , u)πt(u)du. (8)

We will call (6) the exploratory formulation of the controlled state dynamics,and b(·, ·) and σ(·, ·) in (7) and (8), respectively, the exploratory drift and theexploratory volatility.5

5The exploratory formulation (6), inspired by repetitive learning, is consistent with thenotion of relaxed control in the control literature (see, e.g., El Karoui et al. (1987); Kurtz andStockbridge (1998, 2001); Fleming and Nisio (1984)). Indeed, let f : D 7→ R be a bounded andtwice continuously differentiable function, and define the infinitesimal generator associated tothe classical controlled process (1) as

L[f ](x, u) :=1

2σ2(x, u)f ′′(x) + b(x, u)f ′(x), x ∈ D, u ∈ U.

In the classical relaxed control framework, the controlled dynamics are replaced by the six-tuple (Ω,F ,F = Ftt≥0,P, Xπ , π), such that Xπ

0 = x and

f(Xπt )− f(x)−

∫ t

0

∫UL[f ](Xπ

t , u)πt(u)duds, t ≥ 0, is a P−martingale. (9)

It is easy to verify that our proposed exploratory formulation (6) agrees with the abovemartingale formulation. However, even though the mathematical formulations are equivalent,the motivations of the two are entirely different. Relaxed control was introduced to mainly dealwith the existence of optimal controls, whereas the exploratory formulation here is motivatedby learning and exploration in RL.

7

Page 8: Exploration versus exploitation in reinforcement learning: A … · 2018. 12. 4. · on video games (Mnih et al. (2015)), and autonomous driving (Levine et al. (2016), Mirowski et

In a similar fashion,

1

N

N∑i=1

e−ρtr(xit, uit)∆t

a.s.−−−→ E[e−ρt

∫U

r(Xπt , u)πt(u)du∆t

], as N →∞.

(10)Hence the reward function r in (2) needs to be modified to its exploratoryversion

r (Xπt , πt) :=

∫U

r (Xπt , u)πt(u)du. (11)

If, on the other hand, the model is fully known, exploration would not beneeded and the control distributions would all degenerate into the Dirac mea-sures, and we are in the realm of the classical stochastic control. Thus, in theRL context, we need to add a “regularization term” to account for model un-certainty and to encourage exploration. We use Shanon’s differential entropy tomeasure the degree of exploration:

H(πt) := −∫U

πt(u) lnπt(u)du,

where πt is a control distribution.We therefore introduce the following entropy-regularized value function

V (x) := supπ∈A(x)

E[∫ ∞

0

e−ρt(∫

U

r (Xut , πt)πt (u) du− λ

∫U

πt(u) lnπt(u)du

)dt

∣∣∣∣Xπ0 = x

](12)

where λ > 0 is an exogenous exploration weight parameter capturing the trade-off between exploitation (the original reward function) and exploration (theentropy), and A(x) is the set of the admissible control distributions (which mayin general depend on x).

It remains to specify A(x). Denote by B(U) the Borel algebra on U andby P (U) the set of probability measures on U that are absolutely continuouswith respect to the Lebesgue measure. The admissible set A(x) contains allmeasure-valued processes π = πt, t ≥ 0 satisfying:

i) for each t ≥ 0, πt ∈ P(U) a.s.;ii) for each A ∈ B(U), πt (A) , t ≥ 0 is Ft-progressively measurable;iii) the stochastic differential equation (SDE) (6) has a unique solution Xπ =

Xπt , t ≥ 0 if π is applied;iv) the expectation on the right hand side of (12) is finite.

3 HJB Equation and Optimal Control Distribu-tions

We present the general procedure for solving the optimization problem (12).To this end, applying the classical Bellman’s principle of optimality, we have

V (x) = supπ∈A(x)

E[∫ s

0

e−ρt (r (Xπt , πt) + λH (πt)) dt+ e−ρsV (Xπ

s )

∣∣∣∣Xπ0 = x

], s > 0.

8

Page 9: Exploration versus exploitation in reinforcement learning: A … · 2018. 12. 4. · on video games (Mnih et al. (2015)), and autonomous driving (Levine et al. (2016), Mirowski et

Proceeding with standard arguments, we deduce that V satisfies the followingHamilton-Jacobi-Bellmam (HJB) equation

ρV (x) = maxπ∈P(U)

(r(x, π)− λ

∫U

π(u) lnπ(u)du+1

2σ2(x, π)V ′′(x)

+b(x, π)V ′(x)), x ∈ D, (13)

or

ρV (x) = maxπ∈P(U)

∫U

(r(x, u)− λ lnπ(u) +

1

2σ2(x, u)V ′′ (x) + b (x, u)V ′(x)

)π(u)du,

(14)where, with a slight abuse of notation, we have denoted its generic solution byV . Note that π ∈ P (U) if and only if∫

U

π(u)du = 1 and π(u) ≥ 0 a.e. on U. (15)

Solving the (constrained) maximization problem on the right hand side of (14)then yields the “feedback-type” optimizer

π∗(u;x) =exp

(1λ

(r(x, u) + 1

2σ2 (x, u)V ′′(x) + b (x, u)V ′(x)

))∫U

exp(

(r(x, u) + 1

2σ2 (x, u)V ′′(x) + b (x, u)V ′(x)

))du. (16)

This leads to an optimal measure-valued process

π∗t = π∗(u;X∗t ) =exp

(1λ

(r(X∗t , u) + 1

2σ2(X∗t , u)V ′′(X∗t ) + b(X∗t , u)V ′(X∗t )

))∫U

exp[

(r(X∗t , u) + 1

2σ2(X∗t , u)V ′′(X∗t ) + b(X∗t , u)V ′(X∗t )

)]du,

(17)where X∗t , t ≥ 0, solves (6) when the feedback control distribution π∗t (·;X∗t )is applied and assuming that π∗t ∈ A(x).

The formula (17) elicits qualitative understanding about an optimal explo-ration. We further investigate this in the next section.

4 The Linear–Quadratic Case

We focus on the family of entropy-regularized problems with linear state dy-namics and quadratic rewards, in which6

b(x, u) = Ax+Bu, and σ(x, u) = Cx+Du, x, u ∈ R (18)

where A,B,C,D are given constants with D 6= 0, and

r(x, u) = −(M

2x2 +Rxu+

N

2u2 + Px+Qu

), x, u ∈ R (19)

6We assume both the state and the control are scalar-valued for notational simplicity.There is no essential difficulty with these being vector-valued.

9

Page 10: Exploration versus exploitation in reinforcement learning: A … · 2018. 12. 4. · on video games (Mnih et al. (2015)), and autonomous driving (Levine et al. (2016), Mirowski et

where M ≥ 0, N > 0, R, P,Q ∈ R.In the classical control literature, this type of linear–quadratic (LQ) control

problems is one of the most important in that it admits elegant and simple solu-tions and, furthermore, more complex, nonlinear problems can be approximatedby LQ problems. As is standard with LQ control, we assume that the controlis unconstrained, namely, U = R.

Fix an initial state x ∈ R. For each measured-valued control π ∈ A(x), wedefine its mean and variance processes µt, σ

2t , t ≥ 0, namely,

µt :=

∫Ruπt(u)du and σ2

t :=

∫Ru2πt(u)du− µ2

t .

Then (6) can be rewritten as

dXπt = (AXπ

t +Bµt) dt+√C2(Xπ

t )2 + 2CDXπt µt +D2(µ2

t + σ2t ) dWt

= (AXπt +Bµt) dt+

√(CXπ

t +Dµt)2

+D2σ2t dWt, t > 0; Xπ

0 = x.

(20)Further, define

L(Xπt , πt) :=

∫Rr(Xπ

t , u)πt(u)du− λ∫Rπt(u) lnπt(u)du.

We will be working with the following assumption for the entropy-regularizedLQ problem.

Assumption 1 The discount rate satisfies ρ 2A+ C2.

Here, a bmeans a−b is sufficiently larger than 0. This assumption requiresa sufficiently large discount rate, or (implicitly) a sufficiently short planninghorizon. Such an assumption is standard in an infinite horizon problem withrunning rewards.

We are now ready to define the admissible control set A(x) for the entropy-regularized LQ problem as follows: π ∈ A(x), if

• for each t ≥ 0, πt ∈ P(R) a.s.;

• for each A ∈ B(R), πt (A) , t ≥ 0 is Ft-progressively measurable;

• for each t ≥ 0, E[∫ t

0

(µ2s + σ2

s

)ds]<∞;

• there exists δ ∈ (0, ρ− (2A+ C2)), such that

lim infT→∞

e−δT∫ T

0

E[µ2t + σ2

t

]dt = 0;

• with Xπ solving (20), E[∫∞

0e−ρt |L(Xπ

t , πt)| dt∣∣ Xπ

0 = x]<∞.

10

Page 11: Exploration versus exploitation in reinforcement learning: A … · 2018. 12. 4. · on video games (Mnih et al. (2015)), and autonomous driving (Levine et al. (2016), Mirowski et

Under the above conditions, it is immediate that for any π ∈ A(x), both thedrift and volatility terms of (20) satisfy a global Lipschitz condition in the statevariable; hence the SDE (20) admits a unique strong solution Xπ.

Next, we solve the entropy-regularized stochastic LQ problem

V (x) = supπ∈A(x)

E[∫ ∞

0

e−ρt(∫

Rr(Xπ

t , u)πt(u)du− λ∫Rπt(u) lnπt(u)du

)dt

](21)

with r as in (19) and Xπ as in (20).In the following two subsections, we respectively derive explicit solutions for

the case of state-independent reward and of general state-dependent reward,respectively.

4.1 The case of state-independent reward

We start with the technically less challenging case when r(x, u) = −(N2 u

2 +Qu),

namely, the reward is state independent. In this case, the system dynamicsbecome irrelevant. However, the problem is still interesting in its own rightas it corresponds to the state-independent RL problem, which is known as thecontinuous-armed bandit problem in the continuous-time setting (see, e.g., Man-delbaum (1987); Kaspi and Mandelbaum (1998)).

Following the derivation in the previous section, the optimal control distri-bution in feedback form, (16), reduces to

π∗ (u;x) =exp

(1λ

((−N2 u

2 +Qu)

+ 12 (Cx+Du)2v′′(x) + (Ax+Bu)v′(x)

))∫R exp

(1λ

((−N2 u2 +Qu

)+ 1

2 (Cx+Du)2v′′(x) + (Ax+Bu)v′(x)))du

=

exp

(−(u− CDxv′′(x)+Bv′(x)−Q

N−D2v′′(x)

)2

/ 2λN−D2v′′(x)

)∫R exp

(−(u− CDxv′′(x)+Bv′(x)−Q

N−D2v′′(x)

)2

/ 2λN−D2v′′(x)

)du

. (22)

So the optimal control distribution appears to be Gaussian. More specif-ically, at any present state x, the agent should embark exploration accordingto the Gaussian distribution with mean and variance given, respectively, byCDxv′′(x)+Bv′(x)−Q

N−D2v′′(x) and λN−D2v′′(x) . We note that it is required in the expres-

sion (22) that N −D2v′′(x) > 0, x ∈ R, a condition that will be justified anddiscussed later on.

Remark 2 If we examine the derivation of (22) more closely, we easily see thatthe optimality of Gaussian distribution still holds so long as the state dynamics islinear in control and the reward is quadratic in control whereas their dependenceon state can be generally nonlinear.

11

Page 12: Exploration versus exploitation in reinforcement learning: A … · 2018. 12. 4. · on video games (Mnih et al. (2015)), and autonomous driving (Levine et al. (2016), Mirowski et

Substituting (22) back to (13), the HJB equation becomes, after straightfor-ward calculations,

ρv(x) =(CDxv′′(x)+Bv′(x)−Q)

2

2(N−D2v′′(x)) + λ2

(ln(

2πeλN−D2v′′(x)

)− 1)

+ 12C

2x2v′′(x) +Axv′(x).

(23)

In general, this nonlinear equation has multiple smooth solutions, even amongquadratic polynomials that satisfy N − D2v′′(x) > 0. One such solution is aconstant, given by

v (x) ≡ v =Q2

2ρN+

λ

(ln

2πeλ

N− 1

), (24)

with the corresponding optimal control distribution (22) being

π∗ (u;x) =e−(u+Q

N )2/ 2λN∫

R e−(u+Q

N )2/ 2λN du

.

Note that the classical LQ problem with the state-independent reward func-tion r(x, u) = −

(N2 u

2 +Qu)

clearly has the optimal control u∗ = −QN , which is

the mean of the optimal Gaussian control distribution π∗. The following resultestablishes that this constant v is indeed the value function V .

Henceforth, we denote, for notational convenience, by N (·|µ, σ2) the dis-tribution function of a Gaussian random variable with mean µ and varianceσ2.

Theorem 3 If r(x, u) = −(N2 u

2 +Qu), then the value function in (21) is

given by

V (x) =Q2

2ρN+

λ

(ln

2πeλ

N− 1

), x ∈ R,

and the optimal control distribution is Gaussian: π∗t (u) = N(u∣∣− Q

N ,λN

),

t ≥ 0. The associated optimal state process solves the SDE

dX∗t =

(AX∗t −

BQ

N

)dt+

√(CX∗t −

DQ

N

)2

+λD2

NdWt, X

∗0 = x, (25)

which can be explicitly expressed as follows:i) If C 6= 0, then

X∗t = F (Wt, Yt), t ≥ 0, (26)

where the function

F (z, y) =

√λ

N

∣∣∣∣DC∣∣∣∣ sinh

(|C|z + sinh(−1)

(√N

λ

∣∣∣∣CD∣∣∣∣ (y − DQ

CN

)))+DQ

CN,

12

Page 13: Exploration versus exploitation in reinforcement learning: A … · 2018. 12. 4. · on video games (Mnih et al. (2015)), and autonomous driving (Levine et al. (2016), Mirowski et

and the process Yt, t ≥ 0, is the unique pathwise solution to the random ordinarydifferential equation

dYtdt

=

AF (Wt, Yt)− BQN −

12

((CF (Wt, Yt)− DQ

N

)2

+ λD2

N

)∂∂yF (z, y)

∣∣z=Wt,y=Yt

, Y0 = x.

ii) If C = 0, A 6= 0, then

X∗t = xeAt − BQ

AN(1− eAt) +

|D|N

√Q2 + λN

∫ t

0

eA(t−s)dWs, t ≥ 0. (27)

Proof. See Appendix A.

The above solution suggests that when the reward is independent of thestate, so is the optimal control distribution N (· | − Q

N ,λN ). This is intuitive as

objective (12) does not explicitly distinguish between states.7

A remarkable feature of the derived optimal distributionN (· |− QN ,

λN ) is that

the mean coincides with the optimal control of the original non-exploratory LQproblem, whereas the variance is determined by the parameter λ. In the contextof continuous-armed bandit problem, the mean is concentrated on the currentincumbent of the best arm and the variance is determined by the parameter. Themore weight put on the level of exploration, the more spread out the explorationaround the current best arm. This type of exploration/exploitation strategiesis clearly intuitive, and in turn, it gives a guidance on how to actually choosethe parameter in practice: it is nothing else than the variance of the explorationthe agent wishes to engage.

However, we shall see in the next section that when the reward depends onlocal state, the HJB equation does not admit a constant solution. Consequently,the optimal control distribution must be of a feedback form, depending on wherethe state is at any given time.

7Similar observation can be made for the (state-independent) pure entropy maximizationformulation, where the goal is to solve

supπ∈A(x)

E[−∫ ∞0

e−ρt(∫

Uπt(u) lnπt(u)du

)dt∣∣∣Xπ

0 = x

]. (28)

This problem becomes relevant when λ ↑ ∞ in the entropy-regularized objective (21), corre-sponding to the extreme case when the least informative (or the highest entropy) distributionis favored for pure exploration without considering exploitation (i.e., without maximizing anyrewards). To solve problem (28), we can pointwisely maximize the integrand there, leadingto the state-independent optimization problem

supπ∈P(U)

(−∫Uπ(u) lnπ(u)du

). (29)

If the action space U is a finite interval, say [a, b] on R, then it is straightforward that theoptimal control distribution π∗ is, for all t ≥ 0, the uniform distribution on the interval [a, b].This is in accordance with the traditional static setting where uniform distribution achievesmaximum entropy on any finite interval (see, e.g., Shannon (2001)).

13

Page 14: Exploration versus exploitation in reinforcement learning: A … · 2018. 12. 4. · on video games (Mnih et al. (2015)), and autonomous driving (Levine et al. (2016), Mirowski et

4.2 The case of state-dependent reward

We now consider the case of a general state-dependent reward, i.e.,

r(x, u) = −(M

2x2 +Rxu+

N

2u2 + Px+Qu

), x, u ∈ R.

We start with the following lemma that will be used for the verification argu-ments.

Lemma 4 For each π ∈ A(x), x ∈ R, the unique solution Xπt , t ≥ 0, to the

SDE (20) satisfies

lim infT→∞

e−ρTE[(Xπ

T )2]

= 0. (30)

Proof. See Appendix B.

Following an analogous argument as in (22), we deduce that

π∗(u;x) := N(u∣∣∣ CDxv′′(x) +Bv′(x)−Rx−Q

N −D2v′′(x),

λ

N −D2v′′(x)

).

Then, denoting by µ∗(x) and (σ∗(x))2 the mean and variance of π∗ givenabove, we have

ρv(x) =∫R−

(M2 x

2 +Rxu+ N2 u

2 + Px+Qu)N(u∣∣µ∗(x), (σ∗(x))2

)du

+λ ln(√

2πeσ∗(x))

+ v′(x)∫R(Ax+Bu)N

(u∣∣µ∗(x), (σ∗(x))2

)du

+ 12v′′(x)

∫R(Cx+Du)2N

(u∣∣µ∗(x), (σ∗(x))2

)du

= −M2 x2 − N

2

((CDxv′′(x)+Bv′(x)−Rx−Q

N−D2v′′(x)

)2

+ λN−D2v′′(x)

)−(Rx+Q)CDxv

′′(x)+Bv′(x)−Rx−QN−D2v′′(x) − Px+ λ ln

√2πeλ

N−D2v′′(x)

+Axv′(x) +Bv′(x)CDxv′′(x)+Bv′(x)−Rx−QN−D2v′′(x) + 1

2C2x2v′′(x)

+ 12D

2

((CDxv′′(x)+Bv′(x)−Rx−Q

N−D2v′′(x)

)2

+ λN−D2v′′(x)

)v′′(x)

+CDxv′′(x)CDxv′′(x)+Bv′(x)−Rx−QN−D2v′′(x) .

Reorganizing, we arrive at the HJB equation

ρv(x) =(CDxv′′(x)+Bv′(x)−Rx−Q)

2

2(N−D2v′′(x)) + λ2

(ln(

2πeλN−D2v′′(x)

)− 1)

+ 12 (C2v′′(x)−M)x2 + (Av′(x)− P )x.

(31)

14

Page 15: Exploration versus exploitation in reinforcement learning: A … · 2018. 12. 4. · on video games (Mnih et al. (2015)), and autonomous driving (Levine et al. (2016), Mirowski et

Under Assumption 1 and the additional condition R2 < MN , one smooth solu-tion to the HJB equation (31) is given by v(x) = 1

2k2x2 + k1x+ k0 where8

k2 := 12

(ρ−(2A+C2))N+2(B+CD)R−D2M(B+CD)2+(ρ−(2A+C2))D2

− 12

√((ρ−(2A+C2))N+2(B+CD)R−D2M)2−4((B+CD)2+(ρ−(2A+C2))D2)(R2−MN)

(B+CD)2+(ρ−(2A+C2))D2 ,

(35)

k1 :=P (N − k2D

2)−QRk2B(B + CD) + (A− ρ)(N − k2D2)−BR

, (36)

and

k0 :=(k1B −Q)2

2ρ(N − k2D2)+

λ

(ln

(2πeλ

N − k2D2

)− 1

). (37)

For the particular solution given by v(x) = 12k2x

2 + k1x+ k0, we can verifythat k2 < 0, due to Assumption 1 and R2 < MN (which holds automatically ifR = 0, M > 0 and N > 0). Hence v is concave, a property that is essential inproving that it is actually the value function. On the other hand, N−D2v′′(x) =N − k2D

2 > 0, ensuring that k0 is well defined.

Theorem 5 Suppose D 6= 0, R2 < MN and the reward function is given by

r(x, u) = −(M

2x2 +Rxu+

N

2u2 + Px+Qu

).

Furthermore, suppose that Assumption 1 holds. Then, the value function isgiven by

V (x) =1

2k2x

2 + k1x+ k0, x ∈ R, (38)

where k2, k1 and k0 are as in (35), (36) and (37), respectively. Moreover, theoptimal control distribution is Gaussian, given in a feedback form by

π∗t (u) = N(u

∣∣∣∣ (k2(B + CD)−R)X∗t + k1B −QN − k2D2

N − k2D2

), t ≥ 0.

(39)

8The values of k2, k1 and k0 can be found by substituting v(x) = 12k2x2 + k1x + k0

into (31) and solving the resulting equation in k2, k1 and k0. On the other hand, thereare multiple solutions to (31). Indeed, applying an generic quadratic function ansatz v(x) =12a2x2 + a1x+ a0, x ∈ R in (31) yields a system of algebraic equations

ρa2 =(a2(B + CD)−R)2

N − a2D2+ a2(2A+ C2)−M, (32)

ρa1 =(a1B −Q)(a2(B + CD)−R)

N − a2D2+ a1A− P, (33)

ρa0 =(a1B −Q)2

2(N − a2D2)+λ

2

(ln

(2πeλ

N − a2D2

)− 1

). (34)

This system has in general multiple solutions, leading to multiple quadratic solutions to theHJB equation (31).

15

Page 16: Exploration versus exploitation in reinforcement learning: A … · 2018. 12. 4. · on video games (Mnih et al. (2015)), and autonomous driving (Levine et al. (2016), Mirowski et

Finally, the optimal state process solves the SDE

dX∗t =

[(A+

B(k2(B + CD)−R)

N − k2D2

)X∗t +

B(k1B −Q)

N − k2D2

]dt

+

√[(C +

D(k2(B + CD)−R)

N − k2D2

)X∗t +

D(k1B −Q)

N − k2D2

]2

+λD2

N − k2D2dWt,

(40)X∗0 = x, which can be expressed as follows:

i) if C + D(k2(B+CD)−R)N−k2D2 6= 0, then

X∗t = F (Wt, Yt), t ≥ 0, (41)

where the function

F (z, y) =

√D

|C1|sinh

(|C1|z + sinh(−1)

(|C1|√D

(y +

C2

C1

)))− C2

C1

,

and the process Yt, t ≥ 0, is the unique pathwise solution to the random ordinarydifferential equation

dYtdt

=

AF (Wt, Yt) + B − 12

((C1F (Wt, Yt) + C2

)2

+ D2

)∂∂yF (z, y)

∣∣z=Wt,y=Yt

, Y0 = x,

with A := A+ B(k2(B+CD)−R)N−k2D2 , B := B(k1B−Q)

N−k2D2 , C1 := C + D(k2(B+CD)−R)N−k2D2 ,

C2 := D(k1B−Q)N−k2D2 and D := λD2

N−k2D2 .

ii) If C + D(k2(B+CD)−R)N−k2D2 = 0, A 6= 0, then

X∗t = xeAt − B

A(1− eAt) +

√C1

2+ D

∫ t

0

eA(t−s)dWs, t ≥ 0. (42)

Proof. See Appendix C.

For the state-dependent reward case, we see that actions over U (U = R)also depend on the current state X∗t , which are selected according to a state-dependent Gaussian distribution (39). Compared with the state-independentcase studied earlier, the variance of this distribution is smaller: λ

N−k2D2 <λN

because N − k2D2 > N (recall that k2 < 0). Moreover, the greater D the

smaller the variance, indicating less exploration is required. Recall that D isthe coefficient of the control in the diffusion term of the state dynamics. So theneed for exploration is reduced if an action has a greater impact on volatility

16

Page 17: Exploration versus exploitation in reinforcement learning: A … · 2018. 12. 4. · on video games (Mnih et al. (2015)), and autonomous driving (Levine et al. (2016), Mirowski et

of the system dynamics. This hints that a more volatile environment rendersmore learning opportunities.

On the other hand, the mean of the Gaussian distribution does not explicitlydepend on λ. The implication is that the agent should concentrate on the mostpromising region in the action space while randomly selecting actions to interactwith the unknown environment. It is intriguing that the entropy-regularized RLformulation separates the exploitation from exploration, through respectivelythe mean and variance of the resulting optimal Gaussian distribution.

Finally, as noted earlier (see Remark 2), the optimality of the Gaussiandistribution is still valid for problems with dynamics

dxt = (A(xt) +B(xt)ut) dt+ (C(xt) +D(xt)ut) dWt,

and reward function in the form r(x, u) = r2(x)u2 + r1(x)u + r0(x), where thefunctions A,B,C,D, r2, r1 and r0 are possibly nonlinear.

5 The Cost and Effect of Exploration

Motivated by the necessity of exploration in face of the typically unknown en-vironment in a RL setting, we have formulated and analyzed a new class ofstochastic control problems that combine entropy-regularized criteria and re-laxed controls. We have also derived explicit solutions and presented verificationresults for the representative LQ problems. A natural question is to quantify thecost and effect of the exploration. This can be done by comparing our resultsto the ones on the classical stochastic LQ problems which have neither entropyregularization nor control relaxation.

We carry out this comparative analysis next.

5.1 The classical LQ problem

We first recall the classical stochastic LQ control problem in an infinite horizonwith discounted reward. Let Wt, t ≥ 0 be a standard Brownian motiondefined on the filtered probability space (Ω,F ,F = Ftt≥0,P) that satisfies theusual conditions. The controlled state process x = xt, t ≥ 0 solves

dxt = (Axt +But) dt+ (Cxt +Dut) dWt , t ≥ 0, x0 = x, (43)

with given constants A,B,C and D, and the process u = ut, t ≥ 0 being a(standard, non-relaxed) control.

The value function is defined as in (2),

w(x) := supU

E[∫ ∞

0

e−ρtr(xt, ut) dt

∣∣∣∣ x0 = x

],

for x ∈ R, where the reward function r(·, ·) is given by (19).

17

Page 18: Exploration versus exploitation in reinforcement learning: A … · 2018. 12. 4. · on video games (Mnih et al. (2015)), and autonomous driving (Levine et al. (2016), Mirowski et

The associated HJB equation is

ρw(x) = maxu∈R

(r(x, u) +

1

2(Cx+Du)2w′′(x) + (Ax+Bu)w

′(x)

)

= maxu∈R

(−1

2

(N −D2w′′(x)

)u2 +

(CDxw′′(x) +Bw′(x)−Rx−Q

)u

)+

1

2(C2w′′(x)−M)x2 + (Aw′(x)− P )x

=(CDxw′′(x) +Bw′(x)−Rx−Q)

2

2(N −D2w′′(x))+

1

2(C2w

′′(x)−M)x2 + (Aw′(x)− P )x,

(44)with the maximizer being

u∗(x) =CDxw′′(x) +Bw′(x)−Rx−Q

N −D2w′′(x), x ∈ R.

The optimality of u∗ follows easily if N −D2w′′(x) > 0 holds.To rigorously demonstrate that a quadratic function, say w(x) = 1

2 a2x2 +

a1x+ a0, x ∈ R, with N − a2D2 > 0, which solves the HJB equation (44), is in-

deed the value function and, more generally, to establish a verification theoremfor the above feedback optimal policy, extra conditions on the model param-eters and restrictions on the admissible strategies are needed, similar to theentropy-regularized LQ problem case. In the next section, we establish solvabil-ity equivalence between the standard LQ problem and the entropy-regularizedLQ problem.

5.2 Equivalence of solvability of classical and exploratorymodels

Given a reward function r(·, ·) and a classical controlled process (1), the re-laxed formulation (6) under the entropy-regularized objective typically makesthe problem more technically challenging to solve, compared to the correspond-ing classical stochastic optimization problem.

In this section, we demonstrate that there is equivalence between the solv-ability of the exploratory stochastic LQ problem and that of its classical coun-terpart, in the sense that the value function and optimal control of one problemlead immediately to those of the other. Such equivalence enables us to read-ily establish the convergence result as the exploration weight λ decays to zero.Furthermore, it makes it possible to quantify the exploration cost to be definedin the sequel.

Theorem 6 The following two statements (a) and (b) are equivalent.

18

Page 19: Exploration versus exploitation in reinforcement learning: A … · 2018. 12. 4. · on video games (Mnih et al. (2015)), and autonomous driving (Levine et al. (2016), Mirowski et

(a) The function V (x) = 12α2x

2 + α1x+ α0 + λ2ρ

(ln(

2πeλN−α2D2

)− 1)

, x ∈ R,

with α2 < 0, solves the HJB equation (31) and the corresponding optimalGaussian distribution is given by

π∗t = N(u∣∣∣ (α2(B + CD)−R)X∗t + α1B −Q

N − α2D2,

λ

N − α2D2

),

with X∗t , t ≥ 0 being the corresponding optimal controlled state processsolving (20).

(b) The function w(x) = 12α2x

2 + α1x + α0, x ∈ R, with α2 < 0, solves theHJB equation (44), and the corresponding optimal control is given by

u∗t =(α2(B + CD)−R)x∗t + α1B −Q

N − α2D2,

with x∗t , t ≥ 0 being the corresponding classical controlled state processsolving (43).

Proof. See Appendix D.

The above equivalence between statement (a) and (b), together with theverification arguments for the entropy-regularized and classical stochastic LQproblems, yield that if one problem is solvable, so is the other, and conversely,if one is not solvable, neither is the other.

5.3 Cost of exploration

We define the exploration cost for a general RL problem to be the differencebetween the discounted accumulated rewards following the respective optimalstrategies under the classical objective (2) and the exploratory objective (12).Note that the equivalence between solvability established in the previous sub-section is important for this definition since the cost can be well defined only ifboth the classical and the exploratory problems are solvable.

Specifically, let the classical maximization problem (2) with the state dy-namics (1) have the value function w(·) and an optimal strategy u∗t , t ≥ 0,and the corresponding exploratory problem have the value function V (·) and anoptimal control distribution π∗t , t ≥ 0. Then, we define the exploration costas

Cu∗,π∗(x) := w(x)−

(V (x) + λE

[∫ ∞0

e−ρt(∫

U

π∗t (u) lnπ∗t (u)du

)dt

∣∣∣∣X0 = x

]),

(45)for x ∈ D.

The term in the parenthesis represents the total discounted rewards incurredby π∗ after taking out the contribution of the entropy term to the value functionV (·) for the exploratory problem. The exploration cost hence measures the best

19

Page 20: Exploration versus exploitation in reinforcement learning: A … · 2018. 12. 4. · on video games (Mnih et al. (2015)), and autonomous driving (Levine et al. (2016), Mirowski et

outcome due to the explicit inclusion of exploratory strategies in the entropy-regularized objective, relative to the benchmark w(·) that is the best possibleobjective value should the model be fully known.

We next compute the exploration cost for the LQ case. This cost is surpris-ingly simple: it depends only on two “agent-specific” parameters: the explo-ration weight parameter λ and the discounting parameter ρ.

Indeed, if statement (a) (or (b), equivalently) holds, then both the aug-mented and classical stochastic LQ problems are solvable, leading to the explo-ration cost given by, for x ∈ R,

Cu∗,π∗(x) = w(x)−

(V (x)− λ

2ρln

(2πeλ

N − α2D2

))=

λ

2ρ,

where we have used the fact that the differential entropy of Gaussian distributionN (· |µ, σ2) is ln(σ

√2πe), as well as V (·), π∗ in (a), and w(·) in (b).

So, the exploration cost for stochastic LQ problems can be completely pre-determined by the learning agent through choosing her individual λ and ρ,since it does not rely on the specific (unknown) linear state dynamics, nor thequadratic reward structure.

Moreover, the exploration cost depends on λ and ρ in a rather intuitive way:it increases as λ increases, due to more emphasis placed on exploration, or as ρdecreases, indicating an effectively longer horizon for exploration.9

5.4 Vanishing exploration

Another desirable result is when the weight on the exploration weight λ goes tozero, the entropy-regularized LQ problem converges to its classical counterpart.The following result makes this precise.

Proposition 7 Assume that statement (a) (or equivalently, (b)) holds. Then,for each t ≥ 0,

limλ→0

π∗t = δu∗t weekly. (46)

Moreover, the associated value functions converge pointwisely, i.e., for each x ∈R, limλ→0 |V (x)− w(x)| = 0.

Proof. See Appendix E.

9The connection between a discounting parameter and an effective length of time horizonis well known in the discrete-time discounted reward formulation E[

∑t≥0 γ

tRt] for classical

Markov Decision Processes (see, e.g., Derman (1970)). This infinite horizon discounted prob-lem can be viewed as an undiscounted, finite horizon problem with a random termination timeT that is geometrically distributed with parameter 1− γ. Hence, an effectively longer horizonwith mean 1

1−γ is applied to the optimization problem as γ increases. Since a smaller ρ in the

continuous-time objective (2) or (12) corresponds to a larger γ in the discrete-time objective,we can see the similar effect of a decreasing ρ on the effective horizon of continuous-timeproblems.

20

Page 21: Exploration versus exploitation in reinforcement learning: A … · 2018. 12. 4. · on video games (Mnih et al. (2015)), and autonomous driving (Levine et al. (2016), Mirowski et

6 Conclusions

The main contributions of this paper are conceptual: casting the RL problem ina continuous-time setting and with the aid of stochastic control and stochasticcalculus, we interpret and explain why Gaussian distribution is best for ex-ploration in RL. This finding is independent of the specific parameters of theunderlying dynamics and reward function structure, so long as the dependenceon actions is linear in the former and quadratic in the latter. The same can besaid about other important results, such as the separation between explorationand exploitation in the mean and variance of the resulting Gaussian distribution,and the cost of exploration. The explicit form of the optimal Gaussian does in-deed depend on the model specifications which are unknown in the RL context.In implementing RL algorithms based on our results, we will need to turn theproblem into an MDP problem by discretizing the time, and then learn the pa-rameters of the optimal Gaussian distribution following standard RL procedure(e.g. the so-called Q-learning). For that our result is again useful: it suggeststhat we only need to learn among the class of simpler Gaussian policies, i.e.,π = N (· |θ1x + θ2, φ) (cf. (39)), rather than generic (nonlinear) parametrizedGaussian policy πθ,φ = N (· |θ(x), φ(x)). We expect that this simpler functionform can considerably increase the learning speed.

Appendix A: Proof of Theorem 3

Proof. Following the HJB equation (13), we deduce that for any π ∈ A(x),x ∈ R,

ρv ≥∫R−(N

2u2 +Qu

)πt(u)du− λ

∫Rπt(u) lnπt(u)du. (47)

Indeed, one can directly verify the above inequality by recalling the constrainedproblem (15) for the maximization of the right hand side of (47). In turn,examining the first and second variation of the associated unconstrained problem

L(πt) =

∫R−(N

2u2 +Qu

)πt(u)du−λ

∫Rπt(u) lnπt(u)du+β

(∫Rπt(u)du− 1

),

for a Lagrange multiplier β ∈ R, give the maximizer

π∗t ≡ π∗ = N(u∣∣− Q

N,λ

N

), t ≥ 0,

at which the equality in (47) is also attained. Note that, since (πt)t≥0 ∈ A(x),L(πt) is well defined. The first and second order differentiation with respect toε under the integral in L(πt + εh) are justified by the dominated convergencetheorem, if h ∈ P(R) is such that the first and second variations are integrable.In turn, for any π ∈ A(x) and T ≥ 0,

e−ρT v = v −∫ T

0

e−ρtρvdt

21

Page 22: Exploration versus exploitation in reinforcement learning: A … · 2018. 12. 4. · on video games (Mnih et al. (2015)), and autonomous driving (Levine et al. (2016), Mirowski et

≤ v + E

[∫ T

0

e−ρt(∫

R

(N

2u2 +Qu

)πt(u)du+ λ

∫Rπt(u) lnπt(u)du

)dt

].

Since π ∈ A(x), the dominated convergence theorem yields, as T →∞, that

v ≥ E[∫ ∞

0

e−ρt(∫

R−(N

2u2 +Qu

)πt(u)du− λ

∫Rπt(u) lnπt(u)du

)dt

]and, thus, v ≥ V (x), for x ∈ R. On the other hand,

ρv =

∫R−(N

2u2 +Qu

)π∗t (u)du− λ

∫Rπ∗t (u) lnπ∗t (u)du,

and it is also easily verified that π∗ ∈ A(x). Replacing the inequalities byequalities in the above argument and sending T to infinity, we conclude that

V (x) ≡ v =Q2

2ρN+

λ

(ln

2πeλ

N− 1

),

for x ∈ R.The augmented state dynamics (20) under π∗ take the form

dX∗t =

(AX∗t −

BQ

N

)dt+

√(CX∗t −

DQ

N

)2

+λD2

NdWt, (48)

after the substitution of µ∗t = −QN and (σ∗t )2 = λ

N , t ≥ 0, in (20).If C 6= 0, equation (48) can be solved using the Doss-Saussman transforma-

tion (see, for example, Karatzas and Shreve (1991), pp 295-297). Specifically,we deduce that (48) has a unique solution, given by

X∗t (ω) = F (Wt(ω), Yt(ω)), t ≥ 0,

for all ω ∈ Ω, with F : R2 7→ R solving the ODE

∂F

∂z=

√(CF − DQ

N

)2

+λD2

N, F (0, y) = y. (49)

The process Yt, t ≥ 0 is the unique pathwise solution to the random ODE

d

dtYt(ω) = G(Wt(ω), Yt(ω)), Y0(ω) = x, (50)

with the function

G(z, y) :=

AF (z, y)− BQN −

12

((CF (z, y)− DQ

N

)2

+ λD2

N

)∂∂yF (z, y)

.

22

Page 23: Exploration versus exploitation in reinforcement learning: A … · 2018. 12. 4. · on video games (Mnih et al. (2015)), and autonomous driving (Levine et al. (2016), Mirowski et

It is then easy to verify that both equations (49) and (50) have a unique solution.Solving (49), we obtain that

F (z, y) =

√λ

N

∣∣∣∣DC∣∣∣∣ sinh

(|C|z + sinh(−1)

(√N

λ

∣∣∣∣CD∣∣∣∣ (y − DQ

CN

)))+DQ

CN,

and, in turn, the form of the function G(z, y).If C = 0, (48) becomes

dX∗t =

(AX∗t −

BQ

N

)dt+

|D|N

√Q2 + λN dWt,

whose unique solution is given in (27) when A 6= 0.

Appendix B: Proof of Lemma 4

Proof. Since π ∈ A(x), the SDE (20) has a unique solution Xπt defined for all

t ≥ 0. Applying Ito’s formula, we obtain

d (Xπt )

2=(

(2A+ C2) (Xπt )

2+ 2(B + CD)µtX

πt +D2σ2

t

)dt

+2Xπt

√(CXπ

t +Dµt)2

+D2σ2t dWt.

Now, define the stopping times τπn , n ≥ 1,

τπn := inft ≥ 0 :

∫ t

0

(Xπs )

2(

(CXπs +Dµs)

2+D2σ2

s

)ds ≥ n

.

Then, it follows that for fixed T > 0,

E[(XπT∧τπn

)2]

= x2+E

[∫ T∧τπn

0

((2A+ C2) (Xπ

t )2

+ 2(B + CD)µtXπt +D2σ2

t

)dt

].

By the standard estimate E[sup0≤t≤T (Xπ

t )2]≤ K(1 + x2)eKT , for some con-

stant K > 0 depending only on the Lipschitz constant and the time T , we canapply the dominated convergence theorem and Fubini’s theorem, yielding, asn→∞,

E[(Xπ

T )2]

= x2 +

∫ T

0

(2A+C2)E[(Xπ

t )2]

+2(B+CD)E [µtXπt ]+D2E

[σ2t

]dt.

Using the inequality 2|µtXπt | ≤ ε2|Xπ

t |2 + 1ε2 |µt|

2, we deduce that

E[(Xπ

T )2]≤ x2 +

∫ T

0

(2A+ C2 + ε2|B + CD|)E[(Xπ

t )2]dt

23

Page 24: Exploration versus exploitation in reinforcement learning: A … · 2018. 12. 4. · on video games (Mnih et al. (2015)), and autonomous driving (Levine et al. (2016), Mirowski et

+

∫ T

0

(|B + CD|

ε2+D2

)E[µ2t + σ2

t

]dt.

Under Assumption 1, we can choose ε such that ρ := 2A+C2 + ε2|B +CD| <ρ− δ. Gronwall’s inequality then implies

E[(Xπ

T )2]≤

(x2 + C1

∫ T

0

E[µ2t + σ2

t

]dt

)eρT ,

where C1 := |B+CD|ε2 +D2 is a constant. It follows that

lim infT→∞

e−ρTE[(Xπ

T )2]≤ lim inf

T→∞e−(ρ−ρ)T

∫ T

0

E[µ2t + σ2

t

]dt = 0,

since π ∈ A(x), for x ∈ R.

Appendix C: Proof of Theorem 5

Proof. Following the HJB equation (13), we deduce that for each (πt)t≥0 ∈A(x), x ∈ R,

ρv(x) ≥ r(x, πt)− λ∫Rπt(u) lnπt(u)du+

1

2v′′(x)σ2(x, πt)

+v′(x)b(x, πt), (51)

with r, b and σ given by (11), (7) and (8). Define the stopping times τπn := t ≥0 :∫ t

0(e−ρtv′(Xπ

t )σ(Xπt , πt))

2dt ≥ n, for n ≥ 1. Then, applying Ito’s formula

to e−ρtv(Xπt∧τπn ), 0 ≤ t ≤ T , we obtain

e−ρ(T∧τπn )v(Xπ

T∧τπn ) = v(x) +

∫ T∧τπn

0

e−ρt(− ρv(Xπ

t ) +1

2v′′(Xπ

t )σ2(Xπt , πt)

+v′(Xπt )b(Xπ

t , πt))dt+

∫ T∧τπn

0

e−ρtv′(Xπt )σ(Xπ

t , πt) dWt.

Taking expectations and using that the stochastic integral is a true martingaleyield

E[e−ρ(T∧τ

πn )v(Xπ

T∧τπn )]

=

v(x)+E

[∫ T∧τπn

0

e−ρt(−ρv(Xπ

t ) +1

2v′′(Xπ

t )σ2(Xπt , πt) + v′(Xπ

t )b(Xπt , πt)

)dt

]

≤ v(x)− E

[∫ T∧τπn

0

e−ρt(r(Xπ

t , πt)− λ∫Rπt(u) lnπt(u)du

)dt

],

24

Page 25: Exploration versus exploitation in reinforcement learning: A … · 2018. 12. 4. · on video games (Mnih et al. (2015)), and autonomous driving (Levine et al. (2016), Mirowski et

where we have also used the inequality (51). From classical results, we havethat E

[sup0≤t≤T |Xπ

t |2]≤ K(1 + x2)eKT , for some constant K > 0 depending

only on the Lipschitz constant for (20) and T > 0. Letting n→∞ then yields

E[e−ρT v(Xπ

T )]≤ v(x)− E

[∫ T

0

e−ρt(r(Xπ

t , πt)− λ∫Rπt(u) lnπt(u)du

)dt

],

where we have used the dominated convergence theorem and that π ∈ A(x).Furthermore, Lemma 4 gives that lim infT→∞ e−ρTE

[(Xπ

T )2]

= 0. Together

with k2 < 0, this implies that lim supT→∞ E[e−ρT v(Xπ

T )]≥ 0. Applying the

dominated convergence theorem once more yields

v(x) ≥ E[∫ ∞

0

e−ρt(r(Xπ

t , πt)− λ∫Rπt(u) lnπt(u)du

)dt

],

for each x ∈ R and π ∈ A(x). Hence, v(x) ≥ V (x), for all x ∈ R. On the otherhand, we deduce that the right hand side of (51) is maximized at

π∗(u;x) = N(u∣∣∣ CDxv′′(x) +Bv′(x)−Rx−Q

N −D2v′′(x),

λ

N −D2v′′(x)

).

Therefore,

E[e−ρT v(X∗T )

]= v(x)− E

[∫ T

0

e−ρt(r(X∗t , π

∗t )− λ

∫Rπ∗t (u) lnπ∗t (u)du

)dt

].

Using that k2 < 0 implies lim infT→∞ E[e−ρT v(X∗T )

]≤ 0, and applying the

dominated convergence theorem yields

v(x) ≤ E[∫ ∞

0

e−ρt(r(X∗t , π

∗t )− λ

∫Rπ∗t (u) lnπ∗t (u)du

)dt

],

for any x ∈ R, provided that π∗ ∈ A(x). This proves that v(·) is indeed thevalue function.

It remains to show that π∗ ∈ A(x). First, we verify that there exists δ ∈(0, ρ− (2A+ C2)), such that

lim infT→∞

e−δT∫ T

0

E[(µ∗t )

2 + (σ∗t )2]dt = 0.

By the form of the π∗ as in (39), it suffices to show that

lim infT→∞

e−δTE[(X∗T )2

]= 0, (52)

with X∗ solving the SDE (40). Ito’s formula yields, for any T ≥ 0,

(X∗T )2 = x2 +

∫ T

0

(2(AX∗t + B

)X∗t + (C1X

∗t + C2)2 +D2

)dt

25

Page 26: Exploration versus exploitation in reinforcement learning: A … · 2018. 12. 4. · on video games (Mnih et al. (2015)), and autonomous driving (Levine et al. (2016), Mirowski et

+

∫ T

0

2X∗t

√(C1X∗t + C2

)2

+ D2 dWt. (53)

Following similar arguments as in Lemma 8 in Appendix, we can show that the

leading terms of E[(X∗T )2] are, as T → ∞, e(2A+C12)T and eAT . On the other

hand, the case 2A + C12 ≤ A implies A ≤ 0 and, hence, (52) easily follows for

any δ > 0. Therefore, it suffices to study the dominant term e(2A+C12)T . We

have

2A+ C12−δ = 2A+

2B(k2(B + CD)−R)

N − k2D2+

(C +

D(k2(B + CD)−R)

N − k2D2

)2

−δ

= 2A+ C2 − δ +2(B + CD)(k2(B + CD)−R)

N − k2D2+D2(k2(B + CD)−R)2

(N − k2D2)2

= 2A+ C2 − δ +k2(2N − k2D

2)(B + CD)2

N − k2D2− 2RN(B + CD)−D2R2

N − k2D2. (54)

Notice that the first fraction is nonpositive due to k2 < 0, while the secondfraction is bounded for any k2 < 0. Therefore, under Assumption 1 on therange of ρ, we can find δ ∈ (0, ρ − (2A + C2)) that is large enough so that

2A+ C12 − δ < 0. This proves the desired result (52).

Next, we establish the admissibility constraint E[∫∞

0e−ρt |L(X∗t , π

∗t )| dt

]<

∞. Using the definition of L and the form of r(x, u) yield

E[∫ ∞

0

e−ρt |L(X∗t , π∗t )| dt

]

= E[∫ ∞

0

e−ρt∣∣∣∣∫

Rr(X∗t , u)π∗t (u)du− λ

∫Rπt(u) lnπt(u)du

∣∣∣∣ dt]= E

[ ∫ ∞0

e−ρt∣∣∣ ∫

R−(M

2(X∗t )

2+RX∗t u+

N

2u2 + PX∗t +Qu

)π∗t (u)du

2ln(2πe(σ∗t )2

) ∣∣∣ dt].Noticing that according to (39), the mean of π∗ is linear in X∗ while the varianceis constant. It is hence sufficient to prove that E

[∫∞0e−ρt(X∗t )2dt

]<∞, which

follows easily since, as shown in (54), ρ > 2A + C12

under Assumption 1. Theremaining admissibility conditions for π∗ are easily checked. Finally, the uniquestrong solution (41) is derived from the Doss-Saussman transformation as in theproof of Theorem 3.

Appendix D: Proof of Theorem 6

Proof. We only show that (a)⇒ (b), as the other direction follows with similararguments. First, a comparison between the two HJB equations (31) and (44)

26

Page 27: Exploration versus exploitation in reinforcement learning: A … · 2018. 12. 4. · on video games (Mnih et al. (2015)), and autonomous driving (Levine et al. (2016), Mirowski et

yields that if V (·) in (a) solves the former, then w(·) in (b) solves the latter,and vice versa.

The optimality of the classical control u∗t in (b) follows from the first-ordercondition as well as that α2 < 0 by the assumption in (a). It remains to showthat the candidate optimal control policy u∗t , t ≥ 0 is indeed admissible. Wefirst compute E[(X∗T )2] and E[(x∗T )2].

To this end, recall the exploratory state dynamics of X∗ from (20) underthe Gaussian policy π∗ in (a)

dX∗t =

(AX∗t +B

(α2(B + CD)−R)X∗t + α1B −QN − α2D2

)dt

+

√(CX∗t +D

(α2(B + CD)−R)X∗t + α1B −QN − α2D2

)2

+λD2

N − α2D2dWt

= (A1X∗t +A2) dt+

√(B1X∗t +B2)

2+ C1 dWt,

where A1 := A+ B(α2(B+CD)−R)N−α2D2 , A2 := B(α1B−Q)

N−α2D2 , B1 := C + D(α2(B+CD)−R)N−α2D2 ,

B2 := D(α1B−Q)N−α2D2 and C1 := λD2

N−α2D2 .

Ito’s formula gives

d(X∗t )2 =((2A1 +B2

1)(X∗t )2 + 2(A2 +B1B2)X∗t +B22 + C1

)dt

+2X∗t

√(B1X∗t +B2)

2+ C1 dWt.

Following similar arguments as in the proof of Lemma 4, we have for any T ≥ 0,

E[(X∗T )2

]= x2 +

∫ T

0

((2A1 +B2

1)E[(X∗t )2

]+ 2(A2 +B1B2)E [X∗t ]

)dt

+(B22 + C1)T. (55)

On the other hand, a similar argument yields that for the state process x∗ underthe candidate optimal control u∗ in (b) and T ≥ 0,

E[(x∗T )2

]= x2 +

∫ T

0

((2A1 +B2

1)E[(x∗t )

2]

+ 2(A2 +B1B2)E [x∗t ])dt+B2

2T.

(56)By assumption of (a), the optimal control distribution π∗ in (a) is admissible,

and hence lim infT→∞ e−δTE[(X∗T )2] = 0, for some δ ∈ (0, ρ−(2A+C2)). Then,from Lemma 8 in Appendix F, the comparison between E

[(X∗T )2

]and E

[(x∗T )2

],

given in (55) and (56), yields that lim infT→∞ e−δTE[(x∗T )2] = 0, as desired.To verify the admissible condition E

[∫∞0e−ρt |r(x∗t , u∗t )| dt

]<∞, it suffices

to show that E[∫∞

0e−ρt(x∗t )

2dt]< ∞, due to the quadratic structure of the

reward (19) and the linearity of the control u∗t in the state x∗t , as shown in (b).

27

Page 28: Exploration versus exploitation in reinforcement learning: A … · 2018. 12. 4. · on video games (Mnih et al. (2015)), and autonomous driving (Levine et al. (2016), Mirowski et

The assumption that π∗ in the statement (a) is admissible leads to thatE[∫∞

0e−ρt|L(X∗t , π

∗t )|dt

]< ∞, which, based on the form of π∗t in (a), further

implies that E[∫∞

0e−ρt(X∗t )2dt

]<∞.

According to Lemma 8 in Appendix F, the comparison between (55) and (56)gives E

[∫∞0e−ρt(x∗t )

2dt]< ∞. The other admissibility conditions are clearly

satisfied by u∗ in (b). We hence conclude that u∗t , t ≥ 0 is the optimal control.The proof is complete.

Appendix E: Proof of Proposition 7

Proof. We first show that, for each fixed t ≥ 0, as λ→ 0 the optimal Gaussiancontrol distribution π∗t = π∗(·;X∗t ) in statement (a) converges weakly to theDirac measure δu∗t , with u∗t given in statement (b).

For t = 0, we have X∗0 = x∗0 = x ∈ R, and the convergence follows easily.Given a fixed t > 0, we show that, for any continuous and bounded functionf : R 7→ R, we have

limλ→0

E[∫

Rf(u)π∗(u;X∗t )du

]= E

[∫Rf(u)δu∗t (du)

]. (57)

To this end, let µ∗t and (σ∗t )2 be the mean and variance of π∗t = π∗(u;X∗t ),respectively. Then,

E[∫

Rf(u)π∗(u;X∗t )du

]= E

[∫Rf(u)

1√2πσ∗t

e− (u−µ∗t )

2

2(σ∗t )2 du

]

= E[∫

Rf(σ∗t u+ µ∗t )

1√2πe−

u2

2 du

]=

1√2π

∫RE [f(σ∗t u+ µ∗t )] e

−u22 du,

where we have applied Fubini’s theorem, considering that f is bounded.Based on the explicit expressions for µ∗t and (σ∗t )2 from statement (a), we

have that, for each u ∈ R and t ≥ 0, the random variable µ∗t + σ∗t u satisfies

µ∗t + σ∗t u =(α2(B + CD)−R)X∗t + α1B −Q

N − α2D2+

λu

N − α2D2

=(α2(B + CD)−R)

N − α2D2X∗t +

α1B −Q+ λu

N − α2D2.

Thus, it is a Gaussian random variable, since for each fixed t > 0, X∗t is aGaussian random variable under P, with mean and variance computed as N(t)and M(t)−N(t)2 from Lemma 8 in Appendix C. It then follows that the meanand variance of µ∗t + σ∗t u are given, respectively, by

αt(u) :=(α2(B + CD)−R)N(t) + α1B −Q+ λu

N − α2D2

28

Page 29: Exploration versus exploitation in reinforcement learning: A … · 2018. 12. 4. · on video games (Mnih et al. (2015)), and autonomous driving (Levine et al. (2016), Mirowski et

and

βt(u) =

((α2(B + CD)−R)

N − α2D2

)2

(M(t)−N(t)2).

Therefore,1√2π

∫RE [f(σ∗t u+ µ∗t )] e

−u22 du

=1√2π

∫R

(∫R

1√2πβt(u)

f(y)e−(y−αt(u))2

2βt(u) dy

)e−

u2

2 du. (58)

On the other hand, since for each fixed t > 0, x∗t is a Gaussian random variablewith mean N(t) and variance M(t)−N(t)2 (see Lemma 8 in Appendix F), weconclude that u∗t is also Gaussian with mean and variance given, respectively,by

αt =(α2(B + CD)−R)N(t) + α1B −Q

N − α2D2

and

βt =

((α2(B + CD)−R)

N − α2D2

)2

(M(t)−N(t)2).

In turn, the right hand side of (57) becomes

E[∫

Rf(u)δu∗t (du)

]= E [f(u∗t )] =

∫R

1√2πβt

f(y)e− (y−αt)2

2βt dy. (59)

Notice that, as λ → 0, αt(u) → αt and βt(u) → βt, for all u ∈ R. Indeed, theformer is obvious while the latter follows from the fact that M(t) → M(t), asλ→ 0, as shown in the proof of Lemma 8 in Appendix F.

Sending λ → 0 in (58) and using the dominated convergence theorem weobtain

limλ→0

1√2π

∫R

(∫R

1√2πβt(u)

f(y)e−(y−αt(u))2

2βt(u) dy

)e−

u2

2 du

=

∫R

1√2πβ

t

f(y)e− (y−αt)2

2βt dy =

∫R

1√2πβt

f(y)e− (y−αt)2

2βt dy,

and the weak convergence result (57) follows.The pointwise convergence follows easily using the form of V (·) and w(·) in

statements (a) and (b), respectively.

Appendix F: Lemma 8

Lemma 8 For any T > 0, let

E[(X∗T )2

]= x2 +

∫ T

0

(2A1 +B21)E[(X∗t )2

]+2(A2 +B1B2)E

[X∗t]dt+(B2

2 +C1)T

29

Page 30: Exploration versus exploitation in reinforcement learning: A … · 2018. 12. 4. · on video games (Mnih et al. (2015)), and autonomous driving (Levine et al. (2016), Mirowski et

and

E[(x∗T )2

]= x2 +

∫ T

0

(2A1 +B21)E[(x∗t )

2]

+ 2(A2 +B1B2)E[x∗t]dt+B2

2T,

as in (55) and (56), respectively. Then, lim infT→∞ e−δTE[(X∗T)2]

= 0 if and

only if lim infT→∞ e−δTE[(x∗T)2]

= 0, for any δ ∈ (0, ρ−(2A+C2)). Moreover,

E[∫∞

0e−ρt

(X∗t)2dt]<∞ if and only if E

[∫∞0e−ρt

(x∗t)2dt]<∞.

Proof. We start by solving for E[(X∗T )2

]and E

[(x∗T )2

]explicitly, for any T > 0.

Recall that X∗ satisfies the dynamics

X∗T = x+

∫ T

0

(A1X∗t +A2) dt+

∫ T

0

√(B1X∗t +B2)

2+ C1 dWt,

for T > 0, with X∗0 = x.

Define the stopping time τn, n ≥ 1 as τn := t > 0 :∫ t

0(B1X

∗s +B2)

2+

C1ds ≥ n. Then, it follows that the stopped stochastic integral is a martingaleand hence

E[X∗T∧τn

]= x+ E

[∫ T∧τn

0

(A1X∗t +A2) dt

].

The standard estimate that E[sup0≤t≤T (X∗t )2

]≤ K(1 +x2)eKT , for some con-

stant K > 0 depending only on the Lipschitz constant and the time T allows usto apply Fubini’s theorem and dominated convergence theorem, yielding

E[X∗T ] = x+

∫ T

0

(A1E[X∗t ] +A2) dt.

Denote N(t) := E [X∗t ], for t ≥ 0. Then, we obtain the ODE

dN(t)

dt= A1N(t) +A2, N(0) = x,

whose solution is N(t) =(x+ A2

A1

)eA1t − A2

A1, if A1 6= 0, and N(t) = x + A2t,

otherwise. In turn, the function M(t) := E[(X∗t )2

], t ≥ 0, solves the ODE

dM(t)

dt= (2A1 +B2

1)M(t) + 2(A2 +B1B2)N(t) +B22 + C1, M(0) = x2.

Similarly, E[x∗t]

= N(t), t ≥ 0 and if M(t) := E[(x∗t )

2], t ≥ 0, then

dM(t)

dt= (2A1 +B2

1)M(t) + 2(A2 +B1B2)N(t) +B22 , M(0) = x2.

We next find explicit solutions to the above ODEs corresponding to variousconditions on the parameters.(a) If A1 = B2

1 = 0, then direct computation gives N(t) = x+A2t, and

M(t) = x2 +A2(x+A2t)t+ (B22 + C1)t,

30

Page 31: Exploration versus exploitation in reinforcement learning: A … · 2018. 12. 4. · on video games (Mnih et al. (2015)), and autonomous driving (Levine et al. (2016), Mirowski et

M(t) = x2 +A2(x+A2t)t+B22t.

The conclusion of the Lemma follows.(b) If A1 = 0 and B2

1 6= 0, we have N(t) = x+A2t, and

M(t) =

(x2 +

2(A2 +B1B2)(A2 +B2

1(x+B22 + C1)

)B4

1

)eB

21t

−2(A2 +B1B2)

(A2 +B2

1(x+B22 + C1)

)B4

1

,

M(t) =

(x2 +

2(A2 +B1B2)(A2 +B2

1(x+B22))

B41

)eB

21t

−2(A2 +B1B2)

(A2 +B2

1(x+B22))

B41

.

The conclusion of the Lemma also follows.(c) If A1 6= 0 and A1 + B2

1 = 0, then N(t) =(x+ A2

A1

)eA1t − A2

A1. Further

computation yields

M(t) =

(x2 +

A1(B22 + C1)− 2A2(A2 +B1B2)

A21

)eA1t

+2(A2 +B1B2)(A1x+A2)

A1teA1t − A1(B2

2 + C1)− 2A2(A2 +B1B2)

A21

,

M(t) =

(x2 +

A1B22 − 2A2(A2 +B1B2)

A21

)eA1t

+2(A2 +B1B2)(A1x+A2)

A1teA1t − A1B

22 − 2A2(A2 +B1B2)

A21

.

It is also straightforward to see that the conclusion of the Lemma follows.

(d) If A1 6= 0 and 2A1 +B21 = 0, we have N(t) =

(x+ A2

A1

)eA1t − A2

A1, and

M(t) =2(A2 +B1B2)(A1x+A2)

A21

eA1t

+A1(B2

2 + C1)− 2A2(A2 +B1B2)

A21

t+ x2 − 2(A2 +B1B2)(A1x+A2)

A21

,

M(t) =2(A2 +B1B2)(A1x+A2)

A21

eA1t

+A1B

22 − 2A2(A2 +B1B2)

A21

t+ x2 − 2(A2 +B1B2)(A1x+A2)

A21

,

31

Page 32: Exploration versus exploitation in reinforcement learning: A … · 2018. 12. 4. · on video games (Mnih et al. (2015)), and autonomous driving (Levine et al. (2016), Mirowski et

and we easily conclude.(e) If A1 6= 0, A1 + B2

1 6= 0 and 2A1 + B21 6= 0, then we arrive at N(t) =(

x+ A2

A1

)eA1t − A2

A1, and

M(t) =(x2 +

2(A2 +B1B2)(A1x+A2)

A1(A1 +B21)

+A1(B2

2 + C1)− 2A2(A2 +B1B2)

A1(2A1 +B21)

)e(2A1+B2

1)t

−2(A2 +B1B2)(A1x+A2)

A1(A1 +B21)

eA1t − A1(B22 + C1)− 2A2(A2 +B1B2)

A1(2A1 +B21)

,

M(t) =

(x2 +

2(A2 +B1B2)(A1x+A2)

A1(A1 +B21)

+A1B

22 − 2A2(A2 +B1B2)

A1(2A1 +B21)

)e(2A1+B2

1)t

−2(A2 +B1B2)(A1x+A2)

A1(A1 +B21)

eA1t − A1B22 − 2A2(A2 +B1B2)

A1(2A1 +B21)

.

The conclusion then easily follows and the proof of the Lemma is complete.

References

Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of themultiarmed bandit problem. Machine learning, 47(2-3):235–256, 2002.

Ronen I Brafman and Moshe Tennenholtz. R-max-a general polynomial timealgorithm for near-optimal reinforcement learning. Journal of Machine Learn-ing Research, 3(Oct):213–231, 2002.

Cyrus Derman. Finite state Markovian decision processes. Technical report,1970.

Nicole El Karoui, Nguyen Du Huu, and Monique Jeanblanc-Picque. Compactifi-cation methods in the control of degenerate diffusions: existence of an optimalcontrol. Stochastics, 20(3):169–219, 1987.

Wendell H Fleming and Makiko Nisio. On stochastic relaxed control for partiallyobserved diffusions. Nagoya Mathematical Journal, 93:71–108, 1984.

Roy Fox, Ari Pakman, and Naftali Tishby. Taming the noise in reinforcementlearning via soft updates. arXiv preprint arXiv:1512.08562, 2015.

John Gittins. A dynamic allocation index for the sequential design of experi-ments. Progress in statistics, pages 241–266, 1974.

Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Re-inforcement learning with deep energy-based policies. arXiv preprintarXiv:1702.08165, 2017.

32

Page 33: Exploration versus exploitation in reinforcement learning: A … · 2018. 12. 4. · on video games (Mnih et al. (2015)), and autonomous driving (Levine et al. (2016), Mirowski et

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Softactor-critic: Off-policy maximum entropy deep reinforcement learning with astochastic actor. arXiv preprint arXiv:1801.01290, 2018.

Tommi Jaakkola, Michael I Jordan, and Satinder P Singh. Convergence ofstochastic iterative dynamic programming algorithms. In Advances in neuralinformation processing systems, pages 703–710, 1994.

Leslie Pack Kaelbling, Michael L Littman, and Andrew W Moore. Reinforce-ment learning: A survey. Journal of artificial intelligence research, 4:237–285,1996.

Ioannis Karatzas and Steven E Shreve. Brownian motion and stochastic calculus.Springer-Verlag, 2nd edition, 1991.

Haya Kaspi and Avishai Mandelbaum. Multi-armed bandits in discrete andcontinuous time. Annals of Applied Probability, pages 1270–1290, 1998.

Thomas Kurtz and Richard Stockbridge. Existence of markov controls andcharacterization of optimal markov controls. SIAM Journal on Control andOptimization, 36(2):609–653, 1998.

Thomas Kurtz and Richard Stockbridge. Stationary solutions and forward equa-tions for controlled and singular martingale problems. Electronic Journal ofProbability, 6, 2001.

Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-endtraining of deep visuomotor policies. The Journal of Machine Learning Re-search, 17(1):1334–1373, 2016.

Weiwei Li and Emanuel Todorov. Iterative linearization methods for approxi-mately optimal control and estimation of non-linear stochastic system. Inter-national Journal of Control, 80(9):1439–1453, 2007.

Avi Mandelbaum. Continuous multi-armed bandits and multiparameter pro-cesses. The Annals of Probability, pages 1527–1556, 1987.

Piotr Mirowski, Razvan Pascanu, Fabio Viola, Hubert Soyer, Andrew J Bal-lard, Andrea Banino, Misha Denil, Ross Goroshin, Laurent Sifre, and KorayKavukcuoglu. Learning to navigate in complex environments. arXiv preprintarXiv:1611.03673, 2016.

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Ve-ness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidje-land, and Georg Ostrovski. Human-level control through deep reinforcementlearning. Nature, 518(7540):529, 2015.

Ofir Nachum, Mohammad Norouzi, Kelvin Xu, and Dale Schuurmans. Bridgingthe gap between value and policy based reinforcement learning. In Advancesin Neural Information Processing Systems, pages 2775–2785, 2017a.

33

Page 34: Exploration versus exploitation in reinforcement learning: A … · 2018. 12. 4. · on video games (Mnih et al. (2015)), and autonomous driving (Levine et al. (2016), Mirowski et

Ofir Nachum, Mohammad Norouzi, Kelvin Xu, and Dale Schuurmans. Trust-pcl: An off-policy trust region method for continuous control. arXiv preprintarXiv:1707.01891, 2017b.

Daniel Russo and Benjamin Van Roy. Eluder dimension and the sample com-plexity of optimistic exploration. In Advances in Neural Information Process-ing Systems, pages 2256–2264, 2013.

Daniel Russo and Benjamin Van Roy. Learning to optimize via posterior sam-pling. Mathematics of Operations Research, 39(4):1221–1243, 2014.

Claude Elwood Shannon. A mathematical theory of communication. ACMSIGMOBILE mobile computing and communications review, 5(1):3–55, 2001.

David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, GeorgeVan Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneer-shelvam, and Marc Lanctot. Mastering the game of go with deep neuralnetworks and tree search. nature, 529(7587):484, 2016.

David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, AjaHuang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, and AdrianBolton. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017.

Satinder Singh, Tommi Jaakkola, Michael L Littman, and Csaba Szepesvari.Convergence results for single-step on-policy reinforcement-learning algo-rithms. Machine learning, 38(3):287–308, 2000.

Alexander L Strehl and Michael L Littman. An analysis of model-based intervalestimation for markov decision processes. Journal of Computer and SystemSciences, 74(8):1309–1331, 2008.

Alexander L Strehl, Lihong Li, and Michael L Littman. Reinforcement learningin finite mdps: Pac analysis. Journal of Machine Learning Research, 10(Nov):2413–2444, 2009.

Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduc-tion. MIT press, 2018.

William R Thompson. On the likelihood that one unknown probability exceedsanother in view of the evidence of two samples. Biometrika, 25(3/4):285–294,1933.

Emanuel Todorov and Weiwei Li. A generalized iterative lqg method for locally-optimal feedback control of constrained nonlinear stochastic systems. InAmerican Control Conference, 2005. Proceedings of the 2005, pages 300–306.IEEE, 2005.

Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. Max-imum entropy inverse reinforcement learning. In AAAI, volume 8, pages1433–1438. Chicago, IL, USA, 2008.

34