theoretically principled deep rl acceleration via nearest

Theoretically Principled Deep RL Acceleration viaNearest Neighbor Function Approximation

Junhong Shen, Lin F. YangUniversity of California, Los Angeles

[email protected], [email protected]

Abstract

Recently, deep reinforcement learning (RL) has achieved re-markable empirical success by integrating deep neural net-works into RL frameworks. However, these algorithms oftenrequire a large number of training samples and admit littletheoretical understanding. To mitigate these issues, we pro-pose a theoretically principled nearest neighbor (NN) func-tion approximator that can improve the value networks indeep RL methods. Inspired by human similarity judgments,the NN approximator estimates the action values using roll-outs on past observations and can provably obtain a smallregret bound that depends only on the intrinsic complex-ity of the environment. We present (1) Nearest NeighborActor-Critic (NNAC), an online policy gradient algorithmthat demonstrates the practicality of combining function ap-proximation with deep RL, and (2) a plug-and-play NN up-date module that aids the training of existing deep RL meth-ods. Experiments on classical control and MuJoCo locomo-tion tasks show that the NN-accelerated agents achieve highersample efficiency and stability than the baseline agents.Based on its theoretical benefits, we believe that the NN ap-proximator can be further applied to other complex domainsto speed-up learning.

IntroductionPeople learn a variety of relationships in life, e.g., we as-sociate the force of pressing the gas pedal with the amountof acceleration gained while driving. In the context of rein-forcement learning (RL), where an agent interacts with theenvironment to maximize the cumulative rewards, the learn-ing objective is the relationship between state-action pairsand future gains. Theories on associative learning suggestthat people learn from similarity measures (Carroll 1963;Busemeyer et al. 1997): if x can predict y, they presume thatobservations similar to x have similar y values. It is thusnatural to consider integrating similarity-based models intoactive learning. A suitable choice for the RL setting is thenearest neighbor function approximator (Emigh et al. 2016;Shah and Xie 2018; Yang, Ni, and Wang 2019).

We study online episodic RL with unknown reward andtransition functions. Existing deep RL algorithms achieveimpressive results in robot control (e.g., Lillicrap et al. 2016;

Copyright © 2021, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.

Levine et al. 2016; Gu et al. 2017), Go (Silver et al. 2016)and Atari playing (Mnih et al. 2013). However, several chal-lenges still exist. First, the theoretical foundation of deepRL has not been fully established (Arulkumaran et al. 2017;Lake et al. 2018). It is often mysterious why an algorithmworks or fails in certain cases (Kansky et al. 2017). Sec-ond, empirical results suggest that model-free deep RL oftenrequires considerable samples to learn (Deisenroth and Ras-mussen 2011). Nonetheless, data can be expensive to acquirein practical domains like healthcare (Kober, Bagnell, andPeters 2013). High-dimensional input, such as pixel data,demands a larger sample size even if the problem itself issimple (Lillicrap et al. 2016). Third, online learning coupledwith neural networks is generally regarded as unstable (vanHasselt, Guez, and Silver 2016). Hyperparameter tuning canalso affect the learning outcome. In sum, improving deep RLwith theory-based approaches is of critical importance.

On the other hand, great theoretical progress has beenmade in tabular RL, where the state-action spaces are smalland finite. The sample complexity of tabular methods areproperly studied (e.g., Jaksch, Ortner, and Auer 2008; Azar,Osband, and Munos 2017; Jin et al. 2018; Zanette and Brun-skill 2019). Yet the best obtainable complexity depends lin-early on the number of states, which can be huge in reality,e.g., the game of Go has about 319×19 states. Thus, real-world application of tabular theories remains a challenge.

In this paper, we bridge the gap between RL theory andpractice by a theoretically principled deep RL accelerationtechnique. Specifically, we exploit the structural propertiesof the state-action space by using the nearest neighbor (NN)1

approximator for value estimation. Such a function approxi-mator not only attains a theoretical guarantee in sample com-plexity but also possesses good generalization ability whenplugged into deep RL frameworks. In fact, we show thatthe NN approximator with an upper-confidence constructionobtains a near-optimal regret O[H(DLK)d/(d+1)] for bothlow- and high-dimensional inputs in deterministic systems,where K is the number of episodes, H is the episode length,L is a Lipschitz constant related to the distance metric mea-suring state similarities, D and d are respectively the diam-eter and dimension of the intrinsic state-action space.

1Throughout the paper, the abbreviation “NN” is used to referto “nearest neighbor,” not “neural network.”

arX

iv:2

110.

0442

2v1

[cs

.LG

] 9

Oct

202

1

We demonstrate the empirical efficacy of the NN approx-imator by fitting it into the actor-critic framework. We usethe non-parametric NN critic to bootstrap state values andtrain the policy network with temporal-difference methods(Sutton 1988). Given a state-action pair, the NN critic findswithin history the closest sample to this observation underthe distance metric. The corresponding reward plus a Lips-chitz confidence term and the next state are used to approx-imate the reward and transition functions, respectively. Thealgorithm displays impressive learning speed in the cart-polebalancing problem. Beyond this, we also encapsulate the NNapproximator into a plug-and-play module that boosts thetraining of existing deep RL algorithms without changingtheir original structures. The plug-in NN critic encouragesaction exploration and stabilizes value learning. We evalu-ate the module with state-of-the-art deep RL agents on a setof 3D locomotion tasks. Results show that the NN-aided gra-dient update improves both training speed and stability.

Roadmap: The paper is organized as follows: we first dis-cuss related works in theoretical and deep RL. Next, weintroduce notations and concepts in RL and metric dimen-sions. We then give the formulation of the NN approximatorand analyze its theoretical guarantee. In the later sections,we present two algorithms that combine the NN approxima-tor with deep RL and evaluate their empirical performance.

Related WorkRL with neural networks: There is a long line of researchthat applies deep RL to games and control problems (e.g.,Mnih et al. 2013; Schulman et al. 2015; Levine et al. 2016;van Hasselt, Guez, and Silver 2016; Lillicrap et al. 2016).These results employ several heuristics to accelerate explo-ration. For example, Mnih et al. (2013, 2015); van Hasselt,Guez, and Silver (2016) randomly sample actions and storethem in a replay buffer before policy learning. Gaussiannoise (Lillicrap et al. 2016) or Ornstein-Uhlenbeck noise(Fujimoto et al. 2018) are added to the actions or the networkparameters (Plappert et al. 2018) to encourage exploration.Several works also combine model-based value estimationand model-free policy learning to reduce sample complexity(Deisenroth and Rasmussen 2011; Buckman et al. 2018).

Although there is limited theoretical understanding of theaforementioned methods, we emphasize that our aim is notto improve them but rather introduce a new theory-basedfunction approximator to the literature of deep RL. By com-bining NN value estimation and existing frameworks, we be-lieve that the efficiency of model-free RL algorithms can beimproved in a provable manner (at least in some settings).

RL with theoretical guarantees: To facilitate theoreticalanalysis, many works study RL in the tabular setting, wherethe state and action spaces are discrete (e.g., Jaksch, Ortner,and Auer 2008; Azar, Osband, and Munos 2017; Jin et al.2018; Zanette and Brunskill 2019). The sample complexityof the algorithms depends at least linearly on the number ofstates |S|. Since this number tends to be large in practice, itis difficult to extend these algorithms to real-world settings.Recently, several works have emphasized understanding in

RL with general function approximation (e.g., Osband andRoy 2014; Jiang et al. 2017; Sun et al. 2019; Wang et al.2020). However, the function classes are either simple, e.g.,linear functions (e.g., Yang and Wang 2019; Jin et al. 2020),or have strong structural assumptions, which prevent themfrom being applied to more practical problems.

RL with nearest neighbor search: Combining nearestneighbor search with active learning has been studied inepisodic RL. Model-free episodic control (Blundell et al.2016) builds on a tabular memory and applies regression us-ing the mean of k-nearest neighbors for Q-value estimation.Neural episodic control (Pritzel et al. 2017) and episodicmemory deep Q-networks (Lin et al. 2018) improve the al-gorithm’s generalization ability by absorbing state featuresinto networks. These algorithms take the NN search as apure classification technique. They do not exploit the factthat the distance between state-action pairs can indicate theirrelative values. In addition, the value estimation for thesemethods exists at the trajectory level: the Q-values are up-dated at the end of an episode by the total reward of a trajec-tory. In contrast, the NN value estimation in this paper takesthe form of on-policy Monte Carlo rollouts using samplesfrom independent environment steps. Thus, we leverage notonly intra-episode but also inter-episode information.

The prototype of our NN function approximator is pre-sented in Yang, Ni, and Wang (2019), where an upper-confidence algorithm with general function approximation isproposed for tabular RL. The algorithm can apply to contin-uous cases but requires a discretization of the action space.We improve it to account for non-tabular cases without dis-cretizing the action space. Meanwhile, though Yang, Ni, andWang (2019) derive a regret based on the ambient dimen-sion of the state space, they do not justify the regret of high-dimensional inputs with small intrinsic dimensions.

PreliminariesIn this section, we introduce the key definitions and nota-tions in RL. For the clarity of the proofs, we assume that theMarkov decision process (MDP) is finite-horizon and deter-ministic. This assumption is not restrictive, as many real-world control systems do not involve randomness.

Formally, we consider an MDP (S,A, f, r,H) with statespace S, action space A, deterministic transition model f :S ×A → S , and reward function r : S ×A → R. An agentinteracts with the environment episodically, where eachepisode lastsH steps. In an episode, the agent starts from aninitial state s1 independent of the history. At step h ∈ [H]2,it observes state sh and performs action ah := π(sh, h) ac-cording to the policy π : S × [H] → A. It then receivesreward rh = r(sh, ah) and next state sh+1 = f(sh, ah). Wedefine the cumulative reward from h as Rh =

∑H−ht=0 rh+t.

The goal of learning is to find a policy that maximizes thetotal reward in one episode when f and r are unknown.

Given a policy π, the value function V π : S× [H]→ R isdefined as the cumulative reward from state s at step h and

2[H] denotes the set of integers {1, ..., H}.

following π therefrom. It satisfies the Bellman equation:

V πh (s) = r(s, π(s)) + V πh+1[f(s, π(s))], (1)

with V πH(s) = r(s, π(s)). The optimal policy π∗ is the onesuch that V π

∗h := V ∗h = maxπ V

πh . The temporal-difference

(TD) error at step h is δh = rh + V πh+1(sh+1)− V πh (sh).We further denote the action value (or Q-function) as

Qπh(s, a) := r(s, a) + V πh+1

(f(s, a)

). The optimal Q∗h :=

maxπ Qπh gives the maximum values for a (s, a) pair achiev-

able by any policy. By the Bellman optimality equation, wehave V ∗h (s) = maxa∈AQ∗h(s, a).

We measure the sample complexity of an algorithm by itsregret, which is the difference between the total rewards ofthe unknown optimal policy and that gathered in learning:

Regret(K) =K∑

k=1

[V ∗1 (sk1)−H∑

h=1

r(skh, akh)],

where K is the number of episodes played.

Theoretical Guarantee of Nearest NeighborFunction Approximation

In this section, we first introduce concepts relevant to thestructure of an MDP. We then formalize the nearest neighborfunction approximator and show its theoretical guarantee interms of sample complexity.

Metric Space and Intrinsic DimensionIn practice, the state-action spaceX = S×A is usually con-tinuous. We assume that X is a metric space with a distancefunction dX that satisfies the triangular inequality. This as-sumption is easily achievable, e.g., the Euclidean distancecan be applied to a space of pixel data. We also assume thatX is bounded and has diameter D := supx,x′∈X dX (x, x′).

An intuitive way to measure the complexity of X isthrough the ambient dimension p, which can be roughly un-derstood as the number of variables used to describe a pointinX . In real-world MDPs, the states are often represented byreal-valued vectors, which form a Euclidean ambient space.Thus, in our context, we simply take p as the Euclideandimension of the natural embedding of X . For instance, a20× 20 image has p = 400 regardless of its content.

However, most meaningful, high-dimensional data donot uniformly fill in the space where they are represented.Rather, they concentrate on smooth manifolds with low in-trinsic dimension. The intrinsic dimension d measures theinherent complexity of a metric space. Informally, it is thenumber of variables needed to describe X . Suppose theaforementioned image depicts a car with 5 physical prop-erties, then d can be 5 rather than 400. Studies on intrinsicdimension estimation (e.g., Kegl 2002; Levina and Bickel2004) employ more formal definitions. Yet the technical de-tails are out of our scope. We only require d ≤ p in general.

For the clarity of later proofs, we outline two conceptsthat can bound the intrinsic dimension of a metric space.

Definition 1 (Covering and packing). An ε-cover of a metricspace X is a subset X ⊆ X such that for each x ∈ X , there

exists x′ ∈ X with dX (x, x′) ≤ ε. The ε-covering number isN(ε) = min{|X | : X is an ε-cover of X}. An ε-packing isX ⊆ X such that ∀x, x′ ∈ X , dX (x, x′) > ε. The ε-packingnumber is M(ε) = max{|X | : X is an ε-packing of X}.

Nearest Neighbor Function ApproximatorIn the RL setting, the learning algorithm collects a set ofobservations B := {xi}i∈[N ] and corresponding value la-bels {Q(xi)}i∈[N ] ⊆ R. The unknown Q : X → R mea-sures the quality of a state-action pair. As in supervised ma-chine learning, the task is to find a function approximatorf : X → R that approximates the known labels with smallerrors and also generalizes to unseen data points. That is,f(xi) ≈ Q(xi) for all i ∈ [N ] and f(x) ≈ Q(x) forx ∈ X\B. With the distance metric dX , we can now definethe NN approximator which satisfies the above property.Definition 2 (Nearest neighbor function approximator).Given a sample buffer B = {

(xi, Q(xi)

)}i∈[N ] ⊆ X × R,

the nearest neighbor approximator is the function Q : X →R such that ∀x ∈ X ,

Q(x) := mini∈[N ]{Q(xi) + L · dX (x, xi)}.

L > 0 is a parameter that adjusts the approximation error.

Note that Q(x) matches existing samples exactly and theapproximation error for a new x is characterized by an upperbound obtained from the closest known data. This contrastswith other function approximators that lack theoretical un-derstanding, e.g., neural networks. Now, we proceed to showmore practical guarantees of the NN function approximator.

MDP with Lipschitz ContinuityTo ensure the problem is tractable, we impose the followingregularity condition on the optimal Q-function of the MDP.Assumption 3 (MDP with Lipschitz continuity). In the met-ric space X , let the optimal Q-function be Q∗h : X → R,then ∃L1, L2 > 0 such that ∀x, x′ ∈ X , ∀h ∈ [H],

|Q∗h(x)−Q∗h(x′)| ≤ L1 · dX (x, x′), (2)

maxa′′

dX[(f(x), a′′

),(f(x′), a′′

)]≤ L2 · dX (x, x′). (3)

Assumption 3 implies that there is a proper notion of dis-tance in X such that two points close to each other havesimilar Q-values. It also ensures a stable system. Let L =

L1(L2 +1) be the parameter defining Q(x), then Q(x) is L-Lipschitz continuous (Yang, Ni, and Wang 2019, Lemma 2).

Yang, Ni, and Wang (2019) proposed an upper-confidencealgorithm with NN approximation (UCRL-FA) for discretedeterministic MDPs. Basically, the approximate Qh is up-dated recursively from h = H to h = 1 at the end ofeach episode. The agent acts according to the greedy policyarg maxa∈AQh. For completeness, we present UCRL-FAin Appendix A. It achieves the following regret bound.Theorem 4 (Regret till ε-optimality by Yang et al). SupposeX admits an ε-cover with size N(ε) for any ε > 0. After Kepisodes, UCRL-FA with the nearest neighbor constructionobtains a regret bound Regret(K) ≤ HN(ε) + 2εLKH. IfX is compact, then Regret(K) = O(DLK)

pp+1 ·H .

For a metric space with ambient dimension p, Theo-rem 4 states that UCRL-FA learns the system with ε−p sam-ples, where ε is the learning accuracy. The algorithm is ef-ficient for low-dimensional observations. Indeed, in real-world MDPs governed by laws of physics, the intrinsic di-mensions are usually small. However, the internal states arenot always available. If there are only pixel data, p can behundreds if not thousands even for a simple system like amoving car. The natural question is: can the NN function ap-proximator achieve a regret that depends only on the intrin-sic dimension d? In the next section, we answer the questionaffirmatively with some mild assumptions.

Efficient Value Learning in Metric SpacesTo characterize the true complexity of the NN approximator,we additionally distinguish the state-action spaces. Supposethe learner observes a p-dimensional state space S, e.g., im-age space, which admits an embedding to a d-dimensionalspace S, e.g., the parameter space of a physical system, forsome d � p. We emphasize that the learner does not haveany information about S. Let Y = S ×A and dY be the dis-tance metric satisfying the triangular inequality. Since theaction space remains the same, we simply denote the ambi-ent dimension of Y as p and the intrinsic dimension as d.

The relationship between any true state s and its externalrepresentation s can be described by a function that mapsS to S. In our proofs, we only assume the existence of themapping, but do not require knowing its explicit form.Assumption 5 (Bi-Lipschitz mapping between metricspaces). There exists a map g : X → Y from the intrin-sic state space to the external metric space. We assume thatg is Bi-Lipschitz, i.e., ∀x, x′ ∈ X , ∃C > 0 such that

C−1dX (x, x′) ≤ dY(g(x), g(x′)) ≤ CdX (x, x′). (4)

dX and dY are distance metrics for X and Y , respectively.Assumption 5 models most real-world RL systems: if two

points are close in the observation space, they are close in-ternally. Now, by Assumption 3, we obtain the followinglemma on the continuity of the observation space.Lemma 6 (High-dimensional MDP with Lipschitz continu-ity). If X satisfies Assumption 3 and g : X → Y satisfiesAssumption 5, then the MDP with state-action space Y isL-Lipschitz continuous, where L = (L2C

2 + 1) · L1C.The formal proof can be found in Appendix B. Now, we

can treat Y as an independent MDP without knowing its in-trinsic properties and directly apply Theorem 4. Lemma 7characterizes the regret of the NN approximator in Y .Lemma 7 (Regret in Y w.r.t. the local dimension). Supposethat Y admits an ε-cover N(ε) for some ε > 0, after Kepisodes, the NN function approximator obtains a regret

Regret(K) ≤ H ·N(Y, ε) + 2εLKH (5)

where L = (L2C2 + 1) · L1C.

To bound the regret of the NN approximator in the ob-servation space, it remains to bound the covering size of Y .Recall the covering and packing of a metric space in Defini-tion 1, we now present the main theorem.

Theorem 8 (Regret till ε-optimality in Y w.r.t. the intrin-sic dimension). Suppose X admits an ε-cover N(X , ε) andg : X → Y satisfies Lemma 6, then the following holds:

1. Y admits an ε-cover with ε = 2Cε, N(Y, ε) ≤ N(X , ε)2. In K episodes, UCRL-FA with NN function approximator

obtains a regret bound O(DL′K)dd+1 ·H with L′ := CL.

Proof. We make use of the facts that∀ε > 0,M(2ε) ≤ N(ε) ≤M(ε) (6)

and for a metric space with diameter D and intrinsic dimen-sion d, there exists a ε-cover with size

N(ε) = Θ (D/ε)d. (7)

We want to show that the covering number of Y cannot begreater than the covering number of X when ε is set to 2Cε.Then, the regret bound in (5) can be replaced by an upperbound which depends on the inherent properties of Y .

Note that finding M(ε) for a dataset Bn = {x1, ..., xn}is equivalent to finding the cardinality of a maximum inde-pendent set MI(Gε) in the graph Gε(V,E) with vertex setV = Bn and edge set E = {(xi, xj)|d(xi, xj) < ε}.

Now, consider the graph GX2ε constructed by the aboverule. Any two connected points x, x′ in the graph satisfydX (x, x′) < 2ε. Denote the image of a maximum indepen-dent set MIX (GX2ε) in Y as MIY(GX2ε). MIY(GX2ε) is stilla maximum independent set w.r.t. the graph with vertex setV = Bn and edge set EX = {(g(xi), g(xj))|dX (xi, xj) <2ε} in the new metric space.

To find the packing number of Y , we require the cutoffcondition for EY to be dY [g(xi), g(xj)] < 2Cε. By (5),dY [g(x), g(x′)] < 2Cε for all elements in EX . Therefore,EX ⊆ EY . In other words, g(GX2ε) is a subgraph of GY2Cεwith the same vertices. Thus, MIY(GY2Cε) ≤ MIX (GX2ε).This is because adding edges can only reduce (or remain)the size of the maximum independent sets in a graph.

Thus, M(Y, 2Cε) ≤ M(X , 2ε). Using (6), we concludethat N(Y, 2Cε) ≤ M(Y, 2Cε) ≤ M(X , 2ε) ≤ N(X , ε),where the first inequality holds in space Y , the second in-equality comes from the proofs above, and the last inequal-ity holds in space X . Preserving only the first and the lastterms, we have N(Y, ε) ≤ N(X , ε).

Consequently, the regret upper bound in (5) becomesRegret(K) ≤ H ·N(Y, ε) + 2εLKH

≤ H ·N(X, ε) + 4CLεKH.

In X , equation (7) indicates that N(Y, ε) ≤ Θ(D/ε)d,where d and D are the intrinsic properties of the state-actionspace. As a result,Regret(K) ≤ H ·Θ(D/ε)d+4CLεKH .

Denote L′ := CL as the smoothing constant for the high-dimensional metric space, when ε = D

dd+1 · (CLK)−

1d+1 ,

the upper bound becomes O(DL′K)dd+1 ·H as desired.

The main takeaway is that the regret in the complex spaceis also sub-linear inK and linear inH . Moreover, it dependson the intrinsic dimension rather than the ambient dimen-sion. In practice, the NN approximator makes learning fromimages as efficient as from the actual state descriptors byemphasizing the internal differences between observations.

Algorithm 1 Nearest Neighbor Actor-Critic

Initialize experience buffer B1 = ∅, policy parameter θπ1

for k = 1, ...,Kmax doReceive initial state sk1for h = 1, ..., H do

Take action akh according to policy πk(akh|skh, θπk )Receive rkh ← r(skh, a

kh) and skh+1 ← f(skh, a

kh)

V (skh)← NNFUNCAPPROX(skh, h, πk, Bk, H)

V (skh+1)← NNFUNCAPPROX(skh+1, h+1, πk, Bk, H)δkh = r(skh, a

kh) + γ · V (skh+1)− V (skh)

Bk ← Bk ∪ {(skh, a

kh, f(s

kh, a

kh), r(s

kh, a

kh), δ

kh

)}

Sample a random mini-batch of N transitions from BUpdate policy with TD error policy gradient by Eq. (8)

end forBk+1 ← Bk

end forprocedure NNFUNCAPPROX(s, h, π(·|θπ), B, H)

if h == H thenreturn 0

V = ∅a← π(s|θπ)for i = 1, ...,M do

(si, ai)← i-th neighbor of (s, a) in B under metric dri ← r(si, ai), s′i ← f(si, ai) . stored in BV ′i ← NNFUNCAPPROX(s′i, h+ 1, π(·|θπ), B,H)Vi ← ri + γ · V ′

i + L · d[(s, a), (si, ai)]V ← V ∪ {Vi}

end forreturn minV

Nearest Neighbor Actor-CriticIn this section, we integrate the NN approximator into theactor-critic framework and evaluate its practicality. ThisNearest Neighbor Actor-Critic (NNAC) combines a policynetwork and an NN critic to solve RL problems. Upon re-ceiving a new observation, the NN critic finds a sequenceof past (s, a) pairs as a simulated trajectory and sums upthe upper-bounded rewards as the value estimate. The 1-stepTD error is obtained from consecutive state values. Then, thepolicy is updated based on the log action probability scaledby the TD error. The pseudocode is given in Algorithm 1.

Neural-Network-Based ActorUnlike the aforementioned tabular methods which employthe greedy policy π(s) = arg maxa Q(s, a), we use a sepa-rate network parameterized by θπ to guide the actor’s move-ment. Let the actor loss be J(θπ). The network is updatedthrough the standard TD error policy gradient (Sutton 1988)with mini-batch size N :

∇θπJ = N−1∑

δi∇θπ log π(ai|si, θπ). (8)

At step h, the action distribution at sh is pushed towards ahif the TD error δh > 0. We use temporal difference learninginstead of directly maximizing the values to reduce variance.

Nearest-Neighbor-Based CriticThe NN critic in Algorithm 1 maintains an experience bufferB to construct the nearest neighbor tree. As in (Schulman

et al. 2016), we use a parameter γ ∈ (0, 1] to down-weightfuture rewards due to delayed effects. This parameter corre-sponds to the discount factor in infinite-horizon discountedproblems, but we take it as a variance reduction technique.

Following previous notation, let xh denote a state-actionpair at step h. Let M be the number of neighbors consideredin value approximation. The exact estimation procedure ex-ploits Monte Carlo tree search. Given xh, find M samplesin history with the minimum distances dX (xh, xi), i ∈ [M ].Consider their next states and the actions that would be takenunder πh. Expand these new state-action pairs until the treedepth reaches H − h. The total reward is back-propagatedfrom the terminal states and the minimum value is the esti-mate for V π(xh). When M = 1, the recursive formula is:

V π(x) = r(x) + γ · V π(x′) + L · dX (x, x′). (9)

x′ = arg minx′∈B dX (x, x′) is the nearest neighbor andV π(x′) = Qπ

(x′, π(x′)

). In long-horizon problems, we can

replace the varying H−h with a fixed planning horizon H ′.Equation (9) prioritizes exploring (s, a)’s that are farther

away from the seen ones. By Assumption 3, the Lipschitzbonus L · dX (x, x′) ensures that the real value of V π(x)

is upper-bounded by V π(x). As new observations accumu-late, dX (x, x′) becomes smaller and the upper bound is im-proved. Since exploration is based on the value upper bound,NNAC encourages exploring new actions. In training, dX isproblem-specific and L can be tuned as a hyperparameter.

The distance-based bonus is less sensitive to the dimen-sion ofX . Deep RL has poor sample efficiency for pixel dataas it makes implicit use of feature encoding to find a low-dimensional embedding. In contrast, the NN approximatordoes not need any parametric model to capture the intrinsicstates. With a proper metric defined in the high-dimensionalspace, the neighbors preserve their similarities and the dis-tances can still be good indicators of their relative values.

Evaluation of NNACWe test NNAC on the cart-pole balancing problem. Due tothe discrete nature of the action space, we compare withdueling double deep Q-networks (DDDQN) (Wang et al.2016), trust region policy optimization (TRPO) (Schulmanet al. 2015), proximal policy optimization (PPO) (Schulmanet al. 2017), soft actor-critic (SAC) (Haarnoja et al. 2018),and neural episodic control (NEC) (Pritzel et al. 2017). Ourgoal is to show that the NN critic enables efficient learningwith high-dimensional data, which is generally not easy toachieve. Therefore, we select a task with relatively simpledynamics and do not compare with all state-of-the-art deepRL methods that typically use more samples.

Cart-pole environment: We use the OpenAI Gym imple-mentation (Brockman et al. 2016). The state of the cart isdescribed by a 4-tuple (θ, θ, x, v), where θ and θ are the an-gle and angular velocity of the pole, x is the horizontal po-sition of the cart, and v is the velocity. The horizon and themaximum achievable reward in one episode are both 500.

State space dimension: We prepare three types of inputs.

• dim(S) = 4. The 4-tuple descriptors are used directly.

0.0 0.1 0.2 0.3 0.4 0.5Time Steps (1e5)

0

100

200

300

400

500A

vera

ge R

etur

nNNACDDDQNPPOTRPOSACNEC

(a) NNAC vs baselines with internal state descriptors

0.0 0.1 0.2 0.3 0.4 0.5Time Steps (1e5)

0

100

200

300

400

500

Ave

rage

Ret

urn

Dim = 4Dim = 10Dim = 100IMG Learned MetricIMG L2 Distance

(b) NNAC with different input dimensions

Figure 1: Learning curves for CartPole-v1. The shaded areasdenote one standard deviation of evaluations over 5 trails.Curves are smoothed by taking a 500-step moving average.

• dim(S) = 10 and 100. We use random projection matri-ces to map the 4-tuples into high-dimensional spaces. Thematrix columns are orthonormal to preserve the distancesand the neighbor relations in the new metric spaces.

• dim(S) = 4 × 20 × 20. We crop the cart-pole from the160×240 gray scale images and down-sample it to 20×20pixels. Four consecutive frames are stacked together toderive the velocity and acceleration of the moving object.

In all cases, L2 distance is used for the nearest neighborsearch. As the L2 distance may not be a good measurementfor image similarity, we also learn a distance oracle with theSiamese network for comparison (details in Appendix C.2).

Network structure and hyperparameters: We use theStable Baselines implementation of the deep RL agents (Hillet al. 2018). For low-dimensional NNAC, we use a one-layerpolicy network with 32 hidden units and a ReLU activation.When learning from pixels, we add a convolutional layerwith 16 units before the policy network. The discount fac-tor γ is 0.99. The Lipschitz L is determined by a grid searchand set to 7. All agents are trained with 5 random seeds.Evaluation is done every 1000 steps without exploration.

Results and discussion: Figure 1a shows the learningcurves for all agents when the internal states are given.NNAC learns better policies with fewer samples than theother baselines. Also, compared with network critics thatneed extensive hyperparameter tuning, the NN approxima-tor only requires experimenting with L.

Figure 1b illustrates the performance of NNAC with dif-ferent input dimensions. It converges in similar steps for theintrinsic and projected states. This agrees with our proposi-tion since the distances are preserved by the matrix trans-formation. For the pixel data coupled with a learned metric,

Algorithm 2 Soft Nearest Neighbor Update// Assume a small constant ε > 0Initialize policy network θπ , value network θQ,B = ∅, α← α0

for each episode dofor each environment step do

Take action a according to π(s) and exploration strategyReceive reward r and next state s′

if α > ε thenV (s)← NNFUNCAPPROX(s, h, π,B,H)V (s′)← NNFUNCAPPROX(s′, h+ 1, π, B,H)δ = r + γ · V (s′)− V (s)

end ifB ← B ∪ {(s, a, s′, r, δ)}

end forfor each gradient step do

Update the actor by Equation (11)Update the critic by Equation (10)

end forα← (1− β) · α

end for

NNAC achieves the same performance with slightly moresamples, which might be related to learning the convolu-tional filters. When L2 distance is used to measure imagesimilarities, the algorithm is less stable and does not solvethe environment within limited time steps. However, the fi-nal average return is already comparable to the deep RLagents with internal state input. Indeed, deep RL typicallyuses a linear output layer after several nonlinear layers. Itcan be interpreted as a feature encoding process followed bylinear value approximation. Rather than learning the encod-ing, NNAC treats the distance metric as known information,thereby reducing the amount of unnecessary work.

Empirically, we find the L2 distance to be a good choicefor low-dimensional, physical state spaces, but other sophis-ticated metrics might be needed to capture the differencesbetween images. For generic high-dimensional tasks, our al-gorithm is efficient as long as a distance oracle is provided.

The major concern of NNAC is the computational costof finding the neighbors. We use Kd-tree (Friedman et al.1977) in our implementation. For more complicated MDPs,the training data size can be too large to build a Kd-tree effi-ciently. Thus, in the next section, we introduce a new methodwhich preserves the original neural networks of deep RLagents to reduce the computational burden.

Nearest Neighbor Update ModuleIn this section, we show that the NN approximator can boostthe efficiency of existing deep RL algorithms. As illustratedin previous sections, our method explores the action spaceefficiently both in theory and in practice. However, as sam-ples accumulate, a neural network with sufficient trainingdata can outperform other function approximators due to itsgeneralization ability. Hence, we propose a nearest neighborplug-and-play module (Algorithm 2) that acts as a “starter”to accelerate deep RL agents and can be removed later.

We focus on actor-critic methods. Without changing thealgorithm’s structure, a plug-in NN critic supplies value es-

0.0 2.5 5.0 7.5 10.0Time Steps (1e5)

0

2000

4000

Ave

rage

Ret

urn

Ant-v2NNTD3TD3NNDDPGDDPG

0.0 2.5 5.0 7.5 10.0Time Steps (1e5)

0

1000

2000

3000

Ave

rage

Ret

urn

Hopper-v2

0.0 2.5 5.0 7.5 10.0Time Steps (1e5)

0

2000

4000

Ave

rage

Ret

urn

Walker2d-v2

0.0 2.5 5.0 7.5 10.0Time Steps (1e5)

0

2500

5000

7500

Ave

rage

Ret

urn

HalfCheetah-v2

Figure 2: Learning curves for OpenAI Gym MuJoCo continuous control tasks. The shaded areas denote one standard deviationof evaluations over 5 trails. For visual clarity, the curves are smoothed by taking a moving average of 104 environment steps.

timates to the rest of the framework. While the actor benefitsfrom TD error policy gradient, the value network in the orig-inal algorithm is penalized for large deviations from the NNestimates. We adopt an adaptive weighting scheme and de-crease the weight for the NN module to avoid computationalbottleneck. Let α be the weight of the NN approximator.There are three major components in the update module.

Modification 1: NN Value EstimationAn NN critic is incorporated to guide the training of both theactor and critic networks. In addition to the value network,NNFUNCAPPROX in Algorithm 1 is used to estimate thecurrent state values. Assume that the original critic loss isL(θQ). We penalize large differences between the TD errorsestimated by the value network and the NN critic. Using themean squared error, the critic objective is reformulated as:

L′(θQ) = (1− α) · L(θQ) + α ·∥∥δθQ − δNN

∥∥2 , (10)

where δθQ = r+γV ′θQ−VθQ is obtained from the value net,and δNN is the TD error supplied by the NN approximator.

Modification 2: TD-Regularized Policy LearningSimilar to NNAC, we also include a TD error policy gradientterm in the actor loss. Let the original actor loss be J(θπ).The modified gradient∇θπJ ′(θπ) is:

(1− α) · ∇θπJ(θπ) + α · δNN∇θπ log π(a|s; θπ). (11)

The auxiliary TD term increases the weights for rewardingactions and decreases the weights for less preferable actions.

Modification 3: Continual TD SupervisionWhen the MDP has a large intrinsic dimension, it is im-practical to compute the nearest neighbors at each recur-sion step for every mini-batch samples. Therefore, we use anexponentially decaying weight parameter for the NN critic:α = α0 · (1 − β)k, where α0 ∈ (0, 1] is the initial weight,β ∈ (0, 1) is the decrease rate, k is the episode number. MCsimulation terminates when α is close to 0. However, pastTD estimates can still supervise the learning and reduce thechance of network forgetting. This is achieved by storingδNN along with the stepwise observation in the experiencebuffer. For mini-batches sampled at each gradient step, (10)and (11) are used to update the networks if δNN is available.Otherwise, the original gradients are used.

Environment NNTD3 TD3 NNDDPG DDPG

Ant 4727.44 3425.07 1115.00 1025.82Hopper 3704.40 2708.25 3202.64 2029.65

Walker2d 5170.20 4260.23 2917.25 2633.86HalfCheetah 8341.74 7045.53 7001.52 6385.43

Table 1: Average return over 5 trials. Top values are bolded.

Evaluation of Soft Nearest Neighbor UpdateSetup: We implement the NN critic on DDPG (Lillicrapet al. 2016) and TD3 (Fujimoto et al. 2018) and test with fourMuJoCo locomotion tasks (Todorov et al. 2012). For fair-ness, we use the same hyperparameters for each method be-fore and after adding the NN module. The network structureis selected from the benchmark work (Duan et al. 2016) andidentical for all agents. L2 distance is used to find the neigh-bors. More experiment details are given in Appendix E.

Results and discussion: Figure 2 shows the experimentresults. The auxiliary NN-critic improves the sample effi-ciency of DDPG and TD3 in most settings, though we donot see a major performance gain for DDPG in Ant-v2. IfDDPG cannot solve an environment, the soft update mod-ule is of less help given that the original value networks stillplay a crucial role in learning. We summarize two principalbenefits of the NN update framework.

1. NN-guided training encourages exploration and helps theagents to overcome local optima. Table 1 shows that NNalgorithms obtain larger maximum returns. The Lipschitzbonus highlights exploring unvisited states. This direc-tional exploration is more efficient than random noise.

2. The upper-bounded value estimation stabilizes training bypreventing overestimation of Q(s, a). This technique hasa similar effect to the twin Q-networks in TD3.

ConclusionIn this paper, we provide a nearest neighbor function ap-proximator for efficient value learning and justify its samplecomplexity for high-dimensional input in deterministic sys-tems. The NN value estimator can be incorporated in model-free deep RL to encourage exploration and stabilize train-ing. Our work suggests that there is great potential to im-prove deep RL with non-parametric methods. Future workscan explore the benefits of nearest neighbor search in activelearning or extend the theories to stochastic environments.

AcknowledgementsWe thank all anonymous reviewers for their insightful com-ments. LY acknowledges the support from the Simons Insti-tute at Berkeley (Theory of Reinforcement Learning).

ReferencesArulkumaran, K.; Deisenroth, M.; Brundage, M.; andBharath, A. 2017. A Brief Survey of Deep ReinforcementLearning. ArXiv abs/1708.05866.

Azar, M. G.; Osband, I.; and Munos, R. 2017. MinimaxRegret Bounds for Reinforcement Learning. In ICML.

Blundell, C.; Uria, B.; Pritzel, A.; Li, Y.; Ruderman, A.;Leibo, J. Z.; Rae, J. W.; Wierstra, D.; and Hassabis, D. 2016.Model-Free Episodic Control. ArXiv abs/1606.04460.

Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.;Schulman, J.; Tang, J.; and Zaremba, W. 2016. OpenAIGym. ArXiv abs/1606.01540.

Buckman, J.; Hafner, D.; Tucker, G.; Brevdo, E.; andLee, H. 2018. Sample-Efficient Reinforcement Learn-ing with Stochastic Ensemble Value Expansion. ArXivabs/1807.01675.

Busemeyer, J. R.; Byun, E.; DeLosh, E. L.; and McDaniel,M. A. 1997. Learning functional relations based on experi-ence with input-output pairs by humans and artificial neuralnetworks. In Lamberts, K.; and Shanks, D., eds., Conceptsand Categories, 405–437. Cambridge: MIT Press.

Carroll, J. D. 1963. Functional Learning: The Learning ofContinuous Functional Mappings Relating Stimulus and Re-sponse Continua. ETS Research Bulletin Series 1963(2): i–144. doi:10.1002/j.2333-8504.1963.tb00958.x.

Deisenroth, M. P.; and Rasmussen, C. E. 2011. PILCO: AModel-Based and Data-Efficient Approach to Policy Search.In ICML.

Duan, Y.; Chen, X.; Houthooft, R.; Schulman, J.; andAbbeel, P. 2016. Benchmarking Deep Reinforcement Learn-ing for Continuous Control. In ICML.

Emigh, M.; Kriminger, E.; Brockmeier, A.; Prıncipe, J.; andPardalos, P. 2016. Reinforcement Learning in Video GamesUsing Nearest Neighbor Interpolation and Metric Learning.IEEE Transactions on Computational Intelligence and AI inGames 8: 56–66.

Friedman et al., J. 1977. An Algorithm for Finding BestMatches in Logarithmic Expected Time. ACM Trans. Math.Softw. 3: 209–226.

Fujimoto et al., S. 2018. Addressing Function Approxima-tion Error in Actor-Critic Methods. ArXiv abs/1802.09477.

Gu, S.; Holly, E.; Lillicrap, T. P.; and Levine, S. 2017. Deepreinforcement learning for robotic manipulation with asyn-chronous off-policy updates. 2017 IEEE International Con-ference on Robotics and Automation (ICRA) 3389–3396.

Haarnoja, T.; Zhou, A.; Abbeel, P.; and Levine, S. 2018.Soft Actor-Critic: Off-Policy Maximum Entropy Deep Re-inforcement Learning with a Stochastic Actor. In ICML.

Hill, A.; Raffin, A.; Ernestus, M.; Gleave, A.; Kanervisto,A.; Traore, R.; Dhariwal, P.; Hesse, C.; Klimov, O.; Nichol,A.; Plappert, M.; Radford, A.; Schulman, J.; Sidor, S.; andWu, Y. 2018. Stable Baselines 2.10.1 (2020-08-05). https://github.com/hill-a/stable-baselines.

Jaksch, T.; Ortner, R.; and Auer, P. 2008. Near-optimal Re-gret Bounds for Reinforcement Learning. In J. Mach. Learn.Res.

Jiang, N.; Krishnamurthy, A.; Agarwal, A.; Langford, J.; andSchapire, R. 2017. Contextual Decision Processes with lowBellman rank are PAC-Learnable. ArXiv abs/1610.09512.

Jin, C.; Allen-Zhu, Z.; Bubeck, S.; and Jordan, M. I. 2018.Is Q-learning Provably Efficient? In NeurIPS.

Jin, C.; Yang, Z.; Wang, Z.; and Jordan, M. I. 2020. Prov-ably Efficient Reinforcement Learning with Linear FunctionApproximation. ArXiv abs/1907.05388.

Kansky, K.; Silver, T.; Mely, D. A.; Eldawy, M.; Lazaro-Gredilla, M.; Lou, X.; Dorfman, N.; Sidor, S.; Phoenix, D.;and George, D. 2017. Schema Networks: Zero-shot Transferwith a Generative Causal Model of Intuitive Physics. ArXivabs/1706.04317.

Kegl, B. 2002. Intrinsic Dimension Estimation Using Pack-ing Numbers. In NIPS.

Kober, J.; Bagnell, J.; and Peters, J. 2013. Reinforcementlearning in robotics: A survey. The International Journal ofRobotics Research 32: 1238 – 1274.

Lake, B. M.; Ullman, T. D.; Tenenbaum, J.; and Gershman,S. 2018. Building Machines That Learn and Think Like Peo-ple. The Behavioral and brain sciences 40: e253.

Levina, E.; and Bickel, P. 2004. Maximum Likelihood Esti-mation of Intrinsic Dimension. In NIPS.

Levine, S.; Finn, C.; Darrell, T.; and Abbeel, P. 2016. End-to-End Training of Deep Visuomotor Policies. Journal ofMachine Learning Research 17(39): 1–40.

Lillicrap, T. P.; Hunt, J. J.; Pritzel, A.; Heess, N. M. O.;Erez, T.; Tassa, Y.; Silver, D.; and Wierstra, D. 2016. Con-tinuous control with deep reinforcement learning. CoRRabs/1509.02971.

Lin, Z.; Zhao, T.; Yang, G.; and Zhang, L. 2018. EpisodicMemory Deep Q-Networks. ArXiv abs/1805.07603.

Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.;Antonoglou, I.; Wierstra, D.; and Riedmiller, M. A. 2013.Playing Atari with Deep Reinforcement Learning. ArXivabs/1312.5602.

Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Ve-ness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M. A.;Fidjeland, A. K.; Ostrovski, G.; Petersen, S.; Beattie, C.;Sadik, A.; Antonoglou, I.; King, H.; Kumaran, D.; Wierstra,D.; Legg, S.; and Hassabis, D. 2015. Human-level controlthrough deep reinforcement learning. Nature 518: 529–533.

Osband, I.; and Roy, B. V. 2014. Model-based Reinforce-ment Learning and the Eluder Dimension. In NIPS.

Plappert, M.; Houthooft, R.; Dhariwal, P.; Sidor, S.; Chen,R. Y.; Chen, X.; Asfour, T.; Abbeel, P.; and Andrychowicz,M. 2018. Parameter Space Noise for Exploration. ArXivabs/1706.01905.Pritzel, A.; Uria, B.; Srinivasan, S.; Badia, A. P.; Vinyals, O.;Hassabis, D.; Wierstra, D.; and Blundell, C. 2017. NeuralEpisodic Control. In ICML.Schulman, J.; Levine, S.; Moritz, P.; Jordan, M.; and Abbeel,P. 2015. Trust Region Policy Optimization. In Proceedingsof the 32nd International Conference on International Con-ference on Machine Learning (ICML), 1889–1897.Schulman, J.; Moritz, P.; Levine, S.; Jordan, M. I.; andAbbeel, P. 2016. High-Dimensional Continuous Con-trol Using Generalized Advantage Estimation. CoRRabs/1506.02438.Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; andKlimov, O. 2017. Proximal Policy Optimization Algorithms.ArXiv abs/1707.06347.Shah, D.; and Xie, Q. 2018. Q-learning with Nearest Neigh-bors. ArXiv abs/1802.03900.Silver, D.; Huang, A.; Maddison, C. J.; Guez, A.; Sifre,L.; van den Driessche, G.; Schrittwieser, J.; Antonoglou, I.;Panneershelvam, V.; Lanctot, M.; Dieleman, S.; Grewe, D.;Nham, J.; Kalchbrenner, N.; Sutskever, I.; Lillicrap, T. P.;Leach, M.; Kavukcuoglu, K.; Graepel, T.; and Hassabis, D.2016. Mastering the game of Go with deep neural networksand tree search. Nature 529: 484–489.Sun, W.; Jiang, N.; Krishnamurthy, A.; Agarwal, A.; andLangford, J. 2019. Model-based RL in Contextual DecisionProcesses: PAC bounds and Exponential Improvements overModel-free Approaches. In COLT.Sutton, R. S. 1988. Learning to predict by the methods oftemporal differences. Machine Learning 3: 9–44.Todorov et al. 2012. MuJoCo: A physics engine for model-based control. 2012 IEEE/RSJ International Conference onIntelligent Robots and Systems 5026–5033.van Hasselt, H.; Guez, A.; and Silver, D. 2016. Deep Rein-forcement Learning with Double Q-Learning. In AAAI.Wang, R.; Du, S.; Yang, L. F.; and Salakhutdinov, R. 2020.On Reward-Free Reinforcement Learning with Linear Func-tion Approximation. ArXiv abs/2006.11274.Wang, Z.; Schaul, T.; Hessel, M.; Hasselt, H. V.; Lanctot,M.; and Freitas, N. D. 2016. Dueling Network Architecturesfor Deep Reinforcement Learning. In ICML.Yang, L. F.; Ni, C.; and Wang, M. 2019. Learning toControl in Metric Space with Optimal Regret. ArXivabs/1905.01576.Yang, L. F.; and Wang, M. 2019. Sample-Optimal Paramet-ric Q-Learning Using Linearly Additive Features. In ICML.Zanette, A.; and Brunskill, E. 2019. Tighter Problem-Dependent Regret Bounds in Reinforcement Learning with-out Domain Knowledge using Value Function Bounds. InICML.

Theoretically Principled Deep RL Acceleration via Nearest Neighbor FunctionApproximation (Appendix)

Junhong Shen, Lin F. YangUniversity of California, Los Angeles

[email protected], [email protected]

A UCRL-FA

Algorithm 3 UCRL-FA (?)

1: Input: A deterministic metric MDP2: Initialize: B(0) ← ∅, Q(0)

h (s, a)← H , r(0)(s, a)← 1 for all3: (s, a) ∈ S ×A, h ∈ [H]4: for episode k = 1, 2, ...,K do5: for step h = 1, 2, ..., H do6: Current state s(k)h

7: Play action a(k)h = argmaxa∈AQ(k)h (s

(k)h , a)

8: Record s(k)h+1 ← f(s(k)h , a

(k)h ), r(k)h ← r(s

(k)h , a

(k)h )

9: end for10: B(k+1) ← B(k)∪11:

{(s(k)i , a

(k)i , f(s

(k)i , a

(k)i ), r(s

(k)i , a

(k)i )), i ∈ [H]

}

12: r(k+1) ← FUNCAPPROX({(s, a), r(s, a)}(s,a)∈B(k+1)

)

13: Q(k+1)H ← r(k+1)

14: Update Q(k+1)h recursively:

15: Q(k+1)h ← FUNCAPPROX

({(s, a), r(s, a)+

16: supa′∈AQ(k+1)h+1

(f(s, a), a′

)})17: end for

We present the pseudocode for upper-confidence re-inforcement learning with general function approximator(UCRL-FA) proposed by ?. In their problem setting, theMDPs have discrete state and action spaces. The rewardsare assumed to be either 0 or 1. When the function approxi-mator in Algorithm 3 takes the form of the nearest neighborconstruction, Line 12 to Line 16 become:

r(k+1)(s, a) = min(s′,a′)∈B(k+1)

(r(s′, a′) + L1 · dist[(s, a), (s′, a′)]

),

Q(k+1)H ← min

[r(k+1)(s, a), 1

],

Q(k+1)h ← min

(s′,a′)∈B(k+1)

[r(s′, a′) + sup

a′′∈AQ

(k+1)h+1

(f(s′, a′), a′′

)

+ L1 · dist[(s, a), (s′, a′)]].

B Proof of Lemma 6Proof. Though the state representation inY is different fromX , the MDPs have the same transition model, reward func-tion, and thus the Q-function. Let s = g(s) ∈ Y . Plug the

Copyright © 2021, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.

Bi-Lipschitz condition (5) in Equations (3) and (4), we have∀s ∈ S, a ∈ A:

|Q∗h(s, a)−Q∗h(s′, a′)| = |Q∗h(s, a)−Q∗h(s′, a′)|≤ L1dX [(s, a), (s

′, a′)] by Eq. (3)

≤ L1C · dY [(s, a), (s′, a′)] by Eq. (5)

and

max dY [(f(s, a), a′′), (f(s′, a′), a′′)]

≤ C ·max dX [(f(s, a), a′′), (f(s′, a′), a′′)] by Eq. (5)

≤ L2C · dX ((s, a), (s′, a′)) by Eq. (4)

≤ L2C2 · dY [(s, a), (s′, a′)] by Eq. (5)

Lemma 6 follows.

C Details of the Cart-Pole ExperimentC.1 Environment SpecificationAn example image of the OpenAI Gym CartPole-v1 envi-ronment (?) is presented in Figure 3. The state of the cart-pole is a 4-tuple (θ, θ, x, v), where θ is the vertical angle ofthe pole, θ is the angular velocity, x is the horizontal positionof the cart, and v is its velocity. The transition model can bedescribed by the system of equations (?):

θ =g sin θ−cos θ F+mplθ

2 sin θ

mp+ml

l( 43−

mp cos θ2

mp+ml)

x =F+mpl(θ

2 sin θ−θ cos θ)mp+ml

x = x+ t · xx = x+ t · xθ = θ + t · θθ = θ + t · θ

An episode ends when either the cart hits the track bound-aries (x /∈ [−4.8, 4.8]) or the pole has fallen over (θ /∈[−24◦, 24◦]). The agent receives a reward 0 upon termina-tion and +1 otherwise.

C.2 Distance Metric LearningWe use the Siamese network to learn the internal distancebetween two input image stacks. Let the images be x and x′

arX

iv:2

110.

0442

2v1

[cs

.LG

] 9

Oct

202

1

Figure 3: Image of CartPole-v1.

with internal states s and s′, respectively. To obtain accuratepredictions of the distances, we use larger pixel-size imageswith dimension 4×160×240. Therefore in the experiments,we down-sample the rendered images twice with differentscales: 160 × 240 for distance calculation and 20 × 20 forpolicy learning.

Both x and x′ are first passed to a convolutional neuralnetwork individually for feature encoding. The network out-puts are concatenated together and fed into a fully connectednetwork for distance prediction. Denote the metric networkas θd. The objective is L(θd) = ‖θd(x, x′) − ‖s − s′‖2‖2.We randomly sample 100000 (x, s) pairs from the environ-ment and train the network for 5 epochs with learning rate1× 10−3 and batch size 16. The network structure is:

Conv 1 128 [3×3×1] filters, leaky ReLUPool 2×2 Max with stride 2

Conv 2 64 [3×3×1] filters, leaky ReLUPool 2×2 Max with stride 2

Conv 3 16 [3×3×4] filters, leaky ReLUFlatten

FC 64, leaky ReLUConcatenate

FC 8, leaky ReLU

C.3 Hyperparameters for NNACNN-related parameters

• Number of nearest neighbor M : 1

• Planning horizon H ′: 12

• Lipschitz parameter L: 7, determined by a grid searchover {0.1, 0.5, 1, 2, ..., 10}

• Action space dimension: 1

• Weight w for (s, a) pairs when calculating distance, i.e.,d(x, x′) =

√∑wi(xi − x′i)2:

– Dimension 4: (0.25, 0.25, 0.25, 0.25, 1)– Dimension 10: (0.1, ..., 0.1, 1)– Dimension 100: (0.01, ..., 0.01, 1)– Image: (1, ..., 1, 1)

The last dimension is the action weight.

Environment and network-related parameters• γ: 0.99

• Policy network structure:

– Dimension 4, 10, 100: (32, ReLU, tanh, softmax)– Image: (conv 16, 32, ReLU, tanh, softmax)

• Network weight initialization: [−3× 10−3, 3× 10−3]

• Optimizer: Adam

• Mini-batch size: 32

• Learning rate: 5× 10−4

C.4 Evaluation with Pixel Data

0.0 0.1 0.2 0.3 0.4 0.5Time Steps (1e5)

0

100

200

300

400

500

Ave

rage

Ret

urn

NNAC Learned MetricNNAC L2 DistanceDDDQNPPOTRPONEC

Figure 4: Learning curves for image CartPole-v1. The shaded ar-eas denote one standard deviation of evaluations over 5 trails. Forvisual clarity, the curves are smoothed by taking a moving averageof 500 time steps.

We also evaluate the baseline algorithms (DDDQN, PPO,TRPO, NEC) with pixel data of the CartPole-v1 environ-ment. The image inputs are obtained in the same way as theNNAC experiment. In particular, dim(S) = 4×20×20. Forthe deep RL agents, we use the Stable Baselines (?) imple-mentation with the CNN policy. For NEC, we use the author-provided implementation. The policy networks have identi-cal structures to the NNAC actor. Figure 4 shows that com-pared with learning from internal state descriptors, learningfrom pixels requires more samples in general. NNAC with alearned metric solves the environment with the highest sam-ple efficiency. Regardless of the distance metric used, NNACobtains better average rewards at the end of training com-pared with the other algorithms.

D Example Usage of Soft NN Update ModuleIn deterministic control systems, unlike neural networkvalue function approximators which may change drasticallyover iterations and do not have a monotonic improvementguarantee, the nearest neighbor function approximator en-sures that the value upper bound becomes smaller as thedata pool size becomes larger. Approximation accuracy isimproved in the sense that once a regret is paid, the algo-rithm can gain some new information such that the sameregret will not be paid again.

To better illustrate how the NN approximator can be com-bined with existing algorithms, we provide the pseudocodefor two widely-used deep RL agents, DDPG (?) and TD3(?), when equipped with the NN critic (Algorithm 4 and 5,respectively).

E Hyperparameters of MuJoCo ExperimentsE.1 Soft Nearest Neighbor DDPGNN-related parameters

Environment Hopper Walker2d HalfCheetah Ant

α0 0.9 0.5 0.9 0.9β 0 if k < 20 0 if k < 20 0 if k < 20 0.995

1 otherwise 1 otherwise 1 otherwiseε 10−3 10−3 10−3 10−3

M 1 1 1 1L 7 7 5 7H ′ 12 12 12 12τNN 0.2 0.2 0.2 0.2

Neg δ scale 0.3 0.3 0.3 0.3Grad clip 10 10 10 10

• α0: initial NN weight.

• β: NN weight decreasing rate.

• ε: NN termination threshold, i.e., when NN weight is smallerthan ε, we no longer use MC rollouts to estimate TD errors.

• M : number of nearest neighbor.

• L: Lipschitz parameter.

• H ′: planning horizon.

• τNN : target network update rate when using the NN critic.

• Negative TD error scale: scale the negative TD error to improvenetwork convergence.

• Gradient clip threshold: clip the TD error policy gradient to im-prove network stability.

Environment and network-related parameters


γ 0.99 0.99 0.99 0.99Net. struct. (400, ReLU, 300, ReLU)Weight init. [−3× 10−3, 3× 10−3]Optimizer Adamθπ lr 10−3 10−3 10−3 10−3

θQ lr 10−3 10−3 10−3 10−3

σN 0.3 0.2 0.2 0.1τ 0.005 0.005 0.005 0.005

Batch size 256 256 256 256r scale 0.1 0.1 1 1

• γ: variance reduction parameter.

• Network structure: used in policy and value networks.

• Weight initialization: range of the uniform distribution for net-work weight initialization.

• Optimizer: used for both policy and value learning.

• θπ lr: initial policy network learning rate.

• θQ lr: initial value network learning rate.

• σN : standard deviation of the normal action noise.

• τ : target networks update parameter.

• Batch size: size of mini-batches at each gradient update step.

• r scale: scale the reward to improve network stability.

E.2 Soft Nearest Neighbor TD3NN-related parameters


α0 0.9 0.9 0.9 0.9β 0 if k < 20 0 if k < 20 0 if k < 20 0 if k < 20

1 otherwise 1 otherwise 1 otherwise 1 otherwiseε 10−3 10−3 10−3 10−3

M 1 1 1 1L 4 4 5 4H ′ 12 12 12 12τNN 0.2 0.2 0.2 0.2

Neg δ scale 0.3 0.3 0.3 0.3Grad clip 10 10 10 10

Environment and network-related parameters


γ 0.99 0.99 0.99 0.99Net. struct. (400, ReLU, 300, ReLU)Weight init. [−3× 10−3, 3× 10−3]Optimizer Adamθπ lr 10−3 10−3 10−3 10−3

θQ lr 10−3 10−3 10−3 10−3

σN 0.3 0.2 0.2 0.2Noise clip 0.5 0.5 0.5 0.5

τ 0.005 0.005 0.005 0.005Batch size 256 256 256 256r scale 0.1 0.1 1 0.1

Policy freq. 2 2 2 2

• Noise clip: action noise clip threshold.

• Policy frequency: policy update frequency w.r.t. gradient updatesteps.

Algorithm 4 Soft Nearest Neighbor DDPGRandomly initialize critic network θπ and value network θQInitialize target networks θ′π ← θπ , θ′Q ← θQ, replay buffer B = ∅, NN critic weight α← α0

for episode = 1, ...,K doReceive initial random observation s1for h = 1, ..., H do

Take action ah = π(sh|θπ) +N(0, σ), receive reward rh and next state sh+1

B ← B ∪ {(sh, ah, sh+1, rh, h,NA)}Sample a mini-batch of N transitions (si, ai, si+1, ri, hi, δi)yi = ri + γ ·Q′(si+1, π

′(si+1|θπ′)|θQ′)

δQi = yi −Q(si, ai|θQ)if α > ε then

V (si)← NNFUNCAPPROX(si, hi, π′, B,H)V (si+1)← NNFUNCAPPROX(si+1, hi + 1, π′, B,H)δNNi = ri + γ · V (si+1)− V (si) . NN TD error estimates

Update the critic: L′(θQ) = N−1∑α∥∥∥δQi − δNNi

∥∥∥2

+ (1− α)(yi −Q′(si, ai|θQ′))2

Update the actor:∇θπJ ′(θπ) = N−1 ·∑α · δNNi ∇θπ log π(ai|si; θπ) + (1− α)∇aQ(si, π(si|θπ)|θQ)∇θππ(si|θπ)δi ← δNNi

elseUpdate the critic: L′(θQ) = N−1∑(yi −Q′(si, ai|θQ′))2 + ε

∥∥∥δQi − δi∥∥∥2

. Continual TD supervision

Update the actor:∇θπJ ′(θπ) = N−1 ·∑∇aQ(si, π(si|θπ)|θQ)∇θππ(si|θπ)end ifθ′π ← τθπ + (1− τ)θ′π , θ′Q ← τθQ + (1− τ)θ′Q


end for

Algorithm 5 Soft Nearest Neighbor TD3Randomly initialize critic network θπ and value networks θQ1 ,θQ2

Initialize target networks θ′π ← θπ , θ′Q ← θQ, replay buffer B = ∅, NN critic weight α← α0

for episode = 1, ...,K doReceive initial random observation s1for h = 1, ..., H do

Take action ah = π(sh|θπ) +N(0, σ), receive reward rh and next state sh+1

B ← B ∪ {(sh, ah, sh+1, rh, h,NA)}Sample a mini-batch of N transitions (si, ai, si+1, ri, hi, δi)yi = ri + γ ·minn=1,2Q

′(si+1, a′i+1|θQ′

n), a′i+1 ←, π′(si+1|θπ′) + εn, εn ← clip

(N(0, σ),−c, c

)

δQi = yi −minn=1,2Q′(si, ai|θQ′

n)

if α > ε thenV (si)← NNFUNCAPPROX(si, hi, π′, B,H)V (si+1)← NNFUNCAPPROX(si+1, hi + 1, π′, B,H)δNNi = ri + γ · V (si+1)− V (si)

Update the critic: L′(θQn) = argminθQnN−1∑α

∥∥∥δQi − δNNi∥∥∥2

+ (1− α)(yi −Qn(si, ai|θQn))2if h mod policyfreq then

Update the actor:∇θπJ ′(θπ) = N−1 ·∑α · δNNi ∇θπ log π(ai|si; θπ) + (1− α)∇aQ1(si, π(si|θπ)|θQ1)∇θππ(si|θπ)end ifδi ← δNNi

elseUpdate the critic: L′(θQ) = argminθQi

N−1∑(yi −Qi(si, ai|θQi))2 + ε∥∥∥δQi − δi

∥∥∥2

if h mod policyfreq thenUpdate the actor:∇θπJ ′(θπ) = N−1 ·∑∇aQ1(si, ai|θQ1)∇θππ(si|θπ)

end ifend ifθ′π ← τθπ + (1− τ)θ′π , θ′Qn ← τθQn + (1− τ)θ′Qn , n = 1, 2


end for

theoretically principled deep rl acceleration via nearest

Documents