safe and efficient off policy reinforcement learning

SafeandEfficientOff-PolicyReinforcementLearning

紹介者：岩城諒

2016/11/12@関西NIPS読み会

2

• 岩城諒

– 大阪大学浅田研究室 D1– [email protected]– 強化学習 ::方策探索 ::自然方策勾配法

自己紹介

SafeandEfficientOff-PolicyReinforcementLearning

Safe and efficient off-policy reinforcement learning

Remi [email protected] DeepMind

Tom [email protected]

Google DeepMind

Anna [email protected]

Vrije Universiteit Brussel

Marc G. [email protected]

Google DeepMind

Abstract

In this work, we take a fresh look at some old and new algorithms for off-policy,return-based reinforcement learning. Expressing these in a common form, we de-rive a novel algorithm, Retrace(�), with three desired properties: (1) low variance;(2) safety, as it safely uses samples collected from any behaviour policy, whateverits degree of “off-policyness”; and (3) efficiency, as it makes the best use of sam-ples collected from near on-policy behaviour policies. We analyse the contractivenature of the related operator under both off-policy policy evaluation and controlsettings and derive online sample-based algorithms. To our knowledge, this is thefirst return-based off-policy control algorithm converging a.s. to Q⇤ without theGLIE assumption (Greedy in the Limit with Infinite Exploration). As a corollary,we prove the convergence of Watkins’ Q(�), which was still an open problem. Weillustrate the benefits of Retrace(�) on a standard suite of Atari 2600 games.

One fundamental trade-off in reinforcement learning lies in the definition of the update target: shouldone estimate Monte Carlo returns or bootstrap from an existing Q-function? Return-based meth-ods (where return refers to the sum of discounted rewards

Pt

�trt

) offer some advantages overvalue bootstrap methods: they are better behaved when combined with function approximation, andquickly propagate the fruits of exploration (Sutton, 1996). On the other hand, value bootstrap meth-ods are more readily applied to off-policy data, a common use case. In this paper we show thatlearning from returns need not be at cross-purposes with off-policy learning.

We start from the recent work of Harutyunyan et al. (2016), who show that naive off-policy policyevaluation, without correcting for the “off-policyness” of a trajectory, still converges to the desiredQ⇡ value function provided the behavior µ and target ⇡ policies are not too far apart (the maxi-mum allowed distance depends on the � parameter). Their Q⇡

(�) algorithm learns from trajectoriesgenerated by µ simply by summing discounted off-policy corrected rewards at each time step. Un-fortunately, the assumption that µ and ⇡ are close is restrictive, as well as difficult to uphold in thecontrol case, where the target policy is always greedy with respect to the current Q-function. In thatsense this algorithm is not safe: it does not handle the case of arbitrary “off-policyness”.

Alternatively, the Tree-backup (TB) (�) algorithm (Precup et al., 2000) tolerates arbitrary tar-get/behavior discrepancies by scaling information (here called traces) from future temporal dif-ferences by the product of target policy probabilities. TB(�) is not efficient in the “near on-policy”case (similar µ and ⇡), though, as traces may be cut prematurely, blocking learning from full returns.

In this work, we express several off-policy, return-based algorithms in a common form. From thiswe derive an improved algorithm, Retrace(�), which is both safe and efficient, enjoying convergenceguarantees for off-policy policy evaluation and – more importantly – for the control setting.

Retrace(�) can learn from full returns retrieved from past policy data, as in the context of experiencereplay (Lin, 1993), which has returned to favour with advances in deep reinforcement learning (Mnih

arX

iv:1

606.

0264

7v1

[cs.L

G]

8 Ju

n 20

16

• Off-Policy型強化学習• 学習データを生成する方策が学習したい方策と異なる

• 任意の方策に対してサンプル効率良く収益を近似して価値関数を学習するアルゴリズムの提案

4

• 状態• 行動• 状態遷移確率• 報酬関数• 方策• Q関数

• 遷移演算子

• 行動価値関数

• 学習目的：価値を最大化する方策の獲得

強化学習

エージェント

環境

状態行動報酬

(P⇡

Q)(x, a) ,X

x

02X

X

a

02AP (x0|x, a)⇡(a0|x0)Q(x0

, a

0)

Q⇡ ,X

t�0

�t(P⇡)tr, � 2 [0, 1)

x 2 Xa 2 AP : X ⇥A ! Pr(X )

r : X ⇥A ! [�RMAX, RMAX]

⇡ : X ! Pr(A)

Q : X ⇥A ! R

5

• ベルマン演算子

– 行動価値が不動点

• ベルマン最適演算子

– 最適価値関数が不動点

– に対する greedyな方策 → 最適方策

ベルマン方程式たち

T ⇡Q , r + �P⇡Q

T Q , r + �max

⇡P⇡Q

Q⇤

Q⇡

Q⇤ ⇡⇤

6

• 学習したい推定方策とデータを生成した挙動方策が異なる

Off-policy

difficult and engaging for human players. We used the same networkarchitecture, hyperparameter values (see Extended Data Table 1) andlearning procedure throughout—taking high-dimensional data (210|160colour video at 60 Hz) as input—to demonstrate that our approachrobustly learns successful policies over a variety of games based solelyon sensory inputs with only very minimal prior knowledge (that is, merelythe input data were visual images, and the number of actions availablein each game, but not their correspondences; see Methods). Notably,our method was able to train large neural networks using a reinforce-ment learning signal and stochastic gradient descent in a stable manner—illustrated by the temporal evolution of two indices of learning (theagent’s average score-per-episode and average predicted Q-values; seeFig. 2 and Supplementary Discussion for details).

We compared DQN with the best performing methods from thereinforcement learning literature on the 49 games where results wereavailable12,15. In addition to the learned agents, we also report scores fora professional human games tester playing under controlled conditionsand a policy that selects actions uniformly at random (Extended DataTable 2 and Fig. 3, denoted by 100% (human) and 0% (random) on yaxis; see Methods). Our DQN method outperforms the best existingreinforcement learning methods on 43 of the games without incorpo-rating any of the additional prior knowledge about Atari 2600 gamesused by other approaches (for example, refs 12, 15). Furthermore, ourDQN agent performed at a level that was comparable to that of a pro-fessional human games tester across the set of 49 games, achieving morethan 75% of the human score on more than half of the games (29 games;

Convolution Convolution Fully connected Fully connected

No input

Figure 1 | Schematic illustration of the convolutional neural network. Thedetails of the architecture are explained in the Methods. The input to the neuralnetwork consists of an 84 3 84 3 4 image produced by the preprocessingmap w, followed by three convolutional layers (note: snaking blue line

symbolizes sliding of each filter across input image) and two fully connectedlayers with a single output for each valid action. Each hidden layer is followedby a rectifier nonlinearity (that is, max 0,xð Þ).

a b

c d

0 200 400 600 800

1,000 1,200 1,400 1,600 1,800 2,000 2,200

0 20 40 60 80 100 120 140 160 180 200

Ave

rage

sco

re p

er e

piso

de

Training epochs

0 1 2 3 4 5 6 7 8 9

10 11

0 20 40 60 80 100 120 140 160 180 200

Ave

rage

act

ion

valu

e (Q

)

Training epochs

0

1,000

2,000

3,000

4,000

5,000

6,000

0 20 40 60 80 100 120 140 160 180 200

Ave

rage

sco

re p

er e

piso

de

Training epochs

0 1 2 3 4 5 6 7 8 9

10

0 20 40 60 80 100 120 140 160 180 200

Ave

rage

act

ion

valu

e (Q

)

Training epochs

Figure 2 | Training curves tracking the agent’s average score and averagepredicted action-value. a, Each point is the average score achieved per episodeafter the agent is run with e-greedy policy (e 5 0.05) for 520 k frames on SpaceInvaders. b, Average score achieved per episode for Seaquest. c, Averagepredicted action-value on a held-out set of states on Space Invaders. Each point

on the curve is the average of the action-value Q computed over the held-outset of states. Note that Q-values are scaled due to clipping of rewards (seeMethods). d, Average predicted action-value on Seaquest. See SupplementaryDiscussion for details.

RESEARCH LETTER

5 3 0 | N A T U R E | V O L 5 1 8 | 2 6 F E B R U A R Y 2 0 1 5

Macmillan Publishers Limited. All rights reserved©2015

[Mnih+15] IEEE ROBOTICS & AUTOMATION MAGAZINE MARCH 2016104

regression [2]. With a function approximator, the sampled data from the approximated model can be generated by inap-propriate interpolation or extrapolation that improperly up-dates the policy parameters. In addition, if we aggressively

derive the analytical gradi-ent of the approximated model to update the poli-cy, the approximated gra-dient might be far from the true gradient of the objective function due to the model approximation error. If we consider using these function approxima-tion methods for high-di-mensional systems like humanoid robots, this problem becomes more serious due to the difficul-ty of approximating high-

dimensional dynamics models with a limited amount of data sampled from real systems. On the other hand, if the environ-ment is extremely stochastic, a limited amount of previously acquired data might not be able to capture the real environ-ment’s property and could lead to inappropriate policy up-dates. However, rigid dynamics models, such as a humanoid robot model, do not usually include large stochasticity. There-fore, our approach is suitable for a real robot learning for high-dimensional systems like humanoid robots.

Moreover, applying RL to actual robot control is difficult, since it usually requires many learning trials that cannot be exe-cuted in real environments, and the real system’s durability is limited. Previous studies used prior knowledge or properly de-signed initial trajectories to apply RL to a real robot and im-proved the robot controller’s parameters [1], [4], [10], [19], [32].

We applied our proposed learning method to our human-oid robot [7] (Figure 13) and show that it can accomplish two different movement-learning tasks without any prior knowl-edge for the cart-pole swing-up task or with a very simple nominal trajectory for the basketball-shooting task.

The proposed recursive use of previously sampled data to improve policies for real robots would also be useful for other policy search algorithms, such as reward weighted re-gression [11] or information theoretic approaches [12], and it might be interesting to investigate how these combinations work as a future study.

ConclusionsIn this article, we proposed reusing the previous experienc-es of a humanoid robot to efficiently improve its task per-formance. We proposed recursively using the off-policy PGPE method to improve the policies and applied our ap-proach to cart-pole swing-up and basketball-shooting tasks. In the former, we introduced a real-virtual hybrid task environment composed of a motion controller and vir-tually simulated cart-pole dynamics. By using the hybrid environment, we can potentially design a wide variety of different task environments. Note that complicated arm movements of the humanoid robot need to be learned for the cart-pole swing-up. Furthermore, by using our pro-posed method, the challenging basketball-shooting task was successfully accomplished.

Future work will develop a method based on a transfer learning [28] approach to efficiently reuse the previous expe-riences acquired in different target tasks.

AcknowledgmentThis work was supported by MEXT KAKENHI Grant 23120004, MIC-SCOPE, ``Development of BMI Technolo-gies for Clinical Application’’ carried out under SRPBS by AMED, and NEDO. Part of this study was supported by JSPS KAKENHI Grant 26730141. This work was also supported by NSFC 61502339.

References[1] A. G. Kupcsik, M. P. Deisenroth, J. Peters, and G. Neumann, “Data-effi-cient contextual policy search for robot movement skills,” in Proc. National Conf. Artificial Intelligence, 2013.[2] C. E. Rasmussen and C. K. I. Williams Gaussian Processes for Machine Learning. Cambridge, MA: MIT Press, 2006.[3] C. G. Atkeson and S. Schaal, “Robot learning from demonstration,” in Proc. 14th Int. Conf. Machine Learning, 1997, pp. 12–20.[4] C. G. Atkeson and J. Morimoto, “Nonparametric representation of policies and value functions: A trajectory-based approach,” in Proc. Neural Infor-mation Processing Systems, 2002, pp. 1643–1650.

Efficiently reusing previous

experiences is crucial to

improve its behavioral

policies without actually

interacting with real

environments.

Figure 13. The humanoid robot CB-i [7]. (Photo courtesy of ATR.)[Sugimoto+16]

O↵-Policy Actor-Critic

Z = {u 2 U | dg(u) = 0} and the value functionweights, v

t

, converge to the corresponding TD-solutionwith probability one.

Proof Sketch: We follow a similar outline to thetwo timescale analysis for on-policy policy gradientactor-critic (Bhatnagar et al., 2009) and for nonlinearGTD (Maei et al., 2009). We analyze the dynamicsfor our two weights, u

t

and zt

T = (wt

Tvt

T), based onour update rules. The proof involves satisfying sevenrequirements from Borkar (2008, p. 64) to ensure con-vergence to an asymptotically stable equilibrium.

4. Empirical Results

This section compares the performance of O↵-PAC tothree other o↵-policy algorithms with linear memoryand computational complexity: 1) Q(�) (called Q-Learning when � = 0), 2) Greedy-GQ (GQ(�) witha greedy target policy), and 3) Softmax-GQ (GQ(�)with a Softmax target policy). The policy in O↵-PACis a Gibbs distribution as defined in section 2.3.

We used three benchmarks: mountain car, a pendulumproblem and a continuous grid world. These prob-lems all have a discrete action space and a continu-ous state space, for which we use function approxima-tion. The behavior policy is a uniform distributionover all the possible actions in the problem for eachtime step. Note that Q(�) may not be stable in thissetting (Baird, 1995), unlike all the other algorithms.

The goal of the mountain car problem (see Sutton &Barto, 1998) is to drive an underpowered car to thetop of a hill. The state of the system is composed ofthe current position of the car (in [�1.2, 0.6]) and itsvelocity (in [�.07, .07]). The car was initialized witha position of -0.5 and a velocity of 0. Actions are athrottle of {�1, 0, 1}. The reward at each time stepis �1. An episode ends when the car reaches the topof the hill on the right or after 5,000 time steps.

The second problem is a pendulum problem (Doya,2000). The state of the system consists of the angle (inradians) and the angular velocity (in [�78.54, 78.54])of the pendulum. Actions, the torque applied to thebase, are {�2, 0, 2}. The reward is the cosine of theangle of the pendulum with respect to its fixed base.The pendulum is initialized with an angle and an angu-lar velocity of 0 (i.e., stopped in a horizontal position).An episode ends after 5,000 time steps.

For the pendulum problem, it is unlikely that the be-havior policy will explore the optimal region where thependulum is maintained in a vertical position. Conse-quently, this experiment illustrates which algorithmsmake best use of limited behavior samples.

Behavior Greedy-GQ

Softmax-GQ O↵-PAC

Figure 1. Example of one trajectory for each algorithmin the continuous 2D grid world environment after 5,000learning episodes from the behavior policy. O↵-PAC is theonly algorithm that learned to reach the goal reliably.

The last problem is a continuous grid-world. Thestate is a 2-dimensional position in [0, 1]2. The ac-tions are the pairs {(0.0, 0.0), (�.05, 0.0), (.05, 0.0),(0.0,�.05), (0.0, .05)}, representing moves in both di-mensions. Uniform noise in [�.025, .025] is addedto each action component. The reward at eachtime step for arriving in a position (p

x

, py

) is de-fined as: �1 + �2(N (p

x

, .3, .1) · N (py

, .6, .03) +N (p

x

, .4, .03)·N (py

, .5, .1)+N (px

, .8, .03)·N (py

, .9, .1))

where N (p, µ,�) = e�(p�µ)2

2�2 /�p2⇡. The start posi-

tion is (0.2, 0.4) and the goal position is (1.0, 1.0). Anepisode ends when the goal is reached, that is whenthe distance from the current position to the goal isless than 0.1 (using the L1-norm), or after 5,000 timesteps. Figure 1 shows a representation of the problem.

The feature vectors xs

were binary vectors constructedaccording to the standard tile-coding technique (Sut-ton & Barto, 1998). For all problems, we used tentilings, each of roughly 10 ⇥ 10 over the joint spaceof the two state variables, then hashed to a vector ofdimension 106. An addition feature was added thatwas always 1. State-action features,

s,a

, were also106 + 1 dimensional vectors constructed by also hash-ing the actions. We used a constant � = 0.99. Allthe weight vectors were initialized to 0. We performeda parameter sweep to select the following parameters:1) the step size ↵

v

for Q(�), 2) the step-sizes ↵v

and↵w

for the two vectors in Greedy-GQ, 3) ↵v

, ↵w

andthe temperature ⌧ of the target policy distribution forSoftmax-GQ and 4) the step sizes ↵

v

, ↵w

and ↵u

forO↵-PAC. For the step sizes, the sweep was done overthe following values: {10�4, 5 · 10�4, 10�3, . . . , .5, 1.}

O↵-Policy Actor-Critic

Z = {u 2 U | dg(u) = 0} and the value functionweights, v

t

, converge to the corresponding TD-solutionwith probability one.

Proof Sketch: We follow a similar outline to thetwo timescale analysis for on-policy policy gradientactor-critic (Bhatnagar et al., 2009) and for nonlinearGTD (Maei et al., 2009). We analyze the dynamicsfor our two weights, u

t

and zt

T = (wt

Tvt

T), based onour update rules. The proof involves satisfying sevenrequirements from Borkar (2008, p. 64) to ensure con-vergence to an asymptotically stable equilibrium.

4. Empirical Results

This section compares the performance of O↵-PAC tothree other o↵-policy algorithms with linear memoryand computational complexity: 1) Q(�) (called Q-Learning when � = 0), 2) Greedy-GQ (GQ(�) witha greedy target policy), and 3) Softmax-GQ (GQ(�)with a Softmax target policy). The policy in O↵-PACis a Gibbs distribution as defined in section 2.3.

We used three benchmarks: mountain car, a pendulumproblem and a continuous grid world. These prob-lems all have a discrete action space and a continu-ous state space, for which we use function approxima-tion. The behavior policy is a uniform distributionover all the possible actions in the problem for eachtime step. Note that Q(�) may not be stable in thissetting (Baird, 1995), unlike all the other algorithms.

The goal of the mountain car problem (see Sutton &Barto, 1998) is to drive an underpowered car to thetop of a hill. The state of the system is composed ofthe current position of the car (in [�1.2, 0.6]) and itsvelocity (in [�.07, .07]). The car was initialized witha position of -0.5 and a velocity of 0. Actions are athrottle of {�1, 0, 1}. The reward at each time stepis �1. An episode ends when the car reaches the topof the hill on the right or after 5,000 time steps.

The second problem is a pendulum problem (Doya,2000). The state of the system consists of the angle (inradians) and the angular velocity (in [�78.54, 78.54])of the pendulum. Actions, the torque applied to thebase, are {�2, 0, 2}. The reward is the cosine of theangle of the pendulum with respect to its fixed base.The pendulum is initialized with an angle and an angu-lar velocity of 0 (i.e., stopped in a horizontal position).An episode ends after 5,000 time steps.

For the pendulum problem, it is unlikely that the be-havior policy will explore the optimal region where thependulum is maintained in a vertical position. Conse-quently, this experiment illustrates which algorithmsmake best use of limited behavior samples.

Behavior Greedy-GQ

Softmax-GQ O↵-PAC

Figure 1. Example of one trajectory for each algorithmin the continuous 2D grid world environment after 5,000learning episodes from the behavior policy. O↵-PAC is theonly algorithm that learned to reach the goal reliably.

The last problem is a continuous grid-world. Thestate is a 2-dimensional position in [0, 1]2. The ac-tions are the pairs {(0.0, 0.0), (�.05, 0.0), (.05, 0.0),(0.0,�.05), (0.0, .05)}, representing moves in both di-mensions. Uniform noise in [�.025, .025] is addedto each action component. The reward at eachtime step for arriving in a position (p

x

, py

) is de-fined as: �1 + �2(N (p

x

, .3, .1) · N (py

, .6, .03) +N (p

x

, .4, .03)·N (py

, .5, .1)+N (px

, .8, .03)·N (py

, .9, .1))

where N (p, µ,�) = e�(p�µ)2

2�2 /�p2⇡. The start posi-

tion is (0.2, 0.4) and the goal position is (1.0, 1.0). Anepisode ends when the goal is reached, that is whenthe distance from the current position to the goal isless than 0.1 (using the L1-norm), or after 5,000 timesteps. Figure 1 shows a representation of the problem.

The feature vectors xs

were binary vectors constructedaccording to the standard tile-coding technique (Sut-ton & Barto, 1998). For all problems, we used tentilings, each of roughly 10 ⇥ 10 over the joint spaceof the two state variables, then hashed to a vector ofdimension 106. An addition feature was added thatwas always 1. State-action features,

s,a

, were also106 + 1 dimensional vectors constructed by also hash-ing the actions. We used a constant � = 0.99. Allthe weight vectors were initialized to 0. We performeda parameter sweep to select the following parameters:1) the step size ↵

v

for Q(�), 2) the step-sizes ↵v

and↵w

for the two vectors in Greedy-GQ, 3) ↵v

, ↵w

andthe temperature ⌧ of the target policy distribution forSoftmax-GQ and 4) the step sizes ↵

v

, ↵w

and ↵u

forO↵-PAC. For the step sizes, the sweep was done overthe following values: {10�4, 5 · 10�4, 10�3, . . . , .5, 1.}

[Degris+12]

⇡ µ

Eµ[·] 6= E

⇡[·]

7

• Forwardview– 真の収益を targetに学習したい– できない

• Backwardview– 過去に経験した状態・行動対を履歴として保持して現在の報酬を分配

• backwardview=forwardview[Sutton&Barto 98]

ReturnBasedReinforcementLearning

tX

s=0

(��)t�sI{xj , aj = x, a}

Q(xt, at) (1� ↵)Q(xt, at) + ↵

X

s�t

�

s�t(P⇡)s�tr

EligibilityTrace

X

t�0

�t(P⇡)tr

rt+1 rt+2 rt+n

(xt, at, rt)

t

�n�3

�2 �

(xt�n, at�n) (xt�1, at�1)

8

• λ収益 :nステップ収益の指数平均

• が TD(0)とMonteCarlo法をつなぐ

λ収益と TD(λ)

T ⇡�=0 = T ⇡Q

T ⇡�=1 = Q⇡

T ⇡� , (1� �)

X

n�0

�n[(T ⇡)nQ] = Q+ (I � ��P⇡)�1(T ⇡Q�Q)

� 2 [0, 1]

X

t�0

�t(P⇡)tr

rt+1 rt+2 rt+n

(xt, at, rt)

t

�n�3

�2 �

(xt�n, at�n) (xt�1, at�1)

9

• 軌跡の生成確率は方策ごとに大きく違う

– 保持している traceがうそっぱち

• 学習サンプルをどう重み付け /修正すれば収益を効率良く近似できるか？

• 目的– safe:任意のとに対して収束する

– efficient:サンプル効率よく収益を近似

Off-policy,ReturnBased

⇡(a|x) µ(a|x)

Ft = (x0, a0, r0, x1, . . . , xt, at, rt)

10

Off-PolicyReturnOperator(ForwardView)

E⇡Q(x, ·) ,

X

a

⇡(a|x)Q(x, a)

軌跡生成確率の重み付け

推定方策で報酬の値を補正

RQ(x, a) , Q(x, a) + Eµ

2

4X

t�0

�

t⇣ tY

s=1

cs

⌘�rt+�E

⇡Q(xt+1, ·)�Q(xt, at)

�3

5

• 演算子を用いて– 関連研究を整理– 新しいアルゴリズムの提案

R

11

• ImportanceSampling(IS)– 単純な重点重み

– 報酬の補正はしない

✘ の分散が大きい

関連研究の整理 1

[Precup+00;01;Geist&Scherrer 14]

cs =⇡(as|xs)

µ(as|xs)

RQ(x, a) , Q(x, a) + Eµ

2

4X

t�0

�

t⇣ tY

s=1

cs

⌘�rt+�E

⇡Q(xt+1, ·)�Q(xt, at)

�3

5

RISQ(x, a) , Q(x, a) + E

µ

2

4X

t�0

�

t⇣ tY

s=1

⇡(as|xs)

µ(as|xs)

⌘rt

3

5

X

t�0

�

t⇣ tY

s=1

⇡(as|xs)

µ(as|xs)

⌘

⇡ µ

a

12

• Q(λ)– 古典的な eligibilitytrace– 推定方策で即時報酬を補正

✘軌跡を重み付けしないためとが大きく異なると収束が保証されない


[Harutyunyan+16]

cs = �

RQ(x, a) , Q(x, a) + Eµ

2

4X

t�0

�

t⇣ tY

s=1

cs

⌘�rt+�E

⇡Q(xt+1, ·)�Q(xt, at)

�3

5

R�Q(x, a) , Q(x, a) + E

µ

2

4X

t�0

(��)thrt + �E

⇡Q(xt+1, ·)�Q(xt, at)

i3

5

⇡(a|x) µ(a|x)

max

x

k⇡(·|x)� µ(·|x)k1 1� �

��

unsafe

13

• TreeBackup(TB)– 推定方策そのもので軌跡を重み付けü挙動方策が非マルコフでも学習可能

✘ とが近い場合(On-Policy)トレースの減衰が速く，収益を効率良く近似できない


cs = �⇡(as|xs)

[Precup+00]

RQ(x, a) , Q(x, a) + Eµ

2

4X

t�0

�

t⇣ tY

s=1

cs

⌘�rt+�E

⇡Q(xt+1, ·)�Q(xt, at)

�3

5

R�Q(x, a) , Q(x, a) + E

µ

2

4X

t�0

(��)t⇣ tY

s=1

⇡(as|xs)⌘ h

rt + �E⇡Q(xt+1, ·)�Q(xt, at)

i3

5

⇡(a|x) µ(a|x)

inefficient

14

• 1で打ち切られた重点重み

ü が有限の値をとる

ü On-Policyに近い場合でも traceが減衰しにくい

提案手法 ::Retrace(λ)

cs = �min

✓1,

⇡(as|xs)

µ(as|xs)

◆

min

✓1,

⇡(as|xs)

µ(as|xs)

◆� ⇡(as|xs)

RQ(x, a) , Q(x, a) + Eµ

2

4X

t�0

�

t⇣ tY

s=1

cs

⌘�rt+�E

⇡Q(xt+1, ·)�Q(xt, at)

�3

5

V✓X

t

�ttY

s=1

cs

◆

⇡ µ

a

15

関連研究との比較

Definition Estimation Guaranteed Use full returnsof c

s

variance convergence† (near on-policy)

Importance sampling ⇡(as|xs)µ(as|xs)

High for any ⇡, µ yes

Q(�) � Low only for ⇡ close to µ yes

TB(�) �⇡(as

|xs

) Low for any ⇡, µ no

Retrace(�) �min(1, ⇡(as|xs)µ(as|xs)

) Low for any ⇡, µ yes

Table 1: Properties of several algorithms defined in terms of the general operator given in (4).†Guaranteed convergence of the expected operator R.

• The online Retrace(�) algorithm converges a.s. to Q⇤ in the control case. In the controlcase, convergence does not require the GLIE assumption.

• As a corollary, we prove the convergence of Watkins’s Q(�) to Q⇤.

3 Analysis of Retrace(�)

We will in turn analyse both off-policy policy evaluation and control settings. We will show that Ris a contraction mapping in both settings (under a mild additional assumption for the control case).

3.1 Policy Evaluation

Consider a fixed target policy ⇡. For ease of exposition we consider a fixed behaviour policy µ,noting that our result extends to the setting of sequences of behaviour policies (µ

k

: k 2 N).Our first result states the �-contraction of the operator (4) defined by any set of non-negative coef-ficients c

s

= cs

(as

,Fs

) (in order to emphasize that cs

can be a function of the whole history Fs

)under the assumption that c

s

⇡(as|xs)µ(as|xs)

.

Theorem 1. The operator R defined by (4) has a unique fixed point Q⇡ . Furthermore, if for eachas

2 A and each history Fs

we have cs

= cs

(as

,Fs

) 2⇥0, ⇡(as|xs)

µ(as|xs)

⇤, then for any Q-function Q

kRQ�Q⇡k �kQ�Q⇡k.

The following lemma will be useful in proving Theorem 1 (proof in the appendix).Lemma 1. The difference between RQ and its fixed point Q⇡ is

RQ(x, a)�Q⇡

(x, a) = Eµ

h X

t�1

�t

⇣ t�1Y

i=1

ci

⌘⇣⇥E⇡

[(Q�Q⇡

)(xt

, ·)]� ct

(Q�Q⇡

)(xt

, at

)

⇤⌘i.

Proof (Theorem 1). The fact that Q⇡ is the fixed point of the operator R is obvious from (4) sinceExt+1⇠P (·|xt,at)

⇥rt

+ �E⇡

Q⇡

(xx+1, ·) � Q⇡

(xt

, at

)

⇤= (T ⇡Q⇡ � Q⇡

)(xt

, at

) = 0, since Q⇡ isthe fixed point of T ⇡ . Now, from Lemma 1, and defining �Q := Q�Q⇡ , we have

RQ(x, a)�Q⇡

(x, a) =X

t�1

�tEx1:t,a1:t

h⇣ t�1Y

i=1

ci

⌘⇣⇥E⇡

�Q(xt

, ·)� ct

�Q(xt

, at

)

⇤⌘i

=

X

t�1

�tEx1:t,a1:t�1

h⇣ t�1Y

i=1

ci

⌘⇣⇥E⇡

�Q(xt

, ·)� Eat [ct(at,Ft

)�Q(xt

, at

)|Ft

]

⇤⌘i

=

X

t�1

�tEx1:t,a1:t�1

h⇣ t�1Y

i=1

ci

⌘X

b

�⇡(b|x

t

)� µ(b|xt

)ct

(b,Ft

)

��Q(x

t

, b)i.

4

Safe Efficient

RQ(x, a) , Q(x, a) + Eµ

2

4X

t�0

�

t⇣ tY

s=1

cs

⌘�rt+�E

⇡Q(xt+1, ·)�Q(xt, at)

�3

5

16

• 任意の挙動方策とincreasinglygreedyな推定方策に対して，traceがマルコフ性を満たし

悲観的初期値であれば

• さらに，推定方策が greedy方策に漸近すれば

Retrace(λ)の収束

µk

⇡k

kRkQk �Q⇤k �kQk �Q⇤k+ "kkQkk

Qk ! Q⇤("k ! 0)

cs = c(as, xs) 20,

⇡(as|xs)

µ(as|xs)

�

T ⇡0Q0 � Q0

17

• 不動点の存在を示す

• が縮小写像であることを示す

不動点定理による収束証明

RQ⇤

kRkQk �Q⇤k �kQk �Q⇤k

RQ

Q

R Q

Q⇤

Q⇤

Q⇤

Q = Q⇤

⇡ � cµ � 0cs 20,

⇡(as|xs)

µ(as|xs)

�

18

• Off-PolicyReturnOperator(forwardview)

• Algorithm(backwardview)

ForwardView= BackwardView

Qk+1(x, a) Qk(x, a) + ↵k

X

t�s

�

⇡kt

tX

j=s

�

t�j⇣ tY

i=j+1

ci

⌘I{xj , aj = x, a}

�

⇡kt = rt + �E

⇡k

Qk(xt+1, ·)�Qk(xt, at)

Qk+1(x, a) = (1� ↵k)Qk(x, a) + ↵kRkQk(x, a)

RkQk(x, a) = Qk(x, a) + Eµk

2

4X

t�0

�

t⇣ tY

s=1

cs

⌘�rt + �E

⇡k

Qk(xt+1, ·)�Qk(xt, at)�3

5

=

19

• CNNによる Qの近似• 16スレッド並列の学習

1. θ’← θ2. 各スレッドでθ’から勾配を計算3. 全ての勾配を利用してθを更新

• ExperienceReplay– 一度の更新に使うサンプル数は揃える (64)– Retrace,TB,Q*では (連続した16の状態遷移 x4)

実験 ::Atari2600+DQN

difficult and engaging for human players. We used the same networkarchitecture, hyperparameter values (see Extended Data Table 1) andlearning procedure throughout—taking high-dimensional data (210|160colour video at 60 Hz) as input—to demonstrate that our approachrobustly learns successful policies over a variety of games based solelyon sensory inputs with only very minimal prior knowledge (that is, merelythe input data were visual images, and the number of actions availablein each game, but not their correspondences; see Methods). Notably,our method was able to train large neural networks using a reinforce-ment learning signal and stochastic gradient descent in a stable manner—illustrated by the temporal evolution of two indices of learning (theagent’s average score-per-episode and average predicted Q-values; seeFig. 2 and Supplementary Discussion for details).

We compared DQN with the best performing methods from thereinforcement learning literature on the 49 games where results wereavailable12,15. In addition to the learned agents, we also report scores fora professional human games tester playing under controlled conditionsand a policy that selects actions uniformly at random (Extended DataTable 2 and Fig. 3, denoted by 100% (human) and 0% (random) on yaxis; see Methods). Our DQN method outperforms the best existingreinforcement learning methods on 43 of the games without incorpo-rating any of the additional prior knowledge about Atari 2600 gamesused by other approaches (for example, refs 12, 15). Furthermore, ourDQN agent performed at a level that was comparable to that of a pro-fessional human games tester across the set of 49 games, achieving morethan 75% of the human score on more than half of the games (29 games;

Convolution Convolution Fully connected Fully connected

No input

Figure 1 | Schematic illustration of the convolutional neural network. Thedetails of the architecture are explained in the Methods. The input to the neuralnetwork consists of an 84 3 84 3 4 image produced by the preprocessingmap w, followed by three convolutional layers (note: snaking blue line

symbolizes sliding of each filter across input image) and two fully connectedlayers with a single output for each valid action. Each hidden layer is followedby a rectifier nonlinearity (that is, max 0,xð Þ).

a b

c d

0 200 400 600 800

1,000 1,200 1,400 1,600 1,800 2,000 2,200

0 20 40 60 80 100 120 140 160 180 200

Ave

rage

sco

re p

er e

piso

de

Training epochs

0 1 2 3 4 5 6 7 8 9

10 11

0 20 40 60 80 100 120 140 160 180 200

Ave

rage

act

ion

valu

e (Q

)

Training epochs

0

1,000

2,000

3,000

4,000

5,000

6,000

0 20 40 60 80 100 120 140 160 180 200

Ave

rage

sco

re p

er e

piso

deTraining epochs

0 1 2 3 4 5 6 7 8 9

10

0 20 40 60 80 100 120 140 160 180 200

Ave

rage

act

ion

valu

e (Q

)

Training epochs

Figure 2 | Training curves tracking the agent’s average score and averagepredicted action-value. a, Each point is the average score achieved per episodeafter the agent is run with e-greedy policy (e 5 0.05) for 520 k frames on SpaceInvaders. b, Average score achieved per episode for Seaquest. c, Averagepredicted action-value on a held-out set of states on Space Invaders. Each point

on the curve is the average of the action-value Q computed over the held-outset of states. Note that Q-values are scaled due to clipping of rewards (seeMethods). d, Average predicted action-value on Seaquest. See SupplementaryDiscussion for details.

RESEARCH LETTER

5 3 0 | N A T U R E | V O L 5 1 8 | 2 6 F E B R U A R Y 2 0 1 5

Macmillan Publishers Limited. All rights reserved©2015

[Mnih+15]

[Mnih+16]

DQN�Q(xt, at; ✓) = r(xt, at) + �max

a0Q(xt+1, a

0; ✓

�)�Q(xt, at; ✓)

✓ ✓ + ↵t�Q(xt, at; ✓)@Q(xt, at; ✓)

@✓

�Q(xt, at; ✓) =k�1X

s=t

�

s�t⇣ sY

i=t+1

ci

⌘⇥r(xs, as) + �E

⇡Q(xs+1, ·; ✓�)�Q(xs, as; ✓)

⇤

20

• 60種類のゲームでのパフォーマンスを比較• :ゲーム gにおけるアルゴリズム aのスコア

実験結果

0.00.20.40.60.81.00.0

0.2

0.4

0.6

0.8

1.0

Frac

tion

of G

ames

Inter-algorithm Score

40M TRAINING FRAMES

Q*

Retrace

Tree-backup

Q-Learning

0.00.20.40.60.81.00.0

0.2

0.4

0.6

0.8

1.0

Frac

tion

of G

ames

Inter-algorithm Score

200M TRAINING FRAMES

Retrace

Tree-backup

Q-Learning

Figure 1: Inter-algorithm score distribution for �-return (� = 1) variants and Q-Learning (� = 0).

Increasingly greedy policies The assumption that the sequence of target policies (⇡k

) is in-creasingly greedy w.r.t. the sequence of (Q

k

) is more general that just considering greedy policiesw.r.t. (Q

k

) (which is Watkins’s Q(�)), and may be more efficient as well. Indeed, using non-greedytarget policies ⇡

k

can speed up convergence as the traces will not be cut as frequently. Of course, inorder to converge to Q⇤, we eventually need the target policies (and not the behaviour policies, asmentioned above) to become greedy in the limit (i.e. "

k

! 0 as defined in Theorem 2).

Comparison to Q⇡

(�). Unlike Retrace(�), Q⇡ does not need to know the behaviour policy µ.However, it fails to converge when µ is far from ⇡. Retrace(�) uses its knowledge of µ (for thechosen actions) to cut the traces and safely handle arbitrary policies ⇡ and µ.

Comparison to TB(�). Similarly to Q⇡ , TB(�) does not need the knowledge of the behaviourpolicy µ. But as a consequence, TB(�) is not able to benefit from possible near on-policy situations,cutting traces unnecessarily when ⇡ and µ are close.

Continuous action space. Let us mention that Theorems 1 and 2 extend to the case of (mea-surable) continuous or infinite action spaces. The trace coefficients will make use of the densitiesmin(1, d⇡/dµ) instead of the probabilities min(1,⇡/µ). This would not be possible with TB(�).

Open questions include: (1) Removing the technical assumption that P⇡k and P⇡k^µk asymp-totically commute, (2) Relaxing the Markov assumption in the control case in order to allow tracecoefficients c

t

of the form (10).

5 Experimental Results

To validate our theoretical results, we employ Retrace(�) in an experience replay (Lin, 1993) setting,where sample transitions are stored within a large but bounded replay memory and subsequentlyreplayed as if they were new experience. Naturally, older data in the memory is usually drawn froma policy which differs from the current policy, offering an excellent point of comparison for thealgorithms presented in Section 2.

Our agent adapts the DQN architecture of Mnih et al. (2015) to replay short sequences from thememory (details in Appendix F) instead of single transitions. The Q-function target for a samplesequence x

t

, at

, rt

, · · · , xt+k

is

�Q(xt

, at

) =

k�1X

s=t

�s�t

⇣ sY

i=t+1

ci

⌘⇥r(x

s

, as

) + �E⇡

Q(xs+1, ·)�Q(x

s

, as

)

⇤.

We compare our algorithms’ performance on 60 different Atari 2600 games in the Arcade LearningEnvironment (Bellemare et al., 2013) using Bellemare et al.’s inter-algorithm score distribution.Inter-algorithm scores are normalized so that 0 and 1 respectively correspond to the worst and bestscore for a particular game, within the set of algorithms under comparison. If g 2 {1, . . . , 60} is agame and z

g,a

the inter-algorithm score on g for algorithm a, then the score distribution function isf(x) := |{g : z

g,a

� x}|/60. Roughly, a strictly higher curve corresponds to a better algorithm.

Across values of �, � = 1 performs best, save for Q⇤ where � = 0.5 obtains slightly superiorperformance. However, Q⇤ diverges for larger � values (see Figure 1, left), and yields poor per-formance for smaller ones. Both Retrace and TB(�) achieve dramatically higher performance than

8

f(x) , |{g : zg,a � x}|/60

zg,a 2 [0, 1]

21

• 既存の returnbasedoff-policy手法に対して統一的な見方を与えた．

• 任意の挙動方策に対するサンプル効率の良いreturnbasedoff-policyを提案した．– トレース係数を1で打ち切ることで分散を抑えた

• Atari2600において，既存の returnbasedoff-policy手法よりも高い学習性能を示した．

まとめ

22

• cが必ず 1より小さいため traceの減衰が速い．• 分散を小さく保ちながら最大の traceを利用するためには例えば

• しかし cが非マルコフ的になるため収束の保証が崩れる．

FutureWork

cs = �min

✓1

c1 . . . cs�1,

⇡(as|xs)

µ(as|xs)

◆

safe and efficient off policy reinforcement learning

Science