count-based exploration with neural density models · count-based exploration with neural density...

Count-Based Exploration with Neural Density Models

Georg Ostrovski 1 Marc G. Bellemare 1 Aaron van den Oord 1 Remi Munos 1

Abstract

Bellemare et al. (2016) introduced the notion ofa pseudo-count, derived from a density model,to generalize count-based exploration to non-tabular reinforcement learning. This pseudo-count was used to generate an exploration bonusfor a DQN agent and combined with a mixedMonte Carlo update was sufficient to achievestate of the art on the Atari 2600 game Mon-tezumas Revenge. We consider two questionsleft open by their work: First, how important isthe quality of the density model for exploration?Second, what role does the Monte Carlo updateplay in exploration? We answer the first questionby demonstrating the use of PixelCNN, an ad-vanced neural density model for images, to sup-ply a pseudo-count. In particular, we examine theintrinsic difficulties in adapting Bellemare et al.sapproach when assumptions about the model areviolated. The result is a more practical and gen-eral algorithm requiring no special apparatus. Wecombine PixelCNN pseudo-counts with differentagent architectures to dramatically improve thestate of the art on several hard Atari games. Onesurprising finding is that the mixed Monte Carloupdate is a powerful facilitator of exploration inthe sparsest of settings, including MontezumasRevenge.

1. IntroductionExploration is the process by which an agent learns aboutits environment. In the reinforcement learning framework,this involves reducing the agents uncertainty about theenvironments transition dynamics and attainable rewards.From a theoretical perspective, exploration is now well-understood (e.g. Strehl & Littman, 2008; Jaksch et al.,2010; Osband et al., 2016), and Bayesian methods have

1DeepMind, London, UK. Correspondence to: Georg Ostro-vski .

been successfully demonstrated in a number of settings(Deisenroth & Rasmussen, 2011; Guez et al., 2012). On theother hand, practical algorithms for the general case remainscarce; fully Bayesian approaches are usually intractable inlarge state spaces, and the count-based method typical oftheoretical results is not applicable in the presence of valuefunction approximation.

Recently, Bellemare et al. (2016) proposed the notion ofpseudo-count as a reasonable generalization of the tabu-lar setting considered in the theory literature. The pseudo-count is defined in terms of a density model trained onthe sequence of states experienced by an agent:

N(x) = (x)n(x),

where n(x) can be thought of as a total pseudo-count com-puted from the models recoding probability (x), theprobability of x computed immediately after training onx. As a practical application the authors used the pseudo-counts derived from the simple CTS density model (Belle-mare et al., 2014) to incentivize exploration in Atari 2600agents. One of the main outcomes of their work was sub-stantial empirical progress on the infamously hard gameMONTEZUMAS REVENGE.

Their method critically hinged on several assumptionsregarding the density model: 1) the model should belearning-positive, i.e. the probability assigned to a state xshould increase with training; 2) it should be trained on-line, using each sample exactly once; and 3) the effectivemodel step-size should decay at a rate of n1. Part of theirempirical success also relied on a mixed Monte Carlo/Q-Learning update rule, which permitted fast propagation ofthe exploration bonuses.

In this paper, we set out to answer several research ques-tions related to these modelling choices and assumptions:

1. To what extent does a better density model give rise tobetter exploration?

2. Can the above modelling assumptions be relaxedwithout sacrificing exploration performance?

3. What role does the mixed Monte Carlo update play insuccessfully incentivizing exploration?

In particular, we explore the use of PixelCNN (van denOord et al., 2016b;a), a state-of-the-art neural densitymodel. We examine the challenges posed by this approach:

Model choice. Performing two evaluations and one modelupdate at each agent step (to compute (x) and (x)) canbe prohibitively expensive. This requires the design of asimplified yet sufficiently expressive and accurate Pix-elCNN architecture.

Model training. A CTS model can naturally be trainedfrom sequentially presented, correlated data samples.Training a neural model in this online fashion requiresmore careful attention to the optimization procedure to pre-vent overfitting and catastrophic forgetting (French, 1999).

Model use. The theory of pseudo-counts requires the den-sity models rate of learning to decay over time. Optimiza-tion of a neural model, however, imposes constraints on thestep-size regime which cannot be violated without deterio-rating effectiveness and stability of training.

The concept of intrinsic motivation has made a recent resur-gence in reinforcement learning research, in great partdue to a dissatisfaction with -greedy and Boltzmann poli-cies. Of note, Tang et al. (2016) maintain an approximatecount by means of hash tables over features, which in thepseudo-count framework corresponds to a hash-based den-sity model. Houthooft et al. (2016) used a second-orderTaylor approximation of the prediction gain to drive explo-ration in continuous control. As research moves towardsever more complex environments, we expect the trend to-wards more intrinsically motivated solutions to continue.

2. Background2.1. Pseudo-Count and Prediction Gain

Here we briefly introduce notation and results, referring thereader to (Bellemare et al., 2016) for technical details.

Let be a density model on a finite space X , and n(x) theprobability assigned by the model to x after being trainedon a sequence of states x1, . . . , xn. Assume n(x) > 0for all x, n. The recoding probability n(x) is then theprobability the model would assign to x if it were trained onthat same x one more time. We call learning-positive ifn(x) n(x) for all x1, . . . , xn, x X . The predictiongain (PG) of is

PGn(x) = log n(x) log n(x). (1)

A learning-positive implies PGn(x) 0 for all x X .For learning-positive , we define the pseudo-count as

Nn(x) =n(x)(1 n(x))n(x) n(x)

derived from postulating that a single observation of x X

should lead to a unit increase in pseudo-count:

n(x) =Nn(x)

n, n(x) =

Nn(x) + 1

where n is the pseudo-count total. The pseudo-count gen-eralizes the usual state visitation count function Nn(x).Under certain assumptions on n, pseudo-counts growapproximately linearly with real counts. Crucially, thepseudo-count can be approximated using the predictiongain of the density model:

Nn(x) (ePGn(x) 1

Its main use is to define an exploration bonus. We considera reinforcement learning (RL) agent interacting with an en-vironment that provides observations and extrinsic rewards(see Sutton & Barto, 1998, for a thorough exposition of theRL framework). To the reward at step n we add the bonus

r+(x) := (Nn(x))1/2,

which incentivizes the agent to try to re-experience sur-prising situations. Quantities related to prediction gainhave been used for similar purposes in the intrinsic moti-vation literature (Lopes et al., 2012), where they measurean agents learning progress (Oudeyer et al., 2007). Al-though the pseudo-count bonus is close to the predictiongain, it is asymptotically more conservative and supportedby stronger theoretical guarantees.

2.2. Density Models for Images

The CTS density model (Bellemare et al., 2014) is basedon the namesake algorithm, Context Tree Switching (Ve-ness et al., 2012), a Bayesian variable-order Markov model.In its simplest form, the model takes as input a 2D imageand assigns to it a probability according to the product oflocation-dependent L-shaped filters, where the predictionof each filter is given by a CTS algorithm trained on pastimages. In Bellemare et al. (2016), this model was ap-plied to 3-bit greyscale, 42 42 downsampled Atari 2600frames (Fig. 1). The CTS model presents advantages interms of simplicity and performance but is limited in ex-pressiveness, scalability, and data efficiency.

In recent years, neural generative models for images haveachieved impressive successes in their ability to generatediverse images in various domains (Kingma & Welling,2013; Rezende et al., 2014; Gregor et al., 2015; Good-fellow et al., 2014). In particular, van den Oord et al.(2016b;a) introduced PixelCNN, a fully convolutional neu-ral network composed of residual blocks with multiplica-tive gating units, which models pixel probabilities condi-tional on previous pixels (in the usual top-left to bottom-right raster-scan order) by using masked convolution fil-

Original Frame (160x210) 3-bit Greyscale (42x42)

Figure 1. Atari frame preprocessing (Bellemare et al., 2016).

ters. This model achieved state-of-the-art modelling perfor-mance on standard datasets, paired with the computationalefficiency of a convolutional feed-forward network.

2.3. Multi-Step RL Methods

A distinguishing feature of reinforcement learning is thatthe agent learns on the basis of interim estimates (Sutton,1996). For example, the Q-Learning update rule is

Q(x, a) Q(x, a)+ [r(x, a) + maxa Q(x

, a)Q(x, a)] (x,a)

linking the reward r and next-state value functionQ(x, a)to the current state value function Q(x, a). This particularform is the stochastic update rule with step-size and in-volves the TD-error . In the approximate reinforcementlearning setting, such as when Q(x, a) is represented by aneural network, this update is converted into a loss to beminimized, most commonly the squared loss 2(x, a).

It is well known that better performance, both in terms oflearning efficiency and approximation error, is attained bymulti-step methods (Sutton, 1996; Tsitsiklis & van Roy,1997). These methods interpolate between one-step meth-ods (Q-Learning) and the Monte-Carlo update

Q(x, a) Q(x, a) +

tr(xt, at)Q(x, a)

MC(x,a)

where x0, a0, x1, a1, . . . is a sample path through the en-vironment beginning in (x, a). To achieve their success onthe hardest Atari 2600 games, Bellemare et al. (2016) usedthe mixed Monte-Carlo update (MMC)

Q(x, a) Q(x, a) + [(1 )(x, a) + MC(x, a)] ,

with [0, 1]. This choice was made for computa-tional and implementational simplicity, and is a particu-larly coarse multi-step method. A better multi-step methodis the recent Retrace() algorithm (Munos et al., 2016).Retrace() uses a product of truncated importance sam-

pling ratios c1, c2, . . . to replace with the error term

RETRACE(x, a) :=

)(xt, at),

effectively mixing in TD-errors from all future time steps.Munos et al. showed that Retrace() is safe (does not di-verge when trained on data from an arbitrary behaviour pol-icy), and efficient (makes the most of multi-step returns).

3. Using PixelCNN for ExplorationAs mentioned in the Introduction, the theory of using den-sity models for exploration makes several assumptions thattranslate into concrete requirements for an implementation:

(a) The density model should be trained completely on-line, i.e. exactly once on each state experienced by theagent, in the given sequential order.

(b) The prediction gain (PG) should decay at a rate n1

to ensure that pseudo-counts grow approximately lin-early with real counts.

(c) The density model should be learning-positive.

Simultaneously, a partly competing set of requirementsare posed by the practicalities of training a neural densitymodel and using it as part of an RL agent:

(d) For stability, efficiency, and to avoid catastrophic for-getting in the context of a drifting data distribution,it is advantageous to train a neural model in mini-batches, drawn randomly from a diverse dataset.

(e) For effective training, a certain optimization regime(e.g. a fixed learning rate schedule) has to be followed.

(f) The density model must be computationallylightweight, to allow computing the PG (twomodel evaluations and one update) as part of everytraining step of an RL agent.

We investigate how to best resolve these tensions in thecontext of the Arcade Learning Environment (Bellemareet al., 2013), a suite of benchmark Atari 2600 games.

3.1. Designing a Suitable Density Model

Driven by (f) and aiming for an agent with computationalperformance comparable to DQN, we design a slim vari-ant of the PixelCNN network. Its core is a stack of 2 gatedresidual blocks with 16 feature maps (compared to 15 resid-ual blocks with 128 feature maps in vanilla PixelCNN). Aswas done with the CTS model, images are downsampled to42 42 and quantized to 3-bit greyscale. See Appendix Afor technical details.

0 2000 4000 6000 8000 10000 12000

Model Updates

Freewayonline

random permutation

random samples

0 1000 2000 3000 4000 5000 6000

Frames

0 20 40 60 80

Million Frames

Montezuma's RevengeLR: 0. 001, PG: 0. 1 n1/2 PGnLR: 0. 001, PG: 0. 01 PGn

LR: 0. 1 n1/2, PG: 0. 1 n1/2 PGnLR: 0. 1 n1/2, PG: 0. 01 PGn

Figure 2. Left: PixelCNN log loss on FREEWAY, when trained online, on a random permutation (single use of each frame) or onrandomly drawn samples (with replacement, potentially using same frame multiple times) from the state sequence. To simulate theeffect of non-stationarity, the agents policy changes every 4K updates. All training methods show qualitatively similar learning progressand stability. Middle: PixelCNN log loss over first 6K training frames on PONG. Vertical dashed lines indicate episode ends. Thecoinciding loss spikes are the density models surprise upon observing the distinctive green frame that sometimes occurs at the episodestart. Right: DQN-PixelCNN training performance on MONTEZUMAS REVENGE as we vary learning rate and PG decay schedules.

10-5 10-4 10-3 10-2 10-1 100

Initial Learning Rate c

Freeway

10-5 10-4 10-3 10-2 10-1 100

Initial Learning Rate c

Montezuma's Revenge

c n1/2

Figure 3. Model loss averaged over 10K frames, after 1M trainingframes, for constant, n1, and n1/2 learning rate schedules. Thesmallest loss is achieved by a constant learning rate of 103.

3.2. Training the Density Model

Instead of using randomized mini-batches, we train thedensity model completely online on the sequence of experi-enced states. Empirically we found that with minor tuningof optimization hyper-parameters we could train the modelas robustly on a temporally correlated sequence of states ason a sequence with randomized order (Fig. 2(left)).

Besides satisfying the theoretical requirement (a), com-pletely online training of the density model has the advan-tage that n = n+1, so that the model update performedfor computing the PG need not be reverted1.

Another more subtle reason for avoiding mini-batch up-dates of the density model (despite (d)) is a practical op-timization issue. The (necessarily online) computation ofthe PG involves a model update and hence the use of anoptimizer. Advanced optimizers used with deep neural net-works, like the RMSProp optimizer (Tieleman & Hinton,2012) used in this work, are stateful, tracking running av-erages of e.g. mean and variance of the model parameters.If the model is additionally trained from mini-batches, thetwo streams of updates may show different statistical char-

1The CTS model allows querying the PG cheaply, without in-curring an actual update of model parameters.

acteristics (e.g. different gradient magnitudes), invalidatingthe assumptions underlying the optimization algorithm andleading to slower or unstable training.

To determine a suitable online learning rate schedule, wetrain the model on a sequence of 1M frames of experienceof a random-policy agent. We compare the loss achieved bytraining procedures following constant or decaying learn-ing rate schedules, see Fig. 3. The lowest final training lossis achieved by a constant learning rate of 0.001 or a de-caying learning rate of 0.1 n1/2. We settled our choiceon the constant learning rate schedule as it showed greaterrobustness with respect to the choice of initial learning rate.

PixelCNN rapidly learns a sensible distribution over statespace. Fig. 2(left) shows the models loss decaying asit learns to exploit image regularities. Spikes in its lossfunction quickly start to correspond to visually meaning-ful events, such as the starts of episodes (Fig. 2(middle)).A video of early density model training is provided inhttp://youtu.be/T6iaa8Z4eyE.

0 10 20 30 40

Temp 1.0

0 10 20 30 40

Temp 1.0

Figure 4. Samples after 25K steps. Left: CTS, right: PixelCNN.

3.3. Computing the Pseudo-Count

From the previous section we obtain a particular learningrate schedule that cannot be arbitrarily modified withoutdeteriorating the models training performance or stability.To achieve the required PG decay (b), we instead replacePGn by cn PGn with a suitably decaying sequence cn.

http://youtu.be/T6iaa8Z4eyE

In experiments comparing actual agent performance weempirically determined that in fact the constant learningrate 0.001, paired with a PG decay cn = c n1/2, obtainsthe best exploration results on hard exploration games likeMONTEZUMAS REVENGE, see Fig. 2(right). We find themodel to be robust across 1-2 orders of magnitude for thevalue of c, and informally determine c = 0.1 to be a sen-sible configuration for achieving good results on a broadrange of Atari 2600 games (see also Section 7).

Regarding (c), it is hard to ensure learning-positiveness fora deep neural model, and a negative PG can occur when-ever the optimizer overshoots a local loss minimum. Asa workaround, we threshold the PG value at 0. To summa-rize, the computed pseudo-count is

Nn(x) =(exp

(c n1/2 (PGn(x))+

4. Exploration in Atari 2600 GamesHaving described our pseudo-count friendly adaptation ofPixelCNN, we now study its performance on Atari games.To this end we augment the environment reward with apseudo-count exploration bonus, yielding the combined re-ward r(x, a) + (Nn(x))1/2. As usual for neural network-based agents, we ensure the total reward lies in [1, 1] byclipping larger values.

4.1. DQN with PixelCNN Exploration Bonus

Our first set of experiments provides the PixelCNN explo-ration bonus to a DQN agent (Mnih et al., 2015)2. At eachagent step, the density model receives a single frame, withwhich it simultaneously updates its parameters and outputsthe PG. We refer to this agent as DQN-PixelCNN.

The DQN-CTS agent we compare against is derived fromthe one in (Bellemare et al., 2016). For better compara-bility, it is trained in the same online fashion as DQN-PixelCNN, i.e. the PG is computed whenever we train thedensity model. By contrast, the original DQN-CTS queriedthe PG at the end of each episode.

Unless stated otherwise, we always use the mixed MonteCarlo update (MMC) for the intrinsically motivatedagents3, but regular Q-Learning for the baseline DQN.

Fig. 5 shows training curves of DQN compared to DQN-

2Unlike Bellemare et al. we use regular Q-Learning instead ofDouble Q-Learning (van Hasselt et al., 2016), as our early exper-iments showed no significant advantage of DoubleDQN with thePixelCNN-based exploration reward.

3The use of MMC in a replay-based agent poses a minor com-plication, as the MC return is not available for replay until the endof an episode. For simplicity, in our implementation we disregardthis detail and set the MC return to 0 for transitions from the mostrecent episode.

0 20 40 60 80 1000

Montezuma's RevengeDQN

DQN-CTS

DQN-PixelCNN

0 20 40 60 80 1002000

16000Private Eye

DQN-CTS

DQN-PixelCNN

0 20 40 60 80 100

Million Frames

AsteroidsDQN

DQN-CTS

DQN-PixelCNN

0 20 40 60 80 100

Million Frames

500Berzerk

DQN-CTS

DQN-PixelCNN

Figure 5. DQN, DQN-CTS and DQN-PixelCNN on hard explo-ration games (top) and easier ones (bottom).

Figure 6. Improvements (in % of AUC) of DQN-PixelCNN andDQN-CTS over DQN in 57 Atari games. Annotations indicatethe number of hard exploration games with positive (right) andnegative (left) improvement, respectively.

CTS and DQN-PixelCNN. On the famous MONTEZUMASREVENGE, both intrinsically motivated agents vastly out-perform the baseline DQN. On other hard explorationgames (PRIVATE EYE; or VENTURE, appendix Fig. 15),DQN-PixelCNN achieves state of the art results, substan-tially outperforming DQN and DQN-CTS. The other twogames shown (ASTEROIDS, BERZERK) pose easier explo-ration problems, where the reward bonus should not pro-vide large improvements and may have a negative effect byskewing the reward landscape. Here, DQN-PixelCNN be-haves more gracefully and still outperforms DQN-CTS. Wehypothesize this is due to a qualitative difference betweenthe models, see Section 5.

Overall PixelCNN provides the DQN agent with a largeradvantage than CTS, and often accelerates or stabilizestraining even when not affecting peak performance. Out of

57 Atari games, DQN-PixelCNN outperforms DQN-CTSin 52 games by maximum achieved score, and 51 by AUC(methodology in Appendix B). See Fig. 6 for a high levelcomparison (appendix Fig. 15 for full training graphs). Thegreatest gains from using either exploration bonus are ob-served in games categorized as hard exploration games inthe taxonomy of exploration in (Bellemare et al., 2016,reproduced in Appendix D), specifically in the most chal-lenging sparse reward games (e.g. MONTEZUMAS RE-VENGE, PRIVATE EYE, VENTURE).

4.2. A Multi-Step RL Agent with PixelCNN

Empirical practitioners know that techniques beneficial forone agent architecture often can be detrimental for a dif-ferent algorithm. To demonstrate the wide applicabilityof the PixelCNN exploration bonus, we also evaluate itwith the more recent Reactor agent4 (Gruslys et al., 2017).This replay-based actor-critic agent represents its policyand value function by a recurrent neural network and, cru-cially, uses the multi-step Retrace() algorithm for policyevaluation, replacing the MMC we use in DQN-PixelCNN.

To reduce impact on computational efficiency of this agent,we sub-sample intrinsic rewards: we perform updates ofthe PixelCNN model and compute the reward bonus on(randomly chosen) 25% of all steps, leaving the agents re-ward unchanged on other steps. We use the same PG decayschedule of 0.1n1/2, with n the number of model updates.

0 20 40 60 80 100 120 1400

GravitarDQN

DQN-PixelCNN

Reactor

Reactor-PixelCNN

0 20 40 60 80 100 120 1400

35000H.E.R.O.

DQN-PixelCNN

Reactor

Reactor-PixelCNN

0 20 40 60 80 100 120 140

Million Frames

Road RunnerDQN

DQN-PixelCNN

Reactor

Reactor-PixelCNN

0 20 40 60 80 100 120 140

Million Frames

25000Time Pilot

DQN-PixelCNN

Reactor

Reactor-PixelCNN

Figure 7. Reactor/Reactor-PixelCNN and DQN/DQN-PixelCNNtraining performance (averaged over 3 seeds).

Training curves for the Reactor/Reactor-PixelCNN agentcompared to DQN/DQN-PixelCNN are shown in Fig. 7.The baseline Reactor agent is superior to the DQN agent,obtaining higher scores and learning faster in about 50 outof 57 games. It is further improved on a large fraction ofgames by the PixelCNN exploration reward, see Fig. 8 (fulltraining graphs in appendix Fig. 16).

The effect of the exploration bonus is rather uniform, yield-ing improvements on a broad range of games. In particu-

4The exact agent variant is referred to as -LOO with = 1.

lar, Reactor-PixelCNN enjoys better sample efficiency (interms of area under the curve, AUC) than vanilla Reactor.We hypothesize that, like other policy gradient algorithms,Reactor generally suffers from weaker exploration than itsvalue-based counterpart DQN. This aspect is much helpedby the exploration bonus, boosting the agents sample effi-ciency in many environments.

% Improvement Reactor-PixelCNN over Reactor (by AUC)

Easy exploration (40)

Hard exploration, dense reward (10)

Hard exploration, sparse reward (7)

Figure 8. Improvements (in % of AUC) of Reactor-PixelCNNover Reactor in 57 Atari games.

However, on hard exploration games with sparse rewards,Reactor seems unable to make full use of the explorationbonus. We believe this is because, in very sparse settings,the propagation of reward information across long hori-zons becomes crucial. The MMC takes one extreme ofthis view, directly learning from the observed returns. TheRetrace() algorithm, on the other hand, has an effectivehorizon which depends on and, critically, the truncatedimportance sampling ratio. This ratio results in the discard-ing of trajectories which are off-policy, i.e. unlikely underthe current policy. We hypothesize that the very goal of theRetrace() algorithm to learn cautiously is what prevents itfrom taking full advantage of the exploration bonus!

5. Quality of the Density ModelPixelCNN can be expected to be more expressive and accu-rate than the less advanced CTS model, and indeed, sam-ples generated after training are somewhat higher quality(Fig. 4). However, we are not using the generative functionof the models when computing an exploration bonus, anda better generative model does not necessarily give rise tobetter probability estimates (Theis et al., 2016).

PixelCNN

Figure 9. PG on MONTEZUMAS REVENGE (log scale).

In Fig. 9 we compare the PG produced by the two mod-els throughout 5K training steps. PixelCNN consistentlyproduces PGs lower than CTS. More importantly, its PGsare smoother, exhibiting less variance between succes-sive states, while showing more pronounced peaks at cer-tain infrequent events. This yields a reward bonus that isless harmful in easy exploration games, while providing astrong signal in the case of novel or rare events.

Another distinguishing feature of PixelCNN is its non-decaying step-size. The per-step PG never completely van-ishes, as the model tracks the most recent data. This pro-vides an unexpected benefit: the agent remains mildly sur-prised by significant state changes, e.g. switching rooms inMONTEZUMAS REVENGE. These persistent rewards actas milestones that the agent learns to return to. This is il-lustrated in Fig. 10, depicting the intrinsic reward over thecourse of an episode. The agent routinely revisits the right-hand side of the torch room, not because it leads to rewardbut just to take in the sights. A video of the episode isprovided at http://youtu.be/232tOUPKPoQ.5

Time Steps

Figure 10. Intrinsic reward in MONTEZUMAS REVENGE.

Lastly, PixelCNNs convolutional nature is expected to bebeneficial for its sample efficiency. In Appendix C we com-pare to a convolutional CTS and confirm that this explainspart, but not all of PixelCNNs advantage over vanilla CTS.

6. Importance of the Monte Carlo ReturnLike for DQN-CTS, the success of DQN-PixelCNN hingeson the use of the mixed Monte Carlo update. The transientand vanishing nature of the exploration rewards requiresthe learning algorithm to latch on to these rapidly. TheMMC serves this end as a simple multi-step method, help-ing to propagate reward information faster. An additionalbenefit lies in the fact that the Monte Carlo return helpsbridging long horizons in environments where rewards arefar apart and encountered rarely. On the other hand, it is

5Another agent video on the game PRIVATE EYE can be foundat http://youtu.be/kNyFygeUa2E.

0 20 40 60 80 1000

Bank Heist

DQN (w/o MC)

DQN (with MC)

DQN-PixelCNN (w/o MC)

DQN-PixelCNN (with MC)

0 20 40 60 80 1000

2500Ms. Pacman

DQN (w/o MC)

DQN (with MC)

0 20 40 60 80 100

Million Frames

H.E.R.O

DQN (w/o MC)

DQN (with MC)

0 20 40 60 80 100

Million Frames

16000Private Eye

DQN (w/o MC)

DQN (with MC)

Figure 11. Top: games where MMC completely explains the im-proved/decreased performance of DQN-PixelCNN compared toDQN. Bottom-left: MMC and PixelCNN show additive benefits.Bottom-right: hard exploration, sparse reward game only com-bining MMC and PixelCNN bonus achieves training progress.

important to note that the Monte Carlo returns on-policynature increases variance in the learning algorithm, and canprevent the algorithms convergence to the optimal policywhen training off-policy. It can therefore be expected toadversely affect training performance in some games.

To distill the effect of the MMC on performance, we com-pare all four combinations of DQN with/without PixelCNNexploration bonus and with/without MMC. Fig. 11 showsthe performance of these four agent variants (graphs for allgames are shown in Fig. 17). These games were picked toillustrate several commonly occurring cases:

MMC speeds up training and improves final perfor-mance significantly (examples: BANK HEIST, TIMEPILOT). In these games, MMC alone explains most orall of the improvement of DQN-PixelCNN over DQN.

MMC hurts performance (examples: MS. PAC-MAN,BREAKOUT). Here too, MMC alone explains most ofthe difference between DQN-PixelCNN and DQN.

MMC and PixelCNN reward bonus have a compound-ing effect (example: H.E.R.O.).

Most importantly, the situation is rather different when werestrict our attention to the hardest exploration games withsparse rewards. Here the baseline DQN agent fails to makeany training progress, and neither Monte Carlo return northe exploration bonus alone provide any significant benefit.Their combination however grants the agent rapid trainingprogress and allows it to achieve high performance.

One effect of the exploration bonus in these games is toprovide a denser reward landscape, enabling the agent tolearn meaningful policies. Due to the transient nature ofthe exploration bonus, the agent needs to be able to learnfrom this reward signal faster than regular one-step meth-ods allow, and MMC proves to be an effective solution.

http://youtu.be/232tOUPKPoQhttp://youtu.be/kNyFygeUa2E

0 50 100 150 200

Million Frames

Montezuma's Revengec= 0. 1

c= 1. 0

c= 10. 0

0 50 100 150 200

Million Frames

Private Eyec= 0. 1

c= 1. 0

c= 10. 0

0 50 100 150 200

Million Frames

Venturec= 0. 1

c= 1. 0

c= 10. 0

Figure 12. DQN-PixelCNN, hard exploration games, different PG scales c n1/2 PGn (c = 0.1, 1, 10) (5 seeds each).

7. Pushing the Limits of Intrinsic MotivationIn this section we explore the idea of a maximally curiousagent, whose reward function is dominated by the explo-ration bonus. For that we increase the PG scale, previouslychosen conservatively to avoid adverse effects on easy ex-ploration games.

Fig. 12 shows DQN-PixelCNN performance on the hardestexploration games when the PG scale is increased by 1-2 orders of magnitude. The algorithm seems fairly robustacross a wide range of scales: the main effect of increasingthis parameter is to trade off exploration (seeking maximalreward) with exploitation (optimizing the current policy).

As expected, a higher PG scale translates to stronger ex-ploration: several runs obtain record peak scores (900 inGRAVITAR, 6,600 in MONTEZUMAS REVENGE, 39,000in PRIVATE EYE, 1,500 in VENTURE) surpassing the stateof the art by a substantial margin (for previously publishedresults, see Appendix D). Aggressive scaling speeds up theagents exploration and achieves peak performance rapidly,but can also deteriorate its stability and long-term perfor-mance. Note that in practice, because of the non-decayingstep-size the PG does not vanish. After reward clipping,an overly inflated exploration bonus can therefore becomeessentially constant, no longer providing a useful intrinsicmotivation signal to the agent.

Another way of creating an entirely curiosity-driven agentis to ignore the environment reward altogether and trainbased on the exploration reward only, see Fig. 13. Remark-ably, the curiosity signal alone is sufficient to train a high-performing agent (measured by environment reward!).

It is worth noting that agents with exploration bonus seemto never stop exploring: for different seeds, the agentsmake learning progress at very different times during train-ing, a qualitative difference to vanilla DQN.

8. ConclusionWe demonstrated the use of PixelCNN for exploration andshowed that its greater accuracy and expressiveness trans-late into a more useful exploration bonus than that obtainedfrom previous models. While the current theory of pseudo-

0 20 40 60 80 100 120 140 160

Million Frames

Montezuma's Revengec= 1. 0

c= 5. 0

c= 10. 0

c= 50. 0

0 20 40 60 80 100 120 140 160

Million Frames

Private Eyec= 1. 0

c= 5. 0

c= 10. 0

c= 50. 0

Figure 13. DQN-PixelCNN trained from intrinsic reward only(3 seeds for each configuration).

counts puts stringent requirements on the density model,we have shown that PixelCNN can be used in a simpler andmore general setup, and can be trained completely online.It also proves to be widely compatible with both value-function and policy-based RL algorithms.

In addition to pushing the state of the art on the hardestexploration problems among the Atari 2600 games, Pixel-CNN improves speed of learning and stability of baselineRL agents across a wide range of games. The quality of itsreward bonus is evidenced by the fact that on sparse rewardgames, this signal alone suffices to learn to achieve signifi-cant scores, creating a truly intrinsically motivated agent.

Our analysis also reveals the importance of the MonteCarlo return for effective exploration. The comparison withmore sophisticated but fixed-horizon multi-step methodsshows that its significance lies both in faster learning in thecontext of a useful but transient reward function, as well asbridging reward gaps in environments where extrinsic andintrinsic rewards are, or quickly become, extremely sparse.

AcknowledgementsThe authors thank Tom Schaul, Olivier Pietquin, Ian Os-band, Sriram Srinivasan, Tejas Kulkarni, Alex Graves,Charles Blundell, and Shimon Whiteson for invaluablefeedback on the ideas presented here, and AudrunasGruslys especially for providing the Reactor agent.

ReferencesBellemare, Marc, Veness, Joel, and Talvitie, Erik. Skip

context tree switching. Proceedings of the InternationalConference on Machine Learning, 2014.

Bellemare, Marc G., Naddaf, Yavar, Veness, Joel, andBowling, Michael. The arcade learning environment:An evaluation platform for general agents. Journal ofArtificial Intelligence Research, 47:253279, 2013.

Bellemare, Marc G., Srinivasan, Sriram, Ostrovski, Georg,Schaul, Tom, Saxton, David, and Munos, Remi. Uni-fying count-based exploration and intrinsic motivation.Advances in Neural Information Processing Systems,2016.

Deisenroth, Marc P. and Rasmussen, Carl E. PILCO:A model-based and data-efficient approach to policysearch. In Proceedings of the International Conferenceon Machine Learning, 2011.

French, Robert M. Catastrophic forgetting in connectionistnetworks. Trends in cognitive sciences, 3(4):128135,1999.

Goodfellow, Ian, Pouget-Abadie, Jean, Mirza, Mehdi, Xu,Bing, Warde-Farley, David, Ozair, Sherjil, Courville,Aaron, and Bengio, Yoshua. Generative adversarial nets.In Advances in Neural Information Processing Systems,2014.

Gregor, Karol, Danihelka, Ivo, Graves, Alex, Rezende,Danilo, and Wierstra, Daan. Draw: A recurrent neuralnetwork for image generation. In Proceedings of the In-ternational Conference on Machine Learning, 2015.

Gruslys, Audrunas, Azar, Mohammad Gheshlaghi, Belle-mare, Marc G., and Munos, Remi. The Reactor: Asample-efficient actor-critic architecture. arXiv preprintarXiv:1704.04651, 2017.

Guez, Arthur, Silver, David, and Dayan, Peter. Efficientbayes-adaptive reinforcement learning using sample-based search. In Advances in Neural Information Pro-cessing Systems, 2012.

Houthooft, Rein, Chen, Xi, Duan, Yan, Schulman, John,De Turck, Filip, and Abbeel, Pieter. Variational infor-mation maximizing exploration. In Advances in NeuralInformation Processing Systems (NIPS), 2016.

Jaksch, Thomas, Ortner, Ronald, and Auer, Peter. Near-optimal regret bounds for reinforcement learning. Jour-nal of Machine Learning Research, 11:15631600,2010.

Kingma, Diederik P. and Welling, Max. Auto-encodingvariational bayes. In Proceedings of the InternationalConference on Learning Representations, 2013.

Lopes, Manuel, Lang, Tobias, Toussaint, Marc, andOudeyer, Pierre-Yves. Exploration in model-based re-inforcement learning by empirically estimating learningprogress. In Advances in Neural Information ProcessingSystems, 2012.

Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David,Rusu, Andrei A, Veness, Joel, Bellemare, Marc G.,Graves, Alex, Riedmiller, Martin, Fidjeland, Andreas K.,Ostrovski, Georg, et al. Human-level control throughdeep reinforcement learning. Nature, 518(7540):529533, 2015.

Munos, Remi, Stepleton, Tom, Harutyunyan, Anna, andBellemare, Marc G. Safe and efficient off-policy rein-forcement learning. In Advances in Neural InformationProcessing Systems, 2016.

Osband, Ian, van Roy, Benjamin, and Wen, Zheng. Gen-eralization and exploration via randomized value func-tions. In Proceedings of the International Conference onMachine Learning, 2016.

Oudeyer, Pierre-Yves, Kaplan, Frederic, and Hafner, Ver-ena V. Intrinsic motivation systems for autonomousmental development. IEEE Transactions on Evolution-ary Computation, 11(2):265286, 2007.

Rezende, Danilo Jimenez, Mohamed, Shakir, and Wier-stra, Daan. Stochastic backpropagation and approximateinference in deep generative models. In Proceedingsof The International Conference on Machine Learning,2014.

Strehl, Alexander L. and Littman, Michael L. An analysisof model-based interval estimation for Markov decisionprocesses. Journal of Computer and System Sciences, 74(8):1309 1331, 2008.

Sutton, Richard S. Generalization in reinforcement learn-ing: Successful examples using sparse coarse coding.In Advances in Neural Information Processing Systems,1996.

Sutton, Richard S. and Barto, Andrew G. Reinforcementlearning: An introduction. MIT Press, 1998.

Tang, Haoran, Houthooft, Rein, Foote, Davis, Stooke,Adam, Chen, Xi, Duan, Yan, Schulman, John, De Turck,

Filip, and Abbeel, Pieter. #Exploration: A study ofcount-based exploration for deep reinforcement learn-ing. arXiv preprint arXiv:1611.04717, 2016.

Theis, Lucas, van den Oord, Aaron, and Bethge, Matthias.A note on the evaluation of generative models. In Pro-ceedings of the International Conference on LearningRepresentations, 2016.

Tieleman, Tijmen and Hinton, Geoffrey. RMSProp: dividethe gradient by a running average of its recent magni-tude. COURSERA. Lecture 6.5 of Neural Networks forMachine Learning, 2012.

Tsitsiklis, John N. and van Roy, Benjamin. An analysis oftemporal-difference learning with function approxima-tion. IEEE Transactions on Automatic Control, 42(5):674690, 1997.

van den Oord, Aaron, Kalchbrenner, Nal, Espeholt, Lasse,Vinyals, Oriol, Graves, Alex, et al. Conditional imagegeneration with PixelCNN decoders. In Advances inNeural Information Processing Systems, 2016a.

van den Oord, Aaron, Kalchbrenner, Nal, andKavukcuoglu, Koray. Pixel recurrent neural net-works. In Proceedings of the International Conferenceon Machine Learning, 2016b.

van Hasselt, Hado, Guez, Arthur, and Silver, David. Deepreinforcement learning with Double Q-learning. In Pro-ceedings of the AAAI Conference on Artificial Intelli-gence, 2016.

Veness, Joel, Ng, Kee Siong, Hutter, Marcus, and Bowling,Michael H. Context tree switching. In Proceedings ofthe Data Compression Conference, 2012.

Wang, Ziyu, Schaul, Tom, Hessel, Matteo, van Hasselt,Hado, Lanctot, Marc, and de Freitas, Nando. Duelingnetwork architectures for deep reinforcement learning.In Proceedings of The 33rd International Conference onMachine Learning, pp. 19952003, 2016.

A. PixelCNN Hyper-parametersThe PixelCNN model used in this paper is a lightweightvariant of the Gated PixelCNN introduced in (van den Oordet al., 2016a). It consists of a 7 7 masked convolution,followed by two residual blocks with 11 masked convolu-tions with 16 feature planes, and another 11 masked con-volution producing 64 features planes, which are mappedby a final masked convolution to the output logits. Inputsare 42 42 greyscale images, with pixel values quantizedto 8 bins.

The model is trained completely online, from the stream ofAtari frames experienced by an agent. Optimization is per-formed with the (uncentered) RMSProp optimizer (Tiele-man & Hinton, 2012) with momentum 0.9, decay 0.95 andepsilon 104.

B. MethodologyUnless otherwise stated, all agent performance graphs inthis paper show the agents training performance, measuredas the undiscounted per-episode return, averaged over 1Menvironment frames per data point.

The algorithm-comparison graphs Fig. 6 and Fig. 8 showthe relative improvement of one algorithm over anotherin terms of area-under-the-curve (AUC). A comparison bymaximum achieved score would yield similar overall re-sults, but underestimate the advantage in terms of learningspeed (sample efficiency) and stability that the intrinsicallymotivated and MMC-based agents show over the baselines.

C. Convolutional CTSIn Section 4 we have seen that DQN-PixelCNN outper-forms DQN-CTS in most of the 57 Atari games, by provid-ing a more impactful exploration bonus in hard explorationgames, as well as a more graceful (less harmful) one ingames where the learning algorithm does not benefit fromthe additional curiosity signal. One may wonder whetherthis improvement is due to the generally more expressiveand accurate density model PixelCNN, or simply its con-volutional nature, which gives it an advantage in general-ization and sample efficiency over a model that representspixel probabilities in a completely location-dependent way.

To answer this question, we developed a convolutional vari-ant of the CTS model. This model has a single set of pa-rameters conditioning a pixels value on its predecessorsshared across all pixel locations, instead of the location-dependent parameters in the regular CTS. In Fig. 14 wecontrast the performance of DQN, DQN-MC, DQN-CTS,DQN-ConvCTS and DQN-PixelCNN on 6 example games.

We first consider dense reward games like Q*BERT and

ZAXXON, where most improvement comes from the useof the MMC, and the exploration bonus hurts performance.We find that in fact convolutional CTS behaves fairly sim-ilarly to PixelCNN, leaving agent performance unaffected,whereas regular CTS causes the agent to train more slowlyor reach an earlier performance plateau. On the sparse re-ward games (GRAVITAR, PRIVATE EYE, VENTURE) how-ever, convolutional CTS shows to be as inferior to Pixel-CNN as the vanilla CTS variant, failing to achieve the sig-nificant improvements over the baseline agents presentedin this paper.

We conclude that while the convolutional aspect plays arole in the softer nature of the PixelCNN model com-pared to its CTS counterpart, it alone is insufficient toexplain the massive exploration boost that the PixelCNN-derived reward provides to the DQN agent. The more ad-vanced models accuracy advantage translates into a moretargeted and useful curiosity signal for the agent, whichdistinguishes novel from well-explored states more clearlyand allows for more effective exploration.

D. The Hardest Exploration GamesTable 1 reproduces Bellemare et al. (2016)s taxonomy ofgames available through the ALE according to their explo-ration difficulty. Human-Optimal refers to games whereDQN-like agents achieve human-level or higher perfor-mance; Score Exploit refers to games where agents findways to achieve superhuman scores, without necessarilyplaying the game as a human would. Sparse and Denserewards are qualitative descriptors of the games rewardstructure. See the original source for additional details.

Table 2 compares previously published results on the 7 hardexploration, sparse reward Atari 2600 games with resultsobtained by DQN-CTS and DQN-PixelCNN.

0 20 40 60 80 100 120 140

Million Frames

GravitarDQN

DQN-MC

DQN-CTS

DQN-ConvCTS

DQN-PixelCNN

0 20 40 60 80 100 120 140

Million Frames

Montezuma's RevengeDQN

DQN-MC

DQN-CTS

DQN-ConvCTS

DQN-PixelCNN

0 20 40 60 80 100 120 140

Million Frames

Private Eye

DQN-MC

DQN-CTS

DQN-ConvCTS

DQN-PixelCNN

0 20 40 60 80 100 120 140

Million Frames

Q*Bert

DQN-MC

DQN-CTS

DQN-ConvCTS

DQN-PixelCNN

0 20 40 60 80 100 120 140

Million Frames

VentureDQN

DQN-MC

DQN-CTS

DQN-ConvCTS

DQN-PixelCNN

0 20 40 60 80 100 120 140

Million Frames

ZaxxonDQN

DQN-MC

DQN-CTS

DQN-ConvCTS

DQN-PixelCNN

Figure 14. Comparison of DQN, DQN-CTS, DQN-ConvCTS and DQN-PixelCNN training performance.

Easy Exploration Hard ExplorationHuman-Optimal Score Exploit Dense Reward Sparse Reward

ASSAULT ASTERIX BEAM RIDER ALIEN FREEWAYASTEROIDS ATLANTIS KANGAROO AMIDAR GRAVITAR

BATTLE ZONE BERZERK KRULL BANK HEIST MONTEZUMAS REVENGEBOWLING BOXING KUNG-FU MASTER FROSTBITE PITFALL!

BREAKOUT CENTIPEDE ROAD RUNNER H.E.R.O. PRIVATE EYECHOPPER CMD CRAZY CLIMBER SEAQUEST MS. PAC-MAN SOLARIS

DEFENDER DEMON ATTACK UP N DOWN Q*BERT VENTUREDOUBLE DUNK ENDURO TUTANKHAM SURROUNDFISHING DERBY GOPHER WIZARD OF WOR

ICE HOCKEY JAMES BOND ZAXXONNAME THIS GAME PHOENIX

PONG RIVER RAIDROBOTANK SKIING

SPACE INVADERS STARGUNNER

Table 1. A rough taxonomy of Atari 2600 games according to their exploration difficulty.

DQN A3C-CTS Prior. Duel DQN-CTS DQN-PixelCNNFREEWAY 30.8 30.48 33.0 31.7 31.7

GRAVITAR 473.0 238.68 238.0 498.3 859.1MONTEZUMAS REVENGE 0.0 273.70 0.0 3705.5 2514.3

PITFALL! -286.1 -259.09 0.0 0.0 0.0PRIVATE EYE 146.7 99.32 206.0 8358.7 15806.5

SOLARIS 3,482.8 2270.15 133.4 2863.6 5501.5VENTURE 163.0 0.00 48.0 82.2 1356.25

Table 2. Comparison with previously published results on hard exploration, sparse reward games. The compared agents are DQN (Mnihet al., 2015), A3C-CTS (A3C+ in (Bellemare et al., 2016)), Prioritized Dueling DQN (Wang et al., 2016), and the basic versions ofDQN-CTS and DQN-PixelCNN from Section 4. For our agents we report the maximum scores achieved over 150M frames of training,averaged over 3 seeds.

0 20 40 60 80 100 120 140200

1800Alien

0 20 40 60 80 100 120 1400

1000Amidar

0 20 40 60 80 100 120 140200

1400Assault

0 20 40 60 80 100 120 1400

4000Asterix

0 20 40 60 80 100 120 140400

1300Asteroids

0 20 40 60 80 100 120 1400

70000Atlantis

0 20 40 60 80 100 120 1400

500Bank Heist

0 20 40 60 80 100 120 1400

18000Battle Zone

0 20 40 60 80 100 120 1400

7000Beam Rider

0 20 40 60 80 100 120 140100

500Berzerk

0 20 40 60 80 100 120 14015

45Bowling

0 20 40 60 80 100 120 14010

80Boxing

0 20 40 60 80 100 120 1400

250Breakout

0 20 40 60 80 100 120 140500

3500Centipede

0 20 40 60 80 100 120 1400

3000Chopper Command

0 20 40 60 80 100 120 1400

90000Crazy Climber

0 20 40 60 80 100 120 1402000

10000Defender

0 20 40 60 80 100 120 1400

8000Demon Attack

0 20 40 60 80 100 120 14022

8Double Dunk

0 20 40 60 80 100 120 1400

500Enduro

0 20 40 60 80 100 120 140100

10Fishing Derby

0 20 40 60 80 100 120 1400

30Freeway

0 20 40 60 80 100 120 1400

2500Frostbite

0 20 40 60 80 100 120 1400

6000Gopher

0 20 40 60 80 100 120 1400

600Gravitar

0 20 40 60 80 100 120 1400

25000H.E.R.O.

0 20 40 60 80 100 120 14016

4Ice Hockey

0 20 40 60 80 100 120 1400

600Jamesbond

0 20 40 60 80 100 120 1400

9000Kangaroo

0 20 40 60 80 100 120 1402000

8000Krull

0 20 40 60 80 100 120 1400

16000Kung Fu Master

0 20 40 60 80 100 120 1400

4000Montezuma's Revenge

0 20 40 60 80 100 120 1400

2500Ms. Pacman

0 20 40 60 80 100 120 1401500

6000Name this Game

0 20 40 60 80 100 120 1400

7000Phoenix

0 20 40 60 80 100 120 140900

0Pitfall!

0 20 40 60 80 100 120 14025

20Pong

0 20 40 60 80 100 120 1402000

16000Private Eye

0 20 40 60 80 100 120 1400

8000Q*Bert

0 20 40 60 80 100 120 1401000

9000Riverraid

0 20 40 60 80 100 120 1400

25000Road Runner

0 20 40 60 80 100 120 1400

50Robotank

0 20 40 60 80 100 120 1400

3000Seaquest

0 20 40 60 80 100 120 14022000

14000Skiing

0 20 40 60 80 100 120 140500

4000Solaris

0 20 40 60 80 100 120 140100

1000Space Invaders

0 20 40 60 80 100 120 1400

25000Star Gunner

0 20 40 60 80 100 120 14010.0

8.8Surround

0 20 40 60 80 100 120 14025

5Tennis

0 20 40 60 80 100 120 1401000

7000Time Pilot

0 20 40 60 80 100 120 1400

200Tutankham

0 20 40 60 80 100 120 1400

6000Up'n'down

0 20 40 60 80 100 120 1400

1200Venture

0 20 40 60 80 100 120 1400

80000Video Pinball

0 20 40 60 80 100 120 1400

3500Wizard of Wor

0 20 40 60 80 100 120 1402000

16000Yar's Revenge

0 20 40 60 80 100 120 1400

9000Zaxxon

DQN-CTS

DQN-PixelCNN

Figure 15. Training curves of DQN, DQN-CTS and DQN-PixelCNN across all 57 Atari games.

0 20 40 60 80 100 120 1400

4500Alien

0 20 40 60 80 100 120 1400

1400Amidar

0 20 40 60 80 100 120 1400

4000Assault

0 20 40 60 80 100 120 1400

9000Asterix

0 20 40 60 80 100 120 1400

3000Asteroids

0 20 40 60 80 100 120 1400.0

1.0 1e7Atlantis

0 20 40 60 80 100 120 1400

1400Bank Heist

0 20 40 60 80 100 120 1400

70000Battle Zone

0 20 40 60 80 100 120 1400

7000Beam Rider

0 20 40 60 80 100 120 140100

700Berzerk

0 20 40 60 80 100 120 14010

80Bowling

0 20 40 60 80 100 120 14020

100Boxing

0 20 40 60 80 100 120 1400

300Breakout

0 20 40 60 80 100 120 140500

5000Centipede

0 20 40 60 80 100 120 1400

12000Chopper Command

0 20 40 60 80 100 120 1400

100000

120000

140000Crazy Climber

0 20 40 60 80 100 120 1400

50000Defender

0 20 40 60 80 100 120 1400

9000Demon Attack

0 20 40 60 80 100 120 14025

15Double Dunk

0 20 40 60 80 100 120 1400

3500Enduro

0 20 40 60 80 100 120 140100

40Fishing Derby

0 20 40 60 80 100 120 1400

35Freeway

0 20 40 60 80 100 120 1400

7000Frostbite

0 20 40 60 80 100 120 1400

25000Gopher

0 20 40 60 80 100 120 1400

2000Gravitar

0 20 40 60 80 100 120 1400

35000H.E.R.O.

0 20 40 60 80 100 120 14020

10Ice Hockey

0 20 40 60 80 100 120 1400

800Jamesbond

0 20 40 60 80 100 120 1400

14000Kangaroo

0 20 40 60 80 100 120 1400

12000Krull

0 20 40 60 80 100 120 1400

40000Kung Fu Master

0 20 40 60 80 100 120 1400

3500MS. Pacman

0 20 40 60 80 100 120 1401000

10000Name this Game

0 20 40 60 80 100 120 1400

7000Phoenix

0 20 40 60 80 100 120 1401400

0Pitfall!

0 20 40 60 80 100 120 14030

30Pong

0 20 40 60 80 100 120 1405000

20000Private Eye

0 20 40 60 80 100 120 1400

14000Q*Bert

0 20 40 60 80 100 120 1400

12000Riverraid

0 20 40 60 80 100 120 1400

60000Road Runner

0 20 40 60 80 100 120 1400

70Robotank

0 20 40 60 80 100 120 1400

6000Seaquest

0 20 40 60 80 100 120 14035000

10000Skiing

0 20 40 60 80 100 120 1400

5000Solaris

0 20 40 60 80 100 120 1400

1800Space Invaders

0 20 40 60 80 100 120 1400

30000Star Gunner

0 20 40 60 80 100 120 14010

0Surround

0 20 40 60 80 100 120 14030

30Tennis

0 20 40 60 80 100 120 1400

25000Time Pilot

0 20 40 60 80 100 120 1400

250Tutankham

0 20 40 60 80 100 120 1400

100000

150000

200000

250000Up'n'down

0 20 40 60 80 100 120 1400

1600Venture

0 20 40 60 80 100 120 1400

100000

150000

200000

250000

300000

350000

400000Video Pinball

0 20 40 60 80 100 120 1400

10000Wizard of Wor

0 20 40 60 80 100 120 1400

80000Yar's Revenge

0 20 40 60 80 100 120 1400

18000Zaxxon

DQN-PixelCNN

Reactor

Reactor-PixelCNN

Figure 16. Training curves of DQN, DQN-PixelCNN, Reactor and Reactor-PixelCNN across all 57 Atari games.

0 20 40 60 80 100 120 1400

2500Alien

0 20 40 60 80 100 120 1400

1000Amidar

0 20 40 60 80 100 120 140200

1400Assault

0 20 40 60 80 100 120 1400

4500Asterix

0 20 40 60 80 100 120 140400

1300Asteroids

0 20 40 60 80 100 120 1400

70000Atlantis

0 20 40 60 80 100 120 1400

500Bank Heist

0 20 40 60 80 100 120 1400

25000Battle Zone

0 20 40 60 80 100 120 1400

7000Beam Rider

0 20 40 60 80 100 120 140100

500Berzerk

0 20 40 60 80 100 120 14015

45Bowling

0 20 40 60 80 100 120 14020

80Boxing

0 20 40 60 80 100 120 1400

250Breakout

0 20 40 60 80 100 120 140500

3500Centipede

0 20 40 60 80 100 120 1400

3000Chopper Command

0 20 40 60 80 100 120 1400

100000Crazy Climber

0 20 40 60 80 100 120 1402000

10000Defender

0 20 40 60 80 100 120 1400

8000Demon Attack

0 20 40 60 80 100 120 14022

4Double Dunk

0 20 40 60 80 100 120 1400

500Enduro

0 20 40 60 80 100 120 140100

10Fishing Derby

0 20 40 60 80 100 120 1400

30Freeway

0 20 40 60 80 100 120 1400

2500Frostbite

0 20 40 60 80 100 120 1400

6000Gopher

0 20 40 60 80 100 120 1400

600Gravitar

0 20 40 60 80 100 120 1400

25000H.E.R.O.

0 20 40 60 80 100 120 14016

4Ice Hockey

0 20 40 60 80 100 120 1400

600Jamesbond

0 20 40 60 80 100 120 1400

9000Kangaroo

0 20 40 60 80 100 120 1402000

8000Krull

0 20 40 60 80 100 120 1400

16000Kung Fu Master

0 20 40 60 80 100 120 1400

2500MS. Pacman

0 20 40 60 80 100 120 1401000

7000Name this Game

0 20 40 60 80 100 120 1400

7000Phoenix

0 20 40 60 80 100 120 140450

0Pitfall!

0 20 40 60 80 100 120 14025

20Pong

0 20 40 60 80 100 120 1402000

16000Private Eye

0 20 40 60 80 100 120 1400

8000Q*Bert

0 20 40 60 80 100 120 1401000

9000Riverraid

0 20 40 60 80 100 120 1400

35000Road Runner

0 20 40 60 80 100 120 1400

50Robotank

0 20 40 60 80 100 120 1400

3000Seaquest

0 20 40 60 80 100 120 14028000

14000Skiing

0 20 40 60 80 100 120 1401000

4000Solaris

0 20 40 60 80 100 120 1400

1200Space Invaders

0 20 40 60 80 100 120 1400

25000Star Gunner

0 20 40 60 80 100 120 14010.0

8.6Surround

0 20 40 60 80 100 120 14025

0Tennis

0 20 40 60 80 100 120 1401000

7000Time Pilot

0 20 40 60 80 100 120 1400

200Tutankham

0 20 40 60 80 100 120 1400

7000Up'n'down

0 20 40 60 80 100 120 1400

1200Venture

0 20 40 60 80 100 120 1400

100000

150000

200000Video Pinball

0 20 40 60 80 100 120 1400

3500Wizard of Wor

0 20 40 60 80 100 120 1402000

16000Yar's Revenge

0 20 40 60 80 100 120 1400

10000Zaxxon

DQN (w/o MC)

DQN (with MC)

Figure 17. Training curves of DQN and DQN-PixelCNN, each with and without MMC, across all 57 Atari games.

count-based exploration with neural density models · count-based exploration with neural density...

Documents

can graph neural networks count substructures? · can graph...

density estimation and random variate generation...

select la_code, sum(population)/sum(wardarea) as density,...

structured variational learning of bayesian neural ... ·...

fpga implementation of a pulse density neural network · pdf...

theoryandalgorithmsforhop-count-based...

neural networks for density...

mixture density estimation - neural information processing...

neural network - gaussian mixture hybrid for speech ... ·...

ˇ ˘˛ ˛ ˛! ˘ˇ˘ˆ #$$ % ˝ &’& -...

l -c language modelling with r neural network · pdf...

can graph neural networks count substructures?

ultra-high-fiber-count and high-density slotted core...

abstract word count: 238 · text word count: 4500 abstract...

acoustic modeling in statistical parametric speech ... ·...

neural joint source-channel coding - arxivneural joint...

fiber products - electrical cable · pdf fileshielded...

count-based exploration with neural density...

neural networks for density estimation...neural networks for...

assumed density filtering methods for learning bayesian...