count-based exploration with neural density models · count-based exploration with neural density...
Post on 16-Feb-2019
246 Views
Preview:
TRANSCRIPT
Count-Based Exploration with Neural Density Models
Georg Ostrovski 1 Marc G. Bellemare 1 Aaron van den Oord 1 Remi Munos 1
Abstract
Bellemare et al. (2016) introduced the notion ofa pseudo-count, derived from a density model,to generalize count-based exploration to non-tabular reinforcement learning. This pseudo-count was used to generate an exploration bonusfor a DQN agent and combined with a mixedMonte Carlo update was sufficient to achievestate of the art on the Atari 2600 game Mon-tezumas Revenge. We consider two questionsleft open by their work: First, how important isthe quality of the density model for exploration?Second, what role does the Monte Carlo updateplay in exploration? We answer the first questionby demonstrating the use of PixelCNN, an ad-vanced neural density model for images, to sup-ply a pseudo-count. In particular, we examine theintrinsic difficulties in adapting Bellemare et al.sapproach when assumptions about the model areviolated. The result is a more practical and gen-eral algorithm requiring no special apparatus. Wecombine PixelCNN pseudo-counts with differentagent architectures to dramatically improve thestate of the art on several hard Atari games. Onesurprising finding is that the mixed Monte Carloupdate is a powerful facilitator of exploration inthe sparsest of settings, including MontezumasRevenge.
1. IntroductionExploration is the process by which an agent learns aboutits environment. In the reinforcement learning framework,this involves reducing the agents uncertainty about theenvironments transition dynamics and attainable rewards.From a theoretical perspective, exploration is now well-understood (e.g. Strehl & Littman, 2008; Jaksch et al.,2010; Osband et al., 2016), and Bayesian methods have
1DeepMind, London, UK. Correspondence to: Georg Ostro-vski .
Proceedings of the 34 th International Conference on MachineLearning, Sydney, Australia, PMLR 70, 2017. Copyright 2017by the author(s).
been successfully demonstrated in a number of settings(Deisenroth & Rasmussen, 2011; Guez et al., 2012). On theother hand, practical algorithms for the general case remainscarce; fully Bayesian approaches are usually intractable inlarge state spaces, and the count-based method typical oftheoretical results is not applicable in the presence of valuefunction approximation.
Recently, Bellemare et al. (2016) proposed the notion ofpseudo-count as a reasonable generalization of the tabu-lar setting considered in the theory literature. The pseudo-count is defined in terms of a density model trained onthe sequence of states experienced by an agent:
N(x) = (x)n(x),
where n(x) can be thought of as a total pseudo-count com-puted from the models recoding probability (x), theprobability of x computed immediately after training onx. As a practical application the authors used the pseudo-counts derived from the simple CTS density model (Belle-mare et al., 2014) to incentivize exploration in Atari 2600agents. One of the main outcomes of their work was sub-stantial empirical progress on the infamously hard gameMONTEZUMAS REVENGE.
Their method critically hinged on several assumptionsregarding the density model: 1) the model should belearning-positive, i.e. the probability assigned to a state xshould increase with training; 2) it should be trained on-line, using each sample exactly once; and 3) the effectivemodel step-size should decay at a rate of n1. Part of theirempirical success also relied on a mixed Monte Carlo/Q-Learning update rule, which permitted fast propagation ofthe exploration bonuses.
In this paper, we set out to answer several research ques-tions related to these modelling choices and assumptions:
1. To what extent does a better density model give rise tobetter exploration?
2. Can the above modelling assumptions be relaxedwithout sacrificing exploration performance?
3. What role does the mixed Monte Carlo update play insuccessfully incentivizing exploration?
arX
iv:1
703.
0131
0v2
[cs
.AI]
14
Jun
2017
Count-Based Exploration with Neural Density Models
In particular, we explore the use of PixelCNN (van denOord et al., 2016b;a), a state-of-the-art neural densitymodel. We examine the challenges posed by this approach:
Model choice. Performing two evaluations and one modelupdate at each agent step (to compute (x) and (x)) canbe prohibitively expensive. This requires the design of asimplified yet sufficiently expressive and accurate Pix-elCNN architecture.
Model training. A CTS model can naturally be trainedfrom sequentially presented, correlated data samples.Training a neural model in this online fashion requiresmore careful attention to the optimization procedure to pre-vent overfitting and catastrophic forgetting (French, 1999).
Model use. The theory of pseudo-counts requires the den-sity models rate of learning to decay over time. Optimiza-tion of a neural model, however, imposes constraints on thestep-size regime which cannot be violated without deterio-rating effectiveness and stability of training.
The concept of intrinsic motivation has made a recent resur-gence in reinforcement learning research, in great partdue to a dissatisfaction with -greedy and Boltzmann poli-cies. Of note, Tang et al. (2016) maintain an approximatecount by means of hash tables over features, which in thepseudo-count framework corresponds to a hash-based den-sity model. Houthooft et al. (2016) used a second-orderTaylor approximation of the prediction gain to drive explo-ration in continuous control. As research moves towardsever more complex environments, we expect the trend to-wards more intrinsically motivated solutions to continue.
2. Background2.1. Pseudo-Count and Prediction Gain
Here we briefly introduce notation and results, referring thereader to (Bellemare et al., 2016) for technical details.
Let be a density model on a finite space X , and n(x) theprobability assigned by the model to x after being trainedon a sequence of states x1, . . . , xn. Assume n(x) > 0for all x, n. The recoding probability n(x) is then theprobability the model would assign to x if it were trained onthat same x one more time. We call learning-positive ifn(x) n(x) for all x1, . . . , xn, x X . The predictiongain (PG) of is
PGn(x) = log n(x) log n(x). (1)
A learning-positive implies PGn(x) 0 for all x X .For learning-positive , we define the pseudo-count as
Nn(x) =n(x)(1 n(x))n(x) n(x)
,
derived from postulating that a single observation of x X
should lead to a unit increase in pseudo-count:
n(x) =Nn(x)
n, n(x) =
Nn(x) + 1
n+ 1,
where n is the pseudo-count total. The pseudo-count gen-eralizes the usual state visitation count function Nn(x).Under certain assumptions on n, pseudo-counts growapproximately linearly with real counts. Crucially, thepseudo-count can be approximated using the predictiongain of the density model:
Nn(x) (ePGn(x) 1
)1.
Its main use is to define an exploration bonus. We considera reinforcement learning (RL) agent interacting with an en-vironment that provides observations and extrinsic rewards(see Sutton & Barto, 1998, for a thorough exposition of theRL framework). To the reward at step n we add the bonus
r+(x) := (Nn(x))1/2,
which incentivizes the agent to try to re-experience sur-prising situations. Quantities related to prediction gainhave been used for similar purposes in the intrinsic moti-vation literature (Lopes et al., 2012), where they measurean agents learning progress (Oudeyer et al., 2007). Al-though the pseudo-count bonus is close to the predictiongain, it is asymptotically more conservative and supportedby stronger theoretical guarantees.
2.2. Density Models for Images
The CTS density model (Bellemare et al., 2014) is basedon the namesake algorithm, Context Tree Switching (Ve-ness et al., 2012), a Bayesian variable-order Markov model.In its simplest form, the model takes as input a 2D imageand assigns to it a probability according to the product oflocation-dependent L-shaped filters, where the predictionof each filter is given by a CTS algorithm trained on pastimages. In Bellemare et al. (2016), this model was ap-plied to 3-bit greyscale, 42 42 downsampled Atari 2600frames (Fig. 1). The CTS model presents advantages interms of simplicity and performance but is limited in ex-pressiveness, scalability, and data efficiency.
In recent years, neural generative models for images haveachieved impressive successes in their ability to generatediverse images in various domains (Kingma & Welling,2013; Rezende et al., 2014; Gregor et al., 2015; Good-fellow et al., 2014). In particular, van den Oord et al.(2016b;a) introduced PixelCNN, a fully convolutional neu-ral network composed of residual blocks with multiplica-tive gating units, which models pixel probabilities condi-tional on previous pixels (in the usual top-left to bottom-right raster-scan order) by using masked convolution fil-
Count-Based Exploration with Neural Density Models
Original Frame (160x210) 3-bit Greyscale (42x42)
Figure 1. Atari frame preprocessing (Bellemare et al., 2016).
ters. This model achieved state-of-the-art modelling perfor-mance on standard datasets, paired with the computationalefficiency of a convolutional feed-forward network.
2.3. Multi-Step RL Methods
A distinguishing feature of reinforcement learning is thatthe agent learns on the basis of interim estimates (Sutton,1996). For example, the Q-Learning update rule is
Q(x, a) Q(x, a)+ [r(x, a) + maxa Q(x
, a)Q(x, a)] (x,a)
,
linking the reward r and next-state value functionQ(x, a)to the current state value function Q(x, a). This particularform is the stochastic update rule with step-size and in-volves the TD-error . In the approximate reinforcementlearning setting, such as when Q(x, a) is represented by aneural network, this update is converted into a loss to beminimized, most commonly the squared loss 2(x, a).
It is well known that better performance, both in terms oflearning efficiency and approximation error, is attained bymulti-step methods (Sutton, 1996; Tsitsiklis & van Roy,1997). These methods interpolate between one-step meth-ods (Q-Learning) and the Monte-Carlo update
Q(x, a) Q(x, a) +
[ t=0
tr(xt, at)Q(x, a)
]
MC(x,a)
,
where x0, a0, x1, a1, . . . is a sample path through the en-vironment beginning in (x, a). To achieve their success onthe hardest Atari 2600 games, Bellemare et al. (2016) usedthe mixed Monte-Carlo update (MMC)
Q(x, a) Q(x, a) + [(1 )(x, a) + MC(x, a)] ,
with [0, 1]. This choice was made for computa-tional and implementational simplicity, and is a particu-larly coarse multi-step method. A better multi-step methodis the recent Retrace() algorithm (Munos et al., 2016).Retrace() uses a product of truncated importance sam-
pling ratios c1, c2, . . . to replace with the error term
RETRACE(x, a) :=
t=0
t
(t
s=1
cs
)(xt, at),
effectively mixing in TD-errors from all future time steps.Munos et al. showed that Retrace() is safe (does not di-verge when trained on data from an arbitrary behaviour pol-icy), and efficient (makes the most of multi-step returns).
3. Using PixelCNN for ExplorationAs mentioned in the Introduction, the theory of using den-sity models for exploration makes several assumptions thattranslate into concrete requirements for an implementation:
(a) The density model should be trained completely on-line, i.e. exactly once on each state experienced by theagent, in the given sequential order.
(b) The prediction gain (PG) should decay at a rate n1
to ensure that pseudo-counts grow approximately lin-early with real counts.
(c) The density model should be learning-positive.
Simultaneously, a partly competing set of requirementsare posed by the practicalities of training a neural densitymodel and using it as part of an RL agent:
(d) For stability, efficiency, and to avoid catastrophic for-getting in the context of a drifting data distribution,it is advantageous to train a neural model in mini-batches, drawn randomly from a diverse dataset.
(e) For effective training, a certain optimization regime(e.g. a fixed learning rate schedule) has to be followed.
(f) The density model must be computationallylightweight, to allow computing the PG (twomodel evaluations and one update) as part of everytraining step of an RL agent.
We investigate how to best resolve these tensions in thecontext of the Arcade Learning Environment (Bellemareet al., 2013), a suite of benchmark Atari 2600 games.
3.1. Designing a Suitable Density Model
Driven by (f) and aiming for an agent with computationalperformance comparable to DQN, we design a slim vari-ant of the PixelCNN network. Its core is a stack of 2 gatedresidual blocks with 16 feature maps (compared to 15 resid-ual blocks with 128 feature maps in vanilla PixelCNN). Aswas done with the CTS model, images are downsampled to42 42 and quantized to 3-bit greyscale. See Appendix Afor technical details.
Count-Based Exploration with Neural Density Models
0 2000 4000 6000 8000 10000 12000
Model Updates
100
150
200
250
300
Pix
elC
NN
Log L
oss
Freewayonline
random permutation
random samples
0 1000 2000 3000 4000 5000 6000
Frames
101
102
103
104
Pix
elC
NN
Log L
oss
Pong
0 20 40 60 80
Million Frames
0
500
1000
1500
2000
2500
3000
3500
Avera
ge S
core
Montezuma's RevengeLR: 0. 001, PG: 0. 1 n1/2 PGnLR: 0. 001, PG: 0. 01 PGn
LR: 0. 1 n1/2, PG: 0. 1 n1/2 PGnLR: 0. 1 n1/2, PG: 0. 01 PGn
Figure 2. Left: PixelCNN log loss on FREEWAY, when trained online, on a random permutation (single use of each frame) or onrandomly drawn samples (with replacement, potentially using same frame multiple times) from the state sequence. To simulate theeffect of non-stationarity, the agents policy changes every 4K updates. All training methods show qualitatively similar learning progressand stability. Middle: PixelCNN log loss over first 6K training frames on PONG. Vertical dashed lines indicate episode ends. Thecoinciding loss spikes are the density models surprise upon observing the distinctive green frame that sometimes occurs at the episodestart. Right: DQN-PixelCNN training performance on MONTEZUMAS REVENGE as we vary learning rate and PG decay schedules.
10-5 10-4 10-3 10-2 10-1 100
Initial Learning Rate c
101
102
103
104
Avera
ge L
og L
oss
Freeway
10-5 10-4 10-3 10-2 10-1 100
Initial Learning Rate c
101
102
103
104
Montezuma's Revenge
c n1
c n1/2
c
Figure 3. Model loss averaged over 10K frames, after 1M trainingframes, for constant, n1, and n1/2 learning rate schedules. Thesmallest loss is achieved by a constant learning rate of 103.
3.2. Training the Density Model
Instead of using randomized mini-batches, we train thedensity model completely online on the sequence of experi-enced states. Empirically we found that with minor tuningof optimization hyper-parameters we could train the modelas robustly on a temporally correlated sequence of states ason a sequence with randomized order (Fig. 2(left)).
Besides satisfying the theoretical requirement (a), com-pletely online training of the density model has the advan-tage that n = n+1, so that the model update performedfor computing the PG need not be reverted1.
Another more subtle reason for avoiding mini-batch up-dates of the density model (despite (d)) is a practical op-timization issue. The (necessarily online) computation ofthe PG involves a model update and hence the use of anoptimizer. Advanced optimizers used with deep neural net-works, like the RMSProp optimizer (Tieleman & Hinton,2012) used in this work, are stateful, tracking running av-erages of e.g. mean and variance of the model parameters.If the model is additionally trained from mini-batches, thetwo streams of updates may show different statistical char-
1The CTS model allows querying the PG cheaply, without in-curring an actual update of model parameters.
acteristics (e.g. different gradient magnitudes), invalidatingthe assumptions underlying the optimization algorithm andleading to slower or unstable training.
To determine a suitable online learning rate schedule, wetrain the model on a sequence of 1M frames of experienceof a random-policy agent. We compare the loss achieved bytraining procedures following constant or decaying learn-ing rate schedules, see Fig. 3. The lowest final training lossis achieved by a constant learning rate of 0.001 or a de-caying learning rate of 0.1 n1/2. We settled our choiceon the constant learning rate schedule as it showed greaterrobustness with respect to the choice of initial learning rate.
PixelCNN rapidly learns a sensible distribution over statespace. Fig. 2(left) shows the models loss decaying asit learns to exploit image regularities. Spikes in its lossfunction quickly start to correspond to visually meaning-ful events, such as the starts of episodes (Fig. 2(middle)).A video of early density model training is provided inhttp://youtu.be/T6iaa8Z4eyE.
0 10 20 30 40
0
10
20
30
40
Temp 1.0
0 10 20 30 40
0
10
20
30
40
Temp 1.0
Figure 4. Samples after 25K steps. Left: CTS, right: PixelCNN.
3.3. Computing the Pseudo-Count
From the previous section we obtain a particular learningrate schedule that cannot be arbitrarily modified withoutdeteriorating the models training performance or stability.To achieve the required PG decay (b), we instead replacePGn by cn PGn with a suitably decaying sequence cn.
http://youtu.be/T6iaa8Z4eyE
Count-Based Exploration with Neural Density Models
In experiments comparing actual agent performance weempirically determined that in fact the constant learningrate 0.001, paired with a PG decay cn = c n1/2, obtainsthe best exploration results on hard exploration games likeMONTEZUMAS REVENGE, see Fig. 2(right). We find themodel to be robust across 1-2 orders of magnitude for thevalue of c, and informally determine c = 0.1 to be a sen-sible configuration for achieving good results on a broadrange of Atari 2600 games (see also Section 7).
Regarding (c), it is hard to ensure learning-positiveness fora deep neural model, and a negative PG can occur when-ever the optimizer overshoots a local loss minimum. Asa workaround, we threshold the PG value at 0. To summa-rize, the computed pseudo-count is
Nn(x) =(exp
(c n1/2 (PGn(x))+
) 1)1
.
4. Exploration in Atari 2600 GamesHaving described our pseudo-count friendly adaptation ofPixelCNN, we now study its performance on Atari games.To this end we augment the environment reward with apseudo-count exploration bonus, yielding the combined re-ward r(x, a) + (Nn(x))1/2. As usual for neural network-based agents, we ensure the total reward lies in [1, 1] byclipping larger values.
4.1. DQN with PixelCNN Exploration Bonus
Our first set of experiments provides the PixelCNN explo-ration bonus to a DQN agent (Mnih et al., 2015)2. At eachagent step, the density model receives a single frame, withwhich it simultaneously updates its parameters and outputsthe PG. We refer to this agent as DQN-PixelCNN.
The DQN-CTS agent we compare against is derived fromthe one in (Bellemare et al., 2016). For better compara-bility, it is trained in the same online fashion as DQN-PixelCNN, i.e. the PG is computed whenever we train thedensity model. By contrast, the original DQN-CTS queriedthe PG at the end of each episode.
Unless stated otherwise, we always use the mixed MonteCarlo update (MMC) for the intrinsically motivatedagents3, but regular Q-Learning for the baseline DQN.
Fig. 5 shows training curves of DQN compared to DQN-
2Unlike Bellemare et al. we use regular Q-Learning instead ofDouble Q-Learning (van Hasselt et al., 2016), as our early exper-iments showed no significant advantage of DoubleDQN with thePixelCNN-based exploration reward.
3The use of MMC in a replay-based agent poses a minor com-plication, as the MC return is not available for replay until the endof an episode. For simplicity, in our implementation we disregardthis detail and set the MC return to 0 for transitions from the mostrecent episode.
0 20 40 60 80 1000
500
1000
1500
2000
2500
3000
3500
Avera
ge S
core
Montezuma's RevengeDQN
DQN-CTS
DQN-PixelCNN
0 20 40 60 80 1002000
0
2000
4000
6000
8000
10000
12000
14000
16000Private Eye
DQN
DQN-CTS
DQN-PixelCNN
0 20 40 60 80 100
Million Frames
400
500
600
700
800
900
1000
1100
1200
1300
Avera
ge S
core
AsteroidsDQN
DQN-CTS
DQN-PixelCNN
0 20 40 60 80 100
Million Frames
100
150
200
250
300
350
400
450
500Berzerk
DQN
DQN-CTS
DQN-PixelCNN
Figure 5. DQN, DQN-CTS and DQN-PixelCNN on hard explo-ration games (top) and easier ones (bottom).
6
11
3
14
Figure 6. Improvements (in % of AUC) of DQN-PixelCNN andDQN-CTS over DQN in 57 Atari games. Annotations indicatethe number of hard exploration games with positive (right) andnegative (left) improvement, respectively.
CTS and DQN-PixelCNN. On the famous MONTEZUMASREVENGE, both intrinsically motivated agents vastly out-perform the baseline DQN. On other hard explorationgames (PRIVATE EYE; or VENTURE, appendix Fig. 15),DQN-PixelCNN achieves state of the art results, substan-tially outperforming DQN and DQN-CTS. The other twogames shown (ASTEROIDS, BERZERK) pose easier explo-ration problems, where the reward bonus should not pro-vide large improvements and may have a negative effect byskewing the reward landscape. Here, DQN-PixelCNN be-haves more gracefully and still outperforms DQN-CTS. Wehypothesize this is due to a qualitative difference betweenthe models, see Section 5.
Overall PixelCNN provides the DQN agent with a largeradvantage than CTS, and often accelerates or stabilizestraining even when not affecting peak performance. Out of
Count-Based Exploration with Neural Density Models
57 Atari games, DQN-PixelCNN outperforms DQN-CTSin 52 games by maximum achieved score, and 51 by AUC(methodology in Appendix B). See Fig. 6 for a high levelcomparison (appendix Fig. 15 for full training graphs). Thegreatest gains from using either exploration bonus are ob-served in games categorized as hard exploration games inthe taxonomy of exploration in (Bellemare et al., 2016,reproduced in Appendix D), specifically in the most chal-lenging sparse reward games (e.g. MONTEZUMAS RE-VENGE, PRIVATE EYE, VENTURE).
4.2. A Multi-Step RL Agent with PixelCNN
Empirical practitioners know that techniques beneficial forone agent architecture often can be detrimental for a dif-ferent algorithm. To demonstrate the wide applicabilityof the PixelCNN exploration bonus, we also evaluate itwith the more recent Reactor agent4 (Gruslys et al., 2017).This replay-based actor-critic agent represents its policyand value function by a recurrent neural network and, cru-cially, uses the multi-step Retrace() algorithm for policyevaluation, replacing the MMC we use in DQN-PixelCNN.
To reduce impact on computational efficiency of this agent,we sub-sample intrinsic rewards: we perform updates ofthe PixelCNN model and compute the reward bonus on(randomly chosen) 25% of all steps, leaving the agents re-ward unchanged on other steps. We use the same PG decayschedule of 0.1n1/2, with n the number of model updates.
0 20 40 60 80 100 120 1400
500
1000
1500
2000
Avera
ge S
core
GravitarDQN
DQN-PixelCNN
Reactor
Reactor-PixelCNN
0 20 40 60 80 100 120 1400
5000
10000
15000
20000
25000
30000
35000H.E.R.O.
DQN
DQN-PixelCNN
Reactor
Reactor-PixelCNN
0 20 40 60 80 100 120 140
Million Frames
0
10000
20000
30000
40000
50000
60000
Avera
ge S
core
Road RunnerDQN
DQN-PixelCNN
Reactor
Reactor-PixelCNN
0 20 40 60 80 100 120 140
Million Frames
0
5000
10000
15000
20000
25000Time Pilot
DQN
DQN-PixelCNN
Reactor
Reactor-PixelCNN
Figure 7. Reactor/Reactor-PixelCNN and DQN/DQN-PixelCNNtraining performance (averaged over 3 seeds).
Training curves for the Reactor/Reactor-PixelCNN agentcompared to DQN/DQN-PixelCNN are shown in Fig. 7.The baseline Reactor agent is superior to the DQN agent,obtaining higher scores and learning faster in about 50 outof 57 games. It is further improved on a large fraction ofgames by the PixelCNN exploration reward, see Fig. 8 (fulltraining graphs in appendix Fig. 16).
The effect of the exploration bonus is rather uniform, yield-ing improvements on a broad range of games. In particu-
4The exact agent variant is referred to as -LOO with = 1.
lar, Reactor-PixelCNN enjoys better sample efficiency (interms of area under the curve, AUC) than vanilla Reactor.We hypothesize that, like other policy gradient algorithms,Reactor generally suffers from weaker exploration than itsvalue-based counterpart DQN. This aspect is much helpedby the exploration bonus, boosting the agents sample effi-ciency in many environments.
0%
50%
100%
150%
200%
250%
% Improvement Reactor-PixelCNN over Reactor (by AUC)
Easy exploration (40)
Hard exploration, dense reward (10)
Hard exploration, sparse reward (7)
Figure 8. Improvements (in % of AUC) of Reactor-PixelCNNover Reactor in 57 Atari games.
However, on hard exploration games with sparse rewards,Reactor seems unable to make full use of the explorationbonus. We believe this is because, in very sparse settings,the propagation of reward information across long hori-zons becomes crucial. The MMC takes one extreme ofthis view, directly learning from the observed returns. TheRetrace() algorithm, on the other hand, has an effectivehorizon which depends on and, critically, the truncatedimportance sampling ratio. This ratio results in the discard-ing of trajectories which are off-policy, i.e. unlikely underthe current policy. We hypothesize that the very goal of theRetrace() algorithm to learn cautiously is what prevents itfrom taking full advantage of the exploration bonus!
5. Quality of the Density ModelPixelCNN can be expected to be more expressive and accu-rate than the less advanced CTS model, and indeed, sam-ples generated after training are somewhat higher quality(Fig. 4). However, we are not using the generative functionof the models when computing an exploration bonus, anda better generative model does not necessarily give rise tobetter probability estimates (Theis et al., 2016).
CTS
PixelCNN
CTS
PixelCNN
Figure 9. PG on MONTEZUMAS REVENGE (log scale).
Count-Based Exploration with Neural Density Models
In Fig. 9 we compare the PG produced by the two mod-els throughout 5K training steps. PixelCNN consistentlyproduces PGs lower than CTS. More importantly, its PGsare smoother, exhibiting less variance between succes-sive states, while showing more pronounced peaks at cer-tain infrequent events. This yields a reward bonus that isless harmful in easy exploration games, while providing astrong signal in the case of novel or rare events.
Another distinguishing feature of PixelCNN is its non-decaying step-size. The per-step PG never completely van-ishes, as the model tracks the most recent data. This pro-vides an unexpected benefit: the agent remains mildly sur-prised by significant state changes, e.g. switching rooms inMONTEZUMAS REVENGE. These persistent rewards actas milestones that the agent learns to return to. This is il-lustrated in Fig. 10, depicting the intrinsic reward over thecourse of an episode. The agent routinely revisits the right-hand side of the torch room, not because it leads to rewardbut just to take in the sights. A video of the episode isprovided at http://youtu.be/232tOUPKPoQ.5
Time Steps
Intri
nsic
Rew
ard
Figure 10. Intrinsic reward in MONTEZUMAS REVENGE.
Lastly, PixelCNNs convolutional nature is expected to bebeneficial for its sample efficiency. In Appendix C we com-pare to a convolutional CTS and confirm that this explainspart, but not all of PixelCNNs advantage over vanilla CTS.
6. Importance of the Monte Carlo ReturnLike for DQN-CTS, the success of DQN-PixelCNN hingeson the use of the mixed Monte Carlo update. The transientand vanishing nature of the exploration rewards requiresthe learning algorithm to latch on to these rapidly. TheMMC serves this end as a simple multi-step method, help-ing to propagate reward information faster. An additionalbenefit lies in the fact that the Monte Carlo return helpsbridging long horizons in environments where rewards arefar apart and encountered rarely. On the other hand, it is
5Another agent video on the game PRIVATE EYE can be foundat http://youtu.be/kNyFygeUa2E.
0 20 40 60 80 1000
100
200
300
400
500
Avera
ge S
core
Bank Heist
DQN (w/o MC)
DQN (with MC)
DQN-PixelCNN (w/o MC)
DQN-PixelCNN (with MC)
0 20 40 60 80 1000
500
1000
1500
2000
2500Ms. Pacman
DQN (w/o MC)
DQN (with MC)
DQN-PixelCNN (w/o MC)
DQN-PixelCNN (with MC)
0 20 40 60 80 100
Million Frames
0
5000
10000
15000
20000
25000
Avera
ge S
core
H.E.R.O
DQN (w/o MC)
DQN (with MC)
DQN-PixelCNN (w/o MC)
DQN-PixelCNN (with MC)
0 20 40 60 80 100
Million Frames
2000
0
2000
4000
6000
8000
10000
12000
14000
16000Private Eye
DQN (w/o MC)
DQN (with MC)
DQN-PixelCNN (w/o MC)
DQN-PixelCNN (with MC)
Figure 11. Top: games where MMC completely explains the im-proved/decreased performance of DQN-PixelCNN compared toDQN. Bottom-left: MMC and PixelCNN show additive benefits.Bottom-right: hard exploration, sparse reward game only com-bining MMC and PixelCNN bonus achieves training progress.
important to note that the Monte Carlo returns on-policynature increases variance in the learning algorithm, and canprevent the algorithms convergence to the optimal policywhen training off-policy. It can therefore be expected toadversely affect training performance in some games.
To distill the effect of the MMC on performance, we com-pare all four combinations of DQN with/without PixelCNNexploration bonus and with/without MMC. Fig. 11 showsthe performance of these four agent variants (graphs for allgames are shown in Fig. 17). These games were picked toillustrate several commonly occurring cases:
MMC speeds up training and improves final perfor-mance significantly (examples: BANK HEIST, TIMEPILOT). In these games, MMC alone explains most orall of the improvement of DQN-PixelCNN over DQN.
MMC hurts performance (examples: MS. PAC-MAN,BREAKOUT). Here too, MMC alone explains most ofthe difference between DQN-PixelCNN and DQN.
MMC and PixelCNN reward bonus have a compound-ing effect (example: H.E.R.O.).
Most importantly, the situation is rather different when werestrict our attention to the hardest exploration games withsparse rewards. Here the baseline DQN agent fails to makeany training progress, and neither Monte Carlo return northe exploration bonus alone provide any significant benefit.Their combination however grants the agent rapid trainingprogress and allows it to achieve high performance.
One effect of the exploration bonus in these games is toprovide a denser reward landscape, enabling the agent tolearn meaningful policies. Due to the transient nature ofthe exploration bonus, the agent needs to be able to learnfrom this reward signal faster than regular one-step meth-ods allow, and MMC proves to be an effective solution.
http://youtu.be/232tOUPKPoQhttp://youtu.be/kNyFygeUa2E
Count-Based Exploration with Neural Density Models
0 50 100 150 200
Million Frames
0
1000
2000
3000
4000
5000
6000
7000
Avera
ge S
core
Montezuma's Revengec= 0. 1
c= 1. 0
c= 10. 0
0 50 100 150 200
Million Frames
0
5000
10000
15000
20000
25000
30000
35000
Avera
ge S
core
Private Eyec= 0. 1
c= 1. 0
c= 10. 0
0 50 100 150 200
Million Frames
0
200
400
600
800
1000
1200
1400
Avera
ge S
core
Venturec= 0. 1
c= 1. 0
c= 10. 0
Figure 12. DQN-PixelCNN, hard exploration games, different PG scales c n1/2 PGn (c = 0.1, 1, 10) (5 seeds each).
7. Pushing the Limits of Intrinsic MotivationIn this section we explore the idea of a maximally curiousagent, whose reward function is dominated by the explo-ration bonus. For that we increase the PG scale, previouslychosen conservatively to avoid adverse effects on easy ex-ploration games.
Fig. 12 shows DQN-PixelCNN performance on the hardestexploration games when the PG scale is increased by 1-2 orders of magnitude. The algorithm seems fairly robustacross a wide range of scales: the main effect of increasingthis parameter is to trade off exploration (seeking maximalreward) with exploitation (optimizing the current policy).
As expected, a higher PG scale translates to stronger ex-ploration: several runs obtain record peak scores (900 inGRAVITAR, 6,600 in MONTEZUMAS REVENGE, 39,000in PRIVATE EYE, 1,500 in VENTURE) surpassing the stateof the art by a substantial margin (for previously publishedresults, see Appendix D). Aggressive scaling speeds up theagents exploration and achieves peak performance rapidly,but can also deteriorate its stability and long-term perfor-mance. Note that in practice, because of the non-decayingstep-size the PG does not vanish. After reward clipping,an overly inflated exploration bonus can therefore becomeessentially constant, no longer providing a useful intrinsicmotivation signal to the agent.
Another way of creating an entirely curiosity-driven agentis to ignore the environment reward altogether and trainbased on the exploration reward only, see Fig. 13. Remark-ably, the curiosity signal alone is sufficient to train a high-performing agent (measured by environment reward!).
It is worth noting that agents with exploration bonus seemto never stop exploring: for different seeds, the agentsmake learning progress at very different times during train-ing, a qualitative difference to vanilla DQN.
8. ConclusionWe demonstrated the use of PixelCNN for exploration andshowed that its greater accuracy and expressiveness trans-late into a more useful exploration bonus than that obtainedfrom previous models. While the current theory of pseudo-
0 20 40 60 80 100 120 140 160
Million Frames
0
500
1000
1500
2000
2500
Avera
ge S
core
Montezuma's Revengec= 1. 0
c= 5. 0
c= 10. 0
c= 50. 0
0 20 40 60 80 100 120 140 160
Million Frames
0
5000
10000
15000
20000
Avera
ge S
core
Private Eyec= 1. 0
c= 5. 0
c= 10. 0
c= 50. 0
Figure 13. DQN-PixelCNN trained from intrinsic reward only(3 seeds for each configuration).
counts puts stringent requirements on the density model,we have shown that PixelCNN can be used in a simpler andmore general setup, and can be trained completely online.It also proves to be widely compatible with both value-function and policy-based RL algorithms.
In addition to pushing the state of the art on the hardestexploration problems among the Atari 2600 games, Pixel-CNN improves speed of learning and stability of baselineRL agents across a wide range of games. The quality of itsreward bonus is evidenced by the fact that on sparse rewardgames, this signal alone suffices to learn to achieve signifi-cant scores, creating a truly intrinsically motivated agent.
Our analysis also reveals the importance of the MonteCarlo return for effective exploration. The comparison withmore sophisticated but fixed-horizon multi-step methodsshows that its significance lies both in faster learning in thecontext of a useful but transient reward function, as well asbridging reward gaps in environments where extrinsic andintrinsic rewards are, or quickly become, extremely sparse.
Count-Based Exploration with Neural Density Models
AcknowledgementsThe authors thank Tom Schaul, Olivier Pietquin, Ian Os-band, Sriram Srinivasan, Tejas Kulkarni, Alex Graves,Charles Blundell, and Shimon Whiteson for invaluablefeedback on the ideas presented here, and AudrunasGruslys especially for providing the Reactor agent.
ReferencesBellemare, Marc, Veness, Joel, and Talvitie, Erik. Skip
context tree switching. Proceedings of the InternationalConference on Machine Learning, 2014.
Bellemare, Marc G., Naddaf, Yavar, Veness, Joel, andBowling, Michael. The arcade learning environment:An evaluation platform for general agents. Journal ofArtificial Intelligence Research, 47:253279, 2013.
Bellemare, Marc G., Srinivasan, Sriram, Ostrovski, Georg,Schaul, Tom, Saxton, David, and Munos, Remi. Uni-fying count-based exploration and intrinsic motivation.Advances in Neural Information Processing Systems,2016.
Deisenroth, Marc P. and Rasmussen, Carl E. PILCO:A model-based and data-efficient approach to policysearch. In Proceedings of the International Conferenceon Machine Learning, 2011.
French, Robert M. Catastrophic forgetting in connectionistnetworks. Trends in cognitive sciences, 3(4):128135,1999.
Goodfellow, Ian, Pouget-Abadie, Jean, Mirza, Mehdi, Xu,Bing, Warde-Farley, David, Ozair, Sherjil, Courville,Aaron, and Bengio, Yoshua. Generative adversarial nets.In Advances in Neural Information Processing Systems,2014.
Gregor, Karol, Danihelka, Ivo, Graves, Alex, Rezende,Danilo, and Wierstra, Daan. Draw: A recurrent neuralnetwork for image generation. In Proceedings of the In-ternational Conference on Machine Learning, 2015.
Gruslys, Audrunas, Azar, Mohammad Gheshlaghi, Belle-mare, Marc G., and Munos, Remi. The Reactor: Asample-efficient actor-critic architecture. arXiv preprintarXiv:1704.04651, 2017.
Guez, Arthur, Silver, David, and Dayan, Peter. Efficientbayes-adaptive reinforcement learning using sample-based search. In Advances in Neural Information Pro-cessing Systems, 2012.
Houthooft, Rein, Chen, Xi, Duan, Yan, Schulman, John,De Turck, Filip, and Abbeel, Pieter. Variational infor-mation maximizing exploration. In Advances in NeuralInformation Processing Systems (NIPS), 2016.
Jaksch, Thomas, Ortner, Ronald, and Auer, Peter. Near-optimal regret bounds for reinforcement learning. Jour-nal of Machine Learning Research, 11:15631600,2010.
Kingma, Diederik P. and Welling, Max. Auto-encodingvariational bayes. In Proceedings of the InternationalConference on Learning Representations, 2013.
Lopes, Manuel, Lang, Tobias, Toussaint, Marc, andOudeyer, Pierre-Yves. Exploration in model-based re-inforcement learning by empirically estimating learningprogress. In Advances in Neural Information ProcessingSystems, 2012.
Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David,Rusu, Andrei A, Veness, Joel, Bellemare, Marc G.,Graves, Alex, Riedmiller, Martin, Fidjeland, Andreas K.,Ostrovski, Georg, et al. Human-level control throughdeep reinforcement learning. Nature, 518(7540):529533, 2015.
Munos, Remi, Stepleton, Tom, Harutyunyan, Anna, andBellemare, Marc G. Safe and efficient off-policy rein-forcement learning. In Advances in Neural InformationProcessing Systems, 2016.
Osband, Ian, van Roy, Benjamin, and Wen, Zheng. Gen-eralization and exploration via randomized value func-tions. In Proceedings of the International Conference onMachine Learning, 2016.
Oudeyer, Pierre-Yves, Kaplan, Frederic, and Hafner, Ver-ena V. Intrinsic motivation systems for autonomousmental development. IEEE Transactions on Evolution-ary Computation, 11(2):265286, 2007.
Rezende, Danilo Jimenez, Mohamed, Shakir, and Wier-stra, Daan. Stochastic backpropagation and approximateinference in deep generative models. In Proceedingsof The International Conference on Machine Learning,2014.
Strehl, Alexander L. and Littman, Michael L. An analysisof model-based interval estimation for Markov decisionprocesses. Journal of Computer and System Sciences, 74(8):1309 1331, 2008.
Sutton, Richard S. Generalization in reinforcement learn-ing: Successful examples using sparse coarse coding.In Advances in Neural Information Processing Systems,1996.
Sutton, Richard S. and Barto, Andrew G. Reinforcementlearning: An introduction. MIT Press, 1998.
Tang, Haoran, Houthooft, Rein, Foote, Davis, Stooke,Adam, Chen, Xi, Duan, Yan, Schulman, John, De Turck,
Count-Based Exploration with Neural Density Models
Filip, and Abbeel, Pieter. #Exploration: A study ofcount-based exploration for deep reinforcement learn-ing. arXiv preprint arXiv:1611.04717, 2016.
Theis, Lucas, van den Oord, Aaron, and Bethge, Matthias.A note on the evaluation of generative models. In Pro-ceedings of the International Conference on LearningRepresentations, 2016.
Tieleman, Tijmen and Hinton, Geoffrey. RMSProp: dividethe gradient by a running average of its recent magni-tude. COURSERA. Lecture 6.5 of Neural Networks forMachine Learning, 2012.
Tsitsiklis, John N. and van Roy, Benjamin. An analysis oftemporal-difference learning with function approxima-tion. IEEE Transactions on Automatic Control, 42(5):674690, 1997.
van den Oord, Aaron, Kalchbrenner, Nal, Espeholt, Lasse,Vinyals, Oriol, Graves, Alex, et al. Conditional imagegeneration with PixelCNN decoders. In Advances inNeural Information Processing Systems, 2016a.
van den Oord, Aaron, Kalchbrenner, Nal, andKavukcuoglu, Koray. Pixel recurrent neural net-works. In Proceedings of the International Conferenceon Machine Learning, 2016b.
van Hasselt, Hado, Guez, Arthur, and Silver, David. Deepreinforcement learning with Double Q-learning. In Pro-ceedings of the AAAI Conference on Artificial Intelli-gence, 2016.
Veness, Joel, Ng, Kee Siong, Hutter, Marcus, and Bowling,Michael H. Context tree switching. In Proceedings ofthe Data Compression Conference, 2012.
Wang, Ziyu, Schaul, Tom, Hessel, Matteo, van Hasselt,Hado, Lanctot, Marc, and de Freitas, Nando. Duelingnetwork architectures for deep reinforcement learning.In Proceedings of The 33rd International Conference onMachine Learning, pp. 19952003, 2016.
Count-Based Exploration with Neural Density Models
A. PixelCNN Hyper-parametersThe PixelCNN model used in this paper is a lightweightvariant of the Gated PixelCNN introduced in (van den Oordet al., 2016a). It consists of a 7 7 masked convolution,followed by two residual blocks with 11 masked convolu-tions with 16 feature planes, and another 11 masked con-volution producing 64 features planes, which are mappedby a final masked convolution to the output logits. Inputsare 42 42 greyscale images, with pixel values quantizedto 8 bins.
The model is trained completely online, from the stream ofAtari frames experienced by an agent. Optimization is per-formed with the (uncentered) RMSProp optimizer (Tiele-man & Hinton, 2012) with momentum 0.9, decay 0.95 andepsilon 104.
B. MethodologyUnless otherwise stated, all agent performance graphs inthis paper show the agents training performance, measuredas the undiscounted per-episode return, averaged over 1Menvironment frames per data point.
The algorithm-comparison graphs Fig. 6 and Fig. 8 showthe relative improvement of one algorithm over anotherin terms of area-under-the-curve (AUC). A comparison bymaximum achieved score would yield similar overall re-sults, but underestimate the advantage in terms of learningspeed (sample efficiency) and stability that the intrinsicallymotivated and MMC-based agents show over the baselines.
C. Convolutional CTSIn Section 4 we have seen that DQN-PixelCNN outper-forms DQN-CTS in most of the 57 Atari games, by provid-ing a more impactful exploration bonus in hard explorationgames, as well as a more graceful (less harmful) one ingames where the learning algorithm does not benefit fromthe additional curiosity signal. One may wonder whetherthis improvement is due to the generally more expressiveand accurate density model PixelCNN, or simply its con-volutional nature, which gives it an advantage in general-ization and sample efficiency over a model that representspixel probabilities in a completely location-dependent way.
To answer this question, we developed a convolutional vari-ant of the CTS model. This model has a single set of pa-rameters conditioning a pixels value on its predecessorsshared across all pixel locations, instead of the location-dependent parameters in the regular CTS. In Fig. 14 wecontrast the performance of DQN, DQN-MC, DQN-CTS,DQN-ConvCTS and DQN-PixelCNN on 6 example games.
We first consider dense reward games like Q*BERT and
ZAXXON, where most improvement comes from the useof the MMC, and the exploration bonus hurts performance.We find that in fact convolutional CTS behaves fairly sim-ilarly to PixelCNN, leaving agent performance unaffected,whereas regular CTS causes the agent to train more slowlyor reach an earlier performance plateau. On the sparse re-ward games (GRAVITAR, PRIVATE EYE, VENTURE) how-ever, convolutional CTS shows to be as inferior to Pixel-CNN as the vanilla CTS variant, failing to achieve the sig-nificant improvements over the baseline agents presentedin this paper.
We conclude that while the convolutional aspect plays arole in the softer nature of the PixelCNN model com-pared to its CTS counterpart, it alone is insufficient toexplain the massive exploration boost that the PixelCNN-derived reward provides to the DQN agent. The more ad-vanced models accuracy advantage translates into a moretargeted and useful curiosity signal for the agent, whichdistinguishes novel from well-explored states more clearlyand allows for more effective exploration.
D. The Hardest Exploration GamesTable 1 reproduces Bellemare et al. (2016)s taxonomy ofgames available through the ALE according to their explo-ration difficulty. Human-Optimal refers to games whereDQN-like agents achieve human-level or higher perfor-mance; Score Exploit refers to games where agents findways to achieve superhuman scores, without necessarilyplaying the game as a human would. Sparse and Denserewards are qualitative descriptors of the games rewardstructure. See the original source for additional details.
Table 2 compares previously published results on the 7 hardexploration, sparse reward Atari 2600 games with resultsobtained by DQN-CTS and DQN-PixelCNN.
Count-Based Exploration with Neural Density Models
0 20 40 60 80 100 120 140
Million Frames
0
100
200
300
400
500
600
Avera
ge S
core
GravitarDQN
DQN-MC
DQN-CTS
DQN-ConvCTS
DQN-PixelCNN
0 20 40 60 80 100 120 140
Million Frames
0
500
1000
1500
2000
2500
3000
3500
4000
Avera
ge S
core
Montezuma's RevengeDQN
DQN-MC
DQN-CTS
DQN-ConvCTS
DQN-PixelCNN
0 20 40 60 80 100 120 140
Million Frames
2000
0
2000
4000
6000
8000
10000
12000
14000
16000
Avera
ge S
core
Private Eye
DQN
DQN-MC
DQN-CTS
DQN-ConvCTS
DQN-PixelCNN
0 20 40 60 80 100 120 140
Million Frames
0
1000
2000
3000
4000
5000
6000
7000
8000
Avera
ge S
core
Q*Bert
DQN
DQN-MC
DQN-CTS
DQN-ConvCTS
DQN-PixelCNN
0 20 40 60 80 100 120 140
Million Frames
0
200
400
600
800
1000
1200
Avera
ge S
core
VentureDQN
DQN-MC
DQN-CTS
DQN-ConvCTS
DQN-PixelCNN
0 20 40 60 80 100 120 140
Million Frames
0
2000
4000
6000
8000
10000
Avera
ge S
core
ZaxxonDQN
DQN-MC
DQN-CTS
DQN-ConvCTS
DQN-PixelCNN
Figure 14. Comparison of DQN, DQN-CTS, DQN-ConvCTS and DQN-PixelCNN training performance.
Easy Exploration Hard ExplorationHuman-Optimal Score Exploit Dense Reward Sparse Reward
ASSAULT ASTERIX BEAM RIDER ALIEN FREEWAYASTEROIDS ATLANTIS KANGAROO AMIDAR GRAVITAR
BATTLE ZONE BERZERK KRULL BANK HEIST MONTEZUMAS REVENGEBOWLING BOXING KUNG-FU MASTER FROSTBITE PITFALL!
BREAKOUT CENTIPEDE ROAD RUNNER H.E.R.O. PRIVATE EYECHOPPER CMD CRAZY CLIMBER SEAQUEST MS. PAC-MAN SOLARIS
DEFENDER DEMON ATTACK UP N DOWN Q*BERT VENTUREDOUBLE DUNK ENDURO TUTANKHAM SURROUNDFISHING DERBY GOPHER WIZARD OF WOR
ICE HOCKEY JAMES BOND ZAXXONNAME THIS GAME PHOENIX
PONG RIVER RAIDROBOTANK SKIING
SPACE INVADERS STARGUNNER
Table 1. A rough taxonomy of Atari 2600 games according to their exploration difficulty.
DQN A3C-CTS Prior. Duel DQN-CTS DQN-PixelCNNFREEWAY 30.8 30.48 33.0 31.7 31.7
GRAVITAR 473.0 238.68 238.0 498.3 859.1MONTEZUMAS REVENGE 0.0 273.70 0.0 3705.5 2514.3
PITFALL! -286.1 -259.09 0.0 0.0 0.0PRIVATE EYE 146.7 99.32 206.0 8358.7 15806.5
SOLARIS 3,482.8 2270.15 133.4 2863.6 5501.5VENTURE 163.0 0.00 48.0 82.2 1356.25
Table 2. Comparison with previously published results on hard exploration, sparse reward games. The compared agents are DQN (Mnihet al., 2015), A3C-CTS (A3C+ in (Bellemare et al., 2016)), Prioritized Dueling DQN (Wang et al., 2016), and the basic versions ofDQN-CTS and DQN-PixelCNN from Section 4. For our agents we report the maximum scores achieved over 150M frames of training,averaged over 3 seeds.
Count-Based Exploration with Neural Density Models
0 20 40 60 80 100 120 140200
400
600
800
1000
1200
1400
1600
1800Alien
0 20 40 60 80 100 120 1400
200
400
600
800
1000Amidar
0 20 40 60 80 100 120 140200
400
600
800
1000
1200
1400Assault
0 20 40 60 80 100 120 1400
500
1000
1500
2000
2500
3000
3500
4000Asterix
0 20 40 60 80 100 120 140400
500
600
700
800
900
1000
1100
1200
1300Asteroids
0 20 40 60 80 100 120 1400
10000
20000
30000
40000
50000
60000
70000Atlantis
0 20 40 60 80 100 120 1400
100
200
300
400
500Bank Heist
0 20 40 60 80 100 120 1400
2000
4000
6000
8000
10000
12000
14000
16000
18000Battle Zone
0 20 40 60 80 100 120 1400
1000
2000
3000
4000
5000
6000
7000Beam Rider
0 20 40 60 80 100 120 140100
150
200
250
300
350
400
450
500Berzerk
0 20 40 60 80 100 120 14015
20
25
30
35
40
45Bowling
0 20 40 60 80 100 120 14010
0
10
20
30
40
50
60
70
80Boxing
0 20 40 60 80 100 120 1400
50
100
150
200
250Breakout
0 20 40 60 80 100 120 140500
1000
1500
2000
2500
3000
3500Centipede
0 20 40 60 80 100 120 1400
500
1000
1500
2000
2500
3000Chopper Command
0 20 40 60 80 100 120 1400
10000
20000
30000
40000
50000
60000
70000
80000
90000Crazy Climber
0 20 40 60 80 100 120 1402000
3000
4000
5000
6000
7000
8000
9000
10000Defender
0 20 40 60 80 100 120 1400
1000
2000
3000
4000
5000
6000
7000
8000Demon Attack
0 20 40 60 80 100 120 14022
20
18
16
14
12
10
8Double Dunk
0 20 40 60 80 100 120 1400
100
200
300
400
500Enduro
0 20 40 60 80 100 120 140100
90
80
70
60
50
40
30
20
10Fishing Derby
0 20 40 60 80 100 120 1400
5
10
15
20
25
30Freeway
0 20 40 60 80 100 120 1400
500
1000
1500
2000
2500Frostbite
0 20 40 60 80 100 120 1400
1000
2000
3000
4000
5000
6000Gopher
0 20 40 60 80 100 120 1400
100
200
300
400
500
600Gravitar
0 20 40 60 80 100 120 1400
5000
10000
15000
20000
25000H.E.R.O.
0 20 40 60 80 100 120 14016
14
12
10
8
6
4Ice Hockey
0 20 40 60 80 100 120 1400
100
200
300
400
500
600Jamesbond
0 20 40 60 80 100 120 1400
1000
2000
3000
4000
5000
6000
7000
8000
9000Kangaroo
0 20 40 60 80 100 120 1402000
3000
4000
5000
6000
7000
8000Krull
0 20 40 60 80 100 120 1400
2000
4000
6000
8000
10000
12000
14000
16000Kung Fu Master
0 20 40 60 80 100 120 1400
500
1000
1500
2000
2500
3000
3500
4000Montezuma's Revenge
0 20 40 60 80 100 120 1400
500
1000
1500
2000
2500Ms. Pacman
0 20 40 60 80 100 120 1401500
2000
2500
3000
3500
4000
4500
5000
5500
6000Name this Game
0 20 40 60 80 100 120 1400
1000
2000
3000
4000
5000
6000
7000Phoenix
0 20 40 60 80 100 120 140900
800
700
600
500
400
300
200
100
0Pitfall!
0 20 40 60 80 100 120 14025
20
15
10
5
0
5
10
15
20Pong
0 20 40 60 80 100 120 1402000
0
2000
4000
6000
8000
10000
12000
14000
16000Private Eye
0 20 40 60 80 100 120 1400
1000
2000
3000
4000
5000
6000
7000
8000Q*Bert
0 20 40 60 80 100 120 1401000
2000
3000
4000
5000
6000
7000
8000
9000Riverraid
0 20 40 60 80 100 120 1400
5000
10000
15000
20000
25000Road Runner
0 20 40 60 80 100 120 1400
10
20
30
40
50Robotank
0 20 40 60 80 100 120 1400
500
1000
1500
2000
2500
3000Seaquest
0 20 40 60 80 100 120 14022000
21000
20000
19000
18000
17000
16000
15000
14000Skiing
0 20 40 60 80 100 120 140500
1000
1500
2000
2500
3000
3500
4000Solaris
0 20 40 60 80 100 120 140100
200
300
400
500
600
700
800
900
1000Space Invaders
0 20 40 60 80 100 120 1400
5000
10000
15000
20000
25000Star Gunner
0 20 40 60 80 100 120 14010.0
9.8
9.6
9.4
9.2
9.0
8.8Surround
0 20 40 60 80 100 120 14025
20
15
10
5
0
5Tennis
0 20 40 60 80 100 120 1401000
2000
3000
4000
5000
6000
7000Time Pilot
0 20 40 60 80 100 120 1400
50
100
150
200Tutankham
0 20 40 60 80 100 120 1400
1000
2000
3000
4000
5000
6000Up'n'down
0 20 40 60 80 100 120 1400
200
400
600
800
1000
1200Venture
0 20 40 60 80 100 120 1400
10000
20000
30000
40000
50000
60000
70000
80000Video Pinball
0 20 40 60 80 100 120 1400
500
1000
1500
2000
2500
3000
3500Wizard of Wor
0 20 40 60 80 100 120 1402000
4000
6000
8000
10000
12000
14000
16000Yar's Revenge
0 20 40 60 80 100 120 1400
1000
2000
3000
4000
5000
6000
7000
8000
9000Zaxxon
DQN
DQN-CTS
DQN-PixelCNN
Figure 15. Training curves of DQN, DQN-CTS and DQN-PixelCNN across all 57 Atari games.
Count-Based Exploration with Neural Density Models
0 20 40 60 80 100 120 1400
500
1000
1500
2000
2500
3000
3500
4000
4500Alien
0 20 40 60 80 100 120 1400
200
400
600
800
1000
1200
1400Amidar
0 20 40 60 80 100 120 1400
500
1000
1500
2000
2500
3000
3500
4000Assault
0 20 40 60 80 100 120 1400
1000
2000
3000
4000
5000
6000
7000
8000
9000Asterix
0 20 40 60 80 100 120 1400
500
1000
1500
2000
2500
3000Asteroids
0 20 40 60 80 100 120 1400.0
0.2
0.4
0.6
0.8
1.0 1e7Atlantis
0 20 40 60 80 100 120 1400
200
400
600
800
1000
1200
1400Bank Heist
0 20 40 60 80 100 120 1400
10000
20000
30000
40000
50000
60000
70000Battle Zone
0 20 40 60 80 100 120 1400
1000
2000
3000
4000
5000
6000
7000Beam Rider
0 20 40 60 80 100 120 140100
200
300
400
500
600
700Berzerk
0 20 40 60 80 100 120 14010
20
30
40
50
60
70
80Bowling
0 20 40 60 80 100 120 14020
0
20
40
60
80
100Boxing
0 20 40 60 80 100 120 1400
50
100
150
200
250
300Breakout
0 20 40 60 80 100 120 140500
1000
1500
2000
2500
3000
3500
4000
4500
5000Centipede
0 20 40 60 80 100 120 1400
2000
4000
6000
8000
10000
12000Chopper Command
0 20 40 60 80 100 120 1400
20000
40000
60000
80000
100000
120000
140000Crazy Climber
0 20 40 60 80 100 120 1400
10000
20000
30000
40000
50000Defender
0 20 40 60 80 100 120 1400
1000
2000
3000
4000
5000
6000
7000
8000
9000Demon Attack
0 20 40 60 80 100 120 14025
20
15
10
5
0
5
10
15Double Dunk
0 20 40 60 80 100 120 1400
500
1000
1500
2000
2500
3000
3500Enduro
0 20 40 60 80 100 120 140100
80
60
40
20
0
20
40Fishing Derby
0 20 40 60 80 100 120 1400
5
10
15
20
25
30
35Freeway
0 20 40 60 80 100 120 1400
1000
2000
3000
4000
5000
6000
7000Frostbite
0 20 40 60 80 100 120 1400
5000
10000
15000
20000
25000Gopher
0 20 40 60 80 100 120 1400
500
1000
1500
2000Gravitar
0 20 40 60 80 100 120 1400
5000
10000
15000
20000
25000
30000
35000H.E.R.O.
0 20 40 60 80 100 120 14020
15
10
5
0
5
10Ice Hockey
0 20 40 60 80 100 120 1400
100
200
300
400
500
600
700
800Jamesbond
0 20 40 60 80 100 120 1400
2000
4000
6000
8000
10000
12000
14000Kangaroo
0 20 40 60 80 100 120 1400
2000
4000
6000
8000
10000
12000Krull
0 20 40 60 80 100 120 1400
5000
10000
15000
20000
25000
30000
35000
40000Kung Fu Master
0 20 40 60 80 100 120 1400
500
1000
1500
2000
2500
3000Montezuma's Revenge
0 20 40 60 80 100 120 1400
500
1000
1500
2000
2500
3000
3500MS. Pacman
0 20 40 60 80 100 120 1401000
2000
3000
4000
5000
6000
7000
8000
9000
10000Name this Game
0 20 40 60 80 100 120 1400
1000
2000
3000
4000
5000
6000
7000Phoenix
0 20 40 60 80 100 120 1401400
1200
1000
800
600
400
200
0Pitfall!
0 20 40 60 80 100 120 14030
20
10
0
10
20
30Pong
0 20 40 60 80 100 120 1405000
0
5000
10000
15000
20000Private Eye
0 20 40 60 80 100 120 1400
2000
4000
6000
8000
10000
12000
14000Q*Bert
0 20 40 60 80 100 120 1400
2000
4000
6000
8000
10000
12000Riverraid
0 20 40 60 80 100 120 1400
10000
20000
30000
40000
50000
60000Road Runner
0 20 40 60 80 100 120 1400
10
20
30
40
50
60
70Robotank
0 20 40 60 80 100 120 1400
1000
2000
3000
4000
5000
6000Seaquest
0 20 40 60 80 100 120 14035000
30000
25000
20000
15000
10000Skiing
0 20 40 60 80 100 120 1400
1000
2000
3000
4000
5000Solaris
0 20 40 60 80 100 120 1400
200
400
600
800
1000
1200
1400
1600
1800Space Invaders
0 20 40 60 80 100 120 1400
5000
10000
15000
20000
25000
30000Star Gunner
0 20 40 60 80 100 120 14010
8
6
4
2
0Surround
0 20 40 60 80 100 120 14030
20
10
0
10
20
30Tennis
0 20 40 60 80 100 120 1400
5000
10000
15000
20000
25000Time Pilot
0 20 40 60 80 100 120 1400
50
100
150
200
250Tutankham
0 20 40 60 80 100 120 1400
50000
100000
150000
200000
250000Up'n'down
0 20 40 60 80 100 120 1400
200
400
600
800
1000
1200
1400
1600Venture
0 20 40 60 80 100 120 1400
50000
100000
150000
200000
250000
300000
350000
400000Video Pinball
0 20 40 60 80 100 120 1400
2000
4000
6000
8000
10000Wizard of Wor
0 20 40 60 80 100 120 1400
10000
20000
30000
40000
50000
60000
70000
80000Yar's Revenge
0 20 40 60 80 100 120 1400
2000
4000
6000
8000
10000
12000
14000
16000
18000Zaxxon
DQN
DQN-PixelCNN
Reactor
Reactor-PixelCNN
Figure 16. Training curves of DQN, DQN-PixelCNN, Reactor and Reactor-PixelCNN across all 57 Atari games.
Count-Based Exploration with Neural Density Models
0 20 40 60 80 100 120 1400
500
1000
1500
2000
2500Alien
0 20 40 60 80 100 120 1400
200
400
600
800
1000Amidar
0 20 40 60 80 100 120 140200
400
600
800
1000
1200
1400Assault
0 20 40 60 80 100 120 1400
500
1000
1500
2000
2500
3000
3500
4000
4500Asterix
0 20 40 60 80 100 120 140400
500
600
700
800
900
1000
1100
1200
1300Asteroids
0 20 40 60 80 100 120 1400
10000
20000
30000
40000
50000
60000
70000Atlantis
0 20 40 60 80 100 120 1400
100
200
300
400
500Bank Heist
0 20 40 60 80 100 120 1400
5000
10000
15000
20000
25000Battle Zone
0 20 40 60 80 100 120 1400
1000
2000
3000
4000
5000
6000
7000Beam Rider
0 20 40 60 80 100 120 140100
150
200
250
300
350
400
450
500Berzerk
0 20 40 60 80 100 120 14015
20
25
30
35
40
45Bowling
0 20 40 60 80 100 120 14020
0
20
40
60
80Boxing
0 20 40 60 80 100 120 1400
50
100
150
200
250Breakout
0 20 40 60 80 100 120 140500
1000
1500
2000
2500
3000
3500Centipede
0 20 40 60 80 100 120 1400
500
1000
1500
2000
2500
3000Chopper Command
0 20 40 60 80 100 120 1400
20000
40000
60000
80000
100000Crazy Climber
0 20 40 60 80 100 120 1402000
3000
4000
5000
6000
7000
8000
9000
10000Defender
0 20 40 60 80 100 120 1400
1000
2000
3000
4000
5000
6000
7000
8000Demon Attack
0 20 40 60 80 100 120 14022
20
18
16
14
12
10
8
6
4Double Dunk
0 20 40 60 80 100 120 1400
100
200
300
400
500Enduro
0 20 40 60 80 100 120 140100
90
80
70
60
50
40
30
20
10Fishing Derby
0 20 40 60 80 100 120 1400
5
10
15
20
25
30Freeway
0 20 40 60 80 100 120 1400
500
1000
1500
2000
2500Frostbite
0 20 40 60 80 100 120 1400
1000
2000
3000
4000
5000
6000Gopher
0 20 40 60 80 100 120 1400
100
200
300
400
500
600Gravitar
0 20 40 60 80 100 120 1400
5000
10000
15000
20000
25000H.E.R.O.
0 20 40 60 80 100 120 14016
14
12
10
8
6
4Ice Hockey
0 20 40 60 80 100 120 1400
100
200
300
400
500
600Jamesbond
0 20 40 60 80 100 120 1400
1000
2000
3000
4000
5000
6000
7000
8000
9000Kangaroo
0 20 40 60 80 100 120 1402000
3000
4000
5000
6000
7000
8000Krull
0 20 40 60 80 100 120 1400
2000
4000
6000
8000
10000
12000
14000
16000Kung Fu Master
0 20 40 60 80 100 120 1400
200
400
600
800
1000
1200
1400
1600
1800Montezuma's Revenge
0 20 40 60 80 100 120 1400
500
1000
1500
2000
2500MS. Pacman
0 20 40 60 80 100 120 1401000
2000
3000
4000
5000
6000
7000Name this Game
0 20 40 60 80 100 120 1400
1000
2000
3000
4000
5000
6000
7000Phoenix
0 20 40 60 80 100 120 140450
400
350
300
250
200
150
100
50
0Pitfall!
0 20 40 60 80 100 120 14025
20
15
10
5
0
5
10
15
20Pong
0 20 40 60 80 100 120 1402000
0
2000
4000
6000
8000
10000
12000
14000
16000Private Eye
0 20 40 60 80 100 120 1400
1000
2000
3000
4000
5000
6000
7000
8000Q*Bert
0 20 40 60 80 100 120 1401000
2000
3000
4000
5000
6000
7000
8000
9000Riverraid
0 20 40 60 80 100 120 1400
5000
10000
15000
20000
25000
30000
35000Road Runner
0 20 40 60 80 100 120 1400
10
20
30
40
50Robotank
0 20 40 60 80 100 120 1400
500
1000
1500
2000
2500
3000Seaquest
0 20 40 60 80 100 120 14028000
26000
24000
22000
20000
18000
16000
14000Skiing
0 20 40 60 80 100 120 1401000
1500
2000
2500
3000
3500
4000Solaris
0 20 40 60 80 100 120 1400
200
400
600
800
1000
1200Space Invaders
0 20 40 60 80 100 120 1400
5000
10000
15000
20000
25000Star Gunner
0 20 40 60 80 100 120 14010.0
9.8
9.6
9.4
9.2
9.0
8.8
8.6Surround
0 20 40 60 80 100 120 14025
20
15
10
5
0Tennis
0 20 40 60 80 100 120 1401000
2000
3000
4000
5000
6000
7000Time Pilot
0 20 40 60 80 100 120 1400
50
100
150
200Tutankham
0 20 40 60 80 100 120 1400
1000
2000
3000
4000
5000
6000
7000Up'n'down
0 20 40 60 80 100 120 1400
200
400
600
800
1000
1200Venture
0 20 40 60 80 100 120 1400
50000
100000
150000
200000Video Pinball
0 20 40 60 80 100 120 1400
500
1000
1500
2000
2500
3000
3500Wizard of Wor
0 20 40 60 80 100 120 1402000
4000
6000
8000
10000
12000
14000
16000Yar's Revenge
0 20 40 60 80 100 120 1400
2000
4000
6000
8000
10000Zaxxon
DQN (w/o MC)
DQN (with MC)
DQN-PixelCNN (w/o MC)
DQN-PixelCNN (with MC)
Figure 17. Training curves of DQN and DQN-PixelCNN, each with and without MMC, across all 57 Atari games.
top related