reinforcement learning for pacmancs229.stanford.edu/proj2017/final-posters/5144893.pdfpacman; we...

1
Reinforcement learning for PACMAN Jing An, Jordi Feliu and Abeynaya Gnanasekaran {jingan, jfeliu, abeynaya}@stanford.edu Abstract We apply reinforcement learning on the classical game PACMAN; we study and compare Q-learning, Approxi- mate Q-learning and Deep Q-learning based on the total rewards and win-rate. While Q-learning proved to be quite effective on smallGrid, it was too expensive for larger grids. In Approximate Q-learning, we handcraft ‘intelli- gent’ features which are fed into the game. To ensure that we don’t miss important features, we implement a Convolu- tional Neural Network (CNN) that can implicitly ‘extract’ the features. Methods We can try to minimize the following loss function: L(s, a)=(r + γ max a Q(s ,a ) - Q(s, a)) 2 We start by implementing Q-Learning for Pacman based on the Berkeley’s Pacman project [1]. 1 Approximate Q-Learning The Q function is approximated by a combination of features f i (s, a) extracted from the game states Q(s, a)= n X i=1 f i (s, a)w i where the weights are updated by: w i w i + α · (r + γ max a Q(s ,a ) - Q(s, a)) · f i (s, a) 2 Deep Q-Learning The Q-values for different state and action pairs are computed with a Convolutional Neural Network (CNN) that can implicitly "extract" the features using the pixel data from a downsampled image of a state s. The neural network architecture comprises three convolutional layers followed by a fully connected hidden layer. Figure 1: Pacman with smallGrid and mediumGrid layout Feature Selections in Approximate Q-Learning We perform ablative analysis to select the most important fea- tures in terms of total reward, win-rate and conclude that the best features are: number of scared, active ghosts one step away, eat food if there are no active ghosts nearby. The num- ber of active ghosts one step away and eat food features are essential for winning a game and hence we do not include them in the analysis. Table: Ablation Analysis Component Average Score Win Rate Overall System 1608 92% inv. min. distance active ghost 1583 91.6% min. distance active ghost 1594 91.8% min. distance capsule 1591 90.2% no. scared ghosts 2 steps away 1545 92.2% no. scared ghosts 1 step away 1438 91.2% distance to closest food 1397 90% Deep Q-learning with Experience Replay We use TensorFlow to implement a Convolutional Neural Net- work (Q-network), to generate the Q-values corresponding to the five possible directions at each state: North, South, East, West, Stop. 1 Create equivalent images of frames with each pixel representing objects in the Pacman grid. 2 Use Replay memory to store the last N experiences (s,a,r ,s ,a ). 3 Sampling a minibatch randomly from replay memory. 4 Use an additional target network ( ˆ Q-network) to generate the target values for Q-network. Once every C iterations the target network is updated with the learned parameters from Q-network by minimizing the loss L = 1 T T X i=1 (r i + γ max a i ˆ Q(s i ,a i ) - Q(s i ,a i )) 2 5 Epsilon-greedy strategy for exploration. We gradually reduce from 1 to 0.1 during training. DQN Pipeline Figure 2: DQN pipeline with an equivalent image, three convolutional layers, a flattening layer and a fully connected hidden layer References [1] Berkeley Pacman Project. http : //ai.berkeley.edu/project o verview.html . [2] Volodymyr Mnih et al. Playing atari with deep reinforcement learning. [3] Volodymyr Mnih et al. Human-level control through deep reinforcement learning. Conclusion Use of intelligent features for Approximate Q-learning works well for small and medium grids The developed architecture for DQN gives a reasonably good performance for small and medium grids on which it was tested. Also the computational cost was reasonable without the use of GPUs. Results of DQN Number of training games 0 500 1000 1500 2000 2500 3000 Average win-rate for 100 test games 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 smallGrid mediumGrid Figure 3: Average win-rate on smallGrid and mediumGrid with the number of training games Number of training games 0 500 1000 1500 2000 2500 3000 Average reward for 100 test games -1200 -1000 -800 -600 -400 -200 0 200 400 600 smallGrid mediumGrid Figure 4: Average rewards on smallGrid and mediumGrid with the number of training games Future Work Increasing training for better performance in medium grids Adding prioritized replay by sampling from distribution according to rewards or choosing the samples from which the agent can learn the most Adding auxiliary rewards Adding asynchronous gradient descent for optimization of the deep neural network

Upload: others

Post on 15-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Reinforcement learning for PACMANcs229.stanford.edu/proj2017/final-posters/5144893.pdfPACMAN; we study and compare Q-learning, Approxi-mate Q-learning and Deep Q-learning based on

Reinforcement learning for PACMANJing An, Jordi Feliu and Abeynaya Gnanasekaran

{jingan, jfeliu, abeynaya}@stanford.edu

Abstract

We apply reinforcement learning on the classical gamePACMAN; we study and compare Q-learning, Approxi-mate Q-learning and Deep Q-learning based on the totalrewards and win-rate. While Q-learning proved to be quiteeffective on smallGrid, it was too expensive for largergrids. In Approximate Q-learning, we handcraft ‘intelli-gent’ features which are fed into the game. To ensure thatwe don’t miss important features, we implement a Convolu-tional Neural Network (CNN) that can implicitly ‘extract’the features.

Methods

We can try to minimize the following loss function:L(s, a) = (r + γmax

a′Q(s′, a′)−Q(s, a))2

We start by implementing Q-Learning for Pacman based onthe Berkeley’s Pacman project [1].1 Approximate Q-LearningThe Q function is approximated by a combination offeatures fi(s, a) extracted from the game states

Q(s, a) =n∑i=1

fi(s, a)wi

where the weights are updated by:wi← wi + α · (r + γmax

a′Q(s′, a′)−Q(s, a)) · fi(s, a)

2 Deep Q-LearningThe Q-values for different state and action pairs arecomputed with a Convolutional Neural Network (CNN)that can implicitly "extract" the features using the pixeldata from a downsampled image of a state s. The neuralnetwork architecture comprises three convolutional layersfollowed by a fully connected hidden layer.

Figure 1: Pacman with smallGrid and mediumGrid layout

Feature Selections in ApproximateQ-Learning

We perform ablative analysis to select the most important fea-tures in terms of total reward, win-rate and conclude that thebest features are: number of scared, active ghosts one stepaway, eat food if there are no active ghosts nearby. The num-ber of active ghosts one step away and eat food features areessential for winning a game and hence we do not include themin the analysis.

Table: Ablation Analysis

Component Average Score Win RateOverall System 1608 92%inv. min. distance active ghost 1583 91.6%min. distance active ghost 1594 91.8%min. distance capsule 1591 90.2%no. scared ghosts 2 steps away 1545 92.2%no. scared ghosts 1 step away 1438 91.2%distance to closest food 1397 90%

Deep Q-learning with Experience Replay

We use TensorFlow to implement a Convolutional Neural Net-work (Q-network), to generate the Q-values corresponding tothe five possible directions at each state: North, South, East,West, Stop.

1 Create equivalent images of frames with each pixelrepresenting objects in the Pacman grid.

2 Use Replay memory to store the last N experiences(s,a,r,s′,a′).

3 Sampling a minibatch randomly from replay memory.4 Use an additional target network (Q̂-network) togenerate the target values for Q-network. Once every Citerations the target network is updated with the learnedparameters from Q-network by minimizing the loss

L = 1T

T∑i=1

(ri + γmaxa′i

Q̂(s′i, a′i)−Q(si, ai))2

5 Epsilon-greedy strategy for exploration. We graduallyreduce ε from 1 to 0.1 during training.

DQN Pipeline

Figure 2: DQN pipeline with an equivalent image, three convolutional layers, a flattening layer and a fully connected hidden layer

References

[1] Berkeley Pacman Project.http : //ai.berkeley.edu/projectoverview.html.

[2] Volodymyr Mnih et al.Playing atari with deep reinforcement learning.

[3] Volodymyr Mnih et al.Human-level control through deep reinforcement learning.

Conclusion

• Use of intelligent features for Approximate Q-learningworks well for small and medium grids

• The developed architecture for DQN gives a reasonablygood performance for small and medium grids on which itwas tested. Also the computational cost was reasonablewithout the use of GPUs.

Results of DQN

Number of training games

0 500 1000 1500 2000 2500 3000

Ave

rag

e w

in-r

ate

fo

r 1

00

te

st

ga

me

s

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

smallGrid

mediumGrid

Figure 3: Average win-rate on smallGrid and mediumGridwith the number of training games

Number of training games

0 500 1000 1500 2000 2500 3000

Ave

rag

e r

ew

ard

fo

r 1

00

te

st

ga

me

s

-1200

-1000

-800

-600

-400

-200

0

200

400

600

smallGrid

mediumGrid

Figure 4: Average rewards on smallGrid and mediumGridwith the number of training games

Future Work

• Increasing training for better performance in medium grids• Adding prioritized replay by sampling from distribution

according to rewards or choosing the samples from whichthe agent can learn the most

• Adding auxiliary rewards• Adding asynchronous gradient descent for optimization of

the deep neural network