monte carlo tree search for the super mario bros

Monte Carlo Tree Search for the Super Mario Bros

Chih-Sheng Lin Advisor: Dr. rer. nat. Chuan-Kang Ting

Department of Computer Science and Information Engineering, National Chung Cheng University

1

Outline • Introduction

• Monte Carlo Tree Search (MCTS) o Basic Algorithm

o Upper Confidence Bounds for Trees (UCT)

• Controller Design using MCTS o Problem Formulation

o MCTS-based Controller

o Improvements on UCT

• Experimental Results

• Conclusions

2

Introduction

The Mario AI Benchmark

• Designing a Mario-playing controller

Controller Gam

e En

viron

me

nt

Percepts

Actions

Sensors

Actuators

? in 42ms

3

Introduction

The Mario AI Benchmark

• Scoring method in the Mario AI competition

o Maximizing the multi-objective weighted sum

distance 1 hiddenBlocks 24 marioStatus 1024

flowers 64 killsByStomp 12 timeLeft 8

mushrooms 58 killsByFire 4 marioMode 32

greenMushrooms 58 killsByShell 17

coins 16 killsTotal 42

4

Introduction

Related Work: A∗-based Controller

• Robin proposed using A∗ search in 2009 [*].

o States : The full game information observed by Mario

o Successor function : Mario’s possible actions

o Heuristic function : Time for reaching the most right side

[*] J. Togelius, S. Karakovskiy, and R. Baumgarten. The 2009 Mario AI competition. In Proceedings of the 2010 IEEE Congress on Evolutionary Computation, 2010.

current node

left, jump, speed

right, speed

jump right, jump, speed

left, speed

5

Monte Carlo Tree Search (MCTS)

Basic Algorithm : Motivation

• Problems of A∗ search in real-time games

o Lack of effective heuristic functions

o Complexity is exponential in worst case.

• Characteristics of MCTS

o Using the generality of random sampling

o Performing asymmetric tree growth

6


Basic Algorithm

• Selection • function MCTS(𝑠0)

create root node 𝑖0 with state 𝑠0

while within

computational budget do

𝒊𝐄 ← 𝐒𝐞𝐥𝐞𝐜𝐭𝐢𝐨𝐧(𝒊𝟎)

𝑖L ← Expansion(𝑖E)

𝑅𝑘 ← Simulation(𝑖L)

Backpropagation(𝑖L, 𝑅𝑘)

return BestChild 𝑖0, 0 . action

E

𝒊𝟎

7


Basic Algorithm

• Expansion • function MCTS(𝑠0)


while within


𝑖E ← Selection(𝑖0)

𝒊𝐋 ← 𝐄𝐱𝐩𝐚𝐧𝐬𝐢𝐨𝐧(𝒊𝐄)




L

8


Basic Algorithm

• Simulation • function MCTS(𝑠0)


while within




𝑹𝒌 ← 𝐒𝐢𝐦𝐮𝐥𝐚𝐭𝐢𝐨𝐧(𝒊𝐋)



L

𝑹𝒌

A random sampling (a simulated game)

𝑅𝑘 : score

9


Basic Algorithm

• Backpropagation • function MCTS(𝑠0)


while within





𝐁𝐚𝐜𝐤𝐩𝐫𝐨𝐩𝐚𝐠𝐚𝐭𝐢𝐨𝐧(𝒊𝐋, 𝑹𝒌)


𝑹𝒌

𝑹𝒌

𝑹𝒌

𝑅𝑘 : score

10


Basic Algorithm

• Final action selection • function MCTS(𝑠0)


while within






return 𝐁𝐞𝐬𝐭𝐂𝐡𝐢𝐥𝐝 𝒊𝟎, 𝟎 . action

𝑣1 𝑣2

．．．

．．．

．．．

．．．

𝑠0

𝑎1 𝑎2

𝑣𝑖 : utility

11


Basic Algorithm : Example

• Selection • function MCTS(𝑠0)


while within







𝒊𝟎

bf : 2 win : 1 draw : 0.5 lose : 0

12



• Expansion, Simulation • function MCTS(𝑠0)


while within







1

bf : 2 win : 1 draw : 0.5 lose : 0

13



• Backpropagation • function MCTS(𝑠0)


while within







1

1

1

bf : 2 win : 1 draw : 0.5 lose : 0

14



• Selection (2nd loop) • function MCTS(𝑠0)


while within







1

1

𝒊𝟎

bf : 2 win : 1 draw : 0.5 lose : 0

15



• Expansion, Simulation (2nd loop)

• function MCTS(𝑠0)


while within







1

1

0

bf : 2 win : 1 draw : 0.5 lose : 0

16



• Backpropagation (2nd loop)

• function MCTS(𝑠0)


while within







1 0

0

0.5

bf : 2 win : 1 draw : 0.5 lose : 0

17



• 3rd loop • function MCTS(𝑠0)


while within







1 0

1

0.75

1

bf : 2 win : 1 draw : 0.5 lose : 0

18



• 4th loop • function MCTS(𝑠0)


while within







1

1

0.63

0.5

0.5

0.25

bf : 2 win : 1 draw : 0.5 lose : 0

19



• nth loop • function MCTS(𝑠0)


while within






return 𝐁𝐞𝐬𝐭𝐂𝐡𝐢𝐥𝐝 𝒊𝟎, 𝟎 . action

0.56

0.32

．．．

．．．

．．．

．．．

𝒔𝟎

𝒂𝟏 𝒂𝟐

0.49

0.75

0.87

bf : 2 win : 1 draw : 0.5 lose : 0

20


Upper Confidence Bounds for Trees (UCT) • Example : 4th loop • UCT = MCTS + UCB [*]

• Selecting a child node 𝑐 which

𝑐 ∈ argmax𝑖∈𝐼(𝑣𝑖 + 𝐶𝑝 ×ln 𝑛𝑝

𝑛𝑖)

o 𝑝 : 𝑐’s parent node

o 𝐼 : the set of 𝑝’s children

o 𝑣𝑖 : 𝑖’s approximate utility

o 𝑛𝑖 : 𝑖’s visit count

o 𝑛𝑝 : 𝑝’s visit count

o 𝐶𝑝 : a tunable constant

[*] L. Kocsis and C. Szepesvári. Bandit based Monte-Carlo planning. In Proceedings of the 17th European Conference on Machine Learning, 2006.

Exploration

Exploitation 1 0

1

0.75

?

21

Controller Design using MCTS

Problem Formulation • States

o The full game information in 15 × 19 grid • Mario’s position, speed and so on

• Enemy’s position and speed and so on

• Object’s position, …

• Successor function o Mario’s possible actions

• A search node 𝑖 contains o Game state 𝑠𝑖

o Approximate utility 𝑣𝑖 (average score, winning rate, …)

o Visit count 𝑛𝑖 (the number of updates)

22


MCTS-based Controller

Controller G

ame

Enviro

nm

en

t

Sensors

Actuators

MCTS(𝑠0) in 42ms

23


MCTS-based Controller

Controller G

ame

Enviro

nm

en

t

Sensors

Actuators

MCTS(𝑠0) in 42ms

24


Improvements on UCT

• Approach to the calculation of a simulated game’s result

o Modifications of the multi-objective weighed sum

distance 0.1 hiddenBlocks 24 marioStatus 1024

flowers 64 killsByStomp 12 timeLeft 2

mushrooms 58 killsByFire 4 marioMode 32

greenMushrooms 1 killsByShell 17

coins 16 killsTotal 42

hurts −42 stomps 1 carries 1

25


Improvements on UCT : Review UCT

• Example • Simulation step

o A random sampling

• Performing random

actions until the

simulated game is

terminated

L

𝑹𝒌

26


Improvements on UCT : UCT-best

• Example : 𝑁 = 3 • Best-of-𝑁 simulation strategy [*]

o 𝑁 candidates selected from possible actions

o Evaluating candidates

o Performing the best one

[*] T. Kozelek. Methods of MCTS and the game Arimaa. Master's thesis, Charles University, 2009.

L

27








L

28








L

29








L

30







• Drawback

o Lack of the generality of randomness


L

𝑹𝒌

31


Improvements on UCT : UCT-multi

• Example : 𝑁 = 3

• Multi-simulation simulation strategy

o Performing 𝑁 random samplings

o Selecting the best result to propagate

• Advantage

o Improving the accuracy of randomness

L

𝑹𝒌𝟐 𝑹𝒌

𝟑 𝑹𝒌𝟏

32

• Small problem : Parameter tuning

o 15 different levels

o Computational budget : 40ms

o Results’ behavior is similar to the big problem’s results.

• Big problem : Performance comparison

o 512 different levels (used in the Mario AI competition)

o Computational budget : 5ms - 40ms

Experimental Results

Performance Measurement

33

• The UCT-multi controller

o UCT uses the multi-simulation simulation strategy.


Small Problem: Parameter Tuning

34


Small Problem: Parameter Tuning

• The UCT-best controller

o UCT uses the best-of-𝑁 simulation strategy .

35


Big Problem: Performance Comparison • A∗-based controllers :

o plainAstar : Robin’s A∗-based controller without improvements

o refinedAstar : Robin’s A∗-based controller

• MCTS-based controllers : o UCT : UCT without modifying its simulation strategy

o UCT-best : UCT using the best-of-𝑁 simulation strategy

o UCT-multi : UCT using the multi-simulation simulation strategy

o UCT +carr : UCT with the additional objective carries

o UCT-multi +carr : UCT-multi with the additional objective carries

• Parameter settings : o UCT-best : Increasing the number of candidate actions

o UCT-multi : Increasing the number of random samplings

36


Big Problem: Performance Comparison

37


Big Problem: Performance Comparison

38

Conclusions • Challenge of the Mario AI benchmark

o Large state space

o Lack of effective heuristic functions

• Contribution of this study o Showing the applicability of MCTS

o Results outperform the A∗-based controller.

o Improving the MCTS-based controller by • Improving random sampling’s accuracy

(the multi-simulation simulation strategy )

• Applying a good calculation of a simulated game’s result (the additional objective carries)

39

Thank You for Attention

40

monte carlo tree search for the super mario bros

Technology