monte carlo tree search for the super mario bros
TRANSCRIPT
Monte Carlo Tree Search for the Super Mario Bros
Chih-Sheng Lin Advisor: Dr. rer. nat. Chuan-Kang Ting
Department of Computer Science and Information Engineering, National Chung Cheng University
1
Outline • Introduction
• Monte Carlo Tree Search (MCTS) o Basic Algorithm
o Upper Confidence Bounds for Trees (UCT)
• Controller Design using MCTS o Problem Formulation
o MCTS-based Controller
o Improvements on UCT
• Experimental Results
• Conclusions
2
Introduction
The Mario AI Benchmark
• Designing a Mario-playing controller
Controller Gam
e En
viron
me
nt
Percepts
Actions
Sensors
Actuators
? in 42ms
3
Introduction
The Mario AI Benchmark
• Scoring method in the Mario AI competition
o Maximizing the multi-objective weighted sum
distance 1 hiddenBlocks 24 marioStatus 1024
flowers 64 killsByStomp 12 timeLeft 8
mushrooms 58 killsByFire 4 marioMode 32
greenMushrooms 58 killsByShell 17
coins 16 killsTotal 42
4
Introduction
Related Work: A∗-based Controller
• Robin proposed using A∗ search in 2009 [*].
o States : The full game information observed by Mario
o Successor function : Mario’s possible actions
o Heuristic function : Time for reaching the most right side
[*] J. Togelius, S. Karakovskiy, and R. Baumgarten. The 2009 Mario AI competition. In Proceedings of the 2010 IEEE Congress on Evolutionary Computation, 2010.
current node
left, jump, speed
right, speed
jump right, jump, speed
left, speed
5
Monte Carlo Tree Search (MCTS)
Basic Algorithm : Motivation
• Problems of A∗ search in real-time games
o Lack of effective heuristic functions
o Complexity is exponential in worst case.
• Characteristics of MCTS
o Using the generality of random sampling
o Performing asymmetric tree growth
6
Monte Carlo Tree Search (MCTS)
Basic Algorithm
• Selection • function MCTS(𝑠0)
create root node 𝑖0 with state 𝑠0
while within
computational budget do
𝒊𝐄 ← 𝐒𝐞𝐥𝐞𝐜𝐭𝐢𝐨𝐧(𝒊𝟎)
𝑖L ← Expansion(𝑖E)
𝑅𝑘 ← Simulation(𝑖L)
Backpropagation(𝑖L, 𝑅𝑘)
return BestChild 𝑖0, 0 . action
E
𝒊𝟎
7
Monte Carlo Tree Search (MCTS)
Basic Algorithm
• Expansion • function MCTS(𝑠0)
create root node 𝑖0 with state 𝑠0
while within
computational budget do
𝑖E ← Selection(𝑖0)
𝒊𝐋 ← 𝐄𝐱𝐩𝐚𝐧𝐬𝐢𝐨𝐧(𝒊𝐄)
𝑅𝑘 ← Simulation(𝑖L)
Backpropagation(𝑖L, 𝑅𝑘)
return BestChild 𝑖0, 0 . action
L
8
Monte Carlo Tree Search (MCTS)
Basic Algorithm
• Simulation • function MCTS(𝑠0)
create root node 𝑖0 with state 𝑠0
while within
computational budget do
𝑖E ← Selection(𝑖0)
𝑖L ← Expansion(𝑖E)
𝑹𝒌 ← 𝐒𝐢𝐦𝐮𝐥𝐚𝐭𝐢𝐨𝐧(𝒊𝐋)
Backpropagation(𝑖L, 𝑅𝑘)
return BestChild 𝑖0, 0 . action
L
𝑹𝒌
A random sampling (a simulated game)
𝑅𝑘 : score
9
Monte Carlo Tree Search (MCTS)
Basic Algorithm
• Backpropagation • function MCTS(𝑠0)
create root node 𝑖0 with state 𝑠0
while within
computational budget do
𝑖E ← Selection(𝑖0)
𝑖L ← Expansion(𝑖E)
𝑅𝑘 ← Simulation(𝑖L)
𝐁𝐚𝐜𝐤𝐩𝐫𝐨𝐩𝐚𝐠𝐚𝐭𝐢𝐨𝐧(𝒊𝐋, 𝑹𝒌)
return BestChild 𝑖0, 0 . action
𝑹𝒌
𝑹𝒌
𝑹𝒌
𝑅𝑘 : score
10
Monte Carlo Tree Search (MCTS)
Basic Algorithm
• Final action selection • function MCTS(𝑠0)
create root node 𝑖0 with state 𝑠0
while within
computational budget do
𝑖E ← Selection(𝑖0)
𝑖L ← Expansion(𝑖E)
𝑅𝑘 ← Simulation(𝑖L)
Backpropagation(𝑖L, 𝑅𝑘)
return 𝐁𝐞𝐬𝐭𝐂𝐡𝐢𝐥𝐝 𝒊𝟎, 𝟎 . action
𝑣1 𝑣2
...
...
...
...
𝑠0
𝑎1 𝑎2
𝑣𝑖 : utility
11
Monte Carlo Tree Search (MCTS)
Basic Algorithm : Example
• Selection • function MCTS(𝑠0)
create root node 𝑖0 with state 𝑠0
while within
computational budget do
𝒊𝐄 ← 𝐒𝐞𝐥𝐞𝐜𝐭𝐢𝐨𝐧(𝒊𝟎)
𝑖L ← Expansion(𝑖E)
𝑅𝑘 ← Simulation(𝑖L)
Backpropagation(𝑖L, 𝑅𝑘)
return BestChild 𝑖0, 0 . action
𝒊𝟎
bf : 2 win : 1 draw : 0.5 lose : 0
12
Monte Carlo Tree Search (MCTS)
Basic Algorithm : Example
• Expansion, Simulation • function MCTS(𝑠0)
create root node 𝑖0 with state 𝑠0
while within
computational budget do
𝑖E ← Selection(𝑖0)
𝒊𝐋 ← 𝐄𝐱𝐩𝐚𝐧𝐬𝐢𝐨𝐧(𝒊𝐄)
𝑹𝒌 ← 𝐒𝐢𝐦𝐮𝐥𝐚𝐭𝐢𝐨𝐧(𝒊𝐋)
Backpropagation(𝑖L, 𝑅𝑘)
return BestChild 𝑖0, 0 . action
1
bf : 2 win : 1 draw : 0.5 lose : 0
13
Monte Carlo Tree Search (MCTS)
Basic Algorithm : Example
• Backpropagation • function MCTS(𝑠0)
create root node 𝑖0 with state 𝑠0
while within
computational budget do
𝑖E ← Selection(𝑖0)
𝑖L ← Expansion(𝑖E)
𝑅𝑘 ← Simulation(𝑖L)
𝐁𝐚𝐜𝐤𝐩𝐫𝐨𝐩𝐚𝐠𝐚𝐭𝐢𝐨𝐧(𝒊𝐋, 𝑹𝒌)
return BestChild 𝑖0, 0 . action
1
1
1
bf : 2 win : 1 draw : 0.5 lose : 0
14
Monte Carlo Tree Search (MCTS)
Basic Algorithm : Example
• Selection (2nd loop) • function MCTS(𝑠0)
create root node 𝑖0 with state 𝑠0
while within
computational budget do
𝒊𝐄 ← 𝐒𝐞𝐥𝐞𝐜𝐭𝐢𝐨𝐧(𝒊𝟎)
𝑖L ← Expansion(𝑖E)
𝑅𝑘 ← Simulation(𝑖L)
Backpropagation(𝑖L, 𝑅𝑘)
return BestChild 𝑖0, 0 . action
1
1
𝒊𝟎
bf : 2 win : 1 draw : 0.5 lose : 0
15
Monte Carlo Tree Search (MCTS)
Basic Algorithm : Example
• Expansion, Simulation (2nd loop)
• function MCTS(𝑠0)
create root node 𝑖0 with state 𝑠0
while within
computational budget do
𝑖E ← Selection(𝑖0)
𝒊𝐋 ← 𝐄𝐱𝐩𝐚𝐧𝐬𝐢𝐨𝐧(𝒊𝐄)
𝑹𝒌 ← 𝐒𝐢𝐦𝐮𝐥𝐚𝐭𝐢𝐨𝐧(𝒊𝐋)
Backpropagation(𝑖L, 𝑅𝑘)
return BestChild 𝑖0, 0 . action
1
1
0
bf : 2 win : 1 draw : 0.5 lose : 0
16
Monte Carlo Tree Search (MCTS)
Basic Algorithm : Example
• Backpropagation (2nd loop)
• function MCTS(𝑠0)
create root node 𝑖0 with state 𝑠0
while within
computational budget do
𝑖E ← Selection(𝑖0)
𝑖L ← Expansion(𝑖E)
𝑅𝑘 ← Simulation(𝑖L)
𝐁𝐚𝐜𝐤𝐩𝐫𝐨𝐩𝐚𝐠𝐚𝐭𝐢𝐨𝐧(𝒊𝐋, 𝑹𝒌)
return BestChild 𝑖0, 0 . action
1 0
0
0.5
bf : 2 win : 1 draw : 0.5 lose : 0
17
Monte Carlo Tree Search (MCTS)
Basic Algorithm : Example
• 3rd loop • function MCTS(𝑠0)
create root node 𝑖0 with state 𝑠0
while within
computational budget do
𝑖E ← Selection(𝑖0)
𝑖L ← Expansion(𝑖E)
𝑅𝑘 ← Simulation(𝑖L)
Backpropagation(𝑖L, 𝑅𝑘)
return BestChild 𝑖0, 0 . action
1 0
1
0.75
1
bf : 2 win : 1 draw : 0.5 lose : 0
18
Monte Carlo Tree Search (MCTS)
Basic Algorithm : Example
• 4th loop • function MCTS(𝑠0)
create root node 𝑖0 with state 𝑠0
while within
computational budget do
𝑖E ← Selection(𝑖0)
𝑖L ← Expansion(𝑖E)
𝑅𝑘 ← Simulation(𝑖L)
Backpropagation(𝑖L, 𝑅𝑘)
return BestChild 𝑖0, 0 . action
1
1
0.63
0.5
0.5
0.25
bf : 2 win : 1 draw : 0.5 lose : 0
19
Monte Carlo Tree Search (MCTS)
Basic Algorithm : Example
• nth loop • function MCTS(𝑠0)
create root node 𝑖0 with state 𝑠0
while within
computational budget do
𝑖E ← Selection(𝑖0)
𝑖L ← Expansion(𝑖E)
𝑅𝑘 ← Simulation(𝑖L)
Backpropagation(𝑖L, 𝑅𝑘)
return 𝐁𝐞𝐬𝐭𝐂𝐡𝐢𝐥𝐝 𝒊𝟎, 𝟎 . action
0.56
0.32
...
...
...
...
𝒔𝟎
𝒂𝟏 𝒂𝟐
0.49
0.75
0.87
bf : 2 win : 1 draw : 0.5 lose : 0
20
Monte Carlo Tree Search (MCTS)
Upper Confidence Bounds for Trees (UCT) • Example : 4th loop • UCT = MCTS + UCB [*]
• Selecting a child node 𝑐 which
𝑐 ∈ argmax𝑖∈𝐼(𝑣𝑖 + 𝐶𝑝 ×ln 𝑛𝑝
𝑛𝑖)
o 𝑝 : 𝑐’s parent node
o 𝐼 : the set of 𝑝’s children
o 𝑣𝑖 : 𝑖’s approximate utility
o 𝑛𝑖 : 𝑖’s visit count
o 𝑛𝑝 : 𝑝’s visit count
o 𝐶𝑝 : a tunable constant
[*] L. Kocsis and C. Szepesvári. Bandit based Monte-Carlo planning. In Proceedings of the 17th European Conference on Machine Learning, 2006.
Exploration
Exploitation 1 0
1
0.75
?
21
Controller Design using MCTS
Problem Formulation • States
o The full game information in 15 × 19 grid • Mario’s position, speed and so on
• Enemy’s position and speed and so on
• Object’s position, …
• Successor function o Mario’s possible actions
• A search node 𝑖 contains o Game state 𝑠𝑖
o Approximate utility 𝑣𝑖 (average score, winning rate, …)
o Visit count 𝑛𝑖 (the number of updates)
22
Controller Design using MCTS
MCTS-based Controller
Controller G
ame
Enviro
nm
en
t
Sensors
Actuators
MCTS(𝑠0) in 42ms
23
Controller Design using MCTS
MCTS-based Controller
Controller G
ame
Enviro
nm
en
t
Sensors
Actuators
MCTS(𝑠0) in 42ms
24
Controller Design using MCTS
Improvements on UCT
• Approach to the calculation of a simulated game’s result
o Modifications of the multi-objective weighed sum
distance 0.1 hiddenBlocks 24 marioStatus 1024
flowers 64 killsByStomp 12 timeLeft 2
mushrooms 58 killsByFire 4 marioMode 32
greenMushrooms 1 killsByShell 17
coins 16 killsTotal 42
hurts −42 stomps 1 carries 1
25
Controller Design using MCTS
Improvements on UCT : Review UCT
• Example • Simulation step
o A random sampling
• Performing random
actions until the
simulated game is
terminated
L
𝑹𝒌
26
Controller Design using MCTS
Improvements on UCT : UCT-best
• Example : 𝑁 = 3 • Best-of-𝑁 simulation strategy [*]
o 𝑁 candidates selected from possible actions
o Evaluating candidates
o Performing the best one
[*] T. Kozelek. Methods of MCTS and the game Arimaa. Master's thesis, Charles University, 2009.
L
27
Controller Design using MCTS
Improvements on UCT : UCT-best
• Example : 𝑁 = 3 • Best-of-𝑁 simulation strategy [*]
o 𝑁 candidates selected from possible actions
o Evaluating candidates
o Performing the best one
[*] T. Kozelek. Methods of MCTS and the game Arimaa. Master's thesis, Charles University, 2009.
L
28
Controller Design using MCTS
Improvements on UCT : UCT-best
• Example : 𝑁 = 3 • Best-of-𝑁 simulation strategy [*]
o 𝑁 candidates selected from possible actions
o Evaluating candidates
o Performing the best one
[*] T. Kozelek. Methods of MCTS and the game Arimaa. Master's thesis, Charles University, 2009.
L
29
Controller Design using MCTS
Improvements on UCT : UCT-best
• Example : 𝑁 = 3 • Best-of-𝑁 simulation strategy [*]
o 𝑁 candidates selected from possible actions
o Evaluating candidates
o Performing the best one
[*] T. Kozelek. Methods of MCTS and the game Arimaa. Master's thesis, Charles University, 2009.
L
30
Controller Design using MCTS
Improvements on UCT : UCT-best
• Example : 𝑁 = 3 • Best-of-𝑁 simulation strategy [*]
o 𝑁 candidates selected from possible actions
o Evaluating candidates
o Performing the best one
• Drawback
o Lack of the generality of randomness
[*] T. Kozelek. Methods of MCTS and the game Arimaa. Master's thesis, Charles University, 2009.
L
𝑹𝒌
31
Controller Design using MCTS
Improvements on UCT : UCT-multi
• Example : 𝑁 = 3
• Multi-simulation simulation strategy
o Performing 𝑁 random samplings
o Selecting the best result to propagate
• Advantage
o Improving the accuracy of randomness
L
𝑹𝒌𝟐 𝑹𝒌
𝟑 𝑹𝒌𝟏
32
• Small problem : Parameter tuning
o 15 different levels
o Computational budget : 40ms
o Results’ behavior is similar to the big problem’s results.
• Big problem : Performance comparison
o 512 different levels (used in the Mario AI competition)
o Computational budget : 5ms - 40ms
Experimental Results
Performance Measurement
33
• The UCT-multi controller
o UCT uses the multi-simulation simulation strategy.
Experimental Results
Small Problem: Parameter Tuning
34
Experimental Results
Small Problem: Parameter Tuning
• The UCT-best controller
o UCT uses the best-of-𝑁 simulation strategy .
35
Experimental Results
Big Problem: Performance Comparison • A∗-based controllers :
o plainAstar : Robin’s A∗-based controller without improvements
o refinedAstar : Robin’s A∗-based controller
• MCTS-based controllers : o UCT : UCT without modifying its simulation strategy
o UCT-best : UCT using the best-of-𝑁 simulation strategy
o UCT-multi : UCT using the multi-simulation simulation strategy
o UCT +carr : UCT with the additional objective carries
o UCT-multi +carr : UCT-multi with the additional objective carries
• Parameter settings : o UCT-best : Increasing the number of candidate actions
o UCT-multi : Increasing the number of random samplings
36
Experimental Results
Big Problem: Performance Comparison
37
Experimental Results
Big Problem: Performance Comparison
38
Conclusions • Challenge of the Mario AI benchmark
o Large state space
o Lack of effective heuristic functions
• Contribution of this study o Showing the applicability of MCTS
o Results outperform the A∗-based controller.
o Improving the MCTS-based controller by • Improving random sampling’s accuracy
(the multi-simulation simulation strategy )
• Applying a good calculation of a simulated game’s result (the additional objective carries)
39
Thank You for Attention
40