high-level decision making in adversarial …ann/exjobb/bjorn_dagerman.pdfhigh-level decision making...

48
IN DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS , STOCKHOLM SWEDEN 2017 High-Level Decision Making in Adversarial Environments using Behavior Trees and Monte Carlo Tree Search BJÖRN DAGERMAN KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF COMPUTER SCIENCE AND COMMUNICATION

Upload: duongkiet

Post on 27-May-2018

220 views

Category:

Documents


1 download

TRANSCRIPT

IN DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING,SECOND CYCLE, 30 CREDITS

, STOCKHOLM SWEDEN 2017

High-Level Decision Making in Adversarial Environments using Behavior Trees and Monte Carlo Tree Search

BJÖRN DAGERMAN

KTH ROYAL INSTITUTE OF TECHNOLOGYSCHOOL OF COMPUTER SCIENCE AND COMMUNICATION

High-Level Decision Making in AdversarialEnvironments using Behavior Trees and Monte

Carlo Tree Search

BJÖRN DAGERMAN

Master’s Thesis at CSC

Supervisor: Petter Ögren

Examiner: Patric Jensfelt

Master’s Thesis in Computer Science

Degree Programme in Computer Science and Engineering

Swedish Title: En Algoritm för Beslutsfattande i Spelsituationer

som Kombinerar Beteendeträd och Monte Carlo Tree Search

Björn Dagerman, [email protected]

June 26, 2017

ii

Abstract

Monte Carlo Tree Search (MCTS) has been shown to produce good resultswhen used to model agent behaviors. However, one di�culty in MCTS is an-ticipating adversarial behaviors in multi-agent systems, based on incompleteinformation. In this study, we show how a heuristic prediction of adversarialbehaviors can be added, using Behavior Trees. Furthermore, we perform ex-periments to test how a Behavior Tree Playout Policy compares to a RandomPlayout Policy. Doing so, we establish that a Behavior Tree Playout Policy cane�ectively be used to encode domain knowledge in MCTS, creating a compet-itive agent for the Adversarial Search problem in real-time environments.

iii

Sammanfattning

Monte Carlo Tree Search (MCTS) har visat sig producera bra resultat närdet använts för att modellera agenters beteenden. En svårighet i MCTS är attförutse motståndares beteenden i multiagentsystem, baserat på ofullständig in-formation. I den här studien visar vi hur en heuristisk förutsägning kan tillförasMCTS med hjälp av Beteendeträd. Dessutom genomför vi experiment för attutvärdera hur en Beteendeträds-policy förhåller sig till en Random-policy. Somföljd av detta konstaterar vi att en Beteendeträds-policy e�ektivt kan användasför att koda in domänkunskap i MCTS, vilket resulterar i en konkurenskraftigagent för Adversarial Search-problemet i realtidsmiljöer.

Contents

Contents iv

1 Introduction 11.1 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Objective and Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Delimitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.4 Sustainability, Ethical, and Social Aspects . . . . . . . . . . . . . . . 41.5 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Background 62.1 Decision Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1.1 Markov Decision Processes . . . . . . . . . . . . . . . . . . . 62.1.2 Partially Observable Markov Decision Processes . . . . . . . 7

2.2 Game Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.3 Bandit Based methods . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3.1 Regret . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.3.2 UCB1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.4 Monte Carlo Tree Search . . . . . . . . . . . . . . . . . . . . . . . . . 92.4.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.4.2 UCT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.5 Behavior Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3 Related Work 133.1 Behavior Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.1.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.2 Monte Carlo Tree Search . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.2.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4 Capture the Flag Pac-Man 164.1 Environment Introduction . . . . . . . . . . . . . . . . . . . . . . . . 164.2 Environment Classification . . . . . . . . . . . . . . . . . . . . . . . 174.3 Environment Motivation . . . . . . . . . . . . . . . . . . . . . . . . . 18

iv

CONTENTS v

5 Proposed Approach 195.1 Policies and Domain Knowledge . . . . . . . . . . . . . . . . . . . . . 19

5.1.1 Selection and expansion . . . . . . . . . . . . . . . . . . . . . 195.1.2 Behavior Tree Playout Policy . . . . . . . . . . . . . . . . . . 19

5.2 Evaluation Function . . . . . . . . . . . . . . . . . . . . . . . . . . . 205.2.1 Basic Heuristic . . . . . . . . . . . . . . . . . . . . . . . . . . 205.2.2 Improved Heuristic . . . . . . . . . . . . . . . . . . . . . . . . 21

5.3 Observing enemies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215.4 Cooperation between agents . . . . . . . . . . . . . . . . . . . . . . . 21

6 Evaluation 236.1 Statistical Significance . . . . . . . . . . . . . . . . . . . . . . . . . . 236.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

6.2.1 Establishing Fairness . . . . . . . . . . . . . . . . . . . . . . . 236.2.2 Establishing a baseline . . . . . . . . . . . . . . . . . . . . . . 256.2.3 Behavior Tree Playout Policy . . . . . . . . . . . . . . . . . . 25

7 Results and Discussion 277.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

7.1.1 Establishing Fairness . . . . . . . . . . . . . . . . . . . . . . . 277.1.2 Establishing a Baseline . . . . . . . . . . . . . . . . . . . . . 297.1.3 Behavior Tree Playout Policy . . . . . . . . . . . . . . . . . . 30

7.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337.2.1 Establishing Fairness . . . . . . . . . . . . . . . . . . . . . . . 337.2.2 Establishing a Baseline . . . . . . . . . . . . . . . . . . . . . 347.2.3 Behavior Tree Playout Policy . . . . . . . . . . . . . . . . . . 357.2.4 Comparison of MCTSRP and MCTSBTP . . . . . . . . . . . 36

8 Conclusions and Further Work 378.1 Further Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

Bibliography 39

Chapter 1

Introduction

A di�cult domain in Artificial Intelligence concerns multiple agents, with conflict-ing goals, competing against each other. This class of problems is called AdversarialSearch (AS) problems, and is typically referred to as games. The reason AS prob-lems are di�cult is due to the often times exponential time complexity of findingoptimal solutions. This is because games in this class often have a high branchingfactor, while also lasting many moves. For instance, the well-known game GO has abranching factor of 250, on average. Also, GO games typically last up to 400 moves.Hence, there are about 250400 game states in total, of which roughly 2.08◊10170 arelegal [10]. Searching through all these states is infeasible within reasonable time.Therefore, instead of trying to find optimal solutions, a common approach is tosearch for good approximate solutions. This typically involves not evaluating theentire search space, but instead introducing a heuristic evaluation function. Such aheuristic should give a good estimate of the utility of a given action, in much lesstime than computing an optimal answer.

Monte Carlo Tree Search (MCTS) has been shown to produce e�ective approxi-mations of good solutions for GO [31]. GO can be classified as a two-player zero-sumgame with perfect information. What this means is that at all times, each playercan observe the complete game state, and given a terminal state, it is clear thateither one player won, or the other. Other di�cult games are real-time strategygames, such as StarCraft. In real-time environment, decisions must be made withina strict deadline, typically less than a second. Given a strict deadline, finding fastapproximations becomes important. Moreover, MCTS has been shown to quicklyfind good solutions in real-time environments. The reason MCTS performs well, isthat it simulates outcomes of games, choosing actions that are observed to performwell over many simulations. One component of MCTS is a playout policy. Thispolicy, or strategy, dictates which actions agents select during simulations. Eitherthese actions can be selected at random, or according to some strategy that en-codes domain knowledge. Also, given that the number of moves to reach a terminalstate is vast, a heuristic evaluation function can be used to only perform playoutsto some maximum depth, where the function is used to evaluate the utility at the

1

2 CHAPTER 1. INTRODUCTION

given depth. However, a problem with MCTS is that during playouts of limiteddepth, it is di�cult to anticipate actions of adversarial agents.

Final State Machines (FSMs) have been used to model agent logic in AI, byexpressing the di�erent states an agent can have, and more specifically, the transi-tions between states. A problem with FSMs is that they do not scale well. That is,for large systems, being able to both reuse states to add new features, and to reusetransitions, is problematic. Behavior Trees (BTs) is an alternative to FSMs, whichtries to solve this problem by encapsulating logic transparently, increasing the mod-ularity of states. In BTs, states are self-contained and do not have any transitionsto external states. Therefore, states are instead called behaviors. BTs are a newarea of research, but initial results have shown that they make logic modular andeasier to use [8].

In this study, we want to evaluate if heuristic predictions of adversarial actionscan be added to MCTS using BTs, improving the quality of approximations. Thatis, can BTs be used as a playout policy by MCTS to find better solutions? Thisquestion is of interest for current research, in part because it may introduce newimprovements that may improve the speed of which MCTS can find good solutions.Also, since BTs is a new area of research, it is useful to evaluate unexplored uses ofBTs.

We denote an algorithm using Monte Carlo Tree Search with a Behavior Treeplayout Policy as MCTSBTP. Ideally, such an algorithm would combine the at-tributes of both MCTS and BTs. This could enable easy modeling of agents, whileproviding good solutions to the AS problem. Thus, MCTSBTP could make it easierto express powerful agents, enabling modeling of more realistic agents with dynamicbehaviors. Furthermore, MCTSBTP could be used in a wide range of areas whereit is di�cult to model the real world, requiring agent behaviors to be modular andeasily updated or changed.

Being able to perform adversarial search, while maintaining modularity in agentbehaviors, enables applications where it is di�cult to predict all parameters of amission. E.g., in Search and Rescue operations using UAVs. Using MCTSBTPmakes it possible to adapt agents behaviors as unforeseen circumstances arise. Asthis could enable the autonomous systems to be more easily adapted to new situ-ations, it could aid human decision makers explore various strategies by changingbehavior parameters.

1.1. PROBLEM 3

1.1 ProblemOur problem is to design (and implement) an algorithm for the real-time adversarialsearch problem using Monte Carlo Tree Search and Behavior Trees.

1.2 Objective and HypothesisOur objective is to show that BTs can be used to extend MCTS with heuristicpredictions of adversarial actions.

Hypothesis: The performance of MCTS can be improved by using a simple BT play-out policy, instead of a standard random policy, for real-time adversarial games.

In order to achieve the objective, we will implement MCTS using both a randompolicy and a Behavior Tree policy. We will also implement a baseline team. Theimplementation will be tested in a well-known environment. To test the hypothesis,we will first play many games (enough that the results will be statistically signif-icant) with MCTS, using a random policy, against the baseline team. Then, wewill play many games of MCTS, using the BT policy, against the baseline. Giventhese observation, we can note and compare di�erences in performance. We canthen evaluate how the number of iterations, and the search depth, is a�ected by thechange in policies.

We will compare the policies to test the hypothesis. Computational time will betaken into consideration, to ensure the suggested solution is applicable in real-timedomains.

1.3 DelimitationOur work is based on research in MCTS and BTs (see Chapter 3). We only evaluateour method on one game (see Chapter 4 Capture the Flag Pac-Man), althoughprevious research have shown this environment to be sound [3]. Although we cannot prove the environment to be perfectly fair, we will take precautions to ensureit is as fair as possible. Lastly, our game has a Partial Observability property. Howto best deal with this property is an open question and answering it is outsidethe scope of this study. Therefore, we will mostly disregard it. Although, we willobserve how it a�ects the maximum playout depth.

4 CHAPTER 1. INTRODUCTION

1.4 Sustainability, Ethical, and Social Aspects

The proposed approach enables applications in domains where it is di�cult to pre-dict all parameters ahead of deployment. Such domains include, e.g., Disaster Relief,and Search and Rescue operations. Suppose a decision making and planning system,based on a black-box, was deployed. Then, in a domain where the state-of-the-art isconstantly improving, such a black-box may be di�cult to modify, given the desireto integrate newly discovered domain knowledge. This could result in systems thatare either outdated or replaced. In such cases, an alternative system with modularlogic, based on our proposed approach, could instead enable system engineers toextend existing Behavior Trees with new features, at a presumably lower cost thanreplacing the entire system. Furthermore, it could aid human decision makers ex-plore a range of new strategies, at a faster rate, by experimenting using di�erentBehavior Trees. In addition to the potential economic gain by using such a system,because Behavior Trees can be understood by humans—it is conceivable that anengineer could design and input new features to an existing Behavior Tree—at afaster rate than previously possible, in response to an unseen situation. Thus, al-lowing faster response times. Lastly, because the decision model is human readable,and the decisions found by MCTS can be traced, it could allow for introspection byhuman decision makers, e.g., in safety-critical domains.

To recapitulate, the modularity and ease of use of logic could potentially givethe system a high level of maintainability. This could further enable easier expan-sions of the system, while diminishing the need for full system replacements. Suchinteractive properties, could have an immense value in operations such as DisasterRelief or Search and Rescue, where ultimately human lives are at risk. The pro-posed approach could potentially be used to improve human decision makers abilityto handle unforeseen circumstances in such situations, which is of clear value tosociety.

Ethics in Artificial Intelligence has been a subject of much discussion. As thisstudy concerns decision making and planning, it is appropriate that we briefly dis-cuss some ethical considerations. The main concern that relates to this study, isthat of the trustworthiness of the AI system. That is, when using an AI system,can we blindly trust that the system does what it is designed to? Regardless of howwell the intentions of a systems engineer may be, for any complex system, there willlikely be bugs. In the case of decision making and planning, e.g., using autonomousvehicles, a misbehaving agent may cause harm to the environment, other agents,or even a human. Although such scenarios must be avoided at all cost, we arguethat it is not the responsibility of the underlying algorithm, such as MCTSBTP,to provide such capabilities. Rather, these factors should be considered at a higherlevel, i.e., when designing safety features, such that they are able to override theunderlying decision making.

1.5. SUMMARY OF CONTRIBUTIONS 5

1.5 Summary of ContributionsThis thesis presents four contributions, listed below. For each contribution, we givea brief description, followed by a reference to the chapters and sections where theresults can be found.

• We show how high-level decisions can be made in real-time games using BTsand MCTS in Chapter 5, extending the work described in Chapter 3.

• We conclude that the performance of MCTS can be improved by using asimple BT playout policy, instead of a standard random policy, for real-timeadversarial games, in the discussion in Chapter 7, based on the experimentsdescribed in Section 6.2, and results presented in Section 7.1 and discussedin 7.2.4.

• We conclude that the proposed approach is able to extend MCTS with heuris-tic predictions of adversarial actions, in Section 7.2.4, based on the resultspresented in Section 7.1.

• We give a suggestion for a new algorithm MCTSUT in Section 8.1, potentiallymaking modeling of the environment easier.

Chapter 2

Background

This chapter introduces the necessary background theory required for the mainbody of this study. In Sections 2.1 and 2.2, a brief introduction to decision andgame theory is presented. Section 2.3 introduces the multi-armed bandit problem,on which Monte Carlo Tree Search is based. Sections 2.1-2.3 give the requiredbackground to discuss Monte Carlo Tree Search in Section 2.4. Lastly, Section2.5 presents Behavior Trees. We will revisit Sections 2.4 and 2.5 when discussingalgorithm design considerations and our proposed approach, in Chapter 5.

2.1 Decision TheoryDecision Theory is a framework for decision problems that depend on some degreeof uncertainty, combining probability theory and utility theory. Markov decisionprocesses is used to tackle problems, for which the utility of a state directly followsfrom the sequence of decisions yielding that state.

2.1.1 Markov Decision ProcessesIn fully observable environments, sequential decision problems can be modeled us-ing Markov Decision Processes (MDPs). In order to model overall decisions, thefollowing is required:

• a set of states, S = {s0, s1, ..., sn

},

• a set of actions, A = {a0, a1, ..., ak

},

• a transition function P , where P (sÕ|s, a) denotes the probability of reachingstate sÕ, given that action a is applied in state s,

• a reward function R.

The overall decisions made during a game can thus be expressed as a sequenceof (state, action)-tuples, where the transitions for a given (state, action) to a new

6

2.2. GAME THEORY 7

state is decided by a probability distribution.

Definition 2.1 (Policy) A policy P : S æ A is mapping from each state in S toan action in A.

Typically, there exists many policies, and the goal is to find the policy thatmaximizes the reward function.

2.1.2 Partially Observable Markov Decision ProcessesIn case of every state not being fully observable, the above MDPs formulation needsto be extended with an observation model O(s, o), which expresses the probabilityof making observation o in state s.

In short, the di�erence between Partially Observable Markov Decisions Processes(POMDPs) and MDPs can typically be seen by agents performing actions to gatherinformation. That is, actions that are not necessarily good right now, but that willenable better decisions to be made in the future.

Exact planning in POMDPs that operate on all possible information states,or beliefs, su�er from a problem known as the curse of dimensionality. Meaningthat, the dimensionality of a planning problem is directly related to the numberof states [13]. Additionally, POMDPs may also have a problem where the numberof belief-contingent plans increases exponentially with the planning horizon [28].This is known as the curse of history. Littman showed that the propositionalplanning problem is NP-complete, whereas exact POMDP is known to be PSPACE-complete [18]. Therefore, domains with even small amounts of states, actions, andobservations are computationally intractable.

2.2 Game TheoryDefine a game as an environment containing one or more agents, each interactingto produce specified outcomes [3]. These interactions are subject to a set of rules.Though similar to Decision Theory, in games, agents need also consider other agentsdecisions. In order to extend Decision Theory to games, we need the followingcomponents:

• a set of states S = {s0, s1, ..., sn

},

• a set of actions, A = {a0, a1, ..., ak

},

• a transition function, T : S ◊ A æ S,

• a reward function R,

• a list of agents, declaring which agent should act in which state, B,

• the set of terminal states, ST

™ S.

8 CHAPTER 2. BACKGROUND

A typical game will be played as follows: the game starts in state s0. For anystate s

t

, the agent to act—as defined by B—selects a valid action a. Given (st

, a),the transition function T is applied to yield state s

t+1. This process is repeateduntil a terminal state s œ S

T

is reached. How an agent selects an action for a givenstate is defined by the agent’s policy. In general, an agent’s policy will select anaction which has a high probability of yielding a successor state with high reward,given by R.

2.3 Bandit Based methodsIn this section, we will give an introduction to Bandit Based methods, on whichMonte Carlo Tree Search (Section 2.4) is based.

Given K multi-armed bandit slot machines, which machines should one playand in which order, to maximize the aggregate reward? Or, which is the optimalsequence of actions? This class of sequential decision problems are denoted Ban-dit problems. However, because the underlying reward distributions are unknown,finding optimal actions is di�cult. Therefore, one typically relies on previous obser-vations to predict future results. That is, formalizing a policy that determines whichbandit to play, given the previous rewards. The problem is that this requires a bal-ance between exploiting the action currently believed to be optimal, with exploringother actions. This balance is called the exploitation-exploration dilemma.

2.3.1 RegretThe loss in reward a player achieves, by not playing the optimal action, is called theplayer’s regret. Lai and Robbins [16] introduced the technique for upper confidencebounds for asymptotic analysis of regret, obtaining a O(log n) lower bound on itsgrowth. The regret after n plays can be expressed as:

R̄n

= nµú ≠nÿ

t=1E

It

$(2.1)

where µú is the best possible expected return, and E#µ

It

$the expected return

for arm It

œ {1, ..., K}. See [4] for more details and stronger results.

2.3.2 UCB1Based on Agrawal’s index-based policy [1], Auer showed that there exists a pol-icy UCB1, achieving logarithmic regret uniformly over n, without any preliminaryknowledge of the reward distributions. UCB1 works by playing machine j thatmaximizes x̄

j

2 ln n

nj, where x̄

j

is the average reward obtained from machine j,n

j

is the number of times machine j has been played, and n is the overall numberof plays [2].

2.4. MONTE CARLO TREE SEARCH 9

2.4 Monte Carlo Tree Search

Monte Carlo Tree Search (MCTS) is a family of algorithms, based on the generalidea that:

1. the true value of an action can be approximated using random simulations,

2. using these approximations, the policy can e�ciently be adjusted to a best-first strategy [3].

MCTS maintains a partial game tree, starting with the empty tree and progres-sively expanding it following a tree policy. Each node in the tree represents a gamestate, and the edges from a parent to its children represents the available actionsfrom the parent state. As the tree is expanded, observations of the reward of suc-cessor states are propagated following the parent path. Thus, as the tree grows,these estimated rewards become more accurate.

2.4.1 Algorithm

Given a computational budget (i.e., a maximum amount of time, memory, or iter-ations), MCTS iteratively builds a game tree. See Figure 2.1 for an illustration ofone iteration of MCTS. In each iteration, the following steps are performed:

1. Selection: starting at the root, a child policy is recursively applied until apromising child is found. What constitutes a promising child is determinedby the policy. However, it must have unvisited children, and be a non-terminalstate.

2. Expansion: the available actions at the selected node are explored by anexpansion policy, determining which node(s) are expanded. Expanding a childinvolves adding it to the tree, and marking it for simulation.

3. Simulations: the expanded node(s) are simulated following a default policy,yielding an observed reward for each node.

4. Back propagation: starting at each of the expanded nodes, the observed resultsare propagated up the tree, following the parent path until the root is reached.

10 CHAPTER 2. BACKGROUND

Selection

Expansion

Simulation

µ

Back propagation

µ

µ

µ

µ

µ

Repeat

Figure 2.1: One iteration of MCTS.

The back propagation step does not follow a policy. Rather, it updates thenode statistics. These statistics are then used in future selections and expansions.Upon reaching the threshold imposed by the computation budget, a valid actionfrom the root node is selected following some selection rule. Such selection rules,as introduced by Schadd [30], based on the work of Chaslot et al. [6], and discussedby Browne et al. in [3], includes:

• Max child: select the child with the highest reward.

• Robust child: select the most visited child.

• Secure child: select the child that maximizes a lower confidence bound.

2.4.2 UCTThe Upper Confidence bound applied to Trees (UCT) is an adaption of UCB1(Section 2.3.2). It is a popular approach to handling the exploitation-explorationdilemma in MCTS, initially proposed by Kocsis and Szepesvari [14, 15]. The idea isthat the problem of picking good nodes during the selection and expansion steps, canbe modeled s.t. every choice encodes an independent multi-armed bandit problem.Then, the node that should be chosen is one that maximizes equation 2.2:

UCT = wi

ni

+ c

Ûln t

ni

(2.2)

2.5. BEHAVIOR TREES 11

where wi

is the number of wins after the i-th move, ni

the number of simulationsafter the i-th move, t the total number of simulations for the considered node, andc is the exploration parameter.

2.5 Behavior Trees

Following the success of high-level decisions and behaviors displayed by AI agentsin Halo 2 [11], the gaming industry took an interest in adapting the demonstratedtechniques [5, 12]. Primarily, this corresponds to Behavior Trees—having beenestablished as a more modular alternative to finite state machines [9].

To describe Behavior Trees (BTs), we will adhere to the definition used byMillington and Funge in [23]. The main di�erence between a Hierarchical StateMachine and a BT, is that BTs are constructed of tasks instead of states. A taskcan either be an action or a conditional check, or it can be defined as a sub-treeof other tasks. By having a common interface for all tasks, it becomes possible tomodel higher level behaviors using sub-trees. Furthermore, knowledge of the imple-mentation details of each behavior (sub-tree) is not necessary when constructing thetree. Thus, it is possible to model complex behaviors as compositions of sub-trees.

In order to evaluate a BT, a signal (or tick), is pushed through the tree startingat the root. In order to explain how this tick is propagated through the tree, wewill first explain the di�erent types of nodes (tasks). Each node can report exactlyone of three states: success, failure, or running. The four most essential nodes are:

• Selector : a non-leaf node where its children are evaluated in order (left toright). As soon as one child succeeds, so does the selector. If no child succeeds,the selector fails. A selector is considered running if any child is running.

• Sequence: a non-leaf node that succeeds i� all its children succeeds. A sequenceis considered running if any child is running.

• Action: a leaf node that succeeds upon completing the action it describes.An action task is considered running while the described action is still beingevaluated. It fails if the action can not be completed.

• Condition: a leaf node that succeeds and fails if the described condition istrue or false, respectively.

An illustration of a BT, consisting of the above types of tasks, is shown inFigure 2.2. More details, and more types of tasks (i.e., parallel and decorator) canbe found in [23].

12 CHAPTER 2. BACKGROUND

?

æ

Am I

vulnerable

to enemies?

Find nearest

enemy

Am I too

close?

Move away

from enemy

æ

Find nearest

food

Move towards

food

Figure 2.2: A simplified Behavior Tree of Pac-Mans high-level decisions in thegame Pac-Man. "æ" represents a left to right sequence. The "?" represents aselector. Leaves with descriptions that are questions represent conditions. Leaveswith descriptions that are statements represent actions. That is, the left sub-treedescribes the behavior of running away from close enemies. The right sub-treedescribes eating the closest food. The overall strategy encoded in the BT is to tryto eat the closest food, unless an enemy is nearby, in which case Pac-Man shouldrun away. See Section 6.2.3 for the actual BT used during the experiments.

For every tick, start at the root and traverse the tree as described by the controlnodes (selectors and sequences). Upon reaching a leaf (action or condition), saidtask is evaluated for one time step (one tick), and its status (success, failure, orrunning) is reported back to its parent, which in turn signals its parent, etc. Thus,the results are back propagated to the root. The typical evaluation of a BT willconsist of ticking the root either until it signals success or failure, or until somecomputational time budget is exhausted.

Chapter 3

Related Work

In this chapter, we give an overview of the related work that has been done in thedomains of Behavior Trees (BTs) and Monte Carlo Tree Search (MCTS). We presentselect articles for BTs and MCTS in Sections 3.1 and 3.2, respectively. Althoughmany advancements have been made in machine learning using these techniques,our focus is on real-time planning and decision making. As such, we mainly presentresults related to this topic.

3.1 Behavior TreesWeber et al. present idioms for reactive planning in multi-scale games in [34]. Thatis, games where agents are required to make both high-level planning decisions, whilealso controlling individual agents during battle. They present idioms for handlingthis in the real-time game of StarCraft. These include reactively building an ActiveBehavior Tree, that is gradually populated by goals, inspired by Loyalls work in "Be-lievable agents: building interactive personalities" [19]. These techniques relate toour study, as it concerns the problem with Behavior Trees commonly only handlingone agent, which is problematic when scaling to multi-agent systems. However, theauthors make no e�ort in generalizing their results beyond StarCraft. Furthermore,their evaluation is against the standard StarCraft bots, which to our knowledge,do not provide an accurate representation of competitive AI agents. Marzinotto etal. further discusses the problem of cooperation in multi-agent systems, using sep-arate Behavior Trees, in [21]. They suggest introducing a special Decorator nodethat handles synchronization between di�erent agents’ Behavior Trees.

In On-Line Case-Based Planning (OLCBP), planning at di�erent levels su�ersfrom problems in reactivity. For instance, not being able to interrupt execution oftasks as the parameters of the world changes, making the actions being evaluatedno longer desirable. Palma et al. show how a Tactical Layer can be constructedin [26], using a database of Behavior Trees for low-level tasks. That is, they proposea system where high-level decisions are formed by combining low-level BT decisions.Although their high-level planning is based on learning and thus di�erent from ours,

13

14 CHAPTER 3. RELATED WORK

their overall system design may be applicable to our system. However, they provideno empirical performance analysis.

In [17], Lim et al. use Behavior Trees to build a bot for the real-time gameDEFCON. The authors implemented BTs for low-level actions, then combined theseBTs randomly to build an agent that maintains an above 50% win-rate against thegame’s default bot. They argue that this shows that it may be possible to automatethe process of building agents. A problem with their approach is that they needto provide fitness functions for each behavior, in order to guide the combinations.How these functions should be defined is unclear. However, is is clear that thesefunctions may sometimes be problematic to define in certain environments. Wehypothesize that our approach of using MCTS simulations could instead be used tocombine the behaviors.

3.1.1 SummaryThere is a synchronization problem if Behavior Trees are separated for individualagents in multi-agent systems [34]. This problem can be addressed by extendingBTs with new node types [21]. We will further discuss this problem in Chapter 5Proposed Approach.

BTs are commonly used for low-level tasks. However, how to best use BTs forhigh-level decision making is still an open question. Previous work have exploreddi�erent learning or stochastic processes. To our knowledge, there have been noattempts in using simulation-based methods for high-level decision making withBTs.

3.2 Monte Carlo Tree SearchSamothrakis et al. discuss how the search depth can be limited to deal with gamesthat have no clear terminal winning state [29]. They extend previous research inMCTS to N-player games by extending it to Maxn trees. Maxn is a procedureto find the equilibrium point of the values of the evaluation function in N-playergames [20]. Lastly, they evaluate their approach by playing against other bots. Asour study involves multi-agent systems, this work is of clear interest.

In [27], Pepels and Winands present a bot for the real-time game Ms. Pac-Man, using MCTS to guide its decisions. The authors claim to use MCTS tofind optimal moves at each turn. However, they provide no proof of optimality.Nonetheless, they provide a detailed description of their implementation, and theconsiderations that must be taken to adapt MCTS to Ms. Pac-Man. They showsignificant improvements while playing against competitive bots, concluding thatthe MCTS framework can be used to create strong agents for Ms. Pac-Man. Thisresult relates to our study, as we want to use MCTS in real-time games. Onepotential problem with their approach is that they use highly formalized strategiesto encode domain knowledge. This raises the question of whether these results forMCTS can be reproduced without finely tuned strategies, i.e., using BTs.

3.2. MONTE CARLO TREE SEARCH 15

A new Monte Carlo algorithm, POMCP, is proposed in [32]. POMCP is ableto e�ciently plan in large POMDPs (Section 2.1.2), without needing an explicitprobability distribution. Recall the curses of dimensionality and history. The au-thors claim to break both these curses using Monte Carlo sampling. This article isof interest as it discusses problems with POMDPs, while showing empirical data ofperformance limitation, if the partial observability is not handled with care.

3.2.1 SummaryIn games with no clear terminal winning state, heuristic evaluations can be madeat a limited search depth [29]. MCTS can find good solutions in real-time games, ifhighly formalized strategies are applied during playouts [27]. Lastly, not handlingpartial observable properties may impose performance limitations [32].

Both [27] and [29] disallow Pac-Man to reverse direction when expanding thetree. This attempt at reducing the branching factor by limiting moves is question-able, especially in [27], where the authors claim to produce optimal paths. Thisseems unlikely as they prune part of the search space, without showing that itcannot possibly lead to better solutions.

Chapter 4

Capture the Flag Pac-Man

In this chapter, we give an introduction to the game Capture the Flag Pac-Man inSection 4.1, followed by a classification in Section 4.2. Lastly, we motivate why thisgame is of interest for our study, in Section 4.3.

4.1 Environment IntroductionIn Capture the Flag Pac-Man (CTF Pac-Man), each player controls N

2 agents,constituting a team. The map is a maze with reflection symmetry. Each team hasa unique home and away field of the map. Agents’ starting locations are fixed intheir home field. An agent on its home field is considered a defender. An agenton the away field is considered an attacker. The map is populated by pellets, orfood. When an attacker touches food on the away field, the agent picks it up. If adefender touches an attacker, the attacker dies, dropping all collected food, beforerespawning at its start location. If an attacker returns to its home field carryingm > 0 food, the agent trades these food for m points. For a game with N

2 agentsper team, the game ends when only N

2 food remains on either player’s home field,or after a set time has elapsed. That is, it is not a winning strategy to stand still onsome arbitrarily selected location. The winner of the game is the player with mostpoints. If both players have the same amount of points, the game is considered adraw. See Figure 4.1 for an illustration of the game.

16

4.2. ENVIRONMENT CLASSIFICATION 17

Figure 4.1: An illustration of the game CTF Pac-Man. The Red team is representedby the orange and red agents, while the blue and teal agents belong to the Blueteam. As both of the Red agents are defending (their position is on their half of themap), they are ghosts. The blue agent, on the other hand, is a Pac-Man, as it is onthe opponent’s side of the map.

4.2 Environment Classification• Partial Observable: Agents do not have complete knowledge of the world.

That is, their sensors can only perfectly detect opponents that are within 5units of distance. For an agent agt1 in state s œ S observing opponent agt2,with dist(agt1, agt2|s) > 5, an observation o = O(agt2, s), is given insteadof a true (x, y)-location. This observation o contains a perceived distance5 < dist(agt1, agt2|o) < Œ, following a given discrete probability distribution.

• Multi-agent: A game is played by N agents, with each team consisting of N

2agents that work together to defend and attack the adversaries.

• Stochastic: Because agents’ sensors can not perfectly detect opponents at alltimes, the state transition function is dependent on probabilistic features ofthe environment and adversaries.

• Sequential: Every action changes the game state. Denote the action appliedin state s

k

as ak

, and the resulting state sk+1 = s

k

(ak

). Then, any state sn

can be expressed as a sequence of the n actions {a0, . . . , an

} applied to states{s0, . . . , s

n

}, respectively.

• Semi-dynamic: There is a fixed time to make decisions. During this time,agents are given a guarantee that the environment will not change.

18 CHAPTER 4. CAPTURE THE FLAG PAC-MAN

• Discrete: There is a finite amount of states and state transitions.

4.3 Environment MotivationA majority of research in real-time games is split between the two games StarCraft,and Pac-Man, making either of these environments interesting. The problem withplanning and decision making in StarCraft is that of integrating reasoning at mul-tiple levels of abstraction [25]. That is, in order to apply MCTS to StarCraft, theuse of abstraction is required [24]. Thus, in order to use StarCraft, the scope ofour subject would have to be expanded. Therefore, we select Pac-Man. Capturethe Flag Pac-Man is selected because, unlike traditional Pac-Man, it has a terminalwinning state, making evaluation easier. Also, relevant studies exist that have eval-uated the use of UCT in Pac-Man, e.g., [27, 29], giving a foundation for continuedwork.

Chapter 5

Proposed Approach

In this chapter, we start by extending MCTS with a Behavior Tree Playout Policy(MCTSBTP) and adapting the algorithm to CTF Pac-Man, as described in Sec-tions 5.1.2 and 5.1.1, respectively. A detailed discussion of the evaluation functionsis given in Section 5.2. Section 5.3 discusses the design decisions related to the Par-tial Observability property of CTF Pac-Man. Lastly, Section 5.4 briefly discusseshow MCTS can be extended to cooperative environments.

5.1 Policies and Domain Knowledge5.1.1 Selection and expansionDuring the selection step, either a child node is selected at random, if the child’svisit count, n

c

is under a threshold T . Based on the suggestion by Pepels andWinands [27], we select T = 3 and C = 1.5. For n

c

Ø T apply UCT (see Sec-tion 2.4.2), i.e., select the child that maximizes equation 5.1:

ns

+ C

Ûln n

c

np

(5.1)

where ns

is the current score of the considered node, np

the visit count of the currentnode, and n

c

the visit count of child c. Both T and the exploration constant C, arenaturally domain dependent and may require experimentation to find.

The visit threshold T is used due to the Partial Observable, and Stochasticproperties of CTF Pac-Man (Section 4.2), to ensure the robustness of the selectedpath. That is, because an otherwise safe node can evaluate to a loss due to the above-mentioned factors, applying a threshold ensures that this risk becomes negligible.

5.1.2 Behavior Tree Playout PolicyGiven the Behavior Tree B

i

that perfectly expresses the i-th agent’s decision model,tick B

i

to get an action a. Apply a for agent i. Repeat for each agent at depth d,and increment d = d + 1. Repeat until the maximum depth is reached. Naturally,

19

20 CHAPTER 5. PROPOSED APPROACH

in some settings, one may not have a perfect BT for each agent, or the BT maysimply be an approximation. In such case, either:

1. select some BT Bj

to be used as an approximation for agent i. Tick Bj

toget an action a

bt

. Let ar

be a randomly selected action, from the set of validactions for agent i in the current game state. Set a = a

bt

with probability pand a = a

r

with 1 ≠ p,

2. or, in addition to the above strategy, after each iteration of MCTS, note theactual move performed by the agent. Now, as BTs are modular and easilychanged, given a set of observations, it is conceivable one could adjust theparameters of B

j

to better match the actual observations, but dynamicallyimproving BTs based on observations is an open question.

5.2 Evaluation FunctionIdeally, simulations should run until a terminal game state is reached, provided thatsuch a state has a clear winner or loser. Suppose each simulation is representedby the sequence of states {s0, s1, ..., s

d≠1, sd

}, obtained by repeatedly applying theplayout policy until depth d. Given a set of such sequences of states, representingthese simulations, the search tree can be constructed (following UCT)—such thatit reveals which states tend to lead to victories—and which lead to losses. Then,decisions can directly be based on this tree, and we can properly apply, e.g., min-max decisions. However, there are two main problems with this approach: firstly,the search depth of CTF Pac-Man is more than 1000 moves. Secondly, it is a gameof imperfect information. Meaning that the error of a simulation will increase asthe depth of the simulation increases (see Section 5.3).

As previously discussed, one approach to these problems is to instead stop sim-ulations at some maximum depth. When stopping a simulation, an evaluation ofthe current game state is required, to determine how desirable it is. This is com-monly achieved by defining a heuristic evaluation function, which given a game stateoutputs a numerical value, where the magnitude of that value corresponds to theutility of the given state. Clearly, when determining the utility of some arbitrarynon-terminal game state, looking at the current score may be misleading, as a win-ning strategy in CTF Pac-Man may involve picking up enough pellets to win withouthaving to return to the away field a second time. Thus, it is of limited consequencewhat the intermediate score is. Instead, we will present two alternative heuristicevaluation functions in Sections 5.2.1 (Basic Heuristic) and 5.2.2 (Improved Heuris-tic). These heuristics will be referenced as the basic and the improved heuristic inthe following chapters.

5.2.1 Basic HeuristicAn observation of a game state consists of a number of food locations, and thelocations of the various agents. Given this information, the basic heuristic expresses

5.3. OBSERVING ENEMIES 21

the utility of the given state as a linear combination of the amount of remainingfood, and the distance to the closest food. Each feature is weighted such that itsalways better to eat a food, than to move close to it and stop.

5.2.2 Improved HeuristicWe hypothesize that using an overly simplified heuristic may cause poor perfor-mance. As an improvement to the basic heuristic, the improved heuristic addsan additional weighted feature representing the distance to the closest non-scaredghost. The features are weighted as in the basic heuristic, with the addition that itis never a good idea to die.

5.3 Observing enemiesAs discussed in Section 2.1.2, and by Silver and Veness in [32], dealing with imper-fect information is problematic. It is outside the scope of this study to find goodways of dealing with the Partial Observability property. Instead, we will make theassumption that what we can not observe (with high probability) does not exist.That is, when doing simulations, we will only simulate agents for which the proba-bility of us having correct information—regarding their position at the start of thesimulation—is high. At every depth of a simulation, the current game state is adirect result of the actions selected in previous states. Meaning that, if a simulationstarts in game state s, we have an observation of s which we will base our simulationon. Denote this observation as so. While simulating at an increased depth, it is notpossible to extract new observations that would reveal more information about sthan so.

Naturally, this will have consequences when the simulation depth increases, asagents that have incorrectly been assumed to not exist, may eventually move closeenough to other agents, such that what is perceived to be a good decision maynot match the actual reality. It should be noted that this problem is by no meansspecific to MCTSBTP. Therefore, we suggest that if needed, a more sophisticatedapproach is found by exploring current research.

5.4 Cooperation between agentsAs mentioned in Section 3.1.1, there is a synchronization problem if BTs are sep-arated for individual agents in multi-agent systems [34]. However, MCTS can beextended to handle agent cooperation. Thus, cooperation aspects can be shiftedfrom the BTs to the MCTS algorithm. One way to extend MCTS to cooperationproblems, is by using a two-stage selection strategy. Additionally, node values areextended to contain both a team utility value, in addition to the agent utility value.Then, when making selections, for a node that is not fully explored, select the childwith the highest team mean utility. Otherwise, select the node with the highest

22 CHAPTER 5. PROPOSED APPROACH

agent mean utility. It should be noted that the simulations should be tested forPareto Optimality, to evaluate team performance. That is, the situation where anagent cannot improve its own utility without degrading the utility of the team.Cheng and Carver proposes a pocket algorithm in [7] to test for Pareto Optimalityin MCTS. Pareto Optimality in cooperative MCTS is further discussed by Wangand Sebag in [33].

Chapter 6

Evaluation

We hypothesize that the performance of MCTS can be improved by using a simpleBT playout policy, instead of a standard random policy, for real-time adversarialgames. Using an experimental methodology, it is possible to investigate this hy-pothesis. In this chapter, we first disclose our testing environment in Section 6.1,followed by a presentation of the various experiments in Sections 6.2.1 - 6.2.3.

6.1 Statistical SignificanceAs the game is probabilistic, it may be di�cult to distinguish between what isa feature of the algorithm, and what is random noise. Therefore, a large enoughnumber of games should be played for each test, such that conclusions can be drawnstatistically using large sample sizes.

In order to play a significant amount of games for each test in reasonable time,up to 48 games were played in parallel, distributed on two servers each having 2x6cores with Hyper-Threading (24 threads per server). For each test, independentinstances of the game were spawned using the same input. After evaluating aninstance, the samples output was saved. After gathering up to 200 samples for eachtest, a statistical analysis was performed on the collected data.

6.2 Experiments

6.2.1 Establishing FairnessTo simplify further experiments, it is preferable to establish some degree of fairnessin the testing environment (the game), and in the baseline strategy. Given that thegame and baseline can be considered fair, a team’s start location can be fixed toeither side of the map, thus decreasing the number of variables to be tested. CTFPac-Man utilizes a completely mirrored environment for both teams. That is, thelayout of both sides of the map are mirrored, including agent spawn locations, andpellet/food locations. However, although a real-time game, agents select actions in

23

24 CHAPTER 6. EVALUATION

a sequence of game states. Meaning, agent agt0 will first select an action for gamestate S. Next, a successor state SÕ of S is generated by applying the action selectedby agt0. Now, agent agt1 will select an action for state SÕ. Agents transform thegame state in a fixed order. This order is determined at startup as shown in Algo-rithm 1.

Algorithm 1: Agent Order Selection. Random[0, 1] refers to a pseudorandomnumber generator yielding floating point values between 0 and 1 uniformly.popb refers to a function that removes the back element in a collection andreturns it.1 redAgents Ω [ragt0, ragt1]2 blueAgents Ω [bagt0, bagt1]3 team1 Ω redAgents if Random[0, 1] < 0.5 else blueAgents

4 team2 Ω blueAgents if team1 is redAgents else redAgents

5 order Ω [team1.popb(), team2.popb(), team1.popb(), team2.popb()]

Definition 6.1 (Perfect strategy) Consider a finished game to be the completesequence of actions—selected by each agent—at every time step. We say that astrategy ST is a perfect strategy i�, for every time step, actions are selected by STsuch that in the terminal state, the maximum possible score, is always achieved(given the opponents’ moves).

Definition 6.2 (Perfectly fair game) For a game to be perfectly fair, it must holdthat the chance of winning—using a perfect strategy—is exactly equal, regardless ofthe order in which the agents move.

There is no known perfect strategy for CTF Pac-Man that is computationallytraceable. Furthermore, this study concerns approximations. Therefore, we willnot try to show that CTF Pac-Man is a perfectly fair game, but instead that it isreasonably fair.

Definition 6.3 (Non-trivial Strategy) We say that a strategy is a non-trivialstrategy, if it can maintain a win rate of at least 99% against a strategy that selectsactions at random.

Definition 6.4 (Reasonably fair game) We say that a game is a reasonably fairgame, if there exists a non-trivial strategy that, when played against itself, main-tains a mean score s, with 0 Æ abs(s) < ‘, for some small ‘.

Define a team selecting actions at random as RandomTeam. Let BTTeam bea team that selects actions according to a simple Behavior Tree. In order to testthat the game is reasonably fair, we first test BTTeam against RandomTeam. Itis expected that BTTeam will maintain a win rate of at least 99%. Given that

6.2. EXPERIMENTS 25

BTTeam is a non-trivial strategy, we will sample BTTeam against BTTeam. Therange of the score for one sample is [≠20, 20]. For a reasonably fair game, the scoresshould be normally distributed around zero. That is, the true mean is expected tobe to zero. Based on this we can perform a one sample t ≠ test. The discussed datais insu�cient to decisively accept the null hypothesis (µ = 0). However, we can givea confidence interval for the actual mean.

6.2.2 Establishing a baselineNext, we will establish a baseline, using a random policy, which we will later useto evaluate our Behavior Tree policy. We start with the collected data and samplesfrom the tests in Section 6.2.1 for BTTeam against BTTeam. Next, we play MCT-SRP against BTTeam. Given these two groups of data, a 2 sided t ≠ test can beperformed to test the null hypothesis (that the two groups are equal). Given thatthe null hypothesis gets rejected—as expected—we have a baseline for how MCTS,using a random playout policy, compares to BTTeam. As MCTSRP depends onmany variables (tree depth, playout depth, UCT constant, number of iterations),it is conceivable that some combinations of variables may perform similarly to BT-Team. As such, a set of combinations may need to be tested.

6.2.3 Behavior Tree Playout PolicyTo test your main hypothesis, we will test the Behavior Tree playout policy (MCTS-BTP) against BTTeam. The collected data can then be compared to the baselineestablished in Section 6.2.2. Doing so, we can establish the performance di�erencein using the BT policy instead of a random policy, for playouts. The expected resultis a significant improvement in win rate.

Regarding playouts, it should be noted that they are traditionally performed un-til a terminal state is reached [25]. However, since our game is partially observable—consisting of probabilistic properties—an observed game state may contain an errorcompared to the actual game state. This error is propagated as the observed gamestate is expanded in the simulations. Also, we only have an approximate model ofour adversaries behaviors. Meaning, as simulations increase in depth, the error islikely to increase. Implying that, at some depth, the error will reach a threshold—where increasing the depth will yield an observed score for the simulation—that issignificantly di�erent to the actual score. As the selection policy only can take intoaccount simulated scores, this may cause selections of nodes that are increasinglydi�erent from what they are perceived to be. In other words, the risk of pickinga node, that decreases the score that the algorithm tries to maximize, increases.Thus, decreasing the overall win rate. Furthermore, as the error is not directly ob-servable, it is di�cult to know when to stop a simulation. Therefore, we will limitthe depth of playouts. That is, we want to find a combination of iterations anddepth, such that the depth is deep enough that Pac-Man does not commit suicideby walking into dead end alleys, while shallow enough such that the error caused

26 CHAPTER 6. EVALUATION

by partial observability is negligible. And, we want to perform enough iterations tofind moves that are globally good (not just locally).

Furthermore, as an objective of this study is to make modeling of agents easier,it is preferable if the BTs do not have to be very complex. Therefore, a simple BT(Figure 6.1) will be used in the experiments for MCTSBTP.

?

æ

Am I

vulnerable

to enemies?

Find nearest

enemy

Am I too

close?

Move away

from enemy

æ

Can I eat

enemies?

Find nearest

enemy

Am I very

close?

Move towards

enemy

æ

Find nearest

food

Move towards

food

Figure 6.1: A simple Behavior Tree of Pac-Mans high-level decisions. The overallstrategy encoded in the BT is to try to eat the closest food, unless an enemyis nearby. If so, Pac-Man should either try to eat it, if he has recently found apowerup and the enemy is very close, otherwise he should run away.

Chapter 7

Results and Discussion

In this chapter, we start by giving a presentation of the results of the three categoriesof experiments: Establishing Fairness, Establishing a Baseline, and the BehaviorTree Playout Policy, in Sections 7.1.1, 7.1.2, and 7.1.3, respectively.

The results are discussed in Section 7.2. Each category is discussed on its ownin Sections 7.2.1, 7.2.2, and 7.2.3. Lastly, the combined results are discussed inSection 7.2.4.

7.1 Results

7.1.1 Establishing Fairness

This section presents the results of the experiments performed to establish fairness.That is, that the blue and red sides have an equal opportunity of winning. Obser-vations gathered from gameplay are presented in Tables 7.1 - 7.6. Lastly, Table 7.7shows a t-test that summarizes, and tests, the observations.

Table 7.1. BTTeam (red side) vs. RandomTeam, average score. The observed

results after 50 games with BTTeam starting on the red side and RandomTeam on

the blue side. The hypothesized mean score of 18 means that the team starting on

the red side (BTTeam) is expected to win all games. The observed result was a mean

of 18, as hypothesized.

Hypothesised mean x̄ Standard derivation Sample size

18 18 0 50

27

28 CHAPTER 7. RESULTS AND DISCUSSION

Table 7.2. BTTeam (red side) vs. RandomTeam, win/loss distribution. The per-

centage of wins for the teams starting on the red and blue side of the map, respectively.

BTTeam started on the red side. RandomTeam started on the blue side. BTTeam

won all games.

Red wins (%) Blue wins (%) Ties (%) Sample size

100 0 0 50

Table 7.3. RandomTeam (red side) vs. BTTeam, average score. The observed

results after 50 games with RandomTeam starting on the red side and BTTeam on

the blue side. The hypothesized mean score of -18 means that the team starting on

the red side (RandomTeam) is expected to lose all games. The observed result was a

mean of -18, as hypothesized.

Hypothesised mean x̄ Standard derivation Sample size

≠18 ≠18 0 50

Table 7.4. RandomTeam (red side) vs. BTTeam, win/loss distribution. The per-

centage of wins for the teams starting on the red and blue side of the map, respectively.

RandomTeam started on the red side. BTTeam started on the blue side. BTTeam

won all games.

Red wins (%) Blue wins (%) Ties (%) Sample size

0 100 0 50

Table 7.5. BTTeam (red side) vs. BTTeam, scores. The observed results after 200

games with BTTeam playing against itself. The hypothesized mean score of 0 means

that both sides are expected to score similarly, and win as many games. Although the

observed mean is close to zero (5.5 · 10

≠2), both the minimum and maximum scores

are -18 and 18, respectively. Consequently, resulting in a standard deviation of 7.76.

Hypothesised mean x̄ min max SD Sample size

0 5.5 · 10≠2 ≠18 18 7.76 200

Table 7.6. BTTeam (red side) vs. BTTeam, win/loss distribution. The percentage

of wins for the teams starting on the red and blue side of the map, respectively.

BTTeam played both the red and blue sides. The di�erence in win rates is within

1%.

Red wins (%) Blue wins (%) Ties (%) Sample size

35 36 29 200

7.1. RESULTS 29

Table 7.7. BTTeam (red side) vs. BTTeam, t-test. A one sample t-test of the

observed results of BTTeam vs. BTTeam over 200 games. The two-tailed P -value

equals 0.92. Thus, we fail to reject the null hypothesis at a 95% confidence level. The

95% confidence interval is (-1.02, 1.13).

µ0 x̄ SD SEM P -value CI n CL (%)

0 5.5 · 10≠2 7.76 0.55 0.92 -1.02–1.13 200 95

7.1.2 Establishing a Baseline

This section presents the results of the experiments performed to establish a base-line. All tests are performed for Monte Carlo Tree Search with a Random playoutPolicy (MCTSRP), playing against a team of agents making decisions based onBehavior Trees (BTTeam). Figures 7.1 and 7.2 shows the distribution of wins,losses, and ties, as the playout depth and the number of iterations increases. Wedenote the configuration of MCTSRP that performs n iterations, each with a play-out depth of d, as MCTSRP(n,d). The scores of the best-observed configuration,MCTSRP(640,4), is shown in Table 7.8.

0 5 10 15 20 25 30 350

20

40

60

80

100

Playout Depth

Dist

ribut

ion

(%)

Win rateLoss rateTie rate

Figure 7.1: MCTSRP(160,_) vs. BTTeam, win/loss distribution. Win, loss, andtie rates for MCTSRP, when played against BTTeam over 200 games. The numberof iterations used by MCTSRP was fixed at 160. The playout depth (1-32) is shownon the x-axis. The observed result is a maximum win rate at depth 4. As the depthincreases past this point, the win rate decreases, while the loss rate increases, seeSection 7.2.2.

30 CHAPTER 7. RESULTS AND DISCUSSION

200 300 400 500 6000

20

40

60

80

100

Iterations

Dist

ribut

ion

(%)

Win rateLoss rateTie rate

Figure 7.2: MCTSRP(_,4) vs. BTTeam, win/loss distribution. Win, loss, and tierates for MCTSRP, when played against BTTeam over 200 games. The number ofiterations (160-640) used by MCTSRP is shown on the x-axis. The search depthwas fixed at 4. The results show a growth in win rate up to 320 iterations. Althoughthe win rate increases after this point, the rate at which it increases is significantlylower.

Table 7.8. MCTSRP(640,4) vs. BTTeam, observed results. The average score and

the distribution of wins, losses, and ties observed over 200 games between MCT-

SRP(640,4) and BTTeam. The results show a significantly favored win rate (61%)

for MCTSRP. Nevertheless, the average score is -0.21.

Average Score Wins (%) Losses (%) Ties (%) Sample size

≠0.21 61 20 19 200

7.1.3 Behavior Tree Playout Policy

This section presents the results of the experiments performed to test Monte CarloTree Search using a Behavior Tree Playout Policy (MCTSBTP). The adversary isBTTeam for all tests. Figure 7.3 shows win rates using a basic heuristic. Figures 7.4and 7.5 shows how the distribution of wins, losses and ties changes, as the numberof iterations, and playout depth increases, for MCTSBTP. Figure 7.6 shows how theaverage score of MCTSBTP changes as the number of iterations increases. Table 7.9presents the scores of the best-observed configuration, MCTSBTP(320,4). Lastly,Table 7.10 presents the observed changes in scores, by using the Behavior TreePolicy, instead of the Random Policy, for MCTS.

7.1. RESULTS 31

0 10 20 30 400

20

40

60

80

100

Iterations

Dist

ribut

ion

(%)

Win rateLoss rateTie rate

Figure 7.3: MCTSBTP(_,1) vs. BTTeam (initial heuristic), win/loss distribution.Win, loss, and tie rates for MCTSBTP, when played against BTTeam over 200games. The number of iterations (1-40) used by MCTSBTP is shown on the x-axis.The search depth was fixed at 1. The observed result is a static distribution of wins,losses, and ties, as the number of iterations increases beyond 10, when using theinitial heuristic.

0 5 10 150

20

40

60

80

100

Playout Depth

Dist

ribut

ion

(%)

Win rateLoss rateTie rate

Figure 7.4: MCTSBTP(8,_) vs. BTTeam, win/loss distribution. Win, loss, and tierates for MCTSBTP, when played against BTTeam over 200 games. The numberof iterations used by MCTSBTP was fixed at 8. The playout depth (1-16) is shownon the x-axis. The observed result is an increase in win rate until playout depth 4.After which point, the win rate decreases while the loss rate increases.

32 CHAPTER 7. RESULTS AND DISCUSSION

0 100 200 3000

20

40

60

80

100

Iterations

Dist

ribut

ion

(%)

Win rateLoss rateTie rate

Figure 7.5: MCTSBTP(_,4) vs. BTTeam, win/loss distribution. Win, loss, and tierates for MCTSBTP, when played against BTTeam over 200 games. The numberof iterations (40-340) used by MCTSBTP is shown on the x-axis. The search depthwas fixed at 4. The observed result is a significant increase in win rate for the first40 iterations. After which point, the win rate slowly increases until 320 iterations,while the loss rate slowly decreases.

50 100 150 200 250 300

1.5

2

2.5

3

3.5

Iterations

Aver

age

Scor

e

Figure 7.6: MCTSBTP(_,4) vs. BTTeam, average score. The Average Score forMCTSBTP, when played against BTTeam over 200 games. The number of iterations(40-340) used by MCTSBTP is shown on the x-axis. The search depth was fixedat 4. The observed result is a steadily growing increase in average score, as thenumber of iterations increases.

7.2. DISCUSSION 33

Table 7.9. MCTSBTP(320,4) vs. BTTeam, win/loss distribution. The average

score and the distribution of wins, losses, and ties observed over 200 games between

MCTSBTP(320,4) and BTTeam. The results show a significantly favored win rate

(81%) for MCTSBTP. Furthermore, the average score of 3.01 is significantly greater

than 0.

Average Score Wins (%) Losses (%) Ties (%) Sample size

3.01 81 18 1 200

Table 7.10. MCTSBTP(320,4) and MCTSRP(640,4), result di�erences. The ob-

served changes in scores, by running MCTSBTP(320,4), instead of MCTSRP(640,4).

The changes are presented as the signed di�erence, calculated by subtracting MCT-

SRP’s observations from MCTSBTP’s. The results show a significant favor in the

average score (an increase of 3.22) and win rate (81% instead of 61%) for MCTSBTP.

The number of tied games is significantly lower (1% instead of 19%). The loss rate

is similar at 18% and 20%, comparatively.

�score �wins (%) �losses (%) �ties (%)

3.22 20 ≠2 ≠18

7.2 Discussion7.2.1 Establishing FairnessBTTeam vs. RandomTeam

Tables 7.1 and 7.2 shows that BTTeam maintains a 100% win rate over 100 games,when playing against RandomTeam. Each game scoring 18 or -18, when playingon the red and blue side, respectively. This is the lowest score that results ina terminal winning state for any amount of remaining game time, on the givenlayout. Therefore, we conclude BTTeam to be a non-trivial strategy by definition6.3.

BTTeam vs. BTTeam

Table 7.6 shows that the sample mean is close to zero (5.5 · 10≠2) for BTTeamvs. BTTeam. However, the minimum and maximum scores are both -18 and 18,respectively, contributing to a large sample variance. The standard deviation is7.76. As a result, as can be seen in Table 7.7, we fail to reject the null hypothesisµ0 = 0 at the 95% confidence level. However, we can establish a 95% confidenceinterval of (-1.02, 1.13). The win rate after 200 samples is 35% for the red teamand 36% for the blue team.

We conclude that the above observations do not show a significant bias towardsone side of the map for BTTeam. Thus, in conjunction with the conclusion thatBTTeam is a non-trivial strategy, we conclude that CTF Pac-Man is relatively fair

34 CHAPTER 7. RESULTS AND DISCUSSION

by definition 6.4. Therefore, we omit testing agents for both sides in future testsand will simply refer to the agents by their names.

7.2.2 Establishing a BaselineFigure 7.1 shows that the maximum win rate is achieved at playout depth 4. Toanalyze this, we need to consider the Partial Observability (PO) property of CTFPac-Man. For any given game state observed by an agent, there is a di�erencebetween the state seen by that agent, and the actual game state, e.g., as illustratedin Figure 7.7. Furthermore, when doing simulations, an agent can not request newobservations. Meaning, all explored successor states in a simulation are expandedfrom the original observation. Now, because the initial observation contains anerror, this error is propagated to all successor states. Thus, as the depth of playoutsincrease, so does the probability that the error will cause our agent to make anincorrect decision.

Figure 7.7: Suppose Pac-Man makes an observation of a ghost being a distance of15 tiles away, illustrated by the gray ghost. However, suppose the ghost is actually6 tiles away, illustrated by the blue ghost. Now, a simulation might cause Pac-Manto think he can enter the alley, collect the food, and get out, giving the simulationa high score. Whereas in actuality, Pac-Man would not be able to get out.

Consequently, it is not possible to perfectly predict where the ghosts will be,so increasing the depth increases the risk that a ghost will be in Pac-Mans pathin the actual successor state, whereas it is not in the simulation. The maximumdistance of which Pac-Man can make accurate observations is 5. Figure 7.1 showsthat increasing the playout depth past this decreases the win rate. As playouts areperformed from leaves, and not the root, the maximum playout depth that resultsin a possibility that Pac-Man moves a distance of at most 5 (per iteration) is 4.Therefore, we will fix the playout depth to 4.

Figure 7.2 shows the win rate increases rapidly with the number of iterations,up until 320. After which, the win rate increases slowly until 640, which is the

7.2. DISCUSSION 35

largest number of iterations we may run within our time budget. It is expectedthat the win rate may increase more. However, there is a limit to how high itmay go, consider the PO property. Also, since the desirability comparison betweentwo states is performed by a heuristic, this further limits the maximum achievablewin rate. Thus, the win rate is expected to increase (at a diminished rate), untilreaching the upper bound imposed by these properties.

Table 7.8 shows a win rate of 61%, while maintaining a loss rate of 20%. Al-though a significantly higher win rate than loss rate is retained, the average score is-0.21. Indicating that, even though wins are more frequent than losses, the magni-tude of the score in MCTSRP’s victories is significantly smaller than in its defeats.This is not entirely unexpected for a random policy. As the map is mirrored, thereare exactly two cases in which a game can be lost: Pac-Man either died, or found alonger path than the adversary Pac-Man. We assume the given heuristic is balancedsuch that Pac-Man does not die unnecessarily. Then, the prior case is negligible.Furthermore, it is intuitively obvious that a random agent may sometimes explorepaths that are bad. Most times, exploring bad paths will not matter, as other con-figurations will be evaluated in following iterations, and the algorithm can recover.However, it is conceivable that, given a limited number of iterations and searchdepth, it may not be possible to recover from all configurations.

7.2.3 Behavior Tree Playout Policy

Initial Heuristic

A reader may expect the results of MCTSBTP(1,1) vs. BTTeam to be similar tothose of BTTeam vs. BTTeam, provided that they both use the same Behavior Tree.The observed results show a significant increase in win rate, 59%, as compared to36%. To explain this, we need to consider how one iteration of MCTSBTP di�ersfrom simply doing one evaluation of the Behavior Tree. In the expansion step ofMCTS, all children are added to the tree. That is, when performing one iteration,up to five children may be evaluated (one for each possible move). Then, for eachchild, a playout of depth 1 consists of performing the given action, followed byrunning the BT once. That is, a series of exactly two actions are performed for eachchild, as compared to exactly one action for BTTeam vs. BTTeam. Furthermore,it should be noted that while the increase in win rate is significant, the decreasein loss rate is not. Meaning, the increase in win rate is the result of a significantlylower amount of games resulting in draws.

Figure 7.3 shows win rate as a function of the number of iterations. The expectedresult is an increase in win rate as the number of iterations grows. The observedresult is a static win rate. We hypothesize this to be the result of the heuristicused being too simple. That is, if the heuristic can not be used to indicate whichof two game states is better, then increasing the number of iterations will not yielda better result. Meaning, the heuristic applies an upper bound on the maximumachievable score. We test this hypothesis by repeating the tests using an improved

36 CHAPTER 7. RESULTS AND DISCUSSION

heuristic (see Section 5.2).

Improved Heuristic

Following the use of the initial heuristic in the first set of MCTSBTP tests (Fig-ure 7.3), the improved heuristic was used in all subsequent tests. Using the improvedheuristic, the win rate and average score both increases as the number of iterationsincreases (Figures 7.5 and 7.6). Therefore, we conclude the quality of the heuristicto be of importance.

Figure 7.4 shows the distribution of wins, losses, and ties, as the playout depthchanges. As with MCTSRP, the maximum win rate is achieved at depth 4. Andagain, the win rate decreases as the playout depth is increased beyond this point.This behavior is expected, as both MCTSBTP and MCTSRP use the same heuristic.More importantly, the PO property still holds. Thus, we confirm our previousobservation. And again fix the playout depth at 4.

Figure 7.5 shows the distribution of wins, losses, and ties, as the number ofiterations changes. The maximum number of iterations allowed by our time limit is320. A fast-growing increase in win rate is observed during the first 40 iterations,after which point the growth is slow. Again, this behavior is comparable to whatwas observed for MCTSRP.

7.2.4 Comparison of MCTSRP and MCTSBTPTable 7.10 shows the di�erence in results between MCTS using the proposed Behav-ior Tree Playout Policy, and a traditional Random Policy. The number of iterations(and to some extent the search depth) that is explorable, given the computationaltime budget, is significantly larger for MCTSRP than MCTSBTP. Nonetheless, theexperimental results show a clear increase in both score and win rate for MCTS-BTP. Since the selection, expansion and back propagation steps are the same forboth approaches, the di�erence must stem from the di�erence in Playout Policy.Thus, we confirm our hypothesis that Monte Carlo Tree Search can be improvedusing Behavior Trees for high-level decisions.

Next, consider an instance of CTF Pac-Man without any ghosts. In such aninstance, finding an optimal path to collect all pellets for all maps, can be reducedto the Traveling Salesman Problem. However, because the actual maps in CTFPac-Man contains choke points and are generally small, the practical di�culty infinding close-to-optimal paths is significantly reduced. Therefore, we conjecture thatMCTSRP is permitted enough iterations to find a close-to-optimal path to pellets,if the adversary ghosts were to be disregarded. Consequently, the di�erence in scoremust mean that MCTSBTP is better at avoiding dropping pellets than MCTSRP.As the only way of dropping pellets is to die by being tagged by an opposing ghost,we conclude that MCTSBTP is able to better predict the movement of adversarialagents. Thus, we conclude that MCTSBTP is able to extend MCTS with heuristicpredictions of adversarial actions.

Chapter 8

Conclusions and Further Work

In this study, we have shown how Monte Carlo Tree Search can be combined withBehavior Trees, resulting in the proposed MCTSBTP algorithm described in Chap-ter 5. Our experimental results show that MCTSBTP can achieve both a higherwin rate and mean score (Section 7.1.3), than MCTS using a Random Policy, giventhe same computational time budget. Furthermore, we have concluded that MCTS-BTP can add heuristic predictions of adversarial actions to MCTS, following thediscussion in Section 7.2.4. By combining MCTS and BTs, the proposed algorithmboth includes the tree search element of Monte Carlo Tree Search, finding globallygood action, while maintaining the ease of use, easy modeling, and maintainabilityof Behavior Trees.

8.1 Further Work

In the proposed algorithm, for any given environment, two inputs are required: abehavior tree that models the agents’ behaviors in said environment, and a heuristicwhich, given a game state, returns a numeric value. This numeric value should havea direct correlation between the magnitude of the value and how desirable the gamestate is. How the heuristic is designed depends on the environment. However, in thegeneral case, it is conceivable that it combines di�erent features of the game stateand assigns weights to each considered feature. Merrill discusses how to build utilitydecisions into existing Behavior Trees in [22]. Denote such a Behavior Tree—withutility decisions—as a Utility Tree. Now, consider a type of Utility Tree, similarto a regular Behavior Tree, but with an additional numerical value associated tonodes, which indicates how desirable a descendant leaf of that node is. Here, leavescorrespond to available actions. Given that a heuristic is designed by looking atfeatures and associating weights, it is directly correlatable to building utilities insuch a UT. As such, the heuristic could instead be encoded into the BT, resulting ina UT. Given a UT, the proposed algorithm can be modified as shown in algorithm

37

38 CHAPTER 8. CONCLUSIONS AND FURTHER WORK

2.Algorithm 2: Suggested modified algorithm, MCTSUT.1 itr Ω 02 while itr < N do3 1. Selection: as before.4 2. Expansion: do not expand the selected node. Instead, consider this

node as the child to be evaluated in step 3. Denote this child with C.5 3. Simulation: start by walking from the root to the selected child C.6 depth Ω 07 while depth < M do8 a. Select the descendant D of C, where D is the leaf with the highest

score. If there are no leaves, let D = C.9 b. Starting in D, evaluate UT .

10 c. Add each child state of D.11 d. Back propagate up to C.12 depth Ωdepth +113 4. Back propagation: as before.14 itr Ωitr +1

As expansions are now handled in the simulation step, if the simulation depth isgreater than 1, simulations will cause nodes to be expanded in a depth-first manner.Therefore, the exploration vs. exploitation constant—used by UCT (or similar) inthe selection step—should be reduced by the inverse of the simulation depth, suchthat the breadth of the tree is also explored.

Arguably, such an algorithm, which encodes the heuristic in the BT, greatly in-creases the ease of which an environment can be modeled. We suggest this as futurework. This would entail comparing the modified algorithm to the one presented inthis study. Lastly, how closely the design of such an algorithm should mirror Algo-rithm 2 needs further correctness and computational complexity arguments.

Bibliography

[1] Rajeev Agrawal. Sample mean based index policies by o (log n) regret for themulti-armed bandit problem. Advances in Applied Probability, 27(04):1054–1078, 1995.

[2] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of themultiarmed bandit problem. Machine learning, 47(2-3):235–256, 2002.

[3] Cameron B Browne, Edward Powley, Daniel Whitehouse, Simon M Lucas,Peter I Cowling, Philipp Rohlfshagen, Stephen Tavener, Diego Perez, SpyridonSamothrakis, and Simon Colton. A survey of monte carlo tree search methods.IEEE Transactions on Computational Intelligence and AI in games, 4(1):1–43,2012.

[4] Sébastien Bubeck, Nicolo Cesa-Bianchi, et al. Regret analysis of stochasticand nonstochastic multi-armed bandit problems. Foundations and Trends inMachine Learning, 5(1):1–122, 2012.

[5] Alex Champandard. Understanding behavior trees. AiGameDev.com. Lastvisited June 2017.

[6] Guillaume M.J-B. Chaslot, Mark H.M. Winands, H.J van den Henrik,Jos W.H.M. Uiterwijk, and Bruno Bouzy. Progressive strategies for monte-carlo tree search. New Mathematics and Natural Computation, 4(03):343–357,2008.

[7] CheeChian Cheng and Norman Carver. Cooperative games with monte carlotree search. In Proceedings on the International Conference on Artificial In-telligence (ICAI), page 99. The Steering Committee of The World Congressin Computer Science, Computer Engineering and Applied Computing (World-Comp), 2015.

[8] M. Colledanchise and P. Ögren. How behavior trees modularize hybrid controlsystems and generalize sequential behavior compositions, the subsumption ar-chitecture, and decision trees. IEEE Transactions on Robotics, 33(2):372–389,April 2017.

39

40 BIBLIOGRAPHY

[9] Michele Colledanchise and Petter Ögren. How behavior trees modularize ro-bustness and safety in hybrid systems. In Intelligent Robots and Systems (IROS2014), 2014 IEEE/RSJ International Conference on, pages 1482–1488. IEEE,2014.

[10] Jan Kristian Haugland. On-line encyclopedia of integer sequences, a094777,2004.

[11] Damian Isla. Handling complexity in the halo 2 ai. In Game developers con-ference, volume 12, 2005.

[12] Damian Isla. Halo 3-building a better battle. In Game Developers Conference,2008.

[13] Leslie Pack Kaelbling, Michael L Littman, and Anthony R Cassandra. Planningand acting in partially observable stochastic domains. Artificial intelligence,101(1):99–134, 1998.

[14] Levente Kocsis and Csaba Szepesvári. Bandit based monte-carlo planning. InEuropean conference on machine learning, pages 282–293. Springer, 2006.

[15] Levente Kocsis, Csaba Szepesvári, and Jan Willemson. Improved monte-carlosearch. Univ. Tartu, Estonia, Tech. Rep, 1, 2006.

[16] Tze Leung Lai and Herbert Robbins. Asymptotically e�cient adaptive alloca-tion rules. Advances in applied mathematics, 6(1):4–22, 1985.

[17] Chong-U Lim, Robin Baumgarten, and Simon Colton. Evolving behaviour treesfor the commercial game defcon. In European Conference on the Applicationsof Evolutionary Computation, pages 100–110. Springer, 2010.

[18] Michael Lederman Littman. Algorithms for sequential decision making. PhDthesis, Brown University, 1996.

[19] A Bryan Loyall. Believable agents: building interactive personalities. PhDthesis, Mitsubishi Electric Research Laboratories, 1997.

[20] Carol Luckhart and Keki B Irani. An algorithmic solution of n-person games.In AAAI, volume 86, pages 158–162, 1986.

[21] Alejandro Marzinotto, Michele Colledanchise, Christian Smith, and PetterÖgren. Towards a unified behavior trees framework for robot control. InRobotics and Automation (ICRA), 2014 IEEE International Conference on,pages 5420–5427. IEEE, 2014.

[22] Bill Merrill. Building utility decisions into your existing behavior tree. GameAI Pro: Collected Wisdom of Game AI Professionals, page 127, 2013.

BIBLIOGRAPHY 41

[23] Ian Millington and John Funge. Artificial intelligence for games. CRC Press,2016.

[24] Santiago Ontanón. The combinatorial multi-armed bandit problem and its ap-plication to real-time strategy games. In Proceedings of the Ninth AAAI Con-ference on Artificial Intelligence and Interactive Digital Entertainment, pages58–64. AAAI Press, 2013.

[25] Santiago Ontanón, Gabriel Synnaeve, Alberto Uriarte, Florian Richoux, DavidChurchill, and Mike Preuss. A survey of real-time strategy game ai researchand competition in starcraft. IEEE Transactions on Computational Intelligenceand AI in games, 5(4):293–311, 2013.

[26] Ricardo Palma, Pedro Antonio González-Calero, Marco Antonio Gómez-Martín, and Pedro Pablo Gómez-Martín. Extending case-based planning withbehavior trees. In Twenty-Fourth International FLAIRS Conference, 2011.

[27] Tom Pepels and Mark HM Winands. Enhancements for monte-carlo tree searchin ms pac-man. In Computational Intelligence and Games (CIG), 2012 IEEEConference on, pages 265–272. IEEE, 2012.

[28] Joelle Pineau, Geo�rey Gordon, and Sebastian Thrun. Anytime point-basedapproximations for large pomdps. Journal of Artificial Intelligence Research,27:335–380, 2006.

[29] Spyridon Samothrakis, David Robles, and Simon Lucas. Fast approximatemax-n monte carlo tree search for ms pac-man. IEEE Transactions on Com-putational Intelligence and AI in Games, 3(2):142–154, 2011.

[30] Frederik Christiaan Schadd. Monte-Carlo search techniques in the modernboard game Thurn and Taxis. PhD thesis, Maastricht University, 2009.

[31] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre,George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, VedaPanneershelvam, Marc Lanctot, et al. Mastering the game of go with deepneural networks and tree search. Nature, 529(7587):484–489, 2016.

[32] David Silver and Joel Veness. Monte-carlo planning in large pomdps. In Ad-vances in neural information processing systems, pages 2164–2172, 2010.

[33] Weijia Wang and Michele Sebag. Hypervolume indicator and dominance rewardbased multi-objective monte-carlo tree search. Machine learning, 92(2-3):403–429, 2013.

[34] Ben G Weber, Peter Mawhorter, Michael Mateas, and Arnav Jhala. Reactiveplanning idioms for multi-scale game ai. In Computational Intelligence andGames (CIG), 2010 IEEE Symposium on, pages 115–122. IEEE, 2010.

www.kth.se