learning behaviors in agents systems with interactive dynamic … · 2016. 2. 13. · loftin et...

Learning Behaviors in Agents Systems withInteractive Dynamic Influence Diagrams

Ross ConroyTeesside UniversityMiddlesbrough, UK

[email protected]

Yifeng ZengTeesside UniversityMiddlesbrough, [email protected]

Marc CavazzaTeesside UniversityMiddlesbrough, UK

[email protected]

Yingke ChenUniversity of Georgia

Athens, GA, [email protected]

AbstractInteractive dynamic influence diagrams (I-DIDs)are a well recognized decision model that explic-itly considers how multiagent interaction affects in-dividual decision making. To predict behavior ofother agents, I-DIDs require models of the otheragents to be known ahead of time and manuallyencoded. This becomes a barrier to I-DID appli-cations in a human-agent interaction setting, suchas development of intelligent non-player charac-ters (NPCs) in real-time strategy (RTS) games,where models of other agents or human players areoften inaccessible to domain experts. In this pa-per, we use automatic techniques for learning be-havior of other agents from replay data in RTSgames. We propose a learning algorithm with im-provement over existing work by building a fullprofile of agent behavior. This is the first time thatdata-driven learning techniques are embedded intothe I-DID decision making framework. We eval-uate the performance of our approach on two testcases.

1 IntroductionInteractive dynamic influence diagrams (I-DIDs) [Doshi etal., 2009; Zeng and Doshi, 2012] are a well recognized frame-work for sequential multiagent decision making under un-certainty. They are graphical decision models for individualagents in the presence of other agents who are themselvesacting and observing, and whose actions affect the subjectagent. They explicitly model how the other agents behaveover time, based on which the subject agent’s decisions areto be optimized. Importantly, I-DIDs have the advantage ofa graphical representation that makes the embedded domainstructure explicit by decomposing domain variables, whichyields computational benefits when compared to the enumer-ative representation like partially observable Markov deci-sion process (POMDP) [Smallwood and Sondik, 1973], in-teractive POMDP [Gmytrasiewicz and Doshi, 2005] (and soon) [Doshi et al., 2009].

I-DIDs integrate two components into the decision mod-els: one basic decision model represents the subject agent’sdecision making process while the other predicts behavior of

other agents over time. Traditionally I-DIDs require domainexperts to manually build models of other agents and solvethe models in order to obtain the predicted behavior of theother agents. However, it is rather difficult to construct suchmodels particularly in a setting of human-agent interactionwhere modeling humans is not easy.

In this paper, we use automatic techniques for learning be-havior of other agents in I-DIDs. We will be using a real-timestrategy (RTS) environment to support experiments in learn-ing human behavior. The main advantage of using RTSs isthat they provide a realistic environment in which users co-exist with autonomous agents and have to make decisions inlimited time in an uncertain and partially observable environ-ment. We will use simplified situations from StarCraft 1 asexamples for learning behaviour. The choice of StarCraft ismotivated by the availability of replay data [Synnaeve andBessiere, 2012], the interest it has attracted in the AI commu-nity [Ontanon et al., 2013], and the applicability of learningmodels such as Hidden Markov models and Bayesian models[Synnaeve and Bessiere, 2011], although POMDP are not di-rectly applicable because of the size of the state space model[Ontanon et al., 2013].

The learning task faces the challenge of addressing datascarcity and coverage problems because the replay data maynot cover the full profile of players’ behavior particularly forlarge planning horizons. A partial set of players’ behaviorneeds to be expanded into a full set so that we can build acomplete I-DID model. To avoid randomness of behaviorexpansion, we fill in the missing actions by conducting be-havioral compatibility tests on the observed behaviors. Theproposed technique is inspired by the observation that someplayers tend to perform identically to other players and theymay execute similar actions in different game periods. Hencethe missing actions will be complemented using the observedbehavior. We theoretically analyze the properties of the learn-ing technique and its influence on the planning quality. In ad-dition, we conduct experiments in the large UAV (unmannedaerial vehicle) benchmark [Zeng and Doshi, 2012] as well asthe RTS game StarCraft. To the best of our knowledge, it isthe first time that a data-driven behavior learning technique isembedded into the I-DID model, which avoids the difficultyon manually developing models of other agents.

1http://eu.blizzard.com/en-gb/games/sc/

Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 2015)

39

2 Related Works

Current I-DID research assumes that models of other agentscan be manually built or needs to consider a large numberof random candidate models where the actual models of theother agents are unknown [Zeng and Doshi, 2012]. Hencethe existing I-DID research mainly focuses on reducing thesolution complexity by compressing the model space of otheragents and scaling up planning horizons [Doshi et al., 2010;Zeng and Doshi, 2012]. The improved scalability bene-fits from clustering models of other agents that are behav-iorally equivalent, and the behavior could be either completeor partial [Zeng et al., 2011; 2012]. Recently identifyingother agents’ on-line behavior is used to improve I-DID so-lutions [Chen et al., 2015], which still needs to constructcandidate models of other agents. Meanwhile, Panella andGmytrasiewicz [Panella and Gmytrasiewicz, 2015] attempt tolearn models of other agents using Bayesian nonparametricmethods in a small problem. Chandrasekaran et al. [Chan-drasekaran et al., 2014] intend to use reinforcement learningtechniques for inferring agent behavior, which turns out to beextremely hard to reveal complete behavior.

Learning agent behavior is important in building au-tonomous systems, particularly with human-agent interac-tion [Suryadi and Gmytrasiewicz, 1999; Doshi et al., 2012;Loftin et al., 2014; Zilberstein, 2015], as well as develop-ing social interactive media platforms [Salah et al., 2013].Carmel and Markovitch [Carmel and Markovitch, 1996] pro-pose finite automata to model strategies of intelligent agentsand found it difficult to learn the model. Suryadi and Gmy-trasiewicz [Suryadi and Gmytrasiewicz, 1999] learn influ-ence diagrams [Howard and Matheson, 1984] that modelagents’ decision making process. Instead of learning a spe-cific model of agents, Albrecht and Ramamoorthy [Albrechtand Ramamoorthy, 2014] identify a type of agent behaviorfrom a predefined set using a game-theoretical concept. Sim-ilarly Zhou and Yang [Zhuo and Yang, 2014] exploit action-models through transfer learning techniques to improve theagent planning quality. In parallel, learning other agents isone of the core issues in ad-hoc team settings - a recog-nized challenge in agent research [Stone et al., 2010]. Bar-rett and Stone [Barrett and Stone, 2015] recently simplifythe MDP learning to develop cooperative strategies for team-mates without prior coordination.

In RTS games, Wender and Watson [Wender and Watson,2012] use reinforcement learning to learn behavior of units ina small scale StarCraft combat scenario. Cho et. al [Cho etal., 2013] take advantage of the StarCraft replay data that hasbeen labelled as to the winning strategies, and improve NPCintelligence by predicting the opponent strategies.

Most of the aforementioned research still focuses on thelearning of procedural models for single-agent decision mak-ing and doesn’t integrate behavior learning techniques withdecision models in a multiagent setting. Using a simple ex-ample inspired from StarCraft, we investigate the proposedlearning algorithm in the I-DID framework, which is also ap-plicable to most of general multiagent decision models thatrequire modeling of other agents.

Figure 1: StarCraft Scenario (a) and unit’s behavior repre-sented by the policy tree (b).

Figure 2: A DID modeling agent j in StarCraft.

3 Background and Initial ResultsWe briefly review I-DIDs through one example of a com-bat scenario in StarCraft. More formal details are foundin [Doshi et al., 2009; Zeng and Doshi, 2012]. After that,we provide initial results of applying I-DIDs in the game.

3.1 Interactive DIDsExample 1 (StarCraft Scenario) Fig. 1a shows a screen-shot of the 1 vs 1 combat scenario. Each unit is a TerranGoliath, a unit which attacks with a cannon with a limitedrange of approximately half its visibility radius. Each agentcan only see other agents within its visibility radius. In orderfor a unit to execute an attack on an enemy unit the enemyunit must be visible to the unit and be within range of theunit’s weapons.

Influence diagrams(IDs) [Howard and Matheson, 1984]use three types of nodes, namely chance node (oval), deci-sion node (rectangle) and utility node (diamond), to repre-sent the decision making process for a single agent in un-certain settings. When influence diagrams are unrolled overmultiple time steps, they become dynamic influence dia-grams (DIDs) [Tatman and Shachter, 1990] representing howthe agent plans its actions sequentially.

Fig. 2 shows the DID model for controlling a unit/agent inStarCraft. The unit state (St+1) is whether or not the unit isunder attack (s1 = UnderAttack, s2 = NotUnderAttack)which is influenced by the previous state (St) and action(Dt). At t the unit receives observations (Ot: o1 = See ando2 = Don′tSee) influenced by the state (St) where if a unitis under attack there is a higher probability of seeing the en-emy unit than when not under attack. Both the state (St+1)and reward (R) are influenced by decision (Dt: d1 = Attackand d2 = Escape). Attacking an enemy unit gives a higher

40

Figure 3: An I-DID modeling agent i that reasons with j’sbehaivor.

probability of being under attack in the next time step andlower probability for escaping. Finally the reward (R) givento the unit represents the difference in change of health forthe unit and enemy unit giving a positive reward if the enemyunit looses more health than the unit and vice versa.

Extending DIDs for multiagent settings, I-DIDs introduceone new type of node, called model node (hexagon), that con-tains candidate models of other agents. Fig. 3 shows the I-DID modeling the aforementioned combat scenario in whichagent i is the subject agent (NPC) in level l and agent j theother agent (player) in level l− 1. The strategy level l allowsfor the nested modeling of i by the other agent j. The strategylevel implies the sort of strategic reasoning between agents -what does agent i think that agent j thinks that i thinks. Alevel 0 agent represented by DID does not further model theother agent. The dashed arc from model node to chance nodeis called policy link that solves j’s models and enters a prob-ability distribution of j’s predicted actions into the chancenode. The bold arc between two model nodes is called modelupdate link that specifies how j’s models are updated when jacts and receives observations over time.

In the model node M tj,l−1 of Fig. 3, let us consider two

models of agent j (mt,1j,l−1 andmt,2

j,l−1) differing only in initialbeliefs on the physical states. For example, the two modelshave the same structure of DID as shown in Fig. 2 while theyhave different marginal probabilities in the chance node St.Solving each DID model (through conventional DID tech-niques [Tatman and Shachter, 1990]) results in one policy treeper model. The policy tree for each model is j’s optimal planin which one action is specified for every possible observationat each time step. Fig. 1b shows the policy tree by solving themodel mt,1

j,l−1 of T=3 time steps.According to the policy tree, we specify j’s optimal actions

in the chance nodeAtj of I-DID at t=0. After agent j executes

the action at t=0 and receives one observation at t=1, it up-dates its beliefs in the chance node S1 (in Fig. 2). One updateofm0,1

j (T=3) given an observation (oj) generates a new DIDmodel m1,1

j that plans for two time steps. Consequently, fournew models are generated at t=1 and placed into the modelnode M1

j . Solving each of them generates optimal actions ofagent j at t=1. The model update continues until only onehorizon remains.

Hence solving level l I-DID involves two phases. We firstneed to solve models of other agents at level l−1 and expand

the I-DID with the update of other agents’ models. Then, wecan transform the I-DID into a conventional DID and solvethe DID to obtain the subject agent’s plan.

3.2 Initial Results

We build level 0 DID (and level 1 I-DID) models for unitsin Fig. 1a. Initial tests show how modeling the behaviourof other agents gives a tactical advantage by predicting ac-tions of an opponent. Different time horizons are comparedto determine if there is any significant reward difference withgreater time horizons. DID agents follow the generated poli-cies in each battle. The I-DID agent has complete knowledgeof all of the potential models of the DID agent and generatesa policy to follow for all battles. For both battle scenarios thedifference in remaining health is recorded and used to calcu-late an average. Table 1 shows the average health for unit iwhen i competes with j given: 1) both i and j are modeledby DIDs; 2) unit i is modeled by level 1 I-DID reasoning withlevel 0 DIDs of unit j. Results show significant advantage tothe I-DID agent as a result of modeling the DID agent as partof planning.

T=3 T=4 T=5

DID vs DID 19.8 13.7 35.1I-DID vs DID 71.7 83.6 95.4

Table 1: Average rewards achieved by DID or I-DID con-trolled agent i vs DID controlled agent j.

Note that we do not make claims about the applicability ofI-DIDs to complex, realistic StarCraft situations. However,DIDs are well-suited to represent the decision making pro-cess under partial observability for iterative moves, and theirI-DID extension allows for the modeling of other agents. Thissuggests they should have the ability to learn behavior in sim-ple configurations without suffering from the complexity ofgame states, and consequently that StarCraft can constitutean appropriate testbed for our behavior learning experiments.

4 Data-driven Behavior Learning

Given manually built models of other agents, we solve theirmodels and integrate their solutions into the expansion of thesubject agent’s I-DID. However, domain knowledge is not al-ways accessible to construct precise models of other agents.Although modeling how other agents make decisions is im-portant in I-DID representation, the subject agent’s decisionsare only affected by the predicted behavior of the other agentsthat are solutions of the other agents’ models. Hence, it willbe equally effective if either the models or the behavior as-cribed to the other agents are known for solving I-DIDs. Withthe inspiration, we learn behavior of other agents automati-cally, which provides an alternative to manually crafted mod-els for solving I-DIDs. Instead of constructing and solvingmodels to predict other agents’ behavior, we learn their be-havior from available data.

41

4.1 Behavior RepresentationSolutions of a T -horizon DID are represented by a depth-Tpolicy tree that contains a set of policy paths from the rootnode to the leaf. Every T -length policy path is an action-observation sequence that prescribes agent j’s behavior overthe entire planning horizon. Formally we define a T -lengthpolicy path below.

Definition 1 (Policy Path) A policy path, hTj , is an action-observation sequence over T planning horizons: hTj =

atj , ot+1j T−1

t=0 , where oTj is null with no observations follow-ing the final action.

Since agent j can select any action and receive every possi-ble observation at each time step, all possible T -length policypaths are HT

j = Aj ×∏T−1

t=1 (Ωj × Aj). A depth-T policytree is the optimal plan of agent j in which the best decisionsare assigned to every observation at each time step. We mayimpose an ordering on a policy tree by assuming some orderfor the observations, which guard the arcs in the tree. Thetotal number of policy paths is up to |Ωj |T−1 in a depth-Tpolicy tree. We formally define a policy tree below.

Definition 2 (Policy Tree) A depth-T policy tree is a set ofpolicy paths,HT

j =⋃hTj , that is structured in a tree T =(V ,E)

where V is the set of vertices’s (nodes) labelled with actionsA and E is the set of ordered groups of edges labelled withobservations Ω.

Fig. 1b shows the policy tree that is the solution to oneDID model in the aforementioned StarCraft scenario. A pol-icy tree specifies the predicted actions of other agent j at eachtime step that directly impacts the subject agent i’s decisions.Most of the previous I-DID research assumes the availabil-ity of domain knowledge to construct precise models of otheragents, then solves the models to build the policy tree. How-ever, this is somewhat unrealistic in RTS games if models arerequired to be built from the game. It is particularly diffi-cult to model other agents that could be either other NPCs orhuman players in RTS games due to the complexity of theirbehavior descriptors.

4.2 Behavior LearningInstead of crafting procedural models of other agents in Star-Craft, we resort to learning other agents’ behavior, generallyrepresented by a policy tree. The possibility of learning thebehavior automatically derives from the availability of replaydata uploaded by a large number of game players.

Data RepresentationThe replay data records real-time game states, time-stamp andthe status of units controlled by players. Since the replay datarecords the information of the entire game-play, there is noclear reset state for the units to indicate the start or end of aplanning horizon. In addition, the data implicate units’ be-havior in many types of tasks including combat and resourcegathering. It is natural that we shall focus on the learningof units’ behavior for every specific task. Hence we need topartition an entire set of replay data into multiple local setseach of which supplies the corresponding data resources forlearning a policy tree for a given task.

We have used a landmark approach to partition replaydata, inspired from their definition in robotics [Lazanas andLatombe, 1995] and planning [Hoffmann et al., 2004]. Wedefine landmarks in StarCraft as a set of game features (statesand values) that can be used as a point of reference to deter-mine the player’s position in the game world and the value ofgame-play related parameters e.g. when two enemy militaryunits are within Visibility Radius they are labelled as in bat-tle, this label remains until either a unit has died or both unitsare no longer within 1.5×Visibility Radius. As landmark de-scription falls outside of the scope of this article, we shallonly illustrate their role.

Constructing Policy TreesLet D = D1, · · · ,Dl be the entire set of replay data whereDl = D1

l , · · · ,Dml aggregate the sets labelled by the same

landmark l. Although the landmark decides the player’s lo-cal position in the game-play, the player may have differentbeliefs over the game states. The local set Dm

l may exhibitdifferent behavior of the player. We aim to learn a set of pol-icy trees, T = T1, · · · , Tk, from each Dl.

Alg. 1 shows the development of policy trees from the data.As a policy tree is composed of the set of policy paths, we firstconstruct the policy paths and then add them into the tree.Due to the limited interactions between NPCs and players,the replay data may not exhibit the full profile of players’behavior. Consequently, the learnt policy tree is incompletesince some branches may be missing.

Algorithm 1 Build Policy Trees

1: function LEARN POLICY TREES(D, T)2: H ← PolicyPathConstruction(D, T )3: T ← Build Trees FromH4: return T5: function POLICY PATH CONSTRUCTION(D, T)6: H set of policy paths7: for all Dl ∈ D do8: New hTj9: for all Dm

1 ∈ D1 do10: hTj ← hTj ∪ amj (∈ Dm

1 [A])11: if m 6= 1 then12: hTj ← omj (∈ Dm

1 [Ω])13: H ← H∪ hTj14: returnH

Filling in Partial Policy TreesWe can fill in the missing branches in the policy tree withrandom observation-action sequences, which may lead to un-known errors on predicting the agent’s behavior. We observethat players may have identical post-behavior even if they actdifferently in the past. This is particularly true in the con-text of RTS games. For example, players may approach theiropponents in different ways; however, they often attack themthrough similar mechanisms once the opponents are withinthe attacking distance. This inspires a heuristic for filling inthe partial policy tree.

42

Inspired by the learning of probabilistic finite au-tomata [Higuera, 2010], we propose the branch fill-in algo-rithm by testing the behavioral compatibility of two nodes inAlg. 2. V[At] retrieve actions of time t from the policy treewhile VΩ[At+1] are actions of time t+1 given V[At] with cor-responding observations at t+ 1. The compatibility test (line5) requires: (1) the nodes to have the same label(action); and(2) the difference between their successors to be bounded bya threshold value (line 20). The second condition demandsthe compatibility to be recursively satisfied for every pair ofsuccessor nodes (lines 10-14). Once the compatibility of thetwo nodes is confirmed, we can share their successors and fillin the missing branches accordingly. If multiple sets of nodessatisfy the compatibility, we select the most compatible onewith the minimum difference (line 20). The time complexityof Algs. 1 and 2 is polynomial in the size of data D.

Algorithm 2 Branch Fill-in

1: function FIND MISSING NODES(T Set of policy trees)2: Q Set of policy trees with missing branches identified

from T3: for all Qn ∈ Q do4: for all Tn ∈ T do5: if Behavioral Compatibility(Qn, Tn) then6: Fill missing nodes in Qn from Tn7: function BEHAVIORAL COMPATIBILITY(V1,V2)8: if V1[At] 6= V2[At] then return False9: else

10: if Test(V1,V2) then11: for all At+1 ∈ V1 do12: c← Test(V1[At+1],V2[At+1])13: if c = False then return False14: else return False15: return True16: function TEST(V1,V2)17: n1 ← V1[At].count, n2 ← V2[At].count18: for all Ω ∈ V1 do19: f1 ← VΩ

1 [At+1].count, f2 ← VΩ2 [At+1].count

20: if | f1n1− f2

n2| < ε then

21: return false22: return true

4.3 Theoretical Analysis

The policy tree we learn from data prescribes the player’s ac-tions at every time step. Intuitively, the learnt policy tree (Hj)will become the true behavior (H∗j ) of the player if the amountof replay data is infinite. Let a2,t

j be the missing action and befilled in using action a1,t

j at time t inHj . The branch fill-in as-sumes that Pr(a1,t

j |Hj) has approached the true Pr(atj |H∗j )

after N amount of data, based on which a2,tj will be comple-

mented.Using the Hoeffding inequality [Hoeffding, 1963], Propo-

sition 1 provides the sample complexity bounding the proba-ble rate of convergence of the fill-in actions to the true actions.

Proposition 1

Pr[∑

a2,tj|Pr(a2,t

j |Hj)− Pr(atj |H∗j )| ≤ η]

≥ 1− |Aj | · e−2NT ( η

|Aj |)2

(1)

where Pr(a2,tj |Hj) is the probability of actions at time t we

learn from the data (computed as f2n2

in line 20 of Alg. 2) andPr(atj |H∗j ) the true actions of the player.

Let ∆Ti =|V (mi,l) − V ∗(mi,l)| where V (mi,l) is agent i’s

expected rewards by solving level l I-DID model throughlearning j’s behavior and V ∗(mi, l) is i’s expected rewardgiven j’s true behavior. Following the proof in [Chen et al.,2015], the reward difference is bounded below.

∆Ti ≤ ρRmax

i ((T − 1)(1 + 3(T − 1)|Ωi||Ωj |) + 1) (2)

where ρ =∑

a1j|Pr(a1,t

j |Hj) − Pr(atj |H∗j )| is the worst

error on predicting a1j in the learning.

Furthermore let τ =∑

a2,tj|Pr(a2,t

j |Hj) − Pr(a1,tj |Hj)|.

Since η, ρ and τ compose a triangle, we get ρ ≤ τ + η withan upper-bound τ and obtain η with probability at least 1 −|Aj | · e

−2NT ( η|Aj |

)2

.Recall the test of line 20 in Alg. 2, we have τ < Bε, where

B is the number of tests, so that we control the learning qual-ity using the ε value in the algorithm. Note that more data willreduce the branch fill-ins therefore improving predictions ofagent j’s behavior, which directly impacts the agent i’s planquality in Eq. 2.

5 Experiment ResultsWe first verify the algorithm’s performance in the UAVbenchmark (|S|=25, |A|=5 and |Ω|=5) - the largest problemdomain studied in I-POMDP/I-DID, based multiagent plan-ning research and then demonstrate the application in Star-Craft. We compare the policy tree learning techniques witheither random fill-ins (Rand) or the behavioral compatibilitytest (BCT) in Alg. 2.

5.1 UAV SimulationsWe simulate interactions between agents i and j, and collectdifferent sizes of data for learning agent j’s policy trees. Sub-sequently, we build i’s I-DID given the learned j’s behavior,and use the action equivalence (AE) [Zeng and Doshi, 2009]to solve the I-DID since the exact technique is not scalableto solve the complex domain. Using the generated policies,agent i plays with j that selects one model from 4 possiblemodels of j used in the simulation.

Fig. 4a and Fig 4b shows agent i’s average reward for 1000simulations of the UAV domain for T=4 and 6 respectively.The horizontal lines (Exact) are agent i’s average rewardswhen i adopts the policies generated by the I-DID, which ismanually built by considering 4 possible models of agent j.The set of 4 models including j’s true model are weighteduniformly in i’s I-DID. The I-DID model is far more complexthan what we use by learning j’s behavior from the data.

In Fig. 4, we observe that agent i achieves better perfor-mance when more data is used for learning j’s behavior.

43

(a) Reward for I-DID agent iT = 4

(b) Reward for I-DID agent iT = 6

Figure 4: Performance of agent i by learning behavior ofagent j in the UAV domain.

Figure 5: StarCraft example tree learnt from replay data withincreased observations ds = don′t see, sf = see far, s =see near

The learning algorithm using BCT outperforms the techniquewith Rand since it generates more accurate policies of agentj. The learning algorithm performs even better than the Ex-act technique because agent i can focus on the true or mostprobable models of agent j from learning policy trees. Neg-ative rewards are received in Fig. 4a since it is difficult foragent i to intercept j in a short planning time (T=4). Fig. 4bshows that more data is required to learn the behavior of largeplanning horizons (T=6). Agent i consistently improves itsrewards once more data of agent j becomes available.

5.2 Learning From StarCraft Replay DataBy extending the combat scenario (with more actions and ob-servations in Example 1), we experiment with our algorithmsusing replay data over a number of battles. We retrieve 3 ob-servations and 3 actions from the data, and learn the policytrees given different planning horizons. We build the learningengine using the BWAPI library 2 to interface with the games.We then develop I-DID controlled NPCs (agent i) that reasonfrom learning behavior of other NPCs (agent j).

We learn agent j’s policy trees with the planning horizonsT= 4 and 6, and expand agent i’s I-DID accordingly to j’spolicy in the low level. In Fig. 5a, we show the partial be-havior of agent j learnt from the limited replay data. We usethe observed actions to fill in the missing actions (denoted bytriangles) under BCT. The complemented behavior in Fig. 5bwell reflects a common battle strategy. Fig. 6a reports the av-erage reward of agent i when it competes with agent j over60 competitions. We observe that i receives higher rewardwhen it learns j’s behavior from more battles, where each

2https://code.google.com/p/bwapi/

(a) Reward for I-DID agent i (b) Learning Time

Figure 6: NPC (agent i) learns behavior of other NPC (agentj) from the replay data in the StarCraft domain.

battle records a fight/battle between 2 opposing units untilone or both units dies and normally consists of 10 to 50 of j’saction-observation sequences.

In Fig. 6a, the results show that agent i performs betterfrom more accurate j’s behavior generated from the branchfill technique with BCT. Large planning horizons (like T=6)also provide some benefit to agent i. Learning time increaseswith more data available to the agent. Due to the inexpensivecompatibility tests, the learning algorithm duration shows novisible difference in duration between random filling of miss-ing branches compared to executing compatibility tests re-flected in Fig 6b. Learning policy trees of larger planninghorizons (T=6) takes more time in our experiments.

6 ConclusionsIn this paper we have presented a method for learning the be-haviour of a subject agent in I-DID. The approach uses auto-matic techniques to learn the policy tree from historical dataof the agents. We have presented a solution to the problem ofincomplete policy trees by way of a compatibility test to fillin missing branches with those found to be compatible. Thelearning technique has then been applied to both the UAV do-main and the RTS game StarCraft to allow an NPC to predictthe actions of another NPC or human player to increase itsreward. Our work is an example of integrating data-drivenbehavior learning techniques with I-DID solutions, which re-moves the I-DID applications barrier of lacking models ofother agents. Future work could investigate incremental al-gorithms for learning agents’ behavior from on-line data.

References[Albrecht and Ramamoorthy, 2014] Stefano V. Albrecht and

Subramanian Ramamoorthy. On convergence and optimal-ity of best-response learning with policy types in multia-gent systems. In UAI, pages 12–21, 2014.

[Barrett and Stone, 2015] Samuel Barrett and Peter Stone.Cooperating with unknown teammates in complex do-mains: A robot soccer case study of ad hoc teamwork. InAAAI, pages 2010–2016, 2015.

[Carmel and Markovitch, 1996] David Carmel and ShaulMarkovitch. Learning models of intelligent agents. InAAAI, pages 62–67, 1996.

[Chandrasekaran et al., 2014] Muthukumaran Chan-drasekaran, Prashant Doshi, Yifeng Zeng, and YingkeChen. Team behavior in interactive dynamic influence

44

diagrams with applications to ad hoc teams. In AAMAS,pages 1559–1560, 2014.

[Chen et al., 2015] Yingke Chen, Yifeng Zeng, and PrashantDoshi. Iterative online planning in multiagent settings withlimited model spaces and PAC guarantees. In AAMAS,pages 1161–1169, 2015.

[Cho et al., 2013] Ho-Chul Cho, Kyung-Joong Kim, andSung-Bae Cho. Replay-based strategy prediction and buildorder adaptation for starcraft ai bots. In IEEE CIG, pages1–7, 2013.

[Doshi et al., 2009] Prashant Doshi, Yifeng Zeng, andQiongyu Chen. Graphical models for interactivePOMDPs: Representations and solutions. JAAMAS,18(3):376–416, 2009.

[Doshi et al., 2010] Prashant Doshi, Muthukumaran Chan-drasekaran, and Yifeng Zeng. Epsilon-subject equivalenceof models for interactive dynamic influence diagrams. InIAT, pages 165–172, 2010.

[Doshi et al., 2012] Prashant Doshi, Xia Qu, Adam Goodie,and Diana L. Young. Modeling human recursive reasoningusing empirically informed interactive partially observablemarkov decision processes. IEEE SMC A, 42(6):1529–1542, 2012.

[Gmytrasiewicz and Doshi, 2005] Piotr Gmytrasiewicz andPrashant Doshi. A framework for sequential planning inmultiagent settings. JAIR, 24:49–79, 2005.

[Higuera, 2010] Colin de la Higuera. Grammatical Infer-ence: Learning Automata and Grammar. Cambridge Uni-versity Press, 2010.

[Hoeffding, 1963] W. Hoeffding. Probability inequalities forsums of bounded random variables. Journal of the Ameri-can Statistical Association, 58:13–30, 1963.

[Hoffmann et al., 2004] Jorg Hoffmann, Julie Porteous, andLaura Sebastia. Ordered landmarks in planning. JAIR,22:215–278, 2004.

[Howard and Matheson, 1984] R. A. Howard and J. E. Math-eson. Influence diagrams. In Readings on the Principlesand Applications of Decision Analysis, pages 721–762,1984.

[Lazanas and Latombe, 1995] A. Lazanas and J. C.Latombe. Landmark-based robot navigation. Algo-rithmica, 13(5):472–501, 1995.

[Loftin et al., 2014] Robert Tyler Loftin, James Mac-Glashan, Bei Peng, Matthew E. Taylor, Michael L.Littman, Jeff Huang, and David L. Roberts. A strategy-aware technique for learning behaviors from discretehuman feedback. In AAAI, pages 937–943, 2014.

[Ontanon et al., 2013] Santiago Ontanon, Gabriel Synnaeve,Alberto Uriarte, Florian Richoux, David Churchill, andMike Preuss. A survey of real-time strategy game AI re-search and competition in starcraft. In IEEE CIG, vol-ume 5, pages 293–311, 2013.

[Panella and Gmytrasiewicz, 2015] Alessandro Panella andPiotr Gmytrasiewicz. Nonparametric bayesian learning of

other agents’ policies in multiagent pomdps. In AAMAS,pages 1875–1876, 2015.

[Salah et al., 2013] Albert Ali Salah, Hayley Hung, OyaAran, and Hatice Gunes. Creative applications of humanbehavior understanding. In HBU, pages 1–14, 2013.

[Smallwood and Sondik, 1973] Richard Smallwood and Ed-ward Sondik. The optimal control of partially observableMarkov decision processes over a finite horizon. Opera-tions Research, 21:1071–1088, 1973.

[Stone et al., 2010] Peter Stone, Gal A. Kaminka, SaritKraus, and Jeffrey S. Rosenschein. Ad hoc autonomousagent teams: Collaboration without pre-coordination. InAAAI, pages 1504–1509, 2010.

[Suryadi and Gmytrasiewicz, 1999] Dicky Suryadi and Pi-otr J. Gmytrasiewicz. Learning models of other agentsusing influence diagrams. User Modeling, 407:223–232,1999.

[Synnaeve and Bessiere, 2011] Gabriel Synnaeve and PierreBessiere. A Bayesian model for RTS units control appliedto StarCraft. In 2011 IEEE CIG, pages 190–196, 2011.

[Synnaeve and Bessiere, 2012] Gabriel Synnaeve andP Bessiere. A Dataset for StarCraft AI & an Example ofArmies Clustering. Artificial Intelligence in AdversarialReal-Time Games, pages 25–30, 2012.

[Tatman and Shachter, 1990] Joseph A. Tatman and Ross D.Shachter. Dynamic programming and influence diagrams.IEEE SMC, 20(2):365–379, 1990.

[Wender and Watson, 2012] Stefan Wender and Ian Watson.Applying reinforcement learning to small scale combat inthe real-time strategy game StarCraft: Broodwar. In IEEECIG, pages 402–408, 2012.

[Zeng and Doshi, 2009] Yifeng Zeng and Prashant Doshi.Speeding up exact solutions of interactive influence dia-grams using action equivalence. In IJCAI, pages 1996–2001, 2009.

[Zeng and Doshi, 2012] Yifeng Zeng and Prashant Doshi.Exploiting model equivalences for solving interactive dy-namic influence diagrams. JAIR, 43:211–255, 2012.

[Zeng et al., 2011] Yifeng Zeng, Prashant Doshi, YinghuiPan, Hua Mao, Muthukumaran Chandrasekaran, and JianLuo. Utilizing partial policies for identifying equivalenceof behavioral models. In AAAI, pages 1083–1088, 2011.

[Zeng et al., 2012] Yifeng Zeng, Yinghui Pan, Hua Mao, andJian Luo. Improved use of partial policies for identifyingbehavioral equivalences. In AAMAS, pages 1015–1022,2012.

[Zhuo and Yang, 2014] Hankz Hankui Zhuo and QiangYang. Action-model acquisition for planning via transferlearning. Artificial Intelligence, 212:80–103, 2014.

[Zilberstein, 2015] Shlomo Zilberstein. Building strongsemi-autonomous systems. In AAAI, pages 4088–4092,2015.

45

learning behaviors in agents systems with interactive dynamic … · 2016. 2. 13. · loftin et...

Documents