solving large stochastic games with reinforcement learning · 2015-09-18 · solving large...

Solving large stochastic games withreinforcement learning

Neal Hughes

Australian National University, Canberra, ACT, Australia, 2601(neal.hughes.anu.edu.au)∗

September 11, 2015

Abstract

This paper presents a new algorithm for solving stochastic games with

large numbers of players. In the spirit of multi-agent learning the method

combines reinforcement learning algorithms from computer science with

game theory concepts from economics. The paper introduces a modified

version of fitted Q iteration with tile coding suitable for use in economic

problems. This approach is combined with two repeated game type learning

dynamics. The resulting algorithm — multi-agent fitted Q-V iteration — is a

fast simulation based method for solving large noisy stochastic games where

the existence of an equilibrium is difficult to establish. Our approach provides

a middle ground between traditional dynamic programming approaches and

the simulation methods often used in Agent-based Computational Economics

(ACE). We demonstrate the approach for a multi-agent water storage problem.

∗This work was completed as part of a Phd thesis at the ANU under the supervision of QuentinGrafton, John Stachurski and Sara Beavis. This thesis was supported by a scholarship from theSir Roland Wilson Foundation. This work benefited from computational resources provided bythe Australian Government through the National Computing Infrastructure (NCI). All of the codeused to generate these results is available on-line: github.com/nealbob/regrivermod.

1

http://github.com/nealbob/regrivermod

1 Introduction

This paper presents a new algorithm for solving stochastic games with largenumbers of players. Large stochastic games have many applications in naturalresources , commodity storage and industrial organisation (see for example Ru-bio and Casino 2001, Martin-Herran and Rincon-Zapatero 2005). Unfortunately,large stochastic games are hard to solve and often the existence — let alone theuniqueness — of an equilibrium can not be established.

Standard approaches to solving stochastic games — value iteration for MarkovPerfect Equilibrium (MPE) – quickly become infeasible in problems with largenumbers of agents. Abhishek et al. (2007) propose an alternative solution concept— Oblivious Equilibrium (OE) — where agents make decisions based on their ownstate and aggregate statistics over opponent states (not dissimilar to Krusell andSmith (1998) type methods).

A number of dynamic programming algorithms have been proposed around OE(see for example Abhishek et al. 2007). However, these methods are suited torestricted problem types (e.g., dynamic oligopoly models) where an equilibriumcan be confirmed. In applied problems, rational expectations concepts (like MPEand OE) can quickly become impractical and unrealistic (see Fudenberg andLevine 2007). In these situations we naturally turn to learning methods.

Learning models — like the partial best response dynamic and fictitious play —have received much attention from economists in the context of repeated games(see Fudenberg and Levine 1998). Learning models are considered particularlyrelevant for games with large numbers of agents and aggregate statistics where“players are only trying to learn their optimal strategy, and not to influence thefuture course of the overall system (Fudenberg and Levine 2007; pp. 3). Unfortu-nately, the economic literature on learning in stochastic games is rather scarce.

This leads us to the emerging interdisciplinary field of ‘Multi-Agent Learning’(MAL). In MAL computer science researchers are attempting to extend, reinforce-ment learning algorithms designed for single agent Markov Decision Processes(MDPs). Recently, there has been a push to formalise the field by incorporatingtheory on stochastic games (Shoham et al. 2007).

In reinforcement learning, agents ‘learn’ by observing the outcomes of their actions(see Sutton and Barto 1998). Just like dynamic programming these methods exploitthe Bellman (1952) principle and often come with convergence guarantees. Unlikedynamic programming they assume no prior knowledge of the payoff or transitionfunctions.

2

Despite their intuitive appeal, these methods have seen little use in economics(Fudenberg and Levine 2007, Tesfatsion and Judd 2006). The few economic appli-cations involve problems with either small numbers of agents, limited interactionsbetween agents (i.e. no externalities), no state dynamics (i.e., repeated games)or no noise. For example Tesauro and Kephart (2002), Waltman and Kaymak(2008) both apply Q-learning to repeated Cournot oligopoly problems. WhileKutschinski et al. (2003) applied Q-learning to a dynamic multi-agent pricingproblem. More recently Bekiros (2010) applied an ‘Actor-Critic’ algorithm to learnprofit making trading strategies from stock market data.

This paper extends the literature in two directions. First, we introduce a modifiedversion of fitted Q iteration (Ernst et al. 2005, Timmer and Riedmiller 2007) — abatch version of Q-learning — suitable for noisy single agent economic problems.

In particular, we maintain approximations of both the action-value function Q andthe state value function V, hence the name: fitted Q-V iteration. The Q functionis fit to a large set of simulated points and the V function to an approximatelyequi-distant spaced sub-sample similar to the approach of Maliar and Maliar(2015). For function approximation we employ a version of tile coding (Suttonand Barto 1998): a fast approximation method suitable for low (i.e. less than 10)dimensional problems.

Second, we combine fitted Q-V iteration with two repeated game style learningdynamics: similar to partial best response and fictitious play. Our resultingmulti-agent fitted Q-V iteration algorithm provides a flexible method for solvinglarge noisy stochastic games where the existence of an equilibrium is difficult toestablish. Our approach, is distinct from many of the recent contributions to MAL,which in their ”rush to equilibrium” (Fudenberg and Levine 2007; pp. 5) havefocused on narrow classes of problems (i.e. two users, zero sum games etc.).

Within the economic literature, the approach might be described as an ’optimisation-based’ learning method (Crawford 2013). At the same time, the approach is only asmall departure from algorithms used to compute rational expectations equilibria.

Ultimately, our approach provides a middle ground between dynamic program-ming based methods and the simulation based methods often used in Agent-basedComputational Economics (ACE) (Tesfatsion and Judd 2006) — both in terms ofthe size and complexity of the models it can be applied to and the degree ofrationality or ‘intelligence’ assumed for the agents.

The paper proceeds as follows. First we present our (single agent) fitted Q-Viteration algorithm in the context of the reinforcement learning literature. Second,we present our multi-agent fitted Q-V iteration algorithm in the context of the

3

MAL literature. Thirdly, we introduce the all important function approximationmethods. Finally, we apply the approach to a multi-agent water storage problem.

2 Single agent algorithm

2.1 Reinforcement learning

Reinforcement learning is concerned algorithms for with solving MDPs (Suttonand Barto 1998, Weiring and Otterlo 2012). Here a MDP is a tuple (S, A, T, R, β),where: S ⊂ RDS is the state space and A ⊂ RDA the action space, T : S× A× S→[0, 1] the transition probability density function, R : S × A → R the rewardfunction and β ∈ (0, 1) the discount rate. The goal of an MDP is to select a policyfunction f : S→ A in order to maximise the objective

E

{t=∞

∑t=0

βtR(at, st)

}

given T, R, s0 and at = f (st).

In the context of reinforcement learning, a MDP represents the problem of an agentrepeatedly selecting an action in a stochastic environment in order to maximise areward. Central to reinforcement learning are so-called ‘model free’ algorithms,where the transition and payoff functions are both assumed unknown. Herethe agent must learn only by observing the outcomes — the reward and statetransition — of its interactions with the environment.

2.1.1 Q-learning

The canonical model free algorithm is Q-learning (Watkins and Dayan 1992). Herewe define the action-state value function Q : A× S→ R:

Q(a, s) = R(s, a) + β∫

ST(s, a, s′)V(s′) ds′

V(s) = maxa

Q(a, s)

where V : S→ R is the familiar state value function. Once in possession of Q, wecan compute an optimal (greedy) policy directly, without the payoff and transitionfunctions. In standard or ‘online’ Q-learning, Q is updated after each simulatedsample {st, at, rt, st+1} according to the rule

4

Q(at, st) = (1− αt)Q(at, st) + αt{rt + β maxa

Q(a, st+1)}

0 < αt < 1

Under mild conditions, this process is proven to converge (Watkins and Dayan1992) (for discrete state and action spaces). While central to the field, Q-learning israrely used in practice: it is known to be inefficient and also unstable, especially inproblems with continuous state or action spaces (Weiring and Otterlo 2012).

2.1.2 Fitted Q iteration

Fitted Q iteration (Riedmiller 2005, Ernst et al. 2005) is a batch version of Q-learning. Here, a simulation is run for T periods, with exploration (i.e., partiallyrandomised actions). Then a series of Q function updates is applied to the set ofstate-action samples (see algorithm 1).

Algorithm 1: Fitted Q Iterationinitialise s0, QRun a simulation for T periods, with exploration (i.e., randomised actions)Store all of the samples {st, at, rt, st+1}T

t=0

1. Compute Qt = rt + β. maxa .Q(a, st+1)

2. Estimate Q by regressing Qt against (at, st)

Iterate over steps 1-2 until a stopping rule is satisfied

This batch approach offers both efficiency and stability improvements over on-lineQ-learning (Weiring and Otterlo 2012). Similar to dynamic programming, fittedQ iteration is guaranteed to converge — for problems with continuous state andaction spaces – conditional on the function approximation methods used. Wediscuss function approximation / non-parametric regression methods further insection 4.

2.2 Fitted Q-V iteration algorithm

In fitted Q iteration, the integration / expectation operator is embedded in theregression of Q on (at, st). In problems with noisy state transitions large samplesizes T may be required to achieve accurate results.

5

As we increase the sample size, our set of state points become increasingly dense.Maximising the Q function for each point in the sample then becomes an unneces-sary burden. In this case, it is preferable to optimise Q for a representative subsetof points (i.e. a grid) and to approximate maxa .Q(a, st+1) elsewhere, by estimatinga continuous state value function V (algorithm 2).

Algorithm 2: Fitted Q−V IterationInitialise s0, QRun a simulation for T periods, with exploration (i.e., randomised actions)Store all of the samples {st, at, rt, st+1}T

t=0Select a set of K < T state points Sr = {sk}K

k=1 ⊂ {st}t=Tt=0

1. Set Vk = maxak .Q(ak, sk) for all sk ∈ Sr

2. Estimate V(st) by regressing Vk on sk

3. Set Qt = rt + β.Vst+1) for all st

4. Estimate Q by regressing Qt against (at, st)

Iterate over steps 1-4 until a stopping rule is satisfied

This approach is more suited to economic problems which generally have noisierstate transitions and smoother pay-off and value functions, than common com-puter science / operations research problems. As we show later, the approach isalso better suited to multi-agent problems, since with large numbers of playerscomputing maxa .Qi(a, st+1) for each player quickly becomes impractical.

2.2.1 Sample grids

A natural choice for our subset of state points Sr = {sk}Kk=1 is a sample of approxi-

mately equidistant points (a sample grid). Our starting point is a distance basedmethod (algorithm 3), which provides a subset of points at least r distance apart.

This approach is related to the ‘ε-distinguishable set’ (EDS) method of Maliarand Maliar (2015). However, this algorithm also counts the sample points withinthe radius r of each point: in order to identify outliers. Maliar and Maliar (2015)remove outliers by separately estimating a density function. Maliar and Maliar(2015) also apply a pre-processing step (principle component transformation) toorthogonalise the data, while we simply scale to the range [0,1].

Figure 1 provides a demonstration in two dimensions. Similar to Maliar andMaliar (2015) the complexity of the algorithm will depend on T and r. Thenatural response for large T (and / or low r) is to apply the method to a random

6

sample of the data (as do Maliar and Maliar 2015). Another approach, is toemploy approximation methods (such as tile coding) rather than explicit distancecalculations (see Hughes 2015).

Algorithm 3: Selecting an approximately equidistant gridInitialise parameters r, KScale all st ∈ S to [0, 1] on each dimensionSet Sr to the first data point {s0}

1. Select st ∈ S

2. Compute rk = ||st − sgk || for each point in Sr, recording the minimum

distance r∗

3. if r∗ > r then add st to Sr

4. else add 1 to the counter nk∗ = nk∗ + 1

Iterate steps 1-4 for all t in T.Return the K points of Sr with the highest nk

Figure 1: An approximately equidistant grid in two dimensions

−4 −3 −2 −1 0 1 2 3 4−4

−3

−2

−1

0

1

2

3

4

(a) 10000 iid standard normal points

−4 −3 −2 −1 0 1 2 3 4−4

−3

−2

−1

0

1

2

3

4

(b) 100 points at least 0.4 apart

7

3 Multi-agent algorithm

3.1 Multi-agent learning

The field of MAL is concerned with algorithms for solving multi-agent problems,specifically stochastic games. For a review see Shoham et al. (2007) or Weiring andOtterlo (2012).

3.1.1 Stochastic games

Stochastic games are essentially multi-agent MDPs. There are N players I =

{0, 1, ..., N} who each take actions ait ∈ Ai ⊂ RDA . We define the action profile

at = (ait)i∈I and the action space A = A0 × A1 × ... × AN. The state of the

game st, can include both agent specific states sit ∈ Si ⊂ RDS and a global state

sgt ∈ Sg ⊂ RDG , the state space is S = S0 × S1 × ...× SN × Sg.

Each agent has reward function Ri : S× A→ R and as before we have a transitionfunction T : S × A × S → [0, 1] and discount rate β. The agents problem is tochoose ai

t to maximise their expected discounted reward

max{ai

t}∞t=0

E

{t=∞

∑t=0

βtRi(at, st)

}

given s0, Ri, T, a−it and at ⊂ A.

We have agent policy functions fi : S→ Ai and a policy profile function f : S→ A.For any f each player has a value function V f

i : S→ R

V fi (s) = E

{∞

∑t=0

βtR(( f (st), st)|s0 = s

}

3.1.2 Markov Perfect Equilibrium

A MPE (Maskin and Tirole 1988) is a set of policy functions { f0, f1, ..., fN} whichsimultaneously solve each agents’ problem, forming a sub-game perfect Nashequilibrium.

MPE existence results have been established for specific classes of stochastic gamesas early as Shapley (1953). However a general (pure strategy, infinite horizon andstate space) existence result has remained elusive despite much recent attention(Duggan 2012, Escobar 2011, Balbus et al. 2011, Horst 2005, Amir 2005).

8

3.1.3 Oblivious Equilibrium

With a large number of players the dimensionality of the state space DnS × DG

can quickly become infeasible. Further the assumption that the agents haveinformation on all opponent state variables becomes unrealistic. Under ObliviousEquilibrium (OE) (Weintraub et al. 2008) opponent state variables are replacedwith summary statistics, such that the state space for each agents’ problem isSi × SG.

Weintraub et al. (2008) shows that under certain conditions (a large number ofsimilarly sized firms) OE approximates MPE for oligopoly type models. Thisresult is generalised to a broader class of dynamic stochastic games by Abhisheket al. (2007). Weintraub et al. (2010) extend OE to models with aggregate shocks.

3.1.4 Multi-agent learning methods

The simplest and most frequently applied approach for multi-agent problems, isfor each agent to follow an independent ‘on-line’ algorithm like Q-learning. In themulti-agent context, this approach comes with no convergence guarantees and theconvergence properties have been subject to limited study (Busoniu et al. 2008)

This approach has been attempted frequently in computer science / operationsresearch domains, with mixed results (Busoniu et al. 2008). It has also beenattempted in a few multi-agent economic pricing problems (Waltman and Kaymak2008, Kutschinski et al. 2003). Generally, the approach has performed well inzero-sum repeated games, but has proven less reliable in general sum stochasticgames (Shoham et al. 2007) .

More recently, there has been a proliferation of more complex algorithms, thatcombine reinforcement learning methods with game theory concepts such as Nashequilibrium and fictitious play (see Busoniu et al. 2008). However, once againthese methods have tended to focus on narrow classes of games (e.g., zero-sumand two player games) and can be intractable in larger problems (Busoniu et al.2008, Shoham et al. 2007).

3.2 Multi-agent fitted Q-V iteration

To be effective in multi-agent problems, batch algorithms — like fitted Q-V itera-tion — need to be combined with ‘smoothing dynamics’ to slow the adjustment

9

of the agents. Similar to repeated games, non-smoothed learning algorithms willlack stability in stochastic games.

Consider for the example, a simple (non-smoothed) best response dynamic (al-gorithm 4, with λ = 1, Iu = I and K → ∞). At each stage of the algorithm wesimulate the current population of user policies. Within each simulation the envi-ronment is stationary: the user policy functions are fixed. As such we can computeoptimal best response policies for each agent by (single agent) fitted Q-V iterationand repeat. Unfortunately, such an approach is prone to repeating cycles.

Our first smoothing method is essentially the partial best response dynamic: wherewe update the policies only for a random sample of agents (algorithm 4, withIu ⊂ I). Our next smoothing dynamic is related to the idea of fictitious play: aftereach stage we combine the new batch of samples, with samples from previousstages (algorithm 4, with λ < 1). Finally, we also limit the number of Q-V iterationsperformed at each stage (algorithm 4, with low H).

Our general multi-agent algorithm permits much flexibility. With high H, λ = 1and Iu ⊂ I we have a partial best response dynamic. As λ→ 0, H → 1 and Iu = Ithe method approaches an ‘on-line’ reinforcement learning method. Our preferredapproach (detailed in section 5) is a comprise between these extremes.

In the multi-agent algorithm, we maintain approximations to the agents’ valuefunctions Qi and Vi and policy functions fi. Each stage we perform a simulation,applying some exploration strategy over the current population of policies. Asin the single agent case, the degree of exploration is intended to decline as thealgorithm progresses. With a large number of agents approximating the policyfunctions fi is a necessity, as computing arg maxa Qi(a, st) for each agent withinthe simulation stage quickly becomes infeasible.

Finally, mote that Q-V updates (step 3) only need to be applied to unique agents:if we have identical agents the Q, V, f functions only need to be updated once.In practice, the method can allow for a degree of ‘experience sharing’, where thesamples of similar agents are combined to inform a common Q-V update (as inthe application in section 5).

10

Algorithm 4: Multiple agent fitted Q-V iterationInitialise fi, Vi, s0, QiSimulate an initial batch of T samples {st, at, rt, st+1}T

t=0

1. Simulate for λT periods (with exploration) and replace λT samples of{st, at, rt, st+1}T

t=0 with new samples

2. Select a subset of state points Sr = {sk}Kk=1 ⊂ {st}t=T

t=0

3. For all (unique) agents i ∈ I:

(a) Set Vik = maxa .Qi(a, sk) for all sk ∈ Sr

(b) Estimate Vi(st) by regressing Vik on sk

(c) Set Qit = ri

t + β.Vi(st+1) for all st

(d) Estimate Qi by regressing Qit against (ai

t, st)

Iterate over steps (a)-(d) for H iterations

4. For a subset of agents Iu ⊂ I:

(a) Set aik = arg maxa Qi(a, sk) for all sk

(b) Update fi by regressing aik on sk

Iterate over steps 1-4 for J iterations

11

4 Function approximation

4.1 Supervised learning

With batch reinforcement learning function approximation becomes crucial. Inessence, the fitted Q iteration method translates a dynamic programing probleminto a series of difficult regression problems. To a large extent the approximationor ‘supervised learning’ methods employed become the defining feature of thesealgorithms.

Varian (2014) provides an introduction to the field of supervised learning foreconomists. Here the goal is to ‘learn’ (estimate) a model for a ‘target’ (dependent)variable Y conditional a vector of ‘input’ (explanatory) variables X, given a set of‘training data’ {Yt, Xt}T

t=0, but no prior knowledge of functional form.

For various reasons (including speed and stability) standard methods employedin economics — such as orthogonal polynomials — are not ideal in this context.With fitted Q iteration, a variety of approximation schemes have been proposedincluding: random forests (Ernst et al. 2005), neural networks (Riedmiller 2005)and tile coding (Timmer and Riedmiller 2007). Below we focus on tile coding(Sutton and Barto 1998) which has a history of success in reinforcement learningand is well suited to problems in low (i.e., less than 10) dimensional spaces.

4.2 Tile coding

With tile coding, the input space is partitioned into tiles. A whole partition isreferred to as a tilling or a layer. A tile coding scheme then involves multipleoverlapping layers, each offset from each other according to some displacementvector.

Tile coding is best understood visually. In the simplest approach the tilings arejust regular grids (i.e. rectangular tiles) and each grid is offset uniformly (i.e.diagonally) as in figure 2. For any given point Xt one tile in each layer is activatedand our predicted value is the mean of the weights attached to the active tiles.

More formally, a tile coding scheme involves i = 1 to NL layers. Each layercontains j = 1 to NT binary basis functions

φij(X) =

1 if X ∈ Xij

0 if otherwise

12

Figure 2: Tile coding

input space

tiling layer 1

tiling layer 2

input point Xt

activated tile, layer 1

activated tile, layer 2

where for each layer i the set {Xij : j = 1, ..., NT} is an exhaustive partition of theinput space. The predicted value Yt is then

Yt =1

NL

NL

∑i=1

NT

∑j=1

wijφij(Xt)

where wij are the weights.

Tile coding is a constant piecewise approximation scheme: the ‘resolution’ ofthe approximation depends on the number of layers and the number of tilesper layer. For example, figure 3a shows a tile coding scheme with NL = 1 andNT = 6. Just increasing the number of tiles gives a finer resolution but provides lessgeneralisation leading to over fitting (see figure 3b). Figure 3c, shows a model withNL = 40 and NT = 4 which provides both high resolution and good generalisation.

Tile coding has computational advantages over schemes with basis functions ofglobal support. Tile weights are stored in arrays and accessed directly (computingarray indexes involves integer conversion of X). Each function call then involvesonly the NL active weights. As such, prediction time is low and grows linearly inthe number of layers rather than exponential in the number of dimensions.

This speed gain comes at the cost of higher memory usage. However, since manyreinforcement learning applications are CPU bound, this can be an efficient useof resources. In higher dimensions, weight arrays can be compressed through‘hashing’ (i.e., sparse arrays) (see Hughes 2015).

13

Figu

re3:

An

exam

ple

ofti

leco

ding

inon

edi

men

sion

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

(a)T

ileco

ding

(TC

-A)

NL=

1,N

T=

60a

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

(b)T

ileco

ding

(TC

-A)

NL=

1,N

T=

6

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

(c)T

ileco

ding

(TC

-SG

D)

NT=

4,N

L=

40

14

4.2.1 Fitting

The simplest option is to define weights as the average value of the observationsattached to each tile. When this type of tile coding used to approximate Q, fitted-Q-iteration is guaranteed to converge (Timmer and Riedmiller 2007). A convergenceresult is possible, because averaging results in a non-expansive approximator:that is a smoother. Non-expansive approximations of value functions maintain thecontraction mapping property of the Bellman operator, ensuring convergence oftypical dynamic programing and reinforcement learning methods (see Stachurski2008).

While fitting by averaging provides a convergence guarantee it is unlikely toprovide ideal performance (Timmer and Riedmiller 2007). An alternative approachis ‘Averaged Stochastic Gradient Descent’ (ASGD) (Bottou 2010). Here we updatethe active tile weights one observation at time according to the rule

wij = wij − αt(Yt −Yt)

0 < αt < 1

With ASGD we define our weights as the average of each these updates (asopposed to their final values). Bottou (2010) demonstrates the superiority ofASGD over standard SGD from problems with large samples. Hughes (2015)shows that ASGD outperforms averaging for a single agent reservoir operationproblem.

Unfortunately, tile coding provides no extrapolation. To address this we combineour tile coding schemes, for value and policy functions V and f , with a sparselinear spline model. The combined scheme replaces the tile code weight wij withthe linear spline predicted value if nij = 0 (Hughes 2015).

15

5 A multi-agent water storage problem

Our application involves a multi-agent reservoir operation problem (Hughes 2015).The problem is concerned with the allocation of water in a river system controlledby a large dam. The dam has fixed capacity K, storage volume St and receivesstochastic inflow It+1. Each period t an amount Wt is released downstream tosupply a large number of heterogeneous water users i ∈ [1, 2, ..., n] located at asingle demand node.

Figure 4: An abstract regulated river system

Inflow

Release

(e.g. irrigation area)

River extraction

Water storage (dam)

Water demand node

Return flow

5.1 The planner’s problem

The hypothetical social planner’s optimisation problem is

max{qit,Wt}t=∞

t=0

E

{∞

∑t=0

βtn

∑i=1

πit(qit, It, eit)

}

Subject to:

St+1 = min{St −Wt −L(St) + It+1, K}

16

0 ≤Wt ≤ St

n

∑i=1

qit ≤ max{(1− δb)Wt − δa, 0}

Here L is an evaporation loss function (increasing and concave in St). qit is wateruse by user i and πit are the user benefit functions. πit are increasing and concavein qit and It and increasing in eit. eit is a user specific productivity shock. It and eit

both follow AR(1) processes.

It+1 = ρI It + εt+1

εt+1 ∼ Γ(kI , θI), 0 < ρI < 1

eit = 1− ρe + ρeei,t−1 + ηit

ηit ∼ N (0, σ2η), 0 < ρe < 1

Reservoir operation problems of this type involve a trade-off between ‘yield’ and‘reliability’: higher storage reserves reduce the probability of shortages, but in-crease expected losses from evaporation and ‘spills’. Spills occur when the storagevolume St reaches capacity K, in which case excess water flows uncontrolleddownstream and is essentially lost.

Hughes (2015) derives numerical solutions to the planner’s problem and showsthat (single agent) fitted Q-V iteration can obtain policies of nearly identical qualityto stochastic dynamic programming in significantly less time.

5.2 The decentralised model

The decentralised version contains the same water supply and demand constraints,only here water property rights are defined, facilitating both user storage decisionsand a water spot market.

5.2.1 Water property rights

Under the general property rights framework, each users controls their own ‘wateraccount’. Each period user accounts balances sit are credited with a share λi ofinflow and debited for user withdrawals wit.

To illustrate consider first a ‘capacity sharing’ (CS) (Dudley and Musgrave 1988)approach to water property rights. Here we have

17

sit+1 = min{max{sit − wit −sit

StL(St) + λi It+1 + xit+1, 0}, λiK}

wit ≤ sit,n

∑i=1

λi = 1,n

∑i=1

sit = St

Under CS users own a percentage share λi of total storage capacity and are liablefor a share sit

Stof total evaporation losses.

Here xit are the ‘storage externalities’. Intuitively xit are account reconciliations,which ensure the total account balance ∑n

i=1 sit matches the physical storage vol-ume St. Under CS these externalities include ‘internal spills’ (Dudley and Mus-grave 1988) which occur when individual accounts reach their capacity, and wateris forfeited to other account holders. For now note that, xit can be a complicatedfunction of the storage balances and withdrawals of all users st = (s1t, s2t, ..., snt)

and wt = (w1t, w2t, ..., wnt) as well as quantities St, Wt, Lt and It (for more detailsee Hughes 2015).

5.2.2 The water spot market

Users receive water allocations ait adjusted for delivery losses

ait = (1− δb)wit

Water allocations can be used or traded on the spot market, subject to the marketclearing condition

n

∑i=1

qit =n

∑i=1

ait

Trade is subject to a transfer cost τ > 0. User payoffs uit are then defined

uit =

πh(qit, It) + Pt(ait − qit) if ait − qit ≥ 0

πh(qit, It) + (Pt + τ)(ait − qit) if ait − qit < 0

where Pt is the market price for water.

5.2.3 The users’ problem

The users’ problem is to determine wit and qit each period in order to maximisetheir expected discounted payoff

18

max{qit,wit}t=∞

t=0

E

{∞

∑t=0

βtuit

}

subject to the above water accounting constraints, the current and expected marketprice for water and the physical constraints as detailed in the planner’s problem.Given user withdrawals wit, the spot market can be solved — for Pt and qit — as astand alone static optimisation problem. The solution to the users problem is thena policy function for wit.

In the spirit of OE Weintraub et al. (2008) we assume user policies depend on onlytheir own state sit, eit (rather than st,et) as well as the aggregate shock It and thestorage level St, limiting our attention to policy functions of the form

w∗it = fh(sit, St, eit, It)

5.3 Policy scenarios

Hughes (2015) details a large number policy scenarios, reflecting different ap-proaches to water property rights observed in Australian and Western US rivers.Here we consider two main alternatives to CS: open access and no storage rights.

5.3.1 Open access storage (OA)

Under OA there are no account limits and no loss deductions. Rather, all spillsand losses are allocated in proportion to inflow shares (i.e., ‘socialised’), such thatuser accounts follow

sit+1 = min{max{sit − wit + λi It+1 + xit+1, 0}, K}

xit+1 = λi(Lt + Zt)

Under OA we have a potential ‘tragedy of the commons’ outcome: users are notaccountable for spills or evaporation, so they will tend to ‘over-store’.

5.3.2 No storage rights (NS) ‘use it or lose it’

Here users have no storage rights. That is, any unused water is reallocated inproportion to inflow shares, so that user accounts follow

19

sit+1 = λiSt+1

With a large number of users the incentive will be to consume or trade all waterallocations: to ‘under-store’ .

5.4 Solving the decentralised model

For detail on the parameter assumptions and functional forms see Hughes (2015).

In our application there are 100 agents, divided into two groups: the ‘low reliability’users (with elastic water demands) and the ’high reliability’ users (with inelasticdemands). The users within each group are ex ante identical (they differ only intheir realised productivity levels eit) so we effectively have two (rather than n)user problems to solve.

We begin by solving the planner’s problem by SDP. The planner’s solution is thenused to derive initial user policy functions. We then solve both the high and lowreliability user problems (holding opponent policies fixed) by single agent fittedQ-V iteration. This yields an initial batch of T samples and estimates fi, vi of thepolicy and value functions. We then proceed to the full multi-agent algorithm.

We use J = 25 major iterations and H = 1 value iterations and λ = 0.10. Aftereach major iteration we update policies for a 20 per cent random sample of agents.From this sample of agents we select a subsample to become ‘explorers’. We adoptGaussian exploration:

wit = min{max{ fi(sit, St, eit, It) + N(0, δt.sit), 0}, sit}

0 < δt < 1

The number of explorers and range of exploration declines over the 25 iterations:starting with 10 explorers (5 per user group) and δ = 0.25 down to 4 explorers andδ = 0.085.

We begin with T of 100,000. We pool the samples of our explorers, so that we have500,000 samples for each of the user groups (given five explorers per group). Weoptimise the Q function for both groups over a sample grid of state points, with aradius of 0.045.

We use tile coding to approximate the Q, V and f functions, employing the‘optimal’ displacement vectors of (Brown and Harris 1994). We use ASGD to

20

train the Q functions, SGD to train the f functions and averaging to train the Vfunctions.

The whole process runs on a standard 4 core desktop computer in around twominutes.

5.5 Results

To begin, we show how the sample means of key variables evolve during thelearning algorithm. Figure 5 shows the mean storage level 1

T ∑Tt=1 St at each

iteration — beginning with the planner’s solution as iteration 0.

Figure 5: Mean storage by iteration

0 5 10 15 20 25 30 35 40Iteration

600

650

700

750

800

Mea

n st

orag

e St (G

L)

CS NS OA

Importantly the key aggregate variables (prices, storage volumes, payoffs) showa high degree of stability. While subject to some noise, changes in the value andpolicy functions tend to diminish, rather than cycle (figure 7 and 8). In testing,cycles are found to emerge in these problems when the degree of smoothing isinsufficient.

The results suggest the presence of some form of equilibrium. What is perhapsmore important, is that the model provides clear and consistent results on therelative performance of our policy scenarios.

Scenarios that involve small externalities (CS) and which are expected — fromtheoretical results (Truong and Drynan 2013) — to achieve close to optimal out-comes, closely match the planner’s SDP solution. In fact, CS achieves slightlybelow optimal storage volumes consistent with the presence of ‘internal spill’

21

Figure 6: Mean social welfare by iteration

0 5 10 15 20 25 30 35 40Iteration

180

181

182

183

184

185

186

187M

ean

soci

al w

elfa

re n ∑ i=

1uit ($

M)

CS NS OA

Figure 7: Value function error (Mean absolute percentage deviation)

0 5 10 15 20 25 30 35Iteration

0.000

0.001

0.002

0.003

0.004

0.005

0.006

0.007

Valu

e fu

nctio

n er

ror

CS NS OA

(a) Low reliability

0 5 10 15 20 25 30 35Iteration

0.0005

0.0010

0.0015

0.0020

0.0025

0.0030

0.0035

0.0040

Valu

e fu

nctio

n er

ror

CS NS OA

(b) High reliability

22

Figure 8: Policy function error (Mean percentage deviation)

0 5 10 15 20 25 30 35Iteration

0.000

0.005

0.010

0.015

0.020

0.025

0.030

0.035

0.040Po

licy

func

tion

erro

rCS NS OA

(a) Low reliability

0 5 10 15 20 25 30 35Iteration

0.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

Polic

y fu

nctio

n er

ror

CS NS OA

(b) High reliability

externalities. Scenarios with large externalities (OA and NS) result in expectedchanges in behaviour (over and under storage respectively) and welfare losses.

The algorithm is robust in the sense that it achieves consistent results when solvedrepeatedly (for the same parameter values). While individual policy functionsdisplay some noise, this tends to average out across large numbers of agents suchthat aggregate results are highly consistent. This is demonstrated by figure 9which shows the results of 20 model runs. The degree of precision achieved willof course be influenced by the choice of sample size.

Figure 9: Final mean storage volume (result of 20 model runs)

600

650

700

750

800

Mea

nst

orag

evo

lum

e(G

L)

CS OA NS

In Hughes (2015) a sensitivity analysis is performed, in which the above modelis solved 1500 times for varying parameter values. For each run the multi-agentalgorithm is applied with the exact same meta-parameters (without ’re-tunning’)

23

and consistent performance is achieved. The method has since been applied toa more complex version of the problem with seasonality and an environmentalagent (with payoffs defined over aggregate rivers flows).

6 Conclusions

Reinforcement learning provides a mature set of algorithms for solving singleagent MDPs. While some practical problems are faced in adapting these methodsto economic domains, they are not insurmountable.

Economic problems tend to be noisy, relatively ‘smooth’ (the value and policyfunctions tend to be smooth after averaging out the noise) and involve relativelysmall state spaces. The batch method of fitted Q-V iteration is well suited to thiscontext. While large samples may be required, the method can be faster thandynamic programming, when combined with appropriate function approximation(Hughes 2015).

Here we find the method of tile coding ideal. Tile coding can process large data setsmuch faster than alternatives relying on global basis functions. Tile coding mayalso be useful in other applications in economics including dynamic programming.In more than 10 dimensions the memory requirements of tile coding can becomerestrictive. In large state spaces we can either turn to alternative approximationmethods such as ’random forests’, or consider some form of state ‘abstraction’method to omit or aggregate the least relevant state variables.

Our multi agent algorithm combines fitted Q-V iteration with two repeated gametype learning dynamics. The resulting algorithm provides a fast and convenientmethod for solving large complex stochastic games. The approach representssomething of a middle ground between Krusell and Smith (1998) style dynamicprogramming methods and agent based methods — both in terms of the size andcomplexity of the models it can be applied to and the degree of rationality or‘intelligence’ assumed for the agents.

We applied our method to a set of multi-agent water storage problems. Theagents within these problems face stochastic dynamic optimisation problems withcomplex market (i.e. price) and non-market (i.e., stock) interactions. Our method,yielded stable results consistent with intuition and theory. The method was fastand robust enough to allow the model to be solved many times for differentparameter values and to be extended to even more complex problems (Hughes2015).

24

The field of multi-agent learning is still relatively young. Much debate remainswithin and between disciplines, on the best notion of equilibrium and evenwhether we should be concerned with equilibrium at all (Fudenberg and Levine2007). While techniques are continuously evolving, the method of multi-agentfitted Q-V iteration, provides a practical starting point for economic problems.

25

References

Abhishek, V., Adlakha, S., Johari, R. and Weintraub, G. 2007, Oblivious equilibriumfor general stochastic games with many players, in ‘Allerton Conference onCommunication, Control and Computing’.

Amir, R. 2005, Discounted Supermodular Stochastic Games: Theory and Applica-tions.

Balbus, L., Reffet, K. and Wozny, L. 2011, A Constructive Study of Markov Equi-libria in Stochastic Games with Strategic Complementarities, in ‘Conference onTheoretical Economics, University of Kansas’.

Bekiros, S. D. 2010, ‘Heterogeneous trading strategies with adaptive fuzzy actor–critic reinforcement learning: A behavioral approach’, Journal of Economic Dy-namics and Control, vol. 34, no. 6, pp. 1153–1170.

Bellman, R. 1952, ‘On the theory of dynamic programming’, Proceedings of theNational Academy of Sciences of the United States of America, vol. 38, no. 8, pp. 716.

Bottou, L. 2010, Large-scale machine learning with stochastic gradient descent, in‘Proceedings of COMPSTAT’2010’.

Brown, M. and Harris, C. J. 1994, ‘Neurofuzzy adaptive modelling and control’.

Busoniu, L., Babuska, R. and De Schutter, B. 2008, ‘A comprehensive surveyof multiagent reinforcement learning’, Systems, Man, and Cybernetics, Part C:Applications and Reviews, IEEE Transactions on, vol. 38, no. 2, pp. 156–172.

Crawford, V. P. 2013, ‘Boundedly Rational versus Optimization-Based Models ofStrategic Thinking and Learning in Games’, The Journal of Economic Literature,vol. 51, no. 2, pp. 512–27.

Dudley, N. J. and Musgrave, W. F. 1988, ‘Capacity sharing of water reservoirs’,Water Resources Research, vol. 24, no. 5, pp. 649–658.

Duggan, J. 2012, ‘Noisy stochastic games’, Econometrica, vol. 80, no. 5, pp. 2017–2045.

Ernst, D., Geurts, P. and Wehenkel, L. 2005, ‘Tree-based batch mode reinforcementlearning’, , vol. pp. 503–556.

Escobar, J. 2011, ‘Equilibrium Analysis of Dynamic Models of Imperfect Competi-tion’.

26

Fudenberg, D. and Levine, D. K. 1998, The theory of learning in games, Vol. 2, MITpress.

Fudenberg, D. and Levine, D. K. 2007, ‘An economist’s perspective on multi-agentlearning’, Artificial intelligence, vol. 171, no. 7, pp. 378–381.

Horst, U. 2005, ‘Stationary equilibria in discounted stochastic games with weaklyinteracting players’, Games and Economic Behavior, vol. 51, no. 1, pp. 83–108.

Hughes, N. 2015, Water property rights in rivers with large dams, PhD thesis,Australian National University.

Krusell, P. and Smith, J. A. A. 1998, ‘Income and wealth heterogeneity in themacroeconomy’, Journal of Political Economy, vol. 106, no. 5, pp. 867–896.

Kutschinski, E., Uthmann, T. and Polani, D. 2003, ‘Learning competitive pricingstrategies by multi-agent reinforcement learning’, Journal of Economic Dynamicsand Control, vol. 27, no. 11, pp. 2207–2218.

Maliar, L. and Maliar, S. 2015, ‘Merging simulation and projection approachesto solve high-dimensional problems with an application to a new keynesianmodel’, Quantitative Economics, vol. 6, no. 1, pp. 1–47.<http://dx.doi.org/10.3982/QE364>

Martin-Herran, G. and Rincon-Zapatero, J. P. 2005, ‘Efficient markov perfect nashequilibria: theory and application to dynamic fishery games’, Journal of EconomicDynamics and Control, vol. 29, no. 6, pp. 1073–1096.

Maskin, E. and Tirole, J. 1988, ‘A theory of dynamic oligopoly, I: Overview andquantity competition with large fixed costs’, Econometrica: Journal of the Econo-metric Society, vol. pp. 549–569.

Riedmiller, M. 2005, Neural fitted Q iteration–first experiences with a data efficientneural reinforcement learning method, in ‘Machine Learning: ECML 2005’,Springer, pp. 317–328.

Rubio, S. and Casino, B. 2001, ‘Competitive versus efficient extraction of a commonproperty resource: The groundwater case’, Journal of Economic Dynamics andControl, vol. 25, no. 8, pp. 1117–1137.

Shapley, L. 1953, ‘Stochastic games’, Proceedings of the National Academy of Sciencesof the United States of America, vol. 39, no. 10, pp. 1095.

Shoham, Y., Powers, R. and Grenager, T. 2007, ‘If multi-agent learning is the answer,what is the question?’, Artificial Intelligence, vol. 171, no. 7, pp. 365–377.

27

Stachurski, J. 2008, ‘Continuous state dynamic programming via nonexpansiveapproximation’, Computational Economics, vol. 31, no. 2, pp. 141–160.

Sutton, R. and Barto, A. 1998, Reinforcement learning: An introduction, Vol. 1, Cam-bridge University Press.

Tesauro, G. and Kephart, J. O. 2002, ‘Pricing in agent economies using multi-agent Q-learning’, Autonomous Agents and Multi-Agent Systems, vol. 5, no. 3, pp.289–304.

Tesfatsion, L. and Judd, K. L. 2006, Agent-based computational economics, Vol. 2.

Timmer, S. and Riedmiller, M. 2007, Fitted q iteration with cmacs, in ‘ApproximateDynamic Programming and Reinforcement Learning, 2007. ADPRL 2007. IEEEInternational Symposium on’, IEEE, pp. 1–8.

Truong, C. H. and Drynan, R. G. 2013, ‘Capacity sharing enhances efficiency inwater markets involving storage’, Agricultural Water Management, vol. 122, pp.46–52.

Varian, H. R. 2014, ‘Big data: New tricks for econometrics’, The Journal of EconomicPerspectives, vol. pp. 3–27.

Waltman, L. and Kaymak, U. 2008, ‘Q-learning agents in a Cournot oligopolymodel’, Journal of Economic Dynamics and Control, vol. 32, no. 10, pp. 3275–3293.

Watkins, C. and Dayan, P. 1992, ‘Q-learning’, Machine learning, vol. 8, no. 3, pp.279–292.

Weintraub, G., Benkard, C. and Van Roy, B. 2008, ‘Markov Perfect Industry Dy-namics With Many Firms’, Econometrica, vol. 76, no. 6, pp. 1375–1411.

Weintraub, G., Benkard, C. and Van Roy, B. 2010, ‘Computational methods foroblivious equilibrium’, Operations research, vol. 58, no. 4-Part-2, pp. 1247–1265.

Weiring, M. and Otterlo, M., eds 2012, Reinforcement learning: State-of-the-Art,Vol. 12 of Adaptaion, Learning and Optimization, Springer.

28

solving large stochastic games with reinforcement learning · 2015-09-18 · solving large...

Documents