learning aspiration in repeated games

31
Journal of Economic Theory 124 (2005) 171 – 201 www.elsevier.com/locate/jet Learning aspiration in repeated games In-Koo Cho a , , 1 , Akihiko Matsui b, 2 a Department of Economics, University of Illinois, 1206 S. 6th Street, Champaign, IL 61820, USA b Faculty of Economics, University of Tokyo, Japan Received 25 November 2003; final version received 12 December 2004 Available online 5 February 2005 Abstract We study infinitely repeated symmetric 2 × 2 games played by bounded rational agents who follow a simple rule of thumb: each agent continues to play the same action if the current payoff exceeds the average of the past payoffs, and switches to the other action with a positive probability otherwise. By applying the stochastic approximation technique, we characterize the asymptotic outcomes for all 2 × 2 games. In the prisoners’ dilemma game, for example, the players cooperate in the limit if and only if the gain from defecting against cooperation is “modest.” © 2005 Elsevier Inc. All rights reserved. JEL classification: D83 Keywords: Aspiration; Bounded rationality; Cooperation; Mean dynamics; Recursive learning; Repeated games; Satisficing behavior; Stochastic approximation 1. Introduction According to the theory of satisficing behavior [22], a decision maker satisfices rather than optimizes. 3 Instead of solving an optimization problem, a decision maker has an Corresponding author. Fax: +1 217 244 6678. E-mail address: [email protected] (I.-K. Cho). 1 Financial support from the National Science Foundation (SES-0004315) is gratefully acknowledged. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. 2 Financial support from the Ministry of Education and Science of Japan is gratefully acknowledged. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the Ministry. 3 The satisficing behavior, or the “win-stay, lose-shift” principle, is observed in animal behavior as well as in human behavior. Thorndike [23] mentioned this type of behavior prior to Simon [22]. 0022-0531/$ - see front matter © 2005 Elsevier Inc. All rights reserved. doi:10.1016/j.jet.2004.12.001

Upload: in-koo-cho

Post on 04-Sep-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Journal of Economic Theory 124 (2005) 171 – 201www.elsevier.com/locate/jet

Learning aspiration in repeated games

In-Koo Choa,∗,1, Akihiko Matsuib,2aDepartment of Economics, University of Illinois, 1206 S. 6th Street, Champaign, IL 61820, USA

bFaculty of Economics, University of Tokyo, Japan

Received 25 November 2003; final version received 12 December 2004Available online 5 February 2005

Abstract

We study infinitely repeated symmetric 2×2 games played by bounded rational agents who followa simple rule of thumb: each agent continues to play the same action if the current payoff exceedsthe average of the past payoffs, and switches to the other action with a positive probability otherwise.By applying the stochastic approximation technique, we characterize the asymptotic outcomes for all2×2 games. In the prisoners’ dilemma game, for example, the players cooperate in the limit if andonly if the gain from defecting against cooperation is “modest.”© 2005 Elsevier Inc. All rights reserved.

JEL classification:D83

Keywords:Aspiration; Bounded rationality; Cooperation; Mean dynamics; Recursive learning; Repeated games;Satisficing behavior; Stochastic approximation

1. Introduction

According to the theory of satisficing behavior[22], a decision makersatisficesratherthan optimizes.3 Instead of solving an optimization problem, a decision maker has an

∗ Corresponding author. Fax: +1 217 244 6678.E-mail address:[email protected](I.-K. Cho).

1 Financial support from the National Science Foundation (SES-0004315) is gratefully acknowledged. Anyopinions, findings, and conclusions or recommendations expressed in this material are those of the authors and donot necessarily reflect the views of the National Science Foundation.

2 Financial support from the Ministry of Education and Science of Japan is gratefully acknowledged. Anyopinions, findings, and conclusions or recommendations expressed in this material are those of the authors and donot necessarily reflect the views of the Ministry.

3 The satisficing behavior, or the “win-stay, lose-shift” principle, is observed in animal behavior as well as inhuman behavior. Thorndike[23] mentioned this type of behavior prior to Simon[22].

0022-0531/$ - see front matter © 2005 Elsevier Inc. All rights reserved.doi:10.1016/j.jet.2004.12.001

172 I.-K. Cho, A. Matsui / Journal of Economic Theory 124 (2005) 171–201

aspiration level and searches for a better alternative until he finds the one whose payoffexceeds the aspiration level. Once he finds such an alternative, he quits his search and sticksto it. The present paper examines the behavior of the satisficing agents in 2× 2 symmetricrepeated games.

We assume that each player continues to play the same action if the action generatesone period payoff that exceeds the aspiration level, which summarizes a history accordingto the average of the past payoffs, and switches to the other action with a positive proba-bility otherwise.4 Focusing on the dynamic process of the aspiration levels, we study theasymptotic properties of the average payoffs and the patterns of the behavior of the decisionmakers.

However, the dynamic process of the aspiration levels might induce many differentasymptotic outcomes, some of which appear “unstable.” In order to refine the set of asymp-totic outcomes, we perturb the dynamics by introducing a little imperfection: with a smallprobability, the decision maker might choose an action other than the one dictated by the“behavior” rule.

Because of the perturbation, the aspiration and the outcome of the game are renderedstochastic. As the action influences the aspiration levels, which in turn determine the futureactions to be played, the evolution of the aspiration level exhibits aself-referentialfeature[14]. Formally, the dynamics of the aspiration levels can be described as a state-dependentMarkov process where the transition matrix of action pairs is determined by the aspirationlevel. Consequently, the evolution of the aspiration level and the resulting outcome pathbecome non-stationary and history dependent, which poses a considerable challenge.

This paper uses the stochastic approximation technique[12] to investigate the asymptoticproperties of the aspiration levels. Instead of directly studying the non-stationary stochasticprocess, we approximate the sample paths of the process by a trajectory induced by thedeterministic process, which is called the mean dynamics or the ordinary differential equa-tion (ODE) associated with the original stochastic process. This approximation enables usto study the asymptotic properties, which would otherwise be obscured by the complexstochastic perturbations. In our model, the associated ODE has a particularly simple form,which helps us directly calculate the limit points for all 2×2 games and examine the stabilityof these points.

We characterize the asymptotic properties of the aspiration levels and the behavior ofthe players for symmetric 2× 2 games. Our intuition tells us that in the coordinationgame, the repeated interaction should promote cooperation between the two players, whichis what our model predicts. Although many other outcomes can be sustained as a limit ofdeterministic dynamics, the Pareto efficient coordination outcome is the only stable outcomeof the perturbed dynamics.

The technique of stochastic approximation can be equally applied to games that do nothave a symmetric Pareto efficient individually rational outcome such as the battle of thesexes.5 In the battle of the sexes, we demonstrate that the learning dynamics lead to either

4 Gilboa and Schmeidler[8, pp.147–148]states, “Realism means that the aspiration level is set closer to thebest average performance so far experienced.”

5 The battle of the sexes has two such pure strategy outcomes. See (12).

I.-K. Cho, A. Matsui / Journal of Economic Theory 124 (2005) 171–201 173

one of the two efficient outcomes, but which of the two limit outcomes is realized dependsupon the history.

In the prisoners’ dilemma game, our model predicts that cooperation is achieved asymp-totically if and only if the gain from defection is “modest.”6 If the gain is sufficiently large,cooperation is not attained. We can still characterize the limit points of the aspiration levels,from which one can infer the asymptotic behavior of the players. In other words, the sizeof the gain from defection determines the stability of the cooperation outcome.

In the prisoners’ dilemma, if one player defects while the other player cooperates, thenthe cooperating player (who is double-crossed) receives the worst possible payoff of thegame. The double-crossed player immediately switches his action to punish the defector,as we expect from the tit-for-tat strategy. Once both parties move to the non-cooperativeoutcome, both players immediately realize how bad the non-cooperative outcome could be,and move back to cooperation. If the gain from defection is not too large, this cycle harmsthe original defector as well as the defected. Thus, cooperation sustains. On the other hand,if the gain from defection is sufficiently large, the defector gains from the above cycle,and therefore, temptation for deviation pushes the outcome away from cooperation, andthe cycle of cheating, punishment and cooperation repeats itself. As a result, the aspirationlevel converges (roughly) to the points obtained as a convex combination of three outcomes:cooperation, defection against cooperation and the one shot Nash equilibrium.7

Recently, the satisficing behavior has drawn considerable attention as an alternative be-havioral assumption in games.8 Gilboa and Schmeidler[6] developed an axiomatic foun-dation for the satisficing behavior, and[7] examined a more elaborate behavior by assumingthat the agent adjusts the aspiration level as he accumulates experiences.

Refs.[10, KMRV; 21,17,18]focus on the component games which have a symmetricPareto efficient individually rational outcome.9 In KMRV, for example, each player uses hisown past payoffs to calculate the aspiration level, while in[17], each player uses the averageof the two players’ payoffs. Both papers show that in a repeated game with a unique Paretoefficient individually rational point in pure strategies, cooperation emerges in the limit. Wefreely borrow the basic elements of our model such as the dynamics of the aspiration leveland the behavior rule from these papers.

Ref.[19] is essentially a deterministic version of ours, and considers all symmetric 2×2games as we do. However,[19] admits perpetual coordination failure in the coordinationgames. Our analysis shows how a slight perturbation of the deterministic dynamics caneliminate unstable outcomes of the deterministic dynamics.

By using the stochastic approximation technique, we develop a unified method of analysis,and consequently, avoid various issues arising from details of the component game. Forexample, KMRV needs the existence of a unique pure strategy individually rational Paretoefficient outcome in order to characterize the asymptotic outcomes of the learning dynamics.As a result, KMRV did not cover games like the battle of the sexes, which have two pure

6 See (14). To be precise, the gain from defection is “modest” ifg < 1.7 There are two such points, depending upon who defects first against the other player at the cooperation

outcome.8 Ref. [24] is a remarkable early example of the application of the satisficing behavior.9 Ref. [16] applies the method of KMRV to the ultimatum game.

174 I.-K. Cho, A. Matsui / Journal of Economic Theory 124 (2005) 171–201

strategy Pareto efficient outcomes. In contrast, our approach can be applied to all repeated2 × 2 games to identify the basic dynamic structure, independently of the details of thepayoff structure.10 Indeed, once we identify the basic dynamic structure, the rest of theanalysis follows from the standard results in the stochastic approximation literature.

The technique of stochastic approximation has been used in the recursive learning litera-ture in macroeconomics[14,5] and recently in evolutionary models[1,20]. Ref.[1] appliesthe technique to study the relationship between stochastic dynamics and deterministic dy-namics such as best response dynamics in evolutionary models. In a repeated game betweentwo players, each player has significant influence on the outcome from the game, and thelearning dynamics of the two players are intertwined. As a result, the evolution of thestate space is considerably more complicated than in the existing models where the diversebehavior of the agents is often aggregated into a one dimensional state variable.

The rest of the paper is organized as follows. In the next section, we describe the ba-sic framework. We describe the model with a deterministic behavior rule. We informallyanalyze a coordination game as an example to illustrate what the learning dynamics canimplement in the limit and also, to motivate the need to refine the set of limit outcomes byintroducing a small amount of perturbations to the basic framework. Section3 formalizesthe idea of perturbing the decision rule, and rigorously describes the perturbed model. Sec-tion 4 characterizes the asymptotic properties of the stochastic dynamics in 2× 2 gamesby exploiting the stochastic approximation technique. Section5 examines a number of im-portant extensions of the basic model. In Section5.1, we analyze a model in which theaspiration level is determined as a convex combination of the two players’ payoffs in orderto illustrate how our analysis applies to a larger class of models including[18,17]. Section5.2 investigates a model in which the evolution of the aspiration is subject to a randomshock[10]. Section5.3 examines the model in which a player uses a weighted average ofthe payoffs, assigning larger weights to more recent realizations of payoffs to calculate theaspiration level as in[10]. Section6 concludes the paper.

2. Unperturbed model and its caveats

Consider two players who play a 2× 2 game

G = 〈{C,D}, {C,D}; u1, u2〉infinitely many times, whereC andD are the actions available for each player and

ui : {C,D} × {C,D} → R i = 1,2

is the payoff function of the one shot game. In the sequel,i stands for either 1 or 2. Letst ∈ {C,D}2 be an outcome in periodt. A history is a sequence of outcomes, and a repeatedgame strategy is a mapping from a history to an actionsi ∈ {C,D}. Let us assume thateach individual player is sufficiently patient so that the long-run cooperation at(C,C) canbe sustained by some Nash equilibrium of the infinitely repeated game.

10 See[4] for the analysis of games with more than two actions.

I.-K. Cho, A. Matsui / Journal of Economic Theory 124 (2005) 171–201 175

Instead of a fully rational player, we model each player as a boundedly rational decisionmaker who summarizes a past history into a single state variable according to a simple ruleof thumb, which we call theaspiration level[22]. Let ai,t be the aspiration level of playeri at thetth round, which is calculated as the average of the past payoffs:

ai,t = 1

t

t∑�=1

ui(s�).

Note thatai,t can be written recursively as

ai,t = ai,t−1 + �t(ui(st )− ai,t−1

), (1)

where

�t =1

t. (2)

Note that as playeri summarizes a history into the average payoff in the past, he suppressesmuch information that could have been used for making a choice, such as the precisesequence of outcomes.11

The aspiration levelai,t is the level of payoff which playeri expects from actionsi,t inperiod t, based on his own past experience. Thus, if playeri has entertained a sequenceof high payoffs, it seems natural that playeri continues to expect to receive a high payofffrom his action, and vice versa. If payoffui(st ) from playeri’s actionsi,t ∈ Si = {C,D}at periodt exceeds aspiration levelai,t , then playeri choosessi in the t + 1st period. Ifui(st )�ai,t holds, then playeri takess′i �= si in the next period. To describe the behaviorrule rigorously, let−si be the action other thansi ∈ {C,D}:

{−si} = {C,D} \ {si}.For all i ∈ {1,2},

si,t+1 ={si,t if ui(s1,t , s2,t ) > ai,t(1− p)[−si,t ] + p[si,t ] if ui(s1,t , s2,t )�ai,t ,

(3)

wherep ∈ (0,1) represents inertia, and(1−p)[−si,t ]+p[si,t ] is a mixed strategy oversi,tand−si,t with si,t being played with probabilityp. No conclusion of the paper is sensitiveto the specific tie breaking rule in (3).

We introduce the inertia to capture the idea of a switching cost. That is, once a playerchooses an action, it is more difficult to switch to another action than to play the same actionin the next round (cf. KMRV). One can interpretp close to 0 as a minimal switching cost.Similarly, if p is close to 1, the switching cost is so large that the agent might not move tothe alternative action.

Combining (1) and (3) for given initial aspiration levelsa0 = (a1,0, a2,0) and actionss0 = (s1,0, s2,0), we can generate{at , st }∞t=1. We are most interested in characterizing the

11 The following analysis will be valid with some modification even if�t is constant across time. In that case,�t = � has to be sufficiently small, and the notion of convergence to be used will be convergence in distributionas opposed to convergence with probability 1.

176 I.-K. Cho, A. Matsui / Journal of Economic Theory 124 (2005) 171–201

asymptotic properties of{at , st }∞t=1 in order to understand the asymptotic behavior of theplayers.

To better understand the role of inertia and the asymptotic properties of the learningdynamics, let us consider the following coordination game:

( C D

C 1, 1 −1, −1D −1, −1 0, 0

)(4)

in order to illustrate the learning dynamics. Assume throughout this example that the initialaspiration levelai,0 for playeri is slightly less than 0, say−0.1.

If each player choosesC in the initial round, then the realized payoff for playeri is1, which is strictly larger than his aspiration level in the initial round−0.1. Thus, playeri choosesC in the next round. Because the average payoff is less than the payoff from(C,C), each player keeps playingCso that the aspiration vectorat = (a1,t , a2,t ) convergesto (1,1), which is consideredthefocal point of this game[9,10].

However, the learning dynamics can generate other outcomes, such as the perpetualcoordination failure where(C,D)and(D,C)are alternating each other, because players aredissatisfied by the outcome in every period. The inertia plays an important role in breakinga perpetual coordination failure. Since the coordination failure generates−1, which islower than the aspiration level, each player switches to another action with probability1− p. However, with probabilityp, one party keeps playing the same action. Thus, withprobabilityp(1−p), the two parties achieve coordination either at(C,C) or (D,D). Sincethe coordination payoff exceeds the payoff from the coordination failure, the two partieskeep playing the same action after they achieve coordination.

While we can sustain the Pareto efficient coordination outcome through the aspirationlearning process and inertia, we can also sustain other outcomes, which we find less intuitivethan the Pareto efficient outcome, namely the inefficient coordination outcome(D,D). Weneed to refine the limit outcomes of the learning dynamics by eliminating outcomes whichdoes not appear intuitive. To this end, we impose an additional condition that the limitoutcome should be robust against small imperfections of the behavior rule as in manyevolutionary models[9].

Suppose that there is a small probability that a player chooses an action with probability� > 0 other than the one dictated by the behavior rule (3). If (3) dictates the player to keepplaying the same action but he ends up choosing another action with probability� > 0,then we can view� as the probability of a mistake. A small� > 0 models the imperfectexecution of the behavior rule by the player.

We can offer an “informal” explanation that for any small� > 0, the inefficient coordina-tion outcome is not a limit outcome of the “perturbed” learning dynamics. Suppose that thetwo players coordinate around(D,D) and their aspiration levels remain negative. Althougheach player is satisfied with the present payoff 0, as the payoff is above the aspiration level,with a small, but positive, probability�2 > 0, the two players switch to(C,C). Once(C,C)

is realized, the two players maintainC since the realized payoffui(C,C) = 1 exceeds theaspiration level.

However, this explanation is “incomplete.” Because of the imperfect behavior rule, out-comes other than(C,C) can arise with a positive probability. We have yet to show that

I.-K. Cho, A. Matsui / Journal of Economic Theory 124 (2005) 171–201 177

following any outcome other than(C,C), the learning dynamics will force the players tokeep playingCwith high probability.

3. Perturbed model

Our first task is to formulate the imperfection in the behavior rule. Note that the behaviorrule is represented by a step function of

ui(st )− ai,t .

If ui(st )− ai,t > 0, then playeri chooses the same action in periodt +1 as he did in periodt. Otherwise, he switches to another action with a positive probability. The discontinuityoccurs at the point whereui(st )− ai,t = 0.

We perturb this decision rule by introducing a small amount of imperfection in switchingdecisions. If|ui(st ) − ai,t | is large, then it is very clear that the present action exceeds orfalls short of the aspiration level expected from the action. Thus, there is little chance toget confused about switching the action. On the other hand, if|ui(st ) − ai,t | is close to 0,then the agent might not be so sure about whether to switch the action or not. Each playeris suspicious about the possibility of imperfect observation, or a small observational error.Thus, when|ui(st ) − ai,t | is small, playeri has a good reason to hesitate to accept theobservation at its face value.

Let hi : R → (0,1) be a continuously differentiable weakly increasing function satisfy-ing the following properties:

limx→∞hi(x) = 1, and lim

x→−∞hi(x) = p. (5)

We call suchhi asigmoidfunction.Let hi

(ui(st )− ai,t

)be the probability that playeri chooses the same action int + 1 as

in t. Becausehi is strictly increasing, playeri will choose the same action in periodt + 1with a large probability ifui(st ) − ai,t > 0 is very large. Similarly, if the present payoffui(st ) falls “far” below aspiration levelai,t , then playeri switches to another action with alarge probability, albeit subject to inertia.

It is not necessary thathi(x) be greater thanp, andstrictly smaller than 1 over the entiredomain ofx. We impose this condition only to simplify the analysis. The same analysisapplies as long as there exists� > 0 such that 0< hi(x) < 1 for all x ∈ N�(0), whereN�(0) is the�-neighborhood of 0. On the other hand, it is essential that

0 < hi(0) < 1

which forces playeri to experiment with a positive frequency as long asui(st )−ai,t is closeto 0, regardlessof its sign.

Remark 3.1. KMRV used a different sigmoid functionhi satisfyinghi (0) = 1 so that forall x�0, hi (x) = 1. That is, as long asui(st )− ai,t �0, si,t+1 = si,t with probability one.This seemingly minor difference leads to profoundly different asymptotic properties of thelearning dynamics in the prisoner’s dilemma game when the gain from deviation against

178 I.-K. Cho, A. Matsui / Journal of Economic Theory 124 (2005) 171–201

cooperation is large, i.e.,g > 1. We predict that the cooperation outcome in the prisoner’sdilemma may not be stable ifg > 1. On the other hand, KMRV predicts the cooperation isthe only limit point of the learning dynamics for any value ofg.

Let us consider the sigmoid functionhi defined as (5). If the present payoff falls belowthe aspiration level, playeri can stick to the present action with probabilityp. For a pairh andh′ of increasing real-valued functions overR, let d(h, h′) be the Hausdorff metricof the graphs ofh andh′ in R2. In particular, ifh′ is a step function andh is a sigmoidfunction,d(h, h′) → 0 implies thath(x) − h′(x) → 0 pointwise except at discontinuouspoints. Define

hp(x) = p + (1− p)1x�0,

where1W is the characteristic function of eventW: 1W = 1 if the state is inW, and 0otherwise. We omitp to writeh whenever there is no risk of confusion. We are interestedin the dynamics in the limit of the sigmoid functions which converge toh.

We arrange the elements ofSaccording to the following order:

(C,C), (C,D), (D,C), (D,D)

and let� ∈ �4 (which is a row vector in the unit simplex ofR4) be the probability distributionoverS. LetP(a) = (ps′s(a)) be the transition matrix induced byhi where

ps′s(a)=[h1(u1(s

′)− a1)1s1=s′1 +(1− h1(u1(s

′)− a1)) (

1− 1s1=s′1

)]×[h2(u2(s

′)− a2)1s2=s′2 +(1− h2(u2(s

′)− a2)) (

1− 1s2=s′2

)]is the element in thes′th row andsth column. Note that every component ofP(a) is astrictly positive differentiable function ofa = (a1, a2). Similarly, defineP(a) = (ps′s(a))induced byh in which

ps′s(a)=[h1(u1(s

′)− a1)1s1=s′1 +(1− h1(u1(s

′)− a1)) (

1− 1s1=s′1

)]×[h2(u2(s

′)− a2)1s2=s′2 +(1− h2(u2(s

′)− a2)) (

1− 1s2=s′2

)].

Note that�P(a) represents the action realized in the present round, conditioned on theprevious pair of actions,�, and aspiration vectora. Let u be a 4× 2 matrix obtained byarranging the component game payoff:

u =

u1(C,C) u2(C,C)

u1(C,D) u2(C,D)

u1(D,C) u2(D,C)

u1(D,D) u2(D,D)

.

Given an action distribution�, � · u represents the expected payoff vector.It would be useful to review some of the properties ofP(a).

Lemma 3.2. (1) For any a, P(a) is irreducible and aperiodic so that it has a unique in-variant distribution�∗a satisfying�∗a = �∗aP (a) and for any initial condition�0, limt→∞ �0P t(a) = �∗a .

I.-K. Cho, A. Matsui / Journal of Economic Theory 124 (2005) 171–201 179

(2) �∗a is continuous with respect to a.(3) If P(a) has a unique invariant distribution�a , then �∗a converges to�a as

d(hi, h) → 0.

The first property allows us to “represent”P(a) by its invariant distribution, which sig-nificantly simplifies the characterization of the learning dynamics. Because the invariantdistribution changes continuously with respect to the transition matrix, we often “approx-imate” transition matrixP(a) by P(a), which has a significantly simpler structure thanP(a).

4. Analysis

Consider a deterministic process induced by the ODE

a(T + �)− a(T ) =∫ T+�

T

[�∗a(�′)u− a(�′)] d�′

or equivalently,

da

d�= �∗a(�)u− a(�) (6)

for a given conditiona(T ) at timeT, where

�∗a = �∗aP (a).

To simplify notation, we often denote the right-hand side of the ODE by�:

a ≡ �(a) = [�1(a),�2(a)] . (7)

We call (7) the associated ODE, or the mean dynamics ofat .We shall approximate the stochastic process{at }∞t=1 to the ODE in the same sense as

the sample mean of i.i.d. random variables approximates the population mean. Because theODE is defined over continuous time, however, it is convenient to construct a continuoustime stochastic process from the original discrete time stochastic process{at }∞t=1.

Define tK = ∑Kt=1 �t as the total amount of (fictitious) time that is needed to playK

rounds for∀K�0 with t0 = 0. For all� > 0, there is a uniqueK such thattK−1�� < tK .Definem(�) = K as the first round that passes� time. Let a(�) be the continuous timeprocess obtained from{at }∞t=1 through linear interpolation:∀� ∈ [tK−1, tK),

a(�) = tK − �

�KaK−1 + �− tK−1

�KaK.

DefineaK(�) = a(tK + �). Note that{aK(�)}∞K=1 is the collection of the tail portions ofa(·) obtained by “left-shifting” the time bytK .

By invoking[13,11], we can show that the sample path{a(�)}�>0 is approximated by thetrajectory of (7) in the long run.

180 I.-K. Cho, A. Matsui / Journal of Economic Theory 124 (2005) 171–201

Lemma 4.1. If �t = 1/t as in(2), then

sup�>0

limK→∞

[aK(�)− aK(0)−

∫ �

0�∗a(�′)u− a(�′) d�′

]= 0

with probability1,whereaK(0) = a(0) for all K�1.

Our next task is to analyze the deterministic process given by (7) instead of the originalstochastic process.

Definition 4.2. a∗ ∈ R2 is a stationary state of ODE� if �(a∗) = 0.

Stability of stationary states is central to our analysis.

Definition 4.3. 1. A stationary statea∗ is locally stable (in the sense of Lyapunov) if thereexists� > 0 such that ifa(0) ∈ N�(a

∗), then for all� ∈ (0, �), there existsT > 0 suchthat for all��T , a(�) ∈ N�(a

∗), whereN�(·) is the open ball with radius�.2. We say that the setK is a locally stable set with respect to (7) if K is a minimal compact

set with respect to the following properties. There exists� > 0 such that for all� ∈ (0, �)and for alla(0) ∈ N�(K), there existsT > 0 such that for all��T , a(�) ∈ N�(K). IfK = {a∗}, then we writea∗ instead of{a∗} to simplify notation.

3. If K is not locally stable (in the sense of Lyapunov), then we sayK is not stable. IfK islocally stable (in the sense of Lyapunov) where the entire space is the domain of attraction,thenK is globally stable.

The next theorem, which is adopted from the standard convergence theorem in the stochas-tic approximation literature, is central for our investigation.

Theorem 4.4. Suppose thatK is a locally stable set of(7) with the domain of attractionthat containsK in its interior.Then for alla(0) ∈ R2 and alls0, a(�) → Kwith probabilityone as� →∞.

By invoking Theorem 2 of[25], we can also show that if a stationary state is not locallystable, then the asymptotic probability distribution ofa(�) assigns 0 probability to its neigh-borhood. Thus, the asymptotic probability distribution ofa(�) must be concentrated at theneighborhood of the stable states of (7).

In principle, the limit setK depends upon a pair of sigmoid functionsh = (h1, h2).Givenh, let Kh be the limit set as selected in Theorem4.4. In the following section, we areinterested in the limit ofKh ash converges toh.

By Theorem4.4, it suffices to analyze the stable states of (7) to characterize the asymp-totic distribution of the aspiration levels induced by the learning dynamics. Although thecomplexity of the analysis varies substantially for different games, these games have thesame underlying properties.

I.-K. Cho, A. Matsui / Journal of Economic Theory 124 (2005) 171–201 181

LetV be the set of all feasible payoff vectors:

V ={∑

s∈S�su(s) | �s �0,

∑s∈S

�s = 1

}. (8)

Since we consider the case of a small amount of perturbation, each component of thetransition matrixP(a) is approximated byP(a) induced byh except at thresholds. Recallthat for a fixedp, h is completely determined by

u(s)− a = (u1(s)− a1, u2(s)− a2).

Thus, it is convenient to partitionV into

S = {S1, . . . ,SM} ,where for allm = 1, . . . ,M,

a, a′ ∈ Sm if and only if ∀s ∈ S, ∀i = 1,2, ai > ui(s) ⇒ a′i > ui(s).

For anya anda′ in the same quadrantSm, P(a) = P(a′) = Pm.Take an arbitrarySm. Then calculate an invariant distribution�m = �mPm. The unper-

turbed dynamics in this quadrant are approximated by

da

dt= �m(a)u− a. (9)

Note that some component inP(a) might be zero, and therefore,P(a) may have mul-tiple invariant distributions. On the other hand, thanks to the sigmoid functionhi , P(a) isaperiodic and irreducible, and therefore,P(a) has a unique invariant distribution.

If P(a) has a unique invariant distribution, then Lemma3.2 [3] implies that the invariantdistribution ofP(a) converges to that ofP(a). If, on the other hand,P(a) has multipleinvariant distributions, the upper hemi-continuity property of the invariant distribution im-plies that whend(hi, h) is small, the invariant distribution ofP(a) must be close to some,if not all, invariant distribution ofP(a).

Because the mean dynamics are dictated by the invariant distribution ofP(a), we canreplaceP(a) by P(a) (with some care whenP(a) has multiple invariant distributions).Because the calculation of the invariant distribution is easy, and does not require any specificdetails of the component game, it is straightforward to analyzeany2× 2 game by usingthe same method. To illuminate the key idea, however, this paper examines only symmetric2× 2 games, while leaving other 2× 2 games for interested readers.

4.1. Coordination games

The pure coordination game (4) is the best example for us to illustrate the key propertiesof the learning dynamics, and explain rigorously how the sigmoid function and the inertiahelp the aspiration level converge tou(C,C). In (4), we have

V = {(a1, a2) | − 1�a1 = a2�1} .

182 I.-K. Cho, A. Matsui / Journal of Economic Theory 124 (2005) 171–201

We partitionV into two pieces:

A = {(a1, a2) | − 1�a1 = a2�0}and

B = {(a1, a2) | 0 < a1 = a2�1} ,which are line segments inR2. If a ∈ A, thenP(a) is given by

P(a) =

1 0 0 0p(1− p) (1− p)2 p2 p(1− p)

p(1− p) p2 (1− p)2 p(1− p)

0 0 0 1

.

Note thatP(a) has multiple invariant distributions that are convex combinations of

�∗a = (1,0,0,0)

and

�∗a = (0,0,0,1),

which implies that the mean dynamics are da1

dt

da2dt

=

s∈S �∗a(s)u1(s)− a1∑s∈S �∗a(s)u2(s)− a2

(10)

where

�∗a = �∗aP (a).

Recall that the invariant distribution ofP(a) must be close tosomeinvariant distribution ofP(a) asd(hi, h) → 0 for all i. For any invariant distribution ofP(a),

dai

dt=∑s∈S

�∗a(s)ui(s)− ai > 0 ∀� > 0, ∀ai �ui(D,D)− �.

Hence, if the aspiration level of each playeri is belowui(D,D), then the learning dynamicsmust increase the aspiration level up toui(D,D) in finite periods.

It is instructive to see the role of inertia. If there were no inertia so thatp = 0, thenP(a)

would have an invariant distribution different from those already identified above:

�∗a =(

0,1

2,

1

2,0

).

This invariant distribution is attained from the outcome alternating between(C,D) and(D,C); in each of these outcomes, the two players synchronously switch to the other actionsince each player is dissatisfied with the performance of his present choice. By introducinginertia, i.e., assumingp > 0, either(C,C) or (D,D) is realized with a positive probability,

I.-K. Cho, A. Matsui / Journal of Economic Theory 124 (2005) 171–201 183

both of which are absorbing states. As a result, the two players escape almost surely fromperpetual coordination failure.

Once the aspiration level approachesu(D,D), the perturbation introduced by the sigmoidfunction plays an important role. By the definition of the sigmoid function, there exists� > 0such that� < hi(0) < 1− �. For an aspiration vectora = (a1, a2) in the neighborhood ofu(D,D) = (0,0), the transition matrixP(a) can be approximated by

1 0 0 0

p(1− p) (1− p)2 p2 p(1− p)

p(1− p) p2 (1− p)2 p(1− p)

q1q2 q1(1− q2) q2(1− q1) (1− q1)(1− q2)

,

whereqi = hi(ui(s) − ai) ∈ (0,1) and ui(s) − ai � 0. One can easily verify thatthis transition matrix has a unique invariant distribution(1,0,0,0). As a result, the meandynamics at aspiration vectora = (a1, a2) around the neighborhood ofu(D,D) are

dai

dt= ui(C,C)− ai > 0 ∀i = 1,2.

That is, as the aspiration level approachesui(D,D), player i experiments more often as(D,D) is realized. As a result,(C,C), which is an absorbing state, is realized with a positiveprobability.

Our analysis so far indicates that if the initial aspiration level of playeri is belowui(D,D) = 0, then the aspiration level should continue to increase so that it can go beyondui(D,D) in a finite period of time. Our remaining step is to show that ifa ∈ B, then itsmean dynamics must converge to the neighborhood ofu(C,C) = (1,1). We essentiallyrepeat the same logic as before. For small� > 0, a ∈ B andai > ui(D,D) + �, thetransition matrixP(a) can be approximated by

P(a) =

1 0 0 0p(1− p) (1− p)2 p2 p(1− p)

p(1− p) p2 (1− p)2 p(1− p)

p2 p(1− p) p(1− p) (1− p)2

which has a unique invariant distribution

�∗a = (1,0,0,0).

AsP(a) converges toP(a), so does its invariant distribution. Thus, the mean dynamics canbe approximated by

dai

dt= ui(C,C)− ai �0

and the equality holds only ifai = ui(C,C). This proves that for alla0 and all� > 0, thereexists� > 0 such that ifd(hi, h) < � for all i, then there existsT > 0 such that for allt�T , ai(t)�ui(C,C)− � wherea(t) is induced by (7). Then, by invoking Theorem4.4,we conclude thatat → N�(u(C,C)) with probability one. Since the only way to achievean average payoff close toui(C,C) is to play(C,C) almost always, this convergence resultalso implies thatst = (C,C) for almost allt�1.

184 I.-K. Cho, A. Matsui / Journal of Economic Theory 124 (2005) 171–201

Albeit tedious, one can apply precisely the same logic to a general coordination game:

( C D

C 1,1 �, g

D g, � 0,0

), (11)

whereg, � < 1. We summarize the result in the following statement.

Proposition 4.5. For all � > 0, there exists� > 0 such that ifd(hi, h) < � for all i = 1,2,then

at → N�(u(C,C))

with probability one.

4.2. Battle of the sexes

While [10,17,18]also show that the efficient coordination outcome is achieved in thecoordination game, their result is restricted to the component games which have a symmetricPareto efficient (pure strategy) outcome. In contrast, our method can be applied to any 2×2games that do not have a symmetric Pareto efficient outcome such as the battle of the sexes:

( C D

C 0,0 1,1+ g

D 1+ g,1 0,0

), (12)

whereg > 0. Because the analysis of the battle of the sexes follows precisely the samelogic as the coordination game, we only state the result. LetV be the set of feasible payoffvectors and partitionV into four different regions:

A = {(a1, a2) ∈ V |a1, a2 < 1}, B = {(a1, a2) ∈ V |a1 < 1, a2�1},B′ = {(a1, a2) ∈ V |a1�1, a2 < 1}, C = {(a1, a2) ∈ V |a1, a2�1}.

Note that RegionB′ is the mirror image ofB.RegionC is critical for the global stability of the dynamic system. For alla ∈ C, the

transition matrix in this region is approximated by

P(a) =

p2 p(1− p) p(1− p) (1− p)2

0 p 0 1− p

0 0 p 1− p

(1− p)2 p(1− p) p(1− p) p2

.

Thus, the system moves towardu∗ provided that limt→∞ �t = 0 whereu∗ is calculatedfrom a stationary distribution�∗ of the above transition matrix, i.e.,

�∗ = 1

2+ 4p(1− p,2p,2p,1+ p) .

I.-K. Cho, A. Matsui / Journal of Economic Theory 124 (2005) 171–201 185

Fig. 1. Battle of the Sexes:u∗1, u∗2 < 1.

Therefore, we have

u∗ = �∗ · u = p

1+ 2p(2+ g,2+ g).

Proposition 4.6. For all � > 0, there exists� > 0 such thatd(hi, h) < � implies

at →{N�(K∗∗) if pg < 1,N�(u

∗) if pg > 1

ast →∞ with probability one. In particular, if p is sufficiently small, thenat converges toN�(K∗∗) whereK∗∗ = {u(C,D), u(D,C)}.

Proof. See[3]. �If u∗i < 1 (i = 1,2), there is another stationary point of the associated ODE, located in

the neighborhood of(u1(C,D), u2(D,C)) = (1,1) ashi tends toh for i = 1,2. However,this point is not locally stable because ifa fluctuates a little and falls in region, say,B,the dynamics leada to u(C,D). Therefore, the stochastic process does not converge to theneighborhood of(1,1) with any positive probability (Fig.1).

Since we can analyze other 2× 2 games by following the same idea, let us simply statethe convergence result (Fig.2).

186 I.-K. Cho, A. Matsui / Journal of Economic Theory 124 (2005) 171–201

Fig. 2. Battle of the Sexes:u∗1, u∗2 > 1.

4.3. Game of chicken

Consider the following game of chicken:

( C D

C 1,1 �,1+ g

D 1+ g, � 0,0

), (13)

whereg > 0, � > 0 andg + � < 1. We partition the set of feasible payoff vectors into thefollowing regions:

A = {(a1, a2) ∈ V |∃s ∈ S, (a1, a2) < u(s)},B = {(a1, a2) ∈ V |��a1 < 1, a2�1}.

Like before, RegionB′ is the mirror image ofB. Note thatA ∪ B ∪ B′ = V , and theintersection of any pair of these sets is empty (Fig.3).

We can show that the aspiration vector converges to one of the efficient outcomes.

Proposition 4.7. ∀� > 0, ∃� > 0 such that∀� ∈ (0, �), if maxi=1,2 d(hi, h) < �, then

at →{N�(u(C,C)) if 1+ pg > 2�,N�(K∗) if 1+ pg < 2�

ast →∞ with probability one where

K∗ = {u(C,D), u(C,C), u(D,C)}.

I.-K. Cho, A. Matsui / Journal of Economic Theory 124 (2005) 171–201 187

Fig. 3. Game of Chicken.

If p is sufficiently small, then� = 1/2 becomes a threshold for the destination of thedynamics. If� < 1/2 holds, thenat converges to the neighborhood ofu(C,C), while if� > 1/2, then it converges to one of (the neighborhoods of)u(C,C),u(C,D), andu(D,C).This conforms to our intuition. If� is small, then the off-diagonal payoffs are “unfair” in thesense that the payoff to one of the players is closer tou(D,D) than tou(C,C). Therefore,the player who is not “well treated” tends to be dissatisfied and moves back and forthbetweenC andD.

4.4. Prisoners’ dilemma

Finally, we examine the prisoners’dilemma game in detail as it reveals a critical differencebetween our paper and KMRV:

( C D

C 1,1 −�,1+ g

D 1+ g,−� 0,0

), (14)

where�, g > 0. We divideV into eight regions:

A = [0,1)× [0,1),

B = {a = (a1, a2) ∈ V | a1 < 0, 0 < a2 < 1},C = {a = (a1, a2) ∈ V | a1 < 0, a2�1},D = {a = (a1, a2) ∈ V | 0�a1 < 1, a2�1},E = {a = (a1, a2) ∈ V | a1�1, a2�1}

188 I.-K. Cho, A. Matsui / Journal of Economic Theory 124 (2005) 171–201

Fig. 4.g < 1, � < 1.

and the remaining three regions are the mirror image ofB, C, D obtained by switching theidentity of the players, and are calledB′, C′, D′, respectively. Among these eight regions,E may be empty as depicted in Fig.4 if u(C,C) is located at the Pareto frontier ofV.

If the aspiration level paira is in regionA, �A converges to(1,0,0,0) as the sigmoidfunctionh converges to the step function. Therefore,amoves towardu(C,C).

If a is in regionB, the unperturbed transition matrixPB is given by

PB =

1 0 0 00 p 0 1− p

0 0 p 1− p

0 1− p 0 p

and any convex combination of(1,0,0,0) and(0, .5,0, .5) is an invariant distribution ofPB. Thus,�∗a ·u is a convex combination ofu(C,C) = (1,1) and.5u(C,D)+.5u(D,D) =(−�/2, (1+ g)/2). For any invariant distribution, the aspiration level pair will eventuallyenter regionA. 12

12 Remember thatPB is an approximation ofP(a) for all a ∈ B, andP(a) has a unique invariant distribution.Because of the multiplicity of invariant distributions ofPB , the invariant distribution ofP(a) is sensitive to thechoice of(h1, h2). The analysis shows, however, that the asymptotic property ofat is not sensitive to the choiceof (h1, h2) since the aspiration level pair enters regionA no matter which direction it might move as discussed inthe main text.

I.-K. Cho, A. Matsui / Journal of Economic Theory 124 (2005) 171–201 189

In regionC, the unperturbed transition matrixP C is

P C =

p 1− p 0 00 p 0 1− p

0 0 p 1− p

0 1− p 1− p p

whose unique invariant distribution is(0,0, .5, .5). Once the aspiration vector entersC, itmoves toward.5u(D,C)+ .5u(D,D) = ((1+ g)/2,−�/2). Therefore, the system eitherreaches the boundary ofC andD or entersA in a finite period of time.

Our main interest is whether or not we can sustain the cooperation outcome in the limit.As it turns out, the answer depends upon the size ofg > 0, which measures the gain fromdefection against cooperation, or can be interpreted as the temptation for double crossing.It affects the dynamics in quadrantD (andD′ by symmetry). InD, the transition matrixPD is given by

PD =

p 1− p 0 00 p 0 1− p

0 0 p 1− p

(1− p)2 p(1− p) p(1− p) p2

.

Therefore, the invariant distribution�D is given by

�D = 1

3(1− p,1, p,1).

Similarly, we let�D′ be given by

�D′ = 1

3(1− p, p,1,1).

Using this, we define

u:=�Du =(

1

3(1− �+ pg),

1

3(2+ g − p − p�)

)(15)

and similarly,

u′:=(

1

3(2+ g − p − p�),

1

3(1− �+ pg)

).

As g increases, bothu = (u1, u2) and u′ = (u′1, u′2) are moving away from the maindiagonal ofV. It is useful to note thatu2 > 1 = u2(C,C) andu′1 > 1 = u1(C,C) if andonly if g > 1+ p(1+ �).

For a large� (the loss from being double-crossed),u andu′ might fail to be individuallyrational. In this case, we need to do additional work.

Suppose thatu1 < 0. Define

u:=1

2(u(D,D)+ u(D,C)) =

(1+ g

2,−�

2

)

190 I.-K. Cho, A. Matsui / Journal of Economic Theory 124 (2005) 171–201

Fig. 5. Prisoners’ dilemma:g > 1, � < 1.

as the middle point ofu(D,D) andu(D,C). Let u∗ = (0, u∗2) be a point such that thevertical axis

{(u1, u2)|u1 = 0}and the line segment connectingu andu intersect. Similarly, letu∗′ = (u∗1

′,0) be a pointat which the horizontal axis and the line segment connectingu′ and

u′:=1

2(u(D,D)+ u(C,D)) =

(−�

2,

1+ g

2

)

intersect. Note thatu∗ is the mirror image ofu∗′ around the main diagonal ofV (Fig. 5).Summarizing the above analysis, we obtain the following result:

Proposition 4.8. (1) Supposeg ∈ (0,1) and� ∈ (0,1). Then for all� > 0, there exists� > 0 such thatd(hi, h

p) < � (i = 1,2) impliesat → N�(u(C,C)) with probability one.

(2) Supposeg > 1 and � ∈ (0,1). Then for all� > 0, there exists� > 0 such thatd(hi, h

p) < � (i = 1,2) impliesat → N�({u, u′}) with probability one.

(3)Supposeg > 1 and� > 1.(3.1) If g2+1 > �2+�, then for all� > 0,there exists� > 0such thatd(hi,1x�0) <

� (i = 1,2) impliesat → N�({(0, u∗2), (u∗

′10)})with probability one.

(3.2) If g2 + 1��2 + �, then only(0,1) and(1,0) are locally stable points.

Proof. See Appendix. �For a sufficiently smallp, which is implied byd(hi,1x�0) < �, the cooperation outcome

can be sustained if the gain from double crossing is modest, i.e.,g < 1 holds. If one player

I.-K. Cho, A. Matsui / Journal of Economic Theory 124 (2005) 171–201 191

deviates from the cooperation outcome, the other party, who is double crossed, receives theworst possible payoff in the one shot game. Consequently, the party who is double crossedimmediately responds to such a deviation by playingD as well. The resulting outcomeis one shot Nash equilibrium, which is Pareto dominated by the cooperation outcome.Dissatisfied with the bad outcome, both players switch to cooperation simultaneously. But,the cooperation cannot last too long. As a result, the limit outcome is a convex combinationof (C,C), (D,D) and(D,C) (or (C,D)). Therefore, the initial defector’s aspiration levelis drawn toward(2+g)/3. Thus, ifg < 1, such a deviation pushes the defector’s aspirationlevel below the cooperation payoff of one. This force drives the aspiration level towardregionA, and therefore, towardu(C,C).

On the other hand, ifg > 1, then the average payoffu∗i from alternating between(C,C),(D,D)and(D,C) (or(C,D)) is larger than the cooperation payoffui(C,C). Consequently,each player has an “incentive” to move away from the cooperation. Indeed, ifg > 1, thenu(C,C) is not a stable point of the ODE. Thus, our theory predicts that the learning dynamicsconverges to points other thanu(C,C), which is in sharp contrast to KMRV that predicts thelearning dynamics converges tou(C,C) regardless of the value ofg. Two subtle differencesin the model contribute to the drastic difference of the prediction between KMRV and thispaper. First, in KMRV,hi(z) = 1 if z�0, while we admithi(z) < 1 for some smallz > 0. In our model, the behavior rule of the agent is exposed to perturbation even if thepresent action is satisficing, but the present payoff exceeds the aspiration level only slightly.On the other hand, in KMRV, the agent’s behavior is not subject to the randomization ifthe present action is satisficing. In the prisoner’s dilemma, if the initial pair of aspirationlevels is Pareto dominated byu(C,C), then both agents are satisfied by the cooperation.In KMRV, the cooperation is realized with probability 1. Thus, there can be no deviationfrom u(C,C). However, in our model, the player’s action is subject to perturbation in thesmall neighborhood ofu(C,C), which eventually pushes the aspiration vector away fromu(C,C) if g > 1.

Note thatu∗2 < 1 andu∗′1 < 1 if and only if � + �2 > g2 + 1. In this case, the lossfrom being cheated is so large that the convex combination of, say,u(D,D), u(C,C) andu(D,C) might induce the average payoff for player 2 to go below his security level payoffu(D,D). The proposition says that each player can guarantee his security level payoff.Consequently, the long-run individual payoff of one player precisely becomes his securitylevel payoff. In this case, which is covered by the last part of the Appendix, the analysis of[19] indicates the possible existence of a stable limit cycle. In all other cases, the set of alllocally stable points is the global attractor: for any initial condition, the aspiration vectorconverges to its neighborhood with probability one in a finite period of time. Although(0,1) and(1,0) are locally stable points, we have not been able to check whether or not{(0,1), (1,0)} are global attractors of the learning dynamics.

5. Extension

5.1. Aspiration as a weighted average of the two players

We have assumed that the aspiration level is the average of one’s own payoff as in KMRV.We can apply the same analytic tool to study models in which the aspiration level is formed

192 I.-K. Cho, A. Matsui / Journal of Economic Theory 124 (2005) 171–201

in a different manner such as one that uses the payoff of the other player[17] or the averagepayoff of the players[18]. Instead of reproducing[17,18], we shall modify the aspirationformation rule to incorporate the payoff of the other player to illustrate how our analysiscan be applied to other models.

Suppose that the aspiration level of each player is calculated as

ai,t = ai,t−1 + �t[(1− )ui(st )+ uj (st )− ai,t−1

], (16)

wherej �= i, and ∈ [0,1/2] is the weight playeri puts on the other player’s payoff inupdating his own aspiration level. LetV be the set of all aspiration vectors in the long run:

V ={((1− )a1 + a2, a1 + (1− )a2) ∈ R2| (a1, a2) ∈ V

}.

Notice thatV shrinks to a segment of the 45◦ line as approaches 1/2.As V changes in, the dynamics change accordingly. But, any change in the stochas-

tic process incurred by > 0 is captured by the change of the associated ODE. Thus,we can examine the asymptotic properties of the evolution of aspirations through the as-sociated ODE. First, given an aspiration paira, use the same transition matrixP(a) asbefore to calculate the invariant distribution�∗a by solving�∗a = �∗aP (a). Next, calculatea

i

(i = 1,2) as

ai = (1− )�∗aui + �∗auj .

The mean dynamics atamove the aspiration pair in the direction ofa = (a1, a

2).

We shall examine three underlying games: coordination, the battle of the sexes and theprisoners’ dilemma. The analysis of the game of chicken is left for interested readers.

5.1.1. Coordination gamesThe analysis is virtually identical to that in Section4.1. The aspiration levelsa converge

to u(C,C), which is the only stable outcome of the associated ODE.

5.1.2. Battle of the sexesSee Fig.6. Note thatV , which is the solid triangle in Fig.6, is a subset ofV, which is the

dotted triangle. Also, note thatV converges to the 45◦ line through the origin as → 1/2.Given ∈ (0,1/2), mark each area ofV asA, B, B′ andC as we did in Fig.1.

The mean dynamics overV are fairly easy to analyze. For example, inC, the gradientvector induced by the associated ODE points towardu∗. Thus, ifu∗ is in C, thenu∗ is theonly stable point of the associated ODE, and therefore, the aspiration vector converges tou∗. Otherwise, we follow precisely the same logic as in Section4.2 in order to show that{u∗, u∗∗} is the set of stable points.

5.1.3. Prisoners’ dilemmaFollowing the same analysis as in Section4.4, we calculate the threshold level ofg, beyond

which cooperationu(C,C) is not achieved. The key payoff profileu is now modified sothat the point of interest is

u = (1− )u+ u′.

I.-K. Cho, A. Matsui / Journal of Economic Theory 124 (2005) 171–201 193

Fig. 6. Battle of the Sexes: = 1/4.

If u is in regionD in Fig. 4, thenu is a stable point of the system, and so is its mirrorimage. If u is in regionA, thenu(C,C) is the unique stable point. In other words, for asufficiently smallp, u(C,C) is the unique stable point if and only if

g < 1+ (1+ g + �).

Note that the second term on the right-hand side is increasing in. As increases, thethreshold forg to sustain cooperation also increases. If = 1/2, the aspiration level isessentially the population average as in[17].

5.2. Perturbed aspiration processes

We have maintained the assumption that the evolution of the aspiration level is “deter-ministic” in the sense thatai,t+1 is completely determined byai,t andui(st ) as prescribedby (1). We can easily incorporate the perturbed version of (1) as KMRV did. Suppose thatthere exists a small probability > 0 and a sequence of i.i.d. white noises{wt }∞t=1, i.e.,Etwt = 0 for all t whereEt is the expectation operator conditioned on the informationavailable at the beginning of timet, such that

ai,t ={ai,t−1 + �t

(ui(st )− ai,t−1

)with probability 1−

ai,t−1 + �twt with probability .(17)

Following the same logic as before, we calculate the associated ODE for (17):

da

d�= (1− )

(�∗a(�)u− a(�)

), (18)

194 I.-K. Cho, A. Matsui / Journal of Economic Theory 124 (2005) 171–201

wherea(�) = (a1(�), a2(�)) is a pair of aspirations at time�, and�∗a is the unique invariantdistribution ofP(a). Note thatwt ’s disappear in (18) since they are white noises. Because(18) has the same stationary points with the identical stability properties as (9), the asymp-totic properties of the aspiration vectors generated by (17) remain the same as in the casewhere the aspiration levels evolve according to (1).

5.3. Constant gain algorithms

If �t = 1/t , a player’s aspiration level is a simple average of the past payoffs, in which allpast observations are assigned an equal weight in calculating the aspiration level. Implicitly,�t = 1/t assumes that the player perceives the underlying states are stationary, including thebehavior of the opponent. However, in the transition to the stable outcome, the stationarityassumption is suspect. A more natural assumption would be to let the agent discount the pastobservations, because the agent believes the underlying states are not stationary. A simpleway to incorporate this idea would be to set�t = � > 0 so that the aspiration is a geometricdiscounted average of the past payoffs.

If �t = � > 0, we say that (1) or (17) is a constant gain algorithm in whichat = (a1,t , a2,t )

evolves according to

ai,t = ai,t−1 + �(ui(st )− ai,t−1

). (19)

In this case, the convergence with probability one does not hold. Instead, we have to usea weaker convergence. This version of the learning algorithm is useful for comparing ourmodel to KMRV.

Even though the convergence with probability 1 no longer holds, we can still characterizethe asymptotic properties of the aspiration vectors induced by a constant gain algorithm bythe associated ODE such as (9) or (18). The only difference is that instead of convergencewith probability 1, we obtain weak convergence: for a given�, the aspiration vector inducedby the constant gain algorithm given by (1) or (17) has a unique invariant distribution, whichis distributed around the stable solution of the associated ODE. As� → 0, the invariantdistribution converges to the mean in weak topology[12]. The stability properties of thedynamics of the aspiration vector remains the same if we use weak convergence. We statethe convergence result for the constant gain algorithm without proof, because it is virtuallyidentical to the proof for the decreasing gains algorithm where�t = 1/t .

Theorem 5.1. Suppose thatK is the set of stable solutions of(7)with domain of attractionD. If there exists a compact subsetD∗ in the interior of D such that

K ⊂ D∗

and

at ∈ D∗

I.-K. Cho, A. Matsui / Journal of Economic Theory 124 (2005) 171–201 195

infinitely many times with probability one, then for all� > 0 and allT > 0,

lim�→0

Pr

(sup

0���T

d(a(�),K) > �

)= 0,

wherea(�) is the continuous time process obtained by linearly interpolatingat andd(·, ·)is the Hausdorff metric.

6. Conclusion

We studied the behavior of boundedly rational agents who play an infinitely repeatedsymmetric 2×2 game according to a simple rule of thumb: each agent continues to play thesame action if and only if the action generates one period payoff that exceeds the aspirationlevel that summarizes the past history according to the average of the past payoffs. Usingthe method of stochastic approximation, we set forth a general approach to this problemand characterized the set of limit points of the learning dynamics for all symmetric 2× 2games.

A couple of remarks are in order. One may wonder if a player who follows this rule istaken advantage of by a rational player. The answer depends on the specific game. In gameswith multiple individually rational Pareto efficient pure strategy payoffs such as the battleof the sexes, if a rational player knows that the other player is following the above behaviorrule, then the rational player can select the equilibrium that favors him. On the other hand,the rational player cannot take advantage of the other in a prisoner’s dilemma game. Thisis because any defection that leads to the lowest possible payoff for the deceived playerprompts him to switch to defection in the next period.

We can apply the same analytic method to more elaborate games in which the playershave more than two actions in each period as in[2]. By invoking the stochastic approxi-mation technique, we can approximate the stochastic learning dynamics by a deterministicdynamics induced by an ordinary differential equation (ODE). As the number of feasibleactions for each player increases, however, the characterization of the steady state of theODE becomes considerably more complicated. However, the preliminary results in[4] in-dicate that if each player has a continuous action space as in the Cournot duopoly game,the Pareto frontier must be a stable set of the ODE, and therefore, the asymptotic outcomesof the stochastic learning dynamics must converge to the Pareto frontier in the Cournotduopoly game. We believe that the result can be generalized to a larger class of games toprovide a behavioral and dynamic foundation for the Coase theorem.

In many works in the learning literature, agents make optimal choices given their expec-tations, and it is the expectations that are updated via mechanical rules. In our work, onthe other hand, it is the behavior rule that does not follow expected utility maximization.However, our satisficing behavior is no less rational than the expected utility maximizingbehavior, especially when the state space and/or probability measure are not defined on solidground.13 Indeed, in single-person decision making problems, the expected utility theory

13 See[8] for further discussion.

196 I.-K. Cho, A. Matsui / Journal of Economic Theory 124 (2005) 171–201

and the case-based decision theory are equivalent in the sense that one can be embeddedin the other as proved by Matsui[15]. Yet, the present paper shows that the two theoriesproduce very different results in repeated games. Similarity values, which correspond toprobability values in the above mentioned embedding, across different stage games areassumed to be identical in the repeated game. As a result, our behavior rule is far morerestrictive than those rules induced in some learning literature. This restriction on similarityvalues, rather than the behavior ruleper se, is what drives the present results.14

Acknowledgments

We are grateful for helpful comments from two anonymous referees and conversationswith Joseph Hofbauer, Michihiro Kandori, Daisuke Oyama, Martin Posch, and especially,Dilip Mookherjee and Debraj Ray as well as seminar participants at University of Tokyo,University of Illinois, University of Wisconsin, New York University, Stanford University,University of Vienna, and Boston University. All remaining errors are ours.

Appendix

Proof of Proposition 4.8.To simplify exposition, we consider a symmetric perturbation,i.e.,h1(·) = h2(·) = h(·). We state useful results. The proofs are fairly straightforward, buttedious, which can be found in[3].

Lemma A.1. For all � > 0, d(h,1x�0) < � implies that there exists a steady state inN�(u(C,C)), and that every steady state inN�(u(C,C)) is locally stable ifg < 1, and isa saddle point ifg > 1.

Lemma A.2. (i) Supposeg < 1.For all � > 0, there exists� > 0 such that no steady stateexists in[�,1− �] × [1− �,1+ �].

(ii) Supposeg > 1. For a sufficiently small� > 0, no steady state exists in[�,1− �] ×[1− �,1+ �].

Lemma A.3. Supposeg < 1. For all � > 0, there exists� > 0 such that no steady stateexists outsideN�((1,1)).

Lemma A.4. Supposeg < 1. For all � > 0, there exists� > 0 such that no closed orbitgoes out ofN�((1,1)).

14 In relation to this point, whether the average payoff of the past play is appropriate for an aspiration level or notneeds further study. At this moment, we can only say that it is better than some other conceivable behavior rule.For example, the behavior rule like aspiring to the best payoff ever attained is too ambitious to work well. Indeed,in, say, a prisoner’s dilemma, if both players aspire to the best past payoff, then they fall into a perpetual searchprocess since they both experience the best payoff by defecting while the opponent cooperates with probabilityone at some point, and therefore, they keep searching afterward.

I.-K. Cho, A. Matsui / Journal of Economic Theory 124 (2005) 171–201 197

Lemma A.5. Supposeg > 1 and� > 1.

(i) Suppose�2 − g2 < −�+ 1.Then, in the limit as� goes to zero, the only steady statesare (0, u∗2) and(u∗1,0).

(ii) Suppose�2 − g2 > −�+ 1.Then, in the limit as� goes to zero, the only steady statesare (0,1) and(1,0).

All of the steady states in(i) and(ii) are locally stable.

Proof of Proposition 4.8(Continued). Ifg < 1, then the ODE vanishes at a single pointin the neighborhood of(1,1), which we demonstrated is locally stable. The phase diagramof the ODE reveals that starting from any initial condition, the trajectory of the ODE mustconverge to the stationary point. That is, the locally stable point of the ODE is the globallystable point. By applying Theorem4.4, we conclude that ifg < 1, thenat converges tothe small neighborhood of(1,1) with probability one. This proves the first part of theproposition.

If g > 1, then the stationary solution of the ODE in the neighborhood of(1,1) is a saddlepoint. We have to show that the aspiration vector converges to either one of the locally stablepoints in the limit, which are{u, u′} if g > 1 and� < 1, andK∗ if g > 1 and� > 1. Fromthe previous analysis, we know that the ODE vanishes precisely at three points: two locallystable points and one saddle point in the neighborhood of(1,1).

To simplify the notation, let us regard(1,1) as the saddle point, and letS∗ = {s∗1, s∗2,(1,1)} be the three points where the ODE vanishes. Ifg > 1 and� > 1, thens∗1 =(0,max

[u∗2,1

]), ands∗2 = (max

[u∗′1,1

],0), and if g > 1 and� < 1, thens∗1 = u and

s∗2 = u′.Suppose thatg > 1 and� < 1. Then,s∗1 = u ands∗2 = u′. It is straightforward to verify

that starting from any initial condition, the trajectory of the ODE must converge to theneighborhood of{s∗1, s∗2}. Then, Theorem4.4 implies thatat → {s∗1, s∗2} with probabilityone. Ifg > 1 and� > 1, buts∗1 = (0, u∗2) ands∗2 = (u∗′1,0), then we can invoke exactlythe same logic to prove thatat → {s∗1, s∗2} with probability one.

We now prove the proposition for the case whereg > 1 and� > 1 buts∗1 = (0,1) ands∗2 = (1,0):

�2 − g2 > −�+ 1.

We shall focus on the stability property ofs∗1, because the other case follows from thesymmetric argument. Note thats∗1 is the corner ofA, B, C andD. We can write the transitionmatrix as

1− h2 h2 0 0

0 0 0 10 0 0 1h1 0 1− h1 0

,

wherehi is continuous(i = 1,2); h1(x) = 0 if x < −�, h1(x) = 1 if x > �, andh1 isstrictly increasing over[−�, �]; andh2(x) = 0 if x < 1− �, h2(x) = 1 if x > 1+ �, andh2

198 I.-K. Cho, A. Matsui / Journal of Economic Theory 124 (2005) 171–201

is strictly increasing over[1− �,1+ �]. For example, in the area ofD which is away fromits boundary by more than�, the transition matrix is

0 1 0 00 0 0 10 0 0 11 0 0 0

becauseh1 = h2 = 1. If h1 > 0 orh2 > 0, then the invariant distribution is

1

h1 + 2h2(h1, h1h2, h2(1− h1), h2) .

If h1 = 0 andh2 = 0 (which is the case in the “interior” ofB), the invariant distributionsare {(

,0,1−

2,

1−

2

)| ∈ [0,1]

}.

Note that this case does not arise if we assume that the value of the sigmoid function iscontained in(0,1). But at the same time, the existence of multiple invariant distributions re-veals that the sequence of the unique invariant distributions induced by the sigmoid functionsmight be very sensitive to the choice of the sigmoid function. Thus, we have to pay specialattention to the case whereh1 = h2 = 0, or equivalently,{(a1, a2)|a1� − �, a2�1− �}.

If h1 + 2h2 > 0, then the ODE is

a1 = 1

h1 + 2h2(h1 − �h1h2 + (1+ g)h2(1− h2))− a1, (20)

a2 = 1

h1 + 2h2(h1 + (1+ g)h1h2 − �h2(1− h2))− a2. (21)

Clearly, at the stationary points∗1, a1 = a2 = 0.Our task is to extract information about

A1 = {(a1, a2)|a1 = 0}and

A2 = {(a1, a2)|a2 = 0}.Let us examineA1. Sinceh2 is an increasing function ofa2,

�a1

�a2�0 (22)

and if h′2 > 0, then the strict inequality holds. Thus, ifa1 = 0 at (a1, a2), then for any(a1, a

′2) �∈ A1, a1 < 0 if a′2 > a2, and vice versa.

Consider the area whereh2 = 1, which is the case ifa2�1+ �. A simple calculationshows that

a1 = 1+ g − (�+ g)h1

h1 + 2− a1 = 0

I.-K. Cho, A. Matsui / Journal of Economic Theory 124 (2005) 171–201 199

at A1. Sinceh1 is an increasing function ofa1, this equation has a unique solution thatsatisfies the equality, which implies thatA1 is a vertical line ifh2 = 1:

(a1, a2), (a′1, a

′2) ∈ A1, a1 = a′1 if a2, a

′2�1+ �. (23)

Similarly, let us consider the area whereh1 = 1, which covers the area wherea1��. Ifh1 = 1, then

a1 = 1− �h2

1+ 2h2− a1,

which is a strictly decreasing function ofh2:

�a1

�a2< 0. (24)

In particular, ifa1 = �, thenh2 = (1− �)/(�+ 2�), and ifa1 = 1, thenh2 = 0. Thus,A1 islocated in a narrow band of[�,1] × [1− �,1+ �]. Combining (22)–(24), we conclude thatA1 is contained in a narrow band of

[−�, �] × [1− �,∞) ∪ [−�,∞)× [1− �,1+ �].Moreover,A1 is “decreasing”: ifA1 is not a vertical line, then it has a negative slope.

Next, we examineA2. It is easy to verify that

�a2

�a1> 0. (25)

If h1 = 0 andh2 > 0, or equivalently,a1 < −� anda2 > 1+ �, then

a2 = −�

2− a2 < 0. (26)

If h1 > 0 andh2 = 0, thena1 > 1− �, a2�1− � and

a2 = 1− a2�� > 0. (27)

If h1 = 1 and 1− � < h2�1+ �, thena1��, 1− � < a2�1+ � and

a2 = (1+ g)h2 + 1

1+ 2h2− a2.

It would be helpful to understand the structure of the sigmoid functionh2 over[1−�,1+�].Becauseh2 has to change from 0 to 1 over a small interval, its derivative must be very largeover almost all parts of[1− �,1+ �]. In particular,h′(0) ↑ ∞. Thus, if�a2/�a2 < 0 ata2 = 1−�, then the derivative must become positive beforea2 becomes equal to 0. Thus, theminimum ofa2 over[1− �,1+ �] must be located ata2 ∈ (1− �,0). Since�a2/�a2� −1,the minimum ofa2 cannot be smaller than

1+ (1+ g)h2(1− �)

1+ 2h2(1− �)− (1− �) = �.

200 I.-K. Cho, A. Matsui / Journal of Economic Theory 124 (2005) 171–201

Thus, for a giveng > 1, we can choose� > 0 sufficiently small so that

1+ (1+ g)h2(1+ �)

1+ 2h2(1+ �)− (1+ �) = 2+ g

3− (1+ �) > 0.

Then, it follows that

a2 > 0

if a1�� anda2 ∈ [1− �,1+ �].So far, we have shown thata2 < 0 over

(−∞,−�] × (1− �,∞) (28)

and thata2 > 0 over

([−�,∞)× (−∞,1− �]) ∪ ([1+ �,∞)× [1− �,1+ �]) . (29)

Since the gradient vector is a continuous function of the aspiration vector,A2 must belocated outside of (28) and (29). Combining this observation with (22)–(24), we concludethatbothA1 andA2 can exist only in

{(a1, a2) | − � < a1 < �, a2 > 1− �} . (30)

Thus, if there is a locally stable point, then it must be located in (30).For convenience, let us regard Fig.4 as a “map.” Note thatA2 moves to the southwest

region from (30), while A1 moves to the southeast region from (30). Thus, to prove thatthere exists a locally stable point, it suffices to show that over (30), A2 is located to the eastside ofA1.

We know that inA1, if a2�1+ �, thenh2 = 1 and therefore,

h1 − �h1 + (1+ g)(1− h2)

h1 + 2− a1 = 0.

Hence,

h1 = 1+ g − 2a1

�+ g − a1.

Substitutingh1 and settingh2 = 1 into a2 evaluated at(a1, a2) ∈ A1, we have

a2 = 1

h1 + 2

[(1+ g − 2a1)(2+ �+ g − a2)

�+ g − a1− �− 2a2

].

It is verified that for a sufficiently small� > 0,

a2 < 0

for (a1, a2) ∈ A2 ∩ {a2�1+ �}. By (25), over the area located in the left-hand side ofA2,a2 < 0. Thus,A1 is located in the left hand side ofA2 if a1�1+ �.

I.-K. Cho, A. Matsui / Journal of Economic Theory 124 (2005) 171–201 201

Notice that a part ofA2 is located in the area wherea1 < 0, while the rest is locatedin the area wherea1 > 0 (i.e.,a1�1− �). Since the gradient changes continuously withrespect to the aspiration vector,

A1 ∩ A2 �= ∅,which proves that a stationary point exists.�

References

[1] M. Benäim, J. Weibull, Deterministic approximation of stochastic evolution in games, Econometrica 71 (3)(2003) 873–903.

[2] J. Bendor, D. Mookherjee, D. Ray, Reinforcement learning in repeated interaction games, Adv. Theoret.Econ. 1 (1) (2003).

[3] I.-K. Cho, A. Matsui, Learning aspiration in repeated games, University of Illinois and University of Tokyo,2003.

[4] I.-K. Cho, A. Matsui, Learning, aspiration and efficiency, University of Illinois and University of Tokyo,2004.

[5] G.W. Evans, S. Honkapohja, Learning and Expectations in Macroeconomics, Princeton University Press,Princeton, NJ, 2001.

[6] I. Gilboa, D. Schmeidler, Case-based decision theory, Quart. J. Econ. 15 (1995) 1–26.[7] I. Gilboa, D. Schmeidler, Reaction to price changes and aspiration level adjustments, Tel Aviv University,

2000.[8] I. Gilboa, D. Schmeidler, A Theory of Case-Based Decisions, Cambridge University Press, Cambridge, 2001.[9] M. Kandori, G. Mailath, R. Rob, Learning, mutation and long run equilibria in games, Econometrica 61

(1993) 27–56.[10] R. Karandikar, D. Mookherjee, D. Ray, F. Vega-Redondo, Evolving aspirations and cooperation, J. Econ.

Theory 80 (2) (1998) 292–331.[11] H.J. Kushner, D.S. Clark, Stochastic Approximation Methods for Constrained and Unconstrained Systems,

Springer, Berlin, 1978.[12] H.J. Kushner, G.G. Yin, Stochastic Approximation Algorithms and Applications, Springer, Berlin, 1997.[13] L. Ljung, Analysis of recursive stochastic algorithms, IEEE Trans. Automat. Control (1977) 551–575.[14] A. Marcet, T.J. Sargent, Convergence of least squares learning mechanisms in self referential linear stochastic

models, J. Econ. Theory 48 (1989) 337–368.[15] A. Matsui, Expected utility and case-based reasoning, Math. Social Sci. 39 (2000) 1–12.[16] S. Napel, Aspiration adaptation in the ultimatum minigame, Games Econ. Behav. 43 (1) (2003) 86–106.[17] J. Oechssler, Cooperation as a result of learning with aspiration levels, J. Econ. Behav. Organ. 49 (2002)

405–409.[18] M. Posch, Win Stay, Lose Shift or Imitation—Only the Choice of the Peers Counts, University of Vienna,

2001.[19] M. Posch, A. Pichler, K. Sigmund, The efficiency of adapting aspiration levels, Proceedings of Royal Society,

vol. 266, London, 1999, pp. 1427–1435.[20] W. Sandholm, J. Hofbauer, On the global convergence of stochastic fictitious plays, Econometrica 70 (2002)

2265–2294.[21] R. Sarin, F. Vahid, Payoff assessment without probabilities: a simple dynamic model of choice, Games Econ.

Behav. 28 (2) (1999) 294–309.[22] H. Simon, Models of man, continuity in administrative science, Ancestral Books in the Management of

Organizations, Garland Publisher, New York, 1987.[23] E.L. Thorndike, Animal Intelligence: Experimental Studies, MacMillan, New York, 1911.[24] S.G. Winter, Satisficing, selection and the innovating remnant, Quart. J. Econ. 85 (2) (1971) 237–261.[25] M.D. Woodford, Learning to believe in sunspots, Econometrica 58 (1990) 277–307.