value iteration and optimization of multiclass queueing networks · 2014-12-08 · queueing systems...

Queueing Systems 32 (1999) 65–97 65

Value iteration and optimization of multiclass queueingnetworks

Rong-Rong Chen and Sean Meyn ∗

University of Illinois and the Coordinated Science Laboratory, 1308 W. Main St., Urbana, IL 61801,USA

E-mail: {[email protected]; [email protected]}

Submitted 1 February 1998; accepted 1 December 1998

This paper considers in parallel the scheduling problem for multiclass queueing networks,and optimization of Markov decision processes. It is shown that the value iteration algorithmmay perform poorly when the algorithm is not initialized properly. The most typical casewhere the initial value function is taken to be zero may be a particularly bad choice. Incontrast, if the value iteration algorithm is initialized with a stochastic Lyapunov function,then the following hold: (i) a stochastic Lyapunov function exists for each intermediatepolicy, and hence each policy is regular (a strong stability condition), (ii) intermediate costsconverge to the optimal cost, and (iii) any limiting policy is average cost optimal. It is arguedthat a natural choice for the initial value function is the value function for the associateddeterministic control problem based upon a fluid model, or the approximate solution toPoisson’s equation obtained from the LP of Kumar and Meyn. Numerical studies show thateither choice may lead to fast convergence to an optimal policy.

Keywords: multiclass queueing networks, Markov decision processes, optimal control, dy-namic programming

1. Introduction

This paper presents a convergence proof for the value iteration algorithm forgeneral Markov decision processes, and also develops methods for the application ofthis algorithm to the synthesis of optimal scheduling policies for multiclass queueingnetworks. The latter results are based upon the close connection between optimizationof a network, and optimal control of an associated fluid network model.

Over the past ten years there have been several successful attempts to approximatea network model with a more tractable process to reduce the complexity of the controlsynthesis problem. The recent paper [21] treats the optimal control of a multiclass

∗Work supported in part by NSF Grant ECS 940372 and JSEP Grant N00014-90-J-1270. This workwas completed with the assistance of equipment granted through the IBM Shared University Researchprogram and managed by the Computing and Communications Services Office at the University ofIllinois at Urbana-Champaign.

J.C. Baltzer AG, Science Publishers

66 R.-R. Chen, S. Meyn / Value iteration and network optimization

queueing network by relating this problem to the optimal control of an associateddiffusion process in heavy traffic, following the work of [13]. Methods for translatingan optimal policy for the Brownian system model back to an implementable policyfor the discrete-stochastic model are introduced in [12]. In [22,23] it is shown that thevalue function for the network scheduling problem can be approximated by the valuefunction for an associated fluid limit model. Some heuristics based upon this result aredeveloped in [24] to translate a policy for the fluid model back to the original discretenetwork. The results reported here provide a more exact approach to translating anoptimal policy for the fluid model back to the original problem of interest.

We begin with the analysis of a general Markov decision process model with onestep cost c and state process Φ = {Φ(t): t > 0} evolving on a countable state spaceX. Our goal is to construct a stationary policy w with minimal average cost

J(w,x) := lim supn→∞

1n

n−1∑t=0

Ex[c(Φ(t),w

(Φ(t)

))]. (1.1)

Value iteration is perhaps the most common approach in practice to constructing anoptimal policy. The idea is to consider the finite time problem with value function

Vn(x) = min Ex

[n−1∑t=0

c(Φ(t), a(t)

)+ V0

(Φ(n)

)], (1.2)

where {a(t): t > 0} is a sequence of actions determined by some policy, and theminimum in (1.2) is with respect to all policies. The function V0 :X → R+ is apenalty term – the standard value iteration procedure uses V0 ≡ 0. Letting vn denotea policy which attains this minimum, it may be assumed without loss of generalitythat there is a sequence of state feedback functions wk :X→ A, k > 0, such that forany n, the policy vn is a Markov policy whose first n actions may be expressed as

vn[0,n−1] =(wn−1(Φ(0)

), . . . ,w0(Φ(n− 1)

)).

The value iteration algorithm is then the standard dynamic programming approach torecursively computing an optimal sequence (Vn,wn: n > 1).

Various convergence proofs and counterexamples have appeared since the earlysixties, with most of the general positive results holding in the case of finite state spacemodels only. A thorough survey is found on pages 429–433 of [27] (see also [14]).In early papers the analyses typically focus on the differential cost function gn(x) =Vn+1(x)− Vn(x) and the normalized value function hn(x) = Vn(x)− Vn(θ), where θis some distinguished state. Under various conditions one may show that as n→∞,possibly through a subsequence,

gn → η∗, hn → h, wn → w∗,

where η∗ is the average minimum cost, and the triple (η∗,w∗,h) is a solution of theaverage cost optimality equation (see (2.1), (2.2) below).

R.-R. Chen, S. Meyn / Value iteration and network optimization 67

Recently, there has been a resurgence of interest in understanding the algorithmwhen the state space is unbounded. The paper [5] treats countable state space modelswhere the state space is a single communication class under any stationary policy.Convergence holds under two natural assumptions: the stabilizability condition thatthe steady state cost is finite for some stationary policy; and the almost monotonecondition on the cost function of [2]. The irreducibility assumption was relaxed in [6]by imposing a global Lyapunov function condition similar to that of [15]. The globalLyapunov function condition is expressed as

Ew[V(Φ(t+ 1)

)| Φ(t) = x

]6 V (x)− c

(x,w(x)

)+ b1lS(x), t ∈ Z+, (1.3)

where V is a positive function on the state space, b <∞, and S is a finite set, or moregenerally a compact set. It is assumed in [6] that there exists a single function V suchthat (1.3) holds for every Markov policy w, where S = {θ} is a singleton. Under thisassumption it may be shown that Evx[τθ] is uniformly bounded over all policies, foreach initial condition x, where τθ is the first return time to the state θ ∈ X.

In the paper [30], conditions are determined under which the optimal cost η∗ iscomputable through the limit η∗ = limn→∞ Vn(x)/n, x ∈ X. The analysis is basedupon the discounted control problem, and the use of a truncated value function to avoidthe difficulties associated with unbounded costs. The paper begins with some implicitbounds on the relative discounted value function for the truncated control problem.These assumptions are related to more readily verifiable conditions such as the nearmonotone condition of [2], and the Lyapunov condition of [15]. Hence, this papercaptures some aspects of the results of [5,6].

None of these contributions are applicable in general for multiclass networkmodels since both the Lyapunov condition and the irreducibility condition fail for manymodels. A contribution of the present paper is to establish conditions for convergencewhich are valid in the networks context. Both the assumptions imposed and themethods of analysis are based on the recent treatment of the policy iteration algorithmof [23].

One major contribution of this paper is to resolve a significant drawback to thevalue iteration approach – it can be extremely slow. In [27, p. 385] the author writes “Inaverage reward models, value iteration may converge very slowly, and policy iterationmay be inefficient in models with many states . . . ” Indeed, we have applied valueiteration to network models with approximately 50 000 states where policy iterationis not directly applicable, and we have found that convergence is slow even for verysimple models. The explanation in the network case is easily seen. One is attempting toapproximate the relative value function h(x) by the difference hn(x) = Vn(x)−Vn(θ).When V0 is taken to be zero, then each approximation hn is bounded by a linearfunction of x, and can grow by at most one in each iteration. The actual relative valuefunction h is equivalent to a quadratic on the state space [23,24], so there is a largemismatch between the two functions whenever the state is large. For this reason, eachof the state feedback laws {wn} generated by the value iteration algorithm can actuallyinduce a transient state process Φ (see section 4).


We show in this paper that if the value iteration algorithm is initialized withV0 equal to a stochastic Lyapunov function satisfying (1.3) for just one policy w,then the value iteration algorithm constructs recursively solutions to a version of thedrift inequality (1.3) for each n. Hence, we reach the same conclusion establishedfor the policy iteration algorithm in the companion paper [23]: the strong stabilitycondition (1.3) is superfluous when working with the value iteration algorithm becausethe algorithm automatically generates stabilizing policies. It is only necessary to findan initial stabilizing policy to initialize the algorithm. Based upon this observationwe prove that the intermediate average costs J(wn,x) are finite for each n, andindependent of x; that the average costs J(wn,x) converge to the optimal cost η∗as n→∞; and that any limiting policy is average cost optimal.

In the network optimization problem the relative value function for the optimalpolicy may be approximated by the value function for the associated fluid control prob-lem. It is thus natural to use the latter value function to initialize the value iterationalgorithm. A second approach we consider is based on computing an approximatesolution to Poisson’s equation through the stability LP of [18]. Results from numer-ical experiments show that either choice may lead to fast convergence to an optimalpolicy. We thus arrive at a new way of using the information gained from solving adeterministic optimization problem to solve the original discrete scheduling problemof interest.

The paper is organized as follows. In section 2 we present the main resultsconcerning the convergence of the value iteration algorithm. The assumptions aresatisfied for general multiclass queueing networks of the form described in section 3.Methods for constructing suitable initializations for the value iteration algorithm forthe network scheduling problem are described in sections 4–6. The appendices containproofs of the main results and some background theory.

2. Value iteration

Consider a general Markov decision process whose state space X and actionspace A are countable. Detailed treatments of Markov decision processes can be foundin, for instance, [27]. We present here a bare-bones description of the general model.

Associated with each x ∈ X is a non-empty subset A(x) ⊆ A whose elements arethe admissible actions when the state Φt takes the value x at time t. The transitions ofthe state process Φ are governed by the conditional probability distributions {Pa(x, y)}which describe the probability that the next state is y ∈ X given that the currentstate is x ∈ X, and the current action chosen is a ∈ A. A policy w is a sequenceof actions {a(t): t ∈ Z+} which is adapted, that is, a(t) can only depend on thehistory {Φ(0), . . . , Φ(t)}. We will consider primarily Markov policies of the formw = {w0(Φ(0)),w1(Φ(1)),w2(Φ(2)), . . .}, where for each i the function wi mapsX to A, with wi(x) ∈ A(x) for each x. For a Markov policy w we denote the resultingMarkov chain Φw := {Φw(t): t > 0} – we simply write Φ if it is clear from thecontext which policy has been applied.


A stationary policy is a Markov policy for which wi = w for all i, for somefixed state feedback law w. The action w(x) is applied whenever the state takes thevalue x, independent of the past and independent of the time-period. We shall writePw(x,B) = Pw(x)(x,B) for the transition law corresponding to a stationary policy w.The n-step transition probabilities are denoted as

Pnw (x, y) = P(Φw(n) = y | Φw(0) = x

), x, y ∈ X.

We also use the operator-theoretic notation,

Pnwh(x) := E[h(Φw(n)

)| Φw(0) = x

],

where h is any real-valued function on X.The resolvent kernel is defined for a feedback law w as

Kw =∞∑t=0

2−(t+1)P tw.

We will occasionally extend this definition to a Markov policy v = (v0, v1, . . .) via

Kv(x, y) := Evx

[ ∞∑t=0

2−(t+1)1l(Φ(t) = y

)], x, y ∈ X.

We assume that a cost function c :X × A → [1,∞) is given. The average costof a particular policy w is, for a given initial condition x, defined as

J(w,x) := lim supn→∞

1n

n−1∑t=0

Ewx[c(Φw(t), a(t)

)].

A policy w∗ is then called optimal if J(w∗,x) 6 J(w,x) for all policies w, and anyinitial state x.

A state feedback law w is called regular if the controlled chain is a cw-regularMarkov chain, where cw(y) = c(y,w(y)). We say that a policy w is regular if w is astationary Markov policy defined by a regular state feedback law. That is, for somedistinguished state θ ∈ X,

Ewx

[τθ−1∑t=0

cw(Φ(t)

)]<∞, x ∈ X,

where τθ is the first return time to the state θ. This a highly desirable stabilityproperty for the controlled process. If the feedback law w is regular, then necessarilyan invariant probability measure πw exists such that πw(cw) < ∞. Moreover, fora regular w, the resulting cost is J(w,x) =

∑πw(y)cw(y), independent of x. The

following result is a consequence of the f -norm ergodic theorem of [25, chapter 14].

Theorem 2.1. For any regular feedback law w, there exists a unique invariant prob-ability πw, and the controlled state process Φ satisfies ηw :=

∑cw(x)πw(x) < ∞.


For each initial condition the average cost is equal to ηw, independent of x, and thefollowing limits hold:

J(w,x) = ηw = limn→∞

1n

n∑t=1

Ex[cw(Φ(t)

)]= lim

n→∞1n

n∑t=1

cw(Φ(t)

), a.s. [Px].

In view of theorem 2.1, we denote the average cost J(w) = J(w,x) when the policy wis regular.

The construction of an optimal policy typically involves the solution to the fol-lowing equations:

η∗ + h∗(x) = mina∈A(x)

[c(x, a) + Pah∗(x)

], (2.1)

w∗(x) = arg mina∈A(x)

[c(x, a) + Pah∗(x)

], x ∈ X. (2.2)

The equality (2.1), a version of Poisson’s equation, is known as the average costoptimality equation (ACOE). The second equation (2.2) defines a stationary policy w∗

(see, e.g., [1,2,14,27] for further discussion).The value iteration algorithm, or VIA, is defined inductively as follows. If the

value function Vn is given, the action wn(x) is defined as

wn(x) = arg mina∈A(x)

[PaVn (x) + c(x, a)

], x ∈ X.

For each n the following dynamic programming equation is satisfied:

Vn+1(x) = cwn(x) + PwnVn(x) = mina∈A(x)

(PaVn (x) + ca(x)

),

which then makes it possible to compute the next function wn+1.To simplify notation we denote throughout the remainder of this paper

cn = cwn , Pn = Pwn , Kn = Kwn ,

and we let En denote the expectation operator induced by the stationary policy

wn :=(wn(Φ(0)

),wn

(Φ(1)

),wn

(Φ(2)

), . . .

).

Suppose that θ ∈ X is some distinguished state, and define

hn(x) = Vn(x)− Vn(θ); gn(x) = Vn+1(x)− Vn(x), x ∈ X, n ∈ Z+. (2.3)

Then for each n we have the identity Pnhn = hn−cn+gn, which at least superficiallyresembles the ACOE. We show below that the pair {hn, gn} does indeed converge toa solution {h∗, η∗} to (2.1), (2.2) under reasonable conditions on the model and on theinitial value function V0.

We assume that at least one regular policy w−1 exists, so that there also exists afunction V0 :X→ R+ and an η <∞ such that

Pw−1V0 6 V0 − cw−1 + η. (2.4)


This stabilizability assumption is a generalization of that used in [5] and many otherpapers. If the value iteration algorithm is initialized with this function V0, then theresulting penalty term in Vn, n > 1, “regularizes” the intermediate policies to en-sure that each is stabilizing, and in the examples considered it also appears to speedconvergence.

We assume throughout the paper that there exists a regular, optimal policy w∗, aPw∗-invariant probability πw∗ , and a relative value function h∗ satisfying the ACOEwith η∗ = πw∗(cw∗). The quantities (w∗, πw∗ , h∗) may not be unique, but we fixone such triple throughout the paper in our assumptions and in the analysis to follow.These conditions will be met under the stabilizability part of assumption (A1), andassumptions (A2), (A3) below (the conditions of [29] may be verified, following theapproach of [23]).

Assumptions (A2) and (A3) are related to the near-monotone assumption of [2].We call a function c norm-like if the sublevel set {x: c(x) 6 b} is a finite subset ofX for any finite b. It is assumed in assumption (A2) below that for any “reasonable”policy w, the cost cw(x) dominates a norm-like function c on X. Assumption (A3)then imposes an accessibility condition on the sublevel set S0 ⊂ X defined as

S0 ={x: c(x) 6 2η

}. (2.5)

Although this assumption imposes an accessibility condition on all Markov policies,assumption (A3) is only used for the optimal policies vn, and the stationary policieswn, n ∈ Z+. Assumption (A3) is weaker than the Lyapunov condition of [6]. It issatisfied for the network scheduling problem described in section 3, and other storageand routing models found in the operations research area.

Formally, our assumptions are summarized as follows:

(A1) There exists a policy w−1, a function V0 :X→ R+, and η <∞ satisfying (2.4)and for the optimal policy w∗,

limn→∞

1n

(Pnw∗V0

)(x) = 0, x ∈ X.

(A2) For each fixed x, the function c(x, · ) is norm-like on A, and there exists anorm-like function c :X → R+ such that for any regular policy w satisfyingJ(w) 6 η,

cw (x) > c(x), x ∈ X.

(A3) There is a fixed state θ ∈ X and a δ > 0 such that for any Markov policy v,

Kv(x,θ) > δ for all x ∈ S0, (2.6)

where S0 is defined in (2.5), and for any action a ∈ A(θ),

Pa(θ,θ) > δ.

We will occasionally also assume:


(A4) For any regular optimal policy w the controlled chain is irreducible in the usualsense that

Kw(x, y) > 0, x, y ∈ X.The main results of this section are summarized in the following two theorems.

Theorem 2.2. For the value iteration algorithm initialized with the function V0, sup-pose that assumptions (A1)–(A3) are satisfied. Then:

(i) For each x the sequences {gn(x)} and {hn(x)} are bounded, and

limn→∞

1nVn(θ) = lim

n→∞gn(θ) = η∗.

lim supn→∞

1nVn(x)6 lim sup

n→∞gn(x) 6 η∗, x ∈ X.

(ii) Each intermediate state feedback law wn is regular, and each Vn serves as aLyapunov function for the nth policy:

PnVn 6 Vn − cn + η, n > 0.

(iii) The average cost J(wn) converges to the optimal cost:

J(wn)→ η∗ as n→∞.(iv) Any point-wise limit point of the feedback laws {wn: n ∈ Z+} is regular and

optimal.

Theorem 2.3. Under assumptions (A1)–(A4) the conclusions of theorem 2.2 hold, andin addition, as n→∞,

hn(x)→ h∗(x)− h∗(θ), gn(x)→ η∗,1nVn(x)→ η∗, x ∈ X.

Proof of theorem 2.2. The two limits in (i) follow from lemma B.9(ii). The boundson the two limit supremums follow from lemma B.9(i) and the formula

1n

n−1∑t=0

gt(x) =1n

(Vn(x)− V0(x)

).

The bound in (ii), and, hence, also regularity, is established in proposition B.3.Result (iii) requires the lower bound cw > c. From proposition B.3 and this lower

bound we have πn(c) 6 η for all n, which shows that the probabilities {πn: n ∈ Z+}are tight. From proposition B.3 and the Comparison Theorem A.1 we have πn(cn) 6πn(gn), where gn 6 η. Since the probabilities {πn} are tight, for any preassignedε > 0 there exists a finite set C ⊂ X with πn(C) > 1− ε for all n. Thus,

η − lim supn→∞

J(wn)> lim inf

n→∞

(η − πn(gn)

)> lim inf

n→∞

∑x∈C

πn(x)(η − gn(x)

)> (1− ε)

(η − η∗

).


Since ε is arbitrary, this shows that lim supn→∞ J(wn) 6 η∗, as desired.Finally, result (iv) is immediate from (i) and Fatou’s lemma applied to the identity

Pnhn = hn − cn + gn obtained in proposition B.3. �

Proof of theorem 2.3. The convergence of hn follows from lemma B.10. We maythen use the identity gn(x) = gn(θ) + hn+1(x) − hn(x) and theorem 2.2 (i) to provethat gn(x) converges to η∗. The convergence of Vn(x)/n to η∗ then follows as in theproof of theorem 2.2 (i). �

3. Discrete and fluid models for multiclass networks

Consider a network of the form illustrated in figure 1, composed of d single serverstations, indexed by σ = 1, . . . , d. The network is populated by K classes of customers,and an exogenous stream of customers of class 1 arrive to machine s(1). A customerof class k requires service at station s(k). If the service times and interarrival times areassumed to be exponentially distributed, then after a suitable time scaling and samplingof the process, the dynamics of the network can be described by the random linearsystem

Φ(t+ 1) = Φ(t) +K∑i=0

Ii(t+ 1)[ei+1 − ei

]wi(Φ(t)

), (3.1)

where the state process Φ evolves on X = ZK+ . The random variable Φi(t) de-notes the number of class i customers in the system at time t which await service inbuffer i at station s(i). The function w :X → {0, 1}K+1 is the policy to be designed.If wi(Φ(t))Ii(t+ 1) = 1, this means that at the discrete time t, a customer of class i isjust completing a service, and is moving on to buffer i+ 1 or, if i = K, the customerthen leaves the system. The set of admissible control actions A(x) when the state isx ∈ X is defined as follows for a = (a0, . . . , aK)′ ∈ A(x) ⊂ {0, 1}K+1:

(i) a0 = 1, and for any 1 6 i 6 K, ai = 0 or 1.

(ii) For any 1 6 i 6 K, xi = 0⇒ ai = 0.

Figure 1. A multiclass network with d = 2 and K = 3.


(iii) For any station σ, 0 6∑

i:s(i)=σ ai 6 1.

(iv) For any station σ,∑

i:s(i)=σ ai = 1 whenever∑

i:s(i)=σ xi > 0.

Condition (iv) is the non-idling property that a server will always work if there is workto be done. With the one step cost cw(x) = |x| :=

∑i xi, the non-idling condition

may be assumed without any loss of generality [24].The random variables {I(t): t > 0} are i.i.d. on {0, 1}K+1, with P{

∑i Ii(t) =

1} = 1, and E[Ii(t)] = µi. For 1 6 i 6 K, µi denotes the service rate for class icustomers, and for i = 0 we let µ0 := λ denote the arrival rate of customers ofclass 1. For 1 6 i 6 K we let ei denote the ith basis vector in RK , and we sete0 = eK+1 := 0. It is evident that these specifications define a Markov decisionprocess whose state transition function has the simple form:

Pa(x,x+ ei+1 − ei

)=µiai, 0 6 i 6 K,

Pa(x,x) = 1−K∑0

µiai.

We assume throughout the paper that the usual load conditions are satisfied:

ρσ =∑

i:s(i)=σ

λ

µi< 1, 1 6 σ 6 d. (3.2)

For concreteness we consider exclusively the one step cost cw(x) = |x|. It is nowknown that for a network model with this cost criterion, regularity of a stationary policyw is closely connected with the stability of an associated fluid limit model [9,10]. Foreach initial condition Φw(0) = x 6= θ of the controlled chain Φw we construct acontinuous time process φx(t) as follows. If |x|t is an integer, set

φx(t) =1|x|Φ

w(|x|t).

For all other t > 0, define φx(t) by linear interpolation, so that it is continuous andpiecewise linear in t. Note that |φx(0)| = 1, and that φx is Lipschitz continuous. Thecollection of all “fluid limits” is defined by

L :=∞⋂n=1

{φx: |x| > n},

where the overline denotes weak closure. The process φ evolves on the state space RK+ .We shall also call L the fluid limit model, in contrast to the fluid model which is definedas the set of all continuous solutions to the differential equation

ddtφ(t) =

K∑i=0

µi[ei+1 − ei

]ui(t), a.e. t ∈ R+, (3.3)


where the function u(t) is analogous to the discrete control, and satisfies similar con-straints [7].

The fluid limit model L is called Lp-stable if

limt→∞

supφ∈L

E[∣∣φ(t)

∣∣p] = 0.

It is shown in [19,24] that L2-stability of the fluid limit model is equivalent to a formof c-regularity for the network:

Theorem 3.1. The following stability criteria are equivalent for the network under anynon-idling, stationary Markov policy w:

(i) The drift condition holds

PwV (x) 6 V (x)− |x|+ η, x ∈ X, (3.4)

where η ∈ R+, and the function V :X → R+ is equivalent to a quadratic in thesense that, for some γ > 0,

1 + γ|x|2 6 V (x) 6 1 + γ−1|x|2, x ∈ X.

(ii) For some quadratic function V ,

Ewx

[σθ∑n=0

∣∣Φ(n)∣∣] 6 V (x), x ∈ X.

(iii) For some quadratic function V and some γ <∞,

1N

N∑n=1

Ewx[∣∣Φ(n)

∣∣] 6 1NV (x) + γ, for all x and N > 1.

(iv) The fluid limit model L is L2-stable.

(v) The total cost for the fluid limit L is uniformly bounded in the sense that

supφ∈L

E

[∫ ∞0

∣∣φ(τ )∣∣ dτ

]<∞.

Using theorem 3.1 it is shown in [23] that the optimal control of a network andthe optimal control of the fluid model are also related. As an illustration, consider thethree-buffer example illustrated in figure 1. We have taken the parameters

λ = 0.1429, µ1 = 0.3492, µ2 = 0.1587, µ3 = 0.3492, (3.5)

so that ρ1 = λ/µ1 + λ/µ3 = 9/10, and ρ2 = λ/µ2 = 9/11. The optimal policy forthe fluid model is illustrated in figure 2. It can be defined succinctly as

Serve buffer one if and only if x3 = 0, or x1 > x3 − 1 +(1l(x2 = 0)

)−1.


The form of the policy is logical: If the second buffer is non-empty, then the last bufferreceives exclusive service. When the second buffer x2 is empty and x1 > x3 then,because service at buffer two is slow, the first buffer releases fluid to avoid starvationat the second machine.

The optimal policy was computed for the stochastic model with the performanceindex

Jn(x) =1n

n∑t=1

Ex[∣∣Φ(t)

∣∣].

Figure 2. Optimal policy for the fluid model with ρ2 = 9/10 and ρ1 = 9/11. In this illustration, thegrey regions indicate those states for which buffer three is given exclusive service.

Figure 3. Optimal policy for the discrete network. There is an approximately affine switching curvewhich is similar to the linear switching curve found for the fluid model policy illustrated in figure 2.


To compute the policy numerically we used value iteration, terminated at n = 7 000.The buffer levels were truncated so that xi < 45 for all i. This gives rise to a finitestate space Markov decision process with 453 = 91 125 states. In figure 3 we see theresult of this computation. Again there is a roughly linear or affine switching curve– buffer one is served provided that buffer two is small, and the population at bufferone is reasonably large compared with that at buffer three. The policy illustrated infigure 3 is closely approximated by the formula

Serve buffer one if and only if x3 = 0, or x1 > x3 − 29 + 10 exp(x2/2).

The fluid limit of this approximation is precisely the optimal fluid policy illustrated infigure 2.

4. Initialization of the VIA

For the optimal scheduling policy the relative value function h∗ is equivalent toa quadratic on X = ZK+ in the sense that for some γ > 1,

−γ + γ−1|x|2 6(h∗(x)− h∗(θ)

)6 γ + γ|x|2, x ∈ X,

where θ is the vector of zeros in X [23]. However, in the standard VIA, for each n

Vn(x) = min Ewx

[n−1∑t=0

∣∣Φ(t)∣∣] 6 min Ewx

[n−1∑t=0

(∣∣Φ(0)∣∣ + t

)]6 n|x|+ n2,

where we are using the skip-free property of the network that |Φ(t+ 1)| 6 |Φ(t)|+ 1,t ∈ Z+. It follows that, for each n, the function hn(x) = Vn(x) − Vn(θ) is boundedfrom above by a linear function of x. Hence, the approximation hn(x) ≈ h∗(x) isgrossly inaccurate for any finite n when the state x is large.

As one might then expect, for any n the actions wn(x), x ∈ X, defined by thefeedback law wn may be far from optimal when the state x is large. We illustrate howthe feedback laws {wn} generated by the standard VIA can give poor performancewith the following two examples.

Example 1: A simple queue with controlled service

Here we describe a model which satisfies all of the conditions of [5]. Hence, withV0 ≡ 0, the VIA does converge to give an optimal policy. However, for each iterationn the state feedback law wn induces a transient Markov chain Φ, so that J(wn) =∞.

The state space is taken as X = Z+, and the action space is A(x) = {0, 1},x 6= 0, with A(0) = {0}. There is an arrival rate λ and a variable service rate µ whichtakes on the small value µ1, or the larger value µ1 + µ2, depending upon the control.We assume that λ+ µ1 + µ2 = 1, and define the transition law as follows:

Pa(x,x+ 1) = λ, Pa(x,x− 1) = µ1 + µ2a, Pa(x,x) = (1− µ1 − µ2a− λ).


That is, if w(x) = 1 then a customer receives maximal service during the current timeslot when there are x customers in the queue. Assuming that ρ = λ/(µ1 + µ2) < 1,the feedback law defined as wα(x) = 1, x > α, is regular for any α > 1, and theoptimal policy is of this form for some α. We assume that λ/µ1 > 1, so that the lasyserver defined by w(x) ≡ 0 gives rise to a transient chain Φ.

Suppose that cw(x) = (1+w(x))x, so that the one-step cost of serving a customeris proportional to the total number of customers in the system. Then for the standardVIA it may be shown inductively that there exists {αn: n ∈ Z+} such that wn(x) = 0for all x > αn. It follows that the chain Φw

nobtained with the stationary policy wn

is transient for any n, although the policies {wn: n > 0} do converge pointwise togive an optimal policy.

To obtain a version of the VIA which generates regular policies, initialize thealgorithm with the function V0(x) = x2/(µ1+µ2−λ). With the feedback law w−1(x) =1l(x > 0) we do have Pw−1V0 6 V0 − cw−1 + η for some η < ∞. It follows fromtheorem 2.2 that each successive feedback law wn is regular, and that the policiesconverge to an optimal policy as n→∞. In fact, it may be shown directly by inductionthat wn(x) = 1 for all n and all x sufficiently large, which immediately implies thatwn is regular. We also note that in this case the function hn(x) = Vn(x) − Vn(0) isequivalent to a quadratic for all n > 0. Hence, when properly initialized, the VIAreturns the correct form of both the optimal policy and the relative value function forlarge x, and only modifies the intermediate policies for x in a finite subset of X.

Example 2: The Rybko–Stolyar model

The next example treats a model introduced concurrently in the papers [20,28].Consider the network illustrated in figure 4 consisting of four buffers and two machinesfed by separate arrival streams. It is shown in [28] that the last buffer-first servedpolicy where buffers 2 and 4 receive priority at their respective machines will makethe controlled process Φ transient, even when the load conditions (3.2) are satisfied, ifthe cross-machine condition λ/µ2 + λ/µ4 < 1 is violated.

If the VIA is applied to this model with V0 ≡ 0, then one obtains V1(x) = |x|,and

V2(x) = |x|+ minw

(|x|+ 2λ− µ2w2(x)− µ4w4(x)

).

The minimizing feedback law w2 is evidently given by

w22(x) = 1l(x2 > 0), w2

4(x) = 1l(x4 > 0).

We conclude that J(w2) ≡ ∞ when λ/µ2 + λ/µ4 > 1 since this is precisely thedestabilizing policy introduced in [20,28].

These examples show that the standard VIA may return policies which are notstabilizing, and we have evidence to suspect that convergence may be slow, given themismatch between the functions hn and the limiting relative value function h∗. We haveconducted numerical experiments to test the rate of convergence of the VIA for the


Figure 4. A multiclass network with d = 2 and K = 4.

Figure 5. Convergence of the VIA with V0 taken to be zero, and with V0(x) = |x|2. The vertical axis isan approximation to the steady state cost J(wn), and the horizontal axis is n, the number of iterations

of the algorithm.

three-buffer example illustrated in figure 1 with the parameters defined in (3.5). Figure 5shows the results from two implementations of the VIA. In the first we consider thestandard algorithm with V0 ≡ 0. For comparison purposes we also consider the caseV0(x) = |x|2. This might appear to be a natural choice since it will result in lowervalues for the terminal cost. The vertical axis is the approximate value of the steadystate cost J(wn), where wn is the stationary policy obtained at the nth iteration of theVIA. Because the problem is large we cannot compute this cost exactly, but insteaduse

J(wn)≈ 1

6 600

6 600∑t=1

Ewn

θ

[∣∣Φ(t)∣∣].

In either case we found that it takes several thousand iterations to reach convergence


for this model. The figure shows results from the first 300 steps of the algorithm. Datawas saved for n equal to multiples of ten: n = 10, . . . , 300.

In summary, the standard VIA suffers from two potential drawbacks: intermediatepolicies may perform poorly, and may even give rise to a transient Markov chain,and the convergence of the algorithm can be slow. In the following two sectionswe improve this situation by establishing general methods for constructing a moreappropriate initial value function V0. Although in practice we will never optimize thefull infinite dimensional model, the approach described here may be used even in thefinite state space truncated model to speed convergence. We see in the next section thatfor a network with buffer levels truncated to 33, the standard VIA requires thousandsof iterations for convergence, while the VIA implemented with an appropriate initialvalue function converges to the same level of performance in less than twenty steps,even though the chain possesses 453 = 91 125 states.

5. Initialization through the fluid model

From theorem 3.1 (i) and theorem 2.2 we conclude that the VIA will converge ifthe algorithm is initialized with a stationary policy w−1 whose fluid model is L2-stable,since the network model will then satisfy assumptions (A1)–(A3) when initialized witha solution to (3.4). Assumption (A1) will hold since the relative value function isequivalent to a quadratic for any such policy, and the accessibility assumption (A3)holds with θ equal to the empty state [24]. There are many stabilizing policies whichmay serve as the initial policy w−1 (see, e.g., [8]). This leads to several algorithmicapproaches to constructing the initial value function V0 based on the value functionfor the fluid model. We begin with the following suggestive proposition. The result(ii) is proven in [23]. For the sake of brevity we omit the proof of (i).

Proposition 5.1. If the feedback law w gives rise to a network whose fluid limitmodel L is L2-stable, then:

(i) The function V below is a solution to (3.4) for any b > 1 and all T sufficientlylarge:

V (x) = b|x|2Ewx

[∫ T

0

∣∣φx(τ )∣∣ dτ]. (5.1)

(ii) There exists a solution h to the Poisson equation Pwh = h − |x| + η−1, and thefunction h approximates the value function V as follows:(

1− ε(|x|,T

))V (x) 6 h(x) 6

(1 + ε

(|x|,T

))V (x),

where ε > 0 and satisfies lim supT→∞ lim supr→∞ ε(r,T ) = 0.


Figure 6. Convergence of the VIA with V0 taken as the value function for the associated fluid controlproblem. Two value functions are found: one with the optimal fluid policy, and the other taken with theLBFS fluid policy. Both choices lead to remarkably fast convergence. Surprisingly, the “suboptimal”

choice using LBFS leads to the fastest convergence of J(wn) to the optimal cost η∗ ≈ 10.9.

While the function V given in (5.1) is not easily computable in general, if thefluid limit model L is purely deterministic then we may use the limiting function

V (x) = b|x|2∫ ∞

0

∣∣φ(τ )∣∣ dτ , φ(0) =

x

|x| . (5.2)

We can also obtain an approximation to (5.1) based upon a finite dimensionallinear program [16]. If a sufficiently tight approximation to (5.1) is found then onecan expect that (3.4) will hold for this approximation. We illustrate this approach withthe three-buffer example illustrated in figure 1. We have computed explicitly the func-tion V in (5.2) for two policies: LBFS, and the optimal policy illustrated in figure 2.

Two experiments were performed to compare the performance of the VIA ini-tialized with these two value functions. The results from two experiments are shownin figure 6. For comparison, data from the standard VIA shown earlier is also given.We have again taken 300 steps of value iteration, saving data for n = 10, . . . , 300.The parameter values (λ,µ1,µ2,µ3) are again defined in (3.5). The convergence isexceptionally fast in both experiments. Surprisingly, the “suboptimal” choice usingLBFS leads to the fastest convergence.

6. Initialization through a linear program

The computation of (5.2) is currently possible only for very simple models.We present here an alternative initialization of the VIA based upon the stability


LP of [18] which is computationally feasible even for networks with hundreds ofbuffers.

Since the relative value function for an optimal policy is known to be equivalentto a quadratic, it is natural to attempt to find a quadratic form which gives the requirednegative drift. Let I denote the set of ordered pairs I = {(i, j): 1 6 i, j 6 K}. ForI ∈ I and α a vector of dimension |I| = K2 we define

hI (x) = xixj , hα(x) =∑I∈I

αIhI (x).

Given a state feedback law w, let z = z(x) denote the vector in R|I| whose Ithcomponent is given by zI = wi(x)xj , I ∈ I . For any non-idling policy w, the costcan be expressed in terms of the vector z as |x| = cTz, where c ∈ R|I| is the vectorwhose Ith component is defined as cI = 1l(s(i) = s(j)).

For any I there exist vectors cI ∈ R|I|, γI ∈ RK and a constant BI such that

PwhI (x) = hI (x)− cITz + γI

Tw(x) +BI .

The vector cI is defined as follows. Let eI be the vector that is zero except for a onein the I = (i, j)th position, so that eI

Tz = zI . Then for i = j, i = j+1 and i > j+1,

the vector cI = cij is given by

2µi−1ei−1,i − 2µieii,

µi−1ei−1,i+1 − µiei,i+1 + µizii − µi+1e

i+1,i, and

µi−1ei−1,j − µieij + µj−1e

j−1,i − µjeji,

respectively. In addition, the non-idling constraint may be expressed as a linear in-equality constraint on the variables z through the expression∑

i:s(i)=s(k)

zi,k >∑

i:s(i)=σ

zi,k, 1 6 σ 6 d, 1 6 k 6 K.

Letting dσ,k denote the vector whose (i, j)th entry is defined as

dσ,ki,j =

{−1 if s(i) = s(j) and j = k,+1 if s(i) = σ and j = k,0 otherwise,

this constraint may be expressed as dσ,kTz 6 0, 1 6 σ 6 d, 1 6 k 6 K. For

α ∈ RK2

+ and β ∈ Rd+ × RK+ define the two vectors

cα :=∑I∈I

αIcI , dβ :=

d∑σ=1

K∑k=1

βσ,kdσ,k.


The stability LP of [18] can be interpreted as follows: Find α ∈ RK2

+ , β ∈ Rd+ ×RK+such that

cα + dβ > c, (6.1)

where inequalities between vectors are interpreted component-wise. If this lower boundholds, then we obtain the desired Lyapunov drift inequality

Pwhα(x)6 hα(x)−(cα + dβ

)Tx+ γαTw(x) +Bα

6 hα(x)− |x|+ γαTw(x) +Bα,

where γα =∑αIγ

I , Bα =∑αIB

I .There can be many values of α,β satisfying the lower bound (6.1). To find a

value which is best in an average sense, first note that by the Comparison Theorem A.1we have the steady state bound

Eπw[|x|]6 Eπw

[cαTz

]6 Eπw

[γαTw

]+Bα. (6.2)

The right hand side is computed in the construction of the performance LP of [17,26]giving the following formula:

Eπw[γTijw]

+Bij =

−2λ, if i = j,λ, if j = i+ 1,0, if j > i+ 1.

A natural choice of (α,β) is a minimizer of the right hand side of (6.2), subjectto the constraint that cα + dβ > c, which gives the best upper bound Eπw [|x|] 6γαTEπw[w] + Bα over all such (α,β). This strong connection between the existenceof a Lyapunov function, and the existence of a bound on steady state performance isprecisely the principle of duality established in [19].

If vectors (α,β) exist which satisfy these constraints, then the simultaneous Lya-punov condition of Hordijk is also satisfied [15]. Unfortunately, in many examples ofinterest it is not possible to find a single quadratic Lyapunov function suitable for allpolicies. One such example is the three-buffer example given in figure 1. For thisexample, the stability LP of [18] is not feasible for certain service rates, even thoughthe load condition (3.2) is satisfied.

However, if the feedback law w is specified, then it is often possible to relaxsome of the constraints on α. For example, for the three-buffer model under the LBFSpolicy we have z13 = w1(x)x3 = 0, so that the constraint on cα is relaxed to

cαI + dβI > cI , (I) 6= (1, 3).

The stability LP was run for this specific example, maintaining the earlier systemparameters (3.5), giving the following Lyapunov function for the model:

hα1 (x) = xTQ1x, Q1 =

( 15.9231 9.0000 9.00009.0000 9.0000 09.0000 0 7.5000

). (6.3)


Figure 7. Convergence of the VIA with V0 taken as a quadratic. The quadratic hα1 was found using thestability LP, and the second quadratic hα2 was found through direct calculation. In both cases we see

relatively fast convergence of J(wn) to the optimal cost η∗ ≈ 10.9.

Another quadratic which solves (3.4) is

hα2 (x) = xTQ2x, Q2 =

( 55.9895 31.6456 24.343931.6456 31.6456 024.3439 0 20.8858

). (6.4)

The two matrices are almost multiples of one another, Q2 ≈ 3.5Q1, and in factthe latter choice gives a correspondingly worse upper bound on E[|x|] through theinequality (6.2).

Two experiments were performed to compare the performance of the VIA initial-ized with these two quadratic Lyapunov functions. The results are shown in figure 7.The speed of convergence is not as fast as what was found using the fluid model toconstruct V0, but we see in figure 7 that it is significantly faster than the standardalgorithm.

7. Conclusions

We have seen that one can obtain excellent steady state performance, even whenthe time horizon is very short compared with the size of the state space, by addingan appropriate penalty term to the finite horizon cost criterion used in the VIA. Forthe network scheduling problem two approaches have been described for constructinga useful penalty term. That based upon a fluid model gives the best results in theexamples considered, but the simpler approach based upon a quadratic approximation


also performs well. Either approach can potentially be used in the online optimizationof a large manufacturing system.

For other control problems it will be necessary to have some understanding of theright form for the optimal relative value function in order to initialize the algorithm.One general approach is to apply one step of policy iteration, since the resulting relativevalue function is a Lyapunov function satisfying (A1) under mild conditions on theprocess (see [23]).

It is likely that the modified policy iteration algorithm of [27] can be analyzedusing a modification of the proof given in the appendix. We are also consideringsample-path based on-line optimization methods using the approaches introduced here.Some initial progress is reported in [3].

Extensions to risk sensitive optimal control are developed in [4].

Appendix A. Some stability theory for Markov chains

Here we collect together some general results on Poisson’s equation and regularityfor a countable state space Markov chain with transition law P and resolvent

K =∞∑i=0

2−(i+1)P i.

Suppose that c is a function on X with c > 1. The Markov chain is called c-regular iffor some θ ∈ X and every x ∈ X,

Ex

[τθ−1∑i=0

c(Φ(i)

)]<∞,

where the first entrance time and first return time to the point θ are defined respectivelyas σθ = min(t > 0: Φ(t) ∈ θ); τθ = min(t > 1: Φ(t) ∈ θ).

A c-regular chain always possesses a unique invariant probability π such that

π(c) :=∫c(x)π(dx) <∞.

A set S ⊂ X is called petite if there is a δ > 0 and θ ∈ X such that

K(x,θ) > δ, x ∈ S. (A.1)

If the chain is irreducible in the usual sense then every finite set is petite.The connections between Poisson’s equation, the Lyapunov drift (1.3), and reg-

ularity are largely based upon the following general result, which is a minor general-ization of the Comparison Theorem of [25].

Theorem A.1 (Comparison Theorem). Let Φ be a Markov chain on X satisfying thedrift inequality PV 6 V − c + s. The functions s and c take values in R+, and the


function V takes values in [0,∞] with V (x0) <∞ for at least one x0 ∈ X. Then forany stopping time τ

Ex[V(Φ(τ )

)]+ Ex

[τ−1∑t=0

c(Φ(t)

)]6 V (x) + Ex

[τ−1∑t=0

s(Φ(t)

)].

The following result is a consequence of the f -Norm Ergodic Theorem of [25].

Theorem A.2. Suppose that the conditions of the Comparison Theorem A.1 hold,c > 1, s is a constant, and the set S = {x: c(x) 6 2s} is petite. Then Φ is ac-regular Markov chain with a unique invariant probability π. The function V satisfiesV (x) <∞ whenever π(x) > 0.

The proof of the following result is identical to that of [25, theorem 11.3.11]. Forany probability distribution a on Z+ define the generalization of the resolvent kernel

Ka(x, y) =∞∑n=0

a(n)Pn(x, y), x, y ∈ X.

We denote the mean of a by m(a) =∑na(n).

Lemma A.3. For the Markov chain Φ or X, suppose that there is a set S ⊆ X, a pointθ ∈ X, and a constant δ > 0 such that Ka(x,θ) > δ for x ∈ S. Then for all x,

Ex

[τθ−1∑t=0

1lS(Φ(t)

)]6 m(a)

δ.

Suppose that Φ is a c-regular Markov chain with invariant probability π, anddenote η = π(c). The Poisson equation is defined as

Ph = h− c+ η, (A.2)

where h :X → R. The computation of h in the finite state space case involves asimple matrix inversion which can be generalized to the present setting provided thatthe chain is c-regular.

Given a function s :X→ [0, 1], and a probability ν on X, the kernel s⊗ ν :X×X → [0, 1] is defined as the product s ⊗ ν (x, y) = s(x)ν(y), x, y ∈ X. Letting νdenote the point-mass at θ, and s = δ1lS , the minorization condition (A.1) may beexpressed as K > s⊗ ν. Letting G denote the kernel

G =∞∑t=0

(K − s⊗ ν)t,


a solution to Poisson’s equation may be explicitly written as

h(x) = GKc(x) =∞∑i=0

(K − s⊗ ν)iKc(x), (A.3)

where c(x) = c(x)− π(c), provided the sum is absolutely convergent [11,23].The paper [11] uses these ideas to establish the following sufficient condition for

the existence of suitably bounded solutions to Poisson’s equation. Define the set S by

S ={x: Kc(x) 6 π(c)

}. (A.4)

If the chain is positive recurrent we have π(S) > 0.

Theorem A.4. Suppose that the Markov chain Φ is positive recurrent. Assume furtherthat η =

∫c(x)π(dx) < ∞, and that the set S defined in (A.4) is petite. Then there

exists a solution h to Poisson’s equation (A.2) which is finite for every x ∈ X satisfyingπ(x) > 0, and is bounded from below everywhere:

infx∈X

h(x) > −∞.

If Φ is also c-regular then h can be taken as (A.3), which satisfies the bound

h(x) 6 d1Ex

[τθ−1∑t=0

c(Φ(t)

)], x ∈ X,

where d1 is a finite constant.

Uniqueness of the solution to Poisson’s equation is established in [23] using theprevious lower bound.

Theorem A.5. Suppose that the Markov chain Φ is positive recurrent, that η =π(c) <∞, and assume that S defined in (A.4) is petite. Let g be finite-valued, boundedfrom below, and satisfy

Pg 6 g − c+ η.

Then Φ is c-regular and for some constant b,

(i) g(x) = GKc(x) + b for almost every x ∈ X [π],

(ii) g(x) > GKc(x) + b for every x ∈ X.

Closely related is the following

Lemma A.6. Suppose that Φ is c-regular with invariant probability π, and supposethat z :X→ R is bounded from below, and is superharmonic: Pz 6 z. Then

(i) z(x) = π(z) for almost every x ∈ X [π],

(ii) z(x) > π(z) for every x ∈ X.


Appendix B. Convergence of the value iteration algorithm

We present here proofs of our main results. Throughout this section we assumethat (A1)–(A3) are satisfied, even when this is not mentioned explicitly.

Central to our analysis is the incremental cost gn and function hn defined in (B.3).In the standard version of the VIA where V0 = 0, the functions {gn: n ∈ Z+} arepositive-valued for each n, but may be unbounded. In the present case we find thatthe opposite situation arises. When V0 is a Lyapunov function for some policy, thefunctions {gn} are strictly bounded from above, but may be unbounded from below.This is a desirable situation since an upper bound on the sequence {gn: n ∈ Z+}permits us to conclude that the each of the stationary policies {wn: n ∈ Z+} is regular.These results are summarized in proposition B.3. We first require the following twolemmas.

Lemma B.1. Suppose that for the state feedback law w there exists a solution V :X→R+ to the inequality

PwV (x) 6 V (x)− |x|+ η, x ∈ X. (B.1)

Then the controlled chain has the following properties:

(i) The feedback law is regular, and hence the controlled chain has a unique invariantprobability πw.

(ii) There exists a constant B1 depending only on η and δ such that

Ewx

[τθ−1∑t=0

c(Φ(t)

)]6 B1

(V (x) + 1

), x ∈ X.

(iii) ηw = πw(cw) 6 η.

(iv) πw(θ) > δπw(S0) > 0.

(v) V (x)− V (θ) > −η/δ, x ∈ X.

Proof. From (B.1) and the definition of S0 we obtain the inequality

PwV (x) 6 V (x)− 12c(x) + η1lS0(x),

where c(x) = |x|. Applying the Comparison Theorem A.1 then gives

12

Ewx

[τθ−1∑t=0

c(Φ(t)

)]6 V (x)− V (θ) + ηEwx

[τθ−1∑t=0

1lS0

(Φ(t)

)]. (B.2)

From (A3) the minorization condition (B.6) holds for Kw:

Kw(x,θ) > δ1lS0(x), x ∈ X. (B.3)


Applying lemma A.3 then gives

Ewx

[τθ−1∑t=0

c(Φ(t)

)]6 2V (x) +

2ηδ.

This proves (ii) with B1 = 2 + 2η/δ. Results (i) and (iii) follow immediately from(ii) and the Comparison Theorem.

To prove (iv) observe that πw(c) 6 ηw 6 η. Hence the sublevel set S0 must havepositive πw-measure. From the inequality (B.3) we can invoke invariance to show thatπw(θ) > δπw(S0) > 0.

Finally, (v) also follows from (B.2) and lemma A.3:

0 6 V (x)− V (θ) + Ewx

[τθ−1∑t=0

1lS0

(Φ(t)

)]6 V (x)− V (θ) +

η

δ. �

Let ηn = supx∈X gn(x) and ηn

= infx∈X gn(x).

Lemma B.2. For each n ∈ Z+,

(i) Pn+1gn(x) 6 gn+1(x),

(ii) ηn6 η

n+1,

(iii) gn+1(x) 6 Pngn(x),

(iv) ηn+1 6 ηn 6 η.

Proof. Result (i) follows from the bound Vn+1 = PnVn + cn 6 Pn+1Vn + cn+1, asshown here:

gn+1 = Vn+2 − Vn+1> Vn+2 − (Pn+1Vn + cn+1)

=Pn+1Vn+1 + cn+1 − Pn+1Vn − cn+1 = Pn+1gn.

To prove (ii), we apply (i) and the definition of ηn

:

ηn+1

= infx∈X

gn+1(x) > infx∈X

Pn+1gn(x) > infy∈X

gn(y) = ηn.

We now prove (iii). First observe that

PnVn+1 = Pn(Vn + gn) = PnVn + Pngn = Vn+1 − cn + Pngn.

From the definition of {Vn} we then have

Vn+2 = Pn+1Vn+1 + cn+1 6 PnVn+1 + cn = Vn+1 + Pngn,

from which the result follows. Result (iv) then follows immediately, as in (ii). �

We may now establish the desired stability properties of the VIA under (A1)–(A3).


Proposition B.3. The policy wn satisfies, for each n:

(i) The following identity holds for all x:

PnVn(x) = Vn(x)− cn(x) + gn(x). (B.4)

(ii) The sequence {gn: n ∈ Z+} is uniformly bounded from above:

supx∈X

gn(x) 6 η, x ∈ X, n ∈ Z+. (B.5)

(iii) The chain Φn is cn-regular, and there exists a constant depending only on δ andη such that for each n,

Enx

[τθ−1∑t=0

cn(Φ(t)

)]6 B1

(Vn(x) + 1

), x ∈ X.

(iv) The stationary policy wn is regular with unique invariant probability πn, and theinvariant probability satisfies

J(wn)

= πn(cn) 6 ηn.

Proof. Result (i) is essentially the definition of Vn, gn: For each n,

PnVn = Vn+1 − cn = Vn − cn + (Vn+1 − Vn) = Vn − cn + gn.

Result (ii) follows from lemma B.2, and (iii) directly from lemma B.1. Result (iv)follows from (ii), (iii), and the Comparison Theorem applied to (B.4). �

An application of this proposition and lemma B.1 gives a lower bound on thesequences {gn}, {hn}:

Lemma B.4. For all n ∈ Z+,

gn(θ) > − ηδ

hn(x) > − ηδ

, x ∈ X.

Proof. The lower bound on hn follows immediately from lemma B.1 and proposi-tion B.3. We then have,

− ηδ6 Pnhn (θ) = hn(θ)− cn(θ) + gn(θ) 6 gn(θ). �

These bounds can now be used to establish a uniform upper bound on {hn}.

Lemma B.5. There is a finite constant B2, independent of n, k or x, such that

hn(x) 6 B2(Vk(x) + 1

), 0 6 k 6 n, x ∈ X.


Proof. It is enough to prove the result for k = 0 since we may treat the kth stepof the algorithm as a new starting point. We have from minimality of Vn, for anyn ∈ Z+,

Vn(x)6E0x

[(n−1∑t=0

c0(Φ(t)

)+ V0

(Φ(n)

))1l(τθ > n)

]

+ E0x

[(τθ−1∑t=0

c0(Φ(t)

)+ Vn−τθ (θ)

)1l(τθ 6 n)

],

where E0 is the expectation operator obtained with the policy w0. Subtracting Vn(θ)from both sides then gives

hn(x)6E0x

[(n−1∑t=0

c0(Φ(t)

)+ V0

(Φ(n)

))1l(τθ > n)

]

+ E0x

[τθ−1∑t=0

c0(Φ(t))

]+ E0

x

[(Vn−τθ (θ)− Vn(θ)

)1l(τθ 6 n)

]. (B.6)

We now proceed to bound each of these terms. First, letting Ln(x) denote the firstterm on the right hand side of (B.6), we have

Ln(x) := E0x

[(n−1∑t=0

c0(Φ(t)

)+ V0

(Φ(n)

))1l(τθ > n)

]

= E0x

[(n−1∑t=0

c0(Φ(t)) + E0[V0(Φ(n)

)| Φ(0), . . . , Φ(n− 1)

])1l(τθ > n)

]

6 E0x

[(n−1∑t=0

c0(Φ(t)

)+(V0(Φ(n− 1)

)− c0

(Φ(n− 1)

)+ η)

)1l(τθ > n)

]6 Ln−1(x) + ηPw

0{τθ > n | Φ0 = x}.

Hence, by iteration we have for all n and x,

Ln(x) 6 L0(x) + ηE0x[τθ] = V0(x) + ηE0

x[τθ].

Proposition B.3(iii) combined with this inequality then gives Ln(x) 6 V0(x) +B1(V0(x) + 1).

The second term in (B.6) is also bounded using proposition B.3(iii):

E0x

[τθ−1∑t=0

c0(Φ(t))

]6 B1

(V0(x) + 1

).


To bound the final term, note that by lemma B.4, for any n > τθ,

Vn−τθ (θ)− Vn(θ) = −n−1∑

k=n−τθ

gk(θ) 6 η

δτθ.

The third term on the right hand side of (B.6) is thus again bounded by proposi-tion B.3(iii):

E0x

[(Vn−τθ (θ)− Vn(θ)

)1l(τθ 6 n)

]6 η

δE0x[τθ] 6

η

δB1(V0(x) + 1

).

Thus each of the expectations on the right hand side of (B.6) is bounded as desired,with B2 = 1 + (2 + (η/δ))B1. �

From the optimality equations we have for all n ∈ Z+ and x ∈ X,

Pnhn (x) = hn+1(x)− cn(x) + gn(θ).

This identity together with the bounds already obtained on the sequence {hn} are pre-cisely what is needed to deduce a strong form of stability for the time-inhomogeneouschain Φv

n.

Lemma B.6. There is a constant B3 dependent only on η and δ such that for all xand n,

Evn+1

x [n ∧ τθ] 6 B3(V0(x) + 1

).

Proof. For n fixed denote

M (t) = hn+1−t(Φ(t)

)− tη +

t−1∑i=0

c(Φ(i)

), 0 6 t 6 n+ 1.

We show here that (M (t),Ft) is a supermartingale, with Ft = σ(Φ0, . . . , Φt), t > 0.For each 0 6 t 6 n,

Evn+1[

M (t+ 1) | Ft]

= Evn+1[

hn−t(Φ(t+ 1)

)| Ft

]− (t+ 1)η +

t∑i=0

c(Φ(i)

)= Pn−thn−t

(Φ(t)

)− (t+ 1)η +

t∑i=0

c(Φ(i)

)= hn−t+1

(Φ(t)

)− cn−t(x) + gn−t(θ)− (t+ 1)η +

t∑i=0

c(Φ(i)

)6M (t),


where we have used the bounds gt 6 η, cw > c. This establishes the supermartingaleproperty. Now let τ = τθ ∧ n, and apply the optional stopping theorem to obtain thebound

Evn+1

x

[hn+1−τ

(Φ(τ )

)+

τ−1∑i=0

(c(Φ(i)

)− η)]

= Evn+1

x

[M (τ )

]6M (0) 6 hn+1(x) 6 B2

(V0(x) + 1

).

Since we also have hk(x) > −η/δ for all x and k, it follows that

Evn+1

x

[τ−1∑i=0

(c(Φ(i)

)− η)]6 B2

(V0(x) + 1

)+η

δ.

Using the definition of S0 we then obtain

12

Evn+1

x

[τ−1∑i=0

c(Φ(i)

)]6 B2

(V0(x) + 1

)+η

δ+ ηEv

n+1

x

[τ−1∑i=0

1lS0

(Φ(i)

)].

Exactly as in the proof of lemma A.3 given in [25, theorem 11.3.11] we may deducevia assumption (A3) that

Evn+1

x

[τ−1∑i=0

1lS0

(Φ(i)

)]6 1δ.

The lemma then follows with B3 = 2(B2 + 2η/δ). �

For x ∈ X let g(x) = lim supn→∞ gn(x), and g(x) = lim infn→∞ gn(x).

Lemma B.7.

g(x) 6 g(θ), x ∈ X.

Proof. Let m(t) = gn−t+1(Φvn+1(t)). The adapted process (m(t),Ft) is a submartin-

gale since by lemma B.1,

Evn+1[

m(t+ 1) | Ft]

= Pn−tgn−t(Φ(t)

)> gn−t+1

(Φ(t)

)= m(t).

From the optional stopping theorem we have Evn+1

[m(τ )] > m(0), or

Evn+1[

gn−τ+1(Φ(τ )

)]> gn+1(x).

For any k define gk(x) = supi>k gi(x), so that gk(x) → g(x) as k → ∞. Letting sndenote the integer part of n/2 we then have from the previous bound

gn+1(x)6 Evn+1[

gsn(Φ(τ )

)1l(τθ < sn)

]+ ηEv

n+1[1l(τθ > sn)

]= gsn(θ)Pv

n+1

x (τθ < sn) + ηPvn+1

x (τθ > sn).


Since

Pvn+1

x (τθ > sn) 6 2n

Evn+1

x [τ ] 6 2nB3(V0(x) + 1

),

we may take limit supremums of both sides with respect to n to obtain g(x) 6 g(θ). �

Lemma B.8. If gni(θ)→ g(θ), i→∞, then for any integer t,

gni−t(θ)→ g(θ), i→∞.

Proof. The proof is slightly different from that given in [5] since we do not know ifthe sequence of functions {gn} is bounded from below.

It is enough to prove the result for t = 1. By taking a further subsequenceif necessary we may assume that there is a kernel P and a function g such thatPni−1(x, y)→ P (x, y) and gni−1(x)→ g(x) as i→∞. The kernel P is substochastic:P (x,X) 6 1, x ∈ X. Using the inequality Pni−1gni−1(θ) > gni(θ) and Fatou’s lemmathen gives

g(θ)6 lim supi→∞

∑y∈X

Pni−1(θ, y)gni−1(y)

6∑y∈X

lim supi→∞

Pni−1(θ, y)gni−1(y) 6 Pg(θ) 6 P (θ,X)g(θ),

where in the last inequality we are using the fact that θ is maximal. Fatou’s lemma isapplicable because {gn} is uniformly bounded from above. It follows that P (θ,X) = 1and that g(y) = g(θ) for every y ∈ X for which P (θ, y) > 0. Since P (θ,θ) > δ byassumption, we conclude that g(θ) = g(θ). Since g(θ) is an arbitrary limit point ofthe sequence {gni−1(θ): i > 0} the conclusion of the lemma follows. �

Lemma B.9.

(i) g(x) 6 η∗ for every x ∈ X.

(ii) limn→∞ gn(θ) = η∗.

Proof. We first prove (i). From the previous lemma it is enough to show thatg(θ) 6 η∗. We show that there exists a sequence of functions {Wt: t > 0} fromX to R+ such that, for some B4 <∞,

Wt(x)6B4(V0(x) + 1), x ∈ X, t ∈ Z+; (B.7)

Pw∗Wt(x)>Wt−1(x)− cw∗(x) + g(θ), x ∈ X, t ∈ Z+. (B.8)

Given these bounds, we then have by iteration

B4 +B4Pnw∗V0 (x) >W0(x)−

n−1∑t=0

P tw∗cw∗ (x) + ng(θ).


Dividing by n and letting n→∞ then shows that

g(θ) 6 limn→∞

1n

n−1∑t=0

P tw∗cw∗ (x) = η∗,

as claimed.To prove that such a sequence exists, first consider the inequality Pw∗hni−t (x) >

hni−t+1 − cw∗(x) + gni−t(θ). Letting W (i)t (x) = η + hni−t(x), x ∈ X, we obtain, for

each i,

Pw∗W(i)t (x) >W (i)

t−1(x)− cw∗(x) + gni−t(θ), x ∈ X, t ∈ Z+.

Assume that {ni} is chosen so that gni(θ) → g(θ) as i → ∞. Then by choosing asubsequence if necessary we may find functions {Wt} with W (i)

t → Wt pointwise ast→∞. Since the functions {ht} are bounded as desired, the inequalities (B.7), (B.8)then follow from lemma B.8 and the Dominated Convergence Theorem.

To prove (ii), consider any limit point g(θ) of the sequence {gn(θ}. We canassume without loss of generality that there are functions g,h on X, a feedback laww, and that there is a subsequence {mi} of Z+ with gmi (x) → g(x), hmi(x) →h(x), wmi(x) → w(x), i → ∞, for all x ∈ X. From Fatou’s lemma we thenhave Pwh 6 h − cw + g, and from the Comparison Theorem A.1 we then haveπw(cw) 6 π(g) 6 η∗, where the last inequality follows from (i). Since πw(cw) > η∗

by optimality, it then follows that g(x) = η∗ for a.e. x ∈ X [πw]. Lemma B.1(iv)completes the proof. �

Lemma B.10. Under (A1)–(A4),

hn(x)→ h∗(x)− h∗(θ), as n→∞.

Proof. Let h be any pointwise limit of the {hn}. The function h is finite valued bylemma B.5. Then using Fatou’s lemma we may find a limiting feedback law w suchthat Pwh 6 h− cw + η∗. By theorem A.5 and (A4) it follows that this is an equality

Pwh = h− cw + η∗. (B.9)

For any a ∈ R+ define ha(x) = h∗(x)− h∗(θ)− a, let hmin = min(h,ha), and definethe state feedback function

wmin(x) =

{w(x) if h(x) 6 ha(x),

w∗(x) otherwise.

For any x ∈ X we have, whenever h(x) 6 ha(x),

Pwminhmin(x) 6 Pwminh(x) 6 h(x)− cw(x) + η∗,

and if h(x) > ha(x)

Pwminhmin(x) 6 Pwminha(x) = ha(x)− cw∗(x) + η∗.


Thus, for all x, Pwminhmin(x) 6 hmin(x)− cwmin(x) + η∗, and by theorem A.5 and (A4)again this must be an equality:

Pwminhmin(x) = hmin(x)− cwmin(x) + η∗. (B.10)

Let s = h − hmin. We have s(x) > 0 for all x, with s(θ) = 0 ∨ a. Moreover,from (B.10) and the ACOE for ha, Pws 6 s. Thus, by theorem A.6, the function s isidentically equal to −a. By considering large a > 0 we conclude that h = h∗. �

References

[1] A. Arapostathis, V.S. Borkar, E. Fernandez-Gaucherand, M.K. Ghosh and S.I. Marcus, Discrete-time controlled Markov processes with average cost criterion: A survey, SIAM J. Control Optim.31 (1993) 282–344.

[2] V.S. Borkar, Topics in Controlled Markov Chains, Pitman Research Notes in Mathematics Series,Vol. 240 (Longman Sci. Tech., Harlow, UK, 1991).

[3] V.S. Borkar and S.P. Meyn, Stability and convergence of stochastic approximation using the ODEmethod, to appear in SIAM J. Control Optim., and IEEE CDC (December 1998).

[4] V.S. Borkar and S.P. Meyn, Risk sensitive optimal control: Existence and synthesis for models withunbounded cost, submitted to SIAM J. Control Optim. (1998).

[5] R. Cavazos-Cadena, Value iteration in a class of communicating Markov decision chains with theaverage cost criterion, Technical Report, Universidad Autonoma Agraria Anonio Narro (1996).

[6] R. Cavazos-Cadena and E. Fernandez-Gaucherand, Value iteration in a class of average controlledMarkov chains with unbounded costs: Necessary and sufficient conditions for pointwise conver-gence, in: Proc. of the 34th IEEE Conf. on Decision and Control, New Orleans, LA (1995)pp. 2283–2288.

[7] H. Chen and A. Mandelbaum, Discrete flow networks: Bottlenecks analysis and fluid approxima-tions, Math. Oper. Res. 16 (1991) 408–446.

[8] H. Chen and H. Zhang, Stability of multiclass queueing networks under priority service disciplines,Technical Note (1996).

[9] J.G. Dai, On the positive Harris recurrence for multiclass queueing networks: A unified approachvia fluid limit models, Ann. Appl. Probab. 5 (1995) 49–77.

[10] J.G. Dai and S.P. Meyn, Stability and convergence of moments for multiclass queueing networksvia fluid limit models, IEEE Trans. Automat. Control 40 (November 1995) 1889–1904.

[11] P.W. Glynn and S.P. Meyn, A Lyapunov bound for solutions of Poisson’s equation, Ann. Probab.24 (April 1996).

[12] J.M. Harrison, The BIGSTEP approach to flow management in stochastic processing networks, in:Stochastic Networks Theory and Applications, eds. F.P. Kelly, S. Zachary and I. Ziedins (ClarendonPress, Oxford, 1996) pp. 57–89.

[13] J.M. Harrison and L.M. Wein, Scheduling networks of queues: Heavy traffic analysis of a simpleopen network, Queueing Systems 5 (1989) 265–280.

[14] O. Hernandez-Lerma and J.B. Lasserre, Discrete Time Markov Control Processes I (Springer, NewYork, 1996).

[15] A. Hordijk, Dynamic Programming and Markov Potential Theory (1977).[16] J. Humphrey, D. Eng and S.P. Meyn, Fluid network models: Linear programs for control and

performance bounds, in: Proc. of the 13th IFAC World Congress, Vol. B, eds. J. Cruz, J. Gertlerand M. Peshkin, San Francisco, CA (1996) pp. 19–24.

[17] S. Kumar and P.R. Kumar, Performance bounds for queueing networks and scheduling policies,IEEE Trans. Automat. Control 39 (August 1994) 1600–1611.


[18] P.R. Kumar and S.P. Meyn, Stability of queueing networks and scheduling policies, IEEE Trans.Automat. Control 40(2) (1995) 251–260.

[19] P.R. Kumar and S.P. Meyn, Duality and linear programs for stability and performance analysisqueueing networks and scheduling policies, IEEE Trans. Automat. Control 41(1) (1996) 4–17.

[20] P.R. Kumar and T.I. Seidman, Dynamic instabilities and stabilization methods in distributed real-time scheduling of manufacturing systems, IEEE Trans. Automat. Control 35(3) (1990) 289–298.

[21] L.F. Martins, S.E. Shreve and H.M. Soner, Heavy traffice convergence of a controlled, multiclassqueueing system, SIAM J. Control Optim. 34(6) (1996) 2133–2171.

[22] S.P. Meyn, The policy improvement algorithm: General theory with applications to queueing net-works and their fluid models, in: 35th IEEE Conf. on Decision and Control, Kobe, Japan (December1996).

[23] S.P. Meyn, The policy improvement algorithm for Markov decision processes with general statespace, IEEE Trans. Automat. Control 42 (1997) 191–196.

[24] S.P. Meyn, Stability and optimization of multiclass queueing networks and their fluid models,Lectures in Applied Mathematics, Vol. 33 (Amer. Math. Soc., Providence, RI, 1997) pp. 175–199.

[25] S.P. Meyn and R.L. Tweedie, Markov Chains and Stochastic Stability (Springer, London, 1993).[26] I.C. Paschalidis D. Bertsimas and J.N. Tsitsiklis, Scheduling of multiclass queueing networks:

Bounds on achievable performance, in: Workshop on Hierarchical Control for Real-Time Schedulingof Manufacturing Systems, Lincoln, NH (October 16–18, 1992).

[27] M.L. Puterman, Markov Decision Processes (Wiley, New York, 1994).[28] A.N. Rybko and A.L. Stolyar, On the ergodicity of stohastic processes describing the operation of

open queueing networks, Problemy Peredachi Informatsii 28 (1992) 3–26.[29] L.I. Sennott, A new condition for the existence of optimal stationary policies in average cost Markov

decision processes, Oper. Res. Lett. 5 (1986) 17–23.[30] L.I. Sennott, The convergence of value iteration in average cost Markov decision chains, Oper. Res.

Lett. 19 (1996) 11–16.

value iteration and optimization of multiclass queueing networks · 2014-12-08 · queueing systems...

Documents