robust solutions to markov decision problems with ...elghaoui/pubs/phd1.pdfoptimal solutions to...

Draft. Do not distribute

Robust Solutions to Markov Decision Problems withUncertain Transition Matrices

Arnab Nilim, [email protected] (Corresponding Author)Laurent El Ghaoui, [email protected]

Department of Electrical Engineering and Computer SciencesUniversity of California

Berkeley, CA 94720, USA

Keywords : Dynamic programming: Markov, finite state, game theory, programming:convex, uncertainty, robustness, statistics: estimation.

1

Abstract

Optimal solutions to Markov decision problems may be very sensitive with respect to thestate transition probabilities. In many practical problems, the estimation of these probabil-ities is far from accurate. Hence, estimation errors are limiting factors in applying Markovdecision processes to real-world problems.

We consider a robust control problem for a finite-state, finite-action Markov decisionprocess, where uncertainty on the transition matrices is described in terms of possibly non-convex sets. We show that perfect duality holds for this problem, and that as a consequence,it can be solved with a variant of the classical dynamic programming algorithm, the “robustdynamic programming” algorithm. We show that a particular choice of the uncertaintysets, involving likelihood regions or entropy bounds, leads to both a statistically accuraterepresentation of uncertainty, and a complexity of the robust recursion that is almost thesame as that of the classical recursion. Hence, robustness can be added at practically noextra computing cost. We derive similar results for other uncertainty sets, including onewith a finite number of possible values for the transition matrices.

We describe, in a practical path planning example, the benefits of using a robust strategyinstead of the classical optimal strategy; even if the uncertainty level is only crudely guessed,the robust strategy yields a much better worst-case expected travel time.

Contents

1 Introduction 4

2 Problem Setup 5

2.1 The nominal problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 The robust control problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.3 Main results and outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3 Finite-Horizon Problem 7

3.1 Robust dynamic programming . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3.2 Solving the inner problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.3 Complexity analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.4 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4 Infinite-Horizon Problem 12

4.1 Robust Bellman recursion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4.2 Complexity analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4.3 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

5 Likelihood Models 17

5.1 Model description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

5.2 The dual problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

5.3 A bisection algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

5.4 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

5.5 Maximum A Posteriori models . . . . . . . . . . . . . . . . . . . . . . . . . . 22

6 Entropy Models 23

6.1 Model description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

6.2 Dual problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

6.3 A bisection algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

6.4 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

7 Other Uncertainty Models 25

7.1 Finite scenario model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

7.2 Interval matrix model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

7.3 Ellipsoidal models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

8 Example: Robust Aircraft Routing 27

8.1 The nominal problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

8.2 The robust version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

8.3 Comparing robust and nominal strategies . . . . . . . . . . . . . . . . . . . . 28

8.4 Inaccuracy of uncertainty level . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2

9 Concluding remarks 30

A Stochastic game theoretic proof of the robust Bellman recursion 31

B Properties of function σ of section 5.3 32

C Properties of function σ of section 6.3 33

D Calculation of β for a Desired Confidence Level 34

3

Notation. P > 0 or P ≥ 0 refers to the strict or non-strict componentwise inequality formatrices or vectors. For a vector p > 0, log p refers to the componentwise operation. Thenotation 1 refers to the vector of ones, with size determined from context. The probabilitysimplex in Rn is denoted ∆n = {p ∈ Rn

+ : pT1 = 1}, while Θn is the set of n× n transitionmatrices (componentwise non-negative matrices with rows summing to one). We use σP todenote the support function of a set P ⊆ Rn, with for v ∈ Rn, σP(v) := sup{pTv : p ∈ P}.

1 Introduction

Finite-state and finite-action Markov Decision Processes (MDPs) capture several attrac-tive features that are important in decision-making under uncertainty: they handle risk insequential decision-making via a state transition probability matrix, while taking into ac-count the possibility of information gathering and using this information to apply recourseduring the multi-stage decision process [Putterman, 1994, Berstsekas and Tsitsiklis, 1996,Mine and Osaki, 1970, Feinberg and Shwartz, 2002].

This paper addresses the issue of uncertainty at a higher level: we consider a Markovdecision problem in which the transition probabilities themselves are uncertain, and seek arobust decision for it. Our work is motivated by the fact that in many practical problems, thetransition matrices have to be estimated from data, and this may be a difficult task, see for ex-ample [Kalyanasundaram et al., 2001, Feinberg and Shwartz, 2002, Abbad and Filar, 1992,Abbad et al., 1992]. It turns out that estimation errors may have a huge impact on thesolution, which is often quite sensitive to changes in the transition probabilities. We willprovide an example of this phenomenon in section 8.

A number of authors have addressed the issue of uncertainty in the transition matrices ofan MDP. A Bayesian approach such as described by [Shapiro and Kleywegt, 2002] requiresa perfect knowledge of the whole prior distribution on the transition matrix, making it dif-ficult to apply in practice. Other authors have considered the transition matrix to lie ina given set, most typically a polytope: see [Satia and Lave, 1973, White and Eldeib, 1994,Givan et al., 1997]. Although our approach allows to describe the uncertainty on the transi-tion matrix by a polytope, we will argue against choosing such a model for the uncertainty.First, a general polytope is often not a tractable way to address the robustness problem, as itincurs a significant additional computational effort to handle uncertainty. As we will show,an exception is when the uncertainty is described by an interval matrix, intersected by theconstraint that probabilities sum to one, as in [Givan et al., 1997, Bagnell et al., 2001]; or,when the polytope is described by its vertices. Perhaps more importantly, polytopic models,especially interval matrices, may be very poor representations of statistical uncertainty andlead to very conservative robust policies. In [Bagnell et al., 2001], the authors consider aproblem dual to ours, and give without proof the “robust value iteration”, which we derivehere. Like us, they consider relative entropy as a way to measure uncertainties in the transi-tion matrices; however, they do not propose any specific algorithm to solve the corresponding“inner problem”, which has to be solved at each step of the robust value iteration. Theyonly provide a general statement according to which the cost of solving the inner problem is

4

polynomial in problem size, provided the uncertainty on the transition matrices is describedby convex sets.

2 Problem Setup

2.1 The nominal problem

We consider a finite horizon Markov decision process with finite decision horizon T ={0, 1, 2, . . . , N − 1}. At each stage, the system occupies a state i ∈ X , where n = |X |is finite, and a decision maker is allowed to choose an action a deterministically from afinite set of allowable actions A = {a1, . . . , am} (for notational simplicity we assume thatA is not state-dependent). The system starts in a given initial state i0. The states makeMarkov transitions according to a collection of (possibly time-dependent) transition matricesτ := (P a

t )a∈A,t∈T , where for every a ∈ A, t ∈ T , the n × n transition matrix P at contains

the probabilities of transition under action a at stage t. We denote by π = (a0, . . . , aN−1)a generic controller policy, where at(i) denotes the controller action when the system is instate i ∈ X at time t ∈ T . Let Π = AnN be the corresponding strategy space. Define byct(i, a) the cost corresponding to state i ∈ X and action a ∈ A at time t ∈ T , and by cN

the cost function at the terminal stage. We assume that ct(i, a) is non-negative and finitefor every i ∈ X and a ∈ A.

For a given set of transition matrices τ , we define the finite-horizon nominal problem by

φN(Π, τ) := minπ∈Π

CN(π, τ), (1)

where CN(π, τ) denotes the expected total cost under controller policy π and transitions τ :

CN(π, τ) := E

(

N−1∑

t=0

ct(it, at(i)) + cN(iN)

)

. (2)

A special case of interest is when the expected total cost function bears the form (2), wherethe terminal cost is zero, and ct(i, a) = νtc(i, a), with c(i, a) now a constant cost function,which we assume non-negative and finite everywhere, and ν ∈ (0, 1) is a discount factor. Werefer to this cost function as the discounted cost function, and denote by C∞(π, τ) the limitof the discounted cost (2) as N → ∞.

When the transition matrices are exactly known, the corresponding nominal problem canbe solved via a dynamic programming algorithm, which has total complexity of mn2N flopsin the finite-horizon case. In the infinite-horizon case with a discounted cost function, thecost of computing an ε-suboptimal policy via the Bellman recursion is O(mn2 log(1/ε)); see[Putterman, 1994] for more details.

2.2 The robust control problem

At first we consider the finite horizon case, and assume that when for each action a and timet, the corresponding transition matrix P a

t is only known to lie in some given subset Pa of Θn.

5

Loosely speaking, we can think of the sets Pa as sets of confidence for the transition matrices.Note that our uncertainty model does not allow for correlations between the uncertaintiesaffecting the P a’s across different actions a.

Two models for transition matrix uncertainty are possible, leading to two possible formsof finite-horizon robust control problems. In a first model, referred to as the stationaryuncertainty model, the transition matrices are chosen by nature depending on the controllerpolicy once and for all, and remain fixed thereafter. In a second model, which we refer to asthe time-varying uncertainty model, the transition matrices can vary arbitrarily with time,within their prescribed bounds. Each problem leads to a game between the controller andnature, where the controller seeks to minimize the maximum expected cost, with naturebeing the maximizing player.

Let us define our two problems more formally. A policy of nature refers to a specificcollection of time-dependent transition matrices τ = (P a

t )a∈A,t∈T chosen by nature, and theset of admissible policies of nature is T := (⊗a∈APa)N . Denote by Ts the set of stationaryadmissible policies of nature:

Ts = {τ = (P at )a∈A,t∈T ∈ T : P a

t = P aτ for every t, τ ∈ T, a ∈ A} .

The stationary uncertainty model leads to the problem

φN(Π, Ts) := minπ∈Π

maxτ∈Ts

CN(π, τ). (3)

In contrast, the time-varying uncertainty model leads to a relaxed version of the above:

φN(Π, Ts) ≤ φN(Π, T ) := minπ∈Π

maxτ∈T

CN(π, τ). (4)

The first model is attractive for statistical reasons, as it is much easier to develop statis-tically accurate sets of confidence when the underlying process is time-invariant. Unfortu-nately, the resulting game (3) seems hard to solve, because dynamic programming algorithmscannot be used to solve this problem. The second model is attractive as one can solve thecorresponding game (4) using a variant of the dynamic programming algorithm seen later,but we are left with a difficult task, that of estimating a meaningful set of confidence for thetime-varying matrices P a.

In the finite-horizon case, we use the first model of uncertainty, where the set of theallowable transition matrices for the nature is time-invariant for each action of the controller.This allows us to describe uncertainty in a statistically accurate way using likelihood orentropy bounds. To solve the corresponding control problem (3), we use an approximationthat is common in robust control, where the nature is allowed to chose different transitionmatrices at different stages within this time-invariant uncertainty set. In a sense, the time-invariant uncertainty is replaced by a time-varying one. This means that we solve the secondproblem (4) as an approximation (upper bound) to the first, using uncertainty sets P a derivedfrom a time-invariance assumption about the transition matrices.

We also consider an infinite-horizon robust control problem, with the discounted costfunction referred to above, and where we restrict control and nature policies to be stationary:

φ∞(Πs, Ts) := minπ∈Πs

maxτ∈Ts

C∞(π, τ), (5)

6

where Πs denotes the space of stationary control policies. We define φ∞(Π, T ), φ∞(Π, Ts)and φ∞(Πs, T ) accordingly.

In the sequel, for a given control policy π ∈ Π and subset S ⊆ T , the notation φN(π,S) :=maxτ∈S CN(π, τ) denotes the worst-case expected total cost for the finite-horizon problem,and φ∞(π,S) is defined likewise.

2.3 Main results and outline

Our main contributions are as follows. First we derive a recursion, the “robust dynamicprogramming” algorithm, which solves the finite-horizon robust control problem (4). Weprovide a simple proof of the optimality of the recursion, where the main ingredient is toshow that perfect duality holds in the game (4). As a corollary of this result, we obtain thatthe repeated game (16) is equivalent to its non-repeated counterpart (4). Second, we derivesimilar results for the infinite-horizon problem with discounted cost function, (5). Moreover,we obtain that if we consider a finite-horizon problem with a discounted cost function, thenthe gap between the optimal value of the stationary uncertainty problem (3) and that of itstime-varying counterpart (4) goes to zero as the horizon length goes to infinity, at a ratedetermined by the discount factor. Finally, we identify several classes of uncertainty models,which result in an algorithm that is both statistically accurate and numerically tractable. Wederive precise complexity results that imply that, with the proposed approach, robustnesscan be handled at practically no extra computing cost.

Our paper is organized as follows. Section 3 deals with the finite-horizon problem, in-cluding the “robust dynamic programming” theorem (theorem 1) and its proof, as well as adetailed complexity analysis. Section 4 provides similar results for the infinite-horizon case.Sections 5 and 6 are devoted to specific uncertainty models, involving likelihood regions orentropy bounds, while section 7 deals with finite scenario, ellipsoidal and interval matrixmodels. We describe numerical results in the context of aircraft routing in section 8.

3 Finite-Horizon Problem

We consider the finite-horizon robust control problem defined in section 2.2. For a givenstate i ∈ X , action a ∈ A, and P a ∈ Pa, we denote by pa

i the next-state distribution drawnfrom P a corresponding to state i ∈ X ; thus pa

i is the i-th row of matrix P a. We define Pai

as the projection of the set Pa onto the set of pai -variables. By assumption, these sets are

included in the probability simplex of Rn, ∆n; no other property is assumed.

3.1 Robust dynamic programming

We provide below a self-contained proof of the following theorem. In [Bagnell et al., 2001],the authors have considered the problem of computing the dual quantity ψN(Π, T ) definedbelow, and stated without proof that it can be computed by the recursion given in thetheorem.

7

Theorem 1 (robust dynamic programming) For the robust control problem (4), perfectduality holds:

φN(Π, T ) = minπ∈Π

maxτ∈T

CN(π, τ) = maxτ∈T

minπ∈Π

CN(π, τ) := ψN(Π, T ).

The problem can be solved via the recursion

vt(i) = mina∈A

(

ct(i, a) + σPai(vt+1)

)

, i ∈ X , t ∈ T, (6)

where σP(v) := sup{pTv : p ∈ P} denotes the support function of a set P, vt(i) is the worst-case optimal value function in state i at stage t. A corresponding optimal control policyπ∗ = (a∗

0, . . . , a∗N−1) is obtained by setting

a∗t (i) ∈ arg min

a∈A

{

ct(i, a) + σPai(vt+1)

}

, i ∈ X , (7)

and a corresponding worst-case nature policy is obtained by choosing the i-th row of thetransition matrix P a

t as

pai (t) ∈ arg max

p

{

pTvt+1 : p ∈ Pai

}

, i ∈ X , a ∈ A, t ∈ T. (8)

The effect of uncertainty on a given strategy π = (a0, . . . , aN ) can be evaluated by the fol-lowing recursion

vπt (i) = ct(i, at(i)) + σPat(i)

i

(vπt+1), i ∈ X , (9)

which provides the worst-case value function vπ for the strategy π.

Proof. We begin with a simple technical lemma.

Lemma 1 For given vN ∈ Rn, consider the problem

η := maxv0,...,vN−1

qTv0 : vt ≤ gt(vt+1), t ∈ T, i ∈ X , (10)

where inequalities are understood componentwise, q ∈ Rn+ and the functions gt : R

n → Rn

are given. If the functions gt are componentwise non-decreasing for every t ∈ T , meaningthat gt(u) ≤ gt(v) for every u, v ∈ R

n with u ≤ v, then the optimal variables can be computedvia the recursion

vt = gt(vt+1), t ∈ T, (11)

and the optimal value is η = qT (g1 ◦ . . . gN)(vN).

To prove lemma 1, we note that recursion (11) yields v0 = v∗0 := (g1 ◦ . . . ◦ gN)(vN). Inaddition, this recursion provides a feasible point for the problem, hence η ≥ qTv∗0. Sinceq ≥ 0, and each gt is componentwise non-decreasing, we also have η ≤ qTv∗0, which showsthat the recursion provides the optimal value of problem (10). This proves the lemma.

8

We proceed with a well-known linear programming representation of the nominal problem(1) [Putterman, 1994]:

φN(Π, τ) := maxv0,...,vN−1

qTv0 : vt(i) ≤ ct(i, a) +∑

j

P at (i, j)vt+1(j), a ∈ A, i ∈ X , t ∈ T,

(12)where q is a componentwise non-negative vector, precisely q(i) = 0 if i 6= i0, q(i0) = 1, wherei0 is the initial state. In the above we have denoted by τ := (P a

t )a∈A,t∈T the (given) collectionof time-varying transition matrices. Likewise, the expected cost for a given controller policyπ = (at)t∈T is given by the linear program

φN(π, τ) := maxv0,...,vN−1

qTv0 : vt(i) ≤ ct(i, at(i)) +∑

j

Pat(i)t (i, j)vt+1(j), i ∈ X , t ∈ T. (13)

By weak duality, φN(Π, T ) ≥ ψN(Π, T ), where ψN(Π, T ) is defined in the theorem. Let usprove that perfect duality holds, that is, φN(Π, T ) = ψN(Π, T ). The lower bound ψN(Π, T )can be expressed as the optimal value of the following non-linear problem (in variables v, τ):

ψN(Π, T ) := maxτ∈T ,v0,...,vN−1

qTv0 : vt(i) ≤ ct(i, a)+∑

j

P at (i, j)vt+1(j), a ∈ A, i ∈ X , t ∈ T.

(14)The difference between the nominal problem (12) and (14) is simply that the matrices P a

t

are fixed in (12), while they are variables in problem (14).Denote by φN(π, T ) = maxτ∈T CN(π, τ) the worst-case expected total cost for a given

policy π. This value is obtained by letting the matrices Pat(i)t become variables in (13), which

results in

φN(π, T ) := maxτ∈T ,v0,...,vN−1

qTv0 : vt(i) ≤ ct(i, at(i)) +∑

j

Pat(i)t (i, j)vt+1(j), i ∈ X , t ∈ T.

(15)The problem of computing ψN(Π, T ), and that of computing φN(π, T ) for a given policy

π, can both be represented as the problem (10) of lemma 1, where we define the functionsgt, t ∈ T , by their components, as follows for problem (14):

(gt(v))i := mina∈A

(

ct(i, a) + σPai(v))

, i ∈ X ,

and as follows for problem (15):

(gt(v))i := ct(i, at(i)) + σPat(i)i

(v), i ∈ X .

Since the sets Pai are all included in ∆n, the above functions are componentwise non-

decreasing, and lemma 1 applies. This shows that problems (14) and (15) can be solvedby the recursions (6) and (9) respectively, as given in theorem 1.

Recursion (6) provides a policy π∗ = (a∗0, . . . , a

∗N−1), via the expression (7) as given in

the theorem. We can express the recursion exactly as in (9), with at replaced with a∗t , t ∈ T .

9

This shows that ψN(Π, T ) = φN(π∗, T ). Since π∗ is an admissible (that is, deterministic)policy, we necessarily have φN(π∗, T ) ≥ φN(Π, T ). This shows that perfect duality holds:φN(Π, T ) = φN(π∗, T ) = ψN(Π, T ), and that the policy π∗ provided by the expression (7) isoptimal for the robust control problem (4).

Finally, the expression for the optimal worst-case policy of nature given in (8) is obtainedby noting that it corresponds to the solution of problem (14) when π is set to the optimalcontrol policy. This ends our proof. �

Note that our proof does not require convexity of the uncertainty sets Pai ; we only used

the fact that these sets are entirely included in the probability simplex of Rn.We can also consider a variant of the finite-horizon time-varying problem (4), where

controller and nature play alternatively, leading to a repeated game

φrepN (Π,Q) := min

a0

maxτ0∈Q

mina1

maxτ1∈Q

. . .minaN−1

maxτN−1∈Q

CN(π, τ), (16)

where the notation τt = (P at )a∈A denotes the collection of transition matrices at a given time

t ∈ T , and Q := ⊗a∈APa is the corresponding set of confidence. We are ready to examinethe repeated game (16).

Corollary 1 The repeated game (16) is equivalent to the game (4):

φrepN (Π,Q) = φN(Π, T ),

and the optimal strategies for φN(Π, T ) given in theorem 1 are optimal for φrepN (Π,Q) as

well.

Proof. A repeated application of weak duality shows the lower bound φrepN (Π,Q) ≤ φN(Π, T )

(this is simply a consequence of the fact that the repeated game gives less power to nature).Since the optimal worst-case nature strategy defined in theorem 1 is feasible for problem(16), the result follows. �

Furthermore, the robust Bellman recursion can also be proved under the stochastic gametheoretic framework (see Appendix A).

3.2 Solving the inner problem

Each step of the robust dynamic programming algorithm involves the solution of an opti-mization problem, referred to as the “inner problem”, of the form

σP(v) = maxp∈P

vTp, (17)

where the variable p corresponds to a particular row of a specific transition matrix, P = P ai

is the set that describes the uncertainty on this row, and v contains the elements of the valuefunction at some given stage. The complexity of the sets Pa

i for each i ∈ X and a ∈ A is akey component in the complexity of the robust dynamic programming algorithm. Note thatwe can safely replace P in (17) by its convex hull, so that convexity of the sets Pa

i is notrequired; the algorithm only requires the knowledge of their convex hulls.

10

Beyond numerical tractability, an additional criteria for the choice of a specific uncer-tainty model is of course be that the sets Pa should represent accurate (non-conservative)descriptions of the statistical uncertainty on the transition matrices. Perhaps surprisingly,there are statistical models of uncertainty that are good on both counts; specific examples ofsuch models are described in sections 5 and 6. Precisely, the uncertainty models consideredin sections 5 and 6 all result in inner problems (17) that can be solved in worst-case time ofO(n log(vmax/δ)) via a simple bisection algorithm, where n is the size of the state space, vmax

is a global upper bound on the value function, and δ > 0 specifies the accuracy at which theoptimal value of the inner problem (17) is computed. We defer the proof of this complexityresult to the appropriate sections. The bisection algorithm can be interpreted as a functionσP such that for every v ∈ Rn, there exist δP(v) such that

σP(v) = σP(v) + δP(v), 0 ≤ δP(v) ≤ δ. (18)

3.3 Complexity analysis

In this section, we discuss the complexity of computing an ε-suboptimal policy π, which isa policy such that the worst-case expected total cost under policy π, namely φN(π, T ) =maxτ∈T CN(π, τ), satisfies φN(π, T ) − ε ≤ φN(Π, T ) ≤ φN(π, T ). Here, ε > 0 is given. Weassume that we use the specific uncertainty models considered in sections 5 and 6, and thatwe solve the resulting inner problem with the bisection algorithm with an accuracy δ := ε/N .

When we apply the bisection algorithm within the robust dynamic programming algo-rithm given in section 3.1, we generate vectors vt by the recursion (6), with σPa

ireplaced by

σPai, as defined by (18). The corresponding equations (7) yield a policy π. We can express

the recursion that provides v as

vt(i) = mina∈A

(

ct(i, a) + δt(i, a) + σPai(vt+1)

)

, i ∈ X , t ∈ T,

where δt(i, a) := δPai(vt+1). The policy π is obtained by looking at a minimizing index in

the above. Thus, π is optimal for the robust control problem (4), but with a modified costfunction: ct(i, a) = ct(i, a) + δt(i, a). The bounds 0 ≤ δt(i, a) ≤ ε/N then imply that thecorresponding expected total cost function CN satisfies CN(π, τ) ∈ [CN(π, τ)−ε, CN (π, τ)] forevery π ∈ Π and τ ∈ T . Maximizing over τ for π = π yields φN(π, T ) ∈ [φ−ε, φ], where φ :=minπ∈Π maxτ∈T CN(π, τ) is the optimal value of the modified control problem, and φN(π, T )is the worst-case expected total cost under policy π for the original problem. Likewise,minimizing over π the maximum over τ yields φN(Π, T ) ∈ [φ − ε, φ]. Since φN(Π, T ) ≤φN(π, T ) because π is deterministic, we conclude that φN(Π, T ) ∈ [φN(π, T ) − ε, φN (π, T )].

We obtain that, to compute a suboptimal policy π that achieves the exact optimum withprescribed accuracy ε, the number of flops required by the algorithm isO(mn2N log(vmaxN/ε)).The bound vmax ≤ NCmax, with Cmax = maxi∈X ,a∈A,t∈T ct(i, a) then leads to the complexitybound of O(mn2N log(N/ε)), which means that robustness is obtained at a relative increaseof computational cost of only log(N/ε) with respect to the classical dynamic programmingalgorithm, which is small for moderate values of N . If N is very large, we can turn insteadto the infinite-horizon problem examined in section 4, and similar complexity results hold.

11

3.4 Algorithm

Our analysis yields an algorithm to compute an ε-suboptimal policy πε for problem (4)using the uncertainty models described in sections 5 and 6. The algorithm has complexityO(mn2N log(N/ε)).

Robust Finite Horizon Dynamic Programming Algorithm

1. Set ε > 0. Initialize the value function to its terminal value vN = cN .

2. Repeat until t = 0:

(a) For every state i ∈ X and action a ∈ A, compute, using the bisection algorithmdescribed in section 5 or 6, a value σa

i such that

σai − ε/N ≤ σPa

i(vt) ≤ σa

i .

(b) Update the value function by vt−1(i) = mina∈A(ct−1(i, a) + σai ) , i ∈ X .

(c) Replace t by t− 1 and go to 2.

3. For every i ∈ X and t ∈ T , set πε = (aε0, . . . , a

εN−1), where

aεt(i) = arg max

a∈A{ct−1(i, a) + σa

i } , i ∈ X , a ∈ A.

4 Infinite-Horizon Problem

In this section, we address a the infinite-horizon robust control problem, with a discountedcost function of the form (2), where the terminal cost is zero, and ct(i, a) = νtc(i, a), wherec(i, a) is now a constant cost function, which we assume non-negative and finite everywhere,and ν ∈ (0, 1) is a discount factor.

4.1 Robust Bellman recursion

We begin with the infinite-horizon problem involving stationary control and nature policies(5).

Theorem 2 For the infinite-horizon robust control problem (5) with stationary uncertaintyon the transition matrices, stationary control policies, and a discounted cost function withdiscount factor ν ∈ [0, 1), perfect duality holds:

φ∞(Πs, Ts) = maxτ∈Ts

minπ∈Πs

C∞(π, τ) := ψ∞(Πs, Ts).

The optimal value is given by φ∞(Πs, Ts) = v(i0), where i0 is the initial state, and where thevalue function v satisfies is the optimality conditions

v(i) = mina∈A

(

c(i, a) + νσPai(v))

, i ∈ X . (19)

12

The value function is the unique limit value of the convergent vector sequence defined by

vk+1(i) = mina∈A

(

c(i, a) + νσPai(vk)

)

, i ∈ X , k = 1, 2, . . . (20)

A stationary, optimal control policy π = (a∗, a∗, . . .) is obtained as

a∗(i) ∈ arg mina∈A

{

c(i, a) + νσPai(v)}

, i ∈ X , (21)

and a stationary optimal nature policy is obtained by choosing the i-th row of the transitionmatrix P a as

pai ∈ arg max

p

{

pTv : p ∈ Pai

}

, i ∈ X , a ∈ A. (22)

The effect of uncertainty on a given stationary strategy π = (a, a, . . .) can be evaluated bythe following equation

vπ(i) = c(i, a(i)) + νσPa(i)i

(vπ), i ∈ X , (23)

which provides the worst-case value function for the strategy π.

Proof: The proof follows identical lines as that of theorem 1. As before, we begin witha simple technical lemma, which we state without proof.

Lemma 2 For a given vector q ∈ Rn+, and function g : Rn → R

n, consider the problem

maxv

qTv : v ≤ g(v), (24)

where inequalities are understood componentwise. If the above problem is feasible, and g ismonotone non-decreasing, and contractive, then there is a unique optimizer v∞, which is theunique solution to the fixed-point equation v = g(v).

We then express the nominal problem (without uncertainty on the transition matrices)with the linear program:

maxv

qTv : v(i) ≤ c(i, a) + ν∑

j

P a(i, j)v(j), a ∈ A, i ∈ X , (25)

where q is a componentwise non-negative vector, precisely q(i) = 0 if i 6= i0, q(i0) = 1, wherei0 is the initial state. Likewise, the expected cost for a given stationary controller policyπ = (a, a, . . .) is given by the linear program

maxv

qTv : v(i) ≤ c(i, a(i)) + ν∑

j

P a(i)(i, j)v(j), i ∈ X . (26)

By weak duality, φ∞(Πs, Ts) ≥ ψ∞(Πs, Ts), where the latter is defined in the theorem. Wenow prove that equality holds. The lower bound ψ∞(Πs, Ts) can be expressed as the solution

13

to the non-linear problem (in variables v, τ) obtained by letting P a’s become variables in(25):

ψ∞(Πs, Ts) = maxτ∈Ts,v

qTv : v(i) ≤ c(i, a) + ν∑

j

P a(i, j)v(j), a ∈ A, i ∈ X . (27)

Likewise, if we denote by φ∞(π, Ts) := maxτ∈TsC∞(π, τ) the worst-case expected total cost

for a given policy π, then this value is obtained by letting the matrices P a(i) become variablesin (13), which results in

φ∞(π, Ts) := maxτ∈Ts,v

qTv : v(i) ≤ c(i, a(i)) + ν∑

j

P a(i)(i, j)v(j), i ∈ X . (28)

The problem of computing ψ∞(Πs, Ts), and that of computing φ∞(π, Ts) for a given policyπ, can both be represented as the problem (24) of lemma 2, where we define the function gby their components, as follows for problem (27):

(g(v))i := mina∈A

(

c(i, a) + νσPai(v))

, i ∈ X , (29)

and as follows for problem (28):

(g(v))i := c(i, a(i)) + νσPa(i)i

(vπ), i ∈ X . (30)

Since the sets Pai are all included in ∆n, the above functions are componentwise non-

decreasing; furthermore, these functions are ν-contractive, and lemma 1 applies. This showsthat the optimal value of problems (14) and (15) are characterized by the equations givenin theorem 2. The contractive property of g defined by (29) can be established by observingthat for any pair u, v ∈ Rn, and for every i ∈ X , we have

gi(u) = mina∈A maxp∈Pai

(

c(i, a) + νpTv + νpT (u− v))

≤ mina∈A maxp∈Pai

(

c(i, a) + νpTv)

+ νmaxp∈PaipT (u− v)

≤ gi(v) + νmaxpT 1=1, p≥0 pT (u− v)

≤ gi(v) + ν‖u− v‖∞.

The proof of the contractive property for g defined by (30) is similar. The rest of the proofis similar to that of theorem 1. This ends our proof. �

Theorem (2) leads to the following corollary.

Corollary 2 In the infinite-horizon problem, we can without loss of generality assume thatthe control and nature policies are stationary, that is,

φ∞(Π, T ) = φ∞(Πs, Ts) = φ∞(Πs, T ) = φ∞(Π, Ts). (31)

Furthermore, in the finite-horizon case, with a discounted cost function, the gap between theoptimal values of the finite-horizon problems under stationary and time-varying uncertaintymodels, φN(Π, T ) − φN(Π, Ts), goes to zero as the horizon length N goes to infinity, at ageometric rate ν.

14

Proof. First we prove that φN(Π, T ) converges geometrically at rate ν, to φ∞(Πs, Ts).Denote by (vk) the value function delivered by the infinite-horizon recursion (20), and byv∞ its limit, so that φ∞(Πs, Ts) = v∞(i0). Fix ε > 0; by convergence of the above recursion,there exist N such that

v∞ − ε1 ≤ vN ≤ v∞. (32)

Now define the sequence vt = νtvN−t, for t = 0, . . . , N − 1; it satisfies the finite-horizonrecursion (6) with ct(i, a) = νtc(i, a). Thus, (vt)t∈T is the optimal value function for theproblem of computing φN(Π, T ), and in particular, v0(i0) = vN(i0) = φN(Π, T ). From (32)we obtain that

φ∞(Πs, Ts) − ε ≤ φN(Π, T ) ≤ φ∞(Πs, Ts).

which proves the convergence result; the rate is determined by that of vN to its limit, whichis ν. This proves our intermediate result. By similar arguments, one can prove that for everystationary policy π ∈ Πs, φN(π, T ) converges geometrically at rate ν, to φ∞(π, Ts).

Next we prove that φ∞(Π, T ) = φ∞(Πs, Ts). For every N , if we use a discounted costfunction, then

CN(π, τ) ≤ C∞(π, τ) ≤ CN(π, τ) + εN , (33)

where cmax := maxi∈X ,a∈A c(i, a) < ∞ and εN := νNcmax/(1 − ν) converges geometrically tozero at rate ν. The above implies φN(Π, T ) ≤ φ∞(Π, T ) ≤ φN(Π, T ) + εN . Since φN(Π, T )converges φ∞(Πs, Ts), we obtain that the first equality in (31) holds.

To prove the second equality in (31), we observe that for every stationary policy π, andevery N , we have

φN(π, T ) ≤ φ∞(π, T ) ≤ φN(π, T ) + εN ,

where φN(π, T ) := maxτ∈T CN(π, τ) and φ∞(π, T ) is defined likewise. This shows thatlimN→∞ φN(π, T ) = φ∞(π, T ), which in turn ensures that the latter equals φ∞(π, Ts). Sincethis is valid for every π ∈ Πs, we obtain that φ∞(Πs, T ) = φ∞(Πs, Ts).

To prove the last equality, we note that standard results on the stationarity of optimalpolicies for nominal problems [Putterman, 1994] imply that for every stationary nature policyτ ∈ Ts, φN(Π, τ) converges to φ∞(Πs, τ) as N → ∞. The last equality then follows from thefollowing bound, derived from (33): for every τ ∈ Ts, φN(Π, τ) ≤ φ∞(Π, τ) ≤ φN(Π, τ) + εN .

Finally, the rate of convergence on the gap is obtained as a consequence of the boundφ∞(Π, Ts) − εN ≤ φN(Π, Ts) ≤ φ∞(Π, Ts), which is derived from (33), and of the rate ofconvergence of φN(Π, T ) to its limit. �

4.2 Complexity analysis

We now turn to the complexity analysis of the infinite-horizon problem, assuming again thatwe use the specific uncertainty models described in sections 5 and 6. The robust Bellmanrecursion (20) provides a sequence (vk) which converges geometrically, at rate ν, to theoptimal value function v∞ of the problem. This means that, to achieve a given accuracy(say ε/2) on that value, we need O(log(1/ε)) iterations, with exact computation of the innerproblem at each step. Let us examine the complexity when inexact values are used.

15

We consider iterates (vk) of the recursion (20), with the same initial condition v0 = v0,but where we use the bisection algorithm with accuracy δ = (1 − ν)ε/2ν, in effect replacingthe map σPa

iby its approximate counterpart σPa

i, as defined by (18). Let us prove that these

approximate values also converge in O(log(1/ε)) time.We now prove by induction that vk ≤ vk ≤ vk + θ1, where θ = νδ/(1 − ν) = ε/2. The

initial condition is obtained trivially, as v1 = v1 satisfies v1 ≤ v1 ≤ v1 + θ1 . Assume thebounds are true for a given k ≥ 1. Then for every i, a, we have

σPai(vk) ≤ σPa

i(vk) ≤ σPa

i(vk + θ1)

≤ σPai(vk) + σPa

i(θ1)

= σPai(vk) + θ,

where we successively used the convexity, monotonicity and homogeneity of degree one ofthe function σPa

i. We then obtain

vk+1(i) ≤ vk+1(i) ≤ vk+1(i) + ν(δ + θ) = vk+1(i) + θ ∀ i ∈ X ,

which proves our result. The above implies that

‖vk − v∞‖∞ ≤ ‖vk − vk‖∞ + ‖vk − v∞‖∞ ≤ θ + ε/2 = ε,

provided k = O(log(1/ε)) is large enough. This proves that we can achieve ε-convergence ofvk in k = O(log(1/ε)).

We finish by examining the cost of computing an ε-suboptimal policy. The iterates vk

obey tovk+1(i) = min

a∈A

(

c(i, a) + νδPai(vk) + νσPa

i(vk)

)

,

where 0 ≤ δPai(vk) ≤ δ. We can express the above as

vk+1(i) = mina∈A

(

c(i, a) + ν(

δPai(vk) + ∆a

i (k))

+ νσPai(vk+1)

)

, (34)

where ∆ai (k) := σPa

i(vk) − σPa

i(vk+1). |∆a

i (k)| ≤ ‖vk+1 − vk‖∞ can be obtained by using thefact that, for any pair of n-vectors (u, v), and subset P of the probability simplex ∆n, wehave

σP(u) − σP(v) = maxp∈P

minq∈P

(

pTu− qTv)

≤ maxp∈P

pT (u− v) ≤ maxp∈∆n

pT (u− v) ≤ ‖u− v‖∞.

Let δai (k) := δPa

i(vk) + ∆a

i (k). Choose k = N such that the ‖vN+1 − vN‖∞ ≤ (1− ν)ε/2ν, sothat |δa

i (N)| ≤ δ + (1 − ν)ε/2ν = (1 − ν)ε/ν. (By the convergence properties proved above,we have N = O(log(1/ε)).)

The relation (34) implies that vk+1 and the corresponding policy πk is optimal for theinfinite-horizon problem, but with a different cost function c, defined by c(i, a) = c(i, a) +νδa

i (N). (Note that N is a constant here, so we are really defining a time-invariant cost.)The bound on δa

i (N) then implies that the corresponding expected total discounted cost

16

function satisfies |C∞(π, τ) − C∞(π, τ)| ≤ ε. The rest of the proof follows that of the finite-horizon case, with the only difference being that now we only have the two-sided inequality|C∞(π, τ) − C∞(π, τ)| ≤ ε as opposed to a one-sided inequality. But the result remains thesame.

We established that to compute an ε-suboptimal policy, we need to run O(log(1/ε)) stepsof the robust Bellman recursion, using a bisection algorithm with accuracy δ = O(ε). Eachstep of the Bellman recursion requires O(nm log(vmax/δ)) flops. The bound vmax ≤ cmax/(1−ν), where cmax = maxi∈X ,a∈A c(i, a) brings the total complexity to O(mn2(log(1/ε))2). Thus,the extra computational cost incurred by robustness in the infinite-horizon case isO(log(1/ε)).

4.3 Algorithm

Our analysis yields the following algorithm to compute an ε-suboptimal policy πε for problem(5) in O(mn2(log(1/ε))2) flops, using the uncertainty models described in sections 5 and 6.

Robust Infinite Horizon Dynamic Programming Algorithm

1. Set ε > 0, initialize the value function v1 > 0 and set k = 1.

2. (a) For all states i and controls a, compute, using the bisection algorithm describedin section 5 or 6, a value σa

i such that

σai − δ ≤ σPa

i(vk) ≤ σa

i ,

where δ = (1 − ν)ε/2ν.

(b) For all states i and controls a, compute vk+1(i) by,

vk+1(i) = mina∈A

(c(i, a) + νσai ) .

3. If

‖vk+1 − vk‖ <(1 − ν)ε

2ν,

go to 4. Otherwise, replace k by k + 1 and go to 2.

4. For each i ∈ X , set an πε = (aε, aε, . . .), where

aε(i) = arg maxa∈A

{c(i, a) + νσai } , i ∈ X .

5 Likelihood Models

Our first model is based on a likelihood constraint to describe uncertainty on each transitionmatrix. Our uncertainty model is derived from a controlled experiment starting from statei = 1, 2, . . . , n and the count of the number of transitions to different states. We denote byF a the matrix of empirical frequencies of transition with control a in the experiment; denote

17

by fai its ith row. We have F a ≥ 0 and F a1 = 1, where 1 denotes the vector of ones. For

simplicity, we assume that F a > 0 for every a.To simplify the notation, we will drop the superscript a in this section, and refer to a

generic transition matrix as P and to its ith row as pi. The same convention applies to theempirical frequency matrix F a and its rows fa

i , as well as to sets Pa and Pai . When the

meaning is clear from context, we will further drop the subscript i.

5.1 Model description

The “plug-in” estimate P = F is the solution to the maximum likelihood problem

maxP

L(P ) :=∑

i,j

F (i, j) logP (i, j) : P ≥ 0, P1 = 1. (35)

The optimal log-likelihood is βmax =∑

i,j F (i, j) logF (i, j).A classical description of uncertainty in a maximum-likelihood setting is via the likelihood

region [Lehmann and Casella, 1998, Poor, 1988]

{

P ∈ Rn×n : P ≥ 0, P1 = 1,∑

i,j

F (i, j) logP (i, j) ≥ β

}

, (36)

where β < βmax is a given number, which represents the uncertainty level. In practice, thedesigner chose an uncertainty level and β can be estimated using re-sampling methods, or alarge-sample Gaussian approximation, so as to ensure that the set above achieves the desiredlevel of confidence (see Appendix D).

The above description is classical in the sense that log likelihood regions are the startingpoint for developing ellipsoidal or interval models of confidence, hence are more statisticallyaccurate [Lehmann and Casella, 1998]; see section 7.3 for further details. The above set isstatistically meaningful as it describes how informative the data is. If this set is elongatedalong a direction, then the likelihood function does not vary much in that direction, and thedata is not very informative in that direction. This set has some interesting features. First,it does not result from a (quadratic) approximation; it is a valid description of uncertainty,even for β values that are far below βmax. Second, this set might not be symmetric aroundthe maximum-likelihood point, reflecting the fact the statistical uncertainty depends on thedirection. Finally, by construction, it excludes matrices that are not transition matrices; thesame cannot be said of the more classical ellipsoidal approximations.

In the inner problem (17), we only need to work with the uncertainty on each row pi,that is, with projections of the set (36). Due to the separable nature of the log-likelihoodfunction, the projection of the above set onto the pi variables of matrix P can be givenexplicitly, as

Pi(βi) :=

{

p ∈ ∆n :∑

j

fi(j) log pi(j) ≥ βi

}

,

18

where

βi := β −∑

k 6=i

∑

j

F (k, j) logF (k, j).

We are now ready to attack problem (17) under the premise that the transition matrix is onlyknown to lie in some likelihood region as defined above. The inner problem is to compute

σ∗ := maxp

pTv : p ∈ ∆n,∑

j

f(j) log p(j) ≥ β, (37)

where we have dropped the subscript i in the empirical frequencies vector fi and in thelower bound βi. In this section βmax denotes the maximal value of the likelihood functionappearing in the above set, which is βmax =

∑

j f(j) log f(j). We assume that β < βmax,which, together with f > 0, ensures that the set above has non-empty interior. Without lossof generality, we can assume that v ∈ Rn

+.

5.2 The dual problem

The Lagrangian L : Rn×Rn×R×R → R associated with the inner problem can be writtenas

L(v, ζ, µ, λ) = pTv + ζTp+ µ(1 − pT1) + λ(fT log p− β),

where ζ, µ, and λ are the Lagrange multipliers. The Lagrange dual function d : Rn×R×R →R is the maximum value of the Lagrangian over p, i.e., for ζ ∈ Rn, µ ∈ R, and λ ∈ R,

d(ζ, µ, λ) = supp

L(v, ζ, µ, λ) = supp

(pTv + ζTp+ µ(1 − pT1) + λ(fT log p− β)). (38)

The optimal p∗ = arg supp L(v, ζ, µ, λ) is readily be obtained by solving ∂L∂p

= 0, which resultsin

p∗(i) =λf(i)

µ− v(i) − ζ(i).

Plugging the value of p∗ in the equation for d(ν, µ, λ) yields, with some simplification, thefollowing dual problem:

σ := minλ,µ,ζ

µ− (1 + β)λ+ λ∑

j

f(j) logλf(j)

µ− v(j) − ν(j): λ ≥ 0, ζ ≥ 0, ζ + v ≤ µ1.

Since the above problem is convex, and has a feasible set with non-empty interior, there isno duality gap, that is, σ∗ = σ. Moreover, by a monotonicity argument, we obtain that theoptimal dual variable ζ is zero, which reduces the number of variables to two:

σ∗ = minλ,µ

h(λ, µ)

19

where

h(λ, µ) :=

µ− (1 + β)λ+

λ∑

j f(j) logλf(j)

µ− v(j)if λ > 0, µ > vmax := maxj v(j),

+∞ otherwise.

(39)

For further reference, we note that h is twice differentiable on its domain, and that itsgradient is given by

∇h(λ, µ) =

∑

j f(j) logλf(j)

µ− v(j)− β

1 − λ∑

j

f(j)

µ− v(j)

. (40)

5.3 A bisection algorithm

From the expression of the gradient obtained above, we obtain that the optimal value of λfor a fixed µ, λ(µ), is given analytically by

λ(µ) =

(

∑

j

f(j)

µ− v(j)

)−1

, (41)

which further reduces the problem to a one-dimensional problem:

σ∗ = minµ≥vmax

σ(µ),

where vmax = maxj v(j), and σ(µ) = h(λ(µ), µ). By construction, the function σ(µ) isconvex in its (scalar) argument, since the function h defined in (39) is jointly convex in bothits arguments (see [Boyd and Vandenberghe, 2004, p.74]). Hence, we may use bisection tominimize σ.

To initialize the bisection algorithm, we need upper and lower bounds µ− and µ+ on aminimizer of σ. When µ → vmax, σ(µ) → vmax and σ′(µ) → −∞ (see Appendix B). Thus,we may set the lower bound to µ− = vmax.

The upper bound µ+ must be chosen such that σ′(µ+) > 0. We have

σ′(µ) =∂h

∂µ(λ(µ), µ) +

∂h

∂λ(λ(µ), µ)

dλ(µ)

dµ. (42)

The first term is zero by construction, and dλ(µ)/dµ > 0 for µ > vmax. Hence, we only needa value of µ for which

∂h

∂λ(λ(µ), µ) =

∑

j

f(j) logλ(µ)f(j)

µ− v(j)− β > 0. (43)

20

By convexity of the negative log function, and using the fact that f T1 = 1, f ≥ 0, we obtainthat

∂h

∂λ(λ(µ), µ) = βmax − β +

∑

j f(j) logλ(µ)

µ− v(j)

≥ βmax − β − log

(

∑

j f(j)µ− v(j)

λ(µ)

)

≥ βmax − β + logλ(µ)

µ− v,

where v = fTv denotes the average of v under f .The above, combined with the bound on λ(µ): λ(µ) ≥ µ − vmax, yields a sufficient

condition for (43) to hold:

µ > µ0+ :=

vmax − eβ−βmax v

1 − eβ−βmax. (44)

By construction, the interval [vmax, µ0+] is guaranteed to contain a global minimizer of σ over

(vmax,+∞).The bisection algorithm goes as follows:

1. Set µ− = vmax and µ+ = µ0+ as in (44). Let δ > 0 be a small convergence parameter.

2. While µ+ − µ− > δ(1 + µ− + µ−), repeat

(a) Set µ = (µ+ + µ−)/2.

(b) Compute the gradient of σ at µ.

(c) If σ′(µ) > 0, set µ+ = µ; otherwise, set µ− = µ.

(d) go to 2a.

In practice, the function to minimize may be very “flat” near the minimum. This meansthat the above bisection algorithm may take a long time to converge to the global minimizer.Since we are only interested in the value of the minimum (and not of the minimizer), wemay modify the stopping criterion to

µ+ − µ− ≤ δ(1 + µ− + µ−) or σ′(µ+) − σ′(µ−) ≤ δ.

The second condition in the criterion implies that |σ′((µ+ + µ−)/2)| ≤ δ, which is an ap-proximate condition for global optimality.

5.4 Complexity

Let us analyze the number of iterations needed to achieve a given accuracy on the optimalvalue σ∗. We denote by µ∗ a minimizer of the function and by µ+, µ− the final iterates of thebisection algorithm, run with convergence parameter δ. We then have µ+−µ− ≤ δ(1+2µ0

+),which implies

0 ≤ µ+ − µ∗ ≤ µ+ − µ− ≤ δ

(

1 +2vmax

1 − eβ−βmax

)

= O(vmaxδ).

21

The number of iterations needed to achieve the above bound on the minimizer µ∗ grows aslog((µ0

+ − vmax)/δ) = O(log(vmax/δ). Thus, to achieve an accuracy δ in the minimizer, weneed O(log(vmax/δ)) iterations.

Here, we are not interested in the value of a minimizer µ∗, but on the minimum value,σ∗. By construction, µ+ ≥ µ∗, and we have 0 ≤ σ′(µ+) ≤ limµ→+∞ σ′(µ) = βmax − β.Furthermore, we have 0 ≤ µ+ − µ∗ ≤ µ+ − µ− ≤ O(vmaxδ). By convexity:

σ(µ+) ≥ σ∗ ≥ σ(µ+) − (µ+ − µ∗)σ′(µ+) = σ(µ+) −O(vmaxδ).

We obtain that, to achieve a given accuracy δ on σ∗, we need O(log(vmax/δ)) iterations ofthe bisection algorithm. Since each iteration requires n flops, the total complexity of theinner problem is O(n log(vmax/δ)).

5.5 Maximum A Posteriori models

We now consider a variation on the likelihood model, the Maximum A Posteriori (MAP)model. The MAP estimation framework provides a way of incorporating prior informationin the estimation process. This is particularly useful for dealing with sparse training data,for which the maximum likelihood approach may provide inaccurate estimates. The MAPestimator, denoted by pMAP , maximizes the “MAP function”[Siouris, 1995]

LMAP(p) = L(p) + log gprior(p)

where L(p) is the log-likelihood function, and gprior refers to the a priori density function ofthe parameter vector p.

In our case, p is a row of the transition matrix, so a prior distribution has support includedin the n-dimensional simplex

{

p : p ≥ 0, pT1 = 1}

. It is customary to choose the prior tobe a Dirichlet distribution [Ferguson, 1974, Wilks, 1962], the density of which is of the form

gprior(p) = K ·∏

i

pαi−1i ,

where the vector α ≥ 1 is given, and K is a normalizing constant. Choosing α = 1 werecover the “non-informative prior”, which is the uniform distribution on the n-dimensionalsimplex. In that case, the MAP estimation converges to the Maximum Likelihood estimation.Hence, MAP estimation is a more general framework and Maximum Likelihood estimationis a specialization of MAP when prior information is not available.

The resulting MAP estimation problem takes the form

maxp

(f + α− 1)T log p : pT1 = 1, p ≥ 0.

To this problem we can associate a “MAP” region which describes the uncertainty on theestimate, via a lower bound β on the function LMAP(p). The inner problem now takes theform

σ := maxp

pTv : p ≥ 0, pT1 = 1,∑

j

(f(j) + α(j) − 1) log p(j) ≥ κ,

22

where κ depends on the normalizing constant K appearing in the prior density functionand on the chosen lower bound on the MAP function, β. We observe that this problemhas exactly the same form as in the case of likelihood function, provided we replace f byf + α− 1. Therefore, the same results apply to the MAP case.

6 Entropy Models

6.1 Model description

We consider problem (17), with the uncertainty on the i-th row of the transition matrix P a

described by a set of the form P = {p ∈ ∆n : D(p‖q) ≤ β}, where β > 0 is fixed, q > 0 isa given distribution, and D(p‖q) denotes the Kullback-Leibler divergence from q to p:

D(p‖q) :=∑

j

p(j) logp(j)

q(j).

Together with q > 0, the condition β > 0 ensures that P has non-empty interior. (As before,we have dropped the control and row indices a and i).

Note that both the likelihood and entropy models can be interpreted in terms of an up-per bound on the Kullback-Leibler divergence between two distributions. In the likelihoodsetting, we impose an upper bound on the divergence D(f‖p), from the (unknown) distri-bution p to the observed distribution f ; in the entropy case, we use an upper bound on thedivergence from the reference distribution q to the unknown distribution p. This parallelsuggests a heuristic to choose the uncertainty level β, by following the same guidelines usedin the likelihood setting, as described in Appendix D.

We now address the inner problem (17), with P given above. We note that P actuallyequals the whole probability simplex if β is too large, specifically if β ≥ maxi(− log qi), sincethe latter quantity is the maximum of the relative entropy function over the simplex. Thus,if β ≥ maxi(− log qi), the worst-case value of pTv for p ∈ P is equal to vmax := maxj v(j).

6.2 Dual problem

By standard duality arguments (set P being of non-empty interior), the inner problem isequivalent to its dual:

minλ>0,µ

µ+ βλ+ λ∑

j

q(j) exp

(

v(j) − µ

λ− 1

)

.

Setting the derivative with respect to µ to zero, we obtain the optimality condition

∑

j

q(j) exp

(

v(j) − µ

λ− 1

)

= 1,

23

from which we derive

µ = λ log

(

∑

j

q(j) expv(j)

λ

)

− λ.

The optimal distribution is

p∗ =q(j) exp v(j)

λ∑

i q(i) exp v(i)λ

.

As before, we reduce the problem to a one-dimensional problem:

minλ>0

σ(λ)

where σ is the convex function:

σ(λ) = λ log

(

∑

j

q(j) expv(j)

λ

)

+ βλ. (45)

Perhaps not surprisingly, the above function is closely linked to the moment generatingfunction of a random variable v having the discrete distribution with mass qi at vi.

6.3 A bisection algorithm

As proved in Appendix C, the convex function σ in (45) has the following properties:

∀ λ ≥ 0, qTv + βλ ≤ σ(λ) ≤ vmax + βλ, (46)

andσ(λ) = vmax + (β + logQ(v))λ+ o(λ), (47)

whereQ(v) :=

∑

j : v(j)=vmax

q(j) = Prob{v = vmax}.

Hence, σ(0) = vmax and σ′(0) = β + logQ(v). In addition, at infinity the expansion of σ is

σ(λ) = qTv + βλ+ o(1). (48)

The bisection algorithm can be started with the lower bound λ− = 0. An upper boundcan be computed by finding a solution to the equations σ(0) = qTv + βλ, which yields theinitial upper bound λ0

+ = (vmax − qTv)/β. By convexity, a minimizer exists in the interval[0 λ0

+].Note that if σ′(0) ≥ 0, then λ = 0 is optimal and the optimal value of σ is vmax. This

means that if β is too high, that is, if β > − logQ(v), enforcing robustness amounts todisregard any prior information on the probability distribution p. We have observed in 6.1 asimilar phenomenon brought about by too large values of β, which resulted in a set P equalto the probability simplex. Here, the limiting value − logQ(v) depends not only on q butalso on v, since we are dealing with the optimization problem (17) and not only with itsfeasible set P .

24

6.4 Complexity

The complexity analysis for the entropy model follows the same lines as that of the likelihoodmodel, so we will be brief here. First we note that the number of iterations needed to obtaina given accuracy δ on the minimizer, we need O(log(vmax/δ)) iterations, since λ0

+ = O(vmax).To obtain a given accuracy on the minimum value, the important feature is to ensure thatthe derivative of the function σ is bounded uniformly and independent of problem size n,at least on one side of the optimum. In the entropy case, we have at each step of thebisection algorithm 0 ≤ σ′(µ+) ≤ limµ→+∞ σ′(µ) = β. We obtain that, to achieve a givenaccuracy δ on σ∗, we need O(log(vmax/δ)) iterations of the bisection algorithm. Since eachiteration requires n flops, the total complexity of the inner problem in the entropy case isagain O(n log(vmax/δ)).

7 Other Uncertainty Models

7.1 Finite scenario model

Perhaps the simplest uncertainty model involves a finite collection of transition matrices,where for every a ∈ A, Pa = {P a,1, . . . ,Pa,L}, with P a,k ∈ Θn representing a possible value(scenario) of the transition matrix. As noted earlier, the robust Bellman recursion appliesto non-convex uncertainty sets; in fact, the scenario model gives rise to the same optimalrobust policy as its convex hull, which corresponds to an uncertainty set given as a polytopedescribed by its vertices.

Under the scenario (or, polytopic) model, the inner problem (17) bears a particularlysimple form:

σPai(v) = max

p∈{pa,1i ,...,pa,L

i }vTp = max

1≤k≤LvTpa,k

i ,

where we denote by pa,ki the i-th row of matrix P a,k. The worst-case complexity of each

step of the robust Bellman recursion is then O(mnL), where L is the number of vertices.For moderately large values of L, the scenario model is attractive, due to its simplicity ofimplementation.

7.2 Interval matrix model

The interval matrix model describes the uncertainty on the rows of the transition matricesin the form

P ={

p : p ≤ p ≤ p, pT1 = 1}

,

where p, p are given componentwise non-negative n-vectors (whose elements do not neces-sarily sum to one), with p ≥ p. Note that, for theorem 1 to hold, we must ensure that theset P is entirely included in the probability simplex ∆n, which we did by assuming p ≥ 0.This model is motivated by statistical estimates of intervals of confidence on the componentsof the transition matrix. Those intervals can be obtained by resampling methods, or by

25

projecting an ellipsoidal uncertainty model on each component axis (see section 7.3). Sincep ≥ p, P is not empty.

Since the inner problem

σ∗ := maxp

vTp : pT1 = 1, p ≤ p ≤ p

is a linear, feasible program, it is equivalent to its dual, which can be reduced to

σ∗ = minµ

(p− p)T (µ1 − v)+ + vTp+ µ(1 − pT1),

where z+ stands for the positive part of vector z. The function to be minimized is a convexpiecewise linear function with break points v(0) := 0 and v(1), . . . , v(n). Since the originalproblem is feasible, we have 1Tp ≤ 1, which implies that the function above goes to infinitywhen µ→ ∞. Thus, the minimum of the function is attained at one of the break points v(i)(i = 0, . . . , n). The complexity of this enumerative approach is O(n2), since each evaluationcosts O(n). In fact one does not need to enumerate the function at all values vi; a bisectionscheme over the discrete set {v0, . . . , vn} suffices. This scheme will bring the complexitydown to O(n log n).

7.3 Ellipsoidal models

Ellipsoidal models arise when second-order approximations are made to the log-likelihoodfunction arising in the likelihood model. Specifically, we work with the following set in lieuof (36):

P(β) ={

P ∈ Rn×n : P ≥ 0, P1 = 1, Q(P ) ≥ β}

, (49)

where Q(P ) is the second-order approximation to the log-likelihood function L, around themaximum-likelihood estimate F :

Q(P ) := βmax −1

2

∑

i,j

(P (i, j) − F (i, j))2

F (i, j).

The above set is an ellipsoid intersected by the polytope of transition matrices. Again, theprojection on the space of ith row variables assumes a similar shape, that of an ellipsoidintersected with the probability simplex, specifically

Pi(β) =

{

p : p ≥ 0, pT1 = 1,∑ (pi(j) − fi(j))

2

fi(j)≤ κ2

}

,

where κ2 := 2(βmax − β). We refer to the above model as the constrained ellipsoidal model.In the constrained likelihood case, the inner problem assumes the form

maxp

vTp : p ≥ 0, pT1 = 1,∑ (p(j) − f(j))2

f(j)≤ κ2.

26

Using an interior-point method [Boyd and Vandenberghe, 2004], the above problem can besolved with absolute accuracy ε in worst-case time of O(n1.5 log(vmax/ε)), and with a practicalcomplexity of O(n log(vmax/ε)).

In statistics, it is a standard practice to further simplify the description above, by relaxingthe inequality constraints P ≥ 0 in the definition of P(β). This would bring down the worst-case complexity to O(n log(vmax/ε)). However, if sign constraints are omitted, theorem 1does not necessarily hold, and we would only compute an upper bound on the value of theproblem.

8 Example: Robust Aircraft Routing

We consider the problem of routing an aircraft whose path is obstructed by stochastic ob-stacles, representing storms. In practice, the stochastic model must be estimated from pastweather data. This makes this particular application a good illustration of our method.

8.1 The nominal problem

In [Nilim et al., 2001], we have introduced an MDP representation of the problem, in whichthe evolution of the storms is modelled as a perfectly known stationary Markov chain. Theterm nominal here refers to the fact that the transition matrix of the Markov process corre-sponding to the weather is not subject to uncertainty. The goal is to minimize the expecteddelay (flight time). The weather process is a fully observable Markov chain: at each decisionstage (every 15 minutes in our example), we learn the actual state of the weather.

The airspace is represented as a rectangular grid. The state vector comprises the currentposition of the aircraft on the grid, as well as the current states of each storm. The action inthe MDP corresponds to the choice of nodes to fly towards, from any given node. There arek obstacles, represented by a Markov chain with a 2k × 2k transition matrix. The transitionmatrix for the routing problem is thus of order N2k, where N is the number of nodes in thegrid.

We solved the MDP via the Bellman recursion [Nilim et al., 2001]. Our framework avoidsthe potential “curse of dimensionality” inherent in generic Bellman recursions, by consid-erable pruning of the state-space and action sets. This makes the method effective for upto a few storms, which corresponds to realistic situations. For more details on the nominalproblem and its implementation, we refer the reader to [Nilim et al., 2001].

In the example below, the problem is two-dimensional in the sense that the aircraft fliesat a fixed altitude. In a coordinate system where each unit is equal to 1 Nautical Mile, theaircraft is initially positioned at (0, 0) and the destination point is at (360, 0). The velocityof the aircraft is fixed at 480 n.mi/hour. The airspace is described by a rectangular gridwith N = 210 nodes, with edge length of 24 n.mi. There is a possibility that a storm mightobstruct the flight path. The storm zone is a rectangular space with the corner points at(160, 192), (160,−192), (168, 192) and (168,−192) (Figure 1).

27

Since there is only one potential storm in the area, storm dynamics is described by a2 × 2 transition matrix Pweather. Together with N = 210 nodes, this results in a state-spaceof total dimension 420. By limiting the angular changes in the heading of the aircraft, wecan prune out the action space and reduce its cardinality at each step to m = 4. This impliesthat the transition matrices are very sparse; in fact, they are sparse, affine functions of thetransition matrix Pweather. Sparsity implies that the nominal Bellman recursion only involves8 states at each step.

8.2 The robust version

In practice, the transition matrix Pweather is estimated from past weather data, and thus itis subject to estimation errors.

We assumed a likelihood model of uncertainty on this transition matrix. This results ina likelihood model of uncertainty on the state transition matrix, which is as sparse as thenominal transition matrix. Thus, the effective state pruning that takes place in the nominalmodel can also take place in the robust counterpart. In our example, we chose the numericalvalue

Pweather =

(

0.9 0.10.1 0.9

)

for the maximum-likelihood estimate of Pweather.The likelihood model involves a lower bound β on the likelihood function, which is a

measure of the uncertainty level. Its maximum value βmax corresponds to the case with nouncertainty, and decreasing values of β correspond to higher uncertainty level. To β, wemay associate a measure of uncertainty that is perhaps more readable: the uncertainty level,denoted by UL, is defined as a percentage and its complement 1−UL can be interpreted as aprobabilistic confidence level in the context of large samples. The one-to-one correspondenceof UL and β is precisely described in Appendix D.

In figure 2, we plot UL against decreasing values of the lower bound on the log-likelihoodfunction (β). We see that UL = 0, which refers to a complete certainty of the data, is attainedat β = βmax, the maximum value of the likelihood function. The value of UL decreases withβ and reaches the maximum value, which is 100%, at β = −∞ (not drawn in this plot).Point to be noted: the rate of increase of UL is maximum at β = βmax and increases with β.

8.3 Comparing robust and nominal strategies

In figure 3, we compare various strategies: we plot the relative delay, which is the relativeincrease (in percentage) in flight time with respect to the flight time corresponding to themost direct route (straight line), against the negative of the lower bound on the likelihoodfunction β.

We compare three strategies. The conservative strategy is to avoid the storm zone alto-gether. If we take β = βmax, the uncertainty set becomes a singleton (UL = 0) and hencewe obtain the solution computed via the classical Bellman recursion; this is referred to as

28

the nominal strategy. The robust strategy corresponds to solving our robust MDP with thecorresponding value of β.

The plot in figure (3) shows how the various strategies fare, as we decrease the bound onthe likelihood function β. For the nominal and the robust strategies, and a given bound β,we can compute the worst-case delay using the recursion (9), which provides the worst-casevalue function.

The conservative strategy incurs a 51.5% delay with respect to the flight time correspond-ing to the most direct route. This strategy is independent of the transition matrix, so itappears as a straight line in the plot. If we know the value of the transition matrix exactly,then the nominal strategy is extremely efficient and results in a delay of 8.02% only. As βdeviates from βmax, the uncertainty set gets bigger. In the nominal strategy, the optimalvalue is very sensitive in the range of values of β close to βmax: the delay jumps from 8% to25% when β changes by 7.71% with respect to βmax (the uncertainty level UL changes from0% to 5%). In comparison, the relative delay jumps by only 6% with the robust strategy. Inboth of strategies, the slope of the optimal value with respect to the uncertainty is almostinfinite at β = βmax, which shows the high sensitivity of the value function with respect tothe uncertainty.

We observe that the robust solution performs better than the nominal solution as theestimation error increases. The plot shows an average of 19% decrease in delay with respect tothe nominal strategy when uncertainty is present. Further, as the uncertainty level increases,the nominal strategy very quickly reaches delay values comparable to those obtained withthe conservative strategy. In fact, the conservative strategy even outperforms the nominalstrategy at β = −1.84, which corresponds to UL = 69.59%. In this sense, even for moderateuncertainty levels, the nominal strategy defeats its purpose. In contrast, the robust strategyoutperforms the conservative strategy by 15% even if the data is very uncertain (UL = 85%).

In summary, when there is no error in the estimation, both nominal and robust algorithmsprovide a strategy that produces 43.3% less delay than the conservative strategy,. However,with the presence of even a moderate estimation error, the robust strategy performs muchbetter than the conservative strategy, whereas the nominal MDP strategy cannot produce amuch better result.

Nominal and robust strategies have similar computational requirements. In our example,with a simple Matlab implementation on a standard PC, the running time for the nominalalgorithm was about 4 seconds, and the robust version took on average 4 more seconds tosolve.

8.4 Inaccuracy of uncertainty level

The previous comparison assumes that, in the robust case, we are able to estimate exactlythe precise value of the uncertainty level UL (or the bound on the likelihood function β). Inpractice, this parameter also has to be estimated. Hence the question: how sensitive is therobust approach with respect to inaccuracies in the uncertainty level UL?

To answer this question in our particular example, we have assumed that a guess U 0L

on the uncertainty level is available, and examined how the corresponding robust solution

29

would behave if it was subject to uncertainty with level above or below the guess.In figure 4, we compare various strategies. In each strategy, we guess a desired level of

accuracy (U 0L) on the data and calculate a corresponding likelihood bound β0. We choose

the optimal action using our robust MDP algorithm applied with this bound. Keeping theresulting policy fixed, we then compute the relative delay with the various values of β. Inthe figure 4, we plot the relative delays against −β for the strategies where the uncertaintylevels were guessed as 15% and 55%.

Not surprisingly, the relative delay of a strategy attains its minimum value when β (UL)is accurately predicted. For values of β above or below its guessed value, the delay increases.We note that it is only for very small uncertainty levels (within .995% of βmax) that thenominal strategy performs better than the robust strategy with imperfect prediction of β(UL).

We define RULas the range of the actual UL in percentage terms where the robust strategy

(with imperfect prediction of UL) performs worse than nominal strategy. In figure (5), weshow RUL

against the guessed value, U 0L. The plot clearly shows that RUL

remains less than1% with varying predicted U 0

L.Our example shows that if we predict the uncertainty level inaccurately in order to

obtain a robust strategy, the nominal strategy will outperform the robust strategy only ifthe actual uncertainty level UL is less than 1%. For any higher value of the uncertaintylevel, the robust strategies outperform the nominal strategy, by an average of 13%. Thus,even if the uncertainty level is not accurately predicted, the robust solution outperforms thenominal solution significantly.

9 Concluding remarks

We have considered a robust Markov decision problem with uncertainty models for thetransition matrices that are statistically accurate, yet give rise to very moderate extra com-putational effort for computing a robust solution, with respect to a nominal solution, whereuncertainty is ignored. Specifically, the relative increase in computational cost is of orderO(log(N/ε)) in the finite-horizon case, and O(log(1/ε)) in the infinite-horizon case, where εis the desired accuracy on the optimal expected total cost. As a result, the robust algorithmhas practically the same complexity as that of the nominal problem. We have consideredboth stationary and time-varying assumptions about uncertainty, and showed that as thedecision horizon goes to infinity, the gap between these two models vanishes. This justifiesour use of bounds based on stationarity assumptions, even if we allow time-varying changesin the transition matrices. The statistical accuracy of our uncertainty models is derived fromthe fact that they use the Kullback-Leibler divergence, which is a natural way to measureerrors in the transition matrices. The other models we have considered, from the polytopicto the interval to the ellipsoidal model, do not enjoy such properties, and moreover, give riseto larger worst-case complexity estimates.

We have shown in a practical path planning example the benefits of using a robuststrategy instead of the classical optimal strategy; even if the uncertainty level is only crudely

30

guessed, the robust strategy yields a much better worst-case expected flight delay.

Acknowledgments

The authors would like to thank Antar Bandyopadhyay, Bob Barmish, Giuseppe Calafiore,Vu Duong, Mikael Johansson, Rupak Majumdar, Andrew Ng, Stuart Russell, Shankar Sastry,Ben Van Roy and Pravin Varaiya for interesting discussions and comments. The authorsare grateful to Dimitri Bertsimas for pointing out a mistake in an earlier version of thepaper. The authors are especially thankful to the unknown reviewers whose interestingcomments prompted a significant portion of this work. This research was funded in part byEurocontrol-014692, DARPA-F33615-01-C-3150, and NSF-ECS-9983874.

A Stochastic game theoretic proof of the robust Bell-

man recursion

In this section, we prove that the stochastic game (4) can be solved using the robust Bellmanrecursion (6). Our proof is based on transforming the original problem into a term-basedzero-sum game, and applying a result by Nowak [Nowak, 1984] that applies to such games.

We begin by augmenting the state space X with states of the form (i, a), where i ∈ Xand a ∈ A. The augmented state-space is thus X aug := X ⋃(X ×A). We now define a newtwo-player game on this augmented state-space, where decisions are taken not only at timet, t ∈ T = {0, 1, . . . , N}, but also at intermediate times t+ 1/2, t ∈ T .

In the first step, from t to t+ 1/2, if the controller action is at, states of the form i makea transition to states of the form (i, at) with probability one, and all other states stay thesame with probability one. Here, the opponent is idle. The cost incurred by this step is thecost of the original problem, ct(i, at), if we start from state i, and zero otherwise.

In the second step, from t + 1/2 to t + 1, the controller stands idle while the opponentacts as follows. The states of the form (i, a) make a transition to states of the form j withprobabilities given by the vector pa

i , where pai is chosen freely by the opponent in the set Pa

i ;all the other states stay the same with probability one. There is no cost incurred at thisstage.

Clearly, starting at time t in state i, and with a controller action at, we end up in thestate j at time (t+ 1) with probability P at(i, j). Since incurred costs are the same, our newgame is equivalent to the original game. In addition, the new game is a term-based zero-sumgame, since the controller and the opponent act alternatively, in independent fashion at eachtime step.

Nowak’s result provides a Bellman-type recursion to solve the problem of minimizingthe worst-case (maximum) expected cost of a term-based zero-sum game, when both playersfollow randomized policies that are restricted to given state-dependent compact subsets of theprobability simplex. In our new game, the opponent’s choice of a vector pa

i within Pai at the

second step, can be interpreted as a choice of a randomized policy over the compact, state-dependent set B((i, a)) := Pa

i . (Here, the deterministic actions of the opponent correspond to

31

the vertices of the probability simplex of Rn.) Hence, the results due to Nowak [Nowak, 1984]apply.

In the case when one player (say, the first) acts deterministically, with state-independent,finite action set A, the recursion for the optimal value function V in state s can be writtenas

vt(s) = mina∈A

maxb∈B(s)

Eb

(

ct(s, a, b) +∑

s′

P ab(s, s′)vt+1(s′)

)

, (50)

where the notation a, b refers to actions of the minimizing and maximizing player respectively,P ab is the corresponding transition matrix, ct is the cost function, b refers to a particularrandomized action that is freely chosen by the opponent within the state-dependent compactset B(s), and Eb is the corresponding expectation operator.

Let us detail how applying the above recursion to our game yields our result.Denote by V vt(s) the value function of the game at time t in state s ∈ X aug. We first

update this value function from time t+1 to t+1/2. The controller is idle, but the opponentis allowed to chose a randomized policy from a state-dependent compact set. If the state is(i, a), this set is B((i, a)) = Pa

i , and the value function is updated as

vt+ 12((i, a)) = max

p∈Pai

(

n∑

j=1

p(j)vt+1(j)

)

, (51)

where we make use of the fact that incurred costs are zero in this step. To update the valuefunction from t + 1/2 to t, we use the fact that the opponent is idle. For i = 1, . . . , n, thevalue function is updated as

vt(i) = mina∈A

(

ct(i, a) + vt+ 12((i, a))

)

. (52)

Combining (52) and (51) ends our proof.

B Properties of function σ of section 5.3

Here, we prove two properties of the function σ involved in the bisection algorithm of section5.3. For simplicity of notation, we assume that there is an unique index i∗ achieving themaximum in vmax, that is, v(i∗) = vmax.

We first show that σ(µ) → vmax as µ→ vmax. We have

λ(µ) =µ− v(i∗)

f(i∗)+ o(µ− v(i∗)).

We then express σ(µ) as

σ(µ) = µ− λ(µ)

(

1 + β − βmax + log λ(µ) −∑

j 6=i∗

fj log(µ− vj)

)

− λ(µ)f(i∗) log(µ− v(i∗)).

32

The second term (first line) vanishes as µ → vmax, since λ(µ) → 0 then. In view of theexpression of λ(µ) above, the last term (second line) behaves as (µ − v(i∗)) log(µ − v(i∗)),which also vanishes.

Next we prove that σ′(µ) → −∞ as µ→ vmax. We obtain easily

dλ(µ)

dµ=

∑

j

f(j)

(µ− v(j))2

(

∑

j

f(j)

µ− v(j)

)2 → 1

f(i∗)when µ→ v(i∗).

We then have

∂h

∂λ(λ(µ), µ) =

∑

j

logλ(µ)f(j)

µ− v(j)− β

= logλ(µ)f(i∗)

µ− v(i∗)+∑

j 6=i∗

logλ(µ)f(j)

µ− v(j)− β

= log(1 + o(1)) + (n− 1) log λ(µ) +∑

j 6=i∗

logf(j)

µ− v(j)− β

→ −∞ as µ→ v(i∗).

Also, by definition of λ(µ), we have ∂h/∂µ(λ(µ), µ) = 0. The proof is achieved with theidentity (42).

C Properties of function σ of section 6.3

In this section, we prove that the function σ defined in (45) obeys properties (46), (47) and(48).

First, we prove (47). If v(j) = vmax for every j, the result holds, with Q(v) = Q(vmax1) =1. Assume now that there exist j such that v(j) < vmax. We have

σ(λ) = λ log

(

evmax/λ∑

j q(j) exp(v(j) − vmax

λ)

)

+ βλ

= vmax + βλ+ λ log

∑

j:v(j)=vmax

q(j) +∑

j:v(j)<vmax

q(j) exp(v(j) − vmax

λ)

= vmax + βλ+ λ log(

Q+O(e−t/λ))

= vmax + (β + logQ)λ+O(λe−t/λ),

where t = vmax − vs > 0, where vs is the largest v(j) < vmax. This proves (47).From the expression of σ given in the second line above, we immediately obtain the upper

bound in (46).

33

The expansion of σ at infinity provides

σ(λ) = βλ+ λ log

(

∑

j q(j)(1 +v(j)

λ+ o(λ))

)

= qTv + βλ+ o(1),

which proves (48). The lower bound in (46) is a direct consequence of the concavity of thelog function.

D Calculation of β for a Desired Confidence Level

In this section, we describe a one-to-one correspondence between a lower bound on thelog-likelihood function β, as used in section 5, and a desired level of confidence (1 − UL)on the transition matrix estimates, as used in section 8. This correspondence is valid forasymptotically large samples only but can serve as a guideline to choose β. The followingmaterial is standard, see for instance [Lehmann, 1986].

First, we define a vector θ ∈ Rn(n−1) which contains the first n − 1 columns to beestimated in a n× n transition matrix P . We order θ so that P (i, j) = θ((n− 1)(i− 1) + j)for 1 ≤ i ≤ n, 1 ≤ j ≤ (n− 1). Using the conditions P1 = 1, we can write P as an (affine)function of θ, and express the log-likelihood function L(P ) of (35) as a function l(θ). Let θbe the vector corresponding to the matrix of empirical frequencies F , which we assumed tobe positive componentwise. Provided some regularity conditions hold, one can show that forasymptotically large samples, θ is normally distributed with mean given by θ, and inversecovariance matrix H = −Eθ((∇2l)(θ)). Furthermore, we can approximate H by the observedinformation matrix H := −(∇2l)(θ). In our case, the non-zero elements of this matrix are

H((n− 1)(i− 1) + j, (n− 1)(i− 1) + k) =

{

1F (i,n)

+ 1F (i,j)

if j = k,1

F (i,n)otherwise.

If q denotes the quadratic approximation to l around θ, we have

q(θ) = βmax −1

2(θ − θ)T H(θ − θ),

where βmax is the maximal log-likelihood defined in section 5.1. Then, the parameter β ischosen to be the smallest such that, under the Gaussian probability distribution N (θ, H−1),the set {θ : q(θ) ≥ β} has probability larger than a given threshold (1 − UL), where (say)UL = 15% in order to obtain the 85% confidence level. It turns out that we can solve forsuch a β explicitly:

(1 − UL) = Fχ2n(n−1)

(2(βmax − β)), (53)

where Fχ2d

is the cumulative density function of the χ2-distribution with d degrees of freedom.The latter can be approximated as follows [Pitman, 1993]:

Fχ2d(ξ) ≈ Φ(z) −

√2

3√d(z2 − 1)φ(z), (54)

34

where z = (ξ − d)/√d, φ(z) = 1√

2πe−

12z2

, and Φ(z) =∫ z

−∞ φ(u)du is the standard normalcumulative density function.

References

[Abbad et al., 1992] Abbad, M., Filar, J., and Bielecki, T. (1992). Algorithms for singularlyperturbed limiting average Markov control problems. IEEE Transactions on AutomaticControl, 37:1421–1425.

[Abbad and Filar, 1992] Abbad, M. and Filar, J. A. (1992). Perturbation and stabilitytheory for Markov control problems. IEEE Transactions on Automatic Control, 37:1415–1420.

[Bagnell et al., 2001] Bagnell, J., Ng, A., and Schneider, J. (2001). Solving uncertain MarkovDecision Problems. Technical Report CMU-RI-TR-01-25, Robotics Institute, CarnegieMellon University.

[Berstsekas and Tsitsiklis, 1996] Berstsekas, D. and Tsitsiklis, J. (1996). Neuro-DynamicProgramming. Athena Scientific, Massachusetts.

[Boyd and Vandenberghe, 2004] Boyd, S. and Vandenberghe, L. (January, 2004). ConvexOptimization. Cambridge University Press, Cambridge, U.K.

[Feinberg and Shwartz, 2002] Feinberg, E. and Shwartz, A. (2002). Handbook of MarkovDecision Processes, Methods and Applications. Kluwer’s Academic Publishers, Boston.

[Ferguson, 1974] Ferguson, T. (1974). Prior distributions on space of probability measures.The Annal of Statistics, 2(4):615–629.

[Givan et al., 1997] Givan, R., Leach, S., and Dean, T. (1997). Bounded parameter MarkovDecision Processes. In Fourth European Conference on Planning, pages 234–246.

[Kalyanasundaram et al., 2001] Kalyanasundaram, S., E.Chong, and Shroff, N. (2001).Markov Decision Processes with uncertain Transition Rates: Sensitivity and robust con-trol. Technical report, Department of ECE, Purdue University, West Lafayette, Indiana,USA.

[Lehmann, 1986] Lehmann, E. (1986). Testing Statistical Hypothesis. Wiley, New York,USA.

[Lehmann and Casella, 1998] Lehmann, E. and Casella, G. (1998). Theory of point estima-tion. Springer-Verlag, New York, USA.

[Mine and Osaki, 1970] Mine, H. and Osaki, S. (1970). Markov Decision Processes. Ameri-can Elsevier Publishing Company Inc.

35

[Nilim et al., 2001] Nilim, A., Ghaoui, L. E., Hansen, M., and Duong, V. (2001). Trajectory-based Air Traffic Management (TB-ATM) under weather uncertainty. In the proceedingof the 4th USA/EUROPE ATM R & D Seminar.

[Nowak, 1984] Nowak, A. S. (1984). On zero sum stochastic games with general state space.i. Probability and Mathematical Statistics, 4(1):13–32.

[Pitman, 1993] Pitman, J. (1993). Probability. Springer-Verlag, New York, USA.

[Poor, 1988] Poor, H. (1988). An introduction to signal detection and estimation. Springer-Verlag, New York.

[Putterman, 1994] Putterman, M. (1994). Markov Decision Processes: Discrete StochasticDynamic Programming. Wiley-Interscince, New York.

[Satia and Lave, 1973] Satia, J. K. and Lave, R. L. (1973). Markov Decision Processes withUncertain Transition Probabilities. Operations Research, 21(3):728–740.

[Shapiro and Kleywegt, 2002] Shapiro, A. and Kleywegt, A. J. (2002). Minimax analysis ofstochastic problems. Optimization Methods and Software. to appear.

[Siouris, 1995] Siouris, G. (1995). Optimal control and estimation theory. Wiley-Interscience,New York, USA.

[White and Eldeib, 1994] White, C. C. and Eldeib, H. K. (1994). Markov Decision Processeswith imprecise transition probabilities. Operations Research, 42(4):739–749.

[Wilks, 1962] Wilks, S. (1962). Mathematical Statistics. Wiley-Interscience, New York, USA.

36

−50 0 50 100 150 200 250 300 350 400−200

−150

−100

−50

0

50

100

150

200

PSfrag replacements

Stochastic Obstacle

Origin Destination

Nautical Miles

Nau

tica

lM

iles

Figure 1: Aircraft Path Planning Scenario.

0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.60

10

20

30

40

50

60

70

80

90

PSfrag replacements

UL

−β

Figure 2: −β (negative lower bound on the log-likelihood function) vs UL (Uncertainty Level(in %) of the Transition Matrices).

37

0.5 1 1.5 2 2.50

10

20

30

40

50

60

PSfrag replacements Robust Strategy

Nominal Strategy

Conservative Strategy

UL = 0% UL = 50% UL = 80%

Rel

ativ

eD

elay

(in

%)

−β

Figure 3: Optimal value vs. uncertainty level (negative lower bound on the log-likelihoodfunction), for both the classical Bellman recursion and its robust counterpart.

0.5 1 1.5 2 2.50

10

20

30

40

50

60

PSfrag replacements

Nominal Strategy

Robust Strategy (Exact Guess)

Robust Strategy (Inexact Guess)

U0L = 15% U0

L = 55%

Rel

ativ

eD

elay

(in

%)

−β

Figure 4: Optimal value vs. uncertainty level (negative lower bound on the log-likelihoodfunction), for the classical Bellman recursion and its robust counterpart (with exact andinexact predictions of the uncertainty level UL ).

38

0 10 20 30 40 50 600

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

PSfrag replacements

RU

L(in

%)

Predicted uncertainty level U 0L

Figure 5: Predicted uncertainty level U 0L vs. RUL

, which is the range of the actual uncertaintylevel UL over which the nominal strategy performs better than a robust strategy computedwith the imperfect prediction U 0

L.

39

robust solutions to markov decision problems with ...elghaoui/pubs/phd1.pdfoptimal solutions to...

Documents