information collection in a linear program · or/ms subject classi cation: primary: sequential...

Information collection in a linear program

Ilya O. RyzhovOperations Research and Financial Engineering, Princeton University, Princeton NJ 08544

email: [email protected] http://www.princeton.edu/~iryzhov

Warren B. PowellOperations Research and Financial Engineering, Princeton University, Princeton NJ 08544

email: [email protected] http://castlelab.princeton.edu

Consider a linear program where the cost coefficients are uncertain, for which we have a Bayesian prior. We cancollect information to improve our understanding of these coefficients, but this may be expensive, giving us aseparate problem of optimizing the collection of information to improve the quality of the solution relative to thetrue cost coefficients. We formulate this information collection problem for linear programs for the first time, andderive a knowledge gradient policy which maximizes the marginal value of each measurement. We prove that thispolicy is asymptotically optimal, and demonstrate the performance of our algorithm on a network flow problem.

Key words: Stochastic optimization ; Optimal learning ; Sequential learning ; Value of information

MSC2000 Subject Classification: Primary: 62L10, 90C15; Secondary: 90C39

OR/MS subject classification: Primary: sequential decision analysis, theory of decision analysis ; Secondary:stochastic programming, dynamic programming models

1. Introduction Consider the standard form of a linear program, given by

V (c) = maxx cTxs.t. Ax = b

x ≥ 0,(1)

and suppose that the vector c of objective coefficients is unknown. However, we have certain initialinformation about the problem that allows us to construct a Bayesian prior around c. Thus, we view c asa random variable whose distribution represents the uncertainty in our beliefs. We also have the abilityto make noisy measurements of individual coefficients of c. Each measurement provides new informationthat can be used to update and improve our beliefs about the objective coefficients. Measurements areassumed to be time-consuming and expensive; we can choose any coefficient to measure at any time, butthe total budget allocated for measurements is finite. However, our beliefs about different coefficientsmay be correlated, meaning that a single measurement could potentially provide useful information aboutmore than one component of c. Our problem is to determine which coefficients to measure in order tocome as close as possible to the optimal solution of the true LP in (1). Our decision at each time step isinformed by what we have learned from prior measurements.

In optimal learning problems, we have an underlying optimization model with unknown parameters,and we must find a way to efficiently collect information that will allow us to improve our ability to op-timize. Linear programming presents us with a general and powerful optimization framework. Examplesof linear programming problems potentially involving optimal learning include the following:

(i) Logistics. We are using a network flow model to design shipment routes from suppliers to retailcenters to satisfy customer demand (Ahuja et al. 1992 [2]). The purchasing cost from a vendoris known, but the vendor’s service reliability (which impacts inventory cost) is unknown. Theonly way to learn about service is by using the vendor for a period of time. Trying one vendormay teach us about other vendors in the same region (delays may be due to shipping problemsfrom that region) or about a type of product, which may share common suppliers on its own.

(ii) Marketing and production. A production problem can be written as a linear program maximizingprofit subject to resource constraints. The profit margins are uncertain, and the exact optimalproduction strategy is unknown. However, we can perform market studies to improve our beliefsabout the profitability of different products. Each study costs both time and money, and so onlya small number of studies can be performed.

(iii) Agriculture. In agricultural planning, linear programming can be used to obtain optimal croprotations (El-Nazer and McCarl 1986 [17], Haneveld and Stegemann 2005 [30]). The exact yieldfrom planting certain fields is unknown, though prior estimates can be constructed via regression

1

http://www.princeton.edu/~iryzhov

mailto:[email protected]

http://www.princeton.edu/~iryzhov

http://castlelab.princeton.edu

mailto:[email protected]

http://castlelab.princeton.edu

2 Ryzhov and Powell: Information collection in a linear programMathematics of Operations Research xx(x), pp. xxx–xxx, c©200x INFORMS

techniques on historical data. We can obtain new information through soil analysis on differentfields. In particular, if analysis of one field suggests that crop yields there may be higher thanexpected, we might believe that crop yields from nearby fields may also be higher.

The basic idea of a linear program with stochastic parameters has been around for many years. Thisproblem was originally formulated by Dantzig (1995) [15]. The proposed approach in this early studywas to solve a deterministic LP where c was replaced with its mean. The problem of finding IEV (c),the expected optimal value of the linear program, was approached by Madansky (1960) [39] and Itami(1974) [31]. These studies considered various theoretical properties of the problem, but computing thedesired expectation still posed a challenge. Approximations for the multi-stage version of this problemwere considered by Birge (1982) [9]. Multi-stage stochastic problems are extensively studied in the areaof stochastic programming; see Kolbin (1977) [38], Kall and Wallace (1994) [34], Birge and Louveaux(1997) [10] or Ruszczynski and Shapiro [44] for an introduction. However, stochastic programming modelstypically assume a fixed distribution for the unknown parameters, and do not consider the additionaldimension of sequential learning, where our distribution of belief evolves over time.

Sequential optimal learning has been widely studied, but mostly for very simple underlying optimiza-tion models. In ranking and selection (Bechhofer et al. 1995 [4], Kim and Nelson 2006, 2007 [35, 36]) andmulti-armed bandit problems (Berry and Fristedt (1985) [6], Gittins (1989) [25], Auer et al. [3]), thereis a finite set of alternatives with unknown values, and the optimization consists of simply choosing thealternative with the highest value. A recent study by Ryzhov and Powell (2010) [46] considered a learningproblem with a more complex optimization component, namely a shortest-path problem on a graph. Thiswork was the first to separate the measurement decision (measuring a single edge to collect informationabout its length) from the implementation decision (the shortest-path solution). The problem consideredin this paper has a similar distinction: we measure a single objective coefficient with the goal of learningabout the optimal solution to the entire LP. However, the linear programming framework is more generalthan the shortest-path problem. We also use a more general learning model that allows us to incorporatecorrelations into the Bayesian belief structure, whereas Ryzhov and Powell (2010) [46] assumes that theedges in the graph are independent.

We approach the problem using the method of knowledge gradients (KG), originally developed byGupta and Miescke (1994) [27] and Gupta and Miescke (1996) [28], and later studied by Frazier etal. (2008) [20], in the context of the simple ranking and selection problem. In this approach, eachmeasurement is chosen to maximize the expected value of information that can be collected at a giventime step. The initial work on KG considered ranking and selection with independent Gaussian priors.It was found, however, that the KG concept can be used as a general methodology for learning problems,potentially involving more complicated statistical and optimization components. Variants of ranking andselection where KG algorithms have been applied include problems with unknown measurement noise(Chick et al. 2010 [12]), or correlated beliefs (Frazier et al. 2009 [21]), or where the values of thealternatives are described using a linear regression model (Negoescu et al. 2010 [41]). Other learningproblems where KG algorithms have been studied include multi-armed bandits (Ryzhov and Powell 2009[45], Ryzhov et al. 2009 [47]), information collection on a graph (Ryzhov and Powell 2010 [46]), andMarkov decision processes with unknown transition probabilities (Ryzhov et al. 2010 [48]).

In the problem of learning in an LP, the KG algorithm consists of measuring the objective coefficientthat yields the greatest expected improvement in our beliefs about the optimal value of the LP. Thus, theprimary computational challenge is to find the expected value of the future LP that we will obtain afterthe measurement is completed. Conditionally, this future problem can be written as a parametric LP witha single stochastic parameter. Parametric linear programming typically considers deterministic problems(Gass and Saaty 1955 [23], Murty 1980 [40], Adler and Monteiro 1992 [1]), although some studies havemade connections to LPs with uncertain parameters (Carlsson and Korhonen [11], Gans et al. 2010 [22]).The problem of computing the expectation can be viewed from the perspective of LP sensitivity analysis.Much of the work in sensitivity analysis (e.g. Wendell 1985 [50], Jansen et al. 1997 [32]) concerns itselfwith finding a range in which a single optimal solution will remain optimal. More recently, Hadigheh andTerlaky (2006) [29] developed a technique for finding all possible optimal solutions across the entire rangeof values of the sensitivity parameter. We apply this work to solve the problem of finding the expectedvalue of an LP.

This paper makes the following contributions: 1) We present a new class of optimal learning problems

Ryzhov and Powell: Information collection in a linear programMathematics of Operations Research xx(x), pp. xxx–xxx, c©200x INFORMS 3

where information is sequentially collected to improve a Bayesian belief about the optimal value of an LP.We use a multivariate Gaussian learning model which allows a single measurement to provide informationabout multiple unknown parameters of the LP. 2) We derive a knowledge gradient algorithm for thisproblem class. The algorithm computes the expected value of information exactly, through an elegantsynthesis of concepts from optimal learning, stochastic linear programming, and sensitivity analysis. 3)We prove the asymptotic optimality of the KG algorithm as the number of measurements becomes large.4) We present numerical examples demonstrating the performance of the algorithm on a minimum-costflow problem.

Section 2 presents our mathematical model for the problem of information collection in an LP. Section3 explains the derivation and computation of the KG algorithm in this setting. Section 4 discusses theasymptotic optimality property of KG. Finally, Section 5 gives numerical examples of the algorithm atwork.

2. Mathematical model We assume that the feasible region A defined by Ax = b and x ≥ 0 isbounded. The optimal value of the LP in (1), for a fixed M -vector c of objective coefficients, is denotedby V (c). Also let x (c) represent the optimal solution of the LP, that is, the value of x that achievesthe value V (c). If there are multiple optimal solutions, we can let x (c) be an arbitrary point in this setwithout loss of generality. The dual of the LP is given by

V (c) = miny bT ys.t. AT y − s = c

s ≥ 0.(2)

By strong duality, the optimal value of the dual for fixed c remains V (c). We can let y (c) and s (c)represent the optimal dual solution as a function of the objective coefficients. We also assume that A hasdimension M .

We begin with a multivariate Gaussian prior on c, given by c ∼ N(c0,Σ0

). We can improve these

estimates by making N sequential measurements of individual objective coefficients. When we choose tomeasure the jth coefficient, we make a noisy observation cj ∼ N (cj , λj) of the true value. The noise λjis assumed to be known, but can depend on j. We also assume that the measurements are conditionallyindependent given the sequence of measurement decisions j0, ..., jN−1.

Each observation is used to update our beliefs about c. Denote by Fn the sigma-algebra generatedby our choices of the first n coefficients, as well as the observations we made as a result. Becausethe measurements and our prior beliefs all follow Gaussian distributions, the conditional distribution of cgiven Fn is also multivariate Gaussian (DeGroot 1970 [16]). Let IEn = IE (· | Fn) represent the conditionalexpectation operator given Fn. Then, we can define a vector cn = IEn (c) representing our beliefs about c“at time n,” that is, immediately after exactly n measurements have been made. Similarly, let Σn denotethe conditional covariance matrix of c at time n. It can be shown (Gelman et al. 2004 [24], Frazier et al.2009 [21]) that

cn+1 = cn +cn+1jn − cnjn

λjn + ΣnjnjnΣnejn , (3)

Σn+1 = Σn −ΣneTjnejnΣn

λjn + Σnjnjn, (4)

where jn is the (n+ 1)st coefficient chosen for measurement. Because we start with a multivariate prior,with correlations in our beliefs, a single measurement can potentially change our beliefs about the entireobjective vector.

Together, the vector cn and the matrix Σn completely characterize our distribution of belief at timen. We can refer to these parameters as the knowledge state

sn = (cn,Σn) .

If we choose to measure coefficient cj at time n, we can write

sn+1 = KM(sn, j, cn+1

j

)where the transition function (or “system model”) KM is described by (3) and (4). The knowledge stateencodes all the information that is available to us when we make our measurement decision at time n.


As in Ryzhov and Powell (2010) [46], our problem can be divided into two phases. First, in the learningphase, we sequentially make measurement decisions j0, ..., jN−1. A learning policy π is a sequence ofdecision rules Jπ,0, ..., Jπ,N−1, where each Jπ,n is a function mapping the knowledge state sn to an indexin the set {1, ...,M}. Thus, a policy tells us what to measure in any possible circumstance. Then, afterall N measurements have been made, we must commit to a final decision (e.g. production strategy oragricultural plan) based on our final set of beliefs sN . This is known as the implementation phase. Thisfinal implementation decision can be represented by a function χ mapping the final knowledge state sN

to a point in the feasible region A. Since χ is also a rule for making decisions rather than a fixed decision,we refer to it as an implementation policy.

All together, the objective function for the learning problem (not to be confused with the objectivefunction of the underlying LP) can be written as

supπ

supχ

IEπcTχ(sN)

. (5)

The notation IEπ denotes an expectation over the outcomes of all N measurements, given that thesemeasurements are made according to the learning policy π. The implementation policy is chosen tomaximize the expectation of the true objective value cTχ over all possible outcomes of sN and ensuingimplementation decisions χ

(sN).

The following theorem gives a way to simplify the problem. One intuitive choice for the implementationpolicy is to simply plug our final beliefs cN into (1) and let χ

(sN)

= x(cN). The value of the deterministic

LP with objective vector cN is generally not the same as the expected value of the stochastic LP withthe true objective vector c. This issue is studied in Madansky (1960) [39], where some conditions aregiven under which equality does hold. In our setting, these quantities are equal in expectation over theoutcomes of the measurements under the same learning policy.

Theorem 2.1 supπ supχ IEπcTχ

(sN)

= supπ IEπV(cN)

.

Proof. By the tower property of conditional expectation,

supπ

supχ

IEπcTχ(sN)

= supπ

supχ

IEπIEπ,NcTχ(sN)

because IEπ observes only the choice of learning policy, whereas IEπ,N observes that as well as the outcomesof the measurements made by following that policy. Observe that, because sN is FN -measurable, wehave

IEπ,NcTχ(sN)

= χ(sN)T

IEπ,N (c) =(cN)Tχ(sN)

by the linearity of expected values. Thus, we can write

supπ

supχ

IEπcTχ(sN)

= supπ

supχ

IEπ(cN)Tχ(sN)

≤ supπ

supχ

IEπ maxx∈A

(cN)Tx

= supπ

IEπ maxx∈A

(cN)Tx

= supπ

IEπV(cN)

. (6)

The second line is due to the monotonicity of conditional expectations because, for any outcome ω of themeasurements, (

cN (ω))Tχ(sN (ω)

)≤ max

x∈A

(cN (ω)

)Tx.

Thus, we can remove the supremum over χ because the expression inside that supremum no longerdepends on the implementation policy.

Now, fix π and let χ′ be defined by χ′(sN)

= x(cN). Then,

IEπV(cN)

= IEπ(cN)Tχ′(sN)≤ sup

χIEπ(cN)Tχ(sN)

.

It follows thatsupπ

IEπV(cN)≤ sup

πsupχ

IEπ(cN)Tχ(sN)

. (7)


Combining (6) with (7) yields the desired result. �

At the final time N , the solution x(cN)

is, in fact, the best possible implementation decision. Bymaximizing our final guess of the optimal value, we also maximize (in expectation) the true objectivevalue. Thus, the problem reduces to choosing a policy π for making measurements during the learningphase, and our objective function in (5) can be rewritten as

supπ

IEπV(cN)

. (8)

Our next result allows us to place a global upper bound on the objective value in (8). This bound willbe used later in Section 4 to show the asymptotic optimality of the KG policy proposed in Section 3.

Proposition 2.1 For all n and all knowledge states sn = (cn,Σn),

V (cn) ≤ IEnV (c) almost surely. (9)

Proof. The optimal value function V is convex in c (concave for minimization problems; see e.g.[33, Lemma 5]). Applying Jensen’s inequality together with the definition of cn, we obtain

V (cn) = V (IEnc) ≤ IEnV (c) ,

as required. �

Corollary 2.1 Taking the expectation of both sides of (9) yields IEπV(cN)≤ IEπV (c) for any policy π.

However, the true objective vector c does not depend on the policy π, whence IEπV (c) = IEV (c) for all π,and

IEπV(cN)≤ IEV (c)

for all π.

The optimal learning policy π∗ that maximizes (8) can be characterized using dynamic programmingnotation. The objective value achieved by following the optimal learning policy from time n onward canbe described using a form of Bellman’s equation (Bellman 1957 [5]) adapted for learning problems (firstused in this context by DeGroot 1970 [16]), given by

V∗,n (sn) = maxj

IEnj V∗,n+1(KM

(sn, j, cn+1

j

)), (10)

V∗,N(sN)

= V(cN)

. (11)

The notation IEnj denotes an expectation given Fn and the decision to measure the jth coefficient at

time n. If there are no measurements remaining, we make the optimal implementation decision x(cN).

Otherwise, we make the decision that puts us in the best position for following the optimal policy startingat time n+ 1.

Although it is conceptually useful to characterize the optimal learning policy in this way, the dynamicprogram given by (10) and (11) is computationally intractable, because the knowledge state sn is multi-dimensional and continuous. In the next section, we propose a heuristic learning policy that yields acomputable algorithm.

3. The knowledge gradient algorithm The knowledge gradient (KG) concept was originallyadvanced by Gupta and Miescke (1996) [28] and later developed by Chick and Inoue (2001) [13, 14] andFrazier et al. (2008) [20] for the ranking and selection problem. The KG approach consists of a one-periodlook-ahead, viewing each time period as if it were the last, and making the measurement that would beoptimal under those circumstances. In our problem, the optimal measurement decision at time N − 1 isgiven by

J∗,N−1(sN−1

)= arg max

jIEN−1j V

(cN)

. (12)

Equivalently, (12) can be written in terms of the difference

J∗,N−1(sN−1

)= arg max

jIEN−1j

(V(cN)− V

(cN−1

)),

because V(cN−1

)is FN−1-measurable and has no effect on the argmax.


The KG decision rule is given by

JKG,n (sn) = arg maxjνKG,nj (13)

where

νKG,nj = IEnj(V(cn+1

)− V (cn)

). (14)

Under the KG policy, we always measure the coefficient that yields the greatest expected improvementbetween our current estimate V (cn) of the optimal LP value, and the future estimate V

(cn+1

)that will

result from the measurement. The term “knowledge gradient” arises because the expected improvementis written as a difference. Observe that, by definition, KG is the optimal learning policy if N = 1. Alsonote that JKG,n only depends on n through sn, and thus we can drop the time index n from the decisionrule.

At time n, the quantity V (cn) can be found by solving a deterministic LP with objective vector cn.However, the quantity IEnj

(V(cn+1

))is the expected value of a linear program with stochastic parameters,

since cn+1 is random at time n. The remainder of this section explains how this quantity can be computedexactly.

3.1 Derivation From Frazier et al. (2009) [21], we know that the conditional distribution of cn+1,given Fn as well as the decision to measure j at time n, can be expressed by the equation

cn+1 = cn +Σnej√λj + Σnjj

· Z,

where Z ∼ N (0, 1). Thus, we can write

IEnj V(cn+1

)= IEV

(cn +

Σnej√λj + Σnjj

· Z

).

For a fixed value z of Z, the expression inside the expectation can be rewritten as

V(cn + z∆cnj

)= maxx

(cn + z∆cnj

)Tx

s.t. Ax = bx ≥ 0,

(15)

where

∆cnj =Σnej√λj + Σnjj

.

The LP in (15) is a perturbed version of (1). The perturbation is random (and follows a standard normaldistribution), and the expected value of the perturbed LP is given by

IEV(cn + Z∆cnj

)=

∫ ∞−∞

V(cn + z∆cnj

)φ (z) dz

with φ being the standard normal pdf. Now observe that∫ ∞−∞

V(cn + z∆cnj

)φ (z) dz =

∫ ∞−∞

(cn + z∆cnj

)Tx(cn + z∆cnj

)φ (z) dz.

The quantity(cn + z∆cnj

)Tx(cn + z∆cnj

)is a piecewise linear function of z. If z is inside a certain range,

the optimal solution x(cn + z∆cnj

)will remain the same, and the value

(cn + z∆cnj

)Tx(cn + z∆cnj

)will

be linear in z within that range. Figure 1 gives an example in two dimensions. The red lines representthe level curves of the objective function. Varying z has the effect of rotating these curves. For smallvalues of z, the tangent point does not change; it will only change once the perturbed level curves becomeparallel to one of the faces of the feasible region.

The value of z for which x(cn + z∆cnj

)does not change is known in the literature (e.g. Hadigheh and

Terlaky 2006 [29]) as the invariant support set. Thus, the expectation IEV(cn + Z∆cnj

)can be rewritten

as a sum of integrals over all possible invariant support sets. We can let −∞ = z1 < z2 < ... < zI =∞ be


(a) (b)

Figure 1: Example showing that the optimal solution to the perturbed LP remains constant for a rangeof values of z.

a finite set of points such that x(cn + z∆cnj

)is constant for z ∈ (zi, zi+1). Let xi be the optimal solution

of the LP for z ∈ (zi, zi+1). Then,

IEV(cn + Z∆cnj

)=

∑i

∫ zi+1

zi

(cn + z∆cnj

)Txiφ (z) dz

=∑i

(cn)Txi (Φ (zi+1)− Φ (zi)) +

∑i

(∆cnj

)Txi (φ (zi)− φ (zi+1)) (16)

with Φ being the standard normal cdf. The sum of integrals in the first line of (16) recalls a similarexpression in Itami (1974) [31], though that study did not consider the issue of computation. We canwrite this more concisely as

IEV(cn + Z∆cnj

)=∑i

ai (Φ (zi+1)− Φ (zi)) + bi (φ (zi)− φ (zi+1))

where ai = (cn)Txi and bi =

(∆cnj

)Txi. Using the analysis of Frazier et al. (2009) [21], we can rewrite

this asIEV

(cn + Z∆cnj

)=(

maxiai

)+∑i

(bi+1 − bi) f (− |zi|)

where f (z) = zΦ (z) + φ (z). Now observe that maxi ai = maxi (cn)Txi, the largest time-n objective

value over all points that can ever possibly be optimal for any value of z, including z = 0. It follows thatmaxi ai = V (cn), whence

νKG,nj =∑i

(bi+1 − bi) f (− |zi|) . (17)

The problem of computing the KG factor of the jth coefficient thus reduces to the problem of findingthe breakpoints zi of the piecewise linear function, as well as the corresponding optimal solutions xi. Wenow present an algorithm for finding these quantities.

3.2 Computation We start with the optimal solution x (cn) to (15) when z = 0. Our algorithmfinds the invariant solutions xi by working forward and backward from z = 0. We adapt the proceduredeveloped for LP sensitivity analysis by Hadigheh and Terlaky (2006) [29].

First, it is necessary to determine whether zero is itself a breakpoint of the piecewise linear function.This can be done by solving two LPs

z− = miny,s,z zs.t. AT y − s− z∆cnj = cn

x (cn)Ts = 0

s ≥ 0,

(18)


(a) (b)

Figure 2: Illustration of the relationship between x (cn), Vl (cn) and zl (c

n).

andz+ = maxy,s,z z

s.t. AT y − s− z∆cnj = cn

x (cn)Ts = 0

s ≥ 0.

(19)

It is shown in Roos et al. (1997) [43] that these two problems find the smallest and largest values of z,respectively, for which x (cn) is still an optimal solution of (15). The feasible region in each problem merelyensures that there is a feasible solution of the dual of (15) that maintains complementary slackness withx (cn), meaning that x (cn) is optimal. If z− < 0 < z+, then zero is not a breakpoint. In this situation,cn corresponds to the solid red line in Figure 1. However, the numbers z−, z+ are breakpoints.

Suppose, however, that z− = 0. Then, zero is a breakpoint, and the level curve of cn is tangent to aface of the feasible region. In this case, there are two optimal extreme-point solutions of (15) at z = 0,one of which is x (cn). The other extreme point is the optimal solution to the LP

Vl (cn) = minx

(∆cnj

)Tx

s.t. Ax = b

(s (cn))Tx = 0

x ≥ 0.

(20)

Let xl (cn) denote this point. The quantity

(∆cnj

)Txl (c

n) is precisely the left derivative of the piecewiselinear function at the breakpoint z = 0. (The right derivative is x (cn) itself.) But now, observe thatxl (c

n) is optimal at two breakpoints: at zero, and at some z′ < 0. This z′ is the optimal value of the LP

zl (cn) = miny,s,z z

s.t. AT y − s− z∆cnj = cn

(xl (cn))

Ts = 0

s ≥ 0.

(21)

The LP in (21) is almost identical to that in (18), with the only difference being that x (cn) is replacedby xl (c

n). Observe that zl (cn) ≤ 0, since there will always be a feasible solution of (21) with z = 0.

Figure 2 gives a visual illustration of the relationship between the solutions x (cn) and xl (cn), and the

breakpoint zl (cn).

In the other case where z+ = 0, we follow the same procedure, but maximize instead of minimizing.Zero is still a breakpoint, and there are still two optimal extreme-point solutions of (15) at z = 0. Thistime, the other extreme point is the optimal solution to

Vu (cn) = maxx(∆cnj

)Tx

s.t. Ax = b

(s (cn))Tx = 0

x ≥ 0.

(22)


1: Solve (15) with z = 0 to obtain x (cn), y (cn) and s (cn).

2: Solve the LPs given by (18) and (19), and let z−, z+ be their optimal values.

2a: If z− < 0 < z+, let z = (z−, z+) and x = (x (cn)).

2b: Otherwise, let z = (0) and let x be the empty set.

3: While min (z) > −∞, do the following:

3a: Let c′ = cn + min (z) ·∆cnj .

3b: Find Vl (c′) and zl (c

′) using (20) and (21).

3c: Update z ← (min (z) + zl (c′) , z) and x← (xl (c

′) , x).

4: While max (z) <∞, do the following:

4a: Let c′ = cn + max (z) ·∆cnj .

4b: Find Vu (c′) and zu (c′) using (22) and (23).

4c: Update z ← (z,max (z) + zu (c′)) and x← (x, xu (c′)).

Figure 3: Algorithm for finding the vector z of breakpoints and the set x of corresponding invariantsolutions.

We let xu (cn) be the value of x that solves (22), and find the next breakpoint z′ > 0 by computing

zu (cn) = maxy,s,z zs.t. AT y − s− z∆cnj = cn

(xu (cn))Ts = 0

s ≥ 0.

(23)

If z− = z+ = 0, then zero is a breakpoint, but x (cn) is not an extreme point. This can occur if we usean interior-point algorithm to find x (cn). Such an algorithm may converge to a point on the face of thepolyhedron that is tangent to cn. In this case, we simply follow both parts of the above procedure to findthe neighbouring extreme points.

This procedure can be repeated to find sequences of extreme points for z < 0 and z > 0. Putting theelements of the procedure together yields the algorithm described in Figure 3. By running this algorithm,we will obtain a vector z of the breakpoints, together with a set x containing the corresponding invariantsolutions. These quantities can then be used to compute the KG formula given in (17).

The algorithm is guaranteed to terminate in finite time, because there is only a finite number ofextreme points (see Bertimas and Tsitsiklis 1997 [7, Corollary 2.1]). Furthermore, the breakpoints arealready returned sorted. However, the algorithm can be computationally expensive, requiring us to solvea sequence of linear programs for every coefficient of the objective function at every time step n. Theadvantage is that the KG factor νKG,nj is computed exactly.

The KG policy synthesizes a number of concepts from optimization and optimal learning. The one-period look-ahead heuristic allows us to reduce the difficult sequential learning problem to the problem ofcomputing the expected value of a parametric, stochastic linear program. We then apply recent work insensitivity analysis to compute the expected value of a single continuous measurement exactly. The sametechnique could potentially be applied to other types of prior distributions on c. However, we focus on themultivariate Gaussian distribution because it provides an elegant and powerful framework for modelingcorrelations in our beliefs.

4. Asymptotic optimality In this section, we prove that the KG algorithm laid out in Section 3possesses the property of asymptotic optimality. The precise meaning of this property is given in Section4.1, along with other definitions and preliminaries required for the proof. The proof itself is given inSection 4.2. It is based on the convergence analysis of Frazier and Powell (2009) [19]: essentially, we showthat the KG policy avoids getting “stuck” at a knowledge state where some, but not all components of care known perfectly. Given infinitely many measurements, KG eventually reaches a state where all of cis known.


4.1 Preliminaries We define the space of knowledge states as

S ={

(c,Σ) | c ∈ RM , Σ ∈ SM+ , Σjj > 0∀j}

,

where SM+ is the space of M×M positive-semidefinite matrices. The space S includes all possible posteriordistributions at all times n. Without loss of generality, we assume that Σ0 is a full-rank covariance matrixwith Σ0

jj > 0 for all j, that is, we do not have perfect information about any objective coefficient at thebeginning. The space S thus includes all possible mean vectors and covariance matrices that we willobserve over the course of a finite number of measurements. The closure of this space is

S ={

(c,Σ) | c ∈ RM , Σ ∈ SM+}

,

which includes all possible legitimate covariance matrices Σ, including those where Σjj = 0 for some j.Such a knowledge state can be reachable by measuring j infinitely often.

We define convergence in the space of knowledge states as follows. Suppose that (sn) is a sequence ofknowledge states (cn,Σn) ∈ S, and s =

(c(s),Σ(s)

)is a knowledge state in S. We say that sn → s if cn →

c(s) and Σnjk → Σ(s)jk for all j, k. Our parametrization is different from the mean-value parametrization

used in Frazier and Powell (2009) [19]. However, our definition of convergence in S is equivalent toconvergence in the mean-value parameters. This is because both our sampling density and our posteriordensity are multivariate Gaussian, and thus come from an exponential family. It is known (e.g. fromBickel and Doksum 2007 [8]) that, for an exponential family, there is a continuous bijection between thenatural parameters of the distribution, and the mean-value parameters. Thus, every pair

(c(s),Σ(s)

)for

s ∈ S corresponds to exactly one set of mean-value parameters. Furthermore, a sequence in S convergesto a point in S if and only if the corresponding sequence of mean-value parameters converges to themean-value parametrization of the limit. For our purposes, however, it is much more convenient to workwith cn and Σn, because we use these parameters to define the KG policy.

We now give the meaning of asymptotic optimality. The risk function

R (c, x) = V (c)− cTxrepresents the loss incurred (for a particular choice of the objective coefficients) by choosing x ∈ A as theimplementation decision instead of solving for V (c). The work by Frazier and Powell (2009) [19] definesasymptotic optimality as

limN→∞

IE(

minx

IENR (c, x))

= IE(

minx

IE (R (c, x) | c))

, (24)

that is, the minimum-risk decision at time N will achieve the lowest risk possible if c is perfectly known,in the limit as N →∞. Every expectation in this section can be assumed to be under the KG policy.

The left-hand side of (24) can be rewritten as

limN→∞

IE(

minx

IENR (c, x))

= limN→∞

IE(

minx

IENV (c)− cTx)

= limN→∞

IE(IENV (c) + min

x

(−IENcTx

))= lim

N→∞IEV (c)− IEmax

x

(cN)Tx

= IEV (c)− limN→∞

IEV(cN)

.

Analogously, it can be shown that IE (minx IE (R (c, x) | c)) = 0, whence (24) becomes

limN→∞

IEV(cN)

= IEV (c) . (25)

Thus, the meaning of asymptotic optimality is in line with our objective function in (8). The objectivevalue achieved by an asymptotically optimal learning policy converges to the global upper bound inCorollary 2.1.

The last preliminaries bring concepts from Frazier and Powell (2009) [19] to our LP setting. We definethree functions

h (sn, j) = minx

IEnR (c, x)− IEnj

(minx

IEn+1R (c, x))

,

g (sn, j) = minx

IEnR (c, x)− IEn(

minx

IEn (R (c, x) | cj))

,

g (sn) = minx

IEnR (c, x)− IEn(

minx

IEn (R (c, x) | c))

.


Each function represents a decrease in risk under different conditions. The function h is the one-perioddecrease in risk obtained by measuring the jth objective coefficient at time n. The function g is thedecrease in risk that would be achieved if we were given the exact value of the jth coefficient afterreaching a given knowledge state. Finally, g represents the decrease in risk achieved if we were given theexact value of c.

Our last definitions relate these concepts to the notion of convergence. For each coefficient j, let

Mj ={s ∈ S | ∃ (sn) ⊆ S : sn → s, g (sn, j)→ 0

}be the set of all knowledge states for which we gain nothing (no decrease in risk) by knowing the exactvalue of cj . Similarly, let

Bs = {j | s ∈Mj}be the set of all such objective coefficients for the fixed knowledge state s. Finally, let

M∗ ={s ∈ S | ∀ (sn) ⊆ S with sn → s, g (sn, j)→ 0 ∀j

}be the set of all knowledge states for which nothing is gained by knowing all of c exactly.

4.2 Main result The work by Frazier and Powell (2009) [19] provides sufficient conditions underwhich the asymptotic optimality property from (25) holds. We first restate these conditions in terms ofour LP problem.

Theorem 4.1 Suppose that the following conditions hold:

(i) [19, Assumption 1] Let sn be a sequence of knowledge states with sn → s and g (sn, j) → 0 forall j. Then, g (sn)→ 0 also.

(ii) [19, Theorem 1] Let s /∈M∗. Then, s has an open neighbourhood U ⊆ S such that

sups′∈U

P(JKG (s′) ∈ Bs

)< 1. (26)

Then, the KG policy of Section 3 is asymptotically optimal in the sense of (25).

We now calculate

h (sn, j) = minx

IEn(V (c)− cTx

)− IEnj

(minx

IEn+1(V (c)− cTx

))= IEnV (c)−max

x(cn)

Tx− IEnV (c) + IEnj max

x

(cn+1

)Tx

= IEnj(V(cn+1

)− V (cn)

)= νKG,nj . (27)

Thus, the KG policy as defined in (13) and (14) maximizes the one-period decrease in risk. Repeatingthe same computations as in (27), we find that

g (sn, j) = IEn(

maxx

IEn(cTx | cj

)− V (cn)

). (28)

Because cj is correlated with the values of other cj′ , we find that

IEn(cTx | cj

)=

M∑j′=1

xj′ IEn (cj′ | cj)

= cjxj +∑j′ 6=j

xj

(cnj +

Σnj,j′

Σnj,j

(cj − cnj

)).

The time-n marginal distribution of cj is N(cnj ,Σ

nj,j

), whence

IEn(

maxx

IEn(cTx | cj

))= IE

maxx

xj

(cnj +

√Σnj,jZ

)+∑j′ 6=j

xj′

(cnj +

Σnj,j′

Σnj,j

√Σnj,jZ

)= IE

maxx

M∑j=1

cnj xj + xjΣnj,j′√

Σnj,jZ

.


It follows that

g (sn, j) = IEV(cn + Z∆cnj

)− V (cn)

with ∆cnj =Σnej√

Σnj,j. Repeating the arguments of Section 3.1, we find that

g (sn, j) =∑i

(bi+1 − bi

)f (− |zi|) . (29)

As before, −∞ = z1 < z2 < ... < zI′ <∞ is a finite set of points for which x(cn + z∆cnj

)is constant for

z ∈ (zi, zi+1), xi is the optimal solution of the LP for z ∈ (zi, zi+1), and bi =(∆cnj

)Txi.

Proposition 4.1 Suppose that sn is the current knowledge state. Then, g (sn, j) = 0 if and only ifΣnj,j = 0. Also, h (sn, j) = 0 if and only if Σnj,j = 0.

Proof. Suppose that Σnj,j = 0. Because the marginal distribution of cj given sn is N(cnj ,Σ

nj,j

), it

follows that cj = cnj almost surely given Fn. Consequently, the right-hand side of (28) is zero.

Suppose now that Σnj,j > 0. Then, g (sn, j) is given by (29). We claim that the right-hand side of(29) must be strictly positive. To see this, first observe that the function f has no zeros on the real line(Ryzhov and Powell 2010 [46]). Furthermore, we have bi+1 ≥ bi for all i, so the only way (29) can equalzero is if

M∑j′=1

xi,j′Σnj,j′ =

M∑j′=1

xi′,j′Σnj,j′

for all i, i′ ∈ {1, ..., I ′}. It follows that x1,1 ... x1,M 1... ... ... ...xI′,1 ... xI′,M 1

Σnj1...

ΣnjM−β

= 0 (30)

for some constant β. However, the polyhedron A is assumed to have dimension M , which means thatit has exactly M + 1 affinely independent extreme points (Padberg 1999 [42]). Thus, the matrix in (30)must contain an invertible M + 1 ×M + 1 submatrix. Because Σnj,j > 0 by assumption, this submatrixcannot yield zero when multiplied by the vector in (30). We conclude that g (sn, j) = 0 if and only ifΣnj,j = 0. The same argument can be used to show that h (sn, j) = 0 if and only if Σnj,j = 0. �

We now present a technical lemma that will be used to verify the first condition of Theorem 4.1. Thelemma will be used to establish certain continuity conditions on h, g and g, which is crucial to the proofof asymptotic optimality.

Lemma 4.1 Let an and Bn be sequences in RI and RI×K , respectively, such that an → a and Bni,k → Bi,kfor all i, k. Define

yn = IEmaxiani +

K∑k=1

Bni,kZk,

y = IEmaxiai +

K∑k=1

Bi,kZk,

where Z1, ..., ZK are i.i.d. N (0, 1) random variables. Then, yn → y.

Proof. We define random variables

Y n = maxiani +

K∑k=1

Bni,kZk,

Y = maxiai +

K∑k=1

Bi,kZk,


and show that Y n → Y in L1. To see this, let ε > 0 and take N such that, for all n > N , we have

|ani − ai| ,∣∣∣Bni,k −Bi,k∣∣∣ < ε for all i, k. Then, for n > N ,

IE |Y n − Y | = IE

∣∣∣∣∣maxi

(ani +

K∑k=1

Bni,kZk

)−max

i

(ai +

K∑k=1

Bi,kZk

)∣∣∣∣∣≤ IEmax

i

∣∣∣∣∣(ani +

K∑k=1

Bni,kZk

)−

(ai +

K∑k=1

Bi,kZk

)∣∣∣∣∣≤ IEmax

i|ani − ai|+

∣∣∣∣∣K∑k=1

(Bni,k −Bi,k

)Zk

∣∣∣∣∣≤ ε+ IEmax

i

K∑k=1

∣∣Bni,k −Bi,k∣∣ · |Zk|≤ ε+ max

i,k

∣∣Bni,k −Bi,k∣∣ K∑k=1

IE |Zk|

≤ ε

(1 +

K∑k=1

IE |Zk|

).

Since the random variables Zk are i.i.d. and integrable, we have IE |Y n − Y | → 0 as n→∞. �

Corollary 4.1 The function h (·, j) is continuous for all j. That is, if sn → s, then h (sn, j)→ h (s, j).

Proof. Recall that h (sn, j) = IEV(cn + Z∆cnj

)−V (cn). Because any LP always has a basic optimal

solution, V can be rewritten as a finite maximum over extreme points, that is,

V(cn + Z∆cnj

)= max

i(cn)

Txi + Z

(∆cnj

)Txi,

which means that the expected value IEV(cn + Z∆cnj

)can be put in the framework of Lemma 4.1 with

K = 1 andani = (cn)

Txi, Bni,1 =

(∆cnj

)Txi.

If sn → s, it follows that cn → c(s) and ∆cnj → ∆c(s)j for ∆c

(s)j =

Σ(s)ej√λj+Σ

(s)j,j

, and h (sn, j) → h (s, j) by

Lemma 4.1. �

Lemma 4.1 can also be applied to the functions g (·, j) and g. However, in the case where sn → s and

Σ(s)j,j = 0, the limit ∆c

(s)j of ∆cnj =

Σnej√Σnj,j

may be undefined. Observe, however, thatΣnj,j′√Σnj,j≤√

Σnj′,j′ by

the Cauchy-Schwarz inequality. Thus, in the special case where Σ(s)j,j = 0 for all j, ∆cnj → 0 and Lemma

4.1 can be applied to show the continuity of g (·, j) at s.

We can also show that g is continuous at s with Σ(s) = 0. Repeating the computation from (27) oncemore, we find that

g (sn) = IEnV (c)− V (cn) .

Observe that the distribution of c given sn can be written as cn + BnZ where Z = (Z1, ..., ZK) forsome finite integer K, Z1, ..., ZK are i.i.d. N (0, 1) random variables, and Bn is an M × K matrix

satisfying Σn =(Bn)(

Bn)T

. Again writing V as a finite maximum over extreme points, we obtain

IEnV (c) = IEmaxi (cn)Txi + ZT

(Bn)T

xi. We can put this in the framework of Lemma 4.1 by letting

ani = (cn)Txi, Bni,k =

M∑j=1

xi,jBnj,k.

If Σn → 0 componentwise, it follows that Bn → 0 as well. Our proof of asymptotic optimality relies onthe continuity of g (·, j) and g only at those knowledge states s for which Σ(s) = 0.


Proposition 4.2 Suppose that sn is a sequence of knowledge states with sn → s, and g (sn, j) → 0 forall j. Then, g (sn)→ 0 also.

Proof. By Lemma 9 of Frazier and Powell (2009) [19], we know that 0 ≤ h (sn, j) ≤ g (sn, j) for alln. It follows that h (sn, j)→ 0 for all j. By Corollary 4.1, it follows that h (s, j) = 0 for all j. Then, by

Proposition 4.1, it follows that Σ(s)j,j = 0 for all j, whence Σ(s) = 0.

Recall that g (s) = IE(s)V (c)−V(c(s)). Because Σ(s) = 0, it follows that c = c(s) almost surely, whence

g (s) = 0. By using the continuity of g at the particular state s, we have that g (sn)→ g (s), as required.�

Proposition 4.2 corresponds to the first condition of Theorem 4.1. It remains to show that the secondcondition holds, completing the proof that the KG policy is asymptotically optimal.

Theorem 4.2 Let s /∈M∗. Then, s has an open neighbourhood U ⊆ S such that

sups′∈U

P(JKG (s′) ∈ Bs

)< 1.

Proof. Let s ∈ S −M∗. There are two possibilities: either Bs = ∅, or Bs 6= ∅. If Bs = ∅, thenP(JKG (s′) ∈ ∅

)= 0 for any s′, so we can take U = S, which is open in itself, and (26) holds.

Suppose now that s is such that Bs 6= ∅. We argue that h (s, j) = 0 for all j ∈ Bs. To see this, takej ∈ Bs. It follows that s ∈ Mj . Then there exists a sequence sn of knowledge states with sn → s andg (sn, j)→ 0. It follows that h (sn, j)→ 0. By the continuity of h, it follows that h (s, j) = 0.

Now define

U =

{s′ ∈ S | max

j∈Bsh (s′, j) < min

j /∈Bsh (s′, j)

}.

This set is open by the continuity of h. We argue that s ∈ U . To see this, observe that maxj∈Bs h (s, j) = 0.Thus, the only way we can have s /∈ U is if h (s, j) = 0 for all j. But if this were true, by Proposition 4.1

we would have Σ(s)j,j = 0 and g (s, j) = 0 for all j. Then, for every sn → s, we would have g (sn, j) → 0

for all j, which would imply that s ∈M∗ by Proposition 4.2. Thus, if s /∈M∗, then s ∈ U .

Finally, for any s′ ∈ U , the definition of U implies that arg maxj h (s′, j) /∈ Bs, whence we haveJ(XKG (s′) ∈ Bs

)= 0, and (26) holds. We conclude that (26) holds for any s /∈M∗. �

5. Numerical examples We now present a numerical illustration of the performance of the KGmethod. In many respects, the problem posed in this paper, of allocating sequential measurements toadaptively learn about cost coefficients in a linear program, is itself fundamentally new. To our knowledge,the KG policy is the first approach to be proposed for systematically making measurement decisions inthis problem. Thus, our objective in this section is to illustrate how systematic measurements via KGlead us to better LP solutions than a baseline policy of making measurements at random.

The underlying LP in our example is a minimum-cost flow problem. We used the NETGEN algorithmof Klingman et al. (1974) [37] to generate a network with 50 nodes and 100 arcs. A diagram of thisnetwork is given in Figure 4. There are 10 supply nodes and 10 demand nodes. The total flow in thesystem is 500. The arc costs generated by NETGEN are integers between 1 and 10. Thus, when convertedto the standard form of (1), our problem has 100 constraints and 200 variables.

The arc costs generated by NETGEN were used as our starting prior c0. The initial covariance matrixΣ0 was created as follows. First, the prior variance of the cost on arc j was set to Σ0

jj = 2. Second,the correlation coefficient of arcs i and j was set to 0.25 if those arcs were adjacent. The correlationstructure was thus based on the physical proximity of arcs in the network. In larger networks, it may bedesirable to consider correlation structures with multiple degrees of correlation, gradually decreasing thecorrelation coefficient as the distance between two arcs increases. For the purposes of our demonstration,we consider a simple model where correlations are determined by adjacency only.

We used the CVX package of Grant and Boyd (2010) [26] to implement the algorithm for findingbreakpoints given in Figure 3. In our implementation, we encountered some numerical issues. Forexample, the constraint s (cn)

Tx = 0 in (20) and (22) occasionally caused the solver to erroneously


Figure 4: Schematic of the network generated for our experiments. Squares represent supply nodes;triangles represent demand nodes. The visualization uses the Graphviz package of Ellson et al. (2002)[18].

report the LP as being infeasible. In this situation, the equivalent constraint (cn)Tx = (cn)

Tx (cn)

can be used for much more stable performance (Terlaky 2010 [49]). Similarly, the analogous constraints

xl (cn)Ts = 0 and xu (cn)

Ts = 0 in (21) and (23) can be replaced by

(cn + z∆cnj

)Txl (c

n) = bT y and(cn + z∆cnj

)Txu (cn) = bT y.

To evaluate the performance of a learning policy π, we first fix a vector c representing the true valuesof the objective coefficients. These true values are used to generate the measurements cn+1

Jπ,n(sn) when wefollow the policy π at each time step. However, the policy does not see the exact values of cj . After Nmeasurements, the opportunity cost incurred by following π is defined as

Cπ = V (c)− cTx(cπ,N

)where cπ,N is our posterior mean at time N , based on the measurements we made according to π. Thisis the difference between the true objective value of the solution x

(cπ,N

)that is optimal according to

our final beliefs under policy π, and the true optimal value of the LP. In the optimal learning literature,the opportunity cost is also known under the name “regret” (Auer et al. 2002 [3]).

As in Ryzhov and Powell (2010) [46], we consider two different prior structures. The heterogeneous-prior approach generates c from the prior distribution N

(c0,Σ0

). This represents a situation where

we have access to accurate historical data, so that our starting estimate is reasonably accurate. In theequal-prior approach, we let c0 be a vector of constants, and use the costs obtained from NETGEN tocreate the true objective function. We thus have less starting information; the prior gives us a rough ideaof the magnitudes of the costs, but no way to distinguish between them. We compared KG to the pureexploration policy on 100 randomly generated cost functions of each type. In both sets of experiments,the measurement noise λj was set to 2 uniformly for all arcs in the network. We considered two policies:the KG policy of Section 3, and the pure exploration policy (Explore), which measures a randomly chosenarc in each time step.

Figure 5(a) shows how the opportunity cost for both policies (averaged over 100 heterogeneous-priorproblems) changes with the number of measurements. The KG policy assumes a commanding lead over


(a) (b)

Figure 5: Results for heterogeneous-prior experiments: (a) opportunity costs for both policies, and (b)the difference in opportunity costs, shown with 95% confidence bounds.

(a) (b)

Figure 6: Results for equal-prior experiments: (a) opportunity costs for both policies, and (b) the differ-ence in opportunity costs, shown with 95% confidence bounds.

pure exploration in the early iterations. The difference in opportunity cost between KG and Explore (inother words, the amount by which KG outperforms Explore) is shown in Figure 5(b) along with 95%confidence bounds. It is clear that KG learns significantly more efficiently. Additionally, there is a cleardownward trend in the opportunity cost, indicating that KG continues to find better solutions as themeasurement budget increases. This empirical result is in line with our asymptotic analysis in Section 4.

Figure 6 shows analogous results for the equal-prior case. Because the prior conveys much lessinformation about the true objective function, our initial solution x

(c0)

is much worse than in theheterogeneous-prior case. However, the KG policy is able to reduce the opportunity cost by more thanhalf in 50 iterations, and achieves a much bigger lead over the pure exploration policy. Whereas in theheterogeneous-prior setting, we start with a good understanding of the LP, and can obtain a reasonablygood solution even without making the most of our measurements, the equal-prior setting provides uswith little guidance about the underlying optimization problem, making it even more important to learnefficiently.

Finally, Figure 7 shows the average number of distinct arcs measured by each policy in both typesof experiments. Not surprisingly, we find that KG measures fewer distinct arcs than pure exploration,meaning that there are more arcs that get measured multiple times under KG. The number of distinctarcs measured is slightly higher in the equal-prior setting, where it is necessary to do more exploration.


(a) (b)

Figure 7: Number of distinct arcs measured by each policy in the (a) heterogeneous-prior and (b) equal-prior settings.

In the heterogeneous-prior setting, it is easier to discern the most important arcs and focus on them.

These numerical examples indicate that there is a clear benefit by incorporating the value of informa-tion, as defined by the KG logic, into our decision-making. The empirical evidence shown here reflectsthe particular problem setting chosen for our illustration. However, we feel that these results demon-strate that the KG algorithm proposed in this paper runs as expected, and produces successively bettersolutions to the underlying LP.

6. Conclusion We have posed a new type of optimal learning problem: solving an LP with un-known cost coefficients, given finitely many chances to learn about them through noisy measurements.Our uncertainty about the objective function is represented by a Bayesian belief structure. We havedeveloped a knowledge gradient strategy for making measurement decisions efficiently. In the setting ofa multivariate Gaussian prior and independent Gaussian measurements, the KG approach provides uswith a closed-form expression for the value of information contributed by a single measurement to ourunderstanding of the underlying LP. This quantity can be computed exactly by applying techniques fromLP sensitivity analysis. We have shown that this policy is asymptotically optimal, meaning that it learnsthe true optimal value of the LP as the number of measurements becomes large. Our numerical examplesshow how, on average, KG gradually leads us to improve our LP solution.

We view this work as a bridge between the fields of stochastic optimization and optimal learning. Manyreal-world stochastic optimization problems are characterized by a high degree of uncertainty about keymodel parameters. Whenever approximations are used to make decisions, we run the risk of makinga poor decision by relying on an inaccurate approximation. It becomes crucial to develop intelligent,systematic strategies for balancing existing approximations with the value of new information that hasthe potential to give us a better approximation down the road. This dilemma is studied extensively inthe literature on optimal learning, but typically for a very simple underlying optimization structure. Webelieve that the idea explored in our work, namely the idea of quantifying and formalizing the valueof information in a more general and complex optimization framework, has a great deal of potentialfor resolving information collection issues in many broad classes of stochastic optimization problems, ofwhich linear programs are a very important example.

Acknowledgments This work was supported in part by AFOSR contract FA9550-08-1-0195 throughthe Center for Dynamic Data Analysis.

References

[1] I. Adler and R.D.C. Monteiro, A Geometric View of Parametric Linear Programming, Algorithmica 8 (1992),161–176.

[2] R.K. Ahuja, T.L. Magnanti, and J.B. Orlin, Network flows: Theory, algorithms and applications, Prentice


Hall, New York, 1992.

[3] P. Auer, N. Cesa-Bianchi, and P. Fischer, Finite-time analysis of the multiarmed bandit problem, MachineLearning 47 (2002), no. 2-3, 235–256.

[4] R.E. Bechhofer, T.J. Santner, and D.M. Goldsman, Design and analysis of experiments for statistical selec-tion, screening and multiple comparisons, J.Wiley & Sons, New York, 1995.

[5] R. Bellman, Dynamic programming, Princeton University Press, Princeton, 1957.

[6] D. A. Berry and B. Fristedt, Bandit problems, Chapman and Hall, London, 1985.

[7] D. Bertsimas and J.N. Tsitsiklis, Introduction to linear optimization, Athena Scientific, 1997.

[8] P.J. Bickel and K.A. Doksum, Mathematical statistics: basic ideas and selected topics, 2nd ed. updatedprinting ed., Prentice Hall, 2007.

[9] J.R. Birge, The value of the stochastic solution in stochastic linear programs with fixed recourse, MathematicalProgramming 24 (1982), 314–325.

[10] J.R. Birge and F. Louveaux, Introduction to stochastic programming, Springer-Verlag, New York, 1997.

[11] C. Carlsson and P. Korhonen, A parametric approach to fuzzy linear programming, Fuzzy Sets and Systems20 (1986), no. 1, 17–30.

[12] S.E. Chick, J. Branke, and C. Schmidt, Sequential Sampling to Myopically Maximize the Expected Value ofInformation, INFORMS J. on Computing 22 (2010), no. 1, 71–80.

[13] S.E. Chick and K. Inoue, New procedures to select the best simulated system using common random numbers,Management Science 47 (2001), no. 8, 1133–1149.

[14] , New two-stage and sequential procedures for selecting the best simulated system, Operations Research49 (2001), no. 5, 732–743.

[15] G.B. Dantzig, Linear programming under uncertainty, Management Science 1 (1955), no. 3-4, 197–206.

[16] M. H. DeGroot, Optimal Statistical Decisions, John Wiley and Sons, 1970.

[17] T. El-Nazer and B.A. McCarl, The choice of crop rotation: a modeling approach and case study, AmericanJournal of Agricultural Economics 68 (1986), no. 1, 127–136.

[18] J. Ellson, E. Gansner, L. Koutsofios, S. North, and G. Woodhull, Graphviz–open source graph drawing tools,Proceedings of the 9th International Symposium on Graph Drawing (P. Mutzel, M. Junger, and S. Leipert,eds.), 2002, pp. 594–597.

[19] P. I. Frazier and W. B. Powell, Convergence to Global Optimality with Sequential Bayesian Sampling Policies,Submitted for publication (2009).

[20] P. I. Frazier, W. B. Powell, and S. Dayanik, A knowledge gradient policy for sequential information collection,SIAM Journal on Control and Optimization 47 (2008), no. 5, 2410–2439.

[21] , The knowledge-gradient policy for correlated normal rewards, INFORMS J. on Computing 21 (2009),no. 4, 599–613.

[22] N. Gans, H. Shen, Y.P. Zhou, N. Korolev, A. McCord, and H. Ristock, Parametric Stochastic Programmingfor Call-Center Workforce Scheduling, (2010), Working paper.

[23] S. Gass and T. Saaty, The computational algorithm for the parametric objective function, Naval ResearchLogistics Quarterly 2 (1955), no. 1-2, 39–45.

[24] A.B. Gelman, J.B. Carlin, H.S. Stern, and D.B. Rubin, Bayesian data analysis, second ed., CRC Press, 2004.

[25] J.C. Gittins, Multi-armed bandit allocation indices, John Wiley and Sons, New York, 1989.

[26] M. Grant and S. Boyd, CVX: Matlab software for disciplined convex programming, version 1.21,http://cvxr.com/cvx, October 2010.

[27] S.S. Gupta and K.J. Miescke, Bayesian look ahead one stage sampling allocations for selecting the largestnormal mean, Statistical Papers 35 (1994), 169–177.

[28] , Bayesian look ahead one-stage sampling allocations for selection of the best population, Journal ofstatistical planning and inference 54 (1996), no. 2, 229–244.

[29] A.G. Hadigheh and T. Terlaky, Sensitivity analysis in linear optimization: Invariant support set intervals,European Journal of Operational Research 169 (2006), no. 3, 1158–1175.

[30] W.K. Haneveld and A.W. Stegeman, Crop succession requirements in agricultural production planning, Eu-ropean Journal of Operational Research 166 (2005), no. 2, 406–429.

[31] H. Itami, Expected Objective Value of a Stochastic Linear Program and the Degree of Uncertainty of Param-eters, Management Science 21 (1974), no. 3, 291–301.

[32] B. Jansen, J.J. de Jong, C. Roos, and T. Terlaky, Sensitivity analysis in linear programming: just be careful!,European Journal of Operational Research 101 (1997), no. 1, 15–28.

[33] B. Jansen, C. Roos, T. Terlaky, and J. Vial, Interior-point methodology for linear programming: duality,sensitivity analysis and computational aspects, Tech. Report 93-28, Faculty of Technical Mathematics andInformatics, Delft University of Technology, 1993.

[34] P. Kall and S.W. Wallace, Stochastic programming, John Wiley & Sons, New York, 1994.


[35] S.H. Kim and B.L. Nelson, Selecting the best system, Handbooks of Operations Research and ManagementScience, vol. 13: Simulation (S.G. Henderson and B.L. Nelson, eds.), North-Holland Publishing, Amsterdam,2006, pp. 501–534.

[36] , Recent advances in ranking and selection, Proceedings of the 39th conference on Winter simulation,IEEE Press Piscataway, NJ, USA, 2007, pp. 162–172.

[37] D. Klingman, A. Napier, and J. Stutz, NETGEN: A program for generating large scale capacitated assignment,transportation, and minimum cost flow network problems, Management Science 20 (1974), no. 1, 814–821.

[38] V.V. Kolbin, Stochastic programming, D. Reidel Publishing Company, Dordrecht, Holland, 1977, Translatedfrom Russian by I.P. Grigoryev.

[39] A. Madansky, Inequalities for stochastic linear programming problems, Management Science 6 (1960), no. 2,197–204.

[40] K.G. Murty, Computational complexity of parametric linear programming, Mathematical Programming 19(1980), 213–219.

[41] D.M. Negoescu, P.I. Frazier, and W.B. Powell, The Knowledge-Gradient Algorithm for Sequencing Experi-ments in Drug Discovery, INFORMS J. on Computing (to appear) (2010).

[42] M.W. Padberg, Linear optimization and extensions, Springer, 1999.

[43] C. Roos, T. Terlaky, and JP Vial, Theory and Algorithms for Linear Optimization: An Interior PointApproach, John Wiley and Sons, Chichester, UK, 1997.

[44] A. Ruszczynski and A. Shapiro, Handbooks in operations research and management science: Stochastic pro-gramming, vol. 10, Elsevier, Amsterdam, 2003.

[45] I. O. Ryzhov and W. B. Powell, The knowledge gradient algorithm for online subset selection, Proceedingsof the 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning, Nashville,TN, 2009, pp. 137–144.

[46] , Information collection on a graph, Operations Research (to appear) (2010).

[47] I. O. Ryzhov, W. B. Powell, and P. I. Frazier, The knowledge gradient algorithm for a general class of onlinelearning problems, Submitted for publication (2009).

[48] I. O. Ryzhov, M. R. Valdez-Vivas, and W. B. Powell, Optimal Learning of Transition Probabilities in theTwo-Agent Newsvendor Problem, Proceedings of the 2010 Winter Simulation Conference (B. Johansson,S. Jain, J. Montoya-Torres, J. Hugan, and E. Yucesan, eds.), 2010.

[49] T. Terlaky, Private communication, 2010.

[50] R.E. Wendell, The Tolerance Approach to Sensitivity Analysis in Linear Programming, Management Science31 (1985), no. 5, 564–578.

information collection in a linear program · or/ms subject classi cation: primary: sequential...

Documents