derivative free trust region algorithms for stochastic optimizationanton/truststoch3.pdf ·...
Post on 25-Mar-2019
219 Views
Preview:
TRANSCRIPT
Derivative Free Trust Region Algorithms for Stochastic
Optimization
Vijay Bharadwaj
Anton J. Kleywegt
School of Industrial and Systems Engineering
Georgia Institute of Technology
Atlanta, GA 30332-0205
September 7, 2008
1 Introduction and Review of Related Literature
In this article we study the following stochastic optimization problem. Let (Ω,F ,P) be a probability space.
Let ζ(ω) (where ω denotes a generic element of Ω) be a random variable on (Ω,F ,P), taking values in
the probability space (Ξ,G,Q), where Q denotes the probability measure of ζ. Suppose for some open set
E ⊂ Rl, F : E × Ξ �→ R is a real-valued function such that for each x ∈ E , F (x, ·) : Ξ �→ R is G-measurable
and integrable. Then, given a closed and convex set X ⊂ E , we consider the problem
minx∈X
{f(x) := EQ [F (x, ζ)]} (P)
If f is sufficiently smooth on X and its higher order derivatives can be computed exactly without much
effort at any x ∈ X , then there are many deterministic optimization techniques that can be used to solve
problem (P). However, in many practical problems, one is faced with the following conditions.
• For a given x ∈ X and ζ ∈ Ξ, F (x, ζ) can easily be computed exactly.
• It is prohibitively expensive to compute f(x) or its higher order derivatives exactly at any x ∈ X ,
typically because it is difficult to compute the (often multidimensional) integral that defines f(x).
• For a fixed x ∈ X and ζ ∈ Ξ, ∇xF (x, ζ) can not easily be computed exactly, or may not even exist at
all (x, ζ), even though f is differentiable at all x.
For example, one may have a simulator that takes x as input, generates ζ according to a specified
distribution, and computes F (x, ζ) with a single simulation run. Often such a simulator does not compute
1
∇xF (x, ζ). This may be because the simulator code required to compute ∇xF (x, ζ) is so complicated that
the analyst does not want to take the time and effort to do the coding or run the risk of introducing errors
in the code. It may also be that simulator is a black box to the user who wants to do the optimization
and the simulator only returns F (x, ζ) for a given x, and the user cannot or does not want to change the
simulator to also compute ∇xF (x, ζ). Also, as mentioned above, it may be that ∇xF (x, ζ) does not exist at
all (x, ζ) even though f is differentiable at all x. A combination of these reasons held in an application that
the authors worked on, and motivated the work in this paper.
Under the above assumptions it is natural to approximate f using a sample average function that can be
obtained at a relatively low cost as follows. In order to show how this is done, consider for some N ∈ N, a
sequence{ζj(ω)
}j∈N
of i.i.d. random functions on (Ω,F ,P) taking values in Ξ such that ζj is F-measurable
for each j ∈ N and P is such that Q is the measure induced on (Ξ,G) by ζ1. Then, we define the sample
average function fN : Rl × Ω −→ R as
fN (x, ω) :=1N
N∑j=1
F (x, ζj(ω)) (1)
Then for each x ∈ Rl, fN (x, ω) is an unbiased and consistent estimator of f(x). That is, EP[fN (x, ω)] = f(x)
and
limN→∞
fN (x, ω) = f(x) for P-almost all ω
The above result follows from the strong law of large numbers. Similarly, under some stronger conditions
on the probability spaces and the differentiability of the function F (·, ζ) with respect to x, it is possible to
show that the random function ∇fN : Rl × Ω −→ R
l defined as
∇fN (x, ω) :=1N
N∑j=1
∇xF (x, ζj(ω))
is an unbiased and consistent estimator of ∇f(x) for each x ∈ Rl. Given any x ∈ X and N ∈ N, in order
to evaluate the sample average function for some fixed ω ∈ Ω, we must be able to generate the independent
replications {ζj = ζj(ω) : j = 1, . . . , N}. In practice, ζ is usually a real-valued random vector and hence the
set {ζj : j = 1, . . . , N} can be obtained by first generating a set {ξ1, . . . , ξN} of pseudo-random numbers that
are independent and uniformly distributed on [0, 1] and subsequently applying an appropriate transformation
to this set.
There are two essentially different ways in which such sampling techniques can be incorporated in al-
gorithms that solve the problem (P). First, let us consider the class of algorithms popularly known as the
stochastic approximation approach. A description of the simplest algorithm in this class (first proposed in
?) is sufficient to illustrate the basic sampling strategy. The algorithm essentially generates a sequence of
{xn}n∈N⊂ X , starting at some x0 ∈ X , by applying the following recursion.
xn+1 = ΠX
(xn − αn∇f(xn)
)(2)
2
In (2), ΠX denotes the projection operation onto the set X and ∇f(xn) is an estimate of ∇f(xn) which
is obtained for each iteration n as follows. For some sample size N ∈ N, we first generate N independent
replications {ζjn : j = 1, . . . , N} of the random vector ζ, that are also independent of all replications generated
in earlier iterations. Then, we set
∇f(xn) :=1N
N∑j=1
∇xF (xn, ζjn)
Further, the sequence {αn}n∈Nof positive step-sizes is chosen to satisfy
∞∑n=1
αn = ∞ and∞∑
n=1
α2n < ∞
Under certain regularity conditions, it can be shown that the sequence {xn}n∈N(which is a sequence of
random vectors) converges with probability one, to a local minimum of f in X . Such an approach is
attractive because of its relative ease of implementation. However, in practice this approach has been found
to be extremely sensitive to the choice of the sequence {αn}n∈Nof step-sizes; a bad choice of which can lead
to an extremely slow rate of convergence. Therefore, subsequent work in this area has aimed to improve the
rate of convergence of this algorithm by averaging the gradient estimates obtained in consecutive iterations
along with the use of adaptive step-size sequences. Further, the algorithm has also been applied to non-
smooth problems where an estimate of a sub-gradient of the objective function is obtained through sampling
at each iteration. We refer the reader to ?, (?) and (?) for details. Also, ? and ? contain detailed treatments
of stochastic approximation algorithms for various cases.
The second approach, known variously as stochastic counterpart method, sample path optimization and
sample average approximation, works as follows. We first fix some ω ∈ Ω and some N ∈ N, and generate N
independent replications of the random vector ζ denoted by {ζj = ζj(ω) : j = 1, . . . , N}. Then, using this
fixed sample, we solve the optimization problem
minx∈X
fN (x, ω) :=1N
N∑j=1
F (x, ζj) (P)
Now, it is easily seen that fN (·, ω) is a deterministic function of x, we can compute fN (x, ω) and its higher
order derivatives (if they exist) exactly for any x ∈ X . Therefore, the problem (P) can be solved using any
appropriate deterministic optimization algorithm. This is certainly an appealing feature since there exist
a vast store of algorithms developed for deterministic optimization from which this choice can be made.
Suppose that
v∗N (ω) := min
x∈XfN (x, ω) and x∗
N (ω) ∈ arg minx∈X
fN (x, ω),
Then, it can be shown under mild conditions that as N →∞, v∗N and x∗
N converge respectively to the optimal
objective value and optimal solution of the “true” problem (P), for P-almost all ω. Further, there is a well-
developed theory of statistical inference for the optimal value v∗N and optimal solution x∗
N of the problem (P)
3
that helps in setting the sample size N and in the design of stopping tests. In particular, given any candidate
optimal solution x∗ ∈ X for the true problem, ? shows how the optimality gap f(x∗) − minx∈X f(x) can
be estimated by solving M sample average problems as in (P) for M independent realizations {ω1, . . . , ωM}.The same article also provides methods to statistically test the validity of the first order Karush-Kuhn-Tucker
optimality conditions for the point x∗ using an independently generated estimate of the gradient ∇f(x∗).
We refer the reader to ? and ? for details regarding the stochastic counterpart method and to ? and ? for
a general description of Monte Carlo sampling methods in the solution of stochastic programs.
In this paper, we deal with a specific class of stochastic programs that satisfy, apart from the previously
stated assumptions regarding the intractability of computing f or its higher order derivatives, the following
assumptions.
A 1. (a) The cost of computing F (x, ζ) for a given x ∈ Rl and ζ ∈ Ξ, while being small relative to the
cost of computing f(x), is still large enough to warrant attempts to lower the number of evaluations of
F (x, ζ) as much as possible.
(b) No sensitivity measures related to F can be computed directly. That is, if F (·, ζ) is continuously differ-
entiable with respect to x for some ζ ∈ Ξ, then we assume that ∇xF (x, ζ) cannot be evaluated for any
x ∈ X . Otherwise, if F (·, ζ) is convex for some ζ ∈ Ξ, then we assume that a subgradient cannot be
computed for any x ∈ X .
(c) The function f is continuously differentiable on E.
Indeed, the first two statements of Assumption A 1 hold quite often in the case of simulation optimization.
In many practical settings, F (x, ζ) is computed by a large and complex computer simulation model whose
source code is proprietary and hence unavailable. Obviously apart from making the computation of F (x, ζ)
very expensive , such a situation also precludes the use of automatic differentiation techniques to calculate
the gradient ∇xF (x, ζ) (if we know it exists).
If F (·, ζ) is continuously differentiable on E for Q-almost all ζ, then it is easy to see that f is also
continuously differentiable on E . However, many examples of stochastic optimization problems exist where
the f is continuously differentiable on E even when F (·, ζ) is only Lipschitz continuous on E for any ζ ∈ Ξ.
The following is a simple example of such a situation.
Example 1.1. Consider the well known news-vendor problem. A company has to decide the quantity x of
a seasonal product, to order. The product can be purchased at a unit cost of c and sold at a unit price of
r > c while in season. After that however, the remaining unsold stock can only be sold at a unit salvage
price of s, where s < c. Suppose further, that the demand ζ ∈ R+ for that product is uncertain and has
a distribution function G : R −→ [0, 1] associated with it that is continuous on R. In order to make the
4
decision regarding the quantity x, the company wishes to solve the optimization problem
maxx∈[0,∞)
f(x) := E[F (x, ζ)]
where the profit F : R+ × R+ −→ R can be written for any fixed x and ζ as
F (x, ζ) :=
⎧⎨⎩(s− c)x + (r − s)ζ if x ≥ ζ
(r − c)x if x < ζ
Indeed, for any given ζ ≥ 0, F (·, ζ) is not differentiable at x = ζ.
However, the expected value of the profit, f : R+ −→ R can be shown to be given for any x ∈ R+, by
f(x) = E [F (x, ζ)]
=∫
(−∞,∞)
F (x, ζ)dG(ζ)
= (r − c)x − (r − s)∫
[0,x]
G(ζ)dζ
Since G is known to be continuous on R, it follows that f is continuously differentiable on R+.
It can be shown that the same situation occurs in many two stage stochastic linear programming problems,
where the random data have densities associated with their distributions.
Assumption A 1 immediately suggests that any sampling-based solution methodology used on (P) should
attempt to incorporate the following features.
• The sample sizes used in the algorithm have to remain manageable small. Of course, more precise
statements regarding the sample size can be made only if the computational effort required to evaluate
F (x, ζ) and the total available computational budget are known.
• Once F (x, ζ) is evaluated for a given x and ζ, this value must be reused as much as possible.
Now, let us again consider the two approaches to solving the problem (P) that we discussed earlier, in light of
Assumption A 1 and the above requirements. In the case of stochastic approximation, the same algorithmic
recursion as in (2) can be used along with a biased gradient estimator which is generated as follows. Given
xn ∈ X , cn > 0 and some N ∈ N, we generate independent realizations ζijn ∈ Ξ for i = 0, . . . , l (where l is
the dimension of the search space) and j = 1, . . . , N , that are also independent of all realizations generated
earlier in the procedure. Then, we define the gradient estimator as
∇f(xn) :=
⎛⎜⎜⎜⎝
PNj=1 F (xn+cne1,ζ1j
n )−PNj=1 F (xn,ζ0j
n )
Ncn
...PN
j=1 F (xn+cnel,ζljn )−PN
j=1 F (xn,ζ0jn )
Ncn
⎞⎟⎟⎟⎠
5
where ei denotes the unit vector in the i-th coordinate direction in Rl. Essentially, this is a finite difference
gradient approximation where the function values f(xn + cnei) for i = 1, . . . , l are approximated by sample
averages found using independently generated samples. This idea, first suggested in ?, has the advantage that
convergence can be shown even for a sample size N = 1. However, it is explicitly required in the algorithm
that each new gradient estimate be generated using independent samples. Therefore, each iteration of the
algorithm requires (l + 1)×N new evaluations of the function F . Thus, given also the slow progress of this
algorithm due to small step-sizes, the stated algorithm may not be suitable for solving (P) under Assumption
A 1. However, subsequent research in this field has resulted in the development of gradient approximations
that can be generated using at most 2N evaluations of F . Algorithms that use such gradient approximations
are collectively known in literature as Simultaneous Perturbation Stochastic Approximation(SPSA). A com-
bination of these gradient approximations, used along with adaptive step sizes, may prove useful in solving
(P). We refer the reader to ? for the basic SPSA algorithm and its convergence analysis.
In this paper however, in order to be able to reuse evaluations of F , we will not consider algorithms that
require independent samples to be generated at each iteration. Accordingly, let us look at the sample average
approximation approach in light of Assumption A 1. As we noted earlier, having fixed the sample size N
and some ω ∈ Ω, (P) becomes a deterministic optimization problem. With Assumption A 1 however, it is
clear that sensitivity measures of the sample average function fN (·, ω) like derivatives or subgradients, are
also unavailable. Fortunately, there exist several different types of iterative algorithms for the optimization
of deterministic functions, without the use of derivative information. The literature in this area of research
is quite extensive but not all the different approaches are suited for solving the problem (P) under the
assumption that function evaluations are expensive. Hence, we will mention only a couple of ideas that are
related to the approach that we propose in this paper, and refer the reader to ? for an excellent review of
the state of the art in derivative free algorithms.
The simplest approach may be to use a direct search algorithm, i.e., an algorithm that proceeds using only
direct comparisons of objective function values at different points. Common examples of such algorithms
include the simplex reflection algorithm of ? and the Parallel Direct Search and Multi-directional Search
algorithms proposed respectively in ? and ?. Indeed, if fN (·, ω) is not continuously differentiable on X ,
then there is often no other alternative but to use such direct search algorithms to solve (P). On the other
hand, if fN (·, ω) is smooth, then such algorithms tend to progress slower than algorithms that exploit the
smoothness of fN .
When fN (·, ω) is smooth, then one strategy is to use traditional algorithms for deterministic smooth
optimization but with finite difference approximations of the gradient (and if possible, the Hessian). But
again, in general, it is unlikely that function evaluations can be reused in such an algorithm. Indeed, it is
extremely improbable that at some iteration n, given the current solution xn, the function fN has already
6
been evaluated at xn + cnei where ei is a unit vector in some coordinate direction. Thus, it is highly likely
that such a method will require (l+1)×N new function evaluations at each iteration, in order to approximate
just the gradient.
Recently however, there has been considerable interest in trust region algorithms for derivative free
unconstrained deterministic optimization of smooth functions under the assumption that function evaluations
are expensive. The idea is that instead of trying to approximate the unavailable higher order derivatives
of the objective function, we could construct a polynomial model that approximates the objective function
itself in a neighborhood of interest, using interpolation. Let us explore this approach in greater detail since
its core ideas are also applicable to the class of algorithms that we propose in this paper. Assume for the
moment that the feasible set X = Rl and that fN (·, ω) : R
l −→ R is sufficiently smooth. At iteration n,
given the current iterate xn, we construct an interpolation set Yn with the appropriate number of points.
That is, if we are interested in approximating fN with a linear model then there are l + 1 parameters to be
determined and hence the set Yn must contain l+1 points including xn. Similarly, if the model is quadratic,
then Yn must contain 1 + l + l(l + 1)/2 points including xn. Obviously, we try to include as many points
in the interpolation set as possible, at which the function fN has been evaluated already. Then, we find the
parameters in the model mn : Rl −→ R using the interpolation equations
mn(xi) = fN (xi) for i = 1, . . . , |Yn| (3)
Finally, the model mn is optimized in a trust-region Tn := {x ∈ Rl : ‖x− xn‖ ≤ Δn} which (hopefully)
yields a point with a lower objective function value. The points in the interpolation set Yn and the trust
region radius Δn are updated as the algorithm progresses.
This idea has its origins in an algorithm proposed in ?, where the author proposed the use of a quadratic
interpolation model. Much later, ? suggested an algorithm for constrained optimization where both the
objective function and the constraints were modeled using linear interpolation. However, there is some
crucial insight regarding such methods in ? that relates the interpolation set Yn to the quality of the model
mn as an approximation to the objective function within the trust region. Consider a model function that
is traditionally used in trust region algorithms.
mn(x) := fN (xn, ω) + ∇fN (xn, ω)T (x− xn) +12(x− xn)T Hn(x− xn) (4)
In (4), either Hn ∈ Sl×l (where S
l×l is the space of real symmetric l×l matrices) is either set to be∇2fN (xn, ω)
or obtained using a quasi-Newton update. In any case, if fN (·, ω) ∈ C2(Tn) and∥∥∇2fN (x, ω)
∥∥2≤ Kn for
all x ∈ Tn , we can derive the following bound on |fN (x, ω)−mn(x)| for x ∈ Tn using the Taylor series
7
expansion of fN (·, ω) at xn.
|fN (x, ω)−mn(x)| =∣∣∣∣fN (x, ω) − fN (xn, ω) − ∇fN (xn, ω)T (x− xn) − 1
2(x− xn)T Hn(x− xn)
∣∣∣∣=
∣∣∣∣12(x− xn)T∇2fN (xn + tn(x− xn)) (x− xn) − 12(x− xn)T Hn(x− xn)
∣∣∣∣ where tn ∈ [0, 1]
≤∣∣∣∣‖Hn‖2 + Kn
2
∣∣∣∣×Δ2n (5)
Such a bound relating the accuracy of the model function to the trust region radius is critical to the proper
performance of any trust region algorithm.
Now, instead of (4), suppose a quadratic model given by
mn(x) := fN (xn, ω) + βTn (x− xn) +
12(x− xn)T Λn(x− xn), (6)
that is obtained by solving the interpolation equations in (3) for the components of β ∈ Rl and Λn ∈ S
l×l,
is used in a trust region algorithm. In order for such an algorithm to be successful, the interpolation points
in Yn must be chosen such that the resulting model satisfies a bound similar to that in (5). It is well
known that for a finite difference gradient approximation, the error in the approximation decreases to zero
as the step-size used to calculate the gradient approximation reduces to zero. Analogously, the interpolation
points in Yn must be chosen to lie in a sufficiently small neighborhood of xn in order for βn and Hn to be
accurate estimates of ∇fN (xn, ω) and ∇2fN (xn, ω) respectively. Further,it can be shown that under certain
conditions, as the points in Yn get progressively closer to xn, βn converges to ∇fN (xn, ω) and Hn converges
to ∇2fN (xn, ω). However, in order to obtain a bound similar to that in (5), it turns out the points in Yn
also have to satisfy certain geometric conditions imposed on their positions relative to xn. In particular, the
interpolating points cannot lie on any quadratic surface in Rl. If they do, then it can be shown that the
equations in (3) have multiple solutions and that there exist solutions that give rise to models mn that are
bad approximations of the objective function within the trust region. In order to avoid such a situation, in
practice we usually require that the interpolation points should be sufficiently far away from any quadratic
surface in Rl. An interpolation set that satisfies such a condition is referred to as being well poised.
? defined the set Yn for each n ∈ N to consist of l + l(l + 1)/2 points chosen entirely from points
at which the objective function had been previously evaluated and assumed that at each n ∈ N, Yn would
automatically satisfy the geometric requirement mentioned above. ? however, ensured that the interpolation
set is updated in such a way that it continues to be well poised throughout. All subsequent algorithms that fit
in this framework have included in some form or the other, methods to keep the interpolation set well-poised.
? suggested a variant that used quadratic models and showed how Lagrange interpolation polynomials could
be used to add new points to the interpolation set to ensure its well-poisedness. The same paper also reported
some promising numerical results. Further, ? gave the first convergence proof for such algorithms and also
showed how Newton Fundamental Polynomials can be used to build quadratic models even when there are
8
not sufficiently many points in the interpolation set to exactly specify a fully quadratic model. ? provides a
good introduction to the use of Lagrange and Newton polynomials in order to maintain the well-poisedness
of the interpolation set. ? proposed an interesting variant of this idea, where instead of evaluating the
objective function at new points purely in order to keep the interpolation set well-poised, the model function
is minimized over a modified (albeit non-convex) trust region, which ensures that the minimizer, when added
to the interpolation set, will keep it well-poised. The authors show significant gains over the use of finite
difference gradients along with quasi-Newton updates for the Hessian. In a series of recent papers, ?,(?)
and (?) has revisited the idea of using Lagrange interpolation polynomials and has also investigated quasi-
Newton-like updating formulas for the Hessian matrix of the model found using interpolation, in an attempt
to reduce the routine linear algebra work in each iteration.
Indeed, the computational experience reported in the references given above seems to indicate that
if the sample average function is smooth enough, then solving the sample average problem (P) using such
interpolation models can result in substantial reductions in the number of required evaluations of the function
F . However, rather than working with a fixed sample size N , we believe that further reduction in the number
of evaluations of F can be achieved by gradually increasing the sample size in an adaptive manner as the
algorithm progresses. The following considerations motivate this belief.
Recall that we are looking to solve the problem (P) and we wish to do this by generating a sequence
{xn}n∈N⊂ X of points with successively lower objective function values. However, instead of f , we have
access only to a sequence of sample average functions{
fN
}N∈N
that converge to f as the sample size
N increases to infinity. Thus, by solving the sample average problem (P), we generate a sequence with
successively lower values of the sample average function fN and hope that this sequence also achieves
reduction in the objective function. Under this setting, the following rationale naturally emerges.
• In the initial stages of the algorithm, when large reductions are possible in the value of f , a crude
approximation of the objective function may be sufficient to generate points with lower values of f .
Accordingly, a sample average approximating function fN obtained using a small sample size N may
be sufficient to achieve improvement. Therefore, it makes sense to start the optimization process with
a small sample size.
• Further, we should not use a large number of function evaluations in generating minute reductions in
the value of the sample average function unless we are confident that this will also lead to improvements
in the value of f . Therefore, we use a sample size N only as long as the reductions obtained in value
of fN are significant compared to some measure of the error between fN and f .
With this survey of some of the issues involved in the solution of (P) and various algorithmic approaches,
next we turn to a class of algorithms that incorporates the aforementioned refinements.
9
2 A Class of Derivative Free Trust Region Algorithms
In this paper, we propose and analyze a class of iterative algorithms for solving the problem (P) under
Assumption A 1, when f is continuously differentiable on X . Our class of algorithms incorporates Monte
Carlo sampling techniques along with the use of polynomial model functions in a trust region framework.
When the objective function f : X −→ R and its gradient ∇f(x) can be evaluated easily, a typical trust
region algorithm used to find the stationary points of f in X , works as follows.
Algorithm 1. Let us set the constants used in the algorithm as
0 < η1 ≤ η2 < 1 0 < σ1 ≤ σ2 < 1 Δmax > 0
Also let the initial feasible point be x0 ∈ X and the initial trust region radius be 0 < Δ0 < Δmax. Then, for
any iteration n and current solution xn ∈ X , we generate the next point xn+1 in the following manner.
Step 1: Define a model function mn : X −→ R that approximates f within a trust region Tn := {xn + d : ‖d‖ ≤ Δn}
Step 2: Find xn + dn ∈ arg min {mn(x) : x ∈ X ∩ Tn}.
Step 3: Evaluate
ρn =f(xn)− f(xn + dn)
mn(xn)−mn(xn + dn)
If ρn ≥ η1, then set xn+1 = xn + dn; else set xn+1 = xn.
Step 4: Update the trust region radius as,
Δn+1 ∈
⎧⎪⎪⎪⎨⎪⎪⎪⎩
[Δn,Δmax] if ρn ≥ η2
[σ2Δn,Δn] if ρn ∈ [η1, η2)
[σ1Δn, σ2Δn] if ρn < η1
Thus, at any iteration n, given the current iterate xn ∈ X , we first define trust region Tn, which is a
ball of radius Δn, centered at xn and defined in the norm ‖·‖ on Rl. We will refer to Δn as the trust region
radius and ‖·‖ as the trust region norm. Next, we define a model function mn : X → R that approximates
the function f in X ∩ Tn. Since f(xn) and ∇f(xn) can be evaluated easily, mn is defined to be a quadratic
function of the form,
mn(x) = f(xn) +∇f(xn)T (x− xn) +12(x− xn)T Hn(x− xn) (7)
where Hn ∈ Sl×l. Further, if ∇2f(xn) is also available, then we set Hn = ∇2f(xn). Otherwise Hn may
be obtained for example, via a quasi-Newton update. After defining the model function and trust region,
we find an approximate minimizer xn + dn for mn within the trust region Tn, that satisfies a minimum
improvement condition (which we will later elaborate on). We evaluate the relative improvement ρn, which
10
is the actual decrease in the value of f at xn + dn as compared to xn, divided by the decrease predicted by
mn. If there is sufficient decrease in the function f then we update the solution for the next iteration as
xn+1 = xn + dn and otherwise we set xn+1 = xn. Finally, depending on the relative improvement, we either
reduce, leave unchanged, or increase the trust region radius Δn for the next iteration.
In order to solve the optimization problem (P) under Assumption A 1, using a trust region algorithm,
several modifications have to be made to Algorithm 1. First, since f(x) cannot be evaluated exactly for
any x ∈ Rl, we will instead have to use sample averages where ever required. The use of sampling in our
algorithms is akin to the sample average approximation approach. That is, we fix ω ∈ Ω and generate the
sequence{ζi := ζi(ω)
}i∈N⊂ Ξ before we begin the optimization process. All the sample averages required
by the algorithm, are taken with respect to this sample. Therefore, as far as our algorithms are concerned,
our sample average functions are deterministic functions whose parameters are x ∈ Rl and the sample size
N . Accordingly, it will be convenient for us to discard the dependence on ω ∈ Ω from the notation for the
sample average functions and denote the sample average function for any x ∈ Rl and a sample size N ∈ N
as f(x,N), where f : Rl × N −→ R is given by
f(x,N) :=1N
N∑j=1
F (x, ζj) (8)
Obviously, the model function defined in (7) cannot be used since neither f(xn) nor ∇f(xn) can be
evaluated exactly. Therefore, we propose the following alternative model function.
mn(x) := f(xn, N0n) + ∇nf(xn)T (x− xn) +
12(x− xn)T ∇2
nf(xn)(x− xn) (9)
In (9), since f cannot be evaluated exactly, we fix a sample size N0n and use the sample average f(xn, N0
n) to
approximate f(xn). Similarly, ∇nf(xn) ∈ Rl and ∇2
nf(xn) ∈ Sl×l are approximations to the gradient∇f(xn)
and Hessian∇2f(xn) respectively, obtained as follows. We evaluate the sample averages f(xn+yin, N i
n) at Mn
points {xn+y1n, . . . , xn+yMn
n } in a neighborhood of xn using sample sizes N in ∈ N. We will refer to the points
{xn + yin : i = 1, . . . ,Mn} as design points and refer to the corresponding vectors as {yi
n : i = 1, . . . ,Mn}as perturbations. Then, we assign non-negative weights {w1
n, . . . , wMnn } to each of the design points and find
∇nf(xn) and ∇2nf(xn) such that,
(∇nf(xn), ∇2nf(xn)) ∈ arg min
(β,Λ)∈Rl×Sl×l
Mn∑i=1
[wi
n
{f(xn + yi
n, N in)− f(xn, N i
n)− βT yin −
12yi
n
TΛyi
n
}2]
(10)
Thus, our model function mn for each n ∈ N,is a quadratic polynomial which best fits (in the least squares
sense) the sample average function values at the design points. We will refer to mn as defined in (9) as the
regression model function, since we essentially perform linear regression to obtain the model.
It is well known that the success of any trust region algorithm depends crucially on its ability to monitor
and adaptively improve the accuracy of the model function in approximating the objective function within
11
the trust region. When a Taylor series based model function as in (7) is used in Algorithm 1, Steps 3 and 4
serve this purpose. At iteration n, the relative improvement ρn evaluated in Step 3, serves as an indicator
for the quality of the model. Whenever ρn is small (ρn < η1), a reduction in the trust region radius in Step
4, improves the accuracy of the model within the trust region.
However, when a regression model function as in (9) is used, its accuracy in approximating f within the
trust region depends not only on the trust region radius, but also on the quality of f(xn, N0n), ∇nf(xn)
and ∇2nf(xn) as approximations f(xn), ∇f(xn) and ∇2f(xn) respectively. Therefore, apart from controlling
the trust region radius, we must appropriately choose N0n, the number of design points Mn, the location
of the design points {xn + yin : i = 1, . . . , Mn}, the corresponding sample sizes {N i
n : i = 1, . . . ,Mn} and
weights {win : i = 1, . . . , Mn} for each n ∈ N so that the resulting model function may be sufficiently
accurate. Perhaps more importantly, in light of Assumption A 1, we must also seek to minimize the number
of evaluations of the function F used in the optimization process.
Accordingly, in Section 3, we consider the accuracy of the regression model function mn defined in (9) as
an approximation of f and describe procedures to pick the design points, sample sizes and weights required
to construct mn with a specified accuracy. Subsequently, we describe the working of a typical trust region
algorithm that uses such regression model functions and show its convergence in Section ??.
3 The Regression Model Function
In this section, we analyze the properties of the regression model function as defined in (9) and develop prac-
tical schemes to appropriately pick the design points, sample sizes and weights required for its construction.
Consider a sequence {xn}n∈N⊂ X where X ⊂ R
l is assumed to be closed. For each n ∈ N, we
suppose that f(xn, N0n) is calculated for some sample size N0
n. Also, we assume that the sample averages
{f(xn + yin, N i
n) : i = 1, . . . ,Mn} are evaluated at a set of design points {xn + yin : i = 1, . . . , Mn} ⊂ R
l
using the sample sizes {N in : i = 1, . . . ,Mn}. Finally, we assume that using a set of non-negative weights
{win : i = 1, . . . , Mn}, ∇nf(xn) ∈ R
l and ∇2nf(xn) ∈ S
l×l are determined to satisfy
(∇nf(xn), ∇2nf(xn)) ∈ arg min
(β,Λ)∈Rl×Sl×l
Mn∑i=1
[wi
n
{f(xn + yi
n, N in)− f(xn, N i
n)− βT yin −
12yi
n
TΛyi
n
}2]
(11)
Note: Actually, without loss of generality we can assume that win > 0 for i = 1, . . . ,Mn. This is because
setting win = 0 for some i ∈ {1, . . . , Mn} is equivalent to not including the design point xn + yi
n in the
set used to determine ∇nf(xn) and ∇2nf(xn). Therefore, we assume that only the points that had positive
weights associated with them, were included in the set of design points to begin with. For the same reason,
we also assume without loss of generality that∥∥yi
n
∥∥2
> 0 for each i = 1, . . . ,Mn and n ∈ N.
12
Our aim is to first establish when the accuracy of the regression model function approaches that of the
“exact” model function defined in (7) as n → ∞. In particular, we investigate the conditions that can be
placed on the choice of the various quantities required to find f(xn, N0n), ∇nf(xn) and ∇2
nf(xn), that are
sufficient to ensure∣∣∣f(xn, N0
n)− f(xn)∣∣∣ → 0,
∥∥∥∇nf(xn)−∇f(xn)∥∥∥
2→ 0 and
∥∥∥∇2nf(xn)−∇2f(xn)
∥∥∥2→ 0
as n→∞. Let us start with a result regarding∣∣∣f(xn, N0
n)− f(xn)∣∣∣.
Lemma 3.1. Suppose the following assumptions hold.
A 2. For any compact set D ⊂ E, f is continuous on D and the sequence{
f(·, N)}
N∈N
converges uniformly
to f on D.
limN→∞
supx∈D
∣∣∣f(x,N)− f(x)∣∣∣ = 0 (12)
A 3. The sequence{N0
n
}n∈N
of sample sizes satisfies N0n →∞ as n→∞.
Then, for any sequence {xn}n∈N⊂ C ⊂ X such that C is compact, we get
limn→∞
∣∣∣f(xn, N0n)− f(xn)
∣∣∣ = 0
In particular, if xn → x, then
limn→∞ f(xn, N0
n) = f(x)
Proof. We have for each n ∈ N.∣∣∣f(xn, N0n)− f(xn)
∣∣∣ ≤ supx∈C
∣∣∣f(x,N0n)− f(x)
∣∣∣Therefore, taking limits as n→∞ and noting that N0
n →∞, we get
limn→∞
∣∣∣f(xn, N0n)− f(xn)
∣∣∣ ≤ limn→∞ sup
x∈X
∣∣∣f(x,N0n)− f(x)
∣∣∣ = 0
Next, from the continuity of f on X , we get
limn→∞
∣∣∣f(xn, N0n)− f(x)
∣∣∣ ≤ limn→∞
∣∣∣f(xn, N0n)− f(xn)
∣∣∣+ limn→∞ |f(xn)− f(x)| = 0
Thus, we see that if we have uniform convergence of the sequence{
f(·, N)}
N∈N
of sample average
functions to the function f on some compact set C ⊃ {xn}n∈N, then the accuracy of f(xn, N0
n) as an
approximation to f(xn), can be improved by increasing the sample size N0n.
Next,we consider the accuracy of ∇nf(xn) and ∇2nf(xn) as approximations of ∇f(xn) and ∇2
nf(xn). In
particular, we will first consider conditions sufficient to ensure that
limn→∞
∥∥∥∇nf(xn)−∇f(xn)∥∥∥
2= 0
13
Accordingly, let us define the notation required to represent and characterize the set of optimal solutions on
the right side of (11). First, let the perturbation matrix Yn ∈ RMn×l be defined as
Yn :=
⎛⎜⎜⎜⎜⎜⎜⎝
(y1n)T
(y2n)T
...
(yMnn )T
⎞⎟⎟⎟⎟⎟⎟⎠ (13)
Let Nn := {N1n, . . . , NMn
n } denote the set of sample sizes used to evaluate the sample averages at the design
points. Define f(x, Yn,Nn) ∈ RMn for any x ∈ R
l as
f(xn, Yn,Nn) :=
⎛⎜⎜⎜⎜⎜⎜⎝
f(x + y2n, N2
n)− f(x,N2n)
f(x + y1n, N1
n)− f(x,N1n)
...
f(x + yMnn , NMn
n )− f(x,NMnn )
⎞⎟⎟⎟⎟⎟⎟⎠ (14)
Also, we define f(xn, Yn) ∈ RMn for any x ∈ R
l as
f(xn, Yn) :=
⎛⎜⎜⎜⎜⎜⎜⎝
f(x + y1n)− f(x)
f(x + y2n)− f(x)...
f(x + yMnn )− f(x)
⎞⎟⎟⎟⎟⎟⎟⎠ (15)
Next, we develop notation to represent the quadratic form yin
T Λyin in (11). Let y ∈ R
l be any vector and
H ∈ Sl×l be any symmetric matrix given by
y =
⎛⎜⎜⎜⎜⎜⎜⎝
y1
y2
...
yl
⎞⎟⎟⎟⎟⎟⎟⎠ and H =
⎛⎜⎜⎜⎜⎜⎜⎝
h11 h12 · · · h1l
h21 h22 · · · h2l
......
. . ....
hl1 hl2 · · · hll
⎞⎟⎟⎟⎟⎟⎟⎠
Then, the quadratic form yT Hy expands as follows.
yT Hy =l∑
j=1
l∑k=1
hjkyjyk
Since H is assumed to be symmetric we know that hjk = hkj for all j, k ∈ {1, . . . , l}. Thus,
yT Hy = 2l∑
j=1
l∑k=j+1
hjkyjyk +l∑
j=1
hjjy2j
Therefore,12yT Hy =
l∑j=1
l∑k=j+1
hjkyjyk +12
l∑j=1
hjjy2j
14
We wish to write the right side of the above equation as the scalar product of two appropriately defined
vectors. Accordingly, we define the vector yQ ∈ Rl(l+1)/2 corresponding to y ∈ R
l as
yQ :=(
y1y2, y1y3, y2y3, . . . , y1yl, y2yl, . . . , yl−1yl,1√2y21 , . . . ,
1√2y2
l
)T
(16)
=
({yjyk}j,k=1,...,l
j<k
, { 1√2y2
j }j=1,...,l
)T
Note the following useful relationship between the Euclidean norms of y and yQ:
∥∥yQ∥∥2
2=
12
⎡⎣ l∑
j=1
l∑k=j+1
2y2j y2
k +l∑
j=1
y4j
⎤⎦
=12
⎛⎝ l∑
j=1
y2j
⎞⎠2
=12‖y‖42 (17)
Similarly, we write the components of H as a vector Hv ∈ Rl(l+1)/2 as follows.
Hv :=(
h12, h13, h23, . . . , h1l, h2l, . . . , h(l−1)l,1√2h11,
1√2h22, . . . ,
1√2hll
)T
(18)
=
({hjk}j,k=1,...,l,
j<k
, { 1√2hjj}j=1,...,l
)T
Then it is not hard to see that,12yT Hy = HT
v yQ (19)
Using the notation in (16), let for each i ∈ {1, . . . , Mn},
yin
Q:=
((yi
n)1(yin)2, (yi
n)1(yin)3, (yi
n)2(yin)3, . . . , (yi
n)1(yin)l, (yi
n)2(yin)l, . . . , (yi
n)l−1(yin)l,
1√2(yi
n)21, . . . ,1√2(yi
n)2l
)T
where (yin)j denotes the jth component of yi
n. Also, define the matrix Y Qn ∈ R
Mn×l(l+1)/2 as
Y Qn :=
⎛⎜⎜⎜⎜⎜⎜⎜⎝
(y1
nQ)T
(y2
nQ)T
...(yMn
nQ)T
⎞⎟⎟⎟⎟⎟⎟⎟⎠
(20)
Let for each i = 1, . . . ,Mn, let zin
T :=(yi
nT (yi
nQ)T
). Define the regression matrix Zn ∈ R
Mn×(l+l(l+1)/2)
as
Zn :=
⎛⎜⎜⎜⎝
z1n
T
...
zMnn
T
⎞⎟⎟⎟⎠ (21)
15
Obviously, we have Zn =(Yn Y Q
n
). Finally, we let Wn = diag(w1
n, . . . , wMnn ). Now, we can write the set
of optimal solutions on the right side of (11) as follows.
arg min(β Λ)∈Rl×Sl×l
Mn∑i=1
[wi
n
{f(xn + yi
n, N in)− f(xn, N i
n)− βT yin −
12yi
n
TΛyi
n
}2]≡
arg min(β Λv)T ∈Rl+l(l+1)/2
∥∥∥∥∥∥√
Wn
⎧⎨⎩f(xn, Yn,Nn)− Zn
⎛⎝ β
Λv
⎞⎠⎫⎬⎭∥∥∥∥∥∥
2
2
(22)
Consider the optimization problem occurring on the right side of (22).
min(β Λv)T ∈Rl+l(l+1)/2
∥∥∥∥∥∥√
Wn
⎧⎨⎩f(xn, Yn,Nn)− Zn
⎛⎝ β
Λv
⎞⎠⎫⎬⎭∥∥∥∥∥∥
2
2
(23)
It is easy to see that this optimization problem is equivalent to computing the weighted projection of the
vector f(xn, Yn,Nn) ∈ RMn on to the subspace of R
Mn spanned by the columns of Zn. We know that
such a projection always exists since a subspace is a closed convex set. Hence the set of optimal solutions
of the optimization problem in (23) has to be non-empty. Thus, (22) shows that our method of finding
∇nf(xn) ∈ Rl and ∇2
nf(xn) ∈ Sl×l such that (11) holds, is well defined. Also, it is easily seen that for the
existence of a unique optimal solution, the columns of Zn have to be linearly independent. Since there are
l components of ∇nf(xn) and l(l + 1)/2 components of ∇2nf(xn), Zn contains p := l + l(l + 1)/2 columns.
Thus, a unique optimal solution exists for (23) only if Zn contains at least p linearly independent rows.
It is well known that the projection problem in (23) can be solved by solving the so-called normal
equations associated with this problem. That is, we have
arg min(β Λv)T ∈Rp
∥∥∥∥∥∥√
Wn
⎧⎨⎩f(xn, Yn,Nn)− Zn
⎛⎝ β
Λv
⎞⎠⎫⎬⎭∥∥∥∥∥∥
2
2
=
⎧⎨⎩⎛⎝ β
Λv
⎞⎠ ∈ R
p : (ZTn WnZn)
⎛⎝ β
Λv
⎞⎠ = ZT
n Wnf(xn, Yn,Nn)
⎫⎬⎭
Using the notation in (18), we define[∇2fn(x)
]v∈ R
l(l+1)/2 corresponding to ∇2fn(x) by
[∇2fn(x)
]v
=
({(∇2fn(x))jk
}j,k=1,...,l, j<k
,
{1√2(∇2fn(x))jj
}j=1,...,l
)T
(24)
where (∇2fn(x))jk denotes the element in the jth row and kth column of ∇2fn(x). Now, we finally get that
choosing ∇nf(xn) ∈ Rl and ∇2
nf(xn) ∈ Sl×l from (11) is equivalent to choosing the components of ∇nf(xn)
and[∇2
nf(xn)]
vfrom a solution of the set of linear equations given by,
(ZTn WnZn)
⎛⎝ ∇nf(xn)[∇2
nf(xn)]
v
⎞⎠ = ZT
n Wnf(xn, Yn,Nn) (25)
16
With this notation, we are ready to state conditions on the design points, sample sizes and weights
sufficient to ensure that∥∥∥∇nf(xn)−∇f(xn)
∥∥∥2→ 0 and
∥∥∥∇2nf(xn)−∇2f(xn)
∥∥∥2→ 0 as n → ∞, Let us
begin by considering some intuitive ideas that motivate our conditions.
Consider for a moment, the finite difference approximation ∇hg(x∗) ∈ Rl of the gradient ∇g(x∗) of some
function g ∈ C1(Rl) at some x∗ ∈ Rl, defined as
∇hg(x∗) :=
⎛⎜⎜⎜⎝
g(x∗+he1)−g(x∗)h
...g(x∗+hel)−g(x∗)
h
⎞⎟⎟⎟⎠
where {e1, . . . , el} represent unit vectors along the coordinate directions and h > 0 is the step size. It is easy
to see that such a gradient approximation is nothing but the gradient of the linear function
m(x) = g(x∗) + ∇hg(x∗)T (x− x∗)
that satisfies
∇hg(x∗) ∈ arg minβ∈Rl
{l∑
i=1
[g(x∗ + hei)− g(x∗)− hβT ei
]2}
Thus, a linear function m(x) with the finite difference gradient approximation ∇hg(xn), is the linear function
that best fits the function values {g(x + hei) : i = 1, . . . , l} at the corresponding design points {x + hei : i =
1, . . . , l}. Further, it is well known that
limh→0
∥∥∥∇hg(x∗)−∇g(x∗)∥∥∥
2= 0
That is , the finite difference approximation get progressively more accurate as the step size h, i.e., the
Euclidean distance between x∗ and the design points decreases to zero.
It is intuitive to expect that in the same fashion, the accuracies of ∇nf(xn) and ∇2nf(xn) determined
as in (11) depend on the Euclidean distances {∥∥yi
n
∥∥2
: i = 1, . . . , Mn} of the design points {xn + yin : i =
1, . . . ,Mn} from xn. Indeed, we should expect that in order to ensure∥∥∥∇nf(xn)−∇f(xn)
∥∥∥2→ 0 and∥∥∥∇2
nf(xn)−∇2f(xn)∥∥∥
2→ 0 as n → ∞, the Euclidean distances between the design points and the point
xn must decrease to zero as n → ∞. Thus, in order to monitor and control the Euclidean distances of the
design points from xn, we define a neighborhood of xn called the design region for each n, as follows.
Dn :={x ∈ R
l : ‖x− xn‖2 ≤ δn
}(26)
We will refer to δn > 0 as the design region radius for iteration n. Without loss of generality, we will assume
that∥∥yi
n
∥∥2≤ δn for i = 1, . . . ,M I
n for some M In ≤Mn and
∥∥yin
∥∥2
> δn for i = M In + 1, . . . , Mn. We use the
terms inner and outer to denote the design points lying respectively within and outside the design region.
In order to obtain the convergence of ∇nf(xn) to ∇f(xn) and ∇2nf(xn) to ∇2f(xn), we will ensure that
17
δn → 0 as n → ∞ while at the same time requiring a certain number of inner design points exist for each
n ∈ N.
For the notation that we defined earlier with regard the the design points, we will use the superscripts
“I” and “O” to denote the corresponding notation for the inner and outer design points respectively. Thus,
we let
Y In :=
⎛⎜⎜⎜⎝
y1n
T
...
yMI
nn
T
⎞⎟⎟⎟⎠ and Y O
n :=
⎛⎜⎜⎜⎝
yMI
n+1n
T
...
yMnn
T
⎞⎟⎟⎟⎠ and hence Yn =
⎛⎝Y I
n
Y ON
⎞⎠ (27)
and define the matrices Y In
Q, Y On
Q, ZIn and ZO
n in the same manner. Analogous to Nn, we let N In :=
{N1n, . . . , N
MIn
n } and NOn := {NMI
n+1n , . . . , NMn
n } denote the set of sample sizes used to evaluate the sample
average functions at the inner and outer design points respectively. Then, the vectors f(x, Y In ,N I
n) ⊂ RMI
n
and f(x, Y On ,NO
n ) ⊂ RMn−MI
n are defined analogous to f(x, Yn,Nn). Also, we let W In = diag
(w1
n, . . . , wMI
nn
)and WO
n = diag(w
MIn+1
n , . . . , wMnn
).
We noted earlier that as n→∞ we will ensure that δn → 0 as n→∞. This implies that
limn→∞ max
i=1,...,MIn
∥∥yin
∥∥2
= 0
That is, the inner perturbation vectors corresponding to larger n will be much smaller in norm that those
used earlier in the sequence. Therefore, in order to compare various sets of the design points for different
n ∈ N, we will scale the corresponding perturbations such that the corresponding design region radii are all
equal to 1. Accordingly, we define for each i = 1, . . . ,Mn and n ∈ N, the scaled perturbation vector yin ∈ R
l
and the scaled regression vector zin ∈ R
p
yin :=
yin
δnand zi
n :=
⎛⎝ yi
n
δn
(yin)Q
δ2n
⎞⎠ (28)
The corresponding scaled perturbation and regression matrices are defined as follows.
Yn :=
⎛⎜⎜⎜⎝
(y1n)T
...
(yMnn )T
⎞⎟⎟⎟⎠ =
Yn
δn(29)
Y Qn :=
⎛⎜⎜⎜⎜⎝{(y1
n)Q}T
δ2n
...{(yMn
n )}Q
δ2n
⎞⎟⎟⎟⎟⎠ =
Y Qn
δ2n
(30)
Zn :=(Yn Y Q
n
)= ZnDn (31)
18
where
Dn = diag
⎛⎜⎜⎜⎜⎝
1δn
, . . . ,1δn︸ ︷︷ ︸
l terms
,1δ2n
, . . . ,1δ2n︸ ︷︷ ︸
l(l+1)/2 terms
⎞⎟⎟⎟⎟⎠ (32)
The corresponding scaled matrices Y In , Y O
n , (Y In )Q, (Y O
n )Q, ZIn and ZO
n are defined analogously.
Also, note that we use sample averages of the form f(xn + yin, N i
n) for i = 1, . . . , Mn in the calcu-
lation of ∇nf(xn) and ∇2nf(xn) for each n ∈ N. In order to obtain
∥∥∥∇nf(xn)−∇f(xn)∥∥∥
2→ 0 and∥∥∥∇2
nf(xn)−∇2f(xn)∥∥∥
2→ 0 as n → ∞, since ∇f(xn) and ∇2f(xn) are quantities related to the function
f , it is intuitive to expect that the sequence{
f(·, N)}
N∈N
must converge to the the function f on X , in
a certain sense as the sample size N grows to ∞. For example, we already know from the strong law of
large numbers that the sequence∣∣∣f(x,N)− f(x)
∣∣∣→ 0 point wise for each x ∈ Rl such that f(x) is finite for
P-almost all ω. However, we require stronger forms of convergence of the sample average function f(·, N)
to f . With regard to such a requirement, let us define notation for the required function spaces and their
norms. First, for any D ⊂ Rl, let W0(D) denote the space of all Lipschitz continuous functions φ : D −→ R.
The standard norm (referred to as the Lipschitz norm) defined on W0(D) is as follows. For any φ ∈W0(D),
‖φ‖W0(D) := sup
x∈D|φ(x)| + sup
x,x+y∈Dy �=0
|φ(x)− φ(y)|‖y‖2
< ∞
As usual, we let C1(D) denote the space of continuously differentiable functions on D with
‖φ‖C1(D) = sup
x∈D|φ(x)| + sup
x∈D‖∇φ(x)‖2
Further, let W1(D) denote the space of Lipschitz continuously differentiable functions on D. That is φ ∈W1(D) if
‖φ‖W1(D) := sup
x∈D|φ(x)| + sup
x∈D‖∇φ(x)‖2 + sup
x,x+y∈Dy �=0
‖∇φ(x + y)−∇φ(x)‖2‖y‖2
< ∞
Finally, we let C2(D) denote the set of twice continuously differentiable functions on D. From the aforemen-
tioned definitions, it is clear that
W0(D) ⊃ C1(D) ⊃ W1(D) ⊃ C2(D)
Finally, using the notation we have developed so far, we note that,
ZTn WnZn =
⎛⎝ Y T
n WnYn Y Tn WnY Q
n
Y Qn
TWnYn Y Q
nTWnYn
⎞⎠ and ZT
n Wnf(xn, Yn,Nn) =
⎛⎝ Y T
n Wnf(xn, Yn,Nn)
Y Qn
TWnf(xn, Yn,Nn)
⎞⎠
Therefore, the system of equations in (25) can be divided into two sets of equations given below.
(Y Tn WnYn)∇fn(xn) + (Y T
n WnY Qn )
[∇2fn(xn)
]v
= Y Tn Wnf(xn, Yn,Nn) (33)(
Y Qn
TWnYn
)∇fn(xn) +
(Y Q
n
TWnY Q
n
) [∇2fn(xn)
]v
= Y Qn
TWnf(xn, Yn,Nn) (34)
19
While considering sufficient conditions to ensure that∥∥∥∇nf(xn)−∇f(xn)
∥∥∥2→ 0 as n → ∞, we will
assume that ∇nf(xn) and ∇2nf(xn) are picked satisfy only the first of the two sets of equations, i.e., the set
given in (33). Obviously, this is a weaker requirement that picking ∇nf(xn) and ∇2nf(xn) such that both
(33) and (34) are satisfied. Therefore the following result holds when ∇nf(xn) and ∇2nf(xn) are picked such
that (25) holds.
Theorem 3.2. Consider a sequence {xn}n∈N⊂ X and let for each n ∈ N, ∇nf(xn) ∈ R
l and ∇2nf(xn) ∈ S
l×l
be picked so as to satisfy (33). Suppose that the following assumptions hold.
A 4. For any compact set D ⊂ E, we have that f ∈ C1(D),{
f(·, N)}
N∈N
⊂W0(D) and
limN→∞
∥∥∥f(·, N)− f∥∥∥
W0(D)= 0
A 5. The sample sizes N in corresponding to the inner design points all increase to infinity as n→∞. That
is,
limn→∞ min
i=1,...,MIn
N in = ∞
A 6. The sequence {δn}n∈Nof design region radii is such that δn > 0 for each n ∈ N and δn → 0 as n→∞.
A 7. The set of design points{xn + yi
n : i = 1, . . . ,Mn and n ∈ N}
and the sequence {Wn}n∈Nof weight
matrices, satisfy the following.
1. There exists a compact set C ⊂ E such that the point xn and the set of design points {xn + yin : i =
1, . . . , Mn} satisfy
xn ∈ C and {xn + yin : i = 1, . . . ,Mn} ⊂ C (35)
for each n ∈ N.
2. There exists KIλ <∞ such that for each n ∈ N,∥∥∥(Y I
n )T W In Y I
n
∥∥∥2
∥∥∥((Y In )T W I
n Y In )−1
∥∥∥2
< KIλ (36)
3.
limn→∞
∥∥∥(Y On )T WO
n Y On
∥∥∥2∥∥∥(Y I
n )T W In Y I
n
∥∥∥2
= 0 (37)
A 8. The sequence of Hessian approximation matrices{∇2fn(xn)
}n∈N
is norm-bounded. That is, there
exists KH <∞ such that∥∥∥∇2fn(xn)
∥∥∥2
< KH for all n ∈ N.
Then, it holds that
limn→∞
∥∥∥∇nf(xn)−∇f(xn)∥∥∥
2= 0
In particular, if {xn}n∈N⊂ X is such that xn → x ∈ X as n→∞, then
limn→∞ ∇nf(xn) = ∇f(x)
20
We will first show a couple of useful Lemmas and then proceed with the proof of Theorem 3.2. For
the next lemma, and for the rest of this article, we will let λmax(A) and λmin(A) denote respectively the
maximum and minimum eigenvalues of any square matrix A.
Lemma 3.3. Let Assumption A 7 hold. Then, Y Tn WnYn is positive definite for each n ∈ N and the following
assertions hold.
1. ∥∥(Y Tn WnYn)−1
∥∥2
∥∥∥Y In
TW I
nY In
∥∥∥2
< KIλ for all n ∈ N (38)
2.
limn→∞
∥∥(Y Tn WnYn)−1
∥∥2
∥∥∥Y On
TWO
n Y On
∥∥∥2
= 0 (39)
Proof. First, from the definition of Yn, we can see that∥∥∥(Y In )T W I
n Y In
∥∥∥2
∥∥∥((Y In )T W I
n Y In )−1
∥∥∥2
=∥∥∥Y I
n
TW I
nY In
∥∥∥2
∥∥∥(Y In
TW I
nY In )−1
∥∥∥2
and∥∥∥(Y On )T WO
n Y On
∥∥∥2∥∥∥(Y I
n )T W In Y I
n
∥∥∥2
=
∥∥∥Y On
TWO
n Y On
∥∥∥2∥∥∥Y I
nTW I
nY In
∥∥∥2
Therefore, (36) and (37) imply that
∥∥(Y In )T W I
nY In
∥∥2
∥∥((Y In )T W I
nY In )−1
∥∥2
< KIλ for all n ∈ N (40)
and
limn→∞
∥∥(Y On )T WO
n Y On
∥∥2
‖(Y In )T W I
nY In ‖2
= 0 (41)
Now since Y In
TW I
nY In is symmetric and positive semidefinite, we get from (40) that
λmin(Y In
TW I
nY In ) =
1‖((Y I
n )T W InY I
n )−1‖2> 0
Further, we also have,
Y Tn WnYn = Y I
n
TW I
nY In + Y O
n
TWO
n Y On
Now, since Y On
TWO
n Y On is positive semidefinite, from the interlocking eigenvalues theorem, we know that
λmin(Y Tn WnYn) ≥ λmin(Y I
n
TW I
nY In ) > 0
Therefore, Y Tn WnYn is positive definite and
∥∥(Y Tn WnYn)−1
∥∥2
=1
λmin(Y Tn WnYn)
≤ 1
λmin(Y In
TW I
nY In )
=∥∥∥(Y I
n
TW I
nY In )−1
∥∥∥2
(42)
Therefore, we see that for any n ∈ N, using (40) and (42),
∥∥(Y Tn WnYn)−1
∥∥2
∥∥∥Y In
TW I
nY In
∥∥∥2≤
∥∥∥(Y In
TW I
nY In )−1
∥∥∥2
∥∥∥Y In
TW I
nY In
∥∥∥2≤ KI
λ
21
Thus, we have shown that (38) holds. Similarly,
∥∥(Y Tn WnYn)−1
∥∥2
∥∥∥Y On
TWO
n Y On
∥∥∥2≤
∥∥(Y Tn WnYn)−1
∥∥2
∥∥∥Y In
TW I
nY In
∥∥∥2
⎛⎝∥∥∥Y O
nTWO
n Y On
∥∥∥2∥∥∥Y I
nTW I
nY In
∥∥∥2
⎞⎠
Using (38) on the right hand side of the above equation, we get
∥∥(Y Tn WnYn)−1
∥∥2
∥∥∥Y On
TWO
n Y On
∥∥∥2≤ KI
λ
⎛⎝∥∥∥Y O
nTWO
n Y On
∥∥∥2∥∥∥Y I
nTW I
nY In
∥∥∥2
⎞⎠
Finally, we use (41) to get that
limn→∞
∥∥(Y Tn WnYn)−1
∥∥2
∥∥∥Y On
TWO
n Y On
∥∥∥2≤ KI
λ limn→∞
∥∥∥Y On
TWO
n Y On
∥∥∥2∥∥∥Y I
nTW I
nY In
∥∥∥2
= 0
Lemma 3.4. Consider any matrix Y ∈ Rm×l and a positive diagonal matrix W ∈ R
m×m. Let yi ∈ Rl
denote row i of Y , and let wi ∈ R denote diagonal element Wii, i = 1, . . . ,m. For some m1 ≤ m, let
Y I =
⎛⎜⎜⎜⎝
y1
...
ym1
⎞⎟⎟⎟⎠ and Y O =
⎛⎜⎜⎜⎝
ym1+1
...
ym
⎞⎟⎟⎟⎠
Similarly, let W I = diag(w1, . . . , wm1) and WO = diag(wm1+1, . . . , wm).
Let {α1, α2, . . . , αm} be a set of real numbers and {v1, v2, . . . , vm} be a set of vectors in Rl. Also, let
a, b, c ∈ Rm be given by
a :=
⎛⎜⎜⎜⎜⎜⎜⎝
‖y1‖2 α1
‖y2‖2 α2
...
‖ym‖2 αm
⎞⎟⎟⎟⎟⎟⎟⎠ b :=
⎛⎜⎜⎜⎜⎜⎜⎝
y1T v1
y2T v2
...
ymT vm
⎞⎟⎟⎟⎟⎟⎟⎠ and c :=
⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝
y1T v1
...
ym1T vm1
‖ym1+1‖2 αm1+1
...
‖ym‖2 αm
⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠
Then,
∥∥Y T Wa∥∥
2≤ l
[∥∥∥Y ITW IY I
∥∥∥2
maxi∈{1,...,m1}
|αi| +∥∥∥Y OT
WOY O∥∥∥
2max
i∈{m1+1,...,m}|αi|
](43)
∥∥Y T Wb∥∥
2≤ l
[∥∥∥Y ITW IY I
∥∥∥2
maxi∈{1,...,m1}
∥∥vi∥∥
2+
∥∥∥Y OTWOY O
∥∥∥2
maxi∈{m1+1,...,m}
∥∥vi∥∥
2
](44)
∥∥Y T Wc∥∥
2≤ l
[∥∥∥Y ITW IY I
∥∥∥2
maxi∈{1,...,m1}
∥∥vi∥∥
2+
∥∥∥Y OTWOY O
∥∥∥2
maxi∈{m1+1,...,m}
|αi|]
(45)
22
Proof. We first prove (43).
∥∥Y T Wa∥∥
2=
∥∥∥∥∥m∑
i=1
yiwi ‖yi‖2 αi
∥∥∥∥∥2
≤m∑
i=1
‖yiwi ‖yi‖2 αi‖2
=m∑
i=1
wi ‖yi‖22 |αi|
=m1∑i=1
wi ‖yi‖22 |αi| +m∑
i=m1+1
wi ‖yi‖22 |αi|
≤m1∑i=1
wi ‖yi‖22 maxk∈{1,...,m1}
|αk| +m∑
i=m1+1
wi ‖yi‖22 maxk∈{m1+1,...,m}
|αk|
=m1∑i=1
l∑j=1
wiy2ij max
k∈{1,...,m1}|αk| +
m∑i=m1+1
l∑j=1
wiy2ij max
k∈{m1+1,...,m}|αk|
= trace(Y ITW IY I) max
k∈{1,...,m1}|αk| + trace(Y OT
WOY O) maxk∈{m1+1,...,m}
|αk|
≤ l
[∥∥∥Y ITW IY I
∥∥∥2
maxi∈{1,...,m1}
|αi| +∥∥∥Y OT
WOY O∥∥∥
2max
i∈{m1+1,...,m}|αi|
]Equation (44) can be established in a similar fashion, by using the Cauchy-Schwartz inequality |(yi)T vi| ≤‖yi‖2
∥∥vi∥∥
2.
∥∥Y T Wb∥∥
2≤
m∑i=1
∥∥yiwi(yi)T vi∥∥
2
=m∑
i=1
wi|(yi)T vi| ‖yi‖2
≤m∑
i=1
wi ‖yi‖22∥∥vi
∥∥2
≤m1∑i=1
wi ‖yi‖22∥∥vi
∥∥2
+m∑
i=m1+1
wi ‖yi‖22∥∥vi
∥∥2
≤ l
[∥∥∥Y ITW IY I
∥∥∥2
maxi∈{1,...,m1}
∥∥vi∥∥
2+
∥∥∥Y OTWOY O
∥∥∥2
maxi∈{m1+1,...,m}
∥∥vi∥∥
2
]Similarly, we get
∥∥Y T Wc∥∥
2=
∥∥∥∥∥m1∑i=1
yiwi(yi)T vi +m∑
i=m1+1
yiwi ‖yi‖2 αi
∥∥∥∥∥2
≤m1∑i=1
∥∥yiwi(yi)T vi∥∥
2+
m∑i=m1+1
‖yiwi ‖yi‖2 αi‖2
≤m1∑i=1
wi ‖yi‖22∥∥vi
∥∥2
+m∑
i=m1+1
wi ‖yi‖22∣∣αi∣∣
≤ l
[∥∥∥Y ITW IY I
∥∥∥2
maxi∈{1,...,m1}
∥∥vi∥∥
2+
∥∥∥Y OTWOY O
∥∥∥2
maxi∈{m1+1,...,m}
|αi|]
23
Proof of Theorem 3.2:
From Lemma 3.3, we know that Y Tn WnYn is non-singular for each n ∈ N. Therefore, we can rewrite (33) for
each n ∈ N, as
∇fn(xn) = (Y Tn WnYn)−1Y T
n Wnf(xn, Yn,Nn) − (Y Tn WnYn)−1(Y T
n WnY Qn )
[∇2fn(xn)
]v
(46)
We proceed by manipulating the above expression for∇fn(xn) given and showing that∥∥∥∇nf(xn)−∇f(xn)
∥∥∥2→
0 as n→∞. From (46), we get
∇fn(xn)−∇f(xn) = (Y Tn WnYn)−1Y T
n Wn
{f(xn, Yn,Nn)− Y Q
n
[∇2
nf(xn)]
v
}− (Y T
n WnYn)−1Y Tn WnYn∇f(xn)
Adding and subtracting appropriate quantities, we get
∇fn(xn)−∇f(xn) = (YnT WnYn)−1Y T
n Wn
[f(xn, Yn,Nn)− f(xn, Yn)
]+
(Y Tn WnYn)−1Y T
n Wn
[f(xn, Yn)− Yn∇f(xn)
]− (Yn
T WnYn)−1Y Tn WnY Q
n
[∇2
nf(xn)]
Recall that we assumed without loss of generality, that∥∥yi
n
∥∥2
> 0 for all i ∈ {1, . . . , Mn} and n ∈ N. Using
this, we define
an :=(f(xn, Yn,Nn)− f(xn, Yn)
)(47)
=
⎛⎜⎜⎜⎜⎝
∥∥y1n
∥∥2
(f(xn+y1n,N1
n)−f(xn,N1n))−(f(xn+y1
n)−f(xn))‖y1
n‖2...∥∥yMn
n
∥∥2
(f(xn+yMnn ,NMn
n )−f(xn,NMnn ))−(f(xn+yMn
n )−f(xn))‖yMn
n ‖2
⎞⎟⎟⎟⎟⎠
Also, let
bn := f(xn, Yn)− Yn∇f(xn) (48)
=
⎛⎜⎜⎜⎝
f(xn + y1n)− f(xn)− y1
nT∇f(xn)
...
f(xn + yMnn )− f(xn)− yMn
nT∇f(xn)
⎞⎟⎟⎟⎠ =
⎛⎜⎜⎜⎜⎝
∥∥y1n
∥∥2
f(xn+y1n)−f(xn)−y1
nT ∇f(xn)
‖yin‖2
...∥∥yMnn
∥∥2
f(xn+yMnn )−f(xn)−yMn
nT ∇f(xn)
‖yMnn ‖2
⎞⎟⎟⎟⎟⎠
Finally, we set
cn = Y Qn
[∇2
nf(xn)]
v(49)
=
⎛⎜⎜⎜⎜⎝(y1
nQ)T [∇2
nf(xn)]
v...(
yMnn
Q)T [∇2
nf(xn)]
v
⎞⎟⎟⎟⎟⎠
(50)
24
Using the notation in (19) and the fact that∥∥yi
n
∥∥2
> 0 for all i = 1, . . . , Mn and n ∈ N, we get
cn =
⎛⎜⎜⎜⎝
12y1
nt∇2
nf(xn)y1n
...12yMn
nt∇2
nf(xn)yMnn
⎞⎟⎟⎟⎠ =
⎛⎜⎜⎜⎜⎝
12
∥∥y1n
∥∥2
y1n
t∇2nf(xn)y1
n
‖y1n‖2
...12
∥∥yMnn
∥∥2
yMnn
t∇2nf(xn)yMn
n
‖yMnn ‖2
⎞⎟⎟⎟⎟⎠
Then, using the notation defined above, we get
∇nf(xn)−∇f(x) = (YnT WnYn)−1Y T
n Wnan + (YnT WnYn)−1Y T
n Wnbn − (Y Tn WnYn)−1Y T
n Wncn
Using the triangle inequality,
∥∥∥∇nf(xn)−∇f(x)∥∥∥
2≤
∥∥(YnT WnYn)−1Y T
n Wnan
∥∥2+∥∥(Yn
T WnYn)−1Y Tn Wnbn
∥∥2+∥∥(Y T
n WnYn)−1Y Tn Wncn
∥∥2
(51)
Let us consider the three terms on the right hand side of (51) in order.
Using (43) in Lemma 3.2, we get
∥∥(Y Tn WnYn)−1Y T
n Wnan
∥∥2≤
∥∥(Y Tn WnYn)−1
∥∥2
∥∥Y Tn Wnan
∥∥2
≤ l∥∥(Y T
n WnYn)−1∥∥
2
⎡⎣⎧⎨⎩∥∥∥Y I
n
TW I
nY In
∥∥∥2
maxi∈{1,...,MI
n}
∣∣∣(f(xn + yin, N i
n)− f(xn, N in))−(f(xn + yi
n)− f(xn))∣∣∣
‖yin‖2
⎫⎬⎭
+
⎧⎨⎩∥∥∥Y O
n
TWO
n Y On
∥∥∥2
maxi∈{MI
n+1,...,Mn}
∣∣∣(f(xn + yin, N i
n)− f(xn, N in))−(f(xn + yi
n)− f(xn))∣∣∣
‖yin‖2
⎫⎬⎭⎤⎦ (52)
Now, we consider the two terms of (52) in order. First, from Lemma 3.3, we know that for all n ∈ N,∥∥(Y Tn WnYn)−1
∥∥2
∥∥∥Y In
TW I
nY In
∥∥∥2≤ KI
λ. From Assumption A 5, we know that
limn→∞ min
i=1,...,MIn
N in = ∞
Also, we know that {xn}n∈N⊂ C and xn + yi
n ∈ C for each i = 1, . . . ,Mn and n ∈ N. Now, from Assumption
A 4 and the definition of the norm on W0(C), we get for any ε > 0 there exists Nε ∈ N such that for all
N > Nε,
supx+y,x∈C
y �=0
∣∣∣(f(x + y,N)− f(x,N))− (f(x + y)− f(x))
∣∣∣‖y‖2
≤ ε
Also, from Assumption A 5, we know that there exists nε ∈ N such that for all n > nε, min{N in : i =
1, . . . ,M In} > Nε. Thus, we get combining these two facts, that for any ε > 0 , there exists nε ∈ N, such
that for all n > nε,
maxi∈{1,...,MI
n}
∣∣∣(f(xn + yin, N i
n)− f(xn, N in))−(f(xn + yi
n)− f(xn))∣∣∣
‖yin‖2
≤ ε
25
Therefore,
limn→∞ max
i∈{1,...,MIn}
∣∣∣(f(xn + yin, N i
n)− f(xn, N in))−(f(xn + yi
n)− f(xn))∣∣∣
‖yin‖2
= 0
Consequently, we get that,
limn→∞
∥∥(Y Tn WnYn)−1
∥∥2
∥∥∥Y In
TW I
nY In
∥∥∥2
maxi∈{1,...,MI
n}
∣∣∣(f(xn + yin, N i
n)− f(xn, N in))−(f(xn + yi
n)− f(xn))∣∣∣
‖yin‖2
= 0
Next consider the second term on the right side of (52). First, from Lemma 3.3, we get
limn→∞
∥∥(Y Tn WnYn)−1
∥∥2
∥∥∥Y On
TWO
n Y On
∥∥∥2
= 0
Now, we note that for each n ∈ N,
maxi∈{MI
n+1,...,Mn}
∣∣∣(f(xn + yin, N i
n)− f(xn, N in))−(f(xn + yi
n)− f(xn))∣∣∣
‖yin‖2
≤ maxi∈{MI
n+1,...,Mn}
{|f(xn + yi
n, N in)− f(xn, N i
n)|‖yi
n‖2+|f(xn + yi
n)− f(xn)|‖yi
n‖2
}
≤ maxi∈{MI
n+1,...,Mn}
⎧⎪⎨⎪⎩ sup
x,x+y∈Cy �=0
|f(x + y,N in)− f(x,N i
n)|‖y‖2
+ supx,x+y∈C
y �=0
|f(x + y)− f(x)|‖y‖2
⎫⎪⎬⎪⎭
≤ ‖f‖W0(C) + max
i∈{MIn+1,...,Mn}
∥∥∥f(·, N in)∥∥∥
W0(C)
From Assumption A 4, we know that
limN→∞
∥∥∥f(·, N)− f∥∥∥
W0(C)= 0
Hence, the sequence{∥∥∥f(·, N)W0(C)
∥∥∥}N∈N
is bounded and there exists Kf < ∞, such that ‖f‖W0(C) < Kf
and supN∈N
∥∥∥f(·, N)W0(C)
∥∥∥ < Kf where C ⊂ E is the compact set mentioned in Assumption A 7. Since we
also know from Assumption A 7 that xn ∈ C for each n ∈ N and xn + yin ∈ C for each n ∈ N, we get
maxi∈{MI
n+1,...,Mn}
∥∥∥f(·, N in)∥∥∥
W0(C)< Kf
Consequently , we get that
‖f‖W0(C) + max
i∈{MIn+1,...,Mn}
∥∥∥f(·, N in)∥∥∥
W0(C)≤ 2Kf
Hence, using the fact that limn→∞∥∥(Y T
n WnYn)−1∥∥
2
∥∥∥Y On
TWO
n Y On
∥∥∥2
= 0, we finally get
limn→∞
∥∥(Y Tn WnYn)−1
∥∥2
∥∥∥Y On
TWO
n Y On
∥∥∥2
maxi∈{MI
n+1,...,Mn}
∣∣∣(f(xn + yin, N i
n)− f(xn, N in))−(f(xn + yi
n)− f(xn))∣∣∣
‖yin‖2
= 0
Thus, we have shown that
limn→∞
∥∥(Y Tn Yn)−1Y T
n an
∥∥2
= 0
26
Next, we consider the second term on the right side of (51). Using (43) in Lemma 3.4, we get
∥∥(Y Tn WnYn)−1Y T
n Wnbn
∥∥2≤ l
∥∥(Y Tn WnYn)−1
∥∥2×[{∥∥∥Y I
n
TW I
nY In
∥∥∥2
maxi∈{1,...,MI
n}
∣∣∣∣∣f(xn + yin)− f(xn)− yi
nT∇f(xn)
‖yin‖2
∣∣∣∣∣}
+
{∥∥∥Y On
TWO
n Y On
∥∥∥2
maxi∈{MI
n+1,...,Mn}
∣∣∣∣∣f(xn + yin)− f(xn)− yi
nT∇f(xn)
‖yin‖2
∣∣∣∣∣}]
(53)
We will first show that
limn→∞ l
∥∥(Y Tn WnYn)−1
∥∥2
∥∥∥Y In
TW I
nY In
∥∥∥2
maxi∈{1,...,MI
n}
∣∣∣∣∣f(xn + yin)− f(xn)− yi
nT∇f(xn)
‖yin‖2
∣∣∣∣∣ = 0
From Assumption A 4, we know that ∇f : C −→ Rl is continuous on C. Since the set C ⊂ E (defined in
Assumption A 7) is compact, we get that ∇f is uniformly continuous on C. That is, for any ε > 0, there
exists δ > 0 such that
supx,x+y∈C0<‖y‖2≤δ
∣∣∣∣f(x + y)− f(x)−∇f(x)T y
‖y‖2
∣∣∣∣ < ε (54)
From Assumption A 7, we know that {xn}n∈N⊂ C and {xn + yi
n : i = 1, . . . ,Mn} ⊂ C for each n ∈ N.
Further, we know from Assumption A 6 that δn → 0 as n→∞. Since∥∥yi
n
∥∥2≤ δn and 0 ≤ tin ≤ 1 for each
i = 1, . . . , M In and each n ∈ N, we get that
limn→∞ max
i∈{1,...,MIn}
∥∥xn + yin − xn
∥∥2
= limn→∞ max
i∈{1,...,MIn}
∥∥yin
∥∥2
= 0 (55)
It follows from (55) that given any δ > 0, there exists an n1 ∈ N such that∥∥xn + yi
n − xn
∥∥2
< δ for all
i ∈ {1, . . . , M In} and n > n1. Therefore, using (54), we get that for any ε > 0, there exists n1 ∈ N, such that∣∣∣∣∣f(xn + yi
n)− f(xn)− yin
T∇f(xn)‖yi
n‖2
∣∣∣∣∣ < ε for all i = 1, . . . ,M In and n > n1
Thus
limn→∞ max
i∈{1,...,MIn}
∣∣∣∣∣f(xn + yin)− f(xn)− yi
nT∇f(xn)
‖yin‖2
∣∣∣∣∣ = 0
Now, using (38) in Lemma 3.3, we get
limn→∞ l
∥∥(Y Tn WnYn)−1
∥∥2
∥∥∥Y In
TW I
nY In
∥∥∥2
maxi∈{1,...,MI
n}
∣∣∣∣∣f(xn + yin)− f(xn)− yi
nT∇f(xn)
‖yin‖2
∣∣∣∣∣ ≤l KI
λ limn→∞ max
i∈{1,...,MIn}
∣∣∣∣∣f(xn + yin)− f(xn)− yi
nT∇f(xn)
‖yin‖2
∣∣∣∣∣ = 0
Hence, the first term on the right side of (53) converges to zero. Next we show that
limn→∞ l
∥∥(Y Tn WnYn)−1
∥∥2
∥∥∥Y On
TWO
n Y On
∥∥∥2
maxi∈{MI
n+1,...,Mn}
∣∣∣∣∣f(xn + yin)− f(xn)− yi
nT∇f(xn)
‖yin‖2
∣∣∣∣∣ = 0
27
Again, since f is continuously differentiable on C and C ⊂ E is compact, we get from Lemma 2.2 in ?, that
there exists K1f <∞ such that
supx+y,x∈C
y �=0
∣∣∣∣f(x + y)− f(x)− yT∇f(x)‖y‖2
∣∣∣∣ < K1f
Since we know from Assumption A 7 that xn ∈ C for each n ∈ N and xn + yin ∈ C for each n ∈ N and
i ∈ {1, . . . , Mn}, we get that for each n ∈ N
maxi∈{MI
n+1,...,Mn}
∣∣∣∣∣f(xn + yin)− f(xn)− yi
nT∇f(xn)
‖yin‖2
∣∣∣∣∣ ≤ K1f
Now, using (39) in Lemma 3.3, we get
limn→∞ l
∥∥(Y Tn WnYn)−1
∥∥2
∥∥∥Y On
TWO
n Y On
∥∥∥2
maxi∈{MI
n+1,...,Mn}
∣∣∣∣∣f(xn + yin)− f(xn)− yi
nT∇f(xn)
‖yin‖2
∣∣∣∣∣ ≤2l K1f lim
n→∞∥∥(Y T
n WnYn)−1∥∥
2
∥∥∥Y On
TWO
n Y On
∥∥∥2
= 0
Thus, the second term on the right side of (53) converges to zero. Hence∥∥(Y T
n WnYn)−1Y Tn Wnbn
∥∥2→ 0 as
n→∞.
Finally, consider the third term on the right in (51). Again using (43),
∥∥(Y Tn WnYn)−1Y T
n Wncn
∥∥2≤ l
2
∥∥(Y Tn WnYn)−1
∥∥2
⎡⎣⎧⎨⎩∥∥∥Y I
n
TW I
nY In
∥∥∥2
maxi∈{1,...,MI
n}
∣∣∣yin
T∇2fn(xn)yin
∣∣∣‖yi
n‖2
⎫⎬⎭
+
⎧⎨⎩∥∥∥Y O
n
TWO
n Y On
∥∥∥2
maxi∈{MI
n+1,...,Mn}
∣∣∣yin
T∇2fn(xn)yin
∣∣∣‖yi
n‖2
⎫⎬⎭⎤⎦
Since ∇2fn(xn) ∈ Sl×l is a symmetric matrix, it is well known that
∣∣∣yin
T∇2fn(xn)yin
∣∣∣ ≤ ∥∥∥∇2fn(xn)∥∥∥
2
∥∥yin
∥∥2
2
for each n ∈ N and i ∈ {1, . . . ,Mn}. Using∥∥yi
n
∥∥2
> 0 for each i = 1, . . . ,Mn and n ∈ N, we get∣∣∣yin
T∇2fn(xn)yin
∣∣∣‖yi
n‖2≤
∥∥∥∇2fn(xn)∥∥∥
2
∥∥yin
∥∥2
2
‖yin‖2
=∥∥∥∇2fn(xn)
∥∥∥2
∥∥yin
∥∥2
Now, from Assumption A 8 we know∥∥∥∇2fn(xn)
∥∥∥2≤ KH for all n ∈ N. Hence,
∥∥(Y Tn WnYn)−1Y T
n Wncn
∥∥2≤ lKH
2
∥∥(Y Tn WnYn)−1
∥∥2
[{∥∥∥Y In
TW I
nY In
∥∥∥2
maxi∈{1,...,MI
n}
∥∥yin
∥∥2
}
+{∥∥∥Y O
n
TWO
n Y On
∥∥∥2
maxi∈{MI
n+1,...,Mn}
∥∥yin
∥∥2
}](56)
28
But we know from Assumption A 6, that maxi∈{1,...,MIn}∥∥yi
n
∥∥2→ 0 as n → ∞. Therefore, using (38) in
Lemma 3.3, we get
limn→∞
(lKH
2
)∥∥(Y Tn WnYn)−1
∥∥2
∥∥∥Y In
TW I
nY In
∥∥∥2
maxi∈{1,...,MI
n}
∥∥yin
∥∥2≤(
lKHKIλ
2
)lim
n→∞ maxi∈{1,...,MI
n}
∥∥yin
∥∥2
= 0
Now, we know from Assumption A 7, that {xn}n∈N⊂ C and xn + yi
n ⊂ C for all i ∈ {M In + 1, . . . , Mn} and
n ∈ N, where C ⊂ E is compact. Therefore, there must exist Ky <∞ such that maxi∈{MIn+1,...,Mn}
∥∥yin
∥∥2
<
Ky for all n ∈ N. Thus, using (39) in Lemma 3.3 we get
limn→∞
(lKH
2
)∥∥(Y Tn WnYn)−1
∥∥2
∥∥∥Y On
TWO
n Y On
∥∥∥2
maxi∈{1,...,MI
n}
∥∥yin
∥∥2≤(
lKHKy
2
)lim
n→∞∥∥(Y T
n WnYn)−1∥∥
2
∥∥∥Y On
TWO
n Y On
∥∥∥2
= 0
Therefore∥∥(Y T
n WnYn)−1Y Tn Wncn
∥∥2→∞ as n→∞. Hence we have shown that,
limn→∞
∥∥∥∇nf(xn)−∇f(xn)∥∥∥
2= 0
In particular, if xn → x ∈ X as n→∞, then we get from the continuous differentiability of f that
limn→∞
∥∥∥∇nf(xn)−∇f(x)∥∥∥
2≤ lim
n→∞
∥∥∥∇nf(xn)−∇f(xn)∥∥∥
2+ lim
n→∞ ‖∇f(xn)−∇f(x)‖2 = 0
�
Corollary 3.5. Consider any sequence {xn}n∈N⊂ X and let all the assumptions of Theorem 3.2 hold.
Suppose for each n ∈ N, ∇nf(xn) ∈ Rl and ∇2
nf(xn) ∈ Sl×l are picked to satisfy (25). Then,
limn→∞
∥∥∥∇nf(xn)−∇f(xn)∥∥∥
2= 0
Proof. The result follows from the fact that if ∇nf(xn) and ∇2nf(xn) satisfy (25) for each n ∈ N, then they
also automatically satisfy (33).
Corollary 3.6. Consider any sequence {xn}n∈N⊂ X and let all the assumptions of Theorem 3.2 hold.
Suppose for each n ∈ N, ∇nf(xn) ∈ Rl is picked such that
(Y Tn Wn)Yn ∇nf(xn) = Y T
n Wnf(xn, Yn,Nn)
Then we have,
limn→∞
∥∥∥∇nf(xn)−∇f(xn)∥∥∥
2= 0
Proof. The required result follows from Theorem 3.2 when we set ∇2nf(xn) = 0 in (33), for each n ∈ N.
29
Next, we present some stronger conditions that ensure the convergence of both the gradient and Hessian
approximations.
Theorem 3.7. Consider a sequence {xn}n∈N⊂ X and let for each n ∈ N, ∇nf(xn) ∈ R
l and ∇2nf(xn) ∈ S
l×l
be picked so as to satisfy (25). Suppose that the following assumptions hold.
A 9. For any compact set D ⊂ E, we get f ∈ C2(D),{
f(·, N)}
N∈N
⊂W1(D) and
limn→∞
∥∥∥f(·, N)− f∥∥∥
W1(D)= 0
A 10. There exists a sequence{N#
n
}n∈N
of sample sizes such that
1. N in = N#
n for all i = 1, . . . ,Mn and
2. N#n →∞ as n→∞.
A 11. The set of design points{xn + yi
n : i = 1, . . . ,Mn and n ∈ N}
and the sequence {Wn}n∈Nof weight
matrices, satisfy the following.
1. There exists a compact set C ⊂ E such that the point xn and the set of design points {xn + yin : i =
1, . . . , Mn} satisfy
xn ∈ C and {xn + yin : i = 1, . . . ,Mn} ⊂ C (57)
for each n ∈ N.
2. There exists KIλ <∞ such that for each n ∈ N∥∥∥(ZI
n)T W In ZI
n
∥∥∥2
∥∥∥((ZIn)T W I
n ZIn)−1
∥∥∥2
< KIλ (58)
3.
limn→∞
∥∥∥(ZOn )T WO
n ZOn
∥∥∥2∥∥∥(ZI
n)T W In ZI
n
∥∥∥2
= 0 (59)
Further, let Assumption A 6 also hold. Then, we have,
limn→∞
∥∥∥∇nf(xn)−∇f(xn)∥∥∥
2= 0 and lim
n→∞
∥∥∥∇2nf(xn)−∇2f(xn)
∥∥∥2
= 0
In particular, if {xn}n∈N⊂ X is such that xn → x ∈ X as n→∞, then
limn→∞ ∇nf(xn) = ∇f(x) and lim
n→∞ ∇2nf(xn) = ∇2f(x)
Before proving Theorem 3.7, we state and prove a useful lemma.
Lemma 3.8. Let Assumption A 11 hold. Then, ZTn WnZn and ZT
n WnZn are positive definite for each n ∈ N
and the following hold true.
30
1. ∥∥∥(ZTn WnZn)−1
∥∥∥2
∥∥∥(ZIn)T W I
n ZIn
∥∥∥2
< KIλ for all n ∈ N (60)
2.
limn→∞
∥∥∥(ZTn WnZn)−1
∥∥∥2
∥∥∥(ZOn )T WO
n ZOn
∥∥∥2
= 0 (61)
Proof. Since (ZIn)T W I
n ZIn is a symmetric and positive semidefinite matrix, we get from (58) that
λmin((ZIn)T W I
n ZIn) =
1∥∥∥((ZIn)T W I
n ZIn)−1
∥∥∥2
> 0
Further, it is easily seen that
ZTn WnZn = (ZI
n)T W In ZI
n + (ZOn )T WO
n ZOn
Now, since (ZOn )T WO
n ZOn is positive semidefinite, from the interlocking eigenvalues theorem, we know that
λmin(ZTn WnZn) ≥ λmin((ZI
n)T W In ZI
n) > 0
Therefore , ZTn WnZn is positive definite. Further, from the definition of Dn in (32), we know that Dn is
positive definite for each n ∈ N. Since the product of positive definite matrices remains positive definite, we
get that ZTn WnZn = D−1
n (ZTn WnZn)D−1
n is positive definite for each n ∈ N. Now,∥∥∥(ZTn WnZn)−1
∥∥∥2
=1
λmin(ZTn WnZn)
≤ 1λmin((ZI
n)T W In ZI
n)=
∥∥∥((ZIn)T W I
n ZIn)−1
∥∥∥2
(62)
Therefore, we see that for any n ∈ N, using (62),∥∥∥(ZTn WnZn)−1
∥∥∥2
∥∥∥(ZIn)T W I
n ZIn
∥∥∥2≤
∥∥∥((ZIn)T W I
n ZIn)−1
∥∥∥2
∥∥∥(ZIn)T W I
n ZIn
∥∥∥2≤ KI
λ
Thus, we have shown that (60) holds. Similarly,
∥∥∥(ZTn WnZn)−1
∥∥∥2
∥∥∥(ZOn )T WO
n ZOn
∥∥∥2≤
∥∥∥(ZTn WnZn)−1
∥∥∥2
∥∥∥(ZIn)T W I
n ZIn
∥∥∥2
⎛⎝∥∥∥(ZO
n )T WOn ZO
n
∥∥∥2∥∥∥(ZI
n)T W In ZI
n
∥∥∥2
⎞⎠
Using (60) on the right hand side of the above equation, we get
∥∥∥(ZTn WnZn)−1
∥∥∥2
∥∥∥(ZOn )T WO
n ZOn
∥∥∥2≤ KI
λ
⎛⎝∥∥∥(ZO
n )T WOn ZO
n
∥∥∥2∥∥∥(ZI
n)T W In ZI
n
∥∥∥2
⎞⎠
Finally, we use (37) in Assumption A 11 to get that
limn→∞
∥∥∥(ZTn WnZn)−1
∥∥∥2
∥∥∥(ZOn )T WO
n ZOn
∥∥∥2≤ KI
λ limn→∞
∥∥∥(ZOn )T WO
n ZOn
∥∥∥2∥∥∥(ZI
n)T W In ZI
n
∥∥∥2
= 0
31
Proof of Theorem 3.7:
From Lemma 3.8, we know that ZTn WnZn is positive definite and hence non-singular for each n ∈ N.
Therefore, we can rewrite (25) as⎛⎝ ∇nf(xn)[∇2f(xn)
]v
⎞⎠−
⎛⎝ ∇f(xn)[∇2f(xn)
]v
⎞⎠ = (ZT
n WnZn)−1ZTn Wnf(xn, Yn,Nn)− (ZT
n WnZn)−1ZTn WnZn
⎛⎝ ∇f(xn)[∇2f(xn)
]v
⎞⎠
(63)
Noting from Assumption A 9 that f(·, N) ∈W1(C) for each N ∈ N and from Assumption 10 that N in = N#
n
for i = 1, . . . ,Mn and since {xn}n∈N⊂ C, we define the vector bn ∈ R
Mn as,
bn :=
⎛⎜⎜⎜⎝
y1n
T (∇f(xn, N1n)−∇f(xn))...
yMnn
T (∇f(xn, NMnn )−∇f(xn))
⎞⎟⎟⎟⎠ =
⎛⎜⎜⎜⎝
y1n
T (∇f(xn, N#n )−∇f(xn))...
yMnn
T (∇f(xn, N#n )−∇f(xn))
⎞⎟⎟⎟⎠ = Yn(∇f(xn, N#
n )−∇f(xn))
Adding and subtracting (ZTn WnZn)−1ZT
n Wn (bn + f(xn, Yn)) in (63), we get⎛⎝ ∇nf(xn)[∇2f(xn)
]v
⎞⎠−
⎛⎝ ∇f(xn)[∇2f(xn)
]v
⎞⎠ = (ZT
n WnZn)−1ZTn Wn
[f(xn, Yn,Nn) − f(xn, Yn)− bn
]+
(ZTn WnZn)−1ZT
n Wnbn + (ZTn WnZn)−1ZT
n Wn
⎡⎣f(xn, Yn)− Zn
⎛⎝ ∇f(xn)[∇2f(xn)
]v
⎞⎠⎤⎦ (64)
Using the fact that (ZTn WnZn)−1ZT
n Wn = Dn(ZTn WnZn)−1ZT
n Wn, we get⎛⎝ ∇nf(xn)[∇2f(xn)
]v
⎞⎠−
⎛⎝ ∇f(xn)[∇2f(xn)
]v
⎞⎠ = Dn(ZT
n WnZn)−1ZTn Wn
[f(xn, Yn,Nn) − f(xn, Yn)− bn
]+
Dn(ZTn WnZn)−1ZT
n Wnbn + Dn(ZTn WnZn)−1ZT
n Wn
⎡⎣f(xn, Yn)− Zn
⎛⎝ ∇f(xn)[∇2f(xn)
]v
⎞⎠⎤⎦ (65)
We assumed without loss of generality that∥∥yi
n
∥∥2
> 0 for all i ∈ {1, . . . , Mn} and n ∈ N. Using (17), it is
easily seen that this means∥∥zi
n
∥∥2
> 0 for all i ∈ {1, . . . ,Mn} and n ∈ N. Therefore, again using Assumption
A 10, we set for each n ∈ N,
an :=(f(xn, Yn,Nn)− f(xn, Yn)− bn
)
=
⎛⎜⎜⎜⎜⎝
∥∥z1n
∥∥2
(f(xn+y1n,N1
n)−f(xn,N1n))−(f(xn+y1
n)−f(xn))−y1n
T (∇f(xn,N1n)−∇f(xn))
‖z1n‖2
...∥∥zMnn
∥∥2
(f(xn+yMnn ,NMn
n )−f(xn,NMnn ))−(f(xn+yMn
n )−f(xn))−yMnn
T (∇f(xn,NMnn )−∇f(xn))
‖zMnn ‖2
⎞⎟⎟⎟⎟⎠
=
⎛⎜⎜⎜⎜⎝
∥∥z1n
∥∥2
(f(xn+y1n,N#
n )−f(xn,N#n ))−(f(xn+y1
n)−f(xn))−y1n
T (∇f(xn,N#n )−∇f(xn))
‖z1n‖2
...∥∥zMnn
∥∥2
(f(xn+yMnn ,N#
n )−f(xn,N#n ))−(f(xn+yMn
n )−f(xn))−yMnn
T (∇f(xn,N#n )−∇f(xn))
‖zMnn ‖2
⎞⎟⎟⎟⎟⎠
32
Let us simplify the third term on the right side of (65).
f(xn, Yn)− Zn
⎛⎝ ∇f(xn)[∇2f(xn)
]v
⎞⎠ = f(xn, Yn)− Yn∇f(xn) − Y Q
n ∇2nf(xn)
=
⎛⎜⎜⎜⎝
f(xn + y1n)− f(xn)− y1
nT∇f(xn)− (y1
nQ)T
[∇2f(xn)
]v
...
f(xn + yMnn )− f(xn)− yMn
nT∇f(xn)− (yMn
nQ)T
[∇2f(xn)
]v
⎞⎟⎟⎟⎠
=
⎛⎜⎜⎜⎝
f(xn + y1n)− f(xn)− y1
nT∇f(xn)− 1
2
(y1
nT∇2f(xn)y1
n
)...
f(xn + yMnn )− f(xn)− yMn
nT∇f(xn)− 1
2
(yMn
nT∇2f(xn)yMn
n
)⎞⎟⎟⎟⎠
Again noting that δn > 0 for each n ∈ N and∥∥zi
n
∥∥2
> 0 for each i ∈ {1, . . . , Mn} and n ∈ N, we set
cn := f(x, Yn)− Zn
⎛⎝ ∇f(xn)[∇2f(xn)
]v
⎞⎠
=
⎛⎜⎜⎜⎜⎝
∥∥z1n
∥∥2
f(xn+y1n)−f(xn)−y1
nT ∇f(xn)− 1
2 (y1n
T ∇2f(xn)y1n)
‖z1n‖2
...∥∥zMnn
∥∥2
f(xn+yMnn )−f(xn)−yMn
nT ∇f(xn)− 1
2
“yMn
nT ∇2f(xn)yMn
n
”
‖zMnn ‖2
⎞⎟⎟⎟⎟⎠
Therefore, we finally have for each n ∈ N,⎛⎝ ∇nf(xn)[∇2f(xn)
]v
⎞⎠−
⎛⎝ ∇f(xn)[∇2f(xn)
]v
⎞⎠ = Dn(ZT
n WnZn)−1ZTn Wnan + Dn(ZT
n WnZn)−1ZTn Wnbn +
Dn(ZTn WnZn)−1ZT
n Wncn
⇒
∥∥∥∥∥∥⎛⎝ ∇nf(xn)[∇2f(xn)
]v
⎞⎠−
⎛⎝ ∇f(xn)[∇2f(xn)
]v
⎞⎠∥∥∥∥∥∥
2
≤∥∥∥Dn(ZT
n WnZn)−1ZTn Wnan
∥∥∥2
+∥∥∥Dn(ZT
n WnZn)−1ZTn Wnbn
∥∥∥2
+∥∥∥Dn(ZT
n WnZn)−1ZTn Wncn
∥∥∥2
(66)
Next, we show that all three terms on the right side of (66) converge to zero as n → ∞. But first,
we note that from Assumption A 6, δn → 0 as n → ∞. Therefore, there exists n∗ ∈ N, such that for all
n > n∗, δn < 1. This means that for all n > n∗, (1/δn)2 > (1/δn). Therefore, we get that for all n > n∗,
‖Dn‖2 = (1/δn)2.
Let us consider the first term right side of (66) for n > n∗. First, we note that
∥∥∥Dn(ZTn WnZn)−1ZT
n Wnan
∥∥∥2≤ ‖Dn‖2
∥∥∥(ZTn WnZn)−1
∥∥∥2
∥∥∥ZTn Wnan
∥∥∥2
=(
1δ2n
) ∥∥∥(ZTn WnZn)−1
∥∥∥2
∥∥∥ZTn Wnan
∥∥∥2
33
Recall that the matrix Zn has p = l + l(l + 1)/2 columns. Using this fact in (43) of Lemma 3.4, we get
∥∥∥Dn(ZTn WnZn)−1ZT
n Wnan
∥∥∥2≤ p
∥∥∥(ZTn WnZn)−1
∥∥∥2×
⎡⎣⎧⎨⎩∥∥∥(ZI
n)T W In ZI
n
∥∥∥2×
maxi∈{1,...,MI
n}
∣∣∣∣∣∣(f(xn + yi
n, N#n )− f(xn, N#
n ))−(f(xn + yi
n)− f(xn))− yi
nT(∇f(xn, N#
n )−∇f(xn))
δ2n ‖zi
n‖2
∣∣∣∣∣∣⎫⎬⎭
+
⎧⎨⎩∥∥∥(ZO
n )T WOn ZO
n
∥∥∥2×
maxi∈{MI
n+1,...,Mn}
∣∣∣∣∣∣(f(xn + yi
n, N#n )− f(xn, N#
n ))−(f(xn + yi
n)− f(xn))− yi
nT(∇f(xn, N#
n )−∇f(xn))
δ2n ‖zi
n‖2
∣∣∣∣∣∣⎫⎬⎭⎤⎦
From the above, we get∥∥∥Dn(ZTn WnZn)−1ZT
n Wnan
∥∥∥2≤ p
∥∥∥(ZTn WnZn)−1
∥∥∥2
(∥∥∥(ZIn)T W I
n ZIn
∥∥∥2
+∥∥∥(ZO
n )T WOn ZO
n
∥∥∥2
)×
maxi∈{1,...,Mn}
∣∣∣∣∣∣(f(xn + yi
n, N#n )− f(xn, N#
n ))−(f(xn + yi
n)− f(xn))− yi
nT(∇f(xn, N#
n )−∇f(xn))
δ2n ‖zi
n‖2
∣∣∣∣∣∣First, from Assumption A 11 and Lemma 3.8 we get that∥∥∥(ZT
n WnZn)−1∥∥∥
2
∥∥∥(ZIn)T W I
n ZIn
∥∥∥2≤ KI
λ for all n ∈ N
and
limn→∞
∥∥∥(ZTn WnZn)−1
∥∥∥2
∥∥∥(ZOn )T WO
n ZOn
∥∥∥2
= 0
Therefore, clearly there exists a constant Kλ <∞ such that for all n ∈ N,∥∥∥(ZTn WnZn)−1
∥∥∥2
(∥∥∥(ZIn)T W I
n ZIn
∥∥∥2
+∥∥∥(ZO
n )T WOn ZO
n
∥∥∥2
)< Kλ (67)
Next, we note that from (17), for any n ∈ N and i ∈ {1, . . . , Mn},
δ2n
∥∥zin
∥∥2
=
√δ2n ‖yi
n‖22 +
(12
)‖yi
n‖42 ≥
(√12
)∥∥yin
∥∥2
2(68)
Therefore, for each i = 1, . . . ,Mn,∣∣∣∣∣∣(f(xn + yi
n, N#n )− f(xn, N#
n ))−(f(xn + yi
n)− f(xn))− yi
nT(∇f(xn, N#
n )−∇f(xn))
δ2n ‖zi
n‖2
∣∣∣∣∣∣ ≤√
2
∣∣∣∣∣∣(f(xn + yi
n, N#n )− f(xn, N#
n ))−(f(xn + yi
n)− f(xn))− yi
nT(∇f(xn, N#
n )−∇f(xn))
‖yin‖
22
∣∣∣∣∣∣Now from Assumption A 10, we know that, N#
n → ∞ as n → ∞. From Assumption A 9 we then get
that
limn→∞
∥∥∥f(·, N#n )− f
∥∥∥W1(C)
= 0
34
where C ⊂ E is the compact set defined in Assumption A 11. Therefore, we get from Lemma 2.4 in ? that
limn→∞ sup
x,x+y⊂Cy �=0
∣∣∣∣∣ (f(x + y,N#n )− f(x + y))− (f(x,N#
n )− f(x))− yT (∇f(x,N#n )−∇f(x))
‖y‖22
∣∣∣∣∣ = 0
Further, from Assumption A 11, we know that xn+yin ∈ C for each i = 1, . . . , Mn and n ∈ N and {xn}n∈N
⊂ C.Therefore, for each n ∈ N,
maxi∈{1,...,Mn}
∣∣∣∣∣∣(f(xn + yi
n, N#n )− f(xn, N#
n ))−(f(xn + yi
n)− f(xn))− yi
nT(∇f(xn, N#
n )−∇f(xn))
‖yin‖
22
∣∣∣∣∣∣ ≤sup
x,x+y⊂Cy �=0
∣∣∣∣∣ (f(x + y,N#n )− f(x + y))− (f(x,N#
n )− f(x))− yT (∇f(x,N#n )−∇f(x))
‖y‖22
∣∣∣∣∣Therefore, we get that
limn→∞ max
i∈{1,...,Mn}
∣∣∣∣∣∣(f(xn + yi
n, N#n )− f(xn, N#
n ))−(f(xn + yi
n)− f(xn))− yi
nT(∇f(xn, N#
n )−∇f(xn))
‖yin‖
22
∣∣∣∣∣∣ = 0
And hence we finally get
limn→∞
∥∥∥Dn(ZTn WnZn)−1ZT
n Wnan
∥∥∥2≤√
2 p Kλ ×
limn→∞ max
i∈{1,...,Mn}
∣∣∣∣∣∣(f(xn + yi
n, N#n )− f(xn, N#
n ))−(f(xn + yi
n)− f(xn))− yi
nT(∇f(xn, N#
n )−∇f(xn))
‖yin‖
22
∣∣∣∣∣∣ = 0
Next, let us consider the second term on the right side of (66). Letting 0l(l+1)/2 denote a column vector
with l(l + 1)/2 elements each of which is equal to zero, it is easily seen that
bn = Yn(∇f(xn, N#n )−∇f(xn)) = Zn
⎛⎝∇f(xn, N#
n )−∇f(xn)
0l(l+1)/2
⎞⎠ = ZnD−1
n
⎛⎝∇f(xn, N#
n )−∇f(xn)
0l(l+1)/2
⎞⎠
Using the definition of bn, we get for each n ∈ N
∥∥∥Dn(ZTn WnZn)−1ZT
n Wnbn
∥∥∥2
=
∥∥∥∥∥∥Dn(ZTn WnZn)−1ZT
n WnZnD−1n
⎛⎝∇f(xn, N#
n )−∇f(xn)
0l(l+1)/2
⎞⎠∥∥∥∥∥∥
2
=∥∥∥∇f(xn, N#
n )−∇f(xn)∥∥∥
2
From Assumption A 9 and the fact that C ⊂ E is compact, we know that∥∥∥f(·, N)− f
∥∥∥W1(C)
→ 0 as N →∞.
Since N#n → ∞ as n → ∞, we get that
∥∥∥f(·, N#n )− f
∥∥∥W1(C)
→ 0 as n → ∞. Further, using the fact that
{xn}n∈N⊂ C and the definition of the norm on W1(C), we get that
limn→∞
∥∥∥∇f(xn, N#n )−∇f(xn)
∥∥∥2
= 0
35
Therefore,
limn→∞
∥∥∥Dn(ZTn WnZn)−1ZT
n Wnbn
∥∥∥2
= 0
Finally, we consider the third term on the right side of (66) for n > n∗.∥∥∥Dn(ZTn WnZn)−1ZT
n Wncn
∥∥∥2≤ ‖Dn‖2
∥∥∥(ZTn WnZn)−1
∥∥∥2
∥∥∥ZTn Wncn
∥∥∥2
=(
1δ2n
)∥∥∥(ZTn WnZn)−1
∥∥∥2
∥∥∥ZTn Wncn
∥∥∥2
Using Lemma 3.4 and the definition of cn, we get∥∥∥Dn(ZTn WnZn)−1ZT
n Wncn
∥∥∥2≤ p
∥∥∥(ZTn WnZn)−1
∥∥∥2×⎡
⎣⎧⎨⎩∥∥∥(ZI
n)T W In ZI
n
∥∥∥2
maxi∈{1,...,MI
n}
∣∣∣∣∣∣f(xn + yi
n)− f(xn)− yin
T∇f(xn)− 12
(yi
nT∇2f(xn)yi
n
)‖δ2
nzin‖2
∣∣∣∣∣∣⎫⎬⎭ +
⎧⎨⎩∥∥∥(ZO
n )T WOn ZO
n
∥∥∥2
maxi∈{MI
n+1,...,Mn}
∣∣∣∣∣∣f(xn + yi
n)− f(xn)− yin
T∇f(xn)− 12
(yi
nT∇2f(xn)yi
n
)‖δ2
nzin‖2
∣∣∣∣∣∣⎫⎬⎭⎤⎦
Again using (68), we get that∥∥∥Dn(ZTn WnZn)−1ZT
n Wncn
∥∥∥2≤√
2 p∥∥∥(ZT
n WnZn)−1∥∥∥
2×⎡
⎣⎧⎨⎩∥∥∥(ZI
n)T W In ZI
n
∥∥∥2
maxi∈{1,...,MI
n}
∣∣∣∣∣∣f(xn + yi
n)− f(xn)− yin
T∇f(xn)− 12
(yi
nT∇2f(xn)yi
n
)‖yi
n‖22
∣∣∣∣∣∣⎫⎬⎭ +
⎧⎨⎩∥∥∥(ZO
n )T WOn ZO
n
∥∥∥2
maxi∈{MI
n+1,...,Mn}
∣∣∣∣∣∣f(xn + yi
n)− f(xn)− yin
T∇f(xn)− 12
(yi
nT∇2f(xn)yi
n
)‖yi
n‖22
∣∣∣∣∣∣⎫⎬⎭⎤⎦ (69)
Let us consider the two terms on the right side of (69). First, we show that
limn→∞
√2 p
∥∥∥(ZTn WnZn)−1
∥∥∥2
∥∥∥(ZIn)T W I
n ZIn
∥∥∥2×
maxi∈{1,...,MI
n}
∣∣∣∣∣∣f(xn + yi
n)− f(xn)− yin
T∇f(xn)− 12
(yi
nT∇2f(xn)yi
n
)‖yi
n‖22
∣∣∣∣∣∣ = 0
From Assumption A 11 and Lemma 3.8, we know that for all n ∈ N,∥∥∥(ZTn WnZn)−1
∥∥∥2
∥∥∥(ZIn)T W I
n ZIn
∥∥∥2≤ KI
λ
From Assumption A 9, we know that f ∈ C2(C). Since C ⊂ E is compact, ∇2f is uniformly continuous on
C. That is, for any ε > 0, there exists δ > 0 such that
supx+y,x∈C0<‖y‖2≤δ
∣∣∣∣∣f(x + y)− f(x)− yT∇f(x)− 12yT∇2f(x)y
‖y‖22
∣∣∣∣∣ < ε (70)
Now, from Assumption A 11, we know that {xn + yin : i = 1, . . . , Mn} ⊂ C for each n ∈ N and {xn}n∈N
⊂ C.Further, using Assumption A 6 we get
limn→∞ max
i∈{1,...,MIn}
∥∥xn + yin − xn
∥∥2
= limn→∞ max
i∈{1,...,MIn}
∥∥yin
∥∥2
≤ δn = 0 (71)
36
It follows that for any δ > 0, there exists an n > n∗ with such that for all n > n, δn < δ. Thus, 0 <∥∥yi
n
∥∥2≤ δ
for all i = 1, . . . ,M In and n > n. Thus, we get from (70) that for any ε > 0, there exists n ∈ N such that∣∣∣∣∣∣
f(xn + yin)− f(xn)− yi
nT∇f(xn)− 1
2
(yi
nT∇2f(xn)yi
n
)‖yi
n‖22
∣∣∣∣∣∣ < ε
for i ∈ {1, . . . ,M In} and n > n. Consequently,
limn→∞ max
i∈{1,...,MIn}
∣∣∣∣∣∣f(xn + yi
n)− f(xn)− yin
T∇f(xn)− 12
(yi
nT∇2f(xn)yi
n
)‖yi
n‖22
∣∣∣∣∣∣ = 0
Therefore, we see that
limn→∞
√2 p
∥∥∥(ZTn WnZn)−1
∥∥∥2
∥∥∥(ZIn)T W I
n ZIn
∥∥∥2
maxi∈{1,...,MI
n}
∣∣∣∣∣∣f(xn + yi
n)− f(xn)− yin
T∇f(xn)− 12
(yi
nT∇2f(xn)yi
n
)‖yi
n‖22
∣∣∣∣∣∣≤√
2 p KIλ lim
n→∞ maxi∈{1,...,MI
n}
∣∣∣∣∣∣f(xn + yi
n)− f(xn)− yin
T∇f(xn)− 12
(yi
nT∇2f(xn)yi
n
)‖yi
n‖22
∣∣∣∣∣∣ = 0
Finally, we show that
limn→∞
√2 p
∥∥∥(ZTn WnZn)−1
∥∥∥2
∥∥∥(ZOn )T WO
n ZOn
∥∥∥2×
maxi∈{MI
n+1,...,Mn}
∣∣∣∣∣∣f(xn + yi
n)− f(xn)− yin
T∇f(xn)− 12
(yi
nT∇2f(xn)yi
n
)‖yi
n‖22
∣∣∣∣∣∣ = 0
As we noted earlier, we get from Assumption A 9, f is twice continuously differentiable on C. Further, since
C is compact, we get from Lemma 2.3 in ? that there exists K2f <∞ such that
supx,x+y∈C
y �=0
∣∣f(x + y)− f(x)− yT∇f(x)−(
12
)yT∇2f(x)y
∣∣‖y‖22
< K2f
Since {xn + yin : i = 1, . . . , Mn} ⊂ C for each n ∈ N and {xn}n∈N
⊂ C, it follows that
maxi∈{MI
n+1,...,Mn}
∣∣∣∣∣∣f(xn + yi
n)− f(xn)− yin
T∇f(xn)− 12
(yi
nT∇2f(xn)yi
n
)‖yi
n‖22
∣∣∣∣∣∣ ≤ K2f
Also, from Assumption A 11 and Lemma 3.8, we get that
limn→∞
∥∥∥(ZTn WnZn)−1
∥∥∥2
∥∥∥(ZOn )T WO
n ZOn
∥∥∥2
= 0
Therefore, we finally get
limn→∞
√2 p
∥∥∥(ZTn WnZn)−1
∥∥∥2
∥∥∥(ZOn )T WO
n ZOn
∥∥∥2×
maxi∈{MI
n+1,...,Mn}
∣∣∣∣∣∣f(xn + yi
n)− f(xn)− yin
T∇f(xn)− 12
(yi
nT∇2f(xn)yi
n
)‖yi
n‖22
∣∣∣∣∣∣ ≤√
2 p K2f limn→∞
∥∥∥(ZTn WnZn)−1
∥∥∥2
∥∥∥(ZOn )T WO
n ZOn
∥∥∥2
= 0
37
Thus, we have shown that∥∥∥Dn(ZT
n WnZn)−1ZTn Wncn
∥∥∥2→ 0 as n→∞. Hence it holds that
limn→∞
∥∥∥∥∥∥⎛⎝ ∇nf(xn)[∇2
nf(xn)]
v
⎞⎠−
⎛⎝ ∇f(xn)[∇2f(xn)
]v
⎞⎠∥∥∥∥∥∥
2
= 0
This in turn gives us
limn→∞
∥∥∥∇nf(xn)−∇f(xn)∥∥∥
2= 0 and lim
n→∞
∥∥∥∇2nf(xn)−∇2f(xn)
∥∥∥2
= 0.
In particular, if xn → x ∈ X as n→∞, then since f ∈ C2(C),
limn→∞
∥∥∥∇nf(xn)−∇f(x)∥∥∥
2≤ lim
n→∞
∥∥∥∇nf(xn)−∇f(xn)∥∥∥
2+ lim
n→∞ ‖∇f(xn)−∇f(x)‖2 = 0
limn→∞
∥∥∥∇2nf(xn)−∇2f(x)
∥∥∥2≤ lim
n→∞
∥∥∥∇2nf(xn)−∇2f(xn)
∥∥∥2
+ limn→∞
∥∥∇2f(xn)−∇2f(x)∥∥
2= 0
�
Thus, we have three results that relate the accuracy of f(xn, N0n), ∇nf(xn) and ∇2
nf(xn) as respective
approximations of f(xn), ∇f(xn) and ∇2f(xn), to the design points {xn + yin : i = 1, . . . , Mn} and the
corresponding sample sizes N0n and {N i
n : i = 1, . . . ,Mn} and weights {win : i = 1, . . . ,Mn} used in the
determination of f(xn, N0n), ∇nf(xn) and ∇2
nf(xn) (ultimately used to define the model function mn). In
the following sections, we discuss various practical schemes that can be used to satisfy the various assumptions
made in Lemma 3.1 and Theorems 3.2 and 3.7, which can then be used in our trust region algorithms in
Section ?? to construct a sufficiently accurate regression model function for each iteration n.
3.1 Smoothness of f and Convergence of{
f(·, N)}
N∈N
Let us first consider Assumptions A 2, A 4 and A 9 made in Lemma 3.1 and Theorems 3.2 and 3.7 respectively.
These assumptions place (successively stronger) restrictions the smoothness of the objective function f and
the strength of convergence of the sequence{
f(·, N)}
N∈N
of sample average functions to the objective
function f as N → ∞, on a set C ⊂ E which contains all the candidate solutions and design points. Next,
we obtain sufficient conditions on the smoothness of F on E such that Assumptions A 2 and A 4 may be
satisfied. Accordingly, we begin by noting a few facts regarding the Clarke generalized gradient defined for
Lipschitz continuous functions on E .
Consider a function g : E −→ R such that
Kg := supx,y∈Ex�=y
∣∣∣∣g(x)− g(y)‖x− y‖2
∣∣∣∣ <∞
The following properties of g are well known.
38
P 3.1. From Rademacher’s Theorem, there exists a set U ⊂ E with L(U) = 0 (where L denotes the Lebesgue
measure on Rl) such that g is Frechet differentiable at all x ∈ E \ U . That is, for any x ∈ E \ U , the vector
∇g(x) =(
∂g∂x1
(x), . . . , ∂g∂xl
(x))T
exists, where ∂g∂xi
(x) denotes the partial derivative of g with respect to the
ith vector ei from the standard basis for Rl, and satisfies
limy→0
∣∣g(x + y)− g(x)− yT∇g(x)∣∣
‖y‖2= 0
P 3.2. Further, it is known that
supx∈E\U
‖∇g(x)‖2 = supx,y∈Ex�=y
∣∣∣∣g(x)− g(y)‖x− y‖2
∣∣∣∣ = Kg for all x ∈ E \ U (72)
Consequently, if g ∈W0(E), then
‖g‖W0(E) := sup
x∈E|g(x)| + sup
x,y∈Ex�=y
∣∣∣∣g(x)− g(y)‖x− y‖2
∣∣∣∣ = supx∈E|g(x)| + sup
x∈E\U‖∇g(x)‖2
P 3.3. For any sequence {xn}n∈N⊂ E \ U such that xn → x ∈ E \ U , we have ∇g(xn)→ ∇g(x) as n→∞.
Now, for any x ∈ E , the Clarke generalized gradient ∂g(x) is a set defined as follows
∂g(x) := conv{
v ∈ Rl : v = lim
k→∞∇g(xk) where {xk}k∈N
⊂ E \ U}
(73)
P 3.4. For any x ∈ E , ∂g(x) = {d(x)} for some d(x) ∈ Rl, if and only if g is Frechet differentiable at x, with
∇g(x) = d(x).
P 3.5. Suppose that g1, . . . , gN ∈W0(E) and α1, . . . , αN ∈ R. Then we have for each x ∈ E ,
∂
⎧⎨⎩
N∑j=1
αjgj
⎫⎬⎭ (x) ⊆
N∑j=1
αj∂gj(x)
We will make the following assumptions regarding the function F : E × Ξ −→ R.
A 12. F (x, ·) : Ξ −→ R is G-measurable and Q-integrable for each x ∈ E.
A 13. There exists a G-measurable and Q-integrable function K : Ξ −→ R+, such that for Q-almost all ζ,
give any x, y ∈ E we have
|F (x, ζ)− F (y, ζ)| ≤ K(ζ) ‖x− y‖2 (74)
In particular, there exists a Q-null set N 1Ξ ⊂ Ξ such that for ζ /∈ N 1
Ξ, (74) holds for K(ζ) <∞.
A 14. For any fixed x ∈ E, there exists a Q-null set N 2Ξ(x) ⊂ Ξ such that for ζ /∈ N 2
Ξ(x), F (·, ζ) is
differentiable at x, i.e., ∇F (x, ζ) exists. Consequently, from Property P 3.4, ∂F (x, ζ) = {∇F (x, ζ)}.
Under these assumptions, we show next using results in ? and ?, that the sequence{
f(·, N)}
N∈N
(for
P-almost all ω ∈ Ω) and f satisfy the requirements of Assumptions A 2 and A 4.
39
Lemma 3.9. Suppose F : E × Ξ −→ R satisfies Assumptions A 12 through A 14. Then, given any compact
set D ⊂ E, the following assertions hold.
1. f (as defined in (P)) is continuously differentiable on E, i.e., ∇f(x) exists and is continuous for each
x ∈ E. Further, for each compact set D ⊂ E, f ∈ C1(D).
2. For P-almost all ω ∈ Ω, the sequence{
f(·, N)}
N∈N
(where f is as defined in (8)), lies in W0(D).
3. For P-almost all ω ∈ Ω, we have
limN→∞
supx∈D
∣∣∣f(x,N)− f(x)∣∣∣ = 0 (75)
4. For P-almost all ω ∈ Ω, we have
limN→∞
supx∈D
supd∈∂f(x,N)
‖d−∇f(x)‖2 = 0 (76)
Proof. We show each of the above assertions in order.
1. Under Assumption A 12 through A 14, Lemma A2 (on page 21) in ? shows that ∇f(x) exists and
is continuous for each x ∈ E . Therefore, f is also continuous on E . Now, since D ⊂ E is compact, it
is easily seen that f and ∇f are bounded on D. Consequently, ‖f‖C1(D) < ∞ and hence we get that
f ∈ C1(D).
2. For any N ∈ N and x, y ∈ D, we get from (8)
∣∣∣f(x,N)− f(y,N)∣∣∣ =
∣∣∣∣∣∣N∑
j=1
(F (x, ζj)− F (y, ζj)
)∣∣∣∣∣∣≤
N∑j=1
∣∣F (x, ζj)− F (y, ζj)∣∣ (77)
where ζj = ζj(ω) for each j ∈ N, and ω ∈ Ω. Consider the set
N 1Ω :=
⋃j∈N
(ζj)−1
(N 1Ξ) (78)
where N 1Ξ is defined in Assumption A 13. It is clear that N 1
Ω is P-null and we get{ζj}
j∈N⊂ Ξ \ N 1
Ξ
for all ω ∈ Ω \ N 1Ω, . Therefore, for ω ∈ Ω \ N 1
Ω, using (74) in (77) we get that
∣∣∣f(x,N)− f(y,N)∣∣∣ ≤
(∑Nj=1 K(ζj)
N
)‖x− y‖2 for any x, y ∈ D (79)
where (∑Nj=1 K(ζj)
N
)<∞
40
Consequently, f(·, N) is continuous onD for ω ∈ Ω\NΩ. SinceD is compact, we get that supx∈D |f(x,N)| <KN for some KN <∞. Therefore, we finally get that for ω ∈ Ω \ N 1
Ω
∥∥∥f(·, N)∥∥∥
W0(E)= sup
x∈D|f(x,N)| + sup
x,x+y∈Dy �=0
∣∣∣∣∣ f(x + y,N)− f(x,N)‖y‖2
∣∣∣∣∣≤ KN +
(∑Nj=1 K(ζj)
N
)< ∞
Hence{
f(·, N)}
N∈N
⊂WO(D) for P-almost all ω ∈ Ω.
3. It is clear from (74) in Assumption A 13 that for Q-almost all ζ, the function F (·, ζ) is continuous
on D. Now, consider some x∗ ∈ D such that EQ [F (x∗, ζ)] < ∞. Such an x∗ exists from Assumption
A 12 . Since D is compact, there exists KD < ∞ such that supx∈D ‖x− x∗‖2 < KD. Therefore, we
get using (74) that for Q-almost all ζ,
|F (x, ζ)| ≤ |F (x∗, ζ)| + K(ζ) ‖x− x∗‖2≤ |F (x∗, ζ)| + K(ζ)KD
From Assumption A 12 we get that |F (x∗, ζ)| is Q-integrable and from Assumption A 13, we know
that K(ζ) is Q-integrable. Therefore the family {|F (x, ζ)| : x ∈ D} is dominated by a Q-integrable
function. Thus using, Lemma A1 (on page 67) in ?, we get that (75) holds.
4. Finally, we show that (76) holds. Note that the statement of (76) is well-defined since we have already
shown that f ∈ C1(D) ⊂ C1(E) and that there exists a P-null set N 1Ω ⊂ Ω defined as in (78), such that
for all ω /∈ N 1Ω,{
f(·, N)}
N∈N
⊂W0(D).
In order to show (76), we define for any x ∈ D and δ > 0,
B(x, δ) :={
x ∈ D : ‖x− x‖2 < δ}
(80)
Now, let us fix a sequence{
δk
}k∈N
of positive real numbers with δk → 0 as k →∞ and define for each
k ∈ N a function Gxk : Ξ −→ R as
Gxk(ζ) := sup
x∈B(x,δk)
supd∈∂F (x,ζ)
‖d−∇F (x, ζ)‖2 (81)
For each k ∈ N, we know that B(x, δk) ⊂ D ⊂ E . Also, we know from Assumption A 13 that for ζ /∈ N 1Ξ
(where Q(N 1Ξ) = 0), (74) holds for x, y ∈ E and consequently we have
supx,y∈Dx�=y
∣∣∣∣F (x, ζ)− F (y, ζ)‖x− y‖2
∣∣∣∣ ≤ K(ζ) ≤ ∞
Therefore, for each x ∈ D, we can define a a Clarke generalized gradient ∂F (x, ζ) as in (73). Further,
from Assumption A 14, we get that there exists a Q-null set N 2Ξ(x) such that for ζ /∈ N 2
Ξ(x), F (·, ζ) is
41
Frechet differentiable at x, i.e, ∇F (x, ζ) exists. Thus, it is easily seen that{Gx
k
}k∈N
is a sequence of
random variables defined on Ξ \(N 1
Ξ
⋃N 2
Ξ(x)).
From Assumption A 13 and Property P 3.1, we know that if ζ /∈ N 1Ξ, ∂F (x, ζ) = {∇F (x, ζ)} for
x ∈ D \ UD(ζ) where L(UD(ζ)) = 0. Using this we show next that for each ζ ∈ Ξ \(N 1
Ξ
⋃N 2
Ξ(x))
and
k ∈ N,
Gxk(ζ) = sup
x∈B(x,δk)\UD(ζ)
‖∇F (x, ζ)−∇F (x, ζ)‖2 (82)
From (81) and Property P 3.4, it is easily seen that
Gxk(ζ) ≥ sup
x∈B(x,δk)\UD(ζ)
supd∈∂F (x,ζ)
‖d−∇F (x, ζ)‖2 = supx∈B(x,δk)\UD(ζ)
‖∇F (x, ζ)−∇F (x, ζ)‖2
Therefore, we only have to show the converse inequality to prove (82) To do this, first we note that for
any set C ⊂ Rl and d∗ ∈ R
l, we have
supd∈C‖d− d∗‖2 = sup
x∈conv(C)
‖d− d∗‖2 (83)
Let us fix ζ ∈ Ξ \(N 1
Ξ
⋃N 2
Ξ(x))
and k ∈ N. Then, for each x ∈ B(x, δk), using (83) and the definition
of the Clarke generalized gradient in (73), we get
supd∈∂F (x,ζ)
‖d−∇F (x, ζ)‖2 = sup{‖d−∇F (x, ζ)‖2 : d = lim
xj→x∇F (xj , ζ), {xj}j∈N
⊂ B(x, δk) \ UD(ζ)}
(84)
Now, consider some d ∈ Rl such that d = limxj→x∇F (xj , ζ) for some sequence {xj}j∈N
⊂ B(x, δk) \UD(ζ). We get for each j ∈ N,
‖d−∇F (x, ζ)‖2 ≤ ‖d−∇F (xj , ζ)‖2 + ‖∇F (xj , ζ)−∇F (x, ζ)‖2≤ ‖d−∇F (xj , ζ)‖2 + sup
x∈B(x,δk)\UD(ζ)
‖∇F (x, ζ)−∇F (x, ζ)‖2
Taking limits as j →∞ on both sides of the last inequality above and noting that ∇F (xj , ζ)→ d, we
get that
‖d−∇F (x, ζ)‖2 ≤ supx∈B(x,δk)\UD(ζ)
‖∇F (x, ζ)−∇F (x, ζ)‖2
Consequently, using (84),
supd∈∂F (x,ζ)
‖d−∇F (x, ζ)‖2 ≤ supx∈B(x,δk)\UD(ζ)
‖∇F (x, ζ)−∇F (x, ζ)‖2
Since the above inequality is true for each x ∈ B(x, δk), we finally get
Gxk(ζ) := sup
x∈B(x,δk)
supd∈∂F (x,ζ)
‖d−∇F (x, ζ)‖2 ≤ supx∈B(x,δk)\UD(ζ)
‖∇F (x, ζ)−∇F (x, ζ)‖2
Thus, we have shown the converse inequality and hence (82) holds for each k ∈ N and ζ ∈ Ξ \(N 1
Ξ
⋃N 2
Ξ(x)).
As a consequence, we get the following two properties of the sequence{Gx
k
}k∈N
.
42
• For each k ∈ N and ζ ∈ Ξ \(N 1
Ξ
⋃N 2
Ξ(x)), we get from from Property P 3.2 that
Gxk(ζ) ≤ 2 sup
x∈B(x,δk)\UD(ζ)
‖∇F (x, ζ)‖2 ≤ 2 supx∈D\UD(ζ)
‖∇F (x, ζ)‖2 ≤ 2K(ζ)
Thus, for each k ∈ N, Gxk is bounded above by the integrable function 2K(ζ) for Q-almost all ζ.
• For ζ ∈ Ξ \(N 1
Ξ
⋃N 2
Ξ(x)), we get Property P 3.3 and the fact that δk → 0 as k →∞
limk→∞
Gxk(ζ) = lim
k→∞sup
x∈B(x,δk)\UD(ζ)
‖∇F (x, ζ)−∇F (x, ζ)‖2 = 0
Thus, the sequence{Gx
k
}k∈N
converges point wise (in ζ) to 0 for Q-almost all ζ.
Therefore, using the Lebesgue Dominated Convergence Theorem, we get that
limk→∞
EQ
[Gx
k(ζ)]
= 0
That is, for each ε > 0, there exists k(x, ε) ∈ N, such that for all k ≥ k(x, ε), EQ
[Gx
k(ζ)]
< ε.
Now, we define the P-null set N 2Ω(x) ⊂ Ω as
N 2Ω(x) :=
⋃j∈N
(ζj)−1 (N 2
Ξ(x))
(85)
If ω /∈ N 2Ω(x), then by definition ζj := ζj(ω) /∈ N 2
Ξ(x) for each j ∈ N. Recall that we also analogously
defined the P-null set N 1Ω ⊂ Ω in (78). Next, we consider the sample average functions
{f(·, N)
}N∈N
for ω ∈ Ω \(N 1
Ω
⋃N 2
Ω(x)). First of all it is easily seen that for ω ∈ Ω \
(N 1
Ω
⋃N 2
Ω(x)), f(·, N) is
differentiable at x for each N ∈ N and
∇f(x, N) =1N
N∑j=1
∇F (x, ζj)
Further, for any x ∈ D, N ∈ N and ω ∈ Ω \(N 1
Ω
⋃N 2
Ω(x)), it is easily seen from Property P 3.5.
∂f(x,N) = ∂
⎧⎨⎩ 1
N
N∑j=1
F (·, ζj)
⎫⎬⎭ (x) ⊆ 1
N
N∑j=1
∂F (x, ζj)
Using these two observations, we get for ω ∈ Ω \(N 1
Ω
⋃N 2
Ω(x)), k ∈ N and x ∈ B(x, δk),
supd∈∂f(x,N)
∥∥∥d−∇f(x, N)∥∥∥
2= sup
⎧⎨⎩∥∥∥d−∇f(x, N)
∥∥∥2
: d ∈ ∂
⎧⎨⎩ 1
N
N∑j=1
F (·, ζj)
⎫⎬⎭ (x)
⎫⎬⎭
≤ sup
⎧⎨⎩∥∥∥d−∇f(x, N)
∥∥∥2
: d ∈ 1N
N∑j=1
∂F (x, ζj)
⎫⎬⎭
= sup
⎧⎨⎩ 1
N
∥∥∥∥∥∥N∑
j=1
(dj −∇F (x, ζj))
∥∥∥∥∥∥2
: dj ∈ ∂F (x, ζj) for j = 1, . . . , N
⎫⎬⎭
≤ sup
⎧⎨⎩ 1
N
N∑j=1
∥∥dj −∇F (x, ζj)∥∥
2: dj ∈ ∂F (x, ζj) for j = 1, . . . , N
⎫⎬⎭
=1N
N∑j=1
supdj∈∂F (x,ζj)
∥∥dj −∇F (x, ζj)∥∥
2
43
Therefore, we get that for ω ∈ Ω \(N 1
Ω
⋃N 2
Ω(x))
and k ∈ N
supx∈B(x,δk)
supd∈∂f(x,N)
∥∥∥d−∇f(x, N)∥∥∥
2= sup
x∈B(x,δk)
⎧⎨⎩ 1
N
N∑j=1
supdj∈∂F (x,ζj)
∥∥dj −∇F (x, ζj)∥∥
2
⎫⎬⎭
≤ 1N
N∑j=1
{sup
x∈B(x,δk)
supdj∈∂F (x,ζj)
∥∥dj −∇F (x, ζj)∥∥
2
}=
1N
N∑j=1
Gxk(ζj)
Now, consider the right side of the last inequality given above. Since{ζj}
j∈Nis an i.i.d. se-
quence, we get from the strong law large numbers that 1N
∑Nj=1 Gx
k(ζj) converges to EQ
[Gx
k(ζ)]
=
EP
[Gx
k(ζ1(ω))]
for P-almost all ω. That is, there exists a P-null set N 3Ω(x, k) ⊂ Ω such that for
ω ∈ Ω \(N 1
Ω
⋃N 2
Ω(x)⋃N 3
Ω(x, k))
, the following statement holds. For each ε > 0, there exists
N(x, k, ω, ε) such that∣∣∣∣∣∑N
j=1 Gxk(ζj)
N− EQ
[Gx
k(ζ)]∣∣∣∣∣ < ε for all N ≥ N(x, k, ω, ε)
Therefore, we finally get that given any ε > 0 and x ∈ D, for each k > k(x, ε), there exists a P-null
set N 3Ω(x, k) such that if ω ∈ Ω \
(N 1
Ω
⋃N 2
Ω(x)⋃N 3
Ω(x, k)), then
supx∈B(x,δk)
supd∈∂f(x,N)
∥∥∥d−∇f(x, N)∥∥∥
2< 2ε for all N > N(x, k, ω, ε) (86)
Also, we have already shown that f ∈ C1(D). Since D is compact, this means that ∇f is uniformly
continuous on D. That is, for each ε > 0, there exists δ(ε) > 0 such that
sup{‖∇f(x)−∇f(y)‖2 : x, y ∈ D , ‖y − x‖2 ≤ δ(ε)} < ε (87)
Now, consider the sequence {(1/i)}i∈Nand some i ∈ N. For each x ∈ D, we first pick ki
x ∈ N such that
kix > k(x, (1/i)) and δki
x< δ(1/i). Then clearly, the collection
{B(x, δki
x)}
x∈Dis an open cover for D
(in the subspace topology on D). Since D is compact, there exists a finite sub-cover for D; i.e.; there
exist xi1, . . . , x
imi ∈ D such that
D ⊆mi⋃j=1
B(xi
j , kixi
j
)(88)
For each j = 1, . . . ,mi, the following statements clearly hold.
• From Assumption A 13, if ω ∈ Ω \ N 2Ω(xi
j), then ∇f(xij , N) exists for each N ∈ N.
• From the Strong Law of Large Numbers, there exists a P-null set N 4Ω(xi
j) such that if ω ∈Ω \
(N 2
Ω(x)⋃N 4
Ω(xij)), then for any ε > 0, there exists N(xi
j , ω, ε) such that
∥∥∥∇f(xij , N)−∇f(xi
j)∥∥∥
2< ε for all N > N(xi
j , ω, ε) (89)
44
Let us define
NΩ(i) :=mi⋃j=1
{N 1
Ω
⋃N 2
Ω(xij)⋃N 3
Ω(xij , k
ixi
j)⋃N 4
Ω(xij)}
N(i, ω) := maxj=1,...,mi
max{
N(xij , k
ixi
j, ω, (1/i)) , N(xi
j , ω, (1/i))}
It is easily seen that P(NΩ(i)) = 0.
Now, for any ω ∈ Ω \ NΩ(i) and x ∈ D, we know from (88) that there exists j ∈ {1, . . . , mi} such that
x ∈ B(xij , δxi
j). Therefore, for each N > N(i, ω)
(≥ N(xi
j , kixi
j, ω, (1/i))
), we get from (86) that
supd∈∂f(x,N)
∥∥∥d−∇f(xij , N)
∥∥∥2
<2i
Similarly, for each N > N(i, ω)(≥ N(xi
j , ω, (1/i))), we have from (89) that
∥∥∥∇f(xij , N)−∇f(xi
j)∥∥∥
2<
(1/i). Finally, since we chose kixi
jsuch that δki
xij
< δ(1/i), we get from (87) that∥∥∇f(x)−∇f(xi
j)∥∥
2<
(1/i). Thus, combining these three observations, we get
supd∈∂f(x,N)
‖d−∇f(x)‖2 ≤ supd∈∂f(x,N)
{∥∥∥d−∇f(xij , N)
∥∥∥2
+∥∥∥∇f(xi
j)−∇f(xij)∥∥∥
2+
∥∥∇f(xij)−∇f(x)
∥∥2
}
<4i
Since the above inequality is true for each x ∈ D, we get that for ω ∈ Ω \ NΩ(i) and all N > N(i, ω)
supx∈D
supd∈∂f(x,N)
‖d−∇f(x)‖2 ≤ 4i
(90)
Finally, if we set NΩ :=⋃{NΩ(i) : i ∈ N}, it is clear that P(NΩ) = 0. Then, for ω ∈ Ω \ NΩ we get
that for each i ∈ N, there exists N(i, ω) such that for N > N(i, ω), (90) holds. Therefore we get that
for P-almost all ω,
limN→∞
supx∈D
supd∈∂f(x,N)
‖d−∇f(x)‖2 = 0
From Lemma 3.9, we get the following corollary.
Corollary 3.10. Let Assumptions A 12 through A 14 hold. Then, we get
limN→∞
∥∥∥f(·, N)− f∥∥∥
W0(D)= 0
Consequently, there exists Kf <∞, such that ‖f‖W1(D) < Kf and
∥∥∥f(·, N)∥∥∥
W1(D)< Kf for each N ∈ N.
Proof. Since all the assumptions of Lemma 3.9 hold, we see{f(·, N)
}N∈N
∈ W0(D) for P-almost all ω.
Also, f ∈ C1(D) ⊂W0(D) Therefore, the statement of Corollary 3.10 is well defined.
45
Now, from Lemma 3.9, there exists a set NΩ ⊂ Ω, such that for all ω ∈ Ω \ NΩ,{
f(·, N)}
n∈N
⊂W0(D)
and (75), (76) hold. Consider any such ω ∈ Ω \ NΩ. From Rademacher’s theorem, we know that for each
N ∈ N, there exists a set UN (ω) ⊂ D with L(UN (ω)) = 0, such that ∇f(·, N) exists for all x ∈ D \ UN (ω).
Using Property P 3.4, we get that
∥∥∥f(·, N)− f∥∥∥
W0(D)= sup
x∈D|f(x,N)− f(x)| + sup
x∈D\UN (ω)
∥∥∥∇f(x,N)−∇f(x)∥∥∥
2(91)
It follows from (75) that
limN→∞
supx∈D|f(x,N)− f(x)| = 0
Also, for each N ∈ N,
supx∈D
supd∈∂f(x,N)
‖d−∇f(x)‖2 ≥ supx∈D\UN (ω)
supd∈∂f(x,N))
‖d−∇f(x)‖2
But we know that for x ∈ D \ UN (ω), f(·, N) is differentiable and hence, ∂f(·, N) = {∇f(·, N)}. Thus,
supx∈D\UN (ω)
supd∈∂f(x,N)
‖d−∇f(x)‖2 = supx∈D\UN (ω)
∥∥∥∇f(x,N)−∇f(x)∥∥∥
2
Therefore, we get from (76) that
0 = limN→∞
supx∈D
supd∈∂f(x,N)
‖d−∇f(x)‖2 ≥ limN→∞
supx∈D\UN (ω)
∥∥∥∇f(x,N)−∇f(x)∥∥∥
2
Thus, we have for all ω ∈ Ω \ NΩ,
limN→∞
∥∥∥f(·, N)− f∥∥∥
W0(D)= lim
N→∞supx∈D|f(x,N)− f(x)| + lim
N→∞sup
x∈D\UN (ω)
∥∥∥∇f(x,N)−∇f(x)∥∥∥
2= 0
Therefore,∥∥∥f(·, N)− f
∥∥∥W0(D)
→ 0 as N → ∞ for P-almost all ω. Hence∥∥∥f(·, N)
∥∥∥W0(D)
→ ‖f‖W0(D)
as N → ∞ and consequently, the sequence{∥∥∥f(·, N)
∥∥∥W0(D)
}N∈N
is bounded. Therefore, there exists
Kf ∈ (‖f‖W0(D) ,∞) such that∥∥∥f(·, N)
∥∥∥W0(D)
< Kf for each N ∈ N.
From Lemma 3.9 and Corollary 3.10, it is clear that both Assumptions A 2 and A 4 are satisfied if
Assumptions A 12 through 14 hold. For the rest of this paper, we will assume that we have chosen a
particular ω ∈ Ω such that Assumptions A 2 and A 4 hold.
Note that the results of Lemma 3.9 and Corollary 3.10 do not satisfy Assumption A 9 which requires
that f ∈ C2(D) and∥∥∥f(·, N)− f
∥∥∥W1(D)
→ 0 as N →∞ for any compact set D ⊂ E . We do not consider the
conditions required to satisfy Assumption A 9, since we will not require the Hessian approximation ∇2nf(xn)
to get progressively more accurate as n → ∞, in order to show convergence of our trust region algorithms.
We will not even require that f ∈ C2(E), i.e., that ∇2f(x) exist for any x ∈ D. Indeed the only assumption
46
we will make regarding the quadratic component of our model functions, is Assumption A 8 which ensures
that{∇2
nf(xn)}
n∈N
used remains bounded.
However, we will present the rest of our analysis assuming that we use a quadratic model function as in
(9). Further, in the next section, we will discuss methods to pick the design points and the corresponding
sample sizes and weights not only to satisfy the assumptions in Theorem 3.2 but also to satisfy the remaining
assumptions (A 10 and A 11) in Theorem 3.7. Thus, in the event that f and{
f(·, N)}
N∈N
are known to
satisfy the Assumption A 9, such methods may be used to find a good approximation of the Hessian ∇2f(xn)
and in turn, a more accurate model function mn for each n ∈ N.
3.2 Picking the Design Points, Sample Sizes and Weights
In this section, we discuss methods to pick the number of design points Mn, the design points {xn + yin : i =
1, . . . ,Mn}, the sample sizes {N in : i = 1, . . . , Mn} and the weights {wi
n : i = 1, . . . , Mn} for each n ∈ N, so
as to satisfy the assumptions made in Theorems 3.2 and 3.7. Here we will assume that for each n ∈ N, we
have knowledge of xn ∈ X , the design region radius δn > 0 and the sample size N0n. Further, we will assume
that{N0
n
}n∈N
satisfies
limn→∞N0
n = ∞ and N0n+1 ≥ N0
n for each n ∈ N (92)
and that
limn→∞ δn = 0 and (93)
When we describe our trust region algorithms in Section ??, we will show how δn and N0n can be adaptively
set for each iteration n, such that (92) and (93) hold.
It is clear that for each n ∈ N, once the Mn design points {xn+yin : i = 1, . . . , Mn} and the corresponding
sample sizes {N in : i = 1, . . . , Mn} are set, in order to construct the regression model function mn as in (9),
we need to evaluate for each n ∈ N, f(xn, N0n) and also f(xn, N i
n) and f(xn + yin, N i
n) for i = 1, . . . ,Mn.
Consequently, we require max{N0n, N1
n, . . . , NMnn }+
∑Mn
i=1 N in evaluations of F for each n ∈ N. Now, keeping
in mind that such a regression model function will eventually be used in a trust region algorithm to solve
(P), recall that in Section 1, we made the assumption (A 1) that the evaluation of the function F is
computationally expensive for any x ∈ Rl and ζ ∈ Ξ. Thus, the methods we develop to pick the design points
and sample sizes for each n ∈ N, must ensure not only that assumptions of Theorem 3.2 or Theorem 3.7
are satisfied, but also that the number of evaluations of F required to subsequently evaluate ∇nf(xn) and
∇2nf(xn) is minimized.
We begin in Section 3.2.1 by developing procedures to pick the inner and outer design points for each
n ∈ N. Subsequently, we consider methods to pick the corresponding sample sizes and weights in Sections
47
3.2.2 and 3.2.3 respectively.
3.2.1 Design points
It is easily seen that the conditions imposed on the design points and weights in Theorems 3.2 and 3.7, occur
in (35) through (37) in Assumption A 7 and in (57) through (59) in Assumption A 11 respectively. In this
section, we provide methods to pick design points {xn + yin : i = 1, . . . , Mn} for each n ∈ N such that under
certain assumptions on the weights {win : i = 1, . . . , Mn}, (35) and (36) in Assumption A 7 and (57) and (58)
in Assumption A 11 are satisfied. Later, in Section 3.2.3, we deal with methods to assign an appropriate
weight to each design points so as to satisfy the assumptions made in this section and the conditions (37)
and (59).
First, we consider the conditions (36) and (58) in Assumptions A 7 and A 11 respectively. Let us start
by stating some well known and relevant properties of matrices. Consider a matrix Y ∈ Rm×l given by
Y :=
⎛⎜⎜⎜⎜⎜⎜⎝
y1T
y2T
...
ymT
⎞⎟⎟⎟⎟⎟⎟⎠ where yi ∈ R
l for each i = 1, . . . ,m (94)
1. The matrix Y T Y =∑m
i=1 yiyiT ∈ Rl×l is symmetric and positive semidefinite.
2. We get from the Courant-Fischer eigenvalue characterization that
∥∥Y T Y∥∥
2= λmax(Y T Y ) = max
x∈Rl
‖x‖2=1
xT (Y T Y )x = maxx∈R
l
‖x‖2=1
m∑i=1
(yiT x
)2
(95)
and
λmin(Y T Y ) = minx∈R
l
‖x‖2=1
xT (Y T Y )x = minx∈R
l
‖x‖2=1
m∑i=1
(yiT x
)2
(96)
If Y T Y is positive definite, then λmin(Y T Y ) > 0, (Y T Y )−1 exists and∥∥(Y T Y )−1
∥∥2
= (1/λmin(Y T Y )).
3. The quantity∥∥Y T Y
∥∥2
∥∥(Y T Y )−1∥∥
2= (λmax(Y T Y )/λmin(Y T Y )) is called the (two-norm) condition
number of the matrix Y T Y . We have
1 ≤∥∥Y T Y
∥∥2
∥∥(Y T Y )−1∥∥
2≤ ∞
with the convention that∥∥Y T Y
∥∥2
∥∥(Y T Y )−1∥∥
2:= ∞ whenever λmin(Y T Y ) = 0 (and (Y T Y )−1 does
not exist).
48
Now, consider the matrix (Y In )T W I
n Y In occurring in (36). Recall that W I
n = diag(w1
n, . . . , wMI
nn
)where
win > 0 is the weight assigned to inner design point xn + yi
n for each i = 1, . . . , M In and n ∈ N. It is easily
seen that for each n ∈ N,
(Y In )T W I
n Y In =
(√W I
n Y In
)T (√W I
n Y In
)Therefore, using (95) and (96), we get
∥∥∥(Y In )T W I
n Y In
∥∥∥2
= λmax
((√W I
n Y In
)T (√W I
n Y In
))
= maxx∈R
l
‖x‖2=1
MIn∑
i=1
win
((yi
n)T x)2
≤(
maxi=1,...,MI
n
win
) ⎧⎪⎨⎪⎩ max
x∈Rl
‖x‖2=1
MIn∑
i=1
((yi
n)T x)2⎫⎪⎬⎪⎭
=(
maxi=1,...,MI
n
win
)∥∥∥(Y In )T Y I
n
∥∥∥2
and
λmin
((√W I
n Y In
)T (√W I
n Y In
))= min
x∈Rl
‖x‖2=1
MIn∑
i=1
win
((yi
n)T x)2
≥(
mini=1,...,MI
n
win
) ⎧⎪⎨⎪⎩ min
x∈Rl
‖x‖2=1
MIn∑
i=1
((yi
n)T x)2⎫⎪⎬⎪⎭
=(
mini=1,...,MI
n
win
)λmin
((Y I
n )T Y In
)which gives us∥∥∥((Y I
n )T W In Y I
n )−1∥∥∥
2=
1
λmin
((√W I
n Y In
)T (√W I
n Y In
)) ≤(
1mini=1,...,MI
nwi
n
) ∥∥∥∥((Y In )T Y I
n
)−1∥∥∥∥
2
Therefore, it is clear that∥∥∥(Y In )T W I
n Y In
∥∥∥2
∥∥∥((Y In )T W I
n Y In )−1
∥∥∥2≤
(maxi=1,...,MI
nwi
n
mini=1,...,MIn
win
) ∥∥∥(Y In )T Y I
n
∥∥∥2
∥∥∥∥((Y In )T Y I
n
)−1∥∥∥∥
2
(97)
Similarly, it is easy to show that∥∥∥(ZIn)T W I
n ZIn
∥∥∥2
∥∥∥((ZIn)T W I
n ZIn)−1
∥∥∥2≤
(maxi=1,...,MI
nwi
n
mini=1,...,MIn
win
) ∥∥∥(ZIn)T ZI
n
∥∥∥2
∥∥∥∥((ZIn)T ZI
n
)−1∥∥∥∥
2
(98)
Now, in order to satisfy (36) and (58) we will obtain separate bounds on each of the terms occurring in
the products on the right sides of (97) and (98) respectively.
First, we will make the following assumption in this section regarding the weights {win : i = 1, . . . ,M I
n}for each n ∈ N.
49
A 15. The weights win for each i = 1, . . . , M I
n and n ∈ N are chosen such that(maxi=1,...,MI
nwi
n
mini=1,...,MIn
win
)≤ KI
w < ∞ for each n ∈ N (99)
With Assumption A 15, suppose we choose our design points such that there exist constants 0 < KI
y,KIy <
∞ where for each n ∈ N, ∥∥∥(Y In )T Y I
n
∥∥∥2
= λmax((Y I
n )T Y In
)< K
I
y < ∞ (100)
and
λmin((Y I
n )T Y In
)> KI
y > 0 i.e.,∥∥∥∥((Y I
n )T Y In
)−1∥∥∥∥
2
<1
KIy
< ∞ (101)
Then, it is clear from (97) that (36) is satisfied for KIλ = KI
w
(K
I
y/KIy
). Analogously, suppose we choose
our design points such that there exist some constants 0 < KI
z,KIz <∞ where for each n ∈ N,∥∥∥(ZI
n)T ZIn
∥∥∥2
= λmax((ZI
n)T ZIn
)< K
I
z < ∞ (102)
and
λmin((ZI
n)T ZIn
)> KI
z > 0 i.e.,∥∥∥∥((ZI
n)T ZIn
)−1∥∥∥∥
2
<1
KIz
< ∞ (103)
Then, from Assumption A 15 and (98), we get that (58) is satisfied for KIλ = KI
w
(K
I
z/KIz
). Accordingly, in
this section, we will provide methods to pick the design points such that (100), (101), (102) and (103) hold for
each n ∈ N. In Section 3.2.3, we will provide an example of a scheme to set the weights {win : i = 1, . . . ,M I
n}such that (99) holds for some KI
w <∞ and each n ∈ N.
First, let us consider (100) and (102). We know that by definition, the inner design points satisfy∥∥yin
∥∥2≤ δn, i.e.,
∥∥yin
∥∥2≤ 1 for each i ∈ {1, . . . , M I
n}. Using (17) we then get that∥∥zi
n
∥∥2≤ (
√3/2) for each
i = 1, . . . ,M In. Now, if we can ensure that for each n ∈ N, the design points we pick satisfy the following
assumption
A 16. For each n ∈ N, M In ≤M I
max for some constant M Imax <∞.
Then from (95) we get that for each n ∈ N,
∥∥∥(Y In )T Y I
n
∥∥∥2
= maxx∈R
l,‖x‖2=1
MIn∑
i=1
((yi
n)T x)2 ≤ M I
n maxi∈{1,...,Mi
n}
∥∥yin
∥∥2
2≤ M I
max (104)
∥∥∥(ZIn)T ZI
n
∥∥∥2
= maxx∈R
l,‖x‖2=1
MIn∑
i=1
((zi
n)T x)2 ≤ M I
n maxi∈{1,...,Mi
n}
∥∥zin
∥∥2
2≤(
32
)M I
max (105)
Next, we consider (101) and (103). In order to see that these are conditions on the relative positions of
the inner design points within the design region, we first state some well-known facts in basic linear algebra
and set the related notation.
50
Projection Operation : For any y ∈ Rl and any subspaceM⊂ R
l, we let
Π(y,M) := arg minx∈M
‖y − x‖2 (106)
denote the orthogonal projection of y onto M. The following properties of the projection operator are well
known.
P 3.6. For any y ∈ Rl and subspaceM⊂ R
l, we have ‖y‖2 ≥ ‖Π(y,M)‖2.
P 3.7. For any c1, c2 ∈ R, y1, y2 ∈ Rl and subspace M⊂ R
l, we have
Π(c1y1 + c2y
2,M) = c1Π(y1,M) + c2Π(y2,M)
P 3.8. Given a subspace M ⊂ Rl is a subspace with dim(M) = j ∈ {1, . . . , l} and an orthonormal basis
{v1, . . . , vj} for M, we have
Π(y,M) =j∑
i=1
(yT vi
)vi
for any y ∈ Rl. Consequently, ‖Π(y,M)‖2 =
√∑ji=1(yT vi)2.
Orthogonal Complements : Given any subspace M ⊂ Rl, we will let M⊥ denote the orthogonal
complement of M in Rl.
P 3.9. For any y ∈ Rl and subspace M ⊂ R
l, there exist unique vectors yM ∈ M and yM⊥ ∈ M⊥ such
that y = yM + yM⊥. In fact, it can be shown that yM = Π(y,M) and yM⊥
= Π(y,M⊥).
Consequently, using (106), we get the following property.
P 3.10. For any y ∈ Rl and subspaceM⊂ R
l,∥∥Π(y,M⊥)
∥∥2
= ‖y −Π(y,M)‖2 measures the distance of y
from the subspaceM. In particular, y ∈M if and only if y = Π(y,M), i.e., if and only if∥∥Π(y,M⊥)
∥∥2
= 0.
Now, let us again consider the matrix Y ∈ Rm×l defined in (94). The following lemma provides an
alternative characterization of the minimum eigenvalue of Y T Y .
Lemma 3.11. Let Y ∈ Rm×l and let {(y1)T , . . . , (ym)T } ⊂ R
l denote its rows. Then, there exists a subspace
M∗ ⊂ Rl with dim(M) = l − 1 such that
infM⊂R
l
dim(M)=l−1
m∑i=1
∥∥Π(yi,M⊥)∥∥2
2=
m∑i=1
∥∥∥Π(yi, (M∗)⊥
)∥∥∥2
2
Further, we have
λmin(Y T Y ) = minM⊂R
l
dim(M)=l−1
m∑i=1
∥∥Π(yi,M⊥)∥∥2
2=
m∑i=1
∥∥∥Π(yi, (M∗)⊥
)∥∥∥2
2(107)
Consequently, λmin(Y T Y ) = 0 if and only if the row rank of Y is less than l.
51
Proof. We know that for each subspace M ⊂ Rl with dim(M) = l − 1, dim(M⊥) = 1 and consequently,
there exists x ∈ M⊥ ⊂ Rl with ‖x‖2 = 1 such that M⊥ = span{x}. Similarly, for each x ∈ R
l with
‖x‖2 = 1, span{x} is a subspace of dimension 1 and consequently, the subspace M = (span{x})⊥ satisfies
dim(M) = l − 1.
Now, consider any subspace M ⊂ Rl with dim(M) = l − 1 and x ∈ R
l with ‖x‖2 = 1 such that
span{x} =M⊥. It is clear that since {x} is an orthonormal basis forM⊥. Therefore, from Property P 3.8,
we get that for each i = 1, . . . , m,
Π(yi,M⊥) =(yiT x
)x and
∥∥Π(yi,M⊥)∥∥
2=
(yiT x
)and consequently
m∑i=1
∥∥Π(yi,M⊥)∥∥2
2=
m∑i=1
(yiT x
)2
(108)
Therefore, we clearly get that
infM⊂R
l
dim(M)=l−1
m∑i=1
∥∥Π(yi,M⊥)∥∥2
2= inf
x∈Rl
‖x‖2=1
m∑i=1
(yiT x
)2
(109)
Now, since∑m
i=1
(yiT x
)2
is a continuous function of x and{x ∈ R
l : ‖x‖2 = 1}
is a compact set, we know
that there exists x∗ ∈ Rl with ‖x∗‖2 = 1 such that
m∑i=1
(yiT x∗
)2
= infx∈R
l
‖x‖2=1
m∑i=1
(yiT x
)2
Therefore, setting M∗ = (span{x∗})⊥, and using (108) and (109), we getm∑
i=1
∥∥∥Π(yi, (M∗)⊥
)∥∥∥2
2=
m∑i=1
(yiT x∗
)2
= infx∈R
l
‖x‖2=1
m∑i=1
(yiT x
)2
= infM⊂R
l
dim(M)=l−1
m∑i=1
∥∥Π(yi,M⊥)∥∥2
2
Finally, using (96), it is clear that (107) holds.
In order to show the final assertion in Lemma 3.11, let us first assume that λmin(Y T Y ) = 0. Then, from
(107), there exists a subspaceM∗ ⊂ Rl with dim
((M∗)⊥
)= l − 1, such that
m∑i=1
∥∥∥Π(yi, (M∗)⊥
)∥∥∥2
2= 0
This means that Π(yi, (M∗)⊥
)= 0 for each i = 1, . . . ,m. Thus using Property P 3.10, we get that yi ∈M∗
for each i = 1, . . . ,m. But now, since span{y1, . . . , ym} ⊂ M∗ and dim (M∗) = l − 1, we get that the row
rank of Y T Y can be at most l − 1.
Conversely, if the row rank of Y T Y is less than l, then dim(span{y1, . . . , ym}) < l. Therefore, for
x∗ ∈(span{y1, . . . , ym}
)⊥ such that ‖x∗‖2 = 1, we getm∑
i=1
(yiT x∗
)2
= 0
52
Therefore, using (96), we get that
0 ≤ λmin(Y T Y ) ≤m∑
i=1
(yiT x∗
)2
= 0
Using Property P 3.10 and Lemma 3.11, we can state (107) as follows. The minimum eigenvalue of the
matrix Y T Y is equal to minimum of the sum of squared distances of the rows vectors of Y to any subspace
M ⊂ Rl of dimension equal to l − 1. Thus, it is intuitively clear that λmin(Y T Y ) is close to zero whenever
there exists a subspace M ⊂ Rl of dimension l − 1 such that
∥∥Π(yi,M)∥∥
2is small for each i = 1, . . . , m,
i.e., all the rows vectors of Y lie close to M. Indeed as we showed in Lemma 3.11, we get λmin(Y T Y ) = 0
whenever the distances from each of the rows of Y to some subspace is equal to zero, i.e., the rank of Y is
less than l.
In order to formalize this intuitive notion, we present next, a series of results quantitatively relating the
geometry of the vectors {y1, . . . , ym} to the minimum eigenvalue of Y T Y . First, let us consider the vectors
corresponding to the first l rows {y1, . . . , yl} ⊂ Rl. For the sake of notational brevity and consistency we
define Qi := span{y1, . . . , yi} for each i = 1, . . . , l. For the present, we will assume that for each i = 1, . . . , l,
∥∥yi∥∥
2= 1 (110)
and that there exists μ ∈ (0, 1] such that for i = 2, . . . , l
∥∥Π (yi, (Qi−1)⊥
)∥∥2≥ μ (111)
Note that from (110),∥∥y1
∥∥2
> 0. Also, for each i = 2, . . . , l, (111) implies that the distance∥∥Π(yi,Qi−1)
∥∥2
from yi to span{y1, . . . , yi−1} has to be at least μ > 0. In particular, this means that yi /∈ span{y1, . . . , yi−1}.Naturally, under these conditions, we would expect the set {y1, . . . , yl} to be linearly independent. The
following lemma shows this result.
Lemma 3.12. Consider j ≤ l vectors x1, . . . , xj ∈ Rl that satisfy
∥∥x1∥∥
2> 0 and if j ≥ 2,
∥∥Π (xi, span{x1, . . . , xi−1}⊥
)∥∥2
> 0 for each i = 2, . . . , j (112)
Then, the vectors {x1, . . . , xj} are linearly independent.
Proof. If j = 1, then it is trivially clear that {x1} is a linearly independent set since∥∥x1
∥∥2
> 0. Therefore,
let us assume that j ≥ 2.
First, using Property P 3.6 it is easily seen that (112) implies∥∥xi
∥∥2
> 0 for each i = 1, . . . , j. Further,
it is also clear from Property P 3.10 that for each i = 2, . . . , j.
xi /∈ span{x1, . . . , xi−1} (113)
53
Now, in order to show that the vectors {x1, . . . , xj} are linearly independent, we must show that the only
solution to
α1x1 + α2x
2 + . . . + αjxj = 0 (114)
is α1 = α2 = . . . = αj = 0. We show this by contradiction.
Let us assume that there exists a solution to (114) with αj �= 0. In this case, we can write,
xj =(−1αj
)(α1x
1 + α2x2 + . . . + αj−1x
j−1)
If α1 = . . . = αj−1 = 0, then we get that xj = 0 which contradicts the fact that∥∥xj
∥∥2
> 0. Therefore, there
exist some k ∈ {1, . . . , j − 1} such that αk �= 0. Now this means that that xj ∈ span{x1, . . . , xj−1}. This
contradicts (113) for i = j. Therefore, there exists no solution for (114) with αj �= 0.
Now assume that there exists a solution to (114) with αj−1 �= 0. Then, since αj = 0, we can write
xj−1 =(
1αj−1
)(α1x
1 + α2x2 + . . . + αj−2x
j−2)
As before, since∥∥xj−1
∥∥2
> 0, there exists some j ∈ {1, . . . , j − 2}, such that αk �= 0. This means that
xj−1 ∈ span{x1, . . . , xj−2}, i.e., which contradicts (113) for i = j − 1. Thus, αj−1 = 0.
Proceeding in the same manner as above, we can show that α2 = . . . = αj = 0. But now, since∥∥x1
∥∥2
> 0,
we get that the only solution to α1x1 = 0 is α1 = 0.
Thus, we have shown that the only solution to (114) is α1 = . . . = αj = 0. Therefore, the vectors
{x1, . . . , xj} are linearly independent.
Now, let us apply Lemma 3.12 to the vectors {y1, . . . , yl} ⊂ Rl that satisfy (110) and (111) for each
i = 1, . . . , l. From Property P 3.6, it is clear that the conditions (111) for i = 1, . . . , l together satisfy (112).
Therefore we get that the vectors {y1, . . . , yl} are linearly independent. Consequently, the row rank of Y
(defined in (94)), is equal to l and from Lemma 3.11, we get λmin(Y T Y ) > 0. However, we want to show not
only that λmin(Y T Y ) > 0 but also that there exists a positive lower bound on this minimum eigenvalue that
is a function of μ. To this end, we will first show that if for each i = 1, . . . , l, (110) is satisfied and (111) is
satisfied for some μ ∈ (0, 1], then we get
minx∈R
l
‖x‖2=1
l∑i=1
((yi)T x
)2 ≥ Sl(μ) (115)
where Sl(μ) is the lth term in the sequence {Sk(μ)}k∈N⊂ R defined iteratively for any μ ∈ (0, 1], as follows.
S1(μ) := 1 and Sj(μ) =(
1 + Sj−1(μ)2
)[1−
√1−
(4Sj−1(μ)μ2
(Sj−1(μ) + 1)2
)]for j = 2, 3, . . . (116)
The following lemma shows that the sequence {Sj(μ)}j∈Nis well defined for any μ ∈ (0, 1] and that Sj(μ) > 0
for each j ∈ N.
54
Lemma 3.13. Consider the sequence {Sj(μ)}j∈N⊂ R defined in (116). For any μ ∈ (0, 1], the following
assertions hold.
1. For each j ∈ N, Sj(μ) is a well defined real number and in particular Sj(μ) > 0.
2. For each j ≥ 2, Sj(μ) ≤ μ2Sj−1(μ).
Proof. We show the first assertion by induction. First, it is trivially clear that S1(μ) is a real number and
S1(μ) > 0. Next, let us suppose that Sj−1(μ) ∈ R and Sj−1(μ) > 0 for some j ≥ 2. Let us rearrange the
expression for Sj(μ) (for j ≥ 2) in (116) as follows.
Sj(μ) =(
1 + Sj−1(μ)2
)−
(12
)√(1 + Sj−1(μ))2 − 4Sj−1(μ)μ2 (117)
Then, since μ ∈ (0, 1],
(1 + Sj−1(μ))2 − 4Sj−1(μ)μ2 ≥ (1 + Sj−1(μ))2 − 4Sj−1(μ) = (1− Sj−1(μ))2 ≥ 0 (118)
Thus,√
(1 + Sj−1(μ))2 − 4Sj−1(μ)μ2 ∈ R and Sj(μ) is a well-defined real number. Further, it is easily
seen that since Sj−1(μ) > 0 and μ > 0, we have 4Sj−1(μ)μ2 > 0. Therefore,
(1 + Sj−1(μ))2 − 4Sj−1(μ)μ2 < (1 + Sj−1(μ))2
=⇒√
(1 + Sj−1(μ))2 − 4Sj−1(μ)μ2 < (1 + Sj−1(μ))
=⇒(−1
2
)√(1 + Sj−1(μ))2 − 4Sj−1(μ)μ2 >
(−(1+Sj−1(μ))2
)=⇒
(1+Sj−1(μ)
2
)−
(12
)√(1 + Sj−1(μ))2 − 4Sj−1(μ)μ2 > 0
=⇒ Sj(μ) > 0
Therefore we get that Sj(μ) > 0. Consequently, by induction we get that Sj(μ) ∈ R and Sj(μ) > 0 for each
j ∈ N.
Next let us consider the second assertion. Since μ ∈ (0, 1] and Sj(μ) > 0 for each j ∈ N, we know that
4Sj−1(μ)μ2(μ2 − 1) ≤ 0 for each j ≥ 2. Therefore, we get for each j ≥ 2,
(1 + Sj−1(μ))2 + 4Sj−1(μ)μ2(μ2 − 1)− 4Sj−1(μ)μ2
4≤ (1 + Sj−1(μ))2 − 4Sj−1(μ)μ2
4
=⇒(
1 + Sj−1(μ)− 2Sj−1(μ)μ2
2
)2
≤((
12
)√(1 + Sj−1(μ))2 − 4Sj−1(μ)μ2
)2
From (118), we know that the square root on the right side of the final inequality given above exists (i.e., is
a real number). Now, for any a, b ∈ R, we know that if a2 ≤ b2, then a ≤ |a| ≤ |b|. Therefore, we get that
for each j ≥ 2,
1 + Sj−1(μ)− 2Sj−1(μ)μ2
2≤
(12
)√(1 + Sj−1(μ))2 − 4Sj−1(μ)μ2
=⇒(
1 + Sj−1(μ)2
)−
(12
)√(1 + Sj−1(μ))2 − 4Sj−1(μ)μ2 ≤ Sj−1(μ)μ2
=⇒ Sj(μ) ≤ Sj−1(μ)μ2
55
Now, in order to show that (115) satisfied, we will use the Gram-Schmidt orthogonalization process on
the vectors {y1, . . . , yl}. Accordingly, before we do so, let us review some well known properties of the
Gram-Schmidt Process.
Gram-Schmidt Orthogonalization : Consider any 1 ≤ j ≤ l linearly independent vectors {x1, . . . , xj} ⊂R
l. We can successfully perform the Gram-Schmidt orthogonalization process on this set. That is, we can
define the sets of vectors {u1, . . . , uj} ⊂ Rl and {q1, . . . , qj} ⊂ R
l as follows.
u1 := x1 and q1 :=u1
‖u1‖2(119)
and if j ≥ 2 then
ui := xi −i−1∑k=1
(xiT qk
)qk and qi :=
ui
‖ui‖2for each i = 2, . . . , j (120)
The following properties of {u1, . . . , uj} and {q1, . . . , ql} are well-known.
P 3.11. {u1, . . . , uj} is an orthogonal set of vectors and {q1, . . . , qj} is an orthonormal set of vectors.
P 3.12. 1. For each i = 1, . . . , j,
span{x1, . . . , xi
}= span
{u1, . . . , ui
}= span
{q1, . . . , qi
}2. Thus, from Property P 3.11, we get that for each i = 1, . . . j, {q1, . . . , qi} forms an orthonormal
basis for span{x1, . . . , xi}. In particular, if j = l, then {q1, . . . , ql} forms an orthonormal basis for
span{x1, . . . , xl} = Rl.
P 3.13. If 2 ≤ j ≤ l, then for each i = 2, . . . , j, using Property P 3.8 we get,
Π(xi, span{x1, . . . , xi−1}
)=
i−1∑k=1
(xiT qk)qk
But we know from (123) that for each i = 2, . . . , j, xi =∑i
k=1
(xiT qk
)qk. Therefore, using the fact that
xi = Π(xi, span{x1, . . . , xi−1}
)+ Π
(xi,
(span{x1, . . . , xi−1}
)⊥), we get
Π(xi,
(span{x1, . . . , xi−1}
)⊥)= xi −
i−1∑k=1
(xiT qk)qk = ui =(xiT qi
)qi
Consequently, we get that if j ≥ 2, then for each i = 2, . . . , j, span{qi} is the orthogonal complement of
the subspace Qi−1 of the vector space Qi. That is, any x ∈ Qi can be written uniquely as the sum of two
vectors Π(x,Qi−1) ∈ Qi−1 and (xT qi)qi ∈ span{qi}.
56
Now, analogous to (119) and (120), we define the Grams-Schmidt vectors corresponding to the set
{y1, . . . , yl} ⊂ Rl (satisfying (110) and (111)) as
u1 := y1 and q1 :=u1
‖u1‖2(121)
ui := yi −i−1∑k=1
(yiT qk
)qk and qi :=
ui
‖ui‖2for each i = 2, . . . , l (122)
From Property P 3.12 we know that for i = 1, . . . , l, {q1, . . . , qi} forms an orthonormal basis for Qi. Thus,
for any x ∈ Qi, we can write
x =i∑
k=1
(xT qk
)qk and ‖x‖2 =
i∑k=1
(xT qk
)2(123)
where(xT q1, . . . , xT qi
)represent the unique coordinates of x in Qi, with respect to the orthonormal basis
{q1, . . . , qi}. Since we will repeatedly use such coordinates in our analysis, for the sake of notational brevity,
we let xqi = xT qi denote the coordinate of any vector x ∈ Rl, in the direction of the basis vector qi for any
i = 1, . . . , l.
Since we intend to show using induction that (115) is satisfied, we start with a result involving only the
first two vectors y1 and y2 from our set {y1, . . . , yl}.
Lemma 3.14. Consider the vectors y1, y2 ∈ Rl from the set {y1, . . . , yl} ∈ R
l that satisfies (110) and (111).
We have,
minx∈Q2
‖x‖2=1
(y1T
x)2
+(y1T
x)2
≥ S2(μ) (124)
where S2(μ) is the second term in the sequence {Sj(μ)}j∈Ndefined in (116). Consequently,(
y1Tx)2
+(y1T
x)2
≥ S2(μ) ‖x‖22
for all x ∈ Q2.
In order to prove Lemma 3.14, we will need the following lemma regarding a related one dimensional
optimization problem. In what follows and for the rest of this article we assume that for any non-negative
real number α,√
α denotes the positive square root of α. In particular, for any α ∈ R,√
α2 = |α|.
Lemma 3.15. Let s, a ∈ (0, 1]. Consider the function g : [0, 1] −→ R defined as
g(t) :=(at −
√1− a2
√1− t2
)2
+ s(1− t2
)(125)
Then,
mint∈[0,1]
g(t) =(
1 + s
2
)(1−
√1− 4sa2
(1 + s)2
)(126)
Consequently, for any μ ∈ (0, a], we get
mint∈[0,1]
g(t) ≥(
1 + s
2
)(1−
√1− 4sμ2
(1 + s)2
)(127)
57
Proof. First, we note that for any s, a ∈ R,
(1 + s)2 − 4sa2 = (1 + s− 2a2)2 + 4a2(1− a2) (128)
Consequently, for any a ∈ (0, 1], we get that (1 + s)2 − 4sa2 ≥ 0. Therefore, the minimum objective value
given on the right side of (126) is a real number and hence the statement of Lemma 3.15 is well defined.
Now, we break the assertion in (126) into two cases.
1. First, let us consider the case when a = 1. Then, it can be seen from (125) that g(t) = s + (1− s)t2.
Thus, since 1− s ≥ 0, obviously, mint∈[0,1] g(t) = s. But now, using the fact that s ∈ (0, 1], we get
(1 + s
2
)(1−
√1− 4s
(1 + s)2
)=
(1 + s
2
)(1−
∣∣∣∣1− s
1 + s
∣∣∣∣)
=(
1 + s
2
)(1− 1− s
1 + s
)= s
Therefore, (126) holds for a = 1.
2. Next, we assume that a ∈ (0, 1). In this case, note that we get from (128) that
(1 + s)2 − 4sa2 > 0 (129)
Now, the function g can be rearranged and written as
g(t) =(at −
√1− a2
√1− t2
)2
+ s(1− t2
)= a2t2 + (1− a2)(1− t2) − 2(a
√1− a2)(t
√1− t2) + s(1− t2)
= (2a2 − s− 1)t2 − 2(a√
1− a2)(t√
1− t2) + (1 + s− a2)
It is easily seen that g is differentiable on (0, 1). Accordingly, let us find the values of t in the interval
(0, 1) that satisfy dgdt (t) = 0. We have,
dg
dt(t) = 2(2a2 − s− 1)t − 2(a
√1− a2)(
√1− t2) +
2(a√
1− a2)t2√1− t2
= 2
[(2a2 − s− 1)(t
√1− t2)− (a
√1− a2)(1− 2t2)√
1− t2
]
Thus, in order to find first order critical points of g in (0, 1), we must find the solutions in (0, 1) to the
equation
(2a2 − s− 1)(t√
1− t2) = (a√
1− a2)(1− 2t2) (130)
Squaring both sides to remove the square roots and rearranging, we get the following quadratic equation
in t2. ((s + 1)2 − 4sa2
)t4 −
((s + 1)2 − 4sa2
)t2 + a2(1− a2) = 0 (131)
58
Let us set A = (s + 1)2 − 4sa2, B = −A and C = a2(1 − a2). From (129), we get that A > 0, B < 0
and C > 0. Then the quadratic equation in (131) is given by At4 + Bt2 + C = 0. Now, we get
B2 − 4AC =((s + 1)2 − 4sa2
)2 − 4((s + 1)2 − 4sa2
)a2(1− a2)
=((s + 1)2 − 4sa2
) (s + 1− 2a2
)2≥ 0
Therefore, we get two real-valued solutions that satisfy (131), given by
t2 =−B ±
√B2 − 4AC
2A=
⎧⎪⎪⎪⎨⎪⎪⎪⎩(
12
){1 +
(s+1−2a2√
(s+1)2−4sa2
)}(
12
){1 −
(s+1−2a2√
(s+1)2−4sa2
)}Now, since we have assumed that a ∈ (0, 1), we get from (128) that∣∣∣∣∣ s + 1− 2a2√
(s + 1)2 − 4sa2
∣∣∣∣∣ < 1 (132)
Thus, we get the following four potential real-valued solutions to (130)
t∗1 =
√√√√(12
){1 +
(s + 1− 2a2√
(s + 1)2 − 4sa2
)}, t∗2 =
√√√√(12
){1 −
(s + 1− 2a2√
(s + 1)2 − 4sa2
)},
t∗3 = −
√√√√(12
){1 +
(s + 1− 2a2√
(s + 1)2 − 4sa2
)}and t∗4 = −
√√√√(12
){1 −
(s + 1− 2a2√
(s + 1)2 − 4sa2
)}
Since we are interested only in solutions in the interval [0, 1), we consider only the two solutions t∗1 and
t∗2. From (132) it is easily seen that t∗1, t∗2 ∈ (0, 1).
Now, substituting t∗1 and t∗2 back into (130), it is easily seen that only t∗1 satisfies (130). Therefore,
there is exactly one point t∗1 ∈ (0, 1) such that dgdt (t∗1) = 0. Consequently, we get that
mint∈[0,1]
g(t) = min{g(0), g(1), g(t∗1)}
It is easy to verify that
g(0) = 1 + s− a2
g(1) = a2 and
g(t∗1) =(
1 + s
2
)(1−
√1− 4sa2
(1 + s)2
)
Further, using (132) and the triangle inequality, we get that
g(1)− g(t∗1) =(
12
)(√(s + 1)2 − 4sa2 + (2a2 − 1− s)
)≥
(12
)(√(s + 1)2 − 4sa2 −
∣∣2a2 − 1− s∣∣) ≥ 0
59
Similarly,
g(0)− g(t∗1) =(
12
)(√(s + 1)2 − 4sa2 + (1 + s− 2a2)
)≥
(12
)(√(s + 1)2 − 4sa2 −
∣∣1 + s− 2a2∣∣) ≥ 0
Therefore, we finally get that (126) holds for the case when a ∈ (0, 1).
Thus, from the two cases considered above, we get that (126) holds for any s, a ∈ (0, 1].
Now for any μ ∈ (0, a], it is easily seen that
(1 + s
2
)(1−
√1− 4sa2
(1 + s)2
)≥
(1 + s
2
)(1−
√1− 4sμ2
(1 + s)2
)
Therefore (127) holds for all μ ∈ (0, a].
Proof of Lemma 3.14:
Using the notation that we developed earlier, we write the vectors y1 ∈ Q1 and y2 ∈ Q2 in terms of the
orthonormal vectors q1 and q2, as follows. First, since y1 ∈ Q1 and q1 = (y1/∥∥y1
∥∥2) (from (122)) is a basis
for Q1, we write that y1 = y1q1q1 where y1
q1 =∥∥y1
∥∥2
= 1 (using (110)). Similarly, we know from Property
P 3.12 that {q1, q2} forms an orthonormal basis for Q2. Therefore, since y2 ∈ Q2, we write
y2 = y2q1q1 + y2
q2q2
Here, we get using Property P 3.13 and (111) that
y2q2 = y2T
q2 =∥∥∥y2 − (y2T
q1)q1∥∥∥
2≥ μ (133)
Now, from Property P 3.12 we know that Q2 = {a1q1 + a2q
2 : a1, a2 ∈ R}. Therefore, we get that
{x ∈ Q2 : ‖x‖2 = 1
}=
{a1q
1 + a2q2 : −1 ≤ a1, a2 ≤ 1 and (a1)2 + (a2)2 = 1
}=
{a1q
1 + a2q2 : a2 ∈ [−1, 1] and a1 = ±
√1− (a2)2
}=
{a1q
1 + a2q2 : |a2| ≤ 1 and |a1| =
√1− |a2|2
}(134)
Further, for any a1, a2 ∈ R such that x = a1q1 + a2q
2 ∈ Q2, we have
y1Tx = y1
q1a1 = a1 and y2Tx = y2
q1a1 + y2q2a2
Therefore,
minx∈Q2
‖x‖2=1
(y1T
x)2
+(y1T
x)2
= min|a2|≤1
|a1|=√
1−|a2|2a1,a2∈R
(a1)2 +(y2
q1a1 + y2q2a2
)2
60
Using the triangle inequality for real numbers, we know that(y2
q1a1 + y2q2a2
)2 ≥(|y2
q2 ||a2| − |y2q1 ||a1|
)2Therefore,
min|a2|≤1
|a1|=√
1−|a2|2a1,a2∈R
(a1)2 +(y2
q1a1 + y2q2a2
)2 ≥ min|a2|≤1
|a1|=√
1−|a2|2a1,a2∈R
(|a1|)2 +(|y2
q2 ||a2| − |y2q1 ||a1|
)2(135)
Let us simplify the optimization problem on the right side of (135). First, we know from (110) that∥∥y2∥∥
2= 1. Therefore, we have (y2
q1)2 + (y2q2)2 = 1 which gives us∣∣y2
q1
∣∣ =√
1− |y2q2 |2 (136)
Similarly, the constraint set in (134) contains the equality constraint |a1| =√
1− |a2|2. We can use this to
eliminate the variable a1 and get
min|a2|≤1
|a1|=√
1−|a2|2a1,a2∈R
(|a1|)2 +(|y2
q2 ||a2| − |y2q1 ||a1|
)2
= min0≤|a2|≤1
(1− (|a2|)2
)+
(|y2
q2 ||a2| −√
1− |y2q2 |2
√1− |a2|2
)2
= mint∈[0,1]
(1− t2
)+
(|y2
q2 |t−√
1− |y2q2 |2
√1− t2
)2
(137)
We already showed in (133) that y2q2 ≥ μ. Further, from (110) we know that
∥∥y2∥∥
2= 1. Therefore it is clear
that y2q2 ≤
∥∥y2∥∥
2= 1.
Thus, finally using Lemma 3.15 with p = 1, a = |y2q2 | and μ ∈ (0, a] on the optimization problem on the
right side of (137), we get
minx∈Q2
‖x‖2=1
(y1T
x)2
+(y1T
x)2
≥ mint∈[0,1]
(1− t2
)+
(|y2
q2 |t−√
1− |y2q2 |2
√1− t2
)2
≥ 1−√
1− μ2 = S2(μ)
Consequently, for any x ∈ Q2, we can write
(y1T
x)2
+(y1T
x)2
≥
⎧⎨⎩(
y1Tx
‖x‖2
)2
+
(y1T
x
‖x‖2
)2⎫⎬⎭ ‖x‖22 ≥ S2(μ) ‖x‖22
�
Next, we consider a result involving the vectors {y1, . . . , yi, yi+1} for some l − 1 ≥ i ≥ 2.
Lemma 3.16. Consider the vectors {y1, . . . , yl} satisfying (110) and (111). For some fixed i ∈ {2, . . . , l−1},suppose that
i∑k=1
(ykTx)2 ≥ Si(μ) ‖x‖22 for all x ∈ Qi (138)
61
where Si(μ) is the ith term in the sequence {Sj(μ)}j∈Ndefined in (116). Then,
minx∈Qi+1
‖x‖2=1
i+1∑k=1
(ykT
x)2
≥ Si+1(μ) (139)
Consequently,i+1∑k=1
(ykT
x)2
≥ Si+1(μ) ‖x‖22 for all x ∈ Qi+1
In order to prove Lemma 3.16, we will need the following result.
Lemma 3.17. Consider a subspace M of Rl with dim(M) ≥ 2. Further, let y ∈M and a, b > 0. Then,
minx∈M‖x‖2=b
h(x) :=(a− |yT x|
)2=
⎧⎪⎨⎪⎩
0 if a ≤ ‖y‖2 b
(a− ‖y‖2 b)2 if a > ‖y‖2 b
Proof. First, we note that since dim(M) ≥ 2, the set S := {x ∈M : ‖x‖2 = b} is a compact and connected
set inM. Also,the function h(x) := yT x is continuous onM. Therefore, we get that the set{yT x : x ∈ S
}is a connected and compact subset of R; i.e., a closed and bounded interval on R.
Now, from the Cauchy-Schwartz inequality, we know that for any x ∈ S,
−‖y‖2 b = −‖y‖2 ‖x‖2 ≤ yT x ≤ ‖y‖2 ‖x‖2 = ‖y‖2 b
Therefore, {yT x : x ∈ S
}⊂ [−‖y‖2 b, ‖y‖2 b] (140)
Next, consider x∗ = b (y/ ‖y‖2) ∈ M. Clearly, ‖x‖2 = b and hence x∗ ∈ S. Also, it is clear that yT x∗ =
‖y‖2 b. Similarly, −x∗ ∈ S and yT (−x∗) = −‖y‖2 b. Therefore, we get that
{yT x : x ∈ S
}⊃ [−‖y‖2 b, ‖y‖2 b] (141)
Therefore, we have from (140) and (141) that
{yT x : x ∈ S
}= [−‖y‖2 b, ‖y‖2 b]
and hence {|yT x| : x ∈ S
}= [0, ‖y‖2 b] (142)
Clearly, h(x) ≥ 0 for all x ∈ S. Now, if a ≤ ‖y‖2 b, then we know from (142) that there exists x∗ ∈ Ssuch that |yT x∗| = a; i.e., such that h(x∗) = 0. Therefore, if a ≤ ‖y‖2 b, then
minx∈M‖x‖2=b
(a− |yT x|
)2= 0
62
If a− ‖y‖2 b > 0, then using (142), we get
{a− |yT x| : x ∈ S
}= [a− ‖y‖2 b, a]
⇒{(
a− |yT x|)2
: x ∈ S}
= [(a− ‖y‖2 b)2 , a2]
Therefore, if a− ‖y‖2 b > 0, we get
minx∈M‖x‖2=b
(a− |yT x|
)2= (a− ‖y‖2 b)2
Proof of Lemma 3.16:
We know from Property P 3.13 and the notation developed earlier, that any x ∈ Qi+1 can be written as
x = Π(x,Qi) + (xT qi+1)qi+1 = Π(x,Qi) + xqi+1qi+1
Conversely, for any ¯x ∈ Qi ⊂ Qi+1 and ai+1 ∈ R, it is obvious that ¯x + ai+1qi+1 ∈ Qi+1. Therefore, it is
clear that
Qi+1 :={¯x + ai+1q
i+1 : ¯x ∈ Qi and ai+1 ∈ R}
(143)
Further, we know from Property P 3.11, that qi+1 ⊥ Qi. Consequently, we get that for any ¯x ∈ Qi and
ai+1 ∈ R, we get that ∥∥¯x + ai+1qi+1
∥∥2
=√‖¯x‖22 + (ai+1)2 (144)
In particular, we get that yi+1 ∈ Qi+1 can be written as
yi+1 = Π(yi+1,Qi
)i+1+ yi+1
qi+1qi+1 (145)
Since∥∥yi+1
∥∥2
= 1 (from (110)) we get using (144) that
∥∥Π (yi+1,Qi
)∥∥2
=√
1− (yi+1qi+1)2 (146)
Further, using the fact that∥∥yi+1
∥∥2
= 1 and Property P 3.13 along with (111) for yi+1, we know that
1 =∥∥yi+1
∥∥2≥ yi+1
qi+1 = (yi+1)T qi+1 =∥∥Π(yi+1, (Qi)⊥)
∥∥2≥ μ (147)
Also, using (143) and (144), the constraint set in the optimization problem on the left side of (139) can
be written as follows.
{x ∈ Qi+1 : ‖x‖2 = 1
}=
{¯x + ai+1q
i+1 : ai+1 ∈ [−1, 1] , ¯x ∈ Qi and ‖¯x‖2 =√
1− (ai+1)2}
={
¯x + ai+1qi+1 : |ai+1| ≤ 1 , ¯x ∈ Qi and ‖¯x‖2 =
√1− |ai+1|2
}
63
Now, for any ¯x ∈ Qi and ai+1 ∈ R, from (145) and the fact that qi+1 ⊥ Qi, we get that x = ¯x+ai+1qi+1 ∈
Qi+1 satisfies
ykTx = ykT ¯x for k = 1, . . . , i and
yi+1Tx = (Π
(yi+1,Qi
))T ¯x + yi+1
qi+1ai+1
Therefore, we can now write the optimization problem in (139) as
minx∈Qi+1
‖x‖2=1
i+1∑k=1
(ykT
x)2
= min|ai+1|≤1
¯x∈Qi,‖¯x‖2=√
1−|ai+1|2
{i∑
k=1
(ykT ¯x
)2}
+((Π(yi+1,Qi
))T ¯x + yi+1
qi+1ai+1
)2
We know from (138) that for any ¯x ∈ Qi,
i∑k=1
(ykT ¯x
)2
≥ Si(μ) ‖¯x‖22
Also, we know that for any ¯x + ai+1qi+1 ∈ S, ‖¯x‖22 = 1 − |ai+1|2. Using these two observations along with
the triangle inequality for real numbers we get
minx∈Qi+1
‖x‖2=1
i+1∑k=1
(ykT
x)2
≥ min|ai+1|≤1
¯x∈Qi,‖¯x‖2=√
1−|ai+1|2
Si(μ) ‖¯x‖22 +((Π(yi+1,Qi
))T ¯x + yi+1
qi+1ai+1
)2
= min|ai+1|≤1
¯x∈Qi,‖¯x‖2=√
1−|ai+1|2
Si(μ)(1− |ai+1|2
)+
((Π(yi+1,Qi
))T ¯x + yi+1
qi+1ai+1
)2
≥ min|ai+1|≤1
¯x∈Qi,‖¯x‖2=√
1−|ai+1|2
Si(μ)(1− |ai+1|2
)+
(|yi+1
qi+1 ||ai+1| − |(Π(yi+1,Qi
))T ¯x|
)2
(148)
Now, let us consider the optimization problem on the right side of (148). We can rewrite this as
min|ai+1|≤1
¯x∈Qi,‖¯x‖2=√
1−|ai+1|2
Si(μ)((1− |ai+1|2
)+
(|yi+1
qi+1 ||ai+1| − |(Π(yi+1,Qi
))T ¯x|
)2
= min|ai+1|≤1
⎡⎢⎢⎣Si(μ)
(1− |ai+1|2
)+ min
¯x∈Qi
‖¯x‖2=√
1−|ai+1|2
(|yi+1
qi+1 ||ai+1| − |(Π(yi+1,Qi
))T ¯x|
)2
⎤⎥⎥⎦
For any fixed ai+1 ∈ [−1, 1] we get by applying Lemma 3.17 and using (146) that
min¯x∈Qi
‖¯x‖2=√
1−|ai+1|2
(|yi+1
qi+1 ||ai+1| − |(Π(yi+1,Qi
))T ¯x|
)2
=
⎧⎪⎨⎪⎩
0 if |yi+1qi+1 ||ai+1| ≤
√1− |yi+1
qi+1 |2√
1− |ai+1|2(|yi+1
qi+1 ||ai+1| −√
1− |yi+1qi+1 |2
√1− |ai+1|2
)2
otherwise
It is easy to verify that
|yi+1qi+1 ||ai+1| ≤
√1− |yi+1
qi+1 |2√
1− |ai+1|2 if and only if |ai+1| ≤√
1− |yi+1qi+1 |2.
64
Therefore, we finally get
min|ai+1|≤1
¯x∈Qi,‖¯x‖2=√
1−|ai+1|2
Si(μ)((1− |ai+1|2
)+
(|yi+1
qi+1 ||ai+1| − |(Π(yi+1,Qi
))T ¯x|
)2
= min {h∗1, h
∗2}
where
h∗1 := min
|ai+1|≤q
1−|yi+1qi+1 |2
Si(μ)((1− |ai+1|2
)
h∗2 := min
|ai+1|∈hq
1−|yi+1qi+1 |2,1
i Si(μ)((1− |ai+1|2
)+
(|yi+1
qi+1 ||ai+1| −√
1− |yi+1qi+1 |2
√1− |ai+1|2
)2
Using (147), it is easily seen that
h∗1 = Si(μ)|yi+1
qi+1 |2 ≥ Si(μ)μ2
Also, using Lemma 3.15 with p = Si(μ) and a = |yi+1qi+1 | ∈ [μ, 1], we get
h∗2 = min
|ai+1|∈hq
1−|yi+1qi+1 |2,1
i Si(μ)(1− |ai+1|2
)+
(|yi+1
qi+1 ||ai+1| −√
1− |yi+1qi+1 |2
√1− |ai+1|2
)2
≥ min|ai+1|∈[0,1]
Si(μ)(1− |ai+1|2
)+
(|yi+1
qi+1 ||ai+1| −√
1− |yi+1qi+1 |2
√1− |ai+1|2
)2
= mint∈[0,1]
Si(μ)(1− t2
)+
(|yi+1
qi+1 |t −√
1− |yi+1qi+1 |2
√1− t2
)2
≥(
1 + Si(μ)2
)[1−
√1−
(4Si(μ)μ2
(Si(μ) + 1)2
)]= Si+1(μ)
Thus, using (148), we get that
minx∈Qi+1
‖x‖2=1
i+1∑k=1
(ykT
x)2
≥ min{h∗1, h
∗2} ≥ min{Si(μ)μ2, Si+1(μ)}
Finally, from Lemma 3.13, we know that Si+1(μ) ≤ Si(μ)μ2. Therefore,
minx∈Qi+1
‖x‖2=1
i+1∑k=1
(ykT
x)2
≥ Si+1(μ)
Consequently for any x ∈ Qi+1, we have
i+1∑k=1
(ykT
x)2
=
⎛⎝i+1∑
k=1
(ykT
x
‖x‖2
)2⎞⎠ ‖x‖22 ≥ Si+1(μ) ‖x‖22
�
Now, using Lemmas 3.14 and 3.16, we can show that if {y1, . . . , yl} satisfy (110) and (111) for each
i = 1, . . . , l, then (115) holds.
65
Theorem 3.18. Consider the l vectors {y1, . . . , yl} ⊂ Rl that satisfy (110) and (111) for each i = 1, . . . , l.
Then, (115) holds true.
Proof. First of all, since we have assumed that {y1, . . . , yl} ⊂ Rl satisfy (110) and (111), we know that the
requirements of Lemma 3.14 are satisfied and consequently, (124) holds true. Consequently (138) required
in Lemma 3.16 holds true for i = 2. Therefore, repeatedly applying Lemma 3.16 for i = 2, . . . , l − 1, we
finally get that, (139) holds for i = l − 1. That is,
minx∈Ql
‖x‖2=1
l∑k=1
(ykT
x)2
≥ Sl(μ)
However, from Lemma 3.12, we know that {y1, . . . , yl} are l linearly independent vectors in Rl. Therefore,
Ql = span{y1, . . . , yl} = Rl. And hence we get that (115) holds.
Now, having shown our main result regarding the l vectors {y1, . . . , yl} ⊂ Rl that satisfy (110) and (111),
let us turn next, to the properties of matrices whose rows satisfy a relaxed form of these conditions.
Theorem 3.19. Consider m row vectors {y1, . . . , ym} ⊂ Rl of the matrix Y ∈ R
m×l defined in (94). As
before, let Qi := span{y1, . . . , yi
}for i = 1, . . . , l. Now, given m ≥ l and constants μ ∈ (0, 1] and κ > 0,
suppose that for each i = 1, . . . , l ∥∥yi∥∥
2≥ κ (149)
and for each i = 2, . . . , l, ∥∥Π (yi, (Qi−1)⊥
)∥∥2
‖yi‖2≥ μ (150)
Then, we have
λmin(Y T Y ) ≥ κ2Sl(μ) > 0 (151)
where Sl(μ) is the lth term of the sequence {Sj(μ)}j∈Ndefined in (116).
Proof. We get using (96) and (149) that
λmin(Y T Y ) = minx∈R
l
‖x=1‖2
m∑i=1
(yiT x
)2
≥ minx∈R
l
‖x‖2=1
l∑i=1
(yiT x
)2
= minx∈R
l
‖x‖2=1
l∑i=1
∥∥yi∥∥2
2
[(yi
‖yi‖2
)T
x
]2
≥ κ2
⎧⎪⎨⎪⎩ min
x∈Rl
‖x‖2=1
l∑i=1
[(yi
‖yi‖2
)T
x
]2⎫⎪⎬⎪⎭ (152)
66
Let us define
yi =(
yi
‖yi‖2
)for i = 1, . . . , l
Then it is clear that for each i = 1, . . . , l,
Qi = span{y1, . . . , yi} = span{y1, . . . , yi}
Further, it is also easily seen that
∥∥yi∥∥
2= 1 for i = 1, . . . , l (153)
Using Property P 3.7 and (150) we get
∥∥Π(yi,Qi−1)∥∥
2=
∥∥∥∥Π((
yi
‖yi‖2
),Qi−1
)∥∥∥∥2
=
∥∥Π(yi,Qi−1)∥∥
2
‖yi‖2≥ μ for i = 2, . . . , l (154)
Since 0 < μ ≤ 1, we see from (153) and (154) that the vectors {y1, . . . , yl} ⊂ Rl satisfy the requirements of
Theorem 3.18. Therefore, we get that
minx∈R
l
‖x‖2=1
l∑i=1
((yi)T x
)2 ≥ Sl(μ)
where Sl(μ) is defined as in (116). Finally, we use this in (152) along with Lemma 3.13, to get that
λmin(Y T Y ) ≥ κ2Sl(μ) > 0 (155)
Note: The requirement that the first l rows of the matrix Y satisfy (149) and (150) was imposed only for
notational convenience. Indeed the bound on the condition number of Y T Y can be shown to hold as long as
there exists some sequence of l rows of Y that satisfy (150). This can be seen from (95) and (96) which show
that the maximum and minimum eigenvalues of Y T Y are invariant with respect to changes in the ordering
of the rows of Y .
Thus, we have established a relationship between the relative positioning of the rows of the matrix
Y ∈ Rm×l in R
l, to the condition number of Y T Y . The intuition behind the result in Theorem 3.19 is as
follows. As we noted immediately after the proof of Lemma 3.11, the minimum eigenvalue of Y T Y is small
whenever there exists a subspace M⊂ Rl of dimension l − 1 such that all the row vectors {y1, . . . , ym} are
close to M. Consequently, in order to ensure that λmin(Y T Y ) is bounded away from zero, we must ensure
that there exists no subspace M such that all the row vectors of Y are close to M, i.e., that there exist
sufficiently many row vectors of Y that are “well spread out” in Rl.
67
Our notion of the row vectors being well spread-out in Rl is given in the conditions (149) and (150). In
particular, (149) ensures that there exist l row vectors {y1, . . . , yl} that lie at least a minimum distance of
κ > 0 away from the origin (which clearly belongs to every subspace of Rl). Further, note that for each
i = 2, . . . , l, (∥∥Π(yi,Qi−1)
∥∥2/∥∥yi
∥∥2) = sin(θi
M ) where θiM ∈ [0, π/2] is the angle between yi and Qi−1. Thus,
(150) ensures that for each i = 2, . . . , l, the angle between the row vector yi and the subspace Qi−1 is at
least sin−1(μ) > 0. This in turn ensures that the row vectors {y1, . . . , yl} are well spread out, thus giving us
the lower bound in (151).
Indeed there exist other conditions on the rows of Y , that provide similar lower bounds on λmin(Y T Y ).
The conditions (149) and (150) are particularly appealing because there exists an easy and well known
algorithm to verify whether these conditions are satisfied or not. Obviously given {y1, . . . , ym} and κ > 0, it is
trivial to check if∥∥yi
∥∥2≥ κ for i = 1, . . . , l. Now, consider the result of the Gram-Schmidt orthogonalization
process performed on {y1, . . . , yl}, particularly the vectors {u1, . . . , ul} defined in (121) and (122). From
Property P 3.13, we know that ui = Π(yi,Qi−1) for each i = 2, . . . , l. Thus, the row vectors {y1, . . . , yl}satisfy (150) if and only if at each step of the Gram-Schmidt process applied to these vectors, we get∥∥ui
∥∥2≥ μ
∥∥yi∥∥
2where μ ∈ (0, 1].
Now, let us apply Theorem 3.19 towards satisfying (101). First, suppose we ensure that for each n ∈ N,
the design points {xn + yin : i = 1, . . . ,M I
n} are picked such that the following assumption is satisfied.
A 17. 1. The number of inner design points M In is at least l.
2. Given some constant μL ∈ (0, 1], the scaled perturbation vectors {y1n, . . . , yl
n} satisfy for each i = 1, . . . , l
∥∥yin
∥∥2≥ μL (156)
and for each i = 2, . . . , l ∥∥Π (yi
n, (Qi−1n )⊥
)∥∥2
‖yin‖2
≥ μL (157)
where Qin := span{y1
n, . . . , yin} for i = 1, . . . , l.
Then, using Theorem 3.19, we get that for each n ∈ N,
λmin((Y I
n )T Y In
)≥ μ2
LSl(μL) (158)
Therefore, (101) is satisfied with KIy := μ2
LSl(μL). Thus, if for each n ∈ N, we pick our inner design
points {xn + yin : i = 1, . . . , M I
n} so as to satisfy Assumptions A 16 and A 17 and the corresponding weights
{win : i = 1, . . . ,M I
n} to satisfy Assumption A 15, then from (97), (104) and (158) we get that for each
n ∈ N, ∥∥∥(Y In )T W I
n Y In
∥∥∥2
∥∥∥((Y In )T W I
n Y In )−1
∥∥∥2≤ KI
wM Imax
μ2LSl(μL)
(159)
68
Similarly, suppose we pick the inner design points {xn + yin : i = 1, . . . ,M I
n} for each n ∈ N such that
the following assumption is satisfied.
A 18. 1. The number of inner design points M In is at least p.
2. Given some constant μQ ∈ (0, 1], the scaled regression vectors {z1n, . . . , zp
n} satisfy for each i = 1, . . . , p
∥∥zin
∥∥2≥
(√32
)μQ (160)
and for each i = 2, . . . , p ∥∥Π (zin, (Ri−1
n )⊥)∥∥
2
‖zin‖2
≥ μQ (161)
where Rin := span{y1
n, . . . , yin} for i = 1, . . . , p.
Then, using Theorem 3.19, we get that for each n ∈ N,
λmin((ZI
n)T ZIn
)≥
(32
)μ2
QSp(μQ) (162)
Therefore, (101) is satisfied with KIz :=
(32
)μ2
QSp(μQ). Thus, if for each n ∈ N, we pick our inner design
points {xn + yin : i = 1, . . . ,M I
n} so as to satisfy Assumptions A 16 and 18 and pick the corresponding
weights {win : i = 1, . . . ,M I
n} to satisfy Assumption A 15, then from (98), (105) and (162) we get that for
each n ∈ N, ∥∥∥(ZIn)T W I
n ZIn
∥∥∥2
∥∥∥((ZIn)T W I
n ZIn)−1
∥∥∥2≤ KI
wM Imax
μ2QSp(μQ)
(163)
Thus, we have two sets of sufficient conditions on the inner design points for each n ∈ N, which respectively
ensure that (36) in Assumption A 7 and (58) in Assumption A 11 will be satisfied. Next, we develop
procedures to pick the design points {xn + yin : i = 1, . . . , Mn} to satisfy these conditions. Of course, as we
noted earlier, we intend to eventually use the procedures we develop here to construct appropriate regression
model functions in trust region algorithms that solve the optimization problem (P). Consequently, in light of
Assumption A 1, the procedures we develop to pick {xn + yin : i = 1, . . . ,Mn} for each n ∈ N so as to satisfy
(36) or (58), must at the same time also minimize the number of evaluations of F required to subsequently
evaluate ∇nf(xn) and ∇2nf(xn).
One obvious solution to reduce the number of evaluations of F required to construct mn for each iteration
n of our trust region algorithms, is to store and reuse sample averages computed in earlier iterations. In
general, suppose that we wish to calculate a sample average f(x,N2) for some x ∈ E and N2 ∈ N. If we have
stored the value of f(x,N1) for some 0 < N1 < N2, then we can calculate f(x,N2) using
f(x,N2) =
{N1 f (x,N1)
}+
∑N2j=N1
F (x, ζj)
N2(164)
69
Clearly, such a computation involves only N2 − N1 extra evaluations of F as against the N2 evaluations
required if we do not have f(x,N1) stored. Accordingly, in our trust region algorithms we will maintain a
list of at most Mmax <∞ sample averages evaluated during the course of the algorithm’s operation and the
corresponding points and sample sizes used. In particular, at the beginning of each iteration n, for some
Mn ≤Mmax we will have the following (totally) ordered sets.
Xn :={xk
n : k = 1, . . . ,Mn
}(165)
Nn :={
Nk
n : k = 1, . . . ,Mn
}(166)
Fn :={
f(xkn, N
k
n) : k = 1, . . . ,Mn
}and (167)
Φn :={
σ(xkn, N
k
n) : k = 1, . . . ,Mn
}(168)
where for any x ∈ E and N ∈ N
σ(x,N) :=
√(∑N
j=1 F (x, ζj)2)−Nf(x,N)2
N(N − 1)(169)
denotes the standard deviation of the sample average f(x,N). Here, for each k = 1, . . . ,Mn, xkn is either a
design point xn∗ + yin∗ or a solution point xn∗ for an earlier iteration n∗ ∈ {1, . . . , n− 1}. If xk
n = xn∗ + yin∗
for some n∗ ∈ {1, . . . , n − 1}, then Nk
n = N in∗ , f(xn∗ + yi
n∗ , N in∗) and σ(xn∗ + yi
n∗ , N in∗) are stored in Nn,
Fn and Φn respectively. Similarly, if xkn = xn∗ for some n∗ ∈ {1, . . . , n − 1}, then N
k
n = N0n∗ , f(xn∗ , Nn∗)
and σ(xn∗ , Nn∗) are stored in N n, Fn and Φn respectively. We store σ(xkn, N
k
n) for each k = 1, . . . ,Mn, for
use in a test to adaptively set the sample size N0n such that (92) may be satisfied.
The advantage of maintaining the lists Xn, Nn, Fn and Φn is clear from (164). Suppose that we are
able to set xkn for some k ∈ {1, . . . ,Mn} as the design point xn + yi
n for iteration n. Then, we can evaluate
f(xn + yin, N i
n) using f(xk
n, Nk
n
)and N i
n − Nk
n extra evaluations of F with (164), whereas we would need
N in evaluations of F for a new design point (i.e., a point that is not originally in the set Xn). Also, note that
for any x ∈ E and 0 < N1 < N2, if we have stored σ(x,N1) and f(x,N1) stored, then after having evaluated
f(x,N2) using (164), we can evaluate σ(x,N2) as follows.
σ(x,N2) :=
√√√√N1(N1 − 1)σ(x,N1)2 +(∑N2
j=N1+1 F (x, ζj)2)−
(N2f(x,N2)2 −N1f(x,N1)2
)N2(N2 − 1)
(170)
Consequently, we can also compute σ(xn +yin, N i
n) using σ(xkn, N
k
n), f(xkn, N
k
n), f(xn +yin, N i
n) and Nk
n−N in
evaluations of F .
Thus, in each iteration n of our trust region algorithms, we will pick as many design points as possible
from Xn and evaluate sample averages the fewest possible number of “new”(i.e., not in Xn) design points
required to satisfy either Assumption A 17 or Assumption 18 depending on whether we wish to satisfy (36)
or (58). Also, whenever we evaluate a sample average at a point not in Xn, we will insert this “new” point
70
into Xn and the corresponding sample size, sample average and standard deviation into Nn, Fn and Φn
respectively. Finally, at the end of iteration n, we will pick at most Mmax appropriate points from Xn (for
example, the points in Xn that lie closest to xn+1 in the Euclidean norm) and define the set of these points
to be Xn+1 and define the corresponding sets of sample sizes, sample averages and standard deviations to
be N n+1, Fn+1 and Φn+1 respectively. Thus, Xn, Nn, Fn and Φn start as empty sets for n = 1. As
our algorithms progress, these sets grow to contain Mmax elements each and Mn remains equal to Mmax
thereafter.
Accordingly, in this section, we will assume that for each n ∈ N, the ordered sets Xn, N n, Fn and
Φn as defined in (165) through (168) respectively, are available to us with Mn ≤ Mmax. We will provide
algorithms to pick as many inner design points as possible from Xn required to satisfy Assumptions A 17
or A 18. Further, we provide algorithms to appropriately pick the required number of “new” inner design
points, if sufficiently many points do not already exist in Xn to satisfy Assumptions A 17 or A 18. Using
these algorithms, we then provide methods to pick the inner and outer design points for each n ∈ N, so as to
satisfy Assumptions A 16 and either Assumption A 17 or Assumption A 18. Since the algorithms we discuss
in this section involve the manipulation of sets such as Xn, we briefly describe the notation required for such
manipulations first.
1. For our purposes, we define an ordered set (A, fA) as an ordered pair consisting of a set A of m ∈ Z
unique objects (for some m ≥ 0) along with a bijective mapping fA : {1, . . . , m} −→ A. If the set
A = ∅ i.e. m = 0, then we define the ordered set (A, fA) also to be empty. Thus, an ordered set for
us, is essentially a finite set on which a total order has been imposed. For notational brevity however,
we will not explicitly refer to the bijection fA while referring to an ordered set (A, fA) since in most
cases, the ordering of the elements will be clear from context. Instead whenever we refer to a set
A := {a, b, c, . . .}, it is understood that the the elements of the set A have been written in the order
defined by fA, i.e., that a = fA(1), b = fA(2) and so on. Further, so as not to explicitly involve fA in
the notation, for each i ∈ {1, . . . , m}, we will denote fA(i) by A(i) and refer to it as the ith element
of the set A. Finally, we let |A| denote the number of elements in the set A. If |A| = 0, then we say
that the ordered set A is empty and write A = ∅.
2. In contexts where the ordering of the elements in a set A is not important, we will treat A as a regular
set and perform common set operations on it.
3. Suppose that A = {a1, . . . , am} is an ordered set of m elements for some m > 0. Let K = {i1, . . . , ij}be an ordered set of positive integers containing j ≤ m elements such that K(i) ∈ {1, . . . , m} for
each k = 1, . . . , j. Then we say that K is an index set for A and define A(K) to be the ordered set{ai1 , ai2 , . . . , aij
}.
4. We define next, some operations that can be performed on an ordered set.
71
• We can set the object at position j (for 1 ≤ j ≤ m) in A = {a1, . . . , am} to be b. After such an
operation is performed, A is defined as the set of m + 1 elements, {a1, . . . , aj−1, b, aj+1, . . . , am}.
• We can insert the object b at position j for some 1 ≤ j ≤ m+1 into the set A. After the insertion
operation, A is a set of m + 1 objects defined as follows.
A :=
⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩{b, a1, a2, . . . , am} if j = 1
{a1, . . . , aj−1, b, aj , . . . , am} if 2 ≤ j ≤ m
{a1, . . . , am, b} if j = m + 1
• We can delete the jth object from the set A for some 1 ≤ j ≤ m. After the deletion operation Ais a set of m− 1 elements defined as follows.
A :=
⎧⎪⎨⎪⎩{a1, . . . , aj−1, aj+1, . . . , am} if m > 1
∅ if m = 1
In keeping with the terminology developed earlier in Section 3, for each k = 1, . . . ,Mn, we will refer to
the vectors (Xn(k) − xn) and
⎛⎝ (Xn(k)− xn
)(Xn(k)− xn
)Q
⎞⎠ as the perturbation and regression vectors respectively,
corresponding to Xn(k). Similarly, we will refer to((Xn(k)− xn)/δn
)and
⎛⎝ (Xn(k)− xn
)/δn(
Xn(k)− xn
)Q/δ2
n
⎞⎠ as the
scaled perturbation and regression vectors respectively, corresponding to Xn(k). Now, we start by providing
an algorithm that separates the points in Xn lying within and outside the design region Dn and also identifies
as many points from Xn as possible that can be used as design points that satisfy Assumption A 17.
Algorithm 2. Let the ordered set Xn containing Mn(where Mn ≤Mmax) entries be given. Also, we assume
that the design region radius δn > 0, the current solution xn and a constant μL ∈ (0, 1] are known. Initialize
the index sets KG
n = KB
n = KO
n = ∅ and the corresponding counters MG
n = MB
n = MO
n = 0. Also initialize the
sets Yn = Zn = Un = ∅. Repeat the following steps for each k ∈ {1, . . . ,Mn}.
Step 1: Set yn ← (Xn(k)− xn)/δn and compute (yn)Q (as defined in (16)).
Step 2: Insert yn into the set Yn and
⎛⎝ yn
(yn)Q
⎞⎠ into the set Zn both at position k.
Step 3: If 0 < ‖yn‖2 ≤ 1 then
• If 0 ≤MG
n < l, then
– Set
un ←
⎧⎪⎨⎪⎩
yn if MG
n = 0
yn −∑M
Gn
j=1(yTnUn(j))Un(j) otherwise
(171)
72
– If ‖yn‖2 < μL or ‖un‖2 < μL ‖yn‖2, then set MB
n ←MB
n +1 and insert k into the set KB
n
at the position MB
n .
– If ‖yn‖2 ≥ μL and ‖un‖2 ≥ μL ‖yn‖2 then
∗ Set MG
n ←MG
n + 1 and insert k into the set KG
n at the position MG
n .
∗ Insert (un/ ‖un‖2) into the set Un at the position MG
n .
• If MG
n = l, then set MB
n ←MB
n + 1 and insert k into the set KB
n at the position MB
n .
Step 4: If ‖yn‖2 > 1, then MO
n ←MO
n + 1 and insert k into the set KO
n at the position MO
n .
It is easily seen from the operation of Algorithm 2 that the various sets generated in the algorithm satisfy
the following properties.
P 3.14. For each k = 1, . . . ,Mn, Yn(k) and Zn(k) are respectively, the scaled perturbation and regression
vectors for the point Xn(k), evaluated with respect to xn.
P 3.15. 1. The setsKG
n , KB
n andKO
n , containing MG
n , MB
n and MO
n elements respectively, satisfyKG
n
⋂KB
n =
KB
n
⋂KO
n = KO
n
⋂KG
n = φ and KG
n
⋃KB
n
⋃KO
n ={1, . . . ,Mn
}. Consequently, we have M
G
n + MB
n +
MO
n = Mn.
2.∥∥∥Yn(KG
n(i))∥∥∥
2≤ 1 for each i = 1, . . . ,M
G
n and∥∥∥Yn(KB
n(i))∥∥∥
2≤ 1 for each i = 1, . . . ,M
B
n . Consequently,
we get that Xn(KG
n)⋃Xn(KB
n) ⊂ Dn, where Dn is the design region for iteration n defined as in (26).
Thus, the points in Xn(KG
n) and Xn(KB
n) together form a set of potential inner design points for
iteration n. Similarly∥∥∥Yn(KO
n(i))∥∥∥
2≥ 1 for each i = 1, . . . ,M
O
n and hence the points in Xn(KO
n) ∈R
l \ Dn form a set of potential outer design points for iteration n.
P 3.16. 1. If MG
n ≥ 1, the vectors in the set Un are the Gram-Schmidt vectors corresponding to those
in Yn(KG
n). That is, if MG
n ≥ 1, then
Un(1) =Yn(KG
n(1))∥∥∥Yn(KG
n(1))∥∥∥
2
(172)
and if MG
n ≥ 2 then for i = 2, . . . ,MG
n ,
Un(i) =Yn(KG
n(i))−∑i−1
j=1
(Yn(KG
n(i))TUn(j))Un(j)∥∥∥Yn(KG
n(i))−∑i−1
j=1
(Yn(KG
n(i))TUn(j))Un(j)
∥∥∥2
(173)
2. If MG
n ≥ 1, then for each i = 1, . . . ,MG
n ,
∥∥∥Yn(KG
n(i))∥∥∥
2≥ μL (174)
73
Further, if∣∣∣MG
n
∣∣∣ ≥ 2 then we get from Property P 3.13 that for i = 2, . . . ,∣∣∣MG
n
∣∣∣,∥∥∥∥Π
(Yn(KG
n(i)),(span
{Yn(KG
n(1)), . . . , Yn(KG
n(i− 1))})⊥)∥∥∥∥
2∥∥∥Yn(KG
n(i))∥∥∥
2
=
∥∥∥Yn(KG
n(i))−∑i−1
j=1
(Yn(KG
n(i))TUn(j))Un(j)
∥∥∥2∥∥∥Yn(KG
n(i))∥∥∥
2
≥ μL (175)
Indeed, it is easily seen that Algorithm 2 is only a slightly modified form of the Gram-Schmidt orthogonal-
ization process; the only difference being that in Algorithm 2, the set of vectors on which the Gram-Schmidt
process is performed i.e., Yn(KG
n), is chosen from Yn as the algorithm progresses, so as to ensure that the
vectors in Yn(KG
n) satisfy (174) and (175). Consequently, from Property P 3.16, it is clear that if MG
n ≥ 1
and we set xn + yin = Xn(KG
n(i)) for each i = 1, . . . ,MG
n , then (156) is satisfied for i = 1, . . . ,MG
n and if
MG
n ≥ 2, then (157) is satisfied for i = 2, . . . ,MG
n .
However, recall that l inner design points satisfying (156) and (157) are required to satisfy Assumption
A 17 and it is clearly possible that after Algorithm 2 is executed we get fewer than l points in Xn(KG
n), i.e.,
MG
n < l. Thus, we describe a procedure next, that inserts an appropriate point into Xn, such that MG
n is
incremented by one and the Properties P 3.14 through P 3.16 continue to hold.
Algorithm 3. Let us assume that we have knowledge of the design region radius δn, the sets Xn, Yn,
Zn, Nn, Fn, Φn and Un, the index sets KG
n , KB
n and KO
n and a constant μL ∈ (0, 1], such that Properties
P 3.14 through P 3.16 are satisfied. Further, we assume that MG
n < l and let {e1, . . . , el} ⊂ Rl represent the
standard orthonormal basis for Rl.
Step 1 : If MG
n = 0, then set yn ← e1.
Step 2 : If MG
n ≥ 1, then
• Find i∗ ∈ [1, . . . , l] such that
un = ei∗ −M
Gn∑
j=1
(ei∗TUn(j))Un(j) and ‖un‖2 > 0 (176)
• Set yn ← (un/ ‖un‖2).
Step 3 : Set Mn ←Mn + 1.
Step 4 : Insert yn into Yn,
⎛⎝ yn
(yn)Q
⎞⎠ into Zn and xn + δnyn into Xn all at position Mn.
Step 5 : Insert the value 0 into Fn, Φn and N n all at position Mn.
74
Step 6 : Set MG
n ←MG
n + 1 and insert Mn into the set KG
n at position MG
n .
Step 7 : Insert yn into the set Un at the position MG
n .
Algorithm 3 essentially computes a unit vector yn ∈ (span{Yn(KG
n)})⊥ sets xn + δnyn as the new design
point to be added to Xn. Note that the sample size, sample average and standard deviation of the new
point are each set to be zero. We will provide methods to set these quantities to their proper values in
Section 3.2.2. The following lemma shows that as long as MG
n < l before the execution of Algorithm 3, the
algorithm will successfully add a point to Xn such that the updated set of scaled perturbations Yn(KG
n),
continues to satisfy (174) and (175)
Lemma 3.20. Suppose that the sets Xn,Yn, Zn, Nn, Fn, Φn and Un and the index sets KG
n, KB
n and KO
n
satisfy Properties P 3.14 through P 3.16 and we have MG
n < l before the execution of Algorithm 3. Then, the
algorithm is well defined and upon its termination, MG
n is incremented by one and Properties P 3.14 through
P 3.16 continue to hold.
Proof. In order to show that Algorithm 2 is well defined, we need only to show that Steps 2 in the algorithm
well defined since it is trivial to see that the rest of the steps are well defined. In particular, for Step 2 to
be well defined, we need to show that if MG
n < l, there exists some i∗ ∈ {1, . . . , l} such that (176) holds.
Suppose 1 ≤ MG
n < l before running Algorithm 3. Then we know from Property P 3.16 that Un contains
the Gram-Schmidt orthonormal vectors corresponding to those in Yn(KG
n) where MG
n < l. Therefore,
dim(span
{Un
})= dim
(span
{Yn(KG
n)})
= MG
n < l
Since dim(span{e1, . . . , el}) = l, clearly there exists some i∗ ∈ {1, . . . , l} such that ei∗ /∈ span{Un}, i.e., such
that
‖un‖2 > 0, where un = ei∗ −|Un|∑j=1
((ei∗)TUn(j))Un(j)
Therefore Algorithm 3 is able to set yn = (un/ ‖un‖2) and hence Step 2 of Algorithm 3 is well defined.
Next we show that if Properties P 3.14, and P 3.15 were satisfied before the execution of Algorithm 3,
they continue to be satisfied after its execution. It is easily seen that upon termination of Algorithm 3, Mn
is incremented by one and Yn(Mn) = yn, Zn(Mn) =
⎛⎝ yn
(yn)Q
⎞⎠ and Xn(Mn) = xn + δnyn. Since no other
changes are made to Xn, Yn and Zn, and Property P 3.14 held before the execution of the algorithm, it is
clear that this property holds after its execution.
Similarly, upon termination of Algorithm 3, MG
n is incremented by one and we get KG
n(MG
n) = Mn.
Further, note that since Mn was incremented by one, Mn /∈ KB
n
⋃KO
n . That is, a new entry(Mn) is inserted
only into KG
n and no changes are made to KB
n or KO
n . Consequently, since the first part of Property P 3.15
75
was assumed to hold before the execution of Algorithm 3, it continues to hold after the execution of the
algorithm. Further, since∥∥∥Yn(Mn)
∥∥∥2
= ‖yn‖2 = 1 (where yn is calculated in either Step 1 or Step 2), it is
clear that Xn(Mn) ∈ Dn. Therefore, after the execution of the algorithm Xn(KG
n) ⊂ Dn, Xn(KB
n) ⊂ Dn and
Xn(KO
n) ⊂ Rl \ Dn and the second part of Property P 3.15 continues to hold.
Next, let us show that if Property P 3.16 held before the execution of Algorithm 3, then it holds after
its execution. Note again, that upon termination of the algorithm, the value of MG
n is incremented by one
and Un(MG
n) = Yn(KG
n(MG
n)) = yn. First, if MG
n = 0 before the execution of Algorithm 3, then after its
execution MG
n = 1 and Un(MG
n) = Yn(KG
n(MG
n)) = e1. Thus, it is clear that (172) holds. Further, since∥∥∥Yn(MG
n)∥∥∥
2= 1, it is easily seen that (174) is satisfied for i = 1.
Finally, suppose that MG
n ≥ 1 before the execution of Algorithm 3. Then, MG
n is incremented by one
and consequently, MG
n ≥ 2 after its execution. Since we assumed that the first part of Property P 3.16 held
before the algorithm was executed, we know that after it is executed, (172) holds and if MG
n ≥ 3 after the
execution of the algorithm then (173) holds for i = 2, . . . MG
n − 1. Therefore, from Property P 3.11, we get
that {Un(1), . . . ,Un(MG
n − 1)} is an orthonormal set. Using this in (176), we get that
uTnUn(j) = (ei∗)TUn(j)− (ei∗)TUn(j) = 0 for each j = 1, . . . ,M
G
n − 1
Thus, since yn = (un/ ‖un‖2), we get that ‖yn‖2 = 1 and
(yn)T Un(j) = 0 for each j = 1, . . . ,MG
n − 1 (177)
Since Un(MG
n) = Yn(KG
n(MG
n)) = yn, we get using (177) and ‖yn‖2 = 1, that
Un(MG
n) = yn =yn −
∑MGn −1
j=1
(yT
nUn(j))Un(j)∥∥∥∥yn −
∑MGn −1
j=1
(yT
nUn(j))Un(j)
∥∥∥∥2
=Yn(KG
n(MG
n))−∑M
Gn −1
j=1
(Yn(KG
n(MG
n))TUn(j))Un(j)∥∥∥∥Yn(KG
n(MG
n))−∑M
Gn −1
j=1
(Yn(KG
n(MG
n))TUn(j))Un(j)
∥∥∥∥2
Therefore, (173) holds for i = MG
n . Thus, the first part of Property P 3.16 holds after the execution of
Algorithm 3.
Similarly, since the second part of Property P 3.16 was assumed to hold before the execution of Al-
gorithm 3, we get that after its execution (174) holds for i = 1, . . . ,MG
n − 1 and if MG
n ≥ 3, (175) holds
for i = 2, . . . ,MG
n − 1. Thus, in order to show that the second part of Property P 3.16 holds after the
execution of Algorithm 3, we only have to show that (174) and (175) hold for i = MG
n . Obviously, since∥∥∥Yn(KG
n(MG
n))∥∥∥
2= ‖yn‖2 = 1 ≥ μL, (174) holds for i = M
G
n . Further, we get using Property P 3.13 and
76
(177), that
∥∥∥∥Π(Yn(KG
n(MG
n)),(span
{Yn(KG
n(1)), . . . , Yn(KG
n(MG
n − 1))})⊥)∥∥∥∥
2∥∥∥Yn(KG
n(MG
n))∥∥∥
2
=
∥∥∥∥yn −∑M
Gn −1
j=1
((yn)T Un(j)
)Un(j)
∥∥∥∥2
‖yn‖2=‖yn‖2‖yn‖2
= 1 ≥ μL
Therefore (175) holds for i = MG
n and consequently, the second part of Property P 3.16 continues to hold
upon termination of Algorithm 3.
Now if we have the set Xn (and the corresponding N n, Fn and Φn) for each n ∈ N, we can use
Algorithms 2 and 3 to find a set of design points {xn + yin : i = 1, . . . , Mn} such that Assumptions A 16 and
A 17 are satisfied.
Algorithm 4. Let us assume that for each n ∈ N, we have the sets Xn, N n, Fn and Φn with Mn ≤Mmax.
Initialize the index sets KGn = KB
n = KOn = ∅ and the corresponding counters MG
n = MBn = MO
n = 0. Let the
constant l ≥M Imax <∞ be given.
Step 1 : Run Algorithm 2 once. Then, run Algorithm 3 as many times as necessary to obtain the sets
Yn, Zn, Un and the index sets KG
n , KB
n and KO
n with MG
n = l, MB
n and MO
n elements respectively.
Note that Algorithm 3 has to be run at most l times since the value of MG
n is incremented by one
whenever Algorithm 3 is run.
Step 2 : Set KGn ← K
G
n and MGn ←M
G
n .
Step 3 : Set the index set KBn with 0 ≤ MB
n ≤ min{
MB
n ,M Imax − l
}elements, such that it contains a
subset of the elements in KB
n in some order (the order can be chosen arbitrarily). For example,
we may set KBn ← ∅ or KB
n ← KB
n or KBn ←
{i ∈ KB
n : N n(KB
n(i)) ≥ αN0n
}for some constant
α ∈ (0, 1].
Step 4 : Set the index set KOn with 0 ≤MO
n ≤MO
n elements, such that it contains a subset of the elements
in KO
n in some order (the order can be chosen arbitrarily). For example, we may set KOn ← 0 or
KOn ← K
O
n or KOn ←
{i ∈ KO
n :∥∥∥Yn(KO
n(i))∥∥∥
2≤ αδ
}for some constant αδ > 1.
Step 5 : For each i = 1, . . . ,MGn , set xn + yi
n ← Xn (KGn (i)).
Step 6 : If MBn > 0, then
• For each i = MGn + 1, . . . , MG
n + MBn , set xn + yi
n ← Xn (KBn (i−MG
n )).
Step 7 : If MOn > 0, then
77
• For each i = MGn + MB
n + 1, . . . ,MGn + MB
n + MOn , set xn + yi
n ← Xn (KOn (i−MG
n −MBn )).
If Algorithm 4 is run for each n ∈ N, then from Lemma 3.20, it is easily seen that the following statements
hold.
1. The number of entries in KGn is equal to l and the number of entries in KB
n is at most M Imax − l. Thus,
since Xn(KGn)⋃Xn(KB
n) form the set of inner design points, it is clear that there exist at most M Imax
inner design points for each n ∈ N. Therefore, Assumption A 16 is satisfied for each n ∈ N.
2. From Lemma 3.20, we know that Properties P 3.14 through P 3.16 are satisfied after the execution of
Step 1 in Algorithm 4. Thus, from Property P 3.16 we know that the l vectors in Yn(KG
n) satisfy (174)
and (175) and from Property P 3.14, we know that these vectors are the scaled perturbation vectors
corresponding to the points in Xn(KG
n).
Finally, we note from Steps 2 and 5 of Algorithm 4 that the points in Xn(KGn) = Xn(KG
n) are chosen
to be the first l design points. Therefore, we get that (156) and (157) is satisfied by the first l design
points and consequently Assumption A 17 is satisfied.
Thus, for each n ∈ N, as long as the design points are chosen using Algorithm 4 and the weights {win : i =
1, . . . ,M In} are chosen so as to satisfy Assumption A 15, we get from (97), (104) and (158) that (159) holds,
i.e. (36) in Assumption A 7 holds with KIλ = (KI
wM Imax/μ2
LSl(μL)) for each n ∈ N.
Next, we look at methods to pick {xn + yin : i = 1, . . . ,M I
n} so as to satisfy Assumption A 18. The
algorithms we will devise to find p design points satisfying the conditions (160) and (161), will be similar
to Algorithms 2 and 3. That is, we will first provide an algorithm to identify as many points as possible
from the set Xn that can be used to satisfy (160) and (161) and then provide another procedure find the
required number of new design points that can be inserted into Xn so as to ensure that these requirements
are satisfied.
To this end, we start with a Lemma that shows that the l scaled regression vectors in Zn(KG
n), obtained
after running Algorithms 2 once and 3 at most l times, can be set as the first l design points satisfying (160)
and (161).
Lemma 3.21. Consider the l scaled regression vectors in the set Zn(KG
n) obtained after the execution of
Algorithm 2 once and Algorithm 3 as many times as required to get MG
n = l. We have for each i = 1, . . . , l,
∥∥∥Zn(KG
n(i))∥∥∥
2≥ μL =
(√32
)(√23
)μL
78
and for i = 2, . . . , l,∥∥∥∥Π(Zn(KG
n(i)),(span{Zn(KG
n(1)), . . . , Zn(KG
n(i− 1))})⊥)∥∥∥∥
2∥∥∥Zn(KG
n(i))∥∥∥
2
≥(√
23
)μL
Consequently, the Gram-Schmidt orthogonalization process can be successfully performed on the l vectors in
Zn(KG
n).
Proof. First we note that from Lemma 3.20, the l vectors in Yn(KG
n) obtained after running Algorithm 2
once and Algorithm 3 at most l times, satisfy Properties P 3.14 through P 3.16. Now, from (17) it is easy
to see that for each i = 1, . . . , l.∥∥∥Zn(KG
n(i))∥∥∥
2=
√∥∥∥Yn(KG
n(i))∥∥∥2
2+
12
∥∥∥Yn(KG
n(i))∥∥∥4
2
From Property P 3.15, we know that∥∥∥Yn(KG
n(i))∥∥∥
2≤ 1. Using this, we get
∥∥∥Yn(KG
n(i))∥∥∥
2≤
∥∥∥Zn(KG
n(i))∥∥∥
2≤
√32
∥∥∥Yn(KG
n(i))∥∥∥
2(178)
Since (174) in Property P 3.16 holds, we get from (178) that for each i = 1, . . . , l,
∥∥∥Zn(KG
n(i))∥∥∥
2≥
∥∥∥Yn(KG
n(i))∥∥∥
2≥ μL =
(√32
)(√23
)μL > 0
Next, for each i = 2, . . . , l, we get using the definition of the orthogonal projection operator, that∥∥∥∥Π(Zn(KG
n(i)),(span
{Zn(KG
n(j)) : j = 1, . . . , i− 1})⊥)∥∥∥∥
2
=∥∥∥Zn(KG
n(i))−Π(Zn(KG
n(i)), span{Zn(KG
n(j)) : j = 1, . . . , i− 1})∥∥∥
2=
min{∥∥∥Zn(KG
n(i))− v∥∥∥
2: v ∈ span
{Zn(KG
n(j)) : j = 1, . . . , i− 1}}
Clearly any v ∈ span{Zn(KG
n(j))) : j = 1, . . . , i− 1}
can be written as v =∑i−1
j=1 αjZn(KG
n(j)) for some
α1, . . . , αi−1 ∈ R. Therefore,
min{∥∥∥Zn(KG
n(i))− v∥∥∥
2: v ∈ span
{Zn(KG
n(j)) : j = 1, . . . , i− 1}}
=
minα1,...,αi−1∈R
∥∥∥∥∥∥Zn(KG
n(i))−i−1∑j=1
αjZn(KG
n(j))
∥∥∥∥∥∥2
Further, using the definition of Zn(KG
n(i)) for i = 1, . . . , l, we get that for any α1, . . . , αi−1 ∈ R,∥∥∥∥∥∥Zn(KG
n(i))−i−1∑j=1
αjZn(KG
n(j))
∥∥∥∥∥∥2
=
∥∥∥∥∥∥⎛⎝ Yn(KG
n(i))(Yn(KG
n(i)))Q
⎞⎠− i−1∑
j=1
⎧⎨⎩αj
⎛⎝ Yn(KG
n(j))(Yn(KG
n(i)))Q
⎞⎠⎫⎬⎭∥∥∥∥∥∥
2
≥
∥∥∥∥∥∥Yn(KG
n(i))−i−1∑j=1
αjYn(KG
n(j))
∥∥∥∥∥∥2
79
Thus, we have for each i = 2, . . . , l,∥∥∥∥Π(Zn(KG
n(i)),(span
{Zn(KG
n(j)) : j = 1, . . . , i− 1})⊥)∥∥∥∥
2
= minα1,...,αi−1∈R
∥∥∥∥∥∥Zn(KG
n(i))−i−1∑j=1
αjZn(KG
n(j))
∥∥∥∥∥∥2
≥ minα1,...,αi−1∈R
∥∥∥∥∥∥Yn(KG
n(i))−i−1∑j=1
αjYn(KG
n(j))
∥∥∥∥∥∥2
=∥∥∥Yn(KG
n(i))−Π(Yn(KG
n(i)), span{Yn(KG
n(j)) : j = 1, . . . , i− 1})∥∥∥
2
=∥∥∥∥Π
(Yn(KG
n(i)),(span
{Yn(KG
n(j)) : j = 1, . . . , i− 1})⊥)∥∥∥∥
2
Finally, using (178) and (175) in Property P 3.16, we get for each i = 2, . . . , l,∥∥∥∥Π(Zn(KG
n(i)),(span
{Zn(KG
n(j)) : j = 1, . . . , i− 1})⊥)∥∥∥∥
2∥∥∥Zn(KG
n(i))∥∥∥
2
≥
(√23
)⎛⎜⎜⎝∥∥∥∥Π
(Yn(KG
n(i)),(span
{Yn(KG
n(j)) : j = 1, . . . , i− 1})⊥)∥∥∥∥
2∥∥∥Yn(KG
n(i))∥∥∥
2
⎞⎟⎟⎠ ≥
(√23
)μL
Consequently, using Lemma 3.12, we get that {Zn(KG
n(1)), . . . , Zn(KG
n(l))} ⊂ Rp is a linearly independent
set and therefore, we can perform Gram-Schmidt orthogonalization on this set.
Thus, if we set the first l design points {xn + yin : i = 1, . . . , l} to be those in Xn(KG
n), then from
Lemma 3.21, it is clear that (160) and (161) in Assumption A 18 are satisfied for i = 1, . . . , l with μQ =(√2/3
)μL. The advantages of such a “warm start” are obvious. First, by running Algorithm 2 we have
already separated the potential inner and outer design points in Xn. Also, in order satisfy Assumption A 18,
instead of having to pick p design points, we only have to find p − l additional points. Finally, if for any
reason, we wish to stop without picking the required p− l design points we are still assured of the bound on
the condition number of (Y In )T W I
n Y In as in (159).
Motivated by Lemma 3.21, we provide an algorithm next that performs the Gram-Schmidt orthogonal-
ization process on the scaled regression vectors corresponding to the points in Xn(KG
n) and searches the set
Xn(KB
n) for more points that can be used to satisfy Assumption A 18.
Algorithm 5. Let us assume that we have knowledge of the design region radius δn, the sets Xn, Yn, Zn,
Nn, Fn and Φn, the index sets KG
n , KB
n and KO
n such that Properties P 3.14 through P 3.16 are satisfied.
Further, we assume that MG
n = l. Initialize the set Vn = ∅. Choose a constant μQ1 ∈ (0, 1].
Step 1 : Repeat the following steps for each i = 1, . . . ,MG
n .
80
• If i = 1, then set vn ← Zn(KG
n(i)).
• If i ≥ 2, then set
vn ← Zn(KG
n(i))−i−1∑j=1
(Zn(KG
n(i))TVn(j))Vn(j)
• Insert (vn/ ‖vn‖2) into the set Vn at position i.
Step 2 : Repeat the following steps for i = 1, . . . ,MB
n .
• If MG
n < p then
– Set
vn ← Zn(KB
n(i))−M
Gn∑
j=1
(Zn(KB
n(i))TVn(j))Vn(j) (179)
– If∥∥∥Zn(KB
n(i))∥∥∥
2≥ (
√3/2)μQ1 and ‖vn‖2 ≥ μQ1
∥∥∥Zn(KB
n(i))∥∥∥
2then
∗ Set MG
n ←MG
n + 1.
∗ Insert KB
n(i) into the set KG
n at position MG
n .
∗ Insert (vn/ ‖vn‖2) into the set Vn at the position MG
n .
– Otherwise if∥∥∥Zn(KB
n(i))∥∥∥
2< (
√3/2)μQ1 or ‖vn‖2 < μQ1
∥∥∥Zn(KB
n(i))∥∥∥
2, then set M tmp ←
M tmp + 1 and insert KB
n(i) into the set Ktmpat position M tmp.
• Otherwise if MG
n = p then set M tmp ← M tmp + 1 and insert KB
n(i) into the set Ktmpat
position M tmp.
Step 3 : Set KB
n ← Ktmp
and MB
n ←M tmp.
We know from Lemma 3.21 that we can perform Gram-Schmidt orthogonalization on the vectors in
Zn(KG
n). Step 1 of Algorithm 5 does exactly this. Step 2 checks each scaled regression vector Zn(KB
n(i)) for
i = 1, . . . ,MB
n , to see if it satisfies
∥∥∥Zn(KB
n(i))∥∥∥
2≥
(√32
)μQ1 and
∥∥∥∥Π(Zn(KB
n(i)), span{Zn(KG
n)}⊥)∥∥∥∥
2∥∥∥Zn(KB
n(i))∥∥∥
2
≥ μQ1
If it does then the index KB
n(i) is moved to KG
n and the set Vn containing the Gram-Schmidt vectors
corresponding to those in Zn(KG
n) is updated appropriately. Thus, it is easily seen that after the execution
of Algorithm 5, Properties P 3.14 and P 3.15 continue to hold. Further from Lemma 3.21 and Steps 1 and
2 of Algorithm 5 it can be seen that, the sets Zn(KG
n) and Vn satisfy the following property (analogous to
Property P 3.16) for μQ = min{μQ1, (√
2/3)μL}.
P 3.17. 1. The vectors in the set Vn are the Gram-Schmidt orthonormal vectors corresponding to those
in Zn(KG
n). That is,
Vn(1) =Zn(KG
n(1))∥∥∥Zn(KG
n(1))∥∥∥
2
(180)
81
and for i = 2, . . . ,MG
n ,
Vn(i) =Zn(KG
n(i))−∑i−1
j=1
(Zn(KG
n(i))TVn(j))Vn(j)∥∥∥Zn(KG
n(i))−∑i−1
j=1
(Zn(KG
n(i))TVn(j))Vn(j)
∥∥∥2
(181)
2. The MG
n vectors in Zn(KG
n) satisfy for each i = 1, . . . ,MG
n ,∥∥∥Zn(KG
n(i))∥∥∥
2≥
(√32
)μQ (182)
and from Property P 3.13, we get that for i = 2, . . . ,MG
n ,∥∥∥∥Π(Zn(KG
n(i)),(span
{Zn(KG
n({1, . . . , i− 1}))})⊥)∥∥∥∥
2∥∥∥Zn(KG
n(i))∥∥∥
2
=
∥∥∥Zn(KG
n(i))−∑i−1
j=1
(Zn(KG
n(i))TVn(j))Vn(j)
∥∥∥2∥∥∥Zn(KG
n(i))∥∥∥
2
≥ μQ (183)
From Property P 3.17, it is clear that upon termination of Algorithm 5, we can set the first MG
n design
points to be those in Xn(KG
n) and satisfy (160) and (161) in Assumption A 18 with μQ = min{μQ1, (√
2/3)μL}.However, in order to satisfy Assumption A 18, we require p inner design points satisfying (160) and (160),
and it is clearly possible that after the execution of Algorithm 5, we get MG
n < p. Accordingly we provide
next, an algorithm analogous to Algorithm 3, that adds an appropriate point into Xn and increments MG
n
by one such that Properties P 3.14 through Property P 3.17 continue to hold.
Let us recall the operation of Algorithm 3. There, if dim(span{Yn(KG
n)}) = MG
n < l, we merely found a
unit vector yn ∈ span{(Yn(KG
n))⊥} and inserted xn + δnyn to Xn. Now, if dim(span{Zn(KG
n)}) = MG
n < p,
we could, in the same manner, find a unit vector wn ∈ span{(Zn(KG
n))⊥}. However, in this case, in order to
be able to insert a point to Xn, wn has to have a very specific form. That is, we can add a point xn + δnyn
to Xn only if wn ∈ span{(Zn(KG
n))⊥} is such that
wn =
⎛⎝ yn
(yn)Q
⎞⎠
The following two lemmas show that even though we cannot pick any arbitrary wn ∈ span{(Zn(KG
n))⊥},
there exists an appropriate vector zn =
⎛⎝ yn
(yn)Q
⎞⎠ sufficiently close to wn, such that xn+δnyn can be inserted
into Xn.
Lemma 3.22. Consider any w1 ∈ Rp with ‖w1‖2 = 1. Then, there exists z ∈ R
p given by z =
⎛⎝ y
yQ
⎞⎠ for
some y ∈ Rl, such that ‖y‖2 ≤ 1 and∣∣zT w1
∣∣ =∣∣∣(yT
(yQ)T)
w1
∣∣∣ ≥ 12√
p(184)
82
Further, given any such z ∈ Rp and a subspace M⊂ R
p such that w1 ∈M⊥, we have
‖z‖2 ≥(√
32
)(1√6p
)and
∥∥Π (z,M⊥)∥∥
2
‖z‖2≥ 1√
6p
Proof. Define the vector w ∈ Rp as
w :=w1
‖w1‖∞Then clearly, we have ‖w‖∞ = 1. Now, let the components of w ∈ R
p be as follows,
wT =(u1 , . . . , ul , v12 , v13 , v23 , . . . , v1l , . . . , v(l−1)l , v11 , . . . , vll
)Then, since ‖w‖∞ = 1, we get that one of the following cases must hold
(a) |uj∗ | = 1 for some j∗ ∈ {1, . . . , l}.
(b) |vj∗j∗ | = 1 for some j∗ ∈ {1, . . . , l}.
(c) |vj∗k∗ | = 1 for some j∗, k∗ ∈ {1, . . . , l} where j∗ �= k∗.
Let for any y ∈ Rl, the components of y be given as yT =
(y1 . . . yl
). Then, from (16), we get that
(yT
(yQ)T)
=(y1 , . . . , yl , y1y2 , y1y3 , y2y3 , . . . , y1yl , . . . , yl−1yl , 1√
2y21 , . . . , 1√
2y2
l
)Thus, for any y ∈ R
l, we get that
∣∣∣(yT(yQ)T)
w∣∣∣ =
⎧⎨⎩
l∑j=1
(uj yj +
1√2vjj y
2j
)⎫⎬⎭ +
⎧⎨⎩
l∑k=2
k−1∑j=1
vjkyj yk
⎫⎬⎭
Also, we will use the following identity repeatedly in our proof.
max {|αj | : j = 1, . . . , N} ≥
√∑Nj=1 |αj |2
Nfor any α1, . . . , αN ∈ R (185)
Now, let use consider the three cases, listed earlier in order.
(a) Suppose that |uj∗ | = 1 for some j∗ ∈ {1, . . . , l}. Then consider the following two points y1, y2 ∈ Rl.
(y1)T =(y11 . . . y1
l
)where y1
j∗ = 1 and y1j = 0 for j �= j∗
(y2)T =(y21 . . . y2
l
)where y2
j∗ = −1 and y2j = 0 for j �= j∗
(186)
83
Clearly,∥∥y1
∥∥2
=∥∥y2
∥∥2
= 1. Now, we get using the fact that |uj∗ | = 1 that,
max{∣∣∣(yT
1 (yQ1 )T
)w∣∣∣ ,
∣∣∣(yT2 (yQ
2 )T)
w∣∣∣} = max
{∣∣∣∣uj∗ +vj∗j∗√
2
∣∣∣∣ ,
∣∣∣∣−uj∗ +vj∗j∗√
2
∣∣∣∣}
≥
√|uj∗ |2 +
v2j∗j∗
2(using (185))
≥ |uj∗ | = 1
(b) Suppose that |vj∗j∗ | = 1 for some j∗ ∈ {1, . . . , l}. Then again we consider the points y1, y2 ∈ Rl as
defined in (186). Using the fact that |vj∗j∗ | = 1 , we get
max{∣∣∣(yT
1 (yQ1 )T
)w∣∣∣ ,
∣∣∣(yT2 (yQ
2 )T)
w∣∣∣} = max
{∣∣∣∣uj∗ +vj∗j∗√
2
∣∣∣∣ ,
∣∣∣∣−uj∗ +vj∗j∗√
2
∣∣∣∣}
=
√|uj∗ |2 +
v2j∗j∗
2(using (185))
≥ |vj∗j∗ |√2
=1√2
(c) Finally suppose that |vj∗k∗ | = 1 for some j∗, k∗ ∈ {1, . . . , l} where j∗ �= k∗. Then we consider the
following four points.
(y1)T =(y11 . . . y1
l
)where y1
j∗ = 1√2, y1
k∗ = 1√2
and y1j = 0 for j �= j∗, k∗
(y2)T =(y21 . . . y2
l
)where y2
j∗ = −1√2
y1k∗ = −1√
2and y2
j = 0 for j �= j∗, k∗
(y3)T =(y31 . . . y3
l
)where y3
j∗ = 1√2
y1k∗ = −1√
2and y3
j = 0 for j �= j∗, k∗
(y4)T =(y41 . . . y4
l
)where y4
j∗ = −1√2
y1k∗ = 1√
2and y4
j = 0 for j �= j∗, k∗
Again, it is easily seen that∥∥y1
∥∥2
=∥∥y2
∥∥2
=∥∥y3
∥∥2
=∥∥y4
∥∥2
= 1. Now, using the fact that |vj∗k∗ | = 1,
we get
max{∣∣∣(yT
i (yQi )T
)w∣∣∣ : i = 1, . . . , 4
}= max
{∣∣∣∣∣(
v2j∗j∗ + v2
k∗k∗
2√
2
)+(vj∗k∗
2
)+(
uj∗ + uk∗√2
)∣∣∣∣∣ ,∣∣∣∣∣(
v2j∗j∗ + v2
k∗k∗
2√
2
)+(vj∗k∗
2
)−(
uj∗ + uk∗√2
)∣∣∣∣∣ ,∣∣∣∣∣(
v2j∗j∗ + v2
k∗k∗
2√
2
)−(vj∗k∗
2
)+(
uj∗ − uk∗√2
)∣∣∣∣∣ ,∣∣∣∣∣(
v2j∗j∗ + v2
k∗k∗
2√
2
)−(vj∗k∗
2
)−(
uj∗ − uk∗√2
)∣∣∣∣∣}
=
√√√√((vj∗j∗ + vk∗k∗)2
8
)+
(v2
j∗k∗
4
)+
(u2
j∗ + u2j∗
2
)(using (185))
≥ |vj∗k∗ |2
=12
Therefore, we see that in all three cases, there exists a vector z =
⎛⎝ y
yQ
⎞⎠, such that ‖y‖2 ≤ 1 and
∣∣zT w∣∣ =
∣∣∣(yT(yQ)T)
w∣∣∣ ≥ 1
2
84
Thus, since ‖w1‖2 = 1,
|(z)T w1| = ‖w1‖∞ |(z)T w| ≥ ‖w1‖∞2
≥ ‖w1‖22√
p=
12√
p
Next since w1 ∈M⊥ and ‖w1‖2 = 1, let us extend it an orthonormal basis forM⊥ given by {w1, w2, . . . , wk} ⊂M⊥. Let z ∈ R
p by such that
z =
⎛⎝ y
yQ
⎞⎠ with y ∈ R
l and ‖y‖2 ≤ 1
Now, we know from Property P 3.8 of the projection operation, that
Π(z,M⊥) =
k∑j=1
((z)T wj
)wj
Therefore, we get ∥∥Π (z,M⊥)
)∥∥2
=
√√√√ k∑j=1
((z)T wj)2 ≥ |(z)T w1| ≥
12√
p
Therefore, using Property P 3.6, we get that
‖z‖2 ≥∥∥Π (
z,M⊥))∥∥
2≥ 1
2√
p≥
(√32
)(1√6p
)
Further since ‖y‖2 ≤ 1 get from (17) that
‖z‖2 =
√‖y‖22 +
12‖y‖42 ≤
√32
Consequently, we get ∥∥Π (z,M⊥)∥∥
2
‖z‖2≥
(1
2√
p
)(√
32
) =1√6p
Lemma 3.22 motivates the following algorithm that adds a single appropriate point to Xn.
Algorithm 6. Let us assume that we have knowledge of the design region radius δn, the sets Xn, Yn, Zn,
Nn, Fn and Φn and Vn and the index sets KG
n , KB
n and KO
n such that Properties P 3.14 through P 3.17 are
satisfied. Let l ≤MG
n < p and let {e1, . . . , ep} denote the standard basis for Rp
Step 1 : Find an i∗ ∈ {1, . . . , p} such that
vn = ei∗ −M
Gn∑
j=1
((ei∗)TVn(j)
)Vn(j)
satisfies ‖vn‖2 > 0.
85
Step 2 : Compute zn =
⎛⎝ yn
(yn)Q
⎞⎠ ∈ R
p with yn ∈ Rl such that ‖yn‖2 ≤ 1 and
∣∣∣∣zTn
(vn
‖vn‖2
)∣∣∣∣ ≥ 12√
p(187)
Step 3 : Set Mn ←Mn + 1.
Step 4 : Insert yn into Yn, zn into Zn and xn + δnyn into Xn all at position Mn.
Step 5 : Insert the value 0 into Fn, Φn and N n all at position Mn.
Step 6 : Set MG
n ←MG
n + 1 and insert Mn into the set KG
n at position MG
n .
Step 7 : Set
vn ← zn −M
Gn −1∑
j=1
(zTnVn(j)
)Vn(j)
Step 8 : Insert (vn/ ‖vn‖2) into Vn at position MG
n .
Note: In Step 2, we could potentially calculate the vector zn =
⎛⎝ yn
yQn
⎞⎠ as
zn ∈ arg max
⎧⎨⎩∣∣∣∣zT
(vn
‖vn‖2
)∣∣∣∣ : z =
⎛⎝ y
yQ
⎞⎠ where y ∈ R
l and ‖y‖2 ≤ 1
⎫⎬⎭ (188)
If zn is chosen as in (188), then clearly, from Lemma 3.22, (187) is satisfied. However, to find zn as in (188),
it is easily seen that we need to solve two nonlinearly constrained quadratic optimization problems which
could be computationally expensive. Alternatively, we can find an appropriate vector zn satisfying (187) by
a simple enumeration as shown in the proof of Lemma 3.22.
The following lemma shows that as long as MG
n < p before the execution of Algorithm 6, the algorithm
successfully inserts an appropriate point into Xn such that Properties P 3.14 and P 3.15 continue to hold
and Property P 3.17 continues to hold.
Lemma 3.23. Suppose that the sets Xn,Yn, Zn, Nn, Fn, Φn and Vn and the index sets KG
n, KB
n and
KO
n satisfy Properties P 3.14 and P 3.15 and Property P 3.17 for μQ = min{(√
2/3)μL, μQ1, (1/√
6p)} and
we have MG
n < p before the execution of Algorithm 6. Then, the algorithm is well defined and upon its
termination, an appropriate point is added to Xn and MG
n is incremented by one such that Properties P 3.14
and P 3.15 continue to hold and Property P 3.17 holds for μQ = min{(√
2/3)μL, μQ1, (1/√
6p)}.
Proof. First we show that as long as MG
n < p before the execution of Algorithm 6, all its steps are well
defined. Let us begin with Step 1. From our assumptions we know that MG
n < p before the execution of the
86
algorithm. Since Property P 3.17 holds before the execution of the algorithm, we know from the first part
of this property that Vn contains a set of MG
n < p Gram-Schmidt vectors corresponding to those in Zn(KG
n).
Therefore, dim(span{Vn}) = dim(span
{Zn(KG
n)})
= MG
n < p. Since dim(span{e1, . . . , ep}) = p, clearly
there exists i∗ ∈ {1, . . . , p} such that ei∗ /∈ span{Vn}, i.e., such that
‖vn‖2 =
∥∥∥∥∥∥ei∗ −M
Gn∑
j=1
((ei∗)TVn(j)
)Vn(j)
∥∥∥∥∥∥2
=∥∥∥∥Π
(ei∗ ,
(span{Zn(KG
n)})⊥)∥∥∥∥
2
> 0
Therefore, Step 1 is well defined. We know from Lemma 3.22, that there exists yn ∈ Rl with ‖yn‖2 ≤ 1 such
that (187) holds. Therefore, Step 2 in Algorithm 6 is well defined. It is trivial to see that the rest of the
steps in the algorithm are well defined and consequently that the algorithm itself is well defined.
Next we show that if Properties P 3.14, P 3.15 and P 3.17 were satisfied before the execution of Algo-
rithm 6, they continue to be satisfied after its execution. It is clear that upon termination of Algorithm 6,
Mn is incremented by one and Yn(Mn) = yn, Zn(Mn) =
⎛⎝ yn
(yn)Q
⎞⎠ and Xn(Mn) = xn + δnyn. Since no
other changes are made to Xn, Yn and Zn and Property P 3.14 we assumed to hold before the algorithms
execution, it is clear that this property continues to hold. Further, we also know that Algorithm 6 increments
MG
n by one and sets KG
n(MG
n) = Mn. It is easily seen that the incremented value of Mn cannot lie in either
KB
n or KO
n . Thus, since a new entry(Mn) is inserted only into KG
n and no changes are made to KB
n or KO
n ,
the first part of Property P 3.15 continues to hold after the execution of Algorithm 6. Finally, since yn as
defined in (187) in Step 2 satisfies ‖yn‖2 ≤ 1, it is also is clear that the second part of Property P 3.15
continues to hold.
Next, let us show that if Property P 3.17 holds before the execution of Algorithm 6 for μQ = min{(√
2/3)μL, μQ1, (1/√
6p)
then the property holds after the execution of Algorithm 6 for the same value of μQ. Note again that upon
termination of the algorithm, the value of MG
n is incremented by one with Zn(KG
n(MG
n)) = zn and from
Step 8,
Vn(MG
n) =Zn(KG
n(MG
n))−∑M
Gn −1
j=1
(Zn(KG
n(MG
n))TVn(j))Vn(j)∥∥∥∥Zn(KG
n(MG
n))−∑M
Gn −1
j=1
(Zn(KG
n(MG
n))TVn(j))Vn(j)
∥∥∥∥2
(189)
Since the first part Property P 3.17 held before the execution of the algorithm, clearly after its execution,
(180) holds and (181) holds for i = 2, . . . ,MG
n − 1. From (189) it is clear that (181) holds for i = MG
n .
Therefore the first part of Property P 3.17 continues to hold after the execution of Algorithm 6.
Similarly, since the second part of Property P 3.17 was assumed to hold before the execution of Al-
gorithm 6, we know that after its execution, (182) and (183) hold for i = 1, . . . ,MG
n − 1 and μQ =
min{(√
2/3)μL, μQ1, (1/√
6p)}. Thus, it only remains for us to show that (182) and (183) hold for i = MG
n
and the same value of μQ. Now, we know that after the algorithm is executed and MG
n is incremented, the
87
vector vn computed in Step 1 of the algorithm satisfies
vn = ei∗ −M
Gn −1∑
j=1
((ei∗)TVn(j)
)Vn(j) ∈ span
{Vn
({1, . . . ,M
G
n − 1})}⊥
= span{Zn
(KG
n
({1, . . . ,M
G
n − 1}))}⊥
Thus, (vn/ ‖vn‖2) used in Step 2 is a unit vector in the subspace span{Zn
(KG
n
({1, . . . ,M
G
n − 1}))}⊥
.
Since we chose Zn(KG
n(MG
n)) = zn in Step 2 such that (187) is satisfied, we get using Lemma 3.22 that
∥∥∥Zn(KG
n(MG
n))∥∥∥
2≥
(√32
)(1√6p
)≥
(√32
)μQ
and ∥∥∥∥Π(Zn(KG
n(MG
n)),(span
{Zn(KG
n({1, . . . ,MG
n − 1}))})⊥)∥∥∥∥
2
≥(
1√6p
)≥ μQ
Therefore, (182) and (183) hold for i = MG
n . Thus, it is clear that Property P 3.17 holds after the execution
of Algorithm 6 for μQ = min{(√
2/3)μL, μQ1, (1/√
6p)}.
Finally, we show that if we have the set Xn (and the corresponding N n, Fn and Φn) with Mn ≤Mmax
at the beginning of each n ∈ N, we can use Algorithms 2, 3, 5 and 6, to find a set of design points
{xn + yin : i = 1, . . . ,Mn} such that for each n ∈ N, Assumptions A 16 and 18 are satisfied.
Algorithm 7. Let us assume that for each n ∈ N, we have the sets Xn, N n, Fn and Φn with Mn ≤Mmax.
Initialize the index sets KGn = KB
n = KOn = ∅ and the corresponding counters MG
n = MBn = MO
n = 0. Let
M Imax ≥ p.
Step 1 : Run Algorithm 2 once and Algorithm 3 repeatedly (at most l times) until MG
n = l, to obtain the
sets Yn and Zn, and the index sets KG
n , KB
n and KO
n .
Step 2 : Run Algorithm 5 once and Algorithm 6 repeatedly (at most p− l times) until MG
n = p.
Step 3 : Set KGn ← K
G
n and MGn ←M
G
n .
Step 4 : Set the set KBn with 0 ≤ MB
n ≤ min{MB
n ,M Imax − p} elements, such that it contains a subset of
the elements in KB
n in some order (the order can be chosen arbitrarily). For example, we may set
KBn ← ∅ or KB
n ← KB
n or KBn ←
{i ∈ KB
n : Nn(KB
n(i)) ≥ αN0n
}for some constant α ∈ (0, 1].
Step 5 : Set the set KOn with 0 ≤ MO
n ≤ MO
n elements, such that it contains a subset of the elements in
KO
n in some order (the order can be chosen arbitrarily). For example, we may set KOn ← 0 or
KOn ← K
O
n or KOn ←
{i ∈ KO
n :∥∥∥Yn(KO
n(i))∥∥∥
2≤ αδ
}for some constant αδ > 1.
Step 6 : For each i = 1, . . . ,MGn , set xn + yi
n ← Xn (KGn (i)).
88
Step 7 : If MBn > 0, then
• For each i = MGn + 1, . . . , MG
n + MBn , set xn + yi
n ← Xn (KBn (i−MG
n )).
Step 8 : If MOn > 0, then
• For each i = MGn + MB
n + 1, . . . ,MGn + MB
n + MOn , set xn + yi
n ← Xn (KOn (i−MG
n −MBn )).
If Algorithm 7 is run for each n ∈ N, then it is easily seen that the following statements hold.
1. The number of entries in KGn is equal to p and the number of entries in KB
n is at most M Imax− p. Thus,
since Xn(KGn)⋃Xn(KB
n) form the set of inner design points, it is clear that there exist at most M Imax
inner design points for each n ∈ N. Therefore, Assumption A 16 is satisfied for each n ∈ N.
2. From Lemma 3.23, we know that Properties P 3.14, P 3.15 and P 3.17 are satisfied after the execution
of Steps 1 and 2 in Algorithm 7. Thus, from Property P 3.17 we know that the p vectors in Zn(KG
n)
satisfy (182) and (183) and from Property P 3.14, we know that these vectors are the scaled regression
vectors corresponding to the points in Xn(KG
n).
Also, we note from Steps 3 and 6 of Algorithm 7, that the points in Xn(KGn) = Xn(KG
n) are chosen to
be the first p design points. Therefore, from (182) and (183), we get that (160) and (161) are satisfied
by the first p design points and consequently Assumption A 18 is satisfied.
Thus, for each n ∈ N, as long as the design points are chosen using Algorithm 7 and the weights {win :
i = 1, . . . , M In} are chosen so as to satisfy Assumption A 15, we get from (98), (105) and (162) that
(163) holds, i.e. (58) in Assumption A 11 holds with KIλ = (KI
wM Imax/μ2
QSp(μQ)) for each n ∈ N, where
μQ = min{(√
2/3)μL, μQ1, (1/√
6p)}.
Finally, in this section, we consider (35) in Assumption A 7 and (57) in Assumption A 11. In both of
these conditions, we require all the design points {xn + yin : i = 1, . . . ,Mn and n ∈ N} to lie in the compact
set C ⊂ E . Let us see how this can satisfied. Suppose that the following assumption holds.
A 19. There exists a compact set C ⊆ X such that {xn}n∈N⊂ C.
Then, since X ⊂ E , we get that C ⊂ E . Since C is compact and E is open, we get from we get from
Lemma 2.1 in ?, that there exists δ∗ > 0 and a compact set
C :={x = x1 + x2 : x1 ∈ C and
∥∥x2∥∥
2≤ δ∗
}such that C ⊂ E . Now, suppose we choose our design region radii such that δn ≤ δ∗ for each n ∈ N.
Then, it is clear that for each n ∈ N, we have {xn + yin : i = 1, . . . ,M I
n} ⊂ C ⊂ E . Further, note that in
Algorithms 4 and 7, we never evaluate sample averages at new design points that lie outside the design region
89
Dn for any n ∈ N. Instead, the only outer design points we include at each iteration n are the points in
Xn(MOn ) ⊂ Xn(M
O
n). Now, recall our description of the use of Xn in our trust region algorithms. For n = 1,
Xn would start as an empty set. As and when we evaluated sample averages at new inner design points,
we insert these points into Xn. Then, at the end of each iteration n, the set Xn+1 is made up of at most
Mmax points appropriately chosen from Xn. Thus, in our algorithms, for any n ∈ N and i = 1, . . . ,MOn ,
Xn(KOn(i)) was an inner design point at an earlier iteration, i.e., Xn(KO
n(i)) ∈ Dj for some j ∈ {1, . . . , n−1}.Consequently, it is easy to see that under Assumption A 19, if for each n ∈ N we choose δn ≤ δ∗, use either
Algorithm 4 or Algorithm 7 to pick {xn + yin : i = 1, . . . ,Mn} and update the set Xn+1 using points only
from Xn, then (35) in Assumption A 7 and (57) in Assumption A 11 are satisfied.
3.2.2 Sample Sizes
The conditions imposed on the sample sizes are stated in Assumption A 5 for Theorem 3.2 and in Assumption
A 10 for Theorem 3.7. In order to set the sample sizes {N in : i = 1, . . . , Mn}, we assume that we know the
sample size N0n for each n ∈ N and that
{N0
n
}n∈N
is a monotonically increasing sequence with N0n → 0 as
n → ∞. In Section ??, when we describe a typical trust region algorithm that uses the regression model
function, we will show how such a sequence{N0
n
}n∈N
can be adaptively generated.
First, let us consider Assumption A 5 in Theorem 3.2 requires that
limn→∞ min
i∈{1,...,MIn}
N in = ∞
That is, the minimum sample size used to evaluate sample averages at the inner design points, must increase
to infinity as n → ∞. Recall that in the previous section, in order to satisfy (36) from Assumption A 7
in Theorem 3.2, we picked the design points using Algorithm 4. In Algorithm 4, we picked our inner
design points to be those in Xn(KGn)⋃Xn(KB
n). Accordingly, now we provide an algorithm to update the
corresponding entries in N n, Fn and Φn so that Assumption A 5 may be satisfied.
Algorithm 8. Let us assume that we have the sample size N0n, the sets Nn, Fn and Φn, and the index
sets KGn (MG
n = l), KBn and KO
n obtained after the execution of Algorithm 4. Initialize the temporary index
set Ktmp= ∅ and M tmp = 0 and choose a constant α ∈ (0, 1]. For any a ∈ R, let �a� denote the smallest
integer greater than or equal to a.
Step 1 : First set Ktmp ← KGn and M tmp ← MG
n and repeat the following steps for each i = 1, . . . ,M tmp.
Then set Ktmp ← KBn and M tmp ←MB
n and again repeat the following steps for i = 1, . . . ,M tmp.
• If N0n(Ktmp
(i)) < αN0n, then
– Set N ← �αN0n�.
90
– If Nn(Ktmp(i)) > 0, then compute f(Xn(Ktmp
(i)), N) using Fn(Ktmp(i)) in (164) and
σ(Xn(Ktmp(i)), N) usingN n(Ktmp
(i)), Fn(Ktmp(i)), f(Xn(Ktmp
(i)), N) and Φn(Ktmp(i))
in (170).
– If Nn(Ktmp(i)) = 0, then compute f(Xn(Ktmp
(i)), N) and σ(Xn(Ktmp(i)), N) using (8)
and (169) respectively.
– SetN n(Ktmp(i))← N , Fn(Ktmp
(i))← f(Xn(Ktmp(i)), N) and Φn(Ktmp
(i))← σ(Xn(Ktmp(i)), N).
Step 2 : Set N in ← Nn(KG
n(i)) for each i = 1, . . . ,MGn .
Step 3 : If MBn > 0, set N i
n ← Nn(KBn(i−MG
n )) for each i = MGn + 1, . . . ,MG
n + MBn .
Step 4 : If MOn > 0, set N i
n ← N n(KOn(i−MG
n −MBn )) for each i = MG
n + MBn + 1, . . . , MG
n + MBn + MO
n .
For each n ∈ N, upon execution of Algorithm 8, it is easily seen that N in ≥ αN0
n for each i = 1, . . . ,M In.
Therefore, if we ensure that N0n →∞ as n→∞, then clearly Assumption A 5 is satisfied.
Next, in order to satisfy Assumption A 10, we have to set N in = N#
n for each i = 1, . . . , Mn and n ∈ N,
where N#n →∞ as n→∞. In the previous section, we used Algorithm 7 to pick the design points so as to
satisfy (58) from Assumption A 11 in Theorem 3.7. Recall that upon execution of Algorithm 7, the inner
design points were set to be those in Xn(KGn)⋃Xn(KB
n), and the outer design points were those in Xn(KOn).
Therefore, we will provide an algorithm next that updates the corresponding elements of Nn, Fn and Φn.
Algorithm 9. Let us assume that we have the sample size N0n, the sets Nn, Fn and Φn, and the index
sets KGn (MG
n = l), KBn and KO
n obtained after the execution of Algorithm 7. Initialize the temporary index
set Ktmp= ∅ and M tmp = 0 and choose a constant α ∈ (0, 1]. For any a ∈ R, let �a� denote the smallest
integer greater than or equal to a.
Step 1 : Set
N#n ← max
{�αN0
n�,max{N n(k) : k ∈ KG
n
⋃KB
n
⋃KO
n
}}
Step 2 : First set Ktmp ← KGn and M tmp ← MG
n and repeat the following steps for each i = 1, . . . ,M tmp.
Then set Ktmp ← KBn and M tmp ← MB
n and again repeat these steps for each i = 1, . . . ,M tmp.
Finally set Ktmp ← KOn and M tmp ←MO
n and repeat these steps for each i = 1, . . . ,M tmp.
• If N0n(Ktmp
(i)) < N#n , then
– Set N#n ← �αN0
n�.
– If N n(Ktmp(i)) > 0, then compute f(Xn(Ktmp
(i)), N#n ) using Fn(Ktmp
(i)) in (164)
and σ(Xn(Ktmp(i)), N#
n ) using Nn(Ktmp(i)), Fn(Ktmp
(i)), f(Xn(Ktmp(i)), N#
n ) and
Φn(Ktmp(i)) in (170).
91
– If N n(Ktmp(i)) = 0, then compute f(Xn(Ktmp
(i)), N#n ) and σ(Xn(Ktmp
(i)), N#n ) using
(8) and (169) respectively.
– Set Nn(Ktmp(i)) ← N#
n , Fn(Ktmp(i)) ← f(Xn(Ktmp
(i)), N#n ) and Φn(Ktmp
(i)) ←σ(Xn(Ktmp
(i)), N#n ).
Step 3 : Set N in ← Nn(KG
n(i)) for each i = 1, . . . ,MGn .
Step 4 : If MBn > 0, set N i
n ← Nn(KBn(i−MG
n )) for each i = MGn + 1, . . . ,MG
n + MBn .
Step 5 : If MOn > 0, set N i
n ← N n(KOn(i−MG
n −MBn )) for each i = MG
n + MBn + 1, . . . , MG
n + MBn + MO
n .
From the definition of N#n in Step 1 of the above algorithm, it is clear that before the execution of Step
2, Nn(k) ≤ N#n for each k ∈ KG
n
⋃KB
n
⋃KO
n . Thus for each n ∈ N, upon termination of the algorithm, we
clearly get that N in = N#
n ≥ αN0n for each i = 1, . . . ,Mn. Consequently, since we ensure the N0
n → ∞ as
n→∞, if we run Algorithm 9 for each n ∈ N, then Assumption A 10 is satisfied.
3.2.3 Weights
In this section we consider methods to set the weights {win : i = 1, . . . ,Mn} for each n ∈ N. Note that the
weight matrices corresponding to the inner and outer design points, specified in the matrices W In and WO
n
respectively, occur in (36) and (37) in Assumption A 7 and (58) and (59) in Assumption A 11. We already
showed in Section 3.2.1, how the design points may picked for each n ∈ N, so as to satisfy (36) and (58),
using Assumption A 15. Here, we provide methods to pick the weights {win : i = 1, . . . , Mn} for each n ∈ N
such that the conditions (37) in Assumption A 7, (59) in Assumption A 11 and also Assumption A 15 may
be satisfied.
The intuition behind the two conditions in ((37) and (59)) is as follows. As we mentioned earlier, we
introduced the design region Dn for each iteration n, in order to be able to control the Euclidean distances∥∥yin
∥∥2
of the design points xn +yin from the iterate xn. Indeed we could, if we wanted, use only points within
the design region Dn to determine ∇nf(xn) and ∇2nf(xn) for each n ∈ N. However, as our optimization
algorithm progresses, it is likely that some points in the set Xn (i.e., points at which sample averages have
been evaluated in earlier iterations), will lie outside the design region Dn for iteration n. Since we have
already evaluated sample averages at these points, we wish to include these points in the determination of
mn. However, in order for ∇nf(xn) to better approximate ∇f(xn), it is also true that the inner design points
must provide progressively higher contributions to the determination ∇nf(xn) and ∇2nf(xn) as compared
to the outer design points . Thus, the conditions (37) and (59), balance these two conflicting requirements
by requiring that the outer design points , be assigned progressively smaller weights {wMIn+1
n , . . . , wMnn } as
n → ∞, when compared to the weights {w1n, . . . , w
MIn
n } assigned to inner design points. Essentially these
92
conditions ensure that as n→∞, the outer design points are ignored and only the points within the design
region are used to calculate ∇nf(xn) and ∇2nf(xn).
Consider the following method for setting the weights {win : i = 1, . . . ,Mn}.
Algorithm 10. Let us suppose that the design points {xn + yin : i = 1, . . . , Mn} are available to us. Let
r ∈ R be a positive constant.
Step 1 : Define πn ← mini∈{1,...,MIn}∥∥yi
n
∥∥2
Step 2 : Set for each i = 1, . . . , Mn,
win ←
⎧⎪⎨⎪⎩
1 for i ∈{1, . . . , M I
n
}(
π2n
‖yin‖2
2
)min {1, (δn)r} for i ∈
{M I
n + 1, . . . ,Mn
} (190)
Algorithm 10 warrants the following comments.
• In order for (190) to be well-defined, we must have∥∥yi
n
∥∥2
> 0 for each i = M In + 1, . . . ,Mn. From our
convention since {xn + yin : i = M I
n + 1, . . . ,Mn} is the set of outer design points, we have∥∥yi
n
∥∥2
> δn
for each i = M In + 1,Mn. Therefore, as long as δn > 0, Algorithm 10 is well defined.
• It is easily seen that in the above example, win < 1 for each i = M I
n + 1, . . . ,Mn and n ∈ N. Thus,
the outer design points always get a lower weight as compared to the inner design points. Further, it
is easy to see that if δn → 0 as n→∞, then the weights given to the outer design points decrease to
zero as n→∞.
Lemma 3.24. Let the following assumptions hold.
1. The sequence {δn}n∈Nof design region radii satisfies δn → 0 as n→∞.
2. There exists a constant MDmax <∞, such that 1 ≤M I
n ≤Mn ≤MDmax. for each n ∈ N.
Then, if the weights {win : i = 1, . . . ,Mn} are assigned using Algorithm 10 for each n ∈ N, (37) in Assumption
A 7 is satisfied for any r > 0.
Further, if there exists Ky < ∞ such that∥∥yi
n
∥∥2
< Ky for all i = 1, . . . ,Mn and n ∈ N, then (59) in
Assumption A 11 is satisfied for any r > 2.
Proof. First, let us consider (37). It is not hard to see that∥∥∥(Y On )T WO
n Y On
∥∥∥2∥∥∥(Y I
n )T W In Y I
n
∥∥∥2
=
∥∥(Y On )T WO
n Y On
∥∥2
‖(Y In )T W I
nY In ‖2
≤ l
(trace(Y O
nTWO
n Y On )
trace(Y In
TW I
nY In )
)= l
⎛⎝∑Mn
i=MIn+1 wi
n
∥∥yin
∥∥2
2∑MIn
i=1 win ‖yi
n‖22
⎞⎠
93
Substituting the weights mentioned in (190) into the above and manipulating the resulting expressions, we
get ∥∥∥(Y On )T WO
n Y On
∥∥∥2∥∥∥(Y I
n )T W In Y I
n
∥∥∥2
≤ l
(min{1, (δn)r}π2
n(Mn −M In)∑MI
ni=1 ‖yi
n‖22
)≤ l
(min{1, (δn)r}(Mn −M I
n)M I
n
)
We know that MDmax ≥ Mn ≥ M I
n ≥ 1 for each n ∈ N. Therefore, ((Mn −M In)/M I
n) ≤ 2MDmax. Further,
since δn → 0 as n→∞ and r > 0, we finally get
limn→∞
∥∥∥(Y On )T WO
n Y On
∥∥∥2∥∥∥(Y I
n )T W In Y I
n
∥∥∥2
≤ 2l (MD
max) limn→∞min {1, (δn)r} = 0
Thus, (37) in Assumption 7 is satisfied.
Next, assuming in addition, that∥∥yi
n
∥∥2
< Ky for all i = 1, . . . ,Mn and n ∈ N, consider (59). It is easily
seen that ∥∥∥(ZOn )T WO
n ZOn
∥∥∥2∥∥∥(ZI
n)T W In ZI
n
∥∥∥2
≤ p
⎛⎝∑Mn
i=MIn+1 wi
n
∥∥zin
∥∥2
2∑MIn
i=1 win ‖zi
n‖22
⎞⎠
Now, substituting the weight distribution from (190) into the numerator of the above expression and using
the definition of zin, we get
Mn∑i=MI
n+1
win
∥∥zin
∥∥2
2=
Mn∑i=MI
n+1
min{1, (δn)r}(
π2n
‖yin‖
22
)∥∥zin
∥∥2
2≤
Mn∑i=MI
n+1
(δn)r
(π2
n
‖yin‖
22
)(∥∥yin
∥∥2
2
δ2n
+
∥∥yin
∥∥4
2
2δ4n
)
≤Mn∑
i=MIn+1
(δn)(r−2)π2n
(1 +
∥∥yin
∥∥2
2
2δ2n
)(191)
Also, we haveMI
n∑i=1
win
∥∥zin
∥∥2
2=
MIn∑
i=1
(∥∥yin
∥∥2
2
δ2n
+
∥∥yin
∥∥4
2
2δ4n
)≥
MIn∑
i=1
(∥∥yin
∥∥2
2
δ2n
)(192)
Thus, combining (191) and (192) and noting that∥∥yi
n
∥∥2
< Ky for all i = 1, . . . , Mn and n ∈ N, we have
∑Mn
i=MIn+1 wi
n
∥∥zin
∥∥2
2∑MIn
i=1 win ‖zi
n‖22
≤
∑Mn
i=MIn+1(δn)(r−2)π2
n
(1 +‖yi
n‖222δ2
n
)∑MI
ni=1
(‖yin‖2
2δ2
n
) ≤(δn)(r−2)
(δ2n + K2
y
2
)(Mn −M I
n)∑MIn
i=1
(‖yin‖2
2π2
n
)≤ (δn)(r−2)
(δ2n +
K2y
2
)(Mn −M I
n
M In
)(193)
As before, since ((Mn −M In)/M I
n) ≤ 2MDmax, δn → 0 as n→ 0 and r > 2, we get
limn→∞
∑Mn
i=MIn+1 wi
n
∥∥zin
∥∥2
2∑MIn
i=1 win ‖zi
n‖22
≤ 2MD
max limn→∞(δn)(r−2)
(δ2n +
K2y
2
)= 0
Thus, we finally get that
limn→∞
∥∥∥(ZOn )T WO
n ZOn
∥∥∥2∥∥∥(ZI
n)T W In ZI
n
∥∥∥2
≤ p limn→∞
⎛⎝∑Mn
i=MIn+1 wi
n
∥∥zin
∥∥2
2∑MIn
i=1 win ‖zi
n‖22
⎞⎠ = 0
94
Note the following regarding the assumptions of Lemma 3.24.
1. Suppose we pick our design points {xn + yin : i = 1, . . . ,Mn} for each n ∈ N using either Algorithm 4
or 7. Then for each n ∈ N, we note that before either of these algorithms are executed, we know that
the number of entries in Xn is at most Mmax. In either Algorithm 4 we add at most l points into Xn,
and in Algorithm 7 we add at most p new points in to Xn. Therefore, in either case, it is easily seen
that l ≤M In ≤Mn ≤Mmax + p.
2. Further, it is easily seen that if there exists a compact set C ⊂ E such that xn + yin ∈ C for each n ∈ N
and i = 1, . . . ,Mn, then there exists Ky <∞ such that∥∥yi
n
∥∥2
< Ky for each n ∈ N and i = 1, . . . , Mn
Finally it is also clear that if the weights {win : i = 1, . . . , M I
n} corresponding to the inner design points,
are picked as in Algorithm 10, then
min{win : i = 1, . . . , M I
n} = max{win : i = 1, . . . , M I
n} = 1
Therefore, in this case Assumption A 15 used in Section 3.2.1 is satisfied with KIw = 1.
95
top related