using spreadsheet solvers in sample design

20
Computational Statistics & Data Analysis 44 (2004) 527 – 546 www.elsevier.com/locate/csda Using spreadsheet solvers in sample design Lynne Stokes a ; , John Plummer b a Department of Statistics, Southern Methodist University, Dallas, TX 75275, USA b Department of Management Science and Information Systems, University of Texas, Austin, TX 78712, USA Received 1 November 2001; received in revised form 1 September 2002 Abstract Nonlinear programming (NLP) has been used for some time in planning of sample designs more complex than a single objective stratied or two-stage design. This paper discusses the new NLP solver tools that are now widely available in commercial spreadsheet products, and illustrates how to use them in these kinds of sample design problems. These tools eliminate the need for acquiring special purpose software or writing code, and have a natural interface so that setting up the problem is straightforward. However, they often require considerable tuning and experimentation to produce usable solutions. In this paper, we give practical advice on helping the solver tools nd solutions in typical sample design problems. The methods are illustrated using Excel’s solver on three examples of real sample design problems with complex features. c 2002 Published by Elsevier B.V. Keywords: Nonlinear program; Optimization; Excel; Stratied sample 1. Introduction Planning of a sample design requires a balance between theory and practice. Classical optimization results, such as Neyman allocation for a stratied design, often provide only approximate solutions to the problem facing the practitioner because of issues such as multiple objectives or stang constraints. Some practical issues have been addressed theoretically, using nonlinear programming (NLP) methods. For example, many nonlinear programming algorithms have been developed to solve the problem of optimal allocation to strata for surveys having multiple objectives (Kokan, 1963; Kokan and Khan, 1967; Chatterjee, 1968, 1972; Huddleston et al., 1970; Corresponding author. E-mail address: [email protected] (L. Stokes). 0167-9473/03/$ - see front matter c 2002 Published by Elsevier B.V. PII: S0167-9473(02)00322-5

Upload: lynne-stokes

Post on 02-Jul-2016

217 views

Category:

Documents


4 download

TRANSCRIPT

Computational Statistics & Data Analysis 44 (2004) 527–546www.elsevier.com/locate/csda

Using spreadsheet solvers in sample design

Lynne Stokesa ;∗ , John Plummerb

aDepartment of Statistics, Southern Methodist University, Dallas, TX 75275, USAbDepartment of Management Science and Information Systems, University of Texas,

Austin, TX 78712, USA

Received 1 November 2001; received in revised form 1 September 2002

Abstract

Nonlinear programming (NLP) has been used for some time in planning of sample designsmore complex than a single objective strati2ed or two-stage design. This paper discusses thenew NLP solver tools that are now widely available in commercial spreadsheet products, andillustrates how to use them in these kinds of sample design problems. These tools eliminate theneed for acquiring special purpose software or writing code, and have a natural interface so thatsetting up the problem is straightforward. However, they often require considerable tuning andexperimentation to produce usable solutions. In this paper, we give practical advice on helpingthe solver tools 2nd solutions in typical sample design problems. The methods are illustratedusing Excel’s solver on three examples of real sample design problems with complex features.c© 2002 Published by Elsevier B.V.

Keywords: Nonlinear program; Optimization; Excel; Strati2ed sample

1. Introduction

Planning of a sample design requires a balance between theory and practice. Classicaloptimization results, such as Neyman allocation for a strati2ed design, often provideonly approximate solutions to the problem facing the practitioner because of issuessuch as multiple objectives or sta>ng constraints. Some practical issues have beenaddressed theoretically, using nonlinear programming (NLP) methods.

For example, many nonlinear programming algorithms have been developed to solvethe problem of optimal allocation to strata for surveys having multiple objectives(Kokan, 1963; Kokan and Khan, 1967; Chatterjee, 1968, 1972; Huddleston et al., 1970;

∗ Corresponding author.E-mail address: [email protected] (L. Stokes).

0167-9473/03/$ - see front matter c© 2002 Published by Elsevier B.V.PII: S0167 -9473(02)00322 -5

528 L. Stokes, J. Plummer / Computational Statistics & Data Analysis 44 (2004) 527–546

Chromy, 1987; Bethel, 1989; Zayatz and Sigman, 1994; Valliant and Gentle, 1997).Many papers on this topic illustrate that a perfect solution to even this well-de2nedproblem was not easy to achieve. Better and faster solutions have become possible asalgorithms and technology have improved. Bethel (1989, p. 53) commented that hisnew algorithm “: : : is relatively easy to program, requiring only 40 or 50 lines of code”.Valliant and Gentle (1997) developed a system called ALLOCATE at the Departmentof Labor, which is code written in C that interfaces to commercially available optimizersoftware. They note that manuals (Allocate User’s Guide and Allocate Programmer’sGuide) are available “: : : to anyone wishing to develop their own systems” (p. 357).

Nowadays, a solution to problems like these is available with oL-the-shelf software.Most commercial spreadsheets contain a “solver” tool that requires no code to bewritten, and requires little knowledge about the optimization algorithms themselves,and so are extremely easy to use. The purpose of this paper is to provide a tutorial inusing these tools in typical sample design problems. We discuss some characteristicsof nonlinear optimization algorithms in general, as well as speci2cs for their use insample design. We believe these tools will be useful to both practitioners and teachersof sampling methods.

The paper proceeds as follows. In Section 2, we discuss how nonlinear optimizationarises in sample design, review the methodology, and describe software implementationsavailable today. In Section 3, we illustrate the methods with applications that are typicalof the nonstandard problems encountered by practitioners. Each involves a questionabout e>cient allocation to strata. The solutions to these problems and tips for 2ndingthem using a solver tool are discussed in Section 4. A summary follows in Section 5.

2. Optimization methods

The classical stratum allocation problem can be stated as follows. Suppose that apopulation is divided into H strata, where the hth stratum is of size Nh. Let Yhi denotethe value of some characteristic for the ith unit in stratum h; h=1; : : : ; H ; i=1; : : : ; Nh,and let S2

h denote the variance of that characteristic within stratum h. Suppose thata simple random sample (SRS) of size nh is to be selected from each stratum andan estimate of the population total is to be made using t̂ =

∑Hh=1 Nh Pyh, where Pyh =∑nh

i=1 yhi=nh. Suppose further that we wish to select the stratum sample sizes in sucha way that the cost of sampling, assumed to be linear in the sample sizes,

c0 +H∑h=1

chnh (2.1)

is minimized, subject to the requirement that the variance of t̂ does not exceed somespeci2ed value V ∗¿ 0,

V (t̂) =H∑h=1

N 2h

(1 − nh=Nh)nh

S2h6V ∗: (2.2)

L. Stokes, J. Plummer / Computational Statistics & Data Analysis 44 (2004) 527–546 529

The solution to this problem (nh ˙ NhSh=√ch), is easily obtained using calculus. In

practice, one would add the additional constraints that

nh6Nh for all h; (2.3)

and usually one would prefer that

nh¿ 2 (2.4)

to facilitate variance estimation. Enforcing these constraints requires an iterative solu-tion. (The common practice of 100% sampling for strata failing (2.3) and reallocationto the remaining strata was proven optimal by Schneeberger, 1991). One must also re-quire that nh be integer-valued, but this is generally attained by rounding the solutionobtained for the continuous problem.

2.1. Nonlinear programming problem (NLP)

Now we give a brief overview of optimization problems, solution characteristics,and the tools for solving such problems. (For a more detailed presentation of the-ory, algorithms, and applications, see Lasdon et al., 1996.) An optimization problem(mathematical program) may be stated in general form as:

Problem NLP

Maximize or minimize f(x) (2.5)

subject to : ai6 gi6 bi i = 1; : : : ; I (2.6)

and lh6 xh6 uh h= 1; : : : ; H; (2.7)

where x= (x1; : : : ; xH ) is a vector of H decision variables, f is the objective function,g= (g1; : : : ; gI ) is a vector of constraint functions, and a= (a1; : : : ; aI ), b= (b1; : : : ; bI ),l = (l1; : : : ; lH ), and u = (u1; : : : ; uH ) are vectors of constants. The objective functionrepresents a measure of how “good” or “bad” a given set of decision variable values iswith respect to a goal, and the constraint functions represent limitations of the systembeing modeled or relationships that must hold for the model to be valid. a and b rep-resent lower and upper bounds on the constraint functions and l and u lower and upperbounds on the decision variables. The values ai and bi in (2.6) are a generalizationof the form gi6 k where k is referred to in NLPs as the right hand side (RHS) ofthe constraint. If the objective and all constraints are linear functions, the problem isa linear program. If at least one of the problem functions is nonlinear, the problemis a nonlinear program (NLP). Thus the Neyman allocation problem is an NLP withxh = nh; f given by (2.1), I = 1 and g de2ned in (2.2), and l and u de2ned in (2.4)and (2.3). When the survey has multiple constraints, I ¿ 1, simple calculus does notyield a solution, and NLP methods must be used. In (2.1)–(2.4), we de2ned cost asthe objective function and variance as a constraint, but these can be reversed. Mostsample design problems are NLPs, since variance is proportional to the reciprocal ofsample size.

530 L. Stokes, J. Plummer / Computational Statistics & Data Analysis 44 (2004) 527–546

A vector x is feasible if it satis2es all the constraints. The set of all feasible points iscalled the feasible region, F , and if F is empty, the problem is infeasible. The problemstated in (2.1)–(2.4) is never infeasible (as long as Nh¿ 2) regardless of the value ofV ∗. However, this will not be the case for all sample design problems, as we shall seein the examples.

2.2. Characteristics of NLP solutions

Let x∗ be the solution to the problem de2ned by (2.5)–(2.7). A constraint gi is saidto be binding if it holds with equality (e.g., gi¿ 0 is binding if gi = 0 and nonbindingotherwise). Associated with each constraint gi in (2.6) is a Lagrange Multiplier. TheLagrange Multiplier for each constraint represents the change, at point x∗, in the ob-jective function value for a unit change in the right hand side of the constraint. Thisvalue therefore represents the marginal impact on the objective function of tighteningor loosening that constraint. The Lagrange multipliers for nonbinding constraints havethe value 0 since tightening or loosening a constraint (at the margin) which does nothold with equality cannot aLect the optimal solution.

2.3. NLP solution software

Nowadays software for solving NLPs exists in three forms:(1) Standalone solvers are NLP solution algorithms coded in (mainly) FORTRAN

or C, and supplied alone or as part of a library of mathematical software. Theseare generally the most powerful and general solvers and allow the user the greatestSexibility to “tune” the solution process. They require the user to code his/her owninterface to the solver to obtain the problem structure and functions, communicate theproblem to the solver, and retrieve the solution. The ALLOCATE system interfacesa standalone solver which uses an algorithm called GRG2, an implementation of thegeneralized reduced gradient algorithm (Lasdon et al., 1978). A recent addition to thisclass of solver is the new procedure PROC NLP in SAS version 8. It can solve largeproblems and allows a choice of many diLerent NLP solution algorithms.

(2) The major algebraic modeling systems (GAMS, AMPL) are supplied with in-terfaces to well-known nonlinear solvers. These modeling systems allow the user toformulate NLPs in a quasi-algebraic form with such structures as vectors and sets, andconstructs that will iterate over such sets. These systems provide probably the greatestcombination of power and usability.

(3) All of the major spreadsheet products (Lotus, Quattro, Excel) are supplied withoptimization capability, generally linear and nonlinear, perhaps mixed integer (i.e., somedecision variables continuous, some discrete). The main advantage of these systemsis their wide availability and ease of use, since they do not require custom cod-ing. However, some experimentation is required to achieve good results with thesetools.

The examples in this paper employ the Excel solver. The NLP solver packagedwith Excel also uses GRG2. It can solve linear and nonlinear problems of up to 200decision variables, large enough to handle many sample design problems. In addition,

L. Stokes, J. Plummer / Computational Statistics & Data Analysis 44 (2004) 527–546 531

the company that packaged the solver for Microsoft supplies solver add-ins that canincrease the problem size to 400 variables for nonlinear problems and 800 variables forlinear problems. Problems exceeding these dimensions must be reformulated to reducetheir size or solved with one of the other tools described above.

NLP codes work by 2rst evaluating the functions and their derivatives at a startingvalue of the decision vector, and then iteratively searching for a better solution using asearch direction suggested by the derivatives. The search continues until one of severaltermination conditions is met. Among these are:

(1) Declaring “optimality”, which means that the optimality criteria have been metto within a speci2ed tolerance.

(2) Terminating on “fractional change”, which means that the algorithm is makingextremely slow progress; i.e., the diLerence between the objective values at successivepoints is less than some tolerance for a speci2ed number of consecutive iterations.

(3) Declaring that a default or user-speci2ed iteration limit or time has beenexceeded.

(4) Declaring that a feasible point cannot be found or that a feasible nonoptimalpoint has been obtained but a direction of improvement cannot be found.

The 2rst of these is a good outcome, indicating location of a local optimum, thesecond often occurs quite close to a true optimum, and the third is easily correctedby increasing the maximum iterations or time allowed. The fourth outcome may in-dicate a poorly speci2ed model. We will discuss some methods for improving modelspeci2cation in Sections 4 and 5.

Spreadsheet solvers require the user to specify starting values for the decision vari-ables. The chosen values will determine which local optimum is reached when thealgorithm terminates. Bad starting values can cause an algorithm to fail or to makeslow progress. Therefore the user should supply starting values that are as good asinformation and experience allow. Some suggestions on how to do this for sampledesign problems are given in Sections 4 and 5. It is important to understand that allcommon NLP algorithms are capable of 2nding only local optima. The only way, inthe general case, to assess (imperfectly) the relationship of local to global optima isto solve from several starting points.

Though solvers interfaced to spreadsheets do not allow the user as much controlas standalone or algebraic modeling system solvers, there are a number of optionsthat the user can invoke to aid in solving the NLP. Among these are options forcomputing derivatives and for setting various tolerances, which would otherwise beset to defaults by the system. The feasibility tolerance (“precision” option in Excel’ssolver) controls how accurately a constraint must be satis2ed. The fractional changetolerance (“convergence” option in Excel) speci2es the amount by which the objectivevalue must diLer from (on a relative basis) its previous value in a speci2ed numberof iterations in order for the algorithm to continue.

NLP codes estimate derivatives (in the absence of user-supplied derivative compu-tations) via 2nite diLerences. Most codes, including Excel’s, default to forward diLer-ences. With forward diLerences, �f=�x is estimated by (f(x∗ + �) − f(x∗))=� where�= px∗. p is a perturbation factor, typically in the range of 10−4–10−8 so the totalperturbation � is a fraction of the value of x∗.

532 L. Stokes, J. Plummer / Computational Statistics & Data Analysis 44 (2004) 527–546

Central diLerences (�f=�x estimated at point x∗ by ((f(x∗+�)−f(x∗−�))=2�) aremore accurate than forward diLerences (exact for quadratics), but require twice as manyfunction evaluations. If the function evaluations for a model are inherently inaccurate,such as the result of iterative computations with their own convergence tolerance, theperturbation factor will be critical to accurate solution. Current spreadsheet solversdo not let the user specify the perturbation factor, but do allow the user and optionto select selection central diLerences instead of forward diLerences, which might beconsidered when the NLP fails to solve.

The performance of many NLPs can be greatly inSuenced by keeping the variables,function values, and derivatives within 2 or 3 orders of magnitude of each other. Ifa model contains values that vary by many orders of magnitude, the algorithm maydeclare the model to be infeasible or terminate at a nonoptimal point. Most codes (in-cluding Excel’s Solver) allow selection of an automatic scaling option, but well-scaledmodels are the most desirable. In sample size allocation problems, the variables andfunctions are often in diLerent units (e.g., decision variables are sample sizes andconstraint functions are variances), so this problem can occur.

In the usual sample design problem, we seek solutions to our optimization problemthat are integer valued, since decision variables typically represent sample sizes. Seek-ing an integer solution by treating the decision variables as continuous is referred toas solving the “relaxed” problem. A frequent practice is to round the solution of therelaxed problem to the nearest integer values. If the values of the decision variablesare large enough that the impact on the objective function from rounding is small, thistechnique may produce an acceptable solution. One must take care, however, to verifythat the rounded solution is feasible, since rounding can move decision variable valuesto points outside the feasible region.

If rounding is not an acceptable option, true integer programming techniques mustbe used. Solving mixed-integer programs, which are linear programs where at leastone decision variable is speci2ed as an integer, is a well-developed area of optimiza-tion. Most solution algorithms employ a form of branch and bound which attempts anintelligent approach to exhaustive enumeration by pruning branches of the enumera-tion tree where solutions superior to one already obtained cannot be located. However,2nding integer solutions to nonlinear integer programs is very di>cult in general. Ittends to approach exhaustive enumeration if there are many integer variables and thosevariables are “general integer” (i.e., able to take on any integer value). Excel’s solverimplements a branch and bound procedure on nonlinear programs and can take anexcessively long time to 2nd a solution.

3. Three examples

3.1. Aggregated strata

A survey of parents of students in a school district was being planned to assesssatisfaction with school curriculum, staL, facilities, and similar issues. The parame-ters of interest were the proportions in various categories, such as the proportion of

L. Stokes, J. Plummer / Computational Statistics & Data Analysis 44 (2004) 527–546 533

Table 1Number of students in school district by race/ethnicity, income, and grade level (Nh), for example 1

Race/Ethnicity Low Income High Income

Elem. Middle High Elem. Middle High Totals

African Amer. Nh = 3872 Nh = 1721 Nh = 1517 Nh = 1183 Nh = 867 Nh = 1753 10,913Asian Nh = 282 Nh = 116 Nh = 120 Nh = 590 Nh = 246 Nh = 347 1701Hispanic Nh = 12; 380 Nh = 4958 Nh = 3224 Nh = 3466 Nh = 2467 Nh = 4135 30,630White Nh = 1695 Nh = 749 Nh = 607 Nh = 9689 Nh = 5229 Nh = 7109 25,078Totals 18,229 7544 5468 14,928 8809 13,344 68,322

students whose parents are satis2ed or very satis2ed with their child’s teacher. Theschool district was interested in having estimates available by race/ethnicity (AfricanAmerican, Asian, Hispanic, and White & Other), income (Low and High), and schoollevel (Elementary, Middle, High). The district speci2ed that a margin of error (half-width of 95% con2dence interval) for all estimates of 5 percentage points was desir-able.

Table 1 shows the number of students in the district, cross-classi2ed by these threevariables. The planned design was a strati2ed simple random sample without replace-ment. Our problem is to specify the sample allocation to each of the 24 (4 race ×2 income × 3 school level) strata in such a way that the district’s margin of errorrequirements are satis2ed with the minimum possible sample size. Since the preci-sion requirements are for aggregations of strata rather than individual ones, Neymanallocation is not (necessarily) optimal.

Let Nhkl; phkl; nhkl; p̂hkl denote the stratum size, stratum proportion, sample size andsample proportion, respectively, for stratum (h; k; l) where h denotes the race (h =1; : : : ; 4); k the income (k = 1; 2); and l the school level (l = 1; 2; 3) subpopulations.The estimator of proportion for race h is

p̂h·· =2∑k=1

3∑l=1

(Nhkl=Nh··)p̂hkl;

where Nh·· =∑∑

k; l Nhkl is the size of the hth race subpopulation. Its variance is

V (p̂h··) =2∑k=1

3∑l=1

(Nhkl=Nh··)2(

1 − nhklNhkl

)phkl(1 − phkl)

nhkl: (3.1)

The analogous estimators for the income and school level groups are denoted by p̂·k·and p̂··l, respectively. We use the conservative substitution of 1/4 for phkl(1−phkl) in(3.1) to de2ne the following NLP. The decision variables are the stratum sample sizes

nhkl for h= 1; : : : ; 4; k = 1; 2; and l= 1; 2; 3;

the objective function that will be minimized is

n=∑h

∑k

∑l

nhkl; (3.2)

534 L. Stokes, J. Plummer / Computational Statistics & Data Analysis 44 (2004) 527–546

and constraints on the solution, from (3.1) and the margin of error requirements,are

2∑k=1

3∑l=1

(Nhkl=Nh··)2(

1nhkl

− 1Nhkl

)6(

(2)(0:05)1:96

)2

(3.3)

for all h;4∑h=1

3∑l=1

(Nhkl=N·k·)2(

1nhkl

− 1Nhkl

)6(

(2)(0:05)1:96

)2

(3.4)

for all k; and4∑h=1

2∑k=1

(Nhkl=N··l)2(

1nhkl

− 1Nhkl

)6(

(2)(0:05)1:96

)2

(3.5)

for all l, where Nhkl are as shown in Table 1. We also specify l and u on the decisionvariables by requiring that

26 nhkl6Nhkl: (3.6)

3.2. Limited resources for interviewing

This example arose in a project undertaken by a sampling class taught by one of theauthors. The class implemented a pilot survey for the Texas State Preservation Board(TSBP) to help them develop a method for making quarterly estimates of the numberof tourists entering the Texas Capitol Building. For the pilot test, the class planned asurvey to estimate the number of tourists, t, entering the Capitol during one speci2cmonth.

During that month, the Capitol Building was open for 403 hours. Tourists could enterone of six doors. The sampling units chosen were door-hours, or one-hour periods ata particular door, so that our population contained 403 × 6 = 2418 units. A sample ofthese units was to be chosen, and students in the class assigned to count the numberof tourists entering the building during the sampled door-hours. Since the tourist countvaried greatly by time and door, a strati2ed random sample design was planned. Thepopulation was divided into 24 strata de2ned by eight time periods or shifts and threedoor groups. Table 2 de2nes these strata, and shows for each its size and a roughestimate of the standard deviation of its tourist count (Nh and Sh for h= aA; : : : ; cH).The stratum standard deviation estimates had been made the previous month.

There were 14 students in the class, each of whom was required to contribute eighthours to the sampling eLort. One student volunteered to collect 15 hours. However allstudents had constraints on the shifts they were available. These constraints are shownin Table 3, where an X in cell (s; i) indicates that student s was not available dur-ing shift i. The question we address here is how the door-hours should be allocatedto strata and to students. Our goal was to make the estimator of t as e>cient aspossible.

Let nhs denote the number of door-hours sampled in stratum h by student s; nh· =∑s nhs, and n·s =

∑h nhs. The NLP we are to solve can be stated as follows. The

L. Stokes, J. Plummer / Computational Statistics & Data Analysis 44 (2004) 527–546 535

Table 2Stratum size (Nh), estimated standard deviation (Sh) of tourist count, and rounded optimal sample size (nh)for the 24 strata of example 2

ShiftsA B C D E F G H

Nh Sh Nh Sh Nh Sh Nh Sh Nh Sh Nh Sh Nh Sh Nh ShDoor groups nh nh nh nh nh nh nh nh

a 52 86 52 86 32 86 32 86 56 126 42 20 105 20 32 209 10 6 6 21 2 6 2

b 52 17 52 17 32 17 32 17 56 91 42 20 105 2 32 22 2 2 2 15 2 2 2

c 208 7 208 7 128 7 128 7 224 7 168 7 420 7 128 73 4 2 2 4 3 8 2

Table 3Student sta>ng constraints in example 2

Student Total hrs. Shifts

A B C D E F G H

1 15 X X X X X2 8 X X X X X3 8 X X X X X4 8 X X X X X X5 8 X X X X X X6 8 X X X X X X7 8 X X X X X X X8 8 X X X X X X X9 8 X X X X X X X

10 8 X X X X X X X11 8 X X X X12 8 X X X X13 8 X X X X14 8 X X X X

X indicates the student is not available at that time

decision variables are the sample sizes,

nhs; h= aA; : : : ; cH; s= 1; : : : ; 14;

the objective function we wish to minimize is the variance of the estimator t̂ =∑24h=1 Nh Pyh,

V (t̂) =24∑h=1

N 2h

(1 − nh·=Nh)nh·

S2h ; (3.7)

536 L. Stokes, J. Plummer / Computational Statistics & Data Analysis 44 (2004) 527–546

the constraint functions and limits, which ensure that the strata sample sizes are in thecorrect range and that each student contributes their agreed upon hours, are

26 nh·6Nh for all h; (3.8)

n·1 = 15 and n·s = 8 for s= 2; : : : ; 14; (3.9)

and for ensuring that the sta>ng availability from Table 3 are honored, we requirenaA;s = nbA;s = ncA;s = 0 for s �∈ {1; 2; 3};...

naH;s = nbH;s = ncH;s = 0 for s �∈ {2; 11; 12; 13; 14}:(3.10)

3.3. Discontinuous cost function

Another component of the class project discussed in Section 3.2 was to adapt the pilotsurvey design to one that could be administered in an ongoing fashion by the TSPB.Besides their goal of estimating the number of tourists visiting the Capitol buildingeach quarter was a goal of estimating the number of those tourists who were fromout of state. In order to obtain the necessary information to make the latter estimate,some tourists had to be interviewed, in addition to simply being counted. We decidedto use a strati2ed two-stage design, where the primary sampling units (psu’s) werethe door-hours and the secondary sampling units were the tourists themselves. Thedesign problem was to determine the number of door-hours to be chosen from eachstratum and the number of tourists selected during each door-hour. We recommendedusing only seven strata in the TSPB design, collapsing those observed to have similartourist tra>c in the pilot survey. For simplicity of implementation, we speci2ed thatthe number of tourists sampled remain the same from one psu to another within thesame stratum, but could vary among strata. Thus the decision variables were nh andmh, for h= 1; : : : ; 7.

The objective function for the design problem was cost, which we wished to mini-mize for a speci2ed precision on a quarterly estimate of number of out-of-state tourists.The factor making our problem nonstandard was the cost function, which does notincrease smoothly with mh. Instead, the number of interviewers needed increases dis-cretely with this design parameter. Experience suggested that a single interviewer couldconduct up to 5 interviews during an hour and sta>ng constraints dictated that no morethan 4 interviewers would be available at any time. We assumed a set-up cost of 0:5 hof time for each psu sampled.

For sample design planning, we used the strati2ed version of the unbiased two-stagedesign estimator of the total number of out-of-state tourists (e.g., Lohr, 1999, p. 147):

t̂os =7∑h=1

nh∑i=1

mh∑j=1

yhij

/(nhNh

mhMhij

)

=7∑h=1

Nhnh

nh∑i=1

Mhip̂hi;

L. Stokes, J. Plummer / Computational Statistics & Data Analysis 44 (2004) 527–546 537

Table 4Stratum size and estimates of stratum parameters for example 3

Parameter Estimates Strata

1 2 3 4 5 6 7

Nh 504 168 537 504 168 537 4836s2ht 485 2091 108 56 20 1 101M̂ hs 101 319 24 99 23 1.4 4p̂h 0.39 0.22 0.34 0.20 0.42 0.05 0.60

where yhij is 1 or 0 according to whether the jth tourist in psu i from stratum h isfrom outside Texas or not, Mhi is the number of tourists arriving during the ith psu instratum h, and p̂hi, the sample proportion of out-of-state tourists present in psu i, isan estimate of phi, the analogous population parameter. The variance of t̂os is

7∑h=1

[N 2h

(1 − nh

Nh

)S2ht

nh+Nhnh

Nh∑i=1

(1 − mh

Mhi

)M 2hiphi(1 − phi)

mh

]

where S2ht is the variance among psu totals within stratum h. For purposes of design

optimization, we approximated this variance by setting within-psu parameters (Mhi andphi) to their stratum averages, yielding

V (t̂os) ≈7∑h=1

[N 2h

(1 − nh

Nh

)S2ht

nh+N 2h

nh

(1 − mh

PMh

)PM 2hph(1 − ph)

mh

]; (3.11)

where PMh is the average psu size and ph is the proportion of out-of state tourists instratum h. Estimates of S2

ht , PMh, and ph were made from the data of the pilot surveyfor the seven strata and are shown in Table 4. The precision requirement was a marginof error of 5000 for t̂os in each quarter. Because the estimated value of PMh is so small(¡ 5) in strata 6 and 7, we determine in advance that 100% subsampling would beused there; i.e., mh =Mhi for h= 6 and 7.

Thus the optimization problem to be solved has decision variables

nh for h= 1; : : : ; 7 and mh for h= 1; : : : ; 5;

objective function,

C =7∑h=1

{(0:5)nh + (1 + [(mh − 1)=5])nh} (3.12)

where [x]=greatest integer6 x; constraint based on precision requirements and (3.11),

V̂ (t̂os) =7∑h=1

[N 2h

(1 − nh

Nh

)s2htnh

+N 2h

nh

(1 − mh

P̂Mh

)PM 2hsp̂h(1 − p̂h)

mh

]

6 (5000=1:96)2; (3.13)

538 L. Stokes, J. Plummer / Computational Statistics & Data Analysis 44 (2004) 527–546

where s2ht ; PMhs, and p̂h are estimates of S2ht ; PMh, and ph; and

26 nh6Nh for h= 1; : : : ; 7;

26mh6min(20; PMhs) for h= 1; : : : ; 5: (3.14)

4. Practical tips for solving NLPs with a spreadsheet solver

In this section we present solutions for the examples of Section 3. Then we sum-marize some techniques that proved helpful for solving them with Excel’s solver.

Fig. 1 shows the initial set-up of the Excel spreadsheet for the NLP posed in Exam-ple 1. The population sizes (Nhkl’s) were entered in the table delimited (diagonally) bycells B4 and G7. The decision variables (nhkl’s) will be returned in the table delimitedby cells B13 and G16. Starting values must be supplied in these cells before invok-ing solver. In the case shown, a value of 50 was chosen as a starting value for eachdecision variable. The objective function is the total sample size (SUM(B13:G16)),de2ned in G17. The variance formulae for the subpopulation estimates ((3.3)–(3.5)),are de2ned in cells B18–B26. For example, the formula for V (p̂1··), the variance ofthe estimator for the African American subpopulation, is displayed in the commandline.

The solver tool itself is an add-in in Excel that is found on the Tools menu. Itsinterface, set up to solve this problem, is shown in Fig. 2. The objective function, inthis case total sample size, is entered in Set Target Cell and the decision variablesin By Changing Cells. Choose the radio button to Min(imize) the objective function,and 2nally enter constraints as shown by selecting the Add button and completingthe constraints menu. The 2rst two sets of constraints visible enforce the sample sizeinequalities in (3.6) and the last set describes the variance requirements of (3.3)–(3.5).

After pushing the solve button on the solver interface, the solution is returned in thedecision variable cells, as shown in Fig. 3. Note that now all the variance constraintsare met, which remained the case even after the stratum sample sizes were rounded tointegers.

In this application, the sponsors of the survey found the sample size of 1519 to betoo expensive. They asked us to determine which estimator was causing the greatestmarginal increase in cost. That is, if they were willing to accept less precision forone of the nine estimators, which one should it be? A look at the sensitivity reportproduced by Solver, part of which is shown in Fig. 4, provides an answer. The varianceconstraints for three of the nine estimators are not binding, which is indicated in thetable by a Lagrange multiplier of 0. Among the remaining six, the estimator for theAfrican American subpopulation shows the largest (in magnitude) Lagrange multiplier,meaning that relaxation of the required precision for that estimate should allow thelargest reduction in sample size. When we ran solver again after relaxing the constraintfor cell B18 to [(2)(0:06)=1:96]2 = 0:003748, we found a solution yielding a samplesize of 1431.

For the second example, we 2rst attempted to solve the constrained problem de2nedin (3.7)–(3.10). However, we encountered a problem because the number of decision

L. Stokes, J. Plummer / Computational Statistics & Data Analysis 44 (2004) 527–546 539

Fig.

1.Se

tup

ofsp

read

shee

tfo

rex

ampl

e1.

540 L. Stokes, J. Plummer / Computational Statistics & Data Analysis 44 (2004) 527–546

Fig. 2. Solver menu for example 1.

Fig. 3. Solution returned by Solver for example 1.

variables in this set-up, 14 × 24 = 336, exceeded the 200 allowed in Excel’s solver.The problem was re-formulated so that only the allowable nonzero nhs were de2ned asdecision variables (rather than de2ning them all and enforcing constraints), reducingtheir number to 105. A portion of the spreadsheet set-up for this problem, in formulaview, is shown in Fig. 5. The strata information was entered in rows (2–25), while thestudents’ sample size information (nhs) is in columns (G–T). The 105 nonzero sample

L. Stokes, J. Plummer / Computational Statistics & Data Analysis 44 (2004) 527–546 541

Fig. 4. Portion of output from sensitivity analysis in example 1.

sizes appear in row 28 and below, and are set to 2 for starting values. The valuesof the decision variables Sow back to the matrix of sample sizes in G2–T25. Theseare in turn summed over the student columns to return the stratum sample sizes incolumn D. The passing of the solutions from rows 28 to 36 back to rows 2 through25 was done only to make the results easier to read from the spreadsheet. Havingthe decision variables de2ned in contiguous cells on the spreadsheet makes 2lling inSolver’s menu easier to manage. Solver returned a noninteger solution to the problem,which we rounded, yielding the result shown in Table 2.

The third example problem, 2nding an optimal two-stage design for estimating totalnumber of out-of-state tourists, proved harder to solve with Excel than the others dueto the discontinuity of the objective function. Clearly, any solution must have mh as amultiple of 5 (if it is less than PMhs), since otherwise a better solution can be found byincreasing mh to the next multiple of 5. Yet solver presented “solutions” not havingthis property for the NLP de2ned by (3.12)–(3.14), even when we speci2ed integerconstraints on the decision variables. We reformulated the problem, making use of theobservation above. We rede2ned the decision variables to be nh and qh, the number ofinterviewers used on shift h. We rede2ned the objective function as

∑h [(0:5)nh+nhqh],

set mh = min(5qh, PMhs) and constrained qh to be integer valued, with 16 qh6 4. Thespreadsheet set-up is shown in Fig. 6. The decision variables can be found in B2–B8and D2–D6, and are set to starting values of all 200’s and all 1’s, respectively. Withthis formulation, a reasonable solution was obtained (nh values of 181, 145, 67, 21,40, 6 and 566 and qh values of 1, 1, 1, 2, 1, 1, 1), which yielded a minimum cost ofabout 1560 h.

Some experimentation was required to obtain good solutions in the three examples.The following tactics proved helpful:

(1) Specify the precision constraints on the variance, rather than on the standarddeviation or margin of error. Initially the constraints in the 2rst example were de2nedin terms of margin of error, but the solver returned an error because the optimizertried to move the decision variables to values that caused domain violations with the

542 L. Stokes, J. Plummer / Computational Statistics & Data Analysis 44 (2004) 527–546

Fig.

5.A

port

ion

ofsp

read

shee

tse

t-up

for

exam

ple

2.

L. Stokes, J. Plummer / Computational Statistics & Data Analysis 44 (2004) 527–546 543

Fig.

6.Se

t-up

ofsp

read

shee

tan

dso

lver

for

exam

ple

3.

544 L. Stokes, J. Plummer / Computational Statistics & Data Analysis 44 (2004) 527–546

square root function (negative values). In general, NLPs should be formulated to avoidfunctions with discontinuities, in2nities, and unde2ned regions whenever possible.

(2) Scale the problem so that objective and constraint values are within a few ordersof magnitude of each other, or use the automatic scaling option. In the 2rst example,the objective function’s minimum was 1519 and the constraint functions (3.4)–(3.6)were set to have maxima of 0.0026. With this formulation and without invoking theautomatic scaling option, diLerent solutions were obtained from diLerent starting values.When automatic scaling was used, a consistent solution was found. However, it canhappen that a better solution is found without scaling, so it is good practice to try itboth ways.

(3) Try several starting values for the decision variables. When the same solution isproduced from each, you can have more con2dence that you have actually obtained aglobal minimum. When diLerent solutions are obtained from diLerent starting values,it is necessary to explore the objective function more completely. In the 2rst example,local minima were not a problem once the model was properly scaled; i.e., nearlyidentical solutions were found for all starting values tried. In the second example, allstarting values selected again produced solutions giving the same minimum varianceand stratum allocation. However, the solutions themselves varied greatly because thereare many feasible student assignments resulting in the same allocation to strata, whichis what controls the value of the objective function. Sometimes, though, considerablymore experimentation is required. We studied the second example with a variety ofsets of student sta>ng constraints, some more and some less restrictive than thoseshown in Table 3. For one particular (more restrictive) set of constraints, the solverconverged to 5 diLerent “minimum” variance values in 20 attempts. We designatedthe solution producing the minimum cost as optimal. Did we actually 2nd the globalminimum? We can’t be sure. We tried to increase our chances by continuing to se-lect starting values until several consecutive starting values produced no new localminima.

(4) Select starting values with care. The starting values are not required to be fea-sible, which is convenient, especially in problems where feasibility is not perfectlyobvious. However, objective function must be de2ned at the chosen starting value. Asan illustration, in the third example, we must not choose nh = 0 for any stratum, butwe did choose nh ¿Nh for some strata. Even in general NLPs, it is good practice toavoid 0’s as starting values since many codes determine Jacobian structure from thederivatives at the starting point which can result in 0 derivatives for product terms,or can cause division by 0 or domain violations. To increase the chance that you 2ndthe global optimum, vary the starting values as much as possible. In the 2rst example,extreme (all nhkl=2 and Nhkl) and intermediate values were tried as starting values. Allproduced the same solution, although when the extremes were used, the default num-ber of iterations was exceeded before obtaining a solution. Running solver again fromits stopping point took care of this problem. Because of the large number of decisionvariables in the second example, it was too time-consuming to select starting values byhand. Instead, Excel’s binomial random number generator was used to assign startingvalues automatically. By performing this operation repeatedly, we obtained su>cientlyvaried starting values.

L. Stokes, J. Plummer / Computational Statistics & Data Analysis 44 (2004) 527–546 545

(5) When the values and range of the decision variables are large, integer solutionsto the NLP are best obtained by solving the relaxed problem and rounding to thenearest integer. In the 2rst example this procedure still produced a solution meetingthe variance constraints to two signi2cant digits. For the second example, rounding thedecision variables to the nearest integer did not necessarily produce a feasible solution.(Constraints (3.8) and (3.9) were sometimes violated.) Experimentation was requiredto 2nd a feasible solution near to the continuous optimal. This process did increase theobjective function slightly. When a continuous solution to this problem was allowed,the minimum predicted variance was 14.98 million, while a minimum variance of 15.05was found after rounding to a feasible integral solution. We cannot be sure that thisrounded solution actually minimizes the objective function among all integer solutions.

(6) In circumstances other than those outlined in (5), one can try to obtain an integersolution by imposing an integrality constraint in the solver. To speed up the searchfor a solution with this approach the following procedure is recommended: (a) makesure that all integer variables are speci2ed with the narrowest possible bounds; (b)solve the relaxed problem without integrality constraints; (c) use the solution to therelaxed problem as the starting point for the problem with integrality constraints added.Even then, runtime can be prohibitively long. In both the 2rst and second examples(constrained problem), the solver ran for more than 72 h without 2nding a solution(on a 1 GHz machine with 256 MB of RAM). Integrality constraints on qh were usedsuccessfully in the third example.

5. Discussion

We have demonstrated that spreadsheet solvers are useful tools for planning e>cientsample designs in practical situations that were previously approached by heuristicsand mathematical simpli2cations. They are especially bene2cial in situations where thesurvey being planned is a one-of-a-kind eLort, or when the analyst does survey designwork only occasionally.

The wide availability and ease of use of the solver tools also makes them easy toincorporate into the curriculum of an applied sampling course. It provides the instruc-tor a method for introducing the mathematically less sophisticated student to optimalsample design without the distraction of solving calculus problems. The spreadsheetinterface is so familiar to students that it is easy to use for developing their intuitionabout the eLect of changing design parameters.

Despite these advantages, we have also seen that NLPs do not always solve, sincethere is no algorithm that will solve all NLPs. We showed some approaches to prob-lems that do not solve. The source of most instances where an NLP fails or terminatesat a signi2cantly nonoptimal point is poor derivative accuracy or poor interaction be-tween the model values and the algorithm tolerances. This can be addressed by: (a)switching to a diLerent starting point; (b) reformulating models to reduce the degree ofnonlinearity or the potential for singularities where possible; (3) switching to centraldiLerences; (4) adjusting tolerances where possible and appropriate; and (5) simplyrestarting a failed run from the 2nal point of that run.

546 L. Stokes, J. Plummer / Computational Statistics & Data Analysis 44 (2004) 527–546

The easiest NLPs to solve are those having as many linear or quadratics as possibleamong objective and constraint functions, and ones whose constructs do not resultin discontinuities and in2nities. Of course, this is often impossible in sample designproblems, since the decision variables are generally sample sizes and either objectiveor constraint functions are variances, which are inversely related to sample size. Still,variance functions are monotonic in each sample size, which makes local optima lessof a problem than when dealing with more pathological functions. The user shouldformulate the model to avoid domain violations when possible, such as using variancerather than standard error as an objective function. The user can also help the programto solve by putting bounds on all variables and functions which are as restrictive aspossible without excluding potentially favorable points.

References

Bethel, J., 1989. Sample allocation in multivariate surveys. Survey Method 15, 47–57.Chatterjee, S., 1968. Multivariate strati2ed surveys. J. Am. Statist. Assoc. 63, 530–534.Chatterjee, S., 1972. A study of optimum allocation in multivariate strati2ed surveys. Scand. Actuarietidskrift

55, 73–80.Chromy, J.R., 1987. Design optimization with multiple objectives. Proceedings of the Survey Research

Section of the American Statistical Association, pp. 194–199.Huddleston, H.F., Claypool, P.L., Hocking, R.R., 1970. Optimal allocation to strata using convex

programming. Appl. Statist. 19, 273–278.Kokan, A.R., 1963. Optimum allocation in multivariate surveys. J. Roy. Statist. Soc. Ser. A 126, 557–565.Kokan, A.R., Khan, S., 1967. Optimum allocation in multivariate surveys: an analytical solution. J. Roy.

Statist. Soc. Ser. B 29, 115–125.Lasdon, L., Plummer, J., Waren, A., 1996. In: Avriel, M., Golany, B. (Eds.), Mathematical Programming

for Industrial Engineers. Marcel Dekker, Inc, New York.Lasdon, L., Warren, A., Jain, A., Ratner, M., 1978. Design and testing of a generalized reduced gradient

code for nonlinear programming. ACM Trans. Math. Software, pp. 34–50.Lohr, S., 1999. Sampling: Design and Analysis. Duxbury Press.Schneeberger, H., 1991. Some comments on sampling optimization. Jahrb. Natl. Okonomie Statist. 208,

67–80.Valliant, R., Gentle, J., 1997. An application of mathematical programming to sample allocation. Comput.

Statist. Data Anal. 25, 337–360.Zayatz, L., Sigman, R., 1994. Feasibility study of the use of Chromy’s algorithm in Poisson sample selection

for the annual survey of manufactures. Proceedings of the Section on Survey Methods Research, AmericanStatistical Association, Alexandria, VA, pp. 641–646.