lecture notes in statistics - home - springer978-1-4612-3140-0/1.pdf · lecture notes in statistics...

Lecture Notes in Statistics

Edited by J. Berger, S. Fienberg, J. Gani, K. Krickeberg, I. Olkin, and B. Singer

66

Tommy Wright

Exact Confidence Bounds when Sampling from Small Finite Universes An Easy Reference Based on the Hypergeometric Distribution

Springer-Verlag Berlin Heidelberg New York London Paris

Tokyo Hong Kong Barcelona Budapest

Author

Tommy Wright Mathematical Sciences Section Oak Ridge National Laboratory Oak Ridge, TN 37831-6367, USA

Mathematical Subject Classification: 62D05, 62Q05, 60C05

ISBN-13: 978-0-387-97515-3

001: 10.1007/978-1-4612-3140-0

e-ISBN-13: 978-1-4612-3140-0

This work is subjectto copyright. All rights are reserved, whether the whole or part ofthe material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in other ways, and storage in data banks. Duplication of this publication or parts thereof is only permitted under the provisions of the German Copyright Law of September 9, 1965, in its current version, and a copyright fee must always be paid. Violations fall under the prosecution act of the German Copyright Law.

Cl Springer-Verlag Berlin Heidelberg 1991

Softcover reprint of the hardcover 1st edition 1991

2847/3140-543210 - Printed on acid-free paper

What is more beautiful than a simple and important

question with a simple and exact answer that is easy

to provide?

To Marsha, Taunya, Tommy II, and Tracy

PREFACE

There is a very simple and fundamental concept· to much of probability and statistics that

can be conveyed using the following problem.

PROBLEM. Assume a finite set (universe) of N units where A of the

units have a particular attribute. The value of N is known while the

value of A is unknown. If a proper subset (sample) of size n is

selected randomly and a of the units in the subset are observed to have

the particular attribute, what can be said about the unknown value of

A?

The problem is not new and almost anyone can describe several situations where a particular

problem could be presented in this setting. Some recent references with different focuses

include Cochran (1977); Williams (1978); Hajek (1981); Stuart (1984); Cassel, Samdal, and

Wretman (1977); and Johnson and Kotz (1977). We focus on confidence interval estimation of A. Several methods for exact confidence interval estimation of A exist (Buonaccorsi, 1987,

and Peskun, 1990), and this volume presents the theory and an extensive Table for one of

them.

One of the important contributions in Neyman (1934) is a discussion of the meaning of

confidence interval estimation and its relationship with hypothesis testing which we will call

the Neyman Approach. In Chapter 3 and following Neyman's Approach for simple random

sampling (without replacement), we present an elementary development of exact confidence interval estimation of A as a response to the specific problem cited above. Buonaccorsi (1987)

notes that the exact methods under simple random sampling of Konijn (1973, p. 79) and Katz

(1953) appear to be the same as the. result obtained from the Neyman Approach which Buonaccorsi refers to as the T Method. Because the Neyman Approach in our case for

one-sided confidence bounds of A is based on inverting a family of uniformly most powerful

one-sided tests of hypotheses for A, the resulting one-sided confidence bounds (upper and

lower) are uniformly most accurate as noted by Buonaccorsi (1987). That is, the Neyman

Approach leads to the shortest one-sided confidence intervals for the stated confidence levels.

Under simple random sampling, exact confidence intervals for A can be constructed using

the hypergeometric probability distribution (Chapter 3). Ch!Jflg and Delury (1950) provide

charts for two-sided confidence limits based on the hypergeometric distribution for N ;: 500;

2500; and 10000. Buonaccorsi (1987) notes that their method is similar to a method described

by Sukhatme, Sukhatme, Sukhatme, and Asok (1984, p. 46) and that it does not always lead to

a solution-particularly for small values of N .

Perhaps the most familiar method for coJnstructing exact confidence intervals of A is given

in Cochran (1977, p. 57). However, the method presented in Cochran's book is not the same

VIII

as that of the Neyman Approach. In fact, Buonaccorsi shows for most observed values of a that Cochran's method leads to longer confidence intervals for A than those obtained by the

Neyman Approach. The difference in length for one-sided intervals does not exceed 1.

To obtain an exact confidence bound for A under simple random sampling requires

extensive computing of hypergeometric probabilities. For this reason, approximations of

confidence bounds (for example, based on the binomial, Poisson, or normal distributions) are

frequently used (Cochran, 1977). For certain combinations of N , n, a, and confidence level,

these approximations are not good and can lead to incorrect inferences about the value of A .

The computer made it possible for the publication of a table of the hyper geometric

probability distribution by Liebermann and Owen (1961) for N varying from 1 through 100;

N = 1000; and N = 2000. Although the table is extensive, it can be difficult, and in some cases

impossible, to obtain exact confidence bounds for A •

Tomsky, Nakano, and Iwashika (1979) give a table of upper confidence bounds using the

Neyman Approach for N = 2, 3, 6, 7, 8,9, 10,20,30,40,50, and 100. Odeh and Owen (1983)

give a table of upper and lower confidence bounds using the method of Cochran (1977) for

N = 400; 600; 800; 1000; 1200; 1400; 1600; 1800; and 2000.

Currently, some statistical packages contain functions that generate hypergeometric

probabilities which can then be used to generate exact confidence bounds for A under simple

random sampling. For example, Alexander (1986) presents an interactive macro using the

Statistical Analysis System (SAS) function PROBHYPR to produce upper confidence bounds

for A using the Neyman Approach.

In spite of these recent computing developments, the existence of theory, and the ability to

produce exact confidence bounds for A, exact results are rarely given in practice. Why

continue to use and teach approximations, including ones that yield bad results for certain

cases, for such a common and simple problem when exact and simple methods can be used?

The purpose of this volume is to provide a complete and elementary development of the

details behind these confidence bounds and to provide an extensive Table (see

Application I, p. vii) of optimal upper and lower confidence bounds for A that is easy to

understand and use. It is primarily intended to be a quick and easy reference for a large group

of users including consulting and research statisticians, practitioners involved in acceptance

sampling type applications, scientists, auditors, engineers, quality control and quality

assurance personnel~pecially those engaged in manufacturing settings, government

officials---especially those involved in the collection of data from institutions or

establishments at local, state, and federal levels, managers, social scientists, education

IX

administrators, environment-wildlife-forestry related workers, marketing agencies, health

administrators, economists, personnel managers, etc. Indeed, anyone who has reason to select

a sample from a finite universe and construct a confidence interval or test a hypothesis will

find great use for this volume. Also as mentioned earlier, this volume is instructive and can be

a valuable supplement to courses in sampling techniques and methodology which tend to

devote little or no time to exact methods when sampling from finite universes. In addition to

the elementary development given in Chapter 3, on pages vi-viii, eight specific applications

of the Table of confidence bounds are listed, including tests of hypotheses, guidance for

detennination of sample size n, construction of conservative. confidence bounds under

stratified random sampling, and conservative comparisons of two separate universes. These

applications and the use of the Table are discussed with examples in Chapter 2 and can be

used without reference to the theory and development in Chapter 3. The Table is given in

Chapter 4.

The Table in this volume was produced on an IBM 3033 computer. I am grateful that

pennission was granted to use, in the computer program, the function PROBHYPR which is

part of the SAS® System, a product from SAS Institute Inc., Cary, North Carolina.

I am also grateful to the Naval Facilities Engineering Command, Department of the Navy,

U.S. Department of Defense for initial funding on a project related to the sampling of housing

units at Navy Installations around the world which led to the beginning of this work.

Additional support to complete the work came from the Applied Mathematical Sciences

Program in the Office of Energy Research of the U.S. Department of Energy, under contract

number DE-AC05-840R21400 with Martin Marietta Energy Systems, Inc. to operate Oak

Ridge National Laboratory.

My sincere thanks to the following individuals for independent reviews, encouragement,

and for helpful suggestions: John Beauchamp, Kimiko O. Bowman, John P. Buonaccorsi, and

How J. Tsao. Kimiko Bowman and How Tsao each produced separate and independent

computer programs which confinn the computational results given in Chapter 4. It was a

personal joy that I was able to excite one of my students, Paula Baker, at Knoxville College

about statistics by having her proofread an early draft.

Finally, the production of this work would have beenirnpossible without the valuable

assistance of three other members of the Mathematical Sciences Section at Oak Ridge

National Laboratory: Rhonda Harbison and Tammy Darland for typing and retyping the many

drafts of this volume, excluding Chapter 4, and Elmon Leach for the programming that

produced the extensive Table in Chapter 4. I am indeed grateful for their expert support and

patience.

Tommy Wright

A NOTE TO USERS

What is the purpose of this volume?

This volume is of particuJar interest to anyone who studies solutions to or faces problems

of the following type.

Setting and Problem. Assume a finite universe (population, lot, or urn) of

N units of which an unknown number A (or unknown proportion P =AIN) has a particuJar attribute or characteristic. If a sample of size n is selected from the

entire universe and a of the sample units are observed to have the particuJar attribute or characteristic, what can be said about the value of A (or P)?

If the sample is a simple random sample, then this work can be used to

easily find exact ont>-sided and tw<rsided confidence bounds for A (or P) for

small values of N. The extent of the Table is indicated under Application I of

question 2 below. Exact tests of hypotheses and sample size determination for

estimation under simple random sampling can also be facilitated using this

volume. Conservative confidence intervals under stratified random sampling

can also be obtained.

Indeed major objectives of this volume are to be instructive and to provide an easy to use

reference. In order to increase usefulness, allow for flexibility, decrease the chance of the need for approximations, and provide exact results, a table that is responsive to the above setting and problem must be extensive to accommodate the many possible combinations of Nt

n, a and confidence levels that are most likely to be encountered particularly with small

universes. An attempt has been made to provide exact bounds under simple random sampling for those cases where the approximations are generally not good, i.e., for small values of N,

for small values of n relative to N , and for small and Jarge values of a relative to n .

What specific problems can be solved using this volume?

Eight possible applications of the Table in this volume are listed. Each application is discussed in Chapter 2 with examples.

Application I. Exact 100(1 - a)% on<>-sided lower and upper confidence bounds for A

under simple random sampling can be found easily in the Table where 1 - a is either .975 or .95 for the following combinations of N , n, and a where

N = the number of units in the finite universe. n = the number of units in the simple random sample, and a = the number of units in the sample with a particular attribute

or characteristic.

XII

The Table in Chapter 4 has six sections.

lab,le ectIon N n a (3) a (4) Pages

4.1 2(1)50(1) 1(1) ~ 0(1) ~ n "2(1)n 58 to 76

4.2 52(2)100(2) 1(1)~ 0(1) ~ n "2(1)n 77 to 116

4.3 105(5)200 1(1)~ 0(1) ; n "2(I)n 117 to 190

4.4 210(10)500 1(1)~ 0(1) ; n "2(I)n 191 to 339

4.5 600(100)1000 1(1)60 0(1) ; n "2(I)n 340 to 378

62(2)~ 0(1) ~ 4 -Sn(l)n

4.6 1100(100)2000 2(2) ~ 0(1)!!. 4 379 to 426 -Sn(l)n

5

(1 )2( 1 )50 means that N varies from 2 to 50 in steps of 1.

(2)52(2)100 means thatN varies from 52 to 100 in steps of2.

(3)Displayed in Table.

(4)From Table by subtraction using (2.5) and (2.6).

Actually, the 100(1 - a)% one-sided lower and upper confidence bounds that are provided are

the best in the sense that they give the shortest possible intervals for the given confidence level

1- a.

Application II. Exact 100(1- a)% two-sided confidence bounds for A under simple

random sampling can be found easily for given N, n, and a by using appropriate lower and

upper one-sided confidence bounds. For example, 95% two-sided confidence bounds can be

obtained using the 97.5% lower confidence bound with the 97.5% upper confidence bound for

the given N , n , and a.

Application III. Conservative but useful confidence bounds for A when N is not in this

volume but is between two values of N that are in this volume can be obtained easily. Similar

results can be obtained'when a particular n is not in this volume but is between two values of

n that are.

Application W. Exact one- and two-tailed a level tests under simple random sampling of

the hypotheses

can be performed easily, for various values of a including a= .025, .05, .1, etc.

XIII

Application V. This volume can be used to determine the sample size n needed to estimate

A under simple random sampling without appealing to assumptions of normality (or some

other approximation) for any statistic.

Application VI. The analogous exact inferences and procedures noted in Applications I, II,

III, IV, and V for A can also be performed for P the universe (population) proportion under

simple random sampling.

Application VII. Conservative confidence bounds (both one- and two-sided) of A (or P)

for certain values of 1 - a can be obtained under stratified random sampling with four or less

strata. Hence, this volume can be used for much larger universe sizes with the use of

stratification as long as the number of units in each stratum does not exceed 2000.

Application VIII. Conservative confidence bounds for the difference A' - A" (or P' - P '')

can also be provided when comparing two different universes.

What is meant by "exact" when sampling from finite universes?

Under simple random sampling without replacement from a finite universe, the

hypergeometric probability distribution is an appropriate distribution on which to base

statistical inferences. Because the hypergeometric probability distribution is discrete, it will be

tare that the confidence level 1 - a will be exactly equal to the actual coverage probabilities,

for example, as illustrated in Table 3.5 of Example 3.8 on page 51. However, the actual

coverage probabilities for the results in this volume will always be at least that of the stated

confidence level 1 - a, and the coverage probabilities will be as close to the stated confidence

level as possible using the hypergeometric distribution. Thus, when the phrases "exact

confidence bounds" or "exact confidence intervals" are used, they are referring to the use of

the hypergeometric distribution under simple random sampling instead of an approximation of

the hypergeometric distribution and to the fact that the actual coverage probability for our

confidence statement will always be at least the stated confidence level and that the excess

probability will be as small as possible.

Finally, the user should note the following point.

• While Applications V. VII, and VIII are theoretically correct, it is not clear that one cannot

provide better results for the finite population setting. Research is underway in search of

better results, and it is expected that others may provide better answers in the future

through the sampling theory literature.

TABLE OF CONlENTS

PREFACE ........................................................................................................................... vii

A NOTE TO USERS .......................................................................................................... xi What is the purpose of this volume? ..................................... ............................. .......... xi What specific problems can be solved using this volume? ............................... ........... xi What is meant by "exact" when sampling from finite universes? ............................. xiii

1. INTR.ODUCTION ........................................................................................................ 1

2. TIIE APPLICATIONS ................................................................................................. 4

2.1. Application I. Exact 100(1- a)% One-Sided Upper and Lower Confidence Bounds for A Under Simple Random Sampling ............... 4

Application 1.1. 100(1- a)% Upper Confidence Bound for A........................... 4 Application 1.2. 100(1 - a)% Lower Confidence Bound for A........................... 5 Application 1.3. When a Particular Value of a Is Not in the Table ............. ........ 6

2.2. Application II. Exact 100(1 - a)% Two-Sided Confidence Bounds for A Under Simple Random Sampling .............................................. 8

.2.3. Application III. Conservative Confidence Bounds for A Under Simple Random Sampling when No Is Not in the Table, but No Is Between Two Other Values of N That Are .................................................................... 9

Application III.1. When a Particular Value no is Not in the Table ..................... 11

2.4. Application IV. Exact One- and Two-Sided a Level Tests of Hypotheses Under Simple Random Sampling. ................................................ 13

Application IV. 1. To TestHo: A =AoAgainstH,,: A ~Ao . ............................. 14 Application IV.2. To TestHo: A ~AoAgainstH,,: A >Ao. ............................. 14 Application IV.3. To TestHo: A ~AoAgainstH,,: A <Ao. ............................. 15

2.5. Application V. Sample Size Determination Under Simple Random Sampling 16

2.6. Application VI. The Analogous Exact Inferences and Procedures of Applications I, II, III, IV, and V Can All Be Performed for P , the Universe (population) Proportion, Under Simple Random Sampling ........ 19

2.7. Application VII. Conservative Confidence Bounds Under Stratified Random Sampling with Four or Less Strata ....................................... 20

2.8. Application VIII. Conservative Comparison of Two Universes ................. ,....... 23

3. TIIE DEVELOPMENT AND THEORy..................................................................... 26

3.1. Exact Hypothesis Testing for a Finite Universe .................................................. 26

3.2. Exact Confidence Interval Estimation for a Finite Universe ............................... 38

3.3. Some Additional Results On One-Sided Confidence Bounds ............................. 53

XVI

4. THE TABLE OF LOWER AND UPPER CONFIDENCE BOUNDS

4.1. N = 2(1)50 ............................................................................................................ 58

4.2. N = 52(2)100 ........................................................................................................ 77

4.3. N = 105(5)200 ...................................................................................................... 117

4.4. N = 210(10)500 .................................................................................................... 191

" 4.5. N = 600(100)1000 ................................................................................................ 340

4.6. N = 1100(100)2000 .............................................................................................. 379

APPENDIX. A SAS MACRO FOR GENERATING EXACT ONE-SIDED LOWER AND UPPER CONFIDENCE BOUNDS FORA FOR STATEDN, n, a, and 1- a .......................................................................... 427

REFERENCES ................................................................................................................... 429

IN'DEX ................................................................................................................................ 431

lecture notes in statistics - home - springer978-1-4612-3140-0/1.pdf · lecture notes in statistics...

Documents