schwarz statistics(cjs) -- 4(survey sampling)

Sampling, Regression, Experimental Design andAnalysis for Environmental Scientists,

Biologists, and Resource Managers2006

C. J. SchwarzDepartment of Statistics and Actuarial Science, Simon Fraser University

[email protected]

August 7, 2006

Contents

4 Sampling 24.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

4.1.1 Difference between sampling and experimental design . . 34.1.2 Why sample rather than census? . . . . . . . . . . . . . . 34.1.3 Principle steps in a survey . . . . . . . . . . . . . . . . . . 34.1.4 Probability sampling vs. non-probability sampling . . . . 44.1.5 The importance of randomization in survey design . . . . 64.1.6 Model vs Design based sampling . . . . . . . . . . . . . . 104.1.7 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

4.2 Overview of Sampling Methods . . . . . . . . . . . . . . . . . . . 114.2.1 Simple Random Sampling . . . . . . . . . . . . . . . . . . 114.2.2 Systematic Surveys . . . . . . . . . . . . . . . . . . . . . . 134.2.3 Cluster sampling . . . . . . . . . . . . . . . . . . . . . . . 164.2.4 Multi-stage sampling . . . . . . . . . . . . . . . . . . . . . 194.2.5 Multi-phase designs . . . . . . . . . . . . . . . . . . . . . 214.2.6 Repeated Sampling . . . . . . . . . . . . . . . . . . . . . . 23

4.3 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.4 Simple Random Sampling Without Replacement (SRSWOR) . . 25

4.4.1 Summary of main results . . . . . . . . . . . . . . . . . . 254.4.2 Estimating the Population Mean . . . . . . . . . . . . . . 264.4.3 Estimating the Population Total . . . . . . . . . . . . . . 274.4.4 Estimating Population Proportions . . . . . . . . . . . . . 284.4.5 Example - estimating total catch of fish in a recreational

fishery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28What is the population of interest? . . . . . . . . . . . . . 30What is the frame? . . . . . . . . . . . . . . . . . . . . . . 31What is the sampling design and sampling unit? . . . . . 31Excel analysis . . . . . . . . . . . . . . . . . . . . . . . . . 32SAS analysis . . . . . . . . . . . . . . . . . . . . . . . . . 35JMP Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.5 Sample size determination for a simple random sample . . . . . . 474.5.1 Example - How many anglers to survey . . . . . . . . . . 49

4.6 Systematic sampling . . . . . . . . . . . . . . . . . . . . . . . . . 524.6.1 Advantages of systematic sampling . . . . . . . . . . . . . 52

1

CONTENTS

4.6.2 Disadvantages of systematic sampling . . . . . . . . . . . 524.6.3 How to select a systematic sample . . . . . . . . . . . . . 534.6.4 Analyzing a systematic sample . . . . . . . . . . . . . . . 534.6.5 Technical notes - Repeated systematic sampling . . . . . . 54

Example of replicated subsampling within a systematicsample . . . . . . . . . . . . . . . . . . . . . . . 54

4.7 Stratified simple random sampling . . . . . . . . . . . . . . . . . 574.7.1 A visual comparison of a simple random sample vs a strat-

ified simple random sample . . . . . . . . . . . . . . . . . 594.7.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 634.7.3 Summary of main results . . . . . . . . . . . . . . . . . . 634.7.4 Example - sampling organic matter from a lake . . . . . . 654.7.5 Example - estimating the total catch of salmon . . . . . . 70

What is the population of interest? . . . . . . . . . . . . . 71What is the sampling frame? . . . . . . . . . . . . . . . . 71What is the sampling design? . . . . . . . . . . . . . . . . 72Excel analysis . . . . . . . . . . . . . . . . . . . . . . . . . 72SAS analysis . . . . . . . . . . . . . . . . . . . . . . . . . 75JMP analysis . . . . . . . . . . . . . . . . . . . . . . . . . 79When should the various estimates be used? . . . . . . . . 85

4.8 Sample Size for Stratified Designs . . . . . . . . . . . . . . . . . . 884.8.1 Total sample size . . . . . . . . . . . . . . . . . . . . . . . 884.8.2 Allocating samples among strata . . . . . . . . . . . . . . 914.8.3 Example: Estimating the number of tundra swans. . . . . 944.8.4 Multiple stratification . . . . . . . . . . . . . . . . . . . . 1014.8.5 Post-stratification . . . . . . . . . . . . . . . . . . . . . . 1014.8.6 Allocation and precision - revisited . . . . . . . . . . . . . 101

4.9 Ratio estimation in SRS - improving precision with auxiliary in-formation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1024.9.1 Summary of Main results . . . . . . . . . . . . . . . . . . 1044.9.2 Example - wolf/moose ratio . . . . . . . . . . . . . . . . . 105

Excel analysis . . . . . . . . . . . . . . . . . . . . . . . . . 107SAS Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 109JMP Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 113Post mortem . . . . . . . . . . . . . . . . . . . . . . . . . 121

4.9.3 Example - Grouse numbers - using a ratio estimator toestimate a population total . . . . . . . . . . . . . . . . . 121Excel analysis . . . . . . . . . . . . . . . . . . . . . . . . . 123SAS analysis . . . . . . . . . . . . . . . . . . . . . . . . . 125JMP Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 130Post mortem - a question to ponder . . . . . . . . . . . . 135

4.10 Additional ways to improve precision . . . . . . . . . . . . . . . . 1354.10.1 Using both stratification and auxiliary variables . . . . . . 1354.10.2 Regression Estimators . . . . . . . . . . . . . . . . . . . . 1364.10.3 Sampling with unequal probability - pps sampling . . . . 137

4.11 Cluster sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

c©2006 Carl James Schwarz 2

CONTENTS

4.11.1 Sampling plan . . . . . . . . . . . . . . . . . . . . . . . . 1384.11.2 Advantages and disadvantages of cluster sampling com-

pared to SRS . . . . . . . . . . . . . . . . . . . . . . . . . 1424.11.3 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1424.11.4 Summary of main results . . . . . . . . . . . . . . . . . . 1424.11.5 Example - estimating the density of urchins . . . . . . . . 144

Excel Analysis . . . . . . . . . . . . . . . . . . . . . . . . 145SAS Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 148JMP Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 152Planning for future experiments . . . . . . . . . . . . . . . 160

4.11.6 Example - estimating the total number of sea cucumbers 1614.12 Multi-stage sampling - a generalization of cluster sampling . . . . 165

4.12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 1654.12.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1674.12.3 Summary of main results . . . . . . . . . . . . . . . . . . 1674.12.4 Example - estimating number of clams . . . . . . . . . . . 169

Excel Spreadsheet . . . . . . . . . . . . . . . . . . . . . . 174SAS Program . . . . . . . . . . . . . . . . . . . . . . . . . 175

4.12.5 Some closing comments on multi-stage designs . . . . . . 1754.13 Some final comments on descriptive surveys . . . . . . . . . . . . 176

4.13.1 Unit size . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1764.13.2 Key considerations when designing a survey . . . . . . . . 177

4.14 Analytical surveys - almost experimental design . . . . . . . . . . 1774.15 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1824.16 Frequently Asked Questions (FAQ) . . . . . . . . . . . . . . . . . 183

4.16.1 Confusion about the definition of a population . . . . . . 1834.16.2 How is N defined . . . . . . . . . . . . . . . . . . . . . . . 1844.16.3 Multi-stage vs Multi-phase sampling . . . . . . . . . . . . 1844.16.4 What is the difference between a Population and a frame? 1854.16.5 How to account for missing transects. . . . . . . . . . . . 185


Chapter 4

Sampling

4.1 Introduction

Today the word "survey" is used most often to describe a method of gatheringinformation from a sample of individuals or animals or areas. This "sample" isusually just a fraction of the population being studied.

You are exposed to survey results almost every day. For example, electionpolls, the unemployment rate, or the consumer price index are all examples ofthe results of surveys. On the other hand, some common headlines are NOTthe results of surveys, but rather the results of experiments. For example, is anew drug just as effective as an old drug.

Not only do surveys have a wide variety of purposes, they also can be con-ducted in many ways – including over the telephone, by mail, or in person.Nonetheless, all surveys do have certain characteristics in common. All surveysrequire a great deal of planning in order that the results are informative.

Unlike a census, where all members of the population are studied, surveysgather information from only a portion of a population of interest – the size ofthe sample depending on the purpose of the study. Surprisingly to many people,a survey can give better quality results than an census.

In a bona fide survey, the sample is not selected haphazardly. It is scientifi-cally chosen so that each object in the population will have a measurable chanceof selection. This way, the results can be reliably projected from the sample tothe larger population.

4

CHAPTER 4. SAMPLING

Information is collected by means of standardized procedures The survey’sintent is not to describe the particular object which, by chance, are part of thesample but to obtain a composite profile of the population.

4.1.1 Difference between sampling and experimental de-sign

There are two key differences between survey sampling and experimental design.

• In experiments, one deliberately perturbs some part of population to seethe effect of the action. In sampling, one wishes to see what the populationis like without disturbing it.

• In experiments, the objective is to compare the mean response to changesin levels of the factors. In sampling the objective is to describe the char-acteristics of the population. However, refer to the section on analyticalsampling later in this chapter for when sampling looks very similar toexperimental design.

4.1.2 Why sample rather than census?

There are a number of advantages of sampling over a complete census:

• reduced cost

• greater speed - a much smaller scale of operations is performed

• greater scope - if highly trained personnel or equipment is needed

• greater accuracy - easier to train small crew, supervise them, and reducedata entry errors

• reduced respondent burden

• in destructive sampling you can’t measure the entire population - e.g.crash tests of cars

4.1.3 Principle steps in a survey

The principle steps in a survey are:


CHAPTER 4. SAMPLING

• formulate the objectives of the survey - need concise statement

• define the population to be sampled - e.g. what is the range of animalsor locations to be measured? Note that the population is the set of finalsampling units that will be measured - refer to the FAQ at the end of thechapter for more information.

• establish what data is to be collected - collect a few items well rather thanmany poorly

• what degree of precision is required - examine power needed

• establish the frame - this is a list of sampling units that is exhaustive andexclusive

– in many cases the frame is obvious, but in others it is not

– it is often very difficult to establish a frame - e.g. a list of all streamsin the lower mainland.

• choose among the various designs; will you stratify? There are a varietyof sampling plans some of which will be discussed in detail later in thischapter. Some common designs in ecological studies are:

– simple random sampling

– systematic sample

– cluster sampling

– multi-stage design

All designs can be improved by stratification, so this should always beconsidered during the design phase.

• pre-test - very important to try out field methods and questionnaires

• organization of field work - training, pre-test, etc

• summary and data analysis - easiest part if earlier parts done well

• post-mortem - what went well, poorly, etc.

4.1.4 Probability sampling vs. non-probability sampling

There are two types of sampling plans - probability sampling where units arechosen in a ‘random fashion’ and non-probability sampling where units are cho-sen in some deliberate fashion.

In probability sampling


CHAPTER 4. SAMPLING

• every unit has a known probability of being in the sample

• the sample is drawn with some method consistent with these probabilities

• these selection probabilities are used when making estimates from thesample

The advantages of probability sampling

• we can study biases of the sampling plans

• standard errors and measures of precision (confidence limits) can be ob-tained

Some types of non-probability sampling plan include:

• quota sampling - select 50 M and 50 F from the population

– less expensive than a probability sample

– may be only option if no frame exists

• judgmental sampling - select ‘average’ or ‘typical’ value. This is a quickand dirty sampling method and can perform well if there are a few extremepoints which should not be included.

• convenience sampling - select those readily available. This is useful if isdangerous or unpleasant to sample directly. For example, selecting bloodsamples from grizzly bears.

• haphazard sampling (not the same as random sampling). This is oftenuseful if the sampling material is homogeneous and spread throughout thepopulation, e.g. chemicals in drinking water.

The disadvantages of non-probability sampling include

• unable to assess biases in any rational way.

• no estimates of precision can be obtained. In particular the simple use offormulae from probability sampling is WRONG!.

• experts may disagree on what is the “best” sample.


CHAPTER 4. SAMPLING

4.1.5 The importance of randomization in survey design

[With thanks to Dr. Rick Routledge for this part of the notes.]

. . . I had to make a ‘cover degree’ study... This involved the useof a Raunkiaer’s Circle, a device designed in hell. In appearanceit was all simple innocence, being no more than a big metal hoop;but in use it was a devil’s mechanism for driving sane men mad.To use it, one stood on a stretch of muskeg, shut one’s eyes, spunaround several times like a top, and then flung the circle as far awayas possible. This complicated procedure was designed to ensurethat the throw was truly ‘random’; but, in the event, it inevitablyresulted in my losing sight of the hoop entirely, and having to spendan unconscionable time searching for the thing.

Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963.

Why would a field biologist in the early post-war period be instructed tofollow such a bizarre-looking scheme for collecting a representative sample oftundra vegetation? Could she not have obtained a typical cross-section of thevegetation by using her own judgment? Undoubtedly, she could have convincedherself that by replacing an awkward, haphazard sampling scheme with one de-pendent solely on her own judgment and common sense, she could have beenguaranteed a more representative sample. But would others be convinced? Acareful, objective scientist is trained to be skeptical. She would be reluctantto accept any evidence whose validity depended critically on the judgment andskills of a stranger. The burden of proof would then rest squarely with FarleyMowat to prove his ability to take representative, judgmental samples. It is typ-ically far easier for a scientist to use randomization in her sampling proceduresthan it is to prove her judgmental skills.

Hovering and Patrolling Bees

It is often difficult, if not impossible, to take a properly randomized sam-ple. Consider, e.g., the problem faced by Alcock et al. (1977) in studying thebehavior of male bees of the species, Centris pallida, in the deserts of south-western United States. Females pupate in underground burrows. To maximizethe presence of his genes in the next generation, a male of the species needsto mate with as many virgin females as possible. One strategy is to patrol theburrowing area at a low altitude, and nab an emerging female as soon as herpresence is detected. This patrolling strategy seems to involve a relatively highrisk of confrontation with other patrolling males. The other strategy reportedby the authors is to hover farther above the burrowing area, and mate withthose females who escape detection by the hoverers. These hoverers appear tobe involved in fewer conflicts.


CHAPTER 4. SAMPLING

Because the hoverers tend to be less involved in aggressive confrontations,one might guess that they would tend to be somewhat smaller than the moreaggressive patrollers. To assess this hypothesis, the authors took measurementsof head widths for each of the two subpopulations. Of course, they could notcapture every single male bee in the population. They had to be content witha sample.

Sample sizes and results are reported in the Table below. How are we tointerpret these results? The sampled hoverers obviously tended to be somewhatsmaller than the sampled patrollers, although it appears from the standarddeviations that some hoverers were larger than the average-sized patroller andvice-versa. Hence, the difference is not overwhelming, and may be attributableto sampling errors.

Table Summary of head width measurements on two samples of bees.Sample n y SDHoverers 50 4.92 mm 0.15 mmPatrollers 100 5.14 mm 0.29 mm

If the sampling were truly randomized, then the only sampling errors wouldbe chance errors, whose probable size can be assessed by a standard t-test.Exactly how were the samples taken? Is it possible that the sampling procedureused to select patrolling bees might favor the capture of larger bees, for example?This issue is indeed addressed by the authors. They carefully explain how theyattempted to obtain unbiased samples. For example, to sample the patrollingbees, they made a sweep across the sampling area, attempting to catch all thepatrolling bees that they observed. To assess the potential for bias, one mustin the end make a subjective judgment.

Why make all this fuss over a technical possibility? It is important to do sobecause lack of attention to such possibilities has led to some colossal errors inthe past. Nowhere are they more obvious than in the field of election prediction.Most of us never find out the real nature of the population that we are sam-pling. Hence, we never know the true size of our errors. By contrast, pollsters’errors are often painfully obvious. After the election, the actual percentages areavailable for everyone to see.

Lessons from Opinion Polling

In the 1930’s, political opinion was in its formative years. The pioneers inthis endeavor were training themselves on the job. Of the inevitable errors, twowere so spectacular as to make international headlines.

In 1935, an American magazine with a large circulation, The Literary Digest,attempted to poll an enormous segment of the American voting public in orderto predict the outcome of the presidential election that autumn. Roosevelt,


CHAPTER 4. SAMPLING

the Democratic candidate, promised to develop programs designed to increaseopportunities for the disadvantaged; Landon, the candidate for the RepublicanParty, appealed more to the wealthier segments of American society. The Liter-ary Digest mailed out questionnaires to about ten million people whose namesappeared in such places as subscription lists, club directories, etc. They receivedover 2.5 million responses, on the basis of which they predicted a comfortablevictory for Landon. The election returns soon showed the massive size of theirprediction error.

The cumbersome design of this highly publicized survey provided a young,wily pollster with the chance of a lifetime. Between the time that the Digestannounced its plans and released its predictions, George Gallup planned andexecuted a remarkable coup. By polling only a small fraction of these individ-uals, and a relatively small number of other voters, he correctly predicted notonly the outcome of the election, but also the enormous size of the error aboutto be committed by The Literary Digest.

Obviously, the enormous sample obtained by the Digest was not very rep-resentative of the population. The selection procedure was heavily biased infavor of Republican voters. The most obvious source of bias is the methodused to generate the list of names and addresses of the people that they con-tacted. In 1935, only the relatively affluent could afford magazines, telephones,etc., and the more conservative policies of the Republican Party appealed to agreater proportion of this segment of the American public. The Digest’s sampleselection procedure was therefore biased in favor of the Republican candidate.

The Literary Digest was guilty of taking a sample of convenience. Samplesof convenience are typically prone to bias. Any researcher who, either by choiceor necessity, uses such a sample, has to be prepared to defend his findingsagainst possible charges of bias. As this example shows, it can have catastrophicconsequences.

How did Gallup obtain his more representative sample? He did not userandomization. Randomization is often criticized on the grounds that once ina while, it can produce absurdly unrepresentative samples. When faced with asample that obviously contains far too few economically disadvantaged voters,it is small consolation to know that next time around, the error will likely notbe repeated. Gallup used a procedure that virtually guaranteed that his samplewould be representative with respect to such obvious features as age, race, etc.He did so by assigning quotas which his interviewers were to fill. One interviewermight, e.g. be assigned to interview 5 adult males with specified characteristicsin a tough, inner-city neighborhood. The quotas were devised so as to make thesample mimic known features of the population.

This quota sampling technique suited Gallup’s needs spectacularly well in1935 even though he underestimated the support for the Democratic candidate


CHAPTER 4. SAMPLING

by about 6%. His subsequent polls contained the same systematic error. In1947, the error finally caught up with him. He predicted a narrow victory forthe Republican candidate, Dewey. A Newspaper editor was so confident of theprediction that he authorized the printing of a headline proclaiming the victorybefore the official results were available. It turned out that the Democrat,Truman, won by a narrow margin.

What was wrong with Gallup’s sampling technique? He gave his interviewersthe final decision as to whom would be interviewed. In a tough inner-cityneighborhood, an interviewer had the option of passing by a house with severalmotorcycles parked out in front and sounds of a raucous party coming fromwithin. In the resulting sample, the more conservative (Republican) voterswere systematically over-represented.

Gallup learned from his mistakes. His subsequent surveys replaced inter-viewer discretion with an objective, randomized scheme at the final stage ofsample selection. With the dominant source of systematic error removed, hiselection predictions became even more reliable.

Implications for Biological Surveys

The bias in samples of convenience can be enormous. It can be surprisinglylarge even in what appear to be carefully designed surveys. It can easily exceedthe typical size of the chance error terms. To completely remove the possibilityof bias in the selection of a sample, randomization must be employed. Sometimesthis is simply not possible, as for example, appears to be the case in the studyon bees. When this happens and the investigators wish to use the results of anonrandomized sample, then the final report should discuss the possibility ofselection bias and its potential impact on the conclusions.

Furthermore, when reading a report containing the results of a survey, it isimportant to carefully evaluate the survey design, and to consider the potentialimpact of sample selection bias on the conclusions.

Should Farley Mowat really have been content to take his samples by tossingRaunkier’s Circle to the winds? Definitely not, for at least two reasons. First,he had to trust that by tossing the circle, he was generating an unbiased sample.It is not at all certain that certain types of vegetation would not be selected witha higher probability than others. For example, the higher shrubs would tend tointercept the hoop earlier in its descent than would the smaller herbs. Second,he has no guarantee that his sample will be representative with respect to themajor habitat types. Leaving aside potential bias, it is possible that the circlecould, by chance, land repeatedly in a snowbed community. It seems indeedfoolish to use a sampling scheme which admits the possibility of including onlysnowbed communities when tundra bogs and fellfields may be equally abundantin the population. In subsequent chapters, we shall look into ways of taking more


CHAPTER 4. SAMPLING

thoroughly randomized surveys, and into schemes for combining judgment withrandomization for eliminating both selection bias and the potential for grosslyunrepresentative samples. There are also circumstances in which a systematicsample (e.g. taking transects every 200 meters along a rocky shore line) maybe justifiable, but this subject is not discussed in these notes.

4.1.6 Model vs Design based sampling

Model-based sampling starts by assuming some sort of statistical model forthe data in the population and the goal is to select data to estimate the pa-rameters of this distribution. For example, you may be willing to assume thatthe distribution of values in the population is log-normally distributed. Thedata collected from the survey are then used along with a likelihood function toestimate the parameters of the distribution.

Model-based sampling is very powerful because you are willing to make a lotof assumptions about the data process. However, if your model is wrong, thereare big problems. For example, what if you assume log-normality but data isnot log-normally distributed? In these cases, the estimates of the parameterscan be extremely biased and inefficient.

Design-based sampling makes no assumptions about the distribution ofdata values in the population. Rather it relies upon the randomization procedureto select representative elements of the population. Estimates from design-basedmethods are unbiased regardless of the distribution of values in the population,but in “strange” populations can also be inefficient. For example, if a populationis highly clustered, a random sample of quadrats will end up with mostly zeroobservations and a few large values and the resulting estimates will have a largestandard error.

Most of the results in this chapter on survey sampling are design-based, i.e.we don’t need to make any assumptions about normality in the population forthe results to valid.

4.1.7 Software

Unfortunately, there is no common, easy to use statistical package for the anal-ysis of survey data. Fortunately, most of the computations are fairly straight-forward so that many of the common packages, such as JMP, or Excel can beused to analyze survey data.

SAS includes survey design procedures, but these are not covered in this


CHAPTER 4. SAMPLING

course.

For a review of packages that can be used to analyze survey data pleaserefer to the article at http://www.fas.harvard.edu/~stats/survey-soft/survey-soft.html.

CAUTIONS IN USING STANDARD STATISTICAL SOFTWAREPACKAGES Standard statistical software packages generally do not take intoaccount four common characteristics of sample survey data: (1) unequal proba-bility selection of observations, (2) clustering of observations, (3) stratificationand (4) nonresponse and other adjustments. Point estimates of population pa-rameters are impacted by the value of the analysis weight for each observation.These weights depend upon the selection probabilities and other survey de-sign features such as stratification and clustering. Hence, standard packageswill yield biased point estimates if the weights are ignored. Estimated vari-ance formulas for point estimates based on sample survey data are impacted byclustering, stratification and the weights. By ignoring these aspects, standardpackages generally underestimate the estimated variance of a point estimate,sometimes substantially so.

Most standard statistical packages can perform weighted analyses, usuallyvia a WEIGHT statement added to the program code. Use of standard statis-tical packages with a weighting variable may yield the same point estimates forpopulation parameters as sample survey software packages. However, the esti-mated variance often is not correct and can be substantially wrong, dependingupon the particular program within the standard software package.

For further information about the problems of using standard statisticalsoftware packages in survey sampling please refer to the article at http://www.fas.harvard.edu/~stats/survey-soft/donna_brogan.html.

NOTE that SAS has specialized routines for the analysis of surveydata that avoid these problems.

4.2 Overview of Sampling Methods

4.2.1 Simple Random Sampling

This is the basic method of selecting survey units. Each unit in the populationis selected with equal probability and all possible samples are equally likely tobe chosen. This is commonly done by listing all the members in the population(the set of sampling units) and then choosing units using a random number


http://www.fas.harvard.edu/~stats/survey-soft/survey-soft.html

http://www.fas.harvard.edu/~stats/survey-soft/survey-soft.html

http://www.fas.harvard.edu/~stats/survey-soft/donna_brogan.html


CHAPTER 4. SAMPLING

table.

An example of a simple random sample would be a vegetation survey ina large forest stand. The stand is divided into 480 one-hectare plots, and arandom sample of 24 plots was selected and analyzed using aerial photos. Themap of the units selected might look like:


CHAPTER 4. SAMPLING

Units are usually chosen without replacement, i.e. each unit in the pop-ulation can only be chosen once. In some cases (particularly for multi-stagedesigns), there are advantages to selecting units with replacement, i.e. a unit inthe population may potentially be selected more than once. The analysis of asimple random sample is straightforward. The mean of the sample is an esti-mate of the population mean. An estimate of the population total is obtainedby multiplying the sample mean by the number of units in the population. Thesampling fraction, the proportion of units chosen from the entire population,is typically small. If it exceeds 5%, an adjustment (the finite population cor-rection) will result in better estimates of precision (a reduction in the standarderror) to account for the fact that a substantial fraction of the population wassurveyed.

A simple random sample design is often ‘hidden’ in the details of manyother survey designs. For example, many surveys of vegetation are conductedusing strip transects where the initial starting point of the transect is randomlychosen, and then every plot along the transect is measured. Here the stripsare the sampling unit, and are a simple random sample from all possible strips.The individual plots are subsamples from each strip and cannot be regardedas independent samples. For example, suppose a rectangular stand is surveyedusing aerial overflights. In many cases, random starting points along one edgeare selected, and the aircraft then surveys the entire length of the stand startingat the chosen point. The strips are typically analyzed section- by-section, butit would be incorrect to treat the smaller parts as a simple random sample fromthe entire stand.

Note that a crucial element of simple random samples is that every samplingunit is chosen independently of every other sampling unit. For example, in striptransects plots along the same transect are not chosen independently - when aparticular transect is chosen, all plots along the transect are sampled and sothe selected plots are not a simple random sample of all possible plots. Strip-transects are actually examples of cluster-samples. Cluster samples are discusesin greater detail later in this chapter.

4.2.2 Systematic Surveys

In some cases, it is logistically inconvenient to randomly select sample units fromthe population. An alternative is to take a systematic sample where every kth

unit is selected (after a random starting point); k is chosen to give the requiredsample size. For example, if a stream is 2 km long, and 20 samples are required,then k = 100 and samples are chosen every 100 m along the stream after arandom starting point. A common alternative when the population does notnaturally divide into discrete units is grid-sampling. Here sampling points arelocated using a grid that is randomly located in the area. All sampling points


CHAPTER 4. SAMPLING

are a fixed distance apart.

An example of a systematice sample would be a vegetation survey in a largeforest stand. The stand is divided into 480 one-hectare plots. As a total samplesize of 24 is required, this implies that we need to sample every 480/24 = 20th

plot. We pick a random starting point (the 9th) plot in the first row, and thenevery 20 plots reading across rows. The final plan could look like:


CHAPTER 4. SAMPLING

If a known trend is present in the sample, this can be incorporated into theanalysis (Cochran, 1977, Chapter 8). For example, suppose that the systematicsample follows an elevation gradient that is known to directly influence theresponse variable. A regression-type correction can be incorporated into theanalysis. However, note that this trend must be known from external sources -it cannot be deduced from the survey.

Pitfall: A systematic sample is typically analyzed in the same fashion asa simple random sample. However, the true precision of an estimator from asystematic sample can be either worse or better than a simple random sampleof the same size, depending if units within the systematic sample are positivelyor negatively correlated among themselves. For example, if a systematic sam-ple’s sampling interval happens to match a cyclic pattern in the population,values within the systematic sample are highly positively correlated (the sam-pled units may all hit the ‘peaks’ of the cyclic trend), and the true samplingprecision is worse than a SRS of the same size. What is even more unfortunateis that because the units are positively correlated within the sample, the sam-ple variance will underestimate the true variation in the population, and if theestimated precision is computed using the formula for a SRS, a double dose ofbias in the estimated precision occurs (Krebs, 1989, p.227). On the other hand,if the systematic sample is arranged ‘perpendicular’ to a known trend to tryand incorporate additional variability in the sample, the units within a sampleare now negatively correlated, the true precision is now better than a SRS sam-ple of the same size, but the sample variance now overestimates the populationvariance, and the formula for precision from a SRS will overstate the samplingerror. While logistically simpler, a systematic sample is only ‘equivalent’ to asimple random sample of the same size if the population units are ‘in randomorder’ to begin with. (Krebs, 1989, p. 227). Even worse, there is no informationin the systematic sample that allows the manager to check for hidden trendsand cycles.

Nevertheless, systematic samples do offer some practical advantages overSRS if some correction can be made to the bias in the estimated precision:

• it is easier to relocate plots for long term monitoring

• mapping can be carried out concurrently with the sampling effort becausethe ground is systematically traversed. This is less of an issue now withGPS as the exact position can easily be recorded and the plots revisitedalter.

• it avoids the problem of poorly distributed sampling units which can occurwith a SRS [but this can also be avoided by judicious stratification.]

Solution: Because of the necessity for a strong assumption of ‘randomness’in the original population, systematic samples are discouraged and statistical


CHAPTER 4. SAMPLING

advice should be sought before starting such a scheme. If there are no otherfeasible designs, a slight variation in the systematic sample provides some pro-tection from the above problems. Instead of taking a single systematic sampleevery kth unit, take 2 or 3 independent systematic samples of every 2kth or 3kth

unit, each with a different starting point. For example, rather than taking asingle systematic sample every 100 m along the stream, two independent sys-tematic samples can be taken, each selecting units every 200 m along the streamstarting at two random starting points. The total sample effort is still the same,but now some measure of the large scale spatial structure can be estimated.This technique is known as replicated sub-sampling (Kish, 1965, p. 127).

4.2.3 Cluster sampling

In some cases, units in a population occur naturally in groups or clusters. Forexample, some animals congregate in herds or family units. It is often convenientto select a random sample of herds and then measure every animal in the herd.This is not the same as a simple random sample of animals because individualanimals are not randomly selected; the herds are the sampling unit. The strip-transect example in the section on simple random sampling is also a clustersample; all plots along a randomly selected transect are measured. The stripsare the sampling units, while plots within each strip are sub-sampling units.Another example is circular plot sampling; all trees within a specified radius ofa randomly selected point are measured. The sampling unit is the circular plotwhile trees within the plot are sub-samples.

The reason cluster samples are used is that costs can be reduced comparedto a simple random sample giving the same precision. Because units within acluster are close together, travel costs among units are reduced. Consequently,more clusters (and more total units) can be surveyed for the same cost as acomparable simple random sample.

For example, consider the vegation survey of previous sections. The 480plots can be divided into 60 clusters of size 8. A total sample size of 24 isobtained by randomly selecting three clusters from the 60 clusters present inthe map, and then surveying ALL eight members of the seleced clusters. A mapof the design might look like:


CHAPTER 4. SAMPLING

Alternatively, cluster are often formed when a transect sample is taken. Forexample, suppose that the vegetation survey picked an initial starting point onthe left margin, and then flew completely across the landscape in a a straightline measuring all plots along the route. A map of the design migh look like:


CHAPTER 4. SAMPLING


CHAPTER 4. SAMPLING

In this case, there are three clusters chosen from a possible 30 clusters andthe clusters are of unequal size (the middle cluster only has 12 plots measuredcompared to the 18 plots measured on the other two transects.

Pitfall A cluster sample is often mistakenly analyzed using methods for sim-ple random surveys. This is not valid because units within a cluster are typicallypositively correlated. The effect of this erroneous analysis is to come up withan estimate that appears to be more precise than it really is, i.e. the estimatedstandard error is too small and does not fully reflect the actual imprecision inthe estimate.

Solution: In order to be confident that the reported standard error reallyreflects the uncertainty of the estimate, it is important that the analytical meth-ods are appropriate for the survey design. The proper analysis treats the clustersas a random sample from the population of clusters. The methods of simplerandom samples are applied to the cluster summary statistics (Thompson, 1992,Chapter 12).

4.2.4 Multi-stage sampling

In many situations, there are natural divisions of the population into severaldifferent sizes of units. For example, a forest management unit consists of severalstands, each stand has several cutblocks, and each cutblock can be divided intoplots. These divisions can be easily accommodated in a survey through theuse of multi-stage methods. Selection of units is done in stages. For example,several stands could be selected from a management area; then several cutblocksare selected in each of the chosen stands; then several plots are selected in eachof the chosen cutblocks. Note that in a multi-stage design, units at any stageare selected at random only from those larger units selected in previous stages.

Again consider the vegetation survey of previous sections. The population isagain divided into 60 clusers of size 8. However, rather than surveying all unitswithin a cluster, we decide to survey only two units within each cluster. Hence,we now sample at the first stage, a total of 12 clusters out of the 60. In eachcluster, we randomly sample 2 of the 8 units. A sample plan might look likethe following where the rectangles indicate the clusters selected, and the checksindicate the sub-sample taken from each cluster:


CHAPTER 4. SAMPLING

The advantage of multi-stage designs are that costs can be reduced com-pared to a simple random sample of the same size, primarily through improvedlogistics. The precision of the results is worse than an equivalent simple ran-dom sample, but because costs are less, a larger multi-stage survey can often bedone for the same costs as a smaller simple random sample. This often resultsin a more precise estimate for the same cost. However, due to the misuse ofdata from complex designs, simple designs are often highly preferred and end


CHAPTER 4. SAMPLING

up being more cost efficient when costs associated with incorrect decisions areincorporated.

Pitfall: Although random selections are made at each stage, a common erroris to analyze these types of surveys as if they arose from a simple random sample.The plots were not independently selected; if a particular cutblock was notchosen, then none of the plots within that cutblock can be chosen. As in clustersamples, the consequences of this erroneous analysis are that the estimatedstandard errors are too small and do not fully reflect the actual imprecision inthe estimates. A manager will be more confident in the estimate than is justifiedby the survey.

Solution: Again, it is important that the analytical methods are suitable forthe sampling design. The proper analysis of multi-stage designs takes into ac-count that random samples takes place at each stage (Thompson, 1992, Chapter13). In many cases, the precision of the estimates is determined essentially bythe number of first stage units selected. Little is gained by extensive samplingat lower stages.

4.2.5 Multi-phase designs

In some surveys, multiple surveys of the same survey units are performed. In thefirst phase, a sample of units is selected (usually by a simple random sample).Every unit is measured on some variable. Then in subsequent phases, samplesare selected ONLY from those units selected in the first phase, not from theentire population.

For example, refer back to the vegetation survey. An initial sample of 24plots is closen in a simple random survey. Aerial flights are used to quicklymeasure some characteristic of the plots. A second phase sample of 6 units(circled below) is then measured using ground based methods.


CHAPTER 4. SAMPLING

Multiphase designs are commonly used in two situations. First, it is some-times difficult to stratify a population in advance because the values of thestratification variables are not known. The first phase is used to measure thestratification variable on a random sample of units. The selected units are thenstratified, and further samples are taken from each stratum as needed to mea-sure a second variable. This avoids having to measure the second variable onevery unit when the strata differ in importance. For example, in the first phase,


CHAPTER 4. SAMPLING

plots are selected and measured for the amount of insect damage. The plots arethen stratified by the amount of damage, and second phase allocation of unitsconcentrates on plots with low insect damage to measure total usable volume ofwood. It would be wasteful to measure the volume of wood on plot with muchinsect damage.

The second common occurrence is when it is relatively easy to measure asurrogate variable (related to the real variable of interest) on selected units, andthen in the second phase, the real variable of interest is measured on a subsetof the units. The relationship between the surrogate and desired variable in thesmaller sample is used to adjust the estimate based on the surrogate variable inthe larger sample. For example, managers need to estimate the volume of woodremoved from a harvesting area. A large sample of logging trucks is weighed(which is easy to do), and weight will serve as a surrogate variable for volume. Asmaller sample of trucks (selected from those weighed) is scaled for volume andthe relationship between volume and weight from the second phase sample isused to predict volume based on weight only for the first phase sample. Anotherexample is the count plot method of estimating volume of timber in a stand. Aselection of plots is chosen and the basal area determined. Then a sub-selectionof plots is rechosen in the second phase, and volume measurements are madeon the second phase plots. The relationship between volume and area in thesecond phase is used to predict volume from area measurements seen the firstphase.

4.2.6 Repeated Sampling

One common objective of long-term studies is to investigate changes over timeof a particular population. This will involve repeated sampling from the popu-lation. There are three common designs.

First, separate independent surveys can be conducted at each time point.This is the simplest design to analyze because all observations are independentover time. For example, independent surveys can be conducted at five yearintervals to assess regeneration of cutblocks. However, precision of the estimatedchange may be poor because of the additional variability introduced by havingnew units sampled at each time point.

At the other extreme, units are selected in the first survey and the sameunits are remeasured over time. For example, permanent study plots can beestablished that are remeasured for regeneration over time. The advantageof permanent study plots is that comparisons over time are free of additionalvariability introduced by new units being measured at every time point. Onepossible problem is that survey units have become ‘damaged’ over time, and thesample size will tend to decline over time. An analysis of these types of designs


CHAPTER 4. SAMPLING

is more complex because of the need to account for the correlation over timeof measurements on the same sample plot and the need to account for possiblemissing values when units become ‘damaged’ and are dropped from the study.

Intermediate to the above two designs are partial replacement designs wherea portion of the survey units are replaced with new units at each time point.For example, 1/5 of the units could be replaced by new units at each time point- units would normally stay in the study for a maximum of 5 time periods. Theanalysis of these types of designs is very complex.

4.3 Notation

Unfortunately, sampling theory has developed its own notation that is differ-ent than that used for design of experiments or other areas of statistics eventhough the same concepts are used in both. It would be nice to adopt a generalconvention for all of statistics - maybe in 100 years this will happen.

Even among sampling textbooks, there is no agreement on notation! (sigh).

In the table below, I’ve summarized the “usual” notation used in samplingtheory. In general, large letters refer to population values, while small lettersrefer to sample values.

Characteristic Population value Sample valuenumber of elements N nunits Yi yi

total τ =N∑

i=1

Yi y =n∑

i=1

yi

mean µ = 1N

N∑i=1

Yi y = 1n

n∑i=1

yi

proportion P = τN p = y

n

variance S2 =N∑

i=1

(Yi−µ)2

N−1 s2 =n∑

i=1

(yi−y)2

n−1

variance of a prop S2 = NN−1P (1− P ) s2 = np(1−p)

n−1

Note:

• The population mean is sometimes denoted as Y in many books.

• The population total is sometimes denoted as Y in many books.

• Again note the distinction between the population quantity (e.g. the pop-ulation mean µ) and the corresponding sample quantity (e.g. the sample


CHAPTER 4. SAMPLING

mean y

4.4 Simple Random Sampling Without Replace-ment (SRSWOR)

This forms the basis of many other more complex sampling plans and is the‘gold standard’ against which all other sampling plans are compared. It oftenhappens that more complex sampling plans consist of a series of simple randomsamples that are combined in a complex fashion.

In this design, once the frame of units has been enumerated, a sample of sizen is selected without replacement from the N population units.

Refer to the previous sections for an illustration of how the units will beselected.

4.4.1 Summary of main results

It turns out that for a simple random sample, the sample mean (y) is the bestestimator for the population mean (µ). The population total is estimated bymultiplying the sample mean by the POPULATION size. And, a proportionis estimated by simply coding results as 0 or 1 depending if the sampled unitbelongs to the class of interest, and taking the mean of these 0,1 values. (Yes,this really does work - refer to a later section for more details).

As with every estimate, a measure of precision is required. We say in anearlier chapter that the standard error (se) is such a measure. Recall that thestandard error measures how variable the results of our survey would be if thesurvey were to be repeated. The standard error for the sample mean looks verysimilar to that for a sample mean from a completely randomized design (referto later chapters) with a common correction of a finite population factor (the(1− f) term).

The standard error for the population total estimate is found by multiplyingthe standard error for the mean by the POPULATION SIZE.

The standard error for a proportion is found again, by treating each datavalue as 0 or 1 and applying the same formula as the standard error for a mean.

The following table summarizes the main results:


CHAPTER 4. SAMPLING

Parameter Population value Estimator Estimated se

Mean µ µ = y√

s2

n (1− f)

Total τ τ = N × µ = Nyy N × se(µ) = N√

s2

n (1− f)

Proportion P P = p = y0/1 = yn

√p(1−p)

n−1 (1− f)

Notes:

• Inflation factor The term N/n is called the inflation factor and theestimator for the total is sometimes called the expansion estimator or thesimple inflation estimator.

• Sampling weight Many statistical packages that analyze survey datawill require the specification of a sampling weight. A sampling weightrepresent how many units in the population are represented by this unitin the sample. In the case of a simple random sample, the sampling weightis also equal to N/n. For example, if you select 10 units at random from150 units in the population, the sampling weight for each observation is15, i.e. each unit in the sample represents 15 units in the population. Thesampling weights are computed differently for various designs so won’talways be equal to N/n.

• sampling fraction the term n/N is called the sampling fraction is de-noted as f

• finite population correction (fpc) the term (1− f) is called the finitepopulation correction factor and reflects that if you sample a substantialpart of the population, the variance of the estimator is smaller than whatwould be expected from experimental design results. If f is less than 5%,this is often ignored.

4.4.2 Estimating the Population Mean

The first line of the above table shows the “basic” results and all the remaininglines in the table can be derived from this line as will be shown later.

The population mean (µ) is estimated by the sample mean (y). The esti-mated se of the sample mean is

se(y) =

√s2

n(1− f) =

s√n

√(1− f)


CHAPTER 4. SAMPLING

Note that if the sampling fraction (f) is small, then the standard error of thesample mean can be approximated by:

se(y) ≈√

s2

n=

s√n

which is the familiar form seen previously. In general, the standard errorformula changes depending upon the sampling method used to collectthe data and the estimator used on the data. Every different samplingdesign has its own way of computing the estimator and se.

Confidence intervals for parameters are computed in the usual fashion, i.e.an approximate 95% confidence interval would be found as: estimator ± 2se.Some textbooks use a t-distribution for smaller sample sizes, but most surveysare sufficiently large that this makes little difference.

4.4.3 Estimating the Population Total

Many students find this part confusing, because of the term population total.This does NOT refer to the total number of units in the population, but ratherthe sum of the individual values over the units. For example, if you are interestedin estimating total timber volume in an inventory unit, the trees are the samplingunits. A sample of trees is selected to estimate the mean volume per tree. Thetotal timber volume over all trees in the inventory unit is of interest, not thetotal number of trees in the inventory unit.

As the population total is found by Nµ (total population size times the pop-ulation mean), a natural estimator is formed by the product of the populationsize and the sample mean, i.e. TOTAL = τ = Ny. Note that you must multiplyby the population size not the sample size.

Its estimated se is found by multiplying the estimated se for the samplemean by the population size as well, i.e.,

se(τ) = N

√s2

n(1− f)

In general, estimates for population totals in most sampling designs are foundby multiplying estimates of population means by the population size.

Confidence intervals are found in the usual fashion.


CHAPTER 4. SAMPLING

4.4.4 Estimating Population Proportions

A “standard trick” used in survey sampling when estimating a population pro-portion is to replace the response variable by a 0/1 code and then treat thiscoded data in the same way as ordinary data.

For example, suppose you were interested the proportion of fish in a catchthat was of a particular species. A sample of 10 fish were selected (of coursein the real world, a larger sample would be taken), and the following data wereobserved (S=sockeye, C=chum):

S C C S S S S C S S

Of the 10 fish sampled, 3 were chum so that the sample proportion of fish thatwere chum is 3/10 = 0.30.

If the data are recoded using 1=Chum, 0=Sockeye, the sample values wouldbe:

0 1 1 0 0 0 0 1 0 0

The sample average of these numbers gives y = 3/10 = 0.30 which is exactlythe proportion seen.

It is not surprising then that by recoding the sample using 0/1 variables, thefirst line in the summary table reduces to the last line in the summary table. Inparticular, s2 reduces to np(1− p)/(n− 1) resulting in the se seen above.

Confidence intervals are computed in the usual fashion.

4.4.5 Example - estimating total catch of fish in a recre-ational fishery

This will illustrate the concepts in the previous sections using a very smallillustrative example.

For management purposes, it is important to estimate the total catch byrecreational fishers. Unfortunately, there is no central registry of fishers, noris there a central reporting station. Consequently, surveys are often used toestimate the total catch.


CHAPTER 4. SAMPLING

There are two common survey designs used in these types of surveys (generi-cally called creel surveys). In access surveys, observers are stationed at accesspoints to the fishery. For example, if fishers go out in boats to catch the fish, theaccess points are the marinas where the boats are launched and are returned.From these access points, a sample of fishers is selected and interviews con-ducted to measure the number of fish captured and other attributes. Rovingsurveys are commonly used when there is no common access point and youcan move among the fishers. In this case, the observer moves about the fisheryand questions anglers as they are encountered. Note that in this last design,the chances of encountering an angler are no longer equal - there is a greaterchance of encountering an angler who has a longer fishing episode. And, youtypically don’t encounter the angler at the end of the episode but somewherein the middle of the episode. The analysis of roving surveys is more complex -seek help. The following example is based on a real life example from BritishColumbia. The actual survey is much larger involving several thousand anglersand sample sizes in the low hundreds, but the basic idea is the same.

An access survey was conducted to estimate the total catch at a lake inBritish Columbia. Fortunately, access to the lake takes place at a single landingsite and most anglers use boats in the fishery. An observer was stationed at thelanding site, but because of time constraints, could only interview a portion ofthe anglers returning, but was able to get a total count of the number of fishingparties on that day. A total of 168 fishing parties arrived at the landing duringthe day, of which 30 were sampled. The decision to sample an fishing party wasmade using a random number table as the boat returned to the dock.

The objectives are to estimate the total number of anglers and their catchand to estimate the proportion of boat trips (fishing parties) that had sufficientlife-jackets for the members on the trip. Here is the raw data - each line is the


CHAPTER 4. SAMPLING

results for a fishing party..

Number Party SufficientAnglers Catch Life Jackets?1 1 yes3 1 yes1 2 yes1 2 no3 2 no3 1 yes1 0 no1 0 no1 1 yes1 0 yes2 0 yes1 1 yes2 0 yes1 2 yes3 3 yes1 0 no1 0 yes2 0 yes3 1 yes1 0 yes2 0 yes1 1 yes1 0 yes1 0 yes1 0 no2 0 yes2 1 no1 1 no1 0 yes1 0 yes

What is the population of interest?

The population of interest is NOT the fish in the lake. The Fisheries Departmentis not interested in estimating the characteristics of the fish, such as mean fishweight or the number of fish in the lake. Rather, the focus is on the anglers andfishing parties. Refer to the FAQ at the end of the chapter for more details.

It would be tempting to conclude that the anglers on the lake are the popula-


CHAPTER 4. SAMPLING

tion of interest. However, note that information is NOT gathered on individualanglers. For example, the number of fish captured by each angler in the party isnot recorded - only the total fish caught by the party. Similarly, it is impossibleto say if each angler had an individual life jacket - if there were 3 anglers in theboat and only two life jackets, which angler was without? 1

For this reason, the the population of interest is taken to be the set of boatsfishing at this lake. The fisheries agency doesn’t really care about the individualanglers because if a boat with 3 anglers catches one fish, the actual person whocaught the fish is not recorded. Similarly, if there are only two life jackets, doesit matter which angler didn’t have the jacket?

Under this interpretation, the design is a simple random sample of boatsreturning to the landing.

What is the frame?

The frame for a simple random sample is a listing of ALL the units in the pop-ulation. This list is then used to randomly select which units will be measured.In this case, there is no physical list and the frame is conceptual. A randomnumber table was used to decide which fishing parties to interview.

What is the sampling design and sampling unit?

The sampling design will be treated as if it were a simple random sample fromall boats (fishing parties) returning, but in actual fact was likely a systematicsample or variant. As you will see later, this may or may not be a problem.

In many cases, special attention should be paid to identify the correct sam-pling unit. Here the sampling unit is a fishing party or boat, i.e. the boats wereselected, not individual anglers. This mistake is often made when the data arepresented on an individual basis rather than on a sampling unit basis. As youwill see in later chapters, this is an example of pseudo-replication.

1If data were collected on individual anglers, then the anglers could be taken as the popu-lation of interest. However, in this case, the design is NOT a simple random sample of anglers.Rather, as you will see later in the course, the design is a cluster sample where a simple ran-dom sample of clusters (boats) was taken and all members of the cluster (the anglers) wereinterviewed. As you will see later in the course, a cluster sample can be viewed as a simplerandom sample if you define the population in terms of clusters.


CHAPTER 4. SAMPLING

Excel analysis

As mentioned earlier, Excel should be used with caution in statistical analysis.However, for very simple surveys, it is an adequate tool.

A copy of a sample Excel workbook called creel.xls is available from theSample Program Library at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.

Here is a condensed view of the spreadsheet within the workbook:


http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms


CHAPTER 4. SAMPLING


CHAPTER 4. SAMPLING

The analysis proceeds in a series of logical steps as illustrated for the numberof anglers in each party variable.

Enter the data on the spreadsheet

The metadata (information about the survey) is entered at the top of the spread-sheet.

The actual data is entered in the middle of the sheet. One row is used tolist the variables recorded for each angling party.

Obtain the required summary statistics.

At the bottom of the data, the summary statistics needed are computed usingthe Excel built-in functions. This includes the sample size, the sample mean,and the sample standard deviation.

Obtain estimates of the population quantity

Because the sample mean is the estimator for the population mean in if thedesign is a simple random sample, no further computations are needed.

In order to estimate the total number of angler, we multiply the averagenumber of anglers in each fishing party (1.533 angler/party) by the POPULA-TION SIZE (the number of fishing parties for the entire day = 168) to get theestimated total number of anglers (257.6).

Obtain estimates of precision - standard errors

The se for the sample mean is computed using the formula presented earlier.The estimated standard error OF THE MEAN is 0.128 anglers/party.

Because we found the estimated total by multiplying the estimates of themean number of anglers/boat trip times the number of boat trips (168), theestimated standard error of the POPULATION TOTAL is found by multiplyingthe standard error of the sample mean by the same value, 0.128x168 = 21.5anglers.

Hence, a 95% confidence interval for the total number of anglers fishing thisday is found as 257.6± 2(21.5).

Estimating total catch

The next column uses a similar procedure is followed to estimate the total catch.

Estimating proportion of parties with sufficient life-jackets


CHAPTER 4. SAMPLING

First, the character values yes/no are translated into 0,1 variables using the IFstatement of Excel.

Then the EXACT same formula as used for estimating the total number ofanglers or the total catch is applied to the 0,1 data!

We estimate that 73.3% of boats have sufficient life-jackets with a se of 7.4percentage points.

SAS analysis

SAS (Version 8 or higher) has procedures for analyzing survey data. Copies ofthe sample SAS program called creel.sas and the output called creel.lst are avail-able from the Sample Program Library at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.

Here is the SAS program:

/* Simple Random Sample - Creel survey */

title ’Creel Survey - Simple Random Sample’;options nodate nonumber nocenter noovp linesize=75;

/*For management purposes, it is important to estimate thetotal catch by recreational fishers.Unfortunately, there is no central registry of fishers, noris there a central reporting station.Consequently, surveys are often used to estimate the totalcatch.

An access survey was conducted to estimate the total catch ata lake in British Columbia. Fortunately, access to the laketakes place at a single landing site and most anglers useboats in the fishery. An observer was stationed at thelanding site, but because of time constraints, could onlyinterview a portion of the parties returning, but was able toget a total count of the number of parties fishing on thatday. A total of 168 boats (fishing parties)arrived at the landing during theday, of which 30 were sampled. The decision to sample anparty was made using a random number table as the boatsreturned.




CHAPTER 4. SAMPLING

The objectives are to estimate the total number of anglersand their catch and to estimate the proportion of boat tripsthat had sufficient life-jackets for the members on the trip.Here is the raw data. */

data creel; /* read in the survey data */input angler catch lifej $;enough = 0;if lifej = ’yes’ then enough = 1;datalines;

1 1 yes3 1 yes1 2 yes1 2 no3 2 no3 1 yes1 0 no1 0 no1 1 yes1 0 yes2 0 yes1 1 yes2 0 yes1 2 yes3 3 yes1 0 no1 0 yes2 0 yes3 1 yes1 0 yes2 0 yes1 1 yes1 0 yes1 0 yes1 0 no2 0 yes2 1 no1 1 no1 0 yes1 0 yes;;;;

proc print data=creel;title2 ’raw data’;


CHAPTER 4. SAMPLING

/* add the sampling weights to the data set. The samplingweights are defined as N/n for an SRSWOR */

data creel;set creel;sampweight = 168/30;

proc surveymeans data=creeltotal=168 /* total population size */mean clm /* find estimates of mean, its se, and a 95% confidence interval */sum clsum /* find estimates of total,its se, and a 95% confidence interval */;

var angler catch lifej ; /* estimate mean and total for numeric variables, proportions for char variables */weight sampweight;

/* Note that it is not necessary to use the coded 0/1 variables in this procedure */run;

The program starts with the metadata so that the purpose of the programand how the data were collected etc are not lost.

The first section of code reads the data and computes the 0,1 variable fromthe life-jacket information. The data is listed so that it can be verified that itwas read correctly.

Most program for dealing with survey data require that sampling weightsbe available for each observation. A sampling weight is the weighting factorrepresenting how many people in the population this observation represents. Inthis case, each of the 30 parties represents 168/30=5.6 parties in the population.

Finally, the SURVEYMEANS procedure is used to estimate the quantitiesof interest. It is not necessary to code any formula as these are builtin into theSAS program. So how does the SAS program know this is a simple randomsample? This is the default analysis - more complex designs require additionalstatements (e.g. a CLUSTER statement) to indicate a more complex design.As well, equal sampling weights indicate that all items were selected with equalprobability.

Here is the SAS output


CHAPTER 4. SAMPLING

Creel Survey - Simple Random Sampleraw data

Obs angler catch lifej enough

1 1 1 yes 12 3 1 yes 13 1 2 yes 14 1 2 no 05 3 2 no 06 3 1 yes 17 1 0 no 08 1 0 no 09 1 1 yes 1

10 1 0 yes 111 2 0 yes 112 1 1 yes 113 2 0 yes 114 1 2 yes 115 3 3 yes 116 1 0 no 017 1 0 yes 118 2 0 yes 119 3 1 yes 120 1 0 yes 121 2 0 yes 122 1 1 yes 123 1 0 yes 124 1 0 yes 125 1 0 no 026 2 0 yes 127 2 1 no 028 1 1 no 029 1 0 yes 130 1 0 yes 1

Creel Survey - Simple Random Sampleraw data

The SURVEYMEANS Procedure

Data Summary

Number of Observations 30Sum of Weights 168


CHAPTER 4. SAMPLING

Class Level Information

ClassVariable Levels Values

lifej 2 no yes

Statistics

Std Error Lower 95% Upper 95%Variable Mean of Mean CL for Mean CL for Mean-------------------------------------------------------------------------angler 1.533333 0.128419 1.270686 1.795980catch 0.666667 0.139688 0.380972 0.952362lifej=no 0.266667 0.074425 0.114450 0.418884lifej=yes 0.733333 0.074425 0.581116 0.885550-------------------------------------------------------------------------

Statistics

Lower 95% Upper 95%Variable Sum Std Dev CL for Sum CL for Sum-------------------------------------------------------------------------angler 257.600000 21.574442 213.475312 301.724688catch 112.000000 23.467659 64.003248 159.996752lifej=no 44.800000 12.503462 19.227550 70.372450lifej=yes 123.200000 12.503462 97.627550 148.772450-------------------------------------------------------------------------

All of the results match that from the Excel spreadsheet.

JMP Analysis

Unfortunately, while JMP excels (excuse the pun!) in the analysis of experi-mental data, it is bit clumsy to analyze survey data using JMP. 2. There aretwo deficiencies:

• There is no way to specify the finite population correction (the√

1− f)that is applied to standard errors. Fortunately, in many ecological exper-iments, the sampling fraction f is very close to 0, the finite population

2Future versions of JMP will include survey sampling modules


CHAPTER 4. SAMPLING

correction is very close to 1, and so there is little effect. In any case,the reported standard error by JMP will be slightly too large which isconservative.

• It is not easy to take the results from the analysis and use them in futurecomputations. For example, the estimated total is found by multiplyingthe estimated mean by the population size - this is usually done by handoutside of JMP.

Obtain the required summary statistics.

JMP assumes, unless you specify otherwise, that the data are collected froma simple random sample. This matches the design of the angler survey so JMPcan be used directly.

The data are entered into a JMP spreadsheet directly. A copy of theJMP data file is called creel.jmp and is available from the Sample Program Li-brary at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.There is no need to code the categorical variable corresponding to sufficient life-jackets. Be sure that the angler and catch variables are continuously scaled andthat EnoughLifeJackets is nominally scaled:

The basic summary statistics are found using the Analyze->Distributionplatform:



CHAPTER 4. SAMPLING

All three variables can be specified simultaneously and JMP will use the scaleof the variables to decide which statistics to compute.

This gives summary output as shown below:


CHAPTER 4. SAMPLING

The display can be improved by converting to a stacked setting (use the Stackedoption in the red-triangle near the Distribution header):


CHAPTER 4. SAMPLING

removing the quantile information and the histograms (use the red triangles foreach data variable to remove the display):

and asking for standard errors and confidence intervals for the proportion (rightclick in the table of proportions and ask for the appropriate columns):

The final output is:


CHAPTER 4. SAMPLING

Estimating the average number of anglers/boat

The estimates are read directly from the output. The estimated average numberof angers per boat is 1.53 with an estimated se of .14. Notice that the se is slightlarger than the se reported by Excel - this is a result of not applying a finitepopulation correction.

Estimating the total catch of fish over all boats

The estimated average catch per boat is read directly above and is 0.667 fish/boatwith a se of .15 fish/boat. To estimate the total catch over all 168 boats, wemultiply both the mean catch/boat and the se of the catch/boat by 168. Thisgives an estimated total catch of 112 fish (se 26) fish. [Again, the standard


CHAPTER 4. SAMPLING

error is slightly larger than that reported by Excel because the finite populationcorrection factor was not applied.]

Estimating the proportion of parties with sufficient life jackets

There is no need to code the categorical variable as was done in Excel. Readingdirectly from the above output, we estimate that 73% of parties had sufficientlife jackets with a se of .08 (or a se of 8 percentage points). [Again the seis slightly larger than that reported by Excel because of the lack of a finitepopulation correction factor.]

What to do if you want to do everything in JMP

It is possible to do the entire analysis in JMP including multiplying the thepopulation size and applying the finite population factor. This will be illustratedby estimating the total number of angers and their catch over all 168 boats.

For estimating the total number of anglers and their catch, we need the sam-ple size, the average over the sample and the standard deviation over the sample.This can be done using the Tables->Summary pop-down menu. Complete thedialogue box as shown:

This will create a new summary table.


CHAPTER 4. SAMPLING

At this point, it may be easier to simply use the summary statistics tocompute the relevant quantities by hand rather than actually programming theequations in JMP.

We continue, by creating new columns to estimate the se for the mean. Inthis column we create a formula to estimate the standard error using JMP’sformula editor. The final results is:

To estimate the se for the total number of anglers, multiply the estimates ofthe mean number of anglers/boat trip times the number of boat trips (168) toget the final estimate. The estimated total number of anglers is 1.53333x168 =257.6 with an estimated standard error of 0.128x168 = 21.5.

A similar procedure is followed to estimate the total catch.

We are also interested in estimating the proportion of BOATS that havesufficient number of life-jackets for passengers.

We first transform the yes/no responses in 1/0 using a formula box, andthen repeat the same summary steps as for the mean number of anglers giving:

We estimate that 73.3% of boats have sufficient life-jackets with a se of 7.4percentage points.


CHAPTER 4. SAMPLING

4.5 Sample size determination for a simple ran-dom sample

I cannot emphasize too strongly, the importance of planning in advance of thesurvey.

There are many surveys where the results are disappointing. For example,a survey of anglers may show that the mean catch per angler is 1.3 fish butthat the standard error is .9 fish. In other words, a 95% confidence intervalstretches from 0 to well over 4 fish per angler, something that is known withnear certainty even before the survey was conducted. In many cases, a back ofthe envelope calculation has showed that the precision obtained from a surveywould be inadequate at the proposed sample size even before the survey wasstarted.

In order to determine the appropriate sample size, you will need to firstspecify some measure of precision that is required to be obtained. For example,a policy decision may require that the results be accurate to within 5% of thetrue value.

This precision requirement usually occurs in one of two formats:

• an absolute precision, i.e. you wish to be 95% confident that the samplemean will not vary from the population mean by a pre-specified amount.For example, a 95% confidence interval for the total number of fish cap-tured should be ± 1,000 fish.

• a relative precision, i.e. you wish to be 95% confident that the samplemean will be within 10% of the true mean.

The latter is more common than the former, but both are equivalent andinterchangeable. For example, if the actual estimate is around 200, with a se ofabout 50, then the 95% confidence interval is ± 100 and the relative precisionis within 50% of the true answer (± 100 / 200). Conversely, a 95% confidenceinterval that is within ± 40% of the estimate of 200, turns out to be ± 80 (40%of 200), and consequently, the se is around 40 (=80/2).

A common question is:

What is the difference between se/est and 2se/est? When is therelative standard error divided by 2? Does se/est have anything todo with a 95 % ci?


CHAPTER 4. SAMPLING

Precision requirements are stated in different ways (replace blah below bymean/total/proportion etc).

Expression Mathematics- within xxx of the blah se = xxx

- margin of error of xxx 2se = xxx- within xxx of the true value 19 times out of 20 2se = xxx- within xxx of the true value 95% of the time 2se = xxx

- the width of the 95% confidence interval is xxx 4se = xxx

- within 10% of the blah se/est = .10- a rse of 10% se/est = .10- a relative error of 10% se/est = .10

- within 10% of the blah 95% of the time 2se/est = .10- within 10% of the blah 19 times out of 20 2se/est = .10- margin of error of 10% 2se/est = .10

- width of 95% confidence interval = 10% of the blah 4se/est = .10

As a rough rule of thumb, the following are often used as survey precisionguidelines:

• For preliminary surveys, the 95% confidence interval should be ± 50% ofthe estimate.

• For management surveys, the 95% confidence interval should be ± 25% ofthe estimate.

• For scientific work, the 95% confidence interval should be ± 10% of theestimate.

Next, some preliminary guess for the standard deviation of individual itemsin the population (S) needs to be taken along with an estimate of the populationsize (N) and possibly the population mean (µ) or population total (τ). Theseare not too crucial and can be obtained by:

• taking a pilot study.

• previous sampling of similar populations

• expert opinion


CHAPTER 4. SAMPLING

A very rough estimate of the standard deviation can be found by taking theusual range of the data/4. If the population proportion is unknown, the valueof 0.5 is often used as this leads to the largest sample size requirement as aconservative guess.

These are then used with the formulae for the confidence interval to deter-mine the relevant sample size. Many text books have complicated formulae todo this - it is much easier these days to simply code the formulae in a spreadsheet(see examples) and use either trial and error to find a appropriate sample size,or use the “GOAL SEEKER” feature of the spreadsheet to find the appropriatesample size. This will be illustrated in the example.

The final numbers are not to be treated as the exact sample size but moreas a guide to the amount of effort that needs to be expended.

If more than one item is being surveyed, these calculations must be donefor each item. The largest sample size needed is then chosen. This may leadto conflict in which case some response items must be dropped or a differentsampling method must be used for this other response variable.

Precision essentially depends only the ab-solute sample size, not the relative fractionof the population sampled. For example, a sample of1000 people taken from Canada (population of 33,000,000) is just as precise asa sample of 1000 people taken from the US (population of 333,000,000)! This ishighly counter-intuitive and will be explored more in class.

4.5.1 Example - How many anglers to survey

We wish to repeat the angler creel survey next year.

• How many angling-parties should be interviewed to be 95% confident ofbeing with 10% of the true mean catch?

• What sample size would be needed to estimate the proportion of boatswithin 3 percentage points 19 times out of 20? In this case we are askingthat the 95% confidence interval be ±0.03 or that the se = 0.015.

The sample size spreadsheet is available in an Excel workbook called Sur-veySampleSize.xls which can be downloaded from the Sample Program Libraryat http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.

Here is a condensed view of the spreadsheet:



CHAPTER 4. SAMPLING


CHAPTER 4. SAMPLING

First note that the computations for sample size require some PRIOR in-formation about population size, the population mean, or the population pro-portion. We will use information from the previous survey to help plan futurestudies.

For example, about 168 boats were interviewed last year. The mean catchper angling party was about .667 fish/boat. The standard deviation of the catchper party was .844. These values are entered in the spreadsheet in column C.

A preliminary sample size of 40 (in green in Column C) was tried. Thislead to a 95% confidence interval of ± 35% which did not meet the precisionrequirements.

Now vary the sample size (in green) in column C until the 95% confidenceinterval (in yellow) is below ± 10%. You will find that you will need to interviewalmost 135 parties - a very high sampling fraction indeed. The problem for thisvariable is the very high variation of individual data points.

If you are familiar with Excel, you can use the Goal Seeker function to speedthe search.

Similarly, the proportion of people wearing lifejackets last year was around73%. Enter this in the blue areas of Column E. The initial sample size of 20is too small as the 95% confidence interval is pm .186 (18 percentage points).Now vary the sample size (in green) until the 95% confidence interval is ± .03.Note that you need to be careful in dealing with percentages - confidence limitsare often specified in terms of percentage points rather than percents to avoidproblems where percents are taken of percents. This will be explained furtherin class.

Try using the spreadsheet to compare the precision of a poll of 1000 peopletaken from Canada (population 33,000,000) and 1000 people taken from the US(population 330,000,000) if both polls have about 40% in favor of some issue.

Technical notes

If you really want to know how the sample size numbers are determined,here is the lowdown.

Suppose that you wish to be 95% sure that the sample mean is within 10%of the true mean.

We must solve z S√n

√N−n

N ≤ εµ for n where z is the term representing themultiplier for a particular confidence level (for a 95% c.i. use z = 2) and ε is the‘closeness’ factor (in this case ε = 0.10.


CHAPTER 4. SAMPLING

Rearranging this equation gives n = N

1+N( εµzS )2

4.6 Systematic sampling

Sometimes, logistical considerations make a true simple random sample notvery convenient to administer. For example, in the previous creel survey, a truerandom sample would require that a random number be generated for each boatreturning to the marina. In such cases, a systematic sample could be used toselect elements. For example, every 5th angler could be selected after a randomstarting point.

4.6.1 Advantages of systematic sampling

The main advantages of systematic sampling are:

• it is easier to draw units because only one random number is chosen

• if a sampling frame is not available but there is a convenient method ofselecting items, e.g. the creel survey where every 5th angler is chosen.

• easier instructions for untrained staff

• if the population is in random order relative to the variable being mea-sured, the method is equivalent to a SRS. For example, it is unlikelythat the number of anglers in each boat changes dramatically over theperiod of the day. This is an important assumption that should beinvestigated carefully in any real life situation!

• it distributes the sample more evenly over the population. Consequentlyif there is a trend, you will get items selected from all parts of the trend.

4.6.2 Disadvantages of systematic sampling

The primary disadvantages of systematic sampling are:

• Hidden periodicities or trends may cause biased results. In such cases,estimates of mean and variances may be severely biased! See Section 4.2.2for a detailed discussion.


CHAPTER 4. SAMPLING

• Without making an assumption about the distribution of population units,there is no estimate of the standard error. This is an important disad-vantage of a systematic sample! Many studies very casually make theassumption that the systematic sample is equivalent to a simple randomsample without much justification for this.

4.6.3 How to select a systematic sample

There are several methods, depending if you know the population size, etc.Suppose we need to choose every kth record, where k is chosen to meet samplesize requirements. - an example of choosing k will be given in class. All of thefollowing methods are equivalent if k divides N exactly. These are the two mostcommon methods.

• Method 1 Choose a random number j from 1 · · · k.. Then choose the j,j + k, j + 2k, · · · records. One problem is that different samples may beof different size - an example will be given in class where n doesn’t divideN exactly. This causes problems in sampling theory, but not too much ofa problem if n is large.

• Method 2 Choose a random number from 1 · · ·N . Choose very kthitemand continue in a circle when you reach the end until you have selectedn items. This will always give you the same sized sample, however, itrequires knowledge of N

4.6.4 Analyzing a systematic sample

Most surveys casually assume that the population has been sorted in randomorder when the systematic sample was selected and so treat the results as ifthey had come from a SRSWOR. This is theoretically not correct and if yourassumption is false, the results may be biased, and there is no way of examiningthe biases from the data at hand.

Before implementing a systematic survey or analyzing a systematic survey,please consult with an expert in sampling theory to avoid problems. This is acase where an hour or two of consultation before spending lots of money couldpotentially turn a survey where nothing can be estimated, into a survey thathas justifiable results.


CHAPTER 4. SAMPLING

4.6.5 Technical notes - Repeated systematic sampling

To avoid many of the potential problems with systematic sampling, a commondevice is to use repeated systematic samples on the same population.

For example, rather than taking a single systematic sample of size 100 froma population, you can take 4 systematic samples (with different starting points)of size 25.

An empirical method of obtaining a variance estimator from a systematicsample is to use repeated systematic sampling. Rather than choosing onesystematic subsample of every kth unit, choose, m independent systematic sub-sample of size n/m. Then estimate the mean of each sub-systematic sample.Treat these means as a simple random sample from the population of possiblesystematic samples and use the usual sampling theory. The variation amongthe sub-systematic samples provides an estimate of the sample variance. Thiswill be illustrated in an example.

Example of replicated subsampling within a systematic sample

A yearly survey has been conducted in the Prairie Provinces to estimate thenumber of breeding pairs of ducks. One breeding area has been divided intoapproximately 1000 transects of a certain width, i.e. the breeding area wasdivided into 1000 strips.

What is the population of interest? As noted in class, the definition of apopulation depends, in part, upon the interest of the researcher. Two possibledefinitions are:

• The population is the set of individual ducks on the study area. However,no frame exists for the individual birds. But a frame can be constructedbased on the 1000 strips that cover the study area. In this case, the designis a cluster sample, with the clusters being strips.

• The population consists of the 1000 strips that cover the study area andthe number of ducks in each strip is the response variable. The design isthen a simple random sample of the strips.

In either case, the analysis is exactly the same and the final estimates are exactlythe same.

Approximately 100 of the transects are flown by an aircraft and spotters onthe aircraft count the number of breeding pairs visible from the aircraft.


CHAPTER 4. SAMPLING

For administrative convenience, it is easier to conduct systematic sampling.However, there is structure to the data; it is well known that ducks do notspread themselves randomly through out the breeding area. After discussionswith our Statistical Consulting Service, the researchers flew 10 sets of replicatedsystematic samples; each set consisted of 10 transects. As each transect isflown, the scientists also classify each transect as ‘prime’ or ‘non-prime’ breedinghabitat.

Here is the raw data reporting the number of nests in each set of 10 transects:

Prime Non-Prime Non-Set Habitat Habitat ALL Prime prime

Total n Total n Total mean mean Diff(b) (a) (c) (d) (e)

1 123 3 345 7 468 41.0 49.3 -8.32 57 2 36 8 93 28.5 4.5 24.03 85 5 46 5 131 17.0 9.2 7.84 97 2 131 8 228 48.5 16.4 32.15 34 5 43 5 77 6.8 8.6 -1.86 85 3 67 7 152 28.3 9.6 18.87 56 7 64 3 120 8.0 21.3 -13.38 46 2 65 8 111 23.0 8.1 14.99 37 4 43 6 80 9.3 7.2 2.1

10 93 2 104 8 197 46.5 13.0 33.5

Avg 71.3 165.7 10.97s 29.5 117.0 16.38n 10 10 10

EstEst total 7130 16570 mean 10.97

Est se 885 3510 se 4.91

Several different estimates can be formed.

1. Total number of nests in the breeding area (refer to column (a)above). The total number of nests in the breeding area for all types ofhabitat is of interest. Column (a) in the above table is the data that willbe used. It represents the total number of nests in the 10 transects of eachset.The principle behind the estimator is that the 1000 total transects can bedivided into 100 sets of 10 transects, of which a random sample of size10 was chosen. The sampling unit is the set of transects – the individualtransects are essentially ignored.Note that this method assumes that the systematic samples are all of the


CHAPTER 4. SAMPLING

same size. If the systematic samples had been of different sizes (e.g. somesets had 15 transects, other sets had 5 transects), then a ratio-estimator(see later sections) would have been a better estimator.

• compute the total number of nests for each set. This is found incolumn (a).

• Then the sets selected are treated as a SRSWOR sample of size 10from the 100 possible sets. An estimate of the mean number ofnests per set of 10 transects is found as: µ = (468 + 93 + · · · +

197)/10 = 165.7 with an estimated se of se(µ) =√

s2

n

(1− n

100

)=√

117.02

10

(1− 10

100

)= 35.1

• The average number of nests per set is expanded to cover all 100 setsτ = 100µ = 16570 and se(τ) = 100se(µ) = 3510

2. Total number of nests in the prime habitat only (refer to column (b)above). This is formed in exactly the same way as the previous estimate.This is technically known as estimation in a domain. The number ofelements in the domain in the whole population (i.e. how many of the1000 transects are in prime-habitat) is unknown but is not needed. Allthat you need is the total number of nests in prime habitat in each set –you essentially ignore the non-prime habitat transects within each set.

The average number of nests per set in prime habitats is found asbefore: µ = 123+···+93

10 = 71.3 with an estimated se of se(µ) =√s2

n (1− n100 ) =

√29.52

10 (1− 10100 ) = 8.85.

• because there are 100 sets of transects in total, the estimate of thepopulation total number of nests in prime habitat and its estimatedse is τ = 100µ = 7130 with a se(τ) = 100se(µ) = 885

• Note that the total number of transects of prime habitat is not knownfor the population and so an estimate of the density of nests in primehabitat cannot be computed from this estimated total. However, aratio-estimator (see later in the notes) could be used to estimate thedensity.

3. Difference in mean density between prime and non-prime habi-tats The scientists suspect that the density of nests is higher in primehabitat than in non-prime habitat. Is there evidence of this in the data?(refer to columns (c)-(e) above). Here everything must be transformed tothe density of nest per transect (assuming that the transects were all thesame size). Also, pairing (refer to the section on experimental design) istaking place so a difference must be computed for each set and the dif-ferences analyzed, rather than trying to treat the prime and non-primehabitats as independent samples.

Again, this is an example of what is known as domain-estimation.


CHAPTER 4. SAMPLING

• Compute the domain means for type of habitat for each set (columns(c) and (d)). Note that the totals are divided by the number oftransects of each type in each set.

• Compute the difference in the means for each set (column (e))

• Treat this difference as a simple random sample of size 10 taken fromthe 100 possible sets of transects. What does the final estimatedmean difference and se imply?

4.7 Stratified simple random sampling

A simple modification to a simple random sample can often lead to dramaticimprovements in precision. This is known as stratification. All survey meth-ods can potentially benefit from stratification (also known as blocking in theexperimental design literature).

Stratification will be beneficial whenever variability in the response variableamong the survey units can be anticipated and strata can be formed that aremore homogeneous than the original set of survey units.

All stratified designs will have the same basic steps as listed below regardlessof the underlying design.

• Creation of strata. Stratification begins by grouping the survey unitsinto homogeneous groups (strata) where survey units within strata shouldbe similar and strata should be different. For example, suppose you wishedto estimate the density of animals. The survey region is divided into alarge number of quadrats based on aerial photographs. The quadrats canbe stratified into high and low quality habitat because it is thought thatthe density within the high quality quadrats may be similar but differentfrom the density in the low quality habitats. The strata do not have tobe physically contiguous – for example, the high quality habitats couldbe scattered through out the survey region and can be grouped into onesingle stratum.

• Determine total sample size. Use the methods in previous sections todetermine the total sample size (number of survey units) to select. At thisstage, some sort of “average” standard deviation will be used to determinethe sample size.

• Allocate effort among the strata. there are several ways to allocatethe total effort among the strata.


CHAPTER 4. SAMPLING

– In equal allocation, the total effort is split equally among all strata.Equal allocation is preferred when equally precise estimates are re-quired for each stratum. 3

– In proportional allocation, the total effort is allocated to the stratain proportion to stratum importance. Stratum importance could berelated to stratum size (e.g. when allocating effort among the U.S.and Canada, then because the U.S. is 10 times larger in Canada,more effort should be allocated to surveying the U.S.). But if densityis your measure of importance, allocate more effort to higher den-sity strata. Proportional allocation is preferred when more preciseestimates are required in more important strata.

– Neyman allocation. Neyman determined that if you also have in-formation on the variability within each stratum, then more effortshould be allocated to strata that are more important and more vari-able to give you the most precise overall estimate for a given samplesize. This rarely is performed in ecology because often informationon intra-stratum variability is unknown. 4

– Cost allocation. In general, effort should be allocated to moreimportant strata, more variable strata, or strata where sampling ischeaper to give the best overall precision for the entire survey. Asin the previous allocation method, ecologists rarely have sufficientlydetailed cost information to do this allocation method.

• Conduct separate surveys in each stratum Separate independentsurveys are conducted in each stratum. It is not necessary to use thesame survey method in all strata. For example, low density quadratscould be surveyed using aerial methods, while high density strata mayrequire ground based methods. Some strata may use simple random sam-ples, while other strata may use cluster samples. Many textbooks showexamples were the same survey method is used in all strata, but this isNOT required.

• Obtain stratum specific estimates. Use the appropriate estimators toestimate stratum means, proportions or totals (along with their se ) foreach stratum.

• Rollup The separate stratum estimates are then combined to give anoverall value for the entire survey region.

Stratification can be carried out prior to the survey (pre- stratification) orafter the survey (post-stratification). Pre-stratification is used if the stratum

3Recall from previous sections that the absolute sample size is one of the drivers for pre-cision.

4However, in many cases, higher means per survey unit are accompanied by greater vari-ances among survey units so allocations based on stratum means often capture this variationas well.


CHAPTER 4. SAMPLING

variable is known in advance for every plot (e.g. elevation of a plot). Post-stratification is used if the stratum variable can only be ascertained after mea-suring the plot, e.g. soil texture or soil pH. The advantages of pre-stratificationare that samples can be allocated to the various strata in advance to optimize thesurvey and the analysis is relatively straightforward. With post-stratification,there is no control over sample size in each of the strata, and the analysis ismore complicated (the problem is that the samples sizes in each stratum arenow random). A post-stratification can result in significant improvements inprecision but does not allow for finer control of the sample sizes as found inpre-stratification.

Stratification can be used with any type of sampling design – the conceptsintroduced here deal with stratification applied to simple random samples butare easily extended to more complex designs.

The advantages of stratification are:

• variance estimates of the mean or of the total will be more precise whencompared to variances from an unstratified design if the units can bedivided into groups that are more homogeneous within groups than thewhole population.

• the cost of conducting a survey under stratification may be less as unitsselected within a stratum are in closer proximity.

• different sampling methods may be used in each stratum for cost or con-venience reasons. [In the detail below we assume that each stratum hasthe same sampling method used, but this is only for simplification.]

• because randomization occurs independently in each stratum, corruptionof the survey design due to problems experienced in the field may beconfined.

• separate estimates for each stratum with a given precision can be obtained

• it may be more convenient to take a stratified random sample for admin-istrative reasons. For example, the strata may refer to different districtoffices.

4.7.1 A visual comparison of a simple random sample vsa stratified simple random sample

You may find it useful to compare a simple random sample of 24 vs a stratifiedrandom sample of 24 using the following visual plans:

Select a sample of 24 in each case.


CHAPTER 4. SAMPLING

Simple Random Sampling

Describe how the sample was taken.


CHAPTER 4. SAMPLING

Stratified Simple Random Sampling

First you will have to define the strata. Suppose that there is a gradientin response from the top to the bottom of the map. Three strata are defined,consisting of the first 3 rows, the next 5 rows, and finally, the last two rows.It was decided to conduct a simple random sample within each stratum, withsample sizes of 8, 10, and 6 in the three strata respectively. [The decision processon allocating samples to strata will be covered later.]


CHAPTER 4. SAMPLING


CHAPTER 4. SAMPLING

In this design, the same design was used in ALL strata, but this is NOT arequirement for stratification. It is quite possible, and often desirable, to usedifferent methods in the different strata. For example, it may be more efficientto survey desert areas using a fixed-wing aircraft, while ground surveys need tobe used in heavily forested areas.

4.7.2 Notation

Common notation is to use h as a stratum index and i or j as unit indices withineach stratum.

Characteristic Population quantities sample quantitiesnumber of strata H Hstratum sizes N1, N2, · · · , NH n1, n2, · · · , nH

population units Yhj h=1,· · · ,H, j=1,· · · ,NH yhj h=1,· · · ,H, j=1,· · · ,nH

stratum totals τh yh

stratum means µh yh

Population total τ = NH∑

h=1

Whµh where Wh = Nh

N

Population mean µ =H∑

h=1

Whµh

Standard deviation S2h s2

h


It is assumed that from each stratum, a SRSWOR of size nh is selected inde-pendently of ALL OTHER STRATA!

The results below summarize the computations that can be more easilythought as occurring in four steps:

1. Compute the estimated mean and its se for each stratum. In this chapter,we use a SRS design in each stratum, but it not necessary to use thisdesign in a stratum and each stratum could have a different design. In thecase of an SRS, the estimate of the mean for each stratum is found as:

µh = yh

with associated standard error:

se(µh) =

√s2

h

nh(1− fh)


CHAPTER 4. SAMPLING

where the subscript h refers to each stratum.

2. Compute the estimated total and its se for each stratum. In many casesthis is simply the estimated mean for the stratum multiplied by the STRA-TUM POPULATION size. In the case of an SRS in each stratum thisgives::

τh = Nh × µh = Nh × yh

.

se(τh) = Nh × se(µh) = Nh ×

√s2

h

nh(1− fh)

3. Compute the grand total and its se over all strata. This is the sum of theindividual totals. The se is computed in a special way.

τ = τ1 + τ2 + . . .

se(τ) =√

se(τ)21 + se(τ)22 + . . .

4. Occasionally, the grand mean over all strata is needed. This is found bydividing the estimated grand total by the total POPULATION sizes:

µ =τ

N1 + N2 + . . .

se(µ) =se(τ)

N1 + N2 + . . .

This can be summarized in a succinct form as follows. Note that the stratumweights Wh are formed as Nh/N and are often used to derive weighted meansetc:

Quantity Pop value Estimator se

Mean µ =H∑

h=1

Whµh µstr =H∑

h=1

Whyh

√H∑

h=1

W 2hse2(yh) =√

H∑h=1

W 2h

s2h

nh(1− fh)

Total τ = NH∑

h=1

Whµh or τstr = NH∑

h=1

Whyh or

√H∑

h=1

N2hse2(yh) or

τ =H∑

h=1

τh or τstr =H∑

h=1

Nhyh

√H∑

h=1

N2h

s2h

nh(1− fh)

τ =H∑

h=1

Nhµh

Notes


CHAPTER 4. SAMPLING

• The estimator for the grand population mean is a weighted average ofthe individual stratum means using the POPULATION weights ratherthan the sample weights. This is NOT the same as the simple unweightedaverage of the estimated stratum means unless the nh/n equal the Nh/N- such a design is known as proportional allocation in stratified sampling.

• The estimated standard error for the grand total is found as√

se21 + se2

2 + · · ·+ se2h,

i.e. the square root of the sum of the individual se2 of the strata TO-TALS.

• The estimators for a proportion are IDENTICAL to that of the meanexcept replace the variable of interest by 0/1 where 1=character of interestand 0=character not of interest.

• Confidence intervals Once the se has been determined, the usual ±2sewill give approximate 95% confidence intervals if the sample sizes are rela-tively large in each stratum. If the sample sizes are small in each stratumsome authors suggest using a t-distribution with degrees of freedom de-termined using a Satterthwaite approximation - this will not be coveredin this course.

4.7.4 Example - sampling organic matter from a lake

[With thanks to Dr. Rick Routledge for this example].

Suppose that you were asked to estimate the total amount of organic mattersuspended in a lake just after a storm. The first scheme that might occur toyou could be to cruise around the lake in a haphazard fashion and collect a fewsample vials of water which you could then take back to the lab. If you knew thetotal volume of water in the lake, then you could obtain an estimate of the totalamount of organic matter by taking the product of the average concentrationin your sample and the total volume of the lake.

The accuracy of your estimate of course depends critically on the extent towhich your sample is representative of the entire lake. If you used the haphazardscheme outlined above, you have no way of objectively evaluating the accuracyof the sample. It would be more sensible to take a properly randomized sample.(How might you go about doing this?)

Nonetheless, taking a randomized sample from the entire lake would still notbe a totally sensible approach to the problem. Suppose that the lake were to befed by a single stream, and that most of the organic matter were concentratedclose to the mouth of the stream. If the sample were indeed representative, thenmost of the vials would contain relatively low concentrations of organic matter,whereas the few taken from around the mouth of the stream would contain much


CHAPTER 4. SAMPLING

higher concentration levels. That is, there is a real potential for outliers in thesample. Hence, confidence limits based on the normal distribution would notbe trustworthy.

Furthermore, the sample mean is not as reliable as it might be. Its valuewill depend critically on the number of vials sampled from the region close tothe stream mouth. This source of variation ought to be controlled.

Finally, it might be useful to estimate not just the total amount of organicmatter in the entire lake, but the extent to which this total is concentrated nearthe mouth of the stream.

You can simultaneously overcome all three deficiencies by taking what iscalled a stratified random sample. This involves dividing the lake into two ormore parts called strata. (These are not the horizontal strata that naturallyform in most lakes, although these natural strata might be used in a morecomplex sampling scheme than the one considered here.) In this instance, thelake could be divided into two parts, one consisting roughly of the area of highconcentration close to the stream outlet, the other comprising the remainder ofthe lake.

Then if a simple random sample of fixed size were to be taken from withineach of these “strata”, the results could be used to estimate the total amount oforganic matter within each stratum. These subtotals could then be added toproduce an estimate of the overall total for the lake.

This procedure, because it involves constructing separate estimates for eachstratum, permits us to assess the extent to which the organic matter is concen-trated near the stream mouth. It also permits the investigator to control thenumber of vials sampled from each of the two parts of the lake. Hence, thechance variation in the estimated total ought to be sharply reduced. Finally, weshall soon see that the confidence limits that one can construct are free of theoutlier problem that invalidated the confidence limits based on a simple randomsampling scheme.

A randomized sample is to be drawn independently from within each stra-tum.

How can we use the results of a stratified random sample to estimate theoverall total? The simplest way is to construct an estimate of the totals withineach of the strata, and then to sum these estimates. A sensible estimate of theaverage within the h’th stratum is yh. Hence, a sensible estimate of the totalwithin the h’th stratum is τh = Nhyh, and the overall total can be estimatedby τ =

∑Hh=1 τh =

∑Hh=1 Nhyh.


CHAPTER 4. SAMPLING

If we prefer to estimate the overall average, we can merely divide the estimateof the overall total by the size of the population, N . The resulting estimator iscalled the stratified random sampling estimator of the population average, andis given by µ =

∑Hh=1 Nhyh/N .

This can be expressed as a fancy average if we adjust the order of operationsin the above expression. If, instead of dividing the sum by N , we divide eachterm by N and then sum the results, we shall obtain the same result. Hence,

µstratified =H∑

h=1

(Nh/N)yh

=H∑

h=1

Whyh,

where Wh = Nh/N . These Wh-values can be thought of as weighting factors,and µstratified can then be viewed as a weighted average of the within-stratumsample averages.

The estimated standard error is found as:

se(µstratified) = se

{H∑

h=1

Whyh

}

=

√√√√ H∑h=1

W 2h [se(yh)]2,

where the estimated se(yh) is given by the formulas for simple random sampling:

se(yh) =√

s2h

nh(1− fh).

A Numerical Example

Suppose that for the lake sampling example discussed earlier the lake weresubdivided into two strata, and that the following results were obtained. (Allreadings are in mg per litre.)

Stratum Nh nh Sample Observations yh sh

1 7.5× 108 5 37.2 46.6 45.3 38.1 40.4 41.52 4.232 2.5× 107 5 365 344 388 347 403 369.4 25.7

We begin by computing the estimated mean for each stratum and its asso-ciated standard error. The sampling fraction nh

Nhis so close to 0 it can be safely


CHAPTER 4. SAMPLING

ignored. For example, the standard error of the mean for stratum 1 is found as:

se(µ1) =

√s21

n1(1− f1) =

√4.232

5= 1.89

. This gives the summary table:

Stratum nh µh se(µh)1 5 41.52 1.89352 5 369.4 11.492

Next, we estimate the total organic matter in each stratum. This is found bymultiplying the mean concentration and se of each stratum by the total volume:

τh = Nh × µh

se(τh) = Nhse(µh)

For example, the estimated total organic matter in stratum 1 is found as:

τ1 = N1 × µ1 = 7.5× 108 × 41.52 = 311.4× 108

se(τ1) = N1se(µ1) = 7.5× 108 × 1.89 = 14.175× 108

This gives the summary table:

Stratum nh µh se(µh) τh se(τh)1 5 41.52 1.8935 311.4 ×108 14.175 ×108

2 5 369.4 11.492 92.3 ×108 2.873 ×108

Next, we total the organic content of the two strata and find the se of thegrand total as

√14.1752 + 2.8732 × 108 to give the summary table:

Stratum nh µh se(µh) τh se(τh)1 5 41.52 1.8935 311.4 ×108 14.175 ×108

2 5 369.4 11.492 92.3 ×108 2.873 ×108

Total 403.7 ×108 14.46 ×108

Finally, the overall grand mean is found by dividing by the total volume ofthe lake 7.75× 108 to give:

µ =403.7× 108

7.75× 108= 52.09mg/L

se(µ) =14.46× 108

7.75× 108= 1.87mg/L

The calculations required to compute the stratified estimate can also be doneusing the method of weighted averages as shown in the following table:


CHAPTER 4. SAMPLING

Stratum Nh Wh yh Whyh se(yh) W 2h [se(yh)]2

(= Nh/N)1 7.5× 108 0.9677 41.52 40.180 1.8935 3.35782 2.5× 107 0.0323 369.4 11.916 11.492 0.1374

Totals 7.75× 108 1.0000 52.097 3.4952se =

√3.4952

Hence the estimate of the overall average is 52.097 mg/L, and the associatedestimated standard error is

√3.4963 = 1.870 mg/L and an approximate 95%

confidence interval is then found in the usual fashion. As expected these matchthe previous results.

This discussion swept a number of practical difficulties under the carpet.These include (a) estimating the volume of each of the two portions of the lake,(b) taking properly randomized samples from within each stratum, (c) selectingthe appropriate size of each water sample, (d) measuring the concentration foreach water sample, and (e) choosing the appropriate number of water samplesfrom each stratum. None of these difficulties is simple to do. Estimating thevolume of a portion of a lake, for example, typically involves taking numerousdepth readings and then applying a formula for approximating integrals. Thisproblem is beyond the scope of these notes.

The standard error in the estimator of the overall average is markedly re-duced in this example by the stratification. The standard error was just esti-mated for the stratified estimator to be around 2. This result was for a sampleof total size 10. By contrast, for an estimator based on a simple random sampleof the same size, the standard error can be found to be about 20. [This involvesmethods not covered in this class.] Stratification has reduced the standard errorby an order of magnitude.

It is also possible that we could reduce the standard error even further with-out increasing our sampling effort by somehow allocating this effort more effi-ciently. Perhaps we should take fewer water samples from the region far fromthe outlet, and take more from the other stratum. This will be covered later inthis course.

One can also read in more comprehensive accounts how to construct esti-mates from samples that are stratified after the sample is selected. This isknown as post-stratification. These methods are useful if, e.g. you are sam-pling a population with a known sex ratio. If you observe that your sample isbiased in favor of one sex, you can use this information to build an improvedestimate of the quantity of interest through stratifying the sample by sex afterit is collected. It is not necessary that you start out with a plan for samplingsome specified number of individuals from each sex (stratum).


CHAPTER 4. SAMPLING

Nonetheless, in any survey work, it is crucial that you begin with a plan.There are many examples of surveys that produced virtually useless resultsbecause the researchers failed to develop an appropriate plan. This shouldinclude a statement of your main objective, and detailed descriptions of howyou plan to generate the sample, collect the data, enter them into a computerfile, and analyze the results. The plan should contain discussion of how youpropose to check for and correct errors at each stage. It should be tested witha pilot survey, and modified accordingly. Major, ongoing surveys should bereassessed continually for possible improvements. There is no reason to expectthat the survey design will be perfect the first time that it is tried, nor thatflaws will all be discovered in the first round. On the other hand, one shouldexpect that after many years experience, the researchers will have honed thesurvey into a solid instrument. George Gallup’s early surveys were seriouslybiased. Although it took over a decade for the flaws to come to light, once theydid, he corrected his survey design promptly, and continued to build a strongreputation.

One should also be cautious in implementing stratified survey designs forlong-term studies. An efficient stratification of the Fraser Delta in 1994, e.g.might be hopelessly out of date 50 years from now, with a substantially alteredconfiguration of channels and islands. You should anticipate the need to reviseyour stratification periodically.

4.7.5 Example - estimating the total catch of salmon

DFO needs to monitor the catch of sockeye salmon as the season progresses sothat stocks are not overfished.

The season in one statistical sub-area in a year was a total of 2 days (!) and250 vessels participated in the fishery in these 2 days. A census of the catch ofeach vessel at the end of each day is logistically difficult.

In this particular year, observers were randomly placed on selected vesselsand at the end of each day the observers contacted DFO managers with a countof the number of sockeye caught on that day.

Here is the raw data - each line corresponds to the observers’ count for thatvessel for that day. On the second day, a new random sample of vessels wasselected. On both days, 250 vessels participated in the fishery.

Date Sockeye29-Jul-98 33729-Jul-98 730


CHAPTER 4. SAMPLING

29-Jul-98 45829-Jul-98 9829-Jul-98 8229-Jul-98 2829-Jul-98 54429-Jul-98 41529-Jul-98 28529-Jul-98 23529-Jul-98 57129-Jul-98 22529-Jul-98 1929-Jul-98 62329-Jul-98 180

30-Jul-98 9730-Jul-98 31130-Jul-98 4530-Jul-98 5830-Jul-98 3330-Jul-98 20030-Jul-98 38930-Jul-98 33030-Jul-98 22530-Jul-98 18230-Jul-98 27030-Jul-98 13830-Jul-98 8630-Jul-98 49630-Jul-98 215

What is the population of interest?

The population of interest is the set of vessels participating in the fishery onthe two days. [The fact that each vessel likely participated in both days is notreally relevant.] The population of interest is NOT the salmon captured - thisis the response variable for each boat whose total is of interest.

What is the sampling frame?

It is not clear how the list of fishing boats was generated. It seems unlikely thatthe aerial survey actually had a picture of the boats on the water from whichDFO selected some boats. More likely, the observers were taken onto the water


CHAPTER 4. SAMPLING

in some systematic fashion, and then the observer selected a boat at randomfrom those seen at this point. Hence the sampling frame is the set of locationschosen to drop off the observers and the set of boats visible from these points.

What is the sampling design?

The sampling unit is a boat on a day. The strata are the two days. On eachday, a random sample was selected from the boats participating in the fishery.

This is a stratified design with a simple random sample selected each day.

Note in this survey, it is logistically impossible to do a simple random sampleover both the days as the number of vessels participating really isn’t known forany day until the fishery starts. Here, stratification takes the form of adminis-trative convenience.

Excel analysis

A copy of an Excel workbook called sockeye.xls is available from the SampleProgram Library at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.

A summary of the page appears below:




CHAPTER 4. SAMPLING


CHAPTER 4. SAMPLING

The data are listed on the spreadsheet on the left.

Summary statistics

The Excel builtin functions are used to compute the summary statistics (samplesize, sample mean, and sample standard deviation) for each stratum. Somecaution needs to be exercised that the range of each function covers only thedata for that stratum. 5

You will also need to specify the stratum size (the total number of samplingunits in each stratum), i.e. 250 vessels on each day.

Find estimates of the mean catch for each stratum

Because the sampling design in each stratum is a simple random sample, thesame formulae as in the previous section can be used.

The mean and its estimated se for each day of the opening is reported in thespreadsheet.

Find the estimates of the total catch for each stratum

The estimated total catch is found by multiplying the average catch per boat bythe total number of boats participating in the fishery. The estimated standarderror for the total for that day is found by multiplying the standard error forthe mean by the stratum size as in the previous section.

For example, in the first stratum (29 July), the estimated total catch isfound by multiplying the estimated mean catch per boat (322) by the numberof boats participating (250) to give an estimated total catch of 80,500 salmonfor the day. The se for the total catch is found by multiplying the se of themean (57) by the number of boats participating (250) to give the se of the totalcatch for the day of 14,200 salmon.

Find estimate of grand total

Once an estimated total is found for each stratum, the estimated grand totalis found by summing the individual stratum estimated totals. The estimatedstandard error of the grand total is found by the square root of the sum of thesquares of the standard errors in each stratum - the Excel function sumsq isuseful for this computation.

Estimates of the overall grand mean5If you are proficient with Excel, Pivot-Tables are an ideal way to compute the summary

statistics for each stratum. An application of Pivot-Tables is demonstrated in the analysis ofa cluster sample where the cluster totals are needed for the summary statistics.


CHAPTER 4. SAMPLING

This was not done in the spreadsheet, but is easily computed by dividingthe total catch by the total number of boat days in the fishery (250+250=500).The se is found by dividing the se of the total catch also by 500.

Note this is interpreted as the mean number of fish captured per day perboat.

SAS analysis

As noted earlier, some care must be used when standard statistical packages areused to analyze survey data as many packages ignore the design used to selectthe data.

A sample SAS program for the analysis of the sockeye example called sock-eye.sas and its output called sockeye.lst is available from the Sample Program Li-brary at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.

A copy of the program is as follows:

/* Example of a stratified sample analysis using SAS */

/* On each of two days, a sample of vessels were assignedobservers who counted the number of sockeye salmon caughtin that day. On the second day, a new set of vessels was observed. */

title ’Number of sockeye caught - example of stratified simple random sampling’;

options nodate nonumber noovp nocenter linesize=75;

data sockeye; /* read in the data */length date $8.;input date $ sockeye;/* compute the sampling weight. In general,

these will be different for each stratum */if date = ’29-Jul’ then sampweight = 250/15;if date = ’30-Jul’ then sampweight = 250/15;datalines;

29-Jul 33729-Jul 73029-Jul 45829-Jul 9829-Jul 8229-Jul 28



CHAPTER 4. SAMPLING

29-Jul 54429-Jul 41529-Jul 28529-Jul 23529-Jul 57129-Jul 22529-Jul 1929-Jul 62329-Jul 18030-Jul 9730-Jul 31130-Jul 4530-Jul 5830-Jul 3330-Jul 20030-Jul 38930-Jul 33030-Jul 22530-Jul 18230-Jul 27030-Jul 13830-Jul 8630-Jul 49630-Jul 215;;;;

data n_boats; /* you need to specify the stratum sizes if you want stratum totals */length date $8.;date = ’29-Jul’; _total_=250; output; /* the stratum sizes must be variable _total_ */date = ’30-Jul’; _total_=250; output;

proc print data=sockeye;title2 ’raw data from the survey’;

proc print data=n_boats;title2 ’number of boats in each stratum’;

proc surveymeans data=sockeyeN = n_boats /* dataset with the stratum population sizes present */mean /* average catch/boat along with standard error */sum ; /* request estimates of total */ ;

strata date / list; /* identify the stratum variable */var sockeye; /* which variable to get estimates for */weight sampweight;


CHAPTER 4. SAMPLING

The program starts with reading in the raw data and the computation of thesampling weights. Because the population size and sample size are the same foreach stratum, the sampling weights are common to all boats. In general, thisis not true, and a separate sampling weight computation is required for eachstratum.

A separate file is also constructed with the population sizes for each stratumso that estimates of the population total can be constructed.

The SURVEYMEANS procedure then uses the STRATUM statement toidentify that this is a stratified design. The default analysis in each stratum isagain a simple random sample.

The SAS output is:

Number of sockeye caught - example of stratified simple random samplingraw data from the survey

Obs date sockeye sampweight

1 29-Jul 337 16.66672 29-Jul 730 16.66673 29-Jul 458 16.66674 29-Jul 98 16.66675 29-Jul 82 16.66676 29-Jul 28 16.66677 29-Jul 544 16.66678 29-Jul 415 16.66679 29-Jul 285 16.6667

10 29-Jul 235 16.666711 29-Jul 571 16.666712 29-Jul 225 16.666713 29-Jul 19 16.666714 29-Jul 623 16.666715 29-Jul 180 16.666716 30-Jul 97 16.666717 30-Jul 311 16.666718 30-Jul 45 16.666719 30-Jul 58 16.666720 30-Jul 33 16.666721 30-Jul 200 16.666722 30-Jul 389 16.666723 30-Jul 330 16.666724 30-Jul 225 16.666725 30-Jul 182 16.6667


CHAPTER 4. SAMPLING

26 30-Jul 270 16.666727 30-Jul 138 16.666728 30-Jul 86 16.666729 30-Jul 496 16.666730 30-Jul 215 16.6667

Number of sockeye caught - example of stratified simple random samplingnumber of boats in each stratum

Obs date _total_

1 29-Jul 2502 30-Jul 250

Number of sockeye caught - example of stratified simple random samplingnumber of boats in each stratum


Data Summary

Number of Strata 2Number of Observations 30Sum of Weights 500

Stratum Information

Stratum Population SamplingIndex date Total Rate N Obs Variable N

---------------------------------------------------------------------------1 29-Jul 250 6.00% 15 sockeye 152 30-Jul 250 6.00% 15 sockeye 15

---------------------------------------------------------------------------

Statistics

Std ErrorVariable Mean of Mean Sum Std Dev------------------------------------------------------------------------sockeye 263.500000 33.082758 131750 16541------------------------------------------------------------------------

The results are the same as before.

The only thing of “interest” is to note that SAS labels the precision of the


CHAPTER 4. SAMPLING

estimated grand means as a Standard error while it labels the precision of theestimated total as a standard deviation! Both are correct - a standard error isa standard deviation - not of individual units in the population - but of theestimates over repeated sampling from the same population. I think it is clearerto label both as standard errors to avoid any confusion.

If separate analyses are wanted for each stratum, the SURVEYMEANS pro-cedure has to be run twice, one time with a BY statement to estimate the meansand totals in each stratum.

Again, it is likely easiest to do planning for future experiments in an Excelspreadsheet rather than using SAS.

JMP analysis

The data is available in a JMP file called sockeye.jmp available at the SampleProgram Library at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.

The data are entered as usual into JMP. Ensure that the variable that identi-fies the strata is nominally scaled and that the response variable is continuouslyscaled:




CHAPTER 4. SAMPLING

We start by finding the summary statistics for EACH stratum. This is doneusing the Analyze->Distribution platform as in the analysis of the creel data,but we want a separate analysis for each stratum. This is obtained by using theBy box in the dialogue box:


CHAPTER 4. SAMPLING

The output from the Analyze->Distribution platform is stacked and unneces-sary information removed as in the creel example. The final summary statisticsfor each stratum are:


CHAPTER 4. SAMPLING

The estimated catch per boat in the first stratum is 322 (se 59) and in thesecond stratum is 205 (se 35). Note that the difference in se between the resultsand JMP and Excel are minimal because the sampling fraction (15/250=6%) isvery small.

Unfortunately, we now need to complete the rest of the computations byhand. The estimated total catch in each stratum is found by multiplying theestimated catch per boat by the number of boats giving an estimated total catch


CHAPTER 4. SAMPLING

of 80,500 (se 14,640) in stratum 1 and 51,250 (se 8,757) in stratum 2.

The estimated grand total catch is found by adding the stratum total,131, 750 = 80, 500+51, 250. The se of the grand total is found as

√14, 6402 + 8, 7572 =

17, 060. [Again, all of the se are slightly larger because the finite population cor-rection factor has not been applied, but the differences in the se are only about3%.

Doing everything in JMP

It is possible to do everything in JMP including the finite population factor andtotaling over strata. We follow the same steps as in the previous analyses.

We need to find the sample size, the mean, and the standard deviation foreach stratum. We can use the Tables->Summary pop-down menu as shownbelow:

Note that we must specify the grouping variable in this case to identify thestrata.

This generates the summary table with statistics for each stratum:

date N rows N(sockeye) Mean Std dev29-Jul-98 15 15 322 226.81830-Jul-98 15 15 205 135.695


CHAPTER 4. SAMPLING

At this point, further computations using JMP are clumsy. It may be easierto use the summary statistics and transfer them to a spreadsheet to continuethe computations.

Add a new column for the number of vessels in each stratum, and two morecolumns where you estimate the total catch for the day (mean catch x numberof vessels) and the standard error**2 for the total in each stratum using theformula:

This generates the revised summary table with statistics for each stratum:

Finally, add the totals and square of the se from each stratum to get the overall


CHAPTER 4. SAMPLING

total and overall square of the se,

Take the√

of the square of the se to get the overall se and, voila,

Hence our final estimate is that a total of 131,750 sockeye were caught witha se of 16541 fish.

When should the various estimates be used?

In a stratified sample, there are many estimates that are obtained with differentstandard errors. It can sometimes be confusion as to which estimate is used forwhich purpose.


CHAPTER 4. SAMPLING

Here is a brief review of the four possible estimates and the level of interestin each estimate.


CH

AP

TE

R4.

SAM

PLIN

G

Parameter Estimator se Example and Interpretation Who would be interested in thisquantity?

Stratummean

µh = Y h

√s2

h

nh(1− fh) Stratum 1. Estimate is 322; se

of 56.8 (not shown).The estimated average catchper boat was 322 fish (se 56.8fish) on 29 July

A fisher who wishes to fish ONLYthe first day of the season andwants to know if it will meet ex-penses.

Stratumtotal

τh = Nhµh =NhY h

Nhse(µh) =

Nh

√s2

j

nh(1− fh)

Stratum 1. Estimate is80,500=250x322; se of14195=250x56.8.

The estimated total catch overallboats on 29 July was 80,500 (se14,195) DFO who wishes to esti-mate TOTAL catch overall ALLboats on this single day so thatquota for next day can be set.Grand Total

Grandtotal.

τ = τ1 + τ2

√se(τ1)2 + se(τ1)2 Estimate

131,750=80,500+51,250; se is√1419522 + 849222 = 16541.

The estimated total catchoverall all boats over all daysis 132,000 fish (se 17,000 fish).

DFO who wishes to know totalcatch over entire fishing season sothat impacts on stock can be exam-ined.

Grandaverage

µ = bτN

se(bτ)N Grand mean (not shown).

N=500 vessel-days.Estimate is131,750/500=263.5; se is16541/500=33.0.The estimated catch per boatper day over the entire seasonwas 263 fish (se 33 fish).

A fisher who want to know aver-age catch per boat per day for theentire season to see if it will meetexpenses.

c©2006

Carl

Jam

esSch

warz

89

CHAPTER 4. SAMPLING

‘

4.8 Sample Size for Stratified Designs

4.8.1 Total sample size

As before, the question arises as how many units should be selected in stratifieddesigns.

This has two questions that need to be answered. First, what is the totalsample size required? Second how should these be allocated among the strata.

The total sample size can be determined using the same methods as for asimple random sample. I would suggest that you initially ignore the fact thatthe design will be stratified when finding the initial required total sample size.If stratification proves to be useful, then your final estimate will be more precisethan you anticipated (always a nice thing to happen!) but seeing as you aremaking guesses as to the standard deviations and necessary precision required,I wouldn’t worry about the extra cost in sampling too much.

If you must, it is possible to derive formulae for the overall sample sizes whenaccounting for stratification, but these are relatively complex. It is likely easierto build a general spreadsheet where the single cell is the total sample size andall other cells in the formula depend upon this quantity depending upon theallocation used. Then the total sample size can be manipulated to obtain thedesired precision. The following information will be required:

• The sizes (or relative sizes) of each stratum (i.e. the Nh or Wh).

• The standard deviation of measurements in each stratum. This can beobtained from past surveys, a literature search, or expert opinion.

• The desired precision – overall – and if needed, for each stratum.

Again refer to the sockeye.xls spreadsheet:


CHAPTER 4. SAMPLING


CHAPTER 4. SAMPLING

The standard deviations from this survey will be used as ‘guesses’ for whatmight happen next year. As in this year’s survey, the total sample size will beallocated evenly between the two days.

In this case, the total sample size must be allocated to the two strata. Youwill see several methods in a later section to do this, but for now, assumethat the total sample will be allocated equally among both strata. Hence theproposed sample size of 75 is split in half to give a proposed sample size of 37.5in each stratum. Don’t worry about the fractional sample size - this is only aplanning exercise. We create one cell that has the total sample size, and thenuse the formulae to allocate the total sample size equally to the two strata.The total and the se of the overall total are found as before, and the relativeprecision (denoted as the relative standard error (rse), and, unfortunately, insome books at the coefficient of variation cv) is found as the estimated standarderror/estimated total.

Again, this portion of the spreadsheet is setup so that changes in the totalsample size are propagated throughout the sheet. If you change the total samplesize from 75 to some other number, this is automatically split among the twostrata, which then affects the estimated standard error for each stratum, whichthen affects the estimated standard error for the total, which then affects therelative standard error. Again, the proposed total sample size can be variedusing trial and error, or the Excel Goal-Seek option can be used.

Here is what happens when a sample size of 75 is used. Don’t be alarmedby the fractional sample sizes in each stratum – the goal is again to get a roughfeel for the required effort for a certain precision.

Total n=75se

Est EstStratum n Mean std dev vessels total total29-Jul 37.5 322 226.8 250 80500 853730-Jul 37.5 205 135.7 250 51250 5107Total 131750 9948

rse 7.6%

A sample size of 75 is too small. Try increasing the sample size until the rseis 5% or less. Alternatively, once could use the GOAL SEEK feature of Excel tofind the sample size that gives a relative standard error of 5% or less as shownbelow:


CHAPTER 4. SAMPLING

Total n=145se

Est EstStratum n Mean std dev vessels total total29-Jul 72.5 322 226.8 250 80500 561130-Jul 72.5 205 135.7 250 51250 3357Total 131750 6539

rse 5.0%

4.8.2 Allocating samples among strata

There are number of ways of allocating a sample of size n among the variousstrata. For example,

1. Equal allocation. Under an equal allocation scheme, all strata get thesame sample size, i.e. nh = n/H This allocation is best if variances ofstrata are roughly equal, equally precise estimates are required for eachstratum, and you wish to test for differences in means among strata (i.e.an analytical survey discussed in previous sections).

2. Proportional allocation. Under proportional allocation, sample sizesare allocated to be proportional to the number of sampling units in thestrata, i.e ni = n × Ni

N = n × NiPNh

= n × Ni

N1+N2+···+NH= n ×Wi This

allocation is simple to plan and intuitively appealing. However, it is notthe best design. This design may waste effort because large strata getlarge sample sizes but precision is determined by sample size not the ratioof sample size to population size. For example, if one stratum is 10 timeslarger than any other stratum, it is not necessary to allocate 10 times thesampling effort to get the same precision in that stratum.

3. Optimal allocation - a.k.a Neyman allocation In optimal allocation,the sample is allocated to minimize the overall variance, keeping the sam-ple size fixed. Tedious algebra gives that the sample should be allocatedproportional to the product of the stratum size and the stratum standarddeviation, i.e. ni = n× WiSiP

WhSh= n× NiSiP

NhSh= n× NiSi

N1S1+N2S2+···+NHSH.

This allocation will be appropriate if the costs of measuring units are thesame in all strata. Intuitively, the strata that have the most of samplingunits should be weighted larger; strata with larger standard deviationsmust have more samples allocated to them to get the se of the samplemean within the stratum down to a reasonable level. A key assumption ofthis allocation is that the cost to sample a unit is the same in all strata.

4. Optimal Allocation when costs are involved In some cases, the costsof sampling differ among the strata. Suppose that it costs Ci to sample


CHAPTER 4. SAMPLING

each unit in a stratum i. Then the total cost of the survey is C =∑

nhCh.The allocation rule is that sample sizes should be proportional to theproduct to stratum sizes, stratum standard deviations, and the inverse ofthe square root of the cost of sampling, i.e. ni = n × WiSi/

√CiP

(WhSh/√

Ch)=

n×NiSi√

CiP(

NhSh√Ch

)= n×

NiSi√Ci

N1S1√C1

+N2S2√

C2+···+ NH SH√

CH

This implies that large samples

are found in strata that are larger, more variable, or cheaper to sample.

In practice, most of the gain in precision occurs from moving from equal toproportional allocation, while often only small improvements in precision aregained from moving from proportional allocation to Neyman allocation. Simi-larly, unless cost differences are enormous, there isn’t much of an improvementin precision to moving to an optimal allocation based on costs.

Example - estimating the size of a caribou herd We wish to estimatethe size of a caribou herd. The density of caribou differs dramatically basedon the habitat type. The survey area was was divided into six strata based onhabitat type. The survey design is to divide each stratum in 4 km2 quadratsthat will be randomly selected. The number of caribou in the quadrats will becounted from an aerial photograph.

An Excel workbook called caribou.xls is available from the Sample ProgramLibrary at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.The key point to examining different allocations is to make a single cell representthe total sample size and then make a formula in each of the stratum samplesizes a function of the total.

The total sample size can be found by varying the sample total until thedesired precision is found.

Results from previous year’s survey: Here are the summary statistics fromthe survey in a previous year:

Map-squares sampledStratum Nh nh y s Est total se (total)1 400 98 24.1 74.7 9640 26212 40 10 25.6 63.7 1024 6983 100 37 267.6 589.5 26760 76934 40 6 179 151.0 7160 22735 70 39 293.7 351.5 20559 26226 120 21 33.2 99.0 3984 2354Total 770 211 69127 9172



CHAPTER 4. SAMPLING

The estimated size of the herd is 69,127 animals with an estimated se of9,172 animals.

Equal allocation

What would happen if an equal allocation were used? We now split the 211total sample size equally among the 6 strata. In this case, the sample sizes are‘fractional’, but this is OK as we are interested only in planning to see whatwould have happened. Notice that the estimate of the overall population wouldNOT change, but the se changes.

Stratum Nh nh y s Est total se (total)1 400 35.2 24.1 74.7 9640 48102 40 35.2 25.6 63.7 1024 1493 100 35.2 267.6 589.5 26760 80054 40 35.2 179 151.0 7160 3545 70 35.2 293.7 351.5 20559 29276 120 35.2 33.2 99.0 3984 1684Total 770 211 69127 9938

An equal allocation gives rise to worse precision than the original survey. Exam-ining the table in more detail, you see that far too many samples are allocatedin an equal allocation to strata 2 and 4 and not enough to strata 1 and 3.

Proportional allocation

What about proportional allocation? Now the sample size is proportional to thestratum population sizes. For example, the sample size for stratum 1 is foundas 211× 400/770. The following results are obtained:

Stratum Nh nh y s Est total se (total)1 400 109.6 24.1 74.7 9640 24312 40 11.0 25.6 63.7 1024 6563 100 27.4 267.6 589.5 26760 95964 40 11.0 179 151.0 7160 15545 70 19.2 293.7 351.5 20559 47876 120 32.9 33.2 99.0 3984 1765Total 770 211 69127 11263

This has an even worse standard error! It looks like not enough samples areplaced in stratum 3 or 5.

Optimal allocation

What if both the stratum sizes and the stratum variances are to be used inallocating the sample? We create a new column (at the extreme right) which is


CHAPTER 4. SAMPLING

equal to NhSh. Now the sample sizes are proportional to these values, i.e. thesample size for the first stratum is now found as 211×29866.4/133893.8. Againthe estimate of the total doesn’t change but the se is reduced.

Stratum Nh nh y s Est total se (total) NhSh

1 400 47.1 24.1 74.7 9640 4089 29866.42 40 4.0 25.6 63.7 1024 1206 2550.03 100 92.9 267.6 589.56 26760 1629 58953.94 40 9.5 179 151.0 7160 1709 6039.65 70 38.8 293.7 351.5 20559 2639 24607.66 120 18.7 33.2 99.0 3984 2522 11876.4Total 770 211 69127 6089 133893.8

4.8.3 Example: Estimating the number of tundra swans.

The Tundra Swan Cygnus columbianus, formerly known as the Whistling Swan,is a large bird with white plumage and black legs, feet, and beak. 6 The USFWSis responsible for conserving and protecting tundra swans as a migratory birdunder the Migratory Bird Treaty Act and the Fish and Wildlife ConservationAct of 1980. As part of these responsibilities, it conducts regular aerial surveysat one of their prime breeding areas in Bristol Bay, Alaska. And, the BristolBay population of tundra swans is of particular interest because suitable habitatfor nesting is available earlier than most other nesting areas. This example isbased on one such survey. 7

Tundra swans are highly visible on their nesting grounds making them easyto monitor during aerial surveys.

The Bristol Bay refuge has been divided into 186 survey units, each being aq quarter section. These survey units have been divided into three strata basedon density, and previous years’ data provide the following information aboutthe strata:

Density Total Past PastStratum Survey Units Density Std DevHigh 60 20 10Medium 68 10 6Low 58 2 3Total 186

6Additional information about the tundra swan is available at http://www.hww.ca/hww2.asp?id=78&cid=7

7Doster, J. (2002). Tundra Swan Population Survey in Bristol Bay, Northern AlaskaPeninsula, June 2002.


http://www.hww.ca/hww2.asp?id=78&cid=7

http://www.hww.ca/hww2.asp?id=78&cid=7

CHAPTER 4. SAMPLING

Based on past years’ results and budget considerations, approximately 30survey units can be sampled.

The three strata are all approximately the same total area (number of sur-vey units) so allocations based on stratum area will be approximately equalacross strata. However, that would place about 1/3 of the effort into the lowdensity strata which typically have fewer birds. It is felt that stratum densityis a suitable measure of stratum importance (notice that close relationship be-tween stratum density and stratum standard deviations which is often found inbiological surveys.) Consequently, an allocation based on stratum density wasused. This allocation would place about 30× 20

32 = 18 units in the high densitystratum; about 30times 10

32 = 9 units in the medium density stratum; and theremainder (3 units) in the low density stratum.

The survey was conducted with the following results:

Survey Area Swans in Single TotalUnit Stratum (km2) flocks Birds Pairs birdsdilai2 h 148 12 6 24naka41 h 137 13 15 43naka43 h 137 6 16 38naka51 h 16 10 3 2 17nakb32 h 137 10 10 30nakb44 h 135 6 18 12 48nakc42 h 83 4 5 6 21nakc44 h 109 17 15 47nakd33 h 134 11 11 33ugac34 h 65 2 10 22ugac44 h 138 28 15 58ugad5/63 h 159 9 20 49dugad56/4 m 102 7 4 15guad43 m 137 6 4 14ugad42 m 137 5 11 15 46low1 l 143 2 2low3 l 138 1 1

The first thing to notice from the table above is that not all survey unitscould be surveyed because of poor weather. As always with missing data, it isimportant to determine if the data are Missing Completely at Random (MCAR).In this case, it seems reasonable that swans did not adjust their behavior know-ing that certain survey units would be sampled on the poor weather days and sothere is no impact of the missing data other than a loss of precision comparedto a survey with a full 30 survey units chosen.

Also notice that “blanks” in the table (missing values) represent zeros and


CHAPTER 4. SAMPLING

not really missing data.

Finally, not all of the survey units are the same area. This could introduceadditional variation into our data which may affect our final standard errors.Even though the survey units are of different areas, the survey units were chosenas a simple random sample so ignoring the area will NOT introduce bias into theestimates (why). You will see in later sections how to compute a ratio estimatorwhich could take the area of each survey units into account and potentially leadto more precise estimates.

JMP analysis The data are imported into JMP and are available in a JMPfile tundra.jmp available in the Sample Program Library at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.

The data sheet appears below. Notice that zeros have been inserted in theappropriate locations.

Because units in each stratum was selected using a simple random sample,the appropriate summary measures are means and standard errors for each meanfor each stratum. Use the Tables->Summary




CHAPTER 4. SAMPLING

to select appropriate statistics for each stratum:


CHAPTER 4. SAMPLING

Because the sample fraction is smallish, the finite population correction factoris ignored in each stratum. This gives the summary table:

In order to estimate the total number of swans in each stratum, augment thetable with the number of survey units in each stratum, 8 and then multiply themean and standard error by the number of survey units to estimate the totalswans and se(total) in each stratum using the formula editor: 9

Hence we estimate that about 2000 swans are present in the H and M strata,but just over 100 in the L stratum. The grand total is found by adding theestimated totals from the strata 2150+87+1700=3937, and the standard errorof the grand total is found in the usual way

√230.122 + 29.002 + 714.272 = 751

either using a calculator or a spreadsheet.

The standard error is larger than desired, mostly because of the very smallsample size in the M stratum where only 3 of the 9 proposed survey units couldbe surveyed.

SAS analysis A copy of the SAS program (tundra.sas) is available in avail-able in the Sample Program Library at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.

The data are read into SAS in the usual fashion with the code fragment:8Notice that JMP sorts the strata alphabetically9It may be easiest at this point to import back to Excel




CHAPTER 4. SAMPLING

data swans;length survey_unit $10 stratum $1;;input survey_unit $ stratum $ area num_flocks num_single num_pairs;num_swans = num_flocks + num_single + 2*num_pairs;datalines;

... data inserted here ...

The total number of survey units in each stratum is also read into SAS usingthe code fragment. Notice that the variable that has the number of stratumunits must be called _total_ as required by the SurveyMeans procedure.

data total_survey_units;length stratum $1.;input stratum $ _total_; /* must use _total_ as variable name */datalines;

h 60m 68l 58;;;;

Next the data are sorted by stratum (not shown), the number of actualsurvey units surveyed in each stratum is found using Proc Means:

proc means data=swans noprint;by stratum;var num_swans;output out=n_units n=n;

Most survey procedures in SAS require the use sampling weights. Theseare the reciprocal of the probability of selection. In this case, this is simply thenumber of units in the stratum divided by the number sampled in each stratum:

data swans;merge swans total_survey_units n_units;by stratum;sampling_weight = _total_ / n;

Now the individual stratum estimates are obtained using the code fragment:

/* first estimate the numbers in each stratum */


CHAPTER 4. SAMPLING

proc surveymeans data=swanstotal=total_survey_units /* inflation factors */sum clsum mean clm;

by stratum; /* separate estimates by stratum */var num_swans;weight sampling_weight;ods output statistics=IndivEst;

This gives the output:

Lower Upper Lower UpperObs stratum Mean StdErr CLMean CLMean Sum StdDev CLSum CLSum1 h 35.83 3.43 28.28 43.38 2150 206 1697 26032 l 1.50 0.49 -4.74 7.74 87 28 -275 4493 m 25.00 10.27 -19.19 69.19 1700 698 -1305 4705

The estimates in the L and M strata are not very precise because of the smallnumber of survey units selected. SAS has incorporated the finite populationcorrection factor when estimating the se for the individual stratum estimates.

We estimate that about 2000 swans are present in the H and M strata, butjust over 100 in the L stratum. The grand total is found by adding the estimatedtotals from the strata 2150+87+1700=3937, and the standard error of the grandtotal is found in the usual way

√2062 + 282 + 6982 = 729.

Proc SurveyMeans can be used to estimate the grand total number of unitsoverall strata using the code fragment::

proc surveymeans data=swanstotal=total_survey_units /* inflation factors for each stratum */sum clsum mean clm; /* want to estimate grand totals */

title2 ’Estimate total number of swans’;strata stratum /list; /* which variable define the strata */var num_swans; /* which variable to analyze */weight sampling_weight; /* sampling weight for each obs */

This gives the output:

Lower Upper Lower UpperObs Mean StdErr CLMean CLMean Sum StdDev CLSum CLSum1 21.17 3.92 12.77 29.57 3937 729 2374 5500


CHAPTER 4. SAMPLING

The standard error is larger than desired, mostly because of the very smallsample size in the M stratum where only 3 of the 9 proposed survey units couldbe surveyed.

4.8.4 Multiple stratification

In some cases, we need to allocate units based upon two or more variables Wefirst find the optimal allocation and see how they differ from one another - inmany cases, there may not be a big difference.

There are quite complicated schemes available to optimize allocation for twoor more variables. These will not be covered in this course.

4.8.5 Post-stratification

In some cases, it is inconvenient or impossible to stratify the population ele-ments into strata before sampling. For example, if an attribute measured forstratification is only available when the unit is sampled. For example, we wish tostratify baby births by birth weight to estimate the proportion of birth defects;or we wish to stratify by family size when looking at day care costs.

In these cases, we do simple random sampling, and post- stratify after thesample is taken. We assume that the stratum sizes Nh are still known, say fromother sources.

The estimates of the population mean, total, etc., don’t change. However,the variances must be increased to account for the fact that the sample sizein each stratum is no longer fixed. This introduces an additional source ofvariation for the estimate, i.e. estimates will vary from sample to sample notonly because a new sample is drawn each time, but also because the sample sizewithin a stratum will change, leading to different precisions in each new sample.

This is covered in many standard books on sampling theory and is not cov-ered more in this course.

4.8.6 Allocation and precision - revisited

A student wrote:

I’m a little confused about sample allocation in stratified sampling.


CHAPTER 4. SAMPLING

Earlier in the course, you stated that precision is independent ofsample size, i.e. a sample of 1000 gave estimates that were equallyprecise for Canada and the US (assuming a simple random sample).Yet in stratified sampling, you also said that precision is improved byproportional allocation where larger strata get larger sample sizes.

Both statements are correct. If you are interested in estimates for individualpopulations, then absolute sample size is important.

If you wanted equally precise estimates for BOTH Canada and the US thenyou would have equal sample sizes from both populations, say 1000 from bothpopulation even though their overall population size differs by a factor of 10:1.

However, in stratified sampling designs, you may also be interested in theOVERALL estimate, over both populations. In this case, a proportional alloca-tion where sample size is allocated proportion to population size often performsbetter. In this, the overall sample of 2000 people would be allocated propor-tional to the population sizes as follows:

Stratum Population Fraction of total population Sample sizeUS 300,000,000 91% 91% x2000=1818Canada 30,000,000 9% 9% x2000=181Total 330,000,000 100% 2000

Why does this happen? Well if you are interested in the overall population,then the US results essentially drives everything and Canada has little effecton the overall estimate. Consequently, it doesn’t matter that the Canadianestimate is not as precise as the US estimate.

4.9 Ratio estimation in SRS - improving preci-sion with auxiliary information

An association between the measured variable of interest and a second variableof interest can be exploited to obtain more precise estimates. For example,suppose that growth in a sample plot is related to soil nitrogen content. Asimple random sample of plots is selected and the height of trees in the sampleplot is measured along with the soil nitrogen content in the plot. A regressionmodel is fit (Thompson, 1992, Chapters 7 and 8) between the two variables toaccount for some of the variation in tree height as a function of soil nitrogencontent. This can be used to make precise predictions of the mean height instands if the soil nitrogen content can be easily measured. This method willbe successful if there is a direct relationship between the two variables, and,


CHAPTER 4. SAMPLING

the stronger the relationship, the better it will perform. This technique is oftencalled ratio-estimation or regression-estimation.

Notice that multi-phase designs often use an auxiliary variable but this sec-ond variable is only measured on a subset of the sample units and should notbe confused with ratio estimators in this section.

Ratio estimation has two purposes. First, in some cases, you are interestedin the ratio of two variables, e.g. what is the ratio of wolves to moose in a regionof the province.

Second, a strong relationship between two variables can be used to improveprecision without increasing sampling effort. This is an alternative to stratifi-cation when you can measure two variables on each sampling unit.

We define the population ratio as R = τY

τX= µY

µX. Here Y is the variable

of interest; X is a secondary variable not really of interest. Note that notationdiffers among books - some books reverse the role of X and Y.

Why is the ratio defined in this way? There are two common ratio estimators,traditionally called the mean-of-ratio and the ratio-of-mean estimators. Supposeyou had the following data for Y and X which represent the counts of animalsof species 1 and 2 taken on 3 different days:

Sample1 2 3

Y 10 100 20X 3 20 1

The mean-of-ratios estimator would compute the estimated ratio between Yand X as:

Rmean−of−ratio =103 + 100

20 + 201

3= 9.44

while the ratio-of-means would be computed as:

Rratio−of−means =(10 + 100 + 20)/3

(3 + 20 + 1)/3=

10 + 100 + 203 + 20 + 1

= 5.41

Which is ”better”?

The mean-of-ratio estimator should be used when you wish to give equalweight to each pair of numbers regardless of the magnitude of the numbers.For example, you may have three plots of land, and you measure Y and X oneach plot, but because of observer efficiencies that differ among plots, the rawnumbers cannot be compared. For example, in a cloudy, rainy day it is hardto see animals (first case), but in a clear, sunny day, it is easy to see animals(second case). The actual numbers themselves cannot be combined directly.


CHAPTER 4. SAMPLING

The ratio-of-means estimator (considered in this chapter) gives every valueof Y and X equal weight. Here the fact that unit 2 has 10 times the number ofanimals as unit 1 is important as we are interested in the ratio over the entirepopulation of animals. Hence, by adding the values of Y and X first, eachanimals is given equal weight.

When is a ratio estimator better - what other information is needed?The higher the correlation between Xi and Yi, the better the ratio estimator iscompared to a simple expansion estimator. It turns out that the ratio estimatoris the ‘best’ linear estimator if

• the relation between Yi and Xi is linear through the origin

• the variation around the regression line is proportional to the X value, i.e.the spread around the regression line increases as X increases unlike anordinary regression line where the spread is assumed to be constant in allparts of the line.

In practice, plot yi vs. xi from the sample and see what type of relation exists.

When can a ratio estimator be used? A ratio estimator will require thatanother variable (the X variable) be measured on the selected sampling units.Furthermore, if you are estimating the overall mean or total, the total value ofthe X-variable over the entire population must also be known. For example, assee in the examples to come, the total area must be known to estimate the totalanimals once the density (animals/ha) is known.

4.9.1 Summary of Main results

Quantity Population value Sample estimate se

Ratio R = τY

τX= µY

µXr = y

x = yx

√1

µ2X

s2diff

n (1− f)

Total τY = RτX τratio = rτX τX ×√

1µ2

X

s2diff

n (1− f)

Mean µY = RµX µY ratio = rµX µX ×√

1µ2

X

s2diff

n (1− f)

Notes

Don’t be alarmed by the apparent complexity of the formulae above. Theyare relatively simple to implement in spreadsheets.


CHAPTER 4. SAMPLING

• The term s2diff =

n∑i=1

(yi−rxi)2

n−1 is computed by creating a new column

yi − rxi and finding the (sample standard deviation)2 of this new derivedvariable. This will be illustrated in the examples.

• In some cases the µ2X in the denominator may or may not be known and

it or its estimate x2 can be used in place of it. There doesn’t seem to beany empirical evidence that either is better.

• The term τ2X/µ2

X reduces to N2.

• Confidence intervals Confidence limits are found in the usual fashion.In general, the distribution of R is positively skewed and so the upperbound is usually too small. This skewness is caused by the variation inthe denominator of the the ratio. For example, suppose that a randomvariable (Z) has a uniform distribution between 0.5 and 1.5 centered on 1.The inverse of the random variable (i.e. 1/Z) now ranges between 0.666and 2 - no longer symmetrical around 1. So if a symmetric confidenceinterval is created, the width will tend not to match the true distribution.

This skewness is not generally a problem if the sample size is at least 30and the relative standard error of y and x are both less than 10%.

• Sample size determination: We once again invert the formulae for these for the ratio but again is likely easiest done on a spread sheet usingtrial and error or the solver feature.

4.9.2 Example - wolf/moose ratio

[This example was borrowed from Krebs, 1989, p. 208. Note that Krebs inter-changes the use of x and y in the ratio.]

Wildlife ecologists interested in measuring the impact of wolf predation onmoose populations in BC obtained estimates by aerial counting of the populationsize of wolves and moose on 11 sub-areas (all roughly equal size) selected asSRSWOR from a total of 200 sub-areas in the game management zone.

In this example, the actual ratio of wolves to moose is of interest.

Here are the raw data:

Sub-areas Wolves Moose1 8 1902 15 3703 9 460


CHAPTER 4. SAMPLING

4 27 7255 14 2656 3 877 12 4108 19 6759 7 290

10 10 37011 16 510

What is the population and parameter of interest?

As in previous situations, there is some ambiguity:

• The population of interest is the 200 sub-areas in the game-managementzone. The sampling units are the 11 sub-areas. The response variables arethe wolf and moose populations in the game management sub-area. Weare interested in the wolf/moose ratio.

• The populations of interest are the moose and wolves. If individual mea-surements were taken of each animal, then this definition would be fine.However, only the total number of wolves and moose within each sub-areaare counted - hence a more proper description of this design would be acluster design. As you will see in a later section, the analysis of a clusterdesign starts by summing to the cluster level and then treating the clustersas the population and sampling unit as is done in this case.

Having said this, do the number of moose and wolves measured on eachsub-area include young moose and young wolves or just adults? How will im-migration and emigration be taken care of?

What was the frame? Was it complete?

The frame consists of the 200 sub-areas of the game management zone. Pre-sumably these 200 sub-areas cover the entire zone, but what about emigrationand immigration? Moose and wolves may move into and out of the zone.

What was the sampling design?

It appears to be an SRSWOR design - the sampling units are the sub-areasof the zone.

How did they determine the counts in the sub-areas? Perhaps they simplylooked for tracks in the snow in winter - it seems difficult to get estimates fromthe air in summer when there is lots of vegetation blocking the view.


CHAPTER 4. SAMPLING

Excel analysis

A copy of the workbook to perform the analysis of this data is called wolf.xlsand is available from the Sample Program Library at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.

Here is a summary shot of the spreadsheet:

Assessing conditions for a ratio estimator




CHAPTER 4. SAMPLING

The ratio estimator works well if the relationship between Y and X is linear,through the origin, with increasing variance with X. Begin by plotting Y (wolves)vs X (moose).

The data appears to satisfy the conditions for a ratio estimator.

Compute summary statistics for both Y and X

Refer to the screen shot of the spreadsheet. The Excel builtin functions areused to compute the sample size, sample mean, and sample standard deviationfor each variable.

Compute the ratio

The ratio is computed using the formula for a ratio estimator in a simplerandom sample, i.e.

r =y

x

Compute the difference variable

Then for each observation, the difference between the observed Y (the actual


CHAPTER 4. SAMPLING

number of wolves) and the predicted Y based on the number of moose (Yi = rXi)is found. Notice that the sum of the differences must equal zero.

The standard deviation of the differences will be needed to compute thestandard error for the estimated ratio.

Estimate the standard error of the estimated ratio

Use the formula given at the start of the section.

Final estimate

Our final result is that the estimated ratio is 0.03217 wolf/moose with anestimated se of 0.00244 wolf/moose. An approximate 95% confidence intervalwould be computed in the usual fashion.

Planning for future surveys

Our final estimate has an approximate rse of 0.00244/.03217 = 7.5% whichis pretty good. You could try different n values to see what sample size wouldbe needed to get a rse of better than 5% or perhaps this is too precise and youonly want a rse of about 10%.

The key variable for the standard error is the total sample size (which youcan modify) and the standard deviation of the differences - which is estimatedfrom the previous survey.

As before, create a new spreadsheet where you can modify the total samplesize and see the effect upon precision. This will be left as an exercise for thereader.

SAS Analysis

The above computations can also be done in SAS with the program wolf.sasavailable from the Sample Program Library at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms. It uses Proc SurveyMeans which gives the out-put contained in wolf.lst.

Here is the SAS program:

/* Example of a ratio estimator in simple random sampling */

/* Wildlife ecologists interested in measuring the impact of wolf




CHAPTER 4. SAMPLING

predation on moose populations in BC obtained estimates by aerialcounting of the population size of wolves and moose on 11subareas (all roughly equal size) selected as SRSWOR from a total of200 subarea in the game management zone.

In this example, the actual ratio of wolves to moose is of interest. */

title ’Wolf-moose ratio - ratio estimator in SRS design’;options nodate nonumber noovp nocenter linesize=75;

data wolf;input subregion wolf moose;datalines;

1 8 1902 15 3703 9 4604 27 7255 14 2656 3 877 12 4108 19 6759 7 290

10 10 37011 16 510

;;;

proc print data=wolf;title2 ’raw data’;sum wolf moose;

proc plot data=wolf;title2 ’plot to assess assumptions’;plot wolf*moose;

proc surveymeans data=wolf ratio clm N=200;title2 ’Estimate of wolf to moose ratio’;/* ratio clm - request a ratio estimator with confidence intervals *//* N=200 specifies total number of units in the population */var moose wolf;ratio wolf/moose; /* this statement ask for ratio estimator */

The SAS program again starts with the DATA step to read in the data.Because the sampling weights are equal for all observation, it is not necessaryto include them when estimating a ratio (the weights cancel out in the formula


CHAPTER 4. SAMPLING

used by SAS).

The PLOT procedure creates the plot similar to that in the Excel spread-sheet.

The RATIO statement in the SURVEYMEANS procedure request the com-putation of the ratio estimator.

Here is the output:

Wolf-moose ratio - ratio estimator in SRS designraw data

Obs subregion wolf moose

1 1 8 1902 2 15 3703 3 9 4604 4 27 7255 5 14 2656 6 3 877 7 12 4108 8 19 6759 9 7 290

10 10 10 37011 11 16 510

==== =====140 4352

Plot of wolf*moose. Legend: A = 1 obs, B = 2 obs, etc.wolf

27 + A|

26 +|

25 +|

24 +|

23 +|

22 +|

21 +|


CHAPTER 4. SAMPLING

20 +|

19 + A|

18 +|

17 +|

16 + A|

15 + A|

14 + A|

13 +|

12 + A|

11 +|

10 + A|

9 + A|

8 + A|

7 + A|

6 +|

5 +|

4 +|

3 + A---+-------------+-------------+-------------+-------------+--

0 200 400 600 800

moose


Data Summary

Number of Observations 11


CHAPTER 4. SAMPLING

Statistics

Std Error Lower 95% Upper 95%Variable Mean of Mean CL for Mean CL for Mean------------------------------------------------------------------------moose 395.636364 56.458162 269.839740 521.432988wolf 12.727273 1.926872 8.433935 17.020611------------------------------------------------------------------------

Ratio Analysis

Numerator Denominator Ratio Std Err 95% Confidence Interval---------------------------------------------------------------------------wolf moose 0.032169 0.002438 0.026737 0.037601---------------------------------------------------------------------------

The results are identical to that from the spreadsheet.

Again, it is easier to do planning in the Excel spreadsheet rather than in theSAS program.

JMP Analysis

The JMP data table is available here in the file wolf.jmp from the Sample Pro-gram Library http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.

CAUTION. Ordinary regression estimation from standard statistical pack-ages provide only an APPROXIMATION to the correct analysis of survey data.There are two problems in using standard statistical packages for regression andratio estimation of survey data:

• Unable to use a finite population correction factor. This is usually not aproblem unless the sample size is large relative to the population size.

• Wrong error structure. Standard regression analyses assume that the vari-ance around the regression or ratio line is constant. In many survey prob-lems this is not true. This can be partially alleviated through the useof weighted regression, but this still does not completely fix the problem.For further information about the problems of using standard statisticalsoftware packages in survey sampling please refer to the article at http://www.fas.harvard.edu/~stats/survey-soft/donna_brogan.html.





CHAPTER 4. SAMPLING

Because the ratio estimator assumes that the variance of the response in-creases with the value of X, a new column representing the inverse of the Xvariable (i.e. 1/the number of moose) has been created. Be sure all variablesare continuously scaled:

The Analyze->Fit Y-by-X platform will be used. The Y variable is thenumber of wolves; the X variable is the number of moose. Be sure to specifythat the inverse of the X variable (1/X) is the weighting variable:


CHAPTER 4. SAMPLING

The graph looks like it is linear through the origin which is one of the assump-tions of the ratio estimator. Then use Fit Special:


CHAPTER 4. SAMPLING

and select the no intercept option to force the regression line through the origin:


CHAPTER 4. SAMPLING

This gives the following output:


CHAPTER 4. SAMPLING

We see that the estimated ratio (.032 wolves/moose) matches the Excel output,the estimated standard error (.0026) does not quite match Excel. The differenceis a bit larger than can be accounted for not using the finite population correctionfactor.

As a matter of interest, if you use the Analyze->Fit Y-by-X platform WITH-OUT using the square of the X variable as the weighting variable, you obtainan estimated ratio of .0317 (se .0022). All of these estimates are similar and itlikely makes very little difference which is used.

Using JMP the hard way


CHAPTER 4. SAMPLING

It is possible to reproduce the ratio estimator and its standard error exactly.As in previous sections, the use of JMP is a bit clumsy.

Use the Tables->Summary to get the total wolves and total moose as inprevious examples. The estimate is easily obtained as r = y/x = y/x where yand x refer to the total wolves and moose respectively. Create a new column tocompute the ratio giving:

Hence our estimate of the wolf/moose ratio is 0.03217 wolves/moose. Inorder to compute a se we create a new column in the ORIGINAL data wherewe compute yi − rxi = wolvesi − r ×moosei


CHAPTER 4. SAMPLING

Find the standard deviation of this new column again using the Tables->Summarycommand. Compute the se as:

This gives the final result: Which match the previous results.


CHAPTER 4. SAMPLING

Post mortem

No population numbers can be estimated using the ratio estimator in this casebecause of a lack of suitable data.

In particular, if you had wanted to estimate the total wolf population, youwould have to use the simple inflation estimator that we discussed earlier unlessyou had some way of obtaining the total number of moose that are present inthe ENTIRE management zone. This seems unlikely.

However, refer to the next example, where the appropriate information isavailable.

4.9.3 Example - Grouse numbers - using a ratio estimatorto estimate a population total

In some cases, a ratio estimator is used to estimate a population total. In thesecases, the improvement in precision is caused by the close relationship betweentwo variables.

Note that the population total of the auxiliary variable will have to be knownin order to use this method.

Grouse Numbers

A wildlife biologist has estimated the grouse population in a region contain-ing isolated areas (called pockets) of bush as follows: She selected 12 pocketsof bush at random, and attempted to count the numbers of grouse in each ofthese. (One can assume that the grouse are almost all found in the bush, andfor the purpose of this question, that the counts were perfectly accurate.) Thetotal number of pockets of bush in the region is 248, comprising a total area of3015 hectares. Results are as follows:


CHAPTER 4. SAMPLING

Area Number(ha) Grouse8.9 242.7 36.6 10

20.6 363.7 84.1 8

25.8 601.8 5

20.1 3514.0 3410.1 188.0 22

What is the population of interest and parameter to be estimated?

As before, the is some ambiguity:

• The population of interest are the pockets of brush in the region. Thesampling unit is the pocket of brush. The number of grouse in each pocketis the response variable.

• The population of interest is the grouse. These happen to be clusteredinto pockets of brush. This leads back to the previous case.

What is the frame

Here the frame is explicit - the set of all pockets of bush. It isn’t clear if allgrouse will be found in these pockets - will some be itinerant and hence missed?What about movement between looking at the pockets of bush?

Summary statistics

Variable n mean std devaarea 12 10.53 7.91grouse 12 21.92 16.95

Simple inflation estimator ignoring the pocket areas

Using our earlier results for the simple inflation estimator, our estimate of thetotal number of grouse is τ = Ny = 248 × 21.92 = 5435.33 with an estimatedse of se = N ×

√s2

n (1− f) = 248×√

16.952

12 (1− 12248 ) = 1183.4.

The estimate isn’t very precise with a rse of 1183.4/5435.3 = 22%.


CHAPTER 4. SAMPLING

Ratio estimator - why?

Why did the inflation estimator do so poorly? Part of the reason is the relativelylarge standard deviation in the number of grouse in the pockets. Why does thisnumber vary so much?

It seems reasonable that larger pockets of brush will tend to have moregrouse. Perhaps we can do better by using the relationship between the area ofthe bush and the number of grouse through a ratio estimator.

Excel analysis

An Excel worksheet is available in grouse.xls from the Sample Program Libraryat http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.

Preliminary plot to assess if ratio estimator will work

First plot numbers of grouse vs. area and see if this has a chance of succeeding.

The graph shows a linear relationship, through the origin. There is some



CHAPTER 4. SAMPLING

evidence that the variance is increasing with X (area of the plot).

Find the ratio between grouse numbers and area

The spreadsheet is set up similarly to the previous example:

The total of the X variable (area) will need to be known.


CHAPTER 4. SAMPLING

As before, you find summary statistics for X and Y, compute the ratio esti-mate, find the difference variables, find the standard deviation of the differencevariable, and find the se of the estimated ratio.

The estimated ratio is: r = y/x = 21.82/10.53 = 2.081 grouse/ha.

The se of r is found as

se(r) =

√1x2 ×

s2diff

n× (1− f) =

√1

10.5332× 4.74642

12× (1− 12

248) = 0.1269

grouse/ha.

Expand ratio by total of X

In order to estimate the population total of Y, you now multiply the es-timated ratio by the population total of X. We know the pockets cover 3015ha, and so the estimated total number of grouse is found by τY = τX × r =3015× 2.081 = 6273.3 grouse.

To estimate the se of the total, multiply the se of r by 3015 as well: se(τY ) =τX × se(r) = 3015× 0.1269 = 382.6 grouse.

The precision is much improved compared to the simple inflation estimator.This improvement is due to the very strong relationship between the number ofgrouse and the area of the pockets.

Sample size for future surveys

If you wish to investigate different sample sizes, the simplest way would beto modify the cell corresponding to the count of the differences. This will beleft as an exercise for the reader.

The final ratio estimate has a rse of about 6% - quite good. It is relativelystraight forward to investigate the sample size needed for a 5% rse. We findthis to be about 17 pockets.

SAS analysis

The analysis is done in SAS using the program grouse.sas from the Sample Pro-gram Library http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.

Here is a program listing:



CHAPTER 4. SAMPLING

/* Example of using SAS to compute a ratio estimator in survey sampling */

/* A wildlife biologist has estimated the grouse populationin a region containing isolated areas (called pockets) ofbush as follows: She selected 12 pockets of bush at random, andattempted to count the numbers of grouse in each of of these.(One can assume that the grouse are almost all found in the bush, and for thepurpose of this question, that the counts were perfectly accurate.)The total number of pockets of bush in the region is 248,comprising a total area of 3015 hectares. */

title ’Number of grouse - Ratio estimator’;options nodate nonumber noovp nocenter linesize=75;

data grouse;input area grouse; /* sampling weights not needed */datalines;

8.9 242.7 36.6 10

20.6 363.7 84.1 8

25.8 601.8 5

20.1 3514.0 3410.1 188.0 22

;;;;

proc print data=grouse;title2 ’raw data’;

proc plot data=grouse;plot grouse * area;

proc surveymeans data=grouse ratio clm N=248;/* the ratio clm keywords request a ratio estimator and a confidence interval. */title2 ’Estimation using a ratio estimator’;var grouse area;ratio grouse / area;ods output ratio=outratio; /* extract information so that total can be estimated */

data outratio;/* compute estimates of the total */


CHAPTER 4. SAMPLING

set outratio;Est_total = ratio * 3015;Se_total = stderr* 3015;UCL_total = uppercl*3015;LCL_total = lowercl*3015;format est_total se_total ucl_total lcl_total 7.1;format ratio stderr lowercl uppercl 7.3;

proc print data=outratio split=’_’;title2 ’the computed estimates’;var ratio stderr lowercl uppercl Est_total Se_total LCL_total UCL_total;

The DATA step reads in the data. It is not necessary to include a computa-tion of the sampling weight if the data are collected in a simple random samplefor a ratio estimator – the weights will cancel out in the formulae used by SAS.

The SURVEYMEANS procedure can estimate the ratio of grouse/ha butcannot directly estimate the population total. The ODS statement redirects theresults from the RATIO statement to a new dataset that is processed furtherto multiply by the total area of the pockets.

The output is as follows:

Number of grouse - Ratio estimatorraw data

Obs area grouse

1 8.9 242 2.7 33 6.6 104 20.6 365 3.7 86 4.1 87 25.8 608 1.8 59 20.1 35

10 14.0 3411 10.1 1812 8.0 22

Number of grouse - Ratio estimatorraw data

Plot of grouse*area. Legend: A = 1 obs, B = 2 obs, etc.


CHAPTER 4. SAMPLING

grouse ||

60 + A||||||

50 +||||||

40 +||| AA| A||

30 +|||| A|| A

20 +| A|||||

10 + A| AA|| A|| A|

0 +


CHAPTER 4. SAMPLING

|-+----------+----------+----------+----------+----------+----------+0 5 10 15 20 25 30

areaNumber of grouse - Ratio estimatorEstimation using a ratio estimator


Data Summary

Number of Observations 12

Statistics

Std Error Lower 95% Upper 95%Variable Mean of Mean CL for Mean CL for Mean------------------------------------------------------------------------grouse 21.916667 4.772130 11.413279 32.420054area 10.533333 2.227746 5.630097 15.436570------------------------------------------------------------------------

Ratio Analysis

Numerator Denominator Ratio Std Err 95% Confidence Interval---------------------------------------------------------------------------grouse area 2.080696 0.126893 1.801406 2.359986---------------------------------------------------------------------------Number of grouse - Ratio estimatorthe computed estimates

Est Se LCL UCLObs Ratio StdErr LowerCL UpperCL total total total total

1 2.081 0.127 1.801 2.360 6273.3 382.6 5431.2 7115.4

The results are exactly the same as before.

Again, it is easiest to do the sample size computations in Excel.


CHAPTER 4. SAMPLING

JMP Analysis

The JMP data table is available here in the file grouse.jmp from the Sam-ple Program Library http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms. The data file contains both variables and the derived variable1/area:

Estimating total grouse using inflation estimator

As in Excel, let us first use the simple inflation estimate by finding theaverage number of grouse/pocket and then expanding by the number of pockets.The Analyze->Distribution platform




CHAPTER 4. SAMPLING

is used to estimate the mean grouse/pocket:

The estimated mean number of grouse per pocket is 21.9 (se 4.9). The estimatedtotal number is found by multiplying the mean number per pocket by the totalnumber of pockets (N = 248) to give an estimated total of 5435 (se 1213) grouse.The standard error is larger than that computed by Excel because of the lackof the finite population correction.

Estimating total grouse using ratio estimator

We must first estimate the ratio (grouse/hectare), and then expand this to


CHAPTER 4. SAMPLING

estimate the overall number of grouse.

CAUTION. Ordinary regression estimation from standard statistical pack-ages provide only an APPROXIMATION to the correct analysis of survey data.There are two problems in using standard statistical packages for regression andratio estimation of survey data:

• Unable to use a finite population correction factor. This is usually not aproblem unless the sample size is large relative to the population size.

• Wrong error structure. Standard regression analyses assume that the vari-ance around the regression or ratio line is constant. In many survey prob-lems this is not true. This can be partially alleviated through the useof weighted regression, but this still does not completely fix the problem.For further information about the problems of using standard statisticalsoftware packages in survey sampling please refer to the article at http://www.fas.harvard.edu/~stats/survey-soft/donna_brogan.html.

Because the ratio estimator assumes that the variance of the response in-creases with the value of X, a new column representing the inverse of the Xvariable (i.e. 1/area of pocket) has been created. Be sure all variables arecontinuously scaled:

The Analyze->Fit Y-by-X platform will be used. The Y variable is thenumber of grouse; the X variable is the area of the pocket. Be sure to specifythat the inverse of the X variable (1/X) is the weighting variable:




CHAPTER 4. SAMPLING

The graph looks like it is linear through the origin which is one of the assump-tions of the ratio estimator. Then use Fit Special: and select the no interceptoption to force the regression line through the origin:

The final estimates are found:


CHAPTER 4. SAMPLING

The estimated density is 2.081 (se .123) grouse/hectare. The point estimate isbang on, and the estimated se is within 1% of the correct se.

This now need to be multiplied by the total area of the pockets (3015 ha)which gives an estimated total number of grouse of 6274 (se 371) grouse. [Againthe estimated se is slightly smaller because of the lack of a finite populationcorrection.]

The ratio estimator is much more precise than the inflation estimator becauseof the strong relationship between the number of grouse and the area of thepocket.


CHAPTER 4. SAMPLING

Post mortem - a question to ponder

What if it were to turn out that grouse population size tended to be proportionalto the perimeter of a pocket of bush rather than its area? Would using the aboveratio estimator based on a relationship with area introduce serious bias into theratio estimate, increase the standard error of the ratio estimate, or do both?

4.10 Additional ways to improve precision

This section will not be examined on the exams or term tests

4.10.1 Using both stratification and auxiliary variables

It is possible to use both methods to improve precision. However, this comes ata cost of increased computational complexity.

There are two ways of combining ratio estimators in stratified simple randomsampling.

1. combined ratio estimate: Estimate the numerator and denominatorusing stratified random sampling and then form the ratio of these twoestimates:

rstratified,combined =µY stratified

µXstratified

and

τY stratified,combined =µY stratified

µXstratified

τX

We won’t consider the estimates of the se in this course, but it can befound in any textbook on sampling.

2. separate ratio estimator- make a ratio total for each stratum, and forma grand ratio by taking a weighted average of these estimates. Note thatwe weight by the covariate total rather than the stratum sizes. We getthe following estimators for the grand ratio and grand total:

rstratified,separate =1

τX

H∑h=1

τXhrh


CHAPTER 4. SAMPLING

and

τY stratified,separate =H∑

h=1

τXhrh

Again, we won’t worry about the estimates of the se.

Why use one over the other?

• You need stratum total for separate estimate, but only population totalfor combined estimate

• combined ratio is less subject to risk of bias. (see Cochran, p. 165 andfollowing). In general, the biases in separate estimator are added togetherand if they fall in the same direction, then trouble. In the combinedestimator these biases are reduced through stratification for numeratorand denominator

• When the ratio estimate is appropriate (regression through the origin andvariance proportional to covariate), the last term vanishes. Consequently,the combined ratio estimator will have greater variance than the separateratio estimator unless R is relatively constant from stratum to stratum.However, see above, the bias may be more severe for the separate ratioestimator. You must consider the combined effects of bias and precision,i.e. MSE.

4.10.2 Regression Estimators

A ratio estimator works well when the relationship between Yi and Xi is linear,through the origin, with the variance of observations about the ratio line in-creasing with X. In some cases, the relationship may be linear, but not throughthe origin.

In these cases, the ratio estimator is generalized to a regression estimatorwhere the linear relationship is no longer constrained to go through the origin.

We won’t be covering this in this course.

Regression estimators are also useful if there is more than one X variable.

Whenever you use a regression estimator, be sure to plot y vs x to assess ifthe assumptions for a ratio estimator are reasonable.

CAUTION: If ordinary statistical packages are used to do regression anal-ysis on survey data, you could obtain misleading results because the usual pack-


CHAPTER 4. SAMPLING

ages ignore the way in which the data were collected. Virtually all standard re-gression packages assume you’ve collected data under a simple random sample.If your sampling design is more complex, e.g. stratified design, cluster design,multi-state design, etc then you should use a package specifically designed forthe analysis of survey data, e.g. SAS and the Proc SurveyReg procedure.

4.10.3 Sampling with unequal probability - pps sampling

All of the designs discussed in previous sections have assumed that each sampleunit was selected with equal probability. In some cases, it is advantageousto select units with unequal probabilities, particularly if they differ in theircontribution to the overall total. This technique can be used with any of thesampling designs discussed earlier. An unequal probability sampling design canlead to smaller standard errors (i.e. better precision) for the same total effortcompared to an equal probability design. For example, forest stands may beselected with probability proportional to the area of the stand (i.e. a stand of200 ha will be selected with twice the probability that a stand of 100 ha in size)because large stands contribute more to the overall population and it would bewasteful of sampling effort to spend much effort on smaller stands.

The variable used to assign the probabilities of selection to individual studyunits does not need to have an exact relationship with an individual contribu-tions to the total. For example, in probability proportional to prediction (3Psampling), all trees in a small area are visited. A simple, cheap characteristicis measured which is used to predict the value of the tree. A sub-sample ofthe trees is then selected with probability proportional to the predicted value,remeasured using a more expensive measuring device, and the relationship be-tween the cheap and expensive measurement in the second phase is used withthe simple measurement from the first phase to obtain a more precise estimatefor the entire area. This is an example of two-phase sampling with unequalprobability of selection.

Please consult with a sampling expert before implementing or analyzing anunequal probability sampling design.

4.11 Cluster sampling

In some cases, units in a population occur naturally in groups or clusters. Forexample, some animals congregate in herds or family units. It is often convenientto select a random sample of herds and then measure every animal in the herd.This is not the same as a simple random sample of animals because individual


CHAPTER 4. SAMPLING

animals are not randomly selected; the herds are the sampling unit. The strip-transect example in the section on simple random sampling is also a clustersample; all plots along a randomly selected transect are measured. The stripsare the sampling units, while plots within each strip are sub-sampling units.Another example is circular plot sampling; all trees within a specified radius ofa randomly selected point are measured. The sampling unit is the circular plotwhile trees within the plot are sub-samples.

Some examples of cluster samples are:

• urchin estimation - transects are taken perpendicular to the shore and adiver swims along the transect and counts the number of urchins in eachm2 along the line.

• aerial surveys - a plane flies along a line and observers count the numberof animals they see in a strip on both sides of the aircraft.

• forestry surveys - often circular plots are located on the ground and ALLtree within that plot are measured.

Pitfall A cluster sample is often mistakenly analyzed using methods for sim-ple random surveys. This is not valid because units within a cluster are typicallypositively correlated. The effect of this erroneous analysis is to come up withan estimate that appears to be more precise than it really is, i.e. the estimatedstandard error is too small and does not fully reflect the actual imprecision inthe estimate.

Solution: You will pleased to know that, in fact, you already know how todesign and analyze cluster samples! The proper analysis treats the clusters as arandom sample from the population of clusters, i.e. treat the cluster as a wholeas the sampling unit, and deal only with cluster total as the response measure.

4.11.1 Sampling plan

In simple random sampling, a frame of all elements was required in order to drawa random sample. Individual units are selected one at a time. In many cases,this is impractical because it may not be possible to list all of the individualunits or may be logistically impossible to do this. In many cases, the individualunits appear together in clusters. This is particularly true if the sampling unitis a transect - almost always you measure things on a individual quadrat level,but the actual sampling unit is the cluster.

This problem is analogous to pseudo-replication in experimental design - the


CHAPTER 4. SAMPLING

breaking of the transect into individual quadrats is like having multiple fishwithin the tank.

A visual comparison of a simple random sample vs a cluster sample

You may find it useful to compare a simple random sample of 24 vs a clustersample of 24 using the following visual plans:

Select a sample of 24 in each case.


CHAPTER 4. SAMPLING

Simple Random Sampling

Describe how the sample was taken.


CHAPTER 4. SAMPLING

Cluster Sampling

First, the clusters must be defined. In this case, the units are naturallyclustered in blocks of size 8. The following units were selected.

Describe how the sample was taken. Note the differences between stratifiedsimple random sampling and cluster sampling!


CHAPTER 4. SAMPLING

4.11.2 Advantages and disadvantages of cluster samplingcompared to SRS

• Advantage It may not be feasible to construct a frame for every elementalunit, but possible to construct frame for larger units, e.g. it is difficult tolocate individual quadrats upon the sea floor, but easy to lay out transectsfrom the shore.

• Advantage Cluster sampling is often more economical. Because all unitswithin a cluster are close together, travel costs are much reduced.

• Disadvantage Cluster sampling has a higher standard error than an SR-SWOR of the same total size because units are typically homogeneouswithin clusters. The cluster itself serves as the sampling unit. For thesame number of units, cluster sampling almost always gives worse preci-sion. This is the problem that we have seen earlier of pseudo-replication.

• Disadvantage A cluster sample is more difficult to analyze, but withmodern computing equipment, this is less of a concern. The difficultiesare not arithmetic but rather being forced to treat the clusters as thesurvey unit - there is a natural tendency to think that data are beingthrown away.

4.11.3 Notation

The key thing to remember is to work with the cluster TOTALS.

Traditionally, the cluster size is denoted by M rather than by X, but as youwill see in a few moment, estimation in cluster sampling is nothing more thanratio estimation performed on the cluster totals.

Population SampleAttribute value valueNumber of clusters N nCluster totals τi yi NOTE τi and yi are the cluster i TOTALSCluster sizes Mi mi

Total area M


The key concept in cluster sampling is to treat the cluster TOTAL as the re-sponse variable and ignore all the individual values within the cluster. Because


CHAPTER 4. SAMPLING

the clusters are a simple random sample from the population of clusters, simplyapply all the results you had before for a SRS to the CLUSTER TOTALS.

If the clusters are roughly equal in size, a simple inflation estimator can beused; In many cases, there is strong relationship between the size of the clusterand cluster total – in these cases a ratio estimator would likely be suitable wherethe X variable is the cluster size. If there is no relationship between cluster sizeand the cluster total, a simple inflation estimator can be used as well even in thecase of unequal cluster sizes.. You should do a preliminary plot of the clustertotals against the cluster sizes to see if this relationship holds.

You will also have to know the size of each cluster - this is simply the numberof sub-units within each cluster.

The perils of ignoring a cluster design This design is used frequently,but often analyzed incorrectly. The key thing to note is that the sampling unitis a cluster, not the individual quadrats. In general, when ever the quadratshave been gathered using a transect of some sort, you have a cluster samplingproblem.

The biggest danger of ignoring the clustering aspects and treating the indi-vidual quadrats as if they came from an SRS is that, typically, your estimated sewill be too small. That is, the true standard error from your design may be sub-stantially larger than your estimated standard error. The precision is thoughtto be far better than is justified based on the survey results. This has been seenbefore - refer to the paper by Underwood where the dangers of estimation withpositively correlated data were discussed.

Extensions of cluster analysis - unequal size sampling In some cases,the clusters are of quite unequal sizes. A better design choice may to be selectclusters with an unequal probability design rather than using a simple randomsample. In this case, clusters that are larger, typically contribute more to thepopulation total, and would be selected with a higher

Computational formulae

Parameter Population value Estimator estimated se

Overall mean µ =

NPi=1

τi

NPi=1

Mi

µ =

nPi=1

yi

nPi=1

mi

√1

m2

s2diff

n (1− f)

Overall total τ = M × µ τ = M × µ

√M2 × 1

m2

s2diff

n (1− f)

• You never use the mean per unit within a cluster.


CHAPTER 4. SAMPLING

• The term s2diff =

n∑i=1

(yi−bµmi)2

n−1 is again found in the same fashion as in

ratio estimation - create a new variable which is the difference betweenyi − µmi, find the sample standard deviation2 of it, and then square thestandard deviation.

• Sometimes the ratio of two variables measured within each cluster is re-quired, e.g. you conduct aerial surveys to estimate the ratio of wolves tomoose - this has already been done in an earlier example! In these cases,the actual cluster length is not used.

Confidence intervals

As before, once you have an estimator for the mean and for the se, use theusual ±2se rule. If the number of clusters is small, then some text books adviseusing a t-distribution for the multiplier – this is not covered in this course.

Sample size determination

Again, this is no real problem - except that you will get a value for thenumber of CLUSTERS, not the individual quadrats within the clusters.

4.11.5 Example - estimating the density of urchins

Red sea urchins are considered a delicacy and the fishery is worth several millionsof dollars to British Columbia.

In order to set harvest quotas and in order to monitor the stock, it is im-portant that the density of sea urchins be determined each year.

To do this, the managers lay out a number of transects perpendicular tothe shore in the urchin beds. Divers then swim along the transect, and roll a1 m2 quadrat along the transect line and count the number of legal sized andsub-legal sized urchins in the quadrat.

The number of possible transects is so large that the correction for finitepopulation sampling can be ignored.

The raw data is available in an ascii file at urchin.dat from the SampleProgram Library at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms. The data isn’t all listed here as it has over 1000 lines!

What is the population of interest and the parameter?




CHAPTER 4. SAMPLING

The population of interest is the sea urchins in the harvest area. Thesehappened to be (artificially) “clustered” into transects which are sampled. Allsea urchins within the cluster are measured.

The parameter of interest is the density of legal sized urchins.

What is the frame?

The frame is conceptual - there is no predefined list of all the possible tran-sects. Rather they pick random points along the shore and then lay the transectsout from that point.


The sampling design is a cluster sample - the clusters are the transect lineswhile the quadrats measured within each cluster are similar to pseudo-replicates.The measurements within a transect are not independent of each other and arelikely positively correlated (why?).

As the points along the shore were chosen using a simple random sample theanalysis proceeds as a SRS design on the cluster totals.

Excel Analysis

An Excel worksheet with the data and analysis is called urchin.xls and is avail-able in the Sample Program Library at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms. A reduced view appears below:




CHAPTER 4. SAMPLING

Summarize to cluster level

The key, first step in any analysis of a cluster survey is to first summarize thedata to the cluster level. You will need the cluster total and the cluster size (inthis case the length of the transect). The Pivot Table feature of Excel is quiteuseful for doing this automatically. Unfortunately, you still have to play aroundwith the final table in order to get the data displayed in a nice format.


CHAPTER 4. SAMPLING

In many transect studies, there is a tendency to NOT record quadrats with0 counts as they don’t affect the cluster sum. However, you still have to knowthe correct size of the cluster (i.e. how many quadrats), so you can’t simplyignore these ‘missing’ values. In this case, you could examine the maximumof the quadrat number and the number of listed quadrats to see if these agree(why?).

Preliminary plot

Plot the cluster totals vs. the cluster size to see if a ratio estimator is appro-priate, i.e. linear relationship through the origin with variance increasing withcluster size.

The plot (not shown) shows a weak relationship between the two variables.

Summary Statistics

Compute the summary statistics on the cluster TOTALS. You will need thetotals over all sampled clusters of both variables.

sum(legal) sum(quad) n(transect)1507 1120 28

Compute the ratio

The estimated density is then density = sum(legal)sum(quad) = 1507/1120 = 1.345536

urchins/m2.

Compute the difference column

To compute the se, create the diff column as in the ratio estimation section andfind its standard deviation.

Compute the se of the ratio estimate

The estimated se is then found as: se( density) =√

s2diff

ntransects× 1

quad2 =

√48.099332

28 × 1402

= 0.2272 urchins/m2.

(Optional) Expand final answer to a population total

In order to estimate the total number of urchins in the harvesting area, yousimply multiply the estimated ratio and its standard error by the area to beharvested.


CHAPTER 4. SAMPLING

SAS Analysis

SAS v.8 has procedures for the analysis of survey data taken in a cluster design.A program to analyze the data is urchin.sas and is available from the SampleProgram Library at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.

Here is a program listing:

/* Example of a cluster sample */

/* This dataset consists of the results from a seriesof transects conducted perpindicular to the shore.As divers swam along the transect, they countedthe number of red sea urchins in 1 m**2 quadrats.

The variables are (from left to right):transect, quadrat, legal size, sublegal sized.

If a quadrat is not present in this listing, then the count was 0for both variables. It does NOT indicate that the quadratwas not measured - rather that no urchins were found.

There was no transect numbered 5, 12, 17, or 19. */

filename urchin ’urchin.dat’; /* name of datafile containing urchin data */

title ’Estimating urchin density - example of cluster analysis’;

options nodate nonumber noovp nocenter linesize=75;

data urchin;infile urchin firstobs=2 missover; /* the first record has the variable names */input transect quadrat legal sublegal;/* no need to specify sampling weights because transects are an SRS */

/***** First check to see if any transects are missing quadrats *************/

proc sort data=urchin; by transect;

proc means data=urchin noprint;by transect;var quadrat legal;output out=check min=min max=max n=n sum(legal)=tlegal;




CHAPTER 4. SAMPLING

data check;set check;length problem $15.;problem = ’ ’;if max ^= n then problem = ’missing quadrat?’;drop _type_ _freq_;

proc print data=check;title2 ’check to see if any transect is missing quadrats’;

proc plot data=check;title2 ’plot the relationship between the cluster total and cluster size’;plot tlegal *n$ transect; /* use the transect number as the plotting character */

/****************************************************************************/

/* Now for the cluster analysis */

proc surveymeans data=urchin; /* do not specify a pop size as fpc is negligble */cluster transect;var legal;

Because we are computing a ratio estimator from a simple random sampleof transects, it is not necessary to specify the sampling weights.

The key feature of the SAS program is the use of the CLUSTER statementto identify the clusters in the data.

The population number of transects was not specified as the finite populationcorrection is negligible.

Here are the results:

Estimating urchin density - example of cluster analysischeck to see if any transect is missing quadrats

Obs transect min max n tlegal problem

1 1 1 100 100 1512 2 1 30 30 1503 3 1 100 100 04 4 1 73 73 1585 6 1 39 39 22


CHAPTER 4. SAMPLING

6 7 1 21 21 57 8 1 46 46 858 9 1 24 24 469 10 1 37 37 27

10 11 1 24 24 911 13 1 40 40 5012 14 1 37 37 5013 15 1 32 32 1514 16 1 21 21 5815 18 1 32 32 4216 20 1 42 42 1217 21 1 21 21 5218 22 1 13 13 019 23 1 88 88 10020 24 1 15 15 121 25 1 23 23 1622 26 1 17 17 4923 27 1 16 16 4624 28 1 18 18 4025 29 1 30 30 3726 30 1 39 39 4027 31 1 42 42 17528 33 1 100 100 71

Estimating urchin density - example of cluster analysisplot the relationship between the cluster total and cluster size

Plot of tlegal*n$transect. Symbol points to label.

tlegal ||

180 +| > 31|||

160 + > 4|| > 2 1 <||

140 +||||

120 +


CHAPTER 4. SAMPLING

||||

100 + > 23|||| > 8

80 +|| 33 <||

60 + > 16|| > 21 14 < > 13| 27 <> 26 > 9| > 18

40 + > 28 > 30| > 29|| > 10| > 6

20 +| > 25 > 15| > 20| > 11| > 7

0 + 22 < > 24 3 <|-+------------+------------+------------+------------+------------+-0 20 40 60 80 100

nEstimating urchin density - example of cluster analysisplot the relationship between the cluster total and cluster size


Data Summary

Number of Clusters 28Number of Observations 1120


CHAPTER 4. SAMPLING

Statistics

Std Error Lower 95%Variable N Mean of Mean CL for Mean------------------------------------------------------------------------legal 1120 1.345536 0.227248 0.879261------------------------------------------------------------------------

Statistics

Upper 95%Variable CL for Mean------------------------legal 1.811810------------------------

The results are identical to above.

JMP Analysis

The urchin data is available in a JMP file urchin.jmp from the Sample ProgramLibrary at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.

The raw data:



CHAPTER 4. SAMPLING

contains variables for the transect, the quadrat within each transect, and thenumber of legal and sub-legal sized urchins.

Use the Tables->Summary


CHAPTER 4. SAMPLING

to get the cluster totals by summing the number of legal sized urchins andcounting the number of quadrats present:


CHAPTER 4. SAMPLING


CHAPTER 4. SAMPLING

This gives a table with one row for each transect:

(note that there was no transect 5, 12, 17, 19, or 32).

We compare the maximum(quadrat) number to the number of quadrat valuesactually recorded and see that they all match indicating that it appears no emptyquadrats were not measured.

Now we are back to the case of a ratio estimator with the Y variable beingthe number of legal sized urchins measured on the transect, and the X variablebeing the size of the transect. As in the previous examples of a ratio estimator,we create a weighting variable equal to 1/X = 1/size of transect:

We use the Analyze->Fit Y-by-X platform:


CHAPTER 4. SAMPLING

and don’t forget to specify 1/X as the weighting variable.

After the plot is created, we use the Fit Special from the red triangle nearthe plot:


CHAPTER 4. SAMPLING

and request a line with intercept of 0:


CHAPTER 4. SAMPLING

The final estimates are presented:


CHAPTER 4. SAMPLING

The estimated density is 1.346 (se .216) uchins/m2. The se is a bit smallerbecause of the lack of a finite population correction factor but is within 1% ofthe correct se.

Planning for future experiments

The rse of the estimate is 0.2274/1.3455 = 17% - not terrific. The determinationof sample size is done in the same manner as in the ratio estimator case dealtwith in earlier sections except that the number of CLUSTERS is found. If wewanted to get a rse near to 5%, we would need almost 320 transects - this islikely too costly.


CHAPTER 4. SAMPLING

4.11.6 Example - estimating the total number of sea cu-cumbers

Sea cucumbers are considered a delicacy among some, and the fishery is ofgrowing importance.

In order to set harvest quotas and in order to monitor the stock, it is impor-tant that the number of sea cucumbers in a certain harvest area be estimatedeach year.

The following is an example taken from Griffith Passage in BC 1994.

To do this, the managers lay out a number of transects across the cucumberharvest area. Divers then swim along the transect, and while carrying a 4 mwide pole, count the number of cucumbers within the width of the pole duringthe swim.

The number of possible transects is so large that the correction for finitepopulation sampling can be ignored.

Here is the summary information up the transect area (the preliminary rawdata is unavailable):

Transect SeaArea Cucumbers260 124220 67200 6180 62120 35200 3200 1120 49140 28400 1120 89120 116140 76800 10

1460 501000 122140 34180 10980 48


CHAPTER 4. SAMPLING

The total harvest area is 3,769,280 m2 as estimated by a GIS system.

The transects were laid out from one edge of the bed and the length of theedge is 51,436 m. Note that because each transect was 4 m wide, the numberof transects is 1/4 of this value.

What is the population of interest and the parameter?

The population of interest is the sea cucumbers in the harvest area. Thesehappen to be (artificially) “clustered” into transects which are the samplingunit. All sea cucumbers within the transect (cluster) are measured.

The parameter of interest is the total number of cucumbers in the harvestarea.

What is the frame?

The frame is conceptual - there is no predefined list of all the possible transects.Rather they pick random points along the edge of the harvest area, and thenlay out the transect from there.


The sampling design is a cluster sample - the clusters are the transect lines whilethe quadrats measured within each cluster are similar to pseudo-replicates. Themeasurements within a transect are not independent of each other and are likelypositively correlated (why?).

Analysis - abbreviated

As the analysis is similar to the previous example, a detailed description of theExcel, SAS, or JMP versions will not be done.

The workbook cucumber.xls from the Sample Program Library http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms illustrates the compu-tations in Excel. There are three sheets It also computes the two estimatorswhen two potential outliers are deleted and for a second harvest area.

The SAS program is available in cucumber.sas and the relevant output incucumber.lst. Because only the summary data is available, you cannot use theCLUSTER statement of SurveyMeans. Rather, as noted earlier in the notes,you form a ratio estimator based on the cluster totals.

A JMP file called cucumber.jmp is also available.

Summarize to cluster level




CHAPTER 4. SAMPLING

The key, first step in any analysis of a cluster survey is to first summarize thedata to the cluster level. You will need the cluster total and the cluster size (inthis case the area of the transect). This has already been done in the abovedata.

Now this summary table is simply an SRSWOR from the set of all transects.We first estimate the density, and then multiply by the area to estimate the total.

Note that after summarizing up to the transect level, this example proceedsin an analogous fashion as the grouse in pockets of brush example that we lookedat earlier.

Preliminary Plot

A plot of the cucumber total vs the transect size shows a very poor rela-tionship between the two variables. It will be interesting to compare the resultsfrom the simple inflation estimator and the ratio estimator.

Simple Inflation Estimator

First, estimate the number ignoring the area of the transects by using a simpleinflation estimator.

The summary statistics that we need are:

n 19 transectsMean 54.21 cucumbers/transectstd Dev 42.37 cucumbers/transect

We compute an estimate of the total as τ = Ny = (51, 436/4) × 54.21 =697, 093 sea cucumbers. [Why did we use 51,436/4 rather than 51,436?]

We compute an estimate of the se of the total as: se(τ) =√

N2s2/n× (1− f) =√(51, 436/4)2 × 42.372/19 = 124, 981 sea cucumbers.

The finite population correction factor is so small we simply ignore it.

This gives a relative standard error (se/est) of 18%.

Ratio Estimator

We use the methods outlined earlier for ratio estimators from SRSWOR to getthe following summary table:


CHAPTER 4. SAMPLING

area cucumbersMean 320.00 54.21 per transect

The estimated density of sea cucumbers is then density = mean(cucumbers)mean(area)

= 54.21/320.00 = 0.169 cucumber/m2.

To compute the se, create the diff column as in the ratio estimation sectionand find its standard deviation as sdiff = 73.63. The estimated se of the ratio

is then found as: se( density) =√

s2diff

ntransects× 1

area2 =√

73.632

19 × 13202 = 0.053

cucumbers/m2.

We once again ignore the finite population correction factor.

In order to estimate the total number of cucumbers in the harvesting area,you simply multiply the above by the area to be harvested:

τratio = area× density = 3, 769, 280× 0.169= 638,546 sea cucumbers.

The se is found as: se(τratio) = area × se( density) = 3, 769, 280 × 0.053 =198,983 sea cucumbers for an overall rse of 31%.

Comparing the two approaches

Why did the ratio estimator do worse in this case than the simple inflationestimator in Griffiths Passage? The plot the number of sea cucumbers vs thearea of the transect shows virtually no relationship between the two - hencethere is no advantage to using a ratio estimator.

In more advanced courses, it can be shown that the ratio estimator will dobetter than the inflation estimator if the correlation between the two variables isgreater than 1/2 of the ratio of their respective relative variation (std dev/mean).Advanced computations shows that half of the ratio of their relative variationsis 0.732, while the correlation between the two variables is 0.041. Hence theratio estimator will not do well.

The Excel workbook also repeats the analysis for Griffith Passage after drop-ping some obvious outliers. This only makes things worse! As well, at the bot-tom of the worksheet, a sample size computation shows that substantially moretransects are needed using a ratio estimator than for a inflation estimator. Itappears that in Griffith Passage, that there is a negative correlation betweenthe length of the transect and the number of cucumbers found! No biologicalreason for this has been found. This is a cautionary example to illustrate theeven the best laid plans can go astray - always plot the data.


CHAPTER 4. SAMPLING

A third worksheet in the workbook analyses the data for Sheep Passage.Here the ratio estimator outperforms the inflation estimator, but not by a widefactor.

4.12 Multi-stage sampling - a generalization ofcluster sampling

Not part of Stat403/650 Please consult with a sampling expert before im-plementing or analyzing a multi-stage design.

4.12.1 Introduction

All of the designs considered above select a sampling unit from the populationand then do a complete measurement upon that item. In the case of clustersampling, this is facilitated by dividing the sampling unit into small observa-tional units, but all of the observational units within the sampled cluster aremeasured.

If the units within a cluster are fairly homogeneous, then it seems wastefulto measure every unit. In the extreme case, if every observational unit within acluster was identical, only a single observational unit from the cluster needs to beselected in order to estimate (without any error) the cluster total. Suppose thenthat the observational units within a cluster were not identical, but had somevariation? Why not take a sub-sample from each cluster, e.g. in the urchinsurvey, count the urchins in every second or third quadrat rather than everyquadrat on the transect.

This method is called two-stage sampling. In the first stage, larger samplingunits are selected using some probability design. In the second stage, smallerunits within the selected first-stage units are selected according to a probabilitydesign. The design used at each stage can be different, e.g. first stage unitsselected using a simple random sample, but second stage units selected using asystematic design as proposed for the urchin survey above.

This sampling design can be generalized to multi-stage sampling.

Some example of multi-stage designs are:

• Vegetation Resource Inventory. The forest land mass of BC has beenmapped using aerial methods and divided into a series of polygons repre-


CHAPTER 4. SAMPLING

senting homogeneous stands of trees (e.g. a stand dominated by Douglas-fir). In order to estimate timber volumes in an inventory unit, a sample ofpolygons is selected using a probability-proportional-to-size design. In theselected polygons, ground measurement stations are selected on a 100 mgrid and crews measure standing timber at these selected ground stations.

• Urchin survey Transects are selected using a simple random sampledesign. Every second or third quadrat is measured after a random startingpoint.

• Clam surveys Beaches are divided into 1 ha sections. A random sampleof sections is selected and a series of 1 m2 quadrats are measured withineach section.

• Herring spawns biomass Schweigert et al. (1985, CJFAS, 42, 1806-1814) used a two-stage design to estimate herring spawn in the Strait ofGeorgia.

• Georgia Strait Creel Survey The Georgia Strait Creel Survey uses amulti-stage design to select landing sites within strata, times of days tointerview at these selected sites, and which boats to interview in a surveyof angling effort on the Georgia Strait.

Some consequences of simple two-stage designs are:

• If the selected first-stage units are completely enumerated then completecluster sampling results.

• If every first-stage unit in the population is selected, then a stratifieddesign results.

• A complete frame is required for all first-stage units. However, a frameof second-stage and lower-stage units need only be constructed for theselected upper-stage units.

• The design is very flexible allowing (in theory) different selection methodsto be used at each stage, and even different selection methods within eachfirst stage unit.

• A separate randomization is done within each first-stage unit when select-ing the second-stage units.

• Multi-stage designs are less precise than a simple random sample of thesame number of final sampling units, but more precise than a clustersample of the same number of final sampling units. [Hint: think of whathappens if the second-stage units are very similar.]


CHAPTER 4. SAMPLING

• Multi-stage designs are cheaper than a simple random sample of the samenumber of final sampling units, but more expensive than a cluster sampleof the same number of final sampling units. [Hint: think of the travel costsin selecting more transects or measuring quadrats within a transect.]

• As in all sampling designs, stratification can be employed at any leveland ratio and regression estimators are available. As expected, the theorybecomes more and more complex, the more "variations" are added to thedesign.

The primary incentives for multi-stage designs are that

1. frames of the final sampling units are typically not available

2. it often turns out that most of the variability in the population occursamong first-stage units. Why spend time and effort in measuring lowerstage units that are relatively homogeneous within the first-stage unit

4.12.2 Notation

A sample of n first-stage units (FSU) is selected from a total of N first-stageunits. Within the ith first-stage unit, mi second-stage units (SSU) are selectedfrom the Mi units available.

Item Population SampleValue Value

First stage units N nSecond stage units Mi mi

SSUs in population M =∑

Mi

Value of SSU Yij yij

Total of FSU τi τi = Mi/mi

mi∑j=1

yij

Total in pop τ =∑

τi

Mean in pop µ = τ/M


We will only consider the case when simple random sampling occurs at bothstages of the design.


CHAPTER 4. SAMPLING

The intuitive explanation for the results is that a total is estimated for eachFSU selected (based on the SSU selected). These estimated totals are then usedin a similar fashion to a cluster sample to estimate the grand total.

Parameter Population Estimatedvalue Estimate se

Total τ =∑

τiNn

n∑i=1

τi se (τ) =

√N2 (1− f1)

s21

n + N2f1n2

n∑i=1

M2i (1− f2)

s22i

mi

Mean µ = τM µ = bτ

M se (µ) =√

se2(τ)M2

where

s21 =

n∑i=1

(τi − τ

)2

n− 1

s22i =

mi∑j=1

(yij − yi)2

mi − 1

τ =1n

n∑i=1

τi

f1 = n/N and f2i = mi/Mi

Notes:

• There are two contributions to the estimated se - variation among firststage totals (s2

1) and variation among second stage units (S22i).

• If the FSU vary considerably in size, a ratio estimator (not discussed inthese notes) may be more appropriate.

Confidence Intervals The usual large sample confidence intervals can beused.


CHAPTER 4. SAMPLING

4.12.4 Example - estimating number of clams

The Klahoose First Nations (situated near Desolation Sound in the Strait ofGeorgia) wished to develop a wild oyster fishery. As first stage in the devel-opment of the fishery, a survey was needed to establish the current stock in anumber of oyster beds.

This example looks at the estimate of oyster numbers at Lloyd Point froma survey conducted in 1994.

The survey was conducted by running a line through the oyster bed – thetotal length was 105 m. Several random location were located along the linein increments of 1 m. At each randomly chosen location, the width of the bedwas measured and about 3 random location along the perpendicular transect atthat point were taken. A 1 m2 quadrat was applied, and the number of oystersof various sizes was counted in the quadrat.

The raw data: is available as a data file called wildoyster.dat from the SampleProgram Library at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.




CHAPTER 4. SAMPLING

tran- width quad- total NetLocation sect width rat seed xsmall small med large count weight

(m) (m) (kg)Lloyd 5 17 3 18 18 41 48 14 139 14.6Lloyd 5 17 5 6 4 30 9 4 53 5.2Lloyd 5 17 10 15 21 44 13 11 104 8.2Lloyd 7 18 5 8 10 14 5 3 40 6.0Lloyd 7 18 12 10 38 36 16 4 104 10.2Lloyd 7 18 13 0 15 12 3 3 33 4.6Lloyd 18 14 1 11 8 5 9 19 52 7.8Lloyd 18 14 5 13 23 68 18 11 133 12.6Lloyd 18 14 8 1 29 60 2 1 93 10.2Lloyd 30 11 3 17 1 13 13 2 46 5.4Lloyd 30 11 8 12 16 23 22 14 87 6.6Lloyd 30 11 10 23 15 19 17 1 75 7.0Lloyd 49 9 3 10 27 15 1 0 53 2.0Lloyd 49 9 5 13 7 14 11 4 49 6.8Lloyd 49 9 8 10 25 17 16 11 79 6.0Lloyd 76 21 4 3 3 11 7 0 24 4.0Lloyd 76 21 7 15 4 32 26 24 101 12.4Lloyd 76 21 11 2 19 14 19 0 54 5.8Lloyd 79 18 1 14 13 7 9 0 43 3.6Lloyd 79 18 4 0 32 32 27 16 107 12.8Lloyd 79 18 11 16 22 43 18 8 107 10.6Lloyd 84 19 1 14 32 25 39 7 117 10.2Lloyd 84 19 8 25 43 42 17 3 130 7.2Lloyd 84 19 15 5 22 61 30 13 131 14.2Lloyd 86 17 8 1 19 32 10 8 70 8.6Lloyd 86 17 11 8 17 13 10 3 51 4.8Lloyd 86 17 12 7 22 55 11 4 99 9.8Lloyd 95 20 1 17 12 20 18 4 71 5.0Lloyd 95 20 8 32 4 26 29 12 103 11.6Lloyd 95 20 15 3 34 17 11 1 66 6.0

These multi-stage designs are complex to analyze. Rather than trying toimplement the various formulae, I would suggest that a proper sampling packagebe used (such as SAS) rather than trying to do these by hand.

The first step after importing the data in JMP, is to collect information toestimate the FSU (the transect) total and to compute some components of thevariance from the second stage of sampling.

As in cluster sampling, use the Tables->Summary to create summary statis-tics. The Grouping variable is the transect. You will need to compute theaverage of the weights, the standard deviation of the weights, the number ofquadrats measured, and the width of each transect. [The latter is found as the


CHAPTER 4. SAMPLING

average of the width variable which was replicated for each individual quadrat.]

This will create a summary table as shown below:

Now you will need to add some columns to estimate the total for each FSUand the contribution of the second stage sampling to the overall variance. Thesecolumns will be created using the formula boxes as shown below.


CHAPTER 4. SAMPLING

First the formula for the FSU total, i.e. the estimated total weight for theentire transect. This is the simply the average weight per quadrat times thewidth of the strip.

Second, the component of variance for the second stage. [Typically, if thefirst stage sampling fraction is small, this can be ignored.]

This gives the summary table shown below:


CHAPTER 4. SAMPLING

Now to summarize up to the population level. We again must summarizethis table using the Tables->Summary menu item.

This gives us the final summary table

The variance component from the first stage is found as:


CHAPTER 4. SAMPLING

And the final overall se is found as:

This gives us the final solution:

Our final estimate is a total biomass of 14,070 kg with an estimated se of1484 kg.

A similar procedure can be used for the other variables. The nice featureabout JMP is that once a series of operations has been done, you can save thescript and apply it easily to other variables – refer to the JMP manual for moredetails.

Excel Spreadsheet

The above computations can also be done in Excel as shown in the attachedworkbook klahoose.xls from the Sample Program Library.

As in the case of a pure cluster sample, the PivotTable feature can be usedto compute summary statistics needed to estimate the various components.


CHAPTER 4. SAMPLING

SAS Program

SAS can also be used to analyze the data as shown in the program klahoose.sasand output Klahoose.lst.

Note that the Proc SurveyMeans computes the se using only the first stagevariance. As the first stage sampling fraction is usually quite small, this will tendto give only slight underestimates of the true standard error of the estimate.

4.12.5 Some closing comments on multi-stage designs

The above example barely scratches the surface of multi-stage designs. Multi-stage designs can be quite complex and the formulae for the estimates andestimated standard errors fearsome. If you have to analyze such a design, itis likely better to invest some time in learning one of the statistical packagesdesigned for surveys (e.g. SAS v.8) rather than trying to program the tediousformulae by hand.

There are also several important design decisions for multi-stage designs.

• Two-stage designs have reduced costs of data collection because unitswithin the FSU are easier to collect but also have a poorer precision com-pared to a simple-random sample with the same number final samplingunits. However, because of the reduced cost, it often turns out the moreunits can be sampled under a multi-stage design leading to an improvedprecision for the same cost as a simple-random sample design. There isa tradeoff between sampling more first stage units and taking a smallsub-sample in the secondary stage. An optimal allocation strategy canbe constructed to decide upon the best strategy – consult some of thereference books on sampling for details.

• As with ALL sampling designs, stratification can be used to improveprecision. The stratification usually takes place at the first sampling unitstage, but can take place at all stages. The details of estimation understratification can be found in many sampling texts.

• Similarly, ratio or regression estimators can also be used if auxiliary in-formation is available that is correlated with the response variable. Thisleads to very complex formulae!

One very nice feature of multi-stage designs is that if the first stage is sam-pled with replacement, then the formulae for the estimated standard errorssimplify considerably to a single term regardless of the design used in the


CHAPTER 4. SAMPLING

lower stages! If there are many first stage units in the population and if thesampling fraction is small, the chances of selecting the same first stage unittwice are very small. Even if this occurs, a different set of second stage unitswill likely be selected so there is little danger of having to measure the samefinal sampling unit more than once. In such situations, the design at second andlower stages is very flexible as all that you need to ensure is that an unbiasedestimate of the first-stage unit total is available.

4.13 Some final comments on descriptive surveys

4.13.1 Unit size

A typical concern with any of the survey methods occurs when the populationdoes not have natural discrete sampling units. For example, a large section ofland may be arbitrarily divided into 1 m2 plots, or 10 m2 plots. A naturalquestion to ask is what is the ‘best size’ of unit. This has no simple answer anddepends upon several factors which must be addressed for each survey:

• Cost. All else being equal, sampling many small plots may be moreexpensive than sampling fewer larger plots. The primary difference incost is the overhead in traveling and setup to measure the unit.

• Size of unit. An intuitive feeling is that more smaller plots are betterthan few large plots because the sample size is larger. This will be trueif the characteristic of interest is ‘patchy’ , but surprisingly, makes nodifference if the characteristic is randomly scattered through out the area(Krebs, 1989, p. 64). Indeed if the characteristic shows ‘avoidance’, thenlarger plots are better. For example, competition among trees implies theyare spread out more than expected if they were randomly located. Logisticconsiderations often influence the plot size. For example, if tramplingthe soil affects the response, then sample plots must be small enough tomeasure without trampling the soil.

• Edge effects. Because the population does not have natural boundaries,decisions often have to be made about objects that lie on the edge of thesample plot. In general larger square or circular plots are better becauseof smaller edge-to-area ratio. [A large narrow rectangular plot can havemore edge than a similar area square plot.]

• Size of object being measured. Clearly a 1 m2 plot is not appropriatewhen counting mature Douglas-fir, but may be appropriate for a lichensurvey.


CHAPTER 4. SAMPLING

A pilot study should be carried out prior to a large scale survey to investigatefactors that influence the choice of sampling unit size.

4.13.2 Key considerations when designing a survey

Key considerations when designing a survey are

• what is the sampling unit? This should be carefully distinguished fromthe observational unit.

• is a frame available for all of the sampling units? If so, then direct samplingcan be used. If not, is there a frame of groups of units suitable for a two-stage sample?

• Are all the units are the same size? If so, then a simple random sample (orvariant thereof) is likely a suitable design. If the units vary considerablyin size, then an unequal probability design may be more suitable.

• Are there precision requirements that will be needed? How will you knowthe proper sample size that is needed?

• Can stratification be employed either for administrative convenience or toimprove precision?

When analyzing a survey a key step is to recognize the design that was usedto collect the data. Key pointers to help recognize various designs are:

• How were the units selected? A true simple random sample makes a listof all possible items and then chooses from that list.

• Is there more than one size of sampling unit? For example, were transectsselected at random, and then quadrats within samples selected at random?This is usually a multi-stage design.

• Is there a cluster? For example, transects are selected, and these aredivided into a series of quadrats - all of which are measured.

4.14 Analytical surveys - almost experimental de-sign

In descriptive surveys, the objective was to simply obtain information about onelarge group. In observational studies, two deliberately selected sub-populations


CHAPTER 4. SAMPLING

are selected and surveyed, but no attempt is made to generalize the resultsto the whole population. In analytical studies, sub-populations are selectedand sampled in order to generalize the observed differences among the sub-population to this and other similar populations.

As such, there are similarities between analytical and observational surveysand experimental design. The primary difference is that in experimental studies,the manager controls the assignment of the explanatory variables while measur-ing the response variables, while in analytical and observational surveys, neitherset of variables is under the control of the manager. [Refer back to Examples B,C, and D in the earlier chapters] The analysis of complex surveys for analyticalpurposes can be very difficult (Kish 1987; Kish, 1984; Rao, 1973; Sedransk,1965a, 1965b, 1966).

As in experimental studies, the first step in analytical surveys is to iden-tify potential explanatory variables (similar to factors in experimental studies).At this point, analytical surveys can be usually further subdivided into threecategories depending on the type of stratification:

• the population is pre-stratified by the explanatory variables and surveysare conducted in each stratum to measure the outcome variables;

• the population is surveyed in its entirety, and post-stratified by the ex-planatory variables.

• the explanatory variables can be used as auxiliary variables in ratio orregression methods.

[It is possible that all three types of stratification take place - these are verycomplex surveys.]

The choice between the categories is usually made by the ease with whichthe population can be pre-stratified and the strength of the relationship betweenthe response and explanatory variables. For example, sample plots can be easilypre-stratified by elevation or by exposure to the sun, but it would be difficultto pre-stratify by soil pH.

Pre-stratification has the advantage that the manager has control over thenumber of sample points collected in each stratum, whereas in post- stratifica-tion, the numbers are not controllable, and may lead to very small sample sizesin certain strata just because they form only a small fraction of the population.

For example, a manager may wish to investigate the difference in regener-ation (as measured by the density of new growth) as a function of elevation.Several cut blocks will be surveyed. In each cut block, the sample plots will


CHAPTER 4. SAMPLING

be pre-stratified into three elevation classes, and a simple random sample willbe taken in each elevation class. The allocation of effort in each stratum (i.e.the number of sample plots) will be equal. The density of new growth will bemeasured on each selected sample plot. On the other hand, suppose that theregeneration is a function of soil pH. This cannot be determined in advance,and so the manager must take a simple random sample over the entire stand,measure the density of new growth and the soil pH at each sampling unit, andthen post-stratify the data based on measured pH. The number of samplingunits in each pH class is not controllable; indeed it may turn out that certainpH classes have no observations.

If explanatory variables are treated as a auxiliary variables, then there mustbe a strong relationship between the response and explanatory variables. Ad-ditionally, we must be able to measure the auxiliary variable precisely for eachunit. Then, methods like multiple regression can also be used to investigate therelationship between the response and the explanatory variable. For example,rather than classifying elevation into three broad elevation classes or soil pH intobroad pH classes, the actual elevation or soil pH must be measured precisely toserve as an auxiliary variable in a regression of regeneration density vs. elevationor soil pH.

If the units have been selected using a simple random sample, then theanalysis of the analytical surveys proceeds along similar lines as the analysis ofdesigned experiments (Kish, 1987; also refer to Chapter 2). In most analysesof analytical surveys, the observed results are postulated to have been takenfrom a hypothetical super-population of which the current conditions are justone realization. In the above example, cut blocks would be treated as a randomblocking factor; elevation class as an explanatory factor; and sample plots assamples within each block and elevation class. Hypothesis testing about theeffect of elevation on mean density of regeneration occurs as if this were aplanned experiment.

Pitfall: Any one of the sampling methods described in Section 2 for descrip-tive surveys can be used for analytical surveys. Many managers incorrectly usethe results from a complex survey as if the data were collected using a simplerandom sample. As Kish (1987) and others have shown, this can lead to sub-stantial underestimates of the true standard error, i.e. the precision is thoughtto be far better than is justified based on the survey results. Consequentlythe manager may erroneously detect differences more often than expected (i.e.make a Type I error) and make decisions based on erroneous conclusions.

Solution: As in experimental design, it is important to match the analysisof the data with the survey design used to collect it. The major difficulty in theanalysis of analytical surveys are:


CHAPTER 4. SAMPLING

1. Recognizing and incorporating the sampling method used to collect thedata in the analysis. The survey design used to obtain the sampling unitsmust be taken into account in much the same way as the analysis ofthe collected data is influenced by actual experimental design. A table of‘equivalences’ between terms in a sample survey and terms in experimentaldesign is provided in Table 1.

Table 1Equivalences between terms used in surveys and in experimental design.Survey Term Experimental Design TermSimple RandomSample

Completely randomized design

Cluster Sam-pling

(a) Clusters are random effects; units within acluster treated as sub-samples; or(b) Clusters are treated as main plots; units withina cluster treated as sub-plots in a split-plot anal-ysis.

Multi-stagesampling

(a) Nested designs with units at each stage nestedin units in higher stages. Effects of units at eachstage are treated as random effects, or(b) Split-plot designs with factors operating athigher stages treated as main plot factors and fac-tors operating at lower stages treated as sub-plotfactors.

Stratification Fixed factor or random block depending on thereasons for stratification.

Sampling Unit Experimental unit or treatment unitSub-sample Sub-sample

There is no quick easy method for the analysis of complex surveys (Kish,1987). The super-population approach seems to work well if the selectionprobabilities of each unit are known (these are used to weight each obser-vation appropriately) and if random effects corresponding to the variousstrata or stages are employed. The major difficulty caused by complexsurvey designs is that the observations are not independent of each other.

2. Unbalanced designs (e.g. unequal numbers of sample points in each combi-nation of explanatory factors). This typically occurs if post- stratificationis used to classify units by the explanatory variables but can also occurin pre-stratification if the manager decides not to allocate equal effort ineach stratum. The analysis of unbalanced data is described by Millikenand Johnson (1984).

3. Missing cells, i.e. certain combinations of explanatory variables may notoccur in the survey. The analysis of such surveys is complex, but refer toMilliken and Johnson (1984).

4. If the range of the explanatory variable is naturally limited in the popula-


CHAPTER 4. SAMPLING

tion, then extrapolation outside of the observed range is not recommended.

More sophisticated techniques can also be used in analytical surveys. Forexample, correspondence analysis, ordination methods, factor analysis, multidi-mensional scaling, and cluster analysis all search for post-hoc associations amongmeasured variables that may give rise to hypotheses for further investigation.Unfortunately, most of these methods assume that units have been selectedindependently of each other using a simple random sample; extensions whereunits have been selected via a complex sampling design have not yet developed.Simpler designs are often highly preferred to avoid erroneous conclusions basedon inappropriate analysis of data from complex designs.

Pitfall: While the analysis of analytical surveys and designed experimentsare similar, the strength of the conclusions is not. In general, causation cannotbe inferred without manipulation. An observed relationship in an analyticalsurvey may be the result of a common response to a third, unobserved variable.For example, consider the two following experiments. In the first experiment,the explanatory variable is elevation (high or low). Ten stands are randomlyselected at each elevation. The amount of growth is measured and it appearsthat stands at higher elevations have less growth. In the second experiment,the explanatory variables is the amount of fertilizer applied. Ten stands arerandomly assigned to each of two doses of fertilizer. The amount of growth ismeasured and it appears that stands that receive a higher dose of fertilizer havegreater growth. In the first experiment, the manager is unable to say whetherthe differences in growth are a result of differences in elevation or amount ofsun exposure or soil quality as all three may be highly related. In the secondexperiment, all uncontrolled factors are present in both groups and their effectswill, on average, be equal. Consequently, the assignment of cause to the fertilizerdose is justified because it is the only factor that differs (on average) among thegroups.

As noted by Eberhardt and Thomas (1991), there is a need for a rigorousapplication of the techniques for survey sampling when conducting analyticalsurveys. Otherwise they are likely to be subject to biases of one sort or another.Experience and judgment are very important in evaluating the prospects forbias, and attempting to find ways to control and account for these biases. Themost common source of bias is the selection of survey units and the most com-mon pitfall is to select units based on convenience rather than on a probabilisticsampling design. The potential problems that this can lead to are analogous tothose that occur when it is assumed that callers to a radio-phone- in show arerepresentative of the entire population.


CHAPTER 4. SAMPLING

4.15 References

• Cochran, W.G. (1977). Sampling Techniques. New York:Wiley.One of the standard references for survey sampling. Very technical

• Gillespie, G.E. and Kronlund, A.R. (1999).A manual for intertidal clam surveys, Canadian Technical Report of Fish-eries and Aquatic Sciences 2270. A very nice summary of using samplingmethods to estimate clam numbers.

• Keith, L.H. (1988), Editor. Principles of Environmental Sampling. NewYork: American Chemical Society.A series of papers on sampling mainly for environmental contaminants inground and surface water, soils, and air. A detailed discussion on samplingfor pattern.

• Kish, L. (1965). Survey Sampling. New York: Wiley.An extensive discussion of descriptive surveys mostly from a social scienceperspective.

• Kish, L. (1984). On Analytical Statistics from complex samples. SurveyMethodology, 10, 1-7.An overview of the problems in using complex surveys in analytical sur-veys.

• Kish, L. (1987). Statistical designs for research. New York: Wiley.One of the more extensive discussions of the use of complex surveys inanalytical surveys. Very technical.

• Krebs, C. (1989). Ecological Methodology.A collection of methods commonly used in ecology including a section onsampling

• Kronlund, A.R., Gillespie, G.E., and Heritage, G.D. (1999).Survey methodology for intertidal bivalves. Canadian Technical Report ofFisheries and Aquatic Sciences 2214. An overview of how to use surveys forassessing intertidal bivalves - more technical than Gillespie and Kronlund(1999).

• Myers, W.L. and Shelton, R.L. (1980). Survey methods for ecosystemmanagement. New York: Wiley.Good primer on how to measure common ecological data using direct sur-vey methods, aerial photography, etc. Includes a discussion of commonsurvey designs for vegetation, hydrology, soils, geology, and human influ-ences.

• Sedransk, J. (1965b). Analytical surveys with cluster sampling. Journalof the Royal Statistical Society, Series B, 27, 264-278.


CHAPTER 4. SAMPLING

• Thompson, S.K. (1992). Sampling. New York:Wiley.A good companion to Cochran (1977). Has many examples of using sam-pling for biological populations. Also has chapters on mark-recapture,line-transect methods, spatial methods, and adaptive sampling.

4.16 Frequently Asked Questions (FAQ)

4.16.1 Confusion about the definition of a population

What is the difference between the "population total" and the "pop-ulation size"?

Population size normally refers to the number of “final sampling” units inthe population. Population total refers to the total of some variable over theseunits.

For example, if you wish to estimate the total family income of families inVancouver, the “final” sampling units are families, the population size is thenumber of families in Vancouver, and the response variable is the income forthis household, and the population total will be the total family income over allfamilies in Vancouver.

Things become a bit confusing when sampling units differ from “final” unitsthat are clustered and you are interested in estimates of the number of “final”units. For example in the grouse/pocket bush example, the population consistsof the grouse which are clustered into 248 pockets of brush. The grouse isthe final sampling unit, but the sampling unit is a pocket of bush. In clustersampling, you must expand the estimator by the number of CLUSTERS, not bythe number of final units. Hence the expansion factor is the number of pockets(248), the variable of interest for a cluster is the number of grouse in each pocket,and the population total is the number of grouse over all pockets.

Similarly, for the oysters on the lease. The population is the oysters on thelease. But you don’t randomly sample individual oysters – you randomly samplequadrats which are clusters of oysters. The expansion factor is now the numberof quadrats.

In the salmon example, the boats are surveyed. The fact that the numberof salmon was measured is incidental - you could have measured the amount offood consumed, etc.

In the angling survey problem, the boats are the sampling units. The fact


CHAPTER 4. SAMPLING

that they contain anglers or that they caught fish is what is being measured,but the set of boats that were at the lake that day is of interest.

4.16.2 How is N defined

How is N (the expansion factor defined). What is the best way tofind this value?

This can get confusing in the case of cluster or multi-phase designs as thereare different N ’s at each stage of the design. It might be easier to think of Nas an expansion factor.

The expansion factor will be known once the frame is constructed. In somecases, this can only be done after the fact - for example, when surveying anglingparties, the total number of parties returning in a day is unknown until theend of the day. For planning purposes, some reasonable guess may have todone in order to estimate the sample size. If this is impossible, just choose somearbitrary large number - the estimated future sample size will be an overestimate(by a small amount) but close enough. Of course, once the survey is finished,you would then use the actual value of N in all computations.

4.16.3 Multi-stage vs Multi-phase sampling

What is the difference between Multi-stage sampling and multi-phase sampling?

In multi-stage sampling, the selection of the final sampling units takes placein stages. For example, suppose you are interested in sampling angling partiesas they return from fishing. The region is first divided into different landingsites. A random selection of landing sites is selected. At each landing site, arandom selection of angling parties is selected.

In multi-phase sampling, the units are NOT divided into larger groups.Rather a first phase selects some units and they are measured quickly. A sec-ond phase takes a sub-sample of the first phase and measures more intently.Returning back to the angling survey. A multi-phase design would select an-gling parties. All of the selected parties could fill out a brief questionnaire. Aweek later, a sample of the questionnaires is selected, and the angling partiesRECONTACTED for more details.

The key difference is that in multi-phase sampling, some units are measuredTWICE; in multi-phase sampling, there are different sizes of sampling units


CHAPTER 4. SAMPLING

(landing sites vs angling parties), but each sampling unit is only selected once.

4.16.4 What is the difference between a Population and aframe?

Frame = list of sampling units from which a sample will be taken. The samplingunits may not be the same as the “final” units that are measured. For example,in cluster sampling, the frame is the list of clusters, but the final units are theobjects within the cluster.

Population = list of all “final” units of interest. Usually the “final units” arethe actual things measured in the field, i.e. what is the final object upon whicha measurement is taken.

In some cases, the frame doesn’t match the population which may causebiases, but in ideal cases, the frame covers the population.

4.16.5 How to account for missing transects.

What do you do if an entire cluster is “missing”?

Missing data can occur at various parts in a survey and for various reasons.The easiest data to handle is data ‘missing completely at random’ (MCAR).In this situation, the missing data provides no information about the problemthat is not already captured by other data point and the ‘missingness’ is alsonon-informative. In this case, and if the design was a simple random sample,the data point is just ignored. So if you wanted to sample 80 transects, butwere only able to get 75, only the 75 transects are used. If some of the dataare missing within a transect - the problem changes from a cluster sample to atwo-stage sample so the estimation formulae change slightly.

If data is not MCAR, this is a real problem - welcome to a Ph.D. in statisticsin how to deal with it!


schwarz statistics(cjs) -- 4(survey sampling)

Documents