you want to survey a school

You want to survey a school

• You draw your sample from the first day of school student enrollment list

• This list would be your ____???____

• Which students are not on this list?

• A phenomenon known as?

• Potentially problematic because?

• (Hint: Dillman, p. 196)

Some reminders…

• Population: The group about whom we

want to draw our inference• Sample Frame: Members of the

population who could potentially be in our sample

• Coverage Error: The extent to which members of population are excluded from sample frame (not good)

Welcome…

• …to a hopefully productive lesson on SAMPLING METHODOLOGY!

• What’s ideal?• Nifty tricks??• Common misconceptions???• Limitations of our methods?????????

• P.S. We are going to do (some) math and it is going to be FUN!!!

Simple Random Sampling(what’s ideal)

• Members of a sample frame, which hopefully includes our entire population, are selected one at a time

• independently & without replacement• (Drawing names out of a hat)• Sample is equal in expectation to

population on all outcomes, but no guarantees

Stratified Random Sampling(possibly even more ideal)

• Use criterion to divide sample frame by group membership (e.g. racial category)

• Randomly sample within each group

• What is the advantage of this procedure?

Scenario…• We want to know what percentage of

Americans support Obama for president • We need 1100 members from each racial

group to be confident about group means (more on this later)• American Indians / Alaskan Natives comprise

1% of our population. • Through simple random sampling, how

large of a sample would we theoretically need to reach n = 1100 for this subgroup?

Scenario cont’d…

• OR, we could use stratified random sampling and draw 1100 from each subgroup without all this trouble.

• BUT, now we have oversampled from American Indians--they are over-represented in our sample!

• Implications?• Solutions?

(This data is very fake)

• Proportion supporting B.O.

African American: .50

Asian American: .50

Latino: .50

White: .50

American Indian: 0

Unweighted avg: ??

Weighting (nifty trick)

• Now, let’s do a weighted average instead…

What’s going on here?

99% (.50) + 1% (0) = 49.50%

• Big difference, eh?

So, why was 1100 an ideal subgroup number?

• Because no matter how large your population, a sample of 1100 will get you very close to the true population value if your outcome is binary (e.g. Obama: Yes or No)

• How come?

Because this man said so

• William Sealy Gossett (1876-1937)• Chemist, “math person”, Guinness Brewery worker• A patient man

Yes, a patient man

• Using barley (somehow), spent two years empirically studying relationship between sample means and population means.

• “The Probable (Standard) Error of a Mean” (1908)

• Standard errors are what we use to estimate sampling error

Sampling error

• Describes how closely our sample mean allows us to estimate our population mean

• Conceptually similar to a confidence interval (Dillman, p. 207; http://www.researchsolutions.co.nz/sample_sizes.htm

• Depends on: Population variance (“spread”) (estimated by sample variance) Sample size Population size (to a point)

Sampling error: big picture

• Larger variances and (to a point) larger population sizes require larger samples to estimate the population mean at a given level of precision

• Increasing sample size reduces sampling error, BUT there are diminishing returns to increasing our sample size

Sampling error: big picture

• Diminishing Returns? For large populations… Increasing “n” from 100 to 200 is helpful Increasing from 500-600 is less helpful

Increasing from 1200-1300 helps very little (no matter how large the population)

Why Diminishing Returns?

• Because there is an upper bound (“ceiling”) on the variance of any sample.

• For binary (Yes/no, “1” or “0”) outcomes, max variance is .25

• Thus, it’s only a matter of time till more “n” in the denominator makes our standard error very low

Why Diminishing Returns?

• Even for continuous outcomes, there is still an upper bound on variance unless scale is infinite

• Thus, there are still diminishing returns on increasing “n”

• For more on this topic… -take S-012 -look up Confidence Intervals in stats books “You don't need a large sample of users to obtain

meaningful data:Continuous Data (e.g. Task Time)” http://www.measuringusability.com/sample_continuous.htm

Limitations of Sampling error calculations

• Does not take coverage error into account!

• Assumes you have drawn an simple random sample (e.g. does not take “clustering” into account)

Clustering???

• There are 20,000 students in a city with 40 schools. We want a sample of 1100

• Ideally, we would draw students at random from every school.

• But, it would be cheaper and easier if we drew a few schools at random and obtained information from every student

• Implications?

Clustering???

• If there is a lot of school-level variation in our outcome, our sample will not be representative and our sample estimate will be biased.

• Sampling error formula does not account for this possibility

One more limitation of sampling error formula

• Non-response bias• Even if you have drawn a beautifully

random sample, your sample estimate will be biased if those who do not return your survey are different on your outcome of interest.

• That’s why Dillman’s advice on getting high response rates is so important!

you want to survey a school

Documents

sample mean

sample sizesa

members of population

population variance

sample coverage error

inference sample frame

entire population

true population value