comparing datasets and comparing a dataset with a standard how different is enough?

Comparing Datasets and Comparing a Dataset with a

Standard

How different is enough?

module 7 2

Concepts:

• Independence of each data point• Test statistics• Central Limit Theorem• Standard error of the mean• Confidence interval for a mean• Significance levels• How to apply in Excel

module 7 3

Independent measurements:

• Each measurement must be independent (shake up the basket of tickets)

• Example of non-independent measurements:– Public responses to questions (one result

affects the next person’s answer)– Samplers placed too close together so air

flows are affected

module 7 4

Test statistics:

• Some number that is calculated based on the data

• In the student’s t test, for example, t• If t is >= 1.96, and you have a

normally distributed population, you know you are to the right on the curve where 95% of the data is in the inner portion is symmetrically between the right and left (t=1.96 on the right and -1.96 on the left)

module 7 5

Test statistics correspond to significance levels

• “P” stands for percentile• Pth percentile is where p of the data falls

below, and 1-p fall above:

module 7 6

Two major types of questions:

• Comparing the mean against a standard – Does the air quality here meet the NAAQS?

• Comparing two datasets– Is the air quality different in 2006 than 2005?– Or, is the air quality better?– Or, is the air quality worse?

module 7 7

Comparing mean to a standard:

• Did the air quality meet the CARB annual stnd of 12 microg/m3?

yearFt Smith avg

Ft Smith Min

Ft Smith Max

N_Fort Smith

‘05 14.78 0.1 37.9 77

module 7 8

Central Limit Theorem (magic!)

• Even if the underlying population is not normally distributed

• If we repeatedly take datasets• These different datasets will have means

that cluster around the true mean• And the distribution of these means is

normally distributed!

module 7 9

magic concept #2: Standard error of the mean

• Represents uncertainty around the mean

• as sample size N gets bigger, your error gets smaller!

• The bigger the N, the more tightly you can estimate mean

• LIKE standard deviation for a population, but this is for YOUR sample

N

module 7 10

For a “large” sample (N > 60), or when very close

to a normal distribution:A confidence interval for a population mean is:

n

sZx

Choice of z determines 90%, 95%, etc.

module 7 11

For a “small” sample:

Replace the Z value with a t value to get:

n

stx

where “t” comes from Student’s t distribution, and depends on the sample size.

module 7 12

Student’s t distribution versus Normal Z distribution

-5 0 5

0.0

0.1

0.2

0.3

0.4

Value

dens

ityT-distribution and Standard Normal Z distribution

T with 5 d.f.

Z distribution

module 7 13

compare t and Z values:

Confidencelevel

t value with5 d.f

Z value

90% 2.015 1.65

95% 2.571 1.96

99% 4.032 2.58

module 7 14

What happens as sample gets larger?

-5 0 5

0.0

0.1

0.2

0.3

0.4

Value

dens

ityT-distribution and Standard Normal Z distribution

Z distribution

T with 60 d.f.

module 7 15

What happens to CI as sample gets larger?

n

sZx

n

stx

For large samples:

Z and t values become almost identical, so CIs are almost identical.

module 7 16

First, graph and review data:

• Use box plot add-in• Evaluate spread • Evaluate how far apart mean and

median are• (assume the sampling design and

the QC are good)

module 7 17

Excel summary stats:

module 7 18

N=77

0

5

10

15

20

25

30

35

40

Ft Smith

Min 0.1

25th 7.5

Median 13.7

75th 18.1

Max 37.9

Mean 14.8

SD 8.7

1.Use the box-plot add-in

2.Calculate summary stats

module 7 19

Our question:

• Can we be 95%, 90% or how confident that this mean of 14.78 is really greater than the standard of 12?

• Saw that N = 77, and mean and median not too different

• Use z (normal) rather than t

module 7 20

The mean is 14.8 +- what?• We know the equation for CI is •

• The width of the confidence interval represents how sure we want to be that this CI includes the true mean

• Now all we need to decide is how confident we want to be

n

sZx

module 7 21

CI calculation:

• For 95%, z = 1.96 (often rounded to 2)• Stnd error (sigma/N) = (8.66/square root of

77) = 0.98• CI around mean = 2 x 0.98• We can be 95% sure that the mean is

included in (mean +- 2), or 14.8-2 at the low end, to 14.8 + 2 at the high end

• This does NOT include 12 !

module 7 22

Excel can also calculate a confidence interval around the

mean:

The mean plus and minus 1.93 is a 95% confidence interval that does NOT include 12!

module 7 23

We know we are more than 95% confident, but how confident can we be that Ft Smith mean > 12?

• Calculate where on the curve our mean of 14.8 is, in terms of the z (normal) score,

• Or if N small, use the t score:

http://upload.wikimedia.org/wikipedia/commons/2/25/The_Normal_Distribution.svg

module 7 24

To find where we are on the curve, calc the test statistic:

• Ft Smith mean = 14.8, sigma =8.66, N =77

• Calculate the test statistic, which in this case is the z factor (we decided we can use the z rather than the t distribution)

• If N was < 60, the test stat is t, but calculated the same way

N

xz

)(

Data’s mean

The stnd of 12

module 7 25

Calculate z easily:• our mean 14.8 minus the standard of 12

(treat the real mean m (mu) as the stnd) is the numerator (= 2.8)

• The stnd error is sigma/square root of N = 0.98 (same as for CI)

• so z = (2.8)/0.98 = z = 2.84• So where is this z on the curve?• Remember at z = 3 we are to the right of ~

99%

module 7 26

Where on the curve?

Z = 3

Z = 2

So between 95 and 99% probable that the true mean will not include 12

module 7 27

Can calculate exactly where on the curve, using Excel:

• Use Normsdist function, with z

If z (or t) = 2.84, in Excel:

Yields 99.8% probability that the true mean does NOT include 12

comparing datasets and comparing a dataset with a standard how different is enough?

Documents