comparing datasets and comparing a dataset with a standard how different is enough?
TRANSCRIPT
Comparing Datasets and Comparing a Dataset with a
Standard
How different is enough?
module 7 2
Concepts:
• Independence of each data point• Test statistics• Central Limit Theorem• Standard error of the mean• Confidence interval for a mean• Significance levels• How to apply in Excel
module 7 3
Independent measurements:
• Each measurement must be independent (shake up the basket of tickets)
• Example of non-independent measurements:– Public responses to questions (one result
affects the next person’s answer)– Samplers placed too close together so air
flows are affected
module 7 4
Test statistics:
• Some number that is calculated based on the data
• In the student’s t test, for example, t• If t is >= 1.96, and you have a
normally distributed population, you know you are to the right on the curve where 95% of the data is in the inner portion is symmetrically between the right and left (t=1.96 on the right and -1.96 on the left)
module 7 5
Test statistics correspond to significance levels
• “P” stands for percentile• Pth percentile is where p of the data falls
below, and 1-p fall above:
module 7 6
Two major types of questions:
• Comparing the mean against a standard – Does the air quality here meet the NAAQS?
• Comparing two datasets– Is the air quality different in 2006 than 2005?– Or, is the air quality better?– Or, is the air quality worse?
module 7 7
Comparing mean to a standard:
• Did the air quality meet the CARB annual stnd of 12 microg/m3?
yearFt Smith avg
Ft Smith Min
Ft Smith Max
N_Fort Smith
‘05 14.78 0.1 37.9 77
module 7 8
Central Limit Theorem (magic!)
• Even if the underlying population is not normally distributed
• If we repeatedly take datasets• These different datasets will have means
that cluster around the true mean• And the distribution of these means is
normally distributed!
module 7 9
magic concept #2: Standard error of the mean
• Represents uncertainty around the mean
• as sample size N gets bigger, your error gets smaller!
• The bigger the N, the more tightly you can estimate mean
• LIKE standard deviation for a population, but this is for YOUR sample
N
module 7 10
For a “large” sample (N > 60), or when very close
to a normal distribution:A confidence interval for a population mean is:
n
sZx
Choice of z determines 90%, 95%, etc.
module 7 11
For a “small” sample:
Replace the Z value with a t value to get:
n
stx
where “t” comes from Student’s t distribution, and depends on the sample size.
module 7 12
Student’s t distribution versus Normal Z distribution
-5 0 5
0.0
0.1
0.2
0.3
0.4
Value
dens
ityT-distribution and Standard Normal Z distribution
T with 5 d.f.
Z distribution
module 7 13
compare t and Z values:
Confidencelevel
t value with5 d.f
Z value
90% 2.015 1.65
95% 2.571 1.96
99% 4.032 2.58
module 7 14
What happens as sample gets larger?
-5 0 5
0.0
0.1
0.2
0.3
0.4
Value
dens
ityT-distribution and Standard Normal Z distribution
Z distribution
T with 60 d.f.
module 7 15
What happens to CI as sample gets larger?
n
sZx
n
stx
For large samples:
Z and t values become almost identical, so CIs are almost identical.
module 7 16
First, graph and review data:
• Use box plot add-in• Evaluate spread • Evaluate how far apart mean and
median are• (assume the sampling design and
the QC are good)
module 7 17
Excel summary stats:
module 7 18
N=77
0
5
10
15
20
25
30
35
40
Ft Smith
Min 0.1
25th 7.5
Median 13.7
75th 18.1
Max 37.9
Mean 14.8
SD 8.7
1.Use the box-plot add-in
2.Calculate summary stats
module 7 19
Our question:
• Can we be 95%, 90% or how confident that this mean of 14.78 is really greater than the standard of 12?
• Saw that N = 77, and mean and median not too different
• Use z (normal) rather than t
module 7 20
The mean is 14.8 +- what?• We know the equation for CI is •
• The width of the confidence interval represents how sure we want to be that this CI includes the true mean
• Now all we need to decide is how confident we want to be
n
sZx
module 7 21
CI calculation:
• For 95%, z = 1.96 (often rounded to 2)• Stnd error (sigma/N) = (8.66/square root of
77) = 0.98• CI around mean = 2 x 0.98• We can be 95% sure that the mean is
included in (mean +- 2), or 14.8-2 at the low end, to 14.8 + 2 at the high end
• This does NOT include 12 !
module 7 22
Excel can also calculate a confidence interval around the
mean:
The mean plus and minus 1.93 is a 95% confidence interval that does NOT include 12!
module 7 23
We know we are more than 95% confident, but how confident can we be that Ft Smith mean > 12?
• Calculate where on the curve our mean of 14.8 is, in terms of the z (normal) score,
• Or if N small, use the t score:
module 7 24
To find where we are on the curve, calc the test statistic:
• Ft Smith mean = 14.8, sigma =8.66, N =77
• Calculate the test statistic, which in this case is the z factor (we decided we can use the z rather than the t distribution)
• If N was < 60, the test stat is t, but calculated the same way
N
xz
)(
Data’s mean
The stnd of 12
module 7 25
Calculate z easily:• our mean 14.8 minus the standard of 12
(treat the real mean m (mu) as the stnd) is the numerator (= 2.8)
• The stnd error is sigma/square root of N = 0.98 (same as for CI)
• so z = (2.8)/0.98 = z = 2.84• So where is this z on the curve?• Remember at z = 3 we are to the right of ~
99%
module 7 26
Where on the curve?
Z = 3
Z = 2
So between 95 and 99% probable that the true mean will not include 12
module 7 27
Can calculate exactly where on the curve, using Excel:
• Use Normsdist function, with z
If z (or t) = 2.84, in Excel:
Yields 99.8% probability that the true mean does NOT include 12