ANOVA example
Polychlorinated biphenyls (PCBs) previously used in the manufacture of large electrical transformers and capacitors, are extremely hazardous contaminants when released into the environment. Samples of fish were taken from each of four rivers and analyzed for PCB concentration (in ppm)
Question
Do the data provide sufficient evidence to indicate differences in the mean PCB concentration in fish for the four rivers?
Hypotheses: – H0: 1= 2= 3= 4
– HA:the means are not all equal (at least one mean is not equal to the others.
First Step
Examine the data. What does this mean? Boxplots Histograms Normal Quantile plots
– Note command line language to do a grid of 4 probability plots at once. Go to File-->New--> Script File. Paste them into the script file window. Press F10 and the 4 plots are produced automatically.
– OR: just paste these in the command line.par(mfrow=c(2,2))
for (i in 1:4) {
qqnorm(PCBfish[,1][PCBfish[,2]==i],ylab="Data quantiles")
title (paste("River ",i,sep=""))}
Can we do an ANOVA? What are the criteria? Normally distributed Equal standard deviations Independent samples across treatments
– What might this look like if it weren’t true? – Rivers connected?
Independent sample within treatments– What might this look like if it weren’t true? – Clustering?
Transformations (p. 65 & 69 of Sleuth)
Log transformation. – Why try this?– Ratio of largest to smallest > 10, data are
skewed, and the group with the larger average has the larger spread
When do reciprocal– waiting times
When do square root?– Data are counts
Better?
Why or why not? Standard deviations are much
more similar
Do an ANOVA
Read table: – sum of squares
– Spooled and spooled2
– F-value– p-value
What are your conclusions?
Conclusions
We can reject the null hypothesis of no difference in these group means.
At least one of the means is different from the others (is this statement the same as accepting the alternative hypothesis?)
“Convincing evidence exists that median PCB concentration of fish in these rivers is different (p-value of 0.002; analysis of variance F-test).”
Compare just two rivers...
Average and 95% CI for the difference in PCB in fish between Rivers 1 and 2
. Logged data, so…
– 1.09-1.52=river2-river1=-0.43
– e-0.43=0.65– The median concentration of PCB in fish in
River 1 is 0.65 times that of fish in River 2.
0 1 2 1 2: 0; : 0River River A River RiverH H
Is this significant?
Two-sided, two-sample T-test: Must do calculation of t-statistic (and p-
value) by hand, because need to use spooled to calculate SE.
Spool
SE:
2 2 2 224 0.806 29 0.956 22 1.023 23 0.897
0.9224 29 22 23
1 1
0.92 0.2825 20
Hypothesis test
Test the hypothesis that River1-River2=0– Estimate/SE:
– Suggestive only of a difference (in fact, at the 0.05 level, we would not reject the null), but we’ll still do a CI for practice
0.43 1.54; 0.28pt(-1.54,45)=0.065
[1] 0.06528174
95% CI
95% CI for the difference in group means– qt(0.975,88); [1] 1.98729– -0.43±(1.99)(0.28)-->(-0.98,0.13)– e-0.98=0.37;e0.13=1.14– Fish in River 1 have between 0.39 to 1.14 times
as much PCB in their muscle as fish in River 2. (Are we surprised that this covers 1?)
ANOVA Explanation
Reduced model=equal means model– All these rivers have the same mean PCB
concentration in the fish: null hypothesis
How wrong are we for this hypothesis?– Residual error is how wrong we are– Large residuals here mean the null hypothesis
fits poorly
Graph of PCB in Each River: Equal Means
1 2 3 4
river
-1
0
1
2
3
log.pcb
=1.64
}
Residual for highest point in River 1 to Equal Means average
ANOVA by hand (conceptual)
River Log(PCB) EqualMeans
Est.
EqualMeans
Res.
SepMeans
Est.
Sep MeansRes.
1 0.83 1.64 -0.81
1 1.86 1.64 0.22
2 0.06 1.64 -1.58
2 0.22 1.64 -1.42
3 1.14 1.64 -0.5
3 -0.78 1.64 -2.42
4 1.45 1.64 -0.19
4 3.11 1.64 1.47
Graph of PCB in Each River: Separate Means
1 2 3 4
river
-1
0
1
2
3
log.pcb
}Residual for highest point in River 1 to Separate Means Model
ANOVA by hand (conceptual)
River Log(PCB) EqualMeans
Est.
EqualMeans
Res.
SepMeans
Est.
Sep MeansRes.
1 0.83 1.64 -0.81 1.09 -0.26
1 1.86 1.64 0.22 1.09 0.77
2 0.06 1.64 -1.58 1.52 -1.46
2 0.22 1.64 -1.42 1.52 -1.3
3 1.14 1.64 -0.5 1.89 -0.75
3 -0.78 1.64 -2.42 1.89 -2.67
4 1.45 1.64 -0.19 2.09 -0.64
4 3.11 1.64 1.47 2.09 1.02
Model Inaccuracy
If the null hypothesis is correct,– The two models should be about equal in their
ability to explain the data– AND, the magnitudes of the residuals should be
about the same
If the null hypothesis is incorrect– The magnitudes of the residuals from the equal-
means model will tend to be larger– Their larger sizes reflect model inaccuracy
Residual Sum of Squares
We need a single summary of the residuals for a particular model.
Statisticians have chosen the sum of the squared residuals -- the residual sum of squares
Extra Sum of Squares
The error from your reduced (equal means) model - your error from your full (separate means) model is the difference in sizes of residuals from the full and reduced model.
This is called the Extra Sum of Squares Another way to say this is: that the ESS measures the
amount of unexplained variability in the reduced model that is explained by the full model.
How much better is it to say that each river has its own mean than to say that all the rivers have their own mean?
Thus: ESS=RSSreduced-RSSfull
F-Statistic
How much difference in the models is enough to say it is significant (the same questions we’ve asked through t-tests, etc)?
We compare these two levels of unexplained variability in an F-test.
We take their difference, divide by the extra degrees of freedom, and scale them by the best estimate we have of variance
F-test (cont)
Large F-statistics are associated with large differences in the size of residuals from the two models.
This is evidence against the reduced model (null hyp) and in favor of the full model (different means).
This test is summarized by its p-value (based on an F-distribution).
2
Extra sum of squaresExtra degrees of freedom
ˆ full
F statistic
ANOVA Table
V a rso u rce
In o th erw o rd s
S S d f M S F
B etw eeng ro up s
E rro r T o ta l S S R -w ith in S S R(red u ced-fu ll)
T o ta l d f-w ith in d f(red u ced-fu ll)
A :S S /d f
A /B :( )
( )
SS betw eend fSS w ith ind f
W ith ing ro up s
F u ll M od el S S R fu ll T o ta l d f-n um b er o fg ro up s
B :S S /D F
T o ta l R ed u cedm od el
S S R redu ced T o ta l d f – 1
S+ Printout Residual standard error: 0.9200322 Df Sum of Sq Mean Sq F Value Pr(F) river 3 14.018 4.673 5.520 0.0016Residuals 88 74.488 0.846 We can reject the null hypothesis of no difference in
medians. At least one river has a different median PCB concentration
For some reason, S+ does not print out the reduced model information (total) that is on the ANOVA table we make by hand.