Download - ANOVA example 4 Polychlorinated biphenyls (PCBs) previously used in the manufacture of large electrical transformers and capacitors, are extremely hazardous

ANOVA example

Polychlorinated biphenyls (PCBs) previously used in the manufacture of large electrical transformers and capacitors, are extremely hazardous contaminants when released into the environment. Samples of fish were taken from each of four rivers and analyzed for PCB concentration (in ppm)

Question

Do the data provide sufficient evidence to indicate differences in the mean PCB concentration in fish for the four rivers?

Hypotheses: – H0: 1= 2= 3= 4

– HA:the means are not all equal (at least one mean is not equal to the others.

First Step

Examine the data. What does this mean? Boxplots Histograms Normal Quantile plots

– Note command line language to do a grid of 4 probability plots at once. Go to File-->New--> Script File. Paste them into the script file window. Press F10 and the 4 plots are produced automatically.

– OR: just paste these in the command line.par(mfrow=c(2,2))

for (i in 1:4) {

qqnorm(PCBfish[,1][PCBfish[,2]==i],ylab="Data quantiles")

title (paste("River ",i,sep=""))}

Can we do an ANOVA? What are the criteria? Normally distributed Equal standard deviations Independent samples across treatments

– What might this look like if it weren’t true? – Rivers connected?

Independent sample within treatments– What might this look like if it weren’t true? – Clustering?

Transformations (p. 65 & 69 of Sleuth)

Log transformation. – Why try this?– Ratio of largest to smallest > 10, data are

skewed, and the group with the larger average has the larger spread

When do reciprocal– waiting times

When do square root?– Data are counts

Better?

Why or why not? Standard deviations are much

more similar

Do an ANOVA

Read table: – sum of squares

– Spooled and spooled2

– F-value– p-value

What are your conclusions?

Conclusions

We can reject the null hypothesis of no difference in these group means.

At least one of the means is different from the others (is this statement the same as accepting the alternative hypothesis?)

“Convincing evidence exists that median PCB concentration of fish in these rivers is different (p-value of 0.002; analysis of variance F-test).”

Compare just two rivers...

Average and 95% CI for the difference in PCB in fish between Rivers 1 and 2

. Logged data, so…

– 1.09-1.52=river2-river1=-0.43

– e-0.43=0.65– The median concentration of PCB in fish in

River 1 is 0.65 times that of fish in River 2.

0 1 2 1 2: 0; : 0River River A River RiverH H

Is this significant?

Two-sided, two-sample T-test: Must do calculation of t-statistic (and p-

value) by hand, because need to use spooled to calculate SE.

Spool

SE:

2 2 2 224 0.806 29 0.956 22 1.023 23 0.897

0.9224 29 22 23

1 1

0.92 0.2825 20

Hypothesis test

Test the hypothesis that River1-River2=0– Estimate/SE:

– Suggestive only of a difference (in fact, at the 0.05 level, we would not reject the null), but we’ll still do a CI for practice

0.43 1.54; 0.28pt(-1.54,45)=0.065

[1] 0.06528174

95% CI

95% CI for the difference in group means– qt(0.975,88); [1] 1.98729– -0.43±(1.99)(0.28)-->(-0.98,0.13)– e-0.98=0.37;e0.13=1.14– Fish in River 1 have between 0.39 to 1.14 times

as much PCB in their muscle as fish in River 2. (Are we surprised that this covers 1?)

ANOVA Explanation

Reduced model=equal means model– All these rivers have the same mean PCB

concentration in the fish: null hypothesis

How wrong are we for this hypothesis?– Residual error is how wrong we are– Large residuals here mean the null hypothesis

fits poorly

Graph of PCB in Each River: Equal Means

1 2 3 4

river

-1

0

1

2

3

log.pcb

=1.64

}

Residual for highest point in River 1 to Equal Means average

ANOVA by hand (conceptual)

River Log(PCB) EqualMeans

Est.

EqualMeans

Res.

SepMeans

Est.

Sep MeansRes.

1 0.83 1.64 -0.81

1 1.86 1.64 0.22

2 0.06 1.64 -1.58

2 0.22 1.64 -1.42

3 1.14 1.64 -0.5

3 -0.78 1.64 -2.42

4 1.45 1.64 -0.19

4 3.11 1.64 1.47

Graph of PCB in Each River: Separate Means

1 2 3 4

river

-1

0

1

2

3

log.pcb

}Residual for highest point in River 1 to Separate Means Model

ANOVA by hand (conceptual)

River Log(PCB) EqualMeans

Est.

EqualMeans

Res.

SepMeans

Est.

Sep MeansRes.

1 0.83 1.64 -0.81 1.09 -0.26

1 1.86 1.64 0.22 1.09 0.77

2 0.06 1.64 -1.58 1.52 -1.46

2 0.22 1.64 -1.42 1.52 -1.3

3 1.14 1.64 -0.5 1.89 -0.75

3 -0.78 1.64 -2.42 1.89 -2.67

4 1.45 1.64 -0.19 2.09 -0.64

4 3.11 1.64 1.47 2.09 1.02

Model Inaccuracy

If the null hypothesis is correct,– The two models should be about equal in their

ability to explain the data– AND, the magnitudes of the residuals should be

about the same

If the null hypothesis is incorrect– The magnitudes of the residuals from the equal-

means model will tend to be larger– Their larger sizes reflect model inaccuracy

Residual Sum of Squares

We need a single summary of the residuals for a particular model.

Statisticians have chosen the sum of the squared residuals -- the residual sum of squares

Extra Sum of Squares

The error from your reduced (equal means) model - your error from your full (separate means) model is the difference in sizes of residuals from the full and reduced model.

This is called the Extra Sum of Squares Another way to say this is: that the ESS measures the

amount of unexplained variability in the reduced model that is explained by the full model.

How much better is it to say that each river has its own mean than to say that all the rivers have their own mean?

Thus: ESS=RSSreduced-RSSfull

F-Statistic

How much difference in the models is enough to say it is significant (the same questions we’ve asked through t-tests, etc)?

We compare these two levels of unexplained variability in an F-test.

We take their difference, divide by the extra degrees of freedom, and scale them by the best estimate we have of variance

F-test (cont)

Large F-statistics are associated with large differences in the size of residuals from the two models.

This is evidence against the reduced model (null hyp) and in favor of the full model (different means).

This test is summarized by its p-value (based on an F-distribution).

2

Extra sum of squaresExtra degrees of freedom

ˆ full

F statistic

ANOVA Table

V a rso u rce

In o th erw o rd s

S S d f M S F

B etw eeng ro up s

E rro r T o ta l S S R -w ith in S S R(red u ced-fu ll)

T o ta l d f-w ith in d f(red u ced-fu ll)

A :S S /d f

A /B :( )

( )

SS betw eend fSS w ith ind f

W ith ing ro up s

F u ll M od el S S R fu ll T o ta l d f-n um b er o fg ro up s

B :S S /D F

T o ta l R ed u cedm od el

S S R redu ced T o ta l d f – 1

S+ Printout Residual standard error: 0.9200322 Df Sum of Sq Mean Sq F Value Pr(F) river 3 14.018 4.673 5.520 0.0016Residuals 88 74.488 0.846 We can reject the null hypothesis of no difference in

medians. At least one river has a different median PCB concentration

For some reason, S+ does not print out the reduced model information (total) that is on the ANOVA table we make by hand.

Download - ANOVA example 4 Polychlorinated biphenyls (PCBs) previously used in the manufacture of large electrical transformers and capacitors, are extremely hazardous

Top Related