stat6140_assignment4_2014
DESCRIPTION
Experimental DesignTRANSCRIPT
Assignment 4 Design and Analysis of Experiments
STAT 6140
Due: Wednesday in the 12 th week of class
* You may discuss this assignment among yourselves but the work that you submit must be your own.
*Your work must be typed up in the spaces provided. Additional Sheets with hand written material will not be graded. *DO NOT ERASE THE MARKS ALLOCATED FOR EACH SECTION.
*Appropriate R commands and output as well as output from Minitab and SPSS need to be pasted in the spaces provided. Please adjust the space to accommodate ALL your answers.
*There are 3 questions in this Assignment and you are required to answer all questions.
NAME: PAUL GOKOOL
ID#: 806003199
QUESTION 1
Consider the following scenario:
An experiment on the yield of three varieties of oats (factor A) and four different levels of manure (factor B) was described by F. Yates in his 1935 paper Complex Experiments. The experimental area was divided into s = 6 blocks. Each of these was then subdivided into a = 3 whole plots. The varieties of oat were sown on the whole plots according to a randomized complete block design (so that every variety appeared in every block exactly once). Each whole plot was then divided into b =4 split plots, and the levels of manure were applied to the split plots according to a randomized complete block design (so that every level of B appeared in every whole plot exactly once).
(i) Identify the Factors and blocks in this experiment. Identify the whole plot factor and the split plot factor. (3 marks)
The factors are Oats and Manure. There are 3 levels of Oats (a=3) and 4 levels of Manure (b=4).
The blocking factor is the experimental area divided into 6 blocks (s=6).
Each block is divided into 3 equal sized plots. These are the whole plot factors.
Each plot is assigned a variety of oat according to a randomized block design.
Each whole plot is divided into 4 plots (split-plots) and the four levels of manure are
randomly assigned to the 4 split plots.
(ii) Write the model for this experiment. ( 3 marks)
Yhij=µ+θh+αi+εwi(h)+βj+(αβ)ij+εs
j(hi),
Where εwi(h)~NID(0,σw
2), εsj(hi)~NID(0,σs
2) are mutually independent with h=1, 2, ..., s, i=1, 2, ..., a, j=1, 2, ..., b.
(iii) Complete an ANOVA Table showing clearly the df column, SS, MS and F ratios that will be of interest in this experiment. (3 marks)
Source of DF SS MS FVariation
Block s-1 SSθ ___ ___
Oats a-1 SSA MSA MSA/MSEW
Whole-plot error (a-1)(s-1) SSEW MSEW
B b-1 SSB MSB MSB/MSES
AB (a-1)(b-1) SSAB MSAB MSAB/MSES
Split-plot error a(b-1)(s-1) SSES MSES
Total n-1 SSTotal
QUESTION 2
(10 marks, 10 marks)
R Code:
life.hours <- matrix(c(22,31,25,32,43,29,35,34,50,55,47,46,
44,45,38,40,37,36,60,50,54,39,41,47),byrow=T,ncol=3)
dimnames(life.hours) <- list(
c("(1)","a","b","ab","c","ac","bc","abc"),c("Rep1","Rep2","Rep3"))
A <- rep(c(-1,1),4)
B <- rep(c(-1,-1,1,1),2)
C <- c(rep(-1,4),rep(1,4))
Total <- apply(life.hours,1,sum)
cbind(A,B,C,life.hours,Total)
# #reps: n=3
n <- 3
# Effect estimates are differences of averages of 4 means ("runs")
# Effect estimates:
Aeff <- (Total %*% A)/(4*n)
Beff <- (Total %*% B)/(4*n)
Ceff <- (Total %*% C)/(4*n)
# Interaction effects
AB <- A*B
AC <- A*C
BC <- B*C
ABC <- A*B*C
cbind(A,B,C,AB,AC,BC,ABC,Total)
ABeff <- (Total %*% AB)/(4*n)
ACeff <- (Total %*% AC)/(4*n)
BCeff <- (Total %*% BC)/(4*n)
ABCeff <- (Total %*% ABC)/(4*n)
# Summary
Effects <- t(Total) %*% cbind(A,B,C,AB,AC,BC,ABC)/(4*n)
Summary <- rbind( cbind(A,B,C,AB,AC,BC,ABC),Effects )
dimnames(Summary)[[1]] <- c(dimnames(life.hours)[[1]],"Effect")
Summary
# Fit as an ANOVA model
life.vec <- c(t(life.hours))
Af <- rep(as.factor(A),rep(3,8))
Bf <- rep(as.factor(B),rep(3,8))
Cf <- rep(as.factor(C),rep(3,8))
options(contrasts=c("contr.sum","contr.poly"))
life.lm <- lm(life.vec ~ Af*Bf*Cf)
summary(life.lm)
anova(life.lm)
model.matrix(life.lm)
R Output:
> life.hours <- matrix(c(22,31,25,32,43,29,35,34,50,55,47,46,
+ 44,45,38,40,37,36,60,50,54,39,41,47),byrow=T,ncol=3)
> dimnames(life.hours) <- list(
+ c("(1)","a","b","ab","c","ac","bc","abc"),c("Rep1","Rep2","Rep3"))
> A <- rep(c(-1,1),4)
> B <- rep(c(-1,-1,1,1),2)
> C <- c(rep(-1,4),rep(1,4))
> Total <- apply(life.hours,1,sum)
> cbind(A,B,C,life.hours,Total)
A B C Rep1 Rep2 Rep3 Total
(1) -1 -1 -1 22 31 25 78
a 1 -1 -1 32 43 29 104
b -1 1 -1 35 34 50 119
ab 1 1 -1 55 47 46 148
c -1 -1 1 44 45 38 127
ac 1 -1 1 40 37 36 113
bc -1 1 1 60 50 54 164
abc 1 1 1 39 41 47 127
> # #reps: n=3
> n <- 3
> # Effect estimates are differences of averages of 4 means ("runs")
> # Effect estimates:
> Aeff <- (Total %*% A)/(4*n)
> Beff <- (Total %*% B)/(4*n)
> Ceff <- (Total %*% C)/(4*n)
> # Interaction effects
> AB <- A*B
> AC <- A*C
> BC <- B*C
> ABC <- A*B*C
> cbind(A,B,C,AB,AC,BC,ABC,Total)
A B C AB AC BC ABC Total
(1) -1 -1 -1 1 1 1 -1 78
a 1 -1 -1 -1 -1 1 1 104
b -1 1 -1 -1 1 -1 1 119
ab 1 1 -1 1 -1 -1 -1 148
c -1 -1 1 1 -1 -1 1 127
ac 1 -1 1 -1 1 -1 -1 113
bc -1 1 1 -1 -1 1 -1 164
abc 1 1 1 1 1 1 1 127
> ABeff <- (Total %*% AB)/(4*n)
> ACeff <- (Total %*% AC)/(4*n)
> BCeff <- (Total %*% BC)/(4*n)
> ABCeff <- (Total %*% ABC)/(4*n)
> # Summary
> Effects <- t(Total) %*% cbind(A,B,C,AB,AC,BC,ABC)/(4*n)
> Summary <- rbind( cbind(A,B,C,AB,AC,BC,ABC),Effects )
> dimnames(Summary)[[1]] <- c(dimnames(life.hours)[[1]],"Effect")
> Summary
A B C AB AC BC ABC
(1) -1.0000000 -1.00000 -1.000000 1.000000 1.000000 1.000000 -1.000000
a 1.0000000 -1.00000 -1.000000 -1.000000 -1.000000 1.000000 1.000000
b -1.0000000 1.00000 -1.000000 -1.000000 1.000000 -1.000000 1.000000
ab 1.0000000 1.00000 -1.000000 1.000000 -1.000000 -1.000000 -1.000000
c -1.0000000 -1.00000 1.000000 1.000000 -1.000000 -1.000000 1.000000
ac 1.0000000 -1.00000 1.000000 -1.000000 1.000000 -1.000000 -1.000000
bc -1.0000000 1.00000 1.000000 -1.000000 -1.000000 1.000000 -1.000000
abc 1.0000000 1.00000 1.000000 1.000000 1.000000 1.000000 1.000000
Effect 0.3333333 11.33333 6.833333 -1.666667 -8.833333 -2.833333 -2.166667
> # Fit as an ANOVA model
> life.vec <- c(t(life.hours))
> Af <- rep(as.factor(A),rep(3,8))
> Bf <- rep(as.factor(B),rep(3,8))
> Cf <- rep(as.factor(C),rep(3,8))
> options(contrasts=c("contr.sum","contr.poly"))
> life.lm <- lm(life.vec ~ Af*Bf*Cf)
> summary(life.lm)
Call:
lm(formula = life.vec ~ Af * Bf * Cf)
Residuals:
Min 1Q Median 3Q Max
-5.667 -3.500 -1.167 3.167 10.333
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 40.8333 1.1211 36.421 < 2e-16 ***
Af1 -0.1667 1.1211 -0.149 0.883680
Bf1 -5.6667 1.1211 -5.054 0.000117 ***
Cf1 -3.4167 1.1211 -3.048 0.007679 **
Af1:Bf1 -0.8333 1.1211 -0.743 0.468078
Af1:Cf1 -4.4167 1.1211 -3.939 0.001172 **
Bf1:Cf1 -1.4167 1.1211 -1.264 0.224475
Af1:Bf1:Cf1 1.0833 1.1211 0.966 0.348282
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5.492 on 16 degrees of freedom
Multiple R-squared: 0.7696, Adjusted R-squared: 0.6689
F-statistic: 7.637 on 7 and 16 DF, p-value: 0.0003977
> anova(life.lm)
Analysis of Variance Table
Response: life.vec
Df Sum Sq Mean Sq F value Pr(>F)
Af 1 0.67 0.67 0.0221 0.8836803
Bf 1 770.67 770.67 25.5470 0.0001173 ***
Cf 1 280.17 280.17 9.2873 0.0076787 **
Af:Bf 1 16.67 16.67 0.5525 0.4680784
Af:Cf 1 468.17 468.17 15.5193 0.0011722 **
Bf:Cf 1 48.17 48.17 1.5967 0.2244753
Af:Bf:Cf 1 28.17 28.17 0.9337 0.3482825
Residuals 16 482.67 30.17
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Table1 Effect Estimate Summary for Question2
Effect Sum of Percent Factor Estimate Squares Contribution
A 0.3333333 0.67 0.03198
B 11.33333 770.67 36.77984
C 6.833333 280.17 13.37097
AB -1.666667 16.67 0.79557
AC -8.833333 468.17 22.34318
BC -2.833333 48.17 2.29889
ABC -2.166667 28.17 1.34440
Factors B and C and the interaction AC appear to be large.
From the analysis of variance B has p-value 0.0001173, C has p-value 0.0076787 and AC has p-value 0.0011722. These are all significant at α=0.05.
QUESTION 3
At the beginning of this class, I mentioned the paired t-test can be used to compare the prices of the SAME grocery items in two stores- this is useful in finding the supermarket with the lower prices. This concept is now extended to 4 stores. The data is given below. Write an appropriate model, and analyze the data at the 0.05 level. The goal is to determine if the prices differ between stores. Can you advise a customer which store they should purchase from the save money?
storeA storeB storeC storeDlettuce 1.17 1.78 1.29 1.29potatoes 1.77 1.98 1.99 1.99milk 1.49 1.69 1.79 1.59eggs 0.65 0.99 0.69 1.09bread 1.58 1.70 1.89 1.89cereal 3.13 3.15 2.99 3.09ground.beef 2.09 1.88 2.09 2.49tomato.soup 0.62 0.65 0.65 0.69laundry.detergent 5.89 5.99 5.99 6.99aspirin 4.46 4.84 4.99 5.15
Analysis of Data [You are free to use ANY software package of your choice!]
(i) Model: (3 marks)
This is a single factor design with repeated measures
yij=µ+αi+βj+εij
yij is response of subject j to treatment i and that only n subjects are used.
αi is the effect of the ith treatment
βj is a parameter associated with the jth subject
εij is the random error
∑i=1
a
α i=0, εij~NID(0,σ2), βj~NID(0,σ2β)
The effects are fixed and βj is random
Store and subject are our sources of variability. The treatment we are
interested in is store, and this treatment effect is visible within each
subject (i.e., nested within each subject).
(ii) Hypotheses (based on the model above): (3 marks)
H0: α1=α2= … =αa=0 a=1,2,3,4
HA: at least one αa≠0
H0: βj=0 j=1,2,…,10
HA: βj¿0
(iii) Computer Output: (3 marks)
R code:
> groceries = read.table("groceriesdata.txt", header=T);groceries
price store subject
1 1.17 storeA lettuce
2 1.77 storeA potatoes
3 1.49 storeA milk
4 0.65 storeA eggs
5 1.58 storeA bread
6 3.13 storeA cereal
7 2.09 storeA ground.beef
8 0.62 storeA tomato.soup
9 5.89 storeA laundry.detergent
10 4.46 storeA aspirin
11 1.78 storeB lettuce
12 1.98 storeB potatoes
13 1.69 storeB milk
14 0.99 storeB eggs
15 1.70 storeB bread
16 3.15 storeB cereal
17 1.88 storeB ground.beef
18 0.65 storeB tomato.soup
19 5.99 storeB laundry.detergent
20 4.84 storeB aspirin
21 1.29 storeC lettuce
22 1.99 storeC potatoes
23 1.79 storeC milk
24 0.69 storeC eggs
25 1.89 storeC bread
26 2.99 storeC cereal
27 2.09 storeC ground.beef
28 0.65 storeC tomato.soup
29 5.99 storeC laundry.detergent
30 4.99 storeC aspirin
31 1.29 storeD lettuce
32 1.99 storeD potatoes
33 1.59 storeD milk
34 1.09 storeD eggs
35 1.89 storeD bread
36 3.09 storeD cereal
37 2.49 storeD ground.beef
38 0.69 storeD tomato.soup
39 6.99 storeD laundry.detergent
40 5.15 storeD aspirin
> with(groceries, tapply(price, store, sum))
storeA storeB storeC storeD
22.85 24.65 24.36 26.26
> # It appears we should shop at Store A as this has the lower mean price.
> aov.out = aov(price ~ store + Error(subject/store), data=groceries)
> summary(aov.out)
Error: subject
Df Sum Sq Mean Sq F value Pr(>F)
Residuals 9 115.2 12.8
Error: subject:store
Df Sum Sq Mean Sq F value Pr(>F)
store 3 0.5859 0.19529 4.344 0.0127 *
Residuals 27 1.2137 0.04495
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> # Store is significant at alpha=0.05.
> with(groceries, pairwise.t.test(price, store,p.adjust.method="holm", paired=T))
Pairwise comparisons using paired t tests
data: price and store
storeA storeB storeC
storeB 0.17 - -
storeC 0.17 0.69 -
storeD 0.07 0.49 0.33
P value adjustment method: holm
> # There is no difference between stores in our t test.
> # However, store A vs store D is the closest to being different with p=0.07.
> bartlett.test(price~store, data=groceries)
Bartlett test of homogeneity of variances
data: price by store
Bartlett's K-squared = 0.2701, df = 3, p-value = 0.9655
> # Variances appear to be constant across stores.
> shapiro.test(groceries$price)
Shapiro-Wilk normality test
data: groceries$price
W = 0.8384, p-value = 4.751e-05
> # The distribution of prices does not appear to be normal
> kruskal.test(price~store, data=groceries)
Kruskal-Wallis rank sum test
data: price by store
Kruskal-Wallis chi-squared = 0.5431, df = 3, p-value = 0.9093
> # The prices at different stores do not differ significantly.
(iv) Diagnostics: (3marks)
Equality of variance assumption is met by the Bartlett test (p-value>0.05). The data is not normal from the p-value of the Shapiro-Wilk’s test (p=0.00005).
(v) Conclusions (3marks)
It appears we should shop at store A due to the lower mean price.
(vi) Further tests (if necessary- if you don’t need to do any more give a reason why) (3 marks)
We carry out pair wise t-test to determine which stores differ. The test concludes that there are no differences in prices between stores. Store A vs store D appear to be the closest to differing (p-value=0.07). Since the data is not normal we use a non parametric test to determine if stores are equal in prices. The test concludes that they are equal (p-value>0.05)
QUESTION 4
A manufacturer of industrial textiles is trying to develop a method to produce synthetic spider silk with high strength. There are 5 types of polymer solution. The manufacturer has prepared 4 replicate batches of fiber for solution type and has measured the tensile strength of the fiber from a sample from each batch. The manufacturer wishes to determine if there are solution effects on tensile strength.
Fiber diameter is an important determinant of strength. Each of the solutions should be able to create fibers of the same diameter. However, because the industrial process is not perfected, diameter varies significantly from batch to batch. Hence, the manufacturer decides to use fiber diameter as a covariate in the analysis.
The data are below and are also on the class Web page.
obs rep solution diameter strength1 1 1 117 73.52 2 1 115 70.93 3 1 103 68.54 4 1 115 71.65 1 2 105 69.36 2 2 102 63.97 3 2 109 71.08 4 2 106 70.49 1 3 108 72.910 2 3 108 71.211 3 3 107 70.312 4 3 116 74.313 1 4 103 69.514 2 4 118 78.815 3 4 103 70.916 4 4 119 77.217 1 5 110 64.618 2 5 110 67.719 3 5 103 62.920 4 5 104 65.8
a. Write a linear model for the observations of strength, which includes effects for solution (S), diameter (D) and all of the interactions. Make sure that all of the terms in the model are fully defined, including distribution of any random terms, and any constraints on the model parameters. (Assume that the solution and diameter effects are fixed.) (3marks)
Yij = µ+τi + β(xij-x ..) + εij i = 1, 2,…, 5 5 solutions j = 1, 2,…, n n=4
Yij is the value of strength for the jth sample from the ith solution
τi is the effect of the ith solution
βi is the regression slope for the regression of strength on diameter for the ith solution
xij is the value of diameter for the jth sample from the ith solution
x .. is the mean of the xij values.
εij ~NID(0,σ2) is the random error.
∑i=1
5
τ i=0, βi≠0 and xij is not affected by the treatments
b. Fit the model in part a and test for a treatment by covariate interaction. Include null and alternative hypotheses, test statistic, p-value, and your conclusion written as a sentence. (You do not need to check the residual plots.) (USE R!) (3marks)
Analysis of Variance Table
Response: strength
` Df Sum Sq Mean Sq F value Pr(>F)
factor(solution) 4 187.413 46.853 21.2645 7.053e-05
diameter 1 94.372 94.372 42.8312 6.518e-05
factor(solution):diameter 4 12.829 3.207 1.4557 0.2864
Residuals 10 22.034 2.203
Ho: β1=β5 HA: At least one βi differs
F*=1.46 d.f. = 4, 10 p=0.2864
There is no evidence of a statistically significant covariate by treatment interaction.
c. Fit the ANCOVA model with no interaction. Is there a difference in tensile strength among the fibers from the different polymer solutions, after adjusting for diameter? Include null and alternative hypotheses, test statistic, p-value, and your conclusion written as a sentence. (3marks)
Analysis of Variance Table
Response: strength
Df Sum Sq Mean Sq F value Pr(>F) factor(solution) 4 187.413 46.853 18.815 1.612e-05 diameter 1 94.372 94.372 37.898 2.493e-05 Residuals 14 34.863 2.490
Ho: 1=5 = 0 HA: At least one i is not zero
F*=18.815 d.f. = 4, 14 p=0.000016
There is a statistically significant difference in strength among the different solutions, after adjusting for diameter.
d. Solutions 2 and 3 differ only by the type of catalyst used. Test whether or not whether fibers produced from these solutions have the same mean strength after adjusting for diameter, and give a 95% confidence interval for the difference. (DO THIS QUESTION BY HAND!!) (6marks)
strength Standardtreat LSMEAN Error Pr > |t|
1 61.1922432 1.7960735 <.00012 61.8074342 1.3630872 <.00013 63.4562468 1.6212340 <.00014 64.9397909 1.6842405 <.00015 57.8556144 1.4371183 <.0001
Ho: 2=HA: 2
For the model in (c) the MSE is 2.490. SE is 1.578. The degrees of freedom is 14.
The difference between α̂ 2 and α̂ 3 is 61.807-63.456=-1.649.
tcalc=-1.649/1.578=-1.045.
ttable with 14 degrees of freedom is 2.145. Since |1.045|<2.145we do not reject H0 and conclude
that there is no significant difference in strength between solutions 2 and 3 after adjusting for diameter.
A 95% CI for the difference in means is
-1.649 t.025,14 * 1.1578