stat6140_assignment4_2014

Assignment 4 Design and Analysis of Experiments

STAT 6140

Due: Wednesday in the 12 th week of class

* You may discuss this assignment among yourselves but the work that you submit must be your own.

*Your work must be typed up in the spaces provided. Additional Sheets with hand written material will not be graded. *DO NOT ERASE THE MARKS ALLOCATED FOR EACH SECTION.

*Appropriate R commands and output as well as output from Minitab and SPSS need to be pasted in the spaces provided. Please adjust the space to accommodate ALL your answers.

*There are 3 questions in this Assignment and you are required to answer all questions.

NAME: PAUL GOKOOL

ID#: 806003199

QUESTION 1

Consider the following scenario:

An experiment on the yield of three varieties of oats (factor A) and four different levels of manure (factor B) was described by F. Yates in his 1935 paper Complex Experiments. The experimental area was divided into s = 6 blocks. Each of these was then subdivided into a = 3 whole plots. The varieties of oat were sown on the whole plots according to a randomized complete block design (so that every variety appeared in every block exactly once). Each whole plot was then divided into b =4 split plots, and the levels of manure were applied to the split plots according to a randomized complete block design (so that every level of B appeared in every whole plot exactly once).

(i) Identify the Factors and blocks in this experiment. Identify the whole plot factor and the split plot factor. (3 marks)

The factors are Oats and Manure. There are 3 levels of Oats (a=3) and 4 levels of Manure (b=4).

The blocking factor is the experimental area divided into 6 blocks (s=6).

Each block is divided into 3 equal sized plots. These are the whole plot factors.

Each plot is assigned a variety of oat according to a randomized block design.

Each whole plot is divided into 4 plots (split-plots) and the four levels of manure are

randomly assigned to the 4 split plots.

(ii) Write the model for this experiment. ( 3 marks)

Yhij=µ+θh+αi+εwi(h)+βj+(αβ)ij+εs

j(hi),

Where εwi(h)~NID(0,σw

2), εsj(hi)~NID(0,σs

2) are mutually independent with h=1, 2, ..., s, i=1, 2, ..., a, j=1, 2, ..., b.

(iii) Complete an ANOVA Table showing clearly the df column, SS, MS and F ratios that will be of interest in this experiment. (3 marks)

Source of DF SS MS FVariation

Block s-1 SSθ ___ ___

Oats a-1 SSA MSA MSA/MSEW

Whole-plot error (a-1)(s-1) SSEW MSEW

B b-1 SSB MSB MSB/MSES

AB (a-1)(b-1) SSAB MSAB MSAB/MSES

Split-plot error a(b-1)(s-1) SSES MSES

Total n-1 SSTotal

QUESTION 2

(10 marks, 10 marks)

R Code:

life.hours <- matrix(c(22,31,25,32,43,29,35,34,50,55,47,46,

44,45,38,40,37,36,60,50,54,39,41,47),byrow=T,ncol=3)

dimnames(life.hours) <- list(

c("(1)","a","b","ab","c","ac","bc","abc"),c("Rep1","Rep2","Rep3"))

A <- rep(c(-1,1),4)

B <- rep(c(-1,-1,1,1),2)

C <- c(rep(-1,4),rep(1,4))

Total <- apply(life.hours,1,sum)

cbind(A,B,C,life.hours,Total)

# #reps: n=3

n <- 3

# Effect estimates are differences of averages of 4 means ("runs")

# Effect estimates:

Aeff <- (Total %*% A)/(4*n)

Beff <- (Total %*% B)/(4*n)

Ceff <- (Total %*% C)/(4*n)

# Interaction effects

AB <- A*B

AC <- A*C

BC <- B*C

ABC <- A*B*C

cbind(A,B,C,AB,AC,BC,ABC,Total)

ABeff <- (Total %*% AB)/(4*n)

ACeff <- (Total %*% AC)/(4*n)

BCeff <- (Total %*% BC)/(4*n)

ABCeff <- (Total %*% ABC)/(4*n)

# Summary

Effects <- t(Total) %*% cbind(A,B,C,AB,AC,BC,ABC)/(4*n)

Summary <- rbind( cbind(A,B,C,AB,AC,BC,ABC),Effects )

dimnames(Summary)[[1]] <- c(dimnames(life.hours)[[1]],"Effect")

Summary

# Fit as an ANOVA model

life.vec <- c(t(life.hours))

Af <- rep(as.factor(A),rep(3,8))

Bf <- rep(as.factor(B),rep(3,8))

Cf <- rep(as.factor(C),rep(3,8))

options(contrasts=c("contr.sum","contr.poly"))

life.lm <- lm(life.vec ~ Af*Bf*Cf)

summary(life.lm)

anova(life.lm)

model.matrix(life.lm)

R Output:

> life.hours <- matrix(c(22,31,25,32,43,29,35,34,50,55,47,46,

+ 44,45,38,40,37,36,60,50,54,39,41,47),byrow=T,ncol=3)

> dimnames(life.hours) <- list(

+ c("(1)","a","b","ab","c","ac","bc","abc"),c("Rep1","Rep2","Rep3"))

> A <- rep(c(-1,1),4)

> B <- rep(c(-1,-1,1,1),2)

> C <- c(rep(-1,4),rep(1,4))

> Total <- apply(life.hours,1,sum)

> cbind(A,B,C,life.hours,Total)

A B C Rep1 Rep2 Rep3 Total

(1) -1 -1 -1 22 31 25 78

a 1 -1 -1 32 43 29 104

b -1 1 -1 35 34 50 119

ab 1 1 -1 55 47 46 148

c -1 -1 1 44 45 38 127

ac 1 -1 1 40 37 36 113

bc -1 1 1 60 50 54 164

abc 1 1 1 39 41 47 127

> # #reps: n=3

> n <- 3

> # Effect estimates are differences of averages of 4 means ("runs")

> # Effect estimates:

> Aeff <- (Total %*% A)/(4*n)

> Beff <- (Total %*% B)/(4*n)

> Ceff <- (Total %*% C)/(4*n)

> # Interaction effects

> AB <- A*B

> AC <- A*C

> BC <- B*C

> ABC <- A*B*C

> cbind(A,B,C,AB,AC,BC,ABC,Total)

A B C AB AC BC ABC Total

(1) -1 -1 -1 1 1 1 -1 78

a 1 -1 -1 -1 -1 1 1 104

b -1 1 -1 -1 1 -1 1 119

ab 1 1 -1 1 -1 -1 -1 148

c -1 -1 1 1 -1 -1 1 127

ac 1 -1 1 -1 1 -1 -1 113

bc -1 1 1 -1 -1 1 -1 164

abc 1 1 1 1 1 1 1 127

> ABeff <- (Total %*% AB)/(4*n)

> ACeff <- (Total %*% AC)/(4*n)

> BCeff <- (Total %*% BC)/(4*n)

> ABCeff <- (Total %*% ABC)/(4*n)

> # Summary

> Effects <- t(Total) %*% cbind(A,B,C,AB,AC,BC,ABC)/(4*n)

> Summary <- rbind( cbind(A,B,C,AB,AC,BC,ABC),Effects )

> dimnames(Summary)[[1]] <- c(dimnames(life.hours)[[1]],"Effect")

> Summary

A B C AB AC BC ABC

(1) -1.0000000 -1.00000 -1.000000 1.000000 1.000000 1.000000 -1.000000

a 1.0000000 -1.00000 -1.000000 -1.000000 -1.000000 1.000000 1.000000

b -1.0000000 1.00000 -1.000000 -1.000000 1.000000 -1.000000 1.000000

ab 1.0000000 1.00000 -1.000000 1.000000 -1.000000 -1.000000 -1.000000

c -1.0000000 -1.00000 1.000000 1.000000 -1.000000 -1.000000 1.000000

ac 1.0000000 -1.00000 1.000000 -1.000000 1.000000 -1.000000 -1.000000

bc -1.0000000 1.00000 1.000000 -1.000000 -1.000000 1.000000 -1.000000

abc 1.0000000 1.00000 1.000000 1.000000 1.000000 1.000000 1.000000

Effect 0.3333333 11.33333 6.833333 -1.666667 -8.833333 -2.833333 -2.166667

> # Fit as an ANOVA model

> life.vec <- c(t(life.hours))

> Af <- rep(as.factor(A),rep(3,8))

> Bf <- rep(as.factor(B),rep(3,8))

> Cf <- rep(as.factor(C),rep(3,8))

> options(contrasts=c("contr.sum","contr.poly"))

> life.lm <- lm(life.vec ~ Af*Bf*Cf)

> summary(life.lm)

Call:

lm(formula = life.vec ~ Af * Bf * Cf)

Residuals:

Min 1Q Median 3Q Max

-5.667 -3.500 -1.167 3.167 10.333

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 40.8333 1.1211 36.421 < 2e-16 ***

Af1 -0.1667 1.1211 -0.149 0.883680

Bf1 -5.6667 1.1211 -5.054 0.000117 ***

Cf1 -3.4167 1.1211 -3.048 0.007679 **

Af1:Bf1 -0.8333 1.1211 -0.743 0.468078

Af1:Cf1 -4.4167 1.1211 -3.939 0.001172 **

Bf1:Cf1 -1.4167 1.1211 -1.264 0.224475

Af1:Bf1:Cf1 1.0833 1.1211 0.966 0.348282

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 5.492 on 16 degrees of freedom

Multiple R-squared: 0.7696, Adjusted R-squared: 0.6689

F-statistic: 7.637 on 7 and 16 DF, p-value: 0.0003977

> anova(life.lm)

Analysis of Variance Table

Response: life.vec

Df Sum Sq Mean Sq F value Pr(>F)

Af 1 0.67 0.67 0.0221 0.8836803

Bf 1 770.67 770.67 25.5470 0.0001173 ***

Cf 1 280.17 280.17 9.2873 0.0076787 **

Af:Bf 1 16.67 16.67 0.5525 0.4680784

Af:Cf 1 468.17 468.17 15.5193 0.0011722 **

Bf:Cf 1 48.17 48.17 1.5967 0.2244753

Af:Bf:Cf 1 28.17 28.17 0.9337 0.3482825

Residuals 16 482.67 30.17

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Table1 Effect Estimate Summary for Question2

Effect Sum of Percent Factor Estimate Squares Contribution

A 0.3333333 0.67 0.03198

B 11.33333 770.67 36.77984

C 6.833333 280.17 13.37097

AB -1.666667 16.67 0.79557

AC -8.833333 468.17 22.34318

BC -2.833333 48.17 2.29889

ABC -2.166667 28.17 1.34440

Factors B and C and the interaction AC appear to be large.

From the analysis of variance B has p-value 0.0001173, C has p-value 0.0076787 and AC has p-value 0.0011722. These are all significant at α=0.05.

QUESTION 3

At the beginning of this class, I mentioned the paired t-test can be used to compare the prices of the SAME grocery items in two stores- this is useful in finding the supermarket with the lower prices. This concept is now extended to 4 stores. The data is given below. Write an appropriate model, and analyze the data at the 0.05 level. The goal is to determine if the prices differ between stores. Can you advise a customer which store they should purchase from the save money?

storeA storeB storeC storeDlettuce 1.17 1.78 1.29 1.29potatoes 1.77 1.98 1.99 1.99milk 1.49 1.69 1.79 1.59eggs 0.65 0.99 0.69 1.09bread 1.58 1.70 1.89 1.89cereal 3.13 3.15 2.99 3.09ground.beef 2.09 1.88 2.09 2.49tomato.soup 0.62 0.65 0.65 0.69laundry.detergent 5.89 5.99 5.99 6.99aspirin 4.46 4.84 4.99 5.15

Analysis of Data [You are free to use ANY software package of your choice!]

(i) Model: (3 marks)

This is a single factor design with repeated measures

yij=µ+αi+βj+εij

yij is response of subject j to treatment i and that only n subjects are used.

αi is the effect of the ith treatment

βj is a parameter associated with the jth subject

εij is the random error

∑i=1

a

α i=0, εij~NID(0,σ2), βj~NID(0,σ2β)

The effects are fixed and βj is random

Store and subject are our sources of variability. The treatment we are

interested in is store, and this treatment effect is visible within each

subject (i.e., nested within each subject).

(ii) Hypotheses (based on the model above): (3 marks)

H0: α1=α2= … =αa=0 a=1,2,3,4

HA: at least one αa≠0

H0: βj=0 j=1,2,…,10

HA: βj¿0

(iii) Computer Output: (3 marks)

R code:

> groceries = read.table("groceriesdata.txt", header=T);groceries

price store subject

1 1.17 storeA lettuce

2 1.77 storeA potatoes

3 1.49 storeA milk

4 0.65 storeA eggs

5 1.58 storeA bread

6 3.13 storeA cereal

7 2.09 storeA ground.beef

8 0.62 storeA tomato.soup

9 5.89 storeA laundry.detergent

10 4.46 storeA aspirin

11 1.78 storeB lettuce

12 1.98 storeB potatoes

13 1.69 storeB milk

14 0.99 storeB eggs

15 1.70 storeB bread

16 3.15 storeB cereal

17 1.88 storeB ground.beef

18 0.65 storeB tomato.soup

19 5.99 storeB laundry.detergent

20 4.84 storeB aspirin

21 1.29 storeC lettuce

22 1.99 storeC potatoes

23 1.79 storeC milk

24 0.69 storeC eggs

25 1.89 storeC bread

26 2.99 storeC cereal

27 2.09 storeC ground.beef

28 0.65 storeC tomato.soup

29 5.99 storeC laundry.detergent

30 4.99 storeC aspirin

31 1.29 storeD lettuce

32 1.99 storeD potatoes

33 1.59 storeD milk

34 1.09 storeD eggs

35 1.89 storeD bread

36 3.09 storeD cereal

37 2.49 storeD ground.beef

38 0.69 storeD tomato.soup

39 6.99 storeD laundry.detergent

40 5.15 storeD aspirin

> with(groceries, tapply(price, store, sum))

storeA storeB storeC storeD

22.85 24.65 24.36 26.26

> # It appears we should shop at Store A as this has the lower mean price.

> aov.out = aov(price ~ store + Error(subject/store), data=groceries)

> summary(aov.out)

Error: subject


Residuals 9 115.2 12.8

Error: subject:store


store 3 0.5859 0.19529 4.344 0.0127 *

Residuals 27 1.2137 0.04495

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

> # Store is significant at alpha=0.05.

> with(groceries, pairwise.t.test(price, store,p.adjust.method="holm", paired=T))

Pairwise comparisons using paired t tests

data: price and store

storeA storeB storeC

storeB 0.17 - -

storeC 0.17 0.69 -

storeD 0.07 0.49 0.33

P value adjustment method: holm

> # There is no difference between stores in our t test.

> # However, store A vs store D is the closest to being different with p=0.07.

> bartlett.test(price~store, data=groceries)

Bartlett test of homogeneity of variances

data: price by store

Bartlett's K-squared = 0.2701, df = 3, p-value = 0.9655

> # Variances appear to be constant across stores.

> shapiro.test(groceries$price)

Shapiro-Wilk normality test

data: groceries$price

W = 0.8384, p-value = 4.751e-05

> # The distribution of prices does not appear to be normal

> kruskal.test(price~store, data=groceries)

Kruskal-Wallis rank sum test

data: price by store

Kruskal-Wallis chi-squared = 0.5431, df = 3, p-value = 0.9093

> # The prices at different stores do not differ significantly.

(iv) Diagnostics: (3marks)

Equality of variance assumption is met by the Bartlett test (p-value>0.05). The data is not normal from the p-value of the Shapiro-Wilk’s test (p=0.00005).

(v) Conclusions (3marks)

It appears we should shop at store A due to the lower mean price.

(vi) Further tests (if necessary- if you don’t need to do any more give a reason why) (3 marks)

We carry out pair wise t-test to determine which stores differ. The test concludes that there are no differences in prices between stores. Store A vs store D appear to be the closest to differing (p-value=0.07). Since the data is not normal we use a non parametric test to determine if stores are equal in prices. The test concludes that they are equal (p-value>0.05)

QUESTION 4

A manufacturer of industrial textiles is trying to develop a method to produce synthetic spider silk with high strength. There are 5 types of polymer solution. The manufacturer has prepared 4 replicate batches of fiber for solution type and has measured the tensile strength of the fiber from a sample from each batch. The manufacturer wishes to determine if there are solution effects on tensile strength.

Fiber diameter is an important determinant of strength. Each of the solutions should be able to create fibers of the same diameter. However, because the industrial process is not perfected, diameter varies significantly from batch to batch. Hence, the manufacturer decides to use fiber diameter as a covariate in the analysis.

The data are below and are also on the class Web page.

obs rep solution diameter strength1 1 1 117 73.52 2 1 115 70.93 3 1 103 68.54 4 1 115 71.65 1 2 105 69.36 2 2 102 63.97 3 2 109 71.08 4 2 106 70.49 1 3 108 72.910 2 3 108 71.211 3 3 107 70.312 4 3 116 74.313 1 4 103 69.514 2 4 118 78.815 3 4 103 70.916 4 4 119 77.217 1 5 110 64.618 2 5 110 67.719 3 5 103 62.920 4 5 104 65.8

a. Write a linear model for the observations of strength, which includes effects for solution (S), diameter (D) and all of the interactions. Make sure that all of the terms in the model are fully defined, including distribution of any random terms, and any constraints on the model parameters. (Assume that the solution and diameter effects are fixed.) (3marks)

Yij = µ+τi + β(xij-x ..) + εij i = 1, 2,…, 5 5 solutions j = 1, 2,…, n n=4

Yij is the value of strength for the jth sample from the ith solution

τi is the effect of the ith solution

βi is the regression slope for the regression of strength on diameter for the ith solution

xij is the value of diameter for the jth sample from the ith solution

x .. is the mean of the xij values.

εij ~NID(0,σ2) is the random error.

∑i=1

5

τ i=0, βi≠0 and xij is not affected by the treatments

b. Fit the model in part a and test for a treatment by covariate interaction. Include null and alternative hypotheses, test statistic, p-value, and your conclusion written as a sentence. (You do not need to check the residual plots.) (USE R!) (3marks)


Response: strength

` Df Sum Sq Mean Sq F value Pr(>F)

factor(solution) 4 187.413 46.853 21.2645 7.053e-05

diameter 1 94.372 94.372 42.8312 6.518e-05

factor(solution):diameter 4 12.829 3.207 1.4557 0.2864

Residuals 10 22.034 2.203

Ho: β1=β5 HA: At least one βi differs

F*=1.46 d.f. = 4, 10 p=0.2864

There is no evidence of a statistically significant covariate by treatment interaction.

c. Fit the ANCOVA model with no interaction. Is there a difference in tensile strength among the fibers from the different polymer solutions, after adjusting for diameter? Include null and alternative hypotheses, test statistic, p-value, and your conclusion written as a sentence. (3marks)


Response: strength

Df Sum Sq Mean Sq F value Pr(>F) factor(solution) 4 187.413 46.853 18.815 1.612e-05 diameter 1 94.372 94.372 37.898 2.493e-05 Residuals 14 34.863 2.490

Ho: 1=5 = 0 HA: At least one i is not zero

F*=18.815 d.f. = 4, 14 p=0.000016

There is a statistically significant difference in strength among the different solutions, after adjusting for diameter.

d. Solutions 2 and 3 differ only by the type of catalyst used. Test whether or not whether fibers produced from these solutions have the same mean strength after adjusting for diameter, and give a 95% confidence interval for the difference. (DO THIS QUESTION BY HAND!!) (6marks)

strength Standardtreat LSMEAN Error Pr > |t|

1 61.1922432 1.7960735 <.00012 61.8074342 1.3630872 <.00013 63.4562468 1.6212340 <.00014 64.9397909 1.6842405 <.00015 57.8556144 1.4371183 <.0001

Ho: 2=HA: 2

For the model in (c) the MSE is 2.490. SE is 1.578. The degrees of freedom is 14.

The difference between α̂ 2 and α̂ 3 is 61.807-63.456=-1.649.

tcalc=-1.649/1.578=-1.045.

ttable with 14 degrees of freedom is 2.145. Since |1.045|<2.145we do not reject H0 and conclude

that there is no significant difference in strength between solutions 2 and 3 after adjusting for diameter.

A 95% CI for the difference in means is

-1.649 t.025,14 * 1.1578

stat6140_assignment4_2014

Documents