scott cunningham stat512 final project

28
Scott Cunningham Final Project STAT 512 31/7/2015

Upload: scott-cunningham

Post on 21-Jan-2018

57 views

Category:

Documents


6 download

TRANSCRIPT

Page 1: Scott Cunningham STAT512 Final Project

Scott Cunningham Final Project

STAT 512

31/7/2015

Page 2: Scott Cunningham STAT512 Final Project

PART I

1) Looking at the graph below, it is clear that the two pieces are not the same line. This is confirmed by the sameline test I performed, as shown below. The p-value is just barely significant (if we don’t get too picky about thousandths and ten-thousandths). Thus, the extra 0.0003 is not enough for me to call it insignificant. Because we have a significant p-value, we can reject the null hypothesis β2 = 0 in favor of the alternative hypothesis β2 ≠ 0.

Page 3: Scott Cunningham STAT512 Final Project

2) (a) Extra sum of squares = SSE(R)–SSE(F) = 306.48106–277.86404 = 28.61702

F-value: F(1, 34) =

=

= 3.50163

(b) F(1, 34) = 3.50 p-value = 0.0699 Conclusion: Fail to reject H0, and β7 = 0 (c) t* = −1.87 p-value = 0.0699 (t*)2 = 3.4969 A t-test with n degrees of freedom is equivalent to an F-test with (1, n) degrees of freedom. As you can see the t-value squared gives the F-value from the previous parts of the problem. Hence, both produce the same p-value.

Page 4: Scott Cunningham STAT512 Final Project

3) Σ Type-I SS = 632.17685 Σ Type-II SS = 104.00489 The Type-I SS sum to SSM. The two types of SS are equal for the danger predictor. This is because the conditional probability is the same in both cases: SS(Danger | BodyWt BrainWt Dreaming LifeSpan Gestation Predation Exposure) so we get the same value.

4)

Explanatory Variables R2 BodyWt BrainWt Dreaming LifeSpan Gestation Predation Exposure Danger 0.6952

BodyWt 0.1175

BrainWt 0.1136

Dreaming 0.5287

NonDreaming 0.9364

LifeSpan 0.1463

Gestation 0.3776

Predation 0.0078

Exposure 0.3861

Danger 0.3652

BodyWt BrainWt 0.1186

BodyWt BrainWt Sum 0.4118

LifeSpan Gestation 0.3780

Dreaming Sum 0.6346

NonDreaming Sum 0.9376

Predation Exposure Gestation 0.5404

BodyWt BrainWt Dreaming 0.5991

Page 5: Scott Cunningham STAT512 Final Project

Part II 1) These are the initial scatter plots for each variable:

Page 6: Scott Cunningham STAT512 Final Project
Page 7: Scott Cunningham STAT512 Final Project
Page 8: Scott Cunningham STAT512 Final Project
Page 9: Scott Cunningham STAT512 Final Project

I decided that BodyWt, BrainWt, Gestation and Lifespan needed transformed because they are all non-linear. At the professor’s suggestion, I took the ratio of BrainWt to BodyWt to transform those two. I also took the inverse of both Gestation and Lifespan.

Page 10: Scott Cunningham STAT512 Final Project

LifeInverse ended up being fairly non-linear so I checked two further transformations: square root and log. I decided to go with logLifeInv as it is more linear.

Page 11: Scott Cunningham STAT512 Final Project

Note: From here onward I use NonDreaming and threw out Dreaming because NonDreaming had a more linear relationship with TotalSleep. NonDreaming also has lower or the same correlation with all the variables compared to Dreaming (see below). I ran the correlation procedure and found that Predation, Exposure and Danger, were all highly correlated. Thus, I removed Danger as it had the highest correlation with the other two (it’s also the least linear of the three).

Page 12: Scott Cunningham STAT512 Final Project

Box-Cox gave optimal λ as 0.75 so I decided Y did not need transformed. I ran regression with the following model to obtain residuals: TotalSleep = NonDreaming + Predation + Exposure + BrainBody + GestInv + logLifeInv

Thus, I obtained the following residual plots:

Page 13: Scott Cunningham STAT512 Final Project
Page 14: Scott Cunningham STAT512 Final Project
Page 15: Scott Cunningham STAT512 Final Project
Page 16: Scott Cunningham STAT512 Final Project

I also produced the following histogram and QQ-plot, which both indicate the residual are approximately Normal.

Page 17: Scott Cunningham STAT512 Final Project

I also checked TotalSleep with a QQ-plot and found that it was approximately Normal as well.

In summation, I would conclude that TotalSleep = NonDreaming + Predation + Exposure + BrainBody + GestInv + logLifeInv is a good model to begin with as a starting point.

Page 18: Scott Cunningham STAT512 Final Project

2) Mallow’s Cp reported the following models (I’ve only included the first 12): I’ve highlighted my selection for best model in blue. I would use this model because it has a Cp < p which is good, and a very high R2. Adding more variables doesn’t increase the R2 value very much, so it would be unnecessary. I don’t want to include the BrainBody term since a negative coefficient would not make sense for this variable, because it is positively correlated with TotalSleep, as evidenced by its scatter plot. Given the somewhat high (~0.62) correlation between Predation and Exposure, it does not make sense to have both in the model, and as you can see, switching Exposure with Predation (last entry) results in a Cp > p, which would not be a good model. The following is a summary of the part of the table I have removed. The Cp values slowly increase as different combinations of variables are tried, with NonDreaming remaining constant among them. Once NonDreaming is removed, the Cp values skyrocket, increasing by approximately 2000% on the “best” model without NonDreaming (highlighted in yellow below), and increasing further from thereon as different combinations without NonDreaming are tried.

This result makes sense as TotalSleep = NonDreaming + Dreaming, so it is the most influential explanatory variable. In conclusion, I choose the best model to be: TotalSleep = β0 + β1NonDreaming – β2Predation + β3GestationInverse

Page 19: Scott Cunningham STAT512 Final Project

3) The stepwise selection method produced the following result:

As this is the same model as selected above, I will not restate my reasons for selecting it. However, I shall list some interesting points. Some points of note:

Predation contributes very little to R2, however if we want to satisfy Cp < p, it is necessary to include it. This is also true of GestationInverse.

As pointed out above, NonDreaming contributes very heavily to TotalSleep, which is why it has such a large partial R2.

The stepwise selection produces the same “best” model as the Cp criterion.

To reiterate, I choose the best model to be: TotalSleep = β0 + β1NonDreaming – β2Predation + β3GestationInverse*

*Note: On Mixable I saw Professor Sharabati saying he would suggest keeping Dreaming and throwing out NonDreaming. I tested my model at every step after switching NonDreaming with Dreaming and had much worse results, which prompted me to continue using NonDreaming.

Page 20: Scott Cunningham STAT512 Final Project

4) The residual plot for NonDreaming appears to be okay, except for maybe an outlier at ~ −2.75.

The residual plot for Predation is fine, no discernible pattern.

Page 21: Scott Cunningham STAT512 Final Project

However, the residual plot for GestationInverse indicates that the constant variance assumption may be violated. We can also see that possible outlier at ~ −2.75.

Summary of residuals: The GestationInverse residual plot makes me cautious, but I would not denounce the model just yet. I see no reason to assume the responses are not independent. Looking at the following histogram and QQ-plot indicates that the residuals are approximately Normally distributed. The histogram tells me that the possible outlier is probably not an outlier, but I will confirm this in the nest question. Based on the scatter plots in Question 1, I would say the linearity assumption is not violated. Overall, I would say this is an acceptable model to use with some caution due to the GestationInverse residuals having slight problem with constant variance.

Page 22: Scott Cunningham STAT512 Final Project
Page 23: Scott Cunningham STAT512 Final Project

5) I use the Studentized Residuals and Cook’s Distance to check for outliers and influential observations. I use VIF to check for multicollinearity. VIF results: All VIF scores are well below the threshold for determining multicollinearity. I conclude that multicollinearity is not a problem in the model. For the residuals I only include output for the most influential/unusual points: Looking at the Studentized Residuals and fences, none of the largest are considered outliers. Cook’s Distance Critical F-value = F(4,40) (.5) = 0.85356585 As you can see, none of the largest Cook’s D values come close to exceeding the critical value.

In conclusion, I have determined and statistically proved that the suspected outlier from the residual and QQ-plots is in fact, not an outlier, and that there are no influential observations.

Page 24: Scott Cunningham STAT512 Final Project

6) (a) = 1.153133 + 1.05652(NonDreaming) – 0.28902(Predation) + 35.46003(GestationInverse) (b) 90% C.I. for µh : Highlighted in green below (first 20 obs.)

(c) 90% P.I. for (h)new : Highlighted in pink below (first 20 obs.) (d) 90% C.I. for βi : Highlighted in blue below

Page 25: Scott Cunningham STAT512 Final Project

SAS CODE *data imported using File menu

PART 1 ;

symbol1 v=dot i=sm75S;

proc gplot data = sleep;

plot TotalSleep* (BodyWt BrainWt NonDreaming Dreaming Lifespan

Gestation Predation Exposure Danger);

run; *I used this to figure out which variable I wanted to use;

quit;

data sleep; *I decided on gestation, so I create the cslope term;

set sleep;

if gestation le 175

then cslope=0;

if gestation gt 175

then cslope=(gestation-175);

proc reg data=sleep; *regression to get equation;

model totalsleep=gestation cslope / p;

output out=sleepoutpred p=pred;

sameline: test cslope; *sameline test;

run;

quit;

symbol1 v=circle i=none c=black;

symbol2 v=none i=join c=red;

title1 'Question 1 - Piecewise Regression';

title2 'Scott Cunningham';

axis1 label = (angle=90 'TotalSleep');

proc sort data=sleepoutpred; by gestation;

proc gplot data=sleepoutpred;

plot (totalsleep pred)*gestation / overlay

vaxis=axis1;

run;

quit; *plotting the graph;

* END PROBLEM 1

--------------------------------------------------------------

PROBLEM 2 ;

data sleep; *creating sum;

set sleep;

sum = lifespan+gestation;

proc reg data = sleep; *running the two regressions;

model totalsleep = bodywt brainwt dreaming predation exposure danger;

model totalsleep = bodywt brainwt dreaming predation exposure danger

sum;

nilsum: test sum; *F-test;

run;

quit;

* END PROBLEM 2

-------------------------------------------------------------

Page 26: Scott Cunningham STAT512 Final Project

PROBLEM 3 ;

proc reg data = sleep;

model totalsleep = bodywt brainwt dreaming lifespan gestation predation

exposure danger / ss1 ss2;

run;

quit;

* END PROBLEM 3

--------------------------------------------------------------

PROBLEM 4 ;

proc reg data = sleep;

model totalsleep = bodywt;

model totalsleep = brainwt;

model totalsleep = dreaming;

model totalsleep = nondreaming;

model totalsleep = lifespan;

model totalsleep = gestation;

model totalsleep = predation;

model totalsleep = exposure;

model totalsleep = danger;

model totalsleep = bodywt brainwt;

model totalsleep = bodywt brainwt sum;

model totalsleep = lifespan gestation;

model totalsleep = dreaming sum;

model totalsleep = nondreaming sum;

model totalsleep = predation exposure danger;

model totalsleep = bodywt brainwt dreaming;

run;

quit;

* END PROBLEM 4

--------------------------------

PART 2

PROBLEM 1 ;

symbol1 v=dot i=sm75S;

title1 'Question 1 - Scatter Plot with Smoothing Curve';

title2 'Scott Cunningham';

proc gplot data = sleep;

plot TotalSleep* (BodyWt BrainWt NonDreaming Dreaming Lifespan

Gestation Predation Exposure Danger) / vaxis=axis1;

run; *to examine the response variables;

quit;

data sleep; *creating transforms of the variables I think need it;

set sleep;

brainbody = brainwt/bodywt;

gestinv = 1/gestation;

lifeinv = 1/lifespan;

proc gplot data = sleep;

plot TotalSleep*(brainbody gestinv lifeinv) / vaxis=axis1;

run; *checking the new transformed variables;

quit;

Page 27: Scott Cunningham STAT512 Final Project

data sleep; *checking two possible further transformations;

set sleep;

loglifeinv=log(lifeinv);

sqrtlifeinv=sqrt(lifeinv);

proc gplot data = sleep; *checking again;

plot TotalSleep*(loglifeinv sqrtlifeinv) / vaxis=axis1;

run;

quit;

proc corr data=sleep; *checking correlation between responses;

var nondreaming dreaming predation exposure danger brainbody gestinv

loglifeinv;

proc transreg data = sleep; *performing Box-Cox to check if Y needs to be

transformed;

model boxcox(totalsleep)=identity(nondreaming predation exposure

brainbody gestinv loglifeinv);

run;

quit;

proc reg data = sleep;

model totalsleep = nondreaming predation exposure brainbody gestinv

loglifeinv / r;

output out=sleepoutresid r=resid;

run; *computing the residuals;

quit;

symbol1 v=dot i=none;

title1 'Question 1 - Residual Plot';

title2 'Scott Cunningham';

axis1 label = (angle=90 'Residual');

proc gplot data = sleepoutresid;

plot resid*(nondreaming predation exposure brainbody gestinv

loglifeinv) / vref=0

vaxis=axis1;

proc univariate data=sleepoutresid noprint;

qqplot resid totalsleep / normal (L=1 mu=est sigma=est)

odstitle='Question 1 - QQ-plot'

odstitle2='Scott Cunningham';

histogram resid / odstitle='Question 1 - Histogram'

odstitle2='Scott Cunningham'

normal(noprint);

run;

quit; *these graphs are just sort of a final check to see if I did well in

refining the model;

*END PROBLEM 1

---------------------------------------------------

PROBELMS 2 & 3 ;

proc reg data = sleep;

Page 28: Scott Cunningham STAT512 Final Project

model totalsleep = nondreaming predation exposure brainbody gestinv

loglifeinv / selection=cp b;

proc reg data = sleep;

model totalsleep = nondreaming predation exposure brainbody gestinv

loglifeinv / selection=stepwise;

run;

quit;

*END PROBLEMS 2 & 3

------------------------------------------------------

PROBLEMS 4, 5 & 6;

proc reg data=sleep;

model totalsleep = nondreaming predation gestinv / r vif clm cli clb

alpha=0.1;

output out=sleepresidbest r=resid;

run; *regression with all the options I need;

quit;

symbol1 v=dot i=none;

title1 'Question 4 - Residual Plot';

title2 'Scott Cunningham';

axis1 label = (angle=90 'Residual');

proc gplot data=sleepresidbest;

plot resid*(nondreaming predation gestinv) / vref=0

vaxis=axis1;

proc univariate data=sleepresidbest noprint;

qqplot resid / normal (L=1 mu=est sigma=est)

odstitle='Question 4 - QQ-plot'

odstitle2='Scott Cunningham';

histogram resid / odstitle='Question 4 - Histogram'

odstitle2='Scott Cunningham'

normal(noprint);

run;

quit; *graphs to check asumptions (problem 4)

*END PROBLEMS 4, 5 & 6;