regression analysis of sat scores final

1

Regression Analysis:

SAT Score

By:

Matthew Adkins

John Michael Croft

Chima Iheme

Anthony Podolak

2

Table of Contents

Regression Analysis of SAT Scores

I.) Table of Contents …………………………………………………………………. 2

II.) Abstract …………………………………………………………………………. 3

III.) Introduction …………………………………………………………………. 4

IV.) Model Specifications ...……………………………………………………….. 4

A.) Explanation of Coefficients ...……………………………………….. 5

B.) Test for Non-Linearity ….……………………………………………… 6

C.) Test for Heteroskedasticity …………………...…………………….. 6

D.) Normality Test …………………………………………………………. 6

i.) Histogram, PP Plot, other graphical methods

ii.) Quantitative Analysis of Normality

E.) Outlier Control …………………………………………………………. 7

i.) Decision to keep/disregard outliers

ii.) Explanation for decision

V.) Conclusion …………………………………………………………………. 7

A.) Final Model

B.) Limitations of Model

C.) Possible Improvements

Appendix: Includes relevant graphs, plots, and non-essential output reports. 9

3

Abstract

In this project, analyzed a set of data and determine its relationship to SAT scores. The

first thing that we did was create a baseline model from which we were able to compare all of

our later results against. We then tested for non-linearity in our independent variables to

ascertain the forms of the variables that explained the most of the variation in our model. After

we built a preliminary model, we tested it for multi-collinearity, heteroskedasticity, and

normality. Once we adjusted our model to deal with these issues, we made the decision to retain

the outliers as they did not seem to significantly affect our model. By doing these things, we feel

comfortable that we have created a model that accurately assesses the information that we were

given. However, we are subject to the limited data and in order to build a truly explanatory

model, we would need more data.

4

Model Introduction

This is a project that is a required assignment by our professor of Econometrics. It

incorporates the building of a model and testing for problems in said model. In addition, it

provides an avenue for us to practice the concepts that we learned this semester in school. As we

proceed, we will lead you in a step-by-step process that will allow you to intuitively understand

how we came to our specific model.

Initially, we were given a data set that included our dependent value, SAT score, and a

number of possible factors of SAT scores. Based on this we created a baseline model that

resulted in the following Adjusted R².

Model Summary

Model R R Square Adjusted R Square

Std. Error of the

Estimate

1 .531a .282 .281 118.239

This model is sub-optimal and is useful only as a starting point for us to compare with our later

models. However, it does allow us to have an idea of how our results compare to this and

whether we are progressing or regressing in our analysis of this data.

Model Specifications

After many regressions, we were able to create what we believe to be the optimal model,

given the data we are working with and the limitations placed upon us. In order to achieve this,

we regressed SAT scores against every possible (In this case, the possibilities are the inverse,

5

square root, log, quadratic, and cubic) variation of our available independent variables. We

worked to eliminate variables by first using the stepwise function. After we completed that

initial round of variable elimination, we focused on eliminating variables with a VIF of over ten.

Lastly, once we were able to reduce the VIF’s of all of our remaining variables to numbers under

ten, we then eliminated variables that we could not prove were statistically significant. By using

this process, we were able to significantly increase our adjusted R², which increases the

explanatory power of our model. The summary of the final model is as follows:

Model Summary

Model R R Square Adjusted R Square

Std. Error of the

Estimate Durbin-Watson

1 .574a .330 .328 114.277 1.853

Explanation of the Coefficients

Now that we have our model we can move on to testing it for reliability, but before we do

that we will explain the meaning behind the independent variables we incorporated.

Female: If the person is female, then SAT scores are reduced by 61.38 points Athlete: If the person is an athlete, then SAT scores are reduced by 81.95 Inv Colgpa: If the inverse of College GPA increases by 1, then SAT scores increase by

94 points. Inv Hsize: If the inverse of High school size increases by 1, then SAT scores increase by

6.3 points. Sqrt Hsize: If the square root of High school size increases by 1, then SAT scores

increase by 31.37. Black: If the person is black, then SAT scores decrease by 113.56. Sqrt Colgpa: If the square root of College GPA increases by 1, then SAT scores increase

by 237.84. LN Hsperc: If the LN of high school percentage increases by 1, then SAT scores decrease

by 49.93. Hsperc: If the high school percentage increases by 1, then SAT scores increase by 1.14.

6

Squared Tothrs: If the square of total hours increases by 1, then SAT scores decrease by 0.001.

Verb/Math: If Verbal/Math increase by 1, then SAT scores increase by 25.68.

Testing for Non-Linearity

One of the main problems that can be found in regression analysis comes from the non-

linearity of variables. It is not always easy to tell whether a variable is linear or not, but there are

a couple of methods we can use to account for possible non-linearity in our variables. The first

thing to do is visually inspect the data by plotting each independent variable out against SAT

scores. We did this and found a few variables that were questionable. We then incorporated all

of the possible variations that we could think of into our regression and eliminated the ones that

did not increase the explanatory power of our model. These two methods allowed us to account

for the non-linearity in our variables, which in turn allowed us to further optimize our model.

Testing for Heteroskedasticity

Our analysis shows no heteroskedasticity while observing different residual output charts

and graphs. The residual histogram shows the residuals are normally distributed, and the residual

scatter plot shows no obvious trend. The number of observations coincides with the Empirical

Rule, with 99.7% of data within three standard deviations of the expected value. We decided no

correction was necessary by way of GLS in regards to the Breusch Pagan test.

Testing for Normality

There are two methods for observing whether a model is normally distributed. The first

is graphically and the second is quantitatively. We use two different types of graphs to assess the

normality of our regression. When we graph our standardized residual by a PP plot and a

7

Histogram, we view a distribution that looks very normal. These visual aids are helpful to us,

but we cannot accurately determine whether our regression is normally distributed solely through

this method. We also need to test it quantitatively. Thus, we use the Kolmogorov-Smirnov and

Shapiro-Wilks tests to determine the normality of our regression. Through these tests we are

able to determine that our regression is in fact normal.

Outlier Control

In order to obtain a better perspective on our outliers, we created a box plot to graph our

outliers and see where they fit in relationship to our data as a whole. We also ran some

descriptive statistics on our standardized residual, which allowed us to observe our outliers

statistically. We found that most of our outliers lay right at or within three standard deviations of

the mean and even the few that were outside of three standard deviations, were within four.

Because of this statistical data and the size of our sample, we decided that it would be

best to leave our outliers in the model. We do not believe that they will significantly skew our

data and keeping them may help us in explaining the variation of our model.

Conclusion

After analyzing the data, we constructed a model that as far as we can tell explains the

most variation of any possible model. As we tested the model, we were able to further refine it

and eliminate a great deal of the multi-collinearity and insignificant factors found in our model.

By doing all of this, we were able to improve our adjusted R² by over four percent and resulted

in this final model.

8

SAT = 675.28 – 61.38Female – 81.95Athlete + 94.03 + 6.3 + 31.37

- 113.56Black + 237.84 – 49.93 + 1.143 Hsperc - .001

Tothrs² + 25.68Verb/Math

Model Limitations

Due to the nature of this model, there are certain inherent limitations. Foremost among

these is its limited ability to predict future values. A model like this one is more of an

explanatory model. It lacks multiple years of historical data, which would allow it to then

predict future values based on the analyzed trends of older data. In addition to its lack of

predictive power, this model also lacks fairly obvious variables, such as a SAT preparatory class

dummy variable and a tutor dummy variable.

Possible Improvements

The major improvement that could be made to this model would be the importation of

more variables. If we were able to more variables, like the affluence of the test taker’s family,

the time spent studying for the SAT, and a SAT preparatory class; then we would definitely be

able to create a more accurate model. This improvement would allow us to account for a great

deal more variation than we are currently able to under the scope of this data.

9

Appendix

IV. Final Model

10

IV.C.

IV.D.

11

IV.E.

regression analysis of sat scores final

Documents