regression analysis of sat scores final
TRANSCRIPT
1
Regression Analysis:
SAT Score
By:
Matthew Adkins
John Michael Croft
Chima Iheme
Anthony Podolak
2
Table of Contents
Regression Analysis of SAT Scores
I.) Table of Contents …………………………………………………………………. 2
II.) Abstract …………………………………………………………………………. 3
III.) Introduction …………………………………………………………………. 4
IV.) Model Specifications ...……………………………………………………….. 4
A.) Explanation of Coefficients ...……………………………………….. 5
B.) Test for Non-Linearity ….……………………………………………… 6
C.) Test for Heteroskedasticity …………………...…………………….. 6
D.) Normality Test …………………………………………………………. 6
i.) Histogram, PP Plot, other graphical methods
ii.) Quantitative Analysis of Normality
E.) Outlier Control …………………………………………………………. 7
i.) Decision to keep/disregard outliers
ii.) Explanation for decision
V.) Conclusion …………………………………………………………………. 7
A.) Final Model
B.) Limitations of Model
C.) Possible Improvements
Appendix: Includes relevant graphs, plots, and non-essential output reports. 9
3
Abstract
In this project, analyzed a set of data and determine its relationship to SAT scores. The
first thing that we did was create a baseline model from which we were able to compare all of
our later results against. We then tested for non-linearity in our independent variables to
ascertain the forms of the variables that explained the most of the variation in our model. After
we built a preliminary model, we tested it for multi-collinearity, heteroskedasticity, and
normality. Once we adjusted our model to deal with these issues, we made the decision to retain
the outliers as they did not seem to significantly affect our model. By doing these things, we feel
comfortable that we have created a model that accurately assesses the information that we were
given. However, we are subject to the limited data and in order to build a truly explanatory
model, we would need more data.
4
Model Introduction
This is a project that is a required assignment by our professor of Econometrics. It
incorporates the building of a model and testing for problems in said model. In addition, it
provides an avenue for us to practice the concepts that we learned this semester in school. As we
proceed, we will lead you in a step-by-step process that will allow you to intuitively understand
how we came to our specific model.
Initially, we were given a data set that included our dependent value, SAT score, and a
number of possible factors of SAT scores. Based on this we created a baseline model that
resulted in the following Adjusted R².
Model Summary
Model R R Square Adjusted R Square
Std. Error of the
Estimate
1 .531a .282 .281 118.239
This model is sub-optimal and is useful only as a starting point for us to compare with our later
models. However, it does allow us to have an idea of how our results compare to this and
whether we are progressing or regressing in our analysis of this data.
Model Specifications
After many regressions, we were able to create what we believe to be the optimal model,
given the data we are working with and the limitations placed upon us. In order to achieve this,
we regressed SAT scores against every possible (In this case, the possibilities are the inverse,
5
square root, log, quadratic, and cubic) variation of our available independent variables. We
worked to eliminate variables by first using the stepwise function. After we completed that
initial round of variable elimination, we focused on eliminating variables with a VIF of over ten.
Lastly, once we were able to reduce the VIF’s of all of our remaining variables to numbers under
ten, we then eliminated variables that we could not prove were statistically significant. By using
this process, we were able to significantly increase our adjusted R², which increases the
explanatory power of our model. The summary of the final model is as follows:
Model Summary
Model R R Square Adjusted R Square
Std. Error of the
Estimate Durbin-Watson
1 .574a .330 .328 114.277 1.853
Explanation of the Coefficients
Now that we have our model we can move on to testing it for reliability, but before we do
that we will explain the meaning behind the independent variables we incorporated.
Female: If the person is female, then SAT scores are reduced by 61.38 points Athlete: If the person is an athlete, then SAT scores are reduced by 81.95 Inv Colgpa: If the inverse of College GPA increases by 1, then SAT scores increase by
94 points. Inv Hsize: If the inverse of High school size increases by 1, then SAT scores increase by
6.3 points. Sqrt Hsize: If the square root of High school size increases by 1, then SAT scores
increase by 31.37. Black: If the person is black, then SAT scores decrease by 113.56. Sqrt Colgpa: If the square root of College GPA increases by 1, then SAT scores increase
by 237.84. LN Hsperc: If the LN of high school percentage increases by 1, then SAT scores decrease
by 49.93. Hsperc: If the high school percentage increases by 1, then SAT scores increase by 1.14.
6
Squared Tothrs: If the square of total hours increases by 1, then SAT scores decrease by 0.001.
Verb/Math: If Verbal/Math increase by 1, then SAT scores increase by 25.68.
Testing for Non-Linearity
One of the main problems that can be found in regression analysis comes from the non-
linearity of variables. It is not always easy to tell whether a variable is linear or not, but there are
a couple of methods we can use to account for possible non-linearity in our variables. The first
thing to do is visually inspect the data by plotting each independent variable out against SAT
scores. We did this and found a few variables that were questionable. We then incorporated all
of the possible variations that we could think of into our regression and eliminated the ones that
did not increase the explanatory power of our model. These two methods allowed us to account
for the non-linearity in our variables, which in turn allowed us to further optimize our model.
Testing for Heteroskedasticity
Our analysis shows no heteroskedasticity while observing different residual output charts
and graphs. The residual histogram shows the residuals are normally distributed, and the residual
scatter plot shows no obvious trend. The number of observations coincides with the Empirical
Rule, with 99.7% of data within three standard deviations of the expected value. We decided no
correction was necessary by way of GLS in regards to the Breusch Pagan test.
Testing for Normality
There are two methods for observing whether a model is normally distributed. The first
is graphically and the second is quantitatively. We use two different types of graphs to assess the
normality of our regression. When we graph our standardized residual by a PP plot and a
7
Histogram, we view a distribution that looks very normal. These visual aids are helpful to us,
but we cannot accurately determine whether our regression is normally distributed solely through
this method. We also need to test it quantitatively. Thus, we use the Kolmogorov-Smirnov and
Shapiro-Wilks tests to determine the normality of our regression. Through these tests we are
able to determine that our regression is in fact normal.
Outlier Control
In order to obtain a better perspective on our outliers, we created a box plot to graph our
outliers and see where they fit in relationship to our data as a whole. We also ran some
descriptive statistics on our standardized residual, which allowed us to observe our outliers
statistically. We found that most of our outliers lay right at or within three standard deviations of
the mean and even the few that were outside of three standard deviations, were within four.
Because of this statistical data and the size of our sample, we decided that it would be
best to leave our outliers in the model. We do not believe that they will significantly skew our
data and keeping them may help us in explaining the variation of our model.
Conclusion
After analyzing the data, we constructed a model that as far as we can tell explains the
most variation of any possible model. As we tested the model, we were able to further refine it
and eliminate a great deal of the multi-collinearity and insignificant factors found in our model.
By doing all of this, we were able to improve our adjusted R² by over four percent and resulted
in this final model.
8
SAT = 675.28 – 61.38Female – 81.95Athlete + 94.03 + 6.3 + 31.37
- 113.56Black + 237.84 – 49.93 + 1.143 Hsperc - .001
Tothrs² + 25.68Verb/Math
Model Limitations
Due to the nature of this model, there are certain inherent limitations. Foremost among
these is its limited ability to predict future values. A model like this one is more of an
explanatory model. It lacks multiple years of historical data, which would allow it to then
predict future values based on the analyzed trends of older data. In addition to its lack of
predictive power, this model also lacks fairly obvious variables, such as a SAT preparatory class
dummy variable and a tutor dummy variable.
Possible Improvements
The major improvement that could be made to this model would be the importation of
more variables. If we were able to more variables, like the affluence of the test taker’s family,
the time spent studying for the SAT, and a SAT preparatory class; then we would definitely be
able to create a more accurate model. This improvement would allow us to account for a great
deal more variation than we are currently able to under the scope of this data.
9
Appendix
IV. Final Model
10
IV.C.
IV.D.
11
IV.E.
12
13