lecture 30
DESCRIPTION
RESEARCH METHODSTRANSCRIPT
© Copy Right: Rai University11.556 183
RE
SE
AR
CH
ME
TH
OD
OLO
GY
Essentially when we run a regression we are actually estimating theparameters on the basis of the sample of observations. Thereforey =a+bx for example is a sample regression line much in the
same way that ‘x is a sample estimate of the population parameterm. In the same way our population regression line or the truerelationship of the data is : is Y=A+Bx . This equation howeveris unknown and we have to use sample data to estimate it. Thetrue form of the unknown equation for the k variable case is:
kk xbxbxbxbay +++++= ............ˆ 332211
Even in the case of the population regression plane regressionplane not all data points will lie on it.. Why ? Consider our IRSproblem. Not all payments to informants will be equally effective.;• Some of the computer hours may be used for organizing
data rather than analyzing accounts.• For these and other reasons some of the data points will
lie above the regression plane and some below it .• Therefore instead of satisfying the above equation the
individual data points will satisfy :exbxbxbxbay kk ++++++= ............ˆ 332211
This is the population regression plane plus a random disturbanceterm . The term e is a random disturbance term which equals zeroon the average. The standard deviation of this term of this termis σe. The standard error of the regression se which we havetalked about in the earlier section is an estimate of σe.As our sample regression equation :
kk xbxbxbxbay +++++= ............ˆ 332211
This equation estimates the unknown e population regressionplane
kk xBxBxBxBay +++++= ............ˆ 332211
As we can see the estimation of a regression plane can also bethought of as a problem of statistical inference where we makeinferences regarding an unknown population relationship on thebasis of an estimated relationship based on sample data.Much in the same way as for hypothesis testing for a mean we canalso set up confidence intervals for the parameters of the estimatedequation. We can also make inferences about the slopes of thetrue regression equation slope parameters(B1, B2, Bk) on the basisof slopes coefficients of the estimated equation (b1, b2 , b3,bk ).
Tests of Inference of an Individual Slope ParameterBi
As explained earlier we can use the value of the individual bi ,which are values of the slope parameter for the ith variable , to testa hypotheses about the value of Bi , which is the true populationvalue of the slope for the ith variable.The process of hypotheses testing is the same as that delineatedfor testing the mean.
When we perform a regression we are frequently interested inquestions whether y is actually dependent on x? That is if we takeour example we may ask whether the volumeof unpaid tax recoveryactually depends on the number of computer hours of researchthe filed researcher puts in. Essentially we are asking is x a significantexplanatory variable for y?We can say that if there is some relationship between y and x ifBi¹0. There is no relationship between x and y if Bi=0.Thus we can formulate our hypotheses regarding the tests ofsignificance of the xi coefficient as follows:Ho: Bi=0 Null hypothesis , that x i is not a significant explanatoryvariable for yHa: Bi≠0 Alternative hypothesis that xi is a significant explanatoryvariable for y.We can test this hypothesis using the t ratio:
bi
ioi
sBb
t−
=
Wherebi: slope of fitted regressionBio: actual slope of hypothesized for the populationSbi: standard error of the regression coefficient
Why is the t Statistic Used ?In multiple regression we use n data points to estimate k+1coefficients, i.e., intercept a and b1…bk slope coefficients. Thesecoefficients were used to calculate s e which estimates se. the standarddeviation of the disturbance of the data. We use s e to estimate s bi
Therefore since se has n-k-1 degrees of freedom, sbi will have n-k-1. Degrees of freedom.The value of sbi is given in the output as stdev term.Because our hypothesized value value of Bi is 0 , the standardizedvalue of the regression coefficient to becomes:The value of to is called the observed or computed t value. This isthe number that appears in the column headed t – ratio in thecomputer out put.We test for the significance of the t ratio by checking against thecolumn headed p-value. This column gives the prob values forthe two-tailed test of hypotheses:Ho: Bi=0Ha: Bi≠0The prob values are the probabilities that each bi would be as far(or farther ) away from zero( hypothesized value of Bi coefficient)if ho is true. This is shown in Figure 2 . We need only to comparethe p values with a, the level of significance. To determine whetherxi is a significant explanatory variable of y.
LESSON 30:MAKING INFERENCES ABOUT POPULATION PARAMETERS
© Copy Right: Rai University184 11.556
RE
SE
AR
CH
ME
TH
OD
OLO
GY
Figure 2If p> α Xi is not a significant explanatory variable .If p<α Xi is a significant explanatory variable .This test of significance of the explanatory variable is always atwo-tailed test. The independent variable x i is a significantexplanatory variable if bi is significantly different from zero. Thisrequires that our t ratio be a large positive or negative.In our IRS example for each of the three explanatory variables pis less than .01. Therefore we conclude that each one is a significantexplanatory variable.
TEst of Significance of The Regression as a WholeIt is quite possible that we frequently may get a high value of R2
by pure chance. After all if we throw a dart on board to get a scatterplot we could generate a regression, which may conceivably have ahigh R2. Therefore we need to ask the question a high value of R2
necessarily mean that the independent variables explain a largeproportion of the variation in Y or could this be a freak chance.In statistical terms we ask the following question:Is the regression as a whole significant? In the last section we hadlooked at whether the individual x i were significant. Now we askwhether collectively all the xi (i=1…k) together significantly explainthe variability in y.Our hypothesis is:Ho: B1=B2+…Bk = 0 Null hypothesis that y does not depend on xisHa: atleast one Bi≠0 Alternative hypothesis that at least one Bi isnot zero.To explain this concept we have to go back to our initial diagram,which shows the two variable case. (insert diag Lr p743The total variation in y
∑ − 2)( yy
Explained variation by the regression is
∑ − 2)ˆ( yy
Unexplained variation
∑ − 2)ˆ( yy
This is shown in the figure 3 for the one variable case of simplicity.For a multiple variable case the something applies conceptually.
Figure 3Thus when we look at the variation in y we look at 3 differentterms each of which is a sum of squares .These are denoted asfollows:
SST= Total sum of squares ∑ − 2)( yy
SSR=Regression sum of squares ∑ − 2)ˆ( yy
SSE=Error sum of squares ∑ − 2)ˆ( yy
Total variation in y can be broken into two parts: the explainedand the unexplained:SST=SSR+SSEEach of these has an associated degrees of freedom. SST has n-1degrees of freedom. SSR has k degrees of freedom because thereare k independent variables. SSE has degrees of freedom n-k-1because we used n observations to estimate k+1 parameters a,b1,b2, ..bk.If the null hypotheses is true we get the following F ratio
1−−
=
knSSE
kSSR
F
Which has a F distribution with k numerator degrees of freedomand n-k-1 degrees of freedom in the denominator.If the null hypotheses is false i.e that the explanatory variableshave a significant effect on y then the F ratio tends to be higherthan if the null hypothesis is true., So if the F ratio is large wereject the null hypotheses that the explanatory variables have noeffect on the variation of y. Therefore we reject Ho and concludethat the regression is significant.Going back to our IRS example we now look at the computeroutput. A typical output of a regression also includes the computedF ratio for the regression. This is also at times called the ANOVAfor the regression. This is because we break up the up the analysisof variation in Y into explained variance or variance explained bythe regression(between column variace0 and unexplainedvariance.(within column variance.) This is shown in table 3
© Copy Right: Rai University11.556 185
RE
SE
AR
CH
ME
TH
OD
OLO
GY
Table 3
Analysis of Variance
Source DF SS MS F P
Regression 3 29.1088 9.7029 118.52 0.00
Error 6 .4912 .0819
Total 9 29.600
The sample output for the IRS problem is given above.SSR=29.109, k=3SSE=.491 ( with n-k-1 df = 6) degrees of freedom.
3.118
6491.0311.29
==F
The MS column is the sum of squares divided by the number ofdegrees of freedom. The output also gives us the p- value, whichis 0.00. Because p<α =0.01 we can conclude that the regression asa whole is highly significant.
Exercises
Q1. Bill Buxton, a statistic professor in a leading business school,has a keen interest in factors affecting student’s performanceon exams. The midterm exam for the past semester had awide distribution of grades, but Bill feels certain that severalfactors explain the distribution: He allowed has students tostudy from as many different books as they liked, their Iqsthey are of different ages, and they study varying amount oftime for exams. To develop a predicting formula for examgrads, Bill asked each student to answer, at the end of theexam, questions regarding study time and number of booksused. Bill’s teaching record already contained the Iqs andages for the students, so he compiled the data for the classand ran a multiple regression with Minitab. The output formBill’s computer run was as follows:
Predictor Coef Stdev T-
ratio
P
Constant -49.948 41.55 -1.20 0.268
Hours 1.06931 0.98163 1.09 0.312
Iq 1.36460 0.37627 3.63 0.008
Books 2.03982 1.50799 1.35 0.218
Age -
1.78990
0.67332 -2.67 0.319
S = 11.657 R – sq = 76.7%
a. What is the best fitting regression equation for these data?b. What percentage of the variation in grades is explained by this
equation?c. What grade would you aspect for a 21- year old student with an
IQ of 113 who studied 5 hour and used three different books?
Notes