multiple regression

© Copy Right: Rai University11.556 179

RE

SE

AR

CH

ME

TH

OD

OLO

GY

So far we have talked of regression analysis using only oneindependent explanatory variable. At this level the regressionanalysis can also be estimated manually. However it is very rarethat we have just one explanatory variable and the explanatorypower of the estimated equation can be substantially improvedby the addition of more independent variables. For example inour earlier example of household consumption we can probablyimprove the explanatory power of the equation by adding morevariables such as household size, age distribution of household,etc. However when we use two or more independent variables theprocess of regression becomes that much more complex and it isnot feasible to solve for the parameters of the equation manually.Typically most multiple regression analysis is always carried out bycomputers, which enable us to carry out complex calculations usinglarge volumes of data easily. Therefore our stress when discussingmultiple regression will be on understanding and interpretingcomputer output.

Multiple Regression equationThe general form of the multiple regression equation is as follows:The three variable case for example is :

exbxbxbay ++++= 332211ˆFor the two variable case we can find the multiple regressionequation as follows:

exbxbay +++= 2211ˆThe normal equations for this are as follows:

∑∑∑ ++= 2211 xbxbnay

∑∑∑∑ ++= 2221111 xxbxbxayx

∑∑∑∑ ++= 22221122 xbxxbxayx

These can be solved to obtain the values of the parameters a, b1,

b2

So far we have referred to a as the y intercept and b 1 as the slopes ofthe multiple regression. However are the estimated regressioncoefficients. The constant a is the value of y if both x1, x2 arezero. The coefficients b1, b2 describe how changes in x1 affect thevalue of y . Thus b1 measures the value of changes in x1 ony holding x2 constant. Similarly b2 measures the effect on y of

changes in x2 holding x1 constant.Thus linear regression estimates a regression line between twovariables. Multiple regression there is a regression plane among y,x1 and x2. This regression plane is determined in the same way asthe regression line by minimizing the sum of squared deviationsof data points from the regression plane. Each independentvariable accounts for some of the variation in the dependentvariable. This is shown in figure 1 below.

Figure 1

The Computer and Multiple regressionA manager in any managerial situation deals with complexproblems requiring large samples and several independentvariables.The generalized multiple regression model is specified for kvariables with n data points. That is for each of the k independentvariables we have n data points. The regression equation that weestimate is :

kk xbxbxbxbay +++++= ............ˆ 332211

This equation is estimated by the computer . We now look at howa statistical package such as SPSS or Minitab handles the data.An example will help make the process clearer:Suppose the IRS in US wish to model discovery of unpaid taxes.They include the following independent variables:1. No. of hours of Field audit($00s)2. No. of computer hours($00s)3. Reward to informants ($000s)4. Actual unpaid taxes discovered. ($100000s)The data is shown in Table 1

LESSON 29:MULTIPLE REGRESSION

© Copy Right: Rai University180 11.556

RE

SE

AR

CH

ME

TH

OD

OLO

GY

Month Field audit

Comp hours

Reward to informers

Actual unpaid taxes

Jan Feb March Apr May Jun July Aug Sept Oct

45 42 44 43 46 44 45 44 43 42

16 14 15 13 13 14 16 16 15 15

71 70 72 71 75 74 76 69 74 73

29 24 27 25 26 28 30 28 28 27

Now a regression is run on Minitab and the sample out put ispresented below. We now have to interpret this output. This isgiven in table 2Now how do we interpret this output?1. The regression equation is of the form :

332211ˆ xbxbxbay +++= From the numbers given inthe coefficient column we can read the estimating equation:

=-45.8+.597Audit+1.18Comp+.405RewardsHow do we interpret this equation?The interpretation is similar to that of the one variable simplelinear regression case.• If we hold the number of field audit labour hours, number

of computer hours constant and we change rewards toinformants by one unit , then y will change by an additional$405000 for each additional $1000 paid to informants.

• Similarly holding x1 and x3 constant an additional 100 hoursof computer time will increase by $1177000.

• Similarly for holding x2 and x3 constant we increase an increase inhours in filed by $100 increase recoveries by $597000.

We can also use this equation to solve problems such as :Suppose in Nov the IRS plans to leave field hours and computerhours at their Oct level but increase rewards to $75000 How muchof recoveries can they expect to make in Nov?We can get a forecasted value by substituting in the equation.

y = -45.8+.597(43)+1.18(15)+.405(75)y = 27.905 or approximately $28 million.

Standard Error of The RegressionNow that we have our equation we need to have some measureof the dispersion of actual observations around the estimatedregression plane. We can expect the estimation to be more accurateas the degree of dispersion around this regression plane is less.To measure this dispersion or variation by the standard error ofthe estimate se

1)ˆ( 2

−−−

=kn

yyse

WhereY = sample values of the dependent variable

y = corresponding estimated values from the regressionequationn = number of data points in the samplek = number of independent variables (3 in our example)

The denominator of this equation shows that in a regressionwith k independent variables, the standard error has n-k-1 degreesof freedom. This is because one more degree of freedom is reducedby the estimation of the intercept term a. Thus we have k+1parameters to estimate from the sample data.Standard error of the regression is also called the root mean squareor Mean square error (MSE). In our sample output this is indicatedby s. The standard error of the regression in our problem is 0.286.We can also use the standard error of estimate or the MSE and athe t distribution to form an approximate confidence intervalaround our estimated vale . The t value at 95% level of confidencegiven our degrees of freedom of n-k-1 is 2.444. For example inour problem:For a value of

x1= 4300 hours

x2= 1500 hors

x3= $75000

Our estimate for y is $27905000 and our se is $286000. for

example if we want to construct a 95% confidence interval aroundthis estimate of $27905000 we can do it as follows:$27905000+/-t s e= $27905000+2.447(286,000) =$2860,800 upper limit =$27905000+2.447(286,000) =$27,205,200 Lower limitthe standard error of the estimate measures the dispersion ofdata points around the regression plane. Smaller values of se

indicate a better regression. If the addition of another variablereduces se then we say that the inclusion of the third variableimproves the fir of the regression.

The Coefficient of Multiple DeterminationIn a multiple regression we measure the strength of the relationshipamong the three independent variables and the dependent variablesby the coefficient of determination or R2. This defined as :R2 is the proportion of total variation in y that is explained by theregression plane.In our example we have R2=98.3% .This tells us the 98.3% ofvariation in unpaid taxes is explained by the three independentvariables. AS we add more variables in a regression explanatorypower of the equation improves if the R2 increases.Example 2Insert exercise lr p732

ExamplePam Schneider owns and operates an accounting firm in Ithaca,New York. Pam feels that it would be used to be able to predict inadvance in the number of rush income-tax returns during the

© Copy Right: Rai University11.556 181

RE

SE

AR

CH

ME

TH

OD

OLO

GY

busy march 1 to April 15 period so that she can better ;oan herpersonnel need during this time. She has hypothesized that severalfactors may be useful in her production. Data for these factors andnumber of rush returns for past years are as follows:

X1 X2 X3 Y

Economic

index

Population

within 1 mile of

office

Average

income in

Ithaca

Number of rush

returns, march 1

to April 15

99 10188 21465 2306

106 8566 22228 1266

100 10557 27665 1422

129 10219 25200 1721

179 9662 26300 2544

a. Use the following Minitab output to determine the best fittingregression equation for these data:

The regressions equation is

Predictor

const

Coef Stdev T – ratio P

-1275 2699 -0.47 0.719

X1 17.059 6.098 2.47 0.245

X2 0.5456 0.3144 1.72 0.335

X3 -

0.1743

0.1005 -1.73 0.333

S = 396.1 R sq = 87.2%

b. What percentage of the total variation in the number of rushreturns is explained by this equation?

c. For this year, the economic index is 169, then population within 1 mile of the office is 10212, and the average income inIthaca is $26925. How many rush returns should Pam expectto prices between March 1 April 15?

Results

y = -1275 + 17.059 X1 + 0.5406 X2 - 0.1743 X3 .R2 = 87.2%; 87.2% of the total variation in Y is explained by themodel.y = -1275 + 17.059 (169 ) + 0.5406 (10,212) – 0.1743( 26,925)

= 2436 rush returns.

Exercises

Q1. Given the following set of data, use whatever computerpackage is available to f find the best fitting regressionequation and answer the following:

a. What is the regression equation?b. What is the standard error of estimate?c. What is R2 for this regression?

d. What is the predicted value for Y when X1 = 5.8, X2 = 4.2, X3 =5.1?

Y X1 X2 X3

64.7 3.5 5.3 8.5 80.9 7.4 1.6 2.6 24.6 2.5 6.3 4.5 43.9 3.7 9.4 8.8 77.7 5.5 1.4 3.6 20.6 8.3 9.2 2.5 66.9 6.7 2.5 2.7 34.3 1.2 2.2 1.3

Given the following set of data use whatever computer package isavailable to find the best fitting regression equation and answerthe following:a. What is the regression equation?b. What is the standard error of estimate?c. What is R2 for this regression?e. Given an approximate 95 percent confidence interval for the

value of Y when the values of X1, X2, X3, and X4 are 52.4, 41.635.8, and 3, respectively.

Q3.We are trying to predict the annual demand for widgets(Demand)using the following independent variable.Price = price of widgets (in $)Income = consumer income (in $)Sub = price of a substitute commodity (in $)

(Note: A substitute commodity is one that can be substituted foranother commodity. For example, margarine is a substitutecommodity for butter,)

Year Demand Price ($) Income Sub ($)

1982 40 9 400 10

1983 45 8 500 14

1984 50 9 600 12

1985 55 8 700 13

1986 60 7 800 11

1987 70 6 900 15

1988 65 6 1000 26

1989 65 8 1100 27

1990 75 5 1200 22

1991 75 5 1300 19

1992 80 5 1400 20

1993 100 3 1500 23

1994 90 4 1600 18

1995 95 3 1700 24

1996 85 4 1800 21

© Copy Right: Rai University182 11.556

RE

SE

AR

CH

ME

TH

OD

OLO

GY

a Using whatever computer package is available, determine thebest-fitting regression equation for these data.

b. Are the signs (+ or -) of the regression coefficients of theindependent variables, as one would expect? Explain briefly.

c. State and interpret the coefficient of multiple determinationsfor this problem.

d. State and interpret the standard error of estimate for thisproblem.

e. Using the equation, what would you predict for DEMAND ifthe price of widgets was $6, consumer income was $1200 andthe price of the substitute commodity was $17?

Notes

multiple regression

Documents