chapter 4. linear regression and correlation · pdf filedepartment of applied mathematics...

Department of Applied Mathematics Chapter 4. Linear regression and correlation analysis

Marjorie Chiu, 2009 4-1

Chapter 4. Linear regression and correlation analysis

4.1. Linear regression

Linear regression is applied to infer the relationship between the dependent variable

and the regressors, using a set of sample data. Prediction can be made according to the

fitted model. Generally, the true model is ExxY pp ++++= βββ ⋯110

= EpxxY+

,,1 ⋯µ , where Y is the dependent (response) variable that depends on the

ix ’s

(independent variable, regressor or predictor) and E is the error which is normally

distributed with zero mean and independently with the regressors. Given a set of data

of size n, the model becomes

iippii xxy εβββ ++++= ⋯110 , ni ,,1…= and ( )2,0~ σε �i independently;

or in matrix form

+

=

npnpn

p

p

nxx

xx

xx

y

y

y

ε

εε

β

ββ

⋮⋮

⋯

⋮⋮⋮

⋯

⋯

⋮

2

1

1

0

1

221

111

2

1

1

1

1

, that is, εεεεββββ += Xy .

The vector of the least squares estimate b is given by

( ) yXXXb ′′=

= −11

0

pb

b

b

⋮.

The fitted regression equation is

ippii xbxbby +++= ⋯110ˆ

with residual iii yye ˆ−= . The fitness of the regression line is measured by the

coefficient of determination 2R , where 10 2 ≤≤ R .

Example 1: (car.sav)

� Fit a multiple linear regression model for the model miles on weight and

temperature

Analyze

Regression

Linear



Dependent: miles

Independents: weight, temperature

Method: enter

� Analyze the results

Remark: When there are a number of candidate variables that might consider as

regressors in the regression model, we can use the method of enter, stepwise,

backward or forward to choose the appropriate subset of variables into the model.

Regression (Enter)

The linear model explains proportion

of variability in y.

Variables Entered/RemovedVariables Entered/RemovedVariables Entered/RemovedVariables Entered/Removed bbbb

Temperature inFahrenheit,Weight in tons

a . Enter

Model1

VariablesEntered

VariablesRemoved Method

All requested variables entered.a.

Dependent Variable: Miles per gallonb.

Model SummaryModel SummaryModel SummaryModel Summary

.795a .632 .527 .7412Model1

R R SquareAdjusted R

SquareStd. Error ofthe Estimate

Predictors: (Constant), Temperature in Fahrenheit, Weight intons

a.

It measures the goodness of fit useful in

multiple regression, which is adjusted by the

degrees of freedom (no. of regressors).

Absolute value of

correlation coefficient



63.2% variation of “mile” is explained by the regression model.

The ANOVA (Analysis of Variance) approach is used to test the hypothesis

0: 210 ==== pH βββ ⋯

1H : any one of the iβ ’s are non-zero

The significance of individual regressor is tested by the t-test.

0:0 =iH β

0:1 ≠iH β

The fitted regression equation is

^

21.994 2.848 0.00762mile weight temperature= − × − × .

If the temperature is fixed and the weight of an automobile is increased by 1 ton, the

gasoline mileage is expected to decrease by 2.848 miles per gallon. If the weight of an

automobile is unchanged, then the increment of temperature by one Fahrenheit will

decrease the gasoline mileage by 0.00762 miles per gallon. From the above results of

the t-test, only “weight” is significant and “temperature” is not significant at 5%

level of significance.

CoefficientsCoefficientsCoefficientsCoefficients aaaa

21.994 1.487 14.794 .000-2.848 .845 -.774 -3.372 .012

-7.620E-03 .012 -.147 -.642 .541

(Constant)Weight in tonsTemperature in Fahrenheit

Model1

B Std. Error

UnstandardizedCoefficients

Beta

StandardizedCoefficients

t Sig.

Dependent Variable: Miles per gallona.

ANOVAANOVAANOVAANOVAbbbb

6.614 2 3.307 6.020 .030a

3.846 7 .54910.460 9

RegressionResidualTotal

Model1

Sum of Squares df Mean Square F Sig.

Predictors: (Constant), Temperature in Fahrenheit, Weight in tonsa.


Test for the significance of individual regressors

in the model.

The variable “temperature” is not significance at 5%


Test statistic for the

significance of the

regression model.

The probability of F greater

than 6.020 equal to 0.030,

therefore the regression

model is significant at 5%




Regression (Stepwise)

Variables Entered/RemovedVariables Entered/RemovedVariables Entered/RemovedVariables Entered/Removed aaaa

Weight intons

.

Stepwise(Criteria:Probability-of-F-to-enter <=.050,Probability-of-F-to-remove >=.100).

Model1

VariablesEntered




.781a .611 .562 .7134Model1



Predictors: (Constant), Weight in tonsa.


6.388 1 6.388 12.550 .008a

4.072 8 .50910.460 9


Model1





21.639 1.329 16.284 .000-2.876 .812 -.781 -3.543 .008

(Constant)Weight in tons

Model1

B Std. Error


Beta


t Sig.




Fitted regression equation is weightmile ×−= 876.2639.21^

. If the weight of an

automobile is increased by 1 ton, the gasoline mileage is expected to decrease by

2.876 miles per gallon.

Regression (Backward)

Excluded VariablesExcluded VariablesExcluded VariablesExcluded Variables bbbb

-.147a -.642 .541 -.236 .997Temperature in FahrenheitModel1

Beta In t Sig.Partial

Correlation Tolerance

CollinearityStatistics

Predictors in the Model: (Constant), Weight in tonsa.

Dependent Variable: Miles per gallonb. The proportion of a

variable's variance not

accounted for by other

independent variables in

the equation (weight).

The correlation that remains

between two variables (mile &

temper) after removing the

correlation that is due to their

mutual association with the other

variables (weight).


Temperature inFahrenheit,Weight in tons

a . Enter

. Temperature inFahrenheit

Backward(criterion:ProbabilityofF-to-remove>= .100).

Model1

2

VariablesEntered





.795a .632 .527 .7412

.781b .611 .562 .7134

Model12



Predictors: (Constant), Temperature in Fahrenheit, Weight intons

a.

Predictors: (Constant), Weight in tonsb.




ANOVAANOVAANOVAANOVAcccc

6.614 2 3.307 6.020 .030a

3.846 7 .54910.460 96.388 1 6.388 12.550 .008b

4.072 8 .50910.460 9

RegressionResidualTotalRegressionResidualTotal

Model1

2


Predictors: (Constant), Temperature in Fahrenheit, Weight in tonsa.

Predictors: (Constant), Weight in tonsb.

Dependent Variable: Miles per gallonc.


21.994 1.487 14.794 .000-2.848 .845 -.774 -3.372 .012

-7.620E-03 .012 -.147 -.642 .54121.639 1.329 16.284 .000-2.876 .812 -.781 -3.543 .008

(Constant)Weight in tonsTemperature in Fahrenheit(Constant)Weight in tons

Model1

2

B Std. Error


Beta


t Sig.











Regression (Forward)


.

Variables Entered/RemovedVariables Entered/RemovedVariables Entered/RemovedVariables Entered/Removed aaaa

Weight intons

.

Forward(Criterion:Probability-of-F-to-enter <=.050)

Model1

VariablesEntered




.781a .611 .562 .7134Model1





6.388 1 6.388 12.550 .008a

4.072 8 .50910.460 9


Model1





21.639 1.329 16.284 .000-2.876 .812 -.781 -3.543 .008


Model1

B Std. Error


Beta


t Sig.











Using the methods of selection of variables, only one regressor – weight is significant

in the regression model.

4.2. Diagnosis of the regression model

In fitting the regression model, the error terms are assumed to follow normal

distribution with zero mean and have constant variance, and they are independent. We

should check whether the assumptions are fulfilled as it may affect the validity of the

regression analysis.

Analyze

Regression

Linear

Select the Durbin-Watson Statistics to test for the first order autocorrelation of the

errors. Using the residual Plots to examine the distribution of the error using normal

probability plot or histogram, check whether the error variance follows a normal

distribution. Plot the standardized residual against the standardized fitted value of

dependent variable to see whether the error variance is a constant. Save the

standardized residuals to detect outliers. Use the scatter plot to obtain other residual

plots of standardized residual against the regressors.

Remark: remedial action should be taken if the assumptions about the error are

violated, for example

� Polynomial regression

� Reciprocal transformation of the x variable

� Log transformation of the x variable

� Log transformation of both the x and y variables

Example 2: (tele)

A company that sells transportation services uses a telemarketing division to help sell

its services. The division manager is interested in the time spent on the phone by the

telemarketers in the division. Data on the number of months of employment and the

number of calls placed per day (an average for 20 working days) is recorded for 20

employees. These data are saved in the file “tele.sav”. Perform regression analysis

yields the following results.



Regression

Variables Entered/Removedb

MONTHSa . Enter

Model

1

Variables

Entered

Variables

Removed Method


Dependent Variable: CALLSb.

Model Summaryb

.935a .874 .867 1.79 .570

Model

1

R R Square

Adjusted R

Square

Std. Error

of the

Estimate Durbin-Watson

Predictors: (Constant), MONTHSa.


ANOVAb

397.446 1 397.446 124.409 .000a

57.504 18 3.195

454.950 19

Regression

Residual

Total

Model

1

Sum of

Squares df

Mean

Square F Sig.

Predictors: (Constant), MONTHSa.


Coefficientsa

13.671 1.427 9.580 .000 10.673 16.669

.744 .067 .935 11.154 .000 .603 .884

(Constant)

MONTHS

Model

1

B Std. Error

Unstandardized

Coefficients

Beta

Standardi

zed

Coefficien

ts

t Sig.

Lower

Bound

Upper

Bound

95% Confidence Interval

for B

Dependent Variable: CALLSa.

Durbin-Watson statistic equals to

0.57 which is less than dL=1.20 at

5% level of significance, therefore the error terms have first order

autocorrelation.

Confidence interval for the

regression coefficients



Residuals Statisticsa

21.11 35.98 28.95 4.57 20

-3.11 2.97 8.88E-16 1.74 20

-1.715 1.536 .000 1.000 20

-1.738 1.663 .000 .973 20

Predicted Value

Residual

Std. Predicted Value

Std. Residual

Minimum Maximum Mean

Std.

Deviation N


Charts

Regression Standardized Residual

1.501.00.500.00-.50-1.00-1.50

Histogram

Dependent Variable: CALLS

Fre

qu

en

cy

5

4

3

2

1

0

Std. Dev = .97

Mean = 0.00

N = 20.00



Scatterplot


Regression Standardized Predicted Value

2.01.51.0.50.0-.5-1.0-1.5-2.0

Re

gre

ssio

n S

tan

da

rdiz

ed

Re

sid

ua

l

2.0

1.5

1.0

.5

0.0

-.5

-1.0

-1.5

-2.0

A systematic pattern can be observed in both of the residual plots. The standardized

residuals plot in a curvilinear pattern, suggesting a curvilinear component may be

omitted from the equation expressing the relationship between CALLS and MO�THS.

The plots of the standardized residuals versus the fitted values and MO�THS show

identical patterns in this case.

Normal P-P Plot of



Observed Cum Prob

1.00.75.50.250.00

Exp

ecte

d C

um

Pro

b

1.00

.75

.50

.25

0.00

If the normal probability plot close to a straight line,

the error terms follow

normal distribution.



Scatter plot is used to produce residual plot of standardized residual against the regressor. Graphs

Legacy Dialogs

Scatter

Simple

y-axis: zre_1

x-axis: months

Graph

MONTHS

403020100

Sta

nd

ard

ize

d R

esid

ua

l

2.0

1.5

1.0

.5

0.0

-.5

-1.0

-1.5

-2.0

A second order polynomial regression will be tried.

Regression

Variables Entered/Removedb

MONTHS2,

MONTHSa . Enter

Model

1

Variables

Entered

Variables

Removed Method





Model Summaryb

.981a .962 .958 1.00 2.092

Model

1

R R Square

Adjusted R

Square

Std. Error

of the

Estimate Durbin-Watson

Predictors: (Constant), MONTHS2, MONTHSa.


ANOVAb

437.839 2 218.920 217.503 .000a

17.111 17 1.007

454.950 19

Regression

Residual

Total

Model

1

Sum of

Squares df

Mean

Square F Sig.

Predictors: (Constant), MONTHS2, MONTHSa.


Coefficientsa

-.140 2.323 -.060 .952 -5.041 4.760

2.310 .250 2.904 9.236 .000 1.782 2.838

-4.01E-02 .006 -1.992 -6.335 .000 -.053 -.027

(Constant)

MONTHS

MONTHS2

Model

1

B Std. Error

Unstandardized

Coefficients

Beta

Standardi

zed

Coefficien

ts

t Sig.

Lower

Bound

Upper

Bound

95% Confidence Interval

for B


The fitted model becomes 2^

0401.0310.214.0 monthsmonthscalls ×−×+−=

Residuals Statisticsa

18.95 33.12 28.95 4.80 20

-1.54 1.73 3.73E-15 .95 20

-2.083 .868 .000 1.000 20

-1.536 1.728 .000 .946 20

Predicted Value

Residual

Std. Predicted Value

Std. Residual

Minimum Maximum Mean

Std.

Deviation N


The Durbin-Watson statistic has value

2.092 and it is greater than dU=1.54 at

5% level of significance. Hence the

error terms of this regression model do

not have first order autocorrelation.



Charts


1.501.00.500.00-.50-1.00-1.50

Histogram


Fre

qu

en

cy

6

5

4

3

2

1

0

Std. Dev = .95

Mean = 0.00

N = 20.00

Normal P-P Plot of



Observed Cum Prob

1.00.75.50.250.00

Exp

ecte

d C

um

Pro

b

1.00

.75

.50

.25

0.00



Scatterplot



1.0.50.0-.5-1.0-1.5-2.0-2.5

Re

gre

ssio

n S

tan

da

rdiz

ed

Re

sid

ua

l

2.0

1.5

1.0

.5

0.0

-.5

-1.0

-1.5

-2.0

The residual plots do not have any systematic pattern that the residuals are scattered

around zero within a horizontal band. Therefore, the error terms seem to have zero

mean and constant variance.

Partial Regression Plot


MONTHS

1.0.50.0-.5-1.0-1.5-2.0

CA

LL

S

4

2

0

-2

-4

-6



Partial Regression Plot


MONTHS2

806040200-20-40

CA

LL

S

4

3

2

1

0

-1

-2

-3

-4

Graph

MONTHS

403020100

Sta

nd

ard

ize

d R

esid

ua

l

2.0

1.5

1.0

.5

0.0

-.5

-1.0

-1.5

-2.0



Graph

MONTHS2

10008006004002000

Sta

nd

ard

ize

d R

esid

ua

l

2.0

1.5

1.0

.5

0.0

-.5

-1.0

-1.5

-2.0

When a sample data point has a y value that is much different from the y values of the

other points in the sample, it is called an outlier.


Regression


Weight intons

a . Enter

Model1

VariablesEntered





.781a .611 .562 .7134Model1







6.388 1 6.388 12.550 .008a

4.072 8 .50910.460 9


Model1





21.639 1.329 16.284 .000 18.575 24.704-2.876 .812 -.781 -3.543 .008 -4.748 -1.004


Model1

B Std. Error


Beta


t Sig. Lower Bound Upper Bound95% Confidence Interval for B


Coefficient CorrelationsCoefficient CorrelationsCoefficient CorrelationsCoefficient Correlations aaaa

1.000.659

Weight in tonsWeight in tons

CorrelationsCovariances

Model1

Weight in tons


Residuals StatisticsResiduals StatisticsResiduals StatisticsResiduals Statistics aaaa

15.743 18.245 17.000 .8425 10-1.445 .900 .000 .6726 10-1.492 1.478 .000 1.000 10-2.026 1.261 .000 .943 10

Predicted ValueResidualStd. Predicted ValueStd. Residual

Minimum Maximum Mean Std. Deviation N




Charts

Normal P-P Plot of Regression Standardized Residual

Dependent Variable: MILE

Observed Cum Prob

1.00.75.50.250.00

Exp

ecte

d C

um

Pro

b

1.00

.75

.50

.25

0.00


1.501.00.500.00-.50-1.00-1.50-2.00

Histogram

Dependent Variable: Miles per gallonF

req

ue

ncy

3.5

3.0

2.5

2.0

1.5

1.0

.5

0.0

Std. Dev = .94

Mean = 0.00

N = 10.00



Outlier is unusual observation that deserves further examination. It does not mean that

it is useless or that it should be deleted from the analysis. When the outlier should be

deleted from the regression model, filter the observation and perform linear regression

again.

4.3. Correlation analysis

Linear association between two quantitative variables can be analyzed by correlation

analysis. Pearson correlation coefficient r ( 11 ≤≤− r ) is used to measure the strength

of linear association between two sets of sample data (say x and y). The population

correlation coefficient is estimated by the sample correlation coefficient.

( )( )( ) ( )∑∑∑

−−

−−=

22yyxx

yyxxr

ii

ii


� Compute the correlation coefficients for the variables and test for their

significance

Analyze

Correlate

Scatterplot

Dependent Variable: Miles per gallon


2.01.51.0.50.0-.5-1.0-1.5-2.0

Re

gre

ssio

n S

tan

da

rdiz

ed

Re

sid

ua

l1.5

1.0

.5

0.0

-.5

-1.0

-1.5

-2.0

-2.5

It is an outlier, as its studentized residual is

greater than 2 in absolute value.



Bivariate

Variables: miles, weight, temperature

Correlation coefficients: Pearson

The statistics such as mean, standard deviation, cross-product deviations and

covariances can be produced.

Correlations

Let 1ρ and 2ρ be the correlation coefficient of gasoline mileage and weight of

automobile; and gasoline mileage and temperature at the time of operation

respectively. We can test whether the correlation coefficients are significant different

from zero. The hypothesis of the correlation coefficient of gasoline mileage and

weight is

0: 10 =ρH

0: 11 ≠ρH

Descriptive StatisticsDescriptive StatisticsDescriptive StatisticsDescriptive Statistics

17.000 1.0781 101.6130 .29292 1052.50 20.850 10

Miles per gallonWeight in tonsTemperature in Fahrenheit

Mean Std. Deviation N

CorrelationsCorrelationsCorrelationsCorrelations

1 -.781** -.188. .008 .603

10.460 -2.221 -38.000

1.162 -.247 -4.22210 10 10

-.781** 1 .052.008 . .886

-2.221 .772 2.875

-.247 .086 .31910 10 10

-.188 .052 1.603 .886 .

-38.000 2.875 3912.500

-4.222 .319 434.72210 10 10

Pearson CorrelationSig. (2-tailed)Sum of Squares andCross-productsCovarianceNPearson CorrelationSig. (2-tailed)Sum of Squares andCross-productsCovarianceNPearson CorrelationSig. (2-tailed)Sum of Squares andCross-productsCovarianceN

Miles per gallon

Weight in tons

Temperature in Fahrenheit

Miles per gallon Weight in tonsTemperature in

Fahrenheit

Correlation is significant at the 0.01 level (2-tailed).**.



The sample correlation coefficient is –0.781, with significance level equals to 0.008

which is smaller than 0.05. Therefore, the null hypothesis is rejected and the

correlation coefficient of gasoline mileage and weight is significant different from

zero. The gasoline mileage and weight have significant linear association at 5% level

of significance. Similarly, the linear association between gasoline mileage and

temperature at the time of operation is tested with the hypothesis

0: 20 =ρH

0: 21 ≠ρH

The sample correlation coefficient is –0.188, with significance level equals to 0.603

which is greater than 0.05. Therefore, the null hypothesis cannot be rejected and the

correlation coefficient of gasoline mileage and temperature is not significant different

from zero. Gasoline mileage and temperature at time of operation do not have

significant linear association at 5% level of significance.



Exercise 4

Question 1. (R&D)

A company is interested in the relationship between profit on a number of projects and two

explanatory variables. These variables are the expenditure on research and development for

the project (RD) and a measure of risk assigned at the outset of the project (RISK). The

following table shows the data on the three variables PROFIT, RISK, and RD. PROFIT is

measured in thousands of dollars and RD is measured in hundreds of dollars. Fit a multiple

linear regression model for the PROFIT, using RISK and RD as regressors.

RD RISK PROFIT RD RISK PROFIT

132.580 8.5 396 74.816 7.5 102 81.928 7.5 130 108.752 6.0 214 145.992 10.0 508 92.372 8.5 200 90.020 8.0 172 92.260 7.0 158 114.408 7.0 256 60.732 6.5 32 53.704 7.5 32 78.120 7.5 116 76.244 7.0 102 90.000 5.5 120 71.680 8.0 102 105.532 9.0 270 151.592 9.5 536 111.832 8.0 270

Interpret the results and check whether the error terms have first order autocorrelation, follow

normal distribution with zero mean and constant variance. Modify the model if necessary.

Question 2. (marketing)

The marketing department of a firm desires to develop a one-factor linear model to forecast

the sales of a product. It is believed that the most prominent factors that affect sales are

“Advertising costs” and “Price”. The following semi-annual data are collected:

Sales (in $,000) Advertising costs (in $,000) Price (in $,000)

1,320 38 6 1,440 42 6 1,480 46 6 1,520 48 5 1,440 50 5 1,560 52 5

By calculating correlation coefficients, determine the factor (i.e. advertising costs or price)

which is more relevant to sales. Find the regression equation for the factor that you have

determined.

chapter 4. linear regression and correlation · pdf filedepartment of applied mathematics...

Documents