multiple regression analysis ram akella university of california berkeley silicon valley center/sc...
Post on 21-Dec-2015
214 views
TRANSCRIPT
Multiple Regression Analysis
Ram AkellaUniversity of California Berkeley Silicon Valley Center/SC
Lecture 2January 26, 2011.
Introduction
We extend the concept of simple linear regression as we investigate a response y which is affected by several independent variables x1,
x2, x3,…, xk.
Our objective is to use the information provided by the xi to predict the value of y.
Example
Rating PredictionWe have a database of movie ratings (on the scale of 1 to 10) from various usersWe can predict a missing rating for a movie m from the user Y based on:
other movie ratings from the same user ratings given to movie m by other users
People ratings
Airplane
Matrix Room with a View
... Hidalgo
comedy action romance ... action
Joe 27,M,70k 9 7 2 7
Carol 53,F,20k 8 9
...
Kumar 25,M,22k 9 3 6
Ua 48,M,81k 4 7 ? ? ?
Illustrative Example
Body Fat Prediction Let y be the body fat index of an individual.
This might be a function of several variables: x1 = height x2 = weight x3 = abdomen measurement x4 = hip measurement
We want to predict y using knowledge of x1, x2, x3 and x4.
Formalization of the Regression Model
For each observation i, the expected value of the dependent y conditional on the information X1, X2, …., Xp is given by:
We add some noise to this expectation model, then value of y becomes:
Combining both equations we have
pppi xbxbxbbxxxyE .....),....,,|( 2211021
ipii xxxyEy ),...,,|( 21
ippi xbxbxbby .....22110
Formalization of the Regression ModelWe can express the regression model in a matrix terms:
Where y is a vector of n X1, and b is a vector of order [(p+1)x1] and X is configured as:
The vector of 1 corresponds to de dummy variable that is multiplied by the parameter b0
Xby
nppp
p
p
p
XXX
XXX
XXX
XXX
X
21
33231
22221
11211
1
1
1
1
Assumptions of the Regression Model
Assumptions of the data matrix XAssumptions of the data matrix X
It It is fixed for fitting purposes It is full rank
Assumptions of the random variable Assumptions of the random variable
are independent Have a mean 0 and common variance 2 for any set
xxxk . . Have a normal distribution.
Method of Least Squares
The method o least squares is used to estimate the values of b which achieves the minimum sum of squares between the observations yi and the straight line formed by these parameters and the variables x1, x2, ... Xk.
Mechanics of the Least Square Estimator in multiple variables
The objective function is to choose to minimize
We differentiate the expression above with respect to and we obtain:
Solving for we obtain
b
)ˆ)(ˆ()( bXybXybS
b
0)ˆ('2 bXyX
b
yXXXb ')'(ˆ 1
Mechanics of the Least Square Estimator in multiple variables
The variance for the estimator is
We know that:
IE
XXXbb
bbbbEb
2
1
][
')'(ˆ
)'ˆ)(ˆ()ˆvar(
And the expected value of (εε) is
Mechanics of the Least Square Estimator in multiple variables
12
121
11
11
)'()ˆvar(
)'(')'()ˆvar(
)'('')'()ˆvar(
)'('')'()ˆvar(
XXb
XXIXXXXb
XXXEXXXb
XXXXXXEb
Substituting in the equation above
Properties of the least squares estimator
The estimate is unbiased
The estimate has the lowest variance estimator of b
b
0)ˆ( bbE
b
12 )'()ˆvar( XXb
Example
A computer database in a small community contains: the listed selling price y (in thousands of dollars) per acre, the
amount of living area x1 acres, and the number of floors x2, bedrooms x3, and bathrooms x4, for n = 15 randomly selected
residences currently on the market. Property y x1 x2 x3 x4
1 69.0 6 1 2 1
2 118.5 10 1 2 2
3 116.5 10 1 3 2
… … … … … …
15 209.9 21 2 4 3
Fit a first order model to the data using the method of least squares.
Fit a first order model to the data using the method of least squares.
Example
The first order model is
y= b0+b1x1 + bx2 + bx3 + bx4
6 1 2 1
10 1 2 2
10 1 3 2
21 2 4 3
X= Y=
69.0
118.5
116.5
209.9
b=(X’X)-1X’ y=
18.763
6.2698
-16.203
-2.673
30.271
Some Questions
1. How well does the model fit?2. How strong is the relationship between
y and the predictor variables?3. Have any assumptions been violated?4. How good are the estimates and
predictions?
To answer these question we need the n observations on the response y and the independent variables, x1, x2, x3, …xk.
To answer these question we need the n observations on the response y and the independent variables, x1, x2, x3, …xk.
Residuals The difference between the observed value yi and
the corresponding fitted value ŷi is a residual and it is defined as:
If we elevate the residuals to the square we will obtain a chi-square distribution
22
222
2
~
~),(
),0(~ˆ
Xe
XN
Nyye
i
iii
If the normal assumption is valid, the plot of the residuals should appear as a random scatter around the zero center line.
If not, you will see a pattern in the residuals.
If the normal assumption is valid, the plot of the residuals should appear as a random scatter around the zero center line.
If not, you will see a pattern in the residuals.
Residuals versus Fits
Residuals versus Fits
If we see a pattern on the residuals then this method may not be appropriate to fit the data
The descriptors and the predicted variable do not follow a linear relationship
We can transform the descriptors in order to achieve a better fitting
How good is the fit ? Our objective in regression is choose the
parameters b0, b1,…, bk that provide the most accurate fitted values of y by minimizing the uncertainty after using the information about X.
This uncertainty may be denoted by the measure R2 which is equal to:
i i
i ii
i i
i i iii
yy
yy
yy
yyyyR
2
2
2
222
)(
)ˆ(1
)(
)ˆ()(
How good is the fit? We are interested in having a high value of R2
when we are evaluating our data One drawback of R2 is that whenever an
independent variable to the model this measure always increases (increase of degrees of freedom).
we calculate the adjusted R2 to weigh the improvement in fit versus the cost in degrees of freedom
i i
i ii
nyy
pnyyR
)1/()(
)1/()ˆ(1
2
22
Where p is the number of degrees of freedom and n the number of samples used to fit the model
Is it significant?
The first question to ask is whether the regression model is of any use in predicting y.
If it is not, then the value of y does not change, regardless of the value of the independent variables, x1, x2 ,…, xk. This implies that the partial regression
coefficients, b1b2,…, bk are all zero.
zeronot is oneleast at :H
versus0...:H
ia
210
b
bbb k
F statistics
An F statistic is the ratio of two independent χ2 random variables, each divided by its respective degrees of freedom. The key point is to show that both are independent
)1/()ˆ(
/)ˆ(F
2
2
i ii
i i
pnyy
pyy
Is it significant?Testing the model using F test.This consist in the ratio of the error term (ε) and the variance of the residual terms (e). If the model is useful, the value of F will be large.
freedom. of degrees 1)-p-n(p, with Fstatistic
)1/()ˆ(
/)ˆ(F 2
2
i ii
i i
pnyy
pyy
Analysis of variance table
Source df SS MS F
Regression
k SSR SSR/k MSR/MSE
Error n – k -1 SSE SSE/(n-k-1)
Total n -1 Total SS
SS Sum of squares
df degrees of freedom
MS mean squares
F F statistics
S = 6.849 R-Sq = 97.1% R-q(adj) = 96.0%
Table of variance of the real state example
Source DF SS MS F P
Regression 4 15913.0 3978.3 84.80 0.000
Residual Error
10 469.1 46.9
Total 14 16382.2
Example
Testing Individual Parameters
• Is a particular independent variable useful in the model, in the presence of all the other independent variables? The test statistic is function of bi, our best estimate of bi
0:H versus0:H a0 ii bb
t statistics A t distribution is defined as a normal distribution divided
by a chi-square distribution.
If we want to test the significance of a particular coefficient, we will calculate the t value of the coefficient .
Where vii is the element i,i of the matrix
has a t distribution with error df = n – p –1. We reject that b=0 if
|t0|>t/2,n-2
)ˆvar(b
iiv0ˆ
:statisticTest ib
t
The Real Estate Problem
In the presence of the other three independent variables, is the number of bedrooms significant in predicting the list price of homes? Test using = .05.
Regression Analysis: ListPrice versus SqFeet, NumFlrs, Bdrms, BathsThe regression equation isListPrice = 18.8 + 6.27 SqFeet - 16.2 NumFlrs - 2.67 Bdrms + 30.3 Baths
Predictor Coef SE Coef T PConstant 18.763 9.207 2.04 0.069SqFeet 6.2698 0.7252 8.65 0.000NumFlrs -16.203 6.212 -2.61 0.026Bdrms -2.673 4.494 -0.59 0.565Baths 30.271 6.849 4.42 0.001
Detecting problems in the regressionMulticollinearity This problem is related to the high correlation
between the independent variables Multicollinearity leads to the inestability in the
calculation of the inverse (X’X)-1
We can measure it by calculating the condition index (CI) given by
The higher the value is the most unstable the inverse of the matrix is. The matrix has multicollinearity if CI >30
min
max
d
dCI
Where dmax and dmin are the maximum value and the minimum value of the maxtrix D obtained from the SVD decomposition of X
Detecting problems in the regression
HeteroscedasticityWhen the assumption of the same variance 2 in the error terms εi is violated.
This yields to a low efficiency in the calculation of then this estimator may not have the lower variance or be biasedWe can address this problem by transforming the dependent variable (obtaining the logarithm), or use Weighted least squares to estimate
b
b
yWXXWXb
WN
Xby
111
2
')'(ˆ
),0(~
Detecting problems in the regression
Autocorrelation Is the problem of consecutive error terms in
time series data being correlated The consequences of this problem are
similar to heteroscedasticity We can detect this problem by plotting the
residuals and look for patterns or use the Durbin-Watson test
Durbin Watson Test This test is given by the ratio:
Where e are the residuals
This is approximately equal to 2(1-r), where r is the autocorrelation. Then if DW is close to 2 there is no evidence of autocorrelation.
i t
i tt
e
eeDW 2
2
1)(
Comparing Regression Models
To fairly compare two models, we can use:
The adjusted R2
The F test
This takes into account the difference between
degrees of freedom
Estimation and Prediction
Once you have: determined that the regression line is useful used the diagnostic plots to check for
violation of the regression assumptions.
You are ready to use the regression line to
Predict a particular value of y for a given value of x.
Estimation and Prediction
Enter the appropriate values of x1, x2, …, xk
Particular values of y are more difficult to predict, then it is required to have a wider range of values in the prediction interval.
kk22110 xb...xbxbby
The Real Estate Problem
Estimate the average list price for a home with 1000 square feet of living space, one floor, 3 bedrooms and two baths with a 95% confidence interval.
Predicted Values for New ObservationsNew Obs Fit SE Fit 95.0% CI 95.0% PI1 117.78 3.11 ( 110.86, 124.70) ( 101.02, 134.54)
Values of Predictors for New ObservationsNew Obs SqFeet NumFlrs Bdrms Baths1 10.0 1.00 3.00 2.00
We estimate that the average list price will be between $110,860 and $124,700 for a home like this.
We estimate that the average list price will be between $110,860 and $124,700 for a home like this.
Example: Body Fat Calculation
Predict the amount of body fat index from the body based on the following measures:
X1 Age in yearsX2 Weight in lbsX3 Height in inchesX4 neck measure in cmX5 chest measure in cmX6 abdomen measure in cmX7 hip measure in cmX8 thigh in cm
Example Data set example
age weight height neck chest
abdomen hip thigh
23 154.25 67.75 36.2 93.1 85.2 94.5 59
22 173.25 72.25 38.5 93.6 83 98.7 58.7
22 154 66.25 34 95.8 87.9 99.2 59.6
26 184.75 72.25 37.4 101.8 86.4 101.2 60.1
24 184.25 71.25 34.4 97.3 100 101.9 63.2
24 210.25 74.75 39 104.5 94.4 107.8 66
26 181 69.75 36.4 105.1 90.7 100.3 58.4
25 176 72.5 37.8 99.6 88.5 97.1 60
25 191 74 38.1 100.9 82.5 99.9 62.9
23 198.25 73.5 42.1 99.6 88.6 104.1 63.1
26 186.25 74.5 38.5 101.5 83.6 98.2 59.7
27 216 76 39.4 103.6 90.9 107.7 66.2
32 180.5 69.5 38.4 102 91.6 103.9 63.4
bodyfat
12.3
6.1
25.3
10.4
28.7
20.9
19.2
12.4
4.1
11.7
7.1
7.8
20.8
Y= X=
Regression Analysis The result of the regression analysis is the
following:
Feature Value
b0 -23.2763
b1 0.0265
b2 -0.0980
b3 -0.0933
b4 -0.5725
b5 0.0258
b6 0.9576
b7 -0.2475
b8 0.3493
Measure Value P value
R2 0.7304
R2 adj 0.7237
F 81.9439 0
Total variance
19.4037