multicollinearity in regression principal components analysis

15
Multicollinearity in Regression Principal Components Analysis Standing Heights and Physical Stature Attributes Among Female Police Officer Applicants S.Q. Lafi and J.B. Kaneene (1992). “An Explanation of the Use of Principal Components Analysis to Detect and Correct for Multicollinearity,” Preventive Veterinary Medicine, Vol. 13, pp. 261-275

Upload: dwight

Post on 23-Feb-2016

58 views

Category:

Documents


1 download

DESCRIPTION

Multicollinearity in Regression Principal Components Analysis. Standing Heights and Physical Stature Attributes Among Female Police Officer Applicants - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Multicollinearity  in Regression  Principal Components Analysis

Multicollinearity in Regression Principal Components Analysis

Standing Heights and Physical Stature Attributes Among Female Police

Officer ApplicantsS.Q. Lafi and J.B. Kaneene (1992). “An Explanation of the Use of Principal Components Analysis to Detect and Correct for Multicollinearity,” Preventive Veterinary Medicine, Vol. 13, pp. 261-275

Page 2: Multicollinearity  in Regression  Principal Components Analysis

Data Description

• Subjects: 33 Females applying for police officer positions• Dependent Variable: Y ≡ Standing Height (cm)• Independent Variables:

X1 ≡ Sitting Height (cm) X2 ≡ Upper Arm Length (cm) X3 ≡ Forearm Length (cm) X4 ≡ Hand Length (cm) X5 ≡ Upper Leg Length (cm) X6 ≡ Lower Leg Length (cm) X7 ≡ Foot Length (inches) X8 ≡ BRACH (100X3/X2) X9 ≡ TIBIO (100X6/X5)

Page 3: Multicollinearity  in Regression  Principal Components Analysis

DataID Y X1 X2 X3 X4 X5 X6 X7 X8 X91 165.8 88.7 31.8 28.1 18.7 40.3 38.9 6.7 88.4 96.52 169.8 90.0 32.4 29.1 18.3 43.3 42.7 6.4 89.8 98.63 170.7 87.7 33.6 29.5 20.7 43.7 41.1 7.2 87.8 94.14 170.9 87.1 31.0 28.2 18.6 43.7 40.6 6.7 91.0 92.95 157.5 81.3 32.1 27.3 17.5 38.1 39.6 6.6 85.0 103.96 165.9 88.2 31.8 29.0 18.6 42.0 40.6 6.5 91.2 96.77 158.7 86.1 30.6 27.8 18.4 40.0 37.0 5.9 90.8 92.58 166.0 88.7 30.2 26.9 17.5 41.6 39.0 5.9 89.1 93.89 158.7 83.7 31.1 27.1 18.1 38.9 37.5 6.1 87.1 96.410 161.5 81.2 32.3 27.8 19.1 42.8 40.1 6.2 86.1 93.711 167.3 88.6 34.8 27.3 18.3 43.1 41.8 7.3 78.4 97.012 167.4 83.2 34.3 30.1 19.2 43.4 42.2 6.8 87.8 97.213 159.2 81.5 31.0 27.3 17.5 39.8 39.6 4.9 88.1 99.514 170.0 87.9 34.2 30.9 19.4 43.1 43.7 6.3 90.4 101.415 166.3 88.3 30.6 28.8 18.3 41.8 41.0 5.9 94.1 98.116 169.0 85.6 32.6 28.8 19.1 42.7 42.0 6.0 88.3 98.417 156.2 81.6 31.0 25.6 17.0 44.2 39.0 5.1 82.6 88.218 159.6 86.6 32.7 25.4 17.7 42.0 37.5 5.0 77.7 89.319 155.0 82.0 30.3 26.6 17.3 37.9 36.1 5.2 87.8 95.320 161.1 84.1 29.5 26.6 17.8 38.6 38.2 5.9 90.2 99.021 170.3 88.1 34.0 29.3 18.2 43.2 41.4 5.9 86.2 95.822 167.8 83.9 32.5 28.6 20.2 43.3 42.9 7.2 88.0 99.123 163.1 88.1 31.7 26.9 18.1 40.1 39.0 5.9 84.9 97.324 165.8 87.0 33.2 26.3 19.5 43.2 40.7 5.9 79.2 94.225 175.4 89.6 35.2 30.1 19.1 45.1 44.5 6.3 85.5 98.726 159.8 85.6 31.5 27.1 19.2 42.3 39.0 5.7 86.0 92.227 166.0 84.9 30.5 28.1 17.8 41.2 43.0 6.1 92.1 104.428 161.2 84.1 32.8 29.2 18.4 42.6 41.1 5.9 89.0 96.529 160.4 84.3 30.5 27.8 16.8 41.0 39.8 6.0 91.1 97.130 164.3 85.0 35.0 27.8 19.0 47.2 42.4 5.0 79.4 89.831 165.5 82.6 36.2 28.6 20.2 45.0 42.3 5.6 79.0 94.032 167.2 85.0 33.6 27.1 19.8 46.0 41.6 5.6 80.7 90.433 167.2 83.4 33.5 29.7 19.4 45.2 44.0 5.2 88.7 97.3

Page 4: Multicollinearity  in Regression  Principal Components Analysis

Standardizing the Predictors

*

22

1

* * *12 1911 12 19

* * *21 2921 22 29*

* * *91 9233,1 33,2 33,9

1

2

1,...,33; 1,...,9( 1)

11

1

j jij ijij n

jjiji

n

j kij iki

jk

jij

X X X XX i j

n SX X

r rX X Xr rX X X

r rX X X

X X X Xr

X X

* *X X 'X R

2

1 1

n n

kiki i

X X

Page 5: Multicollinearity  in Regression  Principal Components Analysis

Correlations Matrix of Predictors and InverseR

1.0000 0.1441 0.2791 0.1483 0.1863 0.2264 0.3680 0.1147 0.02120.1441 1.0000 0.4708 0.6452 0.7160 0.6616 0.1468 -0.5820 -0.09840.2791 0.4708 1.0000 0.5050 0.3658 0.7284 0.4277 0.4420 0.44060.1483 0.6452 0.5050 1.0000 0.6007 0.5500 0.3471 -0.1911 -0.09880.1863 0.7160 0.3658 0.6007 1.0000 0.7150 -0.0298 -0.3882 -0.40990.2264 0.6616 0.7284 0.5500 0.7150 1.0000 0.2821 0.0026 0.34340.3680 0.1468 0.4277 0.3471 -0.0298 0.2821 1.0000 0.2445 0.39710.1147 -0.5820 0.4420 -0.1911 -0.3882 0.0026 0.2445 1.0000 0.50820.0212 -0.0984 0.4406 -0.0988 -0.4099 0.3434 0.3971 0.5082 1.0000

R^(-1)1.52 -3.48 3.15 0.41 13.15 -13.28 -0.62 -3.41 10.21-3.48 436.47 -390.31 -1.26 -83.83 77.01 1.18 425.55 -62.663.15 -390.31 353.99 -0.07 91.67 -87.90 -1.25 -382.59 68.230.41 -1.26 -0.07 2.46 4.89 -5.40 -0.81 -0.49 4.5713.15 -83.83 91.67 4.89 817.17 -807.75 -2.21 -76.90 603.81-13.28 77.01 -87.90 -5.40 -807.75 801.94 2.65 71.74 -597.88-0.62 1.18 -1.25 -0.81 -2.21 2.65 1.77 1.12 -2.49-3.41 425.55 -382.59 -0.49 -76.90 71.74 1.12 417.39 -58.2410.21 -62.66 68.23 4.57 603.81 -597.88 -2.49 -58.24 448.37

Page 6: Multicollinearity  in Regression  Principal Components Analysis

Variance Inflation Factors (VIFs)• VIF measures the extent that a regression

coefficient’s variance is inflated due to correlations among the set of predictors

• VIFj = 1/(1-Rj2) where Rj

2 is the coefficient of multiple determination when Xj is regressed on the remaining predictors.

• Values > 10 are often considered to be problematic• VIFs can be obtained as the diagonal elements of R-1

VIFsX1 X2 X3 X4 X5 X6 X7 X8 X9

1.52 436.47 353.99 2.46 817.17 801.94 1.77 417.39 448.37

Not surprisingly, X2, X3, X5, X6, X8, and X9 are problems (see definitions of X8 and X9)

Page 7: Multicollinearity  in Regression  Principal Components Analysis

Regression of Y on [1|X*] * *

0 1 1 9 9 0i i iE Y X X E *Y 1 X β

Regression StatisticsMultiple R 0.944825R Square 0.892694Adjusted R Square0.850704Standard Error 1.890412Observations 33

ANOVAdf SS MS F Significance F

Regression 9 683.7823 75.9758 21.2600 0.0000Residual 23 82.1941 3.5737Total 32 765.9764

CoefficientsStandard Error t Stat P-value Lower 95%Upper 95%Intercept 164.5636 0.3291 500.0743 0.0000 163.8829 165.2444X1* 11.8900 2.3307 5.1015 0.0000 7.0686 16.7114X2* 4.2752 39.4941 0.1082 0.9147 -77.4246 85.9751X3* -3.2845 35.5676 -0.0923 0.9272 -76.8616 70.2927X4* 4.2764 2.9629 1.4433 0.1624 -1.8528 10.4057X5* -9.8372 54.0398 -0.1820 0.8571 -121.6270 101.9525X6* 25.5626 53.5337 0.4775 0.6375 -85.1802 136.3055X7* 3.3805 2.5166 1.3433 0.1923 -1.8255 8.5865X8* 6.3735 38.6215 0.1650 0.8704 -73.5211 86.2682X9* -9.6391 40.0289 -0.2408 0.8118 -92.4453 73.1670

Note the surprising negative coefficients for X3

*, X5*, and X9

*

Page 8: Multicollinearity  in Regression  Principal Components Analysis

Principal Components Analysis

1

2

1

Using Statistical or Matrix Computer Package, decompose the correlation matrix into its eigenvalues and eigenvectors

' where eigenvalue and

j

pjth

j j j jj

jp

p p pvv

j

v

* *j

R

X 'X R v v ' VLV v

1

2

max

1

eigenvector

0 00 0

0 0

subject to: 1 0 Condition Index:

Principal Components:

th

p

p

j ji j

j

p j k

1 2 p

j j j k

*

V v v v L

v 'v v 'v

W = X V

While the columns of X* are highly correlated, the columns of W are uncorrelated The s represent the variance corresponding to each principal component

Page 9: Multicollinearity  in Regression  Principal Components Analysis

Police Applicants Height Data - IV

0.1853 0.1523 0.8017 0.2782 -0.3707 -0.2327 0.1754 -0.0005 0.01040.4413 -0.2348 -0.0986 -0.2312 -0.2551 -0.3191 -0.3973 0.5850 -0.14140.3934 0.3336 -0.1642 0.2336 0.1239 -0.3183 -0.4953 -0.5205 0.13970.4182 -0.0813 0.0284 -0.2063 0.5765 -0.3703 0.5529 0.0009 0.00400.4125 -0.3000 -0.0121 0.3508 0.0559 0.4669 0.0250 0.1487 0.61060.4645 0.1011 -0.2518 0.1658 -0.2697 0.3798 0.2786 -0.1539 -0.60400.2141 0.3577 0.3790 -0.5862 0.2139 0.4811 -0.2484 0.0009 -0.0022-0.0852 0.5467 -0.0498 0.4536 0.3674 0.0367 -0.0418 0.5738 -0.13520.0474 0.5261 -0.3320 -0.2685 -0.4396 -0.1027 0.3445 0.1089 0.4521

L3.6304 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.00000.0000 2.4427 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.00000.0000 0.0000 1.0145 0.0000 0.0000 0.0000 0.0000 0.0000 0.00000.0000 0.0000 0.0000 0.7656 0.0000 0.0000 0.0000 0.0000 0.00000.0000 0.0000 0.0000 0.0000 0.6109 0.0000 0.0000 0.0000 0.00000.0000 0.0000 0.0000 0.0000 0.0000 0.3024 0.0000 0.0000 0.00000.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.2322 0.0000 0.00000.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0009 0.00000.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0005

Page 10: Multicollinearity  in Regression  Principal Components Analysis

Police Applicants Height Data - IIVLV'

1.0000 0.1441 0.2791 0.1483 0.1863 0.2263 0.3680 0.1147 0.02120.1441 1.0000 0.4708 0.6452 0.7160 0.6617 0.1468 -0.5820 -0.09850.2791 0.4708 1.0000 0.5051 0.3658 0.7284 0.4277 0.4420 0.44060.1483 0.6452 0.5051 1.0000 0.6007 0.5500 0.3471 -0.1911 -0.09880.1863 0.7160 0.3658 0.6007 1.0000 0.7150 -0.0298 -0.3882 -0.40980.2263 0.6617 0.7284 0.5500 0.7150 1.0000 0.2821 0.0026 0.34340.3680 0.1468 0.4277 0.3471 -0.0298 0.2821 1.0000 0.2445 0.39710.1147 -0.5820 0.4420 -0.1911 -0.3882 0.0026 0.2445 1.0000 0.50830.0212 -0.0985 0.4406 -0.0988 -0.4098 0.3434 0.3971 0.5083 1.0000

R1.0000 0.1441 0.2791 0.1483 0.1863 0.2264 0.3680 0.1147 0.02120.1441 1.0000 0.4708 0.6452 0.7160 0.6616 0.1468 -0.5820 -0.09840.2791 0.4708 1.0000 0.5050 0.3658 0.7284 0.4277 0.4420 0.44060.1483 0.6452 0.5050 1.0000 0.6007 0.5500 0.3471 -0.1911 -0.09880.1863 0.7160 0.3658 0.6007 1.0000 0.7150 -0.0298 -0.3882 -0.40990.2264 0.6616 0.7284 0.5500 0.7150 1.0000 0.2821 0.0026 0.34340.3680 0.1468 0.4277 0.3471 -0.0298 0.2821 1.0000 0.2445 0.39710.1147 -0.5820 0.4420 -0.1911 -0.3882 0.0026 0.2445 1.0000 0.50820.0212 -0.0984 0.4406 -0.0988 -0.4099 0.3434 0.3971 0.5082 1.0000

Page 11: Multicollinearity  in Regression  Principal Components Analysis

Regression of Y on [1|W]

Regression StatisticsMultiple R 0.944825R Square 0.892694Adjusted R Square0.850704Standard Error 1.890412Observations 33

ANOVAdf SS MS F Significance F

Regression 9 683.7823 75.9758 21.2600 0.0000Residual 23 82.1941 3.5737Total 32 765.9764

CoefficientsStandard Error t Stat P-value Lower 95%Upper 95%Intercept 164.5636 0.3291 500.0743 0.0000 163.8829 165.2444W1 12.1269 0.9922 12.2227 0.0000 10.0744 14.1793W2 4.5224 1.2096 3.7389 0.0011 2.0202 7.0245W3 7.6160 1.8769 4.0578 0.0005 3.7334 11.4985W4 4.9552 2.1605 2.2935 0.0313 0.4858 9.4246W5 -3.5819 2.4185 -1.4810 0.1522 -8.5850 1.4213W6 3.2973 3.4376 0.9592 0.3474 -3.8139 10.4085W7 6.8268 3.9230 1.7402 0.0952 -1.2885 14.9422W8 1.4226 64.0508 0.0222 0.9825 -131.0766 133.9219W9 -27.5954 87.0588 -0.3170 0.7541 -207.6903 152.4995

Note that W8 and W9 have very small eigenvalues and very small t-statisticsCondition indices are 63.5 and 85.2,Both well above 10

0E Y 1 Wγ

Page 12: Multicollinearity  in Regression  Principal Components Analysis

Reduced Model • Removing last 2 principal components due to small,

insignificant t-statistics and high condition indices• Let V(g) be the p×g matrix of the eigenvectors for the

g retained principal components (p=9, g=7)• Let W(g) = X*V(g)

• Then regress Y on [1|W(g)]

V(g)0.1853 0.1523 0.8017 0.2782 -0.3707 -0.2327 0.17540.4413 -0.2348 -0.0986 -0.2312 -0.2551 -0.3191 -0.39730.3934 0.3336 -0.1642 0.2336 0.1239 -0.3183 -0.49530.4182 -0.0813 0.0284 -0.2063 0.5765 -0.3703 0.55290.4125 -0.3000 -0.0121 0.3508 0.0559 0.4669 0.02500.4645 0.1011 -0.2518 0.1658 -0.2697 0.3798 0.27860.2141 0.3577 0.3790 -0.5862 0.2139 0.4811 -0.2484-0.0852 0.5467 -0.0498 0.4536 0.3674 0.0367 -0.04180.0474 0.5261 -0.3320 -0.2685 -0.4396 -0.1027 0.3445

Page 13: Multicollinearity  in Regression  Principal Components Analysis

Reduced Regression FitSUMMARY OUTPUT

Regression StatisticsMultiple R 0.944575R Square 0.892223Adjusted R Square 0.862045Standard Error 1.817195Observations 33

ANOVAdf SS MS F Significance F

Regression 7 683.4215 97.6316 29.5657 0.0000Residual 25 82.5549 3.3022Total 32 765.9764

CoefficientsStandard Error t Stat P-value Lower 95%Upper 95%Intercept 164.5636 0.3163 520.2229 0.0000 163.9121 165.2151W1 12.1268 0.9537 12.7151 0.0000 10.1625 14.0910W2 4.5224 1.1627 3.8895 0.0007 2.1277 6.9170W3 7.6160 1.8042 4.2213 0.0003 3.9002 11.3317W4 4.9551 2.0768 2.3859 0.0249 0.6777 9.2324W5 -3.5819 2.3249 -1.5407 0.1360 -8.3701 1.2063W6 3.2972 3.3044 0.9978 0.3279 -3.5084 10.1028W7 6.8268 3.7711 1.8103 0.0823 -0.9398 14.5934

Page 14: Multicollinearity  in Regression  Principal Components Analysis

Transforming Back to X-scale

2 2 's s^ ^ ^

-1(g) (g) (g) (g)(g) (g) (g)β = V γ β V L V

s^23.3022

gamma-hat(g) beta-hat(g) StdErrW1 12.1268 X1* 12.1779 2.0639W2 4.5224 X2* -0.4583 2.0549W3 7.6160 X3* 1.3113 2.3006W4 4.9551 X4* 4.3866 2.8275W5 -3.5819 X5* 6.8020 1.7926W6 3.2972 X6* 9.1146 1.8993W7 6.8268 X7* 3.3197 2.4118

X8* 1.8268 1.4407X9* 2.6829 1.9731

V{beta-hatg}4.2598 -0.1779 -0.6883 1.0454 -0.8386 -0.0887 -1.8757 -0.4214 0.9289

-0.1779 4.2228 3.6089 -2.2379 -1.9307 -2.4561 -0.1330 -1.0423 -0.7562-0.6883 3.6089 5.2928 -2.3318 -1.3892 -2.9496 -0.3347 1.1128 -2.20311.0454 -2.2379 -2.3318 7.9948 -1.6401 -0.1911 -2.6329 0.1667 1.9223

-0.8386 -1.9307 -1.3892 -1.6401 3.2135 2.3480 1.4626 0.7180 -1.1223-0.0887 -2.4561 -2.9496 -0.1911 2.3480 3.6074 0.1090 -0.1452 1.7520-1.8757 -0.1330 -0.3347 -2.6329 1.4626 0.1090 5.8170 -0.1949 -1.7317-0.4214 -1.0423 1.1128 0.1667 0.7180 -0.1452 -0.1949 2.0755 -1.20550.9289 -0.7562 -2.2031 1.9223 -1.1223 1.7520 -1.7317 -1.2055 3.8931

Page 15: Multicollinearity  in Regression  Principal Components Analysis

Comparison of Coefficients and SEs

CoefficientsStandard ErrorIntercept 164.5636 0.3291X1* 11.8900 2.3307X2* 4.2752 39.4941X3* -3.2845 35.5676X4* 4.2764 2.9629X5* -9.8372 54.0398X6* 25.5626 53.5337X7* 3.3805 2.5166X8* 6.3735 38.6215X9* -9.6391 40.0289

beta-hat(g) StdErrX1* 12.1779 2.0639X2* -0.4583 2.0549X3* 1.3113 2.3006X4* 4.3866 2.8275X5* 6.8020 1.7926X6* 9.1146 1.8993X7* 3.3197 2.4118X8* 1.8268 1.4407X9* 2.6829 1.9731

Original ModelPrincipal Components