multiple and complex regression

Multiple and complex regression

Extensions of simple linear regression

• Multiple regression models: predictor variables are continuous

• Analysis of variance: predictor variables are categorical (grouping variables),

• But… general linear models can include both continuous and categorical predictors

Relative abundance of C3 and C4 plants • Paruelo & Lauenroth (1996)

• Geographic distribution and the effects of climate variables on the relative abundance of a number of plant functional types (PFTs): shrubs, forbs, succulents, C3 grasses and C4 grasses.

data

• Relative abundance of PTFs (based on cover, biomass, and primary production) for each site

• Longitude• Latitude• Mean annual temperature• Mean annual precipitation• Winter (%) precipitation• Summer (%) precipitation• Biomes (grassland , shrubland)

73 sites across temperate central North America

Response variable Predictor variables

Relative abundance transformed ln(dat+1) because positively skewed

Histogram of C3

C3

Fre

quen

cy

0.0 0.2 0.4 0.6 0.8

05

1015

2025

30

Histogram of log_10_C3

log_10_C3

Fre

quen

cy

-2.0 -1.5 -1.0 -0.5 0.0

02

46

810

12

Histogram of log_C3

log_C3

Fre

quen

cy

-5 -4 -3 -2 -1 0

02

46

810

12

Histogram of SQRT_C3

SQRT_C3

Fre

quen

cy

0.0 0.2 0.4 0.6 0.8 1.0

02

46

810

12

Collinearity

• Causes computational problems because it makes the determinant of the matrix of X-variables close to zero and matrix inversion basically involves dividing by the determinant (very sensitive to small differences in the numbers)

• Standard errors of the estimated regression slopes are inflated

Detecting collinearlity

• Check tolerance values

• Plot the variables

• Examine a matrix of correlation coefficients between predictor variables

Dealing with collinearity

• Omit predictor variables if they are highly correlated with other predictor variables that remain in the model

Correlations

LAT

95 105 115 5 10 20 0.1 0.3 0.5

3040

50

9510

511

5LONG

MAP

200

600

1000

510

20

MAT

JJAMAP

0.1

0.3

0.5

30 40 50

0.1

0.3

0.5

200 600 1000 0.1 0.3 0.5

DJFMAP

Correlations

1 .097 -.247* -.839** .074 -.065

. .416 .036 .000 .533 .584

73 73 73 73 73 73

.097 1 -.734** -.213 -.492** .771**

.416 . .000 .070 .000 .000

73 73 73 73 73 73

-.247* -.734** 1 .355** .112 -.405**

.036 .000 . .002 .344 .000

73 73 73 73 73 73

-.839** -.213 .355** 1 -.081 .001

.000 .070 .002 . .497 .990

73 73 73 73 73 73

.074 -.492** .112 -.081 1 -.792**

.533 .000 .344 .497 . .000

73 73 73 73 73 73

-.065 .771** -.405** .001 -.792** 1

.584 .000 .000 .990 .000 .

73 73 73 73 73 73

Pearson Correlation

Sig. (2-tailed)

N

Pearson Correlation

Sig. (2-tailed)

N

Pearson Correlation

Sig. (2-tailed)

N

Pearson Correlation

Sig. (2-tailed)

N

Pearson Correlation

Sig. (2-tailed)

N

Pearson Correlation

Sig. (2-tailed)

N

LAT

LONG

MAP

MAT

JJAMAP

DJFMAP

LAT LONG MAP MAT JJAMAP DJFMAP

Correlation is significant at the 0.05 level (2-tailed).*.

Correlation is significant at the 0.01 level (2-tailed).**.

Coefficientsa

7.391 3.625 2.039 .045

-.191 .091 -3.095 -2.101 .039 .003 307.745

-.093 .035 -1.824 -2.659 .010 .015 66.784

.002 .001 4.323 2.572 .012 .002 400.939

(Constant)

LAT

LONG

LOXLA

Model1

B Std. Error

UnstandardizedCoefficients

Beta

StandardizedCoefficients

t Sig. Tolerance VIF

Collinearity Statistics

Dependent Variable: LC3a.

Coefficientsa

-.553 .027 -20.131 .000

-.003 .004 -.051 -.597 .552 .980 1.020

.048 .006 .783 8.484 .000 .827 1.209

.002 .001 .238 2.572 .012 .820 1.220

(Constant)

LONRE

LATRE

RELALO

Model1

B Std. Error

UnstandardizedCoefficients

Beta

StandardizedCoefficients

t Sig. Tolerance VIF

Collinearity Statistics

Dependent Variable: LC3a.

(lnC3)= βo+ β1(lat)+ β2(long)+ β3(latxlong)

After centering both lat and long

Analysis of variance

Source of variation

SS df MS

Regression Σ(yhat-Y)2 p Σ(yhat-Y)2

p

Residual Σ(yobs-yhat)2 n-p-1 Σ(yobs-yhat)2

n-p-1

Total Σ(yobs-Y)2 n-1

Matrix algebra approach to OLS estimation of multiple regression models

• Y=βX+ε

• X’Xb=XY

• b=(X’X) -1 (XY)

Criteria for “best” fitting in multiple regression with p predictors.

Criterion Formula

r2

Adjusted r2

Akaike Information Criteria AIC

Akaike Information Criteria AIC

total

sidual

total

gression

SS

SS

SS

SSr ReRe2 1

)1()

11 2r

pn

n

1

2)]/[ln( Re pn

pnnSSn sidual

121))/(2ln(

22 )Re pn

pnnSS

nsidual

Hierarchical partitioning and model selection

No pred

Model r2 Adjr2 P AIC (R)

1 Lon 0.0006 -0.013 0.84 30.15

1 Lat 0.47 0.46 >0.001 -16.16

2 Lon + Lat 0.48 0.46 >0.001 -15.25

3 Long +Lat +

Lon x Lat0.54 0.52 >0.001 -22.55

R2=0.48

Longitude Latitude

C3

Model Lat + Long

-15 -10 -5 0 5 10 150.0

0.2

0.4

0.6

0.8

1.0

-15-10

-5 0

5 10

15

cLONG

cLA

T

Y_h

ats.

long

lat

-15 -10 -5 0 5 10 15-0.2

0.0

0.2

0.4

0.6

0.8

1.0

-15-10

-5 0

5 10

15

cLONG

cLA

T

Y_h

ats.

long

xlat

-15 -10 -5 0 5 10 150.0

0.2

0.4

0.6

0.8

1.0

-15-10

-5 0

5 10

15

cLAT

cLO

NG

Y_h

ats.

long

lat

-15 -10 -5 0 5 10 15-0.2

0.0

0.2

0.4

0.6

0.8

1.0

-15-10

-5 0

5 10

15

cLAT

cLO

NG

Y_h

ats.

long

xlat

95 100 105 110 115 120

0.0

0.2

0.4

0.6

0.8

1.0

C3 grasses in North America

Longitude

rela

tive

abun

danc

e

35 Lat

45 Lat

Model Lat * Long

The final forward model selection is:

Step: AIC=-228.67SQRT_C3 ~ LAT + MAP + JJAMAP + DJFMAP

Df Sum of Sq RSS AIC<none> 2.7759 -228.67+ LONG 1 0.0209705 2.7549 -227.23+ MAT 1 0.0001829 2.7757 -226.68

Call:lm(formula = SQRT_C3 ~ LAT + MAP + JJAMAP + DJFMAP)

Coefficients:(Intercept) LAT MAP JJAMAP DJFMAP -0.7892663 0.0391180 0.0001538 -0.8573419 -0.7503936

The final backward selection model is

Step: AIC=-229.32SQRT_C3 ~ LAT + JJAMAP + DJFMAP

Df Sum of Sq RSS AIC<none> 2.8279 -229.32- DJFMAP 1 0.26190 3.0898 -224.85- JJAMAP 1 0.31489 3.1428 -223.61- LAT 1 2.82772 5.6556 -180.72

Call:lm(formula = SQRT_C3 ~ LAT + JJAMAP + DJFMAP)

Coefficients:(Intercept) LAT JJAMAP DJFMAP -0.53148 0.03748 -1.02823 -1.05164

multiple and complex regression

Documents

intercept lat jjamap

long lat lon x lat0

c3 grasses

categorical grouping

effects of climate variables

matrix of x

p predictors

matrix inversion