regression

53
Regression

Upload: clark

Post on 03-Feb-2016

28 views

Category:

Documents


0 download

DESCRIPTION

Regression. Georg Simon Ohm, 1789-1854. 1. 1. 6. 6. 2. 2. 2. 2. 8. 8. 3. 3. 3. 3. 13. 13. 4. 4. 1. 1. 6. 6. 2. 2. 2. 2. 8. 8. 3. 3. 3. 3. 13. 13. 4. 4. 0. Obs. Expec. Gas- mile efficiency. Francis Galton (1822 – 1911). How can you get Jack’s beanstalk ?. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Regression

Regression

Page 2: Regression

Georg Simon Ohm, 1789-1854

Page 3: Regression

RIV

jV jIj

4133382261

4133382261

?R

4/133/82/631

R

Page 4: Regression

jV jIj

4133382261

4133382261

RIV

R

432

1386

000

432

1386

R

Page 5: Regression

0

RIV

000

4133826

RRR

222 )134()83()62( RRR

Page 6: Regression

RIV

I

V

Page 7: Regression

1 6 22 8 33 13 4

Obs. Expec.

jV jIj RI j 2)( RIV jj

R2R3R4

2)26( R2)38( R2)413( R

2)( EO

Page 8: Regression

222 )134()83()62( RRR

CRR 13483622432 2222

222 4321348362

R

Page 9: Regression

1 6 22 8 33 13 4

Gas- mile efficiency

jj GeD

?e

jGjDj

Page 10: Regression

Y

X

Francis Galton (1822 – 1911)

Page 11: Regression

How can you get Jack’s beanstalk ?

Page 12: Regression

jjj XY nj ,,2,1

Y

X

),0(~ 2 Njiid

simple (linear) regression

genetic character

acquired character

: slope of the line

acquired character (residual)

genetic character (model)

: independent variable

dependent variable

noise

Page 13: Regression

jjj XY nj ,,2,1 ),0(~ 2 Njiid

nj ,,2,1 ),(~ 2 jj XNY tindependen

nj ,,2,1 jjjj XNY ,),(~ 2

tindependen

Page 14: Regression

2jj

j

XY ?, 2

jjj XY nj ,,2,1 ),0(~ 2 Nj

iid

Question: What’s happen if all xj’s are 1?

Page 15: Regression

2)( jjj

XY ?,, 2

jjj XY nj ,,2,1 ),0(~ 2 Nj

iid

noise, residual

(estimated) residual

Page 16: Regression

Y

X

simple (linear) regression

Page 17: Regression

Y

X

Page 18: Regression

Source SS df MS FRegression SSR 1 MSR = SSR/1 MSR/MSEError (residual)

SSE n-2 MSE = SSE/(n-2)

Total SST n-1

: coefficient of determination

ANOVA table

SST SSE SSR

Page 19: Regression

setosa versicolor virginica

Fisher’s Iris data

S.Length S.Width P.Length P.Width Species

1 5.1 3.5 1.4 0.2 setosa

2 4.9 3.0 1.4 0.2 setosa

………………. 49 5.3 3.7 1.5

0.2 setosa 50 5.0 3.3 1.4

0.2 setosa 51 7.0 3.2 4.7

1.4 versicolor 52 6.4 3.2 4.5

1.5 versicolor …………………. 99 6.2 2.9 4.3

1.3 versicolor 100 5.7 2.8 4.1

1.3 versicolor101 6.3 3.3 6.0

2.5 virginica …………………150 5.9 3.0 5.1

1.8 virginica

Sepal.Length

2.0 3.0 4.0 0.5 1.5 2.5

4.55

.56.

57.5

2.0

3.0

4.0

Sepal.Width

Petal.Length

12

34

56

7

0.5

1.5

2.5

Petal.Width

4.55.56.57.5 1 2 3 4 5 6 7 1.0 2.0 3.0

1.0

2.0

3.0

Species

Page 20: Regression

2.5 3.0 3.5 4.0

4.5

5.0

5.5

sepal.width

sepal.length

For setosa of Iris data

Page 21: Regression

For setosa of Iris data> setosa <- iris[1:50,1:4] > head(setosa) Sepal.Length Sepal.Width Petal.Length Petal.Width1 5.1 3.5 1.4 0.2….6 5.4 3.9 1.7 0.4> names(setosa)<-c("sl","sw","pl","pw")> head(setosa) sl sw pl pw1 5.1 3.5 1.4 0.2….6 5.4 3.9 1.7 0.4> > plot(setosa$sw,setosa$sl,pch=16, xlab="sepal.width", ylab="sepal.length")> ( rout <- lm(sl~sw,data=setosa) )

Call: lm(formula = sl ~ sw, data = setosa)Coefficients:(Intercept) sw 2.6390 0.6905

> abline(rout,col="red")

Page 22: Regression

For setosa of Iris data> summary(rout)

Call:lm(formula = sl ~ sw, data = setosa)

Residuals: Min 1Q Median 3Q Max -0.52476 -0.16286 0.02166 0.13833 0.44428

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.6390 0.3100 8.513 3.74e-11 ***sw 0.6905 0.0899 7.681 6.71e-10 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2385 on 48 degrees of freedomMultiple R-squared: 0.5514, Adjusted R-squared: 0.542 F-statistic: 58.99 on 1 and 48 DF, p-value: 6.71e-10

Page 23: Regression

For setosa of Iris data

> anova(rout)Analysis of Variance Table

Response: sl Df Sum Sq Mean Sq F value Pr(>F) sw 1 3.3569 3.3569 58.994 6.71e-10 ***Residuals 48 2.7313 0.0569 ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 >

0 20 40 600.0

0.5

1.0

1.5

Page 24: Regression

jjj XfY )(

jjj VY )( jj XfV

jjj XY ))exp(1(eg)

jjj XY )log(eg)

Non-linear Regression

Linear Regression

Linear after data transformation

Page 25: Regression

J. Kepler, 1571-1630 Tycho Brache, 1546-1601

Page 27: Regression

32jj akP

jjj XY jj PY log jj aX log

2/3

Page 28: Regression

        IDBoiling Tem.  (F)

 Air Press.(inches Hg)

1 194.5 20.792 194.3 20.793 197.9 22.44 198.4 22.675 199.4 23.156 199.9 23.357 200.9 23.898 201.1 23.999 201.4 24.02

10 201.3 24.0111 203.6 25.1412 204.6 26.5713 209.5 28.4914 208.6 27.7615 210.7 29.0416 211.9 29.8817 212.2 30.06James, D. Forbes

(1875)

Page 29: Regression

      

195 200 205 210

3.1

3.2

3.3

3.4

Boiling Temperature (BT)

log(

Air

Pres

sure

)

ε)BT(1)PressureAirlog(

0206.0ˆ0.9709, -ˆ

> install.packages("forward")> library(forward)> forbes> lm(log(pres)~bp,data=forbes)

Call:lm(formula = log(pres) ~ bp, data = forbes)

Coefficients:(Intercept) bp -0.97087 0.02062

Page 30: Regression

      

> lm(y~1+x)> lm(y~x)

No intercept model example :

> lm(y~0+x)> lm(y~-1+x)

0 5 10

05

1015

2025

30

weight

depression

> install.packages("DAAG")> install.packages("randomForest")> library(DAAG)> roller> rout0<- lm(depression ~ -1+weight,data=roller)> rout1<- lm(depression ~ weight,data=roller)> with(roller, plot(weight,depression, pch=16,xlim=c(-1,14)))> abline(rout1)> abline(rout0,col="red")> points(0,0, pch=16, col="red")

Lawn roller data

Page 31: Regression

Df of residuals is n-2.

> anova(rout0)Analysis of Variance TableResponse: depression Df Sum Sq Mean Sq F value Pr(>F) weight 1 2637.32 2637.3 63.862 2.233e-05 ***Residuals 9 371.68 41.3

> anova(rout1)Analysis of Variance TableResponse: depression Df Sum Sq Mean Sq F value Pr(>F) weight 1 657.97 657.97 14.503 0.005175 **Residuals 8 362.93 45.37

No-intercept model example :Lawn roller data

> roller weight depression1 1.9 22 3.1 13 3.3 54 4.8 55 5.3 206 6.1 207 6.4 238 7.6 109 9.8 3010 12.4 25

The SSE of no-intercept model is always greater than the SSE of intercept model.In many cases, intercept model has smaller MSE than no-intercept model, but in this lawn roller data MSE of no-intercept model is smaller than that of the intercept model. This allude to the suitabilityof the no-intercept model.

Df of residuals for no-intercept model is n-1. (n=10)

Page 32: Regression

> summary(rout0)

Call: lm(formula = depression ~ -1 + weight, data = roller)

Coefficients: Estimate Std. Error t value Pr(>|t|) weight 2.3919 0.2993 7.991 2.23e-05 ***

Residual standard error: 6.426 on 9 degrees of freedomMultiple R-squared: 0.8765, Adjusted R-squared: 0.8628 F-statistic: 63.86 on 1 and 9 DF, p-value: 2.233e-05

> summary(rout1)

Call: lm(formula = depression ~ weight, data = roller)

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -2.0871 4.7543 -0.439 0.67227 weight 2.6667 0.7002 3.808 0.00518 **

Residual standard error: 6.735 on 8 degrees of freedomMultiple R-squared: 0.6445, Adjusted R-squared: 0.6001 F-statistic: 14.5 on 1 and 8 DF, p-value: 0.005175

No intercept model example :Lawn roller data

The p-value of no-intercept model is smaller than that of the intercept modelfor this lawn roller data.

Page 33: Regression

Y

1X 2X

Page 34: Regression

jjjj XXY 2211

nj ,,2,1

nnnn X

XX

X

XX

Y

YY

2

1

2

22

21

2

1

12

11

12

1

1

11

εXXY 2211 1

),0(~ 2 Nj

iid

multiple (linear) regression

Page 35: Regression

For versicolor of Iris data

> vscolor<- iris[51:100,1:4]> names(vscolor)<-c("sl","sw","pl","pw")> rout<-lm(sl~ sw+pl,data=vscolor) > summary (rout)….

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.1164 0.4943 4.282 9.06e-05 ***sw 0.2476 0.1868 1.325 0.191 pl 0.7356 0.1248 5.896 3.87e-07 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.3399 on 47 degrees of freedomMultiple R-squared: 0.5841, Adjusted R-squared: 0.5664 F-statistic: 33.01 on 2 and 47 DF, p-value: 1.11e-09

Page 36: Regression

For versicolor of Iris data

> summary( lm(sl~ sw+pl+pw,data=vscolor) )

Call:lm(formula = sl ~ sw + pl + pw, data = vscolor)

Residuals: Min 1Q Median 3Q Max -0.7248 -0.2406 -0.0321 0.2958 0.5594

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.8955 0.5071 3.738 0.000511 ***sw 0.3869 0.2045 1.891 0.064890 . pl 0.9083 0.1654 5.491 1.67e-06 ***pw -0.6792 0.4354 -1.560 0.125599 ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.3348 on 46 degrees of freedomMultiple R-squared: 0.605, Adjusted R-squared: 0.5793 F-statistic: 23.49 on 3 and 46 DF, p-value: 2.28e-09

Page 37: Regression

The same slopes, but different intercepts

> irs<-iris> names(irs)<-c("sl","sw","pl","pw","sp")> setosa<-irs[1:50,]; vscolor <-irs[51:100,]> vginica <-irs[101:150,]

> with(irs, plot(sw,sl,xlab="sepal.width",+ ylab="sepal.length"))> with(setosa, points(sw,sl,pch=16,col="red"))> with(vscolor, points(sw,sl,pch=16,col="blue"))> with(vginica, points(sw,sl,pch=16,col="sienna2"))

> rout1<- lm(sl ~sw , data=setosa)> rout2<- lm(sl ~sw , data=vscolor)> rout3<- lm(sl ~sw , data=vginica)

> cfx<-coef(lm(sl~sw+sp,data=irs))> cf1<-c( cfx[1] , cfx[2] )> cf2<-c( cfx[1]+cfx[3], cfx[2] )> cf3<-c( cfx[1]+cfx[4], cfx[2] )

> lines(c(2.5,4.3), c(sum(cf1*c(1,2.5)), sum(cf1*c(1,4.3)) ), lwd=4, col="pink")> lines(c(2.0,3.5), c(sum(cf2*c(1,2.0)), sum(cf2*c(1,3.5)) ), lwd=4, col="skyblue")> lines(c(2.2,3.9), c(sum(cf3*c(1,2.2)), sum(cf3*c(1,3.9)) ), lwd=4, col="tan1")

2.0 2.5 3.0 3.5 4.04.5

5.0

5.5

6.0

6.5

7.0

7.5

8.0

sepal.width

sepal.length

Page 38: Regression

kjkjkkj XY 50,,2,1 j

n

n

n

n

n

n

n

n

n

X

XX

XX

XX

Y

YY

YY

YY

3

31

2

21

1

12

11

3

31

2

21

1

12

11

321

3

31

2

21

1

12

11

1

10

00

00

0

01

10

00

1

11

11

11

),0(~ 2 Nkjiid

3,2,1k)50( n)3( K

3

32

21

11

dummy variables

factor (categoriacal) variable

versicolor

verginica

setosa

jjj XY 1111 jjj XY 22212 jjj XY 33313

for setosa for versicolor for verginica

Page 39: Regression

> is.factor(irs$sp)[1] TRUE> is.factor(irs$sw)[1] FALSE> rout<-lm(sl~sw+sp,data=irs)> coef(rout) (Intercept) sw spversicolor spvirginica 2.2513932 0.8035609 1.4587431 1.9468166 > summary(rout)Call: lm(formula = sl ~ sw + sp, data = irs)

Residuals: Min 1Q Median 3Q Max -1.30711 -0.25713 -0.05325 0.19542 1.41253

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.2514 0.3698 6.089 9.57e-09 ***sw 0.8036 0.1063 7.557 4.19e-12 ***spversicolor 1.4587 0.1121 13.012 < 2e-16 ***spvirginica 1.9468 0.1000 19.465 < 2e-16 ***

Residual standard error: 0.438 on 146 degrees of freedomMultiple R-squared: 0.7259, Adjusted R-squared: 0.7203 F-statistic: 128.9 on 3 and 146 DF, p-value: < 2.2e-16

> anova(rout)Analysis of Variance Table

Response: sl Df Sum Sq Mean Sq F value Pr(>F) sw 1 1.412 1.412 7.3628 0.00746 ** sp 2 72.752 36.376 189.6512 < 2e-16 ***Residuals 146 28.004 0.192

This means the number of groups is 3(3 species, K=3). df = K-1 = 2sp is a factor variable for 3 groups.

Page 40: Regression

> coef( lm(sl~sp*sw,data=irs) ) (Intercept) spversicolor spvirginica sw 2.6390012 0.9007335 1.2678352 0.6904897 spversicolor:sw spvirginica:sw 0.1745880 0.2110448

> ( csetosa<- coef( lm(sl~sw, data=setosa) ) )(Intercept) sw 2.6390012 0.6904897

> coef(lm(sl~sw, data=versicolor) )-csetosa(Intercept) sw 0.9007335 0.1745880

> coef(lm(sl~sw, data=vginica) )-csetosa(Intercept) sw 1.2678352 0.2110448

Regression for many datasets at one command, by using factor variable

Page 41: Regression

Thank you !!

Page 42: Regression

cbxxa 2

0ˆ2 bxa

abx

Page 43: Regression

2)( jjj

XY

j

jjj

jj YXYXn 22 )()(2

j

jj

jjj

j YYXX 222 )()(2

0)ˆ(ˆ j

jj YXn

0)ˆ(ˆ 2 j

jjj

j YXX

Page 44: Regression

0)ˆ(ˆ j

jj YXn

0)ˆ(ˆ 2 j

jjj

j YXX

j

jj

j YXn ˆˆ

j

jjj

jj

j YXXX ˆˆ 2

Page 45: Regression

1

11

1

nX

XX

X

2

1

n

2

1

nY

YY

Y2

1

Page 46: Regression

jjj XY nj ,,2,1

nnn X

XX

Y

YY

2

1

2

1

2

1

1

11

εXY 1

Page 47: Regression

jjj XY nj ,,2,1

Y 1 X 150 ㆍ 1 + 0.1X

100 ㆍ 1 + 0.5X

160 1 176 167.6 188173 1 168 166.8 184185 1 174 167.4 187

… … … …189 1 182 168.2 191

Y

Y

2ˆ j

j YY

Page 48: Regression

nY

YY

2

1

Y

1

11

1

nX

XX

2

1

X

XˆˆY 1

jjj XY nj ,,2,1

),0(~ 2 Njiid

Page 49: Regression

2)( jjj

XY

?,

jjj XY nj ,,2,1

?2

)2(~)ˆˆ(SSE 222 nXY jj

j

)2/(SSE MSE n

Page 50: Regression

jjY nj ,,2,1

nnY

YY

2

1

2

1

1

11εY 1

nj ,,2,1 ),(~ 2NY jiid

2 jj

Y 2YY jj

Page 51: Regression

na

aa

2

1

a

nav

avav

v

2

1

a

nb

bb

2

1

b

nn ba

baba

22

11

ba

n

iii

t ba1

ba

),,( 21 nt aaa a

ba 0bat

Page 52: Regression

nY

YY

2

1

Y

1

11

1

1Y

jjY nj ,,2,1 ),0(~ 2 Njiid

)ˆ(ˆ YYY

0)ˆ(ˆ YYY t

0)()( 11 Yt

0)( 11 Yt

)( 111 tt Y

YYt

t

11

1 )1/(ˆ 22 nYY jj

Page 53: Regression

nY

YY

2

1

Y

1

11

1

nX

XX

1

12

11

1X

2211 XˆXˆˆY 1

jjjj XXY 2211 nj ,,2,1

nX

XX

2

22

21

2X