regression

Regression

Georg Simon Ohm, 1789-1854

RIV

jV jIj

4133382261

4133382261

?R

4/133/82/631

R

jV jIj

4133382261

4133382261

RIV

R

432

1386

000

432

1386

R

0

RIV

000

4133826

RRR

222 )134()83()62( RRR

RIV

I

V

1 6 22 8 33 13 4

Obs. Expec.

jV jIj RI j 2)( RIV jj

R2R3R4

2)26( R2)38( R2)413( R

2)( EO

222 )134()83()62( RRR

CRR 13483622432 2222

222 4321348362

R

1 6 22 8 33 13 4

Gas- mile efficiency

jj GeD

?e

jGjDj

Y

X

Francis Galton (1822 – 1911)

http://en.wikipedia.org/wiki/Image:Francis_Galton.jpg

How can you get Jack’s beanstalk ?

jjj XY nj ,,2,1

Y

X

),0(~ 2 Njiid

simple (linear) regression

genetic character

acquired character

: slope of the line

acquired character (residual)

genetic character (model)

: independent variable

dependent variable

noise

jjj XY nj ,,2,1 ),0(~ 2 Njiid

nj ,,2,1 ),(~ 2 jj XNY tindependen

nj ,,2,1 jjjj XNY ,),(~ 2

tindependen

2jj

j

XY ?, 2

jjj XY nj ,,2,1 ),0(~ 2 Nj

iid

Question: What’s happen if all xj’s are 1?

2)( jjj

XY ?,, 2

jjj XY nj ,,2,1 ),0(~ 2 Nj

iid

noise, residual

(estimated) residual

Y

X

simple (linear) regression

Source SS df MS FRegression SSR 1 MSR = SSR/1 MSR/MSEError (residual)

SSE n-2 MSE = SSE/(n-2)

Total SST n-1

: coefficient of determination

ANOVA table

SST SSE SSR

setosa versicolor virginica

Fisher’s Iris data

S.Length S.Width P.Length P.Width Species

1 5.1 3.5 1.4 0.2 setosa

2 4.9 3.0 1.4 0.2 setosa

………………. 49 5.3 3.7 1.5

0.2 setosa 50 5.0 3.3 1.4

0.2 setosa 51 7.0 3.2 4.7

1.4 versicolor 52 6.4 3.2 4.5

1.5 versicolor …………………. 99 6.2 2.9 4.3

1.3 versicolor 100 5.7 2.8 4.1

1.3 versicolor101 6.3 3.3 6.0

2.5 virginica …………………150 5.9 3.0 5.1

1.8 virginica

Sepal.Length

2.0 3.0 4.0 0.5 1.5 2.5

4.55

.56.

57.5

2.0

3.0

4.0

Sepal.Width

Petal.Length

12

34

56

7

0.5

1.5

2.5

Petal.Width

4.55.56.57.5 1 2 3 4 5 6 7 1.0 2.0 3.0

1.0

2.0

3.0

Species

2.5 3.0 3.5 4.0

4.5

5.0

5.5

sepal.width

sepal.length

For setosa of Iris data

For setosa of Iris data> setosa <- iris[1:50,1:4] > head(setosa) Sepal.Length Sepal.Width Petal.Length Petal.Width1 5.1 3.5 1.4 0.2….6 5.4 3.9 1.7 0.4> names(setosa)<-c("sl","sw","pl","pw")> head(setosa) sl sw pl pw1 5.1 3.5 1.4 0.2….6 5.4 3.9 1.7 0.4> > plot(setosa$sw,setosa$sl,pch=16, xlab="sepal.width", ylab="sepal.length")> ( rout <- lm(sl~sw,data=setosa) )

Call: lm(formula = sl ~ sw, data = setosa)Coefficients:(Intercept) sw 2.6390 0.6905

> abline(rout,col="red")

For setosa of Iris data> summary(rout)

Call:lm(formula = sl ~ sw, data = setosa)

Residuals: Min 1Q Median 3Q Max -0.52476 -0.16286 0.02166 0.13833 0.44428

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.6390 0.3100 8.513 3.74e-11 ***sw 0.6905 0.0899 7.681 6.71e-10 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2385 on 48 degrees of freedomMultiple R-squared: 0.5514, Adjusted R-squared: 0.542 F-statistic: 58.99 on 1 and 48 DF, p-value: 6.71e-10

For setosa of Iris data

> anova(rout)Analysis of Variance Table

Response: sl Df Sum Sq Mean Sq F value Pr(>F) sw 1 3.3569 3.3569 58.994 6.71e-10 ***Residuals 48 2.7313 0.0569 ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 >

0 20 40 600.0

0.5

1.0

1.5

jjj XfY )(

jjj VY )( jj XfV

jjj XY ))exp(1(eg)

jjj XY )log(eg)

Non-linear Regression

Linear Regression

Linear after data transformation

J. Kepler, 1571-1630 Tycho Brache, 1546-1601

http://ko.wikipedia.org/wiki/%EA%B7%B8%EB%A6%BC:Tycho_Brahe.JPG

http://www.astrobio.nau.edu/~koerner/ast180/lectures/pic/cdrom/art_high-res/ch02/figure-02-07a.jpg

http://www.astrobio.nau.edu/~koerner/ast180/lectures/pic/cdrom/art_high-res/ch02/figure-02-08.jpg

http://www.astrobio.nau.edu/~koerner/ast180/lectures/pic/cdrom/art_high-res/ch02/table-02-03.jpg

32jj akP

jjj XY jj PY log jj aX log

2/3

IDBoiling Tem. (F)

Air Press.(inches Hg)

1 194.5 20.792 194.3 20.793 197.9 22.44 198.4 22.675 199.4 23.156 199.9 23.357 200.9 23.898 201.1 23.999 201.4 24.02

10 201.3 24.0111 203.6 25.1412 204.6 26.5713 209.5 28.4914 208.6 27.7615 210.7 29.0416 211.9 29.8817 212.2 30.06James, D. Forbes

(1875)

195 200 205 210

3.1

3.2

3.3

3.4

Boiling Temperature (BT)

log(

Air

Pres

sure

)

ε)BT(1)PressureAirlog(

0206.0ˆ0.9709, -ˆ

> install.packages("forward")> library(forward)> forbes> lm(log(pres)~bp,data=forbes)

Call:lm(formula = log(pres) ~ bp, data = forbes)

Coefficients:(Intercept) bp -0.97087 0.02062

> lm(y~1+x)> lm(y~x)

No intercept model example :

> lm(y~0+x)> lm(y~-1+x)

0 5 10

05

1015

2025

30

weight

depression

> install.packages("DAAG")> install.packages("randomForest")> library(DAAG)> roller> rout0<- lm(depression ~ -1+weight,data=roller)> rout1<- lm(depression ~ weight,data=roller)> with(roller, plot(weight,depression, pch=16,xlim=c(-1,14)))> abline(rout1)> abline(rout0,col="red")> points(0,0, pch=16, col="red")

Lawn roller data

Df of residuals is n-2.

> anova(rout0)Analysis of Variance TableResponse: depression Df Sum Sq Mean Sq F value Pr(>F) weight 1 2637.32 2637.3 63.862 2.233e-05 ***Residuals 9 371.68 41.3

> anova(rout1)Analysis of Variance TableResponse: depression Df Sum Sq Mean Sq F value Pr(>F) weight 1 657.97 657.97 14.503 0.005175 **Residuals 8 362.93 45.37

No-intercept model example :Lawn roller data

> roller weight depression1 1.9 22 3.1 13 3.3 54 4.8 55 5.3 206 6.1 207 6.4 238 7.6 109 9.8 3010 12.4 25

The SSE of no-intercept model is always greater than the SSE of intercept model.In many cases, intercept model has smaller MSE than no-intercept model, but in this lawn roller data MSE of no-intercept model is smaller than that of the intercept model. This allude to the suitabilityof the no-intercept model.

Df of residuals for no-intercept model is n-1. (n=10)

> summary(rout0)

Call: lm(formula = depression ~ -1 + weight, data = roller)

Coefficients: Estimate Std. Error t value Pr(>|t|) weight 2.3919 0.2993 7.991 2.23e-05 ***


> summary(rout1)

Call: lm(formula = depression ~ weight, data = roller)

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -2.0871 4.7543 -0.439 0.67227 weight 2.6667 0.7002 3.808 0.00518 **

Residual standard error: 6.735 on 8 degrees of freedomMultiple R-squared: 0.6445, Adjusted R-squared: 0.6001 F-statistic: 14.5 on 1 and 8 DF, p-value: 0.005175

No intercept model example :Lawn roller data

The p-value of no-intercept model is smaller than that of the intercept modelfor this lawn roller data.

Y

1X 2X

jjjj XXY 2211

nj ,,2,1

nnnn X

XX

X

XX

Y

YY

2

1

2

22

21

2

1

12

11

12

1

1

11

εXXY 2211 1

),0(~ 2 Nj

iid

multiple (linear) regression

For versicolor of Iris data

> vscolor<- iris[51:100,1:4]> names(vscolor)<-c("sl","sw","pl","pw")> rout<-lm(sl~ sw+pl,data=vscolor) > summary (rout)….

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.1164 0.4943 4.282 9.06e-05 ***sw 0.2476 0.1868 1.325 0.191 pl 0.7356 0.1248 5.896 3.87e-07 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


For versicolor of Iris data

> summary( lm(sl~ sw+pl+pw,data=vscolor) )

Call:lm(formula = sl ~ sw + pl + pw, data = vscolor)

Residuals: Min 1Q Median 3Q Max -0.7248 -0.2406 -0.0321 0.2958 0.5594

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.8955 0.5071 3.738 0.000511 ***sw 0.3869 0.2045 1.891 0.064890 . pl 0.9083 0.1654 5.491 1.67e-06 ***pw -0.6792 0.4354 -1.560 0.125599 ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


The same slopes, but different intercepts

> irs<-iris> names(irs)<-c("sl","sw","pl","pw","sp")> setosa<-irs[1:50,]; vscolor <-irs[51:100,]> vginica <-irs[101:150,]

> with(irs, plot(sw,sl,xlab="sepal.width",+ ylab="sepal.length"))> with(setosa, points(sw,sl,pch=16,col="red"))> with(vscolor, points(sw,sl,pch=16,col="blue"))> with(vginica, points(sw,sl,pch=16,col="sienna2"))

> rout1<- lm(sl ~sw , data=setosa)> rout2<- lm(sl ~sw , data=vscolor)> rout3<- lm(sl ~sw , data=vginica)

> cfx<-coef(lm(sl~sw+sp,data=irs))> cf1<-c( cfx[1] , cfx[2] )> cf2<-c( cfx[1]+cfx[3], cfx[2] )> cf3<-c( cfx[1]+cfx[4], cfx[2] )

> lines(c(2.5,4.3), c(sum(cf1*c(1,2.5)), sum(cf1*c(1,4.3)) ), lwd=4, col="pink")> lines(c(2.0,3.5), c(sum(cf2*c(1,2.0)), sum(cf2*c(1,3.5)) ), lwd=4, col="skyblue")> lines(c(2.2,3.9), c(sum(cf3*c(1,2.2)), sum(cf3*c(1,3.9)) ), lwd=4, col="tan1")

2.0 2.5 3.0 3.5 4.04.5

5.0

5.5

6.0

6.5

7.0

7.5

8.0

sepal.width

sepal.length

kjkjkkj XY 50,,2,1 j

n

n

n

n

n

n

n

n

n

X

XX

XX

XX

Y

YY

YY

YY

3

31

2

21

1

12

11

3

31

2

21

1

12

11

321

3

31

2

21

1

12

11

1

10

00

00

0

01

10

00

1

11

11

11

),0(~ 2 Nkjiid

3,2,1k)50( n)3( K

3

32

21

11

dummy variables

factor (categoriacal) variable

versicolor

verginica

setosa

jjj XY 1111 jjj XY 22212 jjj XY 33313

for setosa for versicolor for verginica

> is.factor(irs$sp)[1] TRUE> is.factor(irs$sw)[1] FALSE> rout<-lm(sl~sw+sp,data=irs)> coef(rout) (Intercept) sw spversicolor spvirginica 2.2513932 0.8035609 1.4587431 1.9468166 > summary(rout)Call: lm(formula = sl ~ sw + sp, data = irs)

Residuals: Min 1Q Median 3Q Max -1.30711 -0.25713 -0.05325 0.19542 1.41253

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.2514 0.3698 6.089 9.57e-09 ***sw 0.8036 0.1063 7.557 4.19e-12 ***spversicolor 1.4587 0.1121 13.012 < 2e-16 ***spvirginica 1.9468 0.1000 19.465 < 2e-16 ***

Residual standard error: 0.438 on 146 degrees of freedomMultiple R-squared: 0.7259, Adjusted R-squared: 0.7203 F-statistic: 128.9 on 3 and 146 DF, p-value: < 2.2e-16

> anova(rout)Analysis of Variance Table

Response: sl Df Sum Sq Mean Sq F value Pr(>F) sw 1 1.412 1.412 7.3628 0.00746 ** sp 2 72.752 36.376 189.6512 < 2e-16 ***Residuals 146 28.004 0.192

This means the number of groups is 3(3 species, K=3). df = K-1 = 2sp is a factor variable for 3 groups.

> coef( lm(sl~sp*sw,data=irs) ) (Intercept) spversicolor spvirginica sw 2.6390012 0.9007335 1.2678352 0.6904897 spversicolor:sw spvirginica:sw 0.1745880 0.2110448

> ( csetosa<- coef( lm(sl~sw, data=setosa) ) )(Intercept) sw 2.6390012 0.6904897

> coef(lm(sl~sw, data=versicolor) )-csetosa(Intercept) sw 0.9007335 0.1745880

> coef(lm(sl~sw, data=vginica) )-csetosa(Intercept) sw 1.2678352 0.2110448

Regression for many datasets at one command, by using factor variable

Thank you !!

cbxxa 2

0ˆ2 bxa

abx

2ˆ

2)( jjj

XY

j

jjj

jj YXYXn 22 )()(2

j

jj

jjj

j YYXX 222 )()(2

0)ˆ(ˆ j

jj YXn

0)ˆ(ˆ 2 j

jjj

j YXX

0)ˆ(ˆ j

jj YXn

0)ˆ(ˆ 2 j

jjj

j YXX

j

jj

j YXn ˆˆ

j

jjj

jj

j YXXX ˆˆ 2

1

11

1

nX

XX

X

2

1

n

2

1

nY

YY

Y2

1

jjj XY nj ,,2,1

nnn X

XX

Y

YY

2

1

2

1

2

1

1

11

εXY 1

jjj XY nj ,,2,1

Y 1 X 150 ㆍ 1 + 0.1X

100 ㆍ 1 + 0.5X

160 1 176 167.6 188173 1 168 166.8 184185 1 174 167.4 187

… … … …189 1 182 168.2 191

Y

Y

2ˆ j

j YY

nY

YY

2

1

Y

1

11

1

nX

XX

2

1

X

XˆˆY 1

jjj XY nj ,,2,1

),0(~ 2 Njiid

2)( jjj

XY

?,

jjj XY nj ,,2,1

?2

)2(~)ˆˆ(SSE 222 nXY jj

j

)2/(SSE MSE n

jjY nj ,,2,1

nnY

YY

2

1

2

1

1

11εY 1

nj ,,2,1 ),(~ 2NY jiid

2 jj

Y 2YY jj

na

aa

2

1

a

nav

avav

v

2

1

a

nb

bb

2

1

b

nn ba

baba

22

11

ba

n

iii

t ba1

ba

),,( 21 nt aaa a

ba 0bat

nY

YY

2

1

Y

1

11

1

1Y

jjY nj ,,2,1 ),0(~ 2 Njiid

)ˆ(ˆ YYY

0)ˆ(ˆ YYY t

0)()( 11 Yt

0)( 11 Yt

)( 111 tt Y

YYt

t

11

1 )1/(ˆ 22 nYY jj

nY

YY

2

1

Y

1

11

1

nX

XX

1

12

11

1X

2211 XˆXˆˆY 1

jjjj XXY 2211 nj ,,2,1

nX

XX

2

22

21

2X

regression

Documents

data transformationj

sq f value prf sw

length sepal

sl df sum sq

sq f value prf weight

depression df sum sq

length petal

intercept bp