regression
DESCRIPTION
Regression. Georg Simon Ohm, 1789-1854. 1. 1. 6. 6. 2. 2. 2. 2. 8. 8. 3. 3. 3. 3. 13. 13. 4. 4. 1. 1. 6. 6. 2. 2. 2. 2. 8. 8. 3. 3. 3. 3. 13. 13. 4. 4. 0. Obs. Expec. Gas- mile efficiency. Francis Galton (1822 – 1911). How can you get Jack’s beanstalk ?. - PowerPoint PPT PresentationTRANSCRIPT
Regression
Georg Simon Ohm, 1789-1854
RIV
jV jIj
4133382261
4133382261
?R
4/133/82/631
R
jV jIj
4133382261
4133382261
RIV
R
432
1386
000
432
1386
R
0
RIV
000
4133826
RRR
222 )134()83()62( RRR
RIV
I
V
1 6 22 8 33 13 4
Obs. Expec.
jV jIj RI j 2)( RIV jj
R2R3R4
2)26( R2)38( R2)413( R
2)( EO
222 )134()83()62( RRR
CRR 13483622432 2222
222 4321348362
R
1 6 22 8 33 13 4
Gas- mile efficiency
jj GeD
?e
jGjDj
How can you get Jack’s beanstalk ?
jjj XY nj ,,2,1
Y
X
),0(~ 2 Njiid
simple (linear) regression
genetic character
acquired character
: slope of the line
acquired character (residual)
genetic character (model)
: independent variable
dependent variable
noise
jjj XY nj ,,2,1 ),0(~ 2 Njiid
nj ,,2,1 ),(~ 2 jj XNY tindependen
nj ,,2,1 jjjj XNY ,),(~ 2
tindependen
2jj
j
XY ?, 2
jjj XY nj ,,2,1 ),0(~ 2 Nj
iid
Question: What’s happen if all xj’s are 1?
2)( jjj
XY ?,, 2
jjj XY nj ,,2,1 ),0(~ 2 Nj
iid
noise, residual
(estimated) residual
Y
X
simple (linear) regression
Y
X
Source SS df MS FRegression SSR 1 MSR = SSR/1 MSR/MSEError (residual)
SSE n-2 MSE = SSE/(n-2)
Total SST n-1
: coefficient of determination
ANOVA table
SST SSE SSR
setosa versicolor virginica
Fisher’s Iris data
S.Length S.Width P.Length P.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
………………. 49 5.3 3.7 1.5
0.2 setosa 50 5.0 3.3 1.4
0.2 setosa 51 7.0 3.2 4.7
1.4 versicolor 52 6.4 3.2 4.5
1.5 versicolor …………………. 99 6.2 2.9 4.3
1.3 versicolor 100 5.7 2.8 4.1
1.3 versicolor101 6.3 3.3 6.0
2.5 virginica …………………150 5.9 3.0 5.1
1.8 virginica
Sepal.Length
2.0 3.0 4.0 0.5 1.5 2.5
4.55
.56.
57.5
2.0
3.0
4.0
Sepal.Width
Petal.Length
12
34
56
7
0.5
1.5
2.5
Petal.Width
4.55.56.57.5 1 2 3 4 5 6 7 1.0 2.0 3.0
1.0
2.0
3.0
Species
2.5 3.0 3.5 4.0
4.5
5.0
5.5
sepal.width
sepal.length
For setosa of Iris data
For setosa of Iris data> setosa <- iris[1:50,1:4] > head(setosa) Sepal.Length Sepal.Width Petal.Length Petal.Width1 5.1 3.5 1.4 0.2….6 5.4 3.9 1.7 0.4> names(setosa)<-c("sl","sw","pl","pw")> head(setosa) sl sw pl pw1 5.1 3.5 1.4 0.2….6 5.4 3.9 1.7 0.4> > plot(setosa$sw,setosa$sl,pch=16, xlab="sepal.width", ylab="sepal.length")> ( rout <- lm(sl~sw,data=setosa) )
Call: lm(formula = sl ~ sw, data = setosa)Coefficients:(Intercept) sw 2.6390 0.6905
> abline(rout,col="red")
For setosa of Iris data> summary(rout)
Call:lm(formula = sl ~ sw, data = setosa)
Residuals: Min 1Q Median 3Q Max -0.52476 -0.16286 0.02166 0.13833 0.44428
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.6390 0.3100 8.513 3.74e-11 ***sw 0.6905 0.0899 7.681 6.71e-10 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2385 on 48 degrees of freedomMultiple R-squared: 0.5514, Adjusted R-squared: 0.542 F-statistic: 58.99 on 1 and 48 DF, p-value: 6.71e-10
For setosa of Iris data
> anova(rout)Analysis of Variance Table
Response: sl Df Sum Sq Mean Sq F value Pr(>F) sw 1 3.3569 3.3569 58.994 6.71e-10 ***Residuals 48 2.7313 0.0569 ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 >
0 20 40 600.0
0.5
1.0
1.5
jjj XfY )(
jjj VY )( jj XfV
jjj XY ))exp(1(eg)
jjj XY )log(eg)
Non-linear Regression
Linear Regression
Linear after data transformation
J. Kepler, 1571-1630 Tycho Brache, 1546-1601
32jj akP
jjj XY jj PY log jj aX log
2/3
IDBoiling Tem. (F)
Air Press.(inches Hg)
1 194.5 20.792 194.3 20.793 197.9 22.44 198.4 22.675 199.4 23.156 199.9 23.357 200.9 23.898 201.1 23.999 201.4 24.02
10 201.3 24.0111 203.6 25.1412 204.6 26.5713 209.5 28.4914 208.6 27.7615 210.7 29.0416 211.9 29.8817 212.2 30.06James, D. Forbes
(1875)
195 200 205 210
3.1
3.2
3.3
3.4
Boiling Temperature (BT)
log(
Air
Pres
sure
)
ε)BT(1)PressureAirlog(
0206.0ˆ0.9709, -ˆ
> install.packages("forward")> library(forward)> forbes> lm(log(pres)~bp,data=forbes)
Call:lm(formula = log(pres) ~ bp, data = forbes)
Coefficients:(Intercept) bp -0.97087 0.02062
> lm(y~1+x)> lm(y~x)
No intercept model example :
> lm(y~0+x)> lm(y~-1+x)
0 5 10
05
1015
2025
30
weight
depression
> install.packages("DAAG")> install.packages("randomForest")> library(DAAG)> roller> rout0<- lm(depression ~ -1+weight,data=roller)> rout1<- lm(depression ~ weight,data=roller)> with(roller, plot(weight,depression, pch=16,xlim=c(-1,14)))> abline(rout1)> abline(rout0,col="red")> points(0,0, pch=16, col="red")
Lawn roller data
Df of residuals is n-2.
> anova(rout0)Analysis of Variance TableResponse: depression Df Sum Sq Mean Sq F value Pr(>F) weight 1 2637.32 2637.3 63.862 2.233e-05 ***Residuals 9 371.68 41.3
> anova(rout1)Analysis of Variance TableResponse: depression Df Sum Sq Mean Sq F value Pr(>F) weight 1 657.97 657.97 14.503 0.005175 **Residuals 8 362.93 45.37
No-intercept model example :Lawn roller data
> roller weight depression1 1.9 22 3.1 13 3.3 54 4.8 55 5.3 206 6.1 207 6.4 238 7.6 109 9.8 3010 12.4 25
The SSE of no-intercept model is always greater than the SSE of intercept model.In many cases, intercept model has smaller MSE than no-intercept model, but in this lawn roller data MSE of no-intercept model is smaller than that of the intercept model. This allude to the suitabilityof the no-intercept model.
Df of residuals for no-intercept model is n-1. (n=10)
> summary(rout0)
Call: lm(formula = depression ~ -1 + weight, data = roller)
Coefficients: Estimate Std. Error t value Pr(>|t|) weight 2.3919 0.2993 7.991 2.23e-05 ***
Residual standard error: 6.426 on 9 degrees of freedomMultiple R-squared: 0.8765, Adjusted R-squared: 0.8628 F-statistic: 63.86 on 1 and 9 DF, p-value: 2.233e-05
> summary(rout1)
Call: lm(formula = depression ~ weight, data = roller)
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -2.0871 4.7543 -0.439 0.67227 weight 2.6667 0.7002 3.808 0.00518 **
Residual standard error: 6.735 on 8 degrees of freedomMultiple R-squared: 0.6445, Adjusted R-squared: 0.6001 F-statistic: 14.5 on 1 and 8 DF, p-value: 0.005175
No intercept model example :Lawn roller data
The p-value of no-intercept model is smaller than that of the intercept modelfor this lawn roller data.
Y
1X 2X
jjjj XXY 2211
nj ,,2,1
nnnn X
XX
X
XX
Y
YY
2
1
2
22
21
2
1
12
11
12
1
1
11
εXXY 2211 1
),0(~ 2 Nj
iid
multiple (linear) regression
For versicolor of Iris data
> vscolor<- iris[51:100,1:4]> names(vscolor)<-c("sl","sw","pl","pw")> rout<-lm(sl~ sw+pl,data=vscolor) > summary (rout)….
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.1164 0.4943 4.282 9.06e-05 ***sw 0.2476 0.1868 1.325 0.191 pl 0.7356 0.1248 5.896 3.87e-07 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3399 on 47 degrees of freedomMultiple R-squared: 0.5841, Adjusted R-squared: 0.5664 F-statistic: 33.01 on 2 and 47 DF, p-value: 1.11e-09
For versicolor of Iris data
> summary( lm(sl~ sw+pl+pw,data=vscolor) )
Call:lm(formula = sl ~ sw + pl + pw, data = vscolor)
Residuals: Min 1Q Median 3Q Max -0.7248 -0.2406 -0.0321 0.2958 0.5594
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.8955 0.5071 3.738 0.000511 ***sw 0.3869 0.2045 1.891 0.064890 . pl 0.9083 0.1654 5.491 1.67e-06 ***pw -0.6792 0.4354 -1.560 0.125599 ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3348 on 46 degrees of freedomMultiple R-squared: 0.605, Adjusted R-squared: 0.5793 F-statistic: 23.49 on 3 and 46 DF, p-value: 2.28e-09
The same slopes, but different intercepts
> irs<-iris> names(irs)<-c("sl","sw","pl","pw","sp")> setosa<-irs[1:50,]; vscolor <-irs[51:100,]> vginica <-irs[101:150,]
> with(irs, plot(sw,sl,xlab="sepal.width",+ ylab="sepal.length"))> with(setosa, points(sw,sl,pch=16,col="red"))> with(vscolor, points(sw,sl,pch=16,col="blue"))> with(vginica, points(sw,sl,pch=16,col="sienna2"))
> rout1<- lm(sl ~sw , data=setosa)> rout2<- lm(sl ~sw , data=vscolor)> rout3<- lm(sl ~sw , data=vginica)
> cfx<-coef(lm(sl~sw+sp,data=irs))> cf1<-c( cfx[1] , cfx[2] )> cf2<-c( cfx[1]+cfx[3], cfx[2] )> cf3<-c( cfx[1]+cfx[4], cfx[2] )
> lines(c(2.5,4.3), c(sum(cf1*c(1,2.5)), sum(cf1*c(1,4.3)) ), lwd=4, col="pink")> lines(c(2.0,3.5), c(sum(cf2*c(1,2.0)), sum(cf2*c(1,3.5)) ), lwd=4, col="skyblue")> lines(c(2.2,3.9), c(sum(cf3*c(1,2.2)), sum(cf3*c(1,3.9)) ), lwd=4, col="tan1")
2.0 2.5 3.0 3.5 4.04.5
5.0
5.5
6.0
6.5
7.0
7.5
8.0
sepal.width
sepal.length
kjkjkkj XY 50,,2,1 j
n
n
n
n
n
n
n
n
n
X
XX
XX
XX
Y
YY
YY
YY
3
31
2
21
1
12
11
3
31
2
21
1
12
11
321
3
31
2
21
1
12
11
1
10
00
00
0
01
10
00
1
11
11
11
),0(~ 2 Nkjiid
3,2,1k)50( n)3( K
3
32
21
11
dummy variables
factor (categoriacal) variable
versicolor
verginica
setosa
jjj XY 1111 jjj XY 22212 jjj XY 33313
for setosa for versicolor for verginica
> is.factor(irs$sp)[1] TRUE> is.factor(irs$sw)[1] FALSE> rout<-lm(sl~sw+sp,data=irs)> coef(rout) (Intercept) sw spversicolor spvirginica 2.2513932 0.8035609 1.4587431 1.9468166 > summary(rout)Call: lm(formula = sl ~ sw + sp, data = irs)
Residuals: Min 1Q Median 3Q Max -1.30711 -0.25713 -0.05325 0.19542 1.41253
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.2514 0.3698 6.089 9.57e-09 ***sw 0.8036 0.1063 7.557 4.19e-12 ***spversicolor 1.4587 0.1121 13.012 < 2e-16 ***spvirginica 1.9468 0.1000 19.465 < 2e-16 ***
Residual standard error: 0.438 on 146 degrees of freedomMultiple R-squared: 0.7259, Adjusted R-squared: 0.7203 F-statistic: 128.9 on 3 and 146 DF, p-value: < 2.2e-16
> anova(rout)Analysis of Variance Table
Response: sl Df Sum Sq Mean Sq F value Pr(>F) sw 1 1.412 1.412 7.3628 0.00746 ** sp 2 72.752 36.376 189.6512 < 2e-16 ***Residuals 146 28.004 0.192
This means the number of groups is 3(3 species, K=3). df = K-1 = 2sp is a factor variable for 3 groups.
> coef( lm(sl~sp*sw,data=irs) ) (Intercept) spversicolor spvirginica sw 2.6390012 0.9007335 1.2678352 0.6904897 spversicolor:sw spvirginica:sw 0.1745880 0.2110448
> ( csetosa<- coef( lm(sl~sw, data=setosa) ) )(Intercept) sw 2.6390012 0.6904897
> coef(lm(sl~sw, data=versicolor) )-csetosa(Intercept) sw 0.9007335 0.1745880
> coef(lm(sl~sw, data=vginica) )-csetosa(Intercept) sw 1.2678352 0.2110448
Regression for many datasets at one command, by using factor variable
Thank you !!
cbxxa 2
0ˆ2 bxa
abx
2ˆ
2)( jjj
XY
j
jjj
jj YXYXn 22 )()(2
j
jj
jjj
j YYXX 222 )()(2
0)ˆ(ˆ j
jj YXn
0)ˆ(ˆ 2 j
jjj
j YXX
0)ˆ(ˆ j
jj YXn
0)ˆ(ˆ 2 j
jjj
j YXX
j
jj
j YXn ˆˆ
j
jjj
jj
j YXXX ˆˆ 2
1
11
1
nX
XX
X
2
1
n
2
1
nY
YY
Y2
1
jjj XY nj ,,2,1
nnn X
XX
Y
YY
2
1
2
1
2
1
1
11
εXY 1
jjj XY nj ,,2,1
Y 1 X 150 ㆍ 1 + 0.1X
100 ㆍ 1 + 0.5X
160 1 176 167.6 188173 1 168 166.8 184185 1 174 167.4 187
… … … …189 1 182 168.2 191
Y
Y
2ˆ j
j YY
nY
YY
2
1
Y
1
11
1
nX
XX
2
1
X
XˆˆY 1
jjj XY nj ,,2,1
),0(~ 2 Njiid
2)( jjj
XY
?,
jjj XY nj ,,2,1
?2
)2(~)ˆˆ(SSE 222 nXY jj
j
)2/(SSE MSE n
jjY nj ,,2,1
nnY
YY
2
1
2
1
1
11εY 1
nj ,,2,1 ),(~ 2NY jiid
2 jj
Y 2YY jj
na
aa
2
1
a
nav
avav
v
2
1
a
nb
bb
2
1
b
nn ba
baba
22
11
ba
n
iii
t ba1
ba
),,( 21 nt aaa a
ba 0bat
nY
YY
2
1
Y
1
11
1
1Y
jjY nj ,,2,1 ),0(~ 2 Njiid
)ˆ(ˆ YYY
0)ˆ(ˆ YYY t
0)()( 11 Yt
0)( 11 Yt
)( 111 tt Y
YYt
t
11
1 )1/(ˆ 22 nYY jj
nY
YY
2
1
Y
1
11
1
nX
XX
1
12
11
1X
2211 XˆXˆˆY 1
jjjj XXY 2211 nj ,,2,1
nX
XX
2
22
21
2X