lecture 10 phan tich hoi quy tuyen tinh
DESCRIPTION
phân tích hồi quy đơn biếnTRANSCRIPT
-
Gii thiu Phn tch hi quy tuyn tnh
Dr. Tuan V. Nguyen
Garvan Institute of Medical Research
Sydney
-
Nu cho mt ngi ba loi v kh tng quan, hi quy v cy bt, hn s dng c ba (Anon, 1978)
-
V d
IDAge Chol (mg/ml)
1463.5
2201.9
3524.0
4302.6
5574.5
6253.0
7282.9
8363.8
9222.1
10433.8
11574.1
12333.0
13222.5
14634.6
15403.2
16484.2
17282.3
18494.0
Tui v nng cholesterol ca 18 ngi o c nh sau
-
Nhp s liu vo R
id
-
Tng quan gia tui v nng cholesterol
-
Cu hi nghin cu
Mi tng quan gia tui v nng cholesterolMc tng quanTin on nng cholesterol ng vi mi la tuiPhn tch tng quan v hi quy
-
Phng sai v hip phng sai: i s
Coi x v y l hai bin ngu nhin rt ra t mt mu quan st n i tng.o lng dao ng gia x v y: phng saiHip phng sai gia x v yvar(x + y) = var(x) + var(y)
var(x + y) = var(x) + var(y) + 2cov(x,y)
Trong :
-
Phng sai v Hip phng sai: Hnh hc
Tnh c lp v ph thuc gia x v y c th biu din bng hnh hc:y
x
h
h2 = x2 + y2
x
y
h
h2 = x2 + y2 2xycos(H)
H
-
ngha ca Phng sai v
Phng sai lun lun l s dngNu hip phng sai = 0, x v y c lp vi nhau.Hip phng sai l mt tng ca mt tch cho: do c th m v cng c th dng.Hip phng sai m = lch pha gia hai phn phi theo hng ngc chiu nhau.Hip phng sai dng = lch pha gia hai phn phi theo hng cng chiu nhau.Hip phng sai = o lng cng tng quan.
Hip phng sai -
Hip phng sai v tng quan
Hip phng sai l mt n v ph thuc. H s tng quan (r) gia x v y l mt hip phng sai c chun ho.r c xc nh bng: -
Tng quan thun v nghch
r = 0.9
r = -0.9
-
Kim nh gi thuyt tng quan
Gi thuyt: Ho: r = 0 ngc vi Ho: r khng bng 0.Sai s chun (Standard error) ca r : The t-statistic:Thng k ny c phn phi t vi n 2 bc t do.Fishers z-transformation:Standard error of z: Do vy 95% CI ca z c th tnh bng: -
Minh ho phn tch tng quan
IDAge Cholesterol
(x) (y; mg/100ml)
463.5
201.9
524.0
302.6
574.5
253.0
282.9
363.8
222.1
433.8
574.1
333.0
222.5
634.6
403.2
484.2
282.3
494.0
Mean38.833.33
SD13.600.84
Cov(x, y) = 10.68
t-statistic = 0.56 / 0.26 = 2.17
Critical t-value with 17 df and alpha = 5% is 2.11
Kt lun: Gia tui v nng cholesterol c mt mi tng quan c ngha thng k..
-
Phn tch hi quy tuyn tnh n
nh gi:Lng ho mi tng quan gia hai bin.D onXy dng m hnh d on v nh giKim sotiu chnh yu t nhiu (trng hp phn tch a bin)Ch kho st c hai bin: mt l bin p ng (response variable) v mt l bin d on (predictor variable)Khng c iu chnh cho yu t nhiu hoc cc hip bin khc -
Tng quan gia tui v nng cholesterol
-
M hnh hi quy tuyn tnh
Y : bin ngu nhin, l mt bin p ng (response)X : bin ngu nhin, l bin d on, hay yu t nguy c (predictor, risk factor)C Y v X c th l s liu nhm (e.g., yes / no) hoc bin lin tc (e.g., age). Nu Y l bin phn nhm th s dng m hnh logistic regression; nu Y l bin lin tc th s dng m hnh hi quy tuyn tnh n.M hnh:Y = a + bX + e
a : intercept
b : slope / gradient
: random error (mc dao ng gia cc i tng trong s y s kin nu x khng i (v d bin i cholesterol trong mt nhm cng la tui)
-
Cc gi nh ca m hnh tuyn tnh
Cc thng s c mi tng quan tuyn tnh (ng thng) vi nhau;X o lng khng c sai s; Cc gi tr Y tng ng l c lp vi nhau (v d Y1 khng c mi tng quan vi Y2) ;Sai s ngu nhin (e) c phn phi chun vi trung bnh =0 v phng sai c nh. -
Gi tr k vng v phng sai
Nu cc gi nh tho mn: Gi tr k vng ca Y l: E(Y | x) = a + bxPhng sai ca Y l: var(Y) = var(e) = s2 -
c lng cc thng s ca m hnh hi quy tuyn tnh
Cho hai im A(x1, y1) v B(x2, y2) trong mt mt phng 2 chiu, chng ta c th c mt phng trnh ng thng ni hai im ny.
A(x1,y1)
B(x2,y2)
Gc lch:
Phng trnh: y = mx + a
Vy nu chng ta c hn 2 im th sao?
a
x
y
0
dy
dx
-
c tnh a v b
C mt lot cp i: (x1, y1), (x2, y2), (x3, y3), , (xn, yn)Cho a v b l cc c s ca cc thng s a v b, Chng ta c phng trnh ca mu nghin cu: Y* = a + bxMc ch: tm cc gi tr ca a v b sao cho (Y Y*) l ti thiu. Cho SSE = tng ca (Yi a bxi)2.Cc gi tr a v b c th lm SSE t gi tr nh nht gi l cc c s bnh phng ti thiu (least square estimates). -
Tiu chun c tnh
yi
Chol
Age
Mc ch ca c s bnh phng ti thiu l tm c cc gi tr a v b sao cho tng ca d2 c gi tr nh nht.
-
c tnh a v b
Sau mt s bc tnh ton, chng ta c:Trong :
Nu cc gi nh ca hi quy l hp l, cc c s ca v s:Khng sai lchPhng sai ti thiu (ngha l hiu qu) -
Goodness-of-fit
By gi chng ta c phng trnh:Y = a + bX + e
Cu hi: Phng trnh ny c th m t d liu tt c no? Tr li: h s xc nh (R2): mc bin thin trong Y c th gii thch bng mc bin thin trong nhm X. -
Tch nhm bin thin: khi nim
SST = tng ca cc mc khc bit bnh phng gia tng gi tr yi v tr s trung bnh ca y. SSR = tng ca cc mc khc bit bnh phng gia gi tr d on ca y v tr s trung bnh ca y. SSE = tng ca cc mc khc bit bnh phng gia cc gi tr quan st v gi tr d on ca y.SST = SSR + SSE
Khi h s xc nh l: R2 = SSR / SST
-
Tch nhm bin thin: minh ho hnh hc
Chol (Y)
Age (X)
mean
SSR
SSE
SST
-
Tch nhm bin thin: i s
Some statistics:Total variation:Attributed to the model:Residual sum of square: SST = SSR + SSESSR = SST SSE -
Phn tch phng sai
SS tng ln theo t l vi c mu (n)Trung bnh bnh phng (Mean squares, MS): c chun ho cho bc t do (df)MSR = SSR / p ( p = s bc t do)MSE = SSE / (n p 1)MST = SST / (n 1)Bng tm tt phn tch phng sai (Analysis of variance, ANOVA):
Ngund.f.Sum of squares (SS)Mean squares (MS)F-testRegressionResidualTotalpNp 1n 1SSRSSESSTMSRMSEMSR/MSE -
Kim nh gi thuyt trong cc
By gi chng ta c:
phn tch hi quyS liu mu nghin cu:Y = a + bX + e
Qun th:Y = a + bX + e
Ho: b = 0. Khng c mi tng quan tuyn tnh no gia kt cc v bin d on (yu t nguy c) c.Ngn ng thng thng: Vi iu kin mu nghin cu cho kt qu thu c , vy xc sut cho c c mt mu quan st m khng nht qun vi gi thuyt khng, tc l khng c mi tng quan no, l bao nhiu phn trm? -
Din dch v dc (thng s b)
Ghi nh rng e c coi l mt phn phi chun vi trung bnh 0 v phng sai v = s2. c tnh s2 bng MSE (or s2)Cng c th cho thy rng GI tr k vng ca b l b, i.e. E(b) = b, Sai s chun (standard errors) ca b l: Vy kim nh liu b = 0 s l: t = b / SE(b) s tun theo lut phn phi t vi bc t do l n-1. -
Khong tin cy xung quanh gi tr d on
Gi tr quan st l Yi. Gi tr c d on l: Sai s chun (standard error) ca gi tr c d on l: c tnh khong cho cc gi tr Yi : -
Kim tra cc gi nh
Phng sai hng nhPhn phi chunM hnh ngM hnh n nhTt c u c th biu din bng biu . Phn tn d (residuals) ca m hnh lun ng vai tr quan trng trong tt c cc bc tin hnh phn tch mt m hnh chn on. -
Kim tra cc gi nh
Phng sai hng nhV ng s liu tn d chun ho theo phng php student (studentized residuals) tng ng vi cc gi tr c d on (predicted values). Kim tra xem s bin thin gia cc gi tr tn d liu c tng i hng nh qua sut ht cc dy gi tr c x l khng (fitted values).Phn phi chunV ng s liu tn d tng ng vi cc gi tr k vng (expected valu), hay cn gi l v ng xc sut chun (Normal probability plot). Nu cc gi tr tn d ny tun theo lut phn ohun th n phi nm trn con ng xin 45o. Xy dng cng thc ng? V ng gi tr tn d tng ng vi gi tr x l (fitted values). Kim tra xem liu biu ca cc gi tr tn d c cho thy xu hng khng tuyn tnh ca chng qua cc dy s liu x l khng (fitted values).M hnh n nhKim tra xem liu c mt hay nhiu gi tr quan st b tc ng. S dng khong cch Cook. -
Checking assumptions (tt)
Khong cch Cook (D) l mt n v o lng mc bin i ca cc gi tr x l trong m hnh hi quy nu loi b mt gi tr th ith ra khi b d liu phn tch.Leverage (tc ng n by) o mc gi tr cc tr xi tng quan vi cc gi tr x cn li. Gi tr tn d student ho (Studentized residual) o mc gi tr cc tr yi tng quan vi cc gi tr y cn li. -
o lng chnh l
Phng sai khng hng nhHon chuyn gi tr p ng (y) sang mt thang n v khc (v d logarithm) thng hu ch.Nu hon chuyn ri m khng gii quyt c tnh trng phng sai khng hng nh, s dng mt c s khc mnh hn, nh l bnh phng ti thiu c cn i tng tc (iterative weighted least squares).Khng tun theo phn phi chunPhn phi khng chun v phng sai khng hng nh thng i i vi nhau.Gi tr ngoi l (Outliers)Kim tra xem s liu c chnh xc khngS dng phng php c tnh ph tr -
Phn tch hi quy s dng R
id
-
Phn tch hi quy
summary(reg)
Call:
lm(formula = chol ~ age)
Residuals:
Min 1Q Median 3Q Max
-0.40729 -0.24133 -0.04522 0.17939 0.63040
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.089218 0.221466 4.918 0.000154 ***
age 0.057788 0.005399 10.704 1.06e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3027 on 16 degrees of freedom
Multiple R-Squared: 0.8775, Adjusted R-squared: 0.8698
F-statistic: 114.6 on 1 and 16 DF, p-value: 1.058e-08
-
ANOVA
anova(reg)
Analysis of Variance Table
Response: chol
Df Sum Sq Mean Sq F value Pr(>F)
age 1 10.4944 10.4944 114.57 1.058e-08 ***
Residuals 16 1.4656 0.0916
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
-
Chn on: nh hng ca s liu
p
-
Mt minh ho khng tuyn tnh: BMI v mc hp dn tnh dc
Nghin cu trn 44 sinh vin i hco ch s trng lng c th (BMI)Cho im hp dn tnh dc (SA)id
-
Phn tch hi quy tuyn tnh gia BMI v SA
reg |t|)
(Intercept) 4.92512 0.64489 7.637 1.81e-09 ***
bmi -0.05967 0.02862 -2.084 0.0432 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.354 on 42 degrees of freedom
Multiple R-Squared: 0.09376, Adjusted R-squared: 0.07218
F-statistic: 4.345 on 1 and 42 DF, p-value: 0.04323
-
BMI v SA: phn tch cc gi tr tn d
plot(reg)
-
BMI and SA: biu tn x
reg
-
Phn tch li s liu ny
# Fit 3 regression models
linear
-
Mt s nhn xt:
Gi tr tng quan nm gia khong 1 v +1. Mt h s tng quan rt nh khng c ngha rng khng c mi tng quan gia hai bin. Mi tng quan ny c th l phi tuyn tnh.i vi cc tng quan cong, s dng h s tng phn phn loi (rank correlation) tt hn tng quan Pearson (Pearsons correlation).Mt h s tng quan thp (vd: 0.1) c th c ngha thng k nhng khng c ngha lm sng.R2 l mt ch s o lng mc tng quan. r = 0.7 trng c v hp dn nhng thc cht R2 ch c 0.49!C tng quan khng ng ngha l c quan h nhn qu.
Din dch mi tng quan -
Mt s nhn xt:
Cn cn thn vi a tng quan. i vi s bin l p, s c p(p 1)/2 cc cp tng quan, v khi s i mt vi vn dng tnh gi (c tng quan gi).Tng quan khng th suy din c t cc mi quan h.r(age, weight) = 0.05; r(weight, fat) = 0.03; khng c ngha rng r(age, fat) l gn zero. Nhng trn thc t r(age, fat) = 0.79.
Din dch mi tng quan -
Mt s nhn xt:
ng biu din tng quan (hi quy) ch l mt tng quan c lng gia cc bin ny trong qun th m thi.C mt bt nh lin quan vi cc thng s c c tnh.ng hi quy khng th dng c tnh cc gi tr x nm ngoi vng gi tr quan st (ngoi suy).Mt m hnh thng k l mt m hnh xp x; tng quan thc c th li l phi tuyn tnh, nhng tng quan tuyn tnh l mt tng quan xp x tng i ph hp nht.
Din dch mi tng quan -
Mt s nhn xt:
Kt qu phn tch tng quan hi quy cn c m t y : bn cht ca bin p ng (kt cc), cc bin d on (yu t nguy c); bt k mt cch hon chuyn; kim tra cc gi nh...Cc h s hi quy (a, b), cng vi cc sai s chun tng ng, v R2 cng cn thit.
Bo co kt qu -
Vi nhn xt cui cng
Phng trnh l ct mc cho cc tng khoa hc bm tr v thng hoa.Cc phng trnh p nh nhng bi th, nhng cng thm ch l nhng c hnh. V vy m phi ht sc cnh gic v cn tc khi xy dng phng trnh! -
Li Cm t
Chng ti xin chn thnh cm n Cng ty Dc phm Bridge Healthcare, Australia ti tr cho chuyn i.xx
xy
S
S
b
=
x
b
y
a
-
=
(
)
=
-
=
n
i
i
y
y
SST
1
2
(
)
=
-
=
n
i
i
xx
x
x
S
1
2
(
)
(
)
=
-
-
=
n
i
i
i
xy
y
y
x
x
S
1
20
30
40
50
60
2.0
2.5
3.0
3.5
4.0
4.5
age
chol
(
)
=
-
=
n
i
i
y
y
SSR
1
2
(
)
=
-
=
n
i
i
i
y
y
SSE
1
2
(
)
(
)
2
/
1
,
1
a
-
-
-
p
n
i
i
t
Y
SE
Y
i
i
bx
a
Y
+
=
3.0
3.5
4.0
-3
-2
-1
0
1
2
3
Fitted values
Residuals
Residuals vs Fitted
21
10
20
-2
-1
0
1
2
-2
-1
0
1
2
Theoretical Quantiles
Standardized residuals
Normal Q-Q
21
10
20
3.0
3.5
4.0
0.0
0.4
0.8
1.2
Fitted values
S
t
a
n
d
a
r
d
i
z
e
d
r
e
s
i
d
u
a
l
s
Scale-Location
21
10
20
0.00
0.02
0.04
0.06
0.08
0.10
0.12
-2
-1
0
1
2
Leverage
Standardized residuals
Cook's distance
Residuals vs Leverage
1
3
10
(
)
(
)
(
)
=
-
-
-
=
n
i
i
i
y
y
x
x
n
y
x
1
1
1
,
cov
10
15
20
25
30
35
2
3
4
5
6
bmi
sa
(
)
(
)
(
)
(
)
y
x
SD
SD
y
x
y
x
y
x
r
=
=
,
cov
var
var
,
cov
2
1
2
r
n
r
t
-
-
=
-
+
=
r
r
z
1
1
ln
2
1
(
)
2
1
2
-
-
=
n
r
r
SE
(
)
3
1
-
=
n
z
SE
3
1
-
n
z
56
.
0
94
.
0
1
94
.
0
1
ln
2
1
=
-
+
=
z
(
)
26
.
0
15
1
3
1
=
=
-
=
n
z
SE
(
)
(
)
xx
i
i
S
x
x
n
s
Y
SE
2
1
1
-
+
+
=
(
)
94
.
0
84
.
0
60
.
13
68
.
10
,
cov
=
=
=
y
x
SD
SD
y
x
r
10
15
20
25
30
35
2
3
4
5
6
bmi
sa
20
30
40
50
60
2.0
2.5
3.0
3.5
4.0
4.5
age
chol
i
i
bx
a
y
+
=
i
i
i
y
y
d
-
=
2.5
3.0
3.5
4.0
4.5
-0.4
0.0
0.2
0.4
0.6
Fitted values
Residuals
Residuals vs Fitted
8
6
17
-2
-1
0
1
2
-1
0
1
2
Theoretical Quantiles
Standardized residuals
Normal Q-Q
8
6
17
2.5
3.0
3.5
4.0
4.5
0.0
0.5
1.0
1.5
Fitted values
S
t
a
n
d
a
r
d
i
z
e
d
r
e
s
i
d
u
a
l
s
Scale-Location
8
6
17
0.00
0.05
0.10
0.15
0.20
0.25
-1
0
1
2
Leverage
Standardized residuals
Cook's distance
0.5
0.5
1
Residuals vs Leverage
6
2
8
(
)
(
)
=
-
-
=
n
i
i
n
x
x
x
1
2
1
var
(
)
xx
S
s
b
SE
/
=
8
10
12
14
16
-30
-25
-20
-15
x
y
8
10
12
14
16
15
20
25
30
x
y
1
2
1
2
x
x
y
y
dx
dy
m
-
-
=
=
(
)
(
)
-
-
=
=
n
i
i
n
y
y
y
1
2
1
var