lecture 10 phan tich hoi quy tuyen tinh

50
Giới thiệu Phân tích hồi quy tuyến tính Dr. Tuan V. Nguyen Garvan Institute of Medical Research Sydney

Upload: thinh854

Post on 02-Oct-2015

17 views

Category:

Documents


3 download

DESCRIPTION

phân tích hồi quy đơn biến

TRANSCRIPT

  • Gii thiu Phn tch hi quy tuyn tnh

    Dr. Tuan V. Nguyen

    Garvan Institute of Medical Research

    Sydney

  • Nu cho mt ngi ba loi v kh tng quan, hi quy v cy bt, hn s dng c ba (Anon, 1978)

  • V d

    IDAge Chol (mg/ml)

    1463.5

    2201.9

    3524.0

    4302.6

    5574.5

    6253.0

    7282.9

    8363.8

    9222.1

    10433.8

    11574.1

    12333.0

    13222.5

    14634.6

    15403.2

    16484.2

    17282.3

    18494.0

    Tui v nng cholesterol ca 18 ngi o c nh sau

  • Nhp s liu vo R

    id

  • Tng quan gia tui v nng cholesterol

  • Cu hi nghin cu

    Mi tng quan gia tui v nng cholesterolMc tng quanTin on nng cholesterol ng vi mi la tui

    Phn tch tng quan v hi quy

  • Phng sai v hip phng sai: i s

    Coi x v y l hai bin ngu nhin rt ra t mt mu quan st n i tng.o lng dao ng gia x v y: phng saiHip phng sai gia x v y

    var(x + y) = var(x) + var(y)

    var(x + y) = var(x) + var(y) + 2cov(x,y)

    Trong :

  • Phng sai v Hip phng sai: Hnh hc

    Tnh c lp v ph thuc gia x v y c th biu din bng hnh hc:

    y

    x

    h

    h2 = x2 + y2

    x

    y

    h

    h2 = x2 + y2 2xycos(H)

    H

  • ngha ca Phng sai v
    Hip phng sai

    Phng sai lun lun l s dngNu hip phng sai = 0, x v y c lp vi nhau.Hip phng sai l mt tng ca mt tch cho: do c th m v cng c th dng.Hip phng sai m = lch pha gia hai phn phi theo hng ngc chiu nhau.Hip phng sai dng = lch pha gia hai phn phi theo hng cng chiu nhau.Hip phng sai = o lng cng tng quan.
  • Hip phng sai v tng quan

    Hip phng sai l mt n v ph thuc. H s tng quan (r) gia x v y l mt hip phng sai c chun ho.r c xc nh bng:
  • Tng quan thun v nghch

    r = 0.9

    r = -0.9

  • Kim nh gi thuyt tng quan

    Gi thuyt: Ho: r = 0 ngc vi Ho: r khng bng 0.Sai s chun (Standard error) ca r : The t-statistic:Thng k ny c phn phi t vi n 2 bc t do.Fishers z-transformation:Standard error of z: Do vy 95% CI ca z c th tnh bng:
  • Minh ho phn tch tng quan

    IDAge Cholesterol

    (x) (y; mg/100ml)

    463.5

    201.9

    524.0

    302.6

    574.5

    253.0

    282.9

    363.8

    222.1

    433.8

    574.1

    333.0

    222.5

    634.6

    403.2

    484.2

    282.3

    494.0

    Mean38.833.33

    SD13.600.84

    Cov(x, y) = 10.68

    t-statistic = 0.56 / 0.26 = 2.17

    Critical t-value with 17 df and alpha = 5% is 2.11

    Kt lun: Gia tui v nng cholesterol c mt mi tng quan c ngha thng k..

  • Phn tch hi quy tuyn tnh n

    nh gi:Lng ho mi tng quan gia hai bin.D onXy dng m hnh d on v nh giKim sotiu chnh yu t nhiu (trng hp phn tch a bin)Ch kho st c hai bin: mt l bin p ng (response variable) v mt l bin d on (predictor variable)Khng c iu chnh cho yu t nhiu hoc cc hip bin khc
  • Tng quan gia tui v nng cholesterol

  • M hnh hi quy tuyn tnh

    Y : bin ngu nhin, l mt bin p ng (response)X : bin ngu nhin, l bin d on, hay yu t nguy c (predictor, risk factor)C Y v X c th l s liu nhm (e.g., yes / no) hoc bin lin tc (e.g., age). Nu Y l bin phn nhm th s dng m hnh logistic regression; nu Y l bin lin tc th s dng m hnh hi quy tuyn tnh n.M hnh:

    Y = a + bX + e

    a : intercept

    b : slope / gradient

    : random error (mc dao ng gia cc i tng trong s y s kin nu x khng i (v d bin i cholesterol trong mt nhm cng la tui)

  • Cc gi nh ca m hnh tuyn tnh

    Cc thng s c mi tng quan tuyn tnh (ng thng) vi nhau;X o lng khng c sai s; Cc gi tr Y tng ng l c lp vi nhau (v d Y1 khng c mi tng quan vi Y2) ;Sai s ngu nhin (e) c phn phi chun vi trung bnh =0 v phng sai c nh.
  • Gi tr k vng v phng sai

    Nu cc gi nh tho mn: Gi tr k vng ca Y l: E(Y | x) = a + bxPhng sai ca Y l: var(Y) = var(e) = s2
  • c lng cc thng s ca m hnh hi quy tuyn tnh

    Cho hai im A(x1, y1) v B(x2, y2) trong mt mt phng 2 chiu, chng ta c th c mt phng trnh ng thng ni hai im ny.

    A(x1,y1)

    B(x2,y2)

    Gc lch:

    Phng trnh: y = mx + a

    Vy nu chng ta c hn 2 im th sao?

    a

    x

    y

    0

    dy

    dx

  • c tnh a v b

    C mt lot cp i: (x1, y1), (x2, y2), (x3, y3), , (xn, yn)Cho a v b l cc c s ca cc thng s a v b, Chng ta c phng trnh ca mu nghin cu: Y* = a + bxMc ch: tm cc gi tr ca a v b sao cho (Y Y*) l ti thiu. Cho SSE = tng ca (Yi a bxi)2.Cc gi tr a v b c th lm SSE t gi tr nh nht gi l cc c s bnh phng ti thiu (least square estimates).
  • Tiu chun c tnh

    yi

    Chol

    Age

    Mc ch ca c s bnh phng ti thiu l tm c cc gi tr a v b sao cho tng ca d2 c gi tr nh nht.

  • c tnh a v b

    Sau mt s bc tnh ton, chng ta c:

    Trong :

    Nu cc gi nh ca hi quy l hp l, cc c s ca v s:Khng sai lchPhng sai ti thiu (ngha l hiu qu)
  • Goodness-of-fit

    By gi chng ta c phng trnh:

    Y = a + bX + e

    Cu hi: Phng trnh ny c th m t d liu tt c no? Tr li: h s xc nh (R2): mc bin thin trong Y c th gii thch bng mc bin thin trong nhm X.
  • Tch nhm bin thin: khi nim

    SST = tng ca cc mc khc bit bnh phng gia tng gi tr yi v tr s trung bnh ca y. SSR = tng ca cc mc khc bit bnh phng gia gi tr d on ca y v tr s trung bnh ca y. SSE = tng ca cc mc khc bit bnh phng gia cc gi tr quan st v gi tr d on ca y.

    SST = SSR + SSE

    Khi h s xc nh l: R2 = SSR / SST

  • Tch nhm bin thin: minh ho hnh hc

    Chol (Y)

    Age (X)

    mean

    SSR

    SSE

    SST

  • Tch nhm bin thin: i s

    Some statistics:Total variation:Attributed to the model:Residual sum of square: SST = SSR + SSESSR = SST SSE
  • Phn tch phng sai

    SS tng ln theo t l vi c mu (n)Trung bnh bnh phng (Mean squares, MS): c chun ho cho bc t do (df)MSR = SSR / p ( p = s bc t do)MSE = SSE / (n p 1)MST = SST / (n 1)

    Bng tm tt phn tch phng sai (Analysis of variance, ANOVA):

    Ngund.f.Sum of squares (SS)Mean squares (MS)F-testRegressionResidualTotalpNp 1n 1SSRSSESSTMSRMSEMSR/MSE
  • Kim nh gi thuyt trong cc
    phn tch hi quy

    By gi chng ta c:

    S liu mu nghin cu:Y = a + bX + e

    Qun th:Y = a + bX + e

    Ho: b = 0. Khng c mi tng quan tuyn tnh no gia kt cc v bin d on (yu t nguy c) c.Ngn ng thng thng: Vi iu kin mu nghin cu cho kt qu thu c , vy xc sut cho c c mt mu quan st m khng nht qun vi gi thuyt khng, tc l khng c mi tng quan no, l bao nhiu phn trm?
  • Din dch v dc (thng s b)

    Ghi nh rng e c coi l mt phn phi chun vi trung bnh 0 v phng sai v = s2. c tnh s2 bng MSE (or s2)Cng c th cho thy rng GI tr k vng ca b l b, i.e. E(b) = b, Sai s chun (standard errors) ca b l: Vy kim nh liu b = 0 s l: t = b / SE(b) s tun theo lut phn phi t vi bc t do l n-1.
  • Khong tin cy xung quanh gi tr d on

    Gi tr quan st l Yi. Gi tr c d on l: Sai s chun (standard error) ca gi tr c d on l: c tnh khong cho cc gi tr Yi :
  • Kim tra cc gi nh

    Phng sai hng nhPhn phi chunM hnh ngM hnh n nhTt c u c th biu din bng biu . Phn tn d (residuals) ca m hnh lun ng vai tr quan trng trong tt c cc bc tin hnh phn tch mt m hnh chn on.
  • Kim tra cc gi nh

    Phng sai hng nhV ng s liu tn d chun ho theo phng php student (studentized residuals) tng ng vi cc gi tr c d on (predicted values). Kim tra xem s bin thin gia cc gi tr tn d liu c tng i hng nh qua sut ht cc dy gi tr c x l khng (fitted values).Phn phi chunV ng s liu tn d tng ng vi cc gi tr k vng (expected valu), hay cn gi l v ng xc sut chun (Normal probability plot). Nu cc gi tr tn d ny tun theo lut phn ohun th n phi nm trn con ng xin 45o. Xy dng cng thc ng? V ng gi tr tn d tng ng vi gi tr x l (fitted values). Kim tra xem liu biu ca cc gi tr tn d c cho thy xu hng khng tuyn tnh ca chng qua cc dy s liu x l khng (fitted values).M hnh n nhKim tra xem liu c mt hay nhiu gi tr quan st b tc ng. S dng khong cch Cook.
  • Checking assumptions (tt)

    Khong cch Cook (D) l mt n v o lng mc bin i ca cc gi tr x l trong m hnh hi quy nu loi b mt gi tr th ith ra khi b d liu phn tch.Leverage (tc ng n by) o mc gi tr cc tr xi tng quan vi cc gi tr x cn li. Gi tr tn d student ho (Studentized residual) o mc gi tr cc tr yi tng quan vi cc gi tr y cn li.
  • o lng chnh l

    Phng sai khng hng nhHon chuyn gi tr p ng (y) sang mt thang n v khc (v d logarithm) thng hu ch.Nu hon chuyn ri m khng gii quyt c tnh trng phng sai khng hng nh, s dng mt c s khc mnh hn, nh l bnh phng ti thiu c cn i tng tc (iterative weighted least squares).Khng tun theo phn phi chunPhn phi khng chun v phng sai khng hng nh thng i i vi nhau.Gi tr ngoi l (Outliers)Kim tra xem s liu c chnh xc khngS dng phng php c tnh ph tr
  • Phn tch hi quy s dng R

    id

  • Phn tch hi quy

    summary(reg)

    Call:

    lm(formula = chol ~ age)

    Residuals:

    Min 1Q Median 3Q Max

    -0.40729 -0.24133 -0.04522 0.17939 0.63040

    Coefficients:

    Estimate Std. Error t value Pr(>|t|)

    (Intercept) 1.089218 0.221466 4.918 0.000154 ***

    age 0.057788 0.005399 10.704 1.06e-08 ***

    ---

    Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    Residual standard error: 0.3027 on 16 degrees of freedom

    Multiple R-Squared: 0.8775, Adjusted R-squared: 0.8698

    F-statistic: 114.6 on 1 and 16 DF, p-value: 1.058e-08

  • ANOVA

    anova(reg)

    Analysis of Variance Table

    Response: chol

    Df Sum Sq Mean Sq F value Pr(>F)

    age 1 10.4944 10.4944 114.57 1.058e-08 ***

    Residuals 16 1.4656 0.0916

    ---

    Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

  • Chn on: nh hng ca s liu

    p

  • Mt minh ho khng tuyn tnh: BMI v mc hp dn tnh dc

    Nghin cu trn 44 sinh vin i hco ch s trng lng c th (BMI)Cho im hp dn tnh dc (SA)

    id

  • Phn tch hi quy tuyn tnh gia BMI v SA

    reg |t|)

    (Intercept) 4.92512 0.64489 7.637 1.81e-09 ***

    bmi -0.05967 0.02862 -2.084 0.0432 *

    ---

    Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    Residual standard error: 1.354 on 42 degrees of freedom

    Multiple R-Squared: 0.09376, Adjusted R-squared: 0.07218

    F-statistic: 4.345 on 1 and 42 DF, p-value: 0.04323

  • BMI v SA: phn tch cc gi tr tn d

    plot(reg)

  • BMI and SA: biu tn x

    reg

  • Phn tch li s liu ny

    # Fit 3 regression models

    linear

  • Mt s nhn xt:
    Din dch mi tng quan

    Gi tr tng quan nm gia khong 1 v +1. Mt h s tng quan rt nh khng c ngha rng khng c mi tng quan gia hai bin. Mi tng quan ny c th l phi tuyn tnh.i vi cc tng quan cong, s dng h s tng phn phn loi (rank correlation) tt hn tng quan Pearson (Pearsons correlation).Mt h s tng quan thp (vd: 0.1) c th c ngha thng k nhng khng c ngha lm sng.R2 l mt ch s o lng mc tng quan. r = 0.7 trng c v hp dn nhng thc cht R2 ch c 0.49!C tng quan khng ng ngha l c quan h nhn qu.
  • Mt s nhn xt:
    Din dch mi tng quan

    Cn cn thn vi a tng quan. i vi s bin l p, s c p(p 1)/2 cc cp tng quan, v khi s i mt vi vn dng tnh gi (c tng quan gi).Tng quan khng th suy din c t cc mi quan h.r(age, weight) = 0.05; r(weight, fat) = 0.03; khng c ngha rng r(age, fat) l gn zero. Nhng trn thc t r(age, fat) = 0.79.
  • Mt s nhn xt:
    Din dch mi tng quan

    ng biu din tng quan (hi quy) ch l mt tng quan c lng gia cc bin ny trong qun th m thi.C mt bt nh lin quan vi cc thng s c c tnh.ng hi quy khng th dng c tnh cc gi tr x nm ngoi vng gi tr quan st (ngoi suy).Mt m hnh thng k l mt m hnh xp x; tng quan thc c th li l phi tuyn tnh, nhng tng quan tuyn tnh l mt tng quan xp x tng i ph hp nht.
  • Mt s nhn xt:
    Bo co kt qu

    Kt qu phn tch tng quan hi quy cn c m t y : bn cht ca bin p ng (kt cc), cc bin d on (yu t nguy c); bt k mt cch hon chuyn; kim tra cc gi nh...Cc h s hi quy (a, b), cng vi cc sai s chun tng ng, v R2 cng cn thit.
  • Vi nhn xt cui cng

    Phng trnh l ct mc cho cc tng khoa hc bm tr v thng hoa.Cc phng trnh p nh nhng bi th, nhng cng thm ch l nhng c hnh. V vy m phi ht sc cnh gic v cn tc khi xy dng phng trnh!
  • Li Cm t

    Chng ti xin chn thnh cm n Cng ty Dc phm Bridge Healthcare, Australia ti tr cho chuyn i.

    xx

    xy

    S

    S

    b

    =

    x

    b

    y

    a

    -

    =

    (

    )

    =

    -

    =

    n

    i

    i

    y

    y

    SST

    1

    2

    (

    )

    =

    -

    =

    n

    i

    i

    xx

    x

    x

    S

    1

    2

    (

    )

    (

    )

    =

    -

    -

    =

    n

    i

    i

    i

    xy

    y

    y

    x

    x

    S

    1

    20

    30

    40

    50

    60

    2.0

    2.5

    3.0

    3.5

    4.0

    4.5

    age

    chol

    (

    )

    =

    -

    =

    n

    i

    i

    y

    y

    SSR

    1

    2

    (

    )

    =

    -

    =

    n

    i

    i

    i

    y

    y

    SSE

    1

    2

    (

    )

    (

    )

    2

    /

    1

    ,

    1

    a

    -

    -

    -

    p

    n

    i

    i

    t

    Y

    SE

    Y

    i

    i

    bx

    a

    Y

    +

    =

    3.0

    3.5

    4.0

    -3

    -2

    -1

    0

    1

    2

    3

    Fitted values

    Residuals

    Residuals vs Fitted

    21

    10

    20

    -2

    -1

    0

    1

    2

    -2

    -1

    0

    1

    2

    Theoretical Quantiles

    Standardized residuals

    Normal Q-Q

    21

    10

    20

    3.0

    3.5

    4.0

    0.0

    0.4

    0.8

    1.2

    Fitted values

    S

    t

    a

    n

    d

    a

    r

    d

    i

    z

    e

    d

    r

    e

    s

    i

    d

    u

    a

    l

    s

    Scale-Location

    21

    10

    20

    0.00

    0.02

    0.04

    0.06

    0.08

    0.10

    0.12

    -2

    -1

    0

    1

    2

    Leverage

    Standardized residuals

    Cook's distance

    Residuals vs Leverage

    1

    3

    10

    (

    )

    (

    )

    (

    )

    =

    -

    -

    -

    =

    n

    i

    i

    i

    y

    y

    x

    x

    n

    y

    x

    1

    1

    1

    ,

    cov

    10

    15

    20

    25

    30

    35

    2

    3

    4

    5

    6

    bmi

    sa

    (

    )

    (

    )

    (

    )

    (

    )

    y

    x

    SD

    SD

    y

    x

    y

    x

    y

    x

    r

    =

    =

    ,

    cov

    var

    var

    ,

    cov

    2

    1

    2

    r

    n

    r

    t

    -

    -

    =

    -

    +

    =

    r

    r

    z

    1

    1

    ln

    2

    1

    (

    )

    2

    1

    2

    -

    -

    =

    n

    r

    r

    SE

    (

    )

    3

    1

    -

    =

    n

    z

    SE

    3

    1

    -

    n

    z

    56

    .

    0

    94

    .

    0

    1

    94

    .

    0

    1

    ln

    2

    1

    =

    -

    +

    =

    z

    (

    )

    26

    .

    0

    15

    1

    3

    1

    =

    =

    -

    =

    n

    z

    SE

    (

    )

    (

    )

    xx

    i

    i

    S

    x

    x

    n

    s

    Y

    SE

    2

    1

    1

    -

    +

    +

    =

    (

    )

    94

    .

    0

    84

    .

    0

    60

    .

    13

    68

    .

    10

    ,

    cov

    =

    =

    =

    y

    x

    SD

    SD

    y

    x

    r

    10

    15

    20

    25

    30

    35

    2

    3

    4

    5

    6

    bmi

    sa

    20

    30

    40

    50

    60

    2.0

    2.5

    3.0

    3.5

    4.0

    4.5

    age

    chol

    i

    i

    bx

    a

    y

    +

    =

    i

    i

    i

    y

    y

    d

    -

    =

    2.5

    3.0

    3.5

    4.0

    4.5

    -0.4

    0.0

    0.2

    0.4

    0.6

    Fitted values

    Residuals

    Residuals vs Fitted

    8

    6

    17

    -2

    -1

    0

    1

    2

    -1

    0

    1

    2

    Theoretical Quantiles

    Standardized residuals

    Normal Q-Q

    8

    6

    17

    2.5

    3.0

    3.5

    4.0

    4.5

    0.0

    0.5

    1.0

    1.5

    Fitted values

    S

    t

    a

    n

    d

    a

    r

    d

    i

    z

    e

    d

    r

    e

    s

    i

    d

    u

    a

    l

    s

    Scale-Location

    8

    6

    17

    0.00

    0.05

    0.10

    0.15

    0.20

    0.25

    -1

    0

    1

    2

    Leverage

    Standardized residuals

    Cook's distance

    0.5

    0.5

    1

    Residuals vs Leverage

    6

    2

    8

    (

    )

    (

    )

    =

    -

    -

    =

    n

    i

    i

    n

    x

    x

    x

    1

    2

    1

    var

    (

    )

    xx

    S

    s

    b

    SE

    /

    =

    8

    10

    12

    14

    16

    -30

    -25

    -20

    -15

    x

    y

    8

    10

    12

    14

    16

    15

    20

    25

    30

    x

    y

    1

    2

    1

    2

    x

    x

    y

    y

    dx

    dy

    m

    -

    -

    =

    =

    (

    )

    (

    )

    -

    -

    =

    =

    n

    i

    i

    n

    y

    y

    y

    1

    2

    1

    var