m7222 – 2. cviˇcen ´ı glm02a zotaven´ı v z´avislosti na z...

24
RNDr. Marie Forbelsk´a, Ph.D. 1 M7222 – 2. cvi ˇ cen ´ ı : GLM02a (Zotaven´ ıvz´avislostinaz´avaˇ znosti nemoci a n´ avˇ stˇ evˇ e nemocnice ) Nejprve naˇ cteme vstupn´ ı data pomoc´ ı pˇ ıkazu read.csv2() a pod´ ıv´amese najejich struk- turu pomoc´ ı pˇ ıkazu str(). > fileDat <- paste(data.library, "InfectionSeverity.csv", sep = "") > data <- read.csv2(fileDat, header = TRUE, sep = ";", dec = ".") > str(data) ,data.frame,: 49 obs. of 3 variables: $ Infection_Severity: num 9.3 18.2 22.7 32.9 38 39.9 44 44.9 46.8 47.7 ... $ Treatment_Outcome : int 0000000000... $ Hospital : int 3213121212... Z promˇ enn´ ych, kter´ e jsou kategori´ aln´ ı, utvoˇ ıme pomoc´ ı pˇ ıkazu factor() promˇ enn´ e typu faktor. > data$Treatment_Outcome <- factor(data$Treatment_Outcome, labels = c("survived", "died")) > data$Hospital <- factor(data$Hospital, labels = c("A", "B", "C")) Data vykresl´ ıme > plot(data$Infection_Severity, unclass(data$Treatment_Outcome) - 1, type = "p", pch = unclass(data$Hospital), col = unclass(data$Hospital), cex = 1.5, main = "Recovery as a Function of Illness Severity") 50 100 150 0.0 0.2 0.4 0.6 0.8 1.0 Recovery as a Function of Illness Severity data$Infection_Severity unclass(data$Treatment_Outcome) - 1 Obr´azek1:Vykreslen´ ı dat pomoc´ ı pˇ ıkazu plot.

Upload: others

Post on 05-Jan-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: M7222 – 2. cviˇcen ´ı GLM02a Zotaven´ı v z´avislosti na z ...kolacek/docs/frvs/M7222/materialy/M7222_03.pdf · RNDr. Marie Forbelsk´a, Ph.D. 1 M7222 – 2. cviˇcen ´ı

RNDr. Marie Forbelska, Ph.D. 1

M7222 – 2. cvicenı : GLM02a

(Zotavenı v zavislosti na zavaznosti nemoci a navsteve nemocnice)

Nejprve nacteme vstupnı data pomocı prıkazu read.csv2() a podıvame se na jejich struk-turu pomocı prıkazu str().

> fileDat <- paste(data.library, "InfectionSeverity.csv", sep = "")

> data <- read.csv2(fileDat, header = TRUE, sep = ";", dec = ".")

> str(data)

,data.frame,: 49 obs. of 3 variables:

$ Infection_Severity: num 9.3 18.2 22.7 32.9 38 39.9 44 44.9 46.8 47.7 ...

$ Treatment_Outcome : int 0 0 0 0 0 0 0 0 0 0 ...

$ Hospital : int 3 2 1 3 1 2 1 2 1 2 ...

Z promennych, ktere jsou kategorialnı, utvorıme pomocı prıkazu factor() promenne typufaktor.

> data$Treatment_Outcome <- factor(data$Treatment_Outcome, labels = c("survived",

"died"))

> data$Hospital <- factor(data$Hospital, labels = c("A", "B", "C"))

Data vykreslıme

> plot(data$Infection_Severity, unclass(data$Treatment_Outcome) - 1, type = "p",

pch = unclass(data$Hospital), col = unclass(data$Hospital), cex = 1.5,

main = "Recovery as a Function of Illness Severity")

50 100 150

0.0

0.2

0.4

0.6

0.8

1.0

Recovery as a Function of Illness Severity

data$Infection_Severity

uncl

ass(

data

$Tre

atm

ent_

Out

com

e) −

1

Obrazek 1: Vykreslenı dat pomocı prıkazu plot.

Page 2: M7222 – 2. cviˇcen ´ı GLM02a Zotaven´ı v z´avislosti na z ...kolacek/docs/frvs/M7222/materialy/M7222_03.pdf · RNDr. Marie Forbelsk´a, Ph.D. 1 M7222 – 2. cviˇcen ´ı

2 M7222 Zobecnene linearnı modely (GLM02a)

Protoze tento graf je malo srozumitelny, provedeme nejprve kategorizaci promenneInfection_Severity do 10 subintervalu. Pak zjistıme pocet osob, ktere prezily, popr. zemrelyv jednotlivych subintervalech, na zaklade toho odpovıdajıcı relativnı cetnosti zemrelych.

> breaks_f_sev <- seq(0, 170, length.out = 11)

> f_sev <- cut(data$Infection_Severity, breaks = breaks_f_sev)

> (TabSurvDied <- table(f_sev, data$Treatment_Outcome))

f_sev survived died

(0,17] 1 0

(17,34] 3 0

(34,51] 9 1

(51,68] 3 2

(68,85] 6 6

(85,102] 4 4

(102,119] 0 6

(119,136] 0 1

(136,153] 0 1

(153,170] 0 2

> RelativDied <- TabSurvDied[, 2]/rowSums(TabSurvDied)

> (tab2 <- cbind(TabSurvDied, RelativDied))

survived died RelativDied

(0,17] 1 0 0.0

(17,34] 3 0 0.0

(34,51] 9 1 0.1

(51,68] 3 2 0.4

(68,85] 6 6 0.5

(85,102] 4 4 0.5

(102,119] 0 6 1.0

(119,136] 0 1 1.0

(136,153] 0 1 1.0

(153,170] 0 2 1.0

Predchozı graf nynı budeme modifikovat tak, abychom doplnili relativnı cetnosti zemrelychv jednotlivych kategoriıch

> delta2 <- 0.5 * diff(breaks_f_sev)[1]

> N <- length(breaks_f_sev)

> plot(data$Infection_Severity, unclass(data$Treatment_Outcome) -

1, type = "p", pch = 3, xlim = c(-10, 180), col = unclass(data$Hospital),

main = "Recovery as a Function of Illness Severity")

> points(breaks_f_sev[1:(N - 1)] + delta2, tab2[, 3],

pch = 21, col = "darkred", cex = 2, bg = "red")

> lines(breaks_f_sev, c(0, tab2[, 3]), type = "S")

Page 3: M7222 – 2. cviˇcen ´ı GLM02a Zotaven´ı v z´avislosti na z ...kolacek/docs/frvs/M7222/materialy/M7222_03.pdf · RNDr. Marie Forbelsk´a, Ph.D. 1 M7222 – 2. cviˇcen ´ı

RNDr. Marie Forbelska, Ph.D. 3

0 50 100 150

0.0

0.2

0.4

0.6

0.8

1.0

Recovery as a Function of Illness Severity

data$Infection_Severity

uncl

ass(

data

$Tre

atm

ent_

Out

com

e) −

1

Obrazek 2: Vykreslenı relativnıch cetnostı zemrelych.

Model 1 - binarnı regresnı model s jedinou spojitou kovariatouInfection_Severity

Za g volıme nekterou z linkovacıch funkcı, takze dostavame

η(x) = g1(π(x)) = Φ−1(π(x)) probitovy model (1)

η(x) = g2(π(x)) = log(

π(x)1−π(x)

)logisticky model (2)

η(x) = g3(π(x)) = log[− log(1−π(x))] komplementarnı log-log model (3)

Nejprve zvolıme kanonickou linkovacı funkci, takze dostaneme logisticky regresnı model:

> m1.logit <- glm(Treatment_Outcome ~ Infection_Severity,

family = binomial(logit), data = data)

> summary(m1.logit)

Call:

glm(formula = Treatment_Outcome ~ Infection_Severity, family = binomial(logit),

data = data)

Deviance Residuals:

Min 1Q Median 3Q Max

-1.7891 -0.6459 -0.2365 0.7533 1.9474

Page 4: M7222 – 2. cviˇcen ´ı GLM02a Zotaven´ı v z´avislosti na z ...kolacek/docs/frvs/M7222/materialy/M7222_03.pdf · RNDr. Marie Forbelsk´a, Ph.D. 1 M7222 – 2. cviˇcen ´ı

4 M7222 Zobecnene linearnı modely (GLM02a)

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -4.64050 1.38335 -3.355 0.000795 ***

Infection_Severity 0.05921 0.01758 3.368 0.000756 ***

---

Signif. codes: 0 ,***, 0.001 ,**, 0.01 ,*, 0.05 ,., 0.1 , , 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 67.745 on 48 degrees of freedom

Residual deviance: 45.994 on 47 degrees of freedom

AIC: 49.994

Number of Fisher Scoring iterations: 5

Vidıme, ze kovariata Infection_Severity je v tomto modelu statisticky vyznamna.

Provedeme vykreslenı vysledne logisticke krivky spolu s asymptotickymi intervaly spoleh-livosti:

> predicted.logit <- predict(m1.logit, type = "link",

newdata = data, se = T)

> data$CI.lower.logit <- plogis(predicted.logit$fit -

1.96 * predicted.logit$se.fit)

> data$fitted.logit <- plogis(predicted.logit$fit)

> data$CI.higher.logit <- plogis(predicted.logit$fit +

1.96 * predicted.logit$se.fit)

> x <- c(data$Infection_Severity, rev(data$Infection_Severity))

> y <- c(data$CI.lower.logit, rev(data$CI.higher.logit))

> plot(data$Infection_Severity, unclass(data$Treatment_Outcome) -

1, type = "n", pch = 3, ylab = "Outcome", xlab = "Severity",

main = "Logisticka regrese")

> polygon(x, y, col = "gray85", border = "gray85")

> points(data$Infection_Severity, unclass(data$Treatment_Outcome) -

1, pch = 3, ylab = "Outcome", xlab = "Severity",

col = "gray35")

> lines(data$Infection_Severity, data$fitted.logit, col = "red",

lwd = 2)

> points(data$Infection_Severity, data$fitted.logit, col = "red")

Page 5: M7222 – 2. cviˇcen ´ı GLM02a Zotaven´ı v z´avislosti na z ...kolacek/docs/frvs/M7222/materialy/M7222_03.pdf · RNDr. Marie Forbelsk´a, Ph.D. 1 M7222 – 2. cviˇcen ´ı

RNDr. Marie Forbelska, Ph.D. 5

50 100 150

0.0

0.2

0.4

0.6

0.8

1.0

Logistická regrese

Severity

Out

com

e

Obrazek 3: Logisticka regrese s intervaly spolehlivosti.

Podıvejme se, jak dopadne probitovy regresnı model s jedinou kovariatou Infection_Severity,to znamena musıme zvolit mısto kanonicke linkovacı funkce jinou, a to probitovou linkovacıfunkci.

> m1.probit <- glm(Treatment_Outcome ~ Infection_Severity,

family = binomial(probit), data = data)

> summary(m1.probit)

Call:

glm(formula = Treatment_Outcome ~ Infection_Severity, family = binomial(probit),

data = data)

Deviance Residuals:

Min 1Q Median 3Q Max

-1.7885 -0.6427 -0.1742 0.7577 1.9636

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -2.824141 0.763012 -3.701 0.000214 ***

Infection_Severity 0.036010 0.009699 3.713 0.000205 ***

---

Signif. codes: 0 ,***, 0.001 ,**, 0.01 ,*, 0.05 ,., 0.1 , , 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 67.745 on 48 degrees of freedom

Residual deviance: 45.597 on 47 degrees of freedom

AIC: 49.597

Number of Fisher Scoring iterations: 6

Page 6: M7222 – 2. cviˇcen ´ı GLM02a Zotaven´ı v z´avislosti na z ...kolacek/docs/frvs/M7222/materialy/M7222_03.pdf · RNDr. Marie Forbelsk´a, Ph.D. 1 M7222 – 2. cviˇcen ´ı

6 M7222 Zobecnene linearnı modely (GLM02a)

I v tomto modelu je kovariata Infection_Severity statisticky vyznamna.

Opet vykreslıme vyslednou probitovou krivku spolu s asymptotickymi intervaly spolehli-vosti do grafu:

> predicted.probit <- predict(m1.probit, type = "link",

newdata = data, se = T)

> data$CI.lower.probit <- pnorm(predicted.probit$fit -

1.96 * predicted.probit$se.fit)

> data$fitted.probit <- pnorm(predicted.probit$fit)

> data$CI.higher.probit <- pnorm(predicted.probit$fit +

1.96 * predicted.probit$se.fit)

> x <- c(data$Infection_Severity, rev(data$Infection_Severity))

> y <- c(data$CI.lower.probit, rev(data$CI.higher.probit))

> plot(data$Infection_Severity, unclass(data$Treatment_Outcome) -

1, type = "n", pch = 3, ylab = "Outcome", xlab = "Severity",

main = "Probitova regrese")

> polygon(x, y, col = "gray85", border = "gray85")

> points(data$Infection_Severity, unclass(data$Treatment_Outcome) -

1, pch = 3, ylab = "Outcome", xlab = "Severity",

col = "gray35")

> lines(data$Infection_Severity, data$fitted.probit, col = "darkgreen",

lwd = 2)

> points(data$Infection_Severity, data$fitted.probit,

col = "darkgreen")

50 100 150

0.0

0.2

0.4

0.6

0.8

1.0

Probitová regrese

Severity

Out

com

e

Obrazek 4: Probitova regrese s intervaly spolehlivosti.

Nakonec uvazujme poslednı moznost, a to komplementarnı log-log linkovacı funkci. GLMmodel pro binarnı promennou budeme opet konstruovat pro jedinou kovariatu Infection_Severity.

> m1.cloglog <- glm(Treatment_Outcome ~ Infection_Severity,

family = binomial(cloglog), data = data)

> summary(m1.cloglog)

Page 7: M7222 – 2. cviˇcen ´ı GLM02a Zotaven´ı v z´avislosti na z ...kolacek/docs/frvs/M7222/materialy/M7222_03.pdf · RNDr. Marie Forbelsk´a, Ph.D. 1 M7222 – 2. cviˇcen ´ı

RNDr. Marie Forbelska, Ph.D. 7

Call:

glm(formula = Treatment_Outcome ~ Infection_Severity, family = binomial(cloglog),

data = data)

Deviance Residuals:

Min 1Q Median 3Q Max

-1.7952 -0.6639 -0.3177 0.7782 1.8936

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -3.74248 1.00860 -3.711 0.000207 ***

Infection_Severity 0.04153 0.01159 3.582 0.000341 ***

---

Signif. codes: 0 ,***, 0.001 ,**, 0.01 ,*, 0.05 ,., 0.1 , , 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 67.745 on 48 degrees of freedom

Residual deviance: 45.964 on 47 degrees of freedom

AIC: 49.964

Number of Fisher Scoring iterations: 6

I v tomto modelu je kovariata Infection_Severity statisticky vyznamna.

Stejne jako v predchozıch prıpadech vykreslıme vyslednou krivku spolu s asymptotickymiintervaly spolehlivosti:

> Icloglog <- function(x) return(1 - exp(-exp(x)))

> predicted.cloglog <- predict(m1.cloglog, type = "link",

newdata = data, se = T)

> data$CI.lower.cloglog <- Icloglog(predicted.cloglog$fit -

1.96 * predicted.cloglog$se.fit)

> data$fitted.cloglog <- Icloglog(predicted.cloglog$fit)

> data$CI.higher.cloglog <- Icloglog(predicted.cloglog$fit +

1.96 * predicted.cloglog$se.fit)

> x <- c(data$Infection_Severity, rev(data$Infection_Severity))

> y <- c(data$CI.lower.cloglog, rev(data$CI.higher.cloglog))

> plot(data$Infection_Severity, unclass(data$Treatment_Outcome) -

1, type = "n", pch = 3, ylab = "Outcome", xlab = "Severity",

main = "Binarnı regrese cloglog linkovaci funkci")

> polygon(x, y, col = "gray85", border = "gray85")

> points(data$Infection_Severity, unclass(data$Treatment_Outcome) -

1, pch = 3, ylab = "Outcome", xlab = "Severity",

col = "gray35")

> lines(data$Infection_Severity, data$fitted.cloglog,

col = "dodgerblue", lwd = 2)

> points(data$Infection_Severity, data$fitted.cloglog,

col = "dodgerblue")

Page 8: M7222 – 2. cviˇcen ´ı GLM02a Zotaven´ı v z´avislosti na z ...kolacek/docs/frvs/M7222/materialy/M7222_03.pdf · RNDr. Marie Forbelsk´a, Ph.D. 1 M7222 – 2. cviˇcen ´ı

8 M7222 Zobecnene linearnı modely (GLM02a)

50 100 150

0.0

0.2

0.4

0.6

0.8

1.0

Binární regrese cloglog linkovaci funkci

Severity

Out

com

e

Obrazek 5: Binarnı regrese s komplementarnı log-log linkovacı funkcı s intervaly spolehlivosti.

Nynı zakreslıme vsechny krivky do jedineho grafu a aby byl vysledny graf kvalitnejsı,nepouzijeme predchozı odhady s 49 body, ale sıt’ pro x–ove hodnoty zjemnıme.

> xx <- seq(0, 170, length.out = 200)

> yy.logit <- predict(m1.logit, list(Infection_Severity = xx),

type = "response")

> yy.probit <- predict(m1.probit, list(Infection_Severity = xx),

type = "response")

> yy.cloglog <- predict(m1.cloglog, list(Infection_Severity = xx),

type = "response")

> plot(data$Infection_Severity, unclass(data$Treatment_Outcome) -

1, pch = 3, ylab = "Outcome", xlab = "Severity",

main = "Logisticka, probitova a cloglog regrese")

> lines(xx, yy.logit, col = "red", lwd = 2)

> lines(xx, yy.probit, col = "darkgreen", lwd = 2)

> lines(xx, yy.cloglog, col = "dodgerblue", lwd = 2)

> legend(100, 0.5, bty = "n", col = c("red", "darkgreen",

"dodgerblue"), lty = c(1, 1, 1), lwd = c(2, 2, 2),

legend = c("logisticka krivka", "probitova krivka",

"clog-log krivka"))

Page 9: M7222 – 2. cviˇcen ´ı GLM02a Zotaven´ı v z´avislosti na z ...kolacek/docs/frvs/M7222/materialy/M7222_03.pdf · RNDr. Marie Forbelsk´a, Ph.D. 1 M7222 – 2. cviˇcen ´ı

RNDr. Marie Forbelska, Ph.D. 9

50 100 150

0.0

0.2

0.4

0.6

0.8

1.0

Logistická, probitová a cloglog regrese

Severity

Out

com

e

logisticka krivkaprobitova krivkaclog−log krivka

Obrazek 6: Porovnanı vsech binarnıch regresı.

Vhodny model se pokusıme vybrat na zaklade analyzy reziduı.

> par(mfrow = c(1, 3))

> plot(m1.logit, which = 2, cex = 0.75)

> plot(m1.probit, which = 2, cex = 0.75)

> plot(m1.cloglog, which = 2, cex = 0.75)

−2 −1 0 1 2

−2

−1

01

2

Theoretical Quantiles

Std

. dev

ianc

e re

sid.

Normal Q−Q11

3937

−2 −1 0 1 2

−2

−1

01

2

Theoretical Quantiles

Std

. dev

ianc

e re

sid.

Normal Q−Q11

3937

−2 −1 0 1 2

−2

−1

01

2

Theoretical Quantiles

Std

. dev

ianc

e re

sid.

Normal Q−Q11

39

16

Obrazek 7: Srovnanı logisticke, probitove a clolog regrese pomocı Q-Q grafu.

Na zaklade techto grafu nejsme schopni rozhodnout, ktery model je nejvhodnejsı. Probinarnı vystupy vsak mame k dispozici velmi ucinny graficky nastroj, ktery se nazyva ROCkrivky.

Page 10: M7222 – 2. cviˇcen ´ı GLM02a Zotaven´ı v z´avislosti na z ...kolacek/docs/frvs/M7222/materialy/M7222_03.pdf · RNDr. Marie Forbelsk´a, Ph.D. 1 M7222 – 2. cviˇcen ´ı

10 M7222 Zobecnene linearnı modely (GLM02a)

Poznamky k ROC analyze vztahujıcı se k binarnı regresi

Klasicka ROC krivka je definovana pro binarnı klasifikacnı pravidla, tj. pro pravidlajejichz vystupem jsou pouze dve kategorie (trıdy, populace).

Zkratka ROC je odvozena od slov Receiver Operating Characteristic, nebot’ se puvodnevyuzıvala jako operacnı charakteristika radiolokatoru.

Binarnı klasifikacnı pravidlo je predpis urcujıcı, zda jedinec ci objekt popsany po-mocı jednorozmerneho nebo i vıcerozmerneho statistickeho znaku patrı do jedne ze dvou

rozlisitelnych trıd nebo populacı. Vystupem klasifikacnıho pravidla je tedy oznacenı jistetrıdy, populace ci kategorie, napr. • zdravy, nemocny;

• prospel, neprospel.Jde o kvalitativnı promennou, ktera se obycejne koduje cısly, napr. 0 a 1.V prıpade binarnı regrese mame k dispozici (pro i = 1, . . . , n):

Yi binarnı promenne nabyvajıcıch hodnot {0, 1}xi = (xi1, . . . , xim)

T nezavisle promenne - jednorozmerny ci vıcerozmerny znakcharakterizujıcı jedince

Pomocı GLM modelu s vhodnou linkovacı funkcı g zıskame odhady

πi = π(xi) = g−1(xTi βMLE) pro i = 1, . . . , n,

ktere lze v prostredı R zıskat naprıklad prıkazem fitted(model).Pak se jako klasifikator pouzıva nasledujıcı klasifikacnı pravidlo:

Yi =

{1 aposteriornı pravdepodobnost π(xi) > c, obvykle se volı c = 0.5,

0 jinak.

Bod c se nazyva delıcım ci kritickym bodem (decision limit, cutoff point, threshold).

Pri hodnocenı uspesnosti konkretnıho binarnıho klasifikacnıho pravidla (ktere je spojenos presne danym delıcım bodem) se vychazı z nasledujıcı kontingencnı tabulky, nazyvanetez matice zamen ci konfusnı matice.

Kontingencnı tabulkaSkutecna Klasifikovana kategoriekategorie 0 (negativnı) 1 (pozitivnı)

spravne negativnı nespravne pozitivnı

0 (negativnı) specificita chyba 1. druhu

1− α α

nespravne negativnı spravne pozitivnı

1 (pozitivnı) chyba 2. druhu senzitivita

β 1− β

Podle toho, ktera chyba ma zavaznejsı dusledky, je pak mozne zmenit delıcı bod a mıtpod kontrolou velikost vybrane chyby.

Binarnı klasifikacnı pravidlo (take se mu v diagnostice rıka diagnosticky test citestove kriterium) je nahodnou velicinou, ktera v nasem prıpade nabyva spojitych hodnotmezi nulou a jednickou. Oznacme ji naprıklad symbolem T .

Page 11: M7222 – 2. cviˇcen ´ı GLM02a Zotaven´ı v z´avislosti na z ...kolacek/docs/frvs/M7222/materialy/M7222_03.pdf · RNDr. Marie Forbelsk´a, Ph.D. 1 M7222 – 2. cviˇcen ´ı

RNDr. Marie Forbelska, Ph.D. 11

Dale oznacme symbolem T0 nahodnou velicinu T za podmınky, ze jedinec ve skutecnosti

patrı do skupiny 0 (undiseased population), obdobne oznacme symbolem T1 nahodnou ve-

licinu T za podmınky, ze jedinec ve skutecnosti patrı do skupiny 1 (diseased population).

Prıslusne hustoty a distribucnı funkce oznacme symboly f0, F0 a f1, F1 .

Dale budeme predpokladat, ze vyssı hodnoty kriteria T vedou k vyssı pravdepodobnostivyskytu nejake zkoumane nemoci (tj. k vyssı pravdepodobnosti, ze jedinec patrı do populace1, tedy je pozitivnı).

Pak pro ∀ c platı FP (c)=P (T0 > c) = 1− F0(c) (false positive=1-specificity)TP (c)=P (T1 > c) = 1− F1(c) (true positive=sensitivity)

a ROC krivku tvorı dvojice bodu: (FP (c), TP (c)) = (1− F0(c)︸ ︷︷ ︸x

, 1− F1(c)︸ ︷︷ ︸y

)

nebo–li ROC(p) = 1− F1(F−10 (1− p)) 0 ≤ p ≤ 1.

Na nasledujıcıch grafech vidıme nazorne, jak se konstruuje ROC krivka pro menıcı sedelıcı bod. V tomto prıpade metoda 2 predstavuje populaci s indexem 0.

cA= −2

metoda 2 metoda 1A

1−specificitax

A= 0.14326

senzitivitay

A= 0.98742

cC

= 0.66667

metoda 2 metoda 1C

1−specificitax

C= 0.030408

senzitivitay

C= 0.92608

cD

= 2

metoda 2 metoda 1D

1−specificitax

D= 0.011323

senzitivitay

D= 0.85296

cB= −0.66667

metoda 2 metoda 1B

1−specificitax

B= 0.07074

senzitivitay

B= 0.96743

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

rand

om

AB

C

D

p

q

R(p)

osa x pravdepodobnost chybne klasifikovanychjako pozitvnı ( tzv. 1-specificita)

osa y pravdepodobnost spravne klasifikovanychjako pozitvnı (senzitivita testu)

Page 12: M7222 – 2. cviˇcen ´ı GLM02a Zotaven´ı v z´avislosti na z ...kolacek/docs/frvs/M7222/materialy/M7222_03.pdf · RNDr. Marie Forbelsk´a, Ph.D. 1 M7222 – 2. cviˇcen ´ı

12 M7222 Zobecnene linearnı modely (GLM02a)

Klasifikacnı pravidlo je o to presnejsı, cım vıce se ROC krivka primyka k leveho hornımubodu (0, 1).

Velmi dulezitou charakteristikou je take plocha pod ROC krivkou

AUC =1∫0

ROC(p) dp (Area Under the ROC Curve).

Jestlize kovariaty charakterizovane vektorem x nemajı vliv na klasifikaci do dvou trıd, pakhodnota AUC bude 0.5. Cım vhodnejsı kovariaty byly zvoleny, tım vıce se hodnota AUC blızık jedne.

Protoze skutecne rozdelenı klasifikacnıho kriteria nezname, musıme ROC krivku nejakodhadnout. Pouzıva se cela rada prıstupu:

• parametricky predpokladajıcı naprıklad normalitu testoveho kriteria

• neparametricky nezname podmınene distribucnı funkce se odhadujınaprıklad pomocı jadrovych odhadu.

Nejcasteji se vsak pouzıva neparametricky prıstup zalozeny na empirickych distribuc-

nıch funkcıch. V tom prıpade ma odhadnuta ROC krivka schodovity tvar.

V prostredı R existuje cela rada balıku, ktere dokazı vykreslit ROC krivku a vypocıtatAUC hodnotu. Nejprve si ukazeme grafy zıskane z knihovy epicalc.

> library(epicalc)

> par(mfrow = c(1, 3), mar = c(5, 5, 3, 0) + 0.1)

> graf1 <- lroc(m1.logit, title = TRUE, auc.coords = c(0.05,

0.1), cex = 1.5, cex.main = 1.25)

> graf2 <- lroc(m1.probit, title = TRUE, auc.coords = c(0.05,

0.1), cex = 1.5, cex.main = 1.25)

> graf3 <- lroc(m1.cloglog, title = TRUE, auc.coords = c(0.05,

0.1), cex = 1.5, cex.main = 1.25)

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

1−Specificity

Sen

sitiv

ity

Treatment_Outcome ~ Infection_Severity

Area under the curve = 0.846

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

1−Specificity

Sen

sitiv

ity

Treatment_Outcome ~ Infection_Severity

Area under the curve = 0.846

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

1−Specificity

Sen

sitiv

ity

Treatment_Outcome ~ Infection_Severity

Area under the curve = 0.846

Obrazek 8: Porovnanı ROC krivek a hodnot AUC pro logistickou, probitovou a cloglog binarnıregresi (pomocı prıkazu lroc z knihovny epicalc).

Z vyslednych grafu je patrne, ze ani v ROC krivce, ani v AUC hodnote se metody nelisı.

Page 13: M7222 – 2. cviˇcen ´ı GLM02a Zotaven´ı v z´avislosti na z ...kolacek/docs/frvs/M7222/materialy/M7222_03.pdf · RNDr. Marie Forbelsk´a, Ph.D. 1 M7222 – 2. cviˇcen ´ı

RNDr. Marie Forbelska, Ph.D. 13

Jeste si ukazeme, jaky typ grafu pro ROC krivku nabızı knihovna verification. V tomtoprıpade vedle empirickeho odhadu lze zıskat take ROC krivku, ktera se z vychozıch datodhadne za predpokladu, ze T0 i T1 majı normalnı rozdelenı.

> library(verification)

> par(mfrow = c(1, 3), mar = c(5, 5, 3, 0) + 0.1)

> binvar <- unclass(data$Treatment_Outcome) - 1

> T <- fitted(m1.logit)

> AUC <- roc.area(binvar, T)

> auc.txt <- paste("AUC=", round(AUC$A, 3), " (p.value=", round(AUC$p.value,

6), ")", sep = "")

> roc.plot(binvar, T, binormal = T, plot = "both")

> text(0.2, 0.1, auc.txt, adj = c(0, 0), cex = 1.25)

> T <- fitted(m1.probit)

> AUC <- roc.area(binvar, T)

> auc.txt <- paste("AUC=", round(AUC$A, 3), " (p.value=", round(AUC$p.value,

6), ")", sep = "")

> roc.plot(binvar, T, binormal = T, plot = "both")

> text(0.2, 0.1, auc.txt, adj = c(0, 0), cex = 1.25)

> T <- fitted(m1.cloglog)

> AUC <- roc.area(binvar, T)

> auc.txt <- paste("AUC=", round(AUC$A, 3), " (p.value=", round(AUC$p.value,

6), ")", sep = "")

> roc.plot(binvar, T, binormal = T, plot = "both")

> text(0.2, 0.1, auc.txt, adj = c(0, 0), cex = 1.25)

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

ROC Curve

False Alarm Rate

Hit

Rat

e

0.10.2

0.30.4

0.50.6

0.70.8

0.9

AUC=0.846 (p.value=1.8e−05)

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

ROC Curve

False Alarm Rate

Hit

Rat

e

0.10.2

0.30.4

0.50.6

0.70.8

0.9

AUC=0.846 (p.value=1.8e−05)

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

ROC Curve

False Alarm Rate

Hit

Rat

e

0.10.2

0.3

0.4

0.5

0.60.7

0.8

0.9

AUC=0.846 (p.value=1.8e−05)

Obrazek 9: Porovnanı ROC krivek a hodnot AUC pro logistickou, probitovou a cloglog binarnıregresi (pomocı prıkazu roc.area a roc.plot z knihovny verification).

P–hodnota v zavorce u AUC hodnoty se vztahuje k testovanı hypotezy H0 : AUC = 0.5.Vidıme, ze tuto hypotezu zamıtame, coz znacı ze promenna Infection_Severity ma vy-znamny vliv na prezitı.

Page 14: M7222 – 2. cviˇcen ´ı GLM02a Zotaven´ı v z´avislosti na z ...kolacek/docs/frvs/M7222/materialy/M7222_03.pdf · RNDr. Marie Forbelsk´a, Ph.D. 1 M7222 – 2. cviˇcen ´ı

14 M7222 Zobecnene linearnı modely (GLM02a)

Podıvejme se, jak dopadly konfusnı matice pro jednotlive binarnı modely

> fitY.logit <- factor(fitted(m1.logit) > 0.5, labels = c("pred.survived",

"pred.died"))

> table(data$Treatment_Outcome, fitY.logit)

fitY.logit

pred.survived pred.died

survived 20 6

died 8 15

> fitY.probit <- factor(fitted(m1.probit) > 0.5, labels = c("pred.survived",

"pred.died"))

> table(data$Treatment_Outcome, fitY.probit)

fitY.probit

pred.survived pred.died

survived 20 6

died 8 15

> fitY.cloglog <- factor(fitted(m1.cloglog) > 0.5, labels = c("pred.survived",

"pred.died"))

> table(data$Treatment_Outcome, fitY.cloglog)

fitY.cloglog

pred.survived pred.died

survived 21 5

died 9 14

Vidıme, ze konfusnı matice jsou vsechny stejne. Ukazeme si dale, jak lze mısto absolutnıchcetnostı zıskat relativnı cetnosti. Nejprve pro celou tabulku

> prop.table(table(data$Treatment_Outcome, fitY.logit))

fitY.logit

pred.survived pred.died

survived 0.4081633 0.1224490

died 0.1632653 0.3061224

Relativnı cetnosti podle radku dostaneme prıkazem

> prop.table(table(data$Treatment_Outcome, fitY.probit), 1)

fitY.probit

pred.survived pred.died

survived 0.7692308 0.2307692

died 0.3478261 0.6521739

Page 15: M7222 – 2. cviˇcen ´ı GLM02a Zotaven´ı v z´avislosti na z ...kolacek/docs/frvs/M7222/materialy/M7222_03.pdf · RNDr. Marie Forbelsk´a, Ph.D. 1 M7222 – 2. cviˇcen ´ı

RNDr. Marie Forbelska, Ph.D. 15

A nakonec relativnı cetnosti podle sloupcu dostaneme takto

> prop.table(table(data$Treatment_Outcome, fitY.cloglog), 2)

fitY.cloglog

pred.survived pred.died

survived 0.7000000 0.2631579

died 0.3000000 0.7368421

Mnohem vıce moznostı mame, pokud pouzijeme prıkaz CrossTable() z knihovny gmodels.

> library(gmodels)

> CrossTable(table(data$Treatment_Outcome, fitY.cloglog), prop.r = T,

prop.c = T, prop.t = T, prop.chisq = F)

Cell Contents

|-------------------------|

| N |

| N / Row Total |

| N / Col Total |

| N / Table Total |

|-------------------------|

Total Observations in Table: 49

| fitY.cloglog

| pred.survived | pred.died | Row Total |

-------------|---------------|---------------|---------------|

survived | 21 | 5 | 26 |

| 0.808 | 0.192 | 0.531 |

| 0.700 | 0.263 | |

| 0.429 | 0.102 | |

-------------|---------------|---------------|---------------|

died | 9 | 14 | 23 |

| 0.391 | 0.609 | 0.469 |

| 0.300 | 0.737 | |

| 0.184 | 0.286 | |

-------------|---------------|---------------|---------------|

Column Total | 30 | 19 | 49 |

| 0.612 | 0.388 | |

-------------|---------------|---------------|---------------|

Nynı se vratıme k binarnı regresi a zjistıme, zda se model nezlepsı, jestlize pridame dalsıvysvetlujıcı promennou, a to promennou Hospital.

Page 16: M7222 – 2. cviˇcen ´ı GLM02a Zotaven´ı v z´avislosti na z ...kolacek/docs/frvs/M7222/materialy/M7222_03.pdf · RNDr. Marie Forbelsk´a, Ph.D. 1 M7222 – 2. cviˇcen ´ı

16 M7222 Zobecnene linearnı modely (GLM02a)

Model 2 - binarnı regresnı model s jedinou spojitou promennouInfection_Severity a kategorialnı promennou Hospital

Zacneme s modelem, ve kterem budeme uvazovat i interakce mezi promennymiInfection_Severity a Hospital. Nejprve zvolıme kanonickou linkovacı funkci.

> m2a.logit <- glm(Treatment_Outcome ~ Infection_Severity *

Hospital, family = binomial(logit), data = data)

> summary(m2a.logit)

Call:

glm(formula = Treatment_Outcome ~ Infection_Severity * Hospital,

family = binomial(logit), data = data)

Deviance Residuals:

Min 1Q Median 3Q Max

-2.2565 -0.4597 -0.1565 0.3539 2.0989

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -5.664355 2.803449 -2.020 0.0433 *

Infection_Severity 0.055830 0.032493 1.718 0.0858 .

HospitalB -0.012466 3.989376 -0.003 0.9975

HospitalC 2.817310 3.600700 0.782 0.4340

Infection_Severity:HospitalB 0.012591 0.049197 0.256 0.7980

Infection_Severity:HospitalC 0.006657 0.046875 0.142 0.8871

---

Signif. codes: 0 ,***, 0.001 ,**, 0.01 ,*, 0.05 ,., 0.1 , , 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 67.745 on 48 degrees of freedom

Residual deviance: 34.676 on 43 degrees of freedom

AIC: 46.676

Number of Fisher Scoring iterations: 6

Vidıme, ze vyznamnost jednotlivych promennych se vyrazne zhorsila, proto uvazujmejednodussı model bez interakcı

> m2b.logit <- glm(Treatment_Outcome ~ Infection_Severity +

Hospital, family = binomial(logit), data = data)

> summary(m2b.logit)

Call:

glm(formula = Treatment_Outcome ~ Infection_Severity + Hospital,

family = binomial(logit), data = data)

Deviance Residuals:

Min 1Q Median 3Q Max

-2.2528 -0.4932 -0.1835 0.3643 2.1508

Coefficients:

Page 17: M7222 – 2. cviˇcen ´ı GLM02a Zotaven´ı v z´avislosti na z ...kolacek/docs/frvs/M7222/materialy/M7222_03.pdf · RNDr. Marie Forbelsk´a, Ph.D. 1 M7222 – 2. cviˇcen ´ı

RNDr. Marie Forbelska, Ph.D. 17

Estimate Std. Error z value Pr(>|z|)

(Intercept) -6.18858 1.81807 -3.404 0.000664 ***

Infection_Severity 0.06209 0.01985 3.128 0.001760 **

HospitalB 0.98306 1.01251 0.971 0.331595

HospitalC 3.36626 1.20231 2.800 0.005113 **

---

Signif. codes: 0 ,***, 0.001 ,**, 0.01 ,*, 0.05 ,., 0.1 , , 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 67.745 on 48 degrees of freedom

Residual deviance: 34.742 on 45 degrees of freedom

AIC: 42.742

Number of Fisher Scoring iterations: 6

Jeste zkontrolujme, zda nedoslo k vyraznemu zhorsenı tohoto modelu oproti predchozımu

> anova(m2a.logit, m2b.logit, test = "Chisq")

Analysis of Deviance Table

Model 1: Treatment_Outcome ~ Infection_Severity * Hospital

Model 2: Treatment_Outcome ~ Infection_Severity + Hospital

Resid. Df Resid. Dev Df Deviance P(>|Chi|)

1 43 34.676

2 45 34.742 -2 -0.066078 0.9675

Protoze P-hodnota nenı mensı nez 0.05, vypustenım interakcı nedoslo k vyraznemu zhor-senı modelu.

Provedeme vykreslenı vyslednych logistickych krivek pro jednotlive nemocnice.

> plot(data$Infection_Severity, unclass(data$Treatment_Outcome) -

1, type = "p", pch = 3, xlim = c(-10, 180), col = unclass(data$Hospital),

ylab = "Outcome", xlab = "Severity", main = "Logistic curves")

> points(data$Infection_Severity, fitted(m2b.logit), col = unclass(data$Hospital))

> xx <- seq(0, 170, length.out = 200)

> yA.logit <- predict(m2b.logit, list(Infection_Severity = xx,

Hospital = factor(rep("A", 200))), type = "response")

> lines(xx, yA.logit, col = 1, lwd = 2)

> yB.logit <- predict(m2b.logit, list(Infection_Severity = xx,

Hospital = factor(rep("B", 200))), type = "response")

> lines(xx, yB.logit, col = 2, lwd = 2)

> yC.logit <- predict(m2b.logit, list(Infection_Severity = xx,

Hospital = factor(rep("C", 200))), type = "response")

> lines(xx, yC.logit, col = 3, lwd = 2)

> legend(100, 0.3, bty = "n", col = c(1, 2, 3), lty = c(1,

1, 1), lwd = c(2, 2, 2), legend = c("Hospital A", "Hospital B",

"Hospital C"))

Page 18: M7222 – 2. cviˇcen ´ı GLM02a Zotaven´ı v z´avislosti na z ...kolacek/docs/frvs/M7222/materialy/M7222_03.pdf · RNDr. Marie Forbelsk´a, Ph.D. 1 M7222 – 2. cviˇcen ´ı

18 M7222 Zobecnene linearnı modely (GLM02a)

0 50 100 150

0.0

0.2

0.4

0.6

0.8

1.0

Logistic curves

Severity

Out

com

e

Hospital AHospital BHospital C

Obrazek 10: Logisticke krivky pro jednotlive nemocnice.

Pro tento model opet provedeme grafickou analyzu reziduı a vykreslıme ROC krivku.

> plot(m2b.logit, which = 2, cex = 0.75)

−2 −1 0 1 2

−2

−1

01

2

Theoretical Quantiles

Std

. dev

ianc

e re

sid.

glm(Treatment_Outcome ~ Infection_Severity + Hospital)

Normal Q−Q

31

19

32

Obrazek 11: Q-Q graf reziduı modelu 2b s logit linkovacı funkcı.

> library(verification)

> par(mar = c(5, 5, 3, 0) + 0.1)

> binvar <- unclass(data$Treatment_Outcome) - 1

> T <- fitted(m2b.logit)

> AUC <- roc.area(binvar, T)

> auc.txt <- paste("AUC=", round(AUC$A, 3), " (p.value=", round(AUC$p.value,

8), ")", sep = "")

Page 19: M7222 – 2. cviˇcen ´ı GLM02a Zotaven´ı v z´avislosti na z ...kolacek/docs/frvs/M7222/materialy/M7222_03.pdf · RNDr. Marie Forbelsk´a, Ph.D. 1 M7222 – 2. cviˇcen ´ı

RNDr. Marie Forbelska, Ph.D. 19

> roc.plot(binvar, T, binormal = T, plot = "both")

> text(0.2, 0.1, auc.txt, adj = c(0, 0), cex = 1.25)

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

ROC Curve

False Alarm Rate

Hit

Rat

e0.10.2

0.3

0.40.50.6

0.70.8

0.9

AUC=0.926 (p.value=1e−08)

Obrazek 12: ROC krivka a hodnota AUC pro logistickou binarnı regresi – MODEL 2b (po-mocı prıkazu roc.area a roc.plot z knihovny verification).

Vidıme, ze pridanım dalsı promenne se hodnota AUC z 0.846 zvedla na 0.926. Nezapome-neme take na konfusnı matici:

> library(gmodels)

> fitY.logit <- factor(fitted(m2b.logit) > 0.5, labels = c("pred.survived",

"pred.died"))

> CrossTable(table(data$Treatment_Outcome, fitY.cloglog), prop.r = T,

prop.c = T, prop.t = T, prop.chisq = F)

Cell Contents

|-------------------------|

| N |

| N / Row Total |

| N / Col Total |

| N / Table Total |

|-------------------------|

Total Observations in Table: 49

| fitY.cloglog

| pred.survived | pred.died | Row Total |

-------------|---------------|---------------|---------------|

survived | 21 | 5 | 26 |

| 0.808 | 0.192 | 0.531 |

| 0.700 | 0.263 | |

| 0.429 | 0.102 | |

-------------|---------------|---------------|---------------|

Page 20: M7222 – 2. cviˇcen ´ı GLM02a Zotaven´ı v z´avislosti na z ...kolacek/docs/frvs/M7222/materialy/M7222_03.pdf · RNDr. Marie Forbelsk´a, Ph.D. 1 M7222 – 2. cviˇcen ´ı

20 M7222 Zobecnene linearnı modely (GLM02a)

died | 9 | 14 | 23 |

| 0.391 | 0.609 | 0.469 |

| 0.300 | 0.737 | |

| 0.184 | 0.286 | |

-------------|---------------|---------------|---------------|

Column Total | 30 | 19 | 49 |

| 0.612 | 0.388 | |

-------------|---------------|---------------|---------------|

Pro uplnost spocıtejme Model 2b pro zbyvajıcı dve linkovacı funkce. Vypocteme model,do jednoho grafu zakreslıme vysledne krivky pro vsechny nemocnice, provedeme grafickouanalyzu reziduı, vytvorıme ROC krivku a nakonec vypocteme konfusnı matici.

> m2b.probit <- glm(Treatment_Outcome ~ Infection_Severity +

Hospital, family = binomial(probit), data = data)

> summary(m2b.probit)

Call:

glm(formula = Treatment_Outcome ~ Infection_Severity + Hospital,

family = binomial(probit), data = data)

Deviance Residuals:

Min 1Q Median 3Q Max

-2.2245 -0.4837 -0.1249 0.3579 2.1404

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -3.61785 0.95185 -3.801 0.000144 ***

Infection_Severity 0.03655 0.01058 3.454 0.000552 ***

HospitalB 0.53274 0.57472 0.927 0.353948

HospitalC 1.88787 0.64180 2.942 0.003266 **

---

Signif. codes: 0 ,***, 0.001 ,**, 0.01 ,*, 0.05 ,., 0.1 , , 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 67.745 on 48 degrees of freedom

Residual deviance: 34.511 on 45 degrees of freedom

AIC: 42.511

Number of Fisher Scoring iterations: 7

> m2b.cloglog <- glm(Treatment_Outcome ~ Infection_Severity +

Hospital, family = binomial(cloglog), data = data)

> summary(m2b.cloglog)

Call:

glm(formula = Treatment_Outcome ~ Infection_Severity + Hospital,

family = binomial(cloglog), data = data)

Deviance Residuals:

Min 1Q Median 3Q Max

-2.2131 -0.5651 -0.2920 0.3321 2.0368

Page 21: M7222 – 2. cviˇcen ´ı GLM02a Zotaven´ı v z´avislosti na z ...kolacek/docs/frvs/M7222/materialy/M7222_03.pdf · RNDr. Marie Forbelsk´a, Ph.D. 1 M7222 – 2. cviˇcen ´ı

RNDr. Marie Forbelska, Ph.D. 21

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -4.59689 1.23198 -3.731 0.000191 ***

Infection_Severity 0.04039 0.01258 3.210 0.001329 **

HospitalB 0.70634 0.75040 0.941 0.346561

HospitalC 2.05929 0.73198 2.813 0.004903 **

---

Signif. codes: 0 ,***, 0.001 ,**, 0.01 ,*, 0.05 ,., 0.1 , , 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 67.745 on 48 degrees of freedom

Residual deviance: 35.482 on 45 degrees of freedom

AIC: 43.482

Number of Fisher Scoring iterations: 7

> par(mfrow = c(1, 2), mar = c(5, 5, 3, 0) + 0.1)

> xx <- seq(0, 170, length.out = 200)

> yA.probit <- predict(m2b.probit, list(Infection_Severity = xx,

Hospital = factor(rep("A", 200))), type = "response")

> yB.probit <- predict(m2b.probit, list(Infection_Severity = xx,

Hospital = factor(rep("B", 200))), type = "response")

> yC.probit <- predict(m2b.probit, list(Infection_Severity = xx,

Hospital = factor(rep("C", 200))), type = "response")

> plot(data$Infection_Severity, unclass(data$Treatment_Outcome) -

1, type = "p", pch = 3, xlim = c(-10, 180), col = unclass(data$Hospital),

ylab = "Outcome", xlab = "Severity", main = "Probit curves")

> points(data$Infection_Severity, fitted(m2b.probit), col = unclass(data$Hospital))

> lines(xx, yA.probit, col = 1, lwd = 2)

> lines(xx, yB.probit, col = 2, lwd = 2)

> lines(xx, yC.probit, col = 3, lwd = 2)

> legend(100, 0.3, bty = "n", col = c(1, 2, 3), lty = c(1,

1, 1), lwd = c(2, 2, 2), legend = c("Hospital A", "Hospital B",

"Hospital C"))

> plot(data$Infection_Severity, unclass(data$Treatment_Outcome) -

1, type = "p", pch = 3, xlim = c(-10, 180), col = unclass(data$Hospital),

ylab = "Outcome", xlab = "Severity", main = "cloglog curves")

> points(data$Infection_Severity, fitted(m2b.cloglog), col = unclass(data$Hospital))

> yA.cloglog <- predict(m2b.cloglog, list(Infection_Severity = xx,

Hospital = factor(rep("A", 200))), type = "response")

> yB.cloglog <- predict(m2b.cloglog, list(Infection_Severity = xx,

Hospital = factor(rep("B", 200))), type = "response")

> yC.cloglog <- predict(m2b.cloglog, list(Infection_Severity = xx,

Hospital = factor(rep("C", 200))), type = "response")

> lines(xx, yA.cloglog, col = 1, lwd = 2)

> lines(xx, yB.cloglog, col = 2, lwd = 2)

> lines(xx, yC.cloglog, col = 3, lwd = 2)

> legend(100, 0.3, bty = "n", col = c(1, 2, 3), lty = c(1,

1, 1), lwd = c(2, 2, 2), legend = c("Hospital A", "Hospital B",

"Hospital C"))

Page 22: M7222 – 2. cviˇcen ´ı GLM02a Zotaven´ı v z´avislosti na z ...kolacek/docs/frvs/M7222/materialy/M7222_03.pdf · RNDr. Marie Forbelsk´a, Ph.D. 1 M7222 – 2. cviˇcen ´ı

22 M7222 Zobecnene linearnı modely (GLM02a)

0 50 100 150

0.0

0.2

0.4

0.6

0.8

1.0

Probit curves

Severity

Out

com

e

Hospital AHospital BHospital C

0 50 100 150

0.0

0.2

0.4

0.6

0.8

1.0

cloglog curves

Severity

Out

com

e

Hospital AHospital BHospital C

Obrazek 13: Vysledne krivky: probit a komplementarnı loglog

> par(mfrow = c(1, 2))

> binvar <- unclass(data$Treatment_Outcome) - 1

> T <- fitted(m2b.probit)

> AUC <- roc.area(binvar, T)

> auc.txt <- paste("AUC=", round(AUC$A, 3), " (p.value=", round(AUC$p.value,

8), ")", sep = "")

> roc.plot(binvar, T, binormal = T, plot = "both")

> text(0.2, 0.1, auc.txt, adj = c(0, 0), cex = 1.25)

> T <- fitted(m2b.cloglog)

> AUC <- roc.area(binvar, T)

> auc.txt <- paste("AUC=", round(AUC$A, 3), " (p.value=", round(AUC$p.value,

8), ")", sep = "")

> roc.plot(binvar, T, binormal = T, plot = "both")

> text(0.2, 0.1, auc.txt, adj = c(0, 0), cex = 1.25)

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

ROC Curve

False Alarm Rate

Hit

Rat

e

0.10.20.3

0.40.50.6

0.70.8

0.9

AUC=0.926 (p.value=1e−08)

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

ROC Curve

False Alarm Rate

Hit

Rat

e

0.10.2

0.3

0.40.5

0.60.8

0.9

AUC=0.921 (p.value=1e−08)

Obrazek 14: ROC krivky pro modely s probit a komplementarnı loglog linkovacı funkcı

Page 23: M7222 – 2. cviˇcen ´ı GLM02a Zotaven´ı v z´avislosti na z ...kolacek/docs/frvs/M7222/materialy/M7222_03.pdf · RNDr. Marie Forbelsk´a, Ph.D. 1 M7222 – 2. cviˇcen ´ı

RNDr. Marie Forbelska, Ph.D. 23

> par(mfrow = c(1, 2), mar = c(5, 5, 3, 0) + 0.1)

> plot(m2b.probit, which = 2, cex = 0.75)

> plot(m2b.cloglog, which = 2, cex = 0.75)

−2 −1 0 1 2

−2

−1

01

2

Theoretical Quantiles

Std

. dev

ianc

e re

sid.

Normal Q−Q

31

19

32

−2 −1 0 1 2

−2

−1

01

2

Theoretical Quantiles

Std

. dev

ianc

e re

sid.

Normal Q−Q

31

19

32

Obrazek 15: Q-Q grafy reziduı modelu 2b s linkovacımi funkcemi probit a cloglog.

> library(gmodels)

> fitY.probit <- factor(fitted(m2b.probit) > 0.5, labels = c("pred.survived",

"pred.died"))

> CrossTable(table(data$Treatment_Outcome, fitY.cloglog), prop.r = T,

prop.c = T, prop.t = T, prop.chisq = F)

Cell Contents

|-------------------------|

| N |

| N / Row Total |

| N / Col Total |

| N / Table Total |

|-------------------------|

Total Observations in Table: 49

| fitY.cloglog

| pred.survived | pred.died | Row Total |

-------------|---------------|---------------|---------------|

survived | 21 | 5 | 26 |

| 0.808 | 0.192 | 0.531 |

| 0.700 | 0.263 | |

| 0.429 | 0.102 | |

-------------|---------------|---------------|---------------|

Page 24: M7222 – 2. cviˇcen ´ı GLM02a Zotaven´ı v z´avislosti na z ...kolacek/docs/frvs/M7222/materialy/M7222_03.pdf · RNDr. Marie Forbelsk´a, Ph.D. 1 M7222 – 2. cviˇcen ´ı

24 M7222 Zobecnene linearnı modely (GLM02a)

died | 9 | 14 | 23 |

| 0.391 | 0.609 | 0.469 |

| 0.300 | 0.737 | |

| 0.184 | 0.286 | |

-------------|---------------|---------------|---------------|

Column Total | 30 | 19 | 49 |

| 0.612 | 0.388 | |

-------------|---------------|---------------|---------------|

> fitY.cloglog <- factor(fitted(m2b.cloglog) > 0.5, labels = c("pred.survived",

"pred.died"))

> CrossTable(table(data$Treatment_Outcome, fitY.cloglog), prop.r = T,

prop.c = T, prop.t = T, prop.chisq = F)

Cell Contents

|-------------------------|

| N |

| N / Row Total |

| N / Col Total |

| N / Table Total |

|-------------------------|

Total Observations in Table: 49

| fitY.cloglog

| pred.survived | pred.died | Row Total |

-------------|---------------|---------------|---------------|

survived | 23 | 3 | 26 |

| 0.885 | 0.115 | 0.531 |

| 0.793 | 0.150 | |

| 0.469 | 0.061 | |

-------------|---------------|---------------|---------------|

died | 6 | 17 | 23 |

| 0.261 | 0.739 | 0.469 |

| 0.207 | 0.850 | |

| 0.122 | 0.347 | |

-------------|---------------|---------------|---------------|

Column Total | 29 | 20 | 49 |

| 0.592 | 0.408 | |

-------------|---------------|---------------|---------------|