14 logistsic regression
TRANSCRIPT
-
8/7/2019 14 Logistsic regression
1/18
Universit dOttawa - Bio 4518 - Biostatistiques appliques
Antoine Morin et Scott Findlay
11-02-12 01:201
Logistic regressionLogistic regression
-
8/7/2019 14 Logistsic regression
2/18
Universit dOttawa - Bio 4518 - Biostatistiques appliques
Antoine Morin et Scott Findlay
11-02-12 01:202
Logistic regressionLogistic regression
Member of the GLM family Unlike standard linear regression, the
dependent variable is binary (0,1), so that eachcases value is either 0 or 1.
Normally, 0 is taken to mean the absence ofsome attribute, 1 its presence.
Logistic regression can be extended to thecase where there are more than two possiblevalues for the dependent variable (e.g. low,medium, high multinomial regression)
-
8/7/2019 14 Logistsic regression
3/18
Universit dOttawa - Bio 4518 - Biostatistiques appliques
Antoine Morin et Scott Findlay
11-02-12 01:203
Example: incidence of heart attacks inExample: incidence of heart attacks in
relation to agerelation to age
10 30 50 70 90
age
-0.2
0.1
0.4
0.7
1.0
cardiaque
Linear regression
inappropriate because:
Residuals not normal
Residuals heteroscedastic
Predicted values nonsense (e.g.
what does a predicted value of
0.3 mean?)
-
8/7/2019 14 Logistsic regression
4/18
Universit dOttawa - Bio 4518 - Biostatistiques appliques
Antoine Morin et Scott Findlay
11-02-12 01:204
Logistic regression: dependent variableLogistic regression: dependent variable
Variable of interest isthe probability p of
obtaining a a one as a
function of predictor
variables
The magnitude ofregression
coefficients in the
model depends on
distribution of the
predictor variables inthe two groups Y= 0
and Y = 1,
X
Y
X
Y
1
0
1
0
-
8/7/2019 14 Logistsic regression
5/18
Universit dOttawa - Bio 4518 - Biostatistiques appliques
Antoine Morin et Scott Findlay
11-02-12 01:205
Dependent variable: logit (p)Dependent variable: logit (p)
logit( )
logit( )
logit( ) ln1
1 1
y p
y p
pp yp
e ep
e e
= =
= =+ +
-4 -2 0 2 4
logit
0
20
40
60
80
100
p
-
8/7/2019 14 Logistsic regression
6/18
Universit dOttawa - Bio 4518 - Biostatistiques appliques
Antoine Morin et Scott Findlay
11-02-12 01:206
Logistic regression: model coefficientsLogistic regression: model coefficients
Negative regressioncoefficient means
probability of success
decreases with
increasing value of
predictor. Positive regression
coefficient means
probability of success
decreases with
increasing value ofpredictor.
X
Y
X
Y
1
0
1
0
> 0
< 0
-
8/7/2019 14 Logistsic regression
7/18
Universit dOttawa - Bio 4518 - Biostatistiques appliques
Antoine Morin et Scott Findlay
11-02-12 01:207
Logistic regression: model coefficientsLogistic regression: model coefficients
The magnitude ofthe regression
coefficient
depends on how
abruptly pchanges with X,
with large values
indicating abrupt
change.
X
Y
1
0
> 0, small
X
Y
1
0
> 0, large
-
8/7/2019 14 Logistsic regression
8/18
Universit dOttawa - Bio 4518 - Biostatistiques appliques
Antoine Morin et Scott Findlay
11-02-12 01:208
Least squaresLeast squares
estimation (LSE)estimation (LSE)
An ordinary leastsquares (OLS) estimate
of a model parameter
is that whichminimizes the sum of
squared differences
between observed and
predicted values: Predicted values are
derived from some
model whose
parameters we wish to
estimate
2
1)( yySS
N
i
iR = =
OLS
S
SR
),( = xfy
-
8/7/2019 14 Logistsic regression
9/18
Universit dOttawa - Bio 4518 - Biostatistiques appliques
Antoine Morin et Scott Findlay
11-02-12 01:209
Maximum likelihoodMaximum likelihood
estimation (MLE)estimation (MLE)
A maximum likelihoodestimate (MLE) of a
model parameter fora given distribution is
that which maximizes
the probability ofgenerating the observed
sample data.
MLEs are obtained by
maximizing the lossfunction
or equivalently, by
minimizing the negative
log likelihood function
);(1
= =
n
i
ixL
MLE
Lor-lo
g
L
-log LL
));(ln(log1
= =
i
n
i
xL
-
8/7/2019 14 Logistsic regression
10/18
Universit dOttawa - Bio 4518 - Biostatistiques appliques
Antoine Morin et Scott Findlay
11-02-12 01:2010
How are the model parametersHow are the model parameters
estimated?estimated?
Estimated not by least squares, but ratherby Maximum Likelihood
Based on an estimate of the likelihood of obtaining
the observed results based on different values of
the model parameters
In principle, parameter estimates should converge
to those maximizing log-likelihood or minimizing -
LogL
-
8/7/2019 14 Logistsic regression
11/18
Universit dOttawa - Bio 4518 - Biostatistiques appliques
Antoine Morin et Scott Findlay
11-02-12 01:2011
Hypothesis testingHypothesis testing
Likelihood Deviance=-2L
Is apprioximately distributed as chi-square
Measures the variation unexplained by the fitted
model, analagous to residual sums of squares.
Model comparison
Change in deviance when model terms are added
(or deleted) is also approximately distributed aschi-square, so can test hypotheses relating to
individual model terms.
-
8/7/2019 14 Logistsic regression
12/18
Universit dOttawa - Bio 4518 - Biostatistiques appliques
Antoine Morin et Scott Findlay
11-02-12 01:2012
Model assumptionsModel assumptions
Observations are independent Dependent variable has a binomial
distribution
Little error in measurement of dependentvariables.
-
8/7/2019 14 Logistsic regression
13/18
Universit dOttawa - Bio 4518 - Biostatistiques appliques
Antoine Morin et Scott Findlay
11-02-12 01:2013
Logistic regression in SPlusLogistic regression in SPlus*** Generalized Linear Model ***
Call: glm(formula = cardiaque ~ age, family = binomial(link = logit), data = SDF12, na.action =na.exclude, control
= list(epsilon = 0.0001, maxit = 50, trace = F))
Deviance Residuals:
Min 1Q Median 3Q Max
-1.545637 -0.5732664 -0.272312 -0.1404323 2.679875
Coefficients:
Value Std. Error t value
(Intercept) -7.76838060 0.376403465 -20.63844
age 0.09557905 0.005097055 18.75182
(Dispersion Parameter for Binomial family taken to be 1 )
Null Deviance: 2050.515 on 1999 degrees of freedom
Residual Deviance: 1490.001 on 1998 degrees of freedom
Number of Fisher Scoring Iterations: 4
-
8/7/2019 14 Logistsic regression
14/18
Universit dOttawa - Bio 4518 - Biostatistiques appliques
Antoine Morin et Scott Findlay
11-02-12 01:2014
Incidence of heart attack in relation to ageIncidence of heart attack in relation to age
30 40 50 60 70 80 90
age
-0.1
0.1
0.3
0.5
0.7
0.9
cardiaque
logit ( ) 7.77 0.96
logit( ) 7.77 0.96
y=logit(p) 7.77 0.96
1 1 1
y p Age
y p Age
Age
e e ep
e e e
+
+
= +
= = =+ + +
-
8/7/2019 14 Logistsic regression
15/18
Universit dOttawa - Bio 4518 - Biostatistiques appliques
Antoine Morin et Scott Findlay11-02-12 01:20
15
Presence of post-operative kyphosis usingPresence of post-operative kyphosis using
logistic regressionlogistic regression
Kyphosis: a binary variable indicating thepresence/absence
of a postoperative spinal deformity called Kyphosis.
Age: the age of the child in months.
Number: the number of vertebrae involved in the spinal
operation.
Start: the beginning of the range of the vertebrae
involved in the operation
-
8/7/2019 14 Logistsic regression
16/18
Universit dOttawa - Bio 4518 - Biostatistiques appliques
Antoine Morin et Scott Findlay11-02-12 01:20
16
Evidence that the distribution of predictorEvidence that the distribution of predictor
variables differs among levels of responsevariables differs among levels of response
variablevariable
-
8/7/2019 14 Logistsic regression
17/18
Universit dOttawa - Bio 4518 - Biostatistiques appliques
Antoine Morin et Scott Findlay11-02-12 01:20
17
The modelThe model
-
8/7/2019 14 Logistsic regression
18/18
Universit dOttawa - Bio 4518 - Biostatistiques appliques
Antoine Morin et Scott Findlay11-02-12 01:20
18
Testing hypothesesTesting hypotheses