logistic regression notes

7/27/2019 Logistic Regression Notes

1/50

A COMPARISON OF MULTIPLEREGRESSION, LOGISTIC REGRESSION AND

DISCRIMINATION FUNCTION INCLASSIFICATION OF OBSERVATIONS

by: Dr. Yap Bee Wah

UNIVERSITI TEKNOLOGI MARAFaculty of Information Technology & Quantitative Sciences

PETALWID

3.02.52.01.51.0.50.0

7

6

5

4

3

2

1

0

TYPE

iris virginica

iris versicolor

iris setosa

Kolokium Statistik 24 Julai 2004 Th 5, FTMSK

-6 -4 -2 0 2 4 6 8

0.0

0.1

0.2

0.3

0.4


2/50

KOLOKIUM STATISTIK 2004,FTMSK

2

OVERVIEW OF PRESENTATIONIntroduction

Multiple Regression

Logistic Regression

Discriminant Function

Methodology (Model Building and Evaluation Process)

Results

Conclusion


3/50


3

Introduction:

Two(2) pioneer studies Efron (1975) studied the relative efficiency of logistic regression and normal

discrimination analysis.

He found that typically, logistic regression is between one half and two thirdsas effective as normal discrimination.(Efron, B (1975). The Efficiency of Logistic Regression Compared to NormalDiscrimination Function Analysis. Journal of the American Statistical Association,December 1975, Volume 70, Number 352, Theory and Methods Section)

Press and Wilson (1978) compared logistic regression and parametricdiscriminant analysis and conclude that logistic regression is preferable toparametric discriminant analysis in cases for which the variables do not havemultivariate normal distributions.

However, for normal distributions, logistic regression is less efficient thanparametric discriminant analysis.(Press, S. J. & Wilson, S. (1978). Choosing between logistic regression anddiscriminant analysis. Journal of the American Statistical Association, 73, 699-705)


4/50


4

Introduction to Multiple Linear

Regression Multiple Linear Regression is a useful statistical

modeling technique for describing the relationship

between a response (dependent) variable with oneor several predictor variables

When the response variable is dichotomous (2

categories) or polytomous (more than 2

categories), logistic regression or discriminantanalysis is frequently used to model the

relationship.


5/50


5

Multiple Regression Model

Considerk predictor variables, themultiple regression model is stated asfollow:

iikkiii XXXY ...22110

),(~

2

0 Ni

Regression coefficients

Must be a

quantitative

variable


6/50


6

Regression: Research Application Example

IS Faculty Research Productivity: Influential Factors

and Implications

by :Qing Hu & T. Grandon Gill(Florida Atlantic University)

(Information Resources Management Journal, Vol 13, No 2,

2000)

Research

Productivity

(Annual rate of

publication)

Number of years in IS faculty

Percentages of time allocated for

teaching

Percentages of time allocated for research

Percentages of time allocated for academic services

Type of degree


7/50


7

Introduction to Logistic regression

Allows estimating the probability of anevent happening

Useful for modeling data with adichotomous dependent variable (Y) (eg:survive/die; purchase/do not purchase;pass/fail etc)

Allows a mixture ofquantitative andqualitativepredictor variables (X).


8/50


8

Application examplesDependent variable Independent variables

otherwise0

survive1Y

genderX

X

X

ageX

:

leveldosage:

illnessoflength:

:

4

3

2

1

otherwise0

billscardcreditsettle1Y

gender:

childrenofnumber:

income:

cardscreditofnumber:

:

5

4

3

2

1

X

X

X

X

ageX


9/50


9

Logit model,otherwise known as the

logistic regression model For k explanatory variables and i =1,2,,n

the model is

where

ikkii

i

i xxxp

p

...log 221101

)1( ii YPpReferred to as the logit

or log-odds


10/50


10

We can solve the logit equation to obtain:

In mathematicalexpression, this

formula is called the

logistic function and

can be written as:

)]...(exp[1

1)1Pr(

22110 kkXXXY

z-e1

1f(z)

kk110 X...Xzwhere


11/50


11

Simple logit model

Let Y and X be defined as follows:

)()]|(log[

)()]|(log[

,)(

)(log

001

111

11

1

0

0

0

XYodds

XYodds

Hence

X

YP

YP

1

0

10

NSvsS

e

e

e

snonsmoodds

ssmooddsOR

)ker(

)ker(

otherwise0

smokerif1

otherwise0

cancerlungdevelopif1

1X

Y

OR (Odds-ratio) :

A ratio of 2 odds


12/50


12

Interpretation of odds-ratio,

If for example ,

This odds ratio (OR) indicates that a smoker is 3

times more likely to develop lung cancercompared to a nonsmoker.

3ratioOdds,0986.1 0986.1 e


13/50


13

Introduction to Discriminant Analysis

An appropriate technique for classifying or separating individualsinto different groups (dependent variable) based on a set ofquantitativeindependent random variables

Involves deriving the linear combination of predictor variables(called the discriminant function) that will discriminate best

between the given groups

The main objective of discriminant analysis is to predict groupmembership based on a set of quantitative variables.

Assumptions: The predictor variables for each group has a

multivariate normal distribution

-6 -4 -2 0 2 4 6 8

0.0

0.1

0.2

0.3

0.4


14/50


14

Scatter Plot of Income Vs Lotsize

LOTSIZE

24222018161412

120

100

80

60

40

20

GROUP

nonowners

owners

Can we find a discriminant function based on income and lotsize of

house to predict if a house owner will or will not purchase a lawn

mower? (Johnson and Wichern, Applied Multivariate Statistical

Analysis, Wiley,2002).


15/50


15

We can classify a new observation (xo)

using:1) Linear or quadratic discriminant functions

2) Posteriorprobabilities

i

1

k

ofyprobabilitpriortheiswhere

observed)x wasgiven thatfromcomes

i

ii

g

i

kk

k

p

xfp

xfp

xPxP

)(

)(

()|(

Group k


16/50


16

Classification for two (2) normal populations

Homoscedastic Case (when )

Allocate to if

Otherwise allocate into .

21

1

221

1210

121

12

2121

p

p

c

cxxSxxxSxx pooledpooled

/

/ln'/'

0

x

0x 2

1

Linear Discriminant

Function Cost of misclassification

Prior

probabilities

An observationNote: Assume c(1/2)=c(2/1) and

p1=p2 is they are unknown.

Hence, ln (1)=0.

Source: Johnson &Wichern, 2002


17/50


17

Classification for two (2) normal

populations:

Heteroscedastic case (when ) Allocate to if

Otherwise allocate into

0x

0x2

121

1

20

1

22

1

110

1

2

1

1012

2121

p

p

c

ckxSxSxxSSx

/

/ln'''/

21

221

1

11

2

1 2121 xSxxSxS

Sk

''/||

||ln/

Quadratic Discriminant Function

Source: Johnson & Wichern, 2002


18/50


18

Example:Admission into graduate programs

based on GPA and GMAT

9046224640385

40385029690S

24734618058090

05809004350S

28n31n07447

482x

23561

403x

21

2121

..

..,

..

..

,,.

.,

.

.

admitnotdo

admit

2

1

:

:

00030020610

020610483728S

939349452902

529020369501-pooled

..

..,

..

..pooledS

scoreGMAT

GPAateundergradu

2

1

:

:

X

X

Independent variables (X)Response variable (Y)


19/50


20/50


20

Classification with severalpopulations

Allocate to

if the linear discriminant score

=the largest of

where

0x k

)(

xdk )(

),...,(

),(

xdxdxd g21

,...,g,ipxd ii 21xSx21xSx i

1pooled

'i

1pooled

'i ln)(

,...,g,ipxxSxxSxd iiiiQ

i21

2

1

2

1i

1 ln)()(||ln)( '

Quadratic discriminant score

Fishers discrimianant

function given inSPSS/SAS output

NOTE:

(1) Equal covariance matrices

(2)Unequal covariance matrices

Source: Johnson &

Wicheren 2002)


21/50


21

Assessing the performance of the

classification functions

Error rates-percentage of observations

misclassified Predicted MembershipActual

Membership

Group Owners Non-

owners

Sample

size

Owners n1c n1m n1

Non-owners n2m n2c n2

21

21rateErrornn

nn mm


22/50


22

Comparing the performance of multiple regression,

logistic regression, and discrimination functions in

classification of observations

These three statistical methods were applied to aa data set to compare their predictive ability ofclassifying a baby as low birth weight or normalbased on several predictor variables.


23/50


23

Dependent variableY = Birth weight (g)

Independent variables

X1 = Race (Malay, Chinese and Indian), X6 = Abortion (yes, no)

X2 = Gender (male, female) X7 = Mothers height (cm)

X3 = Mothers age (years) X8 = Vitamin (mg)

X4 = Fathers income (RM) X9 = Weight gain (kg)

X5 = Parity (children) X10 = Antenatal visits (number of times)

Data set (collected in 1997) courtesy of

Hospital Kuala Lumpur


24/50


24

Yes

Split data into

(the training data set (n1= 365)

(the validation data set (n2 = 50)

Build the model(s) using

the training data set.

Evaluate the performance of the

model using the validation data

set.

Find the probabilities of misclassifications;

E1, E2 and E3.

Compare the error rates E1, E2 and E3.

Are remedial

measures

needed?

Checking the model

adequacy using plot

of residuals and other

diagnostics.

Select the best model.

Yes

No

Methodology (The Process of Developing and

Evaluating the Models)


25/50


25

SPSS Results(Multiple Linear Regression

Analysis)

ANOVA

16531299 4 4132824.770 15.500 .000

95990816 360 266641.157

1.13E+08 364

Regression

Residual

Total

Model1

Sum of

Squares df Mean Square F Sig.

Coefficients

-1532.707 857.305 -1.788 .075

45.828 17.534 .131 2.614 .009

23.679 5.506 .210 4.300 .000

39.234 9.698 .210 4.046 .000

51.366 14.606 .178 3.517 .000

(Constant)

PARITY

MUM_HEIG

WGHTGAIN

ANT_VST

Model

1

B Std. Error

UnstandardizedCoefficients

Beta

Standardi

zed

Coefficients

t Sig.

Model Summ ary

.383 .147 .137 516.37

Model

1

R R Square

Adjusted

R Square

Std. Erro r of

the Estimate

Significant

predictor

variables


26/50


26

The final estimated regression function is:

Birth Weight= -1532.707 + 45.828(Parity) +

23.679(Mothers Height) + 39.234(Weight Gain) +

51.366(Antenatal Visits)

SPSS Results (Multiple Linear Regression)


27/50


27

Multiple Regression ResultsInterpretation of the estimated regression coefficients;

1. For parity (b1

= 45.828): every additional one child inthe family, the birth weight of babies will increase byapproximately 46g holding mothers height, weight gain andantenatal visits constant.

2. For mothers height (b2= 23.679), it indicates that the birthweight of babies will increase by approximately 24g for every

1 cm increase in mothers height, holding parity, weight gainand antenatal visits constant.(Birth weight higher for tallermothers)

3. For weight gain (b3= 39.234), it indicates that the birth weightof babies will increase by approximately 40g for every1kg increase in weight gain, holding parity, mothers height

and antenatal visits constant.4. For antenatal visits (b4= 51.366), it indicates that the birth

weight of babies will increase by approximately 52g forevery one unit (time) increase in number of antenatal visitsholding parity, mothers height and weight gain constant.

Checking Model Adequacy Through


28/50


28

Checking Model Adequacy Through

Diagnostic Plots

Observed Value

43210-1-2-3

3

2

1

0

-1

-2

-3

Regress ion Standardized Predicted Value

43210-1-2-3

4

3

2

1

0

-1

-2

-3

-4

Q-Q Plot of Residuals Plot of Residuals against Fitted Values

Notes: Kolmogorov-Smirnov = 0.045, p-value = 0.077Skewness = - 0.153

Kurtosis = 0.048

No violation of regression model assumptions of normal

errors with constant variance.


29/50

KOLOKIUM STATISTIK 2004,FTMSK 29

Evaluating Regression Model Performance

Through Error Rate The estimated regression function is then used to

predict the birth weight of the 50 observations inthe validation sample

Predicted values below 2500 were classified aslow birth weight. Otherwise, they are classified asnormal birth weight.

The following classification table gives the trueand predicted category obtained.


30/50


Predicted Total

Normal weight Low weight

Observed Normal weight 34 0 34

Low weight 15 1 16

Total 49 1 50

Classification Table.

300

50

015

1

.

E

Error rate for Multiple Regression Model


31/50


Independent variables

X1 = Race (Malay, Chinese and Indian), X6 = Abortion (yes, no)

X2 = Gender (male, female) X7 = Mothers height (cm)

X3 = Mothers age (years) X8 = Vitamin (mg)

X4 = Fathers income (RM) X9 = Weight gain (kg)

X5 = Parity (children) X10 = Antenatal visits (number of times)

otherwise02500g)ht(birthweightbirth weiglowif1Y

APPLYING LOGISTIC

REGRESSION


32/50


SPSS Results for Multiple Logistic Regression.

-.193 .052 13.666 1 .000 .824

-.194 .070 7.624 1 .006 .824

.648 .308 4.428 1 .035 1.912

-.108 .028 15.136 1 .000 .898

18.247 4.380 17.356 1 .000 8.4E+07

WGHTGAIN

ANT_VST

ABORT(1)

MUM_HEIG

Constant

Step

1

B S.E. Wald df Sig. Exp(B)

where

zj = 18.247 - 0.193(Weight gain) - 0.194(Antenatal visits) +

0.648(History of abortion)0.108(Mothers height)

The estimated logistic regression

model obtained:

jzj eYP

1

11)(


33/50


1. For weight gain; the odds ratio means that for every 1 kg increasein weight gain, the odds of low birth weight will decrease.

2. For antenatal visits; the odds ratio indicates that when a

mother increases antenatal visit by 1 time, the odds of

low birth weight will decrease.

3. For abortion; the odds ratio indicates that a mother who has

had abortion(s) is approximately 2 times more likely to have a

baby with low birth weight compared to those who have no

history of abortion(s).

4. For mothers height; the odds ratio indicates that the odds of low

birth weight is lower for mothers who are taller

Interpretation of the odds-ratio

e


34/50


35/50


Evaluating the performance of the logistic

regression model

The estimated logistic function is then used to predict the 50

observations in the validation data set

If

we classify the observation as belonging to (low birth weight)

50215001

,...,,.)(

jYP j

1


36/50


Error Rate for the Logistic Regression

Model

Predicted Total



Low weight 11 5 16

Total 44 6 50

240

50

1112

.

E

Discriminant Analysis (Checking the


37/50


Discriminant Analysis (Checking the

assumption of multivariate normal distribution)

Variables Normal birth weight Low birth weight

Mothers age Approximately Normal Approximately Normal

Fathers income Nonnormal Nonnormal

Parity Approximately Approximately Normal

Mothers height Approximately Normal Approximately Normal

Vitamin Approximately Normal Approximately Normal

Weight gain Approximately Normal Approximately Normal

Antenatal visit Approximately Normal Approximately Normal

Chi square plots for checking multivariate


38/50


Chi-square plots for checking multivariate

normality

0.00 5.00 10.00 15.00 20.00 25.00

chisq

0.00000

10.00000

20.00000

30.00000

40.00000

50.00000

MahalanobisDistance

0.00 5.00 10.00 15.00 20.00

chisq

0.00000

10.00000

20.00000

30.00000

40.00000

50.00000

MahalanobisDistanc

e

Low Birth Weight Group Normal Birth Weight

Group

-6 -4 -2 0 2 4 6 8

0.0

0.1

0.2

0.3

0.4 Chi-square plots indicate both

groups have approximate

multivariate normal distributions.


39/50


Boxs M 12.83

F 2.11

df 1 6

df 2 217363

Sig. 0.049

Use Fishers Linear Disriminant Function.

Can assumeequal covariancematrices

Boxs M Test of

Equality of Covariance

Matrices.

Discriminant Analysis Results

211

210

:

:

H

H


40/50


SPSS Output (Discriminant Functions)

Class ification Function Coefficients

6.699 6.601

.397 .238

2.965 2.797

-532.689 -515.224

MUM_HEIG

WGHTGAIN

ANT_VST

(Constant)

0 normal

w eight 1 low w eight

WEI_CODE

Fisher's linear disc riminant functions

1d

Normal birth weight category;

= -532.689 + 6.699(mothers height) + 0.397(weight gain) +

2.965(antenatal visits)

Low birth weight category;

= -515.224 + 6.601(mothers height) + 0.238(weight gain) +

2.797(antenatal visits)2

d


41/50


Classification Results

170 96 266

28 71 99

63.9 36.1 100.0

28.3 71.7 100.0

169 97 266

28 71 99

63.5 36.5 100.0

28.3 71.7 100.0

WEI_CODE

normal weight

low weight

normal weight

low weight

normal weight

low weight

normal weight

low weight

Count

%

Count

%

Original

Cross-validated

normal weight low weight

Predicted Group

Membership

Total

Discriminant Analysis Results (Contd)

Cross-validation error rate of themodel=0.34


42/50


Evaluating the performance of the

discriminant functions The estimated discriminant functions is then used

to predict the group membership of the 50

observations in the validation data set If

we classify the observation into (low birth weight)

)()( jj xdxd 21

1


43/50


Evaluate Discriminant Functions Performance

Through Error Rate

Classification Table.

Predicted Total



Low weight 6 10 16

Total 28 22 50

360

50

1263

.

E

S f h M d l P f


44/50

KOLOKIUM STATISTIK 2004,

FTMSK

44

Summary of the Models Performances

Statistical model Significant variables Error rate

1. Multiple linearregression

1.

Mothers height2. Weight gain

3. Antenatal visits

4. Parity

0.30

2. Multiple logistic

regression

1. Mothers height

2. Weight gain

3. Antenatal visits4. History of abortion(s)

0.24

3. Discriminant 1. Mothers height

2. Weight gain

3. Antenatal visits

0.36

Comparing the models with same significant predictor variables,

error rates for Multiple Regression,Logistic Regression and

Discriminant Analysis are 0.28, 0.26 and 0.36 respectively.

Note:


45/50


FTMSK

45

Conclusion of Study

The significant predictor variables affecting birth

weight of babies are weight gain, number of

antenatal visits, parity, mothers height and historyof abortions

The logistic regression model is found to be the

best model in this study as it has the lowest error

rate


46/50


FTMSK

46

Some interesting research papers

(1) Logistic Regression for Data Mining and High-

Dimensional Classification

Paul Komarek (PhD thesis, Carnegie Mellon University,

2004, 138 pages)

(www.autonlab.org/autonweb/showPaper.jsp?ID=komarek:Ir

_thesis

(2) Predicting Housing Value: A Comparison of Multiple

Regression and Artificial Neural Networks

Nghiep Nguyen & Al Cripps

Journal of Real Estate Research,Vol 22, p313-336, 2001.
http://www.autonlab.org/autonweb/showPaper.jsp?ID=komarek:Ir_thesishttp://www.autonlab.org/autonweb/showPaper.jsp?ID=komarek:Ir_thesishttp://www.autonlab.org/autonweb/showPaper.jsp?ID=komarek:Ir_thesishttp://www.autonlab.org/autonweb/showPaper.jsp?ID=komarek:Ir_thesis


47/50


FTMSK

47


(3)Application of f-regression to fuzzy classification problem

Boris Izyumov

Proceedings of 3rd International Conference on Fuzzy

Logic and Technology (EUS,2003),Zittau, Germany

(2003),pp781-766

(4) Assessing and Predicting Information and Communication

Technology Literacy in Education Undergraduates

JoAnne Davies (Phd thesis, Department of Educational

Psychology, Edmonton, Alberta, 2002)


48/50


FTMSK

48


(5) Discriminant Analysis for recognition of human faceimages

Kamran Etemad & Rama ChellapaJ. Optical Soc. Of America, Vol 14, No 8, 1997


49/50

FACULTY OF INFORMATION TECHNOLOGY &


50/50

KOLOKIUM STATISTIK 2004 50

PETALWID

3.02.52.01.51.0.50.0

7

6

5

4

3

2

1

0

TYPE

iris virginica

iris versicolor

iris setosa

-6 -4 -2 0 2 4 6 8

0.0

0.1

0.2

0.3

0.4

FACULTY OF INFORMATION TECHNOLOGY &

QUANTITATIVE SCIENCES, UiTM

logistic regression notes

Documents