logistic regression applications and cluster analysis …

LOGISTIC REGRESSION APPLICATIONS

AND CLUSTER ANALYSIS

by

JENNIFER KRISTI PETERSON, B.A.

A THESIS

IN

STATISTICS

Submitted to the Graduate Faculty of Texas Tech University in

Partial Fulfillment of the Requirements for

the Degree of

MASTER OF SCIENCE

Approved

I December, 1998

rz

ACKNOWLEDGMENTS

Thanks goes out to my thesis advisor. Dr. Duran, for his advice, training and

support. He always had a kind word to say, even when things veered a little off course.

Thanks also to Dr. Mansouri, for serving on my committee, and posing his questions

in a timely manner. Dr. Duran and Dr. Mansouri were also terrific professors

who assigned numerous projects, and gave challenging tests, but encouraged me to

continue working and to do my best. For that, I graciously thank them both. Thank

you to Dr. Bennett, my graduate advisor, who always seemed happy to see me, and

even when he was extremely busy, would take the time to ask how things were going.

I also want to thank my family and friends. To my parents, grandparents, and

younger brother, thanks for the support and encouragement from afar. I especially

wish to acknowledge those close friends who acted as my "second" family. To all who

helped by reading through drafts, and correcting whatever part they could, I truly

appreciate all that you have done. I would like to also acknowledge the SAS Institute

Inc., for permission to reproduce and analyze their data sets. In conclusion, thank

you to everyone who has helped me grow throughout my graduate school experience.

11

CONTENTS

ACKNOWLEDGMENTS ii

LIST OF TABLES iv

LIST OF FIGURES v

I. INTRODUCTION 1

II. BACKGROUND AND PRELIMINARIES 3

III. SOME APPLICATIONS OF LOGISTIC REGRESSION 12

IV. REGRESSION ON TWO DATA SETS 17

4.1 Summary Statistics of Data Sets 17 4.2 Logistic Regression on the Data Sets 23

4.2.1 Logistic Regression on DIABETES Data Set 23 4.2.2 Logistic Regression on PROSTATE Data Set 28

4.3 Linear Regression on the Data Sets 29 4.3.1 Linear Regression on DIABETES Data Set 29 4.3.2 Linear Regression on PROSTATE Data Set 32

4.4 Comparison of Logistic and Linear Regression Analyses 34

V. CLUSTER ANALYSIS 41 5.1 What is Cluster Analysis? 41 5.2 Cluster Analysis on the DIABETES Data Set 46

VI. CONCLUSION 54

REFERENCES 56

APPENDIX A: SAS CODE FOR DIABETES DATA SET TO GENERATE VARIOUS REGRESSION RESULTS 59

APPENDIX B: SAS CODE TO GENERATE CLUSTER ANALYSIS RESULTS 65

111

LIST OF TABLES

4.1 Description of Variables in DIABETES Data Set 18

4.2 Descriptive Statistics for DIABETES Data 19

4.3 Correlation Matrix for Overall DIABETES Data 20

4.4 Description of Variables in PROSTATE Data Set 21

4.5 Descriptive Statistics for PROSTATE Data 22

4.6 Correlation Matrix for Overall PROSTATE Data 23

4.7 Best Logistic and Linear Regression Model for Each Data Set 35

IV

LIST OF FIGURES

4.1 Distribution of ^ For Best Logistic Model on DIABETES Data 37

4.2 Distribution of p For Best Logistic Model on PROSTATE Data . . . . 38

4.3 Distribution of p For Best Linear Model on DIABETES Data 39

4.4 Distribution of p For Best Linear Model on PROSTATE Data 40

5.1 Diabetes Cluster Analysis Plot of GLUFAST*GLUTEST 49

5.2 Diabetes Cluster Analysis Plot of SSPG*GLUTEST 50

5.3 Diabetes Cluster Analysis Plot of RELWT*GLUTEST 51

5.4 Diabetes Cluster Analysis Plot of INSTEST*SSPG 52

CHAPTER I

INTRODUCTION

Logistic regression is a mathematical modeling approach in which the best-fitting,

yet least-restrictive model is desired to describe the relationship between several

independent explanatory variables and a dependent dichotomous response variable.

In many regression applications the response or dependent variable of interest is

continuous, and therefore, can take on an infinite number of ^'alues with no upper

or lower bounds. Researchers determine the importance of each of the independent

explanatory variables in predicting the response variable. Then, they generate a

model based on their findings and evaluate the appropriateness of the model

using different statistical measures, such as goodness of fit tests. If the model is

successful, it can be used to predict the mean response of the response variable for

a large range of conditions. When the response variable is categorical or dichoto

mous, this least squares linear regression approach should be replaced by logistic

regression or some other categorical data modeling technique. The fundamental

difference between logistic regression and least squares linear regression is that the

response variable is constrained to a limited number of integer values. A dependent

dichotomous response variable, with values limited to 0 or 1. is the most common

one. One reason for using logistic regression analysis is that it offers a technique

to solve problems within the familiar context of multiple linear regression analysis.

There are, of course, differences between the two procedures, especially in setting

up the parametric model and considering the underlying assumptions, but generally,

the same principles apply.

1

The objectives of this thesis are (1) to give a brief overview of the logistic regression

model, (2) review some applications of the model. (3) present a comparison of the

logistic regression model with the standard multiple regression model via examples,

and (4) consider the use of cluster analysis as an aid in determining the groupings for

a logistic regression analysis.

This paper contains an overview of logistic regression and a brief discussion

of cluster analysis along with appUcations of each. Chapter II covers the basic

preliminary ideas surrounding logistic regression including how it differs from linear

regression. Different estimation techniques are also discussed. Chapter III considers

various areas of logistic regression through summaries of four concrete applications.

Chapter l \ specifically considers the analysis of two data sets by beginning with a

preliminary analysis and exploring different model selection techniques to find the

most appropriate model. These same concepts are reconsidered when interactions

among the variables are also included.

At some point it is useful to investigate whether or not the sample actually has

an underlying separation into groups or clusters. This investigation is executed using

different clustering procedures, many of which are based on the Euclidean definition

of distance. The different characteristics of an individual pinpoint its coordinate in

/c-dimensional Euclidean space, and then distances between individuals and groups of

individuals are defined. A more thorough evaluation of cluster analysis is addressed

in Chapter V. where several variations of the procedure are discussed. This chapter

also includes an example of clustering which verifies the underlying groups assumed

in the logistic regression of one of the data sets in Chapter I\ '. A summary of results

and conclusions are contained in Chapter Vl.

CHAPTER II

BACKGROUND AND PRELIMINARIES

In this chapter, the differences between logistic regression and linear regression

analyses will be explained using the corresponding mathematical models. Also

included is a generalized description of logistic regression, risk and odds ratios and

corresponding parameter estimation methods.

In least squares linear regression, the dependent response variable (Y) is

conditioned on the given value of the vector, x, of k independent explanatory variables

X = (xi, X2,... , Xfc). This relationship is expressed as

E{Y\x) = (5o + Pixi + /?2X2 + . . . + (3kXk. (2.1)

With the dichotomous property of the response variable, Y, the conditional mean.

E{Yi\x), must take on values equal to or between 0 and 1 where Y] is a Bernoulli

random variable with P{Yi = 1) = n and P{Yi = 0) = 1 — TT, where z = 1, 2 , . . . . n for

n samples. Substitution gives E{Yi) = l(7r) + 0(1 — TT) = TT or .E(y^|x) = TT. Thus, the

response variable E{Yi) is the probability that Yi = 1 given a particular vector x. The

required assumption of TT being constrained such that 0 < E{Y) = w < 1 cannot be

met using the usual linear regression model. Therefore, the use of the logistic model

where the probabilities 0 and 1 are reached asymptotically is much more appropriate.

The parametric model of logistic regression is based on the logistic distribution.

The logistic cumulative distribution function (c.d.f.) is given by

^(^) = T T r - T ' - 00 < 2/ < 00 (2.2) 1 + expl-y]

4

and the logistic probabiHty density function (p.d.f.) is given by

/(^) ^ ^ " " P ^ ^ ^ -oo<y<oo. (2.3) (l + exp[-?/])2

With the inclusion of location and scale parameters, fi and a, the c.d.f. and p.d.f.

become, respectively,

and

,(.,) e x p l - e ^ ) ] 1 . / ^ / -MA (25) ^^^^ { l + e x p [ - ( s ^ ) ] ) 2 - a n <r ; • ^ - ^

Generally there is a response variable of primary interest that depends upon

k independent explanatory variables Xi,...,Xk, for which a becomes the location

parameter and the /3j ioi j = I,... ,k, become the scale parameters.

Even though there are different approaches to the estimation of the logistic model,

the general response function or logistic regression function is always given by

^ . X _ exp[-(Q + Ejft3:j)] , .

which can be found by algebraically manipulating the c.d.f.

p(„\ = 1 ^ exp[^]

^^^y) l + exp[-2/] 1 + expb] ^'-^^

and substituting the alternative location and scale parameters suggested above. The

conditional distribution oi y = TV{X) -\- e is binomial with p = 7r(x).

Given specific data values, the model parameters, a and l3j, can be "fit" using some

method of estimation to find the point estimates, a and jSj. These point estimates,

when substituted into the logistic model

P{Y = l|xi, ...,x,) = P(x) = r . \ ^ . x. (2.8) 1 + exp[-(a + E j PjXj)\

form the predicted risk or the estimated probability of disease, P(x). Notice that this

is simply the logistic c.d.f. with alternate location and scale parameters included.

The risk ratio (RR) is formed when the predicted risk of one individual separated

by only one dichotomous explanatory variable is compared to another individual.

The effect of that particular variable is shown by comparing it to the base risk. This

results in the formula

where P(x^) is the predicted risk of an individual whose dichotomous variable takes

the value i.

Unfortunately, this method of estimating the risk ratio is restricted to follow-

up studies as opposed to cross-sectional or case-control studies. In the design of a

follow-up study, the explanatory variables are observed followed by the observation

of the response variable. In the design of a cross-sectional study, subjects are sam

pled and simultaneously classified according to the response variable and explanatory

characteristics. A case-control study looks into the past to find the information on

the individuals. Additionally, the risk ratio requires that the explanatory variables

must be known and specified, not just held constant. If either of these conditions are

not met, the risk ratio cannot be determined directly, and some alternative approach

must be used.

Effects in the logistic model refer to odds, or the likelihood that a particular

situation will occur. The estimated odds for the individual specified by x is the

probability that the event will occur divided by the probability it will not occur,

Odds for x ^ y - ^ . (2.10) 1 - P(x)

The ratio of odds, called the odds ratio, is a measure of association comparing the

odds of two individuals, that is.

r^ j j • ^ o odds for xi / . ^ i i \ Odds ratio = OR = — -. (2.11)

odds for xo

.\ major advantage of odds ratio is that it is the only measure of association

directly estimated from the logistic model that does not require any special assump

tions regarding the study design. The use of odds ratio requires only the assumption

that OR is a good approximation for risk ratio. This approximation is accurate for

"rare" response variables {i.e.. response variables which occur with a low probability).

The logistic regression function is not a linear function, however it can be linearized

by applying the logit transformation to it. By definition the logit transformation is

the log of the odds for a particular vector x. that is.

logit P(x) = In P(x)

1 - P(x) = ln

1/{1 +exp[-z])

expl-{z)]/{l-\-exp[-z]) (2.12)

In the case of logistic regression, the logit transformation (2.11) with c = Q-r-Ej ^j-^j,

is used as the logistic regression model since it has a simple reduced form that is similar

to the general model used for usual linear regression (2.1).

The parameters in the logistic regression model have specific interpretations.

There are two interpretations for a. The first interpretation is that a is the log

odds for an individual having T^ = 0 for all k explanatory variables. This is usu

ally unproductive because it is not meaningful to give explanatory variables, such as

weight or age, the value of zero. The alternative interpretation for a is that it is

the background or baseline odds. In other words, it is a baseline risk in which all

explanatory variables are ignored. The interpretation for 13j is the change in log odds

or logit when the change in Xj is 1, but aU other J J ' S are fixed.

7

The specific odds ratio for the logistic c.d.f. can be obtained when the logistic

model, P(x), is appHed to find the odds for the two groups, xi and XQ. The resulting

formula is called the risk odds ratio (ROR) since the probabilities in the odds ratio

are all defined as risks and is given by

ROR Xi ,Xo 1 - P(Xi)

1 - P(xo)

^ exp[-(Q + Ejft-3;ij)]

exp[-{a-hZjPjXoj)]

= llexplpjixij - xoj)]. (2.13) j

Therefore, using the logistic model there is a multiplicative contribution of the ex

planatory variables to the odds ratio. A model other than the logistic regression

model might have an alternative contribution of the variables.

As mentioned previously, some method of estimation must be used to find point

estimates for a and (3j for j = 1, 2 , . . . , A;. In maximum likelihood estimation, the

leading method, one finds the maximum likelihood estimates (MLEs) of the param

eters by taking the partial derivatives of the log likelihood function with respect to

the parameters. Setting these partial derivatives equal to zero allows the resulting

equations to be solved simultaneously for the estimates. In the case of the logistic

model, we need to maximize the function 7r(x). First, consider the specific case of

only one explanatory dependent variable, where x = Xi. The pair of values, (xj. ?/i),

contribute to the likelihood function by

C{xi) = [Tr{x^)Y^ {1 - 7r{xi)Y-y\ i = h2,...,n (2.14)

8

where n is the sample size. Therefore, the log likelihood function is

L{a,P) = In [/(a,^)] = In {[[aî)]. (2.15) i

After taking partials with respect to each of the two parameters, a and /?, the equa

tions to solve simultaneously are

Y,[yi - irlî)] = 0 and 5;] Xi[y^ - TTÔI = 0. (2.16) i i

Using a generalized iterative reweighted least squares procedure (such as Newton-

Raphson), solutions can be found. An interesting consequence of the above equations

is that

x : ? / ^ = E ^ ^ o , (2.17) i i

that is, the sum of the observed values is equal to the sum of the predicted values.

For the general case where there are multiple (k) explanatory variables, the

method is similar, and only the resulting equations to solve simultaneously will vary.

There will be double subscripts, one for the sample size, z, and one for the explanatory

variables, j . Also, the partials will have to be taken with respect to the vector /3 ,

instead of the single parameter p as in equation (2.15). After taking the partials, the

(k -{-1) equations to solve simultaneously are

n ^ [ y . - 7 r { x . ) l = 0 i = l

and

J2[yi - Mî)]x^j = 0 for i - 1,2,..., A; (2.18) 1=1

where x = (xji,^^2,... ,Xik). Then the corresponding interesting result is

Y,y, = T,^Jî)' (2-19)

which is similar to equation (2.17).

The information matrix, or the asymptotic covariance matrix, is formed by taking

the second derivative of the likelihood function with respect to all distinct pairs of

the location and scale parameters. The information matrix for the ML method with

the vector of parameters [a, /5i, /?2,..., Pk] is

d'L{a,f3)

d'L{a,l3)

-f:[{l-^(xO)Wx.))l i = l

n

- Y^[XipXiq{l - 7r(Xi)(7r(Xi)]

dppdp,

dL{a,(3)

n

- J2lXipXiq{l - 7r(Xi)(7r(Xj)] i=l

i-1

(2.20)

(2.21)

where p and q, respectively, identify the entry's row and column in the matrix, and f3p

represents the (i?+ 1) element in the vector of parameters and x = (x^i, 2:^2,..., x jt).

Recall that i = 1, 2 , . . . , n represent the individuals while j = 1, 2 , . . . . A: represent the

explanatory variables.

Two additional estimation methods, minimum x^ and minimum-logit x^- provide

alternatives to the ML method. The minimum x^ method is mostly used in bio-assay

applications where the explanatory variables, Xj for j — 1,...,A;, represent the k

dose levels with nj test subjects at each level. The value yj represents the number

of positive responses at their respective levels Xx,X2,...,Xk. So 1^ has a binomial

distribution with parameters n^ and P(xj) = (1 + expfo; + Ej Z j j])"^- In minimum

10

X estimation, the values a and pj aie foimd by minimizing the x^ statistic,

, _ ^ n,(P(x,) - P(^j)r (P(x,) - F(?,))^

^ P(xj)(l - P( i , ) ) ^ a

• — y •

where P{xj) = -^. As in the ML method, the simultaneous equations to be solved rij

are

i ^(a^j)(l - P{xj))

and

i P(Xj)(l - P(Xj))

Since the coeflBicients of these equations are functions of P(xj), which itself is an esti

mate and is not linear in the parameters, the minimum x^ method, like the maximum

likelihood method, generally requires iterative techniques to find the solutions of the

simultaneous equations (Berkson, 1955).

The minimum-logit x estimates, a and (3j, are foimd by minimizing the logit x

statistic, defined by

logit x' = E ^i^(^i)(l - Pi^jWogit P{xj) - logit P(xj)\\ (2.24)

where logit P[xj) = a-{-PjXj and logit P{xj) = a-\-PjXj as shown in equation (2.12).

The normal equations for obtaining minimum-logit x^ estimates of a and /3j aie

YlnjP{xj)(l - P(xj))[logit P(xj) - logit Pl^j)]) = 0 j

and

J2njP{xj)(l-P{xj))xjllogit P{xj)-logit PI^J)]) = 0. (2.25)

11

The evaluation of equation (2.25) requires the use of a procedure that simply takes

the least-squares solution of the straight line having slope /3 and intercept a with

njP{xj){l - P{xj)) as the weights of the known observations (Berkson, 1955).

The criteria for judging which of the several estimation techniques provides the

"best" parameter estimates are numerable. One frequently used method compares

the size of the mean squared errors of the estimates, that is. the expected value of

the squared difference of the estimate from the true value of the parameter. The

minimum-logit x^ estimate has a smaller mean squared error than either the ML

estimate or the minimum x^ estimate. This lower mean squared error makes the

minimum-logit x^ estimates appear to be the best. However, because the ML esti

mates are functions of the sufficient statistics. E j Uj a nd J2jXjyj, a highly desirable

quality, and the minimum-logit x^ estimates are not, the ML method becomes the

preferable technique. Fortunately, minimum-logit x^ estimates can be improved by

the use of the Rao-Blackwell Theorem, to become functions of the sufficient statistics,

but they lose their ease of evaluation over the original minimum-logit x^ estimates in

the process. (Ferguson, 1967).

CHAPTER III

SOME APPLICATIONS OF LOGISTIC REGRESSION

Logistic regression has received much attention, during the latter part of the

twentieth century, as a viable technique for analyzing a dichotomous response variable.

It has been used successfully in many areas of statistical modeling. In this chapter

certain areas of logistic regression will be illustrated through the discussion of four

real concrete applications. These examples will serve to give the reader some idea of

the wide applicability of logistic regression.

The first application builds on the Cardell-Steinberg estimator. The Cardell-

Steinberg estimator, evaluated for general use by Tam (1992). is an alternative to

the logistic regression model (2.12) that is valuable as a method of finding parameter

estimates. Tams research on the Cardell-Steinberg estimator is primarily concerned

with the idea that even for "choice-restricted*" samples, when the samples contain

information on only one value {y = 1) of the dependent response variable, the binary

logistic regression model could still be estimated. This concept is noted throughout

the research of Cardell and Steinberg (1978. 1987. 1992). In addition, the research

of Cardell and Steinberg (1987) indicates that pooling the "choice-restricted" sample

with a supplementary sample, a sample that contains information on both values

of the response variable, when the probability of y = 1 can be correctly estimated,

allows consistent estimators to be found. However, when the probabiHty is improperly

estimated, the parameter estimates have a definite increase in their percent bias and

root mean square errors. The research conducted by Tam (1992) confirms these

findings and offers results that support Cardell's conjecture that percent bias and

12

13

root mean square errors of the estimates for the ratio of slopes would remain at a low

level regardless of the correctness of the estimation of the probability.

As a test of her research, Tam (1992) applied the Cardell-Steinberg estimator to

college dropout data for the freshman class of 1984 and 1988 at UCLA. The applica

tion attempts to identify student pre-enrollment characteristics that predict under

graduate withdrawal in advance of degree completion, .\dmission records for about

16,000 undergraduate students including information on gender, age. interaction of

gender with age, ethnicity, self-reported high school grade point average. SAT verbal

and mathematics score, majors applied to, and high school type, represent the char

acteristics considered as explanatory variables. The response variable was coded 1 if

the student was determined a dropout by not registering for two consecutive quarters,

and not having received a degree. Logistic regression and Cardell-Steinberg analyses

were found to be similar for the class of 1988 in areas such as age. race, high school

GPA, SAT scores, and area of study. Yet, for the class of 1984. the logistic regres

sion analyses did not find the age factor to be significant, but the Cardell-Steinberg

analyses indicated that student age had a pronounced predictive power.

The second application is an evaluation of the performance of two classification

methods using misclassification probabilities and was investigated by Rylance (1996).

Simulations were completed on data from two different distributions, the bivariate

normal and bivariate uniform. Logistic regression (2.12) was performed on the data

and misclassification probabilities were found and compared to the theoretical mis-

classification probabilities. The theoretical misclassification probabilities were found

using the likelihood-based discriminant method. In other words, the model that best

fits the data is used to find the estimated response value for every individual's set of

14

explanatory values. The misclassification probability is the percent of data catego

rized incorrectly by the fitted model. Results showed that for normally distributed

data the two probabilities were comparable, and that, as the sample size increased,

the difference between the two misclassification probabilities decreased. For the uni

formly distributed data, the likelihood approach outperformed the logistic approach

with an average misclassification probability of around 509^. Therefore, for uniform

data, neither approach performed well.

Another interesting application, by Nottingham and Birch (1998). investigated

the effectiveness of the logistic regression procedure (2.12) for analyzing binary dose-

response data, especially for small sample sizes. They stress that the results of a

poorly designed experiment can be seriously compromised. For three doses, two

goodness of fit tests were performed. The use of the Pearson x^ test found the

probability of a Tj-pe I error to be much smaller than the nominal a-level, and the

likelihood ratio \- or deviance test was found to have difficulty when the response

dosages were less spread out over the entire range of dosage levels. The problem

continued when onh- three dosage levels were considered even with fift}- subjects

per dose. Even for a very wide range of dose levels selected, the results when the

number of dosages is small (around three), and the number of subjects at each dose

is small (around ten) should still cause concern. The logistic regression procedure

can have over 319c of its mean squared error due to bias from having design points at

the extremes of the dose range. Therefore, the placement of the doses is extremely

important to insure rehable results. They also found that it is better to do a five-

level design with ten replicates at each dose level than a three-level design with

twenty replicates per dose level, if the analyst is concerned with the mean squaired

15

error of fit or the variance of the estimate's coefficients. This direct contradiction to

maximum likehhood theory is due to the fact that asymptotic theory is inappropriate

for experiments of this size (Nottingham and Birch, 1998).

The final application considered in this chapter is also concerned with small sample

sizes. Duke (1992) evaluated the effects of various small sample sizes on the accuracy

of odds ratios (2.11) which were estimated by the logistic regression model (2.12)

using data collected following the repeated sample technique from a cross sectional

study of five births occuring in Oklahoma over the ten year period from 1975 to 1984.

Duke further investigated parametric odds ratios by looking at the accuracy of the

coverage in test-based 95% confidence intervals. The investigation also determined the

efficacy of the same 95% confidence intervals through significance testing. Information

regarding the accuracy of odds ratios derived through weighted least squares logistic

regression, and the performance of the confidence intervals applied before and after

the transformation of the logistic coefficients was also provided.

Duke (1992) found that with larger sample sizes, the regression coefficients became

more stable, that is, the size of the standard error of the coefficients decreased. Also,

an increased sample size improved the reliability of the estimates and the accuracy

of the odds ratio. However, the exponential transformation of individual logistic

regression coefficients appeared to overestimate the population odds ratio. Therefore,

the conversion of the logistic transformation will inflate the size of the odds ratio.

The coverage of the parametric odds ratio by 95% confidence intervals more closely

approached 95% with the larger sample sizes. This evaluation of risk factors associated

with low birth weight deliveries was conducted to hopefully aid in assessing the need

and impact of perinatal health programs for Oklahoma.

16

The examples in this chapter show that logistic regression can be applied to a

variety of areas of statistical modeling. One offered a possible alternative to the

logistic regression model useful when the estimation of probability is not accurate,

and another showed that a comparison of theoretical and actual misclassification

probabilities can be conducted using logistic regression. Two applications dealt with

logistic regression for small sample sizes, one relating to dosage levels in drug testing,

and the other to risk factors associated with low birth weight. These are but a few

of the many applications of logistic regression. The next chapter compares multiple

linear regression to logistic regression via two specific sets of data.

CHAPTER IV

REGRESSION ON TWO DATA SETS

Now that the background has been explained two data sets will be considered and

analyzed using SAS, a statistical software package. Multiple linear regression will be

compared with logistic regression via these two sets of data. Linear regression of the

dichotomous variable on independent \ariables, although not the most appropriate

technique, was used for prediction (Gehan. 1959) prior to the more widespread use

of logistic regression which began in the 1960's. The assumptions on the response

variable cause the linear regression model to be invalid for estimation or testing

procedures. Comparisons of the two procedures include but are not limited to the

differences in parameter estimates for the same models, along with the best model

each regression procedure found using various model selection techniques. The two

data sets were reproduced and used with permission of SAS Institute Inc.. Gary. NC.

In SAS (1995), some similar logistic regression analysis was presented. The aim is

not to present a comprehensive logistic regression analysis of the SAS data sets, but

to use the analyses to compare with the standard multiple regression results on the

same data.

4.1 Summary Statistics of Data Sets

The first set of data wiU be referred to as the DIABETES data set (SAS. 1995).

Data was collected from 145 nonobese adults who were diagnosed as subclinical (chem

ical) diabetics, overt diabetics, and normals (nondiabetics). The relationship between

various blood chemistry measures and diabetic status was investigated. The names of

18

the collected explanatory variables and their descriptions are given in Table 4.1. For

the purposes of logistic regression, the response variable must be binary. Therefore,

an indicator variable, DIAB, was defined which classified overt and chemical diabet

ics into a single group having value 1, and normals or nondiabetics into a separate

group having value 0. To maximize the quality of the comparisons, the dichotomous

variable DIAB is the response variable which was used for both regression analysis

techniques.

Table 4.1: Description of Variables in DIABETES Data Set

Variable

PATIENT

RELWT

GLUFAST

GLUTEST

INSTEST

SSPG

GROUP

Description

patient number

relative weight

(ratio of actual weight to expected weight, based on height)

fasting plasma glucose

test plasma glucose (a glucose intolerance measure)

plasma insulin during test

(measure of insulin response to oral glucose)

steady state plasma glucose (measure of insulin resistance)

clinical group

(3=overt diabetic, 2=chemical diabetic, and l=normal)

Preliminary analysis of the DIABETES data set was conducted using PROC

CORR, the correlation analysis procedures of SAS. There were 76 individuals in the

nondiabetic clinical group, and a total of 69 in the diabetic group (a combination of

the 36 chemical and 33 overt diabetic patients). The means and standard deviations

of each of the variables for the overall sample are compared in Table 4.2 to those found

separately for the two response groups. The relative weight of the diabetic group is

19

larger than that of the nondiabetic group, showing the average actual weight of those

diagnosed diabetic is larger than their expected average weight. In fact, for all of the

blood chemistry variables the mean values for the diabetic group are larger than the

overall means by more than the nondiabetic group is smaller than the overall means.

The standard deviations for the diabetic group were expected to be fairly large, since

that group consists of both the chemical diabetics and the overt diabetics. When the

diabetics' averages were calculated separately, the values for the chemical diabetics

were at the complete opposite end of the spectrum from those of the overt diabetics.

For example, the average test plasma glucose level for the group of chemical diabetics

was 493.94, as opposed to 1043.75 for the overt diabetics.

Table 4.2: Descriptive Statistics for DIABETES Data

Variable

Name

RELWT

GLUFAST

GLUTEST

INSTEST

SSPG

Overall

Mean

0.9773

121.9862

543.6138

186.1172

184.2069

Std Dev

0.1292

63.9304

316.9509

120.9352

106.0299

Nondiabetic

Mean

0.9372

91.1842

349.9737

172.6447

114.0000

Std Dev

0.1285

8.2279

36.8706

68.8538

57.5328

Diabetic

Mean

1.0214

155.9130

756.8986

200.9565

261.5362

Std Dev

0.1156

79.6996

350.9525

159.1102

92.6276

Correlation analysis of the overall DIABETES data set, given in Table 4.3, also

showed many significantly correlated variables at the 1% significance level. All three

combinations of the variables GLUFAST, GLUTEST, and SSPG had positive corre

lations above .70. In fact, the testing and fasting glucose levels had a .96 correlation.

The two glucose levels were also significantly correlated with the patients' plasma

20

insulin during test. The variables GLUFAST and INSTEST produced a negative

correlation of -.39 and GLUTEST and INSTEST had a negative correlation of -.33.

Table 4.3: Correlation Matrix for Overall DIABETES Data

RELWT

GLUFAST

GLUTEST

INSTEST

SSPG

RELWT

1.00000

-0.00881

0.02398

0.22224

0.38432

GLUFAST

-0.00881

1.00000

0.96463

-0.39623

0.71548

GLUTEST

0.02398

0.96463

1.00000

-0.33702

0.77094

INSTEST

0.22224

-0.39623

-0.33702

1.00000

0.00791

SSPG

0.38432

0.71548

0.77094

0.00791

1.00000

When the groups were separated into diabetics and nondiabetics, the significant

correlations were between GLUFAST and GLUTEST and all of the other variables

for the diabetic group. These variables were positively correlated with each other and

with SSPG. while negatively correlated to the other variables. For the nondiabetics,

SSPG was significantly correlated to RELWT and INSTEST, and at the 5% level

INSTEST and GLUTEST are just barely significantly correlated.

Correlation analysis on a further separation of the diabetics into chemical and

overt diabetics was also conducted. For the overt diabetics, GLUFAST and GLUTEST

maintain their significance with INSTEST and SSPG, but RELWT is no longer sig-

:;cv nificantly correlated with either at the 1% level, and only with GLUTEST at the 5

level. For chemical diabetics alone, GLUFAST and GLUTEST are still significantly

correlated at the 1% level, while SSPG and RELWT are correlated at the 5% level.

Another measure that can be investigated besides correlation is the variance

inflation factor (VIE) for each of the regression coefficients. The VIF represents

the inflation which occurs when one explanatory variable is regressed against the

21

other explanatory variables. The higher the value of the VIF, the lower the precision

of the parameter estimates given in the model. Those variables whose VIF value

exceeds ten should be seriously considered for possible deletion from the model. For

the DIABETES data, the VIF values for GLUTEST and GLUFAST were around 15,

which alerts the researcher of possible multicoUinearity effects in fitted models con

taining these explanatory variables. The other explanatory variables had VIF values

not to be concerned with mostly around one.

The second data set will be referred to as the PROSTATE data set (SAS, 1995).

Data was collected from 53 patients diagnosed with prostate cancer. Since treatment

depends on whether or not the cancer has spread to the surrounding lymph nodes,

a surgical procedure is used to determine the extent of nodal involvement. The rela

tionship between certain variables and nodal involvement is investigated in order to

see if the involvement can be determined without the surgery. Each patient provided

data on several explanatory variables considered predictive of nodal involvement. The

collected explanatory variable names and their descriptions are given in Table 4.4.

Table 4.4: Description of Variables in PROSTATE Data Set

Variable

CASE

AGE

ACID

XRAY

SIZE

GRADE

NODALINV

Description

an identification variable

age in years of patient at time of diagnosis

level of serum acid phosphatase

X-ray examination results (O=positive, l=negative)

size of tumor (0=small, l=large)

pathological grade of the tumor as determined by biopsy

(O^less serious, l=more serious)

surgical procedure results (0=no involvement, l^involvement)

22

PreUminary analyses were also conducted on the PROSTATE data set. There

were 33 patients without nodal involvement, and 20 patients with nodal involvement.

The means and standard deviations of each of the variables for the overall sample

are compared in Table 4.5 to those found separately for the two response groups.

The means for the patients with nodal involvement are higher than those for pa

tients without nodal involvement in all four variables besides AGE. For the three

dichotomous variables, the averages for nodal involvement are at least twice those for

noninvolvement. The logarithm of the acid level is smaller for those patients whose

cancer has spread to the lymph nodes. The overall ages of the patients ranged from

45 to 68, and the separate groups showed only a slight difference in averages. Yet

patients without nodal involvement had exactly the overall median.

Table 4.5: Descriptive Statistics for PROSTATE Data

Variable

Name

AGE

LACD

XRAY

SIZE

GRADE

Overall

Mean

59.3774

-0.4189

0.2830

0.5094

0.3774

Std Dev

6.1682

0.3151

0.4548

0.5047

0.4894

No Involvement

Mean

60.0606

-0.4959

0.1212

0.3636

0.2727

Std Dev

5.6010

0.3169

0.3314

0.4885

0.4523

Involvement

Mean

58.2500

-0.2918

0.5500

0.7500

0.5500

Std Dev

7.0103

0.2743

0.5104

0.4443

0.5104

The correlation coefficients for the overall PROSTATE data are given in Table 4.6.

The only significant correlation was between GRADE and SIZE which had a positive

correlation of .37. For those patients with nodal involvement, although none were

significant, the strongest correlation was negative, between LACD and SIZE, but all

three combinations of XRAY, AGE and LACD although positively related, were not

23

far below its correlation of .25. As for those patients without the involvement, the

strongest and only significant correlation was between GRADE and SIZE at .52 and

then the only other pair which comes close to being correlated is between GRADE

and LACD which had a negative correlation of -.29.

Table 4.6: Correlation Matrix for Overall PROSTATE Data

AGE

LACD

XRA\'

SIZE

GRADE

AGE

1.00000

-0.01921

-0.00453

-0.01970

-0.04808

LACD

-0.01921

1.00000

0.18075

0.01127

-0.06414

XRAY

-0.00453

0.18075

1.00000

0.19761

0.20217

SIZE

-0.01970

0.01127

0.19761

1.00000

0.37463

GRADE

-0.04808

-0.06414

0.20217

0.37463

1.00000

The correlation analyses were useful as preliminary studies on data because they

gave the researcher an idea of the general variation of the data. The\' also showed

possible variable interactions which may need to be considered in regression analyses.

.\s mentioned previously, the \'1F represents the inflation which occurs when the

explanatory variable is regressed against the other explanatory variables. The higher

the value of the M F , the lower the precision of the parameter estimate in the model.

For the PROSTATE data the M F values for all the explanatory variables were close

to one, showing no sign of multicoUinearity among the variables.

4.2 Logistic Regression on the Data Sets

4.2.1 Logistic Regression on DIABETES Data Set

The two types of regression analyses discussed in this paper have been Hnear

and logistic regression. For the DIABETES data set, as an illustration, the logistic

24

regression model (2.12) involving the single explanatory variable GLUTEST was fit

to the data. In order for this variable to predict the dichotomous response value for

the i*^ patient, its value, Xj, is to be substituted into the resulting model

logit P(J i ) = 90.4017 - 0.2153xi, ioi i = 1 145. (4.1)

There are a variety of measures which can be used to compare different models. The

square of the correlation coefficient (P^) is a measure of the fraction of the variation

in the data, {i.e., the response variable) accounted for by the model. The adjusted

R^ value is similar but it also takes into account the number of variables in the

model. Since the full model, consisting of all possible explanatory variables, may

include extra variables that are not really necessary for the prediction of the response

variable, the adjusted R^ value will give a more accurate depiction of the variation

of the response variable accounted for by the model. Thus, the adjusted R^ value is

useful in detecting overfitting of the model.

Another measure useful in the comparison of fitted models is the sum of squared

errors (SSE), that is. the sum of squares of differences between the response variable

and the value of the response variable predicted by the model. These squared errors

are then totaled for the entire data set. the smaller the SSE the better the model.

The simple model presented in equation (4.1) with SSE= 1.82503 has r^ = 0.7295

and adjusted r^ = 0.9734. Therefore, this one variable logistic model accounts for

97.34% of the variation in the response variable, and its sum of squared errors is

fairly small.

Using the above information alone, one might see this model as accurate enough to

predict whether or not an individual is diabetic. However, with all the information on

the other explanatory variables, model selection methods can be used to investigate

25

their inclusion into the model. The three selection techniques available for both

logistic regression and linear regression using SAS are forward, backward, and stepwise

selection. These methods are iterative processes which add or remove variables from

the model based on the appropriate significance levels at each step. The significance

levels of the x^ score statistic in logistic regression are compared to specified entrance

and removal threshold values.

The forward method begins with no variables in the model, and at the first

step the intercept is added. After that the variable which is the most significant

(having the smallest significance level) when added to the model is entered. Then this

continues until no variable has a significance level below the threshold value for

entrance into the model. The backward selection is the opposite, starting with all

variables in the model. The significance level for testing each variable's parameter

estimate equal to zero is calculated and the least significant variable (with the largest

significance level) is removed at each step. Of course, the process continues until all

variables left in the model are below the threshold value for removal from the model.

Stepwise selection combines the two previously mentioned methods. It begins with

no variables in the model and checks both entrance and removal thresholds at each

step, continuing until a variable entered on one step is removed on the next step. It is

necessary to reaUze that when comparing significance levels for different combinations

of explanatory variables, the overall significance level increases with each comparison.

Therefore, the threshold values for each comparison test should be very small in order

to keep down the overall significance level. Also model selection techniques are meant

to be exploratory, so the fit of the model should be verified on other data.

26

The results from using the logistic regression procedure with the forward model

selection technique provided a model containing only GLUTEST and SSPG with

p2 _ 7373 and SSE= 0.8002, whereas both backward and stepwise named the above

mentioned model with the single variable GLUTEST. Another thing to consider is

that the significance levels used were .15 for entry into the model and .10 for removal

from the model. The forward selection for logistic required two steps although the

others required at least four steps.

Logistic regression also has the option of determining the best model by evaluat

ing the model's score statistic. This selection option is different from the previous

techniques in that it finds a specified number of best models for each possible model

size, ranging from the k one variable models up to the single full model. The model

with the largest score statistic value was the full model having P^ = .7419. Out

of all the four variable models, the best one left out the variable SSPG and had

p2 = .7414, and then also removing INSTEST left the best three variable model,

having P^ = .7409. Comparing this P^ value to the other models, the researcher

must determine if accounting for about one tenth of a percent more of the variation

in the data is worth the addition of two explanatory variables into the model. As for

a comparison of the sum of squared errors, the full model SSE= .42291, the best four

variable model SSE= .45301, and the best three variable model SSE= .66796.

The best model should not be Hmited to including only first order variables. As the

preliminary analyses show, many variables are correlated. Therefore, the interaction

between the variables should also be investigated. When the five variables are allowed

to interact, the new model has the possibility of including up to twenty variables and

then, of course, the intercept. As mentioned previously, when the numbers of steps

27

increase, the significance levels also increase. When the interactions of the variables

are included, the full model using logistic regression procedure had P^ = 0.5847 and

SSE= .88737. The logistic procedure using forward selection required four steps to find

a four variable model including the variables RELWT*RELWT, RELWT*GLUFAST,

RELWT*GLUTEST, and SSPG with P^ = .7430 and SSE= .34065. The backward se

lection model for this procedure required nineteen steps to eliminate all other explana

tory variables except GLUTEST*GLUTEST with P^ = .7295 and SSE= 1.82298.

The stepwise selection technique also yielded a single variable model with the vari

able RELWT*GLUTEST, but it only needed five steps. Therefore, this model has a

better significance level, unfortunately its SSE= 5.5394 which it quite high compared

to previously mentioned models and its P^ = .6813, which is much smaller than the

other mentioned models which makes it less appropriate.

The logistic regression procedure with the score selection option found that the

best two models included eleven and then ten variables. These best models include the

interaction of the variable RELWT with each of the other variables except itself and

SSPG. Similarly, GLUTEST is included interacting with all other variables except

GLUFAST, and the original variables RELWT, GLUFAST, and INSTEST are also in

cluded. Along with these are the variables GLUTEST*INSTEST, GLUFAST*SSPG,

and the eleventh variable added in the best model is GLUFAST*INSTEST. For the

eleven and ten variable models, the logistic sum of squared errors are SSE= .0441 and

SSE= .05796 while P^ = .7476 and P^ = .7474, respectively This amazingly small

sum of squares value makes this eleven variable model the best overall fit to the data

for the logistic regression procedure.

28

4.2.2 Logistic Regression on PROSTATE Data Set

For illustration of the PROSTATE data set, a logistic regression model involving

the variables ACID, XRAY and SIZE was first fit to the data. Instead of directly

fitting the variable ACID, the logarithm of the acid levels, LACD, was used to give a

better discrimination between the closely spaced values of this variable. In order for

a combination of these three variables to predict the dichotomous response values for

each patient, their values are to be substituted into the resulting model

logit P'{^i) = -1.1994 -\- 2.2922^^ + 2.0550x2^ + 1.7638x3 , for z = 1 , . . . , 53 (4.2)

where Xu represents the logarithm of the i*^ individual's level of serum acid phos

phatase, while X2i and 0:3 are the corresponding values for XRAY and SIZE, respec

tively The model in equation (4.2) has P^ = 0.3305, and adjusted P^ = 0.4501

with SSE= 8.0364. In this case, the logistic regression model did not account for as

much variation as it did in the previous data set and the sum of squared error is not

especially small. Accordingly, the next step would be to determine the best model to

fit the data. The same model selection techniques described in the previous section

are available for this data set.

First considering only the original five variables without interactions, the logistic

regression procedure found the model presented in equation (4.2) as best using all

three types of selection techniques. When the alternative selection techniques were

used, the logistic regression procedure with the score option yielded the full model,

containing five variables, as the best with P^ = 0.3605 and SSE= 7.6656. The full

model with GRADE removed was given as the best four variable model with P^ =

0.3468 and SSE= 7.8703. Of course, the full model with both AGE and GRADE

removed presented in equation (4.2) was given as the best three variable model.

29

When the interactions of the variables are included, the full model using the

logistic regression procedure had P^ = 0.5847 and SSE= 3.898. The forward se

lection yielded a two variable model in only two steps including LACD*LACD and

XRAY*SIZE with P^ = 0.3305 and SSE= 8.0920. Backward elimination required

twelve steps to determine the best model as a five variable model with P^ = 0.4894

and SSE= 5.3994. The stepwise selection technique found the same two variable model

as the forward selection technique for this procedure. Using the alternative selection

techniques, logistic regression found the best model to contain eleven variables, and

the second best to contain ten variables with SSE= 4.1199 and SSE= 4.1342 while

R? = 0.5673 and P^ = 0.5660, respectively. The best fit to the data using logistic

regression is the full model with interactions having P^ = 0.5847 and SSE= 3.898.

4.3 Linear Regression on the Data Sets

4.3.1 Linear Regression on DIABETES Data Set

For the DIABETES data set, also as an illustration, a linear regression model (2.1)

involving the single explanatory variable GLUTEST was fit to the data. In order for

this variable to predict the dichotomous response value for the i^^ patient, its value,

Xi, is to be substituted into the resulting model

y, = -0.0772 + O.OOlOxi, for f = 1 . . . . , 145. (4.3)

As mentioned previously, the linear regression model is not ideal for predicting

binary response variables, since it is more useful in modeUng continuous response

variables. The model in equation (4.3) has r^ = 0.4140 and adjusted r^ = 0.4099

with SSE= 21.1938. With such a large sum of squared errors and small percent of

variation of the response variable accounted for by this model, the next step is to

30

search for alternative models which offer a better fit to the data.

In linear regression, the significance levels for the F statistic are compared to

specified entrance and removal threshold values in the forward, backward and stepwise

selection procedures. The results from using the multiple linear regression procedure

with the forward model selection technique yielded the full model, containing all of the

explanatory variables, with a multiple P^ = .7054 and SSE= 7.6673 for the best fit.

The same procedure with both backward and stepwise techniques produced the best

model as the one with all explanatory variables except SSPG, which had P^ = .7024

and SSE— 10.7632. Again considering that the significance levels used were .15 for

entry into the model and .10 for removal from the model, the backward elimination

technique took only one step, while the others required at least four steps.

A similar process to the logistic regression score option is the linear regression

option which uses the P^ or adjusted P^ values to determine the best model. Again

this selection option is different from the previous techniques in that it finds a specified

number of best models for each possible model size, ranging from the k one variable

models up to the single full model. Interestingly, the Hnear regression P^ procedure

yielded the same best five models found using the logistic score procedure option: the

fuU model with P^ = .7054 and SSE= 7.6673, the best four variable model as the full

with SSPG removed with P^ = .7054 and SSE= 10.7632, and the fuH with SSPG and

INSTEST removed with P^ = .6667 and SSE= 12.0548.

Again, the best model should not be limited to including only first order

variables. As the preliminary analyses show, many variables are correlated. There

fore, the interaction between the variables should also be investigated. When the five

variables are aUowed to interact, the new model has the possibility of including up to

31

twenty variables and then, of course, the intercept. As mentioned previously, when

the numbers of steps increase, the significance levels also increase. For the linear

regression procedure, forward selection provides a model with all but two variables,

GLUTEST*GLUFAST and INSTEST*SSPG, with P^ = .8047 and SSE= 7.0474,

but the process took sixteen steps. The backward technique with this procedure only

required twelve steps to determine an eight variable model having P^ = .7944 and

SSE= 7.4339. The stepwise method yielded a six variable model with P^ = .7786 and

SSE= 8.0088. Again, the three percent difference with the addition of twelve variables

requires a trade off between predictive power and simplicity of the model.

Using the linear regression procedure with the P^ selection option, the best two

models again included eleven and then ten variables, with P^ = .8024 and P^ = .8009,

respectively. As for the sum of squared errors of these models, the with eleven vari

ables the SSE= 7.1477 and with ten variables SSE= 7.2003. These best models include

the interaction of the variable RELWT with each of the other variables except itself

and SSPG. Similarly, GLUTEST is included interacting with all other variables except

GLUFAST, and the original variables RELWT, GLUFAST, and INSTEST are also in

cluded. Along with these are the variables GLUTEST*INSTEST, GLUFAST*SSPG,

and the eleventh variable added in the best model is GLUFAST*INSTEST.

The best model according to the entire considered multiple Hnear regression proce

dures on this data is the model given by forward selection with interactions containing

eighteen variables having P^ = .8047 and SSE= 7.0474. RecaH that for logistic re

gression, the best model included the same eleven variables as the linear regression

model above, but had SSE= .0441 and P^ = .7476, which makes it the best overaU

fit to the data using either regression procedure.

32

4.3.2 Linear Regression on PROSTATE Data Set

For the PROSTATE data set, also as an illustration, a linear regression model

involving the variables LACD, XRAY and SIZE was fit to the data. In order for a

combination of these three variables to predict the dichotomous response values for

each patient, their values are to be substituted into the resulting model

yi = -0.2819 -\- 0.3869a;H + 0.3840x2^ -h 0.2922x3i, for i = 1 , . . . , 53 (4.4)

where xu represents the logarithm of the i*^ individual's level of serum acid phos

phatase, while X2i and 0:3 are the corresponding values for XRAY and SIZE.

Again the linear regression model is not ideal for predicting binary response vari

ables such as nodal involvement, and this is apparent in the SSE, P^ and adjusted P^

values for this model. The SSE= 8.0139 which again is not especially small and the

P^ value gives that 35.65% of the variation of the response variable is explained by

the model. When the number of variables in the model is considered by the adjusted

P^, the model only accounts for 31.71% of the variation. Accordingly, the next step

would be to determine the best model to fit the data. The same model selection

techniques described in the previous sections are available for this data set.

First considering only the original five variables without interactions, the for

ward selection technique used with the Hnear regression procedure determined the

fuH model with P^ = .3864 and SSE= 7.6412 as the best model. The backward elim

ination and stepwise selection both provided the model presented in equation (4.4)

with LACD, XRAY, and SIZE as the best model. When the alternative selection

techniques were used, both procedures yielded the same best three models. The fuH

model with P^ = .3864 and SSE= 7.66562, the fuU model with GRADE removed hav

ing P2 = .3737 and SSE= 7.87025 and the fuU model with both AGE and GRADE

33

removed with P^ = .3565 and SSE= 8.03644.

When the interactions of the variables are included, the fuH model with aU eighteen

variables had SSE= 4.5944 and P^ = 0.6311. The forward selection using the Hnear

regression procedure found the best model to contain nine variables with P^ = .5977

and SSE= 6.4345. Unfortunately, it required nine comparison steps to produce it.

The backward elimination using this same procedure required eleven steps to find a

six variable model similar to the five variable model found with the same technique

using logistic regression. This six variable model had P^ = .5358 and SSE= 5.7812,

an increase in P^ and a decrease in SSE from previous models. The stepwise selec

tion only required three steps to find a three variable model which it considered as

the best. This model included LACD*SIZE, LACD*LACD. and XRA\'*SIZE with

SSE= 7.6410 and P^ = .3864, which is almost the same SSE and P^ as the fuU

model considered with interactions. Using the alternative selection techniques, linear

regression again found the best model to contain 11 variables, and the second best

to contain ten variables with SSE= 4.6625 and SSE= 4.8804 while P^ = .6256 and

p2 = .6081, respectively.

The best multiple linear regression model found for this data set is the full model

with interactions containing eighteen variables having P^ = .6311 and SSE= 4.5944.

The model with best overaH fit to the data using the logistic regression procedure

was the fuH logistic model with interactions having P^ = 0.5847 and SSE= 3.898.

Therefore with the smallest sum of squared error, this full logistic regression model

is the model with the best overall fit to the data using any regression procedure.

34

4.4 Comparison of Logistic and Linear Regression Analyses

For the DIABETES data set, the models found using logistic regression usually

contained fewer variables than the models determined by the selection techniques

with linear regression. The sum of squared errors were also significantly smaller for

the logistic regression models suggesting a better fit of the data. The best logistic

regression model was produced using the score option and contained eleven variables

with SSE= .0441 and P^ = .7476. The Hnear P^ selection option confirmed that a

model containing these same eleven variables was in competition for the best model

with R? = .8024, however, the parameter estimates significantly varied for the two

procedures and for linear regression SSE= 7.1477. The best multiple linear regression

model was produced by the forward selection technique with interactions containing

eighteen variables having P^ = .8047 and SSE= 7.0474. Therefore, the overaU best

model produced for this data set was the eleven variable logistic model.

For the PROSTATE data set, again the number of variables in the models found

using the selection techniques with logistic regression were somewhat smaller, but

only with interactions of the variables allowed. When only first-order variables were

considered, both logistic and Hnear regression with all selection techniques basically

agreed on the best fitting models, and even had very similar sum of squared errors.

The sum of squared errors differed for the two regression procedures when interactions

were considered but not nearly as significantly as in the other data set. The logistic

regression score selection technique also produced the best model according to SSE

in this data set as one containing eleven variables. The sum of squares errors for this

model with logistic regression was SSE= 4.11986, whereas in the linear regression

SSE= 4.6625, although the difference is much more noticeable in the difference of

35

their P^ values. For the logistic regression of the eleven variable model P^ = 0.5673,

but for the multiple Hnear regression P^ = .6256. The best multiple linear regression

model found for this data set also contained eighteen variables but was the full linear

model with interactions having P^ = .6311 and SSE= 4.5944. AH these models were

surpassed by the logistic fuU model with interactions having SSE= 3.8979 making it

the overaU best model for this data set, even though its P^ = 0.5847.

The above best models using the two regression procedures on each data set are

given below in Table 4.7. The logistic procedure outperformed the multiple Hnear

regression procedure in both data sets, as was expected. Therefore, with response

variables which are dichotomous, the logistic regression model is the appropriate and

preferred model.

Table 4.7: Best Logistic and Linear Regression Model for Each Data Set

Regression

Procedure

Logistic

Logistic

Linear

Linear

Data

Set

DIABETES

PROSTATE

DIABETES

PROSTATE

P2

0.7476

0.5847

0.8047

0.6311

SSE

0.0441

3.8979

7.0474

4.5944

For each of the four best regression models given in Table 4.7, the frequency

distribution of estimated probabiHty, p is given in Figures 4.1 through 4.4. The

estimates may be used to investigate how many patients would be misclassified if

a particular value of p, perhaps p = 0.5 were chosen to separate those predicted

to have response variable equal to zero from those having a response variable equal

to one. Figure 4.1 is based on the logistic regression model found using the score

36

selection option containing eleven variables for the DIABETES data set. With such

a small sum of squared errors value, SSE= .0441, the distribution of p is very close

to the actual dichotomous values for the response variable DIAB. Figure 4.2 is based

on the linear regression model found using the forward selection technique for the

DIABETES data set and shows the distribution oip. Figure 4.3 gives the distribution

of p for the full logistic regression model including interactions for the PROSTATE

data set. Figure 4.4 shows the distribution of p for the full linear regression model

including interactions for the PROSTATE data set. The two linear regression models

have much more dispersed distributions of p, whereas the distributions for the two

logistic regression models show a much clearer distinction between the two possible

response variable values.

The next chapter discusses cluster analysis and how it can be used to help deter

mine the groupings used in logistic regression. The DIABETES data set is revisited

as an illustration of cluster analysis techniques.

FREQUENCY s 0 i

70

60 -

50

40 -

30

20

10

?<N

R?^?g^ 0.05 0.15 0.25 0.35 0 45 0.55 0.65 0 75 0.85 0.95

Estimated Probability

Figure 4.1: Distribution of p For Best Logistic Model on DIABETES Data

38

FREQUENCY 40

0.05 0.15 0 25 0.35 0.45 0.55 0 65 0.75 0.85 0.95

Predicted Value of DIAB

Figure 4.2: Distribution of p For Best Linear Model on DIABETES Data

(Horizontal label could also be interpreted as estimated probabiHty)

39

F R E Q U E N C Y 3 0

2 0

1 0

0 . 0 5 0 . 1 5 0 . 2 5 0 . 3 5 0 . 4 5 0 . 5 5 0 . 6 5 0 7 5 0 . 8 5 0 . 9 5

E s t i m a t e d P r o b a b i l i t y

Figure 4.3: Distribution of p For Best Logistic Model on PROSTATE Data

40

FREQUENCY 15

0.05 0.15 0.25 0.35 0 45 0.55 0.65 0 75 0.85 0.95

Predicted Value of NODALINV

Figure 4.4: Distribution of p For Best Linear Model on PROSTATE Data

(Horizontal label could also be interpreted as estimated probability)

CHAPTER V

CLUSTER ANALYSIS

5.1 What is Cluster Analysis?

Cluster analysis is a technique appHcable to situations involving data from a

population where there exists some set of features or characteristics which may be

used to separate the data values into groups or clusters. Specifically, let the set

I = ( / i , /2 , . . . ,/n) represent n individuals from a population denoted TT , and let

the set C = {Ci,C2, • •. ,Ck) represent observable characteristics possessed by each

individual in I. Usually these observable characteristics yield quantitative data, also

called measurements. Sometimes, however, the characteristics can yield qualitative

or categorical data. The value of the measurement on the j^^ characteristic of the

individual Ii is denoted by the symbol Xij, and Xi = [xij] represents a A: x 1 vector of

measurements. The researcher, therefore, has available for each set I a corresponding

set oi k X 1 measurement vectors X = (Xi,X2,. . . ,X„) which describes the set of

individuals, I. The set X can be thought of as n points in A:-dimensional Euclidean

space, where the distance between the points can be measured. Based on the data

contained in the set X and an integer m, where m < n, the cluster problem is to

determine m clusters or subsets of the individuals in I, say TTI, 7r2,..., TT , such that

each Ii belongs to one and only one subset. Those individuals which are assigned to

the same cluster are required to be determined significantly similar, while those as

signed to different clusters are required to be determined significantly different. The

general ideas presented in this chapter are a summary of those presented in Duran

and OdeU (1974).

41

42

A solution to the cluster problem usually involves a partitioning of the individuals

which satisfies some optimaHty criterion. This optimality criterion, often caUed an

objective function, may be given in terms of a functional relation that reflects the

levels of desirabiUty of the various partitions. Although various types of objective

functions can be defined, many can be formulated in a unified and general manner.

To accomplish this, a clear definition of what it means for two individuals, U and Ik,

to be similar is needed. One possible interpretation of similar individuals is to assign

the i*^ and k*^ individual to the same cluster if the distance between the points Xi and

Xk is "sufficiently small," and likewise to assign the individuals to different clusters

if the distance is "sufficiently large." This distance between points can be defined by

various distance functions, however the Euclidean distance function is most commonly

used.

Using the above concept of distance, a measure of the scatter or heterogeneity

of the set of individuals I is desired. Statisticians generally use the following k x k

matrix, n

Sx = Y.i^^-X){Xi-X)^ (5.1) 1=1

where X = y^ —-, is a. k x 1 vector of arithmetic averages, called the scatter matrix ^ n

for the set x, as a measure of scatter. The matrix S^ is also sometimes referred to as

the matrix sum of squares. Other scatter measures include the trace of S^, denoted

St, the determinant of S^, denoted SD, and the matrix of correlation coefficients,

denoted R. The measure St

St = tr Sx = E ( X i - X)^(X, - X) (5.2) i = l

is the sum of the distances of the n points from the group mean X and is termed

43

the error or within sum of squares. The measure s/) = |Sx| is the statistical scatter

with respect to the determinant. The matrix of correlation coefficients, R = [rij],

can be computed from the matrix S^ = [sij] which was defined in equation (5.1).

Using the definition of S^, the following diagonal matrices are defined, [Diag(Sx)] =

(sii,522,...,sjtA:) and [Diag(Sx)]"^ = {su',822^-• • ,s^}). Then

R - [Diag(Sx)]-^Sx[Diag(S;,)]-^ (5.3)

is the matrix of correlation coefficients. These measures of scatter are useful in de

termining how tightly the set of individuals are grouped.

Many clustering procedures are hierarchical. In other words, the two closest ob

jects are grouped and treated as a single cluster in the next step. Thus, the number

of objects is decreased to n — 1, a single cluster of two objects and n — 2 clusters of

a single object in each group. This process is repeated until all of the n objects are

grouped into one cluster containing n objects. This hierarchical process involves the

concept of measuring the distance between an object and a cluster and the distance

between two clusters. The concept of the optimality criterion or objective function

determines when the most desirable partition has been obtained. Therefore, we need

measures of homogeneity within a cluster and measures of disparity between two clus

ters. These two measures also depend on how the distance between two clusters is

defined.

The distance between two clusters can be defined in various ways. Let I =

(/i, / 2 , . . . , /nJ and J = {Ji,J2,---,Jn2) represent two clusters of individuals from

a population. Let C = (Ci, C2, . . . , Ck) be a set of characteristics which generate the

two measurement sets X = (Xi, X2 , . . . , X„J and Y = {Yi,Y2,... ,Yn^), associated

with I and J, respectively. From these definitions, the nearest neighbor distance.

44

furthest neighbor distance, and the average distance follow directly. The nearest

neighbor distance is defined as the minimum distance between any pairing of one in

dividual from I and one individual from J. Likewise, the maximum distance between

any similar pairing of two individuals from the sets I and J is defined as the furthest

neighbor distance. The average distance between the clusters I and J is calculated

by finding the arithmetic average of aU possible pairwise distances between the two

sets.

Using the concept of statistical scatter, the measure of distance between the clus

ters / and J is defined as

'''''' {X-Yf{X-Y) (5.4) ni -f 7i2

_ "2 y : where Y = ^ — and A' is similarly defined as it was in equation (5.1). This measure

i = l ^

of distance is also referred to as the within group or error sum of squares.

Most of the clustering and classification methods concentrate on the construction

of methods based on the minimization of the within group sum of squares. These

methods are called minimum-variance constraint methods and are easily described

using squared Euclidean distances. Various clustering techniques have been devel

oped, and the most common include average linkage, furthest and nearest neighbor

methods, and the centroid method. These methods wiU be described in the next

section in relation to their corresponding SAS options.

Another popular technique is the within sum of squares method. In this technique,

the sum of squares of the distances from each point in a cluster to the mean of that

cluster is found. Again, this is a hierarchical process. At each step the two clusters,

having the least increase in within group sum of squares when joined, are subsequently

joined.

45

Many other variations of the clustering procedure begin by initializing certain val

ues as starting points for clusters. One technique chooses starting points at random,

and then objects within a certain threshold distance form the first cluster. The pro

cess continues until all the points are accounted for by being assigned to their nearest

cluster center, thus forming a specific desired number of disjoint clusters.

Another variation involves choosing "typical" points to initialize the clusters.

These points are determined by some preliminary study of the individuals. If the

desired number of clusters is known, call it m, then ultimately m points could be

chosen at random, and the remaining n — m objects can be assigned to the center

which it is nearest.

An improvement on this method is to then join any of the m clusters whose centers

fall within a threshold radius, and split any clusters whose variance (S^^) within the

cluster of any one variable X exceeds its specified threshold value, S"^. Then the

variances, (Sj) of each of the resulting clusters are constrained by S] < kS^, where k is

the number of variables. At each step the cluster centroids replace the original cluster

centers and the process is continued until convergence is achieved. This updating of

the centroids in the procedure until convergence is one of the more popular variations.

StiU another variation starts the updating of centroids almost immediately. Again,

a certain number of objects are chosen at random to be used as cluster centers, and

each object is assigned to the center nearest it if its distance from the cluster is within

a certain threshold. If the object falls beyond the threshold distance, it initiaUzes a

new cluster center. With each aUocation of an object to the cluster, the centroid is

recomputed and becomes the new cluster center. Of course if the distance between

two clusters becomes less than another threshold value, the clusters are joined, and

46

the process continues until convergence is attained.

The main reason for the popularity of the Euclidean distance is probably its

intuitive appeal and direct relation to the within group sum of squares. There are,

inevitably, objections to the use of the minimum variance approach to cluster analysis

since changes in the scale will modify the resulting clusters. As is apparent in the

number of variations, each seems to improve on the previous results.

This clustering procedure is useful in determining how many different underlying

groups one population may contain. Then, based on the number of clusters, either

logistic regression or some other categorical modeling technique can be used to an

alyze the data as accurately as possible. The next section describes SAS clustering

procedures and the results from the cluster analysis of the DIABETES data set.

5.2 Cluster Analysis on the DIABETES Data Set

The DIABETES data set used for the regression analysis in the previous chap

ter was also investigated using various method options of the clustering procedure of

SAS, a statistical software package. The PROC CLUSTER command finds the hier

archical clusters of the observations in the data set, provided the data set is entered

as coordinates or distances. For the DIABETES data set, each individual's informa

tion is treated as a coordinate, and the EucHdean squared distances between each

possible pairing of the observations (or coordinates) are computed by the procedure

CLUSTER. Before performing a cluster analysis on coordinate data, some scaling, or

transformation, of the data should be considered since variables with large variances

tend to affect the clustering more than those with smaUer variances. One choice to

eHminate this effect is to use the option STD in CLUSTER which standardizes the

47

variables to mean 0 and standard deviation 1. Some transformations may change the

number of population clusters, so use caution if transforming the variables.

The basic procedure begins with each observation in a cluster by itself and then

the two closest clusters are joined to form a one new cluster which replaces the two

old clusters. The process continues until there is only one cluster remaining, or

until a specified number of clusters (may be entered as an option) is reached. The

various clustering methods differ in how the distance between clusters is computed,

for example, the distance between two clusters may or may not be updated each time

a new observation or cluster merges with one of the existing clusters.

Some of the methods of clustering mentioned previously which are available using

SAS include average linkage, complete linkage (furthest neighbor), density linkage,

single linkage (nearest neighbor), and the centroid method. In average linkage, the

distances between two clusters is the average distance between pairs of observations,

one in each cluster. Average linkage is biased in that it usually joins those clusters with

small variances, and the resulting clusters tend to have the same variances. Complete

linkage is where the distances between the two clusters is the maximum distance

between one observation in the first cluster and an observation in the second cluster.

Complete Hnkage is biased in that it usually produces clusters of equal diameters, and

can be largely distorted by only moderate outliers. The density linkage refers to the

class of clustering methods which use nonparametric probability density estimates.

Single Hnkage calculates the difference between two clusters as the minimum distance

between an observation from each of the two clusters. This approach is beneficial in

that it can detect elongated clusters, but in the process tends to sacrifice performance

in obtaining compact clusters. In the centroid method, the distance between the two

48

clusters is the squared Euclidean distance between their corresponding centroids or

means. This method is able to handle outliers but may not perform as well as some

of the other methods, especially average linkage.

The CLUSTER procedure prints a history of the clustering process which lists the

observations joined in each step, and also gives statistics useful in estimating the true

number of clusters in the population from which the data were sampled. CLUSTER

also produces an output data set that can be used with the TREE procedure to draw

a tree diagram of the cluster hierarchy. This allows a visual of which observations

were clustered and when. The output lists all stages of the procedure from n clusters

down to a single cluster. This option is mostly useful with small data sets.

The CLUSTER procedure was executed on the DIABETES data set, and the

results separated the population into the three clusters that, not surprisingly, matched

the underlying definition of the clinical groups. Figures 5.1 through 5.4 plot each

individual with its corresponding cluster number, which was determined by using

the centroid method to find three disjoint clusters. Cluster 1 represents the overt

diabetics, cluster 2 contains the chemical diabetics, and cluster 3 is the nondiabetic

subgroup. Some of the observations are hidden, that is, they are too close to each

other and show up as single points. Figure 5.1 shows the clusters when looking at the

variables GLUFAST versus GLUTEST, which, as you may recaU, had a correlation of

.96 in the overaU data set. The moderate correlation between GLUTEST and SSPG

resulted in the clusters in Figure 5.2 being similar to those in Figure 5.1. There may

still be evidence of clusters for uncorrelated variables, and sometimes the distinctions

between clusters is not very weU defined, as seen in Figures 5.3 and 5.4, respectively.

This overlap indicates the number of existing clusters may be less than expected.

49

400 +

300 F a s t i n g

P 1 a s 200 m a

G 1 u c 0 s e 100 +

3 3

22

2 222 2

1 1

1 11 11111 111111111 1 1 11111111 111

1111111 nil 1

1

11

0 + • - + —

250 500 —+— 750

— + - -

1000 — + - -

1250 — + - -

1500

Test Plasma Glucose

Figure 5.1: Diabetes Cluster Analysis Plot of GLUFAST*GLUTEST

(72 observations hidden).

50

500 +

450 +

S t e a d y

s t a t e

P 1 a s m a

G 1 u c 0 s e

400 +

350 +

300 +

250 +

200 +

150 +

100 +

50 +

1 1 1

II 1 1 1 1 11 1 I I I 1

11 1 1 1

1 1 1 11

1 1 11

1 111 111 111

11 1 11

1 11 1 1 n i l 11

11 11 1

1 1 11 1 1 1 11

1

11

0 + -+— 250

2 22

2

22

3

3 3

—+— 500

• - + —

750 — + - -

1000 —+—

1250

3

3

3 3

- + — 1500

Test Plasma Glucose

Figure 5.2: Diabetes Cluster Analysis Plot of SSPG*GLUTEST


51

1.21 +

R 1.11 + e 1 a t i 1.01 + 1 n i l V e

w e 0.91 + i g h t

0.81 +

0.71 +

1 1 1 1 1 1

1 1 1 1 1 1 11 1

1 n i l 1 1 1 11

1 1 11 1 1 in

1 1 11

I 11 1 1 II 11 1 nil 11 1 1 1 11 1 1 nil 11 111 III 1 1 1

111 11

111 1 1 1 1

1

2 2

2 22

250 500 750 — + —

1000 —+--

1250

3 3

3 3

— + —

1500

Test Plasma Glucose

Figure 5.3: Diabetes Cluster Analysis Plot of RELWT*GLUTEST


In most methods, when SAS was to determine only two clusters, the overt diabetics

were sorted into a cluster of their own, while the chemical diabetic cluster merged

with the normal cluster. Intuition may suggest that the two diabetic clusters would

merge and form a diabetic and a nondiabetic cluster. Apparently, chemical diabetics

are more similar to normals than to overt diabetics, as in Figure 5.4, where the plots

of chemical diabetics and nondiabetics, clusters 2 and 3, respectively, overlap.

52

P 1 a s m a

I n s u 1 i n

d u r i n g

T e s t

600 +

500

400

300

200 +

100 +

0 + +-0

1 1

1 1 1 111 n 111 1

11111 1 1 1 1 1

11 11 1 1 11 I 11 nil II 11 111 1 1

1 1 1 1 1 1 1

1 1 1 1 1 1

1 1 1 1 11 1

11 11 1

1 1

1 1

2

22 1 11

3

3 3

3

3

100 200 —+--300 400

• - +

500

Steady S t a t e Plasma Glucose

Figure 5.4: Diabetes Cluster Analysis Plot of INSTEST*SSPG


53

Cluster analysis was not very helpful in this situation because the groups were well

defined. However, cluster analysis can be more useful in situations were the groups

are not as well defined. Cluster analysis was presented as a potential aid in dealing

with the type of data studied, therefore, its effectiveness depends on the definition of

the groups within the data and their degree of separation.

CHAPTER VI

CONCLUSION

Logistic regression analysis incorporates the famiUarity and general principles of

Hnear regression, while offering a solution to the problem of dealing with the special

case of dichotomous response variables. An overview of logistic regression is presented,

yet only some of its many different appHcations have been discussed. The compar

isons throughout the discussion of the specific data sets shows that more cumbersome

models are frequently required when linear regression is used instead of logistic re

gression. Therefore, when response variables are not continuous, an alternative to

linear regression analysis should be sought.

Both the logistic and linear regression procedures agreed the best models for

the two data sets contained eleven variables when interactions of the variables were

considered. The sum of squared errors were smaller in the logistic regression procedure

for both data sets. In the DIABETES data set, the logistic procedure significantly

outperformed the multiple linear regression procedure. For the PROSTATE data

set, logistic regression still outperformed the linear approach, however, the difference

between the sum of squared errors for the best models given by the two procedures

was minimal.

A data set can be investigated using clustering procedures to detect any underlying

separation between the data. Many techniques to identify these underlying groups or

clusters have been presented throughout the Hterature. EucHdean distance methods

are the most commonly used, since representing observations by points on a many

dimensional plane representing their characteristics is intuitive. The cluster analysis

54

00

would verify the number of different categories with which to classify the observations

and determine if logistic regression or some other categorical data modeling technique

should be used.

A next step in this research would be to investigate the usage of principal compo

nents to further aid in the classification of the data sets. The principal components

would specify the order of influence of the characteristics. It would also serve to

reduce the dimensionality of the characteristics by selecting the few most important

components involved in the prediction of the response variable.

REFERENCES

[1] Agresti, A. (1990). Categorical Data Analysis. New York: John Wiley, p 13-18, 84-97.

[2] Berkson, J. (1955). Maximum Likelihood and Minimum x^ Estimates of the Logistic Function, Journal of the American Statistical Association, 50, 130-162.

[3] CardeU, N. and Steinberg, D. (1978, August). Estimating Logistic Regression Models Where the Dependent Variable has No Variance. Paper presented at the Joint Statistical Meetings of the American Statistical Association, San Francisco.

[4] CardeU, N. and Steinberg, D. (1987). Logistic Regression on Pooled Choice Based Samples and Samples Missing the Dependent Variable, American Statistical Association Proceedings of Social Statistics Section, 158-160.

[5] Cardell, N. and Steinberg, D. (1992). Estimating Logistic Regression Models Where the Dependent Variable has No Variance, Communications in Statistics: Theory and Methods, 21(2), 423-450.

[6] Duke, J. (1992). Sample Size and Estimated Odds Ratio in Logistic Regression: A Study with Repeated Samples from a Low Birth Weight Population, Ph.D. Dissertation, University of Oklahoma Health Sciences Center.

[7] Duran, B. and OdeU, P. (1974). Cluster Analysis: A Survey, Lecture Notes in Economics and Mathematical Systems, 100, VI, p 1-30.

[8] Ferguson, T. (1967). Mathematical Statistics: A Decision Theoretic Approach. New York: Academic Press, 119-125.

[9] Gehan, E. (1959). Use of Medical Measurements to Predict the Course of Disease. Proceedings of Conference on Experimental Clinical Cancer Chemotherapy, Washington, D.C.: National Cancer Institute Monograph No. 3, 51-58.

[10] Hosmer, D. and Lemeshow, S. (1989). Applied Logistic Regression. New York: John Wiley, p 1-47.

[11] Hogg, R. and Craig, A. (1978). Introduction to Mathematical Statistics, Engle-wood Ciffs, New Jersey: Prentice-Hall, Inc., p 168-179.

[12] Kleinbaum, D. (1994). Logistic Regression: A Self-Learning Text. New York: Springer-Verlag, p 1-61.

56

57

[13] Nottingham, Q. and Birch, J. (1998). A Note on the SmaU Sample Behavior of Logistic Regression in a Bioassay Setting, Journal of Biopharmaceutical Statistics, In press.

[14] Rylance, J. (1996). A Comparison of the Likelihood-Based Approach with Logistic Regression as a Method for Classification, Master's Thesis, North Dakota State University.

[15] SAS Institute Inc. (1995). Logistic Regression Examples Using the SAS System, Version 6, 1st Edition, Gary, NC: SAS Institute Inc., 163 pp.

[16] Tam, T. (1992). Binary Logistic Regression with Data That Have No Variance on the Dependent Variable: An Application to College Dropout Analysis, Ph.D. Dissertation, University of California, Los Angeles.

APPENDIX A

SAS CODE FOR DIABETES DATA SET

TO GENERATE VARIOUS REGRESSION RESULTS.

This code is an extension of what is given in SAS (1995).

58

59

options ls=72; data diabet;

infile diabet; input patient relwt glufast glutest instest sspg

group diab overt chem; label relwt = 'Relative Weight'

glufast = 'Fasting Plasma Glucose' glutest = 'Test Plasma Glucose' instest = 'Plasma Insulin during Test' sspg = 'Steady State Plasma Glucose' group = 'Clinical Group' diab = 'Diabetics (both)' overt = 'Overt Diabetics Only' chem = 'Chemical Diabetics Only';

/* Other variables defined for the interactions of each */ /* explanatory variable with all others including itself. */

rlwsq=relwt*relwt; rlw_glf=relwt*glufast; rlw_glt=relwt*glutest; rlw_ins=relwt*instest; rlw_sspg=relwt*sspg; glfsq=glufast*glufast; glf_glt=glufast*glutest; glf_ins=glufast*instest; glf_sspg=glufast*sspg; gltsq=glutest*glutest; glt_ins=glutest*instest; glt_sspg=glutest*sspg; inssq=instest*instest; ins_sspg=instest*sspg; sspgsq=sspg*sspg;

/* Runs a preliminary data analysis on the data set which */ /* includes finding the overall means for each variable */ /* and then each variables mean for each of the subgroups */ /* sepcirately. The correlation matrices are also found. */ /***********************************************************/

/* For the Overall Model */

proc means majcdec=4 n mean std; var relwt glufast glutest instest sspg; title 'Overall Diabetic Data Set';

proc corr; var relwt glufast glutest instest sspg;

run;

60

/* Separately for the Diabetics and Nondiabetics */ /**************•***********************************/

proc sort; by diab;

run; proc means maxdec=4 n mean std; by diab; var relwt glufast glutest instest sspg; title 'Diabetic Data Set'; title2 'Descriptive Statistics By DIAB';

proc corr; by diab; var relwt glufast glutest instest sspg;

rim;

/ p r r r ^ r p r F r F r p F r r r ^ F F ^h ^n ^n *^ F ' T* T^ T^ T* *!* T* *!* T* T* T^ T* *!* 'I* T^ T* T^ T* T* T* T* T" /

/* Separately for the Chemical Diabetics and Others */

proc sort; by chem;

run; proc means maxdec=4 n mean std; by chem; var relwt glufast glutest instest sspg; title 'Diabetic Data Set'; title2 'Descriptive Statistics By DIAB';

proc corr; by chem; var relwt glufast glutest instest sspg;

rim; /•*************************************************/

/* Separately for the Overt Diabetics and Others */ /***•**********************************************/

proc sort; by overt;

rim; proc means maxdec=4 n mean std; by overt; var relwt glufast glutest instest sspg; title 'Diabetic Data Set'; title2 'Descriptive Statistics By DIAB';

proc corr; by overt; var relwt glufast glutest instest sspg;

run;

61

/* Investigate the one variable model first */ /* with linear then logistic regression. */

proc reg data=diabet; model diab= glutest; title 'Linear Regression of Diabetes Data';

run; proc logistic data=diabet descending;

model diab= glutest; title 'Logistic Regression of Diabetes Data';

run;

/* All original explanatory variables are investigated */ /* using the linear regression model selection technigues: */ /* Forward, backward and stepwise elimination. */

proc reg data=diabet; model diab= relwt glutest glufast instest sspg

/selection= forward; title 'Linear Regression of Diabetes Data';

run; proc reg data=diabet;

model diab= relwt glutest glufast instest sspg /selection= backward;


model diab= relwt glutest glufast instest sspg /selection= stepwise;

run;

/**********************************************************/

/* Using the same options, now with logistic regression. */

proc logistic data=diabet descending; model diab= relwt glutest glufast instest sspg

/selection= forward; title 'Logistic Regression of Diabetes Data';


model diab= relwt glutest glufast instest sspg /selection= backward;


model diab= relwt glutest glufast instest sspg /selection= stepwise;

run;

62

/***:|c9|c:tc:tc:(c:|c*:|c:|c:tc****************************3f:**3tc:4c**/

/* Now use the selection techniques specific to */ /* linear regression (adjrsq) and logistic */ /* regression (score). */


/selection= adjrsq rsquare; title 'Linear Regression of Diabetes Data';


model diab= relwt glutest glufast instest sspg /selection= score;

title 'Logistic Regression of Diabetes Data'; run;

/***********************************************************/

/* All explanatory variables including interactions are */ /* investigated using the model selection technigues: */ /* forward, backward, and stepwise for linear regression. */


rlwsq rlw_glf rlw_glt rlw_ins rlw_sspg glfsq glf_glt glf_ins glf_sspg gltsq glt_ins glt_sspg inssq ins_sspg sspgsq / selection= forward;

title 'Linear Regression of Diabetes Data'; run; proc reg data=diabet;

model diab= relwt glutest glufast instest sspg rlwsq rlw_glf rlw.glt rlw_ins rlw_sspg glfsq glf_glt glf.ins glf.sspg gltsq glt_ins glt_sspg inssq ins_sspg sspgsq / selection= backward;


model diab= relwt glutest glufast instest sspg rlwsq rlw_glf rlw_glt rlw_ins rlw_sspg glfsq glf_glt glf.ins glf.sspg gltsq glt.ins glt.sspg inssq ins_sspg sspgsq / selection= stepwise;

run;

/**********************************************************/

/* Using the same options, now with logistic regression. */ /***********••********•*******•**•*************************/

63

proc logistic data=diabet descending; model diab= relwt glutest glufast instest sspg

rlwsq rlw_glf rlw.glt rlw_ins rlw_sspg glfsq glf.glt glf_ins glf.sspg gltsq glt_ins glt.sspg inssq ins_sspg sspgsq / selection= forward;

title 'Logistic Regression of Diabetes Data'; run; proc logistic data=diabet descending;

model diab= relwt glutest glufast instest sspg rlwsq rlw_glf rlw_glt rlw.ins rlw_sspg glfsq glf_glt glf_ins glf.sspg gltsq glt_ins glt_sspg inssq ins_sspg sspgsq / selection= backward;


model diab= relwt glutest glufast instest sspg rlwsq rlw_glf rlw_glt rlw_ins rlw_sspg glfsq glf.glt glf.ins glf_sspg gltsq glt_ins glt_sspg inssq ins_sspg sspgsq / selection= stepwise;

run;

/* All explanatory variables including interactions are */ /* investigated using the model selection technigues which */ /* are specific to linear regression (adjrsq) and logistic */ /* regression (score). */


rlwsq rlw_glf rlw_glt rlw_ins rlw_sspg glfsq glf_glt glf.ins glf.sspg gltsq glt_ins glt_sspg inssq ins_sspg sspgsq / selection=adjrsq rsquare best=2;

title 'Linear Regression of Diabetes Data'; run; proc logistic data=diabet descending;

model diab= relwt glutest glufast instest sspg rlwsq rlw_glf rlw_glt rlw_ins rlw_sspg glfsq glf_glt glf_ins glf_sspg gltsq glt_ins glt.sspg inssq ins_sspg sspgsq / selection=score;

title 'Logistic Regression of Diabetes Data'; run;

APPENDIX B SAS CODE TO GENERATE CLUSTER ANALYSIS RESULTS

64

65

options ls=72; data diabet;

infile diabet; input patient relwt glufast glutest instest sspg

group diab overt chem; label relwt = 'Relative Weight'

glufast = 'Fasting Plasma Glucose' glutest = 'Test Plasma Glucose' instest = 'Plasma Insulin during Test' sspg = 'Steady State Plasma Glucose' group = 'Clinical Group' diab = 'Diabetics Group' overt = 'Overt Group' chem = 'Chemical Diabetic Group';

/* Before Cluster analysis can be done, the data */ /* must be in a certain form — sorted in order. */

proc sort data=diabet out=diabet2; by group;

run; /***************************************************/

/* Using average linkage method, first it clusters */ /* the data, and then a dendrogram is printed */ /* showing the clusters created at each step. */

proc cluster data=diabet2 method=average noprint outtree=tree; id patient;

run; proc tree horizontal sort height=n; run; /*************************************************************/

/* The data is sorted into three and then two clusters, with */ /* group number able to be compared to cluster number. */ /*************************************************************/

proc tree noprint out=out nclusters=3; copy patient relwt glutest glufast sspg instest group;

run; proc sort;

by cluster; run; proc print label uniform;

id patient; var group relwt glufast glutest sspg instest; by cluster; title 'Cluster Analysis: Average Linkage with Three Clusters';

run;

66

proc cluster data=diabet2 method=average noprint outtree=tree; id patient;

run; proc tree noprint out=out nclusters=2;

copy patient relwt glutest glufast sspg instest group; title 'Cluster Analysis: Average Linkage with Two Clusters';

rim; proc sort;


id patient; vcir group relwt glufast glutest sspg instest; by cluster; title 'Cluster Analysis: Average Linkage with Two Clusters';

run;

/* Similar results requested, but this time using the */ /* complete linkage method of clustering. The data is */ /* again sorted into three and then two clusters, with */ /* group number able to be compared to cluster number. */

proc cluster data=diabet2 method=complete noprint outtree=tree; id patient;


copy patient relwt glutest glufast sspg instest group; title 'Cluster Analysis: Complete Linkage with Three Clusters';

run; proc sort;


id patient; vair group relwt glufast glutest sspg instest; by cluster; title 'Cluster Analysis: Complete Linkage with Three Clusters';

run; proc cluster data=diabet2 method=complete noprint outtree=tree;

id patient; run; proc tree noprint out=out nclusters=2;

copy patient relwt glutest glufast sspg instest group; title 'Cluster Analysis: Complete Linkage with Two Clusters';

run; proc sort;

by cluster; run;

proc print label uniform; id patient; var group relwt glufast glutest sspg instest; by cluster; title 'Cluster Analysis: Complete Linkage with Two Clusters';

run;

/* Similar results requested, but this time using the */ /* single linkage method of clustering. The data is */ /* again sorted into three and then two clusters, with */ /* group number able to be compared to cluster number. */

proc cluster data=diabet2 method=single noprint outtree=tree; id patient;


copy patient relwt glutest glufast sspg instest group; run; proc sort;


id patient; var group relwt glufast glutest sspg instest; by cluster; title 'Cluster Analysis: Single Linkage with Three Clusters';

run; proc cluster data=diabet2 method=single noprint outtree=tree;


copy patient relwt glutest glufast sspg instest group; run; proc sort;


id patient; var group relwt glufast glutest sspg instest; by cluster; title 'Cluster Analysis: Single Linkage with Two Clusters';

run; /**************•***************•************************/

/* Similar results requested, but this time using the */ /* centroid method of clustering. The data is again */ /* sorted into three and then two clusters, with group */ /* number able to be compared to cluster number. */ /:•*•******************•*********************************/

68

proc cluster data=diabet2 method=centroid noprint outtree=tree; id patient;


copy patient relwt glutest glufast sspg instest group; title 'Cluster Analysis: Centroid Method with Three Clusters';

run; proc sort;


id patient; var group patient relwt glufast glutest sspg instest group; by cluster; title 'Cluster Analysis: Centroid Method with Three Clusters';

run; proc cluster data=diabet2 method=centroid noprint outtree=tree;


copy patient relwt glutest glufast sspg instest group; title 'Cluster Analysis: Centroid Method with Two Clusters';

run; proc sort;


id patient; var group relwt glufast glutest sspg instest; by cluster; title 'Cluster Analysis: Centroid Method with Two Clusters';

run;

/***********************************************************/

/* Now to visually show the clusters, plots can be created */ /* for any two variables, and the underlying groups should */ /* be apparent in the graph. Need to reestablish three */ /* clusters otherwise will use last clustering completed. */ /********************************************************

proc cluster data=diabet2 method=centroid noprint outtree=tree; id patient;


copy relwt glutest glufast sspg instest; run; proc sort;

by cluster; run;

69

proc plot; plot relwt*glutest=cluster;

title 'Diabetes Cluster Analysis Plot'; run; proc plot;

plot sspg*glutest=cluster; run; proc plot;

plot instest*sspg=cluster; run; proc plot;

plot glufast*glutest=cluster; run;

PERMISSION TO COPY

In presenting this thesis in partial fuIfiUment of the requirements for a

master's degree at Texas Tech University or Texas Tech University Health Sciences

Center, I ag^ree that the Library and my major department shall make it freely

available for research purposes. Permission to copy this thesis for scholarly

purposes may be granted by the Director of the Library or my major professor.

It is understood that any copying or publication of this thesis for financial gain

shall not be allowed without my further written permission and that any user

may be liable for copyright infringement.

Agree (Permission is granted.)

_ ^ ^-^-n^^ru//-student's Signature

<^:^ti^ /5/^/f/ Date

f/

Disagree (Permission is not granted.)

Student's Signature Date

logistic regression applications and cluster analysis …

Documents