logistic regression applications and cluster analysis …
TRANSCRIPT
LOGISTIC REGRESSION APPLICATIONS
AND CLUSTER ANALYSIS
by
JENNIFER KRISTI PETERSON, B.A.
A THESIS
IN
STATISTICS
Submitted to the Graduate Faculty of Texas Tech University in
Partial Fulfillment of the Requirements for
the Degree of
MASTER OF SCIENCE
Approved
I December, 1998
rz
ACKNOWLEDGMENTS
Thanks goes out to my thesis advisor. Dr. Duran, for his advice, training and
support. He always had a kind word to say, even when things veered a little off course.
Thanks also to Dr. Mansouri, for serving on my committee, and posing his questions
in a timely manner. Dr. Duran and Dr. Mansouri were also terrific professors
who assigned numerous projects, and gave challenging tests, but encouraged me to
continue working and to do my best. For that, I graciously thank them both. Thank
you to Dr. Bennett, my graduate advisor, who always seemed happy to see me, and
even when he was extremely busy, would take the time to ask how things were going.
I also want to thank my family and friends. To my parents, grandparents, and
younger brother, thanks for the support and encouragement from afar. I especially
wish to acknowledge those close friends who acted as my "second" family. To all who
helped by reading through drafts, and correcting whatever part they could, I truly
appreciate all that you have done. I would like to also acknowledge the SAS Institute
Inc., for permission to reproduce and analyze their data sets. In conclusion, thank
you to everyone who has helped me grow throughout my graduate school experience.
11
CONTENTS
ACKNOWLEDGMENTS ii
LIST OF TABLES iv
LIST OF FIGURES v
I. INTRODUCTION 1
II. BACKGROUND AND PRELIMINARIES 3
III. SOME APPLICATIONS OF LOGISTIC REGRESSION 12
IV. REGRESSION ON TWO DATA SETS 17
4.1 Summary Statistics of Data Sets 17 4.2 Logistic Regression on the Data Sets 23
4.2.1 Logistic Regression on DIABETES Data Set 23 4.2.2 Logistic Regression on PROSTATE Data Set 28
4.3 Linear Regression on the Data Sets 29 4.3.1 Linear Regression on DIABETES Data Set 29 4.3.2 Linear Regression on PROSTATE Data Set 32
4.4 Comparison of Logistic and Linear Regression Analyses 34
V. CLUSTER ANALYSIS 41 5.1 What is Cluster Analysis? 41 5.2 Cluster Analysis on the DIABETES Data Set 46
VI. CONCLUSION 54
REFERENCES 56
APPENDIX A: SAS CODE FOR DIABETES DATA SET TO GENERATE VARIOUS REGRESSION RESULTS 59
APPENDIX B: SAS CODE TO GENERATE CLUSTER ANALYSIS RESULTS 65
111
LIST OF TABLES
4.1 Description of Variables in DIABETES Data Set 18
4.2 Descriptive Statistics for DIABETES Data 19
4.3 Correlation Matrix for Overall DIABETES Data 20
4.4 Description of Variables in PROSTATE Data Set 21
4.5 Descriptive Statistics for PROSTATE Data 22
4.6 Correlation Matrix for Overall PROSTATE Data 23
4.7 Best Logistic and Linear Regression Model for Each Data Set 35
IV
LIST OF FIGURES
4.1 Distribution of ^ For Best Logistic Model on DIABETES Data 37
4.2 Distribution of p For Best Logistic Model on PROSTATE Data . . . . 38
4.3 Distribution of p For Best Linear Model on DIABETES Data 39
4.4 Distribution of p For Best Linear Model on PROSTATE Data 40
5.1 Diabetes Cluster Analysis Plot of GLUFAST*GLUTEST 49
5.2 Diabetes Cluster Analysis Plot of SSPG*GLUTEST 50
5.3 Diabetes Cluster Analysis Plot of RELWT*GLUTEST 51
5.4 Diabetes Cluster Analysis Plot of INSTEST*SSPG 52
CHAPTER I
INTRODUCTION
Logistic regression is a mathematical modeling approach in which the best-fitting,
yet least-restrictive model is desired to describe the relationship between several
independent explanatory variables and a dependent dichotomous response variable.
In many regression applications the response or dependent variable of interest is
continuous, and therefore, can take on an infinite number of ^'alues with no upper
or lower bounds. Researchers determine the importance of each of the independent
explanatory variables in predicting the response variable. Then, they generate a
model based on their findings and evaluate the appropriateness of the model
using different statistical measures, such as goodness of fit tests. If the model is
successful, it can be used to predict the mean response of the response variable for
a large range of conditions. When the response variable is categorical or dichoto
mous, this least squares linear regression approach should be replaced by logistic
regression or some other categorical data modeling technique. The fundamental
difference between logistic regression and least squares linear regression is that the
response variable is constrained to a limited number of integer values. A dependent
dichotomous response variable, with values limited to 0 or 1. is the most common
one. One reason for using logistic regression analysis is that it offers a technique
to solve problems within the familiar context of multiple linear regression analysis.
There are, of course, differences between the two procedures, especially in setting
up the parametric model and considering the underlying assumptions, but generally,
the same principles apply.
1
The objectives of this thesis are (1) to give a brief overview of the logistic regression
model, (2) review some applications of the model. (3) present a comparison of the
logistic regression model with the standard multiple regression model via examples,
and (4) consider the use of cluster analysis as an aid in determining the groupings for
a logistic regression analysis.
This paper contains an overview of logistic regression and a brief discussion
of cluster analysis along with appUcations of each. Chapter II covers the basic
preliminary ideas surrounding logistic regression including how it differs from linear
regression. Different estimation techniques are also discussed. Chapter III considers
various areas of logistic regression through summaries of four concrete applications.
Chapter l \ specifically considers the analysis of two data sets by beginning with a
preliminary analysis and exploring different model selection techniques to find the
most appropriate model. These same concepts are reconsidered when interactions
among the variables are also included.
At some point it is useful to investigate whether or not the sample actually has
an underlying separation into groups or clusters. This investigation is executed using
different clustering procedures, many of which are based on the Euclidean definition
of distance. The different characteristics of an individual pinpoint its coordinate in
/c-dimensional Euclidean space, and then distances between individuals and groups of
individuals are defined. A more thorough evaluation of cluster analysis is addressed
in Chapter V. where several variations of the procedure are discussed. This chapter
also includes an example of clustering which verifies the underlying groups assumed
in the logistic regression of one of the data sets in Chapter I\ '. A summary of results
and conclusions are contained in Chapter Vl.
CHAPTER II
BACKGROUND AND PRELIMINARIES
In this chapter, the differences between logistic regression and linear regression
analyses will be explained using the corresponding mathematical models. Also
included is a generalized description of logistic regression, risk and odds ratios and
corresponding parameter estimation methods.
In least squares linear regression, the dependent response variable (Y) is
conditioned on the given value of the vector, x, of k independent explanatory variables
X = (xi, X2,... , Xfc). This relationship is expressed as
E{Y\x) = (5o + Pixi + /?2X2 + . . . + (3kXk. (2.1)
With the dichotomous property of the response variable, Y, the conditional mean.
E{Yi\x), must take on values equal to or between 0 and 1 where Y] is a Bernoulli
random variable with P{Yi = 1) = n and P{Yi = 0) = 1 — TT, where z = 1, 2 , . . . . n for
n samples. Substitution gives E{Yi) = l(7r) + 0(1 — TT) = TT or .E(y^|x) = TT. Thus, the
response variable E{Yi) is the probability that Yi = 1 given a particular vector x. The
required assumption of TT being constrained such that 0 < E{Y) = w < 1 cannot be
met using the usual linear regression model. Therefore, the use of the logistic model
where the probabilities 0 and 1 are reached asymptotically is much more appropriate.
The parametric model of logistic regression is based on the logistic distribution.
The logistic cumulative distribution function (c.d.f.) is given by
^(^) = T T r - T ' - 00 < 2/ < 00 (2.2) 1 + expl-y]
4
and the logistic probabiHty density function (p.d.f.) is given by
/(^) ^ ^ " " P ^ ^ ^ -oo<y<oo. (2.3) (l + exp[-?/])2
With the inclusion of location and scale parameters, fi and a, the c.d.f. and p.d.f.
become, respectively,
and
,(.,) e x p l - e ^ ) ] 1 . / ^ / -MA (25) ^^^^ { l + e x p [ - ( s ^ ) ] ) 2 - a n <r ; • ^ - ^
Generally there is a response variable of primary interest that depends upon
k independent explanatory variables Xi,...,Xk, for which a becomes the location
parameter and the /3j ioi j = I,... ,k, become the scale parameters.
Even though there are different approaches to the estimation of the logistic model,
the general response function or logistic regression function is always given by
^ . X _ exp[-(Q + Ejft3:j)] , .
which can be found by algebraically manipulating the c.d.f.
p(„\ = 1 ^ exp[^]
^^^y) l + exp[-2/] 1 + expb] ^'-^^
and substituting the alternative location and scale parameters suggested above. The
conditional distribution oi y = TV{X) -\- e is binomial with p = 7r(x).
Given specific data values, the model parameters, a and l3j, can be "fit" using some
method of estimation to find the point estimates, a and jSj. These point estimates,
when substituted into the logistic model
P{Y = l|xi, ...,x,) = P(x) = r . \ ^ . x. (2.8) 1 + exp[-(a + E j PjXj)\
form the predicted risk or the estimated probability of disease, P(x). Notice that this
is simply the logistic c.d.f. with alternate location and scale parameters included.
The risk ratio (RR) is formed when the predicted risk of one individual separated
by only one dichotomous explanatory variable is compared to another individual.
The effect of that particular variable is shown by comparing it to the base risk. This
results in the formula
where P(x^) is the predicted risk of an individual whose dichotomous variable takes
the value i.
Unfortunately, this method of estimating the risk ratio is restricted to follow-
up studies as opposed to cross-sectional or case-control studies. In the design of a
follow-up study, the explanatory variables are observed followed by the observation
of the response variable. In the design of a cross-sectional study, subjects are sam
pled and simultaneously classified according to the response variable and explanatory
characteristics. A case-control study looks into the past to find the information on
the individuals. Additionally, the risk ratio requires that the explanatory variables
must be known and specified, not just held constant. If either of these conditions are
not met, the risk ratio cannot be determined directly, and some alternative approach
must be used.
Effects in the logistic model refer to odds, or the likelihood that a particular
situation will occur. The estimated odds for the individual specified by x is the
probability that the event will occur divided by the probability it will not occur,
Odds for x ^ y - ^ . (2.10) 1 - P(x)
The ratio of odds, called the odds ratio, is a measure of association comparing the
odds of two individuals, that is.
r^ j j • ^ o odds for xi / . ^ i i \ Odds ratio = OR = — -. (2.11)
odds for xo
.\ major advantage of odds ratio is that it is the only measure of association
directly estimated from the logistic model that does not require any special assump
tions regarding the study design. The use of odds ratio requires only the assumption
that OR is a good approximation for risk ratio. This approximation is accurate for
"rare" response variables {i.e.. response variables which occur with a low probability).
The logistic regression function is not a linear function, however it can be linearized
by applying the logit transformation to it. By definition the logit transformation is
the log of the odds for a particular vector x. that is.
logit P(x) = In P(x)
1 - P(x) = ln
1/{1 +exp[-z])
expl-{z)]/{l-\-exp[-z]) (2.12)
In the case of logistic regression, the logit transformation (2.11) with c = Q-r-Ej ^j-^j,
is used as the logistic regression model since it has a simple reduced form that is similar
to the general model used for usual linear regression (2.1).
The parameters in the logistic regression model have specific interpretations.
There are two interpretations for a. The first interpretation is that a is the log
odds for an individual having T^ = 0 for all k explanatory variables. This is usu
ally unproductive because it is not meaningful to give explanatory variables, such as
weight or age, the value of zero. The alternative interpretation for a is that it is
the background or baseline odds. In other words, it is a baseline risk in which all
explanatory variables are ignored. The interpretation for 13j is the change in log odds
or logit when the change in Xj is 1, but aU other J J ' S are fixed.
7
The specific odds ratio for the logistic c.d.f. can be obtained when the logistic
model, P(x), is appHed to find the odds for the two groups, xi and XQ. The resulting
formula is called the risk odds ratio (ROR) since the probabilities in the odds ratio
are all defined as risks and is given by
ROR Xi ,Xo 1 - P(Xi)
1 - P(xo)
^ exp[-(Q + Ejft-3;ij)]
exp[-{a-hZjPjXoj)]
= llexplpjixij - xoj)]. (2.13) j
Therefore, using the logistic model there is a multiplicative contribution of the ex
planatory variables to the odds ratio. A model other than the logistic regression
model might have an alternative contribution of the variables.
As mentioned previously, some method of estimation must be used to find point
estimates for a and (3j for j = 1, 2 , . . . , A;. In maximum likelihood estimation, the
leading method, one finds the maximum likelihood estimates (MLEs) of the param
eters by taking the partial derivatives of the log likelihood function with respect to
the parameters. Setting these partial derivatives equal to zero allows the resulting
equations to be solved simultaneously for the estimates. In the case of the logistic
model, we need to maximize the function 7r(x). First, consider the specific case of
only one explanatory dependent variable, where x = Xi. The pair of values, (xj. ?/i),
contribute to the likelihood function by
C{xi) = [Tr{x^)Y^ {1 - 7r{xi)Y-y\ i = h2,...,n (2.14)
8
where n is the sample size. Therefore, the log likelihood function is
L{a,P) = In [/(a,^)] = In {[[a^i)]. (2.15) i
After taking partials with respect to each of the two parameters, a and /?, the equa
tions to solve simultaneously are
Y,[yi - irl^i)] = 0 and 5;] Xi[y^ - TT^OI = 0. (2.16) i i
Using a generalized iterative reweighted least squares procedure (such as Newton-
Raphson), solutions can be found. An interesting consequence of the above equations
is that
x : ? / ^ = E ^ ^ o , (2.17) i i
that is, the sum of the observed values is equal to the sum of the predicted values.
For the general case where there are multiple (k) explanatory variables, the
method is similar, and only the resulting equations to solve simultaneously will vary.
There will be double subscripts, one for the sample size, z, and one for the explanatory
variables, j . Also, the partials will have to be taken with respect to the vector /3 ,
instead of the single parameter p as in equation (2.15). After taking the partials, the
(k -{-1) equations to solve simultaneously are
n ^ [ y . - 7 r { x . ) l = 0 i = l
and
J2[yi - M^i)]x^j = 0 for i - 1,2,..., A; (2.18) 1=1
where x = (xji,^^2,... ,Xik). Then the corresponding interesting result is
Y,y, = T,^J^i)' (2-19)
which is similar to equation (2.17).
The information matrix, or the asymptotic covariance matrix, is formed by taking
the second derivative of the likelihood function with respect to all distinct pairs of
the location and scale parameters. The information matrix for the ML method with
the vector of parameters [a, /5i, /?2,..., Pk] is
d'L{a,f3)
d'L{a,l3)
-f:[{l-^(xO)Wx.))l i = l
n
- Y^[XipXiq{l - 7r(Xi)(7r(Xi)]
dppdp,
dL{a,(3)
n
- J2lXipXiq{l - 7r(Xi)(7r(Xj)] i=l
i-1
(2.20)
(2.21)
where p and q, respectively, identify the entry's row and column in the matrix, and f3p
represents the (i?+ 1) element in the vector of parameters and x = (x^i, 2:^2,..., x jt).
Recall that i = 1, 2 , . . . , n represent the individuals while j = 1, 2 , . . . . A: represent the
explanatory variables.
Two additional estimation methods, minimum x^ and minimum-logit x^- provide
alternatives to the ML method. The minimum x^ method is mostly used in bio-assay
applications where the explanatory variables, Xj for j — 1,...,A;, represent the k
dose levels with nj test subjects at each level. The value yj represents the number
of positive responses at their respective levels Xx,X2,...,Xk. So 1^ has a binomial
distribution with parameters n^ and P(xj) = (1 + expfo; + Ej Z j j])"^- In minimum
10
X estimation, the values a and pj aie foimd by minimizing the x^ statistic,
, _ ^ n,(P(x,) - P(^j)r (P(x,) - F(?,))^
^ P(xj)(l - P( i , ) ) ^ a
• — y •
where P{xj) = -^. As in the ML method, the simultaneous equations to be solved rij
are
i ^(a^j)(l - P{xj))
and
i P(Xj)(l - P(Xj))
Since the coeflBicients of these equations are functions of P(xj), which itself is an esti
mate and is not linear in the parameters, the minimum x^ method, like the maximum
likelihood method, generally requires iterative techniques to find the solutions of the
simultaneous equations (Berkson, 1955).
The minimum-logit x estimates, a and (3j, are foimd by minimizing the logit x
statistic, defined by
logit x' = E ^i^(^i)(l - Pi^jWogit P{xj) - logit P(xj)\\ (2.24)
where logit P[xj) = a-{-PjXj and logit P{xj) = a-\-PjXj as shown in equation (2.12).
The normal equations for obtaining minimum-logit x^ estimates of a and /3j aie
YlnjP{xj)(l - P(xj))[logit P(xj) - logit Pl^j)]) = 0 j
and
J2njP{xj)(l-P{xj))xjllogit P{xj)-logit PI^J)]) = 0. (2.25)
11
The evaluation of equation (2.25) requires the use of a procedure that simply takes
the least-squares solution of the straight line having slope /3 and intercept a with
njP{xj){l - P{xj)) as the weights of the known observations (Berkson, 1955).
The criteria for judging which of the several estimation techniques provides the
"best" parameter estimates are numerable. One frequently used method compares
the size of the mean squared errors of the estimates, that is. the expected value of
the squared difference of the estimate from the true value of the parameter. The
minimum-logit x^ estimate has a smaller mean squared error than either the ML
estimate or the minimum x^ estimate. This lower mean squared error makes the
minimum-logit x^ estimates appear to be the best. However, because the ML esti
mates are functions of the sufficient statistics. E j Uj a nd J2jXjyj, a highly desirable
quality, and the minimum-logit x^ estimates are not, the ML method becomes the
preferable technique. Fortunately, minimum-logit x^ estimates can be improved by
the use of the Rao-Blackwell Theorem, to become functions of the sufficient statistics,
but they lose their ease of evaluation over the original minimum-logit x^ estimates in
the process. (Ferguson, 1967).
CHAPTER III
SOME APPLICATIONS OF LOGISTIC REGRESSION
Logistic regression has received much attention, during the latter part of the
twentieth century, as a viable technique for analyzing a dichotomous response variable.
It has been used successfully in many areas of statistical modeling. In this chapter
certain areas of logistic regression will be illustrated through the discussion of four
real concrete applications. These examples will serve to give the reader some idea of
the wide applicability of logistic regression.
The first application builds on the Cardell-Steinberg estimator. The Cardell-
Steinberg estimator, evaluated for general use by Tam (1992). is an alternative to
the logistic regression model (2.12) that is valuable as a method of finding parameter
estimates. Tams research on the Cardell-Steinberg estimator is primarily concerned
with the idea that even for "choice-restricted*" samples, when the samples contain
information on only one value {y = 1) of the dependent response variable, the binary
logistic regression model could still be estimated. This concept is noted throughout
the research of Cardell and Steinberg (1978. 1987. 1992). In addition, the research
of Cardell and Steinberg (1987) indicates that pooling the "choice-restricted" sample
with a supplementary sample, a sample that contains information on both values
of the response variable, when the probability of y = 1 can be correctly estimated,
allows consistent estimators to be found. However, when the probabiHty is improperly
estimated, the parameter estimates have a definite increase in their percent bias and
root mean square errors. The research conducted by Tam (1992) confirms these
findings and offers results that support Cardell's conjecture that percent bias and
12
13
root mean square errors of the estimates for the ratio of slopes would remain at a low
level regardless of the correctness of the estimation of the probability.
As a test of her research, Tam (1992) applied the Cardell-Steinberg estimator to
college dropout data for the freshman class of 1984 and 1988 at UCLA. The applica
tion attempts to identify student pre-enrollment characteristics that predict under
graduate withdrawal in advance of degree completion, .\dmission records for about
16,000 undergraduate students including information on gender, age. interaction of
gender with age, ethnicity, self-reported high school grade point average. SAT verbal
and mathematics score, majors applied to, and high school type, represent the char
acteristics considered as explanatory variables. The response variable was coded 1 if
the student was determined a dropout by not registering for two consecutive quarters,
and not having received a degree. Logistic regression and Cardell-Steinberg analyses
were found to be similar for the class of 1988 in areas such as age. race, high school
GPA, SAT scores, and area of study. Yet, for the class of 1984. the logistic regres
sion analyses did not find the age factor to be significant, but the Cardell-Steinberg
analyses indicated that student age had a pronounced predictive power.
The second application is an evaluation of the performance of two classification
methods using misclassification probabilities and was investigated by Rylance (1996).
Simulations were completed on data from two different distributions, the bivariate
normal and bivariate uniform. Logistic regression (2.12) was performed on the data
and misclassification probabilities were found and compared to the theoretical mis-
classification probabilities. The theoretical misclassification probabilities were found
using the likelihood-based discriminant method. In other words, the model that best
fits the data is used to find the estimated response value for every individual's set of
14
explanatory values. The misclassification probability is the percent of data catego
rized incorrectly by the fitted model. Results showed that for normally distributed
data the two probabilities were comparable, and that, as the sample size increased,
the difference between the two misclassification probabilities decreased. For the uni
formly distributed data, the likelihood approach outperformed the logistic approach
with an average misclassification probability of around 509^. Therefore, for uniform
data, neither approach performed well.
Another interesting application, by Nottingham and Birch (1998). investigated
the effectiveness of the logistic regression procedure (2.12) for analyzing binary dose-
response data, especially for small sample sizes. They stress that the results of a
poorly designed experiment can be seriously compromised. For three doses, two
goodness of fit tests were performed. The use of the Pearson x^ test found the
probability of a Tj-pe I error to be much smaller than the nominal a-level, and the
likelihood ratio \- or deviance test was found to have difficulty when the response
dosages were less spread out over the entire range of dosage levels. The problem
continued when onh- three dosage levels were considered even with fift}- subjects
per dose. Even for a very wide range of dose levels selected, the results when the
number of dosages is small (around three), and the number of subjects at each dose
is small (around ten) should still cause concern. The logistic regression procedure
can have over 319c of its mean squared error due to bias from having design points at
the extremes of the dose range. Therefore, the placement of the doses is extremely
important to insure rehable results. They also found that it is better to do a five-
level design with ten replicates at each dose level than a three-level design with
twenty replicates per dose level, if the analyst is concerned with the mean squaired
15
error of fit or the variance of the estimate's coefficients. This direct contradiction to
maximum likehhood theory is due to the fact that asymptotic theory is inappropriate
for experiments of this size (Nottingham and Birch, 1998).
The final application considered in this chapter is also concerned with small sample
sizes. Duke (1992) evaluated the effects of various small sample sizes on the accuracy
of odds ratios (2.11) which were estimated by the logistic regression model (2.12)
using data collected following the repeated sample technique from a cross sectional
study of five births occuring in Oklahoma over the ten year period from 1975 to 1984.
Duke further investigated parametric odds ratios by looking at the accuracy of the
coverage in test-based 95% confidence intervals. The investigation also determined the
efficacy of the same 95% confidence intervals through significance testing. Information
regarding the accuracy of odds ratios derived through weighted least squares logistic
regression, and the performance of the confidence intervals applied before and after
the transformation of the logistic coefficients was also provided.
Duke (1992) found that with larger sample sizes, the regression coefficients became
more stable, that is, the size of the standard error of the coefficients decreased. Also,
an increased sample size improved the reliability of the estimates and the accuracy
of the odds ratio. However, the exponential transformation of individual logistic
regression coefficients appeared to overestimate the population odds ratio. Therefore,
the conversion of the logistic transformation will inflate the size of the odds ratio.
The coverage of the parametric odds ratio by 95% confidence intervals more closely
approached 95% with the larger sample sizes. This evaluation of risk factors associated
with low birth weight deliveries was conducted to hopefully aid in assessing the need
and impact of perinatal health programs for Oklahoma.
16
The examples in this chapter show that logistic regression can be applied to a
variety of areas of statistical modeling. One offered a possible alternative to the
logistic regression model useful when the estimation of probability is not accurate,
and another showed that a comparison of theoretical and actual misclassification
probabilities can be conducted using logistic regression. Two applications dealt with
logistic regression for small sample sizes, one relating to dosage levels in drug testing,
and the other to risk factors associated with low birth weight. These are but a few
of the many applications of logistic regression. The next chapter compares multiple
linear regression to logistic regression via two specific sets of data.
CHAPTER IV
REGRESSION ON TWO DATA SETS
Now that the background has been explained two data sets will be considered and
analyzed using SAS, a statistical software package. Multiple linear regression will be
compared with logistic regression via these two sets of data. Linear regression of the
dichotomous variable on independent \ariables, although not the most appropriate
technique, was used for prediction (Gehan. 1959) prior to the more widespread use
of logistic regression which began in the 1960's. The assumptions on the response
variable cause the linear regression model to be invalid for estimation or testing
procedures. Comparisons of the two procedures include but are not limited to the
differences in parameter estimates for the same models, along with the best model
each regression procedure found using various model selection techniques. The two
data sets were reproduced and used with permission of SAS Institute Inc.. Gary. NC.
In SAS (1995), some similar logistic regression analysis was presented. The aim is
not to present a comprehensive logistic regression analysis of the SAS data sets, but
to use the analyses to compare with the standard multiple regression results on the
same data.
4.1 Summary Statistics of Data Sets
The first set of data wiU be referred to as the DIABETES data set (SAS. 1995).
Data was collected from 145 nonobese adults who were diagnosed as subclinical (chem
ical) diabetics, overt diabetics, and normals (nondiabetics). The relationship between
various blood chemistry measures and diabetic status was investigated. The names of
18
the collected explanatory variables and their descriptions are given in Table 4.1. For
the purposes of logistic regression, the response variable must be binary. Therefore,
an indicator variable, DIAB, was defined which classified overt and chemical diabet
ics into a single group having value 1, and normals or nondiabetics into a separate
group having value 0. To maximize the quality of the comparisons, the dichotomous
variable DIAB is the response variable which was used for both regression analysis
techniques.
Table 4.1: Description of Variables in DIABETES Data Set
Variable
PATIENT
RELWT
GLUFAST
GLUTEST
INSTEST
SSPG
GROUP
Description
patient number
relative weight
(ratio of actual weight to expected weight, based on height)
fasting plasma glucose
test plasma glucose (a glucose intolerance measure)
plasma insulin during test
(measure of insulin response to oral glucose)
steady state plasma glucose (measure of insulin resistance)
clinical group
(3=overt diabetic, 2=chemical diabetic, and l=normal)
Preliminary analysis of the DIABETES data set was conducted using PROC
CORR, the correlation analysis procedures of SAS. There were 76 individuals in the
nondiabetic clinical group, and a total of 69 in the diabetic group (a combination of
the 36 chemical and 33 overt diabetic patients). The means and standard deviations
of each of the variables for the overall sample are compared in Table 4.2 to those found
separately for the two response groups. The relative weight of the diabetic group is
19
larger than that of the nondiabetic group, showing the average actual weight of those
diagnosed diabetic is larger than their expected average weight. In fact, for all of the
blood chemistry variables the mean values for the diabetic group are larger than the
overall means by more than the nondiabetic group is smaller than the overall means.
The standard deviations for the diabetic group were expected to be fairly large, since
that group consists of both the chemical diabetics and the overt diabetics. When the
diabetics' averages were calculated separately, the values for the chemical diabetics
were at the complete opposite end of the spectrum from those of the overt diabetics.
For example, the average test plasma glucose level for the group of chemical diabetics
was 493.94, as opposed to 1043.75 for the overt diabetics.
Table 4.2: Descriptive Statistics for DIABETES Data
Variable
Name
RELWT
GLUFAST
GLUTEST
INSTEST
SSPG
Overall
Mean
0.9773
121.9862
543.6138
186.1172
184.2069
Std Dev
0.1292
63.9304
316.9509
120.9352
106.0299
Nondiabetic
Mean
0.9372
91.1842
349.9737
172.6447
114.0000
Std Dev
0.1285
8.2279
36.8706
68.8538
57.5328
Diabetic
Mean
1.0214
155.9130
756.8986
200.9565
261.5362
Std Dev
0.1156
79.6996
350.9525
159.1102
92.6276
Correlation analysis of the overall DIABETES data set, given in Table 4.3, also
showed many significantly correlated variables at the 1% significance level. All three
combinations of the variables GLUFAST, GLUTEST, and SSPG had positive corre
lations above .70. In fact, the testing and fasting glucose levels had a .96 correlation.
The two glucose levels were also significantly correlated with the patients' plasma
20
insulin during test. The variables GLUFAST and INSTEST produced a negative
correlation of -.39 and GLUTEST and INSTEST had a negative correlation of -.33.
Table 4.3: Correlation Matrix for Overall DIABETES Data
RELWT
GLUFAST
GLUTEST
INSTEST
SSPG
RELWT
1.00000
-0.00881
0.02398
0.22224
0.38432
GLUFAST
-0.00881
1.00000
0.96463
-0.39623
0.71548
GLUTEST
0.02398
0.96463
1.00000
-0.33702
0.77094
INSTEST
0.22224
-0.39623
-0.33702
1.00000
0.00791
SSPG
0.38432
0.71548
0.77094
0.00791
1.00000
When the groups were separated into diabetics and nondiabetics, the significant
correlations were between GLUFAST and GLUTEST and all of the other variables
for the diabetic group. These variables were positively correlated with each other and
with SSPG. while negatively correlated to the other variables. For the nondiabetics,
SSPG was significantly correlated to RELWT and INSTEST, and at the 5% level
INSTEST and GLUTEST are just barely significantly correlated.
Correlation analysis on a further separation of the diabetics into chemical and
overt diabetics was also conducted. For the overt diabetics, GLUFAST and GLUTEST
maintain their significance with INSTEST and SSPG, but RELWT is no longer sig-
:;cv nificantly correlated with either at the 1% level, and only with GLUTEST at the 5
level. For chemical diabetics alone, GLUFAST and GLUTEST are still significantly
correlated at the 1% level, while SSPG and RELWT are correlated at the 5% level.
Another measure that can be investigated besides correlation is the variance
inflation factor (VIE) for each of the regression coefficients. The VIF represents
the inflation which occurs when one explanatory variable is regressed against the
21
other explanatory variables. The higher the value of the VIF, the lower the precision
of the parameter estimates given in the model. Those variables whose VIF value
exceeds ten should be seriously considered for possible deletion from the model. For
the DIABETES data, the VIF values for GLUTEST and GLUFAST were around 15,
which alerts the researcher of possible multicoUinearity effects in fitted models con
taining these explanatory variables. The other explanatory variables had VIF values
not to be concerned with mostly around one.
The second data set will be referred to as the PROSTATE data set (SAS, 1995).
Data was collected from 53 patients diagnosed with prostate cancer. Since treatment
depends on whether or not the cancer has spread to the surrounding lymph nodes,
a surgical procedure is used to determine the extent of nodal involvement. The rela
tionship between certain variables and nodal involvement is investigated in order to
see if the involvement can be determined without the surgery. Each patient provided
data on several explanatory variables considered predictive of nodal involvement. The
collected explanatory variable names and their descriptions are given in Table 4.4.
Table 4.4: Description of Variables in PROSTATE Data Set
Variable
CASE
AGE
ACID
XRAY
SIZE
GRADE
NODALINV
Description
an identification variable
age in years of patient at time of diagnosis
level of serum acid phosphatase
X-ray examination results (O=positive, l=negative)
size of tumor (0=small, l=large)
pathological grade of the tumor as determined by biopsy
(O^less serious, l=more serious)
surgical procedure results (0=no involvement, l^involvement)
22
PreUminary analyses were also conducted on the PROSTATE data set. There
were 33 patients without nodal involvement, and 20 patients with nodal involvement.
The means and standard deviations of each of the variables for the overall sample
are compared in Table 4.5 to those found separately for the two response groups.
The means for the patients with nodal involvement are higher than those for pa
tients without nodal involvement in all four variables besides AGE. For the three
dichotomous variables, the averages for nodal involvement are at least twice those for
noninvolvement. The logarithm of the acid level is smaller for those patients whose
cancer has spread to the lymph nodes. The overall ages of the patients ranged from
45 to 68, and the separate groups showed only a slight difference in averages. Yet
patients without nodal involvement had exactly the overall median.
Table 4.5: Descriptive Statistics for PROSTATE Data
Variable
Name
AGE
LACD
XRAY
SIZE
GRADE
Overall
Mean
59.3774
-0.4189
0.2830
0.5094
0.3774
Std Dev
6.1682
0.3151
0.4548
0.5047
0.4894
No Involvement
Mean
60.0606
-0.4959
0.1212
0.3636
0.2727
Std Dev
5.6010
0.3169
0.3314
0.4885
0.4523
Involvement
Mean
58.2500
-0.2918
0.5500
0.7500
0.5500
Std Dev
7.0103
0.2743
0.5104
0.4443
0.5104
The correlation coefficients for the overall PROSTATE data are given in Table 4.6.
The only significant correlation was between GRADE and SIZE which had a positive
correlation of .37. For those patients with nodal involvement, although none were
significant, the strongest correlation was negative, between LACD and SIZE, but all
three combinations of XRAY, AGE and LACD although positively related, were not
23
far below its correlation of .25. As for those patients without the involvement, the
strongest and only significant correlation was between GRADE and SIZE at .52 and
then the only other pair which comes close to being correlated is between GRADE
and LACD which had a negative correlation of -.29.
Table 4.6: Correlation Matrix for Overall PROSTATE Data
AGE
LACD
XRA\'
SIZE
GRADE
AGE
1.00000
-0.01921
-0.00453
-0.01970
-0.04808
LACD
-0.01921
1.00000
0.18075
0.01127
-0.06414
XRAY
-0.00453
0.18075
1.00000
0.19761
0.20217
SIZE
-0.01970
0.01127
0.19761
1.00000
0.37463
GRADE
-0.04808
-0.06414
0.20217
0.37463
1.00000
The correlation analyses were useful as preliminary studies on data because they
gave the researcher an idea of the general variation of the data. The\' also showed
possible variable interactions which may need to be considered in regression analyses.
.\s mentioned previously, the \'1F represents the inflation which occurs when the
explanatory variable is regressed against the other explanatory variables. The higher
the value of the M F , the lower the precision of the parameter estimate in the model.
For the PROSTATE data the M F values for all the explanatory variables were close
to one, showing no sign of multicoUinearity among the variables.
4.2 Logistic Regression on the Data Sets
4.2.1 Logistic Regression on DIABETES Data Set
The two types of regression analyses discussed in this paper have been Hnear
and logistic regression. For the DIABETES data set, as an illustration, the logistic
24
regression model (2.12) involving the single explanatory variable GLUTEST was fit
to the data. In order for this variable to predict the dichotomous response value for
the i*^ patient, its value, Xj, is to be substituted into the resulting model
logit P(J i ) = 90.4017 - 0.2153xi, ioi i = 1 145. (4.1)
There are a variety of measures which can be used to compare different models. The
square of the correlation coefficient (P^) is a measure of the fraction of the variation
in the data, {i.e., the response variable) accounted for by the model. The adjusted
R^ value is similar but it also takes into account the number of variables in the
model. Since the full model, consisting of all possible explanatory variables, may
include extra variables that are not really necessary for the prediction of the response
variable, the adjusted R^ value will give a more accurate depiction of the variation
of the response variable accounted for by the model. Thus, the adjusted R^ value is
useful in detecting overfitting of the model.
Another measure useful in the comparison of fitted models is the sum of squared
errors (SSE), that is. the sum of squares of differences between the response variable
and the value of the response variable predicted by the model. These squared errors
are then totaled for the entire data set. the smaller the SSE the better the model.
The simple model presented in equation (4.1) with SSE= 1.82503 has r^ = 0.7295
and adjusted r^ = 0.9734. Therefore, this one variable logistic model accounts for
97.34% of the variation in the response variable, and its sum of squared errors is
fairly small.
Using the above information alone, one might see this model as accurate enough to
predict whether or not an individual is diabetic. However, with all the information on
the other explanatory variables, model selection methods can be used to investigate
25
their inclusion into the model. The three selection techniques available for both
logistic regression and linear regression using SAS are forward, backward, and stepwise
selection. These methods are iterative processes which add or remove variables from
the model based on the appropriate significance levels at each step. The significance
levels of the x^ score statistic in logistic regression are compared to specified entrance
and removal threshold values.
The forward method begins with no variables in the model, and at the first
step the intercept is added. After that the variable which is the most significant
(having the smallest significance level) when added to the model is entered. Then this
continues until no variable has a significance level below the threshold value for
entrance into the model. The backward selection is the opposite, starting with all
variables in the model. The significance level for testing each variable's parameter
estimate equal to zero is calculated and the least significant variable (with the largest
significance level) is removed at each step. Of course, the process continues until all
variables left in the model are below the threshold value for removal from the model.
Stepwise selection combines the two previously mentioned methods. It begins with
no variables in the model and checks both entrance and removal thresholds at each
step, continuing until a variable entered on one step is removed on the next step. It is
necessary to reaUze that when comparing significance levels for different combinations
of explanatory variables, the overall significance level increases with each comparison.
Therefore, the threshold values for each comparison test should be very small in order
to keep down the overall significance level. Also model selection techniques are meant
to be exploratory, so the fit of the model should be verified on other data.
26
The results from using the logistic regression procedure with the forward model
selection technique provided a model containing only GLUTEST and SSPG with
p2 _ 7373 and SSE= 0.8002, whereas both backward and stepwise named the above
mentioned model with the single variable GLUTEST. Another thing to consider is
that the significance levels used were .15 for entry into the model and .10 for removal
from the model. The forward selection for logistic required two steps although the
others required at least four steps.
Logistic regression also has the option of determining the best model by evaluat
ing the model's score statistic. This selection option is different from the previous
techniques in that it finds a specified number of best models for each possible model
size, ranging from the k one variable models up to the single full model. The model
with the largest score statistic value was the full model having P^ = .7419. Out
of all the four variable models, the best one left out the variable SSPG and had
p2 = .7414, and then also removing INSTEST left the best three variable model,
having P^ = .7409. Comparing this P^ value to the other models, the researcher
must determine if accounting for about one tenth of a percent more of the variation
in the data is worth the addition of two explanatory variables into the model. As for
a comparison of the sum of squared errors, the full model SSE= .42291, the best four
variable model SSE= .45301, and the best three variable model SSE= .66796.
The best model should not be Hmited to including only first order variables. As the
preliminary analyses show, many variables are correlated. Therefore, the interaction
between the variables should also be investigated. When the five variables are allowed
to interact, the new model has the possibility of including up to twenty variables and
then, of course, the intercept. As mentioned previously, when the numbers of steps
27
increase, the significance levels also increase. When the interactions of the variables
are included, the full model using logistic regression procedure had P^ = 0.5847 and
SSE= .88737. The logistic procedure using forward selection required four steps to find
a four variable model including the variables RELWT*RELWT, RELWT*GLUFAST,
RELWT*GLUTEST, and SSPG with P^ = .7430 and SSE= .34065. The backward se
lection model for this procedure required nineteen steps to eliminate all other explana
tory variables except GLUTEST*GLUTEST with P^ = .7295 and SSE= 1.82298.
The stepwise selection technique also yielded a single variable model with the vari
able RELWT*GLUTEST, but it only needed five steps. Therefore, this model has a
better significance level, unfortunately its SSE= 5.5394 which it quite high compared
to previously mentioned models and its P^ = .6813, which is much smaller than the
other mentioned models which makes it less appropriate.
The logistic regression procedure with the score selection option found that the
best two models included eleven and then ten variables. These best models include the
interaction of the variable RELWT with each of the other variables except itself and
SSPG. Similarly, GLUTEST is included interacting with all other variables except
GLUFAST, and the original variables RELWT, GLUFAST, and INSTEST are also in
cluded. Along with these are the variables GLUTEST*INSTEST, GLUFAST*SSPG,
and the eleventh variable added in the best model is GLUFAST*INSTEST. For the
eleven and ten variable models, the logistic sum of squared errors are SSE= .0441 and
SSE= .05796 while P^ = .7476 and P^ = .7474, respectively This amazingly small
sum of squares value makes this eleven variable model the best overall fit to the data
for the logistic regression procedure.
28
4.2.2 Logistic Regression on PROSTATE Data Set
For illustration of the PROSTATE data set, a logistic regression model involving
the variables ACID, XRAY and SIZE was first fit to the data. Instead of directly
fitting the variable ACID, the logarithm of the acid levels, LACD, was used to give a
better discrimination between the closely spaced values of this variable. In order for
a combination of these three variables to predict the dichotomous response values for
each patient, their values are to be substituted into the resulting model
logit P'{^i) = -1.1994 -\- 2.2922^^ + 2.0550x2^ + 1.7638x3 , for z = 1 , . . . , 53 (4.2)
where Xu represents the logarithm of the i*^ individual's level of serum acid phos
phatase, while X2i and 0:3 are the corresponding values for XRAY and SIZE, respec
tively The model in equation (4.2) has P^ = 0.3305, and adjusted P^ = 0.4501
with SSE= 8.0364. In this case, the logistic regression model did not account for as
much variation as it did in the previous data set and the sum of squared error is not
especially small. Accordingly, the next step would be to determine the best model to
fit the data. The same model selection techniques described in the previous section
are available for this data set.
First considering only the original five variables without interactions, the logistic
regression procedure found the model presented in equation (4.2) as best using all
three types of selection techniques. When the alternative selection techniques were
used, the logistic regression procedure with the score option yielded the full model,
containing five variables, as the best with P^ = 0.3605 and SSE= 7.6656. The full
model with GRADE removed was given as the best four variable model with P^ =
0.3468 and SSE= 7.8703. Of course, the full model with both AGE and GRADE
removed presented in equation (4.2) was given as the best three variable model.
29
When the interactions of the variables are included, the full model using the
logistic regression procedure had P^ = 0.5847 and SSE= 3.898. The forward se
lection yielded a two variable model in only two steps including LACD*LACD and
XRAY*SIZE with P^ = 0.3305 and SSE= 8.0920. Backward elimination required
twelve steps to determine the best model as a five variable model with P^ = 0.4894
and SSE= 5.3994. The stepwise selection technique found the same two variable model
as the forward selection technique for this procedure. Using the alternative selection
techniques, logistic regression found the best model to contain eleven variables, and
the second best to contain ten variables with SSE= 4.1199 and SSE= 4.1342 while
R? = 0.5673 and P^ = 0.5660, respectively. The best fit to the data using logistic
regression is the full model with interactions having P^ = 0.5847 and SSE= 3.898.
4.3 Linear Regression on the Data Sets
4.3.1 Linear Regression on DIABETES Data Set
For the DIABETES data set, also as an illustration, a linear regression model (2.1)
involving the single explanatory variable GLUTEST was fit to the data. In order for
this variable to predict the dichotomous response value for the i^^ patient, its value,
Xi, is to be substituted into the resulting model
y, = -0.0772 + O.OOlOxi, for f = 1 . . . . , 145. (4.3)
As mentioned previously, the linear regression model is not ideal for predicting
binary response variables, since it is more useful in modeUng continuous response
variables. The model in equation (4.3) has r^ = 0.4140 and adjusted r^ = 0.4099
with SSE= 21.1938. With such a large sum of squared errors and small percent of
variation of the response variable accounted for by this model, the next step is to
30
search for alternative models which offer a better fit to the data.
In linear regression, the significance levels for the F statistic are compared to
specified entrance and removal threshold values in the forward, backward and stepwise
selection procedures. The results from using the multiple linear regression procedure
with the forward model selection technique yielded the full model, containing all of the
explanatory variables, with a multiple P^ = .7054 and SSE= 7.6673 for the best fit.
The same procedure with both backward and stepwise techniques produced the best
model as the one with all explanatory variables except SSPG, which had P^ = .7024
and SSE— 10.7632. Again considering that the significance levels used were .15 for
entry into the model and .10 for removal from the model, the backward elimination
technique took only one step, while the others required at least four steps.
A similar process to the logistic regression score option is the linear regression
option which uses the P^ or adjusted P^ values to determine the best model. Again
this selection option is different from the previous techniques in that it finds a specified
number of best models for each possible model size, ranging from the k one variable
models up to the single full model. Interestingly, the Hnear regression P^ procedure
yielded the same best five models found using the logistic score procedure option: the
fuU model with P^ = .7054 and SSE= 7.6673, the best four variable model as the full
with SSPG removed with P^ = .7054 and SSE= 10.7632, and the fuH with SSPG and
INSTEST removed with P^ = .6667 and SSE= 12.0548.
Again, the best model should not be limited to including only first order
variables. As the preliminary analyses show, many variables are correlated. There
fore, the interaction between the variables should also be investigated. When the five
variables are aUowed to interact, the new model has the possibility of including up to
31
twenty variables and then, of course, the intercept. As mentioned previously, when
the numbers of steps increase, the significance levels also increase. For the linear
regression procedure, forward selection provides a model with all but two variables,
GLUTEST*GLUFAST and INSTEST*SSPG, with P^ = .8047 and SSE= 7.0474,
but the process took sixteen steps. The backward technique with this procedure only
required twelve steps to determine an eight variable model having P^ = .7944 and
SSE= 7.4339. The stepwise method yielded a six variable model with P^ = .7786 and
SSE= 8.0088. Again, the three percent difference with the addition of twelve variables
requires a trade off between predictive power and simplicity of the model.
Using the linear regression procedure with the P^ selection option, the best two
models again included eleven and then ten variables, with P^ = .8024 and P^ = .8009,
respectively. As for the sum of squared errors of these models, the with eleven vari
ables the SSE= 7.1477 and with ten variables SSE= 7.2003. These best models include
the interaction of the variable RELWT with each of the other variables except itself
and SSPG. Similarly, GLUTEST is included interacting with all other variables except
GLUFAST, and the original variables RELWT, GLUFAST, and INSTEST are also in
cluded. Along with these are the variables GLUTEST*INSTEST, GLUFAST*SSPG,
and the eleventh variable added in the best model is GLUFAST*INSTEST.
The best model according to the entire considered multiple Hnear regression proce
dures on this data is the model given by forward selection with interactions containing
eighteen variables having P^ = .8047 and SSE= 7.0474. RecaH that for logistic re
gression, the best model included the same eleven variables as the linear regression
model above, but had SSE= .0441 and P^ = .7476, which makes it the best overaU
fit to the data using either regression procedure.
32
4.3.2 Linear Regression on PROSTATE Data Set
For the PROSTATE data set, also as an illustration, a linear regression model
involving the variables LACD, XRAY and SIZE was fit to the data. In order for a
combination of these three variables to predict the dichotomous response values for
each patient, their values are to be substituted into the resulting model
yi = -0.2819 -\- 0.3869a;H + 0.3840x2^ -h 0.2922x3i, for i = 1 , . . . , 53 (4.4)
where xu represents the logarithm of the i*^ individual's level of serum acid phos
phatase, while X2i and 0:3 are the corresponding values for XRAY and SIZE.
Again the linear regression model is not ideal for predicting binary response vari
ables such as nodal involvement, and this is apparent in the SSE, P^ and adjusted P^
values for this model. The SSE= 8.0139 which again is not especially small and the
P^ value gives that 35.65% of the variation of the response variable is explained by
the model. When the number of variables in the model is considered by the adjusted
P^, the model only accounts for 31.71% of the variation. Accordingly, the next step
would be to determine the best model to fit the data. The same model selection
techniques described in the previous sections are available for this data set.
First considering only the original five variables without interactions, the for
ward selection technique used with the Hnear regression procedure determined the
fuH model with P^ = .3864 and SSE= 7.6412 as the best model. The backward elim
ination and stepwise selection both provided the model presented in equation (4.4)
with LACD, XRAY, and SIZE as the best model. When the alternative selection
techniques were used, both procedures yielded the same best three models. The fuH
model with P^ = .3864 and SSE= 7.66562, the fuU model with GRADE removed hav
ing P2 = .3737 and SSE= 7.87025 and the fuU model with both AGE and GRADE
33
removed with P^ = .3565 and SSE= 8.03644.
When the interactions of the variables are included, the fuH model with aU eighteen
variables had SSE= 4.5944 and P^ = 0.6311. The forward selection using the Hnear
regression procedure found the best model to contain nine variables with P^ = .5977
and SSE= 6.4345. Unfortunately, it required nine comparison steps to produce it.
The backward elimination using this same procedure required eleven steps to find a
six variable model similar to the five variable model found with the same technique
using logistic regression. This six variable model had P^ = .5358 and SSE= 5.7812,
an increase in P^ and a decrease in SSE from previous models. The stepwise selec
tion only required three steps to find a three variable model which it considered as
the best. This model included LACD*SIZE, LACD*LACD. and XRA\'*SIZE with
SSE= 7.6410 and P^ = .3864, which is almost the same SSE and P^ as the fuU
model considered with interactions. Using the alternative selection techniques, linear
regression again found the best model to contain 11 variables, and the second best
to contain ten variables with SSE= 4.6625 and SSE= 4.8804 while P^ = .6256 and
p2 = .6081, respectively.
The best multiple linear regression model found for this data set is the full model
with interactions containing eighteen variables having P^ = .6311 and SSE= 4.5944.
The model with best overaH fit to the data using the logistic regression procedure
was the fuH logistic model with interactions having P^ = 0.5847 and SSE= 3.898.
Therefore with the smallest sum of squared error, this full logistic regression model
is the model with the best overall fit to the data using any regression procedure.
34
4.4 Comparison of Logistic and Linear Regression Analyses
For the DIABETES data set, the models found using logistic regression usually
contained fewer variables than the models determined by the selection techniques
with linear regression. The sum of squared errors were also significantly smaller for
the logistic regression models suggesting a better fit of the data. The best logistic
regression model was produced using the score option and contained eleven variables
with SSE= .0441 and P^ = .7476. The Hnear P^ selection option confirmed that a
model containing these same eleven variables was in competition for the best model
with R? = .8024, however, the parameter estimates significantly varied for the two
procedures and for linear regression SSE= 7.1477. The best multiple linear regression
model was produced by the forward selection technique with interactions containing
eighteen variables having P^ = .8047 and SSE= 7.0474. Therefore, the overaU best
model produced for this data set was the eleven variable logistic model.
For the PROSTATE data set, again the number of variables in the models found
using the selection techniques with logistic regression were somewhat smaller, but
only with interactions of the variables allowed. When only first-order variables were
considered, both logistic and Hnear regression with all selection techniques basically
agreed on the best fitting models, and even had very similar sum of squared errors.
The sum of squared errors differed for the two regression procedures when interactions
were considered but not nearly as significantly as in the other data set. The logistic
regression score selection technique also produced the best model according to SSE
in this data set as one containing eleven variables. The sum of squares errors for this
model with logistic regression was SSE= 4.11986, whereas in the linear regression
SSE= 4.6625, although the difference is much more noticeable in the difference of
35
their P^ values. For the logistic regression of the eleven variable model P^ = 0.5673,
but for the multiple Hnear regression P^ = .6256. The best multiple linear regression
model found for this data set also contained eighteen variables but was the full linear
model with interactions having P^ = .6311 and SSE= 4.5944. AH these models were
surpassed by the logistic fuU model with interactions having SSE= 3.8979 making it
the overaU best model for this data set, even though its P^ = 0.5847.
The above best models using the two regression procedures on each data set are
given below in Table 4.7. The logistic procedure outperformed the multiple Hnear
regression procedure in both data sets, as was expected. Therefore, with response
variables which are dichotomous, the logistic regression model is the appropriate and
preferred model.
Table 4.7: Best Logistic and Linear Regression Model for Each Data Set
Regression
Procedure
Logistic
Logistic
Linear
Linear
Data
Set
DIABETES
PROSTATE
DIABETES
PROSTATE
P2
0.7476
0.5847
0.8047
0.6311
SSE
0.0441
3.8979
7.0474
4.5944
For each of the four best regression models given in Table 4.7, the frequency
distribution of estimated probabiHty, p is given in Figures 4.1 through 4.4. The
estimates may be used to investigate how many patients would be misclassified if
a particular value of p, perhaps p = 0.5 were chosen to separate those predicted
to have response variable equal to zero from those having a response variable equal
to one. Figure 4.1 is based on the logistic regression model found using the score
36
selection option containing eleven variables for the DIABETES data set. With such
a small sum of squared errors value, SSE= .0441, the distribution of p is very close
to the actual dichotomous values for the response variable DIAB. Figure 4.2 is based
on the linear regression model found using the forward selection technique for the
DIABETES data set and shows the distribution oip. Figure 4.3 gives the distribution
of p for the full logistic regression model including interactions for the PROSTATE
data set. Figure 4.4 shows the distribution of p for the full linear regression model
including interactions for the PROSTATE data set. The two linear regression models
have much more dispersed distributions of p, whereas the distributions for the two
logistic regression models show a much clearer distinction between the two possible
response variable values.
The next chapter discusses cluster analysis and how it can be used to help deter
mine the groupings used in logistic regression. The DIABETES data set is revisited
as an illustration of cluster analysis techniques.
FREQUENCY s 0 i
70
60 -
50
40 -
30
20
10
?<N
R?^?g^ 0.05 0.15 0.25 0.35 0 45 0.55 0.65 0 75 0.85 0.95
Estimated Probability
Figure 4.1: Distribution of p For Best Logistic Model on DIABETES Data
38
FREQUENCY 40
0.05 0.15 0 25 0.35 0.45 0.55 0 65 0.75 0.85 0.95
Predicted Value of DIAB
Figure 4.2: Distribution of p For Best Linear Model on DIABETES Data
(Horizontal label could also be interpreted as estimated probabiHty)
39
F R E Q U E N C Y 3 0
2 0
1 0
0 . 0 5 0 . 1 5 0 . 2 5 0 . 3 5 0 . 4 5 0 . 5 5 0 . 6 5 0 7 5 0 . 8 5 0 . 9 5
E s t i m a t e d P r o b a b i l i t y
Figure 4.3: Distribution of p For Best Logistic Model on PROSTATE Data
40
FREQUENCY 15
0.05 0.15 0.25 0.35 0 45 0.55 0.65 0 75 0.85 0.95
Predicted Value of NODALINV
Figure 4.4: Distribution of p For Best Linear Model on PROSTATE Data
(Horizontal label could also be interpreted as estimated probability)
CHAPTER V
CLUSTER ANALYSIS
5.1 What is Cluster Analysis?
Cluster analysis is a technique appHcable to situations involving data from a
population where there exists some set of features or characteristics which may be
used to separate the data values into groups or clusters. Specifically, let the set
I = ( / i , /2 , . . . ,/n) represent n individuals from a population denoted TT , and let
the set C = {Ci,C2, • •. ,Ck) represent observable characteristics possessed by each
individual in I. Usually these observable characteristics yield quantitative data, also
called measurements. Sometimes, however, the characteristics can yield qualitative
or categorical data. The value of the measurement on the j^^ characteristic of the
individual Ii is denoted by the symbol Xij, and Xi = [xij] represents a A: x 1 vector of
measurements. The researcher, therefore, has available for each set I a corresponding
set oi k X 1 measurement vectors X = (Xi,X2,. . . ,X„) which describes the set of
individuals, I. The set X can be thought of as n points in A:-dimensional Euclidean
space, where the distance between the points can be measured. Based on the data
contained in the set X and an integer m, where m < n, the cluster problem is to
determine m clusters or subsets of the individuals in I, say TTI, 7r2,..., TT , such that
each Ii belongs to one and only one subset. Those individuals which are assigned to
the same cluster are required to be determined significantly similar, while those as
signed to different clusters are required to be determined significantly different. The
general ideas presented in this chapter are a summary of those presented in Duran
and OdeU (1974).
41
42
A solution to the cluster problem usually involves a partitioning of the individuals
which satisfies some optimaHty criterion. This optimality criterion, often caUed an
objective function, may be given in terms of a functional relation that reflects the
levels of desirabiUty of the various partitions. Although various types of objective
functions can be defined, many can be formulated in a unified and general manner.
To accomplish this, a clear definition of what it means for two individuals, U and Ik,
to be similar is needed. One possible interpretation of similar individuals is to assign
the i*^ and k*^ individual to the same cluster if the distance between the points Xi and
Xk is "sufficiently small," and likewise to assign the individuals to different clusters
if the distance is "sufficiently large." This distance between points can be defined by
various distance functions, however the Euclidean distance function is most commonly
used.
Using the above concept of distance, a measure of the scatter or heterogeneity
of the set of individuals I is desired. Statisticians generally use the following k x k
matrix, n
Sx = Y.i^^-X){Xi-X)^ (5.1) 1=1
where X = y^ —-, is a. k x 1 vector of arithmetic averages, called the scatter matrix ^ n
for the set x, as a measure of scatter. The matrix S^ is also sometimes referred to as
the matrix sum of squares. Other scatter measures include the trace of S^, denoted
St, the determinant of S^, denoted SD, and the matrix of correlation coefficients,
denoted R. The measure St
St = tr Sx = E ( X i - X)^(X, - X) (5.2) i = l
is the sum of the distances of the n points from the group mean X and is termed
43
the error or within sum of squares. The measure s/) = |Sx| is the statistical scatter
with respect to the determinant. The matrix of correlation coefficients, R = [rij],
can be computed from the matrix S^ = [sij] which was defined in equation (5.1).
Using the definition of S^, the following diagonal matrices are defined, [Diag(Sx)] =
(sii,522,...,sjtA:) and [Diag(Sx)]"^ = {su',822^-• • ,s^}). Then
R - [Diag(Sx)]-^Sx[Diag(S;,)]-^ (5.3)
is the matrix of correlation coefficients. These measures of scatter are useful in de
termining how tightly the set of individuals are grouped.
Many clustering procedures are hierarchical. In other words, the two closest ob
jects are grouped and treated as a single cluster in the next step. Thus, the number
of objects is decreased to n — 1, a single cluster of two objects and n — 2 clusters of
a single object in each group. This process is repeated until all of the n objects are
grouped into one cluster containing n objects. This hierarchical process involves the
concept of measuring the distance between an object and a cluster and the distance
between two clusters. The concept of the optimality criterion or objective function
determines when the most desirable partition has been obtained. Therefore, we need
measures of homogeneity within a cluster and measures of disparity between two clus
ters. These two measures also depend on how the distance between two clusters is
defined.
The distance between two clusters can be defined in various ways. Let I =
(/i, / 2 , . . . , /nJ and J = {Ji,J2,---,Jn2) represent two clusters of individuals from
a population. Let C = (Ci, C2, . . . , Ck) be a set of characteristics which generate the
two measurement sets X = (Xi, X2 , . . . , X„J and Y = {Yi,Y2,... ,Yn^), associated
with I and J, respectively. From these definitions, the nearest neighbor distance.
44
furthest neighbor distance, and the average distance follow directly. The nearest
neighbor distance is defined as the minimum distance between any pairing of one in
dividual from I and one individual from J. Likewise, the maximum distance between
any similar pairing of two individuals from the sets I and J is defined as the furthest
neighbor distance. The average distance between the clusters I and J is calculated
by finding the arithmetic average of aU possible pairwise distances between the two
sets.
Using the concept of statistical scatter, the measure of distance between the clus
ters / and J is defined as
'''''' {X-Yf{X-Y) (5.4) ni -f 7i2
_ "2 y : where Y = ^ — and A' is similarly defined as it was in equation (5.1). This measure
i = l ^
of distance is also referred to as the within group or error sum of squares.
Most of the clustering and classification methods concentrate on the construction
of methods based on the minimization of the within group sum of squares. These
methods are called minimum-variance constraint methods and are easily described
using squared Euclidean distances. Various clustering techniques have been devel
oped, and the most common include average linkage, furthest and nearest neighbor
methods, and the centroid method. These methods wiU be described in the next
section in relation to their corresponding SAS options.
Another popular technique is the within sum of squares method. In this technique,
the sum of squares of the distances from each point in a cluster to the mean of that
cluster is found. Again, this is a hierarchical process. At each step the two clusters,
having the least increase in within group sum of squares when joined, are subsequently
joined.
45
Many other variations of the clustering procedure begin by initializing certain val
ues as starting points for clusters. One technique chooses starting points at random,
and then objects within a certain threshold distance form the first cluster. The pro
cess continues until all the points are accounted for by being assigned to their nearest
cluster center, thus forming a specific desired number of disjoint clusters.
Another variation involves choosing "typical" points to initialize the clusters.
These points are determined by some preliminary study of the individuals. If the
desired number of clusters is known, call it m, then ultimately m points could be
chosen at random, and the remaining n — m objects can be assigned to the center
which it is nearest.
An improvement on this method is to then join any of the m clusters whose centers
fall within a threshold radius, and split any clusters whose variance (S^^) within the
cluster of any one variable X exceeds its specified threshold value, S"^. Then the
variances, (Sj) of each of the resulting clusters are constrained by S] < kS^, where k is
the number of variables. At each step the cluster centroids replace the original cluster
centers and the process is continued until convergence is achieved. This updating of
the centroids in the procedure until convergence is one of the more popular variations.
StiU another variation starts the updating of centroids almost immediately. Again,
a certain number of objects are chosen at random to be used as cluster centers, and
each object is assigned to the center nearest it if its distance from the cluster is within
a certain threshold. If the object falls beyond the threshold distance, it initiaUzes a
new cluster center. With each aUocation of an object to the cluster, the centroid is
recomputed and becomes the new cluster center. Of course if the distance between
two clusters becomes less than another threshold value, the clusters are joined, and
46
the process continues until convergence is attained.
The main reason for the popularity of the Euclidean distance is probably its
intuitive appeal and direct relation to the within group sum of squares. There are,
inevitably, objections to the use of the minimum variance approach to cluster analysis
since changes in the scale will modify the resulting clusters. As is apparent in the
number of variations, each seems to improve on the previous results.
This clustering procedure is useful in determining how many different underlying
groups one population may contain. Then, based on the number of clusters, either
logistic regression or some other categorical modeling technique can be used to an
alyze the data as accurately as possible. The next section describes SAS clustering
procedures and the results from the cluster analysis of the DIABETES data set.
5.2 Cluster Analysis on the DIABETES Data Set
The DIABETES data set used for the regression analysis in the previous chap
ter was also investigated using various method options of the clustering procedure of
SAS, a statistical software package. The PROC CLUSTER command finds the hier
archical clusters of the observations in the data set, provided the data set is entered
as coordinates or distances. For the DIABETES data set, each individual's informa
tion is treated as a coordinate, and the EucHdean squared distances between each
possible pairing of the observations (or coordinates) are computed by the procedure
CLUSTER. Before performing a cluster analysis on coordinate data, some scaling, or
transformation, of the data should be considered since variables with large variances
tend to affect the clustering more than those with smaUer variances. One choice to
eHminate this effect is to use the option STD in CLUSTER which standardizes the
47
variables to mean 0 and standard deviation 1. Some transformations may change the
number of population clusters, so use caution if transforming the variables.
The basic procedure begins with each observation in a cluster by itself and then
the two closest clusters are joined to form a one new cluster which replaces the two
old clusters. The process continues until there is only one cluster remaining, or
until a specified number of clusters (may be entered as an option) is reached. The
various clustering methods differ in how the distance between clusters is computed,
for example, the distance between two clusters may or may not be updated each time
a new observation or cluster merges with one of the existing clusters.
Some of the methods of clustering mentioned previously which are available using
SAS include average linkage, complete linkage (furthest neighbor), density linkage,
single linkage (nearest neighbor), and the centroid method. In average linkage, the
distances between two clusters is the average distance between pairs of observations,
one in each cluster. Average linkage is biased in that it usually joins those clusters with
small variances, and the resulting clusters tend to have the same variances. Complete
linkage is where the distances between the two clusters is the maximum distance
between one observation in the first cluster and an observation in the second cluster.
Complete Hnkage is biased in that it usually produces clusters of equal diameters, and
can be largely distorted by only moderate outliers. The density linkage refers to the
class of clustering methods which use nonparametric probability density estimates.
Single Hnkage calculates the difference between two clusters as the minimum distance
between an observation from each of the two clusters. This approach is beneficial in
that it can detect elongated clusters, but in the process tends to sacrifice performance
in obtaining compact clusters. In the centroid method, the distance between the two
48
clusters is the squared Euclidean distance between their corresponding centroids or
means. This method is able to handle outliers but may not perform as well as some
of the other methods, especially average linkage.
The CLUSTER procedure prints a history of the clustering process which lists the
observations joined in each step, and also gives statistics useful in estimating the true
number of clusters in the population from which the data were sampled. CLUSTER
also produces an output data set that can be used with the TREE procedure to draw
a tree diagram of the cluster hierarchy. This allows a visual of which observations
were clustered and when. The output lists all stages of the procedure from n clusters
down to a single cluster. This option is mostly useful with small data sets.
The CLUSTER procedure was executed on the DIABETES data set, and the
results separated the population into the three clusters that, not surprisingly, matched
the underlying definition of the clinical groups. Figures 5.1 through 5.4 plot each
individual with its corresponding cluster number, which was determined by using
the centroid method to find three disjoint clusters. Cluster 1 represents the overt
diabetics, cluster 2 contains the chemical diabetics, and cluster 3 is the nondiabetic
subgroup. Some of the observations are hidden, that is, they are too close to each
other and show up as single points. Figure 5.1 shows the clusters when looking at the
variables GLUFAST versus GLUTEST, which, as you may recaU, had a correlation of
.96 in the overaU data set. The moderate correlation between GLUTEST and SSPG
resulted in the clusters in Figure 5.2 being similar to those in Figure 5.1. There may
still be evidence of clusters for uncorrelated variables, and sometimes the distinctions
between clusters is not very weU defined, as seen in Figures 5.3 and 5.4, respectively.
This overlap indicates the number of existing clusters may be less than expected.
49
400 +
300 F a s t i n g
P 1 a s 200 m a
G 1 u c 0 s e 100 +
3 3
22
2 222 2
1 1
1 11 11111 111111111 1 1 11111111 111
1111111 nil 1
1
11
0 + • - + —
250 500 —+— 750
— + - -
1000 — + - -
1250 — + - -
1500
Test Plasma Glucose
Figure 5.1: Diabetes Cluster Analysis Plot of GLUFAST*GLUTEST
(72 observations hidden).
50
500 +
450 +
S t e a d y
s t a t e
P 1 a s m a
G 1 u c 0 s e
400 +
350 +
300 +
250 +
200 +
150 +
100 +
50 +
1 1 1
II 1 1 1 1 11 1 I I I 1
11 1 1 1
1 1 1 11
1 1 11
1 111 111 111
11 1 11
1 11 1 1 n i l 11
11 11 1
1 1 11 1 1 1 11
1
11
0 + -+— 250
2 22
2
22
3
3 3
—+— 500
• - + —
750 — + - -
1000 —+—
1250
3
3
3 3
- + — 1500
Test Plasma Glucose
Figure 5.2: Diabetes Cluster Analysis Plot of SSPG*GLUTEST
(35 observations hidden).
51
1.21 +
R 1.11 + e 1 a t i 1.01 + 1 n i l V e
w e 0.91 + i g h t
0.81 +
0.71 +
1 1 1 1 1 1
1 1 1 1 1 1 11 1
1 n i l 1 1 1 11
1 1 11 1 1 in
1 1 11
I 11 1 1 II 11 1 nil 11 1 1 1 11 1 1 nil 11 111 III 1 1 1
111 11
111 1 1 1 1
1
2 2
2 22
250 500 750 — + —
1000 —+--
1250
3 3
3 3
— + —
1500
Test Plasma Glucose
Figure 5.3: Diabetes Cluster Analysis Plot of RELWT*GLUTEST
(19 observations hidden).
In most methods, when SAS was to determine only two clusters, the overt diabetics
were sorted into a cluster of their own, while the chemical diabetic cluster merged
with the normal cluster. Intuition may suggest that the two diabetic clusters would
merge and form a diabetic and a nondiabetic cluster. Apparently, chemical diabetics
are more similar to normals than to overt diabetics, as in Figure 5.4, where the plots
of chemical diabetics and nondiabetics, clusters 2 and 3, respectively, overlap.
52
P 1 a s m a
I n s u 1 i n
d u r i n g
T e s t
600 +
500
400
300
200 +
100 +
0 + +-0
1 1
1 1 1 111 n 111 1
11111 1 1 1 1 1
11 11 1 1 11 I 11 nil II 11 111 1 1
1 1 1 1 1 1 1
1 1 1 1 1 1
1 1 1 1 11 1
11 11 1
1 1
1 1
2
22 1 11
3
3 3
3
3
100 200 —+--300 400
• - +
500
Steady S t a t e Plasma Glucose
Figure 5.4: Diabetes Cluster Analysis Plot of INSTEST*SSPG
(21 observations hidden).
53
Cluster analysis was not very helpful in this situation because the groups were well
defined. However, cluster analysis can be more useful in situations were the groups
are not as well defined. Cluster analysis was presented as a potential aid in dealing
with the type of data studied, therefore, its effectiveness depends on the definition of
the groups within the data and their degree of separation.
CHAPTER VI
CONCLUSION
Logistic regression analysis incorporates the famiUarity and general principles of
Hnear regression, while offering a solution to the problem of dealing with the special
case of dichotomous response variables. An overview of logistic regression is presented,
yet only some of its many different appHcations have been discussed. The compar
isons throughout the discussion of the specific data sets shows that more cumbersome
models are frequently required when linear regression is used instead of logistic re
gression. Therefore, when response variables are not continuous, an alternative to
linear regression analysis should be sought.
Both the logistic and linear regression procedures agreed the best models for
the two data sets contained eleven variables when interactions of the variables were
considered. The sum of squared errors were smaller in the logistic regression procedure
for both data sets. In the DIABETES data set, the logistic procedure significantly
outperformed the multiple linear regression procedure. For the PROSTATE data
set, logistic regression still outperformed the linear approach, however, the difference
between the sum of squared errors for the best models given by the two procedures
was minimal.
A data set can be investigated using clustering procedures to detect any underlying
separation between the data. Many techniques to identify these underlying groups or
clusters have been presented throughout the Hterature. EucHdean distance methods
are the most commonly used, since representing observations by points on a many
dimensional plane representing their characteristics is intuitive. The cluster analysis
54
00
would verify the number of different categories with which to classify the observations
and determine if logistic regression or some other categorical data modeling technique
should be used.
A next step in this research would be to investigate the usage of principal compo
nents to further aid in the classification of the data sets. The principal components
would specify the order of influence of the characteristics. It would also serve to
reduce the dimensionality of the characteristics by selecting the few most important
components involved in the prediction of the response variable.
REFERENCES
[1] Agresti, A. (1990). Categorical Data Analysis. New York: John Wiley, p 13-18, 84-97.
[2] Berkson, J. (1955). Maximum Likelihood and Minimum x^ Estimates of the Logistic Function, Journal of the American Statistical Association, 50, 130-162.
[3] CardeU, N. and Steinberg, D. (1978, August). Estimating Logistic Regression Models Where the Dependent Variable has No Variance. Paper presented at the Joint Statistical Meetings of the American Statistical Association, San Francisco.
[4] CardeU, N. and Steinberg, D. (1987). Logistic Regression on Pooled Choice Based Samples and Samples Missing the Dependent Variable, American Statistical Association Proceedings of Social Statistics Section, 158-160.
[5] Cardell, N. and Steinberg, D. (1992). Estimating Logistic Regression Models Where the Dependent Variable has No Variance, Communications in Statistics: Theory and Methods, 21(2), 423-450.
[6] Duke, J. (1992). Sample Size and Estimated Odds Ratio in Logistic Regression: A Study with Repeated Samples from a Low Birth Weight Population, Ph.D. Dissertation, University of Oklahoma Health Sciences Center.
[7] Duran, B. and OdeU, P. (1974). Cluster Analysis: A Survey, Lecture Notes in Economics and Mathematical Systems, 100, VI, p 1-30.
[8] Ferguson, T. (1967). Mathematical Statistics: A Decision Theoretic Approach. New York: Academic Press, 119-125.
[9] Gehan, E. (1959). Use of Medical Measurements to Predict the Course of Disease. Proceedings of Conference on Experimental Clinical Cancer Chemotherapy, Washington, D.C.: National Cancer Institute Monograph No. 3, 51-58.
[10] Hosmer, D. and Lemeshow, S. (1989). Applied Logistic Regression. New York: John Wiley, p 1-47.
[11] Hogg, R. and Craig, A. (1978). Introduction to Mathematical Statistics, Engle-wood Ciffs, New Jersey: Prentice-Hall, Inc., p 168-179.
[12] Kleinbaum, D. (1994). Logistic Regression: A Self-Learning Text. New York: Springer-Verlag, p 1-61.
56
57
[13] Nottingham, Q. and Birch, J. (1998). A Note on the SmaU Sample Behavior of Logistic Regression in a Bioassay Setting, Journal of Biopharmaceutical Statistics, In press.
[14] Rylance, J. (1996). A Comparison of the Likelihood-Based Approach with Logistic Regression as a Method for Classification, Master's Thesis, North Dakota State University.
[15] SAS Institute Inc. (1995). Logistic Regression Examples Using the SAS System, Version 6, 1st Edition, Gary, NC: SAS Institute Inc., 163 pp.
[16] Tam, T. (1992). Binary Logistic Regression with Data That Have No Variance on the Dependent Variable: An Application to College Dropout Analysis, Ph.D. Dissertation, University of California, Los Angeles.
APPENDIX A
SAS CODE FOR DIABETES DATA SET
TO GENERATE VARIOUS REGRESSION RESULTS.
This code is an extension of what is given in SAS (1995).
58
59
options ls=72; data diabet;
infile diabet; input patient relwt glufast glutest instest sspg
group diab overt chem; label relwt = 'Relative Weight'
glufast = 'Fasting Plasma Glucose' glutest = 'Test Plasma Glucose' instest = 'Plasma Insulin during Test' sspg = 'Steady State Plasma Glucose' group = 'Clinical Group' diab = 'Diabetics (both)' overt = 'Overt Diabetics Only' chem = 'Chemical Diabetics Only';
/* Other variables defined for the interactions of each */ /* explanatory variable with all others including itself. */
rlwsq=relwt*relwt; rlw_glf=relwt*glufast; rlw_glt=relwt*glutest; rlw_ins=relwt*instest; rlw_sspg=relwt*sspg; glfsq=glufast*glufast; glf_glt=glufast*glutest; glf_ins=glufast*instest; glf_sspg=glufast*sspg; gltsq=glutest*glutest; glt_ins=glutest*instest; glt_sspg=glutest*sspg; inssq=instest*instest; ins_sspg=instest*sspg; sspgsq=sspg*sspg;
/* Runs a preliminary data analysis on the data set which */ /* includes finding the overall means for each variable */ /* and then each variables mean for each of the subgroups */ /* sepcirately. The correlation matrices are also found. */ /***********************************************************/
/* For the Overall Model */
proc means majcdec=4 n mean std; var relwt glufast glutest instest sspg; title 'Overall Diabetic Data Set';
proc corr; var relwt glufast glutest instest sspg;
run;
60
/* Separately for the Diabetics and Nondiabetics */ /**************•***********************************/
proc sort; by diab;
run; proc means maxdec=4 n mean std; by diab; var relwt glufast glutest instest sspg; title 'Diabetic Data Set'; title2 'Descriptive Statistics By DIAB';
proc corr; by diab; var relwt glufast glutest instest sspg;
rim;
/ p r r r ^ r p r F r F r p F r r r ^ F F ^h ^n ^n *^ F ' T* T^ T^ T* *!* T* *!* T* T* T^ T* *!* 'I* T^ T* T^ T* T* T* T* T" /
/* Separately for the Chemical Diabetics and Others */
proc sort; by chem;
run; proc means maxdec=4 n mean std; by chem; var relwt glufast glutest instest sspg; title 'Diabetic Data Set'; title2 'Descriptive Statistics By DIAB';
proc corr; by chem; var relwt glufast glutest instest sspg;
rim; /•*************************************************/
/* Separately for the Overt Diabetics and Others */ /***•**********************************************/
proc sort; by overt;
rim; proc means maxdec=4 n mean std; by overt; var relwt glufast glutest instest sspg; title 'Diabetic Data Set'; title2 'Descriptive Statistics By DIAB';
proc corr; by overt; var relwt glufast glutest instest sspg;
run;
61
/* Investigate the one variable model first */ /* with linear then logistic regression. */
proc reg data=diabet; model diab= glutest; title 'Linear Regression of Diabetes Data';
run; proc logistic data=diabet descending;
model diab= glutest; title 'Logistic Regression of Diabetes Data';
run;
/* All original explanatory variables are investigated */ /* using the linear regression model selection technigues: */ /* Forward, backward and stepwise elimination. */
proc reg data=diabet; model diab= relwt glutest glufast instest sspg
/selection= forward; title 'Linear Regression of Diabetes Data';
run; proc reg data=diabet;
model diab= relwt glutest glufast instest sspg /selection= backward;
run; proc reg data=diabet;
model diab= relwt glutest glufast instest sspg /selection= stepwise;
run;
/**********************************************************/
/* Using the same options, now with logistic regression. */
proc logistic data=diabet descending; model diab= relwt glutest glufast instest sspg
/selection= forward; title 'Logistic Regression of Diabetes Data';
run; proc logistic data=diabet descending;
model diab= relwt glutest glufast instest sspg /selection= backward;
run; proc logistic data=diabet descending;
model diab= relwt glutest glufast instest sspg /selection= stepwise;
run;
62
/***:|c9|c:tc:tc:(c:|c*:|c:|c:tc****************************3f:**3tc:4c**/
/* Now use the selection techniques specific to */ /* linear regression (adjrsq) and logistic */ /* regression (score). */
proc reg data=diabet; model diab= relwt glutest glufast instest sspg
/selection= adjrsq rsquare; title 'Linear Regression of Diabetes Data';
run; proc logistic data=diabet descending;
model diab= relwt glutest glufast instest sspg /selection= score;
title 'Logistic Regression of Diabetes Data'; run;
/***********************************************************/
/* All explanatory variables including interactions are */ /* investigated using the model selection technigues: */ /* forward, backward, and stepwise for linear regression. */
proc reg data=diabet; model diab= relwt glutest glufast instest sspg
rlwsq rlw_glf rlw_glt rlw_ins rlw_sspg glfsq glf_glt glf_ins glf_sspg gltsq glt_ins glt_sspg inssq ins_sspg sspgsq / selection= forward;
title 'Linear Regression of Diabetes Data'; run; proc reg data=diabet;
model diab= relwt glutest glufast instest sspg rlwsq rlw_glf rlw.glt rlw_ins rlw_sspg glfsq glf_glt glf.ins glf.sspg gltsq glt_ins glt_sspg inssq ins_sspg sspgsq / selection= backward;
run; proc reg data=diabet;
model diab= relwt glutest glufast instest sspg rlwsq rlw_glf rlw_glt rlw_ins rlw_sspg glfsq glf_glt glf.ins glf.sspg gltsq glt.ins glt.sspg inssq ins_sspg sspgsq / selection= stepwise;
run;
/**********************************************************/
/* Using the same options, now with logistic regression. */ /***********••********•*******•**•*************************/
63
proc logistic data=diabet descending; model diab= relwt glutest glufast instest sspg
rlwsq rlw_glf rlw.glt rlw_ins rlw_sspg glfsq glf.glt glf_ins glf.sspg gltsq glt_ins glt.sspg inssq ins_sspg sspgsq / selection= forward;
title 'Logistic Regression of Diabetes Data'; run; proc logistic data=diabet descending;
model diab= relwt glutest glufast instest sspg rlwsq rlw_glf rlw_glt rlw.ins rlw_sspg glfsq glf_glt glf_ins glf.sspg gltsq glt_ins glt_sspg inssq ins_sspg sspgsq / selection= backward;
run; proc logistic data=diabet descending;
model diab= relwt glutest glufast instest sspg rlwsq rlw_glf rlw_glt rlw_ins rlw_sspg glfsq glf.glt glf.ins glf_sspg gltsq glt_ins glt_sspg inssq ins_sspg sspgsq / selection= stepwise;
run;
/* All explanatory variables including interactions are */ /* investigated using the model selection technigues which */ /* are specific to linear regression (adjrsq) and logistic */ /* regression (score). */
proc reg data=diabet; model diab= relwt glutest glufast instest sspg
rlwsq rlw_glf rlw_glt rlw_ins rlw_sspg glfsq glf_glt glf.ins glf.sspg gltsq glt_ins glt_sspg inssq ins_sspg sspgsq / selection=adjrsq rsquare best=2;
title 'Linear Regression of Diabetes Data'; run; proc logistic data=diabet descending;
model diab= relwt glutest glufast instest sspg rlwsq rlw_glf rlw_glt rlw_ins rlw_sspg glfsq glf_glt glf_ins glf_sspg gltsq glt_ins glt.sspg inssq ins_sspg sspgsq / selection=score;
title 'Logistic Regression of Diabetes Data'; run;
APPENDIX B SAS CODE TO GENERATE CLUSTER ANALYSIS RESULTS
64
65
options ls=72; data diabet;
infile diabet; input patient relwt glufast glutest instest sspg
group diab overt chem; label relwt = 'Relative Weight'
glufast = 'Fasting Plasma Glucose' glutest = 'Test Plasma Glucose' instest = 'Plasma Insulin during Test' sspg = 'Steady State Plasma Glucose' group = 'Clinical Group' diab = 'Diabetics Group' overt = 'Overt Group' chem = 'Chemical Diabetic Group';
/* Before Cluster analysis can be done, the data */ /* must be in a certain form — sorted in order. */
proc sort data=diabet out=diabet2; by group;
run; /***************************************************/
/* Using average linkage method, first it clusters */ /* the data, and then a dendrogram is printed */ /* showing the clusters created at each step. */
proc cluster data=diabet2 method=average noprint outtree=tree; id patient;
run; proc tree horizontal sort height=n; run; /*************************************************************/
/* The data is sorted into three and then two clusters, with */ /* group number able to be compared to cluster number. */ /*************************************************************/
proc tree noprint out=out nclusters=3; copy patient relwt glutest glufast sspg instest group;
run; proc sort;
by cluster; run; proc print label uniform;
id patient; var group relwt glufast glutest sspg instest; by cluster; title 'Cluster Analysis: Average Linkage with Three Clusters';
run;
66
proc cluster data=diabet2 method=average noprint outtree=tree; id patient;
run; proc tree noprint out=out nclusters=2;
copy patient relwt glutest glufast sspg instest group; title 'Cluster Analysis: Average Linkage with Two Clusters';
rim; proc sort;
by cluster; run; proc print label uniform;
id patient; vcir group relwt glufast glutest sspg instest; by cluster; title 'Cluster Analysis: Average Linkage with Two Clusters';
run;
/* Similar results requested, but this time using the */ /* complete linkage method of clustering. The data is */ /* again sorted into three and then two clusters, with */ /* group number able to be compared to cluster number. */
proc cluster data=diabet2 method=complete noprint outtree=tree; id patient;
run; proc tree noprint out=out nclusters=3;
copy patient relwt glutest glufast sspg instest group; title 'Cluster Analysis: Complete Linkage with Three Clusters';
run; proc sort;
by cluster; run; proc print label uniform;
id patient; vair group relwt glufast glutest sspg instest; by cluster; title 'Cluster Analysis: Complete Linkage with Three Clusters';
run; proc cluster data=diabet2 method=complete noprint outtree=tree;
id patient; run; proc tree noprint out=out nclusters=2;
copy patient relwt glutest glufast sspg instest group; title 'Cluster Analysis: Complete Linkage with Two Clusters';
run; proc sort;
by cluster; run;
proc print label uniform; id patient; var group relwt glufast glutest sspg instest; by cluster; title 'Cluster Analysis: Complete Linkage with Two Clusters';
run;
/* Similar results requested, but this time using the */ /* single linkage method of clustering. The data is */ /* again sorted into three and then two clusters, with */ /* group number able to be compared to cluster number. */
proc cluster data=diabet2 method=single noprint outtree=tree; id patient;
run; proc tree noprint out=out nclusters=3;
copy patient relwt glutest glufast sspg instest group; run; proc sort;
by cluster; run; proc print label uniform;
id patient; var group relwt glufast glutest sspg instest; by cluster; title 'Cluster Analysis: Single Linkage with Three Clusters';
run; proc cluster data=diabet2 method=single noprint outtree=tree;
id patient; run; proc tree noprint out=out nclusters=2;
copy patient relwt glutest glufast sspg instest group; run; proc sort;
by cluster; run; proc print label uniform;
id patient; var group relwt glufast glutest sspg instest; by cluster; title 'Cluster Analysis: Single Linkage with Two Clusters';
run; /**************•***************•************************/
/* Similar results requested, but this time using the */ /* centroid method of clustering. The data is again */ /* sorted into three and then two clusters, with group */ /* number able to be compared to cluster number. */ /:•*•******************•*********************************/
68
proc cluster data=diabet2 method=centroid noprint outtree=tree; id patient;
run; proc tree noprint out=out nclusters=3;
copy patient relwt glutest glufast sspg instest group; title 'Cluster Analysis: Centroid Method with Three Clusters';
run; proc sort;
by cluster; run; proc print label uniform;
id patient; var group patient relwt glufast glutest sspg instest group; by cluster; title 'Cluster Analysis: Centroid Method with Three Clusters';
run; proc cluster data=diabet2 method=centroid noprint outtree=tree;
id patient; run; proc tree noprint out=out nclusters=2;
copy patient relwt glutest glufast sspg instest group; title 'Cluster Analysis: Centroid Method with Two Clusters';
run; proc sort;
by cluster; run; proc print label uniform;
id patient; var group relwt glufast glutest sspg instest; by cluster; title 'Cluster Analysis: Centroid Method with Two Clusters';
run;
/***********************************************************/
/* Now to visually show the clusters, plots can be created */ /* for any two variables, and the underlying groups should */ /* be apparent in the graph. Need to reestablish three */ /* clusters otherwise will use last clustering completed. */ /********************************************************
proc cluster data=diabet2 method=centroid noprint outtree=tree; id patient;
run; proc tree noprint out=out nclusters=3;
copy relwt glutest glufast sspg instest; run; proc sort;
by cluster; run;
69
proc plot; plot relwt*glutest=cluster;
title 'Diabetes Cluster Analysis Plot'; run; proc plot;
plot sspg*glutest=cluster; run; proc plot;
plot instest*sspg=cluster; run; proc plot;
plot glufast*glutest=cluster; run;
PERMISSION TO COPY
In presenting this thesis in partial fuIfiUment of the requirements for a
master's degree at Texas Tech University or Texas Tech University Health Sciences
Center, I ag^ree that the Library and my major department shall make it freely
available for research purposes. Permission to copy this thesis for scholarly
purposes may be granted by the Director of the Library or my major professor.
It is understood that any copying or publication of this thesis for financial gain
shall not be allowed without my further written permission and that any user
may be liable for copyright infringement.
Agree (Permission is granted.)
_ ^ ^-^-n^^ru//-student's Signature
<^:^ti^ /5/^/f/ Date
f/
Disagree (Permission is not granted.)
Student's Signature Date