applying artificial neural network models to clinical ... · mail may be sent to...
TRANSCRIPT
Psychological Assessment2000, Vol. 12, No. 1, 40-51
Copyrighc 2000 by the American Psychological Association, Inc.1040-359WOO/$5.00 DO1: 10.10377/1040-3590.12.1.40
Applying Artificial Neural Network Models to Clinical Decision Making
Rumi Kato PriceWashington University School of Medicine
Edward L. SpitznagelWashington University
Thomas J. Downey and Donald J. MeyerPartek, Inc.
Nathan K. RiskWashington University School of Medicine
Omar G. El-GhazzawyWashington University
Because psychological assessment typically lacks biological gold standards, it traditionally has relied on
clinicians' expert knowledge. A more empirically based approach frequently has applied linear models
to data to derive meaningful constructs and appropriate measures. Statistical inferences are then used to
assess the generality of the findings. This article introduces artificial neural networks (ANNs), flexible
nonlinear modeling techniques that test a model's generality by applying its estimates against "future"
data. ANNs have potential for overcoming some shortcomings of linear models. The basics of ANNs and
their applications to psychological assessment are reviewed. Two examples of clinical decision making
are described in which an ANN is compared with linear models, and the complexity of the network
performance is examined. Issues salient to psychological assessment are addressed.
In the area of psychological assessment, a computer-based al-
gorithm approach has relied heavily on the expert knowledge of
clinicians. In a more empirically based approach—used when
insufficient knowledge is available, existing knowledge is debated,
or a special population is investigated—linear models such as
factor analysis or regression models are frequently applied to
empirical data to derive meaningful constructs and appropriate
measures. In this approach, statistical inferences are needed to
assess the generality of constructs and rules because the gold
standard of expert knowledge may not be applied.
Making a statistical inference generally requires estimating the
standard error of the parameter; hence linear models have most
frequently been used. Linear models, however, are not appropriate
for all instances. Loss of information is inevitable when extreme
values of a measure must be truncated or rescaled to fit data to a
linear scale. In multivariate linear models, emphasis on statistical
Rumi Kato Price and Nathan K. Risk, Department of Psychiatry, Wash-
ington University School of Medicine; Edward L. Spitznagel, Department
of Mathematics, Washington University; Thomas J. Downey and Donald J.
Meyer, Partek, Inc., St. Peters, Missouri; Omar G. El-Ghazzawy, Depart-
ment of Chemistry, Washington University.
Preparation of this article was supported in part by the Independent
Scientist Award (K02DA00221) and Research Grants R01DA07939,
R01DA09281, and R01DA10021 from the National Institute on Drag
Abuse of National Institutes of Health. A portion of this article was first
presented at the Annual Scientific Meeting of the College on Problems of
Drug Dependence, San Juan, Puerto Rico, June, 1996. We thank Paul
Monro for his critical and constructive comments on earlier versions of this
article.
Correspondence concerning this article should be addressed to Rumi K.
Price, Department of Psychiatry, Washington University School of Medi-
cine, 40 N. Kingshighway, Suite 2, St. Louis, Missouri 63108. Electronic
mail may be sent to [email protected].
significance often leads to situations in which a substantively
meaningful predictive measure with a large effect and a large
standard error may be dropped in favor of another measure with a
small effect and an even smaller standard error (Achen, 1982).
Such practice limits the ability of these models to construct a
potentially more predictive model.
A more drastic approach than that exemplified by the current
linear models in psychological research can be taken by relying
primarily on empirical knowledge and minimizing the importance
of generalization by statistical inference, if the assumption is
accepted that gold standards in psychological assessment exist
only at a data-specific level. Such an assumption may be reason-
able for psychological assessment given the lack of biological gold
standards. Once such an approach is taken, validity assessment
depends more on the ability of the constructed gold standards to
predict "future" data. The importance of making statistical infer-
ences is reduced if future data are indeed available to validate the
empirically obtained constructs and rules. The importance of cap-
turing nonlinear association becomes an even more pertinent issue
because increased precision for prediction is needed with validity
testing against real data.
In this article, we introduce artificial neural networks (ANNs)
suitable for a variety of clinical decision problems such as assessment
of psychological state, diagnosis of psychiatric disorders, and predic-
tion of behavior outcomes such as suicide attempts, hospitalization,
and death. ANNs have become important in a variety of areas in
psychology and medicine, but their application to diagnostic decision
making in psychology and psychiatry has been minimal.
The basics of multilayer perception (MLP) modeling, the most
frequently used type of ANN, are described after a brief introduc-
tion to ANNs. We provide two examples of how an ANN can be
used in judgment tasks. The first example uses item questions to
predict a psychiatric diagnosis; the second example uses behav-
ioral measures collected in early adulthood to predict long-term
SPECIAL SECTION: ARTIFICIAL NEURAL NETWORK MODELS 41
mortality. The details of ANN methodology in these contexts
include ANN parameter specification, software, cross-validation,
and the receiver operating characteristic (ROC) evaluation. The
performance of the ANN is compared with the performance of the
linear statistical analyses frequently used in the area of psycho-
logical assessment: linear and quadratic discriminant analyses and
logistic regression analyses. We examine the performance of ANN
models with varying numbers of hidden neurons to show when
complex network architecture is needed to achieve a better level of
prediction than that achieved in linear models. We then discuss
issues for research practitioners to consider in order to make
optimal use of ANNs. Future directions of ANN applications and
current salient issues in psychological assessment are discussed.
Artificial Neural Networks for Psychological Assessment
Background
Since the first article on neural network modeling (McCulloch
& Pitts, 1943) appeared over half a century ago, continuing fas-
cination with this technique appears to be related to its historical
linkage to artificial intelligence, in that ANNs mimic neural activ-
ities of the human brain (Cheng & Titterington, 1994). Neural
network modeling has been used for various tasks, including
speech recognition, stock market prediction, mine detection, can-
cerous cell identification, and handwriting recognition. Current
applications of ANNs usually fall into three areas: (a) in modeling
biological nervous systems, (b) as real-time adaptive signal pro-
cessors or controllers implemented in hardware, and (c) as data-
analytic methods (Sarle, 1994). The applications described in this
article fall into the third area.
With increased communication between the fields of neural
network modeling and statistics, areas of overlap have become
better defined. Today, the ANN can be characterized as a highly
flexible nonlinear regression technique. Limitations of linear mod-
els have been shown in the "exclusive-or (XOR)" problem
(Rumelhart & McClelland, 1986). The simplest case consists of
two vectors: An input vector consisting of (0,0) and (1,1) belongs
to Class 1; and another input vector of (0, 1) and (1, 0) belongs to
Class 2. There is no single linear decision boundary that cancorrectly classify all four points. This problem can be generalized
to higher dimensions. Thus, an ANN's ability to transform non-
linear input covariances to linearly separable ones via activation
functions supposedly gives the ANN an edge over the linear
modeling technique in some situations.
In practice, however, reports of the ability of ANNs to surpass
linear models for classification and prediction problems have been
inconsistent (Anderer, Saletu, Kloppel, Semlitsch, & Werner,
1994; Baxt, 1995; Doig, Inman, Sibbald, Martin, & Robertson,
1993; Duh, Walker, Pagano, & Kronlund, 1998; Dybowski,
Weller, Chang, & Gant, 1996). Attempts to examine weight ma-
trices, analogous to regression coefficients in linear models such as
logistic regression, are a logical step in identifying reasons for the
inconsistent performance of ANNs relative to linear models (Duh,
Walker, & Ayanian, 1998). Nonetheless, examination of individ-
ual weights becomes infeasible for a complex network model
because the number and dimension of weight matrices increase
with the complexity of models.
ANN performance associated with variations in complexity is
rarely examined in the context of comparison with linear models.
Our results from two experiments relate to the differences in ANN
performance with varying numbers of neurons, types of activa-
tions, and input variables. We suggest that better prediction of
ANN performance depends in part on the complexity of models
relative to the complexity of predictive input variables.
Multilayer Perceptron Model of Artificial
Neural Networks
Of several types of ANN models amenable to data-analytic
applications, MLP modeling has been used most often in clinical
medicine and the psychological literature.1 An MLP model con-
sists of multiple layers of interconnected neurons. Figure 1 illus-
trates the iterative process of MLP training for two-layer (or single
hidden-layer) models, used in two application examples described
later. In this instance, the two layers include hidden and output
layers, because input variables are not usually counted as a layer.
A hidden layer is used for internal transformation of the ANN:
Hidden neurons transform the "signals" from the input layer and
generate signals to be sent to the next layer, so that the MLP model
is sometimes called the "feedforward" neural network.
The input layer consists of input variables plus an input bias,
which adjusts the weighted mean of the input variables. All input
variables are regressed on the next layer, hi the same fashion as
estimation of multiple regression. Thus, if the model includes only
a single output without a hidden layer, the ANN becomes a logistic
regression with a logistic activation function; it becomes a linear
discriminant function with a threshold activation function at out-
put.2 However, the activation functions of the hidden neurons
between the input variables and the output layer make the ANN a
highly flexible model. Any function can be approximated with
increasing numbers of hidden neurons using the sigmoid (logistic)
activation function (Bishop, 1995).
Once the predicted values are forwarded from the hidden layer
to the output layer, the error amount is computed on the basis of
the target (actual) value of outputs. Weights are recalculated after
each observation until all observations are processed.
For a more general multiple-hidden-neuron; multiple-hidden-
layer model, the output for pattern (observation) p at the/1 neuron
on layer I can be expressed formally as
1 MLP models are considered to be parametric, quasi-parametric, or
nonparametric (infinitely parametric, to be more precise) depending on the
complexity of the model. As a reference point, the generalized additive
model (Hastie & Tibshirani, 1990) can be placed somewhere between
parametric linear regression and fully nonparametric nonlinear regression
such as MLP, because the generalized additive model uses a nonlinear
transformation estimated by a nonparametric smoother (Cheng & Titter-
ington, 1994; Sarle, 1994; White, 1992).2 Because both the ANN and logistic regression use an iterative
maximum-likelihood method to estimate parameters, the result of an ANN
with a single output and the result of multiple logistic regression should be
identical. On the other hand, although an ANN with a single threshold
activation function without a hidden layer uses the same model as a linear
discriminant analysis, the two methods would not yield exactly the same
results because the parameters in a linear discriminant analysis are com-
puted by ordinary least squares and an ANN uses a maximum-likelihood
method.
42 PRICE ET AL.
IWeighted
Sums
(wj)
\
HiddenNeurons
Activation
Functions
WeightedSums
(**)
-o «i£ =.« Q.
H O
Recalculation of weights
based on predicted and
actual outputs
Figure 1. Flow diagram of the learning process in a multilayer perception (MLP) model. The notations in the
diagram correspond to those in the text. Squares correspond to input and output variables; ovals correspond to
MLP functions and parameters. For simplicity, the diagram is drawn for a two-layer (single hidden-layer) model;
that is, I = 1 for the hidden layer and / = 2 for the output layer. The formulas in the test are expressed for more
general multiple-hidden-layer models in which / can be higher than 2. For clarity, the input vector* is indexed
by i, and the output vector y is indexed by k in the figure; however, in a multiple-hidden-layer model, k is
subsumed in i and.y is subsumed in x.
where the previous layer, / - 1, is input to a node in layer / through
1 + BL (2)
where wj/ is the weight (coefficient) from node i on layer / — 1 tonode j on layer I and x^1 is node ;' on the previous layer / — 1,which can be an input variable or an output from the previouslayer. B' the output bias, can be added to adjust the decisionthreshold of the predicted outputs to evaluate MLP performancefree of thresholds; Bj^., the input bias, can be used to adjust theweighted mean of the input variables. These biases are not usuallyapplied to hidden layers.
In our applications, two activation functions are used for/(i^).The sigmoid (logistic) activation function is most commonly usedin MLP modeling. This activation function takes the form
/.i, K) = !/(! + e~'i). (3)
The generalized Gaussian activation function is expressed as
where the right-hand side is a generalized activation function ofthe Gaussian distribution. Additional parameters can be added tochange the shapes of these functions.
The two functions are plotted along the axis of Vj in Figure 2.Although the sigmoid activation function is more commonly used, theGaussian activation function is more suitable for examining a finerrange of the distribution of an input vector, as the shape of thisfunction suggests. In contrast, a sigmoid neuron "turns on" and stays"on" once passing a threshold value. The maximum value of theactivation function can be adjusted to scale the distribution range of
the output vector. For classification and prediction problems involv-ing dichotomous outcomes, the maximum value is set to 1.
Our MLP training used the "back-propagation" algorithm in the"on-line" mode, in which the weights are updated after processingeach pattern (Rumelhart, Hinton, & Williams, 1986). In this algo-rithm, small initial weights are randomly assigned to the input neuronsand are fed forward through the network. The error of the estimatewith respect to the target at the/11 output for the p"1 pattern is
The total error at the output layer is
(5)
(6)
It sums errors over all outputs and over all patterns. Wp is theoptional pattern weight to adjust the error function to give weight-ing to individual patterns. Pattern weighting was introduced be-cause the minimization function can be problematic for maximiz-ing prediction when using population survey data in which aprevalence estimate is very low. The pattern weight is typically setas the inverse of the prevalence, and Wp = 1 when patternweighting is not added. In practice, we found pattern weights to bemore helpful in increasing the efficiency of estimation than inimproving prediction.
The information on the total error is used to adjust the weightsin the output layer, which are propagated back to the hidden layerto adjust weights on the hidden layer. These changes are accom-plished with the use of the generalized delta rule, which employsa gradient descent on parameter space. Each epoch (iteration) ofback-propagation continues until the error is minimized globally.
Specification of a few other parameters is usually necessary forthe back-propagation algorithm. The learning rate is a step size
SPECIAL SECTION: ARTIFICIAL NEURAL NETWORK MODELS 43
Sigmoid activation function
1.0
Gaussian activation function
1.0
Figure 2. Sigmoid and Gaussian activation functions commonly used in
MLP modeling. The input bias vector in Figure 1 is omitted from the inputto the neuron for simplicity; the maximum value is 1 by default.
taken in the downward direction of the gradient of the error
function, E, with respect to a parameter. The momentum smoothes
out the changes in parameters such as weights (coefficients) to
avoid short-term error oscillations during learning and to speed up
convergence.3 These parameters are concerned with convergence
speed or assurance of the global minima. The other basics of MLP
modeling are found in neural network textbooks (Anderson, 1995;
Bishop, 1995; Ripley, 1996; Rumelhart & McClelland, 1986).
Applications to Psychiatric and Psychological Research
In early 1995, we conducted a MEDLINE literature search on
neural networks. We identified, cross-sectionally, over 1,200 en-
tries for current articles in engineering, computer science, medi-
cine, psychology, and neuroscience and methodological articles
relating to statistics and new techniques. This search produced
about 30 entries on topics in medicine and psychology.
Ten of the articles were on psychiatric or psychological topics.
They can be divided into two groups: (a) conceptual or simulation
models that use neural network models as a frame of reference to
describe the complexity of psychological phenomena and (b) ap-
plications of ANN models to improve the outcome of interest. The
topics in the first group included a neural network model of
cortical processing to provide a framework .for understanding
pathological processes in schizophrenia (Chen, 1994); neural net-
work simulation of psychiatric symptom recognition to illustrate
the deficiency of traditional algorithmic models of psychiatric
diagnoses (Berrios & Chen, 1993); neural network simulation
illustrating the structure of the relationships among situation, be-
havior, and childhood experience in order to predict personality
traits (Gonzalez-Heydrich, 1993); a mathematical approach to
hypnosis that connects general ideas about the organization of the
nervous system and neural network models (Kuzin, 1995); and a
review of the relation between dream construction and memory
using the neural network as a model architecture (Palombo, 1992).
In this literature, a neural network model is seen as a machine
analogue of the biological neural network system.
In the literature in the second group, neural network modeling was
used as an analytical tool. This group included use of ANN modeling
as a diagnostic test to predict the admission decision among psychi-
atric emergency patients (Somoza & Somoza, 1993); a comparison of
treatment decisions made by ANNs and clinicians for depressed and
psychotic patients (Modai, Staler, Inbar-Saban, & Saban, 1993); use
of an ANN to predict length of hospital stay given psychiatric diag-
noses, demographics, illness severity, and other predictors (Davis,
Walter, & Davis, 1993; Lowell & Davis, 1994); and an application of
Bayesian decision models to an ANN in order to predict teenagers'
substance use (Wu & Gustafson, 1994).
About half of the articles in medicine and psychology were pub-
lished by investigators in institutions outside the United States. Most
noticeably, fewer than 40% of the investigators attempted to compare
or incorporate neural network models with standard statistical meth-
ods, which supports the view that there is a need for increased
communication between statisticians and neural network researchers
(Ripley, 1994). Comparison with, or incorporation of, multiple logis-
tic regression occurred in three studies (Doig et al., 1993; Lette et al.,
1994; Spademan, 1992). Comparison with other methods, such as the
Z statistic and discriminant analysis, was proposed (Anderer et al.,
1994). Simple ROC methods were applied to compare the perfor-
mance of an ANN and other methods in two studies (Liest01,
Andersen, & Andersen, 1994; Spackman, 1992).
A more recent review of literature suggests that the use of ANNs
in psychiatric and psychological research has been increasing in
the last few years. An accelerating diffusion into diverse disci-
plines is indicated by the fact that articles on the use of ANNs are
appearing in highly regarded nontechnical journals. For example,
an introduction to ANNs (Baxt, 1995) and an ANN application that
used a large database collected on intensive-care-unit patients
(Dybowski et al., 1996) appeared in The Lancet. Studies that used
ANNs have also appeared in widely circulated journals in epide-
miology (loannidis, McQueen, Goedert, & Kaslow, 1998), genet-
ics (Lucek & Ott, 1997), psychiatry (Zou et al., 1996), and statis-
tics (Warner & Misra, 1996). In these articles, the results were
often compared with results obtained through commonly used
statistical methods (Dybowski et al., 1996; loannidis et al., 1998).
3 Another parameter, the active range, isPartek's invention (Partek, Inc.,
1998) and not used in most MLP applications. It is a weight rescalingmethod used during the network initialization. Neurons can get "caught" atextreme values near 0 and 1, known as the saturation region, wherelearning becomes very slow (Fahlman, 1989). Active range scaling rescalesthe weights so that, upon initialization, the neurons all lie in their "active"
as opposed to saturated regions (usually 90% to 95% of the range). In our
application, the default .95 was used.
44 PRICE ET AL.
ANNs have been applied to studies involving standardized
psychological assessment instruments. For example, an ANN was
used to examine how varying the values of each of the four Stage
of Change (SOC) scales (Prochaska & DiClemente, 1983) affected
anxiety outcomes, thus allowing prediction of the amount of
changes in anxiety outcomes given a certain level of change in an
SOC scale (Reid, Nair, Mistry, & Beitman, 1996). Measures from
the Diagnostic Interview for Children and Adolescents (Herjanic
& Reich, 1982) were input to an ANN to predict adolescent
hopelessness (Kashani, Nair, Rao, Nair, & Reid, 1996). The Com-
posite International Diagnostic Interview (Robins et al., 1988) was
analyzed with an ANN to compare ANN-derived diagnoses with
those derived from computer algorithms based on the third edition
of the Diagnostic and Statistical Manual of Mental Disorders
(PSM-III; American Psychiatric Association, 1980); clinician-
based diagnoses were used as the gold standard (Zou et al., 1996).
Overall, however, we found that ANNs are still rarely used in
psychological assessment research.
Empirical Examples
In this section, using two large-scale psychiatric surveys, we
describe applications of ANNs to two problems in psychological
assessment: (a) whether an ANN is able to "mimic" a computer
diagnostic algorithm that translates the logic of the clinician diag-
nostic process and (b) whether an ANN is able to predict a
behavior outcome with various self-report measures expected to
contain a level of measurement error. To assess the utility of
ANNs, we also examined (c) whether ANNs outperform common
statistical methods and (d) whether the patterns of performance
improvement differ depending on architecture, initialization, and
complexity of input variables.
Method
Data Sets
Experiment 1: St. Louis Epidemiologic Catchment Area (EGA) Project
The ECA project is one of the largest epidemiological psychiatric studies
ever conducted in the United Slates (Robins & Regier, 1991). Survey
respondents were persons 18 years of age or older residing in five catch-
ment areas of U.S. communities. For each catchment area, both community
and institutional samples were obtained with complex multistage sampling
methods, so that the sample represented the demographic distribution of the
U.S. population according to the 1980 census with proper sampling
weights (Eaton & Kessier, 1985). For this article, the household sample of
the data set from St. Louis was used (N = 2,809 after elimination of
missing values).
The ECA data set was chosen for several reasons: the advantage of using
general-population surveys for drawing inferences that can be generalized,
the importance of having a large sample when prevalence for a disorder is
low, and our familiarity with the data set (Price, Robins, Helzer, & Cottier,
1998). The diagnostic variables were computed on the basis of the DSM-III
(American Psychiatric Association, 1980), because the surveys were con-
ducted between 1981 and 1985. Psychiatric nosology has subsequently
changed with the publication of the revised DSM-III (DSM-III-R; Amer-
ican Psychiatric Association, 1587) and the fourth edition (DSM-fV; Amer-
ican Psychiatric Association, 1994). It should be noted that the appropri-
ateness of a specific nosological system is a peripheral issue in this article
because the ANN was tested to mimic the clinical decision process given
a specific nosological system, in this case, the DSM-III.
Experiment 2: Vietnam Era Study—Phase HI. For the second set of
experiments, we used data from the Washington University Vietnam Era
Study—Phase HI (VES). The previous surveys were conducted in 1972 and
1974. About half of the veterans in the target sample were drawn randomly
from the list of Army enlisted (E1-E9) returnees whose urine tested
positive for opiates, amphetamines, or barbiturates at the time of departure
from Vietnam in September 1971. The drug-positive veteran population
comprised an estimated 10.5% of the total 13,760 Army enlisted returnees
who departed at that time. The other half, the general sample, comprised
veterans who were randomly drawn from the same total population and
included 4.1% of the sample already drawn from the drug-positive list
(Robins, 1974). A civilian control sample was ascertained from Selective
Service registrations and individually matched to the general sample mem-
bers interviewed in 1974 for draft eligibility, draft board location, age, and
education (Robins & Helzer, 1975). The cohort, although not a general-
population sample, was ascertained from a large nonclinical population,
and veterans arguably can be considered a general-population sample
weighted to slightly lower socioeconomic status (O'Donnell, Voss, Clay-
ton, Slatin, & Room, 1976).
As part of the feasibility phase of the 25-year follow-up study in 1996,
mortality information was collected. Further details of the follow-up study
are described elsewhere (Price, Risk, & Spitznagel, 1999; Price, Risk,
Stwalley, et al., 1999). The sample for Experiment 2 was composed of
veterans and civilians who were interviewed in 1974 (N = 855).
Measures
Experiment I. Using data from the St. Louis ECA study, we assessed
the judgment ability of ANNs in the absence of information on DSM-III
rules for the diagnosis of antisocial personality disorder (ASPD). The input
variables were measures of 12 childhood and 9 adult DSM-III symptom
criteria for ASPD obtained from the Diagnostic Interview Schedule, a
well-validated structured instrument designed to be administered by non-
clinicians (Helzer et al., 1985; Robins, Helzer, Croughan, & Ratcliff,
1981). The childhood symptoms included truancy, school expulsion, arrest,
running away from home, lying, sex before age 15, getting drunk or
engaging in illicit drug use before age 15, stealing, vandalism, poor grades,
serious trouble at school, and starting rights. The adult symptoms included
job troubles; negligence toward children; a nontraffic arrest for prostitu-
tion, pimping, or drug sale; marital or intimate relationship problems;
violence; trouble with debts; vagrancy; lying; and traffic offenses. These
measures have been shown to correlate highly with each other (e.g.,
Loeber, 1988).
The output variable was the ASPD diagnosis. Validity studies reported
kappas of .63 for test-retest agreement (Robins, Helzer, Ratcliff, & Sey-
fried, 1982) and .63 for agreement between a psychiatrist and a trained
nonclinician interviewer (Robins et al., 1981). The exclusion criteria of
ASPD—that is, mental retardation, schizophrenia, or mania—were not
applied to the operational diagnostic measure of ASPD because the infor-
mation for the exclusion criteria is external to the information contained in
the childhood and adult symptoms.
Experiment 2. For the VES, the output variable was mortality. That is,
the judgment task was to predict which veterans and civilians would be
dead at the time of the follow-up study more than two decades after their
return from Vietnam. It was worthwhile to test the ability of an ANN with
a clearly defined behavioral measure that avoids fuzzy classifications
typical in psychological assessment. We do not claim that our measure of
mortality is perfect; nonetheless, we felt it was relatively free from clas-
sification errors.
By the end of December 1993,97 of the 1,227 (7.9%) participants in the
original study had been identified as deceased (Price, Eisen, Virgo, Mur-
ray, & Robins, 1995). Three sources of death information were used, which
were later augmented by information from family members in order to
maximize accuracy for case identification (Price et al., 1999). We input two
sets of predictors to the ANN to see if differences in measures related to the
ANN's performance relative to that of logistic models.
The first set of input variables consisted of 23 significant predictors
chosen separately by logistic regression for each of three periods—pre-
SPECIAL SECTION: ARTIFICIAL NEURAL NETWORK MODELS 45
Vietnam, in-Vietnam, and post-Vietnam up to 1974—plus lifetime ques-
tions and demographics. These variables were selected from over 120
variables covering 11 domains (demographics, socioeconomic status, mil-
itary experience, family history, psychiatric problems, antisocial behavior,
crime, social networks, drug use, alcohol use, and a residual category) that
were found to be associated with mortality by bivariate analyses. The input
variables were as follows: (a) pre-Vietnam—heroin use, regular narcotics
use, narcotics injection, types of drugs used, marijuana use level, too much
drinking, and childhood antisocial behavior, (b) in-Vietnam—narcotics
withdrawal, stimulant use, heavy drug use, and arrest; and (c) post-Vietnam
to 1974—wanting to take narcotics, feeling addicted to narcotics, knowing
where to buy heroin, taking drugs to relieve effects of other drugs, taking
drugs to help depression, depression while drinking, arrest, and number of
arrests. The lifetime behavioral predictors included heroin use level and use
of opiates five or more times. Two demographic variables, African Amer-
ican race and veteran status, were also added. These variables were
extensively analyzed previously (Robins, Helzer, & Davis, 1975), and
findings from the earlier surveys continue to be circulated (U.S. General
Accounting Office, 1998).
The second set of input variables included 8 variables that were signif-
icant at the .05 probability level when the previous 23 variables were input
to logistic regression with backward elimination. They included pre-
Vietnam heroin use, narcotics injection, and too much drinking; in-
Vietnam narcotics withdrawal; and post-Vietnam wanting to take narcotics,
knowing where to buy heroin, and using drugs to help depression. African
American race was also included. We hypothesized that an ANN would
improve prediction when a number of highly collinear variables were input
whereas the level of prediction with a logistic model would stay about the
same because a linear regression technique does not gain much additional
information from highly collinear variables.
Scaling. The input variables in Experiment 1 with the EGA data set
were all scaled to be dichotomous. Five of the input variables in Experi-
ment 2 with the VES data set were not dichotomous: types of drugs used
(0-4), marijuana use level (0-3), childhood antisocial behavior (0-10),
number of arrests (0-4), and heroin use (0-3). These variables were scaled
to vary between 0 and 1, a standard practice used to increase efficiency of
estimation in MLP modeling. The same scaling was applied to the input
variables of logistic regression for consistency.
Analyses
Experiment 1. Two-output models were used because they provide a
simple solution to the threshold decision in a classification problem since
the difference in predictive values is used for assigning the class (Bishop,
1995), Both sigmoid and Gaussian activation functions were tried. The
range of a hidden neuron varied between 0 and 1, which is a default for a
classification problem. The number of hidden neurons was increased by an
increment of one from 1 to 5, then to 7, to 10, and thereafter by an
increment of five to a maximum of 30. After an initial series of experi-
ments, the learning rate was set to .001 and the momentum to .01. Both are
considered to be conservative, which is to say that convergence is achieved
slowly. The maximum iteration was set to 600 after initial analyses, which
used as many as 1,200 iterations. Estimation was accomplished both with
and without pattern weighting.
The results of MLP neural network models were compared to the results
of linear and quadratic discriminant analyses. We carried out analysis of
suboptimal network performance both for sigmoid and Gaussian neurons
with two different initializations in order to examine whether complex
MLP architecture is required to improve prediction.
Experiment 2. Two^output models were also used for this experiment.
The same sets of activation functions and learning parameters were tried.
The maximum number of iterations was set to 600 after initial trials.
Pattern weights were used throughout for this experiment because the point
prevalence was low at .079. Logistic regression analyses were used to
compare the results because beta estimates illustrate a limitation of linear
models when predictive collinear variables are used.4 We also examined
suboptimal network performance to assess whether the improvement in
prediction using the full set of predictors was related to the complexity of
the MLP architecture.
Software. Numerous software packages for ANNs are available. We
identified 32 free and 25 commercial packages at the time these experi-
ments were planned (Sarle, 1996). After a trial period, we chose UNIX-
based Partek, written by Thomas J. Downey and Donald J. Meyer (Partek
Inc., 1998). We also applied an ANN to the EGA data set using source code
modified from the "Neural Network Classes for the NeXT Computer,"
which is available from the University of Arizona. The results were judged
to be similar, although this source code used a one-output model and Partek
used two-output models.
Cross-validation. Because real future data are lacking in practice,
cross-validation is an essential part of the ANN methodology, because it
allows assessment of the generalizability of ANN results obtained from the
training phase. Tenfold validation was used: 90% of the data were used to
train the ANN, and the remaining 10% of the data were used to compute
the numbers of positive and negative errors. Results of each validation
were obtained from the predicted outcome for each observation based on
estimates obtained from the testing phase. The process was repeated 10
times, and the results were summed over 10 validation results to compute
the sensitivity and specificity scores. We cross-validated corresponding
linear models using the same procedure.
We selected tenfold validation, a typical choice in ANN modeling
(Bishop, 1995). Tenfold validation was suitable for our problems because
the outcome prevalence rates of both data sets were low (11.6% for ASPD
without exclusion criteria and 7.9% for mortality). Therefore, we wanted to
use as much information as possible for the positive cases for training
without increasing the processing time too much.5
Evaluation by ROC analysis. Trie ANN results were evaluated with a
standard ROC method. Although it was originally invented as an engineer-
ing technique for radar detection, ROC analysis has been subsequently
developed in the medical decision theory and signal detection literatures
(Swets & Pickett, 1982; Weinstein & Fineberg, 1980). ROC analysis is a
simple but useful tool for evaluating the predictive utility of a test. It
combines information on both true positives and false negatives by plotting
sensitivity and (1 - specificity); therefore, ROC analysis is free of the
problems of cutoffs (Dwyer, 1996). For assessing the overall predictive
utility of measures, the area under the ROC curve (AUC) is commonly
used, which ranges between .5 and 1.0. When prediction is made from
self-reported behaviors, the AUC rarely extends beyond .8 even with
established risk factors. For example, the AUC value was in the .75 to .8
range in our past analysis predicting adult ASPD from childhood conduct
problems (Robins & Price, 1991). This range is similar to that of the AUC
for a blood glucose test before a meal in the prediction of diabetes
(Erdreich & Lee, 1981). We used the value of the complement of the area
under the curve (CAUC), that is, 1 - AUC, as an overall measure of
predictive utility. Differences in prediction are easier to see with the use of
CAUC when the prediction range is very good to excellent.
The ROCs for MLP models were obtained with the use of an output bias.
The decision threshold was varied to obtain 101 coordinates of the sum-
mary sensitivity and (1 - specificity); the ROCs for discriminant analyses
4 Although it might have been preferable to use Cox regression, which
estimates the impact of predictors on the proportional hazard rate, we chose
logistic regression because initial analyses indicated that estimates ob-
tained from Cox regression were nearly identical to diose obtained from the
logistic regression analysis, which is often the case with a low-prevalence
dependent variable. Therefore, we used logistic regression estimated for
ease of interpretation. ,5 Because of time constraints, we did not use the leave-one-out cross-
validation, which repeats validation for the same number of times as the
sample size by using all observations but one to train the ANN and by
testing on the one left.
46 PRICE ET AL.
were obtained by varying the threshold of the canonical discriminant(unction (Ripley, 1996); and the ROC for logistic regression was obtained
by varying the output threshold probability level.
Results
Experiment 1
Figure 3 shows the results of MLP modeling in an ANN com-
pared with the results of linear and quadratic discriminant analyses
in the assessment of the diagnosis of ASPD. The best-fit MLP
model was obtained with 30 Gaussian hidden neurons without
pattern weights at the 600-iteration stop point.6 This model out-
performed the best estimate of MLP with pattern weights as well
as those of the linear and quadratic discriminant analyses. The
CAUC was less than .001 for the MLP model, .023 for the linear
discriminant analysis, and .083 for the quadratic discriminant
analysis. The high level of diagnostic accuracy achieved by all
three methods is not surprising because the information consisted
of the DSM-I1I symptom criteria. However, it is noteworthy that
the MLP model yielded a better result than did linear ratings of the
symptoms.
Figure 4 plots the performance of suboptimal networks for both
sigmoid and Gaussian hidden neurons. We chose two sets of
randomly assigned initial weights for each type of activation
function to assess whether the initial weights also have an impact
on performance. Different initial weights resulted in some differ-
1.0
0.8
0.6
0.4
0.2
0.0
MLP (CAUC = .001)
IDA (CAUC = .023)
QDA (CAUC = .083)
MLP {MuHi-Layer Perception)
IDA [Linear Dacriminant Anor/in)
QDA (Quadratic Ducriminanl Andys*)
( I = CAUC (Compleinent of the Area Under the ROC)
0.0 0.2 0.4 0.61-Specificity
0.8 1.0
Figure 3. Predicting antisocial personality disorder (ASPD) from child-hood and adult symptoms: comparison of MLP and discriminant analyses(.V = 2,804). A superior prediction is measured by the CAUC value (thesmaller the value, the better the prediction) and corresponds to the line
farthest away from the diagonal line, which indicates a random prediction.The input variables consisted of 12 childhood and 9 adult symptomscorresponding to those listed in the DSM-IH (American Psychiatric Asso-ciation, 1980). The output variable was the ASPD meeting the DSM-Illdiagnositic criteria except for exclusion criteria. The MLP results are basedon 30 Gaussian neurons without pattern weights. Tenfold variation wasused for the MLP analysis and the LDA and QDA. ROC = receiveroperating characteristic.
ences in the performance level within the range in which the
CAUC was improving rapidly. When the number of hidden neu-
rons was very small or very large, the choice of initial weights
made little difference in the CAUC.
With the Gaussian activation, the MLP model's performance
improved rapidly with a small number of hidden neurons. With
four neurons, the CAUC decreased to less than .01 and thereafter
improved very slowly with larger numbers of hidden neurons. On
the other hand, network performance with the sigmoid activation
function was more affected by the number of hidden neurons used.
The CAUC did not improve to the same level as for the Gaussian
models until the number of neurons was increased to the 10-15
range.
Experiment 2
In Figure 5, MLP performance results were compared with
results of logistic regressions in predicting the 23-year mortal-
ity outcome for two sets of input variables. The smallest CAUC
value of .04 was achieved with 40 Gaussian hidden neurons and
the full 23 variables. Given that the behavioral data were
obtained two decades ago, we consider this level of the CAUC
as excellent. The parallel logistic regression produced a much
worse result (CAUC = .21) with the same 23 predictors.
However, when only eight variables were used, the MLP per-
formance was about the same as that of logistic regression,
yielding a CAUC value of .19. The gain in the CAUC value by
the MLP model with the 23 variables is primarily attributable to
increased sensitivity; that is, the MLP model did much better in
predicting deaths correctly.
As shown in Figure 6, the network performance results were
different depending on whether 23 or 8 input variables were
used. When all 23 variables were used, the increase in the
number of hidden neurons yielded smaller CAUC values for
both sigmoid and Gaussian activation functions. However, with
the 8 input variables, neither the sigmoid nor the Gaussian
models were able to reduce the CAUC much below a level of
.19, and the minimum was achieved with only five sigmoid
neurons. Beyond this level of complexity, the CAUC obtained
with sigmoid hidden neurons essentially stayed flat, whereas
the CAUC obtained with Gaussian neurons increased, which is
likely to be a sign of overiitting. These results indicate that
complex network architecture did not help when the number of
predictive input variables was limited.
The predictors in the logistic regression without cross-validation
are shown in Table L7 The 8 most significant variables chosen by
the logistic regression were all dichotomous when the 23 mixed
dichotomous and scaled variables were input for selection with
backward eh'mination. The 23 variables included several clusters
of collinear variables involving heroin use, narcotics injection,
types of drugs used, marijuana use level, narcotics withdrawal, felt
6 It would have been possible to achieve a slightly better fit with a highernumber of neurons or of iterations. However, we judged further improve-
ment was unnecessary for the purposes of this experiment given the smalllevel of the CAUC achieved by the MLP model.
7 No method exists to our knowledge for adjusting beta estimates ob-tained from multifold validation. Therefore, the logistic regression inTable 1 used 100% of the data, compared with 90% in a tenfold estimation.
SPECIAL SECTION: ARTIFICIAL NEURAL NETWORK MODELS 47
.00
Number of Hidden Neurons
Figure 4. Multilayer perception network performance predicting antisocial personality disorder using sigmoidversus Gaussian neurons with different initializations. The sample corresponds to that used in Figure 3, and thesame input variables were used. The'number of neurons plotted in this figure varied between 1, 2, 3,4,5,7,10,15, and 20 (the complement of the area under the curve [CAUC] almost reached a plateau at 20). Two sets ofinitial weights were randomly chosen for both sigmoid and Gaussian activation functions.
addiction to narcotics, drug use to relieve the effects of otherdrugs, drug use to help depression, depression while drinking,arrest, number of arrests, heroin use level, and lifetime use ofopiates five or more times. Odds ratios transformed from the betaestimates involving these 13 variables, therefore, are biased withrespect to the magnitude of the predictive effects. Also, effects of 7variables yielded odds ratios less than 1, erroneously indicatingthat they were protective factors, A usual treatment would be toremove highly collinear variables to make the results of odds ratiosmeaningful. Without such a treatment, odds ratios can result inmisleading to outright wrong interpretations. Furthermore, predic-tion with logistic regression suffers from having these collinearvariables, as shown by the higher CAUC with 23 variables thanwith 8 variables.
Discussion
Findings Summary
The results of the two experiments in using ANNs in this articlewere related to (a) improving judgment and decision making in thearea of psychological assessment, (b) comparing ANN perfor-mance with the performance of linear statistical methods, and (c)delineating some of the reasons for the superiority of ANNs overlinear methods. The ANN outperformed linear and quadratic dis-criminant analyses in a psychiatric diagnosis assessment when aset of clinical measures was provided a priori (the ECA example).It also made a substantial improvement over logistic regressionanalyses in predicting cumulative mortality from measures col-lected two decades earlier when the input variables consisted of alarge number of measures of dichotomous and ordinal measuresknown to be associated with the outcome. When the input vari-
MLP (.04)Log (.21)MLP Reduced (.19)Log Reduced (.19)
MLP (Multi-Layer Perception)
Log (Logistic Regression)
( ) = CAUC (Complement of the Area Under the ROC)
0.4 0.61 -Specificity
Figure 5. Comparison of 23-year mortality prediction between the MLPmodel and logistic regression (N = 855). The sample was a subset of theVietnam Era Study data set, which consisted of veterans and nonveteransinterviewed in the 1974 survey. All 23 variables listed in Table 1 wereused, and diis model was compared with a reduced model with 8 variablesselected by backward elimination in a logistic regression. The receiveroperating characteristic (ROC) curve for the MLP full model was obtainedwith 40 Gaussian hidden neurons; the ROC curve for the MLP reducedmodel was obtained with 5 sigmoid hidden neurons. Both used patternweights. All ROC curves used tenfold cross-validation.
48 PRICE ET AL.
All 23 variablesReduced 8 variables
0 5 10 15 20 25 30 35
Number of Hidden Neurons
Figure 6. Multilayer perception (MLP) network performance for predicting mortality using sigmoid and
Gaussian neurons with different sets of predictors. The sample corresponds to that used in Figure 5. All 23
variables were used in the full MLP models; the reduced MLP models used 8 variables selected by backward
elimination in a logistic regression. For both sigmoid and Gaussian activation functions, the number of hidden
neurons varied between 1, 2, 3,4, 5,7, 10, 15, 20, 25, 30, and 40. CAUC = complement of the area under the
curve.
ables were limited to a smaller set of measures considered most
significant by logistic regression, MLP modeling did only as well
as logistic regression (the VES example).
Examination of network performance yielded several observa-
tions. Initial weights were found not to be critical if the number of
hidden neurons was sufficiently large. The results indicate that the
types of hidden neurons seem to affect the level of prediction. With
the EGA data set, we found differences in a middle range ofnetwork complexity: Gaussian models reached the low plateau of
the CAUC with fewer neurons than did models with sigmoid
neurons. A similar trend was observed with the VES data set;
however, when the input variables were limited, Gaussian models
risked over-fitting even with a smaller number of hidden neurons.
There appear to be no solid rules on the number of hidden
neurons other than a commonsense understanding that the number
of response patterns is one factor (Zurada, 1992). Nevertheless,
because of the Gaussian model's purported sensitivity for finding
a threshold in a finer region, better performance was expected with
a smaller number of Gaussian hidden neurons. With increased
numbers of hidden neurons, however, the distinction between
sigmoid and Gaussian models was expected to blur.
Our second experiment suggests that the number and the nature
of input variables are particularly important for MLP models to be
able to improve prediction over linear models. The lack of im-
provement in the CAUC for the MLP model using the eight
variables may be due in part to the fact that all of these eight
variables were dichotomous (Warner & Misra, 1996). The infor-
mation for potentially nonlinear association with the underlying
distribution is diminished with the use of dichotomous variables.In the logistic analyses, highly correlated input variables hurt more
than helped estimation and prediction. However, collinear vari-
ables have the potential to provide additional information to im-
prove prediction in MLP models by transforming a linearly non-
separable input covariate space to a potentially linearly separable
"image" space in hidden layers (Zurada, 1992), which may be a
reason for the much improved prediction by the MLP model with
the 23 variables.
Future Directions for ANNs and Associated Methods
Our data analyses using ANNs could be improved in several
ways. With respect to ANNs, a variety of algorithms are now
available in addition to the back-propagation algorithm, which is
known to be slow. These newer algorithms use the information
from second derivatives and therefore reach convergence faster
and require fewer learning parameters. For example, the learning
rate and momentum parameters we used in our experiments are not
needed with the use of newer algorithms (Bishop, 1995; Ripley,
1996). Employing these newer algorithms with the increasingly
powerful hardware available is likely to remedy the current com-
putational inefficiency associated with ANNs.
Our method in neural network estimation did not directly min-
imize errors based on the values of the CAUC, because its deriv-
atives with respect to weights could not be mathematically ob-
tained. It may be possible to explore alternative minimization
functions in neural networks with the use of ROC analysis by
linear transformation of the ROC or by using equivalent estimates
such as Wilcoxon or C statistics (Hanley & McNeil, 1982). Alter-
natively, obtaining the derivatives numerically may become a
feasible option, as is already done with nonlinear regression algo-
rithms (Ralston & Jennrich, 1978).
SPECIAL SECTION: ARTIFICIAL NEURAL NETWORK MODELS 49
Table 1
Estimates of Logistic Regression Coefficients With All 23
Variables Versus 8 Most Significant Variables (N = 855)
Only 8 mostAll 23 significant
variables variables
Variable OR OR
DemographicsAfrican American (race)Veteran
Pre-VietnamHeroin useRegular narcotics useNarcotics injectionTypes of drugs used (0-4)Marijuana use level (0-3)Too much drinkingChildhood antisocial behavior (0-10)
In- VietnamNarcotics withdrawalStimulant useHeavy drug useArrest
Post-VietnamWanted to take narcoticsFelt addicted to narcoticsKnow where to buy heroinDrug use to relieve effects of other
Drug use to help depressionDepression while drinkingArrestNumber of arrests (CM)
LifetimeHeroin use level (0-3)Opium use five or more times
3.222.00
0.051.29
24.720.872.192.462.83
3.041.871.951.39
4.322.031.640.53
2.760.502.030.26
0.280.66
.002 3.16
.36
.005 0.07
.72
.0009 26.28
.86
.35
.03 2.23
.22
.13 4.32
.13
.44
.37
.003 4.30
.21
.19 2.06
.21
.03 2.16
.20
.21
.22
.25
.58
.001
.007
.0002
.05
.0004
.001
.04
.05
Note. The two logistic regressions correspond to the two receiver oper-ating characteristic curves shown in Figure 5. Estimates were onefold; thatis, all available observations were used. OR (odds ratio) = efl; p =probability level of the beta coefficient. The lifetime assessment refers to"ever" using up to the 1974 survey. All variables are dichotomous exceptthose followed by numbers in parentheses. All ordinal variables weresubsequently scaled to be between 0 and 1, to be identical to the scales usedin the multilayer perception model.
Exploration of fuzzy neural networks, which applies fuzzy set
theory to neural network architecture, might prove useful for
diagnostic decision making (Fu & Shann, 1994). However, this
type of ANN architecture involves multiple hidden layers and is
much more complex than the type of ANN we have modeled so
far.
One way to describe the validity of a clinical diagnosis is to
evaluate how well it predicts longitudinal outcomes. Neural net-
work application to longitudinal measures at this time remains
theoretical at best (De Laurentiis & Ravdin, 1994; Liest01 et al.,
1994). In our example of behavioral outcome prediction, we used
logistic regression to predict cumulative mortality. Nonrecurring
events, such as death, however, are better handled by survival
analysis and Cox regression given their ability to take the timing of
events into account.
Neural network analogues to the statistical methods for longi-
tudinal measures have been tested with small data sets. The ap-
proach involves (a) starting with a one-layer model with the
logistic activation function, which, with back-propagation, should
compute the predicted conditional probability of the event, and
then (b) extending it to a nonlinear nonproportional hazard model
by introducing a hidden layer (Liest01 et al., 1994).
The ROC methodology can be applied in a more sophisticated
fashion. The probit-probit transformation of the ROC (Swets &
Pickett, 1982) would yield linear estimation. Thus, regression
coefficients can be estimated with Z statistics to test for differences
in prediction (Erdreich & Lee, 1981). The recently developed
"two-truth" method (Phelps & Hutson, 1995) appears to improve
the estimate of the accuracy of a test when the estimate of the true
rates is uncertain. It may provide a solution to the problem of noisy
data, frequently found with psychosocial data.
Issues of Psychological Assessment
The results of our experiments demonstrated several advantages
of ANNs over linear models for psychological assessment. ANNs
are far from problem free, however. Perhaps most important is the
"black-box" nature of current ANN applications. As mentioned,
one obvious way to open the black box is with a detailed inter-
pretation of weights, moving from a simple to a more complex
model. For example, analyses of input and output spaces created
from weight matrices can aid in deciding the optimal number of
neurons that yield substantive interpretation of data (e.g., Duh et
al., 1998). In a more complex model involving a large number of
input variables and hidden neurons, it is still possible to assess the
relative importance of input variables by taking a summary mea-
sure of the weights (Lucek & Ott, 1997). Alternatively, the struc-
ture of hidden layers can be explored by using the information
stored in vectors of hidden neurons, once training is completed.
Principal-components analysis, canonical discriminant analysis
(e.g., Wiles & Bloesch, 1992), and cluster analysis (Elman, 1990)
have been used to explore the network structure.
Linear methods and ANNs are both sensitive to low prevalence,
which is often the case with rare diseases or psychiatric disorders.
When the prevalence is very low, ANN does what any smart
human would do: It throws all cases into the negative category at
the cost of a small number of false negatives. This practice yields
excellent specificity and unimpressive sensitivity and could yield
unfortunate clinical consequences. For example, a clinician may
find misdiagnosing 1 depressive who subsequently commits sui-
cide to be more "costly" than prescribing Prozac to 10 dysphoric
individuals. We used pattern weights to address this issue, but this
strategy was more helpful for improving the efficiency of estima-
tion than prediction. The concern for low prevalence can be
substantially diminished if the error function in MLP modeling is
derived directly to minimize the CAUC in the ROC method.
Guidelines are needed for the sample size. The literature avail-
able on sample size requirements for neural network modeling is
limited. In the case of a rare disease or condition, the sample size
needed is probably related to the number of positives, so enriched
samples or populations will be valuable. It would be possible to
obtain the sampling distribution of weights by a bootstrap method
to estimate necessary sample size (White, 1989).
Equally important for the purposes of psychological assessment,
information is needed on the conditions under which ANNs out-
perform conventional statistical methods. The examination of net-
work performance and comparisons with logistic analysis in our
application provided empirical observations about some likely
50 PRICE ET AL.
conditions that make application of ANNs suitable. We attempted
to provide some information as to how to better model ANNs, such
as the number and nature of input variables and performance
variation associated with types of activation functions and the
number of hidden neurons. The next step will be to conduct
simulation studies in which linearity, noise level, prevalence, and
sample size are concurrently varied with other parameters of ANN
modeling. Such studies are likely to provide information useful to
research practitioners for deciding when and how to apply ANN
models to their data.
It should be stressed that an ANN is only as good as the data it
analyzes. We have had several failed experiments when input
variables were limited or were not constructed to be useful for
ANN modeling. Greater attention needs to be paid to improving
techniques for selecting optimal measures. Newer techniques such
as genetic algorithms can be applied to selecting optimal measures
(Goldberg, 1989; Downey & Meyer, 1994). Genetic algorithms
can be combined with ANN modeling (So & Karplus. 1996) to
further improve ANN prediction. Given ongoing refinements to
the ANN methodology, we expect that ANNs will gain increased
acceptance in psychiatric and psychological research and will be
applied more widely in clinical decision making.
References
Achen, C. (1982). Interpreting and using regression. In J. L. Sullivan &R. G. Niemi (Series Eds.), Sage University paper series on quantitativeapplications in the social sciences (07-001). Beverly Hills, CA: Sage.
American Psychiatric Association. (1980). Diagnostic and statistical man-ual of mental disorders (3rd ed.). Washington, DC: Author.
American Psychiatric Association. (1987). Diagnostic and statistical man-
ual of mental disorders (3rd ed., rev.). Washington, DC: Author.American Psychiatric Association. (1994). Diagnostic and statistical man-
ual of mental disorders (4th ed.). Washington, DC: Author.Anderer, P., Saletu, B., KLBppel, B., Semlitsch, H. V., & Werner, H.
(1994). Discrimination between demented patients and normals based ontopographic EEG slow wave activity: Comparison between Z statistics,discriminant analysis, and artificial neural network classifies. Electro-encephalography and Clinical Neurophysiology, 91, 108-117.
Anderson, J. A. (1995). An introduction to neural networks. Cambridge,
MA: MIT Press.Baxt, W. G. (1995). Application of artificial neural networks to clinical
medicine. Lancet, 346, 1135-1138.Berries, G. E., & Chen, E. Y. H. (1993). Recognizing psychiatric symp-
toms. Relevance to the diagnostic process. British Journal of Psychiatry,163, 308-314.
Bishop, C. M. (1995). Neural networks for pattern recognition. Oxford,England: Oxford University Press.
Chen, E. Y. H. (1994). A neural network model of cortical informationprocessing in schizophrenia. 1: Interaction between biological and socialfactors in symptom formation. Canadian Journal of Psychiatry, 39,362-367.
Cheng, B., & Titterington, D. M. (1994). Neural networks: A review froma statistical perspective. Statistical Science, 9, 2-54.
Davis, G. E., Walter, E. L., & Davis, G. L. (1993). A neural network thatpredicts psychiatric length of stay. Clinical Computing, 10, 87-92.
De Laureatiis, M., & Ravdin, P. M. (1994). Survival analysis of censoreddata: Neural network analysis detection of complex interaction betweenvariables. Breast Cancer Research and Treatment, 32, 113-118.
Doig, G. S., Inman, K. I., Sibbald, W. J., Martin, C. M., & Robertson, J. M.(1994). Modeling mortality in the intensive care unit: Comparing theperformance of a back-propagation, associative-learning network withmultivariate logistic regression. Proceedings of the Annual Symposiumon Computer Applications in Medical Care, 361-365.
Downey, T. J., & Meyer, D. J. (1994). A genetic algorithm for feature
selection. In C. H. Dagli, B. R. Femandes, J. Ghosh, & R. T. S. Kumura(Eds.), Intelligent engineering systems through artificial neural net-
works (Vol. 4, pp. 363-369). New York: ASME Press.
Dun, M. S., Walker, A. M., & Ayanian, J. Z. (1998). Epidemiologic
interpretation of artificial neural networks. American Journal of Epide-
miology, 147, 1112-1122.Duh, M. S., Walker, A. M., Pagano, M., & Kronlund, K. (1998). Prediction
and cross-validation of neural networks versus logistic regression: Using
hepatic disorders as an example. American Journal of Epidemiology,147, 407-413.
Dwyer, C. A. (1996). Cut scores and testing: Statistics, judgment, truth, and
error. Psychological Assessment, 8, 360-362.
Dybowski, R., Weller, P., Chang, R., & Gant, V. (1996). Prediction of
outcome in critically ill patients using artificial neural network synthe-
sized by genetic algorithm. Lancet, 347, 1146-1150.
Eaton, W. W., & Kessler, L. G. (Eds.). (1985). Epidemiologic field meth-ods in psychiatry: The NIMH Epidemiologic Catchment Area Program.
Orlando, FL: Academic Press.
Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14,
179-211.Erdreich, L. S., & Lee, E. T. (1981). Use of relative operating characteristic
analysis in epidemiology. A method for dealing with subjective judge-
ment. American Journal of Epidemiology, 14, 649-662.Fahlman, S. E. (1989). Faster-learning variations on back-propagation: An
empirical study. In D. Touretzky, G. Hinton, & T. Sejnowski (Eds.),
Proceedings of the 1988 Connectionist Models Summer School (pp.
38-51). San Mateo, CA: Morgan Kaufmann.
Fu, H. C., & Shann, J. (1994). A fuzzy neural network for knowledgelearning. International Journal of Neural System, 5. 13-22.
Goldberg, D. E. (1989). Genetic algorithms in search, optimization andmachine learning. Reading, MA: Addison-Wesley.
Gonzalez-Heydrich, I. (1993). Using neural networks to model personality
development. Medical Hypotheses, 41, 123-130.
Hanley, J. A., & McNeil, B. I. (1982). The meaning and use of the area
under a receiver operating characteristic (ROC) curve. Radiology, 143,29-36.
Hastie, T. J., & Tibshirani, R. J. (1990). Generalized additive models.
London: Chapman & Hall.
Helzer, J. E., Stoltzman, R. K., Farmer, A, Brockington, I. E, Plesons, D.,
Singerman, B., & Works, J. (1985). Comparing the DIS with a DIS/ZWM-W-based physician reevaluation. In W. W. Eaton & L. G. Kessler (Eds.),
Epidemiologic field methods in psychiatry: The NIMH EpidemiologicCatchment Area Program (pp. 285-308). Orlando, FL: Academic Press.
Herjanic, B., & Reich, W. (1982). Development of a structured psychiatric
interview for children: Agreement between child and parent on individ-
ual symptoms. Journal of Abnormal Child Psychology, 10, 307-324.
loannidis, J. P. A., McQueen, P. G., Goedert, J. J., & Kaslow, R. A. (1998).Use of neural networks to model complex immunogenetic associations
of disease: Human leukocyte antigen impact on the progression ofhuman immunodeficiency virus infection. American Journal of Epide-
miology, 147, 464-471.
Kashani, J. H., Nair, S. S., Rao. V. G., Nair, J., & Reid, J. C. (1996).Relationship of personality, environmental and DICA variables to adoles-
cent hopelessness: A neural network 'sensitivity approach. Journal of theAmerican Academy of Child and Adolescent Psychiatry, 35, 640-645.
Kuzin, I. A. (1995). Hypnotic modeling in neural networks. Bulletin on
Mathematical Biology, 57, 1-20.Lette, J. L., Colletti, B., Cerino, M., McNamara, D., Eybalin, M. C.,
Levasseur, A., & Nattel, S. (1994). Artificial intelligence versus logistic
regression statistical modeling to predict cardiac complications afternoncardiac surgery. Clinical Cardiology, 17, 609-614.
Liest01, K., Andersen, P. K., & Andersen, U. (1994). Survival analysis and
neural nets. Statistics in Medicine, 13, 1189-1200.Loeber, R. (1988). Natural histories of conduct problems, delinquency, and
SPECIAL SECTION: ARTIFICIAL NEURAL NETWORK MODELS 51
associated substance use. Evidence for developmental progressions. In
B. B. Lahey & A. E. Kazdin (Eds.), Advances in clinical child psychol-
ogy (Vol. 11, pp. 73-124). New York: Plenum Press.
Lowell, W. E., & Davis, G. E. (1994). Predicting length of stay for
psychiatric diagnosis-related groups using neural networks. Journal of
the American Medical Informatics Association, 6, 459-466.
Lucek, P. R., & Ott, J. (1997). Neural network analysis of complex traits.
Genetic Epidemiology, 14, 1101-1106.
McCulloch, W. S., & Pitts, W. (1943). A logical calculus of ideas immanent
in nervous activity. Bulletin on Mathematical Biophysics, 5, 115-133.
Modai, I., Staler, M., Inbar-Saban, N., & Saban, N. (1993). Clinical
decisions for psychiatric inpatients and their evaluation by a trained
neural network. Methods of Information in Medicine, 32, 396-399.
O'Donnell, J. A., Voss, H. L., Clayton, R. R., Slatin, G. T., & Room,
R. G. W. (1976). Young men and drugs—Nationwide survey (NIDA
Research Monograph No. 5; DHEW Publication No. ADM 76-311).
Springfield, VA: National Institute on Drug Abuse. (NTIS No. PB
247-446)
Palombo, S. R. (1992). Connectivity and condensation in dreaming. Jour-
nal of the American Psychoanalytic Association, 4Q, 1139-1159.
Partek, Inc. (1998). Partek: Pattern Analysis and Recognition Technology
(Version 2.0B7) [Computer software]. St. Peters, MO: Author.
Phelps, C. E., & Hutson, L. (1995). Estimating diagnostic test accuracy
using a "fuzzy gold standard." Medical Decision Making, 15, 44-57.
Price, R. K., Eisen, S. A., Virgo, K. S., Murray, K. S., & Robins, L. N.
(1995). Vietnam drug users two decades later. I. Mortality results [Ab-
stract]. In Proceedings of the 56th Annual Scientific Meeting of the
College on Problems of Drug Dependence, Inc. (Vol. 2, p. 204; NIDA
Research Monograph No. 153; DHHS Publication No. NIH95-3883).
Washington, DC: U.S. Government Printing Office.
Price, R. K., Risk, N. K., & Spitznagel, E. L. (1999). Remission from illicit
drug use over a 24-year period. Part I. Patterns and predictors of
remission and treatment use. Manuscript submitted for publication.
Price, R. K., Risk, N. K., Stwalley, D., Murray, K. S., Eisen, S., Virgo,
K. S., Spitznagel, E. L., & Robins, L. N. (1999). Vietnam drug users
twenty-five years later. Patterns and early predictors of mortality.
Manuscript submitted for publication.
Price, R. K., Robins, L. N., Helzer, J. E., & Cottier, L. B. (1998).
Examining the internal validity of DSM-III and -III-R symptom criteria
for substance dependence and abuse. In T. Widiger & A. Frances (Eds.),
ASM-TV source book (Vol. 4, pp. 51-68). Washington, DC: American
Psychiatric Press.
Prochaska, J. O., & DiClemente, C. C. (1983). Stages and process of
self-change in smoking: Toward an integrative model of change. Journal
of Consulting and Clinical Psychology, 5, 390-395.
Ralston, M. L.. & Jennrich, R. I. (1978). Dud, a derivative-free algorithm
for nonlinear least squares. Technometrics, 20, 7-14.
Reid, J. C., Nair, S. S., Mistry, S. L, & Beitman, B. D. (1996). Effective-
ness of stages of change and adinazolam SR in panic disorder: A neural
network analysis. Journal of Anxiety Disorders, 10, 331-345.
Ripley, B. D. (1994). Comment to "Neural networks: A review from a
statistical perspective." Statistical Science, 9, 45-48.
Ripley, B. D. (1996). Pattern recognition and neural networks. Cambridge,
England: Cambridge University Press.
Robins, L. N. (1974). The Vietnam drug user returns (Special Action
Office Monograph, Series A, No. 2). Washington, DC: U.S. Government
Printing Office.
Robins, L. N., & Helzer, J. E. (1975). Drug use among Vietnam veterans—
three years later. Medical World News—Psychiatry, 16, 44-49.
Robins, L. N., Helzer, J. E., Croughan, J., & Rateliff, K. S. (1981). The
NIMH Diagnostic Interview Schedule: Its history, characteristics, and
validity. Archives of General Psychiatry, 38, 381-389.
Robins, L. N., Helzer, J. E., & Davis, D. H. (1975). Narcotic use in
Southeast Asia and afterward: An interview study of 898 Vietnam
returnees. Archives of General Psychiatry, 32, 955-961.
Robins, L. N., Helzer, J. E., Ratcliff, K. S., & Seyfried, W. (1982). Validity
of the Diagnostic Interview Schedule, Version II: DSM—1U diagnoses.
Psychological Medicine, 12, 855-870.
Robins, L. N., & Price, R. K. (1991). Adult disorders predicted by child-
hood conduct problems: Results from the Epidetniologic Catchment
Area Program. Psychiatry, 54, 116-132.
Robins, L. N., & Regier, D. A. (Eds.). (1991). Psychiatric disorders in
America. New York: Free Press.
Robins, L. N., Whig, J., Wittchen, H. U., Helzer, J. E., Babor, T. F., Burke,
J., Farmer, A., Jahlenski. A., Pickens, R., Regier, D. A., Sartorius, N., &
Towle, L. H. (1988). The composite international diagnostic interview.
Archives of General Psychiatry, 45, 1069-1077.
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning
representations by back-propagating errors. Nature, 323, 533-536.
Rumelhart, D. E., & McClelland, J. L. (Eds.). (1986). Parallel distributed
processing: Explorations in the microslructure of cognition: Vol. 1.
Foundations. Cambridge, MA: MIT Press.
Sarle, W. S. (1994, April). Neural networks and statistical models. In
Proceedings of the 19th Annual SAS Users Group International Con-
ference, Clay, NC.
Sarle, W. S. (1996, June). Comp.ai.nearal-nets FAQ [7 parts]. Usenet
newsgroup comp.ai.neural-nets. Available FTP: ftp://ftp.sas.com/pub/
neural/FAQ.html.
So, S-S., & Karplus, M. (1996). Evolutionary optimization in qualitative
structure—Activity relationship: An application of genetic neural net-
works. Journal of Medical Chemistry, 39, 1521-1530.
Somoza, E., & Somoza, J. R. (1993). A neural-network approach to
predicting admission decisions in a psychiatric emergency room. Med-
ical Decision Making, 13, 273-280.
Spackman, K. A. (1992). Combining logistic regression and neural net-
works to create predictive models. In Proceedings of the Annual Sym-
posium on Computer Applications in Medical Care, 456-459.
Swets, J. A., & Pickett, R. M. (1982). Evaluation of diagnostic systems:
Methods from signal detection theory. New York: Academic Press.
U.S. General Accounting Office. (1998). Drug abuse—Research shows
treatment is effective, but benefits may be overstated (Report to con-
gressional requesters; GAO/HBHS-98-72). Washington, DC: U.S. Gov-
ernment Printing Office.
Warner, B., & Misra, M. (1996). Understanding neural networks as statis-
tical tools. American Statistician, 50, 284-293.
Weinstein, M. C., & Fineberg, H. V. (1980). Clinical decision analysis.
Philadelphia: W.B. Saunders.
White, H. (1989). Learning in artificial neural networks: A statistical
perspective. Neural Computation, 1, 425-464.
White, H. (1992). Artificial neural networks: Approximation and learning
theory. Oxford, England: Blackwell.
Wiles, J., & Bloesch, A. (1992). Operators and curried functions: Training
and analysis of simple recurrent networks. In J. E. Moddy, S. J. Hansen,
& R. P. Lippman (Eds.), Advances in neural information processing
systems 4 (pp. 325-332). San Mateo, CA: Morgan Kaufmann.
Wu, Y. C., & Gustafson, D. H. (1994). Removing the assumption of
conditional independence from Bayesian decision models by using ar-
tificial neural networks: Some practical techniques and a case study.
Artificial Intelligence in Medicine, 6, 437-454.
Zou, Y., Shen, Y., Shu, L., Wang, Y., Feng, P., Xu, K., Qu, Y., Song, Y.,
Zhong, Y., Wang, M., & Liu, W. (1996). Artificial neural networks to
assist psychiatric diagnosis. British Journal of Psychiatry, 169, 64-67.
Zurada, J. M. (Ed.). (1992). Introduction to artificial neural systems. St.
Paul, MN: West Publishing.
Received June 26, 1998
Revision received August 19, 1999
Accepted August 26, 1999