applying artificial neural network models to clinical ... · mail may be sent to...

12
Psychological Assessment 2000, Vol. 12, No. 1, 40-51 Copyrighc 2000 by the American Psychological Association, Inc. 1040-359WOO/$5.00 DO1: 10.10377/1040-3590.12.1.40 Applying Artificial Neural Network Models to Clinical Decision Making Rumi Kato Price Washington University School of Medicine Edward L. Spitznagel Washington University Thomas J. Downey and Donald J. Meyer Partek, Inc. Nathan K. Risk Washington University School of Medicine Omar G. El-Ghazzawy Washington University Because psychological assessment typically lacks biological gold standards, it traditionally has relied on clinicians' expert knowledge. A more empirically based approach frequently has applied linear models to data to derive meaningful constructs and appropriate measures. Statistical inferences are then used to assess the generality of the findings. This article introduces artificial neural networks (ANNs), flexible nonlinear modeling techniques that test a model's generality by applying its estimates against "future" data. ANNs have potential for overcoming some shortcomings of linear models. The basics of ANNs and their applications to psychological assessment are reviewed. Two examples of clinical decision making are described in which an ANN is compared with linear models, and the complexity of the network performance is examined. Issues salient to psychological assessment are addressed. In the area of psychological assessment, a computer-based al- gorithm approach has relied heavily on the expert knowledge of clinicians. In a more empirically based approach—used when insufficient knowledge is available, existing knowledge is debated, or a special population is investigated—linear models such as factor analysis or regression models are frequently applied to empirical data to derive meaningful constructs and appropriate measures. In this approach, statistical inferences are needed to assess the generality of constructs and rules because the gold standard of expert knowledge may not be applied. Making a statistical inference generally requires estimating the standard error of the parameter; hence linear models have most frequently been used. Linear models, however, are not appropriate for all instances. Loss of information is inevitable when extreme values of a measure must be truncated or rescaled to fit data to a linear scale. In multivariate linear models, emphasis on statistical Rumi Kato Price and Nathan K. Risk, Department of Psychiatry, Wash- ington University School of Medicine; Edward L. Spitznagel, Department of Mathematics, Washington University; Thomas J. Downey and Donald J. Meyer, Partek, Inc., St. Peters, Missouri; Omar G. El-Ghazzawy, Depart- ment of Chemistry, Washington University. Preparation of this article was supported in part by the Independent Scientist Award (K02DA00221) and Research Grants R01DA07939, R01DA09281, and R01DA10021 from the National Institute on Drag Abuse of National Institutes of Health. A portion of this article was first presented at the Annual Scientific Meeting of the College on Problems of Drug Dependence, San Juan, Puerto Rico, June, 1996. We thank Paul Monro for his critical and constructive comments on earlier versions of this article. Correspondence concerning this article should be addressed to Rumi K. Price, Department of Psychiatry, Washington University School of Medi- cine, 40 N. Kingshighway, Suite 2, St. Louis, Missouri 63108. Electronic mail may be sent to [email protected]. significance often leads to situations in which a substantively meaningful predictive measure with a large effect and a large standard error may be dropped in favor of another measure with a small effect and an even smaller standard error (Achen, 1982). Such practice limits the ability of these models to construct a potentially more predictive model. A more drastic approach than that exemplified by the current linear models in psychological research can be taken by relying primarily on empirical knowledge and minimizing the importance of generalization by statistical inference, if the assumption is accepted that gold standards in psychological assessment exist only at a data-specific level. Such an assumption may be reason- able for psychological assessment given the lack of biological gold standards. Once such an approach is taken, validity assessment depends more on the ability of the constructed gold standards to predict "future" data. The importance of making statistical infer- ences is reduced if future data are indeed available to validate the empirically obtained constructs and rules. The importance of cap- turing nonlinear association becomes an even more pertinent issue because increased precision for prediction is needed with validity testing against real data. In this article, we introduce artificial neural networks (ANNs) suitable for a variety of clinical decision problems such as assessment of psychological state, diagnosis of psychiatric disorders, and predic- tion of behavior outcomes such as suicide attempts, hospitalization, and death. ANNs have become important in a variety of areas in psychology and medicine, but their application to diagnostic decision making in psychology and psychiatry has been minimal. The basics of multilayer perception (MLP) modeling, the most frequently used type of ANN, are described after a brief introduc- tion to ANNs. We provide two examples of how an ANN can be used in judgment tasks. The first example uses item questions to predict a psychiatric diagnosis; the second example uses behav- ioral measures collected in early adulthood to predict long-term

Upload: truongminh

Post on 22-May-2018

214 views

Category:

Documents


1 download

TRANSCRIPT

Psychological Assessment2000, Vol. 12, No. 1, 40-51

Copyrighc 2000 by the American Psychological Association, Inc.1040-359WOO/$5.00 DO1: 10.10377/1040-3590.12.1.40

Applying Artificial Neural Network Models to Clinical Decision Making

Rumi Kato PriceWashington University School of Medicine

Edward L. SpitznagelWashington University

Thomas J. Downey and Donald J. MeyerPartek, Inc.

Nathan K. RiskWashington University School of Medicine

Omar G. El-GhazzawyWashington University

Because psychological assessment typically lacks biological gold standards, it traditionally has relied on

clinicians' expert knowledge. A more empirically based approach frequently has applied linear models

to data to derive meaningful constructs and appropriate measures. Statistical inferences are then used to

assess the generality of the findings. This article introduces artificial neural networks (ANNs), flexible

nonlinear modeling techniques that test a model's generality by applying its estimates against "future"

data. ANNs have potential for overcoming some shortcomings of linear models. The basics of ANNs and

their applications to psychological assessment are reviewed. Two examples of clinical decision making

are described in which an ANN is compared with linear models, and the complexity of the network

performance is examined. Issues salient to psychological assessment are addressed.

In the area of psychological assessment, a computer-based al-

gorithm approach has relied heavily on the expert knowledge of

clinicians. In a more empirically based approach—used when

insufficient knowledge is available, existing knowledge is debated,

or a special population is investigated—linear models such as

factor analysis or regression models are frequently applied to

empirical data to derive meaningful constructs and appropriate

measures. In this approach, statistical inferences are needed to

assess the generality of constructs and rules because the gold

standard of expert knowledge may not be applied.

Making a statistical inference generally requires estimating the

standard error of the parameter; hence linear models have most

frequently been used. Linear models, however, are not appropriate

for all instances. Loss of information is inevitable when extreme

values of a measure must be truncated or rescaled to fit data to a

linear scale. In multivariate linear models, emphasis on statistical

Rumi Kato Price and Nathan K. Risk, Department of Psychiatry, Wash-

ington University School of Medicine; Edward L. Spitznagel, Department

of Mathematics, Washington University; Thomas J. Downey and Donald J.

Meyer, Partek, Inc., St. Peters, Missouri; Omar G. El-Ghazzawy, Depart-

ment of Chemistry, Washington University.

Preparation of this article was supported in part by the Independent

Scientist Award (K02DA00221) and Research Grants R01DA07939,

R01DA09281, and R01DA10021 from the National Institute on Drag

Abuse of National Institutes of Health. A portion of this article was first

presented at the Annual Scientific Meeting of the College on Problems of

Drug Dependence, San Juan, Puerto Rico, June, 1996. We thank Paul

Monro for his critical and constructive comments on earlier versions of this

article.

Correspondence concerning this article should be addressed to Rumi K.

Price, Department of Psychiatry, Washington University School of Medi-

cine, 40 N. Kingshighway, Suite 2, St. Louis, Missouri 63108. Electronic

mail may be sent to [email protected].

significance often leads to situations in which a substantively

meaningful predictive measure with a large effect and a large

standard error may be dropped in favor of another measure with a

small effect and an even smaller standard error (Achen, 1982).

Such practice limits the ability of these models to construct a

potentially more predictive model.

A more drastic approach than that exemplified by the current

linear models in psychological research can be taken by relying

primarily on empirical knowledge and minimizing the importance

of generalization by statistical inference, if the assumption is

accepted that gold standards in psychological assessment exist

only at a data-specific level. Such an assumption may be reason-

able for psychological assessment given the lack of biological gold

standards. Once such an approach is taken, validity assessment

depends more on the ability of the constructed gold standards to

predict "future" data. The importance of making statistical infer-

ences is reduced if future data are indeed available to validate the

empirically obtained constructs and rules. The importance of cap-

turing nonlinear association becomes an even more pertinent issue

because increased precision for prediction is needed with validity

testing against real data.

In this article, we introduce artificial neural networks (ANNs)

suitable for a variety of clinical decision problems such as assessment

of psychological state, diagnosis of psychiatric disorders, and predic-

tion of behavior outcomes such as suicide attempts, hospitalization,

and death. ANNs have become important in a variety of areas in

psychology and medicine, but their application to diagnostic decision

making in psychology and psychiatry has been minimal.

The basics of multilayer perception (MLP) modeling, the most

frequently used type of ANN, are described after a brief introduc-

tion to ANNs. We provide two examples of how an ANN can be

used in judgment tasks. The first example uses item questions to

predict a psychiatric diagnosis; the second example uses behav-

ioral measures collected in early adulthood to predict long-term

SPECIAL SECTION: ARTIFICIAL NEURAL NETWORK MODELS 41

mortality. The details of ANN methodology in these contexts

include ANN parameter specification, software, cross-validation,

and the receiver operating characteristic (ROC) evaluation. The

performance of the ANN is compared with the performance of the

linear statistical analyses frequently used in the area of psycho-

logical assessment: linear and quadratic discriminant analyses and

logistic regression analyses. We examine the performance of ANN

models with varying numbers of hidden neurons to show when

complex network architecture is needed to achieve a better level of

prediction than that achieved in linear models. We then discuss

issues for research practitioners to consider in order to make

optimal use of ANNs. Future directions of ANN applications and

current salient issues in psychological assessment are discussed.

Artificial Neural Networks for Psychological Assessment

Background

Since the first article on neural network modeling (McCulloch

& Pitts, 1943) appeared over half a century ago, continuing fas-

cination with this technique appears to be related to its historical

linkage to artificial intelligence, in that ANNs mimic neural activ-

ities of the human brain (Cheng & Titterington, 1994). Neural

network modeling has been used for various tasks, including

speech recognition, stock market prediction, mine detection, can-

cerous cell identification, and handwriting recognition. Current

applications of ANNs usually fall into three areas: (a) in modeling

biological nervous systems, (b) as real-time adaptive signal pro-

cessors or controllers implemented in hardware, and (c) as data-

analytic methods (Sarle, 1994). The applications described in this

article fall into the third area.

With increased communication between the fields of neural

network modeling and statistics, areas of overlap have become

better defined. Today, the ANN can be characterized as a highly

flexible nonlinear regression technique. Limitations of linear mod-

els have been shown in the "exclusive-or (XOR)" problem

(Rumelhart & McClelland, 1986). The simplest case consists of

two vectors: An input vector consisting of (0,0) and (1,1) belongs

to Class 1; and another input vector of (0, 1) and (1, 0) belongs to

Class 2. There is no single linear decision boundary that cancorrectly classify all four points. This problem can be generalized

to higher dimensions. Thus, an ANN's ability to transform non-

linear input covariances to linearly separable ones via activation

functions supposedly gives the ANN an edge over the linear

modeling technique in some situations.

In practice, however, reports of the ability of ANNs to surpass

linear models for classification and prediction problems have been

inconsistent (Anderer, Saletu, Kloppel, Semlitsch, & Werner,

1994; Baxt, 1995; Doig, Inman, Sibbald, Martin, & Robertson,

1993; Duh, Walker, Pagano, & Kronlund, 1998; Dybowski,

Weller, Chang, & Gant, 1996). Attempts to examine weight ma-

trices, analogous to regression coefficients in linear models such as

logistic regression, are a logical step in identifying reasons for the

inconsistent performance of ANNs relative to linear models (Duh,

Walker, & Ayanian, 1998). Nonetheless, examination of individ-

ual weights becomes infeasible for a complex network model

because the number and dimension of weight matrices increase

with the complexity of models.

ANN performance associated with variations in complexity is

rarely examined in the context of comparison with linear models.

Our results from two experiments relate to the differences in ANN

performance with varying numbers of neurons, types of activa-

tions, and input variables. We suggest that better prediction of

ANN performance depends in part on the complexity of models

relative to the complexity of predictive input variables.

Multilayer Perceptron Model of Artificial

Neural Networks

Of several types of ANN models amenable to data-analytic

applications, MLP modeling has been used most often in clinical

medicine and the psychological literature.1 An MLP model con-

sists of multiple layers of interconnected neurons. Figure 1 illus-

trates the iterative process of MLP training for two-layer (or single

hidden-layer) models, used in two application examples described

later. In this instance, the two layers include hidden and output

layers, because input variables are not usually counted as a layer.

A hidden layer is used for internal transformation of the ANN:

Hidden neurons transform the "signals" from the input layer and

generate signals to be sent to the next layer, so that the MLP model

is sometimes called the "feedforward" neural network.

The input layer consists of input variables plus an input bias,

which adjusts the weighted mean of the input variables. All input

variables are regressed on the next layer, hi the same fashion as

estimation of multiple regression. Thus, if the model includes only

a single output without a hidden layer, the ANN becomes a logistic

regression with a logistic activation function; it becomes a linear

discriminant function with a threshold activation function at out-

put.2 However, the activation functions of the hidden neurons

between the input variables and the output layer make the ANN a

highly flexible model. Any function can be approximated with

increasing numbers of hidden neurons using the sigmoid (logistic)

activation function (Bishop, 1995).

Once the predicted values are forwarded from the hidden layer

to the output layer, the error amount is computed on the basis of

the target (actual) value of outputs. Weights are recalculated after

each observation until all observations are processed.

For a more general multiple-hidden-neuron; multiple-hidden-

layer model, the output for pattern (observation) p at the/1 neuron

on layer I can be expressed formally as

1 MLP models are considered to be parametric, quasi-parametric, or

nonparametric (infinitely parametric, to be more precise) depending on the

complexity of the model. As a reference point, the generalized additive

model (Hastie & Tibshirani, 1990) can be placed somewhere between

parametric linear regression and fully nonparametric nonlinear regression

such as MLP, because the generalized additive model uses a nonlinear

transformation estimated by a nonparametric smoother (Cheng & Titter-

ington, 1994; Sarle, 1994; White, 1992).2 Because both the ANN and logistic regression use an iterative

maximum-likelihood method to estimate parameters, the result of an ANN

with a single output and the result of multiple logistic regression should be

identical. On the other hand, although an ANN with a single threshold

activation function without a hidden layer uses the same model as a linear

discriminant analysis, the two methods would not yield exactly the same

results because the parameters in a linear discriminant analysis are com-

puted by ordinary least squares and an ANN uses a maximum-likelihood

method.

42 PRICE ET AL.

IWeighted

Sums

(wj)

\

HiddenNeurons

Activation

Functions

WeightedSums

(**)

-o «i£ =.« Q.

H O

Recalculation of weights

based on predicted and

actual outputs

Figure 1. Flow diagram of the learning process in a multilayer perception (MLP) model. The notations in the

diagram correspond to those in the text. Squares correspond to input and output variables; ovals correspond to

MLP functions and parameters. For simplicity, the diagram is drawn for a two-layer (single hidden-layer) model;

that is, I = 1 for the hidden layer and / = 2 for the output layer. The formulas in the test are expressed for more

general multiple-hidden-layer models in which / can be higher than 2. For clarity, the input vector* is indexed

by i, and the output vector y is indexed by k in the figure; however, in a multiple-hidden-layer model, k is

subsumed in i and.y is subsumed in x.

where the previous layer, / - 1, is input to a node in layer / through

1 + BL (2)

where wj/ is the weight (coefficient) from node i on layer / — 1 tonode j on layer I and x^1 is node ;' on the previous layer / — 1,which can be an input variable or an output from the previouslayer. B' the output bias, can be added to adjust the decisionthreshold of the predicted outputs to evaluate MLP performancefree of thresholds; Bj^., the input bias, can be used to adjust theweighted mean of the input variables. These biases are not usuallyapplied to hidden layers.

In our applications, two activation functions are used for/(i^).The sigmoid (logistic) activation function is most commonly usedin MLP modeling. This activation function takes the form

/.i, K) = !/(! + e~'i). (3)

The generalized Gaussian activation function is expressed as

where the right-hand side is a generalized activation function ofthe Gaussian distribution. Additional parameters can be added tochange the shapes of these functions.

The two functions are plotted along the axis of Vj in Figure 2.Although the sigmoid activation function is more commonly used, theGaussian activation function is more suitable for examining a finerrange of the distribution of an input vector, as the shape of thisfunction suggests. In contrast, a sigmoid neuron "turns on" and stays"on" once passing a threshold value. The maximum value of theactivation function can be adjusted to scale the distribution range of

the output vector. For classification and prediction problems involv-ing dichotomous outcomes, the maximum value is set to 1.

Our MLP training used the "back-propagation" algorithm in the"on-line" mode, in which the weights are updated after processingeach pattern (Rumelhart, Hinton, & Williams, 1986). In this algo-rithm, small initial weights are randomly assigned to the input neuronsand are fed forward through the network. The error of the estimatewith respect to the target at the/11 output for the p"1 pattern is

The total error at the output layer is

(5)

(6)

It sums errors over all outputs and over all patterns. Wp is theoptional pattern weight to adjust the error function to give weight-ing to individual patterns. Pattern weighting was introduced be-cause the minimization function can be problematic for maximiz-ing prediction when using population survey data in which aprevalence estimate is very low. The pattern weight is typically setas the inverse of the prevalence, and Wp = 1 when patternweighting is not added. In practice, we found pattern weights to bemore helpful in increasing the efficiency of estimation than inimproving prediction.

The information on the total error is used to adjust the weightsin the output layer, which are propagated back to the hidden layerto adjust weights on the hidden layer. These changes are accom-plished with the use of the generalized delta rule, which employsa gradient descent on parameter space. Each epoch (iteration) ofback-propagation continues until the error is minimized globally.

Specification of a few other parameters is usually necessary forthe back-propagation algorithm. The learning rate is a step size

SPECIAL SECTION: ARTIFICIAL NEURAL NETWORK MODELS 43

Sigmoid activation function

1.0

Gaussian activation function

1.0

Figure 2. Sigmoid and Gaussian activation functions commonly used in

MLP modeling. The input bias vector in Figure 1 is omitted from the inputto the neuron for simplicity; the maximum value is 1 by default.

taken in the downward direction of the gradient of the error

function, E, with respect to a parameter. The momentum smoothes

out the changes in parameters such as weights (coefficients) to

avoid short-term error oscillations during learning and to speed up

convergence.3 These parameters are concerned with convergence

speed or assurance of the global minima. The other basics of MLP

modeling are found in neural network textbooks (Anderson, 1995;

Bishop, 1995; Ripley, 1996; Rumelhart & McClelland, 1986).

Applications to Psychiatric and Psychological Research

In early 1995, we conducted a MEDLINE literature search on

neural networks. We identified, cross-sectionally, over 1,200 en-

tries for current articles in engineering, computer science, medi-

cine, psychology, and neuroscience and methodological articles

relating to statistics and new techniques. This search produced

about 30 entries on topics in medicine and psychology.

Ten of the articles were on psychiatric or psychological topics.

They can be divided into two groups: (a) conceptual or simulation

models that use neural network models as a frame of reference to

describe the complexity of psychological phenomena and (b) ap-

plications of ANN models to improve the outcome of interest. The

topics in the first group included a neural network model of

cortical processing to provide a framework .for understanding

pathological processes in schizophrenia (Chen, 1994); neural net-

work simulation of psychiatric symptom recognition to illustrate

the deficiency of traditional algorithmic models of psychiatric

diagnoses (Berrios & Chen, 1993); neural network simulation

illustrating the structure of the relationships among situation, be-

havior, and childhood experience in order to predict personality

traits (Gonzalez-Heydrich, 1993); a mathematical approach to

hypnosis that connects general ideas about the organization of the

nervous system and neural network models (Kuzin, 1995); and a

review of the relation between dream construction and memory

using the neural network as a model architecture (Palombo, 1992).

In this literature, a neural network model is seen as a machine

analogue of the biological neural network system.

In the literature in the second group, neural network modeling was

used as an analytical tool. This group included use of ANN modeling

as a diagnostic test to predict the admission decision among psychi-

atric emergency patients (Somoza & Somoza, 1993); a comparison of

treatment decisions made by ANNs and clinicians for depressed and

psychotic patients (Modai, Staler, Inbar-Saban, & Saban, 1993); use

of an ANN to predict length of hospital stay given psychiatric diag-

noses, demographics, illness severity, and other predictors (Davis,

Walter, & Davis, 1993; Lowell & Davis, 1994); and an application of

Bayesian decision models to an ANN in order to predict teenagers'

substance use (Wu & Gustafson, 1994).

About half of the articles in medicine and psychology were pub-

lished by investigators in institutions outside the United States. Most

noticeably, fewer than 40% of the investigators attempted to compare

or incorporate neural network models with standard statistical meth-

ods, which supports the view that there is a need for increased

communication between statisticians and neural network researchers

(Ripley, 1994). Comparison with, or incorporation of, multiple logis-

tic regression occurred in three studies (Doig et al., 1993; Lette et al.,

1994; Spademan, 1992). Comparison with other methods, such as the

Z statistic and discriminant analysis, was proposed (Anderer et al.,

1994). Simple ROC methods were applied to compare the perfor-

mance of an ANN and other methods in two studies (Liest01,

Andersen, & Andersen, 1994; Spackman, 1992).

A more recent review of literature suggests that the use of ANNs

in psychiatric and psychological research has been increasing in

the last few years. An accelerating diffusion into diverse disci-

plines is indicated by the fact that articles on the use of ANNs are

appearing in highly regarded nontechnical journals. For example,

an introduction to ANNs (Baxt, 1995) and an ANN application that

used a large database collected on intensive-care-unit patients

(Dybowski et al., 1996) appeared in The Lancet. Studies that used

ANNs have also appeared in widely circulated journals in epide-

miology (loannidis, McQueen, Goedert, & Kaslow, 1998), genet-

ics (Lucek & Ott, 1997), psychiatry (Zou et al., 1996), and statis-

tics (Warner & Misra, 1996). In these articles, the results were

often compared with results obtained through commonly used

statistical methods (Dybowski et al., 1996; loannidis et al., 1998).

3 Another parameter, the active range, isPartek's invention (Partek, Inc.,

1998) and not used in most MLP applications. It is a weight rescalingmethod used during the network initialization. Neurons can get "caught" atextreme values near 0 and 1, known as the saturation region, wherelearning becomes very slow (Fahlman, 1989). Active range scaling rescalesthe weights so that, upon initialization, the neurons all lie in their "active"

as opposed to saturated regions (usually 90% to 95% of the range). In our

application, the default .95 was used.

44 PRICE ET AL.

ANNs have been applied to studies involving standardized

psychological assessment instruments. For example, an ANN was

used to examine how varying the values of each of the four Stage

of Change (SOC) scales (Prochaska & DiClemente, 1983) affected

anxiety outcomes, thus allowing prediction of the amount of

changes in anxiety outcomes given a certain level of change in an

SOC scale (Reid, Nair, Mistry, & Beitman, 1996). Measures from

the Diagnostic Interview for Children and Adolescents (Herjanic

& Reich, 1982) were input to an ANN to predict adolescent

hopelessness (Kashani, Nair, Rao, Nair, & Reid, 1996). The Com-

posite International Diagnostic Interview (Robins et al., 1988) was

analyzed with an ANN to compare ANN-derived diagnoses with

those derived from computer algorithms based on the third edition

of the Diagnostic and Statistical Manual of Mental Disorders

(PSM-III; American Psychiatric Association, 1980); clinician-

based diagnoses were used as the gold standard (Zou et al., 1996).

Overall, however, we found that ANNs are still rarely used in

psychological assessment research.

Empirical Examples

In this section, using two large-scale psychiatric surveys, we

describe applications of ANNs to two problems in psychological

assessment: (a) whether an ANN is able to "mimic" a computer

diagnostic algorithm that translates the logic of the clinician diag-

nostic process and (b) whether an ANN is able to predict a

behavior outcome with various self-report measures expected to

contain a level of measurement error. To assess the utility of

ANNs, we also examined (c) whether ANNs outperform common

statistical methods and (d) whether the patterns of performance

improvement differ depending on architecture, initialization, and

complexity of input variables.

Method

Data Sets

Experiment 1: St. Louis Epidemiologic Catchment Area (EGA) Project

The ECA project is one of the largest epidemiological psychiatric studies

ever conducted in the United Slates (Robins & Regier, 1991). Survey

respondents were persons 18 years of age or older residing in five catch-

ment areas of U.S. communities. For each catchment area, both community

and institutional samples were obtained with complex multistage sampling

methods, so that the sample represented the demographic distribution of the

U.S. population according to the 1980 census with proper sampling

weights (Eaton & Kessier, 1985). For this article, the household sample of

the data set from St. Louis was used (N = 2,809 after elimination of

missing values).

The ECA data set was chosen for several reasons: the advantage of using

general-population surveys for drawing inferences that can be generalized,

the importance of having a large sample when prevalence for a disorder is

low, and our familiarity with the data set (Price, Robins, Helzer, & Cottier,

1998). The diagnostic variables were computed on the basis of the DSM-III

(American Psychiatric Association, 1980), because the surveys were con-

ducted between 1981 and 1985. Psychiatric nosology has subsequently

changed with the publication of the revised DSM-III (DSM-III-R; Amer-

ican Psychiatric Association, 1587) and the fourth edition (DSM-fV; Amer-

ican Psychiatric Association, 1994). It should be noted that the appropri-

ateness of a specific nosological system is a peripheral issue in this article

because the ANN was tested to mimic the clinical decision process given

a specific nosological system, in this case, the DSM-III.

Experiment 2: Vietnam Era Study—Phase HI. For the second set of

experiments, we used data from the Washington University Vietnam Era

Study—Phase HI (VES). The previous surveys were conducted in 1972 and

1974. About half of the veterans in the target sample were drawn randomly

from the list of Army enlisted (E1-E9) returnees whose urine tested

positive for opiates, amphetamines, or barbiturates at the time of departure

from Vietnam in September 1971. The drug-positive veteran population

comprised an estimated 10.5% of the total 13,760 Army enlisted returnees

who departed at that time. The other half, the general sample, comprised

veterans who were randomly drawn from the same total population and

included 4.1% of the sample already drawn from the drug-positive list

(Robins, 1974). A civilian control sample was ascertained from Selective

Service registrations and individually matched to the general sample mem-

bers interviewed in 1974 for draft eligibility, draft board location, age, and

education (Robins & Helzer, 1975). The cohort, although not a general-

population sample, was ascertained from a large nonclinical population,

and veterans arguably can be considered a general-population sample

weighted to slightly lower socioeconomic status (O'Donnell, Voss, Clay-

ton, Slatin, & Room, 1976).

As part of the feasibility phase of the 25-year follow-up study in 1996,

mortality information was collected. Further details of the follow-up study

are described elsewhere (Price, Risk, & Spitznagel, 1999; Price, Risk,

Stwalley, et al., 1999). The sample for Experiment 2 was composed of

veterans and civilians who were interviewed in 1974 (N = 855).

Measures

Experiment I. Using data from the St. Louis ECA study, we assessed

the judgment ability of ANNs in the absence of information on DSM-III

rules for the diagnosis of antisocial personality disorder (ASPD). The input

variables were measures of 12 childhood and 9 adult DSM-III symptom

criteria for ASPD obtained from the Diagnostic Interview Schedule, a

well-validated structured instrument designed to be administered by non-

clinicians (Helzer et al., 1985; Robins, Helzer, Croughan, & Ratcliff,

1981). The childhood symptoms included truancy, school expulsion, arrest,

running away from home, lying, sex before age 15, getting drunk or

engaging in illicit drug use before age 15, stealing, vandalism, poor grades,

serious trouble at school, and starting rights. The adult symptoms included

job troubles; negligence toward children; a nontraffic arrest for prostitu-

tion, pimping, or drug sale; marital or intimate relationship problems;

violence; trouble with debts; vagrancy; lying; and traffic offenses. These

measures have been shown to correlate highly with each other (e.g.,

Loeber, 1988).

The output variable was the ASPD diagnosis. Validity studies reported

kappas of .63 for test-retest agreement (Robins, Helzer, Ratcliff, & Sey-

fried, 1982) and .63 for agreement between a psychiatrist and a trained

nonclinician interviewer (Robins et al., 1981). The exclusion criteria of

ASPD—that is, mental retardation, schizophrenia, or mania—were not

applied to the operational diagnostic measure of ASPD because the infor-

mation for the exclusion criteria is external to the information contained in

the childhood and adult symptoms.

Experiment 2. For the VES, the output variable was mortality. That is,

the judgment task was to predict which veterans and civilians would be

dead at the time of the follow-up study more than two decades after their

return from Vietnam. It was worthwhile to test the ability of an ANN with

a clearly defined behavioral measure that avoids fuzzy classifications

typical in psychological assessment. We do not claim that our measure of

mortality is perfect; nonetheless, we felt it was relatively free from clas-

sification errors.

By the end of December 1993,97 of the 1,227 (7.9%) participants in the

original study had been identified as deceased (Price, Eisen, Virgo, Mur-

ray, & Robins, 1995). Three sources of death information were used, which

were later augmented by information from family members in order to

maximize accuracy for case identification (Price et al., 1999). We input two

sets of predictors to the ANN to see if differences in measures related to the

ANN's performance relative to that of logistic models.

The first set of input variables consisted of 23 significant predictors

chosen separately by logistic regression for each of three periods—pre-

SPECIAL SECTION: ARTIFICIAL NEURAL NETWORK MODELS 45

Vietnam, in-Vietnam, and post-Vietnam up to 1974—plus lifetime ques-

tions and demographics. These variables were selected from over 120

variables covering 11 domains (demographics, socioeconomic status, mil-

itary experience, family history, psychiatric problems, antisocial behavior,

crime, social networks, drug use, alcohol use, and a residual category) that

were found to be associated with mortality by bivariate analyses. The input

variables were as follows: (a) pre-Vietnam—heroin use, regular narcotics

use, narcotics injection, types of drugs used, marijuana use level, too much

drinking, and childhood antisocial behavior, (b) in-Vietnam—narcotics

withdrawal, stimulant use, heavy drug use, and arrest; and (c) post-Vietnam

to 1974—wanting to take narcotics, feeling addicted to narcotics, knowing

where to buy heroin, taking drugs to relieve effects of other drugs, taking

drugs to help depression, depression while drinking, arrest, and number of

arrests. The lifetime behavioral predictors included heroin use level and use

of opiates five or more times. Two demographic variables, African Amer-

ican race and veteran status, were also added. These variables were

extensively analyzed previously (Robins, Helzer, & Davis, 1975), and

findings from the earlier surveys continue to be circulated (U.S. General

Accounting Office, 1998).

The second set of input variables included 8 variables that were signif-

icant at the .05 probability level when the previous 23 variables were input

to logistic regression with backward elimination. They included pre-

Vietnam heroin use, narcotics injection, and too much drinking; in-

Vietnam narcotics withdrawal; and post-Vietnam wanting to take narcotics,

knowing where to buy heroin, and using drugs to help depression. African

American race was also included. We hypothesized that an ANN would

improve prediction when a number of highly collinear variables were input

whereas the level of prediction with a logistic model would stay about the

same because a linear regression technique does not gain much additional

information from highly collinear variables.

Scaling. The input variables in Experiment 1 with the EGA data set

were all scaled to be dichotomous. Five of the input variables in Experi-

ment 2 with the VES data set were not dichotomous: types of drugs used

(0-4), marijuana use level (0-3), childhood antisocial behavior (0-10),

number of arrests (0-4), and heroin use (0-3). These variables were scaled

to vary between 0 and 1, a standard practice used to increase efficiency of

estimation in MLP modeling. The same scaling was applied to the input

variables of logistic regression for consistency.

Analyses

Experiment 1. Two-output models were used because they provide a

simple solution to the threshold decision in a classification problem since

the difference in predictive values is used for assigning the class (Bishop,

1995), Both sigmoid and Gaussian activation functions were tried. The

range of a hidden neuron varied between 0 and 1, which is a default for a

classification problem. The number of hidden neurons was increased by an

increment of one from 1 to 5, then to 7, to 10, and thereafter by an

increment of five to a maximum of 30. After an initial series of experi-

ments, the learning rate was set to .001 and the momentum to .01. Both are

considered to be conservative, which is to say that convergence is achieved

slowly. The maximum iteration was set to 600 after initial analyses, which

used as many as 1,200 iterations. Estimation was accomplished both with

and without pattern weighting.

The results of MLP neural network models were compared to the results

of linear and quadratic discriminant analyses. We carried out analysis of

suboptimal network performance both for sigmoid and Gaussian neurons

with two different initializations in order to examine whether complex

MLP architecture is required to improve prediction.

Experiment 2. Two^output models were also used for this experiment.

The same sets of activation functions and learning parameters were tried.

The maximum number of iterations was set to 600 after initial trials.

Pattern weights were used throughout for this experiment because the point

prevalence was low at .079. Logistic regression analyses were used to

compare the results because beta estimates illustrate a limitation of linear

models when predictive collinear variables are used.4 We also examined

suboptimal network performance to assess whether the improvement in

prediction using the full set of predictors was related to the complexity of

the MLP architecture.

Software. Numerous software packages for ANNs are available. We

identified 32 free and 25 commercial packages at the time these experi-

ments were planned (Sarle, 1996). After a trial period, we chose UNIX-

based Partek, written by Thomas J. Downey and Donald J. Meyer (Partek

Inc., 1998). We also applied an ANN to the EGA data set using source code

modified from the "Neural Network Classes for the NeXT Computer,"

which is available from the University of Arizona. The results were judged

to be similar, although this source code used a one-output model and Partek

used two-output models.

Cross-validation. Because real future data are lacking in practice,

cross-validation is an essential part of the ANN methodology, because it

allows assessment of the generalizability of ANN results obtained from the

training phase. Tenfold validation was used: 90% of the data were used to

train the ANN, and the remaining 10% of the data were used to compute

the numbers of positive and negative errors. Results of each validation

were obtained from the predicted outcome for each observation based on

estimates obtained from the testing phase. The process was repeated 10

times, and the results were summed over 10 validation results to compute

the sensitivity and specificity scores. We cross-validated corresponding

linear models using the same procedure.

We selected tenfold validation, a typical choice in ANN modeling

(Bishop, 1995). Tenfold validation was suitable for our problems because

the outcome prevalence rates of both data sets were low (11.6% for ASPD

without exclusion criteria and 7.9% for mortality). Therefore, we wanted to

use as much information as possible for the positive cases for training

without increasing the processing time too much.5

Evaluation by ROC analysis. Trie ANN results were evaluated with a

standard ROC method. Although it was originally invented as an engineer-

ing technique for radar detection, ROC analysis has been subsequently

developed in the medical decision theory and signal detection literatures

(Swets & Pickett, 1982; Weinstein & Fineberg, 1980). ROC analysis is a

simple but useful tool for evaluating the predictive utility of a test. It

combines information on both true positives and false negatives by plotting

sensitivity and (1 - specificity); therefore, ROC analysis is free of the

problems of cutoffs (Dwyer, 1996). For assessing the overall predictive

utility of measures, the area under the ROC curve (AUC) is commonly

used, which ranges between .5 and 1.0. When prediction is made from

self-reported behaviors, the AUC rarely extends beyond .8 even with

established risk factors. For example, the AUC value was in the .75 to .8

range in our past analysis predicting adult ASPD from childhood conduct

problems (Robins & Price, 1991). This range is similar to that of the AUC

for a blood glucose test before a meal in the prediction of diabetes

(Erdreich & Lee, 1981). We used the value of the complement of the area

under the curve (CAUC), that is, 1 - AUC, as an overall measure of

predictive utility. Differences in prediction are easier to see with the use of

CAUC when the prediction range is very good to excellent.

The ROCs for MLP models were obtained with the use of an output bias.

The decision threshold was varied to obtain 101 coordinates of the sum-

mary sensitivity and (1 - specificity); the ROCs for discriminant analyses

4 Although it might have been preferable to use Cox regression, which

estimates the impact of predictors on the proportional hazard rate, we chose

logistic regression because initial analyses indicated that estimates ob-

tained from Cox regression were nearly identical to diose obtained from the

logistic regression analysis, which is often the case with a low-prevalence

dependent variable. Therefore, we used logistic regression estimated for

ease of interpretation. ,5 Because of time constraints, we did not use the leave-one-out cross-

validation, which repeats validation for the same number of times as the

sample size by using all observations but one to train the ANN and by

testing on the one left.

46 PRICE ET AL.

were obtained by varying the threshold of the canonical discriminant(unction (Ripley, 1996); and the ROC for logistic regression was obtained

by varying the output threshold probability level.

Results

Experiment 1

Figure 3 shows the results of MLP modeling in an ANN com-

pared with the results of linear and quadratic discriminant analyses

in the assessment of the diagnosis of ASPD. The best-fit MLP

model was obtained with 30 Gaussian hidden neurons without

pattern weights at the 600-iteration stop point.6 This model out-

performed the best estimate of MLP with pattern weights as well

as those of the linear and quadratic discriminant analyses. The

CAUC was less than .001 for the MLP model, .023 for the linear

discriminant analysis, and .083 for the quadratic discriminant

analysis. The high level of diagnostic accuracy achieved by all

three methods is not surprising because the information consisted

of the DSM-I1I symptom criteria. However, it is noteworthy that

the MLP model yielded a better result than did linear ratings of the

symptoms.

Figure 4 plots the performance of suboptimal networks for both

sigmoid and Gaussian hidden neurons. We chose two sets of

randomly assigned initial weights for each type of activation

function to assess whether the initial weights also have an impact

on performance. Different initial weights resulted in some differ-

1.0

0.8

0.6

0.4

0.2

0.0

MLP (CAUC = .001)

IDA (CAUC = .023)

QDA (CAUC = .083)

MLP {MuHi-Layer Perception)

IDA [Linear Dacriminant Anor/in)

QDA (Quadratic Ducriminanl Andys*)

( I = CAUC (Compleinent of the Area Under the ROC)

0.0 0.2 0.4 0.61-Specificity

0.8 1.0

Figure 3. Predicting antisocial personality disorder (ASPD) from child-hood and adult symptoms: comparison of MLP and discriminant analyses(.V = 2,804). A superior prediction is measured by the CAUC value (thesmaller the value, the better the prediction) and corresponds to the line

farthest away from the diagonal line, which indicates a random prediction.The input variables consisted of 12 childhood and 9 adult symptomscorresponding to those listed in the DSM-IH (American Psychiatric Asso-ciation, 1980). The output variable was the ASPD meeting the DSM-Illdiagnositic criteria except for exclusion criteria. The MLP results are basedon 30 Gaussian neurons without pattern weights. Tenfold variation wasused for the MLP analysis and the LDA and QDA. ROC = receiveroperating characteristic.

ences in the performance level within the range in which the

CAUC was improving rapidly. When the number of hidden neu-

rons was very small or very large, the choice of initial weights

made little difference in the CAUC.

With the Gaussian activation, the MLP model's performance

improved rapidly with a small number of hidden neurons. With

four neurons, the CAUC decreased to less than .01 and thereafter

improved very slowly with larger numbers of hidden neurons. On

the other hand, network performance with the sigmoid activation

function was more affected by the number of hidden neurons used.

The CAUC did not improve to the same level as for the Gaussian

models until the number of neurons was increased to the 10-15

range.

Experiment 2

In Figure 5, MLP performance results were compared with

results of logistic regressions in predicting the 23-year mortal-

ity outcome for two sets of input variables. The smallest CAUC

value of .04 was achieved with 40 Gaussian hidden neurons and

the full 23 variables. Given that the behavioral data were

obtained two decades ago, we consider this level of the CAUC

as excellent. The parallel logistic regression produced a much

worse result (CAUC = .21) with the same 23 predictors.

However, when only eight variables were used, the MLP per-

formance was about the same as that of logistic regression,

yielding a CAUC value of .19. The gain in the CAUC value by

the MLP model with the 23 variables is primarily attributable to

increased sensitivity; that is, the MLP model did much better in

predicting deaths correctly.

As shown in Figure 6, the network performance results were

different depending on whether 23 or 8 input variables were

used. When all 23 variables were used, the increase in the

number of hidden neurons yielded smaller CAUC values for

both sigmoid and Gaussian activation functions. However, with

the 8 input variables, neither the sigmoid nor the Gaussian

models were able to reduce the CAUC much below a level of

.19, and the minimum was achieved with only five sigmoid

neurons. Beyond this level of complexity, the CAUC obtained

with sigmoid hidden neurons essentially stayed flat, whereas

the CAUC obtained with Gaussian neurons increased, which is

likely to be a sign of overiitting. These results indicate that

complex network architecture did not help when the number of

predictive input variables was limited.

The predictors in the logistic regression without cross-validation

are shown in Table L7 The 8 most significant variables chosen by

the logistic regression were all dichotomous when the 23 mixed

dichotomous and scaled variables were input for selection with

backward eh'mination. The 23 variables included several clusters

of collinear variables involving heroin use, narcotics injection,

types of drugs used, marijuana use level, narcotics withdrawal, felt

6 It would have been possible to achieve a slightly better fit with a highernumber of neurons or of iterations. However, we judged further improve-

ment was unnecessary for the purposes of this experiment given the smalllevel of the CAUC achieved by the MLP model.

7 No method exists to our knowledge for adjusting beta estimates ob-tained from multifold validation. Therefore, the logistic regression inTable 1 used 100% of the data, compared with 90% in a tenfold estimation.

SPECIAL SECTION: ARTIFICIAL NEURAL NETWORK MODELS 47

.00

Number of Hidden Neurons

Figure 4. Multilayer perception network performance predicting antisocial personality disorder using sigmoidversus Gaussian neurons with different initializations. The sample corresponds to that used in Figure 3, and thesame input variables were used. The'number of neurons plotted in this figure varied between 1, 2, 3,4,5,7,10,15, and 20 (the complement of the area under the curve [CAUC] almost reached a plateau at 20). Two sets ofinitial weights were randomly chosen for both sigmoid and Gaussian activation functions.

addiction to narcotics, drug use to relieve the effects of otherdrugs, drug use to help depression, depression while drinking,arrest, number of arrests, heroin use level, and lifetime use ofopiates five or more times. Odds ratios transformed from the betaestimates involving these 13 variables, therefore, are biased withrespect to the magnitude of the predictive effects. Also, effects of 7variables yielded odds ratios less than 1, erroneously indicatingthat they were protective factors, A usual treatment would be toremove highly collinear variables to make the results of odds ratiosmeaningful. Without such a treatment, odds ratios can result inmisleading to outright wrong interpretations. Furthermore, predic-tion with logistic regression suffers from having these collinearvariables, as shown by the higher CAUC with 23 variables thanwith 8 variables.

Discussion

Findings Summary

The results of the two experiments in using ANNs in this articlewere related to (a) improving judgment and decision making in thearea of psychological assessment, (b) comparing ANN perfor-mance with the performance of linear statistical methods, and (c)delineating some of the reasons for the superiority of ANNs overlinear methods. The ANN outperformed linear and quadratic dis-criminant analyses in a psychiatric diagnosis assessment when aset of clinical measures was provided a priori (the ECA example).It also made a substantial improvement over logistic regressionanalyses in predicting cumulative mortality from measures col-lected two decades earlier when the input variables consisted of alarge number of measures of dichotomous and ordinal measuresknown to be associated with the outcome. When the input vari-

MLP (.04)Log (.21)MLP Reduced (.19)Log Reduced (.19)

MLP (Multi-Layer Perception)

Log (Logistic Regression)

( ) = CAUC (Complement of the Area Under the ROC)

0.4 0.61 -Specificity

Figure 5. Comparison of 23-year mortality prediction between the MLPmodel and logistic regression (N = 855). The sample was a subset of theVietnam Era Study data set, which consisted of veterans and nonveteransinterviewed in the 1974 survey. All 23 variables listed in Table 1 wereused, and diis model was compared with a reduced model with 8 variablesselected by backward elimination in a logistic regression. The receiveroperating characteristic (ROC) curve for the MLP full model was obtainedwith 40 Gaussian hidden neurons; the ROC curve for the MLP reducedmodel was obtained with 5 sigmoid hidden neurons. Both used patternweights. All ROC curves used tenfold cross-validation.

48 PRICE ET AL.

All 23 variablesReduced 8 variables

0 5 10 15 20 25 30 35

Number of Hidden Neurons

Figure 6. Multilayer perception (MLP) network performance for predicting mortality using sigmoid and

Gaussian neurons with different sets of predictors. The sample corresponds to that used in Figure 5. All 23

variables were used in the full MLP models; the reduced MLP models used 8 variables selected by backward

elimination in a logistic regression. For both sigmoid and Gaussian activation functions, the number of hidden

neurons varied between 1, 2, 3,4, 5,7, 10, 15, 20, 25, 30, and 40. CAUC = complement of the area under the

curve.

ables were limited to a smaller set of measures considered most

significant by logistic regression, MLP modeling did only as well

as logistic regression (the VES example).

Examination of network performance yielded several observa-

tions. Initial weights were found not to be critical if the number of

hidden neurons was sufficiently large. The results indicate that the

types of hidden neurons seem to affect the level of prediction. With

the EGA data set, we found differences in a middle range ofnetwork complexity: Gaussian models reached the low plateau of

the CAUC with fewer neurons than did models with sigmoid

neurons. A similar trend was observed with the VES data set;

however, when the input variables were limited, Gaussian models

risked over-fitting even with a smaller number of hidden neurons.

There appear to be no solid rules on the number of hidden

neurons other than a commonsense understanding that the number

of response patterns is one factor (Zurada, 1992). Nevertheless,

because of the Gaussian model's purported sensitivity for finding

a threshold in a finer region, better performance was expected with

a smaller number of Gaussian hidden neurons. With increased

numbers of hidden neurons, however, the distinction between

sigmoid and Gaussian models was expected to blur.

Our second experiment suggests that the number and the nature

of input variables are particularly important for MLP models to be

able to improve prediction over linear models. The lack of im-

provement in the CAUC for the MLP model using the eight

variables may be due in part to the fact that all of these eight

variables were dichotomous (Warner & Misra, 1996). The infor-

mation for potentially nonlinear association with the underlying

distribution is diminished with the use of dichotomous variables.In the logistic analyses, highly correlated input variables hurt more

than helped estimation and prediction. However, collinear vari-

ables have the potential to provide additional information to im-

prove prediction in MLP models by transforming a linearly non-

separable input covariate space to a potentially linearly separable

"image" space in hidden layers (Zurada, 1992), which may be a

reason for the much improved prediction by the MLP model with

the 23 variables.

Future Directions for ANNs and Associated Methods

Our data analyses using ANNs could be improved in several

ways. With respect to ANNs, a variety of algorithms are now

available in addition to the back-propagation algorithm, which is

known to be slow. These newer algorithms use the information

from second derivatives and therefore reach convergence faster

and require fewer learning parameters. For example, the learning

rate and momentum parameters we used in our experiments are not

needed with the use of newer algorithms (Bishop, 1995; Ripley,

1996). Employing these newer algorithms with the increasingly

powerful hardware available is likely to remedy the current com-

putational inefficiency associated with ANNs.

Our method in neural network estimation did not directly min-

imize errors based on the values of the CAUC, because its deriv-

atives with respect to weights could not be mathematically ob-

tained. It may be possible to explore alternative minimization

functions in neural networks with the use of ROC analysis by

linear transformation of the ROC or by using equivalent estimates

such as Wilcoxon or C statistics (Hanley & McNeil, 1982). Alter-

natively, obtaining the derivatives numerically may become a

feasible option, as is already done with nonlinear regression algo-

rithms (Ralston & Jennrich, 1978).

SPECIAL SECTION: ARTIFICIAL NEURAL NETWORK MODELS 49

Table 1

Estimates of Logistic Regression Coefficients With All 23

Variables Versus 8 Most Significant Variables (N = 855)

Only 8 mostAll 23 significant

variables variables

Variable OR OR

DemographicsAfrican American (race)Veteran

Pre-VietnamHeroin useRegular narcotics useNarcotics injectionTypes of drugs used (0-4)Marijuana use level (0-3)Too much drinkingChildhood antisocial behavior (0-10)

In- VietnamNarcotics withdrawalStimulant useHeavy drug useArrest

Post-VietnamWanted to take narcoticsFelt addicted to narcoticsKnow where to buy heroinDrug use to relieve effects of other

Drug use to help depressionDepression while drinkingArrestNumber of arrests (CM)

LifetimeHeroin use level (0-3)Opium use five or more times

3.222.00

0.051.29

24.720.872.192.462.83

3.041.871.951.39

4.322.031.640.53

2.760.502.030.26

0.280.66

.002 3.16

.36

.005 0.07

.72

.0009 26.28

.86

.35

.03 2.23

.22

.13 4.32

.13

.44

.37

.003 4.30

.21

.19 2.06

.21

.03 2.16

.20

.21

.22

.25

.58

.001

.007

.0002

.05

.0004

.001

.04

.05

Note. The two logistic regressions correspond to the two receiver oper-ating characteristic curves shown in Figure 5. Estimates were onefold; thatis, all available observations were used. OR (odds ratio) = efl; p =probability level of the beta coefficient. The lifetime assessment refers to"ever" using up to the 1974 survey. All variables are dichotomous exceptthose followed by numbers in parentheses. All ordinal variables weresubsequently scaled to be between 0 and 1, to be identical to the scales usedin the multilayer perception model.

Exploration of fuzzy neural networks, which applies fuzzy set

theory to neural network architecture, might prove useful for

diagnostic decision making (Fu & Shann, 1994). However, this

type of ANN architecture involves multiple hidden layers and is

much more complex than the type of ANN we have modeled so

far.

One way to describe the validity of a clinical diagnosis is to

evaluate how well it predicts longitudinal outcomes. Neural net-

work application to longitudinal measures at this time remains

theoretical at best (De Laurentiis & Ravdin, 1994; Liest01 et al.,

1994). In our example of behavioral outcome prediction, we used

logistic regression to predict cumulative mortality. Nonrecurring

events, such as death, however, are better handled by survival

analysis and Cox regression given their ability to take the timing of

events into account.

Neural network analogues to the statistical methods for longi-

tudinal measures have been tested with small data sets. The ap-

proach involves (a) starting with a one-layer model with the

logistic activation function, which, with back-propagation, should

compute the predicted conditional probability of the event, and

then (b) extending it to a nonlinear nonproportional hazard model

by introducing a hidden layer (Liest01 et al., 1994).

The ROC methodology can be applied in a more sophisticated

fashion. The probit-probit transformation of the ROC (Swets &

Pickett, 1982) would yield linear estimation. Thus, regression

coefficients can be estimated with Z statistics to test for differences

in prediction (Erdreich & Lee, 1981). The recently developed

"two-truth" method (Phelps & Hutson, 1995) appears to improve

the estimate of the accuracy of a test when the estimate of the true

rates is uncertain. It may provide a solution to the problem of noisy

data, frequently found with psychosocial data.

Issues of Psychological Assessment

The results of our experiments demonstrated several advantages

of ANNs over linear models for psychological assessment. ANNs

are far from problem free, however. Perhaps most important is the

"black-box" nature of current ANN applications. As mentioned,

one obvious way to open the black box is with a detailed inter-

pretation of weights, moving from a simple to a more complex

model. For example, analyses of input and output spaces created

from weight matrices can aid in deciding the optimal number of

neurons that yield substantive interpretation of data (e.g., Duh et

al., 1998). In a more complex model involving a large number of

input variables and hidden neurons, it is still possible to assess the

relative importance of input variables by taking a summary mea-

sure of the weights (Lucek & Ott, 1997). Alternatively, the struc-

ture of hidden layers can be explored by using the information

stored in vectors of hidden neurons, once training is completed.

Principal-components analysis, canonical discriminant analysis

(e.g., Wiles & Bloesch, 1992), and cluster analysis (Elman, 1990)

have been used to explore the network structure.

Linear methods and ANNs are both sensitive to low prevalence,

which is often the case with rare diseases or psychiatric disorders.

When the prevalence is very low, ANN does what any smart

human would do: It throws all cases into the negative category at

the cost of a small number of false negatives. This practice yields

excellent specificity and unimpressive sensitivity and could yield

unfortunate clinical consequences. For example, a clinician may

find misdiagnosing 1 depressive who subsequently commits sui-

cide to be more "costly" than prescribing Prozac to 10 dysphoric

individuals. We used pattern weights to address this issue, but this

strategy was more helpful for improving the efficiency of estima-

tion than prediction. The concern for low prevalence can be

substantially diminished if the error function in MLP modeling is

derived directly to minimize the CAUC in the ROC method.

Guidelines are needed for the sample size. The literature avail-

able on sample size requirements for neural network modeling is

limited. In the case of a rare disease or condition, the sample size

needed is probably related to the number of positives, so enriched

samples or populations will be valuable. It would be possible to

obtain the sampling distribution of weights by a bootstrap method

to estimate necessary sample size (White, 1989).

Equally important for the purposes of psychological assessment,

information is needed on the conditions under which ANNs out-

perform conventional statistical methods. The examination of net-

work performance and comparisons with logistic analysis in our

application provided empirical observations about some likely

50 PRICE ET AL.

conditions that make application of ANNs suitable. We attempted

to provide some information as to how to better model ANNs, such

as the number and nature of input variables and performance

variation associated with types of activation functions and the

number of hidden neurons. The next step will be to conduct

simulation studies in which linearity, noise level, prevalence, and

sample size are concurrently varied with other parameters of ANN

modeling. Such studies are likely to provide information useful to

research practitioners for deciding when and how to apply ANN

models to their data.

It should be stressed that an ANN is only as good as the data it

analyzes. We have had several failed experiments when input

variables were limited or were not constructed to be useful for

ANN modeling. Greater attention needs to be paid to improving

techniques for selecting optimal measures. Newer techniques such

as genetic algorithms can be applied to selecting optimal measures

(Goldberg, 1989; Downey & Meyer, 1994). Genetic algorithms

can be combined with ANN modeling (So & Karplus. 1996) to

further improve ANN prediction. Given ongoing refinements to

the ANN methodology, we expect that ANNs will gain increased

acceptance in psychiatric and psychological research and will be

applied more widely in clinical decision making.

References

Achen, C. (1982). Interpreting and using regression. In J. L. Sullivan &R. G. Niemi (Series Eds.), Sage University paper series on quantitativeapplications in the social sciences (07-001). Beverly Hills, CA: Sage.

American Psychiatric Association. (1980). Diagnostic and statistical man-ual of mental disorders (3rd ed.). Washington, DC: Author.

American Psychiatric Association. (1987). Diagnostic and statistical man-

ual of mental disorders (3rd ed., rev.). Washington, DC: Author.American Psychiatric Association. (1994). Diagnostic and statistical man-

ual of mental disorders (4th ed.). Washington, DC: Author.Anderer, P., Saletu, B., KLBppel, B., Semlitsch, H. V., & Werner, H.

(1994). Discrimination between demented patients and normals based ontopographic EEG slow wave activity: Comparison between Z statistics,discriminant analysis, and artificial neural network classifies. Electro-encephalography and Clinical Neurophysiology, 91, 108-117.

Anderson, J. A. (1995). An introduction to neural networks. Cambridge,

MA: MIT Press.Baxt, W. G. (1995). Application of artificial neural networks to clinical

medicine. Lancet, 346, 1135-1138.Berries, G. E., & Chen, E. Y. H. (1993). Recognizing psychiatric symp-

toms. Relevance to the diagnostic process. British Journal of Psychiatry,163, 308-314.

Bishop, C. M. (1995). Neural networks for pattern recognition. Oxford,England: Oxford University Press.

Chen, E. Y. H. (1994). A neural network model of cortical informationprocessing in schizophrenia. 1: Interaction between biological and socialfactors in symptom formation. Canadian Journal of Psychiatry, 39,362-367.

Cheng, B., & Titterington, D. M. (1994). Neural networks: A review froma statistical perspective. Statistical Science, 9, 2-54.

Davis, G. E., Walter, E. L., & Davis, G. L. (1993). A neural network thatpredicts psychiatric length of stay. Clinical Computing, 10, 87-92.

De Laureatiis, M., & Ravdin, P. M. (1994). Survival analysis of censoreddata: Neural network analysis detection of complex interaction betweenvariables. Breast Cancer Research and Treatment, 32, 113-118.

Doig, G. S., Inman, K. I., Sibbald, W. J., Martin, C. M., & Robertson, J. M.(1994). Modeling mortality in the intensive care unit: Comparing theperformance of a back-propagation, associative-learning network withmultivariate logistic regression. Proceedings of the Annual Symposiumon Computer Applications in Medical Care, 361-365.

Downey, T. J., & Meyer, D. J. (1994). A genetic algorithm for feature

selection. In C. H. Dagli, B. R. Femandes, J. Ghosh, & R. T. S. Kumura(Eds.), Intelligent engineering systems through artificial neural net-

works (Vol. 4, pp. 363-369). New York: ASME Press.

Dun, M. S., Walker, A. M., & Ayanian, J. Z. (1998). Epidemiologic

interpretation of artificial neural networks. American Journal of Epide-

miology, 147, 1112-1122.Duh, M. S., Walker, A. M., Pagano, M., & Kronlund, K. (1998). Prediction

and cross-validation of neural networks versus logistic regression: Using

hepatic disorders as an example. American Journal of Epidemiology,147, 407-413.

Dwyer, C. A. (1996). Cut scores and testing: Statistics, judgment, truth, and

error. Psychological Assessment, 8, 360-362.

Dybowski, R., Weller, P., Chang, R., & Gant, V. (1996). Prediction of

outcome in critically ill patients using artificial neural network synthe-

sized by genetic algorithm. Lancet, 347, 1146-1150.

Eaton, W. W., & Kessler, L. G. (Eds.). (1985). Epidemiologic field meth-ods in psychiatry: The NIMH Epidemiologic Catchment Area Program.

Orlando, FL: Academic Press.

Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14,

179-211.Erdreich, L. S., & Lee, E. T. (1981). Use of relative operating characteristic

analysis in epidemiology. A method for dealing with subjective judge-

ment. American Journal of Epidemiology, 14, 649-662.Fahlman, S. E. (1989). Faster-learning variations on back-propagation: An

empirical study. In D. Touretzky, G. Hinton, & T. Sejnowski (Eds.),

Proceedings of the 1988 Connectionist Models Summer School (pp.

38-51). San Mateo, CA: Morgan Kaufmann.

Fu, H. C., & Shann, J. (1994). A fuzzy neural network for knowledgelearning. International Journal of Neural System, 5. 13-22.

Goldberg, D. E. (1989). Genetic algorithms in search, optimization andmachine learning. Reading, MA: Addison-Wesley.

Gonzalez-Heydrich, I. (1993). Using neural networks to model personality

development. Medical Hypotheses, 41, 123-130.

Hanley, J. A., & McNeil, B. I. (1982). The meaning and use of the area

under a receiver operating characteristic (ROC) curve. Radiology, 143,29-36.

Hastie, T. J., & Tibshirani, R. J. (1990). Generalized additive models.

London: Chapman & Hall.

Helzer, J. E., Stoltzman, R. K., Farmer, A, Brockington, I. E, Plesons, D.,

Singerman, B., & Works, J. (1985). Comparing the DIS with a DIS/ZWM-W-based physician reevaluation. In W. W. Eaton & L. G. Kessler (Eds.),

Epidemiologic field methods in psychiatry: The NIMH EpidemiologicCatchment Area Program (pp. 285-308). Orlando, FL: Academic Press.

Herjanic, B., & Reich, W. (1982). Development of a structured psychiatric

interview for children: Agreement between child and parent on individ-

ual symptoms. Journal of Abnormal Child Psychology, 10, 307-324.

loannidis, J. P. A., McQueen, P. G., Goedert, J. J., & Kaslow, R. A. (1998).Use of neural networks to model complex immunogenetic associations

of disease: Human leukocyte antigen impact on the progression ofhuman immunodeficiency virus infection. American Journal of Epide-

miology, 147, 464-471.

Kashani, J. H., Nair, S. S., Rao. V. G., Nair, J., & Reid, J. C. (1996).Relationship of personality, environmental and DICA variables to adoles-

cent hopelessness: A neural network 'sensitivity approach. Journal of theAmerican Academy of Child and Adolescent Psychiatry, 35, 640-645.

Kuzin, I. A. (1995). Hypnotic modeling in neural networks. Bulletin on

Mathematical Biology, 57, 1-20.Lette, J. L., Colletti, B., Cerino, M., McNamara, D., Eybalin, M. C.,

Levasseur, A., & Nattel, S. (1994). Artificial intelligence versus logistic

regression statistical modeling to predict cardiac complications afternoncardiac surgery. Clinical Cardiology, 17, 609-614.

Liest01, K., Andersen, P. K., & Andersen, U. (1994). Survival analysis and

neural nets. Statistics in Medicine, 13, 1189-1200.Loeber, R. (1988). Natural histories of conduct problems, delinquency, and

SPECIAL SECTION: ARTIFICIAL NEURAL NETWORK MODELS 51

associated substance use. Evidence for developmental progressions. In

B. B. Lahey & A. E. Kazdin (Eds.), Advances in clinical child psychol-

ogy (Vol. 11, pp. 73-124). New York: Plenum Press.

Lowell, W. E., & Davis, G. E. (1994). Predicting length of stay for

psychiatric diagnosis-related groups using neural networks. Journal of

the American Medical Informatics Association, 6, 459-466.

Lucek, P. R., & Ott, J. (1997). Neural network analysis of complex traits.

Genetic Epidemiology, 14, 1101-1106.

McCulloch, W. S., & Pitts, W. (1943). A logical calculus of ideas immanent

in nervous activity. Bulletin on Mathematical Biophysics, 5, 115-133.

Modai, I., Staler, M., Inbar-Saban, N., & Saban, N. (1993). Clinical

decisions for psychiatric inpatients and their evaluation by a trained

neural network. Methods of Information in Medicine, 32, 396-399.

O'Donnell, J. A., Voss, H. L., Clayton, R. R., Slatin, G. T., & Room,

R. G. W. (1976). Young men and drugs—Nationwide survey (NIDA

Research Monograph No. 5; DHEW Publication No. ADM 76-311).

Springfield, VA: National Institute on Drug Abuse. (NTIS No. PB

247-446)

Palombo, S. R. (1992). Connectivity and condensation in dreaming. Jour-

nal of the American Psychoanalytic Association, 4Q, 1139-1159.

Partek, Inc. (1998). Partek: Pattern Analysis and Recognition Technology

(Version 2.0B7) [Computer software]. St. Peters, MO: Author.

Phelps, C. E., & Hutson, L. (1995). Estimating diagnostic test accuracy

using a "fuzzy gold standard." Medical Decision Making, 15, 44-57.

Price, R. K., Eisen, S. A., Virgo, K. S., Murray, K. S., & Robins, L. N.

(1995). Vietnam drug users two decades later. I. Mortality results [Ab-

stract]. In Proceedings of the 56th Annual Scientific Meeting of the

College on Problems of Drug Dependence, Inc. (Vol. 2, p. 204; NIDA

Research Monograph No. 153; DHHS Publication No. NIH95-3883).

Washington, DC: U.S. Government Printing Office.

Price, R. K., Risk, N. K., & Spitznagel, E. L. (1999). Remission from illicit

drug use over a 24-year period. Part I. Patterns and predictors of

remission and treatment use. Manuscript submitted for publication.

Price, R. K., Risk, N. K., Stwalley, D., Murray, K. S., Eisen, S., Virgo,

K. S., Spitznagel, E. L., & Robins, L. N. (1999). Vietnam drug users

twenty-five years later. Patterns and early predictors of mortality.

Manuscript submitted for publication.

Price, R. K., Robins, L. N., Helzer, J. E., & Cottier, L. B. (1998).

Examining the internal validity of DSM-III and -III-R symptom criteria

for substance dependence and abuse. In T. Widiger & A. Frances (Eds.),

ASM-TV source book (Vol. 4, pp. 51-68). Washington, DC: American

Psychiatric Press.

Prochaska, J. O., & DiClemente, C. C. (1983). Stages and process of

self-change in smoking: Toward an integrative model of change. Journal

of Consulting and Clinical Psychology, 5, 390-395.

Ralston, M. L.. & Jennrich, R. I. (1978). Dud, a derivative-free algorithm

for nonlinear least squares. Technometrics, 20, 7-14.

Reid, J. C., Nair, S. S., Mistry, S. L, & Beitman, B. D. (1996). Effective-

ness of stages of change and adinazolam SR in panic disorder: A neural

network analysis. Journal of Anxiety Disorders, 10, 331-345.

Ripley, B. D. (1994). Comment to "Neural networks: A review from a

statistical perspective." Statistical Science, 9, 45-48.

Ripley, B. D. (1996). Pattern recognition and neural networks. Cambridge,

England: Cambridge University Press.

Robins, L. N. (1974). The Vietnam drug user returns (Special Action

Office Monograph, Series A, No. 2). Washington, DC: U.S. Government

Printing Office.

Robins, L. N., & Helzer, J. E. (1975). Drug use among Vietnam veterans—

three years later. Medical World News—Psychiatry, 16, 44-49.

Robins, L. N., Helzer, J. E., Croughan, J., & Rateliff, K. S. (1981). The

NIMH Diagnostic Interview Schedule: Its history, characteristics, and

validity. Archives of General Psychiatry, 38, 381-389.

Robins, L. N., Helzer, J. E., & Davis, D. H. (1975). Narcotic use in

Southeast Asia and afterward: An interview study of 898 Vietnam

returnees. Archives of General Psychiatry, 32, 955-961.

Robins, L. N., Helzer, J. E., Ratcliff, K. S., & Seyfried, W. (1982). Validity

of the Diagnostic Interview Schedule, Version II: DSM—1U diagnoses.

Psychological Medicine, 12, 855-870.

Robins, L. N., & Price, R. K. (1991). Adult disorders predicted by child-

hood conduct problems: Results from the Epidetniologic Catchment

Area Program. Psychiatry, 54, 116-132.

Robins, L. N., & Regier, D. A. (Eds.). (1991). Psychiatric disorders in

America. New York: Free Press.

Robins, L. N., Whig, J., Wittchen, H. U., Helzer, J. E., Babor, T. F., Burke,

J., Farmer, A., Jahlenski. A., Pickens, R., Regier, D. A., Sartorius, N., &

Towle, L. H. (1988). The composite international diagnostic interview.

Archives of General Psychiatry, 45, 1069-1077.

Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning

representations by back-propagating errors. Nature, 323, 533-536.

Rumelhart, D. E., & McClelland, J. L. (Eds.). (1986). Parallel distributed

processing: Explorations in the microslructure of cognition: Vol. 1.

Foundations. Cambridge, MA: MIT Press.

Sarle, W. S. (1994, April). Neural networks and statistical models. In

Proceedings of the 19th Annual SAS Users Group International Con-

ference, Clay, NC.

Sarle, W. S. (1996, June). Comp.ai.nearal-nets FAQ [7 parts]. Usenet

newsgroup comp.ai.neural-nets. Available FTP: ftp://ftp.sas.com/pub/

neural/FAQ.html.

So, S-S., & Karplus, M. (1996). Evolutionary optimization in qualitative

structure—Activity relationship: An application of genetic neural net-

works. Journal of Medical Chemistry, 39, 1521-1530.

Somoza, E., & Somoza, J. R. (1993). A neural-network approach to

predicting admission decisions in a psychiatric emergency room. Med-

ical Decision Making, 13, 273-280.

Spackman, K. A. (1992). Combining logistic regression and neural net-

works to create predictive models. In Proceedings of the Annual Sym-

posium on Computer Applications in Medical Care, 456-459.

Swets, J. A., & Pickett, R. M. (1982). Evaluation of diagnostic systems:

Methods from signal detection theory. New York: Academic Press.

U.S. General Accounting Office. (1998). Drug abuse—Research shows

treatment is effective, but benefits may be overstated (Report to con-

gressional requesters; GAO/HBHS-98-72). Washington, DC: U.S. Gov-

ernment Printing Office.

Warner, B., & Misra, M. (1996). Understanding neural networks as statis-

tical tools. American Statistician, 50, 284-293.

Weinstein, M. C., & Fineberg, H. V. (1980). Clinical decision analysis.

Philadelphia: W.B. Saunders.

White, H. (1989). Learning in artificial neural networks: A statistical

perspective. Neural Computation, 1, 425-464.

White, H. (1992). Artificial neural networks: Approximation and learning

theory. Oxford, England: Blackwell.

Wiles, J., & Bloesch, A. (1992). Operators and curried functions: Training

and analysis of simple recurrent networks. In J. E. Moddy, S. J. Hansen,

& R. P. Lippman (Eds.), Advances in neural information processing

systems 4 (pp. 325-332). San Mateo, CA: Morgan Kaufmann.

Wu, Y. C., & Gustafson, D. H. (1994). Removing the assumption of

conditional independence from Bayesian decision models by using ar-

tificial neural networks: Some practical techniques and a case study.

Artificial Intelligence in Medicine, 6, 437-454.

Zou, Y., Shen, Y., Shu, L., Wang, Y., Feng, P., Xu, K., Qu, Y., Song, Y.,

Zhong, Y., Wang, M., & Liu, W. (1996). Artificial neural networks to

assist psychiatric diagnosis. British Journal of Psychiatry, 169, 64-67.

Zurada, J. M. (Ed.). (1992). Introduction to artificial neural systems. St.

Paul, MN: West Publishing.

Received June 26, 1998

Revision received August 19, 1999

Accepted August 26, 1999