on · building and institute of medical biometry and ... applications. hence, w e start with a...

29

Upload: buidat

Post on 29-Apr-2018

218 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: On · Building and Institute of Medical Biometry and ... applications. Hence, w e start with a brief summary of ... Synon yms for the term individual are w ords ob ject,

On the misuses of arti�cial neural networks forprognostic and diagnostic classi�cation in oncologyGuido Schwarzer, Werner Vach and Martin SchumacherCenter for Data Analysis and Model BuildingandInstitute of Medical Biometry and Medical InformaticsUniversity of FreiburgAbstractThe application of arti�cial neural networks (ANNs) for prognostic and diagnosticclassi�cation in clinical medicine has become very popular. Some indications mightbe derived from a recent "mini-series" in the Lancet7,23,30,94 with three more or lessenthusiastic review articles and an additional commentary expressing at least somescepticism. In this paper, the essentials of feed-forward neural networks and their sta-tistical counterparts (e.g. logistic regression models) are reviewed. We point to seriousproblems of ANNs as the �tting of implausible functions to describe the probabilityof class membership and the underestimation of misclassi�cation probabilities. In ap-plications of ANNs to survival data many suggested procedures result in predictedsurvival probabilities which are not necessarily monotone functions of time and lack aproper incorporation of censored observations. Finally, the results of a search in themedical literature from 1991 to 1995 on applications of ANNs in oncology and someimportant common mistakes are reported. It is concluded that there is no evidence,that ANNs can provide real progress in the �eld of diagnosis and prognosis in oncology.IntroductionDuring the last years, the application of arti�cial neural networks (ANNs) for prognos-tic and diagnostic classi�cation in clinical medicine has attracted growing interest in themedical literature. So, for example, a "mini-series" on neural networks that appearedrecently in the Lancet contained three more or less enthusiastic review articles7,23,30and an additional commentary expressing some scepticism.94 In this commentary aswell as in other, more balanced reviews69 the comparison with competing statisticalmethods is emphazised.The relationship between ANNs and statistical methods, especially logistic regressionmodels, has been described in several articles; most of them, however, appeared in1

Page 2: On · Building and Institute of Medical Biometry and ... applications. Hence, w e start with a brief summary of ... Synon yms for the term individual are w ords ob ject,

the statistical literature17,76,87,88,91 and may therefore not be su�ciently recognizedin applications. Hence, we start with a brief summary of feed-forward neural networksand their statistical counterparts, i.e. logistic regression models and their extensions.We then show that the functions describing the probability of class membership areusually far away from being plausible when �tted by ANNs. This is often referred toas an advantage of ANNs;23 we argue that this must be seen as a major drawback.The problems associated with the estimation of misclassi�cation probabilities and theapplication of ANNs to survival data that often occur in oncology are outlined. Inaddition to this general, methodological considerations we report the results of a lit-erature search on the application of ANNs in oncology; the most frequently occuringmistakes are identi�ed and commented. Finally, we summarize our critical remarks inthe conclusions section.Essentials of logistic regression and neural networksThe common principle of diagnosis and prognosis in oncology or other areas can be de-scribed in a general way as follows. Given a set of p covariate values x = (x1; : : : ; xp)0for an individual, decide to which class k, k 2 f0; : : : ; Kg, does this individual be-long. Synonyms for the term individual are the words object, patient or example,among others. The simplest but most common case is to consider only two possibleclasses. Examples are the diagnosis of a tumor as being "benign" or "malignant" orthe prediction of the recurrence of a disease until a speci�ed point in time. We restrictour attention to this special case and denote the class indicator in the following withY 2 f0; 1g.The aim is to construct a decision rule for individuals with known covariate valuesbut unknown class level. Many approaches to construct such a decision rule are basedon estimating the conditional probability of observing an individual with class level 1given the covariates x = (x1; : : : ; xp)0. Such an estimation is often based on a parametricmodel P(Y = 1jX = x) = f(x; �) (1)for the conditional probabilities, where � is a vector of unknown parameters. Theparameters are called "regression coe�cients" in statistics and "weights" in the neuralnet community. The estimation of these parameters is done by using information froma sample of individuals with known class levels and known covariate values. With suchan estimate b� we can estimate for each future individual with covariate values x theconditional probability to belong to class 1 by f(x; b�). The classi�cation rule is then:put the individual in class 1 if f(x; b�) is greater than 0.5, otherwise put it in class 0.Logistic regression and feed-forward neural networks di�er in the functional formf(x; �) assumed for the conditional probability P(Y = 1jX = x).2

Page 3: On · Building and Institute of Medical Biometry and ... applications. Hence, w e start with a brief summary of ... Synon yms for the term individual are w ords ob ject,

Logistic regressionThe linear logistic regression model20,22 assumes thatf(x; �) = 11 + exp(��0 �Ppi=1 �ixi)) = �(�0 + pXi=1 �ixi) (2)with �(�) denoting the logistic function. The model can be re-expressed in terms of theodds of observing an individual with class level 1 given the covariates x in the followingway: Odds(x) = P(Y = 1jX = x)P(Y = 0jX = x) = exp(�0 + pXi=1 �ixi):Thus the essential assumption of the linear logistic regression model is that the loga-rithm of the odds is linear in x. The odds ratio of two individuals, showing a di�erenceof � for the i-th covariate and no di�erence for the others, is equal to exp(�i�). Hence,the regression coe�cients �i are directly interpretable as log-odds ratios or, in terms ofexp(�i), as odds-ratios. This property of the logistic regression model enables one toassess the e�ect of a single covariate and is one reason for the widespread use of thismethod in the biomedical sciences.A natural extension of the linear logistic regression model is to include quadratic termsand multiplicative interaction terms:f(x; �) = �(�0 + pXi=1 �ixi + pXi=1 ix2i +Xi<i0 �ii0xixi0 ): (3)These models are usually considered as the �rst step to overcome the stringent as-sumption of additive and purely linear e�ects of the covariates. Other approaches tomodel the conditional probability P(Y = 1jX = x) in a exible manner are, amongothers, generalized additive models,45 fractional polynomials79 and feed-forward neuralnetworks.Feed-forward neural networksA typical description of an arti�cial neural network covered in a biomedical journal isthe following of Moul et al.:63"Neural network analysis is a computer-based data processing structurebased on processing elements that function in a manner analogous to thatof the human neuron. Input data to the processing elements are modi�edby weighted summation and transferred to succeeding layers of processingelements through nonlinear transfer functions, eventually resulting in anoutput answer." 3

Page 4: On · Building and Institute of Medical Biometry and ... applications. Hence, w e start with a brief summary of ... Synon yms for the term individual are w ords ob ject,

x1

x2

xp

h1

hr

y

wij

Wj

Figure 1: Graphical representation of a feed-forward neural network with p input units,r hidden units and 1 output unit.The description is often accompanied by a graphical representation like that in Figure1. This �gure illustrates a feed-forward neural network with one hidden layer. Thenetwork consists of p input units, r hidden units and one output unit. The ow ofinformation is indicated by the arrows.The units in the input layer x = (x1; : : : ; xp)0 correspond to the covariates in a logisticregression model. The hidden units h1; : : : ; hr are the result of applying a so called"activation function" to a weighted sum of the input units plus a constant which iscalled "bias" in neural net jargon. The value of a hidden unit hj; j = 1; :::; r is givenby hj = �hj (w0j + pXi=1wijxi);with unknown parameters w0j; : : : ; wpj. Although the activation functions �hj (�) couldbe mutually di�erent they are almost always chosen as the logistic function �(�). Thusthe value of a hidden unit appears to behj = 11 + exp(�w0j �Ppi=1 wijxi) = �(w0j + pXi=1wijxi):The value of the output unit y is calculated by applying another activation function�o(�) to the weighted sum of the values of the hidden units plus bias. The outputactivation function must not coincide with any �hj (�) but in many applications it ischosen as the logistic function, too. Hencey = f(x; w) = �(W0 + rPj=1Wj � hj) = � W0 + rPj=1Wj � �(w0j + pPi=1wijxi)! (4)= 11 + exp��W0 �Prj=1Wj � ( 11 + exp(�w0j �Ppi=1 wijxi))�4

Page 5: On · Building and Institute of Medical Biometry and ... applications. Hence, w e start with a brief summary of ... Synon yms for the term individual are w ords ob ject,

with unknown weights w = (W0; : : : ;Wr; w01; : : : ; wpr), which have to be estimated in aso-called "learning" or "training"-process. This representation of a feed-forward neuralnetwork has not the appeal of Figure 1 or the description of Moul et al.,63 but is thebasis for an analytical comparison of feed-forward neural nets and logistic regressionmodels.One drawback of feed-forward neural networks is that the weights w have no plaininterpretation. Except for a neural net without a hidden layer and with a logisticactivation function the weights are lacking the interpretation as log-odds ratios. Inthis special case formula (4) reduces to formula (2) which is a linear logistic regressionmodel. This speci�c network type is called "logistic perceptron" in the neural networkliterature.An extension of model (4) for K > 2 unordered categories is a feed-forward neuralnetwork with K output units, a linear activation function in the hidden layer and asoftmax activation function10,11 in the output layer. This is equivalent to use a poly-tomous logistic regression model31,59 for the conditional probabilities P(Y = kjX =x); k = 1; :::; K.A feed-forward neural network with p input units, r hidden units and K output unitsis denoted with p-r-K in the sequel.EstimationWe consider the situation that a sample of N individuals with known covariate valuesx(n) = (x(n)1 ; : : : ; x(n)p )0 and known class level y(n) 2 f0; 1g is used to estimate theregression coe�cients � or the weights w, respectively. In neural net jargon the phrase"to train a network" is equivalent to estimation.In the neural net science, estimation is usually based on the minimization of a dis-tance measure, the so-called "energy" or "learning" function. In most applications thedistance measure chosen is the least square criterion:DLS(w) = NXn=1 �y(n) � f(x(n); w)�2: (5)Especially for classi�cation problems the Kullback-Leibler distance53 has been pro-posed in the neural net literature6,47 as an alternative:DKL(w) = NXn=1 "y(n) log y(n)f(x(n); w) + (1� y(n)) log 1� y(n)1� f(x(n); w)#: (6)In the statistical science estimation is usually based on the maximum likelihood princi-ple.61 Fitting neural networks or logistic regression models by the maximum likelihoodprinciple is equivalent to minimization of the Kullback-Leibler distance.5

Page 6: On · Building and Institute of Medical Biometry and ... applications. Hence, w e start with a brief summary of ... Synon yms for the term individual are w ords ob ject,

Standard numerical optimization algorithms, like quasi-Newton or Levenberg-Mar-quardt algorithms, can be used to minimize the chosen distance measure. However,the neural net community prefers other training algorithms. In particular the back-propagation algorithm or one of its variants is implemented in most commercial neuralnet programs and thus utilized by the (biomedical) users of this software. It is wellknown77,80 that this algorithm su�ers from convergence problems and di�culties inchoosing values for the tuning-parameters imbedded in this algorithm ("learning rate","momentum", ...). The literature on numerical optimization knows a lot of simplealternatives like the Newton-Raphson-methods allowing maximization of (5) or (6)within a few steps and automatic choice of the tuning parameters, which has beenused by statisticians for many years.Note that �tting a simple logistic perceptron using the Kullback-Leibler distance willgive the same results as �tting a linear logistic regression by the maximum-likelihoodprinciple.87Implausible functions �tted by neural networksIt is well-known that feed-forward neural networks with one hidden layer are uni-versal approximators24,39,48 and thus can approximate any function de�ned by theconditional probability (1) with arbitrary precision by increasing the number of hiddenunits to in�nity. However, the question in practice is, whether we can expect validapproximations to the true function f by solving (5) or (6) based on the limited dataavailable in biomedical applications. To examine this question we consider an exam-ple that was �rst presented by Finney36 and has extensively been used for illustrativepurposes.35,62,71The data set consists of a class indicator Y 2 f0; 1g and two covariates X1 and X2measured on 39 individuals. A scatterplot of the data displayed in Figure 2 indicatesthat, when considering both covariates, the two classes are well-separated, except fortwo individuals marked in Figure 2. However, for each single covariate the separationis not as obvious.In the single plots of Figure 3 we included the pairs (Y;X2), and we observe, that theranges of values of X2 for Y = 0 and Y = 1, respectively, clearly overlap. For smallvalues of X2 the probability of Y = 1 is small but it increases with increasing X2. We�rst �t a linear logistic regression model to this data, using the maximum-likelihoodprinciple as provided by PROC LOGISTIC in SAS 6.11.81 The resulting estimatedfunction x 7! f(x; b�) is shown in the �rst plot; it is a smooth and hence plausibleproposal for the true f .Next we �t feed-forward neural networks with logistic activation functions by minimiz-ing the Kullback-Leibler distance (6), using the backpropagation algorithm providedby the public domain software package Stuttgart Neural Network Simulator (SNNS),956

Page 7: On · Building and Institute of Medical Biometry and ... applications. Hence, w e start with a brief summary of ... Synon yms for the term individual are w ords ob ject,

-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5

-1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

PSfrag replacements 11 0000 0 00

0000

00000 0 00 1 111

1 1 1 1111 11 11111

X1X 2

Figure 2: Scatterplot of the Finney data;36 class membership is indicated by the plot-ting symbol; two in uential observations are marked with a square.version 4.1, with a self-written extension for the Kullback-Leibler criterion. Using asimple logistic perceptron, we achieve of course the same function x 7! f(x; bw) as inthe case of linear logistic regression. Using logistic perceptrons with a hidden layer,we �nd functions becoming more and more warped with increasing numbers of hiddenunits. And these functions are more and more implausible, because the true functionf re ects some biological or medical relationship, and such relationships tend to besmooth, not warped.Mathematically, the extreme uttering of the functions shown in Figure 3 is due to verylarge weights w1j, resulting in output functions of the hidden layer, which are nearlyjump functions. This undesirable feature is well known in the literature on neuralnets, and there exists the suggestion to introduce some weight decay76 to overcomethis. However, it seems that this development did not have any impact on biomedicalapplications.The tendency of neural nets to �t implausible functions can be also shown, if we considerboth covariates. Fitting a linear logistic model results here in a function (Figure 4),which is still smooth, but clearly indicates the di�erence between the lower left andupper right part of Figure 2. The two observations marked with a square in Figure 2are here regarded as indicators, that the change from the upper right part to the lower7

Page 8: On · Building and Institute of Medical Biometry and ... applications. Hence, w e start with a brief summary of ... Synon yms for the term individual are w ords ob ject,

-1.0 -0.5 0.0 0.5 1.0 1.5

0.0

0.2

0.4

0.6

0.8

1.0

Logistic Regression

-1.0 -0.5 0.0 0.5 1.0 1.5

0.0

0.2

0.4

0.6

0.8

1.0

Construction: 1-1-1

-1.0 -0.5 0.0 0.5 1.0 1.5

0.0

0.2

0.4

0.6

0.8

1.0

Construction: 1-3-1

-1.0 -0.5 0.0 0.5 1.0 1.5

0.0

0.2

0.4

0.6

0.8

1.0

Construction: 1-5-1

-1.0 -0.5 0.0 0.5 1.0 1.5

0.0

0.2

0.4

0.6

0.8

1.0

Construction: 1-10-1

-1.0 -0.5 0.0 0.5 1.0 1.5

0.0

0.2

0.4

0.6

0.8

1.0

Construction: 1-15-1

PSfrag replacements X2X2X2X2X2X2

Figure 3: Estimated functions x 7! f(x; b�) for the probability to belong to class 1of linear logistic regression and feed-forward neural networks with di�erent number rof hidden units and logistic activation function for covariate X2 of the Finney datarepresented in Figure 2.left part is not a change from sure membership in class 1 to sure membership in class0, but that there is an intermediate region, where both memberships are possible. Incontrast to this, a feed-forward neural network with three hidden units �ts a function,lacking this intermediate region. Instead, a strict frontier between class 1 and class 0membership is postulated, but this frontier is not smooth and hence little plausible.Besides this tendency to �t implausible functions neural networks lack an easy inter-pretation of the results of the �tting process. We try to illustrate this in Tables 1 and2. Table 1 presents the parameter estimates of the linear logistic regression model. Wecan see, that both covariates have a strong in uence with estimated odds ratios of amagnitude of 100. Additionally, asymptotic theory allows the computation of standarderrors, so that we can compute p-values and assess the signi�cance of our �nding. Con-troversy, the weights of the �tted neural network (Table 2) lack easy interpretation.Especially, we are unable to detect that for both covariates increasing values imply anincrease of the probability to be a class 1 object, because the weights of the covariatesare always negative here. 8

Page 9: On · Building and Institute of Medical Biometry and ... applications. Hence, w e start with a brief summary of ... Synon yms for the term individual are w ords ob ject,

-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5

-1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

1 1

1

1 11

0

0 0 0

0

0 0

1

1

1

1

1

0 1

0

0

0 0 1

0

1

0

1 0

1 0

0 11

1 0

0

1

Logistic Regression

-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5

-1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

1 1

1

1 11

0

0 0 0

0

0 0

1

1

1

1

1

0 1

0

0

0 0 1

0

1

0

1 0

1 0

0 11

1 0

0

1

Construction: 2-3-1

PSfrag replacements X1X1X 2X 2

Figure 4: Gray scale image plots of the estimated function x 7! f(x; b�) for the proba-bility to belong to class 1 of linear logistic regression and neural network analyses forboth covariates of the Finney data represented in Figure 2. Darker gray scale levelscomply with higher probabilites of membership in class 1; displayed gray scale levelsrange from 0 to 1.Survival data and neural networks - a trivial task ?In some recent articles the application of feed-forward neural nets to survival data hasbeen discussed.13,25,34,50,55,74,73 It is fairly straightforward55 to consider a `grouped'version of the Cox regression model21 in the framework of this paper. Denoting by T thesurvival time random variable and by Ik the time interval tk�1 � t < tk; k = 1; : : : ; K,where 0 = t0 < t1 < : : : < tK <1, the model can be speci�ed through the conditionalprobabilities P(T 2 IkjT � tk�1; x) = �(�0k + IXi=1 �ikxi) (7)for k = 1; : : : ; K. This is the original proposal by Cox,21 other models for groupedsurvival data can be obtained using other link functions.28,72 The neural networkcorresponding to (7) is an extension of model (4) without a hidden unit to K outputunits y1; :::; yK. The output yk at the k-th output unit corresponding to the conditionalprobability of dying in the k-th time interval Ik; k = 1; : : : ; K.9

Page 10: On · Building and Institute of Medical Biometry and ... applications. Hence, w e start with a brief summary of ... Synon yms for the term individual are w ords ob ject,

Estimated Regression StandardVariable Coe�cient Error P-valueIntercept �2:92 b= �0 1.29 |X1 4:63 b= �1 1.79 0.010X2 5:22 b= �2 1.86 0.005Table 1: Results of �tting a linear logistic regression to the Finney data represented inFigure 2. Input ! Hidden Hidden ! Output1. 2. 3.hidden unit bias 16.1bias 17.7 -1.7 22.6 1. 36.2weight of X1 -31.4 -25.4 -30.3 2. hidden -25.4weight of X2 -63.5 -22.1 -31.7 3. unit -35.8Table 2: Results of �tting a feed-forward neural network with three hidden units tothe Finney data represented in Figure 2.Data for the n-th individual consists of a vector x(n) = (x(n)1 ; : : : ; x(n)p )0 of regressorvariables and a vector and y(n) = (y(n)1 ; : : : ; y(n)Kn)0 where y(n)k is an indicator for individualn dying in the interval Ik and Kn � K is the number of intervals in which individualn is observed. Thus y(n)1 ; : : : ; y(n)Kn�1 are all zero and y(n)Kn is equal to 1 if this individualdied in IKn and equal to 0 if it is censored. This situation implies that the network hasa randomly varying number of output nodes according to those time intervals wherean individual is "at risk". This is a standard problem of survival analysis49 and caneasily be accomplished by using a slight modi�cation of distance measure (6).In particular, setting yk = f(x; w) = �(�0k + IXi=1 �ikxi)where w summarizes �01; : : : ; �0K ; �11; : : : ; �IK, we obtainD�KL(w) = NXn=1 KnXk=124y(n)k log y(n)kf(x(n); w) + (1� y(n)k ) log 1� y(n)k1� f(x(n); w)35:This could also be written as the sum over all K output units by introducing anindicator for individual n being at risk at the k-th time interval Ik, k = 1; : : : ; K, asit is usually done in survival analysis.2 The proportional hazards assumption can thenbe implemented in (7) through the constraints�ik = �i for i = 1; : : : ; I and k = 1; : : : ; K10

Page 11: On · Building and Institute of Medical Biometry and ... applications. Hence, w e start with a brief summary of ... Synon yms for the term individual are w ords ob ject,

whereas the parameters �0k are allowed to di�er for k = 1; : : : ; K.There have been some other, naive proposals for analysing grouped survival data bymeans of a feed-forward neural net. In order to predict outcome (death or recurrence)of individual breast cancer patients Ravdin & Clark74 and Ravdin et al.73 use a networkwith only one output unit but using the number k of the time interval as additionalinput. Moreover, they consider the unconditional survival probability of dying beforetk rather than the conditional one as output. Their underlying model then readsP(T < tkjx) = �(�0 + IXi=1 �ixi + �I+1 � k) (8)for k = 1; : : : ; K. This parametrization ensures monotonicity of the survival probabili-ties but also implies a rather stringent and unusual shape of the survival distribution,since in the case that no covariates are considered (8) reduces toP(T < tk) = �(�0 + �I+1 � k)for k = 1; : : : ; K. Obviously, the survival probabilities do not depend on the length oftime intervals Ik, which is a rather strange and undesirable feature. Including a hiddenlayer in (8) is a straightforward extension having still all the features summarized above.De Laurentiis & Ravdin25 call such type of neural networks "time-coded models".Another form of neural networks that have been applied to survival data are the so-called "single time point models"25. Since they are identical to a logistic perceptronor a feed-forward neural network with a hidden layer (4) they correspond to �tting oflogistic regression models (2) or their generalizations (3) to survival data. In practice,a single time point t is �xed and the network is trained to predict the t-year survivalprobabilities. The corresponding model is given byP(T < tjx) = �(�0 + pXi=1 �ixi)or its generalization when introducing a hidden layer. This approach is used by Burke13to predict 10-year survival of breast cancer patients based on various patient and tumorcharacteristics at time of primary diagnosis. McGuire et al.60 utilized this approachto predict 5-year disease-free survival of patients with axillary node-negative breastcancer based on seven potentially prognostic variables. Kappen & Neijt50 applied thisapproach to predict 2-year survival of patients with advanced ovarian cancer90 obtainedfrom 17 pretreatment characteristics. The neural network they actually used reducedto a logistic perceptron.Of course, such a procedure can be repeatedly applied for the prediction of survivalprobabilities at �xed time points t1 < t2 < ::: < tK. For example, Kappen & Neijt50trained several (K = 6) neural networks to predict survival of patients with ovarian11

Page 12: On · Building and Institute of Medical Biometry and ... applications. Hence, w e start with a brief summary of ... Synon yms for the term individual are w ords ob ject,

cancer after 1; 2; :::; 6 years. The corresponding model readsP(T < tkjx) = �(�0k + pXi=1 �ikxi)in the case that no hidden layer is introduced, the incorporation of a hidden layer isstraightforward. Note that without restriction on the parameters such an approach doesnot assure, that the probabilities P(T < tkjx) increase with k, and hence may resultin life-table estimators suggesting non-monotone survival functions. Closely related tosuch an approach are the so-called "multiple time point models"25 where one neuralnetwork with K output units with or without a hidden layer is used.The common drawback of these naive approaches is, that they do not allow to incorpo-rate censored observations in a straightforward manner, which is closely related to thefact, that they are based on unconditional survival probabilities instead of conditionalsurvival probabilities as the Cox model. Neither omission of the censored observations{ as suggested by Burke13 { nor treating censored observations as uncensored are validapproaches, but a serious source of bias, which is well known in the statistical literature.De Laurentiis & Ravdin25 propose to impute estimated survival probabilities for thecensored cases that could be derived from a Cox regression model, i.e. they use a wellestablished statistical procedure just to make an arti�cial neural network working. Fu-ture developments should be based on the considerations given by Liest�l et al.55 wherethe standard requirements for the analysis of survival data are incorparated. Faraggi& Simon,34 for instance, use the regression model arising from a neural network witha hidden layer to de�ne an extension of Cox's proportional hazards regression modelthat is applied to the well-known prostate cancer data.15Estimation of misclassi�cation probabilities { an of-ten overlooked problemIn constructing a new rule for prognosis or diagnosis, it is necessary not only to knowthe new rule, but also to know its accuracy. And, as one usually wants to show that thenew rule is better than an existing one, it is very important to have an unbiased estimateof the accuracy, because if the estimate tends to overestimate the true accuracy, thismay result in propagating the use of the new rule although the existing is the betterone. However, estimation of measures of accuracy is a nontrivial task.One simple, meaningful and popular measure to describe the accuracy of a rule is itsmisclassi�cation probability, i.e. the probability to classify future objects incorrectly.The �rst estimate of the misclassi�cation probability is the apparent error rate de�nedas the relative frequency of incorrect classi�cations if we apply the new rule in thecurrent sample, i.e. in that sample we have used to construct the rule. However, thisrate tends to underestimate the true misclassi�cation probability, because in �tting12

Page 13: On · Building and Institute of Medical Biometry and ... applications. Hence, w e start with a brief summary of ... Synon yms for the term individual are w ords ob ject,

-3 -2 -1 0 1 2 3

-3-2

-10

12

3 01

0

0

0

0

0 0

0

1

0

0

0

1

01

0

1

0 0

0

0

0 1

0

1

1

0

0 1

1

1

0

0

1

1

0

11

0

0

0

0 0

0

0

1

0

1

0

0

1 0

0

01

0

11

1 01 0

0 0

0 0

0

1

1

0

1

0

0

0

0

0

0 0

0

0

0

1

1

0 0

0 0

1

1 0 0 0

1

0

0

0

1

0

0

0

0

0 0

1

0

0

0

1 0

0

0

1

0

1

0

0 0

0

1

0

0

0

0

1

0

0

0

0

0

0 0

1

0

1

0

0

0

0

0

0

0 0 1

0

0

1

0

1

0

1

0

1

1

0

11

0

0

0

0 01

0

0

0

1

0

0

11 0

0 0 0

0

0

0

0

0

0

0

0

0

0 0

0

1 0

0

1

0 01

0

0 0

0 0 0

PSfrag replacements X1X 2

Figure 5: Scatterplot of an arti�cial example; class membership is indicated by theplotting symbol; region with a high probability for Y = 1 in the upper right corner isfenced o� by the dotted line.logistic models or neural nets, we make the rule as similar as possible to the currentdata, not to future data. To avoid this underestimation of the true misclassi�cationprobability, a widespread approach is to split the sample into two subsamples, thelearning set and the test set. The learning set is used to construct the new rule, andthe test set is used to estimate the misclassi�cation probability by applying the new ruleand counting the incorrect classi�cations. This test set based error rate is unbiased.We illustrate the di�erence between the apparent error rate and the test set based errorrate by a small arti�cial example. We consider a constellation with 5 covariates, whichare independently and identically standard Gaussian distributed. The probability ofan object to belong to one of the classes Y = 0 or Y = 1 depends only on the �rst twocovariates, and it is given byP (Y = 1jX = x) = 8<: 0:85 if x1 > 0 and x2 > 00:15 otherwise :This re ects a situation not completely implausible in medical applications: if twospeci�c variables both exceed a given threshold, we have a high probability that theobject belongs to class Y = 1, otherwise this probability is small. We draw a sample of13

Page 14: On · Building and Institute of Medical Biometry and ... applications. Hence, w e start with a brief summary of ... Synon yms for the term individual are w ords ob ject,

size 400 according to this law and split it randomly into a learning set and a test set.Figure 5 shows the distribution of the �rst two covariates and the class membershipindicator in the learning set. To the data of the learning set we �t feed-forward neuralnets with a logistic activation function and one hidden layer with the number of hiddenunits r varying between 0 and 15. Figure 6 illustrates the apparent error rates observedin the learning sample and the test set based error rates. We observe that the apparenterror rates decrease with increasing number of hidden units and hence suggest (incor-rectly) to choose a neural net with many hidden units, whereas the unbiased test setbased error rates suggest to choose a neural net with 2 or 3 hidden units and indicatethat neural nets with a large number of hidden units have a higher misclassi�cationprobability. This is what we have to expect if we remember the results of Section 3.With increasing number of hidden units we �t more and more implausible functions,which are more and more away from the true law f , and hence the misclassi�cationprobability increases. Contrary the functions become more and more similar to thedata of the learning set, and the apparent error rate decrease. As we know in ourarti�cial example the true data generating law, we can compute for each rule the truemisclassi�cation probability, which is also shown in Figure 6. The test set based errorrates vary around the true probabilities, which illustrates their unbiasedness.Although such a splitting into a learning set and a test set is widespread, it is in generalwasteful of information. One should be aware that accurate estimation of misclassi�-cation probabilities requires large samples. For example, in order to estimate a truemisclassi�cation probability of 20% with an absolute standard error of 1% requires asample size of 1600. Hence it is always desirable to use the complete sample both forconstruction of a rule and for estimation of misclassi�cation probabilities. There existseveral techniques for this task, the most well known is cross validation. The idea ofcross validation is to split the complete data set several times into a learning set anda test set, to perform estimation of the model parameters { here the "learning" of theweights { in each learning set, and estimation of the misclassi�cation probability in thecorresponding test set, and �nally to use the average of the test set based error rates asan estimate of the misclassi�cation probability of the rule derived from estimation ofthe model parameters in the complete sample. Usually the sample is splitted k timeswith k disjoint test sets, and we speak of k-fold cross validation. If each test set con-tains only one data point, we speak of leave-one-out cross validation. To illustrate thepower of this approach we have used it in the above example to estimate the misclas-si�cation probabilities using only the learning set. The resulting error rates based onleave-one-out cross validation are shown in Figure 6, too. We can observe, that theseestimates vary around the true misclassi�cation probability in the same manner as thetest set based estimates, so that they are of comparable quality.In developing rules based on neural nets, the computation of many test set based errorrates is often an intermediate step in the process to �nd the optimal number of hiddenunits. Then the neural net with the smallest test set based error rate is selected as14

Page 15: On · Building and Institute of Medical Biometry and ... applications. Hence, w e start with a brief summary of ... Synon yms for the term individual are w ords ob ject,

Number of hidden units

Mis

clas

sific

atio

n pr

obab

ility

0 5 10 15

0.0

0.1

0.2

0.3

0.4

true misclassification probabilitylearning set error ratetest set based error ratecross validation error ratePSfrag replacements �Figure 6: Error rates and true misclassi�cation probabilities of rules based on �ttingfeed forward neural networks with one hidden layer with di�erent number r of hiddenunits and logistic activation function. (*: The feed-forward neural network with nohidden units is the simple logistic perceptron and equivalent to a simple linear logisticregression model)the new rule. A common mistake is now to report the minimal test set based errorrate, i.e. the rate observed for the selected rule, as an estimate for the misclassi�cationprobability in classifying future objects using the selected rule. However, this is a biasedestimate, and it tends to underestimate the true misclassi�cation probability. A closerlook on Figure 6 may illustrate the reason: The test set based error rates vary aroundthe true misclassi�cation probabilities and by selecting the rule with the minimal rate{ here the rule based on a 5-2-1 or 5-3-1 network { we have a high chance to selecta rate smaller than the corresponding true rate. Because a single example may notbe convincing, we performed a small simulation study, where we generated 500 pairsof learning and test sets of size 200 according to the above law. We always �tted asabove the 16 di�erent neural nets and selected that with minimal test set based errorrate. On average we observe a true misclassi�cation probability of 0.233, but a test setbased error rate of only 0.211, i.e. an average di�erence of 0.022. Looking at the ratiobetween the test set based error rate of the selected rule and the true misclassi�cationprobability of the selected rule, we observe in the average an underestimation of 9%.15

Page 16: On · Building and Institute of Medical Biometry and ... applications. Hence, w e start with a brief summary of ... Synon yms for the term individual are w ords ob ject,

Moreover, in 75% of the repetitions the test set based error rate of the selected rule wassmaller than the true misclassi�cation probability. If we consider the same problemwith a smaller sample size of 100 for the learning set and the test set (and manyexamples of the use of neural nets in medical applications are even smaller, cf. thenext section), then the results are even worse: We observe an average di�erence of0.036 and an average relative underestimation of 13%.A general strategy to avoid this underestimation is to split the available sample intothree subsamples: The learning set, the validation set and the test set.69,77 The learningset is used to estimate the parameters of the di�erent models, i.e. here to "learn" theweights of networks with varying number of hidden units. The validation set is usedfor an unbiased estimation of the misclassi�cation probability of the rule based onthe parameter estimates obtained in the learning set for each model. Then the rulewith the smallest error rate is selected, and �nally, the test set is used to obtain anunbiased estimate of the misclassi�cation probability of the selected rule. This is a niceparadigm, but wasteful of information. Again cross validation allows to make use of alldata. However, it requires a very computer time intensive nested loop. In each subsetwith one observation left out we have to repeat the complete process of selecting a rule,i.e. for each considered model we have to estimate �rst the parameters and then theerror rate by cross validation. From this point of view alternative methods to allowa direct selection among di�erent models are helpful. This can be done by addingsome penalty to the minimal Kullback-Leibler distance achieved in �tting the singlemodels, where the penalty depends on the number of parameters of each model. Themost prominent measures following this idea are Akaike's information criterion,1 andits analogon, the network information criterion.64 Using such criteria, we need crossvalidation only to assess the misclassi�cation probability of the resulting rule.Beside the misclassi�cation probability also other measures for the accuracy are possi-ble. Especially in diagnostic studies one should report additionally the sensitivity andspeci�city of a new rule or the whole Receiver-Operating Characteristic (ROC).26 If {as it is often the case in diagnostic studies { the sample is not drawn from one pop-ulation, but separately from a population of diseased and undiseased subjects, theseare the only meaningful measures, as the misclassi�cation probability depends on theprevalence, i.e. the arbitrary rate to select a diseased or undiseased subject.Frequently made mistakesWe reviewed the medical literature between 1991 and 1995 in order to examine appli-cations of feed-forward neural networks for prognostic and diagnostic classi�cation inoncological studies. A Medline search conducted in May 1996 yielded 173 articles inEnglish or German concerning the use of arti�cial neural networks and arti�cial intel-ligence in oncology. After reading the available abstracts, 43 articles summarized in16

Page 17: On · Building and Institute of Medical Biometry and ... applications. Hence, w e start with a brief summary of ... Synon yms for the term individual are w ords ob ject,

Table 3 remained in which feed-forward neural nets were used for classi�cation, someof them considered statistical methods as alternatives. The topics of these articles werediagnosis of cancer (23 articles), automatic tumor grading (6 articles) and prognosis(10 articles), furthermore 3 reviews and 2 commentaries. Most of the excluded articlesdealt with the application of arti�cial intelligence in image analysis and the use of ex-pert systems for classi�cation. It became apparent to us that some mistakes occurredover and over again. These mistakes are indicated in Table 3 and are described indetail in the sequel.1) Mistakes in estimation of misclassi�cation probabilitiesa) Biased estimationThe dramatic underestimation of the apparent error rate seems to be well known, innone of the articles the apparent error rate was used as an estimate of the misclassi�-cation probability. Always, for the estimation either an independent test set or crossvalidation or another resampling method was used. However, in most cases, the net-work with the smallest test set based error rate or the smallest cross validation errorrate was selected and this minimal error rate was regarded as a valid estimate of thetrue misclassi�cation probability, which is not true as shown in the last section. Thereare only few articles following the paradigm of a splitting in learning, validation andtest set ensuring unbiased error rates.b) Ine�cient estimationIn many articles the size of the test set was very small, in 6 articles it contains not morethan 20 observations. Consequently, the estimates of the misclassi�cation probabilitiesare highly unstable, but this problem is totally ignored. In general, all applicationsusing a split sample approach could have been improved by using cross validationtechniques.2) Fitting of implausible functionsIn section 3 we have demonstrated, that �tting neural nets with many hidden unitsresults in implausible functions to describe the probability of class membership; aphenomenon often called "over�tting" in the statistical literature. There exists nogenerally valid limits where over�tting begins, but a key number is the ratio betweenthe number of observations and the number of parameters in a model. In the exampleof Figure 3 with 39 observations we can observe, that we have distinct over�tting fora 1-5-1 network with 16 parameters, and already some over�tting for a 1-3-1 networkwith 10 parameters. This agrees with traditional rules of thumb in the statisticalliterature, requiring 5 or 10 observations for each parameter to be estimated, which arealso considered in the literature on neural networks.56 If we assume, that over�ttingat least occurs, if the ratio between the number of observations and the number of17

Page 18: On · Building and Institute of Medical Biometry and ... applications. Hence, w e start with a brief summary of ... Synon yms for the term individual are w ords ob ject,

First Study No. of Individuals No. of CommonYear Author typea Training Validation Test Network Weights mistakesbDawson27 g 12 31 17-4-4 92 2, 5Maclin57 d 52 leave-one-out 21-10-3 253 21991 Ostrem68 g 87 31 6-12-9-3 231 2Piraino70 d 110 44 20-40-55 3095 2Astion3 d 57 20 9-15-2 182 1, 2, 3, 5, 6Cicchetti16 r Commentary 3Goldberg41 d 200 leave-one-out 3-2-1 11 |McGuire60 p 133 66 7-6-2 62 18-7-2 791992 Nafe65 d 58 leave-one-out ... ... 2, 5, 640-14-5 649O'Leary67 d 36 19 2-4-1 17 1Ravdin74 p 500 420 453 8-6-1 61 7Ravdin73 p 508 500 960 8-4-1 41 7Kappen50 p 269 10% of the data 17-0-1 18 6, 71-5-2 221993 Schweiger84 d 90 16 ... ... 1, 2, 3, 4, 5, 67-15-5 200Becker8 d 36 19 2-4-1 17 1, 3, 5Bugliosi12 p 153 19 19-?-? ? 1, 4Burke13 p 3500 3500 3500 3-9-1 46 7Burke14 r Review 4Clark19 r Review |240 40%, 60%, 80% 14-7-1 113Ercal32 d 216 of the data 8-4-1 41 21994 5-3-1 22Erler33 d 45 45 or or 2, 5, 65-5-2 42Floyd37 d 260 leave-one-out 8-16-1 161 1, 2Giger40 d 53 leave-one-out 2-4-1 17 |Kolles51 g ? ? ? ?-?-? ? 4Maclin58 d 72 leave-four-out 35-35-35-5 2700 1, 2Reinus75 d 709 4-fold crossval. 95-0-45 4320 1, 2Table 3: Applications of feed-forward neural networks for diagnostic and prognosticclassi�cation in oncological studies 18

Page 19: On · Building and Institute of Medical Biometry and ... applications. Hence, w e start with a brief summary of ... Synon yms for the term individual are w ords ob ject,

First Study No. of Individuals No. of CommonYear Author typea Training Validation Test Network Weights mistakesbRogers78 r Review |d 1578 209 14-50-1 801Snow83 p 938 5% of the data 12-5-1 71 1, 21994 3-6-2 38Wilding92 d 104 5-fold crossval. ... ... 298 10-10-2 132Wu93 d 727 leave-one-out 9-8-1 89 1, 578 258Attikiouzel4 p 65 248 6-18-1 145 1, 2477 952 4-12-1 73Baker5 d 206 leave-one-out 18-10-1 201 1, 2Christy18 g 52 29 10-10-2 132 2, 5Doyle29 p 528 5-fold crossval. 33-48-30-1 3133 1, 2Fogel38 d 400 283 9-2-1 23 |Gurney42 d 318 271 bootstrap. 7-3-1 28 |Hamamoto43 p 54 11 9-14-1 155 1, 21995 Kolles52 g 226 679 4-30-10-3 493 1, 2, 54-1-1 7Moul63 g 93 9-fold crossval. ... ... 1, 27-14-1 127Niederberger66 r Commentary |Simpson82 d 64 27 ?-?-? ? 1, 4, 5, 6115 28-?-1 ?Strotzer85 d 45 115 28-?-23 ? 2, 4Truong86 d 56 56 7-10-1 91 1, 2a:d b= diagnosis/di�erential diagnosis; g b= automatic grading/cytology; p b= prognosis/survival;r b= review/commentaryb:1) Mistakes in estimation of misclassi�cation probabilities2) Fitting of implausible functions3) Incorrectly describing the complexity of a network4) No information on complexity of the network5) Use of inadequate statistical competitors6) Insu�cient comparison with statistical method7) Naive application of arti�cial neural networks to survival data19

Page 20: On · Building and Institute of Medical Biometry and ... applications. Hence, w e start with a brief summary of ... Synon yms for the term individual are w ords ob ject,

parameters is smaller than 2, then in 23 of our 43 examples we must expect that theresulting rule su�ers from over�tting, i.e. it is based on a highly implausible function.3) Incorrectly describing the complexity of a networkThe above rules of thumb are also used in the literature on neural networks. In theselected applications several authors tried to exclude a possible critique with respect toover�tting by reporting the ratio between the number of observations and the complex-ity of the network. However, to measure the complexity they use only the number ofinput units, which is usually much smaller than the number of parameters to describethe network. Hence the reported ratios are much too large and the danger of over�ttingis underestimated.4) No information on complexity of the networkIn order to enable the reader to judge the danger of an over�tted rule using somerule of thumb, it is necessary to know the number of hidden layers and hidden unitsto calculate the number of �tted weights. However, some publications even miss thisbasic information.5) Use of inadequate statistical competitorsSome authors compared the performance of a feed-forward neural network with that ofa statistical competitor. Linear or quadratic discriminant analysis was the statisticalcompetitor used in all but one application. All results reported were in favour of thearti�cial neural network. This is not astonishing since feed-forward neural networksare highly exible classi�cation tools and "may yield better classi�cation results thanDFs (discriminant functions) since they form complex, nonlinear associations betweeninput data and output results".33 An adequate and fair comparison of the performanceof feed-forward neural networks and statistical methods must be based on statisti-cal tools of similiar exibility like nearest-neighbour methods,44 generalized additivemodels,45 CART9 or logistic regression models with quadratic terms and multiplica-tive interaction terms. Such comparisons were done by Ripley76 and Vach et al.,89for example, and demonstrated, that the discriminative power of neural networks and exible statistical methods do not di�er.6) Insu�cient comparison with statistical methodAstion and Wilding3 concluded that "results from neural networks compare favorablywith those obtained for the same sets of patients as analyzed by QDFA (quadraticdiscriminant function analysis)." This conclusion was solely based on 4 (feed-forward20

Page 21: On · Building and Institute of Medical Biometry and ... applications. Hence, w e start with a brief summary of ... Synon yms for the term individual are w ords ob ject,

neural network) and 5 (quadratic discriminant analysis) misclassi�ed observations fora test set of 20 observations, i.e. highly unstable estimates of the misclassi�cationprobabilities. No attempt was made to prove the signi�cance of the di�erence betweenthe observed misclassi�cation rates, what is necessary for the conclusion, that one ruleoutperforms the other. This testing can be done by the use of the McNemar test, forexample. We observed this behaviour to evaluate the performance of a feed-forwardneural network and a statistical competitor by simply inspecting the estimates of themisclassi�cation probability in several applications.7) Naive application of arti�cial neural networks to survivaldataAll applications of feed-forward neural networks to survival data13,50,74,73 are su�eringfrom the de�ciencies mentionend in section 4. The approach used in Kappen & Neijt,50does not guarantee monotonicity of the estimated survival curves. Burke13 omitted thecensored cases, which is a serious source of bias. Ravdin & Clark74 and Ravdin et al.73used the number of the time interval as additional input unit, hence the estimatedsurvival probabilities do not depend on the length of the time intervals. These obviousde�cits show that there is some danger that the fruitful developments with respect tostatistical methodology for survival data during the last three decades might be wastedwith these naive applications of neural networks to such data.ConclusionsWe have shown, that most applications of arti�cial neural networks for prognosis anddiagnosis in oncology during the last years su�er from methodological de�ciences. Inmany applications probably some over�tting occurs resulting in biologically implausiblefunctions to describe the class membership probability. In many applications the �nallyreported error rate for the selected neural network underestimates the true misclassi-�cation probability. Moreover, we often observed a fundamental misunderstanding ofthe mathematics of neural networks and basic statistical principles. Examples are theuse of two output units in the case of two classes, which gives the same result as usingonly one output unit, the �tting of non-monotone survival curves, misunderstandingof sensitivity and speci�city measures or error rates based on very small samples, ig-noring the unstability of such estimates. One might expect, that review papers andtextbooks on arti�cial neural networks available for any potential user (e.g. Cheng,17Hecht-Nielsen,46 Ripley76) warning against the common mistakes we mentioned, maylead to a better understanding and use of arti�cial neural networks. However, until nowwe cannot observe any improvement in the quality of applications of arti�cial neuralnetworks in oncological studies over time (cf. Table 3).21

Page 22: On · Building and Institute of Medical Biometry and ... applications. Hence, w e start with a brief summary of ... Synon yms for the term individual are w ords ob ject,

Incidentally, we came across to a paper by Peter Lachenbruch entitled "Some misuses ofdiscriminant analysis"54 where he identi�ed frequently made mistakes of similar type.This paper was written in 1977, at a time where statistical packages like BMDP, SASor SPSS became broadly available also for non-specialists. This seems to have been asimilar situation as today where numerous, user-friendly neural network packages areavailable which can in principle be used by everyone who is able to click the mousebutton of his personal computer.The application of arti�cial neural networks in biomedical applications is often accom-panied by overwhelming outlooks, praising neural networks as the ultimate solution tothe problem of diagnosis and prognosis. For example, neural networks "ability to learn... make them formidable tools in the �ght against cancer"13 and "neural computationmay be as bene�cial to medicine and urology in the twenty-�rst-century as molecularbiology has been in the twentieth".66In our opinion, there is neither evidence nor hope, that arti�cial neural networks canprovide real progress in the �eld of diagnosis and prognosis in oncology. We have triedto demonstrate, that feed-forward neural networks are nothing else than regressionmodels like logistic regression models, with the only di�erence, that feed-forward neuralnetworks (with hidden layers) provide a larger class of regression functions than linearlogistic models. This is often referred to as the greater exibility of neural networks.However, a greater exibility is only of value, if the true regression function is faraway from that of a linear logistic model. Small deviations from a linear logisticmodel does not matter, because due to the usually small sample sizes of a few hundredtypical in oncological applications, such a di�erence is small relative to the randomerrors. Large deviations, especially functions with many jumps, are not very plausible,because biological relationships tend to be smooth. Hence one cannot expect that thegreater exibility of neural networks outperforms logistic models, especially if the latteris combined with a careful model building, allowing quadratic or higher interactiveterms, for example. And additionally, such logistic models still allow to describe andevaluate the in uence of the single covariates, in contrast to neural networks. Thisinformation is often of equal importance as a decision rule, as it may allow a betterunderstanding of the underlying process.How can we explain, that nevertheless the comparisons with statistical procedures madein many of the selected articles give almost always results favouring neural networks?In our opinion, this is due to some methodological de�ciencies. Usually many di�erentneural networks are �tted on the same data set. Without using an independent testset, the minimal error rate observed within the neural nets is underestimating the truemisclassi�cation probability. Contrary, only one or two statistical procedures are usedfor a comparison, and consequently here the error rates are not biased. Moreover,it is absolutely not fair to allow an intensive model selection procedure for neuralnetworks, but not for the statistical competitors. Indeed, investigations comparingneural networks with exible statistical methods allowing also some model building do22

Page 23: On · Building and Institute of Medical Biometry and ... applications. Hence, w e start with a brief summary of ... Synon yms for the term individual are w ords ob ject,

not indicate great di�erences.One should also be aware, that all the results favouring neural nets are based on acomparison with statistical procedures. To prove the signi�cance of neural nets tooncology, one should at least �nd one example, where neural nets succeed in �ndinga new rule proven to be better than an existing in the literature, depending on thesame covariates, and which is undetectable by statistical procedures. All examples inour list always avoid the comparison with an existing prognostic index or a diagnosticrule, and hence are far away of demonstrating the value of neural networks for medicalresearch in oncology.References1. Akaike H. Fitting autoregressive models for prediction. Annals of the Insti-tute of Statistical Mathematics. 1969;21:243-7.2. Andersen PK, Borgan �, Gill RD, Keiding N. Statistical Models Basedon Counting Processes. Springer Series in Statistics, New York: Springer-Verlag, 1993.3. Astion ML, Wilding P. Application of neural networks to the interpretationof laboratory data in cancer diagnosis. Clin Chem. 1992;38:34-8.4. Attikiouzel Y, deSilva CJ. Applications of neural networks in medicine.Australas Phys Eng Sci Med. 1995;18:158-64.5. Baker JA, Kornguth PJ, Lo JY, Williford ME, Floyd CE. Breast cancer:prediction with arti�cial neural network based on BI-RADS standardizedlexicon. Radiology. 1995;196:817-22.6. Barron AR, Barron RL. Statistical learning networks: a unifying view. In:Wegman E, ed. Computing Science and Statistics: Proceedings 20th Sym-posium Interface. Washington: American Statistical Association, 1988.7. Baxt WG. Application of arti�cial neural networks to clinical medicine.Lancet. 1995;346:1135-8.8. Becker RL. Computer-assisted image classi�cation: use of neural networksin anatomic pathology. Cancer Lett. 1994;77:111-7.9. Breiman L, Friedman JH, Olshen RA, Stone CJ. Classi�cation and Regres-sion Trees. Monterey, California: Wadsworth and Brooks, 1984.10. Bridle JS. Probabilistic interpretation of feedforward classi�cation networkoutputs, with relationships to statistical pattern recognition. In: FoglemanSouli�e F, H�erault J, eds. Neuro-Computing: Algorithms, Architecturesand Applications. Berlin: Springer, 1990.11. Bridle JS. Training stochastic model recognition algorithms as networkscan lead to maximum mutual information estimation of parameters. In:23

Page 24: On · Building and Institute of Medical Biometry and ... applications. Hence, w e start with a brief summary of ... Synon yms for the term individual are w ords ob ject,

Touretzky DS, ed. Advances in Neural Information Processing Systems 2.San Mateo, CA: Morgan Kaufmann, 1990.12. Bugliosi R, Tribalto M, Avvisati G et al. Classi�cation of patients a�ectedby multiple myeloma using a neural network software. Eur J Haematol.1994;52:182-3.13. Burke HB. Arti�cial neural networks for cancer research: outcome predic-tion. Semin Surg Oncol. 1994;10:73-9.14. Burke HB. Increasing the power of surrogate endpoint biomarkers: the ag-gregation of predictive factors. J Cell Biochem. 1994;19, Supplement:278-82.15. Byar DP, Green SB. The choice of treatment for cancer patients basedon covariate information: applications to prostate cancer. Bull Cancer.1980;67:477-90.16. Cicchetti DV. Neural networks and diagnosis in the clinical laboratory: stateof the art. Clin Chem. 1992;38:9-10.17. Cheng B, Titterington DM. Neural networks: a review from a statisticalperspective (with discussion). Stat Sci. 1994;9:2-54.18. Christy PS, Tervonen O, Scheithauer BW, Forbes GS. Use of a neural net-work and a multiple regression model to predict histologic grade of astro-cytoma from MRI appearances. Neuroradiology. 1995;37:89-93.19. Clark GM, Hilsenbeck SG, Ravdin PM, De Laurentiis M, Osborne CK.Prognostic factors: rationale and methods of analysis and integration. BreastCancer Res Treat. 1994;32:105-12.20. Cox DR. Analysis of Binary Data. London: Methuen, 1970.21. Cox DR. Regression models and life tables (with discussion). J R StatistSoc. 1972;B 34:187-220.22. Cox DR, Snell EJ. Analysis of Binary Data, 2nd Edition. London: Chapman& Hall, 1992.23. Cross SS, Harrison RF, Kennedy RL. Introduction to neural networks.Lancet. 1995;346:1075-9.24. Cybenko G. Approximation by superpositions of a sigmoidal function. Math-ematics of Controls, Signals, and Systems. 1989;2:303-14.25. De Laurentiis M, Ravdin PM. Survival analysis of censored data: neural net-work analysis detection of complex interactions between variables. BreastCancer Res Treat. 1994;32:113-8.26. De Leo JM, Campbell G. Receiver operating characteristic (ROC) method-ology in arti�cial neural networks with biomedical applications. Proceedingsof the World Congress on Neural Networks. 1995.24

Page 25: On · Building and Institute of Medical Biometry and ... applications. Hence, w e start with a brief summary of ... Synon yms for the term individual are w ords ob ject,

27. Dawson AE, Austin RE, Weinberg DS. Nuclear grading of breast carci-noma by image analysis. Classi�cation by multivariate and neural networkanalysis. Am J Clin Pathol. 1991;95, Supplement:S29-37.28. Doksum KA, Gasko, M. On a correspondence between models in binary re-gression analysis and in survival analysis. International Statistical Review.1990;58:243-52.29. Doyle HR, Parmanto B, Munro PW et al. Building clinical classi�ers usingincomplete observations{a neural network ensemble for hepatoma detectionin patients with cirrhosis. Methods Inf Med. 1995;34:253-8.30. Dybowski R, Gant V. Arti�cial neural networks in pathology and medicallaboratories. Lancet. 1995;346:1203-7.31. Engel J. Polytomous logistic regression. Statistica Neerlandica. 1988;42:233-52.32. Ercal F, Chawla A, Stoecker WV, Lee H, Moss RH. Neural network diag-nosis of malignant melanoma from color images. IEEE Trans Biomed Eng.1994;41:837-45.33. Erler BS, Hsu L, Truong HM et al. Image analysis and diagnostic classi�-cation of hepatocellular carcinoma using neural networks and multivariatediscriminant functions. Lab Invest. 1994;71:446-51.34. Faraggi D, Simon, R. A neural network model for survival data. Stat Med.1995;14:73-82.35. Fahrmeir L, Tutz G. Multivariate Statistical Modelling Based on General-ized Linear Models. New York: Springer, 1994.36. Finney DJ. The estimation from individual records of the relationship be-tween dose and quantal response. Biometrika. 1947;34:320-34.37. Floyd Jr. CE, Lo JY, Yun AJ, Sullivan DC, Kornguth PJ. Prediction ofbreast cancer malignancy using an arti�cial neural network. Cancer. 1994;74:2944-8.38. Fogel DB, Wasson 3rd EC, Boughton EM. Evolving neural networks fordetecting breast cancer. Cancer Lett. 1995;96:49-53.39. Funahashi K. On the approximate realization of continuous mappings byneural networks. Neural Networks. 1989;2:183-92.40. Giger ML, Vyborny CJ, Schmidt RA. Computerized characterization ofmammographic masses: analysis of spiculation. Cancer Lett. 1994;77:201-11.41. Goldberg V, Manduca A, Ewert DL, Gisvold JJ, Greenleaf JF. Improvementin speci�city of ultrasonography for diagnosis of breast tumors by means ofarti�cial intelligence. Med Phys. 1992;19:1475-81.25

Page 26: On · Building and Institute of Medical Biometry and ... applications. Hence, w e start with a brief summary of ... Synon yms for the term individual are w ords ob ject,

42. Gurney JW, Swensen SJ. Solitary pulmonary nodules: determining the like-lihood of malignancy with neural network analysis. Radiology. 1995;196:823-9.43. Hamamoto I, Okada S, Hashimoto T, Wakabayashi H, Maeba T, Maeta H.Prediction of the early prognosis of the hepatectomized patient with hepato-cellular carcinoma with a neural network. Comput Biol Med. 1995;25:49-59.44. Hand DJ. Discrimination and Classi�cation. New York: Wiley, 1981.45. Hastie TJ, Tibshirani RJ. Generalized Additive Models. London: Chapman& Hall, 1990.46. Hecht-Nielsen R. Neurocomputing. New York: Addison-Wesley, 1990.47. Hinton GE. Connectionist learning procedures. Arti�cial Intelligence. 1989;40:185-234.48. Hornik K, Stinchcombe M, White H. Multilayer feedforward networks areuniversal approximators. Neural Networks. 1989;2:359-66.49. Kalb eisch JD, Prentice RL. The Statistical Analysis of Failure Time Data.New York: Wiley, 1980.50. Kappen HJ, Neijt JP. Neural network analysis to predict treatment out-come. Ann Oncol. 1993;4, Supplement:S31-4.51. Kolles H, von Wangenheim A, Niedermayer I, Vince GH, Feiden W. Com-puter assisted grading of astrocytoma. Verh Dtsch Ges Pathol. 1994;78:427-31.52. Kolles H, von Wangenheim A, Vince GH, Niedermayer I, Feiden W. Au-tomated grading of astrocytomas based on histomorphometric analysis ofKi-67 and Feulgen stained para�n sections. Classi�cation results of neu-ronal networks and discriminant analysis. Anal Cell Pathol. 1995;8:101-16.53. Kullback S, Leibler RA. On information and su�ciency. Annals of Mathe-matical Statistics. 1951;22:79-86.54. Lachenbruch PA. Some misuses of discriminant analysis. Methods Inf Med.1977;16:255-8.55. Liest�l K, Andersen PK, Andersen U. Survival analysis and neural nets.Stat Med. 1994;13:1189-200.56. Livingstone DJ, Manallack DT. Statistics using neural networks: chancee�ects. J Med Chem. 1993;36:1295-7.57. Maclin PS, Dempsey J, Brooks J, Rand J. Using neural networks to diagnosecancer. J Med Syst. 1991;15:11-9.58. Maclin PS, Dempsey J. How to improve a neural network for early detectionof hepatic cancer. Cancer Lett. 1994;77:95-101.26

Page 27: On · Building and Institute of Medical Biometry and ... applications. Hence, w e start with a brief summary of ... Synon yms for the term individual are w ords ob ject,

59. McCullagh P, Nelder JA. Generalized Linear Models, 2nd edition. NewYork: Chapman & Hall, 1989.60. McGuire WL, Tandon AK, Allred DC, Chamness GC, Ravdin PM, ClarkGM. Treatment decisions in axillary node-negative breast cancer patients.Monographs - National Cancer Institute. 1992;11:173-80.61. Mood AM, Graybill FA, Boes DC. Introduction to the Theory of Statistics,3rd edition. Singapore: McGraw-Hill, 1974.62. Morgenthaler S. Least-absolute-deviations �ts for generalized linear models.Biometrika. 1992;79:747-54.63. Moul JW, Snow PB, Fernandez EB, Maher PD, Sesterhenn IA. Neuralnetwork analysis of quantitative histological factors to predict pathologi-cal stage in clinical stage I nonseminomatous testicular cancer. J Urol.1995;153:1674-7.64. Murata N, Yoshizawa S, Amari SI. Network information criterion | de-termining the number of hidden units for arti�cial neural network models.IEEE Transactions on Neural Networks. 1994;5:865-72.65. Nafe R, Choritz H. Introduction of a neuronal network as a tool for diagnos-tic analysis and classi�cation based on experimental pathologic data. ExpToxicol Pathol. 1992;44:17-24.66. Niederberger CS. Commentary on the use of neural networks in clinicalurology. J Urol. 1995;153:1362.67. O'Leary TJ, Mikel UV, Becker RL. Computer-assisted image interpretation:use of a neural network to di�erentiate tubular carcinoma from sclerosingadenosis. Mod Pathol. 1992;5:402-5.68. Ostrem JS, Valdes AD, Edmonds PD. Application of neural nets to ultra-sound tissue characterization. Ultrason Imaging. 1991;13:298-9.69. Penny W, Frost D. Neural Networks in clinical medicine. Med Decis Mak-ing. 1996;16:386-98.70. Piraino DW, Amartur SC, Richmond BJ et al. Application of an arti�cialneural network in radiographic diagnosis. J Digit Imaging. 1991;4:226-32.71. Pregibon D. Resistant �ts for some commonly used logistic models withmedical applications. Biometrics. 1982;38:485-98.72. Prentice RL, Gloeckler LA. Regression analysis of grouped survival datawith application to breast cancer data. Biometrics. 1978;34:57-67.73. Ravdin PM, Clark GM, Hilsenbeck SG et al. A demonstration that breastcancer recurrence can be predicted by neural network analysis. Breast Can-cer Res Treat. 1992;21:47-53.27

Page 28: On · Building and Institute of Medical Biometry and ... applications. Hence, w e start with a brief summary of ... Synon yms for the term individual are w ords ob ject,

74. Ravdin PM, Clark, GM. A practical application of neural network analysisfor predicting outcome of individual breast cancer patients. Breast CancerRes Treat. 1992;22:285-93.75. Reinus WR, Wilson AJ, Kalman B, Kwasny S. Diagnosis of focal bonelesions using neural networks. Invest Radiol. 1994;29:606-11.76. Ripley BD. Statistical aspects of neural networks. In: Barndor�-NielsenOE et al., eds. Networks and Chaos - Statistical and Probabilistic Aspects.London: Chapman & Hall, 1993.77. Ripley BD. Pattern Recognition and Neural Networks. Cambridge: Univer-sity Press, 1996.78. Rogers SK, Ruck DW, Kabrisky M. Arti�cial neural networks for earlydetection and diagnosis of cancer. Cancer Lett. 1994;77:79-83.79. Royston P, Altman DG. Regression using fractional polynomials of con-tinuous covariates: parsimonious parametric modelling (with discussion).Applied Statistics. 1994;43:429-67.80. Sarle WS. Neural networks and statistical models. Proceedings of the 19thAnnual SAS Users Group International Conference. 1994.81. SAS-Institute Inc. SAS/STAT user's guide, Version 6, 4th edition. CaryNC: SAS-Institute Inc, 1990.82. Simpson HW, McArdle C, Pauson AW, Hume P, Turkes A, Gri�ths K. Anon-invasive test for the pre-cancerous breast. Eur J Cancer. 1995;31A:1768-72.83. Snow PB, Smith DS, Catalona WJ. Arti�cial neural networks in the diagno-sis and prognosis of prostate cancer: a pilot study. J Urol. 1994;152:1923-6.84. Schweiger CR, Soeregi G, Spitzauer S, Maenner G, Pohl AL. Evaluationof laboratory data by conventional statistics and by three types of neuralnetworks. Clin Chem. 1993;39:1966-71.85. Strotzer M, Kr�os P, Held P, Feuerbach S. Die Tre�sicherheit k�unstlicher neu-ronaler Netze in der radiologischen Di�erentialdiagnose solit�arer Knochen-l�asionen. Fortschritte auf dem Gebiete der R�ontgenstrahlen und der NeuenBildgebenden Verfahren. 1995;163:245-9.86. Truong H, Morimoto R, Walts AE, Erler B, Marchevsky A. Neural networksas an aid in the diagnosis of lymphocyte-rich e�usions. Anal Quant CytolHistol. 1995;17:48-54 .87. Schumacher M, Ro�ner R, Vach W. Neural networks and logistic regression:Part I. Computational Statistics and Data Analysis. 1996;21:661-82.88. Stern HS. Neural networks in applied statistics (with discussion). Techno-metrics. 1996;38:205-20. 28

Page 29: On · Building and Institute of Medical Biometry and ... applications. Hence, w e start with a brief summary of ... Synon yms for the term individual are w ords ob ject,

89. Vach W, Ro�ner R, Schumacher M. Neural networks and logistic regression:Part II. Computational Statistics and Data Analysis. 1996;21:683-701.90. Van Houwelingen JC, ten Bokkel Huinink WW, van der Burg ME, van Oos-terom AT, Neijt JP. Predictability of the survival of patients with advancedovarian cancer. J Clin Oncol. 1989;7:769-73.91. Warner B, Misra M. Understanding neural networks as statistical tools. TheAmerican Statistician. 1996;50:284-93.92. Wilding P, Morgan MA, Grygotis AE, Sho�ner MA, Rosato EF. Applica-tion of backpropagation neural networks to diagnosis of breast and ovariancancer. Cancer Lett. 1994;77:145-53.93. Wu YC, Doi K, Giger ML, Metz CE, Zhang W. Reduction of false posi-tives in computerized detection of lung nodules in chest radiographs usingarti�cial neural networks, discriminant analysis, and a rule-based scheme.J Digit Imaging. 1994;7:196-207.94. Wyatt J. Nervous about arti�cial neural networks?. Lancet. 1995;346:1175-7.95. Zell A, Mamier G, Vogt M et al. Stuttgart Neural Network Simulator, UserManual, Version 4.1, Report no. 6/95. Institute for Parallel and DistributedHigh Performance Systems. University of Stuttgart. 1995.

29