density estimation and random variate generation …alumnus.caltech.edu/~amir/density.pdfieee...

24
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 3, MAY 2002 497 Density Estimation and Random Variate Generation Using Multilayer Networks Malik Magdon-Ismail, Member, IEEE, and Amir Atiya, Senior Member, IEEE Abstract—In this paper we consider two important topics: den- sity estimation and random variate generation. We will present a framework that is easily implemented using the familiar multi- layer neural network. First, we develop two new methods for den- sity estimation, a stochastic method and a related deterministic method. Both methods are based on approximating the distribu- tion function, the density being obtained by differentiation. In the second part of the paper, we develop new random number genera- tion methods. Our methods do not suffer from some of the restric- tions of existing methods in that they can be used to generate num- bers from any density provided that certain smoothness conditions are satisfied. One of our methods is based on an observed inverse relationship between the density estimation process and random number generation. We present two variants of this method, a sto- chastic, and a deterministic version. We propose a second method that is based on a novel control formulation of the problem, where a “controller network” is trained to shape a given density into the desired density. We justify the use of all the methods that we pro- pose by providing theoretical convergence results. In particular, we prove that the convergence to the true density for both the density estimation and random variate generation techniques oc- curs at a rate where is the number of data points and can be made arbitrarily small for sufficiently smooth target densities. This bound is very close to the optimally achievable convergence rate under similar smoothness conditions. Also, for comparison, the root mean square (rms) convergence rate of a positive kernel density estimator is when the optimal kernel width is used. We present numerical simulations to illustrate the performance of the proposed density estimation and random variate generation methods. In addition, we present an ex- tended introduction and bibliography that serves as an overview and reference for the practitioner. Index Terms—Convergence rate, density estimation, distribution function, estimation error, multilayer network, neural network, random number generation, stochastic algorithms. I. INTRODUCTION A MAJORITY of problems in science and engineering have to be modeled in a probabilistic manner. Even if the under- lying phenomena are inherently deterministic, the complexity of these phenomena often makes a probabilistic formulation the only feasible approach from the computational point of view. Although quantities such as the mean, the variance, and pos- sibly higher order moments of a random variable have often Manuscript received January 6, 2000; revised September 18, 2000 and February 14, 2001. This work was supported by the National Science Foun- dation Engineering Research Center at the California Institute of Technology, Pasadena. M. Magdon-Ismail is with the Computer Science Department, Rensselear Polytechnic Instutute, Troy, NY 12180 USA (e:mail: [email protected]). A. Atiya is with Countrywide Capital Markets, Calbasas, CA 91302 USA (e-mail: [email protected]). Publisher Item Identifier S 1045-9227(02)03981-4. been sufficient to characterize a particular problem, the quest for higher modeling accuracy, and for more realistic assump- tions drives us toward modeling the available random variables using their probability density. This of course leads us to the problem of density estimation (see [51]). Examples of applica- tions that require a density estimation step as an essential com- ponent include the following: the application of the optimal pat- tern classification procedure, the Bayes procedure (see [12], and [22]), the determination of the optimal detection threshold for the signal detection problem [64], time series prediction applica- tions where the density for a future data point is required rather than a single forecast for the value (see [61], [67]), clustering for unsupervised classifier design [22], the design of optimal scalar quantizers for the signal encoding problem [26], and finding es- timates of confidence intervals and quantile levels for the pa- rameter estimation and regression problems [34]. A. Existing Density Estimation Techniques Traditional density estimation methods can be grouped into two broad categories (see [12], [22], and [51]). The first of these is the parametric approach. It assumes that the density has a spe- cific functional form, such as a Gaussian or a mixture of Gaus- sians. The unknown density is estimated by using the data to obtain estimates for the parameters of the functional form. Typ- ically, the parameters are estimated using maximum likelihood or Bayesian techniques. The drawback of the parametric ap- proach is that the functional form of the density is rarely known beforehand, and the commonly assumed Gaussian or mixture of Gaussian models rarely fit densities that are encountered in practice. The more common approach for density estimation is the nonparametric approach, where no functional form for the un- derlying density is assumed. Rather than expressing the density as a specific parametric form, it is determined according to a formula involving the data points available. The most common nonparametric method is the kernel density estimator, also known as the Parzen window estimator [43]. A problem with the kernel method is its extreme sensitivity to the choice of the kernel width, which acts as regularization parameter. A wrong choice can lead to either under-smoothing or over-smoothing. Methods to estimate the kernel width are either asymptotic and given in terms of the unknown density, thus impractical, or are based on cross-validation, thus prone to statistical error. Another drawback of the kernel method is that it has the tendency to exhibit bumpy behavior at the tails [51]. If the bumps are smoothed out by increasing the kernel width, then essential detail in the main part of the density is masked out or the width of the main part of the density will be exaggerated. 1045-9227/02$17.00 © 2002 IEEE

Upload: ngothuy

Post on 18-May-2018

223 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Density estimation and random variate generation …alumnus.caltech.edu/~amir/density.pdfIEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 3, MAY 2002 497 Density Estimation and Random

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 3, MAY 2002 497

Density Estimation and Random Variate GenerationUsing Multilayer Networks

Malik Magdon-Ismail, Member, IEEE,and Amir Atiya, Senior Member, IEEE

Abstract—In this paper we consider two important topics: den-sity estimation and random variate generation. We will presenta framework that is easily implemented using the familiar multi-layer neural network. First, we develop two new methods for den-sity estimation, a stochastic method and a related deterministicmethod. Both methods are based on approximating the distribu-tion function, the density being obtained by differentiation. In thesecond part of the paper, we develop new random number genera-tion methods. Our methods do not suffer from some of the restric-tions of existing methods in that they can be used to generate num-bers from any density provided that certain smoothness conditionsare satisfied. One of our methods is based on an observed inverserelationship between the density estimation process and randomnumber generation. We present two variants of this method, a sto-chastic, and a deterministic version. We propose a second methodthat is based on a novel control formulation of the problem, wherea “controller network” is trained to shape a given density into thedesired density. We justify the use of all the methods that we pro-pose by providing theoretical convergence results. In particular, weprove that the convergence to the true density for both thedensity estimation and random variate generation techniques oc-curs at a rate (log log )(1 ) 2 where is the numberof data points and can be made arbitrarily small for sufficientlysmooth target densities. This bound is very close to the optimallyachievable convergence rate under similar smoothness conditions.Also, for comparison, the 2 root mean square (rms) convergencerate of a positivekernel density estimator is 2 5 when theoptimal kernel width is used. We present numerical simulations toillustrate the performance of the proposed density estimation andrandom variate generation methods. In addition, we present an ex-tended introduction and bibliography that serves as an overviewand reference for the practitioner.

Index Terms—Convergence rate, density estimation, distributionfunction, estimation error, multilayer network, neural network,random number generation, stochastic algorithms.

I. INTRODUCTION

A MAJORITY of problems in science and engineering haveto be modeled in a probabilistic manner. Even if the under-

lying phenomena are inherently deterministic, the complexityof these phenomena often makes a probabilistic formulation theonly feasible approach from the computational point of view.Although quantities such as the mean, the variance, and pos-sibly higher order moments of a random variable have often

Manuscript received January 6, 2000; revised September 18, 2000 andFebruary 14, 2001. This work was supported by the National Science Foun-dation Engineering Research Center at the California Institute of Technology,Pasadena.

M. Magdon-Ismail is with the Computer Science Department, RensselearPolytechnic Instutute, Troy, NY 12180 USA (e:mail: [email protected]).

A. Atiya is with Countrywide Capital Markets, Calbasas, CA 91302 USA(e-mail: [email protected]).

Publisher Item Identifier S 1045-9227(02)03981-4.

been sufficient to characterize a particular problem, the questfor higher modeling accuracy, and for more realistic assump-tions drives us toward modeling the available random variablesusing their probability density. This of course leads us to theproblem of density estimation (see [51]). Examples of applica-tions that require a density estimation step as an essential com-ponent include the following: the application of the optimal pat-tern classification procedure, the Bayes procedure (see [12], and[22]), the determination of the optimal detection threshold forthe signal detection problem [64], time series prediction applica-tions where the density for a future data point is required ratherthan a single forecast for the value (see [61], [67]), clustering forunsupervised classifier design [22], the design of optimal scalarquantizers for the signal encoding problem [26], and finding es-timates of confidence intervals and quantile levels for the pa-rameter estimation and regression problems [34].

A. Existing Density Estimation Techniques

Traditional density estimation methods can be grouped intotwo broad categories (see [12], [22], and [51]). The first of theseis the parametric approach. It assumes that the density has a spe-cific functional form, such as a Gaussian or a mixture of Gaus-sians. The unknown density is estimated by using the data toobtain estimates for the parameters of the functional form. Typ-ically, the parameters are estimated using maximum likelihoodor Bayesian techniques. The drawback of the parametric ap-proach is that the functional form of the density is rarely knownbeforehand, and the commonly assumed Gaussian or mixtureof Gaussian models rarely fit densities that are encountered inpractice.

The more common approach for density estimation is thenonparametric approach, where no functional form for the un-derlying density is assumed. Rather than expressing the densityas a specific parametric form, it is determined according to aformula involving the data points available. The most commonnonparametric method is the kernel density estimator, alsoknown as the Parzen window estimator [43]. A problem withthe kernel method is its extreme sensitivity to the choice of thekernel width, which acts as regularization parameter. A wrongchoice can lead to either under-smoothing or over-smoothing.Methods to estimate the kernel width are either asymptoticand given in terms of the unknown density, thus impractical,or are based on cross-validation, thus prone to statistical error.Another drawback of the kernel method is that it has thetendency to exhibit bumpy behavior at the tails [51]. If thebumps are smoothed out by increasing the kernel width, thenessential detail in the main part of the density is masked out orthe width of the main part of the density will be exaggerated.

1045-9227/02$17.00 © 2002 IEEE

Page 2: Density estimation and random variate generation …alumnus.caltech.edu/~amir/density.pdfIEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 3, MAY 2002 497 Density Estimation and Random

498 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 3, MAY 2002

Another common nonparametric approach is the-nearestneighbor technique [25]. The -nearest neighbor approachshares some of the drawbacks of the kernel density estimator,such as the difficulty of obtaining the best smoothing parameter,

. In addition, the density estimate is not a smooth function, hasa very heavy tail, and integrates to infinity on noncompact sets.One of the advantages of the kernel and-nearest neighbormethods is their ease of implementation.

A number of other nonparametric or semiparametric methodshave been proposed in the literature. For example, penalizedlikelihood methods (see for example [19], [52]) penalize thelikelihood by some regularizing functional. Orthogonal expan-sions [16], and wavelet based methods [14], [38] obtain den-sity estimates by expanding in some suitable basis. Complexity-based approaches [6], [45] are similar to penalized likelihoodmethods in that they try to achieve a compromise between like-lihood and simplicity. Methods using families of exponentialsrepresent the density in terms of some exponential functions(see, for example, [8]). Histospline approaches use splines tofit the distribution function, and obtain the density by differen-tiation, [48], [66].

There have been a number of methods on using neural net-works for density estimation. The majority of the approachestend to be parametric in nature, and therefore share many ofthe limitations of the parametric approach discussed above. Forexample, Traven’s approach [62] and Cwik and Koronacki’smethod [18] are based on mixture of Gaussian density estima-tion method. Bishop and Legleye [13], Bishop [11], and Hus-meier and Taylor, [30], consider a mixture of Gaussians modelwhere the mean and variance are estimated using a multilayernetwork. Williams [68], proposes a similar approach for themultidimensional case. Schioler and Kulczyki’s method [47] isbased on the kernel estimation method. Roth and Baram [46],and Miller and Horn [39] developed elegant methods by trainingnetworks to maximize the entropy of the outputs. Van Hulle [63]developed an interesting technique based on a self-organizingapproach, whereby the algorithm converges to a solution, suchthat at any point, the density of the weight vectors is an es-timate of the unknown density. Martinez [37] used a compet-itive learning tree to create equiprobable quantizations of theinput space. Modha and Fainman [40] proposed a multilayernetwork, that is trained by maximizing the log-likelihood of thedata. Weigend and Srivastava [67] proposed a fractional binningapproach, developed in the context of time series prediction. Itis based on partitioning the input space into bins, assigning asoftmax output neuron for every bin, and training the networkon the fraction of data falling in each respective bin. Smyth andWolpert [54] suggested a method for combining the estimates ofseveral density estimators (such as kernel estimators or mixtureof Gaussian estimators). Zeevi and Meir [70] proposed the useof convex combinations of density estimators and analyzed theirestimation error. Neural network based methods have also beendeveloped for estimating discrete distributions, see for exampleThathachar and Arvind [59], and Adaliet al. [2].

B. Overview of the New Density Estimation Techniques

First we will propose two new methods for density estima-tion using multilayer networks. The approaches can be consid-

ered as semiparametric models. The reason is that the model isdescribed in terms of a number of parameters (the weights ofthe network), rather than the actual data points, but at the sametime the number of parameters can be increased in a way thatwould ultimately achieve an all-powerful model capable of ap-proximating any (continuous) density function. One importantadvantage of multilayer networks over local methods such asthe kernel estimator is that they have been shown in theory andin practical problems to be superior for high-dimensional prob-lems. For example, Horniket al. [28], [29] show that a neuralnetwork’s ability to simultaneously approximate a function andits derivatives does not decay with increasing input dimension.Modha and Masry [41] obtain a similar result for the case ofneural networks approximating density functions. In both cases,the coefficient of convergence may depend on the dimension.Further, multilayer networks give us the flexibility to choose anerror function to suit our application. The methods developedhere are based on approximating the distribution function, incontrast to most previous works which focus on approximatingthe density itself. Straightforward differentiation then gives usthe estimate of the density function. One expects that estimatingthe distribution function is less sensitive to statistical fluctu-ations (since the integral operator that converts the histograminto the distribution function acts a regularizer). The distributionfunction is often useful in its own right—one can directly eval-uate quantiles or the probability that the random variable occursin a particular interval. Although the methods by Wahba, [66],are also based on approximating the distribution function, ourapproach is quite different and obtains a better convergence ratefor the error than the cubic spline techniques presented there.

The first method that we propose is based on a stochastic algo-rithm (SLC), and the second method is a deterministic techniquebased on learning the cumulative (SIC). We show that these twotechniques yield equivalent results. The stochastic techniquewill generally be smoother on smaller numbers of data points,however, the deterministic technique is faster and applies to themultidimensional case.

We analyze the consistency and the convergence rate of theestimation error for our methods in the univariate case. Whenthe unknown distribution is bounded and has bounded deriva-tives up to order , we prove that the estimation error is

, where is the number of datapoints. As a comparison, for the kernel density estimator (withnonnegative kernels) and theth nearest neighbor density esti-mator, the (RMS) estimation error is , under theassumptions that the unknown density has a square integrablesecond derivative (see [51]), and that theoptimalkernel widthis used, which is not possible in practice because computingthe optimal kernel width requires knowledge of the true den-sity. One can see that for smooth density functions with boundedderivatives, our methods achieve an error rate that is betterthan for any .

We have found in the literature only the following methodsthat achieve comparable error rates. The first is a kernel es-timator with the specific choice of kernel that possesses zeroth order moments for , being even, [51]. For

such kernels, the root mean squre (rms) convergence rate is. A problem, however, with this method is that

Page 3: Density estimation and random variate generation …alumnus.caltech.edu/~amir/density.pdfIEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 3, MAY 2002 497 Density Estimation and Random

MAGDON-ISMAIL AND ATIYA: DENSITY ESTIMATION AND RANDOM VARIATE GENERATION USING MULTILAYER NETWORKS 499

the kernel has negative parts. The other method is the adaptivekernel estimator, [51], for which the rms error rate is theoreti-cally . However, the error estimate for this method isbased on using the optimum smoothing function, which is givenin terms of the unknown density and its derivatives. Even if oneuses pilot estimates of these unknown characteristics of the truedensity to determine the smoothing function, the resulting statis-tical error will leave a residual term that cannot en-tirely cancel out. In short, nopositivekernel estimator can havea faster rms rate of convergence than [8]. Theerror estimate of our method, in contrast to these methods, doesnot rely on optimizing parameters based on the unknown den-sity (other than knowing ). A class of semiparametric methodsthat achieve an error rate ofand error rate of have also beenproposed (see [8], [15], [23], [48], [57]), where is the max-imum for which where is the unknown dis-tribution function. These methods have been shown to achievethe optimal convergence rates oncompactsets (see [23], [55],[56]). Thus, our proposed methods achieve a convergence ratevery close to the optimally achievable rate.

C. The Random Variate Generation Problem

In the second part of the paper, we consider the problemof generating a random variable according to a given densityfunction. Monte Carlo simulation techniques in areas such asphysics, biology, engineering, and economics rely on the abilityto generate a random number from a given (or estimated)distribution efficiently. Examples of such applications includethe simulation of a biological system, such as how nerve cellsinteract together, the simulation of a communication networkwhere the packets are assumed to arrive according to a partic-ular density function, the simulation of a control system wherethe plant disturbance is a random variable, and the simulation ofthe stock market in an attempt to estimate a valuation for somederivative instrument. In the past, the assumption of standarddensity functions such as Gaussian or exponential densities forthese random variables often allowed analytical solutions forthese problems. But now, with more accuracy needed, the nextstep has been to assume more realistic density functions for therandom variables (possibly estimated from the data).

1) Existing Random Variate Generation Techniques:Unfortunately, there are very few techniques that can generatea random variable possessing an arbitrary density function (see[20] and [44] for reviews of the different approaches). Onewell known method is the transformation method: a uniformlydistributed random number is generated, and this numberis passed through a nonlinear transformation ,where is the given (cumulative) distribution function.The problem with this method is that in the majority of cases

(or even ) is not known analytically. Solvingfor numerically is not practical since speed is of theessence for the random number generation problem. In MonteCarlo simulations, random variables are typically generatedmillions of times, in order to obtain statistically reliable results.There are many variations of the transformation method, suchas obtaining transformations of several random variables (see[31] and [32]). These methods are largelyad hoc, and can be

used to generate only some of the well-known density functionssuch as Gaussian, Gamma, Beta, etc.

The other method for random variable generation is the rejec-tion method. It is a more general method than the transforma-tion method. To use the rejection method, a comparison func-tion that dominates everywhere is required. A point

is generated uniformly in the two-dimensional re-gion bounded by the comparison function and the-axis. Forthis purpose, is needed where .The point will be accepted or rejected based on the realiza-tion of a second uniform in random variable as fol-lows: is compared to and the point is acceptedif and rejected otherwise. A problem with therejection method is that for some large-tail densities it is not pos-sible to find a comparison function with an analytically invert-ible distribution function (a necessary condition for the method)that is everywhere larger than the given density. Therefore, itcannot be guaranteed that this method is capable of generatingrandom variates from an arbitrary density. More importantly, theacceptance rate is , where is the area under . The ac-ceptance rate can be small 0.01 , thus making the randomvariate generation process unacceptably slow. Further, if thefunctional form of the given density is not known and it is onlycharacterized by a finite set of data points, then it becomes acomputationally extensive procedure and one has to resort toother techniques. One such technique would be to form a kerneldensity estimate and generate the points from that estimate. Thisapproach, however, is prone to statistical error in the estimate ofthe density, which is of order for the rms error andtherefore fairly high.

Methods for approximate random variate generation includeMetropolis–Hastings algorithms and Markov chain MonteCarlo (MCMC) algorithms (see, for example, [20] and [53]).These algorithms focus on generating from the density functionrather than going through the inverse distribution function.

2) New Random Variate Generation Techniques:Our goalis to develop several new random variate generating techniquesusing multilayer networks. The methods we propose can be usedto generate points from a distribution given by a functional formor given by a set of data points drawn from that distribution.The first method (for which we develop stochastic and determin-istic versions) is based on viewing random number generationas the inverse process of the density estimation (this inverse re-lationship between density estimation and random variate gen-eration has also been pointed out by [39], [46]). One of the ad-vantages of our method is that a density estimation step neednot be performed. We also propose a method that formulatesthe problem within a “control” framework. A cascade structureis proposed, that consists of a density shaping network (actingas the “controller”), and a density estimation network (actingas the “plant”). For the methods that we propose, the bulk ofthe time is spent in “learning” to generate the random variates.Once this learning is done, the actual generation of these randomnumbers is fast and can thus be used for efficient Monte Carlosimulations.

In the remainder of the paper we develop, and theoreticallyjustify the new methods we propose: In Section II, we de-velop the two new density estimation techniques in detail. In

Page 4: Density estimation and random variate generation …alumnus.caltech.edu/~amir/density.pdfIEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 3, MAY 2002 497 Density Estimation and Random

500 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 3, MAY 2002

Section III, we present the convergence results of the proposeddensity estimators. In Section IV, we describe the new randomnumber generating techniques followed by their convergenceproperties in Section V. In Section VI, we present the simu-lation results for both the density estimation techniques andthe random variate generation techniques. We conclude with abrief discussion in Section VII. We present the proofs of theconvergence theorems in Appendix B.

II. NEW DENSITY ESTIMATION TECHNIQUES

In this section we describe the two new methods for densityestimation. For both methods, one can use a standard multilayernetwork, such as a one-hidden-layer network. We will use neuralnetworks throughout for illustrative purposes, but stress that anysufficiently general class of functions will do just as well. Thenetwork’s output will represent an estimate of the distributionfunction, and its derivative will be an estimate of the density.

A. SLC (Stochastic Learning of the Cumulative)

Let , be the given data points. Letthe underlying density be and let its distribution functionbe . We assume the density to be continuousand have continuous derivatives of all orders. Let the neural net-work output be , where represents the set of weightsof the network. Ideally, after training the neural network, wewould like to have . Learning requires a setof targets. To determine a target for the network training, wemake use of the following observation. The density of the vari-able ( being generated according to ) is uniformin . The reason is as follows. Let represent the den-sity of . Using the well-known formula for transformation ofrandom variables

for (1)

and for or . We used the fact that. Thus, if is to be as close as possible

to , then the network output should have a density as closeas possible to uniform in when fed an input that is dis-tributed according to. This is what our goal will be. We will at-tempt to train the network such that its output density is uniform.Having achieved this goal, the network mapping should repre-sent the distribution function . A somewhat similar philos-ophy is used in [39] and [46], where they show that their entropymaximization method implies transformation to a uniform den-sity (and also implies the necessity to maximize the expectedlogarithm of the determinant of the transformation Jacobian).We present here (and fully analyze) a different and more directmethod for obtaining a mapping to a uniform distribution thanthe methods proposed in [39] and [46].

The basic idea behind the proposed algorithm is to use thedata points drawn from the unknown density as inputs to

the network. For every training cycle we generate a differentset of network targets randomly from a uniform distributionin , and adjust the weights to map the data points (sorted

in ascending order) to these generated targets (also sorted inascending order). Thus we are training the network to map thedata to a uniform distribution.

Before describing the steps of the algorithm, we note thatthe resulting network has to represent a monotonically nonde-creasing mapping, otherwise it will not represent a legitimatedistribution function. Let .Then we wish to pick , such that is as close aspossible to . In practice, enforcing monotonicity could bedone by using a class of “monotonic network” [50], or else byusinghint penalty terms [1]. Hints are auxiliary information orconstraints on the target function, that are knowna priori (in-dependent of the training examples). By using hints in the formof penalty terms, we can guide the learning process, and obtaina network that satisfies the hints. In our simulations, we usedthe latter approach of adding a term that penalizes nonmono-tone mappings to the error function. The proposed algorithm isas follows.

1) Let be the points drawn fromthe unknown density. Without loss ofgenerality assume the points are sortedin ascending order: .

2) Set , where is the training cyclenumber. Initialize the weights randomlyto .

3) Generate randomly from a uniform dis-tribution in another points

. These are the network tar-gets for this cycle.

4) Sort the targets in ascending order,i.e., (where we haverenamed the ordered targets ). Then,the point is the target output for

.5) Adjust the network weights according

to the backpropagation scheme:, where is the objec-

tive function that includes the errorterm and the monotonicity hint penaltyterm

(2)

The second term is the monotonicitypenalty term, is a positive weightingconstant, is a small positive number,

is the familiar unit step func-tion, and the ’s are any set ofpoints where we wish to enforce themonotonicity. Because the hint is knownto be true, can be chosen very large.

Page 5: Density estimation and random variate generation …alumnus.caltech.edu/~amir/density.pdfIEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 3, MAY 2002 497 Density Estimation and Random

MAGDON-ISMAIL AND ATIYA: DENSITY ESTIMATION AND RANDOM VARIATE GENERATION USING MULTILAYER NETWORKS 501

6) Set , and go to step 3) to per-form another cycle until the error issmall enough.

7) Upon convergence, the density estimatebecomes .

Note that as presented, the randomly generated targets are dif-ferent for every cycle, which will have a smoothing effect thatwill allow convergence to a truly uniform distribution. One otherversion, that we have implemented in our simulation studies,is to generate new targets after every fixed numberof cy-cles, rather than every cycle. This generally improves the speedof convergence as there is more “continuity” in the learningprocess. Also note that it is preferable to choose the activationfunction for the output node to be in the range of zero to one,to ensure that the estimate of the distribution function is in thisrange. A sample run of how SLC performs on 100 data pointsdrawn from a mixture of two Gaussians is shown in Fig. 4 ofSection VI-A.

SLC is only applicable to estimating univariate densities. Thereason is that for the multivariate case the nonlinear mapping

will not necessarily result in a uniformly distributedoutput . Fortunately, many, if not the majority of problems en-countered in practice are univariate. This is because multivariateproblems, even with a modest number of dimensions, need ahuge amount of data to obtain statistically accurate results. Oursecond method, described next, is applicable to the multivariatecase as well.

B. SIC (Smooth Interpolation of the Cumulative)

Again, we have a multilayer network, to which we input thepoint , and the network outputs the estimate of the distributionfunction. Let be the true density function, and let bethe corresponding distribution function. Let .The distribution function is given by

(3)

A straightforward estimate of could be the fraction of datapoints falling in the area of integration

(4)

where if for all and zero oth-erwise. The statistical properties of the estimate (4) have beenwidely studied ([24], [49]). For the method we propose, we willuse such an estimate for the target outputs of the neural network.

The estimate given by (4) has a staircase-like shape if plottedagainst , and is thus discontinuous. The neural network methoddeveloped here provides a smooth, and hence more realistic es-timate of the distribution function. Further, the density can beobtained by differentiating the output of the network with re-spect to its inputs.

For the low-dimensional case, we can uniformly sample (4)using a grid, to obtain the examples for the network. Beyondtwo or three dimensions, this is not feasible because of compu-tational considerations. One idea is to sample the input space

randomly (using say a uniform distribution over the approxi-mate range of ’s), and for every point determine the networktarget according to (4). Another option is to use the data pointsthemselves as examples. The target for a pointwould thenbe

(5)

This target is unbiased, i.e., . Anotheralternative would be to use . This expected value canbe calculated only for the one-dimensional case. This is becausefor the one-dimensional case, the data points can be naturallyordered. Let be the th-order statistic of the data points.Then, one can show that

(6)

Thus for one dimension, we can use the targets represented in(6), however for more than one dimensions we resort to (5).

For the one-dimensional case, the algorithm is similar to SLCexcept for steps 3 and 4. Instead, the targetsare given by

. It is easy to see how the algorithmgeneralizes to the multidimensional case. Once again, we usemonotonicity as a hint to guide the training. Once training isperformed, and approximates , the density esti-mate can be obtained as

(7)

We note that for a few dimensions, a derivation of the derivativesin (7) will be straightforward. For larger dimensions a numericaldifferentiation scheme such as the simple differencing methodis more feasible. A sample run of how SIC performs on thesame 100 data points used to run SLC is shown in Fig. 5 inSection VI-A.

III. CONVERGENCE OF THEDENSITY ESTIMATION TECHNIQUES

In this section we derive the convergence properties of theproposed density estimation techniques. Our goal is to justifythe methods introduced in Section II by showing that conver-gence to the true density does occur. Further, we analyze therate of convergence and compare with various other estimators.Due to the technical nature of such convergence issues, we willtake this opportunity to summarize the essential details of thesection.

First, we consider the stochastic method (SLC). The targetsare uniform in . The expected value of the target, the thorder statistic of a uniform distribution, is . There-fore we should expect the learned function to be approximatelyperforming the mapping where representsthe th element of the ordered data set. But this is exactly themapping we are trying to learn in SIC (smooth interpolationof ). Thus we expect SLC to converge to a solution thatis also a solution of SIC. The formal statement and proof ofthis claim are the contents of Section III-A. For the proof, wewill need some results from recursive stochastic approximationtheory, which we review briefly in Appendix.

Page 6: Density estimation and random variate generation …alumnus.caltech.edu/~amir/density.pdfIEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 3, MAY 2002 497 Density Estimation and Random

502 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 3, MAY 2002

Having shown that SLC SIC, we will restrict our anal-ysis to SIC. First we define a set of functions which we callgeneralized sample distribution functions. These functions areexactly that set of functions that approximately interpolate thesample distribution function given by (6). We prove that thisset of functions converges uniformly to the true cumulative dis-tribution function in the and (RMS) sense. In fact, wewill show that the rate of convergence is for the

and errors, and for the error.Further, because the neural network is an arbitrarily powerfulclass of functions, these generalized distribution functions canbe implemented, therefore the neural-network implementationswe have discussed in Section II also converge.

How about the convergence to the true density? Some as-sumptions have to be made about the true density. It is wellknown that the density estimation problem is an ill-posedproblem.1 Without anya priori assumptions on the true den-sity function, it will be hard to judge the suitability of anestimator. Even the trivial estimator consisting of the sum-mation of delta functions centered at the data points couldbe as valid as other estimators. In fact, [7] and [21] showthat no density estimator is consistent for certain types oferror measure unless one makes somea priori assumptionsabout the distribution function. Typicala priori assumptionsused in the density estimation literature are smoothness con-straints on the class of considered densities. These are realisticassumptions that are usually obeyed by densities typically en-countered in real-world applications. We will consider suchconstraints in our analysis. We assume boundedness of thederivatives. We prove that when the true distribution func-tion has bounded derivatives, the convergence rate is

which as is fasterthan . For comparison, any positivekernel estimate has an rms convergence rate ofusing theoptimal smoothing parameter, which is inaccessiblein practice as it depends on some detailed properties of thetrue density. We then show that neural networks can achievethese same rates on compact sets by using the universal ap-proximation results in [28], [29]. We will illustrate our con-vergence rate with simulations using neural networks.

A. Convergence of SLC to SIC

In this section we will analyze the stochastic method (SLC).Let be the data points sorted in as-cending order, and let be the output tar-gets for training cycle, where the ’s are generated froma uniform density and then sorted in an ascending order. Ignorefor the time being the effect of the hint penalty term in (2), asthis term can only help convergence, and, if monotonicity is sat-isfied (as in the case of convergence to the true density), then thehint term will equal zero. The convergence behavior is given inthe following theorem.

1A more detailed discussion of ill-posed problems can be found in [60]and [65].

Theorem 3.1:Let the data be generated ac-cording to the true distribution . Then, SLC converges withprobability 1 to a local minimum of the error function

(8)

provided that the learning rate is a decreasing sequence,that satisfies the following conditions:

a) ;b) , for some ;c) .

Proof: See Theorem B.1 in Appendix B.1.We note that the conditions on guarantee that the learning

rate is decreasing in a way that will dampen the random fluctua-tions around the minimum, but at the same time, not decreasingtoo fast to prevent reaching the minimum. A possible choicecould be , where is a constant and .

, however, would not work. This theoremshows that SLC trains the network to map pointtoas . By comparing with SIC, described in Section II, wesee that both methods possess similarities for the univariate case.In fact, as , both methods are equivalent. Therefore, wewill restrict the remaining analysis to SIC—smooth interpola-tion of the sample cumulative . We will first look at theconvergence to the true distribution function and then the con-vergence to the density.

B. Convergence to the True Distribution Function

SIC gives an estimate of the distribution function, from whichwe get the density by differentiation. The distribution functionis useful in its own right, so we will first look at the consistencyand convergence rate of SIC as an estimator of the distributionfunction in the limit of large .

Two well studied statistics of the sample distribution func-tion are the Kolmogorov–Smirnov statistic and theCramér-von Mises statistic , defined as follows:

(9)

(10)

where is the sample distribution function given in (4). Thefollowing theorems are valid [49, pp. 61–64].

Theorem 3.2:With probability 1

(11)

Proof: See [17].Theorem 3.3:With probability 1

(12)

Proof: See [24].It is not hard to extend these theorems to apply to the gen-

eralized sample distribution functions which we define shortly.These theorems give the probability 1 convergence behavior. Wewill look at , the expected performance in the error

Page 7: Density estimation and random variate generation …alumnus.caltech.edu/~amir/density.pdfIEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 3, MAY 2002 497 Density Estimation and Random

MAGDON-ISMAIL AND ATIYA: DENSITY ESTIMATION AND RANDOM VARIATE GENERATION USING MULTILAYER NETWORKS 503

criterion rms. The expectations are with respect to the data set.Let us introduce the following definition.

Let be the space of functions such that for every ,the following holds:

• ;• is continuously differentiable and ;• and .

Thus, is the space of (monotonic) distribution functions onthe real line that possess continuous density functions, which isthe class of functions that we will be interested in. We define ametric, the -norm of as follows:

(13)

is the expectation of with respect to the distri-bution . The -norm is defined by

. Let the data set be, and corresponding to each, let . As

mentioned in Section II, SIC attempts to map the order statis-tics to . In general this is possible given a largeenough neural network. However, we will allow the mapping tobe approximate because under some smoothness assumptionson the true distribution, a smoother fit would warrant a smallsacrifice in the fit accuracy. With this in mind, we define the setof approximate sample distribution functions as follows.

Definition 3.4: A -approximate generalized sample distri-bution function satisfies the following two conditions:

• ;• ;

where is some function of . We will denote the set of all-approximate sample distribution functions for a data set,,

and a given by .The set contains those continuously differentiable dis-

tribution functions that approximately interpolate the data set. For we have a generalized sample dis-

tribution function which is a smooth generalization of the con-ventional sample distribution function. Note that the conven-tional distribution function estimator (4) is not in this class ofgeneralized sample distribution functions, but it is the limit offunctions that are in this class.

Let us now derive the estimation error for the distributionfunction estimator. Let , and let be thetrue distribution function. As we mentioned, approximatesthe distribution function by approximate interpolation. Write

, where . The truedistribution function evaluated at is generally not equal todue to statistical error. Therefore we will write .There are two sources of estimation error (see Fig. 1): The errorat each data point, , and the interpolation error betweenthe data points—an error that would persist even if the ’swere all zero. As a first step, let us analyze the statistics of the

’s. It is well known that the random variable , beinggenerated according to , has a uniform distribution in(see (1) in Section II-A). Therefore, is the th-orderstatistic distribution order statistic of the uniform distribution,

Fig. 1. Interpolation error when interpolating the distribution function.

which is distributed according to the well-known Beta distribu-tion. The joint density of can be obtained as (see [10,pp. 200])

(14)

Noticing that , we can calculate the firsttwo moments of the ’s as

(15)

The following theorem provides a bound for the estimationerror, which will prove consistency of the distribution functionestimator.

Theorem 3.5 ( Convergence to the True Distribution):Letthe data set consist of data points, drawn i.i.d. fromthe (true) distribution . For every and every

, the inequality

(16)

holds, where

(17)

Proof: See theorem B.2 in Appendix B.1.

Note 1: We stress that the theorem applies toanyand any , in particular to ,

the true distribution function. Thus the expected mean in-tegrated squared error approaches zero asymptotically at arate .Note 2:The same proof can be modified for the case of theintegrated squared error over some bounded interval.Note 3: Without much extra effort, this result can be ex-tended to a larger class of interpolators where themonotonicity condition is replaced by the less restrictivecondition if

, with no other constraints being madeon continuity or the existence of derivatives. Note that the

Page 8: Density estimation and random variate generation …alumnus.caltech.edu/~amir/density.pdfIEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 3, MAY 2002 497 Density Estimation and Random

504 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 3, MAY 2002

conventional distribution function (4) belongs to this ex-tended class .Note 4: If , convergence to the truedistribution function with respect to the rms error isobtained at a rate .

In the Appendix, we also give the and rates, which areand , respectively (Theorems

B.3 and B.4).

C. Convergence to the True Density Function

The density estimate is the derivative of the distribution func-tion estimate and thus we need to consider . Obtaininga tight convergence rate for the density estimation error is atougher job. The reason is that the derivative operation accen-tuates the noise. In the next theorem, we present a bound onthe estimation error for the density estimate of SIC. Its essen-tial content is that if the true distribution function has boundedderivatives to order , then by picking the approximate distri-bution function with “minimum” value for , where is abound on the magnitude of theth derivative, we obtain con-vergence in probability at a ratein the norm. Our main theorem is as follows.

Theorem 3.6 ( Convergence in Probability to the TrueDensity): Let data points, be drawn independentlyidentically distributed (i.i.d.) from the distribution . Let

for , where . Letand fix . Let . Let

be a -approximate distribution function with(by the definition of , such

a -approximate distribution function must exist). Then, theinequality

(18)

where

(19)

holds with probability 1, as . By this is meant

(20)

Proof: See Theorem B.10 in Appendix B.1.

Note 5: This theorem holds forany and anyand toany satisfying the conditions of the

theorem. Therefore, we see that for smooth density func-tions with bounded higher derivatives, the convergencerate approaches .Note 6: No smoothing parameter needs to be determinedunlike the other traditional techniques. The regularizationis accomplished by minimizing the norm of the thderivative.

Note 7: The convergence rate is very close to the rateproven to be optimal under similar smoothness assump-tions (see [23], [55], [56]).Note 8: From the theorem, it is clear that one should tryto find a generalized -approximate distribution functionwith the smallest possible derivatives. Specifically, of allthe sample distribution functions, pick the one that “mini-mizes” , the bound on the th derivative. Thus, whenusing optimization techniques to find the generalized dis-tribution function, one would be justified in introducingpenalty terms, penalizing the magnitudes of the derivatives(for example Tikhonov type regularizers [60]).Note 9:For compact support, if we choose to be theuniform measure on that support, we obtain a result for theintegrated squared measure.Note 10: Consistent density estimation involves solvinga constrained optimization problem in order toguaranteeconvergence at the prescribed rate. The objective functionwould be the bound on the th derivative and the con-straints are that you fit the density sufficiently closely. Theconstraints can be enforced softly by penalizing any viola-tion of the constraints. Practically speaking, it is often moreconvenient to approximately impose the smoothness con-straints (for example by starting at small weights or usingweight decay [4], [12]) while attempting to fit the distri-bution function. Our simulations (see Section VI) indicatethat this works quite well.

D. Implementation by Neural Networks

In this section, we have proved the convergence to the dis-tribution and the convergence to the density. Any functionssatisfying the conditions of the theorems will converge at thegiven rates. In particular, if neural networks can be found thatsatisfy the required conditions, then these convergence rates willalso apply to implementations with these networks. The goalhere is to show that one can find neural networks that satisfy therequired conditions. In fact, letting the size of the neural net-works increase at a rate would suffice.

In order to show that neural networks can be chosen to sat-isfy the conditions of the theorems, it suffices to show that neuralnetworks are dense in a certain sense in the space. In otherwords, let a sequence of functions be given where

indexes the number of data points. Suppose thatsatisfiesthe conditions of Theorem 3.5 or 3.6 (whichever we are inter-ested in). Then we know that the error for ’s converges tozero. If we show that there is also a sequence of neural networks

such that the following conditions hold simultaneously:

1) ;2) ;

then the error for the ’s converges to zero at the same rate asthe ’s. To show the existence of such a sequence of neuralnetworks, it suffices to show that given an arbitrary degree ofaccuracy , there is a sequence of ’s that approximate the

’s to within and simultaneously the ’s also approximatethe ’s to within . To this end, we will use the approximationtheorems proved in [28], [29]. First we set up the notation tostate the theorem, following very closely the setup in [28], [29].

Page 9: Density estimation and random variate generation …alumnus.caltech.edu/~amir/density.pdfIEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 3, MAY 2002 497 Density Estimation and Random

MAGDON-ISMAIL AND ATIYA: DENSITY ESTIMATION AND RANDOM VARIATE GENERATION USING MULTILAYER NETWORKS 505

Define the class of neural networks we are interested in as thefunctions defined by the set

(21)

where is conventionally called the activation function. Wemake some assumptions on. Though some of these assump-tions could be dropped (see [28] and [29]), there is negligiblepractical gain in maintaining more generality.

1) is a symmetric sigmoidal nonlinearity that isa bounded monotonically increasing function with expo-nentially decaying derivatives of all orders.

2) for all i.e.,where is a Sobolev space2 in the

metric on the open set for functions in .We make one further restriction which is once again of no prac-tical consequence. We suppose that we are interested in approx-imating the distribution on some arbitrary compact set. Let

be an open bounded set containing. Then, the restrictionof to is a subset of . The following theorem is valid.

Theorem 3.7 (Universal Approximation):Let bea compact set. Let where and let

. Given , there such that simultane-ously for all

(22)

Proof: See Horniket al. [28], theorem 3.1 and corollary3.5.

This is a very powerful theorem. It states that neural networkscan simultaneously approximate a function and its derivativesprovided that certain conditions are met. The ability to approx-imate a function and its derivatives up to order is termed

-uniform densenessin [28]. Thus Theorem 3.7 can be sum-marized by saying that neural networks are-uniformly dense3

in .Corollary 3.8: is one—uniformly dense in re-

stricted to .Proof: Since and restricted to satisfy the con-

ditions of Theorem 3.7, the corollary is immediate.The following theorem is now immediate.Theorem 3.9 (Consistent Distribution and Density Estimation

Using Neural Networks):

1) Consistent distribution estimation with an error ratecan be obtained on any given compact set

using neural networks.2) Consistent density estimation with an error rate

can be obtained on anygiven compact set using neural networks, where weassume the true distribution has bounded derivatives upto order .

Proof: Let be a sequence that has the desired con-vergence for the distribution (Theorem 3.5) or density (theorem3.6). Then there exists a sequence of neural networks that simul-

2for more details on Sobolev spaces, see [3]3see [28] for the precise definition.

Fig. 2. Training a network to implementG (x). The first network learnsG(x) using SLC or SIC. This is done using the data available. Now, for anarbitrary set of inputs, we take Network 1’s outputs as the inputs to Network 2and the targets are the inputs that went into Network 1. Thus we train Network2 to implement the inverse of Network 1 which should beG .

taneously approximates and with a maximum error lessthan (corollary 3.8). Thus the added approximation errorbetween and does not affect the order of convergence.

Further, by the results given in [29], corollary 2.5, we see thatin order to guarantee that a neural network will be able to ap-proximate to within , the number of hidden units, ,must be . Thus we see that by increasing the size ofthe neural network at a rate , one can obtain con-sistent distribution/density estimation using neural networks onany given compact set. As a practical note, the theoretical resultsjustify SIC and provide definite guidelines as to how to choosethe size of the network as a function of to guarantee conver-gence. In practice, one knows a lot more about the distributionand usually a small number of hidden units will suffice.

IV. NEW RANDOM VARIATE GENERATION TECHNIQUES

A. Learning the Inverse Distribution Function From a FiniteSample (SLCI and SICI)

Suppose that we wish to generate random variates from aspecified univariate density . It could also be thatis represented only by a number of data points drawn fromit, rather than a functional form. Typical approaches to sucha problem would estimate the density from the given data,and apply some of the well-known methods such as the trans-formation method or the rejection method (which would becomputationally extensive, since a pass through all the givendata points has to be performed for each point to be generated).

We propose a method using multilayer networks, that is in-spired by the transformation method. But, unlike the transfor-mation method, we do not assume that is known ana-lytically. The network learns to implement the function .The basic structure of the method is illustrated in Fig. 2. It con-sists of two cascaded multilayer networks. The first networkis trained to estimate the distribution function from thedata using one of the techniques described in Sections II and III(SLC or SIC). Once the first network is trained, the density ofits output is uniform in as discussed in Sections II-A andB. Then, we train the second network to invert the mapping pro-duced by the first network. This inversion can be accomplishedto an arbitrary precision by using as large a network and as manydata points as we want. Learning the inverse proceeds as fol-lows. Let and be the functions implementedby Network 1 and Network 2, respectively. Let be an ar-bitrary set of inputs. Then, the input/output examples for Net-work 2 will be . Once the entire training processis complete, random variate generation according to is ac-complished by passing, a uniform deviate in , through

Page 10: Density estimation and random variate generation …alumnus.caltech.edu/~amir/density.pdfIEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 3, MAY 2002 497 Density Estimation and Random

506 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 3, MAY 2002

Network 2, i.e., . To see why this is so, observethat for the cascade structure of the two networks in Fig. 2, thedensity of the output is equal to density of the input be-cause Network 2 is the inverse of Network 1. Since the densityof the output of Network 1 is uniform, inputing a uniform de-viate into Network 2 should produce a variable having adensity equal to that of. More formally, assume that Network1 was implementing . Then, Network 2 is implementing

. Therefore, where has a uniform distri-bution. Using the formula for transformations of random vari-ables, the density of the output is evaluated as

(23)

This method applies only for the one-dimensional case becausethe mapping transforms into a uniform density only forthe univariate case. A second method based on a control for-mulation of the problem, described later, is more general andapplies to the multidimensional case as well.

Similar concepts to those discussed above have been used in-dependently in the methods proposed in [39] and [46]. A draw-back of this approach is that it is wasteful to first learn ,and then invert it, thus, we propose here two new methods thatare based on directly learning . These two methods areinspired by the density estimation techniques that we have de-veloped in Sections II and III. They are variants of the methoddescribed above in that they build upon the idea that a networkmapping a uniform distribution to the true distribution must beimplementing . The two methods described below canbe applied either when the true distribution function isgiven or when a finite data set drawn from is given.

1) SLCI (Stochastic Learning of the Cumulative In-verse): This technique is very similar to the stochastic methodfor estimating and can be viewed as the inverse of SLC.Once again, is a monotonic function so a monotonicityhint should be used here as well. The algorithm is as follows.

1) Sort the data points .2) Set and initialize the weights of the network to

.3) Generate numbers from a uniform density

in and sort them so that .4) Train the network to map input to output target

exactly as in steps 5 and 6 of SLC. Every cycle or everycycles, generate new ’s.

5) After training is complete, input to the network a uni-formly generated number. The output is (approximately)distributed according to .

2) SICI (Smooth Interpolation of the Cumulative In-verse): This approach is analogous to the smooth interpolationapproach for estimating the density (SIC). It is identical toSLCI, except that the input examples are (cor-responding to output example ), instead of the uniformdeviates. The algorithm is as follows.

1) Sort the data points .2) Set and initialize the weights of the network to

.3) Let for .

4) Train the network to map input to output target asin steps 5 and 6 of SLC.

5) After training is complete, input to the network a uni-formly generated number. The output is (approximately)distributed according to .

Another version of these methods that we have implementedis to determine the input–output examples by gridding thespace into points . One then computes

. The input–outputexamples are then . The advantage of this versionis that when only a small number of sample pointsareavailable, one can generate many more examples that help tolearn .

The two methods presented above are applicable to gener-ating random variates given a sample of points. It is often thecase that one has a functional form for the density from whichone wishes to generate. The next method uses a neural networkto learn directly.

B. Learning the Inverse Distribution Function Given theDensity

Suppose that one is given the functional form for thedensity. The data for training the neural network are generatedas follows. The -space is gridded into (say) points, .Assume that the ’s have been ordered (we label the-spacepoints by as they will be used as targets for the learning). Theinput examples to the network are given by , i.e., onecomputes numerically, where is given. Asthe ’s form an increasing sequence, we do not have to perform

numerical integrations. A single numerical integration fromto will suffice if one outputs the value of the integral

each time a is crossed. One now trains a neural network using(say) the squared error criterion to implement the mapping

. Known facts about should be used in this training, suchas monotonicity. The resulting neural network now implementsan estimate of . Random variates are obtained by passing auniform deviate through the resulting network.

The universal approximation theorems in Section III-D guar-antee that by choosing large enough and a large enoughneural network, one can generate random variates possessinga density arbitrarily close to .

C. Control Formulation of Random Variate Generation

The main contribution of this section is a novel formulationof the random number generation problem as a control problem.Fig. 3 illustrates the basic idea. The structure consists of two cas-caded units. A random variableis generated according to anystandard density function, for example Gaussian or uniform. Inprinciple, an unlimited number of’s can be generated. Thisrandom variable is the input to the composite structure. Thefirst network acts as a nonlinear transformation whose goal isto shape the density of into the target density. Let the distri-bution function of the transformed variable be , where isthe transformed variable. The second unit serves to estimate thisdistribution function, for example SLC or SIC (Sections II-Aand B) or a kernel estimator could be used. The output of thesecond unit, is an estimate of the distribution function of

Page 11: Density estimation and random variate generation …alumnus.caltech.edu/~amir/density.pdfIEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 3, MAY 2002 497 Density Estimation and Random

MAGDON-ISMAIL AND ATIYA: DENSITY ESTIMATION AND RANDOM VARIATE GENERATION USING MULTILAYER NETWORKS 507

Fig. 3. Control formulation for random variate generation. The first networkshapes the density ofx. The transformed density is estimated, and thencompared with the true density. The parameters of the first network are altereduntil the transformed density matches the required density.

Network 1’s output. An error signal, 4 is gen-erated where is the target distribution function. This errorsignal is backpropagated through the networks and is used to ad-just the weights of Network 1 so as to minimize .Such a training process would change Network 1’s mapping insuch a way that would shape the distribution function ofto getit as close as possible to the desired distribution function. Oncetraining is complete, we can generate numbers (approximately)according to the density , by generating according to thestandard density, and then passing it through Network 1.

The analogy with the control problem is as follows. Thesecond unit represents the “plant,” and Network 1 is the “con-troller.” The controller is trained, such that the plant producesthe desired output. This technique has a similar form to aneural network control structure (see [42]) where, typically,a neural network is trained to identify the plant (estimate theplant output), analogous in our set up to the density estimationunit. Then, a neural network controller is trained to controlthe plant—i.e., the identification network estimates the plantoutput and the controller network is altered until this output isas desired. The details of the algorithm are as follows.

I: Initialization:1. Let the desired density be and letthe corresponding distribution functionbe . Generate samplesaccording to any standard density such asa Gaussian density. The larger is, themore accurate the final result will be.2. Let and be the functionsimplemented by Network 1 and Network 2,respectively (we have assumed that thedensity estimator is a neural networksuch as those developed in Sections II andIII ). Network 1 has outputs, and Network2 has one output. Initialize the weights( and ) of the networks.3. Calculate Network 1’s output:

, .

4The error signal could also penalize deviations ofG fromZ thus ensuringthat one learns the density.

4. Train Network 2 using SLC or SIC inorder to estimate the distribution of the

’s. Thus, is implementing thedistribution function of . Note thatimplicitly depends on .II: Training Iterations:5. Compute the error function

,where are any set of points. Onecould even choose to be , however,choosing to be on a finer grid usuallyworks better.6. Adjust the weights of the firstnetwork in the direction of the negativegradient: . For thecase of the squared error above, we have

(24)For clarity, we restrict the remainder ofthe algorithm to the case where the set

is the same as the set . Leavefixed for a certain number of

training cycles. Then, the change in ,as changes can be computed by the chainrule. We get

(25)

(26)

The weights in the first network can nowbe updated using a backpropagation-typescheme [27, pg 115] , where one backprop-agates the ’s first through Network 2,then through Network 1 and one only ad-justs the weights of Network 1.7. After cycles of weight updates forNetwork 1 in step 6, train Network 2 for afew cycles on the new s, to allow it totrack the small change in the distributionof . Thus we now change which was keptconstant in step 6.8. Go to step 6 to perform another iter-ation of training, and continue doing sountil the error becomes small enough.9. After training is complete, the randomnumbers can be generated by generatingaccording to the standard density that wasinitially used (e.g., Gaussian), and com-puting (where represents thefinal weights after the entire learningprocess). Now, should be the (approxi-mately) distributed according to .

Page 12: Density estimation and random variate generation …alumnus.caltech.edu/~amir/density.pdfIEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 3, MAY 2002 497 Density Estimation and Random

508 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 3, MAY 2002

Note 1:This algorithm is not restricted to the case of scalarrandom variable, and so it can be used to generate multidi-mensional variables.Note 2: Although, for illustration, we used the squarederror function (step 6), the algorithm can be used with othererror functions such as the cross entropy error function(Kullback-Liebler distance). In addition, the error func-tion need not only penalize differences in the distributions,but could also penalize differences in the derivatives of thedistributions, ensuring that the correct density and higherorder smoothness characteristics are learned.Note 3: For clarity, in step six we considered the casewhere the ’s were the same as the ’s, and we did notupdate the second network weights for some fixed numberof iterations. The general case is treated in Appendix A.The exact update rule is given by (24) and is computedfor two cases: the case where the density estimatoris a neural network trained according to SIC (see Ap-pendix A.1), and the case where the density estimator is aGaussian kernel estimate (see Appendix A.2). The generalcase as described in Appendix A generally produces betterresults.Note 4: To generate from a density represented only by aset of data points, we use the same method above but re-place by an estimate of using the data points. This es-timate can be obtained using one of the methods describedin Sections II-A and B.Note 5: The entire learning process can be an involvedprocess. However, once the learning has been done, thegeneration of the random variates is fast since it only in-volves the evaluation of a multilayer network function.

V. CONVERGENCE OF THERANDOM VARIATE

GENERATION TECHNIQUES

In this section, we discuss the random variate generationmethods, showing that they will indeed be generating from adistribution arbitrarily close to the true distribution in the large

limit. We present the theorems in Appendix, and refer to[36] for the proofs. The random number generation techniques(SLCI and SICI) are essentially techniques for estimating .We restrict the analysis to the case where is continuouson .

Theorem B.13 in Appendix B.2 shows that SLCI and SIC pro-duces solutions that are solutions to the same local minimiza-tion problem, in the large data limit, hence we restrict our dis-cussion to SIC. We will make the simplifying assumption that

is continuously differentiable on the compact interval. Therefore the input support is a compact support and, fur-

ther, is bounded. Let the bound on the derivativebe . Then, the input distribution is bounded in a range of size

. Let and let be a generalized inversedistribution function that approximately interpolates the points

, i.e.,

(27)

We would like to analyze the rate of convergence of the inte-grated squared error ,

where is any continuous measure on . Write. Then Theorem B.14 in Appendix B.2

proves that these generalized inverse distribution functionsconverge to the true inverse distribution function and that aconvergence rate of can be obtained in thesense.

We now look at the convergence to the density. We imposeone additional restriction, that on the compact support of in-terest, (in other words, the true densityis bounded away from zero). Further, we already assume that

for . Then, Theorem B.15 in Ap-pendix B.2 proves that the resulting density obtained from theseinverse distribution functions converges to the true density, withan error rate approaching for smoothdensities. Further, since neural networks can arbitrarily closelyapproximate these functions (see Section III-D), we concludethat neural networks can give these convergence rates.

These results are not only relevant to generating a numbergiven a finite amount of data, but also to the control formula-tion, where an arbitrarily large amount of data to learn from isavailable for designing the neural network.

VI. SIMULATION RESULTS

A. Density Estimation Techniques

To test the proposed density estimation techniques, we con-sidered the following mixture-of-Gausians density:

(28)We generated 100 i.i.d. data points from this density, and haveused these data points to design the density estimators accordingto SLC and SIC (Sections II-A and -B). The results of SLCand SIC are shown in Figs. 4 and 5, respectively. One can seethat even with 100 data points, the density estimates are quitereasonable.

To compare the proposed methods with the kernel estimationmethod, we implemented the three methods (SLC, SIC, and thekernel estimation method) on two samples, of sizes 100 and 200,respectively, drawn from the above density (28). The optimalkernel width has been calculated according to the formula

[51, p. 40]. Since this is not possible in practice,the results of the kernel estimation method tend to be optimistic.The results are shown in Fig. 6. As can be seen, the kernel esti-mator displays bumpy behavior, even with the optimal choice ofthe kernel width. The neural network implementations, on theother hand, are relatively smooth. More quantitatively, for 200data points, the and errors for SIC were 0.141 and 0.0229,respectively, and for the kernel estimate, they were 0.152 and0.0233, respectively. Of course, performance on a single testcase might not be conclusive, but backed by the theoretical re-sults of Section III they are certainly compelling.

Page 13: Density estimation and random variate generation …alumnus.caltech.edu/~amir/density.pdfIEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 3, MAY 2002 497 Density Estimation and Random

MAGDON-ISMAIL AND ATIYA: DENSITY ESTIMATION AND RANDOM VARIATE GENERATION USING MULTILAYER NETWORKS 509

(a)

(b)

Fig. 4. Stochastic learning of the distribution function (SLC). (a) Thecumulative estimate with 100 data points is shown along with the sample andtrue distribution functions. The lower curve is the neural network estimate, theupper curve is the true distribution function and the stair-case like curve is thesample distribution function. (b) The density estimate.

The details of the neural network implementation for both theSLC and SIC method are as follows. We used a one-hidden-layernetwork with three hidden units. The activation function of thehidden units is , and that of the output unit is 5 . Wetrained the network for 10 000 iterations using the conjugategradient algorithm. We enforced the monotonicity hint by apenalty term with weight 10 000. As we mentioned in the deriva-tion of the convergence rates, we wish to pick the “smoothest”interpolator. We softly enforced this by starting the learning withsmall weights [4].

We also implemented the proposed approaches on areal-world example, namely estimating the density of stockprice changes (in the log space, i.e., ). Theoreticalstudies suggest that the logarithm of a stock price followsa Brownian motion [69], and hence the log price difference

5erf(x) = 1=p2� e dt.

(a)

(b)

Fig. 5. Smooth interpolation of the distribution function (SIC). (a) Thecumulative estimate is shown along with the sample and true distributionfunctions. The lower curve is the neural network estimate, the upper curveis the true distribution function and the stair-case like curve is the sampledistribution function. (b) The density estimate.

is Gaussian. However, practical observation has shown thatstock prices appear to have densities with fatter-than-Gaussiantails. The accurate measurement of the density of stock pricechanges is very important, since it can significantly impactthe pricing of stock derivative instruments such as options.We implemented our proposed methods on such a problem,by applying them to three representative Dow-componentstocks (IBM, GE, and JP Morgan). Fig. 7 shows the estimateddensities for the SIC method along with the true histogramsand the Gaussian with the same mean and variance. Noticehow the true distribution significantly deviates from Gaussian,displaying the well-established fat-tail behavior.

B. Random Variate Generation Techniques

1) Learning the Inverse Distribution Function:We gener-ated 5000 data points according to the mixture-of- Gaussians

Page 14: Density estimation and random variate generation …alumnus.caltech.edu/~amir/density.pdfIEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 3, MAY 2002 497 Density Estimation and Random

510 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 3, MAY 2002

(a)

(b)

Fig. 6. Comparison of the optimal kernel estimate, with the neural-networkestimators. Plotted are the true density and the estimates (SLC, SIC, Kernel).(a) Density estimation with 100 data points. (b) Density estimation with 200data points.

density given by (28). Using these data points, we implementedSICI (Section IV-A) in order to generate data according tothe specified density. We used a single hidden layer networkwith 50 hidden nodes, and a linear output node. In orderto create singularities at the end points and ,we added a tangent function to the output. We enforced themonotonicity of the mapping by using a hint penalty term witha hint weight of 10 000. The training algorithm we used was10 000 iterations of a conjugate gradient descent algorithm onthe squared error, and the data for learning was generated asfollows: the -space was gridded into 10 000 points and thesegrid points served as the outputs. The inputs were computedas as described in Section IV-A. To test how well

was learned, we generated two million points from auniform distribution and passed them through the network.The output of the network should then be (approximately)

distributed according to the density in (28). Fig. 8 shows thesample distribution and histogram of these 2 000 000 networkoutputs as compared to the desired distribution and density.For comparison, Fig. 9 shows the resulting distribution anddensity that were obtained by first forming a kernel estimateusing the optimal kernel width given by [51,p. 40] and then generating according to this kernel estimate.In practice, the optimal kernel width is not available, but evenusing the optimal kernel width, we see that SICI appearsto outperform the kernel method. Once again, we note thatperformance on a single test case might not be conclusive, butthe theoretical results of Section V provide additional supportfor SICI.

We also implemented the method described in Section IV-Bfor learning the inverse distribution function given the func-tional form in (28). The data was generated as follows. 5000uniformly spaced points on the interval served as thetarget outputs (outside the range the density in(28) is essentially zero). The inputs are given bywhich are obtained by numerical integration as described inSection IV-B. Once again, to test how well was learned,we generated 2 000 000 points from a uniform distribution andpassed them through the network. Fig. 10 shows the sample dis-tribution and histogram of these 2 000 000 network outputs. Theresults indicate that the network learned almost perfectly.In addition, the resulting density of the network output matchedthe true density extremely well. This implies that hasalso been learned well, which might come as a surprise as thisderivative was not incorporated as part of the error function. Thefact that the neural network could implement the distributioninverse and its derivative simultaneously are consequences ofTheorems B.14 and B.15 and Corollary 3.8.

2) Control Formulation for Random Variate Genera-tion: We implemented the control formulation methoddescribed in Section IV-C to generate data according to themixture-of-Gaussians density in (28). We used the kernel esti-mator as the density estimation network and a single hiddenlayer network with 50 hidden nodes and a linear outputnode as the density shaping network. To create singularitiesat the end points and , we added a tangent functionto the output. Training was performed with 4000 data pointsusing the algorithm proposed in Section IV-C. The inputs tothe first network were generated according to a uniform dis-tribution. To test how well the control method performed, twomillion random variates were generated and their distributionwas compared to the true distribution. Fig. 11 shows the results.The results indicate that the generated points obey the targetdensity fairly accurately. The difference between the true andgenerated densities can be made even smaller by using moretraining data (increasing and ). In fact, by increasing thenumber of training data one can get arbitrarily close to the truedistribution. This is a consequence of the theoretical results inSection V. Again, constraints on such as monotonicitycould be used to improve the learning.

3) Convergence of Density Estimation:To investigate theconvergence of the error for our the density estimator, we haveperformed a Monte Carlo experiment to measure thermsestimation error as a function of the number of data points.

Page 15: Density estimation and random variate generation …alumnus.caltech.edu/~amir/density.pdfIEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 3, MAY 2002 497 Density Estimation and Random

MAGDON-ISMAIL AND ATIYA: DENSITY ESTIMATION AND RANDOM VARIATE GENERATION USING MULTILAYER NETWORKS 511

(a) (b)

(c) (d)

(e) (f)

Fig. 7. Density estimates for the log stock price changes of a number of U.S. stocks. Shown are the density estimates of the changes in log stock price forIBM,JP Morgan (JPM), and General Electric (GE) using SIC. For each company, about 1530 data points were available. Also shown for comparison are the histogramof the data and a Gaussian with the same mean and variance as the data. The estimated distribution is significantly non-Gaussian. We magnify the tail behavior in(b), (d), and (f). Notice that the true density has a considerably fatter tail.

We used a five hidden unit neural network and trained it ac-cording to the SIC method. For various numbers of data points,

, the resulting density estimation error,, was computed byaveraging over 100 runs for each. Shown in Fig. 12 is the be-havior of versus , and for comparison, we show thebest fit. The optimal linear fit had slope 0.97. For com-parison, we present the same simulation (using 1000 runs each)

for the kernel estimator using theoptimalkernel width and thecorresponding linear fit, which had a slope0.77.

VII. D ISCUSSION

We have developed two techniques for density estimationbased on the idea of learning the cumulative by mapping the

Page 16: Density estimation and random variate generation …alumnus.caltech.edu/~amir/density.pdfIEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 3, MAY 2002 497 Density Estimation and Random

512 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 3, MAY 2002

(a)

(b)

Fig. 8. Using SICI for random variate generation from the density in (28).(a) The learned cumulative compared to the true cumulative. (b) The densityresulting from the random variate generation compared to the desired density.

data points to a uniform density. In doing so, we placed the den-sity estimation problem, traditionally an unsupervised learningproblem, within the supervised learning framework. We focusedon implementations using multilayer networks, because multi-layer networks have certain desirable properties such as the uni-versal approximation theorems. However, it should be noted thatany sufficiently general function class could be used in conjunc-tion with our methods.

Two techniques were presented, a stochastic technique(SLC), which is expected to inherit the characteristics of moststochastic iterative algorithms, and a deterministic technique(SIC). We showed that in the limit of infinite training, SLC“converges” to SIC in the sence that they produce solutionsto the same local minimization problem. In reality however,one cannot train for ever, and hence these two algorithmswill behave differently. SLC tends to be slow in practice,however, because each set of targets is drawn from the uniformdistribution, this is anticipated to have a smoothing/regular-izing effect—this can be seen by comparing SLC and SICin Fig. 6(a). A similar outcome is obtained by adding smallrandom perturbations to the targets of SIC.

Our simulations demonstrated that the our techniques per-formed well in comparison to the kernel estimator using the

(a)

(b)

Fig. 9. Using a Gaussian kernel technique for random variate generation fromthe density in (28). (a) The learned cumulative compared to the true cumulative.(b) The density resulting from the random variate generation compared to thedesired density.

optimal kernel width. Further, there is no smoothing param-eter that needs to be chosen. Smoothing occurs naturally bypicking the interpolator with the lowest bound for a cer-tain derivative. In our simulations, we enforced smoothingby starting the network at small weights, which for practicalpurposes seemed sufficient. For our methods, the majority oftime is spent in the learning phase, but once learning is done,evaluating the density is fast. We used our methods to demon-strate the fat-tailed behavior in the stock markets—pricechanges in the stock markets obey a distribution that has afatter-than-Gaussian tail.

We then developed techniques for nonuniform randomvariate generation that are applicable to the generation ofrandom variates from a density represented by a set of datapoints (SLCI or SICI) or from a density whose functional formis given (learning of the inverse distribution). We presented asecond method based on a control formulation. Our methodsare applicable to a general class of densities, and the generationof random variates is fast, once the learning is performed (incomparison with MCMC techniques where the time complexityfor generation can be long depending on the accuracy required).The learning phase is the rate limiting step, but this will become

Page 17: Density estimation and random variate generation …alumnus.caltech.edu/~amir/density.pdfIEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 3, MAY 2002 497 Density Estimation and Random

MAGDON-ISMAIL AND ATIYA: DENSITY ESTIMATION AND RANDOM VARIATE GENERATION USING MULTILAYER NETWORKS 513

(a)

(b)

Fig. 10. LearningG for random variate generation from the density in(28) as described in Section IV-B. (a) The learned cumulative compared to thetrue cumulative. (b) The density resulting from the random variate generationcompared to the desired density.

insignificant when one needs to generate millions of points,many times.

We have also provided convergence results applicable to theuse of generalized distribution functions. Thus, we have laida theoretical foundation upon which our methods stand. Theconditions that we required of the true distribution function arenot unreasonable from the practical point of view. We see thatthe convergence rate for the density estimation approaches

for smooth densities. We speculate thatit is possible to obtain a better convergence rate by a factor of

. Our theoretical convergence results are near optimaland were obtained without any restrictions being placed on thesupport of the true distribution. However, for practical purposes,one is usually interested in some compact set.

Extensions along the following lines are possible: why restrictoneself to a uniform density? One could map to any standarddensity (say) . For example, one can replace the uniformtargets for SLC by targets drawn from (say) a Gaussian density.Let the network function being implemented be , andlet the mapping that converts a Gaussian random variable to auniform be . This is just the error function . Thenthe composite mapping, , is the required estimate

(a)

(b)

Fig. 11. Control formulation for random variate generation from the densityin (28). (a) The learned cumulative compared to the true cummulative. (b) Thedensity resulting from the random variate generation compared to the desireddensity.

Fig. 12. Convergence of the density estimation error to zero. Plotted are theestimation errors versusN on a Log–Log scale. For comparison, also shown arethe bestN fit (for SIC) and the bestN fit for the kernel estimate. Notethat the figure plots the mean square error.

Page 18: Density estimation and random variate generation …alumnus.caltech.edu/~amir/density.pdfIEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 3, MAY 2002 497 Density Estimation and Random

514 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 3, MAY 2002

for , from which the density can be obtained by differenti-ation with respect to . For SIC, the targets would be , theth order statistic of . Thus our methods could be extended by

mapping to any standard density for which is known ana-lytically. The same comments apply to the random variate gen-eration process. The advantage of doing this can be seen by sup-posing that the true density is close to a Gaussian or some otherstandard density. Then, we let the mapping “do the bulkof the work” and the neural network has to learn only the devia-tion of the true density from the standard density. This mappingcould be considerably simpler to learn as it will by assumptionbe close to the identity mapping.6

APPENDIX A

WEIGHT UPDATES FORCONTROL FORMULATION

In this part of the Appendix, we present the exact weightupdate rules for the control algorithm in Section IV-C for thecase where the density estimator is either a neural networktrained according to SIC or a kernel density estimator. The stepsof the algorithm are given in Section IV-C. The only changewould be the weight update step as will be described next.

A. Neural Network Density Estimator

The error gradient can be written as

(29)

where could be either the true distribution or density. Whenusing SIC as the density estimator,is a minimizer of the errorfunction . Define the ma-trix of second derivatives (the Hessian,) of with respect to

and the function by

(30)

which are functions of and . After some algebra, we finallyget

(31)

and the weight updates are given by and, where the summa-

tion is over the weights in .

6A technicality in the proof of convergence with probability 1 requires thatthe random targets be bounded with probability 1. However, we can ignore thistechnical requirement without significant practical loss.

B. Gaussian Kernel Density Estimator

In this case let , the desired density. Letbe the density on the “test points.” The error function is

(32)

where now is given by

(33)

where is appropriately chosen to give a good density esti-mator on the points. If are the weights of then

. A straightforward computation shows that

(34)

and the weight update is given by.

APPENDIX B

PROOFS FORCONVERGENCETHEOREMS

A. Convergence of the Density Estimation Techniques

Theorem B.1:Let the data be generated ac-cording to the true distribution . Then, SLC converges withprobability 1 to a local minimum of the error function

(35)

provided that the learning rate is a decreasing sequence,that satisfies the following conditions:

a) ;b) , for some ;c) .

Proof: We first summarize some aspects of the theory ofstochastic approximation, that will be the main tool in provingthe theorem.7 A general stochastic iterative algorithm is givenby , where is asequence of independent random variables (Ljung considers amore general form, where is given by an iteration in termsof the random input and). The algorithm converges with prob-ability 1 (w.p. 1)only to a stable stationary point of the ordinarydifferential equation

(36)

if the following assumptions are satisfied (see [35]):

A1: is bounded w.p. 1.

7Stochastic approximation deals with the convergence analysis of iterative al-gorithms where the inputs are random variables. We follow the analysis of Ljung[35], though other approaches that have different sets of assumptions, such asKushner and Clark [33] could also be used. For more details on stochastic algo-rithms in general, see [9].

Page 19: Density estimation and random variate generation …alumnus.caltech.edu/~amir/density.pdfIEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 3, MAY 2002 497 Density Estimation and Random

MAGDON-ISMAIL AND ATIYA: DENSITY ESTIMATION AND RANDOM VARIATE GENERATION USING MULTILAYER NETWORKS 515

A2: The function is continuously differentiablewith regard to and , and the derivatives, for fixed arebounded in .A3: is a decreasing sequence, and

• ,• for some ,• .

A4: exists for all .By observing the proof of convergence in [35], it can be shownthat condition A2 can be replaced by the following condition:

A2 : is locally Lipschitz continuous in and ,meaning that

, for , for some, and , for , where is

the ball centered at with radius . Moreover, the Lipschitzconstant is a bounded function offor fixed

and bounded .To see why this is the case, we note that condition A2 is used

in the proof in [35] only in Lemma 2, p. 569. There, the equiv-alent condition A2can be used in its place.

The update equation for SLC is given by, where is the vector of ’s that are pre-

sented for training cycle, and

(37)

Note that is fixed and considered constant in the proof, be-cause it does not change from one cycle to the next. The firstcondition, the independence of successive vectors, is sat-isfied, by the construction of the algorithm. We now verify thatconditions Al, A2, A3, and A4 are satisfied.

• Condtion A1 is satisfied, because is generated froma uniform density, and hence it is bounded.

• is a piecewise (with respect to) differentiable func-tion of and (assuming standard sigmoidal neuralnetworks). Further, the derivatives are bounded inforfixed because is bounded in . Therefore, islocally Lipschitz. In addition, the local Lipschitz constant

is bounded in time (for fixed ) asthe derivatives are bounded in time for fixed. Hencecondition is satisfied.

• Condition A3 is equivalent to the conditions of the the-orem.

• Let us consider Condition A4:

(38)

The variable are the order statistics of a sample ofsize drawn from a distribution that is uniform in .The density of the order statistic can be found in most textson probability theory, e.g., [10]. We get

(39)

which is a beta distribution. The first moment can be ob-tained in a straightforward manner as,

. Substituting in (38), we see that the limit exists and isindependent of for all .

Thus, all conditions of [35] are satisfied, therefore we con-clude that the algorithm converges w.p. 1 to the stable stationarypoints of the following ordinary differential equation:

(40)

which are precisely the local minima of the error function (35),because the right-hand side of (40) is the negative gradient ofthe error function in (35).8

Theorem B.2 ( Convergence to the True Distribution):Letthe data set consist of data points, drawn i.i.d. fromthe true distribution . For every and every

, the inequality

(41)

holds, where

(42)

Proof: Let ,

(43)

Now by definition, and. Let . Because and are mon-

tonically increasing, we have the inequalitiesand therefore

(44)

Thus, using the well known-inequalitywith and we have

(45)

8Observing this proof, it should be clear that if the hint term had been keptin (2) the proof would have proceeded in exactly the same manner resultingin convergence to the error function in (2). If we require that the solution bemonotonic, then the hint term must be zero, therefore, the function converged tomust be a minimum of the remaining term which is precisely the error function(35). Ultimately, the same final result is obtained.

Page 20: Density estimation and random variate generation …alumnus.caltech.edu/~amir/density.pdfIEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 3, MAY 2002 497 Density Estimation and Random

516 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 3, MAY 2002

(46)

(47)

Combining this inequality with (15), we get

(48)

where the last inequality follows by maximizing overand usingthe bound for the ’s. Thus we find

(49)

(50)

The last line follows because since is adistribution function.

The previous theorem can easily be extended to obtain theconvergence.

Theorem B.3 ( Convergence to the True Distribution):Letthe data set consist of data points, drawn i.i.d. from thetrue distribution . For every and every ,the inequality

(51)

holds, where

(52)

Proof: We refer the reader to an accompanying technicalreport for the proof [36].

Thus we have convergence at a rate for. Let denote a sequence of func-

tions in for increasing data set sizes . Define thestatistics

(53)

(54)

which are analogous to (9) and (10). The following theoremsare valid.

Theorem B.4 ( Convergence to the True Distribution):

(55)

where we assume that exists.Proof: We refer the reader to an accompanying technical

report for the proof [36].Theorem B.5:

(56)

where we assume that exists.Proof: We refer the reader to an accompanying technical

report for the proof [36].To prove convergence to the true density, we will need the

following results relating the higher derivatives of a function tolower order derivatives. In the following, we use the notation

to denote theth derivative of .Lemma B.6:Let a twice differentiable function be

defined on the interval where the lower limit, , couldbe . Suppose , , and

, where the is taken over .Then

(57)

Proof: Let . Then for , by Taylor’s the-orem, we have

(58)

for some . Therefore, .This inequality holds for every . In particular for thethat minimizes the right-hand side. This minimum is attained at

and so we have the bound .Thus .

Corollary B.7: Let be differentiable times on theinterval . Let for . Then

(59)

for .Proof: Let in Lemma B.6 be .

Corollary B.8: Let be differentiable times on theinterval . Let for . Then

(60)

Proof: We prove the lemma by induction on. For, the lemma is equivalent to Lemma B.6. Suppose (60) holds

for . We show that it holds for . Fromlemma B.6 we have . By applying the in-duction hypothesis to we have the following equation

. Combining these two equations we find

that , and thus (60)holds for . This concludes the proof.

Page 21: Density estimation and random variate generation …alumnus.caltech.edu/~amir/density.pdfIEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 3, MAY 2002 497 Density Estimation and Random

MAGDON-ISMAIL AND ATIYA: DENSITY ESTIMATION AND RANDOM VARIATE GENERATION USING MULTILAYER NETWORKS 517

We assume that the true distribution function has boundedderivatives up to order . Then, in the asymptotic limit

, one expects that with probability 1, an interpolator shouldexist with bounded derivatives with the same bounds, becausein the asymptotic limit, itself becomes an interpolator. Toproceed in a more formal manner, let ,

and define by

(61)

for fixed . Note that by definition, for all ,such that . is the

lowest possible bound on theth derivative for the -approx-miate sample distribution functions given a particular data set.In a sense, the “smoothest” approximating sample distributionfunction with respect to theth derivative has anth derivatebounded by . One expects that , at least inthe limit . This is the content of the next lemma. Let

.Lemma B.9:Let . Then, for all

(62)

Proof: This is a straightforward application of Theorem3.2 as follows. Suppose the lemma were false. Then there exists

such that . Thereforewith probability , as , as ifthen . Thus, with probability

(63)

as (the arises because our distribution functiondiffers from at by at most ).Thus,

with probability , implying thatwith probability

, contradicting theorem 3.2.Theorem B.10 ( Convergence in Probability to the True

Density): Let data points, be drawn i.i.d. from the dis-tribution . Let for ,where . Let and fix . Let

. Let be a -approximate

distribution function with (by thedefinition of , such a -approximate distribution functionmust exist). Then, the inequality

(64)

where

(65)

holds with probability 1, as . By this is meant

(66)

Proof: Let and let. By lemma B.8

(67)

where and by construction,. Therefore

If we can show that the right-hand side tends to one asthen we are done. We look at the probability of the comple-mentary event and show that it tends to zero. We can bound

by

OR

which in turn can be bounded by the sum of prob-abilities of the individual events. By Lemma B.9, as

, with probability 1, so,. Similar

reasoning to that which led to (44) can now be applied to showthat for

(68)

therefore

(69)

thus, we see that

(70)

because ,by an application of Theorem 3.2. There-fore we have shown that

Page 22: Density estimation and random variate generation …alumnus.caltech.edu/~amir/density.pdfIEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 3, MAY 2002 497 Density Estimation and Random

518 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 3, MAY 2002

thus proving the theorem.The following is an immediate corollary of the theorem.Corollary B.11 ( Convergence to the True Density):Let

data points, be drawn i.i.d. from the distribution . Letfor , where . Let

and fix . Let . Letbe a -approximate distribution function with

(by the definition of , such a -ap-proximate sample distribution function must exist). Then, forany ,as ,

(71)where is as defined in Theorem 3.6.

B. Convergence of the Random Variate Generation Techniques

In this section, we state the theorems relating to the conver-gence of the random variate generation techniques. We omit theproofs as they are similar in flavor to the analogous theoremsfor the density estimates. For the details, the reader is referredto the technical report [36].

Theorem B.12:Let the data be generated ac-cording to the true distribution . Then SLCI converges withprobability 1 to a local minimum of the error function

(72)

(where the expectation is with respect , the orderstatistics of a uniform distribution), provided that the learningrate is a decreasing sequence, that satisfies the followingconditions:

a) ;b) , for some ;c) .

Theorem B.13:Let the data be generatedaccording to . If the conditions of theorem B.12 hold, then,in the limit , SLCI converges with probability 1 to aminimum of the error function

(73)

Theorem B.14 ( Convergence to ): Let the data setconsist of data points, drawn i.i.d. from the (true) distri-bution . Assume that is continuous on the closedinterval and let its derivative be bounded by. For everygeneralized inverse distribution function that is zero outside therange of and for any continuous measure on(with bound on ), the inequality

(74)

holds, where

(75)

Theorem B.15 (Convergence in Probability to the True Den-sity): Let data points, be drawn i.i.d. from the distribution

. Let for , whereand let . Fix and let . Let

where represents a generalized-ap-proximate inverse distribution function andthe correspondingdistribution function. Let be a generalized-approximateinverse distribution function with

(by the definition of , such a -approximate inverse distri-bution function must exist). Then, the inequality

(76)

where

(77)

holds with probability 1, as . By this is meant

(78)

ACKNOWLEDGMENT

The authors would like to acknowledge the helpful commentsof Dr. K. Hornik, Y. Abu-Mostafa, and the Caltech LearningSystems Group.

REFERENCES

[1] Y. Abu-Mostafa, “Hints,”Neural Comput., vol. 4, no. 7, pp. 639–671,1995.

[2] T. Adali, X. Liu, and K. Sönmez, “Conditional distribution learningwith neural networks and its application to channel equalization,”IEEETrans. Signal Processing, vol. 45, pp. 1051–1064, 1997.

[3] R. A. Adams,Sobolev Spaces. New York: Academic Press, 1975.[4] A. Atiya and C. Ji, “How initial conditions affect generalization per-

formance in large networks,”IEEE Trans. Neural Networks, vol. 8, pp.448–451, Mar. 1997.

[5] A. Atiya, “On the required size of multilayer networks for implementingreal-valued functions,” inProc. IEEE World Congr. Comput. Intell.,1994.

[6] A. Barron and T. Cover, “Minimimum complexity density estimation,”IEEE Trans. Inform. Theory, vol. 37, pp. 1034–1054, July 1991.

[7] A. Barron, L. Györfi, and E. van der Meulen, “Distribution estimationconsistent in total variation and in two types of information divergence,”IEEE Trans. Inform. Theory, vol. 38, pp. 1437–1454, Sept. 1992.

[8] A. R. Barron and C.-H. Sheu, “Approximation of density functions bysequences of exponential families,”Ann. Statist., vol. 19, no. 3, pp.1347–1369, 1991.

[9] A. Benveniste, M. Métivier, and P. Priouret,Adaptive Algorithms andStochastic Approximations. New York: Springer-Verlag, 1990.

Page 23: Density estimation and random variate generation …alumnus.caltech.edu/~amir/density.pdfIEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 3, MAY 2002 497 Density Estimation and Random

MAGDON-ISMAIL AND ATIYA: DENSITY ESTIMATION AND RANDOM VARIATE GENERATION USING MULTILAYER NETWORKS 519

[10] P. Billingsley, “Probability and measure,” inWiley Series in Probabilityand Mathematical Statistics. New York: Wiley, 1986.

[11] C. M. Bishop, “Mixture Density Networks, Neural Comput. Res. GroupRep.,”, NCRG/4288, 1994.

[12] , Neural Networks for Pattern Recognition. Oxford: ClarendonPress, 1995.

[13] C. M. Bishop and C. Legleye, “Estimating conditional probability den-sities for periodic variables,”Advances Neural Inform. Processing Syst.,vol. 7, 1995.

[14] D. L. Bonoho, I. M. Johnstone, G. Kerkyacharian, and D. Picard, “Den-sity estimation by wavelet thresholding,”Ann. Statist., vol. 24, no. 2, pp.508–539, 1996.

[15] J. Bretagnolle and C. Huber, “Estimation des densités: Risque min-imax,” Z. Wahrsch. Verw. Gebiete, vol. 47, pp. 119–137, 1979.

[16] B. R. Crain, “Estimation of distributions using orthogonal expansions,”Ann. Statist., vol. 2, pp. 454–463, 1974.

[17] E. Csáki, “An iterated logarithm law for semimartingales and its appli-cation to the empirical distribution function,”Studia Sci. Math. Hung,vol. 3, pp. 287–292, 1968.

[18] J. Cwik and J. Koronacki, “Probability density estimation using aGaussian clustering algorithm,”Neural Comput. Applicat., vol. 4, no.3, pp. 149–160, 1996.

[19] G. M. de Montricher, R. A. Tapia, and J. R. Thompson, “Nonparametricmaximum likelihood of probability densities by penalty functionmethods,”Ann. Statist., vol. 3, pp. 1329–1348, 1975.

[20] L. Devroye,Non Uniform Random Variate Generation. New York:Springer-Verlag, 1986.

[21] L. Devroye and L. Györfi, “No empirical probability measure can con-verge in total variation sense for all distributions,”Ann. Statist., vol. 18,pp. 1496–1499, 1990.

[22] R. Duda and P. Hart,Pattern Classification and Scene Analysis. NewYork: Wiley, 1973.

[23] S. Y. Efroimovich and M. S. Pinsker, “Estimation of square integrableprobability density of a random variable,”Problems Inform. Transmis-sion, vol. 18, pp. 175–189, 1983.

[24] H. Finkelstein, “The law of the iterated logarithm for empirical distribu-tions,” Ann. Math. Statist., vol. 42, no. 2, pp. 607–615, 1971.

[25] K. Fukunaga and L. D. Hostetler, “Optimization ofk-nearest neighbordensity estimates,”IEEE Trans. Inform. Theory, vol. 19, pp. 320–326,1973.

[26] A. Gersho and R. Gray,Vector Quantization and Signal Compres-sion. Boston, MA: Kluwer, 1992.

[27] J. Hertz, A. Krogh, and R. G. Palmer,Introduction to the Theory ofNeural Computation. Redwood City, CA: Addison-Wesley, 1991.

[28] K. Hornik, M. Stinchcombe, and H. White, “Universal approximation ofan unknown mapping and its derivatives using multilayer feedforwardnetworks,”Neural Networks, vol. 3, pp. 551–560, 1990.

[29] K. Hornik, M. Stinchcombe, H. White, and P. Auer, “Degree of approx-imation results for feedforward networks approximating unknown map-pings and their derivatives,”Neural Comput., vol. 6, pp. 1262–1275,1994.

[30] D. Husmeier and J. Taylor, “Predicting conditional probability densities:Improved training scheme combining em and rvfl,”Neural Networks,vol. 11, no. 1, pp. 89–116, Jan. 1998.

[31] G. Johnson, “Construction of particular random processes,”Proc. IEEE,vol. 82, no. 2, pp. 270–281, Feb. 1994.

[32] D. Knuth, The Art of Computer Programming. Redwood City, CA:Addison-Wesley, 1981, vol. 1.

[33] H. Kushner and D. Clark,Stochastic Approximation Methods for Con-strained and Unconstrained Systems. New York: Springer-Verlag,1978.

[34] R. Larsen and M. Marx,An Introduction to Mathematical Statis-tics. Englewood Cliffs, NJ: Prentice-Hall, 1986.

[35] L. Ljung, “Analysis of recursive stochastic algorithms,”IEEE Trans.Automat. Contr., vol. 22, pp. 551–575, Aug. 1977.

[36] M. Magdon-Ismail and A. F. Atiya, “Some Results Regarding the Esti-mation of Densities and Random Variate Generation Using MultilayerNetworks,” California Inst. Technol., 2000, submitted for publication.

[37] D. Martinez, “Neural tree density estimation for novelty detection,”IEEE Trans. Neural Networks, vol. 9, pp. 519–523, 1994.

[38] E. Masry, “Probability density estimation from dependent observationsusing wavelets orthonormal bases,”Statist. Probability Lett., vol. 21, pp.181–194, 1994.

[39] G. Miller and D. Horn, “Probability density estimation using entropymaximization,”Neural Comput., vol. 10, no. 7, pp. 1925–1938, 1998.

[40] D. Modha and Y. Fainman, “A learning law for density estimation,”IEEE Trans. Neural Networks, vol. 5, pp. 519–523, May 1994.

[41] D. S. Modha and E. Masry, “Rate of convergence in densityestimation using neural networks,”Neural Comput., vol. 8, pp.1107–1122, 1996.

[42] K. Narendra and K. Parthasarathy, “Identification and control of dynam-ical systems using neural networks,”IEEE Trans. Neural Networks, vol.1, pp. 4–27, 1990.

[43] E. Parzen, “On the estimation of a probability density function andmode,”Ann. Math. Statist., vol. 33, pp. 1065–1076, 1962.

[44] W. Press, B. Flannery, S. Teukolsky, and W. Vetterling,NumericalRecipes, Cambridge, U.K.: Cambridge Univ. Press, 1988.

[45] J. Rissanen, “Density estimation by stochastic complecity,”IEEE Trans.Inform. Theory, vol. 38, pp. 315–323, 1992.

[46] Z. Roth and Y. Baram, “Multidimensional density shaping by sig-moids,” IEEE Trans. Neural Networks, vol. 7, pp. 1291–1298, Sept.1996.

[47] H. Schioler and P. Kulczyki, “Neural network for estimating conditionaldistributions,” IEEE Trans. Neural Networks, vol. 8, pp. 1015–1025,Sept. 1997.

[48] I. J. Schoenberg, “Splines and histograms,” inSpline Functionsand Approximation Theory Basel, Switzerland, 1973, vol. 21, pp.277–327.

[49] R. Serfling, “Approximation theorems of mathematical statistics ,” , ser.Wiley Series in Probability and Mathematical Statistics. New York:Wiley, 1980.

[50] J. Sill, “Monotonic networks,” inAdvances in Neural Information Pro-cessing Systems (NIPS), 1998, vol. 10.

[51] B. Silverman, Density Estimation for Statistics and Data Anal-ysis. London, U.K.: Chapman and Hall, 1993.

[52] B. W. Silverman, “On the estimation of a probability density function bythe maximum penalized likelihood method,”Ann. Statist., vol. 10, no.3, pp. 795–810, 1982.

[53] A. Sinclair, Algorithms for Random Generation and Counting: AMarkov Chain Approach. Boston, MA: Birkhauser, 1993.

[54] P. Smyth and D. Wolpert, “Stacked density estimation,” inAdvancesin Neural Information Processing Systems (NIPS). Cambridge, MA:MIT Press, 1998, vol. 10.

[55] C. J. Stone, “Optimal uniform rate of convergence for nonparametric es-timators of a density function or its derivatives,” inRecent Advances inStatistics: Papers in honor of Herman Chernoff on his Sixtieth Birthday,M. H. Rezvi, J. S. Rustagi, and D. Siegmund, Eds. New York: Aca-demic, 1983, pp. 393–406.

[56] , “An asymptotically optimal window selection rule for kernel den-sity estimates,”Ann. Statist., vol. 12, no. 4, pp. 1285–1297, 1984.

[57] , “Large sample inference for log-spline models,”Ann. Statist., vol.18, no. 2, pp. 717–741, 1990.

[58] J. Struckmeier, “Generation of random variates using asymptotic expan-sions,”Computing, vol. 59, no. 4, pp. 331–347, 1997.

[59] M. Thathachar and M. Arvind, “Global Boltzmann perceptron networkfor on-line learning of conditional distributions,”IEEE Trans. NeuralNetworks, vol. 10, pp. 1090–1098, Sept. 1999.

[60] A. N. Tikhonov and V. I. Arsenin, “Solutions of ill-posed problems ,” .ser. Scripta Series in Mathematics, F. John, Ed. Winston, NY: Halsted,1977.

[61] “Special issue on conditional density forecasting in economics and fi-nance,” J. Forecasting, 2000, to be published.

[62] H. Traven, “A neural network approach to statistical pattern classifi-cation by semiparametric estimation of probability density functions,”IEEE Trans. Neural Networks, vol. 2, pp. 366–377, May 1991.

[63] M. van Hulle, “Topographic map formation by maximizing uncondi-tional entropy: A plausible strategy for on-line unsupervised competi-tive learning and nonparametric density estimation,”IEEE Trans. NeuralNetworks, vol. 7, pp. 1299–1305, Sept. 1996.

[64] H. van Trees,Detection, Estimation, and Modulation Theory: PartI. New York: Wiley, 1968.

[65] V. N. Vapnik, “Statistical learning theory,” inAdaptive and LearningSystems for Signal Processing, Communications and Control. NewYork: Wiley, 1998.

[66] G. Wahba, “Histosplines with knots which are order statistics,”J. Roy.Statist. Soc. B, vol. 38, pp. 140–151, 1976.

Page 24: Density estimation and random variate generation …alumnus.caltech.edu/~amir/density.pdfIEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 3, MAY 2002 497 Density Estimation and Random

520 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 3, MAY 2002

[67] A. Weigend and A. Srivastava, “Predicting conditional probability dis-tributions: A connectionist approach,”Int. J. Neural Syst., vol. 6, no. 2,pp. 109–118, 1995.

[68] P. Williams, “Using neural networks to model conditional multivariatedensities,”Neural Comput., vol. 8, pp. 843–854, 1996.

[69] P. Wilmott, S. Howison, and J. Dewynne,The Mathematics of FinancialDerivatives. Cambridge, U.K.: Cambridge Univ. Press, 1995.

[70] A. Zeevi and R. Meir, “Density estimation through convex combinationsof densities: Approximation and estimation bounds,”Neural Networks,vol. 10, no. 1, pp. 99–110, Jan. 1997.

Malik Magdon-Ismail (S’98–M’99) received theB.S. degree in physics from Yale University, NewHaven, CT, in 1993 and the Master’s degree inphysics and the Ph.D. degree in electrical engi-neering with a minor in physics from the CaliforniaInstitute of Technology (Caltech), Pasadena, in 1995and 1998, respectively.

From 1998 to 2000, he was a Research Fellowin the Learning Systems Group at Caltech. Heis currently an Assistant Professor of ComputerScience at Rensselaer Polytechnic Institute (RPI),

Troy, NY. He is a founding faculty member of the Machine and ComputationalLearning Group (MaC0L) at RPI. His research interests include the theory andapplications of machine and computational learning (supervised, reinforce-ment, and unsupervised), computational finance and bioinformatics. He hasnumerous publications in refereed journals and conferences. He has been afinancial consultant and has collaborated with a number of companies.

Amir Atiya (S’86–M’91–SM’97) was born inCairo, Egypt, in 1960. He received the B.S. degreefrom Cairo University in 1982 and the M.S. andPh.D. degrees in 1986 and 1991 from the CaliforniaInstitute of Technology (Caltech), Pasadena, all inelectrical engineering.

From 1985 to 1990, he was a Teaching and Re-search Assistant at Caltech. From September 1990 toJuly 1991, he held a Research Associate position atTexas A&M University, Lubbock. From July 1991 toFebruary 1993, he was a Senior Research Scientist

with QANTXX, Houston, TX, a financial modeling firm. In March 1993, hejoined the Computer Engineering Department at Cairo University as an Asis-tant Professor. He spent the summer of 1993 with Tradelink Inc., Chicago, IL,and the summers of 1994, 1995, and 1996 with Caltech. From March 1997 toJune 2001, he has been a Visiting Associate in the Department of Electrical En-gineering at Caltech. In addition, from January 1998 to January 2000, he was aResearch Scientist at Simplex Risk Management, Inc., Hong Kong. He is cur-rently a Quantitative Analyst at Coutrywide Capital Markets, Calabasas, CA.His research interests are in the areas of neural networks, learning theory, pat-tern recognition, Monte Carlo methods, and time series prediction. His most re-cent interests are the application of learning theory and computational methodsto finance.

Dr. Atiya received the Egyptian State Prize for Best Research in Science andEngineering, in 1994. He also received the Young Investigator Award from theInternational Neural Network Society, in 1996. Currently, he is an AssociateEditor for IEEE TRANSACTIONS ONNEURAL NETWORKS, since 1998. He was aGuest Editor of a special issue of IEEE TRANSACTIONS ONNEURAL NETWORKS

on Neural Networks in Financial Engineering. He served on the organizing andprogram committees of several conferences, most recent of which is Computa-tional Finance CF2000, London, U.K., and CIFER’03, Hong Kong, for whichhe is the Program Co-Chair.