nonparametric density estimation and regression achieved

13
Biol. Cybern. 77, 49–61 (1997) Biological Cybernetics c Springer-Verlag 1997 Nonparametric density estimation and regression achieved with topographic maps maximizing the information-theoretic entropy of their outputs Marc M. Van Hulle K.U. Leuven, Laboratorium voor Neuro- en Psychofysiologie, Campus Gasthuisberg, Herestraat 49, B-3000 Leuven, Belgium Received: 12 August 1996 / Accepted in revised form: 9 April 1997 Abstract. We introduce an unsupervised competitive learn- ing rule, called the extended Maximum Entropy learning Rule (eMER), for topographic map formation. Unlike Koho- nen’s Self-Organizing Map (SOM) algorithm, the presence of a neighborhood function is not a prerequisite for achiev- ing topology-preserving mappings, but instead it is intended: (1) to speed up the learning process and (2) to perform non- parametric regression. We show that, when the neighborhood function vanishes, the neural weight density at convergence approaches a linear function of the input density so that the map can be regarded as a nonparametric model of the input density. We apply eMER to density estimation and compare its performance with that of the SOM algorithm and the variable kernel method. Finally, we apply the ‘batch’ ver- sion of eMER to nonparametric projection pursuit regression and compare its performance with that of back-propagation learning, projection pursuit learning, constrained topological mapping, and the Heskes and Kappen approach. 1 Introduction Kohonen’s Self-Organizing (feature) Map (SOM) (Kohonen 1982, 1995) is a biologically inspired algorithm aimed at establishing, in an unsupervised way, a mapping from a d- dimensional space V of input signals v onto an equal or lower-dimensional discrete lattice A of N formal neurons. The critical factor in generating topology-preserving map- pings with the SOM algorithm is the use of a neighborhood function for updating the neuron weights. As a result of the latter, neighboring neurons cooperate and specialize for sim- ilar input signals and the lattice organizes into an orderly, topology-preserving state according to the statistical proper- ties of the input signals. Due to this structured representation of the input data, the SOM algorithm has seen a wide range of statistical applica- tions such as clustering and pattern recognition, vector quan- tization, density estimation and regression (for overviews see: Ritter et al. 1992; Kohonen 1995). The converged to- pographic map is intended to capture the principal dimen- sions of the input space and has been regarded as a dis- crete, nonparametric model of the input probability density (Ritter et al. 1992; Mulier and Cherkassky 1995; Kohonen 1995). There is at least one serious problem with density estimation, however: when the neighborhood function has vanished, the SOM algorithm converges towards a mapping which minimizes the mean squared error (MSE) distortion between the input samples v and the N weight vectors w i , quantizing V -space into N disjoint (Voronoi) partitionings. [In fact, the ‘batch’ SOM algorithm is similar to the LBG algorithm (Linde et al. 1980) for building scalar and vec- tor quantizers, except for the neighborhood function (Lut- trell 1991).] For the SOM algorithm, this implies that the weight density p(w i ) at convergence is proportional to p 2 3 (v) in the one-dimensional case, in the limit of an infinite den- sity of neighbor neurons (continuum approximation) (Ritter and Schulten 1986). For a discrete map in d-dimensional space, the weight density is expected to be proportional to p 1 1+ 2 d (v) for large N and for minimum MSE quantization (Gersho 1979). However, for density estimation purposes, the weight density should be a linear function of the input density. Kohonen’s SOM algorithm is unable to achieve this since it tends to undersample high-probability regions and oversample low-probability ones. When applied to regression analysis, the neurons of a topographic map are regarded as ‘knots’ joining piecewise smooth functions or splines. Since these knots are dynami- cally allocated, one expects that the SOM algorithm will im- prove regression performance: dynamic knot allocation im- proves regression performance but is computationally very hard to achieve in traditional regression procedures (Fried- man and Silverman 1989). Contrary to what is expected, the SOM algorithm performs poorly on regression problems of even low dimensionality. This is mainly due to the fact that the topology-preserving ordering aimed for in the d- dimensional space of the data points may be lost after projec- tion onto the (d-1)-dimensional subspace of the independent variables (nonfunctional mapping). The solution proposed by Ritter and Schulten (1989) is to alleviate the occurrence of nonfunctional mappings by assuming that the multi-variable function to be fitted can be decomposed into a number of single-variable functions. Cherkassky and Lari-Najafi (1991) modified the original SOM algorithm for the same reason

Upload: others

Post on 08-Jan-2022

21 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Nonparametric density estimation and regression achieved

Biol. Cybern. 77, 49–61 (1997) BiologicalCyberneticsc© Springer-Verlag 1997

Nonparametric density estimation and regressionachieved with topographic mapsmaximizing the information-theoretic entropy of their outputsMarc M. Van Hulle

K.U. Leuven, Laboratorium voor Neuro- en Psychofysiologie, Campus Gasthuisberg, Herestraat 49, B-3000 Leuven, Belgium

Received: 12 August 1996 / Accepted in revised form: 9 April 1997

Abstract. We introduce an unsupervised competitive learn-ing rule, called the extended Maximum Entropy learningRule (eMER), for topographic map formation. Unlike Koho-nen’s Self-Organizing Map (SOM) algorithm, the presenceof a neighborhood function is not a prerequisite for achiev-ing topology-preserving mappings, but instead it is intended:(1) to speed up the learning process and (2) to perform non-parametric regression. We show that, when the neighborhoodfunction vanishes, the neural weight density at convergenceapproaches a linear function of the input density so that themap can be regarded as a nonparametric model of the inputdensity. We apply eMER to density estimation and compareits performance with that of the SOM algorithm and thevariable kernel method. Finally, we apply the ‘batch’ ver-sion of eMER to nonparametric projection pursuit regressionand compare its performance with that of back-propagationlearning, projection pursuit learning, constrained topologicalmapping, and the Heskes and Kappen approach.

1 Introduction

Kohonen’s Self-Organizing (feature) Map (SOM) (Kohonen1982, 1995) is a biologically inspired algorithm aimed atestablishing, in an unsupervised way, a mapping from ad-dimensional spaceV of input signalsv onto an equal orlower-dimensional discrete latticeA of N formal neurons.The critical factor in generating topology-preserving map-pings with the SOM algorithm is the use of a neighborhoodfunction for updating the neuron weights. As a result of thelatter, neighboring neurons cooperate and specialize for sim-ilar input signals and the lattice organizes into an orderly,topology-preserving state according to the statistical proper-ties of the input signals.

Due to this structured representation of the input data, theSOM algorithm has seen a wide range of statistical applica-tions such as clustering and pattern recognition, vector quan-tization, density estimation and regression (for overviewssee: Ritter et al. 1992; Kohonen 1995). The converged to-pographic map is intended to capture the principal dimen-sions of the input space and has been regarded as a dis-

crete, nonparametric model of the input probability density(Ritter et al. 1992; Mulier and Cherkassky 1995; Kohonen1995). There is at least one serious problem with densityestimation, however: when the neighborhood function hasvanished, the SOM algorithm converges towards a mappingwhich minimizes the mean squared error (MSE) distortionbetween the input samplesv and theN weight vectorswi,quantizingV -space intoN disjoint (Voronoi) partitionings.[In fact, the ‘batch’ SOM algorithm is similar to the LBGalgorithm (Linde et al. 1980) for building scalar and vec-tor quantizers, except for the neighborhood function (Lut-trell 1991).] For the SOM algorithm, this implies that theweight densityp(wi) at convergence is proportional top

23 (v)

in the one-dimensional case, in the limit of an infinite den-sity of neighbor neurons (continuum approximation) (Ritterand Schulten 1986). For a discrete map ind-dimensionalspace, the weight density is expected to be proportional to

p1

1+ 2d (v) for large N and for minimum MSE quantization

(Gersho 1979). However, for density estimation purposes,the weight density should be a linear function of the inputdensity. Kohonen’s SOM algorithm is unable to achieve thissince it tends to undersample high-probability regions andoversample low-probability ones.

When applied to regression analysis, the neurons of atopographic map are regarded as ‘knots’ joining piecewisesmooth functions or splines. Since these knots are dynami-cally allocated, one expects that the SOM algorithm will im-prove regression performance: dynamic knot allocation im-proves regression performance but is computationally veryhard to achieve in traditional regression procedures (Fried-man and Silverman 1989). Contrary to what is expected,the SOM algorithm performs poorly on regression problemsof even low dimensionality. This is mainly due to the factthat the topology-preserving ordering aimed for in thed-dimensional space of the data points may be lost after projec-tion onto the (d−1)-dimensional subspace of the independentvariables (nonfunctional mapping). The solution proposed byRitter and Schulten (1989) is to alleviate the occurrence ofnonfunctional mappings by assuming that the multi-variablefunction to be fitted can be decomposed into a number ofsingle-variable functions. Cherkassky and Lari-Najafi (1991)modified the original SOM algorithm for the same reason

Page 2: Nonparametric density estimation and regression achieved

50

by constraining the learning process (Constrained Topolog-ical Mapping, CTM): the best matching neuron (‘winner’)is defined in the subspace of independent variables but theweight updates are performed in the space of the data points.Recently, Heskes and Kappen (1995) have combined projec-tion pursuit regression with a modified version of the originalSOM algorithm. According to them, nonfunctional mappingsare avoided by introducing a noise process for smoothing thedependent variable; however, they do not provide a formalguarantee for this. Furthermore, since the neuron weightsare dynamically allocated, so that a MSE distortion metricis minimized, and since spline functions are used for joiningthe neuron weights, there is an immediate connection withthe smoothing spline projection pursuit regression approach(Roosen and Hastie 1994).

A more direct way to density estimation with topographicmaps is to optimize an information-theoretic criterion in-stead of a distortion criterion. Recently, we introduced an‘on-line’ learning rule, called the Maximum Entropy learn-ing Rule (MER) (Van Hulle 1995, 1997), which achievesan equiprobabilistic topographic map and hence maximizesthe unconditional information-theoretic entropy of the map’soutputs. As a result of this, the weight density at conver-gence is a linear function of the input density, so that themap can be regarded as a nonparametric model of the in-put density. In the present article we will prove this pointformally by using a continuum approximation. We applyMER to density estimation and compare its performancewith that of the SOM algorithm and the variable kernel den-sity estimation technique of Breiman et al. (1977). Further-more, we will extend MER with a neighborhood function(eMER) however, unlike the SOM algorithm, the presenceof this neighborhood function is not a prerequisite for achiev-ing topology-preserving mappings but, rather, it serves twopurposes: (1) to speed up the learning process, and (2) toperform nonparametric regression analysis. We apply the‘batch’ version of eMER to nonparametric projection pur-suit regression and compare the performance achieved withthat of back-propagation learning, projection pursuit learn-ing (Friedman and Stuetzle 1981), constrained topologicalmapping (Cherkassky and Lari-Najafi 1991), and the Hes-kes and Kappen approach (Heskes and Kappen 1995). Theuse of a neighborhood function in eMER has a smoothingeffect on the topographic map, and improves the regressionperformance, but the underlying motivation is different fromthat of the smoothing spline approach (Roosen and Hastie1994) by the nature of the cost functions minimized andthe explicit use of a smoothness parameter in the smoothingspline case.

2 Extended Maximum Entropy Learning Rule

Consider ad-dimensional rectangular latticeA (Fig. 1A). Toeach of theN nodes of the lattice corresponds a formal neu-ron i with weight vectorwi = [wi1, ..., wid]. Traditionally,the formal neurons quantize the input spaceV , with prob-ability density function (p.d.f.) p(v), v ∈ V , into a discretenumber of partition cells or quantization regions using thenearest-neighbor rule (Voronoi or Dirichlet tessellation).

In our case, the definition of quantization region is differ-ent. Assume for now thatV andA have the same dimension-ality d; we will relax this assumption later on. (For a partic-ular scheme to determine quantization region membership,see the Appendix.) Since the lattice topology is rectangular,theN nodes of the lattice defined-dimensional quadrilaterals(Fig. 1B) and, by the weights of these nodes, each quadri-lateral represents a quantization region, quantizing the inputspaceV . For example in Fig. 1B, the quadrilateral labeledHe is defined by the neuronsi, j, k,m and the weights ofthese neurons delimit a ‘granular’ (closed) quantization re-gion in V -space. We also consider the quadrilaterals facingthe outer border of the lattice and which represent the ‘over-load’ (open) quantization regions (Fig. 1B). For example,quadrilateralHd is defined by extending the links (edges)lj andkm leftwards (dashed lines). (For simplicity, we willmake no further distinction between a quadrilateral and itscorresponding quantization region.) Following the definitionof a quadrilateral, every neuron ofA is a vertex common to2d adjacent quadrilaterals. For example in Fig. 2A, neuronjis a vertex common to four quadrilateralsHa, Hb, Hc, Hd:the corresponding quantization regions define neuronj’s ‘re-ceptive field’ (shaded area in Fig. 2A). Conversely, eachquadrilateral belongs to the receptive fields of several neu-rons: e.g. quadrilateralHc belongs to the receptive fields ofneuronsi, j, k, l. Hence, receptive fields overlap.

In order to formalize our learning rule, we associate witheach quadrilateral a code membership function:

1lHj (v) =

{2d

nHjif v ∈ Hj

0 if v 6∈ Hj

(1)

with nHj, 1 ≤ nHj

≤ 2d, the number of vertices ofHj .We will assume thatp(v) is stationary and ergodic and thatthere is a zero probability that a singlev will activate twoor more adjacent quadrilaterals unless these quadrilateralsoverlap. Note that several quadrilaterals may be active atthe same time when the map is folded. LetSj be the setof 2d quadrilaterals that have neuronj as a common vertex(e.g. Sj = {Ha, Hb, Hc, Hd} in Fig. 2A) and, conversely,SHj

the set of neurons that are the vertices of quadrilateralHj (e.g.SHc

= {i, j, k, l} in Fig. 2A). The MER is definedas:

∆wi = η∑

Hj∈Si1lHj (v) Sgn(v− wi), ∀i ∈ A (2)

with η the learning rate (a positive constant) andSgn(.) thesign function taken componentwise. The effect of the ruleis shown in Fig. 2A (dashed lines): hence, MER is clearlya local learning rule. It can be formally proven that in theone-dimensional case, MER yields an equiprobable quanti-zation for anyN (Van Hulle 1995), and that in the multi-dimensional case, MER yields a quantization which willapproximate an equiprobable one for largeN (Van Hulle1997). As a result of an equiprobable quantization, everyneuron of the lattice will be active with the same probabil-ity. In other words, MER yields a topographic map whichmaximizes the (unconditional) information-theoretic entropyof its N binary outputs (hence, the rule’s name).

We can easily extend MER by adding a neighborhoodfunctionΛ, a decreasing function of the minimum distance,

Page 3: Nonparametric density estimation and regression achieved

51

A B

H

H H

H

j

H H

H

H

Ha b c

d e f

g h i

k m

l

Fig. 1A,B. Definition of quantization region.APortion of a two-dimensional rectangular lat-tice A plotted inV space.B The quadrilateralsHe, Hf , Hh, Hi are ‘granular’ quantization re-gions. The quadrilateralsHa, Hb, Hc, Hd, Hg

are ‘overload’ regions

in lattice-space, between neuroni and the neurons of the setSHj

with Hj an active quadrilateral. The extended Maxi-mum Entropy learning Rule (eMER) is defined as:

∆wi = η∑k∈A

∑Hj∈Sk

Λ(i, SHj, σ) 1lHj

(v) Sgn(v− wi), (3)

∀i ∈ A

with σ the neighborhood range and with the neighborhoodfunction normalized to unity:

∑k∈A Λ(k, SHj

, σ) = 1. Theeffect of eMER is shown graphically in Fig. 2B.

Formally, eMER can be viewed as a Markov processwith transition probabilitiesT (w,w′) =

∫δ(w−U (w′, v, η))

×p(w)dv, with w = [w1, ...,wN ] and U (.) the new vectorobtained after one weight update step given the inputv. Wecan prove the following:

Proposition: For a discrete, statistically-stationary probabil-ity densityp(v) = {p(vm),m = 1, ...,M} there exists a po-tential E on which (3) on average (for small enoughη)performs (stochastic) gradient descent:

E =∑

vm∈V

∑i,k∈A

∑Hj∈Sk

Λ(i, SHj, σ) 1lHj

(vm) | vm − wi | (4)

with V = {vm} the set ofM input samples and| vm−wi | .=∑dj=1 | vmj − wij |.

Proof: Following the definition of gradient descent, theweights are updated so as to reduceE. Now since (4) iscontinuous and piecewise differentiable, taking the deriva-tive yields:

< ∆wi >V = −η ∂E∂wi

(5)

= η∑

vm∈V

∑k∈A

∑Hj∈Sk

Λ(i, SHj , σ)1lHj (vm) Sgn(vm − wi),

∀i ∈ A

The latter exactly corresponds to the right-hand side of (4)summed over all input patterns. QED.

Furthermore, sinceE is always positive definite and itspartial derivatives (5) are continuous, (4) is a Lyapunov func-tion for discretep.d.f.s. For continuousp.d.f.s, E looses itsdifferentiability and can in principle not be regarded as aLyapunov function. However, as done before for MER (Van

Hulle 1997), we can modify theparallel update scheme to asequentialone: (1) one quadrilateral out of the set of activequadrilaterals is chosen randomly, and (2) one vertex (‘win-ner’) out of the vertices of the randomly chosen quadrilateralis updated using eMER. We convolve the sign functions inthe weight update rule with e.g. a Gaussian kernel of ar-bitrary small range (inV -space) to solve the problem ofdifferentiability of the sign functions when the inputv co-incides with the position of the winning vertex. (Note that,since the other vertices of the active quadrilateral are notupdated, by definition, they do not affect the definition ofthe winner’s receptive field.) For this modification, a Lya-punov functionexistsfor averaged eMER (i.e. the Hessianis symmetric).

3 Unfolding lattices

For MER and eMER, it is clear what we mean by a topo-graphically ordered map: when, for all inputsv ∈ V , holdsthat

∑i 1lHi (v) = 1, then the map represented by the quadri-

lateralsHi, ∀i, is topographically-ordered, and vice versa.Hence, when the map is locally untangled for all quadrilat-erals, the map will also be globally topographically ordered.

What is now the key difference between MER (eMERwithout neighborhood function) and the SOM algorithm,given that the use of neighborhood relations is the mostessential principle in the formation of topology-preservingmaps (Kohonen 1995, p. 143)? The SOM algorithm, andits many variants, assigns a unique neuron (‘winner’) to ev-ery input pattern so that mutually non-overlapping recep-tive fields are defined (Voronoi tessellation). Hence, in orderto establish neighborhoodrelations, the weight update ruleneeds a neighborhoodfunction in order to let neighboringneurons cooperate and specialize for similar input patterns.In MER the receptive fields of neighboring neurons mutu-ally overlap so that, by the overlap, neighborhood relationsare already present at the neural activation stage. Hence, noneighborhood function is required per se in the weight up-date stage. Indeed, as shown in our previous work on MER,a tangled lattice cannot be a stable configuration in thed-dimensional (as well as in the one-dimensional) case (VanHulle 1997). We will now show graphically, in an intuitiveway, why MER is able to generate topographically ordered

Page 4: Nonparametric density estimation and regression achieved

52

A

j

k

H

H H

Ha b

c d

i

l

j

k

H

H H

Ha b

c d

i

l

B

c fd ea

rk

bH H H H HH

a xkT

j

i

C

Fig. 2. A,B Portion of a two-dimensional, rectangular lattice representedin the input space and updated with the Maximum Entropy learning Rule(MER; A) and the extended MER (eMER;B) The continuousanddashedlines represent the lattice before and after the weights are updated (notto scale). Theblack dot represents the current input.C Portion of a one-dimensional lattice in the case of projection pursuit regression. The label onthe horizontal axisrepresents the projected (d − 1)-dimensional vector ofthe independent variablesx, projected onto thek-th projection directionak;the label on thevertical axisrepresents the (scalar) residual associated withthe k-th projection direction, i.e., the dependent variable in the regressionanalysis. Thethick continuousanddashed linesrepresent the lattice beforeand after the weights are updated with eMER. Thethin dashed, verticallines represent the borders of the quantization regions prior to updating theweights. Theshaded areasdenote neuronj’s receptive field and theblackdot represents the current projected data point

maps and the SOM algorithm is not when the neighborhoodfunction has vanished.

Consider the locally folded lattice portion of Fig. 3A:the quadrilateral with verticesa, b, c, d overlaps with thequadrilateral with verticesb, i, j, k. Neuron i is closest tothe current input (black dot). When the neighborhood rangehas vanished, the SOM algorithm reduces to standard unsu-pervised competitive learning (standard UCL), and neuronimoves towards the input. As a result, the overlap betweenthe two quadrilaterals increases.

Consider now Fig. 3B: the current input activates thequadrilateral with verticesa, b, c, d, but not the other quadri-lateral. Following an update with MER, neuronsa, b, c, dmove in the general direction of the input and the over-lap between the two quadrilaterals decreases. In addition,since the weight update steps are fixed in magnitude, andthus not proportional to the distance between the updatedweight and the current input, as is the case with standardUCL, neuroni eventually will jump over neuronb. Hence,in MER, both the definition of quantization region and theuse of fixed weight update steps are essential in untanglinglattices. Contrary to the SOM algorithm, the use of a neigh-borhood function in eMER is not a prerequisite for achievingtopology-preserving mappings but, rather, a way to dramat-ically speed up the learning process of MER. Finally, weshould mention that, since in our case the quantization re-gions are not defined by a distance metric, our search forthe active quadrilaterals is more involved than the searchfor the winning neuron in the SOM algorithm (see also theAppendix).

3.1 Simulations

To show the learning dynamics of eMER with and withoutneighborhood function and to compare it with that of theSOM algorithm, we consider the standard case of mappinga two-dimensional input spaceV onto a 24×24 planar latticeA. The inputsv = (v1, v2) are randomly and independentlydrawn from a two-dimensional uniformp.d.f.p(v) within theunit square{0 ≤ v1 < 1, 0 ≤ v2 < 1}. The weights at timet = 0 are chosen randomly within the same unit square. SinceMER and eMER operate on different time scales, we use asmall, fixed learning rateη and perform “on-line” learning.1

We will take η = 0.001 for MER and eMER. Furthermore,for eMER we use a Gaussian neighborhood function anddecrease its range in the following way:

σ(t) = σ0 exp(−2σ0t

tmax) (6)

with t the present time step,tmax the maximum number oftime steps, andσ0 the range spanned by the neighborhoodfunction att = 0; we taketmax = 2, 000, 000 andσ0 = 12. ForMER, we taketmax = 10 000 000. For the SOM algorithm,we use the following weight update rule:

∆wi = η Λ(i, i∗, t) (v− wi), ∀i ∈ A (7)

1 Note that, theoretically, we should decreaseη to zero at a suitablerate, e.g.η = 1/t (Ljung 1977). However, since we use a small but fixedη, for reasons of comparison, we will verify convergence and topographicordering experimentally.

Page 5: Nonparametric density estimation and regression achieved

53

b

c d

a

A

i

j

k

b

c d

a

B

i

j

k

Fig. 3. Untangling locally dis-torted lattices in the case of theSelf-Organizing Map (SOM) algo-rithm (A) and MER (B)

with i∗ the neuron whose weight vector is closest to thecurrent input vector:

‖wi∗ − v‖ ≤ ‖wi − v‖, ∀i ∈ A (8)

(nearest neighbor rule), and the following Gaussian neigh-borhood function:

Λ(i, i∗, t) = exp

(− (ri − ri∗ )2

2σ(t)2

)(9)

whereri andri∗ represent the lattice coordinates ofi andi∗,and withσ(t) as in (6). We taketmax = 2 000 000,σ0 = 12andη = 0.001.

The results for the SOM algorithm, MER and eMER areshown in Figs. 4, 5 and 6, respectively. We observe that allthree learning rules achieve a topology-preserving mapping.Furthermore, MER yields such a mapping even without aneighborhood function (Fig. 5): the neighborhood functionis indeed not required but the presence of such a functionspeeds up the convergence process by almost two ordersof magnitude, as evidenced by the eMER results (Fig. 6).Moreover, eMER and the SOM algorithm have similar un-tangling speeds. More importantly, the quality of the to-pographical ordering of the map obtained with eMER issuperior to that of the SOM algorithm. This is due to theoccurrence of a series of phase transitions in SOMs withshort neighborhood range (Der and Herrmann 1994): whenthe neighborhood has vanished, distortion minimization ispursued instead of topology preservation, and the optimaldistortion tessellation does not necessarily correspond to aregular (e.g., square-shaped) topographical ordering. Whenthe neighborhood function has vanished and eMER has re-duced to MER, the neuron weights converge to the medians2

of the corresponding receptive fields (Van Hulle 1997) in-stead of their averages, as with the SOM algorithm. Mediansare less sensitive to outliers than averages and this explainswhy it takes more time for the lattice to reach the cornersof the input distribution.

Finally, as noted by Grossberg (1976) and Rumelhart andZipser (1985), one concern with unsupervised competitive

2 The converged weight vectors represent thed-dimensional ‘medians’to the extent that the latter are defined as the vectors of the scalar mediansdetermined for each of thed input dimensions separately. (Note that thereexists no unique definition of median ind dimensions.)

learning based on ‘winner-take-all’ classification is the oc-currence of neurons that are never active (‘dead’ units). Wehave tested this using a U-shaped uniform distribution. Thelattice dimensions and the simulation set-up are the sameas in the previous example. The SOM and MER resultsare shown in Fig. 7A and B, respectively. It is clear thatthe lattice produced by the SOM algorithm has a numberof dead units, notwithstanding that the neighborhood kernelwas gradually decreased over time. On the other hand, thelattice obtained with MER, thus even without using a neigh-borhood function, does not contain dead units. Hence, bythe tendency of MER to produce equally-probable neurons,dead units are avoided.

4 Equilibrium weight distribution, nonparametricdensity estimation

Previously we have proven that, in thed-dimensional case,the weight density at convergence is proportional to the inputdensity, whenN grows large and given that the neurons’ ac-tivation regionsSi span non-zero volumes inV -space (VanHulle 1997). The proof relies on quantization theory. How-ever, it is instructive to show it in a different way, namelyusing the continuum approximation introduced by Ritter andSchulten (1986) for the SOM algorithm. We change the in-dex i of the neurons in the lattice for a position vectorr .We consider topology-preserving mappings only and assumethat, for sufficiently largeN and a sufficiently small rangeσ and learning rateη, the equilibrium weight configurationsare slowly varying withr so that they can be replaced by cor-responding smooth functions over a continuum ofr -values.We choose our neighborhood function in such a way that itdepends only on the distance betweenr and the locationr∗of the ‘closest’ neuron with an active quadrilateral (closestwith respect tor in lattice-space).3 For clarity’s sake, wesimplify the notation for the (Gaussian) neighborhood func-tion asΛ(r∗ − r , σ) and bear in mind thatr∗ represents theclosest winner. We will now determine the necessary and

3 In this way, with respect to a given neuroni, the discrete algorithmmaps input signals activating the same quadrilateral onto the same winningneuron and, thus, we can assume a bijective weight configuration withrespect to neuroni.

Page 6: Nonparametric density estimation and regression achieved

54

0 1000 10000 20000

100000 200000 1000000 2000000

Fig. 4. Evolution of a 24× 24 lattice witha rectangular topology as a function of timein the case of the SOM algorithm. Theoutersquaresoutline the uniform inputp.d.f. Thevalues given below the squaresrepresenttime

0 1000 10000 100000

1000000 2000000 4000000 10000000 Fig. 5. Evolution in the case of MER

sufficient condition for the lattice weights to form a stableconfiguration. We write the equilibrium condition as follows:

< Λ(r∗ − r , σ) Sgn(v− w(r )) > = 0 (10)

for all r . We approximatev by the weight of the closestneuronw(r∗). We then obtain for the equilibrium equation:

0 =∫

Λ(r∗ − r , σ) Sgn(w(r∗)− w(r )) p(v) ddv. (11)

We introduce the vectorq = r∗−r as the integration variable,write p(r∗) instead ofp(v), and rewriteddv asD(r + q)ddq,with D(r ) the absolute value of the determinant of the Ja-cobian J(r ) = ∂v/∂r . Furthermore, for sufficiently smallσ, Λ(.) is sharply peaked so that we can expandp(r∗) intop+qk∂kp+ ... andD(r +q) into D+ql∂lD+ ... Putting theseexpansions into the equilibrium equation yields

0 =∫

Λ(r∗ − r , σ) Sgn(w(r∗)− w(r )) (p + qk∂kp + ...)

×(D + ql∂lD + ...) ddq (12)

SinceΛ(.)Sgn(.) makes an odd function in each dimension,we collect first-order terms only (and drop third- and higher-order terms):

0≈ (13)∫Λ(r∗ − r , σ) Sgn(w(r∗)− w(r )) qi d

dq(p∂iD +D∂ip)

A necessary and sufficient condition for this equation to holdin the limit of a vanishingσ is (p∂iD + D∂ip) = 0 or,

assumingD andp are nowhere zero:

∂ip

p= −∂iD

D(14)

or

∇ logp = −∇ logD (15)

Now since 1/D corresponds to the weight density, theweight density is indeed proportional to the input densityfor largeN .

4.1 Simulations

Since for an equiprobable quantization the density of theweight vectorsp(wi) is proportional to the input densityp(v), a nonparametric model of the inputp.d.f. can beconstructed. Now since MER is aimed at producing anequiprobable quantization, it can thus, at least in principle,be used for nonparametric density estimation. As an exam-ple, we consider the quadrimodal density function displayedin Fig. 8A. This function is obtained in the following way.The (v1, v2)-plane is divided into four equal-sized quadrants.Within each quadrant, the density function is generated byconsidering two independent product distributions, one foreachv-dimension. The product distributions are generatedby taking the product of two uniformly and independentlydistributed random numbers. The analytic equation of theproduct distribution is− logv with v the product term. The

Page 7: Nonparametric density estimation and regression achieved

55

0 1000 10000 20000

100000 200000 1000000 2000000 Fig. 6. Evolution in the case of eMER

A B

Fig. 7. Lattices obtained for a U-shaped uniform distribution in the case ofthe SOM algorithm (A) and MER (B). The U-shaped distribution is outlined

quadrimodal distribution is in turn obtained by choosing onequadrant over the other with equal probability. The resultingasymmetric distribution is unbounded and comprises heav-ily skewed but disjunct modes separated by sharp transitions(discontinuities), which makes it difficult to quantize.

We considered the same lattice dimensions and sim-ulation settings as for eMER and the SOM algorithm inSect. 3.1. The inverse of the receptive field surface area of agiven neuron yields the density estimate located at the neu-ron’s weight vector (after normalization). The reconstructedp.d.f.s are shown in Fig. 8B and C and were obtained bylinearly interpolating between theN density estimates. TheSOM result looks erratic due to the occurrence of a seriesof phase transitions (Sect. 3.1) which cause a deteriorationin the quality of the density estimate.

For the sake of comparison, we also implemented thevariable kernel method of Breiman et al. (1977), i.e., awidely used nonparametric density estimation technique.The basic idea is to place unit-volume kernels at the inputsamples but to allow the width of the kernels to vary fromone point to another, depending on the local sample den-sity. As a pilot estimate, we used the (kth-)nearest-neighbormethod (Tukey and Tukey 1981) withk =

√M andM the

number of samples. For the adaptive kernel method, we tookthe overal degree of smoothingh = 0.3/M0.2 (Silverman1986, p. 45) and the sensitivity parameterα = 1/2 (Breimanet al. 1977). The result is shown in Fig. 8D forM = 5000(the result forM = 2000 looks very similar). We observe

that the quality of the density estimate is inferior to that ob-tained with MER when it comes to representing the ridges(discontinuities) in the originalp.d.f. Reducingh does nothelp in this matter since then the density estimate becomeserratic.

5 Nonparametric projection pursuit regression

Since the SOM algorithm is often regarded as a nonpara-metric regression technique (Ritter et al. 1992; Mulier andCherkassky 1995, Kohonen 1995), we could do the same foreMER. In particular, we consider the regression fitting of ascalar functiony of d−1 independent variables, denoted bythe vectorx = [x1, ..., xd−1], from a given set ofM possiblynoisy data points or measurements{(ym, xm),m = 1, ...,M}in d-dimensional space:

ym = f (xm) + noise (16)

wheref is the unknown function to be estimated, and wherethe noise contribution has zero mean and is independentfrom the {xm}. The functionf is approximated in a non-parametric way using piecewise smooth activation functionsor splines that join continuously at points called ‘knots’.By considering thed-dimensional data points as the inputsamples{vm}, and the knots as the neurons of a (d − 1)-dimensional topographic mapA, we can determine the po-sition of these knots adaptively by developing the map ind-dimensional space. Besides the risk of generating non-functional mappings, a disadvantage of performing regres-sion in this way is the prohibitive amount of neurons neededfor function approximation in high-dimensional spaces andthe high number of data points needed to allocate the neu-ron weight vectors reliably (cf. the curse of dimensionality).One of the very few approaches which perform well evenin high-dimensional spaces is projection pursuit regression(Friedman and Stuetzle 1981): thed-dimensional data pointsare interpreted through optimally chosen lower-dimensionalprojections; the ‘pursuit’ part refers to optimization with re-spect to the projection directions. For simplicity, we willconsider the case where the functionf is approximated bya sum of scalar functions:

f̂ (x) =K∑k=1

fk(akxT ) (17)

Page 8: Nonparametric density estimation and regression achieved

56

A B

-1

-0.5

0

0.5

1

v1

-1

-0.5

0

0.5

1

v2

0

0.5

1

1.5

2

2.5

pd

-1

-0.5

0

0.5

1

v1

-1

-0.5

0

0.5

1

v1

-1

-0.5

0

0.5

1

v2

0

0.5

1

1.5

pd

-1

-0.5

0

0.5

1

v1

C D

-1

-0.5

0

0.5

1

v1

-1

-0.5

0

0.5

1

v2

0

0.5

1

1.5

2

2.5

pd

-1

-0.5

0

0.5

1

v1

-1

-0.5

0

0.5

1

v1

-1

-0.5

0

0.5

1

v2

0

0.5

1

1.5

2

2.5

pd

-1

-0.5

0

0.5

1

v1

Fig. 8. Two-dimensional product distribution (A) and nonparametric models obtained in the case of the SOM algorithm (B), MER (C), and the variablekernel method (D). pd, probability density. The theoretical function (A) is unbounded and plotted in steps of1

25

with f̂ the estimated function andak unit vectors (projec-tion directions), and whereT stands for transpose. Eachfkis a piecewise smooth activation function that joins continu-ously at knots the positions of which are determined by theneuron weights of the corresponding topographic map. Thefunctionsfk and projectionsak are estimated sequentiallyin order to minimize the mean squared error (MSE) of theresiduals. In other words, for increasingk, we search forprojections that minimize the residual error:

C(ak) =

1M

M∑m=1

[fk(akxm T )−

{ym −

k−1∑k′=1

fk′ (ak′xm T )

}]2

(18)

The term between curly brackets denotes thek-th residualof the m-th data point,rmk , with rm1

.= ym. For eachak

estimate, a new topographic map and interpolating functionfk are developed. The one-dimensional topographic mapsare developed with eMER: by observing thatrk is the de-pendent variable, the code membership functions are definedwith respect to the projected independent variables (akxT )

(thin dashed, vertical lines in Fig. 2C), but the weights areupdated in the (rk, akxT ) plane using eMER (thick dashedline). Hence, regression is performed in the direction of thedependent variablerk subject to smoothness constraints setby the interpolating functionfk. Furthermore, the occur-rence of nonfunctional mappings is avoided since eMERconverges to a lattice without overlapping quadrilaterals ineach projection direction.

5.1 Simulations

We applied our regression procedure to the same test ex-amples used by Hwang and co-workers for assessing theregression performance of back-propagation- and projectionpursuit learning (Hwang et al. 1994). In addition to run-ning eMER, we also ran constrained topological mapping(Cherkassky and Lari-Najafi 1991) and the Heskes and Kap-pen approach (Heskes and Kappen 1995) on the same testexamples.

Page 9: Nonparametric density estimation and regression achieved

57

Consider a set ofM = 225 data points with the{xm}taken homogeneously and independently from the uniformdistribution [0, 1]2. The functions to be estimated are(Fig. 9A, C, E):

f1(x) = 10.391((x1 − 0.4)(x2 − 0.6) + 0.36) (19)

f4(x) = 1.3356(1.5(1− x1)

+ exp(2x1 − 1) sin(3π(x1 − 0.6)2)

+ exp(3(x2 − 0.5)) sin(4π(x2 − 0.9)2)) (20)

f5(x) = 1.9(1.35 + exp(x1) sin(13(x1 − 0.6)2)

× exp(x2) sin(7x2)) (21)

To each of these functions, zero mean Gaussian noise isadded with standard deviation 0.25. Hwang and co-workersconsidered two-layer perceptrons for which the hidden neu-rons correspond to the activation functions and for which theweights of these neurons correspond to the projection direc-tions. These two-layer perceptrons were trained with back-propagation learning (BPL) withK = 5 and 10 hidden neu-rons, or with projection pursuit learning (PPL) withK = 3and 5 hidden neurons. For BPL, the Gauss-Newton methodwas used; the projection directions were further optimizedby backfitting. Backfitting consists of cyclically minimizingC(ak) for the residuals of neuronk, until there is little orno change. For PPL, a modified supersmoother method wasapplied for obtaining smooth output function representationsfor the hidden neurons; the Gauss-Newton method was usedfor obtaining the optimal projection directions.

The Heskes and Kappen approach (HK) is in essencea combination of PPL with topographic map formation inwhich a modified SOM algorithm is used for updating theneuron weights along the projection directions. We imple-mented HK using lattices sizedN = 7 or 10 neurons for eachone ofK = 5 projection directions, natural cubic spline in-terpolation between the neuron weights, a (fixed) learningrate η = 0.2, and the Gaussian neighborhood function (9)(Sect. 3.1). We took for the noise factor of the dependentvariableε = 0.1 (ε� 1) and for the noise parameterβ →∞(Heskes and Kappen 1995). The projection directions werere-normalized to unit length after each update. The opti-mal range of the neighborhood function was determined bycross-validation (in steps of 0.125). After each epoch, wedetermined the MSE between the training samples and theneuron weights and ran the procedure until the magnitude ofthe difference between the present and the previous running-averaged MSE became lower than 1.0×10−3 or until 50 000epochs had elapsed; the present running average equals 10%of the present, unaveraged MSE added to 90% of the pre-vious running average. Finally, in order to further optimizethe projection directions, we also applied backfitting.

Furthermore, we implemented constrained topologicalmapping (CTM) (Cherkassky and Lari-Najafi 1991), an al-gorithm which also relies on a modified SOM algorithmbut now for updating the neuron weights in the space ofthe data points. We implemented CTM using square lat-tices sized 5× 5, and 7× 7 neurons, linear interpolationbetween the neuron weights, a learning rateη = 0.02, andthe same Gaussian neighborhood function, stopping criterionand cross-validation strategy as for HK.

For eMER we used lattices sizedN = 7 or 10 neu-rons, natural cubic spline interpolation between the neu-

ron weights,K = 5 projection directions, a learning rateη = 0.02, and the same Gaussian neighborhood function asin Sect. 3.1; the optimal range of the neighborhood functionwas determined by cross-validation. As in HK, the projec-tion directions were re-normalized to unit length after eachupdate. After each eMER epoch, we determine the MSE be-tween the actual and the desired, equiprobable code mem-bership function usage. We ran eMER until the magnitude ofthe difference between the present and the previous running-averaged MSE became lower than 1.0×10−10 or until 50 000epochs had elapsed. In order to optimizeC(ak), the pro-cedure was first run for theak taken as unit vectors; thecomponents of the unit vector with the lowest residual errorwere then further optimized by performing hill descent insteps of 0.01. Finally, we also applied backfitting.

The generalization performance was determined as thefraction of variance unexplained (FVU) for a set ofMt =10 000 independent test data:

FVU =

∑Mt

m=1(f̂ (xm)− f (xm))2∑Mt

m=1(f (xm)− f̄ (xm))2(22)

with f̄ the set average. The results are summarized inTable 1. The function estimates obtained with eMER forN = 10 andK = 5 are shown in Fig. 9B, D, F.

Although theK and/orN values of the different re-gression techniques are not immediately comparable, severalobservations can be made. Firstly, the FVU performance ofeMER, HK and CTM improves for increasingN values (ex-cept forf1 and CTM). SinceN determines the complexityof the regression surface, this means that we are not (yet)overfitting the data set. The slower increase in the FVU per-formance for CTM is due to the rapidly increasing numberof data samples needed in order to allocate theN weightvectors reliably (curse of dimensionality) and, to a lesserextent, also due to the relative roughness of the CTM regres-sion surface since it was obtained by using piecewise linearinterpolation. Secondly, the FVU performance improves forincreasingK values for BPL and PPL (except forf1) sinceK determines the number of projection directions. This isalso the case with eMER: forf4 andK = 3, FVU equals0.0244 forN = 7 and 0.0145 forN = 10 (not listed in Ta-ble 1). As before, this also implies that there is no overfittingof the data set. For cases where the multi-variable functionsunderlying the data set can be approximated by a decomposi-tion into a number of single-variable functions, eMER, HK,BPL and PPL perform better than CTM, as evidenced by thef1 andf4 results in Table 1. In the opposite case, CTM isbetter than the other techniques, at least for smallK values,as shown by thef5 results; however, for largerK values thisis not always the case. Thirdly, eMER, HK and PPL havean additional advantage over BPL: the effect of adding anadditional activation functionfk can be evaluated in termsof the decrease in the residual errorC(ak) achieved or interms of the increase in the FVU performance. This allowsfor a better control of the network size (and of the overalltraining time). Finally, eMER is a good or even superiorregression technique for cases which, only up to a limitedextent, rely on extrapolation for capturing the overall shapeof the surface underlying the data points such as in thef1

andf4 examples; on the other hand, in thef5 example most

Page 10: Nonparametric density estimation and regression achieved

58

Table 2. Nonparametric regression performance of HK and eMER as afunction of training set sizeM , for f4 andK = 5 andN = 10

M HK eMER

50 0.112 0.0630225 0.0260 0.006862250 0.00916 0.0106

of the surface complexity is on the edge of the data spaceand this explains the lower performance of eMER. Due tobackfitting and the fact that eMER is less sensitive to out-liers (Sect. 3.1), the results are relatively well reproducible:e.g. for f4 andN = 10 the average FVU obtained for 10different runs is 0.00747.

Both HK and eMER rely on projection pursuit learningbut use different strategies for updating the neuron weights:by using eMER, one is aiming for a faithful input repre-sentation (or smoothed for non-zero radii of the neighbor-hood function); by using HK, one is aiming for a represen-tation which minimizes a (weighted) MSE distortion met-ric (also radius dependent). We observe from Table 1 thateMER yields a superior FVU result everywhere (except forf5 and N = 7). The superiority of eMER continues evenfor smaller training sets ( e.g.M = 50) but is lost for largerones (M = 2250, see Table 2). Hence, the motivation touse eMER or HK is influenced by the size of the train-ing set available for the regression task. This distinction inperformance stems from the fact that in eMER the weightsconverge towards the (weighted) medians along each pro-jection direction, whereas in HK they converge towards the(weighted) means. Medians are less sensitive to outliers thanmeans (Sect. 3.1), which is often an advantage for smalltraining sets; on the other hand, medians are more sensitiveto biased noise than means. Fortunately, the effect of biasednoise can often be reduced by backfitting.

Finally, we will add a note on the computation timeneeded by the different algorithms.4 When comparing BPL,using the Gauss-Newton method, with PPL, using the super-smoother method, the learning speeds for the two methodsare similar for the same number of hidden neurons (Hwanget al. 1994). Both HK and eMER differ, basically, from PPLand BPL by the fact that the output functions of the hiddenneurons are replaced by one-dimensional lattices comprisingN neurons. Hence these methods are always slower thanBPL and PPL. The learning speed of HK is similar to thatof eMER, as far as the development of the neuron weightsalong the projection directions is concerned, but is overallslower when backfitting is taken into account: HK needsmore backfitting cycles (at least twice as many), with littleimprovement at the end, to achieve its desired accuracy. Fi-nally, CTM is in turn about an order of magnitude slowerthan eMER and HK in achieving its accuracy.

6 Conclusion

We have introduced a new learning rule, called the extendedMaximum Entropy learning Rule (eMER), for topographic

4 A detailed, technical analysis of the time- and memory-complexityproperties of these algorithms is beyond the scope of the present article.

map formation. In eMER the receptive fields of neighboringneurons mutually overlap so that, by the overlap, neighbor-hood relations are already present at the neural activationstage. Hence, no neighborhood function is required per se inthe weight update stage, and the map untangles even whenthe neighborhood range has vanished, but its presence dra-matically speeds up convergence. We have shown that fora vanishing neighborhood range, the neural weight densityat convergence becomes proportional to the input density(the map’s binary outputs maximize information-theoreticentropy) so that eMER can, in principle, be used as an adap-tive, nonparametric density estimation technique. We haveverified this point by comparing eMER’s density estimationperformance with that of the Self-Organizing Map (SOM)algorithm (Kohonen 1982, 1995) and the variable kernelmethod (Breiman et al. 1977) i.e. a classic nonparametricdensity estimation technique. Since eMER does not rely onthe use of kernels, it is expected to outperform the variablekernel method in the vicinity of sharp transitions in the inputspace. We have shown that eMER outperforms the SOM al-gorithm mainly due to the occurrence phase transitions whenthe neighborhood range vanishes (Der and Herrmann 1994)and, to a lesser extent, due to the weight density which is apower function of the input density. On the other hand, thesearch for active quadrilaterals in eMER is more involvedthan the search for the winning neuron in the SOM algo-rithm, and the extension of eMER to the case where thedimensionalities of the map and the input space differ isless evident.

Since topographic map formation is often conceived asa nonparametric regression technique (Ritter et al. 1992;Mulier and Cherkassky 1995, Kohonen 1995), we have ap-plied eMER to projection pursuit regression and shown thateMER can yield a regression performance which is supe-rior to that achieved with back-propagation learning, pro-jection pursuit learning (Friedman and Stuetzle 1981), con-strained topological mapping (Cherkassky and Lari-Najafi1991), and the Heskes and Kappen approach (Heskes andKappen 1995). Finally, although the number of exampleswe have considered is limited, the use of an information-theoretic criterion (i.e., entropy maximization) instead of adistortion criterion could lead to an improved regression per-formance of topographic maps.

Appendix

The definition and search for active quantization regions ineMER can be implemented in several ways. We first as-sume that the input spaceV and latticeA have the samedimensionality, but relax this assumption at the end whenwe simplify our implementation.

One-dimensional caseThe real line is partitioned by theN weightsw1, ..., wN

of the one-dimensional latticeA. The N + 1 quantizationintervals are defined as follows:

Hj = {v ∈ V | wj−1 ≤ v < wj ∪ wj < v ≤ wj−1},∀j /= 1, N + 1,

H1 = {v ∈ V | v < w1},HN+1 = {v ∈ V | wN ≤ v}

Page 11: Nonparametric density estimation and regression achieved

59

A B

0

0.2

0.4

0.6

0.8

1

x1

0

0.2

0.4

0.6

0.81

x2

0

2

4

6

y

0

0.2

0.4

0.6

0.8

1

x1

0

0.2

0.4

0.6

0.8

1

x1

0

0.2

0.4

0.60.8

1

x2

0

2

4

6

y

0

0.2

0.4

0.6

0.8

1

x1

C D

0

0.2

0.4

0.60.8

1

x1

0

0.2

0.4

0.6

0.8

1

x2

0

2

4

y

0

0.2

0.4

0.60.8

1

x1

0

2

4

y

0

0.2

0.4

0.60.8

1

x1

0

0.2

0.4

0.6

0.8

1

x2

0

2

4

y

0

0.2

0.4

0.60.8

1

x1

0

2

4

y

E F

0

0.2

0.4

0.60.8

1

x1

0

0.2

0.4

0.6

0.8

1

x2

0

2

4

6

y

0

0.2

0.4

0.60.8

1

x1

0

2

4

6

y

0

0.2

0.4

0.60.8

1

x1

0

0.2

0.4

0.6

0.8

1

x2

0

2

4

6

y

0

0.2

0.4

0.60.8

1

x1

0

2

4

6

y

Fig. 9. A ‘Simple’ interaction functionf1 (19), C ‘additive’ interaction functionf4 (20), andE ‘complex’ interaction functionf5 (21). B,D,F Estimatedfunctions using eMERN = 10,K = 5

and, hence, interval membership can be easily verified. How-ever, from a methodological viewpoint, it is more interestingto implement a search for matching distances: intervalHj ,∀j /= 1, N + 1, is active when| v − wj−1 | + | v − wj | =| wj−1 − wj | (the interval membership definitions ofH1andHN+1 are not modified).

Two-dimensional caseWe proceed in a similar way for the two-dimensional caseexcept that the intervals become triangles and the search formatching distances becomes one for matching surface areas.We first determine the surface area of each granular quanti-zation region (quadrilateral) by partitioning it into two non-overlapping triangles (Fig. A1A,BA). LetSAHj

be the sur-face area ofHj . We then verify thatv belongs toHj by con-

sidering the surface areas of the four triangles drawn betweenv and each pair of adjacent vertices ofHj (Fig. A1A,BB):if the sum of these surface areas matchesSAHj then Hj

is active. For the overload regions, we take arbitrary pointson the extended links (dashed lines in Fig. 1B) and proceedin the same way as for a granular region. Instead of usingsurface areas, we can also consider two vertices in both tri-angles of Fig. A1A,BA, e.g.,a andc in the shaded triangle,and verify the angles made byav andbv with the edges ofthe triangle (arc segments in Fig. A1A,BA).

There exist at least three ways to improve the efficiencyof the search algorithm. Firstly, for both batch and incre-mental learning, only the surface areas of the quantizationregions of which the vertices are updated need to be re-

Page 12: Nonparametric density estimation and regression achieved

60

Table 1. Nonparametric regression performance on an independent test data set expressed as the fraction of varianceunexplained (FVU)

Function BPL PPL CTM HK eMER

K = 5 K = 10 K = 3 K = 5 N = 52 N = 72 K = 5 K = 5N = 7 N = 10 N = 7 N = 10

f1 0.00631 0.0178 0.00896 0.0147 0.0244 0.0261 0.0212 0.0178 0.0137 0.0113f4 0.577 0.0198 0.334 0.0186 0.148 0.118 0.0285 0.0260 0.0121 0.00686f5 0.905 0.0700 0.269 0.0432 0.145 0.123 0.272 0.221 0.299 0.0829

Aa

b

c d

Hj

ab

c d

B

Fig. A1A,B. Definition and search for active quadrilaterals, two-dimensional case. Two methods are shown: one based on surface areasand another based on angles.A Partitioning of quadrilateralHj (thick con-tinuous line), defined by the weights of the verticesa, b, c, d, in two non-overlapping triangles. The current input (black dot) falls into the shadedtriangle. The angles made by the lines joining the current input and ver-ticesa andc are indicated (arc segments) and correspond to that of a validconstellation, i.e., one for which the shaded triangle is active.B The fourtriangles that can be drawn between the current input and each pair of ad-jacent vertices ofHj . One triangle is shaded for the sake of illustration.The current input activatesHj since the surface area ofHj , determined inA, matches the sum of the surface areas of the four triangles determined inB

calculated for the next learning step. Secondly, we determineone by one the surface areas of the four triangles into whichHj can be partitioned and which all havev as a commonvertex (Fig. A1A,BB): we can stop considering quantizationregionHj when the surface area of the next triangle makesthe sum of the surface areas of this and the triangles consid-ered up to now larger thanSAHj

, the surface area ofHj .Finally, these triangles can be shared between neighboringquantization regions.

Three-dimensional caseIn the three-dimensional case, surface areas now becomevolumes and triangles become tetraeders. Hence, we sectioneach quantization regionHj (i.e., an irregular prism with 8vertices) into 6 non-overlapping tetraeders to determine itsvolumeSAHj . We then sectionHj into 12 non-overlappingtetraeders, all withv as a common vertex, and verify whetherthe total volume of these tetraeders matches that ofSAHj .(Note also for this case that, instead of using volumes, wecan equally well consider the flat angles made byv and theedges and faces of the tetraeders.)

Higher-dimensional caseEvidently, the previous scheme is far too complicated to bepractical in the higher-dimensional case. However, for in-creasing numbers of neurons in the lattice, the distinctionbetween overload and granular region becomes less impor-tant, as well as the exact shape of an individual quantizationregion. Hence, we only determine the winning neuron foreach input, consider it to be a vertex of a granular region,

and update the lattice using eMER. According to Sect. 4,the equilibrium weight density obtained with this simplifiedscheme will approach a linear function of the input density,as desired. Finally, the same simplified scheme also enablesus to apply eMER to cases in which the dimensionalities ofthe lattice and the input space differ.

Acknowledgements.The author is a research associate of the Fund for Sci-entific Research – Flanders (Belgium) and is supported by research grantsreceived from the Fund for Scientific Research (G.0185.96) and the Euro-pean Commission (ECVnet EP8212).

References

Breiman L, Meisel W, Purcell E (1977) Variable kernel estimates of mul-tivariate densities. Technometrics 19:135-144

Cherkassky V, Lari-Najafi H (1991) Constrained topological mapping fornonparametric regression analysis. Neural Networks 4:27-40

Der R, Herrmann M (1994) Instabilities in self-organized feature mapswith short neighborhood range. Verleysen M (ed) Proceedings of theEuropean Symposium on Artificial Neural Networks – ESANN’94,Brussels, Belgium, pp 271-276

Friedman JH, Stuetzle W (1981) Projection pursuit regression. J Am StatAssoc 76:817-823

Friedman JH, Silverman BW (1989) Flexible parsimonious smoothing andadditive modeling. Technometrics 31:3-21

Gersho A (1979) Asymptotically optimal block quantization. IEEE TransInf Processing 25:373-380

Grossberg S (1976) Adaptive pattern classification and universal recoding.I. Parallel development and coding of neural feature detectors. BiolCybern 23:121-134

Heskes T, Kappen B (1995) Self-organization and nonparametric regression.Proc ICANN’95, Vol I, pp 81-86

Hwang J-N, Lay S-R, Maechle, M, Martin RD, Schimert J (1994) Re-gression modeling in back-propagation and projection pursuit learning.IEEE Trans Neural Networks 5:342-353

Kohonen T (1982) Self-organized formation of topologically correct featuremaps. Biol Cybern 43:59-69

Kohonen T (1995) Self-organizing maps. Springer, Berlin Heidelberg NewYork

Linde Y, Buzo A, Gray RM (1980) An algorithm for vector quantizerdesign. IEEE Trans Commun 28:84-95

Ljung L (1977) Analysis of recursive stochastic algorithms. IEEE TransAutomat Control 22:551-575

Luttrell SP (1991) Code vector density in topographic mappings: scalarcase. IEEE Trans Neural Networks 2:427-436

Mulier F, Cherkassky V (1995) Self-organization as an iterative kernelsmoothing process. Neural Comput 7:1165-1177

Ritter H, Schulten K (1986) On the stationary state of Kohonen’s self-organizing sensory mapping. Biol Cybern 54:99-106

Ritter H, Schulten K (1989) Combining self-organizing maps. Proc Int JointConf Neural Networks 2: 499-502

Ritter H, Martinetz T, Schulten K (1992) Neural computation and self-organizing maps: an introduction. Addison-Wesley, Reading, Mass

Roosen CB, Hastie TJ (1994) Automatic smoothing spline projection pur-suit. J Comput Graphical Stat 3:235-248

Page 13: Nonparametric density estimation and regression achieved

61

Rumelhart DE, Zipser D (1985) Feature discovery by competitive learning.Cogn Sci 9:75-112

Silverman BW (1986) Density estimation for statistics and data analysis.Chapman and Hall, London

Tukey PA, Tukey JW (1981) Graphical display of data sets in 3 or moredimensions. In: Barnett V (ed) Interpreting multivariate data. Wiley,Chichester, pp 189-275

Van Hulle MM (1995) Globally-ordered topology-preserving maps achievedwith a learning rule performing local weight updates only. Proc IEEENNSP95, pp 95-104

Van Hulle MM (1997) The formation of topographic maps that maximizethe average mutual information of the output responses to noiselessinput signals. Neural Comput 9:595-606