gtm: the generative topographic mapping

29
Communicated by Helge Ritter GTM: The Generative Topographic Mapping Christopher M. Bishop Markus Svens ´ en Christopher K. I. Williams Neural Computing Research Group, Department of Computer Science and Applied Mathematics, Aston University, Birmingham B4 7ET, U.K. Latent variable models represent the probability density of data in a space of several dimensions in terms of a smaller number of latent, or hidden, variables. A familiar example is factor analysis, which is based on a lin- ear transformation between the latent space and the data space. In this article, we introduce a form of nonlinear latent variable model called the generative topographic mapping, for which the parameters of the model can be determined using the expectation-maximization algorithm. GTM provides a principled alternative to the widely used self-organizing map (SOM) of Kohonen (1982) and overcomes most of the significant limita- tions of the SOM. We demonstrate the performance of the GTM algorithm on a toy problem and on simulated data from flow diagnostics for a mul- tiphase oil pipeline. 1 Introduction Many data sets exhibit significant correlations between the variables. One way to capture such structure is to model the distribution of the data in terms of latent, or hidden, variables. A familiar example of this approach is factor analysis, which is based on a linear transformation from latent space to data space. In this article, we show how the latent variable framework can be extended to allow nonlinear transformations while remaining computa- tionally tractable. This leads to the GTM (generative topographic mapping) algorithm, which is based on a constrained mixture of gaussians whose parameters can be optimized using the EM (expectation-maximization) al- gorithm. One of the motivations for this work is to provide a principled alternative to the widely used self-organizing map (SOM) algorithm (Kohonen, 1982) in which a set of unlabeled data vectors t n (n = 1,..., N) in a D-dimensional data space is summarized in terms of a set of reference vectors having a spatial organization corresponding to a (generally) two-dimensional sheet. Although this algorithm has achieved many successes in practical applica- tions, it also suffers from some significant deficiencies, many of which are highlighted in Kohonen (1995). They include the absence of a cost function, Neural Computation 10, 215–234 (1998) c 1997 Massachusetts Institute of Technology

Upload: christopher-k-i

Post on 03-Feb-2017

283 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: GTM: The Generative Topographic Mapping

Communicated by Helge Ritter

GTM: The Generative Topographic Mapping

Christopher M. BishopMarkus SvensenChristopher K. I. WilliamsNeural Computing Research Group, Department of Computer Scienceand Applied Mathematics, Aston University, Birmingham B4 7ET, U.K.

Latent variable models represent the probability density of data in a spaceof several dimensions in terms of a smaller number of latent, or hidden,variables. A familiar example is factor analysis, which is based on a lin-ear transformation between the latent space and the data space. In thisarticle, we introduce a form of nonlinear latent variable model called thegenerative topographic mapping, for which the parameters of the modelcan be determined using the expectation-maximization algorithm. GTMprovides a principled alternative to the widely used self-organizing map(SOM) of Kohonen (1982) and overcomes most of the significant limita-tions of the SOM. We demonstrate the performance of the GTM algorithmon a toy problem and on simulated data from flow diagnostics for a mul-tiphase oil pipeline.

1 Introduction

Many data sets exhibit significant correlations between the variables. Oneway to capture such structure is to model the distribution of the data interms of latent, or hidden, variables. A familiar example of this approach isfactor analysis, which is based on a linear transformation from latent spaceto data space. In this article, we show how the latent variable framework canbe extended to allow nonlinear transformations while remaining computa-tionally tractable. This leads to the GTM (generative topographic mapping)algorithm, which is based on a constrained mixture of gaussians whoseparameters can be optimized using the EM (expectation-maximization) al-gorithm.

One of the motivations for this work is to provide a principled alternativeto the widely used self-organizing map (SOM) algorithm (Kohonen, 1982) inwhich a set of unlabeled data vectors tn (n = 1, . . . ,N) in a D-dimensionaldata space is summarized in terms of a set of reference vectors having aspatial organization corresponding to a (generally) two-dimensional sheet.Although this algorithm has achieved many successes in practical applica-tions, it also suffers from some significant deficiencies, many of which arehighlighted in Kohonen (1995). They include the absence of a cost function,

Neural Computation 10, 215–234 (1998) c© 1997 Massachusetts Institute of Technology

Page 2: GTM: The Generative Topographic Mapping

216 Christopher M. Bishop, Markus Svensen, and Christopher K. I. Williams

the lack of a theoretical basis for choosing learning rate parameter schedulesand neighborhood parameters to ensure topographic ordering, the absenceof any general proofs of convergence, and the fact that the model does notdefine a probability density. These problems can all be traced to the heuristicorigins of the SOM algorithm.1 We show that the GTM algorithm overcomesmost of the limitations of the SOM while introducing no significant disad-vantages.

An important application of latent variable models is to data visualiza-tion. Many of the models used in visualization are regarded as defining aprojection from the D-dimensional data space onto a two-dimensional vi-sualization space. We shall see that, by contrast, the GTM model is definedin terms of a mapping from the latent space into the data space. For thepurposes of data visualization, the mapping is then inverted using Bayes’theorem, giving rise to a posterior distribution in latent space.

2 Latent Variables

The goal of a latent variable model is to find a representation for the dis-tribution p(t) of data in a D-dimensional space t = (t1, . . . , tD) in terms ofa number L of latent variables x = (x1, . . . , xL). This is achieved by firstconsidering a function y(x;W), which maps points x in the latent space intocorresponding points y(x;W) in the data space. The mapping is governedby a set of parameters W and could consist, for example, of a feedforwardneural network, in which case, W would represent the weights and biases.We are interested in the situation in which the dimensionality L of the latentvariable space is less than the dimensionality D of the data space, since wewish to capture the fact that the data set itself has an intrinsic dimensionalitythat is less than D. The transformation y(x;W) then maps the latent variablespace into an L-dimensional non-Euclidean manifold S embedded withinthe data space.2 This is illustrated schematically for the case of L = 2 andD = 3 in Figure 1.

If we define a probability distribution p(x) on the latent variable space,this will induce a corresponding distribution p(y|W) in the data space. Weshall refer to p(x) as the prior distribution of x, for reasons that will becomeclear shortly. Since L < D, the distribution in t-space would be confined tothe L-dimensional manifold and hence would be singular. Since in realitythe data will only approximately live on a lower-dimensional manifold,it is appropriate to include a noise model for the t vector. We choose thedistribution of t, for given x and W, to be a radially symmetric gaussian

1 Biological metaphor is sometimes invoked when motivating the SOM procedure.It should be stressed that our goal here is not neurobiological modeling, but rather thedevelopment of effective algorithms for data analysis, for which biological realism neednot be considered.

2 We assume that the matrix of partial derivatives ∂yk/∂xi has full column rank.

Page 3: GTM: The Generative Topographic Mapping

Generative Topographic Mapping 217

Figure 1: The nonlinear function y(x;W) defines a manifold S embedded indata space given by the image of the latent variable space under the mappingx→ y.

centered on y(x;W) having variance β−1 so that

p(t|x,W, β) =(β

)D/2

exp{−β

2‖y(x;W)− t‖2

}. (2.1)

Note that other models for p(t|x) might also be appropriate, such as aBernoulli for binary variables (with a sigmoid transformation of y) or amultinomial for mutually exclusive classes (with a softmax, or normalizedexponential transformation of y [Bishop, 1995]), or even combinations ofthese. The distribution in t-space, for a given value of W, is then obtainedby integration over the x-distribution,

p(t|W, β) =∫

p(t|x,W, β)p(x) dx. (2.2)

For a given a data setD = (t1, . . . , tN) of N data points, we can determine theparameter matrix W, and the inverse varianceβ, using maximum likelihood.In practice it is convenient to maximize the log likelihood, given by

L(W, β) = lnN∏

n=1

p(tn|W, β). (2.3)

Once we have specified the prior distribution p(x) and the functional formof the mapping y(x;W), we can in principle determine W andβ by maximiz-

Page 4: GTM: The Generative Topographic Mapping

218 Christopher M. Bishop, Markus Svensen, and Christopher K. I. Williams

Figure 2: In order to formulate a latent variable model similar in spirit to theSOM, we consider a prior distribution p(x) consisting of a superposition of deltafunctions, located at the nodes of a regular grid in latent space. Each node xi ismapped to a corresponding point y(xi;W) in data space and forms the centerof a corresponding gaussian distribution.

ing L(W, β). However, the integral over x in equation 2.2 will, in general,be analytically intractable. If we choose y(x;W) to be a linear function of W,and we choose p(x) to be gaussian, then the integral becomes a convolutionof two gaussians, which is itself a gaussian. For a noise distribution p(t|x)that is gaussian with a diagonal covariance matrix, we obtain the standardfactor analysis model. In the case of the radially symmetric gaussian givenby equation 2.1, the model is closely related to principal component anal-ysis since the maximum likelihood solution for W has columns given bythe scaled principal eigenvectors (Tipping & Bishop, 1997). Here we wishto extend this formalism to nonlinear functions y(x;W), and in particularto develop a model similar in spirit to the SOM algorithm. We thereforeconsider a specific form for p(x) given by a sum of delta functions centeredon the nodes of a regular grid in latent space,

p(x) = 1K

K∑i=1

δ(x− xi), (2.4)

in which case the integral in equation 2.2 can again be performed analyti-cally. Each point xi is then mapped to a corresponding point y(xi;W) in dataspace, which forms the center of a gaussian density function, as illustratedin Figure 2. From equations 2.2 and 2.4, we see that the distribution function

Page 5: GTM: The Generative Topographic Mapping

Generative Topographic Mapping 219

in data space then takes the form

p(t|W, β) = 1K

K∑i=1

p(t|xi,W, β), (2.5)

and the log likelihood function becomes

L(W, β) =N∑

n=1

ln

{1K

K∑i=1

p(tn|xi,W, β)

}. (2.6)

For the particular noise model p(t|x,W, β) given by equation 2.1, the dis-tribution p(t|W, β) corresponds to a constrained gaussian mixture model(Hinton, Williams, & Revow, 1992) since the centers of the gaussians, givenby y(xi;W), cannot move independently but are related through the func-tion y(x;W). Note that, provided the mapping function y(x;W) is smoothand continuous, the projected points y(xi;W) will necessarily have a topo-graphic ordering in the sense that any two points xA and xB that are closein latent space will map to points y(xA;W) and y(xB;W), which are close indata space.

2.1 The EM Algorithm. If we now choose a particular parameterizedform for y(x;W), which is a differentiable function of W (for example, afeedforward network with sigmoidal hidden units), then we can use stan-dard techniques for nonlinear optimization, such as conjugate gradients orquasi-Newton methods, to find a weight matrix W∗, and an inverse varianceβ∗, which maximize L(W, β).

However, our model consists of a mixture distribution which suggeststhat we might seek an EM algorithm (Dempster, Laird, & Rubin, 1977;Bishop, 1995). By making a suitable choice of model y(x;W) we will seethat the M-step corresponds to the solution of a set of linear equations.In particular we shall choose y(x;W) to be given by a generalized linearregression model of the form

y(x;W) =Wφ(x), (2.7)

where the elements of φ(x) consist of M fixed basis functions φj(x), and Wis a D ×M matrix. Generalized linear regression models possess the sameuniversal approximation capabilities as multilayer adaptive networks, pro-vided the basis functions φj(x) are chosen appropriately. The usual limita-tion of such models, however, is that the number of basis functions musttypically grow exponentially with the dimensionality L of the input space(Bishop, 1995). In out context, this is not a significant problem since the di-mensionality is governed by the number of latent variable variables, whichwill typically be small. In fact, for data visualization applications, we gen-erally use L = 2.

Page 6: GTM: The Generative Topographic Mapping

220 Christopher M. Bishop, Markus Svensen, and Christopher K. I. Williams

The maximization of L(W, β) given by equation 2.6 can be regarded as amissing-data problem in which the identity i of the component that gener-ated each data point tn is unknown. We can formulate the EM algorithm forthis model as follows. First, suppose that, at some point in the algorithm,the current weight matrix is given by Wold, and the current inverse noisevariance is given by βold. In the E-step we use Wold and βold to evaluate theposterior probabilities, or responsibilities, of each gaussian component i forevery data point tn using Bayes’ theorem in the form

Rin(Wold, βold) = p(xi|tn,Wold, βold) (2.8)

= p(tn|xi,Wold, βold)∑Ki′=1 p(tn|xi′ ,Wold, βold)

. (2.9)

We now consider the expectation of the complete-data log likelihood in theform

〈Lcomp(W, β)〉 =N∑

n=1

K∑i=1

Rin(Wold, βold) ln{p(tn|xi,W, β)

}. (2.10)

Maximizing equation 2.10 with respect to W and using equations 2.1 and2.7, we obtain

N∑n=1

K∑i=1

Rin(Wold, βold){Wnewφ(xi)− tn

}φT(xi) = 0. (2.11)

This can conveniently be written in matrix notation in the form

ΦTGoldΦWTnew = ΦTRoldT, (2.12)

where Φ is a K ×M matrix with elements 8ij = φj(xi), T is a N ×D matrixwith elements tnk, R is a K ×N matrix with elements Rin, and G is a K × Kdiagonal matrix with elements

Gii =N∑

n=1

Rin(W, β). (2.13)

We can now solve equation 2.12 for Wnew using standard matrix techniques,based on singular value decomposition to allow for possible ill conditioning.Note that the matrix Φ is constant throughout the algorithm and so needsonly to be evaluated once at the start.

Similarly, maximizing equation 2.10 with respect to β, we obtain thefollowing reestimation formula:

1βnew

= 1ND

N∑n=1

K∑i=1

Rin(Wold, βold)∥∥Wnewφ(xi)− tn

∥∥2. (2.14)

Page 7: GTM: The Generative Topographic Mapping

Generative Topographic Mapping 221

The EM algorithm alternates between the E-step, corresponding to the eval-uation of the posterior probabilities in equation 2.9, and the M-step, givenby the solution of equations 2.12 and 2.14. Jensen’s inequality can be used toshow that at each iteration of the algorithm, the objective function will in-crease unless it is already at a (local) maximum, as discussed, for example, inBishop (1995). Typically the EM algorithm gives satisfactory convergenceafter a few tens of cycles, particularly since we are primarily interestedin convergence of the distribution, and this is often achieved much morerapidly than convergence of the parameters themselves.

If desired, a regularization term can be added to the objective function tocontrol the mapping y(x;W). This can be interpreted as a MAP (maximuma posteriori) estimator corresponding to a choice of prior over the weightsW. In the case of a radially symmetric gaussian prior of the form

p(W|λ) =(λ

)MD/2

exp

−λ2M∑

j=1

D∑k=1

w2jk

, (2.15)

where λ is the regularization coefficient, this leads to a modification of theM-step (equation 2.12) to give

(ΦTGoldΦ+ (λ|β)I)WTnew = ΦTRoldT, (2.16)

where I is the identity matrix.

2.2 Data Visualization. One application for GTM is in data visualiza-tion, in which Bayes’ theorem is used to invert the transformation from latentspace to data space. For the particular choice of prior distribution given byequation 2.4, the posterior distribution is again a sum of delta functionscentered at the lattice points, with coefficients given by the responsibilitiesRin. These coefficients can be used to provide a visualization of the posteriorresponsibility map for individual data points in the two-dimensional latentspace. If it is desired to visualize a set of data points, then a complete pos-terior distribution for each data point may provide too much information,and it is often convenient to summarize the posterior by its mean, given foreach data point tn by

〈x|tn,W∗, β∗〉 =∫

p(x|tn,W∗, β∗)x dx (2.17)

=K∑

i=1

Rinxi. (2.18)

Keep in mind, however, that the posterior distribution can be multimodal,in which case the posterior mean can give a very misleading summary of

Page 8: GTM: The Generative Topographic Mapping

222 Christopher M. Bishop, Markus Svensen, and Christopher K. I. Williams

the true distribution. An alternative approach is therefore to evaluate themode of the distribution, given by

imax = arg max{i}

Rin. (2.19)

In practice, it is often convenient to plot both the mean and the mode for eachdata point, because significant differences between them can be indicativeof a multimodal distribution.

2.3 Choice of Model Parameters. The problem of density estimationfrom a finite data set is fundamentally ill posed, since there exist infinitelymany distributions that could have given rise to the observed data. An algo-rithm for density modeling therefore requires some form of “prior knowl-edge” in addition to the data set. The assumption that the distribution canbe described in terms of a reduced number of latent variables is itself part ofthis prior. In the GTM algorithm, the prior distribution over mapping func-tions y(x;W) is governed by the prior over weights W, given, for example,by equation 2.15, as well as by the basis functions. We typically choose thebasis functions φj(x) to be radially symmetric gaussians whose centers aredistributed on a uniform grid in x-space, with a common width parameterσ , whose value, along with the number and spacing of the basis functions,determines the smoothness of the manifold. Examples of surfaces generatedby sampling the prior are shown in Figure 3.

In addition to the basis functions φi(x), it is also necessary to select the la-tent space sample points {xi}. Note that if there are too few sample points inrelation to the number of basis functions, then the gaussian mixture centersin data space become relatively independent, and the desired smoothnessproperties can be lost. Having a large number of sample points, however,causes no difficulty beyond increased computational cost. In particular,there is no overfitting if the number of sample points is increased sincethe number of degrees of freedom in the model is controlled by the map-ping function y(x;W). One way to view the role of the latent space samples{xi} is as a Monte Carlo approximation to the integral over x in equation 2.2(MacKay, 1995; Bishop, Svensen, & Williams, 1996). The choice of the num-ber K and location of the sample points xi in latent space is not critical, andwe typically choose gaussian basis functions and set K so that, in the case ofa two-dimensional latent space, O(100) sample points lie within 2σ of thecenter of each basis function.

Note that we have considered the basis function parameters (widths andlocations) to be fixed, with a gaussian prior on the weight matrix W. In prin-ciple, priors over the basis function parameters could also be introduced,and these could again be treated by maximum a posteriori (MAP) estimationor by Bayesian integration.

We initialize the parameters W so that the GTM model initially approxi-mates principal component analysis (PCA). To do this, we first evaluate the

Page 9: GTM: The Generative Topographic Mapping

Generative Topographic Mapping 223

Figure 3: Examples of manifolds generated by sampling from the prior distri-bution over W given by equation 2.15, showing the effect of the choice of basisfunctions on the smoothness of the manifold. Here the basis functions are gaus-sian with width σ = 4s in the left-hand plot (where s is the spacing of the basisfunction centers) and σ = 2s in the right-hand plot. Different values of λ simplyaffect the linear scaling of the embedded manifold.

data covariance matrix and obtain the first and second principal eigenvec-tors, and then we determine W by minimizing the error function,

E = 12

∑i

∥∥Wφ(xi)−Uxi∥∥ , (2.20)

where the columns of U are given by the eigenvectors. This represents thesum-of-squares error between the projections of the latent points into dataspace by the GTM model and the corresponding projections obtained fromPCA. The value of β−1 is initialized to be the larger of either the L + 1eigenvalue from PCA (representing the variance of the data away from thePCA plane) or the square of half of the grid spacing of the PCA-projectedlatent points in data space.

Finally, we note that in a numerical implementation, care must be takenover the evaluation of the responsibilities since this involves computing theexponentials of the distances between the projected latent points and thedata points, which may span a significant range of values.

2.4 Summary of the GTM Algorithm. Although the foregoing discus-sion has been somewhat detailed, the underlying GTM algorithm itself isstraightforward and is summarized here for convenience.

GTM consists of a constrained mixture of gaussians in which the modelparameters are determined by maximum likelihood using the EM algo-rithm. It is defined by specifying a set of points {xi} in latent space, together

Page 10: GTM: The Generative Topographic Mapping

224 Christopher M. Bishop, Markus Svensen, and Christopher K. I. Williams

Figure 4: Results from a toy problem involving data (©) generated from a one-dimensional curve embedded in two dimensions, together with the projectedlatent points (+) and their gaussian noise distributions (filled circles). The initialconfiguration, determined by principal component analysis, is shown on the left,and the converged configuration, obtained after 15 iterations of EM, is shownon the right.

with a set of basis functions {φj(x)}. The adaptive parameters W and β de-fine a constrained mixture of gaussians with centers Wφ(xi) and a commoncovariance matrix given by β−1I. After initializing W and β, training in-volves alternating between the E-step in which the posterior probabilitiesare evaluated using equation 2.9, and the M-step in which W and β arereestimated using equations 2.12 and 2.14, respectively. Evaluation of thelog likelihood using equation 2.6 at the end of each cycle can be used tomonitor convergence.

3 Experimental Results

We now present results from the application of this algorithm first to a toyproblem involving data in two dimensions and then to a more realistic prob-lem involving 12-dimensional data arising from diagnostic measurementsof oil flows along multiphase pipelines. In both examples, we choose thebasis functions φj(x) to be radially symmetric gaussians whose centers aredistributed on a uniform grid in x-space, with a common width parameterchosen equal to twice the separation of neighboring basis function centers.Results from a toy problem for the case of a two-dimensional data spaceand a one-dimensional latent space are shown in Figure 4.

Page 11: GTM: The Generative Topographic Mapping

Generative Topographic Mapping 225

Figure 5: The left plot shows the posterior-mean projection of the oil flow datain the latent space of the GTM model; the plot on the right shows the same dataset visualized using principal component analysis. In both plots, crosses, circles,and plus signs represent stratified, annular, and homogeneous multiphase con-figurations, respectively. Note how the nonlinearity of GTM gives an improvedseparation of the clusters.

3.1 Oil Flow Data. Our second example arises from the problem of de-termining the fraction of oil in a multiphase pipeline carrying a mixture ofoil, water, and gas (Bishop & James, 1993). Each data point consists of 12measurements taken from dual-energy gamma densitometers measuringthe attenuation of gamma beams passing through the pipe. Syntheticallygenerated data are used that model accurately the attenuation processes inthe pipe, as well as the presence of noise (arising from photon statistics). Thethree phases in the pipe (oil, water, and gas) can belong to one of three dif-ferent geometrical configurations, corresponding to laminar, homogeneous,and annular flows, and the data set consists of 1000 points drawn with equalprobability from the three configurations. We take the latent variable spaceto be two-dimensional, since our goal is data visualization.

Figure 5 shows the oil data visualized in the latent variable space inwhich, for each data point, we have plotted the posterior mean vector. Eachpoint has then been labeled according to its multiphase configuration. Forcomparison, Figure 5 also shows the corresponding results obtained usingPCA.

Page 12: GTM: The Generative Topographic Mapping

226 Christopher M. Bishop, Markus Svensen, and Christopher K. I. Williams

4 Relation to the Self-Organizing Map

Since one motivation for GTM is to provide a principled alternative to theSOM, it is useful to consider the precise relationship between GTM andSOM. Focusing on the batch versions of both algorithms helps to make therelationship particularly clear.

The batch version of the SOM algorithm (Kohonen, 1995) can be de-scribed as follows. A set of K reference vectors zi is defined in the dataspace, in which each vector is associated with a node on a regular latticein a (typically) two-dimensional feature map (analogous to the latent spaceof GTM). The algorithm begins by initializing the reference vectors (for ex-ample, by setting them to random values, setting them equal to a randomsubset of the data points, or using PCA). Each cycle of the algorithm thenproceeds as follows. For every data vector tn, the corresponding “winningnode” j(n) is identified, corresponding to the reference vector zj having thesmallest Euclidean distance ‖zj − tn‖2 to tn. The reference vectors are thenupdated by setting them equal to weighted averages of the data points givenby

zi =∑

n hij(n)tn∑n hij(n)

, (4.1)

in which hij is a neighborhood function associated with the ith node. This isgenerally chosen to be a unimodal function of the feature map coordinatescentered on the winning node, for example, a gaussian. The steps of iden-tifying the winning nodes and updating the reference vectors are repeatediteratively. A key ingredient in the algorithm is that the width of the neigh-borhood function hij starts with a relatively large value and is graduallyreduced after each iteration.

4.1 Kernel versus Linear Regression. As pointed out by Mulier andCherkassky (1995), the value of the neighborhood function hij(n) dependsonly on the identity of the winning node j and not on the value of thecorresponding data vector tn. We can therefore perform partial sums overthe groups Gj of data vectors assigned to each node j, and hence rewriteequation 4.1 in the form

zi =∑

j

Kijmj, (4.2)

in which mj is the mean of the vectors in group Gj and is given by

mj = 1Nj

∑n∈Gj

tn, (4.3)

Page 13: GTM: The Generative Topographic Mapping

Generative Topographic Mapping 227

where Nj is the number of data vectors in group Gj. The result (equation 4.2)is analogous to the Nadaraya-Watson kernel regression formula (Nadaraya,1964; Watson, 1964) with the kernel functions given by

Kij =hijNj∑j′ hij′Nj′

. (4.4)

Thus the batch SOM algorithm replaces the reference vectors at each cyclewith a convex combination of the node means mj, with coefficients deter-mined by the neighborhood function. Note that the kernel coefficients satisfy∑

j Kij = 1 for every i.In the GTM algorithm, the centers y(xi;W) of the gaussian components

can be regarded as analogous to the reference vectors zi of the SOM. Wecan evaluate y(xi;W) by solving the M-step equation (2.12) to find W andthen using y(xi;W) =Wφ(xi). If we define the weighted means of the datavectors by

µi =∑

n Rintn∑n Rin

, (4.5)

then we obtain

y(xi;W) =∑

j

Fijµj, (4.6)

where we have introduced the effective kernel Fij given by

Fij = φT(xi)(ΦTGΦ

)−1φ(xj)Gjj. (4.7)

Note that the effective kernel satisfies∑

j Fij = 1. To see this, we first useequation 4.7 to show that

∑j Fijφl(xj) = φl(xi). Then if one of the basis

functions l corresponds to a bias, so that φl(x) = const., the result follows.The solution for y(xi;W)given by equations 4.6 and 4.7 can be interpreted

as a weighted least-squares regression (Mardia, Kent, & Bibby, 1979) inwhich the target vectors are theµi, and the weighting coefficients are givenby Gjj.

Figure 6 shows an example of the effective kernel for GTM correspondingto the oil flow problem discussed in section 3.

From equations 4.2 and 4.6 we see that both GTM and SOM can beregarded as forms of kernel smoothers. However, there are two key dif-ferences. The first is that in SOM, the vectors that are smoothed, definedby equation 4.3, correspond to hard assignments of data points to nodes,whereas the corresponding vectors in GTM, given by equation 4.5, involvesoft assignments, weighted by the posterior probabilities. This is analogous

Page 14: GTM: The Generative Topographic Mapping

228 Christopher M. Bishop, Markus Svensen, and Christopher K. I. Williams

Figure 6: Example of the effective kernel Fij plotted as a function of the nodej for a given node i, for the oil flow data set after three iterations of EM. Thiskernel function is analogous to the (normalized) neighborhood function in theSOM algorithm.

to the distinction between K-means clustering (hard assignments) and fit-ting a standard gaussian mixture model using EM (soft assignments).

The second key difference is that the kernel function in SOM is madeto shrink during the course of the algorithm in an arbitrary, handcraftedmanner. In GTM, the posterior probability distribution in latent space, fora given data point, forms a localized bubble and the radius of this bubbleshrinks automatically during training, as shown in Figure 7. This responsi-bility bubble governs the extent to which individual data points contributetoward the vectorsµi in equation 4.5 and hence toward the updating of thegaussian centers y(xi;W) via equation 4.6.

4.2 Comparison of GTM with SOM. The most significant differencebetween the GTM and SOM algorithms is that GTM defines an explicitprobability density given by the mixture distribution in equation 2.5. Asa consequence there is a well-defined objective function given by the loglikelihood (see equation 2.6), and convergence to a (local) maximum of theobjective function is guaranteed by the use of the EM algorithm (Dempsteret al., 1977). This also provides a direct means to compare different choicesof model parameters and even to compare a GTM solution with anotherdensity model by evaluating the likelihood of a test set under the generativedistributions of the respective models. For the SOM algorithm, however,there is no probability density and no well-defined objective function that isbeing minimized by the training process. Indeed it has been proved (Erwin,

Page 15: GTM: The Generative Topographic Mapping

Generative Topographic Mapping 229

Figure 7: Examples of the posterior probabilities (responsibilities) Rin of the la-tent space points at an early stage (left), intermediate stage (center), and latestage (right) during the convergence of the GTM algorithm. These have beenevaluated for a single data point from the training set in the oil flow problemdiscussed in section 3 and are plotted using a nonlinear scaling of the formp(x|tn)

0.1 to highlight the variation over the latent space. Notice how the respon-sibility bubble, which governs the updating of the weight matrix, and hence theupdating of the data-space vectors y(xi;W), shrinks automatically during thelearning process.

Obermayer, & Schulten 1992) that such an objective function cannot existfor the SOM.

A further limitation of the SOM, highlighted in Kohonen (1995, p. 234),is that the conditions under which so-called self-organization of the SOMoccurs have not been quantified, and so in practice it is necessary to confirmempirically that the trained model does indeed have the desired spatial or-dering. In contrast, the neighborhood-preserving nature of the GTM map-ping is an automatic consequence of the choice of a continuous functiony(x;W).

Similarly, the smoothness properties of the SOM are determined indi-rectly by the choice of neighborhood function and by the way in which itis changed during the course of the algorithm and is therefore difficult tocontrol. Thus, prior knowledge about the form of the map cannot easily bespecified. The prior distribution for GTM, however, can be controlled di-rectly, and properties such as smoothness are governed explicitly by basisfunction parameters, as illustrated in Figure 3.

Finally, we consider the relative computational costs of the GTM andSOM algorithms. For problems involving data in high-dimensional spaces,the dominant computational cost of GTM arises from the evaluation of theEuclidean distances from every data point to every gaussian center y(xi;W).Since exactly the same calculations must be done for SOM (involving thedistances of data points from the reference vectors µi), we expect one itera-

Page 16: GTM: The Generative Topographic Mapping

230 Christopher M. Bishop, Markus Svensen, and Christopher K. I. Williams

tion of either algorithm to take approximately the same time. An empiricalcomparison of the computational cost of GTM and SOM was obtained byrunning each algorithm on the oil flow data until convergence (definedas no discernible change in the appearance of the visualization map). TheGTM algorithm took 1058 sec (40 iterations), while the batch SOM took 1011sec (25 iterations) using a gaussian neighborhood function. With a simpletop-hat neighborhood function, in which each reference vector is updatedat each iteration using only data points associated with nearby referencevectors, the CPU time for the SOM algorithm is reduced to 305 sec (25 it-erations). One potential advantage of GTM in practical applications arisesfrom a reduction in the number of experimental training runs needed sinceboth convergence and topographic ordering are guaranteed.

5 Relation to Other Algorithms

Several algorithms in the published literature have close links with GTM.Here we review briefly the most significant of these.

The elastic net algorithm of Durbin and Willshaw (1987) can be viewedas a gaussian mixture density model, fitted by penalized maximum likeli-hood. The penalty term encourages the centers of gaussians correspondingto neighboring points along the (typically one-dimensional) chain to beclose in data space. It differs from GTM in that it does not define a continu-ous data space manifold. Also, the training algorithm generally involves ahandcrafted annealing of the weight penalty coefficient.

There are also similarities between GTM and principal curves and prin-cipal surfaces (Hastie & Stuetzle, 1989; LeBlanc & Tibshirani 1994), whichagain involve a two-stage algorithm consisting of projection followed bysmoothing, although these are not generative models. It is interesting tonote that Hastie and Stuetzle (1989) propose reducing the spatial widthof the smoothing function during learning, in a manner analogous to theshrinking of the neighborhood function in the SOM. A modified form ofthe principal curves algorithm (Tibshirani, 1992) introduces a generativedistribution based on a mixture of gaussians, with a well-defined likeli-hood function, and is trained by the EM algorithm. However, the numberof gaussian components is equal to the number of data points, and smooth-ing is imposed by penalizing the likelihood function with the addition of aderivative-based regularization term.

The technique of parameterized self-organizing maps (PSOMs) involvesfirst fitting a standard SOM model to a data set and then finding a manifoldin data space that interpolates the reference vectors (Ritter, 1993). Althoughthis defines a continuous manifold, the interpolating surface does not formpart of the training algorithm, and the basic problems in using SOM, dis-cussed in section 4.2, remain.

The SOM has also been used for vector quantization. In this context ithas been shown how a reformulation of the vector quantization problem

Page 17: GTM: The Generative Topographic Mapping

Generative Topographic Mapping 231

(Luttrell, 1990; Buhmann & Kuhnel 1993; Luttrell, 1994; Luttrell, 1995) canavoid many of the problems with the SOM procedure discussed earlier.

Finally, the density network model of MacKay (1995) involves transform-ing a simple distribution in latent space to a complex distribution in dataspace by propagation through a nonlinear network. A discrete distributionin latent space is again used, which is interpreted as an approximate MonteCarlo integration over the latent variables needed to define the data spacedistribution. GTM can be seen as a particular instance of this frameworkin which the sampling of latent space is regular rather than stochastic, aspecific form of nonlinearity is used, and the model parameters are adaptedusing EM.

6 Discussion

In this article, we have introduced a form of nonlinear latent variable modelthat can be trained efficiently using the EM algorithm. Viewed as a topo-graphic mapping algorithm, it has the key property that it defines a proba-bility density model.

As an example of the significance of having a probability density, considerthe important practical problem of dealing with missing values in the dataset (in which some components of the data vectors tn are unobserved). Ifthe missing values are missing at random (Little & Rubin, 1987) then thelikelihood function is obtained by integrating out the unobserved values.For the GTM model, the integrations can be performed analytically, leadingto a simple modification of the EM algorithm.

A further consequence of having a probabilistic approach is that it isstraightforward to consider a mixture of GTM models. In this case, theoverall density can be written as

p(t) =∑

rP(r)p(t|r), (6.1)

where p(t|r) represents the rth model, with its own set of independent pa-rameters, and P(r) are mixing coefficients satisfying 0 ≤ P(r) ≤ 1 and∑

r P(r) = 1. Again, it is straightforward to extend the EM algorithm tomaximize the corresponding likelihood function.

The GTM algorithm can be extended in other ways, for instance, by allow-ing independent mixing coefficients πi (prior probabilities) for each of thegaussian components, which again can be estimated by a straightforwardextension of the EM algorithm. Instead of being independent parameters,the πi can be determined as smooth functions of the latent variables usinga normalized exponential applied to a generalized linear regression model,although in this case the M-step of the EM algorithm would involve nonlin-ear optimization. Similarly, the inverse noise variance β can be generalizedto a function of x. An important property of GTM is the existence of a smooth

Page 18: GTM: The Generative Topographic Mapping

232 Christopher M. Bishop, Markus Svensen, and Christopher K. I. Williams

manifold in data space, which allows the local magnification factor betweenlatent and data space to be evaluated as a function of the latent space coor-dinates using the techniques of differential geometry (Bishop, Svensen, &Williams, in press). Finally, since there is a well-defined likelihood function,it is straightforward in principle to introduce priors over the model param-eters (as discussed in section 2.1) and to use Bayesian techniques in placeof maximum likelihood.

Throughout this article, we have focused on the batch version of theGTM algorithm in which all of the training data are used together to updatethe model parameters. In some applications, it will be more convenient toconsider sequential adaptation in which data points are presented one ata time. Since we are minimizing a differentiable cost function, given byequation 2.6, a sequential algorithm can be obtained by appealing to theRobbins-Monro procedure (Robbins & Monro, 1951; Bishop, 1995) to find azero of the objective function gradient. Alternatively, a sequential form ofthe EM algorithm can be used (Titterington, Smith, & Makov, 1985).

A Web site for GTM is provided at: http://www.ncrg.aston.ac.uk/GTM/, which includes postscript files of relevant papers, a software imple-mentation in Matlab (a C implementation is under development), and ex-ample data sets used in the development of the GTM algorithm.

Acknowledgments

This work was supported by EPSRC grant GR/K51808: Neural Networksfor Visualization of High-Dimensional Data. We thank Geoffrey Hinton,Iain Strachan, and Michael Tipping for useful discussions. Markus Svensenthanks the staff of the SANS group in Stockholm for their hospitality duringpart of this project.

References

Bishop, C. M. (1995). Neural networks for pattern recognition. New York: OxfordUniversity Press.

Bishop, C. M., & James, G. D. (1993). Analysis of multiphase flows using dual-energy gamma densitometry and neural networks. Nuclear Instruments andMethods in Physics Research, A327, 580–593.

Bishop, C. M., Svensen, M., & Williams, C. K. I. (1996). A fast EM algorithm forlatent variable density models. In D. S. Touretzky, M. C. Mozer, & M. E. Has-selmo (Eds.), Advances in neural information processing systems, 8 (pp. 465–471).Cambridge, MA: MIT Press.

Bishop, C. M., Svensen, M., & Williams, C. K. I. (in press). Magnification factorsfor the GTM algorithm. In Proceedings of the Fifth IEE International Conferenceon Artificial Neural Networks. Cambridge, U.K., IEE, (pp. 64–69).

Buhmann, J., & Kuhnel, K. (1993). Vector quantization with complexity costs.IEEE Transactions on Information Theory, 39(4), 1133–1145.

Page 19: GTM: The Generative Topographic Mapping

Generative Topographic Mapping 233

Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood fromincomplete data via the EM algorithm. Journal of the Royal Statistical Society,B 39(1), 1–38.

Durbin, R., & Willshaw, D. (1987). An analogue approach to the travelling sales-man problem. Nature, 326, 689–691.

Erwin, E., Obermayer, K., & Schulten, K. (1992). Self-organizing maps: Ordering,convergence properties and energy functions. Biological Cybernetics, 67, 47–55.

Hastie, T., & Stuetzle, W. (1989). Principal curves. Journal of the American StatisticalAssociation, 84(406), 502–516.

Hinton, G. E., Williams, C. K. I., & Revow, M. D. (1992). Adaptive elastic mod-els for hand-printed character recognition. In J. E. Moody, S. J. Hanson, &R. P. Lippmann (Eds.), Advances in neural information processing systems, 4(pp. 512–519). San Mateo, CA: Morgan Kauffman.

Kohonen, T. (1982). Self-organized formation of topologically correct featuremaps. Biological Cybernetics, 43, 59–69.

Kohonen, T. (1995). Self-organizing maps. Berlin: Springer-Verlag.LeBlanc, M., & Tibshirani, R. (1994). Adaptive principal surfaces. Journal of the

American Statistical Association, 89(425), 53–64.Little, R. J. A., & Rubin, D. B. (1987). Statistical analysis with missing data. New

York: John Wiley.Luttrell, S. P. (1990). Derivation of a class of training algorithms. IEEE Transactions

on Neural Networks, 1(2), 229–232.Luttrell, S. P. (1994). A Bayesian analysis of self-organizing maps. Neural Com-

putation, 6(5), 767–794.Luttrell, S. P. (1995). Using self-organizing maps to classify radar range profiles.

Proc. 5th IEE Conf. on Artificial Neural Networks, 335–340.MacKay, D. J. C. (1995). Bayesian neural networks and density networks. Nuclear

Instruments and Methods in Physics Research, A 354(1), 73–80.Mardia, K., Kent, J., & Bibby, M. (1979). Multivariate analysis. New York: Aca-

demic Press.Mulier, F., & Cherkassky, V. (1995). Self-organization as an iterative kernel

smoothing process. Neural Computation, 7(6), 1165–1177.Nadaraya, E. A. (1964). On estimating regression. Theory of Probability and Its

Applications, 9(1), 141–142.Ritter, H. (1993). Parameterized self-organizing maps. In Proceedings ICANN’93

International Conference on Artificial Neural Networks, Amsterdam (pp. 568–575).Berlin: Springer-Verlag.

Robbins, H., & Monro, S. (1951). A stochastic approximation method. Annals ofMathematical Statistics, 22, 400–407.

Tibshirani, R. (1992). Principal curves revisited. Statistics and Computing, 2, 183–190.

Tipping, M. E., & Bishop, C. M. (1997). Probabilistic Principal Component Anal-ysis. Tech. Rep. NCRG/97/010, Neural Computing Research Group, Dept. ofComputer Science & Applied Mathematics, Aston Univ., Birmingham B4 7ET,U.K.

Titterington, D. M., Smith, A. F. M., & Makov, U. E. (1985). Statistical analysis offinite mixture distributions. New York: Wiley.

Page 20: GTM: The Generative Topographic Mapping

234 Christopher M. Bishop, Markus Svensen, and Christopher K. I. Williams

Watson, G. S. (1964). Smooth regression analysis. Sankhya: The Indian Journal ofStatistics, series A, 26, 359–372.

Received April 17, 1996; accepted April 3, 1997.

Page 21: GTM: The Generative Topographic Mapping

This article has been cited by:

1. Ezequiel López-Rubio, Rafael Luque-BaenaOnline Learning by Stochastic Approximation for Background Modeling 8-1-8-23.[CrossRef]

2. Hiromasa Kaneko, Kimito Funatsu. 2014. Model for predicting transmembrane pressure jump for various membrane bioreactors.Desalination and Water Treatment 1-11. [CrossRef]

3. Hiromasa Kaneko, Kimito Funatsu. 2014. Analysis of a transmembrane pressure (TMP) jump prediction model for preventingTMP jumps. Desalination and Water Treatment 1-6. [CrossRef]

4. Andrej Gisbrecht, Alexander Schulz, Barbara Hammer. 2014. Parametric nonlinear dimensionality reduction using kernel t-SNE.Neurocomputing . [CrossRef]

5. Antonio Gracia, Santiago González, Victor Robles, Ernestina Menasalvas. 2014. A methodology to compare DimensionalityReduction algorithms in terms of loss of quality. Information Sciences 270, 1-27. [CrossRef]

6. T. Vatanen, M. Osmala, T. Raiko, K. Lagus, M. Sysi-Aho, M. Orešič, T. Honkela, H. Lähdesmäki. 2014. Self-organization andmissing values in SOM and GTM. Neurocomputing . [CrossRef]

7. Kiyoshi Hasegawa, Kimito Funatsu. 2014. Generative topographic mapping of binding pocket of β 2 receptor and three-way partialleast squares modeling of inhibitory activities. Journal of Chemometrics n/a-n/a. [CrossRef]

8. César A. Astudillo, B. John Oommen. 2014. Topology-oriented self-organizing maps: a survey. Pattern Analysis and Applications17, 223-248. [CrossRef]

9. M. Griebel, A. Hullmann. 2014. Dimensionality Reduction of High-Dimensional Data with a NonLinear Principal ComponentAligned Generative Topographic Mapping. SIAM Journal on Scientific Computing 36, A1027-A1047. [CrossRef]

10. Asha Viswanath, A. I. J. Forrester, A. J. Keane. 2014. Constrained Design Optimization Using Generative Topographic Mapping.AIAA Journal 52, 1010-1023. [CrossRef]

11. Natalia V. Kireeva, Svetlana I. Ovchinnikova, Igor V. Tetko, Abdullah M. Asiri, Konstantin V. Balakin , Aslan Yu. Tsivadze.2014. Nonlinear Dimensionality Reduction for Visualizing Toxicity Data: Distance-Based Versus Topology-Based Approaches.ChemMedChem n/a-n/a. [CrossRef]

12. Trevor Richardson, Brett Nekolny, Joseph Holub, Eliot H. Winer. 2014. Visualizing Design Spaces Using Two-DimensionalContextual Self-Organizing Maps. AIAA Journal 52, 725-738. [CrossRef]

13. Antoine Deleforge, Florence Forbes, Radu Horaud. 2014. High-dimensional regression with gaussian mixtures and partially-latent response variables. Statistics and Computing . [CrossRef]

14. Wenyue Feng, Zhouwang Yang, Jiansong Deng. 2014. Moving Multiple Curves/Surfaces Approximation of Mixed Point Clouds.Communications in Mathematics and Statistics 2, 107-124. [CrossRef]

15. Bibliography 177-197. [CrossRef]16. Natalia V. Kireeva, Svetlana I. Ovchinnikova, Sergey L. Kuznetsov, Andrey M. Kazennov, Aslan Yu. Tsivadze. 2014. Impact of

distance-based metric learning on classification and visualization model performance and structure–activity landscapes. Journal ofComputer-Aided Molecular Design . [CrossRef]

17. Atish Roy, Araceli S. Romero-Peláez, Tim J. Kwiatkowski, Kurt J. Marfurt. 2014. Generative topographic mapping for seismicfacies estimation of a carbonate wash, Veracruz Basin, southern Mexico. Interpretation 2, SA31-SA47. [CrossRef]

18. Barbara HammerNeural Networks 817-855. [CrossRef]19. Kiyoshi Hasegawa, Kimito Funatsu. 2014. Prediction of ProteinProtein Interaction Pocket Using L-Shaped PLS Approach and

Its Visualizations by Generative Topographic Mapping. Molecular Informatics 33:10.1002/minf.v33.1, 65-72. [CrossRef]20. Seppo Pulkkinen, Marko M. Mäkelä, Napsu Karmitsa. 2014. A generative model and a generalized trust region Newton method

for noise reduction. Computational Optimization and Applications 57, 129-165. [CrossRef]21. Héléna A. Gaspar, Gilles Marcou, Dragos Horvath, Alban Arault, Sylvain Lozano, Philippe Vayer, Alexandre Varnek.

2013. Generative Topographic Mapping-Based Classification Models and Their Applicability Domain: Application to theBiopharmaceutics Drug Disposition Classification System (BDDCS). Journal of Chemical Information and Modeling 53, 3318-3325.[CrossRef]

22. B. Cannas, A. Fanni, A. Murari, A. Pau, G. Sias. 2013. Automatic disruption classification based on manifold learning for real-time applications on JET. Nuclear Fusion 53, 093023. [CrossRef]

23. Alfredo Vellido, David L. García, Àngela Nebot. 2013. Cartogram visualization for nonlinear manifold learning models. DataMining and Knowledge Discovery 27, 22-54. [CrossRef]

Page 22: GTM: The Generative Topographic Mapping

24. Signature Generation Algorithms for Polymorphic Worms 169-260. [CrossRef]25. Sho Yakushiji, Tetsuo Furukawa. 2013. Shape space estimation by higher-rank of SOM. Neural Computing and Applications 22,

1267-1277. [CrossRef]26. Seppo Pulkkinen, Marko Mikael Mäkelä, Napsu Karmitsa. 2013. A continuation approach to mode-finding of multivariate

Gaussian mixtures and kernel density estimates. Journal of Global Optimization 56, 459-487. [CrossRef]27. Jianbo Yu. 2013. A nonlinear probabilistic method and contribution analysis for machine condition monitoring. Mechanical Systems

and Signal Processing 37, 293-314. [CrossRef]28. B Cannas, A Fanni, A Murari, A Pau, G Sias. 2013. Manifold learning to interpret JET high-dimensional operational space.

Plasma Physics and Controlled Fusion 55, 045006. [CrossRef]29. Yi Lin, Juha Hyyppä. 2013. Geometrically modeling 2D scattered points: a review of the potential for methodologically improving

mobile laser scanning in data processing. International Journal of Digital Earth 1-18. [CrossRef]30. Enrique Pelayo, David Buldain, Carlos Orrite. 2013. Magnitude Sensitive Competitive Learning. Neurocomputing . [CrossRef]31. Bassam Mokbel, Wouter Lueks, Andrej Gisbrecht, Barbara Hammer. 2013. Visualizing the quality of dimensionality reduction.

Neurocomputing . [CrossRef]32. N. Kireeva, S.L. Kuznetsov, A.A. Bykov, A. Yu. Tsivadze. 2013. Towards in silico identification of the human ether-a-go-go-

related gene channel blockers: discriminative vs. generative classification models. SAR and QSAR in Environmental Research 24,103-117. [CrossRef]

33. Teuvo Kohonen. 2013. Essentials of the self-organizing map. Neural Networks 37, 52-65. [CrossRef]34. Chao Jin, Thomas Fevens, Sudhir Mudur. 2012. Optimized keyframe extraction for 3D character animations. Computer Animation

and Virtual Worlds 23:10.1002/cav.v23.6, 559-568. [CrossRef]35. Natalia Kireeva, Sergey L. Kuznetsov, Aslan Yu. Tsivadze. 2012. Toward Navigating Chemical Space of Ionic Liquids: Prediction of

Melting Points Using Generative Topographic Maps. Industrial & Engineering Chemistry Research 121024102232003. [CrossRef]36. Valero Laparra, Sandra Jiménez, Gustavo Camps-Valls, Jesús Malo. 2012. Nonlinearities and Adaptation of Color Vision from

Sequential Principal Curves Analysis. Neural Computation 24:10, 2751-2788. [Abstract] [Full Text] [PDF] [PDF Plus]37. ANDREJ GISBRECHT, BASSAM MOKBEL, FRANK-MICHAEL SCHLEIF, XIBIN ZHU, BARBARA HAMMER. 2012.

LINEAR TIME RELATIONAL PROTOTYPE BASED LEARNING. International Journal of Neural Systems 22, 1250021.[CrossRef]

38. MOHAMAD GHASSANY, NISTOR GROZAVU, YOUNES BENNANI. 2012. COLLABORATIVE CLUSTERING USINGPROTOTYPE-BASED TECHNIQUES. International Journal of Computational Intelligence and Applications 11, 1250017.[CrossRef]

39. Gonzalo A. Ruz, Duc Truong Pham. 2012. NBSOM: The naive Bayes self-organizing map. Neural Computing and Applications21, 1319-1330. [CrossRef]

40. C. Jayne, A. Lanitis, C. Christodoulou. 2012. One-to-many neural network mapping techniques for face image synthesis. ExpertSystems with Applications 39, 9778-9787. [CrossRef]

41. Xibin Zhu, Andrej Gisbrecht, Frank-Michael Schleif, Barbara Hammer. 2012. Approximation techniques for clusteringdissimilarity data. Neurocomputing 90, 72-84. [CrossRef]

42. Hiromasa Kaneko, Kimito Funatsu. 2012. Visualization of Models Predicting Transmembrane Pressure Jump for MembraneBioreactor. Industrial & Engineering Chemistry Research 51, 9679-9686. [CrossRef]

43. Junbin Gao, Qinfeng Shi, Tibério S. Caetano. 2012. Dimensionality reduction via compressive sensing. Pattern Recognition Letters33, 1163-1170. [CrossRef]

44. Abdenour Hadid, Matti Pietikäinen. 2012. Demographic classification from face videos using manifold learning. Neurocomputing. [CrossRef]

45. N. Kireeva, I. I. Baskin, H. A. Gaspar, D. Horvath, G. Marcou, A. Varnek. 2012. Generative Topographic Mapping (GTM):Universal Tool for Data Visualization, Structure-Activity Modeling and Dataset Comparison. Molecular Informatics n/a-n/a.[CrossRef]

46. Kerstin Bunte, Michael Biehl, Barbara Hammer. 2012. A General Framework for Dimensionality-Reducing Data VisualizationMapping. Neural Computation 24:3, 771-804. [Abstract] [Full Text] [PDF] [PDF Plus]

47. Kadim Taşdemir. 2012. Vector quantization based approximate spectral clustering of large datasets. Pattern Recognition . [CrossRef]48. Seung-Hee Bae, Judy Qiu, Geoffrey Fox. 2012. Adaptive Interpolation of Multidimensional Scaling. Procedia Computer Science

9, 393-402. [CrossRef]

Page 23: GTM: The Generative Topographic Mapping

49. Barbara Hammer. 2012. Challenges in Neural Computation. KI - Künstliche Intelligenz 26:4, 333. [CrossRef]50. Antanas Verikas, Marija Bacauskiene, Adas Gelzinis, Evaldas Vaiciukynas, Virgilijus Uloza. 2011. Questionnaire- versus voice-

based screening for laryngeal disorders. Expert Systems with Applications . [CrossRef]51. References 591-636. [CrossRef]52. Enrique Romero, Tingting Mu, Paulo J.G. Lisboa. 2011. Cohort-based kernel visualisation with scatter matrices. Pattern

Recognition . [CrossRef]53. Henry Wirth, Martin von Bergen, Jayaseelan Murugaiyan, Uwe Rösler, Tomasz Stokowy, Hans Binder. 2011. MALDI-typing

of infectious algae of the genus Prototheca using SOM portraits. Journal of Microbiological Methods . [CrossRef]54. KaiJun Wang, XuanHui Yan, LiFei Chen. 2011. Geometric double-entity model for recognizing far-near relations of clusters.

Science China Information Sciences 54, 2040-2050. [CrossRef]55. Changho Son, Yongyoon Suh, Jeonghwan Jeon, Yongtae Park. 2011. Development of a GTM-based patent map for identifying

patent vacuums. Expert Systems with Applications . [CrossRef]56. Brian Moths, Abbas Ourmazd. 2011. Bayesian algorithms for recovering structure from single-particle diffraction snapshots of

unknown orientation: a comparison. Acta Crystallographica Section A Foundations of Crystallography 67, 481-486. [CrossRef]57. John R. Owen, Ian T. Nabney, Jose L. Medina-Franco, Fabian Lopez-Vallejo. 2011. Visualization of Molecular Fingerprints.

Journal of Chemical Information and Modeling 110708085249068. [CrossRef]58. EZEQUIEL LÓPEZ-RUBIO, RAFAEL MARCOS LUQUE-BAENA, ENRIQUE DOMÍNGUEZ. 2011. FOREGROUND

DETECTION IN VIDEO SEQUENCES WITH PROBABILISTIC SELF-ORGANIZING MAPS. International Journal ofNeural Systems 21, 225-246. [CrossRef]

59. Asha Viswanath, A. I. J. Forrester, A. J. Keane. 2011. Dimension Reduction for Aerodynamic Design Optimization. AIAA Journal49, 1256-1266. [CrossRef]

60. Thomas Villmann, Sven Haase. 2011. Divergence-Based Vector Quantization. Neural Computation 23:5, 1343-1392. [Abstract][Full Text] [PDF] [PDF Plus]

61. Charles Bouveyron, Camille Brunet. 2011. Simultaneous model-based clustering and visualization in the Fisher discriminativesubspace. Statistics and Computing . [CrossRef]

62. A. Verikas, A. Gelzinis, M. Hållander, M. Bacauskiene, A. Alzghoul. 2011. Screening web breaks in a pressroom by soft computing.Applied Soft Computing 11, 3114-3124. [CrossRef]

63. Housen Li, Hao Jiang, Roberto Barrio, Xiangke Liao, Lizhi Cheng, Fang Su. 2011. Incremental manifold learning by spectralembedding methods. Pattern Recognition Letters . [CrossRef]

64. Yin Hujun. 2011. Advances in adaptive nonlinear manifolds and dimensionality reduction. Frontiers of Electrical and ElectronicEngineering in China 6, 72-85. [CrossRef]

65. Stephen T. McClain, Peter Tino, Richard E. Kreeger. 2011. Ice Shape Characterization Using Self-Organizing Maps. Journalof Aircraft 48, 724-730. [CrossRef]

66. Andrej Gisbrecht, Bassam Mokbel, Barbara Hammer. 2011. Relational generative topographic mapping. Neurocomputing .[CrossRef]

67. Andrej Gisbrecht, Barbara Hammer. 2011. Relevance learning in generative topographic mapping. Neurocomputing . [CrossRef]68. Xuelei (Sherry) NiModeling and Forecasting by Manifold Learning . [CrossRef]69. Kaiichiro Ota, Takaaki Aoki, Koji Kurata, Toshio Aoyagi. 2011. Asymmetric neighborhood functions accelerate ordering process

of self-organizing maps. Physical Review E 83. . [CrossRef]70. RAÚL CRUZ-BARBOSA, ALFREDO VELLIDO. 2011. SEMI-SUPERVISED ANALYSIS OF HUMAN BRAIN

TUMOURS FROM PARTIALLY LABELED MRS INFORMATION, USING MANIFOLD LEARNING MODELS.International Journal of Neural Systems 21, 17-29. [CrossRef]

71. Xiaofeng Zhang, William K. Cheung, C.H. Li. 2011. Learning latent variable models from distributed and abstracted data.Information Sciences . [CrossRef]

72. Iván Olier, Julià Amengual, Alfredo Vellido. 2011. A variational Bayesian approach for the robust analysis of the cortical silentperiod from EMG recordings of brain stroke patients. Neurocomputing . [CrossRef]

73. M. A. Anusuya, S. K. Katti. 2011. Front end analysis of speech recognition: a review. International Journal of Speech Technology. [CrossRef]

74. Chrisina Jayne, Andreas Lanitis, Chris Christodoulou. 2010. Neural network methods for one-to-many multi-valued mappingproblems. Neural Computing and Applications . [CrossRef]

Page 24: GTM: The Generative Topographic Mapping

75. Julio J. Valdés, Alan J. Barton, Robert OrchardGenetic Programming for Exploring Medical Data Using Visual Spaces 149-172.[CrossRef]

76. Ezequiel Lopez-Rubio. 2010. Probabilistic Self-Organizing Maps for Continuous Data. IEEE Transactions on Neural Networks21, 1543-1554. [CrossRef]

77. Hujun Yin, Weilin Huang. 2010. Adaptive nonlinear manifolds and their applications to pattern recognition. Information Sciences180, 2649-2662. [CrossRef]

78. Tianyi Zhou, Dacheng Tao, Xindong Wu. 2010. Manifold elastic net: a unified framework for sparse dimension reduction. DataMining and Knowledge Discovery . [CrossRef]

79. Antanas Verikas, Zivile Kalsyte, Marija Bacauskiene, Adas Gelzinis. 2010. Hybrid and ensemble-based soft computing techniquesin bankruptcy prediction: a survey. Soft Computing 14, 995-1010. [CrossRef]

80. Elmar Diederichs, Anatoli Juditsky, Vladimir Spokoiny, Christof Schutte. 2010. Sparse Non-Gaussian Component Analysis. IEEETransactions on Information Theory 56, 3033-3047. [CrossRef]

81. Jong Youl Choi, Judy Qiu, Marlon Pierce, Geoffrey Fox. 2010. Generative topographic mapping by deterministic annealing.Procedia Computer Science 1, 47-56. [CrossRef]

82. Rui Li, Tai-Peng Tian, Stan Sclaroff, Ming-Hsuan Yang. 2010. 3D Human Motion Tracking with a Coordinated Mixtureof Factor Analyzers. International Journal of Computer Vision 87, 170-190. [CrossRef]

83. Junbin Gao, Jun Zhang, D. Tien. 2010. Relevance Units Latent Variable Model and Nonlinear Dimensionality Reduction. IEEETransactions on Neural Networks 21, 123-135. [CrossRef]

84. Paola Costantini, Marielle Linting, Giovanni C. Porzio. 2010. Mining performance data through nonlinear PCA with optimalscaling. Applied Stochastic Models in Business and Industry 26:10.1002/asmb.v26:1, 85-101. [CrossRef]

85. Abusham. 2009. Non-Linear Principal Component Embedding for Face Recognition. Journal of Applied Sciences 9, 2625-2629.[CrossRef]

86. Ezequiel López-Rubio. 2009. Multivariate Student-tt self-organizing maps. Neural Networks 22, 1432-1447. [CrossRef]87. Luis Felipe Giraldo, Fernando Lozano, Nicanor Quijano. 2009. Foraging theory for dimensionality reduction of clustered data.

Machine Learning . [CrossRef]88. M. Dumarey, A.M. van Nederkassel, I. Stanimirova, M. Daszykowski, F. Bensaid, M. Lees, G.J. Martin, J.R. Desmurs, J. Smeyers-

Verbeke, Y. Vander Heyden. 2009. Recognizing paracetamol formulations with the same synthesis pathway based on their trace-enriched chromatographic impurity profiles. Analytica Chimica Acta 655, 43-51. [CrossRef]

89. Jan H. Kwakkel, Scott W. Cunningham. 2009. Managing polysemy and synonymy in science mapping using the mixtures offactor analyzers model. Journal of the American Society for Information Science and Technology 60:10.1002/asi.v60:10, 2064-2078.[CrossRef]

90. Jun-Ying ZHANG, Bin ZHOU. 2009. Self Organizing Map with Generalized and Localized Parallel Competitions for the TSP.Chinese Journal of Computers 31, 220-227. [CrossRef]

91. Ezequiel Lopez-Rubio, Juan Miguel Ortiz-de-Lazcano-Lobato, Domingo Lopez-Rodriguez. 2009. Probabilistic PCA Self-Organizing Maps. IEEE Transactions on Neural Networks 20, 1474-1489. [CrossRef]

92. Shiming Xiang, Feiping Nie, Changshui Zhang, Chunxia Zhang. 2009. Nonlinear Dimensionality Reduction with Local SplineEmbedding. IEEE Transactions on Knowledge and Data Engineering 21, 1285-1298. [CrossRef]

93. Tianhao Zhang, Dacheng Tao, Xuelong Li, Jie Yang. 2009. Patch Alignment for Dimensionality Reduction. IEEE Transactionson Knowledge and Data Engineering 21, 1299-1313. [CrossRef]

94. Christian Feuersänger, Michael Griebel. 2009. Principal manifold learning by sparse grids. Computing 85, 267-299. [CrossRef]95. Dora Alvarez, Hugo Hidalgo. 2009. Document analysis and visualization with zero-inflated poisson. Data Mining and Knowledge

Discovery 19, 1-23. [CrossRef]96. Yong Su Kim, Sun Jin Hwang, Jong Min Oh, Gye Dae Whang, Chang Kyoo Yoo. 2009. Mapping multi-class cancers and clinical

outcomes prediction for multiple classifications of microarray gene expression data. Korean Journal of Chemical Engineering 26,969-979. [CrossRef]

97. R. D'Abrusco, G. Longo, N. A. Walton. 2009. Quasar candidates selection in the Virtual Observatory era. Monthly Notices of theRoyal Astronomical Society 396:10.1111/mnr.2009.396.issue-1, 223-262. [CrossRef]

98. Tetsuo Furukawa. 2009. SOM of SOMs. Neural Networks 22, 463-478. [CrossRef]99. Shih-Sian Cheng, Hsin-Chia Fu, Hsin-Min Wang. 2009. Model-Based Clustering by Probabilistic Self-Organizing Maps. IEEE

Transactions on Neural Networks 20, 805-826. [CrossRef]

Page 25: GTM: The Generative Topographic Mapping

100. Shiming Xiang, Feiping Nie, Yangqiu Song, Changshui Zhang, Chunxia Zhang. 2009. Embedding new data points for manifoldlearning via coordinate propagation. Knowledge and Information Systems 19, 159-184. [CrossRef]

101. Dieter Müller. 2009. Self organized mapping of data clusters to neuron groups. Neural Networks 22, 415-424. [CrossRef]102. D K Saldin, V L Shneerson, R Fung, A Ourmazd. 2009. Structure of isolated biomolecules obtained from ultrashort x-ray pulses:

exploiting the symmetry of random orientations. Journal of Physics: Condensed Matter 21, 134014. [CrossRef]103. Heng Lian. 2009. Bayesian Nonlinear Principal Component Analysis Using Random Fields. IEEE Transactions on Pattern Analysis

and Machine Intelligence 31, 749-754. [CrossRef]104. Marc M. Van HulleKernel-Based Topographic Maps: Theory and Applications . [CrossRef]105. Indranil Bose, Xi Chen. 2009. A method for extension of generative topographic mapping for fuzzy clustering. Journal of the

American Society for Information Science and Technology 60:10.1002/asi.v60:2, 363-371. [CrossRef]106. Eimad E. Abusham, E. K. Wong. 2009. Locally Linear Discriminate Embedding for Face Recognition. Discrete Dynamics in

Nature and Society 2009, 1-8. [CrossRef]107. R. WehrensData Mapping: Linear Methods versus Nonlinear Techniques 619-633. [CrossRef]108. MUSTAPHA LEBBAH, YOUNÈS BENNANI, NICOLETA ROGOVSCHI. 2008. A PROBABILISTIC SELF-

ORGANIZING MAP FOR BINARY DATA TOPOGRAPHIC CLUSTERING. International Journal of ComputationalIntelligence and Applications 07, 363-383. [CrossRef]

109. Iván Olier, Alfredo Vellido. 2008. Variational Bayesian Generative Topographic Mapping. Journal of Mathematical Modelling andAlgorithms 7, 371-387. [CrossRef]

110. I OLIER, A VELLIDO. 2008. Advances in clustering and visualization of time series using GTM through time. Neural Networks21, 904-913. [CrossRef]

111. N. Gianniotis, P. Tino. 2008. Visualization of Tree-Structured Data Through Generative Topographic Mapping. IEEETransactions on Neural Networks 19, 1468-1493. [CrossRef]

112. Tong Lin, Hongbin Zha. 2008. Riemannian Manifold Learning. IEEE Transactions on Pattern Analysis and Machine Intelligence30, 796-809. [CrossRef]

113. Guido Sanguinetti. 2008. Dimensionality Reduction of Clustered Data Sets. IEEE Transactions on Pattern Analysis and MachineIntelligence 30, 535-540. [CrossRef]

114. Gary G. Yen, Zheng Wu. 2008. Ranked Centroid Projection: A Data Visualization Approach With Self-Organizing Maps. IEEETransactions on Neural Networks 19, 245-259. [CrossRef]

115. F NAPOLITANO, G RAICONI, R TAGLIAFERRI, A CIARAMELLA, A STAIANO, G MIELE. 2008. Clustering andvisualization approaches for human cell cycle gene expression data analysis. International Journal of Approximate Reasoning 47,70-84. [CrossRef]

116. Vladimir Bochko, Yoichi Miyake, Jussi Parkkinen. 2008. The Learning-Based Principal Component Analysis Technique in LowResolution and High Resolution Spectral Images. Journal of Imaging Science and Technology 52, 030504. [CrossRef]

117. L. Hamel, N. Nahar, M.S. Poptsova, O. Zhaxybayeva, J.P. Gogarten. 2008. Unsupervised Learning in Detection of Gene Transfer.Journal of Biomedicine and Biotechnology 2008, 1-7. [CrossRef]

118. Chun-Hou Zheng, De-Shuang Huang, Kang Li, George Irwin, Zhan-Li Sun. 2007. MISEP Method for Postnonlinear BlindSource Separation. Neural Computation 19:9, 2557-2578. [Abstract] [PDF] [PDF Plus]

119. M GIROLAMI, H MISCHAK, R KREBS. 2007. Analysis of complex, multidimensional datasets. Drug Discovery Today:Technologies 3, 13-19. [CrossRef]

120. Diego Vidaurre, Jorge Muruzabal. 2007. A Quick Assessment of Topology Preservation for SOM Structures. IEEE Transactionson Neural Networks 18, 1524-1528. [CrossRef]

121. Sylvain Lespinats, Michel Verleysen, Alain Giron, Bernard Fertil. 2007. DD-HDS: A Method for Visualization and Explorationof High-Dimensional Data. IEEE Transactions on Neural Networks 18, 1265-1279. [CrossRef]

122. Chao Jin, Thomas Fevens, Shuo Li, Sudhir Mudur. 2007. Motion learning-based framework for unarticulated shape animation.The Visual Computer 23, 753-761. [CrossRef]

123. A VELLIDO, E MARTI, J COMAS, I RODRIGUEZRODA, F SABATER. 2007. Exploring the ecological status of humanaltered streams through Generative Topographic Mapping. Environmental Modelling & Software 22, 1053-1065. [CrossRef]

124. Hujun Yin. 2007. Nonlinear dimensionality reduction and data visualization: A review. International Journal of Automation andComputing 4, 294-303. [CrossRef]

Page 26: GTM: The Generative Topographic Mapping

125. Steven C. Kazmierczak, Todd K. Leen, Deniz Erdogmus, Miguel A. Carreira-Perpinan. 2007. Reduction of multi-dimensionallaboratory data to a two-dimensional plot: a novel technique for the identification of laboratory error. Clinical Chemistry andLaboratory Medicine 45:10.1515/cclm.2007.45.issue-6, 749-752. [CrossRef]

126. Ata Kabán. 2007. Predictive Modelling of Heterogeneous Sequence Collections by Topographic Ordering of Histories. MachineLearning 68, 63-95. [CrossRef]

127. Colin Fyfe. 2007. Two topographic maps for data visualisation. Data Mining and Knowledge Discovery 14, 207-224. [CrossRef]128. M HUSSAIN, J EAKINS. 2007. Component-based visual clustering using the self-organizing map. Neural Networks 20, 260-273.

[CrossRef]129. Y SUN, T BUTLER, A SHAFARENKO, R ADAMS, M LOOMES, N DAVEY. 2007. Word segmentation of handwritten text

using supervised classification techniques. Applied Soft Computing 7, 71-88. [CrossRef]130. A VELLIDO. 2006. Missing data imputation through GTM as a mixture of tt-distributions. Neural Networks 19, 1624-1635.

[CrossRef]131. H CHANG, D YEUNG, W CHEUNG. 2006. Relaxational metric adaptation and its application to semi-supervised clustering

and content-based image retrieval. Pattern Recognition 39, 1905-1917. [CrossRef]132. J. Verbeek. 2006. Learning nonlinear image manifolds by global alignment of local linear models. IEEE Transactions on Pattern

Analysis and Machine Intelligence 28, 1236-1250. [CrossRef]133. Jianghua Liang, V.K. Vaishnavi, A. Vandenberg. 2006. Clustering of LDAP directory schemas to facilitate information resources

interoperability across organizations. IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans 36,631-642. [CrossRef]

134. G POLZLBAUER, M DITTENBACH, A RAUBER. 2006. Advanced visualization of Self-Organizing Maps with vector fields.Neural Networks 19, 911-922. [CrossRef]

135. V KABURLASOS, S PAPADAKIS. 2006. Granular self-organizing map (grSOM) for structure identification. Neural Networks19, 623-643. [CrossRef]

136. STERGIOS PAPADIMITRIOU, SEFERINA MAVROUDI, SPIRIDON D. LIKOTHANASSIS. 2006. MUTUALINFORMATION CLUSTERING FOR EFFICIENT MINING OF FUZZY ASSOCIATION RULES WITH APPLICATIONTO GENE EXPRESSION DATA ANALYSIS. International Journal on Artificial Intelligence Tools 15, 227-250. [CrossRef]

137. Guzmán Díaz, Ignacio Díaz Blanco, Pablo Arboleya, Javier Gómez-Aleixandre. 2006. Zero-sequence-based relaying techniquefor protecting power transformers and its performance assessment using unsupervised learning ANN. European Transactions onElectrical Power 16:10.1002/etep.v16:2, 147-160. [CrossRef]

138. D. D'Alimonte, G. Zibordi. 2006. Statistical assessment of radiometric measurements from autonomous systems. IEEETransactions on Geoscience and Remote Sensing 44, 719-728. [CrossRef]

139. E. Berglund, J. Sitte. 2006. The Parameterless Self-Organizing Map Algorithm. IEEE Transactions on Neural Networks 17,305-316. [CrossRef]

140. M.H.C. Law, A.K. Jain. 2006. Incremental nonlinear dimensionality reduction by manifold learning. IEEE Transactions on PatternAnalysis and Machine Intelligence 28, 377-391. [CrossRef]

141. Thomas Villmann, Jens Christian Claussen. 2006. Magnification Control in Self-Organizing Maps and Neural Gas. NeuralComputation 18:2, 446-469. [Abstract] [PDF] [PDF Plus]

142. E.G. Learned-Miller. 2006. Data driven image models through continuous joint alignment. IEEE Transactions on Pattern Analysisand Machine Intelligence 28, 236-250. [CrossRef]

143. P. Gao, W.L. Woo, S.S. Dlay. 2006. Non-linear independent component analysis using series reversion and Weierstrass network.IEE Proceedings - Vision, Image, and Signal Processing 153, 115. [CrossRef]

144. Ming-ming Sun, Jian Yang, Jing-yu Yang. 2006. Topology Description for Data Distributions Using a Topology Graph WithDivide-and-Combine Learning Strategy. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 36, 1296-1305.[CrossRef]

145. P. Gao, W.L. Woo, S.S. Dlay. 2006. Weierstrass approach to blind source separation of multiple nonlinearly mixed signals. IEEProceedings - Circuits, Devices and Systems 153, 332. [CrossRef]

146. B ZHANG, C ZHANG, X YI. 2005. Active curve axis Gaussian mixture models. Pattern Recognition 38, 2351-2362. [CrossRef]147. Petr Praus. 2005. SVD-based principal component analysis of geochemical data. Central European Journal of Chemistry 3, 731-741.

[CrossRef]

Page 27: GTM: The Generative Topographic Mapping

148. R. Amato. 2005. A multi-step approach to time series analysis and gene expression clustering. Bioinformatics 22, 589-596.[CrossRef]

149. P. Meinicke, S. Klanke, R. Memisevic, H. Ritter. 2005. Principal surfaces from unsupervised kernel regression. IEEE Transactionson Pattern Analysis and Machine Intelligence 27, 1379-1391. [CrossRef]

150. I.T. Nabney, Y. Sun, P. Tino, A. Kaban. 2005. Semisupervised learning of hierarchical latent trait models for data visualization.IEEE Transactions on Knowledge and Data Engineering 17, 384-400. [CrossRef]

151. Xiaoli Li, R. Du. 2005. Condition Monitoring Using a Latent Process Model with an Application to Sheet Metal StampingProcesses. Journal of Manufacturing Science and Engineering 127, 376. [CrossRef]

152. S SEO, K OBERMAYER. 2004. Self-organizing maps and clustering methods for matrix data. Neural Networks 17, 1211-1229.[CrossRef]

153. G POLCICOVA, P TINO. 2004. Making sense of sparse rating data in collaborative filtering via topographic organization of userpreference patterns. Neural Networks 17, 1183-1199. [CrossRef]

154. J LAAKSONEN, J MARKUSKOSKELA, E OJA. 2004. Class distributions on SOM surfaces for feature extraction and objectretrieval. Neural Networks 17, 1121-1133. [CrossRef]

155. CHRISTIAN JUTTEN, JUHA KARHUNEN. 2004. ADVANCES IN BLIND SOURCE SEPARATION (BSS) ANDINDEPENDENT COMPONENT ANALYSIS (ICA) FOR NONLINEAR MIXTURES. International Journal of Neural Systems14, 267-292. [CrossRef]

156. Elias Pampalk, Simon Dixon, Gerhard Widmer. 2004. Exploring Music Collections by Browsing Different Views. Computer MusicJournal 28:2, 49-62. [Citation] [PDF] [PDF Plus]

157. Reiner Schulz, James A. Reggia. 2004. Temporally Asymmetric Learning Supports Sequence Processing in Multi-Winner Self-Organizing Maps. Neural Computation 16:3, 535-561. [Abstract] [PDF] [PDF Plus]

158. Jainy Kuriakose, Anandamohan Ghosh, V. Ravi Kumar, B. D. Kulkarni. 2004. Isometric graphing and multidimensional scalingfor reaction-diffusion modeling on regular and fractal surfaces with spatiotemporal pattern recognition. The Journal of ChemicalPhysics 120, 5432. [CrossRef]

159. STERGIOS PAPADIMITRIOU, SPIRIDON D. LIKOTHANASSIS. 2004. KERNEL-BASED SELF-ORGANIZED MAPSTRAINED WITH SUPERVISED BIAS FOR GENE EXPRESSION DATA ANALYSIS. Journal of Bioinformatics andComputational Biology 01, 647-680. [CrossRef]

160. A. Vellido, W. El-Deredy, P.J.G. Lisboa. 2003. Selective smoothing of the generative topographic mapping. IEEE Transactionson Neural Networks 14, 847-852. [CrossRef]

161. Guilherme de A. Barreto, Aluizio F. R. Araújo, Stefan C. Kremer. 2003. A Taxonomy for Spatiotemporal Connectionist NetworksRevisited: The Unsupervised Case. Neural Computation 15:6, 1255-1320. [Abstract] [PDF] [PDF Plus]

162. Miguel L. Teodoro, George N. Phillips, Lydia E. Kavraki. 2003. Understanding Protein Flexibility through DimensionalityReduction. Journal of Computational Biology 10:10.1089/CMB.2003.10.issue-3-4, 617-634. [CrossRef]

163. T Villmann. 2003. Neural maps in remote sensing image analysis. Neural Networks 16, 389-403. [CrossRef]164. R Bullen. 2003. Outlier detection in scatterometer data: neural network approaches. Neural Networks 16, 419-426. [CrossRef]165. Kyung Hwan Kim, Sung June Kim. 2003. Method for unsupervised classification of multiunit neural signal recording under low

signal-to-noise ratio. IEEE Transactions on Biomedical Engineering 50, 421-431. [CrossRef]166. M Daszykowski. 2003. Projection methods in chemistry. Chemometrics and Intelligent Laboratory Systems 65, 97-112. [CrossRef]167. B.J. Frey, N. Jojic. 2003. Transformation-invariant clustering using the EM algorithm. IEEE Transactions on Pattern Analysis

and Machine Intelligence 25, 1-17. [CrossRef]168. Katy Börner, Chaomei Chen, Kevin W. Boyack. 2003. Visualizing knowledge domains. Annual Review of Information Science and

Technology 37:10.1002/aris.v37:1, 179-255. [CrossRef]169. A. Rauber, D. Merkl, M. Dittenbach. 2002. The growing hierarchical self-organizing map: exploratory analysis of high-

dimensional data. IEEE Transactions on Neural Networks 13, 1331-1341. [CrossRef]170. A Smolinski. 2002. Hierarchical clustering extended with visual complements of environmental data set. Chemometrics and

Intelligent Laboratory Systems 64, 45-54. [CrossRef]171. H YIN. 2002. Data visualisation and manifold mapping using the ViSOM. Neural Networks 15, 1005-1016. [CrossRef]172. 2002. An Efficient Knowledge Base Management Using Hybrid SOM. The KIPS Transactions:PartB 9B, 635-642. [CrossRef]173. M VANHULLE. 2002. Kernel-based topographic map formation achieved with an information-theoretic approach. Neural

Networks 15, 1029-1039. [CrossRef]

Page 28: GTM: The Generative Topographic Mapping

174. Marc M. Van Hulle. 2002. Joint Entropy Maximization in Kernel-Based Topographic Maps. Neural Computation 14:8, 1887-1906.[Abstract] [PDF] [PDF Plus]

175. Marc M. Van Hulle. 2002. Kernel-Based Topographic Map Formation by Local Density Modeling. Neural Computation 14:7,1561-1573. [Abstract] [PDF] [PDF Plus]

176. P. Tino, I. Nabney. 2002. Hierarchical GTM: constructing localized nonlinear projection manifolds in a principled way. IEEETransactions on Pattern Analysis and Machine Intelligence 24, 639-656. [CrossRef]

177. P Lisboa. 2002. A review of evidence of health benefit from artificial neural networks in medical intervention. Neural Networks15, 11-39. [CrossRef]

178. R.D. Pascual-Marqui, A.D. Pascual-Montano, K. Kochi, J.M. Carazo. 2001. Smoothly distributed fuzzy c-means: a new self-organizing map. Pattern Recognition 34, 2395-2402. [CrossRef]

179. W Walley. 2001. Unsupervised pattern recognition for the interpretation of ecological data. Ecological Modelling 146, 219-230.[CrossRef]

180. M. Girolami. 2001. The topographic organization and visualization of binary data using multivariate-Bernoulli latent variablemodels. IEEE Transactions on Neural Networks 12, 1367-1374. [CrossRef]

181. A. Kaban, M. Girolami. 2001. A combined latent class and trait model for the analysis and visualization of discrete data. IEEETransactions on Pattern Analysis and Machine Intelligence 23, 859-872. [CrossRef]

182. L Angelini. 2001. Cost functions for pairwise data clustering. Physics Letters A 285, 279-285. [CrossRef]183. Aapo Hyvärinen, Patrik O. Hoyer, Mika Inki. 2001. Topographic Independent Component Analysis. Neural Computation 13:7,

1527-1558. [Abstract] [PDF] [PDF Plus]184. T.W. Nattkemper, H.J. Ritter, W. Schubert. 2001. A neural classifier enabling high-throughput topological analysis of

lymphocytes in tissue sections. IEEE Transactions on Information Technology in Biomedicine 5, 138-149. [CrossRef]185. Karin Haese, Geoffrey J. Goodhill. 2001. Auto-SOM: Recursive Parameter Estimation for Guidance of Self-Organizing Feature

Maps. Neural Computation 13:3, 595-619. [Abstract] [PDF] [PDF Plus]186. H. Yin, N.M. Allinson. 2001. Self-organizing mixture networks for probability density estimation. IEEE Transactions on Neural

Networks 12, 405-411. [CrossRef]187. H. Yin, N.M. Allinson. 2001. Bayesian self-organising map for Gaussian mixtures. IEE Proceedings - Vision, Image, and Signal

Processing 148, 234. [CrossRef]188. Kui-Yu Chang, J. Ghosh. 2001. A unified model for probabilistic principal surfaces. IEEE Transactions on Pattern Analysis and

Machine Intelligence 23, 22-41. [CrossRef]189. Dirk Husmeier. 2000. The Bayesian Evidence Scheme for Regularizing Probability-Density Estimating Neural Networks. Neural

Computation 12:11, 2685-2717. [Abstract] [PDF] [PDF Plus]190. M.A. Carreira-Perpinan. 2000. Mode-finding for mixtures of Gaussian distributions. IEEE Transactions on Pattern Analysis and

Machine Intelligence 22, 1318-1323. [CrossRef]191. Zhuo Meng, Yoh-Han Pao. 2000. Visualization and self-organization of multidimensional data through equalized orthogonal

mapping. IEEE Transactions on Neural Networks 11, 1031-1038. [CrossRef]192. Maxim Khaikine, Klaus Holthausen. 2000. A General Probability Estimation Approach for Neural Computation. Neural

Computation 12:2, 433-450. [Abstract] [PDF] [PDF Plus]193. Mike SchusterNeural Nets for Speech Processing . [CrossRef]194. A Vellido. 1999. Segmentation of the on-line shopping market using neural networks. Expert Systems with Applications 17, 303-314.

[CrossRef]195. M. Ignova, G. A. Montague, A. C. Ward, J. Glassey. 1999. Fermentation seed quality analysis with self-organising neural networks.

Biotechnology and Bioengineering 64:10.1002/(SICI)1097-0290(19990705)64:1&lt;&gt;1.0.CO;2-V, 82-91. [CrossRef]196. H. Attias. 1999. Independent Factor Analysis. Neural Computation 11:4, 803-851. [Abstract] [PDF] [PDF Plus]197. XAVIER GIANNAKOPOULOS, JUHA KARHUNEN, ERKKI OJA. 1999. AN EXPERIMENTAL COMPARISON OF

NEURAL ALGORITHMS FOR INDEPENDENT COMPONENT ANALYSIS AND BLIND SEPARATION. InternationalJournal of Neural Systems 09, 99-114. [CrossRef]

198. Sam Roweis, Zoubin Ghahramani. 1999. A Unifying Review of Linear Gaussian Models. Neural Computation 11:2, 305-345.[Abstract] [PDF] [PDF Plus]

199. Michael E. Tipping, Christopher M. Bishop. 1999. Mixtures of Probabilistic Principal Component Analyzers. Neural Computation11:2, 443-482. [Abstract] [PDF] [PDF Plus]

Page 29: GTM: The Generative Topographic Mapping

200. Thore Graepel, Klaus Obermayer. 1999. A Stochastic Self-Organizing Map for Proximity Data. Neural Computation 11:1,139-155. [Abstract] [PDF] [PDF Plus]

201. Akio Utsugi. 1998. Density Estimation by Mixture Models with Smoothing Priors. Neural Computation 10:8, 2115-2135.[Abstract] [PDF] [PDF Plus]

202. M. Girolami, A. Cichocki, S.I. Amari. 1998. A common neural-network model for unsupervised exploratory data analysis andindependent component analysis. IEEE Transactions on Neural Networks 9, 1495-1501. [CrossRef]

203. C.M. Bishop, M.E. Tipping. 1998. A hierarchical latent variable model for data visualization. IEEE Transactions on PatternAnalysis and Machine Intelligence 20, 281-293. [CrossRef]

204. Judy Qiu, Jaliya Ekanayake, Thilina Gunarathne, Jong Youl Choi, Seung-Hee Bae, Yang Ruan, Saliya Ekanayake, Stephen Wu,Scott Beason, Geoffrey Fox, Mina Rho, Haixu TangData Intensive Computing for Bioinformatics 207-241. [CrossRef]

205. M.I. Cardenas, A. Vellido, I. Olier, X. Rovira, J. GiraldoKernel Generative Topographic Mapping of Protein Sequences 195-208.[CrossRef]