semi-supervised geodesic generative topographic mapping

Pattern Recognition Letters 31 (2010) 202–209

Contents lists available at ScienceDirect

Pattern Recognition Letters

journal homepage: www.elsevier .com/locate /patrec

Semi-supervised geodesic Generative Topographic Mapping

Raúl Cruz-Barbosa a,b, Alfredo Vellido a,*

a Technical University of Catalonia, 08034 Barcelona, Spainb Technological University of the Mixteca, 69000 Huajuapan, Oaxaca, Mexico

a r t i c l e i n f o a b s t r a c t

Article history:Received 1 December 2008Received in revised form 17 September2009Available online 29 September 2009

Communicated by W. Pedrycz

Keywords:Semi-supervised learningGeodesic distanceGenerative Topographic MappingLabel propagation

0167-8655/$ - see front matter � 2009 Elsevier B.V. Adoi:10.1016/j.patrec.2009.09.029

* Corresponding author. Tel.: +34 93 4137796/4014017014.

E-mail addresses: [email protected] (R. Cruz-Ba(A. Vellido).

We present a novel semi-supervised model, SS-Geo-GTM, which stems from a geodesic distance-basedextension of Generative Topographic Mapping that prioritizes neighbourhood relationships along a gen-erated manifold embedded in the observed data space. With this, it improves the trustworthiness and thecontinuity of the low-dimensional representations it provides, while behaving robustly in the presence ofnoise. In SS-Geo-GTM, the model prototypes are linked by the nearest neighbour to the data manifoldconstructed by Geo-GTM. The resulting proximity graph is used as the basis for a class label propagationalgorithm. The performance of SS-Geo-GTM is experimentally assessed, comparing positively with that ofan Euclidean distance-based counterpart and with those of alternative manifold learning methods.

� 2009 Elsevier B.V. All rights reserved.

1. Introduction

In many of the databases generated in some of the currentlymost active areas of research such as, for instance, bioinformaticsor web mining, class labels are either completely or partiallyunavailable. The first case scenario is that of unsupervised learning,where the most common task to be performed is that of data clus-tering, which aims to discover the group structure of multivariatedata (Jain and Dubes, 1998). The second case is less frequently con-sidered despite the fact that, quite often, only a reduced number ofclass labels is readily available and even that can be difficult and/orexpensive to obtain. This becomes a task at the interface betweensupervised and unsupervised models: semi-supervised learning(SSL, Chapelle et al., 2006).

One way to categorize the SSL methods (Chapelle et al., 2006),even if not the only one, is by dividing them into generative models,low-density separation methods, and graph-based techniques. Here,we are specifically interested in graph-based methods based on gen-erative models. In graph-based methods, the nodes of a graph cometo represent the observed data points, while its edges are assignedthe pairwise distances between the incident nodes. The way the dis-tance between two data points is computed can be seen as anapproximation of the geodesic distance between the two points withrespect to the overall data manifold (Belkin and Niyogi, 2004).

ll rights reserved.

5863; fax: +34 93 4137833/

rbosa), [email protected]

Zhu and Ghahramani (2002) introduced a label propagation (LP)algorithm for SSL that works under the assumption that close datapoints tend to have similar class labels. Here, the label of a node(label vector) propagates to neighbouring nodes according to theirproximity in a fully connected graph (formed by the input samples,labeled and unlabeled). Thus, labels are propagated through denseunlabeled data regions. An alternative method, presented in (Bel-kin and Niyogi, 2003), assumes that the data lie on a manifold ina high dimensional space. The learning of the underlying manifoldsis accomplished using all the available input samples. A proximitygraph is then constructed, using node’s adjacencies, as a model forthe manifold. The proposed graph Laplacian approximates the La-place–Beltrami operator, which can be thought of as an operatoron functions defined on nodes of the proximity graph. Recently,in (Herrmann and Ultsch, 2007), a two-stage SSL method wasproposed. In the first stage, data points are clustered using theEmergent Self-Organizing Map (ESOM, Ultsch, 2003). Then, ESOMis considered as a proximity graph and a modified LP is carriedout in the second stage.

In this paper, we present a semi-supervised approach, inspiredby that proposed in (Herrmann and Ultsch, 2007). It is based onGeo-GTM (Cruz-Barbosa and Vellido, 2008c), which is an extensionof the statistically principled Generative Topographic Mapping(GTM, Bishop et al., 1998). Geo-GTM prioritizes neighbourhoodrelationships along a generated manifold embedded in theobserved data space. This model has been shown to improve boththe trustworthiness and the continuity of the low-dimensionaldata representations, and also to behave robustly in the presenceof noise (Cruz-Barbosa and Vellido, 2008a,). In our proposal, the

http://dx.doi.org/10.1016/j.patrec.2009.09.029

mailto:[email protected]

mailto:[email protected]

http://www.sciencedirect.com/science/journal/01678655

http://www.elsevier.com/locate/patrec

R. Cruz-Barbosa, A. Vellido / Pattern Recognition Letters 31 (2010) 202–209 203

prototypes are inserted and linked by the nearest neighbour to thedata manifold constructed by Geo-GTM. The resulting graph is con-sidered as a proximity graph for which an ad hoc version of LP isdefined. This becomes semi-supervised Geo-GTM (SS-Geo-GTM),a model that uses the information derived from Geo-GTM trainingto accomplish the semi-supervised task. Following the same meth-odology, we also develop in this paper a semi-supervised versionfor the standard GTM (SS-GTM) and compare its performance withthat of SS-Geo-GTM. The performance of SS-Geo-GTM is furthercompared with those of two alternative semi-supervised manifoldtechniques.

2. Manifolds and geodesic distances

Manifold learning methods work on the assumption that multi-variate data can be faithfully represented by lower-dimensionalmanifolds embedded in the data space. Methods such as ISOMAP(Tenenbaum et al., 2000) and Curvilinear Distance Analysis (Leeet al., 2002), for instance, use the geodesic distance as a basis forgenerating the data manifold. ISOMAP, in fact, can be seen as an in-stance of Multi-Dimensional Scaling (MDS) in which the Euclideandistance is replaced by the geodesic one. This metric measuressimilarity along the embedded manifold, instead of doing itthrough the embedding space. In doing so, it may help to avoidsome of the distortions (such as breaches of topology preservation)that the use of a standard metric such as the Euclidean distancemay introduce when learning the manifold, due to its excessivefolding (that is, undesired manifold curvature effects).

The otherwise computationally intractable geodesic metric canbe approximated by graph distances (Bernstein et al., 2000), so thatinstead of finding the minimum arc-length between two dataitems on a manifold, we find the length of the shortest path be-tween them, where such path is built by connecting the closestsuccessive data items. Here, this is accomplished using the K-rule;there are alternative approaches (Lee and Verleysen, 2007) buttheir study is beyond the scope of this brief paper. A weightedgraph is then constructed by using the data (vertices) and the setof allowed connections (edges). If the resulting graph is discon-nected, some edges are added using a minimum spanning tree pro-cedure in order to connect it. Finally, the distance matrix of theweighted undirected graph is obtained by repeatedly applyingDijkstra’s algorithm (Dijkstra, 1959), which computes the shortestpath between all data items.

3. GTM and Geo-GTM

The standard GTM (Bishop et al., 1998) is a generative latentvariable model of the statistical machine learning family aimedto provide nonlinear dimensionality reduction, usually for datavisualization. Unlike many projection methods that directly createa low-dimensional representation from the observed data space,the GTM is defined as a mapping from the low-dimensional latentspace used for visualization onto the observed data space. Suchmapping is carried through by a set of basis functions generatinga constrained mixture density distribution and is defined as a gen-eralized linear regression model:

y ¼ UW; ð1Þ

where U is a M � K matrix built with the images of M points um inlatent space under K basis functions, which, in the original GTM for-mulation, for continuous data of dimension D, were chosen to bespherically symmetric Gaussians; W is a matrix of adaptive weightswkd. Only M points, arranged as a regular grid, are sampled from thelatent space to avoid computational intractability. Each of them,which can be considered as the representative of a data cluster,

has a fixed prior probability p(um) = 1/M and is mapped to a dataprototype ym using (1). These data prototypes define the low-dimen-sional manifold nonlinearly embedded in the data space. A proba-bility distribution for the multivariate data X ¼ fxngN

n¼1 and acorresponding log-likelihood:

LðW;bjXÞ ¼XN

n¼1

ln1M

XM

m¼1

b2p

� �D=2

exp �b=2kym � xnk2n o( )

ð2Þ

can then be defined, where b is the inverse of the noise variance,which accounts for the fact that data points might not strictly lieon the low dimensional embedded manifold generated by the GTM.

The EM algorithm is an straightforward alternative to obtain themaximum likelihood estimates of the adaptive parameters of themodel, which are the adaptive matrix of weights W and b. In theE-step of the EM algorithm, the mapping is inverted and theresponsibilities zmn (the posterior probability of cluster m member-ship for each data point xn) can be directly computed as

zmn ¼ pðumjxn;W ;bÞ ¼ pðxnjum;W;bÞpðumÞPm0pðxnjum0 ;W;bÞpðum0 Þ

; ð3Þ

where pðxnjum;W ;bÞ ¼Nðyðum;WÞ;bÞ.

3.1. Geo-GTM

The Geo-GTM model is an extension of GTM that favors the sim-ilarity of points along the learned manifold, while penalizing thesimilarity of points that are not contiguous in the manifold, evenif close in terms of the Euclidean distance. This is achieved by mod-ifying the standard calculation of the responsibilities in (3) in pro-portion to the discrepancy between the geodesic (approximated bythe graph) and the Euclidean distances. Following Archambeau andVerleysen (2005), such discrepancy is explicited through the defi-nition of the exponential distribution

Eðdg jde;aÞ ¼1a

exp �dgðxn; ymÞ � deðxn; ymÞa

� �; ð4Þ

where de(xn,ym) and dg(xn,ym) are, in turn, the Euclidean and graphdistances between data point xn and the GTM prototype ym. Respon-sibilities are redefined as:

zgeomn ¼ pðumjxn;W ;bÞ ¼ p0ðxnjum;W;bÞpðumÞP

m0p0ðxnjum0;W;bÞpðum0 Þ; ð5Þ

where p0ðxnjum;W ;bÞ ¼Nðyðum;WÞ;bÞEðdgðxn; ymÞ2jdeðxn; ymÞ

2;1Þ.When there is no agreement between the graph approximation

of the geodesic distance and the Euclidean distance, the value ofthe numerator of the fraction within the exponential in (4) in-creases, pushing the exponential and, as a result, the modifiedresponsibility towards smaller values (i.e., punishing the discrep-ancy between metrics). Once the responsibility is calculated inthe modified E-step, the rest of the model’s parameters are esti-mated following the standard EM procedure.

The main advantage of GTM and its extensions over generalfinite mixture models consists in the fact that both data and re-sults can be intuitively visualized on a low dimensional repre-sentation space. Specifically, it is the mapping defined by Eq.(1) and the possibility to invert it, defined by the responsibilitiesin Eq. (5), what provides Geo-GTM with the data visualizationcapabilities that the alternative Manifold Finite Gaussian Mix-tures model proposed in (Archambeau and Verleysen, 2005)lacks. Given that the posterior probability of every Geo-GTMcluster representative for being the generator of each data point,or responsibility zgeo

mn , is calculated as part of the modified EMalgorithm, data points can be visualized, if required, as the meanof the estimated posterior distribution umean

n ¼PM

m¼1umzgeomn , or in

204 R. Cruz-Barbosa, A. Vellido / Pattern Recognition Letters 31 (2010) 202–209

the form of attributions to the latent space locations bearingmaximum responsibility: umaxresp

n ¼ arg maxum zgeomn .

The faithfulness of the mappings generated by GTM and Geo-GTM was evaluated and compared in (Cruz-Barbosa and Vellido,2008a,), using trustworthiness and continuity measures, as definedin (Venna and Kaski, 2001). Geo-GTM was shown to generate themost faithful representations in terms of these measures. Dataneighbouring relations that are not preserved in the pass fromthe observed data space to the low-dimensional representationhamper the continuity of the latter, while spurious neighbouringrelations in the low-dimensional representation that do not havea correspondence in the observed space limit its trustworthiness.

Geo-GTM was also shown in (Cruz-Barbosa and Vellido, 2008a)to behave more robustly than the standard GTM in the presence ofuninformative noise. This was evaluated using the test log-likeli-hood as a comparing criterion.

4. Semi-supervised Geo-GTM

If only unlabeled data were available and our analyses only con-cerned data clustering, the previously described Geo-GTM wouldsuffice. In many real situations, though, we may well count withonly a limited amount of labeled cases. If this is the case, and weare also interested in classification, the problem can be addressedas a semi-supervised one with the goal of inferring the unavailableclass labels using the information provided by the few availableones as well as by the cluster structure defined by Geo-GTM. Thelatter is contained in the prototypes ym, the responsibilities zgeo

mn de-fined in Eq. (5), and the data manifold obtained for computing thegraph distance.

The basic idea underlying the proposed semi-supervised ap-proach is that neighbouring points are most likely to share their la-bel and that these labels are best propagated through neighbouringnodes according to proximity. Assuming that the Geo-GTM proto-types and the corresponding constructed data manifold can beseen as a proximity graph, we modify an existing label propagation(LP) algorithm (Zhu and Ghahramani, 2002) to account for theinformation provided by the trained Geo-GTM. The result is theproposed semi-supervised Geo-GTM (SS-Geo-GTM, for short).

A label vector Lm 2 [0,1]k is first associated to each Geo-GTMprototype ym. These label vectors can be considered as nodes in aproximity graph. The weights of the edges are derived from thegraph distances dg between prototypes. For this, the prototypesare inserted and linked to the graph through the nearest data point.Only non-empty clusters (corresponding to prototypes with atleast one data point assigned to them) are retained. The edgeweight between nodes m and m0 is calculated as

wmm0 ¼ expð�d2gðm;m0Þ=r2Þ: ð6Þ

4.1. Choice of r parameter

The r parameter in Eq. (6) defines the level of sparseness inthe graph for label information. As stated in (Zhu and Ghahra-mani, 2002), an essential problem in LP for semi-supervisedlearning is finding an adequate value for this parameter r. It isknown that for r ?1, all unclassified data cases are assignedthe same label vector because of label vectors shrinking to a sin-gle point (for large r values, unlabeled cases tend to have simi-lar class probabilities, therefore receiving the same influencefrom all labeled cases). On the other hand, when r ? 0 the per-formance of LP is similar to that of a 1-nearest neighbour classi-fier. Therefore, a suitable value for the parameter should liebetween these two extremes.

Here, we propose an ad hoc criterion that consists on assigningr the value of what we call the main reference inter-prototype(MRIP) distance. It is based on one of the most interesting charac-teristics of GTM-based models: due to their probabilistic definition,the posterior probability of cluster m membership for each datapoint xn can be explicitly calculated (Eqs. (3) and (5) in this paper).As a result, we can quantify how much posterior probability, orresponsibility, is born by each cluster (represented by a prototype)for the complete dataset. This quantitative measure is the Cumula-tive Responsibility (CR), which is the sum of responsibilities overall data items in X, for each cluster m, and is calculated as:

CRm ¼XN

n¼1

zgeomn ð7Þ

Note that this definition implies that those prototypes ym towhich highest CR is assigned could be considered as the mostrepresentative for a given data set. The concept of CR has beenpreviously used for a different purpose in (Cruz-Barbosa and Vel-lido, 2007), where a two-stage clustering procedure was defined.A GTM-based clustering was performed in the first stage, and thenon-contiguous prototypes with highest CR were then selectedas seeds for a subsequent agglomerative clustering procedure.

We propose the MRIP to be chosen as the graph distancedg(ym1, ym2) between the two non-contiguous prototypes ym1, ym2

of highest CR. The use of the graph distance assures the minimal in-ter-prototype path. This choice means that the radius of influencefor the LP procedure described in the next subsection is set to bedetermined by the two best representatives of all data. In turn, itmeans that in the calculation of the edge weights between nodesm and m

0described by Eq. (6), the influence exerted by the distance

between nodes will be counterbalanced by the use, in the denom-inator of the quotient, of the distance between the prototypes bestrepresenting the data. It is hypothesized here that this is an ade-quate choice for r.

4.2. Label propagation

Following Herrmann and Ultsch (2007), the available labelinformation of xn 2 X with class attribution c(xn) = Ct 2 {C1, . . . ,Ck}will be used to fix the label vectors of the prototypes to which theyare assigned (xn is assigned to ym through um ¼ arg maxui

zgeoin ), so

that Lm, j = 1 if j = t, and Lm, j = 0 otherwise. Unlabeled prototypeswill then update their label by propagation, according to

Lnewm ¼

Xm0

wmm0Lm0=X

m0wmm0 ; ð8Þ

until no further changes occur in the label updating. Subsequently,unlabeled data items are labeled by assignment to the class more rep-resented on the label vector of the prototype ym bearing the highestresponsibility for them, according to cðxnÞ ¼ arg maxCj2fC1 ;...;CkgLm;j.The same methodology is used to build a semi-supervised versionof a standard GTM model (SS-GTM).

For illustration, the process of computing the graph distancesbetween prototypes is shown in Fig. 1 (right), using the Dalí setdescribed in Section 5.1.

4.3. The SS-Geo-GTM algorithm in a nutshell

For the sake of completeness, we now provide further details ofthe proposed SS-Geo-GTM algorithm. Given an initial modelling ofthe data by Geo-GTM, the general settings for LP are assumed: boththe class label availability of a dataset X (l labeled and u unlabeleddata points) and the number of classes C are supposed to beknown, and all classes are present in the labeled data in some pro-portion. The algorithm proceeds according to the following steps:

−10−505100

1020

−5

0

5

10

15

20

25

−10−50510−10

010

20−5

0

5

10

15

20

25

Fig. 1. (Left): The artificial 3-D Dalí dataset, where the two contiguous fragments are assumed to correspond to different classes, identified with different symbols. (Right):Results of the Geo-GTM modeling of Dalí. The prototypes are represented by ‘�’ symbols (only the non-empty prototypes are preserved and linked to the graph through thenearest data point). The graph constructed using 4-nearest neighbours is represented by lines connecting the data points, which are, in turn, represented by ‘�’ symbols.


� Pre-processing stage– Create a connected graph by inserting and linking the M pro-

totypes to the nearest neighbour of the data manifold con-structed by Geo-GTM. Here, the nodes are all prototypes.

– Compute the graph distance among prototypes using theconstructed graph in step 1.

– Compute the weights wij of the edges between nodes i, j as inEq. (6), where r is obtained as shown in Section 4.1.

– Compute a M �M transition matrix T as Tij ¼wijPlþu

k¼1wkj

, where

Tij is the probability of propagation from node j to node i.– Define a (l + u) � C label matrix L, whose ith row represents

the label probability distribution of data point xi.– Define a M � C prototypes label matrix L0, whose ith row rep-

resents the label probability distribution of node (prototype)yi. Here, the available label information of xn 2 X (given by L)with class attribution c(xn) = Ct 2 {C1,. . .,Ck} is used to fix thelabel vectors of the prototypes to which they are assigned(xn is assigned to ym through um ¼ arg maxui

zgeoin ), so that

L0m;j ¼ 1 if j = t, and L0m;j ¼ 0 otherwise. The initialization ofunlabeled nodes is not relevant.

� LP algorithm– Propagate L0 TL0, as in Eq. (8).– Row-normalize L0 as L0ij ¼ L0ij=

PkL0ik.

– Clamp the labeled data. Repeat from step 1 until L0 converges.

Finally, unlabeled data points in L are labeled by assignment tothe class more represented on the label vector of the prototype ym

bearing the highest responsibility for them, according tocðxnÞ ¼ arg maxCj2fC1 ;...;CkgL

0m;j.

5. Experiments

5.1. Experimental design and settings

Geo-GTM, SS-Geo-GTM, and SS-GTM were implemented in MAT-LAB�. For the experiments reported next, the adaptive matrix W wasinitialized, following a procedure described in (Bishop et al., 1998),as to minimize the difference between the prototype vectors ym

and the vectors that would be generated in data space by a partialPrincipal Component Analysis (PCA). The inverse variance b was ini-tialised to be the inverse of the 3rd PCA eigenvalue. This initializationprocedure ensures the replicability of the results. The latent grid wasfixed to a square layout of approximately (N/2)1/2 � (N/2)1/2, whereN is the number of points in the dataset. The corresponding grid of

basis functions was equally fixed to a 5 � 5 square layout for alldatasets.

The performance of these GTM-based models is to be comparedwith that of two alternative techniques, namely Laplacian Eigenm-aps (LapEM: Belkin and Niyogi, 2003) and semi-supervised SVM forManifold Learning (SS-SVMan: Wu et al., 2006), which were alsoimplemented in MATLAB� for this study.

Laplacian Eigenmaps are a manifold-based technique. As amodel for the manifold, an adjacency graph is constructed usingthe input data points as nodes. Edge weights between nodes i, jcan be derived from the distances between the correspondingnodes or simply by taking wi, j = 1 if data points xi and xj are con-nected, and wij = 0 otherwise. Then, in order to exploit the struc-ture of the model, the graph Laplacian L is obtained for theadjacency graph. L is a symmetric, positive semidefinite matrixwhich can be thought of as an operator (Laplace–Beltrami opera-tor) on functions defined on vertices of the graph. The classifieris constructed using its eigenfunctions, which provide a natural ba-sis for functions on the manifold. In other words, only input datapoints (labeled and unlabeled) information is needed to recoverthe manifold. Then, the labeled data are used to develop a classifierdefined on this manifold.

Semi-supervised SVM for Manifold Learning follows a related butwell differentiated approach: manifold learning is again adapted tothe semi-supervised setting but, this time, the objective function ismodified to accommodate manifold consistency and the hinge lossof class prediction (an approximation to misclassification error).The result is an SVM-like process. There are three parameters in-volved in the choice of the specific SS-SVMan model: C, c, and q.The last one is, in fact, a coefficient that guarantees the invertibilityof an expression leading to the obtention of the objective function. Inour experiments, even a small variation in the 4th decimal positionof this parameter led to extreme variations in the performance of themodel as measured by classification accuracy. The other two param-eters, typical of an SVM form, lead to another problem: If C and c arechosen to minimize the objective function of the model, the classifi-cation accuracy results are very sub-optimal. This means that thevalues of these parameters which maximize the resulting accuracymust be chosen in an approximate manner, by trial-and-error.

Three datasets were selected for the reported experiments:

� The first one is the artificial 3-D Dalí set (inspired by one ofthe common patterns in Salvador Dalí ’s artworks), as shownin Fig. 1(left). It consists of two groups of 300 data points eachthat are images of the functions x1 = (tcos(t), t2, tsin(t)) andx2 = (tcos(t), t2, �tsin(t) + 20), where t and t2 follow Uðp;3pÞand Uð0;10Þ, respectively.

Table 1Classification accuracy as an average percentage over one hundred runs (with its corresponding standard deviation). The statistical significance (calculated through a one-wayANOVA test) of the differences between SS-Geo-GTM and each of the rest of models is indicated with ‘*’ if p < 0.01 and with ‘**’ if p < 0.05.

Data set SS-Geo-GTM (% ± std) SS-GTM (% ± std) LapEM (% ± std) SS-SVMan (% ± std)

Dalí 99.54 ± 2.24 90.71 ± 7.99* 54.57 ± 3.13* 100 ± 0Iris 88.71 ± 7.88 85.74 ± 8.72** 50.39 ± 3.37* 80.95 ± 9.85*

Oil-Flow 77.43 ± 8.31 36.74 ± 3.29* 63.50 ± 12.08* 56.02 ± 12.60*

Table 2Average classification accuracy (as a percentage) and its standard deviation over one hundred runs for different values of r parameter in the SS-Geo-GTM setting.

Dalí Iris Oil-flow

r < MRIP % ± std r < MRIP % ± std r < MRIP % ± std

5.0 98.06 ± 3.73 0.05 85.72 ± 8.93 0.10 74.74 ± 8.6310.0 98.46 ± 4.69 0.10 87.24 ± 8.97 0.20 75.03 ± 9.0815.0 99.19 ± 2.44 0.12 87.37 ± 7.46 0.25 75.24 ± 9.2620.0 99.37 ± 2.22 0.14 86.94 ± 9.73 0.30 74.38 ± 10.1025.0 99.48 ± 2.13 0.15 88.20 ± 8.14 0.35 75.74 ± 8.98

MRIP = 31.36 99.54 ± 2.24 MRIP=0.21 88.71 ± 7.88 MRIP = 0.43 77.43 ± 8.31

r > MRIP r > MRIP r > MRIP

35.0 98.54 ± 3.96 0.30 88.30 ± 7.46 0.50 75.97 ± 8.5140.0 98.43 ± 4.54 0.40 88.69 ± 8.93 0.55 74.71 ± 8.5645.0 97.95 ± 4.77 1.0 88.64 ± 7.63 0.60 74.70 ± 8.8050.0 96.84 ± 6.55 3.0 88.59 ± 5.32 0.65 73.98 ± 8.7755.0 95.35 ± 8.01 4.0 83.03 ± 7.29 0.75 72.08 ± 9.88


� The second set is the well-known Iris data, available from theUCI repository (Asuncion and Newman, 2007), which consistsof 150 four-dimensional items representing several measure-ments of Iris flowers, which belong to three different classes.

� The third is the more complex Oil-Flow set, also availableonline,1 which simulates measurements in an oil pipe corre-sponding to three possible configurations (classes). It consistsof 1000 items described by 12 attributes.

The central goal of the experiments is the comparison of theperformances of all models in terms of classification accuracy.We hypothesize that, at least, SS-GTM will yield lower rates of clas-sification accuracy in the semi-supervised task than its geodesicdistance-based counterpart, especially for datasets of convolutedgeometry such as Dalí and Oil-Flow.

For the GTM-based models, we first assume that the choice ofthe MRIP, described in Section 4.1, as a value for r is appropriate.In this setting, we evaluate all models in the most extreme semi-supervised setting, that is, when the class label is only availablefor a single input sample for each class and the remaining samplesare unlabeled.

We then proceed to evaluate the performance of SS-Geo-GTM inthis same setting for a range of different values of r, both higherand lower than the MRIP. With this, it should be possible to assessthe adequacy of the MRIP choice.

In the next step of experimentation, the label availabilitycondition is relaxed and all models are evaluated in the pres-ence of higher ratios of labels. Finally, we aim to gauge andcompare the robustness of the methods in the presence of noise.In previous research (Cruz-Barbosa and Vellido, 2008a), the Geo-GTM model was shown to behave better in this respect than thestandard GTM model (with the Euclidean metric) as measuredby the test log-likelihood. In the semi-supervised extension de-fined in this paper, the performance criterion is the classificationaccuracy.

1 http://research.microsoft.com/~cmbishop/PRML/webdatasets/datasets.htm.

5.2. Results and discussion

As outlined in the previous subsection, all models were firstevaluated (average accuracy over one hundred runs) in the mostextreme semi-supervised setting: when the class label is availablefor only one input item for each class, while the rest is unlabeled.The corresponding results are shown in Table 1. SS-Geo-GTM sig-nificantly outperforms SS-GTM and LapEM for all data sets and,most notoriously, for the data sets of more convoluted geometry.The differences with SS-GTM are less notorious for the less convo-luted Iris data set. LapEM yields a very poor behaviour in this set-ting. SS-Geo-GTM also performs significantly better than SS-SVMan with the exception of the linearly separable Dalí dataset,for which results are similar.

The previous results were obtained by setting the value ofr = MRIP. We then proceed to evaluate the performance of SS-Geo-GTM in this same setting for a range of different values ofr, both higher and lower than the MRIP, in order to assess theadequacy of the proposed MRIP choice. We explore the intervalr 2 [MRIP � �, MRIP + �], where � > 0, and measure the perfor-mance of SS-Geo-GTM over a hundred runs. These results are re-ported in Table 2. The models with r = MRIP yield the bestresults in the range of selected r values, which confirms thatthe MRIP value is at least near the optimum value for r.Consequently, from here on MRIP will be used as the default va-lue for r.

The proposed SS-Geo-GTM model has been shown to performwell and significantly better than SS-GTM, LapEM and, in mostcases, SS-SVMan in the most extreme semi-supervised setting.The question remains: will this difference of performance be thesame when the label availability condition is relaxed? To answerthis question, the ratio of randomly selected labeled data is in-creased from a single one to a 1%, and from there, up to a 10%.The experiment is again carried out a hundred times for each data-set. The corresponding results are shown in Table 3.

SS-Geo-GTM clearly (and again significantly according to anANOVA test) outperforms SS-GTM for Dalí and Oil-Flow and, asexpected, the performance monotonically improves with the

http://research.microsoft.com/~cmbishop/PRML/webdatasets/datasets.htm

Table 3Average classification accuracy and its std. deviation over 100 runs, for all models. A randomly increasing percentage of pre-labeled items per class was chosen in each run. The ‘q’symbol means that the experiment was not carried out because the corresponding percentage of available labels was less than or equal to one label per class. A super-index ‘*’indicates that the differences between the corresponding model and SS-Geo-GTM were significant at p < 0.01 in the ANOVA test for all percentages of class labels. A super-index‘**’ indicates that no differences were significant.

% of avail. labels Classification accuracy (% ± std)

Dalí Iris

SS-Geo SS-GTM* LapEM* SS-SVMan** SS-Geo SS-GTM** LapEM* SS-SVMan**

1 100 ± 0 93.43 ± 5.46 64.91 ± 4.52 100 ± 0 q q q q

2 100 ± 0 96.96 ± 3.41 76.00 ± 5.88 100 ± 0 q q q q

3 100 ± 0 97.74 ± 2.05 79.65 ± 9.29 100 ± 0 q q q q

4 100 ± 0 98.29 ± 1.80 75.24 ± 10.56 100 ± 0 90.00 ± 8.11 89.46 ± 5.24 58.10 ± 4.01 88.91 ± 5.775 100 ± 0 98.61 ± 1.32 88.72 ± 8.05 100 ± 0 89.96 ± 6.98 89.18 ± 6.48 57.01 ± 4.57 90.33 ± 5.606 100 ± 0 98.66 ± 1.64 95.01 ± 4.95 100 ± 0 91.30 ± 7.37 91.66 ± 3.02 63.68 ± 4.48 91.90 ± 3.607 100 ± 0 98.98 ± 0.80 97.68 ± 3.16 100 ± 0 90.74 ± 7.62 90.94 ± 3.03 64.22 ± 4.86 91.44 ± 4.548 100 ± 0 99.19 ± 0.82 98.64 ± 2.13 100 ± 0 91.91 ± 5.31 91.90 ± 3.03 69.84 ± 5.26 93.07 ± 3.169 100 ± 0 99.30 ± 0.70 98.88 ± 1.87 100 ± 0 92.35 ± 4.90 91.88 ± 2.48 70.19 ± 4.97 93.70 ± 3.77

10 100 ± 0 99.24 ± 0.73 99.39 ± 1.39 100 ± 0 93.19 ± 4.36 92.32 ± 2.42 74.87 ± 5.92 94.57 ± 2.55

% of avail. labels Oil-Flow (% ± std)

SS-Geo SS-GTM* LapEM* SS-SVMan

1 83.93 ± 5.60 39.96 ± 3.44 76.43 ± 7.55 79.58 ± 5.77*

2 90.08 ± 3.49 55.88 ± 10.95 83.36 ± 5.48 86.45 ± 4.29*

3 91.79 ± 3.07 64.71 ± 7.95 87.56 ± 4.42 89.23 ± 3.99*

4 94.28 ± 2.60 70.69 ± 6.06 89.71 ± 3.51 92.62 ± 3.78*

5 95.14 ± 2.20 74.11 ± 5.05 91.63 ± 3.25 93.57 ± 3.49*

6 95.97 ± 2.01 76.51 ± 4.30 92.63 ± 2.76 95.26 ± 2.41**

7 96.43 ± 1.81 79.10 ± 4.24 93.77 ± 2.36 95.89 ± 2.32**

8 96.65 ± 1.53 80.88 ± 4.27 94.41 ± 2.11 96.66 ± 2.20**

9 97.11 ± 1.66 82.19 ± 3.43 95.18 ± 2.07 97.14 ± 1.77**

10 97.53 ± 1.22 83.91 ± 3.58 95.58 ± 1.53 7.75 ± 1.34**


increasing percentage of labels. The differences for the latter set,more complex and high-dimensional, are striking. Also, SS-Geo-GTM significantly outperforms LapEM for all data sets. For Dalí,SS-Geo-GTM achieves a 100% accuracy even with a 1% of labeleddata, while SS-GTM and LapEM do not reach that average accuracy

−10−50510−1001020

−5

0

5

10

15

20

25

−50510−5

0

5

10

15

20

25

−0.5 0 0.5 1 1.5 2−0.5

0

0.5

1

1.5

2

2.5

Variable #5

Var

iabl

e #9

−0.5 0 0.5−0.5

0

0.5

1

1.5

2

2.5

Variable

Fig. 2. Noisy variations of some of the data used in the experiments, provided for illustrfrom top-left to top-right, noise of standard deviations r = 0.1, r = 0.5, and r = 2.0. For Obottom-right, noise of standard deviations r = 0.01, r = 0.05, and r = 0.2.

even with a 10%. The Iris data set benefits less of the addition ofclass labels and the performances of SS-Geo-GTM and SS-GTMmodels are comparable. This confirms that the use of the geodesicmetric is likely to improve the results mainly for data sets of con-voluted underlying geometry. The performance of SS-SVMan is

−10

−1001020 −10−50510

−100

1020

−5

0

5

10

15

20

25

1 1.5 2 #5

−0.5 0 0.5 1 1.5 2−0.5

0

0.5

1

1.5

2

2.5

Variable #5

ation. The noise scale magnitude is in correspondence with the data scale. For Dalí,il-Flow, we provide three views of variable 5 versus variable 9: From bottom-left to

Table 4Average classification accuracy and its standard deviation over 100 runs, for all models in the presence of increasing levels of uninformative noise. An increasing percentage ofpre-labeled items per class was randomly chosen in each run. Bold and italic lettering is used to distinguish between the results of the models and ease their interpretation.

Dataset Noise level Model Percent of available labels

2 4 6 8 10

Dalí 0.1 SS-Geo 100 ± 0 100 ± 0 100 ± 0 100 ± 0 100 ± 0SS-GTM 96.29 ± 3.37 98.15 ± 1.97 99.09 ± 1.0 99.31 ± 0.99 99.28 ± 0.89LapEM 75.48 ± 6.56 75.73 ± 10.38 94.48 ± 4.66 98.07 ± 2.02 98.50 ± 1.96SS-SVMan 100 ± 0 100 ± 0 100 ± 0 100 ± 0 100 ± 0

0.3 SS-Geo 99.83 ± 1.11 100 ± 0 100 ± 0 100 ± 0 100 ± 0SS-GTM 95.57 ± 4.0 98.11 ± 1.45 98.56 ± 0.83 98.77 ± 0.75 98.88 ± 0.69LapEM 74.47 ± 5.27 75.11 ± 11.11 95.55 ± 4.82 99.03 ± 1.96 99.54 ± 1.12SS-SVMan 100 ± 0 100 ± 0 100 ± 0 100 ± 0 100 ± 0

0.5 SS-Geo 99.04 ± 3.16 100 ± 0 100 ± 0 100 ± 0 100 ± 0SS-GTM 96.52 ± 3.09 98.05 ± 2.16 98.99 ± 1.40 99.31 ± 1.06 99.39 ± 0.78LapEM 77.67 ± 6.79 76.56 ± 10.30 95.06 ± 4.53 97.49 ± 2.76 98.87 ± 1.61SS-SVMan 99.92 ± 0.14 99.98 ± 0.06 99.99 ± 0.03 100 ± 0 100 ± 0

1.0 SS-Geo 95.14 ± 5.52 97.75 ± 2.94 98.71 ± 1.98 99.23 ± 0.73 99.28 ± 0.92SS-GTM 96.12 ± 3.79 98.36 ± 1.53 98.66 ± 1.21 99.04 ± 0.45 99.06 ± 0.35LapEM 73.86 ± 6.07 70.73 ± 10.57 92.15 ± 5.34 97.23 ± 3.09 98.93 ± 1.39SS-SVMan 95.14 ± 3.71 98.01 ± 1.92 98.12 ± 1.86 98.35 ± 1.46 98.84 ± 0.83


Oil-Flow 0.01 SS-Geo 88.13 ± 4.05 93.87 ± 2.71 95.63 ± 2.24 96.87 ± 1.45 97.26 ± 1.18SS-GTM 55.54 ± 11.94 70.66 ± 5.84 77.14 ± 4.65 80.25 ± 3.58 84.15 ± 3.39LapEM 81.35 ± 5.67 88.17 ± 3.41 91.80 ± 2.67 93.20 ± 2.30 94.77 ± 1.70SS-SVMan 86.65 ± 5.01 92.11 ± 3.58 94.90 ± 2.95 96.51 ± 1.95 97.66 ± 1.41






identical to that of SS-Geo-GTM for the relatively easier Dalí data-set, whereas for both Iris and Oil-Flow, the performance of SS-Geo-GTM is slightly better for the lowest levels of label availability,whereas the slight advantage is for SS-SVMan when higher per-centages of labels are available. The only statistically significantdifferences are those in favor of SS-Geo-GTM with low levels ofclass availability for Oil-Flow.

It was shown in (Cruz-Barbosa and Vellido, 2008a) that Geo-GTM can recover the true underlying data structure far better thanthe standard GTM (as reflected in a lower test log-likelihood), evenin the presence of a considerable amount of noise in the data. Wenow extend these results to the semi-supervised setting to gaugeand compare the robustness of the analyzed methods in thepresence of noise in some illustrative experiments. For this,Gaussian noise of zero mean and increasing standard deviationwas added to a noise-free version of the Dalí set (added noise fromr = 0.1 to r = 2.0, partially illustrated by the top displays of Fig. 2)and to the most difficult dataset, Oil-Flow (added noise fromr = 0.01 to r = 0.2, partially illustrated by the bottom displays ofFig. 2). As in the previous experiment, we also analyzed the evolu-tion of the performance of these models as the percentage of avail-able labels for each dataset is increased from 2% to 10%.

These new results are shown in Table 4. In accordance to the re-sults presented in (Cruz-Barbosa and Vellido, 2008a), the geodesicvariant SS-Geo-GTM consistently outperforms SS-GTM across data

sets and noise levels. It also outperforms LapEM, with few excep-tions. It is worth noting that the results for LapEM only becomecomparable as the percentage of available labels increases. Therobustness of the semi-supervised procedure for SS-GTM is sur-prisingly good, though. For the more complex Oil-Flow set, bothmodels deteriorate significantly at high noise levels. Overall, theseresults indicate that the resilience of the models is mostly due tothe inclusion of the geodesic metric and not to the semi-supervisedprocedure itself. The results of the comparison between SS-Geo-GTM and SS-SVMan are somehow mixed. The former performsslightly better with the Dalí dataset for most noise and class la-bel-availability settings and with Oil-Flow at low levels of labelavailability. The latter behaves more robustly for the highest levelof noise with Oil-Flow.

6. Conclusion

Previous research has shown that the use of geodesic distances inthe training of manifold learning models improves the continuityand trustworthiness of the low dimensional representations theygenerate. Geo-GTM is one such model that provides faithful cluster-ing and visualization of datasets of convoluted geometry. A semi-supervised version of Geo-GTM, namely SS-Geo-GTM, has been de-fined in this paper. It makes use of the model prototypes as nodes


in a proximity graph, where the edges are obtained using graph dis-tances as approximation of the geodesic metric. From this setting, amodified class label propagation algorithm performs the semi-supervised task. Information obtained from the training of Geo-GTM is used to derive a criterion (MRIP) for the selection of the rparameter in the modified LP algorithm.

Through several experiments, the performance of SS-Geo-GTMhas been assessed and shown to be consistently better than thatof the semi-supervised version of the standard GTM trained usingthe Euclidean metric, even in the presence of high levels of noise.The SS-Geo-GTM has also clearly outperformed the alternative La-pEM model in most settings. Another alternative model, though,namely SS-SVMan, has matched the performance of SS-Geo-GTMthroughout the experimental settings. SS-Geo-GTM would be thechoice when low levels of class labels are available, whereas SS-SVMan would be preferred in problems with high levels of noise.For the rest of settings, the performance of these two models doesnot differ significantly. It must be noted, though, that the perfor-mance of SS-SVMan is extremely sensible to the choice of itsparameters, which must be carefully tuned in a heuristic manner.Future research is planned to extend these experiments to gaugethe effect of outliers in the performance of the proposed semi-supervised models.

Acknowledgements

Alfredo Vellido acknowledges funding from the Spanish CICyT re-search project TIN2006-08114. Raúl Cruz-Barbosa acknowledgesSEP-SESIC (PROMEP program) of México for his PhD grant.

References

Archambeau, C., Verleysen, M., 2005. Manifold constrained finite Gaussianmixtures. In: Cabestany, J., Prieto, A., Sandoval, D.F. (Eds.), Proc. IWANN, LNCS,vol. 3512. Springer-Verlag, pp. 820–828.

Asuncion, A., Newman, D., 2007. UCI Machine Learning Repository. University ofCalifornia, Irvine, School of Information and Computer Sciences. <http://www.ics.uci.edu/~mlearn/MLRepository.html>.

Belkin, M., Niyogi, P., 2002. Using manifold structure for partially labelledclassification. Advances in Neural Information Processing Systems (NIPS). Vol.15. MIT Press, pp. 929–936.

Belkin, M., Niyogi, P., 2004. Semi-supervised learning on Riemannian manifolds.Machine Learning 56, 209–239.

Bernstein, M., de Silva, V., Langford, J., Tenenbaum, J., 2000. Graph approximationsto geodesics on embedded manifolds. Tech. rep., Stanford University, CA.

Bishop, C.M., Svensén, M., Williams, C.K.I., 1998. The Generative TopographicMapping. Neural Comput. 10 (1), 215–234.

Chapelle, O., Schölkopf, B., Zien, A. (Eds.), 2006. Semi-Supervised Learning. The MITPress.

Cruz-Barbosa, R., Vellido, A., 2007. On the initialization of two-stage clustering withclass-GTM. In: Borrajo, D., Castillo, L., Corchado, J. (Eds.), Proc. 12th Conf. of theSpanish Association for Artificial Intelligence, CAEPIA+TTIA 2007, LNAI, vol.4788, pp. 50–59.

Cruz-Barbosa, R., Vellido, A., 2008. On the improvement of the mappingtrustworthiness and continuity of a manifold learning model. In: Proc. NinthInternat. Conf. on Intelligent Data Engineering and Automated Learning (IDEAL2008). LNCS, vol. 5326. Springer, pp. 266–273.

Cruz-Barbosa, R., Vellido, A., 2008a. Geodesic Generative Topographic Mapping. In:Proc. 11th Ibero-American Conf. on Artificial Intelligence (IBERAMIA 2008).LNAI, vol. 5290. Springer, pp. 113–122.

Cruz-Barbosa, R., Vellido, A., 2008c. Unfolding the manifold in GenerativeTopographic Mapping. In: Proc. Third Internat. Workshop on Hybrid ArtificialIntelligence Systems (HAIS 2008). LNAI, vol. 5271. Springer, pp. 392–399.

Dijkstra, E.W., 1959. A note on two problems in connection with graphs. Numer.Math. 1, 269–271.

Herrmann, L., Ultsch, A., 2007. Label propagation for semi-supervised learning inself-organizing maps. In: Proc. Sixth WSOM 2007.

Jain, A.K., Dubes, R.C., 1998. Algorithms for Clustering Data. Prentice Hall, NewJersey.

Lee, J., Verleysen, M., 2007. Nonlinear Dimensionality Reduction. Springer.Lee, J.A., Lendasse, A., Verleysen, M., 2002. Curvilinear distance analysis versus

isomap. In: Proc. European Symposium on Artificial Neural Networks (ESANN),pp. 185–192.

Tenenbaum, J.B., de Silva, V., Langford, J.C., 2000. A global geometric framework fornonlinear dimensionality reduction. Science 290, 2319–2323.

Ultsch, A., 2003. Maps for the visualization of high-dimensional data spaces. In:Proc. WSOM 2003, pp. 225–230.

Venna, J., Kaski, S., 2001. Neighborhood preservation in nonlinear projectionmethods: An experimental study. In: Dorffner, G., Bischof, H., Hornik, K. (Eds.),Proc. WSOM. Springer, pp. 485–491.

Wu, Z., Li, C.H., Zhu, J., Huang, J., 2006. A semi-supervised SVM for manifoldlearning. In: Proc. of the 18th International Conference on Pattern Recognition(ICPR’06), vol. 2, pp. 490–493.

Zhu, X., Ghahramani, Z., 2002. Learning from labeled and unlabeled data with labelpropagation. Tech. rep., CMU-CALD-02-107, Carnegie Mellon University.

http://www.ics.uci.edu/~mlearn/MLRepository.html

http://www.ics.uci.edu/~mlearn/MLRepository.html

semi-supervised geodesic generative topographic mapping

Documents