learning a kernel matrix for nonlinear dimensionality...

8
Learning a kernel matrix for nonlinear dimensionality reduction Kilian Q. Weinberger [email protected] Fei Sha [email protected] Lawrence K. Saul [email protected] Department of Computer and Information Science, University of Pennsylvania, Philadelphia, PA 19104, USA Abstract We investigate how to learn a kernel matrix for high dimensional data that lies on or near a low dimensional manifold. Noting that the kernel matrix implicitly maps the data into a nonlinear feature space, we show how to discover a mapping that “unfolds” the un- derlying manifold from which the data was sampled. The kernel matrix is constructed by maximizing the variance in feature space subject to local constraints that preserve the angles and distances between nearest neigh- bors. The main optimization involves an in- stance of semidefinite programming—a fun- damentally different computation than pre- vious algorithms for manifold learning, such as Isomap and locally linear embedding. The optimized kernels perform better than poly- nomial and Gaussian kernels for problems in manifold learning, but worse for problems in large margin classification. We explain these results in terms of the geometric properties of different kernels and comment on various interpretations of other manifold learning al- gorithms as kernel methods. 1. Introduction Kernel methods (Sch¨ olkopf & Smola, 2002) have proven to be extremely powerful in many areas of ma- chine learning. The so-called “kernel trick” is by now widely appreciated: a canonical algorithm (e.g., the linear perceptron, principal component analysis) is re- formulated in terms of Gram matrices, then general- ized to nonlinear problems by substituting a kernel function for the inner product. Well beyond this fa- miliar recipe, however, the field continues to develop Appearing in Proceedings of the 21 st International Confer- ence on Machine Learning, Banff, Canada, 2004. Copyright 2004 by the authors. as researchers devise novel types of kernels, exploit- ing prior knowledge in particular domains and insights from computational learning theory and convex op- timization. Indeed, much work revolves around the simple question: “How to choose the kernel?” The an- swers are diverse, reflecting the tremendous variety of problems to which kernel methods have been applied. Kernels based on string matching (Lodhi et al., 2004) and weighted transducers (Cortes et al., 2003) have been proposed for problems in bioinformatics, text, and speech processing. Other specialized kernels have been constructed for problems in pattern recognition involving symmetries and invariances (Burges, 1999). Most recently, kernel matrices have been learned by semidefinite programming for large margin classifica- tion (Graepel, 2002; Lanckriet et al., 2004). In this paper, we revisit the problem of nonlinear dimensionality reduction and its solution by kernel principal component analysis (PCA) (Sch¨ olkopf et al., 1998). Our specific interest lies in the application of kernel PCA to high dimensional data whose basic modes of variability are described by a low dimensional manifold. The goal of nonlinear dimensionality reduc- tion in these applications is to discover the underly- ing manifold (Tenenbaum et al., 2000; Roweis & Saul, 2000). For problems of this nature, we show how to learn a kernel matrix whose implicit mapping into fea- ture space “unfolds” the manifold from which the data was sampled. The main optimization of our algorithm involves an instance of semidefinite programming, but unlike earlier work in learning kernel matrices (Grae- pel, 2002; Lanckriet et al., 2004), the setting here is completely unsupervised. The problem of manifold learning has recently at- tracted a great deal of attention (Tenenbaum et al., 2000; Roweis & Saul, 2000; Belkin & Niyogi, 2003; Saul & Roweis, 2003), and a number of authors (Ben- gio et al., 2004; Ham et al., 2004) have developed connections between manifold learning algorithms and kernel PCA. In contrast to previous work, however,

Upload: hadien

Post on 10-May-2018

224 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Learning a kernel matrix for nonlinear dimensionality …cseweb.ucsd.edu/~saul/papers/kernel_icml04.pdfLearning a kernel matrix for nonlinear dimensionality reduction Kilian Q. Weinberger

Learning a kernel matrix for nonlinear dimensionality reduction

Kilian Q. Weinberger [email protected] Sha [email protected] K. Saul [email protected]

Department of Computer and Information Science, University of Pennsylvania, Philadelphia, PA 19104, USA

Abstract

We investigate how to learn a kernel matrixfor high dimensional data that lies on or neara low dimensional manifold. Noting that thekernel matrix implicitly maps the data intoa nonlinear feature space, we show how todiscover a mapping that “unfolds” the un-derlying manifold from which the data wassampled. The kernel matrix is constructedby maximizing the variance in feature spacesubject to local constraints that preserve theangles and distances between nearest neigh-bors. The main optimization involves an in-stance of semidefinite programming—a fun-damentally different computation than pre-vious algorithms for manifold learning, suchas Isomap and locally linear embedding. Theoptimized kernels perform better than poly-nomial and Gaussian kernels for problems inmanifold learning, but worse for problems inlarge margin classification. We explain theseresults in terms of the geometric propertiesof different kernels and comment on variousinterpretations of other manifold learning al-gorithms as kernel methods.

1. Introduction

Kernel methods (Scholkopf & Smola, 2002) haveproven to be extremely powerful in many areas of ma-chine learning. The so-called “kernel trick” is by nowwidely appreciated: a canonical algorithm (e.g., thelinear perceptron, principal component analysis) is re-formulated in terms of Gram matrices, then general-ized to nonlinear problems by substituting a kernelfunction for the inner product. Well beyond this fa-miliar recipe, however, the field continues to develop

Appearing in Proceedings of the 21 st International Confer-ence on Machine Learning, Banff, Canada, 2004. Copyright2004 by the authors.

as researchers devise novel types of kernels, exploit-ing prior knowledge in particular domains and insightsfrom computational learning theory and convex op-timization. Indeed, much work revolves around thesimple question: “How to choose the kernel?” The an-swers are diverse, reflecting the tremendous variety ofproblems to which kernel methods have been applied.Kernels based on string matching (Lodhi et al., 2004)and weighted transducers (Cortes et al., 2003) havebeen proposed for problems in bioinformatics, text,and speech processing. Other specialized kernels havebeen constructed for problems in pattern recognitioninvolving symmetries and invariances (Burges, 1999).Most recently, kernel matrices have been learned bysemidefinite programming for large margin classifica-tion (Graepel, 2002; Lanckriet et al., 2004).

In this paper, we revisit the problem of nonlineardimensionality reduction and its solution by kernelprincipal component analysis (PCA) (Scholkopf et al.,1998). Our specific interest lies in the applicationof kernel PCA to high dimensional data whose basicmodes of variability are described by a low dimensionalmanifold. The goal of nonlinear dimensionality reduc-tion in these applications is to discover the underly-ing manifold (Tenenbaum et al., 2000; Roweis & Saul,2000). For problems of this nature, we show how tolearn a kernel matrix whose implicit mapping into fea-ture space “unfolds” the manifold from which the datawas sampled. The main optimization of our algorithminvolves an instance of semidefinite programming, butunlike earlier work in learning kernel matrices (Grae-pel, 2002; Lanckriet et al., 2004), the setting here iscompletely unsupervised.

The problem of manifold learning has recently at-tracted a great deal of attention (Tenenbaum et al.,2000; Roweis & Saul, 2000; Belkin & Niyogi, 2003;Saul & Roweis, 2003), and a number of authors (Ben-gio et al., 2004; Ham et al., 2004) have developedconnections between manifold learning algorithms andkernel PCA. In contrast to previous work, however,

Page 2: Learning a kernel matrix for nonlinear dimensionality …cseweb.ucsd.edu/~saul/papers/kernel_icml04.pdfLearning a kernel matrix for nonlinear dimensionality reduction Kilian Q. Weinberger

our paper does not serve to reinterpret pre-existingalgorithms such as Isomap and locally linear embed-ding as instances of kernel PCA. Instead, we proposea novel optimization (based on semidefinite program-ming) that bridges the literature on kernel methodsand manifold learning in a rather different way. Thealgorithm we describe can be viewed from several com-plementary perspectives. This paper focuses mainlyon its interpretation as a kernel method, while a com-panion paper (Weinberger & Saul, 2004) focuses onits application to the unsupervised learning of imagemanifolds.

2. Kernel PCA

Scholkopf, Smola, and Muller (1998) introduced ker-nel PCA as a nonlinear generalization of PCA (Jol-liffe, 1986). The generalization is obtained by mappingthe original inputs into a higher (and possibly infi-nite) dimensional feature space F before extracting theprincipal components. In particular, consider inputsx1, . . . ,xN ∈ RD and features Φ(x1), . . . ,Φ(xN )∈Fcomputed by some mapping Φ : RD → F . KernelPCA is based on the insight that the principal com-ponents in F can be computed for mappings Φ(x)that are only implicitly defined by specifying the innerproduct in feature space—that is, the kernel functionK(x,y) = Φ(x) ·Φ(y).

Kernel PCA can be used to obtain low dimensionalrepresentations of high dimensional inputs. For this,it suffices to compute the dominant eigenvectors of thekernel matrix Kij = Φ(xi) ·Φ(xj). The kernel matrixcan be expressed in terms of its eigenvalues λα andeigenvectors vα as K =

∑α λαvαvT

α . Assuming theeigenvalues are sorted from largest to smallest, the d-dimensional embedding that best preserves inner prod-ucts in feature space is obtained by mapping the inputxi∈RD to the vector yi = (

√λ1v1i, . . . ,

√λdvdi).

The main freedom in kernel PCA lies in choosing thekernel function K(x,y) or otherwise specifying thekernel matrix Kij . Some widely used kernels are thelinear, polynomial and Gaussian kernels, given by:

K(x,y) = x · y, (1)K(x,y) = (1 + x · y)p, (2)

K(x,y) = e−|x−y|2

2σ2 . (3)

The linear kernel simply identifies the feature spacewith the input space. Implicitly, the polynomial kernelmaps the inputs into a feature space of dimensional-ity O(Dp), while the Gaussian kernel maps the inputsonto the surface of an infinite-dimensional sphere.

The dominant eigenvalues of the kernel matrix Kij

measure the variance along the principal componentsin feature space, provided that the features are cen-tered on the origin. The features can always be cen-tered by subtracting out their mean—namely, by thetransformation Φ(xi)← Φ(xi)− 1

N

∑j Φ(xj). When

the mapping Φ(x) is only implicitly specified by thekernel function, the “centering” transformation can beapplied directly to the kernel matrix. In particular, re-computing the inner products Kij = Φ(xi)·Φ(xj) fromthe centered features gives:

Kij ← Kij −2N

∑k

Kkj +1

N2

∑k`

Kk`. (4)

For a centered kernel matrix, the relative weight ofthe leading d eigenvalues, obtained by dividing theirsum by the trace, measures the relative variance cap-tured by the leading d eigenvectors. When this ratiois nearly unity, the data can be viewed as inhabiting ad-dimensional subspace of the feature space, or equiv-alently, a d-dimensional manifold of the input space.

3. Learning the kernel matrix

The choice of the kernel plays an important role in ker-nel PCA, in that different kernels are bound to reveal(or conceal) different types of low dimensional struc-ture. In this section, we show how to learn a kernelmatrix that reveals when high dimensional inputs lieon or near a low dimensional manifold. As in earlierwork on support vector machines (SVMs) (Graepel,2002; Lanckriet et al., 2004), we will cast the prob-lem of learning the kernel matrix as an instance ofsemidefinite programming. The similarity ends there,however, as the optimization criteria for nonlinear di-mensionality reduction differ substantially from thecriteria for large margin classification. We describethe constraints on the optimization in section 3.1, theobjective function in section 3.2, and the optimizationitself in section 3.3.

3.1. Constraints

Semipositive definitenessThe kernel matrix K is constrained by three criteria.The first is semipositive definiteness, a condition re-quired to interpret the kernel matrix as storing theinner products of vectors in a Hilbert space. We thusconstrain the optimization over K to the cone of sym-metric matrices with nonnegative eigenvalues. Thoughnot a linear constraint, the cone of semipositive defi-nite matrices defines a convex domain for the overalloptimization.

CenteringThe second constraint is that the kernel matrix stores

Page 3: Learning a kernel matrix for nonlinear dimensionality …cseweb.ucsd.edu/~saul/papers/kernel_icml04.pdfLearning a kernel matrix for nonlinear dimensionality reduction Kilian Q. Weinberger

the inner products of features that are centered on theorigin, or: ∑

i

Φ(xi) = 0 (5)

As described in section 2, this condition enables usto interpret the eigenvalues of the kernel matrix asmeasures of variance along principal components infeature space. Eq. (5) can be expressed in terms ofthe kernel matrix as:

0 =

∣∣∣∣∣∑i

Φ(xi)

∣∣∣∣∣2

=∑ij

Φ(xi)·Φ(xj) =∑ij

Kij . (6)

Note that this is a linear constraint on the elementsof the kernel matrix, thus preserving the convexity ofthe domain of optimization.

IsometryThe final constraints on the kernel matrix reflect ourgoals for nonlinear dimensionality reduction. In partic-ular, we are interested in the setting where the inputslie on or near a low dimensional manifold, and the goalsof kernel PCA are to detect the dimensionality of thisunderlying manifold and discover its modes of variabil-ity. We imagine that this manifold is isometric to anopen connected subset of Euclidean space (Tenenbaumet al., 2000; Donoho & Grimes, 2003), and the prob-lem of learning the kernel matrix is to discover how theinner products between inputs transform under thismapping. An isometry1 is a smooth invertible map-ping that looks locally like a rotation plus translation,thus preserving local (and hence geodesic) distances.Thus, in our application of kernel PCA to manifoldlearning, the final constraints we impose on the kernelmatrix are to restrict the (implicitly defined) mappingsbetween inputs and features from fully general nonlin-ear transformations to the special class of isometries.

How is this done? We begin by extending the no-tion of isometry to discretely sampled manifolds, inparticular to sets of inputs {x1, . . . ,xN} and fea-tures {Φ(x1), . . . ,Φ(xN )} in one-to-one correspon-dence. Let the N×N binary matrix η indicate a neigh-borhood relation on both sets, such that we regard xj

as a neighbor of xi if and only if ηij =1 (and similarly,for Φ(xj) and Φ(xi)). We will say that the inputsxi and features Φ(xi) are locally isometric under theneighborhood relation η if for every point xi, there ex-ists a rotation and translation that maps xi and itsneighbors precisely onto Φ(xi) and its neighbors.

The above definition translates naturally into various1Formally, two Riemannian manifolds are said to be iso-

metric if there is a diffeomorphism such that the metric onone pulls back to the metric on the other.

sets of linear constraints on the elements of the ker-nel matrix Kij . Note that the local isometry betweenneighborhoods will exist if and only if the distancesand angles between points and their neighbors are pre-served. Thus, whenever both xj and xk are neighborsof xi (that is, ηijηik = 1), for local isometry we musthave that:

(Φ(xi)−Φ(xj))·(Φ(xi)−Φ(xk)) = (xi−xj)·(xi−xk) .(7)

Eq. (7) is sufficient for local isometry because the tri-angle formed by any point and its neighbors is deter-mined up to rotation and translation by specifying thelengths of two sides and the angle between them. Infact, such a triangle is similarly determined by speci-fying the lengths of all its sides. Thus, we can also saythat the inputs and features are locally isometric un-der η if whenever xi and xj are themselves neighbors(that is, ηij =1) or are common neighbors of anotherinput (that is, [ηT η]ij >0), we have:

|Φ(xi)−Φ(xj)|2 = |xi−xj |2 . (8)

This is an equivalent characterization of local isome-try as eq. (7), but expressed only in terms of pairwisedistances. Finally, we can express these constraintspurely in terms of dot products. Let Gij = xi ·xj de-note the Gram matrix of the inputs, and recall that thekernel matrix Kij =Φ(xi) ·Φ(xj) represents the Grammatrix of the features. Then eq. (8) can be written as:

Kii+Kjj−Kij−Kji = Gii+Gjj−Gij−Gji. (9)

Eq. (9) constrains how the inner products betweennearby inputs are allowed to transform under a lo-cally isometric mapping. We impose these constraintsto ensure that the mapping defined (implicitly) by thekernel matrix is an isometry. Note that these con-straints are also linear in the elements of the kernelmatrix, thus preserving the convexity of the domainof optimization.

The simplest choice for neighborhoods is to let ηij in-dicate whether input xj is one of the k nearest neigh-bors of input xi computed using Euclidean distance.In this case, eq. (9) specifies O(Nk2) constraints thatfix the distances between each point and its nearestneighbors, as well as the pairwise distances betweennearest neighbors. Provided that k � N , however,the kernel matrix is unlikely to be fully specified bythese constraints since it contains O(N2) elements.

3.2. Objective function

In the previous section, we showed how to restrict thekernel matrices so that the features Φ(xi) could be re-garded as images of a locally isometric mapping. The

Page 4: Learning a kernel matrix for nonlinear dimensionality …cseweb.ucsd.edu/~saul/papers/kernel_icml04.pdfLearning a kernel matrix for nonlinear dimensionality reduction Kilian Q. Weinberger

goal of nonlinear dimensionality reduction is to dis-cover the particular isometric mapping that “unfolds”the underlying manifold of inputs into a Euclideanspace of its intrinsic dimensionality. This intrinsicdimensionality may be much lower than the extrin-sic dimensionality of the input space. To unfold themanifold, we need to construct an objective functionover “locally isometric” kernel matrices that favors thistype of dimensionality reduction.

To this end, imagine each input xi as a steel ball con-nected to its k nearest neighbors by rigid rods. (Forsimplicity, we assume that the graph formed this wayis fully connected; if not, then each connected com-ponent of the graph should be analyzed separately.)The effect of the rigid rods is to “lock” the neighbor-hoods in place, fixing the distances and angles betweennearest neighbors. The lattice of steel balls formed inthis way can be viewed as a discretized version of theunderlying manifold. Now imagine that we pull thesteel balls as far apart as possible, recording their fi-nal positions by Φ(xi). The discretized manifold willremain connected—due to the constraints imposed bythe rigid rods—but it will also flatten, increasing thevariance captured by its leading principal components.(For a continuous analogy, imagine pulling on the endsof a string; a string with any slack in it occupies at leasttwo dimensions, whereas a taut, fully extended stringoccupies just one.)

We can formalize this intuition as an optimization oversemipositive definite matrices. The constraints im-posed by the rigid rods are, in fact, precisely the con-straints imposed by eq. (9). An objective function thatmeasures the pairwise distances between steel balls isgiven by:

T =1

2N

∑ij

|Φ(xi)−Φ(xj)|2 (10)

It is easy to see that this function is bounded abovedue to the constraints on distances between neighborsimposed by the rigid rods. Suppose the distance be-tween any two neighbors is bounded by some maximaldistance τ . Providing the graph is connected, then forany two points, there exists a path along the graphof distance at most Nτ , which (by the triangle in-equality) provides an upper bound on the Euclideandistance between the points that appears in eq. (10).This results in an upper bound on the objective func-tion of order O(N3τ2).

Eq. (10) can be expressed in terms of the elements ofthe kernel matrix by expanding the right hand side:

T =1

2N

∑ij

(Kii + Kjj − 2Kij) = Tr(K). (11)

The last step in eq. (11) follows from the centering con-straint in eq. (6). Thus the objective function for theoptimization is simply the trace of the kernel matrix.Maximizing the trace also corresponds to maximizingthe variance in feature space.

3.3. Semidefinite embedding (SDE)

The constraints and objective function from the pre-vious sections define an instance of semidefinite pro-gramming (SDP) (Vandenberghe & Boyd, 1996).Specifically, the goal is to optimize a linear functionof the elements in a semipositive definite matrix sub-ject to linear equality constraints. Collecting the con-straints and objective function, we have the followingoptimization:

Maximize: Tr(K) subject to:

1. K � 0.

2.∑

ij Kij = 0.

3. Kii+Kjj−Kij−Kji = Gii+Gjj−Gij−Gji

for all i, j such that ηij =1 or[ηT η

]ij

>0.

The optimization is convex and does not suffer fromlocal optima. There are several general-purpose tool-boxes and polynomial-time solvers available for prob-lems in semidefinite programming. The results inthis paper were obtained using the SeDuMi tool-box (Sturm, 1999) in MATLAB. Once the kernel ma-trix is computed, a nonlinear embedding can be ob-tained from its leading eigenvectors, as described insection 2. Because the kernel matrices in this approachare optimized by semidefinite programming, we will re-fer to this particular form of kernel PCA as Semidefi-nite Embedding (SDE).

4. Experimental Results

Experiments were performed on several data sets toevaluate the learning algorithm described in section 3.Though the SDE kernels were expressly optimized forproblems in manifold learning, we also evaluated theirperformance for large margin classification.

4.1. Nonlinear dimensionality reduction

We performed kernel PCA with linear, polynomial,Gaussian, and SDE kernels on data sets where we knewor suspected that the high dimensional inputs weresampled from a low dimensional manifold. Where nec-essary, kernel matrices were centered before computingprincipal components, as in eq. (4).

Page 5: Learning a kernel matrix for nonlinear dimensionality …cseweb.ucsd.edu/~saul/papers/kernel_icml04.pdfLearning a kernel matrix for nonlinear dimensionality reduction Kilian Q. Weinberger

Inputs

Eigenvalues (normalized by trace)

SDE Kernel

!!" " !" #"

!$"

"

$"

0.0 0.2 0.4 0.6 0.8 1.0

GAUSSIAN

POLYNOMIAL

LINEAR

SDE

Figure 1. Top: results of SDE applied to N = 800 in-puts sampled from a “Swiss roll” (top left). The inputshad D=23 dimensions, of which 20 were filled with smallamounts of noise (not shown). The two dimensional plotshows the embedding from kernel PCA with the SDE ker-nel. Bottom: eigenvalues of different kernel matrices, nor-malized by their trace. Only the eigenvalues from SDEindicate the correct intrinsic dimensionality (d = 2) of theSwiss roll.

In the first experiment, the inputs were sampled froma three dimensional “Swiss roll”, a data set com-monly used to evaluate algorithms in manifold learn-ing (Tenenbaum et al., 2000). Fig. 1 shows the orig-inal inputs (top left), the embedding discovered bySDE with k=4 nearest neighbors (top right), and theeigenvalue spectra from several different kernel matri-ces (bottom). The color coding of the embedding re-veals that the Swiss roll has been successfully unrav-eled. Note that the kernel matrix learned by SDE hastwo dominant eigenvalues, indicating the correct un-derlying dimensionality of the Swiss roll, whereas theeigenspectra of other kernel matrices fail to reveal thisstructure. In particular, the linear kernel matrix hasthree dominant eigenvalues, reflecting the extrinsic di-mensionality of the swiss roll, while the eigenspectraof the polynomial (p = 4) and Gaussian (σ = 1.45)kernel matrices2 indicate that the variances of theirfeatures Φ(xi) are spread across a far greater numberof dimensions than the original inputs xi.

The second experiment was performed on a data setconsisting of N =400 color images of a teapot viewedfrom different angles in the plane. With a resolution of

2For all the data sets in this section, we set the widthparameter σ of the Gaussian kernel to the estimated stan-dard deviation within neighborhoods of size k = 4, thusreflecting the same length scale used in SDE.

0.0 0.2 0.4 0.6 0.8 1.0

GAUSSIAN

POLYNOMIAL

LINEAR

SDE

Eigenvalues (normalized by trace)

SDE Kernel

Figure 2. Results of SDE applied to N = 400 images of ateapot viewed from different angles in the plane, under afull 360 degrees of rotation. The images were representedby inputs in a D = 23028 dimensional vector space. SDEfaithfully represents the 360 degrees of rotation by a circle.The eigenvalues of different kernel matrices are also shown,normalized by their trace.

76×101 and three bytes of color information per pixel,the images were represented as points in a D =23028dimensional vector space. Though very high dimen-sional, the images in this data set are effectively pa-rameterized by a single degree of freedom—namely, theangle of rotation. The low dimensional embedding ofthese images by SDE and the eigenvalue spectra of dif-ferent kernel matrices are shown in Fig. 2. The kernelmatrix learned by SDE (with k=4 nearest neighbors)concentrates the variance of the feature space in twodimensions and maps the images to a circle, a highlyintuitive representation of the full 360 degrees of rota-tion. By contrast, the linear, polynomial (p=4), andGaussian (σ = 1541) kernel matrices have eigenvaluespectra that do not reflect the low intrinsic dimen-sionality of the data set.

Why does kernel PCA with the Gaussian kernel per-form so differently on these data sets when its widthparameter σ reflects the same length scale as neighbor-hoods in SDE? Note that the Gaussian kernel com-putes a nearly zero inner product (Kij ≈ 0) in fea-ture space for inputs xi and xj that do not belong to

Page 6: Learning a kernel matrix for nonlinear dimensionality …cseweb.ucsd.edu/~saul/papers/kernel_icml04.pdfLearning a kernel matrix for nonlinear dimensionality reduction Kilian Q. Weinberger

B

D

AC

B

C

AD

B

CD

A

B C DA

Figure 3. Embeddings from kernel PCA with the Gaussiankernel on the Swiss roll and teapot data sets in Figs. 1and 2. The first three principal components are shown. Inboth cases, different patches of the manifolds are mappedto orthogonal parts of feature space.

the same or closely overlapping neighborhoods. It fol-lows from these inner products that the feature vectorsΦ(xi) and Φ(xj) must be nearly orthogonal. As a re-sult, the different patches of the manifold are mappedinto orthogonal regions of the feature space: see Fig. 3.Thus, rather than unfolding the manifold, the Gaus-sian kernel leads to an embedding whose dimensional-ity is equal to the number of non-overlapping patchesof length scale σ. This explains the generally poor per-formance of the Gaussian kernel for manifold learning(as well as its generally good performance for largemargin classification, discussed in section 4.2).

Another experiment on the data set of teapot imageswas performed by restricting the angle of rotation to180 degrees. The results are shown in Fig. 4. Interest-ingly, in this case, the eigenspectra of the linear, poly-nomial, and Gaussian kernel matrices are not qualita-tively different. By contrast, the SDE kernel matrixnow has only one dominant eigenvalue, indicating thatthe variability for this subset of images is controlled bya single (non-cyclic) degree of freedom.

As a final experiment, we compared the performance ofthe different kernels on a real-world data set describedby an underlying manifold. The data set (Hull, 1994)consisted of N =953 grayscale images at 16×16 resolu-tion of handwritten twos and threes (in roughly equalproportion). In this data set, it is possible to find a rel-atively smooth “morph” between any pair of images,and a relatively small number of degrees of freedomdescribe the possible modes of variability (e.g. writ-ing styles for handwritten digits). Fig 5 shows theresults. Note that the kernel matrix learned by SDEconcentrates the variance in a significantly fewer num-ber of dimensions, suggesting it has constructed a more

Figure 4. Results of kernel PCA applied to N =200 imagesof a teapot viewed from different angles in the plane, under180 degrees of rotation. The eigenvalues of different kernelmatrices are shown, normalized by their trace. The onedimensional embedding from SDE is also shown.

appropriate feature map for nonlinear dimensionalityreduction.

4.2. Large margin classification

We also evaluated the use of SDE kernel matrices forlarge margin classification by SVMs. Several trainingand test sets for problems in binary classification werecreated from the USPS data set of handwritten dig-its (Hull, 1994). Each training and test set had 810and 90 examples, respectively. For each experiment,the SDE kernel matrices were learned (using k = 4nearest neighbors) on the combined training and testdata sets, ignoring the target labels. The results werecompared to those obtained from linear, polynomial(p = 2), and Gaussian (σ = 1) kernels. Table 1 showsthat the SDE kernels performed quite poorly in thiscapacity, even worse than the linear kernels.

Fig. 6 offers an explanation of this poor performance.The SDE kernel can only be expected to perform wellfor large margin classification if the decision bound-ary on the unfolded manifold is approximately linear.There is no a priori reason, however, to expect thistype of linear separability. The example in Fig. 6shows a particular binary labeling of inputs on theSwiss roll for which the decision boundary is muchsimpler in the input space than on the unfolded mani-fold. A similar effect seems to be occurring in the largemargin classification of handwritten digits. Thus, thestrength of SDE for nonlinear dimensionality reductionis generally a weakness for large margin classification.By contrast, the polynomial and Gaussian kernels leadto more powerful classifiers precisely because they mapinputs to higher dimensional regions of feature space.

5. Related and ongoing work

SDE can be viewed as an unsupervised counterpart tothe work of Graepel (2002) and Lanckriet et al (2004),who proposed learning kernel matrices by semidefinite

Page 7: Learning a kernel matrix for nonlinear dimensionality …cseweb.ucsd.edu/~saul/papers/kernel_icml04.pdfLearning a kernel matrix for nonlinear dimensionality reduction Kilian Q. Weinberger

Figure 5. Results of kernel PCA applied to N =953 imagesof handwritten digits. The eigenvalues of different kernelmatrices are shown, normalized by their trace.

Digits Linear Polynomial Gaussian SDE1 vs 2 0.00 0.00 0.14 0.591 vs 3 0.23 0.00 0.35 1.732 vs 8 2.18 1.12 0.63 3.378 vs 9 1.00 0.54 0.29 1.27

Table 1. Percent error rates for SVM classification usingdifferent kernels on test sets of handwritten digits. Eachline represents the average of 10 experiments with different90/10 splits of training and testing data. Here, the SDEkernel performs much worse than the other kernels.

programming for large margin classification. The ker-nel matrices learned by SDE differ from those usuallyemployed in SVMs, in that they aim to map inputs intoan (effectively) lower dimensional feature space. Thisexplains both our positive results in nonlinear dimen-sionality reduction (section 4.1), as well as our negativeresults in large margin classification (section 4.2).

SDE can also be viewed as an alternative to mani-fold learning algorithms such as Isomap (Tenenbaumet al., 2000), locally linear embedding (LLE) (Roweis& Saul, 2000; Saul & Roweis, 2003), hessian LLE(hLLE) (Donoho & Grimes, 2003), and Laplacianeigenmaps (Belkin & Niyogi, 2003). All these al-gorithms share a similar structure, creating a graphbased on nearest neighbors, computing an N ×N ma-trix from geometric properties of the inputs, and con-structing an embedding from the eigenvectors with thelargest or smallest nonnegative eigenvalues. A moredetailed discussion of differences between these algo-

Figure 6. Left: linearly separable inputs (in black versuswhite) sampled from a a Swiss roll. Right: unfolding themanifold leads to a more complicated decision boundary.

5 10 15 20!0.2

0

0.2

0.4

0.6

5 10 15 20!0.2

0

0.2

0.4

0.6ISOMAP

5 10 15 20

0

0.2

0.4

5 10 15 20

0

0.2

0.4

SDE

Figure 7. The twenty eigenvalues of leading magnitudefrom Isomap and SDE on the data sets of teapot imagesand handwritten digits. Note that the similarity matricesconstructed by Isomap have negative eigenvalues.

rithms is given in a companion paper (Weinberger &Saul, 2004). Here, we comment mainly on their variousinterpretations as kernel methods (Ham et al., 2004).In general, these other methods give rise to matriceswhose geometric properties as kernels are less robust ornot as well understood. For example, unlike SDE, thesimilarity matrix constructed by Isomap from finitedata sets can have negative eigenvalues. In some cases,moreover, these negative eigenvalues can be apprecia-ble in magnitude to the dominant positive ones: seeFig. 7. Unlike both SDE and Isomap, LLE and hLLEconstruct matrices whose bottom eigenvectors yield lowdimensional embeddings; to interpret these matrices askernels, their eigenvalues must be “flipped”, either byinverting the matrix itself or by subtracting it froma large multiple of the identity matrix. Moreover, itdoes not appear that these eigenvalues can be usedto estimate the intrinsic dimensionality of underlyingmanifolds (Saul & Roweis, 2003). A kernel can be de-rived from the discrete Laplacian by noting its role inthe heat diffusion equation, but the intuition gainedfrom this analogy, in terms of diffusion times througha network, does not relate directly to the geometricproperties of the kernel matrix. SDE stands apartfrom these methods in its explicit construction of asemipositive definite kernel matrix that preserves thegeometric properties of the inputs up to local isometryand whose eigenvalues indicate the dimensionality ofthe underlying manifold.

We are pursuing many directions in ongoing work.The first is to develop faster and potentially dis-tributed (Biswas & Ye, 2003) methods for solving the

Page 8: Learning a kernel matrix for nonlinear dimensionality …cseweb.ucsd.edu/~saul/papers/kernel_icml04.pdfLearning a kernel matrix for nonlinear dimensionality reduction Kilian Q. Weinberger

instance of semidefinite programming in SDE and forout-of-sample extensions (Bengio et al., 2004). Thusfar we have been using generic solvers with a relativelyhigh time complexity, relying on toolboxes that donot exploit any special structure in our problem. Weare also investigating many variations on the objectivefunction and constraints in SDE—for example, to al-low some slack in the preservation of local distancesand angles, or to learn embeddings onto spheres. Fi-nally, we are performing more extensive comparisonswith other methods in nonlinear dimensionality reduc-tion. Not surprisingly, perhaps, all these directions re-flect a more general trend toward the convergence ofresearch in kernel methods and manifold learning.

Acknowledgements

The authors are grateful to S. Boyd and Y. Ye (Stan-ford) for useful discussions of semidefinite program-ming and to the anonymous reviewers for many helpfulcomments.

References

Belkin, M., & Niyogi, P. (2003). Laplacian eigenmapsfor dimensionality reduction and data representa-tion. Neural Computation, 15(6), 1373–1396.

Bengio, Y., Paiement, J.-F., & Vincent, P. (2004).Out-of-sample extensions for LLE, Isomap, MDS,eigenmaps, and spectral clustering. Advances inNeural Information Processing Systems 16. Cam-bridge, MA: MIT Press.

Biswas, P., & Ye, Y. (2003). A distributed method forsolving semideinite programs arising from ad hocwireless sensor network localization. Stanford Uni-versity, Department of Electrical Engineering, work-ing paper.

Burges, C. J. C. (1999). Geometry and invariancein kernel based methods. Advances in KernelMethods—-Support Vector Learning. Cambridge,MA: MIT Press.

Cortes, C., Haffner, P., & Mohri, M. (2003). Rationalkernels. Advances in Neural Information ProcessingSystems 15 (pp. 617–624). Cambridge, MA: MITPress.

Donoho, D. L., & Grimes, C. E. (2003). Hessianeigenmaps: locally linear embedding techniques forhigh-dimensional data. Proceedings of the NationalAcademy of Arts and Sciences, 100, 5591–5596.

Graepel, T. (2002). Kernel matrix completion bysemidefinite programming. Proceedings of the Inter-

national Conference on Artificial Neural Networks(pp. 694–699). Springer-Verlag.

Ham, J., Lee, D. D., Mika, S., & Scholkopf, B. (2004).A kernel view of the dimensionality reduction ofmanifolds. Proceedings of the Twenty First Interna-tional Conference on Machine Learning (ICML-04).Banff, Canada.

Hull, J. J. (1994). A database for handwritten textrecognition research. IEEE Transaction on PatternAnalysis and Machine Intelligence, 16(5), 550–554.

Jolliffe, I. T. (1986). Principal component analysis.New York: Springer-Verlag.

Lanckriet, G. R. G., Cristianini, N., Bartlett, P.,Ghaoui, L. E., & Jordan, M. I. (2004). Learningthe kernel matrix with semidefinite programming.Journal of Machine Learning Research, 5, 27–72.

Lodhi, H., Saunders, C., Cristianini, N., & Watkins,C. (2004). String matching kernels for text classifi-cation. Journal of Machine Learning Research. inpress.

Roweis, S. T., & Saul, L. K. (2000). Nonlinear di-mensionality reduction by locally linear embedding.Science, 290, 2323–2326.

Saul, L. K., & Roweis, S. T. (2003). Think globally,fit locally: unsupervised learning of low dimensionalmanifolds. Journal of Machine Learning Research,4, 119–155.

Scholkopf, B., & Smola, A. J. (2002). Learning withkernels: Support vector machines, regularization,optimization, and beyond. Cambridge, MA: MITPress.

Scholkopf, B., Smola, A. J., & Muller, K.-R. (1998).Nonlinear component analysis as a kernel eigenvalueproblem. Neural Computation, 10, 1299–1319.

Sturm, J. F. (1999). Using SeDuMi 1.02, a MATLABtoolbox for optimization overy symmetric cones. Op-timization Methods and Software, 11-12, 625–653.

Tenenbaum, J. B., de Silva, V., & Langford, J. C.(2000). A global geometric framework for nonlineardimensionality reduction. Science, 290, 2319–2323.

Vandenberghe, L., & Boyd, S. P. (1996). Semidefiniteprogramming. SIAM Review, 38(1), 49–95.

Weinberger, K. Q., & Saul, L. K. (2004). Unsupervisedlearning of image manifolds by semidefinite pro-gramming. Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition (CVPR-04). Washington D.C.