learning multi-modal similarity - ucsd computational statistics

33
Journal of Machine Learning Research 12 (2011) 491-523 Submitted 6/10; Published 2/11 Learning Multi-modal Similarity Brian McFee BMCFEE@CS. UCSD. EDU Department of Computer Science and Engineering University of California San Diego, CA 92093-0404, USA Gert Lanckriet GERT@ECE. UCSD. EDU Department of Electrical and Computer Engineering University of California San Diego, CA 92093-0407, USA Editor: Tony Jebara Abstract In many applications involving multi-media data, the definition of similarity between items is inte- gral to several key tasks, including nearest-neighbor retrieval, classification, and recommendation. Data in such regimes typically exhibits multiple modalities, such as acoustic and visual content of video. Integrating such heterogeneous data to form a holistic similarity space is therefore a key challenge to be overcome in many real-world applications. We present a novel multiple kernel learning technique for integrating heterogeneous data into a single, unified similarity space. Our algorithm learns an optimal ensemble of kernel transfor- mations which conform to measurements of human perceptual similarity, as expressed by relative comparisons. To cope with the ubiquitous problems of subjectivity and inconsistency in multi- media similarity, we develop graph-based techniques to filter similarity measurements, resulting in a simplified and robust training procedure. Keywords: multiple kernel learning, metric learning, similarity 1. Introduction In applications such as content-based recommendation systems, the definition of a proper similarity measure between items is crucial to many tasks, including nearest-neighbor retrieval and classifi- cation. In some cases, a natural notion of similarity may emerge from domain knowledge, for ex- ample, cosine similarity for bag-of-words models of text. However, in more complex, multi-media domains, there is often no obvious choice of similarity measure. Rather, viewing different aspects of the data may lead to several different, and apparently equally valid notions of similarity. For ex- ample, if the corpus consists of musical data, each song or artist may be represented simultaneously by acoustic features (such as rhythm and timbre), semantic features (tags, lyrics), or social features (collaborative filtering, artist reviews and biographies, etc). Although domain knowledge may be incorporated to endow each representation with an intrinsic geometry—and, therefore, a sense of similarity—the different notions of similarity may not be mutually consistent. In such cases, there is generally no obvious way to combine representations to form a unified similarity space which optimally integrates heterogeneous data. c 2011 Brian McFee and Gert Lanckriet.

Upload: others

Post on 09-Feb-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

Journal of Machine Learning Research 12 (2011) 491-523 Submitted 6/10; Published 2/11

Learning Multi-modal Similarity

Brian McFee [email protected]

Department of Computer Science and EngineeringUniversity of CaliforniaSan Diego, CA 92093-0404, USA

Gert Lanckriet [email protected]

Department of Electrical and Computer EngineeringUniversity of CaliforniaSan Diego, CA 92093-0407, USA

Editor: Tony Jebara

Abstract

In many applications involving multi-media data, the definition of similarity between items is inte-gral to several key tasks, including nearest-neighbor retrieval, classification, and recommendation.Data in such regimes typically exhibits multiple modalities, such as acoustic and visual content ofvideo. Integrating such heterogeneous data to form a holistic similarity space is therefore a keychallenge to be overcome in many real-world applications.

We present a novel multiple kernel learning technique for integrating heterogeneous data intoa single, unified similarity space. Our algorithm learns an optimal ensemble of kernel transfor-mations which conform to measurements of human perceptual similarity, as expressed by relativecomparisons. To cope with the ubiquitous problems of subjectivity and inconsistency in multi-media similarity, we develop graph-based techniques to filter similarity measurements, resulting ina simplified and robust training procedure.

Keywords: multiple kernel learning, metric learning, similarity

1. Introduction

In applications such as content-based recommendation systems, the definitionof a proper similaritymeasure between items is crucial to many tasks, including nearest-neighbor retrieval and classifi-cation. In some cases, a natural notion of similarity may emerge from domain knowledge, for ex-ample, cosine similarity for bag-of-words models of text. However, in more complex, multi-mediadomains, there is often no obvious choice of similarity measure. Rather, viewing different aspectsof the data may lead to several different, and apparently equally valid notions of similarity. For ex-ample, if the corpus consists of musical data, each song or artist may be represented simultaneouslyby acoustic features (such as rhythm and timbre), semantic features (tags, lyrics), or social features(collaborative filtering, artist reviews and biographies, etc). Although domain knowledge may beincorporated to endow each representation with an intrinsic geometry—and,therefore, a sense ofsimilarity—the different notions of similarity may not be mutually consistent. In suchcases, thereis generally no obvious way to combine representations to form a unified similarity space whichoptimally integrates heterogeneous data.

c©2011 Brian McFee and Gert Lanckriet.

MCFEE AND LANCKRIET

Without extra information to guide the construction of a similarity measure, the situation seemshopeless. However, if some side-information is available, for example, as provided by human label-ers, it can be used to formulate a learning algorithm to optimize the similarity measure.

This idea of using side-information to optimize a similarity function has received agreat dealof attention in recent years. Typically, the notion of similarity is captured by a distance metric overa vector space (e.g., Euclidean distance inR

d), and the problem of optimizing similarity reduces tofinding a suitable embedding of the data under a specific choice of the distance metric.Metric learn-ing methods, as they are known in the machine learning literature, can be informedby various typesof side-information, including class labels (Xing et al., 2003; Goldberger et al., 2005; Globersonand Roweis, 2006; Weinberger et al., 2006), or binarysimilar/dissimilarpairwise labels (Wagstaffet al., 2001; Shental et al., 2002; Bilenko et al., 2004; Globerson and Roweis, 2007; Davis et al.,2007). Alternatively, multidimensional scaling (MDS) techniques are typicallyformulated in termsof quantitative (dis)similarity measurements (Torgerson, 1952; Kruskal, 1964; Cox and Cox, 1994;Borg and Groenen, 2005). In these settings, the representation of datais optimized so that distance(typically Euclidean) conforms to side-information. Once a suitable metric has been learned, sim-ilarity to new, unseen data can be computed either directly (if the metric takes a certain parametricform, for example, a linear projection matrix), or via out-of-sample extensions (Bengio et al., 2004).

To guide the construction of a similarity space for multi-modal data, we adopt theidea of usingsimilarity measurements, provided by human labelers, as side-information. However, it has to benoted that, especially in heterogeneous, multi-media domains, similarity may itself be ahighlysubjective concept and vary from one labeler to the next (Ellis et al., 2002). Moreover, a singlelabeler may not be able to consistently decide if or to what extent two objects are similar, but shemay still be able to reliably produce a rank-ordering of similarity over pairs (Kendall and Gibbons,1990). Thus, rather than rely on quantitative similarity or hard binary labelsof pairwise similarity,it is now becoming increasingly common to collect similarity information in the form of triadic orrelativecomparisons (Schultz and Joachims, 2004; Agarwal et al., 2007), in which human labelersanswer questions of the form:

“Is x more similar toy or z?”

Although this form of similarity measurement has been observed to be more stablethan quantitativesimilarity (Kendall and Gibbons, 1990), and clearly provides a richer representation than binarypairwise similarities, it is still subject to problems of consistency and inter-labeler agreement. It istherefore imperative that great care be taken to ensure some sense of robustness when working withperceptual similarity measurements.

In the present work, our goal is to develop a framework for integrating multi-modal data so asto optimally conform to perceptual similarity encoded by relative comparisons.In particular, wefollow three guiding principles in the development of our framework:

1. The algorithm should be robust against subjectivity and inter-labeler disagreement.

2. The algorithm must be able to integrate multi-modal data in an optimal way, that is,thedistances between embedded points should conform to perceptual similarity measurements.

3. It must be possible to compute distances to new, unseen data as it becomesavailable.

We formulate this problem of heterogeneous feature integration as a learning problem: givena data set, and a collection of relative comparisons between pairs, we learna representation of

492

LEARNING MULTI -MODAL SIMILARITY

Feature space 1Feature space 2Feature space 3

d(i,j) < d(i,k)

Figure 1: An overview of our proposed framework for multi-modal feature integration. Data isrepresented in multiple feature spaces (each encoded by a kernel function). Humanssupply perceptual similarity measurements in the form of relative pairwise comparisons,which are in turn filtered by graph processing algorithms, and then used asconstraints tooptimize the multiple kernel embedding.

the data that optimally reproduces the similarity measurements. This type of embedding problemhas been previously studied by Agarwal et al. (2007) and Schultz and Joachims (2004). However,Agarwal et al. (2007) provide no out-of-sample extension, and neithersupport heterogeneous featureintegration, nor do they address the problem of noisy similarity measurements.

A common approach to optimally integrate heterogeneous data is based onmultiple kernellearning, where each kernel encodes a different modality of the data. Heterogeneous feature inte-gration via multiple kernel learning has been addressed by previous authors in a variety of contexts,including classification (Lanckriet et al., 2004; Zien and Ong, 2007; Kloft et al., 2009; Jagarlapudiet al., 2009), regression (Sonnenburg et al., 2006; Bach, 2008; Cortes et al., 2009), and dimension-ality reduction (Lin et al., 2009). However, none of these methods specifically address the problemof learning a unified data representation which conforms to perceptual similarity measurements.

1.1 Contributions

Our contributions in this work are two-fold. First, we develop thepartial order embedding(POE)framework (McFee and Lanckriet, 2009b), which allows us to use graph-theoretic algorithms tofilter a collection of subjective similarity measurements for consistency and redundancy. We thenformulate a novel multiple kernel learning (MKL) algorithm which learns an ensemble of featurespace projections to produce a unified similarity space. Our method is able to produce non-linearembedding functions which generalize to unseen, out-of-sample data. Figure 1 provides a high-leveloverview of the proposed methods.

The remainder of this paper is structured as follows. In Section 2, we develop a graphicalframework for interpreting and manipulating subjective similarity measurements.In Section 3, wederive an embedding algorithm which learns an optimal transformation of a single feature space.In Section 4, we develop a novel multiple-kernel learning formulation for embedding problems,and derive an algorithm to learn an optimal space from heterogeneous data. Section 5 providesexperimental results illustrating the effects of graph-processing on noisysimilarity data, and theeffectiveness of the multiple-kernel embedding algorithm on a music similarity task with human

493

MCFEE AND LANCKRIET

perception measurements. Finally, we prove hardness of dimensionality reduction in this setting inSection 6, and conclude in Section 7.

1.2 Preliminaries

A (strict) partial order is a binary relationR over a setZ (R⊆Z2) which satisfies the followingproperties:1

• Irreflexivity: (a,a) /∈ R,

• Transitivity: (a,b) ∈ R∧ (b,c) ∈ R⇒ (a,c) ∈ R,

• Anti-symmetry:(a,b) ∈ R⇒ (b,a) /∈ R.

Every partial order can be equivalently represented as a directed acyclic graph (DAG), whereeach vertex is an element ofZ and an edge is drawn froma to b if (a,b)∈R. For any partial order,Rmay refer to either the set of ordered tuples{(a,b)} or the graph (DAG) representation of the partialorder; the use will be clear from context.

For a directed graphG, we denote byG∞ its transitive closure, that is,G∞ contains an edge(i, j)if and only if there exists a path fromi to j in G. Similarly, thetransitive reduction(denotedGmin)is the minimal graph with equivalent transitivity toG, that is, the graph with the fewest edges suchthat

(Gmin

)∞= G∞.

LetX = {x1,x2, . . . ,xn} denote the training set ofn items. AEuclidean embeddingis a functiong : X → R

d which mapsX into ad-dimensional space equipped with the Euclidean (ℓ2) metric:

‖x−y‖2 =√(x−y)T(x−y).

For any matrixB, let Bi denote itsith column vector. A symmetric matrixA ∈ Rn×n has a

spectral decompositionA=VΛVT, whereΛ = diag(λ1,λ2, . . . ,λn) is a diagonal matrix containingthe eigenvalues ofA, andV contains the eigenvectors ofA. We adopt the convention that eigenvalues(and corresponding eigenvectors) are sorted in descending order.A is positive semi-definite(PSD),denoted byA� 0, if each eigenvalue is non-negative:λi ≥ 0, i = 1, . . . ,n. Finally, a PSD matrixAgives rise to the Mahalanobis distance function

‖x−y‖A =√

(x−y)TA(x−y).

2. A Graphical View of Similarity

Before we can construct an embedding algorithm for multi-modal data, we must first establish theform of side-information that will drive the algorithm, that is, the similarity measurements that willbe collected from human labelers. There is an extensive body of work onthe topic of constructing ageometric representation of data to fit perceptual similarity measurements. Primarily, this work fallsunder the umbrella of multi-dimensional scaling (MDS), in which perceptual similarity is modeledby numerical responses corresponding to the perceived “distance” between a pair of items, for

1. The standard definition of a (non-strict) partial order also includes thereflexiveproperty:∀a,(a,a) ∈ R. For reasonsthat will become clear in Section 2, we take thestrict definition here, and omit the reflexive property.

494

LEARNING MULTI -MODAL SIMILARITY

example, on a similarity scale of 1–10. (See Cox and Cox 1994 and Borg andGroenen 2005 forcomprehensive overviews of MDS techniques.)

Because “distances” supplied by test subjects may not satisfy metric properties—in particular,they may not correspond to Euclidean distances—alternativenon-metricMDS (NMDS) techniqueshave been proposed (Kruskal, 1964). Unlike classical or metric MDS techniques, which seek topreserve quantitative distances, NDMS seeks an embedding in which the rank-ordering of distancesis preserved.

Since NMDS only needs the rank-ordering of distances, and not the distances themselves, thetask of collecting similarity measurements can be simplified by asking test subjects toorder pairs ofpoints by similarity:

“Are i and j more similar thank andℓ?”

or, as a special case, the “triadic comparison”

“Is i more similar toj or ℓ?”

Based on this kind ofrelative comparisondata, the embedding problem can be formulated as fol-lows. Given is a set of objectsX , and a set of similarity measurementsC = {(i, j,k, ℓ)} ⊆ X 4,where a tuple(i, j,k, ℓ) is interpreted as “i and j are more similar thank andℓ.” (This formulationsubsumes the triadic comparisons model wheni = k.) The goal is to find an embedding functiong : X → R

d such that

∀(i, j,k, ℓ) ∈ C : ‖g(i)−g( j)‖2+1< ‖g(k)−g(ℓ)‖2. (1)

The unit margin is forced between the constrained distances for numericalstability.Agarwal et al. (2007) work with this kind of relative comparison data and describe a generalized

NMDS algorithm (GNMDS), which formulates the embedding problem as a semi-definite program.Schultz and Joachims (2004) derive a similar algorithm which solves a quadratic program to learna linear, axis-aligned transformation of data to fit relative comparisons.

Previous work on relative comparison data often treats each measurement(i, j,k, ℓ) ∈ C as ef-fectively independent (Schultz and Joachims, 2004; Agarwal et al., 2007). However, due to theirsemantic interpretation as encoding pairwise similarity comparisons, and the fact that a pair(i, j)may participate in several comparisons with other pairs, there may be someglobal structure toCwhich these previous methods are unable to exploit.

In Section 2.1, we develop a graphical framework to infer and interpret the global structureexhibited by the constraints of the embedding problem. Graph-theoretic algorithms presented inSection 2.2 then exploit this representation to filter this collection of noisy similarity measurementsfor consistency and redundancy. The final, reduced set of relativecomparison constraints defines apartial order, making for a more robust and efficient embedding problem.

2.1 Similarity Graphs

To gain more insight into the underlying structure of a collection of comparisonsC , we can representC as a directed graph overX 2. Each vertex in the graph corresponds to a pair(i, j)∈X 2, and an edgefrom (i, j) to (k, ℓ) corresponds to a similarity measurement(i, j,k, ℓ) (see Figure 2). InterpretingCas a graph will allow us to infer properties ofglobal (graphical) structure ofC . In particular, twofacts become immediately apparent:

495

MCFEE AND LANCKRIET

C =

( j,k, j, ℓ), ( j,k, i,k),( j, ℓ, i,k), (i, ℓ, j, ℓ),(i, ℓ, i, j), (i, j, j, ℓ)

Figure 2: The graph representation (left) of a set of relative comparisons (right).

1. If C contains cycles, then there exists no embedding which can satisfyC .

2. If C is acyclic, any embedding that satisfies the transitive reductionCmin also satisfiesC .

The first fact implies that no algorithm can produce an embedding which satisfies all measure-ments if the graph is cyclic. In fact, the converse of this statement is also true:if C is acyclic, thenan embedding exists in which all similarity measurements are preserved (see Appendix A). If Cis cyclic, however, by analyzing the graph, it is possible to identify an “unlearnable” subset ofCwhich must be violated by any embedding.

Similarly, the second fact exploits the transitive nature of distance comparisons. In the exampledepicted in Figure 2, anyg that satisfies( j,k, j, ℓ) and( j, ℓ, i,k) must also satisfy( j,k, i,k). In effect,the constraint( j,k, i,k) is redundant, and may also be safely omitted fromC .

These two observations allude to two desirable properties inC for embedding methods:tran-sitivity and anti-symmetry. Together with irreflexivity, these fit the defining characteristics of apartial order. Due to subjectivity and inter-labeler disagreement, however, most collections of rel-ative comparisons will not define a partial order. Some graph processing, presented next, based onan approximate maximum acyclic subgraph algorithm, can reduce them to a partial order.

2.2 Graph Simplification

Because a set of similarity measurementsC containing cycles cannot be embedded in any Euclideanspace,C is inherently inconsistent. Cycles inC therefore constitute a form oflabel noise. As notedby Angelova (2004), label noise can have adverse effects on both model complexity and general-ization. This problem can be mitigated by detecting and pruning noisy (confusing) examples, andtraining on a reduced, but certifiably “clean” set (Angelova et al., 2005; Vezhnevets and Barinova,2007).

Unlike most settings, where the noise process affects each label independently—for example,random classification noise (Angluin and Laird, 1988)—the graphical structure of interrelated rel-ative comparisons can be exploited to detect and prune inconsistent measurements. By eliminatingsimilarity measurements which cannot be realized by any embedding, the optimization procedurecan be carried out more efficiently and reliably on a reduced constraint set.

Ideally, when eliminating edges from the graph, we would like to retain as much informationas possible. Unfortunately, this is equivalent to themaximum acyclic subgraphproblem, which isNP-Complete (Garey and Johnson, 1979). A1/2-approximate solution can be achieved by a simplegreedy algorithm (Algorithm 1) (Berger and Shor, 1990).

Once a consistent subset of similarity measurements has been produced, itcan be simplifiedfurther by pruning redundancies. In the graph view of similarity measurements, redundancies canbe easily removed by computing the transitive reduction of the graph (Aho etal., 1972).

496

LEARNING MULTI -MODAL SIMILARITY

Algorithm 1 Approximate maximum acyclic subgraph (Aho et al., 1972)Input : Directed graphG= (V,E)Output : Acyclic graphG′

E′← /0for each(u,v) ∈ E in random orderdo

if E′∪{(u,v)} is acyclicthenE′← E′∪{(u,v)}

end ifend forG′← (V,E′)

By filtering the constraint set for consistency, we ensure that embeddingalgorithms are notlearning from spurious information. Additionally, pruning the constraint set by transitive reduc-tion focuses embedding algorithms on the most important core set of constraints while reducingoverhead due to redundant information.

3. Partial Order Embedding

Now that we have developed a language for expressing similarity between items, we are ready toformulate the embedding problem. In this section, we develop an algorithm that learns a represen-tation of data consistent with a collection of relative similarity measurements, and allows to mapunseen data into the learned similarity space after learning. In order to accomplish this, we willassume a feature representation forX . By parameterizing the embedding functiong in terms ofthe feature representation, we will be able to applyg to any point in the feature space, therebygeneralizing to data outside of the training set.

3.1 Linear Projection

To start, we assume that the data originally lies in some Euclidean space, that is,X ⊂RD. There are

of course many ways to define an embedding functiong : RD→Rd. Here, we will restrict attention

to embeddings parameterized by a linear projection matrixM, so that for a vectorx∈ RD,

g(x).= Mx.

Collecting the vector representations of the training set as columns of a matrixX ∈ RD×n, the inner

product matrix of the embedded points can be characterized as

A= XTMTMX.

Now, for a relative comparison(i, j,k, ℓ), we can express the distance constraint (1) betweenembedded points as follows:

(Xi−Xj)TMTM(Xi−Xj)+1≤ (Xk−Xℓ)

TMTM(Xk−Xℓ).

These inequalities can then be used to form the constraint set of an optimization problem to solvefor M. Because, in general,C may not be satisfiable by a linear projection ofX , we soften the

497

MCFEE AND LANCKRIET

constraints by introducing a slack variableξi jkℓ ≥ 0 for each constraint, and minimize the empiricalhinge loss over constraint violations1/|C |∑C ξi jkℓ. This choice of loss function can be interpreted asa convex approximation to a generalization of the area under an ROC curve(see Appendix C).

To avoid over-fitting, we introduce a regularization term tr(MTM), and a trade-off parameterβ>0 to control the balance between regularization and loss minimization. This leadsto a regularizedrisk minimization objective:

minM,ξ≥0

tr(MTM)+β|C |∑

C

ξi jkℓ (2)

s.t. (Xi−Xj)TMTM(Xi−Xj)+1≤ (Xk−Xℓ)

TMTM(Xk−Xℓ)+ξi jkℓ,

∀(i, j,k, ℓ) ∈ C .

After learningM by solving this optimization problem, the embedding can be extended to out-of-sample pointsx′ by applying the projection:x′ 7→Mx′.

Note that the distance constraints in (2) involve differences of quadratic terms, and are thereforenot convex. However, sinceM only appears in the formMTM in (2), the optimization problem canbe expressed in terms of a positive semi-definite matrixW

.= MTM. This change of variables results

in Algorithm 2, a (convex) semi-definite programming (SDP) problem (Boyd and Vandenberghe,2004), since objective and constraints are linear inW, including the linear matrix inequalityW� 0.The corresponding inner product matrix is

A= XTWX.

Finally, after the optimalW is found, the embedding functiong : RD → RD can be recovered

from the spectral decomposition ofW:

W =VΛVT ⇒ g(x) = Λ1/2VTx,

and ad-dimensional approximation can be recovered by taking the leadingd eigenvectors ofW.

Algorithm 2 Linear partial order embedding (LPOE)Input: n objectsX ,partial orderC ,data matrixX ∈ R

D×n,β > 0Output: mappingg : X → R

d

minW,ξ

tr(W)+β|C |∑

C

ξi jkℓ

d(xi ,x j).= (Xi−Xj)

TW (Xi−Xj)

d(xi ,x j)+1≤ d(xk,xℓ)+ξi jkℓ

ξi jkℓ ≥ 0 ∀(i, j,k, ℓ) ∈ CW � 0

498

LEARNING MULTI -MODAL SIMILARITY

3.2 Non-linear Projection via Kernels

The formulation in Algorithm 2 can be generalized to support non-linear embeddings by the useof kernels, following the method of Globerson and Roweis (2007): we first map the data into areproducing kernel Hilbert space (RKHS)H via a feature mapφ with corresponding kernel functionk(x,y) = 〈φ(x),φ(y)〉H ; then, the data is mapped toRd by a linear projectionM : H → R

d. Theembedding functiong : X → R

d is the therefore the composition of the projectionM with φ:

g(x) = M(φ(x)).

Becauseφ may be non-linear, this allows us to learn a non-linear embeddingg.More precisely, we considerM as being comprised ofd elements ofH , that is,{ω1,ω2, . . . ,ωd}⊆

H . The embeddingg can thus be expressed as

g(x) = (〈ωp,φ(x)〉H )dp=1 ,

where(·)dp=1 denotes concatenation.

Note that in general,H may be infinite-dimensional, so directly optimizingM may not befeasible. However, by appropriately regularizingM, we may invoke the generalized representertheorem (Schölkopf et al., 2001). Our choice of regularization is the Hilbert-Schmidt norm ofM,which, in this case, reduces to

‖M‖2HS =d

∑p=1

〈ωp,ωp〉H .

With this choice of regularization, it follows from the generalized representer theorem that at anoptimum, eachωp must lie in the span of the training data, that is,

ωp =n

∑i=1

Npiφ(xi), p= 1, . . . ,d,

for some real-valued matrixN ∈ Rd×n. If Φ is a matrix representation ofX in H (i.e., Φi = φ(xi)

for xi ∈ X ), then the projection operatorM can be expressed as

M = NΦT. (3)

We can now reformulate the embedding problem as an optimization overN rather thanM. Using(3), the regularization term can be expressed as

‖M‖2HS = tr(ΦNTNΦT) = tr(NTNΦTΦ) = tr(NTNK),

whereK is the kernel matrix overX :

K = ΦTΦ, with Ki j = 〈φ(xi),φ(x j)〉H = k(xi ,x j).

To formulate the distance constraints in terms ofN, we first express the embeddingg in terms ofNand the kernel function:

g(x) = M(φ(x)) = NΦT(φ(x)) = N(〈Φi ,φ(x)〉H )ni=1 = N(k(xi ,x))

ni=1 = NKx,

499

MCFEE AND LANCKRIET

whereKx is the column vector formed by evaluating the kernel functionk at x against the trainingset. The inner product matrix of embedded points can therefore be expressed as

A= KNTNK,

which allows to express the distance constraints in terms ofN and the kernel matrixK:

(Ki−K j)TNTN(Ki−K j)+1≤ (Kk−Kℓ)

TNTN(Kk−Kℓ).

The embedding problem thus amounts to solving the following optimization problem inN andξ:

minN,ξ≥0

tr(NTNK)+β|C |∑

C

ξi jkℓ (4)

s.t. (Ki−K j)TNTN(Ki−K j)+1≤ (Kk−Kℓ)

TNTN(Kk−Kℓ)+ξi jkℓ,

∀(i, j,k, ℓ) ∈ C .

Again, the distance constraints in (4) are non-convex due to the differences of quadratic terms.And, as in the previous section,N only appears in the form of inner productsNTN in (4)—bothin the constraints, and in the regularization term—so we can again derive a convex optimizationproblem by changing variables toW

.= NTN � 0. The resulting embedding problem is listed as

Algorithm 3, again a semi-definite programming problem (SDP), with an objective function andconstraints that are linear inW.

After solving forW, the matrixN can be recovered by computing the spectral decompositionW =VΛVT, and definingN = Λ1/2VT. The resulting embedding function takes the form:

g(x) = Λ1/2VTKx.

As in Schultz and Joachims (2004), this formulation can be interpreted as learning a Maha-lanobis distance metricΦWΦT overH . More generally, we can view this as a form of kernellearning, where the kernel matrixA is restricted to the set

A∈ {KWK : W � 0} . (5)

3.3 Connection to GNMDS

We conclude this section by drawing a connection between Algorithm 3 and thegeneralized non-metric MDS (GNMDS) algorithm of Agarwal et al. (2007).

First, we observe that thei-th column,Ki , of the kernel matrixK can be expressed in terms ofKand theith standard basis vectorei :

Ki = Kei .

From this, it follows that distance computations in Algorithm 3 can be equivalently expressed as

d(xi ,x j) = (Ki−K j)TW(Ki−K j)

= (K(ei−ej))TW(K(ei−ej))

= (ei−ej)TKTWK(ei−ej). (6)

500

LEARNING MULTI -MODAL SIMILARITY

Algorithm 3 Kernel partial order embedding (KPOE)Input: n objectsX ,partial orderC ,kernel matrixK,β > 0Output: mappingg : X → R

n

minW,ξ

tr(WK)+β|C |∑

C

ξi jkℓ

d(xi ,x j).= (Ki−K j)

TW (Ki−K j)

d(xi ,x j)+1≤ d(xk,xℓ)+ξi jkℓ

ξi jkℓ ≥ 0 ∀(i, j,k, ℓ) ∈ CW � 0

If we consider the extremal case whereK = I , that is, we have no prior feature-based knowledge ofsimilarity between points, then Equation 6 simplifies to

d(xi ,x j) = (ei−ej)TIWI(ei−ej) =Wii +Wj j −Wi j −Wji .

Therefore, in this setting, rather than defining a feature transformation,W directly encodes the innerproducts between embedded training points. Similarly, the regularization term becomes

tr(WK) = tr(WI) = tr(W).

Minimizing the regularization term can be interpreted as minimizing a convex upperbound onthe rank ofW (Boyd and Vandenberghe, 2004), which expresses a preferencefor low-dimensionalembeddings. Thus, by settingK = I in Algorithm 3, we directly recover the GNMDS algorithm.

Note that directly learning inner products between embedded training data points rather than afeature transformation does not allow a meaningful out-of-sample extension, to embed unseen datapoints. On the other hand, by Equation 5, it is clear that the algorithm optimizes over the entirecone of PSD matrices. Thus, ifC defines a DAG, we could exploit the fact that a partial orderover distances always allows an embedding which satisfies all constraints inC (see Appendix A) toeliminate the slack variables from the program entirely.

4. Multiple Kernel Embedding

In the previous section, we derived an algorithm to learn an optimal projection from a kernel spaceH to R

d such that Euclidean distance between embedded points conforms to perceptual similarity.If, however, the data is heterogeneous in nature, it may not be realistic to assume that a single featurerepresentation can sufficiently capture the inherent structure in the data.For example, if the objectsin question are images, it may be natural to encode texture information by one set of features, andcolor in another, and it is not immediately clear how to reconcile these two disparate sources ofinformation into a single kernel space.

501

MCFEE AND LANCKRIET

However, by encoding each source of information independently by separate feature spacesH 1,H 2, . . .—equivalently, kernel matricesK1,K2, . . .—we can formulate a multiple kernel learn-ing algorithm to optimally combine all feature spaces into a single, unified embedding space. In thissection, we will derive a novel, projection-based approach to multiple-kernel learning and extendAlgorithm 3 to support heterogeneous data in a principled way.

4.1 Unweighted Combination

Let K1,K2, . . . ,Km be a set of kernel matrices, each with a corresponding feature mapφp and RKHSH p, for p∈ 1, . . . ,m. One natural way to combine the kernels is to look at the product space, whichis formed by concatenating the feature maps:

φ(xi) = (φ1(xi),φ2(xi), . . . ,φm(xi)) = (φp(xi))mp=1.

Inner products can be computed in this space by summing across each feature map:

〈φ(xi),φ(x j)〉=m

∑p=1

⟨φp(xi),φp(x j)

⟩H p .

resulting in thesum-kernel—also known as theaverage kernelor product space kernel. The corre-sponding kernel matrix can be conveniently represented as the unweighted sum of the base kernelmatrices:

K =m

∑p=1

Kp. (7)

SinceK is a valid kernel matrix itself, we could useK as input for Algorithm 3. As a result, thealgorithm would learn a kernel from the family

K1 =

{(m

∑p=1

Kp

)W

(m

∑p=1

Kp

): W � 0

}

=

{m

∑p,q=1

KpWKq : W � 0

}.

4.2 Weighted Combination

Note thatK1 treats each kernel equally; it is therefore impossible to distinguishgoodfeatures (i.e.,those which can be transformed to best fitC ) from bad features, and as a result, the quality ofthe resulting embedding may be degraded. To combat this phenomenon, it is common to learn ascheme for weighting the kernels in a way which is optimal for a particular task.The most commonapproach to combining the base kernels is to take a positive-weighted sum

m

∑p=1

µpKp (µp≥ 0),

where the weightsµp are learned in conjunction with a predictor (Lanckriet et al., 2004; Sonnenburget al., 2006; Bach, 2008; Cortes et al., 2009). Equivalently, this can beviewed as learning a featuremap

φ(xi) =(√

µpφp(xi))m

p=1 ,

502

LEARNING MULTI -MODAL SIMILARITY

where each base feature map has been scaled by the corresponding weight√

µp.Applying this reasoning to learning an embedding that conforms to perceptual similarity, one

might consider a two-stage approach to parameterizing the embedding (Figure 3(a)): first constructa weighted kernel combination, and then project from the combined kernelspace. Lin et al. (2009)formulate a dimensionality reduction algorithm in this way. In the present setting,this would beachieved by simultaneously optimizingW andµp to choose an inner product matrixA from the set

K2 =

{(m

∑p=1

µpKp

)W

(m

∑p=1

µpKp

): W � 0,∀p, µp≥ 0

}

=

{m

∑p,q=1

µpKpWµqKq : W � 0,∀p, µp≥ 0

}.

The corresponding distance constraints, however, contain differences of terms cubic in the opti-mization variablesW andµp:

∑p,q

(Kp

i −Kpj

)TµpWµq

(Kq

i −Kqj

)+1≤∑

p,q

(Kp

k −Kpℓ

)TµpWµq

(Kq

k −Kqℓ

),

and are therefore non-convex and difficult to optimize. Even simplifying theclass by removingcross-terms, that is, restrictingA to the form

K3 =

{m

∑p=1

µ2pKpWKp : W � 0,∀p, µp≥ 0

},

still leads to a non-convex problem, due to the difference of positive quadratic terms introduced bydistance calculations:

m

∑p=1

(Kp

i −Kpj

)Tµ2

pW(

µpKpi −Kp

j

)+1≤

m

∑p=1

(Kp

k −Kpℓ

)Tµ2

pW(µpKp

k −Kpℓ

).

However, a more subtle problem with this formulation lies in the assumption that a single weightcan characterize the contribution of a kernel to the optimal embedding. In general, different kernelsmay be more or less informative on different subsets ofX or different regions of the correspondingfeature space. Constraining the embedding to a single metricW with a single weightµp for eachkernel may be too restrictive to take advantage of this phenomenon.

4.3 Concatenated Projection

We now return to the original intuition behind Equation 7. The sum-kernel represents the innerproduct between points in the space formed by concatenating the base feature mapsφp. The setsK2

andK3 characterize projections of the weighted combination space, and turn out tonot be amenableto efficient optimization (Figure 3(a)). This can be seen as a consequence of prematurely combiningkernels prior to projection.

Rather than projecting the (weighted) concatenation ofφp(·), we could alternatively concatenatelearned projectionsMp(φp(·)), as illustrated by Figure 3(b). Intuitively, by defining the embeddingas the concatenation ofm different projections, we allow the algorithm to learn an ensemble of

503

MCFEE AND LANCKRIET

projections, each tailored to its corresponding domain space and jointly optimized to produce anoptimal space. By contrast, the previously discussed formulations apply essentially the same pro-jection to each (weighted) feature space, and are thus much less flexible than our proposed approach.Mathematically, an embedding function of this form can be expressed as the concatenation

g(x) = (Mp (φp(x)))mp=1 .

Now, given this characterization of the embedding function, we can adaptAlgorithm 3 to opti-mize over multiple kernels. As in the single-kernel case, we introduce regularization terms for eachprojection operatorMp

m

∑p=1

‖Mp‖2HS

to the objective function. Again, by invoking the representer theorem foreachMp, it follows that

Mp = Np (Φp)T ,

for some matrixNp, which allows to reformulate the embedding problem as a joint optimization overNp, p= 1, . . . ,m rather thanMp, p= 1, . . . ,m. Indeed, the regularization terms can be expressed as

m

∑p=1

‖Mp‖2HS =m

∑p=1

tr((Np)T(Np)Kp

). (8)

The embedding function can now be rewritten as

g(x) = (Mp (φp(x)))mp=1 = (NpKp

x )mp=1 , (9)

and the inner products between embedded points take the form:

Ai j = 〈g(xi),g(x j)〉=m

∑p=1

(NpKp

i

)T(NpKp

j

)

=m

∑p=1

(Kpi )

T(Np)T(Np)(Kpj ).

Similarly, squared Euclidean distance also decomposes by kernel:

‖g(xi)−g(x j)‖2 =m

∑p=1

(Kp

i −Kpj

)T(Np)T(Np)

(Kp

i −Kpj

). (10)

Finally, since the matricesNp, p= 1, . . . ,monly appear in the form of inner products in (8) and(10), we may instead optimize over PSD matricesWp = (Np)T(Np). This renders the regularizationterms (8) and distances (10) linear in the optimization variablesWp. Extending Algorithm 3 to thisparameterization ofg(·) therefore results in an SDP, which is listed as Algorithm 4. To solve theSDP, we implemented a gradient descent solver, which is described in Appendix B.

The class of kernels over which Algorithm 4 optimizes can be expressed simply as the set

K4 =

{m

∑p=1

KpWpKp : ∀p, Wp� 0

}

504

LEARNING MULTI -MODAL SIMILARITY

Kernel class Learned kernel matrix

K1 ={

∑p,qKpWKq} [

K1+K2+ · · ·+Km][W][K1+K2+ · · ·+Km

]

K2 ={

∑p,qµpµqKpWKq}

K1

K2

...Km

T

µ21W µ1µ2W · · · µ1µmW

µ2µ1W µ22W · · · ...

..... .

µmµ1W µ2mW

K1

K2

...Km

K3 ={

∑pµ2pKpWKp

}

K1

K2

...Km

T

µ21W 0 · · · 0

0 µ22W · · · ...

..... .

0 µ2mW

K1

K2

...Km

K4 ={

∑pKpWpKp}

K1

K2

...Km

T

W1 0 · · · 0

0 W2 · · · ......

. . .0 Wm

K1

K2

...Km

Table 1: Block-matrix formulations of metric learning for multiple-kernel formulations (K1–K4).EachWp is taken to be positive semi-definite. Note that all sets are equal when there isonly one base kernel.

Note thatK4 containsK3 as a special case when allWp are positive scalar multiples of each-other.However,K4 leads to a convex optimization problem, whereK3 does not.

Table 1 lists the block-matrix formulations of each of the kernel combination rules described inthis section. It is worth noting that it is certainly valid to first form the unweighted combination ker-nel K and then useK1 (Algorithm 3) to learn an optimal projection of the product space. However,as we will demonstrate in Section 5, our proposed multiple-kernel formulation (K4) outperforms thesimple unweighted combination rule in practice.

4.4 Diagonal Learning

The MKPOE optimization is formulated as a semi-definite program overm differentn×n matricesWp—or, as shown in Table 1, a singlemn×mnPSD matrix with a block-diagonal sparsity structure.Scaling this approach to large data sets can become problematic, as they require optimizing overmultiple high-dimensional PSD matrices.

To cope with larger problems, the optimization problem can be refined to constrain eachWp

to the set of diagonal matrices. IfWp are all diagonal, positive semi-definiteness is equivalent tonon-negativity of the diagonal values (since they are also the eigenvalues of the matrix). This allowsthe constraintsWp � 0 to be replaced by linear constraintsWp

ii ≥ 0, and the resulting optimizationproblem is a linear program (LP), rather than an SDP. This modification reduces the flexibility ofthe model, but leads to a much more efficient optimization procedure.

505

MCFEE AND LANCKRIET

...

(a) Weighted combination (K2)

...

(b) Concatenated projection (K4)

Figure 3: Two variants of multiple-kernel embedding. (a) A data pointx∈ X is mapped intom fea-ture spaces viaφ1,φ2, . . . ,φm, which are then scaled byµ1,µ2, . . . ,µm to form a weightedfeature spaceH ∗, which is subsequently projected to the embedding space viaM. (b) x isfirst mapped into each kernel’s feature space and then its image in each space is directlyprojected into a Euclidean space via the corresponding projectionsMp. The projectionsare jointly optimized to produce the embedding space.

Algorithm 4 Multiple kernel partial order embedding (MKPOE)Input: n objectsX ,partial orderC ,mkernel matricesK1,K2, . . . ,Km,β > 0Output: mappingg : X → R

mn

minWp,ξ

m

∑p=1

tr(WpKp)+β|C |∑

C

ξi jkℓ

d(xi ,x j).=

m

∑p=1

(Kp

i −Kpj

)TWp(

Kpi −Kp

j

)

d(xi ,x j)+1≤ d(xk,xℓ)+ξi jkℓ

ξi jkℓ ≥ 0 ∀(i, j,k, ℓ) ∈ CWp� 0 p= 1,2, . . . ,m

More specifically, our implementation of Algorithm 4 operates by alternating sub-gradient de-scent onWp and projection onto the feasible setWp � 0 (see Appendix B for details). For fullmatrices, this projection is accomplished by computing the spectral decompositionof eachWp, andthresholding the eigenvalues at 0. For diagonal matrices, this projection is accomplished simply by

Wpii 7→max

{0,Wp

ii

},

which can be computed inO(mn) time, compared to theO(mn3) time required to computemspectraldecompositions.

RestrictingWp to be diagonal not only simplifies the problem to linear programming, but carriesthe added interpretation of weighting the contribution of each (kernel, training point) pair in theconstruction of the embedding. A large value atWp

ii corresponds to pointi being a landmark for the

506

LEARNING MULTI -MODAL SIMILARITY

SneakerHatWhite shoe

X-mas teddyPink animalBallBig smurf

LemonPearOrange

Clothing

Toys

Fruit

All

Figure 4: The label taxonomy for the experiment in Section 5.1.

features encoded inKp. Note that each of the formulations listed in Table 1 has a correspondingdiagonal variant, however, as in the full matrix case, onlyK1 andK4 lead to convex optimizationproblems.

5. Experiments

To evaluate our framework for learning multi-modal similarity, we first test the multiple kernellearning formulation on a simple toy taxonomy data set, and then on a real-world data set of musicalperceptual similarity measurements.

5.1 Toy Experiment: Taxonomy Embedding

For our first experiment, we generated a toy data set from the Amsterdam Library of Object Images(ALOI) data set (Geusebroek et al., 2005). ALOI consists of RGB images of 1000 classes of objectsagainst a black background. Each class corresponds to a single object, and examples are providedof the object under varying degrees of out-of-plane rotation.

In our experiment, we first selected 10 object classes, and from each class, sampled 20 examples.We then constructed an artificial taxonomy over the label set, as depicted in Figure 4. Using thetaxonomy, we synthesized relative comparisons to span subtrees via their least common ancestor.For example,

(Lemon#1, Lemon#2, Lemon#1, Pear#1),

(Lemon#1, Pear#,1, Lemon#1, Sneaker#1),

and so on. These comparisons are consistent and therefore can be represented as a directed acyclicgraph. They are generated so as to avoid redundant, transitive edgesin the graph.

For features, we generated five kernel matrices. The first is a simple linearkernel over thegrayscale intensity values of the images, which, roughly speaking, compares objects by shape. Theother four are Gaussian kernels over histograms in the (background-subtracted) red, green, blue, andintensity channels, and these kernels compare objects based on their coloror intensity distributions.

We augment this set of kernels with five “noise” kernels, each of which was generated by sam-pling random points from the unit sphere inR3 and applying the linear kernel.

The data was partitioned into five 80/20 training and test set splits. To tuneβ, we furthersplit the training set for 5-fold cross-validation, and swept overβ ∈ {10−2,10−1, . . . ,106}. For

507

MCFEE AND LANCKRIET

each fold, we learned a diagonally-constrained embedding with Algorithm 4, using the subset ofrelative comparisons(i, j,k, ℓ) with i, j,k and ℓ restricted to the training set. After learning theembedding, the held out data (validation or test) was mapped into the space, and the accuracy of theembedding was determined by counting the fraction of correctly predicted relative comparisons. Inthe validation and test sets, comparisons were processed to only include comparisons of the form(i, j, i,k) wherei belongs to the validation (or test) set, andj andk belong to the training set.

We repeat this experiment for each base kernel individually (that is, optimizing overK1 witha single base kernel), as well as the unweighted sum kernel (K1 with all base kernels), and finallyMKPOE (K4 with all base kernels). The results are averaged over all training/test splits, and col-lected in Figure 5. For comparison purposes, we include the prediction accuracy achieved by com-puting distances in each kernel’s native space before learning. In each case, the optimized spaceindeed achieves higher accuracy than the corresponding native space. (Of course, the random noisekernels still predict randomly after optimization.)

As illustrated in Figure 5, taking the unweighted combination of kernels significantly degradesperformance (relative to the best kernel) both in the native space (0.718accuracy versus 0.862 forthe linear kernel) and the optimized sum-kernel space (0.861 accuracy for the sum versus 0.951 forthe linear kernel), that is, the unweighted sum kernel optimized by Algorithm 3. However, MKPOE(K4) correctly identifies and omits the random noise kernels by assigning them negligible weight,and achieves higher accuracy (0.984) than any of the single kernels (0.951 for the linear kernel, afterlearning).

5.2 Musical Artist Similarity

To test our framework on a real data set, we applied the MKPOE algorithm to the task of learninga similarity function between musical artists. The artist similarity problem is motivatedby severalreal-world applications, including recommendation and playlist-generation for online radio. Be-cause artists may be represented by a wide variety of different features(e.g., tags, acoustic features,social data), such applications can benefit greatly from an optimally integrated similarity metric.

The training data is derived from theaset400corpus of Ellis et al. (2002), which consists of412 popular musicians, and 16385 relative comparisons of the form(i, j, i,k). Relative comparisonswere acquired from human test subjects through a web survey; subjectswere presented with a queryartist (i), and asked to choose what they believe to be the most similar artist (j) from a list of 10candidates. From each single response, 9 relative comparisons are synthesized, indicating thatj ismore similar toi than the remaining 9 artists (k) which were not chosen.

Our experiments here replicate and extend previous work on this data set (McFee and Lanck-riet, 2009a). In the remainder of this section, we will first give an overview of the various types offeatures used to characterize each artist in Section 5.2.1. We will then discuss the experimental pro-cedure in more detail in Section 5.2.2. The MKL embedding results are presented in Section 5.2.3,and are followed by an experiment detailing the efficacy of our constraintgraph processing approachin Section 5.2.4.

5.2.1 FEATURES

We construct five base kernels over the data, incorporating acoustic, semantic, and social views ofthe artists.

508

LEARNING MULTI -MODAL SIMILARITY

Linear

Red

Green

Blue

Intensity

Random 1

Random 2

Random 3

Random 4

Random 5

Sum

MKL

0.4 0.5 0.6 0.7 0.8 0.9 1

Ke

rne

l

Accuracy

NativeOptimized

Figure 5: Mean test set accuracy for the experiment of Section 5.1. Error bars correspond to onestandard deviation across folds. Accuracy is computed by counting the fraction of cor-rectly predicted relative comparisons in the native space of each base kernel, and then inthe optimized space produced by KPOE (K1 with a single base kernel). The unweightedcombination of kernels (Sum) significantly degrades performance in both the native andoptimized spaces. MKPOE (MKL, K4) correctly rejects the random kernels, and signifi-cantly outperforms the unweighted combination and the single best kernel.

• MFCC : for each artist, we collected between 1 and 10 songs (mean 4). For eachsong,we extracted a short clip consisting of 10000 half-overlapping 23ms windows. For eachwindow, we computed the first 13 Mel Frequency Cepstral Coefficients (MFCCs) (Davis andMermelstein, 1990), as well as their first and second instantaneous derivatives. This resultsin a sequence of 39-dimensional vectors (delta-MFCCs) for each song. Each artisti wasthen summarized by a Gaussian mixture model (GMM)pi over delta-MFCCs extracted fromthe corresponding songs. Each GMM has 8 components and diagonal covariance matrices.Finally, the kernel between artistsi and j is the probability product kernel (Jebara et al., 2004)between their corresponding delta-MFCC distributionspi , p j :

Kmfcci j =

∫ √pi(x)p j(x)dx.

• Auto-tags (AT): Using the MFCC features described above, we applied the automatic taggingalgorithm of Turnbull et al. (2008), which for each song yields a multinomialdistribution overa setT of 149 musically-relevant tag words (auto-tags). Artist-level tag distributionsqi wereformed by averaging model parameters (i.e., tag probabilities) across all ofthe songs of artisti. The kernel between artistsi and j for auto-tags is a radial basis function applied to the

509

MCFEE AND LANCKRIET

χ2-distance between the multinomial distributionsqi andq j :

Kati j = exp

(−σ ∑

t∈T

(qi(t)−q j(t))2

qi(t)+q j(t)

).

In these experiments, we fixedσ = 256.

• Social tags (ST): For each artist, we collected the top 100 most frequently used tag wordsfrom Last.fm,2 a social music website which allows users to label songs or artists with ar-bitrary tag words orsocial tags. After stemming and stop-word removal, this results in avocabulary of 7737 tag words. Each artist is then represented by a bag-of-words vector inR

7737, and processed by TF-IDF. The kernel between artists for social tags is the cosine sim-ilarity (linear kernel) between TF-IDF vectors.

• Biography (Bio): Last.fm also provides textual descriptions of artists in the form of user-contributed biographies. We collected biographies for each artist in theaset400data set,and after stemming and stop-word removal, we arrived at a vocabulary of16753 biographywords. As with social tags, the kernel between artists is the cosine similarity between TF-IDFbag-of-words vectors.

• Collaborative filtering (CF) : Celma (2008) collected collaborative filtering data from Last.fmin the form of a bipartite graph over users and artists, where each user isassociated with theartists in her listening history. We filtered this data down to include only the aset400artists,of which all but 5 were found in the collaborative filtering graph. The resulting graph has336527 users and 407 artists, and is equivalently represented by a binary matrix where eachrow i corresponds to an artist, and each columnj corresponds to a user. Thei j entry of thismatrix is 1 if we observe a user-artist association, and 0 otherwise. The kernel between artistsin this view is the cosine of the angle between corresponding rows in the matrix,which canbe interpreted as counting the amount of overlap between the sets of userslistening to eachartist and normalizing for overall artist popularity. For the 5 artists not found in the graph, wefill in the corresponding rows and columns of the kernel matrix with the identity matrix.

5.2.2 EXPERIMENTAL PROCEDURE

The data was randomly partitioned into ten 90/10 training/test splits. Given the inherent ambiguityin the task, and format of the survey, there is a great deal of conflicting information in the surveyresponses. To obtain a more accurate and internally consistent set of training comparisons, directlycontradictory comparisons (e.g.,(i, j, i,k) and(i,k, i, j)) were removed from both the training andtest sets. Each training set was further cleaned by finding an acyclic subset of comparisons andtaking its transitive reduction, resulting in a minimal partial order. (No furtherprocessing wasperformed on test comparisons.)

After training, test artists were mapped into the learned space (by Equation 9), and accuracywas measured by counting the number of measurements(i, j, i,k) correctly predicted by distance inthe learned space, wherei belongs to the test set, andj,k belong to the training set.

For each experiment,β is chosen from{10−2,10−1, . . . ,107} by holding out 30% of the trainingconstraints for validation. (Validation splits are generated from the unprocessed training set, and the

2. Last.fm can be found athttp://last.fm .

510

LEARNING MULTI -MODAL SIMILARITY

MFCC

Auto−tags

Social tags

Biography

Collab. Filter

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1Accuracy

NativeDiagonalFull

Figure 6: aset400 embedding results for each of the base kernels. Accuracy is computed in eachkernel’s native feature space, as well as the space produced by applying Algorithm 3(i.e., optimizing overK1 with a single kernel) with either the diagonal or full-matrixformulation. Error bars correspond to one standard deviation across training/test splits.

remaining training constraints are processed as described above.) Afterfinding the best-performingβ, the embedding is trained on the full (processed) training set.

5.2.3 EMBEDDING RESULTS

For each base kernel, we evaluate the test-set performance in the nativespace (i.e., by distancescalculated directly from the entries of the kernel matrix), and by learned metrics, both diagonal andfull (optimizing overK1 with a single base kernel). Figure 6 illustrates the results.

We then repeated the experiment by examining different groupings of base kernels: acoustic(MFCC and Auto-tags), semantic (Social tags and Bio), social (Collaborative filter), and combina-tions of the groups. The different sets of kernels were combined by Algorithm 4 (optimizing overK4). The results are listed in Figure 7.

In all cases, MKPOE improves over the unweighted combination of base kernels. Moreover,many combinations outperform the single best kernel (ST, 0.777± 0.02 after optimization), andthe algorithm is generally robust in the presence of poorly-performing kernels (MFCC and AT).Note that the poor performance of MFCC and AT kernels may be expected,as they derive fromsong-level rather than artist-level features, whereas ST provides high-level semantic descriptionswhich are generally more homogeneous across the songs of an artist, andBio and CF are directlyconstructed at the artist level. For comparison purposes, we trained metrics on the sum kernelwith K1 (Algorithm 3), resulting in accuracies of 0.676±0.05 (diagonal) and 0.765±0.03 (full).The proposed approach (Algorithm 4) applied to all kernels results in 0.754±0.03 (diagonal), and0.795±0.02 (full).

Figure 8 illustrates the weights learned by Algorithm 4 using all five kernels and diagonally-constrainedWp matrices. Note that the learned metrics are both sparse (many 0 weights) and non-uniform across different kernels. In particular, the (lowest-performing) MFCC kernel is eliminated

511

MCFEE AND LANCKRIET

MFCC+AT

ST+Bio

MFCC+AT+CF

ST+Bio+CF

MFCC+AT+ST+Bio

All

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8Accuracy

NativeDiagonalFull

Figure 7: aset400 embedding results with multiple kernel learning: the learnedmetrics are opti-mized overK4 by Algorithm 4. Nativecorresponds to distances calculated according tothe unweighted sum of base kernels.

by the algorithm, and the majority of the weight is assigned to the (highest-performing) social tag(ST) kernel.

A t-SNE (van der Maaten and Hinton, 2008) visualization of the space produced by MKPOEis illustrated in Figure 9. The embedding captures a great deal of high-level genre structure: forexample, theclassic rockandmetalgenres lie at the opposite end of the space frompopandhip-hop.

5.2.4 GRAPH PROCESSINGRESULTS

To evaluate the effects of processing the constraint set for consistency and redundancy, we repeatthe experiment of the previous section with different levels of processingapplied toC . Here, wefocus on the Biography kernel, since it exhibits the largest gap in performance between the nativeand learned spaces.

As a baseline, we first consider the full set of similarity measurements as provided by humanjudgements, including all inconsistencies. To first deal with what appear tobe the most egregiousinconsistencies, we prune all directly inconsistent training measurements; that is, whenever(i, j, i,k)and (i,k, i, j) both appear, both are removed.3 Finally, we consider the fully processed case byfinding a maximal consistent subset (partial order) ofC and removing all redundancies. Table 2 liststhe number of training constraints retained by each step of processing (averaged over the randomsplits).

Using each of these variants of the training set, we test the embedding algorithm with bothdiagonal and full-matrix formulations. The results are presented in Table 2.Each level of graphprocessing results in a significant reduction in the number of training comparisons (and, therefore,

3. A more sophisticated approach could be used here, for example, majority voting, provided there is sufficient over-sampling of comparisons. The aset400 data lacks sufficient over-sampling for majority voting, so we default to thisrelatively simple approach.

512

LEARNING MULTI -MODAL SIMILARITY

0

0.5

1

MF

CC

0

0.5AT

0

0.5ST

0

0.5Bio

0 50 100 150 200 250 3000

0.5CF

Figure 8: The weighting learned by Algorithm 4 using all five kernels and diagonalWp. Each barplot contains the diagonal of the corresponding kernel’s learned metric.The horizontalaxis corresponds to the index in the training set, and the vertical axis corresponds to thelearned weight in each kernel space.

AccuracyC |C | (Avg.) Diagonal FullFull 8951.3 0.622±0.05 0.715±0.04Length-2 6684.5 0.630±0.05 0.714±0.04Processed 4814.5 0.628±0.05 0.716±0.04

Table 2: aset400 embedding results (Biography kernel) for three possible refinements of the con-straint set.Full includes all similarity measurements, with no pruning for consistency orredundancy.Length-2removes all length-2 cycles (i.e.,(i, j,k, ℓ) and(k, ℓ, i, j)). Processedfinds an approximate maximal consistent subset, and removes redundant constraints.

computational overhead of Algorithm 3), while not degrading the quality of the resulting embed-ding.

Finally, to test the sensitivity of the algorithm to randomness in the acyclic subgraph routine, werepeated the above experiment ten times, each with a different random maximal acyclic constraintset and the full matrix formulation of the algorithm. As depicted in Figure 10, the randomness inthe constraint generation has little impact on the accuracy of the learned metric: the largest standarddeviation is 0.007 (split #7).

513

MCFEE AND LANCKRIET

−8 −6 −4 −2 0 2 4 6 8

2

4

6

8

10

12

14

16

18

20

Lara Fabian

BetteSixpence None the Richer

Michael Jackson

Suga

Spice Girls

Twista

Eiffel 65

R. Kelly

ABBA

Usher

Keith Sweat

*NSync

Dru Hill

Janet Jackson

Xzibit

Melissa Etheridg

Shaggy

LL Cool J

Craig David Culture Be

Britney SpearsSpin

Pink

Jessica Simpson

Vengaboys

DMX

Wyclef Jean

White Zombie

Enrique Iglesias

112

Coolio

Alice D

Missy Elliott

Natalie Imbruglia

All Saints

Aaliyah

Toni Braxton

Marc Anthony

Christina Aguilera

Donna Summer

Ludacris

En Vo

Backstreet Boys

Aqua

Celine Dion

Busta Rhymes

Whitney Ho

Chic

S Club 7

War

TLC

Savage Garden

Ja Rule

Ric

Madonna

Ace of Base

Blackstreet

LFO

Ricky Martin

The Corrs

Melanie C

Kenny G

Nelly

Matthew S

Nelly Furtado

La B

Shania Twain

Mariah Carey

Cyndi Lauper

Ugly Kid Joe

Westlife

Soul Asy

Red

Train

Test

(a)

0 5 10 15 20

−22

−21

−20

−19

−18

−17

−16

−15

−14

−13

−12

Mark KnopflerDire Straits

ZZ Top

U2

Steppenwolf

Bruce Springsteen

Tom Petty

Peter Gabriel

Mr. Big

Roy OrbisonEverly Brothers

Tesla

Black Sabbath

Eric Clapton

Coldplay

Aerosmith

Fleetwood MacBilly Idol

Deep Purple

Alice Cooper

Nazareth

Creedence Clearwater Revival

Scorpions

Iron Maiden

AC/DC

Poison

Uriah HeepExtreme

Blondie

Van Halen

Genesis

Lynyrd Skynyrd

Skid Row

Queensryche

Bon Jovi

Ozzy Osbourne

Def Leppard

Train

Test

(b)

Figure 9: t-SNE visualizations of an embedding of aset400 produced by MKPOE. The embeddingis constructed by optimizing overK4 with all five base kernels. The two clusters shownroughly correspond to (a) pop/hip-hop, and (b) classic rock/metal genres. Out-of-samplepoints are indicated by a red +.

514

LEARNING MULTI -MODAL SIMILARITY

1

2

3

4

5

6

7

8

9

10

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Sp

lit #

Accuracy

Figure 10: Accuracy of the learned embedding for each training/test split,averaged over ten tri-als with random maximal acyclic constraint subgraphs. Error bars correspond to onestandard deviation.

6. Hardness of Dimensionality Reduction

The algorithms given in Sections 3 and 4 attempt to produce low-dimensional solutions by regular-izing W, which can be seen as a convex approximation to the rank of the embedding.In general,because rank constraints are not convex, convex optimization techniques cannot efficiently mini-mize dimensionality. This does not necessarily imply other techniques could notwork. So, it isnatural to ask if exact solutions of minimal dimensionality can be found efficiently, particularly inthe multidimensional scaling scenario, that is, whenK = I (Section 3.3).

As a special case, one may wonder if any instance(X ,C ) can be satisfied inR1. As Figure 11demonstrates, not all instances can be realized in one dimension. Even more, we show that it isNP-Complete to decide if a givenC can be satisfied inR1. Given an embedding, it can be verifiedin polynomial time whetherC is satisfied or not by simply computing the distances between all pairsand checking each comparison inC , so the decision problem is in NP. It remains to show that theR

1 partial order embedding problem (hereafter referred to as1-POE) is NP-Hard. We reduce fromtheBetweennessproblem (Opatrny, 1979), which is known to be NP-complete.

Definition 1 (Betweenness)Given a finite set Z and a collection T of ordered triples(a,b,c) ofdistinct elements from Z, is there a one-to-one function f: Z→ R such that for each(a,b,c) ∈ T,either f(a)< f (b)< f (c) or f (c)< f (b)< f (a)?

Theorem 1 1-POE is NP-Hard.

Proof Let (Z,T) be an instance of Betweenness. LetX = Z, and for each(a,b,c) ∈ T, intro-duce constraints(a,b,a,c) and(b,c,a,c) to C . Since Euclidean distance inR1 is simply line dis-tance, these constraints forceg(b) to lie betweeng(a) andg(c). Therefore, the original instance(Z,T) ∈ Betweenness if and only if the new instance(X ,C ) ∈ 1-POE. Since Betweenness is NP-Hard, 1-POE is NP-Hard as well.

Since 1-POE can be reduced to the general optimization problem of finding an embedding ofminimal dimensionality, we can conclude that dimensionality reduction subject to partial order con-straints is also NP-Hard.

515

MCFEE AND LANCKRIET

A B

C D(a)

AD BC

CD AC BDAB

(b)

Figure 11: (a) The vertices of a square inR2. (b) The partial order over distances induced by thesquare: each side is less than each diagonal. This constraint set cannot be satisfied inR

1.

7. Conclusion

We have demonstrated a novel method for optimally integrating heterogeneousdata to conform tomeasurements of perceptual similarity. By interpreting a collection of relative similarity compar-isons as a directed graph over pairs, we are able to apply graph-theoretic techniques to isolate andprune inconsistencies in the training set and reduce computational overhead by eliminating redun-dant constraints in the optimization procedure.

Our multiple-kernel formulation offers a principled way to integrate multiple feature modalitiesinto a unified similarity space. Our formulation carries the intuitive geometric interpretation of con-catenated projections, and results in a semidefinite program. By incorporating diagonal constraintsas well, we are able to reduce the computational complexity of the algorithm, and learn a modelwhich is both flexible—only using kernels in the portions of the space where they are informative—and interpretable—each diagonal weight corresponds to the contributionto the optimized space dueto a single point within a single feature space. Table 1 provides a unified perspective of multiplekernel learning formulations for embedding problems, but it is clearly not complete. It will be thesubject of future work to explore and compare alternative generalizations and restrictions of theformulations presented here.

Acknowledgments

The authors acknowledge support from NSF Grant DMS-MSPA 0625409 and eHarmony, Inc.

Appendix A. Embedding Partial Orders

In this appendix, we prove that any setX with a partial order over distancesC can be embeddedintoR

n while satisfying all distance comparisons.In the special case whereC is a total ordering over all pairs (i.e., a chain graph), the problem

reduces to non-metric multidimensional scaling (Kruskal, 1964), and a constraint-satisfying em-bedding can always be found by the constant-shift embedding algorithm of Roth et al. (2003). In

516

LEARNING MULTI -MODAL SIMILARITY

Algorithm 5 Naïve total order constructionInput: objectsX , partial orderCOutput: symmetric dissimilarity matrix∆ ∈ R

n×n

for eachi in 1. . .n do∆ii ← 0

end forfor each(k, ℓ) in topological orderdo

if in-degree(k, ℓ) = 0 then∆kℓ,∆ℓk← 1

else∆kℓ,∆ℓk← max

(i, j,k,ℓ)∈C∆i j +1

end ifend for

general,C is not a total order, but aC -respecting embedding can always be produced by reducingthe partial order to a (weak) total order by topologically sorting the graph (see Algorithm 5).

Let ∆ be the dissimilarity matrix produced by Algorithm 5 on an instance(X ,C ). An embeddingcan be found by first applying classical multidimensional scaling (MDS) (Cox and Cox, 1994) to∆:

A=−12

H∆H,

whereH = I − 1n11T is then×n centering matrix, and1 is a vector of 1s. Shifting the spectrum of

A yieldsA−λn(A)I = A � 0,

whereλn(A) is the minimum eigenvalue ofA. The embeddingg can be found by decomposing

A=VΛVT, so thatg(xi) is theith column ofΛ1/2

VT; this is the solution constructed by the constant-shift embedding non-metric MDS algorithm of Roth et al. (2003).

Applying this transformation toA affects distances by

‖g(xi)−g(x j)‖2 = Aii + A j j −2Ai j = (Aii −λn)+(A j j −λn)−2Ai j

= Aii +A j j −2Ai j −2λn.

Since adding a constant (−2λn) preserves the ordering of distances, the total order (and henceC )is preserved by this transformation. Thus, for any instance(X ,C ), an embedding can be found inR

n−1.

Appendix B. Solver

Our implementation of Algorithm 4 is based on a simple projected (sub)gradient descent. To sim-plify exposition, we show the derivation of the single-kernel SDP versionof the algorithm (Algo-rithm 3) with unit margins. (It is straightforward to extend the derivation to themultiple-kernel andLP settings.)

517

MCFEE AND LANCKRIET

We first observe that a kernel matrix columnKi can be expressed asKTei whereei is the ith

standard basis vector. We can then denote the distance calculations in terms of Frobenius innerproducts:

d(xi ,x j) = (Ki−K j)TW(Ki−K j)

= (ei−ej)TKWK(ei−ej)

= tr(KWK(ei−ej)(ei−ej)T) = tr(WKEi j K)

=⟨W,KEi j K

⟩F ,

whereEi j = (ei−ej)(ei−ej)T.

A margin constraint(i, j,k, ℓ) can now be expressed as:

d(xi ,x j)+1≤ d(xk,xℓ)+ξi jkℓ

⇒⟨W,KEi j K

⟩F +1≤ 〈W,KEkℓK〉F +ξi jkℓ

⇒ ξi jkℓ ≥ 1+⟨W,K(Ei j −Ekℓ)K

⟩F .

The slack variablesξi jkℓ can be eliminated from the program by rewriting the objective in termsof the constraints:

minW�0

f (W) where f (W) = tr(WK)+β|C |∑

C

h(1+⟨W,K(Ei j −Ekℓ)K

⟩F

),

where

h(x) =

{0 x≤ 0

x x> 0

is the hinge loss.The gradient∇ f has two components: one due to regularization, and one due to the hinge loss.

The gradient due to regularization is simplyK. The loss term decomposes linearly, and for each(i, j,k, ℓ) ∈ C , a sub-gradient direction can be defined:

∂∂W

h(1+d(xi ,x j)−d(xk,xℓ)) =

{0 d(xi ,x j)+1≤ d(xk,xℓ)

K(Ei j −Ekℓ)K otherwise.(11)

Rather than computing each gradient direction independently, we observethat each violated con-straint contributes a matrix of the formK(Ei j −Ekℓ)K. By linearity, we can collect all(Ei j −Ekℓ)terms and then pre- and post-multiply byK to obtain a more efficient calculation of∇ f :

∂∂W

f = K+K

(i, j,k,ℓ)∈CEi j −Ekℓ

K,

whereC is the set of all currently violated constraints.After each gradient stepW 7→W−α∇ f , the updatedW is projected back onto the set of positive

semidefinite matrices by computing its spectral decomposition and thresholding theeigenvalues byλi 7→max(0,λi).

518

LEARNING MULTI -MODAL SIMILARITY

To extend this derivation to the multiple-kernel case (Algorithm 4), we can define

d(xi ,x j).=

m

∑p=1

dp(xi ,x j),

and exploit linearity to compute each partial derivative∂/∂Wp independently.For the diagonally-constrained case, it suffices to substitute

K(Ei j −Ekℓ)K 7→ diag(K(Ei j −Ekℓ)K)

in Equation 11. After each gradient step in the diagonal case, the PSD constraint onW can beenforced by the projectionWii 7→max(0,Wii ).

Appendix C. Relationship to AUC

In this appendix, we formalize the connection between partial orders overdistances and query-by-example ranking. Recall that Algorithm 2 minimizes the loss1/|C |∑C ξi jkℓ, where eachξi jkℓ ≥ 0 is aslack variable associated with a margin constraint

d(i, j)+1≤ d(k, ℓ)+ξi jkℓ.

As noted by Schultz and Joachims (2004), the fraction of relative comparisons satisfied by anembeddingg is closely related to the area under the receiver operating characteristic curve (AUC).To make this connection precise, consider the following information retrievalproblem. For eachpointxi ∈ X , we are given a partition ofX \{i}:

X+i = {x j : x j ∈ X relevant forxi}, and

X−i = {xk : xk ∈ X irrelevant forxi}.

If we embed eachxi ∈ X into a Euclidean space, we can then rank the rest of the dataX \ {xi} byincreasing distance fromxi . Truncating this ranked list at the topτ elements (i.e., closestτ pointsto xi) will return a certain fraction of relevant points (true positives), and irrelevant points (falsepositives). Averaging over all values ofτ defines the familiar AUC score, which can be compactlyexpressed as:

AUC(xi |g) =1

|X+i | · |X−i |

∑(x j ,xk)∈X+

i ×X−i

1 [‖g(xi)−g(x j)‖< ‖g(xi)−g(xk)‖] .

Intuitively, AUC can be interpreted as an average over all pairs(x j ,xk)∈X+i ×X−i of the number

of timesxi was mapped closer to a relevant pointx j than an irrelevant pointxk. This in turn can beconveniently expressed by a set of relative comparisons for eachxi ∈ X :

∀(x j ,xk) ∈ X+i ×X−i : (i, j, i,k).

An embedding which satisfies a complete set of constraints of this form will receive an AUC scoreof 1, since every relevant point must be closer toxi than every irrelevant point.

Now, returning to the more general setting, we do not assume binary relevance scores or com-plete observations of relevance for all pairs of points. However, we can define the generalized

519

MCFEE AND LANCKRIET

AUC score (GAUC) as simply the average number of correctly ordered pairs (equivalently, satisfiedconstraints) given a set of relative comparisons:

GAUC(g) =1|C | ∑

(i, j,k,ℓ)∈C1 [‖g(xi)−g(x j)‖< ‖g(xk)−g(xℓ)‖] .

Like AUC, GAUC is bounded between 0 and 1, and the two scores coincide exactly in the previouslydescribed ranking problem. A corresponding loss function can be defined by reversing the order ofthe inequality, that is,

LGAUC(g) =1|C | ∑

(i, j,k,ℓ)∈C1 [‖g(xi)−g(x j)‖ ≥ ‖g(xk)−g(xℓ)‖] .

Note thatLGAUC takes the form of a sum over indicators, and can be interpreted as the average0/1-loss overC . This function is clearly not convex ing, and is therefore difficult to optimize.Algorithms 2, 3 and 4 instead optimize a convex upper bound onLGAUC by replacing indicators bythe hinge loss.

As in SVM, this is accomplished by introducing a unit margin and slack variableξi jkℓ for each(i, j,k, ℓ) ∈ C , and minimizing1/|C |∑C ξi jkℓ.

References

Sameer Agarwal, Joshua Wills, Lawrence Cayton, Gert Lanckriet, DavidKriegman, and SergeBelongie. Generalized non-metric multi-dimensional scaling. InProceedings of the TwelfthInternational Conference on Artificial Intelligence and Statistics, 2007.

Alfred V. Aho, Michael R. Garey, and Jeffrey D. Ullman. The transitivereduction of a directedgraph. SIAM Journal on Computing, 1(2):131–137, 1972. doi: 10.1137/0201008. URLhttp://link.aip.org/link/?SMJ/1/131/1 .

Anelia Angelova. Data pruning. Master’s thesis, California Institute of Technology, 2004.

Anelia Angelova, Yaser Abu-Mostafa, and Pietro Perona. Pruning training sets for learning of objectcategories.Computer Vision and Pattern Recognition, IEEE Computer Society Conference on, 1:494–501, 2005. ISSN 1063-6919. doi: http://doi.ieeecomputersociety.org/10.1109/CVPR.2005.283.

Dana Angluin and Philip Laird. Learning from noisy examples.Mach. Learn., 2(4):343–370, 1988.ISSN 0885-6125. doi: http://dx.doi.org/10.1023/A:1022873112823.

Francis R. Bach. Consistency of the group lasso and multiple kernel learning. J. Mach. Learn. Res.,9:1179–1225, 2008. ISSN 1532-4435.

Yoshua Bengio, Jean-François Paiement, Pascal Vincent, Olivier Delalleau, Nicolas Le Roux, andMarie Ouimet. Out-of-sample extensions for lle, isomap, mds, eigenmaps, and spectral cluster-ing. In Sebastian Thrun, Lawrence Saul, and Bernhard Schölkopf, editors,Advances in NeuralInformation Processing Systems 16. MIT Press, Cambridge, MA, 2004.

520

LEARNING MULTI -MODAL SIMILARITY

Bonnie Berger and Peter W. Shor. Approximation alogorithms for the maximum acyclic subgraphproblem. InSODA ’90: Proceedings of the First Annual ACM-SIAM Symposium on Discretealgorithms, pages 236–243, Philadelphia, PA, USA, 1990. Society for Industrialand AppliedMathematics. ISBN 0-89871-251-3.

Mikhail Bilenko, Sugato Basu, and Raymond J. Mooney. Integrating constraints and metric learningin semi-supervised clustering. InProceedings of the Twenty-first International Conference onMachine Learning, pages 81–88, 2004.

Ingwer Borg and Patrick J.F. Groenen.Modern Multidimensional Scaling: Theory and Applications.Springer-Verlag, second edition, 2005.

Stephen Boyd and Lieven Vandenberghe.Convex Optimization. Cambridge University Press, 2004.

Oscar Celma.Music Recommendation and Discovery in the Long Tail. PhD thesis, UniversitatPompeu Fabra, Barcelona, Spain, 2008.

Corinna Cortes, Mehryar Mohri, and Afshin Rostamizadeh. Learning non-linear combinations ofkernels. In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, editors,Advances in Neural Information Processing Systems 22, pages 396–404. 2009.

Trevor F. Cox and Michael A.A. Cox.Multidimensional Scaling. Chapman and Hall, 1994.

Jason V. Davis, Brian Kulis, Prateek Jain, Suvrit Sra, and Inderjit S. Dhillon. Information-theoreticmetric learning. InICML ’07: Proceedings of the 24th International Conference on MachineLearning, pages 209–216, New York, NY, USA, 2007. ACM. ISBN 978-1-59593-793-3. doi:http://doi.acm.org/10.1145/1273496.1273523.

Steven B. Davis and Paul Mermelstein. Comparison of parametric representations for monosyllabicword recognition in continuously spoken sentences. pages 65–74, 1990.

Daniel P.W. Ellis, Brian Whitman, Adam Berenzweig, and Steve Lawrence. The quest for groundtruth in musical artist similarity. InProeedings of the International Symposium on Music Infor-mation Retrieval (ISMIR2002), pages 170–177, October 2002.

Michael R. Garey and David S. Johnson.Computers and Intractability: A Guide to the Theory ofNP-Completeness. W. H. Freeman & Co., New York, NY, USA, 1979. ISBN 0716710447.

Jan-Mark Geusebroek, Gertjan J. Burghouts, and Arnold W. M. Smeulders. The Amsterdam libraryof object images.Int. J. Comput. Vis., 61(1):103–112, 2005.

Amir Globerson and Sam Roweis. Metric learning by collapsing classes. In Yair Weiss, BernhardSchölkopf, and John Platt, editors,Advances in Neural Information Processing Systems 18, pages451–458, Cambridge, MA, 2006. MIT Press.

Amir Globerson and Sam Roweis. Visualizing pairwise similarity via semidefinite embedding.In Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics,2007.

521

MCFEE AND LANCKRIET

Jacob Goldberger, Sam Roweis, Geoffrey Hinton, and Ruslan Salakhutdinov. Neighborhood com-ponents analysis. In Lawrence K. Saul, Yair Weiss, and Léon Bottou, editors,Advances in NeuralInformation Processing Systems 17, pages 513–520, Cambridge, MA, 2005. MIT Press.

Saketha Nath Jagarlapudi, Dinesh G, Raman S, Chiranjib Bhattacharyya, Aharon Ben-Tal, andRamakrishnan K.R. On the algorithmics and applications of a mixed-norm basedkernel learningformulation. In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, editors,Advances in Neural Information Processing Systems 22, pages 844–852. 2009.

Tony Jebara, Risi Kondor, and Andrew Howard. Probability productkernels.Journal of MachineLearning Research, 5:819–844, 2004. ISSN 1533-7928.

Maurice Kendall and Jean Dickinson Gibbons.Rank correlation methods. Oxford University Press,1990.

Marius Kloft, Ulf Brefeld, Soeren Sonnenburg, Pavel Laskov, Klaus-Robert Müller, and AlexanderZien. Efficient and accurate lp-norm multiple kernel learning. In Y. Bengio, D. Schuurmans,J. Lafferty, C. K. I. Williams, and A. Culotta, editors,Advances in Neural Information ProcessingSystems 22, pages 997–1005. 2009.

Joseph B. Kruskal. Nonmetric multidimensional scaling: a numerical method.Psychometrika, 29(2), June 1964.

Gert R. G. Lanckriet, Nello Cristianini, Peter Bartlett, Laurent El Ghaoui, and Michael I. Jordan.Learning the kernel matrix with semidefinite programming.J. Mach. Learn. Res., 5:27–72, 2004.ISSN 1533-7928.

Yen-Yu Lin, Tyng-Luh Liu, and Chiou-Shann Fuh. Dimensionality reduction for data in multiplefeature representations. In D. Koller, D. Schuurmans, Y. Bengio, andL. Bottou, editors,Advancesin Neural Information Processing Systems 21, pages 961–968. 2009.

Brian McFee and Gert R. G. Lanckriet. Heterogeneous embedding for subjective artist similarity.In Tenth International Symposium for Music Information Retrieval (ISMIR2009), October 2009a.

Brian McFee and Gert R. G. Lanckriet. Partial order embedding with multiple kernels. InProceed-ings of the 26th International Conference on Machine Learning (ICML’09), pages 721–728, June2009b.

Jaroslav Opatrny. Total ordering problem.SIAM J. Computing, (8):111–114, 1979.

Volker Roth, Julian Laub, Joachim M. Buhmann, and Klaus-Robert Müller. Going metric: denoisingpairwise data. In S. Becker, S. Thrun, and K. Obermayer, editors,Advances in Neural InformationProcessing Systems 15, pages 809–816, Cambridge, MA, 2003. MIT Press.

Bernhard Schölkopf, Ralf Herbrich, Alex J. Smola, and Robert Williamson. A generalized represen-ter theorem. InProceedings of the 14th Annual Conference on Computational LearningTheory,pages 416–426, 2001.

Matthew Schultz and Thorsten Joachims. Learning a distance metric from relative comparisons. InSebastian Thrun, Lawrence Saul, and Bernhard Schölkopf, editors,Advances in Neural Informa-tion Processing Systems 16, Cambridge, MA, 2004. MIT Press.

522

LEARNING MULTI -MODAL SIMILARITY

Noam Shental, Tomer Hertz, Daphna Weinshall, and Misha Pavel. Adjustmentlearning and relevantcomponents analysis. InEuropean Conference on Computer Vision, 2002.

Sören Sonnenburg, Gunnar Rätsch, Christin Schäfer, and Bernhard Schölkopf. Large scale multiplekernel learning.J. Mach. Learn. Res., 7:1531–1565, 2006. ISSN 1532-4435.

Warren S. Torgerson. Multidimensional scaling: 1. theory and method.Psychometrika, 17:401–419,1952.

Douglas Turnbull, Luke Barrington, David Torres, and Gert Lanckriet. Semantic annotation andretrieval of music and sound effects.IEEE Transactions on Audio, Speech and Language Pro-cessing, 16(2):467–476, February 2008.

Laurens van der Maaten and Geoffrey Hinton. Visualizing high-dimensional data using t-SNE.Journal of Machine Learning Research, 9:2579–2605, 2008.

Alexander Vezhnevets and Olga Barinova. Avoiding boosting overfitting byremoving confusingsamples. InECML 2007, pages 430–441, 2007.

Kiri Wagstaff, Claire Cardie, Seth Rogers, and Stefan Schroedl. Constrained k-means clusteringwith background knowledge. InProceedings of the Eighteenth International Conference on Ma-chine Learning, pages 577–584, 2001.

Kilian Q. Weinberger, John Blitzer, and Lawrence K. Saul. Distance metric learning for large mar-gin nearest neighbor classification. In Yair Weiss, Bernhard Schölkopf, and John Platt, editors,Advances in Neural Information Processing Systems 18, pages 451–458, Cambridge, MA, 2006.MIT Press.

Eric P. Xing, Andrew Y. Ng, Michael I. Jordan, and Stuart Russell. Distance metric learning,with application to clustering with side-information. In S. Becker, S. Thrun, and K. Obermayer,editors,Advances in Neural Information Processing Systems 15, pages 505–512, Cambridge,MA, 2003. MIT Press.

Alexander Zien and Cheng Soon Ong. Multiclass multiple kernel learning. InICML ’07: Proceed-ings of the 24th International Conference on Machine Learning, pages 1191–1198, New York,NY, USA, 2007. ACM. ISBN 978-1-59593-793-3. doi: http://doi.acm.org/10.1145/1273496.1273646.

523