fast learning rate of multiple kernel learning: trade-off

25
The Annals of Statistics 2013, Vol. 41, No. 3, 1381–1405 DOI: 10.1214/13-AOS1095 © Institute of Mathematical Statistics, 2013 FAST LEARNING RATE OF MULTIPLE KERNEL LEARNING: TRADE-OFF BETWEEN SPARSITY AND SMOOTHNESS BY TAIJI SUZUKI 1 AND MASASHI SUGIYAMA 2 University of Tokyo and Tokyo Institute of Technology We investigate the learning rate of multiple kernel learning (MKL) with 1 and elastic-net regularizations. The elastic-net regularization is a compo- sition of an 1 -regularizer for inducing the sparsity and an 2 -regularizer for controlling the smoothness. We focus on a sparse setting where the total num- ber of kernels is large, but the number of nonzero components of the ground truth is relatively small, and show sharper convergence rates than the learning rates have ever shown for both 1 and elastic-net regularizations. Our analysis reveals some relations between the choice of a regularization function and the performance. If the ground truth is smooth, we show a faster convergence rate for the elastic-net regularization with less conditions than 1 -regularization; otherwise, a faster convergence rate for the 1 -regularization is shown. 1. Introduction. Learning with kernels such as support vector machines has been demonstrated to be a promising approach, given that kernels were chosen appropriately [Schölkopf and Smola (2002), Shawe-Taylor and Cristianini (2004)]. So far, various strategies have been employed for choosing appropriate kernels, ranging from simple cross-validation [Chapelle et al. (2002)] to more sophisticated “kernel learning” approaches [Ong, Smola and Williamson (2005), Argyriou et al. (2006), Bach (2009), Cortes, Mohri and Rostamizadeh (2009a), Varma and Babu (2009)]. Multiple kernel learning (MKL) is one of the systematic approaches to learning kernels, which tries to find the optimal linear combination of prefixed base-kernels by convex optimization [Lanckriet et al. (2004)]. The seminal paper by Bach, Lanckriet and Jordan (2004) showed that this linear-combination MKL formula- tion can be interpreted as 1 -mixed-norm regularization (i.e., the sum of the norms of the base kernels). Based on this interpretation, several variations of MKL were proposed, and promising performance was achieved by “intermediate” regulariza- tion strategies between the sparse ( 1 ) and dense ( 2 ) regularizers, for example, Received December 2011; revised January 2013. 1 Supported in part by MEXT KAKENHI 22700289, and the Aihara project, the FIRST program from JSPS, initiated by CSTP. 2 Supported in part by the FIRST program. MSC2010 subject classifications. Primary 62G08, 62F12; secondary 62J07. Key words and phrases. Sparse learning, restricted isometry, elastic-net, multiple kernel learning, additive model, reproducing kernel Hilbert spaces, convergence rate, smoothness. 1381

Upload: others

Post on 10-Dec-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Fast learning rate of multiple kernel learning: Trade-off

The Annals of Statistics2013, Vol. 41, No. 3, 1381–1405DOI: 10.1214/13-AOS1095© Institute of Mathematical Statistics, 2013

FAST LEARNING RATE OF MULTIPLE KERNEL LEARNING:TRADE-OFF BETWEEN SPARSITY AND SMOOTHNESS

BY TAIJI SUZUKI1 AND MASASHI SUGIYAMA2

University of Tokyo and Tokyo Institute of Technology

We investigate the learning rate of multiple kernel learning (MKL) with�1 and elastic-net regularizations. The elastic-net regularization is a compo-sition of an �1-regularizer for inducing the sparsity and an �2-regularizer forcontrolling the smoothness. We focus on a sparse setting where the total num-ber of kernels is large, but the number of nonzero components of the groundtruth is relatively small, and show sharper convergence rates than the learningrates have ever shown for both �1 and elastic-net regularizations. Our analysisreveals some relations between the choice of a regularization function and theperformance. If the ground truth is smooth, we show a faster convergence ratefor the elastic-net regularization with less conditions than �1-regularization;otherwise, a faster convergence rate for the �1-regularization is shown.

1. Introduction. Learning with kernels such as support vector machines hasbeen demonstrated to be a promising approach, given that kernels were chosenappropriately [Schölkopf and Smola (2002), Shawe-Taylor and Cristianini (2004)].So far, various strategies have been employed for choosing appropriate kernels,ranging from simple cross-validation [Chapelle et al. (2002)] to more sophisticated“kernel learning” approaches [Ong, Smola and Williamson (2005), Argyriou et al.(2006), Bach (2009), Cortes, Mohri and Rostamizadeh (2009a), Varma and Babu(2009)].

Multiple kernel learning (MKL) is one of the systematic approaches to learningkernels, which tries to find the optimal linear combination of prefixed base-kernelsby convex optimization [Lanckriet et al. (2004)]. The seminal paper by Bach,Lanckriet and Jordan (2004) showed that this linear-combination MKL formula-tion can be interpreted as �1-mixed-norm regularization (i.e., the sum of the normsof the base kernels). Based on this interpretation, several variations of MKL wereproposed, and promising performance was achieved by “intermediate” regulariza-tion strategies between the sparse (�1) and dense (�2) regularizers, for example,

Received December 2011; revised January 2013.1Supported in part by MEXT KAKENHI 22700289, and the Aihara project, the FIRST program

from JSPS, initiated by CSTP.2Supported in part by the FIRST program.MSC2010 subject classifications. Primary 62G08, 62F12; secondary 62J07.Key words and phrases. Sparse learning, restricted isometry, elastic-net, multiple kernel learning,

additive model, reproducing kernel Hilbert spaces, convergence rate, smoothness.

1381

Page 2: Fast learning rate of multiple kernel learning: Trade-off

1382 T. SUZUKI AND M. SUGIYAMA

a mixture of �1-mixed-norm and �2-mixed-norm called the elastic-net regular-ization [Shawe-Taylor (2008), Tomioka and Suzuki (2009)] and �p-mixed-normregularization with 1 < p < 2 [Micchelli and Pontil (2005), Kloft et al. (2009)].

Together with the active development of practical MKL optimization algo-rithms, theoretical analysis of MKL has also been extensively conducted. For�1-mixed-norm MKL, Koltchinskii and Yuan (2008) established the learning rated(1−s)/(1+s)n−1/(1+s) + d log(M)/n under rather restrictive conditions, where n

is the number of samples, d is the number of nonzero components of the groundtruth, M is the number of kernels and s (0 < s < 1) is a constant representing thecomplexity of the reproducing kernel Hilbert spaces (RKHSs). Their conditionsinclude a smoothness assumption of the ground truth. For elastic-net regulariza-tion (which we call elastic-net MKL), Meier, van de Geer and Bühlmann (2009)gave a near optimal convergence rate d(n/ log(M))−1/(1+s). Recently, Koltchinskiiand Yuan (2010) showed that MKL with a variant of �1-mixed-norm regulariza-tion (which we call L1-MKL) achieves the minimax optimal convergence rate,which successfully captured sharper dependency with respect to log(M) than thebound of Meier, van de Geer and Bühlmann (2009) and established the bounddn−1/(1+s) + d log(M)/n. Another line of research considers the cases where theground truth is not sparse, and bounds the Rademacher complexity of a candidatekernel class by a pseudo-dimension of the kernel class [Srebro and Ben-David(2006), Ying and Campbell (2009), Cortes, Mohri and Rostamizadeh (2009b),Kloft, Rückert and Bartlett (2010)]. Fast learning rate of MKL in nonsparse set-tings is given by Kloft and Blanchard (2012) for �p-mixed-norm regularization andby Suzuki (2011a, 2011b) for regularizations corresponding to arbitrary monoton-ically increasing norms.

In this paper, we focus on the sparse setting (i.e., the total number of kernelsis large, but the number of nonzero components of the ground truth is relativelysmall), and derive sharp learning rates for both L1-MKL and elastic-net MKL. Ournew learning rates,

d(1−s)/(1+s)n−1/(1+s)R2s/(1+s)1,f ∗ + d log(M)

n,(L1-MKL)

d(1+q)/(1+q+s)n−(1+q)/(1+q+s)R2s/(1+q+s)2,g∗ + d log(M)

n,(Elastic-net MKL)

are faster than all the existing bounds, where R1,f ∗ is the �1-mixed-norm of thetruth, R2,g∗ is a kind of �2-mixed-norm of the truth and q (0 ≤ q ≤ 1) is a constantdepending on the smoothness of the ground truth.

Our contributions are summarized as follows:

(a) The sharpest existing bound for L1-MKL given by Koltchinskii and Yuan(2010) achieves the minimax rate on the �∞-mixed-norm ball [Raskutti, Wain-wright and Yu (2009, 2012)]. Our work follows this line and shows that the learn-ing rates for L1-MKL and elastic-net MKL further achieve the minimax rates on

Page 3: Fast learning rate of multiple kernel learning: Trade-off

FAST LEARNING RATE OF MKL 1383

the �1-mixed-norm ball and �2-mixed-norm ball, respectively, both of which arefaster than that on the �∞-mixed-norm ball. This result implies that the bound byKoltchinskii and Yuan (2010) is tight only when the ground truth is evenly spreadin the nonzero components.

(b) We included the smoothness q of the ground truth into our learning rate,where the ground truth is said to be smooth if it is represented as a convolution of acertain function and an integral kernel; see Assumption 2. Intuitively, for larger q ,the truth is smoother. We show that elastic-net MKL properly makes use of thesmoothness of the truth: The smoother the truth is, the faster the convergencerate of elastic-net MKL is. That is, the resultant convergence rate of elastic-netMKL becomes as if the complexity of RKHSs was s

1+qinstead of the true com-

plexity s. Meier, van de Geer and Bühlmann (2009) and Koltchinskii and Yuan(2010) assumed q = 0 and Koltchinskii and Yuan (2008) considered a situation ofq = 1. Our analysis covers both of those situations and is more general since any0 ≤ q ≤ 1 is allowed.

(c) We investigate a relation between the sparsity and the smoothness. Roughlyspeaking, L1-MKL generates a sparser solution while elastic-net MKL generatesa smoother solution. When the smoothness q of the truth is small (say q = 0),we give a faster convergence rate of L1-MKL than that of elastic-net MKL. On theother hand, if the truth is smooth, elastic-net MKL can make use of the smoothnessof the truth. In that situation, the learning rate of elastic-net MKL could be fasterthan L1-MKL.

The relation between our analysis and existing analyses is summarized in Ta-ble 1.

2. Preliminaries. In this section, we formulate elastic-net MKL, and summa-rize mathematical tools that are needed for our theoretical analysis.

TABLE 1Relation between our analysis and existing analyses

Penalty Smoothness Minimax Convergence rate(q) optimality

KY (2008) �1 q = 1 ? d(1−s)/(1+s)n−1/(1+s) + d log(M)n

MGB (2009) elastic-net q = 0 × (log(M)

n )1/(1+s)(d + R22,g∗)

KY (2010) �1 q = 0 �∞-ball(d+R1,f ∗ )

n1/(1+s) + d log(M)n

This paper elastic-net 0 ≤ q ≤ 1 �2-ball ( dn )(1+q)/(1+q+s)R

2s/(1+q+s)2,g∗ + d log(M)

n

�1 q = 0 �1-ball d(1−s)/(1+s)

n1/(1+s) R2s/(1+s)1,f ∗ + d log(M)

n

Page 4: Fast learning rate of multiple kernel learning: Trade-off

1384 T. SUZUKI AND M. SUGIYAMA

2.1. Formulation. Suppose we are given n samples {(xi, yi)}ni=1 where xi be-longs to an input space X and yi ∈ R. We denote the marginal distribution of X

by �. We consider an MKL regression problem in which the unknown target func-tion is represented as f (x) =∑M

m=1 fm(x), where each fm belongs to a differentRKHS Hm(m = 1, . . . ,M) with a kernel km over X × X .

The elastic-net MKL we consider in this paper is the version considered inMeier, van de Geer and Bühlmann (2009),

f = arg minfm∈Hm

(m=1,...,M)

1

n

N∑i=1

(yi −

M∑m=1

fm(xi)

)2

(1)

+M∑

m=1

(n)1 ‖fm‖n + λ

(n)2 ‖fm‖Hm + λ

(n)3 ‖fm‖2

Hm

),

where ‖fm‖n :=√

1n

∑ni=1 fm(xi)2 and ‖fm‖Hm is the RKHS norm of fm in

Hm. The regularizer is the mixture of �1-term∑M

m=1(λ(n)1 ‖fm‖n + λ

(n)2 ‖fm‖Hm)

and �2-term∑M

m=1 λ(n)3 ‖fm‖2

Hm. In that sense, we say that the regularizer is of

the elastic-net type3 [Zou and Hastie (2005)]. Here the �1-term is a mixtureof the empirical L2-norm ‖fm‖n and the RKHS norm ‖fm‖Hm . Koltchinskiiand Yuan (2010) considered �1-regularization that contains only the �1-term:∑

m(λ(n)1 ‖fm‖n + λ

(n)2 ‖fm‖Hm). To distinguish the situations of λ

(n)3 = 0 and

λ(n)3 > 0, we refer to the learning method (1) with λ

(n)3 = 0 as L1-MKL and that

with λ(n)3 > 0 as elastic-net MKL.

By the representer theorem [Kimeldorf and Wahba (1971)], the solution f

can be expressed as a linear combination of nM kernels: ∃αm,i ∈ R, fm(x) =∑ni=1 αm,ikm(x, xi). Thus, using the Gram matrix Km = (km(xi, xj ))i,j , the regu-

larizer in (1) is expressed as

M∑m=1

(n)1

√α�

m

KmKm

nαm + λ

(n)2

√α�

mKmαm + λ(n)3 α�

mKmαm

),

where αm = (αm,i)ni=1 ∈ R

n. Thus, we can solve the problem by an SOCP (second-order cone programming) solver as in Bach, Lanckriet and Jordan (2004), the co-ordinate descent algorithms [Meier, van de Geer and Bühlmann (2008)] or thealternating direction method of multipliers [Boyd et al. (2011)].

3There is another version of MKL with elastic-net regularization considered in Shawe-Taylor

(2008) and Tomioka and Suzuki (2009), that is, λ(n)2∑M

m=1 ‖fm‖Hm+ λ

(n)3∑M

m=1 ‖fm‖2Hm

(i.e.,there is no ‖fm‖n term in the regularizer). However, we focus on equation (1) because the above oneis too loose to properly bound the irrelevant components of the estimated function.

Page 5: Fast learning rate of multiple kernel learning: Trade-off

FAST LEARNING RATE OF MKL 1385

2.2. Notation and assumptions. Here, we present several assumptions used inour theoretical analysis and prepare notation.

Let H = H1 ⊕ · · · ⊕ HM . We utilize the same notation f ∈ H indicating boththe vector (f1, . . . , fM) and the function f =∑M

m=1 fm (fm ∈ Hm). This is a littleabuse of notation because the decomposition f =∑M

m=1 fm might not be uniqueas an element of L2(�). However, this will not cause any confusion. We denote byf ∗ ∈ H the ground truth satisfying the following assumption (the decompositionf ∗ =∑M

m=1 f ∗m of the truth might not be unique but we fix one possibility).

ASSUMPTION 1 (Basic assumptions).

(A1-1) There exists f ∗ = (f ∗1 , . . . , f ∗

M) ∈ H such that E[Y |X] =∑Mm=1 f ∗

m(X),and the noise εi := yi − f ∗(xi) is bounded as |εi | ≤ L (a.s.).

(A1-2) For each m = 1, . . . ,M , the kernel function km is continuous andsupX∈X |km(X,X)| ≤ 1.

The first assumption in (A1-1) ensures the model H is correctly specified, andthe technical assumption |εi | < L allows εif to be Lipschitz continuous with re-spect to f . The assumption of correct specification can be relaxed to misspeci-fied settings, and the bounded noise can be replaced with i.i.d. Gaussian noise asin Raskutti, Wainwright and Yu (2012). However, for the sake of simplicity, weassume these conditions. It is known that assumption (A1-2) gives the relation‖fm‖∞ ≤ ‖fm‖Hm ; see Chapter 4 of Steinwart and Christmann (2008).

Let an integral operator Tm :L2(�) → L2(�) corresponding to a kernel func-tion km be

Tmf =∫

km(·, x)f (x)d�(x).

It is known that this operator is compact, positive and self-adjoint [see Theo-rem 4.27 of Steinwart and Christmann (2008)], and hence the spectral theoremshows that there exist an at most countable orthonormal system {φ�,m}∞�=1 andeigenvalues {μ�,m}∞�=1 such that

Tmf =∞∑

�=1

μ�,m〈φ�,m,f 〉L2(�)φ�,m(2)

for f ∈ L2(�). Here we assume {μ�,m}∞�=1 is sorted in descending order, that is,μ1,m ≥ μ2,m ≥ μ3,m ≥ · · · ≥ 0. Associated with Tm, we can define an operatorTm : Hm → Hm as

⟨f ′

m, Tmfm

⟩Hm

= E[f ′

m(X)fm(X)]= ⟨

f ′m,

∫km(·, x)fm(x)d�(x)

⟩Hm

.

Page 6: Fast learning rate of multiple kernel learning: Trade-off

1386 T. SUZUKI AND M. SUGIYAMA

For the canonical inclusion map ιm : Hm → L2(�), one can check that the follow-ing commutative relation holds:

Tmιmfm = ιmTmfm,

Hm

ιm

Tm

Hm

ιm

L2(�)Tm

L2(�).

Thus we use the same notation for Tm and Tm and denote by Tm referring to bothoperators.

Due to Mercer’s theorem [Ferreira and Menegatto (2009)], km has the followingspectral expansion:

km

(x, x′)= ∞∑

k=1

μk,mφk,m(x)φk,m

(x′),

where the convergence is absolute and uniform. Thus, the inner product ofthe RKHS Hm can be expressed as 〈fm,gm〉Hm =∑∞

k=1 μ−1k,m〈fm,φk,m〉L2(�) ×

〈φk,m, gm〉L2(�).

The following assumption is regarding the smoothness of the true function f ∗m.

ASSUMPTION 2 (Convolution assumption). There exist a real number 0 ≤q ≤ 1 and g∗

m ∈ Hm such that

f ∗m = T q/2

m g∗m.(A2)

We denote (g∗1 , . . . , g∗

M) and∑M

m=1 g∗m by g∗ (we use the same notation for

both “vector” and “function” representations with a slight abuse of notation). Theconstant q represents the smoothness of the truth f ∗

m because f ∗m is generated by

operating the integral operator Tq/2m to g∗

m (f ∗m(x) =∑∞

�=1 μq/2�,m〈φ�,m, g∗

m〉L2(�) ×φ�,m(x)), and high-frequency components are suppressed as q becomes large.Therefore, as q becomes larger, f ∗ becomes “smoother.” Assumption (A2) wasconsidered in Caponnetto and De Vito (2007) to analyze the convergence rate ofleast-squares estimators in a single kernel setting. In MKL settings, Koltchinskiiand Yuan (2008) showed a fast learning rate of MKL assuming q = 1, and Bach(2008) showed the consistency of MKL under q = 1. Proposition 9 of Bach (2008)gave a sufficient condition to fulfill (A2) with q = 1 for translation invariant ker-nels km(x, x′) = hm(x − x′). Meier, van de Geer and Bühlmann (2009) considereda situation with q = 0 on Sobolev space; the analysis of Koltchinskii and Yuan(2010) also corresponds to q = 0. Note that (A2) with q = 0 imposes nothing onthe smoothness about the truth, and our analysis also covers this case.

Page 7: Fast learning rate of multiple kernel learning: Trade-off

FAST LEARNING RATE OF MKL 1387

We show in Appendix A that as q increases, the space of the functions thatsatisfy (A2) becomes “simpler.” Thus, it might be natural to expect that, underconvolution assumption (A2), the learning rate becomes faster as q increases. Al-though this conjecture is actually true, it is not obvious because the convolutionassumption only restricts the ground truth, not the search space.

Next we introduce a parameter representing the complexity of RKHSs. ByTheorem 4.27 of Steinwart and Christmann (2008), the sum of μ�,m is bounded(∑

� μ�,m < ∞), and thus μ�,m decreases with order �−1 (μ�,m = o(�−1)). Wefurther assume the sequence of the eigenvalues converges even faster to zero.

ASSUMPTION 3 (Spectral assumption). There exist 0 < s < 1 and c such that

μj,m ≤ cj−1/s, (1 ≤ ∀j,1 ≤ ∀m ≤ M),(A3)

where {μj,m}∞j=1 is the spectrum of the kernel km; see equation (2).

It was shown that spectral assumption (A3) gives a bound on the entropy numberof the RKHSs [Steinwart, Hush and Scovel (2009)]. Remember that the ε-coveringnumber N (ε, B G ,L2(�)) with respect to L2(�) for a Hilbert space G is the min-imal number of balls with radius ε needed to cover the unit ball B G in G [van derVaart and Wellner (1996)]. The ith entropy number ei(G → L2(�)) is the infi-mum of ε > 0 for which N (ε, B G ,L2(�)) ≤ 2i−1. If spectral assumption (A3)holds, there exists a constant c that depends only on s and c such that the ithentropy number is bounded as

ei

(Hm → L2(�)

)≤ ci−1/(2s),(3)

and the converse is also true; see Theorem 15 of Steinwart, Hush and Scovel (2009)and Steinwart and Christmann (2008) for details. Therefore, if s is large, at leastone of the RKHSs is “complex,” and if s is small, all the RKHSs are “simple.”A more detailed characterization of the entropy number in terms of the spectrumis provided in Appendix A. The entropy number of the space of functions thatsatisfy the Convolution assumption (A2) is also provided there.

Finally, we impose the following technical assumption related to the sup-normof members in the RKHSs.

ASSUMPTION 4 (Sup-norm assumption). Along with the spectral assump-tion (A3), there exists a constant C1 such that

‖fm‖∞ ≤ C1‖fm‖1−sL2(�)‖fm‖s

Hm(∀fm ∈ Hm,m = 1, . . . ,M),(A4)

where s is the exponent defined in spectral assumption (A3).

This assumption might look a bit strong, but this is satisfied if the RKHS is aSobolev space or is continuously embeddable in a Sobolev space. For example,

Page 8: Fast learning rate of multiple kernel learning: Trade-off

1388 T. SUZUKI AND M. SUGIYAMA

the RKHSs of Gaussian kernels are continuously embedded in all Sobolev spaces,and thus satisfy sup-norm assumption (A4). More generally, RKHSs with γ -timescontinuously differentiable kernels on a closed Euclidean ball in R

d are also con-tinuously embedded in a Sobolev space, and satisfy the sup-norm assumption (A4)with s = d

2γ; see Corollary 4.36 of Steinwart and Christmann (2008). Therefore,

this assumption is common for practically used kernels. A more general neces-sary and sufficient condition in terms of real interpolation is shown in Bennettand Sharpley (1988). Steinwart, Hush and Scovel (2009) used this assumption toshow the optimal convergence rates for regularized regression with a single kernelfunction where the true function is not contained in the model, and one can finddetailed discussions about the assumption there.

We denote by I0 the indices of truly active kernels, that is,

I0 := {m | ∥∥f ∗

m

∥∥Hm

> 0}.

We define the number of truly active components as d := |I0|. For f =∑Mm=1 fm ∈

H and a subset of indices I ⊆ {1, . . . ,M}, we define HI =⊕m∈I Hm, and denote

by fI ∈ HI the restriction of f to an index set I , that is, fI =∑m∈I fm.

Now we introduce a geometric quantity that represents dependency betweenRKHSs. That quantity is related to the restricted eigenvalue condition [Bickel,Ritov and Tsybakov (2009)] and is required to show a nice convergence propertyof MKL. For a given set of indices I ⊆ {1, . . . ,M} and b ≥ 0, we define

βb(I ) := sup{β > 0

∣∣∣β ≤ ‖∑Mm=1 fm‖L2(�)

(∑

m∈I ‖fm‖2L2(�))

1/2,

∀f ∈ H such that b∑m∈I

‖fm‖L2(�) ≥ ∑m/∈I

‖fm‖L2(�)

}.

For I = I0, we abbreviate βb(I0) as

βb := βb(I0).

This quantity plays an important role in our analysis. Roughly speaking, this rep-resents the correlation between RKHSs under the condition that the componentswithin the relevant indices I well “dominate” the rest of the components. One cansee that βb(I ) is nonincreasing with respect to b. The quantity βb is first introducedby Bickel, Ritov and Tsybakov (2009) to define the restricted eigenvalue conditionin the context of parametric model such as the Lasso and the Dantzig selector. Inthe context of MKL, Koltchinskii and Yuan (2010) introduced this quantity to an-alyze a convergence rate of L1-MKL. We will assume that βb(I0) is bounded frombelow with some b > 0 so that we may focus on bounding the L2(�)-norm of the“low-dimensional” components {fm − f ∗

m}m∈I0 , instead of all the components.

Page 9: Fast learning rate of multiple kernel learning: Trade-off

FAST LEARNING RATE OF MKL 1389

Here we give a sufficient condition that βb(I ) is bounded from below. For agiven set of indices I ⊆ {1, . . . ,M}, we introduce a quantity κ(I ) representing thecorrelation of RKHSs inside the indices I ,

κ(I ) := sup{κ ≥ 0

∣∣∣κ ≤ ‖∑m∈I fm‖2L2(�)∑

m∈I ‖fm‖2L2(�)

,∀fm ∈ Hm (m ∈ I )

}.

Similarly, we define the canonical correlations of RKHSs between I and I c asfollows:

ρ(I) := sup{ 〈fI , gIc〉L2(�)

‖fI‖L2(�)‖gIc‖L2(�)

∣∣∣fI ∈ HI , gIc ∈ HI c , fI �= 0, gIc �= 0}.

These quantities give a connection between the L2(�)-norm of f ∈ H and theL2(�)-norm of {fm}m∈I as shown in the following lemma. The proof is given inAppendix B.

LEMMA 1. For all I ⊆ {1, . . . ,M}, we have

‖f ‖2L2(�) ≥ (1 − ρ(I)2)κ(I )

(∑m∈I

‖fm‖2L2(�)

),

thus

β∞(I ) ≥√(

1 − ρ(I)2)κ(I ).

Koltchinskii and Yuan (2008) and Meier, van de Geer and Bühlmann (2009)analyzed statistical properties of MKL under the incoherence condition where (1−ρ(I0)

2)κ(I0) is bounded from below, that is, RKHSs are not too dependent on eachother. In this paper, we employ a less restrictive condition where βb is boundedfrom below for some positive real b.

3. Convergence rate analysis. In this section, we present our main result.

3.1. The convergence rate of L1-MKL and elastic-net MKL. Here we derivethe learning rate of the estimator f defined by equation (1). We may suppose thatthe number of kernels M and the number of active kernels d are increasing withrespect to the number of samples n. Our main purpose of this section is to showthat the learning rate can be faster than the existing bounds. The existing bound hasalready been shown to be optimal on the �∞-mixed-norm ball [Koltchinskii andYuan (2010), Raskutti, Wainwright and Yu (2012)]. Our claim is that the conver-gence rates can further achieve the minimax optimal rates on the �1-mixed-normball and �2-mixed-norm ball, which are faster than that on the �∞-mixed-normball.

Page 10: Fast learning rate of multiple kernel learning: Trade-off

1390 T. SUZUKI AND M. SUGIYAMA

Define η(t) for t > 0 and ξn(λ) for given λ > 0 as

η(t) := max(1,√

t, t/√

n), ξn := ξn(λ) = max(

λ−s/2√

n,

λ−1/2

n1/(1+s),

√log(M)

n

).

For a given function f =∑Mm=1 fm ∈ H and 1 ≤ p ≤ ∞, we define the �p-mixed-

norm of f as

Rp,f :=(

M∑m=1

‖fm‖pHm

)1/p

.

Let

b1 = 16(

1 +√

d maxm∈I0 ‖g∗m‖Hm

R2,g∗

), b2 = 16.

Then we obtain the convergence rate of L1- and elastic-net MKL as follows.

THEOREM 2 (Convergence rate of L1-MKL and elastic-net MKL). SupposeAssumptions 1–4 are satisfied. Then there exist constants C1, C2 and ψs dependingon s, c,L,C1 such that the following convergence rates hold:

(Elastic-net MKL). Set λ(n)1 = ψsη(t)ξn(λ), λ(n)

2 = λ(n)1 λ1/2, λ(n)

3 = λ where λ =d1/(1+q+s)n−1/(1+q+s)R

−2/(1+q+s)2,g∗ . Then for all n satisfying log(M)√

n≤ 1 and

C1

β2b1

ψs

√nξn(λ)2d ≤ 1,(4)

the generalization error of elastic-net MKL is bounded as∥∥f − f ∗∥∥2L2(�)

≤ C2

β2b1

(d(1+q)/(1+q+s)n−(1+q)/(1+q+s)R

2s/(1+q+s)2,g∗

(5)+ d(q+s)/(1+q+s)n−(1+q)/(1+q+s)−q(1−s)/((1+s)(1+q+s))

× R2/(1+q+s)2,g∗ + d log(M)

n

)η(t)2,

with probability 1 − exp(−t) − exp(−min{ β4b1

log(M)

C21ψ2

s nξn(λ)4d2 ,β2

b1C1ψsξn(λ)2d

}) for all

t ≥ 1.(L1-MKL). Set λ

(n)1 = ψsη(t)ξn(λ), λ

(n)2 = λ

(n)1 λ1/2, λ

(n)3 = 0 where λ =

d(1−s)/(1+s)n−1/(1+s)R−2/(1+s)1,f ∗ . Then for all n satisfying log(M)√

n≤ 1 and

C1

β2b2

ψs

√nξn(λ)2d ≤ 1,(6)

Page 11: Fast learning rate of multiple kernel learning: Trade-off

FAST LEARNING RATE OF MKL 1391

the generalization error of L1-MKL is bounded as

∥∥f − f ∗∥∥2L2(�) ≤ C2

β2b2

(d(1−s)/(1+s)n−1/(1+s)R

2s/(1+s)1,f ∗

(7)

+ d(s−1)/(1+s)n−1/(1+s)R2/(1+s)1,f ∗ + d log(M)

n

)η(t)2,

with probability 1 − exp(−t) − exp(−min{ β4b2

log(M)

C21ψ2

s nξn(λ)4d2 ,β2

b2C1ψsξn(λ)2d

}) for all

t ≥ 1.

The proof of Theorem 2 is provided in Section S.3 of the supplementary ma-terial [Suzuki and Sugiyama (2013)]. The bounds presented in the theorem canbe further simplified under additional conditions. To show simplified bounds, weassume that βb1 and βb2 are bounded from below by a positive constant; cf. therestricted eigenvalue condition, Bickel, Ritov and Tsybakov (2009). There existsC2 > 0 such that βb2 ≥ βb1 ≥ C2. This condition is satisfied if β16(1+√

d) ≥ C2

because√

d maxm∈I0 ‖g∗m‖Hm

R2,g∗ ≤ √d . Then we obtain simplified bounds with weak

conditions. If R1,f ∗ ≤ Cd with a constant C (this holds if ‖f ∗m‖Hm ≤ C for all m),

then the first term in the learning rate (7) of L1-MKL dominates the second term,and thus equation (7) becomes∥∥f − f ∗∥∥2

L2(�) ≤ Op

(d(1−s)/(1+s)n−1/(1+s)R

2s/(1+s)1,f ∗ + d log(M)

n

).(8)

Similarly, as for the bound of elastic-net MKL, if R22,g∗ ≤ Cnq/(1+s)d with a con-

stant C (this holds if ‖g∗m‖Hm ≤ √

C for all m), then equation (5) becomes∥∥f − f ∗∥∥2L2(�)

(9)

≤ Op

(d(1+q)/(1+q+s)n−(1+q)/(1+q+s)R

2s/(1+q+s)2,g∗ + d log(M)

n

).

Here notice that the tail probability can be bounded as

exp(−min

{β4

b1log(M)

C21ψ2

s nξn(λ)4d2,

β2b1

C1ψsξn(λ)2d

})≤ exp

(−min{log(M),

√n})

= 1

M,

under the conditions of equation (4) and log(M)√n

≤ 1 [the same inequality also holdsunder equation (6), even if we replace βb1 with βb2 ].

We note that, as s becomes smaller (the RKHSs become simpler), both learningrates of L1-MKL and elastic-net MKL become faster if R1,f ∗,R2,g∗ ≥ 1. Although

Page 12: Fast learning rate of multiple kernel learning: Trade-off

1392 T. SUZUKI AND M. SUGIYAMA

the solutions of both L1-MKL and elastic-net MKL are derived from the sameoptimization framework (1), there appear to be two convergence rates (8) and (9)that posses different characteristics depending on λ

(n)3 = 0, or not. There appears to

be no dependency on the smoothness parameter q in bound (8) of L1-MKL, whilebound (9) of elastic-net MKL depends on q . Let us compare these two learningrates on the two situations: q = 0 and q > 0.

(i) (q = 0). In this situation, the true function f ∗ is not smooth and g∗ = f ∗from the definition of q . The terms with respect to d are d(1−s)/(1+s) for L1-MKL (8) and d1/(1+s) for elastic-net MKL (9). Thus, L1-MKL has milder de-pendency on d . This might reflect the fact that L1-MKL tends to generate sparsersolutions. Moreover, one can check that the learning rate of L1-MKL (8) is bet-ter than that of elastic-net MKL (9) because Jensen’s inequality R1,f ∗ ≤ √

dR2,f ∗gives

d(1−s)/(1+s)n−1/(1+s)R2s/(1+s)1,f ∗ ≤ d1/(1+s)n−1/(1+s)R

2s/(1+s)2,f ∗ .

This suggests that, when the truth is nonsmooth, L1-MKL is preferred.(ii) (q > 0). We see that, as q becomes large (the truth becomes smooth),

the convergence rate of elastic-net MKL becomes faster. The convergence ratewith respect to n in the presented bound is n−(1+q)/(1+q+s) for elastic-net MKLthat is faster than that of L1-MKL (n−1/(1+s)). We suggest that this showsthat elastic-net MKL properly captures the smoothness of the truth f ∗ us-ing the additional �2-regularization term. As we observed above, we obtaineda faster convergence bound of L1-MKL than that of L2-MKL when q = 0.However, if f ∗ is sufficiently smooth (g∗ is small), as q increases, there ap-pears “phase-transition,” that is, the convergence bound of elastic-net MKLturns out to be faster than that of L1-MKL [d(1−s)/(1+s)n−1/(1+s)R

2s/(1+s)1,f ∗ ≥

d(1+q)/(1+q+s)n−(1+q)/(1+q+s)R2s/(1+q+s)2,g∗ ]. This might indicate that, when the

truth f ∗ is smooth, elastic-net MKL is preferred.An interesting observation here is that depending on the smoothness q of the

truth, the preferred regularization changes. Here, we would like to point out thatthe comparison between L1-MKL and elastic-net MKL is just based on the upperbounds of the convergence rates. Thus there is still the possibility that L1-MKLcan also make use of the smoothness q of the true function to achieve a faster rate.We will give discussions about this issue in Section 6.

Finally, we give a comprehensive representation of Theorem 2 that gives a clearcorrespondence to the minimax optimal rate given in the next subsection.

COROLLARY 3. Suppose the same condition as Theorem 2. Define s = s1+q

.

Then there exists constant C′ depending on s, c,L,C1 such that the following con-vergence rates hold:

Page 13: Fast learning rate of multiple kernel learning: Trade-off

FAST LEARNING RATE OF MKL 1393

(Elastic-net MKL). If 1 ≤ R2,g∗ and ‖g∗m‖Hm ≤ C (∀m ∈ I0) with a constant C,

then for all p ≥ 2, elastic-net MKL achieves the following convergence rate:

∥∥f − f ∗∥∥2L2(�) ≤ C′

β2b1

(d1−2s/(p(1+s))n−1/(1+s)R

2s/(1+s)p,g∗ + d log(M)

n

)η(t)2,

with probability 1 − exp(−t) − 1/M for all t ≥ 1.(L1-MKL). If 1 ≤ R1,f ∗ and ‖f ∗

m‖Hm ≤ C (∀m ∈ I0) with a constant C, thenfor all p ≥ 1, L1-MKL achieves the following convergence rate:

∥∥f − f ∗∥∥2L2(�) ≤ C′

β2b2

(d1−2s/(p(1+s))n−1/(1+s)R

2s/(1+s)p,f ∗ + d log(M)

n

)η(t)2,

with probability 1 − exp(−t) − 1/M for all t ≥ 1.

PROOF. Due to Jensen’s inequality, we always have R2,g∗ ≤ d1/2−1/pRp,g∗for p ≥ 2 and R1,f ∗ ≤ d1−1/pRp,f ∗ for p ≥ 1. Thus we have

d1/(1+s)n−1/(1+s)R2s/(1+s)2,g∗ ≤ d1−2s/(p(1+s))n−1/(1+s)R

2s/(1+s)p,g∗ ,

d(1−s)/(1+s)n−1/(1+s)R2s/(1+s)1,f ∗ ≤ d1−2s/(p(1+s))n−1/(1+s)R

2s/(1+s)p,f ∗ .

Combining this and the discussions to derive equations (8) and (9), we have theassertion. �

Below, we show that bounds (8) and (9) achieve the minimax optimal rates onthe �1-mixed-norm ball and the �2-mixed-norm ball, respectively.

3.2. Minimax learning rate of �p-mixed-norm ball. Here we consider a simplesetup to investigate the minimax rate. First, we assume that the input space Xis expressed as X = X M for some space X . Second, all the RKHSs {Hm}Mm=1are induced from the same RKHS H defined on X . Finally, we assume that themarginal distribution � of input is the product of a probability distribution Q,that is, � = QM . Thus, an input x = (x(1), . . . , x(M)) ∈ X = X M is concatenationof M random variables {x(m)}Mm=1 independently and identically distributed fromthe distribution Q. Moreover, the function class H is assumed to be a class offunctions f such that f (x) = f (x(1), . . . , x(M)) =∑M

m=1 fm(x(m)), where fm ∈ Hfor all m. Without loss of generality, we may suppose that all functions in H arecentered: E

X∼Q[f (X)] = 0 (∀f ∈ H). Furthermore, we assume that the spectrum

of the kernel k corresponding to the RKHS H decays at the rate of −1s. That is, in

addition to Assumption 3, we impose the following lower bound on the spectrum:There exist c′, c (> 0) such that

c′j−1/s ≤ μj ≤ cj−1/s,(10)

Page 14: Fast learning rate of multiple kernel learning: Trade-off

1394 T. SUZUKI AND M. SUGIYAMA

where {μj }j is the spectrum of the integral operator Tk

with respect to the ker-nel k; see equation (2). We also assume that the noise {εi}ni=1 is generated by theGaussian distribution with mean 0 and standard deviation σ .

Let H0(d) be the set of functions with d nonzero components in H defined byH0(d) := {(f1, . . . , fM) ∈ H | #{m | ‖fm‖Hm �= 0} ≤ d}. We define the �p-mixed-norm ball (p ≥ 1) with radius R in H0(d) as

Hd,q�p

(R) :={f =

M∑m=1

fm

∣∣∣∣∃(g1, . . . , gM) ∈ H0(d), fm = T q/2m gm,

(M∑

m=1

‖gm‖pHm

)1/p

≤ R

}.

In Raskutti, Wainwright and Yu (2012), the minimax learning rate on Hd,0�∞ (R)

(i.e., p = ∞ and q = 0) was derived.4 We show (a lower bound of) the minimaxlearning rate for more general settings (1 ≤ p ≤ ∞ and 0 ≤ q ≤ 1) in the followingtheorem.

THEOREM 4. Let s = s1+q

. Assume d ≤ M/4. Then the minimax learningrates are lower bounded as follows. If the radius of the �p-mixed-norm ball Rp

satisfies Rp ≥ d1/p√

log(M/d)n

, there exists a constant C1 such that

inff

supf ∗∈Hd,q

�p(Rp)

E[∥∥f − f ∗∥∥2

L2(�)

](11)

≥ C1

(d1−2s/(p(1+s))n−1/(1+s)R2s/(1+s)

p + d log(M/d)

n

),

where “inf” is taken over all measurable functions of the samples {(xi, yi)}ni=1,and the expectation is taken for the sample distribution.

A proof of Theorem 4 is provided in Section S.7 of the supplementary material[Suzuki and Sugiyama (2013)].

Substituting q = 0 and p = 1 into the minimax learning rate (11), we seethat the learning rate (8) of L1-MKL achieves the minimax optimal rate of the�1-mixed-norm ball for q = 0. Moreover, the learning rate of L1-MKL (i.e.,minimax optimal on the �1-mixed-norm ball) is fastest among all the optimalminimax rates on �p-mixed-norm ball for p ≥ 1 when q = 0. To see this, letRp,f ∗ := (

∑m ‖f ∗

m‖pHm

)1/p; then, as in the proof of Corollary 3, we always have

4The set FM,d,H(R) in Raskutti, Wainwright and Yu (2012) corresponds to Hd,0�∞ (R) in the current

paper.

Page 15: Fast learning rate of multiple kernel learning: Trade-off

FAST LEARNING RATE OF MKL 1395

R1,f ∗ ≤ d1−1/pRp,f ∗ ≤ dR∞,f ∗ due to Jensen’s inequality, and consequently wehave

d(1−s)/(1+s)n−1/(1+s)R2s/(1+s)1,f ∗ ≤ d1−2s/(p(1+s))n−1/(1+s)R

2s/(1+s)p,f ∗

(12)≤ dn−1/(1+s)R

2s/(1+s)∞,f ∗ .

On the other hand, the learning rate (9) of elastic-net MKL achieves the mini-max optimal rate (11) on the �2-mixed-norm ball (p = 2). When q = 0, the rate ofelastic-net MKL is slower than that of L1-MKL, but the optimal rate is achievedover the whole range of smoothness parameter 0 ≤ q ≤ 1, which is advantageousagainst L1-MKL. Moreover, the optimal rate on the �2-mixed-norm ball is stillfaster than that on the �∞-mixed-norm ball due to relation (12).

The learning rates of both L1 and elastic-net MKL coincide with the minimaxoptimal rate of the �∞-mixed-norm ball when the truth is homogeneous. For sim-plicity, assume q = 0. If ‖f ∗

m‖Hm = 1 (∀m ∈ I0) and f ∗m = 0 (otherwise), then

Rp,f ∗ = d1/p . Thus, both rates are dn−1/(1+s) + d log(M)n

; that is, the minimax rateon the �∞-mixed-norm ball. We also notice that this homogeneous situation is theonly situation where those convergence rates coincide with each other. As we willsee later, the existing bounds are the minimax rate on the �∞-mixed-norm ball andthus are tight only in the homogeneous setting.

4. Optimal parameter selection. We need the knowledge of parameters suchas q, s, d,R1,f ∗,R2,g∗ to obtain the optimal learning rate shown in Theorem 2;however, this is not realistic in practice.

To overcome this problem, we give an algorithmic procedure such as cross-validation to achieve the optimal learning rate. Roughly speaking, we split the datainto the training set and the validation set and utilize the validation set to choosethe optimal parameter. Given the data D = {(xi, yi)}ni=1, the training set Dtr isgenerated by using the half of the given data Dtr = {(xi, yi)}n′

i=1 where n′ = �n2�

and the remaining data is used as the validation set Dte = {(xi, yi)}ni=n′+1. Let f�

be the estimator given by our MKL formulation (1) where the parameter setting� = (λ

(n)1 , λ

(n)2 , λ

(n)3 ) is employed, and the training set Dtr is used instead of the

whole data set D.We utilize a clipped estimator so that the estimator bounded in a way that makes

the validation procedure effective. Given the estimator f� and a positive real B >

0, the clipped estimator f� is given as

f�(x) :=

⎧⎪⎪⎨⎪⎪⎩B,

(B ≤ f�(x)

),

f�(x),(−B < f�(x) < B

),

−B,(f�(x) ≤ −B

).

To appropriately choose B , we assume that we can roughly estimate the sup-norm ‖f ∗‖∞ of the true function, and B is set to satisfy ‖f ∗‖∞ < B . This as-sumption is not unrealistic because if we set B sufficiently large so that we have

Page 16: Fast learning rate of multiple kernel learning: Trade-off

1396 T. SUZUKI AND M. SUGIYAMA

maxi |yi | < B , then with high probability such B satisfies ‖f ∗‖∞ < B . It shouldbe noted that if ‖f ∗‖∞ < B , the generalization error of the clipped estimator f�

is not greater than that of the original estimator f�,∥∥f� − f ∗∥∥L2(�) ≤ ∥∥f� − f ∗∥∥

L2(�),

because |f�(x) − f ∗(x)| ≤ |f�(x) − f ∗(x)| for all x ∈ X .Now, for a finite set of parameter candidates �n ⊂ R+ × R+ × R+, we choose

an optimal parameter that minimizes the error on the validation set,

�Dte := argmin�∈�n

1

|Dte|∑

(xi ,yi )∈Dte

(f�(xi) − yi

)2.(13)

Then we can show that the estimator f�Dteachieves the optimal learning rate. To

show this, we determine the finite set �n of the candidate parameters as follows:let �n := {1/n2,2/n2, . . . ,1} and

�n = {(λ1, λ2, λ3) | λ1, λ3 ∈ �n,λ2 = λ1λ

1/23

}∪ {(λ1, λ2, λ3) | λ1, λ ∈ �n,λ2 = λ1λ

1/2, λ3 = 0}.

With this parameter set, we have the following theorem that shows the optimalityof the validation procedure (13).

THEOREM 5. Suppose Assumptions 1–4 are satisfied. Assume R1,f ∗ ,R2,g∗ ≥ 1, βb2 ≥ βb1 ≥ C2 and ‖f ∗

m‖Hm,‖g∗m‖Hm ≤ C3 with some constants

C2,C3 > 0, and suppose n satisfies log(M)√n

≤ 1 and

C1

β2b1

ψs

√nξn(λ(1))

2d ≤ 1 andC1

β2b2

ψs

√nξn(λ(2))

2d ≤ 1,

where λ(1) = d1/(1+q+s)n−1/(1+q+s)R−2/(1+q+s)2,g∗ , λ(2) = d(1−s)/(1+s)n−1/(1+s) ×

R−2/(1+s)1,f ∗ and C1 is the constant introduced in the statement of Theorem 2.

Then there exist a universal constant C4 and a constant C3 depending ons, c,L,C1,C2,C3 such that∥∥f�Dte

− f ∗∥∥2L2(�)

≤ C3

(d(1−s)/(1+s)n−1/(1+s)R

2s/(1+s)1,f ∗

∧ d(1+q)/(1+q+s)n−(1+q)/(1+q+s)R2s/(1+q+s)2,g∗ + d log(M)

n

)η(t)2

+ C4B2(τ + log(1 + n))

n,

with probabitlity 1 − 2 exp(−t) − exp(−τ) − 2M

, where a ∧ b means min{a, b}.

Page 17: Fast learning rate of multiple kernel learning: Trade-off

FAST LEARNING RATE OF MKL 1397

This can be shown by combining our bound in Theorem 2 and the techniqueused in Theorem 7.2 of Steinwart and Christmann (2008). According to Theo-rem 5, the estimator f�Dte

with the validated parameter �Dte achieves the mini-mum learning rate among the oracle bound for L1-MKL (8) and that for elastic-netMKL (9) if B is sufficiently small. Therefore, the optimal rate is almost attainable

[at the cost of the term B2 log(1+n)n

] by a simple executable algorithm.

5. Comparison with existing bounds. In this section, we compare our boundwith the existing bounds. Roughly speaking, the difference between the existingbounds is summarized in the following two points (see also Table 1 summarizingthe relations between our analysis and existing analyses):

(a) Our learning rate achieves the minimax rate of the �1-mixed-norm ball orthe �2-mixed-norm ball, instead of the �∞-mixed-norm ball.

(b) Our bound includes the smoothing parameter q (Assumption 2), and thus ismore general and faster than existing bounds.

The first bound on the convergence rate of MKL was derived by Koltchinskiiand Yuan (2008), which assumed q = 1 and 1

d

∑m∈I0

(‖g∗m‖2

Hm/‖f ∗

m‖2Hm

) ≤ C.Under these rather strong conditions, they showed the bound

d(1−s)/(1+s)n−1/(1+s) + d log(M)

n.

Our convergence rate (8) of L1-MKL achieves this learning rate without the twostrong conditions. Moreover, for the smooth case q = 1, we have shown thatelastic-net MKL has a faster rate n−2/(2+s) instead of n−1/(1+s) with respect to n.

The second bound was given by Meier, van de Geer and Bühlmann (2009),which shows (

log(M)

n

)1/(1+s)(d + R2

2,f ∗)

for elastic-net regularization under the condition q = 0. Their bound almostachieves the minimax rate on the �∞-mixed-norm ball except the log(M) factor.Compared with our bound (9), their bound has the additional log(M) factor andthe term with respect to d and R2,f ∗ is larger than d1/(1+s)R

2s/(1+s)2,f ∗ in our learning

rate of elastic-net MKL because Young’s inequality yields

d1/(1+s)R2s/(1+s)2,f ∗ ≤ 1

1 + sd + s

1 + sR2

2,f ∗ ≤ d + R22,f ∗ .

Moreover, our result for elastic-net MKL covers all 0 ≤ q ≤ 1.Most recently, Koltchinskii and Yuan (2010) presented the bound

n−1/(1+s)(d + R1,f ∗) + d log(M)

n

Page 18: Fast learning rate of multiple kernel learning: Trade-off

1398 T. SUZUKI AND M. SUGIYAMA

for L1-MKL and q = 0. Their bound achieves the minimax rate on the �∞-mixed-norm ball, but is looser than our bound (8) of L1-MKL because, by Young’s in-equality, we have

d(1−s)/(1+s)R2s/(1+s)1,f ∗ ≤ 1 − s

1 + sd + 2s

1 + sR1,f ∗ ≤ d + R1,f ∗ .

In fact, their bound is d2s/(1+s) times slower than ours if the ground truth is in-homogeneous. To see this, suppose ‖f ∗

m‖Hm = m−1 (m ∈ I0 = {1, . . . , d}) andf ∗

m = 0 (otherwise). Then their bound is n−1/(1+s)d + d log(M)n

, while our bound

for L1-MKL is n−1/(1+s)d(1−s)/(1+s) + d log(M)n

. Moreover, their formulation ofL1-MKL is slightly different from ours. In their formulation, there are additionalconstraints such that ‖fm‖Hm ≤ Rm (∀m) with some constants Rm in the optimiza-tion problem described in equation (1). Due to these constraints, their formulationis a bit different from the practically used one (in practice, we do not usually im-pose such constrains). Instead, our analysis requires an additional assumption onthe sup-norm (Assumption 4) to control the discrepancy between the empiricaland population means of the square of an element in RKHS, 1

n

∑ni=1 f 2

m(xi) −E[f 2

m] (fm ∈ Hm). In addition, they assumed the global boundedness; that is, thesup-norm of f ∗ is bounded by a constant, ‖f ∗‖∞ = ‖∑M

m=1 f ∗m‖∞ ≤ C. This

assumption is standard and does not affect the convergence rate in single kernellearning settings. However, in MKL settings, it is pointed out that the rate is notminimax optimal in large d regime [in particular d = �(

√n)] under the global

boundedness [Raskutti, Wainwright and Yu (2012)]. Our analysis omits the globalboundedness by utilizing the sup-norm assumption (Assumption 4).

All of the bounds explained above focused on either q = 0 or 1. On the otherhand, our analysis is more general in that the whole range of 0 ≤ q ≤ 1 is covered.

6. Discussion about adaptivity of �1-regularization. In this section, we dis-cuss the issue, “is it really true that �1-regularization cannot possess adaptivityto the smoothness?” According to Theorem 2 and the following discussion, theconvergence rate of L1-MKL does not have dependency on the smoothness of thetrue function. However, this is just an upper bound. Thus, there is still possibilitythat L1-MKL can make use of the smoothness of the true function. We give someremarks about this issue.

According to our analysis, it is difficult to improve the bound of Theorem 2without any additional assumptions. On the other hand, it is possible to show thisif we may assume some additional conditions.

A technical reason that makes it difficult to show adaptivity of L1-MKL is thatthe �1-regularization is not differentiable at 0. Indeed, the sub-gradient of ‖fm‖Hm

is fm/‖fm‖Hm if fm �= 0, and compared with that of ‖fm‖2Hm

(which is fm), thereis a difference of a factor 1/‖fm‖Hm . This makes it difficult to control the behaviorof the estimator around 0. To avoid this difficulty, we assume that the estimator fm

is bounded below as follows.

Page 19: Fast learning rate of multiple kernel learning: Trade-off

FAST LEARNING RATE OF MKL 1399

ASSUMPTION 5 (Lower bound assumption). There exist constants hm >

0 (m ∈ I0) such that

‖fm‖Hm ≥ hm (∀m ∈ I0),(A5)

with probability 1 − pn.

We will give a justification of this assumption later (Lemma 7). If we admit thisassumption, we have the following convergence bound. Define

R2,g∗ :=(∑

m∈I0

‖g∗m‖2

Hm

hm

)1/2

,

b3 := 32(

1 +√

d maxm∈I0(‖g∗m‖Hm/hm)

R2,g∗

).

THEOREM 6. Suppose Assumptions 1–5 are satisfied, and ‖g∗m‖Hm ≤ C for

all m ∈ I0. Set

λ = d1/(1+q+s)n−1/(1+q+s)R−2/(1+q+s)2,g∗ .

Moreover we set λ(n)1 , λ

(n)2 and λ

(n)3 as λ

(n)1 = 2ψsη(t)ξn(λ), λ

(n)2 =

max{λη(t), λ(n)1 λ1/2}, λ

(n)3 = 0 where ψs is same as Theorem 2. Similarly define

λ(n)1 (t ′), λ(n)

2 (t ′) corresponding to some fixed t ′, and λ = (λ(n)2 (t ′)/λ(n)

1 (t ′))2. Thenthere exist constants C3, C′

3, C4 depending on s, c,L,C1,C, b3, t′ such that for

all n satisfying log(M)√n

≤ 1 and

C3

β2b3

ψs

√nξ2

n (λ)d ≤ 1, C′3ψs

√nξ2

n (λ)λd ≤ λ(n)2

(t ′),(14)

we have that∥∥f − f ∗∥∥2L2(�)

(15)

≤ C4

β2b3

(d(1+q)/(1+q+s)n−(1+q)/(1+q+s)R

2s/(1+q+s)2,g∗ + d log(M)

n

)η(t)2,

with probability 1 − exp(−t) − exp(−t ′) − 2/M − pn.

The proof of Theorem 6 can be found in Section S.4 of the supplementary mate-rial [Suzuki and Sugiyama (2013)]. The theorem shows that with the rather strongassumption (Assumption 5), we can show that L1-MKL also possesses adaptivityto the smoothness. Bound (15) is close to the minimax optimal rate on the �2-mixed-norm ball where R2,g∗ appears instead of R2,g∗ . Here we observe that hm

Page 20: Fast learning rate of multiple kernel learning: Trade-off

1400 T. SUZUKI AND M. SUGIYAMA

appears in the denominator in R2,g∗ . Therefore, for small hm, R2,g∗ is larger thanR2,g∗ , which can make bound (15) larger than that of elastic-net MKL. This is dueto the indifferentiability of �1-regularization as explained above.

Next, we give a justification of Assumption 5.

LEMMA 7. If ‖fm − f ∗m‖L2(�) → 0 in probability, then

P

(‖fm‖Hm ≥ ‖f ∗

m‖Hm

2

)→ 1.

PROOF. On the basis of decomposition (2) of the kernel function, we writef ∗

m = ∑∞j=1 aj,mφj,m and fm = ∑∞

j=1 aj,mφj,m. Then we have that ‖f ∗m‖2

Hm=∑∞

j=1 μ−1j,ma2

j,m. Now we define Jf ∗m

to be a finite number such that√∑Jf ∗m

j=1 μ−1j,ma2

j,m ≥ 34‖f ∗

m‖Hm . Noticing that op(1) ≥ ‖fm − f ∗m‖2

L2(�) =∑∞j=1(aj,m − aj,m)2 ≥∑Jf ∗

m

j=1(aj,m − aj,m)2, we have that

‖fm‖Hm =

√√√√√√Jf ∗

m∑j=1

μ−1j,ma2

j,m +∞∑

j=Jf ∗m

+1

μ−1j,ma2

j,m

≥√√√√√Jf ∗

m∑j=1

μ−1j,ma2

j,m

≥√√√√√Jf ∗

m∑j=1

μ−1j,ma2

j,m −√√√√√Jf ∗

m∑j=1

μ−1j,m(aj,m − aj,m)2

≥ 3

4

∥∥f ∗m

∥∥Hm

− μ−1/2Jf ∗

m

√√√√√Jf ∗m∑

j=1

(aj,m − aj,m)2 = 3

4

∥∥f ∗m

∥∥Hm

− op(1).

This gives the assertion. �

One can see from the proof that the convergence rate in Lemma 7 dependson f ∗

m. If d is sufficiently small, we observe that the proof of Theorem 2 gives

that ‖f ∗m − fm‖L2(�)

p→ 0 (m ∈ I0). In this situation, if we set hm = ‖f ∗m‖Hm/2,

‖f ∗m‖Hm ≥ hm (m ∈ I0) is satisfied with high probability for sufficiently large n.The above discussion seems a proper justification to support the adaptivity of

�1-regularization. However, we would like to remark the following two concernsabout the discussion. First, in a situation where d increases as the number of sam-ples increases, it is hardly expected that ‖f ∗

m‖Hm > c with some positive con-stant c. It is more natural to suppose that minm∈I0 ‖f ∗

m‖Hm → 0 as d increases. In

Page 21: Fast learning rate of multiple kernel learning: Trade-off

FAST LEARNING RATE OF MKL 1401

that situation, R2,g∗ becomes much larger as d increases. Second, since Tm is notinvertible, ‖g∗

m‖Hm/‖f ∗m‖Hm is not bounded. Thus for hm = ‖f ∗

m‖Hm/2, we haveno guarantee that R2,g∗ is reasonably small so that the convergence bound (15)is meaningful. Both of these two concerns are caused by the indiffirentiability of�1-regularization at 0. Moreover these concerns are specific to high-dimensionalsituations. If d = M = 1 (or d and M are sufficiently small), then we do not needto worry about such issues.

We have shown that in a restrictive situation, �1-regularization can possess adap-tivity to the smoothness of the true function and achieve a near minimax optimalrate on the �2-mixed-norm ball. It is a future work to clarify whether the lowerbounded assumption (Assumption 5) is a necessary condition or not.

7. Conclusion. We have presented a new learning rate of both L1-MKL andelastic-net MKL, which is tighter than the existing bounds of several MKL for-mulations. According to our bound, the learning rates of L1-MKL and elastic-netMKL achieve the minimax optimal rates on the �1-mixed-norm ball and the �2-mixed-norm ball, respectively, instead of the �∞-mixed-norm ball. We have alsoshown that a procedure like cross validation gives the optimal choice of the param-eters. We have discussed a relation between the regularization and the convergencerate. Our theoretical analysis suggests that there is a trade-off between the sparsityand the smoothness; that is, if the true function is sufficiently smooth, elastic-net regularization is preferred; otherwise, �1-regularization is preferred. This the-oretical insight supports the recent experimental results [Cortes, Mohri and Ros-tamizadeh (2009b), Kloft et al. (2009), Tomioka and Suzuki (2009)] such that in-termediate regularization between �1 and �2 often shows favorable performances.

APPENDIX A: EVALUATION OF ENTROPY NUMBER

Here, we give a detailed characterization of the covering number in terms of thespectrum using the operator Tm. Accordingly, we give the complexity of the setof functions satisfying the convolution assumption (Assumption 2). We extend thedomain and the range of the operator Tm to the whole space of L2(�) and defineits power T

βm :L2(�) → L2(�) for β ∈ [0,1] as

T βmf :=

∞∑k=1

μβk,m〈f,φk,m〉L2(�)φk,m

(f ∈ L2(�)

).

Moreover, we define a Hilbert space Hm,β as

Hm,β :={ ∞∑

k=1

bkφk,m

∣∣∣∣∣∞∑

k=1

μ−βk,mb2

k < ∞},

and equip this space with the Hilbert space norm ‖∑∞k=1 bkφk,m‖Hm,β :=√∑∞

k=1 μ−βk,mb2

k. One can check that Hm,1 = Hm; see Theorem 4.51 of Steinwart

Page 22: Fast learning rate of multiple kernel learning: Trade-off

1402 T. SUZUKI AND M. SUGIYAMA

and Christmann (2008). Here we define, for R > 0,

Hqm(R) := {

fm = T q/2m gm | gm ∈ Hm,‖gm‖Hm ≤ R

}.(16)

Then we obtain the following lemma.

LEMMA 8. Hqm(1) is equivalent to the unit ball of Hm,1+q : Hq

m(1) = {fm ∈Hm,1+q | ‖fm‖Hm,1+q

≤ 1}.

This can be shown as follows. For all fm ∈ Hqm(1), there exists gm ∈

Hm such that fm = Tq/2m gm and ‖gm‖Hm ≤ 1. Thus gm = (T

q/2m )−1fm =∑∞

k=1 μ−q/2k,m 〈fm,φk,m〉L2(�)φk,m and 1 ≥ ‖gm‖Hm = ∑∞

k=1 μ−1k,m〈gm,

φk,m〉2L2(�) = ∑∞

k=1 μ−(1+q)k,m 〈fm,φk,m〉2

L2(�). Therefore, fm is in Hqm(1) if and

only if the norm of f in Hm,1+q is well-defined and not greater than 1.Now Theorem 15 of Steinwart, Hush and Scovel (2009) gives an upper bound

of the entropy number of Hm,β as

ei

(Hm,β → L2(�)

)≤ Ci−β/(2s),

where C is a constant depending on c, s, β . This inequality with β = 1 correspondsto equation 3. Moreover, substituting β = 1 + q into the above equation, we have

ei

(Hm,β → L2(�)

)≤ Ci−(1+q)/(2s).(17)

APPENDIX B: PROOF OF LEMMA 1

PROOF OF LEMMA 1. For J = I c, we have

Pf 2 = ‖fI‖2L2(�) + 2〈fI , fJ 〉L2(�) + ‖fJ ‖2

L2(�)

≥ ‖fI‖2L2(�) − 2ρ(I)‖fI‖L2(�)‖fJ ‖L2(�) + ‖fJ ‖2

L2(�)

≥ (1 − ρ(I)2)‖fI‖2

L2(�) ≥ (1 − ρ(I)2)κ(I )

(∑m∈I

‖fm‖2L2(�)

),

where we used Cauchy–Schwarz’s inequality in the last line. �

Acknowledgments. The authors would like to thank Ryota Tomioka, Alexan-dre B. Tsybakov, Martin Wainwright and Garvesh Raskutti for suggestive discus-sions.

SUPPLEMENTARY MATERIAL

Supplementary material for: Fast learning rate of multiple kernel learning:trade-off between sparsity and smoothness (DOI: 10.1214/13-AOS1095SUPP;.pdf). Due to space constraints, we have moved the proof of the main theorem to asupplementary document [Suzuki and Sugiyama (2013)].

Page 23: Fast learning rate of multiple kernel learning: Trade-off

FAST LEARNING RATE OF MKL 1403

REFERENCES

ARGYRIOU, A., HAUSER, R., MICCHELLI, C. A. and PONTIL, M. (2006). A DC-programmingalgorithm for kernel selection. In The 23st International Conference on Machine Learning(W. W. Cohen and A. Moore, eds.). ACM, New York.

BACH, F. R. (2008). Consistency of the group lasso and multiple kernel learning. J. Mach. Learn.Res. 9 1179–1225. MR2417268

BACH, F. R. (2009). Exploring large feature spaces with hierarchical multiple kernel learning. InAdvances in Neural Information Processing Systems 21 (D. Koller, D. Schuurmans, Y. Bengioand L. Bottou, eds.) 105–112. Curran Associates, Red Hook, NY.

BACH, F. R., LANCKRIET, G. and JORDAN, M. (2004). Multiple kernel learning, conic duality, andthe SMO algorithm. In The 21st International Conference on Machine Learning 41–48. ACM,New York.

BENNETT, C. and SHARPLEY, R. (1988). Interpolation of Operators. Pure and Applied Mathematics129. Academic Press, Boston, MA. MR0928802

BICKEL, P. J., RITOV, Y. and TSYBAKOV, A. B. (2009). Simultaneous analysis of lasso and Dantzigselector. Ann. Statist. 37 1705–1732. MR2533469

BOYD, S., PARIKH, N., CHU, E., PELEATO, B. and ECKSTEIN, J. (2011). Distributed optimizationand statistical learning via the alternating direction method of multipliers. Foundations and Trendsin Machine Learning 3 1–122.

CAPONNETTO, A. and DE VITO, E. (2007). Optimal rates for the regularized least-squares algo-rithm. Found. Comput. Math. 7 331–368. MR2335249

CHAPELLE, O., VAPNIK, V., BOUSQUET, O. and MUKHERJEE, S. (2002). Choosing multiple pa-rameters for support vector machines. Machine Learning 46 131–159.

CORTES, C., MOHRI, M. and ROSTAMIZADEH, A. (2009a). Learning non-linear combinations ofkernels. In Advances in Neural Information Processing Systems 22 (Y. Bengio, D. Schuurmans,J. Lafferty, C. K. I. Williams and A. Culotta, eds.) 396–404. Curran Associates, Red Hook, NY.

CORTES, C., MOHRI, M. and ROSTAMIZADEH, A. (2009b) L2 regularization for learning kernels.In The 25th Conference on Uncertainty in Artificial Intelligence (UAI 2009) (J. Bilmes and A. Ng,eds.). AUAI Press, Corvallis.

FERREIRA, J. C. and MENEGATTO, V. A. (2009). Eigenvalues of integral operators defined bysmooth positive definite kernels. Integral Equations Operator Theory 64 61–81. MR2501172

KIMELDORF, G. and WAHBA, G. (1971). Some results on Tchebycheffian spline functions. J. Math.Anal. Appl. 33 82–95. MR0290013

KLOFT, M. and BLANCHARD, G. (2012). On the convergence rate of �p-norm multiple kernellearning. J. Mach. Learn. Res. 13 2465–2501. MR2973607

KLOFT, M., RÜCKERT, U. and BARTLETT, P. L. (2010). A unifying view of multiple kernel learn-ing. In Proceedings of the European Conference on Machine Learning and Knowledge Discoveryin Databases (ECML/PKDD) (J. L. Balcázar, F. Bonchi, A. Gionis and M. Sebag, eds.). LectureNotes in Computer Science 6322 66–81. Springer, Berlin.

KLOFT, M., BREFELD, U., SONNENBURG, S., LASKOV, P., MÜLLER, K. R. and ZIEN, A. (2009).Efficient and accurate �p-norm multiple kernel learning. In Advances in Neural Information Pro-cessing Systems 22 (Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams and A. Culotta, eds.)997–1005. Curran Associates, Red Hook, NY.

KOLTCHINSKII, V. and YUAN, M. (2008). Sparse recovery in large ensembles of kernel machines.In Proceedings of the Annual Conference on Learning Theory (R. Servedio and T. Zhang, eds.)229–238. Omnipress, Madison, WI.

KOLTCHINSKII, V. and YUAN, M. (2010). Sparsity in multiple kernel learning. Ann. Statist. 383660–3695. MR2766864

Page 24: Fast learning rate of multiple kernel learning: Trade-off

1404 T. SUZUKI AND M. SUGIYAMA

LANCKRIET, G., CRISTIANINI, N., GHAOUI, L. E., BARTLETT, P. and JORDAN, M. (2004). Learn-ing the kernel matrix with semi-definite programming. J. Mach. Learn. Res. 5 27–72.

MEIER, L., VAN DE GEER, S. and BÜHLMANN, P. (2008). The group Lasso for logistic regression.J. R. Stat. Soc. Ser. B Stat. Methodol. 70 53–71. MR2412631

MEIER, L., VAN DE GEER, S. and BÜHLMANN, P. (2009). High-dimensional additive modeling.Ann. Statist. 37 3779–3821. MR2572443

MICCHELLI, C. A. and PONTIL, M. (2005). Learning the kernel function via regularization.J. Mach. Learn. Res. 6 1099–1125. MR2249850

ONG, C. S., SMOLA, A. J. and WILLIAMSON, R. C. (2005). Learning the kernel with hyperkernels.J. Mach. Learn. Res. 6 1043–1071. MR2249848

RASKUTTI, G., WAINWRIGHT, M. and YU, B. (2009). Lower bounds on minimax rates for non-parametric regression with additive sparsity and smoothness. In Advances in Neural InformationProcessing Systems 22 (Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams and A. Culotta,eds.) 1563–1570. Curran Associates, Red Hook, NY.

RASKUTTI, G., WAINWRIGHT, M. J. and YU, B. (2012). Minimax-optimal rates for sparse ad-ditive models over kernel classes via convex programming. J. Mach. Learn. Res. 13 389–427.MR2913704

SCHÖLKOPF, B. and SMOLA, A. J. (2002). Learning with Kernels. MIT Press, Cambridge, MA.SHAWE-TAYLOR, J. (2008). Kernel learning for novelty detection. In NIPS 2008 Workshop on Ker-

nel Learning: Automatic Selection of Optimal Kernels.SHAWE-TAYLOR, J. and CRISTIANINI, N. (2004). Kernel Methods for Pattern Analysis. Cambridge

Univ. Press, New York.SREBRO, N. and BEN-DAVID, S. (2006). Learning bounds for support vector machines with learned

kernels. In Learning Theory. Lecture Notes in Computer Science 4005 169–183. Springer, Berlin.MR2280605

STEINWART, I. and CHRISTMANN, A. (2008). Support Vector Machines. Springer, New York.MR2450103

STEINWART, I., HUSH, D. and SCOVEL, C. (2009). Optimal rates for regularized least squares re-gression. In Proceedings of the Annual Conference on Learning Theory (S. Dasgupta and A. Kli-vans, eds.) 79–93. Omnipress, Madison, WI.

SUZUKI, T. (2011a). Unifying framework for fast learning rate of non-sparse multiple kernel learn-ing. In Advances in Neural Information Processing Systems 24 (J. Shawe-Taylor, R. Zemel,P. Bartlett, F. Pereira and K. Weinberger, eds.) 1575–1583. Curran Associates, Red Hook, NY.

SUZUKI, T. (2011b). Fast learning rate of non-sparse multiple kernel learning and optimal regular-ization strategies. Available at arXiv:1111.3781.

SUZUKI, T. and SUGIYAMA, M. (2013). Supplement to “Fast learning rate of multiple kernel learn-ing: Trade-off between sparsity and smoothness.” DOI:10.1214/13-AOS1095.

TOMIOKA, R. and SUZUKI, T. (2009). Sparsity-accuracy trade-off in MKL. In NIPS 2009 Work-shop: Understanding Multiple Kernel Learning Methods.

VAN DER VAART, A. W. and WELLNER, J. A. (1996). Weak Convergence and Empirical Processes.Springer, New York. MR1385671

VARMA, M. and BABU, B. R. (2009). More generality in efficient multiple kernel learning. In The26th International Conference on Machine Learning (L. Bottou and M. Littman, eds.) 1065–1072.Omnipress, Madison, WI.

YING, Y. and CAMPBELL, C. (2009). Generalization bounds for learning the kernel. In Proceedingsof the Annual Conference on Learning Theory (S. Dasgupta and A. Klivans, eds.). Omnipress,Madison, WI.

Page 25: Fast learning rate of multiple kernel learning: Trade-off

FAST LEARNING RATE OF MKL 1405

ZOU, H. and HASTIE, T. (2005). Regularization and variable selection via the elastic net. J. R. Stat.Soc. Ser. B Stat. Methodol. 67 301–320. MR2137327

DEPARTMENT OF MATHEMATICAL INFORMATICS

GRADUATE SCHOOL OF INFORMATION SCIENCE

AND TECHNOLOGY

UNIVERSITY OF TOKYO

7-3-1 HONGO, BUNKYO-KU

TOKYO

JAPAN

E-MAIL: [email protected]

DEPARTMENT OF COMPUTER SCIENCE

GRADUATE SCHOOL OF INFORMATION SCIENCE

AND ENGINEERING

TOKYO INSTITUTE OF TECHNOLOGY

2-12-1 O-OKAYAMA, MEGURO-KU

TOKYO

JAPAN

E-MAIL: [email protected]