optimal rates for regularized least_squares algorithm

8/12/2019 Optimal Rates for Regularized Least_squares Algorithm

1/32

OPTIMAL RATES FOR REGULARIZED LEAST-SQUARES

ALGORITHM

A. CAPONNETTO AND E. DE VITO

Abstract. We develop a theoretical analysis of the generalization perfor-mances of regularized least-squares algorithm on a reproducing kernel Hilbertspace in the supervised learning setting. The presented results hold in thegeneral framework of vector-valued functions, therefore they can be applied tomulti-task problems. In particular we observe that the concept of effective di-mension plays a central role in the definition of a criterion for the choice of theregularization parameter as a function of the number of samples. Moreover

a complete minimax analysis of the problem is described, showing that theconvergence rates obtained by regularized least-squares estimators are indeedoptimal over a suitable class of priors defined by the considered kernel. Finallywe give an improved lower rate result describing worst asymptotic behavior onindividual probability measures rather than over classes of priors.

1. Introduction

In this paper we investigate the estimation properties of the regularized least-squares (RLS) algorithm on a reproducing kernel Hilbert space (RKHS) in theregression setting. Following the general scheme of supervised statistical learningtheory, the available input-output samples are assumed to be drawn i.i.d. according

to an unknown probability distribution. The aim of a regression algorithm is esti-mating a particular invariant of the unknown distribution: the regression function,using only the available empirical samples. Hence the asymptotic performancesof the algorithm are usually evaluated by the rate of convergence of its estimatesto the regression function. The main result of this paper shows a choice for theregularization parameter of RLS, such that the resulting algorithm is optimal in aminimax sense for a suitable class of priors.

The RLS algorithm on RKHS of real-valued functions (i.e. when the output spaceis equal to R) has been extensively studied in the literature, for an account see [33],[30], [17] and references therein. For the case X = Rd and the RKHS a Sobolevspace, optimal rates were established assuming a suitable smoothness condition onthe regression function (see [18] and references therein). For an arbitrary RKHS

and compact input space, in [5] a covering number technique was used to obtainnon-asymptotic upper bounds expressed in terms of suitable complexity measuresof the regression function (see also [33] and [37]). In [8], [26], [9], [27] the coveringtechniques were replaced by estimates of integral operators through concentrationinequalities of vector-valued random variables. Although expressed in terms of

Date: April 27, 2006.2000 Mathematics Subject Classification. Primary 68T05, 68P30.Key words and phrases. Model Selection, Optimal Rates, Regularized Least-Squares

Algorithm.Corresponding author: Andrea Caponnetto, +16172530549 (phone), +16172532964 (fax).

1


2/32

2 A. CAPONNETTO AND E. DE VITO

easily computable quantities the last bounds do not exploit much information aboutthe fine structure of kernel. Here we show that such information can be used to

obtain tighter bounds. The approach we consider is a refinement of the functionalanalytical techniques presented in [9]. The central concept in this development istheeffective dimensionof the problem. This idea was recently used in [36] and [19]in the analysis of the performances of kernel methods for learning. Indeed in thispaper we show that the effective dimension plays a central role in the definition ofan optimal rule for the choice of the regularization parameter as a function of thenumber of samples.

Although the previous investigations in [8],[26], [9], [27] showed that operatorand spectral methods are valuable tools for the performance analysis of kernelbased algorithms such as RLS, all these results failed to compare with similar re-sults recently obtained using entropy methods (see [10],[29]). These results (e.g.[29], Theorem 1.3) showed that the optimal rate of convergence is essentially deter-mined by the entropy characteristic of the considered class of priors with respectto a suitable topology induced by X , the marginal probability measure over theinput space. Clearly, entropy numbers, and therefore rates of convergence, dependdramatically on X . HoweverX seems not to be crucial in the rates found in [8],[26], [9], [27]. This observation was our original motivation for taking into accountthe effective dimension: a spectral theoretical parameter which quantifies some ca-pacity properties of X by means of the kernel. In fact, the effective dimensionturned out to be the right parameter, in our operator analytical framework, to getrates comparable to the ones defined in terms of entropy numbers.

Recently various papers, [34], [1],[20],[14], have addressed the multi-task learningproblem using kernel techniques. For instance [34] employs two kernels, one on theinput space and the other on the output space, in order to represent similarity mea-sures on the respective domains. The underlying similarity measures are supposed

to capture some inherent regularity of the phenomenon under investigation andshould be chosen according to the available prior knowledge. On the contrary in [1]the prior knowledge is encoded by a single kernel on the space of input-output cou-ples, and a generalization of standard support vector machines is proposed. It wasin [20] and [14] that for the first time in the learning theory literature it was pointedout that particular scalar kernels defined on input-output couples can be profitablymapped onto operator-valued kernels defined on the input space. However, to ourknowledge, a thorough error analysis for regularized least-squares algorithm whenthe output space is a general Hilbert space had never been given before. Our resultfills this gap and is based on the well known fact (see for example [25] and [3]) thatthe machinery of scalar positive defined kernels can be elegantly extended to copewith vector-valued functions using operator-valued positive kernels. An asset of

our treatment is the extreme generality of the mathematical setting, which in factsubsumes most of the frameworks of its type available in the literature. Here, theinput space is an arbitrary Polish space and the output space any separable Hilbertspace. We only assume that the output y has finite variance and the random vari-able (y E[ y | x ]), conditionally to the inputx, satisfies a momentum condition alaBennett (see Hypothesis 2in Section 3).

The other characterizing feature of this paper is the minimax analysis. The firstlower rate result (Th. 2 in Section 4) shows that the error rate attained by RLSalgorithm with our choice for the regularization parameter is optimal on a suitable


3/32

OPTIMAL RATES FOR RLS 3

class of probability measures. The class of priors we consider depends on two pa-rameters: the first is a measure of the complexity of the regression function, as in

[26], the other one is related to the effective dimensionof the marginal probabilitymeasure over the input space relative to the chosen kernel; roughly speaking, itcounts the number of degrees of freedom associated to the kernel and the marginalmeasure, available at a given conditioning. This kind of minimax analysis is stan-dard in the statistical literature but it has received less attention in the context oflearning theory (see for instance [17], [10], [29] and references therein). The mainissue with this kind of approach to minimax problems in statistical learning is thatthebaddistribution in the prior could depend on the number of available samples.In fact we are mainly interested in a worst case analysis for increasing number ofsamples and fixedprobability measure. This type of problem is well known in ap-proximation theory. The idea of comparing minimax rates of approximation overclasses of functions with rates of approximation of individual functions, goes backto Bernsteins problem (see [28], Section 2, for an historical account and some ex-amples). In the context of learning theory this problem was recently considered in[17], Section 3, where the notion of individual lower ratewas introduced.Theorem3 in Section 4 gives a new lower rate of this type greatly generalizing analogousprevious results.

The paper is organized as follows. In Section 2 we briefly recall the mainconcepts of the regression problem in the context of supervised learning theory,[6],[15],[23]; however, the formalism could be easily rephrased using the languageof non-parametric regression as in [17]. In particular we define the notions of up-per, lower and optimal uniform rates over priors of probability measures. Theseconcepts will be the main topic of Sections 4 and 5. In Section 3 we introduce theformalism of operator-valued kernels and the corresponding RKHS. Moreover wedescribe the mathematical assumptions required by the subsequent developments.

The assumptions specify conditions on both the RKHS (see Hypothesis 1) and theprobability measure on the samples (see Hypothesis 2). Finally, we introduce (seeDefinition 1) the class of priors that will be considered throughout the minimaxanalysis.

In Section 4 we state the three main results of the paper, establishing upperand lower uniform rates for regularized least-squares algorithm. The focus of ourexposition on asymptotic rates, rather than on confidence analysis for finite sam-ple size, was motivated by the decision to stress the aspects relevant to the maintopic of investigation of the paper: optimality. However all the results could be,with a relatively small effort, reformulated in terms of a non-asymptotic confidenceanalysis.

The proofs of the theorems stated in Section 4 are postponed to Section 5.

2. Learning from examples

We now introduce some basic concepts of statistical learning theory in the re-gression setting for vector-valued outputs (for details see [32], [15], [24], [6], [20]and references therein).In the framework of learning from examples there are two sets of variables: theinput spaceXand the output space Y. The relation between the inputxX andthe output yYis described by a probability distribution (x, y) =X(x)(y|x)onX Y, whereX is the marginal distribution on Xand(|x) is the conditional


4/32


distribution ofy givenxX. The distribution is known only through a trainingsetz = (x, y) = ((x1, y1), . . . , (x, y)) of examples drawn independently and iden-

tically distributed (i.i.d.) according to . Given the sample z, the aim of learningtheory is to find a function fz : X Y such that fz(x) is a good estimate of theoutputy when a new inputx is given. The function fz is called the estimatorand alearning algorithmis the rule that, for anyN, gives to every training set zZthe corresponding estimatorfz. We also use the notationf(z), or equivalently fz ,for the estimator, every time we want to stress its dependence on the number ofexamples. For the same reason, a learning algorithm will be often represented as asequence{f}N of mappingsf from Z to the set of functions YX .

If the output spaceY is a Hilbert space, given a functionf :XY, the abilityoff to describe the distribution is measured by its expected risk

E[f] =XY

f(x) y2Y d(x, y).

The minimizer of the expected risk over the space of all the measurable Y-valuedfunctions on X is the regression function

f(x) =

Y

y d(y|x).

The final aim of learning theory is to find an algorithm such thatE[fz] is closetoE[f] with high probability. However, if the estimators fz are picked up froma hypothesis spaceH which is not dense in L2(X, X), approachingE[f] is tooambitious, and one can only hope to attain the expected error inffH E[f].

A learning algorithmfzwhich, for every distribution such thatYy2YdY = 0 >0,is said to be universally consistent1.

Universal consistency is an important and well known propriety of many learn-ing algorithms, among which the regularized least-squares algorithm that will beintroduced later. However, ifH is infinite dimensional, the rate of convergence inthe limit above, can not be uniform on the set of all the distributions, but only onsome restricted classPdefined in terms of prior assumptions on . In this paperthepriors2 are suitable classesPof distribution probabilities encoding our knowl-edge on the relation between x and y . In particular we consider a family of priorsP(b, c) (see Definition 1) depending on two parameters: the effective dimensionofH(with respect toX) and a notion of complexity of the regression function f whichgeneralize to arbitrary reproducing kernel Hilbert spaces the degree of smoothness

1Various different definitions of consistency can be found in the literature (see [17] [11]): inprobability, a.s., weak, strong . In more restrictive settings than ours, for example assum-ing that fz(x) yY is bounded, it is possible to prove equivalence results between some of thisdefinitions (e.g. between weak and strong consistency). However this is not true under ourassumptions. Moreover our definition of consistency is weaker than analogous ones in the litera-ture because we replaced E[f] with inffH E[f] in order to deal with hypothesis spaces which are

not dense in L2(X, X).2The concept of prior considered here should not be confused with its homologous in Bayesian

statistics.


5/32


off, usually defined for regression in Sobolev spaces (see [11],[30],[17],[5],[9] andreferences therein).

Assuming in a suitably small priorP, it is possible to study the uniform con-vergence properties of learning algorithms. A natural way to do that is consideringtheconfidence function(see [10], [29])

inff

supP

PzE[f

z] inf

fHE[f]>

N, > 0,

where the infimum is over all the mappings f :Z H. The learning algorithms

{f}N attaining the minimization are optimal overP in the minimax sense. Themain purpose of this paper (accomplished by Theorems 1 and 2) is showing that,for anyPin the considered family of priors, the regularized least-squared algorithm(with a suitable choice of the regularization parameter) shares the asymptotic con-vergence properties of the optimal algorithms.

Let us now introduce the regularized least-squares algorithm [33], [22], [6], [37].In this framework the hypothesis spaceH is a given Hilbert space of functionsf :XYand, for any >0 and zZ, the RLS estimator fz is defined as thesolution of the minimizing problem

minfH

{1

i=1

f(xi) yi2Y + f2H}.

In the following the regularization parameter = is some function of the numberof examples .

The first result of the paper is a bound on the upper rate of convergencefor theRLS algorithm with a suitable choice of, under the assumption P. That is,we prove the existence of a sequence (a)1 such that

(1) lim limsup

supP

Pz E[fz ] inffH

E[f]> a= 0.More precisely, Theorem 1 shows that there is a choice = such that the rate

of convergence is a = bcbc+1 , where 1 < c 2 is a parameter related to the

complexityoff and b >1 is a parameter related to the effective dimension ofH.The second result shows that this rate is optimal if Y is finite dimensional.

Following the analysis presented in [17], we formulate this problem in the frameworkof minimax lower rates. More precisely, a minimax lower rate of convergence forthe classP is a sequence (a)1 of positive numbers such that

(2) lim0

liminf+

inff

supP

PzE[f

z] inf

fHE[f]> a

> 0,

where the infimum is over all the mappings f : Z H. The definition of lowerand upper rates are given with respect to the convergence in probability as in [30]and coherently with the optimization problem inherent to the definition of confi-dence function. On the contrary, in [17] convergence in expectation was considered.Clearly, an upper rate in expectation induces an upper rate in probability and alower rate in probability induces a lower rate in expectation.The choice of the parameter = is optimalover the priorPif it is possible tofind a minimax lower rate (a)1which is also an upper rate for the algorithmfz .Theorem 2 shows the optimality for the choice of given by Theorem 1.


6/32


The minimax lower rates are not completely satisfactory in the statistical learn-ing setting. In fact from the definitions above it is clear that the baddistribution

(the one maximizing Pz E[fz ] inffH E[f]> a) could change for differentvalues of the number of samples . Instead, usually one would like to know howthe excess error,E[f

z]inffH, decreases as the number of samples grows for a

fixed probability measure inP. This type of issue is well known, and has beenextensively analyzed in the context of approximation theory (see [28], Section 2).To overcome the problem, one needs to consider a different type of lower rates: theindividual lower rates. Precisely, an individual lower rate of convergence for thepriorP is a sequence (a)1 of positive numbers such that

(3) inf {f}N

supP

lim sup+

Ez(E[fz ] inffH E[f])a

>0,

where the infimum is over the set of learning algorithms{f}N.Theorem 3 proves an individual lower bound in expectation. However, in orderto show the optimality of the regularized least-squares algorithm in the sense of

individual rates, it remains to prove either an upper rate in expectation or anindividual lower rate in probability.

3. Notations and assumptions

The aim of this section is to set the notations, to state and discuss the mainassumptions we need to prove our results and to describe precisely the class ofpriors on which the bounds hold uniformly.

We assume that the input space Xis a Polish space3 and the output space Y isa real separable Hilbert space. We let Z be the product space X Y, which is aPolish space too.

We let be the probability measure describing the relation between xXandyY. ByX we denote the marginal distribution on Xand by (|x) the conditionaldistribution on Y given x X, both existing since Z is a Polish space, see Th.10.2.2 of [12].

We state the main assumptions onHand .Hypothesis 1. The spaceH is a separableHilbert space of functions f : X Ysuch that for all xXthere is a Hilbert-Schmidt4 operator Kx : Y Hsatisfying(4) f(x) =Kxf f H,whereKx :H Y is the adjoint ofKx; the real function from X

X to R

(5) (x, t) Ktv, KxwH is measurablev, wY; there is >0 such that

(6) Tr(KxKx) xX.

3A Polish space is a separable metrizable topological space such that it is complete with respectto a metric compatible with the topology. Any locally compact second countable space is Polish.

4An operator A : Y H is a Hilbert-Schmidt operator if for some (any) basis (vj)j ofY, it

holds Tr(AA) =

jAvj , AvjH < +.


7/32


Hypothesis 2. The probability measure on Z satisfies the following properties

(7) Zy2Y d(x, y)< +, there exists fH Hsuch that(8) E[fH] = inf

fHE[f],

whereE[f] = Z

f(x) y2Y d(x, y); there are two positive constants , M such that

(9)

Y

eyfH(x)Y

M y fH(x)YM

1

d(y|x) 2

2M2

forX -almost all xX.We now briefly discuss the consequences of the above assumptions.

If Y = R, the operator Kx can be identified with the vector Kx1 H and (4)reduces to

f(x) =f, Kx f H, xX,so thatHis a reproducing kernel Hilbert space [2] with kernel(10) K(x, t) =Kt, KxH .In fact, the theory of reproducing kernel Hilbert spaces can naturally be extendedto vector valued functions [25]. In particular, the assumption that Kx is a Hilbert-Schmidt operator is useful in keeping the generalized theory similar to the scalarone. Indeed, letL(Y) be the space of bounded linear operators on Y with theuniform normL(H). In analogy with (10), let K :X X L(Y) be the (vectorvalued) reproducing kernel

K(x, t) = KxKt x, tX.Since Kx is an Hilbert-Schmidt operator, there is a basis (vj(x))j of Y and anorthogonal sequence (kj(x))j of vector inHsuch that

Kxv=j

v, vj(x)Ykj(x) vY

with the condition

j kj(x)2H< +. The reproducing kernel becomesK(x, t)v=

j,m

kj(t), km(x) v, vj(t) vm(x) vY,

and (6) is equivalent to

j

kj(x)2H xX.

Remark 1. IfYis finite dimensional, any linear operator is Hilbert-Schmidt and (4)is equivalent to the fact that the evaluation functional onH

ff(x)Yis continuous for all x X. Moreover, the reproducing kernel Ktakes values inthe space ofd d-matrices (where d = dim Y). In this finite dimensional settingthe vector valued RKHS formalism can be rephrased in terms of ordinary scalar


8/32


valued functions. Indeed, let (vj)dj=1 be a basis of Y,

X = X {1, . . . , d} and

K :XX R be the kernelK(x, j; t, i) =Ktvi, KxvjH .SinceKis symmetric and positive definite, letHbe the corresponding reproducingkernel Hilbert space, whose elements are real functions onX [2]. Any elementf Hcan be identified with the function inH given by

f(x, j) =f(x), vjY xX, j = 1, . . . , d .Moreover the expected risk becomes

E[f] =Z

j

f(x) y, vj2Y d(x, y)

= j XR(f(x, j) yj)2 dj(x, yj)

= d

XR

(f(x, j) )2 d(x,j, ) =dE[f]wherej are the marginal distributions with respect to the projections

(x, y)(x, y, vjY) (x, y)X Yand is the probability distribution onXR given by= 1d jj . In a similarway, the regularized empirical risk becomes

1

i=1

dj=1

(f(xi, j) yi,j)2 + f2H , yi,j =yi, vjY ,

where the example (xi, yi) is replaced by d-examples (xi, yi,1), . . ., (xi, yi,d).However, in this scalar setting the examples are not i.i.d, so we decide to state theresults in the framework of vector valued functions. Moreover this does not resultin more complex proofs, the theorems are stated in a basis independent form andhold also for infinite dimensional Y .

Coming back to the discussion on the assumptions. The requirement thatH isseparable avoids problems with measurability and allows to employ vector valuedconcentration inequalities. If (5) is replaced by the stronger condition

(x, t) Ktv, KxwH is continuousv, wY,the fact that X and Y are separable implies thatH is separable, too [4]. Con-ditions (5) and the fact thatH is separable ensure that the functions f H aremeasurable from X to Y, whereas (6) implies that f are bounded functions. In-

deed, (6) implies that

(11) KxL(H,Y) =KxL(Y,H)

Tr(KxKx)

,

and (4) givesf(x)Y =KxfY

fH xX.

Regarding the distribution, it is clear that if (7) is not satisfied, thenE[f] = +for allf Hand the learning problem does not make sense. If it holds, (5) and (6)are the minimal requirements to ensure that any f H has a finite expected risk(see item i) of Prop. 1).


9/32


In general fH is not unique as an element ofH, but we recover uniqueness bychoosing the one with minimal norm inH.If the regression function

f =Y

y d(y|x)belongs toH, clearly fH = f. However, in general the existence offH is a weakercondition than f H, for example, ifH is finite dimensional, fH always exists.Condition (8) is essential to define the class of priors for which both the upperbound and the lower bound hold uniformly. Finally, (9) is a model of the noise ofthe output y and it is satisfied, for example, if the noise is bounded, Gaussian orsubgaussian [31].

We now introduce some more notations we need to state our bounds.LetL2(H) be the separable Hilbert space of Hilbert-Schmidt operators onHwithscalar product

A, BL2(H)= Tr(BA)and norm

AL2(Y) =

Tr(AA) AL(H) .Given xX, let(12) Tx = KxK

x L(H),

which is a positive operator. A simple computation shows that

Tr Tx = Tr KxKx

so that Tx is a trace class operator and, a fortiori, a Hilbert-Schmidt operator.Hence

(13)

TxL(H)

TxL2(H)

Tr(Tx)

.

We let T :H Hbe(14) T =

X

Tx dX(x),

where the integral converges inL2(H) to a positive trace class operator with

(15) TL(H)Tr T =X

Tr Tx dX(x)

(see item ii) of Prop. 1). Moreover, the spectral theorem gives

(16) T =N

n=1tn , enH en,

where (en)Nn=1 is a basis of Ker T (possibly N = +), 0 < tn+1 tn withN

n=1 tn = Tr T.We now discuss the class of priors.

Definition 1. Let us fix the positive constantsM, , R, and.Then, given1 < b+and1c2, we defineP=P(b, c)the set of probabilitydistributions onZsuch that

i) Hypotheses 2 holds with the given choice forM and in (9);

ii) there isg H such thatfH = T c12 g withg2HR;


10/32


iii) ifb < +, thenN= + and the eigenvalues ofTgiven by (16) satisfy

(17) nb

tn n1,whereas ifb= +, thenN 0, for any N and any training set z = (x, y) =

((x1, y1), . . . , (x, y))Z, the estimator fz is defined as the solution of the mini-mization problem

(18) minfH

1

i=1

f(xi) yi2Y + f2H

,

whose existence and uniqueness is well known (see item v) of Prop. 1).


11/32


Theorem 1. Given1< b+ and1c2, let

(19) =

( 1

) bbc+1 b < +

c >1

( log ) bb+1 b < + c= 1

( 1 )12 b= +

and

(20) a =

( 1

) bcbc+1 b < + c > 1

( log ) bb+1 b < + c= 1

1 b= +

then

(21) lim limsup

supP(b,c)

PzE[f

z ] E[fH]> a= 0

The above result gives a family of upper rates of convergence for the RLS al-gorithm as defined in (1). The following theorem proves that the correspondingminimax lower rates (see eq. (2)) hold.

Theorem 2. Assume thatdim Y =d

bcbc+1

= 1.

The above result shows that the rate of convergence given by the RLS algorithmis optimal when Y is finite dimensional for any 1 < b < + (i.e. N = +) and1< c2 and that it is optimal up to a logarithmic factor for c = 1.Finally, we give a result about the individual lower rates in expectation (seeeq. (3)).

Theorem 3. Assume thatdim Y =d 0,

where the infimum is over the set of all learning algorithms{f}N.The advantage of individual lower rates over minimax lower rates have already

been discussed in Sections 1 and 2. Here we add that the proof of the theorem abovecan be straightforwardly modified in order to extend the range of the infimum

to general randomized learning algorithms, that is algorithms whose outputs arerandom variables depending on the training set. Such a generalization seems notan easy task to accomplish in the standard minimax setting. It should also beremarked that the condition 1 c 2 in Th. 3 has been introduced to keephomogeneous the notations throughout the sections of the paper, but it could berelaxed to 0c2.

5. Proofs

In this section we give the proofs of the three theorems stated above.


12/32


5.1. Preliminary results. We recall some known facts without reporting theirproofs.

The first proposition summarizes some mathematical properties of the regular-ized least-squares algorithm. It is well known in the framework of linear inverseproblems (see [13]) and a proof in the context of learning theory can be found in[7] and, for the scalar case, in [6].

Proposition 1. Assume Hypothesis 1 and 2. The following facts hold.

i) For allf H, f is measurable and

E[f] =Z

f(x) y2Y d(x, y)< +.

ii) The minimizersfH are the solution of the following equation

(22) T fH= g,

whereT is the positive trace class operator defined by

T =

X

KxKx =

X

TxdX(x)

with the integral converging inL2(H) and(23) g=

X

Kxf(x) dX(x) H,

with the integral converging inH.iii) For allf H

(24) E[f] E[fH] =

T(f fH)

2

Hf H.

iv) For any >0, a unique minimizerf of the regularized expected risk

E[f] + f2Hexists and is given by

(25) f = (T+ )1g= (T+ )1T fH.

v) Given a training setz = (x, y) = ((x1, y1), . . . , (x, y))Z, for any >0a unique minimizerfz of the regularized empirical risk

1

i=1

f(xi) yi2Y + f2H

exists and is given by

(26) fz = (Tx+ )1gz.

whereTx :H H is the positive finite rank operator

(27) Tx =1

i=1

Txi

andgz H is given by

(28) gz= 1

i=1

Kxiyi.


13/32


By means of (4), T and g are explicitly given by

(29) (T f)(x) = KxT f= X Kx(KtKt)f dX(t) = X K(x, t)f(t) dX(t),soTacts as the integral operator of kernel K, and

(30) g(x) = Ktg =X

K(x, t)f(t) dX(t).

We also need the following probabilistic inequality based on a result of [21], seealso Th. 3.3.4 of [35].

Proposition 2. Let (, F, P) be a probability space and be a random variableon taking value in a real separable Hilbert spaceK. Assume that there are twopositive constantsL and such that

(31) E[ E[]mK ]12 m!2Lm2 m2,then, for allN and0< 0,

(1) theresidual

A() =E[f] E[fH] =T(f fH)2H ,

where f H is the minimizer of the regularized expected risk (see itemv) of Prop. 1) and the second equality is a consequence of (24).

(2) the reconstruction error

B() = f fH2H ,(3) theeffective dimension

N() = Tr[(T+ )1T],which is finite due to the fact that Tis trace class (see item ii) of Prop. 1).

Roughly speaking, the effective dimensionN() controls the complexity of thehypothesis spaceHaccording to the marginal measure X , whereasA() andB(),which depend on , control the complexityoffH.


14/32


Remark 3. In the framework of learning theoryA() =

f fH

2

is called ap-

proximation error, whereas in inverse problems theory the approximation error is

usually B() = f fHH. In order to avoid confusion we adopt the nomen-clature of inverse problems [13].

Next, Prop. 3 studies the asymptotic behavior of the above quantities when goes to zero under the assumption that P(b, c). Finally, from this result it iseasy to derive abest choicefor the parameter = giving rise to the claimed rateof convergence.

The following theorem gives a non-asymptotic upper bound which is of interestby itself.

Theorem 4. Let satisfy Hypothesis 2, N, > 0 and0 <


15/32


Step 2.1: probabilistic bound on

T(Tx+ )1L(H)

. Assume that

(38) (, z) = (T+ )1(T Tx)L(H)= (T Tx)(T+ )1L(H) 12 ,(the second inequality holds since if A, B are selfadjoint operators inL(H), thenABL(H)=(AB)L(H)=BAL(H)). Then the Neumann series gives

T(Tx+ )1 =

T(T+ )1(I (T Tx)(T+ )1)1

=

T(T+ )1+n=0

(T Tx)(T+ )1

nso that

T(Tx+ )1

L(H)

T(T+ )1

L(H)+

n=0(T Tx)(T+ )1

n

L(H)

T(T+ )1L(H)

1

1 (, z)2T(T+ )1

L(H).

The spectral theorem ensures thatT(T+ )1

L(H) 1

2

so that

(39)T(Tx+ )1L(H) 1 .

We claim that (35) implies (38) with probability greater than 1 . To this aimwe apply Prop. 2 to the random variable 1: X L2(H)

1(x) = (T+ )1Tx

so that

E[1] = (T+ )1T and 1

i=1

1(xi) = (T+ )1Tx.

Moreover (13) and(T+ )1L(H) 1 imply

L2(H)(T+ )1L(H) TxL2(H) = L12 .

Condition (13) ensures thatTx is of trace class and the inequality

(40) Tr(AB) AL(H)Tr B(A positive bounded operator, B positive trace class operator) implies

E[

1

2

L2(H

)] = X TrTx T12x (T+ )

2T12x dX(x)

X

TxL(H)Tr

(T+ )2Tx

dX(x)

( (13) ) Tr (T+ )2T= Tr

(T+ )1

(T+ )

12 T(T+ )

12

( (40) ) (T+ )1L(H)Tr (T+ )1T

N() = 21.


16/32


by definition of effective dimensionN(). Hence (33) of Prop. 2 holds and (32)gives (T+ )1(TxT)L2(H)2 log(6/)2 + N() with probability greater than 1 /3. Since log(6/)1 and the spectral decom-position ofT gives

N() TL(H)TL(H)+ 1

2 if TL(H) ,

if (35) holds, then

log(6/)

2

+

N()

4 log

2(6/)N()

+

log2(6/)N()

116+ 18 14so that

(41) (, z) (T+ )1(TxT)L2(H) 12with probability greater than 1 /3.Step 2.2: probabilistic bound on

(T Tx)(f fH)L(H). Now we apply Prop. 2to the random variable 2: X H

2(x) =Tx(f fH)

so that

E[2] = T(f

fH

) 1

i=1 2(xi) =Tx(f fH).Bound (13) and the definition ofB() give

2(x)H TxL(H)f fHHB() = L22 .

SinceTx is a positive operator

(42) Txf , fH TxL(H) f, fH f H,so that

E[22H] =X

TxT

12x (f

fH), T12x (f

fH)H

dX(x)

X TxL(H) Tx(f fH), f fHH dX(x)( (13) ) T(f fH), f fHH

= T(f fH)2H

= A() =22 ,by definition ofA(). So (33) holds and (32) gives

(43)(T Tx)(f fH)H2 log(6/)

2B()

+

A()

.


17/32


with probability greater than 1 /3. Replacing (39), (43) in (37), if (35) holds, itfollows

(44) S2(, z)8 log2(6/)42B()2

+ A()

with probability greater than 1 2/3.Step 3: probabilistic bound onS1(, z). Clearly

(45) S1(, z)T(Tx+ )1(T+ ) 12 2L(H)

(T+ ) 12 (gz TxfH)2H .Step 3.1: bound on

T(Tx+ )1(T+ ) 12 L(H). Since

T(Tx+ )1(T+ )

12 =

T(T+ )

12

I (T+ ) 12 (T Tx)(T+ ) 12

1,

reasoning as in Step 2.1, it followsI (T+ ) 12 (T Tx)(T+ ) 121L(H)

2

provided that

(46)(T+ ) 12 (T Tx)(T+ ) 12L(H) 12 .

Moreover, the spectral theorem ensures thatT(T+ ) 12

L(H)1 so

(47)T(Tx+ )1(T+ ) 12L(H) 2.

We will show that (46) holds for the training sets z satisfying (41). Indeed, if

B= (T+ )12 (T Tx)(T+ )

12 , then

B2L2(H) = Tr

(T+ )12 (T Tx)(T+ )1(T Tx)(T+ ) 12

= Tr

(T+ )1(T Tx)(T+ )1(T Tx)

=

(T+ )1(T Tx),

(T+ )1(T Tx)

L2(H)

(T+ )1(T Tx)L2(H) (T+ )1(T Tx)L2(H)=

(T+ )1(T Tx)2L2(H) .If (35) holds, then (41) ensures that (46) holds with probability 1 2/3.Step 3.2: bound on (T+ ) 12 (gzTxfH)H. Let 3 : Z H be the randomvariable

3(x, y) = (T+ ) 12 Kx(y fH(x)) .

First of all, (22) gives

E[3] = (T+ ) 12 (g T fH) = 0

and (9) impliesY

y fH(x)mY d(y|x) 1

2m!2Mm2 m2


18/32


(see for example [31]). It follows that

E[3m

H] = ZKx(T+ )1Kx(y fH(x)), y fH(x)Ym/2 d(x, y)(Eq. (42) )

X

Kx(T+ )1Kxm/2L(H) Y

y fH(x)mY d(y|x)

dX(x)Kx(T+ )1KxL(H)Tr(Kx(T+ )1Kx) m!

2Mm2

2 sup

xX

Kx(T+ )1Kx(m2)/2L(H) X

Tr(T+ )1Tx dX(x)(13) and

(T+ )1L(H)1 1

2m!2Mm2(

)m2 Tr[(T+ )1T]

= 12

m!(N())2(M

)m2.

Hence (31) holds with L3= M

and 3=

N() and (32) gives that(48)

(T+ ) 12 (gz TxfH)H 2 log(6/)

1

M2

+

2N()

with probability greater than 1 /3. Replacing (47), (48) in (45)

(49) S1(, z)32log2(6/)

M2

2 +

2N()

with probability greater than 1 .Replacing bounds (44), (49) in (36),

E[fz

] E[fH]3A() + 8 log2(6/)

42B()2

+A()

+

4M2

2 +

42N()

and (34) follows by bounding the numerical constants with 32.

The second step in the proof of the upper bound is the study of the asymptoticbehavior ofN(),A() andB() when goes to zero. It is known that

lim0

A() = 0

lim0

B() = 0lim

0

N() = N

see, for example, [16] and [13]. However, to state uniform rates of convergence weneed some prior assumptions on the distribution .

Proposition 3. Let P(b, c) with1c2 and1< b+, then

A()cT1c2 fH2H

and

B()c1T 1c2 fH2H .


19/32


Moreover ifb < + (N= +)

N() b

b 1 1

b .

Instead ifb= + (N


20/32


so that, since = bbc+1 ,

2Cb

(b 1) b+1b .So for any P, bound (50) holds with = . Since bbc+1 bcbc+1 .So that

lim sup

supP(b,c)

PzE[f

z ] E[fH]> bcbc+1

.

Since lim+ = lim+ 6e

64D = 0, the thesis follows.

Assume now b


21/32


where and C depend only onP. Finally, we apply a theorem of [10] to obtainthe claimed lower bound.

We recall that the Kullback-Leibler information of two measures1and2is definedby

K(1, 2) =

log (z) d1(z)

where is the density of1 with respect to 2, that is, 1(E) =E

(z) d2(z) forall measurable sets E.

In the following, we choose 0 P(b, c) and we let be its marginal measure.Since0 satisfies Hypothesis 2, the operator Thas the spectral decomposition

(52) T =

X

Tx d(x) =+n=1

tn , en en,

where (en)+n=1

is an orthonormal sequence inH

and, since 0 P

(b, c),

tn nb

n1.The proposition below associates to any vector fbelonging a suitable subclass ofH, a corresponding probability measure f that belongs toP(b, c). In particular,fwill have the same marginal distribution , so that the corresponding operatorTfdefined by (14) with X = (f)X , is in fact given by (52).

Proposition 4. Let (vj)dj=1 be a basis of Y. Given f H such that f = T c12 g

for someg H,g2 R, letf(x, y) = (x)f(y|x) where

f(y

|x) =

1

2dL d

j=1(L f, KxvjH)y+dLvj + (L + f, KxvjH)ydLvjwith L = 4

cR and ydLvj is the Dirac measure on Y at pointdLvj. Then

f is a probability measure with marginal distribution (f)X = and regressionfunctionff =f H. Moreover, f P(b, c) provided that(53) min(M, )2(4d + 1)

cR.

Moreover, if f H such thatf = T c12 g for some g H,g2 R, then theKullback-Leibler informationK(f, f) fulfills the inequality

(54) K(f, f) 1615dL2

T(f f)

2

H.

Proof. The definition off, (11) and (13) imply

(55) | f, KxvjH | T c12 g

HKxL(Y,H)

c2

R=

L

4.

It follows that f(y|x) is a probability measure on Y andY

y df(y|x) =j

f, KxvjH vj =Kxf =f(x).

So that f is a probability measure on Z, the marginal distribution is and theregression function ff = f H. In particular condition (8) holds with fH = f


22/32


and items ii) and iii) of Def. 1 are satisfied.Clearly, (7) is satisfied since f(y|x) has finite support. Moreover, (53) ensures

y f(x)Y yY + KxfYdL + cR= (4d + 1)cRMand

E[y f(x)2Y] = 1

2dL

j

(L f, KxvjH)(dL + f, KxvjH)2

+(L + f, KxvjH)(dL f, KxvjH)2 + (1 1

d) f(x)2Y

( (55) ) 54

L2 + (1 1d

)cR4(4d + 1)cR2,

so that (9) is satisfied.The proof of (54) is the same as Lemma 3.2 of [10]. We only sketch the main

steps. If = dfdf , clearly

log (x, dLvj) = log

L f, KxvjHL f, KxvjH

= log

1 f f

, KxvjHL f, KxvjH

f f

, KxvjHL f, KxvjH

so that

K(f , f)

1

2dL

d

j=1 X f f, KxvjHL + f, KxvjH L + f, KxvjH +

+ f+ f, KxvjH

L f, KxvjH

L f, KxvjH

d(x)

= 1

d

dj=1

X

(f f, KxvjH)2L2 (f, KxvjH)2

d(x)

( (55) ) 12d

dj=1

X

f f, Kxvj2H16

15L2d(x)

= 16

15dL2

X

Kx(f f)2Y d(x)

= 1615dL2 X Tx(f f), f fH d(x)=

16

15dL2T(f f), f fH

Proposition 5. There is an0> 0 such that for all0 < 0, there existNNandf1, . . . , f N H(depending on) satisfying

i) for alli= 1, . . . , N , fi = T c12 gi for somegi Hwithgi2HR;


23/32


ii) for alli, j = 1, . . . , N

(56) T(fifj)2H 4;iii) there is a constant depending only onR and such that

(57) Ne 1bc .

Proof. Let m N such that m >16 and 1, . . . , N {1, 1}m given by Prop. 6so that

mn=1

(ni nj)2 m(58)

N e m24 .(59)In the following we will choose m as a function of is such a way that the statement

of the proposition will be true.Given >0, for all i = 1, . . . , N let

gi =m

n=1

mtcnnien,

(see (52)). Sincetnnb , then

gi2H =mn=1

mtcn

m

mn=1

nb

cC mbc,

where here and in the following C is a constant depending only on R, b and c.Hencegi2 R provided that

mbc

R

Cand we let m = mN be(60) m=C 1bc for a suitable constant C > 0 (wherex is the greatest integer less or equal thanx.) Clearly, sincem goes to + if goes to 0, there is 0 such that m >16 forall0.

Let now fi = T c12 gi, as in the statement of the theorem, thenT(fi fj)2H = T c2 (gigj)2H =

mn=1

m(ni nj)2.

The conditions (58) and (ni

nj)2

4 imply

T(fi fj)2H 4

and (59) and (60) ensure

Ne m24 e 1bc

for a suitable constant >0.

The proof of the above proposition relies on the following result regarding packingnumbers over sets of binary strings.


24/32


Proposition 6. For everym >16 there existN N and1, . . . , N {1, +1}msuch that

mn=1

(ni nj)2 m i=, j = 1, . . . N

N e m24wherei = (1i , . . . ,

mi ) andj = (

1j , . . . ,

mj ).

Proof. We regard the vectors {1, +1}m as a set ofm i.i.d. binary randomvariables distributed according to the uniform distribution 1/2(1+ +1)Let and be two independent random vectors in{1, +1}m, then the realrandom variable

d(, ) =m

n=1(ni nj)2 =

m

n=1n

where n are independent random variables distributed according to the measure1/2(0 + 4). The expectation value d(,

) is 2mand Hoeffding inequality ensuresthat for every >0

P [|d(, ) 2m|> ]2 exp( 2

8m).

Setting= m in the inequality above, we obtain

(61) P [d(, )< m]2 exp(m8

).

Now drawN :=em24 (wherexis the lowest integer greater than x) independentrandom points i (i= 1, , N).

From inequality (61), by union bound it holdsP [ 1i, jN, i=j, with d(i, j)< m]

(N2 N) exp(m8

) N2 N

(N 1)3 = N

(N 1)2 16 imply that (N 1)2 > N and(N1)3 exp m8. The following theorem is a restatement of Th. 3.1 of [10] in our setting.

Theorem 5. Assume (53) and consider an arbitrary learning algorithmz

fz

H, for

N and z

Z. Then for all

0 and for all

N there is a

P(b, c) such thatf Hand it holds

Pz

E [fz ] E [f ]>

4

min{ N

N + 1,

Ne 415dcR }

whereN =e

1bc and= e3/e.

Proof. The proof is the same as in [10]. Given 0, let N and f1, . . . , f N asin Prop. 5. According to Prop. 4, let i = fi . Assumption (53) ensures thati P(b, c).


25/32


Observe that, since all the measures i have the same marginal distribution andfH= fi =fi

Ei [f] Ei [fi ] = T(f fi)2H = X fi(x) f(x)2Y d(x).Given N, let

Ai ={zZ |T(fz fi)2H < 4}

for all i = 1, . . . , N . The lower bound of (56) ensures that Ai Aj =ifi=j , sothat Lemma 3.3 of [10] ensures that there is = i such that either

p = P [Ai ]> NN+ 1

N

N + 1

(since xx+1 is an increasing function and (57) holds) or, replacing the upper bound

of (56), in Eq. 3.12 of [10]

4

15dcR logp+ log(N) 3

e logp+ log(N) 3e

since (57). Solving for p, the thesis follows.

The proof of Th. 2 is now an easy consequence of the above theorem.

Proof of Th. 2. . Since whenever a minimax lower rate holds over a prior it holdsa fortiori over a superset of it, without loss of generality we can assume

R min(M, )2(4d + 1)

c

,

hence enforcing condition (53).

Given > 0, for all N, let =

bcbc+1 . Since goes to 0 when goes to

+, for large enough 0, so Th. 5 applies ensuring

inff

supP(b,c)

Pz

E[f

z] E[fH]>

2

bcbc+1

min{ N

N + 1,e(C1

1bcC2)

1bc+1 }

whereC1, C2 are positive constants independent of and. If goes to, N

N+1

goes to 1, whereas, ifis small enough, the quantity C1 1bc C2is positive, so

that

lim0

liminf+

inff

supP(b,c)

PzE[f

z] E[fH]>

2

bcbc+1

= 1.

5.4. Individual lower rate. The proof of Th. 3 is based on the similar result in[17], see Th. 3.3. Here that result is adapted to the general RKHS setting.

First of all we recall the following proposition, whose proof can be found in [17](lemma 3.2 pg.38).

Proposition 7. LetgR andsa{+1, 1}-valued random variable withP [s= +1] =P [s=1] = 1/2. Moreover letn= (ni)i=1 be independent random variables dis-tributed according to the Gaussian with zero mean and variance 2, independent ofs. Set

y= sg + n,


26/32


then the error probability of the Bayes decision fors based ony is

minD:R{+1,1}

P [D(y)=s] = (g

),

where is the standard normal distribution function.

Proof of Th. 3. Let us reason for a fixed B > b, and let := (B b)c >0.We first define the subsetP ofP(b, c), then prove the lower rate on this subset. Asin the proof of the minimax lower rate, we fix an arbitrary 0 P(b, c) and let be its marginal measure. For every sequences= (sn)nN {+1, 1}, we definea corresponding function inH

m(s)

:=

+

n=1

snt1n nen =+

n=1

sngn.

where

n := n(bc++1)

+ 1cR, gn :=

t1n nen,

where we recall that (tn)nN and (en)nN are the eigenvalues and eigenvectors ofthe operator T defined by (52).

We defineP be the set of probability measures which fulfill the following twoconditions

the marginal distribution

X is equal to ,

there is s {+1, 1}such that, for allxX, the conditional distributionofy given x

(y|x) =N(m(s)(x), 2Id),

that is, the multivariate normal distribution on Y with mean m(s)(x) anddiagonal covariance2Id with

2 = min

M2

2 ,

d2 2

4Sd+

0 ez2+zzd+1dz

,

andSd is the volume of the surface of thed-dimensional unit radius sphere.

It is simple to check thatP P(b, c). Indeed, clearly f = fH = m(s) and

T c12 m(s)2H

=+n=1

t(c1)n t1n n =

+n=1

nbtn

cn(1+)

+ 1R

+n=1

n(1+)

+ 1R

+1

t(1+)dt + 1

+ 1R= R,


27/32


where we used the lower bound (17). Moreover,N(0, 2 Id) fulfills the momentcondition (9) in Hypothesis 2. Indeed,

Y

e yfH(x)YM y fH(x)YM

1 (y|x)= (22)

d2 Sd

+0

e zM z

M 1

e

z2

22 zd1dz

=d2 Sd

+0

ez2

zd1+k=2

1

k

2z

M

kdz

22

M2

d2 Sd

+0

ez2M ez

2

zd+1dz 2

2M2.

We now are left with proving the lower bound on the reduced set P by showingthe inequality

inf{f}N

supP

limsup+

Ez(E[fz ] E[m(s)])

BcBc+1

>0.(62)

Since (en)nN is an orthonormal sequence inH, then for any s it holds

E[fz

] E[m(s)] =T(fz m(s))2H =

+n=1

(cz,n sn)2n,(63)

with

cz,n =

tnn

fz , en

H .

Now let cz,n be 1 if cz,n 0 and -1 otherwise. Because of the straightforwardinequality

2|cz,nsn| |cz,n sn|,(64)from (63) we get

E[fz

] E[m(s)]+n=1

1

4(cz,nsn)2n

=+n=1

I{cz,n=sn}nnD

I{cz,n=sn}n,

where the setD is defined byD :={nN| n1}.

Note that due to (64) and defining the quantityR(s) :=

nD

Pz[cz,n=sn] nnD

n,(65)

from (62) we are led to prove the inequality

inf{f}N

sups{+1,1}

limsup+

R(s)

BcBc+1

>0.(66)

This result is achieved considering a suitable probability measure over the set{+1, 1} (and hence overP itself), and proving that the inequality above holds


28/32


true not just for the worst s, but also on average. Then let us introduce the sequenceS= (Si)iN of independent{+1, 1}-valued random variables with

P [Si = +1] = P [Si =1] = 12 iN.

The plan is first showing that (66) is a consequence of the inequality

ER(S)CnD

n, C >0,(67)

and subsequently proving that indeed (67) is true for some C >0.

Defining the constants u := +1 cR and v := 1

2Bcu

BcBc+1 , the definition ofn

gives

nDn =

n(u) 1bc+1+un(bc+1+)(68)

+

(u)1

bc+1+

t(bc+1+)dt 1

= 1

Bc(u)

BcBc+1 1 1

2Bc(u)

BcBc+1 =v

BcBc+1 ,

where the last inequality holds for all 2Bc(2Bcu)Bc.Then using inequalities (67) and (68) we get

inff

sups{+1,1}

limsup+

R(s)

BcBc+1

C v inf

f

sups{+1,1}

limsup

+

R(s)

ER(S)

C v inffE limsup

+

R(S)

ER(S)

C v inff

limsup+

E

R(S)

ER(S)

=C v >0,

where in the last estimate we applied Fatou lemma, recalling that by inequali-ties (65) and (67) the sequence

R(s)

ER(s)

is uniformly bounded for every s {+1, 1}.As planned we finally proceed proving inequality (67). Recall that by definitionER(S) =

nD

P [cz,n=Sn] n,

where cz,n can be interpreted as a decision rule for the value of Sn given z. Theleast error probability for such a problem is attained by the Bayes decision cz,nwhich outputs 1 ifP [Sn = 1| z]1/2 and -1 otherwise, therefore

P [cz,n=Sn]P [cz,n=Sn] .


29/32


Since by constructionSn is independent of the Xcomponent of the data z, we canreason conditionally on (xi)

i=1. The dependence of the Y component of z on Sn

has the formyi = m

(s)(xi) + ni = Sngn(xi) + ni+k=n

Skgk(xi), i= 1, . . . ,

withniindependentY-valued random variables distributed according to the Gauss-ianN(0, 2 Id). Hence it is clear that also the component ofyi perpendicular togn(xi) is independent of Sn. Consequently the only dependence of z on Sn isdetermined by the longitudinal components

yi :=yi, gn(xi)Y

gn(xi)Y=Sn gn(xi)Y + ni+ hi,(69)

where ni are real valued random variables distributed according to the Gaussian

N(0, 2) and

hi =k=n

Skgk(xi), gn(xi)Y

gn(xi)Y.

From equation (69) we see that the structure of the data available to the Bayesrule cz,n for Sn, is similar to that assumed in Prop. 7, except for the presence ofthe term h= (hi)

i=1. However this term is independent ofSn, and it is clear that

Bayes error cannot decrease when such a term is added to the available data, infact

minD:R{+1,1}

P [D(g + h)=Sn] = minD:R{+1,1}

EhP [D(g + h)=Sn| h] Eh min

D:R{+1,1}P [D(g + h)=Sn| h]

=Eh

minD:R2{+1,1}P

[D(g, h)=Sn] = minD:R{+1,1}P

[D(g)=Sn] ,where the last equality derives from the independence of h on Sn and we let g =(gn(xi)Y)i=1.

Hence by Prop. 7

P [cz,n=Sn| (xi)i] = minD:R{+1,1}

P [D(g + h)=Sn| (xi)i]

minD:R{+1,1}

P [D(g)=Sn| (xi)i] =

i gn(xi)2Y

2

.Moreover since (x) is convex, by Jensens inequality

P [cz,n=Sn]Ei gn(xi)2Y2

1

E

i

gn(xi)2Y

= 1

n

,

where we used

Eg(x)2Y =X

Kxgn, KxgnY d(x) =T gn, gnH= n.


30/32


Thus

ER(S) nD

1

n n( 1

)nD

n

which finally proves inequality (67) with C= ( 1 ), and concludes the proof

6. Conclusion

We presented an error analysis of the RLS algorithm on RKHS for generaloperator-valued kernels. The framework we considered is extremely flexible andgeneralizes many settings previously proposed for this type of problems. In par-ticular the output space need not to be bounded as long as a suitable momentcondition for the output variable is fulfilled, and input spaces which are unboundeddomains are dealt with. An asset of working with operator-valued kernels is theextension of our analysis to the multi-task learning problem; this kind of result is,to our knowledge, new.

We also gave a complete asymptotic worst case analysis for the RLS algorithm inthis setting, showing optimality in the minimax sense on a suitable class of priors.Moreover we extended previous individual lower rate results to our general setting.

Finally we stress the central role played by the effective dimension in our analysis.It enters in the definition of the priors and in the expression of the non-asymptoticupper bound given by Theorem 4 in subsection 5.1. However, since the effectivedimension depends on both the kernel and the marginal probability distribution

over the input space, our choice for the regularization parameter depends stronglyon the marginal distribution. This consideration naturally raises the question ofwhether the effective dimension could be estimated by unlabelled data, allowing inthis way the regularization parameter to adapt to the actual marginal distributionin a semi-supervised setting.

Acknowledgements

The authors wish to thank T. Poggio, L. Rosasco and S. Smale for useful dis-cussions, and the two anonymous referees for helpful comments and suggestions.

This paper describes research done at the Center for Biological & Computa-tional Learning, which is in the McGovern Institute for Brain Research at MIT,as well as in the Dept. of Brain & Cognitive Sciences, and which is affiliated withthe Computer Sciences & Artificial Intelligence Laboratory (CSAIL), as well as atthe Dipartimento di Informatica e Scienze dellInformazione (DISI), University ofGenoa, Italy, as well as at the Dipartimento di Matematica, University of Modena,Italy.

This research was sponsored by grants from: Office of Naval Research (DARPA)Contract No. MDA972-04-1-0037, Office of Naval Research (DARPA) ContractNo. N00014-02-1-0915, National Science Foundation (ITR/SYS) Contract No. IIS-0112991, National Science Foundation (ITR) Contract No. IIS-0209289, NationalScience Foundation-NIH (CRCNS) Contract No. EIA-0218693, National Science


31/32


Foundation-NIH (CRCNS) Contract No. EIA-0218506, and National Institutes ofHealth (Conte) Contract No. 1 P20 MH66239-01A1.

Additional support was provided by: Central Research Institute of ElectricPower Industry (CRIEPI), Daimler-Chrysler AG, Compaq/Digital Equipment Cor-poration, Eastman Kodak Company, Honda R&D Co., Ltd., Industrial TechnologyResearch Institute (ITRI), Komatsu Ltd., Eugene McDermott Foundation, Merrill-Lynch, NEC Fund, Oxygen, Siemens Corporate Research, Inc., Sony, SumitomoMetal Industries, and Toyota Motor Corporation.

This research has also been funded by the FIRB Project ASTAA and the ISTProgramme of the European Community, under the PASCAL Network of Excel-lence, IST-2002-506778.

References

[1] Y. Altun, I. Tsochantaridis, and T. Hofmann. Hidden markov support vector machines. In20th International Conference on Machine Learning ICML-2004, Washington DC, 2003.

[2] N. Aronszajn. Theory of reproducing kernels.Trans. Amer. Math. Soc., 68:337404, 1950.[3] J. Burbea and P. Masani. Banach and Hilbert spaces of vector-valued functions. Pitman

Research Notes in Mathematics Series, 90, 1984.[4] C. Carmeli and A. De Vito, E.and Toigo. Reproducing kernel hilbert spaces and mercer

theorem. Technical report, arvXiv:math.FA/0504071, 2005.[5] F. Cucker and S. Smale. Best choices for regularization parameters in learning theory: on the

bias-variance problem. Foundations of Computationals Mathematics, 2:413428, 2002.[6] F. Cucker and S. Smale. On the mathematical foundations of learning. Bull. Amer. Math.

Soc. (N.S.), 39(1):149 (electronic), 2002.[7] E. De Vito and A. Caponnetto. Risk bounds for regulaized least-squares algorithm with

operator-valued kernels. Technical report, Massachusetts Institute of Technology, Cambridge,MA, May 2005. CBCL Paper #249/AI Memo #2005-015.

[8] E. De Vito, A. Caponnetto, and L. Rosasco. Model selection for regularized least-squares al-

gorithm in learning theory. Foundation of Computational Mathematics, 5(1):5985, February2005.

[9] E. De Vito, L Rosasco, A. Caponnetto, U. De Giovannini, and F. Odone. Learning fromexamples as an inverse problem. Journal of Machine Learning Research, 6:883904, 2005.

[10] R. DeVore, G. Kerkyacharian, D. Picard, and V. Temlyakov. Mathematical methods forsupervised learning. IMI Preprints, 22:151, 2004.

[11] L. Devroye, L. Gyorfi, and G. Lugosi. A Probabilistic Theory of Pattern Recognition. Num-ber 31 in Applications of mathematics. Springer, New York, 1996.

[12] R. M. Dudley. Real analysis and probability, volume 74 of Cambridge Studies in AdvancedMathematics. Cambridge University Press, Cambridge, 2002. Revised reprint of the 1989original.

[13] H. W. Engl, M. Hanke, and A. Neubauer.Regularization of inverse problems, volume 375 ofMathematics and its Applications. Kluwer Academic Publishers Group, Dordrecht, 1996.

[14] T. Evgeniou, C.A. Micchelli, and M. Pontil. Learning multiple tasks with kernel methods.J.Machine Learning Research, 6:615637, April 2005.

[15] T. Evgeniou, M. Pontil, and T. Poggio. Regularization networks and support vector machines.Adv. Comp. Math., 13:150, 2000.

[16] C. W. Groetsch. The theory of Tikhonov regularization for Fredholm equations of the firstkind, volume 105 ofResearch Notes in Mathematics. Pitman (Advanced Publishing Program),Boston, MA, 1984.

[17] L. Gyorfi, M. Kohler, A. Krzyzak, and H. Walk. A Distribution-free Theory of Non-parametric Regression. Springer Series in Statistics, 2002.

[18] M. Kohler and A. Krzyzak. Nonparametric regression estimation using penalized leastsquares.IEEE Trans. Inform. Theory, 47(7):30543058, 2001.

[19] S. Mendelson. On the performance of kernel classes. Journal of Machine Learning Research,4:759771, 2003.


32/32


[20] C.A. Micchelli and M. Pontil. On learning vector-valued functions. Neural Computation,17:177204, 2005.

[21] I. F. Pinelis and A. I. Sakhanenko. Remarks on inequalities for probabilities of large devia-tions.Theory Probab. Appl., 30(1):143148, 1985.

[22] T. Poggio and F. Girosi. A theory of networks for approximation and learning. In C. Lau,editor,Foundation of Neural Networks, pages 91106. IEEE Press, Piscataway, N.J., 1992.

[23] T. Poggio and S. Smale. The mathematics of learning: dealing with data. Notices Amer.Math. Soc., 50(5):537544, 2003.

[24] B. Scholkopf and A.J. Smola. Learning with Kernels. MIT Press, Cambridge, MA, 2002.[25] L. Schwartz. Sous-espaces hilbertiens despaces vectoriels topologiques et noyaux associes

(noyaux reproduisants). J. Analyse Math., 13:115256, 1964.[26] S. Smale and D. Zhou. Shannon sampling II : Connections to learning theory.preprint, 2004.[27] S. Smale and D. Zhou. Learning theory estimates via integral operators and their approxi-

mations.preprint, 2005.[28] V. N. Temlyakov. Nonlinear methods of approximation. Found. comput. Math., 3:33107,

2003.[29] V. N. Temlyakov. Approximation in learning theory. IMI Preprints, 5:142, 2005.

[30] S. A. van de Geer.Applications of empirical process theory. Cambridge Series in Statisticaland Probabilistic Mathematics. Cambridge University Press, Cambridge, 2000.[31] A. W. van der Vaart and J. A. Wellner. Weak convergence and empirical processes. Springer

Series in Statistics. Springer-Verlag, New York, 1996. With applications to statistics.[32] V. N. Vapnik.Statistical learning theory. Adaptive and Learning Systems for Signal Process-

ing, Communications, and Control. John Wiley & Sons Inc., New York, 1998. A Wiley-Interscience Publication.

[33] G. Wahba.Spline models for observational data, volume 59 ofCBMS-NSF Regional Confer-ence Series in Applied Mathematics. Society for Industrial and Applied Mathematics (SIAM),Philadelphia, PA, 1990.

[34] J. Weston, O. Chapelle, A. Elisseeff, B. Schoelkopf, and V. Vapnik. Kernel dependencyestimation. In S. Thrun S. Becker and K. Obermayer, editors, Advances in Neural InformationProcessing Systems 15, pages 873880. MIT Press, Cambridge, MA, 2003.

[35] V. Yurinsky. Sums and Gaussian vectors, volume 1617 ofLecture Notes in Mathematics.Springer-Verlag, Berlin, 1995.

[36] T. Zhang. Effective dimension and generalization of kernel learning.NIPS 2002, pages 454461.[37] T. Zhang. Leave-one-out bounds for kernel methods. Neural Computation, 13:13971437,

2003.

Andrea Caponnetto, C.B.C.L.,McGovern Institute, Massachusetts Institute of Tech-nology, Bldg.E25-206, 45 Carleton St., Cambridge, MA 02142 and D.I.S.I., Universita diGenova, Via Dodecaneso 35, 16146 Genova, Italy

E-mail address: [email protected]

Ernesto De Vito, Dipartimento di Matematica, Universita di Modena, Via Campi213/B, 41100 Modena, Italy and I.N.F.N., Sezione di Genova, Via Dodecaneso 33, 16146Genova, Italy,

E-mail address: [email protected]

optimal rates for regularized least_squares algorithm

Documents