algorithmic stability and sanity-check bounds for leave-one-out...

Algorithmic Stability and Sanity-Check Boundsfor Leave-One-Out Cross-Validation

Michael KearnsAT&T Labs Research

180 Park AvenueFlorham Park, NJ

[email protected]

Dana RonLaboratory for Computer Science, MIT

545 Technology SquareCambridge, MA

[email protected]

Abstract: In this paper we prove sanity-check bounds for the error of the leave-one-out cross-validationestimate of the generalization error: that is, bounds showing that the worst-case error of this estimateis not much worse than that of the training error estimate. The name sanity-check refers to the fact thatalthough we often expect the leave-one-out estimate to perform considerably better than the trainingerror estimate, we are here only seeking assurance that its performance will not be considerably worse.Perhaps surprisingly, such assurance has been given only for limited cases in the prior literature oncross-validation.

Any nontrivial bound on the error of leave-one-out must rely on some notion of algorithmicstability. Previous bounds relied on the rather strong notion of hypothesis stability, whose applicationwas primarily limited to nearest-neighbor and other local algorithms. Here we introduce the newand weaker notion of error stability, and apply it to obtain sanity-check bounds for leave-one-out forother classes of learning algorithms, including training error minimization procedures and Bayesianalgorithms. We also provide lower bounds demonstrating the necessity of some form of error stabilityfor proving bounds on the error of the leave-one-out estimate, and the fact that for training errorminimization algorithms, in the worst case such bounds must still depend on the Vapnik-Chervonenkisdimension of the hypothesis class.

1 Introduction and Motivation

A fundamental problem in statistics, machine learning, neural networks, and related areas is thatof obtaining an accurate estimate for the generalization ability of a learning algorithm trained ona finite data set. Many estimates have been proposed and examined in the literature, some of themost prominent being the training error (also known as the resubstitution estimate), the variouscross-validation estimates (which include the leave-one-out or deleted estimate, as well as

�-fold

cross-validation), and the holdout estimate. For each of these estimates, the hope is that for a fairlywide class of learning algorithms, the estimate will usually produce a value �� that is close to the true(generalization) error � .

There are surprisingly few previous results providing bounds on the accuracy of the variousestimates [RW78, DW79a, DW79b, Vap82, Hol96b, Hol96a, KMN � 95, Kea96] (see the recent book ofDevroye, Gyorfi and Lugosi [DGL96] for an excellent introduction and survey of the topic). Perhaps themost general results are those given for the (classification) training error estimate by Vapnik [Vap82],who proved that for any target function and input distribution, and for any learning algorithm thatchooses its hypotheses from a class of VC dimension � , the training error estimate is at most

�� away from the true error, where is the size of the training sample. On the other hand, among thestrongest bounds (in the sense of the quality of the estimate) are those given for the leave-one-outestimate by the work of Rogers and Wagner [RW78], Devroye and Wagner [DW79a, DW79b], andVapnik [Vap82].

The (classification error) leave-one-out estimate is computed by running the learning algorithm times, each time removing one of the training examples, and testing the resulting hypothesis on thetraining example that was deleted; the fraction of failed tests is the leave-one-out estimate. Rogersand Wagner [RW78] and Devroye and Wagner [DW79a, DW79b] proved that for several specificalgorithms, but again for any target function and input distribution, the leave-one-out estimate can beas close as

�� to the true error. The algorithms they consider are primarily variants of nearest-neighbor and other local procedures, and as such do not draw their hypotheses from a fixed class ofbounded VC dimension, which is the situation we are primarily interested in here. Devroye, Gyorfiand Lugosi [DGL96, Chap. 24] obtain a bound on the error of the leave-one-out estimate for anotherparticular class of algorithms, namely that of histogram rules. Vapnik [Vap82, Chap. 8] studies theleave-one-out estimate, (which he refers to as the moving-control estimate), for a special case of linearregression. He proves bounds of order

� � � on the error of the estimate under certain assumptionson the distribution over the examples, and their labels.

A tempting and optimistic intuition about the leave-one-out estimate is that it should typically yield�The ��

notation hides logarithmic factors in the same way that��!� �

notation hides constants.

1

an estimate that falls within� �� of the true error. This intuition derives from viewing each

deleted test as an independent trial of the true error. The problem, of course, is that these tests arenot independent. The Devroye, Rogers and Wagner results demonstrate that for certain algorithms,the intuition is essentially correct despite the dependencies. In such cases, the leave-one-out estimatemay be vastly preferable to the training error, yielding an estimate of the true error whose accuracyis independent of any notion of dimension or hypothesis complexity (even though the true error itselfmay depend strongly on such quantities).

However, despite such optimism, the prior literature leaves open a disturbing possibility for theleave-one-out proponent: the possibility that its accuracy may often be, for wide classes of naturalalgorithms, arbitrarily poor. We would like to have what we shall informally refer to as a sanity-check

bound: a proof, for large classes of algorithms, that the error of the leave-one-out estimate is not muchworse than the

�� worst-case behavior of the training error estimate. The name sanity-checkrefers to the fact that although we believe that under many circumstances, the leave-one-out estimatewill perform much better than the training error (and thus justify its computational expense) the goalof the sanity-check bound is to simply prove that it is not much worse than the training error. Sucha result is of interest simply because the leave-one-out estimate is in wide experimental use (largelybecause practitioners expect it to indeed frequently outperform the training error), so it behooves us tounderstand its performance and limitations.

A moment’s reflection should make it intuitively clear that, in contrast to the training error, evena sanity-check bound for leave-one-out cannot come without restrictions on the algorithm underconsideration: some form of algorithmic stability is required [DW79b, Hol96b, Koh95]. If the removalof even a single example from the training sample may cause the learning algorithm to “jump” toa different hypothesis with, say, much larger error than the full-sample hypothesis, it seems hard toexpect the leave-one-out estimate to be accurate. The precise nature of the required form of stability isless obvious.

Devroye and Wagner [DW79b] first identified a rather strong notion of algorithmic stability thatwe shall refer to as hypothesis stability, and showed that bounds on hypothesis stability directly lead tobounds on the error of the leave-one-out estimate. This notion of stability demands that the removal of asingle example from the training sample results in hypotheses that are “close” to each other, in the senseof having small symmetric difference with respect to the input distribution. For algorithms drawinghypotheses from a class of fixed VC dimension, the first sanity-check bounds for the leave-one-outestimate were provided by Holden [Hol96b] for two specific algorithms in the realizable case (that is,when the target function is actually contained in the class of hypothesis functions).

However, in the more realistic unrealizable (or agnostic [KSS94]) case, the notion of hypothesisstability may simply be too strong to be obeyed by many natural learning algorithms. For example, if

2

there are many local minima of the true error, an algorithm that managed to always minimize the trainingerror might be induced to move to a rather distant hypothesis by the addition of a new training example(we shall elaborate on this example shortly). Many gradient descent procedures use randomized startingpoints, which may even cause runs on the same sample to end in different local minima. Algorithmsbehaving according to Bayesian principles will choose two hypotheses of equal training error withequal probability, regardless of their dissimilarity. What we might hope to be relatively stable in suchcases would not be the algorithm’s hypothesis itself, but the error of the algorithm’s hypothesis.

The primary goal of this paper is to give sanity-check bounds for the leave-one-out estimate thatare based on the error stability of the algorithm. In Section 2, we begin by stating some neededpreliminaries. In Section 3, we review the Devroye and Wagner notion of hypothesis stability, andgeneralize the results of Holden [Hol96b] by showing that in the realizable case this notion can be usedto obtain sanity-check bounds for any consistent learning algorithm; but we also discuss the limitationsof hypothesis stability in the unrealizable case. In Section 4, we define our new notion of error stability,and prove our main results: bounds on the error of the leave-one-estimate that depend on the VCdimension of the hypothesis class and the error stability of the algorithm. The bounds apply to a wideclass of algorithms meeting a mild condition that includes training error minimization and Bayesianprocedures. While we mainly concentrate on Boolean functions, we also shortly discuss real-valuedfunctions (in Section 4.3). In Section 5, we give a number of lower bound results showing, among otherthings, the necessity of some form of error stability for proving bounds on the error of the leave-one-outestimate, but also the absence of sufficiency of error stability (thus justifying the need of an additionalcondition). In Section 6 we conclude with some open problems.

2 Preliminaries

Let�

be a fixed target function from domain � to range � , and let � be a fixed distribution over � . Both�and � may be arbitrary � . We use �� to denote the random variable �� ,

where is the sample size, each �� is drawn randomly and independently according to � , and

� �� !� . A learning algorithm � is given � � as input, and outputs a hypothesis �� , where

�! �� " � belongs to a fixed hypothesis class # . If � is randomized, it takes an additional input$%'&)(+* ��-,+.

of random bits of the required length�

to make its random choices. In this paper westudy mainly the case in which �/� (+* �

�0,, and briefly the case in which �/��1 . For now we restrict

our attention to boolean functions.2Our results directly generalize to the case in which we allow the target process to be any joint distribution over the

sample space 35476 , but it will be convenient to think of there being a distinct target function.

3

For any boolean function � , we define the generalization error of � (with respect to�

and � ) by

� � � � � � �� (1)

For any two boolean functions � and �� , The distance between � and �� (with respect to � ) is

��

�� !��5� � � � �� (2)

Since the target function�

may or may not belong to # , we define

�#"�$ % ��'& �)(* �#+ (�� , (3)

and let � "�$ % be some function for which � � � "�$ % � � �#"�$ % . Thus, the function � "�$ % is a best approximationto�

(with respect to � ) in the class # , and � "�$ % measures the quality of this approximation. We definethe training error of a boolean function � with respect to � � by

�� ,.- � � � ��

0/1 ( � � �� & ��

� � � �2�� , 1

(4)

and the (generalized) version space

354 � �� ( � & # �� 6& �7(*�8 ��+ ( �� , , (5)

consisting of all functions in # that minimize the training error.

Throughout this paper we assume that the algorithm � is symmetric. This means that � is insensitiveto the ordering of the examples in the input sample � � , so for every ordering of �� it outputs the samehypothesis. (In case � is randomized, it should induce the same distribution on hypotheses.) This isa very mild assumption, as any algorithm can be transformed into a symmetric algorithm by adding arandomizing preprocessing step. Thus, we may refer to � � as an unordered set of labeled examplesrather than as a list of examples. For any index 9 & � :� , we denote by � �� the sample �� with the 9<;>=labeled example, � � �� , removed. That is,

� �� @? ( �� ,� (6)

The leave-one-out cross-validation estimate, ��ACV

� �� , of the error of the hypothesis � � � � �� isdefined to be

�� ACV

� � � � ��

/1 ( 9 & � :� �� 2��

, 1(7)

where � � � � � � �� . We are thus interested in providing bounds on the error1 ��ACV

� �� CB � � � � �� 1 ofthe leave-one-out estimate.

The following uniform convergence bound, due to Vapnik [Vap82], will be central to this paper.

4

THEOREM 2.1 Let # be a hypothesis class with VC dimension �� . Then, for every �� and for

any given �� * , with probability at least� B�� , for every � & # ,

1 �� B � � � � 1 � � ��)( � � �� )( ��

� (8)

Let us introduce the shorthand notation

3�� 7( � �� )( ��

� (9)

Thus, Theorem 2.1 says for any learning algorithm � using a hypothesis space of VC dimension � , forany �� * , with probability at least

� B�� over �� ,1 �� B � � � � �� 1 � 3�� .

3 Sanity-Check Bounds via Hypothesis Stability

As we have mentioned already, it is intuitively clear that the performance of the leave-one-out estimatemust rely on some kind of algorithmic stability (this intuition will be formalized in the lower boundsof Section 5). Perhaps the strongest notion of stability that an interesting learning algorithm might beexpected to obey is that of hypothesis stability: namely, that small changes in the sample can only causethe algorithm to move to “nearby” hypotheses. The notion of hypothesis stability is due to Devroyeand Wagner [DW79b], and is formalized in a way that suits our purposes in the following definition � .DEFINITION 3.1 We say that an algorithm � has hypothesis stability

�� if for every

� , -�� "!�� $# � ��&% �

� ��' �� (10)

where �� $# �)( ( �� , , and both�

� and�� may be functions of .

Thus, we ask that with high probability, the hypotheses output by � on � � and � �)# � be similar.

We shall shortly argue that hypothesis stability is in fact too demanding a notion in many realisticsituations. But first, we state the elegant theorem of Devroye and Wagner [DW79b] that relates theerror of the leave-one-out estimate for an algorithm to the hypothesis stability.

THEOREM 3.1 Let � be any symmetric algorithm that has hypothesis stability��

�� . Then for any�� * , with probability at least

� B�� over �� ,

1 �� ACV

� �� B � � � � �� 1 ' � � � �� +* ��

� � � �� (11)

,Devroye and Wagner [DW79b] formalized hypothesis stability in terms of the expected difference between the hy-

potheses; here we translate to the “high probability” form for consistency.

5

Thus, if we are fortunate enough to have an algorithm with strong hypothesis stability (that is,small

�� and

�� ), the leave-one-out estimate for this algorithm will be correspondingly accurate.

What kind of hypothesis stability should we expect for natural algorithms? Devroye, Rogers andWagner [RW78, DW79b] gave rather strong hypothesis stability results for certain nonparametric locallearning algorithms (such as nearest-neighbor rules), and thus were able to show that the error of theleave-one-out estimate for such algorithms decreases like

� � �� (for values of � ranging from� � � to� �� , depending on the details of the algorithm).

Note that for nearest-neighbor algorithms, there is no fixed “hypothesis class” of limited VCdimension — the algorithm may choose arbitrarily complex hypotheses. This unlimited complexityoften makes it difficult to quantify the performance of the learning algorithm except in terms of theasymptotic generalization error (see Devroye, Gyorfi and Lugosi [DGL96] for a detailed survey ofresults for nearest-neighbor algorithms). For this and other reasons, practitioners often prefer tocommit to a hypothesis class # of fixed VC dimension � , and use heuristics to find a good function in# . In this case, we gain the possibility of finite-sample generalization error bounds (where we comparethe error to that of the optimal model from # ). However, in such a situation, the goal of hypothesisstability may in fact be at odds with the goal of good performance in the sense of learning. To see this,imagine that the input distribution and target function define a generalization error “surface” over thefunction space # , and that this surface has minima at � "�$ % & # , where � � � "�$ % � � � "�$ % � * , and also at� � & # , where � � � � � � � � � "�$ % �� for some small � � * . Thus, � "�$ % is the “global” minimum, and� � is a “local” minimum. Note that

� �� "�$ % � � � � could be as large as � "�$ % , which we are assumingmay be a rather large (constant) quantity. Now if the algorithm � minimizes the training error over # ,then we expect that as " � , algorithm � will settle on hypotheses closer and closer to � "�$ % . Butfor � � � � � , � may well choose hypotheses close to �� . Thus, as more examples are seen, at somepoint � may need to move from �� to the rather distant � "�$ % .

We do not know how to rule out such behavior for training error minimization algorithms, and socannot apply Theorem 3.1. Perhaps more importantly, for certain natural classes of algorithms (such asthe Bayesian algorithms discussed later), and for popular heuristics such as C4.5 and backpropagation,it is far from obvious that any nontrivial statement about hypothesis stability can be made. For thisreason, we would like to have bounds on the error of the leave-one-out estimate that rely on the weakestpossible notion of stability. Note that in the informal example given above, the quantity that we mighthope would exhibit some stability is not the hypothesis itself, but the error of the hypothesis: eventhough � "�$ % and � � may be far apart, if � chooses � � then � must not be “too large”. The main questionaddressed in this paper is when this weaker notion of error stability is sufficient to prove nontrivialbounds on the leave-one-out error, and we turn to this in Section 4.

6

First, however, note that the instability of the hypothesis in the above discussion relied on theassumption that � "�$ % � * — that is, that we are in the unrealizable setting. In the realizable � "�$ % � *case, there is still hope for applying hypothesis stability. Indeed, Holden [Hol96b] was the first to applyuniform convergence results to obtain sanity-check bounds for leave-one-out via hypothesis stability,for two particular (consistent) algorithms in the realizable setting

�. Here we generalize Holden’s

results by giving a sanity-check bound on the leave-one-out error for any consistent algorithm. Thesimple proof idea again highlights why hypothesis stability seems difficult to apply in the unrealizablecase: in the realizable case, minimizing the training error forces the hypothesis to be close to somefixed function (namely, the target). In the unrealizable case, there may be many different functions, allwith optimal or near-optimal error.

THEOREM 3.2 Let # be a class of VC dimension � , and let the target function�

be contained in #(realizable case). Let � be a symmetric algorithm that always finds an � & # consistent with the input

sample. Then for every �� * and � � , with probability at least� B�� ,

1 �� CV

� � � � B � � � � � � �� 1 � � �� (12)

PROOF: By uniform convergence, with probability at least� B�� ,

� � � � ��

� ��

(13)

and

� � � � ��$# � ��

� ��)# � ��

B � � (14)

(Here we are using the stronger�� uniform convergence bounds that are special to the realizable

case.) Thus by the triangle inequality, with probability at least� B�� ,

�� $# � ��

� (15)

The theorem follows from Theorem 3.1, where � � is set to �� .

(Theorem 3.2)�Holden [Hol96a] has recently obtained sanity-check bounds, again for the realizable setting, for other cross-validation

estimates.

7

We should note immediately that the bound of Theorem 3.2 has a dependence on � �� , as opposed

to the�� dependence for the training error given by Theorem 2.1. Unfortunately, it is well-

known [DGL96, Chap. 24] (and demonstrated in Section 5) that, at least in the unrealizable setting, a� �� dependence is in general unavoidable for the leave-one-out estimate. Thus, it appears that in orderto gain whatever benefits leave-one-out offers, we must accept a worst-case dependence on � that isinferior to that of the training error. This again is the price of generality — for particular algorithms,such as

�-nearest neighbor rules, it is possible to show only logarithmic dependence on

� �� [RW78](stated in [DGL96, Thm. 24.2]). Also, we note in passing that Theorem 3.2 can also be generalized(perhaps with a worse power of �� ) to the case where the target function lies in # but is corruptedby random classification noise: again, minimizing training error forces the hypothesis to be close tothe target.

It is possible to give examples in the realizable case for which the leave-one-out estimate has error� �� while the training error has error � � �� ; such examples merely reinforce the intuitiondiscussed in the introduction that leave-one-out may often be superior to the training error. Furthermore,there are unrealizable examples for which the error of leave-one-out is again independent of � , but forwhich no nontrivial leave-one-out bound can be obtained by appealing to hypothesis stability. It seemsthat a more general notion of stability is called for.

4 Sanity-Check Bounds via Error Stability

In this section, we introduce the notion of error stability and use it to prove our main results. We givebounds on the error of the leave-one-out estimate that are analogous to those given in Theorem 3.1, inthat the quality of the bounds is directly related to the error stability of the algorithm. However, unlikeTheorem 3.1, in all of our bounds there will be a residual

�� term that appears regardless ofthe stability; this is the price we pay for using a weaker — but more widely applicable — type ofstability. In Section 5, we will show that some form of error stability (which is slightly weaker than theone defined below) is always necessary � , and also that a dependence on �� cannot be removed in thecase of algorithms which minimize the training error, without further assumptions on the algorithm.

For expository purposes, we limit our attention to deterministic algorithms for now. The gener-alization to randomized algorithms will be discussed shortly. Our key definition mirrors the form ofDefinition 3.1.

�As we note in Section 5, this lower bound also implies that any reasonable learning algorithm (for which the probability

that the error of its hypothesis increases when a sample point is added is very small) the notion of error stability we definebelow is in fact necessary.

8

DEFINITION 4.1 We say that a deterministic algorithm � has error stability��

�� if for every

�� , - �� "! � 1 � � � � � � �� B � � � � � �$# � �� 1 % �� ' �

� (16)

where �� $# �)( ( �� , , and both�


Thus, we ask that with high probability, the hypotheses output by � on � � and ��$# � have similarerror with respect to the target, while allowing them to differ from each other.

Our goal is thus to prove bounds on the error of the leave-one-out estimate that depend on�

� and�� . This will require an additional (and hopefully mild) assumption on the algorithm that is quantified

by the following definition. We will shortly prove that some natural classes of algorithms do indeedmeet this assumption, thus allowing us to prove sanity-check bounds for these classes.

DEFINITION 4.2 For any deterministic algorithm � , we say that leave-one-out��

�� -overestimates

the training error for � if for every ,

�� , - �� ! � �� ACV

� �� &' �� B �� ' �

� (17)

where �� $# �)( ( �� , , and both�


While we cannot claim that training error overestimation is in general necessary for obtainingbounds on the error of the leave-one-out estimate, we note that it is clearly necessary whenever thetraining error underestimates the true error, as is the case for algorithms that minimize the trainingerror. In any case, in Section 5 we show that some additional assumptions (beyond error stability) arerequired to obtain nontrivial bounds for the error of leave-one-out.

Before stating the main theorem of this section, we give the following simple but important lemma.This result is well-known [DGL96, Chap. 24], but we include its proof for the sake of completeness.

LEMMA 4.1 For any symmetric learning algorithm � ,� , - � �� ACV

� �� , - �� $# � �� (18)

PROOF: For any fixed sample �� , let � � �� , and let � � & (+* � �-,be 1 if and only if � � � � � �� .

Then � , - � �� ACV

� � � �� ,.- � � �� (19)

��

� � � , - � � � � (20)

� � ,.- � � � � (21)

� � , - �� $# � �� (22)

9

The first equality follows from the definition of leave-one-out, the second from the additivity ofexpectation, the third from the symmetry of � , and the fourth from the definition of � � . (Lemma 4.1)

The first of our main results follows.

THEOREM 4.1 Let � be any deterministic algorithm using a hypothesis space # of VC dimension �such that � has error stability

�� , and leave-one-out

� �� -overestimates the training error for

� . Then for any �� * , with probability at least� B�� over �� ,

1 �� ACV

� �� B � � � � �� 1 ' * �� * � � � � ��

�� (23)

Let us briefly discuss the form of the bound given in Theorem 4.1. First of all, as we mentionedearlier, there is a residual

�� term that remains no matter how error-stable the algorithm this.This means that we cannot hope to get something better than a sanity-check bound from this result.Our main applications of Theorem 4.1 will be to show specific, natural cases in which

�� and

�� can

be eliminated from the bound, leaving us with a bound that depends only on the error stability and theresidual

�� term. We now turn to the proof of the theorem.

PROOF: From Theorem 2.1 and the fact that leave-one-out��

�� -overestimates the training error, we

have that with probability at least� B�� B �

� , (where � � will be determined by the analysis)

�� ACV

� �� % �� B �� % � � � � �� B 3�� B �

� � (24)

Thus, the fact that leave-one-out does not underestimate the training error by more than�� (with

probability at least� B �

� ) immediately lets us bound the amount by which leave-one-out couldunderestimate the true error � � � � � � �� (where here we set � � to be � �� , and we note that whenever�

� % � �� , the bound holds trivially). It remains to bound the amount by which leave-one-out couldoverestimate the true error.

Let us define the random variable �� by

�� ACV

� �� B � � � � �� (25)

let� �� 3��

� (26)

and� ��

� � (27)

Then Equation (24) says that the probability that �� B � , is at most � . Furthermore, it follows

from the error stability of � that with probability at least� B �

� ,

�� ' �� ACV

� �� B � � � � ��$# � �� (28)

10

(where ��)# � ( ( � �� , �5�� ). By Lemma 4.1 we know that� , - �� "! � �� ACV

� � � � B � � � � � �)# � �� * (29)

and hence on those samples for which Equation (28) holds (whose total probability weight is at least� B �� ), the expected value of �� ACV

� �� B � � � � ��$# � �� is at most�

� � �� B �� . Assuming

�� ' � ��

(since otherwise the bound holds trivially), and using the fact that1�

� � �� 1 ' �, we have that� , - � � � � � ��' * � � � � � (30)

Let � be such that with probability exactly � , �� &� � . Then

* � � � � % � ,.- � � � �� (31)

% � � � � B � �� B�� B � � � B � � (32)

% � � B � B � (33)

where we have again used the fact that1�

� � �� 1 ' �always. Thus

� ' * � � � � � �

� � (34)

From the above we have that with probability at least� B�� ,

�� ACV

� �� ' � � � � �� * � � � � � �

� (35)

� � � � � �� 3�� * � � � � ��

� +� �� (36)

If we set � �� , we get that with probability at least� B�� ,

�� ACV

� � � � ' � � � � � � �� * � � � � � � �� * � � � � ��

�� (37)

which together with Equation (24) proves the theorem. (Theorem 4.1)

4.1 Application to Training Error Minimization

In this section, we give one of our main applications of Theorem 4.1, by showing that for training errorminimization algorithms, a

�� bound on the error of leave-one-out can be obtained from errorstability arguments. We proceed by giving two lemmas, the first bounding the error stability of suchalgorithms, and the second proving that leave-one-out overestimates their training error.

LEMMA 4.2 Let � be any algorithm performing training error minimization over a hypothesis class

# of VC dimension � . Then for any�

� � * , � has error stability��

� � 3�� B ��

� �� .11

PROOF: From uniform convergence (Theorem 2.1), we know that with probability at least� B � � , both

� � � � ��)# � ��&' � "�$ % � 3�� B ��

� �� (38)

and � � � � �� &' � "�$ % � 3��

� �� (39)

hold, while it is always true that both � � � � ��&% � "�$ % , and � � � � ��$# � ��&% � "�$ % . Thus with probabilityat least

� B �� , 1 � � � � ��)# � �� B � � � � �� 1 ' 3�� B �

��

� �� (40)

(Lemma 4.2)

LEMMA 4.3 Let � be any algorithm performing training error minimization over a hypothesis class

# . Then leave-one-out� * � * � -overestimates the training error for � .

PROOF: Let � � � � �� and � � � � � � �� . Let �� be the subset of unlabeled examples in � �

on which � errs. We claim that for every �� &�� , � � errs on � � � � � � as well, implying that��A

CV

� �� % �� . Assume, contrary to the claim, that for some 9 , � � � � � �� while � � � � � � � � � .For any function � and sample � , let �� denote the number of errors made by � on � (thus��

� � /1 � 1

). Since � performs training error minimization, for any function � � & # we have� * 8 � �� % � * � �� . Similarly, for any �� & # , we have � * 8 � � �� % � *�� . In particular this mustbe true for � , and thus � * � � �� &% � * � � � �� . Since � errs on �� , � * � � �� * � �� B �

, and hence� * � � � �� ' � * � �� B �. But since � � does not err on � � � � � � , � * � � �� * � � � �� ' � * � �� B � �� * � �� , contradicting the assumption that � minimizes the training error on � � . (Lemma 4.3)

THEOREM 4.2 Let � be any algorithm performing training error minimization over a hypothesis class

# of VC dimension � . Then for every �� * , with probability at least� B � ,

1 �� ACV

� �� B � � � � �� 1 ' � �� (41)

PROOF: Follows immediately from Lemma 4.2 (where�

� is set to �� ), Lemma 4.3, and Theorem 4.1.(Theorem 4.2)

Thus, for training error minimization algorithms, the worst-case behavior of the leave-one-out

estimate is not worse than that of the training error (modulo the inferior dependence on� �� and

constant factors). We would like to infer that a similar statement is true if the algorithm almost

minimizes the training error. Unfortunately, Lemma 4.3 is extremely sensitive, forcing us to simplyassume that leave-one-out overestimates the training error in the following theorem. We will later

12

discuss how reasonable such an assumption might be for natural algorithms; in any case, we will showin Section 5 that some assumptions beyond just error stability are required to obtain interesting boundsfor leave-one-out.

THEOREM 4.3 Let � be a deterministic algorithm that comes within � of minimizing the training error

over # (that is, on any sample �� , �� )' & �)( * �#+ ( �� , �� ), and suppose that leave-one-out� * � * � -overestimates the training error for � . Then with probability at least� B�� ,

1 �� CV

� �� B � � � � �� 1 ' � � � � � � � �� (42)

Thus, for the above bound to be meaningful, � must be relatively small. In particular, for� � �� we obtain a bound of the same order as the bound of Theorem 4.2. It is an openquestion whether this is an artifact of our proof technique or whether this is the price of generality,as we are interested in a bound that holds for any algorithm that performs approximate training errorminimization.

PROOF: The theorem follows from the fact that any algorithm that comes within � of minimizingthe training error has error stability

�� 3�� B �

��

� �� (the proof is similar to that ofLemma 4.2), and from Theorem 4.1. (Theorem 4.3)

4.2 Application to Bayesian Algorithms

We have just seen that training error minimization in fact implies error stability sufficient to obtain asanity-check bound on the error of leave-one-out. More generally, we might hope to obtain boundsthat depend on whatever error stability an algorithm does possess. In this section, we show that thishope can be realized for a natural class of randomized algorithms that behave in a Bayesian manner.

To begin with, we generalize Definitions 4.1 and 4.2 to include randomization simply by letting theprobability in both definitions be taken over both the sample � � and any randomization required bythe algorithm. We use the notation � � � �

$% � to denote the hypothesis output by � on input sample �and random string

$% , and �� ACV

� �� $% ��

$% � � to denote the leave-one-out estimate when the randomstring

$% � is used on the call to � on � �� .

DEFINITION 4.3 We say that a randomized algorithm � has error stability��

�� if for every

�� , - �� "!�� 8 � 1 � � � � �� $% �� B � � � � ��$# � �

$% � �� 1 % �� ' �

� (43)

where �� $# �)( ( �� , .

13

DEFINITION 4.4 For any randomized algorithm � , we say that leave-one-out� �

�� -overestimates the

training error for � if for every �� , - �� "!�� - � �� ACV

� �� $% � ��

$% � �)' �� $% �� B �

� ��' �� (44)

where �� $# �)( ( �� , .The proof of the following theorem is essentially the same as the proof of Theorem 4.1 where the

only difference is that all probabilities are taken over the sample � � and the randomization of thealgorithm.

THEOREM 4.4 Let � be any randomized algorithm using a hypothesis space # of VC dimension � such

that leave-one-out��

�� -overestimates the training error for � , and � has error stability

�� .

Then for any �� * , with probability at least� B � ,

1 �� ACV

� � � �$% � ��

$% � � B � � � � � � �$% �� 1 � *

� �� -� � � �� * � � � � ��

�� (45)

Here the probability is taken over the choice of � � , and over the coin flips$% ��

$% � and$% of � on

the � �� and �� .

We now apply Theorem 4.1 to the class of Bayesian algorithms — that is, algorithms that choosetheir hypotheses according to a posterior distribution, obtained from a prior that is modified by thesample data and a temperature parameter. Such algorithms are frequently studied in the simulatedannealing and statistical physics literature on learning [SST92, GG84].

DEFINITION 4.5 We say that a randomized algorithm � using hypothesis space # is a Bayesian

algorithm if there exists a prior � over # and a temperature � % * such that for any sample � � and

any � & # ,

�� $% � � � � �

�� B ��

� � � � � � �� (46)

Here� �� * �#+ � � � �� .B ��

� � � � � � �� is the appropriate normalization.

Note that we still do not assume anything about the target function (for instance, it is not necessarilydrawn according to � or any other distribution) — it is only the algorithm that behaves in a Bayesianmanner. Also, note that the special case in which � � * and the support of � is # results in trainingerror minimization.

We begin by giving a general lemma that identifies the only property about Bayesian algorithmsthat we will need; thus, all of our subsequent results will hold for any algorithm meeting the conclusionof this lemma.

14

LEMMA 4.4 Let � be a Bayesian algorithm. For any sample � and any example � �� & � , let� be the probability over

$% that � � � �$% � errs on �� , and let � � be the probability over

$% � that

� � � B ( � ��,�$% � � errs on � �� . Then � � % � .

PROOF: Let � be the distribution induced over # when � is called on � , and let � � be the distributionover # induced when � is called on � B ( � � ��

,. Then for any � & # , � � � � � �� if � does not

err on � � �� , and � � � � � �� B �� if � does err on �� . Thus the only change from � � to� is to decrease the probability of drawing an � which errs on �� . (Lemma 4.4)

The key result leading to a sanity-check bound for Bayesian algorithms follows. It bounds theextent to which leave-one-out overestimates the training error in terms of the error stability of thealgorithm.

THEOREM 4.5 Let � be a Bayesian algorithm (or any other algorithm satisfying the conclusion of

Lemma 4.4) that has error stability��

� �� . Then for any � � * , leave-one-out

�� -overestimates

the training error for � for�

� � � * � � � , and�� 3�� .

In order to prove Theorem 4.5, we will first need the following lemma, which says that with respectto the randomization of a Bayesian algorithm, the leave-one-out estimate is likely to overestimate theexpected training error.

LEMMA 4.5 Let � be a Bayesian algorithm (or any randomized algorithm satisfying the conclusion of

Lemma 4.4). Then for any fixed sample � � �/� �)# � ( ( � � �� , , with probability at least� B�� over$% ��

$% � and$% ,

�� ACV

� �� $% � ��

$% � �&% � �� $% �� B �� (47)

PROOF: For each � � �� & �� , let � � be the probability over$% that � � ��

$% � errs on �� and let� �� be the probability over

$% � that � � � �� $% � � errs on � � �� . By Lemma 4.4 we know that � �� % � � .

Then � �� $% �� * ��+ ��

$% � �5� � / �� (48)

� �* ��+ �� $% � �5� � /

� ��

� � � � � �2�� (49)

��

� � �* ��+ �� 5� � / �� 2�� (50)

��

� � � � � (51)

15

Denote�� by �� , and

�� by �� . Let � � be a Bernoulli random variable determinedby$% � which is

�if � � � ��

$% � � errs on �� and * otherwise. By definition, �� ACV

� �� $% ��

$% � � �� , and � �� - � �� ACV

� � � �$% � ��

$% � �� - � �� )� (52)

� �� % �� (53)

� � �� $% �� (54)

By Chernoff’s inequality, for any � ,

�� - � �� '�� B � � � �� B � � �� (55)

By setting � � �� , we have that with probability at least� B�� over the choice of$% ��

$% � ,

�� ACV

� �� $% ��

$% � � % � �� $% �� B �� (56)

(Lemma 4.5)

Now we can give the proof of Theorem 4.5.

PROOF (Theorem 4.5): Because � has error stability��

� �� , if we draw ��)# � and � �� at random

we have probability at least� B � � � of obtaining an �� such that

�� 8 � 1 � � � � �� $% �� B � � � � ��)# ��

$% � �� 1 % �� ' �

�� (57)

Equation (57) relates the error when � is called on � � and ��)# � . We would like to translate this to astatement relating the error when � is called on � � twice. But if �� satisfies Equation (57), it followsthat

�� 8 � 1 � � � � �� $% �� B � � � � ��

$% � �� 1 % � � � ' �� (58)

The reason is that if1 � � � � � � �

$% ��B � � � � � � �$% � �� 1 % � � , then � � � � � �)# � �

$% � � �� can be within��

of only one of � � � � �� $% �� and � � � � ��

$% � �� , and each is equally likely to result from a call to � on�� . From Equation (58) and Theorem 2.1, we have that with probability at least

� B � B � � � , ��will satisfy

�� 8 �<1 �� $% �� B ��

$% � �� 1 % � � � 3�� ' �� (59)

If �� satisfies Equation (59), it follows that there must be a fixed value �� & � * �� such that

�� 1 �� $% �� B �� 1 % � � � 3�� ' �

� � (60)

16

Assuming that Equation (60) holds, we get the following bounds on� ��

$% �� :� �� $% �� ' �� B �

� � � �� + 3�� /

�(61)

and � �� $% ��% �� B� �

� � � �� B� � � B� 3�� + �� /* � (62)

In either case, 1 � �� $% �� B �� 1 ' �

� � � � + 3�� (63)

and thus by Equation (60), with probability at least� B � B � � � over the draw of �� , �� will be such

that the probability over$% that

1 �� $% �� B � �� 8 � ��

$% � �� 1 % � � � 3�� 3��

(64)is at most � � � . Combined with Lemma 4.5, we obtain that with probability at least

� B� � B�* � � �over �� ,

$% ��$% � and

$% ,

�� ACV

� �� $% ��

$% � � % �� $% � B �

� B � � � B � 3�� B �� (65)

as desired. (Theorem 4.5)

Now we can give the main result of this section.

THEOREM 4.6 Let � be a Bayesian algorithm (or any randomized algorithm satisfying the conclusion

of Lemma 4.4) that has error stability��

�� . Then for any �� * , with probability at least

� B � ,1 �� ACV

� �� $% � ��

$% � � B � � � � �� $% �� 1 ' � * � � � � �� (66)

Here the probability is taken over the choice of � � , and over the coin flips$% ��

$% � and$% of � on

the � �� and �� .

Thus, Theorem 4.6 relates the error of leave-one-out to the stability of a Bayesian algorithm: as�� " * , we obtain a

�� bound. Note that for Bayesian algorithms, we expect increasingerror stability (i.e.,

�� " * ) as the number of examples increases, or as the temperature decreases.

4.3 Application to Linear Functions and Squared Error

In this section, we briefly describe an extension of the ideas developed so far to problems in which theoutputs of both the target function and the hypothesis functions are real-valued, and the error measureis squared loss. The importance of this extension is due to the fact that for squared error, there is

17

a particularly nice case (linear hypothesis functions) for which empirical error minimization can beefficiently implemented, and the leave-one-out estimate can be efficiently computed.

Our samples �� now consist of examples �� , where � � & 1 �and � � & � B �

�� . For any

function � 1 � " � B ��

� � , we now define the generalization error by

� � � � �� "! � � � � � � B � � � � (67)

and similarly the training error becomes

�� ! � , � � � � � � B � � � � � (68)

For any algorithm � , if � � denotes � � � �� , the leave-one-out estimate is now

�� ACV

� � � � �� ! � ,.-� � � � �� B � � � � � (69)

It can be verified that in such situations, provided that a uniform convergence result analogous toTheorem 2.1 can be proved, then the analogue to Theorem 4.2 can be obtained (with essentially the sameproof), where the expression

3�� in the bound must be replaced by the appropriate uniformconvergence expression. We will not state the general theorem here, but instead concentrate on animportant special case. It can easily be verified that Lemma 4.3 still holds in the squared error case: thatis, if � performs (squared) training error minimization, then for any sample � � , �� ACV

� � � �&% �� .Furthermore, if the hypothesis space # consists of only linear functions � / � , then provided thesquared loss is bounded for each � , nice uniform convergence bounds are known.

THEOREM 4.7 Let the target function be an arbitrary mapping 1 � " � B � � � � , where� � * is a

constant, and let � be any input distribution over � B � � � � � . Let � perform squared training error

minimization over the class of all linear functions � / � obeying1)1 � 171 ' � . Then for every �� * , with

probability at least� B�� ,

1 �� ACV

� �� B � � � � �� 1 � �� (70)

Note that while the bound given in Theorem 4.7 is weaker than that proved by Vapnik [Vap82,Chap. 8] (for squared error minimization over the class of linear functions), it is much more general.Namely, we make no assumptions on the distribution according to which the examples are generatedand the function labeling them.

Two very fortunate properties of the combination of linear functions and squared error make thesanity-check bound given in Theorem 4.7 of particular interest:

18

� There exist polynomial-time algorithms for performing minimization of squared training er-ror [DH73] by linear functions. These algorithms do not necessarily obey the constraint171 � 1)1 ' �

, but we suspect this is not an obstacle to the validity of Theorem 4.7 in mostpractical settings.

� There is an efficient procedure for computing the leave-one-out estimate for training errorminimization of the squared error over linear functions [Mil90]. Thus, it is not necessary to runthe error minimization procedure times; there is a closed-form solution for the leave-one-outestimate that can be computed directly from the data much more quickly.

More generally, many of the results given in this paper can be generalized to other loss functionsvia the proper generalizations of uniform convergence [Hau92].

4.4 Other Algorithms

We now comment briefly on the application of Theorem 4.1 to algorithms other than error minimizationand Bayesian procedures. As we have already noted, the only barrier to applying Theorems 4.1 toobtain bounds on the leave-one-out error that depend only on the error stability and

�� liesin proving that leave-one-out sufficiently overestimates the training error (or more precisely, that withhigh probability it does not underestimate the training error by much). We believe that while it maybe difficult to prove this property in full generality for many types of algorithms, it may neverthelessoften hold for natural algorithms running on natural problems.

For instance, note that in the deterministic case, leave-one-out will� * � * � -overestimate the training

error as long as � has the stronger property that if � � � � � erred on an example �� & �� , then� � � � B ( ��

, � errs on �� as well. In other words, the removal of a point from the sample cannotimprove the algorithm’s performance on that point. This stronger property is exactly what was provenin Lemma 4.3 for training error minimization, and its randomized algorithm analogue was shown forBayesian algorithms in Lemma 4.4. To see why this property may be plausible for a natural heuristic,consider (in the squared error case) an algorithm that is performing a gradient descent on the trainingerror over some continuous parameter space

$� . Then the gradient with respect to$� can be written

as a sum of gradients, one for each example in � � . The gradient term for �� gives a force on$�

in a direction that causes the error on �� to decrease. Thus, the main effect on the algorithm ofremoving �� is to remove this term from the gradient, which intuitively should cause the algorithm’sperformance on � �� to degrade. (The reason why this argument cannot be turned into a proof oftraining error overestimation is that it technically is valid only for one step of the gradient descent.) It isan interesting open problem to verify whether this property property holds for widely used heuristics.

19

5 Lower Bounds

In this section, we establish the following:

� That the dependence on� �� is in general unavoidable for the leave-one-out estimate;

� That in the case of algorithms that perform error minimization, the dependence of the error ofleave-one-out on the VC dimension cannot be removed without additional assumptions on thealgorithm.

� That for any algorithm, some form of error stability is necessary in order to provide non-trivialbounds on the leave-one-out estimate.

� That there exist algorithms with perfect error stability for which the leave-one-out estimate isarbitrarily poor, and furthermore, these algorithms use a hypothesis class with constant VC

dimension.

These last two points are especially important. Though we cannot prove that precisely our form oferror stability is necessary, we can prove the necessity of a slightly weaker form of error stability. Thisimplies that the leave-one-out estimate cannot provide very good bounds if no error-stability conditionis met. On the other hand we show that error stability by itself is not sufficient even when the hypothesisclass has very small VC dimension. Therefore, additional assumptions on the algorithm must be made.The additional assumptions made in Theorem 4.1 were sufficient training error overestimation andbounded VC dimension. In contrast, hypothesis stability alone is a sufficient condition for nontrivialbounds, but is far from necessary. We note that some of our lower bounds (as is often the case withsuch bounds) use quite singular distributions. It is left as an open question whether one can obtainimproved bounds under certain continuity assumptions on the underlying distribution.

We begin with the lower bound giving an example where there is an � �� chance of constanterror for the leave-one-out estimate. Setting � � �

in Theorem 4.1 shows that the dependence on �given there is tight (up to logarithmic factors). This theorem has appeared elsewhere [DGL96, Chap.24], but we include it here for completeness.

THEOREM 5.1 There exists an input distribution � , a target function�

, a hypothesis class # of VC

dimension 1, and an algorithm � that minimizes the training error over # such that with probability

� �� , 1 �� ACV

� � � � B � � � � � � �� 1 � � �� .

PROOF: Let the input space � consist of a single point � , and let the target function�

be theprobabilistic function that flips a fair coin on each trial to determine the label to be given with � . Thus,

20

the generalization error of any hypothesis is exactly� �� . The algorithm � simply takes the majority

label of the sample as its hypothesis. Now with probability �� , the sample �� will have a

balanced number of positive and negative examples, in which case ��ACV

� �� , proving the theorem.

(Theorem 5.1)

The following theorem shows that in the case of algorithms that perform training error minimization,the dependence of the error of the leave-one-out estimate on the VC dimension is unavoidable withoutfurther assumptions on the algorithm.

THEOREM 5.2 For any � , there exists an input distribution � , a target function�

, a hypothesis class

# of VC dimension � , and an algorithm � that minimizes the training error over # such that with

probability �� , 1 ��ACV

� �� B � � � � �� 1 � �� .

PROOF: Let � � � * �� , and the underlying distribution be uniform. The hypothesis class # consists

of all � -switch functions over � * �� (that is, it consists of all functions defined by ��

disjoint intervalscovering [0,1], with a binary label associated with each interval). Let � � be the � -switch function over[0,1] in which the switches are evenly spaced

� �� apart.

The algorithm � behaves as follows: if the sample size is even, � first checks if � � minimizes thetraining error on the sample. If so, it selects � � as its hypothesis. Otherwise, � chooses the “leftmost”hypothesis that minimizes the training error over [0,1] (that is, the hypothesis that minimizes the trainingerror and always chooses its switches to be as far to the left as possible between the two sample pointswhere the switch occurs). If the sample size is odd, � chooses the leftmost hypothesis minimizing thetraining error over [0,1]. Thus, on even samples, � has a strong bias towards choosing � � over [0,1],but on odd samples, has no such bias.

Now suppose that the target function labels [0,1] according to � � , and let be even. Then� necessarily chooses � � (as it is consistent with the sample), and so � � � � � � �� * . But whenestimating this generalization error, for each point �� in the sample that is a leftmost point in theinterval it belongs to, � � � �� errs on �� . Since there are � intervals, with high probability �� ACV

� � � � willbe � � �� , as desired. (Theorem 5.2)

We next show that some form of error stability is essential for providing upper bounds on the errorof the leave-one-out estimate.

DEFINITION 5.1 We say that a deterministic algorithm � has error stability�

in expectation if for every �� , - �� !�� )# � �� B � � � � ��

�� ' �

, where �� $# � ( ( � � �� , , (and�

may be a function

of ).

Thus we are asking that the difference between the errors (of the hypotheses output by � when trainedon �� and on ��)# � , respectively), be small, in expectation. This is in general weaker than the

21

requirement in Definition 4.1, since the above expectation could be * while there is high probabilitythat the error sometimes increases by much when a point is added, and sometimes decreases by much.

�

We note however, that for any reasonable learning algorithm, in which there is very small probabilitythat the error increases when a point is added, the above definition is not effectively weaker. For suchreasonable algorithms we are essentially showing that error stability as defined in Definition 4.1 isnecessary.

THEOREM 5.3 Let � be any algorithm that does not have error stability�

in expectation. Then there

exists values of such that for any � % * ,� %�, - � 1 �� ACV

� �� B � � � � �� 1 % � �� B �� B � � (71)

PROOF: Since � does not have error stability�

in expectation, it is either the case that for some � , -�� "! � � � � � ��)# � �� B � � � � �� or that

� , - �� "! � � � � � ��)# � �� B � � � � �� 6B � . Withoutloss of generality, assume the former is true. Let �

� � � � be a random variable which is defined as follows:�

� �� ACV

� �� CB � � � � �� . Thus, �� ACV

� �� CB � � � � ��$# � �� )# � �� B � � � � ��and� , - � � � �� , - �� !�� ACV

� �� B � � � � ��$# � �� , - �� "!�� )# � �� B � � � � �� (72)

where � � �5� �)# � ( ( �� , . By Lemma 4.1 and our assumption on � , we get that� ,.- � � � � � ��

.Let � be the exact probability that

1�

� � � � 1 ' � . Then

� � � , - � � � � � �� (73)

' �/� �� B � � /

�(74)

� � B � �� B � � � (75)

Thus, � � �� B � �� B � � , and equivalently,

� B � � � B �� B � (76)

which means that with probability at least�� B � �� B � � , ��ACV

� �� B � � � � ��&� � .(Theorem 5.3)

Finally, we show that, unlike hypothesis stability, error stability alone is not sufficient to givenontrivial bounds on the error of leave-one-out even when the hypothesis class has very small VC

dimension, and hence additional assumptions are required.�

Precisely for this reason we were not able to show that the stronger notion of error stability is in fact necessary. We arenot able to rule out the case that an algorithm is very unstable (according to Definition 4.1) but that this instability averagesout when computing the leave-one-out estimate.

22

THEOREM 5.4 There exists an input distribution � , a target function�

, a hypothesis class # with

constant VC dimension and an algorithm � , such that � has error stability� * � * � with respect to �

and�

, but with probability�,1 �� ACV

� �� B � � � � �� 1 � � �� .PROOF: Let � � (+* ��

� B �-,where

�is even,

�the constant * function, � the uniform distribution

on � , and # the following class of (boolean) threshold functions:

# �� ( �� & (+* �� B �0,

� where �� iff

� �� C& � � � � � �� , (77)

Clearly, the VC dimension of # is . Furthermore, for every � & # , the distance between�

and �is exactly

� �� , and hence any algorithm using hypothesis class # is� * � * � stable with respect to

�. It

thus remains to show that there exists an algorithm � for which the leave-one-out estimate always haslarge error.

For a given sample � � � ( � � �� ,, let � � � � �� & � � �

, and let � � � � � �� , where �� is as defined in Equation (77). Thus, the algorithm’s hypothesis is determined by thesum of the (unlabeled) examples. We next compute the leave-one-estimate of the algorithm on � � .Assume first that �� is such that

� � �� & � � � � � �� . Then, by definition of � , for each� � , the hypothesis � � � � � � �� will label � � by

�whereas

� � � � � � * . Similarly, if �� is such that� � �� & � � � % � �� , then for each � � , � � � � � � � * which is the correct label according to�

.In other words, for half of the samples � � we have ��ACV

� �� , which means that leave-one-out

overestimates � � � � �� by� �� , and for half of the sample it underestimates the error by

� �� .(Theorem 5.4)

6 Extensions and Open Problems

It is worth mentioning explicitly that in the many situations when uniform convergence bounds betterthan � � � � � � � can be obtained [SST92, HKST96] our resulting bounds for leave-one-out will becorrespondingly better as well.

There are a number of interesting open problems, both theoretical and experimental. On theexperimental side, it would be interesting to determine the “typical” dependence of the leave-one-outestimate’s performance on the VC dimension for various commonly used algorithms. It would also beof interest to establish the extent to which these algorithms possess error stability and leave-one-outoverestimates the training error. On the theoretical side, it would be nice to prove sanity-check boundsfor leave-one-out for popular heuristics like C4.5 and backpropagation. Also, it would be interesting tofind additional properties other than training error overestimation, which together with error stabilityand bounded VC dimension suffice for proving sanity-check bounds. Finally, there is almost certainlyroom for improvement in both our upper and lower bounds: our emphasis has been on the qualitative

23

behavior of leave-one-out in terms of a number of natural parameters of the problem, not the quantitativebehavior.

Acknowledgments

Thanks to Avrim Blum for interesting discussions on cross-validation, to Sean Holden for pointingout a mistake we had in Theorem 5.3 and other discussions, and to Nabil Kahale for improving theconstruction in the proof of Theorem 5.4. We would also like to thank two anonymous referees fortheir comments.

References

[DGL96] L. Devroye, L. Gyoyfi, and G. Lugosi. A Probabilistic Theory of Pattern Recognition.Springer Verlag, 1996.

[DH73] R. O. Duda and P. E. Hart. Pattern Classification and Scene Analysis. Wiley, 1973.

[DW79a] L. P. Devroye and T. J. Wagner. Distribution-free inequalities for the deleted and holdouterror estimates. IEEE Transactions on Information Theory, IT–25(2):202–207, 1979.

[DW79b] L. P. Devroye and T. J. Wagner. Distribution-free performance bounds for potentialfunction rules. IEEE Transactions on Information Theory, IT–25(5):601–604, 1979.

[GG84] S. Geman and D. Geman. Stochastic relaxation, Gibbs distributions, and the Bayesianrestoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence,6:721–741, 1984.

[Hau92] David Haussler. Decision theoretic generalizations of the PAC model for neural net andother learning applications. Information and Computation, 100(1):78–150, 1992.

[HKST96] D. Haussler, M. Kearns, H.S. Seung, and N. Tishby. Rigorous learning curve bounds fromstatistical mechanics. Machine Learning, 25:195–236, 1996.

[Hol96a] S. B. Holden. Cross-validation and the PAC learning model. Research Note RN/96/64,Dept. of CS, Univ. College, London, 1996.

[Hol96b] S. B. Holden. PAC-like upper bounds for the sample complexity of leave-one-out crossvalidation. In Proceedings of the Ninth Annual ACM Workshop on Computational Learning

Theory, pages 41–50, 1996.

24

[Kea96] M. Kearns. A bound on the error of cross validation, with consequences for the training-test split. In Advances in Neural Information Processing Systems 8, pages 183–189, 1996.To Appear in Neural Computation.

[KMN � 95] M. J. Kearns, Y. Mansour, A. Ng, , and D. Ron. An experimental and theoretical compar-ison of model selection methods. In Proceedings of the Eighth Annual ACM Workshop on

Computational Learning Theory, pages 21–30, 1995. To Appear in Machine Learning,COLT95 Special Issue.

[Koh95] Ron Kohavi. A study of cross-validation and bootstrap for accuracy estimation and modelselection. In the International Joint Conference on Artificial Intelligence, 1995.

[KSS94] M. Kearns, R. Schapire, and L. Sellie. Toward efficient agnostic learning. Machine

Learning, 17:115–141, 1994.

[Mil90] A.J. Miller. Subset Selection in Regression. Chapman and Hall, 1990.

[RW78] W. H. Rogers and T. J. Wagner. A finite sample distribution-free performance bound forlocal discrimination rules. The Annals of Statistics, 6(3):506–514, 1978.

[SST92] H. S. Seung, H. Sompolinsky, and N. Tishby. Statistical mechanics of learning fromexamples. Physical Review, A45:6056–6091, 1992.

[Vap82] V.N. Vapnik. Estimation of Dependences Based on Empirical Data. Springer-Verlag, NewYork, 1982.

25

algorithmic stability and sanity-check bounds for leave-one-out...

Documents