evaluating the generalization ability of support vector machines through the bootstrap

Neural Processing Letters11: 51–58, 2000.© 2000Kluwer Academic Publishers. Printed in the Netherlands.

51

Evaluating the Generalization Ability of SupportVector Machines through the Bootstrap

DAVIDE ANGUITA, ANDREA BONI and SANDRO RIDELLADept. of Biophysical and Electronic Engineering, University of Genova, Via Opera Pia 11a, 16145Genova, Italy, E-mail: [email protected]

Abstract. The well-known bounds on the generalization ability of learning machines, based on theVapnik–Chernovenkis (VC) dimension, are very loose when applied to Support Vector Machines(SVMs). In this work we evaluate the validity of the assumption that these bounds are, nevertheless,good indicators of the generalization ability of SVMs. We show that this assumption is, in general,true and assess its correctness, in a statistical sense, on several pattern recognition benchmarksthrough the use of the bootstrap technique.

Key words: bootstrap, generalization, support vector machines, VC dimension

1. Introduction

Support Vector Machines (SVMs) are a new promising learning machine whosesolid theoretical bases were built during the 1970s with the work of the Russianschool on Statistical Learning Theory [14]. Recently, the theory behind SVMshas been further developed [6] and has received increasing attention due to theremarkable performance obtained on real-world problems [11].

The generalization performance of SVMs can be predicted by using the well-known bounds for learning machines based on theVapnik–Chernovenkis dimension[15, 3]. Differently from other pattern recognizers, the VC dimension of SVMs canbe easily computed; however, the resulting bounds are, in general, too loose to beof any practical use. Despite this drawback, it has been argued that these boundsare good indicators of the actual generalization performance of SVMs. This paperinvestigates the validity of such an assumption, using some real-world datasets. Inparticular, we perform a statistical validation, based on theBootstraptechnique[9], to relate the theoretical generalization bounds and the actual generalizationperformance of SVMs.

The following section contains a brief introduction to SVMs theory and its prin-cipal concepts; in Section 3 we describe the bootstrap technique used to performthe statistical validation, and in Section 4 we present and discuss the results onsome real-world problems.

52 DAVIDE ANGUITA ET AL.

2. Support Vector Learning and Generalization accuracy

Let us consider a two-class labeled data set{(xi , yi), i = 1 . . . n}, wherexi ∈ <m and yi = ±1. The linear SVM is a classifier equivalent to a singleperceptron [6]:

f (x) = sign(w · x + b) (1)

with weightsw given by a linear combination of the vectors of the training set

w =n∑i=1

αiyixi (2)

and obtained by solving the following constrained quadratic programming prob-lem:

E(α) = 1

2αTQα + rT α (3)

0≤ αi ≤ C ∀i = 1 . . . n (4)

n∑i=1

αiyi = 0 (5)

whereC is a given constant,α,r ∈ <n, ri = −1 ∀i andqij = qji = yiyjxi · xj .It can be shown that, if the training set is linearly separable, the hyperplane

found by solving the above optimization problem is optimal, in the sense that it hasthe maximum distance from the border patterns of each class [14].

General SVMs extend this concept to the nonlinear case by mapping each inputvector into afeature spacethrough a nonlinear transformation8 : <m → <M ,with M >> m, and then finding an optimal hyperplane in<M . Unfortunately,the problem of finding the optimal hyperplane becomes easily intractable due tothe huge (possibly infinite) dimensionality of the feature space. However, usingEquation (2), it is possible to rewrite Equation (1) for the nonlinear case:

f (x) = sign

(n∑i=1

αiyi8(xi) ·8(x)+ b)

(6)

With this formulation, we are not required to explicitly work in the feature space,but only to deal with inner products of the form8(xi) · 8(xj ). This can beaccomplished by the use of “kernel functions” of the form

K(xi , xj ) = 8(xi ) ·8(xj ) (7)

EVALUATING THE GENERALIZATION ABILITY OF SUPPORT VECTOR MACHINES 53

The use of different kernel functions changes the mapping from the input to thefeature space and, therefore, the behavior of the SVM. Among the possible kernelfunctions, we focus on Radial Basis Functions (RBFs) of the form

K(xi , xj ) = e−||x i−xj ||2

2σ2 (8)

Different SVMs are obtained by changing the kernel parameterσ . See, for ex-ample, [7] for a more in depth discussion on finding an optimalσ and speeding upthe learning for these kernels.

In [15, 3] it has been proved that the generalization ability of a SVM is boundedfrom above by a function of the quantity

θ = R2‖w‖2n

. (9)

whereR is the radius of the smallest (hyper-)sphere containing the training patternsin the feature space andw are the weights of the optimal hyperplane.

The computation ofR2 and‖w‖2 is a relatively easy task. In fact, both valuescan be computed through the SVM kernel function, without explicitly working inthe feature space. For this reason, we are interested in a possible use ofθ as anindicator of the generalization ability of a SVM.

As an example of the easiness of computation of the above quantities, fromEquation (2) we can write:

‖w‖2 =n∑

i,j=1

αiαjyiyjK(xi , xj

)(10)

Due to space constraints, we refer the reader to [5, 1] for a simple method ofcomputingR2.

3. Bootstrap Validation

In order to perform a statistical validation of the generalization performance ofSVMs we use thefirst order accurate Bootstrap[9]. We build T ∗1, . . . ,T ∗B inde-pendent bootstrap replicates of the training set, each consisting ofn patterns drawnwith replacement fromT . The remaining patterns of each bootstrap replicate arecollected in the corresponding validation setsV1, . . . ,VB .

Given a fixedσ 2, we build a SVM for each bootstrap replicate, solving therelated quadratic optimization problem; then we compute an estimate of the gener-alization predictorθ∗b for b = 1, . . . , B. The average and the standard error of thepredictor are given by:

θ = 1

B

B∑b=1

θ∗b (11)


seθ =[

1

B − 1

B∑b=1

(θ∗b − θ

)2] 1

2

(12)

Using the bootstrap theory, we can compute the confidence level of these estimates;in fact [9]:

Prob{θ ∈ θ ± z(α) · seθ

}= 1− 2α (13)

wherez(α) is the 100· αth percentile point of a normal distributionN(0,1).After learning a bootstrap replicate of the training set, each SVM is tested on

the corresponding validation set. Note that, on average, each validation set containsapproximatelyn

e≈ 0.368 · n patterns and that none of them has been used for

learning, therefore they can be effectively used as indicators of the generalizationability of the SVM. The validation sets are used to compute the average accuracy

E = 1

B

B∑b=1

E∗b (14)

whereE∗b is the fraction of misclassified patterns ofVb. Note that there are sev-eral alternatives to Equation (14) for computing the accuracy of the network [12],however our interest is not in the accuracy itself, but in its shape as a function ofσ 2.

We are left now with the problem of choosing a suitable number of replicatesB. It is easy to show that there are

(2n−1n

)different bootstrap replicates; in prac-

tice, a few hundred or a thousand replicates are usually sufficient. The number ofbootstrap replicates is also limited by the large amount of time required for theexperiments. In fact, for each replicate, we must solve the quadratic optimizationproblem associated to the SVM under investigation. This optimization must thenbe repeated forB replicates and for several values ofσ 2. We setB = 1000, a rea-sonable number for estimating confidence intervals [9]. This is also confirmed bylooking at the distribution of the estimated parameterθ as showed in Figure 1. After1000 bootstrap replicates, the distribution approaches a gaussian one, as predictedby theory, and this guarantees a good estimate of the confidence intervals (the plotrefers to the dataset “Hepatitis” as detailed in the next section). This is evidentlynot true if we only use 100 bootstrap replicates. To speed up the computation weused the RAIN system [2], consisting of a cluster of one dual-Pentium II 300 andfour Pentium Pro 200 based computers, working in parallel and acting as a singlepowerful machine. The cluster is running Windows NT as operating system andPVM (Parallel Virtual Machine) [10] as message passing system. Despite the peakperformance of the cluster exceeding 1 GFLOPS (Giga Floating Point Operationsper Second), the entire set of experiments took several days of CPU time.


Figure 1. Distribution of θ after 100 (A) and 1000 (B) bootstrap replicates.

Table I. The datasets used for the experiments.

Dataset N. of patterns N. of features Best reported accuracy (in %)

Breast Cancer Wisconsin 683 9 4.1 [4]

Hepatitis 80 19 17 [4]

Ionosphere 351 34 3.3 [4]

Voting Records 435 16 5–10 [4]

Sonar 208 60 14.4 [13]

Tic Tac Toe 958 9 0.9 [4]

4. Experimental results

We used some well-known datasets from the UCI Machine Learning Database[4] for our experiments. The main characteristics of each dataset are reported inTable I. We omitted patterns with missing features and normalized the input data inthe range[−1,1]. The last column of the table shows the classification performancereported in previous literature, in terms of percentage of misclassified patterns ona validation set. Obviously this figure is purely indicative, due to the large varietyof methods used for learning and the different size of validation sets used to assessthe performance of the method; on the other hand, this number gives an idea of thebest results obtained on each dataset.

We variedσ 2 in the range[10−4,5] with a step of 0.1 and selected the bestSVM by choosing the one with minimumθ . We expected that the selected SVMwould also show the best average accuracy on the validation sets. All the runs wereperformed withC = ∞ to avoid any dependence from this parameter. This isequivalent to assuming that the training patterns, after being mapped in the featurespace, are linearly separable. Note that this is a reasonable assumption when deal-ing with RBF-based SVMs, thanks to this particular mapping in the feature space[5].


Table II. Summary of validation results (values are in %).

Dataset BestE E for minimum θ

Breast Cancer Wisconsin 3.9 (σ2 = 0.4) 4.0 (σ2 = 0.3)

Hepatitis 15.0 (σ2 = 2.2) 15.5 (σ2 = 1.9)

Ionosphere 5.8 (σ2 = 1.4) 6.6 (σ2 = 0.8)

Voting Records 4.6 (σ2 = 4.0) 4.9 (σ2 = 3.1)

Sonar 12.6 (σ2 = 4.3) 14.3 (σ2 = 2.0)

Tic Tac Toe 0.51 (σ2 = 2.0) 0.65 (σ2 = 1.4)

The results of our experiments are summarized in Table II. For each dataset,the best average accuracy obtained on the validation sets is shown along withthe corresponding value ofσ 2. The best accuracy, as suggested by the minimumvalue of θ , is reported in the last column. Figure 2 shows the average accuracyE

compared with the parameterθ for each dataset in greater detail.Two clear observations emerge from these results. First, the parameterθ as

a function ofσ 2 has a similar shape to the actual average generalization error.For large values ofσ 2 the two curves diverge, but, at the same time, so does theconfidence onθ . The second observation is that the minimum value ofθ is, withgood accuracy, close to the best performance of the SVM. Even if this is not truefor all the datasets, in any case the optimal value ofσ 2 suggested byθ identifies aSVM with good generalization properties.

These results appear to confirm the initial assumption thatθ is a good indicatorof the generalization ability of a SVM. Furthermore, it is interesting to note thatthe results obtained with the SVMs on these datasets are comparable with the stateof the art for these classification tasks obtained with a large variety of classificationtechniques (see Table I).

As a final remark, we would like to comment on the generality of the obtainedresults. As pointed out in [8], the sources of variations in statistical validation testscan be identified in four different cases: (a) randomness in the selection of testdata, (b) randomness in the selection of training data, (c) internal randomness ofthe training algorithm and (d) mislabeling of the data set.

Source of variation (a) is of particular concern when the test set is a small frac-tion of the entire data set. In this case, the variability due to randomly selecting thetest set could be problematic; however, this is implicitly avoided by our choice ofthe bootstrap method. In fact, as mentioned in the previous section, each bootstrapreplicate selects a test set of approximately one third of the entire data set.

Source (b) could influence the estimate ofθ ; in fact, the training set is differentin every bootstrap replicate. Note, however, that the distribution ofθ , as showedin Figure 1 is almost gaussian. This is a clear example of a “well–behavior” in thebootstrap estimate [9].


Figure 2. Average accuracyE vs. generalization predictorθ as a function ofσ2 for differentdatasets. Error bars are 90% confidence intervals. A) Breast Cancer Wisconsin, B) Hepatitis,C) Ionosphere, D) Sonar, E) Tic Tac Toe, F) Voting Records.

The third source of variation (c) can affect learning algorithms which depend onthe random starting point (e.g. backpropagation networks). However, the learningalgorithm of a SVM is independent from its initialization, because the solutionof the associated optimization problem gives theglobal minimum of the errorfunction regardless of the starting point.

The mislabeling of the data set is a source of errors for any learning algorithm,nevertheless SVMs have an intrinsic mechanism for dealing with misclassified datapoints [6].


For these reasons, we argue that, even though the results shown in this studymay not be exhaustive, they are of general validity.

5. Conclusions

We have shown that (1)θ can be used for selecting SVMs with good generalizationproperties and (2) SVMs are, in general, a simple but very effective classificationmethod for many classification problems.

References

1. Anguita, D., Boni, A. and Ridella, S.: Support Vector Machines: a comparison of some kernelfunctions,Proc. of the3rd Int. Symp. on Soft Computing, Genova, Italy, (1999).

2. Anguita, D., Boni, A., Chirico, M., Giudici, F., Scapolla, A. M. and Parodi, G.: High perform-ance neurocomputing: Industrial and medical applications of the RAIN system, In: P. Sloot,M. Bubak, B.Hertzberger (eds.),Proc. of HPCN Europe 1998, Amsterdam, The Netherlands,pp. 34–43, Lecture Notes in Computer Science 1401, Springer–Verlag, Berlin, Germany, 1998.

3. Bartlett, P. and Shawe–Taylor, J.: Generalization performance of Support Vector Machines andother pattern classifiers, In: Schölkopf, B., Burges, C., Smola, A. (eds.),Advances in KernelMethods – Support Vector Learning.MIT Press, Cambridge, MA, 1999.

4. Blake, C., Keogh, E. and Merz, C. J.: UCI repository of machine learning databases,http://www.ics.uci.edu/ mlearn/MLRepository.html, Irvine, CA, University of California, De-partment of Information and Computer Science, 1998.

5. Burges, C. J. C.: A tutorial on Support Vector Machines for pattern recognition,Data Miningand Knowledge Discovery2(2) (1998), 1–47.

6. Cortes, C. and Vapnik, V. N.: Support Vector Networks,Machine Learning20 (1995), 1–25.7. Cristianini, N., Campbell, C. and Shawe-Taylor, J.: Dynamically adapting kernels in Support

Vector Machines,NeuroCOLT2 Technical Report Series, Royal Holloway College, Universityof London, UK, NC2–TR–1998–017, 1998.

8. Dietterich, T. G.: Comparing supervised classification learning algorithms,Neural Computa-tion 10 (1998), 1895–1923.

9. Efron, B. and Tibshirani, R. J.:An Introduction to the Bootstrap,Chapman and Hall, New York,USA, 1993.

10. Geist, A., Beguelin, A., Dongarra, J., Jiang, W., Mancheck, R. and Sunderam, V.:PVM: ParallelVirtual Machine, MIT Press, Cambridge, MA, 1994.

11. Hearst, M. A.: Support Vector Machines,IEEE Intelligent Systems13(4) (1998), 18–28.12. Jain, A. K., Dubes, R. C. and Chen, C. C.: Bootstrap techniques for error estimation,IEEE

Trans. on PAMI9(5) (1987), 628–633.13. Torres Moreno, J. M. and Gordon, M. B.: Characterization of the sonar signals benchmark,

Neural Processing Letters7 (1998), 1–4.14. Vapnik, V. N.:The Nature of Statistical Learning Theory, John Wiley and Sons, New York,

USA, 1995.15. Vapnik, V. N.:Statistical Learning Theory, Springer, New York, USA, 1998.

evaluating the generalization ability of support vector machines through the bootstrap

Documents