the pitfalls of overfitting in optimization of a … · 2016. 11. 29. · figure 1: images of: (a)...

THE PITFALLS OF OVERFITTING INOPTIMIZATION OF A MANUFACTURINGQUALITY CONTROL PROCEDURE

Tea TusarDOLPHIN Team, INRIA Lille – Nord Europe, France

Department of Intelligent Systems, Jozef Stefan Institute, Ljubljana, Slovenia

[email protected]

Klemen GantarFaculty of Computer and Information Science, University of Ljubljana, Slovenia

[email protected]

Bogdan FilipicDepartment of Intelligent Systems, Jozef Stefan Institute, Ljubljana, Slovenia

Jozef Stefan International Postgraduate School, Ljubljana, Slovenia

[email protected]

Abstract We are concerned with the estimation of copper-graphite joints qualityin commutator manufacturing—a classification problem in which wewish to detect whether the joints are soldered well or have any of thefour known defects. This quality control procedure can be automated bymeans of an on-line classifier that can assess the quality of commutatorsas they are being manufactured. A classifier suitable for this task canbe constructed by combining computer vision, machine learning andevolutionary optimization techniques. While previous work has shownthe validity of this approach, this paper demonstrates that the searchfor an accurate classifier can lead to overfitting despite cross-validationbeing used for assessing the classifier performance. We inspect severalaspects of this phenomenon and propose to use repeated cross-validationin order to amend it.

Keywords: Computer vision, Differential evolution, Machine learning, parametertuning, Manufacturing, Quality control.

241

242 BIOINSPIRED OPTIMIZATION METHODS AND THEIR APPLICATIONS

1. Introduction

In automotive industry, only one part per million of supplied productsis allowed be defective, which yields strict requirements for the involvedmanufacturing processes as well as their quality control procedures. Weare interested in the manufacturing of graphite commutators (i.e., com-ponents of electric motors used, for example, in automotive fuel pumps)produced at an industrial production plant. More specifically, we wishto automatically assess the quality of copper-graphite joints in commu-tators after the soldering phase of this manufacturing process, which isone of the most critical phases of commutator production.

At present, the soldering quality control at the plant is done manually.Automated on-line quality control would bring several advantages overmanual inspection. For example, it can promptly detect irregularitiesmaking error resolution faster and consequently saving a considerableamount of resources. Moreover, it does not slow down the productionline and is cheaper than manual inspection. Finally, it does not sufferfrom fatigue and other human factors that can result in errors. This iswhy we aim for an automated on-line quality control procedure capableof determining whether the joints are soldered well or have any of thefour known defects.

Such automation can be implemented on the production line with aclassifier previously constructed on a database of commutator segmentimages with known defects (or absence of defects). Three previous stud-ies [3, 4, 5] have already tackled this problem and in all cases 10-foldcross-validation (CV) was used as a measure of classifier accuracy. Thiswork questions 10-fold CV as the measure of choice for such tasks andproposes actions to deal with the inevitable overfitting issue.

The rest of the paper is structured as follows. Section 2 presents de-tails of the problem in question, summarizes previous work and outlinesthe design of the quality control procedure used in this study. Section3 is devoted to cross-validation and the overfitting issue. Performed ex-periments and their results are discussed in Section 4. Finally, Section5 summarizes the paper and gives ideas for future work.

2. Background

2.1 Soldering in Commutator Manufacturing

The soldering phase in the commutator manufacturing process con-sists of soldering the metalized graphite to the commutator copper base.The quality of the resulting copper-graphite joints is crucial since thereliability of end user applications directly depends on the strength of

The Pitfalls of Overfitting in Optimization of a Manufacturing . . . 243

(a) (b) (c)

(d) (e) (f)

Figure 1: Images of: (a) a graphite commutator, (b) a commutator segment, (c) aROI for metalization defect, (d) a ROI for excess of solder, (e) a ROI for deficit ofsolder, and (f) a ROI for disorientation.

these joints. After the soldering phase, the commutators are manuallyinspected for presence of any defects. Known defects comprise metaliza-tion defect (presence of visible defects on the metalization layer), excessof solder (presence of solder spots on the copper pad), deficit of solder(lack of solder in the graphite-copper joint) and disorientation (disori-entation between the copper body and the graphite disc). Commutatorsare made up of a number of segments, depending on the model (the con-sidered commutator model from Fig. 1 (a) consists of eight segments).If a single segment has any of the listed defects, the whole commutatoris labeled as defective and removed from the production process.

Various defects occur in different regions of the commutator segment.For example, the region where the excess of solder is usually detected isdifferent from the region where disorientation can be observed. There-fore, images of commutator segments can be divided into four regions ofinterest (ROIs), one for each defect (see Fig. 1).

Because five different outcomes are possible (rare cases where two ormore defects appear on a single commutator segment are labeled withjust one defect and are not differentiated further), we treat this as aclassification problem with five classes. While the manufacturers areindeed interested in keeping statistics of the detected defects, their mainconcern is that no false positives are found. This means that cases when


a defective commutator is labeled as without defects are to be avoidedas much as possible. This is, of course, very hard to achieve.

2.2 Previous Work

Three previous studies investigated different aspects of this challeng-ing real-world problem. The initial experiment [4] explored whethercomputer vision, machine learning and evolutionary optimization tech-niques could be employed to find small and accurate classifiers for thisproblem. First, images of the copper-graphite joints were captured bya camera. Next, a fixed set of features were extracted from these im-ages using digital image processing methods. This data were then usedto train decision trees capable of predicting if a commutator segmenthas any of the four defects or is soldered well. The DEMO (DifferentialEvolution for Multiobjective Optimization) algorithm [12] was appliedto search for small and accurate trees by navigating through the spaceof parameter values of the decision tree learning program. The studyfound this setup to be beneficial, but urged to focus future research onmore sophisticated extraction of features from the images as this seemedto hinder the search for more accurate classifiers.

The second study [5] presented a different setup for the automatedquality control procedure to address the issues from the first study. In-stead of optimizing decision tree parameter values, differential evolution(DE) [13] was used to search for the best settings of image processingparameters such as filter thresholds. Tuning of these parameters can bea tedious task prone to bias from the engineers that usually do it bytrial-and-error experimentation. Moreover, the choice of right featuresis crucial for obtaining a good classifier.

The single classification problem with five classes was split into four bi-nary classification subproblems, where each subproblem was dedicated todetecting one of the four defects and used data only from the correspond-ing ROI. In addition, instead of classification accuracy, the measure to beoptimized was set to a function penalizing the portion of false negatives100 times harder than the portion of false positives. The study foundthat the new combination of computer vision, machine learning andevolutionary optimization techniques was powerful and achieved somegood results. While optimization with DE always found better parame-ter settings for image processing methods than those defined by domainexperts, some subproblems proved to be harder than others. For exam-ple, detection of commutator segments with excess of solder achieveda satisfactory accuracy, while the detection of metalization defects didnot.


The third study [3] investigated the correctness of the implicit assump-tion from [5] that only features of the subproblem-specific ROI wouldinfluence the outcome of the classifier for that subproblem. The studyfound that features from other ROIs can be important as well, suggest-ing that it might be better not to split the classification problem intosubproblems at all.

While being otherwise rather different, all three mentioned studiesused 10-fold CV to estimate the performance of the employed classifiers.In this paper we wish to test if such evaluation of classifiers is appropriatewhen performing optimization based on this measure.

2.3 Design of the Automated Quality ControlProcedure

The automated quality control procedure considered in this paper isvery similar to the one presented in [5]. Again computer vision, machinelearning and evolutionary optimization methods are combined in thesearch for the best settings for image processing parameters. In short,the procedure design consists of the following steps:

1 Determine a set of image features.

2 Use an evolutionary algorithm to search for the values of imageprocessing parameters that result in the highest fitness. Evaluateeach solution using these steps:

(a) Based on the chosen parameter values, use the image process-ing methods to convert each image of a commutator segmentinto a vector of feature values.

(b) Construct a classifier (in our case a decision tree) where thevectors of feature values serve as learning instances. Estimatethe classifier accuracy and use this value as solution fitness.

3 Choose the best found classifier and the corresponding image pro-cessing parameter values to detect defects in images of new com-mutator segments as they are being manufactured.

Let us now describe the steps of processing commutator segment im-ages, building decision trees and optimizing classifier performance inmore detail.

2.3.1 Processing commutator segment images. Processingof images is the most time-consuming task of our procedure and is donein several steps. First, the image of a commutator segment needs to


be properly aligned. Next, the four ROIs shown in Fig. 1 need to bedetected. This is done by applying four predefined binary masks tothe image, one for each ROI. Each of the ROIs is further processed asfollows. Depending on the ROI, the image in RGB format is convertedinto a gray-scale image by extracting a single color plane. Based onexpert knowledge, red is used for all ROIs except the ROI for excess ofsolder, which uses the blue color plane.

The final three steps require certain parameters to be set. A 2D me-dian filter of size 1× 1, 3× 3 or 5× 5 is applied to reduce noise. Next, abinary threshold that can take values from {1, 2, . . . , 256} is used to elim-inate irrelevant pixels. Finally, an additional particle filter is employedto remove all particles (connected pixels with similar properties) witha smaller number of pixels than a threshold value from {1, 2, . . . , 1000}.Note that because of the diversity of the defects, it is reasonable toassume that these three image processing parameters should be set in-dependently for each ROI. This means that in total 12 image processingparameters need to be set.

After these image processing steps, the chosen set of features are ex-tracted from the image of each ROI. We use the same set of features asin [5, 3]:

– number of particles,

– cumulative size of particles in pixels,

– maximal size of particles in pixels,

– minimal size of particles in pixels,

– gross/net ratio of the largest particle,

– gross/net ratio of the cumulative size of particles.

To summarize, computer vision methods are used to convert eachcommutator segment image into a vector of 24 feature values.

2.3.2 Building decision trees. Commutator segment imageswith known classes are used to construct a database of instances, uponwhich a machine learning classifier can be built. We chose decisiontrees since they are easy to understand and implement in the on-linequality control procedure. In accordance with the guidelines from [3],we do not split the machine learning problems into subproblems, but useall instances and all ROIs to build a single classifier with five classes:no defect, metalization defect, excess of solder, deficit of solder anddisorientation.

Note that the classifier predicts defects on commutator segments. Forthe final application, predictions for all segments of a commutator need


to be aggregated in order to produce a prediction for the commutator asa whole. While this might be straightforward to do, it is not the focusof this paper. We first wish to find good classifiers on the segment levelbefore dealing with any meta-classifiers.

2.3.3 Optimizing classifier performance. Classifier perfor-mance can be measured in several ways, ranging from classification ac-curacy to the F-measure to other, even custom functions that dependon the domain (as was done for the two-class case in [5]). While weacknowledge that a similar custom function would be beneficial also forour five-class problem, where false ‘no defect’ classifications bear moreserious consequences than other types of misclassifications, classificationaccuracy is chosen for now, since it is easier to interpret. Classificationaccuracy is estimated with 10-fold CV, which is a popular technique forpredicting classifier performance on unseen instances and has been usedalso in the three previous studies [4, 5, 3].

In order to find the values of image processing parameters that willresult in a classifier with high accuracy, an evolutionary algorithm isemployed to search in the 12-dimensional space of image processing pa-rameter values.

3. The Pitfalls of Overfitting

When building a classifier, some of the data is used for training theclassifier, while the rest is used for testing its performance. Ideally, wewould like both sets to be fairly large, since a lot of data is needed totrain a classifier well, and a lot of data is needed to truthfully predicthow it will perform on unseen instances. However, in reality, the data isoften scarce and certain compromises need to be made.

One of the most popular approaches to estimate classifier performanceis k-fold cross-validation, where the data is split into k sets of approx-imately equal cardinality. Next, k − 1 of the sets are used for trainingthe classifier, while the remaining set is used for testing its performance.This is repeated k times so that each set is utilized for testing exactlyonce. The average of all performance results is then used to estimatethe accuracy of the classifier built on the entire data.

This and other cross-validation techniques (see [1] for a survey) wereenvisioned in order to avoid overfitting, i.e., constructing classifiers thatdescribe noise in the data instead of the underlying relationships, sincea classifier that overfits the training data performs poorly on unseeninstances. This happens, for example, if the classifier is too complex.However, it has been long known [6] that there exists another sourceof overfitting that takes place despite cross-validation—if we compare a


high number of classifiers on a small set of instances, the best ones areusually those that overfit these instances.

This means, for example, that if an optimization algorithm that canproduce thousands or even millions of solutions is used to find the bestclassifier for a problem, this best found classifier almost surely overfitsthe test data. Note however, that our study does not compare multipleclassifiers on the same data. Our optimization problem resembles morethat of feature selection where a subset of features needs to be found sothat classifiers using these features will achieve good performance. Themain difference to the standard feature selection is that we are perform-ing feature selection in subgroups—for each of the 12 image processingparameters (one subgroup) we need to select exactly one among all pos-sible values. The danger of overfitting despite using cross-validation hasbeen noticed for feature selection problems as well [11].

In the following we present the results achieved with 10-fold CV forestimating classifier performance on our optimization problem, whichshow overfitting patterns, and analyze increased pruning and repeatedcross-validation as possible alternatives to amend this issue.

4. Experimental Study

4.1 Experimental Setup

The experiments were performed on the commutator soldering domainfrom previous studies that contains 363 instances with uneven distribu-tion of classes (see Table 1).

Table 1: The commutator soldering domain.

Class Number of instances Frequency [%]

No defect 212 58.4Metalization defect 35 9.6Excess of solder 35 9.6Deficit of solder 49 13.5Disorientation 32 8.8

Total 363 100.0

All computer vision methods were implemented using the Open Com-puting Language (OpenCL) [7], or more precisely, the OCL program-ming package [8], an implementation of OpenCL functions in the OpenComputer Vision (OpenCV) library [9].


The decision trees were built using the J48 algorithm from the Wekamachine learning environment [16], which is a Java implementation ofQuinlan’s C4.5 decision tree building algorithm [10]. The trees wereconstructed with default J48 parameter values except for the increasedpruning case (more details are given in Section 4.3).

For optimization we use a self-adaptive DE algorithm called jDE [2]with a population of 80 solutions. The stopping criterion for the algo-rithm was set to 1000 generations. For each set of experiments nine runshave been performed and all presented results show the average valuesover the nine runs.

4.2 Results of Single Cross-Validation

First, we look at what happens when single 10-fold CV is used toestimate classifier accuracy (see top plot in Fig. 2). The black line showsthat jDE is able to find increasingly more accurate classifiers as theevolution progresses. In order to check if these classifiers present signs ofoverfitting, we perform the following additional assessment. For each runand each best classifier from the population, we estimate the classifierusing 10-fold CV ten more times. The span of these accuracies averagedover the nine runs is presented in red.

The increasing gap between the black line and the red area meansthat classifiers that are good on the default split of instances into cross-validation sets perform considerably worse when they are tested again onten different cross-validation splits, i.e., the classifiers overfit the defaultcross-validation split. This happens because we are exploring a largenumber of classifiers and incidentally optimize them also with regard tothe default cross-validation split.

Note that this kind of overfitting is different to the ‘usual’ one, wherethe classifier overfits the given instances. While we are probably experi-encing both, we cannot know about the second one without testing theclassifiers on a large number of unseen instances, which we unfortunatelydo not posses. We have experimented with reserving a small part of datafor validation purposes as was done in [14], but found that this approachis not suitable for our case because of the small number of instances atour disposal. Since we have five classes with uneven distribution ofinstances, it proved very difficult to find representative instances for val-idation. Without a representative validation set the resulting estimationof overfitting can be too biased to rely on.


0.84

0.86

0.88

0.90

0.92

0.94

0.96

0.98

1.00

0 100 200 300 400 500 600 700 800 900 1000

Cla

ssifi

catio

n ac

cura

cy

Generations

Results using single cross-validation

Span of repeated CV (10 times)Single CV

0.84

0.86

0.88

0.90

0.92

0.94

0.96

0.98

1.00

0 100 200 300 400 500 600 700 800 900 1000

Cla

ssifi

catio

n ac

cura

cy

Generations

Results using increased prunning

Span of repeated CV (10 times)Single CV

0.84

0.86

0.88

0.90

0.92

0.94

0.96

0.98

1.00

0 100 200 300 400 500 600 700 800 900 1000

Cla

ssifi

catio

n ac

cura

cy

Generations

Results using repeated cross-validation

Span of repeated CV (30 times)Span of repeated CV (10 times)

Repeated CV (10 times)

Figure 2: Experimental results of single cross-validation (top plot), increased pruning(middle plot) and repeated cross-validation (bottom plot). The black line showsthe best classification accuracy found by jDE, while the red and yellow ares denotethe span of accuracy values when the current-best classifier is re-estimated usingadditional cross-validation. All plots show average values over nine runs.


4.3 Results of Increased Pruning

Next, we investigate whether increased pruning of decision trees canhelp improve their generalization ability in our case. Note that theoriginal decision trees (the J48 trees with default parameter settings)used in the previous experiments were also pruned. Here, we intensifythe pruning by increasing the m parameter of the J48 algorithm thatdefines the minimal number of instances in any tree leaf from 2 to 5. Theresults of these experiments are presented in the middle plot in Fig. 2.

While the gap between the single cross-validation and the re-estimationusing repeated cross-validation is smaller than in the previous experi-ments, the overfitting is still obvious. We can conclude that increasedpruning does not alleviate much the overfitting brought by optimization.

4.4 Results of Repeated Cross-Validation

Finally, we explore the case when the fitness of the decision trees builtwith default parameter values is determined as the average of 10 differentassessments by 10-fold CV. Again, we perform an additional assessmentof the classifiers. This time we add to the repeated cross-validation 20new estimations (for a total of 30) to see how they compare. The bottomplot in Fig. 2 shows there are no big differences when additional cross-validation results are added, indicating that repeated cross-validationis less prone to overfitting brought by optimization than single cross-validation. The average accuracy of the current-best classifiers over 10and 30 repetitions are very similar, which suggests 10 repetitions can bechosen over 30 as they require less time.

These results seem to contradict the ones presented in [15], howeverthis is not the case. In a series of experiments, [15] compares the esti-mates of classification accuracy from single 10-fold CV, 10-fold CV re-peated 10 times and 10-fold CV repeated 30 times to a simulated ‘true’performance of the classifier on unseen data. The results show that al-though the confidence interval narrows when increasing the number ofcross-validation repetitions, this does not necessarily mean that the ac-curacy estimate will converge to the ‘true’ accuracy. The authors arguethat the reason for this behavior is that the same data is continuouslybeing resampled in repeated cross-validations. The experiments in [15]tackle the ‘usual’ overfitting problem, which is not the subject of thispaper. We are concerned with the overfitting brought by optimizationand find that repeated cross-validation can alleviate it.


5. Conclusion

We have presented a challenging real-world problem of estimating thequality of the commutator soldering process. We wish to find a classifierable to distinguish among joints soldered well and those that have one ofthe four possible defects. The problem is tackled using a combination ofcomputer vision, machine learning and evolutionary optimization meth-ods. In essence, we are searching for parameter settings of computervision methods that can yield a highly accurate classifier.

Since an optimization algorithm that explores a large number of so-lutions to this problem is being used, we have been confronted with theproblem of overfitting. We performed some experiments that have shownhow overfitting can be detected and discussed on possible ways to amendit. From the results we conclude that repeated cross-validation can beused to diminish the overfitting bias brought by optimization. However,this results cannot be generalized to other machine learning problemswithout additional experiments that include a number of other datasets.This is a task left for future work.

The presented real-world problem is not yet solved and we can seemany directions for future work. First, since the accuracies achievedare still not good enough for automotive industry standards, our mainfocus will be to try to improve on that (possibly by not producing evenmore overfitting). This can be tried, for example, by choosing otherimage features in addition to the six we have right now, or by tryingmore sophisticated classifiers than decision trees. Also, we intend toconsider other measures of classifier performance beside accuracy. Forexample, we could use a specialized aggregation function or try to usea multiobjective approach. Finally, we will have to eventually combinethe classifications of individual commutator segments into a single clas-sification of the commutator as a whole.

Acknowledgment: This work was partially funded by the ARTEMISJoint Undertaking and the Slovenian Ministry of Economic Developmentand Technology as part of the COPCAMS project (http://copcams.eu) under Grant Agreement no. 332913, and by the Slovenian ResearchAgency under research program P2-0209.

The authors wish to thank Valentin Koblar for valuable support re-garding the application domain and computer vision issues, and BernardZenko for helpful discussions on machine learning algorithms.


References

[1] S. Arlot and A. Celisse. A survey of cross-validation procedures for model se-lection. Statistics Surveys, 4:40–79, 2010.

[2] J. Brest, S. Greiner, B. Boskovic, M. Mernik, and V. Zumer. Self-adaptingcontrol parameters in differential evolution: A comparative study on numer-ical benchmark problems. IEEE Transactions on Evolutionary Computation,10(6):646–657, 2006.

[3] E. Dovgan, K. Gantar, V. Koblar, and B. Filipic. Detection of irregularities onautomotive semiproducts. Proceedings of the 17th International MulticonferenceInformation Society (IS), pages 22–25, 2014.

[4] V. Koblar and B. Filipic. Designing a quality-control procedure for commutatormanufacturing. Proceedings of the 16th International Multiconference Informa-tion Society (IS), pages 55–58, 2013.

[5] V. Koblar, E. Dovgan, and B. Filipic. Tuning of a machine-vision-based qualitycontrol procedure for product components in automotive industry. Submittedfor publication, 2014.

[6] A. Y. Ng. Preventing ”overfitting” of cross-validation data. Proceedings of theFourteenth International Conference on Machine Learning (ICML), pages 245–253, 1997.

[7] OpenCL: The open standard for parallel programming of heterogeneous systems.http://www.khronos.org/opencl/. Retrieved January 25, 2016.

[8] OpenCL module within the OpenCV library. http://docs.opencv.org/

modules/ocl/doc/introduction.html. Retrieved January 25, 2016.

[9] OpenCV: Open source computer vision. http://opencv.org/. Retrieved Jan-uary 25, 2016.

[10] J. R. Quinlan. Induction of decision trees. Machine Learning, 1(1):81–106, 1986.

[11] R. B. Rao, G. Fung, and R. Rosales. On the dangers of cross-validation. Anexperimental evaluation. Proceedings of the SIAM Conference on Data Mining(SDM), pages 588–596, 2008.

[12] T. Robic and B. Filipic. DEMO: Differential evolution for multiobjective opti-mization. Lecture Notes in Computer Science, 3410:520–533, 2005.

[13] R. Storn and K. V. Price. Differential evolution – A simple and efficient heuristicfor global optimization over continuous spaces. Journal of Global Optimization,11(4):341–359, 1997.

[14] C. Thornton, F. Hutter, H. H. Hoos, and K. Leyton-Brown. Auto-WEKA: Auto-mated selection and hyper-parameter optimization of classification algorithms.Proceedings of the 19th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining, pages 847–855, 2013.

[15] G. Vanwinckelen and H. Blockeel. On estimating model accuracy with repeatedcross-validation. Proceedings of the 21st Belgian-Dutch Conference on MachineLearning (BeneLearn), pages 39–44, 2013.

[16] Weka Machine Learning Project. http://www.cs.waikato.ac.nz/ml/weka/

index.html. Retrieved January 25, 2016.

the pitfalls of overfitting in optimization of a … · 2016. 11. 29. · figure 1: images of: (a)...

Documents