automatic ground-truth validation with genetic algorithms for multispectral image classification

10
2172 IEEE TRANSACTIONS ONGEOSCIENCE AND REMOTE SENSING, VOL. 47, NO. 7, JULY 2009 Automatic Ground-Truth Validation With Genetic Algorithms for Multispectral Image Classification Noureddine Ghoggali, Student Member, IEEE, and Farid Melgani, Senior Member, IEEE Abstract—In this paper, we propose a novel method that aims at assisting the ground-truth expert through an automatic de- tection of potentially mislabeled learning samples. This method is based on viewing the mislabeled sample detection issue as an optimization problem where it is looked for the best subset of learning samples in terms of statistical separability between classes. This problem is formulated within a genetic optimization framework, where each chromosome represents a candidate so- lution for validating/invalidating the learning samples collected by the ground-truth expert. The genetic optimization process is guided by the joint optimization of two different criteria which are the maximization of a between-class statistical distance and the minimization of the number of invalidated samples. Experiments conducted on both simulated and real data sets show that the pro- posed ground-truth validation method succeeds in the following: 1) in detecting the mislabeled samples with a high accuracy, even when up to 30% of the learning samples are mislabeled, and 2) in strongly limiting the negative impact of the mislabeling issue on the accuracy of the classification process. Index Terms—Genetic algorithms (GAs), ground-truth valida- tion, Jeffries–Matusita (JM) distance measure, mislabeling issue, multiobjective optimization. I. I NTRODUCTION T HE TYPICAL goal of an inductive learning algorithm is to build discriminant functions from part of the available ground-truth samples (training set) so that the generalization capability of the resulting classifier on previously unseen sam- ples is as high as possible. The quantification of the general- ization capability is usually performed on another part of the ground-truth samples, termed as test set. Most of the works on automatic classification have focused efforts on improving the accuracy (generalization capability) of the classification process by acting mainly on the following three levels: 1) data representation; 2) discriminant function model; and 3) criterion on the basis of which the discriminant functions are optimized [1]. These works are, however, based on an essential assump- tion that is the ground-truth samples are of unquestionable quality. In this paper, we will put this assumption under light and show that the accuracy of a classification process (whatever the kind of classifier used) critically depends on the quality of the adopted ground-truth. Manuscript received June 7, 2008; revised October 18, 2008 and January 2, 2009. First published March 27, 2009; current version published June 19, 2009. The authors are with the Department of Information Engineering and Com- puter Science, University of Trento, 38050, Trento, Italy (e-mail: melgani@disi. unitn.it; [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TGRS.2009.2013693 The two well-known ground-truth collection approaches are as follows: 1) in situ observation approach and 2) photo- interpretation approach [2]. Each of them has its own advan- tages and drawbacks, but both are subject to errors in the labeling process. In the first approach, this may occur because of georeferencing problems, while in the second one, spectral mismatching errors by the human analyst are the main source of problems. Since the presence of mislabeling problems (noise) in a learning (training and test) set has a direct negative impact on the classification process, the development of automatic techniques for validating the collected learning samples is, in our opinion, crucial. To the best of our knowledge, in the literature, very scarce at- tention has been paid for coping with this issue, which is mainly faced through two different strategies. The first one, which admits anyway the presence of noise (mislabeling problems) in the data, consists in designing a sophisticated classifier which is less likely to be influenced by this presence [3]. The second strategy is based on the removal of “suspect” samples from the learning set. An early work derived from this strategy for k-nearest neighbor (kNN) classification suggested first to apply a 3NN classification over the whole learning set and then to re- move misclassified samples in order to produce a new learning set on the basis of which a 1NN classifier is formed for the clas- sification phase [4]. In [5], in order to avoid overfitting on noisy samples, the author proposed to perform the removal (filtering) process through the C4.5 decision tree classifier. In [6], the sus- pect samples are identified and removed from the learning set by means of an ensemble of three classifiers (i.e., C4.5, kNN, and linear classifiers). In particular, a sample is expected to be mislabeled if it is misclassified by the ensemble of classifiers. In this paper, we propose an alternative method that aims at interacting with the ground-truth expert by providing him/her with a binary information of the kind “validated”/“invalidated” for each learning sample. For each invalidated sample, the expert may confirm or not the invalidation and thus correct or maintain the adopted labeling before creating the final learning set that will be exploited in the classification process. Our ground-truth validation method is based on viewing the mislabeled sample detection issue as an optimization problem where it is looked for the best subset of learning samples in terms of statistical separability between classes. This problem is formulated within a genetic optimization framework for its capability to solve complex pattern recognition issues [7], [8]. In particular, each chromosome is configured as a binary string, which represents a candidate solution for validating/ invalidating the available learning samples. The genetic opti- mization process is guided by the joint optimization of two 0196-2892/$25.00 © 2009 IEEE

Upload: f

Post on 09-Feb-2017

219 views

Category:

Documents


0 download

TRANSCRIPT

2172 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 47, NO. 7, JULY 2009

Automatic Ground-Truth Validation With GeneticAlgorithms for Multispectral Image Classification

Noureddine Ghoggali, Student Member, IEEE, and Farid Melgani, Senior Member, IEEE

Abstract—In this paper, we propose a novel method that aimsat assisting the ground-truth expert through an automatic de-tection of potentially mislabeled learning samples. This methodis based on viewing the mislabeled sample detection issue asan optimization problem where it is looked for the best subsetof learning samples in terms of statistical separability betweenclasses. This problem is formulated within a genetic optimizationframework, where each chromosome represents a candidate so-lution for validating/invalidating the learning samples collectedby the ground-truth expert. The genetic optimization process isguided by the joint optimization of two different criteria whichare the maximization of a between-class statistical distance and theminimization of the number of invalidated samples. Experimentsconducted on both simulated and real data sets show that the pro-posed ground-truth validation method succeeds in the following:1) in detecting the mislabeled samples with a high accuracy, evenwhen up to 30% of the learning samples are mislabeled, and 2) instrongly limiting the negative impact of the mislabeling issue onthe accuracy of the classification process.

Index Terms—Genetic algorithms (GAs), ground-truth valida-tion, Jeffries–Matusita (JM) distance measure, mislabeling issue,multiobjective optimization.

I. INTRODUCTION

THE TYPICAL goal of an inductive learning algorithm isto build discriminant functions from part of the available

ground-truth samples (training set) so that the generalizationcapability of the resulting classifier on previously unseen sam-ples is as high as possible. The quantification of the general-ization capability is usually performed on another part of theground-truth samples, termed as test set. Most of the workson automatic classification have focused efforts on improvingthe accuracy (generalization capability) of the classificationprocess by acting mainly on the following three levels: 1) datarepresentation; 2) discriminant function model; and 3) criterionon the basis of which the discriminant functions are optimized[1]. These works are, however, based on an essential assump-tion that is the ground-truth samples are of unquestionablequality. In this paper, we will put this assumption under lightand show that the accuracy of a classification process (whateverthe kind of classifier used) critically depends on the quality ofthe adopted ground-truth.

Manuscript received June 7, 2008; revised October 18, 2008 and January 2,2009. First published March 27, 2009; current version published June 19, 2009.

The authors are with the Department of Information Engineering and Com-puter Science, University of Trento, 38050, Trento, Italy (e-mail: [email protected]; [email protected]).

Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TGRS.2009.2013693

The two well-known ground-truth collection approaches areas follows: 1) in situ observation approach and 2) photo-interpretation approach [2]. Each of them has its own advan-tages and drawbacks, but both are subject to errors in thelabeling process. In the first approach, this may occur becauseof georeferencing problems, while in the second one, spectralmismatching errors by the human analyst are the main source ofproblems. Since the presence of mislabeling problems (noise)in a learning (training and test) set has a direct negative impacton the classification process, the development of automatictechniques for validating the collected learning samples is, inour opinion, crucial.

To the best of our knowledge, in the literature, very scarce at-tention has been paid for coping with this issue, which is mainlyfaced through two different strategies. The first one, whichadmits anyway the presence of noise (mislabeling problems) inthe data, consists in designing a sophisticated classifier whichis less likely to be influenced by this presence [3]. The secondstrategy is based on the removal of “suspect” samples fromthe learning set. An early work derived from this strategy fork-nearest neighbor (kNN) classification suggested first to applya 3NN classification over the whole learning set and then to re-move misclassified samples in order to produce a new learningset on the basis of which a 1NN classifier is formed for the clas-sification phase [4]. In [5], in order to avoid overfitting on noisysamples, the author proposed to perform the removal (filtering)process through the C4.5 decision tree classifier. In [6], the sus-pect samples are identified and removed from the learning setby means of an ensemble of three classifiers (i.e., C4.5, kNN,and linear classifiers). In particular, a sample is expected to bemislabeled if it is misclassified by the ensemble of classifiers.

In this paper, we propose an alternative method that aims atinteracting with the ground-truth expert by providing him/herwith a binary information of the kind “validated”/“invalidated”for each learning sample. For each invalidated sample, theexpert may confirm or not the invalidation and thus corrector maintain the adopted labeling before creating the finallearning set that will be exploited in the classification process.Our ground-truth validation method is based on viewing themislabeled sample detection issue as an optimization problemwhere it is looked for the best subset of learning samples interms of statistical separability between classes. This problemis formulated within a genetic optimization framework forits capability to solve complex pattern recognition issues [7],[8]. In particular, each chromosome is configured as a binarystring, which represents a candidate solution for validating/invalidating the available learning samples. The genetic opti-mization process is guided by the joint optimization of two

0196-2892/$25.00 © 2009 IEEE

GHOGGALI AND MELGANI: AUTOMATIC GROUND-TRUTH VALIDATION 2173

Fig. 1. Sketch illustrating the proposed ground-truth validation process.

different criteria which are the maximization of a between-class statistical distance and the minimization of the numberof invalidated samples. The former is expressed in terms ofthe Jeffries–Matusita (JM) distance measure [1], [2]. The latterallows one to get at convergence a Pareto front from whichthe ground-truth expert can select the best solution accordingto his/her prior confidence on the reliability of the collectedground-truth.

Experiments were conducted on both simulated data sets andreal remote sensing images. The obtained results reveal that theproposed automatic validation method succeeds in detecting themislabeled samples with a high accuracy, even when up to 30%of the learning samples are mislabeled. Moreover, we showhow the removal of the detected mislabeled samples impactsvery positively on the accuracy of different classifiers, namely,the support vector machine (SVM), the kNN, and the radialbasis function (RBF) neural network [1], [9]–[15]. This papercomplements and integrates partial results presented in [16].

The remaining part of this paper is organized as follows.In Section II, we recall the basic idea of the multiobjectivenondominated sorting genetic algorithm (NSGA-II) and de-scribe the proposed automatic ground-truth validation method.Experimental results obtained on simulated and real data setsare reported in Sections III and IV, respectively. Finally, con-clusions are drawn in Section V.

II. PROPOSED METHOD

A. Problem Formulation

Let us consider a learning set L composed of n sampleslabeled by the ground-truth expert such that L = {(xi, yi), i =1, 2, . . . , n}, where each xi ∈ �d represents a vector of dremote observations or/and processed features and yi ∈ Ω ={ω1 = 1, ω2 = 2, . . . , ωT = T} is the corresponding class la-bel. Our objective is to detect in an automatic way whichof these n learning samples are potentially mislabeled and to

provide the ground-truth expert with a binary information ofthe kind “validated”/“invalidated” for each learning sample.Note that we do not aim at correcting the labels of mislabeledsamples. The label correction work shall be carried out by theground-truth expert (Fig. 1).

A naive approach to this problem would consist in try-ing all possible combinations of validated/invalidated learningsamples and then choosing the best one according to somepredefined criterion. This appears, however, computationallyprohibitive, and thus an impractical solution, even for smallvalues of n since the total number of possible combinations isequal to 2n. Therefore, the only solution at hand is to adopt anumerical optimizer to look for the hopefully best solution inthe binary solution space. In this paper, we propose to carryout this task by means of a multiobjective genetic optimizationmethod. In the following sections, we first recall the basics ofgenetic algorithms (GAs). Then, after describing its two maincomponents (i.e., the chromosome and the fitness function), weexplain the different phases of the proposed genetic solution.

B. General Concepts on GAs

GAs are general purpose randomized optimization tech-niques which exploit principles inspired from biological sys-tems [17], [18]. A genetic optimization algorithm performs asearch by evolving a population of candidate solutions (indi-viduals) modeled with “chromosomes.” From one generationto the next, the population is improved by mechanisms derivedfrom genetics, i.e., through the use of both deterministic andnondeterministic genetic operators. The most common form ofGAs involves the following steps. First, an initial populationof chromosomes is randomly generated. Then, the goodness ofeach chromosome is evaluated according to a predefined fitnessfunction representing the considered objective function. Thisfitness evaluation step allows one to keep the best chromosomesand reject the worst ones by using an appropriate selection rule

2174 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 47, NO. 7, JULY 2009

Fig. 2. Illustration of the chromosome structure and its effect on the learning sample distribution.

based on the principle that the better the fitness, the higherthe chance of being selected. Once the selection process iscompleted, the next step is devoted to reproducing the popu-lation. This is done by genetic operators such as crossover andmutation operators. The entire process is iterated until a user-defined convergence criterion is reached.

Several multiobjective GA-based approaches have been pro-posed in the literature [19]. In this paper, we will adopt theNSGA-II for its low computational requirements and its abilityto distribute uniformly the solutions along the Pareto front [8],[20]. It is based on the concept of Pareto dominance. A solutions1 is said to dominate another solution s2, if s1 is not worse thans2 in all objectives and better than s2 in at least one objective.A solution is said to be nondominated if it is not dominatedby any other solution. The algorithm starts by generating arandom parent population. Individuals (chromosomes) selectedthrough a crowded tournament selection undergo crossoverand mutation operations to form an offspring population. Bothoffspring and parent populations are then combined and sortedinto fronts of decreasing dominance (rank). After the sortingprocess, the new population is filled with solutions of differentfronts starting from the best one. If a front can only partially fillthe next generation, crowded tournament selection is used againto ensure diversity. Once the next generation population hasbeen filled, the algorithm loops back to create a new offspringpopulation and the process continues up to convergence.

C. GA Setup

The success of a genetic optimization process dependsmainly on two ingredients, i.e., the chromosome structure andthe fitness functions, which translate the considered optimiza-tion problem and guide the search toward the best solution,respectively.

Concerning the first ingredient, since we desire either vali-dating or invalidating each of the available n learning samples,we will consider a population of N chromosomes Cm(m =1, 2, . . . , N) where each chromosome Cm ∈ {0, 1}n is a binaryvector of length equal to n encoding a candidate combinationof validations and invalidations of the learning samples. Asshown in Fig. 2, a gene taking the value “1” or “0” meansthe invalidation or validation of the corresponding sample,respectively.

The validation/invalidation procedure will be based on thehypothesis that mislabeling a learning sample potentially leadsto an increase of the intraclass variability and thus to a decreaseof the between-class distance. Therefore, as a first fitness func-tion, we will make use of a between-class statistical distancebased on the well-known JM distance measure [1], [2]. Thismeasure is a function of the Bhattacharyya distance measurewhich is derived from the Chernoff bound, i.e., an upper boundof the probability of error of the Bayes classifier. In the caseof multivariate Gaussian distributions, the JM distance betweentwo generic classes ωi and ωj is given by

JMij =√

2(1 − e−Bij ) (1)

where Bij is the Bhattacharyya distance defined as

Bij =18(μi − μj)T

[ (∑

i +∑

j)2

]−1

×(μi − μj) +12

ln

∣∣∣∣∑

i+

∑j

2

∣∣∣∣√|∑

i ||∑

j |(2)

where∑

and μ denote the class covariance matrix and meanvector, respectively. The symbol |.| stands for the determi-nant operator. The JM distance is a measure bounded by theinterval [0,

√2]. When the two classes are identical (and,

thus, completely overlapped), it assumes the zero value. Incontrast, if they are totally separated, it takes the value

√2.

The assumption that classes follow a Gaussian distribution ismainly motivated by the need to derive a tractable and easy-to-implement between-class distance measure. It is, however,noteworthy that the general nature of the proposed approachmakes it possible to adopt any other type of distance measure.

At this point, in order to be suitably guided, the geneticoptimization process needs an information from the ground-truth expert, i.e., the expected amount of mislabeled learningsamples. Without this information, the process would tend to in-validate all the learning samples but two (the most distant ones),i.e., one for each class. With this information, we could envisionrunning a constrained genetic optimization process, which atconvergence would provide the best subset of invalidated sam-ples with prespecified cardinality. The main drawback of thisgenetic implementation is that it requires an exact knowledge

GHOGGALI AND MELGANI: AUTOMATIC GROUND-TRUTH VALIDATION 2175

of the amount of mislabeled learning samples. As a morepractical alternative, we propose to run a multiobjective geneticoptimization process based on the NSGA-II where the secondfitness function would simply be a count of the number ofinvalidated samples. This implementation offers the advantageof providing at convergence a Pareto front of different solutionsfrom which the ground-truth expert could pick and try oneor even several solutions according to his/her (vague) priorconfidence on the reliability of the collected ground-truth.

D. Algorithmic Description

The different phases characterizing the automatic ground-truth validation method are as follows.

Phase 1—Decomposition From Multiclass to Binary Classi-fication Problems: The typical multiclass nature of the learn-ing set makes it necessary to resort to a suitable multiclassvalidation/invalidation search strategy. One strategy could be toproject the multiclass nature of the problem in the first fitnessfunction, i.e., to adopt a multiclass JM distance measure. Thismeasure can typically be obtained through a weighted averageof the two-class JM measures, where the weights are basedon the prior probabilities of T classes. In order to overcomethe problem of the correct class prior probability estimation,we will adopt another strategy which consists of the following:1) decomposing the multiclass problem into T (T − 1)/2 binaryclassification tasks (one-against-one strategy); 2) performingall “binary” genetic runs, i.e., running a genetic optimizationprocess for each binary learning set; 3) computing at conver-gence a validation/invalidation score function for each samplefrom the solutions provided by all binary runs; and 4) invalidat-ing a sample if its “invalidation” score is greater than the “vali-dation” one (winner-takes-all decision rule). Another advantageof this strategy is that it speeds up the whole genetic optimiza-tion process since the binary genetic runs work on much smallersolution spaces compared to that of the former strategy.

Phase 2—Optimization With NSGA-II: Since it is similarfor all runs, this phase will be described for a single binarygenetic run.

Phase 2.1—Initialization:

Step 1) Generate randomly a population P (t)(t = 0) ofN chromosomes Cm(m = 1, 2, . . . , N), each genetaking either a “0” or a “1” value.

Step 2) For each candidate chromosome Cm(m =1, 2, . . . , N) of P (t), build a new learning setby removing from the original binary learning setthe samples invalidated by the corresponding genes(i.e., those with a “1” value) and compute its fitnessfunctions (i.e., itsJM distance and the number ofinvalidated samples).

Step 3) Perform random binary tournament selection,crossover, and mutation operations in order to createa population of offspring Q(t) having the same sizeN of the population of the parents P (t).

Phase 2.2—Optimization:

Step 4) Merge the two populations, i.e., R(t) = P (t) ∪ Q(t),for guaranteeing elitism (mechanism which ensuresthat all the best chromosomes are passed to the next

generation) and thus stability and fast convergenceof the optimization process. Sort the merged popu-lation Rt into different fronts of descending domi-nation rank according to the nondominated sortingmethod. Note that a solution is said to dominateanother one if and only if the values of its fitnessfunctions are partially less than those of the othersolution.

Step 5) Create a new generation P (t+1) of size N by choos-ing the first best N solutions from R(t). The lastsolutions of the same front are selected so thatthey span as much as possible their front. This iscarried out by integrating in the selection procedurea crowding distance. This distance is computedbasing on the two solutions surrounding the solutionunder consideration in the performance space (i.e.,the space defined by the two fitness functions). Itplays a key role in a multiobjective optimizationprocess since it permits to force it in obtainingfinal solutions which are spread as much as possiblealong the Pareto optimal front [19], [20].

Step 6) If the stop criterion (e.g., maximal number of gen-erations and/or a check on the variation of theJM distance measure during the current and lastgenerations) is not satisfied, set t ← t + 1 and goto Step 2).

Phase 3—Sample Validation:

Step 7) Basing on the indication from the ground-truth ex-pert about his/her prior confidence on the reliabilityof the original ground-truth, select from the Paretofront of each binary genetic run (i.e., couple ofclasses ωi and ωj) the chromosome C∗

mijwith a

number of invalidated learning samples closest tothis indication. For instance, if the expert’s confi-dence is equal to 90%, choose the solution (fromthe Pareto front) closest to 10% of the invalidatedsamples.

Step 8) For each learning sample xl(l = 1, 2, . . . , n), com-pute a score function

S(xl) =∑

i=yl;j �=i

C∗mij(l) (3)

where yl is the original label assigned by theground-truth expert and C(l) denotes the lth geneof the considered chromosome. Validate xl if

S(xl) ≤ (T − 1)/2. (4)

Otherwise, recommend the ground-truth expert tocorrect or remove xl.

The ground-truth expert confidence plays an important rolein the last phase of our approach. Its quantification mainlydepends on the homogeneity of the areas from which thelearning samples were extracted since the confidence is, ingeneral, high in homogenous areas and low/medium in mixedor boundary areas. For instance, if 70 samples were extractedfrom homogeneous areas and 30 from heterogeneous areas, the

2176 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 47, NO. 7, JULY 2009

Fig. 3. Two-dimensional Gaussian distributions generated for the first experiments with simulated data. (a) Case of separated classes and (b) case of overlappedclasses.

confidence level could be set to around 85%. In general, if theconfidence value is underestimated (precautionary behavior),this will result in a larger number of detected mislabeledsamples and thus will require more interactions with the expertto confirm or not the proposed invalidations. Conversely, ifit is overestimated, the risk of not detecting part of actuallymislabeled samples increases.

III. EXPERIMENTAL RESULTS ON SIMULATED DATA

In order to assess the performance of the proposed method,we first started with a round of experiments on simulated data.In particular, we simulated various ground-truth validation sce-narios by adding noise (i.e., mislabeling) with different propor-tions to two original noise-free (without any mislabeled sample)data sets characterized by different class distributions. This al-lowed us to create a controlled experimental environment usefulto understand how noise affects our approach. The assessmentwas performed in terms of the following: 1) performances ofdetection of mislabeled samples and 2) comparison of decisionregions. The detection performances were evaluated in terms ofprobabilities of detection PD and false alarm PFA. The lattergives information about the number of invalidated noiselesssamples, while the former expresses the number of correctlyinvalidated mislabeled samples. In all experiments comprisingthose on real data sets reported in the next section, we usedthe following standard parameters for the genetic optimizationprocess: population size N = 500, crossover probability pc =0.9, mutation probability pm = 0.01, and maximum number ofgenerations set to 100.

A. Experiments 1: Binary Classification Problem WithBivariate Gaussian Distributions

In these experiments, we considered a 2-D two-class originalground-truth by assuming that the classes are drawn from anormal distribution with the following parameters: μ1 = [0, 0]t

Fig. 4. Example of Pareto front obtained at convergence by the multiobjectivegenetic process for the first experiments with simulated data.

and∑

1 =[

1 00 1

]for the first class, and μ2 = [5, 5]t and

∑2 =

[1 0.5

0.5 1

]for the second class [see Fig. 3(a)]. The

choice of these parameters is motivated by a willingness tostart the assessment with a separable classification problem.The number of learning samples generated for each class isequal to 50. Then, we constructed five different scenarios ofground-truth validation, each referred to a given mislabelingproportion. For such purpose, we mislabeled the original noise-free ground-truth by varying the mislabeling rate from 10% to50% with a step of 10%. Mislabeling was carried out by simplypermuting the label of randomly selected learning samples. Atconvergence of the genetic optimization process, we selectedfrom the Pareto front the solution closest to the applied misla-beling rate (see Fig. 4).

The results obtained after running our approach for eachmislabeling scenario are given in Table I(a). As it could beexpected, when the true class distributions are well separated,

GHOGGALI AND MELGANI: AUTOMATIC GROUND-TRUTH VALIDATION 2177

TABLE IDETECTION PERFORMANCE IN TERMS OF PROBABILITY OF DETECTION

(PD) AND OF FALSE ALARMS (PFA) ACHIEVED ON THE DATA SETS

SIMULATING GAUSSIAN-DISTRIBUTED CLASSES VERSUS THE

PROPORTION OF MISLABELED LEARNING SAMPLES. (a) CASE OF

SEPARATED CLASSES AND (b) CASE OF OVERLAPPED CLASSES

a perfect detection (PD = 100%, PFA = 0%) is achievablefor all mislabeling proportions except from 50% (PD =28%, PFA = 72%) in which case the mislabeled classes arealmost completely overlapped (JM distance = 0.28 × 10−4).

In the second part of the experiments, we wanted to ana-lyze what happens in the opposite case, i.e., when the trueclasses are strongly overlapped. For such purpose, we kept theprevious distributions unchanged with the exception that themean vectors were put close to each other, i.e., μ1 = [4.5, 4.5]t

and μ2 = [5.5, 5.5]t [see Fig. 3(b)]. The detection results arereported in Table I(b). Again, we can see that satisfactorydetection performances are achieved up to 30% of mislabelingrate. The performance drops drastically for higher rates (around50% of PD and 40% of PFA). Note that in these cases, the JMdistance is very small, namely, 0.05 and 0.003 for 40% and 50%mislabeling rates, respectively.

As a further assessment criterion, we compared the decisionregions produced by the Bayesian classifier on the two datasets: 1) before mislabeling (ideal decision regions); 2) aftermislabeling; and 3) after the removal of the learning samplesinvalidated by our method. The obtained regions are shownin Figs. 5 and 6 for the data sets characterized by separatedand overlapped classes, respectively. A visual inspection ofthe regions and their boundary allows drawing the followingobservations: 1) Sample mislabeling can lead to very strongdistortions of the decision regions and boundaries, and 2) theproposed method gives the possibility to cope with this problemsatisfactorily up to large values of the mislabeling rate (i.e.,30% of the total number of learning samples).

B. Experiments 2: Chessboard Classification Problem

In the second round of experiments, we changed completelythe classification problem by moving to a 2-D multiclassproblem where the classes are uniformly distributed, i.e., thechessboardlike problem. For such purpose, we generated a3 × 3 chessboard composed of nine uniformly distributedclasses, each represented by 50 learning samples (see Fig. 7).Afterward, we performed experiments according to the sameexperimental roadmap adopted in the first round of experi-ments. The quantitative results are reported in Table II, wherewe can observe the high robustness of the proposed method to

Fig. 5. Decision regions obtained for the first experiments with simulatedseparated classes. (a) Before mislabeling; with (b) 10%, (c) 20%, and (d) 30%of mislabeling rate; after removal of samples invalidated by our method in thecases of (e) 10%, (f) 20%, and (g) 30% of mislabeling rate.

Fig. 6. Decision regions obtained for the first experiments with simulatedoverlapped classes. (a) Before mislabeling; with (b) 10%, (c) 20%, and (d) 30%of mislabeling rate; after removal of samples invalidated by our method in thecases of (e) 10%, (f) 20%, and (g) 30% of mislabeling rate.

2178 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 47, NO. 7, JULY 2009

Fig. 7. Distribution of the classes characterizing the chessboard classificationproblem.

TABLE IIDETECTION PERFORMANCE IN TERMS OF PROBABILITY OF DETECTION

(PD) AND OF FALSE ALARMS (PFA) ACHIEVED ON THE

CHESSBOARD DATA SETS VERSUS THE PROPORTION

OF MISLABELED LEARNING SAMPLES

the presence of mislabeled samples since it permits to yield verygood detection performances up to 40% of mislabeling rate.

For each of the five previous ground-truth validation scenar-ios (one for each mislabeling rate), we generated the decisionregions by training and applying a linear SVM classifier aftersample mislabeling and another one after removal of invalidatedsamples. The regularization parameter of each SVM classifierwas tuned empirically according to a k-fold cross-validationprocedure (k = 3) performed on the learning samples asso-ciated with the considered classification task. The results areshown in Fig. 8, which reveals the following: 1) withoutremoving the mislabeled learning samples, the chessboard isdisfigured starting from a mislabeling rate of 20%, and 2) afterremoving the samples invalidated by our method, it is possibleto keep the chessboard in good shape and to push away the startof the disfiguration at 50% of mislabeling rate.

IV. EXPERIMENTAL RESULTS ON REAL DATA

A. Data Set Description

In this experimental part, we desired to complete the previousassessment by considering this time real class distributions,drawn from two different data sets. In addition, we evaluatedthe impact of the removal of the detected mislabeled sampleson the accuracy of different classifiers, namely, the SVM basedon the Gaussian kernel, the kNN, and the RBF neural network.For each classification scenario, parameter tuning and accuracyassessment of these classifiers were carried out empiricallyby means of a k-fold cross-validation procedure (k = 3) per-formed on the learning samples associated with the scenario.

The first data set represents a pansharpened multispectralimage acquired by the Quickbird satellite in August 2004 inthe region of El Tarf in northeastern Algeria. It consists of fivemain land covers, which are bare soil, forest, residential area,dirt road, and asphalt. The number of learning samples used inthe experiments is equal to 50 for each class [see Fig. 9(a)].

The second data set was extracted from a multisensor(ATM and NASA/JPL SAR airborne sensors) image taken inJuly–August 1989. It is characterized by a total number ofchannels equal to 15 and refers to an agricultural area nearFeltwell, U.K., in which five land cover types are dominant,namely, sugar beets, stubble, bare soil, potatoes, and carrots.The learning set consists of 50 samples for each class [seeFig. 9(b)].

B. Results

As we did previously in the experiments with simulateddata, in order to assess the robustness of our method to themislabeling problem, we contaminated the two data sets withmislabeling rates ranging from 10% to 50% with a step of 10%.This allowed one to create ten classification scenarios for eachdata set, namely, five with mislabeled learning samples and fivewith filtered learning samples (i.e., subset of samples validatedby our method). The detection performances are reported inTables III(a) and IV(a) for the El Tarf and the Feltwell data sets,respectively. As can be seen, they keep particularly high up to30% of mislabeling rate. Note that the second data set was moredifficult to handle by our method because of the stronger classoverlap compared to the first one (see Fig. 8).

Subsequently, we trained an SVM, a kNN, and an RBFclassifier for each of these ten classification scenarios. For thesake of comparison, we trained also these classifiers on theoriginal noise-free data sets. The achieved overall classificationaccuracies (i.e., the ratio of the number of correctly classifiedsamples to the total number of samples) are given in Tables V(a)and VI(a) for the El Tarf and the Feltwell data sets, respectively.The first observation we can draw is that, in general, the higherthe mislabeling rate, the higher the accuracy decrease. On anaverage over the three kinds of classifier, if the mislabeled sam-ples are kept, the decrease of accuracy with respect to the noise-free learning set varies from 12.8% to 48.5% and from 12.9% to41.9% for mislabeling rates ranging from 10% to 50% appliedto the El Tarf and the Feltwell data sets, respectively. Whenthe detected mislabeled samples are removed, such a decreaseranges from 1.2% to 26.6% and from −0.9% (i.e., a gain ofaccuracy) to 17.9% for the El Tarf and the Feltwell data sets,respectively. When the mislabeling rate is on the order of 30%,without mislabeled sample removal, the decrease is of 33.9%and 35.5%, while with detected mislabeled sample removal,it is of 0.2% and 5.0% for the two data sets, respectively. Inother words, the proposed method strongly limits the negativeimpact of the mislabeled samples on the classification processeven in situations where their presence in the ground-truth isvery significant. Note also that the choice of the classifier mayplay an important role to limit further such impact. In theseexperiments, the SVM classifier seems less sensitive to themislabeling problem than the kNN and the RBF classifiers.

GHOGGALI AND MELGANI: AUTOMATIC GROUND-TRUTH VALIDATION 2179

Fig. 8. Decision regions obtained for the chessboard experiments with (a) 10%, (b) 20%, (c) 30%, (d) 40%, and (e) 50% of mislabeling rate; after removal ofsamples invalidated by our method in the cases of (f) 10%, (g) 20%, (h) 30%, (i) 40%, and (j) 50% of mislabeling rate.

Fig. 9. Distribution of the classes for (a) the El Tarf and (b) the Feltwelldata sets.

Finally, we repeated all the aforementioned experiments afterdoubling the size of the original learning set (i.e., from 50 to100 samples per class) in order to analyze the effect of thelearning set size on our method. In general, the obtainedresults suggest that the proposed method keeps working welleven when the learning set size is significant. Indeed, fromTables III(b) and IV(b), it appears that the detection per-

TABLE IIIDETECTION PERFORMANCE IN TERMS OF PROBABILITY OF DETECTION

(PD) AND OF FALSE ALARMS (PFA) ACHIEVED ON THE EL TARF DATA

SET VERSUS THE PROPORTION OF MISLABELED LEARNING SAMPLES.LEARNING SET SIZE OF (a) 50 AND (b) 100 SAMPLES FOR EACH CLASS

TABLE IVDETECTION PERFORMANCE IN TERMS OF PROBABILITY OF DETECTION

(PD) AND OF FALSE ALARMS (PFA) ACHIEVED ON THE FELTWELL DATA

SET VERSUS THE PROPORTION OF MISLABELED LEARNING SAMPLES.LEARNING SET SIZE OF (a) 50 AND (b) 100 SAMPLES FOR EACH CLASS

formances are very satisfactory for mislabeling rates up to30%. Above this level of noise, the validation/invalidation taskstarts encountering difficulties whose degree depends mainly onthe data set complexity. The results in terms of classificationaccuracy confirm what is observed in the previous experiments,namely, the proposed method is capable to capture a good partof the labeling errors without losing correctly labeled samples,

2180 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 47, NO. 7, JULY 2009

TABLE VOVERALL ERROR OBTAINED BY THE SVM, THE kNN, AND THE RBF CLASSIFIERS FOR ALL THE CONSIDERED GROUND-TRUTH SCENARIOS BUILT FROM

THE EL TARF DATA SET. LEARNING SET SIZE OF (a) 50 AND (b) 100 SAMPLES FOR EACH CLASS

TABLE VIOVERALL ERROR OBTAINED BY THE SVM, THE kNN, AND THE RBF CLASSIFIERS FOR ALL THE CONSIDERED GROUND-TRUTH SCENARIOS BUILT FROM

THE FELTWELL DATA SET. LEARNING SET SIZE OF (a) 50 AND (b) 100 SAMPLES FOR EACH CLASS

thus allowing to contain the propagation of the mislabelingproblem to the classification process.

V. CONCLUSION

In this paper, we have proposed a novel methodology toassist the ground-truth expert in his/her work of learning samplecollection by attracting his/her attention on learning samples

with potential mislabeling problems. It is based on an au-tomatic procedure for learning sample validation/invalidationperformed by means of a genetic optimization process. Fromthe experimental results obtained on both simulated and realdata, it is possible to draw the following conclusions.

1) Ground-truth mislabeling problems can severely affectthe classifier design and performance since they have a

GHOGGALI AND MELGANI: AUTOMATIC GROUND-TRUTH VALIDATION 2181

direct impact on the class distributions. The impactstrength depends mainly on the amount of mislabeledsamples and on the complexity of the considered classifi-cation problem (overlap between true class distributions).

2) The proposed method allows one to strongly limit thepropagation of errors incurred by mislabeled samples inthe image classification pipeline even in situations wheretheir presence in the ground-truth is very significant.

3) The higher the complexity of the ground-truth, the lowerthe detection performance. However, in general, for mis-labeling rates less than 30%, it allows detecting withhigh accuracy the mislabeled samples (high probability ofdetection) while preserving most of the correctly labeledones (low probability of false alarms).

We note also that, besides its automatic nature, the proposedmethod exhibits the advantage that it acts as a filter com-pletely independent from the kind of classification approachadopted in the classifier design phase. Therefore, the ground-truth validation result is not constrained by the use of anyspecific classifier. In addition, owing to the flexible nature of themethod, more complex between-class distance measures suchas those based on nonparametric scatter matrices [21] couldbe adopted as well to make it suitable for any kind of datadistribution. Its main drawback is that the handling of large-size ground-truths composed of thousands of learning samplescould be computationally demanding. This problem could be,however, overcome by splitting the ground-truth into severalsets of learning samples and then processing each of themseparately.

ACKNOWLEDGMENT

This work was carried out within the framework of aproject entitled “Development of Advanced Automatic Analy-sis Methodologies for Environmental, Industrial and Biomed-ical Monitoring,” funded by the Italian Ministry of Education,University, and Research (MIUR).

REFERENCES

[1] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2nd ed.New York: Wiley, 2001.

[2] J. A. Richards and X. Jia, Remote Sensing Digital Image Analysis:An Introduction. Berlin, Germany: Springer-Verlag, 1999.

[3] Y. Li, L. F. A. Wessels, D. de Ridder, and M. J. T. Reinders, “Classificationin the presence of class noise using a probabilistic kernel Fisher method,”Pattern Recognit., vol. 40, no. 12, pp. 3349–3357, Dec. 2007.

[4] D. R. Wilson, “Asymptotic properties of nearest rules using edited data,”IEEE Trans. Syst., Man, Cybern., vol. SMC-2, no. 3, pp. 408–421,Jul. 1972.

[5] L. A. Breslow and D. Aha, “Simplifying decision trees: A survey,” Knowl.Eng. Rev., vol. 12, no. 1, pp. 1–40, Jan. 1997.

[6] C. E. Brodley and M. A. Friedl, “Identifying mislabeled training data,”J. Artif. Intell. Res., vol. 11, pp. 131–167, 1999.

[7] Y. Bazi and F. Melgani, “Toward an optimal SVM classification systemfor hyperspectral remote sensing images,” IEEE Trans. Geosci. RemoteSens., vol. 44, no. 11, pp. 3374–3385, Nov. 2006.

[8] N. Ghoggali and F. Melgani, “Genetic SVM approach to semisupervisedmultitemporal classification,” IEEE Geosci. Remote Sens. Lett., vol. 5,no. 2, pp. 212–216, Apr. 2008.

[9] V. Vapnik, Statistical Learning Theory. New York: Wiley, 1998.[10] F. Melgani and L. Bruzzone, “Classification of hyperspectral remote sens-

ing images with support vector machines,” IEEE Trans. Geosci. RemoteSens., vol. 42, no. 8, pp. 1778–1790, Aug. 2004.

[11] B. Waske and J. A. Benediktsson, “Fusion of support vector machinesfor classification of multisensor data,” IEEE Trans. Geosci. Remote Sens.,vol. 45, no. 12, pp. 3858–3866, Dec. 2007.

[12] A. Mathur and G. M. Foody, “Multiclass and binary SVM classification:Implications for training and classification users,” IEEE Geosci. RemoteSens. Lett., vol. 5, no. 2, pp. 241–245, Apr. 2008.

[13] A. Palau, F. Melgani, and S. B. Serpico, “Cell algorithms with data in-flation for non-parametric classification,” Pattern Recognit. Lett., vol. 27,no. 7, pp. 781–790, May 2006.

[14] E. Blanzieri and F. Melgani, “Nearest neighbor classification of remotesensing images with the maximal margin principle,” IEEE Trans. Geosci.Remote Sens., vol. 46, no. 6, pp. 1804–1811, Jun. 2008.

[15] L. Samaniego, A. Bardossy, and K. Schulz, “Supervised classification ofremotely sensed imagery using a modified k-NN technique,” IEEE Trans.Geosci. Remote Sens., vol. 46, no. 7, pp. 2112–2125, Jul. 2008.

[16] N. Ghoggali and F. Melgani, “A genetic automatic ground-truth validationmethod for multispectral remote sensing images,” in Proc. IEEE-IGARSS,Boston, MA, Jul. 2008, vol. 4, pp. 538–541.

[17] D. E. Goldberg, Genetic Algorithms in Search, Optimization and MachineLearning. Reading, MA: Addison-Wesley, 1989.

[18] L. Davis, Handbook of Genetic Algorithms. New York: Van NostrandReinhold, 1991.

[19] K. Deb, Multi-Objective Optimization Using Evolutionary Algorithms.Chichester, U.K.: Wiley, 2001.

[20] K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan, “A fast and elitistmultiobjective genetic algorithm: NSGA-II,” IEEE Trans. Evol. Comput.,vol. 6, no. 2, pp. 182–197, Apr. 2002.

[21] B.-C. Kuo and D. A. Landgrebe, “Nonparametric weighted feature extrac-tion for classification,” IEEE Trans. Geosci. Remote Sens., vol. 42, no. 5,pp. 1096–1105, May 2004.

Noureddine Ghoggali (S’06) received the StateEngineer degree in electronics from the Universityof Batna, Batna, Algeria, in 2000. He is currentlyworking toward the Ph.D. degree in informationand communication technologies in the Departmentof Information Engineering and Computer Science,University of Trento, Trento, Italy.

His research activity is focused on pattern recog-nition and evolutionary computation methodologiesfor remote sensing image analysis (multitemporalclassification and semisupervised learning).

Farid Melgani (M’04–SM’06) received the StateEngineer degree in electronics from the Universityof Batna, Batna, Algeria, in 1994, the M.Sc. de-gree in electrical engineering from the Universityof Baghdad, Baghdad, Iraq, in 1999, and the Ph.D.degree in electronic and computer engineering fromthe University of Genoa, Genoa, Italy, in 2003.

From 1999 to 2002, he cooperated with the SignalProcessing and Telecommunications Group, Depart-ment of Biophysical and Electronic Engineering,University of Genoa. Since 2002, he has been an

Assistant Professor of telecommunications with the University of Trento,Trento, Italy, where he has taught pattern recognition, machine learning, radarremote-sensing systems, and digital transmission. He is currently the Head ofthe Intelligent Information Processing Laboratory, Department of InformationEngineering and Computer Science, University of Trento. He is the coauthorof more than 80 scientific publications and is a referee for several internationaljournals. His research interests are in the area of processing, pattern recognition,and machine learning techniques applied to remote sensing and biomedicalsignals/images (classification, regression, multitemporal analysis, and datafusion).

Dr. Melgani has served on the scientific committees of several interna-tional conferences and is an Associate Editor of the IEEE GEOSCIENCE AND

REMOTE SENSING LETTERS.