mopg: a multi-objective evolutionary algorithm for...

THEORETICAL ADVANCES

MOPG: a multi-objective evolutionary algorithm for prototypegeneration

Hugo Jair Escalante • Maribel Marin-Castro • Alicia Morales-Reyes •

Mario Graff • Alejandro Rosales-Perez • Manuel Montes-y-Gomez •

Carlos A. Reyes • Jesus A. Gonzalez

Received: 26 March 2014 / Accepted: 22 January 2015

� Springer-Verlag London 2015

Abstract Prototype generation deals with the problem

of generating a small set of instances, from a large data

set, to be used by KNN for classification. The two key

aspects to consider when developing a prototype gen-

eration method are: (1) the generalization performance

of a KNN classifier when using the prototypes; and (2)

the amount of data set reduction, as given by the number

of prototypes. Both factors are in conflict because, in

general, maximizing data set reduction implies decreas-

ing accuracy and viceversa. Therefore, this problem can

be naturally approached with multi-objective optimiza-

tion techniques. This paper introduces a novel multi-

objective evolutionary algorithm for prototype gen-

eration where the objectives are precisely the amount of

reduction and an estimate of generalization performance

achieved by the selected prototypes. Through a com-

prehensive experimental study we show that the pro-

posed approach outperforms most of the prototype

generation methods that have been proposed so far.

Specifically, the proposed approach obtains prototypes

that offer a better tradeoff between accuracy and re-

duction than alternative methodologies.

Keywords Prototype generation � Evolutionary

algorithms � 1NN classification � Multi-objective

optimization

1 Introduction

k�Nearest neighbors (KNN) is one of the most used

models for pattern classification [31]. This is due in part to

its asymptotic behavior as the number of training instances

tends to infinity [6]. Also, bias and variance of the KNN

model can be adjusted by varying the value of k [18]. In

addition to its effectiveness, KNN is quite popular because

it is easy to understand and implement. Unfortunately,

there are two main issues that limit the application of KNN

to certain domains: memory storage requirements and ef-

ficiency. On the one hand, KNN is known to be very ef-

fective when a large number of instances are available.

However, a large set of training objects implies (1) the

requirement of storing all of the training objects into

memory and (2) estimating the distance (or similarity) from

a test object to all of the training objects each time a new

object has to be classified. This is indeed a major com-

plication because nowadays big-data problems are be-

coming ubiquitous. Therefore, to keep KNN as a suitable

option for cutting-edge classification problems, strategies

for making it scalable and efficient must be applied.

Prototype-based classification (PBC) aims to amend

these issues for KNN [17, 24, 28]. PBC is a methodology

that considers only a subset of representative training in-

stances for making predictions, still under KNN’s classi-

fication rule. In PBC there are two main ways of reducing

the original training set: prototype selection (PS) and pro-

totype generation (PG). The goal of PS methods is to se-

lect, from the training data, the objects that better

H. J. Escalante (&) � M. Marin-Castro � A. Morales-Reyes �A. Rosales-Perez � M. Montes-y-Gomez �C. A. Reyes � J. A. Gonzalez

INAOE, Luis Enrique Erro No.1, Tonantzintla,

Puebla 72840, Mexico

e-mail: [email protected]

M. Graff

INFOTEC, Catedras CONACyT, Aguascalientes, Mexico

123

Pattern Anal Applic

DOI 10.1007/s10044-015-0454-6

summarize the whole data set [17]. PG methods, on the

other hand, generate new objects to be used as prototypes

by selecting and combining objects in the training set [28].

Although there are no comprehensive studies comparing

PG and PS strategies1, generation methods are more gen-

eral than selection ones and, in fact, PS can be considered a

special case of PG [28].

The success of PG methods is usually assessed by looking

at two aspects separately: (1) the classification performance

obtained by a KNN classifier when using the prototypes

(accuracy), and (2) the amount of reduction achieved with

respect to the original training set (reduction). PG methods

that provide the best tradeoff between both measures are

preferred. Unfortunately, reduction and accuracy objectives

are in conflict because, in general, maximizing reduction

(i.e., favoring the generation of fewer prototypes) causes a

decrease in accuracy when using the prototypes, and vicev-

ersa. Multi-objective optimization, therefore, seems to be an

appropriate methodology for developing a PG method that

looks for prototypes that simultaneously optimize accuracy

and reduction (that is, obtaining the best possible accuracy

with the smaller number of prototypes), this is precisely the

formulation adopted in this paper.

This paper introduces MOPG (multi-objective prototype

generation), a novel multi-objective optimization approach

for PG where the objectives are precisely (1) the amount of

reduction and (2) an estimate of generalization performance

achieved by the generated prototypes. The NSGA-II algo-

rithm is used to model the PG problem [9], where ad hoc

representation and evolutionary operators are proposed. We

propose a strategy to select a single solution from the Pareto

set generated through the optimization process. Through a

comprehensive experimental study we show that the pro-

posed approach outperforms most of the PG methods that

have been proposed so far. A total of 59 data sets with dif-

ferent characteristics (number of classes, instances, at-

tributes, etc.,) were considered for experimentation. We

found that the proposed approach offers a better tradeoff

between the considered objectives than alternative method-

ologies. Although several evolutionary algorithms and re-

lated techniques have been used for PG, to the best of our

knowledge the PG problem has not been approached yet with

multi-objective optimization at all (see Sect. 2).

This paper is organized as follows. Next section reviews

related work on PG with emphasis on those based on

heuristic optimization. Section 3 introduces the proposed

MOPG approach. Section 4 describes experimental settings

and Sect. 5 reports experimental results obtained by

MOPG. Finally, Sect. 6 outlines conclusions and future

work directions.

2 Related work

A taxonomy and comparative study among several PG

methods is reported by Triguero et al. [28]. There, PG

methods are classified according to different dimensions,

including: type of reduction (incremental, detrimental,

fixed and mixed); type of resulting set (condensation,

edition or hybrid); generation mechanism (class relabeling,

centroid-based, space splitting, and positioning adjustment)

and evaluation criteria (filter, semi-wrapper and wrapper).

A total of 32 different strategies are classified and an ex-

perimental comparison among 25 of these methods is re-

ported. In this paper, we compare our proposal to the same

25 methods and two other recently proposed [14, 30].

It can be concluded from [28] that there is no single PG-

method winner in terms of accuracy and reduction. This is

understandable as most of the considered methods have

particular characteristics that make them appropriate to

tackle a single objective, either reduction or accuracy. For

instance, the method achieving the highest generalization

performance was GENN (Generalized Editing using

NN) [20], a detrimental method that removes and relabels

instances. GENN is a conservative method because it aims

to edit instances only to an extent that does not consider-

ably harm the classification performance, it achieves sub-

stantially better results than any other of the compared PG

methods; however, it was among the worse in terms of

reduction. On the other hand, PSCSA (Prototype Selection

Clonal Selection Algorithm) obtained the best performance

in terms of reduction [16]. PSCSA models the PG problem

with an artificial immune system, the clonal selection al-

gorithm. This method is able to exactly select a single

example per each class, achieving the best reduction per-

formance among the other 24 methods considered in [28].

However, its performance in terms of accuracy is worse

than several other strategies. This paper introduces a multi-

objective optimization approach to the PG problem. Our

hypothesis is that by explicitly and simultaneously opti-

mizing both objectives, reduction and accuracy, we can

obtain solutions that offer a better tradeoff between these

objectives.

Among the diversity of PG methods proposed so far, in

recent studies PG methods based on evolutionary algo-

rithms and related techniques have reported better results

than alternative approaches [2, 14–16, 23, 28, 30]. Usually,

these methods start from a set of solutions (sets of proto-

types) and then modify them according to specific op-

erators, through an iterative search procedure that attempts

to optimize a criterion related with the classification per-

formance of the prototypes.

A PG method based on particle swarm optimization

(PSO) was proposed in [23]. A standard PSO algorithm

was designed where the authors try to minimize the

1 One should note there are works comparing a few PS and PG

methods over a small number of data sets [19, 22].

Pattern Anal Applic

123

classification error in the training set. The method is run for

several times to obtain varied solutions (sets of prototypes).

When classifying a new object the outputs of the whole set

of prototypes are combined via voting. The ensemble

strategy allows this method to obtain better results than

many other methods evaluated in [28]; in fact, this solution

is among the best methods in terms of the reduction–ac-

curacy tradeoff, see Sect. 5. Another variant of PSO,

adaptive Michigan PSO (AMPSO), has also been used for

the generation of prototypes. In AMPSO, each particle of

the swarm is associated to a prototype in such a way that

the whole population is the set of prototypes that are op-

timized [2]. This method achieves similar reduction per-

formance to PSO but obtains lower accuracy.

Regarding evolutionary algorithms, successful ap-

proaches have been proposed as well. For instance, Fer-

nandez et al. proposed ENPC (Evolutionary Design of NN

classifiers) an evolutionary algorithm that starts from a

single individual that is evolved by applying a variety of

operators that combine and split prototypes [15]. The

method is able to automatically determine the number of

prototypes and requires little information from the user.

ENPC is able to obtain very competitive performance in

terms of accuracy but it is not among the best methods in

reduction. Escalante et al. proposed a PG method based on

genetic programming (GPGP) [14]. The idea consists of

exploiting a tree structure to combine instances using

arithmetic operators. The objective function combines the

accuracy and reduction criteria in a single formula. GPGP

is able to obtain very effective prototypes. Although GPGP

is not included in the comparative study from [28], it

outperforms most of the methods evaluated there.

Finally, Triguero et al. [30] have recently proposed very

effective PG methods. The authors proposed several vari-

ants of the differential evolution technique [25] to ap-

proach the PG problem. The most effective methods

reported in [30] are hybrids, that apply a PS algorithm

followed by a PG method based on differential evolution;

as a result these variants are computationally expensive.

However, the performance of these methods in both, re-

duction and accuracy, is much higher than that of previous

approaches.

Among all methods proposed so far for PG, in most

cases a single objective (either reduction or accuracy) is

being optimized. Some methods combine both objectives

into a single one or exploit the structure of the opti-

mization strategy to incorporate one of the objectives. To

the best of our knowledge no method has been proposed

that aims to optimize accuracy and data set reduction si-

multaneously for PG. Hence, our proposal is novel in that

sense. We are aware of [1, 4, 21] that are somewhat re-

lated to our proposal. Li and Wang approached the pro-

totype selection problem using a genetic algorithm [21].

Although they call their method multi-objective, they

optimized a single-objective function that combines re-

duction and classification performances. The performance

of their prototype selection method is significantly worse

than that obtained by our PG method in this paper when

using the same databases. Chen et al. [4] proposed a

multi-objective optimization approach (based on the

IMOEA algorithm) for simultaneously selecting features

and instances. However, the proposed method was only

evaluated on small databases. Finally, Aler et al. [1] de-

ploy a multi-objective optimization method that aims to

minimize false-negative and false-positive rates of a

prototype-based classifier. However, reduction perfor-

mance was not optimized therein.

In Sect. 5 we compare the performance of our method

with most of the proposals reviewed in this section and

others considered in [28]. Experimental results show that

MOPG is an effective solution to the PG problem.

3 MOPG: Multi-objective prototype generation

This section describes the proposed approach to prototype

generation (PG). First, the considered scenario is formally

described. Next, the PG problem is posed as one of multi-

objective optimization. Then we provide a detailed de-

scription of the proposed method for PG.

3.1 Considered scenario

Given a data set D ¼ fðxi; yiÞg1;...;N for a classification

problem with N training instances, and xi 2 Rd,

yi 2 C ¼ fC1; . . .;CKg; where d is the dimensionality of

the data and C is the set of classes. The goal of PG methods

is to obtain a set of instances P ¼ fðxi; yiÞg1;...;P with

P� N, with the constraint that for each pair

fðxi; yiÞg 2 P, xi 2 Rd, yi 2 C and there is at least one

ðxi; yiÞ for each class in C. One should note that for PS there

is the additional restriction that each pair ðxi; yiÞ 2 P is an

element of D [17], thus PG methods have more freedom

for the design of prototypes as no restriction is enforced to

the relation between instances in P and D.

3.2 Prototype generation as multi-objective

optimization

Besides the above elementary restrictions, the ultimate goal

in PG is that using P one would be able to obtain at least

the same performance that one would obtain when using Dfor the classification of unseen data T ¼ fðxj; yjÞg1;...;T .

The KNN classification rule is considered and in most of

the cases K ¼ 1 (i.e., 1NN). Let cðP;DÞ be an estimate of

Pattern Anal Applic

123

the generalization performance of an 1NN classifier when

using P as training data (e.g., c could be a hold-out esti-

mate), and let dðD;PÞ ¼ 1� PN

be the proportion of re-

duction achieved by P (recall P is the number of

prototypes in P). The PG problem can be formulated as

that of finding P such that cðP;DÞ and dðD;PÞ are

maximized.

Choosing solutions P for which P is very small (i.e.,

dðD;PÞ is maximized) may cause cðP;DÞ to decrease. This

is because, in general, the less the number of instances to

approximate the underlying classification problem, the

lower the capacity of the 1NN classifier. For example,

consider the case in which P ¼ K, with K the number of

classes; in this scenario, highly biased classifiers can be

built having relatively low variance. On the other hand,

solutions with a large value of P (i.e., small dðD;PÞ) may

result in large values of cðP;DÞ. This is because one can

have a P with as many points as instances in D. For in-

stance when P ¼ N we may perfectly classify every subset

of points in D, resulting in unbiased classifiers that may

show very high variance. Therefore, maximizing dðD;PÞmay cause cðP;DÞ to decrease and viceversa. Since both

objectives are in conflict, this problem can be naturally

approached with multi-objective optimization: looking for

the P that offers the best tradeoff in terms of reduction and

accuracy objectives.

In a multi-objective optimization problem one aims to find

the solution x� that simultaneously maximizes a vector of

q�objective functions, fðx�Þ ¼ hf1ðx�Þ; . . .; fqðx�ÞiT , subject

to the condition by which the vector of decision variables,

x ¼ hx1; . . .; xdiT , belongs to the feasible region of the prob-

lem. In this case, x is a solution to the problem at hand, the

feasible region consists of all possible sets of prototypes that

satisfy the constraints previously defined, and each fi, i 2f1; . . .; qg is an objective function to be maximized, where the

objectives may be in conflict with each other.

In multi-objective optimization the notion of optimum

is different than in single-objective optimization, since in

this kind of problems the aim is to find good tradeoffs

among the objectives instead of a single-best solution.

To determine this tradeoff, the most accepted notion of

optimum in multi-objective optimization is the so-called

Pareto optimality [5]. To define the concept of Pareto

optimality, we first introduce the notion of Pareto

dominance. We say that a solution P1 dominates a so-

lution P2 ðdenoted byP1 � P2Þ if and only if P1 is

better than P2 at least in one objective and it is not

worse in the rest. A solution P� is a Pareto optimal

solution if there is not another solution P0 in the entire

feasible region, such that P0 � P�. Thus, a solution is

Pareto optimal if it is not possible to improve one ob-

jective without worsening another. This definition does

not produce a single solution, but a set of tradeoff so-

lutions among the different objectives. The set of these

solutions in the decision variable space is known as the

Pareto optimal set. The image of the elements of the

Pareto optimal set in the objective space constitute the

so-called Pareto front, see Fig. 1.

For PG we have two objectives, reduction and accu-

racy, thus we can define f1ðPÞ ¼ dðD;PÞ to be the ob-

jective related to performance reduction and

f2ðPÞ ¼ cðP;DÞ as a measure of generalization perfor-

mance. Therefore, our goal is to find the solution P� that

maximizes: hf 1ðPÞ; f2ðPÞi, subject to P 2 Y where Y is

the set of feasible solutions, that is, P ¼ fðxi; yiÞg1;...;P

with xi 2 Rd, yi 2 C and 8Ci 9yj : yj ¼ Ci, i ¼ 1; . . .;K,

j ¼ 1; . . .;P.

Each solution to this problem consists of a set of pairs of

feature vectors and their classes P ¼ fðxi; yiÞg1;...;P.

0.975 0.98 0.985 0.99 0.995 10.74

0.76

0.78

0.8

0.82

0.84

0.86

0.88

0.9

0.92

0.94

Reduction − f1(P)

Acc

urac

y −

f 2(P)

Banana

Pareto set (Training)Pareto set (Test)Selected solution − accuracySelected solution − distance frmo optimumSelected solution − reduction

0.975 0.98 0.985 0.99 0.995 10.8

0.82

0.84

0.86

0.88

0.9

0.92

0.94

Reduction − f1 (P)

Acc

urac

y −

f 2 (P

)

Ring

Pareto set (Training)Pareto set (Test)Selected solution − accuracySelected solution − distance from optimumSelected solution − reduction

Fig. 1 Pareto front for two selected data sets: banana and ring. We

show the performance of each solution in the training (blue-squares)

and test (red-circles) sets. Also, we show the solutions selected with

the different selection-strategies: accuracy-based (star), distance-

based (diamond) and reduction-based (triangle).

Pattern Anal Applic

123

Alternatively, one can see P as a matrix P of size P�ðd þ 1Þ where the last column encodes the classes of the

prototypes. Hence solutions lie in an RP�ðdþ1Þ space, where

the value of P can vary across solutions. In the rest of this

paper we will use P and P indistinguishably for making

reference to prototypes.

3.3 Multi-objective prototype generation

There are several techniques that can be adopted to ap-

proach the above multi-objective optimization problem, in

this paper we consider multi-objective evolutionary algo-

rithms (MOEAs). MOEAs are advantageous over classical

mathematical programming techniques because they can

obtain a set of solutions that approximate the Pareto front

in a single run, and because they are more sensitive to the

shape and continuity of the Pareto optimal front [5, 8].

Also, MOEAs can be implemented in parallel and are

somewhat effective in escaping local optima. Among the

large number of MOEAs reported in the state of the art, we

specifically considered the NSGA-II (Non-Dominated

Sorting Genetic Algorithm II) [9] method for solving this

problem. NSGA-II is one of the most used MOEAs and it

has been successfully applied to several pattern recognition

tasks (e.g., [3, 12, 32]). Compared to other MOEAs,

NSGA-II is highly efficient and it is able to generate a close

approximation to the Pareto front, while maintaining a

diverse pool of solutions.

NSGA-II is a fairly standard genetic algorithm in terms of

representation and evolutionary operators, however, it in-

corporates two main mechanisms that allow one to deal with

multi-objective optimization problems: non-dominance

sorting and crowding distance, see [5, 8, 9] for a detailed

explanation of these concepts. NSGA-II operates as follows,

see Algorithm 1. First, a population of solutions, also called

individuals, is generated and the solutions are evaluated

according to the objectives (steps 1 and 2 in Algorithm 1).

Then, an iterative process begins where evolutionary op-

erators, such as tournament selection, recombination, and

mutation are applied to generate a child population (step 5).

After that, the fitness values for each member in the child

population are estimated (step 6). Next, parent and child

populations are combined, and all non-dominated fronts are

identified using the non-dominance sorting mechanism to

rank solutions according to their non-dominance level, (i.e.,

from the combined parent–child population, we identify

which ones are non-dominated with respect to the others and

these constitute the first level. After that, the second level is

formed by the non-dominated solutions from the remainder

solutions, and we follow this procedure until each solution

has been assigned to a non-dominance level) (steps 3 and 7).

A new population of individuals is selected for the next

iteration by choosing solutions from both, the parent and

child populations using the previously identified non-

dominated fronts. If the size of the new population is greater

than the population size, individuals from the last added

front are chosen one-by-one by taking into account its

crowding distance in the objectives space (steps 8–12). The

algorithm stops after g repetitions of the iterative process are

performed. The rest of this section describes the NSGA-II

technique as we used it for PG.

3.3.1 Initialization

A common practice in evolutionary algorithms is to ran-

domly initialize solutions [13]. In this work, however, we

initialize them using information from the training data, to

speed up the convergence of MOPG. Recall that in PG each

solution consists of a matrix P 2 RP�ðdþ1Þ, where P can

vary for each solution. Solutions are initialized on a per-

class basis, where for each class k an integer number Pk

between 1 and Nk is randomly chosen, with Nk the number

of instances in D that belong to class k. Next, we randomly

select Pk instances from the Nk that belong to class k to

generate an individual P. We repeat this process for each

Algorithm 1 NSGA-II algorithm [9].Require: Npop, f , g

{Npop number of individuals (solutions); g number of generationsf = 〈f1(P), f2(P)〉 objectives}f

1: Initialize population X02: Evaluate objective functions f = 〈f1(P), f2(P)〉, ∀P : P ∈ X03: Identify fronts F1,...,F by sorting solutions according to their non-dominance level ∀P : P ∈

X04: while i = 1 < g do5: Create child population Qi from Xi applying evolutionary operators.6: Evaluate objective functions f , ∀P : P ∈ Qi

7: Identify fronts F1,...,F by sorting solutions according to their non-dominance level ∀P: P ∈ Xi ∪ Qi

8: Xi+1 = ∅; j = 1;9: while |Xi+1| < Npop do

10: Xi+1 = Xi+1 ∪ Fj ; j = j + 1;11: end while12: Select the last individuals for Xi+1 from Fj using crowding distance13: end while

Pattern Anal Applic

123

class to generate each of the initial solutions of the

population. In this way, the initial population of prototypes

belongs to the data set D and a different number of pro-

totypes-per-class is allowed.

We control the amount of prototypes to be considered

for the initialization of each individual of the population

through parameter Ip ¼P

kPk

N, where Ip is simply the

fraction of the training instances that will be considered for

initialization. One should note that the evolutionary op-

erators we propose allow MOPG to reduce the number of

prototypes from one generation to the other (via conden-

sation in crossover).

3.3.2 Fitness functions

Solutions are evaluated with respect to the two objectives

defined above: fðPÞ ¼ hf1ðPÞ; f2ðPÞi. Objective f2ðPÞ ¼cðD;PÞ is related to the generalization performance of an

1NN classifier using prototypes P. There are several ways

of estimating the performance of a classifier on unseen

data, including cross validation, bootstrapping, jacknife,

etc. In this work we considered a simple hold-out estimate

in which training data D is further divided into training,

DT , and validation, DV , data sets. The training partition is

used to initialize the population and to apply evolutionary

operators, whereas validation data are used to evaluate

solutions. Hence, we define cðD;PÞ as the accuracy ob-

tained by an 1NN classifier using P as training data when

classifying DV . On the other hand, since we defined

dðD;PÞ ¼ 1� PN

we can use dðD;PÞ directly as the ob-

jective f1 which is related to the amount of reduction.

To generate DT and DV we define a parameter g 2 ½0; 1�that controls the fraction of instances from each class to be

used for training and validation. For each class k, we randomly

select dNk � ge instances fromD and use them as the training

examples of class k, the other bNk � ð1� gÞc instances are

used for the validation set. Repeating this process for each

class we form training and validation partitions that maintain

the original distribution of classes as in D.

In each iteration of the genetic algorithm we update

training and validation partitions, to prevent the prototypes

from overfitting a single validation data set. Each time the

partitions are updated, we re-evaluate all of the solutions in

the Pareto set with the new validation partition, to avoid

evaluating solutions in different iterations of different data.

Please note that using a dynamic fitness function it is still

possible to obtain solutions in the Pareto set that obtained

good performance in a single partition of data (e.g., if a

solution in the last iteration obtains good performance in

the last partition). Anyway, we think that even with this

limitation, a dynamic fitness function is advantageous over

optimizing performance on a fixed data partition.

3.3.3 Evolutionary operators

In evolutionary computation, variation operators are used

to generate new solutions by updating the available

ones [13], where the two fundamental operators in evolu-

tionary algorithms are crossover and mutation. When so-

lutions are encoded as numerical vectors of fixed

dimension, there is a wide diversity of variation operators

that can be applied. However, since in PG each solution is

encoded as a matrix with variable number of rows, we must

propose ad hoc operators for this representation.

The goal of the crossover operator is to generate new

(children) solutions by combining the elements that form two

other individuals (parents) with the aim that new solutions

are better than their ancestors. In PG we want to combine two

sets of prototypes P1 and P2 to generate solutions P01 and P02.

Accordingly, we propose a crossover operator that inter-

changes the individual prototypes that form solutions P1 and

P2. Given two parent solutions P1 and P2 we randomly select

a prototype pP1 2 P1. Then we identify those prototypes

from solution P2 that belong to the same class as pP1 . We

replace a randomly chosen prototype in P2 with pP1 , where

the replaced prototype belongs to the same class as pP1 . Next

we apply with uniform probability either: (a) replace pP1 2P1 with a prototype randomly chosen from P2 that belongs to

the same class as pP1 , or (b) replace pP1 2 P1 with the av-

erage of prototypes in P2 belonging to the same class as pP1 .

The aim of (a) is to replace a prototype with another one that

belongs to the same class, hence allowing the interchange of

information between solutions. On the other hand, the goal of

(b) is to condense a set of solutions in such a way that the new

solution summarizes the position in the search space of all of

the prototypes of the same class.

Mutation operators aim to incorporate diversity in the

population through random modifications of solutions. For

PG we propose a mutation operator that given a solution P1,

generates a new solution P01 where one prototype of P1 is

modified. We randomly select a prototype pP1 2 P1. Next,

we apply with uniform probability either: (a) adding a vector

of random numbers to pP1 , where the numbers are uniformly

generated in the range of the values of each dimension of pP1 ,

or (b) pP1 is replaced by an instance in the training setDT that

belong to the same class. The aim of (a) is to introduce slight

perturbations to solutions in the current population, whereas

(b) aims to introduce new instances (not considered during

initialization) into the individuals.

To select solutions to which the crossover and mutation

operators will be applied, we used a binary tournament

scheme [8, 9, 13]. Crossover and mutation operators are

applied to individuals in the population with probabilities

Prc and Prm. We use default values for these probabilities,

but we also conducted experiments to assess the impact of

Pattern Anal Applic

123

the parameter values and the evolutionary operators on

MOPG.

3.3.4 Selecting a single solution

The output of the NSGA-II method is a set of non-

dominated solutions found during the search, i.e., an ap-

proximation of the Pareto optimal set. Theoretically, none

of these solutions is better/worse than any other. However,

having a set of solutions instead of a single one does not

make sense for PG. Of course, one could use the union of

all prototypes included in all solutions, but this would be

misleading because the amount of reduction would de-

crease and we could have redundant and noisy prototypes.

Instead, we propose a simple mechanism to select a single

solution out of a set of solutions.

We evaluate the classification performance of 1NN

when using the prototypes associated to each solution to

classify all the training instances (D). Then we choose as

final output of our method the solution that obtained the

highest classification performance. One should note that

since prototypes were generated using different partitions

of training and validation data, the performance on the

whole original data set is not expected to be perfect. Thus,

we can use this performance as an indicator of the accuracy

of the prototypes. In Sect. 5 we evaluate the effectiveness

of this selection strategy.

3.4 Discussion

In this section we have described in detail MOPG, our

multi-objective approach to PG. MOPG uses NSGA-II for

exploring the search space of prototypes. A variable-length

representation is adopted and ad hoc evolutionary operators

have been designed. Solutions are evaluated on a subset of

the training data that is updated every iteration. Also, a

strategy for the selection of a single solution from the set of

non-dominated solutions returned by NSGA-II has been

presented.

The main benefit of MOPG when compared to other PG

methods is that the proposed method searches for solutions

that offer a good tradeoff between reduction and accuracy,

which is the ultimate goal of PG methods. Multi-objective

optimization can naturally deal with this type of problems.

To the best of our knowledge no other author has ap-

proached the PG problem similarly. Besides, the ways we

evaluate the fitness functions (updating the training and

validation partitions in each iteration) and select a single

solution allow MOPG to avoid overfitting to some extent.

In Sect. 5 we show experimental results that evidence the

effectiveness of MOPG.

One should note that since MOPG is a population-based

technique, it requires the evaluation of many solutions (as

many as Npop � ðgþ 1Þ). Thus, depending on the data set size,

applying our method could be a computationally demanding

process. Fortunately, there is a growing interest on the re-

search community on the development of efficient and scal-

able methods for PG (see e.g., [29]), which could be applied to

our proposed technique. On the other hand, in evolutionary

computation there are also alternative methodologies that can

be adopted to speed up MOPG [7, 26].

4 Experimental settings

This section describes the experimental settings we adopted

for the evaluation of MOPG. We considered the suite of data

sets described in Table 1. These data sets were collected by

the authors of [28] and have been used for benchmarking

many PG methods proposed so far [2, 10, 14, 15, 23, 28, 30].

The data sets are diverse to each other in terms of number of

instances, attributes2 (numerical/nominal), and classes,

which allows us to assess the performance of MOPG under

different circumstances. Data sets with a large number of

instances are considered in the benchmark. In [28] the au-

thors distinguished small (less than 2,000 instances) from

large (at least 2,000 instances) data sets.

To make our experimental results comparable with

others [14, 28, 30], we applied tenfold cross validation

over the 59 data sets to evaluate the performance of

MOPG. In each experiment, for each data set we applied

MOPG 10 times using the training partitions generated via

tenfold cross validation; we evaluated the performance of

the generated prototypes in each of the 10 runs using the

corresponding test partitions. Hence, for a single ex-

periment over all of the data sets we ran MOPG 590 times.

We evaluated two main aspects of MOPG: accuracy on

unseen data (test partitions) and amount of training set

reduction. Additionally, we also evaluated the tradeoff in

performance by measuring: the product (reduction � ac-

curacy) and the harmonic mean of both objectives

2� reduction � accuracyreduction þ accuracy

� �. One should note that since we used

exactly the same partitions of tenfold cross validation as

in [14, 28, 30], we can directly compare the performance of

MOPG with those PG methods.

5 Experimental results

This section reports experimental results obtained with

MOPG over the suite of data sets introduced in [28] and

2 One should note that among the considered data sets numeric and

nominal attributes are included. For simplicity we have deliberatively

transformed nominal attributes into integers and applied MOPG

without any modification.

Pattern Anal Applic

123

described in Sect. 4. The goal of our experiments is to

evaluate the effectiveness of MOPG for the generation of

prototypes and to compare its performance with that re-

ported by alternative methods. First, we report results of a

study that aims at evaluating the reproducibility of results.

Next, we evaluate the effectiveness of the strategy for se-

lecting solutions from the Pareto set. Next, we assess the

performance of MOPG under different parameter settings

related to MOPG itself and to the evolutionary algorithm

we considered (NSGA-II). Finally, we compare the per-

formance of MOPG to that of other state-of-the-art

approaches.

5.1 Reproducibility of results

This section aims to provide evidence on the repro-

ducibility of results obtained with MOPG. Since MOPG is

based on NSGA-II, which is a stochastic optimization

technique, there is no guarantee that the same results will

be obtained with each execution of it. Accordingly, in this

section our aim is to determine whether results obtained

with MOPG are due to chance or not.

We assess the reproducibility of results obtained by

MOPG by performing experiments with two different pa-

rameter settings. The two configurations differed in the

population size and number of generations. For the first

configuration, called 50–50, 50 individuals and 50 gen-

erations were considered; for the second one, called

250–250, 250 individuals and 250 generations were con-

sidered (the rest of the parameter were fixed according to

the best results from Sect. 5.3). The intuition behind con-

sidering two parameter settings was to assess the repro-

ducibility of results when search is non-intensive (50–50

setting) and somewhat intensive (250–250 setting). We ran

Table 1 Description of the data

sets considered for

evaluation [28]

For each data set we show the

number of instances (Ex),

attributes (At), numerical/

nominal attributes (Nu/No) and

classes (K)

Data set Ex At Nu/No K Data set Ex At Nu/No K

Abalone 4,174 8 7/1 28 Marketing 8,993 13 13/0 9

Appendicitis 106 7 7/0 2 Monks 432 6 6/0 2

Australian 690 14 8/6 2 Movement 360 90 90/0 15

Autos 205 25 15/10 6 Newthyroid 215 5 5/0 3

Balance 625 4 4/0 3 Nursery 12,960 8 0/8 5

Banana 5,300 2 2/0 2 Pageblocks 5,472 10 10/0 5

Bands 539 19 19/0 2 Penbased 10,992 16 16/0 10

Breast 286 9 9/0 2 Phoneme 5,404 5 5/0 2

Bupa 345 6 6/0 2 Pima 768 8 8/0 2

Car 1,728 6 6/0 4 Ring 7,400 20 20/0 2

Chess 3,196 36 36/0 2 Saheart 462 9 8/1 2

Cleveland 297 13 13/0 5 Satimage 6,435 36 36/0 7

Coil2000 9,822 85 85/0 2 Segment 2,310 19 19/0 7

Contraceptive 1,473 9 6/3 3 Sonar 208 60 60/0 2

Crx 125 15 6/9 2 Spambase 4,597 55 55/0 2

Dermatology 366 33 1/32 6 Spectheart 267 44 44/0 2

Ecoli 336 7 7/0 8 Splice 3,190 60 0/60 3

Flare-solar 1,066 9 0/9 2 Tae 151 5 5/0 3

German 1,000 20 6/14 2 Texture 5,500 40 40/0 11

Glass 214 9 9/0 7 Tic-tac-toe 958 9 0/9 2

Haberman 306 3 3/0 2 Thyroid 7,200 21 6/15 3

Hayes-roth 133 4 4/0 3 Titanic 2,201 3 3/0 2

Heart 270 13 6/7 2 Twonorm 7,400 20 20/0 2

Hepatitis 155 19 19/0 2 Vehicle 846 18 18/0 4

Housevotes 435 16 0/16 2 Vowel 990 13 11/2 11

Iris 150 4 4/0 3 Wine 178 13 13/0 3

Led7digit 500 7 6/1 10 Wisconsin 683 9 9/0 2

Lymphography 148 18 3/15 4 Yeast 1,484 8 8/0 10

Magic 19,020 10 10/0 2 Zoo 101 17 0/17 7

Mammographic 961 5 0/5 2

Pattern Anal Applic

123

MOPG five times with each parameter configuration for the

59 data sets from Table 1, results of the experiment are

shown in Table 2.

It can be seen from Table 2 that results do not vary

considerably for both parameter configurations; results

vary slightly more for the 250–250 configuration. For both

parameter configurations, an ANOVA test comparing av-

erage results (across the 59 data sets and 5 runs), reveals

that, with confidence of 99.9 %, the null hypotheses that

the means across different runs (groups) are equal cannot

be rejected. Therefore, we can conclude that results ob-

tained by MOPG are reproducible.

5.2 Selection strategy

In this section we evaluate the performance of the strategy

proposed for the selection of a single solution from the

Pareto set as returned by the NSGA-II algorithm. The goal

of this experiment is to determine the effectiveness of the

strategy and its impact on the final performance of MOPG.

To determine the effectiveness of the selection strategy

described in Sect. 3.3.4 we ran an experiment and com-

pared the performance of the proposed strategy to the one

that would be obtained if the best solution from the Pareto

set had been selected. Additionally, we implemented two

other alternative selection criteria and compared them with

the proposed one. We considered a reduction-based crite-

rion, in which we always chose the solution with the

highest reduction performance. Also, we considered an-

other criterion that chooses the solution that is closest to the

theoretical optimum (an accuracy and reduction of 1); we

refer to this strategy as distance-from-optimum.

Table 3 compares the performance of MOPG when us-

ing the proposed technique and the other strategies. The

results in Table 3 are the average over the 5 runs of the

250–250 configuration from Sect. 5.1. From this table it

can be seen that the performance of MOPG when using our

selection strategy is very close to the topline (best). In

terms of reduction, our strategy allows us to select solu-

tions that virtually achieve the same performance as the

topline; in fact, all strategies obtained comparable reduc-

tion performance. On the other hand, in terms of accuracy,

the proposed criterion is the one that is closer to the per-

formance of the topline. The other two strategies obtained

lower performance. From these results we can conclude

that, although there is still a margin for improvement, the

proposed strategy proved to be very effective for selecting

competitive solutions from the Pareto set. Also, we showed

that the proposed strategy outperforms the other two se-

lection techniques.

Figure 1 shows the Pareto front obtained by MOPG for

two data sets (Banana and Ring) for a particular run; we

show the training-set (blue-squares) /test-set (red-circles)

accuracies and reduction performance of each solution in

the Pareto front. Both plots are representative of the rest of

data sets and runs. It can be seen from these plots that

solutions along the reduction objective achieve very close

fitness values: ½0:975;1� for both data sets. Thus,

choosing a solution with competitive reduction perfor-

mance is not too difficult. On the other hand, the accuracy

objective has a wider range of variation: [0.75, 0.92] for

Banana and [0.81, 0.94] for Ring. Therefore, the selection

of a competitive solution in terms of accuracy is not trivial,

and, in fact, it makes sense to use a criterion related to

accuracy to select a single solution from the Pareto set.

Figure 1 also shows the solutions that would be se-

lected with the proposed strategy (accuracy-based) and

the other two alternative methods. The solution selected

with each strategy illustrate the benefits of using them:

accuracy-based obtains solutions that achieve better

classification performance; reduction-based solutions with

the highest reduction performance; distance-from-opti-

mum returns solutions with a good tradeoff between ac-

curacy and reduction. We emphasize that although better

reduction performance can be obtained with the last two

strategies, the improvement over the accuracy-based cri-

terion is pretty small.

5.3 Experimental study on MOPG’s parameters

In this section we evaluate the sensitivity of MOPG to var-

iations in its parameters. Specifically we consider parameters

that are directly related to our proposal, namely: Npop

(number of individuals), g (number of generations), g (the

Table 2 Evaluation of reproducibility of results, we report the av-

erage performance (across all data sets) in terms of test-set accuracy

and training-set reduction for each run, and the average performance

across runs, the values for other parameters were: g ¼ 0:3, Ip ¼ 0:1

Run ID Accuracy Reduction

Configuration 1: g ¼ 50,Npop ¼ 50

Run 1 71:84 % 18:63 98:65 % 1:13

Run 2 71:93 % 18:48 98:65 % 1:12

Run 3 72:16 % 18:27 98:65 % 1:13

Run 4 71:85 % 18:73 98:64 % 1:12

Run 5 71:85 % 18:72 98:64 % 1:12

Average 71:92 % 18:56 98:64 % 1:12

Configuration 2: g ¼ 250, Npop ¼ 250

Run 1 77:05 % 17:12 98:68 % 1:19

Run 2 77:10 % 17:16 98:69 % 1:19

Run 3 76:90 % 17:35 98:63 % 1:18

Run 4 76:81 % 17:35 98:64 % 1:18

Run 5 76:70 % 17:15 98:66 % 1:17

Average 76:91 % 17:22 98:66 % 1:12

Pattern Anal Applic

123

portion of examples from each class to be used for the

training partitionDT ), and Ip (upper bound on the portion of

training instances to be used to initialize individuals). The

goal is to analyze the performance of MOPG when varying

these parameters and to determine an acceptable set of pa-

rameters for the benchmark we considered. Hopefully, the

experimental study from this section will help other re-

searchers using MOPG fix the parameters for their particular

problems.

We ran MOPG using different parameter values and

recorded the average and standard deviation (over 590 re-

sults, 59 data sets and 10 partitions from tenfold cross

validation) of accuracy and reduction. For this experiment

we report the results obtained in a single run, because there

are many parameter configurations, and it would be very

time consuming to report average results of multiple runs.

Anyway, in Sect. 5.1 we showed evidence of the repro-

ducibility of MOPG results; besides, please also notice that

it is not our aim to find the best configuration of parameters

but rather to analyze the performance of MOPG under

different settings.

When evaluating a specific parameter the values of the

other parameters were fixed. Thus, unless otherwise stated

default parameter values were Npop ¼ 50, g ¼ 50, g ¼ 0:3,

Ip ¼ 0:1. The results of this experiment are shown in Table 4.

It can be seen from Table 4 that better accuracy was

obtained when using larger values for the number of in-

dividuals and generations. This is a somewhat expected

result as more individuals imply3 that larger portions of the

search space can be explored. On the other hand, a larger

number of generations implies that search is more

intensive, this may lead to overfitting the data. However, it

seems that our mechanism of updating the validation par-

tition in each generation allows us to overcome this phe-

nomenon, to some extent. Notwithstanding, one should

note that the differences in performance are not very large.

Thus, MOPG is somewhat robust to these parameters.

Regarding the reduction performance all the values con-

sidered for number of individuals and generations virtually

obtained the same performance.

Regarding parameter g, the best performance was ob-

tained when g ¼ 0:3. This result indicates that a large

number of instances in the validation data set (70 % of the

instances in D when g ¼ 0:3) may lead to better solutions.

This can be due to the fact that a large sample for eval-

uation forces MOPG to select prototypes that better gen-

eralize for those amounts of data. The values g ¼ 0:1; 0:7

also obtained very competitive performance in terms of

accuracy. However, lower values of g are preferred as the

reduction is larger: as expected the smaller the number of

instances the higher the reduction and viceversa.

The last parameter under analysis in this section is Ip,

the upper bound on number of instances to be used to

generate individuals in the initial population. From Table 4

it can be seen that this parameter makes MOPG behave as

expected, and in fact, illustrates the accuracy/reduction

dilemma: small values of Ip result in solutions with ex-

tremely high reduction values but low accuracy. Thus, the

best value for Ip would be the configuration that offers the

best tradeoff, Ip ¼ 0:05 seems to be the best alternative. It

is interesting that using only 0:5 % (i.e., Ip ¼ 0:005) of the

total number of instances MOPG is still able to obtain

competitive solutions (71:19 % in accuracy) with an ex-

treme reduction (99:01 %). Therefore, if the user has a

priori preferences for accuracy or reduction, the value of Ip

must be set accordingly.

Table 3 Comparison of the performance obtained by MOPG when using the proposed selection strategy (accuracy-based) with the best solution

in the Pareto front

Method All Large Small

Accuracy

Accuracy-based (proposal) 76.91 17.22 81.33 20.86 74.81 15.02

Reduction-based 72.95 16.79 73.20 19.84 72.83 15.42

Distance-from-optimum 75.78 17.02 79.62 20.53 73.95 15.03

Best 79.07 16.32 82.01 20.25 77.68 13.68

Reduction

Accuracy-based (proposal) 98.67 1.17 99.39 0.32 98.32 1.26

Reduction-based 98.98 1.27 99.75 0.21 98.62 1.41

Distance-from-optimum 98.90 1.24 99.66 0.22 98.54 1.36

Best 98.98 1.27 99.75 0.21 98.62 1.41

Also, we report the performance of alternative techniques (reduction-based and distance-from-optimum). We show results for all (59 data sets),

small (40 data sets) and large (19 data sets)

3 Please note that, in general, in evolutionary algorithms large

populations do not necessarily mean better performance. This

behavior is observed when the search space has not been explored

extensively, which is beneficial for avoiding overfitting.

Pattern Anal Applic

123

5.4 Evolutionary operators

In this section, we evaluate the performance of MOPG

when varying the crossover and mutation probabilities.

As in the previous section, the goal of our experiment is

to determine the impact that each of these parameters

has in the performance of MOPG. We ran MOPG using

different values for Prc and Prm, the probabilities of

crossover and mutation, respectively. The results of

these experiments are shown in Fig. 2. As before, we

report the average (over 590 results, 59 data sets and 10

partitions from tenfold cross validation) of accuracy and

reduction.

From Fig. 2 it can be seen that reduction performance

is roughly the same for all of the configurations of values

of Prc and Prm. Regarding accuracy (left plot), different

values of Prm do not seem to significantly modify the

performance of MOPG, although slightly better results

were obtained with larger values of Prm. Regarding the

crossover parameter Prc, accuracy increases considerably

(up to 2 %) as the value of Prc increases. Therefore,

crossover seems to play a key role in MOPG. This is

understandable as the crossover operator allows solutions

to condense prototypes and also to interchange prototypes

between parent solutions.

5.5 Comparison with related works

We compare now the performance of MOPG to that ob-

tained by alternative approaches that have used exactly

the same data. For comparison we considered the 25

methods4 evaluated in [28]. Also, we consider the PG

method introduced in [14], which outperforms most

methods from the previous study. Finally, we also com-

pare the performance of MOPG to that obtained by the

methods in [30], which, to the best of our knowledge, are

the techniques that have obtained the best results for the

data sets we consider.

For this experiment we used the best configuration of

parameters for MOPG we found in our previous study

(g ¼ 250 generations, Npop ¼ 250 individuals, g ¼ 0:1

training-set size, Ip ¼ 0:005 initial population). In fact we

are reporting for MOPG the average performance of 5 runs

as described in Sect. 5.1. Please note that the same pro-

cedure for fixing parameters was followed for all of other

methods we compare to, i.e., the results for the other

methods were obtained using the best parameter con-

figurations in the test sets, as recommended by the authors

of the corresponding papers, see5 [14, 28, 30].

Table 5 shows a summary of the comparison of MOPG

and methods evaluated in [28]. In this summary table we

considered the best methods in terms of accuracy

(GENN [20]) and reduction (PSCSA [16]). We also show

the performance obtained by 1NN.

From Table 5 it can be seen that in terms of accuracy,

our method obtains lower accuracy performance than

GENN when considering all of the data sets. In small data

Table 4 Performance of

MOPG under different

parameter settings

Parameter Value Accuracy Reduction

Individuals (Npop) 50 71:68 % 18:18 97:24 % 1:21

100 72:25 % 17:80 97:26 % 1:25

250 73:13 % 18:10 97:23 % 1:31

Generations (g) 50 71:68 % 18:18 97:24 % 1:21

100 72:71 % 18:03 97:53 % 1:29

250 73:32 % 18:11 97:62 % 1:34

500 73:37 % 18:08 97:70 % 1:31

Training-set size (g) 0.1 72:11 % 18:33 98:84 % 1:09

0.3 73:07 % 18:18 98:19 % 1:13

0.5 71:68 % 18:18 97:24 % 1:21

0.7 72:12 % 18:44 96:85 % 1:38

0.9 70:30 % 20:39 96:38 % 1:67

Initial prot. (Ip) 0.005 71:19 % 18:18 99:01 % 1:28

0.01 71:20 % 18:31 98:97 % 1:26

0.05 72:14 % 18:74 98:37 % 1:16

0.1 71:68 % 18:18 97:24 % 1:21

0.2 73:07 % 17:76 95:19 % 1:82

0.4 73:20 % 17:72 89:86 % 3:87

4 Please note that the 25 methods have been evaluated on small data

sets, but only 20 out of the 25 were evaluated in large data sets [28].

Five methods were not considered for large data sets because they

were too computational expensive, see [28] for details.5 See also http://sci2s.ugr.es/pgtax/.

Pattern Anal Applic

123

http://sci2s.ugr.es/pgtax/

sets, the gain of GENN over MOPG is small, but for large

data sets MOPG and GENN achieve virtually the same

performance. This is a very positive result because, even

when MOPG did not outperform GENN in terms of ac-

curacy, MOPG achieves much higher reduction rates than

GENN. Moreover, the fact that MOPG performs better on

large data sets is encouraging as the main target of PG

methods is precisely large databases. Likewise, in terms of

reduction, PSCSA outperforms MOPG. The gain of

PSCSA over MOPG in reduction is small, nevertheless, we

can see that MOPG outperforms significantly PSCSA in

terms of accuracy. Therefore, we can conclude that MOPG

offers a better tradeoff between accuracy and reduction

than the best methods (in either aspect) considered in [28].

Figure 3 graphically shows a comparison between the

25 methods considered in [28] (plus the GPGP method

introduced in [14]) and our proposal for small and large

data sets in terms of reduction (y-axis) and accuracy (x-

axis). Regarding small data sets, MOPG is outperformed by

three methods in terms of accuracy: GENN, ICPI and PSO.

However, the reduction performance of MOPG is better

than any of these methods. In terms of reduction, our

method outperforms most of the evaluated techniques.

Regarding large data sets, it can be seen from Fig. 3 that

our method offers the best tradeoff between accuracy and

reduction than any other method, as MOPG is located at

upper right corner of the plot. It obtains similar accuracy as

GENN and GPGP but its reduction performance is better.

To the best of our knowledge the results obtained with

MOPG for large data sets in the suite provided in [28] are

the best ones reported so far. MOPG also outperforms our

own previous work GPGP [14], a very effective method

that was recently introduced.

Figure 4 shows boxplots of a tradeoff performance es-

timate for each method that has been evaluated on the data

sets from Table 1. Boxplots report the average, across

small and large data sets, of reduction � accuracy as ob-

tained by MOPG and each of the other PG methods. It is

clear that, on average, MOPG offers the best tradeoff be-

tween both objectives, for both small and large data sets.

For small data sets the second/third best methods were

PSO [23] and MSE [10], respectively. Regarding large

data sets the second/third best methods were GPGP [14]

and PSO [23]. The average performance for both small and

large data sets was worse than that obtained by MOPG.

A Wilcoxon signed-rank test6 comparing the performances

of MOPG to the other PG methods revealed that for large

data sets there is a statistically significant difference in

tradeoff performance for all but for MSE and PSO, whereas

for small data sets all of the differences were statistically

significant. From the results presented so far we can con-

clude that the multi-objective approach is indeed obtaining

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.969.5

70

70.5

71

71.5

72

72.5

Probability

Acc

urac

y (%

)

CrossoverMutation

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.998.6

98.61

98.62

98.63

98.64

98.65

98.66

98.67

Probability

Red

uctio

n (%

)

Crossover

Mutation

Fig. 2 Performance of MOPG

under different crossover (left)

and mutation (right) rates.

Table 5 Average performance and percentage of reduction obtained by MOPG for all the data sets; we also show the separate performance

obtained in small and large sets.

Test set accuracy Training set reduction

Measure All Small Large All Small Large

MOGP 76.91 ± 17.22 74.81 ± 15.02 81.33 ± 20.86 98.67 % 98.29 % 99.39 %

GENN 77.46 % ± 17.71 75.64 % ± 15.45 81.33 % ± 21.70 17.71 % 18.62 % 15.76 %

PSCSA 66.90 % ± 19.67 66.82 % ± 18.74 67.07 % ± 22.05 99.01 % 98.58 % 99.88 %

1NN 75.77 % ± 18.73 73.48 % ± 16.64 80.60 % ± 22.24 0 % 0 % 0 %

6 This is the statistical test recommended by Demsar for comparing

classification methods over multiple data sets [11].

Pattern Anal Applic

123

solutions that offer a better tradeoff between accuracy and

reduction than most other techniques proposed so far.

We now compare the performance of MOPG with the

methods based on differential evolution that were pro-

posed in [30]. To the best of our knowledge these meth-

ods are the ones that have obtained the best performance

so far on PG. We considered three methods out of the 15

variants evaluated in [30]: SFLSDE/RandtoBest/1/Bin is

the best PG method for small data sets, SFLSDE/Rand/1/

Bin is the best one for large data sets and SSMA?

SFLSDE/RandtoBest/1/Bin is the best method overall.

One should note, however, that the latter method is a

hybrid that combines PS (SSMA) and PG (SFLSDE)

methods; hence their results are not directly comparable

to MOPG. For this experiment we only considered 56 out

of the 59 data sets from Table 1. Three large data sets

were discarded in [30]: ring, phoneme and nursery. This

is because we wanted to use exactly the same data sets

that were used in the reference study.

Table 6 shows the tradeoff results obtained by the best

methods reported in [30] and MOPG. We report two mea-

sures of performance tradeoff: the product (reduction �accuracy) and the harmonic mean of both objectives

2 � reduction� accuracyreductionþ accuracy

� �, to better appreciate the differences

among methods.

From Table 6 it can be seen that MOPG obtains a per-

formance comparable (yet worse) to that of the three

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.64

0.66

0.68

0.7

0.72

0.74

0.76GENN

Depur

PNN

BTS3

MCA

GMCA

ICPL

MixtGauss

SGP

LVQ3

MSE

DSM

LVQTC

VQ

AVQ

HYB

LVQPRU

Chen

RSP3

POC

ENPC

PSO

AMPSO

PSCSA

1NN

GPPC

MOPG

Reduction

Acc

urac

ySmall data sets

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.65

0.7

0.75

0.8

0.85

GENNDepur

BTS3

MixtGauss

SGP

LVQ3

MSE

DSM

LVQTC

VQ

AVQ

HYB

LVQPRU

ChenRSP3

ENPCPSO

AMPSO

PSCSA

1−NN GPPCMOPG

Reduction

Acc

urac

y

Large data sets

Fig. 3 Reduction vs. accuracy in small (top) and large (bottom) data

sets.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

GE

NN

Dep

ur

PN

N

BT

S3

MC

A

GM

CA

ICP

L

Mix

tGau

ss

SG

P

LVQ

3

MS

E

DS

M

LVQ

TC

VQ

AV

Q

HY

B

LVQ

PR

U

Che

n

RS

P3

PO

C

EN

PC

PS

O

AM

PS

O

PS

CS

A

1NN

GP

PG

MO

PG

Small data sets

Acc

urac

y× R

educ

tion

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Acc

urac

y× R

educ

tion

Large data sets

GE

NN

Dep

ur

BT

S3

Mix

tGau

ss

SG

P

LVQ

3

MS

E

DS

M

LVQ

TC

VQ

AV

Q

HY

B

LVQ

PR

U

Che

n

RS

P3

EN

PC

PS

O

AM

PS

O

PS

CS

A

1−N

N

GP

PG

MO

PG

Fig. 4 Box plot of reduction � accuracy in small (top) and large

(bottom) data sets. The x-axis indicates the different PG methods

considered in our study.

Table 6 Tradeoff (accuracy–reduction) performance for selected

methods.

Method References Large Small

Reduction � Accuracy

MOGP Ours 80.05 73.56

SFLSDE/RandtoBest/1/Bin [30] 81.54 72.23

SFLSDE/Rand/1/Bin [30] 81.67 71.88

SSMA?SFLSDE/RandtoBest/1/Bin [30] 81.64 74.95

(2 � Reduction � Accuracy)/(Reduction þ Accuracy)

MOGP Ours 88.97 84.97

SFLSDE/RandtoBest/1/Bin [30] 89.92 84.48

SFLSDE/Rand/1/Bin [30] 89.99 84.25

SSMA?SFLSDE/RandtoBest/1/Bin [30] 90.02 86.15

Pattern Anal Applic

123

methods based on differential evolution. Results are con-

sistent for both evaluation measures. When considering

large data sets (column 3, large) the reference PG methods

(i.e., SFLSDE/Rand/1/Bin and SSMA?SFLSDE/Rand-

toBest/1/Bin) outperform ours by 1 %. However, the per-

formance on small data sets is virtually the same (column

4, small). The margin of improvement is higher for the

hybrid method (i.e, SSMA?SFLSDE/RandtoBest/1/Bin).

One should note that MOPG outperforms most variants of

differential evolution proposed in [30] (comparison not

shown here) as well as other PG methods considered for

comparison by the authors (9 other methods including

PSO, ENPC, PSCSA). From the results in Table 6 we can

conclude that the performance of MOPG is competitive

with the most effective methods in the state of the art.

6 Conclusions

We introduced MOPG, a novel prototype generation (PG)

method based on multi-objective optimization. We ap-

proach the PG problem as one of multi-objective opti-

mization where we aim to simultaneously optimize

accuracy and reduction of prototypes. Our working hy-

pothesis is that by simultaneously optimizing both objec-

tives we can achieve a better reduction/accuracy tradeoff.

The proposed approach was evaluated in benchmark data

and its performance was compared to many PG methods,

including the best performing ones. The contributions of

this paper can be summarized as follows. (1) Formulation

of the PG problem as one of multi-objective optimization

and proposal of an effective multi-objective evolutionary

algorithm to approach the PG problem. (2) New methods

for representation, initialization, crossover and mutation

for PG using evolutionary algorithms. Likewise, an effec-

tive strategy for the selection of a single solution from the

Pareto front generated by NSGA-II. (3) Extensive eval-

uation of the proposed method over benchmark data, in-

cluding comparisons with many PG methods.

The main findings of this work can be summarized as

follows. We found the multi-objective formulation for PG

is a promising alternative to single-objective approaches,

we hope our work can foster the development of other

multi-objective optimization methods for PG. We showed

evidence supporting the hypothesis that our proposal,

MOPG, is very competitive in terms of both objectives

reduction and accuracy. MOPG outperforms most PG

methods proposed so far and obtains similar performance

to the best PG method proposed so far. MOPG can be

improved in many ways (for instance, for reducing its

computational cost), thus we hope our work motivates

further research on new mechanisms to improve it.

Current and future work directions on MOPG include

enhancing our method in terms of efficiency and scal-

ability, to apply it to big-data problems. For this we are

planning to use ad hoc stratification and surrogate model-

ing techniques, see [26, 27, 29]. Also, we are working on

the development of methods that can simultaneously gen-

erate prototypes and features using a multi-objective opti-

mization framework.

Acknowledgments This work was partially supported by the

LACCIR programme under project ID R1212LAC006. Hugo Jair

Escalante was supported by the internships programme of CONACyT

under grant No. 234415.

References

1. Aler R, Handl J, Knowles JD (2013) Comparing multi-objective and

threshold-moving roc curve generation for a prototype-based clas-

sifier. In: Proceedings of the fifteenth annual conference on Genetic

and evolutionary computation conference. ACM, pp 1029–1036

2. Cervantes A, Galvan IM, Isasi P (2009) AMPSO: a new particle

swarm method for nearest neighborhood classification. IEEE

Trans. Sys. Man Cybern. B 39(5):1082–1091

3. Chatelain Clement, Adam Sebastien, Lecourtier Yves, Heutte

Laurent, Paquet Thierry (2010) A multi-model selection frame-

work for unknown and/or evolutive misclassification cost prob-

lems. Pattern Recogn. 43(3):815–823

4. Chen JH, Chen HM, Ho SY (2005) Design of nearest neighbor

classifiers: multi-objective approach. Int. J. Approx. Reason.

40:3–22

5. Coello Coello CA, Lamont GB, Veldhuizen DAV (2007) Evo-

lutionary algorithms for solving multi-objective problems. Ge-

netic and evolutionary computation, 2nd edn. Springer, USA

6. Cover T, Hart P (1967) Nearest neighbor pattern classification.

IEEE Trans. Inform. Theory 13(1):21–27

7. Cruz-Vega I, Garcia-Limon M, Escalante HJ (2014) Adaptive

surrogates with a neuro-fuzzy network and granular computing.

In: Proceedings of GECCO 2014. ACM Press, pp 761–768

8. Deb K (2001) Multi-objective optimization using evolutionary

algorithms. Wiley

9. Deb K, Pratap A, Agarwal S, Meyarivan T (2002) A fast and

elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans.

Evol. Comput. 6(2):182–197

10. Decaestecker C (1997) Finding prototypes for nearest neighbour

classification by means of gradient descent and deterministic

annealing. Pattern Recogn. 30(2):281–288

11. Demsar J (2006) Statistical comparisons of classifiers over mul-

tiple data sets. J Mach Learn Res 7:1–30

12. Dos-Santos EM, Sabourina R, Maupinb P (2008) A dynamic

overproduce-and-choose strategy for the selection of classifier

ensembles. Pattern Recogn. 41:2993–3009

13. Eiben AE, Smith JE (2010) Introduction to evolutionary com-

puting. Natural computing. Springer

14. Escalante HJ, Mendoza KM, Graff M, Morales-Reyes A (2013)

Genetic programming of prototypes for pattern classification. In:

Proceedings of IbPRIA 2013, vol. 7887 of LNCS. Springer,

pp 100–107

15. Fernandez F, Isasi P (2004) Evolutionary design of nearest pro-

totype classifiers. J. Heuristics 10:431–454

16. Garain U (2008) Prototype reduction using an artificial immune

system. Pattern Anal. Appl. 11(3–4):353–363

Pattern Anal Applic

123

17. Garcıa S, Derrac J, Cano JR, Herrera F (2012) Prototype selection

for nearest neighbor classification: Taxonomy and empirical

study. IEEE Trans. Pattern Anal. Mach. Intell. 34(3):417–435

18. Hastie T, Tibshirani R, Friedman J (2001) The elements of sta-

tistical learning. Springer, New York

19. Kim SW, Oommen BJ (2003) A brief taxonomy and ranking of

creative prototype reduction schemes. Pattern Anal. Appl.

6:232–244

20. Koplowitz J, Brown T (1981) On the relation of performance to

editing in nearest neighbor rules. Pattern Recogn. 13(3):251–255

21. Li J, Wang Y (2013) A nearest prototype selection algorithm

using multi-objective optimization and partition. In: Proceedings

of the 9th International Conference on Computational Intelli-

gence and Security. IEEE, pp. 264–268

22. Lozano M, Sotoca JM, Sanchez JS, Pla F, Pkalska E, Duin RPW

(2006) Experimental study on prototype optimisation algorithms

for prototype-based classification in vector spaces. Pattern

Recogn. 39(10):1827–1838

23. Nanni L, Lumini A (2008) Particle swarm optimization for pro-

totype reduction. Neurocomputing 72(4–6):1092–1097

24. Olvera A, Carrasco-Ochoa JA, Martinez-Trinidad JF, Kittler J

(2010) A review of instance selection methods. Artif. Intell. Rev.

34:133–143

25. Storn R, Price KV (1997) Differential evolution a simple and

efficient heuristic for global optimization over continuous spaces.

J. Global Optim. 11(10):341–359

26. Rosales A, Coello CA, Gonzalez J, Reyes CA, Escalante HJ

(2013) A hybrid surrogate-based approach for evolutionary multi-

objective optimization. In: Proceedings of Congress on Evolu-

tionary Computation 2013. IEEE, pp 2548–2555

27. Rosales A, Gonzalez J, Coello CA, Escalante HJ, Reyes CA

(2014) Surrogate-assisted multi-objective model selection for

support vector machines. Neurocomputing (in press)

28. Triguero I, Derrac J, Garcıa S, Herrera F (2012) A taxonomy and

experimental study on prototype generation for nearest neighbor

classification. IEEE Trans. Sys. Man Cybern. C 42(1):86–100

29. Triguero I, Peralta D, Bacardit J, Garcia S, Herrera F (2014)

MRPR: a mapreduce solution for prototype reduction in big data

classification. Neurocomputing (in press)

30. Triguero I, Garcia S, Herrera F (2011) Differential evolution for

optimizing the positioning of prototypes in nearest neighbor

classification. Pattern Recogn. 44:901–916

31. Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H,

McLachlan GJ, Ng A, Liu B, Yu Ps, Zhou ZH, Steinbach M,

Hand DJ, Steinberg D (2007) Top 10 algorithms in data mining.

Knowl Inf Syst 14(1):1–37

32. Xia H, Zhuang J, Yu D (2013) Novel soft subspace clustering

with multi-objective evolutionary approach for high-dimensional

data. Pattern Recogn. 46:2562–2575

Pattern Anal Applic

123

mopg: a multi-objective evolutionary algorithm for...

Documents