mopg: a multi-objective evolutionary algorithm for...
TRANSCRIPT
THEORETICAL ADVANCES
MOPG: a multi-objective evolutionary algorithm for prototypegeneration
Hugo Jair Escalante • Maribel Marin-Castro • Alicia Morales-Reyes •
Mario Graff • Alejandro Rosales-Perez • Manuel Montes-y-Gomez •
Carlos A. Reyes • Jesus A. Gonzalez
Received: 26 March 2014 / Accepted: 22 January 2015
� Springer-Verlag London 2015
Abstract Prototype generation deals with the problem
of generating a small set of instances, from a large data
set, to be used by KNN for classification. The two key
aspects to consider when developing a prototype gen-
eration method are: (1) the generalization performance
of a KNN classifier when using the prototypes; and (2)
the amount of data set reduction, as given by the number
of prototypes. Both factors are in conflict because, in
general, maximizing data set reduction implies decreas-
ing accuracy and viceversa. Therefore, this problem can
be naturally approached with multi-objective optimiza-
tion techniques. This paper introduces a novel multi-
objective evolutionary algorithm for prototype gen-
eration where the objectives are precisely the amount of
reduction and an estimate of generalization performance
achieved by the selected prototypes. Through a com-
prehensive experimental study we show that the pro-
posed approach outperforms most of the prototype
generation methods that have been proposed so far.
Specifically, the proposed approach obtains prototypes
that offer a better tradeoff between accuracy and re-
duction than alternative methodologies.
Keywords Prototype generation � Evolutionary
algorithms � 1NN classification � Multi-objective
optimization
1 Introduction
k�Nearest neighbors (KNN) is one of the most used
models for pattern classification [31]. This is due in part to
its asymptotic behavior as the number of training instances
tends to infinity [6]. Also, bias and variance of the KNN
model can be adjusted by varying the value of k [18]. In
addition to its effectiveness, KNN is quite popular because
it is easy to understand and implement. Unfortunately,
there are two main issues that limit the application of KNN
to certain domains: memory storage requirements and ef-
ficiency. On the one hand, KNN is known to be very ef-
fective when a large number of instances are available.
However, a large set of training objects implies (1) the
requirement of storing all of the training objects into
memory and (2) estimating the distance (or similarity) from
a test object to all of the training objects each time a new
object has to be classified. This is indeed a major com-
plication because nowadays big-data problems are be-
coming ubiquitous. Therefore, to keep KNN as a suitable
option for cutting-edge classification problems, strategies
for making it scalable and efficient must be applied.
Prototype-based classification (PBC) aims to amend
these issues for KNN [17, 24, 28]. PBC is a methodology
that considers only a subset of representative training in-
stances for making predictions, still under KNN’s classi-
fication rule. In PBC there are two main ways of reducing
the original training set: prototype selection (PS) and pro-
totype generation (PG). The goal of PS methods is to se-
lect, from the training data, the objects that better
H. J. Escalante (&) � M. Marin-Castro � A. Morales-Reyes �A. Rosales-Perez � M. Montes-y-Gomez �C. A. Reyes � J. A. Gonzalez
INAOE, Luis Enrique Erro No.1, Tonantzintla,
Puebla 72840, Mexico
e-mail: [email protected]
M. Graff
INFOTEC, Catedras CONACyT, Aguascalientes, Mexico
123
Pattern Anal Applic
DOI 10.1007/s10044-015-0454-6
summarize the whole data set [17]. PG methods, on the
other hand, generate new objects to be used as prototypes
by selecting and combining objects in the training set [28].
Although there are no comprehensive studies comparing
PG and PS strategies1, generation methods are more gen-
eral than selection ones and, in fact, PS can be considered a
special case of PG [28].
The success of PG methods is usually assessed by looking
at two aspects separately: (1) the classification performance
obtained by a KNN classifier when using the prototypes
(accuracy), and (2) the amount of reduction achieved with
respect to the original training set (reduction). PG methods
that provide the best tradeoff between both measures are
preferred. Unfortunately, reduction and accuracy objectives
are in conflict because, in general, maximizing reduction
(i.e., favoring the generation of fewer prototypes) causes a
decrease in accuracy when using the prototypes, and vicev-
ersa. Multi-objective optimization, therefore, seems to be an
appropriate methodology for developing a PG method that
looks for prototypes that simultaneously optimize accuracy
and reduction (that is, obtaining the best possible accuracy
with the smaller number of prototypes), this is precisely the
formulation adopted in this paper.
This paper introduces MOPG (multi-objective prototype
generation), a novel multi-objective optimization approach
for PG where the objectives are precisely (1) the amount of
reduction and (2) an estimate of generalization performance
achieved by the generated prototypes. The NSGA-II algo-
rithm is used to model the PG problem [9], where ad hoc
representation and evolutionary operators are proposed. We
propose a strategy to select a single solution from the Pareto
set generated through the optimization process. Through a
comprehensive experimental study we show that the pro-
posed approach outperforms most of the PG methods that
have been proposed so far. A total of 59 data sets with dif-
ferent characteristics (number of classes, instances, at-
tributes, etc.,) were considered for experimentation. We
found that the proposed approach offers a better tradeoff
between the considered objectives than alternative method-
ologies. Although several evolutionary algorithms and re-
lated techniques have been used for PG, to the best of our
knowledge the PG problem has not been approached yet with
multi-objective optimization at all (see Sect. 2).
This paper is organized as follows. Next section reviews
related work on PG with emphasis on those based on
heuristic optimization. Section 3 introduces the proposed
MOPG approach. Section 4 describes experimental settings
and Sect. 5 reports experimental results obtained by
MOPG. Finally, Sect. 6 outlines conclusions and future
work directions.
2 Related work
A taxonomy and comparative study among several PG
methods is reported by Triguero et al. [28]. There, PG
methods are classified according to different dimensions,
including: type of reduction (incremental, detrimental,
fixed and mixed); type of resulting set (condensation,
edition or hybrid); generation mechanism (class relabeling,
centroid-based, space splitting, and positioning adjustment)
and evaluation criteria (filter, semi-wrapper and wrapper).
A total of 32 different strategies are classified and an ex-
perimental comparison among 25 of these methods is re-
ported. In this paper, we compare our proposal to the same
25 methods and two other recently proposed [14, 30].
It can be concluded from [28] that there is no single PG-
method winner in terms of accuracy and reduction. This is
understandable as most of the considered methods have
particular characteristics that make them appropriate to
tackle a single objective, either reduction or accuracy. For
instance, the method achieving the highest generalization
performance was GENN (Generalized Editing using
NN) [20], a detrimental method that removes and relabels
instances. GENN is a conservative method because it aims
to edit instances only to an extent that does not consider-
ably harm the classification performance, it achieves sub-
stantially better results than any other of the compared PG
methods; however, it was among the worse in terms of
reduction. On the other hand, PSCSA (Prototype Selection
Clonal Selection Algorithm) obtained the best performance
in terms of reduction [16]. PSCSA models the PG problem
with an artificial immune system, the clonal selection al-
gorithm. This method is able to exactly select a single
example per each class, achieving the best reduction per-
formance among the other 24 methods considered in [28].
However, its performance in terms of accuracy is worse
than several other strategies. This paper introduces a multi-
objective optimization approach to the PG problem. Our
hypothesis is that by explicitly and simultaneously opti-
mizing both objectives, reduction and accuracy, we can
obtain solutions that offer a better tradeoff between these
objectives.
Among the diversity of PG methods proposed so far, in
recent studies PG methods based on evolutionary algo-
rithms and related techniques have reported better results
than alternative approaches [2, 14–16, 23, 28, 30]. Usually,
these methods start from a set of solutions (sets of proto-
types) and then modify them according to specific op-
erators, through an iterative search procedure that attempts
to optimize a criterion related with the classification per-
formance of the prototypes.
A PG method based on particle swarm optimization
(PSO) was proposed in [23]. A standard PSO algorithm
was designed where the authors try to minimize the
1 One should note there are works comparing a few PS and PG
methods over a small number of data sets [19, 22].
Pattern Anal Applic
123
classification error in the training set. The method is run for
several times to obtain varied solutions (sets of prototypes).
When classifying a new object the outputs of the whole set
of prototypes are combined via voting. The ensemble
strategy allows this method to obtain better results than
many other methods evaluated in [28]; in fact, this solution
is among the best methods in terms of the reduction–ac-
curacy tradeoff, see Sect. 5. Another variant of PSO,
adaptive Michigan PSO (AMPSO), has also been used for
the generation of prototypes. In AMPSO, each particle of
the swarm is associated to a prototype in such a way that
the whole population is the set of prototypes that are op-
timized [2]. This method achieves similar reduction per-
formance to PSO but obtains lower accuracy.
Regarding evolutionary algorithms, successful ap-
proaches have been proposed as well. For instance, Fer-
nandez et al. proposed ENPC (Evolutionary Design of NN
classifiers) an evolutionary algorithm that starts from a
single individual that is evolved by applying a variety of
operators that combine and split prototypes [15]. The
method is able to automatically determine the number of
prototypes and requires little information from the user.
ENPC is able to obtain very competitive performance in
terms of accuracy but it is not among the best methods in
reduction. Escalante et al. proposed a PG method based on
genetic programming (GPGP) [14]. The idea consists of
exploiting a tree structure to combine instances using
arithmetic operators. The objective function combines the
accuracy and reduction criteria in a single formula. GPGP
is able to obtain very effective prototypes. Although GPGP
is not included in the comparative study from [28], it
outperforms most of the methods evaluated there.
Finally, Triguero et al. [30] have recently proposed very
effective PG methods. The authors proposed several vari-
ants of the differential evolution technique [25] to ap-
proach the PG problem. The most effective methods
reported in [30] are hybrids, that apply a PS algorithm
followed by a PG method based on differential evolution;
as a result these variants are computationally expensive.
However, the performance of these methods in both, re-
duction and accuracy, is much higher than that of previous
approaches.
Among all methods proposed so far for PG, in most
cases a single objective (either reduction or accuracy) is
being optimized. Some methods combine both objectives
into a single one or exploit the structure of the opti-
mization strategy to incorporate one of the objectives. To
the best of our knowledge no method has been proposed
that aims to optimize accuracy and data set reduction si-
multaneously for PG. Hence, our proposal is novel in that
sense. We are aware of [1, 4, 21] that are somewhat re-
lated to our proposal. Li and Wang approached the pro-
totype selection problem using a genetic algorithm [21].
Although they call their method multi-objective, they
optimized a single-objective function that combines re-
duction and classification performances. The performance
of their prototype selection method is significantly worse
than that obtained by our PG method in this paper when
using the same databases. Chen et al. [4] proposed a
multi-objective optimization approach (based on the
IMOEA algorithm) for simultaneously selecting features
and instances. However, the proposed method was only
evaluated on small databases. Finally, Aler et al. [1] de-
ploy a multi-objective optimization method that aims to
minimize false-negative and false-positive rates of a
prototype-based classifier. However, reduction perfor-
mance was not optimized therein.
In Sect. 5 we compare the performance of our method
with most of the proposals reviewed in this section and
others considered in [28]. Experimental results show that
MOPG is an effective solution to the PG problem.
3 MOPG: Multi-objective prototype generation
This section describes the proposed approach to prototype
generation (PG). First, the considered scenario is formally
described. Next, the PG problem is posed as one of multi-
objective optimization. Then we provide a detailed de-
scription of the proposed method for PG.
3.1 Considered scenario
Given a data set D ¼ fðxi; yiÞg1;...;N for a classification
problem with N training instances, and xi 2 Rd,
yi 2 C ¼ fC1; . . .;CKg; where d is the dimensionality of
the data and C is the set of classes. The goal of PG methods
is to obtain a set of instances P ¼ fðxi; yiÞg1;...;P with
P� N, with the constraint that for each pair
fðxi; yiÞg 2 P, xi 2 Rd, yi 2 C and there is at least one
ðxi; yiÞ for each class in C. One should note that for PS there
is the additional restriction that each pair ðxi; yiÞ 2 P is an
element of D [17], thus PG methods have more freedom
for the design of prototypes as no restriction is enforced to
the relation between instances in P and D.
3.2 Prototype generation as multi-objective
optimization
Besides the above elementary restrictions, the ultimate goal
in PG is that using P one would be able to obtain at least
the same performance that one would obtain when using Dfor the classification of unseen data T ¼ fðxj; yjÞg1;...;T .
The KNN classification rule is considered and in most of
the cases K ¼ 1 (i.e., 1NN). Let cðP;DÞ be an estimate of
Pattern Anal Applic
123
the generalization performance of an 1NN classifier when
using P as training data (e.g., c could be a hold-out esti-
mate), and let dðD;PÞ ¼ 1� PN
be the proportion of re-
duction achieved by P (recall P is the number of
prototypes in P). The PG problem can be formulated as
that of finding P such that cðP;DÞ and dðD;PÞ are
maximized.
Choosing solutions P for which P is very small (i.e.,
dðD;PÞ is maximized) may cause cðP;DÞ to decrease. This
is because, in general, the less the number of instances to
approximate the underlying classification problem, the
lower the capacity of the 1NN classifier. For example,
consider the case in which P ¼ K, with K the number of
classes; in this scenario, highly biased classifiers can be
built having relatively low variance. On the other hand,
solutions with a large value of P (i.e., small dðD;PÞ) may
result in large values of cðP;DÞ. This is because one can
have a P with as many points as instances in D. For in-
stance when P ¼ N we may perfectly classify every subset
of points in D, resulting in unbiased classifiers that may
show very high variance. Therefore, maximizing dðD;PÞmay cause cðP;DÞ to decrease and viceversa. Since both
objectives are in conflict, this problem can be naturally
approached with multi-objective optimization: looking for
the P that offers the best tradeoff in terms of reduction and
accuracy objectives.
In a multi-objective optimization problem one aims to find
the solution x� that simultaneously maximizes a vector of
q�objective functions, fðx�Þ ¼ hf1ðx�Þ; . . .; fqðx�ÞiT , subject
to the condition by which the vector of decision variables,
x ¼ hx1; . . .; xdiT , belongs to the feasible region of the prob-
lem. In this case, x is a solution to the problem at hand, the
feasible region consists of all possible sets of prototypes that
satisfy the constraints previously defined, and each fi, i 2f1; . . .; qg is an objective function to be maximized, where the
objectives may be in conflict with each other.
In multi-objective optimization the notion of optimum
is different than in single-objective optimization, since in
this kind of problems the aim is to find good tradeoffs
among the objectives instead of a single-best solution.
To determine this tradeoff, the most accepted notion of
optimum in multi-objective optimization is the so-called
Pareto optimality [5]. To define the concept of Pareto
optimality, we first introduce the notion of Pareto
dominance. We say that a solution P1 dominates a so-
lution P2 ðdenoted byP1 � P2Þ if and only if P1 is
better than P2 at least in one objective and it is not
worse in the rest. A solution P� is a Pareto optimal
solution if there is not another solution P0 in the entire
feasible region, such that P0 � P�. Thus, a solution is
Pareto optimal if it is not possible to improve one ob-
jective without worsening another. This definition does
not produce a single solution, but a set of tradeoff so-
lutions among the different objectives. The set of these
solutions in the decision variable space is known as the
Pareto optimal set. The image of the elements of the
Pareto optimal set in the objective space constitute the
so-called Pareto front, see Fig. 1.
For PG we have two objectives, reduction and accu-
racy, thus we can define f1ðPÞ ¼ dðD;PÞ to be the ob-
jective related to performance reduction and
f2ðPÞ ¼ cðP;DÞ as a measure of generalization perfor-
mance. Therefore, our goal is to find the solution P� that
maximizes: hf 1ðPÞ; f2ðPÞi, subject to P 2 Y where Y is
the set of feasible solutions, that is, P ¼ fðxi; yiÞg1;...;P
with xi 2 Rd, yi 2 C and 8Ci 9yj : yj ¼ Ci, i ¼ 1; . . .;K,
j ¼ 1; . . .;P.
Each solution to this problem consists of a set of pairs of
feature vectors and their classes P ¼ fðxi; yiÞg1;...;P.
0.975 0.98 0.985 0.99 0.995 10.74
0.76
0.78
0.8
0.82
0.84
0.86
0.88
0.9
0.92
0.94
Reduction − f1(P)
Acc
urac
y −
f 2(P)
Banana
Pareto set (Training)Pareto set (Test)Selected solution − accuracySelected solution − distance frmo optimumSelected solution − reduction
0.975 0.98 0.985 0.99 0.995 10.8
0.82
0.84
0.86
0.88
0.9
0.92
0.94
Reduction − f1 (P)
Acc
urac
y −
f 2 (P
)
Ring
Pareto set (Training)Pareto set (Test)Selected solution − accuracySelected solution − distance from optimumSelected solution − reduction
Fig. 1 Pareto front for two selected data sets: banana and ring. We
show the performance of each solution in the training (blue-squares)
and test (red-circles) sets. Also, we show the solutions selected with
the different selection-strategies: accuracy-based (star), distance-
based (diamond) and reduction-based (triangle).
Pattern Anal Applic
123
Alternatively, one can see P as a matrix P of size P�ðd þ 1Þ where the last column encodes the classes of the
prototypes. Hence solutions lie in an RP�ðdþ1Þ space, where
the value of P can vary across solutions. In the rest of this
paper we will use P and P indistinguishably for making
reference to prototypes.
3.3 Multi-objective prototype generation
There are several techniques that can be adopted to ap-
proach the above multi-objective optimization problem, in
this paper we consider multi-objective evolutionary algo-
rithms (MOEAs). MOEAs are advantageous over classical
mathematical programming techniques because they can
obtain a set of solutions that approximate the Pareto front
in a single run, and because they are more sensitive to the
shape and continuity of the Pareto optimal front [5, 8].
Also, MOEAs can be implemented in parallel and are
somewhat effective in escaping local optima. Among the
large number of MOEAs reported in the state of the art, we
specifically considered the NSGA-II (Non-Dominated
Sorting Genetic Algorithm II) [9] method for solving this
problem. NSGA-II is one of the most used MOEAs and it
has been successfully applied to several pattern recognition
tasks (e.g., [3, 12, 32]). Compared to other MOEAs,
NSGA-II is highly efficient and it is able to generate a close
approximation to the Pareto front, while maintaining a
diverse pool of solutions.
NSGA-II is a fairly standard genetic algorithm in terms of
representation and evolutionary operators, however, it in-
corporates two main mechanisms that allow one to deal with
multi-objective optimization problems: non-dominance
sorting and crowding distance, see [5, 8, 9] for a detailed
explanation of these concepts. NSGA-II operates as follows,
see Algorithm 1. First, a population of solutions, also called
individuals, is generated and the solutions are evaluated
according to the objectives (steps 1 and 2 in Algorithm 1).
Then, an iterative process begins where evolutionary op-
erators, such as tournament selection, recombination, and
mutation are applied to generate a child population (step 5).
After that, the fitness values for each member in the child
population are estimated (step 6). Next, parent and child
populations are combined, and all non-dominated fronts are
identified using the non-dominance sorting mechanism to
rank solutions according to their non-dominance level, (i.e.,
from the combined parent–child population, we identify
which ones are non-dominated with respect to the others and
these constitute the first level. After that, the second level is
formed by the non-dominated solutions from the remainder
solutions, and we follow this procedure until each solution
has been assigned to a non-dominance level) (steps 3 and 7).
A new population of individuals is selected for the next
iteration by choosing solutions from both, the parent and
child populations using the previously identified non-
dominated fronts. If the size of the new population is greater
than the population size, individuals from the last added
front are chosen one-by-one by taking into account its
crowding distance in the objectives space (steps 8–12). The
algorithm stops after g repetitions of the iterative process are
performed. The rest of this section describes the NSGA-II
technique as we used it for PG.
3.3.1 Initialization
A common practice in evolutionary algorithms is to ran-
domly initialize solutions [13]. In this work, however, we
initialize them using information from the training data, to
speed up the convergence of MOPG. Recall that in PG each
solution consists of a matrix P 2 RP�ðdþ1Þ, where P can
vary for each solution. Solutions are initialized on a per-
class basis, where for each class k an integer number Pk
between 1 and Nk is randomly chosen, with Nk the number
of instances in D that belong to class k. Next, we randomly
select Pk instances from the Nk that belong to class k to
generate an individual P. We repeat this process for each
Algorithm 1 NSGA-II algorithm [9].Require: Npop, f , g
{Npop number of individuals (solutions); g number of generationsf = 〈f1(P), f2(P)〉 objectives}f
1: Initialize population X02: Evaluate objective functions f = 〈f1(P), f2(P)〉, ∀P : P ∈ X03: Identify fronts F1,...,F by sorting solutions according to their non-dominance level ∀P : P ∈
X04: while i = 1 < g do5: Create child population Qi from Xi applying evolutionary operators.6: Evaluate objective functions f , ∀P : P ∈ Qi
7: Identify fronts F1,...,F by sorting solutions according to their non-dominance level ∀P: P ∈ Xi ∪ Qi
8: Xi+1 = ∅; j = 1;9: while |Xi+1| < Npop do
10: Xi+1 = Xi+1 ∪ Fj ; j = j + 1;11: end while12: Select the last individuals for Xi+1 from Fj using crowding distance13: end while
Pattern Anal Applic
123
class to generate each of the initial solutions of the
population. In this way, the initial population of prototypes
belongs to the data set D and a different number of pro-
totypes-per-class is allowed.
We control the amount of prototypes to be considered
for the initialization of each individual of the population
through parameter Ip ¼P
kPk
N, where Ip is simply the
fraction of the training instances that will be considered for
initialization. One should note that the evolutionary op-
erators we propose allow MOPG to reduce the number of
prototypes from one generation to the other (via conden-
sation in crossover).
3.3.2 Fitness functions
Solutions are evaluated with respect to the two objectives
defined above: fðPÞ ¼ hf1ðPÞ; f2ðPÞi. Objective f2ðPÞ ¼cðD;PÞ is related to the generalization performance of an
1NN classifier using prototypes P. There are several ways
of estimating the performance of a classifier on unseen
data, including cross validation, bootstrapping, jacknife,
etc. In this work we considered a simple hold-out estimate
in which training data D is further divided into training,
DT , and validation, DV , data sets. The training partition is
used to initialize the population and to apply evolutionary
operators, whereas validation data are used to evaluate
solutions. Hence, we define cðD;PÞ as the accuracy ob-
tained by an 1NN classifier using P as training data when
classifying DV . On the other hand, since we defined
dðD;PÞ ¼ 1� PN
we can use dðD;PÞ directly as the ob-
jective f1 which is related to the amount of reduction.
To generate DT and DV we define a parameter g 2 ½0; 1�that controls the fraction of instances from each class to be
used for training and validation. For each class k, we randomly
select dNk � ge instances fromD and use them as the training
examples of class k, the other bNk � ð1� gÞc instances are
used for the validation set. Repeating this process for each
class we form training and validation partitions that maintain
the original distribution of classes as in D.
In each iteration of the genetic algorithm we update
training and validation partitions, to prevent the prototypes
from overfitting a single validation data set. Each time the
partitions are updated, we re-evaluate all of the solutions in
the Pareto set with the new validation partition, to avoid
evaluating solutions in different iterations of different data.
Please note that using a dynamic fitness function it is still
possible to obtain solutions in the Pareto set that obtained
good performance in a single partition of data (e.g., if a
solution in the last iteration obtains good performance in
the last partition). Anyway, we think that even with this
limitation, a dynamic fitness function is advantageous over
optimizing performance on a fixed data partition.
3.3.3 Evolutionary operators
In evolutionary computation, variation operators are used
to generate new solutions by updating the available
ones [13], where the two fundamental operators in evolu-
tionary algorithms are crossover and mutation. When so-
lutions are encoded as numerical vectors of fixed
dimension, there is a wide diversity of variation operators
that can be applied. However, since in PG each solution is
encoded as a matrix with variable number of rows, we must
propose ad hoc operators for this representation.
The goal of the crossover operator is to generate new
(children) solutions by combining the elements that form two
other individuals (parents) with the aim that new solutions
are better than their ancestors. In PG we want to combine two
sets of prototypes P1 and P2 to generate solutions P01 and P02.
Accordingly, we propose a crossover operator that inter-
changes the individual prototypes that form solutions P1 and
P2. Given two parent solutions P1 and P2 we randomly select
a prototype pP1 2 P1. Then we identify those prototypes
from solution P2 that belong to the same class as pP1 . We
replace a randomly chosen prototype in P2 with pP1 , where
the replaced prototype belongs to the same class as pP1 . Next
we apply with uniform probability either: (a) replace pP1 2P1 with a prototype randomly chosen from P2 that belongs to
the same class as pP1 , or (b) replace pP1 2 P1 with the av-
erage of prototypes in P2 belonging to the same class as pP1 .
The aim of (a) is to replace a prototype with another one that
belongs to the same class, hence allowing the interchange of
information between solutions. On the other hand, the goal of
(b) is to condense a set of solutions in such a way that the new
solution summarizes the position in the search space of all of
the prototypes of the same class.
Mutation operators aim to incorporate diversity in the
population through random modifications of solutions. For
PG we propose a mutation operator that given a solution P1,
generates a new solution P01 where one prototype of P1 is
modified. We randomly select a prototype pP1 2 P1. Next,
we apply with uniform probability either: (a) adding a vector
of random numbers to pP1 , where the numbers are uniformly
generated in the range of the values of each dimension of pP1 ,
or (b) pP1 is replaced by an instance in the training setDT that
belong to the same class. The aim of (a) is to introduce slight
perturbations to solutions in the current population, whereas
(b) aims to introduce new instances (not considered during
initialization) into the individuals.
To select solutions to which the crossover and mutation
operators will be applied, we used a binary tournament
scheme [8, 9, 13]. Crossover and mutation operators are
applied to individuals in the population with probabilities
Prc and Prm. We use default values for these probabilities,
but we also conducted experiments to assess the impact of
Pattern Anal Applic
123
the parameter values and the evolutionary operators on
MOPG.
3.3.4 Selecting a single solution
The output of the NSGA-II method is a set of non-
dominated solutions found during the search, i.e., an ap-
proximation of the Pareto optimal set. Theoretically, none
of these solutions is better/worse than any other. However,
having a set of solutions instead of a single one does not
make sense for PG. Of course, one could use the union of
all prototypes included in all solutions, but this would be
misleading because the amount of reduction would de-
crease and we could have redundant and noisy prototypes.
Instead, we propose a simple mechanism to select a single
solution out of a set of solutions.
We evaluate the classification performance of 1NN
when using the prototypes associated to each solution to
classify all the training instances (D). Then we choose as
final output of our method the solution that obtained the
highest classification performance. One should note that
since prototypes were generated using different partitions
of training and validation data, the performance on the
whole original data set is not expected to be perfect. Thus,
we can use this performance as an indicator of the accuracy
of the prototypes. In Sect. 5 we evaluate the effectiveness
of this selection strategy.
3.4 Discussion
In this section we have described in detail MOPG, our
multi-objective approach to PG. MOPG uses NSGA-II for
exploring the search space of prototypes. A variable-length
representation is adopted and ad hoc evolutionary operators
have been designed. Solutions are evaluated on a subset of
the training data that is updated every iteration. Also, a
strategy for the selection of a single solution from the set of
non-dominated solutions returned by NSGA-II has been
presented.
The main benefit of MOPG when compared to other PG
methods is that the proposed method searches for solutions
that offer a good tradeoff between reduction and accuracy,
which is the ultimate goal of PG methods. Multi-objective
optimization can naturally deal with this type of problems.
To the best of our knowledge no other author has ap-
proached the PG problem similarly. Besides, the ways we
evaluate the fitness functions (updating the training and
validation partitions in each iteration) and select a single
solution allow MOPG to avoid overfitting to some extent.
In Sect. 5 we show experimental results that evidence the
effectiveness of MOPG.
One should note that since MOPG is a population-based
technique, it requires the evaluation of many solutions (as
many as Npop � ðgþ 1Þ). Thus, depending on the data set size,
applying our method could be a computationally demanding
process. Fortunately, there is a growing interest on the re-
search community on the development of efficient and scal-
able methods for PG (see e.g., [29]), which could be applied to
our proposed technique. On the other hand, in evolutionary
computation there are also alternative methodologies that can
be adopted to speed up MOPG [7, 26].
4 Experimental settings
This section describes the experimental settings we adopted
for the evaluation of MOPG. We considered the suite of data
sets described in Table 1. These data sets were collected by
the authors of [28] and have been used for benchmarking
many PG methods proposed so far [2, 10, 14, 15, 23, 28, 30].
The data sets are diverse to each other in terms of number of
instances, attributes2 (numerical/nominal), and classes,
which allows us to assess the performance of MOPG under
different circumstances. Data sets with a large number of
instances are considered in the benchmark. In [28] the au-
thors distinguished small (less than 2,000 instances) from
large (at least 2,000 instances) data sets.
To make our experimental results comparable with
others [14, 28, 30], we applied tenfold cross validation
over the 59 data sets to evaluate the performance of
MOPG. In each experiment, for each data set we applied
MOPG 10 times using the training partitions generated via
tenfold cross validation; we evaluated the performance of
the generated prototypes in each of the 10 runs using the
corresponding test partitions. Hence, for a single ex-
periment over all of the data sets we ran MOPG 590 times.
We evaluated two main aspects of MOPG: accuracy on
unseen data (test partitions) and amount of training set
reduction. Additionally, we also evaluated the tradeoff in
performance by measuring: the product (reduction � ac-
curacy) and the harmonic mean of both objectives
2� reduction � accuracyreduction þ accuracy
� �. One should note that since we used
exactly the same partitions of tenfold cross validation as
in [14, 28, 30], we can directly compare the performance of
MOPG with those PG methods.
5 Experimental results
This section reports experimental results obtained with
MOPG over the suite of data sets introduced in [28] and
2 One should note that among the considered data sets numeric and
nominal attributes are included. For simplicity we have deliberatively
transformed nominal attributes into integers and applied MOPG
without any modification.
Pattern Anal Applic
123
described in Sect. 4. The goal of our experiments is to
evaluate the effectiveness of MOPG for the generation of
prototypes and to compare its performance with that re-
ported by alternative methods. First, we report results of a
study that aims at evaluating the reproducibility of results.
Next, we evaluate the effectiveness of the strategy for se-
lecting solutions from the Pareto set. Next, we assess the
performance of MOPG under different parameter settings
related to MOPG itself and to the evolutionary algorithm
we considered (NSGA-II). Finally, we compare the per-
formance of MOPG to that of other state-of-the-art
approaches.
5.1 Reproducibility of results
This section aims to provide evidence on the repro-
ducibility of results obtained with MOPG. Since MOPG is
based on NSGA-II, which is a stochastic optimization
technique, there is no guarantee that the same results will
be obtained with each execution of it. Accordingly, in this
section our aim is to determine whether results obtained
with MOPG are due to chance or not.
We assess the reproducibility of results obtained by
MOPG by performing experiments with two different pa-
rameter settings. The two configurations differed in the
population size and number of generations. For the first
configuration, called 50–50, 50 individuals and 50 gen-
erations were considered; for the second one, called
250–250, 250 individuals and 250 generations were con-
sidered (the rest of the parameter were fixed according to
the best results from Sect. 5.3). The intuition behind con-
sidering two parameter settings was to assess the repro-
ducibility of results when search is non-intensive (50–50
setting) and somewhat intensive (250–250 setting). We ran
Table 1 Description of the data
sets considered for
evaluation [28]
For each data set we show the
number of instances (Ex),
attributes (At), numerical/
nominal attributes (Nu/No) and
classes (K)
Data set Ex At Nu/No K Data set Ex At Nu/No K
Abalone 4,174 8 7/1 28 Marketing 8,993 13 13/0 9
Appendicitis 106 7 7/0 2 Monks 432 6 6/0 2
Australian 690 14 8/6 2 Movement 360 90 90/0 15
Autos 205 25 15/10 6 Newthyroid 215 5 5/0 3
Balance 625 4 4/0 3 Nursery 12,960 8 0/8 5
Banana 5,300 2 2/0 2 Pageblocks 5,472 10 10/0 5
Bands 539 19 19/0 2 Penbased 10,992 16 16/0 10
Breast 286 9 9/0 2 Phoneme 5,404 5 5/0 2
Bupa 345 6 6/0 2 Pima 768 8 8/0 2
Car 1,728 6 6/0 4 Ring 7,400 20 20/0 2
Chess 3,196 36 36/0 2 Saheart 462 9 8/1 2
Cleveland 297 13 13/0 5 Satimage 6,435 36 36/0 7
Coil2000 9,822 85 85/0 2 Segment 2,310 19 19/0 7
Contraceptive 1,473 9 6/3 3 Sonar 208 60 60/0 2
Crx 125 15 6/9 2 Spambase 4,597 55 55/0 2
Dermatology 366 33 1/32 6 Spectheart 267 44 44/0 2
Ecoli 336 7 7/0 8 Splice 3,190 60 0/60 3
Flare-solar 1,066 9 0/9 2 Tae 151 5 5/0 3
German 1,000 20 6/14 2 Texture 5,500 40 40/0 11
Glass 214 9 9/0 7 Tic-tac-toe 958 9 0/9 2
Haberman 306 3 3/0 2 Thyroid 7,200 21 6/15 3
Hayes-roth 133 4 4/0 3 Titanic 2,201 3 3/0 2
Heart 270 13 6/7 2 Twonorm 7,400 20 20/0 2
Hepatitis 155 19 19/0 2 Vehicle 846 18 18/0 4
Housevotes 435 16 0/16 2 Vowel 990 13 11/2 11
Iris 150 4 4/0 3 Wine 178 13 13/0 3
Led7digit 500 7 6/1 10 Wisconsin 683 9 9/0 2
Lymphography 148 18 3/15 4 Yeast 1,484 8 8/0 10
Magic 19,020 10 10/0 2 Zoo 101 17 0/17 7
Mammographic 961 5 0/5 2
Pattern Anal Applic
123
MOPG five times with each parameter configuration for the
59 data sets from Table 1, results of the experiment are
shown in Table 2.
It can be seen from Table 2 that results do not vary
considerably for both parameter configurations; results
vary slightly more for the 250–250 configuration. For both
parameter configurations, an ANOVA test comparing av-
erage results (across the 59 data sets and 5 runs), reveals
that, with confidence of 99.9 %, the null hypotheses that
the means across different runs (groups) are equal cannot
be rejected. Therefore, we can conclude that results ob-
tained by MOPG are reproducible.
5.2 Selection strategy
In this section we evaluate the performance of the strategy
proposed for the selection of a single solution from the
Pareto set as returned by the NSGA-II algorithm. The goal
of this experiment is to determine the effectiveness of the
strategy and its impact on the final performance of MOPG.
To determine the effectiveness of the selection strategy
described in Sect. 3.3.4 we ran an experiment and com-
pared the performance of the proposed strategy to the one
that would be obtained if the best solution from the Pareto
set had been selected. Additionally, we implemented two
other alternative selection criteria and compared them with
the proposed one. We considered a reduction-based crite-
rion, in which we always chose the solution with the
highest reduction performance. Also, we considered an-
other criterion that chooses the solution that is closest to the
theoretical optimum (an accuracy and reduction of 1); we
refer to this strategy as distance-from-optimum.
Table 3 compares the performance of MOPG when us-
ing the proposed technique and the other strategies. The
results in Table 3 are the average over the 5 runs of the
250–250 configuration from Sect. 5.1. From this table it
can be seen that the performance of MOPG when using our
selection strategy is very close to the topline (best). In
terms of reduction, our strategy allows us to select solu-
tions that virtually achieve the same performance as the
topline; in fact, all strategies obtained comparable reduc-
tion performance. On the other hand, in terms of accuracy,
the proposed criterion is the one that is closer to the per-
formance of the topline. The other two strategies obtained
lower performance. From these results we can conclude
that, although there is still a margin for improvement, the
proposed strategy proved to be very effective for selecting
competitive solutions from the Pareto set. Also, we showed
that the proposed strategy outperforms the other two se-
lection techniques.
Figure 1 shows the Pareto front obtained by MOPG for
two data sets (Banana and Ring) for a particular run; we
show the training-set (blue-squares) /test-set (red-circles)
accuracies and reduction performance of each solution in
the Pareto front. Both plots are representative of the rest of
data sets and runs. It can be seen from these plots that
solutions along the reduction objective achieve very close
fitness values: ½0:975;1� for both data sets. Thus,
choosing a solution with competitive reduction perfor-
mance is not too difficult. On the other hand, the accuracy
objective has a wider range of variation: [0.75, 0.92] for
Banana and [0.81, 0.94] for Ring. Therefore, the selection
of a competitive solution in terms of accuracy is not trivial,
and, in fact, it makes sense to use a criterion related to
accuracy to select a single solution from the Pareto set.
Figure 1 also shows the solutions that would be se-
lected with the proposed strategy (accuracy-based) and
the other two alternative methods. The solution selected
with each strategy illustrate the benefits of using them:
accuracy-based obtains solutions that achieve better
classification performance; reduction-based solutions with
the highest reduction performance; distance-from-opti-
mum returns solutions with a good tradeoff between ac-
curacy and reduction. We emphasize that although better
reduction performance can be obtained with the last two
strategies, the improvement over the accuracy-based cri-
terion is pretty small.
5.3 Experimental study on MOPG’s parameters
In this section we evaluate the sensitivity of MOPG to var-
iations in its parameters. Specifically we consider parameters
that are directly related to our proposal, namely: Npop
(number of individuals), g (number of generations), g (the
Table 2 Evaluation of reproducibility of results, we report the av-
erage performance (across all data sets) in terms of test-set accuracy
and training-set reduction for each run, and the average performance
across runs, the values for other parameters were: g ¼ 0:3, Ip ¼ 0:1
Run ID Accuracy Reduction
Configuration 1: g ¼ 50,Npop ¼ 50
Run 1 71:84 % 18:63 98:65 % 1:13
Run 2 71:93 % 18:48 98:65 % 1:12
Run 3 72:16 % 18:27 98:65 % 1:13
Run 4 71:85 % 18:73 98:64 % 1:12
Run 5 71:85 % 18:72 98:64 % 1:12
Average 71:92 % 18:56 98:64 % 1:12
Configuration 2: g ¼ 250, Npop ¼ 250
Run 1 77:05 % 17:12 98:68 % 1:19
Run 2 77:10 % 17:16 98:69 % 1:19
Run 3 76:90 % 17:35 98:63 % 1:18
Run 4 76:81 % 17:35 98:64 % 1:18
Run 5 76:70 % 17:15 98:66 % 1:17
Average 76:91 % 17:22 98:66 % 1:12
Pattern Anal Applic
123
portion of examples from each class to be used for the
training partitionDT ), and Ip (upper bound on the portion of
training instances to be used to initialize individuals). The
goal is to analyze the performance of MOPG when varying
these parameters and to determine an acceptable set of pa-
rameters for the benchmark we considered. Hopefully, the
experimental study from this section will help other re-
searchers using MOPG fix the parameters for their particular
problems.
We ran MOPG using different parameter values and
recorded the average and standard deviation (over 590 re-
sults, 59 data sets and 10 partitions from tenfold cross
validation) of accuracy and reduction. For this experiment
we report the results obtained in a single run, because there
are many parameter configurations, and it would be very
time consuming to report average results of multiple runs.
Anyway, in Sect. 5.1 we showed evidence of the repro-
ducibility of MOPG results; besides, please also notice that
it is not our aim to find the best configuration of parameters
but rather to analyze the performance of MOPG under
different settings.
When evaluating a specific parameter the values of the
other parameters were fixed. Thus, unless otherwise stated
default parameter values were Npop ¼ 50, g ¼ 50, g ¼ 0:3,
Ip ¼ 0:1. The results of this experiment are shown in Table 4.
It can be seen from Table 4 that better accuracy was
obtained when using larger values for the number of in-
dividuals and generations. This is a somewhat expected
result as more individuals imply3 that larger portions of the
search space can be explored. On the other hand, a larger
number of generations implies that search is more
intensive, this may lead to overfitting the data. However, it
seems that our mechanism of updating the validation par-
tition in each generation allows us to overcome this phe-
nomenon, to some extent. Notwithstanding, one should
note that the differences in performance are not very large.
Thus, MOPG is somewhat robust to these parameters.
Regarding the reduction performance all the values con-
sidered for number of individuals and generations virtually
obtained the same performance.
Regarding parameter g, the best performance was ob-
tained when g ¼ 0:3. This result indicates that a large
number of instances in the validation data set (70 % of the
instances in D when g ¼ 0:3) may lead to better solutions.
This can be due to the fact that a large sample for eval-
uation forces MOPG to select prototypes that better gen-
eralize for those amounts of data. The values g ¼ 0:1; 0:7
also obtained very competitive performance in terms of
accuracy. However, lower values of g are preferred as the
reduction is larger: as expected the smaller the number of
instances the higher the reduction and viceversa.
The last parameter under analysis in this section is Ip,
the upper bound on number of instances to be used to
generate individuals in the initial population. From Table 4
it can be seen that this parameter makes MOPG behave as
expected, and in fact, illustrates the accuracy/reduction
dilemma: small values of Ip result in solutions with ex-
tremely high reduction values but low accuracy. Thus, the
best value for Ip would be the configuration that offers the
best tradeoff, Ip ¼ 0:05 seems to be the best alternative. It
is interesting that using only 0:5 % (i.e., Ip ¼ 0:005) of the
total number of instances MOPG is still able to obtain
competitive solutions (71:19 % in accuracy) with an ex-
treme reduction (99:01 %). Therefore, if the user has a
priori preferences for accuracy or reduction, the value of Ip
must be set accordingly.
Table 3 Comparison of the performance obtained by MOPG when using the proposed selection strategy (accuracy-based) with the best solution
in the Pareto front
Method All Large Small
Accuracy
Accuracy-based (proposal) 76.91 17.22 81.33 20.86 74.81 15.02
Reduction-based 72.95 16.79 73.20 19.84 72.83 15.42
Distance-from-optimum 75.78 17.02 79.62 20.53 73.95 15.03
Best 79.07 16.32 82.01 20.25 77.68 13.68
Reduction
Accuracy-based (proposal) 98.67 1.17 99.39 0.32 98.32 1.26
Reduction-based 98.98 1.27 99.75 0.21 98.62 1.41
Distance-from-optimum 98.90 1.24 99.66 0.22 98.54 1.36
Best 98.98 1.27 99.75 0.21 98.62 1.41
Also, we report the performance of alternative techniques (reduction-based and distance-from-optimum). We show results for all (59 data sets),
small (40 data sets) and large (19 data sets)
3 Please note that, in general, in evolutionary algorithms large
populations do not necessarily mean better performance. This
behavior is observed when the search space has not been explored
extensively, which is beneficial for avoiding overfitting.
Pattern Anal Applic
123
5.4 Evolutionary operators
In this section, we evaluate the performance of MOPG
when varying the crossover and mutation probabilities.
As in the previous section, the goal of our experiment is
to determine the impact that each of these parameters
has in the performance of MOPG. We ran MOPG using
different values for Prc and Prm, the probabilities of
crossover and mutation, respectively. The results of
these experiments are shown in Fig. 2. As before, we
report the average (over 590 results, 59 data sets and 10
partitions from tenfold cross validation) of accuracy and
reduction.
From Fig. 2 it can be seen that reduction performance
is roughly the same for all of the configurations of values
of Prc and Prm. Regarding accuracy (left plot), different
values of Prm do not seem to significantly modify the
performance of MOPG, although slightly better results
were obtained with larger values of Prm. Regarding the
crossover parameter Prc, accuracy increases considerably
(up to 2 %) as the value of Prc increases. Therefore,
crossover seems to play a key role in MOPG. This is
understandable as the crossover operator allows solutions
to condense prototypes and also to interchange prototypes
between parent solutions.
5.5 Comparison with related works
We compare now the performance of MOPG to that ob-
tained by alternative approaches that have used exactly
the same data. For comparison we considered the 25
methods4 evaluated in [28]. Also, we consider the PG
method introduced in [14], which outperforms most
methods from the previous study. Finally, we also com-
pare the performance of MOPG to that obtained by the
methods in [30], which, to the best of our knowledge, are
the techniques that have obtained the best results for the
data sets we consider.
For this experiment we used the best configuration of
parameters for MOPG we found in our previous study
(g ¼ 250 generations, Npop ¼ 250 individuals, g ¼ 0:1
training-set size, Ip ¼ 0:005 initial population). In fact we
are reporting for MOPG the average performance of 5 runs
as described in Sect. 5.1. Please note that the same pro-
cedure for fixing parameters was followed for all of other
methods we compare to, i.e., the results for the other
methods were obtained using the best parameter con-
figurations in the test sets, as recommended by the authors
of the corresponding papers, see5 [14, 28, 30].
Table 5 shows a summary of the comparison of MOPG
and methods evaluated in [28]. In this summary table we
considered the best methods in terms of accuracy
(GENN [20]) and reduction (PSCSA [16]). We also show
the performance obtained by 1NN.
From Table 5 it can be seen that in terms of accuracy,
our method obtains lower accuracy performance than
GENN when considering all of the data sets. In small data
Table 4 Performance of
MOPG under different
parameter settings
Parameter Value Accuracy Reduction
Individuals (Npop) 50 71:68 % 18:18 97:24 % 1:21
100 72:25 % 17:80 97:26 % 1:25
250 73:13 % 18:10 97:23 % 1:31
Generations (g) 50 71:68 % 18:18 97:24 % 1:21
100 72:71 % 18:03 97:53 % 1:29
250 73:32 % 18:11 97:62 % 1:34
500 73:37 % 18:08 97:70 % 1:31
Training-set size (g) 0.1 72:11 % 18:33 98:84 % 1:09
0.3 73:07 % 18:18 98:19 % 1:13
0.5 71:68 % 18:18 97:24 % 1:21
0.7 72:12 % 18:44 96:85 % 1:38
0.9 70:30 % 20:39 96:38 % 1:67
Initial prot. (Ip) 0.005 71:19 % 18:18 99:01 % 1:28
0.01 71:20 % 18:31 98:97 % 1:26
0.05 72:14 % 18:74 98:37 % 1:16
0.1 71:68 % 18:18 97:24 % 1:21
0.2 73:07 % 17:76 95:19 % 1:82
0.4 73:20 % 17:72 89:86 % 3:87
4 Please note that the 25 methods have been evaluated on small data
sets, but only 20 out of the 25 were evaluated in large data sets [28].
Five methods were not considered for large data sets because they
were too computational expensive, see [28] for details.5 See also http://sci2s.ugr.es/pgtax/.
Pattern Anal Applic
123
sets, the gain of GENN over MOPG is small, but for large
data sets MOPG and GENN achieve virtually the same
performance. This is a very positive result because, even
when MOPG did not outperform GENN in terms of ac-
curacy, MOPG achieves much higher reduction rates than
GENN. Moreover, the fact that MOPG performs better on
large data sets is encouraging as the main target of PG
methods is precisely large databases. Likewise, in terms of
reduction, PSCSA outperforms MOPG. The gain of
PSCSA over MOPG in reduction is small, nevertheless, we
can see that MOPG outperforms significantly PSCSA in
terms of accuracy. Therefore, we can conclude that MOPG
offers a better tradeoff between accuracy and reduction
than the best methods (in either aspect) considered in [28].
Figure 3 graphically shows a comparison between the
25 methods considered in [28] (plus the GPGP method
introduced in [14]) and our proposal for small and large
data sets in terms of reduction (y-axis) and accuracy (x-
axis). Regarding small data sets, MOPG is outperformed by
three methods in terms of accuracy: GENN, ICPI and PSO.
However, the reduction performance of MOPG is better
than any of these methods. In terms of reduction, our
method outperforms most of the evaluated techniques.
Regarding large data sets, it can be seen from Fig. 3 that
our method offers the best tradeoff between accuracy and
reduction than any other method, as MOPG is located at
upper right corner of the plot. It obtains similar accuracy as
GENN and GPGP but its reduction performance is better.
To the best of our knowledge the results obtained with
MOPG for large data sets in the suite provided in [28] are
the best ones reported so far. MOPG also outperforms our
own previous work GPGP [14], a very effective method
that was recently introduced.
Figure 4 shows boxplots of a tradeoff performance es-
timate for each method that has been evaluated on the data
sets from Table 1. Boxplots report the average, across
small and large data sets, of reduction � accuracy as ob-
tained by MOPG and each of the other PG methods. It is
clear that, on average, MOPG offers the best tradeoff be-
tween both objectives, for both small and large data sets.
For small data sets the second/third best methods were
PSO [23] and MSE [10], respectively. Regarding large
data sets the second/third best methods were GPGP [14]
and PSO [23]. The average performance for both small and
large data sets was worse than that obtained by MOPG.
A Wilcoxon signed-rank test6 comparing the performances
of MOPG to the other PG methods revealed that for large
data sets there is a statistically significant difference in
tradeoff performance for all but for MSE and PSO, whereas
for small data sets all of the differences were statistically
significant. From the results presented so far we can con-
clude that the multi-objective approach is indeed obtaining
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.969.5
70
70.5
71
71.5
72
72.5
Probability
Acc
urac
y (%
)
CrossoverMutation
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.998.6
98.61
98.62
98.63
98.64
98.65
98.66
98.67
Probability
Red
uctio
n (%
)
Crossover
Mutation
Fig. 2 Performance of MOPG
under different crossover (left)
and mutation (right) rates.
Table 5 Average performance and percentage of reduction obtained by MOPG for all the data sets; we also show the separate performance
obtained in small and large sets.
Test set accuracy Training set reduction
Measure All Small Large All Small Large
MOGP 76.91 ± 17.22 74.81 ± 15.02 81.33 ± 20.86 98.67 % 98.29 % 99.39 %
GENN 77.46 % ± 17.71 75.64 % ± 15.45 81.33 % ± 21.70 17.71 % 18.62 % 15.76 %
PSCSA 66.90 % ± 19.67 66.82 % ± 18.74 67.07 % ± 22.05 99.01 % 98.58 % 99.88 %
1NN 75.77 % ± 18.73 73.48 % ± 16.64 80.60 % ± 22.24 0 % 0 % 0 %
6 This is the statistical test recommended by Demsar for comparing
classification methods over multiple data sets [11].
Pattern Anal Applic
123
solutions that offer a better tradeoff between accuracy and
reduction than most other techniques proposed so far.
We now compare the performance of MOPG with the
methods based on differential evolution that were pro-
posed in [30]. To the best of our knowledge these meth-
ods are the ones that have obtained the best performance
so far on PG. We considered three methods out of the 15
variants evaluated in [30]: SFLSDE/RandtoBest/1/Bin is
the best PG method for small data sets, SFLSDE/Rand/1/
Bin is the best one for large data sets and SSMA?
SFLSDE/RandtoBest/1/Bin is the best method overall.
One should note, however, that the latter method is a
hybrid that combines PS (SSMA) and PG (SFLSDE)
methods; hence their results are not directly comparable
to MOPG. For this experiment we only considered 56 out
of the 59 data sets from Table 1. Three large data sets
were discarded in [30]: ring, phoneme and nursery. This
is because we wanted to use exactly the same data sets
that were used in the reference study.
Table 6 shows the tradeoff results obtained by the best
methods reported in [30] and MOPG. We report two mea-
sures of performance tradeoff: the product (reduction �accuracy) and the harmonic mean of both objectives
2 � reduction� accuracyreductionþ accuracy
� �, to better appreciate the differences
among methods.
From Table 6 it can be seen that MOPG obtains a per-
formance comparable (yet worse) to that of the three
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.64
0.66
0.68
0.7
0.72
0.74
0.76GENN
Depur
PNN
BTS3
MCA
GMCA
ICPL
MixtGauss
SGP
LVQ3
MSE
DSM
LVQTC
VQ
AVQ
HYB
LVQPRU
Chen
RSP3
POC
ENPC
PSO
AMPSO
PSCSA
1NN
GPPC
MOPG
Reduction
Acc
urac
ySmall data sets
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.65
0.7
0.75
0.8
0.85
GENNDepur
BTS3
MixtGauss
SGP
LVQ3
MSE
DSM
LVQTC
VQ
AVQ
HYB
LVQPRU
ChenRSP3
ENPCPSO
AMPSO
PSCSA
1−NN GPPCMOPG
Reduction
Acc
urac
y
Large data sets
Fig. 3 Reduction vs. accuracy in small (top) and large (bottom) data
sets.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
GE
NN
Dep
ur
PN
N
BT
S3
MC
A
GM
CA
ICP
L
Mix
tGau
ss
SG
P
LVQ
3
MS
E
DS
M
LVQ
TC
VQ
AV
Q
HY
B
LVQ
PR
U
Che
n
RS
P3
PO
C
EN
PC
PS
O
AM
PS
O
PS
CS
A
1NN
GP
PG
MO
PG
Small data sets
Acc
urac
y× R
educ
tion
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Acc
urac
y× R
educ
tion
Large data sets
GE
NN
Dep
ur
BT
S3
Mix
tGau
ss
SG
P
LVQ
3
MS
E
DS
M
LVQ
TC
VQ
AV
Q
HY
B
LVQ
PR
U
Che
n
RS
P3
EN
PC
PS
O
AM
PS
O
PS
CS
A
1−N
N
GP
PG
MO
PG
Fig. 4 Box plot of reduction � accuracy in small (top) and large
(bottom) data sets. The x-axis indicates the different PG methods
considered in our study.
Table 6 Tradeoff (accuracy–reduction) performance for selected
methods.
Method References Large Small
Reduction � Accuracy
MOGP Ours 80.05 73.56
SFLSDE/RandtoBest/1/Bin [30] 81.54 72.23
SFLSDE/Rand/1/Bin [30] 81.67 71.88
SSMA?SFLSDE/RandtoBest/1/Bin [30] 81.64 74.95
(2 � Reduction � Accuracy)/(Reduction þ Accuracy)
MOGP Ours 88.97 84.97
SFLSDE/RandtoBest/1/Bin [30] 89.92 84.48
SFLSDE/Rand/1/Bin [30] 89.99 84.25
SSMA?SFLSDE/RandtoBest/1/Bin [30] 90.02 86.15
Pattern Anal Applic
123
methods based on differential evolution. Results are con-
sistent for both evaluation measures. When considering
large data sets (column 3, large) the reference PG methods
(i.e., SFLSDE/Rand/1/Bin and SSMA?SFLSDE/Rand-
toBest/1/Bin) outperform ours by 1 %. However, the per-
formance on small data sets is virtually the same (column
4, small). The margin of improvement is higher for the
hybrid method (i.e, SSMA?SFLSDE/RandtoBest/1/Bin).
One should note that MOPG outperforms most variants of
differential evolution proposed in [30] (comparison not
shown here) as well as other PG methods considered for
comparison by the authors (9 other methods including
PSO, ENPC, PSCSA). From the results in Table 6 we can
conclude that the performance of MOPG is competitive
with the most effective methods in the state of the art.
6 Conclusions
We introduced MOPG, a novel prototype generation (PG)
method based on multi-objective optimization. We ap-
proach the PG problem as one of multi-objective opti-
mization where we aim to simultaneously optimize
accuracy and reduction of prototypes. Our working hy-
pothesis is that by simultaneously optimizing both objec-
tives we can achieve a better reduction/accuracy tradeoff.
The proposed approach was evaluated in benchmark data
and its performance was compared to many PG methods,
including the best performing ones. The contributions of
this paper can be summarized as follows. (1) Formulation
of the PG problem as one of multi-objective optimization
and proposal of an effective multi-objective evolutionary
algorithm to approach the PG problem. (2) New methods
for representation, initialization, crossover and mutation
for PG using evolutionary algorithms. Likewise, an effec-
tive strategy for the selection of a single solution from the
Pareto front generated by NSGA-II. (3) Extensive eval-
uation of the proposed method over benchmark data, in-
cluding comparisons with many PG methods.
The main findings of this work can be summarized as
follows. We found the multi-objective formulation for PG
is a promising alternative to single-objective approaches,
we hope our work can foster the development of other
multi-objective optimization methods for PG. We showed
evidence supporting the hypothesis that our proposal,
MOPG, is very competitive in terms of both objectives
reduction and accuracy. MOPG outperforms most PG
methods proposed so far and obtains similar performance
to the best PG method proposed so far. MOPG can be
improved in many ways (for instance, for reducing its
computational cost), thus we hope our work motivates
further research on new mechanisms to improve it.
Current and future work directions on MOPG include
enhancing our method in terms of efficiency and scal-
ability, to apply it to big-data problems. For this we are
planning to use ad hoc stratification and surrogate model-
ing techniques, see [26, 27, 29]. Also, we are working on
the development of methods that can simultaneously gen-
erate prototypes and features using a multi-objective opti-
mization framework.
Acknowledgments This work was partially supported by the
LACCIR programme under project ID R1212LAC006. Hugo Jair
Escalante was supported by the internships programme of CONACyT
under grant No. 234415.
References
1. Aler R, Handl J, Knowles JD (2013) Comparing multi-objective and
threshold-moving roc curve generation for a prototype-based clas-
sifier. In: Proceedings of the fifteenth annual conference on Genetic
and evolutionary computation conference. ACM, pp 1029–1036
2. Cervantes A, Galvan IM, Isasi P (2009) AMPSO: a new particle
swarm method for nearest neighborhood classification. IEEE
Trans. Sys. Man Cybern. B 39(5):1082–1091
3. Chatelain Clement, Adam Sebastien, Lecourtier Yves, Heutte
Laurent, Paquet Thierry (2010) A multi-model selection frame-
work for unknown and/or evolutive misclassification cost prob-
lems. Pattern Recogn. 43(3):815–823
4. Chen JH, Chen HM, Ho SY (2005) Design of nearest neighbor
classifiers: multi-objective approach. Int. J. Approx. Reason.
40:3–22
5. Coello Coello CA, Lamont GB, Veldhuizen DAV (2007) Evo-
lutionary algorithms for solving multi-objective problems. Ge-
netic and evolutionary computation, 2nd edn. Springer, USA
6. Cover T, Hart P (1967) Nearest neighbor pattern classification.
IEEE Trans. Inform. Theory 13(1):21–27
7. Cruz-Vega I, Garcia-Limon M, Escalante HJ (2014) Adaptive
surrogates with a neuro-fuzzy network and granular computing.
In: Proceedings of GECCO 2014. ACM Press, pp 761–768
8. Deb K (2001) Multi-objective optimization using evolutionary
algorithms. Wiley
9. Deb K, Pratap A, Agarwal S, Meyarivan T (2002) A fast and
elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans.
Evol. Comput. 6(2):182–197
10. Decaestecker C (1997) Finding prototypes for nearest neighbour
classification by means of gradient descent and deterministic
annealing. Pattern Recogn. 30(2):281–288
11. Demsar J (2006) Statistical comparisons of classifiers over mul-
tiple data sets. J Mach Learn Res 7:1–30
12. Dos-Santos EM, Sabourina R, Maupinb P (2008) A dynamic
overproduce-and-choose strategy for the selection of classifier
ensembles. Pattern Recogn. 41:2993–3009
13. Eiben AE, Smith JE (2010) Introduction to evolutionary com-
puting. Natural computing. Springer
14. Escalante HJ, Mendoza KM, Graff M, Morales-Reyes A (2013)
Genetic programming of prototypes for pattern classification. In:
Proceedings of IbPRIA 2013, vol. 7887 of LNCS. Springer,
pp 100–107
15. Fernandez F, Isasi P (2004) Evolutionary design of nearest pro-
totype classifiers. J. Heuristics 10:431–454
16. Garain U (2008) Prototype reduction using an artificial immune
system. Pattern Anal. Appl. 11(3–4):353–363
Pattern Anal Applic
123
17. Garcıa S, Derrac J, Cano JR, Herrera F (2012) Prototype selection
for nearest neighbor classification: Taxonomy and empirical
study. IEEE Trans. Pattern Anal. Mach. Intell. 34(3):417–435
18. Hastie T, Tibshirani R, Friedman J (2001) The elements of sta-
tistical learning. Springer, New York
19. Kim SW, Oommen BJ (2003) A brief taxonomy and ranking of
creative prototype reduction schemes. Pattern Anal. Appl.
6:232–244
20. Koplowitz J, Brown T (1981) On the relation of performance to
editing in nearest neighbor rules. Pattern Recogn. 13(3):251–255
21. Li J, Wang Y (2013) A nearest prototype selection algorithm
using multi-objective optimization and partition. In: Proceedings
of the 9th International Conference on Computational Intelli-
gence and Security. IEEE, pp. 264–268
22. Lozano M, Sotoca JM, Sanchez JS, Pla F, Pkalska E, Duin RPW
(2006) Experimental study on prototype optimisation algorithms
for prototype-based classification in vector spaces. Pattern
Recogn. 39(10):1827–1838
23. Nanni L, Lumini A (2008) Particle swarm optimization for pro-
totype reduction. Neurocomputing 72(4–6):1092–1097
24. Olvera A, Carrasco-Ochoa JA, Martinez-Trinidad JF, Kittler J
(2010) A review of instance selection methods. Artif. Intell. Rev.
34:133–143
25. Storn R, Price KV (1997) Differential evolution a simple and
efficient heuristic for global optimization over continuous spaces.
J. Global Optim. 11(10):341–359
26. Rosales A, Coello CA, Gonzalez J, Reyes CA, Escalante HJ
(2013) A hybrid surrogate-based approach for evolutionary multi-
objective optimization. In: Proceedings of Congress on Evolu-
tionary Computation 2013. IEEE, pp 2548–2555
27. Rosales A, Gonzalez J, Coello CA, Escalante HJ, Reyes CA
(2014) Surrogate-assisted multi-objective model selection for
support vector machines. Neurocomputing (in press)
28. Triguero I, Derrac J, Garcıa S, Herrera F (2012) A taxonomy and
experimental study on prototype generation for nearest neighbor
classification. IEEE Trans. Sys. Man Cybern. C 42(1):86–100
29. Triguero I, Peralta D, Bacardit J, Garcia S, Herrera F (2014)
MRPR: a mapreduce solution for prototype reduction in big data
classification. Neurocomputing (in press)
30. Triguero I, Garcia S, Herrera F (2011) Differential evolution for
optimizing the positioning of prototypes in nearest neighbor
classification. Pattern Recogn. 44:901–916
31. Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H,
McLachlan GJ, Ng A, Liu B, Yu Ps, Zhou ZH, Steinbach M,
Hand DJ, Steinberg D (2007) Top 10 algorithms in data mining.
Knowl Inf Syst 14(1):1–37
32. Xia H, Zhuang J, Yu D (2013) Novel soft subspace clustering
with multi-objective evolutionary approach for high-dimensional
data. Pattern Recogn. 46:2562–2575
Pattern Anal Applic
123