revisiting evolutionary algorithms in feature selection ...ash/satchi_wires13.pdf · ing...

26
Advanced Review Revisiting evolutionary algorithms in feature selection and nonfuzzy/fuzzy rule based classification Satchidananda Dehuri 1 and Ashish Ghosh 2This paper discusses the relevance and possible applications of evolutionary algo- rithms, particularly genetic algorithms, in the domain of knowledge discovery in databases. Knowledge discovery in databases is a process of discovering knowl- edge along with its validity, novelty, and potentiality. Various genetic-based feature selection algorithms with their pros and cons are discussed in this article. Rule (a kind of high-level representation of knowledge) discovery from databases, posed as single and multiobjective problems is a difficult optimization problem. Here, we present a review of some of the genetic-based classification rule discovery methods based on fidelity criterion. The intractable nature of fuzzy rule mining using single and multiobjective genetic algorithms reported in the literatures is reviewed. An extensive list of relevant and useful references are given for further research. C 2013 Wiley Periodicals, Inc. How to cite this article: WIREs Data Mining Knowl Discov 2013. doi: 10.1002/widm.1087 INTRODUCTION T he current information era is characterized by a great expansion in the volume of data that are being generated by low-cost devices (e.g., scan- ners, bar code readers, sensors) and stored. Intu- itively, this large amount of stored data contains valu- able hidden knowledge, which could be used to im- prove the decision-making process of an organization. Piatetsky-Shapiro reported 1 that it is an urgent re- quirement to develop a semiautomatic tool to discover hidden knowledge. However, discovering knowledge from such a volume of complex data can be char- acterized as a problem of intractability. 2 Therefore, the development of efficient and effective tools for re- vealing valuable knowledge hidden in these databases becomes more critical for enterprise decision mak- ing. One of the possible approaches to this problem Correspondence to: [email protected] 1 Department of Systems Engineering, Ajou University, Suwon South Korea 2 Center for Soft Computing Research, Indian Statistical Institute Kolkata, Kolkata India DOI: 10.1002/widm.1087 is by means of data mining or knowledge discovery from databases (KDD). 3–6 Through data mining, in- teresting knowledge 7 can be extracted and the dis- covered knowledge can be applied in the target field to increase the working efficiency and to improve the quality of decision making. Some of the knowl- edge discovery and data mining tools, e.g., DBMiner, DeltaMiner, CN2, which aim at the mainstream of business user are providing up-to-date solutions. 8, 9 Interested reader/practitioner can obtain a range of existing state-of-the-art data mining and related tools discussed in Ref 10. Over last one and half decades, most of the data mining techniques are focused from database perspec- tive. In comparison, little effort has been made from machine learning and soft computing perspective. 11 However, recently a growing interest from researchers of evolutionary algorithms and multiobjective evo- lutionary algorithms for data mining applications are coming up with their own findings. Some of the findings in this direction can be obtained from Refs 12–14. Alcala-Fdez et al. 15 have developed a software tool known as knowledge extraction based on evolutionary learning (KEEL) to assess evolution- ary algorithms 16 for the data mining problem of Volume 00, January/February 2013 1 c 2013 John Wiley & Sons, Inc.

Upload: others

Post on 16-Aug-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Revisiting evolutionary algorithms in feature selection ...ash/satchi_Wires13.pdf · ing algorithms. In this paper, the utility of GAs for classification task is primarily dealt

Advanced Review

Revisiting evolutionary algorithmsin feature selection andnonfuzzy/fuzzy rule basedclassificationSatchidananda Dehuri1 and Ashish Ghosh2∗

This paper discusses the relevance and possible applications of evolutionary algo-rithms, particularly genetic algorithms, in the domain of knowledge discovery indatabases. Knowledge discovery in databases is a process of discovering knowl-edge along with its validity, novelty, and potentiality. Various genetic-based featureselection algorithms with their pros and cons are discussed in this article. Rule (akind of high-level representation of knowledge) discovery from databases, posedas single and multiobjective problems is a difficult optimization problem. Here,we present a review of some of the genetic-based classification rule discoverymethods based on fidelity criterion. The intractable nature of fuzzy rule miningusing single and multiobjective genetic algorithms reported in the literatures isreviewed. An extensive list of relevant and useful references are given for furtherresearch. C© 2013 Wiley Periodicals, Inc.

How to cite this article:WIREs Data Mining Knowl Discov 2013. doi: 10.1002/widm.1087

INTRODUCTION

T he current information era is characterized bya great expansion in the volume of data that

are being generated by low-cost devices (e.g., scan-ners, bar code readers, sensors) and stored. Intu-itively, this large amount of stored data contains valu-able hidden knowledge, which could be used to im-prove the decision-making process of an organization.Piatetsky-Shapiro reported1 that it is an urgent re-quirement to develop a semiautomatic tool to discoverhidden knowledge. However, discovering knowledgefrom such a volume of complex data can be char-acterized as a problem of intractability.2 Therefore,the development of efficient and effective tools for re-vealing valuable knowledge hidden in these databasesbecomes more critical for enterprise decision mak-ing. One of the possible approaches to this problem

∗Correspondence to: [email protected] of Systems Engineering, Ajou University, SuwonSouth Korea2Center for Soft Computing Research, Indian Statistical InstituteKolkata, Kolkata India

DOI: 10.1002/widm.1087

is by means of data mining or knowledge discoveryfrom databases (KDD).3–6 Through data mining, in-teresting knowledge7 can be extracted and the dis-covered knowledge can be applied in the target fieldto increase the working efficiency and to improvethe quality of decision making. Some of the knowl-edge discovery and data mining tools, e.g., DBMiner,DeltaMiner, CN2, which aim at the mainstream ofbusiness user are providing up-to-date solutions.8,9

Interested reader/practitioner can obtain a range ofexisting state-of-the-art data mining and related toolsdiscussed in Ref 10.

Over last one and half decades, most of the datamining techniques are focused from database perspec-tive. In comparison, little effort has been made frommachine learning and soft computing perspective.11

However, recently a growing interest from researchersof evolutionary algorithms and multiobjective evo-lutionary algorithms for data mining applicationsare coming up with their own findings. Some ofthe findings in this direction can be obtained fromRefs 12–14. Alcala-Fdez et al.15 have developed asoftware tool known as knowledge extraction basedon evolutionary learning (KEEL) to assess evolution-ary algorithms16 for the data mining problem of

Volume 00, January /February 2013 1c© 2013 John Wi ley & Sons , Inc .

Page 2: Revisiting evolutionary algorithms in feature selection ...ash/satchi_Wires13.pdf · ing algorithms. In this paper, the utility of GAs for classification task is primarily dealt

Advanced Review wires.wiley.com/widm

various kinds including regression, classification, un-supervised learning (clustering), and so on. It includesevolutionary learning algorithms based on differentapproaches: Pittsburgh,17,18 Michigan,19,20 iterativerule learning (IRL),21 and genetic cooperative compet-itive learning (GCCL).22 Along with the integrationof evolutionary learning with different preprocessingtechniques, it allowed to perform a complete analy-sis of any learning model in comparison with existingsoftware tools. Similarly, in recent years,23,24 the de-velopment of methods for data mining has attractedincreasing attention in the fuzzy set community. Asystematic discussion of possible benefits of fuzzymethods in data mining is presented in Ref 25. Tothis end, this paper presents a well-balanced reviewof the literature of evolutionary algorithms, hybridfuzzy genetic rule based system in data mining, andKDD.

Nowadays, the application domain of data min-ing is getting more complex and complex, i.e., it isshifted from traditional scientific26 and market basketdatabase mining27 to biological,28,29 health care,30,31

agriculture,32 process monitoring and control,33 in-trusion detection,34–36 and social network analysis.37

For example, detecting unauthorized use, misuse, andattacks that have no previously described patterns oninformation systems is usually a very complex taskfor traditional methods. Similarly, data about a hos-pital’s patients might contain interesting knowledgeabout which kind of patient is more likely to developa given disease. Hence, viewing all these complex-ities of the domains and the limitations of statisti-cal classifier, neural network based classifier, decisiontree based classifier, and some of the nonparamet-ric classifiers, it is a source of inspiration to developan intelligent system using various evolutionary al-gorithm and fuzzy system based approaches. Recallthat this paper will review some of the representa-tive data preprocessing using genetic algorithms andclassification rule generation using genetic and fuzzysystems.

It should be noted that the quality of the dis-covered knowledge (whether it is a classification rule,association rule, prediction rule, or clusters) stronglydepends on the quality of the data being mined.This has motivated the improvement and develop-ment of several data preprocessing techniques such asattribute selection, attribute constructions, and train-ing set selection.38 The requirements of data prepro-cessing in KDD and its success are reported in theliterature.39–43 In the Preliminaries, we present someof the preliminary concepts and definitions of KDD,evolutionary algorithms (EAs), multiobjective evolu-

tionary algorithms (MOEAs) and fuzzy systems. EAsparticularly genetic algorithms (GAs) for attribute se-lection is discussed in the Data Mining Using GeneticAlgorithms: Attribute Selection.

Data mining involves various tasks such as clas-sification, clustering, association rule mining, regres-sion, and change detection. Each task can be con-sidered as a problem to be solved by data min-ing algorithms. In this paper, the utility of GAs forclassification task is primarily dealt with. However,the interested reader for other tasks of data miningcan refer to Ref 44,45 for association rule miningbased on evolutionary algorithms,46–49 for evolution-ary algorithms based clustering,50 for genetic-basedregression, and51 for genetic-based change detection,and so on. Various issues interwined with classifica-tion rule mining (CRM) using GAs are discussed inthe Data Mining Using Genetic Algorithms.

Later on, fuzzy classification rule mining(FCRM) using genetic and multiobjective genetic al-gorithm (MOGA) is presented. The aim is to generatea compact set of classification rules by simultaneousoptimization of rule accuracy, length of the rules, andnumber of rules. The last section presents the sum-mary and future research directions.

PRELIMINARIES

In this section, some preliminaries are discussed.

Knowledge Discovery in DatabasesThe subject of KDD has evolved and continues toevolve, from the intersection of research from variousfields such as databases, machine learning,52 patternrecognition, statistics, artificial intelligence, reason-ing with uncertainties, knowledge acquisition for ex-pert systems,53 data visualization, machine discovery,high-performance computing, evolutionary computa-tion, multiobjective evolutionary computation, andswarm intelligence.12,54 This paper focuses on EAs,particularly GAs, for KDD. Hence, it is important todiscuss the definitions and concepts of data miningand the process of KDD.

Definition: KDD is defined as the nontrivial pro-cess of identifying valid, novel, potentially useful, andultimately understandable patterns in data.

KDD Process: The overall KDD process is interactiveand iterative involving four steps: (i) data acquisi-tion and integration, (ii) data preprocessing, (iii) datamining, and (iv) postprocessing. Specifically, the

2 Volume 00, January /February 2013c© 2013 John Wi ley & Sons , Inc .

Page 3: Revisiting evolutionary algorithms in feature selection ...ash/satchi_Wires13.pdf · ing algorithms. In this paper, the utility of GAs for classification task is primarily dealt

WIREs Data Mining and Knowledge Discovery Revisiting evolutionary algorithms

following steps are interwined in any practical im-plementation:

Domain specific knowledge includes relevantprior knowledge and goals of the appli-cation.

Extracting/selecting the target data set includesextracting/selecting a data set or focusingon a subset of data instance.

Data cleansing includes basic operations, suchas noise removal and handling of missingdata. Data from real-world sources are of-ten erroneous, incomplete, and inconsis-tent, perhaps due to operational error orsystem implementation flaws. Such low-quality data needs to be cleaned prior tomining.

Data integration includes integrating multiple,heterogeneous data sources.

Data reduction and projection includes findinguseful features to represent the data (de-pending on the goal of the task) and us-ing dimensionality reduction or transfor-mation methods.

Choosing the function of data mining includesdeciding the purpose of the model de-rived by data mining algorithms (e.g.,summarization, classification, regression,clustering, link analysis, image segmen-tation/classification, functional dependen-cies, rule extraction (classification andassociation rule), or a combination ofthese).

Choosing the data mining algorithm(s) includesselecting method(s) to be used for search-ing patterns in data, such as decidingon which models and parameters may beappropriate.

Data mining includes searching for patternsof interest in a particular representationalform or a set of such representations.

Interpretation includes interpreting the discov-ered patterns, as well as the possible vi-sualization of the extracted patterns. Onecan analyze the patterns automatically orsemiautomatically to identify truly inter-esting/useful patterns for the user.

Using discovered knowledge includes incorpo-rating this knowledge into the perfor-mance system, taking actions based onknowledge.

Data Mining: KDD refers to the overall process ofturning low-level data into high-level knowledge. Animportant step in the KDD process is data mining.Data mining is an interdisciplinary field with a generalgoal of predicting outcomes and uncovering relation-ships in data. It uses automated tools employing so-phisticated algorithms to discover hidden patterns, as-sociations, anomalies, and/or structures from a largeamount of data stored in data warehouses or otherinformation repositories. Data mining tasks can bedescriptive, i.e., discovering interesting patterns de-scribing the data, and predictive, i.e., predicting thebehavior of the model based on available data. Datamining involves fitting models to or determining pat-terns from observed data. The fitted models play therole of inferred knowledge. Deciding whether themodel reflects useful knowledge or not is a part ofthe overall KDD process for which subjective humanjudgment is usually required. Typically, a data min-ing algorithm constitutes some combination of thefollowing three components.

• The model: The function of the model (e.g.,classification, clustering, regression) and itsrepresentational form (e.g., linear discrim-inants, neural networks, decision trees). Amodel contains parameters that are to be de-termined from the data.

• The preference criterion: A basis for prefer-ence of one model or a set of parameters overanother, depending on the given data. The cri-terion is usually some form of the goodness-of-fit function of the model to the data, per-haps tempered by a smoothing term to avoidoverfitting, or generating a model with toomany degrees of freedom to be constrainedby the given data.

• The search algorithm: The specification of analgorithm for finding particular models andparameters, given the data, model(s), and apreference criterion. A particular data miningalgorithm is usually an instantiation of themodel/preference/search components.

The more common model functions in currentdata mining practice include the following: (1) Clas-sification classifies a data item into one of severalpredefined categorical classes. (2) Regression mapsa data item to a real-valued prediction variable.(3) Clustering maps a data item into one of sev-eral clusters, where clusters are natural groupingsof data items based on similarity metrics or prob-ability density models. (4) Rule generation extracts

Volume 00, January /February 2013 3c© 2013 John Wi ley & Sons , Inc .

Page 4: Revisiting evolutionary algorithms in feature selection ...ash/satchi_Wires13.pdf · ing algorithms. In this paper, the utility of GAs for classification task is primarily dealt

Advanced Review wires.wiley.com/widm

classification rules from the data. (5) Discoveringassociation rules describes association relationshipamong different attributes. (6) Summarization pro-vides a compact description for a subset of data.(7) Dependency modeling describes significant depen-dencies among variables. (8) Sequence analysis mod-els sequential patterns, like time-series analysis. Thegoal is to model the states of the process generatingthe sequence or to extract and report deviation andtrends over time.

The development of new generation algo-rithms is expected to encompass more diversesources and types of data that will support mixed-initiative data mining, where human experts collabo-rate with the computer to form hypotheses and testthem.

Evolutionary AlgorithmsIn essence, the EAs are used to describe computer-based problem-solving algorithms that use compu-tational models of evolutionary processes as keyelements in their design and implementation. A vari-ety of evolutionary algorithms have been evolved dur-ing the past several years. The major representativesare (i) GAs,16,55,56 (ii) evolution strategies (ESs),57,58

(iii) evolutionary programming (EP),59,60 (iv) geneticprogramming (GP),61,62 (v) estimation of distributionalgorithms (EDAs),63 (vi) compact genetic algorithms(CGAs)64 and its variants extended compact geneticalgorithms, and (vii) population-based incrementalalgorithms.65 Each of these constitutes a different ap-proach; however, they are inspired by the same prin-ciple of natural evolution. As GAs play a critical role,in this paper it is worth to discuss the basic conceptsof genetic algorithms.

A GA is a stochastic method, which have beeninspired by Darwin’s theory of evolution. A popula-tion of individuals, each representing a possible so-lution to a problem, is initially created at random.Pairs of individuals combine to produce offspring forthe next generation. A mutation operator is used torandomly modify the genetic structure of some indi-viduals of each new generation. The algorithm runsto generate solutions for successive generations. Theprobability of reproduction of an individual is pro-portional to the goodness of the solution it represents.Hence the quality of the solutions in successive gener-ations improves. The process is terminated when anacceptable or optimum solution is found or after somepredefined time limit. GAs are appropriate for opti-mization problems, with respect to some computablecriterion.

Multiobjective Genetic AlgorithmsThere are many multiobjective problems requiring si-multaneous optimization of several competing objec-tives. Formally, it can be stated as follows:

We want to find �x = (x1, x2, x3, . . . , xn) in deci-sion space, which maximizes the values of p objec-tive functions F (�x) = ( f1(�x), f2(�x), f3(�x), . . . , fp(�x))in objective space within a feasible domain �. Gener-ally, the answer is not a single optimal solution but aset of solutions called a Pareto-optimal set.

Definitions

• A solution vector �x = (x1, x2, . . . , xn) is saidto dominate �x′ = (x′

1, x′2, . . . , x′

n) iff �x is noworse than �x′ in all objectives and �x is strictlybetter than �x′ in at least one objective.

• A solution �x ∈ � is said to be Pareto optimalwith respect to � iff there is no �x′ ∈ � forwhich �x′ dominates �x.

• For a given multiobjective problem F (�x), thePareto-optimal set Ps is defined as Ps = 〈�x ∈�|¬∃ �x′ ∈ � for which �x′ dominates �x 〉.

• For a given multiobjective problem F (�x)and Pareto-optimal set Ps, the Paretofront Pf is defined as Pf = 〈F (�x) =( f1(�x), f2(�x), . . . , fp(�x)|�x ∈ Ps)〉.

Optimization methods generally try to find agiven number of Pareto-optimal solutions, which areuniformly distributed in the Pareto-optimal front.Such solutions provide the decision maker sufficientinsight into the problem to make the final decision.Methods such as weighted sum, ε-constraint, andgoal programming have been proposed to search forPareto-optimal solutions. However, an a priori artic-ulation of the preferences to the objectives is required,which is often difficult to decide beforehand. Besides,these methods can only find one solution at a time.Other solutions cannot be obtained without recom-putation with the free parameters reset.

In contrast, GAs maintain a population and thuscan search for many nondominated solutions in par-allel. GA’s ability to find a diverse set of solutions ina single run and its exemption from demand for ob-jective preference information renders it immediateadvantage over aforementioned techniques. A num-ber of multiobjective GAs (MOGAs)44,66 have beenintroduced in the literature. Basically, an MOGA ischaracterized by its fitness assignment and diversitymaintenance strategy.

In fitness assignment, most MOGAs fallinto two categories: non-Pareto and Pareto-based.

4 Volume 00, January /February 2013c© 2013 John Wi ley & Sons , Inc .

Page 5: Revisiting evolutionary algorithms in feature selection ...ash/satchi_Wires13.pdf · ing algorithms. In this paper, the utility of GAs for classification task is primarily dealt

WIREs Data Mining and Knowledge Discovery Revisiting evolutionary algorithms

Non-Pareto methods use the objective values as thefitness value to decide an individual’s fitness. Schaf-fer’s VEGA67 is such a method. The Predator–preyapproach68 is another one, where some randomlywalking predators will kill a prey or let it surviveaccording to the prey’s value in one objective. Incontrast, Pareto-based methods measure individuals’fitness according to their dominance property. Thenondominated individuals in the population are re-garded as fittest regardless of their single objectivevalues. Since Pareto-based approaches respect betterthe dominant nature of multiobjective problems, theirperformance is reported to be better.

A diversity maintenance strategy works by dis-tributing the solutions uniformly in the Pareto front,instead of accumulating solutions in a small re-gion only. Fitness sharing66 is another way to main-tain diversity in the Pareto front by sharing thefitness of an individual with its neighborhood. Re-stricted mating is another alternative, where matingis permitted only when the distance between twoparents are large enough. Pareto archive evolutionstrategies (PAES),69 strength Pareto evolutionary al-gorithms (SPEA),68 and nondominated sorting geneticalgorithms-II (NSGA-II)66 are some of the front-endMOGAs. They all adopt Pareto-based fitness assign-ment strategy and implement elitism. A good com-prehensive study of MOGA can also be found inRef 13. Let us discuss the working principles ofNSGA-II and SPEA here as they are the most pop-ular in the literature.

Nondominated Sorting Genetic Algorithm-II: TheNSGA-II procedure70 attempts to find multiplePareto-optimal solutions of a multiobjective opti-mization problem by using an elitist mechanism, ex-plicit diversity preserving mechanism, and by empha-sizing on nondominated solutions. At any generationt, the offspring population is first created by usingthe parent population (size N) and the usual geneticoperators. Thereafter, the two populations are com-bined together to form a new population of size 2N.Then, the new population is classified into differentnondominated fronts. Thereafter, the next populationis filled by points of different nondominated fronts,one at a time. Since the overall population size of newpopulation is 2N, not all fronts can be accommodatedin N slots available for the next population. All frontsthat could not be accommodated are deleted. Whenthe last allowed front is being considered, there mayexist more points in the front than the remaining slotsin the new population. Instead of arbitrarily discard-ing some members from the last front, the points thatmake the diversity of the selected points are chosen.

Strength Pareto Evolutionary Algorithms: SPEA71

maintains an external population at every generationstoring all nondominated solutions obtained so far. Ateach generation, external population is mixed withthe current population. All nondominated solutionsin the mixed population are assigned fitness based onthe number of solutions they dominate. Dominatedsolutions are assigned fitness worse than the worstfitness of any nondominated solutions. A determin-istic clustering technique is used to ensure diversityamong nondominated solutions. One of its variantsis SPEA2.72

Fuzzy Set TheoryFuzzy Sets: Unlike a classical set whose boundary isclearly defined, fuzzy set boundaries are not clearlydefined. An object belongs to a fuzzy set to a certaindegree, called the degree of membership, typically rep-resented by a real-valued number in the interval [0,1](i.e., μA: A → [0, 1] , where μA is the membershipfunction (MF)). In contrast, the MF μA for a given setis a map of the form μA: A → {0, 1}. Without loss ofgenerality, a fuzzy set is a generalization of a classicalset. In case of the fuzzy set, the cardinality of a set Acan be defined as

|A| = �x∈XμA(x). (1)

Operations on fuzzy sets are generalizations of op-erations on sets. This generalization can be done inseveral different ways, and details can be found in Ref73.

Membership Functions: In fuzzy classifiers, the rangeof a continuous feature is divided into several in-tervals. Then each interval is considered to be afuzzy set and an associated MF is defined. Thus,the input space is divided into several subregionsthat are parallel to input axes. For each subregion,a fuzzy rule is defined; if the input is in the subre-gion, then it belongs to the class associated with thatsubregion. For an unknown input pattern, the de-gree of membership corresponding to all fuzzy sets iscalculated and the input is classified into the classwith maximum degree of membership. Hence, theMFs directly influence the performance of the fuzzyclassifier.

MFs can be defined in a number of ways, basedon different shapes and different number of param-eters. Commonly used MFs74 are triangular, trape-zoidal, Gaussian, sigmoidal, and reverse-sigmoidal.Some of the multidimensional MFs are rectangular

Volume 00, January /February 2013 5c© 2013 John Wi ley & Sons , Inc .

Page 6: Revisiting evolutionary algorithms in feature selection ...ash/satchi_Wires13.pdf · ing algorithms. In this paper, the utility of GAs for classification task is primarily dealt

Advanced Review wires.wiley.com/widm

pyramidal, truncated rectangular pyramidal, polyhe-dral pyramidal, and bell shaped.

DATA MINING USING GENETICALGORITHMS: ATTRIBUTESELECTION

Feature selection is one of the data-preprocessingtasks, which brought attention to the researchers. It isthe process of selecting a subset of available featuresto use for constructing the model of interest. Solutionto the feature selection problem is neither trivial norunique. The set of optimal features can be differentfor different hypothesis spaces. Therefore, optimalityof a feature subset should only be defined in the con-text of the family of admissible modeling functionsfrom which it is intended to select the one that is fi-nally deployed. As the dimensionality of a domainexpands, the size of the search space grows exponen-tially. The number of candidate feature subsets is 2N,where N is the number of features. Therefore, findingan optimal feature subset is usually intractable75 andmany problems related to feature selection have beenshown to be NP-hard.2

It can be defined as “the feature selection prob-lem involves the selection of a subset of d featuresfrom a total of N features, based on a given optimiza-tion criterion (here it is denoted as J).”Let us denotethe N features uniquely by distinct numbers from 1to N, so that the total set of N features can be writtenas U = {1, 2, 3, . . . , N}. X denotes the subset ofselected features, and Y denotes the set of remainingfeatures. Therefore, U = X∪Y at any time. J(X) de-notes a function evaluating the performance of X. Jmay evaluate either the accuracy of a specific classifieron a specific data set (e.g., the wrapper approach as inRef 76) or a generic statistical measurement (e.g., thefilter approach77). The choice of evaluation function,J, depends on the particular application.

Interest in feature selection is increased due toseveral reasons. Reasons such as new applicationsdealing with vast amount of data have been devel-oped, e.g., data mining,78,79 multimedia informationretrieval,80,81 and medical data processing.82 Sincethe first processing of a large volume of data is criti-cal in these applications for the purpose of real-timeprocessing or to provide a quick response to users,limiting the number of features is a very importantrequirement. Feature selection is a prerequisite whenusing multiple sets of features, as this is requiredfor the subsequent processing involving classificationor clustering. Some examples include aerial photo

interpretation,83 correspondence in stereo vision,83

and handwriting recognition.84

Feature selection can be broadly categorizedinto two types: (i) filter approach and (ii) wrapperapproach. In this work, we will give more emphasison the wrapper approach rather than the filter ap-proach. More details about feature selection can befound in Ref 85.

Filter ApproachIn this approach, feature selection is performed with-out taking into account the classification algorithmto measure how good is the candidate feature subset.Here, the main goal is to select a subset of featuresthat preserves as much as possible the relevant in-formation found in the entire set of features. Oneexample of a filter method for feature selection canbe found in Ref 77. The basic idea of this method isto use a vertical compactness criterion for evaluatingthe quality of a given candidate feature subset. Themethod starts by assuming that one has some ideaabout how much inconsistency can be tolerated inthe data being mined. The term inconsistency refers tothe situation where two or more data instances havethe same value for all selected features but have differ-ent goal attribute values (classes). Once a maximuminconsistency rate is specified, the method searchesfor a feature subset that produces the minimum num-ber of projected instances, provided that the result-ing inconsistency rate is not greater than the specifiedmaximum. The projected instances of a given featuresubset X are the instances produced by eliminatingall features not specified in X and then eliminatingduplicate instances from what is left. The challengeis to find an attribute subset that produces the leastpossible number of projected instances, provided thatthe resulting inconsistency rate is not greater thana specified maximum. Some of the feature selectionmethods based on the filter approach can be found inRefs 86, 87. Sanchez-Marono et al.87 have con-tributed a good comparative study on filter methodsfor feature selection.

Wrapper ApproachIn this approach, feature selection is performed bytaking into account the classification algorithm thatwill be applied to the selected features. Here, the goalis to select a subset of features that is optimized fora given classification algorithm. The following pro-cedure gives a clear indication of how the wrapperapproach proceeds.

6 Volume 00, January /February 2013c© 2013 John Wi ley & Sons , Inc .

Page 7: Revisiting evolutionary algorithms in feature selection ...ash/satchi_Wires13.pdf · ing algorithms. In this paper, the utility of GAs for classification task is primarily dealt

WIREs Data Mining and Knowledge Discovery Revisiting evolutionary algorithms

Procedure Wrapper()

1. Input all features.

2. Generate a candidate feature subset.

3. Create a model with the feature subset gen-erated in Step 2.

4. Measure the performance of the model.

5. If performance is found satisfactory, then goto Step 6; otherwise go to Step 2.

6. Output a feature subset.

Note that in the wrapper-based approach, theclassification algorithm is run many times, each timewith a different subset of features. The performanceof the classification algorithm in each run is used toevaluate the quality of the corresponding feature set.One important point about this approach is that theperformance of the classification algorithm—whichis used to evaluate the quality of the feature subset—cannot be evaluated on the test set. To avoid this, onemust reserve a part of the training set for the pur-pose of evaluating the performance of the classifica-tion algorithm within the loop of the feature selectionprocedure. One simple way of doing this consists ofrandomly dividing the training set into two subsets:one for training set and the other for testing. Theformer is used to train the classification algorithm.Once the algorithm is trained, its performance willbe measured on the test set. This data set plays therole of an unseen test set for the classification algo-rithm. Indeed, when the feature selection method ter-minates, the best feature subset found by that methodis given to the classification algorithm, which is thenfinally run on the entire training set. The knowledgediscovered by the classification algorithm is then eval-uated on the test set, whose data instances remainedunseen during the entire run of the feature selectionmethod.

So far our discussion was focused on how touse a classification algorithm for evaluating the qual-ity of a candidate feature subset. We now turn to theproblem of how to generate candidate feature sub-sets to be evaluated. Clearly, if the data being minedhave a small number of attributes, we can generateall possible subsets and measure the performance ofthe classification algorithm on each of these subsets.Unfortunately, the number of candidate feature sub-sets grows exponentially with the number of avail-able features, as mentioned above. But in data min-ing, the number of features is large enough, so it isnot practicable to apply the exhaustive search proce-dure to generate and evaluate every possible featuresubset.

Genetic Algorithms for Feature SelectionAmong the various categories of feature selectionalgorithms, the evolutionary algorithms particularlyGAs are popular and widely used. Furthermore,GA is naturally applicable to feature selection overnonevolutionary-based approaches because the prob-lem has an exponential search space. The individ-ual encoding and fitness functions are two importantsteps to be determined before discussing about thedetails of GA for feature selection. The other geneticoperators are the same as the standard GA.

Individual EncodingTo the best of our knowledge, in GA-based feature se-lection two individual encoding mechanisms are used.

Binary Individual Encoding: The search space of afeature selection problem consists of all possible fea-ture subsets. Each state in this search space can berepresented by a fixed length string containing N bits,where N is the number of available features. The ithbit, i = 1, 2, . . . , N, indicates whether or not fea-ture Ai is selected. As an example, the chromosomeof length 8 represented by 00101000 means that thethird and fifth features are selected. That is, the chro-mosome represents only third and fifth features andrest of the features are not selected for this particularchromosome.

The main advantage of this encoding schemeis its simplicity. Actually, some authors have empha-sized that when using this approach there is no needto develop problem-dependent genetic operators. Anystandard crossover and mutation operators developedfor fixed-length binary strings will do. However, itdoes not imply that they are the best choice.

Index-Based Individual Encoding: An alternativeform of individual encoding for feature selection wasproposed in Ref 88. In this case, an individual consistsof m genes, where each gene can contain either the in-dex (id) of an attribute or a flag-say 0—indicating noattribute. The value of m is a user-specified parameter.An attribute is considered selected if its correspondingindex occurs in at least one of the genes of the individ-ual. For instance, consider the following individual,where m = 5: 0F1F40F1, in this case only features F1

and F4 are selected. The fact that F1 occurs twice inthe individual, whereas F4 occurs only once, is irrele-vant for the purpose of decoding the individual intoa subset of selected features.

Two motivations for this unconventional indi-vidual encoding are as follows. First, the fact thatan attribute can occur more than once in an indi-vidual that can act as a redundancy mechanism that

Volume 00, January /February 2013 7c© 2013 John Wi ley & Sons , Inc .

Page 8: Revisiting evolutionary algorithms in feature selection ...ash/satchi_Wires13.pdf · ing algorithms. In this paper, the utility of GAs for classification task is primarily dealt

Advanced Review wires.wiley.com/widm

increases robustness and slows the loss of genetic di-versity. Second, in this encoding, the length of an in-dividual is independent of original attributes. Hence,this approach is more scalable to data sets with a verylarge number of attributes. In this case, we can specifya number of genes much smaller than the number oforiginal attributes, which would not be possible withthe standard binary encoding for attribute selection.

The use of such an individual encoding suggeststhe development of new genetic operator tailored forthis kind of encoding. Hence, in addition to crossoverand mutation operators, the GA uses a a new kind ofgenetic operator called delete f eature(). This opera-tor accepts as input one parent and produces as outputone offspring where all occurrences of duplicate at-tributes are removed from the parent. Therefore, thisoperator has a bias that favors the selection of smallerattribute subsets. For instance, in the above exampleof individual, if feature F1 is chosen to be deleted thenthe individual will look like 00F400.

Fitness FunctionAs stated above, in the wrapper-based approach, thefitness function of a GA for feature selection involvesa measure of performance of a classification algorithmusing only the subset of features selected by the corre-sponding GA individual. This basic idea is illustratedin the following procedure:

Genetic Wrapper Feature()

1. Create a population of individuals with a se-lected set of features.

2. Run classification algorithm using these in-dividuals.

3. Compute the fitness of each individual basedon performance measure.

4. Select the fittest individuals based on any ofthe standard selection mechanisms.

5. Apply genetic operators like crossover andmutation.

6. Test the performance of the individuals, ifsatisfactory then stop and exit, otherwise goto Step 2.

According to this simplicity, the vast majorityof GAs for feature selection follows the wrapper ap-proach. There is, however, an alternative view, inwhich a GA often has aspects of both wrapper and fil-ter approaches, depending on what is measured in thefitness function. If the fitness function involves onlya measure of performance of the classifier, e.g., theclassification accuracy rate, then the GA is definitelyfollowing a wrapper-based approach.

Now suppose the fitness function involves boththe classifier accuracy rate and the number of selectedfeatures. The value of these criteria is independent ofthe classifier. It depends only on the genes of an in-dividual. Hence, the use of this criterion adds to thefitness function a certain flavor of the filter approach,even though one might argue that the wrapper crite-rion of classifiers’ accuracy rate is still predominatingin most cases. The predominance of the wrapper cri-terion is normally due to the use of a fitness function,which is a weighted sum of classification accuracy andthe number of selected attributes, where the weight ofthe former term is usually greater than the weight ofthe latter. An example of this kind of the fitness func-tion for attribute selection is proposed in Ref 89. Thisfunction takes the form:

fitness(s) = info(s) − cardinality(s)

+ accuracy(s), (2)

where s is the candidate feature subset associatedwith an individual, accuracy(s) is the classificationaccuracy rate of a classification algorithm using onlythe attributes in s, cardinality(s) is the number ofattributes s, and info(s) is an information theoreticmeasure, which is an estimate of the discriminatorypower of the features in s. Note that cardinality(s)and info(s) take on values independent of the classifi-cation algorithm used, so that they are filter-orientedcriteria.

It is also possible to define a fitness function thatfollows a pure filter approach, ignoring altogether theclassification accuracy rate of the classifier. This ap-proach normally has the advantage of reducing theprocessing time of the GA. This stems from the factthat in general the computation of a filter-orientedcriterion for an attribute subset is considerably fasterthan running a classification algorithm with that at-tribute subset, as required in the wrapper approach.

One example of a GA following the filter ap-proach can be found in the above-mentioned work.89

In one of the experiments reported by the authors,the wrapper component of the fitness function, accu-racy(s), was switched off, so that the GA effectivelyfollowed a purely filter approach. Unfortunately, thispurely filter variant did not produce good results.

A summary of the main aspects of fitness func-tions for feature selection involving a wrapper crite-rion is as follows:

Bala et al., in 1995,90 used decision tree as aclassification algorithm. Predictive accuracy and thenumber of selected features are considered as the fit-ness function. In 1996, they have extended it by in-troducing one more criteria in fitness function known

8 Volume 00, January /February 2013c© 2013 John Wi ley & Sons , Inc .

Page 9: Revisiting evolutionary algorithms in feature selection ...ash/satchi_Wires13.pdf · ing algorithms. In this paper, the utility of GAs for classification task is primarily dealt

WIREs Data Mining and Knowledge Discovery Revisiting evolutionary algorithms

as info. Chen et al.91 in 1999 used predictive accu-racy and the number of selected features in the fit-ness function by considering Euclidean decision tableas the classification algorithm. Guerra-Salcedo andWhitley in 199892 used predictive accuracy as the fit-ness function and Euclidean decision table as the clas-sification algorithm. Cherkauer and Shavlik in 199688

used decision tree as the classification algorithm, andpredictive accuracy, the number of selected features,and average decision tree-size are three criteria forthe fitness function. Yang and Honavar, in 1998,93

used neural networks for classification, in which thepredictive accuracy and attribute cost are the two ob-jectives used in the fitness function. Moser and Murty,in 2000,94 used the predictive accuracy and number ofselected attributes as the fitness function and nearestneighbor techniques as the classification algorithm.Ishibuchi and Nakashima, in 2000,95 used predictiveaccuracy, the number of selected attributes, and thenumber of selected instances as the criteria for fitness,and nearest neighbor as the classification algorithm.

It should be noted that so far our discussionwas focused on GAs for selecting attributes for a sin-gle classification algorithm, or classifier. However, aGA can also be used for selecting features by consid-ering many classifiers (e.g., ensembling classifiers) tomaximize predictive accuracy. A GA that performsfeature selection for generating an ensemble of classi-fiers is discussed in Ref 96. The basic idea is that a GAfor feature selection is run many times, and each runselects a subset of features. Then each of the selectedfeature subsets is given for a classifier of the ensemble,so that different classifiers of the ensemble are trainedwith different feature subsets. Some of the other po-tential contributions in this directions are presentedin Refs 97–100.

DATA MINING USING GENETICALGORITHMS

Recall that data mining is one of the important stepsof the KDD process and since in the present articlewe are interested in CRM therefore the discussion ofother data mining tasks is not considered. However,the interested reader can refer to Ref 3 for more aboutthe data mining tasks.

Classification Rule MiningThis task has been studied for many decades by themachine learning and statistics communities.101,102 Inthis task, the goal is to predict the value (the class)of a user-specified goal attribute based on the val-

ues of other attributes, called predicting attributes.Classification rules can be considered as a particu-lar kind of prediction rule where the rule antecedent(“IF” part) contains predicting attribute and rule con-sequent (“THEN” part) contains a predicted valuefor the goal attribute. Alternatively, the CRM can beconsidered as a task to uncover knowledge, which isrepresented in the form of IF–THEN statement.

RULE: IF cond1 AND cond2 AND . . . AND condm

THEN Class value, where each of the conditions inthe rule antecedent part can be written as attributei

OP valueij, attributei denotes the ith attribute in theset of predictor attributes, valueij denotes the jth valueof the domain of attribute i, and OP is a comparisonoperator—usually in { =, �=, >, <, ≤, ≥.

The classification rule can also be derived from adecision tree.3 The decision tree induction algorithmis one of the most successful learning algorithms, dueto its various attractive features: simplicity, compre-hensibility, and the absence of parameters.103 The var-ious improvements such as ID3,104 ID4,105 ID5,106

ITI,107 C4.5,108 and CART109 over the original deci-sion tree algorithms have been proposed.

Example: The table given below shows an exampleof a very small training set with 14 samples, fourpredictor attributes, namely age, income, student,and credit rating, and one class attribute, namelybuys computer. Based on this training instances, theobjective of a CRM algorithm is to discover rulesthat predict the value of buys computer for test datainstances.

Age Income Student Credit rating Class

≤30 high no fair no≤30 high no excellent no31. . .40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31. . .40 low yes excellent yes≤30 medium no fair no≤30 low yes fair yes>40 medium yes fair yes≤30 medium yes excellent yes31. . .40 medium no excellent yes31. . .40 high yes fair yes>40 medium no excellent no

Based on the basic decision tree algorithm- afterrecursive partitioning of each of the subsets (i.e., untilall samples for a given node belong to the same class,

Volume 00, January /February 2013 9c© 2013 John Wi ley & Sons , Inc .

Page 10: Revisiting evolutionary algorithms in feature selection ...ash/satchi_Wires13.pdf · ing algorithms. In this paper, the utility of GAs for classification task is primarily dealt

Advanced Review wires.wiley.com/widm

FIGURE 1 | Decision tree.

or there are no remaining attributes on which thesamples may be further partitioned, or there are nosamples for the branch of a test attribute) the entiretree can be generated and is illustrated in Figure 1.

The IF–THEN classification rules can be derivedby tracing the path from the root node to each leafnode in the tree. The rules extracted from Figure 1are as follows:

IF age = “≤30” AND student = ”no” THEN class = ”no”IF age = ”≤30” AND student = ”yes” THEN class = ”yes”IF age = ”31. . .40” THEN class = ”yes”IF age = ”>40” AND credit rating = ”excellent” THEN class = ”no”IF age = ”>40” AND credit rating = ”fair” THEN class = ”yes”

Although the generated rules are comprehen-sible with an acceptable accuracy, the decision treealgorithms do not handle two types of exceptions,namely (i) when the information gain for two or moreattributes is the same; (ii) when two or more classeshave equal probabilities in a tree leaf. In addition,if the attribute is not discritized then it must be dis-critized with optimal intervals; otherwise, the branch-ing factor will be more. Furthermore, if the dimensionof the data set is more, which is common in data min-ing, particularly biological data, then the depth of thetree will be more.

Another problem associated with conventionaldecision tree building algorithms is that they performa greedy search for attributes to be put into the tree. Bygreedy search, we mean this kind of algorithms thatbuilds a tree in a step-by-step fashion, adding one at-tribute at a time to the current partial tree, and at eachstep the best possible local choice is made. Note, how-ever, that a sequence of best possible local decisionsdoes not guarantee the best possible global decision.This limitation is one of the motivations to use geneticalgorithms to discover classification rules.44,110

Genetic Algorithms for CRMIn this subsection, we discuss several issues alongwith problem-specific operators, fitness functions,individual representation, and population initializa-tions/seeding related to developing GAs for CRM. Inaddition, some of the recent publications are cited foradvancement of this area.

Genetic RepresentationsEarlier work of GAs addresses that each individualcorresponds to a candidate solution to a given prob-lem. However, as the problem is to mine an opti-mal set of classification rules rather than a singlerule, it is very important to discuss how to encodea set of rules in a GA population. There are twoapproaches to accomplish this task: Michigan andPittsburg.

Pittsburg versus Michigan: In Pittsburg approach,each individual of the GA population represents aset of classification rules, i.e. an entire candidate so-lution. In contrast, in the Michigan approach eachindividual represents a single classification rule.

In the Michigan approach, there are at least twopossibilities for discovering a set of rules. First onesays, let each run of the GA discover a single rule(the best individual produced in all generations) andsimply run the GA multiple times to discover a setof rules. An obvious disadvantage of this strategy isthat it is computationally expensive, requiring manyGA runs. The second possibility is to design a moreelaborate GA where a set of individuals, possibly thewhole population, corresponds to a set of rules.

In the Pittsburg approach, since an individualrepresents a rule set, we can solve the rule interactiona

by using a fitness function that directly measures theperformance of the rule set as a whole. On the otherhand, in this approach an individual tends to be sig-nificantly more complex, at least syntactically longer,than in the Michigan approach. This often leads tomore complex genetic operators. On the other hand,in the Michigan approach, since an individual rep-resents a single rule, it is syntactically shorter andthe genetic operators can be simpler. However, theproblem of rule interaction is ignored. Hence, assum-ing that we want to discover a set of rules, in the

10 Volume 00, January /February 2013c© 2013 John Wi ley & Sons , Inc .

Page 11: Revisiting evolutionary algorithms in feature selection ...ash/satchi_Wires13.pdf · ing algorithms. In this paper, the utility of GAs for classification task is primarily dealt

WIREs Data Mining and Knowledge Discovery Revisiting evolutionary algorithms

Michigan approach we often need to add to the GAsome method(s) that foster the discovery of a good setof rules, rather than converging to a single good rule.One way of preventing this undesirable convergenceis to use some kind of niching mechanism.111,112 Forinstance, fitness sharing was successfully used in a GAfor predicting rare events.113

A further discussion about the Michigan andPittsburg approaches can be found in Ref 22, 114,115. It is also possible to use the hybrid Michi-gan/Pittsburg approach for rule mining. In particular,Ref 116 proposes a GA where the setting of someparameters can be varied to explore different combi-nations of the Michigan and Pittsburg approaches.

Throughout this paper, we will refer to the en-coding of a single rule into an individual so that weare implicitly assuming the use of Michigan approach.

Each individual in the population represents acandidate rule of the form “if antecedent then conse-quent.” The antecedent of this rule can be formed bya conjunction of at most n − 1 attributes, where n isthe number of attributes being mined. Each conditionis of the form Ai = Vij, where Ai is the ith attributeand Vij is the jth value of the ith attribute’s domain.The consequent consists of a single condition of theform G = gl, where G is the goal attribute and gl isthe lth value of the goal attribute’s domain.

When using a binary (low-level) encoding, ingeneral an attribute will be assigned a certain numberof bits, which depends on the data type of the attribute(i.e., whether categorical or continuous). Some of theGAs for rule discovery that use binary encoding canbe found in Refs 116–118 along with their pros andcons. High-level encoding seems particularly advan-tageous in the case of continuous attributes, where abinary encoding tends to be somewhat cumbersomeand/or inefficient, particularly for a large number ofcontinuous attributes, as discussed above. For moredetails, readers can refer to Ref 13.

Note that a hybrid low- and high-level encodingcan also be used. Indeed such a hybrid encoding seemssuitable for representing classifications rules involvingmixed data. One such reference can be obtained fromRef 119.

A string of fixed size encodes an individual withn genes representing the values that each attributecan assume in the rule. In addition, each gene alsocontains a Boolean flag (fp /fa) except the nth gene thatindicates whether or not the ith condition is presentin the rule antecedent. Hence although all individualshave the same genome length, different individualsrepresent rules of different lengths.

Note that this approach implicitly assumes a po-sitional encoding of attributes in the genotype. Hence

this simplifies the action of genetic operators such ascrossover. By choosing the same crossover points inboth parents, we can directly swap the correspond-ing genetic material between them without producingsome kinds of invalid offspring. In contrast, the useof variable-length genotype directly may have prob-lem at the time of crossover. Details with an extensiveexamples can be obtained in Refs 120, 121.

Similarly, the goal attribute is also encoded inthe individual. This is one possibility. Choosing thebest rule consequent on the fly is another possibility.22

Note that both the approaches produce offspring withlower fitness than the parents after a crossover opera-tor. This is due to the fact that different individuals ofthe population can be associated with different ruleconsequences. This is a point that deserves more re-search efforts to apply successfully. One of the solu-tions is to use some kinds of speciation mechanism,122

where only individuals with the same specie (in thiscase having the same rule consequent) can mate witheach other.

The third possibility is to associate all individ-uals of the population with the same predicted class,which is never modified during the execution of thealgorithm. Hence if we want to discover a set ofclassification rules predicting k different classes, wewould need to run the evolutionary algorithm at leastk times, so that in the ith run, i = 1, 2, 3 . . . , k,the algorithm discovers only rules predicting the ithclass.

Fitness FunctionSince the general goal of data mining is to extractknowledge from data, it is important to bear in mindsome desirable properties of discovered knowledge.Namely discovered knowledge should be accurate,comprehensible, and interesting. Let us discuss howthese criteria can be defined and used in the fitnessevaluation of individuals in GAs.

Comprehensibility Metric: There are various waysto quantitatively measure rule comprehensibility. Astandard way of measuring comprehensibility is tocount the number of rules and the number of con-ditions in these rules. If these numbers increase thencomprehensibility decreases. When using a GA forclassification rule discovery, the measure of rule com-prehensibility can be easily incorporated into a fitnessfunction. In general, this is done by using a weightedfitness function, with a term measuring predictive ac-curacy and another term measuring rule comprehen-sibility, where each term has a user-defined weight.Variants of this basic idea are also used.115,123 Al-though the objective approach is easy to implement,

Volume 00, January /February 2013 11c© 2013 John Wi ley & Sons , Inc .

Page 12: Revisiting evolutionary algorithms in feature selection ...ash/satchi_Wires13.pdf · ing algorithms. In this paper, the utility of GAs for classification task is primarily dealt

Advanced Review wires.wiley.com/widm

but it is ignoring the subjective aspects of rule com-prehensibility. One approach that does take this sub-jective aspect into account consists of using a kind ofinteractive fitness function.124

Predictive Accuracy: As already mentioned, our rulesare of the form IF A THEN C. The antecedent part ofthe rule is a conjunction of conditions. A very simpleway to measure the predictive accuracy of a rule is

P A = |A&C||A| , (3)

where |A&C||A| is defined as the number of records sat-

isfying both A and C.Alternatively, the performance of a classifica-

tion rule with respect to predictive accuracy can besummarized by a confusion matrix.3,125 In addition,sometimes it is desirable to use application-dependentfitness functions that are not only related to predictiveaccuracy but also implement a more direct measureof profitability for the user (e.g., Ref 126).

Finally, let us discuss the difficult issue of ruleinterestingness.

Interestingness: Rule interestingness is measured bytwo approaches: (i) subjective and (ii) objective. Thebest judge of subjective approach is the human user.By the interactive fitness function, one can imple-ment the subjective measure of the interestingness of arule/rule set. Romao et al.127 proposed a kind of sub-jective approach, where the fitness function evaluatesboth the predictive accuracy and the surprisingness ofthe rule represented by an individual.

Alternatively, one can use an objective ap-proach, where the fitness function incorporates anobjective measure of rule interestingness. Some of theproposed methods in this direction can be obtainedin Ref 128.

The overall fitness is computed as the arithmeticweighted mean

f (rule) = w1.rule comp + w2.P A+ w3.rule interw1 + w2 + w3

,

(4)

where w1, w2, and w3 are user-defined weights.rule comp is the measure of the comprehensibility,PA is the predictive accuracy, and rule inter is theinterestingness of the rule, respectively.

Although the weighted fitness function is popu-lar in the literature, it is not always a good approach.In particular, when there are several criteria to beevaluated, it is often the case that these criteria areconflicting and or noncommensurable. For instance,predictive accuracy and rule comprehensibility are of-

ten conflicting criteria (improving a rule with respectto one of them can worsen the rule with respect to theother). The three identified criteria such as predictiveaccuracy, comprehensibility, and rule interestingnessare intuitively noncommensurable and can be con-flicting in many cases. This suggests the use of mul-tiobjective approach for rule discovery. Some of theproposals in this direction with promising solutionscan be obtained in Refs 120, 121. A good comprehen-sive review of multiobjective evolutionary algorithmsfor rule discovery can found in Ref 13.

Genetic OperatorsOne can also consider the idea of uniformcrossover.129 However, depending on whether a ruleis too specific or too general, one can design/adoptspecial kind of crossover to generalize or specialize agiven rule. After crossover is complete, the algorithmanalyzes if any invalid individual is created. If so, arepair operator is used to produce valid individuals.

The mutation operator randomly transforms thevalue of an attribute into another value belonging tothe same domain of the attribute.

Besides crossover and mutation, the insert andremove operators directly try to control the size of therules being evolved; thereby influence the comprehen-sibility of the rules. These operators randomly insertand remove, a condition in the rule antecedent. Theseoperators are not part of the regular GA.

DATA MINING USINGFUZZY-GENETIC APPROACH

Classification tasks of data mining using the fuzzy ap-proach is becoming very popular day by day as it useslinguistic variables, which are very close to humandescriptions of structure in data.130 In addition, fuzzyrule based systems are highly capable of handling anonlinear problem of classification. So how the clas-sification problem is handled is an important researchtopic in fuzzy classification systems.131–134

Fuzzy systems themselves do not exhibit learn-ing capabilities. EAs particularly GAs can be used asa learning algorithm of the fuzzy systems.135 In otherwords, how to adapt GAs for mining FCRs is thefocus of this section.

Fuzzy Classification Rule MiningIn contrast to classification rules, FCRs are pop-ular because of their comprehensibility136,137(i.e.,each fuzzy if–then rule is interpreted via linguisticvalues such as large, medium, and so on) and higher

12 Volume 00, January /February 2013c© 2013 John Wi ley & Sons , Inc .

Page 13: Revisiting evolutionary algorithms in feature selection ...ash/satchi_Wires13.pdf · ing algorithms. In this paper, the utility of GAs for classification task is primarily dealt

WIREs Data Mining and Knowledge Discovery Revisiting evolutionary algorithms

classification ability. Fuzzy rules for classification sys-tems are obtained via two available approaches: (i) di-rectly by experts and (ii) through an automatic learn-ing process. Several methods have been introducedfor fuzzy classification.138–144

Without loss of generality, in FCRM, the rangesof input variables are divided into subregions in ad-vance and for each grid region a fuzzy rule is defined.We call the subregions fuzzy regions. Then a fuzzy setwith a MF is defined for each subinterval. The MFdefines the degree that the input belongs to the fuzzyset.

In general, the fuzzy rule for classification prob-lems in a d-dimensional space is represented asfollows:

IF x1 is A1 and x2 is A2 and ,. . . , and xd is Ad

THEN class C,

where <x1, x2, . . . , xd > is a d-dimensional patternvector, Ai, i = 1(1)d is an antecedent fuzzy set and Cis the class label of the rule.

The if part is connected by the AND operator,and each fuzzy rule is connected by the OR operator.These fuzzy rules are defined according to the experts’knowledge.

The following pseudocode is used for classifica-tion:

Algorithm for CRM()

Classifier Design

• Divide the input space into subregions.

• Assign a fuzzy set with a MF to each subre-gion.

• Define the fuzzy rules according to the ex-perts’ knowledge.

Classification

• For each fuzzy rule, considering the if part,compute the degree of membership of eachinput variable and perform the AND opera-tion to find out the minimum value.

• Considering the OR operation among fuzzyrules, the rule having degree of membership islarger then assign the unknown sample to theclass associated with that rule.

Although it is understandable, it has lots ofdrawbacks. Usually it is difficult to acquire knowl-edge from experts. If acquired, performance of the re-sulting classifier is far from satisfaction. At the sametime, it is very difficult to divide the input space intoregions beforehand. We do not know to what extent

we need to divide the input space, and the size of thedivision should be determined not for the range ofthe input variable but for each class that is approxi-mated. In addition, if some of the input variables arecorrelated it is inadequate to approximate the regionby a rectangle, which is parallel to the input axes.To solve the problems, several fuzzy classifiers withlearning capability have been developed. These fuzzyclassifiers extract fuzzy rules with variable size fuzzyregions from data, and fuzzy regions are not neces-sarily rectangles parallel to the input axes.

FCRM with Learning CapabilityIn FCRM, how many fuzzy rules must be generated torealize sufficient recognition rates for both the train-ing data and test data sets are really posed as a chal-lenging problem. The reason is that it depends onhow data for each class are distributed and how databetween different classes overlap in the input space.For example, if class i is approximated by one fuzzyregion, then one fuzzy rule is sufficient for class i.But if this does not hold, we need to define more thanone fuzzy rule to resolve overlaps between classes andthen to improve the recognition rates. There are sev-eral ways to solve this problem by introducing learn-ing. Two of them are as follows: (i) detect overlaps,if they exist then generate an additional fuzzy rulesor modify existing fuzzy rules to resolve the overlaps;(ii) generate fuzzy rules without considering overlapsand then to tune MFs for overlap resolution. In thelatter method, class data are divided into clusters inadvance (preclustering) or after rule generation (post-clustering).

The following steps are required for the rulegeneration:

1. Generate fuzzy rule using all or part of thetraining data included in a class. If there areno data remained to generate rules, go toStep 3.

2. Check whether the fuzzy region defined inStep 1 overlaps with other fuzzy regions de-fined previously. If there is no overlap, goto Step 1. Otherwise resolve the overlap bymodifying fuzzy rules and go to Step 1.

3. Tune MFs so that the recognition rate of thetraining data is improved.

In the preclustering method, first the trainingdata belonging to a class are divided into several clus-ters and then for each cluster a fuzzy rule is gener-ated. While in the postclustering method, first onefuzzy rule is generated for each class and then if the

Volume 00, January /February 2013 13c© 2013 John Wi ley & Sons , Inc .

Page 14: Revisiting evolutionary algorithms in feature selection ...ash/satchi_Wires13.pdf · ing algorithms. In this paper, the utility of GAs for classification task is primarily dealt

Advanced Review wires.wiley.com/widm

FIGURE 2 | Approaches for FCRM using GAs.

recognition rate is not sufficient fuzzy rules are gener-ated to resolve overlaps between classes. After rulegeneration by preclustering and postclustering, wecan improve the recognition rate by tuning fuzzyrules, i.e., locations and slopes of the MFs using thetraining data.

Many approaches have been proposed for FCRgeneration, such as heuristic procedures,138,145 neu-rofuzzy techniques,146–149 clustering methods,136,150

data mining,151–153 and GAs.137,140,154–160 In thispaper, we are restricting ourselves with GA-basedFCRM.

GAs for FCRMIn past few decades, GAs have been employed forfuzzy if–then rules and adjusting MFs of fuzzysets.161,162 In this section, we will focus on fuzzy if–then rule generation (specifically for classification) us-ing GAs. Figure 2 shows a taxonomy of GAs for FCRmining.

From the taxonomy, one can conclude that thereare at least three different approaches to generateFCRs. The first approach is a priori fixing the MFsand generating FCRs by GAs, whereas in the secondapproach we are generating rules by some CRM algo-rithms and then fuzzifying the antecedent conditions.In the case of the third approach, simultaneously weare evolving MFs and classification rules. Let us dis-cuss each of these approaches with their strength andweaknesses.

Unlike GAs for CRM whose antecedent condi-tions are Boolean, here the antecedent conditions arefuzzy. However, both the approaches are conceptu-ally similar. Therefore, many aspects of GAs designedfor classification rule generation can be used for FCR

generation; in particular, the method of encoding anFCR antecedent into an individuals and determiningthe consequent part.163 The determination of the con-sequent part suggested in Ref 95 is as follows:

As described above, the fuzzy if–then rules areof following type: For M-class, d-dimensional clas-sification problems, the fuzzy if–then rule is of thefollowing form:

Rule R : IF x1 is A1, x2 is A2, and . . . , and xd isAd THEN Class C, where <x1, x2, . . . , xd > is a d-dimensional pattern vector, Ai is an antecedent fuzzyset, and C is the class label of the rule.

The consequent class C of each fuzzy rule isdetermined by the following widely used procedureproposed by Isibuchi and Nakashima et al.95:

• Calculate the compatibility grade of eachtraining pattern, xp =< xp1 , xp2 , . . . , xpd >

with the fuzzy if–then rule. i.e., μ(xp)=μA1 (xp1 ) × μA2 (xp2 ) ×. . .× μAd (xpd ), whereμAi (xpi ) is the MF of the antecedent fuzzy setAi.

• Calculate the sum of the compatibility gradesof the training patterns with the fuzzy if–then rule R for each class. i.e., βclassh(R) =∑

�xp∈Classh μ(�xp), i = 1, 2, . . . , M.

• Find the class which has maximum βclassh(R)value and denote it as βclassh. If two or moreclasses take the maximum value, then thereis no unique consequent class for rule R andCi will be φ. If a single class takes the max-imum value then Ci will be that class. Againif βclassh(R) = 0, that is no training pattern iscompatible with rule R, then also the conse-quent class Ci will be φ.

The above procedure to choose the best ruleconsequent for a given rule antecedent seems intuitiveand a generalization of CRM. However, in the case offuzzy rules there can be a reasonable number of datainstances whose degree of matching in the antecedentpart can be close to 0; but their collective effort canhave a significant influence of choosing a class. Thisleads to some undesirable results. In other words, ithas a bias favoring the choice of majority class in thetraining set and this tends to be undesirable for theimbalanced class distribution data set. One solutionto mitigate this problem is thresholding.164

While computing the degree of matching in theantecedent part, the design should at least go for adata-sensitive operator, instead of adapting productoperator as the only tool. An example of such anoperator is the one based on the median of the

14 Volume 00, January /February 2013c© 2013 John Wi ley & Sons , Inc .

Page 15: Revisiting evolutionary algorithms in feature selection ...ash/satchi_Wires13.pdf · ing algorithms. In this paper, the utility of GAs for classification task is primarily dealt

WIREs Data Mining and Knowledge Discovery Revisiting evolutionary algorithms

membership degrees of all rule conditions, as pro-posed in Ref 165.

After generating the FCRs, we can now classifyan unknown sample as follows:

A new input pattern xq = < xq1, xq2, . . . , xqd >

can be classified using any of the following two pro-cedures:

(1) Fuzzy reasoning based on a single winnerrule: This a two-step method described as follows:

• Calculate αClassh,h = 1, 2,. . ., M as αClassh =max(μ(�xq)).

• Classify �xq as the class with the maximumvalue of αClassh.

(2) Thr fuzzy reasoning method based on votingby multiple fuzzy if–then rules: The class of an inputpattern Xp based on voting by multiple fuzzy if–thenrules that are compatible with Xp, can be determinedby the following two-step method as follows:

• Calculate αClasshh = 1, 2,. . ., M as αClassh =∑Rj ∈Q,Cj =Classhμj(xp).

• Classify Xp as the class with maximumvalue αClassh.

In contrast to the first approach, the second ap-proach is only evolving the MFs. In other words, herethe GAs are used for instance when crisp predictionrules were already discovered by a conventional ruleinduction algorithm and we just want to fuzzify thediscovered crisp classification rules. One of the con-tributions in this direction is presented in Ref 166.

In the case of the third approach, i.e., GAs canalso be used to evolve both the rules and MFs. In thiscase, the genotype of an individual has at least twokinds of genes: (i) genes defining contents of fuzzyrules, and (ii) genes defining MFs. Some of the repre-sentative contributions of this approach are presentedin Refs 167, 168.

The fitness evaluation can be carried out byfuzzified confusion matrix.169 A taxonomy and cur-rent research trends and prospects of genetic fuzzysystem is presented in Refs 170.

GAs for Multiobjective FCRMTraditionally, the main objective of fuzzy classifi-cation systems was to maximize the accuracy. Themajority of the above-cited techniques are focusedon the accuracy, neglecting the interpretability—themost distinguishable feature and the primary meritof FCR. Interpretability depends on several factorssuch as the number of features, fuzzy rules, antecedent

fuzzy sets, shape of fuzzy sets, completeness, consis-tency, and compactness of fuzzy rules. Studies171–178

have addressed the issue of interpretability in FCR.These two objectives, accuracy of the system and in-terpretability of fuzzy rules are conflicting objectivesand in practice one of these objectives prevails theother one. Therefore, recently researchers are tryingto find a solution that can simultaneously optimizethese objectives without prioritizing others.179–181 InRef 182, Cordon presents a review on the most repre-sentative genetic fuzzy systems relying on Mamdani-type fuzzy rule based systems to search interpretablelinguistic fuzzy models with a good accuracy.

FCRM is not only a multiobjective problem butalso a NP-hard problem. For example, in the M-class,the d-dimensional classification problem, the fuzzy if–then rule is of the following form:

IF x1 is A1 and x2 is A2 and ,. . ., and xd is Ad

THEN class C,

where �x =< x1, x2, . . . , xd > is a d-dimensional pat-tern vector, Ai is an antecedent fuzzy set, and C is theclass label of the rule.

Antecedents of fuzzy rules are k linguistic val-ues and don’t care. So there are (k + 1)d fuzzy if–then rules, for the d-dimensional pattern classificationproblem. Hence the number of rules is very high. Theprimary goal in fuzzy rule generations is precision re-duction and interpretability improvement.

Let us discuss the objectives of FCRM, whichare normally considered in many objective optimiza-tions.

The accuracy of fuzzy classification system isdefined as

J = 1/mn∑

k=1

ek, (5)

where ek is the classification error for pattern �xk andis defined as ek = 1, if �xk is classified correctly andek = 0, if �xk is classified falsely.

Interpretability136,179,183–187 refers to the abilityto express the behavior of the system in a humanunderstandable way. It is a subjective property, whichhas no formal definition but the following aspects areconsidered to describe this property:

• The number of variables: Fuzzy modelsshould use fewer variables.

• The number of fuzzy rules: The fuzzy modelshould use very less fuzzy rules. As to humanexperience, this number should be less than10.

Volume 00, January /February 2013 15c© 2013 John Wi ley & Sons , Inc .

Page 16: Revisiting evolutionary algorithms in feature selection ...ash/satchi_Wires13.pdf · ing algorithms. In this paper, the utility of GAs for classification task is primarily dealt

Advanced Review wires.wiley.com/widm

• Completeness, consistency, and compactnessof fuzzy rules: Fuzzy rules should cover thewhole input space, i.e., for each effective in-put variable combination, there must be atleast one fuzzy rule being fired. The rules inthe rule base should not be contradictory toeach other. There must be no rule, whose an-tecedent is a subset of another rule, and norule may appear more than once in the rulebase.

• Characteristics of membership functions:Normality and complexity are two essentialproperties of the MFs used in modeling.

An overview of interpretability measures is pre-sented in Ref 188 in context of the linguistic fuzzyrule based system. More about interpretability can beobtained from Refs 172, 189, 190.

MOGAs are used to generate fuzzy rule withbetter trade-off between interpretability and accu-racy. To achieve these objectives, different NSGA-II, SPEA, and SPEA2 algorithms are frequently used.Some of the modified versions are also proposed, asdiscussed below. Furthermore, multiobjective algo-rithms are used for feature selection, tuning of MFsin addition to rule selection. Some of representativestudies are as follows:

Isibuchi et al.191 used a two-stage rule selectionprocedure. In the first step, a set of candidate fuzzyrules by the heuristic rule generation procedure is gen-erated and then MOGAs are used for optimizing thecriteria.

In 1995, Isibuchi et al.191 proposed a GA fortwo-objective fuzzy rule generation, where the twoobjectives are maximize the number of correctly clas-sified training patterns and minimize the number oflinguistic rules from the rule set. They express the twoobjectives into a single scaler function as

f (S) = WNCP × NCP(S) − WS × |S|, (6)

where WNCP and WS are randomly specified weights,S is the rule set, NCP(S) is the number of correctlyclassified training patterns, and |S| is the number oflinguistic rules in S. For each pair of selected parentindividuals, the weight WNCP is any random numberfrom [0, 1] and WS = 1 − WNCP. The multiple so-lutions are preserved from the current generation tothe next generation as elite solutions. These elite so-lutions are randomly selected from a tentative set ofnondominated solutions that are stored and updatedat each generation of the two-objective GAs.

Furthermore, in 2001, Isibuchi andYamamoto,154 proposed a three-objective GA

to find nondominated rule sets. They considered thethree objectives as

1. maximizing f1(S), the number of correctlyclassified training patterns,

2. minimizing f2(S), the number of fuzzy rulesin S, and

3. minimizing f3(S), the total rule length offuzzy rules in S,

where S is the subset of generated fuzzy rule sets.A MOGA has been employed,191,192 which uses thefitness function as

f i tness(S) = w1 × f1(S) − w2 × f2(S)

−w3 × f3(S), (7)

where w1, w2, and w3 are random weights satisfyingthe following conditions: w1, w2, w3 0 and w1 + w2

+ w3 = 1. The nondominated rule sets are stored ina tentative pool, separated from current population.The pool is updated in every generation to store thenondominated rule sets, which are examined. Fromthe pool, randomly Nelite rule sets are selected as eliteindividuals and added to new solutions.

In 2004, Isibuchi et al.153 proposed a multiob-jective genetic local search algorithm for generating asmall number of fuzzy if–then rules for pattern clas-sification. The method combines the local search andrule weight learning.154,193 In this framework, threeobjectives are used: maximization of classification ac-curacy, minimization of selected rules, and minimiza-tion of the total rule length. In the first stage, fuzzyif–then rules are generated and prescreened using tworule evaluation measures such as confidence and sup-port used in data mining.3 The confidence and sup-port of a fuzzy if–then rule Aq ⇒ Cq is defined, re-spectively, as follows:

Conf idence c(Aq ⇒ Cq) = |D(Aq) ∩ D(Cq)||D(Aq)| , (8)

Support s(Aq ⇒ Cq) = |D(Aq) ∩ D(Cq)||D| , (9)

where D is the set of training patterns, |D(Aq)| isthe number of training patterns compatible with an-tecedent Aq, and |D(Aq)∩D(Cq)| is the number oftraining patterns that are compatible with both an-tecedent Aq and consequent Cq. In the second stage,a number of nondominated rule sets with respect tothe above objectives use a multiobjective genetic localsearch algorithm.

In 2006, they194,195 applied NSGA-II for multi-objective fuzzy rule selection with two heuristic tricks

16 Volume 00, January /February 2013c© 2013 John Wi ley & Sons , Inc .

Page 17: Revisiting evolutionary algorithms in feature selection ...ash/satchi_Wires13.pdf · ing algorithms. In this paper, the utility of GAs for classification task is primarily dealt

WIREs Data Mining and Knowledge Discovery Revisiting evolutionary algorithms

based on the problem, to find small rule sets withhigh accuracy. That is, they used the objectives infuzzy rule selection, namely the minimization of theerror rate on training patterns and the minimizationof the number of fuzzy rules, in which biased muta-tion is used as the first trick. A kind of local search inremoving the unnecessary rules is used as the secondtrick. These tricks are described as follows: As theyused single winner rule-based methods for the classi-fication of each pattern by the rule set S, some rulesin this set do not qualify as a winner, for any rule.They eliminated these rules from the rule set S. Thisis performed after the first objective is calculated andbefore the second objective is calculated.

In 2006, Chen et al.196 proposed an approachbased on MOGA to construct an interpretable andprecision fuzzy classification from data. In the firststage, they used a MOGA for feature selectionand dynamic grid partition with the following threeobjectives:

1. the number of wrongly classified trainingpatterns JERR,

2. the number of features used Jf , and

3. the number of fuzzy rules Jr.

The fitness function is

MinF1 = w1 × JERR + w2 × J f + w3 × Jr , (10)

where w1, w2, and w3 are positive weights as per theuser’s choice. The number of rules Jr is defined asJr = ∏

J=1,...,nkj, where kj is the number of dynamicgrid partitions of feature xj. Then they apply geneticoperators to evolve the population. In the second step,they optimized the initial fuzzy classification systemsso obtained in the above step. For this in the first step,they used a GA to exclude the unnecessary fuzzy rulesand extracted the significant fuzzy rules. The objectiveis to select a subset of rules, keeping the classificationperformance. They used a single objective function

MinF2 = w4 × JERR + w5 × Jr + w6 × Jm, (11)

where JERR is the number of wrongly classified train-ing patterns, Jr is the number of fuzzy rules, andJm is the average length of fuzzy rules. w4, w5,and w6 are positive weights. Again for improvementof classification performance of the fuzzy classifica-tion system, they used a GA to optimize the pa-rameters. They form the chromosomes as a sequenceof real numbers by coding the centers of MFs, theneighboring overlap values, and the certainty degreesof the consequents. They also restricted the searchspace of GAas the centers are limited in a rangeof ±α % around their initial values; search spaces

of the neighboring overlap values are constrainedin [0.02, 0.45] and certainty degree varies from0 to 1.

In 2007, Alcala et al.197 proposed an accuracy-oriented multiobjective algorithm SEPA2ACC, basedon the SPEA2,72 to obtain fuzzy rule-based systemswith a better transaction between interpretability andaccuracy in linguistic fuzzy modeling by performing arule selection together with tuning of the MF, whichminimizes only two objectives to achieve the desiredtrade-off: The number of rules and the mean squarederror. SEPA2ACC algorithm centers the search on thedesired Pareto zone, i.e., the zone having high accu-racy with least possible number of rules, by incorpo-rating two changes in the SPEA2 algorithm. The ob-jective is to put pressure on the selection of solutionswith high accuracy. They have made two changes inthe existing algorithms as follows:

1. A restarting operator is applied at Step 4 ofthe algorithm, by maintaining the most ac-curate individual as the single individual inthe external population and obtaining the re-maining individuals in the population withthe same rule configuration of the best in-dividual and tuning parameters generated atrandom within the corresponding variationintervals. Then return to Step 2 with t =t + 2.

2. As the second change, in each stage of thealgorithm, before and after restarting, theysorted the solutions from the best to theworst considering accuracy as a sorting cri-terion. The number of solutions in the exter-nal population considered to form the mat-ing pool should progressively reduce from100% at the beginning to 50% at the end ofeach stage, by focusing only on those withthe best accuracy. They have also modifiedthe creation of solutions in the initial popula-tion. They selected all the possible rules thatfavor a progressive extraction of bad rules,only by means of mutation at the beginningand then by means of the crossover.

They used a double coding scheme for both ruleselection CS and tuning CT :

Cp = CSp.CT

p. (12)

In the rule selection part, they used binary-coded strings with a size m, the number of initialrules. The corresponding gene is assigned as “1” ifthe rule is selected otherwise “0.” i.e., CS

p = (cS1,cS2,. . ., cSm), where cSi ∈ [0, 1]. In the tuning part,

Volume 00, January /February 2013 17c© 2013 John Wi ley & Sons , Inc .

Page 18: Revisiting evolutionary algorithms in feature selection ...ash/satchi_Wires13.pdf · ing algorithms. In this paper, the utility of GAs for classification task is primarily dealt

Advanced Review wires.wiley.com/widm

a real coding is considered, being mi, the numberof labels of each of the n variables comprising thedatabase. Ci = (ai

1, bi1, ci

1, . . . , aimi , bi

mi , cimi ), i = 1,. . .,

n and CpT = C1C2 . . . Cn. In the CS part of the initial

population, all individuals are selected with all geneshaving value “1”and in the CT part of initial popula-tion; The initial database is taken as a first individual.The rest of individuals are generated randomly withinthe corresponding variation intervals, which are cal-culated from the initial database. They applied theBLX-0.5 crossover198 operator in the CT part andHUX crossover199 in the CS part of the chromosome.Then finally four offspring are produced by combin-ing two from the CS part and two from the CT part.They applied the mutation operator, which changes agene value randomly selecting one in each CS and CT

part with probability Pm.In 2008, Gacto et al.200 proposed an extension

of SEPA2ACC algorithm, SEP A2ACC2 to obtain thefuzzy rule based system with a better transaction be-tween interpretability and accuracy in linguistic fuzzymodeling by performing rule selection together witha tuning of the MF, which minimizes only two objec-tives to achieve the desired trade-off: the number ofrules and the mean squared error. They proposed twochanges in the SEPA2ACC algorithm. Instead of HUXcrossover in the CS part of the chromosome, an in-telligent crossover is applied. Offspring are generatedby applying the following steps:

• Apply BLX crossover to obtain the CT part ofthe offspring.

• In the CS part, for every gene, the correspond-ing rule is extracted from each individual inthe crossover, namely offspring, parent 1, andparent 2, after the real parameters are ob-tained by determining a database. Similarly,the same rule is obtained three times withdifferent MFs,which pertain to these threeindividuals.

• Considering only the center points of the MFsinvolved in the extracted rules, the Euclideannormalized distances are computed betweenoffspring and each parent. The normalizationof the differences between two points is doneby the amplitude of their respective variationintervals.

• The present rule would be selected for theoffspring or not is determined by the nearestparent, by copying its value in the CS part, forthe corresponding gene.

• Until all the CS values are assigned for theoffspring, repeat Step 1 through Step 4.

Using the above process, four offspring are gen-erated. In real application, exploration is performedin the CT part using this operator and the CS part isobtained directly, based on the previous knowledge ofparent’s regarding the inclusion or not inclusion of aspecific configuration of MFs for every rule. When anoffspring is generated, the mutation operator changesa gene value in th CT part randomly and in the CS

part zero is set to a gene selected at random withprobability Pm.

Pulkkinen and Koivisto,201 have identified afuzzy classifier by using decision tree and multiobjec-tive evolutionary algorithms. To obtain the compactand accurate fuzzy classifier, Pulkkinen and cowork-ers presented a multiobjective genetic fuzzy systems inRef 202. Marquez et al.203 have presented a multiob-jective evolutionary algorithm with an interpretabilityimprovement mechanism for linguistic fuzzy systemswith adaptive defuzzification.

SUMMARY AND FURTHER RESEARCH

The motivation of applying evolutionary algorithmsin KDD is that EAs are robust search methods, whichperform a global search in the candidate solutionspace (feature space, rules, or another form of knowl-edge representation). Initially, we presented the pre-liminary concepts of KDD, EAs and its taxonomy,MOEAs, and fuzzy set theory. Then we discussed thesignificant advances of EAs, MOEAs, and fuzzy sys-tem in KDD and data miningarea.

Furthermore, this paper has attempted to studyabout CRM using GAs. The reason for applyingGAs for classification is that GAs can use the sameknowledge representation (IF–THEN rules) as con-ventional rule induction algorithms. However, GAs’global search nature tends to cope better with theattribute interaction and to discover interesting re-lationships that would be missed by greedy searchof rule induction algorithms. The flexible algorith-mic paradigm can be used to incorporate backgroundknowledge into the GA and or to hybridize GAs withlocal search methods that are specifically tailored tothe data mining tasks being solved.

Like any other data mining paradigm, GAs alsohave some disadvantages. One of them is that conven-tional genetic operators such as crossover and muta-tion operator are blind search operators in the sensethat they modify individuals in a way independentfrom the individual fitness. This characteristic of con-ventional genetic operators increases the generalityof GAs but intuitively tends to reduce their effective-ness in solving a specific kind of problem. Hence, it

18 Volume 00, January /February 2013c© 2013 John Wi ley & Sons , Inc .

Page 19: Revisiting evolutionary algorithms in feature selection ...ash/satchi_Wires13.pdf · ing algorithms. In this paper, the utility of GAs for classification task is primarily dealt

WIREs Data Mining and Knowledge Discovery Revisiting evolutionary algorithms

is important to do more studies to extend GAs use totask-specific operators.

Another disadvantage of GAs is that they arecomputationally slow. However, if necessary, the pro-cessing time can be significantly reduced by using par-allel processing techniques204,205 and/or compute thefitness of individuals by using only a subset of train-ing instances. Another possibility is to compute thefitness of some of the individuals and approximateothers. However, this needs intensive research.

An important research direction is to better ex-ploit the power of GP, ES, EP, EDA, and CGAs indata mining. There are several GP algorithms for dis-covering classification rules206 or for classification ingeneral.207,208 However, the power of GP is still un-derexplored.

Furthermore, FCRM using is another attractingpoint of this paper. With a large number of criteriaused to design a genetic fuzzy system considered byvarious researchers, the most frequently used onesare to (i) maximize the number of correctly classifiedtraining patterns, (ii) minimize the number of rules,and (iii) minimize the length of the fuzzy rule forfinding out a genetic fuzzy classifier.

From this study, it is worth noting that themultiobjective nature of the fuzzy rule based sys-tems needs intensive care for handling concavity ofthe problem and scalability when learning from largedata sets. It is also equally important to consider the

adaptation of MOGAs in FCRs to data set with a highimbalanced ratio.

Most of the studies only consider the quantita-tive measures of the fuzzy rule based systems and giveless emphasis on qualitative measures. Hence, quali-tative measures of fuzzy rule based systems still needfurther intensive research. In the context of the inter-pretability measure, our feature research includes de-signing of appropriate algorithms to handle the grow-ing measures of interpretability within the frameworkof interpretability accuracy trade-off.

The hybridization between fuzzy systems andEAs in evolutionary fuzzy systems became an im-portant research area during the past decade. Nowa-days, it is a developed research area where researchersneed to reflect to advance toward strengths and dis-tinctive features of the fuzzy system. Hybridizingwith other metaheuristics such as particle swarmoptimization,209–211 ant colony optimization,212–214

and bee colony optimization,215–217 can be anotherimprovement in this direction.

NOTESaThe problem of the rule interaction consists of eval-uating the quality of a rule set as a whole, rather thanjust evaluating the quality of each rule in an isolatedmanner.

REFERENCES

1. Piatetsky-Shapiro G. Knowledge discovery in realdatabases: a report on the IJCAR 89 workshop. AIMag. 1991, 11(5):68–70.

2. Fortnow L. The status of the P versus NP problem.Commun 2009, 52(9):78–86.

3. Han J, Kamber M. Data Mining: Concepts and Tech-niques, 2nd ed. San Francisco, CA: Morgan Kauf-mann, 2006.

4. Kantardzic M. Data Mining: Concepts, Models,Methods, and Algorithms. New York: John Wiley &Sons, 2003.

5. Miller H, Han J., eds. Geographical Data Mining andKnowledge Discovery, 2nd ed. New York: John Wiley& Sons, 2009.

6. Kargupta H, Han J, Yu P, Motwani R, Kumar V.,eds. Next Generation of Data Mining. Boca Raton,FL: CRC Press and Taylor & Francis, 2008.

7. Geng L, Hamilton HJ. Interestingness measures fordata mining: a survey. ACM Comput Surv 2006,38(3):1–32.

8. Goebel M, Gruenwald L. A review of softwarepackages for data mining. SIGKDD Explor 1999,1(1):20–32.

9. Haughton D, Deichmann J, Eshghi A, Sayek S, Tee-bagy N, Topi H. A survey of data mining andknowledge discovery software tools. Am Stat 2003,57(4):290–309.

10. Mikut R, Reischl M. Data mining tools. WIREs DataMining Knowl Discov 2011, 1:431–443.

11. Peters G, Weber R. Dynamic clustering with soft com-puting. WIREs Data Mining Knowl Discov 2012,2:226–236.

12. Coello CA, Dehuri S, Ghosh S., eds. Swarm In-telligence for Multi-objective Problems in DataMining. Heidelberg, Germany: Springer-Verlag,2009.

13. Ghosh A, Dehuri S, Ghosh S., eds. Multi-objectiveEvolutionary Algorithms for Knowledge Di scoveryin Databases. Heidelberg, Germany: Springer-Verlag,2008.

Volume 00, January /February 2013 19c© 2013 John Wi ley & Sons , Inc .

Page 20: Revisiting evolutionary algorithms in feature selection ...ash/satchi_Wires13.pdf · ing algorithms. In this paper, the utility of GAs for classification task is primarily dealt

Advanced Review wires.wiley.com/widm

14. Freitas AA. Data Mining and Knowledge Discoverywith Evolutionary Algorithms. Heidelberg, Germany:Springer-Verlag, 2002.

15. Alcal-Fdez J, Snchez L, Garca S, del Jesus MJ, Ven-tura S, Garrell JM, Otero J, Romero C, Bacardit J,Rivas VM, et al. KEEL: a software tool to assess evo-lutionary algorithms for data mining problems. SoftComput 2009, 13(3):307–318.

16. Goldberg DE. Genetic Algorithms in Search, Op-timization and Machine Learning. Reading MA:Addison-Wesley, 1989.

17. Smith SF. A learning system based on genetic tive al-gorithm. Ph.D. dissertation, University of Pittsburgh,Pittsburgh, PA, 1980.

18. Smith SF. Flexible learning of problem solving heuris-tics through adaptive search. In: Proceedings of 8thInternational Joint Conference on Artificial Intelli-gence, 1983, 422–425.

19. Holland JH. Escaping brittleness: the possibilitiesof general purpose learning algorithms applied toparallel-rule based systems. Mach Learn: An Artif In-tell Approach 1986, 2:593–623.

20. Booker LB. Goldberg DE, Holland JH. Classifier sys-tems and genetic algorithms. Artif Intell 1989, 40(1–3):235–283.

21. Venturini G. SIA: a supervised inductive algorithmwith genetic search for learning attributes basedconcepts. In: Proceedings of European Conferenceon Machine Learning. LNAI 667, Berlin, Germany:Springer-Verlag, 1993, 280–296.

22. Greene DP. Smith SF. Competition based inductionof decision models from examples. Mach Learn 1993,13(2–3), 229–257.

23. Huhn J, Hullermeier E. Fr3: a fuzzy rule learner forinducing reliable classifier. IEEE Trans Fuzzy Syst2009, 17:138–149.

24. Hullermeier E. Fuzzy sets in machine learning anddata mining. Appl Soft Comput J 2011, 11:1493–1505.

25. Hullermeier E. Fuzzy machine learning and data min-ing. WIREs Data Mining Knowl Discov 2011, 1:269–283.

26. Nisbet R, John E, Gary M. Handbook of StatisticalAnalysis & Data Mining Applications. New York:Academic Press/Elsevier, 2009.

27. Zaki MJ, Parthasarathy S, Ogihara M, Li W. New al-gorithms for fast discovery of association rules. Tech-nical Report URCS-TR-651, July 1997.

28. Wang JTL. Zaki MJ, Toivonen HTT, Shasha D. DataMining in Bioinformatics. Berlin, Germany: Springer-Verlag, 2004.

29. Alberto P-M. Gene expression modular analysis anoverview from the data mining perspective. WIREsData Mining Knowl Discov 2011, 1:381–396.

30. Wang J., ed. Data Mining in Health Care Applica-tions. Hershey, PA: IGI Publishing, 2003.

31. Kokol P, Pohorec S, Stiglic G, Podgorelec V. Evo-lutionary design of decision trees for medical appli-cation. WIREs Data Mining Knowl Discov 2012,2:237–254.

32. Mucherino A, Papajoirgji P, Pardalos P. Data Min-ing in Agriculture. Berlin, Germany: Springer-Verlag,2009.

33. Wang XZ. Data mining and knowledge discovery forprocess monitoring and control. London: Springer-Verlag, 1999.

34. Sequeira K, Zaki MJ. ADMIT: anomaly-base datamining for intrusions. In: 8th ACM SIGKDD Interna-tional Conference on Knowledge Discovery and DataMining; July 2002.

35. Lam K-Y, Hui L, Chung S-L. A data reduction methodfor intrusion detection. J Syst Software 1996, 33:101–108.

36. Lee W, Stolfo SJ. A framework for constructing fea-tures and models for intrusion detection systems.ACM Trans Inform Syst Security 2000, 3(4):227–261.

37. Oliveira M, Gama J, An overview of social networkanalysis. WIREs Data Mining Knowl Discov 2012,2:99–115.

38. Nicolas G-P. Evolutionary selection for training setselection. WIREs Data Mining Knowl Discov 2011,1:512–523.

39. Dy JG, Broadley CE. Feature selection for unsuper-vised learning, J Mach Learn Res 2004, 5(5):845–889.

40. Brill FZ, Brown DE, Martin WN. Fast genetic selec-tion of features for neural network classifiers. IEEETrans Neural Netw 1998, 3(2):324–328.

41. Mitra P, Murthy CA, Pal SK. Unsupervised featureselection using feature similarity. IEEE Trans PatternAnal Mach Intell 2002, 24(3):301–312.

42. Yang J, Honavar V. Feature subset selection using agenetic algorithm, IEEE Intell SystTheir Appl 1998,13(2):44–49.

43. Siedlecki W, Sklansky J. A note on genetic algorithmsfor large-scale feature selection. Pattern Recogn Lett1989, 10:335–347.

44. Ghosh A, Nath B. Multi-objective rule mining us-ing genetic algorithms. Information Sciences 2004,163:123–133.

45. Ghosh A, Dehuri S, Ghosh S. Eds. MultiobjectiveAssociation Rule Mining. Berlin, Germany: Springer-Verlag, 2008.

46. Tseng LY, Yang SB. A genetic approach to the au-tomatic clustering problem. Pattern Recogn 2001,34:415–424.

20 Volume 00, January /February 2013c© 2013 John Wi ley & Sons , Inc .

Page 21: Revisiting evolutionary algorithms in feature selection ...ash/satchi_Wires13.pdf · ing algorithms. In this paper, the utility of GAs for classification task is primarily dealt

WIREs Data Mining and Knowledge Discovery Revisiting evolutionary algorithms

47. Krsihma K, Murty MN. Genetic k-means algo-rithms. IEEE Trans Syst Man Cybernet Cybern 1999,29:433–439.

48. Lozano JA, Larranaga P. Applying genetic algorithmsto search for the best hierarchical clustering of adataset. Pattern Recogn Lett 1999, 20:911–918.

49. Hruschka ER, Campello RJGB, Freitas AA, deCar-valho A. A survey of evolutionary algorithms forclustering. IEEE Trans Syst Man Cybern C. 2009,39(2):133–154.

50. Aliev RA, Fazlollahi B, Vahidov R. Geneticalgorithms-based fuzzy regression analysis. Soft Com-put —A Fusion Found Method Appl 2002, 6(6):470–475.

51. Rosin PL, Ioannidis E. Evaluation of global imagethresholding for change detection. Pattern RecognLett 2003, 24:2345–2356.

52. Mitchell T. Machine learning and data mining. Com-mun ACM 1999, 42(11):31–36.

53. Dehuri S, Cho S-B., eds. Knowledge Mining UsingIntelligent Agents. London: Imperial College Press,2010.

54. Lim CP, Jain LC, Dehuri S., eds. Innovations inSwarm Intelligence. Heidelberg, Germany: Springer-Verlag, 2009.

55. Fogel DB. Evolutionary Computation: Toward a NewPhilosophy of Machine Intelligence. Pisacataway, NJ:IEEE Press, 2006.

56. Michalewicz Z. Genetic Algorithms+ Data Structure=Evolution Programs. Berlin, Germany: Springer-Verlag, 1999.

57. Bayes HG. The Theory of Evolution Strategies. Berlin,Germany: Springer-Verlag, 2001.

58. Bayer HG, Schwefel H-R. Evolution strategies: a com-prehensive introduction. J Natural Comput 2002,1(1):3–52.

59. Eiben AE, Smith JE. Introduction to EvolutionaryComputing. Berlin, Germany: Springer-Verlag, 2007.

60. Fogel LJ. Intelligence through Simulated Evolution:Forty Years of Evolutionary Programming. NewYork: John Wiley & Sons, 1999.

61. Koza JR. Genetic Programming: On the Program-ming of Computers by means of Natural Selection.Cambridge, MA: MIT Press, 1992.

62. Wong ML, Leung KS. Data Mining Using GrammarBased Genetic Programming and Applications. Ams-terdam, the Netherlands: Kluwer Academic Publish-ers, 2000.

63. Larranaga P, Lozano JA. Estimation Distribution Al-gorithms: A New Tool for Evolutionary Computa-tion. Amsterdam, the Netherlands: Kluwer AcademicPublisher, 2001.

64. Baluja S. Population based incremental learning: amethod for integrating genetic search based functionoptimization and competitive learning. Technical Re-

port CMU-CS-94-163, Carnegie Mellon University,1994.

65. Harik GR, Lobo FG, Goldberg DE. The compactgenetic algorithm. IEEE Trans Evol Comput 1999,3(4):287–297.

66. Deb K. Multi-Objective Optimization Using Evolu-tionary Algorithms. New York: John Wiley and Sons,2001.

67. Schaffer JD. Multiple objective optimization with vec-tor evaluated genetic algorithms. In ICGA 85, 1985,93–100.

68. Laumanns M, Rudolph G, Schwefel HP. A spatialpredator-prey approach to multi-objective optimiza-tion. Parallel Problem Solving Nature 1998, 5:241–249.

69. Ziztler E, Thiele L. Multi-objective evolutionary algo-rithms: a comparative case study and strength paretoapproach. IEEE Trans Evol Comput 1999, 3:257–271.

70. Deb K, Agrawal S, Pratap A, Meyarivan T. A fastand elitist multi-objective genetic algorithm: Nsga-ii.IEEE Trans Evol Comput 2002, 6(2):182–197.

71. Zitzler E, Thiele L. An evolutionary algorithms formulti-objective optimization: the strength pareto ap-proach. TIK Report 43, Swiss Federal Institute ofTechnology (ETZ), Zurich, Swizerland, 1998.

72. Zitzler E, Laumanns M, Thiele L. SPEA2: improvingthe strength pareto evolutionary algorithm for multi-objective optimization. In: Zitzler E, GlannakoglouKC, Tsahalis D, Periaux J, Papailiou K, Fogarty T,eds. Evolutionary Methods for Design, Optimizationand Control with Application to Industrial Problems.Barcelona, Spain: International Center for Numer-ical Methods in Engineering (CIMNE), 2002, 95–100.

73. Klir GJ, Yuan B. Fuzzy Sets and Fuzzy Logic: Theoryand Applications. New Delhi, India: Prentice Hall In-dia, 1995.

74. Shi Y, Eberhart R, Chen Y. Implementation of evolu-tionary fuzzy systems. IEEE Trans Fuzzy Syst 1999,7(2):109–119.

75. Kohavi R, John GH. Wrappers for feature subset se-lection, Artif Intell 1997, 97(1–2):273–324.

76. Langley P. Selection of relevant features in machinelearning. In: Proceedings of AAAI Fall Symposium onRelevance, 1994, 1–5.

77. Liu H, Motoda H, eds. Selecting Features by VerticalCompactness of Data, Amsterdam, the Netherlands:Kluwer, 1998.

78. Piramuthu S. Evaluating feature selection methods forlearning in data mining applications. In: Proceedingsof 31st Annual Hawaii International Conference onSystem Science, 1998.

79. Martin-Bautista MJ, Vila M-A. A survey of geneticfeature selection in mining issues. In: Proceedings

Volume 00, January /February 2013 21c© 2013 John Wi ley & Sons , Inc .

Page 22: Revisiting evolutionary algorithms in feature selection ...ash/satchi_Wires13.pdf · ing algorithms. In this paper, the utility of GAs for classification task is primarily dealt

Advanced Review wires.wiley.com/widm

of 1999 Congress on Evolutionary Computation(CEC’99), 1999, 1314–1321.

80. Messer K, Kittler J. Using feature selection to aid aniconic search through an image database. In: Proceed-ings of IEEE International Conference on Acoustics,Speech, and Signal Processing (ICASSP), 1997, 1605–2608.

81. Liu Y, Dellaert F. A classification based similaritymetric for 3Dimage retrieval. In: Proceedings of IEEEInternational Conference on Computer Vision andPattern Recognition, 1998, 800–805.

82. Puuronen S, Tsymbal A, Skrypnik I. Advanced localfeature selection in medical diagnostics. In: Proceed-ings of 13th IEEE Symposium on Computer-BasedMedical Systems, 2000, 25–30.

83. Ishikawa H. Multiscale feature selection in stereo.In: Proceedings of IEEE International Conferenceon Computer Vision and Pattern Recognition, 1999,132–137.

84. Oh I-S, Lee JS, Suen CY. Analysis of class separa-tion and combination of class dependent features forhandwriting recognition, IEEE Trans Pattern AnalMach Intell 1999, 21(10):1089–1094.

85. Liu H, Motoda H. Feature Selection for KnowledgeDiscovery and Data Mining. Norwell, MA: Kluwer,1998.

86. Huang J, Cai Y, Xu X. A filter approach to feature se-lection based on mutual information. In: Proceedingsof 5th IEEE International Conference on CognitiveInformatics, 2006, 84–89.

87. Sanchez-Marona N, Alonso-Betanzos A, Tombilla-Sanroman M. Filter Methods for Feature Selection—A Comparative Study. In Yin H, Tino P, CorchadoE, Byrne W, Yao X, eds. Intelligent Data engineeringand Automated Learning (IDEAL 2007), LNCS, vol.4881. Berlin Heidelberg: Springer-Verlag, 2007, 178–187.

88. Cherkauer KJ, Shavlik JW. Growing simpler decisiontrees to facilitate knowledge discovery. In Proceedingsof the 2nd International Conference on KnowledgeDiscovery and Data Mining (KDD’96). Menlo Park,CA: AAAI Press, 1996, 315–318.

89. Bala J. Using learning to facilitate the evolution offeatures for recognizing visual concepts. Evol Comput1996, 4(3):297–312.

90. Bala J, Huang J, Vafaie H, Dejong K, Wechsler H.Hybrid learning using genetic algorithms and deci-sion trees for pattern classification. In: Proceedings ofthe International Joint Conference on Artificial Intel-ligence (IJCAI’95). Montreal Quebec, Canada: Mor-gan Kaufmann, 1995, 719–724.

91. Chen S, Guerra-Salcedo C, Smith S. Nonstandardcrossover for a standard representation- commonal-ity based feature subset selection. In: Proceedings ofthe Genetic and Evolutionary Computation Confer-

ence (GECCO’99). Orlando, FL: Morgan Kaufmann,1999, 129–134.

92. Guerra-Salcedo C. Genetic search for feature subsetselection: A comparison between CHC and GENESIS.In: Genetic Programming 1998: Proceedings of theThird Annual Conference, 1999, 504–509.

93. Liu H, Motoda H., eds. Feature Subset SelectionUsing a Genetic Algorithm. Norwell, MA: Kluwer,1998.

94. Moser A, Murty MN. On the scalability of geneticalgorithms to very large scale feature selection. In:Proceedings of the Real World applications of Evolu-tionary Computation (EvoWorkshops 2000), LNCS1803. Berlin, Germany: Springer-Verlag, 2000, 77–86.

95. Ishibuchi H, Nakashima T. Multi-objective patternand feature selection by a genetic algorithm. In Pro-ceedings of the 2000 Genetic and Evolutionary Com-putation Conference (GECCO’ 2000), 2000, 1069–1076.

96. Guerra-Salcedo C, Whitley D. Feature SelectionMechanisms for Ensemble Creation: A Genetic SearchPerspective. Technical Report WS-99-06, MenloPark, CA:AAAI Press, 1999, 13–17.

97. Derrac J, Garcia S, Herrera F. IFS-CoCo: instanceand feature selection based on cooperative coevolu-tion with nearest neighbor rule. Pattern Recogn 2010,43(6):2082–2105.

98. Huang Y, Cai J, Xu X. A hybrid genetic algorithm forfeature selection wrapper based on mutual informa-tion. Pattern Recogn Lett 2007, 28:1825–1844.

99. Kabir MM, Shahjahan M, Murase K. Involving newlocal search in hybrid genetic algorithm for featureselection. Lect Notes in Comput Sci 2009, 5864: 150–158.

100. Li Y, Zeng X. Sequential multi-criteria feature selec-tion algorithm based on agent genetic algorithm. ApplIntell 2010, 33(2):117–131.

101. Lim TS, Loh W-Y, Shih Y-S. A comparison of predic-tion accuracy, complexity and training time of thirtythree old and new classification algorithms. MachLearn J 2000, 40:203–228.

102. Michie D, Spiegelhalter DJ, Taylor CC. MachineLearning, Neural and Statistical Classification. NewYork: Ellis Horwood, 1994.

103. Katsikopoulos KV, Fasolo B. New tools for deci-sion analysts, IEEE Trans Syst Man Cybern A 2006,36(5):960–967.

104. Quinlan JR. Induction of decision trees. J Mach Learn1986, 1(1):81–106.

105. Utgoff PE. Incremental induction of decision trees. JMach Learn 1989, 4:161–186.

106. Utgoff PE. ID5: An incremental ID3. In: Proceed-ings of the Fifth International Conference on Machine

22 Volume 00, January /February 2013c© 2013 John Wi ley & Sons , Inc .

Page 23: Revisiting evolutionary algorithms in feature selection ...ash/satchi_Wires13.pdf · ing algorithms. In this paper, the utility of GAs for classification task is primarily dealt

WIREs Data Mining and Knowledge Discovery Revisiting evolutionary algorithms

Learning. San Mateo, CA: Morgan Kaufmann, 1988,107–120.

107. Utgoff PE. An improved algorithm for incremental in-duction of decision trees. In: Proceedings of 11th In-ternational Conference on Machine Learning. 1994,318–325.

108. Quinlan JR. C4.5: Programs for Machine Learning.San Mateo, CA: Morgan Kaufmann, 1993.

109. Loh W-Y. Classification and Regression Trees,WIREs Data Mining Knowl Discov 2011, 1:14–23.

110. Dehuri S, Mall R. Predictive and comprehensible rulediscovery using a multi-objective genetic algorithms.Knowl-Based Syst 2006, 19:413–421.

111. Dhar V, Chou D, Provost F. Discovering interest-ing patterns for investment decision making withGLOWER: a genetic learner overlaid with entropyreduction. Data Mining Knowl Discov J 2000,4(4):251–280.

112. Hekanaho J. Testing different sharing methods inconcept learning. TUCS Technical Report 71, Cen-ter for Computer Science, Finland, 1996.

113. Weiss GM. Timeweaver: a genetic algorithm for iden-tifying predictive patterns in sequences of events. In:Proceedings of the Genetic and Evolutionary Com-putation Conference (GECCO’99). San Mateo, CA:Morgan Kaufmann, 1999, 718–725.

114. Freitas AA. A survey of evolutionary algorithms fordata mining and knowledge discovery. In Ghosh A,Tsutsui S, eds. Advances in Evolutionary Computa-tion. Berlin, Germany: Springer-Verlag, 2002, 819–845.

115. Janikow CZ. A knowledge intensive genetic algorithmfor supervised learning, Mach Learn 1993, 13:189–228.

116. Hekanaho J. Symbiosis in multi-modal concept learn-ing. In: Proceedings of the 1995 International Confer-ence on Machine Learning (ICML’ 96). San Mateo,CA: Morgan Kaufmann, 1996, 234–242.

117. Giordana A, Neri F. Search intensive concept induc-tion. Evol Comput 1995, 3(4):375–416.

118. Mansilla EB, Mekaouche A, Guiu JMG. A study ofgenetic classifier system based on the Pittsburg ap-proach on a medical domain. In: Proceedings of the12th International Conference on Industrial and En-gineering Applications of Artificial Intelligence andExpert Systems (IEA/AIE’99). LNCS, 1611. Berlin,Germany: Springer-Verlag, 1999, 175–184.

119. Kwedlo W, Kretowski M. An evolutionary algorithmusing multivariate discritization for decision rule in-duction. In: Proceedings of the 3rd European Con-ference on Principles and Practice of Knowledge Dis-covery in Databases (PKDD’ 99). LNCS 1704, Berlin,Germany: Springer-Verlag, 1999, 392–397.

120. Dehuri S, Ghosh A, Mall R. Genetic algorithms formulti-criterion classification and clustering in datamining. Int J Comput Inform Sci 2006, 4(3):143–154.

121. Dehuri S, Patnaik S, Ghosh A, Mall R. Applicationof elitist multi-objective genetic algorithm for clas-sification rule generation, Appl Soft Comput 2008,8:477–487.

122. Hekanaho J. Background knowledge in ga based con-cept learning. In: Proceedings of the 13th Interna-tional Conference on Machine Learning (ICML’96),1996, 234–242.

123. Pei M, Goodman ED, Punch WF. Pattern discoveryfrom data using genetic algorithms. In: Proceedingsof the 1st Pacific Asia Conference on Knowledge Dis-covery and Data Mining, 1997.

124. Banzhaf W. Interactive evolution. In Back T, FogelDB, Michalewicz T, eds. Evolutionary Computation1. London: Institute of Physics Publishing, 2000, 132–135.

125. Weiss SM, Kulikowski CA. Computer Systemsthat Learn. San Mateo, CA: Morgan Kaufmann,1991.

126. Thomas JD. Sycara K. In: Freitas AA, ed. DataMining with Evolutionary Algorithms: ResearchDirections—Papers from the AAAI’99/GECCO’99Workshop. Technical Report WS-99-06, Palto Alto,CA: AAAI Press, 1999, 7–11.

127. Romao W, Freitas AA, Pacheco RCS. A genetic al-gorithm for discovering interesting fuzzy predictionrules: applications to science and technology data.In: Proceedings of the 2002 Genetic and Evolution-ary Computation Conference (GECCO’ 2002). NewYork: Morgan Kaufmann, 2002, 1188–1195.

128. Noda E, Freitas AA, Lopes HS. Discovering in-teresting prediction rules with a genetic algorithm.In: Proceedings of the Conference on EvolutionaryComputation-1999 (CEC’ 99). Washington D. C.:IEEE Press, 1999, 1322–1329.

129. Syswerda G. Uniform crossover in genetic algorithms.In: Proceedings of the 3rd International Conferenceon Genetic Algorithms. San Francisco, CA: MorganKaufmann, 1989, 2–9.

130. Wang L, Yen J. Extracting fuzzy rules for system mod-elling using a hybrid of genetic algorithms and kalmanfilter. Fuzzy Sets Syst 1999, 101:353–362.

131. Castro JL, Castro-Schez JJ. Zurita J. Learning max-imal structure rules in fuzzy logic for knowledgeacquisition in expert system. Fuzzy Sets Syst 1999,101:331–342.

132. Chang C, Chen S. Constructing membership func-tions and generating weighted fuzzy rules from train-ing data. In: Proceedings of the Ninth National Con-ference on Fuzzy theory and Its Applications, 2001,708–713.

133. Hong T, Lee C. Introduction of fuzzy rules and mem-bership functions from training examples. Fuzzy SetsSyst 1996, 84:33–47.

134. Wu TP, Chen SM, A new method for constructingmembership functions and fuzzy rules from training

Volume 00, January /February 2013 23c© 2013 John Wi ley & Sons , Inc .

Page 24: Revisiting evolutionary algorithms in feature selection ...ash/satchi_Wires13.pdf · ing algorithms. In this paper, the utility of GAs for classification task is primarily dealt

Advanced Review wires.wiley.com/widm

examples. IEEE Trans Syst Man Cybernetics B 1999,29:25–40.

135. Wong C-C, Chen C-C. A GA-based method for con-structing fuzzy systems directly from numerical data.IEEE Trans Syste Man Cybern B 2002, 30:904–911.

136. Roubos J, Setnes M, Abony J. Learning fuzzy clas-sification rules from labeled data. Inform Sci 2003,150(1-2):77–93.

137. Setnes M, Roubos J. GA-fuzzy modelling and clas-sification: complexity and performance. IEEE TransFuzzy Syst 2000, 8(5):509–522.

138. Ishibuchi H, Nozaki K, Tanaka H. Distributed repre-sentation of fuzzy rules and its application to patternclassification. Fuzzy Sets Syst 1992, 52(1):21–32.

139. Ishibuchi H, Nozaki K, Yamamoto N, Tanaka H,1994. Construction of fuzzy classification systemswith rectangular fuzzy rules using genetic algorithms,Fuzzy Sets Syst 65(2/3):237–253.

140. Gonzalez A, Perez R. SLAVE: a genetic learning sys-tem based on iterative approach. IEEE Trans FuzzySyst 1999, 7(2):176–191.

141. Hu Y, Chen R, Tzeng G. Finding fuzzy classificationrules using data mining technique. Pattern RecognLett 2003, 24:509–519.

142. Ho SY, Chen H, Ho SJ, Chen T. Design of accurateclassifiers with compact fuzzy-rule base using evolu-tionary scatter partition of feature space. IEEE TransSyst Man Cybernet B 2004, 34(2):1031–1044.

143. Chen S, Tsai F. Generating fuzzy rules from traininginstances for fuzzy classification systems. Expert SystAppl 2008, 35(3):611–621.

144. Chen Y, Wang L, Chen S. Generating weighted fuzzyrules from training data for dealing with the irisdata classification problem. Int J Appl Sci Eng 2006,4(1):41–52.

145. Abe S, Lan M. A method for fuzzy rules extraction di-rectly from numerical data and application to patternclassification. IEEE Trans Fuzzy Syst 1995, 3(1):18–28.

146. Mitra S, Kuncheva LI. Improving classification per-formance using fuzzy MLP and two-level selectivepartitioning of feature space. Fuzzy Sets Syst 1995,70(1):1–13.

147. Nauck D, Kruse R. A neuro-fuzzy method to learnfuzzy classification rules from data. Fuzzy Sets Syst1997, 89(3):227–288.

148. Uebele V, Abe S, Lan MS. A neural-network-basedfuzzy classifier, IEEE Trans Systems, Man Cybernet-ics 1995, 25(2):353–361.

149. Chokrobotry U, Pal NR. A neuro-fuzzy schemefor simultaneous feature selection and fuzzy rule-based classification. IEEE Trans Neural Netw 2004,15(1):110–123.

150. Abe S, Thawonmas R. A fuzzy classifier with el-lipsoidal regions. IEEE Trans Fuzzy Syst 1997,5(3):358–368.

151. Chung Hu Y, Hshiung Tzeng G. Elicitation of classi-fication rules by fuzzy data mining. Appl Artif Intell2003, 16(7–8):709–716.

152. DeCock M, Cornelis C, Kerre EE. Elicitation of fuzzyassociation rules from positive and negative examples.Fuzzy Sets Syst 2003, 149(1):73–85.

153. Ishibuchi H, Yamamoto T. Fuzzy rule selection bymulti-objective genetic local search algorithms andrule evaluation measures in data mining, Fuzzy SetsSyst 2004, 141(1):59–88.

154. Ishibuchi H, Nakashima T, Murata T. Three-objective genetic-based machine learning for linguisticrule extraction. Inform Sci 2001, 134(1-4):109–133.

155. Chen SM, Chen Y. Automatically constructing mem-bership functions and generating fuzzy rules using ge-netic algorithms. Cybern Syst 2002, 33(8):841–862.

156. Cordon O, Gomide F, Herrera F, Hoffmann F, Mag-dalena L. Ten years of genetic fuzzy systems: currentframework and new trends. Fuzzy Sets Syst 2004,41:5–31.

157. Zhou E, Khotanzad A. Fuzzy classifier design usinggenetic algorithms. Pattern Recogn 2007, 40:3401–3414.

158. Saniee Adadeh M, Habibi J, Lucas C. Intrusion detec-tion using a fuzzy genetics-based learning algorithm.J Netw Comput Appl 2007, 30:414–428.

159. Aguilera JJ, Chica M, del Jesus MJ. Herrera F, Nich-ing genetic feature selection algorithms applied todesign of fuzzy rule-based classification systems. In:IEEE International Fuzzy System Conference, FUZZ-(IEEE’07), 2007, 1–6.

160. Monsoori E, Zolghadri M, Katebi S. SGERD, asteady-state genetic algorithm for extracting fuzzyclassification rules from data. IEEE Trans Fuzzy Syst2008, 16(4):1061–1072.

161. Karr CL. Design of an adaptive fuzzy logic con-troller using a genetic algorithms. In: Proceedingsof Fourth International Conference on Genetic Al-gorithms, 1991, 450–457.

162. Karr CL, Gentry EJ. Fuzzy controller of ph usinggenetic algorithms, IEEE Trans Fuzzy Syst 1993,1(1):46–53.

163. Bentley PJ. Evolutionary my dear Watson-investigating committee based evolution of fuzzyrules for the detection of suspicious insurance claims.In: Proceedings of the Genetic and EvolutionaryComputation Conference (GECCO’ 2000). SanMateo, CA: Morgan Kaufmann, 2000, 702–709.

164. Fertig CS, Freitas AA, Arruda LVR, Kaestner C. Afuzzy beam search rule induction algorithm. In: Prin-ciples of data mining and knowledge discovery

24 Volume 00, January /February 2013c© 2013 John Wi ley & Sons , Inc .

Page 25: Revisiting evolutionary algorithms in feature selection ...ash/satchi_Wires13.pdf · ing algorithms. In this paper, the utility of GAs for classification task is primarily dealt

WIREs Data Mining and Knowledge Discovery Revisiting evolutionary algorithms

(Proceedings of 3rd European Conference-PKDD’99). LNAI 1704, Berlin, Germany: Springer-Verlag, 1999, 341–347.

165. Walter D, Mohan CK ClaDia: a fuzzy classifier sys-tem for disease diagnosis. In : Proceedings of theCongress on Evolutionary Computation (CEC’2000),2000, 2:1429–1435.

166. Crockett KA, Bandar Z, Al Attar A. Soft decisiontrees: a new approach using non-linear fuzzification.In: Proceedings of the 9th IEEE International Con-ference Fuzzy Systems (FUZZ IEEE’ 2000), 2000,209–215.

167. Mota C, Ferreira H, Rosa A. Independent and simul-taneous evolution of fuzzy sleep classifiers by geneticalgorithms. In: Proceedings of the Genetic and Evo-lutionary Computation Conference (GECCO’ 1999).San Mateo, CA: Morgan Kaufmann, 1999, 1622–1629.

168. Chen H-M, Ho S-Y. Designing an optimal evolution-ary fuzzy decision tree for data mining. In: Proceed-ings of the Genetic and Evolutionary ComputationConference (GECCO’ 2001). San Mateo, CA: Mor-gan Kaufmann, 2001, 943–950.

169. Mendes W, Romao RF, Freitas AA, Pacheco RCS.Discovering fuzzy classification rules with genetic pro-gramming and co-evolution. In Principles of datamining and knowledge discovery (Proceedings ofthe 5th European Conference, PKDD’2001). LNAI2168, Berlin, Germany: Springer-Verlag, 2001, 314–325.

170. Herrera F. Genetic fuzzy systems: taxonomy currentresearch trends and prospects. Evol Intell 2008, 1:27–46.

171. Castellano G, Fanelli A, Gentile E, Roselli T. AGA-based approach to optimisation of fuzzy mod-els learned from data. In:+ GECCO Program, NewYork, 2002, 5–8.

172. Guillaume S. Designing fuzzy inference systems fromdata: An interpretability oriented review. IEEE TransFuzzy Syst 2001, 9(3):426–443.

173. Jimenez F, Gomez-Skarmeta AF, Roubos H, BabuskaR. Accurate, transparent and compact fuzzy mod-els for function approximation and dynamic mod-elling through multiobjective evolutionary optimisa-tion. In: First International Conference on Evolution-ary Multi-criterion Optimisation, 2001, 653–667.

174. Jin Y, Vonseelen W, Sendhoff B. An approach to rule-base knowledge extraction. In: Proceedings of IEEEConference on Fuzzy System, 1998, 1188–1193.

175. Jin Y, Vonseelen W, Sendhoff B. On generating FC3fuzzy rule systems from data using evolution strate-gies. IEEE Trans Syst Man Cybern 1999, 29(6):829–845.

176. Jin Y, Sendhoff B. Extracting interpretable fuzzyrules from RBF networks. Neural Process Lett 2003,17(2):149–164.

177. Roubos H, Setnes M. GA-fuzzy modelling and clas-sification: Complexity and performance. IEEE TransFuzzy Syst 2000, 8(5):509–522.

178. Rojas I, Pomares H, Ortega J, Prieto A. Self-organisedfuzzy system generation from training examples.IEEE Trans Fuzzy Syst 2000, 8(1):23–36.

179. Casillas J, Cordon O, Herrera F, Magdalena L. Ac-curacy Improvements in Linguistic Fuzzy Modelling.Berlin, Germany: Springer-Verlag, 2003.

180. Alcala R, Alcala Fdez J, Casillas J, Cordon O, HerreraF. Hybrid learning models to get the interpretability-accuracy trade off in fuzzy modelling. Soft Comput2006, 10:717–734.

181. Ishibuchi H, Nojima Y. Analysis of interpretability-accuracy trade off of fuzzy systems by multi-objectivefuzzy genetic-based machine learning. Int J AppproxReason 2007, 44(1):4–31.

182. Cordon O. A historical review of evolutionary learn-ing methods for mamdani-type fuzzy rule-based sys-tems: designing interpretable genetic fuzzy systems.Int J Approx Reason 2011, 52(6):894–913.

183. Jin Y. Fuzzy modelling of high-dimensional systems:Complexity reduction and interpretability improve-ments. IEEE Trans Fuzzy Syst 2000, 8(2):212–221.

184. Nauck DD. Fuzzy data analysis with NEFCLASS. Ap-prox Reason 2003, 32:103–130.

185. Cordon O, Herrera F, Zwir I. A proposal for improv-ing the accuracy of linguistic modelling. IEEE TransFuzzy Syst 2000, 8(3):335–344.

186. Mikut R, Jakel J, Groll L. Interpretability issues indata-based learning of fuzzy systems. Fuzzy Sets andSystems 2005, 150:179–197.

187. Oliveria JVD. Semantic constraints for membershipfunction optimisation. IEEE Trans. on Systems, Manand Cybernetics: Part-A Systems and Humans 1999,29(1):128–138.

188. Gacto MJ, Alcala R, Herrera F. Interpretability oflinguistic fuzzy rule based systems: an overview ofinterpretability measures. Information Science 2011,181(20):4340–4360.

189. Gacto MJ, Alcala R, Herrera F. Integration of an in-dex to preserve the semantic interpretability in themulti-objective evolutionary rule selection and tun-ing of linguistic fuzzy systems. IEEE Transactions onFuzzy Systems 2010, 18:515–531.

190. Zhou SM, Gan JQ. Low level interpretability andhigh level interpretability a unified view of data driveninterpretable fuzzy system modeling. Fuzzy Sets andSystems 2008, 159:3091–3131.

191. Ishibuchi H, Nozaki K, Yamamoto N, Tanaka H.Selecting fuzzy if-then rules for classification problemsusing genetic algorithms. IEEE Transactions on FuzzySystems 1995, 3(3):260–270.

192. Ishibuchi H, Murata T. A multi-objective geneticbased local search algorithm and its application to

Volume 00, January /February 2013 25c© 2013 John Wi ley & Sons , Inc .

Page 26: Revisiting evolutionary algorithms in feature selection ...ash/satchi_Wires13.pdf · ing algorithms. In this paper, the utility of GAs for classification task is primarily dealt

Advanced Review wires.wiley.com/widm

flowshop scheduling. IEEE Tran Syst Man CybernetC 1998, 28:392–403.

193. Ishibuchi H, Murata T, Turksen I. Single-objectiveand two-objective genetic algorithms for selecting lin-guistic rules for pattern classification problems. FuzzySets Syst 1997, 89(2):135–149.

194. Ishibuchi H, Doi T, Nojima Y. Incorporationof Scalarizing Fitness Functions into EvolutionaryMulti-objective Optimization Algorithms, Vol. 4193,PPSN IX. Berlin, Germany: Springer-Verlag, 2006.

195. Isibuchi H, Nojima Y, Kuwajima I. Fuzzy data miningby heuristic rule extraction and multiobjective ruleselection. In: 2006 IEEE International Conference onFuzzy Systems, Canada 2006, 1633–1640.

196. Chen J-L, Yuan-Long H, Zong-Yi X, Li-MinJ, Zhong-Zhi T. A multi-objective genetic-basedmethod for design fuzzy classification system, Int JComput Sc Netw Secutity 2006, 6(8 A):110–117.

197. Alcala R, Gacto M, Herrera F. A multi-objective ge-netic algorithm for tuning and rule selection to ob-tain accurate and compact linguistic rule-based sys-tem. Int J Uncertain, Fuzziness Knowled-Based Syst2007, 15(5):539–557.

198. Eshelman LJ, Schaffer JD. Real-coded genetic algo-rithms and interval schemata. Found Genetic Algo-rithms 1993, 2:187–202.

199. Eshelman LJ, Schaffer JD. The CHC adaptive searchalgorithm: how to have safe search when engaging innontraditional genetic recombination. Foundat Ge-netic Algorithms 1991, 1:265–283.

200. Gacto MJ, Alcala R, Herrera F. Adaptation and appli-cation of multi-objective evolutionary algorithm forrule selection and parameter tuning of fuzzy rule-based systems. Soft Comput 2009, 13:419–436.

201. Pulkkinen P, Koivisto H. Fuzzy classifier identificationusing decision tree and multi-objective evolutionaryalgorithms. Int J Approx Reason 2008, 48:526–283.

202. Pulkkinen P. A multi-objective genetic fuzzy systemfor obtaining compact and accurate fuzzy classifierswith transparent fuzzy partitions. In: Proceedingsof 8th International Conference on Machine Learn-ing and Applications. Miami Beach, FL: IEEE Press,2009, 84–94.

203. Marquez A, Marquez F, Peregrin A. A multi-objectiveevolutionary algorithm with an interpretability im-provement mechanism for linguistic fuzzy systemswith adaptive defuzzification. In: IEEE WorldCongress on Computational Intelligence, 2010, 277–283.

204. Mishra BSP, Dehuri S, Mall R, Ghosh A. Parallel sin-gle and multi-objective genetic algorithms: a survey.Int J Appl Evol Comput 2011, 2(2):21–58.

205. Mishra BSP, Addy AK, Roy R, Dehuri S. Parallelmulti-objective genetic algorithms for associative clas-sification rule mining. In: Proceedings of InternationalConference on Communication Computing and Secu-rity. New York: ACM Press, 2011.

206. Pappa GL, Freitas AA. Automating the Design ofData Mining Algorithms: an Evolutionary Compu-tation Approach. Berlin, Germany: Springer-Verlag,2010.

207. Muni DP, Pal NR, Das J. Genetic programming forsimultaneous feature selection and classifier design.IEEE Trans Syst Man Cybernet B 2006, 36(1):106–117.

208. Folino G, Pizzuti C, Spezzano G. Gp ensembles forlarge scale data classification. IEEE Trans Evol Com-put 2006, 10(5):604–616.

209. Kennedy J, Eberhart RC. Particle swarm optimiza-tion. In: Proceedings of IEEE International Confer-ence on Neural Networks, 1995, 1942–1948.

210. Clerc M, Kennedy J. The particle swarm explosion,stability, convergence in a multi-dimensional com-plex space. IEEE Trans Evol Comput 2002, 6(1):68–73.

211. Dehuri S, Cho S-B. Multi-criterion pareto based par-ticle swarm optimized polynomial neural network forclassification: a review and state-of-the-art, ComputSci Rev 2009, 3(1):19–40.

212. Dorigo M, Stutzle T. Ant colony optimization. Cam-bridge, MA: MIT Press, 2004.

213. Parpinelli RS, Lopes HS, Freitas AA. Data miningwith an ant colony optimization algorithm. IEEETrans Evol Computat 2002, 6(4):321–332.

214. Martens D, Backer MD, Haesen R, VanthienenJ, Snoeck M, Baesens B. Classification with antcolony optimization. IEEE Trans Evol Comput 2007,11(5):651–665.

215. Fathian M, Amiri B, Maroosi A. Application ofhoney bee mating optimization algorithm on cluster-ing. Appl Math Comput 2007, 190:1502–1513.

216. Karaboga D, Basturk B, Artificial bee colony (ABC)optimization algorithm for solving constrained op-timization problems. Lect Notes Artifi Intell 2007,4529:789–798.

217. Karaboga D, Basturk B. On the performance of arti-ficial bee colony (abc) algorithm. Appl Soft Comput2008, 8:687–697.

26 Volume 00, January /February 2013c© 2013 John Wi ley & Sons , Inc .