classifying imbalanced data sets by a novel re-sample and...

14
Research Article Classifying Imbalanced Data Sets by a Novel RE-Sample and Cost-Sensitive Stacked Generalization Method Jianhong Yan and Suqing Han Department of Computer Science, Taiyuan Normal University, Taiyuan 030012, China Correspondence should be addressed to Jianhong Yan; yan jian [email protected] Received 3 June 2017; Revised 1 October 2017; Accepted 6 November 2017; Published 23 January 2018 Academic Editor: Michele Migliore Copyright © 2018 Jianhong Yan and Suqing Han. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Learning with imbalanced data sets is considered as one of the key topics in machine learning community. Stacking ensemble is an efficient algorithm for normal balance data sets. However, stacking ensemble was seldom applied in imbalance data. In this paper, we proposed a novel RE-sample and Cost-Sensitive Stacked Generalization (RECSG) method based on 2-layer learning models. e first step is Level 0 model generalization including data preprocessing and base model training. e second step is Level 1 model generalization involving cost-sensitive classifier and logistic regression algorithm. In the learning phase, preprocessing techniques can be embedded in imbalance data learning methods. In the cost-sensitive algorithm, cost matrix is combined with both data characters and algorithms. In the RECSG method, ensemble algorithm is combined with imbalance data techniques. According to the experiment results obtained with 17 public imbalanced data sets, as indicated by various evaluation metrics (AUC, GeoMean, and AGeoMean), the proposed method showed the better classification performances than other ensemble and single algorithms. e proposed method is especially more efficient when the performance of base classifier is low. All these demonstrated that the proposed method could be applied in the class imbalance problem. 1. Introduction Classification learning becomes complicated if the class distribution of the data is imbalanced. e class imbalance problem occurs when the number of representative instances is much less than that of other instances. Recently, the classification problem of imbalanced data appears frequently and has been widely concerned [1–3]. Usually, imbalanced data sets are grouped into binary class and the majority class and minority class are, respec- tively, denoted as negative class and positive class. Tra- ditional techniques are divided into four categories. RE- sample technique is to increase the minority class of instances (oversampling) [4] or decrease the majority class of instances (undersampling) [5, 6]. Existing algorithms are improved by increasing the weight of positive instances [7]. Classi- fier ensemble method is widely adopted to deal with the imbalance problem in the last decade [8]. In cost-sensitive algorithms, data characters are incorporated with misclassifi- cation costs in the classification phase [9, 10]. In general, cost- sensitive and algorithms levels are more associated with the imbalance problem, whereas data level and ensemble learning can be used and independent of the single classifier. Ensemble methods involving training base classifiers, integrating their results, and generating a single final class label can increase the accuracy of classifiers. Bagging algo- rithm [11] and Ada AdaBoost boost algorithm [12, 13] are the most common ensemble classification algorithms. Ensem- ble algorithms combined with the other three techniques are widely applied in the classification of imbalanced data sets. Cost-sensitive learning targets the imbalanced learn- ing problem by using different cost matrices that can be considered as a numerical representation of the penalty of misclassifying examples from one class to another. Data characters are incorporated with misclassification costs in the classification phase. Cost-sensitive learning is closely related to learning from imbalanced data. In order to solve the imbalance problem, ensemble algorithms were combined with data preprocessing, cost-sensitive method, and relevant algorithms in previous studies. Four ensemble methods for imbalance data sets are commonly used: boosting-, bagging-, cost-sensitive-boosting and hybrid ensemble methods. Hindawi Mathematical Problems in Engineering Volume 2018, Article ID 5036710, 13 pages https://doi.org/10.1155/2018/5036710

Upload: others

Post on 05-Jun-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Classifying Imbalanced Data Sets by a Novel RE-Sample and ...downloads.hindawi.com/journals/mpe/2018/5036710.pdf · Cost-Sensitive Stacked Generalization Method ... In this paper,

Research ArticleClassifying Imbalanced Data Sets by a Novel RE-Sample andCost-Sensitive Stacked Generalization Method

Jianhong Yan and Suqing Han

Department of Computer Science Taiyuan Normal University Taiyuan 030012 China

Correspondence should be addressed to Jianhong Yan yan jian hong163com

Received 3 June 2017 Revised 1 October 2017 Accepted 6 November 2017 Published 23 January 2018

Academic Editor Michele Migliore

Copyright copy 2018 JianhongYan and SuqingHanThis is an open access article distributed under theCreative CommonsAttributionLicense which permits unrestricted use distribution and reproduction in anymedium provided the originalwork is properly cited

Learning with imbalanced data sets is considered as one of the key topics in machine learning community Stacking ensemble is anefficient algorithm for normal balance data sets However stacking ensemble was seldom applied in imbalance data In this paperwe proposed a novel RE-sample and Cost-Sensitive Stacked Generalization (RECSG) method based on 2-layer learning modelsThe first step is Level 0model generalization including data preprocessing and basemodel trainingThe second step is Level 1 modelgeneralization involving cost-sensitive classifier and logistic regression algorithm In the learning phase preprocessing techniquescan be embedded in imbalance data learning methods In the cost-sensitive algorithm cost matrix is combined with both datacharacters and algorithms In the RECSGmethod ensemble algorithm is combined with imbalance data techniques According tothe experiment results obtained with 17 public imbalanced data sets as indicated by various evaluation metrics (AUC GeoMeanand AGeoMean) the proposed method showed the better classification performances than other ensemble and single algorithmsThe proposed method is especially more efficient when the performance of base classifier is low All these demonstrated that theproposed method could be applied in the class imbalance problem

1 Introduction

Classification learning becomes complicated if the classdistribution of the data is imbalanced The class imbalanceproblem occurs when the number of representative instancesis much less than that of other instances Recently theclassification problem of imbalanced data appears frequentlyand has been widely concerned [1ndash3]

Usually imbalanced data sets are grouped into binaryclass and the majority class and minority class are respec-tively denoted as negative class and positive class Tra-ditional techniques are divided into four categories RE-sample technique is to increase theminority class of instances(oversampling) [4] or decrease the majority class of instances(undersampling) [5 6] Existing algorithms are improvedby increasing the weight of positive instances [7] Classi-fier ensemble method is widely adopted to deal with theimbalance problem in the last decade [8] In cost-sensitivealgorithms data characters are incorporated withmisclassifi-cation costs in the classification phase [9 10] In general cost-sensitive and algorithms levels are more associated with the

imbalance problemwhereas data level and ensemble learningcan be used and independent of the single classifier

Ensemble methods involving training base classifiersintegrating their results and generating a single final classlabel can increase the accuracy of classifiers Bagging algo-rithm [11] and Ada AdaBoost boost algorithm [12 13] are themost common ensemble classification algorithms Ensem-ble algorithms combined with the other three techniquesare widely applied in the classification of imbalanced datasets Cost-sensitive learning targets the imbalanced learn-ing problem by using different cost matrices that can beconsidered as a numerical representation of the penalty ofmisclassifying examples from one class to another Datacharacters are incorporated with misclassification costs inthe classification phase Cost-sensitive learning is closelyrelated to learning from imbalanced data In order to solvethe imbalance problem ensemble algorithms were combinedwith data preprocessing cost-sensitive method and relevantalgorithms in previous studies Four ensemble methods forimbalance data sets are commonly used boosting- bagging-cost-sensitive-boosting and hybrid ensemble methods

HindawiMathematical Problems in EngineeringVolume 2018 Article ID 5036710 13 pageshttpsdoiorg10115520185036710

2 Mathematical Problems in Engineering

In boosting-based ensembles the data preprocessingtechnique is embedded into boosting algorithms In each iter-ation the data distribution is changed by altering the weightto train the next classifier towards the positive class Thesealgorithmsmainly include SMOTEBoost [14] RUSBoost [15]MSMOTEBoost [16] and DataBoost-IM [17] algorithms Inbagging-based ensembles the algorithms combine baggingwith data preprocessing techniques The algorithms are usu-ally simpler than their integration in boosting because ofsimplicity and good generalization ability of bagging Thefamily includes but not limited to OverBagging Under-OverBagging [18] UnderBagging [19] and IIVotes [20]In cost-sensitive-boosting ensembles the general learningframework of AdaBoost is maintained but amisclassificationcost adjustment function is introduced into the weightupdating formula These ensembles are usually different inthemodification way of the weight update rule AdaCost [21]CSB1 CSB2 [22] RareBoost [23] AdaC1 AdaC2 and AdaC3[24] are the most representative approaches Unlike previousalgorithms hybrid ensemble methods adopt the doubleensemble learningmethods For example EasyEnsemble andBalanceCascade use bagging as the main ensemble learningmethod but each base classifier is trained by AdaBoostTherefore the final classifier is an ensemble of ensembles

Stacking algorithm is another ensemble method andhas the same base classifiers to the Bagging and AdaBoostalgorithms but the structure is different Stacking algorithmhas a two-level structure consisting of Level 0 classifiers andLevel 1 classifiers Stacking algorithm involves two steps Thefirst step is to collect the output of each model into a newset of data For each instance in the original training set thedata set represents every modelrsquos prediction of that instancersquosclass and the models are base classifiers In the second stepbased on the new data and true labels of each instance in theoriginal training set a learning algorithm is employed to trainthe second-layer model In Wolperts terminology the firststep is referred to as the Level 0 layer and the second-stagelearning algorithm is referred to as the Level 1 layer [25]

Stacking ensemble is a general method in which a high-level model is combined with the lower-level models Stack-ing ensemble can achieve the higher predictive accuracyChen et al adopted ant colony optimization to configurestacking ensembles for data mining [26] Kadkhodaei andMoghadam proposed an entropy-based approach to searchfor the best combination of the base classifiers in ensembleclassifiers based on stack generalization [27] Czarnowski andJędrzejowicz focused on the machine classification with datareduction based on stacked generalization [28] Most of theprevious studies were focused on the way to use or generatestacking algorithm However the stacking ensemble does notconsider data distribution and is suitable for the commondata sets other than imbalance data

In order to solve the imbalance problem the paper intro-duces cost-sensitive learning into stacking ensemble and addsamisclassification cost adjustment function into theweight ofinstance and classifier In this waymisclassification costsmaybe considered in the data set as a form of data space weightingto select the best distribution for training On the other handin the combine stage metatechniques can be integrated with

cost-sensitive classifiers to replace standard cost-minimizingtechniques The weights for the misclassification of positiveinstances are higher and the weights for misclassification ofnegative instance are relatively lower The method providesan option for imbalanced learning domains

In this paper RE-sample and Cost-Sensitive StackedGeneralization (RECSG) method is proposed in order tosolve the imbalance problem In the method preprocessedimbalance data are used to train the Level 0 layer modelUnlike common ensemble algorithm for imbalanced datathe proposed method utilized cost-sensitive algorithm asthe Level 1 (meta)layer Stacking methods combined withimbalance data approaches including cost-sensitive learninghad been reported Kotsiantis proposed a stacking variantmethodologywith cost-sensitivemodels as base learners [29]In the method the model tree was replaced by MLR inmetalayer to determine the class with the highest probabilityassociated with the true class Lo et al proposed a cost-sensitive stacking method for audio tag annotation andretrieval [30] In these methods cost-sensitive learners areadopted in the base-layer model and the metalayer modelis trained by other learning algorithms such as SVM anddecision tree In this paper Level 0model generalizer involvesresampling the data and training the base classifier The cost-sensitive algorithm is used to train the Level 1 metalayerclassifier The two layers adopt the imbalanced data algo-rithms and take the full advantages of mature methods Level1 layer model has a bias towards the performance of theminority class Therefore the method proposed in the studyis more efficient than other methods in which cost-sensitivealgorithms are only used in the Level 0 layer

The method was compared with common classificationmethods including other ensemble algorithms Additionallythe evaluation metrics of the algorithm were analyzed basedon the results of statistical tests Statistical tests of evaluationmetrics demonstrated that the proposed new approach couldeffectively solve the imbalanced problem

The paper is structured as follows Related ensembleapproaches and cost-sensitive algorithms are introduced inSection 2 Section 3 introduces the details of proposedRECSG approach including Level 0 and Level 1 modelgeneralizers In Section 4 experiments and correspondingresults and analysis are presented Statistical tests of eval-uation metrics of algorithm performance are analyzed anddiscussed Finally Section 5 discusses the advantages anddisadvantages of the proposed method

2 Background

21 Performance Evaluation in Imbalanced Domains Theevaluation metric is a vital factor for the classifier model andperformance assessment In Table 1 the confusion matrixdemonstrates the results of incorrectly and correctly classifiedinstances of each class in the two classes of problems

Accuracy is the most popular evaluation metric How-ever it cannot effectively measure the correct rates of allthe classes so it is not an appropriate metric for imbalancedata sets For this reason in addition to accuracy moresuitable metrics should be considered in the imbalance

Mathematical Problems in Engineering 3

Table 1 Confusion matrix for performance evaluation

Positive prediction Negative predictionPositive class True positive (TP) False negative (FN)Negative class False positive (FP) True negative (TN)

problem Other metrics have been proposed to measurethe classification performance independently Based on theconfusion matrix (Table 1) these metrics are defined as

Precision = tp(tp + fp)

Recall = tp(tp + fn)

119865-Measure = (1 + 120573)2 lowast Recall lowast Precision1205732 lowast Recall + Precision

(1)

where 120573 is a coefficient to adjust the relative importance ofprecision versus recall (usually 120573 = 1)

FPrate = FP(FP + TN)

TPrate = TP(TP + FN)

(2)

The used combined evaluation metrics of these measuresinclude receiver operating characteristic (ROC) graphic [31]the area under the ROC curve (AUC) [2] average geometricmean of sensitivity and specificity (GeoMean) [32] (see (4))and average adjusted geometric mean (AGeoMean) [33] (see(5)) where 119873119899 refers to the proportion of the majoritysamples These metrics are defined as

AUC = (1 + TPrate minus FPrate)2 (3)

GeoMean = radicTPrate lowast TNrate (4)

AGeoMean

=

GeoMean + TNrate lowast 1198731198991 + 119873119899

TPrate gt 00 TPrate = 0

(5)

22 Stacking Bagging and AdaBoost are the most commonensemble learning algorithms In bagging method differentbase classifier models generate different classification resultsand the final decision is decided by majority voting InAdaBoost algorithm a series of base weak classifiers aretrained with the whole training set and the final decisionis generated by a weighted majority voting scheme In eachround of training iteration different weights are attributed toeach instance In the two algorithms the base classifiers arethe same

Stacking ensemble is another ensemble algorithm inwhich the prediction result of base classifiers is used as the

attribute to train the combination function in metalayer clas-sifier [25] The algorithm has a two-level structure consistingof Level 0 classifiers and Level 1 classifiers It was proposedby Wolpert and used by Breiman [34] and Leblanc andTibshirani [35]

Set a data set

119863 = (119883119899 119884119899) 119899 = 1 119873 (6)

where 119883119899 is a vector representing the attribute values ofthe instance 119899 and 119884119899 is the class value All 119873 instancesare randomly split into 119895 equivalent parts and 119895-fold cross-validation is used to train the model The prediction resultsof every time testing set gives a vector which includes 119899instances

1199101 1199102 1199101198991015840 (7)

Set 119872119895 is the difference base learning algorithm modelobtained with the data set 119863 119895 = 1 119896 and119872119895 is the partof Level 0 models Level 0 layer consists of 119895 base classifierswhich are employed to estimate the class probability of eachinstance

In119872119895 119895-fold cross-validation of training data set119863 givesthe prediction result

119910(119895)1 119910(119895)2 119910(119895)119899 119895 = 1 119896 (8)

All the119872119895 prediction results generate input space of Level1 model and the real value of instance is treated as the outputspace The model is expressed as follows

Input output

119910(1)1 119910(2)1 119910(119894)1 119910(119896)1 1199101119910(1)2 119910(2)2 119910(119894)2 119910(119896)2 1199102

119910(1)119899 119910(2)119899 119910(119894)119899 119910(119896)119899 119910119899

(9)

The above intermediate data are considered as the train-ing data of the Level 1 layer model The input data aretreated as features and the real value of instance is treatedas the output space The next step is to train the data withsome fused leaning algorithms The process is called Level 1generalizer and Level 1 model is denoted by which can beregarded as the function of (119910(1) 119910(2) 119910(119894) 119910(119896))

In the stacking process Level 0 (119872119895 119895 = 1 119896 ) iscombined with Level 1 () Given a new instance 1199091015840 models119872119895 produce a prediction result vector

119910(1) 119910(2) 119910(119894) 119910(119896) 119894 = 1 119896 (10)

Then model is used to combine base classifiers andpredict the final result of 1199091015840

In this paper for imbalance data we propose a stackedgeneralization based on cost-sensitive classification In Level0 model generalizer layer we firstly resample the imbal-ance data Resampling approaches include oversampling and

4 Mathematical Problems in Engineering

Training dataset

andtest dataset

Level 0 (base)-layer

resample(oversample) (undersample) (SMOTE)

Level 0 (base)-layerclassification middot middot middot

middot middot middot

middot middot middot

Ensemble spaceLevel 1 (meta)layerclassification

New data2New data3New data1

LMGJF1 LMGJF2 LMGJFk

M1M2 Mk

Decision space of M2 Decision space of Mk

MGN

Decision space of M1

u1 xiNi=1 u1 x

i

N

i=1( ) ( ) u2 xi

Ni=1 u2 x

i

N

i=1( ) ( ) uk xi

Ni=1 uk x

i

N

i=1( ) ( )

yiGN

N

i=1

D = xi yi

N

i=1( )

DNMN = xi

N

i=1( )

DGN = u1 xi yiNi=1 D

GN = u1 x

i yi

N

i=1 ( ) ( )

Figure 1 Flowchart of the RECSG architecture

undersampling Secondly Model119872119895 119895 = 1 119896 is trainedby the classification with new data and the prediction output119910(119895)119894 by 119895 = 1 119896 is produced with original data In Lever1 model generalizer layer model is generated by cost-sensitive classification based on logistic regression

3 Stacked Generalization forthe Imbalance Problem

The proposed RECSG architecture includes two layers Thefirst layer (Level 0) consisting of classifiers ensemble is calledthe base layer the second layer (Level 1) combined withbase classifiers is called the metalayer The flowchart of thearchitecture is shown in Figure 1

31 Level 0 Model Generalizers For the imbalance prob-lem Level 0 model generalizer step of RECSG includespreprocessing data and training the base classifier Firstlythe oversampling (SMOTE) method has been used in datapreprocessing of base classifier Secondly the base classifiermodel is trained with the new data set In this level weemployed three base algorithms Naıve Bayes (NB) [36]decision tree C45 [37] and 119896-nearest neighbors (119896-NN) [38]

(1) Naıve Bayes (NB) Given instance 119909 set 119875(119894 | 119909) is theposterior probability of class 119894 then

119875119894 (119909) =119875 (119894 | 119909)Σ119894119875 (119894 | 119909)

(11)

where NB uses an Laplacian estimate for estimating theconditional probabilities

Mathematical Problems in Engineering 5

Table 2 Cost matrix

Actual negative Actual positivePredict negative 119862 (0 0) = 11988800 119862 (0 1) = 11988801Predict positive 119862 (1 0) = 11988810 119862 (1 1) = 11988811

Table 3 Cost matrix of credit data sets

Actual bad Actual goodPredict bad 0 1Predict good 5 0

(2) Decision Tree (C45) Decision tree classification is amethod usually used in data mining [39] A decision tree isa tree where each input feature labels a nonleaf node oneclass or probability over the class labels each leaf node of thetree By recursive partitioning a tree can be ldquolearnedrdquo untilno prediction value is added or the subset has the same valueWhen the parameter of training data selects informationentropy the decision tree is named C45

(3) 119896-NN k-NN is a nonparametric algorithm used forregression or classification An instance is classified by amajority vote of its 119896-nearest neighbors If 119896 = 1 the instanceis simply classified in accordance with the class label of thenearest neighbor

All the above algorithms are simple and have low com-plexity so they are applicable weak base classifiers

32 Level 1 Model Generalizers Prediction results of severalbase classifiers in Level 0 layer are used as input space andtrue value class of instance is used as output space Based onLevel 0 layer Level 1 layer model is trained by other learningalgorithms For imbalanced data the cost-sensitive algorithmis used to train the Level 1 metalayer classifier in the paper

321 Cost-Sensitive Classifier Aiming at imbalanced datalearning problem in cost-sensitive classifier different costmatrices are used as the penalty ofmisclassified instance [40]For example a cost matrix 119862 has the following structure in abinary classification scenario in Table 2

In the cost matrix the row indicated alternative predictedclasser whereas the column indicates actual classes The costof false negative is notated as 11988801 and the cost of false positiveis notated as 11988810 Conceptually the cost of correctly classifiedinstances should always be less than the cost of incorrectlyclassified instances In the imbalance problem 11988810 is alwaysgreater than 11988801 For the German credit data set previouslyreported as the part of the Stalog project [41] the cost matrixis provided in Table 3

The cost of false good prediction is greater than the costof false bad prediction in the view of economical reason Soin Level 1 classifier of stacking for the imbalance problem thecost-sensitive algorithm is adopted

322 Logistic Regression Ting and Witten illustrated thatMLR (Multiresponse Linear Regression) algorithm had an

advantage over other Level 1 generalizers in stacking [42]In this paper the logistic regression classifier is used as themetalayer algorithm Based on the metalayer the cost of mis-classification is considered in the cost-sensitive algorithm Inlogistic regression classifier the prediction result of Level 0layer is used as the attribute of Level 1 metalayer and the realvalue of instance is used as output spaceThe linear regressionfor class 119894 is simply obtained as

LR119894 (119909) =119870

sum119895=1

120572119895119894119875119895119894 (119909) 119895 = 1 119896 (12)

Details of the implementation of RECSG are presentedbelow

33 Algorithm Pseudocode Pseudocode 1 presents the pro-posed RECSG approach Input parameters are two data setstraining set and testing set Output predicts class labelsof the test samples The first step in the process is datapreprocessing resample new instances and train model119872119895 (119895 = 1 119896) where 119895 is the number of base classifiersand 119906119895(119909119894) is the generated function of 119909119894 based on model119872119895(lines (1)-(2) in Pseudocode 1)

Then the metalayer model is constructed based onthe data 119863meta which is firstly predicted with Level 0 layer(line (3) in Pseudocode 1) Finally Lever 1 layer classification(cost-sensitive and logistic regression) is used to predict theultimate value of the tested samples

4 Empirical Investigation

The experiment aims to verify whether the RECSG approachcan improve the classification performance for the imbalanceproblem In this paper the RECSG approach was comparedwith other algorithms involving various combination ensem-ble techniques and imbalanced approaches For eachmethodthe same training set testing set and validation set were used

The experimental system has been implemented and iscomposed of 7 learning methods implemented in Weka [43]namely Naıve Bayes(1) (NB) C45(j48)(2) 119896-nearest neigh-bor (k-NN)(3) cost-sensitive(4) AdaBoost(5) Bagging(6) andstacking(7) The brief description the standard version andparameters of these methods are shown in Table 4

The RECSG approach includes 2 layers Level 1 layer(metalearning) system consists of 2 learning methods imple-mented inWeka namely simple logistic regression(9) (MLP)and cost-sensitive classifier(4) MLP is the base classificationalgorithm of cost-sensitive classifier Level 0 layer (base -learning) of stacking is composed of 3 classifiers NaıveBayes(1) C45(2) and 119896-NN(3) Evaluation metrics of algo-rithmperformance includeAUCGeoMean andAGeoMean

41 Experimental Settings Experiments were implementedwith 17 data sets from the UCI Machine Learning Repository[44] These data sets cover various fields and are based onIR measure values (from 054 to 0014) unique data setnames a varying number of samples (from 173 to 2338)

6 Mathematical Problems in Engineering

Input Training set119863 = (119909119894 119910119894)119873119894=1 Test dataset119863test = (1199091015840119894 )1198731015840

119894=1

Output Predict class labels of the test samples 1199101015840119894meta1198731015840

119894=1

For each119895 = 1 2 119870 do(1) Resample imbalance data and generate 119895ndashfold cross-validation sets to obtain New data119895(2) Train and compute 119906119895(119909119894)119873119894=1 and 119906119895(1199091015840119894 )119873

1015840

119894=1 in Level-0 (base)-layer classifier119872119895end(3) Construct119863meta = 1199061(119909119894) 119910119894119873

1015840

119894=1 and1198631015840meta = 1199061(1199091015840119894 ) 1199101198941198731015840

119894=1

(4) Based on the data119863meta classification (cost-sensitive and Logistic Regression) is used togenerate Level-1 (meta)-layer model through with1198631015840meta to predict 1199101015840119894meta119873

1015840

119894=1

Pseudocode 1 Pseudocodes of RE-sample and Cost-Sensitive Stacked Generalization

Table 4 Base classifier methods and corresponding ensemble algorithms

Methods Descriptions and parameters Abbreviations(1) Naıve Bayes wekaclassifiersbayesNaiveBayes NB(2) C45 wekaclassifierstreesJ48 -C 025 -M 2 C45

(3) 119896-nearest neighborwekaclassifierslazyIBk -K 1 -W 0 -A

wekacoreneighboursearchLinearNNSearch -AwekacoreEuclideanDistance -R first-last

KNN

(4) Cost-sensitive with Naıve Bayes wekaclassifiersmetaAdaBoostM1 -P 100 -S 1 -I 10-W wekaclassifiersbayesNaiveBayes Cost-NB

(5) AdaBoost with Naıve Bayes wekaclassifiersmetaAdaBoostM1 -P 100 -S 1 -I 10-W wekaclassifiersbayesNaiveBayes Ada-NB

(6) AdaBoost with cost-sensitive and Naıve Bayes

wekaclassifiersmetaAdaBoostM1 -P 100 -S 1-I10-W wekaClassifiersmetaCostSensitiveClassifier-- -cost-matrix [10 30 10 10] -S 1 -W weka

classifiersbayesNaiveBayes

Ada-Cost-NB

(7) Bagging with Naıve BayeswekaclassifiersmetaBagging -P 100 -S 1

-num-slots 1 -I 10 -WwekaclassifiersbayesNaiveBayes

Bag-NB

(8) Bagging with cost-sensitive and Naıve Bayes

wekaclassifiersmetaBagging -P 100 -S 1-num-slots 1 -I 10 -W

wekaclassifiersmetaCostSensitive Classifier ---cost-matrix [10 30 10 10] -S 1 -W

wekaclassifiersbayesNaiveBayes

Bag-Cost-NB

(9) Stacking with NB KNN C45 and logistic

wekaclassifiersmetaStacking -X 10 -M wekaclassifiersfunctionsLogistic -R 10E-8 -M -1ndashnum ndashdecimal-places 4 -S 1 -num-slots 1 -B

weka classifiersbayesNaiveBayes -Bwekaclassifiers lazyIBk -K 1 -W 0 -A

wekacoreneighboursearch LinearNNSearch-A wekacoreEuclideanDistance -R

first-last -B wekaclassifierstreesJ48 -C025 -M 2

Stacking-log

(10) Stacking cost-sensitive with NB KNN C45and logistic

wekaclassifiersmetaStacking -X 10 -M wekaclassifiersmetaCostSensitiveClassifier ndashcost

-matrix [10 30 10 10] -S 1 -Wwekaclassifiers functionsLogistic -- -R 10E-8 -M-1 ndashnum ndashdecimal ndashplaces 4 -S 1 -num-slots 1 -B

weka classifiers bayesNaiveBayes -BwekaclassifierslazyIBk -K 1 -W 0 -A

wekacoreneighboursearchLinearNN Search-A wekacoreEuclideanDistance -R first-

last -B wekaclassifierstreesJ48 -C 025-M 2

Stacking-Cost-log

Mathematical Problems in Engineering 7

Table 5 Summary of imbalanced data sets

Data sets sam (min maj) IR feawisconsin 683 (239 444) 054 9pima 768 (268 500) 053 8haberman 306 (81 225) 036 3Vehicle1 846 (217 629) 0345 18Vehicle0 846 (199 647) 0307 18segment0 2308 (329 1979) 017 19yeast-0-3-5-9 vs 7-8 506 (50 456) 011 8yeast-0-2-5-6 vs 3-7-8-9 1004 (99 905) 011 8vowel0 988 (90 898) 010 13led7digit-0-2-4-5-6-7-8-9 vs 1 443 (37 406) 009 7Glass2 214 (17 197) 008 10cleveland-0 vs 4 173 (13 160) 008 13glass-0-1-6 vs 5 184 (9 175) 005 9car-good 1728 (69 1659) 004 6flare-F 1066 (43 1023) 004 11car-vgood 1728 (65 1663) 0039 6abalone-17 vs 7-8-9-10 2338 (58 2280) 00254 9

and variations in the amount of class overlap (see the KEELrepository [45]) Multiclass data sets were modified to obtaintwo-class imbalanced problems so that the union of one ormore classes became the positive class and the union of oneor more of the remaining classes was labeled as the negativeclass A brief description of the used date set is presented inTable 5 It includes the total number of instances (Sam)the total number of each class instances (Min Maj) theimbalance ratio (IR = the ratio of the number of minorityclass instance to majority class instance) and the number offeatures (Fea)

Our system was compared with other ensemble algo-rithms including AdaBoost with Naıve Bayes AdaBoost withcost-sensitive bagging with Naıve Bayes bagging with cost-sensitive and stacking cost-sensitive with NB k-NN C45and logistic regression All the experiments were performedby 10-fold cross-validation

42 Experimental Results Tables 6 7 and 8 respectivelyshow the results of the three metrics (AUC GeoMean andAGeoMean) for the algorithms in Table 4 obtained with thedata set in Table 5The best results are emphasized in boldfaceon each data set in these tables

The results showed that the performance of the proposedRECSG method was the best for 12 of 17 data sets in termsof GeoMean and AGeoMean and for 10 of 17 data sets interms of AUC Some methods are better than others in someevaluation metrics but not in all metric and most data sets

In Table 6 (AUC) the performance of the RECSGmethodis better than that of the 3 single-base classification methodsfor 14 of 17 data sets and better than that of other 5 ensemblealgorithms for 13 of 17 data sets In Table 7 (GeoMean)the RECSG method outperforms all the 3 single-base clas-sification methods in 15 of 17 data sets and outperformsother 5 ensemble algorithms in 15 of 17 data sets In Table 8

(AGeoMean) the RECSG method outperforms all the 3single-base classification methods in 14 of 17 data sets andoutperforms other 5 ensemble algorithms in 16 of 17 data setsFor the 17 data sets the evaluation metrics of GeoMean andAGeoMean are better than AUC

43 Statistical Tests Statistical test [46] is adopted to comparedifferent algorithms In the paper we use two types of com-parisons pairwise comparison (between a pair of algorithms)and multiple comparisons (among a group of algorithms)

431 Pair of Algorithms We performed statistic 119879-teststo explore whether RECSG approach is the significantlybetter than other algorithms in the three metrics (AUCGeoMean and AGeoMean) Table 9 shows the results ofRECSG approach compared with the other methods interms of AUC GeoMean and AGeoMean The values inthe square brackets indicate the number of the metrics withstatistically significant difference in the 119879-test performedwith the confidence level 120572 = 005 in the three evaluationmetrics As for the evaluation metric of AUC the RECSGoutperformed NB for 11 of 17 data sets and 5 data sets showedthe statistically significant differencewith the confidence levelof 120572 = 005

432 Multiple Comparisons We performed Holm post hoctest [47] and selected multiple groups to test the threemetrics (AUC GeoMean and AGeoMean) The post hocprocedures can determine whether a comparison hypothesisshould be rejected at a specified confidence level 120572 Statisticalexperiment was performed in the platform on the websitehttpteccitiususcesstac Table 10 provides the results thatRECSGapproach is significantly differentwith the confidencelevel of120572 = 005 in terms of AUCGeoMean andAGeoMean1198670 indicates that there is no significant difference According

8 Mathematical Problems in Engineering

Table6Re

sults

ofotherm

etho

dsin

term

sofA

UC

Datas

etNB

C45

KNN

Cost-N

BAd

a-NB

Ada-Cost-N

BBa

g-NB

Bag-Cost-NB

Stacking

-log

Stacking

-Cost-log

wisc

onsin

098

60957

0953

0989

0981

0982

099

0989

0982

0982

pima

0819

0751

0662

0819

0804

0808

0817

0819

0819

0819

haberm

an064

0615

053

064

0667

0625

0638

064

70636

063

Vehicle

10711

0722

0671

0711

0769

0723

0715

0712

0782

0782

Vehicle

00814

093

092

0814

0784

0781

0808

0808

0977

097

7segm

ent0

0987

0971

0996

0987

0944

0959

0987

0986

0999

099

9yeast-0

-3-5-9

vs7-8

075

0675

0653

075

0751

0686

075

0754

0769

0773

yeast-0

-2-5-6

vs3-7-8-9

0832

0687

0782

0832

079

0808

084

0842

084

7082

vowel0

098

0974

1098

0996

0994

0981

098

11

led7digit-0

-2-4-5-6-7-8-9

vs1

095

40925

0827

0954

0898

0893

0953

0949

0909

0904

Glass2

0716

0812

0587

0716

0737

0735

0748

0737

0713

0721

cleveland

-0vs

40909

0824

0754

0909

0826

0891

0915

0913

0979

0976

glass-0-1-6

vs5

0934

0989

0825

0934

0975

0905

0983

0985

09

099

car-good

0976

0493

064

80976

0968

0971

0976

0976

0653

097

7flare-F

0908

0525

059

0911

0876

0893

0908

0912

0825

0806

car-vgoo

d0997

0992

069

0997

0997

0997

0997

0997

0998

099

8abalon

e-17

vs7-8-9-10

0775

0613

0624

0775

0817

073

0773

0774

0752

0726

Mean

0864

0791

0748

0864

0858

0846

0869

0869

0855

0875

Mathematical Problems in Engineering 9

Table7Re

sults

ofotherm

etho

dsin

term

sofG

eoMean

Datas

etNB

C45

KNN

Cost-N

BAd

a-NB

Ada-Cost-N

BBa

g-NB

Bag-Cost-NB

Stacking

-log

Stacking

-Cost-log

wisc

onsin

0965

0954

0953

0968

0959

0967

0968

0966

096

4097

0pima

0719

0697

0649

0736

0715

0739

0723

0731

0703

0741

haberm

an0419

0566

0455

0586

046

60573

0434

0594

0361

060

5Ve

hicle

10677

0622

0652

0674

0607

0677

0673

0661

0561

0712

Vehicle

00724

0912

0919

0726

0735

0733

0719

0733

0894

094

6segm

ent0

0897

0977

0995

0892

0897

0906

0902

0895

099

40993

yeast-0

-3-5-9

vs7-8

046

70463

060

9046

4046

60550

044

50485

044

4060

3yeast-0

-2-5-6

vs3-7-8-9

0762

0590

0762

0793

0762

0792

0762

0787

0706

079

8vowel0

0920

0969

1000

0913

0944

0929

0921

0914

1000

1000

led7digit-0

-2-4-5-6-7-8-9

vs1

0853

0874

0818

0869

0863

0870

0845

0850

0846

0874

Glass2

0578

0472

0470

0564

0578

0610

0586

0568

000

00607

cleveland

-0vs

4089

90724

0722

0936

0611

0863

0866

0861

0866

0866

glass-0-1-6

vs5

0932

0877

0810

0932

0932

0872

0872

0869

0814

0877

car-good

000

0000

00558

0794

0520

0783

000

00778

0561

0813

flare-F

0751

000

00450

0781

0541

064

70738

0781

000

00774

car-vgoo

d0496

0888

0626

0986

0917

0925

0481

0971

0957

098

8abalon

e-17

vs7-8-9-10

0628

0321

0506

064

00319

0368

0627

0630

0293

066

4Mean

0687

064

10703

0779

0696

0753

0680

0769

064

50814

10 Mathematical Problems in Engineering

Table8Re

sults

ofotherm

etho

dsin

term

sofA

GeoMean

Datas

etNB

C45

KNN

Cost-N

BAd

a-NB

Ada-Cost-N

BBa

g-NB

Bag-Cost-NB

Stacking

-log

Stacking

-Cost-log

wisc

onsin

0961

0960

0961

0962

0963

0962

0962

0960

0967

096

7pima

076

80743

0706

0734

0767

0736

0775

0729

0770

0726

haberm

an064

20680

0601

0718

066

00694

0653

0721

0613

0663

Vehicle

10690

0721

0728

0658

0722

0669

0690

0654

0709

0740

Vehicle

0066

60929

0934

0658

0681

0663

066

40663

0924

093

6segm

ent0

0859

0986

0996

0850

0859

0871

0863

0854

0995

099

5yeast-0

-3-5-9

vs7-8

0715

0706

0782

0709

0713

0737

0703

0721

0700

0770

yeast-0

-2-5-6

vs3-7-8-9

0856

0778

0855

0857

0856

0855

0856

0854

0839

086

6vowel0

0929

0981

1000

0909

0966

0948

0931

0910

1000

1000

led7digit-0

-2-4-5-6-7-8-9

vs1

0853

0874

0818

0869

0863

0870

0845

0850

0846

0874

Glass2

0511

0701

0695

0493

0511

0551

0539

0515

000

00744

cleveland

-0vs

4092

70845

0841

0943

0783

0914

0918

0910

0918

0918

glass-0-1-6

vs5

095

40932

0894

0954

0954

0923

0923

0919

0902

0932

car-good

000

0000

00763

0878

0747

0870

000

00873

0769

089

0flare-F

0840

000

00705

0842

0751

0795

0836

0842

000

0084

5car-vgoo

d0743

0934

0793

0986

0953

0956

0725

0980

0974

099

0abalon

e-17

vs7-8-9-10

0736

0655

0745

0716

0650

0672

0734

0711

064

1080

4Mean

0744

0731

0813

0808

0788

0805

0742

08039

0739

086

2

Mathematical Problems in Engineering 11

Table 9 Comparison between RECSG and other algorithms in terms of winslosses of the results

NB C45 KNN Cost-NB Ada-NB Ada-Cost-NB Bag-NB Bag-Cost-

NB Stacking-log

Stacking-Cost-log(AUC)

11 [5]6 [3] 15 [11]2 [2] 17 [15]0 [2] 11 [6]6 [5] 13 [7]4 [2] 14 [9]3 [1] 10 [6]7 [5] 10 [6]7 [5] 11 [3]6 [2]

Stacking-Cost-log(GeoMean)

15 [13]2 [2] 17 [13]0 [0] 15 [13]2 [1] 14 [8]3 [2] 16 [12]1 [1] 16 [10]1 [0] 17 [13]0 [0] 16 [10]3 [0] 16 [12]1 [0]

Stacking-Cost-log(AGeoMean)

14 [10]3 [1] 15 [9]2 [2] 15 [12]2 [1] 13 [7]4 [3] 15 [11]2 [2] 15 [10]2 [2] 16 [10]1 [1] 15 [8]2 [1] 16 [10]1 [1]

Table 10 Holm post hoc test to show differences of Stacking-Cost-log (AUC) and other algorithms

Rejected (1198670) Accepted (1198670)Stacking-Cost-log (AUC) versus other algorithms 3 6Stacking-Cost-log (GeoMean) versus other algorithms 8 1Stacking-Cost-log (AGeoMean) versus other algorithms 9 0

to the evaluation metric of GeoMean the RECSG is signifi-cantly better than other 8 methods with the confidence levelof 120572 = 005

44 Discussion According to the experiments performedwith 17 different imbalance data sets and the comparisonwith other 9 classification methods the proposed RECSGmethod showed the higher performance than other methodsin terms of AUC GeoMean and AGeoMean as illustratedin Tables 7 8 and 9 The RECSG method showed the betterperformance for 10 of 17 cases in terms of AUC and for 12 of17 cases in terms of GeoMean and AGeoMean The methodoutperformed other 5 ensemble algorithms for 16 of 17 datasets in terms ofGeoMean andAGeoMean and for 13 of 17 datasets in terms of AUC GeoMean and AGeoMean are betterthan AUC in evaluation metrics in 17 data sets The means ofRECSGmethod in terms of GeoMean AGeoMean and AUCare all higher than other methods The Holm post hoc testshows that RECSGmethod significantly outperforms the restwith the confidence level 120572 = 005 in terms of AGeoMean for8 of 9methods in terms of AGeoMean and for 3 of 9methodsin terms of AUC

Experimental results and statistical tests show that theRECSG approach has improved the classification perfor-mance of imbalanced data sets The reasons can be explainedas follows

Firstly stacking algorithm uses to combine the resultof base classifier whereas bagging employs majority voteBagging is only a simple decision combinationmethodwhichrequires neither cross-validation nor Level 1 learning In thispaper stacked generalization adopts logistic regressionthus providing the simplest linear combination of pooling theLevel 0 modelsrsquo confidence

Secondly cost-sensitive algorithm can affect imbalancedata sets in two aspects Firstly cost-sensitive modificationscan be applied in the probabilistic estimationMoreover cost-sensitive factors change theweight of instance and theweight

of the minority class is higher than that of the majorityclass Therefore the method directly changes the number ofcommon instances without discarding or duplicating any ofthe rare instances

Thirdly RECSG method introduces cost-sensitive learn-ing into stacking ensemble The cost-sensitive function in replaces the error-minimizing function thus allowing Level 1learning to be prone to focus on the minority class

The results in Tables 7 and 8 demonstrate that the RECSGmethod has the higher performancewhen the evaluation per-formance metric of base classifier is weaker (such as Vehicle1Vehicle0 and car-vgood) The reason is that the alternativemethods of NB C45 and 119896-NN have shortcomings Theindependence assumption hampers the performance of NBin some data sets In C45 tree construction the selectionof attributes affects the model performances Error ratio of119896-NN is relatively high when data sets are in imbalanceTherefore logistic regression adopted in can improve theperformance when base classifier is weaker

The performance of the RECSG method is generallybetter when the IR is low (such as Glass2 car-good flare-Fcar-vgood and abalone-17 vs 7-8-9-10) The performance ofRECSG method is probably related to the setting of the costmatrix Different data sets should use different cost matricesFor the purpose of simplicity in the paper we have adoptedthe same cost matrices whichmay bemore suitable to low IRvalues

5 Conclusions

In this paper in order to solve the class imbalance problemwe proposed the RECSG method based on 2-layer learningmodels Experimental results and statistical tests showedthat the RECSG approach improved the classification per-formance The proposed RECSG approach may have therelatively high computational complexity in the trainingstage because the approach involves 2-layer classifier models

12 Mathematical Problems in Engineering

which consist of several base classifiers and metaclassifiersThe number and kinds of the base-level classifiers areclosely related to the performance of stacking algorithmIn the fusion stage of base classifiers the selection of themetaclassifier is also important In this paper in order tovalidate the performance improvement compared with othercurrent classification algorithms we only randomly selected3 classification algorithms (NB 119896-NN and C45) as baseclassifier and cost-sensitive algorithm as metaclassifier inthe fuse stage Selection of the number or kind of the baseclassifiers and metaclassifier was not discussed Thereforewe will explore the diversity and quality of base classifier inthe future The adoption of these strategies would improvethe prediction performance and reduce the training time ofstacking algorithm for imbalance problems

Conflicts of Interest

The author declares that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

This research is supported by the Natural Science Foundationof Shanxi Province under Grant no 2015011039

References

[1] O Loyola-Gonzalez J F Martınez-Trinidad J A Carrasco-Ochoa and M Garcıa-Borroto ldquoEffect of class imbalance onquality measures for contrast patterns an experimental studyrdquoInformation Sciences vol 374 pp 179ndash192 2016

[2] C Beyan and R Fisher ldquoClassifying imbalanced data setsusing similarity based hierarchical decompositionrdquo PatternRecognition vol 48 no 5 pp 1653ndash1672 2015

[3] H He and E A Garcia ldquoLearning from imbalanced datardquo IEEETransactions on Knowledge and Data Engineering vol 21 no 9pp 1263ndash1284 2009

[4] V C Chawla K W Bowyer L O Hall and W P KegelmeyerldquoSMOTE synthetic minority over-sampling techniquerdquo Journalof Artificial Intelligence Research vol 16 pp 321ndash357 2002

[5] X-Y Liu J X Wu and Z-H Zhou ldquoExploratory under-sampling for class-imbalance learningrdquo in Proceedings of the 6thInternational Conference on Data Mining (ICDM rsquo06) pp 965ndash969 IEEE Hong Kong December 2006

[6] N Japkowicz ldquoThe class imbalance problem significance andstrategiesrdquo in Proceedings of the International Conference onArtificial Intelligence 2000

[7] B Zadrozny and C Elkan ldquoLearning and making decisionswhen costs and probabilities are both unknownrdquo in Proceedingsof the the seventh ACM SIGKDD international conference pp204ndash213 San Francisco Calif USA August 2001

[8] M Galar A Fernandez E Barrenechea H Bustince andF Herrera ldquoA review on ensembles for the class imbalanceproblem bagging boosting and hybrid-based approachesrdquoIEEE Transactions on Systems Man and Cybernetics Part CApplications and Reviews vol 42 no 4 pp 463ndash484 2012

[9] N V Chawla D A Cieslak L O Hall and A Joshi ldquoAuto-matically countering imbalance and its empirical relationship

to costrdquo Data Mining and Knowledge Discovery vol 17 no 2pp 225ndash252 2008

[10] A Freitas A Costapereira and P Brazdil ldquoCost-SensitiveDecision Trees Applied to Medical Datardquo in InternationalConference on Data Warehousing and Knowledge Discovery pp303ndash312 2007

[11] L Breiman ldquoBagging predictorsrdquoMachine Learning vol 24 no2 pp 123ndash140 1996

[12] R E Schapire ldquoThe strength of weak learnabilityrdquo MachineLearning vol 5 no 2 pp 197ndash227 1990

[13] Y Freund and R E Schapire ldquoA decision-theoretic generaliza-tion of on-line learning and an application to boostingrdquo Journalof Computer and System Sciences vol 55 no 1 part 2 pp 119ndash139 1997

[14] N V Chawla A Lazarevic L O Hall and K W BowyerldquoSMOTEBoost improving prediction of the minority class inboostingrdquo in Proceeding of Knowledge Discovery in Databasesvol 2838 pp 107ndash119

[15] C Seiffert T M Khoshgoftaar J VanHulse and A NapolitanoldquoRUSBoost A hybrid approach to alleviating class imbalancerdquoIEEE Transactions on Systems Man and Cybernetics Systemsvol 40 no 1 pp 185ndash197 2010

[16] S Hu Y Liang L Ma and Y He ldquoMSMOTE Improvingclassification performance when training data is imbalancedrdquoin Proceedings of the 2nd International Workshop on ComputerScience and Engineering WCSE 2009 pp 13ndash17 China October2009

[17] H Guo and H L Viktor ldquoLearning from imbalanced data setswith boosting and data generationrdquoACMSIGKDDExplorationsNewsletter vol 6 no 1 pp 30ndash39 2004

[18] SWang and X Yao ldquoDiversity analysis on imbalanced data setsby using ensemble modelsrdquo in Proceedings of the 2009 IEEESymposium on Computational Intelligence and Data MiningCIDM 2009 pp 324ndash331 USA April 2009

[19] R Barandela J S Srsquoanchez and R M Valdovinos ldquoNewapplications of ensembles of classifiersrdquo PAA Pattern Analysisand Applications vol 6 no 3 pp 245ndash256 2003

[20] J Błaszczynski M Deckert J Stefanowski and S WilkldquoIntegrating selective pre-processing of imbalanced data withIvotes ensemblerdquo Lecture Notes in Computer Science (includingsubseries Lecture Notes in Artificial Intelligence and Lecture Notesin Bioinformatics) Preface vol 6086 pp 148ndash157 2010

[21] W Fan S J Stolfo J Zhang and P K Chan ldquoAdacostMisclassication cost-sensitive boostingrdquo in the 6th Int ConfMach Learning pp 97ndash105 San Francisco CA USA 1999

[22] K M Ting ldquoA comparative study of cost-sensitive boostingalgorithmsrdquo in Proc 17th Int Conf Mach Learning pp 983ndash990 Stanford CA USA 2000

[23] M Joshi V Kumar and R Agarwal ldquoEvaluating boosting algo-rithms to classify rare classes comparison and improvementsrdquoinProceedings of the 2001 IEEE International Conference onDataMining pp 257ndash264 San Jose CA USA

[24] Y Sun M S Kamel A K C Wong and Y Wang ldquoCost-sensitive boosting for classification of imbalanced datardquo PatternRecognition vol 40 no 12 pp 3358ndash3378 2007

[25] D H Wolpert ldquoStacked generalizationrdquo Neural Networks vol5 no 2 pp 241ndash259 1992

[26] Y Chen M L Wong and H Li ldquoApplying Ant Colony Opti-mization to configuring stacking ensembles for data miningrdquoExpert Systems with Applications vol 41 no 6 pp 2688ndash27022014

Mathematical Problems in Engineering 13

[27] H Kadkhodaei and A M E Moghadam ldquoAn entropy basedapproach to find the best combination of the base classi-fiers in ensemble classifiers based on stack generalizationrdquoin Proceedings of the 4th International Conference on ControlInstrumentation and Automation ICCIA 2016 pp 425ndash429Iran January 2016

[28] I Czarnowski and P Jędrzejowicz ldquoAn approach to machineclassification based on stacked generalization and instanceselectionrdquo in Proceedings of the IEEE International Conferenceon Systems Man and Cybernetics SMC 2016 pp 4836ndash4841Hungary 2017

[29] S Kotsiantis ldquoStacking cost sensitive modelsrdquo in Proceedings ofthe 12th Pan-Hellenic Conference on Informatics PCI 2008 pp217ndash221 Greece August 2008

[30] H Y Lo J C Wang H MWang and S D Lin ldquoCost-sensitivestacking for audio tag annotation and retrievalrdquo in Proceedingsof the 36th IEEE International Conference on Acoustics Speechand Signal Processing ICASSP 2011 pp 2308ndash2311 CzechRepublic May 2011

[31] JHuang andCX Ling ldquoUsingAUCand accuracy in evaluatinglearning algorithmsrdquo IEEETransactions on Knowledge andDataEngineering vol 17 no 3 pp 299ndash310 2005

[32] M Kubat and S Matwin ldquoAddressing the curse of imbalancedtraining sets one sided selection inrdquo in Proceedings of the 14thInternational Conference on Machine Learning (ICMLrsquo97 pp179ndash186 1997

[33] R Batuwita and V Palade ldquoAdjusted geometric-mean A novelperformance measure for imbalanced bioinformatics datasetslearningrdquo Journal of Bioinformatics and Computational Biologyvol 10 no 4 Article ID 1250003 2012

[34] L Breiman ldquoStacked regressionsrdquoMachine Learning vol 24 no1 pp 49ndash64 1996

[35] M Leblanc and R Tibshirani ldquoCombining estimates in regres-sion and classificationrdquo Journal of the American StatisticalAssociation vol 91 no 436 pp 1641ndash1650 1996

[36] B Cestnik ldquoEstimating Probabilities A Crucial Task inMachine Learningrdquo in Proceedings of the European Conferenceon Artificial Intelligence pp 147ndash149 1990

[37] J R Quinlan C45 Program for machine learning MorganKaufmann 1993

[38] N S Altman ldquoAn introduction to kernel and nearest-neighbornonparametric regressionrdquo The American Statistician vol 46no 3 pp 175ndash185 1992

[39] J R Quinlan ldquoInduction of decision treesrdquo Machine Learningvol 1 no 1 pp 81ndash106 1986

[40] C Elkan ldquoThe foundations of cost-sensitive learningrdquo in Pro-ceedings of the 17th International Joint Conference on ArtificialIntelligence (IJCAI rsquo01) pp 973ndash978 New York NY USAAugust 2001

[41] D Michie D Spiegel halter J C Taylor et alMachine learningneural and statistical classification Ellis Horwood neural andstatistical classification 1995

[42] K M Ting and I F Witten ldquoIssues in stacked generalizationrdquoArtif Intell Res vol 10 no 1 pp 271ndash289 1999

[43] M Hall E Frank G Holmes B Pfahringer P Reutemann andI H Witten ldquoThe WEKA data mining software an updaterdquoACM SIGKDD Explorations Newsletter vol 11 no 1 pp 10ndash182009

[44] A Frank and A Asuncion UCI Machine Learning RepositoryUniversity of California School of Information and ComputerScience Irvine CA USA httparchiveicsucieduml

[45] J Alcala-Fdez A Fernandez J Luengo et al ldquoKEEL data-mining software tool data set repository integration ofalgorithms and experimental analysis frameworkrdquo Journal ofMultiple-Valued Logic and Soft Computing vol 17 no 2-3 pp255ndash287 2011

[46] S Garcıa A Fernandez J Luengo and F Herrera ldquoA study ofstatistical techniques and performance measures for genetics-based machine learning accuracy and interpretabilityrdquo SoftComputing vol 13 no 10 pp 959ndash977 2009

[47] S Holm ldquoA simple sequentially rejective multiple test proce-durerdquo Scandinavian Journal of Statistics vol 6 no 2 pp 65ndash701979

Hindawiwwwhindawicom Volume 2018

MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Applied MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Probability and StatisticsHindawiwwwhindawicom Volume 2018

Journal of

Hindawiwwwhindawicom Volume 2018

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawiwwwhindawicom Volume 2018

OptimizationJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

Hindawiwwwhindawicom Volume 2018

Operations ResearchAdvances in

Journal of

Hindawiwwwhindawicom Volume 2018

Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018

International Journal of Mathematics and Mathematical Sciences

Hindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018Volume 2018

Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in

Nature and SocietyHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Dierential EquationsInternational Journal of

Volume 2018

Hindawiwwwhindawicom Volume 2018

Decision SciencesAdvances in

Hindawiwwwhindawicom Volume 2018

AnalysisInternational Journal of

Hindawiwwwhindawicom Volume 2018

Stochastic AnalysisInternational Journal of

Submit your manuscripts atwwwhindawicom

Page 2: Classifying Imbalanced Data Sets by a Novel RE-Sample and ...downloads.hindawi.com/journals/mpe/2018/5036710.pdf · Cost-Sensitive Stacked Generalization Method ... In this paper,

2 Mathematical Problems in Engineering

In boosting-based ensembles the data preprocessingtechnique is embedded into boosting algorithms In each iter-ation the data distribution is changed by altering the weightto train the next classifier towards the positive class Thesealgorithmsmainly include SMOTEBoost [14] RUSBoost [15]MSMOTEBoost [16] and DataBoost-IM [17] algorithms Inbagging-based ensembles the algorithms combine baggingwith data preprocessing techniques The algorithms are usu-ally simpler than their integration in boosting because ofsimplicity and good generalization ability of bagging Thefamily includes but not limited to OverBagging Under-OverBagging [18] UnderBagging [19] and IIVotes [20]In cost-sensitive-boosting ensembles the general learningframework of AdaBoost is maintained but amisclassificationcost adjustment function is introduced into the weightupdating formula These ensembles are usually different inthemodification way of the weight update rule AdaCost [21]CSB1 CSB2 [22] RareBoost [23] AdaC1 AdaC2 and AdaC3[24] are the most representative approaches Unlike previousalgorithms hybrid ensemble methods adopt the doubleensemble learningmethods For example EasyEnsemble andBalanceCascade use bagging as the main ensemble learningmethod but each base classifier is trained by AdaBoostTherefore the final classifier is an ensemble of ensembles

Stacking algorithm is another ensemble method andhas the same base classifiers to the Bagging and AdaBoostalgorithms but the structure is different Stacking algorithmhas a two-level structure consisting of Level 0 classifiers andLevel 1 classifiers Stacking algorithm involves two steps Thefirst step is to collect the output of each model into a newset of data For each instance in the original training set thedata set represents every modelrsquos prediction of that instancersquosclass and the models are base classifiers In the second stepbased on the new data and true labels of each instance in theoriginal training set a learning algorithm is employed to trainthe second-layer model In Wolperts terminology the firststep is referred to as the Level 0 layer and the second-stagelearning algorithm is referred to as the Level 1 layer [25]

Stacking ensemble is a general method in which a high-level model is combined with the lower-level models Stack-ing ensemble can achieve the higher predictive accuracyChen et al adopted ant colony optimization to configurestacking ensembles for data mining [26] Kadkhodaei andMoghadam proposed an entropy-based approach to searchfor the best combination of the base classifiers in ensembleclassifiers based on stack generalization [27] Czarnowski andJędrzejowicz focused on the machine classification with datareduction based on stacked generalization [28] Most of theprevious studies were focused on the way to use or generatestacking algorithm However the stacking ensemble does notconsider data distribution and is suitable for the commondata sets other than imbalance data

In order to solve the imbalance problem the paper intro-duces cost-sensitive learning into stacking ensemble and addsamisclassification cost adjustment function into theweight ofinstance and classifier In this waymisclassification costsmaybe considered in the data set as a form of data space weightingto select the best distribution for training On the other handin the combine stage metatechniques can be integrated with

cost-sensitive classifiers to replace standard cost-minimizingtechniques The weights for the misclassification of positiveinstances are higher and the weights for misclassification ofnegative instance are relatively lower The method providesan option for imbalanced learning domains

In this paper RE-sample and Cost-Sensitive StackedGeneralization (RECSG) method is proposed in order tosolve the imbalance problem In the method preprocessedimbalance data are used to train the Level 0 layer modelUnlike common ensemble algorithm for imbalanced datathe proposed method utilized cost-sensitive algorithm asthe Level 1 (meta)layer Stacking methods combined withimbalance data approaches including cost-sensitive learninghad been reported Kotsiantis proposed a stacking variantmethodologywith cost-sensitivemodels as base learners [29]In the method the model tree was replaced by MLR inmetalayer to determine the class with the highest probabilityassociated with the true class Lo et al proposed a cost-sensitive stacking method for audio tag annotation andretrieval [30] In these methods cost-sensitive learners areadopted in the base-layer model and the metalayer modelis trained by other learning algorithms such as SVM anddecision tree In this paper Level 0model generalizer involvesresampling the data and training the base classifier The cost-sensitive algorithm is used to train the Level 1 metalayerclassifier The two layers adopt the imbalanced data algo-rithms and take the full advantages of mature methods Level1 layer model has a bias towards the performance of theminority class Therefore the method proposed in the studyis more efficient than other methods in which cost-sensitivealgorithms are only used in the Level 0 layer

The method was compared with common classificationmethods including other ensemble algorithms Additionallythe evaluation metrics of the algorithm were analyzed basedon the results of statistical tests Statistical tests of evaluationmetrics demonstrated that the proposed new approach couldeffectively solve the imbalanced problem

The paper is structured as follows Related ensembleapproaches and cost-sensitive algorithms are introduced inSection 2 Section 3 introduces the details of proposedRECSG approach including Level 0 and Level 1 modelgeneralizers In Section 4 experiments and correspondingresults and analysis are presented Statistical tests of eval-uation metrics of algorithm performance are analyzed anddiscussed Finally Section 5 discusses the advantages anddisadvantages of the proposed method

2 Background

21 Performance Evaluation in Imbalanced Domains Theevaluation metric is a vital factor for the classifier model andperformance assessment In Table 1 the confusion matrixdemonstrates the results of incorrectly and correctly classifiedinstances of each class in the two classes of problems

Accuracy is the most popular evaluation metric How-ever it cannot effectively measure the correct rates of allthe classes so it is not an appropriate metric for imbalancedata sets For this reason in addition to accuracy moresuitable metrics should be considered in the imbalance

Mathematical Problems in Engineering 3

Table 1 Confusion matrix for performance evaluation

Positive prediction Negative predictionPositive class True positive (TP) False negative (FN)Negative class False positive (FP) True negative (TN)

problem Other metrics have been proposed to measurethe classification performance independently Based on theconfusion matrix (Table 1) these metrics are defined as

Precision = tp(tp + fp)

Recall = tp(tp + fn)

119865-Measure = (1 + 120573)2 lowast Recall lowast Precision1205732 lowast Recall + Precision

(1)

where 120573 is a coefficient to adjust the relative importance ofprecision versus recall (usually 120573 = 1)

FPrate = FP(FP + TN)

TPrate = TP(TP + FN)

(2)

The used combined evaluation metrics of these measuresinclude receiver operating characteristic (ROC) graphic [31]the area under the ROC curve (AUC) [2] average geometricmean of sensitivity and specificity (GeoMean) [32] (see (4))and average adjusted geometric mean (AGeoMean) [33] (see(5)) where 119873119899 refers to the proportion of the majoritysamples These metrics are defined as

AUC = (1 + TPrate minus FPrate)2 (3)

GeoMean = radicTPrate lowast TNrate (4)

AGeoMean

=

GeoMean + TNrate lowast 1198731198991 + 119873119899

TPrate gt 00 TPrate = 0

(5)

22 Stacking Bagging and AdaBoost are the most commonensemble learning algorithms In bagging method differentbase classifier models generate different classification resultsand the final decision is decided by majority voting InAdaBoost algorithm a series of base weak classifiers aretrained with the whole training set and the final decisionis generated by a weighted majority voting scheme In eachround of training iteration different weights are attributed toeach instance In the two algorithms the base classifiers arethe same

Stacking ensemble is another ensemble algorithm inwhich the prediction result of base classifiers is used as the

attribute to train the combination function in metalayer clas-sifier [25] The algorithm has a two-level structure consistingof Level 0 classifiers and Level 1 classifiers It was proposedby Wolpert and used by Breiman [34] and Leblanc andTibshirani [35]

Set a data set

119863 = (119883119899 119884119899) 119899 = 1 119873 (6)

where 119883119899 is a vector representing the attribute values ofthe instance 119899 and 119884119899 is the class value All 119873 instancesare randomly split into 119895 equivalent parts and 119895-fold cross-validation is used to train the model The prediction resultsof every time testing set gives a vector which includes 119899instances

1199101 1199102 1199101198991015840 (7)

Set 119872119895 is the difference base learning algorithm modelobtained with the data set 119863 119895 = 1 119896 and119872119895 is the partof Level 0 models Level 0 layer consists of 119895 base classifierswhich are employed to estimate the class probability of eachinstance

In119872119895 119895-fold cross-validation of training data set119863 givesthe prediction result

119910(119895)1 119910(119895)2 119910(119895)119899 119895 = 1 119896 (8)

All the119872119895 prediction results generate input space of Level1 model and the real value of instance is treated as the outputspace The model is expressed as follows

Input output

119910(1)1 119910(2)1 119910(119894)1 119910(119896)1 1199101119910(1)2 119910(2)2 119910(119894)2 119910(119896)2 1199102

119910(1)119899 119910(2)119899 119910(119894)119899 119910(119896)119899 119910119899

(9)

The above intermediate data are considered as the train-ing data of the Level 1 layer model The input data aretreated as features and the real value of instance is treatedas the output space The next step is to train the data withsome fused leaning algorithms The process is called Level 1generalizer and Level 1 model is denoted by which can beregarded as the function of (119910(1) 119910(2) 119910(119894) 119910(119896))

In the stacking process Level 0 (119872119895 119895 = 1 119896 ) iscombined with Level 1 () Given a new instance 1199091015840 models119872119895 produce a prediction result vector

119910(1) 119910(2) 119910(119894) 119910(119896) 119894 = 1 119896 (10)

Then model is used to combine base classifiers andpredict the final result of 1199091015840

In this paper for imbalance data we propose a stackedgeneralization based on cost-sensitive classification In Level0 model generalizer layer we firstly resample the imbal-ance data Resampling approaches include oversampling and

4 Mathematical Problems in Engineering

Training dataset

andtest dataset

Level 0 (base)-layer

resample(oversample) (undersample) (SMOTE)

Level 0 (base)-layerclassification middot middot middot

middot middot middot

middot middot middot

Ensemble spaceLevel 1 (meta)layerclassification

New data2New data3New data1

LMGJF1 LMGJF2 LMGJFk

M1M2 Mk

Decision space of M2 Decision space of Mk

MGN

Decision space of M1

u1 xiNi=1 u1 x

i

N

i=1( ) ( ) u2 xi

Ni=1 u2 x

i

N

i=1( ) ( ) uk xi

Ni=1 uk x

i

N

i=1( ) ( )

yiGN

N

i=1

D = xi yi

N

i=1( )

DNMN = xi

N

i=1( )

DGN = u1 xi yiNi=1 D

GN = u1 x

i yi

N

i=1 ( ) ( )

Figure 1 Flowchart of the RECSG architecture

undersampling Secondly Model119872119895 119895 = 1 119896 is trainedby the classification with new data and the prediction output119910(119895)119894 by 119895 = 1 119896 is produced with original data In Lever1 model generalizer layer model is generated by cost-sensitive classification based on logistic regression

3 Stacked Generalization forthe Imbalance Problem

The proposed RECSG architecture includes two layers Thefirst layer (Level 0) consisting of classifiers ensemble is calledthe base layer the second layer (Level 1) combined withbase classifiers is called the metalayer The flowchart of thearchitecture is shown in Figure 1

31 Level 0 Model Generalizers For the imbalance prob-lem Level 0 model generalizer step of RECSG includespreprocessing data and training the base classifier Firstlythe oversampling (SMOTE) method has been used in datapreprocessing of base classifier Secondly the base classifiermodel is trained with the new data set In this level weemployed three base algorithms Naıve Bayes (NB) [36]decision tree C45 [37] and 119896-nearest neighbors (119896-NN) [38]

(1) Naıve Bayes (NB) Given instance 119909 set 119875(119894 | 119909) is theposterior probability of class 119894 then

119875119894 (119909) =119875 (119894 | 119909)Σ119894119875 (119894 | 119909)

(11)

where NB uses an Laplacian estimate for estimating theconditional probabilities

Mathematical Problems in Engineering 5

Table 2 Cost matrix

Actual negative Actual positivePredict negative 119862 (0 0) = 11988800 119862 (0 1) = 11988801Predict positive 119862 (1 0) = 11988810 119862 (1 1) = 11988811

Table 3 Cost matrix of credit data sets

Actual bad Actual goodPredict bad 0 1Predict good 5 0

(2) Decision Tree (C45) Decision tree classification is amethod usually used in data mining [39] A decision tree isa tree where each input feature labels a nonleaf node oneclass or probability over the class labels each leaf node of thetree By recursive partitioning a tree can be ldquolearnedrdquo untilno prediction value is added or the subset has the same valueWhen the parameter of training data selects informationentropy the decision tree is named C45

(3) 119896-NN k-NN is a nonparametric algorithm used forregression or classification An instance is classified by amajority vote of its 119896-nearest neighbors If 119896 = 1 the instanceis simply classified in accordance with the class label of thenearest neighbor

All the above algorithms are simple and have low com-plexity so they are applicable weak base classifiers

32 Level 1 Model Generalizers Prediction results of severalbase classifiers in Level 0 layer are used as input space andtrue value class of instance is used as output space Based onLevel 0 layer Level 1 layer model is trained by other learningalgorithms For imbalanced data the cost-sensitive algorithmis used to train the Level 1 metalayer classifier in the paper

321 Cost-Sensitive Classifier Aiming at imbalanced datalearning problem in cost-sensitive classifier different costmatrices are used as the penalty ofmisclassified instance [40]For example a cost matrix 119862 has the following structure in abinary classification scenario in Table 2

In the cost matrix the row indicated alternative predictedclasser whereas the column indicates actual classes The costof false negative is notated as 11988801 and the cost of false positiveis notated as 11988810 Conceptually the cost of correctly classifiedinstances should always be less than the cost of incorrectlyclassified instances In the imbalance problem 11988810 is alwaysgreater than 11988801 For the German credit data set previouslyreported as the part of the Stalog project [41] the cost matrixis provided in Table 3

The cost of false good prediction is greater than the costof false bad prediction in the view of economical reason Soin Level 1 classifier of stacking for the imbalance problem thecost-sensitive algorithm is adopted

322 Logistic Regression Ting and Witten illustrated thatMLR (Multiresponse Linear Regression) algorithm had an

advantage over other Level 1 generalizers in stacking [42]In this paper the logistic regression classifier is used as themetalayer algorithm Based on the metalayer the cost of mis-classification is considered in the cost-sensitive algorithm Inlogistic regression classifier the prediction result of Level 0layer is used as the attribute of Level 1 metalayer and the realvalue of instance is used as output spaceThe linear regressionfor class 119894 is simply obtained as

LR119894 (119909) =119870

sum119895=1

120572119895119894119875119895119894 (119909) 119895 = 1 119896 (12)

Details of the implementation of RECSG are presentedbelow

33 Algorithm Pseudocode Pseudocode 1 presents the pro-posed RECSG approach Input parameters are two data setstraining set and testing set Output predicts class labelsof the test samples The first step in the process is datapreprocessing resample new instances and train model119872119895 (119895 = 1 119896) where 119895 is the number of base classifiersand 119906119895(119909119894) is the generated function of 119909119894 based on model119872119895(lines (1)-(2) in Pseudocode 1)

Then the metalayer model is constructed based onthe data 119863meta which is firstly predicted with Level 0 layer(line (3) in Pseudocode 1) Finally Lever 1 layer classification(cost-sensitive and logistic regression) is used to predict theultimate value of the tested samples

4 Empirical Investigation

The experiment aims to verify whether the RECSG approachcan improve the classification performance for the imbalanceproblem In this paper the RECSG approach was comparedwith other algorithms involving various combination ensem-ble techniques and imbalanced approaches For eachmethodthe same training set testing set and validation set were used

The experimental system has been implemented and iscomposed of 7 learning methods implemented in Weka [43]namely Naıve Bayes(1) (NB) C45(j48)(2) 119896-nearest neigh-bor (k-NN)(3) cost-sensitive(4) AdaBoost(5) Bagging(6) andstacking(7) The brief description the standard version andparameters of these methods are shown in Table 4

The RECSG approach includes 2 layers Level 1 layer(metalearning) system consists of 2 learning methods imple-mented inWeka namely simple logistic regression(9) (MLP)and cost-sensitive classifier(4) MLP is the base classificationalgorithm of cost-sensitive classifier Level 0 layer (base -learning) of stacking is composed of 3 classifiers NaıveBayes(1) C45(2) and 119896-NN(3) Evaluation metrics of algo-rithmperformance includeAUCGeoMean andAGeoMean

41 Experimental Settings Experiments were implementedwith 17 data sets from the UCI Machine Learning Repository[44] These data sets cover various fields and are based onIR measure values (from 054 to 0014) unique data setnames a varying number of samples (from 173 to 2338)

6 Mathematical Problems in Engineering

Input Training set119863 = (119909119894 119910119894)119873119894=1 Test dataset119863test = (1199091015840119894 )1198731015840

119894=1

Output Predict class labels of the test samples 1199101015840119894meta1198731015840

119894=1

For each119895 = 1 2 119870 do(1) Resample imbalance data and generate 119895ndashfold cross-validation sets to obtain New data119895(2) Train and compute 119906119895(119909119894)119873119894=1 and 119906119895(1199091015840119894 )119873

1015840

119894=1 in Level-0 (base)-layer classifier119872119895end(3) Construct119863meta = 1199061(119909119894) 119910119894119873

1015840

119894=1 and1198631015840meta = 1199061(1199091015840119894 ) 1199101198941198731015840

119894=1

(4) Based on the data119863meta classification (cost-sensitive and Logistic Regression) is used togenerate Level-1 (meta)-layer model through with1198631015840meta to predict 1199101015840119894meta119873

1015840

119894=1

Pseudocode 1 Pseudocodes of RE-sample and Cost-Sensitive Stacked Generalization

Table 4 Base classifier methods and corresponding ensemble algorithms

Methods Descriptions and parameters Abbreviations(1) Naıve Bayes wekaclassifiersbayesNaiveBayes NB(2) C45 wekaclassifierstreesJ48 -C 025 -M 2 C45

(3) 119896-nearest neighborwekaclassifierslazyIBk -K 1 -W 0 -A

wekacoreneighboursearchLinearNNSearch -AwekacoreEuclideanDistance -R first-last

KNN

(4) Cost-sensitive with Naıve Bayes wekaclassifiersmetaAdaBoostM1 -P 100 -S 1 -I 10-W wekaclassifiersbayesNaiveBayes Cost-NB

(5) AdaBoost with Naıve Bayes wekaclassifiersmetaAdaBoostM1 -P 100 -S 1 -I 10-W wekaclassifiersbayesNaiveBayes Ada-NB

(6) AdaBoost with cost-sensitive and Naıve Bayes

wekaclassifiersmetaAdaBoostM1 -P 100 -S 1-I10-W wekaClassifiersmetaCostSensitiveClassifier-- -cost-matrix [10 30 10 10] -S 1 -W weka

classifiersbayesNaiveBayes

Ada-Cost-NB

(7) Bagging with Naıve BayeswekaclassifiersmetaBagging -P 100 -S 1

-num-slots 1 -I 10 -WwekaclassifiersbayesNaiveBayes

Bag-NB

(8) Bagging with cost-sensitive and Naıve Bayes

wekaclassifiersmetaBagging -P 100 -S 1-num-slots 1 -I 10 -W

wekaclassifiersmetaCostSensitive Classifier ---cost-matrix [10 30 10 10] -S 1 -W

wekaclassifiersbayesNaiveBayes

Bag-Cost-NB

(9) Stacking with NB KNN C45 and logistic

wekaclassifiersmetaStacking -X 10 -M wekaclassifiersfunctionsLogistic -R 10E-8 -M -1ndashnum ndashdecimal-places 4 -S 1 -num-slots 1 -B

weka classifiersbayesNaiveBayes -Bwekaclassifiers lazyIBk -K 1 -W 0 -A

wekacoreneighboursearch LinearNNSearch-A wekacoreEuclideanDistance -R

first-last -B wekaclassifierstreesJ48 -C025 -M 2

Stacking-log

(10) Stacking cost-sensitive with NB KNN C45and logistic

wekaclassifiersmetaStacking -X 10 -M wekaclassifiersmetaCostSensitiveClassifier ndashcost

-matrix [10 30 10 10] -S 1 -Wwekaclassifiers functionsLogistic -- -R 10E-8 -M-1 ndashnum ndashdecimal ndashplaces 4 -S 1 -num-slots 1 -B

weka classifiers bayesNaiveBayes -BwekaclassifierslazyIBk -K 1 -W 0 -A

wekacoreneighboursearchLinearNN Search-A wekacoreEuclideanDistance -R first-

last -B wekaclassifierstreesJ48 -C 025-M 2

Stacking-Cost-log

Mathematical Problems in Engineering 7

Table 5 Summary of imbalanced data sets

Data sets sam (min maj) IR feawisconsin 683 (239 444) 054 9pima 768 (268 500) 053 8haberman 306 (81 225) 036 3Vehicle1 846 (217 629) 0345 18Vehicle0 846 (199 647) 0307 18segment0 2308 (329 1979) 017 19yeast-0-3-5-9 vs 7-8 506 (50 456) 011 8yeast-0-2-5-6 vs 3-7-8-9 1004 (99 905) 011 8vowel0 988 (90 898) 010 13led7digit-0-2-4-5-6-7-8-9 vs 1 443 (37 406) 009 7Glass2 214 (17 197) 008 10cleveland-0 vs 4 173 (13 160) 008 13glass-0-1-6 vs 5 184 (9 175) 005 9car-good 1728 (69 1659) 004 6flare-F 1066 (43 1023) 004 11car-vgood 1728 (65 1663) 0039 6abalone-17 vs 7-8-9-10 2338 (58 2280) 00254 9

and variations in the amount of class overlap (see the KEELrepository [45]) Multiclass data sets were modified to obtaintwo-class imbalanced problems so that the union of one ormore classes became the positive class and the union of oneor more of the remaining classes was labeled as the negativeclass A brief description of the used date set is presented inTable 5 It includes the total number of instances (Sam)the total number of each class instances (Min Maj) theimbalance ratio (IR = the ratio of the number of minorityclass instance to majority class instance) and the number offeatures (Fea)

Our system was compared with other ensemble algo-rithms including AdaBoost with Naıve Bayes AdaBoost withcost-sensitive bagging with Naıve Bayes bagging with cost-sensitive and stacking cost-sensitive with NB k-NN C45and logistic regression All the experiments were performedby 10-fold cross-validation

42 Experimental Results Tables 6 7 and 8 respectivelyshow the results of the three metrics (AUC GeoMean andAGeoMean) for the algorithms in Table 4 obtained with thedata set in Table 5The best results are emphasized in boldfaceon each data set in these tables

The results showed that the performance of the proposedRECSG method was the best for 12 of 17 data sets in termsof GeoMean and AGeoMean and for 10 of 17 data sets interms of AUC Some methods are better than others in someevaluation metrics but not in all metric and most data sets

In Table 6 (AUC) the performance of the RECSGmethodis better than that of the 3 single-base classification methodsfor 14 of 17 data sets and better than that of other 5 ensemblealgorithms for 13 of 17 data sets In Table 7 (GeoMean)the RECSG method outperforms all the 3 single-base clas-sification methods in 15 of 17 data sets and outperformsother 5 ensemble algorithms in 15 of 17 data sets In Table 8

(AGeoMean) the RECSG method outperforms all the 3single-base classification methods in 14 of 17 data sets andoutperforms other 5 ensemble algorithms in 16 of 17 data setsFor the 17 data sets the evaluation metrics of GeoMean andAGeoMean are better than AUC

43 Statistical Tests Statistical test [46] is adopted to comparedifferent algorithms In the paper we use two types of com-parisons pairwise comparison (between a pair of algorithms)and multiple comparisons (among a group of algorithms)

431 Pair of Algorithms We performed statistic 119879-teststo explore whether RECSG approach is the significantlybetter than other algorithms in the three metrics (AUCGeoMean and AGeoMean) Table 9 shows the results ofRECSG approach compared with the other methods interms of AUC GeoMean and AGeoMean The values inthe square brackets indicate the number of the metrics withstatistically significant difference in the 119879-test performedwith the confidence level 120572 = 005 in the three evaluationmetrics As for the evaluation metric of AUC the RECSGoutperformed NB for 11 of 17 data sets and 5 data sets showedthe statistically significant differencewith the confidence levelof 120572 = 005

432 Multiple Comparisons We performed Holm post hoctest [47] and selected multiple groups to test the threemetrics (AUC GeoMean and AGeoMean) The post hocprocedures can determine whether a comparison hypothesisshould be rejected at a specified confidence level 120572 Statisticalexperiment was performed in the platform on the websitehttpteccitiususcesstac Table 10 provides the results thatRECSGapproach is significantly differentwith the confidencelevel of120572 = 005 in terms of AUCGeoMean andAGeoMean1198670 indicates that there is no significant difference According

8 Mathematical Problems in Engineering

Table6Re

sults

ofotherm

etho

dsin

term

sofA

UC

Datas

etNB

C45

KNN

Cost-N

BAd

a-NB

Ada-Cost-N

BBa

g-NB

Bag-Cost-NB

Stacking

-log

Stacking

-Cost-log

wisc

onsin

098

60957

0953

0989

0981

0982

099

0989

0982

0982

pima

0819

0751

0662

0819

0804

0808

0817

0819

0819

0819

haberm

an064

0615

053

064

0667

0625

0638

064

70636

063

Vehicle

10711

0722

0671

0711

0769

0723

0715

0712

0782

0782

Vehicle

00814

093

092

0814

0784

0781

0808

0808

0977

097

7segm

ent0

0987

0971

0996

0987

0944

0959

0987

0986

0999

099

9yeast-0

-3-5-9

vs7-8

075

0675

0653

075

0751

0686

075

0754

0769

0773

yeast-0

-2-5-6

vs3-7-8-9

0832

0687

0782

0832

079

0808

084

0842

084

7082

vowel0

098

0974

1098

0996

0994

0981

098

11

led7digit-0

-2-4-5-6-7-8-9

vs1

095

40925

0827

0954

0898

0893

0953

0949

0909

0904

Glass2

0716

0812

0587

0716

0737

0735

0748

0737

0713

0721

cleveland

-0vs

40909

0824

0754

0909

0826

0891

0915

0913

0979

0976

glass-0-1-6

vs5

0934

0989

0825

0934

0975

0905

0983

0985

09

099

car-good

0976

0493

064

80976

0968

0971

0976

0976

0653

097

7flare-F

0908

0525

059

0911

0876

0893

0908

0912

0825

0806

car-vgoo

d0997

0992

069

0997

0997

0997

0997

0997

0998

099

8abalon

e-17

vs7-8-9-10

0775

0613

0624

0775

0817

073

0773

0774

0752

0726

Mean

0864

0791

0748

0864

0858

0846

0869

0869

0855

0875

Mathematical Problems in Engineering 9

Table7Re

sults

ofotherm

etho

dsin

term

sofG

eoMean

Datas

etNB

C45

KNN

Cost-N

BAd

a-NB

Ada-Cost-N

BBa

g-NB

Bag-Cost-NB

Stacking

-log

Stacking

-Cost-log

wisc

onsin

0965

0954

0953

0968

0959

0967

0968

0966

096

4097

0pima

0719

0697

0649

0736

0715

0739

0723

0731

0703

0741

haberm

an0419

0566

0455

0586

046

60573

0434

0594

0361

060

5Ve

hicle

10677

0622

0652

0674

0607

0677

0673

0661

0561

0712

Vehicle

00724

0912

0919

0726

0735

0733

0719

0733

0894

094

6segm

ent0

0897

0977

0995

0892

0897

0906

0902

0895

099

40993

yeast-0

-3-5-9

vs7-8

046

70463

060

9046

4046

60550

044

50485

044

4060

3yeast-0

-2-5-6

vs3-7-8-9

0762

0590

0762

0793

0762

0792

0762

0787

0706

079

8vowel0

0920

0969

1000

0913

0944

0929

0921

0914

1000

1000

led7digit-0

-2-4-5-6-7-8-9

vs1

0853

0874

0818

0869

0863

0870

0845

0850

0846

0874

Glass2

0578

0472

0470

0564

0578

0610

0586

0568

000

00607

cleveland

-0vs

4089

90724

0722

0936

0611

0863

0866

0861

0866

0866

glass-0-1-6

vs5

0932

0877

0810

0932

0932

0872

0872

0869

0814

0877

car-good

000

0000

00558

0794

0520

0783

000

00778

0561

0813

flare-F

0751

000

00450

0781

0541

064

70738

0781

000

00774

car-vgoo

d0496

0888

0626

0986

0917

0925

0481

0971

0957

098

8abalon

e-17

vs7-8-9-10

0628

0321

0506

064

00319

0368

0627

0630

0293

066

4Mean

0687

064

10703

0779

0696

0753

0680

0769

064

50814

10 Mathematical Problems in Engineering

Table8Re

sults

ofotherm

etho

dsin

term

sofA

GeoMean

Datas

etNB

C45

KNN

Cost-N

BAd

a-NB

Ada-Cost-N

BBa

g-NB

Bag-Cost-NB

Stacking

-log

Stacking

-Cost-log

wisc

onsin

0961

0960

0961

0962

0963

0962

0962

0960

0967

096

7pima

076

80743

0706

0734

0767

0736

0775

0729

0770

0726

haberm

an064

20680

0601

0718

066

00694

0653

0721

0613

0663

Vehicle

10690

0721

0728

0658

0722

0669

0690

0654

0709

0740

Vehicle

0066

60929

0934

0658

0681

0663

066

40663

0924

093

6segm

ent0

0859

0986

0996

0850

0859

0871

0863

0854

0995

099

5yeast-0

-3-5-9

vs7-8

0715

0706

0782

0709

0713

0737

0703

0721

0700

0770

yeast-0

-2-5-6

vs3-7-8-9

0856

0778

0855

0857

0856

0855

0856

0854

0839

086

6vowel0

0929

0981

1000

0909

0966

0948

0931

0910

1000

1000

led7digit-0

-2-4-5-6-7-8-9

vs1

0853

0874

0818

0869

0863

0870

0845

0850

0846

0874

Glass2

0511

0701

0695

0493

0511

0551

0539

0515

000

00744

cleveland

-0vs

4092

70845

0841

0943

0783

0914

0918

0910

0918

0918

glass-0-1-6

vs5

095

40932

0894

0954

0954

0923

0923

0919

0902

0932

car-good

000

0000

00763

0878

0747

0870

000

00873

0769

089

0flare-F

0840

000

00705

0842

0751

0795

0836

0842

000

0084

5car-vgoo

d0743

0934

0793

0986

0953

0956

0725

0980

0974

099

0abalon

e-17

vs7-8-9-10

0736

0655

0745

0716

0650

0672

0734

0711

064

1080

4Mean

0744

0731

0813

0808

0788

0805

0742

08039

0739

086

2

Mathematical Problems in Engineering 11

Table 9 Comparison between RECSG and other algorithms in terms of winslosses of the results

NB C45 KNN Cost-NB Ada-NB Ada-Cost-NB Bag-NB Bag-Cost-

NB Stacking-log

Stacking-Cost-log(AUC)

11 [5]6 [3] 15 [11]2 [2] 17 [15]0 [2] 11 [6]6 [5] 13 [7]4 [2] 14 [9]3 [1] 10 [6]7 [5] 10 [6]7 [5] 11 [3]6 [2]

Stacking-Cost-log(GeoMean)

15 [13]2 [2] 17 [13]0 [0] 15 [13]2 [1] 14 [8]3 [2] 16 [12]1 [1] 16 [10]1 [0] 17 [13]0 [0] 16 [10]3 [0] 16 [12]1 [0]

Stacking-Cost-log(AGeoMean)

14 [10]3 [1] 15 [9]2 [2] 15 [12]2 [1] 13 [7]4 [3] 15 [11]2 [2] 15 [10]2 [2] 16 [10]1 [1] 15 [8]2 [1] 16 [10]1 [1]

Table 10 Holm post hoc test to show differences of Stacking-Cost-log (AUC) and other algorithms

Rejected (1198670) Accepted (1198670)Stacking-Cost-log (AUC) versus other algorithms 3 6Stacking-Cost-log (GeoMean) versus other algorithms 8 1Stacking-Cost-log (AGeoMean) versus other algorithms 9 0

to the evaluation metric of GeoMean the RECSG is signifi-cantly better than other 8 methods with the confidence levelof 120572 = 005

44 Discussion According to the experiments performedwith 17 different imbalance data sets and the comparisonwith other 9 classification methods the proposed RECSGmethod showed the higher performance than other methodsin terms of AUC GeoMean and AGeoMean as illustratedin Tables 7 8 and 9 The RECSG method showed the betterperformance for 10 of 17 cases in terms of AUC and for 12 of17 cases in terms of GeoMean and AGeoMean The methodoutperformed other 5 ensemble algorithms for 16 of 17 datasets in terms ofGeoMean andAGeoMean and for 13 of 17 datasets in terms of AUC GeoMean and AGeoMean are betterthan AUC in evaluation metrics in 17 data sets The means ofRECSGmethod in terms of GeoMean AGeoMean and AUCare all higher than other methods The Holm post hoc testshows that RECSGmethod significantly outperforms the restwith the confidence level 120572 = 005 in terms of AGeoMean for8 of 9methods in terms of AGeoMean and for 3 of 9methodsin terms of AUC

Experimental results and statistical tests show that theRECSG approach has improved the classification perfor-mance of imbalanced data sets The reasons can be explainedas follows

Firstly stacking algorithm uses to combine the resultof base classifier whereas bagging employs majority voteBagging is only a simple decision combinationmethodwhichrequires neither cross-validation nor Level 1 learning In thispaper stacked generalization adopts logistic regressionthus providing the simplest linear combination of pooling theLevel 0 modelsrsquo confidence

Secondly cost-sensitive algorithm can affect imbalancedata sets in two aspects Firstly cost-sensitive modificationscan be applied in the probabilistic estimationMoreover cost-sensitive factors change theweight of instance and theweight

of the minority class is higher than that of the majorityclass Therefore the method directly changes the number ofcommon instances without discarding or duplicating any ofthe rare instances

Thirdly RECSG method introduces cost-sensitive learn-ing into stacking ensemble The cost-sensitive function in replaces the error-minimizing function thus allowing Level 1learning to be prone to focus on the minority class

The results in Tables 7 and 8 demonstrate that the RECSGmethod has the higher performancewhen the evaluation per-formance metric of base classifier is weaker (such as Vehicle1Vehicle0 and car-vgood) The reason is that the alternativemethods of NB C45 and 119896-NN have shortcomings Theindependence assumption hampers the performance of NBin some data sets In C45 tree construction the selectionof attributes affects the model performances Error ratio of119896-NN is relatively high when data sets are in imbalanceTherefore logistic regression adopted in can improve theperformance when base classifier is weaker

The performance of the RECSG method is generallybetter when the IR is low (such as Glass2 car-good flare-Fcar-vgood and abalone-17 vs 7-8-9-10) The performance ofRECSG method is probably related to the setting of the costmatrix Different data sets should use different cost matricesFor the purpose of simplicity in the paper we have adoptedthe same cost matrices whichmay bemore suitable to low IRvalues

5 Conclusions

In this paper in order to solve the class imbalance problemwe proposed the RECSG method based on 2-layer learningmodels Experimental results and statistical tests showedthat the RECSG approach improved the classification per-formance The proposed RECSG approach may have therelatively high computational complexity in the trainingstage because the approach involves 2-layer classifier models

12 Mathematical Problems in Engineering

which consist of several base classifiers and metaclassifiersThe number and kinds of the base-level classifiers areclosely related to the performance of stacking algorithmIn the fusion stage of base classifiers the selection of themetaclassifier is also important In this paper in order tovalidate the performance improvement compared with othercurrent classification algorithms we only randomly selected3 classification algorithms (NB 119896-NN and C45) as baseclassifier and cost-sensitive algorithm as metaclassifier inthe fuse stage Selection of the number or kind of the baseclassifiers and metaclassifier was not discussed Thereforewe will explore the diversity and quality of base classifier inthe future The adoption of these strategies would improvethe prediction performance and reduce the training time ofstacking algorithm for imbalance problems

Conflicts of Interest

The author declares that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

This research is supported by the Natural Science Foundationof Shanxi Province under Grant no 2015011039

References

[1] O Loyola-Gonzalez J F Martınez-Trinidad J A Carrasco-Ochoa and M Garcıa-Borroto ldquoEffect of class imbalance onquality measures for contrast patterns an experimental studyrdquoInformation Sciences vol 374 pp 179ndash192 2016

[2] C Beyan and R Fisher ldquoClassifying imbalanced data setsusing similarity based hierarchical decompositionrdquo PatternRecognition vol 48 no 5 pp 1653ndash1672 2015

[3] H He and E A Garcia ldquoLearning from imbalanced datardquo IEEETransactions on Knowledge and Data Engineering vol 21 no 9pp 1263ndash1284 2009

[4] V C Chawla K W Bowyer L O Hall and W P KegelmeyerldquoSMOTE synthetic minority over-sampling techniquerdquo Journalof Artificial Intelligence Research vol 16 pp 321ndash357 2002

[5] X-Y Liu J X Wu and Z-H Zhou ldquoExploratory under-sampling for class-imbalance learningrdquo in Proceedings of the 6thInternational Conference on Data Mining (ICDM rsquo06) pp 965ndash969 IEEE Hong Kong December 2006

[6] N Japkowicz ldquoThe class imbalance problem significance andstrategiesrdquo in Proceedings of the International Conference onArtificial Intelligence 2000

[7] B Zadrozny and C Elkan ldquoLearning and making decisionswhen costs and probabilities are both unknownrdquo in Proceedingsof the the seventh ACM SIGKDD international conference pp204ndash213 San Francisco Calif USA August 2001

[8] M Galar A Fernandez E Barrenechea H Bustince andF Herrera ldquoA review on ensembles for the class imbalanceproblem bagging boosting and hybrid-based approachesrdquoIEEE Transactions on Systems Man and Cybernetics Part CApplications and Reviews vol 42 no 4 pp 463ndash484 2012

[9] N V Chawla D A Cieslak L O Hall and A Joshi ldquoAuto-matically countering imbalance and its empirical relationship

to costrdquo Data Mining and Knowledge Discovery vol 17 no 2pp 225ndash252 2008

[10] A Freitas A Costapereira and P Brazdil ldquoCost-SensitiveDecision Trees Applied to Medical Datardquo in InternationalConference on Data Warehousing and Knowledge Discovery pp303ndash312 2007

[11] L Breiman ldquoBagging predictorsrdquoMachine Learning vol 24 no2 pp 123ndash140 1996

[12] R E Schapire ldquoThe strength of weak learnabilityrdquo MachineLearning vol 5 no 2 pp 197ndash227 1990

[13] Y Freund and R E Schapire ldquoA decision-theoretic generaliza-tion of on-line learning and an application to boostingrdquo Journalof Computer and System Sciences vol 55 no 1 part 2 pp 119ndash139 1997

[14] N V Chawla A Lazarevic L O Hall and K W BowyerldquoSMOTEBoost improving prediction of the minority class inboostingrdquo in Proceeding of Knowledge Discovery in Databasesvol 2838 pp 107ndash119

[15] C Seiffert T M Khoshgoftaar J VanHulse and A NapolitanoldquoRUSBoost A hybrid approach to alleviating class imbalancerdquoIEEE Transactions on Systems Man and Cybernetics Systemsvol 40 no 1 pp 185ndash197 2010

[16] S Hu Y Liang L Ma and Y He ldquoMSMOTE Improvingclassification performance when training data is imbalancedrdquoin Proceedings of the 2nd International Workshop on ComputerScience and Engineering WCSE 2009 pp 13ndash17 China October2009

[17] H Guo and H L Viktor ldquoLearning from imbalanced data setswith boosting and data generationrdquoACMSIGKDDExplorationsNewsletter vol 6 no 1 pp 30ndash39 2004

[18] SWang and X Yao ldquoDiversity analysis on imbalanced data setsby using ensemble modelsrdquo in Proceedings of the 2009 IEEESymposium on Computational Intelligence and Data MiningCIDM 2009 pp 324ndash331 USA April 2009

[19] R Barandela J S Srsquoanchez and R M Valdovinos ldquoNewapplications of ensembles of classifiersrdquo PAA Pattern Analysisand Applications vol 6 no 3 pp 245ndash256 2003

[20] J Błaszczynski M Deckert J Stefanowski and S WilkldquoIntegrating selective pre-processing of imbalanced data withIvotes ensemblerdquo Lecture Notes in Computer Science (includingsubseries Lecture Notes in Artificial Intelligence and Lecture Notesin Bioinformatics) Preface vol 6086 pp 148ndash157 2010

[21] W Fan S J Stolfo J Zhang and P K Chan ldquoAdacostMisclassication cost-sensitive boostingrdquo in the 6th Int ConfMach Learning pp 97ndash105 San Francisco CA USA 1999

[22] K M Ting ldquoA comparative study of cost-sensitive boostingalgorithmsrdquo in Proc 17th Int Conf Mach Learning pp 983ndash990 Stanford CA USA 2000

[23] M Joshi V Kumar and R Agarwal ldquoEvaluating boosting algo-rithms to classify rare classes comparison and improvementsrdquoinProceedings of the 2001 IEEE International Conference onDataMining pp 257ndash264 San Jose CA USA

[24] Y Sun M S Kamel A K C Wong and Y Wang ldquoCost-sensitive boosting for classification of imbalanced datardquo PatternRecognition vol 40 no 12 pp 3358ndash3378 2007

[25] D H Wolpert ldquoStacked generalizationrdquo Neural Networks vol5 no 2 pp 241ndash259 1992

[26] Y Chen M L Wong and H Li ldquoApplying Ant Colony Opti-mization to configuring stacking ensembles for data miningrdquoExpert Systems with Applications vol 41 no 6 pp 2688ndash27022014

Mathematical Problems in Engineering 13

[27] H Kadkhodaei and A M E Moghadam ldquoAn entropy basedapproach to find the best combination of the base classi-fiers in ensemble classifiers based on stack generalizationrdquoin Proceedings of the 4th International Conference on ControlInstrumentation and Automation ICCIA 2016 pp 425ndash429Iran January 2016

[28] I Czarnowski and P Jędrzejowicz ldquoAn approach to machineclassification based on stacked generalization and instanceselectionrdquo in Proceedings of the IEEE International Conferenceon Systems Man and Cybernetics SMC 2016 pp 4836ndash4841Hungary 2017

[29] S Kotsiantis ldquoStacking cost sensitive modelsrdquo in Proceedings ofthe 12th Pan-Hellenic Conference on Informatics PCI 2008 pp217ndash221 Greece August 2008

[30] H Y Lo J C Wang H MWang and S D Lin ldquoCost-sensitivestacking for audio tag annotation and retrievalrdquo in Proceedingsof the 36th IEEE International Conference on Acoustics Speechand Signal Processing ICASSP 2011 pp 2308ndash2311 CzechRepublic May 2011

[31] JHuang andCX Ling ldquoUsingAUCand accuracy in evaluatinglearning algorithmsrdquo IEEETransactions on Knowledge andDataEngineering vol 17 no 3 pp 299ndash310 2005

[32] M Kubat and S Matwin ldquoAddressing the curse of imbalancedtraining sets one sided selection inrdquo in Proceedings of the 14thInternational Conference on Machine Learning (ICMLrsquo97 pp179ndash186 1997

[33] R Batuwita and V Palade ldquoAdjusted geometric-mean A novelperformance measure for imbalanced bioinformatics datasetslearningrdquo Journal of Bioinformatics and Computational Biologyvol 10 no 4 Article ID 1250003 2012

[34] L Breiman ldquoStacked regressionsrdquoMachine Learning vol 24 no1 pp 49ndash64 1996

[35] M Leblanc and R Tibshirani ldquoCombining estimates in regres-sion and classificationrdquo Journal of the American StatisticalAssociation vol 91 no 436 pp 1641ndash1650 1996

[36] B Cestnik ldquoEstimating Probabilities A Crucial Task inMachine Learningrdquo in Proceedings of the European Conferenceon Artificial Intelligence pp 147ndash149 1990

[37] J R Quinlan C45 Program for machine learning MorganKaufmann 1993

[38] N S Altman ldquoAn introduction to kernel and nearest-neighbornonparametric regressionrdquo The American Statistician vol 46no 3 pp 175ndash185 1992

[39] J R Quinlan ldquoInduction of decision treesrdquo Machine Learningvol 1 no 1 pp 81ndash106 1986

[40] C Elkan ldquoThe foundations of cost-sensitive learningrdquo in Pro-ceedings of the 17th International Joint Conference on ArtificialIntelligence (IJCAI rsquo01) pp 973ndash978 New York NY USAAugust 2001

[41] D Michie D Spiegel halter J C Taylor et alMachine learningneural and statistical classification Ellis Horwood neural andstatistical classification 1995

[42] K M Ting and I F Witten ldquoIssues in stacked generalizationrdquoArtif Intell Res vol 10 no 1 pp 271ndash289 1999

[43] M Hall E Frank G Holmes B Pfahringer P Reutemann andI H Witten ldquoThe WEKA data mining software an updaterdquoACM SIGKDD Explorations Newsletter vol 11 no 1 pp 10ndash182009

[44] A Frank and A Asuncion UCI Machine Learning RepositoryUniversity of California School of Information and ComputerScience Irvine CA USA httparchiveicsucieduml

[45] J Alcala-Fdez A Fernandez J Luengo et al ldquoKEEL data-mining software tool data set repository integration ofalgorithms and experimental analysis frameworkrdquo Journal ofMultiple-Valued Logic and Soft Computing vol 17 no 2-3 pp255ndash287 2011

[46] S Garcıa A Fernandez J Luengo and F Herrera ldquoA study ofstatistical techniques and performance measures for genetics-based machine learning accuracy and interpretabilityrdquo SoftComputing vol 13 no 10 pp 959ndash977 2009

[47] S Holm ldquoA simple sequentially rejective multiple test proce-durerdquo Scandinavian Journal of Statistics vol 6 no 2 pp 65ndash701979

Hindawiwwwhindawicom Volume 2018

MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Applied MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Probability and StatisticsHindawiwwwhindawicom Volume 2018

Journal of

Hindawiwwwhindawicom Volume 2018

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawiwwwhindawicom Volume 2018

OptimizationJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

Hindawiwwwhindawicom Volume 2018

Operations ResearchAdvances in

Journal of

Hindawiwwwhindawicom Volume 2018

Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018

International Journal of Mathematics and Mathematical Sciences

Hindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018Volume 2018

Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in

Nature and SocietyHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Dierential EquationsInternational Journal of

Volume 2018

Hindawiwwwhindawicom Volume 2018

Decision SciencesAdvances in

Hindawiwwwhindawicom Volume 2018

AnalysisInternational Journal of

Hindawiwwwhindawicom Volume 2018

Stochastic AnalysisInternational Journal of

Submit your manuscripts atwwwhindawicom

Page 3: Classifying Imbalanced Data Sets by a Novel RE-Sample and ...downloads.hindawi.com/journals/mpe/2018/5036710.pdf · Cost-Sensitive Stacked Generalization Method ... In this paper,

Mathematical Problems in Engineering 3

Table 1 Confusion matrix for performance evaluation

Positive prediction Negative predictionPositive class True positive (TP) False negative (FN)Negative class False positive (FP) True negative (TN)

problem Other metrics have been proposed to measurethe classification performance independently Based on theconfusion matrix (Table 1) these metrics are defined as

Precision = tp(tp + fp)

Recall = tp(tp + fn)

119865-Measure = (1 + 120573)2 lowast Recall lowast Precision1205732 lowast Recall + Precision

(1)

where 120573 is a coefficient to adjust the relative importance ofprecision versus recall (usually 120573 = 1)

FPrate = FP(FP + TN)

TPrate = TP(TP + FN)

(2)

The used combined evaluation metrics of these measuresinclude receiver operating characteristic (ROC) graphic [31]the area under the ROC curve (AUC) [2] average geometricmean of sensitivity and specificity (GeoMean) [32] (see (4))and average adjusted geometric mean (AGeoMean) [33] (see(5)) where 119873119899 refers to the proportion of the majoritysamples These metrics are defined as

AUC = (1 + TPrate minus FPrate)2 (3)

GeoMean = radicTPrate lowast TNrate (4)

AGeoMean

=

GeoMean + TNrate lowast 1198731198991 + 119873119899

TPrate gt 00 TPrate = 0

(5)

22 Stacking Bagging and AdaBoost are the most commonensemble learning algorithms In bagging method differentbase classifier models generate different classification resultsand the final decision is decided by majority voting InAdaBoost algorithm a series of base weak classifiers aretrained with the whole training set and the final decisionis generated by a weighted majority voting scheme In eachround of training iteration different weights are attributed toeach instance In the two algorithms the base classifiers arethe same

Stacking ensemble is another ensemble algorithm inwhich the prediction result of base classifiers is used as the

attribute to train the combination function in metalayer clas-sifier [25] The algorithm has a two-level structure consistingof Level 0 classifiers and Level 1 classifiers It was proposedby Wolpert and used by Breiman [34] and Leblanc andTibshirani [35]

Set a data set

119863 = (119883119899 119884119899) 119899 = 1 119873 (6)

where 119883119899 is a vector representing the attribute values ofthe instance 119899 and 119884119899 is the class value All 119873 instancesare randomly split into 119895 equivalent parts and 119895-fold cross-validation is used to train the model The prediction resultsof every time testing set gives a vector which includes 119899instances

1199101 1199102 1199101198991015840 (7)

Set 119872119895 is the difference base learning algorithm modelobtained with the data set 119863 119895 = 1 119896 and119872119895 is the partof Level 0 models Level 0 layer consists of 119895 base classifierswhich are employed to estimate the class probability of eachinstance

In119872119895 119895-fold cross-validation of training data set119863 givesthe prediction result

119910(119895)1 119910(119895)2 119910(119895)119899 119895 = 1 119896 (8)

All the119872119895 prediction results generate input space of Level1 model and the real value of instance is treated as the outputspace The model is expressed as follows

Input output

119910(1)1 119910(2)1 119910(119894)1 119910(119896)1 1199101119910(1)2 119910(2)2 119910(119894)2 119910(119896)2 1199102

119910(1)119899 119910(2)119899 119910(119894)119899 119910(119896)119899 119910119899

(9)

The above intermediate data are considered as the train-ing data of the Level 1 layer model The input data aretreated as features and the real value of instance is treatedas the output space The next step is to train the data withsome fused leaning algorithms The process is called Level 1generalizer and Level 1 model is denoted by which can beregarded as the function of (119910(1) 119910(2) 119910(119894) 119910(119896))

In the stacking process Level 0 (119872119895 119895 = 1 119896 ) iscombined with Level 1 () Given a new instance 1199091015840 models119872119895 produce a prediction result vector

119910(1) 119910(2) 119910(119894) 119910(119896) 119894 = 1 119896 (10)

Then model is used to combine base classifiers andpredict the final result of 1199091015840

In this paper for imbalance data we propose a stackedgeneralization based on cost-sensitive classification In Level0 model generalizer layer we firstly resample the imbal-ance data Resampling approaches include oversampling and

4 Mathematical Problems in Engineering

Training dataset

andtest dataset

Level 0 (base)-layer

resample(oversample) (undersample) (SMOTE)

Level 0 (base)-layerclassification middot middot middot

middot middot middot

middot middot middot

Ensemble spaceLevel 1 (meta)layerclassification

New data2New data3New data1

LMGJF1 LMGJF2 LMGJFk

M1M2 Mk

Decision space of M2 Decision space of Mk

MGN

Decision space of M1

u1 xiNi=1 u1 x

i

N

i=1( ) ( ) u2 xi

Ni=1 u2 x

i

N

i=1( ) ( ) uk xi

Ni=1 uk x

i

N

i=1( ) ( )

yiGN

N

i=1

D = xi yi

N

i=1( )

DNMN = xi

N

i=1( )

DGN = u1 xi yiNi=1 D

GN = u1 x

i yi

N

i=1 ( ) ( )

Figure 1 Flowchart of the RECSG architecture

undersampling Secondly Model119872119895 119895 = 1 119896 is trainedby the classification with new data and the prediction output119910(119895)119894 by 119895 = 1 119896 is produced with original data In Lever1 model generalizer layer model is generated by cost-sensitive classification based on logistic regression

3 Stacked Generalization forthe Imbalance Problem

The proposed RECSG architecture includes two layers Thefirst layer (Level 0) consisting of classifiers ensemble is calledthe base layer the second layer (Level 1) combined withbase classifiers is called the metalayer The flowchart of thearchitecture is shown in Figure 1

31 Level 0 Model Generalizers For the imbalance prob-lem Level 0 model generalizer step of RECSG includespreprocessing data and training the base classifier Firstlythe oversampling (SMOTE) method has been used in datapreprocessing of base classifier Secondly the base classifiermodel is trained with the new data set In this level weemployed three base algorithms Naıve Bayes (NB) [36]decision tree C45 [37] and 119896-nearest neighbors (119896-NN) [38]

(1) Naıve Bayes (NB) Given instance 119909 set 119875(119894 | 119909) is theposterior probability of class 119894 then

119875119894 (119909) =119875 (119894 | 119909)Σ119894119875 (119894 | 119909)

(11)

where NB uses an Laplacian estimate for estimating theconditional probabilities

Mathematical Problems in Engineering 5

Table 2 Cost matrix

Actual negative Actual positivePredict negative 119862 (0 0) = 11988800 119862 (0 1) = 11988801Predict positive 119862 (1 0) = 11988810 119862 (1 1) = 11988811

Table 3 Cost matrix of credit data sets

Actual bad Actual goodPredict bad 0 1Predict good 5 0

(2) Decision Tree (C45) Decision tree classification is amethod usually used in data mining [39] A decision tree isa tree where each input feature labels a nonleaf node oneclass or probability over the class labels each leaf node of thetree By recursive partitioning a tree can be ldquolearnedrdquo untilno prediction value is added or the subset has the same valueWhen the parameter of training data selects informationentropy the decision tree is named C45

(3) 119896-NN k-NN is a nonparametric algorithm used forregression or classification An instance is classified by amajority vote of its 119896-nearest neighbors If 119896 = 1 the instanceis simply classified in accordance with the class label of thenearest neighbor

All the above algorithms are simple and have low com-plexity so they are applicable weak base classifiers

32 Level 1 Model Generalizers Prediction results of severalbase classifiers in Level 0 layer are used as input space andtrue value class of instance is used as output space Based onLevel 0 layer Level 1 layer model is trained by other learningalgorithms For imbalanced data the cost-sensitive algorithmis used to train the Level 1 metalayer classifier in the paper

321 Cost-Sensitive Classifier Aiming at imbalanced datalearning problem in cost-sensitive classifier different costmatrices are used as the penalty ofmisclassified instance [40]For example a cost matrix 119862 has the following structure in abinary classification scenario in Table 2

In the cost matrix the row indicated alternative predictedclasser whereas the column indicates actual classes The costof false negative is notated as 11988801 and the cost of false positiveis notated as 11988810 Conceptually the cost of correctly classifiedinstances should always be less than the cost of incorrectlyclassified instances In the imbalance problem 11988810 is alwaysgreater than 11988801 For the German credit data set previouslyreported as the part of the Stalog project [41] the cost matrixis provided in Table 3

The cost of false good prediction is greater than the costof false bad prediction in the view of economical reason Soin Level 1 classifier of stacking for the imbalance problem thecost-sensitive algorithm is adopted

322 Logistic Regression Ting and Witten illustrated thatMLR (Multiresponse Linear Regression) algorithm had an

advantage over other Level 1 generalizers in stacking [42]In this paper the logistic regression classifier is used as themetalayer algorithm Based on the metalayer the cost of mis-classification is considered in the cost-sensitive algorithm Inlogistic regression classifier the prediction result of Level 0layer is used as the attribute of Level 1 metalayer and the realvalue of instance is used as output spaceThe linear regressionfor class 119894 is simply obtained as

LR119894 (119909) =119870

sum119895=1

120572119895119894119875119895119894 (119909) 119895 = 1 119896 (12)

Details of the implementation of RECSG are presentedbelow

33 Algorithm Pseudocode Pseudocode 1 presents the pro-posed RECSG approach Input parameters are two data setstraining set and testing set Output predicts class labelsof the test samples The first step in the process is datapreprocessing resample new instances and train model119872119895 (119895 = 1 119896) where 119895 is the number of base classifiersand 119906119895(119909119894) is the generated function of 119909119894 based on model119872119895(lines (1)-(2) in Pseudocode 1)

Then the metalayer model is constructed based onthe data 119863meta which is firstly predicted with Level 0 layer(line (3) in Pseudocode 1) Finally Lever 1 layer classification(cost-sensitive and logistic regression) is used to predict theultimate value of the tested samples

4 Empirical Investigation

The experiment aims to verify whether the RECSG approachcan improve the classification performance for the imbalanceproblem In this paper the RECSG approach was comparedwith other algorithms involving various combination ensem-ble techniques and imbalanced approaches For eachmethodthe same training set testing set and validation set were used

The experimental system has been implemented and iscomposed of 7 learning methods implemented in Weka [43]namely Naıve Bayes(1) (NB) C45(j48)(2) 119896-nearest neigh-bor (k-NN)(3) cost-sensitive(4) AdaBoost(5) Bagging(6) andstacking(7) The brief description the standard version andparameters of these methods are shown in Table 4

The RECSG approach includes 2 layers Level 1 layer(metalearning) system consists of 2 learning methods imple-mented inWeka namely simple logistic regression(9) (MLP)and cost-sensitive classifier(4) MLP is the base classificationalgorithm of cost-sensitive classifier Level 0 layer (base -learning) of stacking is composed of 3 classifiers NaıveBayes(1) C45(2) and 119896-NN(3) Evaluation metrics of algo-rithmperformance includeAUCGeoMean andAGeoMean

41 Experimental Settings Experiments were implementedwith 17 data sets from the UCI Machine Learning Repository[44] These data sets cover various fields and are based onIR measure values (from 054 to 0014) unique data setnames a varying number of samples (from 173 to 2338)

6 Mathematical Problems in Engineering

Input Training set119863 = (119909119894 119910119894)119873119894=1 Test dataset119863test = (1199091015840119894 )1198731015840

119894=1

Output Predict class labels of the test samples 1199101015840119894meta1198731015840

119894=1

For each119895 = 1 2 119870 do(1) Resample imbalance data and generate 119895ndashfold cross-validation sets to obtain New data119895(2) Train and compute 119906119895(119909119894)119873119894=1 and 119906119895(1199091015840119894 )119873

1015840

119894=1 in Level-0 (base)-layer classifier119872119895end(3) Construct119863meta = 1199061(119909119894) 119910119894119873

1015840

119894=1 and1198631015840meta = 1199061(1199091015840119894 ) 1199101198941198731015840

119894=1

(4) Based on the data119863meta classification (cost-sensitive and Logistic Regression) is used togenerate Level-1 (meta)-layer model through with1198631015840meta to predict 1199101015840119894meta119873

1015840

119894=1

Pseudocode 1 Pseudocodes of RE-sample and Cost-Sensitive Stacked Generalization

Table 4 Base classifier methods and corresponding ensemble algorithms

Methods Descriptions and parameters Abbreviations(1) Naıve Bayes wekaclassifiersbayesNaiveBayes NB(2) C45 wekaclassifierstreesJ48 -C 025 -M 2 C45

(3) 119896-nearest neighborwekaclassifierslazyIBk -K 1 -W 0 -A

wekacoreneighboursearchLinearNNSearch -AwekacoreEuclideanDistance -R first-last

KNN

(4) Cost-sensitive with Naıve Bayes wekaclassifiersmetaAdaBoostM1 -P 100 -S 1 -I 10-W wekaclassifiersbayesNaiveBayes Cost-NB

(5) AdaBoost with Naıve Bayes wekaclassifiersmetaAdaBoostM1 -P 100 -S 1 -I 10-W wekaclassifiersbayesNaiveBayes Ada-NB

(6) AdaBoost with cost-sensitive and Naıve Bayes

wekaclassifiersmetaAdaBoostM1 -P 100 -S 1-I10-W wekaClassifiersmetaCostSensitiveClassifier-- -cost-matrix [10 30 10 10] -S 1 -W weka

classifiersbayesNaiveBayes

Ada-Cost-NB

(7) Bagging with Naıve BayeswekaclassifiersmetaBagging -P 100 -S 1

-num-slots 1 -I 10 -WwekaclassifiersbayesNaiveBayes

Bag-NB

(8) Bagging with cost-sensitive and Naıve Bayes

wekaclassifiersmetaBagging -P 100 -S 1-num-slots 1 -I 10 -W

wekaclassifiersmetaCostSensitive Classifier ---cost-matrix [10 30 10 10] -S 1 -W

wekaclassifiersbayesNaiveBayes

Bag-Cost-NB

(9) Stacking with NB KNN C45 and logistic

wekaclassifiersmetaStacking -X 10 -M wekaclassifiersfunctionsLogistic -R 10E-8 -M -1ndashnum ndashdecimal-places 4 -S 1 -num-slots 1 -B

weka classifiersbayesNaiveBayes -Bwekaclassifiers lazyIBk -K 1 -W 0 -A

wekacoreneighboursearch LinearNNSearch-A wekacoreEuclideanDistance -R

first-last -B wekaclassifierstreesJ48 -C025 -M 2

Stacking-log

(10) Stacking cost-sensitive with NB KNN C45and logistic

wekaclassifiersmetaStacking -X 10 -M wekaclassifiersmetaCostSensitiveClassifier ndashcost

-matrix [10 30 10 10] -S 1 -Wwekaclassifiers functionsLogistic -- -R 10E-8 -M-1 ndashnum ndashdecimal ndashplaces 4 -S 1 -num-slots 1 -B

weka classifiers bayesNaiveBayes -BwekaclassifierslazyIBk -K 1 -W 0 -A

wekacoreneighboursearchLinearNN Search-A wekacoreEuclideanDistance -R first-

last -B wekaclassifierstreesJ48 -C 025-M 2

Stacking-Cost-log

Mathematical Problems in Engineering 7

Table 5 Summary of imbalanced data sets

Data sets sam (min maj) IR feawisconsin 683 (239 444) 054 9pima 768 (268 500) 053 8haberman 306 (81 225) 036 3Vehicle1 846 (217 629) 0345 18Vehicle0 846 (199 647) 0307 18segment0 2308 (329 1979) 017 19yeast-0-3-5-9 vs 7-8 506 (50 456) 011 8yeast-0-2-5-6 vs 3-7-8-9 1004 (99 905) 011 8vowel0 988 (90 898) 010 13led7digit-0-2-4-5-6-7-8-9 vs 1 443 (37 406) 009 7Glass2 214 (17 197) 008 10cleveland-0 vs 4 173 (13 160) 008 13glass-0-1-6 vs 5 184 (9 175) 005 9car-good 1728 (69 1659) 004 6flare-F 1066 (43 1023) 004 11car-vgood 1728 (65 1663) 0039 6abalone-17 vs 7-8-9-10 2338 (58 2280) 00254 9

and variations in the amount of class overlap (see the KEELrepository [45]) Multiclass data sets were modified to obtaintwo-class imbalanced problems so that the union of one ormore classes became the positive class and the union of oneor more of the remaining classes was labeled as the negativeclass A brief description of the used date set is presented inTable 5 It includes the total number of instances (Sam)the total number of each class instances (Min Maj) theimbalance ratio (IR = the ratio of the number of minorityclass instance to majority class instance) and the number offeatures (Fea)

Our system was compared with other ensemble algo-rithms including AdaBoost with Naıve Bayes AdaBoost withcost-sensitive bagging with Naıve Bayes bagging with cost-sensitive and stacking cost-sensitive with NB k-NN C45and logistic regression All the experiments were performedby 10-fold cross-validation

42 Experimental Results Tables 6 7 and 8 respectivelyshow the results of the three metrics (AUC GeoMean andAGeoMean) for the algorithms in Table 4 obtained with thedata set in Table 5The best results are emphasized in boldfaceon each data set in these tables

The results showed that the performance of the proposedRECSG method was the best for 12 of 17 data sets in termsof GeoMean and AGeoMean and for 10 of 17 data sets interms of AUC Some methods are better than others in someevaluation metrics but not in all metric and most data sets

In Table 6 (AUC) the performance of the RECSGmethodis better than that of the 3 single-base classification methodsfor 14 of 17 data sets and better than that of other 5 ensemblealgorithms for 13 of 17 data sets In Table 7 (GeoMean)the RECSG method outperforms all the 3 single-base clas-sification methods in 15 of 17 data sets and outperformsother 5 ensemble algorithms in 15 of 17 data sets In Table 8

(AGeoMean) the RECSG method outperforms all the 3single-base classification methods in 14 of 17 data sets andoutperforms other 5 ensemble algorithms in 16 of 17 data setsFor the 17 data sets the evaluation metrics of GeoMean andAGeoMean are better than AUC

43 Statistical Tests Statistical test [46] is adopted to comparedifferent algorithms In the paper we use two types of com-parisons pairwise comparison (between a pair of algorithms)and multiple comparisons (among a group of algorithms)

431 Pair of Algorithms We performed statistic 119879-teststo explore whether RECSG approach is the significantlybetter than other algorithms in the three metrics (AUCGeoMean and AGeoMean) Table 9 shows the results ofRECSG approach compared with the other methods interms of AUC GeoMean and AGeoMean The values inthe square brackets indicate the number of the metrics withstatistically significant difference in the 119879-test performedwith the confidence level 120572 = 005 in the three evaluationmetrics As for the evaluation metric of AUC the RECSGoutperformed NB for 11 of 17 data sets and 5 data sets showedthe statistically significant differencewith the confidence levelof 120572 = 005

432 Multiple Comparisons We performed Holm post hoctest [47] and selected multiple groups to test the threemetrics (AUC GeoMean and AGeoMean) The post hocprocedures can determine whether a comparison hypothesisshould be rejected at a specified confidence level 120572 Statisticalexperiment was performed in the platform on the websitehttpteccitiususcesstac Table 10 provides the results thatRECSGapproach is significantly differentwith the confidencelevel of120572 = 005 in terms of AUCGeoMean andAGeoMean1198670 indicates that there is no significant difference According

8 Mathematical Problems in Engineering

Table6Re

sults

ofotherm

etho

dsin

term

sofA

UC

Datas

etNB

C45

KNN

Cost-N

BAd

a-NB

Ada-Cost-N

BBa

g-NB

Bag-Cost-NB

Stacking

-log

Stacking

-Cost-log

wisc

onsin

098

60957

0953

0989

0981

0982

099

0989

0982

0982

pima

0819

0751

0662

0819

0804

0808

0817

0819

0819

0819

haberm

an064

0615

053

064

0667

0625

0638

064

70636

063

Vehicle

10711

0722

0671

0711

0769

0723

0715

0712

0782

0782

Vehicle

00814

093

092

0814

0784

0781

0808

0808

0977

097

7segm

ent0

0987

0971

0996

0987

0944

0959

0987

0986

0999

099

9yeast-0

-3-5-9

vs7-8

075

0675

0653

075

0751

0686

075

0754

0769

0773

yeast-0

-2-5-6

vs3-7-8-9

0832

0687

0782

0832

079

0808

084

0842

084

7082

vowel0

098

0974

1098

0996

0994

0981

098

11

led7digit-0

-2-4-5-6-7-8-9

vs1

095

40925

0827

0954

0898

0893

0953

0949

0909

0904

Glass2

0716

0812

0587

0716

0737

0735

0748

0737

0713

0721

cleveland

-0vs

40909

0824

0754

0909

0826

0891

0915

0913

0979

0976

glass-0-1-6

vs5

0934

0989

0825

0934

0975

0905

0983

0985

09

099

car-good

0976

0493

064

80976

0968

0971

0976

0976

0653

097

7flare-F

0908

0525

059

0911

0876

0893

0908

0912

0825

0806

car-vgoo

d0997

0992

069

0997

0997

0997

0997

0997

0998

099

8abalon

e-17

vs7-8-9-10

0775

0613

0624

0775

0817

073

0773

0774

0752

0726

Mean

0864

0791

0748

0864

0858

0846

0869

0869

0855

0875

Mathematical Problems in Engineering 9

Table7Re

sults

ofotherm

etho

dsin

term

sofG

eoMean

Datas

etNB

C45

KNN

Cost-N

BAd

a-NB

Ada-Cost-N

BBa

g-NB

Bag-Cost-NB

Stacking

-log

Stacking

-Cost-log

wisc

onsin

0965

0954

0953

0968

0959

0967

0968

0966

096

4097

0pima

0719

0697

0649

0736

0715

0739

0723

0731

0703

0741

haberm

an0419

0566

0455

0586

046

60573

0434

0594

0361

060

5Ve

hicle

10677

0622

0652

0674

0607

0677

0673

0661

0561

0712

Vehicle

00724

0912

0919

0726

0735

0733

0719

0733

0894

094

6segm

ent0

0897

0977

0995

0892

0897

0906

0902

0895

099

40993

yeast-0

-3-5-9

vs7-8

046

70463

060

9046

4046

60550

044

50485

044

4060

3yeast-0

-2-5-6

vs3-7-8-9

0762

0590

0762

0793

0762

0792

0762

0787

0706

079

8vowel0

0920

0969

1000

0913

0944

0929

0921

0914

1000

1000

led7digit-0

-2-4-5-6-7-8-9

vs1

0853

0874

0818

0869

0863

0870

0845

0850

0846

0874

Glass2

0578

0472

0470

0564

0578

0610

0586

0568

000

00607

cleveland

-0vs

4089

90724

0722

0936

0611

0863

0866

0861

0866

0866

glass-0-1-6

vs5

0932

0877

0810

0932

0932

0872

0872

0869

0814

0877

car-good

000

0000

00558

0794

0520

0783

000

00778

0561

0813

flare-F

0751

000

00450

0781

0541

064

70738

0781

000

00774

car-vgoo

d0496

0888

0626

0986

0917

0925

0481

0971

0957

098

8abalon

e-17

vs7-8-9-10

0628

0321

0506

064

00319

0368

0627

0630

0293

066

4Mean

0687

064

10703

0779

0696

0753

0680

0769

064

50814

10 Mathematical Problems in Engineering

Table8Re

sults

ofotherm

etho

dsin

term

sofA

GeoMean

Datas

etNB

C45

KNN

Cost-N

BAd

a-NB

Ada-Cost-N

BBa

g-NB

Bag-Cost-NB

Stacking

-log

Stacking

-Cost-log

wisc

onsin

0961

0960

0961

0962

0963

0962

0962

0960

0967

096

7pima

076

80743

0706

0734

0767

0736

0775

0729

0770

0726

haberm

an064

20680

0601

0718

066

00694

0653

0721

0613

0663

Vehicle

10690

0721

0728

0658

0722

0669

0690

0654

0709

0740

Vehicle

0066

60929

0934

0658

0681

0663

066

40663

0924

093

6segm

ent0

0859

0986

0996

0850

0859

0871

0863

0854

0995

099

5yeast-0

-3-5-9

vs7-8

0715

0706

0782

0709

0713

0737

0703

0721

0700

0770

yeast-0

-2-5-6

vs3-7-8-9

0856

0778

0855

0857

0856

0855

0856

0854

0839

086

6vowel0

0929

0981

1000

0909

0966

0948

0931

0910

1000

1000

led7digit-0

-2-4-5-6-7-8-9

vs1

0853

0874

0818

0869

0863

0870

0845

0850

0846

0874

Glass2

0511

0701

0695

0493

0511

0551

0539

0515

000

00744

cleveland

-0vs

4092

70845

0841

0943

0783

0914

0918

0910

0918

0918

glass-0-1-6

vs5

095

40932

0894

0954

0954

0923

0923

0919

0902

0932

car-good

000

0000

00763

0878

0747

0870

000

00873

0769

089

0flare-F

0840

000

00705

0842

0751

0795

0836

0842

000

0084

5car-vgoo

d0743

0934

0793

0986

0953

0956

0725

0980

0974

099

0abalon

e-17

vs7-8-9-10

0736

0655

0745

0716

0650

0672

0734

0711

064

1080

4Mean

0744

0731

0813

0808

0788

0805

0742

08039

0739

086

2

Mathematical Problems in Engineering 11

Table 9 Comparison between RECSG and other algorithms in terms of winslosses of the results

NB C45 KNN Cost-NB Ada-NB Ada-Cost-NB Bag-NB Bag-Cost-

NB Stacking-log

Stacking-Cost-log(AUC)

11 [5]6 [3] 15 [11]2 [2] 17 [15]0 [2] 11 [6]6 [5] 13 [7]4 [2] 14 [9]3 [1] 10 [6]7 [5] 10 [6]7 [5] 11 [3]6 [2]

Stacking-Cost-log(GeoMean)

15 [13]2 [2] 17 [13]0 [0] 15 [13]2 [1] 14 [8]3 [2] 16 [12]1 [1] 16 [10]1 [0] 17 [13]0 [0] 16 [10]3 [0] 16 [12]1 [0]

Stacking-Cost-log(AGeoMean)

14 [10]3 [1] 15 [9]2 [2] 15 [12]2 [1] 13 [7]4 [3] 15 [11]2 [2] 15 [10]2 [2] 16 [10]1 [1] 15 [8]2 [1] 16 [10]1 [1]

Table 10 Holm post hoc test to show differences of Stacking-Cost-log (AUC) and other algorithms

Rejected (1198670) Accepted (1198670)Stacking-Cost-log (AUC) versus other algorithms 3 6Stacking-Cost-log (GeoMean) versus other algorithms 8 1Stacking-Cost-log (AGeoMean) versus other algorithms 9 0

to the evaluation metric of GeoMean the RECSG is signifi-cantly better than other 8 methods with the confidence levelof 120572 = 005

44 Discussion According to the experiments performedwith 17 different imbalance data sets and the comparisonwith other 9 classification methods the proposed RECSGmethod showed the higher performance than other methodsin terms of AUC GeoMean and AGeoMean as illustratedin Tables 7 8 and 9 The RECSG method showed the betterperformance for 10 of 17 cases in terms of AUC and for 12 of17 cases in terms of GeoMean and AGeoMean The methodoutperformed other 5 ensemble algorithms for 16 of 17 datasets in terms ofGeoMean andAGeoMean and for 13 of 17 datasets in terms of AUC GeoMean and AGeoMean are betterthan AUC in evaluation metrics in 17 data sets The means ofRECSGmethod in terms of GeoMean AGeoMean and AUCare all higher than other methods The Holm post hoc testshows that RECSGmethod significantly outperforms the restwith the confidence level 120572 = 005 in terms of AGeoMean for8 of 9methods in terms of AGeoMean and for 3 of 9methodsin terms of AUC

Experimental results and statistical tests show that theRECSG approach has improved the classification perfor-mance of imbalanced data sets The reasons can be explainedas follows

Firstly stacking algorithm uses to combine the resultof base classifier whereas bagging employs majority voteBagging is only a simple decision combinationmethodwhichrequires neither cross-validation nor Level 1 learning In thispaper stacked generalization adopts logistic regressionthus providing the simplest linear combination of pooling theLevel 0 modelsrsquo confidence

Secondly cost-sensitive algorithm can affect imbalancedata sets in two aspects Firstly cost-sensitive modificationscan be applied in the probabilistic estimationMoreover cost-sensitive factors change theweight of instance and theweight

of the minority class is higher than that of the majorityclass Therefore the method directly changes the number ofcommon instances without discarding or duplicating any ofthe rare instances

Thirdly RECSG method introduces cost-sensitive learn-ing into stacking ensemble The cost-sensitive function in replaces the error-minimizing function thus allowing Level 1learning to be prone to focus on the minority class

The results in Tables 7 and 8 demonstrate that the RECSGmethod has the higher performancewhen the evaluation per-formance metric of base classifier is weaker (such as Vehicle1Vehicle0 and car-vgood) The reason is that the alternativemethods of NB C45 and 119896-NN have shortcomings Theindependence assumption hampers the performance of NBin some data sets In C45 tree construction the selectionof attributes affects the model performances Error ratio of119896-NN is relatively high when data sets are in imbalanceTherefore logistic regression adopted in can improve theperformance when base classifier is weaker

The performance of the RECSG method is generallybetter when the IR is low (such as Glass2 car-good flare-Fcar-vgood and abalone-17 vs 7-8-9-10) The performance ofRECSG method is probably related to the setting of the costmatrix Different data sets should use different cost matricesFor the purpose of simplicity in the paper we have adoptedthe same cost matrices whichmay bemore suitable to low IRvalues

5 Conclusions

In this paper in order to solve the class imbalance problemwe proposed the RECSG method based on 2-layer learningmodels Experimental results and statistical tests showedthat the RECSG approach improved the classification per-formance The proposed RECSG approach may have therelatively high computational complexity in the trainingstage because the approach involves 2-layer classifier models

12 Mathematical Problems in Engineering

which consist of several base classifiers and metaclassifiersThe number and kinds of the base-level classifiers areclosely related to the performance of stacking algorithmIn the fusion stage of base classifiers the selection of themetaclassifier is also important In this paper in order tovalidate the performance improvement compared with othercurrent classification algorithms we only randomly selected3 classification algorithms (NB 119896-NN and C45) as baseclassifier and cost-sensitive algorithm as metaclassifier inthe fuse stage Selection of the number or kind of the baseclassifiers and metaclassifier was not discussed Thereforewe will explore the diversity and quality of base classifier inthe future The adoption of these strategies would improvethe prediction performance and reduce the training time ofstacking algorithm for imbalance problems

Conflicts of Interest

The author declares that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

This research is supported by the Natural Science Foundationof Shanxi Province under Grant no 2015011039

References

[1] O Loyola-Gonzalez J F Martınez-Trinidad J A Carrasco-Ochoa and M Garcıa-Borroto ldquoEffect of class imbalance onquality measures for contrast patterns an experimental studyrdquoInformation Sciences vol 374 pp 179ndash192 2016

[2] C Beyan and R Fisher ldquoClassifying imbalanced data setsusing similarity based hierarchical decompositionrdquo PatternRecognition vol 48 no 5 pp 1653ndash1672 2015

[3] H He and E A Garcia ldquoLearning from imbalanced datardquo IEEETransactions on Knowledge and Data Engineering vol 21 no 9pp 1263ndash1284 2009

[4] V C Chawla K W Bowyer L O Hall and W P KegelmeyerldquoSMOTE synthetic minority over-sampling techniquerdquo Journalof Artificial Intelligence Research vol 16 pp 321ndash357 2002

[5] X-Y Liu J X Wu and Z-H Zhou ldquoExploratory under-sampling for class-imbalance learningrdquo in Proceedings of the 6thInternational Conference on Data Mining (ICDM rsquo06) pp 965ndash969 IEEE Hong Kong December 2006

[6] N Japkowicz ldquoThe class imbalance problem significance andstrategiesrdquo in Proceedings of the International Conference onArtificial Intelligence 2000

[7] B Zadrozny and C Elkan ldquoLearning and making decisionswhen costs and probabilities are both unknownrdquo in Proceedingsof the the seventh ACM SIGKDD international conference pp204ndash213 San Francisco Calif USA August 2001

[8] M Galar A Fernandez E Barrenechea H Bustince andF Herrera ldquoA review on ensembles for the class imbalanceproblem bagging boosting and hybrid-based approachesrdquoIEEE Transactions on Systems Man and Cybernetics Part CApplications and Reviews vol 42 no 4 pp 463ndash484 2012

[9] N V Chawla D A Cieslak L O Hall and A Joshi ldquoAuto-matically countering imbalance and its empirical relationship

to costrdquo Data Mining and Knowledge Discovery vol 17 no 2pp 225ndash252 2008

[10] A Freitas A Costapereira and P Brazdil ldquoCost-SensitiveDecision Trees Applied to Medical Datardquo in InternationalConference on Data Warehousing and Knowledge Discovery pp303ndash312 2007

[11] L Breiman ldquoBagging predictorsrdquoMachine Learning vol 24 no2 pp 123ndash140 1996

[12] R E Schapire ldquoThe strength of weak learnabilityrdquo MachineLearning vol 5 no 2 pp 197ndash227 1990

[13] Y Freund and R E Schapire ldquoA decision-theoretic generaliza-tion of on-line learning and an application to boostingrdquo Journalof Computer and System Sciences vol 55 no 1 part 2 pp 119ndash139 1997

[14] N V Chawla A Lazarevic L O Hall and K W BowyerldquoSMOTEBoost improving prediction of the minority class inboostingrdquo in Proceeding of Knowledge Discovery in Databasesvol 2838 pp 107ndash119

[15] C Seiffert T M Khoshgoftaar J VanHulse and A NapolitanoldquoRUSBoost A hybrid approach to alleviating class imbalancerdquoIEEE Transactions on Systems Man and Cybernetics Systemsvol 40 no 1 pp 185ndash197 2010

[16] S Hu Y Liang L Ma and Y He ldquoMSMOTE Improvingclassification performance when training data is imbalancedrdquoin Proceedings of the 2nd International Workshop on ComputerScience and Engineering WCSE 2009 pp 13ndash17 China October2009

[17] H Guo and H L Viktor ldquoLearning from imbalanced data setswith boosting and data generationrdquoACMSIGKDDExplorationsNewsletter vol 6 no 1 pp 30ndash39 2004

[18] SWang and X Yao ldquoDiversity analysis on imbalanced data setsby using ensemble modelsrdquo in Proceedings of the 2009 IEEESymposium on Computational Intelligence and Data MiningCIDM 2009 pp 324ndash331 USA April 2009

[19] R Barandela J S Srsquoanchez and R M Valdovinos ldquoNewapplications of ensembles of classifiersrdquo PAA Pattern Analysisand Applications vol 6 no 3 pp 245ndash256 2003

[20] J Błaszczynski M Deckert J Stefanowski and S WilkldquoIntegrating selective pre-processing of imbalanced data withIvotes ensemblerdquo Lecture Notes in Computer Science (includingsubseries Lecture Notes in Artificial Intelligence and Lecture Notesin Bioinformatics) Preface vol 6086 pp 148ndash157 2010

[21] W Fan S J Stolfo J Zhang and P K Chan ldquoAdacostMisclassication cost-sensitive boostingrdquo in the 6th Int ConfMach Learning pp 97ndash105 San Francisco CA USA 1999

[22] K M Ting ldquoA comparative study of cost-sensitive boostingalgorithmsrdquo in Proc 17th Int Conf Mach Learning pp 983ndash990 Stanford CA USA 2000

[23] M Joshi V Kumar and R Agarwal ldquoEvaluating boosting algo-rithms to classify rare classes comparison and improvementsrdquoinProceedings of the 2001 IEEE International Conference onDataMining pp 257ndash264 San Jose CA USA

[24] Y Sun M S Kamel A K C Wong and Y Wang ldquoCost-sensitive boosting for classification of imbalanced datardquo PatternRecognition vol 40 no 12 pp 3358ndash3378 2007

[25] D H Wolpert ldquoStacked generalizationrdquo Neural Networks vol5 no 2 pp 241ndash259 1992

[26] Y Chen M L Wong and H Li ldquoApplying Ant Colony Opti-mization to configuring stacking ensembles for data miningrdquoExpert Systems with Applications vol 41 no 6 pp 2688ndash27022014

Mathematical Problems in Engineering 13

[27] H Kadkhodaei and A M E Moghadam ldquoAn entropy basedapproach to find the best combination of the base classi-fiers in ensemble classifiers based on stack generalizationrdquoin Proceedings of the 4th International Conference on ControlInstrumentation and Automation ICCIA 2016 pp 425ndash429Iran January 2016

[28] I Czarnowski and P Jędrzejowicz ldquoAn approach to machineclassification based on stacked generalization and instanceselectionrdquo in Proceedings of the IEEE International Conferenceon Systems Man and Cybernetics SMC 2016 pp 4836ndash4841Hungary 2017

[29] S Kotsiantis ldquoStacking cost sensitive modelsrdquo in Proceedings ofthe 12th Pan-Hellenic Conference on Informatics PCI 2008 pp217ndash221 Greece August 2008

[30] H Y Lo J C Wang H MWang and S D Lin ldquoCost-sensitivestacking for audio tag annotation and retrievalrdquo in Proceedingsof the 36th IEEE International Conference on Acoustics Speechand Signal Processing ICASSP 2011 pp 2308ndash2311 CzechRepublic May 2011

[31] JHuang andCX Ling ldquoUsingAUCand accuracy in evaluatinglearning algorithmsrdquo IEEETransactions on Knowledge andDataEngineering vol 17 no 3 pp 299ndash310 2005

[32] M Kubat and S Matwin ldquoAddressing the curse of imbalancedtraining sets one sided selection inrdquo in Proceedings of the 14thInternational Conference on Machine Learning (ICMLrsquo97 pp179ndash186 1997

[33] R Batuwita and V Palade ldquoAdjusted geometric-mean A novelperformance measure for imbalanced bioinformatics datasetslearningrdquo Journal of Bioinformatics and Computational Biologyvol 10 no 4 Article ID 1250003 2012

[34] L Breiman ldquoStacked regressionsrdquoMachine Learning vol 24 no1 pp 49ndash64 1996

[35] M Leblanc and R Tibshirani ldquoCombining estimates in regres-sion and classificationrdquo Journal of the American StatisticalAssociation vol 91 no 436 pp 1641ndash1650 1996

[36] B Cestnik ldquoEstimating Probabilities A Crucial Task inMachine Learningrdquo in Proceedings of the European Conferenceon Artificial Intelligence pp 147ndash149 1990

[37] J R Quinlan C45 Program for machine learning MorganKaufmann 1993

[38] N S Altman ldquoAn introduction to kernel and nearest-neighbornonparametric regressionrdquo The American Statistician vol 46no 3 pp 175ndash185 1992

[39] J R Quinlan ldquoInduction of decision treesrdquo Machine Learningvol 1 no 1 pp 81ndash106 1986

[40] C Elkan ldquoThe foundations of cost-sensitive learningrdquo in Pro-ceedings of the 17th International Joint Conference on ArtificialIntelligence (IJCAI rsquo01) pp 973ndash978 New York NY USAAugust 2001

[41] D Michie D Spiegel halter J C Taylor et alMachine learningneural and statistical classification Ellis Horwood neural andstatistical classification 1995

[42] K M Ting and I F Witten ldquoIssues in stacked generalizationrdquoArtif Intell Res vol 10 no 1 pp 271ndash289 1999

[43] M Hall E Frank G Holmes B Pfahringer P Reutemann andI H Witten ldquoThe WEKA data mining software an updaterdquoACM SIGKDD Explorations Newsletter vol 11 no 1 pp 10ndash182009

[44] A Frank and A Asuncion UCI Machine Learning RepositoryUniversity of California School of Information and ComputerScience Irvine CA USA httparchiveicsucieduml

[45] J Alcala-Fdez A Fernandez J Luengo et al ldquoKEEL data-mining software tool data set repository integration ofalgorithms and experimental analysis frameworkrdquo Journal ofMultiple-Valued Logic and Soft Computing vol 17 no 2-3 pp255ndash287 2011

[46] S Garcıa A Fernandez J Luengo and F Herrera ldquoA study ofstatistical techniques and performance measures for genetics-based machine learning accuracy and interpretabilityrdquo SoftComputing vol 13 no 10 pp 959ndash977 2009

[47] S Holm ldquoA simple sequentially rejective multiple test proce-durerdquo Scandinavian Journal of Statistics vol 6 no 2 pp 65ndash701979

Hindawiwwwhindawicom Volume 2018

MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Applied MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Probability and StatisticsHindawiwwwhindawicom Volume 2018

Journal of

Hindawiwwwhindawicom Volume 2018

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawiwwwhindawicom Volume 2018

OptimizationJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

Hindawiwwwhindawicom Volume 2018

Operations ResearchAdvances in

Journal of

Hindawiwwwhindawicom Volume 2018

Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018

International Journal of Mathematics and Mathematical Sciences

Hindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018Volume 2018

Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in

Nature and SocietyHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Dierential EquationsInternational Journal of

Volume 2018

Hindawiwwwhindawicom Volume 2018

Decision SciencesAdvances in

Hindawiwwwhindawicom Volume 2018

AnalysisInternational Journal of

Hindawiwwwhindawicom Volume 2018

Stochastic AnalysisInternational Journal of

Submit your manuscripts atwwwhindawicom

Page 4: Classifying Imbalanced Data Sets by a Novel RE-Sample and ...downloads.hindawi.com/journals/mpe/2018/5036710.pdf · Cost-Sensitive Stacked Generalization Method ... In this paper,

4 Mathematical Problems in Engineering

Training dataset

andtest dataset

Level 0 (base)-layer

resample(oversample) (undersample) (SMOTE)

Level 0 (base)-layerclassification middot middot middot

middot middot middot

middot middot middot

Ensemble spaceLevel 1 (meta)layerclassification

New data2New data3New data1

LMGJF1 LMGJF2 LMGJFk

M1M2 Mk

Decision space of M2 Decision space of Mk

MGN

Decision space of M1

u1 xiNi=1 u1 x

i

N

i=1( ) ( ) u2 xi

Ni=1 u2 x

i

N

i=1( ) ( ) uk xi

Ni=1 uk x

i

N

i=1( ) ( )

yiGN

N

i=1

D = xi yi

N

i=1( )

DNMN = xi

N

i=1( )

DGN = u1 xi yiNi=1 D

GN = u1 x

i yi

N

i=1 ( ) ( )

Figure 1 Flowchart of the RECSG architecture

undersampling Secondly Model119872119895 119895 = 1 119896 is trainedby the classification with new data and the prediction output119910(119895)119894 by 119895 = 1 119896 is produced with original data In Lever1 model generalizer layer model is generated by cost-sensitive classification based on logistic regression

3 Stacked Generalization forthe Imbalance Problem

The proposed RECSG architecture includes two layers Thefirst layer (Level 0) consisting of classifiers ensemble is calledthe base layer the second layer (Level 1) combined withbase classifiers is called the metalayer The flowchart of thearchitecture is shown in Figure 1

31 Level 0 Model Generalizers For the imbalance prob-lem Level 0 model generalizer step of RECSG includespreprocessing data and training the base classifier Firstlythe oversampling (SMOTE) method has been used in datapreprocessing of base classifier Secondly the base classifiermodel is trained with the new data set In this level weemployed three base algorithms Naıve Bayes (NB) [36]decision tree C45 [37] and 119896-nearest neighbors (119896-NN) [38]

(1) Naıve Bayes (NB) Given instance 119909 set 119875(119894 | 119909) is theposterior probability of class 119894 then

119875119894 (119909) =119875 (119894 | 119909)Σ119894119875 (119894 | 119909)

(11)

where NB uses an Laplacian estimate for estimating theconditional probabilities

Mathematical Problems in Engineering 5

Table 2 Cost matrix

Actual negative Actual positivePredict negative 119862 (0 0) = 11988800 119862 (0 1) = 11988801Predict positive 119862 (1 0) = 11988810 119862 (1 1) = 11988811

Table 3 Cost matrix of credit data sets

Actual bad Actual goodPredict bad 0 1Predict good 5 0

(2) Decision Tree (C45) Decision tree classification is amethod usually used in data mining [39] A decision tree isa tree where each input feature labels a nonleaf node oneclass or probability over the class labels each leaf node of thetree By recursive partitioning a tree can be ldquolearnedrdquo untilno prediction value is added or the subset has the same valueWhen the parameter of training data selects informationentropy the decision tree is named C45

(3) 119896-NN k-NN is a nonparametric algorithm used forregression or classification An instance is classified by amajority vote of its 119896-nearest neighbors If 119896 = 1 the instanceis simply classified in accordance with the class label of thenearest neighbor

All the above algorithms are simple and have low com-plexity so they are applicable weak base classifiers

32 Level 1 Model Generalizers Prediction results of severalbase classifiers in Level 0 layer are used as input space andtrue value class of instance is used as output space Based onLevel 0 layer Level 1 layer model is trained by other learningalgorithms For imbalanced data the cost-sensitive algorithmis used to train the Level 1 metalayer classifier in the paper

321 Cost-Sensitive Classifier Aiming at imbalanced datalearning problem in cost-sensitive classifier different costmatrices are used as the penalty ofmisclassified instance [40]For example a cost matrix 119862 has the following structure in abinary classification scenario in Table 2

In the cost matrix the row indicated alternative predictedclasser whereas the column indicates actual classes The costof false negative is notated as 11988801 and the cost of false positiveis notated as 11988810 Conceptually the cost of correctly classifiedinstances should always be less than the cost of incorrectlyclassified instances In the imbalance problem 11988810 is alwaysgreater than 11988801 For the German credit data set previouslyreported as the part of the Stalog project [41] the cost matrixis provided in Table 3

The cost of false good prediction is greater than the costof false bad prediction in the view of economical reason Soin Level 1 classifier of stacking for the imbalance problem thecost-sensitive algorithm is adopted

322 Logistic Regression Ting and Witten illustrated thatMLR (Multiresponse Linear Regression) algorithm had an

advantage over other Level 1 generalizers in stacking [42]In this paper the logistic regression classifier is used as themetalayer algorithm Based on the metalayer the cost of mis-classification is considered in the cost-sensitive algorithm Inlogistic regression classifier the prediction result of Level 0layer is used as the attribute of Level 1 metalayer and the realvalue of instance is used as output spaceThe linear regressionfor class 119894 is simply obtained as

LR119894 (119909) =119870

sum119895=1

120572119895119894119875119895119894 (119909) 119895 = 1 119896 (12)

Details of the implementation of RECSG are presentedbelow

33 Algorithm Pseudocode Pseudocode 1 presents the pro-posed RECSG approach Input parameters are two data setstraining set and testing set Output predicts class labelsof the test samples The first step in the process is datapreprocessing resample new instances and train model119872119895 (119895 = 1 119896) where 119895 is the number of base classifiersand 119906119895(119909119894) is the generated function of 119909119894 based on model119872119895(lines (1)-(2) in Pseudocode 1)

Then the metalayer model is constructed based onthe data 119863meta which is firstly predicted with Level 0 layer(line (3) in Pseudocode 1) Finally Lever 1 layer classification(cost-sensitive and logistic regression) is used to predict theultimate value of the tested samples

4 Empirical Investigation

The experiment aims to verify whether the RECSG approachcan improve the classification performance for the imbalanceproblem In this paper the RECSG approach was comparedwith other algorithms involving various combination ensem-ble techniques and imbalanced approaches For eachmethodthe same training set testing set and validation set were used

The experimental system has been implemented and iscomposed of 7 learning methods implemented in Weka [43]namely Naıve Bayes(1) (NB) C45(j48)(2) 119896-nearest neigh-bor (k-NN)(3) cost-sensitive(4) AdaBoost(5) Bagging(6) andstacking(7) The brief description the standard version andparameters of these methods are shown in Table 4

The RECSG approach includes 2 layers Level 1 layer(metalearning) system consists of 2 learning methods imple-mented inWeka namely simple logistic regression(9) (MLP)and cost-sensitive classifier(4) MLP is the base classificationalgorithm of cost-sensitive classifier Level 0 layer (base -learning) of stacking is composed of 3 classifiers NaıveBayes(1) C45(2) and 119896-NN(3) Evaluation metrics of algo-rithmperformance includeAUCGeoMean andAGeoMean

41 Experimental Settings Experiments were implementedwith 17 data sets from the UCI Machine Learning Repository[44] These data sets cover various fields and are based onIR measure values (from 054 to 0014) unique data setnames a varying number of samples (from 173 to 2338)

6 Mathematical Problems in Engineering

Input Training set119863 = (119909119894 119910119894)119873119894=1 Test dataset119863test = (1199091015840119894 )1198731015840

119894=1

Output Predict class labels of the test samples 1199101015840119894meta1198731015840

119894=1

For each119895 = 1 2 119870 do(1) Resample imbalance data and generate 119895ndashfold cross-validation sets to obtain New data119895(2) Train and compute 119906119895(119909119894)119873119894=1 and 119906119895(1199091015840119894 )119873

1015840

119894=1 in Level-0 (base)-layer classifier119872119895end(3) Construct119863meta = 1199061(119909119894) 119910119894119873

1015840

119894=1 and1198631015840meta = 1199061(1199091015840119894 ) 1199101198941198731015840

119894=1

(4) Based on the data119863meta classification (cost-sensitive and Logistic Regression) is used togenerate Level-1 (meta)-layer model through with1198631015840meta to predict 1199101015840119894meta119873

1015840

119894=1

Pseudocode 1 Pseudocodes of RE-sample and Cost-Sensitive Stacked Generalization

Table 4 Base classifier methods and corresponding ensemble algorithms

Methods Descriptions and parameters Abbreviations(1) Naıve Bayes wekaclassifiersbayesNaiveBayes NB(2) C45 wekaclassifierstreesJ48 -C 025 -M 2 C45

(3) 119896-nearest neighborwekaclassifierslazyIBk -K 1 -W 0 -A

wekacoreneighboursearchLinearNNSearch -AwekacoreEuclideanDistance -R first-last

KNN

(4) Cost-sensitive with Naıve Bayes wekaclassifiersmetaAdaBoostM1 -P 100 -S 1 -I 10-W wekaclassifiersbayesNaiveBayes Cost-NB

(5) AdaBoost with Naıve Bayes wekaclassifiersmetaAdaBoostM1 -P 100 -S 1 -I 10-W wekaclassifiersbayesNaiveBayes Ada-NB

(6) AdaBoost with cost-sensitive and Naıve Bayes

wekaclassifiersmetaAdaBoostM1 -P 100 -S 1-I10-W wekaClassifiersmetaCostSensitiveClassifier-- -cost-matrix [10 30 10 10] -S 1 -W weka

classifiersbayesNaiveBayes

Ada-Cost-NB

(7) Bagging with Naıve BayeswekaclassifiersmetaBagging -P 100 -S 1

-num-slots 1 -I 10 -WwekaclassifiersbayesNaiveBayes

Bag-NB

(8) Bagging with cost-sensitive and Naıve Bayes

wekaclassifiersmetaBagging -P 100 -S 1-num-slots 1 -I 10 -W

wekaclassifiersmetaCostSensitive Classifier ---cost-matrix [10 30 10 10] -S 1 -W

wekaclassifiersbayesNaiveBayes

Bag-Cost-NB

(9) Stacking with NB KNN C45 and logistic

wekaclassifiersmetaStacking -X 10 -M wekaclassifiersfunctionsLogistic -R 10E-8 -M -1ndashnum ndashdecimal-places 4 -S 1 -num-slots 1 -B

weka classifiersbayesNaiveBayes -Bwekaclassifiers lazyIBk -K 1 -W 0 -A

wekacoreneighboursearch LinearNNSearch-A wekacoreEuclideanDistance -R

first-last -B wekaclassifierstreesJ48 -C025 -M 2

Stacking-log

(10) Stacking cost-sensitive with NB KNN C45and logistic

wekaclassifiersmetaStacking -X 10 -M wekaclassifiersmetaCostSensitiveClassifier ndashcost

-matrix [10 30 10 10] -S 1 -Wwekaclassifiers functionsLogistic -- -R 10E-8 -M-1 ndashnum ndashdecimal ndashplaces 4 -S 1 -num-slots 1 -B

weka classifiers bayesNaiveBayes -BwekaclassifierslazyIBk -K 1 -W 0 -A

wekacoreneighboursearchLinearNN Search-A wekacoreEuclideanDistance -R first-

last -B wekaclassifierstreesJ48 -C 025-M 2

Stacking-Cost-log

Mathematical Problems in Engineering 7

Table 5 Summary of imbalanced data sets

Data sets sam (min maj) IR feawisconsin 683 (239 444) 054 9pima 768 (268 500) 053 8haberman 306 (81 225) 036 3Vehicle1 846 (217 629) 0345 18Vehicle0 846 (199 647) 0307 18segment0 2308 (329 1979) 017 19yeast-0-3-5-9 vs 7-8 506 (50 456) 011 8yeast-0-2-5-6 vs 3-7-8-9 1004 (99 905) 011 8vowel0 988 (90 898) 010 13led7digit-0-2-4-5-6-7-8-9 vs 1 443 (37 406) 009 7Glass2 214 (17 197) 008 10cleveland-0 vs 4 173 (13 160) 008 13glass-0-1-6 vs 5 184 (9 175) 005 9car-good 1728 (69 1659) 004 6flare-F 1066 (43 1023) 004 11car-vgood 1728 (65 1663) 0039 6abalone-17 vs 7-8-9-10 2338 (58 2280) 00254 9

and variations in the amount of class overlap (see the KEELrepository [45]) Multiclass data sets were modified to obtaintwo-class imbalanced problems so that the union of one ormore classes became the positive class and the union of oneor more of the remaining classes was labeled as the negativeclass A brief description of the used date set is presented inTable 5 It includes the total number of instances (Sam)the total number of each class instances (Min Maj) theimbalance ratio (IR = the ratio of the number of minorityclass instance to majority class instance) and the number offeatures (Fea)

Our system was compared with other ensemble algo-rithms including AdaBoost with Naıve Bayes AdaBoost withcost-sensitive bagging with Naıve Bayes bagging with cost-sensitive and stacking cost-sensitive with NB k-NN C45and logistic regression All the experiments were performedby 10-fold cross-validation

42 Experimental Results Tables 6 7 and 8 respectivelyshow the results of the three metrics (AUC GeoMean andAGeoMean) for the algorithms in Table 4 obtained with thedata set in Table 5The best results are emphasized in boldfaceon each data set in these tables

The results showed that the performance of the proposedRECSG method was the best for 12 of 17 data sets in termsof GeoMean and AGeoMean and for 10 of 17 data sets interms of AUC Some methods are better than others in someevaluation metrics but not in all metric and most data sets

In Table 6 (AUC) the performance of the RECSGmethodis better than that of the 3 single-base classification methodsfor 14 of 17 data sets and better than that of other 5 ensemblealgorithms for 13 of 17 data sets In Table 7 (GeoMean)the RECSG method outperforms all the 3 single-base clas-sification methods in 15 of 17 data sets and outperformsother 5 ensemble algorithms in 15 of 17 data sets In Table 8

(AGeoMean) the RECSG method outperforms all the 3single-base classification methods in 14 of 17 data sets andoutperforms other 5 ensemble algorithms in 16 of 17 data setsFor the 17 data sets the evaluation metrics of GeoMean andAGeoMean are better than AUC

43 Statistical Tests Statistical test [46] is adopted to comparedifferent algorithms In the paper we use two types of com-parisons pairwise comparison (between a pair of algorithms)and multiple comparisons (among a group of algorithms)

431 Pair of Algorithms We performed statistic 119879-teststo explore whether RECSG approach is the significantlybetter than other algorithms in the three metrics (AUCGeoMean and AGeoMean) Table 9 shows the results ofRECSG approach compared with the other methods interms of AUC GeoMean and AGeoMean The values inthe square brackets indicate the number of the metrics withstatistically significant difference in the 119879-test performedwith the confidence level 120572 = 005 in the three evaluationmetrics As for the evaluation metric of AUC the RECSGoutperformed NB for 11 of 17 data sets and 5 data sets showedthe statistically significant differencewith the confidence levelof 120572 = 005

432 Multiple Comparisons We performed Holm post hoctest [47] and selected multiple groups to test the threemetrics (AUC GeoMean and AGeoMean) The post hocprocedures can determine whether a comparison hypothesisshould be rejected at a specified confidence level 120572 Statisticalexperiment was performed in the platform on the websitehttpteccitiususcesstac Table 10 provides the results thatRECSGapproach is significantly differentwith the confidencelevel of120572 = 005 in terms of AUCGeoMean andAGeoMean1198670 indicates that there is no significant difference According

8 Mathematical Problems in Engineering

Table6Re

sults

ofotherm

etho

dsin

term

sofA

UC

Datas

etNB

C45

KNN

Cost-N

BAd

a-NB

Ada-Cost-N

BBa

g-NB

Bag-Cost-NB

Stacking

-log

Stacking

-Cost-log

wisc

onsin

098

60957

0953

0989

0981

0982

099

0989

0982

0982

pima

0819

0751

0662

0819

0804

0808

0817

0819

0819

0819

haberm

an064

0615

053

064

0667

0625

0638

064

70636

063

Vehicle

10711

0722

0671

0711

0769

0723

0715

0712

0782

0782

Vehicle

00814

093

092

0814

0784

0781

0808

0808

0977

097

7segm

ent0

0987

0971

0996

0987

0944

0959

0987

0986

0999

099

9yeast-0

-3-5-9

vs7-8

075

0675

0653

075

0751

0686

075

0754

0769

0773

yeast-0

-2-5-6

vs3-7-8-9

0832

0687

0782

0832

079

0808

084

0842

084

7082

vowel0

098

0974

1098

0996

0994

0981

098

11

led7digit-0

-2-4-5-6-7-8-9

vs1

095

40925

0827

0954

0898

0893

0953

0949

0909

0904

Glass2

0716

0812

0587

0716

0737

0735

0748

0737

0713

0721

cleveland

-0vs

40909

0824

0754

0909

0826

0891

0915

0913

0979

0976

glass-0-1-6

vs5

0934

0989

0825

0934

0975

0905

0983

0985

09

099

car-good

0976

0493

064

80976

0968

0971

0976

0976

0653

097

7flare-F

0908

0525

059

0911

0876

0893

0908

0912

0825

0806

car-vgoo

d0997

0992

069

0997

0997

0997

0997

0997

0998

099

8abalon

e-17

vs7-8-9-10

0775

0613

0624

0775

0817

073

0773

0774

0752

0726

Mean

0864

0791

0748

0864

0858

0846

0869

0869

0855

0875

Mathematical Problems in Engineering 9

Table7Re

sults

ofotherm

etho

dsin

term

sofG

eoMean

Datas

etNB

C45

KNN

Cost-N

BAd

a-NB

Ada-Cost-N

BBa

g-NB

Bag-Cost-NB

Stacking

-log

Stacking

-Cost-log

wisc

onsin

0965

0954

0953

0968

0959

0967

0968

0966

096

4097

0pima

0719

0697

0649

0736

0715

0739

0723

0731

0703

0741

haberm

an0419

0566

0455

0586

046

60573

0434

0594

0361

060

5Ve

hicle

10677

0622

0652

0674

0607

0677

0673

0661

0561

0712

Vehicle

00724

0912

0919

0726

0735

0733

0719

0733

0894

094

6segm

ent0

0897

0977

0995

0892

0897

0906

0902

0895

099

40993

yeast-0

-3-5-9

vs7-8

046

70463

060

9046

4046

60550

044

50485

044

4060

3yeast-0

-2-5-6

vs3-7-8-9

0762

0590

0762

0793

0762

0792

0762

0787

0706

079

8vowel0

0920

0969

1000

0913

0944

0929

0921

0914

1000

1000

led7digit-0

-2-4-5-6-7-8-9

vs1

0853

0874

0818

0869

0863

0870

0845

0850

0846

0874

Glass2

0578

0472

0470

0564

0578

0610

0586

0568

000

00607

cleveland

-0vs

4089

90724

0722

0936

0611

0863

0866

0861

0866

0866

glass-0-1-6

vs5

0932

0877

0810

0932

0932

0872

0872

0869

0814

0877

car-good

000

0000

00558

0794

0520

0783

000

00778

0561

0813

flare-F

0751

000

00450

0781

0541

064

70738

0781

000

00774

car-vgoo

d0496

0888

0626

0986

0917

0925

0481

0971

0957

098

8abalon

e-17

vs7-8-9-10

0628

0321

0506

064

00319

0368

0627

0630

0293

066

4Mean

0687

064

10703

0779

0696

0753

0680

0769

064

50814

10 Mathematical Problems in Engineering

Table8Re

sults

ofotherm

etho

dsin

term

sofA

GeoMean

Datas

etNB

C45

KNN

Cost-N

BAd

a-NB

Ada-Cost-N

BBa

g-NB

Bag-Cost-NB

Stacking

-log

Stacking

-Cost-log

wisc

onsin

0961

0960

0961

0962

0963

0962

0962

0960

0967

096

7pima

076

80743

0706

0734

0767

0736

0775

0729

0770

0726

haberm

an064

20680

0601

0718

066

00694

0653

0721

0613

0663

Vehicle

10690

0721

0728

0658

0722

0669

0690

0654

0709

0740

Vehicle

0066

60929

0934

0658

0681

0663

066

40663

0924

093

6segm

ent0

0859

0986

0996

0850

0859

0871

0863

0854

0995

099

5yeast-0

-3-5-9

vs7-8

0715

0706

0782

0709

0713

0737

0703

0721

0700

0770

yeast-0

-2-5-6

vs3-7-8-9

0856

0778

0855

0857

0856

0855

0856

0854

0839

086

6vowel0

0929

0981

1000

0909

0966

0948

0931

0910

1000

1000

led7digit-0

-2-4-5-6-7-8-9

vs1

0853

0874

0818

0869

0863

0870

0845

0850

0846

0874

Glass2

0511

0701

0695

0493

0511

0551

0539

0515

000

00744

cleveland

-0vs

4092

70845

0841

0943

0783

0914

0918

0910

0918

0918

glass-0-1-6

vs5

095

40932

0894

0954

0954

0923

0923

0919

0902

0932

car-good

000

0000

00763

0878

0747

0870

000

00873

0769

089

0flare-F

0840

000

00705

0842

0751

0795

0836

0842

000

0084

5car-vgoo

d0743

0934

0793

0986

0953

0956

0725

0980

0974

099

0abalon

e-17

vs7-8-9-10

0736

0655

0745

0716

0650

0672

0734

0711

064

1080

4Mean

0744

0731

0813

0808

0788

0805

0742

08039

0739

086

2

Mathematical Problems in Engineering 11

Table 9 Comparison between RECSG and other algorithms in terms of winslosses of the results

NB C45 KNN Cost-NB Ada-NB Ada-Cost-NB Bag-NB Bag-Cost-

NB Stacking-log

Stacking-Cost-log(AUC)

11 [5]6 [3] 15 [11]2 [2] 17 [15]0 [2] 11 [6]6 [5] 13 [7]4 [2] 14 [9]3 [1] 10 [6]7 [5] 10 [6]7 [5] 11 [3]6 [2]

Stacking-Cost-log(GeoMean)

15 [13]2 [2] 17 [13]0 [0] 15 [13]2 [1] 14 [8]3 [2] 16 [12]1 [1] 16 [10]1 [0] 17 [13]0 [0] 16 [10]3 [0] 16 [12]1 [0]

Stacking-Cost-log(AGeoMean)

14 [10]3 [1] 15 [9]2 [2] 15 [12]2 [1] 13 [7]4 [3] 15 [11]2 [2] 15 [10]2 [2] 16 [10]1 [1] 15 [8]2 [1] 16 [10]1 [1]

Table 10 Holm post hoc test to show differences of Stacking-Cost-log (AUC) and other algorithms

Rejected (1198670) Accepted (1198670)Stacking-Cost-log (AUC) versus other algorithms 3 6Stacking-Cost-log (GeoMean) versus other algorithms 8 1Stacking-Cost-log (AGeoMean) versus other algorithms 9 0

to the evaluation metric of GeoMean the RECSG is signifi-cantly better than other 8 methods with the confidence levelof 120572 = 005

44 Discussion According to the experiments performedwith 17 different imbalance data sets and the comparisonwith other 9 classification methods the proposed RECSGmethod showed the higher performance than other methodsin terms of AUC GeoMean and AGeoMean as illustratedin Tables 7 8 and 9 The RECSG method showed the betterperformance for 10 of 17 cases in terms of AUC and for 12 of17 cases in terms of GeoMean and AGeoMean The methodoutperformed other 5 ensemble algorithms for 16 of 17 datasets in terms ofGeoMean andAGeoMean and for 13 of 17 datasets in terms of AUC GeoMean and AGeoMean are betterthan AUC in evaluation metrics in 17 data sets The means ofRECSGmethod in terms of GeoMean AGeoMean and AUCare all higher than other methods The Holm post hoc testshows that RECSGmethod significantly outperforms the restwith the confidence level 120572 = 005 in terms of AGeoMean for8 of 9methods in terms of AGeoMean and for 3 of 9methodsin terms of AUC

Experimental results and statistical tests show that theRECSG approach has improved the classification perfor-mance of imbalanced data sets The reasons can be explainedas follows

Firstly stacking algorithm uses to combine the resultof base classifier whereas bagging employs majority voteBagging is only a simple decision combinationmethodwhichrequires neither cross-validation nor Level 1 learning In thispaper stacked generalization adopts logistic regressionthus providing the simplest linear combination of pooling theLevel 0 modelsrsquo confidence

Secondly cost-sensitive algorithm can affect imbalancedata sets in two aspects Firstly cost-sensitive modificationscan be applied in the probabilistic estimationMoreover cost-sensitive factors change theweight of instance and theweight

of the minority class is higher than that of the majorityclass Therefore the method directly changes the number ofcommon instances without discarding or duplicating any ofthe rare instances

Thirdly RECSG method introduces cost-sensitive learn-ing into stacking ensemble The cost-sensitive function in replaces the error-minimizing function thus allowing Level 1learning to be prone to focus on the minority class

The results in Tables 7 and 8 demonstrate that the RECSGmethod has the higher performancewhen the evaluation per-formance metric of base classifier is weaker (such as Vehicle1Vehicle0 and car-vgood) The reason is that the alternativemethods of NB C45 and 119896-NN have shortcomings Theindependence assumption hampers the performance of NBin some data sets In C45 tree construction the selectionof attributes affects the model performances Error ratio of119896-NN is relatively high when data sets are in imbalanceTherefore logistic regression adopted in can improve theperformance when base classifier is weaker

The performance of the RECSG method is generallybetter when the IR is low (such as Glass2 car-good flare-Fcar-vgood and abalone-17 vs 7-8-9-10) The performance ofRECSG method is probably related to the setting of the costmatrix Different data sets should use different cost matricesFor the purpose of simplicity in the paper we have adoptedthe same cost matrices whichmay bemore suitable to low IRvalues

5 Conclusions

In this paper in order to solve the class imbalance problemwe proposed the RECSG method based on 2-layer learningmodels Experimental results and statistical tests showedthat the RECSG approach improved the classification per-formance The proposed RECSG approach may have therelatively high computational complexity in the trainingstage because the approach involves 2-layer classifier models

12 Mathematical Problems in Engineering

which consist of several base classifiers and metaclassifiersThe number and kinds of the base-level classifiers areclosely related to the performance of stacking algorithmIn the fusion stage of base classifiers the selection of themetaclassifier is also important In this paper in order tovalidate the performance improvement compared with othercurrent classification algorithms we only randomly selected3 classification algorithms (NB 119896-NN and C45) as baseclassifier and cost-sensitive algorithm as metaclassifier inthe fuse stage Selection of the number or kind of the baseclassifiers and metaclassifier was not discussed Thereforewe will explore the diversity and quality of base classifier inthe future The adoption of these strategies would improvethe prediction performance and reduce the training time ofstacking algorithm for imbalance problems

Conflicts of Interest

The author declares that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

This research is supported by the Natural Science Foundationof Shanxi Province under Grant no 2015011039

References

[1] O Loyola-Gonzalez J F Martınez-Trinidad J A Carrasco-Ochoa and M Garcıa-Borroto ldquoEffect of class imbalance onquality measures for contrast patterns an experimental studyrdquoInformation Sciences vol 374 pp 179ndash192 2016

[2] C Beyan and R Fisher ldquoClassifying imbalanced data setsusing similarity based hierarchical decompositionrdquo PatternRecognition vol 48 no 5 pp 1653ndash1672 2015

[3] H He and E A Garcia ldquoLearning from imbalanced datardquo IEEETransactions on Knowledge and Data Engineering vol 21 no 9pp 1263ndash1284 2009

[4] V C Chawla K W Bowyer L O Hall and W P KegelmeyerldquoSMOTE synthetic minority over-sampling techniquerdquo Journalof Artificial Intelligence Research vol 16 pp 321ndash357 2002

[5] X-Y Liu J X Wu and Z-H Zhou ldquoExploratory under-sampling for class-imbalance learningrdquo in Proceedings of the 6thInternational Conference on Data Mining (ICDM rsquo06) pp 965ndash969 IEEE Hong Kong December 2006

[6] N Japkowicz ldquoThe class imbalance problem significance andstrategiesrdquo in Proceedings of the International Conference onArtificial Intelligence 2000

[7] B Zadrozny and C Elkan ldquoLearning and making decisionswhen costs and probabilities are both unknownrdquo in Proceedingsof the the seventh ACM SIGKDD international conference pp204ndash213 San Francisco Calif USA August 2001

[8] M Galar A Fernandez E Barrenechea H Bustince andF Herrera ldquoA review on ensembles for the class imbalanceproblem bagging boosting and hybrid-based approachesrdquoIEEE Transactions on Systems Man and Cybernetics Part CApplications and Reviews vol 42 no 4 pp 463ndash484 2012

[9] N V Chawla D A Cieslak L O Hall and A Joshi ldquoAuto-matically countering imbalance and its empirical relationship

to costrdquo Data Mining and Knowledge Discovery vol 17 no 2pp 225ndash252 2008

[10] A Freitas A Costapereira and P Brazdil ldquoCost-SensitiveDecision Trees Applied to Medical Datardquo in InternationalConference on Data Warehousing and Knowledge Discovery pp303ndash312 2007

[11] L Breiman ldquoBagging predictorsrdquoMachine Learning vol 24 no2 pp 123ndash140 1996

[12] R E Schapire ldquoThe strength of weak learnabilityrdquo MachineLearning vol 5 no 2 pp 197ndash227 1990

[13] Y Freund and R E Schapire ldquoA decision-theoretic generaliza-tion of on-line learning and an application to boostingrdquo Journalof Computer and System Sciences vol 55 no 1 part 2 pp 119ndash139 1997

[14] N V Chawla A Lazarevic L O Hall and K W BowyerldquoSMOTEBoost improving prediction of the minority class inboostingrdquo in Proceeding of Knowledge Discovery in Databasesvol 2838 pp 107ndash119

[15] C Seiffert T M Khoshgoftaar J VanHulse and A NapolitanoldquoRUSBoost A hybrid approach to alleviating class imbalancerdquoIEEE Transactions on Systems Man and Cybernetics Systemsvol 40 no 1 pp 185ndash197 2010

[16] S Hu Y Liang L Ma and Y He ldquoMSMOTE Improvingclassification performance when training data is imbalancedrdquoin Proceedings of the 2nd International Workshop on ComputerScience and Engineering WCSE 2009 pp 13ndash17 China October2009

[17] H Guo and H L Viktor ldquoLearning from imbalanced data setswith boosting and data generationrdquoACMSIGKDDExplorationsNewsletter vol 6 no 1 pp 30ndash39 2004

[18] SWang and X Yao ldquoDiversity analysis on imbalanced data setsby using ensemble modelsrdquo in Proceedings of the 2009 IEEESymposium on Computational Intelligence and Data MiningCIDM 2009 pp 324ndash331 USA April 2009

[19] R Barandela J S Srsquoanchez and R M Valdovinos ldquoNewapplications of ensembles of classifiersrdquo PAA Pattern Analysisand Applications vol 6 no 3 pp 245ndash256 2003

[20] J Błaszczynski M Deckert J Stefanowski and S WilkldquoIntegrating selective pre-processing of imbalanced data withIvotes ensemblerdquo Lecture Notes in Computer Science (includingsubseries Lecture Notes in Artificial Intelligence and Lecture Notesin Bioinformatics) Preface vol 6086 pp 148ndash157 2010

[21] W Fan S J Stolfo J Zhang and P K Chan ldquoAdacostMisclassication cost-sensitive boostingrdquo in the 6th Int ConfMach Learning pp 97ndash105 San Francisco CA USA 1999

[22] K M Ting ldquoA comparative study of cost-sensitive boostingalgorithmsrdquo in Proc 17th Int Conf Mach Learning pp 983ndash990 Stanford CA USA 2000

[23] M Joshi V Kumar and R Agarwal ldquoEvaluating boosting algo-rithms to classify rare classes comparison and improvementsrdquoinProceedings of the 2001 IEEE International Conference onDataMining pp 257ndash264 San Jose CA USA

[24] Y Sun M S Kamel A K C Wong and Y Wang ldquoCost-sensitive boosting for classification of imbalanced datardquo PatternRecognition vol 40 no 12 pp 3358ndash3378 2007

[25] D H Wolpert ldquoStacked generalizationrdquo Neural Networks vol5 no 2 pp 241ndash259 1992

[26] Y Chen M L Wong and H Li ldquoApplying Ant Colony Opti-mization to configuring stacking ensembles for data miningrdquoExpert Systems with Applications vol 41 no 6 pp 2688ndash27022014

Mathematical Problems in Engineering 13

[27] H Kadkhodaei and A M E Moghadam ldquoAn entropy basedapproach to find the best combination of the base classi-fiers in ensemble classifiers based on stack generalizationrdquoin Proceedings of the 4th International Conference on ControlInstrumentation and Automation ICCIA 2016 pp 425ndash429Iran January 2016

[28] I Czarnowski and P Jędrzejowicz ldquoAn approach to machineclassification based on stacked generalization and instanceselectionrdquo in Proceedings of the IEEE International Conferenceon Systems Man and Cybernetics SMC 2016 pp 4836ndash4841Hungary 2017

[29] S Kotsiantis ldquoStacking cost sensitive modelsrdquo in Proceedings ofthe 12th Pan-Hellenic Conference on Informatics PCI 2008 pp217ndash221 Greece August 2008

[30] H Y Lo J C Wang H MWang and S D Lin ldquoCost-sensitivestacking for audio tag annotation and retrievalrdquo in Proceedingsof the 36th IEEE International Conference on Acoustics Speechand Signal Processing ICASSP 2011 pp 2308ndash2311 CzechRepublic May 2011

[31] JHuang andCX Ling ldquoUsingAUCand accuracy in evaluatinglearning algorithmsrdquo IEEETransactions on Knowledge andDataEngineering vol 17 no 3 pp 299ndash310 2005

[32] M Kubat and S Matwin ldquoAddressing the curse of imbalancedtraining sets one sided selection inrdquo in Proceedings of the 14thInternational Conference on Machine Learning (ICMLrsquo97 pp179ndash186 1997

[33] R Batuwita and V Palade ldquoAdjusted geometric-mean A novelperformance measure for imbalanced bioinformatics datasetslearningrdquo Journal of Bioinformatics and Computational Biologyvol 10 no 4 Article ID 1250003 2012

[34] L Breiman ldquoStacked regressionsrdquoMachine Learning vol 24 no1 pp 49ndash64 1996

[35] M Leblanc and R Tibshirani ldquoCombining estimates in regres-sion and classificationrdquo Journal of the American StatisticalAssociation vol 91 no 436 pp 1641ndash1650 1996

[36] B Cestnik ldquoEstimating Probabilities A Crucial Task inMachine Learningrdquo in Proceedings of the European Conferenceon Artificial Intelligence pp 147ndash149 1990

[37] J R Quinlan C45 Program for machine learning MorganKaufmann 1993

[38] N S Altman ldquoAn introduction to kernel and nearest-neighbornonparametric regressionrdquo The American Statistician vol 46no 3 pp 175ndash185 1992

[39] J R Quinlan ldquoInduction of decision treesrdquo Machine Learningvol 1 no 1 pp 81ndash106 1986

[40] C Elkan ldquoThe foundations of cost-sensitive learningrdquo in Pro-ceedings of the 17th International Joint Conference on ArtificialIntelligence (IJCAI rsquo01) pp 973ndash978 New York NY USAAugust 2001

[41] D Michie D Spiegel halter J C Taylor et alMachine learningneural and statistical classification Ellis Horwood neural andstatistical classification 1995

[42] K M Ting and I F Witten ldquoIssues in stacked generalizationrdquoArtif Intell Res vol 10 no 1 pp 271ndash289 1999

[43] M Hall E Frank G Holmes B Pfahringer P Reutemann andI H Witten ldquoThe WEKA data mining software an updaterdquoACM SIGKDD Explorations Newsletter vol 11 no 1 pp 10ndash182009

[44] A Frank and A Asuncion UCI Machine Learning RepositoryUniversity of California School of Information and ComputerScience Irvine CA USA httparchiveicsucieduml

[45] J Alcala-Fdez A Fernandez J Luengo et al ldquoKEEL data-mining software tool data set repository integration ofalgorithms and experimental analysis frameworkrdquo Journal ofMultiple-Valued Logic and Soft Computing vol 17 no 2-3 pp255ndash287 2011

[46] S Garcıa A Fernandez J Luengo and F Herrera ldquoA study ofstatistical techniques and performance measures for genetics-based machine learning accuracy and interpretabilityrdquo SoftComputing vol 13 no 10 pp 959ndash977 2009

[47] S Holm ldquoA simple sequentially rejective multiple test proce-durerdquo Scandinavian Journal of Statistics vol 6 no 2 pp 65ndash701979

Hindawiwwwhindawicom Volume 2018

MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Applied MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Probability and StatisticsHindawiwwwhindawicom Volume 2018

Journal of

Hindawiwwwhindawicom Volume 2018

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawiwwwhindawicom Volume 2018

OptimizationJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

Hindawiwwwhindawicom Volume 2018

Operations ResearchAdvances in

Journal of

Hindawiwwwhindawicom Volume 2018

Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018

International Journal of Mathematics and Mathematical Sciences

Hindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018Volume 2018

Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in

Nature and SocietyHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Dierential EquationsInternational Journal of

Volume 2018

Hindawiwwwhindawicom Volume 2018

Decision SciencesAdvances in

Hindawiwwwhindawicom Volume 2018

AnalysisInternational Journal of

Hindawiwwwhindawicom Volume 2018

Stochastic AnalysisInternational Journal of

Submit your manuscripts atwwwhindawicom

Page 5: Classifying Imbalanced Data Sets by a Novel RE-Sample and ...downloads.hindawi.com/journals/mpe/2018/5036710.pdf · Cost-Sensitive Stacked Generalization Method ... In this paper,

Mathematical Problems in Engineering 5

Table 2 Cost matrix

Actual negative Actual positivePredict negative 119862 (0 0) = 11988800 119862 (0 1) = 11988801Predict positive 119862 (1 0) = 11988810 119862 (1 1) = 11988811

Table 3 Cost matrix of credit data sets

Actual bad Actual goodPredict bad 0 1Predict good 5 0

(2) Decision Tree (C45) Decision tree classification is amethod usually used in data mining [39] A decision tree isa tree where each input feature labels a nonleaf node oneclass or probability over the class labels each leaf node of thetree By recursive partitioning a tree can be ldquolearnedrdquo untilno prediction value is added or the subset has the same valueWhen the parameter of training data selects informationentropy the decision tree is named C45

(3) 119896-NN k-NN is a nonparametric algorithm used forregression or classification An instance is classified by amajority vote of its 119896-nearest neighbors If 119896 = 1 the instanceis simply classified in accordance with the class label of thenearest neighbor

All the above algorithms are simple and have low com-plexity so they are applicable weak base classifiers

32 Level 1 Model Generalizers Prediction results of severalbase classifiers in Level 0 layer are used as input space andtrue value class of instance is used as output space Based onLevel 0 layer Level 1 layer model is trained by other learningalgorithms For imbalanced data the cost-sensitive algorithmis used to train the Level 1 metalayer classifier in the paper

321 Cost-Sensitive Classifier Aiming at imbalanced datalearning problem in cost-sensitive classifier different costmatrices are used as the penalty ofmisclassified instance [40]For example a cost matrix 119862 has the following structure in abinary classification scenario in Table 2

In the cost matrix the row indicated alternative predictedclasser whereas the column indicates actual classes The costof false negative is notated as 11988801 and the cost of false positiveis notated as 11988810 Conceptually the cost of correctly classifiedinstances should always be less than the cost of incorrectlyclassified instances In the imbalance problem 11988810 is alwaysgreater than 11988801 For the German credit data set previouslyreported as the part of the Stalog project [41] the cost matrixis provided in Table 3

The cost of false good prediction is greater than the costof false bad prediction in the view of economical reason Soin Level 1 classifier of stacking for the imbalance problem thecost-sensitive algorithm is adopted

322 Logistic Regression Ting and Witten illustrated thatMLR (Multiresponse Linear Regression) algorithm had an

advantage over other Level 1 generalizers in stacking [42]In this paper the logistic regression classifier is used as themetalayer algorithm Based on the metalayer the cost of mis-classification is considered in the cost-sensitive algorithm Inlogistic regression classifier the prediction result of Level 0layer is used as the attribute of Level 1 metalayer and the realvalue of instance is used as output spaceThe linear regressionfor class 119894 is simply obtained as

LR119894 (119909) =119870

sum119895=1

120572119895119894119875119895119894 (119909) 119895 = 1 119896 (12)

Details of the implementation of RECSG are presentedbelow

33 Algorithm Pseudocode Pseudocode 1 presents the pro-posed RECSG approach Input parameters are two data setstraining set and testing set Output predicts class labelsof the test samples The first step in the process is datapreprocessing resample new instances and train model119872119895 (119895 = 1 119896) where 119895 is the number of base classifiersand 119906119895(119909119894) is the generated function of 119909119894 based on model119872119895(lines (1)-(2) in Pseudocode 1)

Then the metalayer model is constructed based onthe data 119863meta which is firstly predicted with Level 0 layer(line (3) in Pseudocode 1) Finally Lever 1 layer classification(cost-sensitive and logistic regression) is used to predict theultimate value of the tested samples

4 Empirical Investigation

The experiment aims to verify whether the RECSG approachcan improve the classification performance for the imbalanceproblem In this paper the RECSG approach was comparedwith other algorithms involving various combination ensem-ble techniques and imbalanced approaches For eachmethodthe same training set testing set and validation set were used

The experimental system has been implemented and iscomposed of 7 learning methods implemented in Weka [43]namely Naıve Bayes(1) (NB) C45(j48)(2) 119896-nearest neigh-bor (k-NN)(3) cost-sensitive(4) AdaBoost(5) Bagging(6) andstacking(7) The brief description the standard version andparameters of these methods are shown in Table 4

The RECSG approach includes 2 layers Level 1 layer(metalearning) system consists of 2 learning methods imple-mented inWeka namely simple logistic regression(9) (MLP)and cost-sensitive classifier(4) MLP is the base classificationalgorithm of cost-sensitive classifier Level 0 layer (base -learning) of stacking is composed of 3 classifiers NaıveBayes(1) C45(2) and 119896-NN(3) Evaluation metrics of algo-rithmperformance includeAUCGeoMean andAGeoMean

41 Experimental Settings Experiments were implementedwith 17 data sets from the UCI Machine Learning Repository[44] These data sets cover various fields and are based onIR measure values (from 054 to 0014) unique data setnames a varying number of samples (from 173 to 2338)

6 Mathematical Problems in Engineering

Input Training set119863 = (119909119894 119910119894)119873119894=1 Test dataset119863test = (1199091015840119894 )1198731015840

119894=1

Output Predict class labels of the test samples 1199101015840119894meta1198731015840

119894=1

For each119895 = 1 2 119870 do(1) Resample imbalance data and generate 119895ndashfold cross-validation sets to obtain New data119895(2) Train and compute 119906119895(119909119894)119873119894=1 and 119906119895(1199091015840119894 )119873

1015840

119894=1 in Level-0 (base)-layer classifier119872119895end(3) Construct119863meta = 1199061(119909119894) 119910119894119873

1015840

119894=1 and1198631015840meta = 1199061(1199091015840119894 ) 1199101198941198731015840

119894=1

(4) Based on the data119863meta classification (cost-sensitive and Logistic Regression) is used togenerate Level-1 (meta)-layer model through with1198631015840meta to predict 1199101015840119894meta119873

1015840

119894=1

Pseudocode 1 Pseudocodes of RE-sample and Cost-Sensitive Stacked Generalization

Table 4 Base classifier methods and corresponding ensemble algorithms

Methods Descriptions and parameters Abbreviations(1) Naıve Bayes wekaclassifiersbayesNaiveBayes NB(2) C45 wekaclassifierstreesJ48 -C 025 -M 2 C45

(3) 119896-nearest neighborwekaclassifierslazyIBk -K 1 -W 0 -A

wekacoreneighboursearchLinearNNSearch -AwekacoreEuclideanDistance -R first-last

KNN

(4) Cost-sensitive with Naıve Bayes wekaclassifiersmetaAdaBoostM1 -P 100 -S 1 -I 10-W wekaclassifiersbayesNaiveBayes Cost-NB

(5) AdaBoost with Naıve Bayes wekaclassifiersmetaAdaBoostM1 -P 100 -S 1 -I 10-W wekaclassifiersbayesNaiveBayes Ada-NB

(6) AdaBoost with cost-sensitive and Naıve Bayes

wekaclassifiersmetaAdaBoostM1 -P 100 -S 1-I10-W wekaClassifiersmetaCostSensitiveClassifier-- -cost-matrix [10 30 10 10] -S 1 -W weka

classifiersbayesNaiveBayes

Ada-Cost-NB

(7) Bagging with Naıve BayeswekaclassifiersmetaBagging -P 100 -S 1

-num-slots 1 -I 10 -WwekaclassifiersbayesNaiveBayes

Bag-NB

(8) Bagging with cost-sensitive and Naıve Bayes

wekaclassifiersmetaBagging -P 100 -S 1-num-slots 1 -I 10 -W

wekaclassifiersmetaCostSensitive Classifier ---cost-matrix [10 30 10 10] -S 1 -W

wekaclassifiersbayesNaiveBayes

Bag-Cost-NB

(9) Stacking with NB KNN C45 and logistic

wekaclassifiersmetaStacking -X 10 -M wekaclassifiersfunctionsLogistic -R 10E-8 -M -1ndashnum ndashdecimal-places 4 -S 1 -num-slots 1 -B

weka classifiersbayesNaiveBayes -Bwekaclassifiers lazyIBk -K 1 -W 0 -A

wekacoreneighboursearch LinearNNSearch-A wekacoreEuclideanDistance -R

first-last -B wekaclassifierstreesJ48 -C025 -M 2

Stacking-log

(10) Stacking cost-sensitive with NB KNN C45and logistic

wekaclassifiersmetaStacking -X 10 -M wekaclassifiersmetaCostSensitiveClassifier ndashcost

-matrix [10 30 10 10] -S 1 -Wwekaclassifiers functionsLogistic -- -R 10E-8 -M-1 ndashnum ndashdecimal ndashplaces 4 -S 1 -num-slots 1 -B

weka classifiers bayesNaiveBayes -BwekaclassifierslazyIBk -K 1 -W 0 -A

wekacoreneighboursearchLinearNN Search-A wekacoreEuclideanDistance -R first-

last -B wekaclassifierstreesJ48 -C 025-M 2

Stacking-Cost-log

Mathematical Problems in Engineering 7

Table 5 Summary of imbalanced data sets

Data sets sam (min maj) IR feawisconsin 683 (239 444) 054 9pima 768 (268 500) 053 8haberman 306 (81 225) 036 3Vehicle1 846 (217 629) 0345 18Vehicle0 846 (199 647) 0307 18segment0 2308 (329 1979) 017 19yeast-0-3-5-9 vs 7-8 506 (50 456) 011 8yeast-0-2-5-6 vs 3-7-8-9 1004 (99 905) 011 8vowel0 988 (90 898) 010 13led7digit-0-2-4-5-6-7-8-9 vs 1 443 (37 406) 009 7Glass2 214 (17 197) 008 10cleveland-0 vs 4 173 (13 160) 008 13glass-0-1-6 vs 5 184 (9 175) 005 9car-good 1728 (69 1659) 004 6flare-F 1066 (43 1023) 004 11car-vgood 1728 (65 1663) 0039 6abalone-17 vs 7-8-9-10 2338 (58 2280) 00254 9

and variations in the amount of class overlap (see the KEELrepository [45]) Multiclass data sets were modified to obtaintwo-class imbalanced problems so that the union of one ormore classes became the positive class and the union of oneor more of the remaining classes was labeled as the negativeclass A brief description of the used date set is presented inTable 5 It includes the total number of instances (Sam)the total number of each class instances (Min Maj) theimbalance ratio (IR = the ratio of the number of minorityclass instance to majority class instance) and the number offeatures (Fea)

Our system was compared with other ensemble algo-rithms including AdaBoost with Naıve Bayes AdaBoost withcost-sensitive bagging with Naıve Bayes bagging with cost-sensitive and stacking cost-sensitive with NB k-NN C45and logistic regression All the experiments were performedby 10-fold cross-validation

42 Experimental Results Tables 6 7 and 8 respectivelyshow the results of the three metrics (AUC GeoMean andAGeoMean) for the algorithms in Table 4 obtained with thedata set in Table 5The best results are emphasized in boldfaceon each data set in these tables

The results showed that the performance of the proposedRECSG method was the best for 12 of 17 data sets in termsof GeoMean and AGeoMean and for 10 of 17 data sets interms of AUC Some methods are better than others in someevaluation metrics but not in all metric and most data sets

In Table 6 (AUC) the performance of the RECSGmethodis better than that of the 3 single-base classification methodsfor 14 of 17 data sets and better than that of other 5 ensemblealgorithms for 13 of 17 data sets In Table 7 (GeoMean)the RECSG method outperforms all the 3 single-base clas-sification methods in 15 of 17 data sets and outperformsother 5 ensemble algorithms in 15 of 17 data sets In Table 8

(AGeoMean) the RECSG method outperforms all the 3single-base classification methods in 14 of 17 data sets andoutperforms other 5 ensemble algorithms in 16 of 17 data setsFor the 17 data sets the evaluation metrics of GeoMean andAGeoMean are better than AUC

43 Statistical Tests Statistical test [46] is adopted to comparedifferent algorithms In the paper we use two types of com-parisons pairwise comparison (between a pair of algorithms)and multiple comparisons (among a group of algorithms)

431 Pair of Algorithms We performed statistic 119879-teststo explore whether RECSG approach is the significantlybetter than other algorithms in the three metrics (AUCGeoMean and AGeoMean) Table 9 shows the results ofRECSG approach compared with the other methods interms of AUC GeoMean and AGeoMean The values inthe square brackets indicate the number of the metrics withstatistically significant difference in the 119879-test performedwith the confidence level 120572 = 005 in the three evaluationmetrics As for the evaluation metric of AUC the RECSGoutperformed NB for 11 of 17 data sets and 5 data sets showedthe statistically significant differencewith the confidence levelof 120572 = 005

432 Multiple Comparisons We performed Holm post hoctest [47] and selected multiple groups to test the threemetrics (AUC GeoMean and AGeoMean) The post hocprocedures can determine whether a comparison hypothesisshould be rejected at a specified confidence level 120572 Statisticalexperiment was performed in the platform on the websitehttpteccitiususcesstac Table 10 provides the results thatRECSGapproach is significantly differentwith the confidencelevel of120572 = 005 in terms of AUCGeoMean andAGeoMean1198670 indicates that there is no significant difference According

8 Mathematical Problems in Engineering

Table6Re

sults

ofotherm

etho

dsin

term

sofA

UC

Datas

etNB

C45

KNN

Cost-N

BAd

a-NB

Ada-Cost-N

BBa

g-NB

Bag-Cost-NB

Stacking

-log

Stacking

-Cost-log

wisc

onsin

098

60957

0953

0989

0981

0982

099

0989

0982

0982

pima

0819

0751

0662

0819

0804

0808

0817

0819

0819

0819

haberm

an064

0615

053

064

0667

0625

0638

064

70636

063

Vehicle

10711

0722

0671

0711

0769

0723

0715

0712

0782

0782

Vehicle

00814

093

092

0814

0784

0781

0808

0808

0977

097

7segm

ent0

0987

0971

0996

0987

0944

0959

0987

0986

0999

099

9yeast-0

-3-5-9

vs7-8

075

0675

0653

075

0751

0686

075

0754

0769

0773

yeast-0

-2-5-6

vs3-7-8-9

0832

0687

0782

0832

079

0808

084

0842

084

7082

vowel0

098

0974

1098

0996

0994

0981

098

11

led7digit-0

-2-4-5-6-7-8-9

vs1

095

40925

0827

0954

0898

0893

0953

0949

0909

0904

Glass2

0716

0812

0587

0716

0737

0735

0748

0737

0713

0721

cleveland

-0vs

40909

0824

0754

0909

0826

0891

0915

0913

0979

0976

glass-0-1-6

vs5

0934

0989

0825

0934

0975

0905

0983

0985

09

099

car-good

0976

0493

064

80976

0968

0971

0976

0976

0653

097

7flare-F

0908

0525

059

0911

0876

0893

0908

0912

0825

0806

car-vgoo

d0997

0992

069

0997

0997

0997

0997

0997

0998

099

8abalon

e-17

vs7-8-9-10

0775

0613

0624

0775

0817

073

0773

0774

0752

0726

Mean

0864

0791

0748

0864

0858

0846

0869

0869

0855

0875

Mathematical Problems in Engineering 9

Table7Re

sults

ofotherm

etho

dsin

term

sofG

eoMean

Datas

etNB

C45

KNN

Cost-N

BAd

a-NB

Ada-Cost-N

BBa

g-NB

Bag-Cost-NB

Stacking

-log

Stacking

-Cost-log

wisc

onsin

0965

0954

0953

0968

0959

0967

0968

0966

096

4097

0pima

0719

0697

0649

0736

0715

0739

0723

0731

0703

0741

haberm

an0419

0566

0455

0586

046

60573

0434

0594

0361

060

5Ve

hicle

10677

0622

0652

0674

0607

0677

0673

0661

0561

0712

Vehicle

00724

0912

0919

0726

0735

0733

0719

0733

0894

094

6segm

ent0

0897

0977

0995

0892

0897

0906

0902

0895

099

40993

yeast-0

-3-5-9

vs7-8

046

70463

060

9046

4046

60550

044

50485

044

4060

3yeast-0

-2-5-6

vs3-7-8-9

0762

0590

0762

0793

0762

0792

0762

0787

0706

079

8vowel0

0920

0969

1000

0913

0944

0929

0921

0914

1000

1000

led7digit-0

-2-4-5-6-7-8-9

vs1

0853

0874

0818

0869

0863

0870

0845

0850

0846

0874

Glass2

0578

0472

0470

0564

0578

0610

0586

0568

000

00607

cleveland

-0vs

4089

90724

0722

0936

0611

0863

0866

0861

0866

0866

glass-0-1-6

vs5

0932

0877

0810

0932

0932

0872

0872

0869

0814

0877

car-good

000

0000

00558

0794

0520

0783

000

00778

0561

0813

flare-F

0751

000

00450

0781

0541

064

70738

0781

000

00774

car-vgoo

d0496

0888

0626

0986

0917

0925

0481

0971

0957

098

8abalon

e-17

vs7-8-9-10

0628

0321

0506

064

00319

0368

0627

0630

0293

066

4Mean

0687

064

10703

0779

0696

0753

0680

0769

064

50814

10 Mathematical Problems in Engineering

Table8Re

sults

ofotherm

etho

dsin

term

sofA

GeoMean

Datas

etNB

C45

KNN

Cost-N

BAd

a-NB

Ada-Cost-N

BBa

g-NB

Bag-Cost-NB

Stacking

-log

Stacking

-Cost-log

wisc

onsin

0961

0960

0961

0962

0963

0962

0962

0960

0967

096

7pima

076

80743

0706

0734

0767

0736

0775

0729

0770

0726

haberm

an064

20680

0601

0718

066

00694

0653

0721

0613

0663

Vehicle

10690

0721

0728

0658

0722

0669

0690

0654

0709

0740

Vehicle

0066

60929

0934

0658

0681

0663

066

40663

0924

093

6segm

ent0

0859

0986

0996

0850

0859

0871

0863

0854

0995

099

5yeast-0

-3-5-9

vs7-8

0715

0706

0782

0709

0713

0737

0703

0721

0700

0770

yeast-0

-2-5-6

vs3-7-8-9

0856

0778

0855

0857

0856

0855

0856

0854

0839

086

6vowel0

0929

0981

1000

0909

0966

0948

0931

0910

1000

1000

led7digit-0

-2-4-5-6-7-8-9

vs1

0853

0874

0818

0869

0863

0870

0845

0850

0846

0874

Glass2

0511

0701

0695

0493

0511

0551

0539

0515

000

00744

cleveland

-0vs

4092

70845

0841

0943

0783

0914

0918

0910

0918

0918

glass-0-1-6

vs5

095

40932

0894

0954

0954

0923

0923

0919

0902

0932

car-good

000

0000

00763

0878

0747

0870

000

00873

0769

089

0flare-F

0840

000

00705

0842

0751

0795

0836

0842

000

0084

5car-vgoo

d0743

0934

0793

0986

0953

0956

0725

0980

0974

099

0abalon

e-17

vs7-8-9-10

0736

0655

0745

0716

0650

0672

0734

0711

064

1080

4Mean

0744

0731

0813

0808

0788

0805

0742

08039

0739

086

2

Mathematical Problems in Engineering 11

Table 9 Comparison between RECSG and other algorithms in terms of winslosses of the results

NB C45 KNN Cost-NB Ada-NB Ada-Cost-NB Bag-NB Bag-Cost-

NB Stacking-log

Stacking-Cost-log(AUC)

11 [5]6 [3] 15 [11]2 [2] 17 [15]0 [2] 11 [6]6 [5] 13 [7]4 [2] 14 [9]3 [1] 10 [6]7 [5] 10 [6]7 [5] 11 [3]6 [2]

Stacking-Cost-log(GeoMean)

15 [13]2 [2] 17 [13]0 [0] 15 [13]2 [1] 14 [8]3 [2] 16 [12]1 [1] 16 [10]1 [0] 17 [13]0 [0] 16 [10]3 [0] 16 [12]1 [0]

Stacking-Cost-log(AGeoMean)

14 [10]3 [1] 15 [9]2 [2] 15 [12]2 [1] 13 [7]4 [3] 15 [11]2 [2] 15 [10]2 [2] 16 [10]1 [1] 15 [8]2 [1] 16 [10]1 [1]

Table 10 Holm post hoc test to show differences of Stacking-Cost-log (AUC) and other algorithms

Rejected (1198670) Accepted (1198670)Stacking-Cost-log (AUC) versus other algorithms 3 6Stacking-Cost-log (GeoMean) versus other algorithms 8 1Stacking-Cost-log (AGeoMean) versus other algorithms 9 0

to the evaluation metric of GeoMean the RECSG is signifi-cantly better than other 8 methods with the confidence levelof 120572 = 005

44 Discussion According to the experiments performedwith 17 different imbalance data sets and the comparisonwith other 9 classification methods the proposed RECSGmethod showed the higher performance than other methodsin terms of AUC GeoMean and AGeoMean as illustratedin Tables 7 8 and 9 The RECSG method showed the betterperformance for 10 of 17 cases in terms of AUC and for 12 of17 cases in terms of GeoMean and AGeoMean The methodoutperformed other 5 ensemble algorithms for 16 of 17 datasets in terms ofGeoMean andAGeoMean and for 13 of 17 datasets in terms of AUC GeoMean and AGeoMean are betterthan AUC in evaluation metrics in 17 data sets The means ofRECSGmethod in terms of GeoMean AGeoMean and AUCare all higher than other methods The Holm post hoc testshows that RECSGmethod significantly outperforms the restwith the confidence level 120572 = 005 in terms of AGeoMean for8 of 9methods in terms of AGeoMean and for 3 of 9methodsin terms of AUC

Experimental results and statistical tests show that theRECSG approach has improved the classification perfor-mance of imbalanced data sets The reasons can be explainedas follows

Firstly stacking algorithm uses to combine the resultof base classifier whereas bagging employs majority voteBagging is only a simple decision combinationmethodwhichrequires neither cross-validation nor Level 1 learning In thispaper stacked generalization adopts logistic regressionthus providing the simplest linear combination of pooling theLevel 0 modelsrsquo confidence

Secondly cost-sensitive algorithm can affect imbalancedata sets in two aspects Firstly cost-sensitive modificationscan be applied in the probabilistic estimationMoreover cost-sensitive factors change theweight of instance and theweight

of the minority class is higher than that of the majorityclass Therefore the method directly changes the number ofcommon instances without discarding or duplicating any ofthe rare instances

Thirdly RECSG method introduces cost-sensitive learn-ing into stacking ensemble The cost-sensitive function in replaces the error-minimizing function thus allowing Level 1learning to be prone to focus on the minority class

The results in Tables 7 and 8 demonstrate that the RECSGmethod has the higher performancewhen the evaluation per-formance metric of base classifier is weaker (such as Vehicle1Vehicle0 and car-vgood) The reason is that the alternativemethods of NB C45 and 119896-NN have shortcomings Theindependence assumption hampers the performance of NBin some data sets In C45 tree construction the selectionof attributes affects the model performances Error ratio of119896-NN is relatively high when data sets are in imbalanceTherefore logistic regression adopted in can improve theperformance when base classifier is weaker

The performance of the RECSG method is generallybetter when the IR is low (such as Glass2 car-good flare-Fcar-vgood and abalone-17 vs 7-8-9-10) The performance ofRECSG method is probably related to the setting of the costmatrix Different data sets should use different cost matricesFor the purpose of simplicity in the paper we have adoptedthe same cost matrices whichmay bemore suitable to low IRvalues

5 Conclusions

In this paper in order to solve the class imbalance problemwe proposed the RECSG method based on 2-layer learningmodels Experimental results and statistical tests showedthat the RECSG approach improved the classification per-formance The proposed RECSG approach may have therelatively high computational complexity in the trainingstage because the approach involves 2-layer classifier models

12 Mathematical Problems in Engineering

which consist of several base classifiers and metaclassifiersThe number and kinds of the base-level classifiers areclosely related to the performance of stacking algorithmIn the fusion stage of base classifiers the selection of themetaclassifier is also important In this paper in order tovalidate the performance improvement compared with othercurrent classification algorithms we only randomly selected3 classification algorithms (NB 119896-NN and C45) as baseclassifier and cost-sensitive algorithm as metaclassifier inthe fuse stage Selection of the number or kind of the baseclassifiers and metaclassifier was not discussed Thereforewe will explore the diversity and quality of base classifier inthe future The adoption of these strategies would improvethe prediction performance and reduce the training time ofstacking algorithm for imbalance problems

Conflicts of Interest

The author declares that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

This research is supported by the Natural Science Foundationof Shanxi Province under Grant no 2015011039

References

[1] O Loyola-Gonzalez J F Martınez-Trinidad J A Carrasco-Ochoa and M Garcıa-Borroto ldquoEffect of class imbalance onquality measures for contrast patterns an experimental studyrdquoInformation Sciences vol 374 pp 179ndash192 2016

[2] C Beyan and R Fisher ldquoClassifying imbalanced data setsusing similarity based hierarchical decompositionrdquo PatternRecognition vol 48 no 5 pp 1653ndash1672 2015

[3] H He and E A Garcia ldquoLearning from imbalanced datardquo IEEETransactions on Knowledge and Data Engineering vol 21 no 9pp 1263ndash1284 2009

[4] V C Chawla K W Bowyer L O Hall and W P KegelmeyerldquoSMOTE synthetic minority over-sampling techniquerdquo Journalof Artificial Intelligence Research vol 16 pp 321ndash357 2002

[5] X-Y Liu J X Wu and Z-H Zhou ldquoExploratory under-sampling for class-imbalance learningrdquo in Proceedings of the 6thInternational Conference on Data Mining (ICDM rsquo06) pp 965ndash969 IEEE Hong Kong December 2006

[6] N Japkowicz ldquoThe class imbalance problem significance andstrategiesrdquo in Proceedings of the International Conference onArtificial Intelligence 2000

[7] B Zadrozny and C Elkan ldquoLearning and making decisionswhen costs and probabilities are both unknownrdquo in Proceedingsof the the seventh ACM SIGKDD international conference pp204ndash213 San Francisco Calif USA August 2001

[8] M Galar A Fernandez E Barrenechea H Bustince andF Herrera ldquoA review on ensembles for the class imbalanceproblem bagging boosting and hybrid-based approachesrdquoIEEE Transactions on Systems Man and Cybernetics Part CApplications and Reviews vol 42 no 4 pp 463ndash484 2012

[9] N V Chawla D A Cieslak L O Hall and A Joshi ldquoAuto-matically countering imbalance and its empirical relationship

to costrdquo Data Mining and Knowledge Discovery vol 17 no 2pp 225ndash252 2008

[10] A Freitas A Costapereira and P Brazdil ldquoCost-SensitiveDecision Trees Applied to Medical Datardquo in InternationalConference on Data Warehousing and Knowledge Discovery pp303ndash312 2007

[11] L Breiman ldquoBagging predictorsrdquoMachine Learning vol 24 no2 pp 123ndash140 1996

[12] R E Schapire ldquoThe strength of weak learnabilityrdquo MachineLearning vol 5 no 2 pp 197ndash227 1990

[13] Y Freund and R E Schapire ldquoA decision-theoretic generaliza-tion of on-line learning and an application to boostingrdquo Journalof Computer and System Sciences vol 55 no 1 part 2 pp 119ndash139 1997

[14] N V Chawla A Lazarevic L O Hall and K W BowyerldquoSMOTEBoost improving prediction of the minority class inboostingrdquo in Proceeding of Knowledge Discovery in Databasesvol 2838 pp 107ndash119

[15] C Seiffert T M Khoshgoftaar J VanHulse and A NapolitanoldquoRUSBoost A hybrid approach to alleviating class imbalancerdquoIEEE Transactions on Systems Man and Cybernetics Systemsvol 40 no 1 pp 185ndash197 2010

[16] S Hu Y Liang L Ma and Y He ldquoMSMOTE Improvingclassification performance when training data is imbalancedrdquoin Proceedings of the 2nd International Workshop on ComputerScience and Engineering WCSE 2009 pp 13ndash17 China October2009

[17] H Guo and H L Viktor ldquoLearning from imbalanced data setswith boosting and data generationrdquoACMSIGKDDExplorationsNewsletter vol 6 no 1 pp 30ndash39 2004

[18] SWang and X Yao ldquoDiversity analysis on imbalanced data setsby using ensemble modelsrdquo in Proceedings of the 2009 IEEESymposium on Computational Intelligence and Data MiningCIDM 2009 pp 324ndash331 USA April 2009

[19] R Barandela J S Srsquoanchez and R M Valdovinos ldquoNewapplications of ensembles of classifiersrdquo PAA Pattern Analysisand Applications vol 6 no 3 pp 245ndash256 2003

[20] J Błaszczynski M Deckert J Stefanowski and S WilkldquoIntegrating selective pre-processing of imbalanced data withIvotes ensemblerdquo Lecture Notes in Computer Science (includingsubseries Lecture Notes in Artificial Intelligence and Lecture Notesin Bioinformatics) Preface vol 6086 pp 148ndash157 2010

[21] W Fan S J Stolfo J Zhang and P K Chan ldquoAdacostMisclassication cost-sensitive boostingrdquo in the 6th Int ConfMach Learning pp 97ndash105 San Francisco CA USA 1999

[22] K M Ting ldquoA comparative study of cost-sensitive boostingalgorithmsrdquo in Proc 17th Int Conf Mach Learning pp 983ndash990 Stanford CA USA 2000

[23] M Joshi V Kumar and R Agarwal ldquoEvaluating boosting algo-rithms to classify rare classes comparison and improvementsrdquoinProceedings of the 2001 IEEE International Conference onDataMining pp 257ndash264 San Jose CA USA

[24] Y Sun M S Kamel A K C Wong and Y Wang ldquoCost-sensitive boosting for classification of imbalanced datardquo PatternRecognition vol 40 no 12 pp 3358ndash3378 2007

[25] D H Wolpert ldquoStacked generalizationrdquo Neural Networks vol5 no 2 pp 241ndash259 1992

[26] Y Chen M L Wong and H Li ldquoApplying Ant Colony Opti-mization to configuring stacking ensembles for data miningrdquoExpert Systems with Applications vol 41 no 6 pp 2688ndash27022014

Mathematical Problems in Engineering 13

[27] H Kadkhodaei and A M E Moghadam ldquoAn entropy basedapproach to find the best combination of the base classi-fiers in ensemble classifiers based on stack generalizationrdquoin Proceedings of the 4th International Conference on ControlInstrumentation and Automation ICCIA 2016 pp 425ndash429Iran January 2016

[28] I Czarnowski and P Jędrzejowicz ldquoAn approach to machineclassification based on stacked generalization and instanceselectionrdquo in Proceedings of the IEEE International Conferenceon Systems Man and Cybernetics SMC 2016 pp 4836ndash4841Hungary 2017

[29] S Kotsiantis ldquoStacking cost sensitive modelsrdquo in Proceedings ofthe 12th Pan-Hellenic Conference on Informatics PCI 2008 pp217ndash221 Greece August 2008

[30] H Y Lo J C Wang H MWang and S D Lin ldquoCost-sensitivestacking for audio tag annotation and retrievalrdquo in Proceedingsof the 36th IEEE International Conference on Acoustics Speechand Signal Processing ICASSP 2011 pp 2308ndash2311 CzechRepublic May 2011

[31] JHuang andCX Ling ldquoUsingAUCand accuracy in evaluatinglearning algorithmsrdquo IEEETransactions on Knowledge andDataEngineering vol 17 no 3 pp 299ndash310 2005

[32] M Kubat and S Matwin ldquoAddressing the curse of imbalancedtraining sets one sided selection inrdquo in Proceedings of the 14thInternational Conference on Machine Learning (ICMLrsquo97 pp179ndash186 1997

[33] R Batuwita and V Palade ldquoAdjusted geometric-mean A novelperformance measure for imbalanced bioinformatics datasetslearningrdquo Journal of Bioinformatics and Computational Biologyvol 10 no 4 Article ID 1250003 2012

[34] L Breiman ldquoStacked regressionsrdquoMachine Learning vol 24 no1 pp 49ndash64 1996

[35] M Leblanc and R Tibshirani ldquoCombining estimates in regres-sion and classificationrdquo Journal of the American StatisticalAssociation vol 91 no 436 pp 1641ndash1650 1996

[36] B Cestnik ldquoEstimating Probabilities A Crucial Task inMachine Learningrdquo in Proceedings of the European Conferenceon Artificial Intelligence pp 147ndash149 1990

[37] J R Quinlan C45 Program for machine learning MorganKaufmann 1993

[38] N S Altman ldquoAn introduction to kernel and nearest-neighbornonparametric regressionrdquo The American Statistician vol 46no 3 pp 175ndash185 1992

[39] J R Quinlan ldquoInduction of decision treesrdquo Machine Learningvol 1 no 1 pp 81ndash106 1986

[40] C Elkan ldquoThe foundations of cost-sensitive learningrdquo in Pro-ceedings of the 17th International Joint Conference on ArtificialIntelligence (IJCAI rsquo01) pp 973ndash978 New York NY USAAugust 2001

[41] D Michie D Spiegel halter J C Taylor et alMachine learningneural and statistical classification Ellis Horwood neural andstatistical classification 1995

[42] K M Ting and I F Witten ldquoIssues in stacked generalizationrdquoArtif Intell Res vol 10 no 1 pp 271ndash289 1999

[43] M Hall E Frank G Holmes B Pfahringer P Reutemann andI H Witten ldquoThe WEKA data mining software an updaterdquoACM SIGKDD Explorations Newsletter vol 11 no 1 pp 10ndash182009

[44] A Frank and A Asuncion UCI Machine Learning RepositoryUniversity of California School of Information and ComputerScience Irvine CA USA httparchiveicsucieduml

[45] J Alcala-Fdez A Fernandez J Luengo et al ldquoKEEL data-mining software tool data set repository integration ofalgorithms and experimental analysis frameworkrdquo Journal ofMultiple-Valued Logic and Soft Computing vol 17 no 2-3 pp255ndash287 2011

[46] S Garcıa A Fernandez J Luengo and F Herrera ldquoA study ofstatistical techniques and performance measures for genetics-based machine learning accuracy and interpretabilityrdquo SoftComputing vol 13 no 10 pp 959ndash977 2009

[47] S Holm ldquoA simple sequentially rejective multiple test proce-durerdquo Scandinavian Journal of Statistics vol 6 no 2 pp 65ndash701979

Hindawiwwwhindawicom Volume 2018

MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Applied MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Probability and StatisticsHindawiwwwhindawicom Volume 2018

Journal of

Hindawiwwwhindawicom Volume 2018

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawiwwwhindawicom Volume 2018

OptimizationJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

Hindawiwwwhindawicom Volume 2018

Operations ResearchAdvances in

Journal of

Hindawiwwwhindawicom Volume 2018

Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018

International Journal of Mathematics and Mathematical Sciences

Hindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018Volume 2018

Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in

Nature and SocietyHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Dierential EquationsInternational Journal of

Volume 2018

Hindawiwwwhindawicom Volume 2018

Decision SciencesAdvances in

Hindawiwwwhindawicom Volume 2018

AnalysisInternational Journal of

Hindawiwwwhindawicom Volume 2018

Stochastic AnalysisInternational Journal of

Submit your manuscripts atwwwhindawicom

Page 6: Classifying Imbalanced Data Sets by a Novel RE-Sample and ...downloads.hindawi.com/journals/mpe/2018/5036710.pdf · Cost-Sensitive Stacked Generalization Method ... In this paper,

6 Mathematical Problems in Engineering

Input Training set119863 = (119909119894 119910119894)119873119894=1 Test dataset119863test = (1199091015840119894 )1198731015840

119894=1

Output Predict class labels of the test samples 1199101015840119894meta1198731015840

119894=1

For each119895 = 1 2 119870 do(1) Resample imbalance data and generate 119895ndashfold cross-validation sets to obtain New data119895(2) Train and compute 119906119895(119909119894)119873119894=1 and 119906119895(1199091015840119894 )119873

1015840

119894=1 in Level-0 (base)-layer classifier119872119895end(3) Construct119863meta = 1199061(119909119894) 119910119894119873

1015840

119894=1 and1198631015840meta = 1199061(1199091015840119894 ) 1199101198941198731015840

119894=1

(4) Based on the data119863meta classification (cost-sensitive and Logistic Regression) is used togenerate Level-1 (meta)-layer model through with1198631015840meta to predict 1199101015840119894meta119873

1015840

119894=1

Pseudocode 1 Pseudocodes of RE-sample and Cost-Sensitive Stacked Generalization

Table 4 Base classifier methods and corresponding ensemble algorithms

Methods Descriptions and parameters Abbreviations(1) Naıve Bayes wekaclassifiersbayesNaiveBayes NB(2) C45 wekaclassifierstreesJ48 -C 025 -M 2 C45

(3) 119896-nearest neighborwekaclassifierslazyIBk -K 1 -W 0 -A

wekacoreneighboursearchLinearNNSearch -AwekacoreEuclideanDistance -R first-last

KNN

(4) Cost-sensitive with Naıve Bayes wekaclassifiersmetaAdaBoostM1 -P 100 -S 1 -I 10-W wekaclassifiersbayesNaiveBayes Cost-NB

(5) AdaBoost with Naıve Bayes wekaclassifiersmetaAdaBoostM1 -P 100 -S 1 -I 10-W wekaclassifiersbayesNaiveBayes Ada-NB

(6) AdaBoost with cost-sensitive and Naıve Bayes

wekaclassifiersmetaAdaBoostM1 -P 100 -S 1-I10-W wekaClassifiersmetaCostSensitiveClassifier-- -cost-matrix [10 30 10 10] -S 1 -W weka

classifiersbayesNaiveBayes

Ada-Cost-NB

(7) Bagging with Naıve BayeswekaclassifiersmetaBagging -P 100 -S 1

-num-slots 1 -I 10 -WwekaclassifiersbayesNaiveBayes

Bag-NB

(8) Bagging with cost-sensitive and Naıve Bayes

wekaclassifiersmetaBagging -P 100 -S 1-num-slots 1 -I 10 -W

wekaclassifiersmetaCostSensitive Classifier ---cost-matrix [10 30 10 10] -S 1 -W

wekaclassifiersbayesNaiveBayes

Bag-Cost-NB

(9) Stacking with NB KNN C45 and logistic

wekaclassifiersmetaStacking -X 10 -M wekaclassifiersfunctionsLogistic -R 10E-8 -M -1ndashnum ndashdecimal-places 4 -S 1 -num-slots 1 -B

weka classifiersbayesNaiveBayes -Bwekaclassifiers lazyIBk -K 1 -W 0 -A

wekacoreneighboursearch LinearNNSearch-A wekacoreEuclideanDistance -R

first-last -B wekaclassifierstreesJ48 -C025 -M 2

Stacking-log

(10) Stacking cost-sensitive with NB KNN C45and logistic

wekaclassifiersmetaStacking -X 10 -M wekaclassifiersmetaCostSensitiveClassifier ndashcost

-matrix [10 30 10 10] -S 1 -Wwekaclassifiers functionsLogistic -- -R 10E-8 -M-1 ndashnum ndashdecimal ndashplaces 4 -S 1 -num-slots 1 -B

weka classifiers bayesNaiveBayes -BwekaclassifierslazyIBk -K 1 -W 0 -A

wekacoreneighboursearchLinearNN Search-A wekacoreEuclideanDistance -R first-

last -B wekaclassifierstreesJ48 -C 025-M 2

Stacking-Cost-log

Mathematical Problems in Engineering 7

Table 5 Summary of imbalanced data sets

Data sets sam (min maj) IR feawisconsin 683 (239 444) 054 9pima 768 (268 500) 053 8haberman 306 (81 225) 036 3Vehicle1 846 (217 629) 0345 18Vehicle0 846 (199 647) 0307 18segment0 2308 (329 1979) 017 19yeast-0-3-5-9 vs 7-8 506 (50 456) 011 8yeast-0-2-5-6 vs 3-7-8-9 1004 (99 905) 011 8vowel0 988 (90 898) 010 13led7digit-0-2-4-5-6-7-8-9 vs 1 443 (37 406) 009 7Glass2 214 (17 197) 008 10cleveland-0 vs 4 173 (13 160) 008 13glass-0-1-6 vs 5 184 (9 175) 005 9car-good 1728 (69 1659) 004 6flare-F 1066 (43 1023) 004 11car-vgood 1728 (65 1663) 0039 6abalone-17 vs 7-8-9-10 2338 (58 2280) 00254 9

and variations in the amount of class overlap (see the KEELrepository [45]) Multiclass data sets were modified to obtaintwo-class imbalanced problems so that the union of one ormore classes became the positive class and the union of oneor more of the remaining classes was labeled as the negativeclass A brief description of the used date set is presented inTable 5 It includes the total number of instances (Sam)the total number of each class instances (Min Maj) theimbalance ratio (IR = the ratio of the number of minorityclass instance to majority class instance) and the number offeatures (Fea)

Our system was compared with other ensemble algo-rithms including AdaBoost with Naıve Bayes AdaBoost withcost-sensitive bagging with Naıve Bayes bagging with cost-sensitive and stacking cost-sensitive with NB k-NN C45and logistic regression All the experiments were performedby 10-fold cross-validation

42 Experimental Results Tables 6 7 and 8 respectivelyshow the results of the three metrics (AUC GeoMean andAGeoMean) for the algorithms in Table 4 obtained with thedata set in Table 5The best results are emphasized in boldfaceon each data set in these tables

The results showed that the performance of the proposedRECSG method was the best for 12 of 17 data sets in termsof GeoMean and AGeoMean and for 10 of 17 data sets interms of AUC Some methods are better than others in someevaluation metrics but not in all metric and most data sets

In Table 6 (AUC) the performance of the RECSGmethodis better than that of the 3 single-base classification methodsfor 14 of 17 data sets and better than that of other 5 ensemblealgorithms for 13 of 17 data sets In Table 7 (GeoMean)the RECSG method outperforms all the 3 single-base clas-sification methods in 15 of 17 data sets and outperformsother 5 ensemble algorithms in 15 of 17 data sets In Table 8

(AGeoMean) the RECSG method outperforms all the 3single-base classification methods in 14 of 17 data sets andoutperforms other 5 ensemble algorithms in 16 of 17 data setsFor the 17 data sets the evaluation metrics of GeoMean andAGeoMean are better than AUC

43 Statistical Tests Statistical test [46] is adopted to comparedifferent algorithms In the paper we use two types of com-parisons pairwise comparison (between a pair of algorithms)and multiple comparisons (among a group of algorithms)

431 Pair of Algorithms We performed statistic 119879-teststo explore whether RECSG approach is the significantlybetter than other algorithms in the three metrics (AUCGeoMean and AGeoMean) Table 9 shows the results ofRECSG approach compared with the other methods interms of AUC GeoMean and AGeoMean The values inthe square brackets indicate the number of the metrics withstatistically significant difference in the 119879-test performedwith the confidence level 120572 = 005 in the three evaluationmetrics As for the evaluation metric of AUC the RECSGoutperformed NB for 11 of 17 data sets and 5 data sets showedthe statistically significant differencewith the confidence levelof 120572 = 005

432 Multiple Comparisons We performed Holm post hoctest [47] and selected multiple groups to test the threemetrics (AUC GeoMean and AGeoMean) The post hocprocedures can determine whether a comparison hypothesisshould be rejected at a specified confidence level 120572 Statisticalexperiment was performed in the platform on the websitehttpteccitiususcesstac Table 10 provides the results thatRECSGapproach is significantly differentwith the confidencelevel of120572 = 005 in terms of AUCGeoMean andAGeoMean1198670 indicates that there is no significant difference According

8 Mathematical Problems in Engineering

Table6Re

sults

ofotherm

etho

dsin

term

sofA

UC

Datas

etNB

C45

KNN

Cost-N

BAd

a-NB

Ada-Cost-N

BBa

g-NB

Bag-Cost-NB

Stacking

-log

Stacking

-Cost-log

wisc

onsin

098

60957

0953

0989

0981

0982

099

0989

0982

0982

pima

0819

0751

0662

0819

0804

0808

0817

0819

0819

0819

haberm

an064

0615

053

064

0667

0625

0638

064

70636

063

Vehicle

10711

0722

0671

0711

0769

0723

0715

0712

0782

0782

Vehicle

00814

093

092

0814

0784

0781

0808

0808

0977

097

7segm

ent0

0987

0971

0996

0987

0944

0959

0987

0986

0999

099

9yeast-0

-3-5-9

vs7-8

075

0675

0653

075

0751

0686

075

0754

0769

0773

yeast-0

-2-5-6

vs3-7-8-9

0832

0687

0782

0832

079

0808

084

0842

084

7082

vowel0

098

0974

1098

0996

0994

0981

098

11

led7digit-0

-2-4-5-6-7-8-9

vs1

095

40925

0827

0954

0898

0893

0953

0949

0909

0904

Glass2

0716

0812

0587

0716

0737

0735

0748

0737

0713

0721

cleveland

-0vs

40909

0824

0754

0909

0826

0891

0915

0913

0979

0976

glass-0-1-6

vs5

0934

0989

0825

0934

0975

0905

0983

0985

09

099

car-good

0976

0493

064

80976

0968

0971

0976

0976

0653

097

7flare-F

0908

0525

059

0911

0876

0893

0908

0912

0825

0806

car-vgoo

d0997

0992

069

0997

0997

0997

0997

0997

0998

099

8abalon

e-17

vs7-8-9-10

0775

0613

0624

0775

0817

073

0773

0774

0752

0726

Mean

0864

0791

0748

0864

0858

0846

0869

0869

0855

0875

Mathematical Problems in Engineering 9

Table7Re

sults

ofotherm

etho

dsin

term

sofG

eoMean

Datas

etNB

C45

KNN

Cost-N

BAd

a-NB

Ada-Cost-N

BBa

g-NB

Bag-Cost-NB

Stacking

-log

Stacking

-Cost-log

wisc

onsin

0965

0954

0953

0968

0959

0967

0968

0966

096

4097

0pima

0719

0697

0649

0736

0715

0739

0723

0731

0703

0741

haberm

an0419

0566

0455

0586

046

60573

0434

0594

0361

060

5Ve

hicle

10677

0622

0652

0674

0607

0677

0673

0661

0561

0712

Vehicle

00724

0912

0919

0726

0735

0733

0719

0733

0894

094

6segm

ent0

0897

0977

0995

0892

0897

0906

0902

0895

099

40993

yeast-0

-3-5-9

vs7-8

046

70463

060

9046

4046

60550

044

50485

044

4060

3yeast-0

-2-5-6

vs3-7-8-9

0762

0590

0762

0793

0762

0792

0762

0787

0706

079

8vowel0

0920

0969

1000

0913

0944

0929

0921

0914

1000

1000

led7digit-0

-2-4-5-6-7-8-9

vs1

0853

0874

0818

0869

0863

0870

0845

0850

0846

0874

Glass2

0578

0472

0470

0564

0578

0610

0586

0568

000

00607

cleveland

-0vs

4089

90724

0722

0936

0611

0863

0866

0861

0866

0866

glass-0-1-6

vs5

0932

0877

0810

0932

0932

0872

0872

0869

0814

0877

car-good

000

0000

00558

0794

0520

0783

000

00778

0561

0813

flare-F

0751

000

00450

0781

0541

064

70738

0781

000

00774

car-vgoo

d0496

0888

0626

0986

0917

0925

0481

0971

0957

098

8abalon

e-17

vs7-8-9-10

0628

0321

0506

064

00319

0368

0627

0630

0293

066

4Mean

0687

064

10703

0779

0696

0753

0680

0769

064

50814

10 Mathematical Problems in Engineering

Table8Re

sults

ofotherm

etho

dsin

term

sofA

GeoMean

Datas

etNB

C45

KNN

Cost-N

BAd

a-NB

Ada-Cost-N

BBa

g-NB

Bag-Cost-NB

Stacking

-log

Stacking

-Cost-log

wisc

onsin

0961

0960

0961

0962

0963

0962

0962

0960

0967

096

7pima

076

80743

0706

0734

0767

0736

0775

0729

0770

0726

haberm

an064

20680

0601

0718

066

00694

0653

0721

0613

0663

Vehicle

10690

0721

0728

0658

0722

0669

0690

0654

0709

0740

Vehicle

0066

60929

0934

0658

0681

0663

066

40663

0924

093

6segm

ent0

0859

0986

0996

0850

0859

0871

0863

0854

0995

099

5yeast-0

-3-5-9

vs7-8

0715

0706

0782

0709

0713

0737

0703

0721

0700

0770

yeast-0

-2-5-6

vs3-7-8-9

0856

0778

0855

0857

0856

0855

0856

0854

0839

086

6vowel0

0929

0981

1000

0909

0966

0948

0931

0910

1000

1000

led7digit-0

-2-4-5-6-7-8-9

vs1

0853

0874

0818

0869

0863

0870

0845

0850

0846

0874

Glass2

0511

0701

0695

0493

0511

0551

0539

0515

000

00744

cleveland

-0vs

4092

70845

0841

0943

0783

0914

0918

0910

0918

0918

glass-0-1-6

vs5

095

40932

0894

0954

0954

0923

0923

0919

0902

0932

car-good

000

0000

00763

0878

0747

0870

000

00873

0769

089

0flare-F

0840

000

00705

0842

0751

0795

0836

0842

000

0084

5car-vgoo

d0743

0934

0793

0986

0953

0956

0725

0980

0974

099

0abalon

e-17

vs7-8-9-10

0736

0655

0745

0716

0650

0672

0734

0711

064

1080

4Mean

0744

0731

0813

0808

0788

0805

0742

08039

0739

086

2

Mathematical Problems in Engineering 11

Table 9 Comparison between RECSG and other algorithms in terms of winslosses of the results

NB C45 KNN Cost-NB Ada-NB Ada-Cost-NB Bag-NB Bag-Cost-

NB Stacking-log

Stacking-Cost-log(AUC)

11 [5]6 [3] 15 [11]2 [2] 17 [15]0 [2] 11 [6]6 [5] 13 [7]4 [2] 14 [9]3 [1] 10 [6]7 [5] 10 [6]7 [5] 11 [3]6 [2]

Stacking-Cost-log(GeoMean)

15 [13]2 [2] 17 [13]0 [0] 15 [13]2 [1] 14 [8]3 [2] 16 [12]1 [1] 16 [10]1 [0] 17 [13]0 [0] 16 [10]3 [0] 16 [12]1 [0]

Stacking-Cost-log(AGeoMean)

14 [10]3 [1] 15 [9]2 [2] 15 [12]2 [1] 13 [7]4 [3] 15 [11]2 [2] 15 [10]2 [2] 16 [10]1 [1] 15 [8]2 [1] 16 [10]1 [1]

Table 10 Holm post hoc test to show differences of Stacking-Cost-log (AUC) and other algorithms

Rejected (1198670) Accepted (1198670)Stacking-Cost-log (AUC) versus other algorithms 3 6Stacking-Cost-log (GeoMean) versus other algorithms 8 1Stacking-Cost-log (AGeoMean) versus other algorithms 9 0

to the evaluation metric of GeoMean the RECSG is signifi-cantly better than other 8 methods with the confidence levelof 120572 = 005

44 Discussion According to the experiments performedwith 17 different imbalance data sets and the comparisonwith other 9 classification methods the proposed RECSGmethod showed the higher performance than other methodsin terms of AUC GeoMean and AGeoMean as illustratedin Tables 7 8 and 9 The RECSG method showed the betterperformance for 10 of 17 cases in terms of AUC and for 12 of17 cases in terms of GeoMean and AGeoMean The methodoutperformed other 5 ensemble algorithms for 16 of 17 datasets in terms ofGeoMean andAGeoMean and for 13 of 17 datasets in terms of AUC GeoMean and AGeoMean are betterthan AUC in evaluation metrics in 17 data sets The means ofRECSGmethod in terms of GeoMean AGeoMean and AUCare all higher than other methods The Holm post hoc testshows that RECSGmethod significantly outperforms the restwith the confidence level 120572 = 005 in terms of AGeoMean for8 of 9methods in terms of AGeoMean and for 3 of 9methodsin terms of AUC

Experimental results and statistical tests show that theRECSG approach has improved the classification perfor-mance of imbalanced data sets The reasons can be explainedas follows

Firstly stacking algorithm uses to combine the resultof base classifier whereas bagging employs majority voteBagging is only a simple decision combinationmethodwhichrequires neither cross-validation nor Level 1 learning In thispaper stacked generalization adopts logistic regressionthus providing the simplest linear combination of pooling theLevel 0 modelsrsquo confidence

Secondly cost-sensitive algorithm can affect imbalancedata sets in two aspects Firstly cost-sensitive modificationscan be applied in the probabilistic estimationMoreover cost-sensitive factors change theweight of instance and theweight

of the minority class is higher than that of the majorityclass Therefore the method directly changes the number ofcommon instances without discarding or duplicating any ofthe rare instances

Thirdly RECSG method introduces cost-sensitive learn-ing into stacking ensemble The cost-sensitive function in replaces the error-minimizing function thus allowing Level 1learning to be prone to focus on the minority class

The results in Tables 7 and 8 demonstrate that the RECSGmethod has the higher performancewhen the evaluation per-formance metric of base classifier is weaker (such as Vehicle1Vehicle0 and car-vgood) The reason is that the alternativemethods of NB C45 and 119896-NN have shortcomings Theindependence assumption hampers the performance of NBin some data sets In C45 tree construction the selectionof attributes affects the model performances Error ratio of119896-NN is relatively high when data sets are in imbalanceTherefore logistic regression adopted in can improve theperformance when base classifier is weaker

The performance of the RECSG method is generallybetter when the IR is low (such as Glass2 car-good flare-Fcar-vgood and abalone-17 vs 7-8-9-10) The performance ofRECSG method is probably related to the setting of the costmatrix Different data sets should use different cost matricesFor the purpose of simplicity in the paper we have adoptedthe same cost matrices whichmay bemore suitable to low IRvalues

5 Conclusions

In this paper in order to solve the class imbalance problemwe proposed the RECSG method based on 2-layer learningmodels Experimental results and statistical tests showedthat the RECSG approach improved the classification per-formance The proposed RECSG approach may have therelatively high computational complexity in the trainingstage because the approach involves 2-layer classifier models

12 Mathematical Problems in Engineering

which consist of several base classifiers and metaclassifiersThe number and kinds of the base-level classifiers areclosely related to the performance of stacking algorithmIn the fusion stage of base classifiers the selection of themetaclassifier is also important In this paper in order tovalidate the performance improvement compared with othercurrent classification algorithms we only randomly selected3 classification algorithms (NB 119896-NN and C45) as baseclassifier and cost-sensitive algorithm as metaclassifier inthe fuse stage Selection of the number or kind of the baseclassifiers and metaclassifier was not discussed Thereforewe will explore the diversity and quality of base classifier inthe future The adoption of these strategies would improvethe prediction performance and reduce the training time ofstacking algorithm for imbalance problems

Conflicts of Interest

The author declares that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

This research is supported by the Natural Science Foundationof Shanxi Province under Grant no 2015011039

References

[1] O Loyola-Gonzalez J F Martınez-Trinidad J A Carrasco-Ochoa and M Garcıa-Borroto ldquoEffect of class imbalance onquality measures for contrast patterns an experimental studyrdquoInformation Sciences vol 374 pp 179ndash192 2016

[2] C Beyan and R Fisher ldquoClassifying imbalanced data setsusing similarity based hierarchical decompositionrdquo PatternRecognition vol 48 no 5 pp 1653ndash1672 2015

[3] H He and E A Garcia ldquoLearning from imbalanced datardquo IEEETransactions on Knowledge and Data Engineering vol 21 no 9pp 1263ndash1284 2009

[4] V C Chawla K W Bowyer L O Hall and W P KegelmeyerldquoSMOTE synthetic minority over-sampling techniquerdquo Journalof Artificial Intelligence Research vol 16 pp 321ndash357 2002

[5] X-Y Liu J X Wu and Z-H Zhou ldquoExploratory under-sampling for class-imbalance learningrdquo in Proceedings of the 6thInternational Conference on Data Mining (ICDM rsquo06) pp 965ndash969 IEEE Hong Kong December 2006

[6] N Japkowicz ldquoThe class imbalance problem significance andstrategiesrdquo in Proceedings of the International Conference onArtificial Intelligence 2000

[7] B Zadrozny and C Elkan ldquoLearning and making decisionswhen costs and probabilities are both unknownrdquo in Proceedingsof the the seventh ACM SIGKDD international conference pp204ndash213 San Francisco Calif USA August 2001

[8] M Galar A Fernandez E Barrenechea H Bustince andF Herrera ldquoA review on ensembles for the class imbalanceproblem bagging boosting and hybrid-based approachesrdquoIEEE Transactions on Systems Man and Cybernetics Part CApplications and Reviews vol 42 no 4 pp 463ndash484 2012

[9] N V Chawla D A Cieslak L O Hall and A Joshi ldquoAuto-matically countering imbalance and its empirical relationship

to costrdquo Data Mining and Knowledge Discovery vol 17 no 2pp 225ndash252 2008

[10] A Freitas A Costapereira and P Brazdil ldquoCost-SensitiveDecision Trees Applied to Medical Datardquo in InternationalConference on Data Warehousing and Knowledge Discovery pp303ndash312 2007

[11] L Breiman ldquoBagging predictorsrdquoMachine Learning vol 24 no2 pp 123ndash140 1996

[12] R E Schapire ldquoThe strength of weak learnabilityrdquo MachineLearning vol 5 no 2 pp 197ndash227 1990

[13] Y Freund and R E Schapire ldquoA decision-theoretic generaliza-tion of on-line learning and an application to boostingrdquo Journalof Computer and System Sciences vol 55 no 1 part 2 pp 119ndash139 1997

[14] N V Chawla A Lazarevic L O Hall and K W BowyerldquoSMOTEBoost improving prediction of the minority class inboostingrdquo in Proceeding of Knowledge Discovery in Databasesvol 2838 pp 107ndash119

[15] C Seiffert T M Khoshgoftaar J VanHulse and A NapolitanoldquoRUSBoost A hybrid approach to alleviating class imbalancerdquoIEEE Transactions on Systems Man and Cybernetics Systemsvol 40 no 1 pp 185ndash197 2010

[16] S Hu Y Liang L Ma and Y He ldquoMSMOTE Improvingclassification performance when training data is imbalancedrdquoin Proceedings of the 2nd International Workshop on ComputerScience and Engineering WCSE 2009 pp 13ndash17 China October2009

[17] H Guo and H L Viktor ldquoLearning from imbalanced data setswith boosting and data generationrdquoACMSIGKDDExplorationsNewsletter vol 6 no 1 pp 30ndash39 2004

[18] SWang and X Yao ldquoDiversity analysis on imbalanced data setsby using ensemble modelsrdquo in Proceedings of the 2009 IEEESymposium on Computational Intelligence and Data MiningCIDM 2009 pp 324ndash331 USA April 2009

[19] R Barandela J S Srsquoanchez and R M Valdovinos ldquoNewapplications of ensembles of classifiersrdquo PAA Pattern Analysisand Applications vol 6 no 3 pp 245ndash256 2003

[20] J Błaszczynski M Deckert J Stefanowski and S WilkldquoIntegrating selective pre-processing of imbalanced data withIvotes ensemblerdquo Lecture Notes in Computer Science (includingsubseries Lecture Notes in Artificial Intelligence and Lecture Notesin Bioinformatics) Preface vol 6086 pp 148ndash157 2010

[21] W Fan S J Stolfo J Zhang and P K Chan ldquoAdacostMisclassication cost-sensitive boostingrdquo in the 6th Int ConfMach Learning pp 97ndash105 San Francisco CA USA 1999

[22] K M Ting ldquoA comparative study of cost-sensitive boostingalgorithmsrdquo in Proc 17th Int Conf Mach Learning pp 983ndash990 Stanford CA USA 2000

[23] M Joshi V Kumar and R Agarwal ldquoEvaluating boosting algo-rithms to classify rare classes comparison and improvementsrdquoinProceedings of the 2001 IEEE International Conference onDataMining pp 257ndash264 San Jose CA USA

[24] Y Sun M S Kamel A K C Wong and Y Wang ldquoCost-sensitive boosting for classification of imbalanced datardquo PatternRecognition vol 40 no 12 pp 3358ndash3378 2007

[25] D H Wolpert ldquoStacked generalizationrdquo Neural Networks vol5 no 2 pp 241ndash259 1992

[26] Y Chen M L Wong and H Li ldquoApplying Ant Colony Opti-mization to configuring stacking ensembles for data miningrdquoExpert Systems with Applications vol 41 no 6 pp 2688ndash27022014

Mathematical Problems in Engineering 13

[27] H Kadkhodaei and A M E Moghadam ldquoAn entropy basedapproach to find the best combination of the base classi-fiers in ensemble classifiers based on stack generalizationrdquoin Proceedings of the 4th International Conference on ControlInstrumentation and Automation ICCIA 2016 pp 425ndash429Iran January 2016

[28] I Czarnowski and P Jędrzejowicz ldquoAn approach to machineclassification based on stacked generalization and instanceselectionrdquo in Proceedings of the IEEE International Conferenceon Systems Man and Cybernetics SMC 2016 pp 4836ndash4841Hungary 2017

[29] S Kotsiantis ldquoStacking cost sensitive modelsrdquo in Proceedings ofthe 12th Pan-Hellenic Conference on Informatics PCI 2008 pp217ndash221 Greece August 2008

[30] H Y Lo J C Wang H MWang and S D Lin ldquoCost-sensitivestacking for audio tag annotation and retrievalrdquo in Proceedingsof the 36th IEEE International Conference on Acoustics Speechand Signal Processing ICASSP 2011 pp 2308ndash2311 CzechRepublic May 2011

[31] JHuang andCX Ling ldquoUsingAUCand accuracy in evaluatinglearning algorithmsrdquo IEEETransactions on Knowledge andDataEngineering vol 17 no 3 pp 299ndash310 2005

[32] M Kubat and S Matwin ldquoAddressing the curse of imbalancedtraining sets one sided selection inrdquo in Proceedings of the 14thInternational Conference on Machine Learning (ICMLrsquo97 pp179ndash186 1997

[33] R Batuwita and V Palade ldquoAdjusted geometric-mean A novelperformance measure for imbalanced bioinformatics datasetslearningrdquo Journal of Bioinformatics and Computational Biologyvol 10 no 4 Article ID 1250003 2012

[34] L Breiman ldquoStacked regressionsrdquoMachine Learning vol 24 no1 pp 49ndash64 1996

[35] M Leblanc and R Tibshirani ldquoCombining estimates in regres-sion and classificationrdquo Journal of the American StatisticalAssociation vol 91 no 436 pp 1641ndash1650 1996

[36] B Cestnik ldquoEstimating Probabilities A Crucial Task inMachine Learningrdquo in Proceedings of the European Conferenceon Artificial Intelligence pp 147ndash149 1990

[37] J R Quinlan C45 Program for machine learning MorganKaufmann 1993

[38] N S Altman ldquoAn introduction to kernel and nearest-neighbornonparametric regressionrdquo The American Statistician vol 46no 3 pp 175ndash185 1992

[39] J R Quinlan ldquoInduction of decision treesrdquo Machine Learningvol 1 no 1 pp 81ndash106 1986

[40] C Elkan ldquoThe foundations of cost-sensitive learningrdquo in Pro-ceedings of the 17th International Joint Conference on ArtificialIntelligence (IJCAI rsquo01) pp 973ndash978 New York NY USAAugust 2001

[41] D Michie D Spiegel halter J C Taylor et alMachine learningneural and statistical classification Ellis Horwood neural andstatistical classification 1995

[42] K M Ting and I F Witten ldquoIssues in stacked generalizationrdquoArtif Intell Res vol 10 no 1 pp 271ndash289 1999

[43] M Hall E Frank G Holmes B Pfahringer P Reutemann andI H Witten ldquoThe WEKA data mining software an updaterdquoACM SIGKDD Explorations Newsletter vol 11 no 1 pp 10ndash182009

[44] A Frank and A Asuncion UCI Machine Learning RepositoryUniversity of California School of Information and ComputerScience Irvine CA USA httparchiveicsucieduml

[45] J Alcala-Fdez A Fernandez J Luengo et al ldquoKEEL data-mining software tool data set repository integration ofalgorithms and experimental analysis frameworkrdquo Journal ofMultiple-Valued Logic and Soft Computing vol 17 no 2-3 pp255ndash287 2011

[46] S Garcıa A Fernandez J Luengo and F Herrera ldquoA study ofstatistical techniques and performance measures for genetics-based machine learning accuracy and interpretabilityrdquo SoftComputing vol 13 no 10 pp 959ndash977 2009

[47] S Holm ldquoA simple sequentially rejective multiple test proce-durerdquo Scandinavian Journal of Statistics vol 6 no 2 pp 65ndash701979

Hindawiwwwhindawicom Volume 2018

MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Applied MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Probability and StatisticsHindawiwwwhindawicom Volume 2018

Journal of

Hindawiwwwhindawicom Volume 2018

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawiwwwhindawicom Volume 2018

OptimizationJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

Hindawiwwwhindawicom Volume 2018

Operations ResearchAdvances in

Journal of

Hindawiwwwhindawicom Volume 2018

Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018

International Journal of Mathematics and Mathematical Sciences

Hindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018Volume 2018

Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in

Nature and SocietyHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Dierential EquationsInternational Journal of

Volume 2018

Hindawiwwwhindawicom Volume 2018

Decision SciencesAdvances in

Hindawiwwwhindawicom Volume 2018

AnalysisInternational Journal of

Hindawiwwwhindawicom Volume 2018

Stochastic AnalysisInternational Journal of

Submit your manuscripts atwwwhindawicom

Page 7: Classifying Imbalanced Data Sets by a Novel RE-Sample and ...downloads.hindawi.com/journals/mpe/2018/5036710.pdf · Cost-Sensitive Stacked Generalization Method ... In this paper,

Mathematical Problems in Engineering 7

Table 5 Summary of imbalanced data sets

Data sets sam (min maj) IR feawisconsin 683 (239 444) 054 9pima 768 (268 500) 053 8haberman 306 (81 225) 036 3Vehicle1 846 (217 629) 0345 18Vehicle0 846 (199 647) 0307 18segment0 2308 (329 1979) 017 19yeast-0-3-5-9 vs 7-8 506 (50 456) 011 8yeast-0-2-5-6 vs 3-7-8-9 1004 (99 905) 011 8vowel0 988 (90 898) 010 13led7digit-0-2-4-5-6-7-8-9 vs 1 443 (37 406) 009 7Glass2 214 (17 197) 008 10cleveland-0 vs 4 173 (13 160) 008 13glass-0-1-6 vs 5 184 (9 175) 005 9car-good 1728 (69 1659) 004 6flare-F 1066 (43 1023) 004 11car-vgood 1728 (65 1663) 0039 6abalone-17 vs 7-8-9-10 2338 (58 2280) 00254 9

and variations in the amount of class overlap (see the KEELrepository [45]) Multiclass data sets were modified to obtaintwo-class imbalanced problems so that the union of one ormore classes became the positive class and the union of oneor more of the remaining classes was labeled as the negativeclass A brief description of the used date set is presented inTable 5 It includes the total number of instances (Sam)the total number of each class instances (Min Maj) theimbalance ratio (IR = the ratio of the number of minorityclass instance to majority class instance) and the number offeatures (Fea)

Our system was compared with other ensemble algo-rithms including AdaBoost with Naıve Bayes AdaBoost withcost-sensitive bagging with Naıve Bayes bagging with cost-sensitive and stacking cost-sensitive with NB k-NN C45and logistic regression All the experiments were performedby 10-fold cross-validation

42 Experimental Results Tables 6 7 and 8 respectivelyshow the results of the three metrics (AUC GeoMean andAGeoMean) for the algorithms in Table 4 obtained with thedata set in Table 5The best results are emphasized in boldfaceon each data set in these tables

The results showed that the performance of the proposedRECSG method was the best for 12 of 17 data sets in termsof GeoMean and AGeoMean and for 10 of 17 data sets interms of AUC Some methods are better than others in someevaluation metrics but not in all metric and most data sets

In Table 6 (AUC) the performance of the RECSGmethodis better than that of the 3 single-base classification methodsfor 14 of 17 data sets and better than that of other 5 ensemblealgorithms for 13 of 17 data sets In Table 7 (GeoMean)the RECSG method outperforms all the 3 single-base clas-sification methods in 15 of 17 data sets and outperformsother 5 ensemble algorithms in 15 of 17 data sets In Table 8

(AGeoMean) the RECSG method outperforms all the 3single-base classification methods in 14 of 17 data sets andoutperforms other 5 ensemble algorithms in 16 of 17 data setsFor the 17 data sets the evaluation metrics of GeoMean andAGeoMean are better than AUC

43 Statistical Tests Statistical test [46] is adopted to comparedifferent algorithms In the paper we use two types of com-parisons pairwise comparison (between a pair of algorithms)and multiple comparisons (among a group of algorithms)

431 Pair of Algorithms We performed statistic 119879-teststo explore whether RECSG approach is the significantlybetter than other algorithms in the three metrics (AUCGeoMean and AGeoMean) Table 9 shows the results ofRECSG approach compared with the other methods interms of AUC GeoMean and AGeoMean The values inthe square brackets indicate the number of the metrics withstatistically significant difference in the 119879-test performedwith the confidence level 120572 = 005 in the three evaluationmetrics As for the evaluation metric of AUC the RECSGoutperformed NB for 11 of 17 data sets and 5 data sets showedthe statistically significant differencewith the confidence levelof 120572 = 005

432 Multiple Comparisons We performed Holm post hoctest [47] and selected multiple groups to test the threemetrics (AUC GeoMean and AGeoMean) The post hocprocedures can determine whether a comparison hypothesisshould be rejected at a specified confidence level 120572 Statisticalexperiment was performed in the platform on the websitehttpteccitiususcesstac Table 10 provides the results thatRECSGapproach is significantly differentwith the confidencelevel of120572 = 005 in terms of AUCGeoMean andAGeoMean1198670 indicates that there is no significant difference According

8 Mathematical Problems in Engineering

Table6Re

sults

ofotherm

etho

dsin

term

sofA

UC

Datas

etNB

C45

KNN

Cost-N

BAd

a-NB

Ada-Cost-N

BBa

g-NB

Bag-Cost-NB

Stacking

-log

Stacking

-Cost-log

wisc

onsin

098

60957

0953

0989

0981

0982

099

0989

0982

0982

pima

0819

0751

0662

0819

0804

0808

0817

0819

0819

0819

haberm

an064

0615

053

064

0667

0625

0638

064

70636

063

Vehicle

10711

0722

0671

0711

0769

0723

0715

0712

0782

0782

Vehicle

00814

093

092

0814

0784

0781

0808

0808

0977

097

7segm

ent0

0987

0971

0996

0987

0944

0959

0987

0986

0999

099

9yeast-0

-3-5-9

vs7-8

075

0675

0653

075

0751

0686

075

0754

0769

0773

yeast-0

-2-5-6

vs3-7-8-9

0832

0687

0782

0832

079

0808

084

0842

084

7082

vowel0

098

0974

1098

0996

0994

0981

098

11

led7digit-0

-2-4-5-6-7-8-9

vs1

095

40925

0827

0954

0898

0893

0953

0949

0909

0904

Glass2

0716

0812

0587

0716

0737

0735

0748

0737

0713

0721

cleveland

-0vs

40909

0824

0754

0909

0826

0891

0915

0913

0979

0976

glass-0-1-6

vs5

0934

0989

0825

0934

0975

0905

0983

0985

09

099

car-good

0976

0493

064

80976

0968

0971

0976

0976

0653

097

7flare-F

0908

0525

059

0911

0876

0893

0908

0912

0825

0806

car-vgoo

d0997

0992

069

0997

0997

0997

0997

0997

0998

099

8abalon

e-17

vs7-8-9-10

0775

0613

0624

0775

0817

073

0773

0774

0752

0726

Mean

0864

0791

0748

0864

0858

0846

0869

0869

0855

0875

Mathematical Problems in Engineering 9

Table7Re

sults

ofotherm

etho

dsin

term

sofG

eoMean

Datas

etNB

C45

KNN

Cost-N

BAd

a-NB

Ada-Cost-N

BBa

g-NB

Bag-Cost-NB

Stacking

-log

Stacking

-Cost-log

wisc

onsin

0965

0954

0953

0968

0959

0967

0968

0966

096

4097

0pima

0719

0697

0649

0736

0715

0739

0723

0731

0703

0741

haberm

an0419

0566

0455

0586

046

60573

0434

0594

0361

060

5Ve

hicle

10677

0622

0652

0674

0607

0677

0673

0661

0561

0712

Vehicle

00724

0912

0919

0726

0735

0733

0719

0733

0894

094

6segm

ent0

0897

0977

0995

0892

0897

0906

0902

0895

099

40993

yeast-0

-3-5-9

vs7-8

046

70463

060

9046

4046

60550

044

50485

044

4060

3yeast-0

-2-5-6

vs3-7-8-9

0762

0590

0762

0793

0762

0792

0762

0787

0706

079

8vowel0

0920

0969

1000

0913

0944

0929

0921

0914

1000

1000

led7digit-0

-2-4-5-6-7-8-9

vs1

0853

0874

0818

0869

0863

0870

0845

0850

0846

0874

Glass2

0578

0472

0470

0564

0578

0610

0586

0568

000

00607

cleveland

-0vs

4089

90724

0722

0936

0611

0863

0866

0861

0866

0866

glass-0-1-6

vs5

0932

0877

0810

0932

0932

0872

0872

0869

0814

0877

car-good

000

0000

00558

0794

0520

0783

000

00778

0561

0813

flare-F

0751

000

00450

0781

0541

064

70738

0781

000

00774

car-vgoo

d0496

0888

0626

0986

0917

0925

0481

0971

0957

098

8abalon

e-17

vs7-8-9-10

0628

0321

0506

064

00319

0368

0627

0630

0293

066

4Mean

0687

064

10703

0779

0696

0753

0680

0769

064

50814

10 Mathematical Problems in Engineering

Table8Re

sults

ofotherm

etho

dsin

term

sofA

GeoMean

Datas

etNB

C45

KNN

Cost-N

BAd

a-NB

Ada-Cost-N

BBa

g-NB

Bag-Cost-NB

Stacking

-log

Stacking

-Cost-log

wisc

onsin

0961

0960

0961

0962

0963

0962

0962

0960

0967

096

7pima

076

80743

0706

0734

0767

0736

0775

0729

0770

0726

haberm

an064

20680

0601

0718

066

00694

0653

0721

0613

0663

Vehicle

10690

0721

0728

0658

0722

0669

0690

0654

0709

0740

Vehicle

0066

60929

0934

0658

0681

0663

066

40663

0924

093

6segm

ent0

0859

0986

0996

0850

0859

0871

0863

0854

0995

099

5yeast-0

-3-5-9

vs7-8

0715

0706

0782

0709

0713

0737

0703

0721

0700

0770

yeast-0

-2-5-6

vs3-7-8-9

0856

0778

0855

0857

0856

0855

0856

0854

0839

086

6vowel0

0929

0981

1000

0909

0966

0948

0931

0910

1000

1000

led7digit-0

-2-4-5-6-7-8-9

vs1

0853

0874

0818

0869

0863

0870

0845

0850

0846

0874

Glass2

0511

0701

0695

0493

0511

0551

0539

0515

000

00744

cleveland

-0vs

4092

70845

0841

0943

0783

0914

0918

0910

0918

0918

glass-0-1-6

vs5

095

40932

0894

0954

0954

0923

0923

0919

0902

0932

car-good

000

0000

00763

0878

0747

0870

000

00873

0769

089

0flare-F

0840

000

00705

0842

0751

0795

0836

0842

000

0084

5car-vgoo

d0743

0934

0793

0986

0953

0956

0725

0980

0974

099

0abalon

e-17

vs7-8-9-10

0736

0655

0745

0716

0650

0672

0734

0711

064

1080

4Mean

0744

0731

0813

0808

0788

0805

0742

08039

0739

086

2

Mathematical Problems in Engineering 11

Table 9 Comparison between RECSG and other algorithms in terms of winslosses of the results

NB C45 KNN Cost-NB Ada-NB Ada-Cost-NB Bag-NB Bag-Cost-

NB Stacking-log

Stacking-Cost-log(AUC)

11 [5]6 [3] 15 [11]2 [2] 17 [15]0 [2] 11 [6]6 [5] 13 [7]4 [2] 14 [9]3 [1] 10 [6]7 [5] 10 [6]7 [5] 11 [3]6 [2]

Stacking-Cost-log(GeoMean)

15 [13]2 [2] 17 [13]0 [0] 15 [13]2 [1] 14 [8]3 [2] 16 [12]1 [1] 16 [10]1 [0] 17 [13]0 [0] 16 [10]3 [0] 16 [12]1 [0]

Stacking-Cost-log(AGeoMean)

14 [10]3 [1] 15 [9]2 [2] 15 [12]2 [1] 13 [7]4 [3] 15 [11]2 [2] 15 [10]2 [2] 16 [10]1 [1] 15 [8]2 [1] 16 [10]1 [1]

Table 10 Holm post hoc test to show differences of Stacking-Cost-log (AUC) and other algorithms

Rejected (1198670) Accepted (1198670)Stacking-Cost-log (AUC) versus other algorithms 3 6Stacking-Cost-log (GeoMean) versus other algorithms 8 1Stacking-Cost-log (AGeoMean) versus other algorithms 9 0

to the evaluation metric of GeoMean the RECSG is signifi-cantly better than other 8 methods with the confidence levelof 120572 = 005

44 Discussion According to the experiments performedwith 17 different imbalance data sets and the comparisonwith other 9 classification methods the proposed RECSGmethod showed the higher performance than other methodsin terms of AUC GeoMean and AGeoMean as illustratedin Tables 7 8 and 9 The RECSG method showed the betterperformance for 10 of 17 cases in terms of AUC and for 12 of17 cases in terms of GeoMean and AGeoMean The methodoutperformed other 5 ensemble algorithms for 16 of 17 datasets in terms ofGeoMean andAGeoMean and for 13 of 17 datasets in terms of AUC GeoMean and AGeoMean are betterthan AUC in evaluation metrics in 17 data sets The means ofRECSGmethod in terms of GeoMean AGeoMean and AUCare all higher than other methods The Holm post hoc testshows that RECSGmethod significantly outperforms the restwith the confidence level 120572 = 005 in terms of AGeoMean for8 of 9methods in terms of AGeoMean and for 3 of 9methodsin terms of AUC

Experimental results and statistical tests show that theRECSG approach has improved the classification perfor-mance of imbalanced data sets The reasons can be explainedas follows

Firstly stacking algorithm uses to combine the resultof base classifier whereas bagging employs majority voteBagging is only a simple decision combinationmethodwhichrequires neither cross-validation nor Level 1 learning In thispaper stacked generalization adopts logistic regressionthus providing the simplest linear combination of pooling theLevel 0 modelsrsquo confidence

Secondly cost-sensitive algorithm can affect imbalancedata sets in two aspects Firstly cost-sensitive modificationscan be applied in the probabilistic estimationMoreover cost-sensitive factors change theweight of instance and theweight

of the minority class is higher than that of the majorityclass Therefore the method directly changes the number ofcommon instances without discarding or duplicating any ofthe rare instances

Thirdly RECSG method introduces cost-sensitive learn-ing into stacking ensemble The cost-sensitive function in replaces the error-minimizing function thus allowing Level 1learning to be prone to focus on the minority class

The results in Tables 7 and 8 demonstrate that the RECSGmethod has the higher performancewhen the evaluation per-formance metric of base classifier is weaker (such as Vehicle1Vehicle0 and car-vgood) The reason is that the alternativemethods of NB C45 and 119896-NN have shortcomings Theindependence assumption hampers the performance of NBin some data sets In C45 tree construction the selectionof attributes affects the model performances Error ratio of119896-NN is relatively high when data sets are in imbalanceTherefore logistic regression adopted in can improve theperformance when base classifier is weaker

The performance of the RECSG method is generallybetter when the IR is low (such as Glass2 car-good flare-Fcar-vgood and abalone-17 vs 7-8-9-10) The performance ofRECSG method is probably related to the setting of the costmatrix Different data sets should use different cost matricesFor the purpose of simplicity in the paper we have adoptedthe same cost matrices whichmay bemore suitable to low IRvalues

5 Conclusions

In this paper in order to solve the class imbalance problemwe proposed the RECSG method based on 2-layer learningmodels Experimental results and statistical tests showedthat the RECSG approach improved the classification per-formance The proposed RECSG approach may have therelatively high computational complexity in the trainingstage because the approach involves 2-layer classifier models

12 Mathematical Problems in Engineering

which consist of several base classifiers and metaclassifiersThe number and kinds of the base-level classifiers areclosely related to the performance of stacking algorithmIn the fusion stage of base classifiers the selection of themetaclassifier is also important In this paper in order tovalidate the performance improvement compared with othercurrent classification algorithms we only randomly selected3 classification algorithms (NB 119896-NN and C45) as baseclassifier and cost-sensitive algorithm as metaclassifier inthe fuse stage Selection of the number or kind of the baseclassifiers and metaclassifier was not discussed Thereforewe will explore the diversity and quality of base classifier inthe future The adoption of these strategies would improvethe prediction performance and reduce the training time ofstacking algorithm for imbalance problems

Conflicts of Interest

The author declares that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

This research is supported by the Natural Science Foundationof Shanxi Province under Grant no 2015011039

References

[1] O Loyola-Gonzalez J F Martınez-Trinidad J A Carrasco-Ochoa and M Garcıa-Borroto ldquoEffect of class imbalance onquality measures for contrast patterns an experimental studyrdquoInformation Sciences vol 374 pp 179ndash192 2016

[2] C Beyan and R Fisher ldquoClassifying imbalanced data setsusing similarity based hierarchical decompositionrdquo PatternRecognition vol 48 no 5 pp 1653ndash1672 2015

[3] H He and E A Garcia ldquoLearning from imbalanced datardquo IEEETransactions on Knowledge and Data Engineering vol 21 no 9pp 1263ndash1284 2009

[4] V C Chawla K W Bowyer L O Hall and W P KegelmeyerldquoSMOTE synthetic minority over-sampling techniquerdquo Journalof Artificial Intelligence Research vol 16 pp 321ndash357 2002

[5] X-Y Liu J X Wu and Z-H Zhou ldquoExploratory under-sampling for class-imbalance learningrdquo in Proceedings of the 6thInternational Conference on Data Mining (ICDM rsquo06) pp 965ndash969 IEEE Hong Kong December 2006

[6] N Japkowicz ldquoThe class imbalance problem significance andstrategiesrdquo in Proceedings of the International Conference onArtificial Intelligence 2000

[7] B Zadrozny and C Elkan ldquoLearning and making decisionswhen costs and probabilities are both unknownrdquo in Proceedingsof the the seventh ACM SIGKDD international conference pp204ndash213 San Francisco Calif USA August 2001

[8] M Galar A Fernandez E Barrenechea H Bustince andF Herrera ldquoA review on ensembles for the class imbalanceproblem bagging boosting and hybrid-based approachesrdquoIEEE Transactions on Systems Man and Cybernetics Part CApplications and Reviews vol 42 no 4 pp 463ndash484 2012

[9] N V Chawla D A Cieslak L O Hall and A Joshi ldquoAuto-matically countering imbalance and its empirical relationship

to costrdquo Data Mining and Knowledge Discovery vol 17 no 2pp 225ndash252 2008

[10] A Freitas A Costapereira and P Brazdil ldquoCost-SensitiveDecision Trees Applied to Medical Datardquo in InternationalConference on Data Warehousing and Knowledge Discovery pp303ndash312 2007

[11] L Breiman ldquoBagging predictorsrdquoMachine Learning vol 24 no2 pp 123ndash140 1996

[12] R E Schapire ldquoThe strength of weak learnabilityrdquo MachineLearning vol 5 no 2 pp 197ndash227 1990

[13] Y Freund and R E Schapire ldquoA decision-theoretic generaliza-tion of on-line learning and an application to boostingrdquo Journalof Computer and System Sciences vol 55 no 1 part 2 pp 119ndash139 1997

[14] N V Chawla A Lazarevic L O Hall and K W BowyerldquoSMOTEBoost improving prediction of the minority class inboostingrdquo in Proceeding of Knowledge Discovery in Databasesvol 2838 pp 107ndash119

[15] C Seiffert T M Khoshgoftaar J VanHulse and A NapolitanoldquoRUSBoost A hybrid approach to alleviating class imbalancerdquoIEEE Transactions on Systems Man and Cybernetics Systemsvol 40 no 1 pp 185ndash197 2010

[16] S Hu Y Liang L Ma and Y He ldquoMSMOTE Improvingclassification performance when training data is imbalancedrdquoin Proceedings of the 2nd International Workshop on ComputerScience and Engineering WCSE 2009 pp 13ndash17 China October2009

[17] H Guo and H L Viktor ldquoLearning from imbalanced data setswith boosting and data generationrdquoACMSIGKDDExplorationsNewsletter vol 6 no 1 pp 30ndash39 2004

[18] SWang and X Yao ldquoDiversity analysis on imbalanced data setsby using ensemble modelsrdquo in Proceedings of the 2009 IEEESymposium on Computational Intelligence and Data MiningCIDM 2009 pp 324ndash331 USA April 2009

[19] R Barandela J S Srsquoanchez and R M Valdovinos ldquoNewapplications of ensembles of classifiersrdquo PAA Pattern Analysisand Applications vol 6 no 3 pp 245ndash256 2003

[20] J Błaszczynski M Deckert J Stefanowski and S WilkldquoIntegrating selective pre-processing of imbalanced data withIvotes ensemblerdquo Lecture Notes in Computer Science (includingsubseries Lecture Notes in Artificial Intelligence and Lecture Notesin Bioinformatics) Preface vol 6086 pp 148ndash157 2010

[21] W Fan S J Stolfo J Zhang and P K Chan ldquoAdacostMisclassication cost-sensitive boostingrdquo in the 6th Int ConfMach Learning pp 97ndash105 San Francisco CA USA 1999

[22] K M Ting ldquoA comparative study of cost-sensitive boostingalgorithmsrdquo in Proc 17th Int Conf Mach Learning pp 983ndash990 Stanford CA USA 2000

[23] M Joshi V Kumar and R Agarwal ldquoEvaluating boosting algo-rithms to classify rare classes comparison and improvementsrdquoinProceedings of the 2001 IEEE International Conference onDataMining pp 257ndash264 San Jose CA USA

[24] Y Sun M S Kamel A K C Wong and Y Wang ldquoCost-sensitive boosting for classification of imbalanced datardquo PatternRecognition vol 40 no 12 pp 3358ndash3378 2007

[25] D H Wolpert ldquoStacked generalizationrdquo Neural Networks vol5 no 2 pp 241ndash259 1992

[26] Y Chen M L Wong and H Li ldquoApplying Ant Colony Opti-mization to configuring stacking ensembles for data miningrdquoExpert Systems with Applications vol 41 no 6 pp 2688ndash27022014

Mathematical Problems in Engineering 13

[27] H Kadkhodaei and A M E Moghadam ldquoAn entropy basedapproach to find the best combination of the base classi-fiers in ensemble classifiers based on stack generalizationrdquoin Proceedings of the 4th International Conference on ControlInstrumentation and Automation ICCIA 2016 pp 425ndash429Iran January 2016

[28] I Czarnowski and P Jędrzejowicz ldquoAn approach to machineclassification based on stacked generalization and instanceselectionrdquo in Proceedings of the IEEE International Conferenceon Systems Man and Cybernetics SMC 2016 pp 4836ndash4841Hungary 2017

[29] S Kotsiantis ldquoStacking cost sensitive modelsrdquo in Proceedings ofthe 12th Pan-Hellenic Conference on Informatics PCI 2008 pp217ndash221 Greece August 2008

[30] H Y Lo J C Wang H MWang and S D Lin ldquoCost-sensitivestacking for audio tag annotation and retrievalrdquo in Proceedingsof the 36th IEEE International Conference on Acoustics Speechand Signal Processing ICASSP 2011 pp 2308ndash2311 CzechRepublic May 2011

[31] JHuang andCX Ling ldquoUsingAUCand accuracy in evaluatinglearning algorithmsrdquo IEEETransactions on Knowledge andDataEngineering vol 17 no 3 pp 299ndash310 2005

[32] M Kubat and S Matwin ldquoAddressing the curse of imbalancedtraining sets one sided selection inrdquo in Proceedings of the 14thInternational Conference on Machine Learning (ICMLrsquo97 pp179ndash186 1997

[33] R Batuwita and V Palade ldquoAdjusted geometric-mean A novelperformance measure for imbalanced bioinformatics datasetslearningrdquo Journal of Bioinformatics and Computational Biologyvol 10 no 4 Article ID 1250003 2012

[34] L Breiman ldquoStacked regressionsrdquoMachine Learning vol 24 no1 pp 49ndash64 1996

[35] M Leblanc and R Tibshirani ldquoCombining estimates in regres-sion and classificationrdquo Journal of the American StatisticalAssociation vol 91 no 436 pp 1641ndash1650 1996

[36] B Cestnik ldquoEstimating Probabilities A Crucial Task inMachine Learningrdquo in Proceedings of the European Conferenceon Artificial Intelligence pp 147ndash149 1990

[37] J R Quinlan C45 Program for machine learning MorganKaufmann 1993

[38] N S Altman ldquoAn introduction to kernel and nearest-neighbornonparametric regressionrdquo The American Statistician vol 46no 3 pp 175ndash185 1992

[39] J R Quinlan ldquoInduction of decision treesrdquo Machine Learningvol 1 no 1 pp 81ndash106 1986

[40] C Elkan ldquoThe foundations of cost-sensitive learningrdquo in Pro-ceedings of the 17th International Joint Conference on ArtificialIntelligence (IJCAI rsquo01) pp 973ndash978 New York NY USAAugust 2001

[41] D Michie D Spiegel halter J C Taylor et alMachine learningneural and statistical classification Ellis Horwood neural andstatistical classification 1995

[42] K M Ting and I F Witten ldquoIssues in stacked generalizationrdquoArtif Intell Res vol 10 no 1 pp 271ndash289 1999

[43] M Hall E Frank G Holmes B Pfahringer P Reutemann andI H Witten ldquoThe WEKA data mining software an updaterdquoACM SIGKDD Explorations Newsletter vol 11 no 1 pp 10ndash182009

[44] A Frank and A Asuncion UCI Machine Learning RepositoryUniversity of California School of Information and ComputerScience Irvine CA USA httparchiveicsucieduml

[45] J Alcala-Fdez A Fernandez J Luengo et al ldquoKEEL data-mining software tool data set repository integration ofalgorithms and experimental analysis frameworkrdquo Journal ofMultiple-Valued Logic and Soft Computing vol 17 no 2-3 pp255ndash287 2011

[46] S Garcıa A Fernandez J Luengo and F Herrera ldquoA study ofstatistical techniques and performance measures for genetics-based machine learning accuracy and interpretabilityrdquo SoftComputing vol 13 no 10 pp 959ndash977 2009

[47] S Holm ldquoA simple sequentially rejective multiple test proce-durerdquo Scandinavian Journal of Statistics vol 6 no 2 pp 65ndash701979

Hindawiwwwhindawicom Volume 2018

MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Applied MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Probability and StatisticsHindawiwwwhindawicom Volume 2018

Journal of

Hindawiwwwhindawicom Volume 2018

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawiwwwhindawicom Volume 2018

OptimizationJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

Hindawiwwwhindawicom Volume 2018

Operations ResearchAdvances in

Journal of

Hindawiwwwhindawicom Volume 2018

Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018

International Journal of Mathematics and Mathematical Sciences

Hindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018Volume 2018

Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in

Nature and SocietyHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Dierential EquationsInternational Journal of

Volume 2018

Hindawiwwwhindawicom Volume 2018

Decision SciencesAdvances in

Hindawiwwwhindawicom Volume 2018

AnalysisInternational Journal of

Hindawiwwwhindawicom Volume 2018

Stochastic AnalysisInternational Journal of

Submit your manuscripts atwwwhindawicom

Page 8: Classifying Imbalanced Data Sets by a Novel RE-Sample and ...downloads.hindawi.com/journals/mpe/2018/5036710.pdf · Cost-Sensitive Stacked Generalization Method ... In this paper,

8 Mathematical Problems in Engineering

Table6Re

sults

ofotherm

etho

dsin

term

sofA

UC

Datas

etNB

C45

KNN

Cost-N

BAd

a-NB

Ada-Cost-N

BBa

g-NB

Bag-Cost-NB

Stacking

-log

Stacking

-Cost-log

wisc

onsin

098

60957

0953

0989

0981

0982

099

0989

0982

0982

pima

0819

0751

0662

0819

0804

0808

0817

0819

0819

0819

haberm

an064

0615

053

064

0667

0625

0638

064

70636

063

Vehicle

10711

0722

0671

0711

0769

0723

0715

0712

0782

0782

Vehicle

00814

093

092

0814

0784

0781

0808

0808

0977

097

7segm

ent0

0987

0971

0996

0987

0944

0959

0987

0986

0999

099

9yeast-0

-3-5-9

vs7-8

075

0675

0653

075

0751

0686

075

0754

0769

0773

yeast-0

-2-5-6

vs3-7-8-9

0832

0687

0782

0832

079

0808

084

0842

084

7082

vowel0

098

0974

1098

0996

0994

0981

098

11

led7digit-0

-2-4-5-6-7-8-9

vs1

095

40925

0827

0954

0898

0893

0953

0949

0909

0904

Glass2

0716

0812

0587

0716

0737

0735

0748

0737

0713

0721

cleveland

-0vs

40909

0824

0754

0909

0826

0891

0915

0913

0979

0976

glass-0-1-6

vs5

0934

0989

0825

0934

0975

0905

0983

0985

09

099

car-good

0976

0493

064

80976

0968

0971

0976

0976

0653

097

7flare-F

0908

0525

059

0911

0876

0893

0908

0912

0825

0806

car-vgoo

d0997

0992

069

0997

0997

0997

0997

0997

0998

099

8abalon

e-17

vs7-8-9-10

0775

0613

0624

0775

0817

073

0773

0774

0752

0726

Mean

0864

0791

0748

0864

0858

0846

0869

0869

0855

0875

Mathematical Problems in Engineering 9

Table7Re

sults

ofotherm

etho

dsin

term

sofG

eoMean

Datas

etNB

C45

KNN

Cost-N

BAd

a-NB

Ada-Cost-N

BBa

g-NB

Bag-Cost-NB

Stacking

-log

Stacking

-Cost-log

wisc

onsin

0965

0954

0953

0968

0959

0967

0968

0966

096

4097

0pima

0719

0697

0649

0736

0715

0739

0723

0731

0703

0741

haberm

an0419

0566

0455

0586

046

60573

0434

0594

0361

060

5Ve

hicle

10677

0622

0652

0674

0607

0677

0673

0661

0561

0712

Vehicle

00724

0912

0919

0726

0735

0733

0719

0733

0894

094

6segm

ent0

0897

0977

0995

0892

0897

0906

0902

0895

099

40993

yeast-0

-3-5-9

vs7-8

046

70463

060

9046

4046

60550

044

50485

044

4060

3yeast-0

-2-5-6

vs3-7-8-9

0762

0590

0762

0793

0762

0792

0762

0787

0706

079

8vowel0

0920

0969

1000

0913

0944

0929

0921

0914

1000

1000

led7digit-0

-2-4-5-6-7-8-9

vs1

0853

0874

0818

0869

0863

0870

0845

0850

0846

0874

Glass2

0578

0472

0470

0564

0578

0610

0586

0568

000

00607

cleveland

-0vs

4089

90724

0722

0936

0611

0863

0866

0861

0866

0866

glass-0-1-6

vs5

0932

0877

0810

0932

0932

0872

0872

0869

0814

0877

car-good

000

0000

00558

0794

0520

0783

000

00778

0561

0813

flare-F

0751

000

00450

0781

0541

064

70738

0781

000

00774

car-vgoo

d0496

0888

0626

0986

0917

0925

0481

0971

0957

098

8abalon

e-17

vs7-8-9-10

0628

0321

0506

064

00319

0368

0627

0630

0293

066

4Mean

0687

064

10703

0779

0696

0753

0680

0769

064

50814

10 Mathematical Problems in Engineering

Table8Re

sults

ofotherm

etho

dsin

term

sofA

GeoMean

Datas

etNB

C45

KNN

Cost-N

BAd

a-NB

Ada-Cost-N

BBa

g-NB

Bag-Cost-NB

Stacking

-log

Stacking

-Cost-log

wisc

onsin

0961

0960

0961

0962

0963

0962

0962

0960

0967

096

7pima

076

80743

0706

0734

0767

0736

0775

0729

0770

0726

haberm

an064

20680

0601

0718

066

00694

0653

0721

0613

0663

Vehicle

10690

0721

0728

0658

0722

0669

0690

0654

0709

0740

Vehicle

0066

60929

0934

0658

0681

0663

066

40663

0924

093

6segm

ent0

0859

0986

0996

0850

0859

0871

0863

0854

0995

099

5yeast-0

-3-5-9

vs7-8

0715

0706

0782

0709

0713

0737

0703

0721

0700

0770

yeast-0

-2-5-6

vs3-7-8-9

0856

0778

0855

0857

0856

0855

0856

0854

0839

086

6vowel0

0929

0981

1000

0909

0966

0948

0931

0910

1000

1000

led7digit-0

-2-4-5-6-7-8-9

vs1

0853

0874

0818

0869

0863

0870

0845

0850

0846

0874

Glass2

0511

0701

0695

0493

0511

0551

0539

0515

000

00744

cleveland

-0vs

4092

70845

0841

0943

0783

0914

0918

0910

0918

0918

glass-0-1-6

vs5

095

40932

0894

0954

0954

0923

0923

0919

0902

0932

car-good

000

0000

00763

0878

0747

0870

000

00873

0769

089

0flare-F

0840

000

00705

0842

0751

0795

0836

0842

000

0084

5car-vgoo

d0743

0934

0793

0986

0953

0956

0725

0980

0974

099

0abalon

e-17

vs7-8-9-10

0736

0655

0745

0716

0650

0672

0734

0711

064

1080

4Mean

0744

0731

0813

0808

0788

0805

0742

08039

0739

086

2

Mathematical Problems in Engineering 11

Table 9 Comparison between RECSG and other algorithms in terms of winslosses of the results

NB C45 KNN Cost-NB Ada-NB Ada-Cost-NB Bag-NB Bag-Cost-

NB Stacking-log

Stacking-Cost-log(AUC)

11 [5]6 [3] 15 [11]2 [2] 17 [15]0 [2] 11 [6]6 [5] 13 [7]4 [2] 14 [9]3 [1] 10 [6]7 [5] 10 [6]7 [5] 11 [3]6 [2]

Stacking-Cost-log(GeoMean)

15 [13]2 [2] 17 [13]0 [0] 15 [13]2 [1] 14 [8]3 [2] 16 [12]1 [1] 16 [10]1 [0] 17 [13]0 [0] 16 [10]3 [0] 16 [12]1 [0]

Stacking-Cost-log(AGeoMean)

14 [10]3 [1] 15 [9]2 [2] 15 [12]2 [1] 13 [7]4 [3] 15 [11]2 [2] 15 [10]2 [2] 16 [10]1 [1] 15 [8]2 [1] 16 [10]1 [1]

Table 10 Holm post hoc test to show differences of Stacking-Cost-log (AUC) and other algorithms

Rejected (1198670) Accepted (1198670)Stacking-Cost-log (AUC) versus other algorithms 3 6Stacking-Cost-log (GeoMean) versus other algorithms 8 1Stacking-Cost-log (AGeoMean) versus other algorithms 9 0

to the evaluation metric of GeoMean the RECSG is signifi-cantly better than other 8 methods with the confidence levelof 120572 = 005

44 Discussion According to the experiments performedwith 17 different imbalance data sets and the comparisonwith other 9 classification methods the proposed RECSGmethod showed the higher performance than other methodsin terms of AUC GeoMean and AGeoMean as illustratedin Tables 7 8 and 9 The RECSG method showed the betterperformance for 10 of 17 cases in terms of AUC and for 12 of17 cases in terms of GeoMean and AGeoMean The methodoutperformed other 5 ensemble algorithms for 16 of 17 datasets in terms ofGeoMean andAGeoMean and for 13 of 17 datasets in terms of AUC GeoMean and AGeoMean are betterthan AUC in evaluation metrics in 17 data sets The means ofRECSGmethod in terms of GeoMean AGeoMean and AUCare all higher than other methods The Holm post hoc testshows that RECSGmethod significantly outperforms the restwith the confidence level 120572 = 005 in terms of AGeoMean for8 of 9methods in terms of AGeoMean and for 3 of 9methodsin terms of AUC

Experimental results and statistical tests show that theRECSG approach has improved the classification perfor-mance of imbalanced data sets The reasons can be explainedas follows

Firstly stacking algorithm uses to combine the resultof base classifier whereas bagging employs majority voteBagging is only a simple decision combinationmethodwhichrequires neither cross-validation nor Level 1 learning In thispaper stacked generalization adopts logistic regressionthus providing the simplest linear combination of pooling theLevel 0 modelsrsquo confidence

Secondly cost-sensitive algorithm can affect imbalancedata sets in two aspects Firstly cost-sensitive modificationscan be applied in the probabilistic estimationMoreover cost-sensitive factors change theweight of instance and theweight

of the minority class is higher than that of the majorityclass Therefore the method directly changes the number ofcommon instances without discarding or duplicating any ofthe rare instances

Thirdly RECSG method introduces cost-sensitive learn-ing into stacking ensemble The cost-sensitive function in replaces the error-minimizing function thus allowing Level 1learning to be prone to focus on the minority class

The results in Tables 7 and 8 demonstrate that the RECSGmethod has the higher performancewhen the evaluation per-formance metric of base classifier is weaker (such as Vehicle1Vehicle0 and car-vgood) The reason is that the alternativemethods of NB C45 and 119896-NN have shortcomings Theindependence assumption hampers the performance of NBin some data sets In C45 tree construction the selectionof attributes affects the model performances Error ratio of119896-NN is relatively high when data sets are in imbalanceTherefore logistic regression adopted in can improve theperformance when base classifier is weaker

The performance of the RECSG method is generallybetter when the IR is low (such as Glass2 car-good flare-Fcar-vgood and abalone-17 vs 7-8-9-10) The performance ofRECSG method is probably related to the setting of the costmatrix Different data sets should use different cost matricesFor the purpose of simplicity in the paper we have adoptedthe same cost matrices whichmay bemore suitable to low IRvalues

5 Conclusions

In this paper in order to solve the class imbalance problemwe proposed the RECSG method based on 2-layer learningmodels Experimental results and statistical tests showedthat the RECSG approach improved the classification per-formance The proposed RECSG approach may have therelatively high computational complexity in the trainingstage because the approach involves 2-layer classifier models

12 Mathematical Problems in Engineering

which consist of several base classifiers and metaclassifiersThe number and kinds of the base-level classifiers areclosely related to the performance of stacking algorithmIn the fusion stage of base classifiers the selection of themetaclassifier is also important In this paper in order tovalidate the performance improvement compared with othercurrent classification algorithms we only randomly selected3 classification algorithms (NB 119896-NN and C45) as baseclassifier and cost-sensitive algorithm as metaclassifier inthe fuse stage Selection of the number or kind of the baseclassifiers and metaclassifier was not discussed Thereforewe will explore the diversity and quality of base classifier inthe future The adoption of these strategies would improvethe prediction performance and reduce the training time ofstacking algorithm for imbalance problems

Conflicts of Interest

The author declares that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

This research is supported by the Natural Science Foundationof Shanxi Province under Grant no 2015011039

References

[1] O Loyola-Gonzalez J F Martınez-Trinidad J A Carrasco-Ochoa and M Garcıa-Borroto ldquoEffect of class imbalance onquality measures for contrast patterns an experimental studyrdquoInformation Sciences vol 374 pp 179ndash192 2016

[2] C Beyan and R Fisher ldquoClassifying imbalanced data setsusing similarity based hierarchical decompositionrdquo PatternRecognition vol 48 no 5 pp 1653ndash1672 2015

[3] H He and E A Garcia ldquoLearning from imbalanced datardquo IEEETransactions on Knowledge and Data Engineering vol 21 no 9pp 1263ndash1284 2009

[4] V C Chawla K W Bowyer L O Hall and W P KegelmeyerldquoSMOTE synthetic minority over-sampling techniquerdquo Journalof Artificial Intelligence Research vol 16 pp 321ndash357 2002

[5] X-Y Liu J X Wu and Z-H Zhou ldquoExploratory under-sampling for class-imbalance learningrdquo in Proceedings of the 6thInternational Conference on Data Mining (ICDM rsquo06) pp 965ndash969 IEEE Hong Kong December 2006

[6] N Japkowicz ldquoThe class imbalance problem significance andstrategiesrdquo in Proceedings of the International Conference onArtificial Intelligence 2000

[7] B Zadrozny and C Elkan ldquoLearning and making decisionswhen costs and probabilities are both unknownrdquo in Proceedingsof the the seventh ACM SIGKDD international conference pp204ndash213 San Francisco Calif USA August 2001

[8] M Galar A Fernandez E Barrenechea H Bustince andF Herrera ldquoA review on ensembles for the class imbalanceproblem bagging boosting and hybrid-based approachesrdquoIEEE Transactions on Systems Man and Cybernetics Part CApplications and Reviews vol 42 no 4 pp 463ndash484 2012

[9] N V Chawla D A Cieslak L O Hall and A Joshi ldquoAuto-matically countering imbalance and its empirical relationship

to costrdquo Data Mining and Knowledge Discovery vol 17 no 2pp 225ndash252 2008

[10] A Freitas A Costapereira and P Brazdil ldquoCost-SensitiveDecision Trees Applied to Medical Datardquo in InternationalConference on Data Warehousing and Knowledge Discovery pp303ndash312 2007

[11] L Breiman ldquoBagging predictorsrdquoMachine Learning vol 24 no2 pp 123ndash140 1996

[12] R E Schapire ldquoThe strength of weak learnabilityrdquo MachineLearning vol 5 no 2 pp 197ndash227 1990

[13] Y Freund and R E Schapire ldquoA decision-theoretic generaliza-tion of on-line learning and an application to boostingrdquo Journalof Computer and System Sciences vol 55 no 1 part 2 pp 119ndash139 1997

[14] N V Chawla A Lazarevic L O Hall and K W BowyerldquoSMOTEBoost improving prediction of the minority class inboostingrdquo in Proceeding of Knowledge Discovery in Databasesvol 2838 pp 107ndash119

[15] C Seiffert T M Khoshgoftaar J VanHulse and A NapolitanoldquoRUSBoost A hybrid approach to alleviating class imbalancerdquoIEEE Transactions on Systems Man and Cybernetics Systemsvol 40 no 1 pp 185ndash197 2010

[16] S Hu Y Liang L Ma and Y He ldquoMSMOTE Improvingclassification performance when training data is imbalancedrdquoin Proceedings of the 2nd International Workshop on ComputerScience and Engineering WCSE 2009 pp 13ndash17 China October2009

[17] H Guo and H L Viktor ldquoLearning from imbalanced data setswith boosting and data generationrdquoACMSIGKDDExplorationsNewsletter vol 6 no 1 pp 30ndash39 2004

[18] SWang and X Yao ldquoDiversity analysis on imbalanced data setsby using ensemble modelsrdquo in Proceedings of the 2009 IEEESymposium on Computational Intelligence and Data MiningCIDM 2009 pp 324ndash331 USA April 2009

[19] R Barandela J S Srsquoanchez and R M Valdovinos ldquoNewapplications of ensembles of classifiersrdquo PAA Pattern Analysisand Applications vol 6 no 3 pp 245ndash256 2003

[20] J Błaszczynski M Deckert J Stefanowski and S WilkldquoIntegrating selective pre-processing of imbalanced data withIvotes ensemblerdquo Lecture Notes in Computer Science (includingsubseries Lecture Notes in Artificial Intelligence and Lecture Notesin Bioinformatics) Preface vol 6086 pp 148ndash157 2010

[21] W Fan S J Stolfo J Zhang and P K Chan ldquoAdacostMisclassication cost-sensitive boostingrdquo in the 6th Int ConfMach Learning pp 97ndash105 San Francisco CA USA 1999

[22] K M Ting ldquoA comparative study of cost-sensitive boostingalgorithmsrdquo in Proc 17th Int Conf Mach Learning pp 983ndash990 Stanford CA USA 2000

[23] M Joshi V Kumar and R Agarwal ldquoEvaluating boosting algo-rithms to classify rare classes comparison and improvementsrdquoinProceedings of the 2001 IEEE International Conference onDataMining pp 257ndash264 San Jose CA USA

[24] Y Sun M S Kamel A K C Wong and Y Wang ldquoCost-sensitive boosting for classification of imbalanced datardquo PatternRecognition vol 40 no 12 pp 3358ndash3378 2007

[25] D H Wolpert ldquoStacked generalizationrdquo Neural Networks vol5 no 2 pp 241ndash259 1992

[26] Y Chen M L Wong and H Li ldquoApplying Ant Colony Opti-mization to configuring stacking ensembles for data miningrdquoExpert Systems with Applications vol 41 no 6 pp 2688ndash27022014

Mathematical Problems in Engineering 13

[27] H Kadkhodaei and A M E Moghadam ldquoAn entropy basedapproach to find the best combination of the base classi-fiers in ensemble classifiers based on stack generalizationrdquoin Proceedings of the 4th International Conference on ControlInstrumentation and Automation ICCIA 2016 pp 425ndash429Iran January 2016

[28] I Czarnowski and P Jędrzejowicz ldquoAn approach to machineclassification based on stacked generalization and instanceselectionrdquo in Proceedings of the IEEE International Conferenceon Systems Man and Cybernetics SMC 2016 pp 4836ndash4841Hungary 2017

[29] S Kotsiantis ldquoStacking cost sensitive modelsrdquo in Proceedings ofthe 12th Pan-Hellenic Conference on Informatics PCI 2008 pp217ndash221 Greece August 2008

[30] H Y Lo J C Wang H MWang and S D Lin ldquoCost-sensitivestacking for audio tag annotation and retrievalrdquo in Proceedingsof the 36th IEEE International Conference on Acoustics Speechand Signal Processing ICASSP 2011 pp 2308ndash2311 CzechRepublic May 2011

[31] JHuang andCX Ling ldquoUsingAUCand accuracy in evaluatinglearning algorithmsrdquo IEEETransactions on Knowledge andDataEngineering vol 17 no 3 pp 299ndash310 2005

[32] M Kubat and S Matwin ldquoAddressing the curse of imbalancedtraining sets one sided selection inrdquo in Proceedings of the 14thInternational Conference on Machine Learning (ICMLrsquo97 pp179ndash186 1997

[33] R Batuwita and V Palade ldquoAdjusted geometric-mean A novelperformance measure for imbalanced bioinformatics datasetslearningrdquo Journal of Bioinformatics and Computational Biologyvol 10 no 4 Article ID 1250003 2012

[34] L Breiman ldquoStacked regressionsrdquoMachine Learning vol 24 no1 pp 49ndash64 1996

[35] M Leblanc and R Tibshirani ldquoCombining estimates in regres-sion and classificationrdquo Journal of the American StatisticalAssociation vol 91 no 436 pp 1641ndash1650 1996

[36] B Cestnik ldquoEstimating Probabilities A Crucial Task inMachine Learningrdquo in Proceedings of the European Conferenceon Artificial Intelligence pp 147ndash149 1990

[37] J R Quinlan C45 Program for machine learning MorganKaufmann 1993

[38] N S Altman ldquoAn introduction to kernel and nearest-neighbornonparametric regressionrdquo The American Statistician vol 46no 3 pp 175ndash185 1992

[39] J R Quinlan ldquoInduction of decision treesrdquo Machine Learningvol 1 no 1 pp 81ndash106 1986

[40] C Elkan ldquoThe foundations of cost-sensitive learningrdquo in Pro-ceedings of the 17th International Joint Conference on ArtificialIntelligence (IJCAI rsquo01) pp 973ndash978 New York NY USAAugust 2001

[41] D Michie D Spiegel halter J C Taylor et alMachine learningneural and statistical classification Ellis Horwood neural andstatistical classification 1995

[42] K M Ting and I F Witten ldquoIssues in stacked generalizationrdquoArtif Intell Res vol 10 no 1 pp 271ndash289 1999

[43] M Hall E Frank G Holmes B Pfahringer P Reutemann andI H Witten ldquoThe WEKA data mining software an updaterdquoACM SIGKDD Explorations Newsletter vol 11 no 1 pp 10ndash182009

[44] A Frank and A Asuncion UCI Machine Learning RepositoryUniversity of California School of Information and ComputerScience Irvine CA USA httparchiveicsucieduml

[45] J Alcala-Fdez A Fernandez J Luengo et al ldquoKEEL data-mining software tool data set repository integration ofalgorithms and experimental analysis frameworkrdquo Journal ofMultiple-Valued Logic and Soft Computing vol 17 no 2-3 pp255ndash287 2011

[46] S Garcıa A Fernandez J Luengo and F Herrera ldquoA study ofstatistical techniques and performance measures for genetics-based machine learning accuracy and interpretabilityrdquo SoftComputing vol 13 no 10 pp 959ndash977 2009

[47] S Holm ldquoA simple sequentially rejective multiple test proce-durerdquo Scandinavian Journal of Statistics vol 6 no 2 pp 65ndash701979

Hindawiwwwhindawicom Volume 2018

MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Applied MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Probability and StatisticsHindawiwwwhindawicom Volume 2018

Journal of

Hindawiwwwhindawicom Volume 2018

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawiwwwhindawicom Volume 2018

OptimizationJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

Hindawiwwwhindawicom Volume 2018

Operations ResearchAdvances in

Journal of

Hindawiwwwhindawicom Volume 2018

Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018

International Journal of Mathematics and Mathematical Sciences

Hindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018Volume 2018

Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in

Nature and SocietyHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Dierential EquationsInternational Journal of

Volume 2018

Hindawiwwwhindawicom Volume 2018

Decision SciencesAdvances in

Hindawiwwwhindawicom Volume 2018

AnalysisInternational Journal of

Hindawiwwwhindawicom Volume 2018

Stochastic AnalysisInternational Journal of

Submit your manuscripts atwwwhindawicom

Page 9: Classifying Imbalanced Data Sets by a Novel RE-Sample and ...downloads.hindawi.com/journals/mpe/2018/5036710.pdf · Cost-Sensitive Stacked Generalization Method ... In this paper,

Mathematical Problems in Engineering 9

Table7Re

sults

ofotherm

etho

dsin

term

sofG

eoMean

Datas

etNB

C45

KNN

Cost-N

BAd

a-NB

Ada-Cost-N

BBa

g-NB

Bag-Cost-NB

Stacking

-log

Stacking

-Cost-log

wisc

onsin

0965

0954

0953

0968

0959

0967

0968

0966

096

4097

0pima

0719

0697

0649

0736

0715

0739

0723

0731

0703

0741

haberm

an0419

0566

0455

0586

046

60573

0434

0594

0361

060

5Ve

hicle

10677

0622

0652

0674

0607

0677

0673

0661

0561

0712

Vehicle

00724

0912

0919

0726

0735

0733

0719

0733

0894

094

6segm

ent0

0897

0977

0995

0892

0897

0906

0902

0895

099

40993

yeast-0

-3-5-9

vs7-8

046

70463

060

9046

4046

60550

044

50485

044

4060

3yeast-0

-2-5-6

vs3-7-8-9

0762

0590

0762

0793

0762

0792

0762

0787

0706

079

8vowel0

0920

0969

1000

0913

0944

0929

0921

0914

1000

1000

led7digit-0

-2-4-5-6-7-8-9

vs1

0853

0874

0818

0869

0863

0870

0845

0850

0846

0874

Glass2

0578

0472

0470

0564

0578

0610

0586

0568

000

00607

cleveland

-0vs

4089

90724

0722

0936

0611

0863

0866

0861

0866

0866

glass-0-1-6

vs5

0932

0877

0810

0932

0932

0872

0872

0869

0814

0877

car-good

000

0000

00558

0794

0520

0783

000

00778

0561

0813

flare-F

0751

000

00450

0781

0541

064

70738

0781

000

00774

car-vgoo

d0496

0888

0626

0986

0917

0925

0481

0971

0957

098

8abalon

e-17

vs7-8-9-10

0628

0321

0506

064

00319

0368

0627

0630

0293

066

4Mean

0687

064

10703

0779

0696

0753

0680

0769

064

50814

10 Mathematical Problems in Engineering

Table8Re

sults

ofotherm

etho

dsin

term

sofA

GeoMean

Datas

etNB

C45

KNN

Cost-N

BAd

a-NB

Ada-Cost-N

BBa

g-NB

Bag-Cost-NB

Stacking

-log

Stacking

-Cost-log

wisc

onsin

0961

0960

0961

0962

0963

0962

0962

0960

0967

096

7pima

076

80743

0706

0734

0767

0736

0775

0729

0770

0726

haberm

an064

20680

0601

0718

066

00694

0653

0721

0613

0663

Vehicle

10690

0721

0728

0658

0722

0669

0690

0654

0709

0740

Vehicle

0066

60929

0934

0658

0681

0663

066

40663

0924

093

6segm

ent0

0859

0986

0996

0850

0859

0871

0863

0854

0995

099

5yeast-0

-3-5-9

vs7-8

0715

0706

0782

0709

0713

0737

0703

0721

0700

0770

yeast-0

-2-5-6

vs3-7-8-9

0856

0778

0855

0857

0856

0855

0856

0854

0839

086

6vowel0

0929

0981

1000

0909

0966

0948

0931

0910

1000

1000

led7digit-0

-2-4-5-6-7-8-9

vs1

0853

0874

0818

0869

0863

0870

0845

0850

0846

0874

Glass2

0511

0701

0695

0493

0511

0551

0539

0515

000

00744

cleveland

-0vs

4092

70845

0841

0943

0783

0914

0918

0910

0918

0918

glass-0-1-6

vs5

095

40932

0894

0954

0954

0923

0923

0919

0902

0932

car-good

000

0000

00763

0878

0747

0870

000

00873

0769

089

0flare-F

0840

000

00705

0842

0751

0795

0836

0842

000

0084

5car-vgoo

d0743

0934

0793

0986

0953

0956

0725

0980

0974

099

0abalon

e-17

vs7-8-9-10

0736

0655

0745

0716

0650

0672

0734

0711

064

1080

4Mean

0744

0731

0813

0808

0788

0805

0742

08039

0739

086

2

Mathematical Problems in Engineering 11

Table 9 Comparison between RECSG and other algorithms in terms of winslosses of the results

NB C45 KNN Cost-NB Ada-NB Ada-Cost-NB Bag-NB Bag-Cost-

NB Stacking-log

Stacking-Cost-log(AUC)

11 [5]6 [3] 15 [11]2 [2] 17 [15]0 [2] 11 [6]6 [5] 13 [7]4 [2] 14 [9]3 [1] 10 [6]7 [5] 10 [6]7 [5] 11 [3]6 [2]

Stacking-Cost-log(GeoMean)

15 [13]2 [2] 17 [13]0 [0] 15 [13]2 [1] 14 [8]3 [2] 16 [12]1 [1] 16 [10]1 [0] 17 [13]0 [0] 16 [10]3 [0] 16 [12]1 [0]

Stacking-Cost-log(AGeoMean)

14 [10]3 [1] 15 [9]2 [2] 15 [12]2 [1] 13 [7]4 [3] 15 [11]2 [2] 15 [10]2 [2] 16 [10]1 [1] 15 [8]2 [1] 16 [10]1 [1]

Table 10 Holm post hoc test to show differences of Stacking-Cost-log (AUC) and other algorithms

Rejected (1198670) Accepted (1198670)Stacking-Cost-log (AUC) versus other algorithms 3 6Stacking-Cost-log (GeoMean) versus other algorithms 8 1Stacking-Cost-log (AGeoMean) versus other algorithms 9 0

to the evaluation metric of GeoMean the RECSG is signifi-cantly better than other 8 methods with the confidence levelof 120572 = 005

44 Discussion According to the experiments performedwith 17 different imbalance data sets and the comparisonwith other 9 classification methods the proposed RECSGmethod showed the higher performance than other methodsin terms of AUC GeoMean and AGeoMean as illustratedin Tables 7 8 and 9 The RECSG method showed the betterperformance for 10 of 17 cases in terms of AUC and for 12 of17 cases in terms of GeoMean and AGeoMean The methodoutperformed other 5 ensemble algorithms for 16 of 17 datasets in terms ofGeoMean andAGeoMean and for 13 of 17 datasets in terms of AUC GeoMean and AGeoMean are betterthan AUC in evaluation metrics in 17 data sets The means ofRECSGmethod in terms of GeoMean AGeoMean and AUCare all higher than other methods The Holm post hoc testshows that RECSGmethod significantly outperforms the restwith the confidence level 120572 = 005 in terms of AGeoMean for8 of 9methods in terms of AGeoMean and for 3 of 9methodsin terms of AUC

Experimental results and statistical tests show that theRECSG approach has improved the classification perfor-mance of imbalanced data sets The reasons can be explainedas follows

Firstly stacking algorithm uses to combine the resultof base classifier whereas bagging employs majority voteBagging is only a simple decision combinationmethodwhichrequires neither cross-validation nor Level 1 learning In thispaper stacked generalization adopts logistic regressionthus providing the simplest linear combination of pooling theLevel 0 modelsrsquo confidence

Secondly cost-sensitive algorithm can affect imbalancedata sets in two aspects Firstly cost-sensitive modificationscan be applied in the probabilistic estimationMoreover cost-sensitive factors change theweight of instance and theweight

of the minority class is higher than that of the majorityclass Therefore the method directly changes the number ofcommon instances without discarding or duplicating any ofthe rare instances

Thirdly RECSG method introduces cost-sensitive learn-ing into stacking ensemble The cost-sensitive function in replaces the error-minimizing function thus allowing Level 1learning to be prone to focus on the minority class

The results in Tables 7 and 8 demonstrate that the RECSGmethod has the higher performancewhen the evaluation per-formance metric of base classifier is weaker (such as Vehicle1Vehicle0 and car-vgood) The reason is that the alternativemethods of NB C45 and 119896-NN have shortcomings Theindependence assumption hampers the performance of NBin some data sets In C45 tree construction the selectionof attributes affects the model performances Error ratio of119896-NN is relatively high when data sets are in imbalanceTherefore logistic regression adopted in can improve theperformance when base classifier is weaker

The performance of the RECSG method is generallybetter when the IR is low (such as Glass2 car-good flare-Fcar-vgood and abalone-17 vs 7-8-9-10) The performance ofRECSG method is probably related to the setting of the costmatrix Different data sets should use different cost matricesFor the purpose of simplicity in the paper we have adoptedthe same cost matrices whichmay bemore suitable to low IRvalues

5 Conclusions

In this paper in order to solve the class imbalance problemwe proposed the RECSG method based on 2-layer learningmodels Experimental results and statistical tests showedthat the RECSG approach improved the classification per-formance The proposed RECSG approach may have therelatively high computational complexity in the trainingstage because the approach involves 2-layer classifier models

12 Mathematical Problems in Engineering

which consist of several base classifiers and metaclassifiersThe number and kinds of the base-level classifiers areclosely related to the performance of stacking algorithmIn the fusion stage of base classifiers the selection of themetaclassifier is also important In this paper in order tovalidate the performance improvement compared with othercurrent classification algorithms we only randomly selected3 classification algorithms (NB 119896-NN and C45) as baseclassifier and cost-sensitive algorithm as metaclassifier inthe fuse stage Selection of the number or kind of the baseclassifiers and metaclassifier was not discussed Thereforewe will explore the diversity and quality of base classifier inthe future The adoption of these strategies would improvethe prediction performance and reduce the training time ofstacking algorithm for imbalance problems

Conflicts of Interest

The author declares that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

This research is supported by the Natural Science Foundationof Shanxi Province under Grant no 2015011039

References

[1] O Loyola-Gonzalez J F Martınez-Trinidad J A Carrasco-Ochoa and M Garcıa-Borroto ldquoEffect of class imbalance onquality measures for contrast patterns an experimental studyrdquoInformation Sciences vol 374 pp 179ndash192 2016

[2] C Beyan and R Fisher ldquoClassifying imbalanced data setsusing similarity based hierarchical decompositionrdquo PatternRecognition vol 48 no 5 pp 1653ndash1672 2015

[3] H He and E A Garcia ldquoLearning from imbalanced datardquo IEEETransactions on Knowledge and Data Engineering vol 21 no 9pp 1263ndash1284 2009

[4] V C Chawla K W Bowyer L O Hall and W P KegelmeyerldquoSMOTE synthetic minority over-sampling techniquerdquo Journalof Artificial Intelligence Research vol 16 pp 321ndash357 2002

[5] X-Y Liu J X Wu and Z-H Zhou ldquoExploratory under-sampling for class-imbalance learningrdquo in Proceedings of the 6thInternational Conference on Data Mining (ICDM rsquo06) pp 965ndash969 IEEE Hong Kong December 2006

[6] N Japkowicz ldquoThe class imbalance problem significance andstrategiesrdquo in Proceedings of the International Conference onArtificial Intelligence 2000

[7] B Zadrozny and C Elkan ldquoLearning and making decisionswhen costs and probabilities are both unknownrdquo in Proceedingsof the the seventh ACM SIGKDD international conference pp204ndash213 San Francisco Calif USA August 2001

[8] M Galar A Fernandez E Barrenechea H Bustince andF Herrera ldquoA review on ensembles for the class imbalanceproblem bagging boosting and hybrid-based approachesrdquoIEEE Transactions on Systems Man and Cybernetics Part CApplications and Reviews vol 42 no 4 pp 463ndash484 2012

[9] N V Chawla D A Cieslak L O Hall and A Joshi ldquoAuto-matically countering imbalance and its empirical relationship

to costrdquo Data Mining and Knowledge Discovery vol 17 no 2pp 225ndash252 2008

[10] A Freitas A Costapereira and P Brazdil ldquoCost-SensitiveDecision Trees Applied to Medical Datardquo in InternationalConference on Data Warehousing and Knowledge Discovery pp303ndash312 2007

[11] L Breiman ldquoBagging predictorsrdquoMachine Learning vol 24 no2 pp 123ndash140 1996

[12] R E Schapire ldquoThe strength of weak learnabilityrdquo MachineLearning vol 5 no 2 pp 197ndash227 1990

[13] Y Freund and R E Schapire ldquoA decision-theoretic generaliza-tion of on-line learning and an application to boostingrdquo Journalof Computer and System Sciences vol 55 no 1 part 2 pp 119ndash139 1997

[14] N V Chawla A Lazarevic L O Hall and K W BowyerldquoSMOTEBoost improving prediction of the minority class inboostingrdquo in Proceeding of Knowledge Discovery in Databasesvol 2838 pp 107ndash119

[15] C Seiffert T M Khoshgoftaar J VanHulse and A NapolitanoldquoRUSBoost A hybrid approach to alleviating class imbalancerdquoIEEE Transactions on Systems Man and Cybernetics Systemsvol 40 no 1 pp 185ndash197 2010

[16] S Hu Y Liang L Ma and Y He ldquoMSMOTE Improvingclassification performance when training data is imbalancedrdquoin Proceedings of the 2nd International Workshop on ComputerScience and Engineering WCSE 2009 pp 13ndash17 China October2009

[17] H Guo and H L Viktor ldquoLearning from imbalanced data setswith boosting and data generationrdquoACMSIGKDDExplorationsNewsletter vol 6 no 1 pp 30ndash39 2004

[18] SWang and X Yao ldquoDiversity analysis on imbalanced data setsby using ensemble modelsrdquo in Proceedings of the 2009 IEEESymposium on Computational Intelligence and Data MiningCIDM 2009 pp 324ndash331 USA April 2009

[19] R Barandela J S Srsquoanchez and R M Valdovinos ldquoNewapplications of ensembles of classifiersrdquo PAA Pattern Analysisand Applications vol 6 no 3 pp 245ndash256 2003

[20] J Błaszczynski M Deckert J Stefanowski and S WilkldquoIntegrating selective pre-processing of imbalanced data withIvotes ensemblerdquo Lecture Notes in Computer Science (includingsubseries Lecture Notes in Artificial Intelligence and Lecture Notesin Bioinformatics) Preface vol 6086 pp 148ndash157 2010

[21] W Fan S J Stolfo J Zhang and P K Chan ldquoAdacostMisclassication cost-sensitive boostingrdquo in the 6th Int ConfMach Learning pp 97ndash105 San Francisco CA USA 1999

[22] K M Ting ldquoA comparative study of cost-sensitive boostingalgorithmsrdquo in Proc 17th Int Conf Mach Learning pp 983ndash990 Stanford CA USA 2000

[23] M Joshi V Kumar and R Agarwal ldquoEvaluating boosting algo-rithms to classify rare classes comparison and improvementsrdquoinProceedings of the 2001 IEEE International Conference onDataMining pp 257ndash264 San Jose CA USA

[24] Y Sun M S Kamel A K C Wong and Y Wang ldquoCost-sensitive boosting for classification of imbalanced datardquo PatternRecognition vol 40 no 12 pp 3358ndash3378 2007

[25] D H Wolpert ldquoStacked generalizationrdquo Neural Networks vol5 no 2 pp 241ndash259 1992

[26] Y Chen M L Wong and H Li ldquoApplying Ant Colony Opti-mization to configuring stacking ensembles for data miningrdquoExpert Systems with Applications vol 41 no 6 pp 2688ndash27022014

Mathematical Problems in Engineering 13

[27] H Kadkhodaei and A M E Moghadam ldquoAn entropy basedapproach to find the best combination of the base classi-fiers in ensemble classifiers based on stack generalizationrdquoin Proceedings of the 4th International Conference on ControlInstrumentation and Automation ICCIA 2016 pp 425ndash429Iran January 2016

[28] I Czarnowski and P Jędrzejowicz ldquoAn approach to machineclassification based on stacked generalization and instanceselectionrdquo in Proceedings of the IEEE International Conferenceon Systems Man and Cybernetics SMC 2016 pp 4836ndash4841Hungary 2017

[29] S Kotsiantis ldquoStacking cost sensitive modelsrdquo in Proceedings ofthe 12th Pan-Hellenic Conference on Informatics PCI 2008 pp217ndash221 Greece August 2008

[30] H Y Lo J C Wang H MWang and S D Lin ldquoCost-sensitivestacking for audio tag annotation and retrievalrdquo in Proceedingsof the 36th IEEE International Conference on Acoustics Speechand Signal Processing ICASSP 2011 pp 2308ndash2311 CzechRepublic May 2011

[31] JHuang andCX Ling ldquoUsingAUCand accuracy in evaluatinglearning algorithmsrdquo IEEETransactions on Knowledge andDataEngineering vol 17 no 3 pp 299ndash310 2005

[32] M Kubat and S Matwin ldquoAddressing the curse of imbalancedtraining sets one sided selection inrdquo in Proceedings of the 14thInternational Conference on Machine Learning (ICMLrsquo97 pp179ndash186 1997

[33] R Batuwita and V Palade ldquoAdjusted geometric-mean A novelperformance measure for imbalanced bioinformatics datasetslearningrdquo Journal of Bioinformatics and Computational Biologyvol 10 no 4 Article ID 1250003 2012

[34] L Breiman ldquoStacked regressionsrdquoMachine Learning vol 24 no1 pp 49ndash64 1996

[35] M Leblanc and R Tibshirani ldquoCombining estimates in regres-sion and classificationrdquo Journal of the American StatisticalAssociation vol 91 no 436 pp 1641ndash1650 1996

[36] B Cestnik ldquoEstimating Probabilities A Crucial Task inMachine Learningrdquo in Proceedings of the European Conferenceon Artificial Intelligence pp 147ndash149 1990

[37] J R Quinlan C45 Program for machine learning MorganKaufmann 1993

[38] N S Altman ldquoAn introduction to kernel and nearest-neighbornonparametric regressionrdquo The American Statistician vol 46no 3 pp 175ndash185 1992

[39] J R Quinlan ldquoInduction of decision treesrdquo Machine Learningvol 1 no 1 pp 81ndash106 1986

[40] C Elkan ldquoThe foundations of cost-sensitive learningrdquo in Pro-ceedings of the 17th International Joint Conference on ArtificialIntelligence (IJCAI rsquo01) pp 973ndash978 New York NY USAAugust 2001

[41] D Michie D Spiegel halter J C Taylor et alMachine learningneural and statistical classification Ellis Horwood neural andstatistical classification 1995

[42] K M Ting and I F Witten ldquoIssues in stacked generalizationrdquoArtif Intell Res vol 10 no 1 pp 271ndash289 1999

[43] M Hall E Frank G Holmes B Pfahringer P Reutemann andI H Witten ldquoThe WEKA data mining software an updaterdquoACM SIGKDD Explorations Newsletter vol 11 no 1 pp 10ndash182009

[44] A Frank and A Asuncion UCI Machine Learning RepositoryUniversity of California School of Information and ComputerScience Irvine CA USA httparchiveicsucieduml

[45] J Alcala-Fdez A Fernandez J Luengo et al ldquoKEEL data-mining software tool data set repository integration ofalgorithms and experimental analysis frameworkrdquo Journal ofMultiple-Valued Logic and Soft Computing vol 17 no 2-3 pp255ndash287 2011

[46] S Garcıa A Fernandez J Luengo and F Herrera ldquoA study ofstatistical techniques and performance measures for genetics-based machine learning accuracy and interpretabilityrdquo SoftComputing vol 13 no 10 pp 959ndash977 2009

[47] S Holm ldquoA simple sequentially rejective multiple test proce-durerdquo Scandinavian Journal of Statistics vol 6 no 2 pp 65ndash701979

Hindawiwwwhindawicom Volume 2018

MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Applied MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Probability and StatisticsHindawiwwwhindawicom Volume 2018

Journal of

Hindawiwwwhindawicom Volume 2018

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawiwwwhindawicom Volume 2018

OptimizationJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

Hindawiwwwhindawicom Volume 2018

Operations ResearchAdvances in

Journal of

Hindawiwwwhindawicom Volume 2018

Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018

International Journal of Mathematics and Mathematical Sciences

Hindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018Volume 2018

Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in

Nature and SocietyHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Dierential EquationsInternational Journal of

Volume 2018

Hindawiwwwhindawicom Volume 2018

Decision SciencesAdvances in

Hindawiwwwhindawicom Volume 2018

AnalysisInternational Journal of

Hindawiwwwhindawicom Volume 2018

Stochastic AnalysisInternational Journal of

Submit your manuscripts atwwwhindawicom

Page 10: Classifying Imbalanced Data Sets by a Novel RE-Sample and ...downloads.hindawi.com/journals/mpe/2018/5036710.pdf · Cost-Sensitive Stacked Generalization Method ... In this paper,

10 Mathematical Problems in Engineering

Table8Re

sults

ofotherm

etho

dsin

term

sofA

GeoMean

Datas

etNB

C45

KNN

Cost-N

BAd

a-NB

Ada-Cost-N

BBa

g-NB

Bag-Cost-NB

Stacking

-log

Stacking

-Cost-log

wisc

onsin

0961

0960

0961

0962

0963

0962

0962

0960

0967

096

7pima

076

80743

0706

0734

0767

0736

0775

0729

0770

0726

haberm

an064

20680

0601

0718

066

00694

0653

0721

0613

0663

Vehicle

10690

0721

0728

0658

0722

0669

0690

0654

0709

0740

Vehicle

0066

60929

0934

0658

0681

0663

066

40663

0924

093

6segm

ent0

0859

0986

0996

0850

0859

0871

0863

0854

0995

099

5yeast-0

-3-5-9

vs7-8

0715

0706

0782

0709

0713

0737

0703

0721

0700

0770

yeast-0

-2-5-6

vs3-7-8-9

0856

0778

0855

0857

0856

0855

0856

0854

0839

086

6vowel0

0929

0981

1000

0909

0966

0948

0931

0910

1000

1000

led7digit-0

-2-4-5-6-7-8-9

vs1

0853

0874

0818

0869

0863

0870

0845

0850

0846

0874

Glass2

0511

0701

0695

0493

0511

0551

0539

0515

000

00744

cleveland

-0vs

4092

70845

0841

0943

0783

0914

0918

0910

0918

0918

glass-0-1-6

vs5

095

40932

0894

0954

0954

0923

0923

0919

0902

0932

car-good

000

0000

00763

0878

0747

0870

000

00873

0769

089

0flare-F

0840

000

00705

0842

0751

0795

0836

0842

000

0084

5car-vgoo

d0743

0934

0793

0986

0953

0956

0725

0980

0974

099

0abalon

e-17

vs7-8-9-10

0736

0655

0745

0716

0650

0672

0734

0711

064

1080

4Mean

0744

0731

0813

0808

0788

0805

0742

08039

0739

086

2

Mathematical Problems in Engineering 11

Table 9 Comparison between RECSG and other algorithms in terms of winslosses of the results

NB C45 KNN Cost-NB Ada-NB Ada-Cost-NB Bag-NB Bag-Cost-

NB Stacking-log

Stacking-Cost-log(AUC)

11 [5]6 [3] 15 [11]2 [2] 17 [15]0 [2] 11 [6]6 [5] 13 [7]4 [2] 14 [9]3 [1] 10 [6]7 [5] 10 [6]7 [5] 11 [3]6 [2]

Stacking-Cost-log(GeoMean)

15 [13]2 [2] 17 [13]0 [0] 15 [13]2 [1] 14 [8]3 [2] 16 [12]1 [1] 16 [10]1 [0] 17 [13]0 [0] 16 [10]3 [0] 16 [12]1 [0]

Stacking-Cost-log(AGeoMean)

14 [10]3 [1] 15 [9]2 [2] 15 [12]2 [1] 13 [7]4 [3] 15 [11]2 [2] 15 [10]2 [2] 16 [10]1 [1] 15 [8]2 [1] 16 [10]1 [1]

Table 10 Holm post hoc test to show differences of Stacking-Cost-log (AUC) and other algorithms

Rejected (1198670) Accepted (1198670)Stacking-Cost-log (AUC) versus other algorithms 3 6Stacking-Cost-log (GeoMean) versus other algorithms 8 1Stacking-Cost-log (AGeoMean) versus other algorithms 9 0

to the evaluation metric of GeoMean the RECSG is signifi-cantly better than other 8 methods with the confidence levelof 120572 = 005

44 Discussion According to the experiments performedwith 17 different imbalance data sets and the comparisonwith other 9 classification methods the proposed RECSGmethod showed the higher performance than other methodsin terms of AUC GeoMean and AGeoMean as illustratedin Tables 7 8 and 9 The RECSG method showed the betterperformance for 10 of 17 cases in terms of AUC and for 12 of17 cases in terms of GeoMean and AGeoMean The methodoutperformed other 5 ensemble algorithms for 16 of 17 datasets in terms ofGeoMean andAGeoMean and for 13 of 17 datasets in terms of AUC GeoMean and AGeoMean are betterthan AUC in evaluation metrics in 17 data sets The means ofRECSGmethod in terms of GeoMean AGeoMean and AUCare all higher than other methods The Holm post hoc testshows that RECSGmethod significantly outperforms the restwith the confidence level 120572 = 005 in terms of AGeoMean for8 of 9methods in terms of AGeoMean and for 3 of 9methodsin terms of AUC

Experimental results and statistical tests show that theRECSG approach has improved the classification perfor-mance of imbalanced data sets The reasons can be explainedas follows

Firstly stacking algorithm uses to combine the resultof base classifier whereas bagging employs majority voteBagging is only a simple decision combinationmethodwhichrequires neither cross-validation nor Level 1 learning In thispaper stacked generalization adopts logistic regressionthus providing the simplest linear combination of pooling theLevel 0 modelsrsquo confidence

Secondly cost-sensitive algorithm can affect imbalancedata sets in two aspects Firstly cost-sensitive modificationscan be applied in the probabilistic estimationMoreover cost-sensitive factors change theweight of instance and theweight

of the minority class is higher than that of the majorityclass Therefore the method directly changes the number ofcommon instances without discarding or duplicating any ofthe rare instances

Thirdly RECSG method introduces cost-sensitive learn-ing into stacking ensemble The cost-sensitive function in replaces the error-minimizing function thus allowing Level 1learning to be prone to focus on the minority class

The results in Tables 7 and 8 demonstrate that the RECSGmethod has the higher performancewhen the evaluation per-formance metric of base classifier is weaker (such as Vehicle1Vehicle0 and car-vgood) The reason is that the alternativemethods of NB C45 and 119896-NN have shortcomings Theindependence assumption hampers the performance of NBin some data sets In C45 tree construction the selectionof attributes affects the model performances Error ratio of119896-NN is relatively high when data sets are in imbalanceTherefore logistic regression adopted in can improve theperformance when base classifier is weaker

The performance of the RECSG method is generallybetter when the IR is low (such as Glass2 car-good flare-Fcar-vgood and abalone-17 vs 7-8-9-10) The performance ofRECSG method is probably related to the setting of the costmatrix Different data sets should use different cost matricesFor the purpose of simplicity in the paper we have adoptedthe same cost matrices whichmay bemore suitable to low IRvalues

5 Conclusions

In this paper in order to solve the class imbalance problemwe proposed the RECSG method based on 2-layer learningmodels Experimental results and statistical tests showedthat the RECSG approach improved the classification per-formance The proposed RECSG approach may have therelatively high computational complexity in the trainingstage because the approach involves 2-layer classifier models

12 Mathematical Problems in Engineering

which consist of several base classifiers and metaclassifiersThe number and kinds of the base-level classifiers areclosely related to the performance of stacking algorithmIn the fusion stage of base classifiers the selection of themetaclassifier is also important In this paper in order tovalidate the performance improvement compared with othercurrent classification algorithms we only randomly selected3 classification algorithms (NB 119896-NN and C45) as baseclassifier and cost-sensitive algorithm as metaclassifier inthe fuse stage Selection of the number or kind of the baseclassifiers and metaclassifier was not discussed Thereforewe will explore the diversity and quality of base classifier inthe future The adoption of these strategies would improvethe prediction performance and reduce the training time ofstacking algorithm for imbalance problems

Conflicts of Interest

The author declares that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

This research is supported by the Natural Science Foundationof Shanxi Province under Grant no 2015011039

References

[1] O Loyola-Gonzalez J F Martınez-Trinidad J A Carrasco-Ochoa and M Garcıa-Borroto ldquoEffect of class imbalance onquality measures for contrast patterns an experimental studyrdquoInformation Sciences vol 374 pp 179ndash192 2016

[2] C Beyan and R Fisher ldquoClassifying imbalanced data setsusing similarity based hierarchical decompositionrdquo PatternRecognition vol 48 no 5 pp 1653ndash1672 2015

[3] H He and E A Garcia ldquoLearning from imbalanced datardquo IEEETransactions on Knowledge and Data Engineering vol 21 no 9pp 1263ndash1284 2009

[4] V C Chawla K W Bowyer L O Hall and W P KegelmeyerldquoSMOTE synthetic minority over-sampling techniquerdquo Journalof Artificial Intelligence Research vol 16 pp 321ndash357 2002

[5] X-Y Liu J X Wu and Z-H Zhou ldquoExploratory under-sampling for class-imbalance learningrdquo in Proceedings of the 6thInternational Conference on Data Mining (ICDM rsquo06) pp 965ndash969 IEEE Hong Kong December 2006

[6] N Japkowicz ldquoThe class imbalance problem significance andstrategiesrdquo in Proceedings of the International Conference onArtificial Intelligence 2000

[7] B Zadrozny and C Elkan ldquoLearning and making decisionswhen costs and probabilities are both unknownrdquo in Proceedingsof the the seventh ACM SIGKDD international conference pp204ndash213 San Francisco Calif USA August 2001

[8] M Galar A Fernandez E Barrenechea H Bustince andF Herrera ldquoA review on ensembles for the class imbalanceproblem bagging boosting and hybrid-based approachesrdquoIEEE Transactions on Systems Man and Cybernetics Part CApplications and Reviews vol 42 no 4 pp 463ndash484 2012

[9] N V Chawla D A Cieslak L O Hall and A Joshi ldquoAuto-matically countering imbalance and its empirical relationship

to costrdquo Data Mining and Knowledge Discovery vol 17 no 2pp 225ndash252 2008

[10] A Freitas A Costapereira and P Brazdil ldquoCost-SensitiveDecision Trees Applied to Medical Datardquo in InternationalConference on Data Warehousing and Knowledge Discovery pp303ndash312 2007

[11] L Breiman ldquoBagging predictorsrdquoMachine Learning vol 24 no2 pp 123ndash140 1996

[12] R E Schapire ldquoThe strength of weak learnabilityrdquo MachineLearning vol 5 no 2 pp 197ndash227 1990

[13] Y Freund and R E Schapire ldquoA decision-theoretic generaliza-tion of on-line learning and an application to boostingrdquo Journalof Computer and System Sciences vol 55 no 1 part 2 pp 119ndash139 1997

[14] N V Chawla A Lazarevic L O Hall and K W BowyerldquoSMOTEBoost improving prediction of the minority class inboostingrdquo in Proceeding of Knowledge Discovery in Databasesvol 2838 pp 107ndash119

[15] C Seiffert T M Khoshgoftaar J VanHulse and A NapolitanoldquoRUSBoost A hybrid approach to alleviating class imbalancerdquoIEEE Transactions on Systems Man and Cybernetics Systemsvol 40 no 1 pp 185ndash197 2010

[16] S Hu Y Liang L Ma and Y He ldquoMSMOTE Improvingclassification performance when training data is imbalancedrdquoin Proceedings of the 2nd International Workshop on ComputerScience and Engineering WCSE 2009 pp 13ndash17 China October2009

[17] H Guo and H L Viktor ldquoLearning from imbalanced data setswith boosting and data generationrdquoACMSIGKDDExplorationsNewsletter vol 6 no 1 pp 30ndash39 2004

[18] SWang and X Yao ldquoDiversity analysis on imbalanced data setsby using ensemble modelsrdquo in Proceedings of the 2009 IEEESymposium on Computational Intelligence and Data MiningCIDM 2009 pp 324ndash331 USA April 2009

[19] R Barandela J S Srsquoanchez and R M Valdovinos ldquoNewapplications of ensembles of classifiersrdquo PAA Pattern Analysisand Applications vol 6 no 3 pp 245ndash256 2003

[20] J Błaszczynski M Deckert J Stefanowski and S WilkldquoIntegrating selective pre-processing of imbalanced data withIvotes ensemblerdquo Lecture Notes in Computer Science (includingsubseries Lecture Notes in Artificial Intelligence and Lecture Notesin Bioinformatics) Preface vol 6086 pp 148ndash157 2010

[21] W Fan S J Stolfo J Zhang and P K Chan ldquoAdacostMisclassication cost-sensitive boostingrdquo in the 6th Int ConfMach Learning pp 97ndash105 San Francisco CA USA 1999

[22] K M Ting ldquoA comparative study of cost-sensitive boostingalgorithmsrdquo in Proc 17th Int Conf Mach Learning pp 983ndash990 Stanford CA USA 2000

[23] M Joshi V Kumar and R Agarwal ldquoEvaluating boosting algo-rithms to classify rare classes comparison and improvementsrdquoinProceedings of the 2001 IEEE International Conference onDataMining pp 257ndash264 San Jose CA USA

[24] Y Sun M S Kamel A K C Wong and Y Wang ldquoCost-sensitive boosting for classification of imbalanced datardquo PatternRecognition vol 40 no 12 pp 3358ndash3378 2007

[25] D H Wolpert ldquoStacked generalizationrdquo Neural Networks vol5 no 2 pp 241ndash259 1992

[26] Y Chen M L Wong and H Li ldquoApplying Ant Colony Opti-mization to configuring stacking ensembles for data miningrdquoExpert Systems with Applications vol 41 no 6 pp 2688ndash27022014

Mathematical Problems in Engineering 13

[27] H Kadkhodaei and A M E Moghadam ldquoAn entropy basedapproach to find the best combination of the base classi-fiers in ensemble classifiers based on stack generalizationrdquoin Proceedings of the 4th International Conference on ControlInstrumentation and Automation ICCIA 2016 pp 425ndash429Iran January 2016

[28] I Czarnowski and P Jędrzejowicz ldquoAn approach to machineclassification based on stacked generalization and instanceselectionrdquo in Proceedings of the IEEE International Conferenceon Systems Man and Cybernetics SMC 2016 pp 4836ndash4841Hungary 2017

[29] S Kotsiantis ldquoStacking cost sensitive modelsrdquo in Proceedings ofthe 12th Pan-Hellenic Conference on Informatics PCI 2008 pp217ndash221 Greece August 2008

[30] H Y Lo J C Wang H MWang and S D Lin ldquoCost-sensitivestacking for audio tag annotation and retrievalrdquo in Proceedingsof the 36th IEEE International Conference on Acoustics Speechand Signal Processing ICASSP 2011 pp 2308ndash2311 CzechRepublic May 2011

[31] JHuang andCX Ling ldquoUsingAUCand accuracy in evaluatinglearning algorithmsrdquo IEEETransactions on Knowledge andDataEngineering vol 17 no 3 pp 299ndash310 2005

[32] M Kubat and S Matwin ldquoAddressing the curse of imbalancedtraining sets one sided selection inrdquo in Proceedings of the 14thInternational Conference on Machine Learning (ICMLrsquo97 pp179ndash186 1997

[33] R Batuwita and V Palade ldquoAdjusted geometric-mean A novelperformance measure for imbalanced bioinformatics datasetslearningrdquo Journal of Bioinformatics and Computational Biologyvol 10 no 4 Article ID 1250003 2012

[34] L Breiman ldquoStacked regressionsrdquoMachine Learning vol 24 no1 pp 49ndash64 1996

[35] M Leblanc and R Tibshirani ldquoCombining estimates in regres-sion and classificationrdquo Journal of the American StatisticalAssociation vol 91 no 436 pp 1641ndash1650 1996

[36] B Cestnik ldquoEstimating Probabilities A Crucial Task inMachine Learningrdquo in Proceedings of the European Conferenceon Artificial Intelligence pp 147ndash149 1990

[37] J R Quinlan C45 Program for machine learning MorganKaufmann 1993

[38] N S Altman ldquoAn introduction to kernel and nearest-neighbornonparametric regressionrdquo The American Statistician vol 46no 3 pp 175ndash185 1992

[39] J R Quinlan ldquoInduction of decision treesrdquo Machine Learningvol 1 no 1 pp 81ndash106 1986

[40] C Elkan ldquoThe foundations of cost-sensitive learningrdquo in Pro-ceedings of the 17th International Joint Conference on ArtificialIntelligence (IJCAI rsquo01) pp 973ndash978 New York NY USAAugust 2001

[41] D Michie D Spiegel halter J C Taylor et alMachine learningneural and statistical classification Ellis Horwood neural andstatistical classification 1995

[42] K M Ting and I F Witten ldquoIssues in stacked generalizationrdquoArtif Intell Res vol 10 no 1 pp 271ndash289 1999

[43] M Hall E Frank G Holmes B Pfahringer P Reutemann andI H Witten ldquoThe WEKA data mining software an updaterdquoACM SIGKDD Explorations Newsletter vol 11 no 1 pp 10ndash182009

[44] A Frank and A Asuncion UCI Machine Learning RepositoryUniversity of California School of Information and ComputerScience Irvine CA USA httparchiveicsucieduml

[45] J Alcala-Fdez A Fernandez J Luengo et al ldquoKEEL data-mining software tool data set repository integration ofalgorithms and experimental analysis frameworkrdquo Journal ofMultiple-Valued Logic and Soft Computing vol 17 no 2-3 pp255ndash287 2011

[46] S Garcıa A Fernandez J Luengo and F Herrera ldquoA study ofstatistical techniques and performance measures for genetics-based machine learning accuracy and interpretabilityrdquo SoftComputing vol 13 no 10 pp 959ndash977 2009

[47] S Holm ldquoA simple sequentially rejective multiple test proce-durerdquo Scandinavian Journal of Statistics vol 6 no 2 pp 65ndash701979

Hindawiwwwhindawicom Volume 2018

MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Applied MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Probability and StatisticsHindawiwwwhindawicom Volume 2018

Journal of

Hindawiwwwhindawicom Volume 2018

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawiwwwhindawicom Volume 2018

OptimizationJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

Hindawiwwwhindawicom Volume 2018

Operations ResearchAdvances in

Journal of

Hindawiwwwhindawicom Volume 2018

Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018

International Journal of Mathematics and Mathematical Sciences

Hindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018Volume 2018

Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in

Nature and SocietyHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Dierential EquationsInternational Journal of

Volume 2018

Hindawiwwwhindawicom Volume 2018

Decision SciencesAdvances in

Hindawiwwwhindawicom Volume 2018

AnalysisInternational Journal of

Hindawiwwwhindawicom Volume 2018

Stochastic AnalysisInternational Journal of

Submit your manuscripts atwwwhindawicom

Page 11: Classifying Imbalanced Data Sets by a Novel RE-Sample and ...downloads.hindawi.com/journals/mpe/2018/5036710.pdf · Cost-Sensitive Stacked Generalization Method ... In this paper,

Mathematical Problems in Engineering 11

Table 9 Comparison between RECSG and other algorithms in terms of winslosses of the results

NB C45 KNN Cost-NB Ada-NB Ada-Cost-NB Bag-NB Bag-Cost-

NB Stacking-log

Stacking-Cost-log(AUC)

11 [5]6 [3] 15 [11]2 [2] 17 [15]0 [2] 11 [6]6 [5] 13 [7]4 [2] 14 [9]3 [1] 10 [6]7 [5] 10 [6]7 [5] 11 [3]6 [2]

Stacking-Cost-log(GeoMean)

15 [13]2 [2] 17 [13]0 [0] 15 [13]2 [1] 14 [8]3 [2] 16 [12]1 [1] 16 [10]1 [0] 17 [13]0 [0] 16 [10]3 [0] 16 [12]1 [0]

Stacking-Cost-log(AGeoMean)

14 [10]3 [1] 15 [9]2 [2] 15 [12]2 [1] 13 [7]4 [3] 15 [11]2 [2] 15 [10]2 [2] 16 [10]1 [1] 15 [8]2 [1] 16 [10]1 [1]

Table 10 Holm post hoc test to show differences of Stacking-Cost-log (AUC) and other algorithms

Rejected (1198670) Accepted (1198670)Stacking-Cost-log (AUC) versus other algorithms 3 6Stacking-Cost-log (GeoMean) versus other algorithms 8 1Stacking-Cost-log (AGeoMean) versus other algorithms 9 0

to the evaluation metric of GeoMean the RECSG is signifi-cantly better than other 8 methods with the confidence levelof 120572 = 005

44 Discussion According to the experiments performedwith 17 different imbalance data sets and the comparisonwith other 9 classification methods the proposed RECSGmethod showed the higher performance than other methodsin terms of AUC GeoMean and AGeoMean as illustratedin Tables 7 8 and 9 The RECSG method showed the betterperformance for 10 of 17 cases in terms of AUC and for 12 of17 cases in terms of GeoMean and AGeoMean The methodoutperformed other 5 ensemble algorithms for 16 of 17 datasets in terms ofGeoMean andAGeoMean and for 13 of 17 datasets in terms of AUC GeoMean and AGeoMean are betterthan AUC in evaluation metrics in 17 data sets The means ofRECSGmethod in terms of GeoMean AGeoMean and AUCare all higher than other methods The Holm post hoc testshows that RECSGmethod significantly outperforms the restwith the confidence level 120572 = 005 in terms of AGeoMean for8 of 9methods in terms of AGeoMean and for 3 of 9methodsin terms of AUC

Experimental results and statistical tests show that theRECSG approach has improved the classification perfor-mance of imbalanced data sets The reasons can be explainedas follows

Firstly stacking algorithm uses to combine the resultof base classifier whereas bagging employs majority voteBagging is only a simple decision combinationmethodwhichrequires neither cross-validation nor Level 1 learning In thispaper stacked generalization adopts logistic regressionthus providing the simplest linear combination of pooling theLevel 0 modelsrsquo confidence

Secondly cost-sensitive algorithm can affect imbalancedata sets in two aspects Firstly cost-sensitive modificationscan be applied in the probabilistic estimationMoreover cost-sensitive factors change theweight of instance and theweight

of the minority class is higher than that of the majorityclass Therefore the method directly changes the number ofcommon instances without discarding or duplicating any ofthe rare instances

Thirdly RECSG method introduces cost-sensitive learn-ing into stacking ensemble The cost-sensitive function in replaces the error-minimizing function thus allowing Level 1learning to be prone to focus on the minority class

The results in Tables 7 and 8 demonstrate that the RECSGmethod has the higher performancewhen the evaluation per-formance metric of base classifier is weaker (such as Vehicle1Vehicle0 and car-vgood) The reason is that the alternativemethods of NB C45 and 119896-NN have shortcomings Theindependence assumption hampers the performance of NBin some data sets In C45 tree construction the selectionof attributes affects the model performances Error ratio of119896-NN is relatively high when data sets are in imbalanceTherefore logistic regression adopted in can improve theperformance when base classifier is weaker

The performance of the RECSG method is generallybetter when the IR is low (such as Glass2 car-good flare-Fcar-vgood and abalone-17 vs 7-8-9-10) The performance ofRECSG method is probably related to the setting of the costmatrix Different data sets should use different cost matricesFor the purpose of simplicity in the paper we have adoptedthe same cost matrices whichmay bemore suitable to low IRvalues

5 Conclusions

In this paper in order to solve the class imbalance problemwe proposed the RECSG method based on 2-layer learningmodels Experimental results and statistical tests showedthat the RECSG approach improved the classification per-formance The proposed RECSG approach may have therelatively high computational complexity in the trainingstage because the approach involves 2-layer classifier models

12 Mathematical Problems in Engineering

which consist of several base classifiers and metaclassifiersThe number and kinds of the base-level classifiers areclosely related to the performance of stacking algorithmIn the fusion stage of base classifiers the selection of themetaclassifier is also important In this paper in order tovalidate the performance improvement compared with othercurrent classification algorithms we only randomly selected3 classification algorithms (NB 119896-NN and C45) as baseclassifier and cost-sensitive algorithm as metaclassifier inthe fuse stage Selection of the number or kind of the baseclassifiers and metaclassifier was not discussed Thereforewe will explore the diversity and quality of base classifier inthe future The adoption of these strategies would improvethe prediction performance and reduce the training time ofstacking algorithm for imbalance problems

Conflicts of Interest

The author declares that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

This research is supported by the Natural Science Foundationof Shanxi Province under Grant no 2015011039

References

[1] O Loyola-Gonzalez J F Martınez-Trinidad J A Carrasco-Ochoa and M Garcıa-Borroto ldquoEffect of class imbalance onquality measures for contrast patterns an experimental studyrdquoInformation Sciences vol 374 pp 179ndash192 2016

[2] C Beyan and R Fisher ldquoClassifying imbalanced data setsusing similarity based hierarchical decompositionrdquo PatternRecognition vol 48 no 5 pp 1653ndash1672 2015

[3] H He and E A Garcia ldquoLearning from imbalanced datardquo IEEETransactions on Knowledge and Data Engineering vol 21 no 9pp 1263ndash1284 2009

[4] V C Chawla K W Bowyer L O Hall and W P KegelmeyerldquoSMOTE synthetic minority over-sampling techniquerdquo Journalof Artificial Intelligence Research vol 16 pp 321ndash357 2002

[5] X-Y Liu J X Wu and Z-H Zhou ldquoExploratory under-sampling for class-imbalance learningrdquo in Proceedings of the 6thInternational Conference on Data Mining (ICDM rsquo06) pp 965ndash969 IEEE Hong Kong December 2006

[6] N Japkowicz ldquoThe class imbalance problem significance andstrategiesrdquo in Proceedings of the International Conference onArtificial Intelligence 2000

[7] B Zadrozny and C Elkan ldquoLearning and making decisionswhen costs and probabilities are both unknownrdquo in Proceedingsof the the seventh ACM SIGKDD international conference pp204ndash213 San Francisco Calif USA August 2001

[8] M Galar A Fernandez E Barrenechea H Bustince andF Herrera ldquoA review on ensembles for the class imbalanceproblem bagging boosting and hybrid-based approachesrdquoIEEE Transactions on Systems Man and Cybernetics Part CApplications and Reviews vol 42 no 4 pp 463ndash484 2012

[9] N V Chawla D A Cieslak L O Hall and A Joshi ldquoAuto-matically countering imbalance and its empirical relationship

to costrdquo Data Mining and Knowledge Discovery vol 17 no 2pp 225ndash252 2008

[10] A Freitas A Costapereira and P Brazdil ldquoCost-SensitiveDecision Trees Applied to Medical Datardquo in InternationalConference on Data Warehousing and Knowledge Discovery pp303ndash312 2007

[11] L Breiman ldquoBagging predictorsrdquoMachine Learning vol 24 no2 pp 123ndash140 1996

[12] R E Schapire ldquoThe strength of weak learnabilityrdquo MachineLearning vol 5 no 2 pp 197ndash227 1990

[13] Y Freund and R E Schapire ldquoA decision-theoretic generaliza-tion of on-line learning and an application to boostingrdquo Journalof Computer and System Sciences vol 55 no 1 part 2 pp 119ndash139 1997

[14] N V Chawla A Lazarevic L O Hall and K W BowyerldquoSMOTEBoost improving prediction of the minority class inboostingrdquo in Proceeding of Knowledge Discovery in Databasesvol 2838 pp 107ndash119

[15] C Seiffert T M Khoshgoftaar J VanHulse and A NapolitanoldquoRUSBoost A hybrid approach to alleviating class imbalancerdquoIEEE Transactions on Systems Man and Cybernetics Systemsvol 40 no 1 pp 185ndash197 2010

[16] S Hu Y Liang L Ma and Y He ldquoMSMOTE Improvingclassification performance when training data is imbalancedrdquoin Proceedings of the 2nd International Workshop on ComputerScience and Engineering WCSE 2009 pp 13ndash17 China October2009

[17] H Guo and H L Viktor ldquoLearning from imbalanced data setswith boosting and data generationrdquoACMSIGKDDExplorationsNewsletter vol 6 no 1 pp 30ndash39 2004

[18] SWang and X Yao ldquoDiversity analysis on imbalanced data setsby using ensemble modelsrdquo in Proceedings of the 2009 IEEESymposium on Computational Intelligence and Data MiningCIDM 2009 pp 324ndash331 USA April 2009

[19] R Barandela J S Srsquoanchez and R M Valdovinos ldquoNewapplications of ensembles of classifiersrdquo PAA Pattern Analysisand Applications vol 6 no 3 pp 245ndash256 2003

[20] J Błaszczynski M Deckert J Stefanowski and S WilkldquoIntegrating selective pre-processing of imbalanced data withIvotes ensemblerdquo Lecture Notes in Computer Science (includingsubseries Lecture Notes in Artificial Intelligence and Lecture Notesin Bioinformatics) Preface vol 6086 pp 148ndash157 2010

[21] W Fan S J Stolfo J Zhang and P K Chan ldquoAdacostMisclassication cost-sensitive boostingrdquo in the 6th Int ConfMach Learning pp 97ndash105 San Francisco CA USA 1999

[22] K M Ting ldquoA comparative study of cost-sensitive boostingalgorithmsrdquo in Proc 17th Int Conf Mach Learning pp 983ndash990 Stanford CA USA 2000

[23] M Joshi V Kumar and R Agarwal ldquoEvaluating boosting algo-rithms to classify rare classes comparison and improvementsrdquoinProceedings of the 2001 IEEE International Conference onDataMining pp 257ndash264 San Jose CA USA

[24] Y Sun M S Kamel A K C Wong and Y Wang ldquoCost-sensitive boosting for classification of imbalanced datardquo PatternRecognition vol 40 no 12 pp 3358ndash3378 2007

[25] D H Wolpert ldquoStacked generalizationrdquo Neural Networks vol5 no 2 pp 241ndash259 1992

[26] Y Chen M L Wong and H Li ldquoApplying Ant Colony Opti-mization to configuring stacking ensembles for data miningrdquoExpert Systems with Applications vol 41 no 6 pp 2688ndash27022014

Mathematical Problems in Engineering 13

[27] H Kadkhodaei and A M E Moghadam ldquoAn entropy basedapproach to find the best combination of the base classi-fiers in ensemble classifiers based on stack generalizationrdquoin Proceedings of the 4th International Conference on ControlInstrumentation and Automation ICCIA 2016 pp 425ndash429Iran January 2016

[28] I Czarnowski and P Jędrzejowicz ldquoAn approach to machineclassification based on stacked generalization and instanceselectionrdquo in Proceedings of the IEEE International Conferenceon Systems Man and Cybernetics SMC 2016 pp 4836ndash4841Hungary 2017

[29] S Kotsiantis ldquoStacking cost sensitive modelsrdquo in Proceedings ofthe 12th Pan-Hellenic Conference on Informatics PCI 2008 pp217ndash221 Greece August 2008

[30] H Y Lo J C Wang H MWang and S D Lin ldquoCost-sensitivestacking for audio tag annotation and retrievalrdquo in Proceedingsof the 36th IEEE International Conference on Acoustics Speechand Signal Processing ICASSP 2011 pp 2308ndash2311 CzechRepublic May 2011

[31] JHuang andCX Ling ldquoUsingAUCand accuracy in evaluatinglearning algorithmsrdquo IEEETransactions on Knowledge andDataEngineering vol 17 no 3 pp 299ndash310 2005

[32] M Kubat and S Matwin ldquoAddressing the curse of imbalancedtraining sets one sided selection inrdquo in Proceedings of the 14thInternational Conference on Machine Learning (ICMLrsquo97 pp179ndash186 1997

[33] R Batuwita and V Palade ldquoAdjusted geometric-mean A novelperformance measure for imbalanced bioinformatics datasetslearningrdquo Journal of Bioinformatics and Computational Biologyvol 10 no 4 Article ID 1250003 2012

[34] L Breiman ldquoStacked regressionsrdquoMachine Learning vol 24 no1 pp 49ndash64 1996

[35] M Leblanc and R Tibshirani ldquoCombining estimates in regres-sion and classificationrdquo Journal of the American StatisticalAssociation vol 91 no 436 pp 1641ndash1650 1996

[36] B Cestnik ldquoEstimating Probabilities A Crucial Task inMachine Learningrdquo in Proceedings of the European Conferenceon Artificial Intelligence pp 147ndash149 1990

[37] J R Quinlan C45 Program for machine learning MorganKaufmann 1993

[38] N S Altman ldquoAn introduction to kernel and nearest-neighbornonparametric regressionrdquo The American Statistician vol 46no 3 pp 175ndash185 1992

[39] J R Quinlan ldquoInduction of decision treesrdquo Machine Learningvol 1 no 1 pp 81ndash106 1986

[40] C Elkan ldquoThe foundations of cost-sensitive learningrdquo in Pro-ceedings of the 17th International Joint Conference on ArtificialIntelligence (IJCAI rsquo01) pp 973ndash978 New York NY USAAugust 2001

[41] D Michie D Spiegel halter J C Taylor et alMachine learningneural and statistical classification Ellis Horwood neural andstatistical classification 1995

[42] K M Ting and I F Witten ldquoIssues in stacked generalizationrdquoArtif Intell Res vol 10 no 1 pp 271ndash289 1999

[43] M Hall E Frank G Holmes B Pfahringer P Reutemann andI H Witten ldquoThe WEKA data mining software an updaterdquoACM SIGKDD Explorations Newsletter vol 11 no 1 pp 10ndash182009

[44] A Frank and A Asuncion UCI Machine Learning RepositoryUniversity of California School of Information and ComputerScience Irvine CA USA httparchiveicsucieduml

[45] J Alcala-Fdez A Fernandez J Luengo et al ldquoKEEL data-mining software tool data set repository integration ofalgorithms and experimental analysis frameworkrdquo Journal ofMultiple-Valued Logic and Soft Computing vol 17 no 2-3 pp255ndash287 2011

[46] S Garcıa A Fernandez J Luengo and F Herrera ldquoA study ofstatistical techniques and performance measures for genetics-based machine learning accuracy and interpretabilityrdquo SoftComputing vol 13 no 10 pp 959ndash977 2009

[47] S Holm ldquoA simple sequentially rejective multiple test proce-durerdquo Scandinavian Journal of Statistics vol 6 no 2 pp 65ndash701979

Hindawiwwwhindawicom Volume 2018

MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Applied MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Probability and StatisticsHindawiwwwhindawicom Volume 2018

Journal of

Hindawiwwwhindawicom Volume 2018

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawiwwwhindawicom Volume 2018

OptimizationJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

Hindawiwwwhindawicom Volume 2018

Operations ResearchAdvances in

Journal of

Hindawiwwwhindawicom Volume 2018

Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018

International Journal of Mathematics and Mathematical Sciences

Hindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018Volume 2018

Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in

Nature and SocietyHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Dierential EquationsInternational Journal of

Volume 2018

Hindawiwwwhindawicom Volume 2018

Decision SciencesAdvances in

Hindawiwwwhindawicom Volume 2018

AnalysisInternational Journal of

Hindawiwwwhindawicom Volume 2018

Stochastic AnalysisInternational Journal of

Submit your manuscripts atwwwhindawicom

Page 12: Classifying Imbalanced Data Sets by a Novel RE-Sample and ...downloads.hindawi.com/journals/mpe/2018/5036710.pdf · Cost-Sensitive Stacked Generalization Method ... In this paper,

12 Mathematical Problems in Engineering

which consist of several base classifiers and metaclassifiersThe number and kinds of the base-level classifiers areclosely related to the performance of stacking algorithmIn the fusion stage of base classifiers the selection of themetaclassifier is also important In this paper in order tovalidate the performance improvement compared with othercurrent classification algorithms we only randomly selected3 classification algorithms (NB 119896-NN and C45) as baseclassifier and cost-sensitive algorithm as metaclassifier inthe fuse stage Selection of the number or kind of the baseclassifiers and metaclassifier was not discussed Thereforewe will explore the diversity and quality of base classifier inthe future The adoption of these strategies would improvethe prediction performance and reduce the training time ofstacking algorithm for imbalance problems

Conflicts of Interest

The author declares that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

This research is supported by the Natural Science Foundationof Shanxi Province under Grant no 2015011039

References

[1] O Loyola-Gonzalez J F Martınez-Trinidad J A Carrasco-Ochoa and M Garcıa-Borroto ldquoEffect of class imbalance onquality measures for contrast patterns an experimental studyrdquoInformation Sciences vol 374 pp 179ndash192 2016

[2] C Beyan and R Fisher ldquoClassifying imbalanced data setsusing similarity based hierarchical decompositionrdquo PatternRecognition vol 48 no 5 pp 1653ndash1672 2015

[3] H He and E A Garcia ldquoLearning from imbalanced datardquo IEEETransactions on Knowledge and Data Engineering vol 21 no 9pp 1263ndash1284 2009

[4] V C Chawla K W Bowyer L O Hall and W P KegelmeyerldquoSMOTE synthetic minority over-sampling techniquerdquo Journalof Artificial Intelligence Research vol 16 pp 321ndash357 2002

[5] X-Y Liu J X Wu and Z-H Zhou ldquoExploratory under-sampling for class-imbalance learningrdquo in Proceedings of the 6thInternational Conference on Data Mining (ICDM rsquo06) pp 965ndash969 IEEE Hong Kong December 2006

[6] N Japkowicz ldquoThe class imbalance problem significance andstrategiesrdquo in Proceedings of the International Conference onArtificial Intelligence 2000

[7] B Zadrozny and C Elkan ldquoLearning and making decisionswhen costs and probabilities are both unknownrdquo in Proceedingsof the the seventh ACM SIGKDD international conference pp204ndash213 San Francisco Calif USA August 2001

[8] M Galar A Fernandez E Barrenechea H Bustince andF Herrera ldquoA review on ensembles for the class imbalanceproblem bagging boosting and hybrid-based approachesrdquoIEEE Transactions on Systems Man and Cybernetics Part CApplications and Reviews vol 42 no 4 pp 463ndash484 2012

[9] N V Chawla D A Cieslak L O Hall and A Joshi ldquoAuto-matically countering imbalance and its empirical relationship

to costrdquo Data Mining and Knowledge Discovery vol 17 no 2pp 225ndash252 2008

[10] A Freitas A Costapereira and P Brazdil ldquoCost-SensitiveDecision Trees Applied to Medical Datardquo in InternationalConference on Data Warehousing and Knowledge Discovery pp303ndash312 2007

[11] L Breiman ldquoBagging predictorsrdquoMachine Learning vol 24 no2 pp 123ndash140 1996

[12] R E Schapire ldquoThe strength of weak learnabilityrdquo MachineLearning vol 5 no 2 pp 197ndash227 1990

[13] Y Freund and R E Schapire ldquoA decision-theoretic generaliza-tion of on-line learning and an application to boostingrdquo Journalof Computer and System Sciences vol 55 no 1 part 2 pp 119ndash139 1997

[14] N V Chawla A Lazarevic L O Hall and K W BowyerldquoSMOTEBoost improving prediction of the minority class inboostingrdquo in Proceeding of Knowledge Discovery in Databasesvol 2838 pp 107ndash119

[15] C Seiffert T M Khoshgoftaar J VanHulse and A NapolitanoldquoRUSBoost A hybrid approach to alleviating class imbalancerdquoIEEE Transactions on Systems Man and Cybernetics Systemsvol 40 no 1 pp 185ndash197 2010

[16] S Hu Y Liang L Ma and Y He ldquoMSMOTE Improvingclassification performance when training data is imbalancedrdquoin Proceedings of the 2nd International Workshop on ComputerScience and Engineering WCSE 2009 pp 13ndash17 China October2009

[17] H Guo and H L Viktor ldquoLearning from imbalanced data setswith boosting and data generationrdquoACMSIGKDDExplorationsNewsletter vol 6 no 1 pp 30ndash39 2004

[18] SWang and X Yao ldquoDiversity analysis on imbalanced data setsby using ensemble modelsrdquo in Proceedings of the 2009 IEEESymposium on Computational Intelligence and Data MiningCIDM 2009 pp 324ndash331 USA April 2009

[19] R Barandela J S Srsquoanchez and R M Valdovinos ldquoNewapplications of ensembles of classifiersrdquo PAA Pattern Analysisand Applications vol 6 no 3 pp 245ndash256 2003

[20] J Błaszczynski M Deckert J Stefanowski and S WilkldquoIntegrating selective pre-processing of imbalanced data withIvotes ensemblerdquo Lecture Notes in Computer Science (includingsubseries Lecture Notes in Artificial Intelligence and Lecture Notesin Bioinformatics) Preface vol 6086 pp 148ndash157 2010

[21] W Fan S J Stolfo J Zhang and P K Chan ldquoAdacostMisclassication cost-sensitive boostingrdquo in the 6th Int ConfMach Learning pp 97ndash105 San Francisco CA USA 1999

[22] K M Ting ldquoA comparative study of cost-sensitive boostingalgorithmsrdquo in Proc 17th Int Conf Mach Learning pp 983ndash990 Stanford CA USA 2000

[23] M Joshi V Kumar and R Agarwal ldquoEvaluating boosting algo-rithms to classify rare classes comparison and improvementsrdquoinProceedings of the 2001 IEEE International Conference onDataMining pp 257ndash264 San Jose CA USA

[24] Y Sun M S Kamel A K C Wong and Y Wang ldquoCost-sensitive boosting for classification of imbalanced datardquo PatternRecognition vol 40 no 12 pp 3358ndash3378 2007

[25] D H Wolpert ldquoStacked generalizationrdquo Neural Networks vol5 no 2 pp 241ndash259 1992

[26] Y Chen M L Wong and H Li ldquoApplying Ant Colony Opti-mization to configuring stacking ensembles for data miningrdquoExpert Systems with Applications vol 41 no 6 pp 2688ndash27022014

Mathematical Problems in Engineering 13

[27] H Kadkhodaei and A M E Moghadam ldquoAn entropy basedapproach to find the best combination of the base classi-fiers in ensemble classifiers based on stack generalizationrdquoin Proceedings of the 4th International Conference on ControlInstrumentation and Automation ICCIA 2016 pp 425ndash429Iran January 2016

[28] I Czarnowski and P Jędrzejowicz ldquoAn approach to machineclassification based on stacked generalization and instanceselectionrdquo in Proceedings of the IEEE International Conferenceon Systems Man and Cybernetics SMC 2016 pp 4836ndash4841Hungary 2017

[29] S Kotsiantis ldquoStacking cost sensitive modelsrdquo in Proceedings ofthe 12th Pan-Hellenic Conference on Informatics PCI 2008 pp217ndash221 Greece August 2008

[30] H Y Lo J C Wang H MWang and S D Lin ldquoCost-sensitivestacking for audio tag annotation and retrievalrdquo in Proceedingsof the 36th IEEE International Conference on Acoustics Speechand Signal Processing ICASSP 2011 pp 2308ndash2311 CzechRepublic May 2011

[31] JHuang andCX Ling ldquoUsingAUCand accuracy in evaluatinglearning algorithmsrdquo IEEETransactions on Knowledge andDataEngineering vol 17 no 3 pp 299ndash310 2005

[32] M Kubat and S Matwin ldquoAddressing the curse of imbalancedtraining sets one sided selection inrdquo in Proceedings of the 14thInternational Conference on Machine Learning (ICMLrsquo97 pp179ndash186 1997

[33] R Batuwita and V Palade ldquoAdjusted geometric-mean A novelperformance measure for imbalanced bioinformatics datasetslearningrdquo Journal of Bioinformatics and Computational Biologyvol 10 no 4 Article ID 1250003 2012

[34] L Breiman ldquoStacked regressionsrdquoMachine Learning vol 24 no1 pp 49ndash64 1996

[35] M Leblanc and R Tibshirani ldquoCombining estimates in regres-sion and classificationrdquo Journal of the American StatisticalAssociation vol 91 no 436 pp 1641ndash1650 1996

[36] B Cestnik ldquoEstimating Probabilities A Crucial Task inMachine Learningrdquo in Proceedings of the European Conferenceon Artificial Intelligence pp 147ndash149 1990

[37] J R Quinlan C45 Program for machine learning MorganKaufmann 1993

[38] N S Altman ldquoAn introduction to kernel and nearest-neighbornonparametric regressionrdquo The American Statistician vol 46no 3 pp 175ndash185 1992

[39] J R Quinlan ldquoInduction of decision treesrdquo Machine Learningvol 1 no 1 pp 81ndash106 1986

[40] C Elkan ldquoThe foundations of cost-sensitive learningrdquo in Pro-ceedings of the 17th International Joint Conference on ArtificialIntelligence (IJCAI rsquo01) pp 973ndash978 New York NY USAAugust 2001

[41] D Michie D Spiegel halter J C Taylor et alMachine learningneural and statistical classification Ellis Horwood neural andstatistical classification 1995

[42] K M Ting and I F Witten ldquoIssues in stacked generalizationrdquoArtif Intell Res vol 10 no 1 pp 271ndash289 1999

[43] M Hall E Frank G Holmes B Pfahringer P Reutemann andI H Witten ldquoThe WEKA data mining software an updaterdquoACM SIGKDD Explorations Newsletter vol 11 no 1 pp 10ndash182009

[44] A Frank and A Asuncion UCI Machine Learning RepositoryUniversity of California School of Information and ComputerScience Irvine CA USA httparchiveicsucieduml

[45] J Alcala-Fdez A Fernandez J Luengo et al ldquoKEEL data-mining software tool data set repository integration ofalgorithms and experimental analysis frameworkrdquo Journal ofMultiple-Valued Logic and Soft Computing vol 17 no 2-3 pp255ndash287 2011

[46] S Garcıa A Fernandez J Luengo and F Herrera ldquoA study ofstatistical techniques and performance measures for genetics-based machine learning accuracy and interpretabilityrdquo SoftComputing vol 13 no 10 pp 959ndash977 2009

[47] S Holm ldquoA simple sequentially rejective multiple test proce-durerdquo Scandinavian Journal of Statistics vol 6 no 2 pp 65ndash701979

Hindawiwwwhindawicom Volume 2018

MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Applied MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Probability and StatisticsHindawiwwwhindawicom Volume 2018

Journal of

Hindawiwwwhindawicom Volume 2018

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawiwwwhindawicom Volume 2018

OptimizationJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

Hindawiwwwhindawicom Volume 2018

Operations ResearchAdvances in

Journal of

Hindawiwwwhindawicom Volume 2018

Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018

International Journal of Mathematics and Mathematical Sciences

Hindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018Volume 2018

Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in

Nature and SocietyHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Dierential EquationsInternational Journal of

Volume 2018

Hindawiwwwhindawicom Volume 2018

Decision SciencesAdvances in

Hindawiwwwhindawicom Volume 2018

AnalysisInternational Journal of

Hindawiwwwhindawicom Volume 2018

Stochastic AnalysisInternational Journal of

Submit your manuscripts atwwwhindawicom

Page 13: Classifying Imbalanced Data Sets by a Novel RE-Sample and ...downloads.hindawi.com/journals/mpe/2018/5036710.pdf · Cost-Sensitive Stacked Generalization Method ... In this paper,

Mathematical Problems in Engineering 13

[27] H Kadkhodaei and A M E Moghadam ldquoAn entropy basedapproach to find the best combination of the base classi-fiers in ensemble classifiers based on stack generalizationrdquoin Proceedings of the 4th International Conference on ControlInstrumentation and Automation ICCIA 2016 pp 425ndash429Iran January 2016

[28] I Czarnowski and P Jędrzejowicz ldquoAn approach to machineclassification based on stacked generalization and instanceselectionrdquo in Proceedings of the IEEE International Conferenceon Systems Man and Cybernetics SMC 2016 pp 4836ndash4841Hungary 2017

[29] S Kotsiantis ldquoStacking cost sensitive modelsrdquo in Proceedings ofthe 12th Pan-Hellenic Conference on Informatics PCI 2008 pp217ndash221 Greece August 2008

[30] H Y Lo J C Wang H MWang and S D Lin ldquoCost-sensitivestacking for audio tag annotation and retrievalrdquo in Proceedingsof the 36th IEEE International Conference on Acoustics Speechand Signal Processing ICASSP 2011 pp 2308ndash2311 CzechRepublic May 2011

[31] JHuang andCX Ling ldquoUsingAUCand accuracy in evaluatinglearning algorithmsrdquo IEEETransactions on Knowledge andDataEngineering vol 17 no 3 pp 299ndash310 2005

[32] M Kubat and S Matwin ldquoAddressing the curse of imbalancedtraining sets one sided selection inrdquo in Proceedings of the 14thInternational Conference on Machine Learning (ICMLrsquo97 pp179ndash186 1997

[33] R Batuwita and V Palade ldquoAdjusted geometric-mean A novelperformance measure for imbalanced bioinformatics datasetslearningrdquo Journal of Bioinformatics and Computational Biologyvol 10 no 4 Article ID 1250003 2012

[34] L Breiman ldquoStacked regressionsrdquoMachine Learning vol 24 no1 pp 49ndash64 1996

[35] M Leblanc and R Tibshirani ldquoCombining estimates in regres-sion and classificationrdquo Journal of the American StatisticalAssociation vol 91 no 436 pp 1641ndash1650 1996

[36] B Cestnik ldquoEstimating Probabilities A Crucial Task inMachine Learningrdquo in Proceedings of the European Conferenceon Artificial Intelligence pp 147ndash149 1990

[37] J R Quinlan C45 Program for machine learning MorganKaufmann 1993

[38] N S Altman ldquoAn introduction to kernel and nearest-neighbornonparametric regressionrdquo The American Statistician vol 46no 3 pp 175ndash185 1992

[39] J R Quinlan ldquoInduction of decision treesrdquo Machine Learningvol 1 no 1 pp 81ndash106 1986

[40] C Elkan ldquoThe foundations of cost-sensitive learningrdquo in Pro-ceedings of the 17th International Joint Conference on ArtificialIntelligence (IJCAI rsquo01) pp 973ndash978 New York NY USAAugust 2001

[41] D Michie D Spiegel halter J C Taylor et alMachine learningneural and statistical classification Ellis Horwood neural andstatistical classification 1995

[42] K M Ting and I F Witten ldquoIssues in stacked generalizationrdquoArtif Intell Res vol 10 no 1 pp 271ndash289 1999

[43] M Hall E Frank G Holmes B Pfahringer P Reutemann andI H Witten ldquoThe WEKA data mining software an updaterdquoACM SIGKDD Explorations Newsletter vol 11 no 1 pp 10ndash182009

[44] A Frank and A Asuncion UCI Machine Learning RepositoryUniversity of California School of Information and ComputerScience Irvine CA USA httparchiveicsucieduml

[45] J Alcala-Fdez A Fernandez J Luengo et al ldquoKEEL data-mining software tool data set repository integration ofalgorithms and experimental analysis frameworkrdquo Journal ofMultiple-Valued Logic and Soft Computing vol 17 no 2-3 pp255ndash287 2011

[46] S Garcıa A Fernandez J Luengo and F Herrera ldquoA study ofstatistical techniques and performance measures for genetics-based machine learning accuracy and interpretabilityrdquo SoftComputing vol 13 no 10 pp 959ndash977 2009

[47] S Holm ldquoA simple sequentially rejective multiple test proce-durerdquo Scandinavian Journal of Statistics vol 6 no 2 pp 65ndash701979

Hindawiwwwhindawicom Volume 2018

MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Applied MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Probability and StatisticsHindawiwwwhindawicom Volume 2018

Journal of

Hindawiwwwhindawicom Volume 2018

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawiwwwhindawicom Volume 2018

OptimizationJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

Hindawiwwwhindawicom Volume 2018

Operations ResearchAdvances in

Journal of

Hindawiwwwhindawicom Volume 2018

Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018

International Journal of Mathematics and Mathematical Sciences

Hindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018Volume 2018

Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in

Nature and SocietyHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Dierential EquationsInternational Journal of

Volume 2018

Hindawiwwwhindawicom Volume 2018

Decision SciencesAdvances in

Hindawiwwwhindawicom Volume 2018

AnalysisInternational Journal of

Hindawiwwwhindawicom Volume 2018

Stochastic AnalysisInternational Journal of

Submit your manuscripts atwwwhindawicom

Page 14: Classifying Imbalanced Data Sets by a Novel RE-Sample and ...downloads.hindawi.com/journals/mpe/2018/5036710.pdf · Cost-Sensitive Stacked Generalization Method ... In this paper,

Hindawiwwwhindawicom Volume 2018

MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Applied MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Probability and StatisticsHindawiwwwhindawicom Volume 2018

Journal of

Hindawiwwwhindawicom Volume 2018

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawiwwwhindawicom Volume 2018

OptimizationJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

Hindawiwwwhindawicom Volume 2018

Operations ResearchAdvances in

Journal of

Hindawiwwwhindawicom Volume 2018

Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018

International Journal of Mathematics and Mathematical Sciences

Hindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018Volume 2018

Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in

Nature and SocietyHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Dierential EquationsInternational Journal of

Volume 2018

Hindawiwwwhindawicom Volume 2018

Decision SciencesAdvances in

Hindawiwwwhindawicom Volume 2018

AnalysisInternational Journal of

Hindawiwwwhindawicom Volume 2018

Stochastic AnalysisInternational Journal of

Submit your manuscripts atwwwhindawicom