a combo feature selection method (filter +wrapper) for … · 2018. 3. 15. · a combo feature...

14
A Combo Feature Selection Method (Filter +Wrapper) for Microarray Gene Classification Bibhuprasad Sahu 1 1 Research Scholar, Department of CS&IT, North Orissa University,Odisha,India. [email protected] January 12, 2018 Abstract In this world cancer is the most dangerous diseases. So cancer diagnosis is most challenging and promising in case of clinical applications using microarray gene expres- sion datasets. Gene expression datasets play a major role for detection and analysis of cancer disease. Even if the technol- ogy developed with various applications still cancer classifi- cation using microarray gene datasets leftovers an intricate dilemma. Microarray datasets available in a different repos- itory are high dimensionality and noisy in nature. Feature selection plays a pretreatment role before the classification which increases the classification accuracy and minimizes the computational cost. In our research usage of feature selection is finished by adopting information gain technique and we have used improved binary particle swarm optimiza- tion (IBPSO) for filter and wrapper approach respectively. Employing the proposed method, we have proved by the ex- perimental result that better accuracy in classification can be produced if the least no of genes are selected by the good selection approach. Key Words: Filter, Wrapper, PSO, Information Gain 1 International Journal of Pure and Applied Mathematics Volume 118 No. 16 2018, 389-401 ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu Special Issue ijpam.eu 389

Upload: others

Post on 06-Mar-2021

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Combo Feature Selection Method (Filter +Wrapper) for … · 2018. 3. 15. · A Combo Feature Selection Method (Filter +Wrapper) for Microarray Gene Classi cation Bibhuprasad Sahu1

A Combo Feature Selection Method(Filter +Wrapper) for Microarray Gene

Classification

Bibhuprasad Sahu1

1Research Scholar, Department of CS&IT,North Orissa University,Odisha,India.

[email protected]

January 12, 2018

Abstract

In this world cancer is the most dangerous diseases.So cancer diagnosis is most challenging and promising incase of clinical applications using microarray gene expres-sion datasets. Gene expression datasets play a major role fordetection and analysis of cancer disease. Even if the technol-ogy developed with various applications still cancer classifi-cation using microarray gene datasets leftovers an intricatedilemma. Microarray datasets available in a different repos-itory are high dimensionality and noisy in nature. Featureselection plays a pretreatment role before the classificationwhich increases the classification accuracy and minimizesthe computational cost. In our research usage of featureselection is finished by adopting information gain techniqueand we have used improved binary particle swarm optimiza-tion (IBPSO) for filter and wrapper approach respectively.Employing the proposed method, we have proved by the ex-perimental result that better accuracy in classification canbe produced if the least no of genes are selected by the goodselection approach.

Key Words: Filter, Wrapper, PSO, Information Gain

1

International Journal of Pure and Applied MathematicsVolume 118 No. 16 2018, 389-401ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version)url: http://www.ijpam.euSpecial Issue ijpam.eu

389

Page 2: A Combo Feature Selection Method (Filter +Wrapper) for … · 2018. 3. 15. · A Combo Feature Selection Method (Filter +Wrapper) for Microarray Gene Classi cation Bibhuprasad Sahu1

1 Introduction

Microarray technology permits the scientists for correspondent ob-serving and simple measuring of high dimension gene expressiondatasets. Now a days Microarray technology is implemented forgene analysis and medical diagnosis for the detecting the stage ofthe disease. The clustering technology is used to cluster some re-lated genes and correlate the relation between genes patterns. Theprimary point of the researchers to build up a model to generatethe training data and anticipate the exact featured genes. [9].

Typically execution of general classification techniques is dif-ficult because of the high dimension of information and the sizeis small.From the high dimension genes, few genes are really re-sponsible for cancer. So feature selection plays a major role in theclassification. Two important feature selection methods such as fil-ter and wrapper are used for reduction of data from a large amountof datasets. Weights of the genes are calculated and the genes hav-ing higher weights are considered from the collected datasets. Anevaluation function is used for wrapper approach with a learningalgorithm is being implementated to appraise the subset featuresselection from the large data sets.The overall concept of combofeature selection is same the implementation of an optimal algo-rithm which provides optimal results in a single dimension space.In comparison to filter, wrapper method implemented the subsetsearch including a an optimization algorithm.Then after classifica-tion algorithms are used for better evaluation of feature subsets.A very well known and widely used population-based stochasticoptimization technique Particle swarm optimization (PSO) is used[3].Implementation of PSO performs closely same as GA when thesystem is initialized with a given population that has solutions gen-erated in random. But this way PSO technique is different from GAin which every possible solution is given with a randomly selectedvelocity which is also known as particles in the search space. InPSO two values are considered such as pbest and gbest. From thisthe pbest is the best value which can be traced by the global versionof the PSO optimizer then its location and values from other parti-cle is calculated which is known as gbest. BPSO is useful to solve ofthe discrete problems.[4].Here each particles are represented eitherthe binary value 0 or 1.

2

International Journal of Pure and Applied Mathematics Special Issue

390

Page 3: A Combo Feature Selection Method (Filter +Wrapper) for … · 2018. 3. 15. · A Combo Feature Selection Method (Filter +Wrapper) for Microarray Gene Classi cation Bibhuprasad Sahu1

2 Related Works

2.1 Binary particle swarm optimization:

A discrete binary version of PSO [3] for binary problems proposedby Kennedy and Eberhart, According to their designed model par-ticle may decide this binary values such as ”yes” or ” no”, ”true”or ”false”, ”include” or ”not[4] .

In binary PSO(BPSO), particle’s pbest, gbest value is changedas to a more real-valued version. Velocities of the particles can bedefined in terms of probabilities in case of Real Valued version overthe binary PSO which doesnt use that when a bit change to occurto one[3]. With the implementation of this definition a velocityshould vary between the range [0,1] .For the purpose of mappingall real valued numbers of velocity to come within the range ,wehave used a sigmoid function to handle probability of variables.

There are two basic problems about binary PSO(BPSO) is dis-cussed such as

I. Parameters of the binary PSO

II. Memory of the binary PSO

2.2 Real-valued PSO:

Normally initialization is done with random solutions forming pop-ulation done to be used as initial population in a PSO system. Uponchanging available generations this population always searches foran optimal solution. The potential solution or an optimum solutionin PSO is known as a particle. In a d-dimensional search space,everyparticle from the vast datasets has its own memory and the infor-mation gained by the swarm as a whole to locate the best (optimal)solution. Each particle has their own positional value and velocitywhich coordinate for the movement. Every particle in the PSO ispresented by the formula xi =(xi1, xi2 ,, xid ) where d can be rep-resented as the dimension numbers. The rate of velocity for the ithparticle in the swarm is presented by vi =( vi1,vi2 , ,vid) and it isrestricted by the value of Vmax , which can be calculated himselfby who uses it. The better beforehand occurred place of the ithparticle (high fitness) is named as pBesti . Which is presented by

3

International Journal of Pure and Applied Mathematics Special Issue

391

Page 4: A Combo Feature Selection Method (Filter +Wrapper) for … · 2018. 3. 15. · A Combo Feature Selection Method (Filter +Wrapper) for Microarray Gene Classi cation Bibhuprasad Sahu1

pi =( pi1,pi2 , ,pid ). The Global best value(gBest) of the wholepopulation is presented by g =( g1 , g2 ,.., gd ) [1].At every commu-nication, the particles are updated by the accompanying equations:

vnewid = w × voldid + c1 × rand2 × (Pbestid − xiold)c2 × rand2 ×(gbestid − xiold)

xnewid = xold

id + xnewid

Inertia weight(w) represents the impact of previous velocity vec-tor on the new vector, rand2, rand2 are Random Numbers, C1 andc2 are the positive constants. This limitation prevent a particlemoving too rapidly from one to other region in search space. Sovalue can be initialized as a function of the range of a given problem.In case of Star topology,every particle is connected with every otherindividual to perform a fully connected social network. Each par-ticle is moveing towards the best particle (best solution of a givenproblem(best)) which is foundout by any of the members among thewhole swarm. So here we have that every possible particle mimicsthe overall best particle (best solution). After that the pBest valueis changed when any new position appears with the entire swarm.

2.3 Improved particle swarm optimization(IBPSO):

This research, we try to implement the improved particle swarm op-timization(IBPSO) which will improvise the characterstics of par-ticles in comparison to the normal BPSO. A Comparison of featureselection between IBPSO and BPSO is done.IBPSO technique isapplied for several iterations it is visualized that the PBest valueinfluenced particle will stop their movement towards Gbest.So thevalue of Gbest should fulfill so the at this condition should nothappen.Normally three logical operators such as and ,or,not beingused to solve the combinational logic problems.A consideration ismade such that if gbest value wont change for 3 recent iteration theparticles will go for the local optimum.

4

International Journal of Pure and Applied Mathematics Special Issue

392

Page 5: A Combo Feature Selection Method (Filter +Wrapper) for … · 2018. 3. 15. · A Combo Feature Selection Method (Filter +Wrapper) for Microarray Gene Classi cation Bibhuprasad Sahu1

2.4 Information gain (IG)

It is a measure based technique, which distinguishes best split at-tributes from choice tree classifiers. Measured value explains theneed so that entire input datas entropy may reduce which it willhelp to identify the value of individual attribute present[5]. Accord-ingly feature IG value,a decision can be made regarding selectionor deletion of feature. So the threshold for selection of a featureshould first consider when IG value of an element is higher thanthe desired threshold value. Information gain can be calculated asthe difference among the Info(S) and InfoA(S) (by partitioning Saccording to testing A)

Gain(A) =Info(S) + InfoA (S)

Where Info(S) and infoA(S) can be defined as

Info(S)=−∑vi=1 P

(ci,S) × logP (ci,S) and infoA(s) = −∑ni=1

|s1||s| ×

info(si)Where S defines n instances, C defines k number or classes. Here

P (Ci, S) is the fraction of the examples in S which have class Ci.

3 Experimental work:

In this paper,two different feature selection approaches such as fil-ter method and wrapper method are combined to detect the nextgenes from existing microarrays and various classification methodsare used for result evaluation. The fig-01 describes the purposedcombo feature selection model using( filter + wrapper). The aimof this research is to find out the important feature genes withrespect to class. So information gain (IG) is considered by filtermodel to evaluate the each and every feature which may differen-tiate between the different categories of genes available For thisimplementation, we used MATLAB to sort the feature genes ac-cording to their IG value. High IG value identify the accuratefeature which is clearly from the other categories and the averageIG valued genes which plays a major role in classification.In this re-search we have considered the threshold value as 0, the genes havinghigher than the threshold value get selected. Suppose genes hav-ing 15 features, from this 10 features having higher then thresholdvalue then the features may be utilized for selection process after

5

International Journal of Pure and Applied Mathematics Special Issue

393

Page 6: A Combo Feature Selection Method (Filter +Wrapper) for … · 2018. 3. 15. · A Combo Feature Selection Method (Filter +Wrapper) for Microarray Gene Classi cation Bibhuprasad Sahu1

the wrapper method. For improving the accuracy of the biomarkerselection BPSO and IBPSO is used after the filter model selectionand for improvement of classification performance, we have consid-ered KNN and SVM algorithms.

6

International Journal of Pure and Applied Mathematics Special Issue

394

Page 7: A Combo Feature Selection Method (Filter +Wrapper) for … · 2018. 3. 15. · A Combo Feature Selection Method (Filter +Wrapper) for Microarray Gene Classi cation Bibhuprasad Sahu1

3.1 Datasets:

We have taken the different multi-category of cancer gene expres-sion datasets. From www.gems system.org for accuracy evaluationof the purposed methods.

3.2 Performance of classifier

For evaluation the proposed technique, the desired subsets of chosefeatures were assessed by the method of leave-one-out cross-validation’LOOCV’ for one nearest neighbor(1-NN). Here we have used K-fold cross-validation for SVM.’LOOCV’ technique applied for alldata sets then next neighbors are found out by calculation usingthe Euclidean distance and the fitness value for every 1-NN is com-puted. Here the single experiment is considered for testing data andrests of the data are considered as the training data. According tothe K fold (if the k=10) then 1 dataset is considered as validatingdata and others will be considered as training data. As there isan issue of a multiclass problem so one-versus-rest method (OVR)method is implemented.

3.3 Parameter set for the experiment

In this research, we have implemented a combo filter and wrapperfeature selection technique’s performance on arrangement of fourmulti-diverse class microarray gene expression datasets. In thisinvestigation 30 no of parameters are taken..The value of rand1and rand 2 value range between [0,1] and the value of accelerationfactors c1 and c2 will be equal to 2.The intertia weight w was con-sidered as 1.0.100 no of iteration are performed with mutation rateand crossover rate are 0.1 and 1 respectively.We likewise proposeda new IBPSO method and have a comparison with other accessibleevolutionary algorithms (EA), for example, GA and SVM. Aftercompletion of feature selection, the selection of n subsets was beingdone by having the result put into two generally used classificationalgorithms. Here Table 2 and 3 demonstrate the yield accuracyaccomplished by the filter approach, wrapper approach and theproposed method of the combo model using both feature selectionmethods. Accoording to Table 2 and table 3, it represents that theclassification of the biomarker is evaluated by using techniques of

7

International Journal of Pure and Applied Mathematics Special Issue

395

Page 8: A Combo Feature Selection Method (Filter +Wrapper) for … · 2018. 3. 15. · A Combo Feature Selection Method (Filter +Wrapper) for Microarray Gene Classi cation Bibhuprasad Sahu1

KNN and SVM respectively. The experimental result proves thatfeature accuracy of microarray data is improved as the comparisonwithout feature selection.

Table 1.1.( inputted genes details for experiment)

Table 1.2.( KNN accuracy performance Comparison)

(Fig.1.2. Result Analysis 1 )

8

International Journal of Pure and Applied Mathematics Special Issue

396

Page 9: A Combo Feature Selection Method (Filter +Wrapper) for … · 2018. 3. 15. · A Combo Feature Selection Method (Filter +Wrapper) for Microarray Gene Classi cation Bibhuprasad Sahu1

Table 1.2.( SVM accuracy performance Comparison)

(Fig.1.2. Result Analysis 2 )

Table 1.2.( Filter vs Wrapper vs Combo)

9

International Journal of Pure and Applied Mathematics Special Issue

397

Page 10: A Combo Feature Selection Method (Filter +Wrapper) for … · 2018. 3. 15. · A Combo Feature Selection Method (Filter +Wrapper) for Microarray Gene Classi cation Bibhuprasad Sahu1

(Fig.1.2. Result Analysis 3 )

4 Conclusion

Here we deal and propose a combo model is for microarray gene fea-ture selection process, then tested by using the available classifierssuch as KNN method and SVM method to examine the perfor-mance of classification. Experimental results prove that our pro-posed method is one of the better-simplified gene selection approach. We have used the total number of parameters which have provedto have worked effectively and have given higher classification accu-racy when it is compared with the other feature selection methods.

References

[1] .L. Zhang, F.M. Zhang, Y.F. Hu:A Two-phase Flight Data Fea-ture Selection Method Using both Filter and Wrapper Proceed-ings of the Software Engineering, Artificial Intelligence, Net-working, and Parallel/Distributed Computing, SNPD 2007,pp.447-452,(2007).

[2] R. Xu, G. Anagnostopoulos, and D. Wunsch II :Multi-class cancer classification using semi-supervised ellipsoidARTMAP and particle swarm optimization with gene expres-sion data,IEEE/ACM Transactions on Computational Biologyand Bioinformatics, vol. 4, no.1, pp. 65-77,(2007).

10

International Journal of Pure and Applied Mathematics Special Issue

398

Page 11: A Combo Feature Selection Method (Filter +Wrapper) for … · 2018. 3. 15. · A Combo Feature Selection Method (Filter +Wrapper) for Microarray Gene Classi cation Bibhuprasad Sahu1

[3] J.R. Quinlan:Induction of decision trees, Machine Learning,No. 1, pp.81-106,(1986).

[4] J. Kennedy and R.C. Eberhart:A discrete binary version of theparticle swarm algorithm, Proceedings of the world multiconfer-ence on systemics, cybernetics and informatics, Piscatawary,pp. 41044109, (1997).

[5] J. Kennedy and R.C. Eberhart:Particle swarm optimization.Inproceedings of the 1995 IEEE International Conference onNeural Networks, vol. 4, pp. 1942-1948 , (1995).

[6] E. Frank, M. Hall, L. Trigg, G. Holmes: I.H. Witten:Datamining in bioinformatics using Weka,Bioinformatics, Vol. 20,No. 15, pp 2479-2481, (2004).

[7] A. Statnikov, C.F. Aliferis, I. Tsamardinos, D. Hardin, S.Levy:A comprehensive evaluation of multicategory classifica-tion methods for microarray gene expression cancer diagnosis,Bioinformatics, Vol. 21, No. 5, pp 631643, (2005).

[8] Y. Saeys, I. Inza, and P. Larraaga:A review of feature selectiontechniques in bioinformatics,Bioinformatics, 23(19), pp.2507-2517), (2007).

[9] Shen Q, Shi WM, Kong W. :Hybrid particle swarm opti-mization and tabu search approach for selecting genes for tu-mor classification using gene expression data. Comput BiolChem;32:52, (2008).

[10] Martineza E, Alvarezb MM, Trevino V:Compact cancer biomarkersdiscovery using a swarm intelligence feature selectionalgorithm.J Comput Biol Chem;34:244,( 2010).

[11] Li L, Weinberg CR, Darden TA, Pedersen LG :Gene selectionfor sample classification based on gene expression data: Studyof sensitivity to choiceof parameters of the GA/KNN method.Bioinformatics;17:1131,(2001).

[12] Yang CH.A hybrid filter/wrapper method for feature selectionof microarray data. J Med Biol Eng;30:23,( 2009).

11

International Journal of Pure and Applied Mathematics Special Issue

399

Page 12: A Combo Feature Selection Method (Filter +Wrapper) for … · 2018. 3. 15. · A Combo Feature Selection Method (Filter +Wrapper) for Microarray Gene Classi cation Bibhuprasad Sahu1

[13] Onskog J, Freyhult E, Landfors M, Rydn P, HvidstenTR:Classification of microarrays; synergistic effects betweennormalization, gene selection and machine learning. BMCBioinformatics;12:390,(2011).

[14] Wang X, Simon R :Microarraybased cancer prediction usingsingle genes. BMC Bioinformatics;12:391,( 2011).

[15] Shah S, Kusiak: A Cancer gene search with datamining andgeneticalgorithms. Comput Biol Med;37:251,( 2007).

12

International Journal of Pure and Applied Mathematics Special Issue

400

Page 13: A Combo Feature Selection Method (Filter +Wrapper) for … · 2018. 3. 15. · A Combo Feature Selection Method (Filter +Wrapper) for Microarray Gene Classi cation Bibhuprasad Sahu1

401

Page 14: A Combo Feature Selection Method (Filter +Wrapper) for … · 2018. 3. 15. · A Combo Feature Selection Method (Filter +Wrapper) for Microarray Gene Classi cation Bibhuprasad Sahu1

402