development of ensemble based deep learning architecture ... · decision trees have the advantage...

Development of Ensemble Based DeepLearning Architecture for Breast Cancer

Classification and its PerformanceComparison of other Classification

Methods Utilizing Electronic HealthRecord Data

K.Kamala Devi1, J.Rajasekar2,J.Senthil Kumar3

1,2Department of CSE3Department of ECE

MepcoSchlenk Engineering College,Sivakasi ,Tamil Nadu, India

August 6, 2018

Abstract

In this research article, a deep learning-based stackingensemble for improving breast cancer classification is pro-posed and its performance is compared it with existing sixmodels including deep neural network on two UCI data.Five classifications methods are applied individually, whichwere k-nearest neighbor, decision trees, support vector ma-chines, discriminant analysis, and logistic regression anal-ysis and then a deep learning architecture is adopted topredict from these methods after using fivefold cross val-idation technique. The proposed deep learningbased en-semble method is compared with these methods for twoUCI data through classification accuracy, ROC curves andc-statistics.Experimental results for two UCI data showed

1

International Journal of Pure and Applied MathematicsVolume 120 No. 6 2018, 11097-11113ISSN: 1314-3395 (on-line version)url: http://www.acadpubl.eu/hub/Special Issue http://www.acadpubl.eu/hub/

11097

that the proposed deep learning-based ensemble outperformedsingle k-nearest neighbor, decision trees, support vector ma-chines discriminant analysis, and logistic regression analysisas well as deep neural network in terms of various perfor-mance measures.The deep learningbased ensemble outper-formed existing single models for all applications in termsof various performance measures.

Key Words: Breast cancer, Deep learning,Ensemble,Classification, Performance evaluation

1 INTRODUCTION

As the incidence of breast cancer has increased recently, early de-tection and accurate diagnosis to reduce breast cancer mortality aremore important than anything else. According to statistics of theNational Institute of Cancer Prevention and Research, in Septem-ber 2017, India now has 3rd highest number of cancer cases amongwomen, were breast cancer, the most common cancer of all othercancers.For every 2 women newly diagnosed with breast cancer,one woman dies of it in India [1].Mammography, ultrasonography,and fine needle aspiration (FNA) are the basic methods for thediagnosis of breast cancer [2,3]. Breast imaging is the most com-monly usedtest method to detect cancer. Breast ultrasound testingis used to diagnose breast cancer using high-resolution ultrasoundequipment. It is mainly applied for those women with whomam-mography observation as dense breast. Usage of fine needle test isone of the simple and accurate method. The needle is puncturedat the spot of the lump and the cell is sampled to check whetherthey are affected.Recently, due to the rapid bloom of artificial intelligence, machinelearning and deep learning, they are used for cancer diagnosis. Typ-ical machine learning techniques are k-nearest neighbor, decisiontree, Support vector machine, neural network, logistic regressionanalysis, and discriminant analysis [4-8]. The k-nearest neighborsmethod has a disadvantage of high computation cost while havinghigh accuracy as a method of determining the classification of newdata. Decision trees have the advantage of being easy to under-stand by classifying the classification function by drawing a treeshape made up of decision rules. However, there are difficulties in

2

International Journal of Pure and Applied Mathematics Special Issue

11098

the result depending on the variable selection for branch division.The support vector machine is a method of classifying new dataafter estimating a hyperplane that maximizes the margin in thelearning process. It is less over-fit than the neural network andhas higher prediction accuracy. Discriminant analysis is effectivewhen the independent variables are in accordance with the multi-variate normal distribution, and the logistic regression analysis isbased on the regression analysis used when the dependent variableis categorical. This method is widely used as an alternative to thediscriminant analysis that requires strict assumptions. As alludedto above, there is no universal machine learning technology thathas excellent performance in all cases, and it has advantages anddisadvantages.Deep learning, which is widely used in recent years, is a field of ma-chine learning technology, and is a deep neural network based onan artificial neural network. There are ensemble methods such asbagging, boosting, and stacking. Boosting combines weak classifica-tion models that are not well classified to create a strong predictionmodel Method. Stacking is a model that can yield more stable re-sults by combining heterogeneous models that are less known thanboosting [10,11].The deep learning based ensemble model proposedin this study is conducted in two stages. In the first stage, the stack-ing ensemble method is performed, and in the second step, the deeplearning is applied to final classification.The data used in the eval-uation of the classification model in this study are the WisconsinOriginal Breast Cancer (WOBC) data and the Wisconsin Diag-nostic Breast Cancer (WDBC) data for UCI breast cancer.WOBCdata was provided by the UCI Machine Learing Repository [12] in1992 and has been used by many researchers for pattern recognitionand machine learning. The WOBC data were collected from 699subjects and consisted of nine variables representing class charac-teristics and cell characteristics of FNA as shown in Table 1.Here,the cell characteristic value is scaled from 1 to 10, and the closer to1, the more positive it is. The closer to 10, the more malignant itis.

3


11099

Table I WOBC data attribute information

The WDBC data was provided by the UCI Machine LearingRepository [12] in 1995 and is widely used for pattern recognitionand machine learning along with WOBC data. The WDBC datawere analyzed for 569 individuals and the class variables indicatingpositive and malignant, ie, 1 diagnosis variable and 10 variablesindicating cell characteristics, such as radius, texture, perimeter,area, smoothness, compactness, concavity, concave The mean, stan-dard deviation, and ideal value (or maximum value) of each point,symmetry, and fractal dimension are examined.

Table II WDBC data attribute information

For example, for the radius variable, there are three variables:the mean radius variable representing the mean, the radius SD vari-able representing the standard deviation, and the worst radius vari-able representing the ideal value.

4


11100

2 RESEARCH METHODS USING EX-

ISTING CLASSIFICATION MODEL

2.1 k-proximity neighbor

The k-nearest neighborhood model finds k data sets that are mostsimilar to the data for the new data to be classified, and classifiesthem into a majority group to which the data belongs. In this case,the choice of k value has a great effect on the classification result.If the value of k is too small, the proximity neighbors classificationmodel may be over-fit due to the noise of the training data. On thecontrary, if the value of k is too large, it may not be classified as agroup close to the data to be classified.

2.2 Decision tree

Decision tree is a classification model that divides the indepen-dent variable space by applying various rules sequentially, and itis explained step by step [13].Intitally, one of several independentvariables are selected and a reference value is set for that indepen-dent variable. This is called a classification rule. Next, the entirelearning data set is divided into a data group whose value of theindependent variable is smaller than the reference value. These twosteps are repeated to create child nodes. Finally, if there is only oneclass of data in the child node, stop it without dividing the childnode any more.

2.3 Support vector machine

Assume that the following learning D dataset represented in Eq.(1)is given

Where yi is a value representing the class to which the P-dimensional vector belongs, xi and is +1 or -1. The optimal sepa-rating hyperplane for classifying Xi into two classes in the support

5


11101

vector machine is determined by maximizing the distance, or mar-gin, between two parallel hyperplanes passing through the supportvector among the points belonging to each class. We can introducethe slack variable ξi(≥ 0) to formulate the optimization problem tomaximize Eq.(2), which is subjected to the condition specified inEq.(3).

Where C is a parameter that determines the trade-off betweenmaximizing the margin and minimizing the classification error rate.The optimization problem of Eq. (2) can be expressed as followsby introducing the kernel function K(Xi, Xj), to maximize Eq.(4),which is subjected to the condition specified in Eq.(5).

Where αi is a Lagrange multiplier. Therefore, we use the sup-port vector to represent the optimal separation plane as specifiedin Eq.(6):

Where S represents the support vector and b is the offset fromthe origin.

2.4 Discriminant analysis

The discriminant analysis is a statistical theory systematized byFisher [14]. It is a method to classify new objects into belonginggroups by deriving discrimination rules that can minimize classifi-cation error when classifying whole group into two or more groups.Are assumed to be multivariate normal distributions and also as-suming that the covariance matrix of each group is the same.

6


11102

2.5 Logistic regression analysis

Logistic regression analysis is a method of analyzing the group-ing of individual observations when the objects to be analyzed aredivided into two or more groups. When n independent variablesare X1, X2, ...Xn, logistic regression, the model is as represented asshown in Eq.(7).

Where βi are regression coefficients, P is the probability thatan individual observation that belongs to a group and 1-P is theprobability that it does not belong to that group. Therefore, thehigher the value, the higher the probability of belonging to thegroup.

3 ENSEMBLE MODEL BASED ON DEEP

LEARNING

Figure 1 shows an example of a deep neural network with severalhidden layers between the input and output layers. The depthneural network is a neural network structure capable of high ab-stractions by combining many nonlinear transformation techniquesby increasing the number of hidden layers.

Figure 1. Structure of a sample deep neural network.

7


11103

Let the number of layers in Figure 1 be N. Then, L1 is the inputlayer, LN is the output layer, and L2, ...LN−1 represents a hiddenlayer. The parameters to be learned in the neural network are asshown in Eq.(8) and Eq.(9).

Here W l = {W li,j}, j = 1, ..., sl, t = 1, ..., sl+1, l = 1, .., nl and

bl = {bli,j}, j = 1, ..., sl+1, l = 1, ..., nl. Where W li,j is the weight of

neuron in layer 1 and neuron i in layer l + 1 and bli,j is the bias ofneuron i in layer l.The m training data to be learned through thein - depth neural network are as in Eq.(10).

We intend to study in-depth neural networks through stochasticgradient descent. The cost function is defined as follows.

Where the first term is the mean square error term, the secondterm is the regularization term, and λ is the weight decay param-eter that controls the relative importance of the two terms. Thenonlinear activation function hW,b(x

i) is defined as follows.

The parameter minimizing the cost function of Eq. (11) is ob-tained by the iterative method as follows represented in Eq.(13)and Eq. (14).

8


11104

Where α represents the learning rate.In depth neural networklearning, the activation function is calculated by Eq. (12), andthe error between the final output value and the actual value iscalculated by the feedforward method. Then, the Eq. (13) is usedtoupdate the weights and biases in which frac∂∂W l

i,jJ(W, b), and∂∂biJ(W, b) are calculated using a back propagation algorithm.

Figure 2. Ensemble model flowchart based on deep learning.

3.1 Stacking Ensemble Model

The proposed deep run based ensemble model is roughly dividedinto two stages as shown in Figure 2. In the first step, the stackingensemble method is applied to the input data, and in the secondstep, the deep classification is applied to final classification.Underthe five-piece cross-validation, the deep-run-based ensemble modelof Figure 2 will be described in two stages. In the first stage (StageI), A piece of data D with five slice of same size, represented asD1, D2, D3, D4, D5 is given. Where Dk = {xK , yk, k = 1, 2, 3, 4, 5}.By using the union of 4 pieces from the five slices, ie, U5

k=2Dk, astraining data, and use the remaining one piece, D1 = {x1, y1}, astest data. Then a model is created as hi by applying the basicclassification model to the training data U5

k=2Dk and generate the

9


11105

test data D1 = {x1, y1} and Xi is calculated by applying the model.And assemble the results of the basic classification model as shownin Eq.(15).

Merge the prediction probability H1 of the model obtained inEq.(15) and the given label (target variable) y1 to generate newdata D′1 = {H1, y1}. By repeating the sequence five times, andexchanging roles for training data and test data by five-piece crossvalidation, new data is generated in sets D′1, D

′2, D

′3, D

′4, D

′5 are ob-

tained. where, D′k = {Hk, yk}, k = {1, 2, 3, 4, 5}In the second stage (Stage II), from the obtained new data setsD′1, D

′2, D

′3, D

′4, D

′5, the data D′1 = {H1, y1} is used as test data and

the remaining data D′2, D′3, D

′4, D

′5 are used as training data. They

are validated by generating a model and by applying the in depthneural network to the training data U5

k=2Dk and generate the testdata D′1 = {H1, y1}, and final classification. Repeating the abovesteps,five times by five piece cross validation, final classification canbe obtained.

4 RESULTS AND DISCUSSION ON

PERFORMANCE METRICS

In order to evaluate the performance of the deep learning basedensemble model proposed in this study, the performance evaluationmeasure is compared with accuracy, receiver operating character-istic (ROC) curve and c-statistics by applying it on WOBC andWDBC dataset. The output of the proposed deep learning basedensemble model is a binary classification value, and the perfor-mance measure is averaged over five data sets generated by fivepiece cross validation.Accuracy shows the degree of agreement be-tween the actual target value and the predicted value of the model.The ROC curve is a graph with the x-axis of the classification modelas specificity and the y-axis of sensitivity. Specificity is called thefalse positive rate and the sensitivity is called the true positive rate.The ROC curve is a graph of the change in false positivity and falsepositive rate when the posterior probability, which is the result of

10


11106

the classification model, changes. Here, ”knn” uses k = 5 underthe Euclidean distance, ”svm” uses the radial kernel, C = 0.25, σ= 0.755 in the WOBC data, C = 0.25, = 0.0464 and ”rpart” wasbinary split using the Gini index as a measure of impurity usingthe CART algorithm. Figure 3 shows the ROC curves obtained byapplying the classification model to the WOBC data. Here, kNNis a k-nearest neighbor, DT is a decision tree, SVM is a supportvector machine, GLM is a logistic regression analysis, LDA is a dis-criminant analysis, DNN is a deep neural network, and Ensembleis a deep learning based ensemble model.

Figure 3. ROC curves obtained by applying the classificationmodel to the WOBC data.

Figure 3 shows that the ROC curves of the remaining models ex-cept the decision tree are located at the upper left corner and showgood performance. These curves are superimposed on each other,so it is not easy to distinguish between good and bad visually. TheROC curve of the decision tree is located closest to the diagonal ref-erence line, indicating a significant performance degradation.Table3 shows the accuracy and c-statistics of the classification model ob-tained by applying the WOBC and WDBC data. First, in termsof accuracy, the proposed deep-run-based ensemble model showedthe highest accuracy, followed by in-depth neural network and k-nearest neighbors, and decision trees showed the lowest accuracy.

11


11107

In the c-statistics, the deep running-based ensemble model was thehighest, followed by logistic regression, discriminant analysis, andin-depth neural networks. The decision trees showed the lowestc-statistics.

Figure 4 ROC curves obtained by applying the classificationmodels to the WDBC data.

Figure 4 shows a similar pattern to Figure 3.Table 4 shows theaccuracy and c-statistics obtained for the WDBC data. In terms ofthe accuracy, the proposed ensemble model is the highest, followedby the deep neural network, k-nearest neighbors, and decision treeshave the lowest accuracy in WOBC data. In terms of c-statistics,in-depth neural network was slightly higher than the proposed en-semble model, followed by discriminant analysis, support vectormachine, and decision trees showed the lowest c-statistic.

Table III Performance comparison of classification models forWOBC and WDBC data sets

12


11108

5 CONCLUSION

The incidence of breast cancer in women is continuously increas-ing, and the diagnosis accuracy of breast cancer is too low, andstatistical classification method is important as an auxiliary meansto increase the diagnostic accuracy.An ensemble modelbased ondeep neural network is constructed to improve the performance ofbreast cancer classification and its performance is compared withexisting single models. Five stacked ensembles are performed inthe first step followed by the deep neural network to improvetheirperformance.In this paper, to evaluate the performance of deep-running based ensemble, UCI’s breast cancer data are comparedwith existing classification methods such as k-neighborhood, de-cision tree, support vector machine, discriminant analysis, logisticregression analysis, ROC curve and c-statistic.In the actual WOBCand WDBC data, all the methods including the proposed deep run-ning ensemble showed good performance except the decision tree inthe ROC curve, and the ensemble model proposed in both WOBCand WDBC data.

References

[1] National Centre for Disease Informatics and Research- Na-tional Cancer Registry ProgrammeCenter. Annual report ofcancer statistics in India in 2015. Indian Council of MedicalResearch; 2015.

[2] Sewak M, Vaidya P, Chan CC, Duan ZH. SVM approach tobreast cancer classification. Conference: Second InternationalMulti-Symposiums on Computer and Computational Sciences(IMSCCS 2007). 2007;32-37.

[3] Fiuzy M, Haddadnia J, Mollania N, Hashemian M, Hassan-pour K. Cancer based on fine needle aspiration (FNA) testdata and combining intelligent systems. Iran J Cancer Prev2012;5(4):169-177.

[4] Liang Z, Zhang G, Huang JX, Hu QV. Deep learning forhealthcare decision making with EMRs. In: 2014 IEEE In-

13


11109

ternational Conference on Bioinformatics and Biomedicine(BIBM): Belfast, UK: IEEE; 2014: 5569.

[5] Cho K, van Merrienboer B, Gulcehre C, et al. Learn-ing Phrase Representations using RNN Encoder-Decoderfor Statistical Machine Translation. arXiv [Cs.CL] 2014.http://arxiv.org/abs/1406.1078

[6] Zhang GP. Neural networks for classification: A Survey, IEEETransactions on Systems, Man and CyberneticsPart C. Appli-cations Reviews 2000;30(4):451-462.

[7] Gupta S, Kumar D, Sharma A. Date mining classification tech-niques applied for breast cancer diagnosis and prognosis. In-dian J ComputSciEng 2011;2(2):188-195.

[8] Kitbumrungrat K. Comparison logistic regression and discrim-inant analysis in classification groups for breast cancer. Int JComputSci Network Secur 2012;12(5):111-115.

[9] Xiao Y, Wu J, Lin Z, Zhao X. A deep learning-based multi-model ensemble method for cancer prediction. Comput Meth-ods Programs Biomed 2018;153:1-9.

[10] Lim JS, Oh YS, Lim DH. Bagging support vector machinefor improving breast cancer classification. J Health Info Stat2014;39(1):15-24 .

[11] Salunkhe UR, Mali SN. Classifier ensemble design for imbal-anced data classification: a hybrid approach. ProcediaCom-putSci 2016;85:725-732.

[12] UCI Machine Learning Repository. University of California,Center for Machine Learning and Intelligent Systems. Avail-able at http://archive.ics.uci.edu/ml/datasets.html[accessedon October 13, 2017].

[13] Breiman L, Friedman JH, Olshen RA, Stone CJ. Classifica-tion and regression trees. London, UK: Chapman Hall/CRC;1984. [14].Fisher RA. The use of multiple measurements intaxonomic problems. Ann Eugen 1936;7:111-132.

14


11110

[14] Egan J. Signal decision theory and ROC analysis. Cambridge,MA: Academic Press; 1975. [16].Cook NR. Statistical evalua-tion of prognostic versus diagnostic models: beyond the ROCcurve. ClinChem 2008;54(1):17-23.

[15] Landry M. Machine learning with R and H2O. Mountain View,CA: H2O.ai, Inc; 2018. [18].Kuhn M. Building predictive mod-els in R using the caret package. J Stat Softw 2008;28(5):1-26.[19].Dubois S, Romano N, Jung K, Shah N, Kale DC. The Ef-fectiveness of Transfer Learning in Electronic Health RecordsData. 2017. https://openreview.net/forum?idB1 E8xrKe

[16] Lipton ZC. The Mythos of Model Interpretability. arXiv[Cs.LG] 2016. http://arxiv.org/abs/1606.03490

[17] Koh PW, Liang P. Understanding Black-box Predic-tions via Influence Functions. arXiv [Stat.ML] 2017.http://arxiv.org/abs/1703.04730

[18] Bahdanau D, Cho K, Bengio Y. Neural Machine Translation byJointly Learning to Align and Translate. arXiv [Cs.CL] 2014.http://arxiv.org/abs/1409.0473

[19] Che Z, Purushotham S, Khemani R, Liu Y. Distilling Knowl-edge from Deep Networks with Applications to Healthcare Do-main. arXiv[Stat.ML] 2015. http://arxiv.org/abs/1512.03542

[20] Bradshaw J, Matthews AG. D G, Ghahramani Z. Adversar-ial Examples, Uncertainty, and Transfer Testing Robustnessin Gaussian Process HybridDeep Networks. arXiv [Stat.ML]2017. http://arxiv.org/abs/1707. 02476

[21] Cao Z, Long M, Wang J, Jordan MI. Partial Transfer Learn-ing with Selective Adversarial Networks. arXiv [Cs.LG] 2017.http://arxiv.org/abs/1707.07901

[22] Johansson F, Shalit U, Sontag D. Learning Representatio006Esfor Counterfactual Inference. International Conference on Ma-chine Learning. 2016:30209.

[23] Bhat HS, Goldman-Mellor SJ. Predicting Adolescent Sui-cide Attempts with Neural Networks. arXiv [Stat.ML] 2017.http://arxiv.org/abs/1711.10057

15


11111

[24] Miotto R, Li L, Dudley JT. Deep learning to predict patientfuture diseases from the electronic health records. Advances inInformation Retrieval.Springer, Cham; 2016. pp. 768774.

[25] Avati A, Jung K, Harman S, Downing L, Ng A, Shah NH.Improving palliative care with deep learning. In: 2017 IEEEInternational Conference on Bioinformatics and Biomedicine(BIBM). Kansas City, MO, USA: IEEE; 2017: 3116.

[26] Rajkomar A, Oren E, Chen K, et al. Scalable and accuratedeep learning for electronic health records. arXiv: 1801.07860[cs.CY]. 2018

[27] Ching T, Himmelstein DS, Beaulieu-Jones BK, et al. Op-portunities and obstacles for deep learning in biology andmedicine. J R Soc Interface 2018; 15 (141): 20170387. DOI:10.1098/rsif.2017.0387.

16


11112

development of ensemble based deep learning architecture ... · decision trees have the advantage...

Documents