international research journal in advanced … · data mining weka naïve bayes random tree j48...

PERFORMANCE ANALYIS ON DIABETES PREDICTION WITH

DIFFERENT CLASSIFICATION ALGORITHMS USING WEKA

Sathya S1, Rajesh A

1Research Scholar, Computer Science & Engineering, St. Peter’s University Chennai, India 2Professor, Computer Science & Engineering, C. Abdul Hakeem College of Engineering &

Technology Vellore, India

ARTICLE INFO

Article History:

Received 9

th Nov, 2015

Received in revised form 12th

Nov,2015

Accepted 14th

Nov, 2015

Published online 16th

Nov, 2015

Keywords:

Diabetic

Data Mining

Naïve Bayes

Random Tree

ABSTRACT

The main objective of this paper is to predict the chances of getting diabetic

using classification algorithms such as Naïve Bayes, Random Tree and J48

classifiers and to compare the performance of each with one another. Many

researchers made research on diabetic in different approaches. But none of the

approach predicts accurately. To overcome this drawbacks, the method is

proposed here. The data set is obtained from a Local Health Center named Arcot

Digital X-Ray, E.C.G & Computerized Lab. The data set includes 18 essential

attributes and 633 instances needed for diabetic prediction. This work is carried

out to convert data in to knowledge. The Popular data mining tool “WEKA”

Waikato Environment for Knowledge Analysis is used to made analysis on the

collected datasets with Naïve Bayes, Random Tree and J48 classifiers. The

Correctly Classified Instances, Incorrectly Classified Instances, Kappa statistic,

Mean absolute error, Root mean squared error, Relative absolute error, Root

relative squared error are measured for each algorithm for comparison and

analysis. Also TP Rate, FP Rate, Precision, Recall, F-Measure, MCC, ROC

Area, PRC Area of each algorithm were tabulated for analysis. Finally the

accuracies of each algorithm is measured and charted for performance analysis.

It is found that J48 classifier gives the better prediction accuracy of 99.0521

while the Random tree stands second with 95.5766 prediction accuracy and

Naïve Bayes with 93.8389.

INTERNATIONAL RESEARCH JOURNAL IN ADVANCED ENGINEERING AND TECHNOLOGY (IRJAET) www.irjaet.com

ISSN (PRINT) : 2454-4744 ISSN (ONLINE): 2454-4752

Vol. 1, Issue 4, pp.178 - 190, November, 2015

P a g e | 179

INTRODUCTION

The collected dataset is given as input to Machine learning algorithms Naïve Bayes, Random Tree

and J48 classifiers and the different measures were considered for comparison. The proposed concept is

given as block diagram below in the Fig 1. The training and test data is given as input to the algorithms and

the result obtained is analyzed in different aspects.

Fig 1. Block diagram of proposed Model

Diabetes

The World Health Organization (WHO) estimates that nearly 200 million people all over the world

suffer from diabetes and this number is likely to be doubled by 2030 and 80% of the diabetes deaths occur

in middle-income countries. In India, there are nearly 50 million diabetics, according to the statistics of the

International Diabetes Federation. As the incidence of diabetes is on the rise, doctors say, there is a

proportionate rise in the complications that are associated with diabetes. [1]. The disease has been named

the fifth deadliest disease in the United States with no imminent cure in sight [2]. This disease has many

side effects such as higher risk of eye disease, higher risk of kidney failure, and other complications.

However, early detection of the disease and proper care management can make a difference [3].

According to the American Diabetes Association, 20.8 million children and adults in the United

States (i.e., approximately 7% of the population) were diagnosed with diabetes. Thus, the ability to

diagnose diabetes early plays an important role for the patient’s treatment process [4]. This causes sugar to

build up in your blood leading to complications like heart disease, stroke, and neuropathy, poor circulation

Training

Naive Bayes

Random

Accuracy

P a g e | 180

leading to loss of limbs, blindness, kidney failure, nerve damage, and death. General Symptoms of Diabetes

are Increased thirst, Increased urination, Weight loss, Increased appetite, Fatigue, Nausea and/or vomiting -

Blurred vision, Slow-healing infections and Impotence in men.[5]. Diabetic results in Multi organ failure

in a human body and it is necessary to predict and prevent earlier.

Diabetic is the most common form of eye problem affecting people with diabetes, usually only

affects people who have had diabetes for a long time period and can result in blindness [6]. Diabetes

mellitus, or simply diabetes, is a set of related diseases in which the body cannot regulate the amount of

sugar in the blood [7]. It is a group of metabolic diseases in which a person has high blood sugar, either

because the body does not produce enough insulin, or because cells do not respond to the insulin that is

produced[8].

There are two types of Diabetes. Type 1 - Diabetes also called as Insulin Dependent Diabetes

Mellitus (IDDM), or Juvenile Onset Diabetes Mellitus is commonly seen in children and young adults

however, older patients do present with this form of diabetes on occasion. Type II - Diabetes is also called

as Non-Insulin Dependent Diabetes Mellitus (NIDDM), or Adult Onset Diabetes Mellitus. Preventing the

disease of diabetes is an ongoing area of interest to the healthcare community [9] .Diabetes is one of the

high prevalence diseases worldwide with increased number of complications, with retinopathy as one of the

most common one. Diabetes is a major chronic disorder which has no cure. [10]

MATERIALS AND METHODS

Data Mining And Weka

Data Mining is the process of extracting hidden knowledge from large volumes of raw data.[11].

Data mining has been applied in various fields like medicine, marketing, banking, etc. In medicine,

predictive data mining is used to diagnose the disease at the earlier stages itself and helps the physicians in

treatment planning procedure.[12].The data collected by medical and healthcare industry is not turned into

useful information for effective decision making. With data mining, doctors can predict patients who might

be diagnosed with diabetes.[13]. It is “the science of extracting useful information from large databases”. It

is one of the tasks in the process of knowledge discovery from the database.[14].

The amount of information related to biomedical databases is growing so rapidly that the rate at

which researchers can convert it into knowledge cannot keep in pace [15]. In general to detect a disease

numerous tests must be conducted in a patient. The usage of data mining techniques in disease prediction is

to reduce the test and increase the accuracy of rate of detection. [16]. So the popular data mining tool

“WEKA” is used to analyze the collected data using the classification methods.

P a g e | 181

Weka Contains collection of Machine Learning algorithms for data mining tasks introduced by the

University of Waikato in New Zealand. Weka is open source software issued under the general public

license.

Data sets used

S.No Attribute Meaning

1. PID PATIENT ID

2. SEX SEX

3. AGE AGE

4. WEIGHT WEIGHT

5. BP BLOOD PRESSURE

6. TYPE TYPE OF DIABETIC

7. FASTING FASTING ( EMPTY STOMACH)

8. PP POST PANDIAL

9. A1C GLYCOSYLATED Hb

10. LP TOT LIPID PROFILE TOTAL CHOLESTEROL

11. HDL HIGH DENSITY LIPO PROTEIN

12. LDL LOW DENSITY LIPO PROTEIN

13. VLDL VERY LOW DENSITY LIPO PROTEIN

14. TGL TRYGLYCERIDES

15. CHL/LDL RATIO OF CHOLESTEROL

16. HEIGHT HEIGHT OF PATIENT

17. HERIDITORY HERIDITORY OF PATIENT

18. CATEGORY CATEGORY(Normal/Daibetes/preDiabetes)

Implementation Tool

"WEKA" stands for the Waikato Environment for Knowledge Analysis, which was developed at the

University of Waikato in New Zealand. WEKA is extensible and has become a collection of machine

P a g e | 182

learning algorithms for solving real-world data mining problems. It is written in Java and runs on almost

every platform. Weka is open source software issued under the GNU General Public License. Weka is a

collection of machine learning algorithms for data mining tasks. Weka is a stateof-the-art facility for

developing machine learning techniques and their application to real-world data mining problems [17]. The

algorithms are applied directly to a dataset. Weka implements algorithms for data preprocessing,

classification, regression, clustering, association rules; it also includes visualization tools [18]. By analysing

the history of data, strategic decisions can be made [19]. The new machine learning schemes can also be

developed with this package. Weka is a data mining system developed by the University of Waikato in

New Zealand that implements data mining algorithms [20]. Data mining is the process of selecting,

exploring and modeling large amounts of data in order to discover unknown

Patterns or relationships that provide a clear and useful result [21].

Implementation

The Training and Test mode uses 10 –fold cross validation for all the Algorithms mentioned with

633 Instances and 18 attributes to eliminate Bias between the results obtained.

Table 1: Naïve Bayes and its Measures.

Measures Values Accuracy

Correctly Classified Instances 594 93.8389 %

Incorrectly Classified Instances 39 6.1611 %

Kappa statistic 0.9027

Mean absolute error 0.0447

Root mean squared error 0.1873

Relative absolute error 10.6132 %

Root relative squared error 40.8274 %

Coverage of cases (0.95 level) 97.3144 %

Mean rel. region size (0.95 level) 36.1769 %

Total Number of Instances 633

P a g e | 183

Naive Bayes

Naive Bayes takes 0 seconds to build the Model. The measures were tabulated below for reference. It is

found that 594 instances are classified correctly and 39 instances are classified incorrectly out of 633

instances. The kappa statistic of Naïve Bayes is 0.9027 with mean absolute error of 0.447, Root mean

squared error of 0.1873, Relative absolute error of 10.6132% and rest of the measures is given in Table 1.

Detailed Accuracy By ClassPrediction

The TP Rate, FP Rate, Recall, F-Measure, ROC Area and PRC Area under the three group of

classes Normal, Diabetic and Prediabetic is tabulated in the Table 2.

Table 2: Naïve Bayes and its Detailed Accuracy By ClassPrediction.

Precision Recall F-

Measure

0.981 0.006 0.993 0.981 0.987 0.996 0.997 N

0.852 0.015 0.954 0.852 0.900 0.975 0.961 D

0.949 0.063 0.831 0.949 0.886 0.974 0.874 P

Weighted

Avg. 0.938 0.023 0.943 0.938 0.939 0.985 0.957

CONFUSION MATRIX

The Confusion Matrix obtained for Naïve Bayes Classifier is given below which describes the True

Positive, True Negative, False Positive and False Negative classification of the instances under the

categories Normal, Diabetic and Prediabetic in Table 3.

Table 3: Naïve Bayes Confusion Matrix

a b c Classified as

302 1 5 a = N

0 144 25 b =D

2 6 148 c= P

P a g e | 184

Random Tree

The size of the tree is 71 and it takes 0.1 seconds to build the Model. The measures were tabulated below

for reference. It is found that 605 instances are classified correctly and 28 instances are classified

incorrectly out of 633 instances. The kappa statistic is 0.93 with mean absolute error of 0.0289, Root mean

squared error of 0.1626, Relative absolute error of 0.1626% and rest of the measures is given in Table 4.

Table 4: Random Tree and its Measures.

Detailed Accuracy By Class

Table 5: Random Tree and its Detailed Accuracy By ClassPrediction.

Precision Recall F-

Measure

MCC ROC

0.974 0.018 0.980 0.974 0.977 0.956 0.981 0.974 N

0.941 0.022 0.941 0.941 0.941 0.919 0.971 0.916 D

0.936 0.025 0.924 0.936 0.930 0.907 0.961 0.901 P

Weighted

0.956 0.021 0.956 0.956 0.956 0.934 0.973 0.940

P a g e | 185

Confusion Matrix

The Confusion Matrix obtained for Random Tree Classifier is given below which describes the True

Positive, True Negative, False Positive and False Negative classification of the instances under the

categories Normal, Diabetic and Prediabetic in Table 6.

Table 6: Random tree Confusion Matrix

a b c Classified as

300 5 3 a = N

1 159 9 b =D

5 5 146 c= P

TABLE 7: J48 AND ITS MEASURES.

P a g e | 186

The size of the tree is 8 and it takes 0.02 seconds to build the Model with 5 leaves. The measures were

tabulated below for reference. It is found that 627 instances are classified correctly and 6 instances are

classified incorrectly out of 633 instances. The kappa statistic is 0.985 with mean absolute error of 0.0107,

Root mean squared error of 0.0794, Relative absolute error of 2.5393% and rest of the measures is given in

Table 7.

DETAILED ACCURACY BY CLASS

Table 8: Random Tree and its Detailed Accuracy By ClassPrediction.

Precision Recall F-

Measure

MCC ROC

0.006 0.994 0.997 0.995 0.991 0.997 0.997

0.002 0.994 0.988 0.991 0.988 0.994 0.981

0.981 0.981 0.981 0.974 0.985 0.953

Weighted

0.991 0.005 0.991

0.991 0.986 0.993 0.982

CONFUSION MATRIX

The Confusion Matrix obtained for J48 Classifier is given below which describes the True Positive, True

Negative, False Positive and False Negative classification of the instances under the categories Normal,

Diabetic and Prediabetic in Table 9.

P a g e | 187

Table 9: J48 Confusion Matrix

a b c Classified as

307 0 1 a = N

0 167 2 b =D

2 1 153 c= P

RESULTS AND DISCUSSION

It is found that J48 classifier gives better accuracy than Naïve Bayes and Random Tree classifier. The

accuracies of all algorithms were given in the table 10. The accuracies of all classifiers is charted in Fig 2.

Table 10: Accuracies of different Classifier.

CLASSIFIER ACCURACY

NAIVE BAYES 93.8389 %

RANDOM TREE 95.5766 %

J48 99.0521 %

Fig 2: Accuracies of classifiers.

91.00%

92.00%

93.00%

94.00%

95.00%

96.00%

97.00%

98.00%

99.00%

100.00%

NAIVE BAYES RANDOM TREE J48

P a g e | 188

CONCLUSION

The three classifiers were experimented with the same data set from UCI Machine Learning Cleveland

Sutherland dataset with the help of Weka Tool. Finally the accuracies of each algorithm is measured and

charted for performance analysis. It is found that J48 classifier gives the better prediction accuracy of

99.0521 while the Random tree stands second with 95.5766 prediction accuracy and Naïve Bayes with

93.8389. In this paper, the dataset is given directly as input to the classifiers. In Future, the performance of

this proposed methodology can be improved by considering the data for pre-processing. The pre-

preprocessing is used to perform data cleaning on the datasets so that the impure, missing, outdated and

inconsistent data can be removed which will result in improved accuracy of diabetic prediction. This

accurate results can be used by healthcare professionals to predict the diabetes in earlier stage and life can

be saved.

REFERENCES

[1] http:// archive.indianexpress.com/news/-50-million...india...diabetes-/1030869/

[2] Iyer, Aiswarya, S. Jeyalatha, and Ronak Sumbaly. "Diagnosis of diabetes using classification mining

techniques." arXiv preprint arXiv:1502.03774 (2015).

[3] Kumar, VelidePhani, and Lakshmi Velide. "A Data Mining Approach For Prediction And Treatment

Ofdiabetes Disease."

[4] Pham, Huy Nguyen Anh, and Evangelos Triantaphyllou. "Prediction of diabetes by employing a new

data mining approach which balances fitting and generalization." Computer and Information Science.

Springer Berlin Heidelberg, 2008. 11-26.

[5] Sanakal, Ravi, and Smt T. Jayakumari. "Prognosis of Diabetes Using Data mining Approach-Fuzzy C

Means Clustering and Support Vector Machine."International Journal of Computer Trends and

Technology 11.2 (2014): 94-8.

[6] Evirgen, Hayrettin, and Menduh Çerkezi. "Prediction and Diagnosis of Diabetic Retinopathy using Data

Mining Technique." The Online Journal of Science and Technology 4.3 (2014).

[7] http://www.emedicinehealth.com/diabetes.

[8] Rajesh, K., and V. Sangeetha. "Application of data mining methods and techniques for diabetes

diagnosis." International Journal of Engineering and Innovative Technology (IJEIT) 2.3 (2012).

P a g e | 189

[9] Sa-ngasoongsong, Akkarapol, and Jongsawas Chongwatpol. "An Analysis of Diabetes Risk Factors

Using Data Mining Approach." Oklahoma state university, USA (2012).

[10] Balakrishnan, Vimala, et al. "Predictions using data mining and case-based reasoning: A case study for

retinopathy." International Journal of Computer and Information Engineering 6 (2012): 73-76.

[11] Radha, P., and B. Srinivasan. "Predicting Diabetes by cosequencing the various Data Mining

Classification Techniques."

[12] Asha Gowda Karegowda ,A.S. Manjunath , M.A. Jayaram,‖Application Of Genetic Algorithm

Optimized Neural Network Connection Weights For Medical Diagnosis Of Pima Indians Diabetes,‖

International Journal on Soft Computing ( IJSC ), Vol.2, No.2, May 2011`

[13] Bagdi, Rupa, and Pramod Patil. "Diagnosis of Diabetes Using OLAP and Data Mining

Integration." International Journal of Computer Science & Communication Networks 2.3 (2012).

[14] Elma kolce (cela), Neki Frasheri, “A Literature Review of Data Mining Techniques used in

Healthcare Databases”, ICT Innovations 2012 Web Proceedings -Poster Session.

[15] Krishnaiah, VV Jaya Rama, et al. "Predicting the Diabetes using Duo Mining

Approach." International Journal of Advanced Research in Computer and Communication Engineering 1.6

(2012).

[16] Thirumal, P. C., and N. Nagarajan. "Utilization of Data Mining Techniques For Diagnosis Of Diabetes

Mellitus-A Case Study." (2006).

[17] Jothikumar R., Dr.Sivabalan R.V. (2015). Performance Analysis on Accuracies of Heart Disease

Prediction System Using Weka by Classification Techniques. AJBAS, 9(7), 741-749

[18] Jothikumar R, Dr.Sivabalan R.V. and Kumarasen A.S. Data Cleaning Using Weka For Effective Data

Mining In Health Care Industries. International Journal of Applied Engineering Research.10(30), 2015

[19] Jothikumar.R , Dr. Sivabalan.R.V. Efficient Data Pre-Processing For Data Mining Using Neural

Networks. Int. Journal of Scientific Research and Management Studies, 1(4), 118-123.

P a g e | 190

[20] Jothikumar.R, Dr. Sivabalan.R.V. E. Sivarajan. Accuracies of j48 weka classifier with different

supervised weka filters for predicting heart diseases, ARPN Journal of Engineering and Applied Sciences,

VOL. 10, NO. 17, September 2015 ISSN 1819-6608, Pg 7788-7793.

[21] Sathya S, Rajesh A, Manivannan R, Prediction of diabetes using Decision Trees, International Journal

of Applied Engineering Research ISSN 0973-4562 Volume 9, Number 24 (2014) pp. 27165-27178.

S. Sathya is working as an Assistant Professor in the Information Technology Program of

C. Abdul Hakeem College of Engineering & Technology, Tamilnadu, India & she is

currently pursuing her Ph.D in St.Peter’s University, Chennai. She received her M.E.

Degree from S.A. Engineering College, Chennai, in June 2009 in the field of Computer

Science & Engineering, and her B.Tech. Degree from Priyadarshini Engg. College,

Vaniyambadi, India. In April 2005 in the field of Information Technology.

A. Rajesh is a Professor & Head in the Computer Science & Engineering Program of C.

Abdul Hakeem College of Engineering & Technology, Tamilnadu, India, and he received

his Ph. D Degree from Dr. M. G. R. Educational and Research Institute University,

Chennai, India, in March 2011 and his M.E. Degree from Sathyabama University,

Chennai, India, in April 2005 in the field of Computer Science and Engineering. His area

of interests includes Datamining, Natural Language

international research journal in advanced … · data mining weka naïve bayes random tree j48...

Documents

knn & naïve bayes

bayesian networks practice (weka). weather data what is the...

more naïve bayes

bayes classifier and naïve bayes - oregon state...

naïve bayes -...

classification: logistic...

naïve bayes (continued)

bayes and naïve bayes classifiers

naïve bayes classifier · naïve bayes classifier 17 •...

naïve bayes and logistic regression · naïve bayes and...

naïve bayes classifier

23: naïve bayes - stanford...

the naïve bayes classifier - svivek · •the naïve bayes...

naïve bayes classfication

naïve bayes: refinements

naïve bayes text classification

naïve bayes

naïve bayes learning

5. aufgabenblatt naïve bayes klassifikation abgabe: 07.02...

naïve bayes 𝑖 𝜶 -...