international research journal in advanced … · data mining weka naïve bayes random tree j48...
Post on 15-Sep-2020
2 Views
Preview:
TRANSCRIPT
PERFORMANCE ANALYIS ON DIABETES PREDICTION WITH
DIFFERENT CLASSIFICATION ALGORITHMS USING WEKA
Sathya S1, Rajesh A
2
1Research Scholar, Computer Science & Engineering, St. Peter’s University Chennai, India 2Professor, Computer Science & Engineering, C. Abdul Hakeem College of Engineering &
Technology Vellore, India
ARTICLE INFO
Article History:
Received 9
th Nov, 2015
Received in revised form 12th
Nov,2015
Accepted 14th
Nov, 2015
Published online 16th
Nov, 2015
Keywords:
Diabetic
Data Mining
Weka
Naïve Bayes
Random Tree
J48
Weka
ABSTRACT
The main objective of this paper is to predict the chances of getting diabetic
using classification algorithms such as Naïve Bayes, Random Tree and J48
classifiers and to compare the performance of each with one another. Many
researchers made research on diabetic in different approaches. But none of the
approach predicts accurately. To overcome this drawbacks, the method is
proposed here. The data set is obtained from a Local Health Center named Arcot
Digital X-Ray, E.C.G & Computerized Lab. The data set includes 18 essential
attributes and 633 instances needed for diabetic prediction. This work is carried
out to convert data in to knowledge. The Popular data mining tool “WEKA”
Waikato Environment for Knowledge Analysis is used to made analysis on the
collected datasets with Naïve Bayes, Random Tree and J48 classifiers. The
Correctly Classified Instances, Incorrectly Classified Instances, Kappa statistic,
Mean absolute error, Root mean squared error, Relative absolute error, Root
relative squared error are measured for each algorithm for comparison and
analysis. Also TP Rate, FP Rate, Precision, Recall, F-Measure, MCC, ROC
Area, PRC Area of each algorithm were tabulated for analysis. Finally the
accuracies of each algorithm is measured and charted for performance analysis.
It is found that J48 classifier gives the better prediction accuracy of 99.0521
while the Random tree stands second with 95.5766 prediction accuracy and
Naïve Bayes with 93.8389.
INTERNATIONAL RESEARCH JOURNAL IN ADVANCED ENGINEERING AND TECHNOLOGY (IRJAET) www.irjaet.com
ISSN (PRINT) : 2454-4744 ISSN (ONLINE): 2454-4752
Vol. 1, Issue 4, pp.178 - 190, November, 2015
P a g e | 179
INTRODUCTION
The collected dataset is given as input to Machine learning algorithms Naïve Bayes, Random Tree
and J48 classifiers and the different measures were considered for comparison. The proposed concept is
given as block diagram below in the Fig 1. The training and test data is given as input to the algorithms and
the result obtained is analyzed in different aspects.
Fig 1. Block diagram of proposed Model
Diabetes
The World Health Organization (WHO) estimates that nearly 200 million people all over the world
suffer from diabetes and this number is likely to be doubled by 2030 and 80% of the diabetes deaths occur
in middle-income countries. In India, there are nearly 50 million diabetics, according to the statistics of the
International Diabetes Federation. As the incidence of diabetes is on the rise, doctors say, there is a
proportionate rise in the complications that are associated with diabetes. [1]. The disease has been named
the fifth deadliest disease in the United States with no imminent cure in sight [2]. This disease has many
side effects such as higher risk of eye disease, higher risk of kidney failure, and other complications.
However, early detection of the disease and proper care management can make a difference [3].
According to the American Diabetes Association, 20.8 million children and adults in the United
States (i.e., approximately 7% of the population) were diagnosed with diabetes. Thus, the ability to
diagnose diabetes early plays an important role for the patient’s treatment process [4]. This causes sugar to
build up in your blood leading to complications like heart disease, stroke, and neuropathy, poor circulation
Training
and
Test
Data
Naive Bayes
Random
Tree
J48
Accuracy
P a g e | 180
leading to loss of limbs, blindness, kidney failure, nerve damage, and death. General Symptoms of Diabetes
are Increased thirst, Increased urination, Weight loss, Increased appetite, Fatigue, Nausea and/or vomiting -
Blurred vision, Slow-healing infections and Impotence in men.[5]. Diabetic results in Multi organ failure
in a human body and it is necessary to predict and prevent earlier.
Diabetic is the most common form of eye problem affecting people with diabetes, usually only
affects people who have had diabetes for a long time period and can result in blindness [6]. Diabetes
mellitus, or simply diabetes, is a set of related diseases in which the body cannot regulate the amount of
sugar in the blood [7]. It is a group of metabolic diseases in which a person has high blood sugar, either
because the body does not produce enough insulin, or because cells do not respond to the insulin that is
produced[8].
There are two types of Diabetes. Type 1 - Diabetes also called as Insulin Dependent Diabetes
Mellitus (IDDM), or Juvenile Onset Diabetes Mellitus is commonly seen in children and young adults
however, older patients do present with this form of diabetes on occasion. Type II - Diabetes is also called
as Non-Insulin Dependent Diabetes Mellitus (NIDDM), or Adult Onset Diabetes Mellitus. Preventing the
disease of diabetes is an ongoing area of interest to the healthcare community [9] .Diabetes is one of the
high prevalence diseases worldwide with increased number of complications, with retinopathy as one of the
most common one. Diabetes is a major chronic disorder which has no cure. [10]
MATERIALS AND METHODS
Data Mining And Weka
Data Mining is the process of extracting hidden knowledge from large volumes of raw data.[11].
Data mining has been applied in various fields like medicine, marketing, banking, etc. In medicine,
predictive data mining is used to diagnose the disease at the earlier stages itself and helps the physicians in
treatment planning procedure.[12].The data collected by medical and healthcare industry is not turned into
useful information for effective decision making. With data mining, doctors can predict patients who might
be diagnosed with diabetes.[13]. It is “the science of extracting useful information from large databases”. It
is one of the tasks in the process of knowledge discovery from the database.[14].
The amount of information related to biomedical databases is growing so rapidly that the rate at
which researchers can convert it into knowledge cannot keep in pace [15]. In general to detect a disease
numerous tests must be conducted in a patient. The usage of data mining techniques in disease prediction is
to reduce the test and increase the accuracy of rate of detection. [16]. So the popular data mining tool
“WEKA” is used to analyze the collected data using the classification methods.
P a g e | 181
Weka Contains collection of Machine Learning algorithms for data mining tasks introduced by the
University of Waikato in New Zealand. Weka is open source software issued under the general public
license.
Data sets used
S.No Attribute Meaning
1. PID PATIENT ID
2. SEX SEX
3. AGE AGE
4. WEIGHT WEIGHT
5. BP BLOOD PRESSURE
6. TYPE TYPE OF DIABETIC
7. FASTING FASTING ( EMPTY STOMACH)
8. PP POST PANDIAL
9. A1C GLYCOSYLATED Hb
10. LP TOT LIPID PROFILE TOTAL CHOLESTEROL
11. HDL HIGH DENSITY LIPO PROTEIN
12. LDL LOW DENSITY LIPO PROTEIN
13. VLDL VERY LOW DENSITY LIPO PROTEIN
14. TGL TRYGLYCERIDES
15. CHL/LDL RATIO OF CHOLESTEROL
16. HEIGHT HEIGHT OF PATIENT
17. HERIDITORY HERIDITORY OF PATIENT
18. CATEGORY CATEGORY(Normal/Daibetes/preDiabetes)
Implementation Tool
"WEKA" stands for the Waikato Environment for Knowledge Analysis, which was developed at the
University of Waikato in New Zealand. WEKA is extensible and has become a collection of machine
P a g e | 182
learning algorithms for solving real-world data mining problems. It is written in Java and runs on almost
every platform. Weka is open source software issued under the GNU General Public License. Weka is a
collection of machine learning algorithms for data mining tasks. Weka is a stateof-the-art facility for
developing machine learning techniques and their application to real-world data mining problems [17]. The
algorithms are applied directly to a dataset. Weka implements algorithms for data preprocessing,
classification, regression, clustering, association rules; it also includes visualization tools [18]. By analysing
the history of data, strategic decisions can be made [19]. The new machine learning schemes can also be
developed with this package. Weka is a data mining system developed by the University of Waikato in
New Zealand that implements data mining algorithms [20]. Data mining is the process of selecting,
exploring and modeling large amounts of data in order to discover unknown
Patterns or relationships that provide a clear and useful result [21].
Implementation
The Training and Test mode uses 10 –fold cross validation for all the Algorithms mentioned with
633 Instances and 18 attributes to eliminate Bias between the results obtained.
Table 1: Naïve Bayes and its Measures.
Measures Values Accuracy
Correctly Classified Instances 594 93.8389 %
Incorrectly Classified Instances 39 6.1611 %
Kappa statistic 0.9027
Mean absolute error 0.0447
Root mean squared error 0.1873
Relative absolute error 10.6132 %
Root relative squared error 40.8274 %
Coverage of cases (0.95 level) 97.3144 %
Mean rel. region size (0.95 level) 36.1769 %
Total Number of Instances 633
P a g e | 183
Naive Bayes
Naive Bayes takes 0 seconds to build the Model. The measures were tabulated below for reference. It is
found that 594 instances are classified correctly and 39 instances are classified incorrectly out of 633
instances. The kappa statistic of Naïve Bayes is 0.9027 with mean absolute error of 0.447, Root mean
squared error of 0.1873, Relative absolute error of 10.6132% and rest of the measures is given in Table 1.
Detailed Accuracy By ClassPrediction
The TP Rate, FP Rate, Recall, F-Measure, ROC Area and PRC Area under the three group of
classes Normal, Diabetic and Prediabetic is tabulated in the Table 2.
Table 2: Naïve Bayes and its Detailed Accuracy By ClassPrediction.
TP
Rate
FP
Rate
Precision Recall F-
Measure
ROC
Area
PRC
Area
Class
0.981 0.006 0.993 0.981 0.987 0.996 0.997 N
0.852 0.015 0.954 0.852 0.900 0.975 0.961 D
0.949 0.063 0.831 0.949 0.886 0.974 0.874 P
Weighted
Avg. 0.938 0.023 0.943 0.938 0.939 0.985 0.957
CONFUSION MATRIX
The Confusion Matrix obtained for Naïve Bayes Classifier is given below which describes the True
Positive, True Negative, False Positive and False Negative classification of the instances under the
categories Normal, Diabetic and Prediabetic in Table 3.
Table 3: Naïve Bayes Confusion Matrix
a b c Classified as
302 1 5 a = N
0 144 25 b =D
2 6 148 c= P
P a g e | 184
Random Tree
The size of the tree is 71 and it takes 0.1 seconds to build the Model. The measures were tabulated below
for reference. It is found that 605 instances are classified correctly and 28 instances are classified
incorrectly out of 633 instances. The kappa statistic is 0.93 with mean absolute error of 0.0289, Root mean
squared error of 0.1626, Relative absolute error of 0.1626% and rest of the measures is given in Table 4.
Table 4: Random Tree and its Measures.
Measures Values Accuracy
Correctly Classified Instances 605 95.5766 %
Incorrectly Classified Instances 28 4.4234 %
Kappa statistic 0.93
Mean absolute error 0.0289
Root mean squared error 0.1626
Relative absolute error 6.866 %
Root relative squared error 35.4535 %
Coverage of cases (0.95 level) 96.3665 %
Mean rel. region size (0.95 level) 33.9652 %
Total Number of Instances 633
Detailed Accuracy By Class
The TP Rate, FP Rate, Recall, F-Measure, ROC Area and PRC Area under the three group of
classes Normal, Diabetic and Prediabetic is tabulated in the Table 5.
Table 5: Random Tree and its Detailed Accuracy By ClassPrediction.
TP
Rate
FP
Rate
Precision Recall F-
Measure
MCC ROC
Area
PRC
Area
Class
0.974 0.018 0.980 0.974 0.977 0.956 0.981 0.974 N
0.941 0.022 0.941 0.941 0.941 0.919 0.971 0.916 D
0.936 0.025 0.924 0.936 0.930 0.907 0.961 0.901 P
Weighted
Avg.
0.956 0.021 0.956 0.956 0.956 0.934 0.973 0.940
P a g e | 185
Confusion Matrix
The Confusion Matrix obtained for Random Tree Classifier is given below which describes the True
Positive, True Negative, False Positive and False Negative classification of the instances under the
categories Normal, Diabetic and Prediabetic in Table 6.
Table 6: Random tree Confusion Matrix
a b c Classified as
300 5 3 a = N
1 159 9 b =D
5 5 146 c= P
TABLE 7: J48 AND ITS MEASURES.
Measures Values Accuracy
Correctly Classified Instances 627 99.0521 %
Incorrectly Classified Instances 6 0.9479 %
Kappa statistic 0.985
Mean absolute error 0.0107
Root mean squared error 0.0794
Relative absolute error 2.5393 %
Root relative squared error 17.3039 %
Coverage of cases (0.95 level) 99.0521 %
Mean rel. region size (0.95 level) 33.3333 %
Total Number of Instances 633
P a g e | 186
J48
The size of the tree is 8 and it takes 0.02 seconds to build the Model with 5 leaves. The measures were
tabulated below for reference. It is found that 627 instances are classified correctly and 6 instances are
classified incorrectly out of 633 instances. The kappa statistic is 0.985 with mean absolute error of 0.0107,
Root mean squared error of 0.0794, Relative absolute error of 2.5393% and rest of the measures is given in
Table 7.
DETAILED ACCURACY BY CLASS
The TP Rate, FP Rate, Recall, F-Measure, ROC Area and PRC Area under the three group of
classes Normal, Diabetic and Prediabetic is tabulated in the Table 8.
Table 8: Random Tree and its Detailed Accuracy By ClassPrediction.
TP
Rate
FP
Rate
Precision Recall F-
Measure
MCC ROC
Area
PRC
Area
Class
0.997
0.006 0.994 0.997 0.995 0.991 0.997 0.997
N
0.988
0.002 0.994 0.988 0.991 0.988 0.994 0.981
D
0.981
0.006
0.981 0.981 0.981 0.974 0.985 0.953
P
Weighted
Avg.
0.991 0.005 0.991
0.991
0.991 0.986 0.993 0.982
CONFUSION MATRIX
The Confusion Matrix obtained for J48 Classifier is given below which describes the True Positive, True
Negative, False Positive and False Negative classification of the instances under the categories Normal,
Diabetic and Prediabetic in Table 9.
P a g e | 187
Table 9: J48 Confusion Matrix
a b c Classified as
307 0 1 a = N
0 167 2 b =D
2 1 153 c= P
RESULTS AND DISCUSSION
It is found that J48 classifier gives better accuracy than Naïve Bayes and Random Tree classifier. The
accuracies of all algorithms were given in the table 10. The accuracies of all classifiers is charted in Fig 2.
Table 10: Accuracies of different Classifier.
CLASSIFIER ACCURACY
NAIVE BAYES 93.8389 %
RANDOM TREE 95.5766 %
J48 99.0521 %
Fig 2: Accuracies of classifiers.
91.00%
92.00%
93.00%
94.00%
95.00%
96.00%
97.00%
98.00%
99.00%
100.00%
NAIVE BAYES RANDOM TREE J48
P a g e | 188
CONCLUSION
The three classifiers were experimented with the same data set from UCI Machine Learning Cleveland
Sutherland dataset with the help of Weka Tool. Finally the accuracies of each algorithm is measured and
charted for performance analysis. It is found that J48 classifier gives the better prediction accuracy of
99.0521 while the Random tree stands second with 95.5766 prediction accuracy and Naïve Bayes with
93.8389. In this paper, the dataset is given directly as input to the classifiers. In Future, the performance of
this proposed methodology can be improved by considering the data for pre-processing. The pre-
preprocessing is used to perform data cleaning on the datasets so that the impure, missing, outdated and
inconsistent data can be removed which will result in improved accuracy of diabetic prediction. This
accurate results can be used by healthcare professionals to predict the diabetes in earlier stage and life can
be saved.
REFERENCES
[1] http:// archive.indianexpress.com/news/-50-million...india...diabetes-/1030869/
[2] Iyer, Aiswarya, S. Jeyalatha, and Ronak Sumbaly. "Diagnosis of diabetes using classification mining
techniques." arXiv preprint arXiv:1502.03774 (2015).
[3] Kumar, VelidePhani, and Lakshmi Velide. "A Data Mining Approach For Prediction And Treatment
Ofdiabetes Disease."
[4] Pham, Huy Nguyen Anh, and Evangelos Triantaphyllou. "Prediction of diabetes by employing a new
data mining approach which balances fitting and generalization." Computer and Information Science.
Springer Berlin Heidelberg, 2008. 11-26.
[5] Sanakal, Ravi, and Smt T. Jayakumari. "Prognosis of Diabetes Using Data mining Approach-Fuzzy C
Means Clustering and Support Vector Machine."International Journal of Computer Trends and
Technology 11.2 (2014): 94-8.
[6] Evirgen, Hayrettin, and Menduh Çerkezi. "Prediction and Diagnosis of Diabetic Retinopathy using Data
Mining Technique." The Online Journal of Science and Technology 4.3 (2014).
[7] http://www.emedicinehealth.com/diabetes.
[8] Rajesh, K., and V. Sangeetha. "Application of data mining methods and techniques for diabetes
diagnosis." International Journal of Engineering and Innovative Technology (IJEIT) 2.3 (2012).
P a g e | 189
[9] Sa-ngasoongsong, Akkarapol, and Jongsawas Chongwatpol. "An Analysis of Diabetes Risk Factors
Using Data Mining Approach." Oklahoma state university, USA (2012).
[10] Balakrishnan, Vimala, et al. "Predictions using data mining and case-based reasoning: A case study for
retinopathy." International Journal of Computer and Information Engineering 6 (2012): 73-76.
[11] Radha, P., and B. Srinivasan. "Predicting Diabetes by cosequencing the various Data Mining
Classification Techniques."
[12] Asha Gowda Karegowda ,A.S. Manjunath , M.A. Jayaram,‖Application Of Genetic Algorithm
Optimized Neural Network Connection Weights For Medical Diagnosis Of Pima Indians Diabetes,‖
International Journal on Soft Computing ( IJSC ), Vol.2, No.2, May 2011`
[13] Bagdi, Rupa, and Pramod Patil. "Diagnosis of Diabetes Using OLAP and Data Mining
Integration." International Journal of Computer Science & Communication Networks 2.3 (2012).
[14] Elma kolce (cela), Neki Frasheri, “A Literature Review of Data Mining Techniques used in
Healthcare Databases”, ICT Innovations 2012 Web Proceedings -Poster Session.
[15] Krishnaiah, VV Jaya Rama, et al. "Predicting the Diabetes using Duo Mining
Approach." International Journal of Advanced Research in Computer and Communication Engineering 1.6
(2012).
[16] Thirumal, P. C., and N. Nagarajan. "Utilization of Data Mining Techniques For Diagnosis Of Diabetes
Mellitus-A Case Study." (2006).
[17] Jothikumar R., Dr.Sivabalan R.V. (2015). Performance Analysis on Accuracies of Heart Disease
Prediction System Using Weka by Classification Techniques. AJBAS, 9(7), 741-749
[18] Jothikumar R, Dr.Sivabalan R.V. and Kumarasen A.S. Data Cleaning Using Weka For Effective Data
Mining In Health Care Industries. International Journal of Applied Engineering Research.10(30), 2015
[19] Jothikumar.R , Dr. Sivabalan.R.V. Efficient Data Pre-Processing For Data Mining Using Neural
Networks. Int. Journal of Scientific Research and Management Studies, 1(4), 118-123.
P a g e | 190
[20] Jothikumar.R, Dr. Sivabalan.R.V. E. Sivarajan. Accuracies of j48 weka classifier with different
supervised weka filters for predicting heart diseases, ARPN Journal of Engineering and Applied Sciences,
VOL. 10, NO. 17, September 2015 ISSN 1819-6608, Pg 7788-7793.
[21] Sathya S, Rajesh A, Manivannan R, Prediction of diabetes using Decision Trees, International Journal
of Applied Engineering Research ISSN 0973-4562 Volume 9, Number 24 (2014) pp. 27165-27178.
S. Sathya is working as an Assistant Professor in the Information Technology Program of
C. Abdul Hakeem College of Engineering & Technology, Tamilnadu, India & she is
currently pursuing her Ph.D in St.Peter’s University, Chennai. She received her M.E.
Degree from S.A. Engineering College, Chennai, in June 2009 in the field of Computer
Science & Engineering, and her B.Tech. Degree from Priyadarshini Engg. College,
Vaniyambadi, India. In April 2005 in the field of Information Technology.
A. Rajesh is a Professor & Head in the Computer Science & Engineering Program of C.
Abdul Hakeem College of Engineering & Technology, Tamilnadu, India, and he received
his Ph. D Degree from Dr. M. G. R. Educational and Research Institute University,
Chennai, India, in March 2011 and his M.E. Degree from Sathyabama University,
Chennai, India, in April 2005 in the field of Computer Science and Engineering. His area
of interests includes Datamining, Natural Language
top related