analyzing the efficiency of heart diseases prediction in diabetic ... · [11] viswanathan k.,...
TRANSCRIPT
Analyzing the Efficiency of Heart Diseases
Prediction in Diabetic Patients Using Data Mining
Techniques 1K.Viswanathan and 2P. Mayilvahanan
1Vels University,
Pallavaram, Chennai.
2Vels University,
Pallavaram Chennai.
Abstract The main objective of this research paper is to discuss about the
classification algorithms applied on different types of real time medical
data sets and compare its performance. Classification algorithms providing
maximum accuracy on various kinds of medical data sets are taken as
optimum result for Performance analysis report. Performance analysis
report comprises of the most frequently used algorithms on respective
medical dataset and efficient classification algorithm to analyze the specific
disease. The comparative study between SVM- (support vector machine)
and KNN - (K-nearest neighbor) is also done on real time data sets which
was obtained from an esteemed Hospital in Tamil Nadu.
Keywords:Data Mining, classification, clustering, SVM, KDD, diabetics
data set, heart disease.
International Journal of Pure and Applied MathematicsVolume 117 No. 7 2017, 207-218ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version)url: http://www.ijpam.euSpecial Issue ijpam.eu
207
1. Data Mining
Data mining is the process of identifying prosecutable information from large
sets of pre-existing data. Data mining involves mathematical examination to
develop patterns and trends that pre exist in data set. The following steps are
primarily required for data mining process.
Defining Problem.
Preparation of Data.
Exploring Data.
Model Building.
Validating the Model.
Deploying Model.
Figure 1: Data Mining Process Flow
There are two major categories of data analysis that can be used to build the
models.
Classification Model.
Prediction Model.
Classification – The main objective of classification is to exactly predict the
collected data.
There are two ways to classify the model that is Supervised Classification and
Unsupervised Classification.
Prediction-The main objective of Prediction involves predicting the Model
continuous valued functions called missing values.
2. Primary Objective
Primary objective for current work are as below.
International Journal of Pure and Applied Mathematics Special Issue
208
Step 1: To collect the real time heart disease related Medical data set from one
of reputed hospital in Tamil Nadu.
Step 2: Preprocessing of real time data set.
Step3: To apply the classification algorithm for preprocessed and collected
heart disease data and record the accuracy.
Step 4: Based on the Accuracy rate, categorize the best algorithm with the
diabetes data for Heart issue
3. Weka
Waikato Environment for Knowledge Analysis (Weka) is a well known
machine learning software which is open source free software written in Java
language. WEKA was developed at the University of Waikato, New Zealand
and supports several standard data mining tasks and few are listed below.
Preprocessing of Data and Clustering.
Regression of Data and Classification.
Data Visualization.
Selection of features.
Figure 2: Weka Process Flow
A. Data Pre-Processing Pre-Processing using Manual
The data from one of the reputed hospital in Tamilnadu was collected further
are ‘manually entered’ and optional system data format. This dataset is
manually integrated in to system. Data pre- processing will be taken care during
the time of integration in to real system .60 % of the data preprocessing is done
during the time of integration.
Tool based Pre-Processing
Tool based preprocessing needs to be follow the below steps.
Step 1: Convert the data set (xls.file) into CSV file format
International Journal of Pure and Applied Mathematics Special Issue
209
Step 2: Convert .CSV file in to .ARF file format using the below command.
Command
java -cp "C:\Program Files\Weka-3-8\weka.jar" weka.core.converters. CSV
Loader "D:\Local path\Sample.csv"> E:\Sample.arff.
Step 4: select the” Preprocess tab” and then open the data-set in .ARFF file
format and choose the attribute filed in the data set which is mandatory
required to build the model and execute it.
Figure 3: Preprocessing Using Weka
Figure 4: Data Visualization
International Journal of Pure and Applied Mathematics Special Issue
210
4. Classification Algorithm
Various classifications methods proposed by researchers. The basic
classifications methods are listed below.
Support Vector Machine.
KNN.
Bayesian classification.
Decision tree.
C4.5.
SVM Support Vector Machine
A support vector machine is a Classification method which are mostly used in
classification and regression problems of datasets. SVM will support Data
mining, Text mining and Pattern recognition.
It is a non-linear classifier method which is often reported as producing better
classification results compared to other methods. SVM delivered good and
optimal solutions
K-NN
K-NN is a type of instance-based learning and is a group of simple algorithm
like as Classification and Regression. The main advantages for KNN are below
Easy implementation of Data.
Minimal cost.
Robustness.
Decision Tree
The major step is to identify the best split variables and best split criteria. Once
we have the split then we have to go to segment level and drill down further.
1. Select the leaf Node.
2. Find out the best splitting attribute value.
3. Spilt the node with help of the attribute.
4. Go to each child node and repeat step 2 & 3 again.
a) Stopping Criteria 5. Each leaf-node contains examples of one type.
6. Algorithm run out of attributes.
7. No further significant information gain.
5. Overview – Diabetes – Heart Disease
Diabetes and Symptoms
Diabetes is a group of metabolic diseases wherein there are high blood
sugar levels over a long years. There are two type of diabetes like Type 1-
insulin diabetes and Type 2-non- insulin diabetes. Some of the common
Symptoms of Diabetes are
Loss/ gain of Weight.
International Journal of Pure and Applied Mathematics Special Issue
211
Blur in vision.
Itchynessin skin.
Polyphagia.
Polyuria.
Fatigueness.
Heart Disease and Symptoms
Heart disease involves narrowed or blocked blood vessels which might lead to
a heart attack, chest pain -angina. Heart disease is mostly produced by the
following effects.
Blood sugar.
Smoking.
Age.
High/low Depression.
Low/High cholesterol.
Figure 5: Symptoms for Diabetes
Some of heart disease symptoms are listed.
Irregular heartbeats.
Sweating, Indigestion.
Pain in the chest.
International Journal of Pure and Applied Mathematics Special Issue
212
6. Diabetes Data Set for Heart Disease
Table 1: Data Set Attribute
SNO Attribute
1 Age
2 Heart Rate
3 Chest Pain
4 Obesity
5 Blood Sugar
6 High BP
Low BP
7 Cholesterol
8 BMI- Body Mass Index
9 Triceps skin fold thickness
10 Number of times pregnant
11 Plasma glucose concentration
12 Urea
Classification Matrix
It compares the actual values in the test dataset with the predicted values in the
trained model.
Table 2: Confusion Matrix
Act
ua
l V
alu
e
Predicated Value
0 1
0
Actual condition is positive it is
TRULY predicated POSITIVE called
TP
FN - If the Actual condition is positive it
is FALSLY predicated NEGATIVE
1
Actual condition is Negative it is
FALSLY predicated POSITIVE
called FP
TN- If the Actual condition is Negative it
is TRULY predicated NEGATIVE
Where
TP- True Positive.
FP - False Positive.
TN – True Negative.
FN – False Negative.
The following mathematical model is used to get accurate for prediction.
Accuracy rate = TP+TN/ TP+FP+ TN +FN
7. Performance Comparison for Classification Algorithm
Analyze report for various algorithms are below.
International Journal of Pure and Applied Mathematics Special Issue
213
Table 3: Performance Analysis Report
Algorithm Applied Accuracy rate for Predication
SVM 92.11
k-NN 86.15
Decision Tree 83.56
C4.5 82.59
Figure 6: Performance Analysis Report
8. Conclusion
Performance comparison of SVM and K-NN Classifiers with diabetes and heart
disease data sets are recorded to obtain effective methodology for prediction.
WEKA tool is used for testing and building the optimum model for predication.
Real time comparison of classifiers algorithm was performed with different
rules of performance evaluating measurements which are respectively recorded
in Performance analysis reports. The maximum classification accuracies of the
SVM and K-NN classifiers were found to be 92.11% and 86.15%, respectively.
Based on the comparative study, it is concluded that SVM Classifier is more
efficient and produce productive results compared to KNN and other
counterpart classifiers for heart disease prediction by implementing
Classification accuracy rate.
References
[1] Dhakate P., Patil S., Rajeswari K., Vaithiyananthan D.V., Abin, D., Preprocessing and Classification in WEKA using different
92.11
86.15
82.5983.56
76
78
80
82
84
86
88
90
92
94
Accuracy rate for Predication
Accuracy rateforPredication
International Journal of Pure and Applied Mathematics Special Issue
214
classifiers, Journal of Engineering Research and Applications IJERA, 2014.
[2] Bouckaert R.R., Frank E., Hall M.A., Holmes G., Pfahringer B., Reutemann P., Witten I.H., WEKAâ Experiences with a Java Open-Source Project, Journal of Machine Learning Research 11 (2010), 2533-2541.
[3] Jaiwei Han, Micheline Kkamber, Data Mining Concepts and Techniques. Morgan Kaufmann Publishers (2006), 360-361.
[4] Fayyad, Piatetsky-Shapiro, Smyth, Uthurusamy, Advances in Knowledge Discovery and Data Mining, (Chapter 1), AAAI/MIT Press (1996).
[5] Witten I., Eibe F., Data mining practical machine learning tools and techniques, 2nded, Sanfrancisco: Morgan Kaufmann series in data management systems (2005).
[6] Padmaja P., Characteristic evaluation of diabetes data using clustering techniques, IJCSNS International Journal of Computer Science and Network Security 8 (11), (2008).
[7] Remco R.B., Eibe Frank, Mark A., HallGeoffrey H., Bernhard P., Peter R., Ian H.W., WEKA—Experiences with a Java Open-Source Project, Journal of Machine Learning Research, 2010.
[8] SwastiSinghal, A Study on WEKA Tool for Data Preprocessing, Classification and Clustering, Monika Jena, International Journal of Innovative Technology and Exploring Engineering (IJITEE) 2 (6), (2013).
[9] PayalDhakate, SuvarnaPatil, Rajeswari K., Vaithiyananthan V., Deepa Abin, Preprocessing and Classification in WEKA using different classifiers, Journal of Engineering Research and Applications 4 (8), (2014).
[10] Yasodha P., Kannan M., Analysis of a Population of Diabetic Patients Databases in Weka Tool, Research 2 (5), (2011).
[11] Viswanathan K., Mayilvahanan K., Christy Pushpaleela R., Performance Comparison of SVM and C4.5 Algorithms for Heart Disease in Diabetics, International Journal of Control Theory and Applications 10 (25), (2017).
[12] Ren Diao, Fei Chao, Member, IEEE, Taoxin Peng, Neal Snooke, and Qiang Shen, Feature Selection Inspired Classifier Ensemble Reduction, IEEE Transactions on Cybernetics 44 (8), (2014).
[13] Fayyad, Piatetsky-Shapiro, Smyth, Uthurusamy, Advances in Knowledge Discovery and Data Mining, (Chapter 1), AAAI/MIT Press, 1996.
International Journal of Pure and Applied Mathematics Special Issue
215
[14] Witten I., Eibe F., Data mining practical machine learning tools and techniques, 2nded, Sanfrancisco: Morgan Kaufmann series in data management systems (2005).
[15] Witten I., Eibe F., Data mining practical machine learning tools and techniques, 2nded, Sanfrancisco: Morgan Kaufmann series in data management systems (2005).
[16] Joseph L.B., Data Mining Diabetic Databases: Are Rough Sets a Useful Addition.
[17] Parthiban G., Rajesh A., Srivatsa S.K., Diagnosis of Heart Disease for Diabetic Patients using Naive Bayes Method, International Journal of Computer Applications 24 (3), (2011).
[18] Padmaja P., Characteristic evaluation of diabetes data using clustering techniques, IJCSNS International Journal of Computer Science and Network Security 8 (11), (2008).
[19] Yasodha P., Kannan M., Analysis of a Population of Diabetic Patients Databases in Weka Tool, Research 2 (5), (2011).
[20] Chaurasia V., Pal S., Data Mining Approach to Detect Heart Dieses, International Journal of Ad-vanced Computer Science and Information Technology (IJACSIT) (2013).
[21] Lavanya D., Usha Rani K., Ensemble decision tree classifier for breast cancer data, International Journal of Information Technology Convergence and Services 2 (1), (2011).
[22] Rajeswari K., Vaithiyanathan V., and Shailaja V.P., Feature Selection for Classification in Medical Data Mining, International journal of emerging trends and technology in computer science 2 (2), (2013).
[23] Raikwal J.S., Kanak Saxena, Performance Evaluation of SVM and K-Nearest Neighbor Algorithm over Medical Data set, International Journal of Computer Applications (2012).
[24] Mai Shouman, Tim Turner, Rob Stocker, Using Decision Tree For Diagnosing Heart Disease Patients, 9-Th Australasian Data Mining Conference (Ausdm'11), Ballarat (2011).
[25] Patil R.R., Heart disease prediction system using Naive Bayes and Jelinek-mercer smoothing, International Journal of Advanced Research in Computer and Communication Engineering 3(5) (2014), 2278-1021.
[26] KratiSaxena, Zubair Khan, Shefali Singh, Diagnosis of Diabetes Mellitus Using K Nearest Neighbor Algorithm, International Journal of Computer Science Trends And Technology (IJCST) (2014).
International Journal of Pure and Applied Mathematics Special Issue
216
[27] NongyaoNai-aruna, RungruttikarnMoungmaia, Comparison of Classifiers for the Risk of Diabetes Prediction, Procedia Computer Science 69 (2015).
[28] SheelaJeyarani D., Anushya G., Rajarajeswari R., Pethalakshmi A., A Comparative Study of Decision Tree and Naive Bayesian Classifiers on Medical Datasets, International Journal of Computer Applications, 2013.
[29] NipjyotiSarma, Sunil Kumar, Anupam Kr. Saini, A Comparative Study On Decision Tree And Bayes Net Classifier For Predicting Diabetes Type 2, International Journal Of Scientific Research Engineering & Technology (IJSRET) (2014).
[30] Revathi T., Jeevitha S., Comparative Study On Heart Disease Prediction System Using Data Mining Techniques, International Journal Of Science And Research (IJSR), 2013.
[31] Rossi F., Villa N., Support vector machine for functional data classification. Neuro computing 69 (7) (2006), 730-742.
[32] Rajesh, M., and J. M. Gnanasekar. "Congestion control in heterogeneous WANET using FRCC." Journal of Chemical and Pharmaceutical Sciences ISSN 974 (2015): 2115.
[33] Rajesh, M., and J. M. Gnanasekar. "A systematic review of congestion control in ad hoc network." International Journal of Engineering Inventions 3.11 (2014): 52-56.
[34] Rajesh, M., and J. M. Gnanasekar. " Annoyed Realm Outlook Taxonomy Using Twin Transfer Learning." International Journal of Pure and Applied Mathematics 116.21 (2017) 547-558.
[35] Rajesh, M., and J. M. Gnanasekar. " Get-Up-And-Go Efficientmemetic Algorithm Based Amalgam Routing Protocol." International Journal of Pure and Applied Mathematics 116.21 (2017) 537-547.
[36] Rajesh, M., and J. M. Gnanasekar. " Congestion Control Scheme for Heterogeneous Wireless Ad Hoc Networks Using Self-Adjust Hybrid Model." International Journal of Pure and Applied Mathematics 116.21 (2017) 537-547.
International Journal of Pure and Applied Mathematics Special Issue
217
218