analyzing the efficiency of heart diseases prediction in diabetic ... · [11] viswanathan k.,...

12
Analyzing the Efficiency of Heart Diseases Prediction in Diabetic Patients Using Data Mining Techniques 1 K.Viswanathan and 2 P. Mayilvahanan 1 Vels University, Pallavaram, Chennai. [email protected] 2 Vels University, Pallavaram Chennai. [email protected] Abstract The main objective of this research paper is to discuss about the classification algorithms applied on different types of real time medical data sets and compare its performance. Classification algorithms providing maximum accuracy on various kinds of medical data sets are taken as optimum result for Performance analysis report. Performance analysis report comprises of the most frequently used algorithms on respective medical dataset and efficient classification algorithm to analyze the specific disease. The comparative study between SVM- (support vector machine) and KNN - (K-nearest neighbor) is also done on real time data sets which was obtained from an esteemed Hospital in Tamil Nadu. Keywords:Data Mining, classification, clustering, SVM, KDD, diabetics data set, heart disease. International Journal of Pure and Applied Mathematics Volume 117 No. 7 2017, 207-218 ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu Special Issue ijpam.eu 207

Upload: others

Post on 10-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Analyzing the Efficiency of Heart Diseases Prediction in Diabetic ... · [11] Viswanathan K., Mayilvahanan K., Christy Pushpaleela R ., Performance Comparison of SVM and C4.5 Algorithms

Analyzing the Efficiency of Heart Diseases

Prediction in Diabetic Patients Using Data Mining

Techniques 1K.Viswanathan and 2P. Mayilvahanan

1Vels University,

Pallavaram, Chennai.

[email protected]

2Vels University,

Pallavaram Chennai.

[email protected]

Abstract The main objective of this research paper is to discuss about the

classification algorithms applied on different types of real time medical

data sets and compare its performance. Classification algorithms providing

maximum accuracy on various kinds of medical data sets are taken as

optimum result for Performance analysis report. Performance analysis

report comprises of the most frequently used algorithms on respective

medical dataset and efficient classification algorithm to analyze the specific

disease. The comparative study between SVM- (support vector machine)

and KNN - (K-nearest neighbor) is also done on real time data sets which

was obtained from an esteemed Hospital in Tamil Nadu.

Keywords:Data Mining, classification, clustering, SVM, KDD, diabetics

data set, heart disease.

International Journal of Pure and Applied MathematicsVolume 117 No. 7 2017, 207-218ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version)url: http://www.ijpam.euSpecial Issue ijpam.eu

207

Page 2: Analyzing the Efficiency of Heart Diseases Prediction in Diabetic ... · [11] Viswanathan K., Mayilvahanan K., Christy Pushpaleela R ., Performance Comparison of SVM and C4.5 Algorithms

1. Data Mining

Data mining is the process of identifying prosecutable information from large

sets of pre-existing data. Data mining involves mathematical examination to

develop patterns and trends that pre exist in data set. The following steps are

primarily required for data mining process.

Defining Problem.

Preparation of Data.

Exploring Data.

Model Building.

Validating the Model.

Deploying Model.

Figure 1: Data Mining Process Flow

There are two major categories of data analysis that can be used to build the

models.

Classification Model.

Prediction Model.

Classification – The main objective of classification is to exactly predict the

collected data.

There are two ways to classify the model that is Supervised Classification and

Unsupervised Classification.

Prediction-The main objective of Prediction involves predicting the Model

continuous valued functions called missing values.

2. Primary Objective

Primary objective for current work are as below.

International Journal of Pure and Applied Mathematics Special Issue

208

Page 3: Analyzing the Efficiency of Heart Diseases Prediction in Diabetic ... · [11] Viswanathan K., Mayilvahanan K., Christy Pushpaleela R ., Performance Comparison of SVM and C4.5 Algorithms

Step 1: To collect the real time heart disease related Medical data set from one

of reputed hospital in Tamil Nadu.

Step 2: Preprocessing of real time data set.

Step3: To apply the classification algorithm for preprocessed and collected

heart disease data and record the accuracy.

Step 4: Based on the Accuracy rate, categorize the best algorithm with the

diabetes data for Heart issue

3. Weka

Waikato Environment for Knowledge Analysis (Weka) is a well known

machine learning software which is open source free software written in Java

language. WEKA was developed at the University of Waikato, New Zealand

and supports several standard data mining tasks and few are listed below.

Preprocessing of Data and Clustering.

Regression of Data and Classification.

Data Visualization.

Selection of features.

Figure 2: Weka Process Flow

A. Data Pre-Processing Pre-Processing using Manual

The data from one of the reputed hospital in Tamilnadu was collected further

are ‘manually entered’ and optional system data format. This dataset is

manually integrated in to system. Data pre- processing will be taken care during

the time of integration in to real system .60 % of the data preprocessing is done

during the time of integration.

Tool based Pre-Processing

Tool based preprocessing needs to be follow the below steps.

Step 1: Convert the data set (xls.file) into CSV file format

International Journal of Pure and Applied Mathematics Special Issue

209

Page 4: Analyzing the Efficiency of Heart Diseases Prediction in Diabetic ... · [11] Viswanathan K., Mayilvahanan K., Christy Pushpaleela R ., Performance Comparison of SVM and C4.5 Algorithms

Step 2: Convert .CSV file in to .ARF file format using the below command.

Command

java -cp "C:\Program Files\Weka-3-8\weka.jar" weka.core.converters. CSV

Loader "D:\Local path\Sample.csv"> E:\Sample.arff.

Step 4: select the” Preprocess tab” and then open the data-set in .ARFF file

format and choose the attribute filed in the data set which is mandatory

required to build the model and execute it.

Figure 3: Preprocessing Using Weka

Figure 4: Data Visualization

International Journal of Pure and Applied Mathematics Special Issue

210

Page 5: Analyzing the Efficiency of Heart Diseases Prediction in Diabetic ... · [11] Viswanathan K., Mayilvahanan K., Christy Pushpaleela R ., Performance Comparison of SVM and C4.5 Algorithms

4. Classification Algorithm

Various classifications methods proposed by researchers. The basic

classifications methods are listed below.

Support Vector Machine.

KNN.

Bayesian classification.

Decision tree.

C4.5.

SVM Support Vector Machine

A support vector machine is a Classification method which are mostly used in

classification and regression problems of datasets. SVM will support Data

mining, Text mining and Pattern recognition.

It is a non-linear classifier method which is often reported as producing better

classification results compared to other methods. SVM delivered good and

optimal solutions

K-NN

K-NN is a type of instance-based learning and is a group of simple algorithm

like as Classification and Regression. The main advantages for KNN are below

Easy implementation of Data.

Minimal cost.

Robustness.

Decision Tree

The major step is to identify the best split variables and best split criteria. Once

we have the split then we have to go to segment level and drill down further.

1. Select the leaf Node.

2. Find out the best splitting attribute value.

3. Spilt the node with help of the attribute.

4. Go to each child node and repeat step 2 & 3 again.

a) Stopping Criteria 5. Each leaf-node contains examples of one type.

6. Algorithm run out of attributes.

7. No further significant information gain.

5. Overview – Diabetes – Heart Disease

Diabetes and Symptoms

Diabetes is a group of metabolic diseases wherein there are high blood

sugar levels over a long years. There are two type of diabetes like Type 1-

insulin diabetes and Type 2-non- insulin diabetes. Some of the common

Symptoms of Diabetes are

Loss/ gain of Weight.

International Journal of Pure and Applied Mathematics Special Issue

211

Page 6: Analyzing the Efficiency of Heart Diseases Prediction in Diabetic ... · [11] Viswanathan K., Mayilvahanan K., Christy Pushpaleela R ., Performance Comparison of SVM and C4.5 Algorithms

Blur in vision.

Itchynessin skin.

Polyphagia.

Polyuria.

Fatigueness.

Heart Disease and Symptoms

Heart disease involves narrowed or blocked blood vessels which might lead to

a heart attack, chest pain -angina. Heart disease is mostly produced by the

following effects.

Blood sugar.

Smoking.

Age.

High/low Depression.

Low/High cholesterol.

Figure 5: Symptoms for Diabetes

Some of heart disease symptoms are listed.

Irregular heartbeats.

Sweating, Indigestion.

Pain in the chest.

International Journal of Pure and Applied Mathematics Special Issue

212

Page 7: Analyzing the Efficiency of Heart Diseases Prediction in Diabetic ... · [11] Viswanathan K., Mayilvahanan K., Christy Pushpaleela R ., Performance Comparison of SVM and C4.5 Algorithms

6. Diabetes Data Set for Heart Disease

Table 1: Data Set Attribute

SNO Attribute

1 Age

2 Heart Rate

3 Chest Pain

4 Obesity

5 Blood Sugar

6 High BP

Low BP

7 Cholesterol

8 BMI- Body Mass Index

9 Triceps skin fold thickness

10 Number of times pregnant

11 Plasma glucose concentration

12 Urea

Classification Matrix

It compares the actual values in the test dataset with the predicted values in the

trained model.

Table 2: Confusion Matrix

Act

ua

l V

alu

e

Predicated Value

0 1

0

Actual condition is positive it is

TRULY predicated POSITIVE called

TP

FN - If the Actual condition is positive it

is FALSLY predicated NEGATIVE

1

Actual condition is Negative it is

FALSLY predicated POSITIVE

called FP

TN- If the Actual condition is Negative it

is TRULY predicated NEGATIVE

Where

TP- True Positive.

FP - False Positive.

TN – True Negative.

FN – False Negative.

The following mathematical model is used to get accurate for prediction.

Accuracy rate = TP+TN/ TP+FP+ TN +FN

7. Performance Comparison for Classification Algorithm

Analyze report for various algorithms are below.

International Journal of Pure and Applied Mathematics Special Issue

213

Page 8: Analyzing the Efficiency of Heart Diseases Prediction in Diabetic ... · [11] Viswanathan K., Mayilvahanan K., Christy Pushpaleela R ., Performance Comparison of SVM and C4.5 Algorithms

Table 3: Performance Analysis Report

Algorithm Applied Accuracy rate for Predication

SVM 92.11

k-NN 86.15

Decision Tree 83.56

C4.5 82.59

Figure 6: Performance Analysis Report

8. Conclusion

Performance comparison of SVM and K-NN Classifiers with diabetes and heart

disease data sets are recorded to obtain effective methodology for prediction.

WEKA tool is used for testing and building the optimum model for predication.

Real time comparison of classifiers algorithm was performed with different

rules of performance evaluating measurements which are respectively recorded

in Performance analysis reports. The maximum classification accuracies of the

SVM and K-NN classifiers were found to be 92.11% and 86.15%, respectively.

Based on the comparative study, it is concluded that SVM Classifier is more

efficient and produce productive results compared to KNN and other

counterpart classifiers for heart disease prediction by implementing

Classification accuracy rate.

References

[1] Dhakate P., Patil S., Rajeswari K., Vaithiyananthan D.V., Abin, D., Preprocessing and Classification in WEKA using different

92.11

86.15

82.5983.56

76

78

80

82

84

86

88

90

92

94

Accuracy rate for Predication

Accuracy rateforPredication

International Journal of Pure and Applied Mathematics Special Issue

214

Page 9: Analyzing the Efficiency of Heart Diseases Prediction in Diabetic ... · [11] Viswanathan K., Mayilvahanan K., Christy Pushpaleela R ., Performance Comparison of SVM and C4.5 Algorithms

classifiers, Journal of Engineering Research and Applications IJERA, 2014.

[2] Bouckaert R.R., Frank E., Hall M.A., Holmes G., Pfahringer B., Reutemann P., Witten I.H., WEKAâ Experiences with a Java Open-Source Project, Journal of Machine Learning Research 11 (2010), 2533-2541.

[3] Jaiwei Han, Micheline Kkamber, Data Mining Concepts and Techniques. Morgan Kaufmann Publishers (2006), 360-361.

[4] Fayyad, Piatetsky-Shapiro, Smyth, Uthurusamy, Advances in Knowledge Discovery and Data Mining, (Chapter 1), AAAI/MIT Press (1996).

[5] Witten I., Eibe F., Data mining practical machine learning tools and techniques, 2nded, Sanfrancisco: Morgan Kaufmann series in data management systems (2005).

[6] Padmaja P., Characteristic evaluation of diabetes data using clustering techniques, IJCSNS International Journal of Computer Science and Network Security 8 (11), (2008).

[7] Remco R.B., Eibe Frank, Mark A., HallGeoffrey H., Bernhard P., Peter R., Ian H.W., WEKA—Experiences with a Java Open-Source Project, Journal of Machine Learning Research, 2010.

[8] SwastiSinghal, A Study on WEKA Tool for Data Preprocessing, Classification and Clustering, Monika Jena, International Journal of Innovative Technology and Exploring Engineering (IJITEE) 2 (6), (2013).

[9] PayalDhakate, SuvarnaPatil, Rajeswari K., Vaithiyananthan V., Deepa Abin, Preprocessing and Classification in WEKA using different classifiers, Journal of Engineering Research and Applications 4 (8), (2014).

[10] Yasodha P., Kannan M., Analysis of a Population of Diabetic Patients Databases in Weka Tool, Research 2 (5), (2011).

[11] Viswanathan K., Mayilvahanan K., Christy Pushpaleela R., Performance Comparison of SVM and C4.5 Algorithms for Heart Disease in Diabetics, International Journal of Control Theory and Applications 10 (25), (2017).

[12] Ren Diao, Fei Chao, Member, IEEE, Taoxin Peng, Neal Snooke, and Qiang Shen, Feature Selection Inspired Classifier Ensemble Reduction, IEEE Transactions on Cybernetics 44 (8), (2014).

[13] Fayyad, Piatetsky-Shapiro, Smyth, Uthurusamy, Advances in Knowledge Discovery and Data Mining, (Chapter 1), AAAI/MIT Press, 1996.

International Journal of Pure and Applied Mathematics Special Issue

215

Page 10: Analyzing the Efficiency of Heart Diseases Prediction in Diabetic ... · [11] Viswanathan K., Mayilvahanan K., Christy Pushpaleela R ., Performance Comparison of SVM and C4.5 Algorithms

[14] Witten I., Eibe F., Data mining practical machine learning tools and techniques, 2nded, Sanfrancisco: Morgan Kaufmann series in data management systems (2005).

[15] Witten I., Eibe F., Data mining practical machine learning tools and techniques, 2nded, Sanfrancisco: Morgan Kaufmann series in data management systems (2005).

[16] Joseph L.B., Data Mining Diabetic Databases: Are Rough Sets a Useful Addition.

[17] Parthiban G., Rajesh A., Srivatsa S.K., Diagnosis of Heart Disease for Diabetic Patients using Naive Bayes Method, International Journal of Computer Applications 24 (3), (2011).

[18] Padmaja P., Characteristic evaluation of diabetes data using clustering techniques, IJCSNS International Journal of Computer Science and Network Security 8 (11), (2008).

[19] Yasodha P., Kannan M., Analysis of a Population of Diabetic Patients Databases in Weka Tool, Research 2 (5), (2011).

[20] Chaurasia V., Pal S., Data Mining Approach to Detect Heart Dieses, International Journal of Ad-vanced Computer Science and Information Technology (IJACSIT) (2013).

[21] Lavanya D., Usha Rani K., Ensemble decision tree classifier for breast cancer data, International Journal of Information Technology Convergence and Services 2 (1), (2011).

[22] Rajeswari K., Vaithiyanathan V., and Shailaja V.P., Feature Selection for Classification in Medical Data Mining, International journal of emerging trends and technology in computer science 2 (2), (2013).

[23] Raikwal J.S., Kanak Saxena, Performance Evaluation of SVM and K-Nearest Neighbor Algorithm over Medical Data set, International Journal of Computer Applications (2012).

[24] Mai Shouman, Tim Turner, Rob Stocker, Using Decision Tree For Diagnosing Heart Disease Patients, 9-Th Australasian Data Mining Conference (Ausdm'11), Ballarat (2011).

[25] Patil R.R., Heart disease prediction system using Naive Bayes and Jelinek-mercer smoothing, International Journal of Advanced Research in Computer and Communication Engineering 3(5) (2014), 2278-1021.

[26] KratiSaxena, Zubair Khan, Shefali Singh, Diagnosis of Diabetes Mellitus Using K Nearest Neighbor Algorithm, International Journal of Computer Science Trends And Technology (IJCST) (2014).

International Journal of Pure and Applied Mathematics Special Issue

216

Page 11: Analyzing the Efficiency of Heart Diseases Prediction in Diabetic ... · [11] Viswanathan K., Mayilvahanan K., Christy Pushpaleela R ., Performance Comparison of SVM and C4.5 Algorithms

[27] NongyaoNai-aruna, RungruttikarnMoungmaia, Comparison of Classifiers for the Risk of Diabetes Prediction, Procedia Computer Science 69 (2015).

[28] SheelaJeyarani D., Anushya G., Rajarajeswari R., Pethalakshmi A., A Comparative Study of Decision Tree and Naive Bayesian Classifiers on Medical Datasets, International Journal of Computer Applications, 2013.

[29] NipjyotiSarma, Sunil Kumar, Anupam Kr. Saini, A Comparative Study On Decision Tree And Bayes Net Classifier For Predicting Diabetes Type 2, International Journal Of Scientific Research Engineering & Technology (IJSRET) (2014).

[30] Revathi T., Jeevitha S., Comparative Study On Heart Disease Prediction System Using Data Mining Techniques, International Journal Of Science And Research (IJSR), 2013.

[31] Rossi F., Villa N., Support vector machine for functional data classification. Neuro computing 69 (7) (2006), 730-742.

[32] Rajesh, M., and J. M. Gnanasekar. "Congestion control in heterogeneous WANET using FRCC." Journal of Chemical and Pharmaceutical Sciences ISSN 974 (2015): 2115.

[33] Rajesh, M., and J. M. Gnanasekar. "A systematic review of congestion control in ad hoc network." International Journal of Engineering Inventions 3.11 (2014): 52-56.

[34] Rajesh, M., and J. M. Gnanasekar. " Annoyed Realm Outlook Taxonomy Using Twin Transfer Learning." International Journal of Pure and Applied Mathematics 116.21 (2017) 547-558.

[35] Rajesh, M., and J. M. Gnanasekar. " Get-Up-And-Go Efficientmemetic Algorithm Based Amalgam Routing Protocol." International Journal of Pure and Applied Mathematics 116.21 (2017) 537-547.

[36] Rajesh, M., and J. M. Gnanasekar. " Congestion Control Scheme for Heterogeneous Wireless Ad Hoc Networks Using Self-Adjust Hybrid Model." International Journal of Pure and Applied Mathematics 116.21 (2017) 537-547.

International Journal of Pure and Applied Mathematics Special Issue

217

Page 12: Analyzing the Efficiency of Heart Diseases Prediction in Diabetic ... · [11] Viswanathan K., Mayilvahanan K., Christy Pushpaleela R ., Performance Comparison of SVM and C4.5 Algorithms

218