association analysis using apriori … analysis using apriori algorithm for improving performance of...

IV-21 ISSN 2085-1944

ASSOCIATION ANALYSIS USING APRIORI ALGORITHM FOR

IMPROVING PERFORMANCE OF NAIVE BAYES CLASIFIER

Indri Sudanawati Rozas1, Jeany Harmoejanto, Elly Antika, Umi Sa’adah, Ghaluh Indah Permatasari, Susiana Sari, Agus Zainal Arifin1

1 Jurusan Teknik Informatika, Fakultas Teknologi Informasi, Institut Teknologi Sepuluh Nopember Kampus ITS Keputih – Sukolilo, Surabaya, 60111

email : [email protected], [email protected], [email protected], [email protected], [email protected] , [email protected], [email protected]

ABSTRACT

Naïve Bayes classifier has the assumption that the relationship between one attribute with another is independent for the class Y. It is called Naïve because the assumption is quite difficult to achieve in real life. The existence of independent assumption between attributes will reduce the level of accuracy.

This study proposed a new method to improve the accuracy of Naive Bayes classifier based on the analysis of dependence between the attributes in the data. Apriori Algorithm is used to analyze dependence between the attributes in training data. Experiments carried out by calculating the initial accuracy before data corrected using Naive Bayes. Furthermore, it is compared with the level of accuracy when using a Naive Bayes Apriori.

From the experimental results proved that the Naive Bayes Apriori method, in general able to increase the value of classification accuracy. Two methods used are apriori with two variables values and three variable values to increase the average accuracy of 4.6% and 4.5%.

Keywords: Naïve Bayes Classifier, dependencies, Apriori, Assocation.

1 INTRODUCTION

Bayesian classification based on Bayes theorem, which is a mathematical method for calculating the chance of a conditional (posterior). It calculates the occurence of an event Y if X is known to be occurred, which is denoted by P (Y | X). In this Bayesian classification, Bayes theorem is used to calculate the chances of a data-set to be classified into a particular class based on existing data inference.

Based on the relationship between attribute data, there are two Bayesian classification methods, namely Naive Bayes and Bayesian Belief Network [1]. Naïve Bayes has the assumption that the relationship between one attribute with other

attributes is independent for the class Y. It is called Naive called because the assumption is quite difficult to achieve in real life. However, this method proved to have a high enough accuracy for most cases [1]. However, the assumption of independent existence among its attributes will reduce the level of accuracy [1, 2]

There are many reasearch tried to fix the Naïve Bayes algorithm in order to overcome the decrease in the accuracy of the dependency between the data attributes. In general there are two groups of approaches. The first group tried to find a subset of attributes that have a dependency and then fix the value of by considering of such dependence. The second group tried to broaden the approach by considering the structure of the Naïve Bayes dependence [3].

The first approach to fix the Group of Naïve Bayes is divided into two parts [4], by combining elements and removing element. Examples research of the first group with the removal of elements [5], and by combining elements [6, 7]. Whereas the second group who were expanding the structure of which is [3, 8].

This research was included in the first group that improves the Naïve Bayes by considering the dependence between attributes. If the Yager research [7] Ordered Weighted Averaging (OWA) aggregation operators, this study uses dependency analysis among attributes by using the Apriori association algorithm.

This study proposed a new method to improve the data based on inter-attribute dependence analysis. Apriori algorithm is used to analyze the dependence between the attributes. The first step is to analyze dependencies between attributes in training data. Then the data that do not follow the rule will be refined according to the rules.

This paper has been prepared on the following systematics. Chapter 1 Introduction which contains the identification of issues, discussion of previous studies that try to improve the accuracy of the Naive Bayes classifier, and the exposure of a new

IV-22 The 6th International Conference on Information & Communication Technology and Systems

ISSN 2085-1944

method that is proposed in the research. Chapter 2 discusses the calculation of Naive Bayes classifier, Naive Bayes Apriori model and the experimental steps. Chapter 3 presents the results of research and discussion. Chapter 4 contains the conclusions and the gaps that still exist to continue as a research

2 MODEL

This section discusses the models that form the basis of research. The are the formula that became the basis for calculating Naïve Bayes classifier, the steps in the association Apriori algorithm, Naïve Bayes model of the proposed association, and the design of the experiment.

2.1 Naïve Bayes Classifier

Naive Bayes classifier introduced by Tan, Pang-Nin [1] using the following equation to calculate:

)(

)|()(

)|( 1

XP

YXiPYP

XYP

d

i

Õ== , (1)

Where each set X = (X1, X2 ,..., Xn) consists of d attributes. Because the data sets are continuous data, the equation used to calculate the value of P (X | Y) is the second equation follows:

2

2

2

)(

2

1)|( ij

ijix

ij

ji eYXPs

m

ps

-

= , (2)

Where s2ij is the standard deviation of each training

data attribute, xi is attribute value of data testing

and m2ij is average value of each attribute of data

training.

2.2 Naïve Bayes Apriori

Stages of apriori association proposed in this research is done in training for improving the value of data training before being put into the calculation of Naïve Bayes.The steps on improving data training data are: § Data Cleaning and Transforming § Association Rule

§ Reduction algorithm

§ Data Refining

Explanation of these steps will be presented in the following paragraphs: 1. Data Cleaning and Transforming Process

Data cleaning process is done by deleting data that does not have a complete attribute values. For example, there are 10 attributes in the case study and there are some records in which only nine attributes have values. Therefore the records are not

included in the next process that is Data Transforming.

Transforming Data is done by changing the value of each attribute into three groups: Low, Medium and High. The range values of each attribute is 1-10, so that can be grouped into 1-3 = Low, 4-7 = Medium, 8-10 = High. Representations of the steps taken are in Figure 1.

ID 1 2 3 4 5 6 7 8 9 Class

1000025 5 1 1 1 2 1 3 1 1 2

1002945 5 4 4 5 7 10 3 2 1 2

1015425 3 1 1 1 2 2 3 1 1 2

1016277 6 8 8 1 3 4 3 7 1 2

1017023 4 1 1 3 2 1 3 1 1 2

a.

ID 1 2 3 4 5 6 7 8 9 Class

1000025 1L 2L 3L 4L 5L 6L 7L 8L 9L 2

1002945 1L 2L 3L 4L 5H 6H 7L 8L 9L 2

1015425 1L 2L 3L 4L 5L 6L 7L 8L 9L 2

1016277 1H 2H 3H 4L 5L 6L 7L 8H 9L 2

1017023 1L 2L 3L 4L 5L 6L 7L 8L 9L 2

b. Figure 1. a. The original data, b. Data transformation results.

2. Assosiacion Rule The next step taken is to find associations

between existing attributes, often referred to by the term rule-rule. . Examples of such rules is 2L à 9L which means that the majority of the data, if the second attribute is Low, then the 9th also be valuable Low.

.

Figure 2. The steps to generate rules

Rule is generated by using apriori algorithm in which the picture of the steps taken can be seen in Figure 2. The first step in finding association rules is to generate large item sets. Large items sets are values of attributes that often arise in which the occurrence of that value > = the value of support provided. The equation used is:

Y

nxf i =)( , (3)

Where f(x) describes the emergence of (x), n indicates the number of x in the data, and Y indicates the total amount of data.

Association Analysis Using Apriori Algorithm for Improving Performance of Naïve Bayes Classifier – Indri Sudawati Rozas

IV-23

ISSN 2085-1944

The second step is to locate all possible combinations of rules to an entire of large items set that are available. All combinations are computed value of its emergence on the data and compared with the value of support. If the value is smaller then the combination is not used. The combination of the two values, such as starting with the 2L 3L, then continued value 3,4,5 and so on until no more combinations can be found because it does not meet the support.

3. Reduction algorithm After all rules are gained, the third step is to

reduce rules by Reduction Algorithm. The algorithm is applied to get minimal rules (r) group in rules (R) group, in which algorithm repeatedly check R to get redundant r. Algorithm used to reduce rules is shown on Figure 3.

Figure 3. Reduction algorithm

The example of implementing Reduction Algorithm for gained rule shown in figure 4 is described as follows. From 10 rules gained from the following WEKA association analysis, rule number

1(4=4L, 8=8L ==> 9=9L) can be reduced become two represented rules, namely rule number 7(8=8L ==> 9=9L) and rule number 8 4=4L ==> 9=9L). By the same step, rule number 3 can also be represented by rule number 5 and number 7.

Figure 4. Analysis Result of WBCD data association

4. Data Refining After getting minimal rules through the

third step, the last step is to refine the data which is suited to minimal rule gained. Such as rule 2L à 9L, in this case all data in which attribute 2 gains Low, so does attribute 9 is changed to Low. Refining Data is shown on Figure 5

id 1 2 3 4 5 6 7 8 9 class

1033078 4 2 1 1 2 1 2 1 1 2

1035283 1 1 1 1 1 1 3 1 1 2

1036172 2 1 1 1 2 1 2 1 1 2

1041801 5 3 3 3 2 3 4 4 1 4

1043999 1 1 1 1 2 3 3 1 1 2

1044572 8 7 5 10 7 9 5 5 0 4

a.

id 1 2 3 4 5 6 7 8 9 class

1033078 4 2 1 1 2 1 2 1 1 2

1035283 1 1 1 1 1 1 3 1 1 2

1036172 2 1 1 1 2 1 2 1 1 2

1041801 5 3 3 3 2 3 4 4 1 4

1043999 1 1 1 1 2 3 3 1 1 2

1044572 8 7 5 10 7 9 5 5 4 4

b. Figure 5. a Primary data, b. Refined data

On figure 5, after doing steps 1 to step 4 of Naive Bayes Apriori, then based on the resulting dependencies will change as follows: id 1044572 in the 9th column has refined from 0 to 4.

2.3 Implementation The steps being undertaken to prove the

hypothesis contained in Figure 6. The first step is to process the dataset using a Naïve Bayes Classifier. The second step to refine the data using a Naïve Bayes Apriori, then compute the accuracy. The final step associated with proving the hypothesis are: to compare the accuracy of the results obtained from step 1 and step 2.

Weka 3.6.2 is being used to shows the implementation of dependencies process. Matlab 7.0 is used in programming the Naive bayes

classifier application. In this section, we demonstrate our approach using the dataset from the University of California at Irvine (UCI) Machine Learning Repository [http://UCI/machine

learning/repository]. In order to examine the validity of model construct, three dataset which have classification characteristic for numerical were used. WBCD dataset can be downloaded from (http://archive.ics.uci.edu/ml/datasets/Breast+Canc

er+Wisconsin), ionosphere dataset from http://archive.ics.uci.edu/ml/machine-learning-data

bases/ionosphere/) and mammography dataset from (http://archive.ics.uci.edu/ml/machine-learning-da

tabases/mammographic-masses/).

In this research, experiments conducted using WBCD data are divided into seven different treatments. Dataset has 10 attributes with the data amount to 683. There were four treatments performed to calculate the value of accuracy, first with the amount of training data 600, followed in


ISSN 2085-1944

succession by the amount of training data 500, 400, 300, 200, 100, and 50.

Figure 6. Experiment Steps

3 RESULT AND DISCUSSION

Analysis and Discussion of reduction algorithm

From the experiment of reduction algorithm on WBCD dataset with 2 variable refining, obtained 10 rule which can be reduced to 8 rule with average convidence value is 0.97. even though it can be obtained 10 rule from 3 variable refining and 9 rule after reduction with confidence value is 0,99.

Experiments conducted on the reduction algorithm of WBCD dataset with 2 refined variables, obtained 10 rules, where the rule can be reduced to 8 rules with an average confidence value of 0.97. For the 3 refined variables, obtained 10 rules, where can be reduced to 9 rules with confidence value of 0.99.

Confidence value that was described above have range about 0 to 1. This means that the increase of confidence value then decrease the number of refining data because almost of data have followed the rule, and vice versa. The experimental results indicate that to WBCD data, reduction process which aims to reduce the complexity of the repair process, to give effect to the rule of reduction of 10-20%.

Discussion of three dataset experiment result

According from the experiment using apriori naïve bayes, obtained accuracy as shown in Table 1.

Table 1. Calculating result from 3 (three) Dataset

Dataset Early

accuracy (%) Apriori’s

accuracy (%)

WBCD average 70,7 73,7

Mammography 53,4 53,4

Ionospher 48,5 48,5

According to the experiment results, naïve

bayes apriori classifier shows increasing of accuracy only from WBCD data set. While the others is not.

This can be analyzed because the data WBCD have a wide range of data values are from 1 to 10, while data ionosphere which has 34 attributes, ranged from 0 to 1. Similarly, mammography data, although it has 4 atributes ranged from 1 to 4, and there is 1data that have large enough range (1 to 60), but in the apriori process, the atribut not through refining process, so is does not affect the generated.

Results Discussion of Dataset WBCD Experiment

The result of accuracy calculation for the datasets WBCD divided into seven experiments shown in Table 2. From the results can be seen that in every type of experimental treatment, apriori algorithm improved the accuracy on an average of 4%. Table 2. Result of Calculation accuracy for the dataset WBCD

The amount of data for training

Accuracy (%)

Initial Apriori

algorithm 2 variable

Apriori algorithm 3

variable

600 78,3 83,1 83,1

500 75,4 77,0 77,0

400 70,3 77,4 77,0

300 70,0 75,2 74,9

200 67,3 69,1 68,9

100 66,9 67,4 67,1

50 66,9 66,9 66,2

The decrease of accuracy vs. number of data

training

Table 2 shows the calculation of initial accuracy. It can be seen that when the amount of training data decrease, then the accuracy of the resulting value also decrease. Start with the amount of initial data training of 600 to training data of 50, produced a classification accuracy of 78.31% to 69.98%.

Association Analysis Using Apriori Algorithm for Improving Performance of Naïve Bayes Classifier – Indri Sudawati Rozas

IV-25

ISSN 2085-1944

Figure 7. Comparison Results of Accuracy Calculation

In experiments using two variables apriori algorithm obtained accuracy decreased as follows. Accuracy value was obtained as many as 83.1% when using 600 data training, whereas for 50 data training, accuracy drops to 70.0%. Or decrease of 16.1%. In the experiment using three variables apriori algorithm, using 600 data training obtained value of 83.1% accuracy. Meanwhile, if using 300 data training obtained accuracy value of 66.2%. It means the accuracy decrease of 16.9%. This proves that in general the amount of training data used for the calculation will affect the accuracy of classifier.

If the result of the accuracy calculation in Table 2 is described in the form of graphical visualization as in Figure 7, it can be seen clearly that the amount of data training data be used, significantly affect the accuracy value. Generally it can be seen that the smaller amount of training data, then the lesser value of the resulting accuracy.

Comparative accuracy of Naive Bayes standard

with Naive Bayes apriori

From the comparison of the accuracy calculation by using data WBCD without implementing apriori algorithm versus applying apriori algorithm, it was found that the classification accuracy of naive bayes apriori produce better grades. Visualization shown in Figure 8.

On classification of data WBCD using the naïve Bayesian (without apriori algorithm), each attribute is independent so it is considered there is no relationship among attributes. While at existing WBCD data with 9 attributes, it is possible there is the relationship between attributes. By using apriori algorithm, the relationship between these attributes can be known so on the classification process produced better accuracy with the value of the average accuracy increase 4%. This occurs because the data from the attributes used in the classification process has a smaller error

Figure 8. Accuration comparison between standard naive bayes

and naive bayes using apriori

On the application of apriori algorithm, the WBDC at each data attribute that has value range between 1 to 10 was split reading of data on each attribute into two kinds. The first to break the reading of data for each attribute into the second half and broke the reading of the data into three parts, as shown in Table 1

Comparative accuracy of Apriori with two

variables vs. three variables

Related to the initial hypothesis that the association algorithm will improve the value of classification accuracy, Figure 8 below provide a good illustration of the improved accuracy.

In this study, we have performed tests on two of apriori algorithm with two variables and apriori with three variables. The highest accuracy was obtained for 83.1% with a total of 600 training data, while the lowest accuracy rate of 74.9% was obtained. Differences in the level of accuracy are affected by two factors: the amount of training data and the number of variables used. The higher the amount of training data and the more the number of variables used, will increase the level of accuracy, and vice versa.

Figure 9. Comparison of Accuracy Level Increase beetween Apriori two variables and three variables.


ISSN 2085-1944

From the experimental results are depicted in Figure 9 shows that the classification accuracy of the data by applying apriori WBDC with each attribute divided by two is better than the data is divided by three. This is because the data will be unreadable Low and High only and no data has extreme value. Implementation without applying Apriori, gains accuracy value 78.3% for 600 data training. This happens because the data value of each attribute that has much difference because the data value range is wide enough from 1 to 10. On apriori by two variables, the increase in the highest level of accuracy obtained on the amount of data testing 400, up to 77.4%.

On apriori using two variables, the data from each of the attributes are classified as follows: for the data value of 1 to 5 is considered as low, while data that is worth 6 to 10 are considered high. By classifying the data of each attribute into Low and High and will reduce the differences in data that have extreme values. On the apriori using three variables, the data of each attribute is divided into Low, Medium and High. For the data value 1-3 include as Low, 4-5 included as Medium and the group 6-10 included as High. Then do the process to obtain an association rule of data attributes that exist and to be checked whether there is data that does not comply with the rule that has been created. If there are data that does not comply with the rule, it will be repaired so the data will folow the rules.

4 CONCLUSION

Based on the above discussion we can be concluded: - Reduction algorithm is used to reduce the rules

in order to improve efficiency apriori. - Number of data training used is significantly

impact the decline in the level of classification accuracy of naive Bayes.

- The Results of classification calculation using Naive Bayes Apriori on the WBCD data, mammography and ionospher are found that the accuracy is improved only in the data WBCD, amounting to 4.5%.

- Naïve Bayes Apriori classification method suitable to the numerical data with a wide range of data values, for example data WBCD.

- Naïve Bayes classification accuracy on data WBCD produce an average value of 73.5%. If using a Naive Bayes Apriori with two values variables showed an average of 78.1% and the Naive Bayes Apriori three values variables showed an average of 78%.

- Naive Bayes priori by two values variables produces a better accuracy than the three values variables.

For further research, the development of Naive Bayes Apriori algorithm may be continued as follows: - Applying the FP-tree algorithm to replace the

apriori algorithm to accelerate the process of association in the data to be classified.

- Add the domain research on categorical classifier data type.

ACKNOWLEDGEMENT

The authors thanks to Hudan Studiawan for his help in the settlement of Apriori Naive Bayes classifier using Matlab programming

REFERENCES

[1] Tan, Pang-Nin., Steinbach, Michael., Kumar, Vipin. (2006). Introduction to Data Mining. Philipine:Pearson Education, Inc.

[2] Han, Jiawei., Kamber, Micheline. (2000). Data Mining Concept and Techniques. Massachusetts: Morgan Kaufman Publisher.

[3] Zang H, Jiang L, Su J. (2005). Hidden Naïve Bayes. American Association for Artifcial Intelligence.

[4] Zheng F, Web GI. (2005). A Comparative Study of Semi-naive Bayes Methods in Classifcation Learning. Proceedings of the Fourth Australian Data Mining Workshop (AusDM05), pp 141-156. Sydney: University of Technology.

[5] Langley, P, Sage, S. (1994): Induction of selective Bayesian classi¯ers. In: Proceeding Tenth Conf. Uncertainty in Artifcial Intelligence.Massachusetts:MorganKaufmann pp 399-406.

[6] Kononenko, I.(1991). Semi-naive Bayesian classifier. In: Proceding 6th European Working Session on Machine Learning. Berlin: Springer-Verlag halaman 206-219.

[7] Yager RR. (2006). An extension of the naive Bayesian classifier. Journal Information Sciences 176 pp 577–588.

[8] Sahami, M. (1996). Learning limited dependence Bayesian classifiers. In: Proceeding 2nd Int. Conf. Knowledge Discovery in Databases. Menlo Park, CA: AAAI Press pp 334-338.

association analysis using apriori … analysis using apriori algorithm for improving performance of...

Documents