serial / parallel implementation of apriori algorithm · · 2015-04-01association rules are the...
TRANSCRIPT
2362 www.ijifr.com
Copyright © IJIFR 2015
Reviewed Paper
Paper
International Journal of Informative & Futuristic Research ISSN (Online): 2347-1697
Volume 2 Issue 7 March 2015
Abstract
Association rules are the main technique for data Mining Apriori algorithm is a
conventional algorithm of association rule mining. There are many algorithms for
mining association rules and their variations are proposed on basis of Apriori
algorithm, but those algorithms are not efficient. With the rapid development of
networks and information technology, the vast information has paid more and
more attention by people. While getting the information with high speed, the
analysis and mining of the information and rules hidden deep in the data are also
getting more attention. Data mining technology is to organize and analyze the
data, which can extract and discover knowledge from the huge and unstructured
data, so how to apply the data mining techniques in transactional databases is the
focus of this topic studied. In this paper, combining with parallel programming
techniques, Apriori algorithm in association rule mining algorithm is described in
detail, the algorithm implementation process is illustrated, and the optimized
methods of the algorithm are discussed.
Serial / Parallel Implementation of
Apriori Algorithm Paper ID IJIFR/ V2/ E7/ 094 Page No. 2362-2366 Subject Area
Computer
Engineering
Key Words Data Mining, Information Extraction, Apriori Algorithm
Dinesh. D. Jadhav 1 B.E. Student, Department of Computer Engineering Sharadchandra Pawar College of Engineering, Pune
Milind. B. Wakchaure 2 B.E. Student, Department of Computer Engineering Sharadchandra Pawar College of Engineering, Pune
Saurabh.B.Jarande 3 B.E. Student, Department of Computer Engineering Sharadchandra Pawar College of Engineering, Pune
Swapnil.A.Nalawade 4 B.E. Student, Department of Computer Engineering Sharadchandra Pawar College of Engineering, Pune
2363
ISSN (Online): 2347-1697 International Journal of Informative & Futuristic Research (IJIFR)
Volume - 2, Issue - 7, March 2015 19th Edition, Page No: 2362-2366
Prof. Virendra Kumar K. N, Hemalatha B. R , Dr. S. B. Anadinni:: Study on Strength of Concrete using Fly Ash and Bottom Ash as a Partial Replacement for Cement and Sand
1. Introduction
Now a day’s modern computer technology and database technology has been developing with rapid
speed and could support to store and quickly retrieve the grand scale databases or data warehouses,
but these techniques was only to gather these "huge" data, and not to efficiently organize and use the
knowledge hidden in them [7], which in turn led to structured data but data with less knowledge. The
provision of data mining technology met people needs. The technology involved in artificial
intelligence, machine learning, statistical analysis and other technologies, and it makes decision
analysis into a new stage [5]. In this paper the association rule mining algorithm - Apriori algorithm
which is commonly used in data mining is mainly discussed. Frequent item sets play key role in
many data mining tasks that try to find interesting patterns from databases [1]. The original
inspiration of searching frequent pattern came from the need to analyze supermarket transaction data
to examine customer behavior in terms of the purchased products. Frequent Pattern shows how often
items are purchased together [9]. The frequent item set and association rule mining problems are
came to focus. Many research papers have been published since past decades. Many algorithms are
searched and developed for existing algorithms to solve these mining problems more efficiently [13].
In this chapter, we are going to optimize the apriori algorithm in serial and parallel approach.
2. Related Work done Agrawal R, Srikant R present two new algorithms for solving this problem that is fundamentally
different from the known algorithms. Empirical evaluation shows that these algorithms outperform
the known algorithms by factors ranging from three for small problems to more than an order of
magnitude for large problems [5]. Chao Yang, Tzu Chang and Chih Chang presents the comparison
of some tools that are specifically designed to extract the most of data parallelism on multi-core
system using OpenMP [3]. Ying Liu, Fuxiang Gao’s experiments show that the parallel
implementation of the algorithm using results in a speed-up about 200% compared with sequential
implementation on a Dual-core processor, while a speed-up about 400% on a Quad-core processor.
[4]. Ketan shah and Sunita Mahajan present the performance of parallel Apriori algorithm on
heterogeneous nodes with different datasets and n processors on a commodity cluster of machines
[6]. Anuradha.T and Satya Prasad.R presents an evaluation of the performance of Apriori on a hyper
threaded dual core processor compared to the performance on a non-hyper threaded dual core
processor using fread() and mmap() functions [7]. Zhang Zheng, Jaiwen Li, Xuhoo Chen, Li Shen,
and Zhiying Wang present a performance model for OpenMP parallelized loops to address the
critical factors which influences the performance [8].Kyung Min Lee, Tae Houn Song, Seung Hyun
Yoon, Key Ho Kwon and Jae Wook Jeon presents a parallel programming model, OpenMP, and
parallel programs that can be benchmarked to multi-core processors of embedded boards using
OpenMP and Executed parallel programs on a dual-core embedded system, analyzing the
performance of sequential programs and parallel programs by SERPOP analysis [9].Anuradha.T,
Dr.Satya Prasad.R, Dr.Tirumala Rao.S.N, evaluates the performance of Apriori using Linux mmap()
function compared to fread() function in both the serial and parallel environments [10].
3. Apriori Algorithm Notations:
K=pass number, Fk=set of frequent K-itemset(with those min support) , ck=set of candidate K-
itemset.
2364
ISSN (Online): 2347-1697 International Journal of Informative & Futuristic Research (IJIFR)
Volume - 2, Issue - 7, March 2015 19th Edition, Page No: 2362-2366
Prof. Virendra Kumar K. N, Hemalatha B. R , Dr. S. B. Anadinni:: Study on Strength of Concrete using Fly Ash and Bottom Ash as a Partial Replacement for Cement and Sand
1. K=1
2. Fk={ i/i I, (i) min. sup} {Find all frequent 1- itemset}
3. Repeat
4. K=K+1
5. Ck=apriori-gen (Fk-1) {generate candidate itemset}
6. For each transaction t
7. Ct=subset (Ck,t) {identify all candidate that belongs to t}
8. For each candidate itemset C do
9. =δ(c)+1 {Increment suport count}
10. 11. End for
12. Fk={C/C } {exact the frequent itemset}
13. Untill Fk=
14. Result=Ufk
4. Related Example Following is an example based on the transaction database D, of Figure 1. Consisting of total nine
transactions in this database, that is, T=9.We use the improved Apriori algorithm for finding
frequent item sets in D.
Figure 4.1: Generation of candidate item sets and frequent item sets.
2365
ISSN (Online): 2347-1697 International Journal of Informative & Futuristic Research (IJIFR)
Volume - 2, Issue - 7, March 2015 19th Edition, Page No: 2362-2366
Prof. Virendra Kumar K. N, Hemalatha B. R , Dr. S. B. Anadinni:: Study on Strength of Concrete using Fly Ash and Bottom Ash as a Partial Replacement for Cement and Sand
i. Scan database D for counting each candidate. In its first pass, each item is a member of the
set of candidate 1-itemsets, C1. The algorithm scans all of the transactions to count the
number of occurrences of each item.
ii. Compare candidate support count with minsup. Suppose that the minimum transaction
support count required is 2. The set of frequent 1-itemsets, L1, can then be determined. It
consists of the candidate1-itemsets satisfying minimum support.
iii. Generate C2 candidates from L1 and scan D for count of each candidate. To discover the set
of frequent 2-itemsets, L2, the algorithm generates a candidate set of 2-itemsets, C2. And
then the transactions in D are scanned and the support count of each candidate item set in C2
is accumulated, as shown in the table of C2 in Figure 1.
iv. Compare candidate support count with minsup. The set of frequent 2-itemsets, L2, is then
determined, consisting of those candidate 2-itemsets in C2 having minimum support. Then
D2 was determined from L2.
v. Generate C3 candidates from L2 and scan D2 for count of each candidate. First let
C3=L2∞L2={{I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4},{I2, I3, I5},{I2, I4, I5}}.
Based on the Apriori property that all subsets of a frequent item set must also be frequent,
it can determine that the four latter candidates cannot possibly be frequent, therefore remove
them from C3, thereby saving the effort of unnecessarily obtaining their counts during the
Subsequent scan of D2 to determine L3.
vi. Compare candidate support count with minsup. The transactions in D2 are scanned in order
to determine L3, consisting of those candidate 3-itemsets in C3 having minimum support.
vii. The algorithm uses L3∞L3 to generate a candidate set of 4-itemsets, C4. Although the join
results in {{I1, I2, I3,I5}}, this item set is pruned since its subset {{I2, I3,I5}}is not
frequent. Thus, C4=Φ, and the algorithm terminates, having found all of the frequent item
sets.
5. Analysis And Evaluation Of The Algorithm The Serial/Parallel Apriori Algorithm has been implemented by Java 8 on a personal computer with
2.10 GHz Intel Core2Duo CPU, 2 GB in memory and Windows 7 operating system. The
experimental transactional database sources are created by Generator and not taken from market. In
the Fig.2, the horizontal axis represents the number of support in database and the vertical axis
represents mining time (second) of the same database with the algorithm Apriori .The two curves
denote different time cost of the algorithm Apriori and RAAT with different minsup.
6. Conclusion In this paper, we have implemented Apriori algorithm on quad core processor in a serial and parallel
manner with four standard datasets using different support counts varying from 30% to 70%. From
our work we can say that parallel results are better than serial once. Our aim of measuring the serial
2366
ISSN (Online): 2347-1697 International Journal of Informative & Futuristic Research (IJIFR)
Volume - 2, Issue - 7, March 2015 19th Edition, Page No: 2362-2366
Prof. Virendra Kumar K. N, Hemalatha B. R , Dr. S. B. Anadinni:: Study on Strength of Concrete using Fly Ash and Bottom Ash as a Partial Replacement for Cement and Sand
and parallel performance of Apriori algorithm on a quad core processor with respect to time and
comparison between them get satisfied
References [1] Jiawei Han and Micheline Kamber, Data Mining concepts and Techniques2nd edition Morgan
Kaufmann Publishers, San Francisco2006.
[2] Anuradha.T, Satya Pasad R and S N Tirumalarao. Parallelizing Apriori on Dual Core using OpenMP.
International Journal of Computer Applications 43(24):33-39, April 2012. Published by Foundation
of Computer Science, New York, USA.
[3] Chao-Tung Yang, Tzu-Chieh Chang, Hsien-Yi Wang, William C.C. Chu, Chih-Hung Chang.
Performance Campion with OpenMP Parallelization for Multi-core Systems. Ninth IEEE
International Symposium on Parallel and Distributed Processing with Applications, 2011 IEEE, pp
232-237.
[4] Ying Liu, Fuxiang Gao. Parallel Implementations of Image Processing Algorithms on Multi-Core.
2010 Fourth International Conference on Genetic and Evolutionary Computing, 2010 IEEE, pp 71-74.
[5] Agrawal R, Srikant R ―Fast algorithms for mining association rules‖ In: Proceedings of the 1994
international conference on very large data bases (VLDB‟94), 1994 Santiago, Chile, and pp 487–499
[6] Ketan D. Shah, Dr. (Mrs.) Sunita Mahajan. Performance Analysis of Parallel Apriori on
Heterogeneous Nodes. 2009International Conference on Advances in Computing, Control, and
Telecommunication Technologies, 2009 IEEE, pp 42-44.
[7] Anuradha.T, Satya Prasad.Parallelizing Apriori on Hyper-Threaded Multicore
Processor.International Journal of Advanced Research in Computer Science and Software Engineer.
Volume 3, Issue 6, June 2013.
[8] Zhong Zheng, Xuhao Chen, Zhiying Wang, Li Shen, Jiawen Li. Performance Model for OpenMP
Parallelized Loops. 2011International Conference on Transportation, Mechanical, and Electrical
Engineering(TMEE) December 16-18, Changchun, China, 2011 IEEE, pp 383-387.
[9] Kyung Min Lee, Tae Houn Song, Seung Hyun Yoon, Key Ho Kwon, Jae Wook Jeon. OpenMP
Parallel Programming Using Dual-Core Embedded System. 2011 11th
International Conference on
Control, Automation and Systems Oct. 26-29, 2011in KINTEX, Gyeonggi-do, Korea, pp762-766.
[10] Anuradha.T, Dr.Satya Prasad.R, Dr.Tirumala Rao.S.N Performance evaluation of apriori with
memory mapped files. International Journal of Computer Science Issues Vol.10, Issue 1, no1, January
2003.
[11] OpenMP Application Program Interface‖ Version 3.0 May 2008.
[12] Tim Mattson and Larry Meadows. ―A ‗Hands-on‘ Introduction to OpenMP ‖ Intel Corporation.
[13]Kent Milfeld. ―Introduction to Programming with OpenMP‖ February 6th 2012, TEXAS
ADVANCED COMPUTING CENTER (TACC).
[13] Multi-core Processor Wikipedia [Available] en.wikipedia.org/wiki/Multi-core processor.