association rule mining by implementing apriori...
TRANSCRIPT
Association Rule Mining by Implementing Apriori
Algorthim using Disconnected Approach
Rajneesh Kumar Singh
Student, M.Tech. (CSE)
Jamia Hamdard , Hamdard Nagar
New Delhi, India
Manoj Kumar Pandey
Deptt. Of Computer Application
Galgotia College of Engg. & Tech
Greater Noida, India
Jawed Ahmed
Deptt. Of Computer Science
Jamia Hamdard, Hamdard Nagar
New Delhi, India
Abstract—There is a huge amount of data around
us and to extract valuable information from it can be done through data mining. Data Miming is the
process of extracting useful information from the
huge amount of data stored in databases.
Association Rule Mining is one of the data mining
techniques used to extract hidden knowledge from
datasets that can be used by an organization’s
decision makers to improve overall profit. One of
the most famous association rule learning
algorithms is Apriori. Apriori algorithm is one of
algorithms for generation of association rules. The
drawback of Apriori Rule algorithm is the number
of time to read data in the database equally number
of each candidate were generated. A disconnected
approach is implemented in this paper. The
implementation of the algorithm would need to
scan the database one time. Through this
implementation of Apriori Algorithm greatly
reduces the database scan which reduces time, network consumption and can improve the
efficiency of algorithm.
Keywords- Data mining; Association rule;
Apriori algorithm; Disconnect approach;
ADO.NET; Frequent pattern
I. INTRODUCTION
One of the most popular technique in data mining
is Apriori algorithm [1][2][3]. The Data mining is
usually involve huge amounts of information.
Association rules exhaustively look for hidden
patterns, making them suitable for discovering
predictive rules involving subsets of data set
attributes. Association rules are used to identify
relationships among a set of items in database.
These relationships are not based on inherent
properties of the data themselves (as with functional dependencies), but rather based on co-
occurrence of the data items [4]. This paper
proposes the implementation of Apriori algorithm
by using Disconnected Approach of ADO.NET to
discover association rules from huge amount of
information.
One of the most well known and popular data
mining techniques is the Association rules or
frequent item sets mining algorithm. The algorithm
was originally proposed by Agrawal et al. [1] [2]
for market basket analysis. Because of its
significant applicability, many revised algorithms
have been introduced since then, and Association
rule mining is still a widely researched area. Many
variations done on the frequent pattern mining
algorithm of Apriori is discussed in this section.
Agrawal et. al. presented an AIS algorithm in [1]
which generates candidate item sets on-the-fly
during each pass of the database scan. Large item
sets from previous pass are checked if they are
present in the current transaction. Thus new item
sets are formed by extending existing item sets.
This algorithm turns out to be ineffective because it
generates too many candidate item sets. It requires
more space and at the same time this algorithm
requires too many passes over the whole database
and also it generates rules with one consequent
item. Agrawal et. al. [2] developed various versions
of Apriori algorithm such as Apriori, AprioriTid,
and AprioriHybrid. Apriori and AprioriTid
generate item sets using the large item sets found in
the previous pass, without considering the
Rajneesh Kumar Singh et al , Int.J.Computer Technology & Applications,Vol 4 (3),486-493
IJCTA | May-June 2013 Available [email protected]
486
ISSN:2229-6093
transactions. AprioriTid improves Apriori by using
the database at the first pass. Counting in
subsequent passes is done using encodings created
in the first pass, which is much smaller than the
database. This leads to a dramatic performance
improvement of three times faster than AIS. A
further improvement, called AprioriHybrid, is
achieved when Apriori is used in the initial passes
and switches to AprioriTid in the later passes if the candidate k-itemset is expected to fit into the main
memory. Even though different versions of Apriori
are available, the problem with Apriori is that it
generates too many 2-item sets that are not
frequent. A Direct Hashing and Pruning (DHP)
algorithm is developed in [6] that reduces the size
of candidate set by filtering any k-item set out of
the hash table, if the hash entry does not have
minimum support. This powerful filtering
capability allows DHP to complete execution when
Apriori is still at its second pass and hence shows
improvement in execution time and utilization of
space. Scalability is another important area of data
mining because of its huge size. Hence, algorithms
must be able to ―scale up‖to handle large amount of
data. Eui-Hong et. al [4] tried to make data
distribution and candidate distribution scalable by Intelligent Data Distribution (IDD) algorithm and
Hybrid Distribution (HD) algorithm respectively.
The quality of the association rule discovered is
measured in terms of confidence. The rules with
confidence above a certain level (threshold value)
are considered as interesting and deserve attention.
Most algorithms define interestingness in terms of
user-supply thresholds for support and confidence.
The problem is that these algorithms rely on the
users to set suitable values.
A survey on different methods and algorithms used
to find frequent patterns is presented in [12].
Analysis of algorithms and descriptions for
AprioriTid, AprioriHybrid, Continuous Association
Rule Mining Algorithm (CARMA), Eclat algorithm, and Direct hashing and Pruning (DHP)
algorithm is explained in detail. Conclusions are
drawn as, for dense databases Éclat algorithm is
better, for sparse databases the Hybrid algorithm is
the best choice and as long as the database fits in
main memory the Hybrid algorithm (combination
of optimized version of Apriori and Eclat) is most
efficient one. An improved version of original
Apriori- All algorithm is developed for sequence
mining in [8]. It adds the property of the userID
during every step of producing the candidate set
and every step of scanning the database to decide
about whether an item in the candidate set should
be used to produce next candidate set. The
algorithm reduces the size of candidate set in order
to reduce the number of database scanning. Based
on the temporal association rule [3] [5], retailers
make better promotion strategies. The time
dimension exists in all transaction, and is included
in finding large item sets, especially when not all
items exist throughout the entire data gathering
period. The temporal concept introduced in [7]
addition to the normal support and confidence. The
temporal support is the minimum interval width. Thus, a rule is considered as long as there is
enough support or temporal support. Different
works are reported in the literature to modify the
Apriori logic so as to improve the efficiency of
generating rules. Enhanced version of Apriori
algorithm is presented in [9] where, the efficiency
is improved by scanning the database in forward
and backward directions. Xiang-wei Liu et.al [10]
presented an improved association rule mining
algorithm that reduces scanning time of candidate
sets using hash tree. Another version of Apriori is
reported in [11] as an algorithm called IApriori
algorithm, which optimizes the join procedure of
frequent item sets generated to reduce the size of
the candidate item sets.
Even though fast algorithms are reported for Association mining it still inherits the drawback of
scanning the whole data base many times. The
survey reveals that more attention is required to
address the issues related to reduce the number of
database scan, and also to reduce memory space
with less execution speed. These limitations and
other related issues motivated us to continue the
research work in this area. Comparing all these
methods, in this work we propose the
implementation of Apriori algorithm which reduces
database scan, time and can improve the efficiency
of algorithms the same is presented in the next
section.
II. MATERIALS AND METHODS
ASSOCIATION RULES: In this section we
overview the basic concepts of association rule
mining. We refer the reader to [1, 2] for further
details. Association rule mining was first
introduced by Agrawal et al.[1], and was used for
market basket analysis. The problem of mining
association rules can be explained as follows:
There is the itemset I = i1, i2, . . ., in, where I is a set of n distinct items, and a set of transactions D,
where each transaction T is a set of items such that
T ⊆ I. Table 1 gives an example where a database
D contains a set of transactions T, and each
transaction consist of one or more items
Rajneesh Kumar Singh et al , Int.J.Computer Technology & Applications,Vol 4 (3),486-493
IJCTA | May-June 2013 Available [email protected]
487
ISSN:2229-6093
Table 1: An Example Database
An association rule is an implication of the form A
B, where A, B ⊂ I and A ∩ B =∅. The rule A B has support s in the transaction set D if s% of
transactions in D contains A ∪ B. The support for a
rule is defined as support (A ∪B). The rule A B holds in the transaction set D with confidence c if
c% of transactions in D that contain A also contain
B. The confidence for a rule is defined as support
(A∪B) / support (A). For example, consider the
database in table 1. When people buy Milk and
Diaper, they also buy Beer in 40% of the cases and 67% of the transactions with Milk and Diaper also
contains Beer. Such a rule can represented as
―Milk, DiaperBeer support=0.4,
confidence=0.67‖ Not all the rules found are useful
and the number of rules generated maybe
enormous. Therefore, the task of mining
association rules is to generate all association rules
that have support and confidence greater than the
user-defined minimum support (minsup) and
minimum confidence (minconf) respectively. An
itemset with minimum support is called the large
(or frequent) itemset. The rule AB is a strong
rule iff A ∪ B is in the large itemset and its confidence is greater than or equal to minconf.
Normally, the task of mining association rules can
be decomposed into two sub-tasks:
1. Discover all large itemsets in the set of
transactions D.
2. Use the large itemsets to generate the strong
rules. The algorithm for this task is simple. For
every large itemset l1, find all large itemsets l2 such
that l2 ⊂ l1 where support (l1 l2) / support(l2) ≥ minconf. For every such large itemset l2, output a
rule of the form l2 (l1 - l2). The performance of
mining association rules is mainly dependent on the
large itemsets discovery process (step 1), since the
cost of the entire process comes from reading the
database (I/O time) to generate the support of
candidates (CPU time) and the generation of new
candidates (CPU time). Therefore, it is important to have an efficient algorithm for large itemsets
discovery.
Apriori Algorithm
The Apriori algorithm [2] uses a bottom-up
breadthfirst approach to finding the large itemsets.
It starts from large 1-itemsets and then extends one
level up in every pass until all large itemsets are
found. For each pass, say pass k, there are three operations. First, append the large (k-1)-itemsets to
L. Next, generate the potential large k-itemsets
using the (k-1)-itemsets. Such potential large
itemsets are called candidate itemsets C. The
candidate generation procedure consists of two
steps:
1. Join step – generate k-itemsets by joining lk−1
with itself.
2. Prune step – remove the itemset A generated
from the join step, if any of the subsets of A is not
large. Since any subset of a large itemset must be
large. This can be written formally as follows:
The Apriori algorithm significantly reduces the size
of candidate sets using the Apriori principle.
However, it can suffer from two-nontrivial costs:
1. Generating a huge number of candidate sets
2. Repeatedly scanning the database and checking
the candidates by pattern matching
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Rajneesh Kumar Singh et al , Int.J.Computer Technology & Applications,Vol 4 (3),486-493
IJCTA | May-June 2013 Available [email protected]
488
ISSN:2229-6093
Figure 1: An Example of Apriori algorithm
Demonstration of the Example:
Step1: First scan the database and find all the itemset that contains the minimum support ≥ 2. In the above d itemset not containing the minimum support ≥ 2, so it is removed from the itemset list.
Step2: Apply the join step on the 2-candidates, scan the database and now apply the prune step (itemset that contains the minimum support ≥ 2).
Step3: Repeat this process until we find all frequent itemsets (itemset that contains the minimum support ≥ 2). In the above example following are the frequent itemsets {A} {B} {C} {E} {A C} {B C} {B E} {C E} {B C E}
Step4: Now we can generate the strong Association Rules from these itemset.
Methods:
To avoid the problem of repeatedly scanning the database, this is very time consuming process and also affect the performance of Apriori algorithm
In this paper we implemented for scanning process
use the disconnected approach, in which just scan
the whole database at once and put into a database
object. By using the disconnected approach, we can
reduce the number of round trips for scanning the database that reduce the scanning time and affect
the performance of Apriori algorithm.
Disconnected Approach
ADO.NET is an object-oriented set of libraries that
allows you to interact with data
sources. Commonly, the data source is a database,
but it could also be a text file, an Excel
spreadsheet, or an XML file. For the purposes of
this we will look at ADO.NET as a way to interact
with a data base.
ADO.NET Components
The ADO.NET components have been designed to
factor data access from data manipulation. There
are two central components of ADO.NET that
accomplish this: the DataSet, and the .NET
Framework data provider, which is a set of
components including the Connection, Command, DataSet, and DataAdapter objects.
The DataSet Object
The ADO.NET DataSet is the core component of
the disconnected architecture of ADO.NET. The
DataSet is explicitly designed for data access
independent of any data source. DataSet objects are
in-memory representations of data. They contain
multiple Datatable objects, which contain columns
and rows, just like normal database tables. You
can even define relations between tables to create
parent-child relationships. The
Rajneesh Kumar Singh et al , Int.J.Computer Technology & Applications,Vol 4 (3),486-493
IJCTA | May-June 2013 Available [email protected]
489
ISSN:2229-6093
DataSet is specifically designed to help manage data in memory and to support disconnected operations on data,
when such a scenario make sense. The DataSet is an object that is used by all of the Data Providers, which is
why it does not have a Data Provider specific prefix
Figure 2: .NET Framework Data provider
Implementation of Apriori using ADO.NET
The code for access the database to generate the association rules as follows.
public Data GetTransactionsData(string rdbmsConnectionString, string dataSource, CommandType
commandType)
{
myDatabase = new Data();
SqlmyConnection = new SqlConnection(rdbmsConnectionString); sqlCommand = new SqlCommand(dataSource, SqlmyConnection);
if (commandType == CommandType.StoredProcedure)
{
sqlCommand.CommandType = CommandType.StoredProcedure;
}
else if (commandType == CommandType.TableDirect)
{
sqlCommand.CommandType = CommandType.Text;
}
string sql = "Select TransactionID,Transactions from TransactionsTable";
SqlmyAdapter = new SqlDataAdapter(sql, SqlmyConnection);
DataSet ds = new DataSet();
SqlmyAdapter.Fill(ds);
foreach (DataRow dr in ds.Tables[0].Rows)
{
myDatabase.Tables[0].Rows.Add(Convert.ToInt32(dr["TransactionID"]), (dr["Transactions"]).ToString());
} return myDatabase;
}
III. RESULTS AND DISCUSSIONS
Process of generating the Association Rules
We have created a windows based project using C# on Visual Studio 2008 with MS-SQL 2005.
For the illustration purpose taking the database table as given below.
Rajneesh Kumar Singh et al , Int.J.Computer Technology & Applications,Vol 4 (3),486-493
IJCTA | May-June 2013 Available [email protected]
490
ISSN:2229-6093
Figure 3: shows the database table
So the process of generating the Association Rule by our project is as follows:
1 Click on the Data Connection choose the Location of your Data either Database or XML file.
2 Select the Database and then enter the connection string in the given text box.
Figure 4: Selecting the database and enter the connection string
3 Click ok then enter the minimum support% and minimum confidence% for the desired rules.
4 Click Analyze for generating the Association rules.
Rajneesh Kumar Singh et al , Int.J.Computer Technology & Applications,Vol 4 (3),486-493
IJCTA | May-June 2013 Available [email protected]
491
ISSN:2229-6093
Figure 5: Display the Association rules for given minimum support and minimum confidence
The above figure shows the generated Association Rules for the above stated database table. In which the
minimum support % ≥ 40 and minimum confidence ≥ 60.
IV. CONCLUSION
From the above experiment and developing the
project for generating the Association Rules, we
came to conclusion that the implementation of
Apriori algorithm using disconnected approach
(ADO.NET) is to overcome the deficiency of the
Apriori algorithm. The Apriori algorithm require
multiple passes over the database for discovering
frequent patterns, and follows bottom up approach
which suffers from increased number of database
scan. The new implementation of using
disconnected approach need to scan the database
only one time and contain the database table data
into the dataset object. This implementation of
Apriori algorithm is more efficient which takes less
time, hence reflects in high efficiency
REFERENCES
[1] Agrawal, R., Imielinski, T., and Swami, A. N. Mining
Association Rules Between Sets of Items in Large Databases.
Proceedings of
the ACM SIGMOD,International
Conference on Management of Data, pp.207- 216, 1993.
[2] Agrawal. R., and Srikant. R., Fast Algorithms for Mining
Association Rules, Proceedings of 20th International
Conference of Very
Large Data Bases. pp.487-499,1994.
[3] Agrawal. R., and Srikant. R. Mining Sequential Patterns.
Proceedings of 11th International Conference on Data
Engineering, IEEE Computer Society Press, pp.3-14, 1995.
Rajneesh Kumar Singh et al , Int.J.Computer Technology & Applications,Vol 4 (3),486-493
IJCTA | May-June 2013 Available [email protected]
492
ISSN:2229-6093
[4] M., Suraj Kumar Sudhanshu, Ayush Kumar and Ghose
M.K., ―Optimized association rule mining using genetic
algorithm
Anandhavalli Advances in Information Mining‖ , ISSN:
0975–3265, Volume 1, Issue 2, 2009, pp-01-04
[5] Han, J., Dong, G., and Yin, Y. Efficient Mining of Partial
Periodic Patterns in Time Series Database. Proceedings of 15th
IEEE
International Conference on Data Engineering, pp.106–
115, 1999
[6] Jong Park, S., Ming-Syan, Chen, and Yu, P. S. Using a
Hash-Based Method with transaction Trimming for Mining
Association
Rules. IEEE Transactions on Knowledge and Data
Engineering, 9(5), pp.813-825,1997
[7] Juan, M. A., Gustavo, H., and Rossi. An Approach to
Discovering Temporal Association Rules. Proceedings of the
ACM Symposium
on Applied Computing, 1, pp.234- 239,2000.
[8] Wang Tong, and He Pi-Lian. Web Log Mining by
Improved Apriori All Algorithm. Transaction on Engineering
Computing and
Technology, 4, pp.97 100, 2005.
[9] Wei Zhang, Zhang Wei, Dongme Sun Shaohua Teng and
Haibin Zhu. An Algorithm to Improve Effectiveness of Apriori.
Proceedings
of 6th IEEE International Conference on Cognitive
Informatics, pp.385-390, 2007.
[10] Xiang-Wei Liu, and Pi-Lian He. The Research of
Improved Association Rules Mining Apriori Algorithm.
Proceedings of 3rd
International Conference on Machine Learning and
Cybernetics, pp.1577-1579, 2004.
[11] Yiwu Xie, Yutong Li, Chunli Wang, and Mingyu Lu. The
Optimization and Improvement of the Apriori Algorithm.
Proceedings of
IEEE International Symposium on Intelligent Information
Technology Application Workshops, pp. 1101-1103, 2008.
[12] Velu, C. M., Ramakrishnan, M., Somu, V., and
Logznathan, V. Efficient Association Rules for Data Mining.
International Journal of
Soft Computing , 2, pp.21-36, 2007.
[13] Han, J., Jian, Pei., and Yiwen, Yin. Mining Frequent
Patterns without Candidate Generation. Proceedings of ACM
International
conference on Management of Data, 29( 2), pp.1-12,
2000.
[14] Han, J., Jian, Pei., Yiwen, Yin, and Runying, Mao. Mining
Frequent Pattern without Candidate Generation: A Frequent-
Pattern Tree
Approach. Journal of Data Mining and Knowledge
Discovery, 8, pp.53-87, 2004.
[15] Mata, J., Alvarez, J. L., and Riquelme, J. C. Evolutionary
Computing and Optimization: An Evolutionary Algorithm to
Discover
Numeric Association Rules. Proceedings of ACM
Symposium on applied Computing, pp. 590-594, 2002.
Rajneesh Kumar Singh et al , Int.J.Computer Technology & Applications,Vol 4 (3),486-493
IJCTA | May-June 2013 Available [email protected]
493
ISSN:2229-6093