association rule mining by implementing apriori...

8
Association Rule Mining by Implementing Apriori Algorthim using Disconnected Approach Rajneesh Kumar Singh Student, M.Tech. (CSE) Jamia Hamdard , Hamdard Nagar New Delhi, India Manoj Kumar Pandey Deptt. Of Computer Application Galgotia College of Engg. & Tech Greater Noida, India Jawed Ahmed Deptt. Of Computer Science Jamia Hamdard, Hamdard Nagar New Delhi, India AbstractThere is a huge amount of data around us and to extract valuable information from it can be done through data mining. Data Miming is the process of extracting useful information from the huge amount of data stored in databases. Association Rule Mining is one of the data mining techniques used to extract hidden knowledge from datasets that can be used by an organization’s decision makers to improve overall profit. One of the most famous association rule learning algorithms is Apriori. Apriori algorithm is one of algorithms for generation of association rules. The drawback of Apriori Rule algorithm is the number of time to read data in the database equally number of each candidate were generated. A disconnected approach is implemented in this paper. The implementation of the algorithm would need to scan the database one time. Through this implementation of Apriori Algorithm greatly reduces the database scan which reduces time, network consumption and can improve the efficiency of algorithm. Keywords- Data mining; Association rule; Apriori algorithm; Disconnect approach; ADO.NET; Frequent pattern I. I NTRODUCTION One of the most popular technique in data mining is Apriori algorithm [1][2][3]. The Data mining is usually involve huge amounts of information. Association rules exhaustively look for hidden patterns, making them suitable for discovering predictive rules involving subsets of data set attributes. Association rules are used to identify relationships among a set of items in database. These relationships are not based on inherent properties of the data themselves (as with functional dependencies), but rather based on co- occurrence of the data items [4]. This paper proposes the implementation of Apriori algorithm by using Disconnected Approach of ADO.NET to discover association rules from huge amount of information. One of the most well known and popular data mining techniques is the Association rules or frequent item sets mining algorithm. The algorithm was originally proposed by Agrawal et al. [1] [2] for market basket analysis. Because of its significant applicability, many revised algorithms have been introduced since then, and Association rule mining is still a widely researched area. Many variations done on the frequent pattern mining algorithm of Apriori is discussed in this section. Agrawal et. al. presented an AIS algorithm in [1] which generates candidate item sets on-the-fly during each pass of the database scan. Large item sets from previous pass are checked if they are present in the current transaction. Thus new item sets are formed by extending existing item sets. This algorithm turns out to be ineffective because it generates too many candidate item sets. It requires more space and at the same time this algorithm requires too many passes over the whole database and also it generates rules with one consequent item. Agrawal et. al. [2] developed various versions of Apriori algorithm such as Apriori, AprioriTid, and AprioriHybrid. Apriori and AprioriTid generate item sets using the large item sets found in the previous pass, without considering the Rajneesh Kumar Singh et al , Int.J.Computer Technology & Applications,Vol 4 (3),486-493 IJCTA | May-June 2013 Available [email protected] 486 ISSN:2229-6093

Upload: trinhkhanh

Post on 10-Mar-2018

233 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Association Rule Mining by Implementing Apriori …ijcta.com/documents/volumes/vol4issue3/ijcta2013040320.pdf · Association Rule Mining by Implementing Apriori Algorthim using Disconnected

Association Rule Mining by Implementing Apriori

Algorthim using Disconnected Approach

Rajneesh Kumar Singh

Student, M.Tech. (CSE)

Jamia Hamdard , Hamdard Nagar

New Delhi, India

Manoj Kumar Pandey

Deptt. Of Computer Application

Galgotia College of Engg. & Tech

Greater Noida, India

Jawed Ahmed

Deptt. Of Computer Science

Jamia Hamdard, Hamdard Nagar

New Delhi, India

Abstract—There is a huge amount of data around

us and to extract valuable information from it can be done through data mining. Data Miming is the

process of extracting useful information from the

huge amount of data stored in databases.

Association Rule Mining is one of the data mining

techniques used to extract hidden knowledge from

datasets that can be used by an organization’s

decision makers to improve overall profit. One of

the most famous association rule learning

algorithms is Apriori. Apriori algorithm is one of

algorithms for generation of association rules. The

drawback of Apriori Rule algorithm is the number

of time to read data in the database equally number

of each candidate were generated. A disconnected

approach is implemented in this paper. The

implementation of the algorithm would need to

scan the database one time. Through this

implementation of Apriori Algorithm greatly

reduces the database scan which reduces time, network consumption and can improve the

efficiency of algorithm.

Keywords- Data mining; Association rule;

Apriori algorithm; Disconnect approach;

ADO.NET; Frequent pattern

I. INTRODUCTION

One of the most popular technique in data mining

is Apriori algorithm [1][2][3]. The Data mining is

usually involve huge amounts of information.

Association rules exhaustively look for hidden

patterns, making them suitable for discovering

predictive rules involving subsets of data set

attributes. Association rules are used to identify

relationships among a set of items in database.

These relationships are not based on inherent

properties of the data themselves (as with functional dependencies), but rather based on co-

occurrence of the data items [4]. This paper

proposes the implementation of Apriori algorithm

by using Disconnected Approach of ADO.NET to

discover association rules from huge amount of

information.

One of the most well known and popular data

mining techniques is the Association rules or

frequent item sets mining algorithm. The algorithm

was originally proposed by Agrawal et al. [1] [2]

for market basket analysis. Because of its

significant applicability, many revised algorithms

have been introduced since then, and Association

rule mining is still a widely researched area. Many

variations done on the frequent pattern mining

algorithm of Apriori is discussed in this section.

Agrawal et. al. presented an AIS algorithm in [1]

which generates candidate item sets on-the-fly

during each pass of the database scan. Large item

sets from previous pass are checked if they are

present in the current transaction. Thus new item

sets are formed by extending existing item sets.

This algorithm turns out to be ineffective because it

generates too many candidate item sets. It requires

more space and at the same time this algorithm

requires too many passes over the whole database

and also it generates rules with one consequent

item. Agrawal et. al. [2] developed various versions

of Apriori algorithm such as Apriori, AprioriTid,

and AprioriHybrid. Apriori and AprioriTid

generate item sets using the large item sets found in

the previous pass, without considering the

Rajneesh Kumar Singh et al , Int.J.Computer Technology & Applications,Vol 4 (3),486-493

IJCTA | May-June 2013 Available [email protected]

486

ISSN:2229-6093

Page 2: Association Rule Mining by Implementing Apriori …ijcta.com/documents/volumes/vol4issue3/ijcta2013040320.pdf · Association Rule Mining by Implementing Apriori Algorthim using Disconnected

transactions. AprioriTid improves Apriori by using

the database at the first pass. Counting in

subsequent passes is done using encodings created

in the first pass, which is much smaller than the

database. This leads to a dramatic performance

improvement of three times faster than AIS. A

further improvement, called AprioriHybrid, is

achieved when Apriori is used in the initial passes

and switches to AprioriTid in the later passes if the candidate k-itemset is expected to fit into the main

memory. Even though different versions of Apriori

are available, the problem with Apriori is that it

generates too many 2-item sets that are not

frequent. A Direct Hashing and Pruning (DHP)

algorithm is developed in [6] that reduces the size

of candidate set by filtering any k-item set out of

the hash table, if the hash entry does not have

minimum support. This powerful filtering

capability allows DHP to complete execution when

Apriori is still at its second pass and hence shows

improvement in execution time and utilization of

space. Scalability is another important area of data

mining because of its huge size. Hence, algorithms

must be able to ―scale up‖to handle large amount of

data. Eui-Hong et. al [4] tried to make data

distribution and candidate distribution scalable by Intelligent Data Distribution (IDD) algorithm and

Hybrid Distribution (HD) algorithm respectively.

The quality of the association rule discovered is

measured in terms of confidence. The rules with

confidence above a certain level (threshold value)

are considered as interesting and deserve attention.

Most algorithms define interestingness in terms of

user-supply thresholds for support and confidence.

The problem is that these algorithms rely on the

users to set suitable values.

A survey on different methods and algorithms used

to find frequent patterns is presented in [12].

Analysis of algorithms and descriptions for

AprioriTid, AprioriHybrid, Continuous Association

Rule Mining Algorithm (CARMA), Eclat algorithm, and Direct hashing and Pruning (DHP)

algorithm is explained in detail. Conclusions are

drawn as, for dense databases Éclat algorithm is

better, for sparse databases the Hybrid algorithm is

the best choice and as long as the database fits in

main memory the Hybrid algorithm (combination

of optimized version of Apriori and Eclat) is most

efficient one. An improved version of original

Apriori- All algorithm is developed for sequence

mining in [8]. It adds the property of the userID

during every step of producing the candidate set

and every step of scanning the database to decide

about whether an item in the candidate set should

be used to produce next candidate set. The

algorithm reduces the size of candidate set in order

to reduce the number of database scanning. Based

on the temporal association rule [3] [5], retailers

make better promotion strategies. The time

dimension exists in all transaction, and is included

in finding large item sets, especially when not all

items exist throughout the entire data gathering

period. The temporal concept introduced in [7]

addition to the normal support and confidence. The

temporal support is the minimum interval width. Thus, a rule is considered as long as there is

enough support or temporal support. Different

works are reported in the literature to modify the

Apriori logic so as to improve the efficiency of

generating rules. Enhanced version of Apriori

algorithm is presented in [9] where, the efficiency

is improved by scanning the database in forward

and backward directions. Xiang-wei Liu et.al [10]

presented an improved association rule mining

algorithm that reduces scanning time of candidate

sets using hash tree. Another version of Apriori is

reported in [11] as an algorithm called IApriori

algorithm, which optimizes the join procedure of

frequent item sets generated to reduce the size of

the candidate item sets.

Even though fast algorithms are reported for Association mining it still inherits the drawback of

scanning the whole data base many times. The

survey reveals that more attention is required to

address the issues related to reduce the number of

database scan, and also to reduce memory space

with less execution speed. These limitations and

other related issues motivated us to continue the

research work in this area. Comparing all these

methods, in this work we propose the

implementation of Apriori algorithm which reduces

database scan, time and can improve the efficiency

of algorithms the same is presented in the next

section.

II. MATERIALS AND METHODS

ASSOCIATION RULES: In this section we

overview the basic concepts of association rule

mining. We refer the reader to [1, 2] for further

details. Association rule mining was first

introduced by Agrawal et al.[1], and was used for

market basket analysis. The problem of mining

association rules can be explained as follows:

There is the itemset I = i1, i2, . . ., in, where I is a set of n distinct items, and a set of transactions D,

where each transaction T is a set of items such that

T ⊆ I. Table 1 gives an example where a database

D contains a set of transactions T, and each

transaction consist of one or more items

Rajneesh Kumar Singh et al , Int.J.Computer Technology & Applications,Vol 4 (3),486-493

IJCTA | May-June 2013 Available [email protected]

487

ISSN:2229-6093

Page 3: Association Rule Mining by Implementing Apriori …ijcta.com/documents/volumes/vol4issue3/ijcta2013040320.pdf · Association Rule Mining by Implementing Apriori Algorthim using Disconnected

Table 1: An Example Database

An association rule is an implication of the form A

B, where A, B ⊂ I and A ∩ B =∅. The rule A B has support s in the transaction set D if s% of

transactions in D contains A ∪ B. The support for a

rule is defined as support (A ∪B). The rule A B holds in the transaction set D with confidence c if

c% of transactions in D that contain A also contain

B. The confidence for a rule is defined as support

(A∪B) / support (A). For example, consider the

database in table 1. When people buy Milk and

Diaper, they also buy Beer in 40% of the cases and 67% of the transactions with Milk and Diaper also

contains Beer. Such a rule can represented as

―Milk, DiaperBeer support=0.4,

confidence=0.67‖ Not all the rules found are useful

and the number of rules generated maybe

enormous. Therefore, the task of mining

association rules is to generate all association rules

that have support and confidence greater than the

user-defined minimum support (minsup) and

minimum confidence (minconf) respectively. An

itemset with minimum support is called the large

(or frequent) itemset. The rule AB is a strong

rule iff A ∪ B is in the large itemset and its confidence is greater than or equal to minconf.

Normally, the task of mining association rules can

be decomposed into two sub-tasks:

1. Discover all large itemsets in the set of

transactions D.

2. Use the large itemsets to generate the strong

rules. The algorithm for this task is simple. For

every large itemset l1, find all large itemsets l2 such

that l2 ⊂ l1 where support (l1 l2) / support(l2) ≥ minconf. For every such large itemset l2, output a

rule of the form l2 (l1 - l2). The performance of

mining association rules is mainly dependent on the

large itemsets discovery process (step 1), since the

cost of the entire process comes from reading the

database (I/O time) to generate the support of

candidates (CPU time) and the generation of new

candidates (CPU time). Therefore, it is important to have an efficient algorithm for large itemsets

discovery.

Apriori Algorithm

The Apriori algorithm [2] uses a bottom-up

breadthfirst approach to finding the large itemsets.

It starts from large 1-itemsets and then extends one

level up in every pass until all large itemsets are

found. For each pass, say pass k, there are three operations. First, append the large (k-1)-itemsets to

L. Next, generate the potential large k-itemsets

using the (k-1)-itemsets. Such potential large

itemsets are called candidate itemsets C. The

candidate generation procedure consists of two

steps:

1. Join step – generate k-itemsets by joining lk−1

with itself.

2. Prune step – remove the itemset A generated

from the join step, if any of the subsets of A is not

large. Since any subset of a large itemset must be

large. This can be written formally as follows:

The Apriori algorithm significantly reduces the size

of candidate sets using the Apriori principle.

However, it can suffer from two-nontrivial costs:

1. Generating a huge number of candidate sets

2. Repeatedly scanning the database and checking

the candidates by pattern matching

TID Items

1 Bread, Milk

2 Bread, Diaper, Beer, Eggs

3 Milk, Diaper, Beer, Coke

4 Bread, Milk, Diaper, Beer

5 Bread, Milk, Diaper, Coke

Rajneesh Kumar Singh et al , Int.J.Computer Technology & Applications,Vol 4 (3),486-493

IJCTA | May-June 2013 Available [email protected]

488

ISSN:2229-6093

Page 4: Association Rule Mining by Implementing Apriori …ijcta.com/documents/volumes/vol4issue3/ijcta2013040320.pdf · Association Rule Mining by Implementing Apriori Algorthim using Disconnected

Figure 1: An Example of Apriori algorithm

Demonstration of the Example:

Step1: First scan the database and find all the itemset that contains the minimum support ≥ 2. In the above d itemset not containing the minimum support ≥ 2, so it is removed from the itemset list.

Step2: Apply the join step on the 2-candidates, scan the database and now apply the prune step (itemset that contains the minimum support ≥ 2).

Step3: Repeat this process until we find all frequent itemsets (itemset that contains the minimum support ≥ 2). In the above example following are the frequent itemsets {A} {B} {C} {E} {A C} {B C} {B E} {C E} {B C E}

Step4: Now we can generate the strong Association Rules from these itemset.

Methods:

To avoid the problem of repeatedly scanning the database, this is very time consuming process and also affect the performance of Apriori algorithm

In this paper we implemented for scanning process

use the disconnected approach, in which just scan

the whole database at once and put into a database

object. By using the disconnected approach, we can

reduce the number of round trips for scanning the database that reduce the scanning time and affect

the performance of Apriori algorithm.

Disconnected Approach

ADO.NET is an object-oriented set of libraries that

allows you to interact with data

sources. Commonly, the data source is a database,

but it could also be a text file, an Excel

spreadsheet, or an XML file. For the purposes of

this we will look at ADO.NET as a way to interact

with a data base.

ADO.NET Components

The ADO.NET components have been designed to

factor data access from data manipulation. There

are two central components of ADO.NET that

accomplish this: the DataSet, and the .NET

Framework data provider, which is a set of

components including the Connection, Command, DataSet, and DataAdapter objects.

The DataSet Object

The ADO.NET DataSet is the core component of

the disconnected architecture of ADO.NET. The

DataSet is explicitly designed for data access

independent of any data source. DataSet objects are

in-memory representations of data. They contain

multiple Datatable objects, which contain columns

and rows, just like normal database tables. You

can even define relations between tables to create

parent-child relationships. The

Rajneesh Kumar Singh et al , Int.J.Computer Technology & Applications,Vol 4 (3),486-493

IJCTA | May-June 2013 Available [email protected]

489

ISSN:2229-6093

Page 5: Association Rule Mining by Implementing Apriori …ijcta.com/documents/volumes/vol4issue3/ijcta2013040320.pdf · Association Rule Mining by Implementing Apriori Algorthim using Disconnected

DataSet is specifically designed to help manage data in memory and to support disconnected operations on data,

when such a scenario make sense. The DataSet is an object that is used by all of the Data Providers, which is

why it does not have a Data Provider specific prefix

Figure 2: .NET Framework Data provider

Implementation of Apriori using ADO.NET

The code for access the database to generate the association rules as follows.

public Data GetTransactionsData(string rdbmsConnectionString, string dataSource, CommandType

commandType)

{

myDatabase = new Data();

SqlmyConnection = new SqlConnection(rdbmsConnectionString); sqlCommand = new SqlCommand(dataSource, SqlmyConnection);

if (commandType == CommandType.StoredProcedure)

{

sqlCommand.CommandType = CommandType.StoredProcedure;

}

else if (commandType == CommandType.TableDirect)

{

sqlCommand.CommandType = CommandType.Text;

}

string sql = "Select TransactionID,Transactions from TransactionsTable";

SqlmyAdapter = new SqlDataAdapter(sql, SqlmyConnection);

DataSet ds = new DataSet();

SqlmyAdapter.Fill(ds);

foreach (DataRow dr in ds.Tables[0].Rows)

{

myDatabase.Tables[0].Rows.Add(Convert.ToInt32(dr["TransactionID"]), (dr["Transactions"]).ToString());

} return myDatabase;

}

III. RESULTS AND DISCUSSIONS

Process of generating the Association Rules

We have created a windows based project using C# on Visual Studio 2008 with MS-SQL 2005.

For the illustration purpose taking the database table as given below.

Rajneesh Kumar Singh et al , Int.J.Computer Technology & Applications,Vol 4 (3),486-493

IJCTA | May-June 2013 Available [email protected]

490

ISSN:2229-6093

Page 6: Association Rule Mining by Implementing Apriori …ijcta.com/documents/volumes/vol4issue3/ijcta2013040320.pdf · Association Rule Mining by Implementing Apriori Algorthim using Disconnected

Figure 3: shows the database table

So the process of generating the Association Rule by our project is as follows:

1 Click on the Data Connection choose the Location of your Data either Database or XML file.

2 Select the Database and then enter the connection string in the given text box.

Figure 4: Selecting the database and enter the connection string

3 Click ok then enter the minimum support% and minimum confidence% for the desired rules.

4 Click Analyze for generating the Association rules.

Rajneesh Kumar Singh et al , Int.J.Computer Technology & Applications,Vol 4 (3),486-493

IJCTA | May-June 2013 Available [email protected]

491

ISSN:2229-6093

Page 7: Association Rule Mining by Implementing Apriori …ijcta.com/documents/volumes/vol4issue3/ijcta2013040320.pdf · Association Rule Mining by Implementing Apriori Algorthim using Disconnected

Figure 5: Display the Association rules for given minimum support and minimum confidence

The above figure shows the generated Association Rules for the above stated database table. In which the

minimum support % ≥ 40 and minimum confidence ≥ 60.

IV. CONCLUSION

From the above experiment and developing the

project for generating the Association Rules, we

came to conclusion that the implementation of

Apriori algorithm using disconnected approach

(ADO.NET) is to overcome the deficiency of the

Apriori algorithm. The Apriori algorithm require

multiple passes over the database for discovering

frequent patterns, and follows bottom up approach

which suffers from increased number of database

scan. The new implementation of using

disconnected approach need to scan the database

only one time and contain the database table data

into the dataset object. This implementation of

Apriori algorithm is more efficient which takes less

time, hence reflects in high efficiency

REFERENCES

[1] Agrawal, R., Imielinski, T., and Swami, A. N. Mining

Association Rules Between Sets of Items in Large Databases.

Proceedings of

the ACM SIGMOD,International

Conference on Management of Data, pp.207- 216, 1993.

[2] Agrawal. R., and Srikant. R., Fast Algorithms for Mining

Association Rules, Proceedings of 20th International

Conference of Very

Large Data Bases. pp.487-499,1994.

[3] Agrawal. R., and Srikant. R. Mining Sequential Patterns.

Proceedings of 11th International Conference on Data

Engineering, IEEE Computer Society Press, pp.3-14, 1995.

Rajneesh Kumar Singh et al , Int.J.Computer Technology & Applications,Vol 4 (3),486-493

IJCTA | May-June 2013 Available [email protected]

492

ISSN:2229-6093

Page 8: Association Rule Mining by Implementing Apriori …ijcta.com/documents/volumes/vol4issue3/ijcta2013040320.pdf · Association Rule Mining by Implementing Apriori Algorthim using Disconnected

[4] M., Suraj Kumar Sudhanshu, Ayush Kumar and Ghose

M.K., ―Optimized association rule mining using genetic

algorithm

Anandhavalli Advances in Information Mining‖ , ISSN:

0975–3265, Volume 1, Issue 2, 2009, pp-01-04

[5] Han, J., Dong, G., and Yin, Y. Efficient Mining of Partial

Periodic Patterns in Time Series Database. Proceedings of 15th

IEEE

International Conference on Data Engineering, pp.106–

115, 1999

[6] Jong Park, S., Ming-Syan, Chen, and Yu, P. S. Using a

Hash-Based Method with transaction Trimming for Mining

Association

Rules. IEEE Transactions on Knowledge and Data

Engineering, 9(5), pp.813-825,1997

[7] Juan, M. A., Gustavo, H., and Rossi. An Approach to

Discovering Temporal Association Rules. Proceedings of the

ACM Symposium

on Applied Computing, 1, pp.234- 239,2000.

[8] Wang Tong, and He Pi-Lian. Web Log Mining by

Improved Apriori All Algorithm. Transaction on Engineering

Computing and

Technology, 4, pp.97 100, 2005.

[9] Wei Zhang, Zhang Wei, Dongme Sun Shaohua Teng and

Haibin Zhu. An Algorithm to Improve Effectiveness of Apriori.

Proceedings

of 6th IEEE International Conference on Cognitive

Informatics, pp.385-390, 2007.

[10] Xiang-Wei Liu, and Pi-Lian He. The Research of

Improved Association Rules Mining Apriori Algorithm.

Proceedings of 3rd

International Conference on Machine Learning and

Cybernetics, pp.1577-1579, 2004.

[11] Yiwu Xie, Yutong Li, Chunli Wang, and Mingyu Lu. The

Optimization and Improvement of the Apriori Algorithm.

Proceedings of

IEEE International Symposium on Intelligent Information

Technology Application Workshops, pp. 1101-1103, 2008.

[12] Velu, C. M., Ramakrishnan, M., Somu, V., and

Logznathan, V. Efficient Association Rules for Data Mining.

International Journal of

Soft Computing , 2, pp.21-36, 2007.

[13] Han, J., Jian, Pei., and Yiwen, Yin. Mining Frequent

Patterns without Candidate Generation. Proceedings of ACM

International

conference on Management of Data, 29( 2), pp.1-12,

2000.

[14] Han, J., Jian, Pei., Yiwen, Yin, and Runying, Mao. Mining

Frequent Pattern without Candidate Generation: A Frequent-

Pattern Tree

Approach. Journal of Data Mining and Knowledge

Discovery, 8, pp.53-87, 2004.

[15] Mata, J., Alvarez, J. L., and Riquelme, J. C. Evolutionary

Computing and Optimization: An Evolutionary Algorithm to

Discover

Numeric Association Rules. Proceedings of ACM

Symposium on applied Computing, pp. 590-594, 2002.

Rajneesh Kumar Singh et al , Int.J.Computer Technology & Applications,Vol 4 (3),486-493

IJCTA | May-June 2013 Available [email protected]

493

ISSN:2229-6093