association rules mining in distributed environments

Association Rules Mining in Distributed Environments

By: Shamila Mafazi

Supervised by: Dr. Abrar Haider

Acknowledgment

First I would like to thank my supervisor, Dr. Abrar Haider for his valuable support in this thesis.

I specially like to extend my thank to Dr Jiuyong Li who introduced me to the Data Mining world.

Introduction

Data Mining refers to extraction or mining knowledge from large amount of data (Han & Kamber 2006).

Data Mining Techniques include: Association rules mining, clustering, classification, prediction and so on.

Association Rules Mining retrieves relation and correlation

between the data in a database.

Research Question

How can a more efficient method for discovering association rules in a distributed environment be created?

Methodology

To address the research question, this research has reviewed literature relating to the following areas:

Data mining in the centralised environments. Data mining in the distributed environments. Association rules mining in the distributed/centralised

environments. Comparing the existing algorithms in the distributed/centralised

environments and studying the advantages and disadvantages of them.

Developing a concise representation particularly, distributed deduction rules.

Designing the new algorithm based on DTFIM .

Thesis Contribution

The proposed algorithm by this research intends to resolve marketing problem in distributed environments. Additionally, it aims to profile needs of customers and their preferences in the transaction oriented systems such as, credit cards. One of the most significant problems of market basket data is dealing with large number of candidate itemsets. Retrieving interesting and meaningful patterns from these candidates is extremely difficult. Thus, this research, presents an algorithm which reduces the number of candidate sets and simplifies the process to produce interesting customer preferences and patterns.

The Importance of Association Rules

Better management of inventory.

Better arrangement of the shelves.

Enhancement in sale of items.

These rules can reveal the effect of continuing or discontinuing the sale of an item on the sale of other items.

Two Common Measures of Rules

Interestingness

The discovered rules are interesting if they satisfy the minimum support and confidence thresholds.

Support of an itemset is the percentage of the transactions which contain that itemset.

Confidence is the probability of occuring (such as buying) items together.

Association Rules Mining (ARM)

ARM in centralised environments refers to mining association rules from an integrated database

Apriori AprioriTid AprioriHybrid FIM

ARM in distributed environments refers to mining association rules from distributed databases.

CD FDM

Sampling Partitioning Algorithm DIC algorithm FP Growth algorithm NDI algorithm

ODAM DTFIM

Apriori Algorithm (Agrawal & Srikant 1994)

(Kantardzic 2003, p.168)

Minimum support thershold=2

1-itemsets

2-itemsets

3-itemsets

Centralised ARM Algorithm

Frequent Itemsets Mining (FIM) algorithm (Bodon 2004):

Trie data structure (Ansari et al. 2008)

Uses Trie data structure

Fast construction and information retrieval

Memory efficient

Supp(a)=9Supp(a,b)=6

Centralised ARM Algorithm Non derivable itemsets (NDI) (Calder & Geothals 2002)

Deduction rules divide the itemsets of a data base into derivable and non derivable itemsets.

The deduction rules set tight bounds on the support of an itemset.

The least of the upper bounds and the greatest of the lower bounds are the tight bounds for an itemset.

Derivable itemsets have equal upper and lower bounds.

Derivable itemsets do not represent any new information, not being introduced by their subsets.

The non derivable itemsets are considered as a concise representation of the frequent itemsets.

Deduction Rules:

An example of a DB transactions (Calders & Geothals 2007)

__supp (abc) = supp(a) –supp(ab) – supp(ac) + supp(abc)0 ≥ supp(a) –supp(ab) – supp(ac) + supp(abc)supp(abc) ≥ supp(ab) + supp(ac) – supp(a)

Summary of Deduction Rules

If |I\X| is odd then (odd-itemsets) |I\J| +1

Supp (I) ≤ ∑ (-1) supp(J) X⊆ J I⊂

If |I\X| is even then (even-itemsets) |I\J| +1

Supp (I) ≥ ∑ (-1) supp(J) X ⊆ J I⊂

The Importance of Distributed Data Mining

Distributed databases. Transferring data within sites is extremely costly and time

consuming. Due to security issues and data ownership, transferring the

local data is not permitted

Distributed ARM Algorithms

Distributed Trie-based Frequent Itemset Mining (DTFIM) (Ansari et al. 2008) Trie based data structure. Memory efficient Fast searching The more skewed data bases, the algorithm acts more

efficient.

My Contribution

Creating a concise representation by applying distributed deduction rules.

Applying the distributed deduction rules inside the DTFIM algorithm

Distributed Deduction Rules

|I\J| +1

Supp (I) ≤ ∑ ((-1) ∑ Suppi (J)) X⊆ J I 1 ⊂ ≤i ≤ n

|I\J| +1

Supp (I) ≥ ∑ ((-1) ∑ Suppi (J)) X⊆ J I 1 ⊂ ≤i ≤ n

If |I/X| is odd then

If |I/X| is even then

The Proposed Algorithm (cont.)

Inputs:

DBi (i=1,…,n): the databases at each site Si.

iterationDepth: number of iterations

minSup: the support threshold

Output: The set of all globally large itemsets L.

Method: Execution of the following program fragment (for the k-th iteration) at the participating sites.

The Proposed Algorithm (cont.)

k:=1;

while k ≤ iterationDepth do

{ if k=1 then

TR i(1) := findLocalCandidate (DB i,0,1);

else

{ candidateGen (TR i(k-1), NDL(k-1), CG(k), DL(k), DL(k-1));

if DL(k-1) ≠ 0 then

dFrequent (DL(k-1), NDL(k-1), DL(k));

TR i(k) := findLocalCandidate (DBi, CG(k), k); }

if CG(k) ≠ 0 then // if the CG(k) is not empty

TRi(k-1) :=findNDFrequent (DBi, CG(k), k);

passLocalCandidate (TRi(k));

GLi(k) := getGlobalFrequent (); // globally large k-itemsets

updateLocalCandidates (TRi(k), GLi(k)); // prunes the local candidates which are not globally large

NDL(k) := ∪ⁿ i=1 GLi(k) ;

k:=k+1; }

L(k) := NDL(k) ∪ DL (k) ;

return L(k);

Candidate Set Generating Procedureprocedure candidateGen (TR i(k-1), NDL(k-1), CG(k), DL(k), DL(k-1))

for all Z ∈ TR i(k-1) do

{ compute the [l,u] bounds of Z

if Z.sup=Z.l or Z.sup=Z.u then

{ Prune Z from NDL(k-1) and TRi(k-1) and insert it into DL(k-1) ;

if Z.sup=Z.l then

Z.sup=Z.l;

else

Z.sup=Z.u; }

pCG(k) =∪ⁿ i=1 CG i(k) =∪ⁿ i=1 aprioriGen(NDLi(k-1)); //FDM candidate itemsets generator

for all Y ∈ pCG(k) do

{ compute [l,u] bounds on support of Y

if l≠u then

{ Y.l=l; Y.u=u; Insert Y into CG(k) ; }

else {

If u≥ minSup then

{ Insert Y into DL(k), delete it from NDLi(k-1) and TRi(k-1) ; Y.sup=u }

} } }

end procedure

Derivable Frequent Itemsets Procedure

Procedure dFrequent (DL(k-1), NDL(k-1), DL(k))

DCG(k) := aprioriGen2(DL(k-1), NDL(k-1)); // FDM apriori candidate generator.

for all Z ∈ DCG(k) do

{

compute Z.sup; //compute the s support of Z

if Z.sup ≥ minSup then

Insert Z into DL(k), delete it from NDLi(k-1) and TRi(k-1) ;

}

end procedure

Explanation of the Proposed Algorithm

Developing the local 1-itemsets vectors: The local DBs are scanned by their local sites independently. the 1-itemsets are determined and their local support counts are

stored in the local vectors.

Global 1-itemsets: The support counts are exchanged within the sites The globally large 1-itemsets are determined.

Initialising the local Tries: Each local site initialise their local Trie based on Global 1-itemsets

Explanation of the Proposed Algorithm

Production of the global frequent k-itemsets (k≥2): The local candidate 2-itemsets are generated based on the local Tries the local candidate 2-itemsets are stored in a two dimensional array.

Applying the deduction rules: The deduction rules are applied on the local candidate 2-itemsets by each site. The derivable itemsets are deleted from the list of the non-derivable local

candidate 2-itemsets.

Global large 2-itmsests: The support counts of non-derivable locally frequent 2-itemsets are exchanged

within the sites. The global support count for 2-itemsets are determined.

Updating the local Tries: Local sites update their Trie by inserting the global large 2-itmsets.

Conclusion

This research resolves the problem of market basket data in distributed environments and presents an algorithm which reduces the number of candidate sets and simplifies the process to produce interesting customer preferences and patterns.

References Ansari, E, Dastghaibifard, GH, Keshtkaran, M, Kaabi, H

2008,‘Distributed Frequent Itemset Mining using Trie Data Structure’, IAENG International Journal of Computer Science, vol. 35, no. 3, pp. 377-381.

Calders, T& Goethals, B 2007, ‘Non Derivable Itemsets mining’, data mining Knowledge Discovery in Databases, vol. 14, pp. 171-206.

Han, J, Kamber, M 2006, Data mining: concepts and techniques, Diane Cerra, United States of America.

Kantardzic, M 2003, Data Mining Concepts, Models, Methods, and Algorithms, A John Wiley & Sons, INC, United State of America.

Thank you

association rules mining in distributed environments

Documents