association rules mining in distributed environments

27
Association Rules Mining in Distributed Environments By: Shamila Mafazi Supervised by: Dr. Abrar Haider

Upload: hinto

Post on 14-Jan-2016

46 views

Category:

Documents


0 download

DESCRIPTION

Association Rules Mining in Distributed Environments. By: Shamila Mafazi Supervised by: Dr. Abrar Haider. Acknowledgment. First I would like to thank my supervisor, Dr. Abrar Haider for his valuable support in this thesis. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Association Rules Mining in Distributed Environments

Association Rules Mining in Distributed Environments

By: Shamila Mafazi

Supervised by: Dr. Abrar Haider

Page 2: Association Rules Mining in Distributed Environments

Acknowledgment

First I would like to thank my supervisor, Dr. Abrar Haider for his valuable support in this thesis.

I specially like to extend my thank to Dr Jiuyong Li who introduced me to the Data Mining world.

Page 3: Association Rules Mining in Distributed Environments

Introduction

Data Mining refers to extraction or mining knowledge from large amount of data (Han & Kamber 2006).

Data Mining Techniques include: Association rules mining, clustering, classification, prediction and so on.

Association Rules Mining retrieves relation and correlation

between the data in a database.

Page 4: Association Rules Mining in Distributed Environments

Research Question

How can a more efficient method for discovering association rules in a distributed environment be created?

Page 5: Association Rules Mining in Distributed Environments

Methodology

To address the research question, this research has reviewed literature relating to the following areas:

Data mining in the centralised environments. Data mining in the distributed environments. Association rules mining in the distributed/centralised

environments. Comparing the existing algorithms in the distributed/centralised

environments and studying the advantages and disadvantages of them.

Developing a concise representation particularly, distributed deduction rules.

Designing the new algorithm based on DTFIM .

Page 6: Association Rules Mining in Distributed Environments

Thesis Contribution

The proposed algorithm by this research intends to resolve marketing problem in distributed environments. Additionally, it aims to profile needs of customers and their preferences in the transaction oriented systems such as, credit cards. One of the most significant problems of market basket data is dealing with large number of candidate itemsets. Retrieving interesting and meaningful patterns from these candidates is extremely difficult. Thus, this research, presents an algorithm which reduces the number of candidate sets and simplifies the process to produce interesting customer preferences and patterns.

Page 7: Association Rules Mining in Distributed Environments

The Importance of Association Rules

Better management of inventory.

Better arrangement of the shelves.

Enhancement in sale of items.

These rules can reveal the effect of continuing or discontinuing the sale of an item on the sale of other items.

Page 8: Association Rules Mining in Distributed Environments

Two Common Measures of Rules

Interestingness

The discovered rules are interesting if they satisfy the minimum support and confidence thresholds.

Support of an itemset is the percentage of the transactions which contain that itemset.

Confidence is the probability of occuring (such as buying) items together.

Page 9: Association Rules Mining in Distributed Environments

Association Rules Mining (ARM)

ARM in centralised environments refers to mining association rules from an integrated database

Apriori AprioriTid AprioriHybrid FIM

ARM in distributed environments refers to mining association rules from distributed databases.

CD FDM

Sampling Partitioning Algorithm DIC algorithm FP Growth algorithm NDI algorithm

ODAM DTFIM

Page 10: Association Rules Mining in Distributed Environments

Apriori Algorithm (Agrawal & Srikant 1994)

(Kantardzic 2003, p.168)

Minimum support thershold=2

1-itemsets

2-itemsets

3-itemsets

Page 11: Association Rules Mining in Distributed Environments

Centralised ARM Algorithm

Frequent Itemsets Mining (FIM) algorithm (Bodon 2004):

Trie data structure (Ansari et al. 2008)

Uses Trie data structure

Fast construction and information retrieval

Memory efficient

Supp(a)=9Supp(a,b)=6

Page 12: Association Rules Mining in Distributed Environments

Centralised ARM Algorithm Non derivable itemsets (NDI) (Calder & Geothals 2002)

Deduction rules divide the itemsets of a data base into derivable and non derivable itemsets.

The deduction rules set tight bounds on the support of an itemset.

The least of the upper bounds and the greatest of the lower bounds are the tight bounds for an itemset.

Derivable itemsets have equal upper and lower bounds.

Derivable itemsets do not represent any new information, not being introduced by their subsets.

The non derivable itemsets are considered as a concise representation of the frequent itemsets.

Page 13: Association Rules Mining in Distributed Environments

Deduction Rules:

An example of a DB transactions (Calders & Geothals 2007)

__supp (abc) = supp(a) –supp(ab) – supp(ac) + supp(abc)0 ≥ supp(a) –supp(ab) – supp(ac) + supp(abc)supp(abc) ≥ supp(ab) + supp(ac) – supp(a)

Page 14: Association Rules Mining in Distributed Environments

Summary of Deduction Rules

If |I\X| is odd then (odd-itemsets) |I\J| +1

Supp (I) ≤ ∑ (-1) supp(J) X⊆ J I⊂

If |I\X| is even then (even-itemsets) |I\J| +1

Supp (I) ≥ ∑ (-1) supp(J) X ⊆ J I⊂

Page 15: Association Rules Mining in Distributed Environments

The Importance of Distributed Data Mining

Distributed databases. Transferring data within sites is extremely costly and time

consuming. Due to security issues and data ownership, transferring the

local data is not permitted

Page 16: Association Rules Mining in Distributed Environments

Distributed ARM Algorithms

Distributed Trie-based Frequent Itemset Mining (DTFIM) (Ansari et al. 2008) Trie based data structure. Memory efficient Fast searching The more skewed data bases, the algorithm acts more

efficient.

Page 17: Association Rules Mining in Distributed Environments

My Contribution

Creating a concise representation by applying distributed deduction rules.

Applying the distributed deduction rules inside the DTFIM algorithm

Page 18: Association Rules Mining in Distributed Environments

Distributed Deduction Rules

|I\J| +1

Supp (I) ≤ ∑ ((-1) ∑ Suppi (J)) X⊆ J I 1 ⊂ ≤i ≤ n

|I\J| +1

Supp (I) ≥ ∑ ((-1) ∑ Suppi (J)) X⊆ J I 1 ⊂ ≤i ≤ n

If |I/X| is odd then

If |I/X| is even then

Page 19: Association Rules Mining in Distributed Environments

The Proposed Algorithm (cont.)

Inputs:

DBi (i=1,…,n): the databases at each site Si.

iterationDepth: number of iterations

minSup: the support threshold

Output: The set of all globally large itemsets L.

Method: Execution of the following program fragment (for the k-th iteration) at the participating sites.

Page 20: Association Rules Mining in Distributed Environments

The Proposed Algorithm (cont.)

k:=1;

while k ≤ iterationDepth do

{ if k=1 then

TR i(1) := findLocalCandidate (DB i,0,1);

else

{ candidateGen (TR i(k-1), NDL(k-1), CG(k), DL(k), DL(k-1));

if DL(k-1) ≠ 0 then

dFrequent (DL(k-1), NDL(k-1), DL(k));

TR i(k) := findLocalCandidate (DBi, CG(k), k); }

if CG(k) ≠ 0 then // if the CG(k) is not empty

TRi(k-1) :=findNDFrequent (DBi, CG(k), k);

passLocalCandidate (TRi(k));

GLi(k) := getGlobalFrequent (); // globally large k-itemsets

updateLocalCandidates (TRi(k), GLi(k)); // prunes the local candidates which are not globally large

NDL(k) := ∪ⁿ i=1 GLi(k) ;

k:=k+1; }

L(k) := NDL(k) ∪ DL (k) ;

return L(k);

Page 21: Association Rules Mining in Distributed Environments

Candidate Set Generating Procedureprocedure candidateGen (TR i(k-1), NDL(k-1), CG(k), DL(k), DL(k-1))

for all Z ∈ TR i(k-1) do

{ compute the [l,u] bounds of Z

if Z.sup=Z.l or Z.sup=Z.u then

{ Prune Z from NDL(k-1) and TRi(k-1) and insert it into DL(k-1) ;

if Z.sup=Z.l then

Z.sup=Z.l;

else

Z.sup=Z.u; }

pCG(k) =∪ⁿ i=1 CG i(k) =∪ⁿ i=1 aprioriGen(NDLi(k-1)); //FDM candidate itemsets generator

for all Y ∈ pCG(k) do

{ compute [l,u] bounds on support of Y

if l≠u then

{ Y.l=l; Y.u=u; Insert Y into CG(k) ; }

else {

If u≥ minSup then

{ Insert Y into DL(k), delete it from NDLi(k-1) and TRi(k-1) ; Y.sup=u }

} } }

end procedure

Page 22: Association Rules Mining in Distributed Environments

Derivable Frequent Itemsets Procedure

Procedure dFrequent (DL(k-1), NDL(k-1), DL(k))

DCG(k) := aprioriGen2(DL(k-1), NDL(k-1)); // FDM apriori candidate generator.

for all Z ∈ DCG(k) do

{

compute Z.sup; //compute the s support of Z

if Z.sup ≥ minSup then

Insert Z into DL(k), delete it from NDLi(k-1) and TRi(k-1) ;

}

end procedure

Page 23: Association Rules Mining in Distributed Environments

Explanation of the Proposed Algorithm

Developing the local 1-itemsets vectors: The local DBs are scanned by their local sites independently. the 1-itemsets are determined and their local support counts are

stored in the local vectors.

Global 1-itemsets: The support counts are exchanged within the sites The globally large 1-itemsets are determined.

Initialising the local Tries: Each local site initialise their local Trie based on Global 1-itemsets

Page 24: Association Rules Mining in Distributed Environments

Explanation of the Proposed Algorithm

Production of the global frequent k-itemsets (k≥2): The local candidate 2-itemsets are generated based on the local Tries the local candidate 2-itemsets are stored in a two dimensional array.

Applying the deduction rules: The deduction rules are applied on the local candidate 2-itemsets by each site. The derivable itemsets are deleted from the list of the non-derivable local

candidate 2-itemsets.

Global large 2-itmsests: The support counts of non-derivable locally frequent 2-itemsets are exchanged

within the sites. The global support count for 2-itemsets are determined.

Updating the local Tries: Local sites update their Trie by inserting the global large 2-itmsets.

Page 25: Association Rules Mining in Distributed Environments

Conclusion

This research resolves the problem of market basket data in distributed environments and presents an algorithm which reduces the number of candidate sets and simplifies the process to produce interesting customer preferences and patterns.

Page 26: Association Rules Mining in Distributed Environments

References Ansari, E, Dastghaibifard, GH, Keshtkaran, M, Kaabi, H

2008,‘Distributed Frequent Itemset Mining using Trie Data Structure’, IAENG International Journal of Computer Science, vol. 35, no. 3, pp. 377-381.

Calders, T& Goethals, B 2007, ‘Non Derivable Itemsets mining’, data mining Knowledge Discovery in Databases, vol. 14, pp. 171-206.

Han, J, Kamber, M 2006, Data mining: concepts and techniques, Diane Cerra, United States of America.

Kantardzic, M 2003, Data Mining Concepts, Models, Methods, and Algorithms, A John Wiley & Sons, INC, United State of America.

Page 27: Association Rules Mining in Distributed Environments

Thank you