mtech project seminar1

36
The association rule mining to mine the frequent patterns is a fundamentally important task in the process of knowledge discovery in large databases. This project report the main focus lies in the generation of frequent patterns which is the most important task in explanation of the fundamentals of association rule mining. This is done by analyzing the implementations of the well known association rule mining algorithms like Apriori, Dynamic Item set Counting Algorithm, FP-growth algorithm. This experimental system is developed using Java under Windows XP Operating System. Run time behaviors of these algorithms are analyzed and compared using Mushroom dataset. Overview of this Project

Upload: pateal

Post on 15-Nov-2014

124 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Mtech Project Seminar1

The association rule mining to mine the frequent patterns is a fundamentally important task in the process of knowledge discovery in large databases.

This project report the main focus lies in the generation of frequent patterns which is the most important task in explanation of the fundamentals of association rule mining.

This is done by analyzing the implementations of the well known association rule mining algorithms like Apriori, Dynamic Item set Counting Algorithm, FP-growth algorithm.

This experimental system is developed using Java under Windows XP Operating System. Run time behaviors of these algorithms are analyzed and compared using Mushroom dataset.

Overview of this Project

Page 2: Mtech Project Seminar1

Outline

• Introduction• Association Rule Mining to Frequent Patterns• Implementation• Conclusions• Future Enhancements• Bibliography

Page 3: Mtech Project Seminar1

Frequent Pattern which is the most important task in explanation of fundamental of association rule mining techniques

The well known association rule based algorithms to mine the frequent patterns :

Apriori

Dynamic Item Counting

FP Growth

Introduction to Frequent Patterns

Page 4: Mtech Project Seminar1

Association Rule mining is one of the fundamental data mining

Association is a rule, which implies certain association relationships among set of objects such as occur together or one implies the other.

Goal of Association rule mining helps in finding interesting association relationships among large set of data items.

Each rule is assigned two factors: Support and Confidence

Association Rule Mining

Page 5: Mtech Project Seminar1

Generally association rule mining is performed in two steps:

• Find all frequent item setsThe basic foundation of Association Rule

algorithm is fact that any subset of a frequent itemset must also be a frequent item set. i.e., if {AB} is a frequent item set, both {A} and {B} should be a frequent item set. Iteratively find frequent item sets with cardinality from 1 to k (k-item set)

• Use frequent item sets to generate strong rules having minimum confidence.

Page 6: Mtech Project Seminar1

FP Array

• FP Array techniques that greatly reduce the needs to traverse FP Trees.

• FP Array techniques obtaining significance improved performance then

FP Tree based Algorithm.

• FP Array is new Algorithms in finding the Maximal and Closed Frequent

Item sets

Page 7: Mtech Project Seminar1

FP Array Applications

• It generate the frequent patterns from the existing datasets.

• It Provides the minimum support to the given data inputs.

• Time Complexity for the searching the frequent item sets .

• It displays the no of records row and columns wise from the datasets

Page 8: Mtech Project Seminar1

Rule to Mine Frequent Items

The frequent itemset mining algorithms are classified considering the following aspects:

• The type of the discovered frequent itemset• Using candidates• The representation of the transactions• The itemsets representation used in the algorithm• The number of disk access• The length of the maximal frequent pattern

Page 9: Mtech Project Seminar1

  APRIORI DIC FP

With Candidate generation

 

Without Candidate generation

   

       

BFS  

DFS    

FP-Tree    

Implemented algorithms work differ as follows:

Page 10: Mtech Project Seminar1

Stages in Knowledge Discovery in Frequent

Databases Selection - selecting and segmenting the data that are relevant to given criteria.

Preprocessing-data cleaning stage where unnecessary information is removed.

Transformation-the data is made usable and navigable.

Data Mining-extraction of patterns from the data

Interpretation and Evaluation-The patterns in the data mining stage are converted into knowledge to support decision-making

Data Visualization-to examine the large volumes of data and detect the patterns visually

·       .

Page 11: Mtech Project Seminar1

Discoveries in Frequent Databases

·           

Page 12: Mtech Project Seminar1

The Apriori algorithm is the most popular association rule algorithm. Apriori uses bottom up search.

Apriori algorithm works as follows:

• The first step, Apriori algorithm generates Candidate 1 – itemsets. Then, itemsets count and minimum support value are compared to find the set L1 (frequent itemsets).

• The second step, algorithm use L1 to construct the set C2 of Candidate 2 – itemsets. The process is finished when there are no more candidates.

Apriori Algorithm

Page 13: Mtech Project Seminar1

In each phase, all the transaction in the data set are scanned.

Finally, all frequent itemsets are returned.

Disadvantage: Multiple database scan.

Page 14: Mtech Project Seminar1

DIC (Dynamic Itemset Counting) algorithm which uses fewer database scan, presents a new approach for finding large itemsets.

Aim of the DIC algorithm is improving the performance and eliminating repeated database scan.

DIC algorithm divides the database into partitions ( intervals M ) and use a dynamic counting strategy. DIC algorithm determines some stop points for itemset counting. Any appropriate points, during the database scan, stopping counting, then starts to count with another itemsets.

Four symbols to indicate the different states of itemsets: Solid Box , Solid Circle, Dashed Box, Dashed Circle

DIC Algorithm

Page 15: Mtech Project Seminar1

The algorithm is described as follows:

Step1: the empty itemset is marked with a solid box and all the 1-itemsets into dashed circle.

Step2: After reading one interval of M transactions from database, do the following steps:

• Check each itemset, in dashed circle. If it exceeds the support threshold, change it from dashed circle to a dashed box.• Check each super set of dashed circle. If all the subsets of dashed circle are in solid box or dashed box, then add it into dashed circle.• Check each set in dashed circle and dashed box. If it has been counted over all the transactions, change it into solid circle if it is in circle or change it into solid box if it is in box.

Step3: End of transactions is reached then, go back to the beginning and repeat step 2, until no itemset remains in dashed circle or dashed box.

Page 16: Mtech Project Seminar1

FP-Growth

FP-Growth is an algorithm for generating frequent item sets for association rules. This algorithm compresses a large database into a compact, frequent pattern– tree (FP tree) structure.

FP – tree structure stores all necessary information about frequent itemsets in a database.

A frequent pattern tree (or FP-tree in short) is defined as

1. The root labeled with “null” and set of items as the children of the root.

2. Each node contains of three fields: item-name (holds the frequent item), count (number of transactions that share that node), and node- link (next node in the FP-tree).

3. Frequent-item header table contains two fields, item-name and head of node link (points to the first node in the FP-tree holding the item).

Page 17: Mtech Project Seminar1

Use case Diagram for the proposed system

Apriori 

Dynamic Itemset Counting

 

FP-Growth 

User

Data SetFile

 

         

 

Page 18: Mtech Project Seminar1

Identifying Classes form the above Use cases

 

 

  

 

Page 19: Mtech Project Seminar1

Architectural design

The division of software into subsystems and components, as well as the process of deciding how these will be connected and how they will interact, include determining the interfaces.

GUI for Selecting the file ,support and algorithm

AprioriDynamic Itemset Counting

FP-Growth

Matrix Based Association

Page 20: Mtech Project Seminar1

User interface design

The design of user interface is to display and obtain needed information in an accessible, efficient manner. The user interface can employ one or more windows. Each window should serve a clear, specific purpose.

Page 21: Mtech Project Seminar1

Step1: Selection of the filename

Page 22: Mtech Project Seminar1

Step 2: Display the contents of the file onto the text area

Page 23: Mtech Project Seminar1

Step 3: Enter valid support

Page 24: Mtech Project Seminar1

Step 5: Select the algorithm

Page 25: Mtech Project Seminar1

Step 6: Display the frequent patterns for apriori

Page 26: Mtech Project Seminar1

Step 7: If the selected algorithm is DIC, then enter the step length

Page 27: Mtech Project Seminar1

Step 8: Display the frequent patterns for DIC

Page 28: Mtech Project Seminar1

Step 8: Display the frequent patterns for FP-Growth

Page 29: Mtech Project Seminar1

Step 9: Display the frequent patterns for MBA

Page 30: Mtech Project Seminar1

The FPMiner tool is implemented using Java language and all the experiments are performed on 1.7GHz PC machine with 256MB memory. The Operating System is WindowsXP.

Experiment 1:

Execution times for different support for different algorithms can be tabulated as follows:

RESULTS

SupportExecution

time of AprioriT

Execution time of

DIC

Execution time ofFP-Growth

50 187ms 226754ms 94ms

60 110ms 184297ms 74ms

70 78ms 161265ms 46ms

80 47ms 106953ms 32ms

90 32ms 74984ms 31ms

Page 31: Mtech Project Seminar1

Experiment 2:

The number of frequent itemsets generated using different algorithms:

Support Frequent itemsets generated

Apriori

50 153

60 51

70 31

80 23

90 9

MBA

50 153

60 51

70 31

80 23

90 9

Page 32: Mtech Project Seminar1

Frequent Pattern mining is used for finding frequent itemsets among items in a given data set.

The results show that

• Apriori cannot be run very effective than FP -Tree. • Apriori on the other hand runs too slow because each transaction contains density. • DIC (Dynamic Itemset Counting) is much slower than every other algorithm for the real -dataset. • MBA is better than DIC but not very better than the other two in the case of MUSHROOM dataset.

CONCLUSION

Page 33: Mtech Project Seminar1

There are still many interesting research issues related to the extensions of frequent pattern mining, such as mining structured patterns by further development of these approaches, mining approximate or fault-tolerant patterns in noisy environments, frequent-pattern-based clustering and classification, and so on.

FUTURE ENHANCEMENT

Page 34: Mtech Project Seminar1
Page 35: Mtech Project Seminar1

FP Array Techniques

FP Array technique that greatly reduce the needs to traverse FP Trees.

FP Array technique obtaining significance improved performance then FP Tree based Algorithms.

FP Array is new Algorithm in finding all Maximal and Closed Frequent Item sets.

Page 36: Mtech Project Seminar1

Fp – tree use compact data structure based on the following properties,

- Frequent pattern generation mining perform one scan of database to

determine the set of frequent items.

- Method needs to store each item in a compact structure, thus more

than two database scan unnecessary.

- Each frequent item located in the FP – tree and each node hold items

and count of the frequent item.

- Each item have to be sorted in their frequency descending order.