mtech project seminar1

The association rule mining to mine the frequent patterns is a fundamentally important task in the process of knowledge discovery in large databases.

This project report the main focus lies in the generation of frequent patterns which is the most important task in explanation of the fundamentals of association rule mining.

This is done by analyzing the implementations of the well known association rule mining algorithms like Apriori, Dynamic Item set Counting Algorithm, FP-growth algorithm.

This experimental system is developed using Java under Windows XP Operating System. Run time behaviors of these algorithms are analyzed and compared using Mushroom dataset.

Overview of this Project

Outline

• Introduction• Association Rule Mining to Frequent Patterns• Implementation• Conclusions• Future Enhancements• Bibliography

Frequent Pattern which is the most important task in explanation of fundamental of association rule mining techniques

The well known association rule based algorithms to mine the frequent patterns :

Apriori

Dynamic Item Counting

FP Growth

Introduction to Frequent Patterns

Association Rule mining is one of the fundamental data mining

Association is a rule, which implies certain association relationships among set of objects such as occur together or one implies the other.

Goal of Association rule mining helps in finding interesting association relationships among large set of data items.

Each rule is assigned two factors: Support and Confidence

Association Rule Mining

Generally association rule mining is performed in two steps:

• Find all frequent item setsThe basic foundation of Association Rule

algorithm is fact that any subset of a frequent itemset must also be a frequent item set. i.e., if {AB} is a frequent item set, both {A} and {B} should be a frequent item set. Iteratively find frequent item sets with cardinality from 1 to k (k-item set)

• Use frequent item sets to generate strong rules having minimum confidence.

FP Array

• FP Array techniques that greatly reduce the needs to traverse FP Trees.

• FP Array techniques obtaining significance improved performance then

FP Tree based Algorithm.

• FP Array is new Algorithms in finding the Maximal and Closed Frequent

Item sets

FP Array Applications

• It generate the frequent patterns from the existing datasets.

• It Provides the minimum support to the given data inputs.

• Time Complexity for the searching the frequent item sets .

• It displays the no of records row and columns wise from the datasets

Rule to Mine Frequent Items

The frequent itemset mining algorithms are classified considering the following aspects:

• The type of the discovered frequent itemset• Using candidates• The representation of the transactions• The itemsets representation used in the algorithm• The number of disk access• The length of the maximal frequent pattern

APRIORI DIC FP

With Candidate generation

Without Candidate generation

BFS

DFS

FP-Tree

Implemented algorithms work differ as follows:

Stages in Knowledge Discovery in Frequent

Databases Selection - selecting and segmenting the data that are relevant to given criteria.

Preprocessing-data cleaning stage where unnecessary information is removed.

Transformation-the data is made usable and navigable.

Data Mining-extraction of patterns from the data

Interpretation and Evaluation-The patterns in the data mining stage are converted into knowledge to support decision-making

Data Visualization-to examine the large volumes of data and detect the patterns visually

· .

Discoveries in Frequent Databases

·

The Apriori algorithm is the most popular association rule algorithm. Apriori uses bottom up search.

Apriori algorithm works as follows:

• The first step, Apriori algorithm generates Candidate 1 – itemsets. Then, itemsets count and minimum support value are compared to find the set L1 (frequent itemsets).

• The second step, algorithm use L1 to construct the set C2 of Candidate 2 – itemsets. The process is finished when there are no more candidates.

Apriori Algorithm

In each phase, all the transaction in the data set are scanned.

Finally, all frequent itemsets are returned.

Disadvantage: Multiple database scan.

DIC (Dynamic Itemset Counting) algorithm which uses fewer database scan, presents a new approach for finding large itemsets.

Aim of the DIC algorithm is improving the performance and eliminating repeated database scan.

DIC algorithm divides the database into partitions ( intervals M ) and use a dynamic counting strategy. DIC algorithm determines some stop points for itemset counting. Any appropriate points, during the database scan, stopping counting, then starts to count with another itemsets.

Four symbols to indicate the different states of itemsets: Solid Box , Solid Circle, Dashed Box, Dashed Circle

DIC Algorithm

The algorithm is described as follows:

Step1: the empty itemset is marked with a solid box and all the 1-itemsets into dashed circle.

Step2: After reading one interval of M transactions from database, do the following steps:

• Check each itemset, in dashed circle. If it exceeds the support threshold, change it from dashed circle to a dashed box.• Check each super set of dashed circle. If all the subsets of dashed circle are in solid box or dashed box, then add it into dashed circle.• Check each set in dashed circle and dashed box. If it has been counted over all the transactions, change it into solid circle if it is in circle or change it into solid box if it is in box.

Step3: End of transactions is reached then, go back to the beginning and repeat step 2, until no itemset remains in dashed circle or dashed box.

FP-Growth

FP-Growth is an algorithm for generating frequent item sets for association rules. This algorithm compresses a large database into a compact, frequent pattern– tree (FP tree) structure.

FP – tree structure stores all necessary information about frequent itemsets in a database.

A frequent pattern tree (or FP-tree in short) is defined as

1. The root labeled with “null” and set of items as the children of the root.

2. Each node contains of three fields: item-name (holds the frequent item), count (number of transactions that share that node), and node- link (next node in the FP-tree).

3. Frequent-item header table contains two fields, item-name and head of node link (points to the first node in the FP-tree holding the item).

Use case Diagram for the proposed system

Apriori

Dynamic Itemset Counting

FP-Growth

User

Data SetFile

Identifying Classes form the above Use cases

Architectural design

The division of software into subsystems and components, as well as the process of deciding how these will be connected and how they will interact, include determining the interfaces.

GUI for Selecting the file ,support and algorithm

AprioriDynamic Itemset Counting

FP-Growth

Matrix Based Association

User interface design

The design of user interface is to display and obtain needed information in an accessible, efficient manner. The user interface can employ one or more windows. Each window should serve a clear, specific purpose.

Step1: Selection of the filename

Step 2: Display the contents of the file onto the text area

Step 3: Enter valid support

Step 5: Select the algorithm

Step 6: Display the frequent patterns for apriori

Step 7: If the selected algorithm is DIC, then enter the step length

Step 8: Display the frequent patterns for DIC

Step 8: Display the frequent patterns for FP-Growth

Step 9: Display the frequent patterns for MBA

The FPMiner tool is implemented using Java language and all the experiments are performed on 1.7GHz PC machine with 256MB memory. The Operating System is WindowsXP.

Experiment 1:

Execution times for different support for different algorithms can be tabulated as follows:

RESULTS

SupportExecution

time of AprioriT

Execution time of

DIC

Execution time ofFP-Growth

50 187ms 226754ms 94ms

60 110ms 184297ms 74ms

70 78ms 161265ms 46ms

80 47ms 106953ms 32ms

90 32ms 74984ms 31ms

Experiment 2:

The number of frequent itemsets generated using different algorithms:

Support Frequent itemsets generated

Apriori

50 153

60 51

70 31

80 23

90 9

MBA

50 153

60 51

70 31

80 23

90 9

Frequent Pattern mining is used for finding frequent itemsets among items in a given data set.

The results show that

• Apriori cannot be run very effective than FP -Tree. • Apriori on the other hand runs too slow because each transaction contains density. • DIC (Dynamic Itemset Counting) is much slower than every other algorithm for the real -dataset. • MBA is better than DIC but not very better than the other two in the case of MUSHROOM dataset.

CONCLUSION

There are still many interesting research issues related to the extensions of frequent pattern mining, such as mining structured patterns by further development of these approaches, mining approximate or fault-tolerant patterns in noisy environments, frequent-pattern-based clustering and classification, and so on.

FUTURE ENHANCEMENT

FP Array Techniques

FP Array technique that greatly reduce the needs to traverse FP Trees.

FP Array technique obtaining significance improved performance then FP Tree based Algorithms.

FP Array is new Algorithm in finding all Maximal and Closed Frequent Item sets.

Fp – tree use compact data structure based on the following properties,

- Frequent pattern generation mining perform one scan of database to

determine the set of frequent items.

- Method needs to store each item in a compact structure, thus more

than two database scan unnecessary.

- Each frequent item located in the FP – tree and each node hold items

and count of the frequent item.

- Each item have to be sorted in their frequency descending order.

mtech project seminar1

Documents

use frequent item

maximal frequent pattern

discovered frequent

datasets rule

frequent databases selection

fp array fp array techniques

fpgrowth algorithm

fp trees