survey on ga and rules

Upload: harinima

Post on 02-Jun-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/10/2019 Survey on GA and Rules

    1/15

    Genetic algorithm in Association Rule Mining : A Survey

    Introduction

    The amount of data stored in databases continues to grow fast. Intuitively, this large amount ofstored data contains valuable hidden knowledge, which may be used to improve the decision-making process of an organization. Thus, there is a clear need for (semi-)automatic methods forextracting knowledge from data. This need has led to the emergence of a field called data miningand knowledge discovery. Association rule mining is one such data mining task which involvesfrequent pattern mining.The aim of frequent pattern mining is to search for recurring relationships in a given data setwhich enables us to discover various kinds of associations and correlations among differentitems in data sets. Let us formally define the problem: Let I = {i1, i2, i3, ..., in} be a set of allitems. A k-itemset l consists of k items from I, is frequent if l occurs in a transaction database D

    not less than |D| times. Here is a user specified parameter called minimum support and |D| istotal number of tuples in database.In this paper the role of Genetic Algorithms in Data mining and in specific Association rulemining is taken up for analysis. The effectiveness of the algorithm is found to me more withmodifications in the Genetic algorithm. We present here the application of genetic algorithm inassociation rule mining with variations in the algorithm and the results achieved

    GENETIC ALGORITHM AND DATA MINING

    A. Genetic algori thm i n th e positi on of data mini ngGenetic algorithm plays an important role in data mining technology, which is decided by itsown characteristics and advantages. To sum up, mainly in the following aspects:1) Genetic algorithm processing object not parameters itself, but the encoded individuals of

    parameters set, which directly operate to set, queue, matrices, charts, and other structure.2) Possess better global overall search performance; reduce the risk of partial optimal solution.At the same time, genetic algorithm itself is also very easy to parallel.3) In standard genetic algorithm, basically not use the knowledge of search space or othersupporting information, but use fitness function to evaluate individuals, and do genetic Operationon the following basis.4) Genetic algorithm doesnt adopt deterministic rules, but adopts the rules of probability

    changing to guide search direction.

    The steps in Genetic Algorithm are1. Randomly select parents .2. Reproduce through crossover . Reproduction is the operator choosing which individual entitieswill survive. In other words, some objective function or selection characteristic is needed todetermine survival. Crossover relates to changes in future generations of entities.3. Select survivors for the next generation through a fitness function .

  • 8/10/2019 Survey on GA and Rules

    2/15

    4. Mutation is the operation by which randomly selected attributes of randomly selected entitiesin subsequent operations are changed.5. Iterate until either a given fitness level is attained, or the preset number of iterations is reached.

    a) Codin g strategy and coding str ing length L :

    Because of many parameters, multi-parameter coding technology can be used. Basic idea is toencode each parameter obtaining substring, and then combine these substrings into a completechromosome. For example, 18 | 36 | medium | good | man gene strings express the employeegroup of age with 18 to 36 years old, medium-income, health condition is good, sex man, it willhave a number of code are combined in use, such as 18 | 36 | medium | good | man encoded stringof 18 | 36 | 01 | 00 | 1.b) Select Operator : By using the selection mechanism of the certainty expected value model,that is expected value of integer part of

    to arrange the times that individual are selected, if selected to participate in cross-matching and,the survival expected value minus 0.5 in the next generation; Instead the survival expected valueminus 1, then listing expected value of M of decimal part according the value from large tosmall, and one selection from large to small until the date is full. Such choice mechanism canovercome randomness in selection.

    c) Cross-Operator: Because of multi-parameter coding technology is used, taking into thecharacteristics of string code, two cross is adopted.

    d) Mu tation operator: Adopting basic mutation operator, mutating age gene locus when below 5random i nteger.e) The group size M : When M for small value, which improves the evolution data of geneticalgorithm, but decreases the diversity of group and might cause the premature phenomena ofgenetic algorithm; when M for greater value, which decreases the evolution speed of geneticalgorithm. Therefore, comprehensive consideration in these two areas, the value of M for 20~100is good.f) F itness function f(x): The best employee group, that is the employee group who obtains thehighest number times in comprehensive evaluation in the same age condition, and the ultimateaim is to find young and excellent employee. In addition to adding a restrictive conditions: theminimum age of employee must be less than maximum age. The objective function can be set to

    Thus, t ( x) accords with the times of comprehensive excellent evaluation of employee for x genestring; T is the total times of comprehensive excellent evaluation of all employee profits; i( x) isage spacing of string of x . Generally speaking, the choice intensity should be slight lower in theinitial stage of genetic optimization, so as to avoid genetic groups have been controlled by one ora few individuals with higher fitness degree; in the latter of genetic optimization, because the

  • 8/10/2019 Survey on GA and Rules

    3/15

    difference is relatively small between groups, The potential ability is low if continue to optimize,it is necessary to improve choice intensity so as to constringe a better solution for geneticalgorithm. So fitness function is designed to

    g) Cross-probabil ity P c : Cross-probability P c control the frequency in exchange operation , highP c can achieve greater solution space, thus reducing the stay in non-optimal solution on the

    probability, but large P c will waste of much time in searching unnecessary solution space. To thisend, the adaptive P c can be used.

    Thus, x is the larger one in the operation of two individuals of cross-participation, f max is thelargest group fitness degree, f avg is the average fitness degree.

    h) M utation probabili ty p m : Mutation probability P m control of the new gene into the population ratio, if too low, some useful genes will not be able to enter the choice; if too high,too much random change, future generations may lose good characteristics inherited from both

    parents. To this end, the adaptive P m can be used in (4).

    Thus, y is the individual fitness in a particular mutation operation, f max is the largest groupfitness degree, f avg is the average group fitness degree.

    i) Termi nation : When genetic algorithm runs to difference (| (f1- f2) / f1 |

  • 8/10/2019 Survey on GA and Rules

    4/15

    Basic Structure of GRA Genotype Expression of GRA

    The table describes the gene of node i, then, the set of these genes represents the genotype ofGRA individuals. ID i is an identification number, for example, IDi = 1 means node i has thedirected branches to other nodes, while IDi = 2 means node i has the indirected branches to thenodes. F i denote the function of the node i. C i1, C i2, . . . , C ik denote the nodes which areconnected from node i, firstly, secondly, . . . , and S i1, S i2, . . . , S ik denote the strength from node Ito node C i1, C i2, . . . , C ik or the strength between node I and node C i1, C i2, . . . , C ik depending onthe arguments of node i.In order to find really important class association rules, the function of the nodes in GRA should

    be changed. It is possible to realize the above effectively by GRA genetic operations, becausemutation and crossover will change the connections or contents of the nodes. Three kinds ofgenetic operators are used: crossover, mutation-1 (change the connection of nodes) andmutation-2 (change the function of nodes).The algorithm is depicted in the flow chart shown below

  • 8/10/2019 Survey on GA and Rules

    5/15

    Two datasets from UCI ML Repository were taken to conduct the experiments namelyLymphography and Vehicle dataset were taken up for analysis. From the experimental results itis shown that when the reduction rate is small, GRA is able to get comparable accuracy to thelarge set of rules, that is, 100% of the rules, especially in the partial match, furthermore, it isshown that the accuracy does not change drastically compared to the accuracy of the large set of

    rules

    Accuracy on Dataset Lympography Accuracy on Dataset Vehicle

    The results of the classification accuracy when compared with other methods are tabulates in below table

    Comparison of Classification Accuracy with other Methods

    In [2] optimization of the rules generated by Association Rule Mining (apriori method), usingGenetic Algorithms was the objective . The algorithm with modifications is

    1. The individuals are represented using the Michiran s approach, i.e. each individualencodes single rule. 2. Representing the rule antecedent done using binary encoding3. Generic Operators 4. For selection the authors used Roullete Wheel Sampling procedure is used.5. Fitness function :

    Confidence Factor, CF = TP / (TP + FP) Comp = TP / (TP + FN)

  • 8/10/2019 Survey on GA and Rules

    6/15

    Fitness = CF x CompFitness = wl x (CF x Comp.) + w2 x Simp

    where TP is True Positive, FP is False Positive, FN is False Negative ,Simp is a measure of rulesimplicity (normalized to take on values in the range O..l) and wl and w2 are user defined

    weights.The algorithm when tested on synthetic database with parameters as listed in table proved out tocontain some rules with negations in the attributes as predicted and desired.

    GA Parameters

    Genetic Algorithm from Application Perspective

    The Genetic Algorithm when made adoptable can be applied for various application areas withdata mining techniques. The areas that are taken up for analysis are tabulated below

    Areas of Application Genetic AlgorithmFinance Evolution Strategy Excellence in GA[3]Watermarking Data Mining Genetic Algorithm[4]MIS Mining Association Rules with GA[5]Medicine Constraint Based GA[6]Image Database Novel GA[7]Students Information System GA through Adapted Mutation[8]Daily Records from API Dynamic Immune Evolution to GA[9]

    Car Test Data Implicating and Optimized Rules[10]Software Engineering Defective Module implication[11]

    In[3] the evolution strategys excellence is applied in genetic algorithms evolutional process.Then the optimized genetic algorithm is used for mining association rules. The shortcomings ofthe traditional GA are overcome with modifications.Improved genetic algorithm

  • 8/10/2019 Survey on GA and Rules

    7/15

    Genetic algorithm based on evolution strategy has improvement as follows. Firstly dissimilardegree of individuals is judged in colony when a century has evolved. Dissimilar degree of tworandom individuals in colony is as follows.

    In formula l is the length of gene chain, aj is the number j bit gene in gene chain of individualsa bj is the number j bit gene in gene chain of individuals b In whole colony, dissimilardegree of colony is as follows.

    In formula P is colony size.cross probability and mutation probability is set up is as follows.

    K is rate that dissimilar degree of colony of last century is compared with current century.

    In formulas Pc ' Pm ' are separately cross probability and mutation probability of last century, Pc Pm are separately cross probability and mutation probability of current century. In this

    way, evolution of current century is based on last century. Original colony contains excellentindividuals of last century. Otherwise partial new individuals are randomly product, and cross

    probability and mutation probability are newly set up. It can enhance the diversity of colony.

    When the algorithm applied to 2050 groups of finance data in certain city the followingobservations were made. After 252 generations the partial association rules were obtained as intable, whereas traditional GA required 850 generations.

    The speed was more and could be applied to other domains also.

  • 8/10/2019 Survey on GA and Rules

    8/15

    In [4] a data mining-based GA is presented to efficiently improve the Traditional GA (TGA).The flowchart of the algorithm is depicted below.

    Algorithm of Our Data Mining-Based GA:1. Setup the environment parameters.

    Initialize the support and confidence arrays; set the DNA pool to be empty. Note that thesupport and confidence arrays will be introduced in Section 3.1.

    2. Evaluate all of the chromosomes based on the fitness function.Record the important gene information for each high quality chromosome by updating thesupport and the confidence arrays.

    3. Recombine new chromosomes based on the traditional GA operations.4. Recombine new chromosomes based on the data mining-based GA operation.

    Type 1 : Randomly select some chromosomes obtained from step 3, and then perform the new

    GA operation, DNA implantation, to generate new chromosomes.Type 2 : Randomly select some chromosomes obtained from step 3, and then disable the genesof the chromosome if the genes appeared in DNA pool.

    5. Repeat steps 2 to 5 until any one of following two conditions are reached.Condition 1 (Obtain the optimum solution): The predefined condition is satisfied, i.e. theobtained solution satisfies to our expectation, or a constant number of iterations has been

    performed.

  • 8/10/2019 Survey on GA and Rules

    9/15

    Condition 2 (Fall into a DNA trap): The obtained solution does not satisfy our requirements, but there is no improvement after a constant number of new evolutionary generations. Notethat the constant number of iterations is much smaller than the one in condition 1. Collect allthe important genes based on the support and the confidence to form a new DNA, put theDNA into the DNA pool, and reset the support and confidence arrays. Then go to step 2.

    The above when implemented on watermarking problem resulted in improved performanceover the traditional GA. This is clear from the graph given.

    Traditional GA* Data mining GA

    In the process of data mining the MIS database [5], some of the history records could be lookedas a population of individuals, a record as a corresponding individual, and the fields representingthe property of the table could be looked as the genes, the correspondence of the items and atable is shown in figure

    Correspondence of Item and a Table.

    The Basic Genetic Algorithm is divided into six steps:1. Encode

  • 8/10/2019 Survey on GA and Rules

    10/15

    2. Select the initial population of individuals, from which computed the excellent gene list set Al;selecting the second population of individuals, from which computed the excellent gene listset B1

    3. Save the excellent gene list set above to result set R4. Operate the mutation of AI and B1, cross the gene of AI and B1, and generate a new set A2,

    then select the third population of individuals, from which computed the excellent gene list setB25. Repeat the steps from b to c, until there is no new excellent gene List6. Decode the result set R and generate the knowledge of association rules

    When the algorithm applied to MIS repository on IBM Netfinity(Pentium 1G/512MRAM). GA- based Association Rule was more efficient than Apriori Algorithm ,The time used with theApriori Algorithm increases sharply with the increase of data amount and the precisiondecreased a little, the efficiency increased a lot.

    A Constraint-Based Genetic Algorithm approach for Mining Classification Rules[6] propose a

    constraint-based genetic algorithm (CBGA) approach to reveal more accurate and significantclassification rules. Here a rule induction system that consists of three modules: the user-interface, the symbol manager, and the constraint based GA (CBGA). According to Figure, theuser interface module allows users to execute the following system operations including: loading a constraint program; adding or retracting the constraints; controlling the GAs parameter settings; monitoring the best solutions. Interesting knowledge or given constraints can be issued by either domain experts or other metaknowledge mechanisms.

  • 8/10/2019 Survey on GA and Rules

    11/15

    In order to introduce details of the proposed CBGA approach, a synthetic medical data set about patients information is used for illustration. This da ta set includes the following attributes: age,sex, blood pressure (BP), the status of Cholesterol (Cho), the values of Na and K, and thequantity (Qty) and frequency (Freq) of taking the drug. The prediction attribute is one of the fivedrug types, including Drug A, Drug B, Drug C, Drug D and Drug E.

    In comparison with a regular GA, CBGA achieves higher classification accuracy rates in ruleinductions for both UCI data sets. In addition, the rule sets discovered by CBGA are not onlywith higher predictive accuracy, but also with more significant knowledge in accordance to theusers preferences.

    In [7] A Novel Genetic Algorithm Based on Image Databases for Mining Association Rules is proposed using a novel spatial mining algorithm, called ARMNGA(Association Rules Mining in Novel Genetic Algorithm)

    Association rules mining Based on a novel Genetic Algorithm is carried out by

    Encoding employs natural numbers to encode the variable Aij. That is, the number of the lines

    of every range in the matrix A in which the element 1 exists is regarded as a gene. The genes areindependent of each other. They are marked by A

    1 , A

    2 Aj, An , in which and A j

    [l,m] , j [l,m]

    and A n may be a repeatedly equal natural numberThe Fitness

    Here, W C+ W s=1, W c 0, Ws 0 , S min, is minimum support, and C min is minimum confidence.

    Reproduction OperatorWe are adopting roulette selection strategy; each individual reproduction probability is

    proportion to fitness value.

    Mutation OperatorThe selection of the mutation probability is the vital point because it influences the action and

    performance of the ARMNGA. If is over-small, the ARMNGA will become a pure randomresearch

    Here, p m1=0.1, pm2=0.001, f max (X) is the maximum fitness value of the population, f(X) is theaverage fitness value of the population.

  • 8/10/2019 Survey on GA and Rules

    12/15

    When implemented on image database the following observations were made.For Runtime vs. the minimum support for both algorithms, where the minimum support variesfrom 0.25% to 2% for the synthetic dataset. Our proposed algorithm runs 2 5 times faster thanthe Apriori algorithm, because a large number of candidates can be pruned by using theARMNGA pruning strategy.

    The runtime vs. the average size of transactions for both algorithms, where the average size oftransactions varies from 4 to 14 for the synthetic dataset, can deduce that ARMNGA has a higherconvergence speed and more reasonable selective scheme which guarantees the non-reduction

    performance of the optimal solution.

    The Genetic algorithm could be through adopting an adaptive mutation rate and improving themethods of individual choice, thereby improving the genetic algorithm that applies to mineassociation rules[8] .

    Here, a method of adaptive mutation rate, in the early stages of evolution and mutation rate isdone by

  • 8/10/2019 Survey on GA and Rules

    13/15

    Pphase-out method to improve the choice, is applied to the latter part of the genetic algorithm:1) The size of the fitness of individual choice selection sort;2) Before the 1/4 copy 2 of the individual, the former 1/4-2/4 part of individual copy 1, enter tothe next roundof selection; 3)Before the 2/4-3/4 part of the individual reservations, enter to the next round of

    selection;4) Before the 3/4-4/4 out part of the individual is no longer into the next round of selection.

    The new algorithm when applied to a database of student achievement in schools in recent years reduces the number of unnecessary operations ,streamlines the collection of frequent generationand improve the efficiency of the algorithm when compared to Apriori Algorithm as shown inGraph.

    .In [9] an IOGA (Immune Optimization based Genetic Algorithm) approach for incrementalassociation rules is proposed The dynamic immune evolution, and biomimetic mechanism inEngineering Immune Computing (EIC) : immune recognition, immune memory, and immuneregulation to GA is introduced .

    Immune recognition is critical in the immune system, its essence is to distinguish self and non-self, and that can be evaluate by affinity between antibodies and antigens

    The experimental data set is from a companys daily records of the APIs (in local computeroperation system) which were called by outside files from network, and the results whether thefiles lead to computer virus.

    A Method for Finding Implicating Rules Based on the Genetic Algorithm[10] for car test resultsis implemented with the algorithm

    Algorithm GAFIR

    Input: Database D , threshold of the strength of implication, the largest evolved algebra GEN , populationsize N, crossover probability P c, mutation probability P m OutputRule Set ( RS )Procedure GAFIR1. L0 = Initial ( D, N ) ;2. TR = GetRules ( M , ) 3. For i=1 to GEN

  • 8/10/2019 Survey on GA and Rules

    14/15

    4. Begin5. C = Crossover ( Li-1, Pc )6. Li = Mutation (C , pm )7.TR= TR U GetRules ( Li ; ) 8. End

    9. RS = TR ;

    When tested on car test data the interesting rules go to balance, while it evolves about 400generations. The generation between 1 and 200 is the phase of interesting implication rules thatare discovered frequently. Later going to balance, when it comes to 700 generations, it nearlydiscovers all the interesting rules. The greater threshold of fitness is, the less number ofinteresting implication rules distilled. On the contrary, the smaller threshold of fitness is, themore number of interesting implication rules distilled.

    In the area of software engineering to find the defective modules , Searching for Rules to findDefective Modules in Unbalanced Data Sets[11] is proposed. Feature selection (attribute

    selection) to work only with those attributes from the data sets capable of predicting defectivemodules. With the reduced data set, a genetic algorithm is used to search for rules characterizingmodules with a high probability of being defective.

    For the given data set feature selection as a necessary step to reduce the data sets and then, as asubgroup discovery technique, a genetic algorithm as a subgroup discovery technique was usedto generate rules for covering only defective modules. Results showed that in general data setsare not very homogeneous in both the feature selection (attributes) selected in each data set orrules generated. The results, however, provide some points for further research.

    Mining Large Data Sets

    When the data size becomes too large an efficient distributed genetic algorithm for classificationrules extraction in data mining, which is based on a new method of dynamic data distributionapplied to parallelism using networks of computers in order to mine large datasets would be a

    better solution

    The model is as shown

    Distributed Model

  • 8/10/2019 Survey on GA and Rules

    15/15

    EDGAR uses a local GA in each node with some communications with the neighborhood forindividuals and poorly covered examples. The Algorithm

    Generate initial population using seedingWhile (Stop Criteria)

    For a number of generationsSelect g individuals by USFor each individual

    If % Perform recombinationIf % Perform mutation

    endreplace g individual from populationExchange individualsExchange training examples

    endend

    Extract set of rules by greedy algorithmSend set of rules to Central PoolIf (not improving) reduce training data

    end

    For the experimental study, a well known problem has been chosen from UCI Nursery. Thisdataset has 12.960 instances, big enough to test data distribution. Nursery is a complex datasetwith 6 characteristics and 5 not balanced classes, representing three of them more than 97% ofthe dataset and the results observed were

    The time of execution of the proposed has a considerable speedup and a better behavior than

    the compared algorithm when the number of processors increases. Classification accuracy is similar in both algorithms and does not follow any tendencyrelative to the number of processors

    The number of rules generated is between 60% and 80% smaller in EDGAR.

    Conclusion

    The genetic algorithm when compared to other data mining association rule generating method produces better accuracy, increases the efficiency, the robustness was found to be sound. The speedis also increased when compared with other method.

    The pitfalls of the Genetic algorithm are overcome by making changes in the same and thealgorithm is found to be versatile in nature thereby enabling it to be applied with any dataset.