4 ecology ga regression

Upload: ngo-bich

Post on 07-Apr-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/6/2019 4 Ecology GA Regression

    1/39

    All Subset Regression Using a

    Genetic Algorithm

    Olcay Akman

    Department of Mathematics

    Illinois State [email protected]

  • 8/6/2019 4 Ecology GA Regression

    2/39

    OVERVIEW OF GENETIC ALGORITHMS

    Genetic algorithms (GA) belong to a group of

    optimization techniques collectively called

    evolutionary computation.

    They constitute robust but simple search

    techniques inspired by observations of the

    mechanics of natural selection process and

    genetics.

  • 8/6/2019 4 Ecology GA Regression

    3/39

    Ask not what mathematics can do for biology, but

    ask what biology can do for mathematics

    The idea is that biological evolution has produced

    organisms capable of living in almost every

    possible landscape available, why dont we take a

    tip from nature and exploit the utility of

    evolution to do optimization.

  • 8/6/2019 4 Ecology GA Regression

    4/39

    OVERVIEW OF GENETIC ALGORITHMS

    Stringdata structures are used to represent sets

    of possible problem solutions, where each location

    in the string contains a character (gene)

    identifying the state of a particular process

    variable.

    This is analogous to a chromosomal structure

    which is occupied genes at fixed locations.

  • 8/6/2019 4 Ecology GA Regression

    5/39

    OVERVIEW OF GENETIC ALGORITHMS

    This string structure experience alterations

    (mutations) throughout an iterative search

    process (analogous to biological structures during

    an evolutionary process).

    In many generations, stronger strings

    (individuals) appear and merge with the

    population.

    These eventually dominate the population in the

    string selection, breeding and replacement

    process (natural selection).

  • 8/6/2019 4 Ecology GA Regression

    6/39

    OVERVIEW OF GENETIC ALGORITHMS

    As such, over several generations, the population

    of strings will experience successive incremental

    improvements, eventually stabilizing itself as the

    best strings emerge.

  • 8/6/2019 4 Ecology GA Regression

    7/39

    COMPONENTS OF GA

    Population

    Mating Pool

    New Offspring

    Selection (Darwinian selection operation) Mating and Mutation

    Evaluation

  • 8/6/2019 4 Ecology GA Regression

    8/39

    PSEUDO-CODE

    Randomly generate chromosomes

    While t

  • 8/6/2019 4 Ecology GA Regression

    9/39

    THE BASICS: SELECTION

    There are several methods to choose which

    chromosomes will contribute to the next

    generation. Among the most popular are:

    Proportional (Roulette Wheel) Selection

    Tournament

    Ranking Selection

  • 8/6/2019 4 Ecology GA Regression

    10/39

    THE BASICS: MATING

    Arguably the most important part of the

    algorithm by creating new combinations implicit

    parallelism.

    Mating of chromosomes is analogous to biological

    crossover which diploid organisms use to create

    new combinations of genes.

    Crossover is done until every chromosome in the

    mating pool is mated with another chromosome.

  • 8/6/2019 4 Ecology GA Regression

    11/39

    THE BASICS: CROSSOVER

    First, the number of crossover points are chosen.

    Suppose the number was of points was 1 and the

    two chromosomes where 01001110 and 11101011

    The point of crossover is chosen at random from[1,l-1] where l denotes the length of the string.

    Parent 1 01001110

    Parent 2 11101011

    010|01110

    111|01011

    11101110

    01001011

  • 8/6/2019 4 Ecology GA Regression

    12/39

    THE BASICS: MUTATION

    Again analogous to biological mutation in which

    there is a base change in nucleotide of DNA,

    mutation changes a 0 to a 1 and a 1 to a 0.

    With a certain probability (pm

    ), each 0 or 1 has a

    chance of being flipped. Typically this

    probability is low.

    A general rule is 1/ l so that on average there is

    one mutation per chromosome.

    01001011 01001011

    The red 1 is chosen

    for mutation

    01000011

  • 8/6/2019 4 Ecology GA Regression

    13/39

    Mating-Offspring

    For every two parents, create

    two children via crossover and

    mutation.

    Crossover: Split the two

    parent chromosomes into

    two at the same location.Exchange tails.

    Parent 1: 1 1 1 1 | 1 1 Child 1: 1 1 1 1 0 0

    Parent 2: 0 0 0 0 | 0 0 Child 2: 0 0 0 0 1 1

    Parent 1: 1 0 1 | 1 0 1 Child 1: 1 0 1 1 0 0

    Parent 2: 0 0 1 | 1 0 0 Child 2: 0 0 1 1 0 1

  • 8/6/2019 4 Ecology GA Regression

    14/39

    Crossover

    Crossover can produce children that are radically

    different from their parents.

    Parent 1: 1 1 1 1 | 1 1 Child 1: 1 1 1 1 0 0

    Parent 2: 0 0 0 0 | 0 0 Child 2: 0 0 0 0 1 1

  • 8/6/2019 4 Ecology GA Regression

    15/39

    Crossover

    Crossover will not introduce differences for a bit

    position where both parents have the same value. Parent 1: 1 0 1 | 1 0 1 Child 1: 1 0 1 1 0 0

    Parent 2:0 0 1 | 1 0 0 Child 2: 0 0 1 1 0 1

    An extreme instance occurs when both parents are identical. In

    such cases cross over can introduce no diversity in the children.

    Thus making the mutation a vital component of GA.

  • 8/6/2019 4 Ecology GA Regression

    16/39

    Offspring-Mutation

    For every two parents, create

    two children via crossover and

    mutation.

    Mutation: Randomly

    mutate every bit in every

    offspring.

    Old

    Chromosome

    Random

    Numbers

    New

    Bit

    New

    Chromosome

    1 0 1 0 .801 .102 .266 .373 - 1 0 1 0

    1 1 0 0 .120 .096 .005 .840 0* 1 1 0 0

    0 0 1 0 .760 .473 .894 .001 1 0 0 1 1

    *: Randomly generated bit is the same as the original bit.

  • 8/6/2019 4 Ecology GA Regression

    17/39

    Generation

    Delete all:

    Repeat the reproduction process until you obtain 100

    offspring.

    Kill the original population.

    Start anew with the second generation.

    Stop at (say) 50 generations.

  • 8/6/2019 4 Ecology GA Regression

    18/39

    PSEUDO-CODE

    Randomly generate chromosomes

    While t

  • 8/6/2019 4 Ecology GA Regression

    19/39

    The Process

    The algorithm begins by creating a random

    initial population, as shown in the figure.

  • 8/6/2019 4 Ecology GA Regression

    20/39

    The Process

  • 8/6/2019 4 Ecology GA Regression

    21/39

    The Process

  • 8/6/2019 4 Ecology GA Regression

    22/39

    USING GA FOR SUBSET

    REGRESSION MODEL

    SELECTION

  • 8/6/2019 4 Ecology GA Regression

    23/39

    MULTIPLE LINEAR REGRESSION

    Real world mathematical models can have many

    variables to explain a single response variable.

    Determining the correct set of variables can be

    difficult.

    There are some methods such as stepwise,

    backward and forward, however, these methods

    may not produce the optimal set.

  • 8/6/2019 4 Ecology GA Regression

    24/39

    MULTIPLE LINEAR REGRESSION

    General Model:

    For the general model there 2k+1-1 different

    possible sets for n variables.

    kkXXXy ...22110

    0

    20000

    40000

    60000

    80000

    100000

    120000

    140000

    1 3 5 7 9 11 13 15Number

    ofModels

    Number of Ind. Varibales

    Number of Possible Modelsfor Multiple Regression

    Number of Possible

    Models

  • 8/6/2019 4 Ecology GA Regression

    25/39

    MULTIPLE REGRESSION AND

    GENETIC ALGORITHMS

    The first step to using a GA is to get the correct

    encoding to take advantage of the genetic

    operators.

    Binary encoding is used in which the following rule

    is applied:

    Ifith position is 0 the ith explanatory variable is

    not included in the model

    Ifith position is 1 the ith explanatory variable is

    included in the model

  • 8/6/2019 4 Ecology GA Regression

    26/39

    ENCODING EXAMPLE

    Suppose the full model is

    The binary structure 1011101would represent

    kkXXXy ...22110

    kkkk XXXXy 2233220 ...

  • 8/6/2019 4 Ecology GA Regression

    27/39

    MULTIPLE REGRESSION AND

    GENETIC ALGORITHMS

    Genetic algorithms operate to find the most fit

    chromosome.

    Thus, to use genetic algorithms with multiple

    regression modeling, a fitness function must be

    determined.

    R2, adjusted R2, Mallows Cp,, MSE, AIC and so on.

    In our analysis we employ ICOMP. (Bozdogan,

    2004)

  • 8/6/2019 4 Ecology GA Regression

    28/39

    ICOMP

    ICOMP is defined as:

    here

    n is the number of variables in the model

    is the mean standard error or RSS/ n

    q is the number of observations the model is based on

    X is the matrix used is the (n+1 X q) matrix where the kth column is thevalues for the (k-1)th variable and the first column is filled with 1s,

    unless the model has no intercept in which the size is (n X q) and the kth

    column corresponds to the kth variable

    nXX

    n

    nXXtrace

    qnnICOMP4

    12

    412

    2 2ln))'ln(det()

    1

    2)'(

    )(ln1()ln()2ln(

    2

  • 8/6/2019 4 Ecology GA Regression

    29/39

    ICOMP

    The best models have low complexity, thus, the

    GA wishes to minimize the ICOMP value.

  • 8/6/2019 4 Ecology GA Regression

    30/39

    An Illustration

    Consider a regression problem with k=6 candidate

    variables.

    Suppose that three randomly chosen regression

    subsets are the members of the initial population,

    with ICOMP as their fitness:

    4433220 ... XXXy

    55110 XXy

    101110 ICOMP=143.7

    110001 ICOMP=138.32

    100101 55330 XXy ICOMP=134.18

  • 8/6/2019 4 Ecology GA Regression

    31/39

    An Illustration

    110 001

    100 101110101100001

    )25.132(5533110 ICOMPXXXy

    )16.140(550 ICOMPXy

    We now rank all of the strings in the current population and replace the

    lowest ranking string with the new string to generate a new population.

    The new population might then consists of the following strings:

    110101 (ICOMP=132.25), 100001 (ICOMP=140.16), 100101 (ICOMP=134.18)

  • 8/6/2019 4 Ecology GA Regression

    32/39

    An Illustration

    We now rank all of the strings in the current population and replace the

    lowest ranking string with the new string to generate a new population.

    The new population might then consists of the following strings:

    110101 (ICOMP=132.25), 100001 (ICOMP=140.16), 100101 (ICOMP=134.18)

    At this time strings 110101 and 100101 will most likely reproduce.

    Suppose they did and the following occurred:

    110 101

    100 101

    110101

    100101

    No new genetic combination produced.

    Mutation will alter some bits onboth strings; a new population willemerge.

  • 8/6/2019 4 Ecology GA Regression

    33/39

    An Illustration

    Question:

    Can the mating and/or mutation result in weaker individuals

    which may eventually enter the general population?

    Answer:

    Yes, possibly. However because of their lower finesses, these

    individuals will be less likely to be chosen for mating, and will soon

    disseappear from population as better fit newborn individuals appear.

  • 8/6/2019 4 Ecology GA Regression

    34/39

    WHAT DID WE CONTRIBUTE?

    Evolution works fastest when the initial genetic

    variance is the largest.

    Using binary coding, the variance is maximized when

    in each position of chromosome 50% of the

    chromosomes have 1 and the others have 0. In the context of the regression model choosing, this

    means that in the initial population each variables

    has a chance to be included in the model.

    We have developed a method called Initial PopulationDiversification in which half the population is

    randomly created and the other half is created by

    taking one chromosome and changing 0 to 1 and 1 to

    0.

  • 8/6/2019 4 Ecology GA Regression

    35/39

    INITIAL POPULATION

    DIVERSIFICATION EXAMPLE

    01110101011111

    00001001010001

    00000100001101

    00110001000100

    10111001001110

    10100010111010

    10010111001001

    11101010111101

    01110101010010

    01111101110101

    10001010100000

    11110110101110

    11111011110010

    11001110111011

    01000110110001

    01011101000101

    01101000110110

    00010101000010

    10001010101101

    10000010001010

    First 10chromosomes

    Last 10chromosomes

  • 8/6/2019 4 Ecology GA Regression

    36/39

    WHAT DID WE MODIFY?

    Populations start larger and reduce.

    Diversification

    We use Binary Tournament instead of

    Proportional Selection (which allows less

    computation since ICOMP and model parameters

    only need to be computed for those in the

    tournament)

  • 8/6/2019 4 Ecology GA Regression

    37/39

    RESULTS

    14681470

    1472

    1474

    1476

    1478

    1480

    1482

    1484

    1486

    116

    31

    46

    61

    76

    91

    106

    121

    136

    151

    166

    181

    196

    ICOMP

    Trial

    Trials and ICOMP values

    Adapt deltaF=.1

    Adapt deltaF=.2

    Typical deltaF=0

    Adapt deltaF=.3

    Exponential

    Linear

    Bozdogan

    Frequency of Correct

    Solution

    Typical deltaF=0 0.915

    Adapt deltaF=.1 0.935

    Adapt deltaF=.2 0.93

    Adapt deltaF=.3 0.905

    Exponential 0.42

    Linear 0.50

    Bozdogan (No bit flipping/no population reduction) 0.09

    Trials and ICOMP values ordered byICOMP value. 1473.9 is the smallestvalue and the correct solution basedon Bozdogan(2004). In all trials 600computations were allowed andmutation was set at .05.

    Frequency of the GAs that found

    the correct solution. The first 6were based on the GAs wedeveloped, the last wasBozdogans. The first entry did

    not have any populationreduction (deltaF=0), but didhave bit flipping alluding to the

    fact that bit flipping is beneficial.

  • 8/6/2019 4 Ecology GA Regression

    38/39

    FUTURE WORK

    Understanding how the ruggedness of the

    search space effects progress of the algorithm

    How to calculate this ruggedness for any

    arbitrary function with little computation.

    Mapping optimum parameters with certain

    values of ruggedness

  • 8/6/2019 4 Ecology GA Regression

    39/39

    REFERENCES

    Dumitrescu, D. et al, ed. Evolutionary

    Computation. CRC P, 2000.

    Holland, J. (1975)Adaptation in Natural and

    Artificial Systems. University of Michigan Press.

    C.F. Lima, F.G. Lobo. A Review of Adaptive

    Population Sizing Schemes in Genetic

    Algorithms. In Proceedings of the 2005

    workshops on Genetic and evolutionary

    computation, pages 228 - 234. ACM, 2005. Wright, Sewall. Evolution in Mendelian

    Populations. Genetics 16 (1931): 97-159.