01001110 01100101 01110101 01110010 01101111 01101110 01101111 01110110 01100001 00100000 01110011...

16
01001110 01100101 01110101 01110010 01101111 01101110 01101111 01110110 01100001 00100000 01110011 01101011 01110101 01110000 01101001 01101110 01100001 00100000 01101011 01100001 01110100 01100101 01100100 01110010 01111001 00100000 01110000 01101111 01100011 01101001 01110100 01100001 01100011 01110101 00101100 00100000 01000110 01000101 01001100 00100000 01000011 01010110 01010101 01010100 00101100 00100000 Automatic Method for Data Preprocessing for the GAME Inductive Modelling Method Miroslav Čepek [email protected] Miloslav Pavlicek, Pavel Kordik Miroslav Šnorek Computational Intelligence Group Department of Computer Science and Engineering Faculty of Electrical Engineering Czech Technical University in Prague ICIM 2008

Upload: cody-fox

Post on 18-Jan-2018

220 views

Category:

Documents


0 download

DESCRIPTION

International Conference on Inductive Modelling, Kyiv 2008 Miroslav Cepek, GAME Neural Network Group of Adaptive Method Evolution (GAME) uses inductive modelling. The structure of the model is created in inductive way (data driven modelling).

TRANSCRIPT

Page 1: 01001110 01100101 01110101 01110010 01101111 01101110 01101111 01110110 01100001 00100000 01110011 01101011 01110101 01110000 01101001 01101110 01100001

01001110 0110010101110101 0111001001101111 0110111001101111 0111011001100001 0010000001110011 0110101101110101 0111000001101001 0110111001100001 0010000001101011 0110000101110100 0110010101100100 0111001001111001 0010000001110000 0110111101100011 0110100101110100 0110000101100011 0111010100101100 0010000001000110 0100010101001100 0010000001000011 0101011001010101 0101010000101100 0010000001010000 0111001001100001 0110100001100001 00000000

Automatic Method for Data Preprocessing for the GAME Inductive Modelling Method

Miroslav Č[email protected]

Miloslav Pavlicek, Pavel Kordik

Miroslav Šnorek

Computational Intelligence GroupDepartment of Computer Science and Engineering

Faculty of Electrical EngineeringCzech Technical University in Prague

ICIM 2008

Page 2: 01001110 01100101 01110101 01110010 01101111 01101110 01101111 01110110 01100001 00100000 01110011 01101011 01110101 01110000 01101001 01101110 01100001

International Conference on Inductive Modelling, Kyiv 2008

Miroslav Cepek, [email protected]

Automatic preprocessing The GAME Neural Network (as all others data

mining methods) heavily depends on data preprocessing.

Preprocessing involves selection, setup and ordering of preprocessing methods.

We want to automate this stage. We will use genetic algorithm to find optimal

sequence of methods.

Page 3: 01001110 01100101 01110101 01110010 01101111 01101110 01101111 01110110 01100001 00100000 01110011 01101011 01110101 01110000 01101001 01101110 01100001

International Conference on Inductive Modelling, Kyiv 2008

Miroslav Cepek, [email protected]

GAME Neural Network Group of Adaptive Method Evolution (GAME)

uses inductive modelling. The structure of the model is created in

inductive way (data driven modelling).

Page 4: 01001110 01100101 01110101 01110010 01101111 01101110 01101111 01110110 01100001 00100000 01110011 01101011 01110101 01110000 01101001 01101110 01100001

International Conference on Inductive Modelling, Kyiv 2008

Miroslav Cepek, [email protected]

Main Ideas of Automatic Preprocessing

The main idea is to use genetic algorithms to find optimal order and optimal setup of data preprocessing methods.

In the first stage we will to use simple genetic algorithm.

Because we want to find sequence which will the most fits the GAME ANN we will use reduced GAME ANN for fitness function evaluation.

Page 5: 01001110 01100101 01110101 01110010 01101111 01101110 01101111 01110110 01100001 00100000 01110011 01101011 01110101 01110000 01101001 01101110 01100001

International Conference on Inductive Modelling, Kyiv 2008

Miroslav Cepek, [email protected]

Single individual in automatic preprocessing

The individuals in our automatic consists of list of preprocessing methods. Each method can be applied to different attributes. Each method have different setup. Methods are applied one by one. Some methods changes structure of the dataset

(PCA) and must be treated separately.

Page 6: 01001110 01100101 01110101 01110010 01101111 01101110 01101111 01110110 01100001 00100000 01110011 01101011 01110101 01110000 01101001 01101110 01100001

International Conference on Inductive Modelling, Kyiv 2008

Miroslav Cepek, [email protected]

GA for Automatic Preprocessing Genetic algorithm goes in standard way as

shown below.

Page 7: 01001110 01100101 01110101 01110010 01101111 01101110 01101111 01110110 01100001 00100000 01110011 01101011 01110101 01110000 01101001 01101110 01100001

International Conference on Inductive Modelling, Kyiv 2008

Miroslav Cepek, [email protected]

GA Properties Selection – tournament selection

Several individuals are selected at random from population and individual with the highest fitness is selected.

Cross over – standard one-point cross over. Mutation

adds or removes preprocessing methods from individual.

changes order of methods. changes configuration of methods.

Page 8: 01001110 01100101 01110101 01110010 01101111 01101110 01101111 01110110 01100001 00100000 01110011 01101011 01110101 01110000 01101001 01101110 01100001

International Conference on Inductive Modelling, Kyiv 2008

Miroslav Cepek, [email protected]

Fitness Recalculation Fitness is average accuracy of several simple

GAME models generated from data preprocessed by given individual. Accuracy of models is not always the same due to

genetic algorithm involved in training. Using several models allows more consistent

results. We assume that better simple model also

means better complex models.

Page 9: 01001110 01100101 01110101 01110010 01101111 01101110 01101111 01110110 01100001 00100000 01110011 01101011 01110101 01110000 01101001 01101110 01100001

International Conference on Inductive Modelling, Kyiv 2008

Miroslav Cepek, [email protected]

Outline of the Experiment Complete dataset is split into training and

testing part. From training data given portion of values is

removed. Several GAME models are created on raw data. Instances with missing values are removed. Then

several GAME models are created. Automatic preprocessing is performed. The best

individual is selected and preprocessing methods are applied and several GAME models are created.

Page 10: 01001110 01100101 01110101 01110010 01101111 01101110 01101111 01110110 01100001 00100000 01110011 01101011 01110101 01110000 01101001 01101110 01100001

International Conference on Inductive Modelling, Kyiv 2008

Miroslav Cepek, [email protected]

Artificial data

Page 11: 01001110 01100101 01110101 01110010 01101111 01101110 01101111 01110110 01100001 00100000 01110011 01101011 01110101 01110000 01101001 01101110 01100001

International Conference on Inductive Modelling, Kyiv 2008

Miroslav Cepek, [email protected]

Best Chromosomes

The best individuals for selected amount of missing values. Part a) shows the best chromosome 1% of missing values. Part b) shows individual for 5% of missing values and c) shows 20% of missing values.

Page 12: 01001110 01100101 01110101 01110010 01101111 01101110 01101111 01110110 01100001 00100000 01110011 01101011 01110101 01110000 01101001 01101110 01100001

International Conference on Inductive Modelling, Kyiv 2008

Miroslav Cepek, [email protected]

Best Chromosomes Chromosomes for simple problems (low

number of missing values) are quite simple. Chromosomes for complicated problems (high

number of missing values) are quite complicated.

In this sense our algorithm works.

Page 13: 01001110 01100101 01110101 01110010 01101111 01101110 01101111 01110110 01100001 00100000 01110011 01101011 01110101 01110000 01101001 01101110 01100001

International Conference on Inductive Modelling, Kyiv 2008

Miroslav Cepek, [email protected]

Manually vs Automatically selected methods.

Page 14: 01001110 01100101 01110101 01110010 01101111 01101110 01101111 01110110 01100001 00100000 01110011 01101011 01110101 01110000 01101001 01101110 01100001

International Conference on Inductive Modelling, Kyiv 2008

Miroslav Cepek, [email protected]

Results Graph shows that GAME is unable to handle

missing values. Results of RAW data are quite poor.

When instances with missing data are removed, accuracy increase rapidly.

When automatic preprocessing is used accuracy is even better.

Page 15: 01001110 01100101 01110101 01110010 01101111 01101110 01101111 01110110 01100001 00100000 01110011 01101011 01110101 01110000 01101001 01101110 01100001

International Conference on Inductive Modelling, Kyiv 2008

Miroslav Cepek, [email protected]

Conclusion We proposed algorithm for automatic selection

and ordering of data preprocessing methods. We performed the first experiment with our

method. It works for artificial data and in future we have

to prove that it work also for more complicated and real-world data.

Page 16: 01001110 01100101 01110101 01110010 01101111 01101110 01101111 01110110 01100001 00100000 01110011 01101011 01110101 01110000 01101001 01101110 01100001

International Conference on Inductive Modelling, Kyiv 2008

Miroslav Cepek, [email protected]

Thank You for Your attention.

[email protected]