data preparation as a process markku ursin [email protected]

25
Data Preparation as a Process Markku Ursin [email protected]

Upload: allison-skinner

Post on 13-Jan-2016

219 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Data Preparation as a Process Markku Ursin mtu@iki.fi

Data Preparation as a Process

Markku Ursin

[email protected]

Page 2: Data Preparation as a Process Markku Ursin mtu@iki.fi

Introduction

• Purpose: make the data better accessible for the mining tool

• No magical general purpose techniques, preparation is half art, half science

• Knowing the limitations and correct use of techniques is more important than thoroughly understanding the actual techniques

Page 3: Data Preparation as a Process Markku Ursin mtu@iki.fi

Data Mining Process (simplified)

1. Data Preparation

2. Data Survey

3. Data Modeling

Page 4: Data Preparation as a Process Markku Ursin mtu@iki.fi

Data Preparation Process

Analyst insight

Training data set

PIE-I

PIE-O

Test data set

DataPreparation

Problem selection

Solutionselection

Modeling toolselection

Raw trainingdata set

Page 5: Data Preparation as a Process Markku Ursin mtu@iki.fi

Training and Test Data Sets

Page 6: Data Preparation as a Process Markku Ursin mtu@iki.fi
Page 7: Data Preparation as a Process Markku Ursin mtu@iki.fi

Prepared Information Environment Modules

• Input module transforms raw execution data:– categorical values into numerical– filling in / ignoring missing values

• Output module undoes the effect of PIE-I

• Used between the model and the real world

Page 8: Data Preparation as a Process Markku Ursin mtu@iki.fi

Modeling Tools and Data Preparation

• Right tool for the right job• Early general-purpose mining tools were

algorithm centric• Modern tools concentrate on business

problems• “Getting the job done is enough, we don’t

need to know how.”

Page 9: Data Preparation as a Process Markku Ursin mtu@iki.fi

Data Separation

• Straight lines parallel to axes

• Straight lines not parallel to axes

• Curves• Closed area• Ideal arrangement

Page 10: Data Preparation as a Process Markku Ursin mtu@iki.fi

Data Separation

• Straight lines parallel to axes

• Straight lines not parallel to axes

• Curves• Closed area• Ideal arrangement

Page 11: Data Preparation as a Process Markku Ursin mtu@iki.fi

Data Separation

• Straight lines parallel to axes

• Straight lines not parallel to axes

• Curves• Closed area• Ideal arrangement

Page 12: Data Preparation as a Process Markku Ursin mtu@iki.fi

Data Separation

• Straight lines parallel to axes

• Straight lines not parallel to axes

• Curves• Closed area• Ideal arrangement

Page 13: Data Preparation as a Process Markku Ursin mtu@iki.fi

Data Separation

• Straight lines parallel to axes

• Straight lines not parallel to axes

• Curves• Closed area• Ideal arrangement

Page 14: Data Preparation as a Process Markku Ursin mtu@iki.fi

Data Separation

• Straight lines parallel to axes

• Straight lines not parallel to axes

• Curves• Closed area• Ideal arrangement

Page 15: Data Preparation as a Process Markku Ursin mtu@iki.fi

Algorithms for Data Separation

• Decision Trees• Decision Lists• Neural Networks• Evolution Programs

Page 16: Data Preparation as a Process Markku Ursin mtu@iki.fi

Modeling Data with the Tools

• Discrete and continuous tools - different approaches to different problems

• Binning vs. continuos algorithms• It may be worthwhile trying different

techniques for preparation• Missing and empty values

Page 17: Data Preparation as a Process Markku Ursin mtu@iki.fi

Stages of Data Preparation

• Accessing the data– not trivial in many cases!– Very case dependent

Page 18: Data Preparation as a Process Markku Ursin mtu@iki.fi

Stages of Data Preparation

• Accessing the data• Auditing the data

– examining the quality, quantity and source of data– make sure the minimum requirements for solution

are filled, forget unsupported hopes

Page 19: Data Preparation as a Process Markku Ursin mtu@iki.fi

Stages of Data Preparation

• Accessing the data• Auditing the data• Enhancing and enriching the data

– add more data if needed– apply domain knowledge to ease the work of the

tool

Page 20: Data Preparation as a Process Markku Ursin mtu@iki.fi

Stages of Data Preparation

• Accessing the data• Auditing the data• Enhancing and enriching the data• Looking for sampling bias

– data sets must accurately represent the population– failure may lead to useless models

Page 21: Data Preparation as a Process Markku Ursin mtu@iki.fi

Stages of Data Preparation

• Accessing the data• Auditing the data• Enhancing and enriching the data• Looking for sampling bias• Determining data structure

– superstructure: selected scaffolding– macrostructure: eg. granularity– microstructure: relationships between variables

Page 22: Data Preparation as a Process Markku Ursin mtu@iki.fi

Stages of Data Preparation

• Building the PIE, data issues:– representative samples– categorical values– normalization– missing and empty values– reducing width and depth– well- and ill-formed manifolds

Page 23: Data Preparation as a Process Markku Ursin mtu@iki.fi

Correcting Problems with Ill-Formed Manifolds

Page 24: Data Preparation as a Process Markku Ursin mtu@iki.fi

Stages of Data Preparation

• Accessing the data• Auditing the data• Enhancing and enriching the data• Looking for sampling bias• Determining data structure• Building the PIE• Surveying the Data• Modeling the Data

Page 25: Data Preparation as a Process Markku Ursin mtu@iki.fi

Summary

• Some data preparation is needed for all mining tools

• The purpose of preparation is to transform data sets so that their information content is best exposed to the mining tool

• Error prediction rate should be lower (or the same) after the preparation as before it

• The miner gains very good insight on the problem during the preparation process