issues with data mining. 2 data mining involves generalization data mining (machine learning) learns...

Issues with Data Mining

Data Mining involves Generalization

• Data mining (Machine Learning) learns generalizations of the instances in training data– E.g. a decision tree learnt from weather data

captures generalizations about the prediction of values for Play attribute

– This means, generalizations predict (or describe) the behaviour of instances beyond the training data

– This in turn means, knowledge is extracted from raw data using data mining

• This knowledge drives end-user’s decision making process

Generalization as Search• The process of generalization can be viewed

as searching a space of all possible patterns or models– For a pattern that fits the data

• This view provides a standard framework for understanding all data mining techniques

• E.g decision tree learning involves searching through all possible decision trees– Lecture 4 shows two example decision trees that

fit the weather data– One of them is a better generalization than the

other (Example 2)

• Important choices made in a data mining system are– representation language– the language chosen

to represent the patterns or models, – search method – the order in which the space

is searched – model pruning method– the way overfitting to

the training data is avoided• This means, each data mining scheme

involves – Language bias– Search bias– Overfitting-avoidance bias

Language Bias

• Different languages used for representing patterns and models– E.g. rules and decision trees

• A concept fits a subset of training data– That subset can be described as a disjunction of rules– E.g classifier for the weather data can be represented as a

disjunction of rules

• Languages differ in their ability to represent patterns and models– This means, when a language with lower representation

ability is used, the data mining system may not achieve good performance

• Domain knowledge (external to training data) helps to cut down search space

Search Bias• An exhaustive search over the search space is

computationally expensive• Search is speeded up by using heuristics

– Pure children nodes indicate good tree stumps in decision tree learning

• By definition heuristics cannot guarantee optimum patterns or models– Using information gain may mislead us to select a suboptimal

attribute at the root• Complex search strategies possible

– Those that pursue several alternatives parallelly– Those that allow backtracking

• A high-level search bias– General-to-specific: start with a root node and grow the

decision tree to fit the specific data– Specific-to-general: choose specific examples in each class and

then generalize the class by including k-nearest neighbour examples

Overfitting-avoidance bias• We want to search for ‘best’ patterns and models• Simple models are the best• Two strategies

– Start with the simplest model and stop building model when it starts to become complex

– Start with a complex model and prune it to make it simpler

• Each strategy biases search in a different way• Biases are unavoidable in practice

– Each data mining scheme might involve a configuration of biases

– These biases may serve some problems well• There is no universal best learning scheme!

– We saw this in our practicals with Weka

Combining Multiple Models• Because there is no ideal data mining scheme, it is

useful to combine multiple models– Idea of democracy – decisions made based on collective

wisdom– Each data mining scheme acts like an expert using its

knowledge to make decisions• Three general approaches

– Bagging– Boosting– Stacking

• Bagging and boosting both follow the same approach– Take a vote on the class prediction from all the different

schemes– Bagging uses a simple average of votes while boosting

uses a weighted average– Boosting gives more weight to more knowledgeable

experts• Boosting is generally considered the most effective

Bias-Variance Decomposition

• Assume – Infinite training data sets of the same size, n– Infinite number of classifiers trained on the above data

• For any learning scheme– Bias = expected error of the classifier even after

increasing training data infinitely– Variance = expected error due to the particular training

set used

• Total expected error = bias + variance• Combining multiple classifiers decreases the

expected error by reducing the variance component

Bagging

• Bagging stands for bootstrap aggregating– Combines equally weighted predictions from multiple

models

• Bagging exploits instability in learning schemes– Instability – small change in training data results in big

change in model

• Idealized version for classifier– Collect several independent training sets– Build a classifier from each training set

• E.g learn a decision tree from each training set– The class of a test instance is the prediction that received

most votes from all the classifiers

• Practically it is not feasible to obtain several independent training sets

Bagging Algorithm

• Involves two stages• Model Generation

– Let n be the number of instances in the training data

– For each of t iterations• Sample n instances with replacement from training data• Apply the learning algorithm to the sample• Store the resulting model

• Classification– For each of the t models:

• Predict class of instance using model

– Return class that has been predicted most often

Boosting• Multiple data mining methods might complement

each other– Each method performing well on a subset of data

• Boosting combines complementing models– Using weighted voting

• Boosting is iterative– Each new model is built to overcome the deficiencies in

the earlier models• Several variants of boosting

– AdaBoost.M1 – based on the idea of giving weights to instances

• Boosting involves two stages– Model generation– Classification

Boosting• Model generation

– Assign equal weight to each training instance– For each of t iterations:

• Apply learning algorithm to weighted dataset and store resulting model

• Compute error e of model on weighted dataset and store error

• If e=0 or e>=0.5– Terminate model generation

• For each instance in dataset:– If instance classified correctly by model:– Multiply weight of instance by e/(1-e)

• Normalize weight of all instances

• Classification– Assign weight of zero to all classes– For each of the t (or less) models:

• Add –log(e/(1-e)) to weight of class predicted by model

– Return class with highest weight

Stacking• Bagging and boosting combine models of the

same type– E.g. a set of decision trees

• Stacking is applied to models of different types– Because voting may not work when different models do

not perform comparably well, – Voting is problematic when two out of three classifiers

perform poorly• Stacking uses a metalearner to combine different

base learners– Base learners: level-0 models– Meta learner: level-1 model– Predictions of base learners fed as inputs to the meta

learner• Base learner predictions on training data cannot

be input to meta learner– Instead use cross-validation results on base learner

• Because classification is done by base learners, meta learners use simple learning schemes

Combining models using Weka

• Weka offers methods to perform bagging, boosting and stacking over classifiers

• In the Explorer, under the classify tab, expand the ‘meta’ section of the hierarchical menu

• AdaboostM1 (one of the boosting methods) on Iris data classifies only 7 out of 150 incorrectly

• You are encouraged to try these methods on your own

issues with data mining. 2 data mining involves generalization data mining (machine learning) learns...

Documents

unit - i data mining. unit - i introduction : fundamentals...

datenbanksysteme 3 sommer 2003 data mining - 1 worzyk fh...

data mining week 1 - pengantar data mining

web mining – data mining im internet mining – data...

text mining - data mining

data mining and analysis - doc.lagout.org mining/data mining...

data mining: what is data mining?

1 data mining chapter 26. 2 chapter 1. introduction...

october 18, 2015 data mining: concepts and techniques 1 data...

data mining introduction to data mining

chapter 2 representing entities in the ontodm data mining...

data mining and privacy preserving in data mining

mining educational data using data mining … ·...

statistical data mining€¦ · 3 data mining data...

data mining by jemini islam. data mining outline: what is...

data mining by: thai hoa nguyen pham. data mining define...

from data mining to knowledge mining: symbolic data

introduction to data mining -...

data mining für business intelligence data mining for

web mining – data mining im internet mining – data...