issues with data mining. 2 data mining involves generalization data mining (machine learning) learns...

Post on 27-Dec-2015

214 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Issues with Data Mining

2

Data Mining involves Generalization

• Data mining (Machine Learning) learns generalizations of the instances in training data– E.g. a decision tree learnt from weather data

captures generalizations about the prediction of values for Play attribute

– This means, generalizations predict (or describe) the behaviour of instances beyond the training data

– This in turn means, knowledge is extracted from raw data using data mining

• This knowledge drives end-user’s decision making process

3

Generalization as Search• The process of generalization can be viewed

as searching a space of all possible patterns or models– For a pattern that fits the data

• This view provides a standard framework for understanding all data mining techniques

• E.g decision tree learning involves searching through all possible decision trees– Lecture 4 shows two example decision trees that

fit the weather data– One of them is a better generalization than the

other (Example 2)

4

Bias

• Important choices made in a data mining system are– representation language– the language chosen

to represent the patterns or models, – search method – the order in which the space

is searched – model pruning method– the way overfitting to

the training data is avoided• This means, each data mining scheme

involves – Language bias– Search bias– Overfitting-avoidance bias

5

Language Bias

• Different languages used for representing patterns and models– E.g. rules and decision trees

• A concept fits a subset of training data– That subset can be described as a disjunction of rules– E.g classifier for the weather data can be represented as a

disjunction of rules

• Languages differ in their ability to represent patterns and models– This means, when a language with lower representation

ability is used, the data mining system may not achieve good performance

• Domain knowledge (external to training data) helps to cut down search space

6

Search Bias• An exhaustive search over the search space is

computationally expensive• Search is speeded up by using heuristics

– Pure children nodes indicate good tree stumps in decision tree learning

• By definition heuristics cannot guarantee optimum patterns or models– Using information gain may mislead us to select a suboptimal

attribute at the root• Complex search strategies possible

– Those that pursue several alternatives parallelly– Those that allow backtracking

• A high-level search bias– General-to-specific: start with a root node and grow the

decision tree to fit the specific data– Specific-to-general: choose specific examples in each class and

then generalize the class by including k-nearest neighbour examples

7

Overfitting-avoidance bias• We want to search for ‘best’ patterns and models• Simple models are the best• Two strategies

– Start with the simplest model and stop building model when it starts to become complex

– Start with a complex model and prune it to make it simpler

• Each strategy biases search in a different way• Biases are unavoidable in practice

– Each data mining scheme might involve a configuration of biases

– These biases may serve some problems well• There is no universal best learning scheme!

– We saw this in our practicals with Weka

8

Combining Multiple Models• Because there is no ideal data mining scheme, it is

useful to combine multiple models– Idea of democracy – decisions made based on collective

wisdom– Each data mining scheme acts like an expert using its

knowledge to make decisions• Three general approaches

– Bagging– Boosting– Stacking

• Bagging and boosting both follow the same approach– Take a vote on the class prediction from all the different

schemes– Bagging uses a simple average of votes while boosting

uses a weighted average– Boosting gives more weight to more knowledgeable

experts• Boosting is generally considered the most effective

9

Bias-Variance Decomposition

• Assume – Infinite training data sets of the same size, n– Infinite number of classifiers trained on the above data

sets

• For any learning scheme– Bias = expected error of the classifier even after

increasing training data infinitely– Variance = expected error due to the particular training

set used

• Total expected error = bias + variance• Combining multiple classifiers decreases the

expected error by reducing the variance component

10

Bagging

• Bagging stands for bootstrap aggregating– Combines equally weighted predictions from multiple

models

• Bagging exploits instability in learning schemes– Instability – small change in training data results in big

change in model

• Idealized version for classifier– Collect several independent training sets– Build a classifier from each training set

• E.g learn a decision tree from each training set– The class of a test instance is the prediction that received

most votes from all the classifiers

• Practically it is not feasible to obtain several independent training sets

11

Bagging Algorithm

• Involves two stages• Model Generation

– Let n be the number of instances in the training data

– For each of t iterations• Sample n instances with replacement from training data• Apply the learning algorithm to the sample• Store the resulting model

• Classification– For each of the t models:

• Predict class of instance using model

– Return class that has been predicted most often

12

Boosting• Multiple data mining methods might complement

each other– Each method performing well on a subset of data

• Boosting combines complementing models– Using weighted voting

• Boosting is iterative– Each new model is built to overcome the deficiencies in

the earlier models• Several variants of boosting

– AdaBoost.M1 – based on the idea of giving weights to instances

• Boosting involves two stages– Model generation– Classification

13

Boosting• Model generation

– Assign equal weight to each training instance– For each of t iterations:

• Apply learning algorithm to weighted dataset and store resulting model

• Compute error e of model on weighted dataset and store error

• If e=0 or e>=0.5– Terminate model generation

• For each instance in dataset:– If instance classified correctly by model:– Multiply weight of instance by e/(1-e)

• Normalize weight of all instances

• Classification– Assign weight of zero to all classes– For each of the t (or less) models:

• Add –log(e/(1-e)) to weight of class predicted by model

– Return class with highest weight

14

Stacking• Bagging and boosting combine models of the

same type– E.g. a set of decision trees

• Stacking is applied to models of different types– Because voting may not work when different models do

not perform comparably well, – Voting is problematic when two out of three classifiers

perform poorly• Stacking uses a metalearner to combine different

base learners– Base learners: level-0 models– Meta learner: level-1 model– Predictions of base learners fed as inputs to the meta

learner• Base learner predictions on training data cannot

be input to meta learner– Instead use cross-validation results on base learner

• Because classification is done by base learners, meta learners use simple learning schemes

15

Combining models using Weka

• Weka offers methods to perform bagging, boosting and stacking over classifiers

• In the Explorer, under the classify tab, expand the ‘meta’ section of the hierarchical menu

• AdaboostM1 (one of the boosting methods) on Iris data classifies only 7 out of 150 incorrectly

• You are encouraged to try these methods on your own

top related