issues with data mining. 2 data mining involves generalization data mining (machine learning) learns...
TRANSCRIPT
Issues with Data Mining
2
Data Mining involves Generalization
• Data mining (Machine Learning) learns generalizations of the instances in training data– E.g. a decision tree learnt from weather data
captures generalizations about the prediction of values for Play attribute
– This means, generalizations predict (or describe) the behaviour of instances beyond the training data
– This in turn means, knowledge is extracted from raw data using data mining
• This knowledge drives end-user’s decision making process
3
Generalization as Search• The process of generalization can be viewed
as searching a space of all possible patterns or models– For a pattern that fits the data
• This view provides a standard framework for understanding all data mining techniques
• E.g decision tree learning involves searching through all possible decision trees– Lecture 4 shows two example decision trees that
fit the weather data– One of them is a better generalization than the
other (Example 2)
4
Bias
• Important choices made in a data mining system are– representation language– the language chosen
to represent the patterns or models, – search method – the order in which the space
is searched – model pruning method– the way overfitting to
the training data is avoided• This means, each data mining scheme
involves – Language bias– Search bias– Overfitting-avoidance bias
5
Language Bias
• Different languages used for representing patterns and models– E.g. rules and decision trees
• A concept fits a subset of training data– That subset can be described as a disjunction of rules– E.g classifier for the weather data can be represented as a
disjunction of rules
• Languages differ in their ability to represent patterns and models– This means, when a language with lower representation
ability is used, the data mining system may not achieve good performance
• Domain knowledge (external to training data) helps to cut down search space
6
Search Bias• An exhaustive search over the search space is
computationally expensive• Search is speeded up by using heuristics
– Pure children nodes indicate good tree stumps in decision tree learning
• By definition heuristics cannot guarantee optimum patterns or models– Using information gain may mislead us to select a suboptimal
attribute at the root• Complex search strategies possible
– Those that pursue several alternatives parallelly– Those that allow backtracking
• A high-level search bias– General-to-specific: start with a root node and grow the
decision tree to fit the specific data– Specific-to-general: choose specific examples in each class and
then generalize the class by including k-nearest neighbour examples
7
Overfitting-avoidance bias• We want to search for ‘best’ patterns and models• Simple models are the best• Two strategies
– Start with the simplest model and stop building model when it starts to become complex
– Start with a complex model and prune it to make it simpler
• Each strategy biases search in a different way• Biases are unavoidable in practice
– Each data mining scheme might involve a configuration of biases
– These biases may serve some problems well• There is no universal best learning scheme!
– We saw this in our practicals with Weka
8
Combining Multiple Models• Because there is no ideal data mining scheme, it is
useful to combine multiple models– Idea of democracy – decisions made based on collective
wisdom– Each data mining scheme acts like an expert using its
knowledge to make decisions• Three general approaches
– Bagging– Boosting– Stacking
• Bagging and boosting both follow the same approach– Take a vote on the class prediction from all the different
schemes– Bagging uses a simple average of votes while boosting
uses a weighted average– Boosting gives more weight to more knowledgeable
experts• Boosting is generally considered the most effective
9
Bias-Variance Decomposition
• Assume – Infinite training data sets of the same size, n– Infinite number of classifiers trained on the above data
sets
• For any learning scheme– Bias = expected error of the classifier even after
increasing training data infinitely– Variance = expected error due to the particular training
set used
• Total expected error = bias + variance• Combining multiple classifiers decreases the
expected error by reducing the variance component
10
Bagging
• Bagging stands for bootstrap aggregating– Combines equally weighted predictions from multiple
models
• Bagging exploits instability in learning schemes– Instability – small change in training data results in big
change in model
• Idealized version for classifier– Collect several independent training sets– Build a classifier from each training set
• E.g learn a decision tree from each training set– The class of a test instance is the prediction that received
most votes from all the classifiers
• Practically it is not feasible to obtain several independent training sets
11
Bagging Algorithm
• Involves two stages• Model Generation
– Let n be the number of instances in the training data
– For each of t iterations• Sample n instances with replacement from training data• Apply the learning algorithm to the sample• Store the resulting model
• Classification– For each of the t models:
• Predict class of instance using model
– Return class that has been predicted most often
12
Boosting• Multiple data mining methods might complement
each other– Each method performing well on a subset of data
• Boosting combines complementing models– Using weighted voting
• Boosting is iterative– Each new model is built to overcome the deficiencies in
the earlier models• Several variants of boosting
– AdaBoost.M1 – based on the idea of giving weights to instances
• Boosting involves two stages– Model generation– Classification
13
Boosting• Model generation
– Assign equal weight to each training instance– For each of t iterations:
• Apply learning algorithm to weighted dataset and store resulting model
• Compute error e of model on weighted dataset and store error
• If e=0 or e>=0.5– Terminate model generation
• For each instance in dataset:– If instance classified correctly by model:– Multiply weight of instance by e/(1-e)
• Normalize weight of all instances
• Classification– Assign weight of zero to all classes– For each of the t (or less) models:
• Add –log(e/(1-e)) to weight of class predicted by model
– Return class with highest weight
14
Stacking• Bagging and boosting combine models of the
same type– E.g. a set of decision trees
• Stacking is applied to models of different types– Because voting may not work when different models do
not perform comparably well, – Voting is problematic when two out of three classifiers
perform poorly• Stacking uses a metalearner to combine different
base learners– Base learners: level-0 models– Meta learner: level-1 model– Predictions of base learners fed as inputs to the meta
learner• Base learner predictions on training data cannot
be input to meta learner– Instead use cross-validation results on base learner
• Because classification is done by base learners, meta learners use simple learning schemes
15
Combining models using Weka
• Weka offers methods to perform bagging, boosting and stacking over classifiers
• In the Explorer, under the classify tab, expand the ‘meta’ section of the hierarchical menu
• AdaboostM1 (one of the boosting methods) on Iris data classifies only 7 out of 150 incorrectly
• You are encouraged to try these methods on your own