![Page 1: Issues with Data Mining. 2 Data Mining involves Generalization Data mining (Machine Learning) learns generalizations of the instances in training data](https://reader035.vdocuments.net/reader035/viewer/2022072011/56649e305503460f94b215e2/html5/thumbnails/1.jpg)
Issues with Data Mining
![Page 2: Issues with Data Mining. 2 Data Mining involves Generalization Data mining (Machine Learning) learns generalizations of the instances in training data](https://reader035.vdocuments.net/reader035/viewer/2022072011/56649e305503460f94b215e2/html5/thumbnails/2.jpg)
2
Data Mining involves Generalization
• Data mining (Machine Learning) learns generalizations of the instances in training data– E.g. a decision tree learnt from weather data
captures generalizations about the prediction of values for Play attribute
– This means, generalizations predict (or describe) the behaviour of instances beyond the training data
– This in turn means, knowledge is extracted from raw data using data mining
• This knowledge drives end-user’s decision making process
![Page 3: Issues with Data Mining. 2 Data Mining involves Generalization Data mining (Machine Learning) learns generalizations of the instances in training data](https://reader035.vdocuments.net/reader035/viewer/2022072011/56649e305503460f94b215e2/html5/thumbnails/3.jpg)
3
Generalization as Search• The process of generalization can be viewed
as searching a space of all possible patterns or models– For a pattern that fits the data
• This view provides a standard framework for understanding all data mining techniques
• E.g decision tree learning involves searching through all possible decision trees– Lecture 4 shows two example decision trees that
fit the weather data– One of them is a better generalization than the
other (Example 2)
![Page 4: Issues with Data Mining. 2 Data Mining involves Generalization Data mining (Machine Learning) learns generalizations of the instances in training data](https://reader035.vdocuments.net/reader035/viewer/2022072011/56649e305503460f94b215e2/html5/thumbnails/4.jpg)
4
Bias
• Important choices made in a data mining system are– representation language– the language chosen
to represent the patterns or models, – search method – the order in which the space
is searched – model pruning method– the way overfitting to
the training data is avoided• This means, each data mining scheme
involves – Language bias– Search bias– Overfitting-avoidance bias
![Page 5: Issues with Data Mining. 2 Data Mining involves Generalization Data mining (Machine Learning) learns generalizations of the instances in training data](https://reader035.vdocuments.net/reader035/viewer/2022072011/56649e305503460f94b215e2/html5/thumbnails/5.jpg)
5
Language Bias
• Different languages used for representing patterns and models– E.g. rules and decision trees
• A concept fits a subset of training data– That subset can be described as a disjunction of rules– E.g classifier for the weather data can be represented as a
disjunction of rules
• Languages differ in their ability to represent patterns and models– This means, when a language with lower representation
ability is used, the data mining system may not achieve good performance
• Domain knowledge (external to training data) helps to cut down search space
![Page 6: Issues with Data Mining. 2 Data Mining involves Generalization Data mining (Machine Learning) learns generalizations of the instances in training data](https://reader035.vdocuments.net/reader035/viewer/2022072011/56649e305503460f94b215e2/html5/thumbnails/6.jpg)
6
Search Bias• An exhaustive search over the search space is
computationally expensive• Search is speeded up by using heuristics
– Pure children nodes indicate good tree stumps in decision tree learning
• By definition heuristics cannot guarantee optimum patterns or models– Using information gain may mislead us to select a suboptimal
attribute at the root• Complex search strategies possible
– Those that pursue several alternatives parallelly– Those that allow backtracking
• A high-level search bias– General-to-specific: start with a root node and grow the
decision tree to fit the specific data– Specific-to-general: choose specific examples in each class and
then generalize the class by including k-nearest neighbour examples
![Page 7: Issues with Data Mining. 2 Data Mining involves Generalization Data mining (Machine Learning) learns generalizations of the instances in training data](https://reader035.vdocuments.net/reader035/viewer/2022072011/56649e305503460f94b215e2/html5/thumbnails/7.jpg)
7
Overfitting-avoidance bias• We want to search for ‘best’ patterns and models• Simple models are the best• Two strategies
– Start with the simplest model and stop building model when it starts to become complex
– Start with a complex model and prune it to make it simpler
• Each strategy biases search in a different way• Biases are unavoidable in practice
– Each data mining scheme might involve a configuration of biases
– These biases may serve some problems well• There is no universal best learning scheme!
– We saw this in our practicals with Weka
![Page 8: Issues with Data Mining. 2 Data Mining involves Generalization Data mining (Machine Learning) learns generalizations of the instances in training data](https://reader035.vdocuments.net/reader035/viewer/2022072011/56649e305503460f94b215e2/html5/thumbnails/8.jpg)
8
Combining Multiple Models• Because there is no ideal data mining scheme, it is
useful to combine multiple models– Idea of democracy – decisions made based on collective
wisdom– Each data mining scheme acts like an expert using its
knowledge to make decisions• Three general approaches
– Bagging– Boosting– Stacking
• Bagging and boosting both follow the same approach– Take a vote on the class prediction from all the different
schemes– Bagging uses a simple average of votes while boosting
uses a weighted average– Boosting gives more weight to more knowledgeable
experts• Boosting is generally considered the most effective
![Page 9: Issues with Data Mining. 2 Data Mining involves Generalization Data mining (Machine Learning) learns generalizations of the instances in training data](https://reader035.vdocuments.net/reader035/viewer/2022072011/56649e305503460f94b215e2/html5/thumbnails/9.jpg)
9
Bias-Variance Decomposition
• Assume – Infinite training data sets of the same size, n– Infinite number of classifiers trained on the above data
sets
• For any learning scheme– Bias = expected error of the classifier even after
increasing training data infinitely– Variance = expected error due to the particular training
set used
• Total expected error = bias + variance• Combining multiple classifiers decreases the
expected error by reducing the variance component
![Page 10: Issues with Data Mining. 2 Data Mining involves Generalization Data mining (Machine Learning) learns generalizations of the instances in training data](https://reader035.vdocuments.net/reader035/viewer/2022072011/56649e305503460f94b215e2/html5/thumbnails/10.jpg)
10
Bagging
• Bagging stands for bootstrap aggregating– Combines equally weighted predictions from multiple
models
• Bagging exploits instability in learning schemes– Instability – small change in training data results in big
change in model
• Idealized version for classifier– Collect several independent training sets– Build a classifier from each training set
• E.g learn a decision tree from each training set– The class of a test instance is the prediction that received
most votes from all the classifiers
• Practically it is not feasible to obtain several independent training sets
![Page 11: Issues with Data Mining. 2 Data Mining involves Generalization Data mining (Machine Learning) learns generalizations of the instances in training data](https://reader035.vdocuments.net/reader035/viewer/2022072011/56649e305503460f94b215e2/html5/thumbnails/11.jpg)
11
Bagging Algorithm
• Involves two stages• Model Generation
– Let n be the number of instances in the training data
– For each of t iterations• Sample n instances with replacement from training data• Apply the learning algorithm to the sample• Store the resulting model
• Classification– For each of the t models:
• Predict class of instance using model
– Return class that has been predicted most often
![Page 12: Issues with Data Mining. 2 Data Mining involves Generalization Data mining (Machine Learning) learns generalizations of the instances in training data](https://reader035.vdocuments.net/reader035/viewer/2022072011/56649e305503460f94b215e2/html5/thumbnails/12.jpg)
12
Boosting• Multiple data mining methods might complement
each other– Each method performing well on a subset of data
• Boosting combines complementing models– Using weighted voting
• Boosting is iterative– Each new model is built to overcome the deficiencies in
the earlier models• Several variants of boosting
– AdaBoost.M1 – based on the idea of giving weights to instances
• Boosting involves two stages– Model generation– Classification
![Page 13: Issues with Data Mining. 2 Data Mining involves Generalization Data mining (Machine Learning) learns generalizations of the instances in training data](https://reader035.vdocuments.net/reader035/viewer/2022072011/56649e305503460f94b215e2/html5/thumbnails/13.jpg)
13
Boosting• Model generation
– Assign equal weight to each training instance– For each of t iterations:
• Apply learning algorithm to weighted dataset and store resulting model
• Compute error e of model on weighted dataset and store error
• If e=0 or e>=0.5– Terminate model generation
• For each instance in dataset:– If instance classified correctly by model:– Multiply weight of instance by e/(1-e)
• Normalize weight of all instances
• Classification– Assign weight of zero to all classes– For each of the t (or less) models:
• Add –log(e/(1-e)) to weight of class predicted by model
– Return class with highest weight
![Page 14: Issues with Data Mining. 2 Data Mining involves Generalization Data mining (Machine Learning) learns generalizations of the instances in training data](https://reader035.vdocuments.net/reader035/viewer/2022072011/56649e305503460f94b215e2/html5/thumbnails/14.jpg)
14
Stacking• Bagging and boosting combine models of the
same type– E.g. a set of decision trees
• Stacking is applied to models of different types– Because voting may not work when different models do
not perform comparably well, – Voting is problematic when two out of three classifiers
perform poorly• Stacking uses a metalearner to combine different
base learners– Base learners: level-0 models– Meta learner: level-1 model– Predictions of base learners fed as inputs to the meta
learner• Base learner predictions on training data cannot
be input to meta learner– Instead use cross-validation results on base learner
• Because classification is done by base learners, meta learners use simple learning schemes
![Page 15: Issues with Data Mining. 2 Data Mining involves Generalization Data mining (Machine Learning) learns generalizations of the instances in training data](https://reader035.vdocuments.net/reader035/viewer/2022072011/56649e305503460f94b215e2/html5/thumbnails/15.jpg)
15
Combining models using Weka
• Weka offers methods to perform bagging, boosting and stacking over classifiers
• In the Explorer, under the classify tab, expand the ‘meta’ section of the hierarchical menu
• AdaboostM1 (one of the boosting methods) on Iris data classifies only 7 out of 150 incorrectly
• You are encouraged to try these methods on your own