machine learning: chenhao tan university of colorado boulder … · 2018-11-01 · bagging and...
TRANSCRIPT
Machine Learning: Chenhao TanUniversity of Colorado BoulderLECTURE 20
Slides adapted from Jordan Boyd-Graber, Chris Ketelsen
Machine Learning: Chenhao Tan | Boulder | 1 of 27
Logistics
• Homework 4 is due on Friday!• Project proposal feedback (I have left initial feedback, more feedback from
Max).
Machine Learning: Chenhao Tan | Boulder | 2 of 27
Learning objectives
• Understand the general idea behind ensembling• Understand bagging• Learn about Adaboost
Machine Learning: Chenhao Tan | Boulder | 3 of 27
Overview
Ensemble methods
Bagging and random forest
General idea of boosting
Machine Learning: Chenhao Tan | Boulder | 4 of 27
Ensemble methods
Outline
Ensemble methods
Bagging and random forest
General idea of boosting
Machine Learning: Chenhao Tan | Boulder | 5 of 27
Ensemble methods
Outline of CSCI 4622We’ve already covered stuff in blue!
• Problem formulations: classification, regression• Supervised techniques: decision trees, nearest neighbors, perceptron, linear
models, neural networks, support vector machine, kernel methods• Unsupervised techniques: clustering, linear dimensionality reduction• “Meta-techniques”: ensembles, expectation-maximization• Understanding ML: limits of learning, practical issues, bias & fairness• Recurring themes: (stochastic) gradient descent
Machine Learning: Chenhao Tan | Boulder | 6 of 27
Ensemble methods
Ensemble methods
We have learned• Decision trees• Perceptron• KNN• Naïve Bayes• Logistic regression• Neural networks• Support vector machines
A meta question: why do we only use a single model?
In practice, especially to win competitions, many models are used together.
Machine Learning: Chenhao Tan | Boulder | 7 of 27
Ensemble methods
Ensemble methods
We have learned• Decision trees• Perceptron• KNN• Naïve Bayes• Logistic regression• Neural networks• Support vector machines
A meta question: why do we only use a single model?In practice, especially to win competitions, many models are used together.
Machine Learning: Chenhao Tan | Boulder | 7 of 27
Ensemble methods
Ensemble methods
There are many techniques to use multiple models.• Bagging◦ Train classifiers on subsets of data◦ Predict based on majority vote
• Boosting◦ Build a sequence of dumb models◦ Modify training data along the way to focus on difficult examples◦ Predict based on weighted majority vote of all the models
• Stacking◦ Take multiple classifiers’ outputs as inputs and train another classifier to make final
prediction
Machine Learning: Chenhao Tan | Boulder | 8 of 27
Bagging and random forest
Outline
Ensemble methods
Bagging and random forest
General idea of boosting
Machine Learning: Chenhao Tan | Boulder | 9 of 27
Bagging and random forest
Recap of decision tree
Algorithm: DTREETRAIN
Data: data D, feature set ΦResult: decision treeif all examples in D have the same label y, or Φ is empty and y is the best guessthen
return LEAF(y);else
for each feature φ in Φ dopartition D into D0 and D1 based on φ-values;let mistakes(φ) = (non-majority answers in D0) + (non-majority answers inD1);
endlet φ∗ be the feature with the smallest number of mistakes;return NODE(φ∗, {0→ DTREETRAIN(D0,Φ \ {φ∗}), 1→DTREETRAIN(D1,Φ \ {φ∗})});
end
Machine Learning: Chenhao Tan | Boulder | 10 of 27
Bagging and random forest
Recap of decision tree
Full decision tree classifiers tend to overfit
Machine Learning: Chenhao Tan | Boulder | 11 of 27
Bagging and random forest
Recap of decision tree
Full decision tree classifiers tend to overfit
We could alleviate this a bit withpruning.• Prepruning◦ Don’t let the tree have too many levels◦ Stopping conditions
• Postpruning◦ Build the complete tree◦ Go back and remove vertices that
don’t affect performance too much
Machine Learning: Chenhao Tan | Boulder | 11 of 27
Bagging and random forest
Bagging
We will see how we can use many decision trees.General idea:• Train numerous slightly different classifiers• Use each classifier to make a prediction on a test instance• Predict by taking majority vote from individual classifiers
Machine Learning: Chenhao Tan | Boulder | 12 of 27
Bagging and random forest
Bagging
Today we will look at two different types of ensembled decision tree classifiers.The simplest is called bootstrapped aggregation (or bagging).
Idea: What would have happened ifthose training examples that we clearlyoverfit to weren’t there?
Machine Learning: Chenhao Tan | Boulder | 13 of 27
Bagging and random forest
Bagging
Today we will look at two different types of ensembled decision tree classifiers.The simplest is called bootstrapped aggregation (or bagging).
Idea: What would have happened ifthose training examples that we clearlyoverfit to weren’t there?
Machine Learning: Chenhao Tan | Boulder | 13 of 27
Bagging and random forest
Bagging
The simplest is called bootstrapped aggregation (or bagging).
• Sample n training examples withreplacement
• Fit decision tree classifier to eachsample
• Repeat many times• Aggregate results by majority vote
Machine Learning: Chenhao Tan | Boulder | 14 of 27
Bagging and random forest
Bagging
Machine Learning: Chenhao Tan | Boulder | 15 of 27
Bagging and random forest
Bagging
How many unique instances show up in n samples with replacement?
The likelihood that an instance is never chosen in n draws islimn→∞(1− 1
n)n = 1e
Machine Learning: Chenhao Tan | Boulder | 16 of 27
Bagging and random forest
Bagging
How many unique instances show up in n samples with replacement?The likelihood that an instance is never chosen in n draws islimn→∞(1− 1
n)n = 1e
Machine Learning: Chenhao Tan | Boulder | 16 of 27
Bagging and random forest
Bagging
In general, assume that y ∈ {−1, 1} and each individual DCT has hk
Then we can define the bagged classifier as
H(x) = sign
K∑k=1
hk(x)
Machine Learning: Chenhao Tan | Boulder | 17 of 27
Bagging and random forest
Bagging Pros and Cons
Pros:
• Results in a much lower variance classifier than a single decision tree
Cons:• Results in a much less interpretable model
We essentially give up some interpretability in favor of better prediction accuracy.But we can still get insight into what our model is doing using bagging.
Machine Learning: Chenhao Tan | Boulder | 18 of 27
Bagging and random forest
Bagging and feature importance
Although we cannot follow the treestructure any more, we can estimateaverage feature importance of singlefeatures.Feature importance in a single tree canbe defined as information gain achievedby splitting on this feature.
Machine Learning: Chenhao Tan | Boulder | 19 of 27
Bagging and random forest
An even better approach: random forests
It turns out, we can take the bagging idea and make it even better.Suppose you have a particular feature that is a very strong predictor for theresponse.So strong that in almost all Bagged Decision Trees, it’s the first feature that getssplit on.
When this happens, the trees can all look very similar. They’re too correlated.Maybe always splitting on the same feature isn’t the best idea.Maybe, like with bootstrapping training examples, we split on a random subset offeatures too.
Machine Learning: Chenhao Tan | Boulder | 20 of 27
Bagging and random forest
An even better approach: random forests
It turns out, we can take the bagging idea and make it even better.Suppose you have a particular feature that is a very strong predictor for theresponse.So strong that in almost all Bagged Decision Trees, it’s the first feature that getssplit on.When this happens, the trees can all look very similar. They’re too correlated.Maybe always splitting on the same feature isn’t the best idea.
Maybe, like with bootstrapping training examples, we split on a random subset offeatures too.
Machine Learning: Chenhao Tan | Boulder | 20 of 27
Bagging and random forest
An even better approach: random forests
It turns out, we can take the bagging idea and make it even better.Suppose you have a particular feature that is a very strong predictor for theresponse.So strong that in almost all Bagged Decision Trees, it’s the first feature that getssplit on.When this happens, the trees can all look very similar. They’re too correlated.Maybe always splitting on the same feature isn’t the best idea.Maybe, like with bootstrapping training examples, we split on a random subset offeatures too.
Machine Learning: Chenhao Tan | Boulder | 20 of 27
Bagging and random forest
Random forests
• Still doing bagging, in the sense that we fit bootstrapped resamples of trainingdata
• But every time we have to choose a feature to split on, don’t consider all pfeatures
• Instead, consider only√
p features to split on
Advantages:
• Since at each split, considering less than half of features, this gives verydifferent trees
• Very different trees in your ensemble decreases correlation, and gives morerobust results
• Also, slightly cheaper because you don’t have to evaluate each feature tochoose best split
Machine Learning: Chenhao Tan | Boulder | 21 of 27
Bagging and random forest
Random forests
• Still doing bagging, in the sense that we fit bootstrapped resamples of trainingdata
• But every time we have to choose a feature to split on, don’t consider all pfeatures
• Instead, consider only√
p features to split on
Advantages:
• Since at each split, considering less than half of features, this gives verydifferent trees
• Very different trees in your ensemble decreases correlation, and gives morerobust results
• Also, slightly cheaper because you don’t have to evaluate each feature tochoose best split
Machine Learning: Chenhao Tan | Boulder | 21 of 27
Bagging and random forest
Random forests
Look at expression of 500 different genes in tissue samples. Predict 15 cancerstates
Machine Learning: Chenhao Tan | Boulder | 22 of 27
General idea of boosting
Outline
Ensemble methods
Bagging and random forest
General idea of boosting
Machine Learning: Chenhao Tan | Boulder | 23 of 27
General idea of boosting
Boosting intuition
Boosting is an ensemble method, but with a different twist.Idea:• Build a sequence of dumb models• Modify training data along the way to focus on difficult to classify examples• Predict based on weighted majority vote of all the models
Challenges:• What do we mean by dumb?• How do we promote difficult examples?• Which models get more say in vote?
Machine Learning: Chenhao Tan | Boulder | 24 of 27
General idea of boosting
Boosting intuition
What do we mean by dumb?Each model in our sequence will be a weak learner
err =1m
m∑i=1
I(yi 6= h(xi)) =12− γ, γ > 0
Most common weak learner in Boosting is a decision stump - a decision tree with asingle split
Machine Learning: Chenhao Tan | Boulder | 25 of 27
General idea of boosting
Boosting intuition
How do we promote difficult examples?After each iteration, we’ll increase the importance of training examples that we gotwrong on the previous iteration and decrease the importance of examples that wegot right on the previous iterationEach example will carry around a weight wi that will play into the decision stumpand the error estimationWeights are normalized so they act like a probability distribution
m∑i=1
wi = 1
Machine Learning: Chenhao Tan | Boulder | 26 of 27
General idea of boosting
Boosting intuition
Which models get more say in vote?The models that performed better on training data get more say in the voteFor our sequence of weak learners: h1(x), h2(x), . . . , hK(x)Boosted classifier defined by
H(x) = sign
[K∑
k=1
αkhk(x)
]Weight αk is measure of accuracy of hk on training data
Machine Learning: Chenhao Tan | Boulder | 27 of 27