machine learning: chenhao tan university of colorado boulder … · 2018-11-01 · bagging and...

Machine Learning: Chenhao TanUniversity of Colorado BoulderLECTURE 20

Slides adapted from Jordan Boyd-Graber, Chris Ketelsen

Machine Learning: Chenhao Tan | Boulder | 1 of 27

Logistics

• Homework 4 is due on Friday!• Project proposal feedback (I have left initial feedback, more feedback from

Max).


Learning objectives

• Understand the general idea behind ensembling• Understand bagging• Learn about Adaboost


Overview

Ensemble methods

Bagging and random forest

General idea of boosting


Ensemble methods

Outline

Ensemble methods




Ensemble methods

Outline of CSCI 4622We’ve already covered stuff in blue!

• Problem formulations: classification, regression• Supervised techniques: decision trees, nearest neighbors, perceptron, linear

models, neural networks, support vector machine, kernel methods• Unsupervised techniques: clustering, linear dimensionality reduction• “Meta-techniques”: ensembles, expectation-maximization• Understanding ML: limits of learning, practical issues, bias & fairness• Recurring themes: (stochastic) gradient descent


Ensemble methods

Ensemble methods

We have learned• Decision trees• Perceptron• KNN• Naïve Bayes• Logistic regression• Neural networks• Support vector machines

A meta question: why do we only use a single model?

In practice, especially to win competitions, many models are used together.


Ensemble methods

Ensemble methods

We have learned• Decision trees• Perceptron• KNN• Naïve Bayes• Logistic regression• Neural networks• Support vector machines

A meta question: why do we only use a single model?In practice, especially to win competitions, many models are used together.


Ensemble methods

Ensemble methods

There are many techniques to use multiple models.• Bagging◦ Train classifiers on subsets of data◦ Predict based on majority vote

• Boosting◦ Build a sequence of dumb models◦ Modify training data along the way to focus on difficult examples◦ Predict based on weighted majority vote of all the models

• Stacking◦ Take multiple classifiers’ outputs as inputs and train another classifier to make final

prediction



Outline

Ensemble methods





Recap of decision tree

Algorithm: DTREETRAIN

Data: data D, feature set ΦResult: decision treeif all examples in D have the same label y, or Φ is empty and y is the best guessthen

return LEAF(y);else

for each feature φ in Φ dopartition D into D0 and D1 based on φ-values;let mistakes(φ) = (non-majority answers in D0) + (non-majority answers inD1);

endlet φ∗ be the feature with the smallest number of mistakes;return NODE(φ∗, {0→ DTREETRAIN(D0,Φ \ {φ∗}), 1→DTREETRAIN(D1,Φ \ {φ∗})});

end




Full decision tree classifiers tend to overfit




Full decision tree classifiers tend to overfit

We could alleviate this a bit withpruning.• Prepruning◦ Don’t let the tree have too many levels◦ Stopping conditions

• Postpruning◦ Build the complete tree◦ Go back and remove vertices that

don’t affect performance too much



Bagging

We will see how we can use many decision trees.General idea:• Train numerous slightly different classifiers• Use each classifier to make a prediction on a test instance• Predict by taking majority vote from individual classifiers



Bagging

Today we will look at two different types of ensembled decision tree classifiers.The simplest is called bootstrapped aggregation (or bagging).

Idea: What would have happened ifthose training examples that we clearlyoverfit to weren’t there?



Bagging

The simplest is called bootstrapped aggregation (or bagging).

• Sample n training examples withreplacement

• Fit decision tree classifier to eachsample

• Repeat many times• Aggregate results by majority vote



Bagging



Bagging

How many unique instances show up in n samples with replacement?

The likelihood that an instance is never chosen in n draws islimn→∞(1− 1

n)n = 1e



Bagging

How many unique instances show up in n samples with replacement?The likelihood that an instance is never chosen in n draws islimn→∞(1− 1

n)n = 1e



Bagging

In general, assume that y ∈ {−1, 1} and each individual DCT has hk

Then we can define the bagged classifier as

H(x) = sign

K∑k=1

hk(x)



Bagging Pros and Cons

Pros:

• Results in a much lower variance classifier than a single decision tree

Cons:• Results in a much less interpretable model

We essentially give up some interpretability in favor of better prediction accuracy.But we can still get insight into what our model is doing using bagging.



Bagging and feature importance

Although we cannot follow the treestructure any more, we can estimateaverage feature importance of singlefeatures.Feature importance in a single tree canbe defined as information gain achievedby splitting on this feature.



An even better approach: random forests

It turns out, we can take the bagging idea and make it even better.Suppose you have a particular feature that is a very strong predictor for theresponse.So strong that in almost all Bagged Decision Trees, it’s the first feature that getssplit on.

When this happens, the trees can all look very similar. They’re too correlated.Maybe always splitting on the same feature isn’t the best idea.Maybe, like with bootstrapping training examples, we split on a random subset offeatures too.




It turns out, we can take the bagging idea and make it even better.Suppose you have a particular feature that is a very strong predictor for theresponse.So strong that in almost all Bagged Decision Trees, it’s the first feature that getssplit on.When this happens, the trees can all look very similar. They’re too correlated.Maybe always splitting on the same feature isn’t the best idea.

Maybe, like with bootstrapping training examples, we split on a random subset offeatures too.




It turns out, we can take the bagging idea and make it even better.Suppose you have a particular feature that is a very strong predictor for theresponse.So strong that in almost all Bagged Decision Trees, it’s the first feature that getssplit on.When this happens, the trees can all look very similar. They’re too correlated.Maybe always splitting on the same feature isn’t the best idea.Maybe, like with bootstrapping training examples, we split on a random subset offeatures too.



Random forests

• Still doing bagging, in the sense that we fit bootstrapped resamples of trainingdata

• But every time we have to choose a feature to split on, don’t consider all pfeatures

• Instead, consider only√

p features to split on

Advantages:

• Since at each split, considering less than half of features, this gives verydifferent trees

• Very different trees in your ensemble decreases correlation, and gives morerobust results

• Also, slightly cheaper because you don’t have to evaluate each feature tochoose best split



Random forests

Look at expression of 500 different genes in tissue samples. Predict 15 cancerstates



Outline

Ensemble methods





Boosting intuition

Boosting is an ensemble method, but with a different twist.Idea:• Build a sequence of dumb models• Modify training data along the way to focus on difficult to classify examples• Predict based on weighted majority vote of all the models

Challenges:• What do we mean by dumb?• How do we promote difficult examples?• Which models get more say in vote?



Boosting intuition

What do we mean by dumb?Each model in our sequence will be a weak learner

err =1m

m∑i=1

I(yi 6= h(xi)) =12− γ, γ > 0

Most common weak learner in Boosting is a decision stump - a decision tree with asingle split



Boosting intuition

How do we promote difficult examples?After each iteration, we’ll increase the importance of training examples that we gotwrong on the previous iteration and decrease the importance of examples that wegot right on the previous iterationEach example will carry around a weight wi that will play into the decision stumpand the error estimationWeights are normalized so they act like a probability distribution

m∑i=1

wi = 1



Boosting intuition

Which models get more say in vote?The models that performed better on training data get more say in the voteFor our sequence of weak learners: h1(x), h2(x), . . . , hK(x)Boosted classifier defined by

H(x) = sign

[K∑

k=1

αkhk(x)

]Weight αk is measure of accuracy of hk on training data


machine learning: chenhao tan university of colorado boulder … · 2018-11-01 · bagging and...

Documents