machine learning: chenhao tan university of colorado boulder … · 2018-11-01 · bagging and...

34
Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 20 Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan | Boulder | 1 of 27

Upload: others

Post on 01-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2018-11-01 · Bagging and random forest Recap of decision tree Algorithm: DTREETRAIN Data: data D, feature set

Machine Learning: Chenhao TanUniversity of Colorado BoulderLECTURE 20

Slides adapted from Jordan Boyd-Graber, Chris Ketelsen

Machine Learning: Chenhao Tan | Boulder | 1 of 27

Page 2: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2018-11-01 · Bagging and random forest Recap of decision tree Algorithm: DTREETRAIN Data: data D, feature set

Logistics

• Homework 4 is due on Friday!• Project proposal feedback (I have left initial feedback, more feedback from

Max).

Machine Learning: Chenhao Tan | Boulder | 2 of 27

Page 3: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2018-11-01 · Bagging and random forest Recap of decision tree Algorithm: DTREETRAIN Data: data D, feature set

Learning objectives

• Understand the general idea behind ensembling• Understand bagging• Learn about Adaboost

Machine Learning: Chenhao Tan | Boulder | 3 of 27

Page 4: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2018-11-01 · Bagging and random forest Recap of decision tree Algorithm: DTREETRAIN Data: data D, feature set

Overview

Ensemble methods

Bagging and random forest

General idea of boosting

Machine Learning: Chenhao Tan | Boulder | 4 of 27

Page 5: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2018-11-01 · Bagging and random forest Recap of decision tree Algorithm: DTREETRAIN Data: data D, feature set

Ensemble methods

Outline

Ensemble methods

Bagging and random forest

General idea of boosting

Machine Learning: Chenhao Tan | Boulder | 5 of 27

Page 6: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2018-11-01 · Bagging and random forest Recap of decision tree Algorithm: DTREETRAIN Data: data D, feature set

Ensemble methods

Outline of CSCI 4622We’ve already covered stuff in blue!

• Problem formulations: classification, regression• Supervised techniques: decision trees, nearest neighbors, perceptron, linear

models, neural networks, support vector machine, kernel methods• Unsupervised techniques: clustering, linear dimensionality reduction• “Meta-techniques”: ensembles, expectation-maximization• Understanding ML: limits of learning, practical issues, bias & fairness• Recurring themes: (stochastic) gradient descent

Machine Learning: Chenhao Tan | Boulder | 6 of 27

Page 7: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2018-11-01 · Bagging and random forest Recap of decision tree Algorithm: DTREETRAIN Data: data D, feature set

Ensemble methods

Ensemble methods

We have learned• Decision trees• Perceptron• KNN• Naïve Bayes• Logistic regression• Neural networks• Support vector machines

A meta question: why do we only use a single model?

In practice, especially to win competitions, many models are used together.

Machine Learning: Chenhao Tan | Boulder | 7 of 27

Page 8: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2018-11-01 · Bagging and random forest Recap of decision tree Algorithm: DTREETRAIN Data: data D, feature set

Ensemble methods

Ensemble methods

We have learned• Decision trees• Perceptron• KNN• Naïve Bayes• Logistic regression• Neural networks• Support vector machines

A meta question: why do we only use a single model?In practice, especially to win competitions, many models are used together.

Machine Learning: Chenhao Tan | Boulder | 7 of 27

Page 9: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2018-11-01 · Bagging and random forest Recap of decision tree Algorithm: DTREETRAIN Data: data D, feature set

Ensemble methods

Ensemble methods

There are many techniques to use multiple models.• Bagging◦ Train classifiers on subsets of data◦ Predict based on majority vote

• Boosting◦ Build a sequence of dumb models◦ Modify training data along the way to focus on difficult examples◦ Predict based on weighted majority vote of all the models

• Stacking◦ Take multiple classifiers’ outputs as inputs and train another classifier to make final

prediction

Machine Learning: Chenhao Tan | Boulder | 8 of 27

Page 10: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2018-11-01 · Bagging and random forest Recap of decision tree Algorithm: DTREETRAIN Data: data D, feature set

Bagging and random forest

Outline

Ensemble methods

Bagging and random forest

General idea of boosting

Machine Learning: Chenhao Tan | Boulder | 9 of 27

Page 11: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2018-11-01 · Bagging and random forest Recap of decision tree Algorithm: DTREETRAIN Data: data D, feature set

Bagging and random forest

Recap of decision tree

Algorithm: DTREETRAIN

Data: data D, feature set ΦResult: decision treeif all examples in D have the same label y, or Φ is empty and y is the best guessthen

return LEAF(y);else

for each feature φ in Φ dopartition D into D0 and D1 based on φ-values;let mistakes(φ) = (non-majority answers in D0) + (non-majority answers inD1);

endlet φ∗ be the feature with the smallest number of mistakes;return NODE(φ∗, {0→ DTREETRAIN(D0,Φ \ {φ∗}), 1→DTREETRAIN(D1,Φ \ {φ∗})});

end

Machine Learning: Chenhao Tan | Boulder | 10 of 27

Page 12: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2018-11-01 · Bagging and random forest Recap of decision tree Algorithm: DTREETRAIN Data: data D, feature set

Bagging and random forest

Recap of decision tree

Full decision tree classifiers tend to overfit

Machine Learning: Chenhao Tan | Boulder | 11 of 27

Page 13: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2018-11-01 · Bagging and random forest Recap of decision tree Algorithm: DTREETRAIN Data: data D, feature set

Bagging and random forest

Recap of decision tree

Full decision tree classifiers tend to overfit

We could alleviate this a bit withpruning.• Prepruning◦ Don’t let the tree have too many levels◦ Stopping conditions

• Postpruning◦ Build the complete tree◦ Go back and remove vertices that

don’t affect performance too much

Machine Learning: Chenhao Tan | Boulder | 11 of 27

Page 14: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2018-11-01 · Bagging and random forest Recap of decision tree Algorithm: DTREETRAIN Data: data D, feature set

Bagging and random forest

Bagging

We will see how we can use many decision trees.General idea:• Train numerous slightly different classifiers• Use each classifier to make a prediction on a test instance• Predict by taking majority vote from individual classifiers

Machine Learning: Chenhao Tan | Boulder | 12 of 27

Page 15: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2018-11-01 · Bagging and random forest Recap of decision tree Algorithm: DTREETRAIN Data: data D, feature set

Bagging and random forest

Bagging

Today we will look at two different types of ensembled decision tree classifiers.The simplest is called bootstrapped aggregation (or bagging).

Idea: What would have happened ifthose training examples that we clearlyoverfit to weren’t there?

Machine Learning: Chenhao Tan | Boulder | 13 of 27

Page 16: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2018-11-01 · Bagging and random forest Recap of decision tree Algorithm: DTREETRAIN Data: data D, feature set

Bagging and random forest

Bagging

Today we will look at two different types of ensembled decision tree classifiers.The simplest is called bootstrapped aggregation (or bagging).

Idea: What would have happened ifthose training examples that we clearlyoverfit to weren’t there?

Machine Learning: Chenhao Tan | Boulder | 13 of 27

Page 17: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2018-11-01 · Bagging and random forest Recap of decision tree Algorithm: DTREETRAIN Data: data D, feature set

Bagging and random forest

Bagging

The simplest is called bootstrapped aggregation (or bagging).

• Sample n training examples withreplacement

• Fit decision tree classifier to eachsample

• Repeat many times• Aggregate results by majority vote

Machine Learning: Chenhao Tan | Boulder | 14 of 27

Page 18: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2018-11-01 · Bagging and random forest Recap of decision tree Algorithm: DTREETRAIN Data: data D, feature set

Bagging and random forest

Bagging

Machine Learning: Chenhao Tan | Boulder | 15 of 27

Page 19: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2018-11-01 · Bagging and random forest Recap of decision tree Algorithm: DTREETRAIN Data: data D, feature set

Bagging and random forest

Bagging

How many unique instances show up in n samples with replacement?

The likelihood that an instance is never chosen in n draws islimn→∞(1− 1

n)n = 1e

Machine Learning: Chenhao Tan | Boulder | 16 of 27

Page 20: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2018-11-01 · Bagging and random forest Recap of decision tree Algorithm: DTREETRAIN Data: data D, feature set

Bagging and random forest

Bagging

How many unique instances show up in n samples with replacement?The likelihood that an instance is never chosen in n draws islimn→∞(1− 1

n)n = 1e

Machine Learning: Chenhao Tan | Boulder | 16 of 27

Page 21: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2018-11-01 · Bagging and random forest Recap of decision tree Algorithm: DTREETRAIN Data: data D, feature set

Bagging and random forest

Bagging

In general, assume that y ∈ {−1, 1} and each individual DCT has hk

Then we can define the bagged classifier as

H(x) = sign

K∑k=1

hk(x)

Machine Learning: Chenhao Tan | Boulder | 17 of 27

Page 22: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2018-11-01 · Bagging and random forest Recap of decision tree Algorithm: DTREETRAIN Data: data D, feature set

Bagging and random forest

Bagging Pros and Cons

Pros:

• Results in a much lower variance classifier than a single decision tree

Cons:• Results in a much less interpretable model

We essentially give up some interpretability in favor of better prediction accuracy.But we can still get insight into what our model is doing using bagging.

Machine Learning: Chenhao Tan | Boulder | 18 of 27

Page 23: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2018-11-01 · Bagging and random forest Recap of decision tree Algorithm: DTREETRAIN Data: data D, feature set

Bagging and random forest

Bagging and feature importance

Although we cannot follow the treestructure any more, we can estimateaverage feature importance of singlefeatures.Feature importance in a single tree canbe defined as information gain achievedby splitting on this feature.

Machine Learning: Chenhao Tan | Boulder | 19 of 27

Page 24: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2018-11-01 · Bagging and random forest Recap of decision tree Algorithm: DTREETRAIN Data: data D, feature set

Bagging and random forest

An even better approach: random forests

It turns out, we can take the bagging idea and make it even better.Suppose you have a particular feature that is a very strong predictor for theresponse.So strong that in almost all Bagged Decision Trees, it’s the first feature that getssplit on.

When this happens, the trees can all look very similar. They’re too correlated.Maybe always splitting on the same feature isn’t the best idea.Maybe, like with bootstrapping training examples, we split on a random subset offeatures too.

Machine Learning: Chenhao Tan | Boulder | 20 of 27

Page 25: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2018-11-01 · Bagging and random forest Recap of decision tree Algorithm: DTREETRAIN Data: data D, feature set

Bagging and random forest

An even better approach: random forests

It turns out, we can take the bagging idea and make it even better.Suppose you have a particular feature that is a very strong predictor for theresponse.So strong that in almost all Bagged Decision Trees, it’s the first feature that getssplit on.When this happens, the trees can all look very similar. They’re too correlated.Maybe always splitting on the same feature isn’t the best idea.

Maybe, like with bootstrapping training examples, we split on a random subset offeatures too.

Machine Learning: Chenhao Tan | Boulder | 20 of 27

Page 26: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2018-11-01 · Bagging and random forest Recap of decision tree Algorithm: DTREETRAIN Data: data D, feature set

Bagging and random forest

An even better approach: random forests

It turns out, we can take the bagging idea and make it even better.Suppose you have a particular feature that is a very strong predictor for theresponse.So strong that in almost all Bagged Decision Trees, it’s the first feature that getssplit on.When this happens, the trees can all look very similar. They’re too correlated.Maybe always splitting on the same feature isn’t the best idea.Maybe, like with bootstrapping training examples, we split on a random subset offeatures too.

Machine Learning: Chenhao Tan | Boulder | 20 of 27

Page 27: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2018-11-01 · Bagging and random forest Recap of decision tree Algorithm: DTREETRAIN Data: data D, feature set

Bagging and random forest

Random forests

• Still doing bagging, in the sense that we fit bootstrapped resamples of trainingdata

• But every time we have to choose a feature to split on, don’t consider all pfeatures

• Instead, consider only√

p features to split on

Advantages:

• Since at each split, considering less than half of features, this gives verydifferent trees

• Very different trees in your ensemble decreases correlation, and gives morerobust results

• Also, slightly cheaper because you don’t have to evaluate each feature tochoose best split

Machine Learning: Chenhao Tan | Boulder | 21 of 27

Page 28: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2018-11-01 · Bagging and random forest Recap of decision tree Algorithm: DTREETRAIN Data: data D, feature set

Bagging and random forest

Random forests

• Still doing bagging, in the sense that we fit bootstrapped resamples of trainingdata

• But every time we have to choose a feature to split on, don’t consider all pfeatures

• Instead, consider only√

p features to split on

Advantages:

• Since at each split, considering less than half of features, this gives verydifferent trees

• Very different trees in your ensemble decreases correlation, and gives morerobust results

• Also, slightly cheaper because you don’t have to evaluate each feature tochoose best split

Machine Learning: Chenhao Tan | Boulder | 21 of 27

Page 29: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2018-11-01 · Bagging and random forest Recap of decision tree Algorithm: DTREETRAIN Data: data D, feature set

Bagging and random forest

Random forests

Look at expression of 500 different genes in tissue samples. Predict 15 cancerstates

Machine Learning: Chenhao Tan | Boulder | 22 of 27

Page 30: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2018-11-01 · Bagging and random forest Recap of decision tree Algorithm: DTREETRAIN Data: data D, feature set

General idea of boosting

Outline

Ensemble methods

Bagging and random forest

General idea of boosting

Machine Learning: Chenhao Tan | Boulder | 23 of 27

Page 31: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2018-11-01 · Bagging and random forest Recap of decision tree Algorithm: DTREETRAIN Data: data D, feature set

General idea of boosting

Boosting intuition

Boosting is an ensemble method, but with a different twist.Idea:• Build a sequence of dumb models• Modify training data along the way to focus on difficult to classify examples• Predict based on weighted majority vote of all the models

Challenges:• What do we mean by dumb?• How do we promote difficult examples?• Which models get more say in vote?

Machine Learning: Chenhao Tan | Boulder | 24 of 27

Page 32: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2018-11-01 · Bagging and random forest Recap of decision tree Algorithm: DTREETRAIN Data: data D, feature set

General idea of boosting

Boosting intuition

What do we mean by dumb?Each model in our sequence will be a weak learner

err =1m

m∑i=1

I(yi 6= h(xi)) =12− γ, γ > 0

Most common weak learner in Boosting is a decision stump - a decision tree with asingle split

Machine Learning: Chenhao Tan | Boulder | 25 of 27

Page 33: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2018-11-01 · Bagging and random forest Recap of decision tree Algorithm: DTREETRAIN Data: data D, feature set

General idea of boosting

Boosting intuition

How do we promote difficult examples?After each iteration, we’ll increase the importance of training examples that we gotwrong on the previous iteration and decrease the importance of examples that wegot right on the previous iterationEach example will carry around a weight wi that will play into the decision stumpand the error estimationWeights are normalized so they act like a probability distribution

m∑i=1

wi = 1

Machine Learning: Chenhao Tan | Boulder | 26 of 27

Page 34: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2018-11-01 · Bagging and random forest Recap of decision tree Algorithm: DTREETRAIN Data: data D, feature set

General idea of boosting

Boosting intuition

Which models get more say in vote?The models that performed better on training data get more say in the voteFor our sequence of weak learners: h1(x), h2(x), . . . , hK(x)Boosted classifier defined by

H(x) = sign

[K∑

k=1

αkhk(x)

]Weight αk is measure of accuracy of hk on training data

Machine Learning: Chenhao Tan | Boulder | 27 of 27