![Page 1: Lecture14–Part01 Boostingsierra.ucsd.edu/cse151a-2020-sp/lectures/14-boosting/... · 2020-06-14 · Lecture14–Part01 Boosting. SoFarinCSE151A Learnasingle(sometimescomplex)model:](https://reader034.vdocuments.net/reader034/viewer/2022042708/5f3a2fc81e6473607a327b95/html5/thumbnails/1.jpg)
Lecture 14 – Part 01Boosting
![Page 2: Lecture14–Part01 Boostingsierra.ucsd.edu/cse151a-2020-sp/lectures/14-boosting/... · 2020-06-14 · Lecture14–Part01 Boosting. SoFarinCSE151A Learnasingle(sometimescomplex)model:](https://reader034.vdocuments.net/reader034/viewer/2022042708/5f3a2fc81e6473607a327b95/html5/thumbnails/2.jpg)
So Far in CSE 151A
▶ Learn a single (sometimes complex) model:▶ Logistic Regression▶ SVMs▶ LDA/QDA▶ Decision Trees▶ …
![Page 3: Lecture14–Part01 Boostingsierra.ucsd.edu/cse151a-2020-sp/lectures/14-boosting/... · 2020-06-14 · Lecture14–Part01 Boosting. SoFarinCSE151A Learnasingle(sometimescomplex)model:](https://reader034.vdocuments.net/reader034/viewer/2022042708/5f3a2fc81e6473607a327b95/html5/thumbnails/3.jpg)
Today
▶ Can we combine very simple models and getgood results?
▶ Yes: boosting.
![Page 4: Lecture14–Part01 Boostingsierra.ucsd.edu/cse151a-2020-sp/lectures/14-boosting/... · 2020-06-14 · Lecture14–Part01 Boosting. SoFarinCSE151A Learnasingle(sometimescomplex)model:](https://reader034.vdocuments.net/reader034/viewer/2022042708/5f3a2fc81e6473607a327b95/html5/thumbnails/4.jpg)
Weak Learners
▶ A weak classifier is one which performs only a littlebetter than chance.
▶ A learning algorithm capable of consistentlyproducing weak classifiers is called a weak learner.
▶ Usually very simple, fast.
![Page 5: Lecture14–Part01 Boostingsierra.ucsd.edu/cse151a-2020-sp/lectures/14-boosting/... · 2020-06-14 · Lecture14–Part01 Boosting. SoFarinCSE151A Learnasingle(sometimescomplex)model:](https://reader034.vdocuments.net/reader034/viewer/2022042708/5f3a2fc81e6473607a327b95/html5/thumbnails/5.jpg)
Example
▶ A decision stump is a weak classifier.
▶ Weak learner: the strategy discussed last timefor picking question.
![Page 6: Lecture14–Part01 Boostingsierra.ucsd.edu/cse151a-2020-sp/lectures/14-boosting/... · 2020-06-14 · Lecture14–Part01 Boosting. SoFarinCSE151A Learnasingle(sometimescomplex)model:](https://reader034.vdocuments.net/reader034/viewer/2022042708/5f3a2fc81e6473607a327b95/html5/thumbnails/6.jpg)
Example
▶ The full decision tree learning algorithm is astrong learner.
![Page 7: Lecture14–Part01 Boostingsierra.ucsd.edu/cse151a-2020-sp/lectures/14-boosting/... · 2020-06-14 · Lecture14–Part01 Boosting. SoFarinCSE151A Learnasingle(sometimescomplex)model:](https://reader034.vdocuments.net/reader034/viewer/2022042708/5f3a2fc81e6473607a327b95/html5/thumbnails/7.jpg)
The Question
▶ Can we “boost” the quality of a weak learner?
![Page 8: Lecture14–Part01 Boostingsierra.ucsd.edu/cse151a-2020-sp/lectures/14-boosting/... · 2020-06-14 · Lecture14–Part01 Boosting. SoFarinCSE151A Learnasingle(sometimescomplex)model:](https://reader034.vdocuments.net/reader034/viewer/2022042708/5f3a2fc81e6473607a327b95/html5/thumbnails/8.jpg)
Boosting: The Idea
▶ Train a weak classifier, 𝐻1 ∶ X → [−1, 1].
▶ Increase weight (importance) of misclassifiedpoints, train another classifier 𝐻2.
▶ Repeat, creating more classifiers, updatingweights.
▶ Final classifier: a linear combination of 𝐻1, … , 𝐻𝑘.
![Page 9: Lecture14–Part01 Boostingsierra.ucsd.edu/cse151a-2020-sp/lectures/14-boosting/... · 2020-06-14 · Lecture14–Part01 Boosting. SoFarinCSE151A Learnasingle(sometimescomplex)model:](https://reader034.vdocuments.net/reader034/viewer/2022042708/5f3a2fc81e6473607a327b95/html5/thumbnails/9.jpg)
The Details
▶ Q1: How do we measure the performance of aclassifier on a weighted data set?
▶ Q2: How do we update the point weights?
▶ Q3: How do we combine the classifiers?
![Page 10: Lecture14–Part01 Boostingsierra.ucsd.edu/cse151a-2020-sp/lectures/14-boosting/... · 2020-06-14 · Lecture14–Part01 Boosting. SoFarinCSE151A Learnasingle(sometimescomplex)model:](https://reader034.vdocuments.net/reader034/viewer/2022042708/5f3a2fc81e6473607a327b95/html5/thumbnails/10.jpg)
AdaBoost
▶ Yoav Freund (UCSD) and Robert Schapire.
▶ A theoretically-sound answer to these questions.
![Page 11: Lecture14–Part01 Boostingsierra.ucsd.edu/cse151a-2020-sp/lectures/14-boosting/... · 2020-06-14 · Lecture14–Part01 Boosting. SoFarinCSE151A Learnasingle(sometimescomplex)model:](https://reader034.vdocuments.net/reader034/viewer/2022042708/5f3a2fc81e6473607a327b95/html5/thumbnails/11.jpg)
Q1: Measuring Performance
▶ Suppose weights at step 𝑡 are in �⃗�(𝑡).▶ Assume normalized s.t. weights add to one.
▶ We use weights to learn a classifier𝐻𝑡 ∶ X → [−1, 1].
▶ The “margin”:
𝑟𝑡 =𝑛∑𝑖=1
𝜔(𝑡)𝑖 𝑦𝑖𝐻𝑡( ⃗𝑥(𝑖)) ∈ [−1, 1]
![Page 12: Lecture14–Part01 Boostingsierra.ucsd.edu/cse151a-2020-sp/lectures/14-boosting/... · 2020-06-14 · Lecture14–Part01 Boosting. SoFarinCSE151A Learnasingle(sometimescomplex)model:](https://reader034.vdocuments.net/reader034/viewer/2022042708/5f3a2fc81e6473607a327b95/html5/thumbnails/12.jpg)
Q1: Measuring Performance
▶ The performance of 𝐻𝑡:
𝛼𝑡 =12 ln
1 + 𝑟𝑡1 − 𝑟𝑡
![Page 13: Lecture14–Part01 Boostingsierra.ucsd.edu/cse151a-2020-sp/lectures/14-boosting/... · 2020-06-14 · Lecture14–Part01 Boosting. SoFarinCSE151A Learnasingle(sometimescomplex)model:](https://reader034.vdocuments.net/reader034/viewer/2022042708/5f3a2fc81e6473607a327b95/html5/thumbnails/13.jpg)
Q2: Updating Weights
▶ We use weights to learn a classifier𝐻𝑡 ∶ X → [−1, 1].
▶ Weigh misclassified points more heavily.
▶ Point is misclassified if 𝑦𝑖𝐻𝑡( ⃗𝑥(𝑖)) < 0
![Page 14: Lecture14–Part01 Boostingsierra.ucsd.edu/cse151a-2020-sp/lectures/14-boosting/... · 2020-06-14 · Lecture14–Part01 Boosting. SoFarinCSE151A Learnasingle(sometimescomplex)model:](https://reader034.vdocuments.net/reader034/viewer/2022042708/5f3a2fc81e6473607a327b95/html5/thumbnails/14.jpg)
Q2: Updating Weights
▶ This will do the trick:
𝜔(𝑡+1)𝑖 ∝ 𝜔(𝑡)𝑖 ⋅ exp (−𝛼𝑡𝑦𝑖𝐻𝑡( ⃗𝑥(𝑖)))
▶ ∝ because we normalize.
![Page 15: Lecture14–Part01 Boostingsierra.ucsd.edu/cse151a-2020-sp/lectures/14-boosting/... · 2020-06-14 · Lecture14–Part01 Boosting. SoFarinCSE151A Learnasingle(sometimescomplex)model:](https://reader034.vdocuments.net/reader034/viewer/2022042708/5f3a2fc81e6473607a327b95/html5/thumbnails/15.jpg)
Q3: Combining Classifiers
▶ The final classifier:
𝐻𝑡( ⃗𝑥) =𝑇∑𝑡=1
𝛼𝑡𝐻𝑡( ⃗𝑥)
![Page 16: Lecture14–Part01 Boostingsierra.ucsd.edu/cse151a-2020-sp/lectures/14-boosting/... · 2020-06-14 · Lecture14–Part01 Boosting. SoFarinCSE151A Learnasingle(sometimescomplex)model:](https://reader034.vdocuments.net/reader034/viewer/2022042708/5f3a2fc81e6473607a327b95/html5/thumbnails/16.jpg)
AdaBoostGiven data ( ⃗𝑥(1), 𝑦1), … , ( ⃗𝑥(𝑛), 𝑦𝑛), labels in {−1, 1}.
▶ Initialize weight vector, �⃗�(1) = ( 1𝑛 ,1𝑛 , … ,
1𝑛 )𝑇
▶ Repeat:▶ Give data and weights �⃗�(𝑡) to weak learner. Receive aclassifier, 𝐻𝑡 ∶ X → {−1, 1} back.
▶ Calculate “performance”, 𝛼𝑡 = 12 ln 1+𝑟𝑡
1−𝑟𝑡
▶ Update �⃗�(𝑡+1) ∝ 𝜔(𝑡)𝑖 ⋅ exp (−𝛼𝑡𝑦𝑖𝐻𝑡( ⃗𝑥(𝑖)))
▶ Final classifier: 𝐻𝑡( ⃗𝑥) = ∑𝑇𝑡=1 𝛼𝑡𝐻𝑡( ⃗𝑥)
![Page 17: Lecture14–Part01 Boostingsierra.ucsd.edu/cse151a-2020-sp/lectures/14-boosting/... · 2020-06-14 · Lecture14–Part01 Boosting. SoFarinCSE151A Learnasingle(sometimescomplex)model:](https://reader034.vdocuments.net/reader034/viewer/2022042708/5f3a2fc81e6473607a327b95/html5/thumbnails/17.jpg)
Example: Decision Stumps
▶ To learn decision stump, given data and �⃗�(𝑡).
▶ Try all features, thresholds.
▶ Choose that which maximizes the margin:
𝑟𝑡 =𝑛∑𝑖=1
𝜔(𝑡)𝑖 𝑦𝑖𝐻𝑡( ⃗𝑥(𝑖)) ∈ [−1, 1]
![Page 18: Lecture14–Part01 Boostingsierra.ucsd.edu/cse151a-2020-sp/lectures/14-boosting/... · 2020-06-14 · Lecture14–Part01 Boosting. SoFarinCSE151A Learnasingle(sometimescomplex)model:](https://reader034.vdocuments.net/reader034/viewer/2022042708/5f3a2fc81e6473607a327b95/html5/thumbnails/18.jpg)
Example: Decision Stumps
▶ To learn decision stump, given data and �⃗�(𝑡).
▶ Try all features, thresholds.
▶ Equivalently, choose that which maximizes theperformance:
𝛼𝑡 =12 ln
1 + 𝑟𝑡1 − 𝑟𝑡
![Page 19: Lecture14–Part01 Boostingsierra.ucsd.edu/cse151a-2020-sp/lectures/14-boosting/... · 2020-06-14 · Lecture14–Part01 Boosting. SoFarinCSE151A Learnasingle(sometimescomplex)model:](https://reader034.vdocuments.net/reader034/viewer/2022042708/5f3a2fc81e6473607a327b95/html5/thumbnails/19.jpg)
Example
![Page 20: Lecture14–Part01 Boostingsierra.ucsd.edu/cse151a-2020-sp/lectures/14-boosting/... · 2020-06-14 · Lecture14–Part01 Boosting. SoFarinCSE151A Learnasingle(sometimescomplex)model:](https://reader034.vdocuments.net/reader034/viewer/2022042708/5f3a2fc81e6473607a327b95/html5/thumbnails/20.jpg)
Example
![Page 21: Lecture14–Part01 Boostingsierra.ucsd.edu/cse151a-2020-sp/lectures/14-boosting/... · 2020-06-14 · Lecture14–Part01 Boosting. SoFarinCSE151A Learnasingle(sometimescomplex)model:](https://reader034.vdocuments.net/reader034/viewer/2022042708/5f3a2fc81e6473607a327b95/html5/thumbnails/21.jpg)
Example
![Page 22: Lecture14–Part01 Boostingsierra.ucsd.edu/cse151a-2020-sp/lectures/14-boosting/... · 2020-06-14 · Lecture14–Part01 Boosting. SoFarinCSE151A Learnasingle(sometimescomplex)model:](https://reader034.vdocuments.net/reader034/viewer/2022042708/5f3a2fc81e6473607a327b95/html5/thumbnails/22.jpg)
Theory
Suppose that on each round 𝑡, the weak learnerreturns a rule 𝐻𝑡 whose error on the step 𝑡 weighteddata is ≤ 1
2 − 𝛾. Then after 𝑇 rounds, the training errorof the combined rule 𝐻 is at most 𝑒−𝛾2𝑇/2.
![Page 23: Lecture14–Part01 Boostingsierra.ucsd.edu/cse151a-2020-sp/lectures/14-boosting/... · 2020-06-14 · Lecture14–Part01 Boosting. SoFarinCSE151A Learnasingle(sometimescomplex)model:](https://reader034.vdocuments.net/reader034/viewer/2022042708/5f3a2fc81e6473607a327b95/html5/thumbnails/23.jpg)
Generalization
▶ Boosted decision stumps are really resistant tooverfitting.
Number of nodes in tree
Error
training error
true error
![Page 24: Lecture14–Part01 Boostingsierra.ucsd.edu/cse151a-2020-sp/lectures/14-boosting/... · 2020-06-14 · Lecture14–Part01 Boosting. SoFarinCSE151A Learnasingle(sometimescomplex)model:](https://reader034.vdocuments.net/reader034/viewer/2022042708/5f3a2fc81e6473607a327b95/html5/thumbnails/24.jpg)
Generalization
▶ Boosted decision stumps are really resistant tooverfitting.
![Page 25: Lecture14–Part01 Boostingsierra.ucsd.edu/cse151a-2020-sp/lectures/14-boosting/... · 2020-06-14 · Lecture14–Part01 Boosting. SoFarinCSE151A Learnasingle(sometimescomplex)model:](https://reader034.vdocuments.net/reader034/viewer/2022042708/5f3a2fc81e6473607a327b95/html5/thumbnails/25.jpg)
Why not?
▶ Why use weak learners?
▶ What if we replace decision stumps with SVMs orlogistic regression?
▶ You can, but weak learners are fast to learn.
▶ The point of boosting is that weak learners are“just as good” as strong learners.
![Page 26: Lecture14–Part01 Boostingsierra.ucsd.edu/cse151a-2020-sp/lectures/14-boosting/... · 2020-06-14 · Lecture14–Part01 Boosting. SoFarinCSE151A Learnasingle(sometimescomplex)model:](https://reader034.vdocuments.net/reader034/viewer/2022042708/5f3a2fc81e6473607a327b95/html5/thumbnails/26.jpg)
Why not?
▶ Why use weak learners?
▶ What if we replace decision stumps with SVMs orlogistic regression?
▶ You can, but weak learners are fast to learn.
▶ The point of boosting is that weak learners are“just as good” as strong learners.
![Page 27: Lecture14–Part01 Boostingsierra.ucsd.edu/cse151a-2020-sp/lectures/14-boosting/... · 2020-06-14 · Lecture14–Part01 Boosting. SoFarinCSE151A Learnasingle(sometimescomplex)model:](https://reader034.vdocuments.net/reader034/viewer/2022042708/5f3a2fc81e6473607a327b95/html5/thumbnails/27.jpg)
Lecture 14 – Part 02Random Forests
![Page 28: Lecture14–Part01 Boostingsierra.ucsd.edu/cse151a-2020-sp/lectures/14-boosting/... · 2020-06-14 · Lecture14–Part01 Boosting. SoFarinCSE151A Learnasingle(sometimescomplex)model:](https://reader034.vdocuments.net/reader034/viewer/2022042708/5f3a2fc81e6473607a327b95/html5/thumbnails/28.jpg)
Let’s Try
▶ Decision trees are susceptible to overfitting.
▶ Let’s try using boosted decision trees.
![Page 29: Lecture14–Part01 Boostingsierra.ucsd.edu/cse151a-2020-sp/lectures/14-boosting/... · 2020-06-14 · Lecture14–Part01 Boosting. SoFarinCSE151A Learnasingle(sometimescomplex)model:](https://reader034.vdocuments.net/reader034/viewer/2022042708/5f3a2fc81e6473607a327b95/html5/thumbnails/29.jpg)
Example: Forest Cover Type
▶ Goal: predict forest type.▶ Spruce-fir▶ Lodgepole pine▶ etc. 7 classes in total.
▶ 54 cartographic/geological features.▶ Elevation, slope, amount of shade, distance to water,etc.
![Page 30: Lecture14–Part01 Boostingsierra.ucsd.edu/cse151a-2020-sp/lectures/14-boosting/... · 2020-06-14 · Lecture14–Part01 Boosting. SoFarinCSE151A Learnasingle(sometimescomplex)model:](https://reader034.vdocuments.net/reader034/viewer/2022042708/5f3a2fc81e6473607a327b95/html5/thumbnails/30.jpg)
Decision Tree
Depth 20. Training error: 1%. Test error: 12.6%.
![Page 31: Lecture14–Part01 Boostingsierra.ucsd.edu/cse151a-2020-sp/lectures/14-boosting/... · 2020-06-14 · Lecture14–Part01 Boosting. SoFarinCSE151A Learnasingle(sometimescomplex)model:](https://reader034.vdocuments.net/reader034/viewer/2022042708/5f3a2fc81e6473607a327b95/html5/thumbnails/31.jpg)
Boosted Decision Trees
![Page 32: Lecture14–Part01 Boostingsierra.ucsd.edu/cse151a-2020-sp/lectures/14-boosting/... · 2020-06-14 · Lecture14–Part01 Boosting. SoFarinCSE151A Learnasingle(sometimescomplex)model:](https://reader034.vdocuments.net/reader034/viewer/2022042708/5f3a2fc81e6473607a327b95/html5/thumbnails/32.jpg)
Boosted Decision Trees
Depth 20: Test error: 8.7%. Slow!
![Page 33: Lecture14–Part01 Boostingsierra.ucsd.edu/cse151a-2020-sp/lectures/14-boosting/... · 2020-06-14 · Lecture14–Part01 Boosting. SoFarinCSE151A Learnasingle(sometimescomplex)model:](https://reader034.vdocuments.net/reader034/viewer/2022042708/5f3a2fc81e6473607a327b95/html5/thumbnails/33.jpg)
Another Idea
▶ Prevent decision trees from overfitting by “hidingdata” randomly.
▶ Train a bunch of trees, quickly.
▶ Average them to make a final prediction.
![Page 34: Lecture14–Part01 Boostingsierra.ucsd.edu/cse151a-2020-sp/lectures/14-boosting/... · 2020-06-14 · Lecture14–Part01 Boosting. SoFarinCSE151A Learnasingle(sometimescomplex)model:](https://reader034.vdocuments.net/reader034/viewer/2022042708/5f3a2fc81e6473607a327b95/html5/thumbnails/34.jpg)
Random Forests
▶ For 𝑡 = 1 to 𝑇▶ Choose 𝑛′ training points randomly, withreplacement.
▶ Fit a decision tree, 𝐻𝑡.▶ At each node, restrict to one of 𝑘features, chosen randomly.
▶ Final classifier: majority vote of 𝐻1, … , 𝐻𝑇 .▶ Common settings: 𝑛′ = 𝑛 (bootstrap), 𝑘 = √𝑑.
![Page 35: Lecture14–Part01 Boostingsierra.ucsd.edu/cse151a-2020-sp/lectures/14-boosting/... · 2020-06-14 · Lecture14–Part01 Boosting. SoFarinCSE151A Learnasingle(sometimescomplex)model:](https://reader034.vdocuments.net/reader034/viewer/2022042708/5f3a2fc81e6473607a327b95/html5/thumbnails/35.jpg)
Forest Cover Type
▶ Decision trees: 12.6% error.
▶ Boosted decision trees: 8.7% error (but slow!)
▶ Random forests: 8.8% error.▶ 50% of features dropped.▶ Each individual tree 𝐻1, … , 𝐻𝑡 has test erroraround 15%.