machine learning basics lecture 6: overfitting · 2016-09-01 · machine learning 1-2-3 •collect...
TRANSCRIPT
![Page 1: Machine Learning Basics Lecture 6: Overfitting · 2016-09-01 · Machine learning 1-2-3 •Collect data and extract features •Build model: choose hypothesis class 𝓗and loss function](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec90152926f2c60817ffddf/html5/thumbnails/1.jpg)
Machine Learning Basics Lecture 6: Overfitting
Princeton University COS 495
Instructor: Yingyu Liang
![Page 2: Machine Learning Basics Lecture 6: Overfitting · 2016-09-01 · Machine learning 1-2-3 •Collect data and extract features •Build model: choose hypothesis class 𝓗and loss function](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec90152926f2c60817ffddf/html5/thumbnails/2.jpg)
Review: machine learning basics
![Page 3: Machine Learning Basics Lecture 6: Overfitting · 2016-09-01 · Machine learning 1-2-3 •Collect data and extract features •Build model: choose hypothesis class 𝓗and loss function](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec90152926f2c60817ffddf/html5/thumbnails/3.jpg)
Math formulation
• Given training data 𝑥𝑖 , 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷
• Find 𝑦 = 𝑓(𝑥) ∈ 𝓗 that minimizes 𝐿 𝑓 =1
𝑛σ𝑖=1𝑛 𝑙(𝑓, 𝑥𝑖 , 𝑦𝑖)
• s.t. the expected loss is small
𝐿 𝑓 = 𝔼 𝑥,𝑦 ~𝐷[𝑙(𝑓, 𝑥, 𝑦)]
![Page 4: Machine Learning Basics Lecture 6: Overfitting · 2016-09-01 · Machine learning 1-2-3 •Collect data and extract features •Build model: choose hypothesis class 𝓗and loss function](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec90152926f2c60817ffddf/html5/thumbnails/4.jpg)
Machine learning 1-2-3
• Collect data and extract features
• Build model: choose hypothesis class 𝓗 and loss function 𝑙
• Optimization: minimize the empirical loss
![Page 5: Machine Learning Basics Lecture 6: Overfitting · 2016-09-01 · Machine learning 1-2-3 •Collect data and extract features •Build model: choose hypothesis class 𝓗and loss function](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec90152926f2c60817ffddf/html5/thumbnails/5.jpg)
Machine learning 1-2-3
• Collect data and extract features
• Build model: choose hypothesis class 𝓗 and loss function 𝑙
• Optimization: minimize the empirical loss
Feature mapping
Gradient descent; convex optimization
Occam’s razor
Maximum Likelihood
![Page 6: Machine Learning Basics Lecture 6: Overfitting · 2016-09-01 · Machine learning 1-2-3 •Collect data and extract features •Build model: choose hypothesis class 𝓗and loss function](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec90152926f2c60817ffddf/html5/thumbnails/6.jpg)
Overfitting
![Page 7: Machine Learning Basics Lecture 6: Overfitting · 2016-09-01 · Machine learning 1-2-3 •Collect data and extract features •Build model: choose hypothesis class 𝓗and loss function](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec90152926f2c60817ffddf/html5/thumbnails/7.jpg)
Linear vs nonlinear models
𝑥1
𝑥2
𝑥12
𝑥22
2𝑥1𝑥2
2𝑐𝑥1
2𝑐𝑥2
𝑐
𝑦 = sign(𝑤𝑇𝜙(𝑥) + 𝑏)
Polynomial kernel
![Page 8: Machine Learning Basics Lecture 6: Overfitting · 2016-09-01 · Machine learning 1-2-3 •Collect data and extract features •Build model: choose hypothesis class 𝓗and loss function](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec90152926f2c60817ffddf/html5/thumbnails/8.jpg)
Linear vs nonlinear models
• Linear model: 𝑓 𝑥 = 𝑎0 + 𝑎1𝑥
• Nonlinear model: 𝑓 𝑥 = 𝑎0 + 𝑎1𝑥 + 𝑎2𝑥2 + 𝑎3𝑥
3 +…+ 𝑎𝑀 𝑥𝑀
• Linear model ⊆ Nonlinear model (since can always set 𝑎𝑖 = 0 (𝑖 >1))
• Looks like nonlinear model can always achieve same/smaller error
• Why one use Occam’s razor (choose a smaller hypothesis class)?
![Page 9: Machine Learning Basics Lecture 6: Overfitting · 2016-09-01 · Machine learning 1-2-3 •Collect data and extract features •Build model: choose hypothesis class 𝓗and loss function](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec90152926f2c60817ffddf/html5/thumbnails/9.jpg)
Example: regression using polynomial curve𝑡 = sin 2𝜋𝑥 + 𝜖
Figure from Machine Learning and Pattern Recognition, Bishop
![Page 10: Machine Learning Basics Lecture 6: Overfitting · 2016-09-01 · Machine learning 1-2-3 •Collect data and extract features •Build model: choose hypothesis class 𝓗and loss function](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec90152926f2c60817ffddf/html5/thumbnails/10.jpg)
Example: regression using polynomial curve
Figure from Machine Learning and Pattern Recognition, Bishop
𝑡 = sin 2𝜋𝑥 + 𝜖
Regression using polynomial of
degree M
![Page 11: Machine Learning Basics Lecture 6: Overfitting · 2016-09-01 · Machine learning 1-2-3 •Collect data and extract features •Build model: choose hypothesis class 𝓗and loss function](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec90152926f2c60817ffddf/html5/thumbnails/11.jpg)
Example: regression using polynomial curve
Figure from Machine Learning and Pattern Recognition, Bishop
𝑡 = sin 2𝜋𝑥 + 𝜖
![Page 12: Machine Learning Basics Lecture 6: Overfitting · 2016-09-01 · Machine learning 1-2-3 •Collect data and extract features •Build model: choose hypothesis class 𝓗and loss function](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec90152926f2c60817ffddf/html5/thumbnails/12.jpg)
Example: regression using polynomial curve
Figure from Machine Learning and Pattern Recognition, Bishop
𝑡 = sin 2𝜋𝑥 + 𝜖
![Page 13: Machine Learning Basics Lecture 6: Overfitting · 2016-09-01 · Machine learning 1-2-3 •Collect data and extract features •Build model: choose hypothesis class 𝓗and loss function](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec90152926f2c60817ffddf/html5/thumbnails/13.jpg)
Example: regression using polynomial curve𝑡 = sin 2𝜋𝑥 + 𝜖
Figure from Machine Learning and Pattern Recognition, Bishop
![Page 14: Machine Learning Basics Lecture 6: Overfitting · 2016-09-01 · Machine learning 1-2-3 •Collect data and extract features •Build model: choose hypothesis class 𝓗and loss function](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec90152926f2c60817ffddf/html5/thumbnails/14.jpg)
Example: regression using polynomial curve
Figure from Machine Learning and Pattern Recognition, Bishop
![Page 15: Machine Learning Basics Lecture 6: Overfitting · 2016-09-01 · Machine learning 1-2-3 •Collect data and extract features •Build model: choose hypothesis class 𝓗and loss function](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec90152926f2c60817ffddf/html5/thumbnails/15.jpg)
Prevent overfitting
• Empirical loss and expected loss are different• Also called training error and test/generalization error
• Larger the data set, smaller the difference between the two
• Larger the hypothesis class, easier to find a hypothesis that fits the difference between the two• Thus has small training error but large test error (overfitting)
• Larger data set helps!
• Throwing away useless hypotheses also helps!
![Page 16: Machine Learning Basics Lecture 6: Overfitting · 2016-09-01 · Machine learning 1-2-3 •Collect data and extract features •Build model: choose hypothesis class 𝓗and loss function](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec90152926f2c60817ffddf/html5/thumbnails/16.jpg)
Prevent overfitting
• Empirical loss and expected loss are different• Also called training error and test error
• Larger the hypothesis class, easier to find a hypothesis that fits the difference between the two• Thus has small training error but large test error (overfitting)
• Larger the data set, smaller the difference between the two
• Throwing away useless hypotheses also helps!
• Larger data set helps!
Use prior knowledge/model to prune hypotheses
Use experience/data to prune hypotheses
![Page 17: Machine Learning Basics Lecture 6: Overfitting · 2016-09-01 · Machine learning 1-2-3 •Collect data and extract features •Build model: choose hypothesis class 𝓗and loss function](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec90152926f2c60817ffddf/html5/thumbnails/17.jpg)
Prior v.s. data
![Page 18: Machine Learning Basics Lecture 6: Overfitting · 2016-09-01 · Machine learning 1-2-3 •Collect data and extract features •Build model: choose hypothesis class 𝓗and loss function](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec90152926f2c60817ffddf/html5/thumbnails/18.jpg)
Prior vs experience
• Super strong prior knowledge: 𝓗 = {𝑓∗}
• No data is needed!
𝑓∗: the best function
![Page 19: Machine Learning Basics Lecture 6: Overfitting · 2016-09-01 · Machine learning 1-2-3 •Collect data and extract features •Build model: choose hypothesis class 𝓗and loss function](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec90152926f2c60817ffddf/html5/thumbnails/19.jpg)
Prior vs experience
• Super strong prior knowledge: 𝓗 = {𝑓∗, 𝑓1}
• A few data points suffices to detect 𝑓∗
𝑓∗: the best function
![Page 20: Machine Learning Basics Lecture 6: Overfitting · 2016-09-01 · Machine learning 1-2-3 •Collect data and extract features •Build model: choose hypothesis class 𝓗and loss function](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec90152926f2c60817ffddf/html5/thumbnails/20.jpg)
Prior vs experience
• Super larger data set: infinite data
• Hypothesis class 𝓗 can be all functions!
𝑓∗: the best function
![Page 21: Machine Learning Basics Lecture 6: Overfitting · 2016-09-01 · Machine learning 1-2-3 •Collect data and extract features •Build model: choose hypothesis class 𝓗and loss function](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec90152926f2c60817ffddf/html5/thumbnails/21.jpg)
Prior vs experience
• Practical scenarios: finite data, 𝓗 of median capacity,𝑓∗ in/not in 𝓗
𝑓∗: the best function
𝓗1 𝓗2
![Page 22: Machine Learning Basics Lecture 6: Overfitting · 2016-09-01 · Machine learning 1-2-3 •Collect data and extract features •Build model: choose hypothesis class 𝓗and loss function](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec90152926f2c60817ffddf/html5/thumbnails/22.jpg)
Prior vs experience
• Practical scenarios lie between the two extreme cases
𝓗 = {𝑓∗} Infinite datapractice
![Page 23: Machine Learning Basics Lecture 6: Overfitting · 2016-09-01 · Machine learning 1-2-3 •Collect data and extract features •Build model: choose hypothesis class 𝓗and loss function](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec90152926f2c60817ffddf/html5/thumbnails/23.jpg)
General Phenomenon
Figure from Deep Learning, Goodfellow, Bengio and Courville
![Page 24: Machine Learning Basics Lecture 6: Overfitting · 2016-09-01 · Machine learning 1-2-3 •Collect data and extract features •Build model: choose hypothesis class 𝓗and loss function](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec90152926f2c60817ffddf/html5/thumbnails/24.jpg)
Cross validation
![Page 25: Machine Learning Basics Lecture 6: Overfitting · 2016-09-01 · Machine learning 1-2-3 •Collect data and extract features •Build model: choose hypothesis class 𝓗and loss function](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec90152926f2c60817ffddf/html5/thumbnails/25.jpg)
Model selection
• How to choose the optimal capacity?• e.g., choose the best degree for polynomial curve fitting
• Cannot be done by training data alone
• Create held-out data to approx. the test error• Called validation data set
![Page 26: Machine Learning Basics Lecture 6: Overfitting · 2016-09-01 · Machine learning 1-2-3 •Collect data and extract features •Build model: choose hypothesis class 𝓗and loss function](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec90152926f2c60817ffddf/html5/thumbnails/26.jpg)
Model selection: cross validation
• Partition the training data into several groups
• Each time use one group as validation set
Figure from Machine Learning and Pattern Recognition, Bishop
![Page 27: Machine Learning Basics Lecture 6: Overfitting · 2016-09-01 · Machine learning 1-2-3 •Collect data and extract features •Build model: choose hypothesis class 𝓗and loss function](https://reader034.vdocuments.net/reader034/viewer/2022042222/5ec90152926f2c60817ffddf/html5/thumbnails/27.jpg)
Model selection: cross validation
• Also used for selecting other hyper-parameters for model/algorithm• E.g., learning rate, stopping criterion of SGD, etc.
• Pros: general, simple
• Cons: computationally expensive; even worse when there are more hyper-parameters