coms 4771 introduction to machine...
TRANSCRIPT
![Page 1: COMS 4771 Introduction to Machine Learningcseweb.ucsd.edu/~naverma/teaching/COMS_S4771/lec/lec6.pdf · 2015. 6. 15. · Still need to prove: Lemma (Chernoff-Hoeffding bound ‘63):](https://reader036.vdocuments.net/reader036/viewer/2022071419/61178c6a28e6ae078f6b9dfd/html5/thumbnails/1.jpg)
COMS 4771Introduction to Machine Learning
Nakul Verma
![Page 2: COMS 4771 Introduction to Machine Learningcseweb.ucsd.edu/~naverma/teaching/COMS_S4771/lec/lec6.pdf · 2015. 6. 15. · Still need to prove: Lemma (Chernoff-Hoeffding bound ‘63):](https://reader036.vdocuments.net/reader036/viewer/2022071419/61178c6a28e6ae078f6b9dfd/html5/thumbnails/2.jpg)
Announcements
• HW2 due now!
• Project proposal due on tomorrow
• Midterm next lecture!
• HW3 posted
![Page 3: COMS 4771 Introduction to Machine Learningcseweb.ucsd.edu/~naverma/teaching/COMS_S4771/lec/lec6.pdf · 2015. 6. 15. · Still need to prove: Lemma (Chernoff-Hoeffding bound ‘63):](https://reader036.vdocuments.net/reader036/viewer/2022071419/61178c6a28e6ae078f6b9dfd/html5/thumbnails/3.jpg)
Last time…
• Linear Regression
• Parametric vs Nonparametric regression
• Logistic Regression for classification
• Ridge and Lasso Regression
• Kernel Regression
• Consistency of Kernel Regression
• Speeding non-parametric regression with trees
![Page 4: COMS 4771 Introduction to Machine Learningcseweb.ucsd.edu/~naverma/teaching/COMS_S4771/lec/lec6.pdf · 2015. 6. 15. · Still need to prove: Lemma (Chernoff-Hoeffding bound ‘63):](https://reader036.vdocuments.net/reader036/viewer/2022071419/61178c6a28e6ae078f6b9dfd/html5/thumbnails/4.jpg)
Towards formalizing ‘learning’
What does it mean to learn a concept?• Gain knowledge or experience of the concept.
The basic process of learning• Observe a phenomenon• Construct a model from observations• Use that model to make decisions / predictions
How can we make this more precise?
![Page 5: COMS 4771 Introduction to Machine Learningcseweb.ucsd.edu/~naverma/teaching/COMS_S4771/lec/lec6.pdf · 2015. 6. 15. · Still need to prove: Lemma (Chernoff-Hoeffding bound ‘63):](https://reader036.vdocuments.net/reader036/viewer/2022071419/61178c6a28e6ae078f6b9dfd/html5/thumbnails/5.jpg)
A statistical machinery for learning Phenomenon of interest:
Input space: Output space:There is an unknown distribution over The learner observes examples drawn from
Construct a model:Let be a collection of models, where each predicts given From observations, select a model which predicts well.
We can say that we have learned the phenomenon if
for any tolerance level of our choice.
(generalization error of f )
Machine learning
![Page 6: COMS 4771 Introduction to Machine Learningcseweb.ucsd.edu/~naverma/teaching/COMS_S4771/lec/lec6.pdf · 2015. 6. 15. · Still need to prove: Lemma (Chernoff-Hoeffding bound ‘63):](https://reader036.vdocuments.net/reader036/viewer/2022071419/61178c6a28e6ae078f6b9dfd/html5/thumbnails/6.jpg)
PAC Learning
For all tolerance levels , and all confidence levels , if there exists some model selection algorithm that selects from m observationsie, , and has the property:
with probability at least over the draw of the sample,
We call • The model class is PAC-learnable.• If the is polynomial in and , then is efficiently PAC-learnable
A popular algorithm:Empirical risk minimizer (ERM) algorithm
![Page 7: COMS 4771 Introduction to Machine Learningcseweb.ucsd.edu/~naverma/teaching/COMS_S4771/lec/lec6.pdf · 2015. 6. 15. · Still need to prove: Lemma (Chernoff-Hoeffding bound ‘63):](https://reader036.vdocuments.net/reader036/viewer/2022071419/61178c6a28e6ae078f6b9dfd/html5/thumbnails/7.jpg)
PAC learning simple model classes
Theorem (finite size ): Pick any tolerance level , and any confidence level let be examples drawn from an unknown
if , then with probability at least
is efficiently PAC learnable
Occam’s Razor Principle:
All things being equal, usually the simplest explanation of a phenomenon is a good hypothesis.
Simplicity = representational succinctness
![Page 8: COMS 4771 Introduction to Machine Learningcseweb.ucsd.edu/~naverma/teaching/COMS_S4771/lec/lec6.pdf · 2015. 6. 15. · Still need to prove: Lemma (Chernoff-Hoeffding bound ‘63):](https://reader036.vdocuments.net/reader036/viewer/2022071419/61178c6a28e6ae078f6b9dfd/html5/thumbnails/8.jpg)
Proof sketch
Define:
We need to analyze:
(generalization error of f ) (sample error of f )
Uniform deviations of expectation of a random variable to the sample
≤ 0
![Page 9: COMS 4771 Introduction to Machine Learningcseweb.ucsd.edu/~naverma/teaching/COMS_S4771/lec/lec6.pdf · 2015. 6. 15. · Still need to prove: Lemma (Chernoff-Hoeffding bound ‘63):](https://reader036.vdocuments.net/reader036/viewer/2022071419/61178c6a28e6ae078f6b9dfd/html5/thumbnails/9.jpg)
Proof sketch
Fix any and a sample , define random variable
Lemma (Chernoff-Hoeffding bound ‘63):Let be m Bernoulli r.v. drawn independently from B(p). for any tolerance level
(generalization error of f ) (sample error of f )
A classic result in concentration of measure, proof later
![Page 10: COMS 4771 Introduction to Machine Learningcseweb.ucsd.edu/~naverma/teaching/COMS_S4771/lec/lec6.pdf · 2015. 6. 15. · Still need to prove: Lemma (Chernoff-Hoeffding bound ‘63):](https://reader036.vdocuments.net/reader036/viewer/2022071419/61178c6a28e6ae078f6b9dfd/html5/thumbnails/10.jpg)
Proof sketch
Need to analyze
Equivalently, by choosing with probability at least , for all
![Page 11: COMS 4771 Introduction to Machine Learningcseweb.ucsd.edu/~naverma/teaching/COMS_S4771/lec/lec6.pdf · 2015. 6. 15. · Still need to prove: Lemma (Chernoff-Hoeffding bound ‘63):](https://reader036.vdocuments.net/reader036/viewer/2022071419/61178c6a28e6ae078f6b9dfd/html5/thumbnails/11.jpg)
PAC learning simple model classes
Theorem (Occam’s Razor): Pick any tolerance level , and any confidence level let be examples drawn from an unknown
if , then with probability at least
is efficiently PAC learnable
![Page 12: COMS 4771 Introduction to Machine Learningcseweb.ucsd.edu/~naverma/teaching/COMS_S4771/lec/lec6.pdf · 2015. 6. 15. · Still need to prove: Lemma (Chernoff-Hoeffding bound ‘63):](https://reader036.vdocuments.net/reader036/viewer/2022071419/61178c6a28e6ae078f6b9dfd/html5/thumbnails/12.jpg)
One thing left…
Still need to prove:
Lemma (Chernoff-Hoeffding bound ‘63):Let be m Bernoulli r.v. drawn independently from B(p). for any tolerance level
sample average true averageHow deviates from as a function of number of samples (m)
Need to analyze: How does the probability measure concentrates towards a central value (like mean)
![Page 13: COMS 4771 Introduction to Machine Learningcseweb.ucsd.edu/~naverma/teaching/COMS_S4771/lec/lec6.pdf · 2015. 6. 15. · Still need to prove: Lemma (Chernoff-Hoeffding bound ‘63):](https://reader036.vdocuments.net/reader036/viewer/2022071419/61178c6a28e6ae078f6b9dfd/html5/thumbnails/13.jpg)
Detour: Concentration of Measure
Let’s start with something simple:
Let X be a non-negative random variable. For a given constant c > 0, what is: ?
ObservationTake expectation on both sides.
Why?
Markov’s Inequality
X
c < c
![Page 14: COMS 4771 Introduction to Machine Learningcseweb.ucsd.edu/~naverma/teaching/COMS_S4771/lec/lec6.pdf · 2015. 6. 15. · Still need to prove: Lemma (Chernoff-Hoeffding bound ‘63):](https://reader036.vdocuments.net/reader036/viewer/2022071419/61178c6a28e6ae078f6b9dfd/html5/thumbnails/14.jpg)
Concentration of Measure
Using Markov to bound deviation from mean…
Let X be a random variable (not necessarily non-negative).Want to examine: for some given constant c > 0
Observation:
by Markov’s Inequality
Chebyshev’s InequalityXTrue for all distributions! EX
![Page 15: COMS 4771 Introduction to Machine Learningcseweb.ucsd.edu/~naverma/teaching/COMS_S4771/lec/lec6.pdf · 2015. 6. 15. · Still need to prove: Lemma (Chernoff-Hoeffding bound ‘63):](https://reader036.vdocuments.net/reader036/viewer/2022071419/61178c6a28e6ae078f6b9dfd/html5/thumbnails/15.jpg)
Concentration of Measure
Sharper estimates using an exponential!
Let X be a random variable (not necessarily non-negative).For some given constant c > 0
Observation:
by Markov’s Inequality
This is called Chernoff’sbounding method
for any t > 0
![Page 16: COMS 4771 Introduction to Machine Learningcseweb.ucsd.edu/~naverma/teaching/COMS_S4771/lec/lec6.pdf · 2015. 6. 15. · Still need to prove: Lemma (Chernoff-Hoeffding bound ‘63):](https://reader036.vdocuments.net/reader036/viewer/2022071419/61178c6a28e6ae078f6b9dfd/html5/thumbnails/16.jpg)
Concentration of Measure
Now, Given X1, …, Xm i.i.d. random variables (assume 0 ≤ Xi ≤ 1)
This implies the Chernoff-Hoeffding bound!
Define Yi := Xi – EXi
By Cherneoff’s bounding technique
Yi i.i.d.
![Page 17: COMS 4771 Introduction to Machine Learningcseweb.ucsd.edu/~naverma/teaching/COMS_S4771/lec/lec6.pdf · 2015. 6. 15. · Still need to prove: Lemma (Chernoff-Hoeffding bound ‘63):](https://reader036.vdocuments.net/reader036/viewer/2022071419/61178c6a28e6ae078f6b9dfd/html5/thumbnails/17.jpg)
Back to Learning Theory!
Theorem (Occam’s Razor): Pick any tolerance level , and any confidence level let be examples drawn from an unknown
if , then with probability at least
is efficiently PAC learnable
![Page 18: COMS 4771 Introduction to Machine Learningcseweb.ucsd.edu/~naverma/teaching/COMS_S4771/lec/lec6.pdf · 2015. 6. 15. · Still need to prove: Lemma (Chernoff-Hoeffding bound ‘63):](https://reader036.vdocuments.net/reader036/viewer/2022071419/61178c6a28e6ae078f6b9dfd/html5/thumbnails/18.jpg)
Learning general concepts
Consider linear classification
Occam’s Razor bound is ineffective
![Page 19: COMS 4771 Introduction to Machine Learningcseweb.ucsd.edu/~naverma/teaching/COMS_S4771/lec/lec6.pdf · 2015. 6. 15. · Still need to prove: Lemma (Chernoff-Hoeffding bound ‘63):](https://reader036.vdocuments.net/reader036/viewer/2022071419/61178c6a28e6ae078f6b9dfd/html5/thumbnails/19.jpg)
Need to capture the true richness of
Definition (Vapnik-Chervonenkis or VC dimension):We say that a model class as VC dimension d, if d is the largest set of points such that for all possible labelings of there exists some that achieves that labelling.
Example: = linear classifiers in R2
VC Theory
![Page 20: COMS 4771 Introduction to Machine Learningcseweb.ucsd.edu/~naverma/teaching/COMS_S4771/lec/lec6.pdf · 2015. 6. 15. · Still need to prove: Lemma (Chernoff-Hoeffding bound ‘63):](https://reader036.vdocuments.net/reader036/viewer/2022071419/61178c6a28e6ae078f6b9dfd/html5/thumbnails/20.jpg)
Another example: = Rectangles in R2
VC dimension:• A combinatorial concept to capture the true richness of• Often (but not always!) proportional to the degrees-of-freedom or
the number of independent parameters in
VC Dimension
The class of rectangles cannot realize this labelling
![Page 21: COMS 4771 Introduction to Machine Learningcseweb.ucsd.edu/~naverma/teaching/COMS_S4771/lec/lec6.pdf · 2015. 6. 15. · Still need to prove: Lemma (Chernoff-Hoeffding bound ‘63):](https://reader036.vdocuments.net/reader036/viewer/2022071419/61178c6a28e6ae078f6b9dfd/html5/thumbnails/21.jpg)
VC Theorem
Theorem (Vapnik-Chervonenkis ’71): Pick any tolerance level , and any confidence level let be examples drawn from an unknown
if , then with probability at least
is efficiently PAC learnable
VC Theorem Occam’s Razor Theorem
![Page 22: COMS 4771 Introduction to Machine Learningcseweb.ucsd.edu/~naverma/teaching/COMS_S4771/lec/lec6.pdf · 2015. 6. 15. · Still need to prove: Lemma (Chernoff-Hoeffding bound ‘63):](https://reader036.vdocuments.net/reader036/viewer/2022071419/61178c6a28e6ae078f6b9dfd/html5/thumbnails/22.jpg)
Tightness of VC bound
Theorem (VC lower bound): Let be any model selection algorithm that given m samples, returns a model from , that is, For all tolerance levels , and all confidence levels ,
there exists a distribution such that if
![Page 23: COMS 4771 Introduction to Machine Learningcseweb.ucsd.edu/~naverma/teaching/COMS_S4771/lec/lec6.pdf · 2015. 6. 15. · Still need to prove: Lemma (Chernoff-Hoeffding bound ‘63):](https://reader036.vdocuments.net/reader036/viewer/2022071419/61178c6a28e6ae078f6b9dfd/html5/thumbnails/23.jpg)
Some implications
• VC dimension of a model class fully characterizes its learning ability!
• Results are agnostic to the underlying distribution.
![Page 24: COMS 4771 Introduction to Machine Learningcseweb.ucsd.edu/~naverma/teaching/COMS_S4771/lec/lec6.pdf · 2015. 6. 15. · Still need to prove: Lemma (Chernoff-Hoeffding bound ‘63):](https://reader036.vdocuments.net/reader036/viewer/2022071419/61178c6a28e6ae078f6b9dfd/html5/thumbnails/24.jpg)
One algorithm to rule them all?
From our discussion it may seem that ERM algorithm is universally consistent.
Theorem (no free lunch, Devroye ‘82):Pick any sample size m, any algorithm and any There exists a distribution such that
while the Bayes optimal error,
This is not the case!
![Page 25: COMS 4771 Introduction to Machine Learningcseweb.ucsd.edu/~naverma/teaching/COMS_S4771/lec/lec6.pdf · 2015. 6. 15. · Still need to prove: Lemma (Chernoff-Hoeffding bound ‘63):](https://reader036.vdocuments.net/reader036/viewer/2022071419/61178c6a28e6ae078f6b9dfd/html5/thumbnails/25.jpg)
Further refinements and extensions
• How to do model class selection? Structural risk results.
• Dealing with kernels – Fat margin theory
• Incorporating priors over the models – PAC-Bayes theory
• Is it possible to get distribution dependent bound? Rademacher complexity
• How about regression? Can derive similar results for nonparametric regression.
![Page 26: COMS 4771 Introduction to Machine Learningcseweb.ucsd.edu/~naverma/teaching/COMS_S4771/lec/lec6.pdf · 2015. 6. 15. · Still need to prove: Lemma (Chernoff-Hoeffding bound ‘63):](https://reader036.vdocuments.net/reader036/viewer/2022071419/61178c6a28e6ae078f6b9dfd/html5/thumbnails/26.jpg)
What We Learned…
• Formalizing learning
• PAC learnability
• Occam’s razor Theorem
• VC dimension and VC theorem
• VC theorem
• No Free-lunch theorem
![Page 27: COMS 4771 Introduction to Machine Learningcseweb.ucsd.edu/~naverma/teaching/COMS_S4771/lec/lec6.pdf · 2015. 6. 15. · Still need to prove: Lemma (Chernoff-Hoeffding bound ‘63):](https://reader036.vdocuments.net/reader036/viewer/2022071419/61178c6a28e6ae078f6b9dfd/html5/thumbnails/27.jpg)
Questions?
![Page 28: COMS 4771 Introduction to Machine Learningcseweb.ucsd.edu/~naverma/teaching/COMS_S4771/lec/lec6.pdf · 2015. 6. 15. · Still need to prove: Lemma (Chernoff-Hoeffding bound ‘63):](https://reader036.vdocuments.net/reader036/viewer/2022071419/61178c6a28e6ae078f6b9dfd/html5/thumbnails/28.jpg)
Next time…
Midterm!
Unsupervised learning.
![Page 29: COMS 4771 Introduction to Machine Learningcseweb.ucsd.edu/~naverma/teaching/COMS_S4771/lec/lec6.pdf · 2015. 6. 15. · Still need to prove: Lemma (Chernoff-Hoeffding bound ‘63):](https://reader036.vdocuments.net/reader036/viewer/2022071419/61178c6a28e6ae078f6b9dfd/html5/thumbnails/29.jpg)
Announcements
• Project proposal due on tomorrow
• Midterm next lecture!
• HW3 posted