introduction to machine learning - alex smola › ... › slides ›...
TRANSCRIPT
![Page 1: Introduction to Machine Learning - Alex Smola › ... › slides › 8_stochconvergence.pdfIntroduction to Machine Learning CMU-10701 8. Stochastic Convergence Barnabás Póczos Motivation](https://reader030.vdocuments.net/reader030/viewer/2022041113/5f20c606cd43586eb2255f03/html5/thumbnails/1.jpg)
Introduction to Machine Learning CMU-10701
8. Stochastic Convergence
Barnabás Póczos
![Page 2: Introduction to Machine Learning - Alex Smola › ... › slides › 8_stochconvergence.pdfIntroduction to Machine Learning CMU-10701 8. Stochastic Convergence Barnabás Póczos Motivation](https://reader030.vdocuments.net/reader030/viewer/2022041113/5f20c606cd43586eb2255f03/html5/thumbnails/2.jpg)
Motivation
2
![Page 3: Introduction to Machine Learning - Alex Smola › ... › slides › 8_stochconvergence.pdfIntroduction to Machine Learning CMU-10701 8. Stochastic Convergence Barnabás Póczos Motivation](https://reader030.vdocuments.net/reader030/viewer/2022041113/5f20c606cd43586eb2255f03/html5/thumbnails/3.jpg)
What have we seen so far?
3
Several algorithms that seem to work fine on training datasets: • Linear regression • Naïve Bayes classifier • Perceptron classifier • Support Vector Machines for regression and classification
How good are these algorithms on unknown test sets? How many training samples do we need to achieve small error? What is the smallest possible error we can achieve?
) Learning Theory To answer these questions, we will need a few powerful tools
![Page 4: Introduction to Machine Learning - Alex Smola › ... › slides › 8_stochconvergence.pdfIntroduction to Machine Learning CMU-10701 8. Stochastic Convergence Barnabás Póczos Motivation](https://reader030.vdocuments.net/reader030/viewer/2022041113/5f20c606cd43586eb2255f03/html5/thumbnails/4.jpg)
Basic Estimation Theory
4
![Page 5: Introduction to Machine Learning - Alex Smola › ... › slides › 8_stochconvergence.pdfIntroduction to Machine Learning CMU-10701 8. Stochastic Convergence Barnabás Póczos Motivation](https://reader030.vdocuments.net/reader030/viewer/2022041113/5f20c606cd43586eb2255f03/html5/thumbnails/5.jpg)
Rolling a Dice, Estimation of parameters θ1,θ2,…,θ6
24
120 60
12
5
Does the MLE estimation converge to the right value? How fast does it converge?
![Page 6: Introduction to Machine Learning - Alex Smola › ... › slides › 8_stochconvergence.pdfIntroduction to Machine Learning CMU-10701 8. Stochastic Convergence Barnabás Póczos Motivation](https://reader030.vdocuments.net/reader030/viewer/2022041113/5f20c606cd43586eb2255f03/html5/thumbnails/6.jpg)
6
Rolling a Dice Calculating the Empirical Average
Does the empirical average converge to the true mean? How fast does it converge?
![Page 7: Introduction to Machine Learning - Alex Smola › ... › slides › 8_stochconvergence.pdfIntroduction to Machine Learning CMU-10701 8. Stochastic Convergence Barnabás Póczos Motivation](https://reader030.vdocuments.net/reader030/viewer/2022041113/5f20c606cd43586eb2255f03/html5/thumbnails/7.jpg)
5 sample traces
7
How fast do they converge?
Rolling a Dice, Calculating the Empirical Average
![Page 8: Introduction to Machine Learning - Alex Smola › ... › slides › 8_stochconvergence.pdfIntroduction to Machine Learning CMU-10701 8. Stochastic Convergence Barnabás Póczos Motivation](https://reader030.vdocuments.net/reader030/viewer/2022041113/5f20c606cd43586eb2255f03/html5/thumbnails/8.jpg)
Key Questions
I want to know the coin parameter θ2[0,1] within ε = 0.1 error, with probability at least 1-δ = 0.95. How many flips do I need?
8
• Do empirical averages converge? • Does the MLE converge in the dice rolling problem? • What do we mean on convergence? • What is the rate of convergence?
Applications: • drug testing (Does this drug modifies the average blood pressure?) • user interface design (We will see later)
![Page 9: Introduction to Machine Learning - Alex Smola › ... › slides › 8_stochconvergence.pdfIntroduction to Machine Learning CMU-10701 8. Stochastic Convergence Barnabás Póczos Motivation](https://reader030.vdocuments.net/reader030/viewer/2022041113/5f20c606cd43586eb2255f03/html5/thumbnails/9.jpg)
Outline Theory: • Stochastic Convergences:
– Weak convergence = Convergence in distribution – Convergence in probability – Strong (almost surely) – Convergence in Lp norm
9
• Limit theorems: – Law of large numbers – Central limit theorem
• Tail bounds: – Markov, Chebyshev
![Page 10: Introduction to Machine Learning - Alex Smola › ... › slides › 8_stochconvergence.pdfIntroduction to Machine Learning CMU-10701 8. Stochastic Convergence Barnabás Póczos Motivation](https://reader030.vdocuments.net/reader030/viewer/2022041113/5f20c606cd43586eb2255f03/html5/thumbnails/10.jpg)
Stochastic convergence definitions and properties
10
![Page 11: Introduction to Machine Learning - Alex Smola › ... › slides › 8_stochconvergence.pdfIntroduction to Machine Learning CMU-10701 8. Stochastic Convergence Barnabás Póczos Motivation](https://reader030.vdocuments.net/reader030/viewer/2022041113/5f20c606cd43586eb2255f03/html5/thumbnails/11.jpg)
Convergence of vectors
11
![Page 12: Introduction to Machine Learning - Alex Smola › ... › slides › 8_stochconvergence.pdfIntroduction to Machine Learning CMU-10701 8. Stochastic Convergence Barnabás Póczos Motivation](https://reader030.vdocuments.net/reader030/viewer/2022041113/5f20c606cd43586eb2255f03/html5/thumbnails/12.jpg)
Convergence in Distribution = Convergence Weakly = Convergence in Law
12
Notation:
Let {Z, Z1, Z2, …} be a sequence of random variables.
Definition:
This is the “weakest” convergence.
![Page 13: Introduction to Machine Learning - Alex Smola › ... › slides › 8_stochconvergence.pdfIntroduction to Machine Learning CMU-10701 8. Stochastic Convergence Barnabás Póczos Motivation](https://reader030.vdocuments.net/reader030/viewer/2022041113/5f20c606cd43586eb2255f03/html5/thumbnails/13.jpg)
Only the distribution functions converge! (NOT the values of the random variables)
1
0 a
13
Convergence in Distribution = Convergence Weakly = Convergence in Law
![Page 14: Introduction to Machine Learning - Alex Smola › ... › slides › 8_stochconvergence.pdfIntroduction to Machine Learning CMU-10701 8. Stochastic Convergence Barnabás Póczos Motivation](https://reader030.vdocuments.net/reader030/viewer/2022041113/5f20c606cd43586eb2255f03/html5/thumbnails/14.jpg)
14
Continuity is important!
Proof:
The limit random variable is constant 0:
Example:
In this example the limit Z is discrete, not random (constant 0), although Zn is a continuous random variable.
0
0 1
0 1/n
0 1
Convergence in Distribution = Convergence Weakly = Convergence in Law
![Page 15: Introduction to Machine Learning - Alex Smola › ... › slides › 8_stochconvergence.pdfIntroduction to Machine Learning CMU-10701 8. Stochastic Convergence Barnabás Póczos Motivation](https://reader030.vdocuments.net/reader030/viewer/2022041113/5f20c606cd43586eb2255f03/html5/thumbnails/15.jpg)
This image cannot currently be displayed.
15
Properties
Scheffe's theorem: convergence of the probability density functions ) convergence in distribution
Example: (Central Limit Theorem)
Zn and Z can still be independent even if their distributions are the same!
Convergence in Distribution = Convergence Weakly = Convergence in Law
![Page 16: Introduction to Machine Learning - Alex Smola › ... › slides › 8_stochconvergence.pdfIntroduction to Machine Learning CMU-10701 8. Stochastic Convergence Barnabás Póczos Motivation](https://reader030.vdocuments.net/reader030/viewer/2022041113/5f20c606cd43586eb2255f03/html5/thumbnails/16.jpg)
Convergence in Probability
16
Notation: Definition:
This indeed measures how far the values of Zn(ω) and Z(ω) are from each other.
![Page 17: Introduction to Machine Learning - Alex Smola › ... › slides › 8_stochconvergence.pdfIntroduction to Machine Learning CMU-10701 8. Stochastic Convergence Barnabás Póczos Motivation](https://reader030.vdocuments.net/reader030/viewer/2022041113/5f20c606cd43586eb2255f03/html5/thumbnails/17.jpg)
Almost Surely Convergence
17
Notation: Definition:
![Page 18: Introduction to Machine Learning - Alex Smola › ... › slides › 8_stochconvergence.pdfIntroduction to Machine Learning CMU-10701 8. Stochastic Convergence Barnabás Póczos Motivation](https://reader030.vdocuments.net/reader030/viewer/2022041113/5f20c606cd43586eb2255f03/html5/thumbnails/18.jpg)
Convergence in p-th mean, Lp norm
Definition:
18
Notation:
Properties:
![Page 19: Introduction to Machine Learning - Alex Smola › ... › slides › 8_stochconvergence.pdfIntroduction to Machine Learning CMU-10701 8. Stochastic Convergence Barnabás Póczos Motivation](https://reader030.vdocuments.net/reader030/viewer/2022041113/5f20c606cd43586eb2255f03/html5/thumbnails/19.jpg)
Counter Examples
19
![Page 20: Introduction to Machine Learning - Alex Smola › ... › slides › 8_stochconvergence.pdfIntroduction to Machine Learning CMU-10701 8. Stochastic Convergence Barnabás Póczos Motivation](https://reader030.vdocuments.net/reader030/viewer/2022041113/5f20c606cd43586eb2255f03/html5/thumbnails/20.jpg)
Further Readings on Stochastic convergence
•http://en.wikipedia.org/wiki/Convergence_of_random_variables
•Patrick Billingsley: Probability and Measure
•Patrick Billingsley: Convergence of Probability Measures
20
![Page 21: Introduction to Machine Learning - Alex Smola › ... › slides › 8_stochconvergence.pdfIntroduction to Machine Learning CMU-10701 8. Stochastic Convergence Barnabás Póczos Motivation](https://reader030.vdocuments.net/reader030/viewer/2022041113/5f20c606cd43586eb2255f03/html5/thumbnails/21.jpg)
Finite sample tail bounds
Useful tools!
21
![Page 22: Introduction to Machine Learning - Alex Smola › ... › slides › 8_stochconvergence.pdfIntroduction to Machine Learning CMU-10701 8. Stochastic Convergence Barnabás Póczos Motivation](https://reader030.vdocuments.net/reader030/viewer/2022041113/5f20c606cd43586eb2255f03/html5/thumbnails/22.jpg)
This image cannot currently be displayed.
Gauss Markov inequality
Decompose the expectation
If X is any nonnegative random variable and a > 0, then
Proof:
Corollary: Chebyshev's inequality 22
![Page 23: Introduction to Machine Learning - Alex Smola › ... › slides › 8_stochconvergence.pdfIntroduction to Machine Learning CMU-10701 8. Stochastic Convergence Barnabás Póczos Motivation](https://reader030.vdocuments.net/reader030/viewer/2022041113/5f20c606cd43586eb2255f03/html5/thumbnails/23.jpg)
Chebyshev inequality If X is any nonnegative random variable and a > 0, then
Proof:
Here Var(X) is the variance of X, defined as:
23
![Page 24: Introduction to Machine Learning - Alex Smola › ... › slides › 8_stochconvergence.pdfIntroduction to Machine Learning CMU-10701 8. Stochastic Convergence Barnabás Póczos Motivation](https://reader030.vdocuments.net/reader030/viewer/2022041113/5f20c606cd43586eb2255f03/html5/thumbnails/24.jpg)
This image cannot currently be displayed.
Generalizations of Chebyshev's inequality
Chebyshev:
Asymmetric two-sided case (X is asymmetric distribution)
Symmetric two-sided case (X is symmetric distribution)
This is equivalent to this:
There are lots of other generalizations, for example multivariate X. 24
![Page 25: Introduction to Machine Learning - Alex Smola › ... › slides › 8_stochconvergence.pdfIntroduction to Machine Learning CMU-10701 8. Stochastic Convergence Barnabás Póczos Motivation](https://reader030.vdocuments.net/reader030/viewer/2022041113/5f20c606cd43586eb2255f03/html5/thumbnails/25.jpg)
Higher moments?
Chebyshev:
Markov:
Higher moments: where n ≥ 1
Other functions instead of polynomials? Exp function:
Proof: (Markov ineq.)
25
![Page 26: Introduction to Machine Learning - Alex Smola › ... › slides › 8_stochconvergence.pdfIntroduction to Machine Learning CMU-10701 8. Stochastic Convergence Barnabás Póczos Motivation](https://reader030.vdocuments.net/reader030/viewer/2022041113/5f20c606cd43586eb2255f03/html5/thumbnails/26.jpg)
Law of Large Numbers
26
![Page 27: Introduction to Machine Learning - Alex Smola › ... › slides › 8_stochconvergence.pdfIntroduction to Machine Learning CMU-10701 8. Stochastic Convergence Barnabás Póczos Motivation](https://reader030.vdocuments.net/reader030/viewer/2022041113/5f20c606cd43586eb2255f03/html5/thumbnails/27.jpg)
Do empirical averages converge?
27 Answer: Yes, they do. (Law of large numbers)
Chebyshev’s inequality is good enough to study the question: Do the empirical averages converge to the true mean?
![Page 28: Introduction to Machine Learning - Alex Smola › ... › slides › 8_stochconvergence.pdfIntroduction to Machine Learning CMU-10701 8. Stochastic Convergence Barnabás Póczos Motivation](https://reader030.vdocuments.net/reader030/viewer/2022041113/5f20c606cd43586eb2255f03/html5/thumbnails/28.jpg)
Law of Large Numbers
28
Strong Law of Large Numbers:
Weak Law of Large Numbers:
![Page 29: Introduction to Machine Learning - Alex Smola › ... › slides › 8_stochconvergence.pdfIntroduction to Machine Learning CMU-10701 8. Stochastic Convergence Barnabás Póczos Motivation](https://reader030.vdocuments.net/reader030/viewer/2022041113/5f20c606cd43586eb2255f03/html5/thumbnails/29.jpg)
Weak Law of Large Numbers Proof I:
Assume finite variance. (Not very important)
Therefore,
As n approaches infinity, this expression approaches 1.
29
![Page 30: Introduction to Machine Learning - Alex Smola › ... › slides › 8_stochconvergence.pdfIntroduction to Machine Learning CMU-10701 8. Stochastic Convergence Barnabás Póczos Motivation](https://reader030.vdocuments.net/reader030/viewer/2022041113/5f20c606cd43586eb2255f03/html5/thumbnails/30.jpg)
What we have learned today
30
Theory: • Stochastic Convergences:
– Weak convergence = Convergence in distribution – Convergence in probability – Strong (almost surely) – Convergence in Lp norm
• Limit theorems:
– Law of large numbers – Central limit theorem
• Tail bounds: – Markov, Chebyshev
![Page 31: Introduction to Machine Learning - Alex Smola › ... › slides › 8_stochconvergence.pdfIntroduction to Machine Learning CMU-10701 8. Stochastic Convergence Barnabás Póczos Motivation](https://reader030.vdocuments.net/reader030/viewer/2022041113/5f20c606cd43586eb2255f03/html5/thumbnails/31.jpg)
Thanks for your attention
31