why supervised learning may worklearning model r e c a p : co mpone n ts of s u p e rv i s e d mac h...
TRANSCRIPT
![Page 1: WHY SUPERVISED LEARNING MAY WORKLearning model R E C A P : CO MPONE N TS OF S U P E RV I S E D MAC H I N E L E A R N I N G 1. A dataset drawn i.i.d. from an unkno wn probability](https://reader035.vdocuments.net/reader035/viewer/2022081411/60b116494f6bae3a2607f644/html5/thumbnails/1.jpg)
Matthieu R Bloch Tuesday, January 14, 2020
WHY SUPERVISED LEARNING MAY WORKWHY SUPERVISED LEARNING MAY WORK
1
![Page 2: WHY SUPERVISED LEARNING MAY WORKLearning model R E C A P : CO MPONE N TS OF S U P E RV I S E D MAC H I N E L E A R N I N G 1. A dataset drawn i.i.d. from an unkno wn probability](https://reader035.vdocuments.net/reader035/viewer/2022081411/60b116494f6bae3a2607f644/html5/thumbnails/2.jpg)
LOGISTICSLOGISTICSTAs and Office hours
Monday: Mehrdad (TSRB) - 2pm-3:15pmTuesday: TJ (VL) - 1:30pm - 2:45pmWednesday: Matthieu (TSRB) - 12:pm-1:15pmThursday: Hossein (VL): 10:45pm - 12:00pmFriday: Brighton (TSRB) - 12pm-1:15pm
Pass/fail policySame homework/exam requirements as letter grade, B required to pass
Self-assessment online Due Friday January 17, 2020 (11:59PM EST) (Friday January 24, 2020 for DL)
http://www.phdcomics.com
here
2
![Page 3: WHY SUPERVISED LEARNING MAY WORKLearning model R E C A P : CO MPONE N TS OF S U P E RV I S E D MAC H I N E L E A R N I N G 1. A dataset drawn i.i.d. from an unkno wn probability](https://reader035.vdocuments.net/reader035/viewer/2022081411/60b116494f6bae3a2607f644/html5/thumbnails/3.jpg)
Learning model
RECAP: COMPONENTS OF SUPERVISED MACHINE LEARNINGRECAP: COMPONENTS OF SUPERVISED MACHINE LEARNING1. A dataset
drawn i.i.d. from an unknown probabilitydistribution on
are the corresponding targets
2. An unknown conditional distribution models with noise
3. A set of hypotheses as to what the function could be
4. A loss function capturing the “cost” ofprediction
5. An algorithm to find the best that explains
D ≜ {( , ), ⋯ , ( , )}x1 y1 xN yN
{xi}Ni=1
Px X
{yi}Ni=1 ∈ Y ≜ Ryi
Py|x
Py|x f : X → Y
H
ℓ : Y × Y → R+
ALG h ∈ H f
3
![Page 4: WHY SUPERVISED LEARNING MAY WORKLearning model R E C A P : CO MPONE N TS OF S U P E RV I S E D MAC H I N E L E A R N I N G 1. A dataset drawn i.i.d. from an unkno wn probability](https://reader035.vdocuments.net/reader035/viewer/2022081411/60b116494f6bae3a2607f644/html5/thumbnails/4.jpg)
RECAP: THE SUPERVISED LEARNING PROBLEMRECAP: THE SUPERVISED LEARNING PROBLEMLearning is not memorizing
Our goal is not to find that accurately assigns values to elements of Our goal is to find the best that accurately predicts values of unseen samples
Consider hypothesis . We can easily compute the empirical risk (a.k.a. in-sample error)
What we really care about is the true risk (a.k.a. out-sample error)
Question #1: Can generalize?For a given , is close to ?
Question #2: Can we learn well?Given , the best hypothesis is Our Empirical Risk Minimization (ERM) algorithm can only find Is close to ?Is ?
h ∈ H D
h ∈ H
h ∈ H
(h) ≜ ℓ( , h( ))R̂N
1
N∑i=1
N
yi xi
R(h) ≜ [ℓ(y, h(x))]Exy
h (h)R̂N R(h)
H ≜ R(h)h♯ argminh∈H
≜ (h)h∗ argminh∈H R̂N
( )R̂N h∗ R( )h♯
R( ) ≈ 0h♯
4
![Page 5: WHY SUPERVISED LEARNING MAY WORKLearning model R E C A P : CO MPONE N TS OF S U P E RV I S E D MAC H I N E L E A R N I N G 1. A dataset drawn i.i.d. from an unkno wn probability](https://reader035.vdocuments.net/reader035/viewer/2022081411/60b116494f6bae3a2607f644/html5/thumbnails/5.jpg)
A SIMPLER SUPERVISED LEARNING PROBLEMA SIMPLER SUPERVISED LEARNING PROBLEMConsider a special case of the general supervised learning problem
1. Dataset drawn i.i.d. from unknown on labels with (binary classification)
2. Unknown , no noise.
3. Finite set of hypotheses ,
4. Binary loss function
In this very specific case, the true risk simplifies
The empirical risk becomes
D ≜ {( , ), ⋯ , ( , )}x1 y1 xN yN
{xi}Ni=1 Px X
{yi}Ni=1 Y = {0, 1}
f : X → Y
H |H| = M < ∞
H ≜ {hi}Mi=1
ℓ : Y × Y → : ( , ) ↦ 1{ ≠ }R+ y1 y2 y1 y2
R(h) ≜ [1{h(x) ≠ y}] = (h(x) ≠ y)Exy Pxy
(h) = 1{h( ) ≠ y}R̂N
1
N∑i=1
N
xi
5
![Page 6: WHY SUPERVISED LEARNING MAY WORKLearning model R E C A P : CO MPONE N TS OF S U P E RV I S E D MAC H I N E L E A R N I N G 1. A dataset drawn i.i.d. from an unkno wn probability](https://reader035.vdocuments.net/reader035/viewer/2022081411/60b116494f6bae3a2607f644/html5/thumbnails/6.jpg)
6
![Page 7: WHY SUPERVISED LEARNING MAY WORKLearning model R E C A P : CO MPONE N TS OF S U P E RV I S E D MAC H I N E L E A R N I N G 1. A dataset drawn i.i.d. from an unkno wn probability](https://reader035.vdocuments.net/reader035/viewer/2022081411/60b116494f6bae3a2607f644/html5/thumbnails/7.jpg)
CAN WE LEARN?CAN WE LEARN?Our objective is to find a hypothesis that ensures a small risk
For a fixed , how does compares to ?
Observe that for The empirical risk is a sum of iid random variables
is a statement about the deviation of a normalized
sum of iid random variables from its mean
We’re in luck! Such bounds, a.k.a, known as concentration inequalities, are a well studied subject
h∗
= (h)h∗ argminh∈H
R̂N
∈ Hhj ( )R̂N hj R( )hj
∈ Hhj
( ) = 1{ ( ) ≠ y}R̂N hj
1
N∑i=1
N
hj xi
[ ( )] = R( )E R̂N hj hj
( ( ) − R( ) > ϵ)P ∣∣R̂N hj hj∣∣
7
![Page 8: WHY SUPERVISED LEARNING MAY WORKLearning model R E C A P : CO MPONE N TS OF S U P E RV I S E D MAC H I N E L E A R N I N G 1. A dataset drawn i.i.d. from an unkno wn probability](https://reader035.vdocuments.net/reader035/viewer/2022081411/60b116494f6bae3a2607f644/html5/thumbnails/8.jpg)
8
![Page 9: WHY SUPERVISED LEARNING MAY WORKLearning model R E C A P : CO MPONE N TS OF S U P E RV I S E D MAC H I N E L E A R N I N G 1. A dataset drawn i.i.d. from an unkno wn probability](https://reader035.vdocuments.net/reader035/viewer/2022081411/60b116494f6bae3a2607f644/html5/thumbnails/9.jpg)
CONCENTRATION INEQUALITIES 101CONCENTRATION INEQUALITIES 101Lemma (Markov's inequality)
Let be a non-negative real-valued random variable. Then for all
Lemma (Chebyshev's inequality)
Let be a real-valued random variable. Then for all
Proposition (Weak law of large numbers)
Let be i.i.d. real-valued random variables with finite mean and finite variance . Then
X t > 0
(X ≥ t) ≤ .P[X]E
t
X t > 0
(|X − [X]| ≥ t) ≤ .P EVar(X)
t2
{Xi}Ni=1 μ σ2
( − μ ≥ ϵ) ≤ ( − μ ≥ ϵ) = 0.P
∣
∣
∣∣
1
N∑i=1
N
Xi
∣
∣
∣∣
σ2
Nϵ2lim
N→∞P
∣
∣
∣∣
1
N∑i=1
N
Xi
∣
∣
∣∣
9
![Page 10: WHY SUPERVISED LEARNING MAY WORKLearning model R E C A P : CO MPONE N TS OF S U P E RV I S E D MAC H I N E L E A R N I N G 1. A dataset drawn i.i.d. from an unkno wn probability](https://reader035.vdocuments.net/reader035/viewer/2022081411/60b116494f6bae3a2607f644/html5/thumbnails/10.jpg)
10
![Page 11: WHY SUPERVISED LEARNING MAY WORKLearning model R E C A P : CO MPONE N TS OF S U P E RV I S E D MAC H I N E L E A R N I N G 1. A dataset drawn i.i.d. from an unkno wn probability](https://reader035.vdocuments.net/reader035/viewer/2022081411/60b116494f6bae3a2607f644/html5/thumbnails/11.jpg)
BACK TO LEARNINGBACK TO LEARNINGBy the law of large number, we know that
Given enough data, we can generalize
How much data? to ensure .
That’s not quite enough! We care about where If is large we should expect the existence of such that
If we choose we can ensure .
That’s a lot of samples!
∀ϵ > 0 ( ( ) − R( ) ≥ ϵ) ≤ ≤P{( , )}xi yi∣∣R̂N hj hj
∣∣Var(1{ ( ) ≠ y})hj x1
Nϵ21
Nϵ2
N = 1δϵ2
( ( ) − R( ) ≥ ϵ) ≤ δP ∣∣R̂N hj hj∣∣
( )R̂N h∗ = (h)h∗ argminh∈H R̂N
M = |H| ∈ Hhk ( ) ≪ R( )R̂N hk hk
( ( ) − R( ) ≥ ϵ) ≤?P ∣∣R̂N h∗ h∗ ∣∣
( ( ) − R( ) ≥ ϵ) ≤ (∃j : ( ) − R( ) ≥ ϵ)P ∣∣R̂N h∗ h∗ ∣∣ P ∣∣R̂N hj hj∣∣
( ( ) − R( ) ≥ ϵ) ≤P ∣∣R̂N h∗ h∗ ∣∣M
Nϵ2
N ≥ ⌈ ⌉Mδϵ2
( ( ) − R( ) ≥ ϵ) ≤ δP ∣∣R̂N h∗ h∗ ∣∣
11
![Page 12: WHY SUPERVISED LEARNING MAY WORKLearning model R E C A P : CO MPONE N TS OF S U P E RV I S E D MAC H I N E L E A R N I N G 1. A dataset drawn i.i.d. from an unkno wn probability](https://reader035.vdocuments.net/reader035/viewer/2022081411/60b116494f6bae3a2607f644/html5/thumbnails/12.jpg)
CONCENTRATION INEQUALITIES 102CONCENTRATION INEQUALITIES 102We can obtain much better bounds than with Chebyshev
Lemma (Hoeffding's inequality)
Let be i.i.d. real-valued zero-mean random variables such that . Then for all
In our learning problem
We can now choose can be quite large (almost exponential in ) and, with enough data, we can generalize .
How about learning ?
{Xi}Ni=1 ∈ [ ; ]Xi ai bi
ϵ > 0
( ≥ ϵ) ≤ 2 exp(− ).P∣
∣
∣∣
1
N∑i=1
N
Xi
∣
∣
∣∣
2N 2ϵ2
( −∑Ni=1 bi ai)2
∀ϵ > 0 ( ( ) − R( ) ≥ ϵ) ≤ 2 exp(−2N )P ∣∣R̂N hj hj∣∣ ϵ2
∀ϵ > 0 ( ( ) − R( ) ≥ ϵ) ≤ 2M exp(−2N )P ∣∣R̂N h∗ h∗ ∣∣ ϵ2
N ≥ ⌈ (ln )⌉12ϵ2
2Mδ
M N h∗
≜ R(h)h♯ argminh∈H
12
![Page 13: WHY SUPERVISED LEARNING MAY WORKLearning model R E C A P : CO MPONE N TS OF S U P E RV I S E D MAC H I N E L E A R N I N G 1. A dataset drawn i.i.d. from an unkno wn probability](https://reader035.vdocuments.net/reader035/viewer/2022081411/60b116494f6bae3a2607f644/html5/thumbnails/13.jpg)
13