pac$learning - carnegie mellon school of computer science · pac$learning 1...
TRANSCRIPT
![Page 1: PAC$Learning - Carnegie Mellon School of Computer Science · PAC$Learning 1 10.601$Introduction$to$Machine$Learning Matt%Gormley Lecture28 May%1,%2016 Machine%Learning%Department](https://reader034.vdocuments.net/reader034/viewer/2022051810/6018f9d5fc75f93e296d2fe9/html5/thumbnails/1.jpg)
PAC Learning
1
10-‐601 Introduction to Machine Learning
Matt GormleyLecture 28May 1, 2016
Machine Learning DepartmentSchool of Computer ScienceCarnegie Mellon University
Learning Theory Readings:Murphy -‐-‐Bishop -‐-‐HTF -‐-‐Mitchell 7
![Page 2: PAC$Learning - Carnegie Mellon School of Computer Science · PAC$Learning 1 10.601$Introduction$to$Machine$Learning Matt%Gormley Lecture28 May%1,%2016 Machine%Learning%Department](https://reader034.vdocuments.net/reader034/viewer/2022051810/6018f9d5fc75f93e296d2fe9/html5/thumbnails/2.jpg)
Reminders
• Homework 9: Applications of ML– Release: Mon, Apr. 24– Due: Wed, May 3 at 11:59pm
4
![Page 3: PAC$Learning - Carnegie Mellon School of Computer Science · PAC$Learning 1 10.601$Introduction$to$Machine$Learning Matt%Gormley Lecture28 May%1,%2016 Machine%Learning%Department](https://reader034.vdocuments.net/reader034/viewer/2022051810/6018f9d5fc75f93e296d2fe9/html5/thumbnails/3.jpg)
Outline• Statistical Learning Theory
– True Error vs. Train Error– Function Approximation View (aka. PAC/SLT Model)– Three Hypotheses of Interest
• Probably Approximately Correct (PAC) Learning– PAC Criterion– PAC Learnable– Consistent Learner– Sample Complexity
• Generalization and Overfitting– Realizable vs. Agnostic Cases– Finite vs. Infinite Hypothesis Spaces– VC Dimension– Sample Complexity Bounds– Empirical Risk Minimization– Structural Risk Minimization
• Excess Risk
5
![Page 4: PAC$Learning - Carnegie Mellon School of Computer Science · PAC$Learning 1 10.601$Introduction$to$Machine$Learning Matt%Gormley Lecture28 May%1,%2016 Machine%Learning%Department](https://reader034.vdocuments.net/reader034/viewer/2022051810/6018f9d5fc75f93e296d2fe9/html5/thumbnails/4.jpg)
LEARNING THEORY
6
![Page 5: PAC$Learning - Carnegie Mellon School of Computer Science · PAC$Learning 1 10.601$Introduction$to$Machine$Learning Matt%Gormley Lecture28 May%1,%2016 Machine%Learning%Department](https://reader034.vdocuments.net/reader034/viewer/2022051810/6018f9d5fc75f93e296d2fe9/html5/thumbnails/5.jpg)
Questions For Today1. Given a classifier with zero training error, what
can we say about generalization error?(Sample Complexity, Realizable Case)
2. Given a classifier with low training error, what can we say about generalization error?(Sample Complexity, Agnostic Case)
3. Is there a theoretical justification for regularization to avoid overfitting?(Structural Risk Minimization)
7
![Page 6: PAC$Learning - Carnegie Mellon School of Computer Science · PAC$Learning 1 10.601$Introduction$to$Machine$Learning Matt%Gormley Lecture28 May%1,%2016 Machine%Learning%Department](https://reader034.vdocuments.net/reader034/viewer/2022051810/6018f9d5fc75f93e296d2fe9/html5/thumbnails/6.jpg)
Statistical Learning Theory
Whiteboard:– Function Approximation View (aka. PAC/SLT Model)
– True Error vs. Train Error– Three Hypotheses of Interest
8
![Page 7: PAC$Learning - Carnegie Mellon School of Computer Science · PAC$Learning 1 10.601$Introduction$to$Machine$Learning Matt%Gormley Lecture28 May%1,%2016 Machine%Learning%Department](https://reader034.vdocuments.net/reader034/viewer/2022051810/6018f9d5fc75f93e296d2fe9/html5/thumbnails/7.jpg)
PAC / SLT Model
9
Labeled Examples
PAC/SLT models for Supervised Learning
Learning Algorithm
Expert / Oracle
Data Source
Alg.outputs
Distribution D on X
c* : X ! Y
(x1,c*(x1)),…, (xm,c*(xm))
h : X ! Y x1 > 5
x6 > 2
+1 -1
+1
+
-
+ + +
- -
- -
-
Slide from Nina Balcan
![Page 8: PAC$Learning - Carnegie Mellon School of Computer Science · PAC$Learning 1 10.601$Introduction$to$Machine$Learning Matt%Gormley Lecture28 May%1,%2016 Machine%Learning%Department](https://reader034.vdocuments.net/reader034/viewer/2022051810/6018f9d5fc75f93e296d2fe9/html5/thumbnails/8.jpg)
PAC / SLT Model
10
![Page 9: PAC$Learning - Carnegie Mellon School of Computer Science · PAC$Learning 1 10.601$Introduction$to$Machine$Learning Matt%Gormley Lecture28 May%1,%2016 Machine%Learning%Department](https://reader034.vdocuments.net/reader034/viewer/2022051810/6018f9d5fc75f93e296d2fe9/html5/thumbnails/9.jpg)
Two Types of Error
11
Train Error (aka. empirical risk)
True Error (aka. expected risk)
![Page 10: PAC$Learning - Carnegie Mellon School of Computer Science · PAC$Learning 1 10.601$Introduction$to$Machine$Learning Matt%Gormley Lecture28 May%1,%2016 Machine%Learning%Department](https://reader034.vdocuments.net/reader034/viewer/2022051810/6018f9d5fc75f93e296d2fe9/html5/thumbnails/10.jpg)
Three Hypotheses of Interest
12
![Page 11: PAC$Learning - Carnegie Mellon School of Computer Science · PAC$Learning 1 10.601$Introduction$to$Machine$Learning Matt%Gormley Lecture28 May%1,%2016 Machine%Learning%Department](https://reader034.vdocuments.net/reader034/viewer/2022051810/6018f9d5fc75f93e296d2fe9/html5/thumbnails/11.jpg)
PAC LEARNING
13
![Page 12: PAC$Learning - Carnegie Mellon School of Computer Science · PAC$Learning 1 10.601$Introduction$to$Machine$Learning Matt%Gormley Lecture28 May%1,%2016 Machine%Learning%Department](https://reader034.vdocuments.net/reader034/viewer/2022051810/6018f9d5fc75f93e296d2fe9/html5/thumbnails/12.jpg)
Probably Approximately Correct (PAC) Learning
Whiteboard:– PAC Criterion–Meaning of “Probably Approximately Correct”– PAC Learnable– Consistent Learner– Sample Complexity
14
![Page 13: PAC$Learning - Carnegie Mellon School of Computer Science · PAC$Learning 1 10.601$Introduction$to$Machine$Learning Matt%Gormley Lecture28 May%1,%2016 Machine%Learning%Department](https://reader034.vdocuments.net/reader034/viewer/2022051810/6018f9d5fc75f93e296d2fe9/html5/thumbnails/13.jpg)
PAC Learning
15
![Page 14: PAC$Learning - Carnegie Mellon School of Computer Science · PAC$Learning 1 10.601$Introduction$to$Machine$Learning Matt%Gormley Lecture28 May%1,%2016 Machine%Learning%Department](https://reader034.vdocuments.net/reader034/viewer/2022051810/6018f9d5fc75f93e296d2fe9/html5/thumbnails/14.jpg)
SAMPLE COMPLEXITY RESULTS
16
![Page 15: PAC$Learning - Carnegie Mellon School of Computer Science · PAC$Learning 1 10.601$Introduction$to$Machine$Learning Matt%Gormley Lecture28 May%1,%2016 Machine%Learning%Department](https://reader034.vdocuments.net/reader034/viewer/2022051810/6018f9d5fc75f93e296d2fe9/html5/thumbnails/15.jpg)
Sample Complexity Results
17
Realizable Agnostic
Four Cases we care about…We’ll start with the
finite case…
![Page 16: PAC$Learning - Carnegie Mellon School of Computer Science · PAC$Learning 1 10.601$Introduction$to$Machine$Learning Matt%Gormley Lecture28 May%1,%2016 Machine%Learning%Department](https://reader034.vdocuments.net/reader034/viewer/2022051810/6018f9d5fc75f93e296d2fe9/html5/thumbnails/16.jpg)
Generalization and Overfitting
Whiteboard:– Realizable vs. Agnostic Cases– Finite vs. Infinite Hypothesis Spaces– Sample Complexity Bounds (Finite Case)
18
![Page 17: PAC$Learning - Carnegie Mellon School of Computer Science · PAC$Learning 1 10.601$Introduction$to$Machine$Learning Matt%Gormley Lecture28 May%1,%2016 Machine%Learning%Department](https://reader034.vdocuments.net/reader034/viewer/2022051810/6018f9d5fc75f93e296d2fe9/html5/thumbnails/17.jpg)
Sample Complexity Results
19
Realizable Agnostic
Four Cases we care about…
![Page 18: PAC$Learning - Carnegie Mellon School of Computer Science · PAC$Learning 1 10.601$Introduction$to$Machine$Learning Matt%Gormley Lecture28 May%1,%2016 Machine%Learning%Department](https://reader034.vdocuments.net/reader034/viewer/2022051810/6018f9d5fc75f93e296d2fe9/html5/thumbnails/18.jpg)
Example: ConjunctionsIn-‐Class Quiz:Suppose H = class of conjunctions over x in {0,1}M
If M = 10, 𝜀 = 0.1, δ = 0.01, how many examples suffice?
20
Realizable Agnostic
![Page 19: PAC$Learning - Carnegie Mellon School of Computer Science · PAC$Learning 1 10.601$Introduction$to$Machine$Learning Matt%Gormley Lecture28 May%1,%2016 Machine%Learning%Department](https://reader034.vdocuments.net/reader034/viewer/2022051810/6018f9d5fc75f93e296d2fe9/html5/thumbnails/19.jpg)
Sample Complexity Results
21
Realizable Agnostic
Four Cases we care about…
![Page 20: PAC$Learning - Carnegie Mellon School of Computer Science · PAC$Learning 1 10.601$Introduction$to$Machine$Learning Matt%Gormley Lecture28 May%1,%2016 Machine%Learning%Department](https://reader034.vdocuments.net/reader034/viewer/2022051810/6018f9d5fc75f93e296d2fe9/html5/thumbnails/20.jpg)
Sample Complexity Results
22
Realizable Agnostic
Four Cases we care about…
We need a new definition of “complexity” for a Hypothesis space for these results (see VC Dimension)
![Page 21: PAC$Learning - Carnegie Mellon School of Computer Science · PAC$Learning 1 10.601$Introduction$to$Machine$Learning Matt%Gormley Lecture28 May%1,%2016 Machine%Learning%Department](https://reader034.vdocuments.net/reader034/viewer/2022051810/6018f9d5fc75f93e296d2fe9/html5/thumbnails/21.jpg)
VC DIMENSION
23
![Page 22: PAC$Learning - Carnegie Mellon School of Computer Science · PAC$Learning 1 10.601$Introduction$to$Machine$Learning Matt%Gormley Lecture28 May%1,%2016 Machine%Learning%Department](https://reader034.vdocuments.net/reader034/viewer/2022051810/6018f9d5fc75f93e296d2fe9/html5/thumbnails/22.jpg)
24
What if H is infinite?
E.g., linear separators in Rd + -
+ + + - -
- -
-
E.g., intervals on the real line
a b
+ - -
E.g., thresholds on the real line w
+ -
![Page 23: PAC$Learning - Carnegie Mellon School of Computer Science · PAC$Learning 1 10.601$Introduction$to$Machine$Learning Matt%Gormley Lecture28 May%1,%2016 Machine%Learning%Department](https://reader034.vdocuments.net/reader034/viewer/2022051810/6018f9d5fc75f93e296d2fe9/html5/thumbnails/23.jpg)
25
Shattering, VC-dimension
A set of points S is shattered by H is there are hypotheses in H that split S in all of the 2|𝑆| possible ways; i.e., all possible ways of classifying points in S are achievable using concepts in H.
Definition:
The VC-dimension of a hypothesis space H is the cardinality of the largest set S that can be shattered by H.
Definition:
If arbitrarily large finite sets can be shattered by H, then VCdim(H) = ∞
VC-dimension (Vapnik-Chervonenkis dimension)
H shatters S if |H S | = 2|𝑆|. H[S] – the set of splittings of dataset S using concepts from H.
![Page 24: PAC$Learning - Carnegie Mellon School of Computer Science · PAC$Learning 1 10.601$Introduction$to$Machine$Learning Matt%Gormley Lecture28 May%1,%2016 Machine%Learning%Department](https://reader034.vdocuments.net/reader034/viewer/2022051810/6018f9d5fc75f93e296d2fe9/html5/thumbnails/24.jpg)
26
Shattering, VC-dimension
The VC-dimension of a hypothesis space H is the cardinality of the largest set S that can be shattered by H.
Definition:
If arbitrarily large finite sets can be shattered by H, then VCdim(H) = ∞
VC-dimension (Vapnik-Chervonenkis dimension)
To show that VC-dimension is d:
– there is no set of d+1 points that can be shattered. – there exists a set of d points that can be shattered
Fact: If H is finite, then VCdim (H) ≤ log (|H|).
![Page 25: PAC$Learning - Carnegie Mellon School of Computer Science · PAC$Learning 1 10.601$Introduction$to$Machine$Learning Matt%Gormley Lecture28 May%1,%2016 Machine%Learning%Department](https://reader034.vdocuments.net/reader034/viewer/2022051810/6018f9d5fc75f93e296d2fe9/html5/thumbnails/25.jpg)
27
Shattering, VC-dimension
E.g., H= Thresholds on the real line
VCdim H = 1 w
+ -
If the VC-dimension is d, that means there exists a set of d points that can be shattered, but there is no set of d+1 points that can be shattered.
E.g., H= Intervals on the real line + - -
+ -
VCdim H = 2
+ - +
![Page 26: PAC$Learning - Carnegie Mellon School of Computer Science · PAC$Learning 1 10.601$Introduction$to$Machine$Learning Matt%Gormley Lecture28 May%1,%2016 Machine%Learning%Department](https://reader034.vdocuments.net/reader034/viewer/2022051810/6018f9d5fc75f93e296d2fe9/html5/thumbnails/26.jpg)
28
Shattering, VC-dimension If the VC-dimension is d, that means there exists a set of d points that can be shattered, but there is no set of d+1 points that can be shattered.
E.g., H= Union of k intervals on the real line
+ - -
VCdim H = 2k
+ - +
+ - + - …
VCdim H < 2k + 1
VCdim H ≥ 2k A sample of size 2k shatters (treat each pair of points as a separate case of intervals)
+
![Page 27: PAC$Learning - Carnegie Mellon School of Computer Science · PAC$Learning 1 10.601$Introduction$to$Machine$Learning Matt%Gormley Lecture28 May%1,%2016 Machine%Learning%Department](https://reader034.vdocuments.net/reader034/viewer/2022051810/6018f9d5fc75f93e296d2fe9/html5/thumbnails/27.jpg)
29
E.g., H= linear separators in R2
Shattering, VC-dimension
VCdim H ≥ 3
![Page 28: PAC$Learning - Carnegie Mellon School of Computer Science · PAC$Learning 1 10.601$Introduction$to$Machine$Learning Matt%Gormley Lecture28 May%1,%2016 Machine%Learning%Department](https://reader034.vdocuments.net/reader034/viewer/2022051810/6018f9d5fc75f93e296d2fe9/html5/thumbnails/28.jpg)
30
Shattering, VC-dimension
VCdim H < 4
Case 1: one point inside the triangle formed by the others. Cannot label inside point as positive and outside points as negative.
Case 2: all points on the boundary (convex hull). Cannot label two diagonally as positive and other two as negative.
Fact: VCdim of linear separators in Rd is d+1
E.g., H= linear separators in R2
![Page 29: PAC$Learning - Carnegie Mellon School of Computer Science · PAC$Learning 1 10.601$Introduction$to$Machine$Learning Matt%Gormley Lecture28 May%1,%2016 Machine%Learning%Department](https://reader034.vdocuments.net/reader034/viewer/2022051810/6018f9d5fc75f93e296d2fe9/html5/thumbnails/29.jpg)
SAMPLE COMPLEXITY RESULTS
32
![Page 30: PAC$Learning - Carnegie Mellon School of Computer Science · PAC$Learning 1 10.601$Introduction$to$Machine$Learning Matt%Gormley Lecture28 May%1,%2016 Machine%Learning%Department](https://reader034.vdocuments.net/reader034/viewer/2022051810/6018f9d5fc75f93e296d2fe9/html5/thumbnails/30.jpg)
Sample Complexity Results
33
Realizable Agnostic
Four Cases we care about…
We need a new definition of “complexity” for a Hypothesis space for these results (see VC Dimension)
![Page 31: PAC$Learning - Carnegie Mellon School of Computer Science · PAC$Learning 1 10.601$Introduction$to$Machine$Learning Matt%Gormley Lecture28 May%1,%2016 Machine%Learning%Department](https://reader034.vdocuments.net/reader034/viewer/2022051810/6018f9d5fc75f93e296d2fe9/html5/thumbnails/31.jpg)
Sample Complexity Results
34
Realizable Agnostic
Four Cases we care about…
![Page 32: PAC$Learning - Carnegie Mellon School of Computer Science · PAC$Learning 1 10.601$Introduction$to$Machine$Learning Matt%Gormley Lecture28 May%1,%2016 Machine%Learning%Department](https://reader034.vdocuments.net/reader034/viewer/2022051810/6018f9d5fc75f93e296d2fe9/html5/thumbnails/32.jpg)
Generalization and Overfitting
Whiteboard:– Sample Complexity Bounds (Infinite Case)– Empirical Risk Minimization– Structural Risk Minimization
35
![Page 33: PAC$Learning - Carnegie Mellon School of Computer Science · PAC$Learning 1 10.601$Introduction$to$Machine$Learning Matt%Gormley Lecture28 May%1,%2016 Machine%Learning%Department](https://reader034.vdocuments.net/reader034/viewer/2022051810/6018f9d5fc75f93e296d2fe9/html5/thumbnails/33.jpg)
EXCESS RISK
36
![Page 34: PAC$Learning - Carnegie Mellon School of Computer Science · PAC$Learning 1 10.601$Introduction$to$Machine$Learning Matt%Gormley Lecture28 May%1,%2016 Machine%Learning%Department](https://reader034.vdocuments.net/reader034/viewer/2022051810/6018f9d5fc75f93e296d2fe9/html5/thumbnails/34.jpg)
Excess Risk
37
![Page 35: PAC$Learning - Carnegie Mellon School of Computer Science · PAC$Learning 1 10.601$Introduction$to$Machine$Learning Matt%Gormley Lecture28 May%1,%2016 Machine%Learning%Department](https://reader034.vdocuments.net/reader034/viewer/2022051810/6018f9d5fc75f93e296d2fe9/html5/thumbnails/35.jpg)
Excess Risk Results
38
![Page 36: PAC$Learning - Carnegie Mellon School of Computer Science · PAC$Learning 1 10.601$Introduction$to$Machine$Learning Matt%Gormley Lecture28 May%1,%2016 Machine%Learning%Department](https://reader034.vdocuments.net/reader034/viewer/2022051810/6018f9d5fc75f93e296d2fe9/html5/thumbnails/36.jpg)
Questions For Today1. Given a classifier with zero training error, what
can we say about generalization error?(Sample Complexity, Realizable Case)
2. Given a classifier with low training error, what can we say about generalization error?(Sample Complexity, Agnostic Case)
3. Is there a theoretical justification for regularization to avoid overfitting?(Structural Risk Minimization)
39