vc dimension and classification - stanford university · vc dimension how do we use shatter...
TRANSCRIPT
VC Dimension and classification
John Duchi
Prof. John Duchi
Outline
I Setting: classification problems
II Finite hypothesis classes1 Union bounds2 Zero error case
III Shatter coe�cients and Rademacher complexity
IV VC Dimension
Prof. John Duchi
Setting for the lecture
Binary classification problems: data X 2 X and labelsY 2 {�1, 1}. Hypothesis class H ⇢ {h : X ! R}.Goal: Find h 2 H with
L(h) := E[1 {h(X)Y 0}]
smallLoss is always
`(h; (x, y)) = 1 {h(x)y 0} =
(1 if sign(h(x)) 6= y
0 if sign(h(x)) = y
Prof. John Duchi
Finite hypothesis classes
TheoremLet H be a finite class. Then
P 9h 2 H s.t. |L(h)� bL
n
(h)| �r
log |H|+ t
2n
! 2e
�t
.
Prof. John Duchi
Finite hypothesis classes: generalizationCorollaryLet H be a finite class,
bh
n
2 argmin
h
bL
n
(h). Then (for numerical
constant C < 1)
L(
bh
n
) min
h2HL(h) + C
slog
|H|�
n
w.p. � 1� �
Prof. John Duchi
Finite hypothesis classes: perfect classifiers
Possible to give better guarantees if there are good classifiers! Wewon’t bother looking at bad ones.
TheoremLet H be a finite hypothesis class and assume min
h
L(h) = 0.
Then for t � 0
P✓L(
bh
n
) � L(h
?
) +
log |H|+ t
n
◆ e
�t
.
Prof. John Duchi
Do not pick the bad ones
Prof. John Duchi
Finite function classes: Rademacher complexity
Idea: Use Rademacher complexity to understand generalizationeven for these?
Let F be finite with |f | 1 for f 2 F . Then
R
n
(F) := E"max
f2F
�����1
n
nX
i=1
"
i
f(Z
i
)
�����
#
satisfies
P max
f2F
�����1
n
nX
i=1
f(X
i
)� E[f(Xi
)]
����� � 2R
n
(F) + t
! 2 exp(�cnt
2)
Prof. John Duchi
Finite function classes: sub-Gaussianity
I Let Pn
be empirical distribution
I Define kfk2L
2(Pn) =1n
Pn
i=1 f(xi)2
I What about sum1pn
nX
i=1
"
i
f(x
i
)
Prof. John Duchi
Finite function classes: Rademacher complexity
Proposition (Massart’s finite class bound)
Let F be finite with M := max
f2F kfkL
2(Pn). Then
bR
n
(F) r
2M
2log(2 card(F))
n
.
Prof. John Duchi
Infinite classes with finite labelsWhat if we had a classifier h : X ! {�1, 1} that could only give acertain number of di↵erent labelings to a data set?
Example (Sketchy)
Say X = R and h
t
(x) = sign(x� t). Complexity of
F := {f(x) = 1 {ht
(x) 0}}?
Prof. John Duchi
Complexity of function classes
DefineF(x1:n) := {(f(x1), . . . , f(xn)) | f 2 F} .
ThenbR
n
(F) =
bR
n
(F 0)
whenever F(x1:n) = F 0(x1:n)
PropositionRademacher complexity depends on values of F : if |f(x)| M for
all x then
R
n
(F) c ·M sup
x1,...,xn2X
rlog card(F(x1:n))
n
.
Prof. John Duchi
Proof of complexity
Prof. John Duchi
Shatter coe�cients
Given function class F , shattering coe�cient (growth function) is
sn
(F) := sup
x1,...,xn2Xcard (F(x1:n))
= sup
x1:n2Xncard ((f(x1), . . . , f(xn)) | f 2 F)
ExampleThresholds in R
Prof. John Duchi
Shatter coe�cients and Rademacher complexity
PropositionFor any function class F with |f(x)| M we have
R
n
(F) cM
rlog s
n
(F)
n
.
Prof. John Duchi
VC Dimension
How do we use shatter coe�cients to give complexity guarantees?
Definition (VC Dimension)
Let H be a collection of boolean functions. The Vapnik
Chervonenkis (VC) Dimension of H is
VC(H) := sup {n 2 N : sn
(H) = 2
n} .
Prof. John Duchi
VC Dimension: examples
Example (Thresholds in R)
Example (Intervals in R)
Prof. John Duchi
VC Dimension: examples
Example (Half-spaces in R2)
Prof. John Duchi
Finite dimensional hypothesis classes
Let F be functions f : X ! R and suppose dim(F) = d
I Definition of dimension:
Example (Linear functionals)
If F = {f(x) = w
>x,w 2 Rd} then dim(F) = d
Example (Nonlinear functionals)
If F = {f(x) = w
>�(x), w 2 Rd} then dim(F) = d
Prof. John Duchi
VC dimension of finite dimensional classes
Let F have dim(F) = d and let
H := {h : X ! {�1, 1} s.t. h(x) = sign(f(x)), f 2 F} .
Proposition (Dimension bounds VC dimension)
VC(H) dim(F)
Prof. John Duchi
Finite dimensional hypothesis classes: proof
Prof. John Duchi
Sauer-Shelah Lemma
TheoremLet H be boolean functions with VC(H) = d. Then
sn
(H) dX
i=0
✓n
i
◆(2
n
if n d
�ne
d
�d
if n > d
Prof. John Duchi
Rademacher complexity of VC classes
PropositionLet H be collection of boolean functions with VC(H) = d. Then
R
n
(H) c
rd log
n
d
n
.
Proof is immediate (but a tighter result is possible):
Prof. John Duchi
Generalization bounds for VC classes
PropositionLet H have VC-dimension d and `(h; (x, y)) = 1 {h(x) 6= y}. Then
P
0
@9 h 2 H s.t. |bLn
(h)� L(h)| � c
sd log
d
n
n
+ t
1
A 2e
�nt
2
Prof. John Duchi
Things we have not addressed
I Multiclass problems (Natarajan dimension, due to BalaNatarajan; see also Multiclass Learnability and the ERMPrinciple by Daniely et al.)
I Extending “zero error” results to infinite classes
I Non-boolean classes
Prof. John Duchi
Reading and bibliography
1. M. Anthony and P. Bartlet. Neural Network Learning:
Theoretical Foundations.Cambridge University Press, 1999
2. P. L. Bartlett and S. Mendelson. Rademacher and Gaussiancomplexities: Risk bounds and structural results.Journal of Machine Learning Research, 3:463–482, 2002
3. S. Boucheron, O. Bousquet, and G. Lugosi. Theory ofclassification: a survey of some recent advances.ESAIM: Probability and Statistics, 9:323–375, 2005
4. A. W. van der Vaart and J. A. Wellner. Weak Convergence
and Empirical Processes: With Applications to Statistics.Springer, New York, 1996 (Ch. 2.6)
5. Scribe notes for Statistics 300b:http://web.stanford.edu/class/stats300b/
Prof. John Duchi