vc dimension and classification - stanford university · vc dimension how do we use shatter...

26
VC Dimension and classification John Duchi Prof. John Duchi

Upload: others

Post on 17-Jul-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: VC Dimension and classification - Stanford University · VC Dimension How do we use shatter coecients to give complexity guarantees? Definition (VC Dimension) Let H be a collection

VC Dimension and classification

John Duchi

Prof. John Duchi

Page 2: VC Dimension and classification - Stanford University · VC Dimension How do we use shatter coecients to give complexity guarantees? Definition (VC Dimension) Let H be a collection

Outline

I Setting: classification problems

II Finite hypothesis classes1 Union bounds2 Zero error case

III Shatter coe�cients and Rademacher complexity

IV VC Dimension

Prof. John Duchi

Page 3: VC Dimension and classification - Stanford University · VC Dimension How do we use shatter coecients to give complexity guarantees? Definition (VC Dimension) Let H be a collection

Setting for the lecture

Binary classification problems: data X 2 X and labelsY 2 {�1, 1}. Hypothesis class H ⇢ {h : X ! R}.Goal: Find h 2 H with

L(h) := E[1 {h(X)Y 0}]

smallLoss is always

`(h; (x, y)) = 1 {h(x)y 0} =

(1 if sign(h(x)) 6= y

0 if sign(h(x)) = y

Prof. John Duchi

Page 4: VC Dimension and classification - Stanford University · VC Dimension How do we use shatter coecients to give complexity guarantees? Definition (VC Dimension) Let H be a collection

Finite hypothesis classes

TheoremLet H be a finite class. Then

P 9h 2 H s.t. |L(h)� bL

n

(h)| �r

log |H|+ t

2n

! 2e

�t

.

Prof. John Duchi

Page 5: VC Dimension and classification - Stanford University · VC Dimension How do we use shatter coecients to give complexity guarantees? Definition (VC Dimension) Let H be a collection

Finite hypothesis classes: generalizationCorollaryLet H be a finite class,

bh

n

2 argmin

h

bL

n

(h). Then (for numerical

constant C < 1)

L(

bh

n

) min

h2HL(h) + C

slog

|H|�

n

w.p. � 1� �

Prof. John Duchi

Page 6: VC Dimension and classification - Stanford University · VC Dimension How do we use shatter coecients to give complexity guarantees? Definition (VC Dimension) Let H be a collection

Finite hypothesis classes: perfect classifiers

Possible to give better guarantees if there are good classifiers! Wewon’t bother looking at bad ones.

TheoremLet H be a finite hypothesis class and assume min

h

L(h) = 0.

Then for t � 0

P✓L(

bh

n

) � L(h

?

) +

log |H|+ t

n

◆ e

�t

.

Prof. John Duchi

Page 7: VC Dimension and classification - Stanford University · VC Dimension How do we use shatter coecients to give complexity guarantees? Definition (VC Dimension) Let H be a collection

Do not pick the bad ones

Prof. John Duchi

Page 8: VC Dimension and classification - Stanford University · VC Dimension How do we use shatter coecients to give complexity guarantees? Definition (VC Dimension) Let H be a collection

Finite function classes: Rademacher complexity

Idea: Use Rademacher complexity to understand generalizationeven for these?

Let F be finite with |f | 1 for f 2 F . Then

R

n

(F) := E"max

f2F

�����1

n

nX

i=1

"

i

f(Z

i

)

�����

#

satisfies

P max

f2F

�����1

n

nX

i=1

f(X

i

)� E[f(Xi

)]

����� � 2R

n

(F) + t

! 2 exp(�cnt

2)

Prof. John Duchi

Page 9: VC Dimension and classification - Stanford University · VC Dimension How do we use shatter coecients to give complexity guarantees? Definition (VC Dimension) Let H be a collection

Finite function classes: sub-Gaussianity

I Let Pn

be empirical distribution

I Define kfk2L

2(Pn) =1n

Pn

i=1 f(xi)2

I What about sum1pn

nX

i=1

"

i

f(x

i

)

Prof. John Duchi

Page 10: VC Dimension and classification - Stanford University · VC Dimension How do we use shatter coecients to give complexity guarantees? Definition (VC Dimension) Let H be a collection

Finite function classes: Rademacher complexity

Proposition (Massart’s finite class bound)

Let F be finite with M := max

f2F kfkL

2(Pn). Then

bR

n

(F) r

2M

2log(2 card(F))

n

.

Prof. John Duchi

Page 11: VC Dimension and classification - Stanford University · VC Dimension How do we use shatter coecients to give complexity guarantees? Definition (VC Dimension) Let H be a collection

Infinite classes with finite labelsWhat if we had a classifier h : X ! {�1, 1} that could only give acertain number of di↵erent labelings to a data set?

Example (Sketchy)

Say X = R and h

t

(x) = sign(x� t). Complexity of

F := {f(x) = 1 {ht

(x) 0}}?

Prof. John Duchi

Page 12: VC Dimension and classification - Stanford University · VC Dimension How do we use shatter coecients to give complexity guarantees? Definition (VC Dimension) Let H be a collection

Complexity of function classes

DefineF(x1:n) := {(f(x1), . . . , f(xn)) | f 2 F} .

ThenbR

n

(F) =

bR

n

(F 0)

whenever F(x1:n) = F 0(x1:n)

PropositionRademacher complexity depends on values of F : if |f(x)| M for

all x then

R

n

(F) c ·M sup

x1,...,xn2X

rlog card(F(x1:n))

n

.

Prof. John Duchi

Page 13: VC Dimension and classification - Stanford University · VC Dimension How do we use shatter coecients to give complexity guarantees? Definition (VC Dimension) Let H be a collection

Proof of complexity

Prof. John Duchi

Page 14: VC Dimension and classification - Stanford University · VC Dimension How do we use shatter coecients to give complexity guarantees? Definition (VC Dimension) Let H be a collection

Shatter coe�cients

Given function class F , shattering coe�cient (growth function) is

sn

(F) := sup

x1,...,xn2Xcard (F(x1:n))

= sup

x1:n2Xncard ((f(x1), . . . , f(xn)) | f 2 F)

ExampleThresholds in R

Prof. John Duchi

Page 15: VC Dimension and classification - Stanford University · VC Dimension How do we use shatter coecients to give complexity guarantees? Definition (VC Dimension) Let H be a collection

Shatter coe�cients and Rademacher complexity

PropositionFor any function class F with |f(x)| M we have

R

n

(F) cM

rlog s

n

(F)

n

.

Prof. John Duchi

Page 16: VC Dimension and classification - Stanford University · VC Dimension How do we use shatter coecients to give complexity guarantees? Definition (VC Dimension) Let H be a collection

VC Dimension

How do we use shatter coe�cients to give complexity guarantees?

Definition (VC Dimension)

Let H be a collection of boolean functions. The Vapnik

Chervonenkis (VC) Dimension of H is

VC(H) := sup {n 2 N : sn

(H) = 2

n} .

Prof. John Duchi

Page 17: VC Dimension and classification - Stanford University · VC Dimension How do we use shatter coecients to give complexity guarantees? Definition (VC Dimension) Let H be a collection

VC Dimension: examples

Example (Thresholds in R)

Example (Intervals in R)

Prof. John Duchi

Page 18: VC Dimension and classification - Stanford University · VC Dimension How do we use shatter coecients to give complexity guarantees? Definition (VC Dimension) Let H be a collection

VC Dimension: examples

Example (Half-spaces in R2)

Prof. John Duchi

Page 19: VC Dimension and classification - Stanford University · VC Dimension How do we use shatter coecients to give complexity guarantees? Definition (VC Dimension) Let H be a collection

Finite dimensional hypothesis classes

Let F be functions f : X ! R and suppose dim(F) = d

I Definition of dimension:

Example (Linear functionals)

If F = {f(x) = w

>x,w 2 Rd} then dim(F) = d

Example (Nonlinear functionals)

If F = {f(x) = w

>�(x), w 2 Rd} then dim(F) = d

Prof. John Duchi

Page 20: VC Dimension and classification - Stanford University · VC Dimension How do we use shatter coecients to give complexity guarantees? Definition (VC Dimension) Let H be a collection

VC dimension of finite dimensional classes

Let F have dim(F) = d and let

H := {h : X ! {�1, 1} s.t. h(x) = sign(f(x)), f 2 F} .

Proposition (Dimension bounds VC dimension)

VC(H) dim(F)

Prof. John Duchi

Page 21: VC Dimension and classification - Stanford University · VC Dimension How do we use shatter coecients to give complexity guarantees? Definition (VC Dimension) Let H be a collection

Finite dimensional hypothesis classes: proof

Prof. John Duchi

Page 22: VC Dimension and classification - Stanford University · VC Dimension How do we use shatter coecients to give complexity guarantees? Definition (VC Dimension) Let H be a collection

Sauer-Shelah Lemma

TheoremLet H be boolean functions with VC(H) = d. Then

sn

(H) dX

i=0

✓n

i

◆(2

n

if n d

�ne

d

�d

if n > d

Prof. John Duchi

Page 23: VC Dimension and classification - Stanford University · VC Dimension How do we use shatter coecients to give complexity guarantees? Definition (VC Dimension) Let H be a collection

Rademacher complexity of VC classes

PropositionLet H be collection of boolean functions with VC(H) = d. Then

R

n

(H) c

rd log

n

d

n

.

Proof is immediate (but a tighter result is possible):

Prof. John Duchi

Page 24: VC Dimension and classification - Stanford University · VC Dimension How do we use shatter coecients to give complexity guarantees? Definition (VC Dimension) Let H be a collection

Generalization bounds for VC classes

PropositionLet H have VC-dimension d and `(h; (x, y)) = 1 {h(x) 6= y}. Then

P

0

@9 h 2 H s.t. |bLn

(h)� L(h)| � c

sd log

d

n

n

+ t

1

A 2e

�nt

2

Prof. John Duchi

Page 25: VC Dimension and classification - Stanford University · VC Dimension How do we use shatter coecients to give complexity guarantees? Definition (VC Dimension) Let H be a collection

Things we have not addressed

I Multiclass problems (Natarajan dimension, due to BalaNatarajan; see also Multiclass Learnability and the ERMPrinciple by Daniely et al.)

I Extending “zero error” results to infinite classes

I Non-boolean classes

Prof. John Duchi

Page 26: VC Dimension and classification - Stanford University · VC Dimension How do we use shatter coecients to give complexity guarantees? Definition (VC Dimension) Let H be a collection

Reading and bibliography

1. M. Anthony and P. Bartlet. Neural Network Learning:

Theoretical Foundations.Cambridge University Press, 1999

2. P. L. Bartlett and S. Mendelson. Rademacher and Gaussiancomplexities: Risk bounds and structural results.Journal of Machine Learning Research, 3:463–482, 2002

3. S. Boucheron, O. Bousquet, and G. Lugosi. Theory ofclassification: a survey of some recent advances.ESAIM: Probability and Statistics, 9:323–375, 2005

4. A. W. van der Vaart and J. A. Wellner. Weak Convergence

and Empirical Processes: With Applications to Statistics.Springer, New York, 1996 (Ch. 2.6)

5. Scribe notes for Statistics 300b:http://web.stanford.edu/class/stats300b/

Prof. John Duchi