vc-dimension of neural networks - mcgill university school...
TRANSCRIPT
![Page 1: VC-dimension of neural networks - McGill University School ...amehra13/Presentations/math_ubc_29_ja… · Main result Theorem (Bartlett, Harvey, Liaw,M'17) Let ˙eb a pieewisec linear](https://reader036.vdocuments.net/reader036/viewer/2022081600/605dffc34550976255357858/html5/thumbnails/1.jpg)
VC-dimension of neural networks
Abbas Mehrabian
McGill University
IVADO Postdoctoral Fellow
29 January 2019
1
![Page 2: VC-dimension of neural networks - McGill University School ...amehra13/Presentations/math_ubc_29_ja… · Main result Theorem (Bartlett, Harvey, Liaw,M'17) Let ˙eb a pieewisec linear](https://reader036.vdocuments.net/reader036/viewer/2022081600/605dffc34550976255357858/html5/thumbnails/2.jpg)
Binary classication tasks
X Emails: spam/not spam
X Mammograms: cancerous/non-cancerous
Input to learning algorithm: a set of examples labelled by an
expert.
Question. How many labelled examples are needed for
learning a model? (sample complexity)
2
![Page 3: VC-dimension of neural networks - McGill University School ...amehra13/Presentations/math_ubc_29_ja… · Main result Theorem (Bartlett, Harvey, Liaw,M'17) Let ˙eb a pieewisec linear](https://reader036.vdocuments.net/reader036/viewer/2022081600/605dffc34550976255357858/html5/thumbnails/3.jpg)
Binary classication
Domain X
Distribution D over X
true classier t : X → −1,+1
Goal of learning: output some h : X → −1,+1 with small
error
error(h) := PX∼D h(X ) 6= t(X )
Input: X1, . . . ,Xm ∼ D and t(X1), . . . , t(Xm)
3
![Page 4: VC-dimension of neural networks - McGill University School ...amehra13/Presentations/math_ubc_29_ja… · Main result Theorem (Bartlett, Harvey, Liaw,M'17) Let ˙eb a pieewisec linear](https://reader036.vdocuments.net/reader036/viewer/2022081600/605dffc34550976255357858/html5/thumbnails/4.jpg)
Empirical Risk Minimization (ERM)
error(h) := PX∼D h(X ) 6= t(X ) and X1, . . . ,Xm ∼ D .
Choose some class H of functions X → −1,+1, and output
h = argminh2H
m∑i=1
1
m1 [h(Xi ) 6= t(Xi )]︸ ︷︷ ︸
training error of h∑mi=1
1m1 [h(Xi ) 6= t(Xi )] is the empirical estimate for error(h).
error(h) = training error + estimation error (generalization error)
(Bias-complexity trade-o)
4
![Page 5: VC-dimension of neural networks - McGill University School ...amehra13/Presentations/math_ubc_29_ja… · Main result Theorem (Bartlett, Harvey, Liaw,M'17) Let ˙eb a pieewisec linear](https://reader036.vdocuments.net/reader036/viewer/2022081600/605dffc34550976255357858/html5/thumbnails/5.jpg)
Empirical Risk Minimization (ERM)
error(h) := PX∼D h(X ) 6= t(X ) and X1, . . . ,Xm ∼ D .
Choose some class H of functions X → −1,+1, and output
h = argminh2H
m∑i=1
1
m1 [h(Xi ) 6= t(Xi )]︸ ︷︷ ︸
training error of h∑mi=1
1m1 [h(Xi ) 6= t(Xi )] is the empirical estimate for error(h).
error(h) = training error + estimation error (generalization error)
(Bias-complexity trade-o)
5
![Page 6: VC-dimension of neural networks - McGill University School ...amehra13/Presentations/math_ubc_29_ja… · Main result Theorem (Bartlett, Harvey, Liaw,M'17) Let ˙eb a pieewisec linear](https://reader036.vdocuments.net/reader036/viewer/2022081600/605dffc34550976255357858/html5/thumbnails/6.jpg)
Bounding the estimation errorThe case of nite H
Fix h 2 H . Recall error(h) := PX∼D h(X ) 6= t(X ) and Xi ∼ D .
Then 1 [h(Xi ) 6= t(Xi )] is Bernoulli with parameter error(h).
Hoeding's inequality:
P
266664
m∑i=1
1
m1 [h(Xi ) 6= t(Xi )] − error(h)︸ ︷︷ ︸
estimation error of h
> t
377775 < exp(−2mt2),
so
P
"suph2H
estimation error of h > t
#< |H | exp(−2mt2),
so
E
"suph2H
estimation error of h
# 3
qln |H |/m
6
![Page 7: VC-dimension of neural networks - McGill University School ...amehra13/Presentations/math_ubc_29_ja… · Main result Theorem (Bartlett, Harvey, Liaw,M'17) Let ˙eb a pieewisec linear](https://reader036.vdocuments.net/reader036/viewer/2022081600/605dffc34550976255357858/html5/thumbnails/7.jpg)
Bounding the estimation errorThe case of nite H
Fix h 2 H . Recall error(h) := PX∼D h(X ) 6= t(X ) and Xi ∼ D .
Then 1 [h(Xi ) 6= t(Xi )] is Bernoulli with parameter error(h).
Hoeding's inequality:
P
266664
m∑i=1
1
m1 [h(Xi ) 6= t(Xi )] − error(h)︸ ︷︷ ︸
estimation error of h
> t
377775 < exp(−2mt2),
so
P
"suph2H
estimation error of h > t
#< |H | exp(−2mt2),
so
E
"suph2H
estimation error of h
# 3
qln |H |/m
7
![Page 8: VC-dimension of neural networks - McGill University School ...amehra13/Presentations/math_ubc_29_ja… · Main result Theorem (Bartlett, Harvey, Liaw,M'17) Let ˙eb a pieewisec linear](https://reader036.vdocuments.net/reader036/viewer/2022081600/605dffc34550976255357858/html5/thumbnails/8.jpg)
VC-dimension of H
Say x1, . . . , xm 2 X is shattered if
|(h(x1), h(x2), . . . , h(xm)) : h 2 H | = 2m
VC-dim(H ) := size of the largest shattered set
8
![Page 9: VC-dimension of neural networks - McGill University School ...amehra13/Presentations/math_ubc_29_ja… · Main result Theorem (Bartlett, Harvey, Liaw,M'17) Let ˙eb a pieewisec linear](https://reader036.vdocuments.net/reader036/viewer/2022081600/605dffc34550976255357858/html5/thumbnails/9.jpg)
Example: linear classiers
Let X = Rd . A linear classier is parametrized by w1, . . . ,wd+1:
h(x1, . . . , xd) = sgn(w1x1 + + wdxd + wd+1)
The VC-dimension of linear classiers in R2 is 3
9
![Page 10: VC-dimension of neural networks - McGill University School ...amehra13/Presentations/math_ubc_29_ja… · Main result Theorem (Bartlett, Harvey, Liaw,M'17) Let ˙eb a pieewisec linear](https://reader036.vdocuments.net/reader036/viewer/2022081600/605dffc34550976255357858/html5/thumbnails/10.jpg)
Example: linear classiers
Let X = Rd . A linear classier is parametrized by w1, . . . ,wd+1:
h(x1, . . . , xd) = sgn(w1x1 + + wdxd + wd+1)
The VC-dimension of linear classiers in R2 is 3
10
![Page 11: VC-dimension of neural networks - McGill University School ...amehra13/Presentations/math_ubc_29_ja… · Main result Theorem (Bartlett, Harvey, Liaw,M'17) Let ˙eb a pieewisec linear](https://reader036.vdocuments.net/reader036/viewer/2022081600/605dffc34550976255357858/html5/thumbnails/11.jpg)
Example: linear classiers
Let X = Rd . A linear classier is parametrized by w1, . . . ,wd+1:
h(x1, . . . , xd) = sgn(w1x1 + + wdxd + wd+1)
The VC-dimension of linear classiers in R2 is 3
The VC-dimension of linear classiers in Rd is d + 1
11
![Page 12: VC-dimension of neural networks - McGill University School ...amehra13/Presentations/math_ubc_29_ja… · Main result Theorem (Bartlett, Harvey, Liaw,M'17) Let ˙eb a pieewisec linear](https://reader036.vdocuments.net/reader036/viewer/2022081600/605dffc34550976255357858/html5/thumbnails/12.jpg)
The Vapnik-Chervonenkis inequality
Recall: given a hypothesis class H , for any h 2 H ,
error(h) = training error + estimation error
Theorem (Vapnik and Chervonenkis 1971, Dudley 1978)
Given a training set of size m,
E
"suph2H
estimation error of h
# C
sVC-dim(H )
m
This inequality is tight up to the value of C
12
![Page 13: VC-dimension of neural networks - McGill University School ...amehra13/Presentations/math_ubc_29_ja… · Main result Theorem (Bartlett, Harvey, Liaw,M'17) Let ˙eb a pieewisec linear](https://reader036.vdocuments.net/reader036/viewer/2022081600/605dffc34550976255357858/html5/thumbnails/13.jpg)
Neural networks
sgn
x ∈ R3
layer 0 321
input
output
13
![Page 14: VC-dimension of neural networks - McGill University School ...amehra13/Presentations/math_ubc_29_ja… · Main result Theorem (Bartlett, Harvey, Liaw,M'17) Let ˙eb a pieewisec linear](https://reader036.vdocuments.net/reader036/viewer/2022081600/605dffc34550976255357858/html5/thumbnails/14.jpg)
Neural networks
x1
x2
xk
y
w1
w2
wkw0
y = σ(w0 + w1x1 + w2x2 + . . . + wkxk)
threshold sigmoid ReLU
sgn(x ) tanh(x ) max0, x
14
![Page 15: VC-dimension of neural networks - McGill University School ...amehra13/Presentations/math_ubc_29_ja… · Main result Theorem (Bartlett, Harvey, Liaw,M'17) Let ˙eb a pieewisec linear](https://reader036.vdocuments.net/reader036/viewer/2022081600/605dffc34550976255357858/html5/thumbnails/15.jpg)
Neural networks
sgn
x ∈ R3
layer 0 321
input
output
f (x, θ)
Given a network architecture and activation functions, we
obtain a class H of functions
H = sgn f (x , θ) : θ 2 Rp
15
![Page 16: VC-dimension of neural networks - McGill University School ...amehra13/Presentations/math_ubc_29_ja… · Main result Theorem (Bartlett, Harvey, Liaw,M'17) Let ˙eb a pieewisec linear](https://reader036.vdocuments.net/reader036/viewer/2022081600/605dffc34550976255357858/html5/thumbnails/16.jpg)
Neural networks
sgn
x ∈ R3
layer 0 321
input
output
f (x, θ)
Given a network architecture and activation functions, we
obtain a class H of functions
H = sgn f (x , θ) : θ 2 Rp
16
![Page 17: VC-dimension of neural networks - McGill University School ...amehra13/Presentations/math_ubc_29_ja… · Main result Theorem (Bartlett, Harvey, Liaw,M'17) Let ˙eb a pieewisec linear](https://reader036.vdocuments.net/reader036/viewer/2022081600/605dffc34550976255357858/html5/thumbnails/17.jpg)
Main result
Theorem (Bartlett, Harvey, Liaw, M'17)
Let σ be a piecewise linear function, p be the number of
parameters, ` the number of layers.
Then VC-dim O(p` log p).
There exist networks with VC-dim Ω(p` log(p/`)).
previous upper bounds: O(p2) [Goldberg and Jerrum'95]
O(p`2 + p` log p) [Bartlett, Maiorov, Meir'98]
previous lower bounds: Ω(p log p) [Maass'94]
Ω(p`) [Bartlett, Maiorov, Meir'98]
GoogleNet 2014: ` = 22, p = 4 million
17
![Page 18: VC-dimension of neural networks - McGill University School ...amehra13/Presentations/math_ubc_29_ja… · Main result Theorem (Bartlett, Harvey, Liaw,M'17) Let ˙eb a pieewisec linear](https://reader036.vdocuments.net/reader036/viewer/2022081600/605dffc34550976255357858/html5/thumbnails/18.jpg)
Main result
Theorem (Bartlett, Harvey, Liaw, M'17)
Let σ be a piecewise linear function, p be the number of
parameters, ` the number of layers.
Then VC-dim O(p` log p).
There exist networks with VC-dim Ω(p` log(p/`)).
previous upper bounds: O(p2) [Goldberg and Jerrum'95]
O(p`2 + p` log p) [Bartlett, Maiorov, Meir'98]
previous lower bounds: Ω(p log p) [Maass'94]
Ω(p`) [Bartlett, Maiorov, Meir'98]
GoogleNet 2014: ` = 22, p = 4 million
18
![Page 19: VC-dimension of neural networks - McGill University School ...amehra13/Presentations/math_ubc_29_ja… · Main result Theorem (Bartlett, Harvey, Liaw,M'17) Let ˙eb a pieewisec linear](https://reader036.vdocuments.net/reader036/viewer/2022081600/605dffc34550976255357858/html5/thumbnails/19.jpg)
Main result
Theorem (Bartlett, Harvey, Liaw, M'17)
Let σ be a piecewise linear function, p be the number of
parameters, ` the number of layers.
Then VC-dim O(p` log p).
There exist networks with VC-dim Ω(p` log(p/`)).
previous upper bounds: O(p2) [Goldberg and Jerrum'95]
O(p`2 + p` log p) [Bartlett, Maiorov, Meir'98]
previous lower bounds: Ω(p log p) [Maass'94]
Ω(p`) [Bartlett, Maiorov, Meir'98]
GoogleNet 2014: ` = 22, p = 4 million 19
![Page 20: VC-dimension of neural networks - McGill University School ...amehra13/Presentations/math_ubc_29_ja… · Main result Theorem (Bartlett, Harvey, Liaw,M'17) Let ˙eb a pieewisec linear](https://reader036.vdocuments.net/reader036/viewer/2022081600/605dffc34550976255357858/html5/thumbnails/20.jpg)
Upper bound proof
Assume all biases are 0, all layers have k nodes, σ(x ) = max0, x . So
p = (`− 1)k2 + k . Let x1, . . . , xm be shattered.
2m = |(sgn f (x1, θ), . . . , sgn f (xm , θ)) : θ 2 Rp|
= |(sgn g1(θ), . . . , sgn gm(θ)) : θ 2 Rp| = Π
Lemma (Warren'68)
If q1, . . . , qm are polynomials of degree d in n variables,
|(sgn q1(y), . . . , sgn qm(y)) : y 2 Rn | (6md/n)n
20
![Page 21: VC-dimension of neural networks - McGill University School ...amehra13/Presentations/math_ubc_29_ja… · Main result Theorem (Bartlett, Harvey, Liaw,M'17) Let ˙eb a pieewisec linear](https://reader036.vdocuments.net/reader036/viewer/2022081600/605dffc34550976255357858/html5/thumbnails/21.jpg)
Upper bound proof
Assume all biases are 0, all layers have k nodes, σ(x ) = max0, x . So
p = (`− 1)k2 + k . Let x1, . . . , xm be shattered.
2m = |(sgn f (x1, θ), . . . , sgn f (xm , θ)) : θ 2 Rp|
= |(sgn g1(θ), . . . , sgn gm(θ)) : θ 2 Rp| = Π
Lemma (Warren'68)
If q1, . . . , qm are polynomials of degree d in n variables,
|(sgn q1(y), . . . , sgn qm(y)) : y 2 Rn | (6md/n)n
Imagine that each gj (θ) was a polynomial of degree ` in the p weights
(variables). Then Warren's lemma would give
2m = Π (6m`/p)p (6m)p = 2p log2(6m),
so m p log2(6m) ⇒ m O(p log p)21
![Page 22: VC-dimension of neural networks - McGill University School ...amehra13/Presentations/math_ubc_29_ja… · Main result Theorem (Bartlett, Harvey, Liaw,M'17) Let ˙eb a pieewisec linear](https://reader036.vdocuments.net/reader036/viewer/2022081600/605dffc34550976255357858/html5/thumbnails/22.jpg)
Upper bound proof
Assume all biases are 0, all layers have k nodes, σ(x ) = max0, x . So
p = (`− 1)k2 + k . Let x1, . . . , xm be shattered.
2m = |(sgn f (x1, θ), . . . , sgn f (xm , θ)) : θ 2 Rp|
= |(sgn g1(θ), . . . , sgn gm(θ)) : θ 2 Rp| = Π
Lemma (Warren'68)
If q1, . . . , qm are polynomials of degree d in n variables,
|(sgn q1(y), . . . , sgn qm(y)) : y 2 Rn | (6md/n)n
Instead, we build an iterative sequence of rened partitions
S1, . . . ,S`−1 of Rp , so that each gj (θ) is a piecewise polynomial of
degree ` with pieces given by S`−1, so
2m = Π (6m`/p)p |S`−1|
22
![Page 23: VC-dimension of neural networks - McGill University School ...amehra13/Presentations/math_ubc_29_ja… · Main result Theorem (Bartlett, Harvey, Liaw,M'17) Let ˙eb a pieewisec linear](https://reader036.vdocuments.net/reader036/viewer/2022081600/605dffc34550976255357858/html5/thumbnails/23.jpg)
Iterative construction of the partitions
x
〈w1, x〉
〈w2, x〉
〈w3, x〉
σ(〈w1, x〉)
σ(〈w2, x〉)
σ(〈w3, x〉)
x,w1, . . . , wk ∈ Rk
1. hwi , xj i : i = 1, . . . , k , j = 1, . . . ,m is a collection of km linear
polynomials in k2 variables of w1, . . . ,wk . By Warren, number of
attained sign vectors is (6km/k2)k2
= (6m/k)k2
.
2. Let S1 =partition specied by this sign pattern. |S1| (6m/k)k2
.
3. Within each part of S1, the sign vector (sgnhwi , xj i)i,j is xed, so
(σhwi , xj i)i,j are linear polynomials in k2 variables, so the inputs to
the second layer are quadratic polynomials in 2k2 parameters.
23
![Page 24: VC-dimension of neural networks - McGill University School ...amehra13/Presentations/math_ubc_29_ja… · Main result Theorem (Bartlett, Harvey, Liaw,M'17) Let ˙eb a pieewisec linear](https://reader036.vdocuments.net/reader036/viewer/2022081600/605dffc34550976255357858/html5/thumbnails/24.jpg)
Iterative construction of the partitions
x
〈w1, x〉
〈w2, x〉
〈w3, x〉
σ(〈w1, x〉)
σ(〈w2, x〉)
σ(〈w3, x〉)
x,w1, . . . , wk ∈ Rk
1. hwi , xj i : i = 1, . . . , k , j = 1, . . . ,m is a collection of km linear
polynomials in k2 variables of w1, . . . ,wk . By Warren, number of
attained sign vectors is (6km/k2)k2
= (6m/k)k2
.
2. Let S1 =partition specied by this sign pattern. |S1| (6m/k)k2
.
3. Within each part of S1, the sign vector (sgnhwi , xj i)i,j is xed, so
(σhwi , xj i)i,j are linear polynomials in k2 variables, so the inputs to
the second layer are quadratic polynomials in 2k2 parameters.
24
![Page 25: VC-dimension of neural networks - McGill University School ...amehra13/Presentations/math_ubc_29_ja… · Main result Theorem (Bartlett, Harvey, Liaw,M'17) Let ˙eb a pieewisec linear](https://reader036.vdocuments.net/reader036/viewer/2022081600/605dffc34550976255357858/html5/thumbnails/25.jpg)
Iterative construction of the partitions
x
〈w1, x〉
〈w2, x〉
〈w3, x〉
σ(〈w1, x〉)
σ(〈w2, x〉)
σ(〈w3, x〉)
x,w1, . . . , wk ∈ Rk
1. hwi , xj i : i = 1, . . . , k , j = 1, . . . ,m is a collection of km linear
polynomials in k2 variables of w1, . . . ,wk . By Warren, number of
attained sign vectors is (6km/k2)k2
= (6m/k)k2
.
2. Let S1 =partition specied by this sign pattern. |S1| (6m/k)k2
.
3. Within each part of S1, the sign vector (sgnhwi , xj i)i,j is xed, so
(σhwi , xj i)i,j are linear polynomials in k2 variables, so the inputs to
the second layer are quadratic polynomials in 2k2 parameters.25
![Page 26: VC-dimension of neural networks - McGill University School ...amehra13/Presentations/math_ubc_29_ja… · Main result Theorem (Bartlett, Harvey, Liaw,M'17) Let ˙eb a pieewisec linear](https://reader036.vdocuments.net/reader036/viewer/2022081600/605dffc34550976255357858/html5/thumbnails/26.jpg)
Iterative construction of the partitions
Repeat this argument for layers 2, 3, . . . , `− 1, so within each
part of S`−1, each of f (x1, θ), . . . , f (xm , θ) is a polynomial of
degree ` in the p variables of θ.
2m = |(sgn f (xi , θ))i : θ 2 Rp| (6m`/p)p |S`−1|
By construction of the partitions, |S1| (6m/k)k2
and
|Sn |
|Sn−1|
6m
k
nk2for n = 2, . . . , `− 1, so
2m
6m`
p
p `−1∏n=1
6m
k
nk2=
6m
k
k2`(`+1)
2
m `p , so
m O(`p log(`p))
26
![Page 27: VC-dimension of neural networks - McGill University School ...amehra13/Presentations/math_ubc_29_ja… · Main result Theorem (Bartlett, Harvey, Liaw,M'17) Let ˙eb a pieewisec linear](https://reader036.vdocuments.net/reader036/viewer/2022081600/605dffc34550976255357858/html5/thumbnails/27.jpg)
Iterative construction of the partitions
Repeat this argument for layers 2, 3, . . . , `− 1, so within each
part of S`−1, each of f (x1, θ), . . . , f (xm , θ) is a polynomial of
degree ` in the p variables of θ.
2m = |(sgn f (xi , θ))i : θ 2 Rp| (6m`/p)p |S`−1|
By construction of the partitions, |S1| (6m/k)k2
and
|Sn |
|Sn−1|
6m
k
nk2for n = 2, . . . , `− 1, so
2m
6m`
p
p `−1∏n=1
6m
k
nk2=
6m
k
k2`(`+1)
2
m `p , so
m O(`p log(`p))
27
![Page 28: VC-dimension of neural networks - McGill University School ...amehra13/Presentations/math_ubc_29_ja… · Main result Theorem (Bartlett, Harvey, Liaw,M'17) Let ˙eb a pieewisec linear](https://reader036.vdocuments.net/reader036/viewer/2022081600/605dffc34550976255357858/html5/thumbnails/28.jpg)
The upper bound
Theorem (Bartlett, Harvey, Liaw, M'17)
Let σ be a piecewise linear function, p be the number of
parameters, ` the number of layers.
Then VC-dim O(p` log p).
Next, lower bound:
there exist networks with VC-dim Ω(p` log(p/`)).
28
![Page 29: VC-dimension of neural networks - McGill University School ...amehra13/Presentations/math_ubc_29_ja… · Main result Theorem (Bartlett, Harvey, Liaw,M'17) Let ˙eb a pieewisec linear](https://reader036.vdocuments.net/reader036/viewer/2022081600/605dffc34550976255357858/html5/thumbnails/29.jpg)
Lower bound proof
Let Sn := e1, . . . , en be standard basis for Rn .
Build a network, input dimension n +m , shattering Sn Sm .
This implies VC-dimension |Sn Sm | = nm
n
m
0/1
29
![Page 30: VC-dimension of neural networks - McGill University School ...amehra13/Presentations/math_ubc_29_ja… · Main result Theorem (Bartlett, Harvey, Liaw,M'17) Let ˙eb a pieewisec linear](https://reader036.vdocuments.net/reader036/viewer/2022081600/605dffc34550976255357858/html5/thumbnails/30.jpg)
Lower bound proof
Let Sn := e1, . . . , en be standard basis for Rn .
Build a network, input dimension n +m , shattering Sn Sm .
This implies VC-dimension |Sn Sm | = nm
For any given g : Sn Sm → 0, 1, nd parameter vector θ such
that f ((ei , ej ), θ) = g(ei , ej ).
30
![Page 31: VC-dimension of neural networks - McGill University School ...amehra13/Presentations/math_ubc_29_ja… · Main result Theorem (Bartlett, Harvey, Liaw,M'17) Let ˙eb a pieewisec linear](https://reader036.vdocuments.net/reader036/viewer/2022081600/605dffc34550976255357858/html5/thumbnails/31.jpg)
Lower bound proof
For any given g : Sn Sm → 0, 1, nd parameter vector θ such
that f ((ei , ej ), p) = g(ei , ej ).
31
![Page 32: VC-dimension of neural networks - McGill University School ...amehra13/Presentations/math_ubc_29_ja… · Main result Theorem (Bartlett, Harvey, Liaw,M'17) Let ˙eb a pieewisec linear](https://reader036.vdocuments.net/reader036/viewer/2022081600/605dffc34550976255357858/html5/thumbnails/32.jpg)
Lower bound proof
For any given g : Sn Sm → 0, 1, nd parameter vector θ such
that f ((ei , ej ), p) = g(ei , ej ).
Build n m table, entry i , j = g(ei , ej ):
0 1 0
1 1 0
0 0 0
1 1 0
32
![Page 33: VC-dimension of neural networks - McGill University School ...amehra13/Presentations/math_ubc_29_ja… · Main result Theorem (Bartlett, Harvey, Liaw,M'17) Let ˙eb a pieewisec linear](https://reader036.vdocuments.net/reader036/viewer/2022081600/605dffc34550976255357858/html5/thumbnails/33.jpg)
Lower bound proof
For any given g : Sn Sm → 0, 1, nd parameter vector θ such
that f ((ei , ej ), p) = g(ei , ej ).
Build n m table, entry i , j = g(ei , ej ):
c1 = 0. 0 1 0
c2 = 0. 1 1 0
c3 = 0. 0 0 0
c4 = 0. 1 1 0
Let ci 2 [0, 1] have binary representation the ith row.
On input (ei , ej ), the network must output g(ei , ej ) = ci j , the
j th bit of binary representation of ci .
33
![Page 34: VC-dimension of neural networks - McGill University School ...amehra13/Presentations/math_ubc_29_ja… · Main result Theorem (Bartlett, Harvey, Liaw,M'17) Let ˙eb a pieewisec linear](https://reader036.vdocuments.net/reader036/viewer/2022081600/605dffc34550976255357858/html5/thumbnails/34.jpg)
The bit extractor network
selectorei ∈ Rn ci
bitextractionunit
ci1
cim
ej ∈ Rm
selector
cij
34
![Page 35: VC-dimension of neural networks - McGill University School ...amehra13/Presentations/math_ubc_29_ja… · Main result Theorem (Bartlett, Harvey, Liaw,M'17) Let ˙eb a pieewisec linear](https://reader036.vdocuments.net/reader036/viewer/2022081600/605dffc34550976255357858/html5/thumbnails/35.jpg)
The bit extraction unit
r-bitextractor
0.b1 . . . bm
b1
brbr+1
b2r
bm−r+1
bm
r-bitextractor
r-bitextractor
Each block has 5 layers and O(r2r ) parameters.
In total O(m/r) layers and O(m2r ) parameters..35
![Page 36: VC-dimension of neural networks - McGill University School ...amehra13/Presentations/math_ubc_29_ja… · Main result Theorem (Bartlett, Harvey, Liaw,M'17) Let ˙eb a pieewisec linear](https://reader036.vdocuments.net/reader036/viewer/2022081600/605dffc34550976255357858/html5/thumbnails/36.jpg)
The bit extractor network
selectorei ∈ Rn ci
bitextractionunit
ci1
cim
ej ∈ Rm
selector
cij
Layers = O(1+m/r) and parameters = O(n +m2r +m)
Given p, `, let r = 12log2
p`
, m = r`
8, n = p
2
VC-dimension mn = Ω(p` log(p/`))
36
![Page 37: VC-dimension of neural networks - McGill University School ...amehra13/Presentations/math_ubc_29_ja… · Main result Theorem (Bartlett, Harvey, Liaw,M'17) Let ˙eb a pieewisec linear](https://reader036.vdocuments.net/reader036/viewer/2022081600/605dffc34550976255357858/html5/thumbnails/37.jpg)
Conclusion
Theorem (Vapnik and Chervonenkis 1971, Dudley 1978)
Given a training set of size m, expected estimation error is
C
qVC-dim(H )
m.
Theorem (Bartlett, Harvey, Liaw, M, Conference on
Learning Theory (COLT'17))
Let p be the number of parameters, ` the number of layers.
Then for piecewise linear networks, VC-dim O(p` log p)
and there exist networks with VC-dim Ω(p` log(p/`)).
These give a nearly-tight upper bound for the size of the
training set needed to train a given neural network.
37
![Page 38: VC-dimension of neural networks - McGill University School ...amehra13/Presentations/math_ubc_29_ja… · Main result Theorem (Bartlett, Harvey, Liaw,M'17) Let ˙eb a pieewisec linear](https://reader036.vdocuments.net/reader036/viewer/2022081600/605dffc34550976255357858/html5/thumbnails/38.jpg)
Further research on theory of neural networks
X Obtain tighter bounds by making additional assumptions,
e.g. on the input data distribution.
X Understanding the optimization problem: how to choose
the parameters to minimize the training error?
X How to design the network architecture for a given learning
task?
38