arxiv:2012.05760v1 [cs.lg] 10 dec 2020

68
arXiv:2012.05760v1 [cs.LG] 10 Dec 2020 Notes on Deep Learning Theory Evgenii (Eugene) Golikov Neural Networks and Deep Learning lab. Moscow Institute of Physics and Technology Moscow, Russia [email protected] December 11, 2020

Upload: others

Post on 05-Feb-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: arXiv:2012.05760v1 [cs.LG] 10 Dec 2020

arX

iv:2

012.

0576

0v1

[cs

.LG

] 1

0 D

ec 2

020

Notes on Deep Learning Theory

Evgenii (Eugene) GolikovNeural Networks and Deep Learning lab.

Moscow Institute of Physics and Technology

Moscow, Russia

[email protected]

December 11, 2020

Page 2: arXiv:2012.05760v1 [cs.LG] 10 Dec 2020

Abstract

These are the notes for the lectures that I was giving during Fall 2020 at the Moscow Institute of Physics and Tech-nology (MIPT) and at the Yandex School of Data Analysis (YSDA). The notes cover some aspects of initialization,loss landscape, generalization, and a neural tangent kernel theory. While many other topics (e.g. expressivity, amean-field theory, a double descent phenomenon) are missing in the current version, we plan to add them in futurerevisions.

Page 3: arXiv:2012.05760v1 [cs.LG] 10 Dec 2020

Contents

1 Introduction 3

1.1 Generalization ability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Global convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 From weight space to function space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Initialization 6

2.1 Preserving the variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.1.1 Linear case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.1.2 ReLU case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.1.3 Tanh case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Dynamical stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.2.1 Linear case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.2.2 ReLU case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.3 GD dynamics for orthogonal initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3 Loss landscape 21

3.1 Wide non-linear nets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.1.1 Possible generalizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.2 Linear nets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.3 Local convergence guarantees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.3.1 Limitations of the result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4 Generalization 31

4.1 Uniform bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.1.1 Upper-bounding the supremum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.1.2 Upper-bounding the Rademacher complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.1.3 Failure of uniform bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.2 PAC-bayesian bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.2.1 At most countable case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.2.2 General case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.2.3 Applying PAC-bayesian bounds to deterministic algorithms . . . . . . . . . . . . . . . . . . . 40

5 Neural tangent kernel 48

5.1 Gradient descent training as a kernel method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485.1.1 Exact solution for a square loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495.1.2 Convergence to a gaussian process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495.1.3 The kernel diverges at initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505.1.4 NTK parameterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515.1.5 GD training and posterior inference in gaussian processes . . . . . . . . . . . . . . . . . . . . 52

5.2 Stationarity of the kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535.2.1 Finite width corrections for the NTK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545.2.2 Proof of Conjecture 1 for linear nets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5.3 GD convergence via kernel stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

1

Page 4: arXiv:2012.05760v1 [cs.LG] 10 Dec 2020

5.3.1 Component-wise convergence guarantees and kernel alignment . . . . . . . . . . . . . . . . . 62

2

Page 5: arXiv:2012.05760v1 [cs.LG] 10 Dec 2020

Chapter 1

Introduction

Machine learning aims to solve the following problem:

R(f) → minf∈F

. (1.1)

Here R(f) = E x,y∼Dr(y, f(x)) is a true risk of a model f from a class F , and D is a data distribution. However,we do not have an access to the true data distribution; instead we have a finite set of i.i.d. samples from it:Sn = (xi, yi)ni=1 ∼ Dn. For this reason, instead of approaching (1.1), we substitite it with an empirical riskminimization:

Rn(f) → minf∈F

, (1.2)

where Rn(f) = E x,y∈Snr(y, f(x)) is an empirical risk of a model f from a class F .

1.1 Generalization ability

How does the solution of (4.2) relate to (1.1)? In other words, we aim to upper-bound the difference between thetwo risks:

R(fn)− Rn(fn) ≤ bound(fn,F , n, δ) w.p. ≥ 1− δ over Sn, (1.3)

where fn ∈ F is a result of training the model on the dataset Sn.We call the bound (1.3) a-posteriori if it depends on the resulting model fn, and we call it a-priori if it does not.

An a-priori bound allows one to estimate the risk difference before training, while an a-posteriori bound estimatesthe risk difference based on the final model.

Uniform bounds are instances of an a-priori class:

R(fn)− Rn(fn) ≤ supf∈F

|R(f)− Rn(f)| ≤ ubound(F , n, δ) w.p. ≥ 1− δ over Sn, (1.4)

A typical form of the uniform bound is the following:

ubound(F , n, δ) = O

(√

C(F) + log(1/δ)

n

)

, (1.5)

where C(F) is a complexity of the class F .The bound above suggests that the generalization ability, measured by the risk difference, decays as the model

class becomes larger. This suggestion conforms the classical bias-variance trade-off curve. The curve can be repro-duced if we fit the Runge function with a polynomial using a train set of equidistant points; the same phenomenacan be observed for decision trees.

A typical notion of model class complexity is VC-dimension [Vapnik and Chervonenkis, 1971]. For neural net-works, VC-dimension grows at least linearly with the number of parameters [Bartlett et al., 2019]. Hence thebound (1.5) becomes vacuous for large enough nets. However, as we observe, the empirical (train) risk Rn vanishes,while the true (test) risk saturates for large enough width (see Figure 1 of [Neyshabur et al., 2015]).

3

Page 6: arXiv:2012.05760v1 [cs.LG] 10 Dec 2020

One might hypothesize that the problem is in VC-dimension, which overestimates the complexity of neural nets.However, the problem turns out to be in uniform bounds in general. Indeed, if the class F contains a bad network,i.e. a network that perfectly fits the train data but fails desperately on the true data distribution, the uniformbound (1.4) becomes at least nearly vacuous. In realistic scenarios, such a bad network can be found explicitly:[Zhang et al., 2016] demonstrated that practically large nets can fit data with random labels; similarly, these netscan fit the training data plus some additional data with random labels. Such nets fit the training data perfectlybut generalize poorly.

Up to this point, we know that among the networks with zero training risk, some nets generalize well, whilesome generalize poorly. Suppose we managed to come with some model complexity measure that is symptomaticfor poor generalization: bad nets have higher complexity than good ones. If we did, we can come up with a betterbound by prioritizing less complex models.

Such prioritization is naturally supported by a PAC-bayesian paradigm. First, we come up with a prior distri-bution P over models. This distribution should not depend on the train dataset Sn. Then we build a posteriordistribution Q | Sn over models based on observed data. For instance, if we fix random seeds, a usual network

training procedure gives a posterior distribution concentrated in a single model fn. The PAC-bayesian bound[McAllester, 1999b] takes the following form:

R(Q | Sn)− Rn(Q | Sn) ≤ O

(√

KL(Q | Sn‖P ) + log(1/δ)

n

)

w.p. ≥ 1− δ over Sn, (1.6)

where R(Q) is an expected risk for models sampled from Q; similarly for Rn(Q). If more complex models are lesslikely to be found, then we can embed this information into prior, thus making the KL-divergence typically smaller.

The PAC-bayesian bound (1.6) is an example of an a-posteriori bound, since the bound depends on Q. However,it is possible to obtain an a-priori bound using the same paradigm [Neyshabur et al., 2018].

The bound (1.6) becomes better when our training procedure tends to find models that are probable accordingto the prior. But what kind of models does the gradient descent typically find? Does it implicitly minimize somecomplexity measure of the resulting model? Despite the existence of bad networks, minimizing the train loss usinga gradient descent typically reveals well-performing solutions. This phenomenon is referred as an implicit bias ofgradient descent.

Another problem with a-priori bounds is that they all are effectively two-sided: all of them are bounding anabsolute value of the risk difference, rather then the risk difference itself. Two-sided bounds fail if there existnetworks that generalize well, while failing on a given train set. [Nagarajan and Kolter, 2019] have constructed aproblem for which such networks are typically found by gradient descent.

1.2 Global convergence

We have introduced the empirical minimization problem (4.2) because we were not able to minimize the true riskdirectly: see (1.1). But are we able to minimize the empirical risk? Let f(x; θ) be a neural net evaluated at inputx with parameters θ. Consider a loss function ℓ that is a convex surrogate of a risk r. Then minimizing the trainloss will imply empirical risk minimization:

Ln(θ) = E x,y∈Snℓ(y, f(x; θ)) → min

θ. (1.7)

Neural nets are complex non-linear functions of both inputs and weights; we can hardly expect the loss landscapeLn(θ) induced by such functions to be simple. At least, for non-trivial neural nets Ln is a non-convex function ofθ. Hence it can have local minima that are not global.

The most widely-used method of solving the problem (1.7) for deep learning is gradient descent (GD), or some ofits variants. Since GD is a local method, it cannot have any global convergence guarantees in general case. However,for practically-sized neural nets it always succeeds in finding a global minimum.

Given this observation, it is tempting to hypothesize that despite of the non-convexity, all local minima of Ln(θ)are global. This turns to be true for linear nets [Kawaguchi, 2016, Lu and Kawaguchi, 2017, Laurent and Brecht, 2018],and for non-linear nets if they are sufficiently wide [Nguyen, 2019].

While globality of local minima implies almost sure convergence of gradient descent [Lee et al., 2016, Panageas and Piliouras, 2017],there are no guarantees on convergence speed. Generally, convergence speed depends on initialization. For instance,

4

Page 7: arXiv:2012.05760v1 [cs.LG] 10 Dec 2020

initializing linear nets orthogonally makes the optimization speed independent of depth [Saxe et al., 2013]. Ill-posedinitialization may heavily slow down the optimization process. [Glorot and Bengio, 2010, He et al., 2015] proposedheuristics for preventing such situations.

1.3 From weight space to function space

Consider the optimization problem (1.7). The gradient descent dynamics for this problem looks as follows:

θt = −ηE x,y∈Sn

∂ℓ(y, z)

∂z

z=f(x;θt)

∇θf(x; θt). (1.8)

This dynamics is very complex due to non-linearity of f(x; θ) as a function of θ. Let us multiply both sides of (1.8)by ∇T

θ f(x; θt):

ft(x′) = −ηE x,y∈Sn

∂ℓ(y, z)

∂z

z=ft(x)

Kt(x′, x), (1.9)

where ft(x) = f(x; θt), and Kt(x′, x) is a tangent kernel :

Kt(x′, x) = ∇T

θ f(x′; θt)∇θf(x; θt). (1.10)

Generally, the kernel is stochastic and evolves with time. For this reason, dynamics (1.9) is not completely defined.However, if the network is parameterized in a certain way, the kernel Kt converges to a stationary deterministickernel K0 as the number of hidden units (width) goes to infinity [Jacot et al., 2018].

If the kernel is stationary and deterministic, the dynamics (1.9) is much simpler than (1.8). Indeed, for squareloss (1.9) is a linear ODE, which can be solved analytically [Lee et al., 2019], while (1.8) still remains non-linear.

This allows us to prove several results on convergence and generalization for large enough nets [Du et al., 2019,Arora et al., 2019a]. Indeed, for a large enough network, its kernel is almost deterministic, and one have to provethat is is almost stationary. Given this, we can transfer results from statinary deterministic kernels of infinitelywide nets to sufficiently wide ones.

Kernels of realistic finite-sized networks turn out to be non-stationary. It is possible to take this effect intoaccount by introducing finite-width corrections [Huang and Yau, 2019, Dyer and Gur-Ari, 2020].

5

Page 8: arXiv:2012.05760v1 [cs.LG] 10 Dec 2020

Chapter 2

Initialization

2.1 Preserving the variance

Consider a network with L hidden layers and no biases:

f(x) =WLφ(WL−1 . . . φ(W0x)), (2.1)

where Wl ∈ Rnl+1×nl and a non-linearity φ is applied element-wise. Note that x ∈ Rn0 ; we denote with k = nL+1

the dimensionality of the output: f : Rn0 → Rk.We shall refer nl as the width of the l-th hidden layer. Denote n = n1, and take nl = αln ∀l ∈ [L]. We shall

refer n as the width of the network, and keep α-factors fixed.Consider a loss function ℓ(y, z). We try to minimize the average loss of our model: L = E x,yℓ(y, f(x)).Let us assume both x and y are fixed. Define:

h1 =W0x ∈ Rn1 , xl = φ(hl) ∈ R

nl , hl+1 =Wlxl ∈ Rnl+1 ∀l ∈ [L]. (2.2)

Hence given x f(x) = hL+1.This is forward dynamics; we can express backward dynamics similarly. Define a loss gradient with respect to

the hidden representation:

gl =∂ℓ(y, hL+1)

∂hl∈ R

nl ∀l ∈ [L+ 1]. (2.3)

We have then:

gl = DlWTl gl+1 ∀l ∈ [L], gL+1 =

∂ℓ(y, h)

∂h, (2.4)

where Dl = diag(φ′(hl)).Using the backward dynamics, we are able to compute gradients wrt the weights:

∇l =∂ℓ(y, hL+1)

∂Wl∈ R

nl+1×nl ∀l ∈ [L]0. (2.5)

Then,∇l = gl+1x

Tl ∀l ∈ [L]0. (2.6)

Assume the weights are initialized with zero mean and layer-dependent variance vl:

EW ijl,0 = 0, VarW ij

l,0 = vl. (2.7)

Note that ∀l ∈ [L+ 1] all components of the vector hl are i.i.d. with zero mean. Let ql be its variance:

ql = Varhil = E (hil)2 =

1

nlEhTl hl. (2.8)

The same holds for gl; we denote its variance by δl:

δl = Var gil = E (gil )2 =

1

nlE gTl gl. (2.9)

6

Page 9: arXiv:2012.05760v1 [cs.LG] 10 Dec 2020

2.1.1 Linear case

Consider first φ(h) = h.Consider the following two properties of the initialization:

1. Normalized forward dynamics: ql does not depend neither on n0:l−1, nor on l ∀l ∈ [L+ 1].

2. Normalized backward dynamics: δl does not depend neither on nl+1:L+1, nor on l ∀l ∈ [L+ 1].

The first property implies that the model stays finite at the initialization irrespective of width n and depth L.The two properties combined imply finite weight gradients at the initialization: Var∇ij

l does not depend neitheron n0:L+1, nor on l ∀l ∈ [L]0.

In turn, these two imply that weight increments stay finite at the initialization irrespective of width n and depthL if we consider training with SGD:

∆Wl = −ηE x,y∇l(x, y). (2.10)

Since the initial weights have zero mean, all hidden representations hl have zero mean too, and due to initialweight independence:

ql+1 =1

nl+1EhTl W

Tl Wlhl = vlEh

Tl hl = nlvlql ∀l ∈ [L], q1 =

1

n1ExTWT

0 W0x = ‖x‖22v0 ∝ n0v0. (2.11)

Hence forward dynamics is normalized if vl = n−1l ∀l ∈ [L]0.

We can compute variances for gradients wrt hidden representations in a similar manner:

δl =1

nlE gTl+1WlW

Tl gl+1 = vlE g

Tl+1gl+1 = nl+1vlδl+1 ∀l ∈ [L− 1], (2.12)

δL =1

nLE gTL+1WLW

TL gL+1 =

∂ℓ(y, h)

∂h

2

2

vL ∝ nL+1vL. (2.13)

As we see, backward dynamics is normalized if vl = n−1l+1 ∀l ∈ [L]0. This means that we cannot have both forward

dynamics and backward dynamics normalized at the same time. [Glorot and Bengio, 2010] proposed taking aharmonic mean of the variances for the two normalization requirements:

vl =2

nl + nl+1∀l ∈ [L]0. (2.14)

2.1.2 ReLU case

We start with the forward dynamics:

ql+1 =1

nl+1ExTl W

Tl Wlxl = vlEx

Tl xl ∀l ∈ [L], q1 =

1

n1ExTWT

0 W0x = ‖x‖22v0 ∝ n0v0. (2.15)

ExTl xl = E [hl]T+[hl]+ =

1

2EhTl hl =

1

2nlql ∀l ∈ [L]. (2.16)

Here the second equality holds due to the symmetry of hl distribution. The latter in its turn holds by induction onl.

Hence for ReLU the forward dynamics is normalized if vl = 2n−1l , a result first presented in [He et al., 2015].

Let us consider the backward dynamics then:

δl =1

nlE gTl+1WlD

2lW

Tl gl+1 =

1

2vlE g

Tl+1gl+1 =

1

2nl+1vlδl+1 ∀l ∈ [L− 1], (2.17)

δL =1

nLE gTL+1WLD

2LW

TL gL+1 =

1

2vLE g

TL+1gL+1 =

1

2

∂ℓ(y, h)

∂h

2

2

vL ∝ 1

2nL+1vL. (2.18)

Similarly, we have to take vl = 2n−1l+1 for this type of normalization. Note that here we have assumed that gl+1 does

not depend on Wl and hl, which is not true: gl+1 depends on hl+1 through Dl+1 which depends on both Wl and hl.

7

Page 10: arXiv:2012.05760v1 [cs.LG] 10 Dec 2020

Again, we have a contradiction between the two normalization requirements. However in some practical casessatisfying only one of these is enough. For instance, if we consider minimizing the cross-entropy loss, the modeldiverging or vanishing at the initialization does not break the optimization process. Moreover, the magnitude ofhidden representations does not matter, thanks to homogeneity of ReLU. Hence in this case normalizing the forwarddynamics is not a strict requirement.

On the other hand, using an optimizer that normalizes the gradient, i.e. Adam, makes backward normalizationunnecessary.

2.1.3 Tanh case

Assume φ ∈ C3(R), φ′(z) > 0, φ(0) = 0, φ′(0) = 1, φ′′(0) = 0, φ′′′(0) < 0, and φ is bounded. The guiding exampleis the hyperbolic tangent:

φ(z) =ez − e−z

ez + e−z. (2.19)

In this case taking vl = n−1l ensures that activations xl are neither in a linear regime (Varhl are not too small),

nor in a saturated regime (Varhl are not too large). However, if we take vl = n−1l+1, Var gl still vanishes for large l

since |φ′(h)| ≤ 1. Nevertheless, [Glorot and Bengio, 2010] suggests initializing with a harmonic mean of variancesfor the class of non-linearities we consider. Rationale: in this case a network is almost linear at the initialization.

Let us assume that vl = σ2w/nl. Consider the forward dynamics in detail:

ql+1 =1

nl+1E hl

EWlφ(hl)

TWTl Wlφ(hl) =

σ2w

nlE hl

φ(hl)Tφ(hl). (2.20)

By the Cenral Limit Theorem, ∀i hil converges to N (0, ql) in distribution as nl−1 → ∞. Note that for a fixed xh1 is always normally distributed. Hence by taking subsequent limits n1 → ∞, n2 → ∞, and so on, we come upwith the following recurrent relation (see [Poole et al., 2016]):

ql+1 = σ2wE z∼N (0,1)φ(

√qlz)

2 = V(ql|σ2w), q1 = σ2

w

‖x‖22n0

. (2.21)

Let us study properties of the length map V :

V ′(q|σ2w) = σ2

wE z∼N (0,1)φ(√qz)φ′(

√qz)z/

√q > 0. (2.22)

The last inequality holds since φ(√qz)z > 0 for z 6= 0 due to monotonicity of φ and since φ(0) = 0. Hence V

monotonically increases.

V ′(q|σ2w) = σ2

wE z∼N (0,1)φ(√qz)φ′(

√qz)z/

√q = σ2

wE z∼N (0,1)

(

φ′(√qz)2 + φ(

√qz)φ′′(

√qz))

. (2.23)

In particular,V ′(0|σ2

w) = σ2wE z∼N (0,1)(φ

′(0))2 = σ2w > 0. (2.24)

[Poole et al., 2016] claimed that the second derivative is always negative for φ being a hyperbolic tangent, whichwe were not able to show. We can check it for q = 0 though:

V ′′(0|σ2w) = 4σ2

wE z∼N (0,1)φ′(0)φ′′′(0) = 4σ2

wφ′′′(0) < 0. (2.25)

Hence at least, V is concave at zero.Whenever σw ≤ 1, q = 0 is a stable fixed point for the length map. However for σw > 1 q = 0 becomes unstable;

since V(∞|σ2w) <∞ due to boundedness of φ, there should be at least one stable fixed point for the length map. If

we believe that V is indeed concave everywhere, this stable fixed point is unique. We denote it as q∞.This means that assuming L = ∞, ql = Varhl has finite non-zero limit as n → ∞ and l → ∞ whenever

σ2w > 1. We shall refer this property as asymptotically normalized forward dynamics. Note that asymptotic and

non-asymptotic forward dynamics normalizations are equivalent for linear and ReLU nets, and they hold exactlyfor σ2

w = 1 and σ2w = 2, respectively.

Let us proceed with backward dynamics. Similar to the forward case, we have:

δl =1

nlE gTl+1Wl diag(φ

′(hl))2WT

l gl+1. (2.26)

8

Page 11: arXiv:2012.05760v1 [cs.LG] 10 Dec 2020

We cannot factorize the expectation since gl+1 depends on W0:l unless φ′ is constant. Nevertheless, assume thatgl+1 does not depend on W0:l. Hence it does not depend on hl, and we have the following:

δl ≈1

nlE gl+1

(gTl+1EWl(WlE hl

diag(φ′(hl))2WT

l )gl+1) =1

nlE h∼N (0,ql)(φ

′(h))2E gl+1(gTl+1EWl

(WlWTl )gl+1) =

=σ2w

nlE z∼N (0,1)(φ

′(√qlz))

2E gl+1

gTl+1gl+1 = σ2w

αl+1

αlδl+1E z∼N (0,1)(φ

′(√qlz))

2. (2.27)

We have already noted that given concavity of V the latter has a single unique stable point q∞; this also impliesql → q∞ as l → ∞. [Poole et al., 2016] has noted that convergence to q∞ is fast; assume ql = q∞ then. This allowsus to express the dynamics solely in terms of δl:

δl = σ2w

αl+1

αlδl+1E z∼N (0,1)(φ

′(√q∞z))

2. (2.28)

For simplicity assume αl = 1 ∀l ≥ 1 (all matrices W1:L+1 are square). Define:

χ1 = σ2wE z∼N (0,1)(φ

′(√q∞z))

2. (2.29)

We get (see [Schoenholz et al., 2016]):δl = χ1δl+1. (2.30)

Obviously, χ1 > 1 implies exploding gradients, while χ1 < 1 causes gradients to vanish. We shall refer the caseχ1 = 1 as asymptotically normalized backward dynamics. Note that for linear and ReLU nets χ1 = 1 is equivalentto σ2

w = 1 and σ2w = 2, respectively.

Correlation stability

The term χ1 has a remarkable interpretation in terms of correlation stability (see [Poole et al., 2016]). Considertwo inputs, x1 and x2, together with their hidden representations h1l and h2l . Define the terms of the covariancematrix for the latter two:

Σl =

(

q11l q12lq12l q22l

)

; qabl =1

nlEha,Tl hbl , a, b ∈ 1, 2. (2.31)

Consider a correlation factor c12l = q12l /√

q11l q22l . We have already derived the dynamics for the diagonal terms

in the subsequent limits of infinite width:

qaal+1 = σ2wE z∼N (0,1)φ(

qaal z)2, qaa1 = σ2w

‖xa‖22n0

, a ∈ 1, 2. (2.32)

Consider the diagonal term:

q12l+1 =1

nl+1E h1

l,h2

lEWl

φ(h1l )TWT

l Wlφ(h2l ) =

σ2w

nlE h1

l,h2

lφ(h1l )

Tφ(h2l ). (2.33)

Taking the same subsequent limits as before, we get:

q12l+1 = σ2wE (u1,u2)T∼N (0,Σl)φ(u

1)φ(u2) = σ2wE (z1,z2)T∼N (0,I)φ(u

1l (z

1))φ(u2l (z1, z2)) = C(c12l , q11l , q22l |σ2

w), (2.34)

where u1l =√

q11l z1, while u2l =

q22l (c12l z1 +

1− (c12l )2z2). We shall refer C as a correlation map.As before, assume that qaal = q∞, a ∈ 1, 2, ∀l. This assumption results in a self-consistent dynamics of the

correlation factor:c12l+1 = q−1

∞ C(c12l , q∞, q∞|σ2w). (2.35)

Note that c12 = 1 is a fixed point of the c-dynamics. Indeed:

c12 = q−1∞ C(1, q∞, q∞|σ2

w) = q−1∞ σ2

wE z∼N (0,1)φ(√q∞z)

2 = q−1∞ V(q∞|σ2

w) = 1. (2.36)

9

Page 12: arXiv:2012.05760v1 [cs.LG] 10 Dec 2020

In order to study its stability, we have to consider a derivative of the C-map at c12 = 1. Let us compute thederivative for a c12 < 1 first:

∂c12l+1

∂c12l

c12l

=c

= q−1∞ σ2

wE (z1,z2)T∼N (0,I)φ(√q∞z

1)φ′(√q∞(cz1 +

1− c2z2))(√q∞(z1 − z2c/

1− c2)). (2.37)

We shall use the following equivalence:

E z∼N (0,1)F (z)z =

∫ +∞

−∞F (z)ze−z2/2 dz =

∫ +∞

−∞(−F (z)) de−z2/2 =

∫ +∞

−∞F ′(z)e−z2/2 dz = E z∼N (0,1)F

′(z).

(2.38)We begin the integration with analyzing one of the parts of this equation:

E z∼N (0,1)φ′(√q∞(cz1 +

1− c2z))√q∞zc/

1− c2 = q∞E z∼N (0,1)φ′′(√q∞(cz1 +

1− c2z))c. (2.39)

Henceforth,

∂c12l+1

∂c12l

c12l

=c

= q−1∞ σ2

wE z1∼N (0,1)φ(u1(z1))E z2∼N (0,1)(

√q∞z

1φ′(u2(z1, z2))− q∞cφ′′(u2(z1, z2))), (2.40)

where u1 =√q∞z1, while u2 =

√q∞(cz1 +

√1− c2z2). Consider the limit of c→ 1:

limc→1

∂c12l+1

∂c12l

c12l

=c

= q−1∞ σ2

wE z∼N (0,1)φ(√q∞z)(

√q∞zφ

′(√q∞z)− q∞φ

′′(√q∞z)). (2.41)

Let us compute the first term first:

E z∼N (0,1)φ(√q∞z)

√q∞zφ

′(√q∞z) = q∞E z∼N (0,1)

(

(φ′(√q∞z))

2 + φ(√q∞z)φ

′′(√q∞z)

)

. (2.42)

This gives the final result:

limc→1

∂c12l+1

∂c12l

c12l

=c

= σ2wE z∼N (0,1)(φ

′(√q∞z))

2 = χ1. (2.43)

We see that χ1 drives the stability of the correlation of strongly correlated hidden representations, or, equivalently,of nearby input points. For χ1 < 1 nearby points with c12 ≈ 1 become more correlated as they propagate throughthe layers. Hence initially different points become more and more similar. We refer this regime as ordered. Incontrast, for χ1 > 1 nearby points separate as they propagate deeper in the network. We refer this regime aschaotic. Hence the case of χ1 = 1 is the edge of chaos.

2.2 Dynamical stability

Following [Pennington et al., 2017], let us turn our attention to the input-output jacobian:

J =∂hL+1

∂h1=

L∏

l=1

WlDl ∈ RnL+1×n1 . (2.44)

We now compute the mean square Frobenius norm of JTJ ∈ Rn1×n1 :

E ‖JTJ‖2F = E tr(JTJ) = EW0:L tr

(

L∏

l=1

WlDl

)T L∏

l=1

WlDl

= tr

EW0:L

(

L∏

l=1

WlDl

)T L∏

l=1

WlDl

=

= tr

EW0:L−1

(

L−1∏

l=1

WlDl

)T

DLEWL(WT

LWL)DL

L−1∏

l=1

WlDl

=

= nL+1vL tr

EW0:L−1

(

L−1∏

l=1

WlDl

)T

D2L

L−1∏

l=1

WlDl

. (2.45)

10

Page 13: arXiv:2012.05760v1 [cs.LG] 10 Dec 2020

Assuming that tr(D2l ) does not depend on W0:l ∀l ∈ [L] allows us to proceed with calculations:

E ‖JTJ‖2F = nL+1vLE hLtr(D2

L)vL−1 tr

EW0:L−2

(

L−2∏

l=1

WlDl

)T

D2L−1

L−2∏

l=1

WlDl

=

= nL+1vL

L∏

l=2

(

E hltr(D2

l )vl−1

)

tr(

EW0D21

)

= nL+1

L∏

l=1

vlE hltr(D2

l ). (2.46)

Suppose we aim to normalize the backward dynamics: vl = σ2w/nl+1 ∀l ∈ [L]. Assume then (see Section 2.1.3)

hl ∼ N (0, q∞) ∀l ∈ [L]. Then the calculation above gives us the mean average eigenvalue of JTJ :

1

n1E

n1∑

i=1

λi =1

n1E ‖JTJ‖2F = σ2L

w

(

E z∼N (0,1)φ′(√q∞z)

)L= χL

1 . (2.47)

Hence χL1 is the mean average eigenvalue of the input-ouput jacobian of the network of depth L.

Let us assume that our non-linearity is homogeneous: φ(βz) = βφ(z). This property holds for leaky ReLU witharbitrary slope; in particular, it holds in the linear case. Then we have the following:

hL+1 = Jh1; qL+1 =1

nL+1E ‖Jh1‖22 =

1

nL+1hT1 (EJ

T J)h1 =1

nL+1E

n1∑

i=1

λi(vTi h1)

2. (2.48)

g1 = JT gL+1; δ1 =1

n1E ‖JT gL+1‖22 =

1

n1gTL+1(EJJ

T )gL+1 =1

n1E

n1∑

i=1

λi(uTi gL+1)

2. (2.49)

One can perceive qL+1 as a mean normalized squared length of the network output. We may want to study adistribution of normalized squared lengths instead.

In this case it suffices to study a distribution of the empirical spectral density:

ρ(x) =1

n1

n1∑

i=1

δ(x− λi). (2.50)

Besides being random, it converges to a deterministic limiting spectral density ρ as n → ∞ if we assume nl = αln∀l ∈ [L+ 1] with αl being constant.

Assume all matrices Wl are square: n1 = . . . = nL+1 = n. In this case the choice of vl = 1/n normalizes bothforward and backward dynamics. On the other hand, in the linear case the limiting spectrum can be parameterizedas (see [Pennington et al., 2017]):

λ(φ) =sinL+1((L+ 1)φ)

sinφ sinL(Lφ). (2.51)

We shall prove this result in the upcoming section. Notice that limφ→0 λ(φ) = (L+1)L+1/LL ∼ e(L+1) for large L.Hence in this case despite we preserve lengths of input vectors on average, some of the input vectors get expandedwith positive probability during forward propagation, while some get contracted. The same holds for the backwarddynamics.

2.2.1 Linear case

Our goal is to compute a limiting spectrum of the matrix JJT ∈ Rn×n with J =∏L

l=1Wl with all Wl composedof i.i.d. gaussians with variance 1/n; it is referred as product Wishart ensemble. The case of L = 1, WWT , isknown as Wishart ensemble. The limiting spectrum of the Wishart ensemble is known as Marchenko-Pastur law[Marchenko and Pastur, 1967]:

ρWWT (x) =1

4

x− 1 Ind[0,4](x). (2.52)

11

Page 14: arXiv:2012.05760v1 [cs.LG] 10 Dec 2020

It is possible to derive a limiting spectrum for JJT by using the so-called S-transform, which we shall definelater. A high-level algorithm is the following. First, we compute an S-transform for the Wishart ensemble:

SWWT (z) =1

1 + z. (2.53)

The S-transform has a following fundamental property. Given two asymptotically free random matrices A and B,we have [Voiculescu, 1987]1:

SAB = SASB (2.54)

in the limit of n→ ∞.As we shall see later, the S-transform of JL =

∏Ll=1WL depends only on traces of the form n−1 tr(Jk

L) whichare invariant under cyclic permutations of matrices Wl. This allows us to compute SJJT :

SJJT = SJLJTL= SWT

L WLJL−1JTL−1

= SWTL WL

SJL−1JTL−1

=

L∏

l=1

SWTlWl

= SLWTW . (2.55)

The last equation holds since all Wl are distributed identically. The final step is to recover the spectrum of JJT

from its S-transform.

Free independence. We say that A and B are freely independent, or just free, if:

τ((P1(A)− τ(P1(A)))(Q1(B) − τ(Q1(B))) . . . (Pk(A)− τ(Pk(A)))(Qk(B)− τ(Qk(B)))) = 0, (2.56)

where ∀i ∈ [k] Pi and Qi are polynomials, while τ(A) = n−1E tr(A) — an analogue of the expectation for scalarrandom variables. Note that τ is a linear operator and τ(I) = 1. Compare above with the definition of classicalindependence:

τ((P (A) − τ(P (A)))(Q(B) − τ(Q(B)))) = 0, (2.57)

for all polynomials P and Q.Note that two scalar-valued random variables are free iff one of them is constant; indeed:

E ((ξ−E ξ)(η−E η)(ξ−E ξ)(η−E η)) = E ((ξ−E ξ)2(η−E η)2) = (E (ξ−E ξ)2)(E (η−E η)2) = Var ξVar η. (2.58)

Hence having Var ξ = 0 or Var η = 0 is necessary; this implies ξ = const or η = const, which gives free independence.This means that the notion of free independence is too strong for scalar random variables. The reason for this

is their commutativity; only non-commutative objects can have a non-trivial notion of free independence. As forrandom matrices with classically independent entries, they have a remarkable property that they become free inthe limit of n→ ∞:

limn→∞

τ((P1(An)− τ(P1(An)))(Q1(Bn)− τ(Q1(Bn))) . . . (Pk(An)− τ(Pk(An)))(Qk(Bn)− τ(Qk(Bn)))) = 0, (2.59)

for An and Bn ∈ Rn×n such that the moments τ(Akn) and τ(Bk

n) are finite for large n for k ∈ N. We shall sat thatthe two sequences An and Bn are asymptotically free as n→ ∞.

Asymptotic free independence for Wigner matrices. In order to illustrate the above property, consider Xand Y being classically independent n × n Wigner matrices, i.e. Xij = Xji ∼ N (0, n−1), and similarly for Y . Ofcourse, τ(X) = τ(Y ) = 0, while τ(X2Y 2) = n−1 tr(EX2EY 2) = n−1 tr(I) = 1. Let us compute τ(XYXY ):

τ(XYXY ) =1

nEXijY

jkXklYli =

1

n3((δikδjl + δilδjk)(δ

jlδki + δjiδkl)− Cn) =

=1

n3(n2 + (3 − C)n) = On→∞(n−1). (2.60)

This means that X and Y are not freely independent, however, it suggests that they become free in the limit oflarge n.

1see also https://mast.queensu.ca/~speicher/survey.html

12

Page 15: arXiv:2012.05760v1 [cs.LG] 10 Dec 2020

A sum of freely independent random matrices. Before moving to the definition of the S-transform used forfinding the product, we discuss a simpler topic of finding the distribution of the sum of freely independent randommatrices.

Let ξ and η be scalar-valued independent random variables. The density of their sum can be computed using acharasteric function:

Fξ+η(t) = E ei(ξ+η)t = E(

eiξteiηt)

=(

E eiξt)

·(

E eiηt)

= Fξ(t) + Fη(t). (2.61)

The first equality is a definition of the charasteric function. The third equlaity holds due to independence of ξ andη. A (generalized) density of their sum can be computed by taking the inverse Fourier transform:

pξ+η(x) =1

R

e−ixtFξ+η(t) dt. (2.62)

Let X and Y be random matrix ensembles of sizes n × n. We cannot apply the same technique to randommatrices since they do not generally commute; for this reason, ei(X+Y )t 6= eiXteiY t generally, and the secondequality of (2.61) does not hold. However, there exists a related technique for freely independent random matrices.

Following [Tao, 2012], define the Stieltjes transform as:

GX(z) = τ((z −X)−1), (2.63)

where τ(X) = n−1E tr(X). This allows for formal Laurent series which give the following:

GX(z) =

∞∑

k=0

τ(Xk)

zk+1=

∞∑

k=0

n−1EX

∑ni=1 λi(X)k

zk+1=

∞∑

k=0

EXE λ∼ρXλk

zk+1=

∞∑

k=0

E λ∼ρXλk

zk+1= E λ∼ρX

(z − λ)−1, (2.64)

where ρX(λ) denotes the expected spectral desnity:

ρX(λ) = EX ρX(λ) =1

nEX

n∑

i=1

δ(λ− λi(X)). (2.65)

Let ζ = GX(z) = τ((z −X)−1). Here ζ is a function of z; let assume that z is a function of ζ: z = zX(ζ). Wehave the following then:

(zX(ζ) −X)−1 = ζ(1 − EX), (2.66)

where τ(EX ) = 0. Rearranging gives:X = zX(ζ)− ζ−1(1− EX)−1, (2.67)

while for Y we have the same:Y = zY (ζ) − ζ−1(1 − EY )

−1, (2.68)

and so:X + Y = zX(ζ) + zY (ζ) − ζ−1((1 − EX)−1 + (1 − EY )

−1). (2.69)

We have:

(1 − EX)−1 + (1− EY )−1 = (1− EX)−1(1 − EX + 1− EY )(1 − EY )

−1 =

= (1 − EX)−1((1− EX)(1 − EY ) + 1− EXEY )(1− EY )−1 = 1 + (1− EX)−1(1− EXEY )(1− EY )

−1. (2.70)

Hence:(zX(ζ) + zY (ζ)−X − Y − ζ−1)−1 = ζ(1 − EY )(1 − EXEY )

−1(1− EX). (2.71)

We have:

(1 − EY )(1 − EXEY )−1(1− EX) = (1− EY )

∞∑

k=0

(EXEY )k(1− EX). (2.72)

The last expression is a sum of alternating products of EX and EY . Since X and Y are freely independent, EX

and EY are freely independent too. Applying τ gives:

τ((zX(ζ) + zY (ζ)−X − Y − ζ−1)−1) = ζ. (2.73)

13

Page 16: arXiv:2012.05760v1 [cs.LG] 10 Dec 2020

At the same time:τ((zX+Y (ζ) −X − Y )−1) = ζ. (2.74)

Hence:zX+Y (ζ) = zX(ζ) + zY (ζ) − ζ−1. (2.75)

Define RX(ζ) = zX(ζ) − ζ−1. Hence:RX+Y (ζ) = RX(ζ) +RY (ζ). (2.76)

Alternatively, we can say that the R-transform is a solution of the following equation:

RX(GX(z)) + (GX(z))−1 = z. (2.77)

As a sanity check, consider the R-transform of a scalar constant x. In this case, Gx(z) = (z − x)−1. This givesRx((z − x)−1) + z − x = z, hence Rx((z − x)−1) = x. This means simply Rx ≡ x.

S-transform. Let us now define the S-transform. We start with defining the moment generating function M :

M(z) = zG(z)− 1 =

∞∑

k=1

τ(Xk)

zk=

∞∑

k=1

E λ∼ρXλk

zk=

∞∑

k=1

mk(X)

zk, (2.78)

where the k-th moment of ρ is defined as follows:

mk(X) = E λ∼ρXλk = τ(Xk). (2.79)

The moment generating function M is a mapping from C \ 0 to C. Let M−1 be its functional inverse. We arenow ready to define the S-transform:

S(z) =1 + z

zM−1(z). (2.80)

In order to get some intuition concerning the property (2.54), we consider the case ρ(λ) = δ(λ−x). In this caseM(z) = x/(z − x); hence z = x(1 + 1/M(z)). This gives M−1(z) = x(1 + 1/z), and S(z) = 1/x, which obviouslysatisfies the property.

Recovering the limiting spectrum. We are not going to compute the S-transform (2.53) of the Wishartensemble (2.52), but we aim to recover the spectrum of the product Wishart enesmble from its S-transform (2.55).We have:

SJJT (z) = SLWTW (z) =

1

(1 + z)L, M−1

JJT (z) =(1 + z)L+1

z. (2.81)

First we need to recover the Stieltjes transform G. Recall MJJT (z) = zGJJT (z)− 1. This gives:

z =M−1JJT (MJJT (z)) =

(zGJJT (z))L+1

zGJJT (z)− 1, (2.82)

or:zGJJT (z)− 1 = zLGJJT (z)L+1. (2.83)

This equation gives a principle way to recover GJJT . However, our goal is the spectral density ρJJT . The densitycan be recovered from its Stieltjes transform using the inversion formula:

ρ(λ) = − 1

πlim

ǫ→0+ℑG(λ+ iǫ). (2.84)

Indeed:

limǫ→0+

ℑG(λ+ iǫ) = limǫ→0+

ℑ∫

ρ(t)

λ− t+ iǫdt = lim

ǫ→0+ℑ∫

ρ(t)

(λ− t)2 + ǫ2(λ− t− iǫ) dt =

= limǫ→0+

(−ǫ)ρ(t)(λ− t)2 + ǫ2

dt = limǫ→0+

(−ǫ)ρ(u+ λ)

u2 + ǫ2du = lim

ǫ→0+

(−1)ρ(vǫ+ λ)

v2 + 1dv = −πρ(λ). (2.85)

14

Page 17: arXiv:2012.05760v1 [cs.LG] 10 Dec 2020

Hence we should consider z = λ+ iǫ and take the limit of ǫ→ 0+. Assume also GJJT (λ+ iǫ) = reiφ. Substitutingit to (2.83) gives:

(λ + iǫ)reiφ − 1 = (λ+ iǫ)LrL+1ei(L+1)φ. (2.86)

Let us consider the real and imaginary parts of this equation separately:

r(λ cosφ+O(ǫ)) − 1 = λLrL+1((1 +O(ǫ2)) cos((L + 1)φ) +O(ǫ)); (2.87)

r(λ sin φ+O(ǫ)) = λLrL+1((1 +O(ǫ2)) sin((L+ 1)φ) +O(ǫ)). (2.88)

Taking the limit of ǫ→ 0+ gives:

rλ cosφ− 1 = λLrL+1 cos((L+ 1)φ), rλ sinφ = λLrL+1 sin((L + 1)φ). (2.89)

Consequently:rλ sin φ

rλ cosφ− 1= tan((L + 1)φ), rL = λ1−L sinφ

sin((L+ 1)φ). (2.90)

From the first equality we get:

r = λ−1 1

cosφ− sinφ/ tan((L+ 1)φ)= λ−1 sin((L+ 1)φ)

sin(Lφ). (2.91)

This equality together with the second on the previous line give:

1 = λsinφ

sin((L+ 1)φ)

sinL(Lφ)

sinL((L+ 1)φ). (2.92)

Hence:

λ =sinL+1((L+ 1)φ)

sinφ sinL(Lφ). (2.93)

We also get the density:

ρ(λ) = − 1

πr sinφ = − 1

π

sin2 φ sinL−1(Lφ)

sinL((L + 1)φ). (2.94)

For the sake of convenience, we substitute φ with −φ; this gives:

λ =sinL+1((L + 1)φ)

sinφ sinL(Lφ), ρ(λ(φ)) =

1

π

sin2 φ sinL−1(Lφ)

sinL((L+ 1)φ). (2.95)

All eigenvalues of JJT are real and non-negative. This gives us a constraint: φ ∈ [0, π/(L + 1)]. The left edge ofthis segment gives a maximal λ = (L + 1)L+1/LL, while the right edge gives a minimum: λ = 0. Note that thesame constraint results in non-negative spectral density.

As a sanity check, take L = 1 and compare with (2.52):

λ =sin2(2φ)

sin2 φ= 4 cos2 φ, ρ(λ(φ)) =

1

π

sin2 φ

sin(2φ)=

1

2πtanφ =

1

1

cos2 φ− 1 =

1

4

λ(φ)− 1. (2.96)

2.2.2 ReLU case

For gaussian initialization, we expect similar problems as we had for linear case. However, curing the expandingjacobian spectrum for a linear net with square layers is easy: one have to assume orthogonal initialization insteadof i.i.d. gaussian:

Wl ∼ U(On×n) ∀l ∈ [L]. (2.97)

In this case ‖Jh1‖ = ‖h1‖ a.s; the same holds for ‖JT gL+1‖. The goal of the current section is to check whetherorthogonal initialization helps in the ReLU case.

15

Page 18: arXiv:2012.05760v1 [cs.LG] 10 Dec 2020

Similarly to the linear case, we have:

SJJT = SJLJTL= SDLWT

LWLDLJL−1JT

L−1= SDLWT

LWLDL

SJL−1JTL−1

=

= SD2LWLWT

LSJL−1JT

L−1= SD2

LSWLWT

LSJL−1JT

L−1=

L∏

l=1

SD2lSWlWT

l= SL

WWT

L∏

l=1

SD2l. (2.98)

Consider orthogonal initialization. In order to normalize forward and backward dynamics, we have to introduce afactor σw =

√2:

Wl ∼ σwU(On×n) ∀l ∈ [L]. (2.99)

For a scaled orthogonal matrix W SWWT = Sσ2wI ≡ σ−2

w = 1/2. We have to compute SD2l+1

then.

Since we have assumed that ∀l ∈ [L] hl ∼ N (0, q∞), the spectrum of D2 is given simply as:

ρD2l+1

(x) =1

2δ(x) +

1

2δ(x− 1). (2.100)

Taking the Stieltjes transform we get:

GD2l+1

(z) =1

2

(

1

z+

1

z − 1

)

. (2.101)

This gives the moment generating function and its inverse:

MD2l+1

(z) =1

2(z − 1), M−1

D2l+1

(z) =1

2z+ 1. (2.102)

Finally, we get the S-transform:

SD2l(z) =

z + 1

zM−1D2

l+1

(z)=

z + 1

z + 1/2. (2.103)

The S-transform of JJT is then given as:

SJJT = σ−2Lw

(

z + 1

z + 1/2

)L

. (2.104)

M−1JJT = σ2L

w

(z + 1/2)L

z(z + 1)L−1. (2.105)

Recall MJJT (z) = zGJJT (z)− 1. This gives:

z =M−1JJT (MJJT (z)) = σ2L

w

(zGJJT (z)− 1/2)L

(zGJJT (z)− 1)(zGJJT (z))L−1, (2.106)

or:z(zGJJT (z)− 1)(zGJJT (z))L−1 = (2zGJJT (z)− 1)L. (2.107)

Taking its imaginary part gives a sequence of transformations:

λ2r sinφ(λL−1rL−1 cos((L−1)φ))+λ(λr cosφ−1)(λL−1rL−1 sin((L−1)φ)) = L(2λr cosφ−1)L−12λr sinφ+O(sin2 φ).(2.108)

λL+1rL sin(Lφ)− λLrL−1 sin((L − 1)φ) = L(2λr cosφ− 1)L−12λr sinφ+O(sin2 φ). (2.109)

λL+1rLL sinφ− λLrL−1(L− 1) sinφ = L(2λr − 1)L−12λr sinφ+O(sin2 φ). (2.110)

Hence for φ = 0 we have:

λL+1rLL− λLrL−1(L− 1) = L(2λr − 1)L−12λr = L(2λr − 1)L + L(2λr − 1)L−1. (2.111)

A real part of (2.107) in its turn gives:

λ(λr cosφ−1)(λL−1rL−1 cos((L−1)φ))−λ2r sinφ(λL−1rL−1 sin((L−1)φ)) = (2λr cosφ−1)L+O(sin2 φ). (2.112)

16

Page 19: arXiv:2012.05760v1 [cs.LG] 10 Dec 2020

λL+1rL cos(Lφ)− λLrL−1 cos((L − 1)φ) = (2λr cosφ− 1)L +O(sin2 φ). (2.113)

λL+1rL − λLrL−1 = (2λr − 1)L +O(sin2 φ). (2.114)

Hence for φ = 0 we have:λL+1rL − λLrL−1 = (2λr − 1)L. (2.115)

Eq. (2.111) −L× eq. (2.115) results in:λLrL−1 = L(2λr − 1)L−1. (2.116)

Putting this to (2.115) gives:L(2λr − 1)L−1(λr − 1) = (2λr − 1)L. (2.117)

L(λr − 1) = 2λr − 1. (2.118)

λr =L− 1

L− 2. (2.119)

Putting this back to (2.116) gives:

λ

(

L− 1

L− 2

)L−1

= L

(

L

L− 2

)L−1

. (2.120)

λ = L

(

L

L− 1

)L−1

= L

(

1 +1

L− 1

)L−1

. (2.121)

The last equation is equivalent to eL for large L. Hence the spectral density ρJJT gets expanded with depth atleast linearly in ReLU case even for orthogonal initialization.

2.3 GD dynamics for orthogonal initialization

It seems natural that a well-conditioned jacobian is necessary for trainability. But does a well-conditioned jacobianensure trainability? In fact, yes, in linear case. Following [Saxe et al., 2013] we will show that for a linear net withL hidden layers initialized orthogonally and trained with square loss, the number of optimization steps required toreach the minimum does not depend on L for large L.

Shallow nets. In order to show this, we start with the case of L = 1:

f(x) =W1W0x. (2.122)

Consider square loss:

ℓ(y, z) =1

2‖y − z‖22, L = E x,yℓ(y, f(x)). (2.123)

Gradient descent step:

W0 = ηE x,yWT1 (yxT −W1W0xx

T ), W1 = ηE x,y(yxT −W1W0xx

T )WT0 . (2.124)

Define Σxx = ExxT — input correlation matrix, and Σxy = E yxT — input-output correlation matrix. Assume thenthat the data is whitened: Σxx = I. Consider an SVD decomposition for the input-output correlation:

Σxy = U2S2,0VT0 =

n∑

r=1

srurvTr . (2.125)

Perform a change of basis:W1 = UT

2 W1, W0 =W0V0. (2.126)

Gradient descent step becomes:

˙W0 = ηWT1 (S2,0 − W1W0),

˙W1 = η(S2,0 − W1W0)WT0 . (2.127)

17

Page 20: arXiv:2012.05760v1 [cs.LG] 10 Dec 2020

Note that while the matrix element W0,ij connects a hidden neuron i to an input neuron j, the matrix elementW0,iα connects a hidden neuron i to an input mode α. Let W0 = [a1, . . . , an], while W1 = [b1, . . . , bn]

T . Then weget:

1

ηaα = sαbα −

n∑

γ=1

bγ(bTγ aα) = (sα − (bTαaα))bα −

γ 6=α

(bTγ aα)bγ ; (2.128)

1

ηbα = sαaα −

n∑

γ=1

(aTγ bα)aγ = (sα − (aTαbα))aα −∑

γ 6=α

(aTγ bα)aγ . (2.129)

This dynamics is a GD dynamics on the following energy function:

E =1

2

n∑

α=1

(sα − aαbα)2 +

1

2

α6=γ

(aαbγ)2. (2.130)

Let assume that there exists an orthogonal matrix R = [r1, . . . , rn] such that aα ∝ rα and bα ∝ rα. In otherwords, W0 = RD0 and W1 = D1R

T , where D0 and D1 are diagonal matrices. Note that in this case W1 = U2D1RT ,

while W0 = RD0VT0 .

Given this, the dynamics above decomposes into a system of independent equations of the same form:

a = η(s− ab)b, b = η(s− ab)a. (2.131)

Note that a2 − b2 is a motion integral, while the energy function for each individual equation depends only on ab:E = (s− ab)2/2.

There exists a solution for these equations that admits a = b. In this case D0 = D1. Let u = ab. We have:

u = 2η(s− u)u. (2.132)

This ODE is integrable:

t =1

η

∫ uf

u0

du

2u(s− u)=

1

2sη

∫ uf

u0

(

du

u+

du

s− u

)

=1

2sη

(

ln

(

ufu0

)

− ln

(

uf − s

u0 − s

))

=1

2sηln

(

uf (u0 − s)

u0(uf − s)

)

.

(2.133)Note that u = s is a global minimizer. Hence the time required to achieve uf = s− ǫ from u0 = ǫ is:

t =1

2sηln

(

(s− ǫ)2

ǫ2

)

=1

sηln(s/ǫ− 1) ∼ 1

sηln(s/ǫ) for ǫ→ 0. (2.134)

This means that the larger the correlation s between input and output modes a and b, the faster convergence is.

Deep nets. Let us proceed with a linear network with L hidden in the same setup:

f(x) =

(

L∏

l=0

Wl

)

x. (2.135)

Gradient descent step:

Wl = ηE x,y

(

L∏

l′=l+1

Wl′

)T (

yxT −(

L∏

l′=0

Wl′

)

xxT

)(

l−1∏

l′=0

Wl′

)T

∀l ∈ [L]0. (2.136)

Again, assume that Σxx = 1 and Σxy = UL+1SL+1,0VT0 . Moreover, in analogy to the shallow case suppose

Wl = Rl+1DlRTl for l ∈ [L]0, where Dl is a diagonal matrix, while Rl are orthogonal; R0 = V0, RL+1 = UL+1. Note

that if all Wl are themselves orthogonal, and∏L

l=0Wl = UL+1VT0 , then the assumption above holds for Dl = I

∀l ∈ [L]0, R0 = V0, Rl+1 =WlRl. This gives:

Dl = η

(

L∏

l′=l+1

Dl′

)T (

SL+1,0 −(

L∏

l′=0

Dl′

))(

l−1∏

l′=0

Dl′

)T

∀l ∈ [L]0. (2.137)

18

Page 21: arXiv:2012.05760v1 [cs.LG] 10 Dec 2020

The latter decouples into independent modes:

al = η

(

s−L∏

l′=0

al′

)

l′ 6=l

al′ , (2.138)

which is a gradient descent for the following energy function:

E(a0:L) =1

2

(

s−L∏

l=0

al

)2

. (2.139)

Again, we are looking for solutions of the form a0 = . . . = aL. Define u =∏L

l=0 al. This gives an ODE:

u = η(L + 1)u2L/(L+1)(s− u). (2.140)

For large L we can approximate this equation with u = η(L+ 1)u2(s− u) (why?) which is easily integrable:

t =1

(L+ 1)η

∫ uf

u0

du

u2(s− u)=

1

(L+ 1)sη

∫ uf

u0

(

du

u2+

du

u(s− u)

)

=1

(L + 1)sη

(

1

u0− 1

uf+

1

sln

(

uf (u0 − s)

u0(uf − s)

))

.

(2.141)We see that t ∼ L−1: training time decreases as the number of layers grows. Note that we cannot perform a

gradient flow; we perform a gradient descent with discrete steps instead. Hence we have to count the number ofsteps as a function of L.

The optimal learning rate is inversely proportional to the maximum eigenvalue of the Hessian of the energyfunction observed during training. Let us first compute the Hessian:

∇i :=∂E

∂ai= −

(

s−L∏

l=0

al

)

l 6=i

al. (2.142)

∇2ij :=

∂2E

∂ai∂aj=

l 6=i

al

l 6=j

al

−(

s−L∏

l=0

al

)

l 6=i,j

al for i 6= j. (2.143)

∇2ii :=

∂2E

∂a2i=

l 6=i

al

2

. (2.144)

Taking into account our assumption a0 = . . . = aL = a, we get:

∇i = −(s− aL+1)aL, ∇2ij = 2a2L − saL−1, ∇2

ii = a2L. (2.145)

There is an eigenvector v1 = [1, . . . , 1]T of value λ1 = ∇2ii + L∇2

ij = (1 + 2L)a2L − sLaL−1. Also, there are L

eigenvectors of the form vi = [1, 0, . . . , 0,−1, 0, . . . , 0] of value λi = ∇2ii −∇2

ij = saL−1 − a2L. Notice that for largeL λ1 becomes the largest eigenvalue irrespective of a.

During the scope of optimization u travels inside the segment [0, s], hence a lies inside[

0, s1/(L+1)]

. Let us findthe maximum of λ1 on this segment:

dλ1da

= 2L(1 + 2L)a2L−1 − sL(L− 1)aL−2. (2.146)

Equating this derivative to zero yields:

a∗ =

(

s(L− 1)

2(1 + 2L)

)1/(L+1)

= s1/(L+1)

(

L− 1

2(1 + 2L)

)1/(L+1)

< s1/(L+1). (2.147)

The second solution is, of course, a = 0 if L > 2. Therefore we have three candidates for being a maximum: a = 0,a = s1/(L+1), and a = a∗. Let us check them:

λ1(0) = 0, λ1(s1/(L+1)) = (1 + L)s2L/(L+1) ≥ 0. (2.148)

19

Page 22: arXiv:2012.05760v1 [cs.LG] 10 Dec 2020

λ1(a∗) = s2L/(L+1)(1 + 2L)

(

L− 1

2(1 + 2L)

)2L/(L+1)

− sLs(L−1)/(L+1)

(

L− 1

2(1 + 2L)

)(L−1)/(L+1)

=

= s2L/(L+1)

(

1

2(1 + 2L)

)(L−1)/(L+1)(

(L − 1)2L/(L+1) − L(L− 1)(L−1)/(L+1))

=

= −s2L/(L+1)

(

1

2(1 + 2L)

)(L−1)/(L+1)

(L− 1)(L−1)/(L+1) =

= −s2L/(L+1)

(

L− 1

2(1 + 2L)

)(L−1)/(L+1)

≤ 0. (2.149)

Hence the maximal λ1 during the scope of optimization is λ1(s1/(L+1)) = (1 + L)s2L/(L+1). Recall the optimal

learning rate is proportional to maximal eigenvalue of the Hessian:

ηopt ∝1

maxt λ1= (L + 1)−1s−2L/(L+1). (2.150)

Substituting it to t yiels:

topt =1

(L+ 1)sηopt

(

1

u0− 1

uf+

1

sln

(

uf (u0 − s)

u0(uf − s)

))

= s(L−1)/(L+1)

(

1

u0− 1

uf+

1

sln

(

uf(u0 − s)

u0(uf − s)

))

. (2.151)

This equation asymptotically does not depend on L. In other words, training time (in terms of the number ofgradient steps) for very deep nets does not depend on depth.

20

Page 23: arXiv:2012.05760v1 [cs.LG] 10 Dec 2020

Chapter 3

Loss landscape

Neural network training process can be viewed as an optimization problem:

L(θ) = E x,y∈Smℓ(y, f(x; θ)) → min

θ, (3.1)

where ℓ is a loss function assumed to be convex, f(·; θ) is a neural net with parameters θ, and Sm = (xi, yi)mi=1

is a dataset of size m sampled from the data distribution D.This problem is typically non-convex, hence we do not have any guarantees for gradient descent convergence

in general. Nevertheless, in realistic setups we typically observe that gradient descent always succeeds in findingthe global minimum of Lm; moreover, this is done in reasonable time. This observation leads us to the followinghypotheses:

1. If the neural network and the data satisfy certain properties, all local minima of Lm(θ) are global.

2. If the neural network and the data satisfy certain properties, gradient descent converges to a global minimumin a good rate with high probability.

Neither of the two hypotheses are stronger than the other. Indeed, having all local minima being global doesnot tell us anything about convergence rate, while having convergence guarantee with high probability does notdraw away the possibility to have (few) local minima.

In the present chapter, we shall discuss the first hypothesis only, while the second one will be discussed later inthe context of a Neural Tangent Kernel.

3.1 Wide non-linear nets

It turns out that if the training data is consistent, one can prove globality of local minima if the network is wideenough.

Following [Yu and Chen, 1995], we shall start with a simplest case of a two-layered net trained to minimize thesquare loss:

f(x;W0,1) =W1φ(W0x); (3.2)

L(W0,1) =1

2

m∑

i=1

‖yi − f(xi;W0,1)‖22 =1

2‖Y −W1φ(W0X)‖2F , (3.3)

where Wl ∈ Rnl+1×nl , xi ∈ Rn0 , yi ∈ Rn2 , X ∈ Rn0×m, and Y ∈ Rn2×m.Let W ∗

0,1 be a local minimum of L. Consider L with W0 fixed to W ∗0 :

LW∗

0(W1) =

1

2‖Y −W1φ(W

∗0X)‖2F . (3.4)

Since W ∗0,1 is a minimum of L, W ∗

1 is a minimum of LW∗

0. Minimizing LW∗

0(W1) is a convex problem. Hence W ∗

1

is a global minimum of LW∗

0.

21

Page 24: arXiv:2012.05760v1 [cs.LG] 10 Dec 2020

Denote H1 = W0X and X1 = φ(H1); then LW∗

0(W1) = 1

2m‖Y − W1X∗1‖2F . Hence rkX∗

1 = m impliesminLW∗

0(W1) = 0. Since W ∗

1 is a global minimum of LW∗

0, L(W ∗

0,1) = LW∗

0(W ∗

1 ) = 0; hence W ∗0,1 is a global

minimum of L.Suppose rkX∗

1 < m. If we still have minLW∗

0(W1) = 0, we arrive at the same conclusion as previously. Suppose

L(W ∗0,1) = LW∗

0(W ∗

1 ) = minLW∗

0(W1) > 0. We shall prove that W ∗

0,1 cannot be a minimum of L in this case, aslong as conditions of the following lemma hold:

Lemma 1. Suppose φ is non-zero real analytic. If n1 ≥ m and ∀i 6= j xi 6= xj then µ(W0 : rkX1 < m) = 0,where µ is a Lebesgue measure on Rn1×n0 .

Since L(W ∗0,1) > 0 and L is a continuous function of W0,1, ∃ǫ > 0 : ∀W0,1 ∈ Bǫ(W

∗0,1) L(W0,1) > 0. By the

virtue of the lemma, ∀δ > 0 ∃W ′0 ∈ Bδ(W

∗0 ) : rkX

′1 ≥ m.

Take δ ∈ (0, ǫ). In this case L(W ′0,W

∗1 ) > 0, while rkX ′

1 ≥ m. Note that minimizing LW ′

0(W1) is a convex

problem and minLW ′

0(W1) = 0 (since rkX ′

1 ≥ m). Hence a (continuous-time) gradient descent on LW ′

0that starts

from W ∗1 converges to a point W ∗,′

1 for which LW ′

0(W ∗,′

1 ) = 0. Because of the latter, (W ′0,W

∗,′1 ) /∈ Bǫ(W

∗0,1).

Overall, we have the following: ∃ǫ > 0 : ∀δ ∈ (0, ǫ) ∃(W ′0,W

∗1 ) ∈ Bδ(W

∗0,1) : a continuous-time gradient descent

on L that starts from (W ′0,W

∗1 ) and that acts only on W1 converges to a point (W ′

0,W∗,′1 ) /∈ Bǫ(W

∗0,1).

Obviously, we can replace "∀δ ∈ (0, ǫ)" with "∀δ > 0". Given this, the statement above means that the gradientflow dynamics that acts only on W1 is unstable in Lyapunov sense at W ∗

0,1. Hence W ∗0,1 cannot be a minimum, and

hence having minL(W0,1) > 0 is impossible as long as the conditions of Lemma 1 hold. This means that all localminima of L are global.

Let us prove Lemma 1. Let Im ⊂ [n1] and |Im| = m. Consider X1,Im ∈ Rm×m — a subset of rows of X1 indexedby Im. Note that rkX1 < m is equivalent to detX1,Im = 0 ∀Im.

Since φ is analytic, detX1,Im is an analytic function of W0 ∀Im. We shall use the following lemma:

Lemma 2. Given the conditions of Lemma 1, ∃W0 : rkX1 = m.

Given this, ∃W0 : ∃Im : detX1,Im 6= 0. Since the determinant is an analytic function of W0, µ(W0 : detX1,Im =0) = 0. This implies the statement of Lemma 1.

3.1.1 Possible generalizations

Let us list the properties we have used to prove the theorem of [Yu and Chen, 1995]:

1. The loss is square.

2. The number of hidden layers L is one.

3. nL ≥ m.

4. φ is real analytic.

Can we relax any of them? First, note that it is enough to have "rkX∗1 = m implies minLW∗

0(W1) = 0". For this

it is enough to have convex ℓ(y, z) with respect to z with minz ℓ(y, z) = 0 ∀y (in particular, minimum should exist).For this reason, cross-entropy loss should require a more sophisticated analysis.

In order to relax the second property, it suffices to generalize Lemma 1:

Lemma 3. Suppose φ is non-zero real analytic and l ∈ [L]. If nl ≥ m and ∀i 6= j xi 6= xj then µ(W0:l−1 : rkXl <m) = 0.

The generalized version is proven in [Nguyen and Hein, 2017].As for the third property, we may want to relax it in two directions. First, we may require not the last hidden

layer, but some hidden layer to be wide enough. Second, we may try to make the lower bound on the number ofhidden units smaller. It seems like the second direction is not possible for a general dataset Sm: one have to assumesome specific properties of the data in order to improve the lower bound.

22

Page 25: arXiv:2012.05760v1 [cs.LG] 10 Dec 2020

Deep nets with analytic activations

Following [Nguyen and Hein, 2017], let us elaborate the first direction. We start with defining forward dynamics:

Hl+1 =WlXl, Xl = φ(Hl) ∀l ∈ [L], H1 =W0X, (3.5)

where X ∈ Rn0×m, Wl ∈ Rnl+1×nl , Hl ∈ Rnl×m. We also define backward dynamics:

Gl =∂L∂Hl

= φ′(Hl)⊙ (WTl

∂L∂Hl+1

) = φ′(Hl)⊙ (WTl Gl+1), (3.6)

GL+1 =∂L

∂HL+1=

1

2

∂HL+1‖Y −HL+1‖2F = HL+1 − Y, (3.7)

where Gl ∈ Rnl×m. Then we have:

∇l =∂L∂Wl

= Gl+1XTl ∈ R

nl+1×nl . (3.8)

Let W ∗0:L be a local minimum and suppose dl ≥ m for some l ∈ [L]. As previously, we divide our reasoning in

two parts: in the first part we assume that rkX∗l = m, while in the second one we show that if rkX∗

l < m thenW ∗

0:L cannot be a minimum.Assume rkX∗

l = m. We have:

0 = ∇∗l = G∗

l+1X∗,Tl , (3.9)

or,X∗

l G∗,Tl+1 = 0 ∈ R

nl×nl+1 . (3.10)

Each column of the right-hand side is a linear combination of columns of X∗l . Since columns of X∗

l are linearlyindependent, G∗

l+1 = 0. By the recurrent relation,

0 = G∗l+1 = φ′(H∗

l+1)⊙ (W ∗,Tl+1G

∗l+2). (3.11)

Assume that φ′ never gets zero. This gives:

W ∗,Tl+1G

∗l+2 = 0 ∈ R

nl+1×m. (3.12)

If we assume that columns of W ∗,Tl+1 (or, equivalently, rows of W ∗

l+1) are linearly independent), we shall get G∗l+2 = 0.

Linearly independent rows of W ∗l+1 is equivalent to rkW ∗

l+1 = nl+2 which implies nl+1 ≥ nl+2.Suppose this assumption holds. If we moreover assume that rkW ∗

l′ = nl′+1 ∀l′ ∈ l+1, . . . , L, we get G∗L+1 = 0.

This implies L(W ∗0:L) = 0. The assumption on ranks of W ∗

l′ requires nl′ ≥ nl′+1 ∀l′ ∈ l + 1, . . . , L: the networkdoes not expand after the l-th layer.

Now we assume rkX∗l < m and L(W ∗

0:L) > 0, while still nl ≥ m. In the shallow case we have shown thatfirst, infinitesimal perturbation of W ∗

0:l−1 results in rkXl ≥ m, and second, starting from this perturbed point, thegradient descent dynamics leaves a sufficiently large vicinity of W ∗

0:L. These two together imply that W ∗0:L cannot

be a minimum which is a contradiction.While both statements still hold if l = L, the second one does not hold for 0 < l < L since the problem of

minimizing LW∗

0:l−1(Wl:L) is still non-convex, hence we have no guarantees on gradient descent convergence. Hence

for the case 0 < l < L we have to come up with another way of reasoning.Define:

u = vec(W0:l−1), v = vec(Wl:L), ψ =∂L(u, v)∂v

. (3.13)

Since (u∗, v∗) is a minimum, we have ψ(u∗, v∗) = 0. Assume that jacobian of ψ with respect to v is non-singular at(u∗, v∗):

det(Jvψ(u∗, v∗)) 6= 0. (3.14)

Note that in the case l = L this property is equivalent to rkX∗L = nL:

ψ(u, v)inL+j = (WL,ikXklL − Y l

i )XL,jl; (Jvψ(u, v))inL+j,i′nL+j′ = δii′XlL,j′XL,jl = δii′(XLX

TL )jj′ . (3.15)

23

Page 26: arXiv:2012.05760v1 [cs.LG] 10 Dec 2020

We see that Jvψ(u, v) is a block-diagonal matrix constructed with nL+1 identical blocks XLXTL . Its determinant

at W ∗0:L is therefore (det(X∗

LX∗,TL ))nL+1 , which is positive as long as rkX∗

L = nL. Note that rkX∗L ≤ m, hence we

need nL ≤ m. Note that we need nL ≥ m in order to apply Lemma 1, hence Сondition 3.14 actually requires astronger property nL = m instead of nL ≥ m used before.

Condition 3.14 allows us to apply the implicit function theorem:

∃δ1 > 0 : ∃v ∈ C1(Bδ1(u∗)) : v(u∗) = v∗ and ∀u ∈ Bδ1(u

∗) ψ(u, v(u)) = 0. (3.16)

Since all matrices W ∗l+1:L are full rank, and the set of non-full rank matrices has measure zero,

∃ǫ > 0 : ∀v ∈ Bǫ(v∗) ∀l′ ∈ l+ 1, . . . , L rkWl′ = nl′+1. (3.17)

Since v ∈ C1(Bδ1(u∗)),

∃δ2 ∈ (0, δ1) : ∀u ∈ Bδ2(u∗) v(u) ∈ Bǫ(v

∗). (3.18)

Consequently,∀u ∈ Bδ2(u

∗) ∀l′ ∈ l+ 1, . . . , L rk Wl′ = nl′+1. (3.19)

Due to Lemma 3,∀ǫ > 0 ∃u ∈ Bǫ(u

∗) : rk Xl = m. (3.20)

Hence

∀ǫ ∈ (0, δ2) ∃u ∈ Bǫ(u∗) : rk Xl = m and ∀l′ ∈ l+ 1, . . . , L rk Wl′ = nl′+1 and ψ(u, v(u)) = 0. (3.21)

Note that in the first part of the proof we have only used that rkX∗l = m, rkW ∗

l′ = nl′+1 ∀l′ ∈ l+ 1, . . . , L, and∇∗

l = 0. Hence we can conclude that L(u, v(u)) = 0. Since this is true for all ǫ ∈ (0, δ2) and the loss is continuouswith respect to weights, this is also true for ǫ = 0: L(u∗, v∗) = L(W ∗

0:L) = 0.

Relaxing analyticity and other conditions

Overall, we have proven the following result first:

Proposition 1. Consider a point in the weight space W ∗0:L. Suppose the following hold:

1. φ′ is not zero anywhere;

2. GH+1 = 0 implies L → min;

3. rkX∗l = m;

4. rkW ∗l′ = nl′+1 ∀l′ ∈ l+ 1, . . . , L;

5. ∇∗l = 0.

Then L(W ∗0:L) = minL.

After that, we have relaxed the 3rd condition in the expense of few others:

Proposition 2. Consider a point in the weight space W ∗0:L. Suppose the following hold:

1. φ′ is not zero anywhere;

2. GH+1 = 0 implies L → min;

3. φ is non-zero real analytic;

4. rkW ∗l′ = nl′+1 ∀l′ ∈ l+ 1, . . . , L;

5. det(∇2Wl+1:L

L(W ∗0:L)) 6= 0;

6. ∇∗l′ = 0 ∀l′ ∈ l, . . . , L.

Then L(W ∗0:L) = minL.

24

Page 27: arXiv:2012.05760v1 [cs.LG] 10 Dec 2020

However, besides of the 3rd condition, Proposition 1 requires 4th condition that is hard to ensure. We can provethe following lemma which is due to [Nguyen, 2019]:

Lemma 4. Let θ =Wl+1:L. Suppose the following hold:

1. rkXl = m;

2. nl′ > nl′+1 ∀l′ ∈ l + 1, . . . , L;

3. φ(R) = R and φ is strictly monotonic.

Then

1. ∃θ′ : ∀l′ ∈ l+ 1, . . . , L rkWl′ = nl′+1 and L(θ′) = L(θ);

2. ∃ a continuous curve connecting θ and θ′, and loss is constant on the curve.

Applying this lemma, we can drive W ∗l+1:L to full-rank W ∗,′

l+1:L without altering the loss, however Lemma 4 does

not guarantee that ∇∗,′l = 0. Hence by applying Lemma 4 we potentially violate the 5th condition of Proposition 1.

Moreover, as we have discussed before, loss convexity is not enough to ensure that minima exist. For example, forcross-entropy loss there could be no critical points of L, hence we cannot statisfy the 5th condition at all. Hencewe have to formulate a different variant of Proposition 1.

Following [Nguyen, 2019], we define an α-level set as L−1(α) and α-sublevel set as L−1((−∞, α)). We also refera connected connected of a sublevel set a "local valley", and we call a local valley global if its infium coincide withinf L. There is a theorem which is due to [Nguyen, 2019]:

Theorem 1. Suppose the following hold:

1. φ(R) = R and φ is strictly monotonic;

2. ℓ(y, z) is convex wrt z with infz ℓ(y, z) = 0 ∀y;

3. rkXl = m;

4. nl′ > nl′+1 ∀l′ ∈ l + 1, . . . , L.Then

1. Every sublevel set is connected;

2. ∀ǫ > 0 L can attain a value < ǫ.

Theorem 1 not only formulates a global minimality condition in a way suitable for cross-entropy (i.e. thatall local valleys are global), but also implies that all local valleys are connected. In a case when local minimaexist, the latter implies that all of them are connected: a phenomena empirically observed in [Garipov et al., 2018,Draxler et al., 2018].

Notice that it is enough to prove Theorem 1 for l = 0: otherwise we can just apply this result to a subnetworkstarting from the l-th layer. Let Ωl = Rnl+1×nl be a set of all nl+1 × nl matrices, while Ω∗

l ⊂ Ωl be a subset offull-rank matrices. We shall state the following result first:

Lemma 5. Suppose the following hold:

1. φ(R) = R and φ is strictly monotonic;

2. rkX = m;

3. nl > nl+1 ∀l ∈ [L].

Then there exists a map h : Ω∗1 × . . .× Ω∗

L × RnL+1×m → Ω0:

1. ∀HL+1 ∈ RnL+1×m for full-rank W1:L HL+1(h(W1:L, HL+1),W1:L) = HL+1;

2. ∀W0:L where all W1:L are full-rank, there is a continuous curve betweenW0:L and (h(W1:L, HL+1(W0:L)),W1:L)such that the loss is constant on the curve.

25

Page 28: arXiv:2012.05760v1 [cs.LG] 10 Dec 2020

The first statement can be proven easily. Indeed, let X† be the left inverse of X , while W †l be the right inverse

of Wl; this means that X†X = Im, while WlW†l = Inl+1

∀l ∈ [L]. These pseudo-inverses exist, because X has fullcolumn rank, while all Wl have full row rank (since nl > nl+1). Define the following recursively:

W0 = H1X†, Hl = φ−1(Xl), Xl =W †

l Hl+1 ∀l ∈ [L]. (3.22)

This gives the following:

W0X = H1X†X = H1, Wlφ(Hl) =WlW

†l Hl+1 = Hl+1 ∀l ∈ [L]. (3.23)

This simply means that HL+1(W0,W1:L) = HL+1. Hence defining h(W1:L, Hl+1) := W0 gives the result.We shall omit the proof of the second statement. Then the proof of Theorem 1 proceeds by constructing paths

from two points θ =W0:L and θ′ to a common point such that the loss does not increase along both of these paths.We then show that the common meeting point can attain loss < ǫ ∀ǫ > 0.

Denote losses at points θ and θ′ as L and L′ respectively. Let us start from the point θ. By the virtue ofLemma 4 we can travel from θ to another point for which all matrices are full rank without altering the loss. Hencewithout loss of generality assume that all matrices of θ are full rank, and for θ′ we can assume the same. Thisallows us to use Lemma 5 and travel from θ and θ′ to the following points by curves of constant loss:

θ = (h(W1:L, HL+1(W0:L)),W1:L), θ′ = (h(W ′1:L, HL+1(W

′0:L)),W

′1:L). (3.24)

Since the set of full-rank matrices is connected, ∀l ∈ [L] there is a continuous curve Wl(t) for which Wl(0) = Wl,Wl(1) =W ′

l , and Wl(t) is full-rank. Hence we can travel from θ to the following point in the weight space:

θ = (h(W1:L(1), HL+1(W0:L)),W1:L(1)) = (h(W ′1:L, HL+1(W0:L)),W

′1:L). (3.25)

Since we do not alter the model output while traveling throughout the curve, we do not alter the loss as well.Consider some HL+1 ∈ RnL+1×m such that corresponding loss is less than min(ǫ,L,L′). Consider a curve

HL+1(t) = (1 − t)HL+1(W0:L) + tHL+1, and a corresponding curve in the weight space:

θ(t) = (h(W ′1:L, HL+1(t)),W

′1:L). (3.26)

Note that

L(θ(t)) = L(HL+1(θ(t))) = L((1 − t)HL+1(W0:L) + tHL+1) ≤ (1− t)L(HL+1(W0:L)) + tL(HL+1) ≤ L. (3.27)

Hence the curve θ(t) is fully contained in any sublevel set conatining initial θ. The same curve starting from θ′

arrives at the same point. Recall that the endpoint has loss less than ǫ. Hence all sublevel sets are connected andcan attain loss less than any positive ǫ.

3.2 Linear nets

The second case for which one can prove globality of local minima is the case of φ(z) = z. Consider:

f(x;W0:L) =WL . . .W0x, (3.28)

where Wl ∈ Rnl+1×nl . We are going to prove the following result which is due to [Laurent and Brecht, 2018]:

Theorem 2. Let ℓ be convex and differentiable, and there are no bottlenecks in the architecture: minl∈[L]0 nl =minn0, nL+1. Then all local minima of L(W0:L) = E x,yℓ(y, f(x;W0:L)) are global.

This theorem follows from the result below:

Theorem 3. Assume L is a scalar differentiable function of nL+1×n0 matrices. Let L(W0:L) = L(WL . . .W0) andlet minl∈[L]0 nl = minn0, nL+1. Then any local minimizer W0:L of L satisfies ∇L(A) = 0 for A = WL . . . W0.

26

Page 29: arXiv:2012.05760v1 [cs.LG] 10 Dec 2020

Indeed, consider L(A) = E x,yℓ(y,Ax). The corresponding L writes as follows: L(W0:L) = E x,yℓ(y, f(x;W0:L));

hence we are in the scope of Theorem 2. Take a local minimizer W0:L of L. From Theorem 3 WL . . . W0 is a criticalpoint of L. It follows from convexity of ℓ that WL . . . W0 is a global minimum of L. Since L(W0:L) = L(A), W0:L

is a global minimum of L.Let us now prove Theorem 3. Define Wl,+ =WL . . .Wl, Wl,− =Wl . . .W0, and A =WL . . .W0. Note that

∇lL(W0:L) =WTl+1,+∇L(A)WT

l−1,− ∀l ∈ [L]0. (3.29)

Since W0:L is a local minimum of L, we have:

0 = ∇LL(W0:L) = ∇L(A)WTL−1,−. (3.30)

If kerWL−1,− = 0 then L(A) = 0 as required. Consider the case when the kernel is non-trivial. We shall prove

that there exist perturbed matrices W0:L such that A = A and W0:L is a local minimizer of L, and for somel ∈ [L− 1]0 kernels of both Wl−1,− and WT

l+1,+ are trivial. This gives L(A) = 0 which is equivalent to L(A) = 0.

By definition of a local minimizer, ∃ǫ > 0 : ‖Wl − Wl‖F ≤ ǫ ∀l ∈ [L]0 implies L(W0:L) ≥ L(W0:L).

Proposition 3. Let W0:L satisfy the following:

1. ‖Wl − Wl‖F ≤ ǫ/2 ∀l ∈ [L]0;

2. A = A.

Then W0:L is a local minimizer of L.

Proof. Let ‖Wl − Wl‖F ≤ ǫ/2 ∀l ∈ [L]0. Then ‖Wl − Wl‖F ≤ ‖Wl − Wl‖F + ‖Wl − Wl‖F ≤ ǫ ∀l ∈ [L]0. HenceL(W0:L) ≥ L(W0:L) = L(W0:L).

Since Wl+1,− =Wl+1Wl,−, we have ker(Wl+1,−) ⊇ ker(Wl,−). Hence there is a chain of inclusions:

ker(W0,−) ⊆ . . . ⊆ ker(WL−1,−). (3.31)

Since the (L− 1)-th kernel is non-trivial, there exists l∗ ∈ [L− 1]0 such that ker(Wl,−) is non-trivial for any l ≥ l∗,while for l < l∗ the l-th kernel is trivial. This gives the following:

0 = ∇l∗L(W0:L) = WTl∗+1,+∇L(A)WT

l∗−1,− implies 0 = WTl∗+1,+∇L(A). (3.32)

We cannot guarantee that ker(WTl∗+1,+) is trivial. However, we can try to construct a perturbation that does not

alter the loss and such that the corresponding kernel is trivial.First, without loss of generality assume that nL+1 ≥ n0. Indeed, if Theorem 3 is already proven for nL+1 ≥ n0, we

can get the same result for nL+1 < n0 by applying this theorem to L′(A) = L(AT ). This gives that all local minimaof L′(WT

L:0) = L′(WT0 . . .WT

L ) = L(WL . . .W0) correspond to critical points of L′(WT0 . . .WT

L ). This is equivalent tosaying that all local minima of L(W0:L) = L′(WT

L:0) correspond to critical points of L(WL . . .W0) = L′(WT0 . . .WT

L ).Combination of assumptions nL+1 ≥ n0 and minl nl = minn0, nL+1 gives nl ≥ n0 ∀l ∈ [L+ 1].

Note that Wl,− ∈ Rnl+1×n0 ∀l ∈ [L]0. Since nl+1 ≥ n0, it is a "column" matrix. Consider an SVD decomposition

of Wl,−:

Wl,− = UlΣlVTl . (3.33)

Here Ul is an orthogonal nl+1 × nl+1 matrix, Vl is an orthogonal n0 × n0 matrix, and Σl is a diagonal nl+1 × n0

matrix. Since for l ≥ l∗ Wl,− has a non-trivial kernel, its least singular value is zero: σl,n0 = 0. Let ul be the n0-th

column of Ul, which exists since n0 ≤ nl+1. Let us now define a family of perturbations satisfying the conditions ofProposition 3:

Proposition 4. Let wl∗+1, . . . , wL be any collections of vectors and δl∗+1, . . . , δL be any collection of scalars satis-fying:

1. wl ∈ Rnl+1 , ‖wl‖2 = 1;

27

Page 30: arXiv:2012.05760v1 [cs.LG] 10 Dec 2020

2. δl ∈ [0, ǫ/2].

Then the tuples W0:L defined by

Wl = Wl + δlwluTl−1 for l > l∗, and Wl = Wl otherwise, (3.34)

satisfy the conditions of Proposition 3.

Proof. For l ≤ l∗ the first condition is trivial. In the opposite case we have:

‖Wl − Wl‖2F = ‖δlwluTl−1‖2F = δ2l ‖wl‖22‖ul−1‖22 ≤ ǫ2/4, (3.35)

which gives the first condition of Proposition 3.Let us now prove that Wl,− = Wl,− ∀l ≥ l∗ (for l < l∗ the statement is trivial). For l = l∗ the statement goes

from the definition; this gives the induction base. The induction step is given as follows:

Wl+1,− = Wl+1Wl,− = Wl+1Wl,− = (Wl+1 + δl+1wl+1uTl )Wl,− = Wl+1Wl,− = Wl+1,−. (3.36)

Hence by Proposition 3 for any δl and wl satisfying the conditions of Proposition 4, W0:L is a local minimum ofL. Then we have an equation similar to (3.32):

0 = ∇l∗L(W0:L) = WTl∗+1,+∇L(A)WT

l∗−1,−. (3.37)

As before, this implies:0 = ∇LT (A)Wl∗+1,+. (3.38)

For δl∗+1 = 0 we have:0 = ∇LT (A)WL . . . Wl∗+2Wl∗+1. (3.39)

Substracting the latter equation to the pre-latter one gives:

0 = ∇LT (A)WL . . . Wl∗+2(Wl∗+1 − Wl∗+1) = ∇LT (A)WL . . . Wl∗+2(δl∗+1wl∗+1uTl∗). (3.40)

Right-multiplying this equation by ul∗ gives:

0 = δl∗+1∇LT (A)WL . . . Wl∗+2wl∗+1, (3.41)

which holds for any sufficiently small non-zero δl∗+1 and any unit wl∗+1. Hence

0 = ∇LT (A)WL . . . Wl∗+2. (3.42)

Proceeding in the same manner gives finally ∇L(A) = 0. The proof concludes with noting that ∇L(A) = ∇L(A)by construction of W0:L.

3.3 Local convergence guarantees

Let L ∈ C2(Rdim θ) and L has L-Lipschitz gradient:

‖∇L(θ1)−∇L(θ2)‖2 ≤ L‖θ1 − θ2‖2 ∀θ1,2. (3.43)

Consider a GD update rule:θk+1 = θk − η∇L(θk) = g(θk). (3.44)

Let θ∗ be a strict saddle:∇L(θ∗) = 0, λmin(∇2L(θ∗)) < 0. (3.45)

Let θ0 ∼ Pinit. We shall prove the following result which is due to [Lee et al., 2016]:

28

Page 31: arXiv:2012.05760v1 [cs.LG] 10 Dec 2020

Theorem 4. Suppose Pinit is absolutely continuous with respect to the Lebesgue measure µ on Rdim θ. Then forη ∈ (0, L−1), P(limk→∞ θk = θ∗) = 0.

Proof. The proof starts with the definition of global stable sets. Define a global stable set of a critical point as aset of initial conditions that lead to convergence to this critical point:

Θs(θ∗) = θ0 : limk→∞

θk = θ∗. (3.46)

In order to prove the theorem it suffices to show that µ(Θs(θ∗)) = 0.The proof relies on the following result of the theory of dynamical systems:

Theorem 5. Let 0 be a stable point of a local diffeomorphism φ : U → E, where U is a vicinity of zero in aBanach space E. Suppose that E = Es ⊕ Eu, where Es is a span of eigenvectors that correspond to eigenvalues ofDφ(0) less or equal to one, while Eu is a span of eigenvectors that correspond to the eigenvalues greater than one.Then there exists a disk Θsc

loc tangent to Es at 0 called the local stable center manifold. Moreover, there exists aneighborhood B of 0, such that φ(Θsc

loc) ∩B ⊂ Θscloc and

⋂∞k=0 φ

−k(B) ⊂ Θscloc.

In order to apply this theorem, we have to prove that g is a diffeomorphism:

Proposition 5. For η ∈ (0, L−1) g is a diffeomorphism.

Given this, we apply the theorem above to φ(θ) = g(θ + θ∗): its differential at zero is Dφ(0) = I − η∇2L(θ∗).Since λmin(∇2L(θ∗)) < 0, dimEu > 0, hence dimEs < dim θ. This means that µ(Θsc

loc) = 0.Let B be a vicinity of zero promised by Theorem 5. Given θ0 ∈ Θs(θ∗), ∃K ≥ 0 : ∀k ≥ K θk ∈ B. Equivalently,

∀l ≥ 0 gl(θK) ∈ B. Hence θK ∈ ⋂∞l=0 g

−l(B) ⊂ Θscloc. This gives the following:

Θs(θ∗) ⊆∞⋃

K=0

g−K(Θscloc). (3.47)

The proof concludes by noting that µ(Θs(θ∗)) ≤ ∑∞K=0 µ(g

−K(Θscloc)) =

∑∞K=0 µ(Θ

scloc) = 0 since g is a diffeomor-

phism.

Proof of Proposition 5. Being a diffeomorphism is equivalent to be injective, surjective, continuously differentiable,and having a continuously differentiable inverse.

Suppose g(θ) = g(θ′). Then θ − θ′ = η(∇L(θ′)−∇L(θ)). Hence:

‖θ − θ′‖2 = η‖∇L(θ′)−∇L(θ)‖2 ≤ ηL‖θ − θ′‖2. (3.48)

Since η < 1/L, this implies θ = θ′. Hence g is injective.Given some point θ2, we shall construct θ1 such that θ2 = g(θ1). Consider:

h(θ1,2) =1

2‖θ1 − θ2‖22 − ηL(θ1). (3.49)

Note that h(θ1,2) is strongly convex with respect to θ1:

λmin(∇2θ1h(θ1,2)) ≥ 1− ηL > 0. (3.50)

Hence it has a unique global minimizer which is a critical point:

0 = ∇θ1h(θ1,2) = θ1 − θ2 − η∇L(θ1). (3.51)

Hence a unique element of argminθ1 h(θ1,2) satisfies θ2 = g(θ1). Hence g is surjective.The fact that g ∈ C1(Rdim θ) follows from the fact that g(θ) = θ− η∇L(θ) and L ∈ C2(Rdim θ). By the virtue of

the inverse function theorem, in order to prove that g has C1 inverse, it suffices to show that g is itself C1 and itsjacobian is non-singular everywhere. The jacobian is given as Jg(θ) = I − η∇2L(θ); hence its minimal eigenvalue≥ 1 − ηL > 0 which means that the jacobian is non-singular. The latter statement concludes the proof that g is adiffeomorphism.

29

Page 32: arXiv:2012.05760v1 [cs.LG] 10 Dec 2020

3.3.1 Limitations of the result

Note that Theorem 4 applies in the following assumptions:

1. L ∈ C2;

2. ∇L is L-Lipschitz and η ∈ (0, L−1);

3. No gradient noise;

4. The saddle point is strict;

5. The saddle point is isolated.

The first assumption is necessary to ensure that g is a diffeomorphism. ReLU nets violate this assumption, andhence require a generalization of Theorem 5.

The second assumption is a standard assumption for the optimization literature. Note however that for, say,a quadratic loss, a network with at least one hidden layer results in a loss surface which is not globally Lips-chitz. Fortunately, if we show that there exists a subset S ∈ Rdim θ such that g(S) ⊆ S and restrict initializa-tions to this subset, one can substitite a global Lipschitzness requirement to local Lipschitzness in S; this is donein [Panageas and Piliouras, 2017]. Note that ReLU nets break gradient Lipschitzness anyway.

A full-batch gradient descent is rare in practice; a typical procedure is a stochastic gradient descent whichintroduces a zero-centered noise to gradient updates. Existence of such noise pulls us away from the scope of thedynamical systems theory. Nevertheless, intuitively, this noise should help us to escape a stable manifold associatedwith a saddle point at hand. In turns out that the presence of noise allows one to have guarantees not only forconvergence itself, but even for convergence rates: see e.g. [Jin et al., 2017].

Strictness of saddle points is necessary to ensure that the second order information about the hessian of the lossis enough to identify Eu. We hypothesize that the generalization of Theorem 5 to high-order saddles is still possible(but out of the scope of the conventional dynamical systems theory).

Note that Theorem 4 says essentially that we cannot converge to any a-priori given saddle point. If the set ofall saddle points is at most countable, this will imply that we cannot converge to any saddle points. However, ifthis set is uncountable, Theorem 4 does not guarantee that we do not converge to any of them. Moreover, e.g. forReLU nets, there is a continuous family of weight-space symmetries that keep criticality (and hence keep negativityof the least eigenvalue of the hessian). Indeed, substituting (Wl+1,Wl) with (α−1Wl+1, αWl) for any positive αkeeps f(x) unchanged. Moreover, if ∇l′ = 0 ∀l′ then

∇(α)l+1 = gl+2x

(α),Tl+1 = αgl+2x

Tl+1 = α∇l+1 = 0, (3.52)

and all other ∇(α)l′ = 0 by a similar reasoning.

A generalization of Theorem 4 to non-isolated critical points is given in [Panageas and Piliouras, 2017]. Intu-itively, if we have a manifold of strict saddle points, the global stable set associated with this manifold is still ofmeasure zero due to the existence of the unstable manifold. Nevertheless, one have to again generalize Theorem 5.

30

Page 33: arXiv:2012.05760v1 [cs.LG] 10 Dec 2020

Chapter 4

Generalization

The goal of learning is to minimize a population risk R over some class of predictors F :

f∗ ∈ Argminf∈F

R(f), (4.1)

where R(f) = E x,y∼Dr(y, f(x)); here D is a data distribution and r(y, z) is a risk; we shall assume that r(y, z) ∈ [0, 1].A typical notion of risk for binary classification problems is 0/1-risk: r0/1(y, z) = [yz < 0], where the targety ∈ −1, 1 and the logit z ∈ R; in this case 1 − R(f) is an accuracy of f . Since we do not have an access to thetrue data distribution D, we cannot minimize the true risk. Instead, we can hope to minimize an empirical risk Rm

over a set of m i.i.d. samples Sm from distribution D:

fm ∈ Argminf∈F

Rm(f), (4.2)

where Rm(f) = E x,y∼Smr(y, f(x)). Since a risk function is typically non-convex and suffer from poor gradients, one

cannot solve problem (4.2) directly with gradient methods. A common solution is to consider a convex differentiablesurrogate ℓ for a risk r, and substitute problem (4.2) with a train loss minimization problem:

fm ∈ Argminf∈F

Lm(f), (4.3)

where Lm(f) = E x,y∼Smℓ(y, f(x)); this problem can be attacked directly with gradient methods.

Unfortunately, it is hard to obtain any guarantees for finding solutions even for problem (4.3). Nevertheless,

suppose we have a learning algorithm A that takes a dataset Sm and outputs a model fm. This algorithm mayaim to solve problem (4.3) or to tackle problem (4.2) directly, but its purpose does not matter; what matters is the

fact that it conditions a model fm on a dataset Sm. Our goal is to upper-bound some divergence of Rm(fm) with

respect to R(fm). Since the dataset Sm is random, fm is also random, and the bound should have some failureprobability δ with respect to Sm.

4.1 Uniform bounds

First of all, note that R(f) = E Sm∼DmRm(f). This fact suggests applying the Hoeffding’s inequality for upper-bounding Rm(f)−R(f):

Theorem 6 (Hoeffding’s inequality [Hoeffding, 1963]). Let X1:m be i.i.d. random variables supported on [0, 1].Then, given ǫ > 0,

P(

m∑

i=1

Xi − E

m∑

i=1

Xi ≥ ǫ

)

≤ e−2ǫ2

m , P(

E

m∑

i=1

Xi −m∑

i=1

Xi ≥ ǫ

)

≤ e−2ǫ2

m . (4.4)

This gives us the following:

P(R(f)− Rm(f) ≥ ǫ) ≤ e−2mǫ2 ∀ǫ > 0 ∀f ∈ F . (4.5)

31

Page 34: arXiv:2012.05760v1 [cs.LG] 10 Dec 2020

Hence for any f ∈ F ,

R(f)− Rm(f) ≤√

1

2mlog

1

δw.p. ≥ 1− δ over Sm. (4.6)

However, the bound above does not make sense since f there is given a-priori and does not depend on Sm. Ourgoal is to bound the same difference but with f = fm. The simplest way to do this is to upper-bound this differenceuniformly over F :

R(fm)− Rm(fm) ≤ supf∈F

(R(f)− Rm(f)). (4.7)

A note on the goodness of uniform bounds. One may worry about how large the supremum over F can be.If the model class F includes a "bad model" which has low train error for a given Sm but large true error, the boundbecomes too pessimistic. Unfortunately, in the case of realistic neural nets, one can explicitly construct such a badmodel. For instance, for a given Sm consider fm,m′ = A(Sm ∪ Sm′) with Sm′ being a dataset with random labels —

it is independent on Sm and taken in advance. For m′ ≫ m, fm,m′ ≈ A(Sm′) — a model learned on random labels;

see [Zhang et al., 2016]. Hence for binary classification with balanced classes R(fm,m′) ≈ 0.5, while Rm(fm,m′) ≈ 0whenever the algorithm is able to learn the data perfectly, which is empirically the case for gradient descent appliedto realistic neural nets.

Nevertheless, taking F to be a set of all models realizable with a given architecture is not necessary. Indeed,assume the data lies on a certain manifold: suppD ⊆ M. Then for sure, fm ∈ A(Mm). Taking F = A(Mm)ensures that F contains only those models that are realizable by our algorithm on realistic data — this excludesthe situation discussed above. One can hope then that if our learning algorithm is good for any realistic data, thebound will be also good. The problem then boils to upper-bounding the supremum as well as possible.

Unfortunately, bounding the supremum for F = A(Mm) is problematic since it requires taking the algorithmdynamics into account, which is complicated for gradient descent applied to neural nets. As a trade-off, one canconsider some larger F ⊇ A(Mm), for which the supremum can be upper-bounded analytically.

4.1.1 Upper-bounding the supremum

When F is finite, we can still apply our previous bound:

P(

supf∈F

(R(f)− Rm(f)) ≥ ǫ

)

= P(∃f ∈ F : (R(f)− Rm(f)) ≥ ǫ) ≤

≤∑

f∈FP(R(f)− Rm(f) ≥ ǫ) ≤ |F|e−2mǫ2 ∀ǫ > 0. (4.8)

Hence

supf∈F

(R(f)− Rm(f)) ≤√

1

2m

(

log1

δ+ log |F|

)

w.p. ≥ 1− δ over Sm. (4.9)

In the case when F is infinite, we can rely on a certain generalization of Hoeffding’s inequality:

Theorem 7 (McDiarmid’s inequality [McDiarmid, 1989]). Let X1:m be i.i.d. random variables and g is a scalarfunction of m arguments such that

supx1:m,xi

|g(x1:m)− g(x1:i−1, xi, xi+1,m)| ≤ ci ∀i ∈ [m]. (4.10)

Then, given ǫ > 0,

P (g(X1:m)− E g(X1:m) ≥ ǫ) ≤ e− 2ǫ2

∑mi=1

c2i . (4.11)

Applying this inequality to g((xi, yi)mi=1) = supf∈F (R(f)− Rm(f)) gives:

PSm

(

supf∈F

(R(f)− Rm(f))− E S′msupf∈F

(R(f)− R′m(f)) ≥ ǫ

)

≤ e−2mǫ2, (4.12)

32

Page 35: arXiv:2012.05760v1 [cs.LG] 10 Dec 2020

which is equivalent to:

supf∈F

(R(f)− Rm(f)) ≤ E S′msupf∈F

(R(f)− R′m(f)) +

1

2mlog

1

δw.p. ≥ 1− δ over Sm. (4.13)

Let us upper-bound the expectation:

E S′msupf∈F

(R(f)− R′m(f)) = E S′

msupf∈F

(E S′′mR′′

m(f)− R′m(f)) ≤ E S′

mE S′′

msupf∈F

(R′′m(f)− R′

m(f)) =

= E S′mE S′′

msupf∈F

(

1

m

m∑

i=1

(r(y′′i , f(x′′i ))− r(y′i, f(x

′i)))

)

= E S′mE S′′

msupf∈F

(

1

m

m∑

i=1

(r′′i (f)− r′i(f))

)

=

= E S′mE S′′

mE σm∼−1,1m sup

f∈F

(

1

m

m∑

i=1

σi(r′′i (f)− r′i(f))

)

≤ E S′mE S′′

mE σm∼−1,1m sup

f∈F

1

m

m∑

i=1

σi(r′′i (f)− r′i(f))

=

≤ E S′mE S′′

mE σm∼−1,1m sup

f∈F

(∣

1

m

m∑

i=1

σir′′i (f)

+

1

m

m∑

i=1

σir′i(f)

)

=

= 2E S′mE σm∼−1,1m sup

f∈F

1

m

m∑

i=1

σir(y′i, f(x

′i))

= 2E S′mRad(r F |S′

m), (4.14)

where we have defined a function class r F such that ∀h ∈ r F h(x, y) = r(y, f(x)), and the Rademachercomplexity of a class H of functions supported on [0, 1] conditioned on a dataset z1:m:

Rad(H | z1:m) = E σ1:m∼−1,1m suph∈H

1

m

m∑

i=1

σih(zi)

. (4.15)

4.1.2 Upper-bounding the Rademacher complexity

The case of zero-one risk

Consider r(y, z) = r0/1(y, z) = [yz < 0]. In this case we have:

Rad(r F |Sm) = E σmsupf∈F

1

m

m∑

i=1

σiri(f)

= E σmmax

f∈FSm

1

m

m∑

i=1

σiri(f)

=

=1

mslog exp

(

sE σmmax

f∈FSm

m∑

i=1

σiri(f)

)

≤ 1

mslogE σm

exp

(

s maxf∈FSm

m∑

i=1

σiri(f)

)

=

=1

mslogE σm

exp

(

s maxh∈rFSm∪(−r)FSm

m∑

i=1

σihi

)

≤ 1

mslog

h∈rFSm∪(−r)FSm

E σmexp

(

s

m∑

i=1

σihi

)

≤ 1

mslog

h∈rFSm∪(−r)FSm

ems2

2 =1

mslog(

2|FSm|ems2

2

)

=1

mslog(2|FSm

|) + s

2, (4.16)

where s is any positive real number and FSmis an equivalence class of functions from F , where two functions are

equivalent iff their images on Sm have identical signs; note that this class is finite: |FSm| ≤ 2m. We have also used

the following lemma:

Lemma 6 (Hoeffding’s lemma [Hoeffding, 1963]). Let X be a random variable a.s.-supported on [a, b] with zeromean. Then, for any positive real s,

E esX ≤ e(b−a)2s2

8 . (4.17)

33

Page 36: arXiv:2012.05760v1 [cs.LG] 10 Dec 2020

Since the upper bound (4.16) is valid for any s > 0, we can minimize it with respect to s. One can easily deducethat the optimal s is

(2/m) log(2|FSm|); plugging it into eq. (4.16) gives:

Rad(r F |Sm) ≤√

2

mlog(2|FSm

|). (4.18)

Define ΠF (m) = maxSm|FSm

| — a growth function of a function class F . The growth function shows how many dis-tinct labelings a function class F induces on datasets of varying sizes. Obviously, ΠF (m) ≤ 2m and ΠF (m) is mono-tonically non-increasing. We say "F shatters Sm" whenever |FSm

| = 2m. Define a VC-dimension [Vapnik and Chervonenkis, 1971]as a maximal m for which F shatters any Sm:

VC(F) = maxm : ΠF(m) = 2m. (4.19)

One can relate a growth function with a VC-dimension using the following lemma:

Lemma 7 (Sauer’s lemma [Sauer, 1972]). ΠF (m) ≤∑VC(F)k=0

(

mk

)

.

Now we need to express the asymptotic behavior as m → ∞ in a convenient way. Let d = VC(F). For m ≤ dΠF (m) = 2m; consider m > d:

ΠF (m) ≤d∑

k=0

(

m

k

)

≤(m

d

)d d∑

k=0

(

m

k

)(

d

m

)k

≤(m

d

)d m∑

k=0

(

m

k

)(

d

m

)k

=(m

d

)d(

1 +d

m

)m

≤(em

d

)d

. (4.20)

Substituting it into (4.18) gives the final bound:

Rad(r F |Sm) ≤√

2

m(log 2 + VC(F) (1 + logm− logVC(F))) = Θm→∞

(√

2VC(F)logm

m

)

. (4.21)

Hence for the bound to be non-vacuous having VC(F) < m/(2 logm) is necessary. According to [Bartlett et al., 2019],whenever F denotes a set of all models realizable by a fully-connected network of width U with W parameters,VC(F) = Θ(WU). While the constant is not give here, this results suggests that the corresponding bound will bevacuous for realistic nets trained on realistic datasets since W ≫ m there.

The case of a margin risk

Suppose now r is a γ-margin risk: r(y, z) = rγ(y, z) = [yz < γ]. In this case we can bound the true 0/1-risk as:

R0/1(fm) ≤ Rγ(fm) ≤ Rm,γ(fm) + 2E S′mRad(rγ F |S′

m) +

1

2mlog

1

δw.p. ≥ 1− δ over Sm. (4.22)

Here we have a trade-off between the train risk and the Rademacher complexity: as γ grows larger the former termgrows too, but the latter one vanishes. One can hope that a good enough model fm should be able to classify thedataset it was trained on with a sufficient margin, i.e. Rm,γ(fm) ≈ 0 for large enough γ.

In the case of a margin loss, a Rademacher complexity is upper-bounded with covering numbers:

Np(H, ǫ, Sm) = infH⊆H

|H| : ∀h ∈ H ∃h ∈ H :

(

m∑

k=1

|h(zk)− h(zk)|p)1/p

< ǫ

. (4.23)

Note that Np(H, ǫ, Sm) grows as ǫ→ 0 and decreases with p. There are several statements that link the Rademachercomplexity with covering numbers; we start with a simple one:

Theorem 8. Suppose H is a class of hypotheses supported on [0, 1]. Then ∀Sm and ∀ǫ > 0,

Rad(H |Sm) ≤ ǫ

m+

2

mlog(2N1(H, ǫ, Sm)). (4.24)

34

Page 37: arXiv:2012.05760v1 [cs.LG] 10 Dec 2020

Proof. Take ǫ > 0. Let H ⊆ H: ∀h ∈ H ∃h ∈ H:∑m

k=1 |h(zk) − h(zk)| < ǫ. Let h[h] ∈ H be a representative ofh ∈ H. Then,

Rad(H |Sm) = E σ1:m suph∈H

1

m

m∑

k=1

σkh(zk)

≤ E σ1:m suph∈H

1

m

m∑

k=1

σk(h(zk)− h[h](zk))

+ E σ1:m suph∈H

1

m

m∑

k=1

σkh[h](zk)

≤ ǫ

m+ E σ1:m sup

h∈H

1

m

m∑

k=1

σkh(zk)

m+Rad(H |Sm) ≤ ǫ

m+

2

mlog(2|H|Sm

). (4.25)

Taking infium with respect to H concludes the proof.

Note that for γ = 0 rγ = r0/1 and N1(r0/1 F , ǫ, Sm) → |FSm| as ǫ→ 0; hence we get (4.18).

The next theorem is more involved:

Theorem 9 (Dudley entropy integral [Dudley, 1967]). Suppose H is a class of hypotheses supported on [−1, 1].Then ∀Sm and ∀ǫ > 0,

Rad(H |Sm) ≤ 4ǫ√m

+12

m

√m/2

ǫ

logN2(H, t, Sm) dt. (4.26)

Now the task is to upper-bound the covering number for H = rγ F . It is easier, however, to upper-boundrγ F instead, where rγ is a soft γ-margin risk. Indeed,

Np(rγ F , ǫ, Sm) ≤ Np(γ−1F , ǫ, Sm) = Np(F , γǫ, Sm). (4.27)

In this case it suffices to upper-bound the covering number for the model class F itself. Note also that we still havean upper-bound for the true 0/1-risk:

R0/1(fm) ≤ Rγ(fm) ≤ Rm,γ(fm) + 2E S′mRad(rγ F |S′

m) +

1

2mlog

1

δw.p. ≥ 1− δ over Sm. (4.28)

When F is a set of all models induced by a neural network of a given architecture, Np(F , γǫ, Sm) is infinite.Nevertheless, if restrict F to a class of functions with uniformly bounded Lipschitz constant, the covering numberbecomes finite, which implies the finite conditional Rademacher complexity. If we moreover assume that the datahave bounded support then the expected Rademacher complexity becomes finite as well.

A set of all neural nets of a given architecture does not have a uniform Lipschitz constant, however, this is thecase if we assume weight norms to be a-priori bounded. For instance, consider a fully-connected network f(·;W0:L)with L hidden layers without biases. Assume the activation function φ to have φ(0) = 0 and to be 1-Lipschitz.Define:

Fs0:L,b0:L = f(·;W0:L) : ∀l ∈ [L]0 ‖Wl‖2 ≤ sl, ‖WTl ‖2,1 ≤ bl. (4.29)

Theorem 10 ([Bartlett et al., 2017]).

logN2(Fs0:L,b0:L , ǫ, Sm) ≤ C2 ‖Xm‖2Fǫ2

R2s0:L,b0:L , (4.30)

where C =√

log(2maxn2l ) and we have introduced a spectral complexity:

Rs0:L,b0:L =

(

L∏

l=0

sl

)

×(

L∑

l=0

(bl/sl)2/3

)3/2

=

L∑

l=0

bl∏

l′ 6=l

sl′

2/3

3/2

. (4.31)

Plugging this result into Theorem 9 and noting eq. (4.27) gives:

Rad(rγ Fs0:L,b0:L |Sm) ≤ 4ǫ√m

+12

m

√m/2

ǫ

C‖Xm‖Fγt

Rs0:L,b0:L dt =4ǫ√m

+12

mC‖Xm‖F

γRs0:L,b0:L log

(√m

)

.

(4.32)

35

Page 38: arXiv:2012.05760v1 [cs.LG] 10 Dec 2020

Differentiating the right-hand side wrt ǫ gives:

d(rhs.)

dǫ=

4√m

− 12

mC‖Xm‖Fγǫ

Rs0:L,b0:L . (4.33)

Hence the optimal ǫ is given by:

ǫopt =3√mC‖Xm‖F

γRs0:L,b0:L . (4.34)

Plugging it back into the bound gives:

Rad(rγ Fs0:L,b0:L |Sm) ≤ 12

mC‖Xm‖F

γRs0:L,b0:L

(

1− log

(

6

mC‖Xm‖F

γRs0:L,b0:L

))

. (4.35)

From an a-priori bound to an a-posteriori bound

We thus have obtained a bound for a class of neural networks with a-priori bounded weight norms. Let θ be thefollowing set of network weights:

θ(i0:L, j0:L) = W0:L : ∀l ∈ [L]0 ‖Wl‖2 ≤ sl(il), ‖WTl ‖2,1 ≤ bl(jl), (4.36)

where sl(·) and bl(·) are strictly monotonic functions on N growing to infinity. Correspondingly, define a set offailure probabilities:

δ(i0:L, j0:L) =δ

∏Ll=0(il(il + 1)jl(jl + 1))

. (4.37)

Note the following equality:∞∑

j=1

1

j(j + 1)=

∞∑

j=1

(

1

j− 1

j + 1

)

= 1. (4.38)

This implies the following:∞∑

i0=1

. . .

∞∑

iL=1

∞∑

j0=1

. . .

∞∑

jL=1

δ(i0:L, j0:L) = δ. (4.39)

Hence by applying the union bound, the following holds with probability ≥ 1− δ over Sm:

supf∈Fs0:L(i0:L),b0:L(j0:L)

(R(f)− Rm(f)) ≤ E S′mRad(rγ Fs0:L(i0:L),b0:L(j0:L) |S′

m) +

1

2mlog

1

δ(i0:L, j0:L)∀il, jl ∈ N.

(4.40)Take the set of smallest i0:L and j0:L such that ‖Wl‖2 < il/L and ‖WT

l ‖2,1 < jl/L ∀l ∈ [L]0 for W0:L being the

weights of a learned network fm = A(Sm). Denote by i∗0:L and j∗0:L the sets mentioned above; let s∗0:L = s0:L(i∗0:L)

and b∗0:L = b0:L(j∗0:L). Given this, fm ∈ Fs∗

0:L,b∗

0:L, and we have with probability ≥ 1− δ over Sm:

R(fm)− Rm(fm) ≤ supf∈Fs∗

0:L,b∗

0:L

(R(f)− Rm(f)) ≤ E S′mRad(rγ Fs∗0:L,b∗0:L

|S′m) +

1

2mlog

1

δ(i∗0:L, j∗0:L)

. (4.41)

Let us express the corresponding spectral complexity in a more convenient form.

Rs∗0:L,b∗0:L=

L∑

l=0

b∗l∏

l′ 6=l

s∗l′

2/3

3/2

L∑

l=0

(‖WTl ‖2,1 +∆b∗l )

l′ 6=l

(‖Wl′‖2 +∆s∗l )

2/3

3/2

, (4.42)

where ∆s∗l = s∗l+1 − s∗l and ∆b∗l = b∗l+1 − b∗l . At the same time,

Rs∗0:L,b∗0:L≥

L∑

l=0

‖WTl ‖2,1

l′ 6=l

‖Wl′‖2

2/3

3/2

. (4.43)

36

Page 39: arXiv:2012.05760v1 [cs.LG] 10 Dec 2020

These two bounds together give an upper-bound for Rad(rγ Fs∗0:L

,b∗0:L

|S′m) that depends explicitly on learned

weight norms but not on s∗0:L and b∗0:L.

Note that i∗l = il(s∗l ) ≤ il(‖Wl‖2) + 1 and j∗l = jl(b

∗l ) ≤ jl(‖WT

l ‖2,1) + 1 ∀l ∈ [L]0, where il(sl) and jl(bl) areinverse maps for sl(il) and bl(jl), respectively. This gives an upper-bound for log(δ(i∗0:L, j

∗0:L)

−1) :

log1

δ(i∗0:L, j∗0:L)

≤ log1

δ+

L∑

l=0

(

log(1 + il(‖Wl‖2)) + log(2 + il(‖Wl‖2)) + log(1 + jl(‖WTl ‖2,1)) + log(2 + jl(‖WT

l ‖2,1)))

. (4.44)

To sum up, we have expressed the bound on test-train risk difference in terms of the weights of the learned modelfm = A(Sm), thus arriving at an a-posteriori bound. Note that the bound is valid for any sequences sl(il) and bl(jl)taken before-hand. Following [Bartlett et al., 2017], we can take, for instance, sl(il) = il/L and bl(jl) = jl/L.

4.1.3 Failure of uniform bounds

Recall the general uniform bound:

R(fm)− Rm(fm) ≤ supf∈F

(R(f)− Rm(f)), (4.45)

where fm = A(Sm). We have already discussed that the bound fails if F contains a "bad model" for which R(f) islarge, while Rm(f) is small; hence we are interested to take F as small as possible. We have also noted that thesmallest F we can consider is A(M), where M = supp(Dm).

Consider now the ideal case: ∃ǫ > 0 : R(f) < ǫ ∀f ∈ F . In other words, all models of the class F generalizewell. In this case the bound (4.45) becomes simply:

R(fm)− Rm(fm) ≤ ǫ w.p. ≥ 1− δ over Sm, (4.46)

which is perfect. Our next step was to apply McDiarmid’s inequality: see eq. (4.13); in our case this results in:

supf∈F

(R(f)− Rm(f)) ≤ ǫ+

1

2mlog

1

δw.p. ≥ 1− δ over Sm, (4.47)

which is almost perfect as well. What happened then, is we tried to upper-bound the expected supremum:

E S′msupf∈F

(R(f)− R′m(f)) = E S′

msupf∈F

(E S′′mR′′

m(f)− R′m(f)) ≤ E S′

mE S′′

msupf∈F

(R′′m(f)− R′

m(f)). (4.48)

The last step is called "symmetrization". Note that having small true error does not imply having small empiricalerror on any train dataset. [Nagarajan and Kolter, 2019] constructed a learning setup for which for any S′′

m thereexists a model fm ∈ F such that R′′

m(fm) ≈ 1; this is true even for F = A(Mm). Specifically, they provided asimple algorithm to construct a specific dataset ¬(S′′

m) and take fm = A(¬(S′′m)).

4.2 PAC-bayesian bounds

4.2.1 At most countable case

Recall the following bound for finite F :

P(

supf∈F

(R(f)− Rm(f)) ≥ ǫ

)

= P(

∃f ∈ F : (R(f)− Rm(f)) ≥ ǫ)

≤∑

f∈FP(R(f)− Rm(f) ≥ ǫ) ≤ |F|e−2mǫ2 ∀ǫ > 0. (4.49)

37

Page 40: arXiv:2012.05760v1 [cs.LG] 10 Dec 2020

When F has infinite cardinality, the bound above still holds, but it is vacuous. Consider at most countable F andǫ that depends on f . If we take ǫ(f) for which

f∈F e−2mǫ2(f) is finite, then we arrive into the finite bound:

P(

∃f ∈ F : (R(f)− Rm(f)) ≥ ǫ(f))

≤∑

f∈FP(

R(f)− Rm(f) ≥ ǫ(f))

≤∑

f∈Fe−2mǫ2(f) ∀ǫ > 0. (4.50)

For instance, consider some probability distribution P (f) on F . Take ǫ(f) such that e−2mǫ2(f) = P (f)e−2mǫ2 forsome ǫ ∈ R+. Solving this equation gives:

ǫ(f) = ǫ+

1

2mlog

1

P (f). (4.51)

Hence we have ∀ǫ > 0:

P(

∃f ∈ F : (R(f)− Rm(f)) ≥ ǫ+

1

2mlog

1

P (f)

)

≤ e−2mǫ2. (4.52)

Or, equivalently, w.p. ≥ 1− δ over Sm we have ∀f ∈ F :

R(f)− Rm(f) ≤√

1

2m

(

log1

δ+ log

1

P (f)

)

. (4.53)

4.2.2 General case

Let us refer P (f) as a "prior distribution". Suppose our learning algorithm is stochastic and outputs a modeldistribution Q(f) which we shall refer as a "posterior":

fm ∼ Qm = A(Sm). (4.54)

We shall now prove the following theorem:

Theorem 11 ([McAllester, 1999a]). For any δ ∈ (0, 1) w.p. ≥ 1− δ over Sm we have:

R(Qm)− Rm(Qm) ≤√

1

2m− 1

(

log4m

δ+KL(Qm || P )

)

, (4.55)

where R(Q) = E f∼QR(f) and Rm(Q) = E f∼QRm(f).

Proof. The proof relies on the following lemmas:

Lemma 8 ([McAllester, 1999a]). For any probability distribution P on F , for any δ ∈ (0, 1) w.p. ≥ 1− δ over Sm

we have:

E f∼P e(2m−1)(∆m(f))2 ≤ 4m

δ, (4.56)

where ∆m(f) = |R(f)− Rm(f)|.Lemma 9 ([Donsker and Varadhan, 1985]). Let P and Q be probability distributions on X. Then for any h : X →R

E x∼Qh(x) ≤ logE x∼P eh(x) +KL(Q || P ). (4.57)

From D-V lemma, taking X = F , h = (2m− 1)∆2m, and Q = Qm:

E f∼Qm(2m− 1)(∆m(f))2 ≤ logE f∼P e

(2m−1)(∆m(f))2 +KL(Qm || P ). (4.58)

Hence from Lemma 8, w.p. ≥ 1− δ over Sm we have:

E f∼Qm(2m− 1)(∆m(f))2 ≤ log

4m

δ+KL(Qm || P ). (4.59)

38

Page 41: arXiv:2012.05760v1 [cs.LG] 10 Dec 2020

A simple estimate concludes the proof:

R(Qm)− Rm(Qm) ≤ |E f∼Qm(R(f)− Rm(f))| ≤ E f∼Qm

|R(f)− Rm(f)| =

= E f∼Qm∆m(f) ≤

E f∼Qm(∆m(f))2 ≤

1

2m− 1

(

log4m

δ+KL(Qm || P )

)

. (4.60)

Let us prove D-V lemma first; we shall prove in the case when P ≪ Q and Q≪ P :

Proof of Lemma 9.

E x∼Qh(x)−KL(Q || P ) = E x∼Q

(

h(x) − log

(

dQ

dP(x)

))

=

= E x∼Q log

(

eh(x)dP

dQ(x)

)

≤ logE x∼Q

(

eh(x)dP

dQ(x)

)

= logE x∼P eh(x), (4.61)

where dQ/dP is a Radon-Nikodym derivative.

We now proceed with proving Lemma 8:

Proof of Lemma 8. Recall Markov’s inequality:

Theorem 12 (Markov’s inequality). Let X be a non-negative random variable. Then ∀a > 0

P(X ≥ a) ≤ EX

a. (4.62)

Hence taking a = 4m/δ, it suffices to show that

E SmE f∼P e

(2m−1)(∆m(f))2 ≤ 4m. (4.63)

We are going to prove a stronger property:

E Sme(2m−1)(∆m(f))2 ≤ 4m ∀f ∈ F . (4.64)

Note that from Hoeffding’s inequality we get:

PSm(∆m(f) ≥ ǫ) ≤ 2e−2mǫ2 ∀ǫ > 0 ∀f ∈ F . (4.65)

First, assume that the distribution of ∆m(f) has density ∀f ∈ F ; denote it by pf (∆). In this case we candirectly up‘per-bound the expectation over Sm:

E Sme(2m−1)(∆m(f))2 =

∫ ∞

0

e(2m−1)ǫ2pf (ǫ) dǫ =

∫ ∞

0

e(2m−1)ǫ2 d

(

−∫ ∞

ǫ

pf(∆) d∆

)

dǫ =

= −(

e(2m−1)ǫ2∫ ∞

ǫ

pf (∆) d∆

)∣

ǫ=0

+ 2(2m− 1)

∫ ∞

0

ǫe(2m−1)ǫ2∫ ∞

ǫ

pf (∆) d∆ dǫ ≤

≤∫ ∞

0

pf (∆) d∆+ 2(2m− 1)

∫ ∞

0

ǫe(2m−1)ǫ2∫ ∞

ǫ

pf (∆) d∆ dǫ ≤

≤ 2 + 4(2m− 1)

∫ ∞

0

ǫe(2m−1)ǫ2e−2mǫ2 dǫ = 2 + 4(2m− 1)

∫ ∞

0

ǫe−ǫ2 dǫ = 2 + 2(2m− 1) = 4m. (4.66)

We now relax our assumption of density existence. Let µf be a distribution of ∆m(f). Consider a class M ofall non-negative sigma-additive measures on R+ such that a property similar to (4.65) holds:

µ([ǫ,∞)) ≤ 2e−2mǫ2 ∀ǫ > 0 ∀µ ∈ M. (4.67)

39

Page 42: arXiv:2012.05760v1 [cs.LG] 10 Dec 2020

Note that M contains a probability distribution of ∆m(f) for any f ∈ F . Among these measures we shall choosea specific one that maximizes an analogue of the left-hand sise of (4.64):

µ∗ ∈ Argmaxµ∈M

∫ ∞

0

e(2m−1)∆2

µ(d∆). (4.68)

Note that constraint (4.67) states that a measure of a right tail of a real line should be upper-bounded. However,µ∗ should have as much mass to the right as possible. Hence constraint (4.67) should become an equality for thisspecific µ∗:

µ∗([ǫ,∞)) = 2e−2mǫ2 ∀ǫ > 0. (4.69)

From this follows that µ∗ has density p∗(∆) = 8m∆e−2m∆2

.

Note that an inequality similar to (4.66) holds for p∗. Moreover, since µ∗ maximizes∫∞0e(2m−1)∆2

µ(d∆), wehave the following bound:

E Sme(2m−1)(∆m(f))2 = E∆∼µf

e(2m−1)∆2

=

∫ ∞

0

e(2m−1)∆2

µf (d∆) ≤∫ ∞

0

e(2m−1)∆2

p∗(∆) d∆ ≤ 4m. (4.70)

4.2.3 Applying PAC-bayesian bounds to deterministic algorithms

Consider a deterministic learning rule A(Sm) ∼ Qm, where Qm is a Kronecker delta. While this situation is fine forat most countable case, whenever F is uncountable and P (f) = 0 ∀f ∈ F , KL(Qm || P ) = ∞ and we arrive into avacuous bound.

Compression and coding

One work-around is to consider some discrete coding c, with encc() being an encoder and decc() being a decoder.We assume that decc(encc(f)) ≈ f ∀f ∈ F and instantiate a bound of the form (4.53) for encc(f). Equivalently, weshall write fc for encc(f). Following [Zhou et al., 2019], we take a prior that prioritize models of small code-length:

Pc(fc) =1

Zm(|fc|)2−|fc|, (4.71)

where |fc| is a code-length for f , m(k) is some probability distribution on N, and Z is a normalizing constant. Inthis case a KL-divergence is given as:

KL(δfc || Pc) = logZ + |fc| log 2− log(m(|fc|)). (4.72)

In order to make our bound as small as possible, we need to ensure that our learning algorithm, when fedrealistic data, outputs models of small code-length. One can esnure this by coding not the model f itself, but rathera result of its compression via a compression algorithm C. We assume that a compressed model C(f) is still a modelfrom F . We also hope that its risk does not change sufficiently R(C(f)) ≈ R(f) and a learning algorithm tendsto output models which in a compressed form have small code-length. In this case we are able to upper-bound atest-train risk difference for an encoded compressed model C(f)c instead of the original one.

When our models are neural nets paramaterized with a set of weights θ, a typical form of a compressed modelis a tuple (S,Q,C), where

• S = s1:k ⊂ [dim θ] are locations of non-zero weights;

• C = c1:r ⊂ R is a codebook;

• Q = q1:k, qi ∈ [r] ∀i ∈ [k] are quantized values.

Then C(θ)i = cqj if i = sj else 0. In this case a naive coding for 32-bit precision gives:

|C(θ)|c = |S|c + |Q|c + |C|c ≤ k(log dim θ + log r) + 32r. (4.73)

40

Page 43: arXiv:2012.05760v1 [cs.LG] 10 Dec 2020

Stochastization

Another work-around is to volunteerly substitute fm with some Qm, presumably satisfying E f∼Qmf = fm, such

that KL(Qm || P ) is finite. In this case we get the upper-bound for R(Qm) instead of R(fm). One possible goalmay be to obtain as better generalization guarantee as possible; in this case one can optimize the upper-boundon R(Qm) wrt Qm. Another goal may be to get a generalization guarantee for fm itself; in this case we have tosomehow relate it with a generalization gurantee for Qm.

Let us discuss the former goal first. Our goal is to optimize the upper-bound on test risk wrt a stochastic modelQ:

R(Q) ≤ Rm(Q) +

1

2m− 1

(

log4m

δ+KL(Q || P )

)

→ minQ

. (4.74)

In order to make optimization via GD possible, we first substitute Rm with its differentiable convex surrogate Lm:

R(Q) ≤ Lm(Q) +

1

2m− 1

(

log4m

δ+KL(Q || P )

)

→ minQ

. (4.75)

The second thing we have to do in order to make GD optimization feasible is switching from searching in an abstractmodel distribution space to searching in some euclidian space. Let F be a space of models realizable by a givenneural network architecture. Let θ denote a set of weights. Following [Dziugaite and Roy, 2017], we consider anoptimization problem in a distribution space Q consisting of non-degenerate diagonal gaussians:

Q = N (µ, diag(expλ)) : µ ∈ Rdim θ, λ ∈ R

dim θ. (4.76)

In this case we substitute a model class F with a set of network weights Rdim θ. For Q ∈ Q and a gaussian priorP = N (µ∗, expλ∗I) the KL-divergence is given as follows:

KL(Q || P ) = 1

2

(

e−λ∗ (∥

∥eλ∥

1+ ‖µ− µ∗‖22

)

+ dim θ(λ∗ − 1)− 1 · µ)

. (4.77)

Since both the KL term and the loss term are differentiable wrt (µ, λ) we can optimize the test risk bound via GD.

[Dziugaite and Roy, 2017] suggest starting the optimization process from µ∗ = θm, where θm is the set of weights

for a model fm = A(Sm), and λ∗ being a sufficiently large negative number.The next question is how to choose the prior. Note that the distribution we finally choose as a result of the bound

optimization does not take stochasticity of the initialization θ(0) of the algorithm A that finds θm into account. Forthis reason, the prior can depend on θ(0); following [Dziugaite and Roy, 2017], we take µ∗ = θ(0). The rationale forthis is that in this case the KL-term depends on ‖µ − θ(0)‖22. If we hope that the both optimization processes donot lead us far away from their initializations, the KL-term will not be too large.

As for the prior log-standard deviation λ∗, we apply the same technique as for obtaining an a-posteriori uniformbound: see Section 4.1.2. Define λ∗j = log c− j/b, where c, b > 0, j ∈ N. Take δj = 6δ/(π2j2). Then we get a validbound for any j ≥ 1:

R(Q) ≤ Lm(Q) +

1

2m− 1

(

log4m

δj+KL(Q || P (µ∗, λ∗j ))

)

w.p. ≥ 1− δj over Sm. (4.78)

A union bound gives:

R(Q) ≤ Lm(Q) +

1

2m− 1

(

log4m

δj+KL(Q || P (µ∗, λ∗j ))

)

∀j ∈ N w.p. ≥ 1− δ over Sm. (4.79)

This allows us to optimize the bound wrt j. However, optimization wrt real numbers is preferable since this allowsus applying GD. In order to achieve this, we express j as a function of λ∗: j = b(log c − λ∗). This gives us thefollowing expression:

R(Q) ≤ Lm(Q)+

1

2m− 1

(

log2π2mb(log c− λ∗)

3δ+KL(Q || P (µ∗, λ∗))

)

∀λ∗ ∈ λ∗j∞j=1 w.p. ≥ 1− δ. (4.80)

The expression above allows us to optimize its right-hand side wrt λ∗ via GD. However, we cannot guarantee thatthe optimization result lies in λ∗j∞j=1. To remedy this, [Dziugaite and Roy, 2017] simply round the result to the

closest λ∗ in this set. To sum up, we take µ∗ = θ(0) and optimize the bound (4.80) wrt µ, λ, and λ∗ via GD.

41

Page 44: arXiv:2012.05760v1 [cs.LG] 10 Dec 2020

A bound for a deterministic model

Recall in the previous section we aimed to search for a stochastic model that optimizes the upper-bound for thetest risk. In the current section we shall discuss how to obtain a bound for a given model deterministic model fmin a PAC-bayesian framework.

Consider a neural network fθ with L− 1 hidden layers with weights θ without biases; let φ(·) be an activation

function. Suppose our learning algorithm A outputs weights θm when given a dataset Sm. In our current framework,both the prior and the posterior are distributions on Rdim θ. Note that McAllester’s theorem (Theorem 11) requirescomputing KL-divergence between two distributions in model space. Nevertheless, noting that weights are mappedto models surjectively, we can upper-bound this term with KL-diveregence in the weight space:

Corollary 1 (of Theorem 11). For any δ ∈ (0, 1) w.p. ≥ 1− δ over Sm we have:

R(Qm) ≤ Rm(Qm) +

1

2m− 1

(

log4m

δ+KL(Qm || P )

)

, (4.81)

where R(Q) = E θ∼QR(fθ) and Rm(Q) = E θ∼QRm(fθ).

For deterministic A, our Qm is degenerate, and the bound is vacuous. The bound is, however, valid for anydistribution Qm in the weight space. We take Qm = N (θm, σ

2Idim θ) for some σ given before-hand. We take theprior as P = N (0, σ2Idim θ); in this case the train risk term and the KL-term in the right-hand side of Corollary 1are given as follows:

Rm(Qm) = E ξ∼N (0,σ2Idim θ)Rm

(

fθm+ξ

)

, KL(Qm || P ) = ‖θ‖222σ2

. (4.82)

This gives us the upper-bound for R(Qm); our goal is, however, to bound R(fm) instead. The following lemma tellsus that it is possible to substitute a risk of a stochastic model with a margin risk of a deterministic one:

Lemma 10 ([Neyshabur et al., 2018]). Let the prior P has density p. For any δ ∈ (0, 1) w.p. ≥ 1− δ over Sm, forany deterministic θ and a random a.c. ξ such that

(

maxx∈X

|fθ+ξ(x)− fθ(x)| <γ

2

)

≥ 1

2(4.83)

we have:

R(fθ) ≤ Rm,γ(fθ) +

1

2m− 1

(

log16m

δ+ 2KL(q′ || p)

)

, (4.84)

where q′ denotes a probability density of θ + ξ.

This lemma requires the noise ξ to conform some property; the next lemma will help us to choose the standarddeviation σ accordingly:

Lemma 11 ([Neyshabur et al., 2018]). Let φ(z) = [z]+. For any x ∈ XB , where XB = x ∈ X : ‖x‖2 ≤ B, forany θ = vec(WlLl=1), and for any ξ = vec(UlLl=1) such that ∀l ∈ [L] ‖Ul‖2 ≤ L−1‖Wl‖2,

|fθ+ξ(x)− fθ(x)| ≤ eB

(

L∏

l=1

‖Wl‖2)

L∑

l=1

‖Ul‖2‖Wl‖2

. (4.85)

These lemmas will allow us to prove the following result:

Theorem 13 ([Neyshabur et al., 2018]). Assume suppx = XB and φ(z) = [z]+; let n = maxl nl. For any δ ∈ (0, 1)

w.p. ≥ 1− δ over Sm we have for any θm

R(

fθm

)

≤ Rγ,m

(

fθm

)

+

1

2m− 1

(

log8Lm

δ+

1

2Llogm+ 8e4

(

BR(θ)

γ

)2

L2n log(2Ln)

)

, (4.86)

42

Page 45: arXiv:2012.05760v1 [cs.LG] 10 Dec 2020

where we have defined a spectral complexity:

R(θ) =

(

L∏

l=1

‖Wl‖2)

L∑

l=1

‖Wl‖2F‖Wl‖22

. (4.87)

Compare with the result of Bartlett and coauthors:

Theorem 14 ([Bartlett et al., 2017]). Assume suppx = XB and φ(z) = [z]+; let n = maxl nl. For any δ ∈ (0, 1)

w.p. ≥ 1− δ over Sm we have for any θm

R(

fθm

)

≤ Rγ,m

(

fθm

)

+Rad(rγ F≤θm|Sm) +

1

2mlog

1

δ, (4.88)

where we upper-bound the Rademacher complexity as

Rad(rγ F≤θ |Sm) ≤ C√m

BR(θ, L−1)

γ

log(2n)

(

1− log

(

C

2√m

BR(θ, 0)

γ

log(2n)

))

, (4.89)

and we define a spectral complexity as

R(θ,∆) =

(

L∏

l=1

(‖Wl‖2 +∆)

)

L∑

l=1

(‖WTl ‖2,1 +∆)2

(‖Wl‖2 +∆)2. (4.90)

Both bounds grow linearly with (B/γ)∏L

l=1 ‖Wl‖2, which is a very natural property. While the former result issimpler, the latter does not depend explicitly on depth L and width n. Nevertheless, the proof of the latter resultis rather technically involved, while the proof of the former can be reproduced without substantial efforts.

Proof of Theorem 13. First of all, define:

β =

(

L∏

l=1

‖Wl‖2)1/L

, Wl =β

‖Wl‖2Wl. (4.91)

Since ReLU is homogeneous, fθ = fθ. Also,∏L

l=1 ‖Wl‖2 =∏L

l=1 ‖Wl‖2 and∑L

l=1‖Wl‖2

F

‖Wl‖22

=∑L

l=1‖Wl‖2

F

‖Wl‖22. Hence

both the model and the bound do not change if we substitute θ with θ. Hence w.l.o.g. assume that ‖Wl‖2 = β∀l ∈ [L].

We now use Lemma 11 to find σ > 0 for which the condition of Lemma 10 is satisfied. In particular, we haveto upper-bound the probability for ‖Ul‖2 ≥ β/L for some l ∈ [L]. Notice that for ξ ∼ N (0, σ2Idim θ) Ul has i.i.d.zero-centered gaussian entries ∀l ∈ [L]. In a trivial case of 1× 1 matrices, we can apply a simple tail bound:

Pξ∼N (0,σ2)(|ξ| ≥ ǫ) = 2Pξ∼N (0,1)

(

ξ ≥ ǫ

σ

)

≤ 2e−ǫ2

2σ2 . (4.92)

This bound follows from Chernoff bound, which is a simple corollary of Markov’s inequality:

Theorem 15 (Chernoff bound). For a real-valued random variable X, for any a ∈ R, and for any λ ∈ R+ we have:

P(X ≥ a) ≤ E eλX

eλa. (4.93)

Indeed,

Pξ∼N (0,1)(ξ ≥ a) ≤ E eλξ

eλa≤ e− supλ(λa−log E eλξ) = e

− supλ

(

λa−λ2

2

)

= e− supλ

(

− 12 (λ−a)2+ a2

2

)

= e−a2

2 , (4.94)

where we have used the moment-generating function for gaussians:

E ξ∼N (0,1)eλξ =

∞∑

k=0

λkE ξk

k!=

∞∑

k=0

λ2k(2k − 1)!!

(2k)!=

∞∑

k=0

λ2k

(2k)!!=

∞∑

k=0

λ2k

2kk!= e

λ2

2 . (4.95)

43

Page 46: arXiv:2012.05760v1 [cs.LG] 10 Dec 2020

We can apply the same bound for a linear combination of i.i.d. standard gaussians:

Pξ1:m∼N (0,1)

(∣

m∑

i=1

aiξi

≥ ǫ

)

= Pξ∼N (0,∑

mi=1 a2

i )(|ξ| ≥ ǫ) = 2e

− ǫ2

2∑m

i=1a2i . (4.96)

Moreover, a similar bound holds for matrix-linear combinations:

Theorem 16 ([Tropp, 2011]). Let A1:m be n× n deterministic matrices and let ξ1:m be i.i.d. standard gaussians.Then

P(∥

m∑

i=1

ξiAi

2

≥ ǫ

)

≤ ne− ǫ2

2‖∑mi=1

A2i ‖2 . (4.97)

What don’t we have a factor of 2 here?Let us return to bounding the probability of ‖Ul‖2 ≥ β/L. For any l ∈ [L] Tropp’s theorem gives:

P(‖Ul‖2 ≥ t) ≤ P(‖Ul‖2 ≥ t) = Pξ1:n,1:n∼N (0,1)

σ

n∑

i,j=1

ξij1ij

2

≥ t

≤ ne−t2

2σ2n , (4.98)

where Ul is a n× n matrix with entries:

Ul,ij =

Ul,ij , 1 ≤ i ≤ nl, 1 ≤ j ≤ nl+1,

N (0, σ2), else.(4.99)

Hence by a union bound:

P(∀l ∈ [L] ‖Ul‖2 ≥ t) ≤ Lne−t2

2σ2n . (4.100)

Equating the right-hand side to 1/2 gives t = σ√

2n log(2Ln). Next, taking t ≤ β/L gives

σ ≤ σmax,1 =β

L√

2n log(2Ln)(4.101)

and allows us to apply Lemma 11: w.p. ≥ 1/2 over ξ,

maxx∈XB

|fθ+ξ(x)− fθ(x)| ≤ eBβL−1L∑

l=1

‖Ul‖2 ≤ eBβL−1Lσ√

2n log(2Ln). (4.102)

In order to apply Lemma 10 we need to ensure that this equation is bounded by γ/2. This gives

σ ≤ σmax,2 =γ

2eBβL−1L√

2n log(2Ln). (4.103)

Taking σ = σmax = min(σmax,1, σmax,2) hence ensures the condition of Lemma 10. The problem is that σ now

depends on β and hence on θm; this means that the prior P = N (0, σ2Idim θ) depends on θm. For this reason, wehave to apply a union bound argument for choosing σ.

Let B be a discrete subset of R+. Hence ∀β ∈ B ∀δ ∈ (0, 1) w.p. ≥ 1−δ over Sm ∀θ such that σmax(β) ≥ σmax(β)

R(fθ) ≤ Rm,γ(fθ) +

1

2m− 1

(

log16m

δ+

‖θ‖22σ2max(β)

)

. (4.104)

A union bound gives ∀δ ∈ (0, 1) w.p. ≥ 1− δ over Sm ∀θ ∀β ∈ B such that σmax(β) ≥ σmax(β)

R(fθ) ≤ Rm,γ(fθ) +

1

2m− 1

(

log16m

δ+

‖θ‖22σ2max(β)

+ log |B|)

. (4.105)

We need B to be finite in order to have a finite bound. First note that for βL < γ/(2B) we have ∀x ∈ XB

|fθ(x)| ≤ βLB ≤ γ/2 which implies Rm,γ(fθ) = 1. In this case the bound is trivially true.

44

Page 47: arXiv:2012.05760v1 [cs.LG] 10 Dec 2020

Second, for βL > γ√m/(2B) the second term of the final bound (see Theorem 13) is greater than 1 and the

bound again becomes trivially true. Hence it suffices to take any finite B with min(B) = βmin = (γ/(2B))1/L andmax(B) = βmax = (γ

√m/(2B))1/L. Note that for β ∈ [βmin, βmax] σmax = σmax,2; indeed,

σmax,1

σmax,2= 2eγ−1BβL ≥ e > 1. (4.106)

Hence σmax(β) ≥ σmax(β) is equivalent to β ≤ β.We shall take B such that the following holds:

∀β ∈ [βmin, βmax] ∃β ∈ B : e−1βL−1 ≤ βL−1 ≤ eβL−1. (4.107)

In this case, obviously, β ≤ β and

σ2max(β) =

γ2

8e2B2β2L−2L2n log(2Ln)≥ γ2

8e4B2β2L−2L2n log(2Ln). (4.108)

We shall prove that the following B conforms condition (4.107):

B =

βmin

(

1 +2k

L

)K

k=0

, K = max

k : βmin

(

1 +2k

L

)

≤ βmax

. (4.109)

Hence 2K = ⌊L(βmax/βmin − 1)⌋ = ⌊L(m1/2L − 1)⌋. This gives:

log |B| = log(K + 1) ≤ log(Lm1/2L/2) = log(L/2) +1

2Llogm. (4.110)

Indeed, for any β ∈ [βmin, βmax] ∃β ∈ B : |β − β| ≤ βmin/L ≤ β/L. Hence

eβL−1 ≥ (β + β/L)L−1 ≥ (β + |β − β|)L−1 ≥ βL−1, (4.111)

e−1βL−1 ≤ (β − β/L)L−1 ≤ (β − |β − β|)L−1 ≤ βL−1, (4.112)

which proves condition (4.107).Let us first write the expression before the (2m− 1)−1 factor:

log16m

δ+

‖θ‖22σ2max(β)

+ log |B| ≤ log8Lm

δ+

1

2Llogm+ 8γ−2e4B2β2LL2n log(2Ln)

L∑

l=1

‖Wl‖2Fβ2

. (4.113)

This gives the final bound:

R(fθ) ≤ Rm,γ(fθ) +

1

2m− 1

(

log8Lm

δ+

1

2Llogm+ 8e4

(

BR(θ)

γ

)2

L2n log(2Ln)

)

, (4.114)

where we have introduced a spectral complexity:

R(θ) = βLL∑

l=1

‖Wl‖Fβ

=

(

L∏

l=1

‖Wl‖2)

L∑

l=1

‖Wl‖2F‖Wl‖22

. (4.115)

Proof of Lemma 10. Let θ and ξ conform Condition 4.83 and let θ′ = θ + ξ. Define:

Aθ = θ′ : maxx∈X

|fθ′(x) − fθ(x)| < γ/2. (4.116)

Following Condition 4.83, we get P(Aθ) ≥ 1/2.

45

Page 48: arXiv:2012.05760v1 [cs.LG] 10 Dec 2020

Since ξ has density, θ′ has density as well; denote it by q′. Define:

q(θ) =1

Zq′(θ)[θ ∈ Aθ], where Z = P(Aθ). (4.117)

Note that maxx∈X |fθ(x) − fθ(x)| < γ/2 a.s. wrt θ for θ ∼ q(θ). Therefore:

R(fθ) ≤ Rγ/2(fθ) and Rm,γ/2(fθ) ≤ Rm,γ(fθ) a.s. wrt θ. (4.118)

Hence

R(fθ) ≤ E θRγ/2(fθ) ≤ (w.p. ≥ 1− δ over Sm)

≤ E θRm,γ/2(fθ) +

1

2m− 1

(

log4m

δ+KL(q || p)

)

≤ Rm,γ(fθ) +

1

2m− 1

(

log4m

δ+KL(q || p)

)

. (4.119)

The only thing that remains is estimating the KL-term. Define:

qc(θ) =1

1− Zq′(θ)[θ /∈ Aθ]. (4.120)

We then get:

KL(q′ || p) = KL(qZ + qc(1 − Z) || p) = E θ′∼q′ (log(q(θ′)Z + qc(θ′)(1 − Z))− log p(θ′)) =

= E b∼B(Z)E θ′∼q′|b (log(q′(θ′|1)Z + q′(θ′|0)(1− Z))− (Z + (1− Z)) log p(θ′)) =

= Z(logZ +KL(q′|1 || p)) + (1− Z)(log(1 − Z) + KL(q′|0 || p)) == ZKL(q || p) + (1− Z)KL(qc || p)−H(B(Z)). (4.121)

This implies the following:

KL(q || p) = 1

Z(KL(q′ || p) +H(B(Z))− (1− Z)KL(qc || p)) ≤

≤ 1

P (Aθ)(KL(q′ || p) + log 2) ≤ 2 (KL(q′ || p) + log 2) . (4.122)

Hence w.p. ≥ 1− δ over Sm we have:

R(fθ) ≤ Rm,γ(fθ) +

1

2m− 1

(

log4m

δ+KL(q || p)

)

≤ Rm,γ(fθ) +

1

2m− 1

(

log16m

δ+ 2KL(q′ || p)

)

. (4.123)

Proof of Lemma 11. Recall the forward dynamics:

h2(x; θ) =W1x ∈ Rn2 , xl(x; θ) = φ(hl(x; θ)) ∈ R

nl , hl+1(x; θ) =Wlxl(x; θ) ∈ Rnl+1 ∀l ∈ 2, . . . , L. (4.124)

Assume that x, θ, and ξ are fixed. Define:

∆l = ‖hl+1(x; θ + ξ)− hl+1(x; θ)‖2 ∀l ∈ [L]. (4.125)

We are going to prove the following by induction:

∆l ≤(

1 +1

L

)l

‖x‖2(

l∏

l′=1

‖Wl′‖2)

l∑

l′=1

‖Ul′‖2‖Wl′‖2

. (4.126)

The induction base is given as:

∆1 = ‖h2(x; θ + ξ)− h2(x; θ)‖2 = ‖U1x‖2 ≤ ‖U1‖2‖x‖2, (4.127)

46

Page 49: arXiv:2012.05760v1 [cs.LG] 10 Dec 2020

and we prove the induction step below:

∆l = ‖hl+1(x; θ + ξ)− hl+1(x; θ)‖2 = ‖(Wl + Ul)xl(x; θ + ξ)−Wlxl(x; θ)‖2 =

= ‖(Wl + Ul)(xl(x; θ + ξ)− xl(x; θ)) + Ulxl(x; θ)‖2 ≤≤ ‖Wl + Ul‖2‖xl(x; θ + ξ)− xl(x; θ)‖2 + ‖Ul‖2‖xl(x; θ)‖2 ≤

≤ (‖Wl‖2 + ‖Ul‖2)‖hl(x; θ + ξ)− hl(x; θ)‖2 + ‖Ul‖2‖hl(x; θ)‖2 ≤

≤ ‖Wl‖2(

1 +1

L

)

∆l−1 + ‖Ul‖2‖x‖2l−1∏

l′=1

‖Wl′‖2 ≤

≤(

1 +1

L

)l

‖x‖2(

l∏

l′=1

‖Wl′‖2)

l−1∑

l′=1

‖Ul′‖2‖Wl′‖2

+‖Ul‖2‖Wl‖2

‖x‖2l∏

l′=1

‖Wl′‖2 ≤

≤(

1 +1

L

)l

‖x‖2(

l∏

l′=1

‖Wl′‖2)

l∑

l′=1

‖Ul′‖2‖Wl′‖2

. (4.128)

A simple estimate then gives the required statement:

‖fθ+ξ(x) − fθ(x)‖2 = ‖hL+1(x; θ + ξ)− hL+1(x; θ)‖2 =

= ∆L ≤(

1 +1

L

)L

‖x‖2(

L∏

l=1

‖Wl‖2)

L∑

l=1

‖Ul‖2‖Wl‖2

≤ eB

(

L∏

l=1

‖Wl‖2)

L∑

l=1

‖Ul‖2‖Wl‖2

. (4.129)

47

Page 50: arXiv:2012.05760v1 [cs.LG] 10 Dec 2020

Chapter 5

Neural tangent kernel

5.1 Gradient descent training as a kernel method

Consider a parametric model with scalar output f(x; θ) ∈ R and let θ ∈ Rd. We aim to minimize a loss

E x,yℓ(y, f(x; θ)) via a gradient descent:

θt = −ηE x,y∂ℓ(y, z)

∂z

z=f(x;θt)

∇θf(x; θt). (5.1)

If we define a feature map Φt(x) = ∇θf(x; θt), then we can express the model as:

f(x; θ) = f(x; θt) + ΦTt (x)(θ − θt) +O(‖θ − θt‖22). (5.2)

It is a locally linear model in the vicinity of θt given a feature map Φt.We now multiply both parts of the equation (5.1) by ∇T

θ f(x′; θt):

ft(x′) = −ηE x,y

∂ℓ(y, z)

∂z

z=ft(x)

Θt(x′, x), (5.3)

where Θt(x′, x) = ∇T

θ f(x′; θt)∇θf(x; θt) and ft(x

′) = f(x′; θt).Here Θt is a kernel and Φt(x) = ∇θf(x; θt) is a corresponding feature map. We call Θt an empirical tangent

kernel at time-step t. It depends on the initialization, hence it is random. Given a train dataset (~x, ~y) of size m,the evolution of the responses on this dataset writes as follows:

ft(~x) = − η

mΘt(~x, ~x)

∂ℓ(~y, ~z)

∂~z

~z=ft(~x)

. (5.4)

We see that the gramian of the kernel maps loss gradients wrt model outputs to output increments. Note that whiledynamics (5.1) is complete, (5.3) is not, since Kt cannot be determined solely in terms of ft.

Nevertheless, if we consider linearized dynamics, Kt becomes independent of t and can computed once at theinitialization, thus making dynamics (5.3) complete. Let us define a linearized model:

flin(x; θ) = f(x; θ0) +∇Tθ f(x; θ0)(θ − θ0). (5.5)

This model then evolves similarly to f (eq. (5.3)), but with a kernel fixed at initialization:

flin,t(x′) = −ηE x,y

∂ℓ(y, z)

∂z

z=flin,t(x)

Θ0(x′, x). (5.6)

The gradient descent dynamics becomes:

θt = −ηE x,y∂ℓ(y, z)

∂z

z=flin,t(x)

∇θf(x; θ0). (5.7)

48

Page 51: arXiv:2012.05760v1 [cs.LG] 10 Dec 2020

5.1.1 Exact solution for a square loss

These equations are analytically solvable if we consider a square loss: ℓ(y, z) = 12 (y − z)2, see [Lee et al., 2019].

Let (~x, ~y), where ~x = ximi=1 and ~y = yimi=1, is a train set. Let f(~x) = f(xi)mi=1 ∈ Rm be a vector of model

responses on the train data. Finally, let Θt(~x, ~x) ∈ Rm×m be a Gramian of the kernel Θt: Θt(~x, ~x)ij = Θt(xi, xj).Eq. (5.6) evaluated on train set becomes:

flin,t(~x) = η1

mΘ0(~x, ~x)(~y − flin,t(~x)). (5.8)

Its solution writes as follows:flin,t(~x) = ~y + e−ηΘ0(~x,~x)t/m(f0(~x)− ~y). (5.9)

Given this, the weight dynamics (5.7) becomes:

θt = −η 1

m∇θf(~x; θ0)e

−ηΘ0(~x,~x)t/m(f0(~x)− ~y), (5.10)

where we have assumed that ∇θf(~x; θ0) ∈ Rd×m. Solving it gives:

θt = θ0 −∇θf(~x; θ0)Θ−10 (~x, ~x)(I − e−ηΘ0(~x,~x)t/m)(f0(~x)− ~y). (5.11)

Substituting the solution back to (5.5) gives a model prediction on an arbitrary input x:

flin,t(x) = f0(x)− Θ0(x, ~x)Θ−10 (~x, ~x)(I − e−ηΘ0(~x,~x)t/m)(f0(~x)− ~y), (5.12)

where we have defined a row-vector Θ0(x, ~x) with components Θ0,i(x, ~x) = ∇Tθ f(x; θ0)∇θf(xi; θ0).

5.1.2 Convergence to a gaussian process

Consider a network with L hidden layers and no biases:

f(x) =WLφ(WL−1 . . . φ(W0x)), (5.13)

where Wl ∈ Rnl+1×nl and a non-linearity φ is applied element-wise. Note that x ∈ Rn0 ; we denote with k = nL+1

the dimensionality of the output: f : Rn0 → Rk. We shall refer nl as the width of the l-th hidden layer.Let us assume x is fixed. Define:

h1 =W0x ∈ Rn1 , xl = φ(hl) ∈ R

nl , hl+1 =Wlxl ∈ Rnl+1 ∀l ∈ [L]. (5.14)

Hence given x f(x) = hL+1. Define also:

ql =1

nlEhTl hl. (5.15)

Let us assume that the weights are initialized with zero-mean gaussians, so that the forward dynamics isnormalized:

W ijl ∼ N

(

0,σ2w

nl

)

. (5.16)

Obiously, all components of hl are distributed identically. Their means are zeros, let us compute the variances:

ql+1 =1

nl+1ExTl W

Tl Wlxl =

σ2w

nlExTl xl =

σ2w

nlEφ(hl)

Tφ(hl) ∀l ∈ [L], q1 =1

n1ExTWT

0 W0x =σ2w

n0‖x‖22.

(5.17)We are going to prove by induction that ∀l ∈ [L + 1] ∀i ∈ [nl] h

il converges weakly to N (0, ql) as n1:l−1 → ∞

sequentially. Since components of W0 are gaussian, hi1 ∼ N (0, q1) ∀i ∈ [n0] — this gives the induction base. If allhil converge weakly to N (0, ql) as n1:l−1 → ∞ sequentially then limn1:l→∞ ql+1 = σ2

wE z∼N (0,ql)(φ(z))2. Hence by

the virture of CLT, hil+1 converges in distribution to N (0, ql+1) as n1:l → ∞ sequentially — this gives the inductionstep.

49

Page 52: arXiv:2012.05760v1 [cs.LG] 10 Dec 2020

Consider two inputs, x1 and x2, together with their hidden representations h1l and h2l . Let us prove that

∀l ∈ [L + 1] ∀i ∈ [nl] (h1,il , h2,il )T converges in distribution to N (0,Σl) as n1:l−1 → ∞ sequentially, where the

covariance matrix is defined as follows:

Σl =

(

q11l q12lq12l q22l

)

; qabl =1

nlEha,Tl hbl , a, b ∈ 1, 2. (5.18)

We have already derived the dynamics for the diagonal terms in the subsequent limits of infinite width:

limn1:l→∞

qaal+1 = σ2wE z∼N (0,qaa

l)(φ(z))

2, qaa1 = σ2w

‖xa‖22n0

, a ∈ 1, 2. (5.19)

Consider the diagonal term:

q12l+1 =1

nl+1E h1

l,h2

lEWl

φ(h1l )TWT

l Wlφ(h2l ) =

σ2w

nlE h1

l,h2

lφ(h1l )

Tφ(h2l ). (5.20)

By induction hypothesis, as n1:l−1 → ∞ we have a weak limit:

(

h1,il

h2,il

)

→ N (0,Σl). (5.21)

Hencelim

n1:l→∞q12l+1 = σ2

wE (u1,u2)T∼N (0,Σl)φ(u1)φ(u2). (5.22)

Note that(

h1,il+1

h2,il+1

)

=

nl∑

j=1

W ijl

(

x1,jl

x2,jl

)

. (5.23)

Here we have a sum of nl i.i.d. random vectors with zero mean, and the covariance matrix of the sum is Σl+1.Hence by the multivariate CLT, (h1,il+1, h

2,il+1)

T converges weakly to N (0,Σl+1) as n1:l → ∞ sequentially.Similarly, for any k ≥ 1

h1,il+1

. . .

hk,il+1

=

nl∑

j=1

W ijl

x1,jl

. . .

xk,jl

. (5.24)

Again, these vectors converge to a gaussian by the multivariate CLT. Hence ∀l ∈ [L+ 1] hil(·) converges weakly toa gaussian process as n1:l−1 → ∞ sequentially. Note that a gaussian process is completely defined by its mean andcovariance functions:

Σl(x, x′) =

(

ql(x, x) ql(x, x′)

ql(x′, x) ql(x

′, x′)

)

∀l ∈ [L+ 1]; (5.25)

ql+1(x, x′) = σ2

wE (u,v)T∼N (0,Σl(x,x′))φ(u)φ(v) ∀l ∈ [L], q1(x, x′) =

σ2w

n0xTx′. (5.26)

Hence the model at initialization converges to a gaussian process with zero mean and covariance ΣL+1(·, ·). ThisGP is referred as NNGP, and qL+1 — as NNGP kernel.

5.1.3 The kernel diverges at initialization

For a fixed x, let us define the following quantity:

Bil =

∂f i

∂hl∈ R

nl ∀l ∈ [L+ 1]. (5.27)

We have then:Bi

l = DlWTl B

il+1 ∀l ∈ [L], Bij

L+1 = δij , (5.28)

where Dl = diag(φ′(hl)). This gives:∇Wl

f i(x) = Bil+1x

Tl ∈ R

nl+1×nl . (5.29)

50

Page 53: arXiv:2012.05760v1 [cs.LG] 10 Dec 2020

Define the scaled covariance for Bl components for two inputs:

βijl (x, x′) = EBi,T

l B′,jl = EBi,T

l+1WlDlD′lW

Tl B

′,jl+1 =

σ2w

nlE tr(DlD

′l)(B

i,Tl+1B

′,jl+1) =

= σ2wβ

ijl+1(x, x

′)E (u,v)T ∼N (0,Σl(x,x′))φ′(u)φ′(v) ∀l ∈ [L− 1], (5.30)

βijL = EBi,T

L B′,jL = EBi,T

L+1WLDLD′LW

TLB

′,jL+1 =

σ2w

nLE tr(DLD

′L)δij = σ2

wE (u,v)T ∼N (0,ΣL(x,x′))φ′(u)φ′(v)δij .

(5.31)Note that βij

l = βlδij . Similarly to ql, define the following:

χl(x, x′) = σ2

wE (u,v)T ∼N (0,Σl(x,x′))φ′(u)φ′(v). (5.32)

This allows us to write:

βl(x, x′) =

L∏

l′=l

χl′(x, x′) ∀l ∈ [L]. (5.33)

In the case of non-scalar output (k > 1), the tangent kernel is a k × k matrix with components defined as:

Θij(x, x′) = ∇Tθ f

i(x)∇θfj(x′). (5.34)

For the sake of convenience, we introduce layer-wise tangent kernels:

Θijl (x, x

′) = tr(∇TWlf i(x)∇Wl

f j(x′)). (5.35)

In this case Θ(x, x′) =∑L

l=0 Θl(x, x′).

We denote Bl and hl evaluated at x′ by B′l and h′l, respectively. This allows us to write:

Θijl (x, x

′) = tr(

φ(hl)Bi,Tl+1B

′,jl+1φ(h

′l)T)

=(

φ(h′l)Tφ(hl)

)

(

Bi,Tl+1B

′,jl+1

)

∀l ∈ [L]. (5.36)

If we assume that the two scalar products are independent then the expected kernel is a product of expectations:

E Θijl (x, x

′) = E(

φ(h′l)Tφ(hl)

)

E

(

Bi,Tl+1B

′,jl+1

)

=nlql+1(x, x

′)σ2w

βl+1(x, x′)δij . ∀l ∈ [L]. (5.37)

Hence (a) each kernel is diagonal, (b) l-th kernel expectation diverges as nl → ∞ ∀l ∈ [L].

5.1.4 NTK parameterization

It is possible to leverage the kernel divergence by altering the network parameterization:

h1 =σw√n0W0x ∈ R

n1 , xl = φ(hl) ∈ Rnl , hl+1 =

σw√nlWlxl ∈ R

nl+1 ∀l ∈ [L]. (5.38)

In this case, the weights are standard gaussians:

W ijl ∼ N (0, 1) . (5.39)

Bil =

∂f i

∂hl∈ R

nl ∀l ∈ [L+ 1]. (5.40)

We have then:Bi

l =σw√nlDlW

Tl B

il+1 ∀l ∈ [L], Bij

L+1 = δij , (5.41)

Both forward and backward dynamics at initialization remains unchanged. What changes are the gradients wrtweights:

∇Wlf i(x) =

σw√nlBi

l+1xTl . (5.42)

51

Page 54: arXiv:2012.05760v1 [cs.LG] 10 Dec 2020

This results in a change of the tangent kernel scaling:

E Θijl (x, x

′) =σ2w

nlE(

φ(h′l)Tφ(hl)

)

E

(

Bi,Tl+1B

′,jl+1

)

= ql+1(x, x′)βl+1(x, x

′)δij ∀l ∈ [L]. (5.43)

Now the kernel expectation neither diverges nor vanishes as n→ ∞. Since the expectation is finite, the kernel itselfconverges to it as n→ ∞. Indeed, consider the l-th kernel:

Θijl (x, x

′) =σ2w

nl

(

φ(h′l)Tφ(hl)

)

(

Bi,Tl+1B

′,jl+1

)

. (5.44)

The first multiplier converges to ql+1(x, x′) due to the Law of Large Numbers. Similar holds for the second one: it

converges to βl+1(x, x′)δij by the LLN. Together these two give:

plimnl→∞

. . . plimn1→∞

Θijl (x, x

′) = E Θijl (x, x

′) = ql+1(x, x′)βl+1(x, x

′)δij ∀l ∈ [L]. (5.45)

And for the whole kernel, we have:

plimnL→∞

. . . plimn1→∞

Θij(x, x′) = E Θij(x, x′) =L+1∑

l=1

ql(x, x′)βl(x, x

′)δij =L+1∑

l=1

(

ql(x, x′)

L∏

l′=l

χl′(x, x′)

)

δij . (5.46)

See [Arora et al., 2019b] for the above expression for the expected kernel, and [Jacot et al., 2018] for the formalproof of convergence in subsequent limits. See also [Arora et al., 2019b] for a convergence proof in stronger terms.

5.1.5 GD training and posterior inference in gaussian processes

Denote the limit kernel at initialization by Θ0:

Θ0(x, x′) =

L+1∑

l=1

(

ql(x, x′)

L∏

l′=l

χl′(x, x′)

)

Ik×k. (5.47)

Unlike the empirical kernel, the limit one is deterministic. Similarly to Section 5.1.1, we assume that ~x is a trainingset of size n, and k = 1. Then let Θ0(~x, ~x) ∈ Rn×n be a Gramian for the limit kernel.

Given (a) the kernel has a deterministic limit, and (b) the model at initialization converges to a limit model,the model trained to minimize square loss converges to the following limit model at any time t:

lim flin,t(x) = lim f0(x) −Θ0(x, ~x)Θ−10 (~x, ~x)(I − e−ηΘ0(~x,~x)t/n)(lim f0(~x)− ~y). (5.48)

Looking at this expression we notice that since the limit model at initialization is a gaussian process (see Sec-tion 5.1.2), the limit model is a gaussian process at any time t. Its mean and covariance are given as follows:

µlin,t(x) = Θ0(x, ~x)Θ−10 (~x, ~x)(I − e−ηΘ0(~x,~x)t/n)~y; (5.49)

qlin,t(x, x′) = qL+1(x, x

′)−−(

Θ0(x′, ~x)Θ−1

0 (~x, ~x)(I − e−ηΘ0(~x,~x)t/n)qL+1(~x, x) + Θ0(x, ~x)Θ−10 (~x, ~x)(I − e−ηΘ0(~x,~x)t/n)qL+1(~x, x

′))

+

+Θ0(x, ~x)Θ−10 (~x, ~x)(I − e−ηΘ0(~x,~x)t/n)qL+1(~x, ~x)(I − e−ηΘ0(~x,~x)t/n)Θ−1

0 (~x, ~x)Θ0(~x, x′). (5.50)

Assume that the limit kernel is bounded away from zero: λmin(Θ0(~x, ~x)) ≥ λ0 > 0. Given this, the modelconverges to the following limit GP as t→ ∞:

µlin,∞(x) = Θ0(x, ~x)Θ−10 (~x, ~x)~y; (5.51)

qlin,∞(x, x′) = qL+1(x, x′) + Θ0(x, ~x)Θ

−10 (~x, ~x)qL+1(~x, ~x)Θ

−10 (~x, ~x)Θ0(~x, x

′)−−(

Θ0(x′, ~x)Θ−1

0 (~x, ~x)qL+1(~x, x) + Θ0(x, ~x)Θ−10 (~x, ~x)qL+1(~x, x

′))

. (5.52)

52

Page 55: arXiv:2012.05760v1 [cs.LG] 10 Dec 2020

Note that the exact bayesian posterior inference gives a different result:

µlin(x | ~x) = qL+1(x, ~x)q−1L+1(~x, ~x)~y; (5.53)

qlin(x, x′ | ~x) = qL+1(x, x

′)− qL+1(x, ~x)q−1L+1(~x, ~x)qL+1(~x, x

′). (5.54)

Nevertheless, if we consider training only the last layer of the network, the tangent kernel becomes:

Θ(x, x′) = ΘL(x, x′) = qL+1(x, x

′). (5.55)

Given this, the two GPs, result of NN training and exact posterior, coincide.Let us return to the assumption of positive defniteness of the limit kernel. [Du et al., 2019] proved that if no

inputs are parallel, this assumption holds:

Theorem 17. If for any i 6= j xTi xj < ‖xi‖2‖xj‖2 then λ0 := λmin(Θ0(~x, ~x)) > 0.

5.2 Stationarity of the kernel

Assume k = 1; in this case NTK is scalar-valued. For analytic activation function φ we have the following:

E θ(Θt(x1, x2)−Θ0(x1, x2)) =

∞∑

k=1

(

E θ

(

dkΘt(x1, x2)

dtk

t=0

)

tk

k!

)

. (5.56)

Hence if we show that all derivatives of the NTK at t = 0 vanish as n → ∞, this would mean that the NTK doesnot evolve with time for large n: Θt(x, x

′) → Θ0(x, x′) as n→ ∞.

Let us consider l2-loss: ℓ(y, z) = 12 (y − z)2. Consider the first derivative:

E θ

(

dΘt(x1, x2)

dt

t=0

)

= E θ

(

d(∇Tθ ft(x1)∇θft(x2))

dt

t=0

)

= E θ

((

θTt ∇θ∇Tθ ft(x1)∇θft(x2) + (x1 ↔ x2)

)∣

t=0

)

=

= E x,yE θ(η(y − f0(x))∇Tθ f0(x)∇θ∇T

θ f0(x1)∇θf0(x2) + (x1 ↔ x2)). (5.57)

We shall show that it is O(n−1), and that it also implies that all higher-order derivatives are O(n−1) too.From now on, we shall consider only initialization: t = 0; for this reason, we shall omit the subscript 0. Following

[Dyer and Gur-Ari, 2020], we start with a definition of a correlation function. Define a rank-k derivative tensorTµ1...µk

as follows:

Tµ1...µk(x; f) =

∂kf(x)

∂θµ1 . . . ∂θµk. (5.58)

For k = 0 we define T (x; f) = f(x). We are now ready to define the correlation function C:

C(x1, . . . , xm) =∑

µ1,...,µkm

∆(π)µ1...µkm

E θ

(

Tµ1...µk1(x1)Tµk1+1...µk2

(x2) . . . Tµkm−1+1...µkm(xm)

)

. (5.59)

Here 0 ≤ k1 ≤ . . . ≤ km, km andm are even, π ∈ Skmis a permutation, and ∆

(π)µ1...µkm

= δµπ(1)µπ(2). . . δµπ(km−1)µπ(km)

.For example,

E θ(f(x)∇Tθ f(x)∇θ∇T

θ f(x1)∇θf(x2)) =∑

µ,ν

E θ(f(x)∂µf(x)∂2µ,νf(x1)∂νf(x2)) =

=∑

µ1,µ2,µ3,µ4

δµ1µ2δµ3µ4E θ(f(x)∂µ1f(x)∂2µ2,µ3

f(x1)∂µ4f(x2)) = C(x, x, x1, x2) (5.60)

is a correlation function with m = 4, k1 = 0, k2 = 1, k3 = 3, k4 = 4, and π(j) = j. Moreover, E θ((f(x) −y)∇T

θ f(x)∇θ∇Tθ f(x1)∇θf(x2)) is a correlation function too: consider fy(x) = f(x)− y instead of f(x). Hence the

whole (5.57) is a linear combination of correlation functions.If two derivative tensors have two indices that are summed over, we shall say that they are contracted. Formally,

we shall say that Tµki−1+1...µki(xi) is contracted with Tµkj−1+1...µkj

(xj) for 1 ≤ i, j ≤ m, if there exists an even

s ≤ km such that ki−1 < π(s− 1) ≤ ki, while kj−1 < π(s) ≤ kj , or vice versa.Define the cluster graph GC(V,E) as a non-oriented non-weighted graph with vertices V = v1, . . . , vm and

edges E = (vi, vj) |T (xi) and T (xj) are contracted in C. Let ne be the number of even-sized connected compo-nents of GC(V,E) and no be the number of odd-sized components.

53

Page 56: arXiv:2012.05760v1 [cs.LG] 10 Dec 2020

Conjecture 1 ([Dyer and Gur-Ari, 2020]). If m is even, C(x1, . . . , xm) = On→∞(nsC ), where sC = ne+no/2−m/2.If m is odd, C(x1, . . . , xm) = 0.

Applying this conjecture to (5.60) gives C(x, x, x1, x2) = O(n−1) (ne = 0, n0 = 2, m = 4), hence the wholeeq. (5.57) is O(n−1).

Let us show that having the first derivative of the NTK being O(n−1) implies all higher-order derivatives to beO(n−1).

Lemma 12 ([Dyer and Gur-Ari, 2020]). Suppose Conjecture 1 holds. Let C(~x) = E θF (~x; θ) be a correlationfunction and suppose C(~x) = O(nsC ) for sC defined in Conjecture 1. Then E θd

kF (~x; θ)/dtk = O(nsC ) ∀k ≥ 1.

Proof. Consider the first derivative:

E θdF (~x)

dt= E θ(θ

T∇θF (~x)) = E x,yE θ(η(y − f(x))∇Tθ f(x)∇θF (~x)) =

= ηE x,yE θ(y∇Tθ f(x)∇θF (~x))− ηE x,yE θ(f(x)∇T

θ f(x)∇θF (~x)). (5.61)

This is a sum of linear combination of correlation functions. By Conjecture 1, the first sum evaluates to zero, whilethe second one has m′ = m+2, n′

e even clusters, and n′o odd clusters. If ∇θf(x) is contracted with an even cluster

of C, we have n′e = ne − 1, n′

o = no + 2. In contrast, if ∇θf(x) is contracted with an odd cluster of C, we haven′e = ne + 1, n′

o = no.In the first case we have s′C = n′

e + n′o/2−m′/2 = sC − 1, while for the second s′C = sC . In any case, the result

is a linear combination of correlation functions with s′C ≤ sC for each.

5.2.1 Finite width corrections for the NTK

Let us define O1,t(x) = ft(x) and for s ≥ 2

Os,t(x1, . . . , xs) = ∇Tθ Os−1,t(x1, . . . , xs−1)∇θft(xs). (5.62)

In this case O2,t(x1, x2) is the empirical kernel Θt(x1, x2). Note that Os,t evolves as follows:

Os,t(x1, . . . , xs) = ηE x,y(y − ft(x))∇Tθ ft(x)∇θOs,t(x1, . . . , xs) = ηE x,y(y − ft(x))Os+1,t(x1, . . . , xs, x). (5.63)

Since Os has s derivative tensors and a single cluster, by the virtue of Conjecture 1, E θOs,0 = O(n1−s/2)

for even s and E θOs,0 = 0 for odd s. At the same time, E θOs,0 = O(n1−(s+2)/2) = O(n−s/2) for even s and

E θOs,0 = O(n1−(s+1)/2) = O(n1/2−s/2) for odd s.As for the second moments, we have E θ(Os,0)

2 = O(n2−s) for even s and E θ(Os,0)2 = O(n1−s) for odd s.

Similarly, we have E θ(Os,0)2 = O(n2/2−(2s+2)/2) = O(n−s) for even s and E θ(Os,0)

2 = O(n2−(2s+2)/2) = O(n1−s)for odd s.

The asymptotics for the first two moments implies the asymptotic for a random variable itself:

Os,0(x1:s) =

O(n1−s/2) for even s;

O(n1/2−s/2) for odd s;Os,0(x1:s) =

O(n−s/2) for even s;

O(n1/2−s/2) for odd s.(5.64)

Lemma 12 gives ∀k ≥ 1:

dkOs,t

dtk(x1:s)

t=0

=

O(n−s/2) for even s;

O(n1/2−s/2) for odd s.(5.65)

Then given an analytic activation function, we have ∀t ≥ 0:

Os,t(x1:s) =

∞∑

k=1

dkOs,t

dtk(x1:s)

t=0

tk

k!=

O(n−s/2) for even s;

O(n1/2−s/2) for odd s.(5.66)

This allows us to write a finite system of ODE for the model evolution up to O(n−1) terms:

ft(x1) = ηE x,y(y − ft(x))Θt(x1, x), f0(x1) = f(x1; θ), θ ∼ N (0, I), (5.67)

54

Page 57: arXiv:2012.05760v1 [cs.LG] 10 Dec 2020

Θt(x1, x2) = ηE x,y(y − ft(x))O3,t(x1, x2, x), Θ0(x1, x2) = ∇Tθ f0(x1)∇θf0(x2), (5.68)

O3,t(x1, x2, x3) = ηE x,y(y − ft(x))O4,t(x1, x2, x3, x), O3,0(x1, x2, x3) = ∇Tθ Θ0(x1, x2)∇θf0(x3), (5.69)

O4,t(x1, x2, x3, x4) = O(n−2), O4,0(x1, x2, x3, x4) = ∇Tθ O3,0(x1, x2, x3)∇θf0(x4). (5.70)

Let us expand all the quantities wrt n−1:

Os,t(x1:s) = O(0)s,t (x1:s) + n−1O

(1)s,t (x1:s) +O(n−2), (5.71)

where O(k)s,t (x1:s) = Θn→∞(1). Then the system above transforms into the following:

f(0)t (x1) = ηE x,y(y − f

(0)t (x))Θ

(0)t (x1, x), lim

n→∞f(x1; θ0), (5.72)

f(1)t (x1) = ηE x,y((y − f

(0)t (x))Θ

(1)t (x1, x)− f

(1)t (x)Θ

(0)t (x1, x)), (5.73)

Θ(0)t (x1, x2) = ∇T

θ f(0)0 (x1)∇θf

(0)0 (x2), (5.74)

Θ(1)t (x1, x2) = ηE x,y(y − f

(0)t (x))O

(1)3,t (x1, x2, x), (5.75)

O(1)3,t (x1, x2, x3) = ηE x,y(y − f

(0)t (x))O

(1)4,t (x1, x2, x3, x), (5.76)

O(1)4,t (x1, x2, x3, x4) = ∇T

θ O(0)3,0(x1, x2, x3)∇θf

(0)0 (x4), (5.77)

where we have ignored the initial conditions for the time being. Integrating this system is straightforward:

f(0)t (~x) = ~y + e−ηΘ

(0)0 (~x,~x)t/n(f

(0)0 (~x)− ~y), (5.78)

where ~x is a train dataset of size n. For the sake of brevity, let us introduce the following definition:

∆f(0)t (x) = e−ηΘ

(0)0 (x,~x)t/n(f

(0)0 (~x)− ~y). (5.79)

This gives:

O(1)3,t (x1, x2, x3) = O

(1)3,0(x1, x2, x3)− ηE x′,y′

∫ t

0

O(1)4,0(x1, x2, x3, x

′)∆f (0)t′ (x′) dt′. (5.80)

Θ(1)t (x1, x2) = Θ

(1)0 (x1, x2)− ηE x′,y′

∫ t

0

O(1)3,0(x1, x2, x)∆f

(0)t′ (x′) dt′+

+ η2E x′′,y′′E x′,y′

∫ t

0

∫ t′′

0

O(1)4,0(x1, x2, x

′′, x′)∆f (0)t′ (x′)∆f (0)

t′′ (x′′) dt′ dt′′. (5.81)

f(1)t (x1) = −ηE x,y

(

∆f(0)t (x)Θ

(1)t (x1, x) + f

(1)t (x)Θ

(0)t (x1, x)

)

, (5.82)

f(t) = Af(t) + g(t), f(0) = f0; (5.83)

f(t) = C(t)eAt; C(t)eAt + C(t)AeAt = C(t)AeAt + g(t); C(t) = g(t)e−At. (5.84)

f(1)t (~x1) = e−ηΘ

(0)0 (~x1,~x)t/nCt(~x); (5.85)

Ct(~x) = −ηE x′,y′eηΘ(0)0 (~x,~x1)t/nΘ

(1)t (~x1, x

′)∆f (0)t (x′); (5.86)

Ct(~x) = f(1)0 (~x)− ηE x′,y′

∫ t

0

eηΘ(0)0 (~x,~x1)t

′/nΘ(1)t (~x1, x

′)∆f (0)t′ (x′) dt′; (5.87)

f(1)t (~x1) = e−ηΘ

(0)0 (~x1,~x)t/nf

(1)0 (~x)− ηe−ηΘ

(0)0 (~x1,~x2)t/nE x′,y′

∫ t

0

eηΘ(0)0 (~x2,~x3)t

′/nΘ(1)t (~x3, x

′)∆f (0)t′ (x′) dt′. (5.88)

55

Page 58: arXiv:2012.05760v1 [cs.LG] 10 Dec 2020

5.2.2 Proof of Conjecture 1 for linear nets

Shallow nets.

We first consider shallow linear nets:

f(x) =1√naTWx. (5.89)

We shall use the following theorem:

Theorem 18 ([Isserlis, 1918]). Let z = (z1, . . . , zl) be a centered multivariate Gaussian variable. Then, for anypositive k, for any ordered set of indices i1:2k,

E z(zi1 · · · zi2k) =1

2kk!

π∈S2k

E (ziπ(1)ziπ(2)

) · · ·E (ziπ(m−1)ziπ(m)

) =∑

p∈P 22k

j1,j2∈p

E (zij1 zij2 ), (5.90)

where P 22k is a set of all unordered pairings p of a 2k-element set, i.e.

P 22k =

π∈S2k

π(1), π(2), . . . , π(2k − 1), π(2k). (5.91)

At the same time,E z(zi1 · · · zi2k−1

) = 0. (5.92)

For example,

E z(z1z2z3z4) = E z(z1z2)E z(z3z4) + E z(z1z3)E z(z2z4) + E z(z1z4)E z(z2z3). (5.93)

Consider a correlation function without derivatives:

C(x1, . . . , xm) = E θ(f(x1) . . . f(xm)) = n−m/2E θ(ai1W

i1x1 . . . aimWimxm) =

= n−m/2E θ(ai1 · · · aim)E θ(W

i1x1 · · ·W imxm) =

= n−m/2[m = 2k]

pa∈P 22k

ja1 ,ja2 ∈pa

δija1ija

2

pw∈P 22k

jw1 ,jw2 ∈pw

δijw

1ijw

2 xTjw1 xjw2

. (5.94)

For even m, we shall associate a graph γ with each pair (pa, pw). Such a graph has m vertices (v1, . . . , vm). For anyja1 , ja2 ∈ pa there is an edge (vja1 , vja2 ) marked a, and an edge (vjw1 , vjw2 ) marked W for any jw1 , jw2 ∈ pw. Henceeach vertex has a unique a-neighbor and a unique W -neighbor; these two can be the same vertex. Hence γ is aunion of cycles. We call γ a Feynman diagram of C.

Denote by Γ(C) a set of Feynman diagrams of C, and by lγ a number of cycles in the diagram γ. It is easy tonotice that each cycle contributes a factor of n when one takes a sum over i1, . . . , im. Hence we have:

C(x1, . . . , xm) = n−m/2[m = 2k]∑

pa,pw

nlγ(pa,pw)

jw1 ,jw2 ∈pw

xTjw1 xjw2

= [m = 2k]On→∞(nmaxγ∈Γ(C) lγ−m/2).

(5.95)Consider now a correlation function with derivatives. Assume there is an edge (vi, vj) inGC ; hence corresponding

derivative tensors are contracted in C. In this case, we should consider only those Feynman diagrams γ that havean edge (vi, vj), either of a or of w type. Denoting a set of such diagrams as Γ(C), we get the same bound as before:

C(x1, . . . , xm) = [m = 2k]On→∞(nmaxγ∈Γ(C) lγ−m/2). (5.96)

In order to illustrate this principle, let us consider the case m = 4. For simplicity, assume also all inputs to beequal: x1 = . . . = x4 = x. If there are no derivatives, we have:

E θ((f(x))4) = n−2(δi1i2δi3i4 + δi1i3δi2i4 + δi1i4δi2i3)(δ

i1i2δi3i4 + δi1i3δi2i4 + δi1i4δi2i3)(xTx)2 = (3 + 6n−1)(xTx)2.(5.97)

56

Page 59: arXiv:2012.05760v1 [cs.LG] 10 Dec 2020

In this case there are three Feynman diagrams with two cycles each, and six diagrams with a single cycle. Let usintroduce a couple of contracted derivative tensors:

E θ((f(x))2∇T

θ f(x)∇θf(x)) = n−2E θ(ai1W

i1xai2Wi2x(δi3kW

i3xδklδi4lWi4x+ ai3δ

i3kδklai4δi4lxTx)) =

= n−2E θ(ai1W

i1xai2Wi2x(δi3i4W

i3xW i4x+ ai3δi3i4ai4x

Tx)) =

= n−2(δi1i2δi3i4)(δi1i2δi3i4 + δi1i3δi2i4 + δi1i4δi2i3)(xT x)2 +n−2(δi1i2δi3i4 + δi1i3δi2i4 + δi1i4δi2i3)(δ

i1i2δi3i4)(xTx)2 =

= (2 + 4n−1)(xTx)2. (5.98)

Here we have only those Feynman diagrams that have an edge (v3, v4). There are two such diagrams with twocycles each, and four with a single cycle.

Note that if there is an edge in a cluster graph GC , there is also an edge, a or w type, in each γ from Γ(C).Note also that each cycle in γ consists of even number of edges. Hence each cycle consists of even clusters and aneven number of odd clusters. Hence there could be at most ne + no/2 cycles in γ, which proves Conjecture 1 forshallow linear nets:

C(x1, . . . , xm) = [m = 2k]On→∞(nmaxγ∈Γ(C) lγ−m/2) = [m = 2k]On→∞(nne+no/2−m/2). (5.99)

Deep nets.

In the case of a network with L hidden layers, there are L + 1 edges of types W0, . . .WL adjacent to each node.Feynman diagrams are still well-defined, however, it is not obvious how to define the number of loops in this case.

The correct way to do it is to count the loops in a corresponding double-line diagram. Given a Feynman diagramγ, define the double-line diagram DL(γ) as follows:

• Each vertex vi of γ maps to L vertices v(1)i , . . . , v

(L)i in DL(γ).

• An edge (vi, vj) of type W0 maps to an edge (v(1)i , v

(1)j ).

• An edge (vi, vj) of type WL maps to an edge (v(L)i , v

(L)j ).

• ∀l ∈ [L− 1] an edge (vi, vj) of type Wl maps to a pair of edges: (v(l)i , v

(l)j ) and (v

(l+1)i , v

(l+1)j ).

We see that each of the Lm vertices of a double-line diagram has degree 2; hence the number of loops is well-defined.For L = 1, a double-line diagram recovers the corresponding Feynman diagram without edge types. For any L, wehave the following:

C(x1, . . . , xm) = [m = 2k]On→∞(nmaxγ∈Γ(C) lγ−Lm/2), (5.100)

where now lγ is a number of loops in DL(γ).In order to get intuition about this result, let us consider a network with two hidden layers. For the sake of

simplicity, assume x is a scalar:

f(x) =1

naTWvx. (5.101)

E θ(f(x1)f(x2)) = n−2E θ(ai1W

i1j1vj1x1ai2Wi2j2vj2x2) = n−2δi1i2δ

i1i2δj1j2δj1j2x1x2 = x1x2. (5.102)

Here both a and v result in a single Kronecker delta, hence they correspond to a single edge in a double-line diagram.At the same time, W results in a product of two deltas, in its turn resulting in a pair of edges in the diagram.

Similar to the case of L = 1, contracted derivative tensors force the existence of corresponding edges in theFeynman diagram. Given a Feynman diagram γ, define sγ = lγ − Lm/2, or, in other words, a number of loops inDL(γ) minus a number of vertices in DL(γ) halved. Let cγ be a number of connected components of γ. We shallprove that

sγ ≤ cγ − m

2. (5.103)

Note that eq. (5.103) holds for L = 1 since all connected components of γ are loops in this case. Let us expressγ as a union of its connected components γ′; given this, sγ =

γ′ sγ′ . We are going to show that sγ′ ≤ 1 −m′/2,where m′ is a number of vertices in the component γ′. The latter will imply sγ ≤ cγ −m/2.

57

Page 60: arXiv:2012.05760v1 [cs.LG] 10 Dec 2020

Let v, e, and f be a number of vertices, a number of edges, and a number faces of γ′. We already know thatv = m′, e = (L + 1)m′/2, and f = lγ′ . Hence sγ′ = lγ′ − Lm′/2 = f − Lv/2. On the other hand, the Eulercharacteristic of γ′ is χ = v − e+ f = sγ′ +m′(1 + L/2)− (L+ 1)m′/2 = sγ′ +m′/2. Since γ′ is a triangulation ofa planar surface with at least one boundary, χ ≤ 1. Hence sγ′ ≤ 1−m′/2, which was required.

Consequently, we may rewrite (5.100) as follows:

C(x1, . . . , xm) = [m = 2k]On→∞(nmaxγ∈Γ(C) cγ−m/2). (5.104)

It is now easy to conclude that cγ ≤ ne + no/2. Indeed, each connected component of γ consists of connectedcomponents of the cluster graph GC . Hence cγ ≤ ne + no. Moreover, each connected component of γ consists ofeven number of vertices, hence it can contain only even number of odd connected components of GC . This givescγ ≤ ne + no/2, which is required.

5.3 GD convergence via kernel stability

Recall the model prediction dynamics (eq. (5.3)):

ft(x′) = −ηE x,y

∂ℓ(y, z)

∂z

z=ft(x)

Θt(x′, x). (5.105)

On the train dataset (~x, ~y) we have the following:

ft(~x) = − η

mΘt(~x, ~x)

∂ℓ(~y, ~z)

∂~z

~z=ft(~x)

. (5.106)

For the special case of square loss:

ft(~x) =η

mΘt(~x, ~x)(~y − ft(~x)). (5.107)

Let us consider the evolution of a loss:

d

dt

(

1

2‖~y − ft(~x)‖22

)

= − η

m(~y − ft(~x))

T Θt(~x, ~x)(~y − ft(~x)) ≤ − η

mλmin(Θt(~x, ~x))‖~y − ft(~x)‖22 (5.108)

Consider λmin ≥ 0 such that ∀t ≥ 0 λmin(Θt(~x, ~x)) ≥ λmin. This allows us to solve the differential inequality:

‖~y − ft(~x)‖22 ≤ e−2ηλmint/m‖~y − f0(~x)‖22. (5.109)

Hence having λmin > 0 ensures that the gradient descent converges to a zero-loss solution. There is a theoremthat guarantees that the least eigenvalue of the kernel stays separated away from zero for wide-enough NTK-parameterized two-layered networks with ReLU activation:

Theorem 19 ([Du et al., 2019]). Consider the following model:

f(x; a1:n, w1:n) =1√n

n∑

i=1

ai[wTi x]+. (5.110)

Assume we aim to minimize the square loss on the dataset (~x, ~y) of size m via a gradient descent on the inputweights:

wi(t) =1√n

m∑

k=1

(yk − f(xk; a1:n, w1:n(t)))ai[wTi (t)xk > 0]xk, wi(0) ∼ N (0, In0), ai ∼ U(−1, 1) ∀i ∈ [n].

(5.111)Assume also that ‖xk‖2 ≤ 1 and |yk| < 1 ∀k ∈ [m]. Let H∞ be an expected gramian of the NTK at initializationand let λ0 be its least eigenvalue:

H∞kl = Ew∼N (0,In0)

[wTxk > 0][wTxl > 0]xTk xl, λ0 = λmin(H∞). (5.112)

58

Page 61: arXiv:2012.05760v1 [cs.LG] 10 Dec 2020

Then ∀δ ∈ (0, 1) taking

n > 28 max

(

2532

π

m6

λ40δ3,m2

λ20log

(

2m

δ

))

= Ω

(

m6

λ40δ3

)

(5.113)

guarantees that w.p. ≥ 1− δ over initialization we have an exponential convergence to a zero-loss solution:

‖~y − ft(~x)‖22 ≤ e−λ0t‖~y − f0(~x)‖22. (5.114)

Proof. From what was shown above, it suffices to show that λmin(H(t)) ≥ λ0/2 with given probability for nsufficiently large, where H(t) is a gram matrix of the NTK at time t:

Hkl(t) =1

n

n∑

i=1

[wTi (t)xk > 0][wT

i (t)xl > 0]xTk xl. (5.115)

We shall first show that H(0) ≥ 3λ0/4:

Lemma 13. ∀δ ∈ (0, 1) taking n ≥ 128m2λ−20 log(m/δ) guarantees that w.p. ≥ 1 − δ over initialization we have

‖H(0)−H∞‖2 ≤ λ0/4 and λmin(H(0)) ≥ 3λ0/4.

Next, we shall show that the initial Gram matrix H(0) is stable with respect to initial weights w1:n(0):

Lemma 14. ∀δ ∈ (0, 1) w.p. ≥ 1 − δ over initialization for any set of weights w1:n that satisfy ∀i ∈ [n] ‖wi(0) −wi‖2 ≤ R(δ) := (

√2π/16)δλ0m

−2, the corresponding Gram matrix H satisfies ‖H −H(0)‖ < λ0/4 and λmin(H) >λ0/2.

After that, we shall show that lower bounded eigenvalues of the Gram matrix gives exponential convergence onthe train set. Moreover, weights stay close to initialization, as the following lemma states:

Lemma 15. Suppose for s ∈ [0, t] λmin(H(s)) ≥ λ0/2. Then we have ‖~y − ft(~x)‖22 ≤ e−λ0t‖~y − f0(~x)‖22 and forany i ∈ [n] ‖wi(t)− wi(0)‖2 ≤ R′ := (2/λ0)

(m/n)‖~y − f0(~x)‖2.Finally, we shall show that when R′ < R(δ), the conditions of Lemma 14 and of Lemma 15 hold ∀t ≥ 0

simultaneously:

Lemma 16. Let δ ∈ (0, 1/3). If R′ < R(δ), then w.p. ≥ 1 − 3δ over initialization ∀t ≥ 0 λmin(H(t)) ≥ λ0/2 and∀i ∈ [n] ‖wi(t)− wi(0)‖2 ≤ R′ and ‖~y − ft(~x)‖22 ≤ e−λ0t‖~y − f0(~x)‖22.

Hence for δ ∈ (0, 1), R′ < R(δ/3) suffices for the theorem to hold:

2√m‖~y − f0(~x)‖2λ0

√n

= R′ < R(δ/3) =

√2πδλ048m2

, (5.116)

which is equivalent to:

n >2932m5‖~y − f0(~x)‖22

πλ40δ2

. (5.117)

We further bound:E ‖~y − f0(~x)‖22 = E ‖~y‖22 − 2~yTE f0(~x) + E ‖f0(~x)‖22 ≤ 2m. (5.118)

Hence by Markov’s inequality, w.p. ≥ 1− δ

‖~y − f0(~x)‖22 ≤ E ‖~y − f0(~x)‖22δ

≤ 2m

δ. (5.119)

By a union bound, in order to have the desired properties w.p. ≥ 1− 2δ, we need:

n >21032m6

πλ40δ3. (5.120)

If we want the things hold w.p. ≥ 1− δ, noting Lemma 13, we finally need the following:

n > max

(

(21332/π)m6

λ40δ3, 28

m2

λ20log

(

2m

δ

))

. (5.121)

59

Page 62: arXiv:2012.05760v1 [cs.LG] 10 Dec 2020

Let us prove the lemmas.

Proof of Lemma 13. Since all Hkl(0) are independent random variables, we can apply Hoeffding’s inequality foreach of them independently:

P(|Hkl(0)−H∞kl | ≥ ǫ) ≤ 2e−nǫ2/2. (5.122)

For a given δ, take ǫ such that δ = 2e−nǫ2/2. This gives ǫ =√

−2 log(δ/2)/n, or,

|Hkl(0)−H∞kl | ≤

2√

log(1/δ)√n

w.p. ≥ 1− δ over initialization. (5.123)

Applying a union bound gives:

|Hkl(0)−H∞kl | ≤

2√

log(m2/δ)√n

≤√

8 log(m/δ)

n∀k, l ∈ [m] w.p. ≥ 1− δ over initialization. (5.124)

Hence

‖H(0)−H∞‖22 ≤ ‖H(0)−H∞‖2F ≤m∑

k,l=1

|Hkl(0)−H∞kl |2 ≤ 8m2 log(m/δ)

n. (5.125)

In order to get ‖H(0)−H∞‖2 ≤ λ0/4, we need to solve:

8m2 log(m/δ)

n≤ λ0

4. (5.126)

This gives:

n ≥ 128m2 log(m/δ)

λ20. (5.127)

This gives ‖H(0)−H∞‖2 ≤ λ0/4, which implies:

λmin(H(0)) = λmin(H∞ + (H(0)−H∞)) ≥ λmin(H

∞)− λmax(H(0)−H∞) ≥ λ0 − λ0/4 = 3λ0/4. (5.128)

Proof of Lemma 14. We define the event in the space of wi(0) realizations:

Aki = ∃w : ‖w − wi(0)‖2 ≤ R, [wTxk ≥ 0] 6= [wTi (0)xk ≥ 0]. (5.129)

When Aki holds, we can always take w = w′ki, where w′

ki is defined as follows:

w′ki =

wi(0)−Rxk, if wTi (0)xk ≥ 0

wi(0) +Rxk, if wTi (0)xk < 0.

(5.130)

Hence Aki is equivalent to the following:

A′ki = [w′,T

ki xk ≥ 0] 6= [wTi (0)xk ≥ 0]. (5.131)

This event holds iff |wTi (0)xk| < R. Since wi(0) ∼ N (0, I), we have

P(Aki) = P(A′ki) = Pz∼N (0,1)|z| < R ≤ 2R√

2π. (5.132)

We can bound the entry-wise deviation of H ′ from the H(0) matrix:

E |Hkl(0)−H ′kl| = E

(

1

n

xTk xl

n∑

i=1

(

[wTi (0)xk > 0][wT

i (0)xl > 0]− [w′,Tki xk > 0][w′,T

li xl > 0])

)

≤ 1

n

n∑

i=1

E [A′ki ∪ A′

li] ≤4R√2π. (5.133)

60

Page 63: arXiv:2012.05760v1 [cs.LG] 10 Dec 2020

Hence E∑m

k,l=1 |Hkl(0)−H ′kl| ≤ 4m2R/

√2π. Hence by Markov’s inequality,

m∑

k,l=1

|Hkl(0)−H ′kl| ≤

4m2R√2πδ

w.p. ≥ 1− δ over initialization. (5.134)

Since ‖H(0)−H ′‖2 ≤ ‖H(0)−H ′‖F ≤∑mk,l=1 |Hkl(0)−H ′

kl|, the same probabilistic bound holds for ‖H(0)−H ′‖2.Note that ∀k ∈ [m] ∀i ∈ [n] for any w ∈ Rn0 such that ‖w − wi(0)‖2 ≤ R, [wTxk ≥ 0] 6= [wT

i (0)xk ≥ 0] implies

[w′,Tki xk ≥ 0] 6= [wT

i (0)xk ≥ 0]. Hence ∀k, l ∈ [m] for any set of weights w1:n such that ∀i ∈ [n] ‖wi − wi(0)‖2 ≤ R,|Hkl(0)−Hkl| ≤ |Hkl(0)−H ′

kl|. This means that w.p. ≥ 1− δ over initialization, for any set of weights w1:n suchthat ∀i ∈ [n] ‖wi − wi(0)‖2 ≤ R,

‖H(0)−H‖2 ≤ ‖H(0)−H ′‖2 ≤ 4m2R√2πδ

. (5.135)

In order to get the required bound, it suffices to solve the equation:

4m2R√2πδ

=λ04, which gives R =

√2πδλ016m2

. (5.136)

The bound on the minimal eigenvalue is then straightforward:

λmin(H) = λmin(H(0) + (H −H(0))) ≥ λmin(H(0))− λmax(H −H(0)) ≥ 3λ0/4− λ0/4 = λ0/2. (5.137)

Proof of Lemma 15. For s ∈ [0, t] we have:

d‖~y − fs(~x)‖22ds

= −2(~y − fs(~x))TH(s)(~y − fs(~x)) ≤ −λ0‖~y − fs(~x)‖22, (5.138)

which implies:d(log(‖~y − fs(~x)‖22))

ds≤ −λ0. (5.139)

Hencelog(‖~y − fs(~x)‖22) ≤ log(‖~y − f0(~x)‖22)− λ0s, (5.140)

or, equivalently,‖~y − fs(~x)‖22 ≤ e−λ0s‖~y − f0(~x)‖22, (5.141)

which holds, for instance, for s = t. In order to bound weight deviation, we first bound the gradient norm:

dwi(s)

ds

2

=

1√n

m∑

k=1

(yk − fs(xk))ai[wTi (s)xk > 0]xk

2

≤ 1√n

m∑

k=1

|yk − fs(xk)| ≤√

m

n‖~y − fs(~x)‖2 ≤

m

ne−λ0s/2‖~y − f0(~x)‖2. (5.142)

This gives ∀i ∈ [n]:

‖wi(t)− wi(0)‖2 =

∫ t

0

dwi(s)

dsds

2

≤∫ t

0

dwi(s)

ds

2

ds ≤

≤ 2√m

λ0√n

(

1− e−λ0t/2)

‖~y − f0(~x)‖2 ≤ 2√m

λ0√n‖~y − f0(~x)‖2. (5.143)

61

Page 64: arXiv:2012.05760v1 [cs.LG] 10 Dec 2020

Proof of Lemma 16. Proof by contradiction. Take δ ∈ (0, 1/3) and suppose that R′ < R(δ), however, w.p. > 3δover initialization ∃t∗ > 0 : either λmin(H(t∗)) < λ0/2, or ∃i ∈ [n] ‖wi(t

∗) − wi(0)‖2 > R′, or ‖~y − ft∗(~x)‖2 >exp(−λ0t∗)‖~y − f0(~x)‖2. If either of the last two holds, then by Lemma 15, ∃s ∈ [0, t∗] λmin(H(s)) < λ0/2. If theformer holds, we can take s = t∗. Hence by virtue of Lemma 14, for this particular s w.p. > 2δ over initialization∃i ∈ [n] ‖wi(s)− wi(0)‖2 > R(δ). Define:

t0 = inf

t ≥ 0 : maxi∈[n]

‖wi(t)− wi(0)‖2 > R(δ)

. (5.144)

Note that w.p. > 2δ over initialization t0 ≤ s ≤ t∗ < ∞. Since wi(·) is a continuous map, w.p. > 2δ overinitialization maxi∈[n] ‖wi(t0) − wi(0)‖2 = R(δ). Hence by Lemma 14, w.p. > δ over initialization ∀t ∈ [0, t0]λmin(H(t)) ≥ λ0/2. Hence by Lemma 15, ∀i ∈ [n] ‖wi(t0) − wi(0)‖2 ≤ R′. Hence w.p. > δ over initialization wehave a contradiction with maxi∈[n] ‖wi(t0)− wi(0)‖2 = R(δ) and R′ < R(δ).

5.3.1 Component-wise convergence guarantees and kernel alignment

Denote ~u(t) = ft(~x). We have the following dynamics for quadratic loss:

d~u(t)

dt= H(t)(~y − ~u(t)), uk(0) =

1√n

n∑

i=1

ai[wTi (0)xk]+ ∀k ∈ [m], (5.145)

where

Hkl(t) =1

n

n∑

i=1

[wTi (t)xk ≥ 0][wT

i (t)xl ≥ 0]xTk xl. (5.146)

Additionaly, following [Arora et al., 2019a], consider the limiting linearized dynamics:

d~u′(t)dt

= H∞(~y − ~u′(t)), u′k(0) = uk(0) ∀k ∈ [m], (5.147)

whereH∞

kl = EHkl(0) = Ew∼N (0,I)[wT xk ≥ 0][wTxl ≥ 0]xTk xl. (5.148)

Solving the above gives:~u′(t) = ~y + e−H∞t(~u(0)− ~y) (5.149)

Consider an eigenvalue-eigenvector decomposition for H∞: H∞ =∑m

k=1 λk~vTk ~vk, where ~vkmk=1 forms an orthonor-

mal basis in Rm and λ1 ≥ . . . ≥ λm ≥ 0. Note that exp(−H∞t) then has the same set of eigenvectors, and each

eigenvector ~vk corresponds to an eigenvalue exp(−λkt). Then the above solution is rewritten as:

~u′(t)− ~y = −m∑

k=1

e−λkt(~vTk (~y − ~u(0)))~vk, (5.150)

which implies

‖~u′(t)− ~y‖22 =

m∑

k=1

e−2λkt(~vTk (~y − ~u(0)))2. (5.151)

We see that components ~vTk (~y−~u(0)) that correspond to large λk decay faster. Hence convergence is fast if ∀k ∈ [m]large ~vTk (~y − ~u(0)) implies large λk. In this case, we shall say that the initial kernel aligns well with the dataset.

It turns out, that realistic datasets align well with NTKs of realistic nets, however, datasets with random labelsdo not. This observation substitutes a plausible explanation for a phenomenon noted in [Zhang et al., 2016]: largenetworks learn corrupted datasets much slower than clean ones.

The above speculation is valid for the limiting linearized dynamics ~u′(t). It turns out that given n large enough,the true dynamics ~u(t) stays close to its limiting linearized version:

Theorem 20 ([Arora et al., 2019a]). Suppose λ0 = λmin(H∞) > 0. Take ǫ > 0 and δ ∈ (0, 1). Then there exists a

constant Cn > 0 such that for

n ≥ Cnm7

λ40δ4ǫ2

, (5.152)

w.p. ≥ 1− δ over initialization, ∀t ≥ 0 ‖~u(t)− ~u′(t)‖2 ≤ ǫ.

62

Page 65: arXiv:2012.05760v1 [cs.LG] 10 Dec 2020

Proof. We start with stating a reformulation of Lemma 16:

Lemma 17. Let δ ∈ (0, 1). There exists C′n > 0 such that for n ≥ C′

nm6

λ40δ

3 , w.p. ≥ 1− δ over initialization, ∀t ≥ 0

‖wi(t)− wi(0)‖2 ≤ R′ :=4√m‖~y − ~u(0)‖2√

n∀i ∈ [n]. (5.153)

We proceed with an analogue of Lemma 14:

Lemma 18. Let δ ∈ (0, 1). There exist CH , CZ > 0 such that w.p. ≥ 1− δ over initialization, ∀t ≥ 0

‖H(t)−H(0)‖F ≤ CHm3

n1/2λ0δ3/2, ‖Z(t)− Z(0)‖F ≤ CZ

m2

n1/2λ0δ3/2. (5.154)

The last lemma we need is an analogue of Lemma 13:

Lemma 19. Let δ ∈ (0, 1). There exist C′H > 0 such that w.p. ≥ 1− δ over initialization,

‖H(0)−H∞‖F ≤ C′H

m

n1/2log(m

δ

)

. (5.155)

Let us elaborate the dynamics over:

d~u(t)

dt= H(t)(~y − ~u(t)) = H∞(~y − ~u(t)) + (H(t)−H∞)(~y − ~u(t)) = H∞(~y − ~u(t)) + ~ζ(t). (5.156)

~u(t) = e−H∞t ~C(t). (5.157)

d~u(t)

dt= −H∞e−H∞t ~C(t) + e−H∞t d

~C(t)

dt= H∞(~y − ~u(t)) + ~ζ(t)−H∞~y + e−H∞t d

~C(t)

dt− ~ζ(t). (5.158)

d~C(t)

dt= eH

∞t(H∞~y + ~ζ(t)). (5.159)

~C(t) = ~u(0) + (eH∞t − I)~y +

∫ t

0

eH∞τ~ζ(τ) dτ. (5.160)

~u(t) = ~y + e−H∞t(~u(0)− ~y) +

∫ t

0

eH∞(τ−t)~ζ(τ) dτ. (5.161)

‖~u(t)− ~u′(t)‖2 =

∫ t

0

eH∞(τ−t)~ζ(τ) dτ

2

≤∫ t

0

∥eH∞(τ−t)~ζ(τ)

2dτ ≤

≤ maxτ∈[0,t]

‖~ζ(τ)‖2∫ t

0

∥e−H∞τ∥

2dτ ≤ max

τ∈[0,t]‖~ζ(τ)‖2

∫ t

0

e−λ0τ dτ ≤

≤ maxτ∈[0,t]

‖~ζ(τ)‖21

λ0

(

1− e−λ0t)

≤ 1

λ0maxτ∈[0,t]

‖~ζ(τ)‖2. (5.162)

‖~ζ(τ)‖2 = ‖(H(τ)−H∞)(~y − ~u(τ))‖2 ≤ (‖H(τ)−H(0)‖2 + ‖H(0)−H∞‖2) ‖~y − ~u(τ)‖2 ≤≤ (‖H(τ)−H(0)‖F + ‖H(0)−H∞‖F ) ‖~y − ~u(0)‖2. (5.163)

Due to Lemma 19 and Lemma 18, and since ‖~y − ~u(0)‖2 ≤√

2m/δ w.p. ≥ 1− δ over initialization, we have

‖~ζ(τ)‖2 ≤(

CHm3

n1/2λ0δ3/2+ C′

H

m

n1/2log(m

δ

)

)

2m

δ=

√2CH

m7/2

n1/2λ0δ2+√2C′

H

m3/2

n1/2δ1/2log(m

δ

)

(5.164)

w.p. ≥ 1− 3δ over initialization. Given ǫ > 0, we need

n ≥ Cnm7

λ40δ4ǫ2

(5.165)

for some Cn > 0 in order to ensure ‖~u(t)− ~u′(t)‖2 ≤ ǫ w.p. ≥ 1− δ over initialization.

63

Page 66: arXiv:2012.05760v1 [cs.LG] 10 Dec 2020

Bibliography

[Arora et al., 2019a] Arora, S., Du, S., Hu, W., Li, Z., and Wang, R. (2019a). Fine-grained analysis of optimizationand generalization for overparameterized two-layer neural networks. In International Conference on MachineLearning, pages 322–332.

[Arora et al., 2019b] Arora, S., Du, S. S., Hu, W., Li, Z., Salakhutdinov, R. R., and Wang, R. (2019b). On exactcomputation with an infinitely wide neural net. In Advances in Neural Information Processing Systems, pages8141–8150.

[Bartlett et al., 2017] Bartlett, P. L., Foster, D. J., and Telgarsky, M. J. (2017). Spectrally-normalized marginbounds for neural networks. In Advances in Neural Information Processing Systems, pages 6240–6249.

[Bartlett et al., 2019] Bartlett, P. L., Harvey, N., Liaw, C., and Mehrabian, A. (2019). Nearly-tight vc-dimensionand pseudodimension bounds for piecewise linear neural networks. J. Mach. Learn. Res., 20:63–1.

[Donsker and Varadhan, 1985] Donsker, M. and Varadhan, S. (1985). Large deviations for stationary gaussianprocesses. Communications in Mathematical Physics, 97(1-2):187–210.

[Draxler et al., 2018] Draxler, F., Veschgini, K., Salmhofer, M., and Hamprecht, F. (2018). Essentially no barriersin neural network energy landscape. In International Conference on Machine Learning, pages 1309–1318.

[Du et al., 2019] Du, S. S., Zhai, X., Poczos, B., and Singh, A. (2019). Gradient descent provably optimizes over-parameterized neural networks. In International Conference on Learning Representations.

[Dudley, 1967] Dudley, R. M. (1967). The sizes of compact subsets of hilbert space and continuity of gaussianprocesses. Journal of Functional Analysis, 1(3):290–330.

[Dyer and Gur-Ari, 2020] Dyer, E. and Gur-Ari, G. (2020). Asymptotics of wide networks from feynman diagrams.In International Conference on Learning Representations.

[Dziugaite and Roy, 2017] Dziugaite, G. K. and Roy, D. M. (2017). Computing nonvacuous generalizationbounds for deep (stochastic) neural networks with many more parameters than training data. arXiv preprintarXiv:1703.11008.

[Garipov et al., 2018] Garipov, T., Izmailov, P., Podoprikhin, D., Vetrov, D. P., and Wilson, A. G. (2018). Losssurfaces, mode connectivity, and fast ensembling of dnns. In Advances in Neural Information Processing Systems,pages 8789–8798.

[Glorot and Bengio, 2010] Glorot, X. and Bengio, Y. (2010). Understanding the difficulty of training deep feed-forward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence andstatistics, pages 249–256.

[He et al., 2015] He, K., Zhang, X., Ren, S., and Sun, J. (2015). Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computervision, pages 1026–1034.

[Hoeffding, 1963] Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. Journal ofthe American Statistical Association, 58(301):13–30.

64

Page 67: arXiv:2012.05760v1 [cs.LG] 10 Dec 2020

[Huang and Yau, 2019] Huang, J. and Yau, H.-T. (2019). Dynamics of deep neural networks and neural tangenthierarchy. arXiv preprint arXiv:1909.08156.

[Isserlis, 1918] Isserlis, L. (1918). On a formula for the product-moment coefficient of any order of a normal frequencydistribution in any number of variables. Biometrika, 12(1/2):134–139.

[Jacot et al., 2018] Jacot, A., Gabriel, F., and Hongler, C. (2018). Neural tangent kernel: Convergence and gener-alization in neural networks. In Advances in neural information processing systems, pages 8571–8580.

[Jin et al., 2017] Jin, C., Ge, R., Netrapalli, P., Kakade, S. M., and Jordan, M. I. (2017). How to escape saddlepoints efficiently. In International Conference on Machine Learning, pages 1724–1732.

[Kawaguchi, 2016] Kawaguchi, K. (2016). Deep learning without poor local minima. In Advances in neural infor-mation processing systems, pages 586–594.

[Laurent and Brecht, 2018] Laurent, T. and Brecht, J. (2018). Deep linear networks with arbitrary loss: All localminima are global. In International conference on machine learning, pages 2902–2907. PMLR.

[Lee et al., 2019] Lee, J., Xiao, L., Schoenholz, S., Bahri, Y., Novak, R., Sohl-Dickstein, J., and Pennington, J.(2019). Wide neural networks of any depth evolve as linear models under gradient descent. In Advances in neuralinformation processing systems, pages 8572–8583.

[Lee et al., 2016] Lee, J. D., Simchowitz, M., Jordan, M. I., and Recht, B. (2016). Gradient descent only convergesto minimizers. In Conference on learning theory, pages 1246–1257.

[Lu and Kawaguchi, 2017] Lu, H. and Kawaguchi, K. (2017). Depth creates no bad local minima. arXiv preprintarXiv:1702.08580.

[Marchenko and Pastur, 1967] Marchenko, V. A. and Pastur, L. A. (1967). Распределение собственных значенийв некоторых ансамблях случайных матриц. Математический сборник, 72(4):507–536.

[McAllester, 1999a] McAllester, D. A. (1999a). Pac-bayesian model averaging. In Proceedings of the twelfth annualconference on Computational learning theory, pages 164–170.

[McAllester, 1999b] McAllester, D. A. (1999b). Some pac-bayesian theorems. Machine Learning, 37(3):355–363.

[McDiarmid, 1989] McDiarmid, C. (1989). On the method of bounded differences. Surveys in combinatorics,141(1):148–188.

[Nagarajan and Kolter, 2019] Nagarajan, V. and Kolter, J. Z. (2019). Uniform convergence may be unable to explaingeneralization in deep learning. In Advances in Neural Information Processing Systems, pages 11615–11626.

[Neyshabur et al., 2018] Neyshabur, B., Bhojanapalli, S., and Srebro, N. (2018). A PAC-bayesian approach tospectrally-normalized margin bounds for neural networks. In International Conference on Learning Representa-tions.

[Neyshabur et al., 2015] Neyshabur, B., Tomioka, R., and Srebro, N. (2015). In search of the real inductive bias:On the role of implicit regularization in deep learning. In ICLR (Workshop).

[Nguyen, 2019] Nguyen, Q. (2019). On connected sublevel sets in deep learning. In International Conference onMachine Learning, pages 4790–4799.

[Nguyen and Hein, 2017] Nguyen, Q. and Hein, M. (2017). The loss surface of deep and wide neural networks. InProceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2603–2612.

[Panageas and Piliouras, 2017] Panageas, I. and Piliouras, G. (2017). Gradient descent only converges to minimizers:Non-isolated critical points and invariant regions. In 8th Innovations in Theoretical Computer Science Conference(ITCS 2017). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik.

[Pennington et al., 2017] Pennington, J., Schoenholz, S., and Ganguli, S. (2017). Resurrecting the sigmoid in deeplearning through dynamical isometry: theory and practice. In Advances in neural information processing systems,pages 4785–4795.

65

Page 68: arXiv:2012.05760v1 [cs.LG] 10 Dec 2020

[Poole et al., 2016] Poole, B., Lahiri, S., Raghu, M., Sohl-Dickstein, J., and Ganguli, S. (2016). Exponentialexpressivity in deep neural networks through transient chaos. In Advances in neural information processingsystems, pages 3360–3368.

[Sauer, 1972] Sauer, N. (1972). On the density of families of sets. Journal of Combinatorial Theory, Series A,13(1):145–147.

[Saxe et al., 2013] Saxe, A. M., McClelland, J. L., and Ganguli, S. (2013). Exact solutions to the nonlinear dynamicsof learning in deep linear neural networks. arXiv preprint arXiv:1312.6120.

[Schoenholz et al., 2016] Schoenholz, S. S., Gilmer, J., Ganguli, S., and Sohl-Dickstein, J. (2016). Deep informationpropagation. arXiv preprint arXiv:1611.01232.

[Tao, 2012] Tao, T. (2012). Topics in random matrix theory, volume 132. American Mathematical Soc.

[Tropp, 2011] Tropp, J. A. (2011). User-friendly tail bounds for sums of random matrices. Foundations of Compu-tational Mathematics, 12(4):389–434.

[Vapnik and Chervonenkis, 1971] Vapnik, V. N. and Chervonenkis, A. Y. (1971). О равномерной сходимостичастот появления событий к их вероятностям. Теория вероятностей и ее применения, 16(2):264–279.

[Voiculescu, 1987] Voiculescu, D. (1987). Multiplication of certain non-commuting random variables. Journal ofOperator Theory, pages 223–235.

[Yu and Chen, 1995] Yu, X.-H. and Chen, G.-A. (1995). On the local minima free condition of backpropagationlearning. IEEE Transactions on Neural Networks, 6(5):1300–1303.

[Zhang et al., 2016] Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. (2016). Understanding deeplearning requires rethinking generalization. arXiv preprint arXiv:1611.03530.

[Zhou et al., 2019] Zhou, W., Veitch, V., Austern, M., Adams, R. P., and Orbanz, P. (2019). Non-vacuous gener-alization bounds at the imagenet scale: a PAC-bayesian compression approach. In International Conference onLearning Representations.

66