neural tangent kernel - github pages · 2019. 12. 5. · neural tangent kernel convergence and...
TRANSCRIPT
![Page 1: Neural Tangent Kernel - GitHub Pages · 2019. 12. 5. · Neural Tangent Kernel Convergence and Generalization of DNNs Arthur Jacot, Franck Gabriel, Clément Hongler Ecole Polytechnique](https://reader036.vdocuments.net/reader036/viewer/2022062510/611fc540c56bbb7dc2304b0c/html5/thumbnails/1.jpg)
Neural Tangent KernelConvergence and Generalization of DNNs
Arthur Jacot, Franck Gabriel, Clément Hongler
Ecole Polytechnique Fédérale de Lausanne
October 28, 2019
![Page 2: Neural Tangent Kernel - GitHub Pages · 2019. 12. 5. · Neural Tangent Kernel Convergence and Generalization of DNNs Arthur Jacot, Franck Gabriel, Clément Hongler Ecole Polytechnique](https://reader036.vdocuments.net/reader036/viewer/2022062510/611fc540c56bbb7dc2304b0c/html5/thumbnails/2.jpg)
Neural Networks
I L + 1 layers of n` neurons with activations α(`)(x) ∈ Rn`
α(0)(x) = x
α̃(`+1)(x) =
√1− β2√
n`W (`)α(`)(x) + βb(`)
α(`+1)(x) = σ(α̃(`+1)(x)
)I Parameters: connections weights W (`) ∈ Rn`×n`+1 and bias b(`) ∈ Rn`+1 .I Weights / bias balance: β ∈ [0,1].I Non-linearity: σ : R→ R.
I Network function fθ(x) = α̃(L)(x)
![Page 3: Neural Tangent Kernel - GitHub Pages · 2019. 12. 5. · Neural Tangent Kernel Convergence and Generalization of DNNs Arthur Jacot, Franck Gabriel, Clément Hongler Ecole Polytechnique](https://reader036.vdocuments.net/reader036/viewer/2022062510/611fc540c56bbb7dc2304b0c/html5/thumbnails/3.jpg)
Initialization: DNNs as Gaussian processes
I In the infinite width limit n1, ...,nL−1 →∞I Initialize the parameters θ ∼ N (0, IdP).
I The preactivations α̃(`)i (·; θ) : Rn0 → R converge to iid Gaussian processes of
covariance Σ(`) (Cho and Saul, 2009; Lee et al., 2018):
Σ(1)(x , y) = (1− β2)xT y + β2
Σ(`+1)(x , y) = (1− β2)Eα∼N(0,Σ(`)) [σ(α(x))σ(α(y))] + β2
I The network function fθ = α̃(L) is also asymptotically Gaussian.
![Page 4: Neural Tangent Kernel - GitHub Pages · 2019. 12. 5. · Neural Tangent Kernel Convergence and Generalization of DNNs Arthur Jacot, Franck Gabriel, Clément Hongler Ecole Polytechnique](https://reader036.vdocuments.net/reader036/viewer/2022062510/611fc540c56bbb7dc2304b0c/html5/thumbnails/4.jpg)
Training: Neural Tangent Kernel
I Training set of size N: inputs X ∈ RN×n0 and outputs Y ∈ RN×nL .I Gradient descent on the MSE C(θ) = 1
N ‖fθ(X )− Y‖2
∂tθ = −∇C(θ) =2N
N∑i=1
∇fθ(xi)(Yi − fθ(xi))
I Evolution of fθ:
∂t fθ(x) = (∇fθ(x))T ∂tθ =2N
N∑i=1
(∇fθ(x))T ∇fθ(xi)(yi − fθ(xi))
I Neural Tangent Kernel (NTK):
Θ(L)(x , y) := (∇fθ(x))T ∇fθ(y)
![Page 5: Neural Tangent Kernel - GitHub Pages · 2019. 12. 5. · Neural Tangent Kernel Convergence and Generalization of DNNs Arthur Jacot, Franck Gabriel, Clément Hongler Ecole Polytechnique](https://reader036.vdocuments.net/reader036/viewer/2022062510/611fc540c56bbb7dc2304b0c/html5/thumbnails/5.jpg)
Asymptotics of the NTK
Theorem (Arora et al., 2019; Jacot et al., 2018)Let n1, . . . ,nL−1 →∞, for any t:
Θ(L)(t)→ Θ(L)∞
where
Θ(L)∞ (x , y) =
L∑`=1
Σ(`)(x , y)Σ̇(`+1)(x , y)...Σ̇(L)(x , y)
withΣ̇(L)(x , x ′) = (1− β2)Eα∼N (0,Σ(L−1))[σ̇(α(x))σ̇(α(x ′))].
![Page 6: Neural Tangent Kernel - GitHub Pages · 2019. 12. 5. · Neural Tangent Kernel Convergence and Generalization of DNNs Arthur Jacot, Franck Gabriel, Clément Hongler Ecole Polytechnique](https://reader036.vdocuments.net/reader036/viewer/2022062510/611fc540c56bbb7dc2304b0c/html5/thumbnails/6.jpg)
ConvergenceThe continuous-time dynamics of the outputs Yθ(t) = fθ(t)(X ) are described by thelinear ODE
∂tYθ(t),k =2N
Θ(L)∞ (X ,X )
(Y − Yθ(t)
)with solutionYθ(t) = Y − e
2tN Θ
(L)∞ (X ,X)
(Y − Yθ(0)
).
Convergence and loss surface(Jacot et al.,2019a):
1. Eigenvalues: speed of convergencealong the eigenvector
2. Narrow valley structure:2.1 Large eig. are the ’cliffs’2.2 Small eig. are the ’bottom’
3. Condition number κ = λmax/λmin
2 1 0 1 2
2
1
0
1
2
0.100
1.000
1.000
10.000
10.000
Figure: Narrow valley in the loss surface.
![Page 7: Neural Tangent Kernel - GitHub Pages · 2019. 12. 5. · Neural Tangent Kernel Convergence and Generalization of DNNs Arthur Jacot, Franck Gabriel, Clément Hongler Ecole Polytechnique](https://reader036.vdocuments.net/reader036/viewer/2022062510/611fc540c56bbb7dc2304b0c/html5/thumbnails/7.jpg)
Freeze and ChaosTwo regimes appear (as in Poole et al. (2016); Schoenholz et al. (2017))
Theorem (Jacot et al., 2019b)Consider a twice differentiable σ satisfying Ez∼N (0,1) [σ(z)] = 1 and two inputs x , ywith ‖x‖ = ‖y‖ =
√n0. The characteristic value
rσ,β = (1− β2)Ez∼N (0,1)
[σ̇(z)2
]determines two regimes:FREEZE (r<1): For x , y ∈ Sn0
1 ≥ Θ(L)∞ (x , y)
Θ(L)∞ (x , x)
≥ 1− CLrL
CHAOS (r>1): If x 6= ±y:Θ
(L)∞ (x , y)
Θ(L)∞ (x , x)
→ 0
![Page 8: Neural Tangent Kernel - GitHub Pages · 2019. 12. 5. · Neural Tangent Kernel Convergence and Generalization of DNNs Arthur Jacot, Franck Gabriel, Clément Hongler Ecole Polytechnique](https://reader036.vdocuments.net/reader036/viewer/2022062510/611fc540c56bbb7dc2304b0c/html5/thumbnails/8.jpg)
Freeze and Chaos: Properties
I FREEZE (r<1):I Almost constant NTK Gram MatrixI Bad conditioning λmax/λmin
I Slow convergenceI CHAOS (r>1):
I Almost identity NTK Gram MatrixI Fast convergenceI Generalization?
I EDGE: For the RELU: r = 1− β2 ≈ 1I Converges to Id + cI Strong constant mode for large NI Slow convergence
3 2 1 0 1 2 3angle
0.0
0.2
0.4
0.6
0.8
1.0
(6) (x
,y)
r = 0.75r = 0.99r = 1.2Batch Norm
Figure: The NTK in the different regimes forL = 6.
![Page 9: Neural Tangent Kernel - GitHub Pages · 2019. 12. 5. · Neural Tangent Kernel Convergence and Generalization of DNNs Arthur Jacot, Franck Gabriel, Clément Hongler Ecole Polytechnique](https://reader036.vdocuments.net/reader036/viewer/2022062510/611fc540c56bbb7dc2304b0c/html5/thumbnails/9.jpg)
Chaos: normalization
I Normalize the non-linearity (over z ∼ N (0,1)):
σ̄(x) =σ(x)− E [σ(z)]√
Var [σ(z)].
I If σ 6= id we have E[
˙̄σ(z)2] > 1, and for small enough β > 0
r = (1− β2)E[
˙̄σ(z)2]> 1.
I Asymptotically equivalent to Layer Normalization.
I For Batch Normalization we only have:I Applying BN after the last non-linearity controls the constant mode:
1N2
∑ij
Θ(L)(xi , xj) = β2
![Page 10: Neural Tangent Kernel - GitHub Pages · 2019. 12. 5. · Neural Tangent Kernel Convergence and Generalization of DNNs Arthur Jacot, Franck Gabriel, Clément Hongler Ecole Polytechnique](https://reader036.vdocuments.net/reader036/viewer/2022062510/611fc540c56bbb7dc2304b0c/html5/thumbnails/10.jpg)
Chaos: normalization
I Normalize the non-linearity (over z ∼ N (0,1)):
σ̄(x) =σ(x)− E [σ(z)]√
Var [σ(z)].
I If σ 6= id we have E[
˙̄σ(z)2] > 1, and for small enough β > 0
r = (1− β2)E[
˙̄σ(z)2]> 1.
I Asymptotically equivalent to Layer Normalization.I For Batch Normalization we only have:
I Applying BN after the last non-linearity controls the constant mode:1
N2
∑ij
Θ(L)(xi , xj) = β2
![Page 11: Neural Tangent Kernel - GitHub Pages · 2019. 12. 5. · Neural Tangent Kernel Convergence and Generalization of DNNs Arthur Jacot, Franck Gabriel, Clément Hongler Ecole Polytechnique](https://reader036.vdocuments.net/reader036/viewer/2022062510/611fc540c56bbb7dc2304b0c/html5/thumbnails/11.jpg)
Generative Adversarial Networks
I Two networks(Goodfellow et al., 2014)I Generator G: generates data G(z) from a random code z.I Discriminator D: classifies real/generated data to guide the generator.
I If the generator lies in the FREEZEI Mode collapse: The generator G becomes constant.
I Solutions:I Batch NormalizationI Chaotic generator
![Page 12: Neural Tangent Kernel - GitHub Pages · 2019. 12. 5. · Neural Tangent Kernel Convergence and Generalization of DNNs Arthur Jacot, Franck Gabriel, Clément Hongler Ecole Polytechnique](https://reader036.vdocuments.net/reader036/viewer/2022062510/611fc540c56bbb7dc2304b0c/html5/thumbnails/12.jpg)
Deconvolutional Generator
For G a deconvolutional network with stride s
TheoremIn the FREEZE (r>1):
1− r v+1
1− rL − C1(v + 1)r v ≤Θ
(L)p,p′ (x , y)
Θ(L)p,p (x , x)
≤ 1− r v+1
1− rL
for v the max. in {0,L− 1} s.t. sv divides p − p′
I Checkerboard patterns: images which repeat every sv pixelsI Solution: Batch Norm / Chaos
![Page 13: Neural Tangent Kernel - GitHub Pages · 2019. 12. 5. · Neural Tangent Kernel Convergence and Generalization of DNNs Arthur Jacot, Franck Gabriel, Clément Hongler Ecole Polytechnique](https://reader036.vdocuments.net/reader036/viewer/2022062510/611fc540c56bbb7dc2304b0c/html5/thumbnails/13.jpg)
NTK PCA
FRE
EZE
1 2 3 4 5 6 7 8
1
2
3
4
CH
AO
S
1 2 3 4 5 6 7 8
1
2
3
4
BAT
CH
NO
RM
1 2 3 4 5 6 7 8
1
2
3
4
![Page 14: Neural Tangent Kernel - GitHub Pages · 2019. 12. 5. · Neural Tangent Kernel Convergence and Generalization of DNNs Arthur Jacot, Franck Gabriel, Clément Hongler Ecole Polytechnique](https://reader036.vdocuments.net/reader036/viewer/2022062510/611fc540c56bbb7dc2304b0c/html5/thumbnails/14.jpg)
Conclusion
1. Certain architectures lead to dominating eigenvalues:1.1 Constant mode1.2 Checkerboard patterns
2. Slows down training3. Mode collapse in GANS4. Use the NTK to identify and fix them
![Page 15: Neural Tangent Kernel - GitHub Pages · 2019. 12. 5. · Neural Tangent Kernel Convergence and Generalization of DNNs Arthur Jacot, Franck Gabriel, Clément Hongler Ecole Polytechnique](https://reader036.vdocuments.net/reader036/viewer/2022062510/611fc540c56bbb7dc2304b0c/html5/thumbnails/15.jpg)
Bibliography I
Arora, S., Du, S. S., Hu, W., Li, Z., Salakhutdinov, R., and Wang, R. (2019). Onexact computation with an infinitely wide neural net. arXiv preprintarXiv:1904.11955.
Cho, Y. and Saul, L. K. (2009). Kernel Methods for Deep Learning. In Advances inNeural Information Processing Systems 22, pages 342–350. CurranAssociates, Inc.
Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S.,Courville, A., and Bengio, Y. (2014). Generative Adversarial Networks.NIPS’14 Proceedings of the 27th International Conference on NeuralInformation Processing Systems - Volume 2, pages 2672–2680.
Jacot, A., Gabriel, F., and Hongler, C. (2018). Neural Tangent Kernel:Convergence and Generalization in Neural Networks. In Advances in NeuralInformation Processing Systems 31, pages 8580–8589. Curran Associates,Inc.
![Page 16: Neural Tangent Kernel - GitHub Pages · 2019. 12. 5. · Neural Tangent Kernel Convergence and Generalization of DNNs Arthur Jacot, Franck Gabriel, Clément Hongler Ecole Polytechnique](https://reader036.vdocuments.net/reader036/viewer/2022062510/611fc540c56bbb7dc2304b0c/html5/thumbnails/16.jpg)
Bibliography II
Jacot, A., Gabriel, F., and Hongler, C. (2019a). The asymptotic spectrum of thehessian of dnn throughout training.
Jacot, A., Gabriel, F., and Hongler, C. (2019b). Freeze and chaos for dnns: anNTK view of batch normalization, checkerboard and boundary effects.CoRR, abs/1907.05715.
Lee, J. H., Bahri, Y., Novak, R., Schoenholz, S. S., Pennington, J., andSohl-Dickstein, J. (2018). Deep Neural Networks as Gaussian Processes.ICLR.
Poole, B., Lahiri, S., Raghu, M., Sohl-Dickstein, J., and Ganguli, S. (2016).Exponential expressivity in deep neural networks through transient chaos. InLee, D. D., Sugiyama, M., Luxburg, U. V., Guyon, I., and Garnett, R., editors,Advances in Neural Information Processing Systems 29, pages 3360–3368.Curran Associates, Inc.
Schoenholz, S. S., Gilmer, J., Ganguli, S., and Sohl-Dickstein, J. (2017). Deepinformation propagation.