lecture 4: some advanced topics in deep learning (ii)...hao su lecture -4 9 mhaskar, poggio, liao,...
TRANSCRIPT
Lecture 4:Some Advanced Topics in
Deep Learning (II)
Instructor: Hao Su
Jan 18, 2018
1/9/2018
Agenda
Hao Su Lecture -4 2
• Review of last lecture
• Theories behind Machine Learning
1/9/2018
Agenda
Hao Su Lecture -4 3
• Review of last lecture
• Theories behind Machine Learning
1/9/2018
Deep Networks: Three Theory Questions
Hao Su Lecture -4 4
• Approximation Theory: When and why are deep networks better than shallow networks?
• Optimization Theory: What is the landscape of the empirical risk?
• Learning Theory: How can deep learning not overfit?
[Tomasi Poggio in STATS385]
1/9/2018
Deep Networks: Three Theory Questions
Hao Su Lecture -4 5
• Approximation Theory: When and why are deep networks better than shallow networks?
• Optimization Theory: What is the landscape of the empirical risk?
• Learning Theory: How can deep learning not overfit?
[Tomasi Poggio’s lecture in STATS385]
1/9/2018
Deep and Shallow Networks: Universality
Hao Su Lecture -4 6
[Tomasi Poggio’s lecture in STATS385]
1/9/2018
Classical learning theory and Kernel Machines(Regularization in RKHS)
Hao Su Lecture -4 7
[Tomasi Poggio’s lecture in STATS385]
1/9/2018
Classical kernel machines are equivalent to shallow networks
Hao Su Lecture -4 8
[Tomasi Poggio’s lecture in STATS385]
1/9/2018
Curse of dimensionality
Hao Su Lecture -4 9
Mhaskar, Poggio, Liao, 2016[Tomasi Poggio’s lecture in STATS385]
1/9/2018 Hao Su Lecture - 10[Tomasi Poggio’s lecture in STATS385]
4
1/9/2018
Hierarchically local compositionality
Hao Su Lecture - 114
1/9/2018
Proof by induction
Hao Su Lecture - 124
1/9/2018
Intuition
Hao Su Lecture - 13
• The family of compositional functions are easier to approximate (can be approximated with less parameters)
• Deep learning leverages the compositional structure
• Why compositional functions are important for perception?
4
1/9/2018
Locality of constituent functions is key: CIFAR
Hao Su Lecture - 144
1/9/2018
Other perspectives of why deep is good
Hao Su Lecture - 15
The main result of [Telgarsky, 2016, Colt] says that there are functions with many oscillations that cannot be represented by shallow networks with linear complexity but can be represented with low complexity by deep networks.
4
1/9/2018
Proof sketch
Hao Su Lecture - 164
1/9/2018
Deep Networks: Three Theory Questions
Hao Su Lecture - 17
• Approximation Theory: When and why are deep networks better than shallow networks?
• Optimization Theory: What is the landscape of the empirical risk?
• Learning Theory: How can deep learning not overfit?
[Tomasi Poggio’s lecture in STATS385]
4
1/9/2018
The shape of objective function to be optimized
Hao Su Lecture - 184
1/9/2018
Two challenging cases for optimization
Hao Su Lecture - 194
1/9/2018
Hessian characterizes the shape at bottom
Hao Su Lecture - 204
1/9/2018
Critical points
Hao Su Lecture - 21
‖∇L(w)‖=0
4
1/9/2018
Critical points
Hao Su Lecture - 22
• If all eigenvalues are positive the point is called a local minimum• If r of them are negative and the rest are positive, then it is called a saddle
point with index r. • At the critical point, the eigenvectors indicate the directions in which the
value of the function locally changes. • Moreover, the changes are proportional to the corresponding -signed-
eigenvalue.
4
1/9/2018
Critical points
Hao Su Lecture - 23
• Classical optimization theory mostly assumes Hessian non-degenerated
• Is this the case for machine learning loss functions?
4
1/9/2018
Spectrum of Hessian for an MLP
Hao Su Lecture - 24
• 5K parameters in total
EMPIRICAL ANALYSIS OF THE HESSIAN OF OVER- PARAMETRIZED NEURAL NETWORKS, Sagun et al.
4
1/9/2018 Hao Su Lecture - 25
An explanation
EMPIRICAL ANALYSIS OF THE HESSIAN OF OVER- PARAMETRIZED NEURAL NETWORKS, Sagun et al.
4
1/9/2018 Hao Su Lecture - 26
Rank 1 Vanish(claimed by author)
If you are interested, check it!(get credit)
An explanation
there are at least M − N many trivial eigenvalues of the Hessian!
EMPIRICAL ANALYSIS OF THE HESSIAN OF OVER- PARAMETRIZED NEURAL NETWORKS, Sagun et al.
4
1/9/2018
Relation between #params and eigenvalues
Hao Su Lecture - 27
• Growing the size of the network with fixed data, architecture, and algorithm.
• 1K samples from MNIST
“the large positive eigenvalues depend on data, so that their number should not, in principle, change with the increasing dimensionality of the parameter space.”
4
1/9/2018
Relation between #params and eigenvalues
Hao Su Lecture - 28
“increasing the number of dimensions keeps the ratio of small negative eigenvalues fixed.”
EMPIRICAL ANALYSIS OF THE HESSIAN OF OVER- PARAMETRIZED NEURAL NETWORKS, Sagun et al.
4
1/9/2018
Deep Networks: Three Theory Questions
Hao Su Lecture - 29
• Approximation Theory: When and why are deep networks better than shallow networks?
• Optimization Theory: What is the landscape of the empirical risk?
• Learning Theory: How can deep learning not overfit?
[Tomasi Poggio’s lecture in STATS385]
4
1/9/2018
Existing theories are insufficient
Hao Su Lecture - 30
• Regularization• VC dimension• Rademacher complexity• Uniform stability• …
4
1/9/2018
A Statistical Mechanics Analysis
Hao Su Lecture - 31
SGD:Assumptions:
THREE FACTORS INFLUENCING MINIMA IN SGD, JASTRZEBSKI ET AL
4
1/9/2018
Main Results
Hao Su Lecture - 32
is a measure of the noise in the system set by the choice of learning rate and batch size S .
: learning rate, S: batch size
THREE FACTORS INFLUENCING MINIMA IN SGD, JASTRZEBSKI ET AL
η
n = ηS
4
1/9/2018 Hao Su Lecture - 33
THREE FACTORS INFLUENCING MINIMA IN SGD, JASTRZEBSKI ET AL
4
1/9/2018
Deep Networks: Three Theory Questions
Hao Su Lecture - 34
• Approximation Theory: When and why are deep networks better than shallow networks?
• Optimization Theory: What is the landscape of the empirical risk?
• Learning Theory: How can deep learning not overfit?
[Tomasi Poggio’s lecture in STATS385]
4