lecture 4: some advanced topics in deep learning (ii)...hao su lecture -4 9 mhaskar, poggio, liao,...

Lecture 4:Some Advanced Topics in

Deep Learning (II)

Instructor: Hao Su

Jan 18, 2018

1/9/2018

Agenda

Hao Su Lecture -4 2

• Review of last lecture

• Theories behind Machine Learning

1/9/2018

Agenda

Hao Su Lecture -4 3

• Review of last lecture

• Theories behind Machine Learning

1/9/2018

Deep Networks: Three Theory Questions

Hao Su Lecture -4 4

• Approximation Theory: When and why are deep networks better than shallow networks?

• Optimization Theory: What is the landscape of the empirical risk?

• Learning Theory: How can deep learning not overfit?

[Tomasi Poggio in STATS385]

1/9/2018


Hao Su Lecture -4 5




[Tomasi Poggio’s lecture in STATS385]

1/9/2018

Deep and Shallow Networks: Universality

Hao Su Lecture -4 6


1/9/2018

Classical learning theory and Kernel Machines(Regularization in RKHS)

Hao Su Lecture -4 7


1/9/2018

Classical kernel machines are equivalent to shallow networks

Hao Su Lecture -4 8


1/9/2018

Curse of dimensionality

Hao Su Lecture -4 9

Mhaskar, Poggio, Liao, 2016[Tomasi Poggio’s lecture in STATS385]

1/9/2018 Hao Su Lecture - 10[Tomasi Poggio’s lecture in STATS385]

4

1/9/2018

Hierarchically local compositionality

Hao Su Lecture - 114

1/9/2018

Proof by induction


1/9/2018

Intuition

Hao Su Lecture - 13

• The family of compositional functions are easier to approximate (can be approximated with less parameters)

• Deep learning leverages the compositional structure

• Why compositional functions are important for perception?

4

1/9/2018

Locality of constituent functions is key: CIFAR


1/9/2018

Other perspectives of why deep is good

Hao Su Lecture - 15

The main result of [Telgarsky, 2016, Colt] says that there are functions with many oscillations that cannot be represented by shallow networks with linear complexity but can be represented with low complexity by deep networks.

4

1/9/2018

Proof sketch


1/9/2018


Hao Su Lecture - 17





4

1/9/2018

The shape of objective function to be optimized


1/9/2018

Two challenging cases for optimization


1/9/2018

Hessian characterizes the shape at bottom


1/9/2018

Critical points

Hao Su Lecture - 21

‖∇L(w)‖=0

4

1/9/2018

Critical points

Hao Su Lecture - 22

• If all eigenvalues are positive the point is called a local minimum• If r of them are negative and the rest are positive, then it is called a saddle

point with index r. • At the critical point, the eigenvectors indicate the directions in which the

value of the function locally changes. • Moreover, the changes are proportional to the corresponding -signed-

eigenvalue.

4

1/9/2018

Critical points

Hao Su Lecture - 23

• Classical optimization theory mostly assumes Hessian non-degenerated

• Is this the case for machine learning loss functions?

4

1/9/2018

Spectrum of Hessian for an MLP

Hao Su Lecture - 24

• 5K parameters in total

EMPIRICAL ANALYSIS OF THE HESSIAN OF OVER- PARAMETRIZED NEURAL NETWORKS, Sagun et al.

4

1/9/2018 Hao Su Lecture - 25

An explanation


4


Rank 1 Vanish(claimed by author)

If you are interested, check it!(get credit)

An explanation

there are at least M − N many trivial eigenvalues of the Hessian!


4

1/9/2018

Relation between #params and eigenvalues

Hao Su Lecture - 27

• Growing the size of the network with fixed data, architecture, and algorithm.

• 1K samples from MNIST

“the large positive eigenvalues depend on data, so that their number should not, in principle, change with the increasing dimensionality of the parameter space.”

4

1/9/2018

Relation between #params and eigenvalues

Hao Su Lecture - 28

“increasing the number of dimensions keeps the ratio of small negative eigenvalues fixed.”


4

1/9/2018


Hao Su Lecture - 29





4

1/9/2018

Existing theories are insufficient

Hao Su Lecture - 30

• Regularization• VC dimension• Rademacher complexity• Uniform stability• …

4

1/9/2018

A Statistical Mechanics Analysis

Hao Su Lecture - 31

SGD:Assumptions:

THREE FACTORS INFLUENCING MINIMA IN SGD, JASTRZEBSKI ET AL

4

1/9/2018

Main Results

Hao Su Lecture - 32

is a measure of the noise in the system set by the choice of learning rate and batch size S .

: learning rate, S: batch size


η

n = ηS

4



4

1/9/2018


Hao Su Lecture - 34





4

lecture 4: some advanced topics in deep learning (ii)...hao su lecture -4 9 mhaskar, poggio, liao,...

Documents