understanding deep learning in theory

40
Introduction of Deep Learning Theory Optimization Theory for Deep Learning Learning Theory for Deep Learning Conclusions Understanding Deep Learning in Theory Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering, York University, Canada Vector Institue Seminar on April 5, 2019 Hui Jiang Deep Learning Theory 1/40

Upload: others

Post on 04-May-2022

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Understanding Deep Learning in Theory

Introduction of Deep Learning TheoryOptimization Theory for Deep Learning

Learning Theory for Deep LearningConclusions

Understanding Deep Learning in Theory

Hui Jiang

Department of Electrical Engineering and Computer ScienceLassonde School of Engineering, York University, Canada

Vector Institue Seminar on April 5, 2019

Hui Jiang Deep Learning Theory 1/40

Page 2: Understanding Deep Learning in Theory

Introduction of Deep Learning TheoryOptimization Theory for Deep Learning

Learning Theory for Deep LearningConclusions

Outline

1 Introduction of Deep Learning TheoryDeep learning overviewDeep learning theory reviewLearning problem formulation

2 Optimization Theory for Deep LearningLearning neural nets in literal spaceLearning neural nets in canonical spaceFrom canonical space back to literal spaceWhy large neural nets learn like convex optimization?

3 Learning Theory for Deep LearningWhy neural nets not overfitting?Bandlimited functionsPerfect learningAsymptotic regularization

4 Conclusions

Hui Jiang Deep Learning Theory 2/40

Page 3: Understanding Deep Learning in Theory

Introduction of Deep Learning TheoryOptimization Theory for Deep Learning

Learning Theory for Deep LearningConclusions

Deep learning overviewDeep learning theory reviewLearning problem formulation

Deep Learning Overview

Deep learning has achieved tremendous successes in practice:speech, vision, text, games, · · ·Deep learning is somehow criticized because it can not beexplained by the current machine learning theory.

Open Problems

What is the essense to successes of neural nets?

Why neural nets overtake other ML models in practice?

Why so “easy” to learn neural nets?

Why do neural nets generalize well?

What is the limit of neural nets?

Hui Jiang Deep Learning Theory 3/40

Page 4: Understanding Deep Learning in Theory

Introduction of Deep Learning TheoryOptimization Theory for Deep Learning

Learning Theory for Deep LearningConclusions

Deep learning overviewDeep learning theory reviewLearning problem formulation

Deep Learning Theory

Theoretical Issues of Neural Nets

1 expressiveness: what is the modeling capacity of neural nets?

solved due to the universal approximation [Cyb89]

2 optimization: why simple gradient descents consistently work?

NP-hard to learn even a small neural net [BR92]high-dimensional & non-convex to learn large neural nets[AHW95]

3 generalization: why over-parameterized neural nets generalize?

VC theory gives loose bounds to simple models [Vap00]totally fails to explain complex models like neural nets

Hui Jiang Deep Learning Theory 4/40

Page 5: Understanding Deep Learning in Theory

Introduction of Deep Learning TheoryOptimization Theory for Deep Learning

Learning Theory for Deep LearningConclusions

Deep learning overviewDeep learning theory reviewLearning problem formulation

Problem Formulation

Machine Learning as Stochastic Function Fitting

Inputs x ∈ RK follow a p.d.f. p(x): x ∼ p(x)

A target function y = f̄ (x): deterministic from input x ∈ RK tooutput y ∈ R

Finite training samples: DT = {(x1, y1), (x2, y2), · · · , (xT , yT )},where xt ∼ p(x) and yt = f̄ (xt) for all 1 ≤ t ≤ T

A model y = f (x|DT ) is learned from function class C based on DT

to minimize a loss measure l(f̄ , f ) between f̄ and f w.r.t. p(x)

l(·) is normally convex: mean square error, cross-entropy, hinge, ...

Hui Jiang Deep Learning Theory 5/40

Page 6: Understanding Deep Learning in Theory

Introduction of Deep Learning TheoryOptimization Theory for Deep Learning

Learning Theory for Deep LearningConclusions

Deep learning overviewDeep learning theory reviewLearning problem formulation

Problem Formulation (cont’d)

Ideally f () should be learned to minimize the expected risk:

Expected Risk

R(f | DT ) = Ep(x)

[l(f̄ (x), f (x|DT )

) ]Practically f () is learned to minimize the empirical loss:

Empirical Risk

Remp(f | DT ) =1

T

T∑t=1

l (yt , f (xt |DT ))

Function class C = L1(UK ), where UK , [0, 1]K ⊂ RK

f ∈ L1(UK ) ⇐⇒∫· · ·∫

x∈UK

|f (x)| dx <∞

Hui Jiang Deep Learning Theory 6/40

Page 7: Understanding Deep Learning in Theory

Introduction of Deep Learning TheoryOptimization Theory for Deep Learning

Learning Theory for Deep LearningConclusions

Learning neural nets in literal spaceLearning neural nets in canonical spaceFrom canonical space back to literal spaceWhy large neural nets learn like convex optimization?

1 Introduction of Deep Learning TheoryDeep learning overviewDeep learning theory reviewLearning problem formulation

2 Optimization Theory for Deep LearningLearning neural nets in literal spaceLearning neural nets in canonical spaceFrom canonical space back to literal spaceWhy large neural nets learn like convex optimization?

3 Learning Theory for Deep LearningWhy neural nets not overfitting?Bandlimited functionsPerfect learningAsymptotic regularization

4 Conclusions

Hui Jiang Deep Learning Theory 7/40

Page 8: Understanding Deep Learning in Theory

Introduction of Deep Learning TheoryOptimization Theory for Deep Learning

Learning Theory for Deep LearningConclusions

Learning neural nets in literal spaceLearning neural nets in canonical spaceFrom canonical space back to literal spaceWhy large neural nets learn like convex optimization?

Learning as Functional Minimization

Learning is formulated as model-free functional minimizationin L1(UK ):

f ∗ = arg minf ∈L1(UK )

Q(f |DT ) = arg minf ∈L1(UK )

T∑t=1

l (yt , f (xt))

Functional minimization is very generic. But

how to parameterize the function space L1(UK ) at above?

Hui Jiang Deep Learning Theory 8/40

Page 9: Understanding Deep Learning in Theory

Introduction of Deep Learning TheoryOptimization Theory for Deep Learning

Learning Theory for Deep LearningConclusions

Learning neural nets in literal spaceLearning neural nets in canonical spaceFrom canonical space back to literal spaceWhy large neural nets learn like convex optimization?

Literal Model Space: Neural Networks

Literal model space

Use neural networks to represent the function space L1(UK )

Literal space ΛM : the set of all well-structured neural nets ofM weights; each neural net is denoted as w ∈ RM

Universal approximator theorem [Cyb89]: ∀f (x) ∈ L1(UK )can be well approximated by at least one w in ΛM .

If inputs and all weights are bounded, ∀w ∈ ΛM represents afunction in L1(UK ), denoted as fw(x).

Lemma 1

If M is sufficiently large, limM→∞ ΛM ≡ L1(UK ).

Hui Jiang Deep Learning Theory 9/40

Page 10: Understanding Deep Learning in Theory

Introduction of Deep Learning TheoryOptimization Theory for Deep Learning

Learning Theory for Deep LearningConclusions

Learning neural nets in literal spaceLearning neural nets in canonical spaceFrom canonical space back to literal spaceWhy large neural nets learn like convex optimization?

Learning in Literal Space

Learning neural nets in literal space

w∗ = arg minw∈RM

Q(fw|DT ) = arg minw∈RM

T∑t=1

l (yt , fw(xt))

Hui Jiang Deep Learning Theory 10/40

Page 11: Understanding Deep Learning in Theory

Introduction of Deep Learning TheoryOptimization Theory for Deep Learning

Learning Theory for Deep LearningConclusions

Learning neural nets in literal spaceLearning neural nets in canonical spaceFrom canonical space back to literal spaceWhy large neural nets learn like convex optimization?

Canonical Model Space: Fourier Series

Fourier coefficients: ∀f (x) ∈ L1(UK )F

=⇒ θ = {θk |k ∈ ZK}

θk =

∫· · ·∫

x∈UK

f (x) e−2πik·xdx (∀k ∈ ZK )

Fourier series: f (x) := F−1 (x|θ)

f (x) =∑k∈ZK

θk e2πik·x

Riemann-Lebesgue lemma: ∀ε > 0, truncate to Nsignificant terms, Nε, to form a partial sum of finite terms

f̂ (x) =∑k∈Nε

θk e2πik·x

where∫·· ·∫

x∈UK‖f (x)− f̂ (x)‖2dx ≤ ε2.

Hui Jiang Deep Learning Theory 11/40

Page 12: Understanding Deep Learning in Theory

Introduction of Deep Learning TheoryOptimization Theory for Deep Learning

Learning Theory for Deep LearningConclusions

Learning neural nets in literal spaceLearning neural nets in canonical spaceFrom canonical space back to literal spaceWhy large neural nets learn like convex optimization?

Illustration of Canonical Space of L1(UK )

Hui Jiang Deep Learning Theory 12/40

Page 13: Understanding Deep Learning in Theory

Introduction of Deep Learning TheoryOptimization Theory for Deep Learning

Learning Theory for Deep LearningConclusions

Learning neural nets in literal spaceLearning neural nets in canonical spaceFrom canonical space back to literal spaceWhy large neural nets learn like convex optimization?

Illustration of Canonical Space of L1(UK )

Hui Jiang Deep Learning Theory 13/40

Page 14: Understanding Deep Learning in Theory

Introduction of Deep Learning TheoryOptimization Theory for Deep Learning

Learning Theory for Deep LearningConclusions

Learning neural nets in literal spaceLearning neural nets in canonical spaceFrom canonical space back to literal spaceWhy large neural nets learn like convex optimization?

Canonical Space of L1(UK )

Canonical space Θ: each set of Fourier coefficients θ ∈ Θ

Learning in canonical space

θ∗ = arg minθ∈Θ

Q(θ|DT ) = arg minθ∈Θ

T∑t=1

l(yt ,F

−1(xt |θ))

Hui Jiang Deep Learning Theory 14/40

Page 15: Understanding Deep Learning in Theory

Introduction of Deep Learning TheoryOptimization Theory for Deep Learning

Learning Theory for Deep LearningConclusions

Learning neural nets in literal spaceLearning neural nets in canonical spaceFrom canonical space back to literal spaceWhy large neural nets learn like convex optimization?

Learning in Canonical Space

Theorem 2

The objective function Q(f |DT ) is convex in canonical space. Ifthe dimensionality of canonical space is not less than the numberof training samples in DT , the global minimum achieves zero loss.

Proof sketch:

Q(f |DT ) is represented in canonical space:

Q(θ|DT ) =T∑t=1

l(yt ,F

−1(xt |θ))

=T∑t=1

l

yt ,∑k∈Nε

θk · e2πik·xt

N unknown coefficients: θ = {θk | k ∈ Nε}T linear eqns: yt = f̂ (xt) =

∑k∈Nε

θk · e2πik·xt (1 ≤ t ≤ T )

Hui Jiang Deep Learning Theory 15/40

Page 16: Understanding Deep Learning in Theory

Introduction of Deep Learning TheoryOptimization Theory for Deep Learning

Learning Theory for Deep LearningConclusions

Learning neural nets in literal spaceLearning neural nets in canonical spaceFrom canonical space back to literal spaceWhy large neural nets learn like convex optimization?

From Canonical Space back to Literal Space

Model spaces: w⇒ fw(x)F

=⇒ θ =⇒ F−1(x|θ) := fθ(x)

The objective function in two spaces:

Q(fw|DT ) = Q(fθ|DT )

The chain rule:

∇wQ(fw|DT ) = ∇θQ(fθ|DT )∇wθ = ∇θQ(fθ|DT )∇wF (fw(x))

Hui Jiang Deep Learning Theory 16/40

Page 17: Understanding Deep Learning in Theory

Introduction of Deep Learning TheoryOptimization Theory for Deep Learning

Learning Theory for Deep LearningConclusions

Learning neural nets in literal spaceLearning neural nets in canonical spaceFrom canonical space back to literal spaceWhy large neural nets learn like convex optimization?

From Canonical Space back to Literal Space (cont’d)

Disparity Matrix

∂Q∂w1

...∂Q∂wm

...∂Q∂wM

M×1

=

F(∂fw(x)∂w1

)...

F(∂fw(x)∂wm

)...

F(∂fw(x)∂wM

)

M×N︸ ︷︷ ︸

disparity matrix H(w)

∂Q∂θ1

...∂Q∂θn

...∂Q∂θN

N×1

Gradients in literal and canonical spaces are related via a pointwise

linear transformation:[∇wQ

]M×1

=[H(w)

]M×N

[∇θQ

]N×1

Hui Jiang Deep Learning Theory 17/40

Page 18: Understanding Deep Learning in Theory

Introduction of Deep Learning TheoryOptimization Theory for Deep Learning

Learning Theory for Deep LearningConclusions

Learning neural nets in literal spaceLearning neural nets in canonical spaceFrom canonical space back to literal spaceWhy large neural nets learn like convex optimization?

From Canonical Space back to Literal Space (cont’d)

Lemma 3

Assume neural network is sufficiently large (M ≥ N). If w∗ is astationary point of Q(fw) and H(w∗) has full rank, w∗ is a globalminimum.

Lemma 4

If w(0) is a stationary point of Q(fw) and H(w(0)) does not has fullrank at w(0), then w(0) may be a local minimum or saddle point orglobal minimum.

Hui Jiang Deep Learning Theory 18/40

Page 19: Understanding Deep Learning in Theory

Introduction of Deep Learning TheoryOptimization Theory for Deep Learning

Learning Theory for Deep LearningConclusions

Learning neural nets in literal spaceLearning neural nets in canonical spaceFrom canonical space back to literal spaceWhy large neural nets learn like convex optimization?

Why learning of large-scale neural networks behaves likeconvex optimization?

Stochastic Gradient Descent (SGD)

randomly initialize w(0), set k = 0for epoch = 1 to L do

for each minibatch in training set DT dow(k+1) ← w(k) − hk∇wQ(w(k))k ← k + 1

end forend for

Theorem 5

If M ≥ N, and initial w(0) and step sizes hk are chosen as such to ensureH(w) maintains full rank at every w(k), then SGD/GD surely convergesto a global minimum of zero loss like convex optimization.

Hui Jiang Deep Learning Theory 19/40

Page 20: Understanding Deep Learning in Theory

Introduction of Deep Learning TheoryOptimization Theory for Deep Learning

Learning Theory for Deep LearningConclusions

Learning neural nets in literal spaceLearning neural nets in canonical spaceFrom canonical space back to literal spaceWhy large neural nets learn like convex optimization?

Why learning of large-scale neural networks behaves likeconvex optimization? (cont’d)

When H(w) degenerates?

dead neurons: zero rows

duplicated neurons: linearly dependent rows

if M � N, H(w) becomes singular only after at least M − Nneurons are dead or duplicated.

Corollary 6

If an over-parameterized neural network is randomly initialized,SGD/GD converges to a global minimum of zero loss in probability.

Hui Jiang Deep Learning Theory 20/40

Page 21: Understanding Deep Learning in Theory

Introduction of Deep Learning TheoryOptimization Theory for Deep Learning

Learning Theory for Deep LearningConclusions

Why neural nets not overfitting?Bandlimited functionsPerfect learningAsymptotic regularization

1 Introduction of Deep Learning TheoryDeep learning overviewDeep learning theory reviewLearning problem formulation

2 Optimization Theory for Deep LearningLearning neural nets in literal spaceLearning neural nets in canonical spaceFrom canonical space back to literal spaceWhy large neural nets learn like convex optimization?

3 Learning Theory for Deep LearningWhy neural nets not overfitting?Bandlimited functionsPerfect learningAsymptotic regularization

4 Conclusions

Hui Jiang Deep Learning Theory 21/40

Page 22: Understanding Deep Learning in Theory

Introduction of Deep Learning TheoryOptimization Theory for Deep Learning

Learning Theory for Deep LearningConclusions

Why neural nets not overfitting?Bandlimited functionsPerfect learningAsymptotic regularization

Why over-parametered neural networks not overfitting?

Current machine learning theory

VC generalization bound for classification [Vap00]:

R(fw | DT ) ≤ Remp(fw | DT ) +

√8dM(ln 2T

dM+ 1) + 8 ln( 4

δ )

T

Error bound for NNs [Bar94]: approx + estimation errors

R(fw | DT ) ≤ O

(C 2f

M

)+ O

(M · KT

log(T )

)

Presumably over-parameterized models will fail due to overfitting:

When M →∞, Remp(fw | DT ) = 0, but R(fw | DT ) will diverge.

Hui Jiang Deep Learning Theory 22/40

Page 23: Understanding Deep Learning in Theory

Introduction of Deep Learning TheoryOptimization Theory for Deep Learning

Learning Theory for Deep LearningConclusions

Why neural nets not overfitting?Bandlimited functionsPerfect learningAsymptotic regularization

Learning Problem Formulation (review)

Inputs x ∈ RK follow p.d.f. p(x): x ∼ p(x)

A target function y = f̄ (x): from input x ∈ RK to output y ∈ RT training samples: DT = {(x1, y1), (x2, y2), · · · , (xT , yT )}, wherext ∼ p(x) and yt = f̄ (xt) for all 1 ≤ t ≤ T

A model y = f (x|DT ) is learned from DT to minimize loss

Empirical Risk

Remp(f | DT ) =1

T

T∑t=1

l (yt , f (xt |DT ))

Expected Risk

R(f | DT ) = Ep(x)

[l(f̄ (x), f (x|DT )

) ]Hui Jiang Deep Learning Theory 23/40

Page 24: Understanding Deep Learning in Theory

Introduction of Deep Learning TheoryOptimization Theory for Deep Learning

Learning Theory for Deep LearningConclusions

Why neural nets not overfitting?Bandlimited functionsPerfect learningAsymptotic regularization

Bandlimitedness

For real-world applications, target functions must be bandlimited.

Non-bandlimited processes must be driven by unlimited power.

Real-world data are always generated from bandlimited processes.

Definition 7 (strictly bandlimited)

F (ω) =

∫· · ·∫ +∞

−∞f (x) e−ix·ω dx = 0 if ‖ω‖ > B

Definition 8 (approximately bandlimited)

∀ε > 0, ∃Bε > 0, out-of-band residual energy satisfies∫· · ·∫‖ω‖>Bε

‖F (ω)‖2 dω < ε2

Hui Jiang Deep Learning Theory 24/40

Page 25: Understanding Deep Learning in Theory

Introduction of Deep Learning TheoryOptimization Theory for Deep Learning

Learning Theory for Deep LearningConclusions

Why neural nets not overfitting?Bandlimited functionsPerfect learningAsymptotic regularization

Illustration of Bandlimited Fourier Spectrum

Strictly bandlimited F (ω):

Approximately bandlimited F (ω):

Hui Jiang Deep Learning Theory 25/40

Page 26: Understanding Deep Learning in Theory

Introduction of Deep Learning TheoryOptimization Theory for Deep Learning

Learning Theory for Deep LearningConclusions

Why neural nets not overfitting?Bandlimited functionsPerfect learningAsymptotic regularization

Illustration of Bandlimited Functions

20 40 60 80 100 120 140 160 180 200

input: x

-300

-200

-100

0

100

200

300

ou

tpu

t: y

No Band-limiting

non-bandlimited function

20 40 60 80 100 120 140 160 180 200

input: x

-150

-100

-50

0

50

100

outp

ut: y

Weakly Band-limiting

strictly bandlimited function by a large B

20 40 60 80 100 120 140 160 180 200

-100

-80

-60

-40

-20

0

20

40

60

Strongly Band-limiting

strictly bandlimited function by a small B

20 40 60 80 100 120 140 160 180 200

input: x

-60

-40

-20

0

20

40

60

outp

ut: y

Approximately Band-limiting

approximately bandlimited function

Hui Jiang Deep Learning Theory 26/40

Page 27: Understanding Deep Learning in Theory

Introduction of Deep Learning TheoryOptimization Theory for Deep Learning

Learning Theory for Deep LearningConclusions

Why neural nets not overfitting?Bandlimited functionsPerfect learningAsymptotic regularization

Perfect Learning

Definition 9 (perfect learning)

Learn a model from a finite set of training samples to achieve notonly zero empirical risk but also zero expected risk

Theorem 10 (existence of perfect learning)

If target function f̄ (x) is strictly or approximately bandlimited,there exists a method to learn a model f (x|DT ) from DT , not onlyleading to zero empirical risk Remp(f |DT ) = 0 but also yielding

zero expected risk in probability R(f |DT )P−→ 0 as T →∞.

Proof sketch:Like a stochastic version of multidimensional sampling theorem[PM62; Me01]

Hui Jiang Deep Learning Theory 27/40

Page 28: Understanding Deep Learning in Theory

Introduction of Deep Learning TheoryOptimization Theory for Deep Learning

Learning Theory for Deep LearningConclusions

Why neural nets not overfitting?Bandlimited functionsPerfect learningAsymptotic regularization

Perfect Learning

Corollary 11

If f̄ (x) · p(x) is not strictly nor approximately bandlimited, nomatter how many training samples to use, R(f |DT ) of allrealizable learning algorithms have a nonzero lower-bound:limT→∞ R(f | DT ) ≥ ε > 0.

Therefore, we conclude:

Perfect Learning

target function f̄ (x) is bandlimited ⇐⇒ perfect learning is feasible

Hui Jiang Deep Learning Theory 28/40

Page 29: Understanding Deep Learning in Theory

Introduction of Deep Learning TheoryOptimization Theory for Deep Learning

Learning Theory for Deep LearningConclusions

Why neural nets not overfitting?Bandlimited functionsPerfect learningAsymptotic regularization

Non-asymptotic Analysis of Perfect Learning

When T is finite, performance is measured by mean expected risk:

RT = EDT

[Ep(x)

[‖f (x|DT )− f̄ (x)‖2

]]Theorem 12 (error bound of perfect learning)

If x ∼ p(x) within a hypercube [−U,U]K ⊂ RK , target functionf̄ (x) is bandlimited by B, perfect learner is upper-bounded as:

R∗T <

[(KBU)n+1 · H

(n + 1)!

]2

where n ' O(T 1/K ) and H = supx |f̄ (x)|.

This error bound is independent of model complexity M

When T is small, difficulty of learning is quantified by KBU

Hui Jiang Deep Learning Theory 29/40

Page 30: Understanding Deep Learning in Theory

Introduction of Deep Learning TheoryOptimization Theory for Deep Learning

Learning Theory for Deep LearningConclusions

Why neural nets not overfitting?Bandlimited functionsPerfect learningAsymptotic regularization

Perfect Learning in Practice

Theorem 13 (perfect learning in practice)

If target function f̄ (x) is strictly or approximately bandlimited,assume a strictly or approximately bandlimited model, f (x), islearned from a sufficiently large training set DT . If this modelyields zero empirical risk on DT :

Remp(f | DT ) = 0,

then it is guaranteed to yield zero expected risk:

R(f | DT ) −→ 0 as T →∞.

Hui Jiang Deep Learning Theory 30/40

Page 31: Understanding Deep Learning in Theory

Introduction of Deep Learning TheoryOptimization Theory for Deep Learning

Learning Theory for Deep LearningConclusions

Why neural nets not overfitting?Bandlimited functionsPerfect learningAsymptotic regularization

Proof sketch of Theorem 13

Any bandlimited function may be represented as a series of Fourier basefunctions with decaying coefficients:

target function:

f̄ (x) = · · · · · ·+ ηk−1e2πik−1·x + ηk0

e2πik0·x + ηk1e2πik1·x + · · · · · ·

model to be learned:

f (x) = · · · · · ·+ θk−1e2πik−1·x + θk0

e2πik0·x + θk1e2πik1·x + · · · · · ·

training samples: DT = {(xt , yt)|1 ≤ t ≤ T}

yt = f̄ (xt) t = 1, · · · ,T

yt = f (xt) t = 1, · · · ,T

Hui Jiang Deep Learning Theory 31/40

Page 32: Understanding Deep Learning in Theory

Introduction of Deep Learning TheoryOptimization Theory for Deep Learning

Learning Theory for Deep LearningConclusions

Why neural nets not overfitting?Bandlimited functionsPerfect learningAsymptotic regularization

Proof sketch of Theorem 13 (cont’d)

T linear equations:

f̄ (xt)− f (xt) = 0 t = 1, · · · ,T

target function and the model:

f̄ (x) = · · · · · ·+ ηk−1e2πik−1·x + ηk0

e2πik0·x + ηk1e2πik1·x + · · ·︸ ︷︷ ︸

T most significant terms

· · ·

f (x) = · · · · · ·+ θk−1e2πik−1·x + θk0

e2πik0·x + θk1e2πik1·x + · · ·︸ ︷︷ ︸

T most significant terms

· · ·

determine T coefficients up to good precision: θk → ηk

as T →∞, f (x)→ f̄ (x) �

Hui Jiang Deep Learning Theory 32/40

Page 33: Understanding Deep Learning in Theory

Introduction of Deep Learning TheoryOptimization Theory for Deep Learning

Learning Theory for Deep LearningConclusions

Why neural nets not overfitting?Bandlimited functionsPerfect learningAsymptotic regularization

Machine Learning Models vs. Bandlimitedness

All machine learning models are approximately bandlimited undersome minor conditions:

1 input x is bounded

2 model size is finite

3 all model parameters are bounded

4 model is piecewise continuous

PAC-learnable models are bandlimited

All PAC-learnable models are approximately bandlimited,including linear models, logistic regression, statistical models,neural networks ...

Hui Jiang Deep Learning Theory 33/40

Page 34: Understanding Deep Learning in Theory

Introduction of Deep Learning TheoryOptimization Theory for Deep Learning

Learning Theory for Deep LearningConclusions

Why neural nets not overfitting?Bandlimited functionsPerfect learningAsymptotic regularization

Why Neural Nets Generalize: Asymptotic Regularization

Corollary 14

Assume neural network, fw(x), is learned from a sufficiently largetraining set DT , generated by a bandlimited process f̄ (x). If fw(x)yields zero empirical risk on DT :

Remp(fw | DT ) = 0,

then it surely yields zero expected risk as T →∞:

limT→∞

R(fw | DT ) −→ 0.

Definition 15 (asymptotic regularization)

Due to the bandlimitedness property, neural networkasymptotically regularizes itself as T →∞.

Hui Jiang Deep Learning Theory 34/40

Page 35: Understanding Deep Learning in Theory

Introduction of Deep Learning TheoryOptimization Theory for Deep Learning

Learning Theory for Deep LearningConclusions

1 Introduction of Deep Learning TheoryDeep learning overviewDeep learning theory reviewLearning problem formulation

2 Optimization Theory for Deep LearningLearning neural nets in literal spaceLearning neural nets in canonical spaceFrom canonical space back to literal spaceWhy large neural nets learn like convex optimization?

3 Learning Theory for Deep LearningWhy neural nets not overfitting?Bandlimited functionsPerfect learningAsymptotic regularization

4 Conclusions

Hui Jiang Deep Learning Theory 35/40

Page 36: Understanding Deep Learning in Theory

Introduction of Deep Learning TheoryOptimization Theory for Deep Learning

Learning Theory for Deep LearningConclusions

Conclusions

Deep Learning Theory: Neural Nets

1 expressiveness: neural nets ⇔ universal approximator in L1(UK )

2 optimization: learning large neural nets is unexpectedly “simple”

behaves like convex optimization, conditional on H(w)over-parameterized neural nets are complete in L1(UK )

3 generalization: asymptotic regularization

real-world data are generated by bandlimited processesneural nets are approximately bandlimitedneural nets self-regularize on sufficiently large training sets

Hui Jiang Deep Learning Theory 36/40

Page 37: Understanding Deep Learning in Theory

Introduction of Deep Learning TheoryOptimization Theory for Deep Learning

Learning Theory for Deep LearningConclusions

Conclusions

Under big data + big model, neural nets solve supervisedlearning problems in statistical sense:

limT→∞

R(f̂w|DT ) = 0.

1 collect sufficient training samples (determined by KBU)2 fit over-parameterized models (neural nets) onto them

However, adversarial attack is new and different ...

all neural nets are approximately bandlimited

Hui Jiang Deep Learning Theory 37/40

Page 38: Understanding Deep Learning in Theory

References

More details and all proofs are found in:

Hui Jiang, “A New Perspective on Machine Learning: How to doPerfect Supervised Learning”, preprint arXiv:1901.02046, 2019.

Hui Jiang, “Why Learning of Large-Scale Neural Networks BehavesLike Convex Optimization”, preprint arXiv:1903.02140, 2019.

Thank You!

(Q&A)

Hui Jiang Deep Learning Theory 38/40

Page 39: Understanding Deep Learning in Theory

References

References:

P. Auer, M. Herbster, and M.K. Warmuth.“Exponentially many local minima for single neurons”.In: Proc. of Advances in Neural Information ProcessingSystems 8 (NIPS). 1995.

A. R. Barron. “Approximation and Estimation Boundsfor Artificial Neural Networks”. In: Machine Learning14 (1994), pp. 115–131.

A. L. Blum and R. L. Rivest. “Training a 3-node neuralnetwork is NP-complete”. In: Neural Networks 5.1(1992), pp. 117–127.

G. Cybenko. “Approximation by superpositions of asigmoidal function”. In: Mathematics of Control,Signals, and Systems 2 (1989), pp. 303–314.

Hui Jiang Deep Learning Theory 39/40

Page 40: Understanding Deep Learning in Theory

References

F. A. Marvasti and et.al. Nonuniform Sampling :Theory and Practice. Kluwer Academic / PlenumPublishers, 2001.

D. P. Petersen and D. Middleton. “Sampling andReconstruction of Wave-Number-Limited Functions inN-Dimensional Euclidean Spaces”. In: Information andControl 5 (1962), pp. 279–323.

V. N. Vapnik. The nature of statistical learning theory.New York : Springer, 2000.

Hui Jiang Deep Learning Theory 40/40