generalization in deep learning - an overview of generalization in … · imagelab seminars, spring...

53
Generalization in Deep Learning An overview of Generalization in Deep Learning Enes Dilber Middle East Technical University ImageLab Seminars, Spring 17 Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 1 / 53

Upload: others

Post on 29-Sep-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Generalization in Deep Learning - An overview of Generalization in … · ImageLab Seminars, Spring 17 Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring

Generalization in Deep LearningAn overview of Generalization in Deep Learning

Enes Dilber

Middle East Technical University

ImageLab Seminars, Spring 17

Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 1 / 53

Page 2: Generalization in Deep Learning - An overview of Generalization in … · ImageLab Seminars, Spring 17 Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring

Outline

1 Introduction

2 Sharp Minima Can Generalize For Deep Nets - Dinh et al. (2017)

3 Generalization in Deep Learning - Kawaguchi et al. (2017)

4 A Bayesian Perspective on Generalization and SGD - Smith and Le (2018)

Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 2 / 53

Page 3: Generalization in Deep Learning - An overview of Generalization in … · ImageLab Seminars, Spring 17 Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring

Difference between Memorization and Generalization

Memorizing, given facts, is an obvious task in learning. This can bedone by storing the input samples explicitly.

The ability to identify the rules, to generalize, allows the system tomake predictions on unknown data.

Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 3 / 53

Page 4: Generalization in Deep Learning - An overview of Generalization in … · ImageLab Seminars, Spring 17 Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring

How well can neural networks generalize?

Cybenko (1989) proved that a feed-forward network with a multilayerperceptron can approximate continuous functions on compact subsetsof IR with using sigmoid activation function.

Many strategies used in machine learning are explicitly designed toreduce the test error, possibly at the expense of increased trainingerror. These strategies are known collectively as regularization.

Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 4 / 53

Page 5: Generalization in Deep Learning - An overview of Generalization in … · ImageLab Seminars, Spring 17 Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring

Neural Networks Optimization Scheme

Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 5 / 53

Page 6: Generalization in Deep Learning - An overview of Generalization in … · ImageLab Seminars, Spring 17 Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring

Understanding deep learning requires rethinkinggeneralization - Zhang et al. (2016)

State-of-the-art convolutional networks for image classification trainedwith stochastic gradient methods easily fit a random labeling of thetraining data.

This phenomenon is qualitatively unaffected by explicit regularization,and occurs even if they replace the true images by completelyunstructured random noise.

Consequently, these models are in principle rich enough to memorizethe training data. This situation poses a conceptual challenge tostatistical learning theory as traditional measures of model complexitystruggle to explain the generalization ability of large artificial neuralnetworks.

Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 6 / 53

Page 7: Generalization in Deep Learning - An overview of Generalization in … · ImageLab Seminars, Spring 17 Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring

On Large-Batch Training For Deep Learning:Generalization Gap And Sharp Minima - Keskar et al.(2016)

Large-batch minimizers are characterized by a significant number oflarge positive eigenvalues in ∇2f (x). Large Hessian leads poorgeneralization.

They claim that small batches in SGD finds wide minima and thatgeneralizes well.

To use large batches they suggest warm-starting weights.

Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 7 / 53

Page 8: Generalization in Deep Learning - An overview of Generalization in … · ImageLab Seminars, Spring 17 Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring

On Large-Batch Training For Deep Learning:Generalization Gap And Sharp Minima - Keskar et al.(2016)

Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 8 / 53

Page 9: Generalization in Deep Learning - An overview of Generalization in … · ImageLab Seminars, Spring 17 Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring

Outline

1 Introduction

2 Sharp Minima Can Generalize For Deep Nets - Dinh et al. (2017)Definition of flatness/sharpnessDeep Rectified NetworksReparametrizations

3 Generalization in Deep Learning - Kawaguchi et al. (2017)

4 A Bayesian Perspective on Generalization and SGD - Smith and Le (2018)

Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 9 / 53

Page 10: Generalization in Deep Learning - An overview of Generalization in … · ImageLab Seminars, Spring 17 Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring

Definition of flatness/sharpness

C (L, θ, ε) as the largest connected set containing θ such that∀θ′ ∈ C (L, θ, ε), L(θ′) < L(θ) + ε . The ε - flatness will be defined as thevolume of C (L, θ, ε) . It is the purple line.

Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 10 / 53

Page 11: Generalization in Deep Learning - An overview of Generalization in … · ImageLab Seminars, Spring 17 Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring

Outline

1 Introduction

2 Sharp Minima Can Generalize For Deep Nets - Dinh et al. (2017)Definition of flatness/sharpnessDeep Rectified NetworksReparametrizations

3 Generalization in Deep Learning - Kawaguchi et al. (2017)

4 A Bayesian Perspective on Generalization and SGD - Smith and Le (2018)

Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 11 / 53

Page 12: Generalization in Deep Learning - An overview of Generalization in … · ImageLab Seminars, Spring 17 Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring

Deep Rectified Networks

Most of the results will be on the deep rectified feedforward networks with alinear output layer. Though they can easily be extended to otherarchitectures (e.g. convolutional, etc.).

y = Φrect(Φrect . . . (Φrect(xθ1) . . . )θK−1)θK

Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 12 / 53

Page 13: Generalization in Deep Learning - An overview of Generalization in … · ImageLab Seminars, Spring 17 Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring

Properties of Deep Rectified Networks

A unit change in the behavior of the model is not constant over parameterspace.

Specifically, each line of a given color corresponds to the parameterassignments (θ1, θ2) that result observationally in the same predictionfunction fθ.

Without changing the function value, flatness can be changed by movingalong the curve.

Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 13 / 53

Page 14: Generalization in Deep Learning - An overview of Generalization in … · ImageLab Seminars, Spring 17 Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring

Non-negative Homogeneity

Non-negative Homogeneity:Under some conditions, a global (nonconvex) optimization problem withquadratic data is equivalent to a convex minimization problem. (Lasserre etal. 2002)

DefinitionA given function φ is non-negative homogeneous if

∀(z , a) ∈ IR× IR+, φ(az) = aφ(z)

Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 14 / 53

Page 15: Generalization in Deep Learning - An overview of Generalization in … · ImageLab Seminars, Spring 17 Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring

Non-negative Homogeneity

For a rectified unit it means:

Φrect(x(aθ1))θ2 = Φrect(xθ1)aθ2

For this one (hidden) layer neural network, the parameters (aθ1, θ2) isobservationally equivalent to (θ1, aθ2). This observational equivalencesimilarly holds for convolutional layers.

Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 15 / 53

Page 16: Generalization in Deep Learning - An overview of Generalization in … · ImageLab Seminars, Spring 17 Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring

Some Common Non-negative Homogeneous Layers

Fully Connected + ReLU

Convolution + ReLU

Max Pooling

Linear Layers

Mean Pooling

Max Out

Not Sigmoid

Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 16 / 53

Page 17: Generalization in Deep Learning - An overview of Generalization in … · ImageLab Seminars, Spring 17 Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring

Outline

1 Introduction

2 Sharp Minima Can Generalize For Deep Nets - Dinh et al. (2017)Definition of flatness/sharpnessDeep Rectified NetworksReparametrizations

3 Generalization in Deep Learning - Kawaguchi et al. (2017)

4 A Bayesian Perspective on Generalization and SGD - Smith and Le (2018)

Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 17 / 53

Page 18: Generalization in Deep Learning - An overview of Generalization in … · ImageLab Seminars, Spring 17 Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring

Model Reparametrization

DefinitionIf we are allowed to change the parametrization of some function f, we can obtainarbitrarily different geometries without affecting how the function evaluates onunseen data. The same holds for reparametrization of the input space.

Flatness of minima to their probable generalization is that the choice ofparametrization and its associated geometry are arbitrary.

Since we are interested in finding a prediction function in a given family offunctions, no reparametrization of this family should influence generalizationof any of these functions. Given a bijection g onto θ, we can define newtransformed parameter κ = g−1(θ). Since θ and κ represent in differentspace the same prediction function, they should generalize as well.

Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 18 / 53

Page 19: Generalization in Deep Learning - An overview of Generalization in … · ImageLab Seminars, Spring 17 Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring

Exploiting Loss Function

A one-dimensional exampleon how much the geometryof the loss functiondepends on the parameterspace chosen.

They carefully used someparameterizationtechniques such as weightnormalization and batchnormalization.

Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 19 / 53

Page 20: Generalization in Deep Learning - An overview of Generalization in … · ImageLab Seminars, Spring 17 Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring

Discussion

In earlier researches, it has been observed empirically that minima found bystandard deep learning algorithms that generalize well tend to be flatter thanfound minima that did not generalize.

The whole geometry of the error surface with respect to the parameters canbe changed arbitrarily under different parametrizations.

Minima concept can not be divorced from the particular parametrization ofthe model or input space.

Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 20 / 53

Page 21: Generalization in Deep Learning - An overview of Generalization in … · ImageLab Seminars, Spring 17 Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring

Outline

1 Introduction

2 Sharp Minima Can Generalize For Deep Nets - Dinh et al. (2017)

3 Generalization in Deep Learning - Kawaguchi et al. (2017)Generalization TheoryUnderstanding Generalization in Practical ParadigmTheoretical Insight for Practical Usage

4 A Bayesian Perspective on Generalization and SGD - Smith and Le (2018)

Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 21 / 53

Page 22: Generalization in Deep Learning - An overview of Generalization in … · ImageLab Seminars, Spring 17 Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring

Expected Risk

Definition

R[f ] : Expected RiskRm[f ] : Empirical Risk

R[f ] = Ex,y p(x,y)[loss(f (x), y)]

Rm[f ] =1

m

m∑i=1

loss(f (xi ), yi )

The aim in machine learning is minimization of expected risk by minimizingthe computable empirical risk. The goal of generalization theory isquestioning how this approach is a sensible one.

Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 22 / 53

Page 23: Generalization in Deep Learning - An overview of Generalization in … · ImageLab Seminars, Spring 17 Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring

Generalization Gap

Definition

Generalization Gap , R[fA(Sm)]− R[fA(Sm)] where;fA(Sm) : X → Y be a model learned by algorithm A with a training dataset Sm

Generalization Gap is not computable. This is the difference between erroron the training set and error on the underlying joint probability distribution.

The aim of many problems in statistical learning theory is to bound orcharacterize the generalization error in probability.

Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 23 / 53

Page 24: Generalization in Deep Learning - An overview of Generalization in … · ImageLab Seminars, Spring 17 Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring

Apparent Paradox

Zhang et al. (2017a) empirically showed that several deep model classes canmemorize random labels, while having the ability to produce zero trainingerror and small test errors for particular natural datasets. This phenomena iseven true for all ML algorithms.

For any model class F whose model complexity is large enough to memorizeany dataset and which includes fε possibly at an arbitrarily sharp minimum,there exist learning algorithms A such that the generalization gap of fA(Sm) isat most ε.

There exist arbitrarily unstable and arbitrarily non-robust algorithms A suchthat the generalization gap of fA(Sm) is at most ε.

The current generalization theory states that how we get f and the set fromwhich we choose f are all that matters.

Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 24 / 53

Page 25: Generalization in Deep Learning - An overview of Generalization in … · ImageLab Seminars, Spring 17 Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring

Generalization Puzzle

He suggest that the methods that we use for generalization such as flatminima, small complexity and robustness imply small generalization gap,however small generalization gap does not imply flat minima, smallcomplexity and robustness.

p: model is stable, robust and has flat minima.q: model generalize wellp implies q does not imply q implies p.

In space of f ∈ F there are such learning mechanisms that are unstable andnon-robust but generalize well.

Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 25 / 53

Page 26: Generalization in Deep Learning - An overview of Generalization in … · ImageLab Seminars, Spring 17 Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring

Practical Role of Generalization Theory

1 Provide guarantees on expected risk.

2 Guarantee generalization gap

1 To be small for a given fixed Sm, and/or2 to approach zero with a fixed model class as m increases.

3 Provide theoretical insights to guide the search over model classes.

Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 26 / 53

Page 27: Generalization in Deep Learning - An overview of Generalization in … · ImageLab Seminars, Spring 17 Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring

Outline

1 Introduction

2 Sharp Minima Can Generalize For Deep Nets - Dinh et al. (2017)

3 Generalization in Deep Learning - Kawaguchi et al. (2017)Generalization TheoryUnderstanding Generalization in Practical ParadigmTheoretical Insight for Practical Usage

4 A Bayesian Perspective on Generalization and SGD - Smith and Le (2018)

Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 27 / 53

Page 28: Generalization in Deep Learning - An overview of Generalization in … · ImageLab Seminars, Spring 17 Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring

Training-Validation Paradigm

Search over model classes by changing architecture to obtain low validationloss.

We can generalize well because we can obtain a good model via modelsearch with validation errors. In deep learning community there is acumulative traditional knowledge for generalization.

Provide theoretical insights to guide the search over model classes.

Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 28 / 53

Page 29: Generalization in Deep Learning - An overview of Generalization in … · ImageLab Seminars, Spring 17 Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring

Upper bound for Expected Risk

Proposition

For any δ > 0. This bound holds with probability 1− δ

R[f ] ≤ Rval [f ] +2C ln( |Fval |

δ )

3mval+

√2γ2 ln( |Fval |

δ )

mval

|Fval | is the number of times we use the validation dataset in our decisionmaking to choose a final model.

Consider MNIST example with mval = 10000 and δ = 0.1. In a worst-casescenerio(C = 1, γ2 = 1, |Fval = 1, 000, 000, 000|): R[f ] ≤ Rval [f ] + 6.94%

In a realistic scenario it became R[f ] ≤ Rval [f ] + 0.49%

Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 29 / 53

Page 30: Generalization in Deep Learning - An overview of Generalization in … · ImageLab Seminars, Spring 17 Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring

Outline

1 Introduction

2 Sharp Minima Can Generalize For Deep Nets - Dinh et al. (2017)

3 Generalization in Deep Learning - Kawaguchi et al. (2017)Generalization TheoryUnderstanding Generalization in Practical ParadigmTheoretical Insight for Practical Usage

4 A Bayesian Perspective on Generalization and SGD - Smith and Le (2018)

Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 30 / 53

Page 31: Generalization in Deep Learning - An overview of Generalization in … · ImageLab Seminars, Spring 17 Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring

Model Description

DefinitionHere is form of k-th output unit. It is structure of a Directed Acyclic Graph withan activation. H is the number of hidden layers.

h(H+1)k =

∑path

xpathσpath(x ,wσ)wpath,k

= [x ◦ σ(x .wσ)]T wk

Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 31 / 53

Page 32: Generalization in Deep Learning - An overview of Generalization in … · ImageLab Seminars, Spring 17 Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring

Method

Proposition

loss = original loss +λ

mESm,ξ

[maxk

m∑i=1

ξh(H+1)k (xi )

]

ξ: Rademacher variable (∼ Uniform(−1, 1))

m: Batch Size of SGD∑mi=1 ξh

(H+1)k (xi ): a family of regulizers

Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 32 / 53

Page 33: Generalization in Deep Learning - An overview of Generalization in … · ImageLab Seminars, Spring 17 Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring

Directly Approximately Regularizing Complexity (DARC1)

Definition

loss = original loss +λ

m

(maxk

m∑i=1

|h(H+1)k (xi )|

)

h(H+1)k (xi ) is already computed in the original loss.

With this family of regularizers they claim that they could outperformed thestate-of-the-art models on CIFAR-10 and MNIST.

Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 33 / 53

Page 34: Generalization in Deep Learning - An overview of Generalization in … · ImageLab Seminars, Spring 17 Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring

Outline

1 Introduction

2 Sharp Minima Can Generalize For Deep Nets - Dinh et al. (2017)

3 Generalization in Deep Learning - Kawaguchi et al. (2017)

4 A Bayesian Perspective on Generalization and SGD - Smith and Le (2018)Effect of Bayesian Evidence on GeneralizationBayes Theorem and Stochastic Gradient Descent

Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 34 / 53

Page 35: Generalization in Deep Learning - An overview of Generalization in … · ImageLab Seminars, Spring 17 Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring

A simple Bayesian Model

Let X be training inputs, Y training labels and M be classification modelwith a single parameter ω.

Definition

P(ω|Y ,X ;M) =P(Y |ω,X ;M)P(ω;M)

P(Y |X ;M)

∝ P(Y |ω,X ;M)︸ ︷︷ ︸Likelihood

P(ω;M)︸ ︷︷ ︸Prior

Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 35 / 53

Page 36: Generalization in Deep Learning - An overview of Generalization in … · ImageLab Seminars, Spring 17 Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring

Posterior

The Likelihood, P(Y |ω,X ;M) =∏

i P(yi |ω, xi ;M) = e−H(ω;M)

Cross Entropy for one hot labels = H(ω;M) = −∑

i ln(P(yi |ω, xi , ;M))

Prior is a Gaussian, P(ω;M) =√λ/2πe−λω

2/2

Where, λ is regularization coefficient.

Posterior will be, P(ω|Y ,X ;M) ∝√λ/2πe−C(ω;M)

Where, C (ω;M) = H(ω;M) + λω2/2C (ω;M) denotes L2 regularized cross-entropy.

ω0 is the value that maximizes the density.

Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 36 / 53

Page 37: Generalization in Deep Learning - An overview of Generalization in … · ImageLab Seminars, Spring 17 Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring

Prediction of New Input

P(yt |xt ,X ,Y ;M) =

∫P(yt |ω, xt ;M)P(ω|Y ,X ;M)dω

These integrals are dominated by the region near ω0, and sinceP(yt |ω, xt ;M) is smooth, it could be approximated byP(yt |xt ;M) ≈ P(yt |ω0, xt ;M).

Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 37 / 53

Page 38: Generalization in Deep Learning - An overview of Generalization in … · ImageLab Seminars, Spring 17 Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring

Modal Comparison

P(M1|Y ,X )

P(M2|Y ,X )=

P(Y |X ;M1)P(M1)

P(Y |X ;M2)P(M2)

To eliminate subjectivity, Prior ratio could be considered as 1.

To Calculate likelihood ratio we need to find normalizing constant at theposterior finding for ω.

P(Y |X ;M) =

∫P(Y |ω,X ;M)P(ω;M)dω =

√λ

∫e−C(ω;M)dω

C (ω;M) ≈ C (ω0) + C ′′(ω0)(ω − ω0)2/2

Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 38 / 53

Page 39: Generalization in Deep Learning - An overview of Generalization in … · ImageLab Seminars, Spring 17 Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring

Modal Comparison

P(Y |X ;M)

P(Y |X ;NULL)= e−E(ω0)

NULL model assumes labels are entirely random. No parameters, onlycontrolled by likelihood.

E (ω0) = C (ω0) + (1/2)∑

i ln(λi/λ)− Nln(n).Log evidence ratio in favor of the NULL model

The Bayesian evidence supports the intuition that broad minima generalizebetter than sharp minima, but unlike the curvature it does not depend onthe model parameterization.

Deriving Bayesian Evidence in Deep Neural Networks is nearly impossible.

Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 39 / 53

Page 40: Generalization in Deep Learning - An overview of Generalization in … · ImageLab Seminars, Spring 17 Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring

Experiment Setup

They compare their intuition with observations of Zhang et al. (2016)

For input, they generated 200 with size 200 random samples from Gaussiandistribution each belongs to one of two classes. Test set contains 10000examples. The loss is L2-regularized sigmoid cross-entropy.

They crerated two tasks:

1 The labels of both the training and test sets are random2 The label y = 1 if

∑i xi > 1, y = 0 otherwise

Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 40 / 53

Page 41: Generalization in Deep Learning - An overview of Generalization in … · ImageLab Seminars, Spring 17 Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring

Effect of Regularization

Exactly as observed by Zhang et al., weakly regularized logistic regressiongeneralizes well on informative labels but memorizes random labels.

Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 41 / 53

Page 42: Generalization in Deep Learning - An overview of Generalization in … · ImageLab Seminars, Spring 17 Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring

Bayesian Evidence

Bayesian evidence has successfully explained the generalization of theirlogistic regression.

Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 42 / 53

Page 43: Generalization in Deep Learning - An overview of Generalization in … · ImageLab Seminars, Spring 17 Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring

Outline

1 Introduction

2 Sharp Minima Can Generalize For Deep Nets - Dinh et al. (2017)

3 Generalization in Deep Learning - Kawaguchi et al. (2017)

4 A Bayesian Perspective on Generalization and SGD - Smith and Le (2018)Effect of Bayesian Evidence on GeneralizationBayes Theorem and Stochastic Gradient Descent

Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 43 / 53

Page 44: Generalization in Deep Learning - An overview of Generalization in … · ImageLab Seminars, Spring 17 Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring

Noise in training Deep Learning

Generalization is strongly correlated with the Bayesian evidence, which is aweighted combination of the depth of a minimum (the cost function) and itsbreadth (the Occam factor).

Small batch training introduces noise to the gradients, and this noise drivesthe SGD away from sharp minima.

Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 44 / 53

Page 45: Generalization in Deep Learning - An overview of Generalization in … · ImageLab Seminars, Spring 17 Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring

Experiment Setup

MNIST (randomly selected 1000 images, to compare full-batch vs smallbatch)

SGD with learning rate = 1 and momentum = 0.9 coefficient.

Purpose is to explore the simplest possible model which shows ageneralization gap between small and large batch training.

Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 45 / 53

Page 46: Generalization in Deep Learning - An overview of Generalization in … · ImageLab Seminars, Spring 17 Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring

Generalization Gap between Small Batch vs Full Batch

Small Batch = 30 and Full Batch = 1000

Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 46 / 53

Page 47: Generalization in Deep Learning - An overview of Generalization in … · ImageLab Seminars, Spring 17 Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring

Stochastic Differential Equation of SGD

Random fluctuation g = ε(NB − 1

)≈ εNB

An optimal batch size emerges when the underlying scale of randomfluctuations is also optimal.

Bopt ∝ ε

Bopt ∝ N

Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 47 / 53

Page 48: Generalization in Deep Learning - An overview of Generalization in … · ImageLab Seminars, Spring 17 Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring

Batch Size and Learning Rate

Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 48 / 53

Page 49: Generalization in Deep Learning - An overview of Generalization in … · ImageLab Seminars, Spring 17 Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring

Batch Size and Training Set Size

Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 49 / 53

Page 50: Generalization in Deep Learning - An overview of Generalization in … · ImageLab Seminars, Spring 17 Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring

Batch Size and Momentum Coefficient

g ≈ εNB(1−m)

Bopt ∝ 1(1−m)

Constant ε = 1 learning rate, until 10000 gradient updates

Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 50 / 53

Page 51: Generalization in Deep Learning - An overview of Generalization in … · ImageLab Seminars, Spring 17 Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring

How to Tune Batch Size

1 Set the learning rate to 0.1 and the momentum coefficient to 0.9. Runexperiments at a range of batch sizes on a logarithmic scale, and identify theoptimal batch size which maximizes the validation set accuracy.

2 Repeatedly increase the batch size by a factor of 3, while scaling the learningrate ε ∝ B, until the validation set accuracy starts to fall. Then repeatedlyincrease the batch size by a factor of 3, while scaling the momentumcoefficient (1−m) ∝ 1/B, until either the validation set accuracy falls orthe batch size reaches the limits of your hardware.

3 Having identified the final learning rate and momentum parameter, retunethe batch size on a linear scale in the local neighborhood of the currentbatch size.

Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 51 / 53

Page 52: Generalization in Deep Learning - An overview of Generalization in … · ImageLab Seminars, Spring 17 Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring

Discussion

Like deep neural networks, linear models which generalize well oninformative labels can memorize random labels of the same inputs. They areexplained by evaluating the Bayesian evidence.

They show that there is an optimum batch size which maximizes the test setaccuracy.

Bopt ∝ εNB(1−m)

Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 52 / 53

Page 53: Generalization in Deep Learning - An overview of Generalization in … · ImageLab Seminars, Spring 17 Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring

References

Dinh, L., Pascanu, R., Bengio, S., and Bengio, Y. (2017). Sharp minimacan generalize for deep nets. arXiv preprint arXiv:1703.04933.

Kawaguchi, K., Kaelbling, L. P., and Bengio, Y. (2017). Generalization indeep learning. arXiv preprint arXiv:1710.05468.

Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., and Tang, P.T. P. (2016). On large-batch training for deep learning: Generalizationgap and sharp minima. arXiv preprint arXiv:1609.04836.

Smith, S. L. and Le, Q. V. (2018). A bayesian perspective ongeneralization and stochastic gradient descent. In InternationalConference on Learning Representations.

Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. (2016).Understanding deep learning requires rethinking generalization. arXivpreprint arXiv:1611.03530.

Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 53 / 53