generalization in deep learning - an overview of generalization in … · imagelab seminars, spring...
TRANSCRIPT
Generalization in Deep LearningAn overview of Generalization in Deep Learning
Enes Dilber
Middle East Technical University
ImageLab Seminars, Spring 17
Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 1 / 53
Outline
1 Introduction
2 Sharp Minima Can Generalize For Deep Nets - Dinh et al. (2017)
3 Generalization in Deep Learning - Kawaguchi et al. (2017)
4 A Bayesian Perspective on Generalization and SGD - Smith and Le (2018)
Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 2 / 53
Difference between Memorization and Generalization
Memorizing, given facts, is an obvious task in learning. This can bedone by storing the input samples explicitly.
The ability to identify the rules, to generalize, allows the system tomake predictions on unknown data.
Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 3 / 53
How well can neural networks generalize?
Cybenko (1989) proved that a feed-forward network with a multilayerperceptron can approximate continuous functions on compact subsetsof IR with using sigmoid activation function.
Many strategies used in machine learning are explicitly designed toreduce the test error, possibly at the expense of increased trainingerror. These strategies are known collectively as regularization.
Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 4 / 53
Neural Networks Optimization Scheme
Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 5 / 53
Understanding deep learning requires rethinkinggeneralization - Zhang et al. (2016)
State-of-the-art convolutional networks for image classification trainedwith stochastic gradient methods easily fit a random labeling of thetraining data.
This phenomenon is qualitatively unaffected by explicit regularization,and occurs even if they replace the true images by completelyunstructured random noise.
Consequently, these models are in principle rich enough to memorizethe training data. This situation poses a conceptual challenge tostatistical learning theory as traditional measures of model complexitystruggle to explain the generalization ability of large artificial neuralnetworks.
Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 6 / 53
On Large-Batch Training For Deep Learning:Generalization Gap And Sharp Minima - Keskar et al.(2016)
Large-batch minimizers are characterized by a significant number oflarge positive eigenvalues in ∇2f (x). Large Hessian leads poorgeneralization.
They claim that small batches in SGD finds wide minima and thatgeneralizes well.
To use large batches they suggest warm-starting weights.
Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 7 / 53
On Large-Batch Training For Deep Learning:Generalization Gap And Sharp Minima - Keskar et al.(2016)
Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 8 / 53
Outline
1 Introduction
2 Sharp Minima Can Generalize For Deep Nets - Dinh et al. (2017)Definition of flatness/sharpnessDeep Rectified NetworksReparametrizations
3 Generalization in Deep Learning - Kawaguchi et al. (2017)
4 A Bayesian Perspective on Generalization and SGD - Smith and Le (2018)
Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 9 / 53
Definition of flatness/sharpness
C (L, θ, ε) as the largest connected set containing θ such that∀θ′ ∈ C (L, θ, ε), L(θ′) < L(θ) + ε . The ε - flatness will be defined as thevolume of C (L, θ, ε) . It is the purple line.
Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 10 / 53
Outline
1 Introduction
2 Sharp Minima Can Generalize For Deep Nets - Dinh et al. (2017)Definition of flatness/sharpnessDeep Rectified NetworksReparametrizations
3 Generalization in Deep Learning - Kawaguchi et al. (2017)
4 A Bayesian Perspective on Generalization and SGD - Smith and Le (2018)
Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 11 / 53
Deep Rectified Networks
Most of the results will be on the deep rectified feedforward networks with alinear output layer. Though they can easily be extended to otherarchitectures (e.g. convolutional, etc.).
y = Φrect(Φrect . . . (Φrect(xθ1) . . . )θK−1)θK
Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 12 / 53
Properties of Deep Rectified Networks
A unit change in the behavior of the model is not constant over parameterspace.
Specifically, each line of a given color corresponds to the parameterassignments (θ1, θ2) that result observationally in the same predictionfunction fθ.
Without changing the function value, flatness can be changed by movingalong the curve.
Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 13 / 53
Non-negative Homogeneity
Non-negative Homogeneity:Under some conditions, a global (nonconvex) optimization problem withquadratic data is equivalent to a convex minimization problem. (Lasserre etal. 2002)
DefinitionA given function φ is non-negative homogeneous if
∀(z , a) ∈ IR× IR+, φ(az) = aφ(z)
Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 14 / 53
Non-negative Homogeneity
For a rectified unit it means:
Φrect(x(aθ1))θ2 = Φrect(xθ1)aθ2
For this one (hidden) layer neural network, the parameters (aθ1, θ2) isobservationally equivalent to (θ1, aθ2). This observational equivalencesimilarly holds for convolutional layers.
Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 15 / 53
Some Common Non-negative Homogeneous Layers
Fully Connected + ReLU
Convolution + ReLU
Max Pooling
Linear Layers
Mean Pooling
Max Out
Not Sigmoid
Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 16 / 53
Outline
1 Introduction
2 Sharp Minima Can Generalize For Deep Nets - Dinh et al. (2017)Definition of flatness/sharpnessDeep Rectified NetworksReparametrizations
3 Generalization in Deep Learning - Kawaguchi et al. (2017)
4 A Bayesian Perspective on Generalization and SGD - Smith and Le (2018)
Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 17 / 53
Model Reparametrization
DefinitionIf we are allowed to change the parametrization of some function f, we can obtainarbitrarily different geometries without affecting how the function evaluates onunseen data. The same holds for reparametrization of the input space.
Flatness of minima to their probable generalization is that the choice ofparametrization and its associated geometry are arbitrary.
Since we are interested in finding a prediction function in a given family offunctions, no reparametrization of this family should influence generalizationof any of these functions. Given a bijection g onto θ, we can define newtransformed parameter κ = g−1(θ). Since θ and κ represent in differentspace the same prediction function, they should generalize as well.
Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 18 / 53
Exploiting Loss Function
A one-dimensional exampleon how much the geometryof the loss functiondepends on the parameterspace chosen.
They carefully used someparameterizationtechniques such as weightnormalization and batchnormalization.
Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 19 / 53
Discussion
In earlier researches, it has been observed empirically that minima found bystandard deep learning algorithms that generalize well tend to be flatter thanfound minima that did not generalize.
The whole geometry of the error surface with respect to the parameters canbe changed arbitrarily under different parametrizations.
Minima concept can not be divorced from the particular parametrization ofthe model or input space.
Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 20 / 53
Outline
1 Introduction
2 Sharp Minima Can Generalize For Deep Nets - Dinh et al. (2017)
3 Generalization in Deep Learning - Kawaguchi et al. (2017)Generalization TheoryUnderstanding Generalization in Practical ParadigmTheoretical Insight for Practical Usage
4 A Bayesian Perspective on Generalization and SGD - Smith and Le (2018)
Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 21 / 53
Expected Risk
Definition
R[f ] : Expected RiskRm[f ] : Empirical Risk
R[f ] = Ex,y p(x,y)[loss(f (x), y)]
Rm[f ] =1
m
m∑i=1
loss(f (xi ), yi )
The aim in machine learning is minimization of expected risk by minimizingthe computable empirical risk. The goal of generalization theory isquestioning how this approach is a sensible one.
Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 22 / 53
Generalization Gap
Definition
Generalization Gap , R[fA(Sm)]− R[fA(Sm)] where;fA(Sm) : X → Y be a model learned by algorithm A with a training dataset Sm
Generalization Gap is not computable. This is the difference between erroron the training set and error on the underlying joint probability distribution.
The aim of many problems in statistical learning theory is to bound orcharacterize the generalization error in probability.
Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 23 / 53
Apparent Paradox
Zhang et al. (2017a) empirically showed that several deep model classes canmemorize random labels, while having the ability to produce zero trainingerror and small test errors for particular natural datasets. This phenomena iseven true for all ML algorithms.
For any model class F whose model complexity is large enough to memorizeany dataset and which includes fε possibly at an arbitrarily sharp minimum,there exist learning algorithms A such that the generalization gap of fA(Sm) isat most ε.
There exist arbitrarily unstable and arbitrarily non-robust algorithms A suchthat the generalization gap of fA(Sm) is at most ε.
The current generalization theory states that how we get f and the set fromwhich we choose f are all that matters.
Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 24 / 53
Generalization Puzzle
He suggest that the methods that we use for generalization such as flatminima, small complexity and robustness imply small generalization gap,however small generalization gap does not imply flat minima, smallcomplexity and robustness.
p: model is stable, robust and has flat minima.q: model generalize wellp implies q does not imply q implies p.
In space of f ∈ F there are such learning mechanisms that are unstable andnon-robust but generalize well.
Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 25 / 53
Practical Role of Generalization Theory
1 Provide guarantees on expected risk.
2 Guarantee generalization gap
1 To be small for a given fixed Sm, and/or2 to approach zero with a fixed model class as m increases.
3 Provide theoretical insights to guide the search over model classes.
Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 26 / 53
Outline
1 Introduction
2 Sharp Minima Can Generalize For Deep Nets - Dinh et al. (2017)
3 Generalization in Deep Learning - Kawaguchi et al. (2017)Generalization TheoryUnderstanding Generalization in Practical ParadigmTheoretical Insight for Practical Usage
4 A Bayesian Perspective on Generalization and SGD - Smith and Le (2018)
Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 27 / 53
Training-Validation Paradigm
Search over model classes by changing architecture to obtain low validationloss.
We can generalize well because we can obtain a good model via modelsearch with validation errors. In deep learning community there is acumulative traditional knowledge for generalization.
Provide theoretical insights to guide the search over model classes.
Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 28 / 53
Upper bound for Expected Risk
Proposition
For any δ > 0. This bound holds with probability 1− δ
R[f ] ≤ Rval [f ] +2C ln( |Fval |
δ )
3mval+
√2γ2 ln( |Fval |
δ )
mval
|Fval | is the number of times we use the validation dataset in our decisionmaking to choose a final model.
Consider MNIST example with mval = 10000 and δ = 0.1. In a worst-casescenerio(C = 1, γ2 = 1, |Fval = 1, 000, 000, 000|): R[f ] ≤ Rval [f ] + 6.94%
In a realistic scenario it became R[f ] ≤ Rval [f ] + 0.49%
Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 29 / 53
Outline
1 Introduction
2 Sharp Minima Can Generalize For Deep Nets - Dinh et al. (2017)
3 Generalization in Deep Learning - Kawaguchi et al. (2017)Generalization TheoryUnderstanding Generalization in Practical ParadigmTheoretical Insight for Practical Usage
4 A Bayesian Perspective on Generalization and SGD - Smith and Le (2018)
Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 30 / 53
Model Description
DefinitionHere is form of k-th output unit. It is structure of a Directed Acyclic Graph withan activation. H is the number of hidden layers.
h(H+1)k =
∑path
xpathσpath(x ,wσ)wpath,k
= [x ◦ σ(x .wσ)]T wk
Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 31 / 53
Method
Proposition
loss = original loss +λ
mESm,ξ
[maxk
m∑i=1
ξh(H+1)k (xi )
]
ξ: Rademacher variable (∼ Uniform(−1, 1))
m: Batch Size of SGD∑mi=1 ξh
(H+1)k (xi ): a family of regulizers
Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 32 / 53
Directly Approximately Regularizing Complexity (DARC1)
Definition
loss = original loss +λ
m
(maxk
m∑i=1
|h(H+1)k (xi )|
)
h(H+1)k (xi ) is already computed in the original loss.
With this family of regularizers they claim that they could outperformed thestate-of-the-art models on CIFAR-10 and MNIST.
Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 33 / 53
Outline
1 Introduction
2 Sharp Minima Can Generalize For Deep Nets - Dinh et al. (2017)
3 Generalization in Deep Learning - Kawaguchi et al. (2017)
4 A Bayesian Perspective on Generalization and SGD - Smith and Le (2018)Effect of Bayesian Evidence on GeneralizationBayes Theorem and Stochastic Gradient Descent
Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 34 / 53
A simple Bayesian Model
Let X be training inputs, Y training labels and M be classification modelwith a single parameter ω.
Definition
P(ω|Y ,X ;M) =P(Y |ω,X ;M)P(ω;M)
P(Y |X ;M)
∝ P(Y |ω,X ;M)︸ ︷︷ ︸Likelihood
P(ω;M)︸ ︷︷ ︸Prior
Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 35 / 53
Posterior
The Likelihood, P(Y |ω,X ;M) =∏
i P(yi |ω, xi ;M) = e−H(ω;M)
Cross Entropy for one hot labels = H(ω;M) = −∑
i ln(P(yi |ω, xi , ;M))
Prior is a Gaussian, P(ω;M) =√λ/2πe−λω
2/2
Where, λ is regularization coefficient.
Posterior will be, P(ω|Y ,X ;M) ∝√λ/2πe−C(ω;M)
Where, C (ω;M) = H(ω;M) + λω2/2C (ω;M) denotes L2 regularized cross-entropy.
ω0 is the value that maximizes the density.
Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 36 / 53
Prediction of New Input
P(yt |xt ,X ,Y ;M) =
∫P(yt |ω, xt ;M)P(ω|Y ,X ;M)dω
These integrals are dominated by the region near ω0, and sinceP(yt |ω, xt ;M) is smooth, it could be approximated byP(yt |xt ;M) ≈ P(yt |ω0, xt ;M).
Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 37 / 53
Modal Comparison
P(M1|Y ,X )
P(M2|Y ,X )=
P(Y |X ;M1)P(M1)
P(Y |X ;M2)P(M2)
To eliminate subjectivity, Prior ratio could be considered as 1.
To Calculate likelihood ratio we need to find normalizing constant at theposterior finding for ω.
P(Y |X ;M) =
∫P(Y |ω,X ;M)P(ω;M)dω =
√λ
2π
∫e−C(ω;M)dω
C (ω;M) ≈ C (ω0) + C ′′(ω0)(ω − ω0)2/2
Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 38 / 53
Modal Comparison
P(Y |X ;M)
P(Y |X ;NULL)= e−E(ω0)
NULL model assumes labels are entirely random. No parameters, onlycontrolled by likelihood.
E (ω0) = C (ω0) + (1/2)∑
i ln(λi/λ)− Nln(n).Log evidence ratio in favor of the NULL model
The Bayesian evidence supports the intuition that broad minima generalizebetter than sharp minima, but unlike the curvature it does not depend onthe model parameterization.
Deriving Bayesian Evidence in Deep Neural Networks is nearly impossible.
Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 39 / 53
Experiment Setup
They compare their intuition with observations of Zhang et al. (2016)
For input, they generated 200 with size 200 random samples from Gaussiandistribution each belongs to one of two classes. Test set contains 10000examples. The loss is L2-regularized sigmoid cross-entropy.
They crerated two tasks:
1 The labels of both the training and test sets are random2 The label y = 1 if
∑i xi > 1, y = 0 otherwise
Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 40 / 53
Effect of Regularization
Exactly as observed by Zhang et al., weakly regularized logistic regressiongeneralizes well on informative labels but memorizes random labels.
Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 41 / 53
Bayesian Evidence
Bayesian evidence has successfully explained the generalization of theirlogistic regression.
Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 42 / 53
Outline
1 Introduction
2 Sharp Minima Can Generalize For Deep Nets - Dinh et al. (2017)
3 Generalization in Deep Learning - Kawaguchi et al. (2017)
4 A Bayesian Perspective on Generalization and SGD - Smith and Le (2018)Effect of Bayesian Evidence on GeneralizationBayes Theorem and Stochastic Gradient Descent
Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 43 / 53
Noise in training Deep Learning
Generalization is strongly correlated with the Bayesian evidence, which is aweighted combination of the depth of a minimum (the cost function) and itsbreadth (the Occam factor).
Small batch training introduces noise to the gradients, and this noise drivesthe SGD away from sharp minima.
Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 44 / 53
Experiment Setup
MNIST (randomly selected 1000 images, to compare full-batch vs smallbatch)
SGD with learning rate = 1 and momentum = 0.9 coefficient.
Purpose is to explore the simplest possible model which shows ageneralization gap between small and large batch training.
Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 45 / 53
Generalization Gap between Small Batch vs Full Batch
Small Batch = 30 and Full Batch = 1000
Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 46 / 53
Stochastic Differential Equation of SGD
Random fluctuation g = ε(NB − 1
)≈ εNB
An optimal batch size emerges when the underlying scale of randomfluctuations is also optimal.
Bopt ∝ ε
Bopt ∝ N
Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 47 / 53
Batch Size and Learning Rate
Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 48 / 53
Batch Size and Training Set Size
Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 49 / 53
Batch Size and Momentum Coefficient
g ≈ εNB(1−m)
Bopt ∝ 1(1−m)
Constant ε = 1 learning rate, until 10000 gradient updates
Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 50 / 53
How to Tune Batch Size
1 Set the learning rate to 0.1 and the momentum coefficient to 0.9. Runexperiments at a range of batch sizes on a logarithmic scale, and identify theoptimal batch size which maximizes the validation set accuracy.
2 Repeatedly increase the batch size by a factor of 3, while scaling the learningrate ε ∝ B, until the validation set accuracy starts to fall. Then repeatedlyincrease the batch size by a factor of 3, while scaling the momentumcoefficient (1−m) ∝ 1/B, until either the validation set accuracy falls orthe batch size reaches the limits of your hardware.
3 Having identified the final learning rate and momentum parameter, retunethe batch size on a linear scale in the local neighborhood of the currentbatch size.
Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 51 / 53
Discussion
Like deep neural networks, linear models which generalize well oninformative labels can memorize random labels of the same inputs. They areexplained by evaluating the Bayesian evidence.
They show that there is an optimum batch size which maximizes the test setaccuracy.
Bopt ∝ εNB(1−m)
Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 52 / 53
References
Dinh, L., Pascanu, R., Bengio, S., and Bengio, Y. (2017). Sharp minimacan generalize for deep nets. arXiv preprint arXiv:1703.04933.
Kawaguchi, K., Kaelbling, L. P., and Bengio, Y. (2017). Generalization indeep learning. arXiv preprint arXiv:1710.05468.
Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., and Tang, P.T. P. (2016). On large-batch training for deep learning: Generalizationgap and sharp minima. arXiv preprint arXiv:1609.04836.
Smith, S. L. and Le, Q. V. (2018). A bayesian perspective ongeneralization and stochastic gradient descent. In InternationalConference on Learning Representations.
Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. (2016).Understanding deep learning requires rethinking generalization. arXivpreprint arXiv:1611.03530.
Enes Dilber (METU) Generalization in Deep Learning ImageLab Seminars, Spring 17 53 / 53