lecture 6 optimization for deep neural networks - …shubhendu/pages/files/lecture...x1 k=1 2
TRANSCRIPT
![Page 1: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/1.jpg)
Lecture 6Optimization for Deep Neural Networks
CMSC 35246: Deep Learning
Shubhendu Trivedi&
Risi Kondor
University of Chicago
April 12, 2017
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 2: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/2.jpg)
Things we will look at today
• Stochastic Gradient Descent
• Momentum Method and the Nesterov Variant• Adaptive Learning Methods (AdaGrad, RMSProp, Adam)• Batch Normalization• Intialization Heuristics• Polyak Averaging• On Slides but for self study: Newton and Quasi Newton
Methods (BFGS, L-BFGS, Conjugate Gradient)
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 3: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/3.jpg)
Things we will look at today
• Stochastic Gradient Descent• Momentum Method and the Nesterov Variant
• Adaptive Learning Methods (AdaGrad, RMSProp, Adam)• Batch Normalization• Intialization Heuristics• Polyak Averaging• On Slides but for self study: Newton and Quasi Newton
Methods (BFGS, L-BFGS, Conjugate Gradient)
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 4: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/4.jpg)
Things we will look at today
• Stochastic Gradient Descent• Momentum Method and the Nesterov Variant• Adaptive Learning Methods (AdaGrad, RMSProp, Adam)
• Batch Normalization• Intialization Heuristics• Polyak Averaging• On Slides but for self study: Newton and Quasi Newton
Methods (BFGS, L-BFGS, Conjugate Gradient)
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 5: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/5.jpg)
Things we will look at today
• Stochastic Gradient Descent• Momentum Method and the Nesterov Variant• Adaptive Learning Methods (AdaGrad, RMSProp, Adam)• Batch Normalization
• Intialization Heuristics• Polyak Averaging• On Slides but for self study: Newton and Quasi Newton
Methods (BFGS, L-BFGS, Conjugate Gradient)
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 6: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/6.jpg)
Things we will look at today
• Stochastic Gradient Descent• Momentum Method and the Nesterov Variant• Adaptive Learning Methods (AdaGrad, RMSProp, Adam)• Batch Normalization• Intialization Heuristics
• Polyak Averaging• On Slides but for self study: Newton and Quasi Newton
Methods (BFGS, L-BFGS, Conjugate Gradient)
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 7: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/7.jpg)
Things we will look at today
• Stochastic Gradient Descent• Momentum Method and the Nesterov Variant• Adaptive Learning Methods (AdaGrad, RMSProp, Adam)• Batch Normalization• Intialization Heuristics• Polyak Averaging
• On Slides but for self study: Newton and Quasi NewtonMethods (BFGS, L-BFGS, Conjugate Gradient)
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 8: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/8.jpg)
Things we will look at today
• Stochastic Gradient Descent• Momentum Method and the Nesterov Variant• Adaptive Learning Methods (AdaGrad, RMSProp, Adam)• Batch Normalization• Intialization Heuristics• Polyak Averaging• On Slides but for self study: Newton and Quasi Newton
Methods (BFGS, L-BFGS, Conjugate Gradient)
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 9: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/9.jpg)
Optimization
We’ve seen backpropagation as a method for computinggradients
Assignment: Was about implementation of SGD inconjunction with backprop
Let’s see a family of first order methods
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 10: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/10.jpg)
Optimization
We’ve seen backpropagation as a method for computinggradients
Assignment: Was about implementation of SGD inconjunction with backprop
Let’s see a family of first order methods
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 11: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/11.jpg)
Optimization
We’ve seen backpropagation as a method for computinggradients
Assignment: Was about implementation of SGD inconjunction with backprop
Let’s see a family of first order methods
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 12: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/12.jpg)
Batch Gradient Descent
Algorithm 1 Batch Gradient Descent at Iteration k
Require: Learning rate εkRequire: Initial Parameter θ1: while stopping criteria not met do2: Compute gradient estimate over N examples:3: g← + 1
N∇θ∑
i L(f(x(i); θ),y(i))4: Apply Update: θ ← θ − εg5: end while
Positive: Gradient estimates are stable
Negative: Need to compute gradients over the entire trainingfor one update
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 13: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/13.jpg)
Gradient Descent
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 14: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/14.jpg)
Gradient Descent
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 15: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/15.jpg)
Gradient Descent
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 16: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/16.jpg)
Gradient Descent
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 17: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/17.jpg)
Gradient Descent
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 18: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/18.jpg)
Gradient Descent
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 19: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/19.jpg)
Stochastic Gradient Descent
Algorithm 2 Stochastic Gradient Descent at Iteration k
Require: Learning rate εkRequire: Initial Parameter θ1: while stopping criteria not met do2: Sample example (x(i),y(i)) from training set3: Compute gradient estimate:4: g← +∇θL(f(x(i); θ),y(i))5: Apply Update: θ ← θ − εg6: end while
εk is learning rate at step kSufficient condition to guarantee convergence:
∞∑k=1
εk =∞ and∞∑k=1
ε2k <∞
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 20: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/20.jpg)
Stochastic Gradient Descent
Algorithm 2 Stochastic Gradient Descent at Iteration k
Require: Learning rate εkRequire: Initial Parameter θ1: while stopping criteria not met do2: Sample example (x(i),y(i)) from training set3: Compute gradient estimate:4: g← +∇θL(f(x(i); θ),y(i))5: Apply Update: θ ← θ − εg6: end while
εk is learning rate at step k
Sufficient condition to guarantee convergence:∞∑k=1
εk =∞ and∞∑k=1
ε2k <∞
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 21: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/21.jpg)
Stochastic Gradient Descent
Algorithm 2 Stochastic Gradient Descent at Iteration k
Require: Learning rate εkRequire: Initial Parameter θ1: while stopping criteria not met do2: Sample example (x(i),y(i)) from training set3: Compute gradient estimate:4: g← +∇θL(f(x(i); θ),y(i))5: Apply Update: θ ← θ − εg6: end while
εk is learning rate at step kSufficient condition to guarantee convergence:
∞∑k=1
εk =∞ and∞∑k=1
ε2k <∞
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 22: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/22.jpg)
Learning Rate Schedule
In practice the learning rate is decayed linearly till iteration τ
εk = (1− α)ε0 + αετ with α =k
τ
τ is usually set to the number of iterations needed for a largenumber of passes through the data
ετ should roughly be set to 1% of ε0
How to set ε0?
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 23: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/23.jpg)
Learning Rate Schedule
In practice the learning rate is decayed linearly till iteration τ
εk = (1− α)ε0 + αετ with α =k
τ
τ is usually set to the number of iterations needed for a largenumber of passes through the data
ετ should roughly be set to 1% of ε0
How to set ε0?
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 24: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/24.jpg)
Learning Rate Schedule
In practice the learning rate is decayed linearly till iteration τ
εk = (1− α)ε0 + αετ with α =k
τ
τ is usually set to the number of iterations needed for a largenumber of passes through the data
ετ should roughly be set to 1% of ε0
How to set ε0?
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 25: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/25.jpg)
Minibatching
Potential Problem: Gradient estimates can be very noisy
Obvious Solution: Use larger mini-batches
Advantage: Computation time per update does not depend onnumber of training examples N
This allows convergence on extremely large datasets
See: Large Scale Learning with Stochastic Gradient Descentby Leon Bottou
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 26: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/26.jpg)
Minibatching
Potential Problem: Gradient estimates can be very noisy
Obvious Solution: Use larger mini-batches
Advantage: Computation time per update does not depend onnumber of training examples N
This allows convergence on extremely large datasets
See: Large Scale Learning with Stochastic Gradient Descentby Leon Bottou
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 27: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/27.jpg)
Minibatching
Potential Problem: Gradient estimates can be very noisy
Obvious Solution: Use larger mini-batches
Advantage: Computation time per update does not depend onnumber of training examples N
This allows convergence on extremely large datasets
See: Large Scale Learning with Stochastic Gradient Descentby Leon Bottou
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 28: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/28.jpg)
Minibatching
Potential Problem: Gradient estimates can be very noisy
Obvious Solution: Use larger mini-batches
Advantage: Computation time per update does not depend onnumber of training examples N
This allows convergence on extremely large datasets
See: Large Scale Learning with Stochastic Gradient Descentby Leon Bottou
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 29: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/29.jpg)
Minibatching
Potential Problem: Gradient estimates can be very noisy
Obvious Solution: Use larger mini-batches
Advantage: Computation time per update does not depend onnumber of training examples N
This allows convergence on extremely large datasets
See: Large Scale Learning with Stochastic Gradient Descentby Leon Bottou
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 30: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/30.jpg)
Stochastic Gradient Descent
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 31: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/31.jpg)
Stochastic Gradient Descent
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 32: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/32.jpg)
Stochastic Gradient Descent
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 33: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/33.jpg)
Stochastic Gradient Descent
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 34: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/34.jpg)
Stochastic Gradient Descent
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 35: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/35.jpg)
Stochastic Gradient Descent
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 36: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/36.jpg)
Stochastic Gradient Descent
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 37: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/37.jpg)
Stochastic Gradient Descent
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 38: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/38.jpg)
Stochastic Gradient Descent
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 39: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/39.jpg)
So far..
Batch Gradient Descent:
g← +1
N∇θ∑i
L(f(x(i); θ),y(i))
θ ← θ − εg
SGD:
g← +∇θL(f(x(i); θ),y(i))
θ ← θ − εg
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 40: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/40.jpg)
So far..
Batch Gradient Descent:
g← +1
N∇θ∑i
L(f(x(i); θ),y(i))
θ ← θ − εg
SGD:
g← +∇θL(f(x(i); θ),y(i))
θ ← θ − εg
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 41: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/41.jpg)
Momentum
The Momentum method is a method to accelerate learningusing SGD
In particular SGD suffers in the following scenarios:
• Error surface has high curvature• We get small but consistent gradients• The gradients are very noisy
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 42: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/42.jpg)
Momentum
The Momentum method is a method to accelerate learningusing SGD
In particular SGD suffers in the following scenarios:
• Error surface has high curvature
• We get small but consistent gradients• The gradients are very noisy
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 43: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/43.jpg)
Momentum
The Momentum method is a method to accelerate learningusing SGD
In particular SGD suffers in the following scenarios:
• Error surface has high curvature• We get small but consistent gradients
• The gradients are very noisy
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 44: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/44.jpg)
Momentum
The Momentum method is a method to accelerate learningusing SGD
In particular SGD suffers in the following scenarios:
• Error surface has high curvature• We get small but consistent gradients• The gradients are very noisy
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 45: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/45.jpg)
Momentum
−4 −2 02
4 −5
0
5
0
500
1.000
Gradient Descent would move quickly down the walls, butvery slowly through the valley floor
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 46: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/46.jpg)
Momentum
How do we try and solve this problem?
Introduce a new variable v, the velocity
We think of v as the direction and speed by which theparameters move as the learning dynamics progresses
The velocity is an exponentially decaying moving average ofthe negative gradients
v← αv − ε∇θ
(L(f(x(i); θ),y(i))
)
α ∈ [0, 1)Update rule: θ ← θ + v
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 47: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/47.jpg)
Momentum
How do we try and solve this problem?
Introduce a new variable v, the velocity
We think of v as the direction and speed by which theparameters move as the learning dynamics progresses
The velocity is an exponentially decaying moving average ofthe negative gradients
v← αv − ε∇θ
(L(f(x(i); θ),y(i))
)
α ∈ [0, 1)Update rule: θ ← θ + v
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 48: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/48.jpg)
Momentum
How do we try and solve this problem?
Introduce a new variable v, the velocity
We think of v as the direction and speed by which theparameters move as the learning dynamics progresses
The velocity is an exponentially decaying moving average ofthe negative gradients
v← αv − ε∇θ
(L(f(x(i); θ),y(i))
)
α ∈ [0, 1)Update rule: θ ← θ + v
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 49: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/49.jpg)
Momentum
How do we try and solve this problem?
Introduce a new variable v, the velocity
We think of v as the direction and speed by which theparameters move as the learning dynamics progresses
The velocity is an exponentially decaying moving average ofthe negative gradients
v← αv − ε∇θ
(L(f(x(i); θ),y(i))
)
α ∈ [0, 1)Update rule: θ ← θ + v
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 50: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/50.jpg)
Momentum
Let’s look at the velocity term:
v← αv − ε∇θ
(L(f(x(i); θ),y(i))
)
The velocity accumulates the previous gradients
What is the role of α?
• If α is larger than ε the current update is more affectedby the previous gradients
• Usually values for α are set high ≈ 0.8, 0.9
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 51: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/51.jpg)
Momentum
Let’s look at the velocity term:
v← αv − ε∇θ
(L(f(x(i); θ),y(i))
)
The velocity accumulates the previous gradients
What is the role of α?
• If α is larger than ε the current update is more affectedby the previous gradients
• Usually values for α are set high ≈ 0.8, 0.9
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 52: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/52.jpg)
Momentum
Let’s look at the velocity term:
v← αv − ε∇θ
(L(f(x(i); θ),y(i))
)
The velocity accumulates the previous gradients
What is the role of α?
• If α is larger than ε the current update is more affectedby the previous gradients
• Usually values for α are set high ≈ 0.8, 0.9
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 53: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/53.jpg)
Momentum
Let’s look at the velocity term:
v← αv − ε∇θ
(L(f(x(i); θ),y(i))
)
The velocity accumulates the previous gradients
What is the role of α?
• If α is larger than ε the current update is more affectedby the previous gradients
• Usually values for α are set high ≈ 0.8, 0.9
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 54: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/54.jpg)
Momentum
Gradient Step
Momentum Step Actual Step
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 55: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/55.jpg)
Momentum
Gradient Step
Momentum Step
Actual Step
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 56: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/56.jpg)
Momentum
Gradient Step
Momentum Step Actual Step
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 57: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/57.jpg)
Momentum: Step Sizes
In SGD, the step size was the norm of the gradient scaled bythe learning rate ε‖g‖. Why?
While using momentum, the step size will also depend on thenorm and alignment of a sequence of gradients
For example, if at each step we observed g, the step sizewould be (exercise!):
ε‖g‖
1− α
If α = 0.9 =⇒ multiply the maximum speed by 10 relative tothe current gradient direction
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 58: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/58.jpg)
Momentum: Step Sizes
In SGD, the step size was the norm of the gradient scaled bythe learning rate ε‖g‖. Why?
While using momentum, the step size will also depend on thenorm and alignment of a sequence of gradients
For example, if at each step we observed g, the step sizewould be (exercise!):
ε‖g‖
1− α
If α = 0.9 =⇒ multiply the maximum speed by 10 relative tothe current gradient direction
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 59: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/59.jpg)
Momentum: Step Sizes
In SGD, the step size was the norm of the gradient scaled bythe learning rate ε‖g‖. Why?
While using momentum, the step size will also depend on thenorm and alignment of a sequence of gradients
For example, if at each step we observed g, the step sizewould be (exercise!):
ε‖g‖
1− α
If α = 0.9 =⇒ multiply the maximum speed by 10 relative tothe current gradient direction
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 60: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/60.jpg)
Momentum: Step Sizes
In SGD, the step size was the norm of the gradient scaled bythe learning rate ε‖g‖. Why?
While using momentum, the step size will also depend on thenorm and alignment of a sequence of gradients
For example, if at each step we observed g, the step sizewould be (exercise!):
ε‖g‖
1− α
If α = 0.9 =⇒ multiply the maximum speed by 10 relative tothe current gradient direction
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 61: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/61.jpg)
Momentum
Illustration of how momentum traverses such an error surfacebetter compared to Gradient Descent
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 62: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/62.jpg)
SGD with Momentum
Algorithm 2 Stochastic Gradient Descent with Momentum
Require: Learning rate εkRequire: Momentum Parameter αRequire: Initial Parameter θRequire: Initial Velocity v1: while stopping criteria not met do2: Sample example (x(i),y(i)) from training set3: Compute gradient estimate:4: g← +∇θL(f(x(i); θ),y(i))5: Compute the velocity update:6: v← αv − εg7: Apply Update: θ ← θ + v8: end while
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 63: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/63.jpg)
Nesterov Momentum
Another approach: First take a step in the direction of theaccumulated gradient
Then calculate the gradient and make a correction
Accumulated Gradient
Correction
New Accumulated Gradient
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 64: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/64.jpg)
Nesterov Momentum
Another approach: First take a step in the direction of theaccumulated gradient
Then calculate the gradient and make a correction
Accumulated GradientCorrection
New Accumulated Gradient
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 65: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/65.jpg)
Nesterov Momentum
Another approach: First take a step in the direction of theaccumulated gradient
Then calculate the gradient and make a correction
Accumulated GradientCorrection
New Accumulated Gradient
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 66: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/66.jpg)
Nesterov Momentum
Next Step
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 67: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/67.jpg)
Nesterov Momentum
Next Step
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 68: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/68.jpg)
Nesterov Momentum
Next Step
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 69: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/69.jpg)
Nesterov Momentum
Next Step
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 70: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/70.jpg)
Let’s Write it out..
Recall the velocity term in the Momentum method:
v← αv − ε∇θ
(L(f(x(i); θ),y(i))
)
Nesterov Momentum:
v← αv − ε∇θ
(L(f(x(i); θ + αv),y(i))
)
Update: θ ← θ + v
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 71: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/71.jpg)
Let’s Write it out..
Recall the velocity term in the Momentum method:
v← αv − ε∇θ
(L(f(x(i); θ),y(i))
)
Nesterov Momentum:
v← αv − ε∇θ
(L(f(x(i); θ + αv),y(i))
)
Update: θ ← θ + v
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 72: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/72.jpg)
Let’s Write it out..
Recall the velocity term in the Momentum method:
v← αv − ε∇θ
(L(f(x(i); θ),y(i))
)
Nesterov Momentum:
v← αv − ε∇θ
(L(f(x(i); θ + αv),y(i))
)
Update: θ ← θ + v
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 73: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/73.jpg)
SGD with Nesterov Momentum
Algorithm 3 SGD with Nesterov Momentum
Require: Learning rate εRequire: Momentum Parameter αRequire: Initial Parameter θRequire: Initial Velocity v1: while stopping criteria not met do2: Sample example (x(i),y(i)) from training set3: Update parameters: θ ← θ + αv4: Compute gradient estimate:5: g← +∇θL(f(x(i); θ),y(i))6: Compute the velocity update: v← αv − εg7: Apply Update: θ ← θ + v8: end while
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 74: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/74.jpg)
Adaptive Learning Rate Methods
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 75: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/75.jpg)
Motivation
Till now we assign the same learning rate to all features
If the features vary in importance and frequency, why is this agood idea?
It’s probably not!
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 76: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/76.jpg)
Motivation
Till now we assign the same learning rate to all features
If the features vary in importance and frequency, why is this agood idea?
It’s probably not!
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 77: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/77.jpg)
Motivation
Till now we assign the same learning rate to all features
If the features vary in importance and frequency, why is this agood idea?
It’s probably not!
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 78: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/78.jpg)
Motivation
Nice (all features are equally important)
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 79: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/79.jpg)
Motivation
Harder!
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 80: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/80.jpg)
AdaGrad
Idea: Downscale a model parameter by square-root of sum ofsquares of all its historical values
Parameters that have large partial derivative of the loss –learning rates for them are rapidly declined
Some interesting theoretical properties
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 81: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/81.jpg)
AdaGrad
Idea: Downscale a model parameter by square-root of sum ofsquares of all its historical values
Parameters that have large partial derivative of the loss –learning rates for them are rapidly declined
Some interesting theoretical properties
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 82: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/82.jpg)
AdaGrad
Idea: Downscale a model parameter by square-root of sum ofsquares of all its historical values
Parameters that have large partial derivative of the loss –learning rates for them are rapidly declined
Some interesting theoretical properties
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 83: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/83.jpg)
AdaGrad
Algorithm 4 AdaGrad
Require: Global Learning rate ε, Initial Parameter θ, δInitialize r = 01: while stopping criteria not met do2: Sample example (x(i),y(i)) from training set3: Compute gradient estimate: g← +∇θL(f(x(i); θ),y(i))4: Accumulate: r← r + g � g5: Compute update: ∆θ ← − ε
δ+√r� g
6: Apply Update: θ ← θ + ∆θ7: end while
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 84: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/84.jpg)
RMSProp
AdaGrad is good when the objective is convex.
AdaGrad might shrink the learning rate too aggressively, wewant to keep the history in mind
We can adapt it to perform better in non-convex settings byaccumulating an exponentially decaying average of thegradient
This is an idea that we use again and again in NeuralNetworks
Currently has about 500 citations on scholar, but wasproposed in a slide in Geoffrey Hinton’s coursera course
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 85: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/85.jpg)
RMSProp
AdaGrad is good when the objective is convex.
AdaGrad might shrink the learning rate too aggressively, wewant to keep the history in mind
We can adapt it to perform better in non-convex settings byaccumulating an exponentially decaying average of thegradient
This is an idea that we use again and again in NeuralNetworks
Currently has about 500 citations on scholar, but wasproposed in a slide in Geoffrey Hinton’s coursera course
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 86: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/86.jpg)
RMSProp
AdaGrad is good when the objective is convex.
AdaGrad might shrink the learning rate too aggressively, wewant to keep the history in mind
We can adapt it to perform better in non-convex settings byaccumulating an exponentially decaying average of thegradient
This is an idea that we use again and again in NeuralNetworks
Currently has about 500 citations on scholar, but wasproposed in a slide in Geoffrey Hinton’s coursera course
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 87: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/87.jpg)
RMSProp
AdaGrad is good when the objective is convex.
AdaGrad might shrink the learning rate too aggressively, wewant to keep the history in mind
We can adapt it to perform better in non-convex settings byaccumulating an exponentially decaying average of thegradient
This is an idea that we use again and again in NeuralNetworks
Currently has about 500 citations on scholar, but wasproposed in a slide in Geoffrey Hinton’s coursera course
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 88: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/88.jpg)
RMSProp
AdaGrad is good when the objective is convex.
AdaGrad might shrink the learning rate too aggressively, wewant to keep the history in mind
We can adapt it to perform better in non-convex settings byaccumulating an exponentially decaying average of thegradient
This is an idea that we use again and again in NeuralNetworks
Currently has about 500 citations on scholar, but wasproposed in a slide in Geoffrey Hinton’s coursera course
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 89: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/89.jpg)
RMSProp
Algorithm 5 RMSProp
Require: Global Learning rate ε, decay parameter ρ, δInitialize r = 01: while stopping criteria not met do2: Sample example (x(i),y(i)) from training set3: Compute gradient estimate: g← +∇θL(f(x(i); θ),y(i))4: Accumulate: r← ρr + (1− ρ)g � g5: Compute update: ∆θ ← − ε
δ+√r� g
6: Apply Update: θ ← θ + ∆θ7: end while
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 90: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/90.jpg)
RMSProp with Nesterov
Algorithm 6 RMSProp with Nesterov
Require: Global Learning rate ε, decay parameter ρ, δ, α, vInitialize r = 01: while stopping criteria not met do2: Sample example (x(i),y(i)) from training set3: Compute Update: θ ← θ + αv4: Compute gradient estimate: g← +∇θL(f(x(i); θ),y(i))5: Accumulate: r← ρr + (1− ρ)g � g6: Compute Velocity: v← αv − ε√
r� g
7: Apply Update: θ ← θ + v8: end while
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 91: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/91.jpg)
Adam
We could have used RMSProp with momentum
Use of Momentum with rescaling is not well motivated
Adam is like RMSProp with Momentum but with biascorrection terms for the first and second moments
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 92: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/92.jpg)
Adam
We could have used RMSProp with momentum
Use of Momentum with rescaling is not well motivated
Adam is like RMSProp with Momentum but with biascorrection terms for the first and second moments
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 93: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/93.jpg)
Adam
We could have used RMSProp with momentum
Use of Momentum with rescaling is not well motivated
Adam is like RMSProp with Momentum but with biascorrection terms for the first and second moments
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 94: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/94.jpg)
Adam: ADAptive Moments
Algorithm 7 RMSProp with Nesterov
Require: ε (set to 0.0001), decay rates ρ1 (set to 0.9), ρ2 (set to0.9), θ, δInitialize moments variables s = 0 and r = 0, time step t = 01: while stopping criteria not met do2: Sample example (x(i),y(i)) from training set3: Compute gradient estimate: g← +∇θL(f(x(i); θ),y(i))4: t← t+ 15: Update: s← ρ1s + (1− ρ1)g6: Update: r← ρ2r + (1− ρ2)g � g7: Correct Biases: s← s
1−ρt1, r← r
1−ρt28: Compute Update: ∆θ = −ε s√
r+δ9: Apply Update: θ ← θ + ∆θ
10: end while
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 95: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/95.jpg)
All your GRADs are belong to us!
SGD: θ ← θ − εgMomentum: v← αv − εg then θ ← θ + v
Nesterov: v← αv − ε∇θ
(L(f(x(i); θ + αv),y(i))
)then θ ← θ + v
AdaGrad: r← r + g � g then ∆θ− ← ε
δ +√r� g then θ ← θ + ∆θ
RMSProp: r← ρr + (1− ρ)g � g then ∆θ ← − ε
δ +√r� g then θ ← θ + ∆θ
Adam: s← s
1− ρt1, r← r
1− ρt2then ∆θ = −ε s√
r + δthen θ ← θ + ∆θ
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 96: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/96.jpg)
Batch Normalization
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 97: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/97.jpg)
A Difficulty in Training Deep Neural Networks
A deep model involves composition of several functionsy = W T
4 (tanh(W T3 (tanh(W T
2 (tanh(W T1 x + b1) + b2) + b3))))
x1 x2 x3 x4
y
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 98: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/98.jpg)
A Difficulty in Training Deep Neural Networks
We have a recipe to compute gradients (Backpropagation),and update every parameter (we saw half a dozen methods)
Implicit Assumption: Other layers don’t change i.e. otherfunctions are fixed
In Practice: We update all layers simultaneously
This can give rise to unexpected difficulties
Let’s look at two illustrations
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 99: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/99.jpg)
A Difficulty in Training Deep Neural Networks
We have a recipe to compute gradients (Backpropagation),and update every parameter (we saw half a dozen methods)
Implicit Assumption: Other layers don’t change i.e. otherfunctions are fixed
In Practice: We update all layers simultaneously
This can give rise to unexpected difficulties
Let’s look at two illustrations
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 100: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/100.jpg)
A Difficulty in Training Deep Neural Networks
We have a recipe to compute gradients (Backpropagation),and update every parameter (we saw half a dozen methods)
Implicit Assumption: Other layers don’t change i.e. otherfunctions are fixed
In Practice: We update all layers simultaneously
This can give rise to unexpected difficulties
Let’s look at two illustrations
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 101: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/101.jpg)
A Difficulty in Training Deep Neural Networks
We have a recipe to compute gradients (Backpropagation),and update every parameter (we saw half a dozen methods)
Implicit Assumption: Other layers don’t change i.e. otherfunctions are fixed
In Practice: We update all layers simultaneously
This can give rise to unexpected difficulties
Let’s look at two illustrations
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 102: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/102.jpg)
A Difficulty in Training Deep Neural Networks
We have a recipe to compute gradients (Backpropagation),and update every parameter (we saw half a dozen methods)
Implicit Assumption: Other layers don’t change i.e. otherfunctions are fixed
In Practice: We update all layers simultaneously
This can give rise to unexpected difficulties
Let’s look at two illustrations
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 103: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/103.jpg)
Intuition
Consider a second order approximation of our cost function(which is a function composition) around current point θ(0):
J(θ) ≈ J(θ(0)) + (θ − θ(0))Tg +1
2(θ − θ(0))TH(θ − θ(0))
g is gradient and H the Hessian at θ(0)
If ε is the learning rate, the new point
θ = θ(0) − εg
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 104: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/104.jpg)
Intuition
Consider a second order approximation of our cost function(which is a function composition) around current point θ(0):
J(θ) ≈ J(θ(0)) + (θ − θ(0))Tg +1
2(θ − θ(0))TH(θ − θ(0))
g is gradient and H the Hessian at θ(0)
If ε is the learning rate, the new point
θ = θ(0) − εg
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 105: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/105.jpg)
Intuition
Consider a second order approximation of our cost function(which is a function composition) around current point θ(0):
J(θ) ≈ J(θ(0)) + (θ − θ(0))Tg +1
2(θ − θ(0))TH(θ − θ(0))
g is gradient and H the Hessian at θ(0)
If ε is the learning rate, the new point
θ = θ(0) − εg
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 106: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/106.jpg)
Intuition
Plugging our new point, θ = θ(0) − εg into the approximation:
J(θ(0) − εg) = J(θ(0))− εgTg +1
2gTHg
There are three terms here:
• Value of function before update• Improvement using gradient (i.e. first order information)• Correction factor that accounts for the curvature of the
function
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 107: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/107.jpg)
Intuition
Plugging our new point, θ = θ(0) − εg into the approximation:
J(θ(0) − εg) = J(θ(0))− εgTg +1
2gTHg
There are three terms here:
• Value of function before update• Improvement using gradient (i.e. first order information)• Correction factor that accounts for the curvature of the
function
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 108: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/108.jpg)
Intuition
Plugging our new point, θ = θ(0) − εg into the approximation:
J(θ(0) − εg) = J(θ(0))− εgTg +1
2gTHg
There are three terms here:
• Value of function before update
• Improvement using gradient (i.e. first order information)• Correction factor that accounts for the curvature of the
function
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 109: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/109.jpg)
Intuition
Plugging our new point, θ = θ(0) − εg into the approximation:
J(θ(0) − εg) = J(θ(0))− εgTg +1
2gTHg
There are three terms here:
• Value of function before update• Improvement using gradient (i.e. first order information)
• Correction factor that accounts for the curvature of thefunction
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 110: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/110.jpg)
Intuition
Plugging our new point, θ = θ(0) − εg into the approximation:
J(θ(0) − εg) = J(θ(0))− εgTg +1
2gTHg
There are three terms here:
• Value of function before update• Improvement using gradient (i.e. first order information)• Correction factor that accounts for the curvature of the
function
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 111: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/111.jpg)
Intuition
J(θ(0) − εg) = J(θ(0))− εgTg +1
2gTHg
Observations:
• gTHg too large: Gradient will start moving upwards• gTHg = 0: J will decrease for even large ε• Optimal step size ε∗ = gTg for zero curvature,
ε∗ = gT ggTHg
to take into account curvature
Conclusion: Just neglecting second order effects can causeproblems (remedy: second order methods). What abouthigher order effects?
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 112: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/112.jpg)
Intuition
J(θ(0) − εg) = J(θ(0))− εgTg +1
2gTHg
Observations:
• gTHg too large: Gradient will start moving upwards• gTHg = 0: J will decrease for even large ε• Optimal step size ε∗ = gTg for zero curvature,
ε∗ = gT ggTHg
to take into account curvature
Conclusion: Just neglecting second order effects can causeproblems (remedy: second order methods). What abouthigher order effects?
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 113: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/113.jpg)
Intuition
J(θ(0) − εg) = J(θ(0))− εgTg +1
2gTHg
Observations:
• gTHg too large: Gradient will start moving upwards
• gTHg = 0: J will decrease for even large ε• Optimal step size ε∗ = gTg for zero curvature,
ε∗ = gT ggTHg
to take into account curvature
Conclusion: Just neglecting second order effects can causeproblems (remedy: second order methods). What abouthigher order effects?
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 114: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/114.jpg)
Intuition
J(θ(0) − εg) = J(θ(0))− εgTg +1
2gTHg
Observations:
• gTHg too large: Gradient will start moving upwards• gTHg = 0: J will decrease for even large ε
• Optimal step size ε∗ = gTg for zero curvature,
ε∗ = gT ggTHg
to take into account curvature
Conclusion: Just neglecting second order effects can causeproblems (remedy: second order methods). What abouthigher order effects?
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 115: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/115.jpg)
Intuition
J(θ(0) − εg) = J(θ(0))− εgTg +1
2gTHg
Observations:
• gTHg too large: Gradient will start moving upwards• gTHg = 0: J will decrease for even large ε• Optimal step size ε∗ = gTg for zero curvature,
ε∗ = gT ggTHg
to take into account curvature
Conclusion: Just neglecting second order effects can causeproblems (remedy: second order methods). What abouthigher order effects?
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 116: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/116.jpg)
Intuition
J(θ(0) − εg) = J(θ(0))− εgTg +1
2gTHg
Observations:
• gTHg too large: Gradient will start moving upwards• gTHg = 0: J will decrease for even large ε• Optimal step size ε∗ = gTg for zero curvature,
ε∗ = gT ggTHg
to take into account curvature
Conclusion: Just neglecting second order effects can causeproblems (remedy: second order methods). What abouthigher order effects?
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 117: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/117.jpg)
Higher Order Effects: Toy Model
x
h1
h2
...hl
y
w1
w2
wl
Just one node per layer, no non-linearity
y is linear in x but non-linear in wi
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 118: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/118.jpg)
Higher Order Effects: Toy Model
x
h1
h2
...hl
y
w1
w2
wl
Just one node per layer, no non-linearity
y is linear in x but non-linear in wi
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 119: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/119.jpg)
Higher Order Effects: Toy Model
x
h1
h2
...hl
y
w1
w2
wl
Just one node per layer, no non-linearity
y is linear in x but non-linear in wi
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 120: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/120.jpg)
Higher Order Effects: Toy Model
Suppose δ = 1, so we want to decrease our output y
Usual strategy:
• Using backprop find g = ∇w(y − y)2
• Update weights w := w − εgThe first order Taylor approximation (in previous slide) saysthe cost will reduce by εgTg
If we need to reduce cost by 0.1, then learning rate should be0.1gT g
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 121: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/121.jpg)
Higher Order Effects: Toy Model
Suppose δ = 1, so we want to decrease our output y
Usual strategy:
• Using backprop find g = ∇w(y − y)2
• Update weights w := w − εgThe first order Taylor approximation (in previous slide) saysthe cost will reduce by εgTg
If we need to reduce cost by 0.1, then learning rate should be0.1gT g
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 122: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/122.jpg)
Higher Order Effects: Toy Model
Suppose δ = 1, so we want to decrease our output y
Usual strategy:
• Using backprop find g = ∇w(y − y)2
• Update weights w := w − εg
The first order Taylor approximation (in previous slide) saysthe cost will reduce by εgTg
If we need to reduce cost by 0.1, then learning rate should be0.1gT g
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 123: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/123.jpg)
Higher Order Effects: Toy Model
Suppose δ = 1, so we want to decrease our output y
Usual strategy:
• Using backprop find g = ∇w(y − y)2
• Update weights w := w − εgThe first order Taylor approximation (in previous slide) saysthe cost will reduce by εgTg
If we need to reduce cost by 0.1, then learning rate should be0.1gT g
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 124: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/124.jpg)
Higher Order Effects: Toy Model
Suppose δ = 1, so we want to decrease our output y
Usual strategy:
• Using backprop find g = ∇w(y − y)2
• Update weights w := w − εgThe first order Taylor approximation (in previous slide) saysthe cost will reduce by εgTg
If we need to reduce cost by 0.1, then learning rate should be0.1gT g
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 125: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/125.jpg)
Higher Order Effects: Toy Model
The new y will however be:
y = x(w1 − εg1)(w2 − εg2) . . . (wl − εgl)
Contains terms like ε3g1g2g3w4w5 . . . wl
If weights w4, w5, . . . , wl are small, the term is negligible. Butif large, it would explode
Conclusion: Higher order terms make it very hard to choosethe right learning rate
Second Order Methods are already expensive, nth ordermethods are hopeless. Solution?
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 126: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/126.jpg)
Higher Order Effects: Toy Model
The new y will however be:
y = x(w1 − εg1)(w2 − εg2) . . . (wl − εgl)
Contains terms like ε3g1g2g3w4w5 . . . wl
If weights w4, w5, . . . , wl are small, the term is negligible. Butif large, it would explode
Conclusion: Higher order terms make it very hard to choosethe right learning rate
Second Order Methods are already expensive, nth ordermethods are hopeless. Solution?
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 127: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/127.jpg)
Higher Order Effects: Toy Model
The new y will however be:
y = x(w1 − εg1)(w2 − εg2) . . . (wl − εgl)
Contains terms like ε3g1g2g3w4w5 . . . wl
If weights w4, w5, . . . , wl are small, the term is negligible. Butif large, it would explode
Conclusion: Higher order terms make it very hard to choosethe right learning rate
Second Order Methods are already expensive, nth ordermethods are hopeless. Solution?
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 128: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/128.jpg)
Higher Order Effects: Toy Model
The new y will however be:
y = x(w1 − εg1)(w2 − εg2) . . . (wl − εgl)
Contains terms like ε3g1g2g3w4w5 . . . wl
If weights w4, w5, . . . , wl are small, the term is negligible. Butif large, it would explode
Conclusion: Higher order terms make it very hard to choosethe right learning rate
Second Order Methods are already expensive, nth ordermethods are hopeless. Solution?
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 129: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/129.jpg)
Higher Order Effects: Toy Model
The new y will however be:
y = x(w1 − εg1)(w2 − εg2) . . . (wl − εgl)
Contains terms like ε3g1g2g3w4w5 . . . wl
If weights w4, w5, . . . , wl are small, the term is negligible. Butif large, it would explode
Conclusion: Higher order terms make it very hard to choosethe right learning rate
Second Order Methods are already expensive, nth ordermethods are hopeless. Solution?
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 130: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/130.jpg)
Batch Normalization
Method to reparameterize a deep network to reduceco-ordination of update across layers
Can be applied to input layer, or any hidden layer
Let H be a design matrix having activations in any layer form examples in the mini-batch
H =
h11 h12 h13 . . . h1kh21 h22 h23 . . . h2k
......
.... . .
...
hm1 hm2 hm3 . . . hmk
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 131: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/131.jpg)
Batch Normalization
Method to reparameterize a deep network to reduceco-ordination of update across layers
Can be applied to input layer, or any hidden layer
Let H be a design matrix having activations in any layer form examples in the mini-batch
H =
h11 h12 h13 . . . h1kh21 h22 h23 . . . h2k
......
.... . .
...
hm1 hm2 hm3 . . . hmk
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 132: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/132.jpg)
Batch Normalization
Method to reparameterize a deep network to reduceco-ordination of update across layers
Can be applied to input layer, or any hidden layer
Let H be a design matrix having activations in any layer form examples in the mini-batch
H =
h11 h12 h13 . . . h1kh21 h22 h23 . . . h2k
......
.... . .
...
hm1 hm2 hm3 . . . hmk
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 133: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/133.jpg)
Batch Normalization
Method to reparameterize a deep network to reduceco-ordination of update across layers
Can be applied to input layer, or any hidden layer
Let H be a design matrix having activations in any layer form examples in the mini-batch
H =
h11 h12 h13 . . . h1kh21 h22 h23 . . . h2k
......
.... . .
...
hm1 hm2 hm3 . . . hmk
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 134: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/134.jpg)
Batch Normalization
H =
h11 h12 h13 . . . h1kh21 h22 h23 . . . h2k
......
.... . .
...
hm1 hm2 hm3 . . . hmk
Each row represents all the activations in layer for one example
Idea: Replace H by H ′ such that:
H ′ =H − µσ
µ is mean of each unit and σ the standard deviation
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 135: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/135.jpg)
Batch Normalization
H =
h11 h12 h13 . . . h1kh21 h22 h23 . . . h2k
......
.... . .
...
hm1 hm2 hm3 . . . hmk
Each row represents all the activations in layer for one example
Idea: Replace H by H ′ such that:
H ′ =H − µσ
µ is mean of each unit and σ the standard deviation
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 136: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/136.jpg)
Batch Normalization
H =
h11 h12 h13 . . . h1kh21 h22 h23 . . . h2k
......
.... . .
...
hm1 hm2 hm3 . . . hmk
Each row represents all the activations in layer for one example
Idea: Replace H by H ′ such that:
H ′ =H − µσ
µ is mean of each unit and σ the standard deviation
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 137: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/137.jpg)
Batch Normalization
µ is a vector with µj the column mean
σ is a vector with σj the column standard deviation
Hi,j is normalized by subtracting µj and dividing by σj
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 138: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/138.jpg)
Batch Normalization
µ is a vector with µj the column mean
σ is a vector with σj the column standard deviation
Hi,j is normalized by subtracting µj and dividing by σj
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 139: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/139.jpg)
Batch Normalization
µ is a vector with µj the column mean
σ is a vector with σj the column standard deviation
Hi,j is normalized by subtracting µj and dividing by σj
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 140: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/140.jpg)
Batch Normalization
During training we have:
µ =1
m
∑j
H:,j
σ =
√δ +
1
m
∑j
(H − µ)2j
We then operate on H ′ as before =⇒ we backpropagatethrough the normalized activations
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 141: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/141.jpg)
Batch Normalization
During training we have:
µ =1
m
∑j
H:,j
σ =
√δ +
1
m
∑j
(H − µ)2j
We then operate on H ′ as before =⇒ we backpropagatethrough the normalized activations
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 142: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/142.jpg)
Why is this good?
The update will never act to only increase the mean andstandard deviation of any activation
Previous approaches added penalties to cost or per layer toencourage units to have standardized outputs
Batch normalization makes the reparameterization easier
At test time: Use running averages of µ and σ collectedduring training, use these for evaluating new input x
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 143: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/143.jpg)
Why is this good?
The update will never act to only increase the mean andstandard deviation of any activation
Previous approaches added penalties to cost or per layer toencourage units to have standardized outputs
Batch normalization makes the reparameterization easier
At test time: Use running averages of µ and σ collectedduring training, use these for evaluating new input x
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 144: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/144.jpg)
Why is this good?
The update will never act to only increase the mean andstandard deviation of any activation
Previous approaches added penalties to cost or per layer toencourage units to have standardized outputs
Batch normalization makes the reparameterization easier
At test time: Use running averages of µ and σ collectedduring training, use these for evaluating new input x
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 145: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/145.jpg)
Why is this good?
The update will never act to only increase the mean andstandard deviation of any activation
Previous approaches added penalties to cost or per layer toencourage units to have standardized outputs
Batch normalization makes the reparameterization easier
At test time: Use running averages of µ and σ collectedduring training, use these for evaluating new input x
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 146: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/146.jpg)
An Innovation
Standardizing the output of a unit can limit the expressivepower of the neural network
Solution: Instead of replacing H by H ′, replace it will γH ′+β
γ and β are also learned by backpropagation
Normalizing for mean and standard deviation was the goal ofbatch normalization, why add γ and β again?
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 147: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/147.jpg)
An Innovation
Standardizing the output of a unit can limit the expressivepower of the neural network
Solution: Instead of replacing H by H ′, replace it will γH ′+β
γ and β are also learned by backpropagation
Normalizing for mean and standard deviation was the goal ofbatch normalization, why add γ and β again?
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 148: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/148.jpg)
An Innovation
Standardizing the output of a unit can limit the expressivepower of the neural network
Solution: Instead of replacing H by H ′, replace it will γH ′+β
γ and β are also learned by backpropagation
Normalizing for mean and standard deviation was the goal ofbatch normalization, why add γ and β again?
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 149: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/149.jpg)
An Innovation
Standardizing the output of a unit can limit the expressivepower of the neural network
Solution: Instead of replacing H by H ′, replace it will γH ′+β
γ and β are also learned by backpropagation
Normalizing for mean and standard deviation was the goal ofbatch normalization, why add γ and β again?
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 150: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/150.jpg)
Initialization Strategies
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 151: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/151.jpg)
In convex problems with good ε no matter what theinitialization, convergence is guaranteed
In the non-convex regime initialization is much moreimportant
Some parameter initialization can be unstable, not converge
Neural Networks are not well understood to have principled,mathematically nice initialization strategies
What is known: Initialization should break symmetry (quiz!)
What is known: Scale of weights is important
Most initialization strategies are based on intuitions andheuristics
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 152: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/152.jpg)
In convex problems with good ε no matter what theinitialization, convergence is guaranteed
In the non-convex regime initialization is much moreimportant
Some parameter initialization can be unstable, not converge
Neural Networks are not well understood to have principled,mathematically nice initialization strategies
What is known: Initialization should break symmetry (quiz!)
What is known: Scale of weights is important
Most initialization strategies are based on intuitions andheuristics
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 153: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/153.jpg)
In convex problems with good ε no matter what theinitialization, convergence is guaranteed
In the non-convex regime initialization is much moreimportant
Some parameter initialization can be unstable, not converge
Neural Networks are not well understood to have principled,mathematically nice initialization strategies
What is known: Initialization should break symmetry (quiz!)
What is known: Scale of weights is important
Most initialization strategies are based on intuitions andheuristics
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 154: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/154.jpg)
In convex problems with good ε no matter what theinitialization, convergence is guaranteed
In the non-convex regime initialization is much moreimportant
Some parameter initialization can be unstable, not converge
Neural Networks are not well understood to have principled,mathematically nice initialization strategies
What is known: Initialization should break symmetry (quiz!)
What is known: Scale of weights is important
Most initialization strategies are based on intuitions andheuristics
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 155: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/155.jpg)
In convex problems with good ε no matter what theinitialization, convergence is guaranteed
In the non-convex regime initialization is much moreimportant
Some parameter initialization can be unstable, not converge
Neural Networks are not well understood to have principled,mathematically nice initialization strategies
What is known: Initialization should break symmetry (quiz!)
What is known: Scale of weights is important
Most initialization strategies are based on intuitions andheuristics
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 156: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/156.jpg)
In convex problems with good ε no matter what theinitialization, convergence is guaranteed
In the non-convex regime initialization is much moreimportant
Some parameter initialization can be unstable, not converge
Neural Networks are not well understood to have principled,mathematically nice initialization strategies
What is known: Initialization should break symmetry (quiz!)
What is known: Scale of weights is important
Most initialization strategies are based on intuitions andheuristics
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 157: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/157.jpg)
In convex problems with good ε no matter what theinitialization, convergence is guaranteed
In the non-convex regime initialization is much moreimportant
Some parameter initialization can be unstable, not converge
Neural Networks are not well understood to have principled,mathematically nice initialization strategies
What is known: Initialization should break symmetry (quiz!)
What is known: Scale of weights is important
Most initialization strategies are based on intuitions andheuristics
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 158: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/158.jpg)
Some Heuristics
For a fully connected layer with m inputs and n outputs,sample:
Wij ∼ U(− 1√m,
1√m
)
Xavier Initialization: Sample
Wij ∼ U(− 6√m+ n
,6√
m+ n)
Xavier initialization is derived considering that the networkconsists of matrix multiplications with no nonlinearites
Works well in practice!
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 159: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/159.jpg)
Some Heuristics
For a fully connected layer with m inputs and n outputs,sample:
Wij ∼ U(− 1√m,
1√m
)
Xavier Initialization: Sample
Wij ∼ U(− 6√m+ n
,6√
m+ n)
Xavier initialization is derived considering that the networkconsists of matrix multiplications with no nonlinearites
Works well in practice!
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 160: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/160.jpg)
Some Heuristics
For a fully connected layer with m inputs and n outputs,sample:
Wij ∼ U(− 1√m,
1√m
)
Xavier Initialization: Sample
Wij ∼ U(− 6√m+ n
,6√
m+ n)
Xavier initialization is derived considering that the networkconsists of matrix multiplications with no nonlinearites
Works well in practice!
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 161: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/161.jpg)
Some Heuristics
For a fully connected layer with m inputs and n outputs,sample:
Wij ∼ U(− 1√m,
1√m
)
Xavier Initialization: Sample
Wij ∼ U(− 6√m+ n
,6√
m+ n)
Xavier initialization is derived considering that the networkconsists of matrix multiplications with no nonlinearites
Works well in practice!
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 162: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/162.jpg)
More Heuristics
Saxe et al. 2013, recommend initialzing to random orthogonalmatrices, with a carefully chosen gain g that accounts fornon-linearities
If g could be divined, it could solve the vanishing andexploding gradients problem (more later)
The idea of choosing g and initializing weights accordingly isthat we want norm of activations to increase, and pass backstrong gradients
Martens 2010, suggested an initialization that was sparse:Each unit could only receive k non-zero weights
Motivation: Ir is a bad idea to have all initial weights to havethe same standard deviation 1√
m
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 163: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/163.jpg)
More Heuristics
Saxe et al. 2013, recommend initialzing to random orthogonalmatrices, with a carefully chosen gain g that accounts fornon-linearities
If g could be divined, it could solve the vanishing andexploding gradients problem (more later)
The idea of choosing g and initializing weights accordingly isthat we want norm of activations to increase, and pass backstrong gradients
Martens 2010, suggested an initialization that was sparse:Each unit could only receive k non-zero weights
Motivation: Ir is a bad idea to have all initial weights to havethe same standard deviation 1√
m
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 164: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/164.jpg)
More Heuristics
Saxe et al. 2013, recommend initialzing to random orthogonalmatrices, with a carefully chosen gain g that accounts fornon-linearities
If g could be divined, it could solve the vanishing andexploding gradients problem (more later)
The idea of choosing g and initializing weights accordingly isthat we want norm of activations to increase, and pass backstrong gradients
Martens 2010, suggested an initialization that was sparse:Each unit could only receive k non-zero weights
Motivation: Ir is a bad idea to have all initial weights to havethe same standard deviation 1√
m
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 165: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/165.jpg)
More Heuristics
Saxe et al. 2013, recommend initialzing to random orthogonalmatrices, with a carefully chosen gain g that accounts fornon-linearities
If g could be divined, it could solve the vanishing andexploding gradients problem (more later)
The idea of choosing g and initializing weights accordingly isthat we want norm of activations to increase, and pass backstrong gradients
Martens 2010, suggested an initialization that was sparse:Each unit could only receive k non-zero weights
Motivation: Ir is a bad idea to have all initial weights to havethe same standard deviation 1√
m
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 166: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/166.jpg)
More Heuristics
Saxe et al. 2013, recommend initialzing to random orthogonalmatrices, with a carefully chosen gain g that accounts fornon-linearities
If g could be divined, it could solve the vanishing andexploding gradients problem (more later)
The idea of choosing g and initializing weights accordingly isthat we want norm of activations to increase, and pass backstrong gradients
Martens 2010, suggested an initialization that was sparse:Each unit could only receive k non-zero weights
Motivation: Ir is a bad idea to have all initial weights to havethe same standard deviation 1√
m
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 167: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/167.jpg)
Polyak Averaging: Motivation
Gradient points towards right
Consider gradient descent above with high step size ε
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 168: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/168.jpg)
Polyak Averaging: Motivation
Gradient points towards left
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 169: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/169.jpg)
Polyak Averaging: Motivation
Gradient points towards right
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 170: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/170.jpg)
Polyak Averaging: Motivation
Gradient points towards left
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 171: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/171.jpg)
Polyak Averaging: Motivation
Gradient points towards right
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 172: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/172.jpg)
A Solution: Polyak Averaging
Suppose in t iterations you have parameters θ(1), θ(2), . . . , θ(t)
Polyak Averaging suggests setting θ(t) = 1t
∑i θ
(i)
Has strong convergence guarantees in convex settings
Is this a good idea in non-convex problems?
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 173: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/173.jpg)
A Solution: Polyak Averaging
Suppose in t iterations you have parameters θ(1), θ(2), . . . , θ(t)
Polyak Averaging suggests setting θ(t) = 1t
∑i θ
(i)
Has strong convergence guarantees in convex settings
Is this a good idea in non-convex problems?
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 174: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/174.jpg)
A Solution: Polyak Averaging
Suppose in t iterations you have parameters θ(1), θ(2), . . . , θ(t)
Polyak Averaging suggests setting θ(t) = 1t
∑i θ
(i)
Has strong convergence guarantees in convex settings
Is this a good idea in non-convex problems?
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 175: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/175.jpg)
A Solution: Polyak Averaging
Suppose in t iterations you have parameters θ(1), θ(2), . . . , θ(t)
Polyak Averaging suggests setting θ(t) = 1t
∑i θ
(i)
Has strong convergence guarantees in convex settings
Is this a good idea in non-convex problems?
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 176: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/176.jpg)
Simple Modification
In non-convex surfaces the parameter space can differ greatlyin different regions
Averaging is not useful
Typical to consider the exponentially decaying average instead:
θ(t) = αθ(t−1) + (1− α)θ(t) with α ∈ [0, 1]
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 177: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/177.jpg)
Simple Modification
In non-convex surfaces the parameter space can differ greatlyin different regions
Averaging is not useful
Typical to consider the exponentially decaying average instead:
θ(t) = αθ(t−1) + (1− α)θ(t) with α ∈ [0, 1]
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 178: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/178.jpg)
Simple Modification
In non-convex surfaces the parameter space can differ greatlyin different regions
Averaging is not useful
Typical to consider the exponentially decaying average instead:
θ(t) = αθ(t−1) + (1− α)θ(t) with α ∈ [0, 1]
Lecture 6 Optimization for Deep Neural Networks CMSC 35246
![Page 179: Lecture 6 Optimization for Deep Neural Networks - …shubhendu/Pages/Files/Lecture...X1 k=1 2](https://reader030.vdocuments.net/reader030/viewer/2022041016/5ec9005b6b6392450e61252a/html5/thumbnails/179.jpg)
Next time
Convolutional Neural Networks
Lecture 6 Optimization for Deep Neural Networks CMSC 35246