lecture 6: training neural networks, part ivision.stanford.edu/teaching/cs231n/slides/2018/cs... ·...
TRANSCRIPT
![Page 1: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/1.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 20181
Lecture 6:Training Neural Networks,
Part I
![Page 2: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/2.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 20182
Administrative
Assignment 1 was due yesterday.
Assignment 2 is out, due Wed May 2. Q5 will be released in a few days, keep an eye out for announcements.
Project proposal due Wed April 25.
![Page 3: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/3.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018
Where we are now...
3
x
W
hinge loss
R
+ Ls (scores)
Computational graphs
*
![Page 4: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/4.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 20184
Where we are now...
Linear score function:
2-layer Neural Network
x hW1 sW2
3072 100 10
Neural Networks
![Page 5: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/5.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018
Illustration of LeCun et al. 1998 from CS231n 2017 Lecture 1
5
Where we are now...
Convolutional Neural Networks
![Page 6: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/6.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 20186
Where we are now...Convolutional Layer
32
32
3
32x32x3 image5x5x3 filter
convolve (slide) over all spatial locations
activation map
1
28
28
![Page 7: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/7.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 20187
Where we are now...Convolutional Layer
32
32
3
Convolution Layer
activation maps
6
28
28
For example, if we had 6 5x5 filters, we’ll get 6 separate activation maps:
We stack these up to get a “new image” of size 28x28x6!
![Page 8: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/8.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 20188
Where we are now...
Landscape image is CC0 1.0 public domainWalking man image is CC0 1.0 public domain
Learning network parameters through optimization
![Page 9: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/9.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 20189
Where we are now...
Mini-batch SGDLoop:1. Sample a batch of data2. Forward prop it through the graph
(network), get loss3. Backprop to calculate the gradients4. Update the parameters using the gradient
![Page 10: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/10.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 201810
Next: Training Neural Networks
![Page 11: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/11.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 201811
Overview1. One time setup
activation functions, preprocessing, weight initialization, regularization, gradient checking
2. Training dynamicsbabysitting the learning process, parameter updates, hyperparameter optimization
3. Evaluationmodel ensembles
![Page 12: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/12.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 201812
Part 1- Activation Functions- Data Preprocessing- Weight Initialization- Batch Normalization- Babysitting the Learning Process- Hyperparameter Optimization
![Page 13: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/13.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 201813
Activation Functions
![Page 14: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/14.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 201814
Activation Functions
![Page 15: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/15.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 201815
Activation FunctionsSigmoid
tanh
ReLU
Leaky ReLU
Maxout
ELU
![Page 16: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/16.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 201816
Activation Functions
Sigmoid
- Squashes numbers to range [0,1]- Historically popular since they
have nice interpretation as a saturating “firing rate” of a neuron
![Page 17: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/17.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 201817
Activation Functions
Sigmoid
- Squashes numbers to range [0,1]- Historically popular since they
have nice interpretation as a saturating “firing rate” of a neuron
3 problems:
1. Saturated neurons “kill” the gradients
![Page 18: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/18.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 201818
sigmoid gate
x
What happens when x = -10?What happens when x = 0?What happens when x = 10?
![Page 19: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/19.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 201819
Activation Functions
Sigmoid
- Squashes numbers to range [0,1]- Historically popular since they
have nice interpretation as a saturating “firing rate” of a neuron
3 problems:
1. Saturated neurons “kill” the gradients
2. Sigmoid outputs are not zero-centered
![Page 20: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/20.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 201820
Consider what happens when the input to a neuron (x) is always positive:
What can we say about the gradients on w?
![Page 21: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/21.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 201821
Consider what happens when the input to a neuron is always positive...
What can we say about the gradients on w?Always all positive or all negative :((this is also why you want zero-mean data!)
hypothetical optimal w vector
zig zag path
allowed gradient update directions
allowed gradient update directions
![Page 22: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/22.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 201822
Activation Functions
Sigmoid
- Squashes numbers to range [0,1]- Historically popular since they
have nice interpretation as a saturating “firing rate” of a neuron
3 problems:
1. Saturated neurons “kill” the gradients
2. Sigmoid outputs are not zero-centered
3. exp() is a bit compute expensive
![Page 23: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/23.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 201823
Activation Functions
tanh(x)
- Squashes numbers to range [-1,1]- zero centered (nice)- still kills gradients when saturated :(
[LeCun et al., 1991]
![Page 24: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/24.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 201824
Activation Functions - Computes f(x) = max(0,x)
- Does not saturate (in +region)- Very computationally efficient- Converges much faster than
sigmoid/tanh in practice (e.g. 6x)- Actually more biologically plausible
than sigmoid
ReLU(Rectified Linear Unit)
[Krizhevsky et al., 2012]
![Page 25: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/25.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 201825
Activation Functions
ReLU(Rectified Linear Unit)
- Computes f(x) = max(0,x)
- Does not saturate (in +region)- Very computationally efficient- Converges much faster than
sigmoid/tanh in practice (e.g. 6x)- Actually more biologically plausible
than sigmoid
- Not zero-centered output
![Page 26: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/26.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 201826
Activation Functions
ReLU(Rectified Linear Unit)
- Computes f(x) = max(0,x)
- Does not saturate (in +region)- Very computationally efficient- Converges much faster than
sigmoid/tanh in practice (e.g. 6x)- Actually more biologically plausible
than sigmoid
- Not zero-centered output- An annoyance:
hint: what is the gradient when x < 0?
![Page 27: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/27.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 201827
ReLU gate
x
What happens when x = -10?What happens when x = 0?What happens when x = 10?
![Page 28: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/28.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 201828
DATA CLOUDactive ReLU
dead ReLUwill never activate => never update
![Page 29: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/29.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 201829
DATA CLOUDactive ReLU
dead ReLUwill never activate => never update
=> people like to initialize ReLU neurons with slightly positive biases (e.g. 0.01)
![Page 30: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/30.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 201830
Activation Functions
Leaky ReLU
- Does not saturate- Computationally efficient- Converges much faster than
sigmoid/tanh in practice! (e.g. 6x)- will not “die”.
[Mass et al., 2013][He et al., 2015]
![Page 31: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/31.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 201831
Activation Functions
Leaky ReLU
- Does not saturate- Computationally efficient- Converges much faster than
sigmoid/tanh in practice! (e.g. 6x)- will not “die”.
Parametric Rectifier (PReLU)
backprop into \alpha(parameter)
[Mass et al., 2013][He et al., 2015]
![Page 32: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/32.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 201832
Activation FunctionsExponential Linear Units (ELU)
- All benefits of ReLU- Closer to zero mean outputs- Negative saturation regime
compared with Leaky ReLU adds some robustness to noise
- Computation requires exp()
[Clevert et al., 2015]
![Page 33: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/33.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 201833
Maxout “Neuron”- Does not have the basic form of dot product ->
nonlinearity- Generalizes ReLU and Leaky ReLU - Linear Regime! Does not saturate! Does not die!
Problem: doubles the number of parameters/neuron :(
[Goodfellow et al., 2013]
![Page 34: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/34.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 201834
TLDR: In practice:
- Use ReLU. Be careful with your learning rates- Try out Leaky ReLU / Maxout / ELU- Try out tanh but don’t expect much- Don’t use sigmoid
![Page 35: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/35.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 201835
Data Preprocessing
![Page 36: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/36.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 201836
Step 1: Preprocess the data
(Assume X [NxD] is data matrix, each example in a row)
![Page 37: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/37.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 201837
Remember: Consider what happens when the input to a neuron is always positive...
What can we say about the gradients on w?Always all positive or all negative :((this is also why you want zero-mean data!)
hypothetical optimal w vector
zig zag path
allowed gradient update directions
allowed gradient update directions
![Page 38: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/38.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 201838
Step 1: Preprocess the data
(Assume X [NxD] is data matrix, each example in a row)
![Page 39: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/39.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 201839
Step 1: Preprocess the dataIn practice, you may also see PCA and Whitening of the data
(data has diagonal covariance matrix)
(covariance matrix is the identity matrix)
![Page 40: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/40.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 201840
TLDR: In practice for Images: center only
- Subtract the mean image (e.g. AlexNet)(mean image = [32,32,3] array)
- Subtract per-channel mean (e.g. VGGNet)(mean along each channel = 3 numbers)
e.g. consider CIFAR-10 example with [32,32,3] images
Not common to normalize variance, to do PCA or whitening
![Page 41: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/41.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 201841
Weight Initialization
![Page 42: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/42.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 201842
- Q: what happens when W=constant init is used?
![Page 43: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/43.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 201843
- First idea: Small random numbers (gaussian with zero mean and 1e-2 standard deviation)
![Page 44: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/44.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 201844
- First idea: Small random numbers (gaussian with zero mean and 1e-2 standard deviation)
Works ~okay for small networks, but problems with deeper networks.
![Page 45: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/45.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 201845
Lets look at some activation statistics
E.g. 10-layer net with 500 neurons on each layer, using tanh non-linearities, and initializing as described in last slide.
![Page 46: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/46.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 201846
![Page 47: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/47.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 201847
All activations become zero!
Q: think about the backward pass. What do the gradients look like?
Hint: think about backward pass for a W*X gate.
![Page 48: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/48.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 201848
Almost all neurons completely saturated, either -1 and 1. Gradients will be all zero.
*1.0 instead of *0.01
![Page 49: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/49.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 201849
“Xavier initialization”[Glorot et al., 2010]
Reasonable initialization.(Mathematical derivation assumes linear activations)
![Page 50: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/50.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 201850
but when using the ReLU nonlinearity it breaks.
![Page 51: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/51.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 201851
He et al., 2015(note additional 2/)
![Page 52: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/52.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 201852
He et al., 2015(note additional 2/)
![Page 53: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/53.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 201853
Proper initialization is an active area of research…Understanding the difficulty of training deep feedforward neural networksby Glorot and Bengio, 2010
Exact solutions to the nonlinear dynamics of learning in deep linear neural networks by Saxe et al, 2013
Random walk initialization for training very deep feedforward networks by Sussillo and Abbott, 2014
Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification by He et al., 2015
Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., 2015
All you need is a good init, Mishkin and Matas, 2015…
![Page 54: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/54.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 201854
Batch Normalization
![Page 55: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/55.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 201855
Batch Normalization“you want zero-mean unit-variance activations? just make them so.”
[Ioffe and Szegedy, 2015]
consider a batch of activations at some layer. To make each dimension zero-mean unit-variance, apply:
this is a vanilla differentiable function...
![Page 56: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/56.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 201856
Batch Normalization [Ioffe and Szegedy, 2015]
XN
D
1. compute the empirical mean and variance independently for each dimension.
2. Normalize
“you want zero-mean unit-variance activations? just make them so.”
![Page 57: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/57.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 201857
Batch Normalization [Ioffe and Szegedy, 2015]
FC
BN
tanh
FC
BN
tanh
...
Usually inserted after Fully Connected or Convolutional layers, and before nonlinearity.
![Page 58: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/58.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 201858
Batch Normalization [Ioffe and Szegedy, 2015]
FC
BN
tanh
FC
BN
tanh
...
Usually inserted after Fully Connected or Convolutional layers, and before nonlinearity.
Problem: do we necessarily want a zero-mean unit-variance input?
![Page 59: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/59.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 201859
Batch Normalization [Ioffe and Szegedy, 2015]
And then allow the network to squash the range if it wants to:
Note, the network can learn:
to recover the identity mapping.
Normalize:
![Page 60: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/60.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 201860
Batch Normalization [Ioffe and Szegedy, 2015]
- Improves gradient flow through the network
- Allows higher learning rates- Reduces the strong dependence
on initialization- Acts as a form of regularization
in a funny way, and slightly reduces the need for dropout, maybe
![Page 61: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/61.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 201861
Batch Normalization [Ioffe and Szegedy, 2015]
Note: at test time BatchNorm layer functions differently:
The mean/std are not computed based on the batch. Instead, a single fixed empirical mean of activations during training is used.
(e.g. can be estimated during training with running averages)
![Page 62: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/62.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 201862
Babysitting the Learning Process
![Page 63: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/63.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 201863
Step 1: Preprocess the data
(Assume X [NxD] is data matrix, each example in a row)
![Page 64: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/64.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 201864
Step 2: Choose the architecture:say we start with one hidden layer of 50 neurons:
input layer hidden layer
output layerCIFAR-10 images, 3072 numbers
10 output neurons, one per class
50 hidden neurons
![Page 65: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/65.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 201865
Double check that the loss is reasonable:
returns the loss and the gradient for all parameters
disable regularization
loss ~2.3.“correct” for 10 classes
![Page 66: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/66.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 201866
Double check that the loss is reasonable:
crank up regularization
loss went up, good. (sanity check)
![Page 67: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/67.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 201867
Lets try to train now…
Tip: Make sure that you can overfit very small portion of the training data The above code:
- take the first 20 examples from CIFAR-10
- turn off regularization (reg = 0.0)- use simple vanilla ‘sgd’
![Page 68: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/68.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 201868
Lets try to train now…
Tip: Make sure that you can overfit very small portion of the training data
Very small loss, train accuracy 1.00, nice!
![Page 69: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/69.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 201869
Lets try to train now…
Start with small regularization and find learning rate that makes the loss go down.
![Page 70: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/70.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 201870
Lets try to train now…
Start with small regularization and find learning rate that makes the loss go down.
Loss barely changing
![Page 71: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/71.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 201871
Lets try to train now…
Start with small regularization and find learning rate that makes the loss go down.
loss not going down:learning rate too low
Loss barely changing: Learning rate is probably too low
![Page 72: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/72.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 201872
Lets try to train now…
Start with small regularization and find learning rate that makes the loss go down.
loss not going down:learning rate too low
Loss barely changing: Learning rate is probably too low
Notice train/val accuracy goes to 20% though, what’s up with that? (remember this is softmax)
![Page 73: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/73.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 201873
Lets try to train now…
Start with small regularization and find learning rate that makes the loss go down.
loss not going down:learning rate too low
Now let’s try learning rate 1e6.
![Page 74: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/74.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 201874
cost: NaN almost always means high learning rate...
Lets try to train now…
Start with small regularization and find learning rate that makes the loss go down.
loss not going down:learning rate too lowloss exploding:learning rate too high
![Page 75: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/75.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 201875
Lets try to train now…
Start with small regularization and find learning rate that makes the loss go down.
loss not going down:learning rate too lowloss exploding:learning rate too high
3e-3 is still too high. Cost explodes….
=> Rough range for learning rate we should be cross-validating is somewhere [1e-3 … 1e-5]
![Page 76: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/76.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 201876
Hyperparameter Optimization
![Page 77: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/77.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 201877
Cross-validation strategycoarse -> fine cross-validation in stages
First stage: only a few epochs to get rough idea of what params workSecond stage: longer running time, finer search… (repeat as necessary)
Tip for detecting explosions in the solver: If the cost is ever > 3 * original cost, break out early
![Page 78: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/78.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 201878
For example: run coarse search for 5 epochs
nice
note it’s best to optimize in log space!
![Page 79: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/79.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 201879
Now run finer search...adjust range
53% - relatively good for a 2-layer neural net with 50 hidden neurons.
![Page 80: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/80.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 201880
Now run finer search...adjust range
53% - relatively good for a 2-layer neural net with 50 hidden neurons.
But this best cross-validation result is worrying. Why?
![Page 81: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/81.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 201881
Random Search vs. Grid Search
Important Parameter Important ParameterU
nim
porta
nt P
aram
eter
Uni
mpo
rtant
Par
amet
er
Grid Layout Random Layout
Illustration of Bergstra et al., 2012 by Shayne Longpre, copyright CS231n 2017
Random Search for Hyper-Parameter OptimizationBergstra and Bengio, 2012
![Page 82: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/82.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 201882
Hyperparameters to play with:- network architecture- learning rate, its decay schedule, update type- regularization (L2/Dropout strength)
neural networks practitionermusic = loss function
This image by Paolo Guereta is licensed under CC-BY 2.0
![Page 83: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/83.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 201883
Cross-validation “command center”
![Page 84: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/84.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 201884
Monitor and visualize the loss curve
![Page 85: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/85.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 201885
Loss
time
Bad initialization a prime suspect
![Page 86: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/86.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 201886
Loss
time
![Page 87: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/87.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 201887
Monitor and visualize the accuracy:
big gap = overfitting=> increase regularization strength?
no gap=> increase model capacity?
![Page 88: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/88.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 201888
Track the ratio of weight updates / weight magnitudes:
ratio between the updates and values: ~ 0.0002 / 0.02 = 0.01 (about okay)want this to be somewhere around 0.001 or so
![Page 89: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/89.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 201889
SummaryWe looked in detail at:
- Activation Functions (use ReLU)- Data Preprocessing (images: subtract mean)- Weight Initialization (use Xavier/He init)- Batch Normalization (use)- Babysitting the Learning process- Hyperparameter Optimization
(random sample hyperparams, in log space when appropriate)
TLDRs
![Page 90: Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung](https://reader034.vdocuments.net/reader034/viewer/2022051814/6015fe8fdbaad16dbd2ac2c5/html5/thumbnails/90.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 201890
Next time: Training Neural Networks, Part 2
- Parameter update schemes- Learning rate schedules- Gradient checking- Regularization (Dropout etc.)- Evaluation (Ensembles etc.)- Transfer learning / fine-tuning