lecture 9 convolutional neural networks, training of ... · multiplechannels...
TRANSCRIPT
Statistical Machine LearningLecture 9
Convolutional neural networksHow to train neural networks
Niklas WahlströmDivision of Systems and ControlDepartment of Information TechnologyUppsala University
[email protected]/katalog/nikwa778
1 / 28 N. Wahlström, 2020
Summary of Lecture 8 (I/III) - Neuralnetwork
A neural network is a sequential construction of severalgeneralized linear regression models.
...
1x1
xp
h q1
yh...h qU
1
1
...
h
h
h
q(2)1q
(2)2
q(2)U2
y
Inputs Hidden units Output
q1 = h
(b
(1)1 +
∑p
j=1w
(1)1j xj
)q2 = h
(b
(1)2 +
∑p
j=1w
(1)2j xj
)...
qU = h
(b
(1)U +
∑p
j=1w
(1)Ujxj
)y = b(2) +
U∑i=1
w(2)i qi
2 / 28 N. Wahlström, 2020
Summary of Lecture 8 (I/III) - Neuralnetwork
A neural network is a sequential construction of severalgeneralized linear regression models.
...
1x1
xp
h
yh...h
1
1
...
h
h
h
q(2)1q
(2)2
q(2)U2
y
Inputs Hidden units Output
q = h(W(1)x + b(1)) y = W(2)q + b(2)
W(1)=
w(1)11 ... w
(1)1p
......
w(1)U1 ... w
(1)Up
, b(1) =
b(1)1...
b(1)U
, q =[
q1
...qU
]b(2) = [ b(2) ]
W(2) = [ w(2)1 ... w
(2)U
]Weight matrix Offset vector Hidden units
2 / 28 N. Wahlström, 2020
Summary of Lecture 8 (I/III) - Neuralnetwork
A neural network is a sequential construction of severalgeneralized linear regression models.
...
1x1
xp
h
h...h
q(1)1q
(1)2
q(1)U1
1 1
...
h
h
h
q(2)1q
(2)2
q(2)U2
y
Inputs Hidden units Hidden units Output
q(1) = h(W(1)x + b(1))q(2) = h(W(2)q(1) + b(2))y = W(3)q(2) + b(3)
The model learns better using adeep network (several layers)instead of a wide and shallownetwork.
2 / 28 N. Wahlström, 2020
NN for classification (M > 2 classes)
For M > 2 classes we want to predict the class probability for allM classes gm = p(y = m|x). We extend the logistic function tothe softmax activation function
gm = fm(z) = ezm∑Ml=1 e
zl, m = 1, . . . ,M.
...
1
...
......
...
1x1
xp
h
h
h
z1
zM
f1
fM
g1
gM
Inputs Hidden units Logits Classprobabilitiessoftmax(z)
softmax(z) = [f1(z), . . . , fM (z)]T3 / 28 N. Wahlström, 2020
Summary of Lecture 8 (III/III)- Example M = 3 classesConsider an example with three classes M = 3.
...
1
...
1x1
xp
h
h
h
z1 = 0.9
z2 = 4.2
z3 = 0.2
f1
f2
f3
g1= 0.03
g2= 0.95
g3= 0.02
y1 = 0
y2 = 1
y3 = 0
Inputs Hidden units
Logits
softmax(z) Classprobabilities
True output(one-hotencoding)
The network is trained by minimizing the cross-entropy
L(y,g) = −M∑
m=1ym log(gm) = − log 0.95 = 0.05
4 / 28 N. Wahlström, 2020
Guest lectures and content
Guest lectureWednesday March 11 at 14.15-15.00, SiegbahnsalenSEB - Applications of Statistical Machine Learning in Finance
Outline1. Previous lecture The neural network model
• Neural network for regression• Neural network for classification
2. This lecture• Convolutional neural network• How to train a neural network
5 / 28 N. Wahlström, 2020
Convolutional neural networks
Convolutional neural networks (CNN) are a special kind neuralnetworks tailored for problems where the input data has a grid-likestructure.
Examples• Digital images (2D grid of pixels)• Audio waveform data (1D grid, times series)• Volumetric data e.g. CT scans (3D grid)
The description here will focus on images.
6 / 28 N. Wahlström, 2020
Data representation of images
Consider a grayscale image of 6× 6 pixels.• Each pixel value represents the color. The value ranges from 0(total absence, black) to 1 (total presence, white)• The pixels are the input variables x1,1, x1,2, . . . , x6,6.
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0 0.0
0.0
0.0
0.0
0.0
0.0
0.9
0.8 0.9
0.8
0.9
0.9
0.9
0.9 0.9 0.9 0.90.8
0.6
0.6
0.6
Image Data representation
7 / 28 N. Wahlström, 2020
The convolutional layer
Consider a hidden layer with 6× 6 hidden units.• Dense layer: Each hidden unit is connected with all pixels.Each pixel-hidden-unit-pair has its own unique parameter.
• Convolutional layer: Each hidden unit is connected with aregion of pixels via a set of parameters, so-called filter.Different hidden units have the same set of parameters.
x1,1 x1,2 x1,3 x1,4 x1,5 x1,6
x2,1 x2,2 x2,3 x2,4 x2,5 x2,6
x3,1 x3,2 x3,3 x3,4 x3,5 x3,6
x4,1 x4,2 x4,3 x4,4 x4,5 x4,6
x5,1 x5,2 x5,3 x5,4 x5,5 x5,6
x6,1 x6,2 x6,3 x6,4 x6,5 x6,6
1
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h
Input variables Hidden units
Conv. layer uses sparse interactions and parameter sharing
8 / 28 N. Wahlström, 2020
The convolutional layer
Consider a hidden layer with 6× 6 hidden units.• Dense layer: Each hidden unit is connected with all pixels.Each pixel-hidden-unit-pair has its own unique parameter.• Convolutional layer: Each hidden unit is connected with aregion of pixels via a set of parameters, so-called filter.Different hidden units have the same set of parameters.
x1,1 x1,2 x1,3 x1,4 x1,5 x1,6
x2,1 x2,2 x2,3 x2,4 x2,5 x2,6
x3,1 x3,2 x3,3 x3,4 x3,5 x3,6
x4,1 x4,2 x4,3 x4,4 x4,5 x4,6
x5,1 x5,2 x5,3 x5,4 x5,5 x5,6
x6,1 x6,2 x6,3 x6,4 x6,5 x6,6
1
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
W(1)1,3
W(1)3,3
b(1)
h
Input variables Hidden units
Conv. layer uses sparse interactions and parameter sharing
8 / 28 N. Wahlström, 2020
The convolutional layer
Consider a hidden layer with 6× 6 hidden units.• Dense layer: Each hidden unit is connected with all pixels.Each pixel-hidden-unit-pair has its own unique parameter.• Convolutional layer: Each hidden unit is connected with aregion of pixels via a set of parameters, so-called filter.Different hidden units have the same set of parameters.
x1,1 x1,2 x1,3 x1,4 x1,5 x1,6
x2,1 x2,2 x2,3 x2,4 x2,5 x2,6
x3,1 x3,2 x3,3 x3,4 x3,5 x3,6
x4,1 x4,2 x4,3 x4,4 x4,5 x4,6
x5,1 x5,2 x5,3 x5,4 x5,5 x5,6
x6,1 x6,2 x6,3 x6,4 x6,5 x6,6
1
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
W(1)1,3
W(1)3,3
b(1)
h
Input variables Hidden units
Conv. layer uses sparse interactions and parameter sharing8 / 28 N. Wahlström, 2020
Condensing information with strides
• Problem: As we proceed though the network we want tocondense the information.• Solution: Apply the filter to every two pixels. We use astride of 2 (instead of 1).
x1,1 x1,2 x1,3 x1,4 x1,5 x1,6
x2,1 x2,2 x2,3 x2,4 x2,5 x2,6
x3,1 x3,2 x3,3 x3,4 x3,5 x3,6
x4,1 x4,2 x4,3 x4,4 x4,5 x4,6
x5,1 x5,2 x5,3 x5,4 x5,5 x5,6
x6,1 x6,2 x6,3 x6,4 x6,5 x6,6
1
h h h
h h h
h h h
h
W(1)1,3
W(1)3,3
b(1)Input variables Hidden units
With stride 2 we get half the number of rows and columns in thehidden layer.
9 / 28 N. Wahlström, 2020
Condensing information with strides
• Problem: As we proceed though the network we want tocondense the information.• Solution: Apply the filter to every two pixels. We use astride of 2 (instead of 1).
x1,1 x1,2 x1,3 x1,4 x1,5 x1,6
x2,1 x2,2 x2,3 x2,4 x2,5 x2,6
x3,1 x3,2 x3,3 x3,4 x3,5 x3,6
x4,1 x4,2 x4,3 x4,4 x4,5 x4,6
x5,1 x5,2 x5,3 x5,4 x5,5 x5,6
x6,1 x6,2 x6,3 x6,4 x6,5 x6,6
1
h h h
h h h
h h h
h
W(1)1,3
W(1)3,3
b(1)Input variables Hidden units
With stride 2 we get half the number of rows and columns in thehidden layer.
9 / 28 N. Wahlström, 2020
Condensing information with strides
• Problem: As we proceed though the network we want tocondense the information.• Solution: Apply the filter to every two pixels. We use astride of 2 (instead of 1).
x1,1 x1,2 x1,3 x1,4 x1,5 x1,6
x2,1 x2,2 x2,3 x2,4 x2,5 x2,6
x3,1 x3,2 x3,3 x3,4 x3,5 x3,6
x4,1 x4,2 x4,3 x4,4 x4,5 x4,6
x5,1 x5,2 x5,3 x5,4 x5,5 x5,6
x6,1 x6,2 x6,3 x6,4 x6,5 x6,6
1
h h h
h h h
h h h
h
0
0
0
W(1)1,3
W(1)3,3
b(1)Input variables Hidden units
With stride 2 we get half the number of rows and columns in thehidden layer.
9 / 28 N. Wahlström, 2020
Multiple channels
• One filter per layer does not give enough flexibility. ⇒• We use multiple filters (visualized with different colors).• Each filter produces its own set of hidden units – a channel.
x1,1 x1,2 x1,3 x1,4 x1,5 x1,6
x2,1 x2,2 x2,3 x2,4 x2,5 x2,6
x3,1 x3,2 x3,3 x3,4 x3,5 x3,6
x4,1 x4,2 x4,3 x4,4 x4,5 x4,6
x5,1 x5,2 x5,3 x5,4 x5,5 x5,6
x6,1 x6,2 x6,3 x6,4 x6,5 x6,6
1
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h
W(1)1,3,1
W(1)3,3,1
b(1)1
Input variables Hidden units
dim 6× 6× 1 dim 6× 6× 4
Hidden layers are organized in tensors of size(rows × columns × channels).
10 / 28 N. Wahlström, 2020
Multiple channels
• One filter per layer does not give enough flexibility. ⇒• We use multiple filters (visualized with different colors).• Each filter produces its own set of hidden units – a channel.
x1,1 x1,2 x1,3 x1,4 x1,5 x1,6
x2,1 x2,2 x2,3 x2,4 x2,5 x2,6
x3,1 x3,2 x3,3 x3,4 x3,5 x3,6
x4,1 x4,2 x4,3 x4,4 x4,5 x4,6
x5,1 x5,2 x5,3 x5,4 x5,5 x5,6
x6,1 x6,2 x6,3 x6,4 x6,5 x6,6
1
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h
W(1)1,3,2
W(1)3,3,2
b(1)2
Input variables Hidden units
dim 6× 6× 1 dim 6× 6× 4
Hidden layers are organized in tensors of size(rows × columns × channels).
10 / 28 N. Wahlström, 2020
Multiple channels
• One filter per layer does not give enough flexibility. ⇒• We use multiple filters (visualized with different colors).• Each filter produces its own set of hidden units – a channel.
x1,1 x1,2 x1,3 x1,4 x1,5 x1,6
x2,1 x2,2 x2,3 x2,4 x2,5 x2,6
x3,1 x3,2 x3,3 x3,4 x3,5 x3,6
x4,1 x4,2 x4,3 x4,4 x4,5 x4,6
x5,1 x5,2 x5,3 x5,4 x5,5 x5,6
x6,1 x6,2 x6,3 x6,4 x6,5 x6,6
1
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h
W(1)1,3,3
W(1)3,3,3
b(1)3
Input variables Hidden units
dim 6× 6× 1 dim 6× 6× 4
Hidden layers are organized in tensors of size(rows × columns × channels).
10 / 28 N. Wahlström, 2020
Multiple channels
• One filter per layer does not give enough flexibility. ⇒• We use multiple filters (visualized with different colors).• Each filter produces its own set of hidden units – a channel.
x1,1 x1,2 x1,3 x1,4 x1,5 x1,6
x2,1 x2,2 x2,3 x2,4 x2,5 x2,6
x3,1 x3,2 x3,3 x3,4 x3,5 x3,6
x4,1 x4,2 x4,3 x4,4 x4,5 x4,6
x5,1 x5,2 x5,3 x5,4 x5,5 x5,6
x6,1 x6,2 x6,3 x6,4 x6,5 x6,6
1
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h
W(1)1,3,4
W(1)3,3,4
b(1)4
Input variables Hidden units
dim 6× 6× 1 dim 6× 6× 4
Hidden layers are organized in tensors of size(rows × columns × channels).
10 / 28 N. Wahlström, 2020
Multiple channels
• One filter per layer does not give enough flexibility. ⇒• We use multiple filters (visualized with different colors).• Each filter produces its own set of hidden units – a channel.
x1,1 x1,2 x1,3 x1,4 x1,5 x1,6
x2,1 x2,2 x2,3 x2,4 x2,5 x2,6
x3,1 x3,2 x3,3 x3,4 x3,5 x3,6
x4,1 x4,2 x4,3 x4,4 x4,5 x4,6
x5,1 x5,2 x5,3 x5,4 x5,5 x5,6
x6,1 x6,2 x6,3 x6,4 x6,5 x6,6
1
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h
W(1)1,3,1
W(1)3,3,1
b(1)1
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h
W(1)1,3,4
W(1)3,3,4
W(1)1,3,4
W(1)3,3,4
W(1)1,3,4
W(1)3,3,4
W(1)1,3,4
W(1)3,3,4
W(1)1,3,4
W(1)3,3,4
W(1)1,3,4
W(1)3,3,4
W(1)1,3,4
W(1)3,3,4
W(1)1,3,4
W(1)3,3,4
W(1)1,3,4
W(1)3,3,4
W(1)1,3,4
W(1)3,3,4
W(1)1,3,4
W(1)3,3,4
W(1)1,3,4
W(1)3,3,4
W(1)1,3,4
W(1)3,3,4
W(1)1,3,4
W(1)3,3,4
W(1)1,3,4
W(1)3,3,4
W(1)1,3,4
W(1)3,3,4
b(1)4
Input variables Hidden units
dim 6× 6× 1 dim 6× 6× 4
Hidden layers are organized in tensors of size(rows × columns × channels).
10 / 28 N. Wahlström, 2020
What is a tensor?
A tensor is a generalization of scalar, vector and matrix toarbitrary order.
Scalarorder 0 a = 3
Vectororder 1
b =
3−2−1
Matrixorder 2
W =
3 2−2 1−1 2
Tensorany order(here order 3)
T:,:,1 =
3 2−2 1−1 2
, T:,:,2 =
−1 41 2−5 3
11 / 28 N. Wahlström, 2020
Multiple channels (cont.)
• A filter operates on all channels in a hidden layer.• Each filter has the dimension (filter rows × filter colomns ×input channels), here (3 × 3 × 4).• We stack all filter parameters in a weight tensor withdimensions (filter rows × filter columns × input channels ×output channels), here (3 × 3 × 4 × 6)
h h h
h h h
h h h
h h h
h h h
h h h
h h h
h h h
h h h
h h h
h h h
h h h
h h h
h h h
h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
1
h h h
h h h
h h h
b(2)6
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h
h h h
h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h
h h h
h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h
h h h
h h h h h h
h h h
h h h
hW(2)
:,:,:,6
First hidden layer Second hidden layer
Convolutional layerW(2) ∈ R3×3×4×6, b(2) ∈ R6
12 / 28 N. Wahlström, 2020
Full CNN architecture
• A full CNN usually consist of multiple convolutional layers(here two) and a few final dense layers (here two).• If we have a classification problem at hand, we end with asoftmax activation function to produce class probabilities.
x1,1 x1,2 x1,3 x1,4 x1,5 x1,6
x2,1 x2,2 x2,3 x2,4 x2,5 x2,6
x3,1 x3,2 x3,3 x3,4 x3,5 x3,6
x4,1 x4,2 x4,3 x4,4 x4,5 x4,6
x5,1 x5,2 x5,3 x5,4 x5,5 x5,6
x6,1 x6,2 x6,3 x6,4 x6,5 x6,6
1
...
h
h
h
1
...
z1
z10
h h h
h h h
h h h
h h h
h h h
h h h
h h h
h h h
h h h
h h h
h h h
h h h
h h h
h h h
h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
1
h
W(1)1,3,1
W(1)3,3,1
b(1)1
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h h h h h h
h
W(1)1,3,4
W(1)3,3,4
b(1)4
W(2)1,3,4,6
W(2)3,3,4,6
b(2)0,6
h h h
h h h
h h h
1
h...
...
g1
g10
Input variablesdim 6× 6× 1
Hidden unitsdim 6× 6× 4
Hidden unitsdim 3× 3× 6
Hidden unitsdim 50
Logitsdim 10
Outputsdim 10
Convolutionallayer
W(1) ∈ R3×3×1×4
b(1) ∈ R4
Convolutionallayer
W(2) ∈ R3×3×4×6
b(2) ∈ R6
Denselayer
W(3) ∈ R54×50
b(3) ∈ R50
Denselayer
W(4) ∈ R50×10
b(4) ∈ R10
Softmax
Here we use 50 hidden units in the last hidden layer and consider aclassification problem with M = 10 classes.
13 / 28 N. Wahlström, 2020
Skin cancer – background
One recent result on the use of deep learning in medicine -Detecting skin cancer (February 2017)Andre Esteva, A., Kuprel, B., Novoa, R. A., Ko, J., Swetter, S. M., Blau, H. M. and Thrun, S. Dermatologist-levelclassification of skin cancer with deep neural networks. Nature, 542, 115–118, February, 2017.
Some background figures (from the US) on skin cancer:• Melanomas represents less than 5% of all skin cancers, butaccounts for 75% of all skin-cancer-related deaths.• Early detection absolutely critical. Estimated 5-year survivalrate for melanoma: Over 99% if detected in its earlier stagesand 14% is detected in its later stages.
14 / 28 N. Wahlström, 2020
Skin cancer – task
Image copyright Nature (doi:10.1038/nature21056)
15 / 28 N. Wahlström, 2020
Skin cancer – solution (ultrabrief)
In the paper they used the following network architecture
Image copyright Nature doi:10.1038/nature21056)
• Initialize all parameters from a neural network trained on 1.28million images (transfer learning).• From this initialization we learn new model parameters using
129 450 clinical images (∼ 100 times more images than anyprevious study).• Use the model to predict class based on unseen data.
16 / 28 N. Wahlström, 2020
Skin cancer – indication of the results
sensitivity = true positivepositive specificity = true negative
negative
Image copyright Nature (doi:10.1038/nature21056)
17 / 28 N. Wahlström, 2020
Unconstrained numerical optimization
We train a network by considering the optimization problem
θ = arg minθJ(θ), J(θ) = 1
n
n∑i=1
L(xi,yi,θ)
θ(B)
θ(A)
θ(C)
J(θ
)
• The best possible solution θ is the global minimizer (θ(A))• The global minimizer is typically very hard to find, and wehave to settle for a local minimizer (θ(A), θ(B), θ(C))
18 / 28 N. Wahlström, 2020
Iterative solution - Example 1D
θ(5)
J(θ
)
In our search for a local optimizer we...• ... do an initial guess of θ...• ... and update θ iteratively.
19 / 28 N. Wahlström, 2020
Iterative solution (gradient descent) -Example 2D
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
θ = [b, w]T ∈ R2
1. Pick a θ(0)
2. while(not converged)• Update θ(t+1) = θ(t) − γd(t), where d(t) = ∇θJ(θ)• Update t := t+ 1
We call γ ∈ R the step length or learning rate.• How big learning rate γ should we have?
20 / 28 N. Wahlström, 2020
Learning rate
γ too low γ too high γ ok
θ
J(θ
)
θ θ
Tuning strategy: If the cost function...• ...decreases very slowly ⇒ increase the learning rate.• ...oscillates widely ⇒ reduce the learning rate.
21 / 28 N. Wahlström, 2020
Computational challenge 1 - dim(θ) is big
At each optimization step we need to compute the gradient
d(t) = ∇θJ(θ(t)) = 1n
n∑i=1
∇θL(xi,yi,θ(t)).
Computational challenge 1 - dim(θ) big: A neural networkcontains a lot of parameters. Computing the gradient is costly.
Solution: A NN is a composition of multiple layers. Hence, eachterm ∇θL(xi,yi,θ) can be computed efficiently by repeatedlyapplying the chain rule. This is called the back-propagationalgorithm. Not part of the course.
22 / 28 N. Wahlström, 2020
Computational challenge 2 - n is big
At each optimization step we need to compute the gradient
d(t) = ∇θJ(θ(t)) = 1n
n∑i=1
∇θL(xi,yi,θ(t)).
Computational challenge 2 - n big: We typically use a lot oftraining data n for training the neural netowork. Computing thegradient is costly.
Solution: For each iteration, we only use a small part of the dataset to compute the gradient d(t). This is called the stochasticgradient.
23 / 28 N. Wahlström, 2020
Stochastic gradient
A big data set is often redundant = many data points are similar.Training data︷ ︸︸ ︷
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x16 x17 x18 x19 x20y1 y2 y3 y4 y5 y6 y7 y8 y9 y10 y11 y12 y13 y14 y15 y16 y17 y18 y19 y20
If the training data is big
∇θJ(θ)≈∑n
2i=1
∇θL(xi,yi,θ) and
∇θJ(θ)≈∑n
i= n2 +1
∇θL(xi,yi,θ).
We can do the update with only half the computation cost!
θ(t+1) = θ(t) − γ 1n/2
n2∑
i=1∇θL(xi,yi,θ
(t)),
θ(t+2) = θ(t+1) − γ 1n/2
n∑i= n
2 +1∇θL(xi,yi,θ
(t+1)).
24 / 28 N. Wahlström, 2020
Stochastic gradient
Training data︷ ︸︸ ︷x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x16 x17 x18 x19 x20y1 y2 y3 y4 y5 y6 y7 y8 y9 y10 y11 y12 y13 y14 y15 y16 y17 y18 y19 y20. ︸ ︷︷ ︸
Mini-batch
θ(3) = θ(2) − γ 15
15∑i=11
∇θL(xi,yi,θ(2))
• The extreme version of this strategy is to use only one datapoint at each training step (called online learning)• We typically do something in between (not one data point,and not all data). We use a smaller set called mini-batch.• One pass through the training data is called an epoch.
25 / 28 N. Wahlström, 2020
Stochastic gradient
x7 x10 x3 x20 x16 x2 x1 x18 x19 x12 x6 x11 x17 x15 x5 x14 x4 x9 x13 x8y7 y10 y3 y20 y16 y2 y1 y18 y19 y12 y6 y11 y17 y15 y5 y14 y4 y9 y13 y8
Iteration: 3Epoch: 1
• If we pick the mini-batches in order, they might beunbalanced and not representative for the whole data set.• Therefore, we pick data points at random from the trainingdata to form a mini-batch.• One implementation is to randomly reshuffle the data beforedividing it into mini-batches.• After each epoch we do another reshuffling and another passthrough the data set.
26 / 28 N. Wahlström, 2020
Mini-batch gradient descent
The full stochastic gradient algorithm (a.k.a mini-batchgradient descent) is as follows
1. Initialize θ(0), set t← 1, choose batch size nb and number ofepochs E.
2. For i = 1 to E(a) Randomly shuffle the training data {xi,yi}n
i=1.(b) For j = 1 to n
nb
(i) Approximate the gradient of the loss function using themini-batch {xi,yi}
jnbi=(j−1)nb+1,
d(t) = 1nb
∑jnb
i=(j−1)nb+1 ∇θL(xi,yi, θ)∣∣∣θ=θ(t)
.
(ii) Do a gradient step θ(t+1) = θ(t) − γd(t).(iii) Update the iteration index t← t+ 1 .
At each time we get a stochastic approximation of the truegradient d(t) ≈ 1
n
∑ni=1 ∇θL(xi,yi,θ)
∣∣∣θ=θ(t)
, hence the name.27 / 28 N. Wahlström, 2020
A few concepts to summarize lecture 9
Convolutional neural network (CNN): A NN with a particular structure tailored forinput data with a grid-like structure, like for example images.
Filter: (a.k.a kernel) A set of parameters that is convolved with a hidden layer. Eachfilter produces a new channel.
Channel: A set of hidden units produced by the same filter. Each hidden layer consistsof one or more channels.
Stride: A positive integer deciding how many steps to move the filter during theconvolution.
Tensor: A generalization of matrices to arbitrary order.
Gradient descent: An iterative optimization algorithm where we at iteration take astep proportional to the negative gradient.
Learning rate: (a.k.a step length). A scalar tuning parameter deciding the length ofeach gradient step in gradient descent.
Stochastic gradient (SG): A version of gradient descent where we at each iterationonly use a small part of the training data (a mini-batch).
Mini-batch: The group of training data that we use at each iteration in SG
Batch size: The number of data points in one mini-batch
Epoch: One complete pass though the entire training data set using SG.28 / 28 N. Wahlström, 2020