lecture 9 convolutional neural networks, training of ... · multiplechannels...

Statistical Machine LearningLecture 9

Convolutional neural networksHow to train neural networks

Niklas WahlströmDivision of Systems and ControlDepartment of Information TechnologyUppsala University

[email protected]/katalog/nikwa778

1 / 28 N. Wahlström, 2020

www.it.uu.se/katalog/nikwa778

Summary of Lecture 8 (I/III) - Neuralnetwork

A neural network is a sequential construction of severalgeneralized linear regression models.

...

1x1

xp

h q1

yh...h qU

1

1

...

h

h

h

q(2)1q

(2)2

q(2)U2

y

Inputs Hidden units Output

q1 = h

(b

(1)1 +

∑p

j=1w

(1)1j xj

)q2 = h

(b

(1)2 +

∑p

j=1w

(1)2j xj

)...

qU = h

(b

(1)U +

∑p

j=1w

(1)Ujxj

)y = b(2) +

U∑i=1

w(2)i qi




...

1x1

xp

h

yh...h

1

1

...

h

h

h

q(2)1q

(2)2

q(2)U2

y

Inputs Hidden units Output

q = h(W(1)x + b(1)) y = W(2)q + b(2)

W(1)=

w(1)11 ... w

(1)1p

......

w(1)U1 ... w

(1)Up

, b(1) =

b(1)1...

b(1)U

, q =[

q1

...qU

]b(2) = [ b(2) ]

W(2) = [ w(2)1 ... w

(2)U

]Weight matrix Offset vector Hidden units




...

1x1

xp

h

h...h

q(1)1q

(1)2

q(1)U1

1 1

...

h

h

h

q(2)1q

(2)2

q(2)U2

y

Inputs Hidden units Hidden units Output

q(1) = h(W(1)x + b(1))q(2) = h(W(2)q(1) + b(2))y = W(3)q(2) + b(3)

The model learns better using adeep network (several layers)instead of a wide and shallownetwork.


NN for classification (M > 2 classes)

For M > 2 classes we want to predict the class probability for allM classes gm = p(y = m|x). We extend the logistic function tothe softmax activation function

gm = fm(z) = ezm∑Ml=1 e

zl, m = 1, . . . ,M.

...

1

...

......

...

1x1

xp

h

h

h

z1

zM

f1

fM

g1

gM

Inputs Hidden units Logits Classprobabilitiessoftmax(z)

softmax(z) = [f1(z), . . . , fM (z)]T3 / 28 N. Wahlström, 2020

Summary of Lecture 8 (III/III)- Example M = 3 classesConsider an example with three classes M = 3.

...

1

...

1x1

xp

h

h

h

z1 = 0.9

z2 = 4.2

z3 = 0.2

f1

f2

f3

g1= 0.03

g2= 0.95

g3= 0.02

y1 = 0

y2 = 1

y3 = 0

Inputs Hidden units

Logits

softmax(z) Classprobabilities

True output(one-hotencoding)

The network is trained by minimizing the cross-entropy

L(y,g) = −M∑

m=1ym log(gm) = − log 0.95 = 0.05


Guest lectures and content

Guest lectureWednesday March 11 at 14.15-15.00, SiegbahnsalenSEB - Applications of Statistical Machine Learning in Finance

Outline1. Previous lecture The neural network model

• Neural network for regression• Neural network for classification

2. This lecture• Convolutional neural network• How to train a neural network


Convolutional neural networks

Convolutional neural networks (CNN) are a special kind neuralnetworks tailored for problems where the input data has a grid-likestructure.

Examples• Digital images (2D grid of pixels)• Audio waveform data (1D grid, times series)• Volumetric data e.g. CT scans (3D grid)

The description here will focus on images.


Data representation of images

Consider a grayscale image of 6× 6 pixels.• Each pixel value represents the color. The value ranges from 0(total absence, black) to 1 (total presence, white)• The pixels are the input variables x1,1, x1,2, . . . , x6,6.

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0 0.0

0.0

0.0

0.0

0.0

0.0

0.9

0.8 0.9

0.8

0.9

0.9

0.9

0.9 0.9 0.9 0.90.8

0.6

0.6

0.6

Image Data representation


The convolutional layer

Consider a hidden layer with 6× 6 hidden units.• Dense layer: Each hidden unit is connected with all pixels.Each pixel-hidden-unit-pair has its own unique parameter.

• Convolutional layer: Each hidden unit is connected with aregion of pixels via a set of parameters, so-called filter.Different hidden units have the same set of parameters.

x1,1 x1,2 x1,3 x1,4 x1,5 x1,6

x2,1 x2,2 x2,3 x2,4 x2,5 x2,6

x3,1 x3,2 x3,3 x3,4 x3,5 x3,6

x4,1 x4,2 x4,3 x4,4 x4,5 x4,6

x5,1 x5,2 x5,3 x5,4 x5,5 x5,6

x6,1 x6,2 x6,3 x6,4 x6,5 x6,6

1

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h

Input variables Hidden units

Conv. layer uses sparse interactions and parameter sharing



Consider a hidden layer with 6× 6 hidden units.• Dense layer: Each hidden unit is connected with all pixels.Each pixel-hidden-unit-pair has its own unique parameter.• Convolutional layer: Each hidden unit is connected with aregion of pixels via a set of parameters, so-called filter.Different hidden units have the same set of parameters.

x1,1 x1,2 x1,3 x1,4 x1,5 x1,6

x2,1 x2,2 x2,3 x2,4 x2,5 x2,6

x3,1 x3,2 x3,3 x3,4 x3,5 x3,6

x4,1 x4,2 x4,3 x4,4 x4,5 x4,6

x5,1 x5,2 x5,3 x5,4 x5,5 x5,6

x6,1 x6,2 x6,3 x6,4 x6,5 x6,6

1

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

W(1)1,3

W(1)3,3

b(1)

h


Conv. layer uses sparse interactions and parameter sharing



Consider a hidden layer with 6× 6 hidden units.• Dense layer: Each hidden unit is connected with all pixels.Each pixel-hidden-unit-pair has its own unique parameter.• Convolutional layer: Each hidden unit is connected with aregion of pixels via a set of parameters, so-called filter.Different hidden units have the same set of parameters.

x1,1 x1,2 x1,3 x1,4 x1,5 x1,6

x2,1 x2,2 x2,3 x2,4 x2,5 x2,6

x3,1 x3,2 x3,3 x3,4 x3,5 x3,6

x4,1 x4,2 x4,3 x4,4 x4,5 x4,6

x5,1 x5,2 x5,3 x5,4 x5,5 x5,6

x6,1 x6,2 x6,3 x6,4 x6,5 x6,6

1

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

W(1)1,3

W(1)3,3

b(1)

h


Conv. layer uses sparse interactions and parameter sharing8 / 28 N. Wahlström, 2020

Condensing information with strides

• Problem: As we proceed though the network we want tocondense the information.• Solution: Apply the filter to every two pixels. We use astride of 2 (instead of 1).

x1,1 x1,2 x1,3 x1,4 x1,5 x1,6

x2,1 x2,2 x2,3 x2,4 x2,5 x2,6

x3,1 x3,2 x3,3 x3,4 x3,5 x3,6

x4,1 x4,2 x4,3 x4,4 x4,5 x4,6

x5,1 x5,2 x5,3 x5,4 x5,5 x5,6

x6,1 x6,2 x6,3 x6,4 x6,5 x6,6

1

h h h

h h h

h h h

h

W(1)1,3

W(1)3,3

b(1)Input variables Hidden units

With stride 2 we get half the number of rows and columns in thehidden layer.


Condensing information with strides

• Problem: As we proceed though the network we want tocondense the information.• Solution: Apply the filter to every two pixels. We use astride of 2 (instead of 1).

x1,1 x1,2 x1,3 x1,4 x1,5 x1,6

x2,1 x2,2 x2,3 x2,4 x2,5 x2,6

x3,1 x3,2 x3,3 x3,4 x3,5 x3,6

x4,1 x4,2 x4,3 x4,4 x4,5 x4,6

x5,1 x5,2 x5,3 x5,4 x5,5 x5,6

x6,1 x6,2 x6,3 x6,4 x6,5 x6,6

1

h h h

h h h

h h h

h

0

0

0

W(1)1,3

W(1)3,3

b(1)Input variables Hidden units

With stride 2 we get half the number of rows and columns in thehidden layer.


Multiple channels

• One filter per layer does not give enough flexibility. ⇒• We use multiple filters (visualized with different colors).• Each filter produces its own set of hidden units – a channel.

x1,1 x1,2 x1,3 x1,4 x1,5 x1,6

x2,1 x2,2 x2,3 x2,4 x2,5 x2,6

x3,1 x3,2 x3,3 x3,4 x3,5 x3,6

x4,1 x4,2 x4,3 x4,4 x4,5 x4,6

x5,1 x5,2 x5,3 x5,4 x5,5 x5,6

x6,1 x6,2 x6,3 x6,4 x6,5 x6,6

1

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h

W(1)1,3,1

W(1)3,3,1

b(1)1


dim 6× 6× 1 dim 6× 6× 4

Hidden layers are organized in tensors of size(rows × columns × channels).

10 / 28 N. Wahlström, 2020

Multiple channels


x1,1 x1,2 x1,3 x1,4 x1,5 x1,6

x2,1 x2,2 x2,3 x2,4 x2,5 x2,6

x3,1 x3,2 x3,3 x3,4 x3,5 x3,6

x4,1 x4,2 x4,3 x4,4 x4,5 x4,6

x5,1 x5,2 x5,3 x5,4 x5,5 x5,6

x6,1 x6,2 x6,3 x6,4 x6,5 x6,6

1

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h

W(1)1,3,2

W(1)3,3,2

b(1)2


dim 6× 6× 1 dim 6× 6× 4


10 / 28 N. Wahlström, 2020

Multiple channels


x1,1 x1,2 x1,3 x1,4 x1,5 x1,6

x2,1 x2,2 x2,3 x2,4 x2,5 x2,6

x3,1 x3,2 x3,3 x3,4 x3,5 x3,6

x4,1 x4,2 x4,3 x4,4 x4,5 x4,6

x5,1 x5,2 x5,3 x5,4 x5,5 x5,6

x6,1 x6,2 x6,3 x6,4 x6,5 x6,6

1

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h

W(1)1,3,3

W(1)3,3,3

b(1)3


dim 6× 6× 1 dim 6× 6× 4


10 / 28 N. Wahlström, 2020

Multiple channels


x1,1 x1,2 x1,3 x1,4 x1,5 x1,6

x2,1 x2,2 x2,3 x2,4 x2,5 x2,6

x3,1 x3,2 x3,3 x3,4 x3,5 x3,6

x4,1 x4,2 x4,3 x4,4 x4,5 x4,6

x5,1 x5,2 x5,3 x5,4 x5,5 x5,6

x6,1 x6,2 x6,3 x6,4 x6,5 x6,6

1

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h

W(1)1,3,4

W(1)3,3,4

b(1)4


dim 6× 6× 1 dim 6× 6× 4


10 / 28 N. Wahlström, 2020

Multiple channels


x1,1 x1,2 x1,3 x1,4 x1,5 x1,6

x2,1 x2,2 x2,3 x2,4 x2,5 x2,6

x3,1 x3,2 x3,3 x3,4 x3,5 x3,6

x4,1 x4,2 x4,3 x4,4 x4,5 x4,6

x5,1 x5,2 x5,3 x5,4 x5,5 x5,6

x6,1 x6,2 x6,3 x6,4 x6,5 x6,6

1

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h

W(1)1,3,1

W(1)3,3,1

b(1)1

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h

W(1)1,3,4

W(1)3,3,4

W(1)1,3,4

W(1)3,3,4

W(1)1,3,4

W(1)3,3,4

W(1)1,3,4

W(1)3,3,4

W(1)1,3,4

W(1)3,3,4

W(1)1,3,4

W(1)3,3,4

W(1)1,3,4

W(1)3,3,4

W(1)1,3,4

W(1)3,3,4

W(1)1,3,4

W(1)3,3,4

W(1)1,3,4

W(1)3,3,4

W(1)1,3,4

W(1)3,3,4

W(1)1,3,4

W(1)3,3,4

W(1)1,3,4

W(1)3,3,4

W(1)1,3,4

W(1)3,3,4

W(1)1,3,4

W(1)3,3,4

W(1)1,3,4

W(1)3,3,4

b(1)4


dim 6× 6× 1 dim 6× 6× 4


10 / 28 N. Wahlström, 2020

What is a tensor?

A tensor is a generalization of scalar, vector and matrix toarbitrary order.

Scalarorder 0 a = 3

Vectororder 1

b =

3−2−1

Matrixorder 2

W =

3 2−2 1−1 2

Tensorany order(here order 3)

T:,:,1 =

3 2−2 1−1 2

, T:,:,2 =

−1 41 2−5 3

11 / 28 N. Wahlström, 2020

Multiple channels (cont.)

• A filter operates on all channels in a hidden layer.• Each filter has the dimension (filter rows × filter colomns ×input channels), here (3 × 3 × 4).• We stack all filter parameters in a weight tensor withdimensions (filter rows × filter columns × input channels ×output channels), here (3 × 3 × 4 × 6)

h h h

h h h

h h h

h h h

h h h

h h h

h h h

h h h

h h h

h h h

h h h

h h h

h h h

h h h

h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

1

h h h

h h h

h h h

b(2)6

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h

h h h

h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h

h h h

h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h

h h h

h h h h h h

h h h

h h h

hW(2)

:,:,:,6

First hidden layer Second hidden layer

Convolutional layerW(2) ∈ R3×3×4×6, b(2) ∈ R6

12 / 28 N. Wahlström, 2020

Full CNN architecture

• A full CNN usually consist of multiple convolutional layers(here two) and a few final dense layers (here two).• If we have a classification problem at hand, we end with asoftmax activation function to produce class probabilities.

x1,1 x1,2 x1,3 x1,4 x1,5 x1,6

x2,1 x2,2 x2,3 x2,4 x2,5 x2,6

x3,1 x3,2 x3,3 x3,4 x3,5 x3,6

x4,1 x4,2 x4,3 x4,4 x4,5 x4,6

x5,1 x5,2 x5,3 x5,4 x5,5 x5,6

x6,1 x6,2 x6,3 x6,4 x6,5 x6,6

1

...

h

h

h

1

...

z1

z10

h h h

h h h

h h h

h h h

h h h

h h h

h h h

h h h

h h h

h h h

h h h

h h h

h h h

h h h

h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

1

h

W(1)1,3,1

W(1)3,3,1

b(1)1

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h h h h h h

h

W(1)1,3,4

W(1)3,3,4

b(1)4

W(2)1,3,4,6

W(2)3,3,4,6

b(2)0,6

h h h

h h h

h h h

1

h...

...

g1

g10

Input variablesdim 6× 6× 1

Hidden unitsdim 6× 6× 4

Hidden unitsdim 3× 3× 6

Hidden unitsdim 50

Logitsdim 10

Outputsdim 10

Convolutionallayer

W(1) ∈ R3×3×1×4

b(1) ∈ R4

Convolutionallayer

W(2) ∈ R3×3×4×6

b(2) ∈ R6

Denselayer

W(3) ∈ R54×50

b(3) ∈ R50

Denselayer

W(4) ∈ R50×10

b(4) ∈ R10

Softmax

Here we use 50 hidden units in the last hidden layer and consider aclassification problem with M = 10 classes.

13 / 28 N. Wahlström, 2020

Skin cancer – background

One recent result on the use of deep learning in medicine -Detecting skin cancer (February 2017)Andre Esteva, A., Kuprel, B., Novoa, R. A., Ko, J., Swetter, S. M., Blau, H. M. and Thrun, S. Dermatologist-levelclassification of skin cancer with deep neural networks. Nature, 542, 115–118, February, 2017.

Some background figures (from the US) on skin cancer:• Melanomas represents less than 5% of all skin cancers, butaccounts for 75% of all skin-cancer-related deaths.• Early detection absolutely critical. Estimated 5-year survivalrate for melanoma: Over 99% if detected in its earlier stagesand 14% is detected in its later stages.

14 / 28 N. Wahlström, 2020

Skin cancer – task

Image copyright Nature (doi:10.1038/nature21056)

15 / 28 N. Wahlström, 2020

Skin cancer – solution (ultrabrief)

In the paper they used the following network architecture

Image copyright Nature doi:10.1038/nature21056)

• Initialize all parameters from a neural network trained on 1.28million images (transfer learning).• From this initialization we learn new model parameters using

129 450 clinical images (∼ 100 times more images than anyprevious study).• Use the model to predict class based on unseen data.

16 / 28 N. Wahlström, 2020

Skin cancer – indication of the results

sensitivity = true positivepositive specificity = true negative

negative

Image copyright Nature (doi:10.1038/nature21056)

17 / 28 N. Wahlström, 2020

Unconstrained numerical optimization

We train a network by considering the optimization problem

θ = arg minθJ(θ), J(θ) = 1

n

n∑i=1

L(xi,yi,θ)

θ(B)

θ(A)

θ(C)

J(θ

)

• The best possible solution θ is the global minimizer (θ(A))• The global minimizer is typically very hard to find, and wehave to settle for a local minimizer (θ(A), θ(B), θ(C))

18 / 28 N. Wahlström, 2020

Iterative solution - Example 1D

θ(5)

J(θ

)

In our search for a local optimizer we...• ... do an initial guess of θ...• ... and update θ iteratively.

19 / 28 N. Wahlström, 2020

Iterative solution (gradient descent) -Example 2D

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

θ = [b, w]T ∈ R2

1. Pick a θ(0)

2. while(not converged)• Update θ(t+1) = θ(t) − γd(t), where d(t) = ∇θJ(θ)• Update t := t+ 1

We call γ ∈ R the step length or learning rate.• How big learning rate γ should we have?

20 / 28 N. Wahlström, 2020

Learning rate

γ too low γ too high γ ok

θ

J(θ

)

θ θ

Tuning strategy: If the cost function...• ...decreases very slowly ⇒ increase the learning rate.• ...oscillates widely ⇒ reduce the learning rate.

21 / 28 N. Wahlström, 2020

Computational challenge 1 - dim(θ) is big

At each optimization step we need to compute the gradient

d(t) = ∇θJ(θ(t)) = 1n

n∑i=1

∇θL(xi,yi,θ(t)).

Computational challenge 1 - dim(θ) big: A neural networkcontains a lot of parameters. Computing the gradient is costly.

Solution: A NN is a composition of multiple layers. Hence, eachterm ∇θL(xi,yi,θ) can be computed efficiently by repeatedlyapplying the chain rule. This is called the back-propagationalgorithm. Not part of the course.

22 / 28 N. Wahlström, 2020

Computational challenge 2 - n is big

At each optimization step we need to compute the gradient

d(t) = ∇θJ(θ(t)) = 1n

n∑i=1

∇θL(xi,yi,θ(t)).

Computational challenge 2 - n big: We typically use a lot oftraining data n for training the neural netowork. Computing thegradient is costly.

Solution: For each iteration, we only use a small part of the dataset to compute the gradient d(t). This is called the stochasticgradient.

23 / 28 N. Wahlström, 2020

Stochastic gradient

A big data set is often redundant = many data points are similar.Training data︷︸︸︷

x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x16 x17 x18 x19 x20y1 y2 y3 y4 y5 y6 y7 y8 y9 y10 y11 y12 y13 y14 y15 y16 y17 y18 y19 y20

If the training data is big

∇θJ(θ)≈∑n

2i=1

∇θL(xi,yi,θ) and

∇θJ(θ)≈∑n

i= n2 +1

∇θL(xi,yi,θ).

We can do the update with only half the computation cost!

θ(t+1) = θ(t) − γ 1n/2

n2∑

i=1∇θL(xi,yi,θ

(t)),

θ(t+2) = θ(t+1) − γ 1n/2

n∑i= n

2 +1∇θL(xi,yi,θ

(t+1)).

24 / 28 N. Wahlström, 2020

Stochastic gradient

Training data︷︸︸︷x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x16 x17 x18 x19 x20y1 y2 y3 y4 y5 y6 y7 y8 y9 y10 y11 y12 y13 y14 y15 y16 y17 y18 y19 y20. ︸︷︷︸

Mini-batch

θ(3) = θ(2) − γ 15

15∑i=11

∇θL(xi,yi,θ(2))

• The extreme version of this strategy is to use only one datapoint at each training step (called online learning)• We typically do something in between (not one data point,and not all data). We use a smaller set called mini-batch.• One pass through the training data is called an epoch.

25 / 28 N. Wahlström, 2020

Stochastic gradient

x7 x10 x3 x20 x16 x2 x1 x18 x19 x12 x6 x11 x17 x15 x5 x14 x4 x9 x13 x8y7 y10 y3 y20 y16 y2 y1 y18 y19 y12 y6 y11 y17 y15 y5 y14 y4 y9 y13 y8

Iteration: 3Epoch: 1

• If we pick the mini-batches in order, they might beunbalanced and not representative for the whole data set.• Therefore, we pick data points at random from the trainingdata to form a mini-batch.• One implementation is to randomly reshuffle the data beforedividing it into mini-batches.• After each epoch we do another reshuffling and another passthrough the data set.

26 / 28 N. Wahlström, 2020

Mini-batch gradient descent

The full stochastic gradient algorithm (a.k.a mini-batchgradient descent) is as follows

1. Initialize θ(0), set t← 1, choose batch size nb and number ofepochs E.

2. For i = 1 to E(a) Randomly shuffle the training data {xi,yi}n

i=1.(b) For j = 1 to n

nb

(i) Approximate the gradient of the loss function using themini-batch {xi,yi}

jnbi=(j−1)nb+1,

d(t) = 1nb

∑jnb

i=(j−1)nb+1 ∇θL(xi,yi, θ)∣∣∣θ=θ(t)

.

(ii) Do a gradient step θ(t+1) = θ(t) − γd(t).(iii) Update the iteration index t← t+ 1 .

At each time we get a stochastic approximation of the truegradient d(t) ≈ 1

n

∑ni=1 ∇θL(xi,yi,θ)

∣∣∣θ=θ(t)

, hence the name.27 / 28 N. Wahlström, 2020

A few concepts to summarize lecture 9

Convolutional neural network (CNN): A NN with a particular structure tailored forinput data with a grid-like structure, like for example images.

Filter: (a.k.a kernel) A set of parameters that is convolved with a hidden layer. Eachfilter produces a new channel.

Channel: A set of hidden units produced by the same filter. Each hidden layer consistsof one or more channels.

Stride: A positive integer deciding how many steps to move the filter during theconvolution.

Tensor: A generalization of matrices to arbitrary order.

Gradient descent: An iterative optimization algorithm where we at iteration take astep proportional to the negative gradient.

Learning rate: (a.k.a step length). A scalar tuning parameter deciding the length ofeach gradient step in gradient descent.

Stochastic gradient (SG): A version of gradient descent where we at each iterationonly use a small part of the training data (a mini-batch).

Mini-batch: The group of training data that we use at each iteration in SG

Batch size: The number of data points in one mini-batch

Epoch: One complete pass though the entire training data set using SG.28 / 28 N. Wahlström, 2020

lecture 9 convolutional neural networks, training of ... · multiplechannels...

Documents