comp 551 -applied machine learning lecture 14 ...wlh/comp551/slides/14-backprop.pdflecture 14...

41
COMP 551 - Applied Machine Learning Lecture 14 --- Backpropagation William L. Hamilton (with slides and content from Joelle Pineau) * Unless otherwise noted, all material posted for this course are copyright of the instructor, and cannot be reused or reposted without the instructor’s written permission. William L. Hamilton, McGill University and Mila 1

Upload: others

Post on 31-Jul-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: COMP 551 -Applied Machine Learning Lecture 14 ...wlh/comp551/slides/14-backprop.pdfLecture 14 ---Backpropagation William L. Hamilton (with slides and content from Joelle Pineau) *

COMP 551 - Applied Machine LearningLecture 14 --- Backpropagation William L. Hamilton(with slides and content from Joelle Pineau)* Unless otherwise noted, all material posted for this course are copyright of the instructor, and cannot be reused or reposted without the instructor’s written permission.

William L. Hamilton, McGill University and Mila 1

Page 2: COMP 551 -Applied Machine Learning Lecture 14 ...wlh/comp551/slides/14-backprop.pdfLecture 14 ---Backpropagation William L. Hamilton (with slides and content from Joelle Pineau) *

Midterm§ The midterm is scheduled for Nov. 18th at 6-8pm.

§ You can bring in a 1-page (double-sided) cheat sheet. Either hand-written or minimum 9-point font.

§ I will release practice questions at least 2 weeks before the exam.

§ If you cannot attend the midterm, you must contact the course staff and let us know by the end of the week. § (Email the head TA, [email protected] and cc me).

William L. Hamilton, McGill University and Mila 2

Page 3: COMP 551 -Applied Machine Learning Lecture 14 ...wlh/comp551/slides/14-backprop.pdfLecture 14 ---Backpropagation William L. Hamilton (with slides and content from Joelle Pineau) *

MiniProject Clarifications

§ MiniProject 3:§ You can use pre-trained models.

§ You can use TensorFlow, PyTorch, etc.

§ You just can’t use things like https://cloud.google.com/vision/

§ MiniProject 4: § “Baselines track”: Re-implement some baselines in their paper and/or

implement a simple model to solve the task in their paper.

§ “Ablations track”: Experiment with their code to evaluate model robustness, the utility of different model components, the importance of hyperparameters, etc.

William L. Hamilton, McGill University and Mila 3

Page 4: COMP 551 -Applied Machine Learning Lecture 14 ...wlh/comp551/slides/14-backprop.pdfLecture 14 ---Backpropagation William L. Hamilton (with slides and content from Joelle Pineau) *

PyTorch Tutorial Hours

§ 10/28 4:30-6:00pm§ 10/29 5:00-7:00pm§ 10/30 4:00-6:00pm§ 10/31 8:30-10:30am

William L. Hamilton, McGill University and Mila 4

Page 5: COMP 551 -Applied Machine Learning Lecture 14 ...wlh/comp551/slides/14-backprop.pdfLecture 14 ---Backpropagation William L. Hamilton (with slides and content from Joelle Pineau) *

Why neural networks?§ Key idea: Learn good features/representations rather than doing manual

feature engineering.

§ Hidden layers correspond to “higher-level” learned features.

§ Critical in domains where manual feature engineering is difficult/impossible (e.g., images, raw audio) and empirically state-of-the-art in most domains.

§ Allow for “end-to-end” learning where manual feature engineering/selection is no longer necessary.

William L. Hamilton, McGill University and Mila 5

Page 6: COMP 551 -Applied Machine Learning Lecture 14 ...wlh/comp551/slides/14-backprop.pdfLecture 14 ---Backpropagation William L. Hamilton (with slides and content from Joelle Pineau) *

Generalizing the feed-forward NN

§ Can use arbitrary output activation functions.

§ In practice, we do not necessarily need to use a sigmoid activation in the hidden layer.

§ We can make networks as deep as we want.

§ We can add regularization.

w(1)out

w(2)out

y(1)

y(2)

h(1)1

h(1)2

h(2)2

h(2)1

h(2)3

h(i) = �i(W(i)h(i�1) + b(i))

Can be an arbitrary non-linear activation function

William L. Hamilton, McGill University and Mila 6

Page 7: COMP 551 -Applied Machine Learning Lecture 14 ...wlh/comp551/slides/14-backprop.pdfLecture 14 ---Backpropagation William L. Hamilton (with slides and content from Joelle Pineau) *

Activation functionsh(i) = �i(W

(i)h(i�1) + b(i))

William L. Hamilton, McGill University and Mila 7

§ There are many choices for the activation function!

§ It must be non-linear § Stacking multiple linear functions is

equivalent to a single linear function, so there would be no gain…..

Page 8: COMP 551 -Applied Machine Learning Lecture 14 ...wlh/comp551/slides/14-backprop.pdfLecture 14 ---Backpropagation William L. Hamilton (with slides and content from Joelle Pineau) *

Activation functions: Sigmoidh(i) = �i(W

(i)h(i�1) + b(i))

William L. Hamilton, McGill University and Mila 8

§ The sigmoid activation is the “classic”/ “original” activation function:

§ Easy to differentiate.

§ Can often be interpreted as a probability.

§ Easily “saturates”, i.e., for inputs outside the range [-4, 4] it is essentially constant, which can make it hard to train.

�(z) =1

1 + e�z

Page 9: COMP 551 -Applied Machine Learning Lecture 14 ...wlh/comp551/slides/14-backprop.pdfLecture 14 ---Backpropagation William L. Hamilton (with slides and content from Joelle Pineau) *

Activation functions: Tanhh(i) = �i(W

(i)h(i�1) + b(i))

William L. Hamilton, McGill University and Mila 9

§ The tanh activation is another popular and traditional activation function:

§ Easy to differentiate:

§ Can often be interpreted as a probability.

§ Slightly less prone to “saturation” than the sigmoid but still has a fixed range.

�(z) = tanh z =ez � e�z

ez + e�z

@ tanh(z)

@z= 1� tanh2(z)

Page 10: COMP 551 -Applied Machine Learning Lecture 14 ...wlh/comp551/slides/14-backprop.pdfLecture 14 ---Backpropagation William L. Hamilton (with slides and content from Joelle Pineau) *

Activation functions: ReLUh(i) = �i(W

(i)h(i�1) + b(i))

William L. Hamilton, McGill University and Mila 10

§ Rectified Linear Unit (ReLU) is the de facto standard in deep learning:

§ Unbounded range so it never saturates (i.e., activation can increase indefinitely).

§ Very strong empirical results.

ReLU(z) = max(z, 0)

Page 11: COMP 551 -Applied Machine Learning Lecture 14 ...wlh/comp551/slides/14-backprop.pdfLecture 14 ---Backpropagation William L. Hamilton (with slides and content from Joelle Pineau) *

Activation functions: softplush(i) = �i(W

(i)h(i�1) + b(i))

William L. Hamilton, McGill University and Mila 11

§ There have been many variants of ReLUs proposed in recent years, e.g.,

§ Similar to ReLU but smoother.

§ Expensive derivative compared to ReLU.

softplus(z) = ln(1 + ex)

Page 12: COMP 551 -Applied Machine Learning Lecture 14 ...wlh/comp551/slides/14-backprop.pdfLecture 14 ---Backpropagation William L. Hamilton (with slides and content from Joelle Pineau) *

Jreg = J + �(HX

i=1

kW(i)k2F+ kwoutk22)

Regularization

§ We can combat overfitting in neural networks by adding regularization.

§ Standard approach is to apply L2-regularization to certain layers: w(1)

out

w(2)out

y(1)

y(2)

h(1)1

h(1)2

h(2)2

h(2)1

h(2)3

h(i) = �i(W(i)h(i�1) + b(i))

William L. Hamilton, McGill University and Mila 12

Regularized loss

Original loss

Frobenius norm of hidden layer weights

(“Frobenius” = 2-norm for matrices)

2-norm of output weights

Page 13: COMP 551 -Applied Machine Learning Lecture 14 ...wlh/comp551/slides/14-backprop.pdfLecture 14 ---Backpropagation William L. Hamilton (with slides and content from Joelle Pineau) *

Complex architectures

William L. Hamilton, McGill University and Mila 13

§ Neural network architectures can get very complicated!

§ E.g., might want to combine feature information from different modalities (e.g., text and images) using different networks.

§ Many more complex architectures will be covered in upcoming lectures!

Text features

Image features

Page 14: COMP 551 -Applied Machine Learning Lecture 14 ...wlh/comp551/slides/14-backprop.pdfLecture 14 ---Backpropagation William L. Hamilton (with slides and content from Joelle Pineau) *

How are we going to train these models?

William L. Hamilton, McGill University and Mila 14

§ As long as all the operations are differentiable (or “sub-differentiable”) we can always just apply the chain rule and compute the derivatives.

§ But manually computing the gradient would be very painful/tedious!

§ Need to recompute every time we modify the architecture…

§ Lots of room for minor bugs in the code…

§ Solution: Automatically compute the gradient.

Page 15: COMP 551 -Applied Machine Learning Lecture 14 ...wlh/comp551/slides/14-backprop.pdfLecture 14 ---Backpropagation William L. Hamilton (with slides and content from Joelle Pineau) *

Computation graphs

§ A very simple “neural network” (i.e., computation graph):

§ Data is transformed at the nodes and “flows” allow the arrows.

§ a and b are source nodes (i.e., the input) and eis the sink (i.e., the output).

William L. Hamilton, McGill University and Mila 15

c = a+ b

d = b+ 1

e = c ⇤ d

[Image and inspiration credit: http://colah.github.io/posts/2015-08-Backprop/]

Page 16: COMP 551 -Applied Machine Learning Lecture 14 ...wlh/comp551/slides/14-backprop.pdfLecture 14 ---Backpropagation William L. Hamilton (with slides and content from Joelle Pineau) *

Computation graphs

William L. Hamilton, McGill University and Mila 16

§ The forward pass in a computational graph = setting the source variables/nodes and determining the sink/output value.

[Image and inspiration credit: http://colah.github.io/posts/2015-08-Backprop/]

Page 17: COMP 551 -Applied Machine Learning Lecture 14 ...wlh/comp551/slides/14-backprop.pdfLecture 14 ---Backpropagation William L. Hamilton (with slides and content from Joelle Pineau) *

Derivatives on computation graphs

William L. Hamilton, McGill University and Mila 17

§ We can use basic calculus to compute the derivatives on the edges.

§ Derivative tells us: If I change one node by one unit, how much does this impact another node?

[Image and inspiration credit: http://colah.github.io/posts/2015-08-Backprop/]

Page 18: COMP 551 -Applied Machine Learning Lecture 14 ...wlh/comp551/slides/14-backprop.pdfLecture 14 ---Backpropagation William L. Hamilton (with slides and content from Joelle Pineau) *

Derivatives on computation graphs

William L. Hamilton, McGill University and Mila 18

§ But what if we want to compute the derivative between distant nodes?

[Image and inspiration credit: http://colah.github.io/posts/2015-08-Backprop/]

Page 19: COMP 551 -Applied Machine Learning Lecture 14 ...wlh/comp551/slides/14-backprop.pdfLecture 14 ---Backpropagation William L. Hamilton (with slides and content from Joelle Pineau) *

Derivatives on computation graphs

William L. Hamilton, McGill University and Mila 19

§ But what if we want to compute the derivative between distant nodes?

§ E.g., consider how e is affected by a. § Change a at a speed of 1.

§ Then c also changes at a speed of 1.

§ And, c changing at a speed of 1 causes e to change at a speed of 2.

§ So e changes at a rate of 1x2 with respect to a.

[Image and inspiration credit: http://colah.github.io/posts/2015-08-Backprop/]

Page 20: COMP 551 -Applied Machine Learning Lecture 14 ...wlh/comp551/slides/14-backprop.pdfLecture 14 ---Backpropagation William L. Hamilton (with slides and content from Joelle Pineau) *

Derivatives on computation graphs

William L. Hamilton, McGill University and Mila 20

§ But what if we want to compute the derivative between distant nodes?

§ E.g., consider how e is affected by a. § Change a at a speed of 1.

§ Then c also changes at a speed of 1.

§ And, c changing at a speed of 1 causes e to change at a speed of 2.

§ So e changes at a rate of 1x2 with respect to a.

§ General rule: sum over all possible paths from one node to the other, multiplying the derivatives on each edge of the path, e.g.

@e

@b= 1 ⇤ 2 + 1 ⇤ 3 [Image and inspiration credit:

http://colah.github.io/posts/2015-08-Backprop/]

Page 21: COMP 551 -Applied Machine Learning Lecture 14 ...wlh/comp551/slides/14-backprop.pdfLecture 14 ---Backpropagation William L. Hamilton (with slides and content from Joelle Pineau) *

Derivatives on computation graphs

William L. Hamilton, McGill University and Mila 21

§ But what if we want to compute the derivative between distant nodes?

§ E.g., consider how e is affected by a. § Change a at a speed of 1.

§ Then c also changes at a speed of 1.

§ And, c changing at a speed of 1 causes e to change at a speed of 2.

§ So e changes at a rate of 1x2 with respect to a.

§ General rule: sum over all possible paths from one node to the other, multiplying the derivatives on each edge of the path, e.g.

@e

@b= 1 ⇤ 2 + 1 ⇤ 3

“Sum over paths” idea is just chain rule!

[Image and inspiration credit: http://colah.github.io/posts/2015-08-Backprop/]

Page 22: COMP 551 -Applied Machine Learning Lecture 14 ...wlh/comp551/slides/14-backprop.pdfLecture 14 ---Backpropagation William L. Hamilton (with slides and content from Joelle Pineau) *

Challenge: combinatorial explosion

William L. Hamilton, McGill University and Mila 22

@Z

@X= ↵� + ↵✏+ ↵⇣ + �� + �✏+ �⇣ + �� + �✏+ �⇣

§ Naively summing over paths (i.e. naively applying the chain rule) can lead to combinatorial explosion!

§ Number of paths between two nodes can be exponential in the number of graph edges!

3x3=9 paths between X and Z….

[Image and inspiration credit: http://colah.github.io/posts/2015-08-Backprop/]

Page 23: COMP 551 -Applied Machine Learning Lecture 14 ...wlh/comp551/slides/14-backprop.pdfLecture 14 ---Backpropagation William L. Hamilton (with slides and content from Joelle Pineau) *

Challenge: combinatorial explosion

William L. Hamilton, McGill University and Mila 23

@Z

@X= ↵� + ↵✏+ ↵⇣ + �� + �✏+ �⇣ + �� + �✏+ �⇣

§ Naively summing over paths (i.e. naively applying the chain rule) can lead to combinatorial explosion!

§ Number of paths between two nodes can be exponential in the number of graph edges!

§ Idea: Factor the paths! @Z

@X= (↵+ � + �)(� + ✏+ ⇣)

Page 24: COMP 551 -Applied Machine Learning Lecture 14 ...wlh/comp551/slides/14-backprop.pdfLecture 14 ---Backpropagation William L. Hamilton (with slides and content from Joelle Pineau) *

Factoring paths

William L. Hamilton, McGill University and Mila 24

§ Forward mode

§ Goes from source(s) to sink.

§ At each node, sum all the incoming edges/derivatives.

Page 25: COMP 551 -Applied Machine Learning Lecture 14 ...wlh/comp551/slides/14-backprop.pdfLecture 14 ---Backpropagation William L. Hamilton (with slides and content from Joelle Pineau) *

Factoring paths

William L. Hamilton, McGill University and Mila 25

§ Forward mode

§ Goes from source(s) to sink.

§ At each node, sum all the incoming edges/derivatives.

§ Reverse mode:

§ Goes from sink to source(s).

§ At each node, sum all the outgoing edges/derivatives.

Page 26: COMP 551 -Applied Machine Learning Lecture 14 ...wlh/comp551/slides/14-backprop.pdfLecture 14 ---Backpropagation William L. Hamilton (with slides and content from Joelle Pineau) *

Factoring paths

William L. Hamilton, McGill University and Mila 26

§ Forward mode

§ Goes from source(s) to sink.

§ At each node, sum all the incoming edges/derivatives.

§ Reverse mode:

§ Goes from sink to source(s).

§ At each node, sum all the outgoing edges/derivatives.

§ Both only touch each edge once!

Page 27: COMP 551 -Applied Machine Learning Lecture 14 ...wlh/comp551/slides/14-backprop.pdfLecture 14 ---Backpropagation William L. Hamilton (with slides and content from Joelle Pineau) *

Computational savings: Forward-mode

William L. Hamilton, McGill University and Mila 27

§ Run forward mode from the source b to the sink e.

§ Note that we get the derivative of each intermediate edge as well!

Page 28: COMP 551 -Applied Machine Learning Lecture 14 ...wlh/comp551/slides/14-backprop.pdfLecture 14 ---Backpropagation William L. Hamilton (with slides and content from Joelle Pineau) *

Computational savings: Forward-mode

William L. Hamilton, McGill University and Mila 28

§ Run forward mode from the source b to the sink e.

§ Note that we get the derivative of each intermediate edge as well!

§ But in neural networks, we want the derivative of the output/loss w.r.t. all the previous layers…

Page 29: COMP 551 -Applied Machine Learning Lecture 14 ...wlh/comp551/slides/14-backprop.pdfLecture 14 ---Backpropagation William L. Hamilton (with slides and content from Joelle Pineau) *

Computational savings: Backward mode

William L. Hamilton, McGill University and Mila 29

§ Run backward mode from the sink e to the sources aand b.

§ Note that we get the derivative of each intermediate edge as well!

Page 30: COMP 551 -Applied Machine Learning Lecture 14 ...wlh/comp551/slides/14-backprop.pdfLecture 14 ---Backpropagation William L. Hamilton (with slides and content from Joelle Pineau) *

Computational savings: Backward mode

William L. Hamilton, McGill University and Mila 30

§ Run backward mode from the sink e to the sources aand b.

§ Note that we get the derivative of each intermediate edge as well!

§ We get the derivative w.r.t. the output in one pass!

Page 31: COMP 551 -Applied Machine Learning Lecture 14 ...wlh/comp551/slides/14-backprop.pdfLecture 14 ---Backpropagation William L. Hamilton (with slides and content from Joelle Pineau) *

Automated differentiation vs. backpropagation

William L. Hamilton, McGill University and Mila 31

§ Reverse-mode automatic differentiation (RV-AD) can efficiently compute the derivative of every node in a computation graph.

§ Neural networks are just computation graphs!

§ We generally call RV-AD “backpropagation” in the context of neural networks and deep learning, since we are propagating the derivative/error backwards from the output node.

Page 32: COMP 551 -Applied Machine Learning Lecture 14 ...wlh/comp551/slides/14-backprop.pdfLecture 14 ---Backpropagation William L. Hamilton (with slides and content from Joelle Pineau) *

Automated differentiation in practice

William L. Hamilton, McGill University and Mila 32

§ Many automated differentiation frameworks exist.§ Basic derivatives are hard-coded (often called “kernels”) and

everything else computed via automated differentiation.

§ Nearly all based on Python/NumPy.

§ You specify:

§ The forward model (i.e., the computation graph).

§ The training examples and training schedule (e.g., how the points are batched together).

§ The optimization details (e.g., loss function, learning rate).

§ The framework takes care of the derivative computations!

§ Tutorial 3 gives examples in PyTorch.

Page 33: COMP 551 -Applied Machine Learning Lecture 14 ...wlh/comp551/slides/14-backprop.pdfLecture 14 ---Backpropagation William L. Hamilton (with slides and content from Joelle Pineau) *

Convergence of backpropagation§ If the learning rate is appropriate, the algorithm is guaranteed to converge to a

local minimum of the cost function.

§ NOT the global minimum. (Can be much worse.)

§ There can be MANY local minimum.

§ Use random restarts = train multiple nets with different initial weights.

§ In practice, the solution found is often good (i.e., local minima are not a big problem in practice)…

§ Training can take thousands of iterations – CAN BE VERY SLOW! But using network after training is generally fast.

William L. Hamilton, McGill University and Mila 33

Page 34: COMP 551 -Applied Machine Learning Lecture 14 ...wlh/comp551/slides/14-backprop.pdfLecture 14 ---Backpropagation William L. Hamilton (with slides and content from Joelle Pineau) *

Overtraining§ Traditional overfitting is concerned with the number of parameters vs. the number

of instances

§ In neural networks: related phenomenon called overtraining occurs when weights take on large magnitudes, i.e. unit saturation

§ As learning progresses, the network has more active parameters.Overtraining in feed-forward networks

0.002

0.003

0.004

0.005

0.006

0.007

0.008

0.009

0.01

0 5000 10000 15000 20000

Err

or

Number of weight updates

Error versus weight updates (example 1)

Training set error

Validation set error

• Traditional overfitting is concerned with the number of parameters vs.the number of instances

• In neural networks there is an additional phenomenon called overtrainingwhich occurs when weights take on large magnitudes, pushing thesigmoids into saturation

• E�ectively, as learning progresses, the network has more actualparameters

• Use a validation set to decide when to stop training!• Regularization is also very e�ective

COMP-652, Lecture 5 - September 20, 2012 37

More application-specific tricks

• If there is too little data, it can be perturbed by random noise; this helpsescape local minima and gives more robust results

• In image classification and pattern recognition tasks, extra data can begenerated, e.g., by applying transformations that make sense

• Weight sharing can be used to indicate parameters that should have thesame value based on prior knowledge

• In this case, each update is computed separately using backprop, thenthe tied parameters are updated with an average

COMP-652, Lecture 5 - September 20, 2012 38

William L. Hamilton, McGill University and Mila 34

Page 35: COMP 551 -Applied Machine Learning Lecture 14 ...wlh/comp551/slides/14-backprop.pdfLecture 14 ---Backpropagation William L. Hamilton (with slides and content from Joelle Pineau) *

Overtraining§ Traditional overfitting is concerned with the number of parameters vs. the number

of instances

§ In neural networks: related phenomenon called overtraining occurs when weights take on large magnitudes, i.e. unit saturation

§ As learning progresses, the network has more active parameters.

§ Use validation set to decide when to stop.

§ # training updates is a hyper-parameter.

Overtraining in feed-forward networks

0.002

0.003

0.004

0.005

0.006

0.007

0.008

0.009

0.01

0 5000 10000 15000 20000

Err

or

Number of weight updates

Error versus weight updates (example 1)

Training set error

Validation set error

• Traditional overfitting is concerned with the number of parameters vs.the number of instances

• In neural networks there is an additional phenomenon called overtrainingwhich occurs when weights take on large magnitudes, pushing thesigmoids into saturation

• E�ectively, as learning progresses, the network has more actualparameters

• Use a validation set to decide when to stop training!• Regularization is also very e�ective

COMP-652, Lecture 5 - September 20, 2012 37

More application-specific tricks

• If there is too little data, it can be perturbed by random noise; this helpsescape local minima and gives more robust results

• In image classification and pattern recognition tasks, extra data can begenerated, e.g., by applying transformations that make sense

• Weight sharing can be used to indicate parameters that should have thesame value based on prior knowledge

• In this case, each update is computed separately using backprop, thenthe tied parameters are updated with an average

COMP-652, Lecture 5 - September 20, 2012 38

William L. Hamilton, McGill University and Mila 35

Page 36: COMP 551 -Applied Machine Learning Lecture 14 ...wlh/comp551/slides/14-backprop.pdfLecture 14 ---Backpropagation William L. Hamilton (with slides and content from Joelle Pineau) *

Choosing the learning rate§ Backprop is very sensitive to the choice of learning rate.

§ Too large ⇒ divergence.

§ Too small ⇒ VERY slow learning.

§ The learning rate also influences the ability to escape local optima.

§ The learning rate is a critical hyperparameter.

William L. Hamilton, McGill University and Mila 36

Page 37: COMP 551 -Applied Machine Learning Lecture 14 ...wlh/comp551/slides/14-backprop.pdfLecture 14 ---Backpropagation William L. Hamilton (with slides and content from Joelle Pineau) *

Adaptive optimization algorithms§ It is now standard to use “adaptive” optimization algorithms.

§ These approaches modify the learning rate adaptively depending on how the training is progressing.

§ Adam, RMSProp, and AdaGrad are popular approaches, with Adam being the de facto standard in deep learning.

§ All of these approaches scale the learning rate for each parameter based on statistics of the history of gradient updates for that parameter.

§ Intuition: increase the update strength for parameters that have had smaller updates in the past.

William L. Hamilton, McGill University and Mila 37

Page 38: COMP 551 -Applied Machine Learning Lecture 14 ...wlh/comp551/slides/14-backprop.pdfLecture 14 ---Backpropagation William L. Hamilton (with slides and content from Joelle Pineau) *

Adding momentum§ At each iteration of gradient descent, we are computing an update based

based on the derivative of the current (mini)batch of training examples:

�iw = ↵@J

@wUpdate for

weight vector

Derivative of error w.r.t. weights for current minibatch

w �iw

William L. Hamilton, McGill University and Mila 38

Page 39: COMP 551 -Applied Machine Learning Lecture 14 ...wlh/comp551/slides/14-backprop.pdfLecture 14 ---Backpropagation William L. Hamilton (with slides and content from Joelle Pineau) *

Adding momentum§ On i’th gradient descent update, instead of:

We do:

The second term is called momentum

�iw = ↵@J

@w

�iw = ↵@J

@w+ ��i�1w

William L. Hamilton, McGill University and Mila 39

Page 40: COMP 551 -Applied Machine Learning Lecture 14 ...wlh/comp551/slides/14-backprop.pdfLecture 14 ---Backpropagation William L. Hamilton (with slides and content from Joelle Pineau) *

Adding momentum§ On i’th gradient descent update, instead of:

We do:

The second term is called momentum

Advantages:§ Easy to pass small local minima.§ Keeps the weights moving in areas where the error is flat.

Disadvantages:§ With too much momentum, it can get out of a global maximum!§ One more parameter to tune, and more chances of divergence

�iw = ↵@J

@w

�iw = ↵@J

@w+ ��i�1w

William L. Hamilton, McGill University and Mila 40

Page 41: COMP 551 -Applied Machine Learning Lecture 14 ...wlh/comp551/slides/14-backprop.pdfLecture 14 ---Backpropagation William L. Hamilton (with slides and content from Joelle Pineau) *

What you should know§ Basic generalizations of neural networks: activation functions,

regularization, and complex architectures§ The concept of a computation graph.§ The basics of automated differentiation (reverse and forward mode) and

backpropagation.§ The issue of overtraining.§ Optimization hyperparameters for backpropagation: learning rate,

momentum, adaptive optimization algorithms

William L. Hamilton, McGill University and Mila 41