cs224n/ling284web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture04-neural… · make sure to...

81
Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning Lecture 4: Gradients by hand (matrix calculus) and algorithmically (the backpropagation algorithm)

Upload: others

Post on 03-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS224N/Ling284web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture04-neural… · Make sure to get help if you need it Visit office hours Friday/Tuesday Note: ... Later we discuss

Natural Language Processingwith Deep Learning

CS224N/Ling284

Christopher Manning

Lecture 4: Gradients by hand (matrix calculus) and algorithmically (the backpropagation algorithm)

Natural Language Processingwith Deep Learning

CS224N/Ling284

Christopher Manning and Richard Socher

Lecture 2: Word Vectors

Page 2: CS224N/Ling284web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture04-neural… · Make sure to get help if you need it Visit office hours Friday/Tuesday Note: ... Later we discuss

1. Introduction

Assignment 2 is all about making sure you really understand the math of neural networks … then we’ll let the software do it!

We’ll go through it quickly today, but also look at the readings!

This will be a tough week for some! àMake sure to get help if you need it

Visit office hours Friday/TuesdayNote: Monday is MLK Day – No office hours, sorry!But we will be on Piazza

Read tutorial materials given in the syllabus

2

Page 3: CS224N/Ling284web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture04-neural… · Make sure to get help if you need it Visit office hours Friday/Tuesday Note: ... Later we discuss

NER: Binary classification for center word being location

• We do supervised training and want high score if it’s a location

𝐽" 𝜃 = 𝜎 𝑠 =1

1 + 𝑒*+

3

x = [ xmuseums xin xParis xare xamazing ]

Page 4: CS224N/Ling284web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture04-neural… · Make sure to get help if you need it Visit office hours Friday/Tuesday Note: ... Later we discuss

Remember: Stochastic Gradient Descent

Update equation:

How can we compute ∇-𝐽(𝜃)?1. By hand2. Algorithmically: the backpropagation algorithm

𝛼 = step size or learning rate

4

Page 5: CS224N/Ling284web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture04-neural… · Make sure to get help if you need it Visit office hours Friday/Tuesday Note: ... Later we discuss

Lecture Plan

Lecture 4: Gradients by hand and algorithmically1. Introduction (5 mins)2. Matrix calculus (40 mins)3. Backpropagation (35 mins)

5

Page 6: CS224N/Ling284web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture04-neural… · Make sure to get help if you need it Visit office hours Friday/Tuesday Note: ... Later we discuss

Computing Gradients by Hand

• Matrix calculus: Fully vectorized gradients

• “multivariable calculus is just like single-variable calculus if you use matrices”

• Much faster and more useful than non-vectorized gradients

• But doing a non-vectorized gradient can be good for intuition; watch last week’s lecture for an example

• Lecture notes and matrix calculus notes cover this material in more detail

• You might also review Math 51, which has a new online textbook: http://web.stanford.edu/class/math51/textbook.html

6

Page 7: CS224N/Ling284web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture04-neural… · Make sure to get help if you need it Visit office hours Friday/Tuesday Note: ... Later we discuss

Gradients

• Given a function with 1 output and 1 input

𝑓 𝑥 = 𝑥3

• It’s gradient (slope) is its derivative 4546= 3𝑥8

“How much will the output change if we change the input a bit?”

7

Page 8: CS224N/Ling284web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture04-neural… · Make sure to get help if you need it Visit office hours Friday/Tuesday Note: ... Later we discuss

Gradients

• Given a function with 1 output and n inputs

• Its gradient is a vector of partial derivatives with respect to each input

8

Page 9: CS224N/Ling284web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture04-neural… · Make sure to get help if you need it Visit office hours Friday/Tuesday Note: ... Later we discuss

Jacobian Matrix: Generalization of the Gradient

• Given a function with m outputs and n inputs

• It’s Jacobian is an m x n matrix of partial derivatives

9

Page 10: CS224N/Ling284web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture04-neural… · Make sure to get help if you need it Visit office hours Friday/Tuesday Note: ... Later we discuss

Chain Rule

• For one-variable functions: multiply derivatives

• For multiple variables at once: multiply Jacobians

10

Page 11: CS224N/Ling284web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture04-neural… · Make sure to get help if you need it Visit office hours Friday/Tuesday Note: ... Later we discuss

Example Jacobian: Elementwise activation Function

11

Page 12: CS224N/Ling284web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture04-neural… · Make sure to get help if you need it Visit office hours Friday/Tuesday Note: ... Later we discuss

Example Jacobian: Elementwise activation Function

Function has n outputs and n inputs → n by n Jacobian

12

Page 13: CS224N/Ling284web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture04-neural… · Make sure to get help if you need it Visit office hours Friday/Tuesday Note: ... Later we discuss

Example Jacobian: Elementwise activation Function

13

Page 14: CS224N/Ling284web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture04-neural… · Make sure to get help if you need it Visit office hours Friday/Tuesday Note: ... Later we discuss

Example Jacobian: Elementwise activation Function

14

Page 15: CS224N/Ling284web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture04-neural… · Make sure to get help if you need it Visit office hours Friday/Tuesday Note: ... Later we discuss

Example Jacobian: Elementwise activation Function

15

Page 16: CS224N/Ling284web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture04-neural… · Make sure to get help if you need it Visit office hours Friday/Tuesday Note: ... Later we discuss

Other Jacobians

• Compute these at home for practice!

• Check your answers with the lecture notes16

Page 17: CS224N/Ling284web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture04-neural… · Make sure to get help if you need it Visit office hours Friday/Tuesday Note: ... Later we discuss

Other Jacobians

• Compute these at home for practice!

• Check your answers with the lecture notes17

Page 18: CS224N/Ling284web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture04-neural… · Make sure to get help if you need it Visit office hours Friday/Tuesday Note: ... Later we discuss

Other Jacobians

• Compute these at home for practice!

• Check your answers with the lecture notes18

Fine print: This is the correct Jacobian. Later we discuss the “shape convention”; using it the answer would be h.

Page 19: CS224N/Ling284web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture04-neural… · Make sure to get help if you need it Visit office hours Friday/Tuesday Note: ... Later we discuss

Other Jacobians

• Compute these at home for practice!

• Check your answers with the lecture notes

19

Page 20: CS224N/Ling284web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture04-neural… · Make sure to get help if you need it Visit office hours Friday/Tuesday Note: ... Later we discuss

Back to our Neural Net!

x = [ xmuseums xin xParis xare xamazing ]20

Page 21: CS224N/Ling284web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture04-neural… · Make sure to get help if you need it Visit office hours Friday/Tuesday Note: ... Later we discuss

Back to our Neural Net!

x = [ xmuseums xin xParis xare xamazing ]

• Let’s find

• Really, we care about the gradient of the loss, but we will compute the gradient of the score for simplicity

21

Page 22: CS224N/Ling284web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture04-neural… · Make sure to get help if you need it Visit office hours Friday/Tuesday Note: ... Later we discuss

1. Break up equations into simple pieces

22

Page 23: CS224N/Ling284web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture04-neural… · Make sure to get help if you need it Visit office hours Friday/Tuesday Note: ... Later we discuss

2. Apply the chain rule

23

Page 24: CS224N/Ling284web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture04-neural… · Make sure to get help if you need it Visit office hours Friday/Tuesday Note: ... Later we discuss

2. Apply the chain rule

24

Page 25: CS224N/Ling284web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture04-neural… · Make sure to get help if you need it Visit office hours Friday/Tuesday Note: ... Later we discuss

2. Apply the chain rule

25

Page 26: CS224N/Ling284web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture04-neural… · Make sure to get help if you need it Visit office hours Friday/Tuesday Note: ... Later we discuss

2. Apply the chain rule

26

Page 27: CS224N/Ling284web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture04-neural… · Make sure to get help if you need it Visit office hours Friday/Tuesday Note: ... Later we discuss

3. Write out the Jacobians

Useful Jacobians from previous slide

27

Page 28: CS224N/Ling284web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture04-neural… · Make sure to get help if you need it Visit office hours Friday/Tuesday Note: ... Later we discuss

3. Write out the Jacobians

Useful Jacobians from previous slide

28

𝒖:

Page 29: CS224N/Ling284web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture04-neural… · Make sure to get help if you need it Visit office hours Friday/Tuesday Note: ... Later we discuss

3. Write out the Jacobians

Useful Jacobians from previous slide

29

𝒖:

Page 30: CS224N/Ling284web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture04-neural… · Make sure to get help if you need it Visit office hours Friday/Tuesday Note: ... Later we discuss

3. Write out the Jacobians

Useful Jacobians from previous slide

30

𝒖:

Page 31: CS224N/Ling284web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture04-neural… · Make sure to get help if you need it Visit office hours Friday/Tuesday Note: ... Later we discuss

3. Write out the Jacobians

Useful Jacobians from previous slide

31

𝒖:

𝒖:

Page 32: CS224N/Ling284web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture04-neural… · Make sure to get help if you need it Visit office hours Friday/Tuesday Note: ... Later we discuss

Re-using Computation

• Suppose we now want to compute

• Using the chain rule again:

32

Page 33: CS224N/Ling284web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture04-neural… · Make sure to get help if you need it Visit office hours Friday/Tuesday Note: ... Later we discuss

Re-using Computation

• Suppose we now want to compute

• Using the chain rule again:

The same! Let’s avoid duplicated computation…

33

Page 34: CS224N/Ling284web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture04-neural… · Make sure to get help if you need it Visit office hours Friday/Tuesday Note: ... Later we discuss

Re-using Computation

• Suppose we now want to compute

• Using the chain rule again:

34 𝛿 is local error signal

𝒖:

Page 35: CS224N/Ling284web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture04-neural… · Make sure to get help if you need it Visit office hours Friday/Tuesday Note: ... Later we discuss

Derivative with respect to Matrix: Output shape

• What does look like?

• 1 output, nm inputs: 1 by nm Jacobian?

• Inconvenient to do

35

Page 36: CS224N/Ling284web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture04-neural… · Make sure to get help if you need it Visit office hours Friday/Tuesday Note: ... Later we discuss

Derivative with respect to Matrix: Output shape

• What does look like?

• 1 output, nm inputs: 1 by nm Jacobian?

• Inconvenient to do

• Instead we use shape convention: the shape of the gradient is the shape of the parameters

• So is n by m:

36

Page 37: CS224N/Ling284web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture04-neural… · Make sure to get help if you need it Visit office hours Friday/Tuesday Note: ... Later we discuss

Derivative with respect to Matrix

• Remember

• is going to be in our answer

• The other term should be because

• Answer is:

37

𝛿 is local error signal at 𝑧𝑥 is local input signal

Page 38: CS224N/Ling284web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture04-neural… · Make sure to get help if you need it Visit office hours Friday/Tuesday Note: ... Later we discuss

Why the Transposes?

• Hacky answer: this makes the dimensions work out!

• Useful trick for checking your work!

• Full explanation in the lecture notes; intuition next• Each input goes to each output – you get outer product

38

Page 39: CS224N/Ling284web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture04-neural… · Make sure to get help if you need it Visit office hours Friday/Tuesday Note: ... Later we discuss

Why the Transposes?

39

Page 40: CS224N/Ling284web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture04-neural… · Make sure to get help if you need it Visit office hours Friday/Tuesday Note: ... Later we discuss

Deriving local input gradient in backprop

• For this function:

• Let’s consider the derivative of a single weight Wij

• Wij only contributes to zi

• For example: W23 is only used to compute z2 not z1

40

x1 x2 x3 +1

f(z1)= h1 h2 =f(z2)

s u2

W23b2

𝜕𝑠𝜕𝑾

= 𝜹𝜕𝒛𝜕𝑾

= 𝜹𝜕𝜕𝑾

𝑾𝒙 + 𝒃

𝜕𝑧C𝜕𝑊CE

=𝜕

𝜕𝑊CE𝑾CF𝒙 + 𝑏C

= HHIJK

∑MNO4 𝑊CM𝑥M = 𝑥E

Page 41: CS224N/Ling284web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture04-neural… · Make sure to get help if you need it Visit office hours Friday/Tuesday Note: ... Later we discuss

What shape should derivatives be?

• is a row vector

• But convention says our gradient should be a column vector because is a column vector…

• Disagreement between Jacobian form (which makes the chain rule easy) and the shape convention (which makes implementing SGD easy)• We expect answers to follow the shape convention

• But Jacobian form is useful for computing the answers

41

Page 42: CS224N/Ling284web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture04-neural… · Make sure to get help if you need it Visit office hours Friday/Tuesday Note: ... Later we discuss

What shape should derivatives be?Two options:

1. Use Jacobian form as much as possible, reshape to follow the convention at the end:

• What we just did. But at the end transpose to make the derivative a column vector, resulting in

2. Always follow the convention• Look at dimensions to figure out when to transpose and/or

reorder terms

42

Page 43: CS224N/Ling284web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture04-neural… · Make sure to get help if you need it Visit office hours Friday/Tuesday Note: ... Later we discuss

• Tip 1: Carefully define your variables and keep track of their dimensionality!

• Tip 2: Chain rule! If y = f(u) and u = g(x), i.e., y = f(g(x)), then:𝜕𝒚𝜕𝒙

=𝜕𝒚𝜕𝒖

𝜕𝒖𝜕𝒙

Keep straight what variables feed into what computations• Tip 3: For the top softmax part of a model: First consider the

derivative wrt fc when c = y (the correct class), then consider derivative wrt fc when c ¹ y (all the incorrect classes)

• Tip 4: Work out element-wise partial derivatives if you’re getting confused by matrix calculus!

• Tip 5: Use Shape Convention. Note: The error message 𝜹 that arrives at a hidden layer has the same dimensionality as that hidden layer

Deriving gradients: Tips

43

Page 44: CS224N/Ling284web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture04-neural… · Make sure to get help if you need it Visit office hours Friday/Tuesday Note: ... Later we discuss

3. Backpropagation

We’ve almost shown you backpropagation

It’s taking derivatives and using the (generalized, multivariate, or matrix) chain rule

Other trick:

We re-use derivatives computed for higher layers in computing derivatives for lower layers to minimize computation

44

Page 45: CS224N/Ling284web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture04-neural… · Make sure to get help if you need it Visit office hours Friday/Tuesday Note: ... Later we discuss

Computation Graphs and Backpropagation

� + �

• We represent our neural net equations as a graph

• Source nodes: inputs

• Interior nodes: operations

45

Page 46: CS224N/Ling284web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture04-neural… · Make sure to get help if you need it Visit office hours Friday/Tuesday Note: ... Later we discuss

Computation Graphs and Backpropagation

� + �

• We represent our neural net equations as a graph

• Source nodes: inputs

• Interior nodes: operations

• Edges pass along result of the operation

46

Page 47: CS224N/Ling284web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture04-neural… · Make sure to get help if you need it Visit office hours Friday/Tuesday Note: ... Later we discuss

Computation Graphs and Backpropagation

� + �

• Representing our neural net equations as a graph

• Source nodes: inputs

• Interior nodes: operations

• Edges pass along result of the operation

“Forward Propagation”

47

Page 48: CS224N/Ling284web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture04-neural… · Make sure to get help if you need it Visit office hours Friday/Tuesday Note: ... Later we discuss

Backpropagation

� + �

• Go backwards along edges

• Pass along gradients

48

Page 49: CS224N/Ling284web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture04-neural… · Make sure to get help if you need it Visit office hours Friday/Tuesday Note: ... Later we discuss

Backpropagation: Single Node

• Node receives an “upstream gradient”

• Goal is to pass on the correct “downstream gradient”

Upstream gradient

49 Downstream gradient

Page 50: CS224N/Ling284web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture04-neural… · Make sure to get help if you need it Visit office hours Friday/Tuesday Note: ... Later we discuss

Backpropagation: Single Node

Downstream gradient

Upstream gradient

• Each node has a local gradient

• The gradient of its output with respect to its input

Local gradient

50

Page 51: CS224N/Ling284web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture04-neural… · Make sure to get help if you need it Visit office hours Friday/Tuesday Note: ... Later we discuss

Backpropagation: Single Node

Downstream gradient

Upstream gradient

• Each node has a local gradient

• The gradient of its output with respect to its input

Local gradient

51

Chain rule!

Page 52: CS224N/Ling284web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture04-neural… · Make sure to get help if you need it Visit office hours Friday/Tuesday Note: ... Later we discuss

Backpropagation: Single Node

Downstream gradient

Upstream gradient

• Each node has a local gradient

• The gradient of it’s output with respect to it’s input

Local gradient

• [downstream gradient] = [upstream gradient] x [local gradient]

52

Page 53: CS224N/Ling284web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture04-neural… · Make sure to get help if you need it Visit office hours Friday/Tuesday Note: ... Later we discuss

Backpropagation: Single Node

*

• What about nodes with multiple inputs?

53

Page 54: CS224N/Ling284web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture04-neural… · Make sure to get help if you need it Visit office hours Friday/Tuesday Note: ... Later we discuss

Backpropagation: Single Node

Downstream gradients

Upstream gradient

Local gradients

*

• Multiple inputs → multiple local gradients

54

Page 55: CS224N/Ling284web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture04-neural… · Make sure to get help if you need it Visit office hours Friday/Tuesday Note: ... Later we discuss

An Example

55

Page 56: CS224N/Ling284web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture04-neural… · Make sure to get help if you need it Visit office hours Friday/Tuesday Note: ... Later we discuss

An Example

+

*max

56

Forward prop steps

Page 57: CS224N/Ling284web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture04-neural… · Make sure to get help if you need it Visit office hours Friday/Tuesday Note: ... Later we discuss

An Example

+

*max

57

Forward prop steps

6

3

2

1

2

2

0

Page 58: CS224N/Ling284web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture04-neural… · Make sure to get help if you need it Visit office hours Friday/Tuesday Note: ... Later we discuss

An Example

+

*max

58

Forward prop steps

6

3

2

1

2

2

0

Local gradients

Page 59: CS224N/Ling284web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture04-neural… · Make sure to get help if you need it Visit office hours Friday/Tuesday Note: ... Later we discuss

An Example

+

*max

59

Forward prop steps

6

3

2

1

2

2

0

Local gradients

Page 60: CS224N/Ling284web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture04-neural… · Make sure to get help if you need it Visit office hours Friday/Tuesday Note: ... Later we discuss

An Example

+

*max

60

Forward prop steps

6

3

2

1

2

2

0

Local gradients

Page 61: CS224N/Ling284web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture04-neural… · Make sure to get help if you need it Visit office hours Friday/Tuesday Note: ... Later we discuss

An Example

+

*max

61

Forward prop steps

6

3

2

1

2

2

0

Local gradients

Page 62: CS224N/Ling284web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture04-neural… · Make sure to get help if you need it Visit office hours Friday/Tuesday Note: ... Later we discuss

An Example

+

*max

62

Forward prop steps

6

3

2

1

2

2

0

Local gradients

upstream * local = downstream

1

1*3 = 3

1*2 = 2

Page 63: CS224N/Ling284web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture04-neural… · Make sure to get help if you need it Visit office hours Friday/Tuesday Note: ... Later we discuss

An Example

+

*max

63

Forward prop steps

6

3

2

1

2

2

0

Local gradients

upstream * local = downstream

1

3

2

3*1 = 3

3*0 = 0

Page 64: CS224N/Ling284web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture04-neural… · Make sure to get help if you need it Visit office hours Friday/Tuesday Note: ... Later we discuss

An Example

+

*max

64

Forward prop steps

6

3

2

1

2

2

0

Local gradients

upstream * local = downstream

1

3

2

3

0

2*1 = 2

2*1 = 2

Page 65: CS224N/Ling284web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture04-neural… · Make sure to get help if you need it Visit office hours Friday/Tuesday Note: ... Later we discuss

An Example

+

*max

65

Forward prop steps

6

3

2

1

2

2

0

Local gradients

1

3

2

3

0

2

2

Page 66: CS224N/Ling284web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture04-neural… · Make sure to get help if you need it Visit office hours Friday/Tuesday Note: ... Later we discuss

Gradients sum at outward branches

66

+

Page 67: CS224N/Ling284web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture04-neural… · Make sure to get help if you need it Visit office hours Friday/Tuesday Note: ... Later we discuss

Gradients sum at outward branches

67

+

Page 68: CS224N/Ling284web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture04-neural… · Make sure to get help if you need it Visit office hours Friday/Tuesday Note: ... Later we discuss

Node Intuitions

+

*max

68

6

3

2

1

2

2

0

1

22

2

• + “distributes” the upstream gradient to each summand

Page 69: CS224N/Ling284web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture04-neural… · Make sure to get help if you need it Visit office hours Friday/Tuesday Note: ... Later we discuss

Node Intuitions

+

*max

69

6

3

2

1

2

2

0

1

33

0

• + “distributes” the upstream gradient to each summand

• max “routes” the upstream gradient

Page 70: CS224N/Ling284web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture04-neural… · Make sure to get help if you need it Visit office hours Friday/Tuesday Note: ... Later we discuss

Node Intuitions

+

*max

70

6

3

2

1

2

2

0

1

3

2

• + “distributes” the upstream gradient

• max “routes” the upstream gradient

• * “switches” the upstream gradient

Page 71: CS224N/Ling284web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture04-neural… · Make sure to get help if you need it Visit office hours Friday/Tuesday Note: ... Later we discuss

Efficiency: compute all gradients at once

* + �

• Incorrect way of doing backprop:

• First compute

71

Page 72: CS224N/Ling284web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture04-neural… · Make sure to get help if you need it Visit office hours Friday/Tuesday Note: ... Later we discuss

Efficiency: compute all gradients at once

* + �

• Incorrect way of doing backprop:

• First compute

• Then independently compute

• Duplicated computation!

72

Page 73: CS224N/Ling284web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture04-neural… · Make sure to get help if you need it Visit office hours Friday/Tuesday Note: ... Later we discuss

Efficiency: compute all gradients at once

* + �

• Correct way:

• Compute all the gradients at once

• Analogous to using 𝜹 when we computed gradients by hand

73

Page 74: CS224N/Ling284web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture04-neural… · Make sure to get help if you need it Visit office hours Friday/Tuesday Note: ... Later we discuss

1. Fprop: visit nodes in topological sort order - Compute value of node given predecessors

2. Bprop:- initialize output gradient = 1 - visit nodes in reverse order:

Compute gradient wrt each node using gradient wrt successors

Done correctly, big O() complexity of fprop and bprop is the same

In general our nets have regular layer-structure and so we can use matrices and Jacobians…

Back-Prop in General Computation Graph

= successors of

Single scalar output

74

Page 75: CS224N/Ling284web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture04-neural… · Make sure to get help if you need it Visit office hours Friday/Tuesday Note: ... Later we discuss

Automatic Differentiation

• The gradient computation can be automatically inferred from the symbolic expression of the fprop

• Each node type needs to know how to compute its output and how to compute the gradient wrt its inputs given the gradient wrt its output

• Modern DL frameworks (Tensorflow, PyTorch, etc.) do backpropagation for you but mainly leave layer/node writer to hand-calculate the local derivative

75

Page 76: CS224N/Ling284web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture04-neural… · Make sure to get help if you need it Visit office hours Friday/Tuesday Note: ... Later we discuss

Backprop Implementations

76

Page 77: CS224N/Ling284web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture04-neural… · Make sure to get help if you need it Visit office hours Friday/Tuesday Note: ... Later we discuss

Implementation: forward/backward API

77

Page 78: CS224N/Ling284web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture04-neural… · Make sure to get help if you need it Visit office hours Friday/Tuesday Note: ... Later we discuss

Implementation: forward/backward API

78

Page 79: CS224N/Ling284web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture04-neural… · Make sure to get help if you need it Visit office hours Friday/Tuesday Note: ... Later we discuss

Manual Gradient checking: Numeric Gradient

• For small h (≈ 1e-4),

• Easy to implement correctly

• But approximate and very slow:

• Have to recompute f for every parameter of our model

• Useful for checking your implementation

• In the old days when we hand-wrote everything, it was key to do this everywhere.

• Now much less needed, when throwing together layers79

Page 80: CS224N/Ling284web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture04-neural… · Make sure to get help if you need it Visit office hours Friday/Tuesday Note: ... Later we discuss

Summary

We’ve mastered the core technology of neural nets! 🎉

• Backpropagation: recursively (and hence efficiently)apply the chain rule along computation graph• [downstream gradient] = [upstream gradient] x [local gradient]

• Forward pass: compute results of operations and save intermediate values

• Backward pass: apply chain rule to compute gradients80

Page 81: CS224N/Ling284web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture04-neural… · Make sure to get help if you need it Visit office hours Friday/Tuesday Note: ... Later we discuss

Why learn all these details about gradients?

• Modern deep learning frameworks compute gradients for you!

• But why take a class on compilers or systems when they are implemented for you?• Understanding what is going on under the hood is useful!

• Backpropagation doesn’t always work perfectly• Understanding why is crucial for debugging and improving

models• See Karpathy article (in syllabus):• https://medium.com/@karpathy/yes-you-should-understand-

backprop-e2f06eab496b

• Example in future lecture: exploding and vanishing gradients81