Backpropagation
Gradient Descent
0Starting
Parameters1 2 ……
0L C ompute 001 L
1L C ompute 112 L
L
,,,,, 2121 bbww
2
1
2
1
L
L
L
L
b
b
w
w
Millions of parameters ……
To compute the gradients efficiently, we use backpropagation.
Network parameters
Chain Rule
Case 1
Case 2
yhz xgy
dx
dy
dy
dz
dx
dzzyx
yxkz , shy sgx
ds
dy
y
z
ds
dx
x
z
ds
dz
s z
x
y
Backpropagation
𝐿 𝜃 =
𝑛=1
𝑁
𝐶𝑛 𝜃𝜕𝐿 𝜃
𝜕𝑤=
𝑛=1
𝑁𝜕𝐶𝑛 𝜃
𝜕𝑤
𝑥1
𝑥2
𝑦1
𝑦2
xn NN𝜃
yn ො𝑦𝑛
𝐶𝑛
Backpropagation
b
𝑤1 𝑧
𝜕𝐶
𝜕𝑤=?
……
𝜕𝑧
𝜕𝑤
𝜕𝐶
𝜕𝑧
(Chain rule)
𝑤2
Forward pass:
Compute Τ𝜕𝑧 𝜕𝑤 for all parameters
Backward pass:
Compute Τ𝜕𝐶 𝜕𝑧 for all activation function inputs z
𝑥1
𝑥2
𝑦1
𝑦2
𝑧 = 𝑥1𝑤1 + 𝑥2𝑤2 + 𝑏
Backpropagation – Forward pass
b
𝑤1 𝑧
……
𝑤2
𝑥1
𝑥2
𝑦1
𝑦2
𝑧 = 𝑥1𝑤1 + 𝑥2𝑤2 + 𝑏
Τ𝜕𝑧 𝜕𝑤1 =? 𝑥1
Compute Τ𝜕𝑧 𝜕𝑤 for all parameters
Τ𝜕𝑧 𝜕𝑤2 =? 𝑥2
The value of the input connected by the weight
Backpropagation – Forward pass
1
-2
1
-1
1
0
0.98
0.12
2
-1
-1
-2
3
-1
4
-1
0.86
0.11
0
0
-2
2
1
-1
Compute Τ𝜕𝑧 𝜕𝑤 for all parameters
𝜕𝑧
𝜕𝑤= −1
𝜕𝑧
𝜕𝑤= 0.12
𝜕𝑧
𝜕𝑤= 0.11
Backpropagation – Backward pass
Compute Τ𝜕𝐶 𝜕𝑧 for all activation function inputs z
b
𝑤1 𝑧
𝑤2
𝑥1
𝑥2
𝑤3 𝑧′𝑎
𝑧’’
𝑤4
𝜕𝐶
𝜕𝑧=𝜕𝑎
𝜕𝑧
𝜕𝐶
𝜕𝑎
𝑎 = 𝜎 𝑧
𝜎′ 𝑧
Backpropagation – Backward pass
Compute Τ𝜕𝐶 𝜕𝑧 for all activation function inputs z
b
𝑤1 𝑧
𝑤2
𝑥1
𝑥2
𝑤3 𝑧′𝑎
𝑧’’
𝑤4
𝜕𝐶
𝜕𝑧=𝜕𝑎
𝜕𝑧
𝜕𝐶
𝜕𝑎
𝑎 = 𝜎 𝑧
𝜎′ 𝑧
𝜎′ 𝑧
𝜎 𝑧
Backpropagation – Backward pass
Compute Τ𝜕𝐶 𝜕𝑧 for all activation function inputs z
b
𝑤1 𝑧
𝑤2
𝑥1
𝑥2
𝑤3 𝑧′𝑎
𝑧’’
𝑤4
𝜕𝐶
𝜕𝑧=𝜕𝑎
𝜕𝑧
𝜕𝐶
𝜕𝑎
𝑎 = 𝜎 𝑧
𝜕𝐶
𝜕𝑎=𝜕𝑧′
𝜕𝑎
𝜕𝐶
𝜕𝑧′+𝜕𝑧′′
𝜕𝑎
𝜕𝐶
𝜕𝑧′′(Chain rule)
𝑧′ = 𝑎𝑤3 +⋯
𝑤3 𝑤4? ? Assumed
it’s known
b
𝑤1 𝑧
𝑤2
𝑥1
𝑥2
𝑤3 𝑧′𝑎
𝑧’’
𝑤4
𝜕𝐶
𝜕𝑧
𝜕𝐶
𝜕𝑧′′
𝜕𝐶
𝜕𝑧′
𝜕𝐶
𝜕𝑧= 𝜎′ 𝑧 𝑤3
𝜕𝐶
𝜕𝑧′+ 𝑤4
𝜕𝐶
𝜕𝑧′′
Backpropagation – Backward pass
Compute Τ𝜕𝐶 𝜕𝑧 for all activation function inputs z
𝑤3
𝜕𝐶
𝜕𝑧
𝜕𝐶
𝜕𝑧′′
𝜕𝐶
𝜕𝑧′
𝜕𝐶
𝜕𝑧= 𝜎′ 𝑧 𝑤3
𝜕𝐶
𝜕𝑧′+ 𝑤4
𝜕𝐶
𝜕𝑧′′
Backpropagation – Backward pass
𝜎′ 𝑧
𝑤4
𝜎′ 𝑧 is a constant because z is already determined in the forward pass.
b
𝑤1 𝑧
𝑤2
𝑥1
𝑥2
𝑤3 𝑧′𝑎
𝑧’’
𝑤4
𝜕𝐶
𝜕𝑧
𝜕𝐶
𝜕𝑧′′
𝜕𝐶
𝜕𝑧′
Backpropagation – Backward pass
Compute Τ𝜕𝐶 𝜕𝑧 for all activation function inputs z
𝑦1
𝑦2
Case 1. Output Layer
𝜕𝐶
𝜕𝑧′=𝜕𝑦1𝜕𝑧′
𝜕𝐶
𝜕𝑦1
𝜕𝐶
𝜕𝑧′′=𝜕𝑦2𝜕𝑧′′
𝜕𝐶
𝜕𝑦2Done!
𝑧′
𝑧’’
𝜕𝐶
𝜕𝑧′
Backpropagation – Backward pass
Compute Τ𝜕𝐶 𝜕𝑧 for all activation function inputs z
Case 2. Not Output Layer
……
……𝜕𝐶
𝜕𝑧′′
𝑧′
𝑧’’
𝜕𝐶
𝜕𝑧′
Backpropagation – Backward pass
Compute Τ𝜕𝐶 𝜕𝑧 for all activation function inputs z
Case 2. Not Output Layer
𝜕𝐶
𝜕𝑧′′
𝑤5 𝑧𝑎𝑎′
𝑤6
𝜕𝐶
𝜕𝑧𝑎
𝑧𝑏
𝜕𝐶
𝜕𝑧𝑏
𝑧′
𝑧’’
𝜕𝐶
𝜕𝑧′
Backpropagation – Backward pass
Compute Τ𝜕𝐶 𝜕𝑧 for all activation function inputs z
Case 2. Not Output Layer
𝜕𝐶
𝜕𝑧′′
𝑤5 𝑧𝑎𝑎′
𝜕𝐶
𝜕𝑧𝑎
𝑧𝑏
𝜕𝐶
𝜕𝑧𝑏
𝜎′ 𝑧′
𝑤6
Until we reach the output layer ……
Compute Τ𝜕𝐶 𝜕𝑧recursively
Backpropagation – Backward Pass
Compute Τ𝜕𝐶 𝜕𝑧 from the output layer
Compute Τ𝜕𝐶 𝜕𝑧 for all activation function inputs z
𝑥1
𝑥2
𝑦1
𝑦2
𝑧1
𝑧2
𝑧3
𝑧4
𝑧5
𝑧6𝜕𝐶
𝜕𝑧6
𝜕𝐶
𝜕𝑧5
𝜕𝐶
𝜕𝑧4
𝜕𝐶
𝜕𝑧3
𝜕𝐶
𝜕𝑧2
𝜕𝐶
𝜕𝑧1
Backpropagation – Backward Pass
Compute Τ𝜕𝐶 𝜕𝑧 from the output layer
Compute Τ𝜕𝐶 𝜕𝑧 for all activation function inputs z
𝑥1
𝑥2
𝑧1
𝑧2
𝑧3
𝑧4
𝑧5
𝑧6𝜕𝐶
𝜕𝑧6
𝜕𝐶
𝜕𝑧5
𝜕𝐶
𝜕𝑧4
𝜕𝐶
𝜕𝑧3
𝜕𝐶
𝜕𝑧2
𝜕𝐶
𝜕𝑧1
𝜎′ 𝑧1
𝜎′ 𝑧2
𝜎′ 𝑧3
𝜎′ 𝑧4
𝑦1
𝑦2
Backpropagation – Summary
……
Forward Pass Backward Pass
𝑎
𝜕𝑧
𝜕𝑤
𝜕𝐶
𝜕𝑧
… …X =
𝜕𝐶
𝜕𝑤= 𝑎
for all w