cs7015 (deep learning): lecture 4 · mitesh m. khapra cs7015 (deep learning): lecture 4. 4/57 x 1 x...

288
1/57 CS7015 (Deep Learning): Lecture 4 Feedforward Neural Networks, Backpropagation Mitesh M. Khapra Department of Computer Science and Engineering Indian Institute of Technology Madras Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Upload: others

Post on 18-Jul-2020

5 views

Category:

Documents


6 download

TRANSCRIPT

Page 1: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

1/57

CS7015 (Deep Learning): Lecture 4Feedforward Neural Networks, Backpropagation

Mitesh M. Khapra

Department of Computer Science and EngineeringIndian Institute of Technology Madras

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 2: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

2/57

References/Acknowledgments

See the excellent videos by Hugo Larochelle on Backpropagation

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 3: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

3/57

Module 4.1: Feedforward Neural Networks (a.k.a.multilayered network of neurons)

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 4: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

4/57

x1 x2 xn

a1

a2

a3

h1

h2

hL = y = f(x)

W1 b1

W2 b2

W3 b3

The input to the network is an n-dimensionalvector

The network contains L− 1 hidden layers (2, inthis case) having n neurons each

Finally, there is one output layer containing kneurons (say, corresponding to k classes)

Each neuron in the hidden layer and output layercan be split into two parts :

pre-activation andactivation (ai and hi are vectors)

The input layer can be called the 0-th layer andthe output layer can be called the (L)-th layer

Wi ∈ Rn×n and bi ∈ Rn are the weight and biasbetween layers i− 1 and i (0 < i < L)

WL ∈ Rn×k and bL ∈ Rk are the weight and biasbetween the last hidden layer and the output layer(L = 3 in this case)

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 5: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

4/57

x1 x2 xn

a1

a2

a3

h1

h2

hL = y = f(x)

W1 b1

W2 b2

W3 b3

The input to the network is an n-dimensionalvector

The network contains L− 1 hidden layers (2, inthis case) having n neurons each

Finally, there is one output layer containing kneurons (say, corresponding to k classes)

Each neuron in the hidden layer and output layercan be split into two parts :

pre-activation andactivation (ai and hi are vectors)

The input layer can be called the 0-th layer andthe output layer can be called the (L)-th layer

Wi ∈ Rn×n and bi ∈ Rn are the weight and biasbetween layers i− 1 and i (0 < i < L)

WL ∈ Rn×k and bL ∈ Rk are the weight and biasbetween the last hidden layer and the output layer(L = 3 in this case)

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 6: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

4/57

x1 x2 xn

a1

a2

a3

h1

h2

hL = y = f(x)

W1 b1

W2 b2

W3 b3

The input to the network is an n-dimensionalvector

The network contains L− 1 hidden layers (2, inthis case) having n neurons each

Finally, there is one output layer containing kneurons (say, corresponding to k classes)

Each neuron in the hidden layer and output layercan be split into two parts :

pre-activation andactivation (ai and hi are vectors)

The input layer can be called the 0-th layer andthe output layer can be called the (L)-th layer

Wi ∈ Rn×n and bi ∈ Rn are the weight and biasbetween layers i− 1 and i (0 < i < L)

WL ∈ Rn×k and bL ∈ Rk are the weight and biasbetween the last hidden layer and the output layer(L = 3 in this case)

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 7: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

4/57

x1 x2 xn

a1

a2

a3

h1

h2

hL = y = f(x)

W1 b1

W2 b2

W3 b3

The input to the network is an n-dimensionalvector

The network contains L− 1 hidden layers (2, inthis case) having n neurons each

Finally, there is one output layer containing kneurons (say, corresponding to k classes)

Each neuron in the hidden layer and output layercan be split into two parts :

pre-activation andactivation (ai and hi are vectors)

The input layer can be called the 0-th layer andthe output layer can be called the (L)-th layer

Wi ∈ Rn×n and bi ∈ Rn are the weight and biasbetween layers i− 1 and i (0 < i < L)

WL ∈ Rn×k and bL ∈ Rk are the weight and biasbetween the last hidden layer and the output layer(L = 3 in this case)

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 8: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

4/57

x1 x2 xn

a1

a2

a3

h1

h2

hL = y = f(x)

W1 b1

W2 b2

W3 b3

The input to the network is an n-dimensionalvector

The network contains L− 1 hidden layers (2, inthis case) having n neurons each

Finally, there is one output layer containing kneurons (say, corresponding to k classes)

Each neuron in the hidden layer and output layercan be split into two parts :

pre-activation andactivation (ai and hi are vectors)

The input layer can be called the 0-th layer andthe output layer can be called the (L)-th layer

Wi ∈ Rn×n and bi ∈ Rn are the weight and biasbetween layers i− 1 and i (0 < i < L)

WL ∈ Rn×k and bL ∈ Rk are the weight and biasbetween the last hidden layer and the output layer(L = 3 in this case)

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 9: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

4/57

x1 x2 xn

a1

a2

a3

h1

h2

hL = y = f(x)

W1 b1

W2 b2

W3 b3

The input to the network is an n-dimensionalvector

The network contains L− 1 hidden layers (2, inthis case) having n neurons each

Finally, there is one output layer containing kneurons (say, corresponding to k classes)

Each neuron in the hidden layer and output layercan be split into two parts :

pre-activation andactivation (ai and hi are vectors)

The input layer can be called the 0-th layer andthe output layer can be called the (L)-th layer

Wi ∈ Rn×n and bi ∈ Rn are the weight and biasbetween layers i− 1 and i (0 < i < L)

WL ∈ Rn×k and bL ∈ Rk are the weight and biasbetween the last hidden layer and the output layer(L = 3 in this case)

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 10: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

4/57

x1 x2 xn

a1

a2

a3

h1

h2

hL = y = f(x)

W1 b1

W2 b2

W3 b3

The input to the network is an n-dimensionalvector

The network contains L− 1 hidden layers (2, inthis case) having n neurons each

Finally, there is one output layer containing kneurons (say, corresponding to k classes)

Each neuron in the hidden layer and output layercan be split into two parts :

pre-activation andactivation (ai and hi are vectors)

The input layer can be called the 0-th layer andthe output layer can be called the (L)-th layer

Wi ∈ Rn×n and bi ∈ Rn are the weight and biasbetween layers i− 1 and i (0 < i < L)

WL ∈ Rn×k and bL ∈ Rk are the weight and biasbetween the last hidden layer and the output layer(L = 3 in this case)

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 11: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

4/57

x1 x2 xn

a1

a2

a3

h1

h2

hL = y = f(x)

W1 b1

W2 b2

W3 b3

The input to the network is an n-dimensionalvector

The network contains L− 1 hidden layers (2, inthis case) having n neurons each

Finally, there is one output layer containing kneurons (say, corresponding to k classes)

Each neuron in the hidden layer and output layercan be split into two parts :

pre-activation andactivation (ai and hi are vectors)

The input layer can be called the 0-th layer andthe output layer can be called the (L)-th layer

Wi ∈ Rn×n and bi ∈ Rn are the weight and biasbetween layers i− 1 and i (0 < i < L)

WL ∈ Rn×k and bL ∈ Rk are the weight and biasbetween the last hidden layer and the output layer(L = 3 in this case)

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 12: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

4/57

x1 x2 xn

a1

a2

a3

h1

h2

hL = y = f(x)

W1 b1

W2 b2

W3 b3

The input to the network is an n-dimensionalvector

The network contains L− 1 hidden layers (2, inthis case) having n neurons each

Finally, there is one output layer containing kneurons (say, corresponding to k classes)

Each neuron in the hidden layer and output layercan be split into two parts : pre-activation

andactivation (ai and hi are vectors)

The input layer can be called the 0-th layer andthe output layer can be called the (L)-th layer

Wi ∈ Rn×n and bi ∈ Rn are the weight and biasbetween layers i− 1 and i (0 < i < L)

WL ∈ Rn×k and bL ∈ Rk are the weight and biasbetween the last hidden layer and the output layer(L = 3 in this case)

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 13: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

4/57

x1 x2 xn

a1

a2

a3

h1

h2

hL = y = f(x)

W1 b1

W2 b2

W3 b3

The input to the network is an n-dimensionalvector

The network contains L− 1 hidden layers (2, inthis case) having n neurons each

Finally, there is one output layer containing kneurons (say, corresponding to k classes)

Each neuron in the hidden layer and output layercan be split into two parts : pre-activation andactivation

(ai and hi are vectors)

The input layer can be called the 0-th layer andthe output layer can be called the (L)-th layer

Wi ∈ Rn×n and bi ∈ Rn are the weight and biasbetween layers i− 1 and i (0 < i < L)

WL ∈ Rn×k and bL ∈ Rk are the weight and biasbetween the last hidden layer and the output layer(L = 3 in this case)

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 14: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

4/57

x1 x2 xn

a1

a2

a3

h1

h2

hL = y = f(x)

W1 b1

W2 b2

W3 b3

The input to the network is an n-dimensionalvector

The network contains L− 1 hidden layers (2, inthis case) having n neurons each

Finally, there is one output layer containing kneurons (say, corresponding to k classes)

Each neuron in the hidden layer and output layercan be split into two parts : pre-activation andactivation (ai and hi are vectors)

The input layer can be called the 0-th layer andthe output layer can be called the (L)-th layer

Wi ∈ Rn×n and bi ∈ Rn are the weight and biasbetween layers i− 1 and i (0 < i < L)

WL ∈ Rn×k and bL ∈ Rk are the weight and biasbetween the last hidden layer and the output layer(L = 3 in this case)

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 15: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

4/57

x1 x2 xn

a1

a2

a3

h1

h2

hL = y = f(x)

W1 b1

W2 b2

W3 b3

The input to the network is an n-dimensionalvector

The network contains L− 1 hidden layers (2, inthis case) having n neurons each

Finally, there is one output layer containing kneurons (say, corresponding to k classes)

Each neuron in the hidden layer and output layercan be split into two parts : pre-activation andactivation (ai and hi are vectors)

The input layer can be called the 0-th layer andthe output layer can be called the (L)-th layer

Wi ∈ Rn×n and bi ∈ Rn are the weight and biasbetween layers i− 1 and i (0 < i < L)

WL ∈ Rn×k and bL ∈ Rk are the weight and biasbetween the last hidden layer and the output layer(L = 3 in this case)

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 16: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

4/57

x1 x2 xn

a1

a2

a3

h1

h2

hL = y = f(x)

W1 b1

W2 b2

W3 b3

The input to the network is an n-dimensionalvector

The network contains L− 1 hidden layers (2, inthis case) having n neurons each

Finally, there is one output layer containing kneurons (say, corresponding to k classes)

Each neuron in the hidden layer and output layercan be split into two parts : pre-activation andactivation (ai and hi are vectors)

The input layer can be called the 0-th layer andthe output layer can be called the (L)-th layer

Wi ∈ Rn×n and bi ∈ Rn are the weight and biasbetween layers i− 1 and i (0 < i < L)

WL ∈ Rn×k and bL ∈ Rk are the weight and biasbetween the last hidden layer and the output layer(L = 3 in this case)

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 17: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

4/57

x1 x2 xn

a1

a2

a3

h1

h2

hL = y = f(x)

W1 b1

W2 b2

W3 b3

The input to the network is an n-dimensionalvector

The network contains L− 1 hidden layers (2, inthis case) having n neurons each

Finally, there is one output layer containing kneurons (say, corresponding to k classes)

Each neuron in the hidden layer and output layercan be split into two parts : pre-activation andactivation (ai and hi are vectors)

The input layer can be called the 0-th layer andthe output layer can be called the (L)-th layer

Wi ∈ Rn×n and bi ∈ Rn are the weight and biasbetween layers i− 1 and i (0 < i < L)

WL ∈ Rn×k and bL ∈ Rk are the weight and biasbetween the last hidden layer and the output layer(L = 3 in this case)

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 18: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

4/57

x1 x2 xn

a1

a2

a3

h1

h2

hL = y = f(x)

W1 b1

W2 b2

W3 b3

The input to the network is an n-dimensionalvector

The network contains L− 1 hidden layers (2, inthis case) having n neurons each

Finally, there is one output layer containing kneurons (say, corresponding to k classes)

Each neuron in the hidden layer and output layercan be split into two parts : pre-activation andactivation (ai and hi are vectors)

The input layer can be called the 0-th layer andthe output layer can be called the (L)-th layer

Wi ∈ Rn×n and bi ∈ Rn are the weight and biasbetween layers i− 1 and i (0 < i < L)

WL ∈ Rn×k and bL ∈ Rk are the weight and biasbetween the last hidden layer and the output layer(L = 3 in this case)

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 19: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

5/57

x1 x2 xn

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

hL = y = f(x) The pre-activation at layer i is given by

ai(x) = bi +Wihi−1(x)

The activation at layer i is given by

hi(x) = g(ai(x))

where g is called the activation function (forexample, logistic, tanh, linear, etc.)

The activation at the output layer is given by

f(x) = hL(x) = O(aL(x))

where O is the output activation function (forexample, softmax, linear, etc.)

To simplify notation we will refer to ai(x) as aiand hi(x) as hi

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 20: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

5/57

x1 x2 xn

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

hL = y = f(x) The pre-activation at layer i is given by

ai(x) = bi +Wihi−1(x)

The activation at layer i is given by

hi(x) = g(ai(x))

where g is called the activation function (forexample, logistic, tanh, linear, etc.)

The activation at the output layer is given by

f(x) = hL(x) = O(aL(x))

where O is the output activation function (forexample, softmax, linear, etc.)

To simplify notation we will refer to ai(x) as aiand hi(x) as hi

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 21: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

5/57

x1 x2 xn

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

hL = y = f(x) The pre-activation at layer i is given by

ai(x) = bi +Wihi−1(x)

The activation at layer i is given by

hi(x) = g(ai(x))

where g is called the activation function (forexample, logistic, tanh, linear, etc.)

The activation at the output layer is given by

f(x) = hL(x) = O(aL(x))

where O is the output activation function (forexample, softmax, linear, etc.)

To simplify notation we will refer to ai(x) as aiand hi(x) as hi

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 22: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

5/57

x1 x2 xn

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

hL = y = f(x) The pre-activation at layer i is given by

ai(x) = bi +Wihi−1(x)

The activation at layer i is given by

hi(x) = g(ai(x))

where g is called the activation function (forexample, logistic, tanh, linear, etc.)

The activation at the output layer is given by

f(x) = hL(x) = O(aL(x))

where O is the output activation function (forexample, softmax, linear, etc.)

To simplify notation we will refer to ai(x) as aiand hi(x) as hi

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 23: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

5/57

x1 x2 xn

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

hL = y = f(x) The pre-activation at layer i is given by

ai(x) = bi +Wihi−1(x)

The activation at layer i is given by

hi(x) = g(ai(x))

where g is called the activation function (forexample, logistic, tanh, linear, etc.)

The activation at the output layer is given by

f(x) = hL(x) = O(aL(x))

where O is the output activation function (forexample, softmax, linear, etc.)

To simplify notation we will refer to ai(x) as aiand hi(x) as hi

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 24: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

5/57

x1 x2 xn

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

hL = y = f(x) The pre-activation at layer i is given by

ai(x) = bi +Wihi−1(x)

The activation at layer i is given by

hi(x) = g(ai(x))

where g is called the activation function (forexample, logistic, tanh, linear, etc.)

The activation at the output layer is given by

f(x) = hL(x) = O(aL(x))

where O is the output activation function (forexample, softmax, linear, etc.)

To simplify notation we will refer to ai(x) as aiand hi(x) as hi

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 25: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

6/57

x1 x2 xn

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

hL = y = f(x) The pre-activation at layer i is given by

ai = bi +Wihi−1

The activation at layer i is given by

hi = g(ai)

where g is called the activation function (forexample, logistic, tanh, linear, etc.)

The activation at the output layer is given by

f(x) = hL = O(aL)

where O is the output activation function (forexample, softmax, linear, etc.)

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 26: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

7/57

x1 x2 xn

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

hL = y = f(x)Data: {xi, yi}Ni=1

Model:

yi = f(xi) = O(W3g(W2g(W1x+ b1) + b2) + b3)

Parameters:θ = W1, ..,WL, b1, b2, ..., bL(L = 3)

Algorithm: Gradient Descent with Back-propagation (we will see soon)

Objective/Loss/Error function: Say,

min1

N

N∑i=1

k∑j=1

(yij − yij)2

In general, min L (θ)

where L (θ) is some function of the parameters

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 27: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

7/57

x1 x2 xn

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

hL = y = f(x)Data: {xi, yi}Ni=1

Model:

yi = f(xi) = O(W3g(W2g(W1x+ b1) + b2) + b3)

Parameters:θ = W1, ..,WL, b1, b2, ..., bL(L = 3)

Algorithm: Gradient Descent with Back-propagation (we will see soon)

Objective/Loss/Error function: Say,

min1

N

N∑i=1

k∑j=1

(yij − yij)2

In general, min L (θ)

where L (θ) is some function of the parameters

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 28: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

7/57

x1 x2 xn

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

hL = y = f(x)Data: {xi, yi}Ni=1

Model:

yi = f(xi) = O(W3g(W2g(W1x+ b1) + b2) + b3)

Parameters:θ = W1, ..,WL, b1, b2, ..., bL(L = 3)

Algorithm: Gradient Descent with Back-propagation (we will see soon)

Objective/Loss/Error function: Say,

min1

N

N∑i=1

k∑j=1

(yij − yij)2

In general, min L (θ)

where L (θ) is some function of the parameters

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 29: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

7/57

x1 x2 xn

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

hL = y = f(x)Data: {xi, yi}Ni=1

Model:

yi = f(xi) = O(W3g(W2g(W1x+ b1) + b2) + b3)

Parameters:θ = W1, ..,WL, b1, b2, ..., bL(L = 3)

Algorithm: Gradient Descent with Back-propagation (we will see soon)

Objective/Loss/Error function: Say,

min1

N

N∑i=1

k∑j=1

(yij − yij)2

In general, min L (θ)

where L (θ) is some function of the parameters

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 30: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

7/57

x1 x2 xn

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

hL = y = f(x)Data: {xi, yi}Ni=1

Model:

yi = f(xi) = O(W3g(W2g(W1x+ b1) + b2) + b3)

Parameters:θ = W1, ..,WL, b1, b2, ..., bL(L = 3)

Algorithm: Gradient Descent with Back-propagation (we will see soon)

Objective/Loss/Error function: Say,

min1

N

N∑i=1

k∑j=1

(yij − yij)2

In general, min L (θ)

where L (θ) is some function of the parameters

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 31: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

7/57

x1 x2 xn

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

hL = y = f(x)Data: {xi, yi}Ni=1

Model:

yi = f(xi) = O(W3g(W2g(W1x+ b1) + b2) + b3)

Parameters:θ = W1, ..,WL, b1, b2, ..., bL(L = 3)

Algorithm: Gradient Descent with Back-propagation (we will see soon)

Objective/Loss/Error function: Say,

min1

N

N∑i=1

k∑j=1

(yij − yij)2

In general, min L (θ)

where L (θ) is some function of the parameters

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 32: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

8/57

Module 4.2: Learning Parameters of FeedforwardNeural Networks (Intuition)

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 33: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

9/57

The story so far...

We have introduced feedforward neural networks

We are now interested in finding an algorithm for learning the parameters ofthis model

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 34: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

10/57

x1 x2 xn

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

hL = y = f(x) Recall our gradient descent algorithm

We can write it more concisely as

Algorithm: gradient descent()

t← 0;max iterations← 1000;while t++ < max iterations do

end

where

Now, in this feedforward neural network,instead of θ = [w, b] we have θ =[W1,W2, ..,WL, b1, b2, .., bL]

We can still use the same algorithm forlearning the parameters of our model

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 35: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

10/57

x1 x2 xn

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

hL = y = f(x) Recall our gradient descent algorithm

We can write it more concisely as

Algorithm: gradient descent()

t← 0;max iterations← 1000;Initialize w0, b0;while t++ < max iterations do

wt+1 ← wt − η∇wt;bt+1 ← bt − η∇bt;

end

where

Now, in this feedforward neural network,instead of θ = [w, b] we have θ =[W1,W2, ..,WL, b1, b2, .., bL]

We can still use the same algorithm forlearning the parameters of our model

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 36: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

10/57

x1 x2 xn

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

hL = y = f(x) Recall our gradient descent algorithm

We can write it more concisely as

Algorithm: gradient descent()

t← 0;max iterations← 1000;Initialize w0, b0;while t++ < max iterations do

wt+1 ← wt − η∇wt;bt+1 ← bt − η∇bt;

end

where

Now, in this feedforward neural network,instead of θ = [w, b] we have θ =[W1,W2, ..,WL, b1, b2, .., bL]

We can still use the same algorithm forlearning the parameters of our model

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 37: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

10/57

x1 x2 xn

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

hL = y = f(x) Recall our gradient descent algorithm

We can write it more concisely as

Algorithm: gradient descent()

t← 0;max iterations← 1000;Initialize θ0 = [w0, b0];while t++ < max iterations do

θt+1 ← θt − η∇θt;end

where

Now, in this feedforward neural network,instead of θ = [w, b] we have θ =[W1,W2, ..,WL, b1, b2, .., bL]

We can still use the same algorithm forlearning the parameters of our model

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 38: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

10/57

x1 x2 xn

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

hL = y = f(x) Recall our gradient descent algorithm

We can write it more concisely as

Algorithm: gradient descent()

t← 0;max iterations← 1000;Initialize θ0 = [w0, b0];while t++ < max iterations do

θt+1 ← θt − η∇θt;end

where ∇θt =[∂L (θ)∂wt

, ∂L (θ)∂bt

]T

Now, in this feedforward neural network,instead of θ = [w, b] we have θ =[W1,W2, ..,WL, b1, b2, .., bL]

We can still use the same algorithm forlearning the parameters of our model

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 39: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

10/57

x1 x2 xn

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

hL = y = f(x) Recall our gradient descent algorithm

We can write it more concisely as

Algorithm: gradient descent()

t← 0;max iterations← 1000;Initialize θ0 = [w0, b0];while t++ < max iterations do

θt+1 ← θt − η∇θt;end

where ∇θt =[∂L (θ)∂wt

, ∂L (θ)∂bt

]TNow, in this feedforward neural network,instead of θ = [w, b] we have θ =[W1,W2, ..,WL, b1, b2, .., bL]

We can still use the same algorithm forlearning the parameters of our model

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 40: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

10/57

x1 x2 xn

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

hL = y = f(x) Recall our gradient descent algorithm

We can write it more concisely as

Algorithm: gradient descent()

t← 0;max iterations← 1000;Initialize θ0 = [w0, b0];while t++ < max iterations do

θt+1 ← θt − η∇θt;end

where ∇θt =[∂L (θ)∂wt

, ∂L (θ)∂bt

]TNow, in this feedforward neural network,instead of θ = [w, b] we have θ =[W1,W2, ..,WL, b1, b2, .., bL]

We can still use the same algorithm forlearning the parameters of our model

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 41: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

10/57

x1 x2 xn

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

hL = y = f(x) Recall our gradient descent algorithm

We can write it more concisely as

Algorithm: gradient descent()

t← 0;max iterations← 1000;Initialize θ0 = [W 0

1 , ...,W0L, b

01, ..., b

0L];

while t++ < max iterations doθt+1 ← θt − η∇θt;

end

where ∇θt =[∂L (θ)∂W1,t

, ., ∂L (θ)∂WL,t

, ∂L (θ)∂b1,t

, ., ∂L (θ)∂bL,t

]TNow, in this feedforward neural network,instead of θ = [w, b] we have θ =[W1,W2, ..,WL, b1, b2, .., bL]

We can still use the same algorithm forlearning the parameters of our model

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 42: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

11/57

Except that now our ∇θ looks much more nasty

∂L (θ)∂W111

. . . ∂L (θ)∂W11n

∂L (θ)∂W211

. . . ∂L (θ)∂W21n

. . . ∂L (θ)∂WL,11

. . . ∂L (θ)∂WL,1k

∂L (θ)∂WL,1k

∂L (θ)∂b11

. . . ∂L (θ)∂bL1

∂L (θ)∂W121

. . . ∂L (θ)∂W12n

∂L (θ)∂W221

. . . ∂L (θ)∂W22n

. . . ∂L (θ)∂WL,21

. . . ∂L (θ)∂WL,2k

∂L (θ)∂WL,2k

∂L (θ)∂b12

. . . ∂L (θ)∂bL2

......

......

......

......

......

......

......

∂L (θ)∂W1n1

. . . ∂L (θ)∂W1nn

∂L (θ)∂W2n1

. . . ∂L (θ)∂W2nn

. . . ∂L (θ)∂WL,n1

. . . ∂L (θ)∂WL,nk

∂L (θ)∂WL,nk

∂L (θ)∂b1n

. . . ∂L (θ)∂bLk

∇θ is thus composed of∇W1,∇W2, ...∇WL−1 ∈ Rn×n,∇WL ∈ Rn×k,∇b1,∇b2, ...,∇bL−1 ∈ Rn and ∇bL ∈ Rk

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 43: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

11/57

Except that now our ∇θ looks much more nasty

∂L (θ)∂W111

. . . ∂L (θ)∂W11n

∂L (θ)∂W211

. . . ∂L (θ)∂W21n

. . . ∂L (θ)∂WL,11

. . . ∂L (θ)∂WL,1k

∂L (θ)∂WL,1k

∂L (θ)∂b11

. . . ∂L (θ)∂bL1

∂L (θ)∂W121

. . . ∂L (θ)∂W12n

∂L (θ)∂W221

. . . ∂L (θ)∂W22n

. . . ∂L (θ)∂WL,21

. . . ∂L (θ)∂WL,2k

∂L (θ)∂WL,2k

∂L (θ)∂b12

. . . ∂L (θ)∂bL2

......

......

......

......

......

......

......

∂L (θ)∂W1n1

. . . ∂L (θ)∂W1nn

∂L (θ)∂W2n1

. . . ∂L (θ)∂W2nn

. . . ∂L (θ)∂WL,n1

. . . ∂L (θ)∂WL,nk

∂L (θ)∂WL,nk

∂L (θ)∂b1n

. . . ∂L (θ)∂bLk

∇θ is thus composed of∇W1,∇W2, ...∇WL−1 ∈ Rn×n,∇WL ∈ Rn×k,∇b1,∇b2, ...,∇bL−1 ∈ Rn and ∇bL ∈ Rk

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 44: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

11/57

Except that now our ∇θ looks much more nasty

∂L (θ)∂W111

. . .

∂L (θ)∂W11n

∂L (θ)∂W211

. . . ∂L (θ)∂W21n

. . . ∂L (θ)∂WL,11

. . . ∂L (θ)∂WL,1k

∂L (θ)∂WL,1k

∂L (θ)∂b11

. . . ∂L (θ)∂bL1

∂L (θ)∂W121

. . . ∂L (θ)∂W12n

∂L (θ)∂W221

. . . ∂L (θ)∂W22n

. . . ∂L (θ)∂WL,21

. . . ∂L (θ)∂WL,2k

∂L (θ)∂WL,2k

∂L (θ)∂b12

. . . ∂L (θ)∂bL2

......

......

......

......

......

......

......

∂L (θ)∂W1n1

. . . ∂L (θ)∂W1nn

∂L (θ)∂W2n1

. . . ∂L (θ)∂W2nn

. . . ∂L (θ)∂WL,n1

. . . ∂L (θ)∂WL,nk

∂L (θ)∂WL,nk

∂L (θ)∂b1n

. . . ∂L (θ)∂bLk

∇θ is thus composed of∇W1,∇W2, ...∇WL−1 ∈ Rn×n,∇WL ∈ Rn×k,∇b1,∇b2, ...,∇bL−1 ∈ Rn and ∇bL ∈ Rk

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 45: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

11/57

Except that now our ∇θ looks much more nasty

∂L (θ)∂W111

. . . ∂L (θ)∂W11n

∂L (θ)∂W211

. . . ∂L (θ)∂W21n

. . . ∂L (θ)∂WL,11

. . . ∂L (θ)∂WL,1k

∂L (θ)∂WL,1k

∂L (θ)∂b11

. . . ∂L (θ)∂bL1

∂L (θ)∂W121

. . . ∂L (θ)∂W12n

∂L (θ)∂W221

. . . ∂L (θ)∂W22n

. . . ∂L (θ)∂WL,21

. . . ∂L (θ)∂WL,2k

∂L (θ)∂WL,2k

∂L (θ)∂b12

. . . ∂L (θ)∂bL2

......

......

......

......

......

......

......

∂L (θ)∂W1n1

. . . ∂L (θ)∂W1nn

∂L (θ)∂W2n1

. . . ∂L (θ)∂W2nn

. . . ∂L (θ)∂WL,n1

. . . ∂L (θ)∂WL,nk

∂L (θ)∂WL,nk

∂L (θ)∂b1n

. . . ∂L (θ)∂bLk

∇θ is thus composed of∇W1,∇W2, ...∇WL−1 ∈ Rn×n,∇WL ∈ Rn×k,∇b1,∇b2, ...,∇bL−1 ∈ Rn and ∇bL ∈ Rk

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 46: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

11/57

Except that now our ∇θ looks much more nasty

∂L (θ)∂W111

. . . ∂L (θ)∂W11n

∂L (θ)∂W211

. . . ∂L (θ)∂W21n

. . . ∂L (θ)∂WL,11

. . . ∂L (θ)∂WL,1k

∂L (θ)∂WL,1k

∂L (θ)∂b11

. . . ∂L (θ)∂bL1

∂L (θ)∂W121

. . . ∂L (θ)∂W12n

∂L (θ)∂W221

. . . ∂L (θ)∂W22n

. . . ∂L (θ)∂WL,21

. . . ∂L (θ)∂WL,2k

∂L (θ)∂WL,2k

∂L (θ)∂b12

. . . ∂L (θ)∂bL2

......

...

......

......

......

......

......

...

∂L (θ)∂W1n1

. . . ∂L (θ)∂W1nn

∂L (θ)∂W2n1

. . . ∂L (θ)∂W2nn

. . . ∂L (θ)∂WL,n1

. . . ∂L (θ)∂WL,nk

∂L (θ)∂WL,nk

∂L (θ)∂b1n

. . . ∂L (θ)∂bLk

∇θ is thus composed of∇W1,∇W2, ...∇WL−1 ∈ Rn×n,∇WL ∈ Rn×k,∇b1,∇b2, ...,∇bL−1 ∈ Rn and ∇bL ∈ Rk

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 47: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

11/57

Except that now our ∇θ looks much more nasty

∂L (θ)∂W111

. . . ∂L (θ)∂W11n

∂L (θ)∂W211

. . . ∂L (θ)∂W21n

. . . ∂L (θ)∂WL,11

. . . ∂L (θ)∂WL,1k

∂L (θ)∂WL,1k

∂L (θ)∂b11

. . . ∂L (θ)∂bL1

∂L (θ)∂W121

. . . ∂L (θ)∂W12n

∂L (θ)∂W221

. . . ∂L (θ)∂W22n

. . . ∂L (θ)∂WL,21

. . . ∂L (θ)∂WL,2k

∂L (θ)∂WL,2k

∂L (θ)∂b12

. . . ∂L (θ)∂bL2

......

......

......

......

......

......

......

∂L (θ)∂W1n1

. . . ∂L (θ)∂W1nn

∂L (θ)∂W2n1

. . . ∂L (θ)∂W2nn

. . . ∂L (θ)∂WL,n1

. . . ∂L (θ)∂WL,nk

∂L (θ)∂WL,nk

∂L (θ)∂b1n

. . . ∂L (θ)∂bLk

∇θ is thus composed of∇W1,∇W2, ...∇WL−1 ∈ Rn×n,∇WL ∈ Rn×k,∇b1,∇b2, ...,∇bL−1 ∈ Rn and ∇bL ∈ Rk

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 48: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

11/57

Except that now our ∇θ looks much more nasty

∂L (θ)∂W111

. . . ∂L (θ)∂W11n

∂L (θ)∂W211

. . . ∂L (θ)∂W21n

. . .

∂L (θ)∂WL,11

. . . ∂L (θ)∂WL,1k

∂L (θ)∂WL,1k

∂L (θ)∂b11

. . . ∂L (θ)∂bL1

∂L (θ)∂W121

. . . ∂L (θ)∂W12n

∂L (θ)∂W221

. . . ∂L (θ)∂W22n

. . .

∂L (θ)∂WL,21

. . . ∂L (θ)∂WL,2k

∂L (θ)∂WL,2k

∂L (θ)∂b12

. . . ∂L (θ)∂bL2

......

......

......

...

......

......

......

...

∂L (θ)∂W1n1

. . . ∂L (θ)∂W1nn

∂L (θ)∂W2n1

. . . ∂L (θ)∂W2nn

. . .

∂L (θ)∂WL,n1

. . . ∂L (θ)∂WL,nk

∂L (θ)∂WL,nk

∂L (θ)∂b1n

. . . ∂L (θ)∂bLk

∇θ is thus composed of∇W1,∇W2, ...∇WL−1 ∈ Rn×n,∇WL ∈ Rn×k,∇b1,∇b2, ...,∇bL−1 ∈ Rn and ∇bL ∈ Rk

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 49: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

11/57

Except that now our ∇θ looks much more nasty

∂L (θ)∂W111

. . . ∂L (θ)∂W11n

∂L (θ)∂W211

. . . ∂L (θ)∂W21n

. . . ∂L (θ)∂WL,11

. . . ∂L (θ)∂WL,1k

∂L (θ)∂WL,1k

∂L (θ)∂b11

. . . ∂L (θ)∂bL1

∂L (θ)∂W121

. . . ∂L (θ)∂W12n

∂L (θ)∂W221

. . . ∂L (θ)∂W22n

. . . ∂L (θ)∂WL,21

. . . ∂L (θ)∂WL,2k

∂L (θ)∂WL,2k

∂L (θ)∂b12

. . . ∂L (θ)∂bL2

......

......

......

......

......

...

......

...

∂L (θ)∂W1n1

. . . ∂L (θ)∂W1nn

∂L (θ)∂W2n1

. . . ∂L (θ)∂W2nn

. . . ∂L (θ)∂WL,n1

. . . ∂L (θ)∂WL,nk

∂L (θ)∂WL,nk

∂L (θ)∂b1n

. . . ∂L (θ)∂bLk

∇θ is thus composed of∇W1,∇W2, ...∇WL−1 ∈ Rn×n,∇WL ∈ Rn×k,∇b1,∇b2, ...,∇bL−1 ∈ Rn and ∇bL ∈ Rk

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 50: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

11/57

Except that now our ∇θ looks much more nasty

∂L (θ)∂W111

. . . ∂L (θ)∂W11n

∂L (θ)∂W211

. . . ∂L (θ)∂W21n

. . . ∂L (θ)∂WL,11

. . . ∂L (θ)∂WL,1k

∂L (θ)∂WL,1k

∂L (θ)∂b11

. . . ∂L (θ)∂bL1

∂L (θ)∂W121

. . . ∂L (θ)∂W12n

∂L (θ)∂W221

. . . ∂L (θ)∂W22n

. . . ∂L (θ)∂WL,21

. . . ∂L (θ)∂WL,2k

∂L (θ)∂WL,2k

∂L (θ)∂b12

. . . ∂L (θ)∂bL2

......

......

......

......

......

......

......

∂L (θ)∂W1n1

. . . ∂L (θ)∂W1nn

∂L (θ)∂W2n1

. . . ∂L (θ)∂W2nn

. . . ∂L (θ)∂WL,n1

. . . ∂L (θ)∂WL,nk

∂L (θ)∂WL,nk

∂L (θ)∂b1n

. . . ∂L (θ)∂bLk

∇θ is thus composed of∇W1,∇W2, ...∇WL−1 ∈ Rn×n,∇WL ∈ Rn×k,∇b1,∇b2, ...,∇bL−1 ∈ Rn and ∇bL ∈ Rk

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 51: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

11/57

Except that now our ∇θ looks much more nasty

∂L (θ)∂W111

. . . ∂L (θ)∂W11n

∂L (θ)∂W211

. . . ∂L (θ)∂W21n

. . . ∂L (θ)∂WL,11

. . . ∂L (θ)∂WL,1k

∂L (θ)∂WL,1k

∂L (θ)∂b11

. . . ∂L (θ)∂bL1

∂L (θ)∂W121

. . . ∂L (θ)∂W12n

∂L (θ)∂W221

. . . ∂L (θ)∂W22n

. . . ∂L (θ)∂WL,21

. . . ∂L (θ)∂WL,2k

∂L (θ)∂WL,2k

∂L (θ)∂b12

. . . ∂L (θ)∂bL2

......

......

......

......

......

......

......

∂L (θ)∂W1n1

. . . ∂L (θ)∂W1nn

∂L (θ)∂W2n1

. . . ∂L (θ)∂W2nn

. . . ∂L (θ)∂WL,n1

. . . ∂L (θ)∂WL,nk

∂L (θ)∂WL,nk

∂L (θ)∂b1n

. . . ∂L (θ)∂bLk

∇θ is thus composed of∇W1,∇W2, ...∇WL−1 ∈ Rn×n,∇WL ∈ Rn×k,∇b1,∇b2, ...,∇bL−1 ∈ Rn and ∇bL ∈ Rk

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 52: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

12/57

We need to answer two questions

How to choose the loss function L (θ)?

How to compute ∇θ which is composed of∇W1,∇W2, ...,∇WL−1 ∈ Rn×n,∇WL ∈ Rn×k∇b1,∇b2, ...,∇bL−1 ∈ Rn and ∇bL ∈ Rk ?

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 53: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

12/57

We need to answer two questions

How to choose the loss function L (θ)?

How to compute ∇θ which is composed of∇W1,∇W2, ...,∇WL−1 ∈ Rn×n,∇WL ∈ Rn×k∇b1,∇b2, ...,∇bL−1 ∈ Rn and ∇bL ∈ Rk ?

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 54: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

12/57

We need to answer two questions

How to choose the loss function L (θ)?

How to compute ∇θ which is composed of∇W1,∇W2, ...,∇WL−1 ∈ Rn×n,∇WL ∈ Rn×k∇b1,∇b2, ...,∇bL−1 ∈ Rn and ∇bL ∈ Rk ?

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 55: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

13/57

Module 4.3: Output Functions and Loss Functions

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 56: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

14/57

We need to answer two questions

How to choose the loss function L (θ) ?

How to compute ∇θ which is composed of:∇W1,∇W2, ...,∇WL−1 ∈ Rn×n,∇WL ∈ Rn×k∇b1,∇b2, ...,∇bL−1 ∈ Rn and ∇bL ∈ Rk ?

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 57: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

14/57

We need to answer two questions

How to choose the loss function L (θ) ?

How to compute ∇θ which is composed of:∇W1,∇W2, ...,∇WL−1 ∈ Rn×n,∇WL ∈ Rn×k∇b1,∇b2, ...,∇bL−1 ∈ Rn and ∇bL ∈ Rk ?

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 58: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

15/57

Neural network withL− 1 hidden layers

isActor

Damon

isDirector

Nolan

Critics

Rating

imdb

Rating

RT

Rating

. . . . . . . . . .

xi

yi = {7.5 8.2 7.7}

The choice of loss function dependson the problem at hand

We will illustrate this with the helpof two examples

Consider our movie example againbut this time we are interested inpredicting ratings

Here yi ∈ R3

The loss function should capture howmuch yi deviates from yi

If yi ∈ Rn then the squared error losscan capture this deviation

L (θ) =1

N

N∑i=1

3∑j=1

(yij − yij)2

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 59: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

15/57

Neural network withL− 1 hidden layers

isActor

Damon

isDirector

Nolan

Critics

Rating

imdb

Rating

RT

Rating

. . . . . . . . . .

xi

yi = {7.5 8.2 7.7}

The choice of loss function dependson the problem at hand

We will illustrate this with the helpof two examples

Consider our movie example againbut this time we are interested inpredicting ratings

Here yi ∈ R3

The loss function should capture howmuch yi deviates from yi

If yi ∈ Rn then the squared error losscan capture this deviation

L (θ) =1

N

N∑i=1

3∑j=1

(yij − yij)2

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 60: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

15/57

Neural network withL− 1 hidden layers

isActor

Damon

isDirector

Nolan

Critics

Rating

imdb

Rating

RT

Rating

. . . . . . . . . .

xi

yi = {7.5 8.2 7.7}The choice of loss function dependson the problem at hand

We will illustrate this with the helpof two examples

Consider our movie example againbut this time we are interested inpredicting ratings

Here yi ∈ R3

The loss function should capture howmuch yi deviates from yi

If yi ∈ Rn then the squared error losscan capture this deviation

L (θ) =1

N

N∑i=1

3∑j=1

(yij − yij)2

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 61: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

15/57

Neural network withL− 1 hidden layers

isActor

Damon

isDirector

Nolan

Critics

Rating

imdb

Rating

RT

Rating

. . . . . . . . . .

xi

yi = {7.5 8.2 7.7}The choice of loss function dependson the problem at hand

We will illustrate this with the helpof two examples

Consider our movie example againbut this time we are interested inpredicting ratings

Here yi ∈ R3

The loss function should capture howmuch yi deviates from yi

If yi ∈ Rn then the squared error losscan capture this deviation

L (θ) =1

N

N∑i=1

3∑j=1

(yij − yij)2

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 62: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

15/57

Neural network withL− 1 hidden layers

isActor

Damon

isDirector

Nolan

Critics

Rating

imdb

Rating

RT

Rating

. . . . . . . . . .

xi

yi = {7.5 8.2 7.7}The choice of loss function dependson the problem at hand

We will illustrate this with the helpof two examples

Consider our movie example againbut this time we are interested inpredicting ratings

Here yi ∈ R3

The loss function should capture howmuch yi deviates from yi

If yi ∈ Rn then the squared error losscan capture this deviation

L (θ) =1

N

N∑i=1

3∑j=1

(yij − yij)2

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 63: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

15/57

Neural network withL− 1 hidden layers

isActor

Damon

isDirector

Nolan

Critics

Rating

imdb

Rating

RT

Rating

. . . . . . . . . .

xi

yi = {7.5 8.2 7.7}The choice of loss function dependson the problem at hand

We will illustrate this with the helpof two examples

Consider our movie example againbut this time we are interested inpredicting ratings

Here yi ∈ R3

The loss function should capture howmuch yi deviates from yi

If yi ∈ Rn then the squared error losscan capture this deviation

L (θ) =1

N

N∑i=1

3∑j=1

(yij − yij)2

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 64: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

16/57

x1 x2 xn

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

hL = y = f(x) A related question: What should theoutput function ‘O’ be if yi ∈ R?

More specifically, can it be the logisticfunction?

No, because it restricts yi to a valuebetween 0 & 1 but we want yi ∈ RSo, in such cases it makes sense tohave ‘O’ as linear function

f(x) = hL = O(aL)

= WOaL + bO

yi = f(xi) is no longer boundedbetween 0 and 1

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 65: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

16/57

x1 x2 xn

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

hL = y = f(x) A related question: What should theoutput function ‘O’ be if yi ∈ R?

More specifically, can it be the logisticfunction?

No, because it restricts yi to a valuebetween 0 & 1 but we want yi ∈ RSo, in such cases it makes sense tohave ‘O’ as linear function

f(x) = hL = O(aL)

= WOaL + bO

yi = f(xi) is no longer boundedbetween 0 and 1

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 66: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

16/57

x1 x2 xn

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

hL = y = f(x) A related question: What should theoutput function ‘O’ be if yi ∈ R?

More specifically, can it be the logisticfunction?

No, because it restricts yi to a valuebetween 0 & 1 but we want yi ∈ R

So, in such cases it makes sense tohave ‘O’ as linear function

f(x) = hL = O(aL)

= WOaL + bO

yi = f(xi) is no longer boundedbetween 0 and 1

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 67: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

16/57

x1 x2 xn

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

hL = y = f(x) A related question: What should theoutput function ‘O’ be if yi ∈ R?

More specifically, can it be the logisticfunction?

No, because it restricts yi to a valuebetween 0 & 1 but we want yi ∈ RSo, in such cases it makes sense tohave ‘O’ as linear function

f(x) = hL = O(aL)

= WOaL + bO

yi = f(xi) is no longer boundedbetween 0 and 1

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 68: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

16/57

x1 x2 xn

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

hL = y = f(x) A related question: What should theoutput function ‘O’ be if yi ∈ R?

More specifically, can it be the logisticfunction?

No, because it restricts yi to a valuebetween 0 & 1 but we want yi ∈ RSo, in such cases it makes sense tohave ‘O’ as linear function

f(x) = hL = O(aL)

= WOaL + bO

yi = f(xi) is no longer boundedbetween 0 and 1

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 69: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

17/57Intentionally left blank

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 70: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

17/57Intentionally left blank

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 71: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

18/57

Neural network withL− 1 hidden layers

Apple Mango Orange Banana

y = [1 0 0 0]Now let us consider another problemfor which a different loss functionwould be appropriate

Suppose we want to classify an imageinto 1 of k classes

Here again we could use the squarederror loss to capture the deviation

But can you think of a betterfunction?

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 72: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

18/57

Neural network withL− 1 hidden layers

Apple Mango Orange Banana

y = [1 0 0 0]Now let us consider another problemfor which a different loss functionwould be appropriate

Suppose we want to classify an imageinto 1 of k classes

Here again we could use the squarederror loss to capture the deviation

But can you think of a betterfunction?

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 73: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

18/57

Neural network withL− 1 hidden layers

Apple Mango Orange Banana

y = [1 0 0 0]Now let us consider another problemfor which a different loss functionwould be appropriate

Suppose we want to classify an imageinto 1 of k classes

Here again we could use the squarederror loss to capture the deviation

But can you think of a betterfunction?

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 74: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

18/57

Neural network withL− 1 hidden layers

Apple Mango Orange Banana

y = [1 0 0 0]Now let us consider another problemfor which a different loss functionwould be appropriate

Suppose we want to classify an imageinto 1 of k classes

Here again we could use the squarederror loss to capture the deviation

But can you think of a betterfunction?

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 75: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

19/57

Neural network withL− 1 hidden layers

Apple Mango Orange Banana

y = [1 0 0 0]Notice that y is a probabilitydistribution

Therefore we should also ensure thaty is a probability distribution

What choice of the output activation‘O’ will ensure this ?

aL = WLhL−1 + bL

yj = O(aL)j =eaL,j∑ki=1 e

aL,i

O(aL)j is the jth element of y and aL,jis the jth element of the vector aL.

This function is called the softmaxfunction

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 76: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

19/57

Neural network withL− 1 hidden layers

Apple Mango Orange Banana

y = [1 0 0 0]Notice that y is a probabilitydistribution

Therefore we should also ensure thaty is a probability distribution

What choice of the output activation‘O’ will ensure this ?

aL = WLhL−1 + bL

yj = O(aL)j =eaL,j∑ki=1 e

aL,i

O(aL)j is the jth element of y and aL,jis the jth element of the vector aL.

This function is called the softmaxfunction

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 77: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

19/57

x1 x2 xn

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

hL = y = f(x) Notice that y is a probabilitydistribution

Therefore we should also ensure thaty is a probability distribution

What choice of the output activation‘O’ will ensure this ?

aL = WLhL−1 + bL

yj = O(aL)j =eaL,j∑ki=1 e

aL,i

O(aL)j is the jth element of y and aL,jis the jth element of the vector aL.

This function is called the softmaxfunction

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 78: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

19/57

x1 x2 xn

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

hL = y = f(x) Notice that y is a probabilitydistribution

Therefore we should also ensure thaty is a probability distribution

What choice of the output activation‘O’ will ensure this ?

aL = WLhL−1 + bL

yj = O(aL)j =eaL,j∑ki=1 e

aL,i

O(aL)j is the jth element of y and aL,jis the jth element of the vector aL.

This function is called the softmaxfunction

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 79: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

19/57

x1 x2 xn

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

hL = y = f(x) Notice that y is a probabilitydistribution

Therefore we should also ensure thaty is a probability distribution

What choice of the output activation‘O’ will ensure this ?

aL = WLhL−1 + bL

yj = O(aL)j =eaL,j∑ki=1 e

aL,i

O(aL)j is the jth element of y and aL,jis the jth element of the vector aL.

This function is called the softmaxfunction

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 80: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

20/57

Neural network withL− 1 hidden layers

Apple Mango Orange Banana

y = [1 0 0 0]Now that we have ensured that bothy & y are probability distributionscan you think of a function whichcaptures the difference betweenthem?

Cross-entropy

L (θ) = −k∑c=1

yc log yc

Notice that

yc = 1 if c = ` (the true class label)

= 0 otherwise

∴ L (θ) = − log y`

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 81: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

20/57

Neural network withL− 1 hidden layers

Apple Mango Orange Banana

y = [1 0 0 0]Now that we have ensured that bothy & y are probability distributionscan you think of a function whichcaptures the difference betweenthem?

Cross-entropy

L (θ) = −k∑c=1

yc log yc

Notice that

yc = 1 if c = ` (the true class label)

= 0 otherwise

∴ L (θ) = − log y`

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 82: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

20/57

Neural network withL− 1 hidden layers

Apple Mango Orange Banana

y = [1 0 0 0]Now that we have ensured that bothy & y are probability distributionscan you think of a function whichcaptures the difference betweenthem?

Cross-entropy

L (θ) = −k∑c=1

yc log yc

Notice that

yc = 1 if c = ` (the true class label)

= 0 otherwise

∴ L (θ) = − log y`

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 83: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

21/57

x1 x2 xn

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

hL = y = f(x)So, for classification problem (where you haveto choose 1 of K classes), we use the followingobjective function

minimizeθ

L (θ) = − log y`

or maximizeθ

−L (θ) = log y`

But wait!Is y` a function of θ = [W1,W2, .,WL, b1, b2, ., bL]?

Yes, it is indeed a function of θ

y` = [O(W3g(W2g(W1x+ b1) + b2) + b3)]`

What does y` encode?

It is the probability that x belongs to the `th class(bring it as close to 1).

log y` is called the log-likelihood of the data.

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 84: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

21/57

x1 x2 xn

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

hL = y = f(x)So, for classification problem (where you haveto choose 1 of K classes), we use the followingobjective function

minimizeθ

L (θ) = − log y`

or maximizeθ

−L (θ) = log y`

But wait!Is y` a function of θ = [W1,W2, .,WL, b1, b2, ., bL]?

Yes, it is indeed a function of θ

y` = [O(W3g(W2g(W1x+ b1) + b2) + b3)]`

What does y` encode?

It is the probability that x belongs to the `th class(bring it as close to 1).

log y` is called the log-likelihood of the data.

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 85: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

21/57

x1 x2 xn

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

hL = y = f(x)So, for classification problem (where you haveto choose 1 of K classes), we use the followingobjective function

minimizeθ

L (θ) = − log y`

or maximizeθ

−L (θ) = log y`

But wait!Is y` a function of θ = [W1,W2, .,WL, b1, b2, ., bL]?

Yes, it is indeed a function of θ

y` = [O(W3g(W2g(W1x+ b1) + b2) + b3)]`

What does y` encode?

It is the probability that x belongs to the `th class(bring it as close to 1).

log y` is called the log-likelihood of the data.

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 86: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

21/57

x1 x2 xn

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

hL = y = f(x)So, for classification problem (where you haveto choose 1 of K classes), we use the followingobjective function

minimizeθ

L (θ) = − log y`

or maximizeθ

−L (θ) = log y`

But wait!Is y` a function of θ = [W1,W2, .,WL, b1, b2, ., bL]?

Yes, it is indeed a function of θ

y` = [O(W3g(W2g(W1x+ b1) + b2) + b3)]`

What does y` encode?

It is the probability that x belongs to the `th class(bring it as close to 1).

log y` is called the log-likelihood of the data.

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 87: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

21/57

x1 x2 xn

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

hL = y = f(x)So, for classification problem (where you haveto choose 1 of K classes), we use the followingobjective function

minimizeθ

L (θ) = − log y`

or maximizeθ

−L (θ) = log y`

But wait!Is y` a function of θ = [W1,W2, .,WL, b1, b2, ., bL]?

Yes, it is indeed a function of θ

y` = [O(W3g(W2g(W1x+ b1) + b2) + b3)]`

What does y` encode?

It is the probability that x belongs to the `th class(bring it as close to 1).

log y` is called the log-likelihood of the data.

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 88: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

21/57

x1 x2 xn

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

hL = y = f(x)So, for classification problem (where you haveto choose 1 of K classes), we use the followingobjective function

minimizeθ

L (θ) = − log y`

or maximizeθ

−L (θ) = log y`

But wait!Is y` a function of θ = [W1,W2, .,WL, b1, b2, ., bL]?

Yes, it is indeed a function of θ

y` = [O(W3g(W2g(W1x+ b1) + b2) + b3)]`

What does y` encode?

It is the probability that x belongs to the `th class(bring it as close to 1).

log y` is called the log-likelihood of the data.

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 89: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

22/57

Outputs

Real Values Probabilities

Output Activation

Linear Softmax

Loss Function

Squared Error Cross Entropy

Of course, there could be other loss functions depending on the problem at handbut the two loss functions that we just saw are encountered very often

For the rest of this lecture we will focus on the case where the output activationis a softmax function and the loss function is cross entropy

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 90: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

22/57

Outputs

Real Values Probabilities

Output Activation Linear

Softmax

Loss Function

Squared Error Cross Entropy

Of course, there could be other loss functions depending on the problem at handbut the two loss functions that we just saw are encountered very often

For the rest of this lecture we will focus on the case where the output activationis a softmax function and the loss function is cross entropy

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 91: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

22/57

Outputs

Real Values Probabilities

Output Activation Linear Softmax

Loss Function

Squared Error Cross Entropy

Of course, there could be other loss functions depending on the problem at handbut the two loss functions that we just saw are encountered very often

For the rest of this lecture we will focus on the case where the output activationis a softmax function and the loss function is cross entropy

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 92: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

22/57

Outputs

Real Values Probabilities

Output Activation Linear Softmax

Loss Function Squared Error

Cross Entropy

Of course, there could be other loss functions depending on the problem at handbut the two loss functions that we just saw are encountered very often

For the rest of this lecture we will focus on the case where the output activationis a softmax function and the loss function is cross entropy

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 93: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

22/57

Outputs

Real Values Probabilities

Output Activation Linear Softmax

Loss Function Squared Error Cross Entropy

Of course, there could be other loss functions depending on the problem at handbut the two loss functions that we just saw are encountered very often

For the rest of this lecture we will focus on the case where the output activationis a softmax function and the loss function is cross entropy

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 94: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

22/57

Outputs

Real Values Probabilities

Output Activation Linear Softmax

Loss Function Squared Error Cross Entropy

Of course, there could be other loss functions depending on the problem at handbut the two loss functions that we just saw are encountered very often

For the rest of this lecture we will focus on the case where the output activationis a softmax function and the loss function is cross entropy

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 95: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

22/57

Outputs

Real Values Probabilities

Output Activation Linear Softmax

Loss Function Squared Error Cross Entropy

Of course, there could be other loss functions depending on the problem at handbut the two loss functions that we just saw are encountered very often

For the rest of this lecture we will focus on the case where the output activationis a softmax function and the loss function is cross entropy

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 96: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

23/57

Module 4.4: Backpropagation (Intuition)

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 97: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

24/57

We need to answer two questions

How to choose the loss function L (θ) ?

How to compute ∇θ which is composed of:∇W1,∇W2, ...,∇WL−1 ∈ Rn×n,∇WL ∈ Rn×k∇b1,∇b2, ...,∇bL−1 ∈ Rn and ∇bL ∈ Rk ?

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 98: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

24/57

We need to answer two questions

How to choose the loss function L (θ) ?

How to compute ∇θ which is composed of:∇W1,∇W2, ...,∇WL−1 ∈ Rn×n,∇WL ∈ Rn×k∇b1,∇b2, ...,∇bL−1 ∈ Rn and ∇bL ∈ Rk ?

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 99: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

25/57

Let us focus on this oneweight (W112).

To learn this weightusing SGD we need aformula for ∂L (θ)

∂W112.

We will see how tocalculate this.

x1 x2 xd

W111

a11

W211

a21

h11

W311

a31

h21

b1

b2

b3

y = f(x)

W112

Algorithm: gradientdescent()

t← 0;max iterations←1000;Initialize θ0;whilet++ < max iterationsdoθt+1 ← θt − η∇θt;

end

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 100: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

25/57

Let us focus on this oneweight (W112).

To learn this weightusing SGD we need aformula for ∂L (θ)

∂W112.

We will see how tocalculate this.

x1 x2 xd

W111

a11

W211

a21

h11

W311

a31

h21

b1

b2

b3

y = f(x)

W112

Algorithm: gradientdescent()

t← 0;max iterations←1000;Initialize θ0;whilet++ < max iterationsdoθt+1 ← θt − η∇θt;

end

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 101: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

25/57

Let us focus on this oneweight (W112).

To learn this weightusing SGD we need aformula for ∂L (θ)

∂W112.

We will see how tocalculate this.

x1 x2 xd

W111

a11

W211

a21

h11

W311

a31

h21

b1

b2

b3

y = f(x)

W112

Algorithm: gradientdescent()

t← 0;max iterations←1000;Initialize θ0;whilet++ < max iterationsdoθt+1 ← θt − η∇θt;

end

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 102: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

26/57

First let us take the simple case whenwe have a deep but thin network.

In this case it is easy to find thederivative by chain rule.

∂L (θ)

∂W111=∂L (θ)

∂y

∂y

∂aL11

∂aL11∂h21

∂h21∂a21

∂a21∂h11

∂h11∂a11

∂a11∂W111

∂L (θ)

∂W111=∂L (θ)

∂h11

∂h11∂W111

(just compressing the chain rule)

∂L (θ)

∂W211=∂L (θ)

∂h21

∂h21∂W211

∂L (θ)

∂WL11=∂L (θ)

∂aL1

∂aL1∂WL11

x1

W111

a11

W211

a21

h11

WL11

aL1

h21

y = f(x)L (θ)

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 103: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

26/57

First let us take the simple case whenwe have a deep but thin network.

In this case it is easy to find thederivative by chain rule.

∂L (θ)

∂W111=∂L (θ)

∂y

∂y

∂aL11

∂aL11∂h21

∂h21∂a21

∂a21∂h11

∂h11∂a11

∂a11∂W111

∂L (θ)

∂W111=∂L (θ)

∂h11

∂h11∂W111

(just compressing the chain rule)

∂L (θ)

∂W211=∂L (θ)

∂h21

∂h21∂W211

∂L (θ)

∂WL11=∂L (θ)

∂aL1

∂aL1∂WL11

x1

W111

a11

W211

a21

h11

WL11

aL1

h21

y = f(x)L (θ)

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 104: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

26/57

First let us take the simple case whenwe have a deep but thin network.

In this case it is easy to find thederivative by chain rule.

∂L (θ)

∂W111=∂L (θ)

∂y

∂y

∂aL11

∂aL11∂h21

∂h21∂a21

∂a21∂h11

∂h11∂a11

∂a11∂W111

∂L (θ)

∂W111=∂L (θ)

∂h11

∂h11∂W111

(just compressing the chain rule)

∂L (θ)

∂W211=∂L (θ)

∂h21

∂h21∂W211

∂L (θ)

∂WL11=∂L (θ)

∂aL1

∂aL1∂WL11

x1

W111

a11

W211

a21

h11

WL11

aL1

h21

y = f(x)L (θ)

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 105: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

26/57

First let us take the simple case whenwe have a deep but thin network.

In this case it is easy to find thederivative by chain rule.

∂L (θ)

∂W111=∂L (θ)

∂y

∂y

∂aL11

∂aL11∂h21

∂h21∂a21

∂a21∂h11

∂h11∂a11

∂a11∂W111

∂L (θ)

∂W111=∂L (θ)

∂h11

∂h11∂W111

(just compressing the chain rule)

∂L (θ)

∂W211=∂L (θ)

∂h21

∂h21∂W211

∂L (θ)

∂WL11=∂L (θ)

∂aL1

∂aL1∂WL11

x1

W111

a11

W211

a21

h11

WL11

aL1

h21

y = f(x)L (θ)

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 106: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

26/57

First let us take the simple case whenwe have a deep but thin network.

In this case it is easy to find thederivative by chain rule.

∂L (θ)

∂W111=∂L (θ)

∂y

∂y

∂aL11

∂aL11∂h21

∂h21∂a21

∂a21∂h11

∂h11∂a11

∂a11∂W111

∂L (θ)

∂W111=∂L (θ)

∂h11

∂h11∂W111

(just compressing the chain rule)

∂L (θ)

∂W211=∂L (θ)

∂h21

∂h21∂W211

∂L (θ)

∂WL11=∂L (θ)

∂aL1

∂aL1∂WL11

x1

W111

a11

W211

a21

h11

WL11

aL1

h21

y = f(x)L (θ)

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 107: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

26/57

First let us take the simple case whenwe have a deep but thin network.

In this case it is easy to find thederivative by chain rule.

∂L (θ)

∂W111=∂L (θ)

∂y

∂y

∂aL11

∂aL11∂h21

∂h21∂a21

∂a21∂h11

∂h11∂a11

∂a11∂W111

∂L (θ)

∂W111=∂L (θ)

∂h11

∂h11∂W111

(just compressing the chain rule)

∂L (θ)

∂W211=∂L (θ)

∂h21

∂h21∂W211

∂L (θ)

∂WL11=∂L (θ)

∂aL1

∂aL1∂WL11 x1

W111

a11

W211

a21

h11

WL11

aL1

h21

y = f(x)L (θ)

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 108: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

27/57

Let us see an intuitive explanation of backpropagation before we get into themathematical details

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 109: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

28/57

We get a certain loss at the output and we try tofigure out who is responsible for this loss

So, we talk to the output layer and say “Hey! Youare not producing the desired output, better takeresponsibility”.

The output layer says “Well, I take responsibilityfor my part but please understand that I am onlyas the good as the hidden layer and weights belowme”. After all . . .

f(x) = y = O(WLhL−1 + bL)

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 110: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

28/57

We get a certain loss at the output and we try tofigure out who is responsible for this loss

So, we talk to the output layer and say “Hey! Youare not producing the desired output, better takeresponsibility”.

The output layer says “Well, I take responsibilityfor my part but please understand that I am onlyas the good as the hidden layer and weights belowme”. After all . . .

f(x) = y = O(WLhL−1 + bL)

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 111: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

28/57

We get a certain loss at the output and we try tofigure out who is responsible for this loss

So, we talk to the output layer and say “Hey! Youare not producing the desired output, better takeresponsibility”.

The output layer says “Well, I take responsibilityfor my part but please understand that I am onlyas the good as the hidden layer and weights belowme”. After all . . .

f(x) = y = O(WLhL−1 + bL)

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 112: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

29/57

So, we talk to WL, bL and hL and ask them “What iswrong with you?”

WL and bL take full responsibility but hL says “Well,please understand that I am only as good as the pre-activation layer”

The pre-activation layer in turn says that I am only asgood as the hidden layer and weights below me.

We continue in this manner and realize that theresponsibility lies with all the weights and biases (i.e.all the parameters of the model)

But instead of talking to them directly, it is easier totalk to them through the hidden layers and outputlayers (and this is exactly what the chain rule allowsus to do)

∂L (θ)

∂W111︸ ︷︷ ︸Talk to the

weight directly

=∂L (θ)

∂y

∂y

∂a3︸ ︷︷ ︸Talk to theoutput layer

∂a3∂h2

∂h2

∂a2︸ ︷︷ ︸Talk to the

previous hiddenlayer

∂a2∂h1

∂h1

∂a1︸ ︷︷ ︸Talk to theprevious

hidden layer

∂a1∂W111︸ ︷︷ ︸and nowtalk tothe

weights

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 113: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

29/57

So, we talk to WL, bL and hL and ask them “What iswrong with you?”

WL and bL take full responsibility but hL says “Well,please understand that I am only as good as the pre-activation layer”

The pre-activation layer in turn says that I am only asgood as the hidden layer and weights below me.

We continue in this manner and realize that theresponsibility lies with all the weights and biases (i.e.all the parameters of the model)

But instead of talking to them directly, it is easier totalk to them through the hidden layers and outputlayers (and this is exactly what the chain rule allowsus to do)

∂L (θ)

∂W111︸ ︷︷ ︸Talk to the

weight directly

=∂L (θ)

∂y

∂y

∂a3︸ ︷︷ ︸Talk to theoutput layer

∂a3∂h2

∂h2

∂a2︸ ︷︷ ︸Talk to the

previous hiddenlayer

∂a2∂h1

∂h1

∂a1︸ ︷︷ ︸Talk to theprevious

hidden layer

∂a1∂W111︸ ︷︷ ︸and nowtalk tothe

weights

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 114: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

29/57

So, we talk to WL, bL and hL and ask them “What iswrong with you?”

WL and bL take full responsibility but hL says “Well,please understand that I am only as good as the pre-activation layer”

The pre-activation layer in turn says that I am only asgood as the hidden layer and weights below me.

We continue in this manner and realize that theresponsibility lies with all the weights and biases (i.e.all the parameters of the model)

But instead of talking to them directly, it is easier totalk to them through the hidden layers and outputlayers (and this is exactly what the chain rule allowsus to do)

∂L (θ)

∂W111︸ ︷︷ ︸Talk to the

weight directly

=∂L (θ)

∂y

∂y

∂a3︸ ︷︷ ︸Talk to theoutput layer

∂a3∂h2

∂h2

∂a2︸ ︷︷ ︸Talk to the

previous hiddenlayer

∂a2∂h1

∂h1

∂a1︸ ︷︷ ︸Talk to theprevious

hidden layer

∂a1∂W111︸ ︷︷ ︸and nowtalk tothe

weights

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 115: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

29/57

So, we talk to WL, bL and hL and ask them “What iswrong with you?”

WL and bL take full responsibility but hL says “Well,please understand that I am only as good as the pre-activation layer”

The pre-activation layer in turn says that I am only asgood as the hidden layer and weights below me.

We continue in this manner and realize that theresponsibility lies with all the weights and biases (i.e.all the parameters of the model)

But instead of talking to them directly, it is easier totalk to them through the hidden layers and outputlayers (and this is exactly what the chain rule allowsus to do)

∂L (θ)

∂W111︸ ︷︷ ︸Talk to the

weight directly

=∂L (θ)

∂y

∂y

∂a3︸ ︷︷ ︸Talk to theoutput layer

∂a3∂h2

∂h2

∂a2︸ ︷︷ ︸Talk to the

previous hiddenlayer

∂a2∂h1

∂h1

∂a1︸ ︷︷ ︸Talk to theprevious

hidden layer

∂a1∂W111︸ ︷︷ ︸and nowtalk tothe

weights

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 116: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

29/57

So, we talk to WL, bL and hL and ask them “What iswrong with you?”

WL and bL take full responsibility but hL says “Well,please understand that I am only as good as the pre-activation layer”

The pre-activation layer in turn says that I am only asgood as the hidden layer and weights below me.

We continue in this manner and realize that theresponsibility lies with all the weights and biases (i.e.all the parameters of the model)

But instead of talking to them directly, it is easier totalk to them through the hidden layers and outputlayers (and this is exactly what the chain rule allowsus to do)

∂L (θ)

∂W111︸ ︷︷ ︸Talk to the

weight directly

=∂L (θ)

∂y

∂y

∂a3︸ ︷︷ ︸Talk to theoutput layer

∂a3∂h2

∂h2

∂a2︸ ︷︷ ︸Talk to the

previous hiddenlayer

∂a2∂h1

∂h1

∂a1︸ ︷︷ ︸Talk to theprevious

hidden layer

∂a1∂W111︸ ︷︷ ︸and nowtalk tothe

weights

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 117: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

29/57

So, we talk to WL, bL and hL and ask them “What iswrong with you?”

WL and bL take full responsibility but hL says “Well,please understand that I am only as good as the pre-activation layer”

The pre-activation layer in turn says that I am only asgood as the hidden layer and weights below me.

We continue in this manner and realize that theresponsibility lies with all the weights and biases (i.e.all the parameters of the model)

But instead of talking to them directly, it is easier totalk to them through the hidden layers and outputlayers (and this is exactly what the chain rule allowsus to do)

∂L (θ)

∂W111︸ ︷︷ ︸Talk to the

weight directly

=∂L (θ)

∂y

∂y

∂a3︸ ︷︷ ︸Talk to theoutput layer

∂a3∂h2

∂h2

∂a2︸ ︷︷ ︸Talk to the

previous hiddenlayer

∂a2∂h1

∂h1

∂a1︸ ︷︷ ︸Talk to theprevious

hidden layer

∂a1∂W111︸ ︷︷ ︸and nowtalk tothe

weights

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 118: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

30/57

Quantities of interest (roadmap for the remaining part):

Gradient w.r.t. output units

Gradient w.r.t. hidden units

Gradient w.r.t. weights and biases

∂L (θ)

∂W111︸ ︷︷ ︸Talk to the

weight directly

=∂L (θ)

∂y

∂y

∂a3︸ ︷︷ ︸Talk to theoutput layer

∂a3∂h2

∂h2∂a2︸ ︷︷ ︸

Talk to theprevious hidden

layer

∂a2∂h1

∂h1∂a1︸ ︷︷ ︸

Talk to theprevious

hidden layer

∂a1∂W111︸ ︷︷ ︸and nowtalk tothe

weights

Our focus is on Cross entropy loss and Softmax output.

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 119: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

30/57

Quantities of interest (roadmap for the remaining part):

Gradient w.r.t. output units

Gradient w.r.t. hidden units

Gradient w.r.t. weights and biases

∂L (θ)

∂W111︸ ︷︷ ︸Talk to the

weight directly

=∂L (θ)

∂y

∂y

∂a3︸ ︷︷ ︸Talk to theoutput layer

∂a3∂h2

∂h2∂a2︸ ︷︷ ︸

Talk to theprevious hidden

layer

∂a2∂h1

∂h1∂a1︸ ︷︷ ︸

Talk to theprevious

hidden layer

∂a1∂W111︸ ︷︷ ︸and nowtalk tothe

weights

Our focus is on Cross entropy loss and Softmax output.

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 120: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

30/57

Quantities of interest (roadmap for the remaining part):

Gradient w.r.t. output units

Gradient w.r.t. hidden units

Gradient w.r.t. weights and biases

∂L (θ)

∂W111︸ ︷︷ ︸Talk to the

weight directly

=∂L (θ)

∂y

∂y

∂a3︸ ︷︷ ︸Talk to theoutput layer

∂a3∂h2

∂h2∂a2︸ ︷︷ ︸

Talk to theprevious hidden

layer

∂a2∂h1

∂h1∂a1︸ ︷︷ ︸

Talk to theprevious

hidden layer

∂a1∂W111︸ ︷︷ ︸and nowtalk tothe

weights

Our focus is on Cross entropy loss and Softmax output.

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 121: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

30/57

Quantities of interest (roadmap for the remaining part):

Gradient w.r.t. output units

Gradient w.r.t. hidden units

Gradient w.r.t. weights and biases

∂L (θ)

∂W111︸ ︷︷ ︸Talk to the

weight directly

=∂L (θ)

∂y

∂y

∂a3︸ ︷︷ ︸Talk to theoutput layer

∂a3∂h2

∂h2∂a2︸ ︷︷ ︸

Talk to theprevious hidden

layer

∂a2∂h1

∂h1∂a1︸ ︷︷ ︸

Talk to theprevious

hidden layer

∂a1∂W111︸ ︷︷ ︸and nowtalk tothe

weights

Our focus is on Cross entropy loss and Softmax output.

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 122: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

30/57

Quantities of interest (roadmap for the remaining part):

Gradient w.r.t. output units

Gradient w.r.t. hidden units

Gradient w.r.t. weights and biases

∂L (θ)

∂W111︸ ︷︷ ︸Talk to the

weight directly

=∂L (θ)

∂y

∂y

∂a3︸ ︷︷ ︸Talk to theoutput layer

∂a3∂h2

∂h2∂a2︸ ︷︷ ︸

Talk to theprevious hidden

layer

∂a2∂h1

∂h1∂a1︸ ︷︷ ︸

Talk to theprevious

hidden layer

∂a1∂W111︸ ︷︷ ︸and nowtalk tothe

weights

Our focus is on Cross entropy loss and Softmax output.

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 123: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

30/57

Quantities of interest (roadmap for the remaining part):

Gradient w.r.t. output units

Gradient w.r.t. hidden units

Gradient w.r.t. weights and biases

∂L (θ)

∂W111︸ ︷︷ ︸Talk to the

weight directly

=∂L (θ)

∂y

∂y

∂a3︸ ︷︷ ︸Talk to theoutput layer

∂a3∂h2

∂h2∂a2︸ ︷︷ ︸

Talk to theprevious hidden

layer

∂a2∂h1

∂h1∂a1︸ ︷︷ ︸

Talk to theprevious

hidden layer

∂a1∂W111︸ ︷︷ ︸and nowtalk tothe

weights

Our focus is on Cross entropy loss and Softmax output.

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 124: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

31/57

Module 4.5: Backpropagation: Computing Gradientsw.r.t. the Output Units

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 125: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

32/57

Quantities of interest (roadmap for the remaining part):

Gradient w.r.t. output units

Gradient w.r.t. hidden units

Gradient w.r.t. weights

∂L (θ)

∂W111︸ ︷︷ ︸Talk to the

weight directly

=∂L (θ)

∂y

∂y

∂a3︸ ︷︷ ︸Talk to theoutput layer

∂a3∂h2

∂h2∂a2︸ ︷︷ ︸

Talk to theprevious hidden

layer

∂a2∂h1

∂h1∂a1︸ ︷︷ ︸

Talk to theprevious

hidden layer

∂a1∂W111︸ ︷︷ ︸and nowtalk tothe

weights

Our focus is on Cross entropy loss and Softmax output.

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 126: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

33/57

Let us first consider the partial derivativew.r.t. i-th output

L (θ) = − log y` (` = true class label)

∂yi(L (θ)) =

∂yi(− log y`)

= − 1

y`if i = `

= 0 otherwise

More compactly,

∂yi(L (θ)) = −

1(i=`)

y`

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 127: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

33/57

Let us first consider the partial derivativew.r.t. i-th output

L (θ) = − log y` (` = true class label)

∂yi(L (θ)) =

∂yi(− log y`)

= − 1

y`if i = `

= 0 otherwise

More compactly,

∂yi(L (θ)) = −

1(i=`)

y`

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 128: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

33/57

Let us first consider the partial derivativew.r.t. i-th output

L (θ) = − log y` (` = true class label)

∂yi(L (θ)) =

∂yi(− log y`)

= − 1

y`if i = `

= 0 otherwise

More compactly,

∂yi(L (θ)) = −

1(i=`)

y`

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 129: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

33/57

Let us first consider the partial derivativew.r.t. i-th output

L (θ) = − log y` (` = true class label)

∂yi(L (θ)) =

∂yi(− log y`)

= − 1

y`if i = `

= 0 otherwise

More compactly,

∂yi(L (θ)) = −

1(i=`)

y`

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 130: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

33/57

Let us first consider the partial derivativew.r.t. i-th output

L (θ) = − log y` (` = true class label)

∂yi(L (θ)) =

∂yi(− log y`)

= − 1

y`if i = `

= 0 otherwise

More compactly,

∂yi(L (θ)) = −

1(i=`)

y`

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 131: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

33/57

Let us first consider the partial derivativew.r.t. i-th output

L (θ) = − log y` (` = true class label)

∂yi(L (θ)) =

∂yi(− log y`)

= − 1

y`if i = `

= 0 otherwise

More compactly,

∂yi(L (θ)) = −

1(i=`)

y`

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 132: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

33/57

Let us first consider the partial derivativew.r.t. i-th output

L (θ) = − log y` (` = true class label)

∂yi(L (θ)) =

∂yi(− log y`)

= − 1

y`if i = `

= 0 otherwise

More compactly,

∂yi(L (θ)) = −

1(i=`)

y`

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 133: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

33/57

Let us first consider the partial derivativew.r.t. i-th output

L (θ) = − log y` (` = true class label)

∂yi(L (θ)) =

∂yi(− log y`)

= − 1

y`if i = `

= 0 otherwise

More compactly,

∂yi(L (θ)) = −

1(i=`)

y`

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 134: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

34/57

∂yi(L (θ)) = −

1(`=i)

y`

We can now talk about the gradientw.r.t. the vector y

∇yL (θ) =

∂L (θ)∂y1...

∂L (θ)∂yk

= − 1

y`

1`=1

1`=2...

1`=k

= − 1

y`e`

where e(`) is a k-dimensional vectorwhose `-th element is 1 and all otherelements are 0.

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 135: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

34/57

∂yi(L (θ)) = −

1(`=i)

y`

We can now talk about the gradientw.r.t. the vector y

∇yL (θ) =

∂L (θ)∂y1...

∂L (θ)∂yk

= − 1

y`

1`=1

1`=2...

1`=k

= − 1

y`e`

where e(`) is a k-dimensional vectorwhose `-th element is 1 and all otherelements are 0.

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 136: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

34/57

∂yi(L (θ)) = −

1(`=i)

y`

We can now talk about the gradientw.r.t. the vector y

∇yL (θ) =

∂L (θ)∂y1...

∂L (θ)∂yk

= − 1

y`

1`=1

1`=2...

1`=k

= − 1

y`e`

where e(`) is a k-dimensional vectorwhose `-th element is 1 and all otherelements are 0.

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 137: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

34/57

∂yi(L (θ)) = −

1(`=i)

y`

We can now talk about the gradientw.r.t. the vector y

∇yL (θ) =

∂L (θ)∂y1

...∂L (θ)∂yk

= − 1

y`

1`=1

1`=2...

1`=k

= − 1

y`e`

where e(`) is a k-dimensional vectorwhose `-th element is 1 and all otherelements are 0.

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 138: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

34/57

∂yi(L (θ)) = −

1(`=i)

y`

We can now talk about the gradientw.r.t. the vector y

∇yL (θ) =

∂L (θ)∂y1...

∂L (θ)∂yk

= − 1

y`

1`=1

1`=2...

1`=k

= − 1

y`e`

where e(`) is a k-dimensional vectorwhose `-th element is 1 and all otherelements are 0.

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 139: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

34/57

∂yi(L (θ)) = −

1(`=i)

y`

We can now talk about the gradientw.r.t. the vector y

∇yL (θ) =

∂L (θ)∂y1...

∂L (θ)∂yk

= − 1

y`

1`=1

1`=2...

1`=k

= − 1

y`e`

where e(`) is a k-dimensional vectorwhose `-th element is 1 and all otherelements are 0.

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 140: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

34/57

∂yi(L (θ)) = −

1(`=i)

y`

We can now talk about the gradientw.r.t. the vector y

∇yL (θ) =

∂L (θ)∂y1...

∂L (θ)∂yk

= − 1

y`

1`=1

1`=2...

1`=k

= − 1

y`e`

where e(`) is a k-dimensional vectorwhose `-th element is 1 and all otherelements are 0.

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 141: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

34/57

∂yi(L (θ)) = −

1(`=i)

y`

We can now talk about the gradientw.r.t. the vector y

∇yL (θ) =

∂L (θ)∂y1...

∂L (θ)∂yk

= − 1

y`

1`=1

1`=2...

1`=k

= − 1

y`e`

where e(`) is a k-dimensional vectorwhose `-th element is 1 and all otherelements are 0.

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 142: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

34/57

∂yi(L (θ)) = −

1(`=i)

y`

We can now talk about the gradientw.r.t. the vector y

∇yL (θ) =

∂L (θ)∂y1...

∂L (θ)∂yk

= − 1

y`

1`=1

1`=2...

1`=k

= − 1

y`e`

where e(`) is a k-dimensional vectorwhose `-th element is 1 and all otherelements are 0.

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 143: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

34/57

∂yi(L (θ)) = −

1(`=i)

y`

We can now talk about the gradientw.r.t. the vector y

∇yL (θ) =

∂L (θ)∂y1...

∂L (θ)∂yk

= − 1

y`

1`=1

1`=2

...1`=k

= − 1

y`e`

where e(`) is a k-dimensional vectorwhose `-th element is 1 and all otherelements are 0.

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 144: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

34/57

∂yi(L (θ)) = −

1(`=i)

y`

We can now talk about the gradientw.r.t. the vector y

∇yL (θ) =

∂L (θ)∂y1...

∂L (θ)∂yk

= − 1

y`

1`=1

1`=2...

1`=k

= − 1

y`e`

where e(`) is a k-dimensional vectorwhose `-th element is 1 and all otherelements are 0.

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 145: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

34/57

∂yi(L (θ)) = −

1(`=i)

y`

We can now talk about the gradientw.r.t. the vector y

∇yL (θ) =

∂L (θ)∂y1...

∂L (θ)∂yk

= − 1

y`

1`=1

1`=2...

1`=k

= − 1

y`e`

where e(`) is a k-dimensional vectorwhose `-th element is 1 and all otherelements are 0.

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 146: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

34/57

∂yi(L (θ)) = −

1(`=i)

y`

We can now talk about the gradientw.r.t. the vector y

∇yL (θ) =

∂L (θ)∂y1...

∂L (θ)∂yk

= − 1

y`

1`=1

1`=2...

1`=k

= − 1

y`e`

where e(`) is a k-dimensional vectorwhose `-th element is 1 and all otherelements are 0.

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 147: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

34/57

∂yi(L (θ)) = −

1(`=i)

y`

We can now talk about the gradientw.r.t. the vector y

∇yL (θ) =

∂L (θ)∂y1...

∂L (θ)∂yk

= − 1

y`

1`=1

1`=2...

1`=k

= − 1

y`e`

where e(`) is a k-dimensional vectorwhose `-th element is 1 and all otherelements are 0.

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 148: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

35/57

What we are actually interested in is

∂L (θ)

∂aLi=∂(− log y`)

∂aLi

=∂(− log y`)

∂y`

∂y`∂aLi

Does y` depend on aLi ? Indeed, it does.

y` =exp(aL`)∑i exp(aLi)

Having established this, we will nowderive the full expression on the nextslide

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 149: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

35/57

What we are actually interested in is

∂L (θ)

∂aLi=∂(− log y`)

∂aLi

=∂(− log y`)

∂y`

∂y`∂aLi

Does y` depend on aLi ? Indeed, it does.

y` =exp(aL`)∑i exp(aLi)

Having established this, we will nowderive the full expression on the nextslide

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 150: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

35/57

What we are actually interested in is

∂L (θ)

∂aLi=∂(− log y`)

∂aLi

=∂(− log y`)

∂y`

∂y`∂aLi

Does y` depend on aLi ?

Indeed, it does.

y` =exp(aL`)∑i exp(aLi)

Having established this, we will nowderive the full expression on the nextslide

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 151: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

35/57

What we are actually interested in is

∂L (θ)

∂aLi=∂(− log y`)

∂aLi

=∂(− log y`)

∂y`

∂y`∂aLi

Does y` depend on aLi ? Indeed, it does.

y` =exp(aL`)∑i exp(aLi)

Having established this, we will nowderive the full expression on the nextslide

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 152: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

35/57

What we are actually interested in is

∂L (θ)

∂aLi=∂(− log y`)

∂aLi

=∂(− log y`)

∂y`

∂y`∂aLi

Does y` depend on aLi ? Indeed, it does.

y` =exp(aL`)∑i exp(aLi)

Having established this, we will nowderive the full expression on the nextslide

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 153: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

35/57

What we are actually interested in is

∂L (θ)

∂aLi=∂(− log y`)

∂aLi

=∂(− log y`)

∂y`

∂y`∂aLi

Does y` depend on aLi ? Indeed, it does.

y` =exp(aL`)∑i exp(aLi)

Having established this, we will nowderive the full expression on the nextslide

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 154: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

36/57

∂aLi− log y` =

−1

y`

∂aLiy`

=−1

y`

∂aLisoftmax(aL)`

=−1

y`

∂aLi

exp(aL)`∑i′ exp(aL)`

=−1

y`

(∂

∂aLiexp(aL)`∑

i′ exp(aL)i′−

exp(aL)`

(∂

∂aLi

∑i′ exp(aL)i′

)(∑i′(exp(aL)i′)2

)

=−1

y`

(1(`=i) exp(aL)`∑

i′ exp(aL)i′− exp(aL)`∑

i′ exp(aL)i′

exp(aL)i∑i′ exp(aL)i′

)

=−1

y`

(1(`=i)softmax(aL)` − softmax(aL)`softmax(aL)i

)=−1

y`

(1(`=i)y` − y`yi

)= −

(1(`=i) − yi

)

∂ g(x)h(x)

∂x=∂g(x)

∂x

1

h(x)− g(x)

h(x)2∂h(x)

∂x

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 155: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

36/57

∂aLi− log y` =

−1

y`

∂aLiy`

=−1

y`

∂aLisoftmax(aL)`

=−1

y`

∂aLi

exp(aL)`∑i′ exp(aL)`

=−1

y`

(∂

∂aLiexp(aL)`∑

i′ exp(aL)i′−

exp(aL)`

(∂

∂aLi

∑i′ exp(aL)i′

)(∑i′(exp(aL)i′)2

)

=−1

y`

(1(`=i) exp(aL)`∑

i′ exp(aL)i′− exp(aL)`∑

i′ exp(aL)i′

exp(aL)i∑i′ exp(aL)i′

)

=−1

y`

(1(`=i)softmax(aL)` − softmax(aL)`softmax(aL)i

)=−1

y`

(1(`=i)y` − y`yi

)= −

(1(`=i) − yi

)

∂ g(x)h(x)

∂x=∂g(x)

∂x

1

h(x)− g(x)

h(x)2∂h(x)

∂x

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 156: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

36/57

∂aLi− log y` =

−1

y`

∂aLiy`

=−1

y`

∂aLisoftmax(aL)`

=−1

y`

∂aLi

exp(aL)`∑i′ exp(aL)`

=−1

y`

(∂

∂aLiexp(aL)`∑

i′ exp(aL)i′−

exp(aL)`

(∂

∂aLi

∑i′ exp(aL)i′

)(∑i′(exp(aL)i′)2

)

=−1

y`

(1(`=i) exp(aL)`∑

i′ exp(aL)i′− exp(aL)`∑

i′ exp(aL)i′

exp(aL)i∑i′ exp(aL)i′

)

=−1

y`

(1(`=i)softmax(aL)` − softmax(aL)`softmax(aL)i

)=−1

y`

(1(`=i)y` − y`yi

)= −

(1(`=i) − yi

)

∂ g(x)h(x)

∂x=∂g(x)

∂x

1

h(x)− g(x)

h(x)2∂h(x)

∂x

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 157: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

36/57

∂aLi− log y` =

−1

y`

∂aLiy`

=−1

y`

∂aLisoftmax(aL)`

=−1

y`

∂aLi

exp(aL)`∑i′ exp(aL)`

=−1

y`

(∂

∂aLiexp(aL)`∑

i′ exp(aL)i′−

exp(aL)`

(∂

∂aLi

∑i′ exp(aL)i′

)(∑i′(exp(aL)i′)2

)

=−1

y`

(1(`=i) exp(aL)`∑

i′ exp(aL)i′− exp(aL)`∑

i′ exp(aL)i′

exp(aL)i∑i′ exp(aL)i′

)

=−1

y`

(1(`=i)softmax(aL)` − softmax(aL)`softmax(aL)i

)=−1

y`

(1(`=i)y` − y`yi

)= −

(1(`=i) − yi

)

∂ g(x)h(x)

∂x=∂g(x)

∂x

1

h(x)− g(x)

h(x)2∂h(x)

∂x

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 158: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

36/57

∂aLi− log y` =

−1

y`

∂aLiy`

=−1

y`

∂aLisoftmax(aL)`

=−1

y`

∂aLi

exp(aL)`∑i′ exp(aL)`

=−1

y`

(∂

∂aLiexp(aL)`∑

i′ exp(aL)i′−

exp(aL)`

(∂

∂aLi

∑i′ exp(aL)i′

)(∑i′(exp(aL)i′)2

)

=−1

y`

(1(`=i) exp(aL)`∑

i′ exp(aL)i′− exp(aL)`∑

i′ exp(aL)i′

exp(aL)i∑i′ exp(aL)i′

)

=−1

y`

(1(`=i)softmax(aL)` − softmax(aL)`softmax(aL)i

)=−1

y`

(1(`=i)y` − y`yi

)= −

(1(`=i) − yi

)

∂ g(x)h(x)

∂x=∂g(x)

∂x

1

h(x)− g(x)

h(x)2∂h(x)

∂x

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 159: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

36/57

∂aLi− log y` =

−1

y`

∂aLiy`

=−1

y`

∂aLisoftmax(aL)`

=−1

y`

∂aLi

exp(aL)`∑i′ exp(aL)`

=−1

y`

(∂

∂aLiexp(aL)`∑

i′ exp(aL)i′−

exp(aL)`

(∂

∂aLi

∑i′ exp(aL)i′

)(∑i′(exp(aL)i′)2

)

=−1

y`

(1(`=i) exp(aL)`∑

i′ exp(aL)i′− exp(aL)`∑

i′ exp(aL)i′

exp(aL)i∑i′ exp(aL)i′

)

=−1

y`

(1(`=i)softmax(aL)` − softmax(aL)`softmax(aL)i

)=−1

y`

(1(`=i)y` − y`yi

)= −

(1(`=i) − yi

)

∂ g(x)h(x)

∂x=∂g(x)

∂x

1

h(x)− g(x)

h(x)2∂h(x)

∂x

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 160: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

36/57

∂aLi− log y` =

−1

y`

∂aLiy`

=−1

y`

∂aLisoftmax(aL)`

=−1

y`

∂aLi

exp(aL)`∑i′ exp(aL)`

=−1

y`

(∂

∂aLiexp(aL)`∑

i′ exp(aL)i′−

exp(aL)`

(∂

∂aLi

∑i′ exp(aL)i′

)(∑i′(exp(aL)i′)2

)

=−1

y`

(1(`=i) exp(aL)`∑

i′ exp(aL)i′− exp(aL)`∑

i′ exp(aL)i′

exp(aL)i∑i′ exp(aL)i′

)

=−1

y`

(1(`=i)softmax(aL)` − softmax(aL)`softmax(aL)i

)=−1

y`

(1(`=i)y` − y`yi

)= −

(1(`=i) − yi

)

∂ g(x)h(x)

∂x=∂g(x)

∂x

1

h(x)− g(x)

h(x)2∂h(x)

∂x

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 161: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

36/57

∂aLi− log y` =

−1

y`

∂aLiy`

=−1

y`

∂aLisoftmax(aL)`

=−1

y`

∂aLi

exp(aL)`∑i′ exp(aL)`

=−1

y`

(∂

∂aLiexp(aL)`∑

i′ exp(aL)i′−

exp(aL)`

(∂

∂aLi

∑i′ exp(aL)i′

)(∑i′(exp(aL)i′)2

)

=−1

y`

(1(`=i) exp(aL)`∑

i′ exp(aL)i′− exp(aL)`∑

i′ exp(aL)i′

exp(aL)i∑i′ exp(aL)i′

)

=−1

y`

(1(`=i)softmax(aL)` − softmax(aL)`softmax(aL)i

)

=−1

y`

(1(`=i)y` − y`yi

)= −

(1(`=i) − yi

)

∂ g(x)h(x)

∂x=∂g(x)

∂x

1

h(x)− g(x)

h(x)2∂h(x)

∂x

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 162: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

36/57

∂aLi− log y` =

−1

y`

∂aLiy`

=−1

y`

∂aLisoftmax(aL)`

=−1

y`

∂aLi

exp(aL)`∑i′ exp(aL)`

=−1

y`

(∂

∂aLiexp(aL)`∑

i′ exp(aL)i′−

exp(aL)`

(∂

∂aLi

∑i′ exp(aL)i′

)(∑i′(exp(aL)i′)2

)

=−1

y`

(1(`=i) exp(aL)`∑

i′ exp(aL)i′− exp(aL)`∑

i′ exp(aL)i′

exp(aL)i∑i′ exp(aL)i′

)

=−1

y`

(1(`=i)softmax(aL)` − softmax(aL)`softmax(aL)i

)=−1

y`

(1(`=i)y` − y`yi

)

= −(1(`=i) − yi

)

∂ g(x)h(x)

∂x=∂g(x)

∂x

1

h(x)− g(x)

h(x)2∂h(x)

∂x

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 163: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

36/57

∂aLi− log y` =

−1

y`

∂aLiy`

=−1

y`

∂aLisoftmax(aL)`

=−1

y`

∂aLi

exp(aL)`∑i′ exp(aL)`

=−1

y`

(∂

∂aLiexp(aL)`∑

i′ exp(aL)i′−

exp(aL)`

(∂

∂aLi

∑i′ exp(aL)i′

)(∑i′(exp(aL)i′)2

)

=−1

y`

(1(`=i) exp(aL)`∑

i′ exp(aL)i′− exp(aL)`∑

i′ exp(aL)i′

exp(aL)i∑i′ exp(aL)i′

)

=−1

y`

(1(`=i)softmax(aL)` − softmax(aL)`softmax(aL)i

)=−1

y`

(1(`=i)y` − y`yi

)= −

(1(`=i) − yi

)

∂ g(x)h(x)

∂x=∂g(x)

∂x

1

h(x)− g(x)

h(x)2∂h(x)

∂x

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 164: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

37/57

So far we have derived the partial derivative w.r.t.the i-th element of aL

∂L (θ)

∂aL,i= −(1`=i − yi)

We can now write the gradient w.r.t. the vector aL

∇aLL (θ)

=

∂L (θ)∂aL1

...∂L (θ)∂aLk

=

− (1`=1 − y1)− (1`=2 − y2)

...− (1`=k − yk)

= −(e(`)− y)

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 165: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

37/57

So far we have derived the partial derivative w.r.t.the i-th element of aL

∂L (θ)

∂aL,i= −(1`=i − yi)

We can now write the gradient w.r.t. the vector aL

∇aLL (θ)

=

∂L (θ)∂aL1

...∂L (θ)∂aLk

=

− (1`=1 − y1)− (1`=2 − y2)

...− (1`=k − yk)

= −(e(`)− y)

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 166: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

37/57

So far we have derived the partial derivative w.r.t.the i-th element of aL

∂L (θ)

∂aL,i= −(1`=i − yi)

We can now write the gradient w.r.t. the vector aL

∇aLL (θ) =

∂L (θ)∂aL1

...∂L (θ)∂aLk

=

− (1`=1 − y1)− (1`=2 − y2)

...− (1`=k − yk)

= −(e(`)− y)

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 167: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

37/57

So far we have derived the partial derivative w.r.t.the i-th element of aL

∂L (θ)

∂aL,i= −(1`=i − yi)

We can now write the gradient w.r.t. the vector aL

∇aLL (θ) =

∂L (θ)∂aL1

...

∂L (θ)∂aLk

=

− (1`=1 − y1)− (1`=2 − y2)

...− (1`=k − yk)

= −(e(`)− y)

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 168: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

37/57

So far we have derived the partial derivative w.r.t.the i-th element of aL

∂L (θ)

∂aL,i= −(1`=i − yi)

We can now write the gradient w.r.t. the vector aL

∇aLL (θ) =

∂L (θ)∂aL1

...∂L (θ)∂aLk

=

− (1`=1 − y1)− (1`=2 − y2)

...− (1`=k − yk)

= −(e(`)− y)

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 169: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

37/57

So far we have derived the partial derivative w.r.t.the i-th element of aL

∂L (θ)

∂aL,i= −(1`=i − yi)

We can now write the gradient w.r.t. the vector aL

∇aLL (θ) =

∂L (θ)∂aL1

...∂L (θ)∂aLk

=

− (1`=1 − y1)− (1`=2 − y2)

...− (1`=k − yk)

= −(e(`)− y)

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 170: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

37/57

So far we have derived the partial derivative w.r.t.the i-th element of aL

∂L (θ)

∂aL,i= −(1`=i − yi)

We can now write the gradient w.r.t. the vector aL

∇aLL (θ) =

∂L (θ)∂aL1

...∂L (θ)∂aLk

=

− (1`=1 − y1)

− (1`=2 − y2)...

− (1`=k − yk)

= −(e(`)− y)

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 171: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

37/57

So far we have derived the partial derivative w.r.t.the i-th element of aL

∂L (θ)

∂aL,i= −(1`=i − yi)

We can now write the gradient w.r.t. the vector aL

∇aLL (θ) =

∂L (θ)∂aL1

...∂L (θ)∂aLk

=

− (1`=1 − y1)− (1`=2 − y2)

...− (1`=k − yk)

= −(e(`)− y)

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 172: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

37/57

So far we have derived the partial derivative w.r.t.the i-th element of aL

∂L (θ)

∂aL,i= −(1`=i − yi)

We can now write the gradient w.r.t. the vector aL

∇aLL (θ) =

∂L (θ)∂aL1

...∂L (θ)∂aLk

=

− (1`=1 − y1)− (1`=2 − y2)

...

− (1`=k − yk)

= −(e(`)− y)

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 173: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

37/57

So far we have derived the partial derivative w.r.t.the i-th element of aL

∂L (θ)

∂aL,i= −(1`=i − yi)

We can now write the gradient w.r.t. the vector aL

∇aLL (θ) =

∂L (θ)∂aL1

...∂L (θ)∂aLk

=

− (1`=1 − y1)− (1`=2 − y2)

...− (1`=k − yk)

= −(e(`)− y)

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 174: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

37/57

So far we have derived the partial derivative w.r.t.the i-th element of aL

∂L (θ)

∂aL,i= −(1`=i − yi)

We can now write the gradient w.r.t. the vector aL

∇aLL (θ) =

∂L (θ)∂aL1

...∂L (θ)∂aLk

=

− (1`=1 − y1)− (1`=2 − y2)

...− (1`=k − yk)

= −(e(`)− y)

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 175: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

38/57

Module 4.6: Backpropagation: Computing Gradientsw.r.t. Hidden Units

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 176: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

39/57

Quantities of interest (roadmap for the remaining part):

Gradient w.r.t. output units

Gradient w.r.t. hidden units

Gradient w.r.t. weights and biases

∂L (θ)

∂W111︸ ︷︷ ︸Talk to the

weight directly

=∂L (θ)

∂y

∂y

∂a3︸ ︷︷ ︸Talk to theoutput layer

∂a3∂h2

∂h2∂a2︸ ︷︷ ︸

Talk to theprevious hidden

layer

∂a2∂h1

∂h1∂a1︸ ︷︷ ︸

Talk to theprevious

hidden layer

∂a1∂W111︸ ︷︷ ︸and nowtalk tothe

weights

Our focus is on Cross entropy loss and Softmax output.

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 177: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

40/57

Chain rule along multiple paths: If afunction p(z) can be written as a function ofintermediate results qi(z) then we have :

∂p(z)

∂z=∑m

∂p(z)

∂qm(z)

∂qm(z)

∂z

In our case:

p(z) is the loss function L (θ)

z = hij

qm(z) = aLm

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 178: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

40/57

Chain rule along multiple paths: If afunction p(z) can be written as a function ofintermediate results qi(z) then we have :

∂p(z)

∂z=∑m

∂p(z)

∂qm(z)

∂qm(z)

∂z

In our case:

p(z) is the loss function L (θ)

z = hij

qm(z) = aLm

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 179: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

40/57

Chain rule along multiple paths: If afunction p(z) can be written as a function ofintermediate results qi(z) then we have :

∂p(z)

∂z=∑m

∂p(z)

∂qm(z)

∂qm(z)

∂z

In our case:

p(z) is the loss function L (θ)

z = hij

qm(z) = aLm

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 180: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

40/57

Chain rule along multiple paths: If afunction p(z) can be written as a function ofintermediate results qi(z) then we have :

∂p(z)

∂z=∑m

∂p(z)

∂qm(z)

∂qm(z)

∂z

In our case:

p(z) is the loss function L (θ)

z = hij

qm(z) = aLm

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 181: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

40/57

Chain rule along multiple paths: If afunction p(z) can be written as a function ofintermediate results qi(z) then we have :

∂p(z)

∂z=∑m

∂p(z)

∂qm(z)

∂qm(z)

∂z

In our case:

p(z) is the loss function L (θ)

z = hij

qm(z) = aLm

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 182: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

41/57Intentionally left blank

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 183: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

42/57

∂L (θ)∂hij

=k∑

m=1

∂L (θ)∂ai+1,m

∂ai+1,m

∂hij

=

k∑m=1

∂L (θ)∂ai+1,m

Wi+1,m,j

Now consider these two vectors,

∇ai+1L (θ) =

∂L (θ)∂ai+1,1

...∂L (θ)∂ai+1,k

;

Wi+1, · ,j

=

Wi+1,1,j...

Wi+1,k,j

Wi+1, · ,j is the j-th column of Wi+1;

see that,

(Wi+1, · ,j)T∇ai+1L (θ) =

k∑m=1

∂L (θ)

∂ai+1,mWi+1,m,j

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

ai+1 = Wi+1hij + bi+1

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 184: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

42/57

∂L (θ)∂hij

=

k∑m=1

∂L (θ)∂ai+1,m

∂ai+1,m

∂hij

=

k∑m=1

∂L (θ)∂ai+1,m

Wi+1,m,j

Now consider these two vectors,

∇ai+1L (θ) =

∂L (θ)∂ai+1,1

...∂L (θ)∂ai+1,k

;

Wi+1, · ,j

=

Wi+1,1,j...

Wi+1,k,j

Wi+1, · ,j is the j-th column of Wi+1;

see that,

(Wi+1, · ,j)T∇ai+1L (θ) =

k∑m=1

∂L (θ)

∂ai+1,mWi+1,m,j

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

ai+1 = Wi+1hij + bi+1

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 185: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

42/57

∂L (θ)∂hij

=

k∑m=1

∂L (θ)∂ai+1,m

∂ai+1,m

∂hij

=

k∑m=1

∂L (θ)∂ai+1,m

Wi+1,m,j

Now consider these two vectors,

∇ai+1L (θ) =

∂L (θ)∂ai+1,1

...∂L (θ)∂ai+1,k

;

Wi+1, · ,j

=

Wi+1,1,j...

Wi+1,k,j

Wi+1, · ,j is the j-th column of Wi+1;

see that,

(Wi+1, · ,j)T∇ai+1L (θ) =

k∑m=1

∂L (θ)

∂ai+1,mWi+1,m,j

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

ai+1 = Wi+1hij + bi+1

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 186: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

42/57

∂L (θ)∂hij

=

k∑m=1

∂L (θ)∂ai+1,m

∂ai+1,m

∂hij

=

k∑m=1

∂L (θ)∂ai+1,m

Wi+1,m,j

Now consider these two vectors,

∇ai+1L (θ) =

∂L (θ)∂ai+1,1

...∂L (θ)∂ai+1,k

;

Wi+1, · ,j

=

Wi+1,1,j...

Wi+1,k,j

Wi+1, · ,j is the j-th column of Wi+1;

see that,

(Wi+1, · ,j)T∇ai+1L (θ) =

k∑m=1

∂L (θ)

∂ai+1,mWi+1,m,j

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

ai+1 = Wi+1hij + bi+1

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 187: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

42/57

∂L (θ)∂hij

=

k∑m=1

∂L (θ)∂ai+1,m

∂ai+1,m

∂hij

=

k∑m=1

∂L (θ)∂ai+1,m

Wi+1,m,j

Now consider these two vectors,

∇ai+1L (θ) =

∂L (θ)∂ai+1,1

...∂L (θ)∂ai+1,k

;Wi+1, · ,j =

Wi+1,1,j...

Wi+1,k,j

Wi+1, · ,j is the j-th column of Wi+1;

see that,

(Wi+1, · ,j)T∇ai+1L (θ) =

k∑m=1

∂L (θ)

∂ai+1,mWi+1,m,j

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

ai+1 = Wi+1hij + bi+1

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 188: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

42/57

∂L (θ)∂hij

=

k∑m=1

∂L (θ)∂ai+1,m

∂ai+1,m

∂hij

=

k∑m=1

∂L (θ)∂ai+1,m

Wi+1,m,j

Now consider these two vectors,

∇ai+1L (θ) =

∂L (θ)∂ai+1,1

...∂L (θ)∂ai+1,k

;Wi+1, · ,j =

Wi+1,1,j...

Wi+1,k,j

Wi+1, · ,j is the j-th column of Wi+1;

see that,

(Wi+1, · ,j)T∇ai+1L (θ) =

k∑m=1

∂L (θ)

∂ai+1,mWi+1,m,j

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

ai+1 = Wi+1hij + bi+1

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 189: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

42/57

∂L (θ)∂hij

=

k∑m=1

∂L (θ)∂ai+1,m

∂ai+1,m

∂hij

=

k∑m=1

∂L (θ)∂ai+1,m

Wi+1,m,j

Now consider these two vectors,

∇ai+1L (θ) =

∂L (θ)∂ai+1,1

...∂L (θ)∂ai+1,k

;Wi+1, · ,j =

Wi+1,1,j

...Wi+1,k,j

Wi+1, · ,j is the j-th column of Wi+1;

see that,

(Wi+1, · ,j)T∇ai+1L (θ) =

k∑m=1

∂L (θ)

∂ai+1,mWi+1,m,j

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

ai+1 = Wi+1hij + bi+1

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 190: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

42/57

∂L (θ)∂hij

=

k∑m=1

∂L (θ)∂ai+1,m

∂ai+1,m

∂hij

=

k∑m=1

∂L (θ)∂ai+1,m

Wi+1,m,j

Now consider these two vectors,

∇ai+1L (θ) =

∂L (θ)∂ai+1,1

...

∂L (θ)∂ai+1,k

;Wi+1, · ,j =

Wi+1,1,j...

Wi+1,k,j

Wi+1, · ,j is the j-th column of Wi+1;

see that,

(Wi+1, · ,j)T∇ai+1L (θ) =

k∑m=1

∂L (θ)

∂ai+1,mWi+1,m,j

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

ai+1 = Wi+1hij + bi+1

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 191: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

42/57

∂L (θ)∂hij

=

k∑m=1

∂L (θ)∂ai+1,m

∂ai+1,m

∂hij

=

k∑m=1

∂L (θ)∂ai+1,m

Wi+1,m,j

Now consider these two vectors,

∇ai+1L (θ) =

∂L (θ)∂ai+1,1

...∂L (θ)∂ai+1,k

;Wi+1, · ,j =

Wi+1,1,j...

Wi+1,k,j

Wi+1, · ,j is the j-th column of Wi+1;

see that,

(Wi+1, · ,j)T∇ai+1L (θ) =

k∑m=1

∂L (θ)

∂ai+1,mWi+1,m,j

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

ai+1 = Wi+1hij + bi+1

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 192: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

42/57

∂L (θ)∂hij

=

k∑m=1

∂L (θ)∂ai+1,m

∂ai+1,m

∂hij

=

k∑m=1

∂L (θ)∂ai+1,m

Wi+1,m,j

Now consider these two vectors,

∇ai+1L (θ) =

∂L (θ)∂ai+1,1

...∂L (θ)∂ai+1,k

;Wi+1, · ,j =

Wi+1,1,j...

Wi+1,k,j

Wi+1, · ,j is the j-th column of Wi+1;

see that,

(Wi+1, · ,j)T∇ai+1L (θ) =

k∑m=1

∂L (θ)

∂ai+1,mWi+1,m,j

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

ai+1 = Wi+1hij + bi+1

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 193: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

42/57

∂L (θ)∂hij

=

k∑m=1

∂L (θ)∂ai+1,m

∂ai+1,m

∂hij

=

k∑m=1

∂L (θ)∂ai+1,m

Wi+1,m,j

Now consider these two vectors,

∇ai+1L (θ) =

∂L (θ)∂ai+1,1

...∂L (θ)∂ai+1,k

;Wi+1, · ,j =

Wi+1,1,j...

Wi+1,k,j

Wi+1, · ,j is the j-th column of Wi+1;

see that,

(Wi+1, · ,j)T∇ai+1L (θ) =

k∑m=1

∂L (θ)

∂ai+1,mWi+1,m,j

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

ai+1 = Wi+1hij + bi+1

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 194: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

42/57

∂L (θ)∂hij

=

k∑m=1

∂L (θ)∂ai+1,m

∂ai+1,m

∂hij

=

k∑m=1

∂L (θ)∂ai+1,m

Wi+1,m,j

Now consider these two vectors,

∇ai+1L (θ) =

∂L (θ)∂ai+1,1

...∂L (θ)∂ai+1,k

;Wi+1, · ,j =

Wi+1,1,j...

Wi+1,k,j

Wi+1, · ,j is the j-th column of Wi+1;

see that,

(Wi+1, · ,j)T∇ai+1L (θ) =

k∑m=1

∂L (θ)

∂ai+1,mWi+1,m,j

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

ai+1 = Wi+1hij + bi+1

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 195: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

42/57

∂L (θ)∂hij

=

k∑m=1

∂L (θ)∂ai+1,m

∂ai+1,m

∂hij

=

k∑m=1

∂L (θ)∂ai+1,m

Wi+1,m,j

Now consider these two vectors,

∇ai+1L (θ) =

∂L (θ)∂ai+1,1

...∂L (θ)∂ai+1,k

;Wi+1, · ,j =

Wi+1,1,j...

Wi+1,k,j

Wi+1, · ,j is the j-th column of Wi+1; see that,

(Wi+1, · ,j)T∇ai+1L (θ) =

k∑m=1

∂L (θ)

∂ai+1,mWi+1,m,j

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

ai+1 = Wi+1hij + bi+1

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 196: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

42/57

∂L (θ)∂hij

=

k∑m=1

∂L (θ)∂ai+1,m

∂ai+1,m

∂hij

=

k∑m=1

∂L (θ)∂ai+1,m

Wi+1,m,j

Now consider these two vectors,

∇ai+1L (θ) =

∂L (θ)∂ai+1,1

...∂L (θ)∂ai+1,k

;Wi+1, · ,j =

Wi+1,1,j...

Wi+1,k,j

Wi+1, · ,j is the j-th column of Wi+1; see that,

(Wi+1, · ,j)T∇ai+1L (θ) =

k∑m=1

∂L (θ)

∂ai+1,mWi+1,m,j

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

ai+1 = Wi+1hij + bi+1

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 197: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

42/57

∂L (θ)∂hij

=

k∑m=1

∂L (θ)∂ai+1,m

∂ai+1,m

∂hij

=

k∑m=1

∂L (θ)∂ai+1,m

Wi+1,m,j

Now consider these two vectors,

∇ai+1L (θ) =

∂L (θ)∂ai+1,1

...∂L (θ)∂ai+1,k

;Wi+1, · ,j =

Wi+1,1,j...

Wi+1,k,j

Wi+1, · ,j is the j-th column of Wi+1; see that,

(Wi+1, · ,j)T∇ai+1L (θ) =

k∑m=1

∂L (θ)

∂ai+1,mWi+1,m,j

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

ai+1 = Wi+1hij + bi+1

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 198: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

43/57

We have,∂L (θ)

∂hij= (Wi+1,.,j)

T∇ai+1L (θ)

We can now write the gradient w.r.t. hi

∇hiL (θ) =

∂L (θ)∂hi1∂L (θ)∂hi2

...∂L (θ)∂hin

=

(Wi+1, · ,1)T∇ai+1L (θ)

(Wi+1, · ,2)T∇ai+1L (θ)...

(Wi+1, · ,n)T∇ai+1L (θ)

= (Wi+1)

T (∇ai+1L (θ))

We are almost done except that we do notknow how to calculate ∇ai+1L (θ) for i < L−1

We will see how to compute that

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 199: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

43/57

We have,∂L (θ)

∂hij= (Wi+1,.,j)

T∇ai+1L (θ)

We can now write the gradient w.r.t. hi

∇hiL (θ)

=

∂L (θ)∂hi1∂L (θ)∂hi2

...∂L (θ)∂hin

=

(Wi+1, · ,1)T∇ai+1L (θ)

(Wi+1, · ,2)T∇ai+1L (θ)...

(Wi+1, · ,n)T∇ai+1L (θ)

= (Wi+1)

T (∇ai+1L (θ))

We are almost done except that we do notknow how to calculate ∇ai+1L (θ) for i < L−1

We will see how to compute that

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 200: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

43/57

We have,∂L (θ)

∂hij= (Wi+1,.,j)

T∇ai+1L (θ)

We can now write the gradient w.r.t. hi

∇hiL (θ) =

∂L (θ)∂hi1∂L (θ)∂hi2

...∂L (θ)∂hin

=

(Wi+1, · ,1)T∇ai+1L (θ)

(Wi+1, · ,2)T∇ai+1L (θ)...

(Wi+1, · ,n)T∇ai+1L (θ)

= (Wi+1)T (∇ai+1L (θ))

We are almost done except that we do notknow how to calculate ∇ai+1L (θ) for i < L−1

We will see how to compute that

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 201: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

43/57

We have,∂L (θ)

∂hij= (Wi+1,.,j)

T∇ai+1L (θ)

We can now write the gradient w.r.t. hi

∇hiL (θ) =

∂L (θ)∂hi1

∂L (θ)∂hi2

...∂L (θ)∂hin

=

(Wi+1, · ,1)T∇ai+1L (θ)

(Wi+1, · ,2)T∇ai+1L (θ)...

(Wi+1, · ,n)T∇ai+1L (θ)

= (Wi+1)T (∇ai+1L (θ))

We are almost done except that we do notknow how to calculate ∇ai+1L (θ) for i < L−1

We will see how to compute that

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 202: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

43/57

We have,∂L (θ)

∂hij= (Wi+1,.,j)

T∇ai+1L (θ)

We can now write the gradient w.r.t. hi

∇hiL (θ) =

∂L (θ)∂hi1

∂L (θ)∂hi2

...∂L (θ)∂hin

=

(Wi+1, · ,1)

T∇ai+1L (θ)

(Wi+1, · ,2)T∇ai+1L (θ)...

(Wi+1, · ,n)T∇ai+1L (θ)

= (Wi+1)T (∇ai+1L (θ))

We are almost done except that we do notknow how to calculate ∇ai+1L (θ) for i < L−1

We will see how to compute that

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 203: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

43/57

We have,∂L (θ)

∂hij= (Wi+1,.,j)

T∇ai+1L (θ)

We can now write the gradient w.r.t. hi

∇hiL (θ) =

∂L (θ)∂hi1∂L (θ)∂hi2

...∂L (θ)∂hin

=

(Wi+1, · ,1)

T∇ai+1L (θ)

(Wi+1, · ,2)T∇ai+1L (θ)...

(Wi+1, · ,n)T∇ai+1L (θ)

= (Wi+1)T (∇ai+1L (θ))

We are almost done except that we do notknow how to calculate ∇ai+1L (θ) for i < L−1

We will see how to compute that

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 204: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

43/57

We have,∂L (θ)

∂hij= (Wi+1,.,j)

T∇ai+1L (θ)

We can now write the gradient w.r.t. hi

∇hiL (θ) =

∂L (θ)∂hi1∂L (θ)∂hi2

...∂L (θ)∂hin

=

(Wi+1, · ,1)

T∇ai+1L (θ)(Wi+1, · ,2)

T∇ai+1L (θ)

...(Wi+1, · ,n)T∇ai+1L (θ)

= (Wi+1)T (∇ai+1L (θ))

We are almost done except that we do notknow how to calculate ∇ai+1L (θ) for i < L−1

We will see how to compute that

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 205: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

43/57

We have,∂L (θ)

∂hij= (Wi+1,.,j)

T∇ai+1L (θ)

We can now write the gradient w.r.t. hi

∇hiL (θ) =

∂L (θ)∂hi1∂L (θ)∂hi2

...

∂L (θ)∂hin

=

(Wi+1, · ,1)

T∇ai+1L (θ)(Wi+1, · ,2)

T∇ai+1L (θ)...

(Wi+1, · ,n)T∇ai+1L (θ)

= (Wi+1)T (∇ai+1L (θ))

We are almost done except that we do notknow how to calculate ∇ai+1L (θ) for i < L−1

We will see how to compute that

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 206: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

43/57

We have,∂L (θ)

∂hij= (Wi+1,.,j)

T∇ai+1L (θ)

We can now write the gradient w.r.t. hi

∇hiL (θ) =

∂L (θ)∂hi1∂L (θ)∂hi2

...∂L (θ)∂hin

=

(Wi+1, · ,1)

T∇ai+1L (θ)(Wi+1, · ,2)

T∇ai+1L (θ)...

(Wi+1, · ,n)T∇ai+1L (θ)

= (Wi+1)T (∇ai+1L (θ))

We are almost done except that we do notknow how to calculate ∇ai+1L (θ) for i < L−1

We will see how to compute that

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 207: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

43/57

We have,∂L (θ)

∂hij= (Wi+1,.,j)

T∇ai+1L (θ)

We can now write the gradient w.r.t. hi

∇hiL (θ) =

∂L (θ)∂hi1∂L (θ)∂hi2

...∂L (θ)∂hin

=

(Wi+1, · ,1)

T∇ai+1L (θ)(Wi+1, · ,2)

T∇ai+1L (θ)...

(Wi+1, · ,n)T∇ai+1L (θ)

= (Wi+1)T (∇ai+1L (θ))

We are almost done except that we do notknow how to calculate ∇ai+1L (θ) for i < L−1

We will see how to compute that

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 208: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

43/57

We have,∂L (θ)

∂hij= (Wi+1,.,j)

T∇ai+1L (θ)

We can now write the gradient w.r.t. hi

∇hiL (θ) =

∂L (θ)∂hi1∂L (θ)∂hi2

...∂L (θ)∂hin

=

(Wi+1, · ,1)

T∇ai+1L (θ)(Wi+1, · ,2)

T∇ai+1L (θ)...

(Wi+1, · ,n)T∇ai+1L (θ)

= (Wi+1)

T (∇ai+1L (θ))

We are almost done except that we do notknow how to calculate ∇ai+1L (θ) for i < L−1

We will see how to compute that

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 209: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

43/57

We have,∂L (θ)

∂hij= (Wi+1,.,j)

T∇ai+1L (θ)

We can now write the gradient w.r.t. hi

∇hiL (θ) =

∂L (θ)∂hi1∂L (θ)∂hi2

...∂L (θ)∂hin

=

(Wi+1, · ,1)

T∇ai+1L (θ)(Wi+1, · ,2)

T∇ai+1L (θ)...

(Wi+1, · ,n)T∇ai+1L (θ)

= (Wi+1)

T (∇ai+1L (θ))

We are almost done except that we do notknow how to calculate ∇ai+1L (θ) for i < L−1

We will see how to compute that

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 210: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

43/57

We have,∂L (θ)

∂hij= (Wi+1,.,j)

T∇ai+1L (θ)

We can now write the gradient w.r.t. hi

∇hiL (θ) =

∂L (θ)∂hi1∂L (θ)∂hi2

...∂L (θ)∂hin

=

(Wi+1, · ,1)

T∇ai+1L (θ)(Wi+1, · ,2)

T∇ai+1L (θ)...

(Wi+1, · ,n)T∇ai+1L (θ)

= (Wi+1)

T (∇ai+1L (θ))

We are almost done except that we do notknow how to calculate ∇ai+1L (θ) for i < L−1

We will see how to compute that

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 211: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

44/57

∇aiL (θ)

=

∂L (θ)∂ai1

...∂L (θ)∂ain

∂L (θ)

∂aij

=∂L (θ)

∂hij

∂hij∂aij

=∂L (θ)

∂hijg′(aij) [∵ hij = g(aij)]

∇aiL (θ)

=

∂L (θ)∂hi1

g′(ai1)

...∂L (θ)∂hin

g′(ain)

= ∇hiL (θ)� [. . . , g

′(aik), . . . ]

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 212: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

44/57

∇aiL (θ) =

∂L (θ)∂ai1

...∂L (θ)∂ain

∂L (θ)

∂aij

=∂L (θ)

∂hij

∂hij∂aij

=∂L (θ)

∂hijg′(aij) [∵ hij = g(aij)]

∇aiL (θ)

=

∂L (θ)∂hi1

g′(ai1)

...∂L (θ)∂hin

g′(ain)

= ∇hiL (θ)� [. . . , g

′(aik), . . . ]

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 213: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

44/57

∇aiL (θ) =

∂L (θ)∂ai1

...∂L (θ)∂ain

∂L (θ)

∂aij

=∂L (θ)

∂hij

∂hij∂aij

=∂L (θ)

∂hijg′(aij) [∵ hij = g(aij)]

∇aiL (θ)

=

∂L (θ)∂hi1

g′(ai1)

...∂L (θ)∂hin

g′(ain)

= ∇hiL (θ)� [. . . , g

′(aik), . . . ]

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 214: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

44/57

∇aiL (θ) =

∂L (θ)∂ai1

...

∂L (θ)∂ain

∂L (θ)

∂aij

=∂L (θ)

∂hij

∂hij∂aij

=∂L (θ)

∂hijg′(aij) [∵ hij = g(aij)]

∇aiL (θ)

=

∂L (θ)∂hi1

g′(ai1)

...∂L (θ)∂hin

g′(ain)

= ∇hiL (θ)� [. . . , g

′(aik), . . . ]

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 215: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

44/57

∇aiL (θ) =

∂L (θ)∂ai1

...∂L (θ)∂ain

∂L (θ)

∂aij

=∂L (θ)

∂hij

∂hij∂aij

=∂L (θ)

∂hijg′(aij) [∵ hij = g(aij)]

∇aiL (θ)

=

∂L (θ)∂hi1

g′(ai1)

...∂L (θ)∂hin

g′(ain)

= ∇hiL (θ)� [. . . , g

′(aik), . . . ]

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 216: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

44/57

∇aiL (θ) =

∂L (θ)∂ai1

...∂L (θ)∂ain

∂L (θ)

∂aij

=∂L (θ)

∂hij

∂hij∂aij

=∂L (θ)

∂hijg′(aij) [∵ hij = g(aij)]

∇aiL (θ)

=

∂L (θ)∂hi1

g′(ai1)

...∂L (θ)∂hin

g′(ain)

= ∇hiL (θ)� [. . . , g

′(aik), . . . ]

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 217: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

44/57

∇aiL (θ) =

∂L (θ)∂ai1

...∂L (θ)∂ain

∂L (θ)

∂aij=∂L (θ)

∂hij

∂hij∂aij

=∂L (θ)

∂hijg′(aij) [∵ hij = g(aij)]

∇aiL (θ)

=

∂L (θ)∂hi1

g′(ai1)

...∂L (θ)∂hin

g′(ain)

= ∇hiL (θ)� [. . . , g

′(aik), . . . ]

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 218: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

44/57

∇aiL (θ) =

∂L (θ)∂ai1

...∂L (θ)∂ain

∂L (θ)

∂aij=∂L (θ)

∂hij

∂hij∂aij

=∂L (θ)

∂hijg′(aij) [∵ hij = g(aij)]

∇aiL (θ)

=

∂L (θ)∂hi1

g′(ai1)

...∂L (θ)∂hin

g′(ain)

= ∇hiL (θ)� [. . . , g

′(aik), . . . ]

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 219: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

44/57

∇aiL (θ) =

∂L (θ)∂ai1

...∂L (θ)∂ain

∂L (θ)

∂aij=∂L (θ)

∂hij

∂hij∂aij

=∂L (θ)

∂hijg′(aij) [∵ hij = g(aij)]

∇aiL (θ)

=

∂L (θ)∂hi1

g′(ai1)

...∂L (θ)∂hin

g′(ain)

= ∇hiL (θ)� [. . . , g

′(aik), . . . ]

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 220: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

44/57

∇aiL (θ) =

∂L (θ)∂ai1

...∂L (θ)∂ain

∂L (θ)

∂aij=∂L (θ)

∂hij

∂hij∂aij

=∂L (θ)

∂hijg′(aij) [∵ hij = g(aij)]

∇aiL (θ) =

∂L (θ)∂hi1

g′(ai1)

...∂L (θ)∂hin

g′(ain)

= ∇hiL (θ)� [. . . , g′(aik), . . . ]

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 221: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

44/57

∇aiL (θ) =

∂L (θ)∂ai1

...∂L (θ)∂ain

∂L (θ)

∂aij=∂L (θ)

∂hij

∂hij∂aij

=∂L (θ)

∂hijg′(aij) [∵ hij = g(aij)]

∇aiL (θ) =

∂L (θ)∂hi1

g′(ai1)

...∂L (θ)∂hin

g′(ain)

= ∇hiL (θ)� [. . . , g′(aik), . . . ]

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 222: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

44/57

∇aiL (θ) =

∂L (θ)∂ai1

...∂L (θ)∂ain

∂L (θ)

∂aij=∂L (θ)

∂hij

∂hij∂aij

=∂L (θ)

∂hijg′(aij) [∵ hij = g(aij)]

∇aiL (θ) =

∂L (θ)∂hi1

g′(ai1)

...

∂L (θ)∂hin

g′(ain)

= ∇hiL (θ)� [. . . , g′(aik), . . . ]

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 223: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

44/57

∇aiL (θ) =

∂L (θ)∂ai1

...∂L (θ)∂ain

∂L (θ)

∂aij=∂L (θ)

∂hij

∂hij∂aij

=∂L (θ)

∂hijg′(aij) [∵ hij = g(aij)]

∇aiL (θ) =

∂L (θ)∂hi1

g′(ai1)

...∂L (θ)∂hin

g′(ain)

= ∇hiL (θ)� [. . . , g′(aik), . . . ]

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 224: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

44/57

∇aiL (θ) =

∂L (θ)∂ai1

...∂L (θ)∂ain

∂L (θ)

∂aij=∂L (θ)

∂hij

∂hij∂aij

=∂L (θ)

∂hijg′(aij) [∵ hij = g(aij)]

∇aiL (θ) =

∂L (θ)∂hi1

g′(ai1)

...∂L (θ)∂hin

g′(ain)

= ∇hiL (θ)� [. . . , g

′(aik), . . . ]

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 225: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

45/57

Module 4.7: Backpropagation: Computing Gradientsw.r.t. Parameters

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 226: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

46/57

Quantities of interest (roadmap for the remaining part):

Gradient w.r.t. output units

Gradient w.r.t. hidden units

Gradient w.r.t. weights and biases

∂L (θ)

∂W111︸ ︷︷ ︸Talk to the

weight directly

=∂L (θ)

∂y

∂y

∂a3︸ ︷︷ ︸Talk to theoutput layer

∂a3∂h2

∂h2∂a2︸ ︷︷ ︸

Talk to theprevious hidden

layer

∂a2∂h1

∂h1∂a1︸ ︷︷ ︸

Talk to theprevious

hidden layer

∂a1∂W111︸ ︷︷ ︸and nowtalk tothe

weights

Our focus is on Cross entropy loss and Softmax output.

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 227: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

47/57

Recall that,

ak = bk +Wkhk−1

∂aki∂Wkij

= hk−1,j

∂L (θ)

∂Wkij

=∂L (θ)

∂aki

∂aki∂Wkij

=∂L (θ)

∂akihk−1,j

∇WkL (θ) =

∂L (θ)∂Wk11

∂L (θ)∂Wk12

. . . . . . ∂L (θ)∂Wk1n

. . . . . . . . . . . . . . ....

......

......

. . . . . . . . . . . . ∂L (θ)∂Wknn

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 228: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

47/57

Recall that,

ak = bk +Wkhk−1

∂aki∂Wkij

= hk−1,j

∂L (θ)

∂Wkij

=∂L (θ)

∂aki

∂aki∂Wkij

=∂L (θ)

∂akihk−1,j

∇WkL (θ) =

∂L (θ)∂Wk11

∂L (θ)∂Wk12

. . . . . . ∂L (θ)∂Wk1n

. . . . . . . . . . . . . . ....

......

......

. . . . . . . . . . . . ∂L (θ)∂Wknn

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 229: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

47/57

Recall that,

ak = bk +Wkhk−1

∂aki∂Wkij

= hk−1,j

∂L (θ)

∂Wkij

=∂L (θ)

∂aki

∂aki∂Wkij

=∂L (θ)

∂akihk−1,j

∇WkL (θ) =

∂L (θ)∂Wk11

∂L (θ)∂Wk12

. . . . . . ∂L (θ)∂Wk1n

. . . . . . . . . . . . . . ....

......

......

. . . . . . . . . . . . ∂L (θ)∂Wknn

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 230: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

47/57

Recall that,

ak = bk +Wkhk−1

∂aki∂Wkij

= hk−1,j

∂L (θ)

∂Wkij=∂L (θ)

∂aki

∂aki∂Wkij

=∂L (θ)

∂akihk−1,j

∇WkL (θ) =

∂L (θ)∂Wk11

∂L (θ)∂Wk12

. . . . . . ∂L (θ)∂Wk1n

. . . . . . . . . . . . . . ....

......

......

. . . . . . . . . . . . ∂L (θ)∂Wknn

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 231: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

47/57

Recall that,

ak = bk +Wkhk−1

∂aki∂Wkij

= hk−1,j

∂L (θ)

∂Wkij=∂L (θ)

∂aki

∂aki∂Wkij

=∂L (θ)

∂akihk−1,j

∇WkL (θ) =

∂L (θ)∂Wk11

∂L (θ)∂Wk12

. . . . . . ∂L (θ)∂Wk1n

. . . . . . . . . . . . . . ....

......

......

. . . . . . . . . . . . ∂L (θ)∂Wknn

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 232: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

47/57

Recall that,

ak = bk +Wkhk−1

∂aki∂Wkij

= hk−1,j

∂L (θ)

∂Wkij=∂L (θ)

∂aki

∂aki∂Wkij

=∂L (θ)

∂akihk−1,j

∇WkL (θ) =

∂L (θ)∂Wk11

∂L (θ)∂Wk12

. . . . . . ∂L (θ)∂Wk1n

. . . . . . . . . . . . . . ....

......

......

. . . . . . . . . . . . ∂L (θ)∂Wknn

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 233: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

47/57

Recall that,

ak = bk +Wkhk−1

∂aki∂Wkij

= hk−1,j

∂L (θ)

∂Wkij=∂L (θ)

∂aki

∂aki∂Wkij

=∂L (θ)

∂akihk−1,j

∇WkL (θ) =

∂L (θ)∂Wk11

∂L (θ)∂Wk12

. . . . . . ∂L (θ)∂Wk1n

. . . . . . . . . . . . . . ....

......

......

. . . . . . . . . . . . ∂L (θ)∂Wknn

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 234: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

48/57Intentionally left blank

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 235: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

49/57

Lets take a simple example of a Wk ∈ R3×3 and see what each entry looks like

∇WkL (θ) =

∂L (θ)∂Wk11

∂L (θ)∂Wk12

∂L (θ)∂Wk13

∂L (θ)∂Wk21

∂L (θ)∂Wk22

∂L (θ)∂Wk23

∂L (θ)∂Wk31

∂L (θ)∂Wk32

∂L (θ)∂Wk33

∂L (θ)∂Wkij

= ∂L (θ)∂aki

∂aki∂Wkij

∇WkL (θ) =

∂L (θ)∂ak1

hk−1,1∂L (θ)∂ak1

hk−1,2∂L (θ)∂ak1

hk−1,3

∂L (θ)∂ak2

hk−1,1∂L (θ)∂ak2

hk−1,2∂L (θ)∂ak2

hk−1,3

∂L (θ)∂ak3

hk−1,1∂L (θ)∂ak3

hk−1,2∂L (θ)∂ak3

hk−1,3

=

∇akL (θ) · hk−1T

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 236: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

49/57

Lets take a simple example of a Wk ∈ R3×3 and see what each entry looks like

∇WkL (θ) =

∂L (θ)∂Wk11

∂L (θ)∂Wk12

∂L (θ)∂Wk13

∂L (θ)∂Wk21

∂L (θ)∂Wk22

∂L (θ)∂Wk23

∂L (θ)∂Wk31

∂L (θ)∂Wk32

∂L (θ)∂Wk33

∂L (θ)∂Wkij

= ∂L (θ)∂aki

∂aki∂Wkij

∇WkL (θ) =

∂L (θ)∂ak1

hk−1,1∂L (θ)∂ak1

hk−1,2∂L (θ)∂ak1

hk−1,3

∂L (θ)∂ak2

hk−1,1∂L (θ)∂ak2

hk−1,2∂L (θ)∂ak2

hk−1,3

∂L (θ)∂ak3

hk−1,1∂L (θ)∂ak3

hk−1,2∂L (θ)∂ak3

hk−1,3

=

∇akL (θ) · hk−1T

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 237: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

49/57

Lets take a simple example of a Wk ∈ R3×3 and see what each entry looks like

∇WkL (θ) =

∂L (θ)∂Wk11

∂L (θ)∂Wk12

∂L (θ)∂Wk13

∂L (θ)∂Wk21

∂L (θ)∂Wk22

∂L (θ)∂Wk23

∂L (θ)∂Wk31

∂L (θ)∂Wk32

∂L (θ)∂Wk33

∂L (θ)∂Wkij

= ∂L (θ)∂aki

∂aki∂Wkij

∇WkL (θ) =

∂L (θ)∂ak1

hk−1,1∂L (θ)∂ak1

hk−1,2∂L (θ)∂ak1

hk−1,3

∂L (θ)∂ak2

hk−1,1∂L (θ)∂ak2

hk−1,2∂L (θ)∂ak2

hk−1,3

∂L (θ)∂ak3

hk−1,1∂L (θ)∂ak3

hk−1,2∂L (θ)∂ak3

hk−1,3

=

∇akL (θ) · hk−1T

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 238: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

49/57

Lets take a simple example of a Wk ∈ R3×3 and see what each entry looks like

∇WkL (θ) =

∂L (θ)∂Wk11

∂L (θ)∂Wk12

∂L (θ)∂Wk13

∂L (θ)∂Wk21

∂L (θ)∂Wk22

∂L (θ)∂Wk23

∂L (θ)∂Wk31

∂L (θ)∂Wk32

∂L (θ)∂Wk33

∂L (θ)∂Wkij

= ∂L (θ)∂aki

∂aki∂Wkij

∇WkL (θ) =

∂L (θ)∂ak1

hk−1,1∂L (θ)∂ak1

hk−1,2∂L (θ)∂ak1

hk−1,3

∂L (θ)∂ak2

hk−1,1∂L (θ)∂ak2

hk−1,2∂L (θ)∂ak2

hk−1,3

∂L (θ)∂ak3

hk−1,1∂L (θ)∂ak3

hk−1,2∂L (θ)∂ak3

hk−1,3

=

∇akL (θ) · hk−1T

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 239: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

49/57

Lets take a simple example of a Wk ∈ R3×3 and see what each entry looks like

∇WkL (θ) =

∂L (θ)∂Wk11

∂L (θ)∂Wk12

∂L (θ)∂Wk13

∂L (θ)∂Wk21

∂L (θ)∂Wk22

∂L (θ)∂Wk23

∂L (θ)∂Wk31

∂L (θ)∂Wk32

∂L (θ)∂Wk33

∂L (θ)∂Wkij

= ∂L (θ)∂aki

∂aki∂Wkij

∇WkL (θ) =

∂L (θ)∂ak1

hk−1,1∂L (θ)∂ak1

hk−1,2∂L (θ)∂ak1

hk−1,3

∂L (θ)∂ak2

hk−1,1∂L (θ)∂ak2

hk−1,2∂L (θ)∂ak2

hk−1,3

∂L (θ)∂ak3

hk−1,1∂L (θ)∂ak3

hk−1,2∂L (θ)∂ak3

hk−1,3

=

∇akL (θ) · hk−1T

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 240: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

49/57

Lets take a simple example of a Wk ∈ R3×3 and see what each entry looks like

∇WkL (θ) =

∂L (θ)∂Wk11

∂L (θ)∂Wk12

∂L (θ)∂Wk13

∂L (θ)∂Wk21

∂L (θ)∂Wk22

∂L (θ)∂Wk23

∂L (θ)∂Wk31

∂L (θ)∂Wk32

∂L (θ)∂Wk33

∂L (θ)∂Wkij

= ∂L (θ)∂aki

∂aki∂Wkij

∇WkL (θ) =

∂L (θ)∂ak1

hk−1,1∂L (θ)∂ak1

hk−1,2∂L (θ)∂ak1

hk−1,3

∂L (θ)∂ak2

hk−1,1∂L (θ)∂ak2

hk−1,2∂L (θ)∂ak2

hk−1,3

∂L (θ)∂ak3

hk−1,1∂L (θ)∂ak3

hk−1,2∂L (θ)∂ak3

hk−1,3

=

∇akL (θ) · hk−1T

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 241: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

49/57

Lets take a simple example of a Wk ∈ R3×3 and see what each entry looks like

∇WkL (θ) =

∂L (θ)∂Wk11

∂L (θ)∂Wk12

∂L (θ)∂Wk13

∂L (θ)∂Wk21

∂L (θ)∂Wk22

∂L (θ)∂Wk23

∂L (θ)∂Wk31

∂L (θ)∂Wk32

∂L (θ)∂Wk33

∂L (θ)∂Wkij

= ∂L (θ)∂aki

∂aki∂Wkij

∇WkL (θ) =

∂L (θ)∂ak1

hk−1,1∂L (θ)∂ak1

hk−1,2∂L (θ)∂ak1

hk−1,3

∂L (θ)∂ak2

hk−1,1∂L (θ)∂ak2

hk−1,2∂L (θ)∂ak2

hk−1,3

∂L (θ)∂ak3

hk−1,1∂L (θ)∂ak3

hk−1,2∂L (θ)∂ak3

hk−1,3

=

∇akL (θ) · hk−1T

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 242: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

49/57

Lets take a simple example of a Wk ∈ R3×3 and see what each entry looks like

∇WkL (θ) =

∂L (θ)∂Wk11

∂L (θ)∂Wk12

∂L (θ)∂Wk13

∂L (θ)∂Wk21

∂L (θ)∂Wk22

∂L (θ)∂Wk23

∂L (θ)∂Wk31

∂L (θ)∂Wk32

∂L (θ)∂Wk33

∂L (θ)∂Wkij

= ∂L (θ)∂aki

∂aki∂Wkij

∇WkL (θ) =

∂L (θ)∂ak1

hk−1,1∂L (θ)∂ak1

hk−1,2∂L (θ)∂ak1

hk−1,3

∂L (θ)∂ak2

hk−1,1∂L (θ)∂ak2

hk−1,2∂L (θ)∂ak2

hk−1,3

∂L (θ)∂ak3

hk−1,1∂L (θ)∂ak3

hk−1,2∂L (θ)∂ak3

hk−1,3

= ∇akL (θ) · hk−1T

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 243: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

50/57

Finally, coming to the biases

aki = bki +∑j

Wkijhk−1,j

∂L (θ)

∂bki=∂L (θ)

∂aki

∂aki∂bki

=∂L (θ)

∂aki

We can now write the gradient w.r.t. the vectorbk

∇bkL (θ) =

∂L (θ)ak1

∂L (θ)ak2...

∂L (θ)akn

= ∇akL (θ)

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 244: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

50/57

Finally, coming to the biases

aki = bki +∑j

Wkijhk−1,j

∂L (θ)

∂bki=∂L (θ)

∂aki

∂aki∂bki

=∂L (θ)

∂aki

We can now write the gradient w.r.t. the vectorbk

∇bkL (θ) =

∂L (θ)ak1

∂L (θ)ak2...

∂L (θ)akn

= ∇akL (θ)

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 245: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

50/57

Finally, coming to the biases

aki = bki +∑j

Wkijhk−1,j

∂L (θ)

∂bki=∂L (θ)

∂aki

∂aki∂bki

=∂L (θ)

∂aki

We can now write the gradient w.r.t. the vectorbk

∇bkL (θ) =

∂L (θ)ak1

∂L (θ)ak2...

∂L (θ)akn

= ∇akL (θ)

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 246: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

50/57

Finally, coming to the biases

aki = bki +∑j

Wkijhk−1,j

∂L (θ)

∂bki=∂L (θ)

∂aki

∂aki∂bki

=∂L (θ)

∂aki

We can now write the gradient w.r.t. the vectorbk

∇bkL (θ) =

∂L (θ)ak1

∂L (θ)ak2...

∂L (θ)akn

= ∇akL (θ)

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 247: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

50/57

Finally, coming to the biases

aki = bki +∑j

Wkijhk−1,j

∂L (θ)

∂bki=∂L (θ)

∂aki

∂aki∂bki

=∂L (θ)

∂aki

We can now write the gradient w.r.t. the vectorbk

∇bkL (θ) =

∂L (θ)ak1

∂L (θ)ak2...

∂L (θ)akn

= ∇akL (θ)

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 248: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

50/57

Finally, coming to the biases

aki = bki +∑j

Wkijhk−1,j

∂L (θ)

∂bki=∂L (θ)

∂aki

∂aki∂bki

=∂L (θ)

∂aki

We can now write the gradient w.r.t. the vectorbk

∇bkL (θ) =

∂L (θ)ak1

∂L (θ)ak2...

∂L (θ)akn

= ∇akL (θ)

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 249: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

50/57

Finally, coming to the biases

aki = bki +∑j

Wkijhk−1,j

∂L (θ)

∂bki=∂L (θ)

∂aki

∂aki∂bki

=∂L (θ)

∂aki

We can now write the gradient w.r.t. the vectorbk

∇bkL (θ) =

∂L (θ)ak1

∂L (θ)ak2...

∂L (θ)akn

= ∇akL (θ)

x1 x2 xn

− log y`

W1

a1

W2

a2

h1

W3

a3

h2

b1

b2

b3

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 250: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

51/57

Module 4.8: Backpropagation: Pseudo code

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 251: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

52/57

Finally, we have all the pieces of the puzzle

∇aLL (θ) (gradient w.r.t. output layer)

∇hkL (θ),∇ak

L (θ) (gradient w.r.t. hidden layers, 1 ≤ k < L)

∇WkL (θ),∇bk

L (θ) (gradient w.r.t. weights and biases, 1 ≤ k ≤ L)

We can now write the full learning algorithm

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 252: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

52/57

Finally, we have all the pieces of the puzzle

∇aLL (θ) (gradient w.r.t. output layer)

∇hkL (θ),∇ak

L (θ) (gradient w.r.t. hidden layers, 1 ≤ k < L)

∇WkL (θ),∇bk

L (θ) (gradient w.r.t. weights and biases, 1 ≤ k ≤ L)

We can now write the full learning algorithm

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 253: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

52/57

Finally, we have all the pieces of the puzzle

∇aLL (θ) (gradient w.r.t. output layer)

∇hkL (θ),∇ak

L (θ) (gradient w.r.t. hidden layers, 1 ≤ k < L)

∇WkL (θ),∇bk

L (θ) (gradient w.r.t. weights and biases, 1 ≤ k ≤ L)

We can now write the full learning algorithm

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 254: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

52/57

Finally, we have all the pieces of the puzzle

∇aLL (θ) (gradient w.r.t. output layer)

∇hkL (θ),∇ak

L (θ) (gradient w.r.t. hidden layers, 1 ≤ k < L)

∇WkL (θ),∇bk

L (θ) (gradient w.r.t. weights and biases, 1 ≤ k ≤ L)

We can now write the full learning algorithm

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 255: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

52/57

Finally, we have all the pieces of the puzzle

∇aLL (θ) (gradient w.r.t. output layer)

∇hkL (θ),∇ak

L (θ) (gradient w.r.t. hidden layers, 1 ≤ k < L)

∇WkL (θ),∇bk

L (θ) (gradient w.r.t. weights and biases, 1 ≤ k ≤ L)

We can now write the full learning algorithm

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 256: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

53/57

Algorithm: gradient descent()

t← 0;max iterations← 1000;Initialize θ0 = [W 0

1 , ...,W0L, b

01, ..., b

0L];

while t++ < max iterations do

h1, h2, ..., hL−1, a1, a2, ..., aL, y = forward propagation(θt);∇θt = backward propagation(h1, h2, ..., hL−1, a1, a2, ..., aL, y, y);θt+1 ← θt − η∇θt;

end

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 257: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

53/57

Algorithm: gradient descent()

t← 0;max iterations← 1000;Initialize θ0 = [W 0

1 , ...,W0L, b

01, ..., b

0L];

while t++ < max iterations do

h1, h2, ..., hL−1, a1, a2, ..., aL, y = forward propagation(θt);∇θt = backward propagation(h1, h2, ..., hL−1, a1, a2, ..., aL, y, y);θt+1 ← θt − η∇θt;

end

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 258: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

53/57

Algorithm: gradient descent()

t← 0;max iterations← 1000;Initialize θ0 = [W 0

1 , ...,W0L, b

01, ..., b

0L];

while t++ < max iterations doh1, h2, ..., hL−1, a1, a2, ..., aL, y = forward propagation(θt);

∇θt = backward propagation(h1, h2, ..., hL−1, a1, a2, ..., aL, y, y);θt+1 ← θt − η∇θt;

end

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 259: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

53/57

Algorithm: gradient descent()

t← 0;max iterations← 1000;Initialize θ0 = [W 0

1 , ...,W0L, b

01, ..., b

0L];

while t++ < max iterations doh1, h2, ..., hL−1, a1, a2, ..., aL, y = forward propagation(θt);∇θt = backward propagation(h1, h2, ..., hL−1, a1, a2, ..., aL, y, y);

θt+1 ← θt − η∇θt;

end

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 260: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

53/57

Algorithm: gradient descent()

t← 0;max iterations← 1000;Initialize θ0 = [W 0

1 , ...,W0L, b

01, ..., b

0L];

while t++ < max iterations doh1, h2, ..., hL−1, a1, a2, ..., aL, y = forward propagation(θt);∇θt = backward propagation(h1, h2, ..., hL−1, a1, a2, ..., aL, y, y);θt+1 ← θt − η∇θt;

end

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 261: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

54/57

Algorithm: forward propagation(θ)

for k = 1 to L− 1 do

ak = bk +Wkhk−1;hk = g(ak);

endaL = bL +WLhL−1;y = O(aL);

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 262: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

54/57

Algorithm: forward propagation(θ)

for k = 1 to L− 1 do

ak = bk +Wkhk−1;hk = g(ak);

end

aL = bL +WLhL−1;y = O(aL);

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 263: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

54/57

Algorithm: forward propagation(θ)

for k = 1 to L− 1 doak = bk +Wkhk−1;

hk = g(ak);

end

aL = bL +WLhL−1;y = O(aL);

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 264: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

54/57

Algorithm: forward propagation(θ)

for k = 1 to L− 1 doak = bk +Wkhk−1;hk = g(ak);

end

aL = bL +WLhL−1;y = O(aL);

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 265: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

54/57

Algorithm: forward propagation(θ)

for k = 1 to L− 1 doak = bk +Wkhk−1;hk = g(ak);

endaL = bL +WLhL−1;

y = O(aL);

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 266: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

54/57

Algorithm: forward propagation(θ)

for k = 1 to L− 1 doak = bk +Wkhk−1;hk = g(ak);

endaL = bL +WLhL−1;y = O(aL);

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 267: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

55/57

Just do a forward propagation and compute all hi’s, ai’s, and y

Algorithm: back propagation(h1, h2, ..., hL−1, a1, a2, ..., aL, y, y)

//Compute output gradient ;

∇aLL (θ) = −(e(y)− y) ;for k = L to 1 do

// Compute gradients w.r.t. parameters ;∇Wk

L (θ) = ∇akL (θ)hTk−1 ;

∇bkL (θ) = ∇akL (θ) ;// Compute gradients w.r.t. layer below ;∇hk−1

L (θ) = W Tk (∇akL (θ)) ;

// Compute gradients w.r.t. layer below (pre-activation);∇ak−1

L (θ) = ∇hk−1L (θ)� [. . . , g′(ak−1,j), . . . ] ;

end

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 268: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

55/57

Just do a forward propagation and compute all hi’s, ai’s, and y

Algorithm: back propagation(h1, h2, ..., hL−1, a1, a2, ..., aL, y, y)

//Compute output gradient ;∇aLL (θ) = −(e(y)− y) ;

for k = L to 1 do

// Compute gradients w.r.t. parameters ;∇Wk

L (θ) = ∇akL (θ)hTk−1 ;

∇bkL (θ) = ∇akL (θ) ;// Compute gradients w.r.t. layer below ;∇hk−1

L (θ) = W Tk (∇akL (θ)) ;

// Compute gradients w.r.t. layer below (pre-activation);∇ak−1

L (θ) = ∇hk−1L (θ)� [. . . , g′(ak−1,j), . . . ] ;

end

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 269: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

55/57

Just do a forward propagation and compute all hi’s, ai’s, and y

Algorithm: back propagation(h1, h2, ..., hL−1, a1, a2, ..., aL, y, y)

//Compute output gradient ;∇aLL (θ) = −(e(y)− y) ;for k = L to 1 do

// Compute gradients w.r.t. parameters ;∇Wk

L (θ) = ∇akL (θ)hTk−1 ;

∇bkL (θ) = ∇akL (θ) ;// Compute gradients w.r.t. layer below ;∇hk−1

L (θ) = W Tk (∇akL (θ)) ;

// Compute gradients w.r.t. layer below (pre-activation);∇ak−1

L (θ) = ∇hk−1L (θ)� [. . . , g′(ak−1,j), . . . ] ;

end

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 270: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

55/57

Just do a forward propagation and compute all hi’s, ai’s, and y

Algorithm: back propagation(h1, h2, ..., hL−1, a1, a2, ..., aL, y, y)

//Compute output gradient ;∇aLL (θ) = −(e(y)− y) ;for k = L to 1 do

// Compute gradients w.r.t. parameters ;

∇WkL (θ) = ∇akL (θ)hTk−1 ;

∇bkL (θ) = ∇akL (θ) ;// Compute gradients w.r.t. layer below ;∇hk−1

L (θ) = W Tk (∇akL (θ)) ;

// Compute gradients w.r.t. layer below (pre-activation);∇ak−1

L (θ) = ∇hk−1L (θ)� [. . . , g′(ak−1,j), . . . ] ;

end

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 271: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

55/57

Just do a forward propagation and compute all hi’s, ai’s, and y

Algorithm: back propagation(h1, h2, ..., hL−1, a1, a2, ..., aL, y, y)

//Compute output gradient ;∇aLL (θ) = −(e(y)− y) ;for k = L to 1 do

// Compute gradients w.r.t. parameters ;∇Wk

L (θ) = ∇akL (θ)hTk−1 ;

∇bkL (θ) = ∇akL (θ) ;// Compute gradients w.r.t. layer below ;∇hk−1

L (θ) = W Tk (∇akL (θ)) ;

// Compute gradients w.r.t. layer below (pre-activation);∇ak−1

L (θ) = ∇hk−1L (θ)� [. . . , g′(ak−1,j), . . . ] ;

end

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 272: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

55/57

Just do a forward propagation and compute all hi’s, ai’s, and y

Algorithm: back propagation(h1, h2, ..., hL−1, a1, a2, ..., aL, y, y)

//Compute output gradient ;∇aLL (θ) = −(e(y)− y) ;for k = L to 1 do

// Compute gradients w.r.t. parameters ;∇Wk

L (θ) = ∇akL (θ)hTk−1 ;

∇bkL (θ) = ∇akL (θ) ;

// Compute gradients w.r.t. layer below ;∇hk−1

L (θ) = W Tk (∇akL (θ)) ;

// Compute gradients w.r.t. layer below (pre-activation);∇ak−1

L (θ) = ∇hk−1L (θ)� [. . . , g′(ak−1,j), . . . ] ;

end

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 273: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

55/57

Just do a forward propagation and compute all hi’s, ai’s, and y

Algorithm: back propagation(h1, h2, ..., hL−1, a1, a2, ..., aL, y, y)

//Compute output gradient ;∇aLL (θ) = −(e(y)− y) ;for k = L to 1 do

// Compute gradients w.r.t. parameters ;∇Wk

L (θ) = ∇akL (θ)hTk−1 ;

∇bkL (θ) = ∇akL (θ) ;// Compute gradients w.r.t. layer below ;

∇hk−1L (θ) = W T

k (∇akL (θ)) ;// Compute gradients w.r.t. layer below (pre-activation);∇ak−1

L (θ) = ∇hk−1L (θ)� [. . . , g′(ak−1,j), . . . ] ;

end

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 274: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

55/57

Just do a forward propagation and compute all hi’s, ai’s, and y

Algorithm: back propagation(h1, h2, ..., hL−1, a1, a2, ..., aL, y, y)

//Compute output gradient ;∇aLL (θ) = −(e(y)− y) ;for k = L to 1 do

// Compute gradients w.r.t. parameters ;∇Wk

L (θ) = ∇akL (θ)hTk−1 ;

∇bkL (θ) = ∇akL (θ) ;// Compute gradients w.r.t. layer below ;∇hk−1

L (θ) = W Tk (∇akL (θ)) ;

// Compute gradients w.r.t. layer below (pre-activation);∇ak−1

L (θ) = ∇hk−1L (θ)� [. . . , g′(ak−1,j), . . . ] ;

end

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 275: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

55/57

Just do a forward propagation and compute all hi’s, ai’s, and y

Algorithm: back propagation(h1, h2, ..., hL−1, a1, a2, ..., aL, y, y)

//Compute output gradient ;∇aLL (θ) = −(e(y)− y) ;for k = L to 1 do

// Compute gradients w.r.t. parameters ;∇Wk

L (θ) = ∇akL (θ)hTk−1 ;

∇bkL (θ) = ∇akL (θ) ;// Compute gradients w.r.t. layer below ;∇hk−1

L (θ) = W Tk (∇akL (θ)) ;

// Compute gradients w.r.t. layer below (pre-activation);

∇ak−1L (θ) = ∇hk−1

L (θ)� [. . . , g′(ak−1,j), . . . ] ;

end

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 276: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

55/57

Just do a forward propagation and compute all hi’s, ai’s, and y

Algorithm: back propagation(h1, h2, ..., hL−1, a1, a2, ..., aL, y, y)

//Compute output gradient ;∇aLL (θ) = −(e(y)− y) ;for k = L to 1 do

// Compute gradients w.r.t. parameters ;∇Wk

L (θ) = ∇akL (θ)hTk−1 ;

∇bkL (θ) = ∇akL (θ) ;// Compute gradients w.r.t. layer below ;∇hk−1

L (θ) = W Tk (∇akL (θ)) ;

// Compute gradients w.r.t. layer below (pre-activation);∇ak−1

L (θ) = ∇hk−1L (θ)� [. . . , g′(ak−1,j), . . . ] ;

end

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 277: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

56/57

Module 4.9: Derivative of the activation function

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 278: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

57/57

Now, the only thing we need to figure out is how to compute g′

Logistic function

g(z) = σ(z)

=1

1 + e−z

g′(z) = (−1)1

(1 + e−z)2d

dz(1 + e−z)

= (−1)1

(1 + e−z)2(−e−z)

=1

1 + e−z

(1 + e−z − 1

1 + e−z

)= g(z)(1− g(z))

tanh

g(z) = tanh (z)

=ez − e−z

ez + e−z

g′(z) =

((ez + e−z) ddz (ez − e−z)− (ez − e−z) ddz (ez + e−z)

)(ez + e−z)2

=(ez + e−z)2 − (ez − e−z)2

(ez + e−z)2

=1− (ez − e−z)2

(ez + e−z)2

=1− (g(z))2

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 279: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

57/57

Now, the only thing we need to figure out is how to compute g′

Logistic function

g(z) = σ(z)

=1

1 + e−z

g′(z) = (−1)1

(1 + e−z)2d

dz(1 + e−z)

= (−1)1

(1 + e−z)2(−e−z)

=1

1 + e−z

(1 + e−z − 1

1 + e−z

)= g(z)(1− g(z))

tanh

g(z) = tanh (z)

=ez − e−z

ez + e−z

g′(z) =

((ez + e−z) ddz (ez − e−z)− (ez − e−z) ddz (ez + e−z)

)(ez + e−z)2

=(ez + e−z)2 − (ez − e−z)2

(ez + e−z)2

=1− (ez − e−z)2

(ez + e−z)2

=1− (g(z))2

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 280: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

57/57

Now, the only thing we need to figure out is how to compute g′

Logistic function

g(z) = σ(z)

=1

1 + e−z

g′(z) = (−1)1

(1 + e−z)2d

dz(1 + e−z)

= (−1)1

(1 + e−z)2(−e−z)

=1

1 + e−z

(1 + e−z − 1

1 + e−z

)= g(z)(1− g(z))

tanh

g(z) = tanh (z)

=ez − e−z

ez + e−z

g′(z) =

((ez + e−z) ddz (ez − e−z)− (ez − e−z) ddz (ez + e−z)

)(ez + e−z)2

=(ez + e−z)2 − (ez − e−z)2

(ez + e−z)2

=1− (ez − e−z)2

(ez + e−z)2

=1− (g(z))2

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 281: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

57/57

Now, the only thing we need to figure out is how to compute g′

Logistic function

g(z) = σ(z)

=1

1 + e−z

g′(z) = (−1)1

(1 + e−z)2d

dz(1 + e−z)

= (−1)1

(1 + e−z)2(−e−z)

=1

1 + e−z

(1 + e−z − 1

1 + e−z

)= g(z)(1− g(z))

tanh

g(z) = tanh (z)

=ez − e−z

ez + e−z

g′(z) =

((ez + e−z) ddz (ez − e−z)− (ez − e−z) ddz (ez + e−z)

)(ez + e−z)2

=(ez + e−z)2 − (ez − e−z)2

(ez + e−z)2

=1− (ez − e−z)2

(ez + e−z)2

=1− (g(z))2

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 282: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

57/57

Now, the only thing we need to figure out is how to compute g′

Logistic function

g(z) = σ(z)

=1

1 + e−z

g′(z) = (−1)1

(1 + e−z)2d

dz(1 + e−z)

= (−1)1

(1 + e−z)2(−e−z)

=1

1 + e−z

(1 + e−z − 1

1 + e−z

)

= g(z)(1− g(z))

tanh

g(z) = tanh (z)

=ez − e−z

ez + e−z

g′(z) =

((ez + e−z) ddz (ez − e−z)− (ez − e−z) ddz (ez + e−z)

)(ez + e−z)2

=(ez + e−z)2 − (ez − e−z)2

(ez + e−z)2

=1− (ez − e−z)2

(ez + e−z)2

=1− (g(z))2

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 283: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

57/57

Now, the only thing we need to figure out is how to compute g′

Logistic function

g(z) = σ(z)

=1

1 + e−z

g′(z) = (−1)1

(1 + e−z)2d

dz(1 + e−z)

= (−1)1

(1 + e−z)2(−e−z)

=1

1 + e−z

(1 + e−z − 1

1 + e−z

)= g(z)(1− g(z))

tanh

g(z) = tanh (z)

=ez − e−z

ez + e−z

g′(z) =

((ez + e−z) ddz (ez − e−z)− (ez − e−z) ddz (ez + e−z)

)(ez + e−z)2

=(ez + e−z)2 − (ez − e−z)2

(ez + e−z)2

=1− (ez − e−z)2

(ez + e−z)2

=1− (g(z))2

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 284: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

57/57

Now, the only thing we need to figure out is how to compute g′

Logistic function

g(z) = σ(z)

=1

1 + e−z

g′(z) = (−1)1

(1 + e−z)2d

dz(1 + e−z)

= (−1)1

(1 + e−z)2(−e−z)

=1

1 + e−z

(1 + e−z − 1

1 + e−z

)= g(z)(1− g(z))

tanh

g(z) = tanh (z)

=ez − e−z

ez + e−z

g′(z) =

((ez + e−z) ddz (ez − e−z)− (ez − e−z) ddz (ez + e−z)

)(ez + e−z)2

=(ez + e−z)2 − (ez − e−z)2

(ez + e−z)2

=1− (ez − e−z)2

(ez + e−z)2

=1− (g(z))2

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 285: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

57/57

Now, the only thing we need to figure out is how to compute g′

Logistic function

g(z) = σ(z)

=1

1 + e−z

g′(z) = (−1)1

(1 + e−z)2d

dz(1 + e−z)

= (−1)1

(1 + e−z)2(−e−z)

=1

1 + e−z

(1 + e−z − 1

1 + e−z

)= g(z)(1− g(z))

tanh

g(z) = tanh (z)

=ez − e−z

ez + e−z

g′(z) =

((ez + e−z) ddz (ez − e−z)− (ez − e−z) ddz (ez + e−z)

)(ez + e−z)2

=(ez + e−z)2 − (ez − e−z)2

(ez + e−z)2

=1− (ez − e−z)2

(ez + e−z)2

=1− (g(z))2

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 286: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

57/57

Now, the only thing we need to figure out is how to compute g′

Logistic function

g(z) = σ(z)

=1

1 + e−z

g′(z) = (−1)1

(1 + e−z)2d

dz(1 + e−z)

= (−1)1

(1 + e−z)2(−e−z)

=1

1 + e−z

(1 + e−z − 1

1 + e−z

)= g(z)(1− g(z))

tanh

g(z) = tanh (z)

=ez − e−z

ez + e−z

g′(z) =

((ez + e−z) ddz (ez − e−z)− (ez − e−z) ddz (ez + e−z)

)(ez + e−z)2

=(ez + e−z)2 − (ez − e−z)2

(ez + e−z)2

=1− (ez − e−z)2

(ez + e−z)2

=1− (g(z))2

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 287: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

57/57

Now, the only thing we need to figure out is how to compute g′

Logistic function

g(z) = σ(z)

=1

1 + e−z

g′(z) = (−1)1

(1 + e−z)2d

dz(1 + e−z)

= (−1)1

(1 + e−z)2(−e−z)

=1

1 + e−z

(1 + e−z − 1

1 + e−z

)= g(z)(1− g(z))

tanh

g(z) = tanh (z)

=ez − e−z

ez + e−z

g′(z) =

((ez + e−z) ddz (ez − e−z)− (ez − e−z) ddz (ez + e−z)

)(ez + e−z)2

=(ez + e−z)2 − (ez − e−z)2

(ez + e−z)2

=1− (ez − e−z)2

(ez + e−z)2

=1− (g(z))2

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Page 288: CS7015 (Deep Learning): Lecture 4 · Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4. 4/57 x 1 x 2 x n a 1 a 2 a 3 h 1 h 2 h L= ^y= f(x) W 1 b 1 W 2 b 2 W 3 b 3 The input to the

57/57

Now, the only thing we need to figure out is how to compute g′

Logistic function

g(z) = σ(z)

=1

1 + e−z

g′(z) = (−1)1

(1 + e−z)2d

dz(1 + e−z)

= (−1)1

(1 + e−z)2(−e−z)

=1

1 + e−z

(1 + e−z − 1

1 + e−z

)= g(z)(1− g(z))

tanh

g(z) = tanh (z)

=ez − e−z

ez + e−z

g′(z) =

((ez + e−z) ddz (ez − e−z)− (ez − e−z) ddz (ez + e−z)

)(ez + e−z)2

=(ez + e−z)2 − (ez − e−z)2

(ez + e−z)2

=1− (ez − e−z)2

(ez + e−z)2

=1− (g(z))2

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4