cs7015 (deep learning) : lecture 14

224
1/41 CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and Exploding Gradients, Truncated BPTT Mitesh M. Khapra Department of Computer Science and Engineering Indian Institute of Technology Madras Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Upload: others

Post on 17-Oct-2021

6 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: CS7015 (Deep Learning) : Lecture 14

1/41

CS7015 (Deep Learning) : Lecture 14Sequence Learning Problems, Recurrent Neural Networks, Backpropagation

Through Time (BPTT), Vanishing and Exploding Gradients, Truncated BPTT

Mitesh M. Khapra

Department of Computer Science and EngineeringIndian Institute of Technology Madras

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 2: CS7015 (Deep Learning) : Lecture 14

2/41

Module 14.1: Sequence Learning Problems

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 3: CS7015 (Deep Learning) : Lecture 14

3/41

In feedforward and convolutionalneural networks the size of the inputwas always fixed

For example, we fed fixed size (32 ×32) images to convolutional neuralnetworks for image classification

Similarly in word2vec, we fed a fixedwindow (k) of words to the network

Further, each input to the networkwas independent of the previous orfuture inputs

For example, the computatations,outputs and decisions for two success-ive images are completely independ-ent of each other

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 4: CS7015 (Deep Learning) : Lecture 14

3/41

10

10

5

5

In feedforward and convolutionalneural networks the size of the inputwas always fixed

For example, we fed fixed size (32 ×32) images to convolutional neuralnetworks for image classification

Similarly in word2vec, we fed a fixedwindow (k) of words to the network

Further, each input to the networkwas independent of the previous orfuture inputs

For example, the computatations,outputs and decisions for two success-ive images are completely independ-ent of each other

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 5: CS7015 (Deep Learning) : Lecture 14

3/41

. . . . . . . . . .

. . . . . . . . . . . .

P(he|sat,he)

P(chair|sat,he)

P(man|sat,he)

P(on|sat,he)

. . . . . . . . .

he sat

Wcontext Wcontext

In feedforward and convolutionalneural networks the size of the inputwas always fixed

For example, we fed fixed size (32 ×32) images to convolutional neuralnetworks for image classification

Similarly in word2vec, we fed a fixedwindow (k) of words to the network

Further, each input to the networkwas independent of the previous orfuture inputs

For example, the computatations,outputs and decisions for two success-ive images are completely independ-ent of each other

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 6: CS7015 (Deep Learning) : Lecture 14

3/41

10

10

5

5

apple

bus

car

...

In feedforward and convolutionalneural networks the size of the inputwas always fixed

For example, we fed fixed size (32 ×32) images to convolutional neuralnetworks for image classification

Similarly in word2vec, we fed a fixedwindow (k) of words to the network

Further, each input to the networkwas independent of the previous orfuture inputs

For example, the computatations,outputs and decisions for two success-ive images are completely independ-ent of each other

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 7: CS7015 (Deep Learning) : Lecture 14

3/41

10

10

5

5

apple

bus

car

...

In feedforward and convolutionalneural networks the size of the inputwas always fixed

For example, we fed fixed size (32 ×32) images to convolutional neuralnetworks for image classification

Similarly in word2vec, we fed a fixedwindow (k) of words to the network

Further, each input to the networkwas independent of the previous orfuture inputs

For example, the computatations,outputs and decisions for two success-ive images are completely independ-ent of each other

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 8: CS7015 (Deep Learning) : Lecture 14

4/41

d

e

e

e

e

p

p

〈 stop 〉

In many applications the input is notof a fixed size

Further successive inputs may not beindependent of each other

For example, consider the task ofauto completion

Given the first character ‘d’ you wantto predict the next character ‘e’ andso on

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 9: CS7015 (Deep Learning) : Lecture 14

4/41

d

e

e

e

e

p

p

〈 stop 〉

In many applications the input is notof a fixed size

Further successive inputs may not beindependent of each other

For example, consider the task ofauto completion

Given the first character ‘d’ you wantto predict the next character ‘e’ andso on

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 10: CS7015 (Deep Learning) : Lecture 14

4/41

d

e

e

e

e

p

p

〈 stop 〉

In many applications the input is notof a fixed size

Further successive inputs may not beindependent of each other

For example, consider the task ofauto completion

Given the first character ‘d’ you wantto predict the next character ‘e’ andso on

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 11: CS7015 (Deep Learning) : Lecture 14

4/41

d

e

e

e

e

p

p

〈 stop 〉

In many applications the input is notof a fixed size

Further successive inputs may not beindependent of each other

For example, consider the task ofauto completion

Given the first character ‘d’ you wantto predict the next character ‘e’ andso on

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 12: CS7015 (Deep Learning) : Lecture 14

4/41

d

e

e

e

e

p

p

〈 stop 〉

In many applications the input is notof a fixed size

Further successive inputs may not beindependent of each other

For example, consider the task ofauto completion

Given the first character ‘d’ you wantto predict the next character ‘e’ andso on

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 13: CS7015 (Deep Learning) : Lecture 14

4/41

d

e

e

e

e

p

p

〈 stop 〉

In many applications the input is notof a fixed size

Further successive inputs may not beindependent of each other

For example, consider the task ofauto completion

Given the first character ‘d’ you wantto predict the next character ‘e’ andso on

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 14: CS7015 (Deep Learning) : Lecture 14

4/41

d

e

e

e

e

p

p

〈 stop 〉

In many applications the input is notof a fixed size

Further successive inputs may not beindependent of each other

For example, consider the task ofauto completion

Given the first character ‘d’ you wantto predict the next character ‘e’ andso on

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 15: CS7015 (Deep Learning) : Lecture 14

5/41

d

e

e

e

e

p

p

〈 stop 〉

Notice a few things

First, successive inputs are no longerindependent (while predicting ‘e’ youwould want to know what the previ-ous input was in addition to the cur-rent input)

Second, the length of the inputs andthe number of predictions you needto make is not fixed (for example,“learn”, “deep”, “machine” have dif-ferent number of characters)

Third, each network (orange-blue-green structure) is performing thesame task (input : character output: character)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 16: CS7015 (Deep Learning) : Lecture 14

5/41

d

e

e

e

e

p

p

〈 stop 〉

Notice a few things

First, successive inputs are no longerindependent (while predicting ‘e’ youwould want to know what the previ-ous input was in addition to the cur-rent input)

Second, the length of the inputs andthe number of predictions you needto make is not fixed (for example,“learn”, “deep”, “machine” have dif-ferent number of characters)

Third, each network (orange-blue-green structure) is performing thesame task (input : character output: character)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 17: CS7015 (Deep Learning) : Lecture 14

5/41

d

e

e

e

e

p

p

〈 stop 〉

Notice a few things

First, successive inputs are no longerindependent (while predicting ‘e’ youwould want to know what the previ-ous input was in addition to the cur-rent input)

Second, the length of the inputs andthe number of predictions you needto make is not fixed (for example,“learn”, “deep”, “machine” have dif-ferent number of characters)

Third, each network (orange-blue-green structure) is performing thesame task (input : character output: character)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 18: CS7015 (Deep Learning) : Lecture 14

5/41

d

e

e

e

e

p

p

〈 stop 〉

Notice a few things

First, successive inputs are no longerindependent (while predicting ‘e’ youwould want to know what the previ-ous input was in addition to the cur-rent input)

Second, the length of the inputs andthe number of predictions you needto make is not fixed (for example,“learn”, “deep”, “machine” have dif-ferent number of characters)

Third, each network (orange-blue-green structure) is performing thesame task (input : character output: character)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 19: CS7015 (Deep Learning) : Lecture 14

6/41

d

e

e

e

e

p

p

〈 stop 〉

These are known as sequence learningproblems

We need to look at a sequence of (de-pendent) inputs and produce an out-put (or outputs)

Each input corresponds to one timestep

Let us look at some more examples ofsuch problems

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 20: CS7015 (Deep Learning) : Lecture 14

6/41

d

e

e

e

e

p

p

〈 stop 〉

These are known as sequence learningproblems

We need to look at a sequence of (de-pendent) inputs and produce an out-put (or outputs)

Each input corresponds to one timestep

Let us look at some more examples ofsuch problems

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 21: CS7015 (Deep Learning) : Lecture 14

6/41

d

e

e

e

e

p

p

〈 stop 〉

These are known as sequence learningproblems

We need to look at a sequence of (de-pendent) inputs and produce an out-put (or outputs)

Each input corresponds to one timestep

Let us look at some more examples ofsuch problems

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 22: CS7015 (Deep Learning) : Lecture 14

6/41

d

e

e

e

e

p

p

〈 stop 〉

These are known as sequence learningproblems

We need to look at a sequence of (de-pendent) inputs and produce an out-put (or outputs)

Each input corresponds to one timestep

Let us look at some more examples ofsuch problems

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 23: CS7015 (Deep Learning) : Lecture 14

7/41

man

noun

is

verb

a

article

social

adjective

animal

noun

Consider the task of predicting the partof speech tag (noun, adverb, adjectiveverb) of each word in a sentence

Once we see an adjective (social) we arealmost sure that the next word should bea noun (man)

Thus the current output (noun) dependson the current input as well as the previ-ous input

Further the size of the input is not fixed(sentences could have arbitrary numberof words)

Notice that here we are interested in pro-ducing an output at each time step

Each network is performing the sametask (input : word, output : tag)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 24: CS7015 (Deep Learning) : Lecture 14

7/41

man

noun

is

verb

a

article

social

adjective

animal

noun

Consider the task of predicting the partof speech tag (noun, adverb, adjectiveverb) of each word in a sentence

Once we see an adjective (social) we arealmost sure that the next word should bea noun (man)

Thus the current output (noun) dependson the current input as well as the previ-ous input

Further the size of the input is not fixed(sentences could have arbitrary numberof words)

Notice that here we are interested in pro-ducing an output at each time step

Each network is performing the sametask (input : word, output : tag)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 25: CS7015 (Deep Learning) : Lecture 14

7/41

man

noun

is

verb

a

article

social

adjective

animal

noun

Consider the task of predicting the partof speech tag (noun, adverb, adjectiveverb) of each word in a sentence

Once we see an adjective (social) we arealmost sure that the next word should bea noun (man)

Thus the current output (noun) dependson the current input as well as the previ-ous input

Further the size of the input is not fixed(sentences could have arbitrary numberof words)

Notice that here we are interested in pro-ducing an output at each time step

Each network is performing the sametask (input : word, output : tag)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 26: CS7015 (Deep Learning) : Lecture 14

7/41

man

noun

is

verb

a

article

social

adjective

animal

noun

Consider the task of predicting the partof speech tag (noun, adverb, adjectiveverb) of each word in a sentence

Once we see an adjective (social) we arealmost sure that the next word should bea noun (man)

Thus the current output (noun) dependson the current input as well as the previ-ous input

Further the size of the input is not fixed(sentences could have arbitrary numberof words)

Notice that here we are interested in pro-ducing an output at each time step

Each network is performing the sametask (input : word, output : tag)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 27: CS7015 (Deep Learning) : Lecture 14

7/41

man

noun

is

verb

a

article

social

adjective

animal

noun

Consider the task of predicting the partof speech tag (noun, adverb, adjectiveverb) of each word in a sentence

Once we see an adjective (social) we arealmost sure that the next word should bea noun (man)

Thus the current output (noun) dependson the current input as well as the previ-ous input

Further the size of the input is not fixed(sentences could have arbitrary numberof words)

Notice that here we are interested in pro-ducing an output at each time step

Each network is performing the sametask (input : word, output : tag)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 28: CS7015 (Deep Learning) : Lecture 14

7/41

man

noun

is

verb

a

article

social

adjective

animal

noun

Consider the task of predicting the partof speech tag (noun, adverb, adjectiveverb) of each word in a sentence

Once we see an adjective (social) we arealmost sure that the next word should bea noun (man)

Thus the current output (noun) dependson the current input as well as the previ-ous input

Further the size of the input is not fixed(sentences could have arbitrary numberof words)

Notice that here we are interested in pro-ducing an output at each time step

Each network is performing the sametask (input : word, output : tag)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 29: CS7015 (Deep Learning) : Lecture 14

7/41

man

noun

is

verb

a

article

social

adjective

animal

noun

Consider the task of predicting the partof speech tag (noun, adverb, adjectiveverb) of each word in a sentence

Once we see an adjective (social) we arealmost sure that the next word should bea noun (man)

Thus the current output (noun) dependson the current input as well as the previ-ous input

Further the size of the input is not fixed(sentences could have arbitrary numberof words)

Notice that here we are interested in pro-ducing an output at each time step

Each network is performing the sametask (input : word, output : tag)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 30: CS7015 (Deep Learning) : Lecture 14

7/41

man

noun

is

verb

a

article

social

adjective

animal

noun

Consider the task of predicting the partof speech tag (noun, adverb, adjectiveverb) of each word in a sentence

Once we see an adjective (social) we arealmost sure that the next word should bea noun (man)

Thus the current output (noun) dependson the current input as well as the previ-ous input

Further the size of the input is not fixed(sentences could have arbitrary numberof words)

Notice that here we are interested in pro-ducing an output at each time step

Each network is performing the sametask (input : word, output : tag)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 31: CS7015 (Deep Learning) : Lecture 14

7/41

man

noun

is

verb

a

article

social

adjective

animal

noun

Consider the task of predicting the partof speech tag (noun, adverb, adjectiveverb) of each word in a sentence

Once we see an adjective (social) we arealmost sure that the next word should bea noun (man)

Thus the current output (noun) dependson the current input as well as the previ-ous input

Further the size of the input is not fixed(sentences could have arbitrary numberof words)

Notice that here we are interested in pro-ducing an output at each time step

Each network is performing the sametask (input : word, output : tag)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 32: CS7015 (Deep Learning) : Lecture 14

7/41

man

noun

is

verb

a

article

social

adjective

animal

noun

Consider the task of predicting the partof speech tag (noun, adverb, adjectiveverb) of each word in a sentence

Once we see an adjective (social) we arealmost sure that the next word should bea noun (man)

Thus the current output (noun) dependson the current input as well as the previ-ous input

Further the size of the input is not fixed(sentences could have arbitrary numberof words)

Notice that here we are interested in pro-ducing an output at each time step

Each network is performing the sametask (input : word, output : tag)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 33: CS7015 (Deep Learning) : Lecture 14

7/41

man

noun

is

verb

a

article

social

adjective

animal

noun

Consider the task of predicting the partof speech tag (noun, adverb, adjectiveverb) of each word in a sentence

Once we see an adjective (social) we arealmost sure that the next word should bea noun (man)

Thus the current output (noun) dependson the current input as well as the previ-ous input

Further the size of the input is not fixed(sentences could have arbitrary numberof words)

Notice that here we are interested in pro-ducing an output at each time step

Each network is performing the sametask (input : word, output : tag)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 34: CS7015 (Deep Learning) : Lecture 14

8/41

The

don’tcare

movie

don’tcare

was

don’tcare

boring

don’tcare

and

don’tcare

long

+/−

Sometimes we may not be interestedin producing an output at every stage

Instead we would look at the full se-quence and then produce an output

For example, consider the task of pre-dicting the polarity of a movie review

The prediction clearly does not de-pend only on the last word but alsoon some words which appear before

Here again we could think that thenetwork is performing the same taskat each step (input : word, output :+/−) but it’s just that we don’t careabout intermediate outputs

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 35: CS7015 (Deep Learning) : Lecture 14

8/41

The

don’tcare

movie

don’tcare

was

don’tcare

boring

don’tcare

and

don’tcare

long

+/−

Sometimes we may not be interestedin producing an output at every stage

Instead we would look at the full se-quence and then produce an output

For example, consider the task of pre-dicting the polarity of a movie review

The prediction clearly does not de-pend only on the last word but alsoon some words which appear before

Here again we could think that thenetwork is performing the same taskat each step (input : word, output :+/−) but it’s just that we don’t careabout intermediate outputs

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 36: CS7015 (Deep Learning) : Lecture 14

8/41

The

don’tcare

movie

don’tcare

was

don’tcare

boring

don’tcare

and

don’tcare

long

+/−

Sometimes we may not be interestedin producing an output at every stage

Instead we would look at the full se-quence and then produce an output

For example, consider the task of pre-dicting the polarity of a movie review

The prediction clearly does not de-pend only on the last word but alsoon some words which appear before

Here again we could think that thenetwork is performing the same taskat each step (input : word, output :+/−) but it’s just that we don’t careabout intermediate outputs

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 37: CS7015 (Deep Learning) : Lecture 14

8/41

The

don’tcare

movie

don’tcare

was

don’tcare

boring

don’tcare

and

don’tcare

long

+/−

Sometimes we may not be interestedin producing an output at every stage

Instead we would look at the full se-quence and then produce an output

For example, consider the task of pre-dicting the polarity of a movie review

The prediction clearly does not de-pend only on the last word but alsoon some words which appear before

Here again we could think that thenetwork is performing the same taskat each step (input : word, output :+/−) but it’s just that we don’t careabout intermediate outputs

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 38: CS7015 (Deep Learning) : Lecture 14

8/41

The

don’tcare

movie

don’tcare

was

don’tcare

boring

don’tcare

and

don’tcare

long

+/−

Sometimes we may not be interestedin producing an output at every stage

Instead we would look at the full se-quence and then produce an output

For example, consider the task of pre-dicting the polarity of a movie review

The prediction clearly does not de-pend only on the last word but alsoon some words which appear before

Here again we could think that thenetwork is performing the same taskat each step (input : word, output :+/−) but it’s just that we don’t careabout intermediate outputs

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 39: CS7015 (Deep Learning) : Lecture 14

8/41

The

don’tcare

movie

don’tcare

was

don’tcare

boring

don’tcare

and

don’tcare

long

+/−

Sometimes we may not be interestedin producing an output at every stage

Instead we would look at the full se-quence and then produce an output

For example, consider the task of pre-dicting the polarity of a movie review

The prediction clearly does not de-pend only on the last word but alsoon some words which appear before

Here again we could think that thenetwork is performing the same taskat each step (input : word, output :+/−) but it’s just that we don’t careabout intermediate outputs

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 40: CS7015 (Deep Learning) : Lecture 14

8/41

The

don’tcare

movie

don’tcare

was

don’tcare

boring

don’tcare

and

don’tcare

long

+/−

Sometimes we may not be interestedin producing an output at every stage

Instead we would look at the full se-quence and then produce an output

For example, consider the task of pre-dicting the polarity of a movie review

The prediction clearly does not de-pend only on the last word but alsoon some words which appear before

Here again we could think that thenetwork is performing the same taskat each step (input : word, output :+/−) but it’s just that we don’t careabout intermediate outputs

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 41: CS7015 (Deep Learning) : Lecture 14

8/41

The

don’tcare

movie

don’tcare

was

don’tcare

boring

don’tcare

and

don’tcare

long

+/−

Sometimes we may not be interestedin producing an output at every stage

Instead we would look at the full se-quence and then produce an output

For example, consider the task of pre-dicting the polarity of a movie review

The prediction clearly does not de-pend only on the last word but alsoon some words which appear before

Here again we could think that thenetwork is performing the same taskat each step (input : word, output :+/−) but it’s just that we don’t careabout intermediate outputs

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 42: CS7015 (Deep Learning) : Lecture 14

8/41

The

don’tcare

movie

don’tcare

was

don’tcare

boring

don’tcare

and

don’tcare

long

+/−

Sometimes we may not be interestedin producing an output at every stage

Instead we would look at the full se-quence and then produce an output

For example, consider the task of pre-dicting the polarity of a movie review

The prediction clearly does not de-pend only on the last word but alsoon some words which appear before

Here again we could think that thenetwork is performing the same taskat each step (input : word, output :+/−) but it’s just that we don’t careabout intermediate outputs

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 43: CS7015 (Deep Learning) : Lecture 14

8/41

The

don’tcare

movie

don’tcare

was

don’tcare

boring

don’tcare

and

don’tcare

long

+/−

Sometimes we may not be interestedin producing an output at every stage

Instead we would look at the full se-quence and then produce an output

For example, consider the task of pre-dicting the polarity of a movie review

The prediction clearly does not de-pend only on the last word but alsoon some words which appear before

Here again we could think that thenetwork is performing the same taskat each step (input : word, output :+/−) but it’s just that we don’t careabout intermediate outputs

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 44: CS7015 (Deep Learning) : Lecture 14

8/41

The

don’tcare

movie

don’tcare

was

don’tcare

boring

don’tcare

and

don’tcare

long

+/−

Sometimes we may not be interestedin producing an output at every stage

Instead we would look at the full se-quence and then produce an output

For example, consider the task of pre-dicting the polarity of a movie review

The prediction clearly does not de-pend only on the last word but alsoon some words which appear before

Here again we could think that thenetwork is performing the same taskat each step (input : word, output :+/−) but it’s just that we don’t careabout intermediate outputs

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 45: CS7015 (Deep Learning) : Lecture 14

9/41

. . .

. . .

Surya Namaskar

Sequences could be composed of any-thing (not just words)

For example, a video could be treatedas a sequence of images

We may want to look at the entire se-quence and detect the activity beingperformed

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 46: CS7015 (Deep Learning) : Lecture 14

9/41

. . .

. . .

Surya Namaskar

Sequences could be composed of any-thing (not just words)

For example, a video could be treatedas a sequence of images

We may want to look at the entire se-quence and detect the activity beingperformed

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 47: CS7015 (Deep Learning) : Lecture 14

9/41

. . .

. . .

Surya Namaskar

Sequences could be composed of any-thing (not just words)

For example, a video could be treatedas a sequence of images

We may want to look at the entire se-quence and detect the activity beingperformed

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 48: CS7015 (Deep Learning) : Lecture 14

9/41

. . .

. . .

Surya Namaskar

Sequences could be composed of any-thing (not just words)

For example, a video could be treatedas a sequence of images

We may want to look at the entire se-quence and detect the activity beingperformed

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 49: CS7015 (Deep Learning) : Lecture 14

9/41

. . .

. . .

Surya Namaskar

Sequences could be composed of any-thing (not just words)

For example, a video could be treatedas a sequence of images

We may want to look at the entire se-quence and detect the activity beingperformed

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 50: CS7015 (Deep Learning) : Lecture 14

9/41

. . .

. . .

Surya Namaskar

Sequences could be composed of any-thing (not just words)

For example, a video could be treatedas a sequence of images

We may want to look at the entire se-quence and detect the activity beingperformed

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 51: CS7015 (Deep Learning) : Lecture 14

9/41

. . .

. . .

Surya Namaskar

Sequences could be composed of any-thing (not just words)

For example, a video could be treatedas a sequence of images

We may want to look at the entire se-quence and detect the activity beingperformed

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 52: CS7015 (Deep Learning) : Lecture 14

9/41

. . .

. . .

Surya Namaskar

Sequences could be composed of any-thing (not just words)

For example, a video could be treatedas a sequence of images

We may want to look at the entire se-quence and detect the activity beingperformed

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 53: CS7015 (Deep Learning) : Lecture 14

10/41

Module 14.2: Recurrent Neural Networks

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 54: CS7015 (Deep Learning) : Lecture 14

11/41

How do we model such tasks involving sequences ?

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 55: CS7015 (Deep Learning) : Lecture 14

12/41

Wishlist

Account for dependence between inputs

Account for variable number of inputs

Make sure that the function executed at each time step is the same

We will focus on each of these to arrive at a model for dealing with sequences

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 56: CS7015 (Deep Learning) : Lecture 14

12/41

Wishlist

Account for dependence between inputs

Account for variable number of inputs

Make sure that the function executed at each time step is the same

We will focus on each of these to arrive at a model for dealing with sequences

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 57: CS7015 (Deep Learning) : Lecture 14

12/41

Wishlist

Account for dependence between inputs

Account for variable number of inputs

Make sure that the function executed at each time step is the same

We will focus on each of these to arrive at a model for dealing with sequences

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 58: CS7015 (Deep Learning) : Lecture 14

12/41

Wishlist

Account for dependence between inputs

Account for variable number of inputs

Make sure that the function executed at each time step is the same

We will focus on each of these to arrive at a model for dealing with sequences

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 59: CS7015 (Deep Learning) : Lecture 14

13/41

s1

V

U

x1

y1

x2

y2

s2

V

U

What is the function being executedat each time step ?

si = σ(Uxi + b)

yi = O(V si + c)

i = timestep

Since we want the same function to beexecuted at each timestep we shouldshare the same network (i.e., sameparameters at each timestep)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 60: CS7015 (Deep Learning) : Lecture 14

13/41

s1

V

U

x1

y1

x2

y2

s2

V

U

What is the function being executedat each time step ?

si = σ(Uxi + b)

yi = O(V si + c)

i = timestep

Since we want the same function to beexecuted at each timestep we shouldshare the same network (i.e., sameparameters at each timestep)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 61: CS7015 (Deep Learning) : Lecture 14

13/41

s1

V

U

x1

y1

x2

y2

s2

V

U

What is the function being executedat each time step ?

si = σ(Uxi + b)

yi = O(V si + c)

i = timestep

Since we want the same function to beexecuted at each timestep we shouldshare the same network (i.e., sameparameters at each timestep)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 62: CS7015 (Deep Learning) : Lecture 14

13/41

s1

V

U

x1

y1

x2

y2

s2

V

U

What is the function being executedat each time step ?

si = σ(Uxi + b)

yi = O(V si + c)

i = timestep

Since we want the same function to beexecuted at each timestep we shouldshare the same network (i.e., sameparameters at each timestep)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 63: CS7015 (Deep Learning) : Lecture 14

13/41

s1

V

U

x1

y1

x2

y2

s2

V

U

What is the function being executedat each time step ?

si = σ(Uxi + b)

yi = O(V si + c)

i = timestep

Since we want the same function to beexecuted at each timestep we shouldshare the same network (i.e., sameparameters at each timestep)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 64: CS7015 (Deep Learning) : Lecture 14

14/41

s1

V

U

x1

y1

x2

y2

s2

V

U

x3

y3

s3

V

U

x4

y4

s4

V

U

. . .

xn

yn

sn

V

U

This parameter sharing also ensuresthat the network becomes agnostic tothe length (size) of the input

Since we are simply going to computethe same function (with same para-meters) at each timestep, the numberof timesteps doesn’t matter

We just create multiple copies of thenetwork and execute them at eachtimestep

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 65: CS7015 (Deep Learning) : Lecture 14

14/41

s1

V

U

x1

y1

x2

y2

s2

V

U

x3

y3

s3

V

U

x4

y4

s4

V

U

. . .

xn

yn

sn

V

U

This parameter sharing also ensuresthat the network becomes agnostic tothe length (size) of the input

Since we are simply going to computethe same function (with same para-meters) at each timestep, the numberof timesteps doesn’t matter

We just create multiple copies of thenetwork and execute them at eachtimestep

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 66: CS7015 (Deep Learning) : Lecture 14

14/41

s1

V

U

x1

y1

x2

y2

s2

V

U

x3

y3

s3

V

U

x4

y4

s4

V

U

. . .

xn

yn

sn

V

U

This parameter sharing also ensuresthat the network becomes agnostic tothe length (size) of the input

Since we are simply going to computethe same function (with same para-meters) at each timestep, the numberof timesteps doesn’t matter

We just create multiple copies of thenetwork and execute them at eachtimestep

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 67: CS7015 (Deep Learning) : Lecture 14

14/41

s1

V

U

x1

y1

x2

y2

s2

V

U

x3

y3

s3

V

U

x4

y4

s4

V

U

. . .

xn

yn

sn

V

U

This parameter sharing also ensuresthat the network becomes agnostic tothe length (size) of the input

Since we are simply going to computethe same function (with same para-meters) at each timestep, the numberof timesteps doesn’t matter

We just create multiple copies of thenetwork and execute them at eachtimestep

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 68: CS7015 (Deep Learning) : Lecture 14

14/41

s1

V

U

x1

y1

x2

y2

s2

V

U

x3

y3

s3

V

U

x4

y4

s4

V

U

. . .

xn

yn

sn

V

U

This parameter sharing also ensuresthat the network becomes agnostic tothe length (size) of the input

Since we are simply going to computethe same function (with same para-meters) at each timestep, the numberof timesteps doesn’t matter

We just create multiple copies of thenetwork and execute them at eachtimestep

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 69: CS7015 (Deep Learning) : Lecture 14

14/41

s1

V

U

x1

y1

x2

y2

s2

V

U

x3

y3

s3

V

U

x4

y4

s4

V

U

. . .

xn

yn

sn

V

U

This parameter sharing also ensuresthat the network becomes agnostic tothe length (size) of the input

Since we are simply going to computethe same function (with same para-meters) at each timestep, the numberof timesteps doesn’t matter

We just create multiple copies of thenetwork and execute them at eachtimestep

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 70: CS7015 (Deep Learning) : Lecture 14

14/41

s1

V

U

x1

y1

x2

y2

s2

V

U

x3

y3

s3

V

U

x4

y4

s4

V

U

. . .

xn

yn

sn

V

U

This parameter sharing also ensuresthat the network becomes agnostic tothe length (size) of the input

Since we are simply going to computethe same function (with same para-meters) at each timestep, the numberof timesteps doesn’t matter

We just create multiple copies of thenetwork and execute them at eachtimestep

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 71: CS7015 (Deep Learning) : Lecture 14

15/41

v

u

x1

y1

v

u

x2

y2

x1

v

u

x2x1 x3

y3

v

u

x3 x4

y4

x2x1

How do we account for dependencebetween inputs ?

Let us first see an infeasible way ofdoing this

At each timestep we will feed all theprevious inputs to the network

Is this okay ?

No, it violates the other two items onour wishlist

How ?

Let us see

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 72: CS7015 (Deep Learning) : Lecture 14

15/41

v

u

x1

y1

v

u

x2

y2

x1

v

u

x2x1 x3

y3

v

u

x3 x4

y4

x2x1

How do we account for dependencebetween inputs ?

Let us first see an infeasible way ofdoing this

At each timestep we will feed all theprevious inputs to the network

Is this okay ?

No, it violates the other two items onour wishlist

How ?

Let us see

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 73: CS7015 (Deep Learning) : Lecture 14

15/41

v

u

x1

y1

v

u

x2

y2

x1

v

u

x2x1 x3

y3

v

u

x3 x4

y4

x2x1

How do we account for dependencebetween inputs ?

Let us first see an infeasible way ofdoing this

At each timestep we will feed all theprevious inputs to the network

Is this okay ?

No, it violates the other two items onour wishlist

How ?

Let us see

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 74: CS7015 (Deep Learning) : Lecture 14

15/41

v

u

x1

y1

v

u

x2

y2

x1

v

u

x2x1 x3

y3

v

u

x3 x4

y4

x2x1

How do we account for dependencebetween inputs ?

Let us first see an infeasible way ofdoing this

At each timestep we will feed all theprevious inputs to the network

Is this okay ?

No, it violates the other two items onour wishlist

How ?

Let us see

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 75: CS7015 (Deep Learning) : Lecture 14

15/41

v

u

x1

y1

v

u

x2

y2

x1

v

u

x2x1 x3

y3

v

u

x3 x4

y4

x2x1

How do we account for dependencebetween inputs ?

Let us first see an infeasible way ofdoing this

At each timestep we will feed all theprevious inputs to the network

Is this okay ?

No, it violates the other two items onour wishlist

How ?

Let us see

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 76: CS7015 (Deep Learning) : Lecture 14

15/41

v

u

x1

y1

v

u

x2

y2

x1

v

u

x2x1 x3

y3

v

u

x3 x4

y4

x2x1

How do we account for dependencebetween inputs ?

Let us first see an infeasible way ofdoing this

At each timestep we will feed all theprevious inputs to the network

Is this okay ?

No, it violates the other two items onour wishlist

How ?

Let us see

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 77: CS7015 (Deep Learning) : Lecture 14

15/41

v

u

x1

y1

v

u

x2

y2

x1

v

u

x2x1 x3

y3

v

u

x3 x4

y4

x2x1

How do we account for dependencebetween inputs ?

Let us first see an infeasible way ofdoing this

At each timestep we will feed all theprevious inputs to the network

Is this okay ?

No, it violates the other two items onour wishlist

How ?

Let us see

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 78: CS7015 (Deep Learning) : Lecture 14

15/41

v

u

x1

y1

v

u

x2

y2

x1

v

u

x2x1 x3

y3

v

u

x3 x4

y4

x2x1

How do we account for dependencebetween inputs ?

Let us first see an infeasible way ofdoing this

At each timestep we will feed all theprevious inputs to the network

Is this okay ?

No, it violates the other two items onour wishlist

How ?

Let us see

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 79: CS7015 (Deep Learning) : Lecture 14

15/41

v

u

x1

y1

v

u

x2

y2

x1

v

u

x2x1 x3

y3

v

u

x3 x4

y4

x2x1

How do we account for dependencebetween inputs ?

Let us first see an infeasible way ofdoing this

At each timestep we will feed all theprevious inputs to the network

Is this okay ?

No, it violates the other two items onour wishlist

How ? Let us see

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 80: CS7015 (Deep Learning) : Lecture 14

16/41

v

u

x1

y1

v

u

x2

y2

x1

v

u

x2x1 x3

y3

v

u

x3 x4

y4

x2x1

First, the function being computed ateach time-step now is different

y1 = f1(x1)

y2 = f2(x1, x2)

y3 = f3(x1, x2, x3)

The network is now sensitive to thelength of the sequence

For example a sequence of length10 will require f1, . . . , f10 whereas asequence of length 100 will requiref1, . . . , f100

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 81: CS7015 (Deep Learning) : Lecture 14

16/41

v

u

x1

y1

v

u

x2

y2

x1

v

u

x2x1 x3

y3

v

u

x3 x4

y4

x2x1

First, the function being computed ateach time-step now is different

y1 = f1(x1)

y2 = f2(x1, x2)

y3 = f3(x1, x2, x3)

The network is now sensitive to thelength of the sequence

For example a sequence of length10 will require f1, . . . , f10 whereas asequence of length 100 will requiref1, . . . , f100

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 82: CS7015 (Deep Learning) : Lecture 14

16/41

v

u

x1

y1

v

u

x2

y2

x1

v

u

x2x1 x3

y3

v

u

x3 x4

y4

x2x1

First, the function being computed ateach time-step now is different

y1 = f1(x1)

y2 = f2(x1, x2)

y3 = f3(x1, x2, x3)

The network is now sensitive to thelength of the sequence

For example a sequence of length10 will require f1, . . . , f10 whereas asequence of length 100 will requiref1, . . . , f100

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 83: CS7015 (Deep Learning) : Lecture 14

16/41

v

u

x1

y1

v

u

x2

y2

x1

v

u

x2x1 x3

y3

v

u

x3 x4

y4

x2x1

First, the function being computed ateach time-step now is different

y1 = f1(x1)

y2 = f2(x1, x2)

y3 = f3(x1, x2, x3)

The network is now sensitive to thelength of the sequence

For example a sequence of length10 will require f1, . . . , f10 whereas asequence of length 100 will requiref1, . . . , f100

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 84: CS7015 (Deep Learning) : Lecture 14

16/41

v

u

x1

y1

v

u

x2

y2

x1

v

u

x2x1 x3

y3

v

u

x3 x4

y4

x2x1

First, the function being computed ateach time-step now is different

y1 = f1(x1)

y2 = f2(x1, x2)

y3 = f3(x1, x2, x3)

The network is now sensitive to thelength of the sequence

For example a sequence of length10 will require f1, . . . , f10 whereas asequence of length 100 will requiref1, . . . , f100

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 85: CS7015 (Deep Learning) : Lecture 14

16/41

v

u

x1

y1

v

u

x2

y2

x1

v

u

x2x1 x3

y3

v

u

x3 x4

y4

x2x1

First, the function being computed ateach time-step now is different

y1 = f1(x1)

y2 = f2(x1, x2)

y3 = f3(x1, x2, x3)

The network is now sensitive to thelength of the sequence

For example a sequence of length10 will require f1, . . . , f10 whereas asequence of length 100 will requiref1, . . . , f100

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 86: CS7015 (Deep Learning) : Lecture 14

17/41

W

V

U

x1

y1

x2

y2

W

V

U

x3

y3

W

V

U

x4

y4

W

V

U

. . .

xn

yn

snW

V

U

The solution is to add a recurrentconnection in the network,

si = σ(Uxi +Wsi−1 + b)

yi = O(V si + c)

or

yi = f(xi, si−1,W,U, V, b, c)

si is the state of the network attimestep i

The parameters are W,U, V, c, bwhich are shared across timesteps

The same network (and parameters)can be used to compute y1, y2, . . . , y10or y100

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 87: CS7015 (Deep Learning) : Lecture 14

17/41

W

V

U

x1

y1

x2

y2

W

V

U

x3

y3

W

V

U

x4

y4

W

V

U

. . .

xn

yn

snW

V

U

The solution is to add a recurrentconnection in the network,

si = σ(Uxi +Wsi−1 + b)

yi = O(V si + c)

or

yi = f(xi, si−1,W,U, V, b, c)

si is the state of the network attimestep i

The parameters are W,U, V, c, bwhich are shared across timesteps

The same network (and parameters)can be used to compute y1, y2, . . . , y10or y100

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 88: CS7015 (Deep Learning) : Lecture 14

17/41

W

V

U

x1

y1

x2

y2

W

V

U

x3

y3

W

V

U

x4

y4

W

V

U

. . .

xn

yn

snW

V

U

The solution is to add a recurrentconnection in the network,

si = σ(Uxi +Wsi−1 + b)

yi = O(V si + c)

or

yi = f(xi, si−1,W,U, V, b, c)

si is the state of the network attimestep i

The parameters are W,U, V, c, bwhich are shared across timesteps

The same network (and parameters)can be used to compute y1, y2, . . . , y10or y100

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 89: CS7015 (Deep Learning) : Lecture 14

17/41

W

V

U

x1

y1

x2

y2

W

V

U

x3

y3

W

V

U

x4

y4

W

V

U

. . .

xn

yn

snW

V

U

The solution is to add a recurrentconnection in the network,

si = σ(Uxi +Wsi−1 + b)

yi = O(V si + c)

or

yi = f(xi, si−1,W,U, V, b, c)

si is the state of the network attimestep i

The parameters are W,U, V, c, bwhich are shared across timesteps

The same network (and parameters)can be used to compute y1, y2, . . . , y10or y100

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 90: CS7015 (Deep Learning) : Lecture 14

17/41

W

V

U

x1

y1

x2

y2

W

V

U

x3

y3

W

V

U

x4

y4

W

V

U

. . .

xn

yn

snW

V

U

The solution is to add a recurrentconnection in the network,

si = σ(Uxi +Wsi−1 + b)

yi = O(V si + c)

or

yi = f(xi, si−1,W,U, V, b, c)

si is the state of the network attimestep i

The parameters are W,U, V, c, bwhich are shared across timesteps

The same network (and parameters)can be used to compute y1, y2, . . . , y10or y100

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 91: CS7015 (Deep Learning) : Lecture 14

17/41

W

V

U

x1

y1

x2

y2

W

V

U

x3

y3

W

V

U

x4

y4

W

V

U

. . .

xn

yn

snW

V

U

The solution is to add a recurrentconnection in the network,

si = σ(Uxi +Wsi−1 + b)

yi = O(V si + c)

or

yi = f(xi, si−1,W,U, V, b, c)

si is the state of the network attimestep i

The parameters are W,U, V, c, bwhich are shared across timesteps

The same network (and parameters)can be used to compute y1, y2, . . . , y10or y100

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 92: CS7015 (Deep Learning) : Lecture 14

17/41

W

V

U

x1

y1

x2

y2

W

V

U

x3

y3

W

V

U

x4

y4

W

V

U

. . .

xn

yn

snW

V

U

The solution is to add a recurrentconnection in the network,

si = σ(Uxi +Wsi−1 + b)

yi = O(V si + c)

or

yi = f(xi, si−1,W,U, V, b, c)

si is the state of the network attimestep i

The parameters are W,U, V, c, bwhich are shared across timesteps

The same network (and parameters)can be used to compute y1, y2, . . . , y10or y100

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 93: CS7015 (Deep Learning) : Lecture 14

17/41

W

V

U

x1

y1

x2

y2

W

V

U

x3

y3

W

V

U

x4

y4

W

V

U

. . .

xn

yn

snW

V

U

The solution is to add a recurrentconnection in the network,

si = σ(Uxi +Wsi−1 + b)

yi = O(V si + c)

or

yi = f(xi, si−1,W,U, V, b, c)

si is the state of the network attimestep i

The parameters are W,U, V, c, bwhich are shared across timesteps

The same network (and parameters)can be used to compute y1, y2, . . . , y10or y100

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 94: CS7015 (Deep Learning) : Lecture 14

18/41

si

V

U

xi

W

yi

This can be represented more com-pactly

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 95: CS7015 (Deep Learning) : Lecture 14

19/41

man

noun

is

verb

a

article

social

adjective

animal

noun

. . .

. . .

Surya Namaskar

d

e

e

e

e

p

p

〈 stop 〉

The

don’tcare

movie

don’tcare

was

don’tcare

boring

don’tcare

and

don’tcare

long

+/−

Let us revisit the sequence learningproblems that we saw earlier

We now have recurrent connectionsbetween time steps which account fordependence between inputs

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 96: CS7015 (Deep Learning) : Lecture 14

19/41

man

noun

is

verb

a

article

social

adjective

animal

noun

. . .

. . .

Surya Namaskar

d

e

e

e

e

p

p

〈 stop 〉

The

don’tcare

movie

don’tcare

was

don’tcare

boring

don’tcare

and

don’tcare

long

+/−

Let us revisit the sequence learningproblems that we saw earlier

We now have recurrent connectionsbetween time steps which account fordependence between inputs

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 97: CS7015 (Deep Learning) : Lecture 14

20/41

Module 14.3: Backpropagation through time

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 98: CS7015 (Deep Learning) : Lecture 14

21/41

W

V

U

x1

y1

x2

y2

W

V

U

x3

y3

W

V

U

x4

y4

W

V

U

Before proceeding let us look at thedimensions of the parameters care-fully

xi ∈ Rn (n-dimensional input)

si ∈ Rd (d-dimensional state)

yi ∈ Rk (say k classes)

U ∈

Rn×d

V ∈

Rd×k

W ∈

Rd×d

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 99: CS7015 (Deep Learning) : Lecture 14

21/41

W

V

U

x1

y1

x2

y2

W

V

U

x3

y3

W

V

U

x4

y4

W

V

U

Before proceeding let us look at thedimensions of the parameters care-fully

xi ∈ Rn (n-dimensional input)

si ∈ Rd (d-dimensional state)

yi ∈ Rk (say k classes)

U ∈

Rn×d

V ∈

Rd×k

W ∈

Rd×d

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 100: CS7015 (Deep Learning) : Lecture 14

21/41

W

V

U

x1

y1

x2

y2

W

V

U

x3

y3

W

V

U

x4

y4

W

V

U

Before proceeding let us look at thedimensions of the parameters care-fully

xi ∈ Rn (n-dimensional input)

si ∈ Rd (d-dimensional state)

yi ∈ Rk (say k classes)

U ∈

Rn×d

V ∈

Rd×k

W ∈

Rd×d

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 101: CS7015 (Deep Learning) : Lecture 14

21/41

W

V

U

x1

y1

x2

y2

W

V

U

x3

y3

W

V

U

x4

y4

W

V

U

Before proceeding let us look at thedimensions of the parameters care-fully

xi ∈ Rn (n-dimensional input)

si ∈ Rd (d-dimensional state)

yi ∈ Rk (say k classes)

U ∈

Rn×d

V ∈

Rd×k

W ∈

Rd×d

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 102: CS7015 (Deep Learning) : Lecture 14

21/41

W

V

U

x1

y1

x2

y2

W

V

U

x3

y3

W

V

U

x4

y4

W

V

U

Before proceeding let us look at thedimensions of the parameters care-fully

xi ∈ Rn (n-dimensional input)

si ∈ Rd (d-dimensional state)

yi ∈ Rk (say k classes)

U ∈

Rn×d

V ∈

Rd×k

W ∈

Rd×d

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 103: CS7015 (Deep Learning) : Lecture 14

21/41

W

V

U

x1

y1

x2

y2

W

V

U

x3

y3

W

V

U

x4

y4

W

V

U

Before proceeding let us look at thedimensions of the parameters care-fully

xi ∈ Rn (n-dimensional input)

si ∈ Rd (d-dimensional state)

yi ∈ Rk (say k classes)

U ∈ Rn×d

V ∈

Rd×k

W ∈

Rd×d

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 104: CS7015 (Deep Learning) : Lecture 14

21/41

W

V

U

x1

y1

x2

y2

W

V

U

x3

y3

W

V

U

x4

y4

W

V

U

Before proceeding let us look at thedimensions of the parameters care-fully

xi ∈ Rn (n-dimensional input)

si ∈ Rd (d-dimensional state)

yi ∈ Rk (say k classes)

U ∈ Rn×d

V ∈

Rd×k

W ∈

Rd×d

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 105: CS7015 (Deep Learning) : Lecture 14

21/41

W

V

U

x1

y1

x2

y2

W

V

U

x3

y3

W

V

U

x4

y4

W

V

U

Before proceeding let us look at thedimensions of the parameters care-fully

xi ∈ Rn (n-dimensional input)

si ∈ Rd (d-dimensional state)

yi ∈ Rk (say k classes)

U ∈ Rn×d

V ∈ Rd×k

W ∈

Rd×d

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 106: CS7015 (Deep Learning) : Lecture 14

21/41

W

V

U

x1

y1

x2

y2

W

V

U

x3

y3

W

V

U

x4

y4

W

V

U

Before proceeding let us look at thedimensions of the parameters care-fully

xi ∈ Rn (n-dimensional input)

si ∈ Rd (d-dimensional state)

yi ∈ Rk (say k classes)

U ∈ Rn×d

V ∈ Rd×k

W ∈

Rd×d

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 107: CS7015 (Deep Learning) : Lecture 14

21/41

W

V

U

x1

y1

x2

y2

W

V

U

x3

y3

W

V

U

x4

y4

W

V

U

Before proceeding let us look at thedimensions of the parameters care-fully

xi ∈ Rn (n-dimensional input)

si ∈ Rd (d-dimensional state)

yi ∈ Rk (say k classes)

U ∈ Rn×d

V ∈ Rd×k

W ∈ Rd×d

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 108: CS7015 (Deep Learning) : Lecture 14

22/41

W

V

U

x1

y1

x2

y2

W

V

U

x3

y3

W

V

U

x4

y4

W

V

U

How do we train this network ?

(Ans: using backpropagation)

Let us understand this with a con-crete example

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 109: CS7015 (Deep Learning) : Lecture 14

22/41

W

V

U

x1

y1

x2

y2

W

V

U

x3

y3

W

V

U

x4

y4

W

V

U

How do we train this network ?(Ans: using backpropagation)

Let us understand this with a con-crete example

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 110: CS7015 (Deep Learning) : Lecture 14

22/41

W

V

U

x1

y1

x2

y2

W

V

U

x3

y3

W

V

U

x4

y4

W

V

U

How do we train this network ?(Ans: using backpropagation)

Let us understand this with a con-crete example

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 111: CS7015 (Deep Learning) : Lecture 14

23/41

d

e

e

e

e

p

p

〈 stop 〉

W

V

U

W

V

U

W

V

U

V

U

Suppose we consider our task of auto-completion (predicting the next char-acter)

For simplicity we assume that thereare only 4 characters in our vocabu-lary (d,e,p, <stop>)

At each timestep we want to predictone of these 4 characters

What is a suitable output function forthis task ?

(softmax)

What is a suitable loss function forthis task ? (cross entropy)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 112: CS7015 (Deep Learning) : Lecture 14

23/41

d

e

e

e

e

p

p

〈 stop 〉

W

V

U

W

V

U

W

V

U

V

U

Suppose we consider our task of auto-completion (predicting the next char-acter)

For simplicity we assume that thereare only 4 characters in our vocabu-lary (d,e,p, <stop>)

At each timestep we want to predictone of these 4 characters

What is a suitable output function forthis task ?

(softmax)

What is a suitable loss function forthis task ? (cross entropy)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 113: CS7015 (Deep Learning) : Lecture 14

23/41

d

e

e

e

e

p

p

〈 stop 〉

W

V

U

W

V

U

W

V

U

V

U

Suppose we consider our task of auto-completion (predicting the next char-acter)

For simplicity we assume that thereare only 4 characters in our vocabu-lary (d,e,p, <stop>)

At each timestep we want to predictone of these 4 characters

What is a suitable output function forthis task ?

(softmax)

What is a suitable loss function forthis task ? (cross entropy)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 114: CS7015 (Deep Learning) : Lecture 14

23/41

d

e

e

e

e

p

p

〈 stop 〉

W

V

U

W

V

U

W

V

U

V

U

Suppose we consider our task of auto-completion (predicting the next char-acter)

For simplicity we assume that thereare only 4 characters in our vocabu-lary (d,e,p, <stop>)

At each timestep we want to predictone of these 4 characters

What is a suitable output function forthis task ?

(softmax)

What is a suitable loss function forthis task ? (cross entropy)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 115: CS7015 (Deep Learning) : Lecture 14

23/41

d

e

e

e

e

p

p

〈 stop 〉

W

V

U

W

V

U

W

V

U

V

U

Suppose we consider our task of auto-completion (predicting the next char-acter)

For simplicity we assume that thereare only 4 characters in our vocabu-lary (d,e,p, <stop>)

At each timestep we want to predictone of these 4 characters

What is a suitable output function forthis task ? (softmax)

What is a suitable loss function forthis task ?

(cross entropy)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 116: CS7015 (Deep Learning) : Lecture 14

23/41

d

e

e

e

e

p

p

〈 stop 〉

W

V

U

W

V

U

W

V

U

V

U

Suppose we consider our task of auto-completion (predicting the next char-acter)

For simplicity we assume that thereare only 4 characters in our vocabu-lary (d,e,p, <stop>)

At each timestep we want to predictone of these 4 characters

What is a suitable output function forthis task ? (softmax)

What is a suitable loss function forthis task ?

(cross entropy)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 117: CS7015 (Deep Learning) : Lecture 14

23/41

d

e

e

e

e

p

p

〈 stop 〉

W

V

U

W

V

U

W

V

U

V

U

Suppose we consider our task of auto-completion (predicting the next char-acter)

For simplicity we assume that thereare only 4 characters in our vocabu-lary (d,e,p, <stop>)

At each timestep we want to predictone of these 4 characters

What is a suitable output function forthis task ? (softmax)

What is a suitable loss function forthis task ? (cross entropy)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 118: CS7015 (Deep Learning) : Lecture 14

24/41

d

True

W

V

U

0.20.70.10.1

d

0

e

1

p

0

stop

0

Predicted

e

True

W

V

U

0.20.70.10.1

0100

Predicted

e

True

W

V

U

0.20.10.70.1

0010

Predicted

p

True

V

U

0.20.10.70.1

0001

Predicted

y1 y2 y3 y4

L1(θ) L2(θ) L3(θ) L4(θ)

Suppose we initialize U, V,W ran-domly and the network predicts theprobabilities as shown

And the true probabilities are asshown

We need to answer two questions

What is the total loss made by themodel ?

How do we backpropagate this lossand update the parameters (θ ={U, V,W, b, c}) of the network ?

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 119: CS7015 (Deep Learning) : Lecture 14

24/41

d

True

W

V

U

0.20.70.10.1

d 0e 1p 0

stop 0

Predicted

e

True

W

V

U

0.20.70.10.1

0100

Predicted

e

True

W

V

U

0.20.10.70.1

0010

Predicted

p

True

V

U

0.20.10.70.1

0001

Predicted

y1 y2 y3 y4

L1(θ) L2(θ) L3(θ) L4(θ)

Suppose we initialize U, V,W ran-domly and the network predicts theprobabilities as shown

And the true probabilities are asshown

We need to answer two questions

What is the total loss made by themodel ?

How do we backpropagate this lossand update the parameters (θ ={U, V,W, b, c}) of the network ?

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 120: CS7015 (Deep Learning) : Lecture 14

24/41

d

True

W

V

U

0.20.70.10.1

d 0e 1p 0

stop 0

Predicted

e

True

W

V

U

0.20.70.10.1

0100

Predicted

e

True

W

V

U

0.20.10.70.1

0010

Predicted

p

True

V

U

0.20.10.70.1

0001

Predicted

y1 y2 y3 y4

L1(θ) L2(θ) L3(θ) L4(θ)

Suppose we initialize U, V,W ran-domly and the network predicts theprobabilities as shown

And the true probabilities are asshown

We need to answer two questions

What is the total loss made by themodel ?

How do we backpropagate this lossand update the parameters (θ ={U, V,W, b, c}) of the network ?

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 121: CS7015 (Deep Learning) : Lecture 14

24/41

d

True

W

V

U

0.20.70.10.1

d 0e 1p 0

stop 0

Predicted

e

True

W

V

U

0.20.70.10.1

0100

Predicted

e

True

W

V

U

0.20.10.70.1

0010

Predicted

p

True

V

U

0.20.10.70.1

0001

Predicted

y1 y2 y3 y4

L1(θ) L2(θ) L3(θ) L4(θ)

Suppose we initialize U, V,W ran-domly and the network predicts theprobabilities as shown

And the true probabilities are asshown

We need to answer two questions

What is the total loss made by themodel ?

How do we backpropagate this lossand update the parameters (θ ={U, V,W, b, c}) of the network ?

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 122: CS7015 (Deep Learning) : Lecture 14

24/41

d

True

W

V

U

0.20.70.10.1

d 0e 1p 0

stop 0

Predicted

e

True

W

V

U

0.20.70.10.1

0100

Predicted

e

True

W

V

U

0.20.10.70.1

0010

Predicted

p

True

V

U

0.20.10.70.1

0001

Predicted

y1 y2 y3 y4

L1(θ) L2(θ) L3(θ) L4(θ)

Suppose we initialize U, V,W ran-domly and the network predicts theprobabilities as shown

And the true probabilities are asshown

We need to answer two questions

What is the total loss made by themodel ?

How do we backpropagate this lossand update the parameters (θ ={U, V,W, b, c}) of the network ?

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 123: CS7015 (Deep Learning) : Lecture 14

25/41

d

True

W

V

U

0.20.70.10.1

d 0e 1p 0

stop 0

Predicted

e

True

W

V

U

0.20.70.10.1

0100

Predicted

e

True

W

V

U

0.20.10.70.1

0010

Predicted

e

True

V

U

0.20.10.70.1

0010

Predicted

y1 y2 y3 y4

L1(θ) L2(θ) L3(θ) L4(θ)

The total loss is simply the sum of theloss over all time-steps

L (θ) =

T∑t=1

Lt(θ)

Lt(θ) = −log(ytc)

ytc = predicted probability of true

character at time-step t

T = number of timesteps

For backpropagation we need to com-pute the gradients w.r.t. W,U, V, b, c

Let us see how to do that

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 124: CS7015 (Deep Learning) : Lecture 14

25/41

d

True

W

V

U

0.20.70.10.1

d 0e 1p 0

stop 0

Predicted

e

True

W

V

U

0.20.70.10.1

0100

Predicted

e

True

W

V

U

0.20.10.70.1

0010

Predicted

e

True

V

U

0.20.10.70.1

0010

Predicted

y1 y2 y3 y4

L1(θ) L2(θ) L3(θ) L4(θ)

The total loss is simply the sum of theloss over all time-steps

L (θ) =

T∑t=1

Lt(θ)

Lt(θ) = −log(ytc)

ytc = predicted probability of true

character at time-step t

T = number of timesteps

For backpropagation we need to com-pute the gradients w.r.t. W,U, V, b, c

Let us see how to do that

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 125: CS7015 (Deep Learning) : Lecture 14

25/41

d

True

W

V

U

0.20.70.10.1

d 0e 1p 0

stop 0

Predicted

e

True

W

V

U

0.20.70.10.1

0100

Predicted

e

True

W

V

U

0.20.10.70.1

0010

Predicted

e

True

V

U

0.20.10.70.1

0010

Predicted

y1 y2 y3 y4

L1(θ) L2(θ) L3(θ) L4(θ)

The total loss is simply the sum of theloss over all time-steps

L (θ) =

T∑t=1

Lt(θ)

Lt(θ) = −log(ytc)

ytc = predicted probability of true

character at time-step t

T = number of timesteps

For backpropagation we need to com-pute the gradients w.r.t. W,U, V, b, c

Let us see how to do that

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 126: CS7015 (Deep Learning) : Lecture 14

25/41

d

True

W

V

U

0.20.70.10.1

d 0e 1p 0

stop 0

Predicted

e

True

W

V

U

0.20.70.10.1

0100

Predicted

e

True

W

V

U

0.20.10.70.1

0010

Predicted

e

True

V

U

0.20.10.70.1

0010

Predicted

y1 y2 y3 y4

L1(θ) L2(θ) L3(θ) L4(θ)

The total loss is simply the sum of theloss over all time-steps

L (θ) =

T∑t=1

Lt(θ)

Lt(θ) = −log(ytc)

ytc = predicted probability of true

character at time-step t

T = number of timesteps

For backpropagation we need to com-pute the gradients w.r.t. W,U, V, b, c

Let us see how to do that

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 127: CS7015 (Deep Learning) : Lecture 14

25/41

d

True

W

V

U

0.20.70.10.1

d 0e 1p 0

stop 0

Predicted

e

True

W

V

U

0.20.70.10.1

0100

Predicted

e

True

W

V

U

0.20.10.70.1

0010

Predicted

e

True

V

U

0.20.10.70.1

0010

Predicted

y1 y2 y3 y4

L1(θ) L2(θ) L3(θ) L4(θ)

The total loss is simply the sum of theloss over all time-steps

L (θ) =

T∑t=1

Lt(θ)

Lt(θ) = −log(ytc)

ytc = predicted probability of true

character at time-step t

T = number of timesteps

For backpropagation we need to com-pute the gradients w.r.t. W,U, V, b, c

Let us see how to do that

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 128: CS7015 (Deep Learning) : Lecture 14

25/41

d

True

W

V

U

0.20.70.10.1

d 0e 1p 0

stop 0

Predicted

e

True

W

V

U

0.20.70.10.1

0100

Predicted

e

True

W

V

U

0.20.10.70.1

0010

Predicted

e

True

V

U

0.20.10.70.1

0010

Predicted

y1 y2 y3 y4

L1(θ) L2(θ) L3(θ) L4(θ)

The total loss is simply the sum of theloss over all time-steps

L (θ) =

T∑t=1

Lt(θ)

Lt(θ) = −log(ytc)

ytc = predicted probability of true

character at time-step t

T = number of timesteps

For backpropagation we need to com-pute the gradients w.r.t. W,U, V, b, c

Let us see how to do that

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 129: CS7015 (Deep Learning) : Lecture 14

25/41

d

True

W

V

U

0.20.70.10.1

d 0e 1p 0

stop 0

Predicted

e

True

W

V

U

0.20.70.10.1

0100

Predicted

e

True

W

V

U

0.20.10.70.1

0010

Predicted

e

True

V

U

0.20.10.70.1

0010

Predicted

y1 y2 y3 y4

L1(θ) L2(θ) L3(θ) L4(θ)

The total loss is simply the sum of theloss over all time-steps

L (θ) =

T∑t=1

Lt(θ)

Lt(θ) = −log(ytc)

ytc = predicted probability of true

character at time-step t

T = number of timesteps

For backpropagation we need to com-pute the gradients w.r.t. W,U, V, b, c

Let us see how to do that

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 130: CS7015 (Deep Learning) : Lecture 14

26/41

d

True

W

V

U

0.20.70.10.1

d 0e 1p 0

stop 0

Predicted

e

True

W

V

U

0.20.70.10.1

0100

Predicted

e

True

W

V

U

0.20.10.70.1

0010

Predicted

e

True

V

U

0.20.10.70.1

0010

Predicted

y1 y2 y3 y4

L1(θ) L2(θ) L3(θ) L4(θ)

Let us consider ∂L (θ)∂V (V is a matrix

so ideally we should write ∇vL (θ))

∂L (θ)

∂V=

T∑t=1

∂Lt(θ)

∂V

Each term is the summation is simplythe derivative of the loss w.r.t. theweights in the output layer

We have already seen how to do thiswhen we studied backpropagation

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 131: CS7015 (Deep Learning) : Lecture 14

26/41

d

True

W

V

U

0.20.70.10.1

d 0e 1p 0

stop 0

Predicted

e

True

W

V

U

0.20.70.10.1

0100

Predicted

e

True

W

V

U

0.20.10.70.1

0010

Predicted

e

True

V

U

0.20.10.70.1

0010

Predicted

y1 y2 y3 y4

L1(θ) L2(θ) L3(θ) L4(θ)

Let us consider ∂L (θ)∂V (V is a matrix

so ideally we should write ∇vL (θ))

∂L (θ)

∂V=

T∑t=1

∂Lt(θ)

∂V

Each term is the summation is simplythe derivative of the loss w.r.t. theweights in the output layer

We have already seen how to do thiswhen we studied backpropagation

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 132: CS7015 (Deep Learning) : Lecture 14

26/41

d

True

W

V

U

0.20.70.10.1

d 0e 1p 0

stop 0

Predicted

e

True

W

V

U

0.20.70.10.1

0100

Predicted

e

True

W

V

U

0.20.10.70.1

0010

Predicted

e

True

V

U

0.20.10.70.1

0010

Predicted

y1 y2 y3 y4

L1(θ) L2(θ) L3(θ) L4(θ)

Let us consider ∂L (θ)∂V (V is a matrix

so ideally we should write ∇vL (θ))

∂L (θ)

∂V=

T∑t=1

∂Lt(θ)

∂V

Each term is the summation is simplythe derivative of the loss w.r.t. theweights in the output layer

We have already seen how to do thiswhen we studied backpropagation

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 133: CS7015 (Deep Learning) : Lecture 14

26/41

d

True

W

V

U

0.20.70.10.1

d 0e 1p 0

stop 0

Predicted

e

True

W

V

U

0.20.70.10.1

0100

Predicted

e

True

W

V

U

0.20.10.70.1

0010

Predicted

e

True

V

U

0.20.10.70.1

0010

Predicted

y1 y2 y3 y4

L1(θ) L2(θ) L3(θ) L4(θ)

Let us consider ∂L (θ)∂V (V is a matrix

so ideally we should write ∇vL (θ))

∂L (θ)

∂V=

T∑t=1

∂Lt(θ)

∂V

Each term is the summation is simplythe derivative of the loss w.r.t. theweights in the output layer

We have already seen how to do thiswhen we studied backpropagation

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 134: CS7015 (Deep Learning) : Lecture 14

27/41

d

True

W

V

U

0.20.70.10.1

d 0e 1p 0

stop 0

Predicted

e

True

W

V

U

0.20.70.10.1

0100

Predicted

e

True

W

V

U

0.20.10.70.1

0010

Predicted

e

True

V

U

0.20.10.70.1

0010

Predicted

y1 y2 y3 y4

L1(θ) L2(θ) L3(θ) L4(θ)

Let us consider the derivative ∂L (θ)∂W

∂L (θ)

∂W=

T∑t=1

∂Lt(θ)

∂W

By the chain rule of derivatives we

know that ∂Lt(θ)∂W is obtained by sum-

ming gradients along all the pathsfrom Lt(θ) to W

What are the paths connecting Lt(θ)to W ?

Let us see this by considering L4(θ)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 135: CS7015 (Deep Learning) : Lecture 14

27/41

d

True

W

V

U

0.20.70.10.1

d 0e 1p 0

stop 0

Predicted

e

True

W

V

U

0.20.70.10.1

0100

Predicted

e

True

W

V

U

0.20.10.70.1

0010

Predicted

e

True

V

U

0.20.10.70.1

0010

Predicted

y1 y2 y3 y4

L1(θ) L2(θ) L3(θ) L4(θ)

Let us consider the derivative ∂L (θ)∂W

∂L (θ)

∂W=

T∑t=1

∂Lt(θ)

∂W

By the chain rule of derivatives we

know that ∂Lt(θ)∂W is obtained by sum-

ming gradients along all the pathsfrom Lt(θ) to W

What are the paths connecting Lt(θ)to W ?

Let us see this by considering L4(θ)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 136: CS7015 (Deep Learning) : Lecture 14

27/41

d

True

W

V

U

0.20.70.10.1

d 0e 1p 0

stop 0

Predicted

e

True

W

V

U

0.20.70.10.1

0100

Predicted

e

True

W

V

U

0.20.10.70.1

0010

Predicted

e

True

V

U

0.20.10.70.1

0010

Predicted

y1 y2 y3 y4

L1(θ) L2(θ) L3(θ) L4(θ)

Let us consider the derivative ∂L (θ)∂W

∂L (θ)

∂W=

T∑t=1

∂Lt(θ)

∂W

By the chain rule of derivatives we

know that ∂Lt(θ)∂W is obtained by sum-

ming gradients along all the pathsfrom Lt(θ) to W

What are the paths connecting Lt(θ)to W ?

Let us see this by considering L4(θ)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 137: CS7015 (Deep Learning) : Lecture 14

27/41

d

True

W

V

U

0.20.70.10.1

d 0e 1p 0

stop 0

Predicted

e

True

W

V

U

0.20.70.10.1

0100

Predicted

e

True

W

V

U

0.20.10.70.1

0010

Predicted

e

True

V

U

0.20.10.70.1

0010

Predicted

y1 y2 y3 y4

L1(θ) L2(θ) L3(θ) L4(θ)

Let us consider the derivative ∂L (θ)∂W

∂L (θ)

∂W=

T∑t=1

∂Lt(θ)

∂W

By the chain rule of derivatives we

know that ∂Lt(θ)∂W is obtained by sum-

ming gradients along all the pathsfrom Lt(θ) to W

What are the paths connecting Lt(θ)to W ?

Let us see this by considering L4(θ)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 138: CS7015 (Deep Learning) : Lecture 14

27/41

d

True

W

V

U

0.20.70.10.1

d 0e 1p 0

stop 0

Predicted

e

True

W

V

U

0.20.70.10.1

0100

Predicted

e

True

W

V

U

0.20.10.70.1

0010

Predicted

e

True

V

U

0.20.10.70.1

0010

Predicted

y1 y2 y3 y4

L1(θ) L2(θ) L3(θ) L4(θ)

Let us consider the derivative ∂L (θ)∂W

∂L (θ)

∂W=

T∑t=1

∂Lt(θ)

∂W

By the chain rule of derivatives we

know that ∂Lt(θ)∂W is obtained by sum-

ming gradients along all the pathsfrom Lt(θ) to W

What are the paths connecting Lt(θ)to W ?

Let us see this by considering L4(θ)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 139: CS7015 (Deep Learning) : Lecture 14

28/41

s1W

V

U

x1

L1(θ)

s2

x2

L2(θ)

W

V

U

s3

x3

L3(θ)

W

V

U

s4

x4

L4(θ)

W

V

U

. . .

s4 L4(θ)

W

s3s2s1s0

L4(θ) depends on s4

s4 in turn depends on s3 and W

s3 in turn depends on s2 and W

s2 in turn depends on s1 and W

s1 in turn depends on s0 and W

where s0 is a constant starting state.

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 140: CS7015 (Deep Learning) : Lecture 14

28/41

s1W

V

U

x1

L1(θ)

s2

x2

L2(θ)

W

V

U

s3

x3

L3(θ)

W

V

U

s4

x4

L4(θ)

W

V

U

. . .

s4 L4(θ)

W

s3

s2s1s0

L4(θ) depends on s4s4 in turn depends on s3 and W

s3 in turn depends on s2 and W

s2 in turn depends on s1 and W

s1 in turn depends on s0 and W

where s0 is a constant starting state.

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 141: CS7015 (Deep Learning) : Lecture 14

28/41

s1W

V

U

x1

L1(θ)

s2

x2

L2(θ)

W

V

U

s3

x3

L3(θ)

W

V

U

s4

x4

L4(θ)

W

V

U

. . .

s4 L4(θ)

W

s3s2

s1s0

L4(θ) depends on s4s4 in turn depends on s3 and W

s3 in turn depends on s2 and W

s2 in turn depends on s1 and W

s1 in turn depends on s0 and W

where s0 is a constant starting state.

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 142: CS7015 (Deep Learning) : Lecture 14

28/41

s1W

V

U

x1

L1(θ)

s2

x2

L2(θ)

W

V

U

s3

x3

L3(θ)

W

V

U

s4

x4

L4(θ)

W

V

U

. . .

s4 L4(θ)

W

s3s2s1

s0

L4(θ) depends on s4s4 in turn depends on s3 and W

s3 in turn depends on s2 and W

s2 in turn depends on s1 and W

s1 in turn depends on s0 and W

where s0 is a constant starting state.

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 143: CS7015 (Deep Learning) : Lecture 14

28/41

s1W

V

U

x1

L1(θ)

s2

x2

L2(θ)

W

V

U

s3

x3

L3(θ)

W

V

U

s4

x4

L4(θ)

W

V

U

. . .

s4 L4(θ)

W

s3s2s1s0

L4(θ) depends on s4s4 in turn depends on s3 and W

s3 in turn depends on s2 and W

s2 in turn depends on s1 and W

s1 in turn depends on s0 and W

where s0 is a constant starting state.

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 144: CS7015 (Deep Learning) : Lecture 14

29/41

s1W

V

U

x1

L1(θ)

s2

x2

L2(θ)

W

V

U

s3

x3

L3(θ)

W

V

U

s4

x4

L4(θ)

W

V

U

. . .

s4 L4(θ)

W

s3s2s1s0

What we have here is an ordered net-work

In an ordered network each state vari-able is computed one at a time in aspecified order (first s1, then s2 andso on)

Now we have

∂L4(θ)

∂W=∂L4(θ)

∂s4

∂s4∂W

We have already seen how to compute∂L4(θ)∂s4

when we studied backprop

But how do we compute ∂s4∂W

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 145: CS7015 (Deep Learning) : Lecture 14

29/41

s1W

V

U

x1

L1(θ)

s2

x2

L2(θ)

W

V

U

s3

x3

L3(θ)

W

V

U

s4

x4

L4(θ)

W

V

U

. . .

s4 L4(θ)

W

s3s2s1s0

What we have here is an ordered net-work

In an ordered network each state vari-able is computed one at a time in aspecified order (first s1, then s2 andso on)

Now we have

∂L4(θ)

∂W=∂L4(θ)

∂s4

∂s4∂W

We have already seen how to compute∂L4(θ)∂s4

when we studied backprop

But how do we compute ∂s4∂W

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 146: CS7015 (Deep Learning) : Lecture 14

29/41

s1W

V

U

x1

L1(θ)

s2

x2

L2(θ)

W

V

U

s3

x3

L3(θ)

W

V

U

s4

x4

L4(θ)

W

V

U

. . .

s4 L4(θ)

W

s3s2s1s0

What we have here is an ordered net-work

In an ordered network each state vari-able is computed one at a time in aspecified order (first s1, then s2 andso on)

Now we have

∂L4(θ)

∂W=∂L4(θ)

∂s4

∂s4∂W

We have already seen how to compute∂L4(θ)∂s4

when we studied backprop

But how do we compute ∂s4∂W

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 147: CS7015 (Deep Learning) : Lecture 14

29/41

s1W

V

U

x1

L1(θ)

s2

x2

L2(θ)

W

V

U

s3

x3

L3(θ)

W

V

U

s4

x4

L4(θ)

W

V

U

. . .

s4 L4(θ)

W

s3s2s1s0

What we have here is an ordered net-work

In an ordered network each state vari-able is computed one at a time in aspecified order (first s1, then s2 andso on)

Now we have

∂L4(θ)

∂W=∂L4(θ)

∂s4

∂s4∂W

We have already seen how to compute∂L4(θ)∂s4

when we studied backprop

But how do we compute ∂s4∂W

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 148: CS7015 (Deep Learning) : Lecture 14

29/41

s1W

V

U

x1

L1(θ)

s2

x2

L2(θ)

W

V

U

s3

x3

L3(θ)

W

V

U

s4

x4

L4(θ)

W

V

U

. . .

s4 L4(θ)

W

s3s2s1s0

What we have here is an ordered net-work

In an ordered network each state vari-able is computed one at a time in aspecified order (first s1, then s2 andso on)

Now we have

∂L4(θ)

∂W=∂L4(θ)

∂s4

∂s4∂W

We have already seen how to compute∂L4(θ)∂s4

when we studied backprop

But how do we compute ∂s4∂W

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 149: CS7015 (Deep Learning) : Lecture 14

30/41

s1W

V

U

x1

L1(θ)

s2

x2

L2(θ)

W

V

U

s3

x3

L3(θ)

W

V

U

s4

x4

L4(θ)

W

V

U

. . .

s4 L4(θ)

W

s3s2s1s0

Recall that

s4 = σ(Ws3 + b)

In such an ordered network, we can’tcompute ∂s4

∂W by simply treating s3 asa constant (because it also dependson W )

In such networks the total derivative∂s4∂W has two parts

Explicit : ∂+s4∂W , treating all other in-

puts as constant

Implicit : Summing over all indirectpaths from s4 to W

Let us see how to do this

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 150: CS7015 (Deep Learning) : Lecture 14

30/41

s1W

V

U

x1

L1(θ)

s2

x2

L2(θ)

W

V

U

s3

x3

L3(θ)

W

V

U

s4

x4

L4(θ)

W

V

U

. . .

s4 L4(θ)

W

s3s2s1s0

Recall that

s4 = σ(Ws3 + b)

In such an ordered network, we can’tcompute ∂s4

∂W by simply treating s3 asa constant (because it also dependson W )

In such networks the total derivative∂s4∂W has two parts

Explicit : ∂+s4∂W , treating all other in-

puts as constant

Implicit : Summing over all indirectpaths from s4 to W

Let us see how to do this

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 151: CS7015 (Deep Learning) : Lecture 14

30/41

s1W

V

U

x1

L1(θ)

s2

x2

L2(θ)

W

V

U

s3

x3

L3(θ)

W

V

U

s4

x4

L4(θ)

W

V

U

. . .

s4 L4(θ)

W

s3s2s1s0

Recall that

s4 = σ(Ws3 + b)

In such an ordered network, we can’tcompute ∂s4

∂W by simply treating s3 asa constant (because it also dependson W )

In such networks the total derivative∂s4∂W has two parts

Explicit : ∂+s4∂W , treating all other in-

puts as constant

Implicit : Summing over all indirectpaths from s4 to W

Let us see how to do this

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 152: CS7015 (Deep Learning) : Lecture 14

30/41

s1W

V

U

x1

L1(θ)

s2

x2

L2(θ)

W

V

U

s3

x3

L3(θ)

W

V

U

s4

x4

L4(θ)

W

V

U

. . .

s4 L4(θ)

W

s3s2s1s0

Recall that

s4 = σ(Ws3 + b)

In such an ordered network, we can’tcompute ∂s4

∂W by simply treating s3 asa constant (because it also dependson W )

In such networks the total derivative∂s4∂W has two parts

Explicit : ∂+s4∂W , treating all other in-

puts as constant

Implicit : Summing over all indirectpaths from s4 to W

Let us see how to do this

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 153: CS7015 (Deep Learning) : Lecture 14

30/41

s1W

V

U

x1

L1(θ)

s2

x2

L2(θ)

W

V

U

s3

x3

L3(θ)

W

V

U

s4

x4

L4(θ)

W

V

U

. . .

s4 L4(θ)

W

s3s2s1s0

Recall that

s4 = σ(Ws3 + b)

In such an ordered network, we can’tcompute ∂s4

∂W by simply treating s3 asa constant (because it also dependson W )

In such networks the total derivative∂s4∂W has two parts

Explicit : ∂+s4∂W , treating all other in-

puts as constant

Implicit : Summing over all indirectpaths from s4 to W

Let us see how to do this

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 154: CS7015 (Deep Learning) : Lecture 14

30/41

s1W

V

U

x1

L1(θ)

s2

x2

L2(θ)

W

V

U

s3

x3

L3(θ)

W

V

U

s4

x4

L4(θ)

W

V

U

. . .

s4 L4(θ)

W

s3s2s1s0

Recall that

s4 = σ(Ws3 + b)

In such an ordered network, we can’tcompute ∂s4

∂W by simply treating s3 asa constant (because it also dependson W )

In such networks the total derivative∂s4∂W has two parts

Explicit : ∂+s4∂W , treating all other in-

puts as constant

Implicit : Summing over all indirectpaths from s4 to W

Let us see how to do this

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 155: CS7015 (Deep Learning) : Lecture 14

31/41

∂s4∂W

=∂+s4∂W︸ ︷︷ ︸

explicit

+∂s4∂s3

∂s3∂W︸ ︷︷ ︸

implicit

=∂+s4∂W

+∂s4∂s3

[ ∂+s3∂W︸ ︷︷ ︸

explicit

+∂s3∂s2

∂s2∂W︸ ︷︷ ︸

implicit

]

=∂+s4∂W

+∂s4∂s3

∂+s3∂W

+∂s4∂s3

∂s3∂s2

[∂+s2∂W

+∂s2∂s1

∂s1∂W

]=∂+s4∂W

+∂s4∂s3

∂+s3∂W

+∂s4∂s3

∂s3∂s2

∂+s2∂W

+∂s4∂s3

∂s3∂s2

∂s2∂s1

[∂+s1∂W

]For simplicity we will short-circuit some of the paths

∂s4∂W

=∂s4∂s4

∂+s4∂W

+∂s4∂s3

∂+s3∂W

+∂s4∂s2

∂+s2∂W

+∂s4∂s1

∂+s1∂W

=

4∑k=1

∂s4∂sk

∂+sk∂W

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 156: CS7015 (Deep Learning) : Lecture 14

31/41

∂s4∂W

=∂+s4∂W︸ ︷︷ ︸

explicit

+∂s4∂s3

∂s3∂W︸ ︷︷ ︸

implicit

=∂+s4∂W

+∂s4∂s3

[ ∂+s3∂W︸ ︷︷ ︸

explicit

+∂s3∂s2

∂s2∂W︸ ︷︷ ︸

implicit

]

=∂+s4∂W

+∂s4∂s3

∂+s3∂W

+∂s4∂s3

∂s3∂s2

[∂+s2∂W

+∂s2∂s1

∂s1∂W

]=∂+s4∂W

+∂s4∂s3

∂+s3∂W

+∂s4∂s3

∂s3∂s2

∂+s2∂W

+∂s4∂s3

∂s3∂s2

∂s2∂s1

[∂+s1∂W

]For simplicity we will short-circuit some of the paths

∂s4∂W

=∂s4∂s4

∂+s4∂W

+∂s4∂s3

∂+s3∂W

+∂s4∂s2

∂+s2∂W

+∂s4∂s1

∂+s1∂W

=

4∑k=1

∂s4∂sk

∂+sk∂W

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 157: CS7015 (Deep Learning) : Lecture 14

31/41

∂s4∂W

=∂+s4∂W︸ ︷︷ ︸

explicit

+∂s4∂s3

∂s3∂W︸ ︷︷ ︸

implicit

=∂+s4∂W

+∂s4∂s3

[ ∂+s3∂W︸ ︷︷ ︸

explicit

+∂s3∂s2

∂s2∂W︸ ︷︷ ︸

implicit

]

=∂+s4∂W

+∂s4∂s3

∂+s3∂W

+∂s4∂s3

∂s3∂s2

[∂+s2∂W

+∂s2∂s1

∂s1∂W

]

=∂+s4∂W

+∂s4∂s3

∂+s3∂W

+∂s4∂s3

∂s3∂s2

∂+s2∂W

+∂s4∂s3

∂s3∂s2

∂s2∂s1

[∂+s1∂W

]For simplicity we will short-circuit some of the paths

∂s4∂W

=∂s4∂s4

∂+s4∂W

+∂s4∂s3

∂+s3∂W

+∂s4∂s2

∂+s2∂W

+∂s4∂s1

∂+s1∂W

=

4∑k=1

∂s4∂sk

∂+sk∂W

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 158: CS7015 (Deep Learning) : Lecture 14

31/41

∂s4∂W

=∂+s4∂W︸ ︷︷ ︸

explicit

+∂s4∂s3

∂s3∂W︸ ︷︷ ︸

implicit

=∂+s4∂W

+∂s4∂s3

[ ∂+s3∂W︸ ︷︷ ︸

explicit

+∂s3∂s2

∂s2∂W︸ ︷︷ ︸

implicit

]

=∂+s4∂W

+∂s4∂s3

∂+s3∂W

+∂s4∂s3

∂s3∂s2

[∂+s2∂W

+∂s2∂s1

∂s1∂W

]=∂+s4∂W

+∂s4∂s3

∂+s3∂W

+∂s4∂s3

∂s3∂s2

∂+s2∂W

+∂s4∂s3

∂s3∂s2

∂s2∂s1

[∂+s1∂W

]

For simplicity we will short-circuit some of the paths

∂s4∂W

=∂s4∂s4

∂+s4∂W

+∂s4∂s3

∂+s3∂W

+∂s4∂s2

∂+s2∂W

+∂s4∂s1

∂+s1∂W

=

4∑k=1

∂s4∂sk

∂+sk∂W

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 159: CS7015 (Deep Learning) : Lecture 14

31/41

∂s4∂W

=∂+s4∂W︸ ︷︷ ︸

explicit

+∂s4∂s3

∂s3∂W︸ ︷︷ ︸

implicit

=∂+s4∂W

+∂s4∂s3

[ ∂+s3∂W︸ ︷︷ ︸

explicit

+∂s3∂s2

∂s2∂W︸ ︷︷ ︸

implicit

]

=∂+s4∂W

+∂s4∂s3

∂+s3∂W

+∂s4∂s3

∂s3∂s2

[∂+s2∂W

+∂s2∂s1

∂s1∂W

]=∂+s4∂W

+∂s4∂s3

∂+s3∂W

+∂s4∂s3

∂s3∂s2

∂+s2∂W

+∂s4∂s3

∂s3∂s2

∂s2∂s1

[∂+s1∂W

]For simplicity we will short-circuit some of the paths

∂s4∂W

=∂s4∂s4

∂+s4∂W

+∂s4∂s3

∂+s3∂W

+∂s4∂s2

∂+s2∂W

+∂s4∂s1

∂+s1∂W

=

4∑k=1

∂s4∂sk

∂+sk∂W

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 160: CS7015 (Deep Learning) : Lecture 14

31/41

∂s4∂W

=∂+s4∂W︸ ︷︷ ︸

explicit

+∂s4∂s3

∂s3∂W︸ ︷︷ ︸

implicit

=∂+s4∂W

+∂s4∂s3

[ ∂+s3∂W︸ ︷︷ ︸

explicit

+∂s3∂s2

∂s2∂W︸ ︷︷ ︸

implicit

]

=∂+s4∂W

+∂s4∂s3

∂+s3∂W

+∂s4∂s3

∂s3∂s2

[∂+s2∂W

+∂s2∂s1

∂s1∂W

]=∂+s4∂W

+∂s4∂s3

∂+s3∂W

+∂s4∂s3

∂s3∂s2

∂+s2∂W

+∂s4∂s3

∂s3∂s2

∂s2∂s1

[∂+s1∂W

]For simplicity we will short-circuit some of the paths

∂s4∂W

=∂s4∂s4

∂+s4∂W

+∂s4∂s3

∂+s3∂W

+∂s4∂s2

∂+s2∂W

+∂s4∂s1

∂+s1∂W

=

4∑k=1

∂s4∂sk

∂+sk∂W

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 161: CS7015 (Deep Learning) : Lecture 14

32/41

s1W

V

U

x1

L1(θ)

s2

x2

L2(θ)

W

V

U

s3

x3

L3(θ)

W

V

U

s4

x4

L4(θ)

W

V

U

. . .

s4 L4(θ)

W

s3s2s1s0

Finally we have

∂L4(θ)

∂W=∂L4(θ)

∂s4

∂s4∂W

∂s4∂W

=4∑

k=1

∂s4∂sk

∂+sk∂W

∴∂Lt(θ)

∂W=∂Lt(θ)

∂st

t∑k=1

∂st∂sk

∂+sk∂W

This algorithm is called backpropaga-tion through time (BPTT) as webackpropagate over all previous timesteps

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 162: CS7015 (Deep Learning) : Lecture 14

32/41

s1W

V

U

x1

L1(θ)

s2

x2

L2(θ)

W

V

U

s3

x3

L3(θ)

W

V

U

s4

x4

L4(θ)

W

V

U

. . .

s4 L4(θ)

W

s3s2s1s0

Finally we have

∂L4(θ)

∂W=∂L4(θ)

∂s4

∂s4∂W

∂s4∂W

=4∑

k=1

∂s4∂sk

∂+sk∂W

∴∂Lt(θ)

∂W=∂Lt(θ)

∂st

t∑k=1

∂st∂sk

∂+sk∂W

This algorithm is called backpropaga-tion through time (BPTT) as webackpropagate over all previous timesteps

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 163: CS7015 (Deep Learning) : Lecture 14

32/41

s1W

V

U

x1

L1(θ)

s2

x2

L2(θ)

W

V

U

s3

x3

L3(θ)

W

V

U

s4

x4

L4(θ)

W

V

U

. . .

s4 L4(θ)

W

s3s2s1s0

Finally we have

∂L4(θ)

∂W=∂L4(θ)

∂s4

∂s4∂W

∂s4∂W

=

4∑k=1

∂s4∂sk

∂+sk∂W

∴∂Lt(θ)

∂W=∂Lt(θ)

∂st

t∑k=1

∂st∂sk

∂+sk∂W

This algorithm is called backpropaga-tion through time (BPTT) as webackpropagate over all previous timesteps

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 164: CS7015 (Deep Learning) : Lecture 14

32/41

s1W

V

U

x1

L1(θ)

s2

x2

L2(θ)

W

V

U

s3

x3

L3(θ)

W

V

U

s4

x4

L4(θ)

W

V

U

. . .

s4 L4(θ)

W

s3s2s1s0

Finally we have

∂L4(θ)

∂W=∂L4(θ)

∂s4

∂s4∂W

∂s4∂W

=

4∑k=1

∂s4∂sk

∂+sk∂W

∴∂Lt(θ)

∂W=∂Lt(θ)

∂st

t∑k=1

∂st∂sk

∂+sk∂W

This algorithm is called backpropaga-tion through time (BPTT) as webackpropagate over all previous timesteps

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 165: CS7015 (Deep Learning) : Lecture 14

32/41

s1W

V

U

x1

L1(θ)

s2

x2

L2(θ)

W

V

U

s3

x3

L3(θ)

W

V

U

s4

x4

L4(θ)

W

V

U

. . .

s4 L4(θ)

W

s3s2s1s0

Finally we have

∂L4(θ)

∂W=∂L4(θ)

∂s4

∂s4∂W

∂s4∂W

=

4∑k=1

∂s4∂sk

∂+sk∂W

∴∂Lt(θ)

∂W=∂Lt(θ)

∂st

t∑k=1

∂st∂sk

∂+sk∂W

This algorithm is called backpropaga-tion through time (BPTT) as webackpropagate over all previous timesteps

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 166: CS7015 (Deep Learning) : Lecture 14

33/41

Module 14.4: The problem of Exploding and VanishingGradients

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 167: CS7015 (Deep Learning) : Lecture 14

34/41

We will now focus on ∂st∂sk

and high-light an important problem in train-ing RNN’s using BPTT

∂st∂sk

=∂st∂st−1

∂st−1∂st−2

. . .∂sk+1

∂sk

Let us look at one such term in theproduct (i.e.,

∂sj+1

∂sj)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 168: CS7015 (Deep Learning) : Lecture 14

34/41

We will now focus on ∂st∂sk

and high-light an important problem in train-ing RNN’s using BPTT

∂st∂sk

=∂st∂st−1

∂st−1∂st−2

. . .∂sk+1

∂sk

Let us look at one such term in theproduct (i.e.,

∂sj+1

∂sj)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 169: CS7015 (Deep Learning) : Lecture 14

34/41

We will now focus on ∂st∂sk

and high-light an important problem in train-ing RNN’s using BPTT

∂st∂sk

=∂st∂st−1

∂st−1∂st−2

. . .∂sk+1

∂sk

=

t−1∏j=k

∂sj+1

∂sj

Let us look at one such term in theproduct (i.e.,

∂sj+1

∂sj)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 170: CS7015 (Deep Learning) : Lecture 14

34/41

We will now focus on ∂st∂sk

and high-light an important problem in train-ing RNN’s using BPTT

∂st∂sk

=∂st∂st−1

∂st−1∂st−2

. . .∂sk+1

∂sk

=

t−1∏j=k

∂sj+1

∂sj

Let us look at one such term in theproduct (i.e.,

∂sj+1

∂sj)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 171: CS7015 (Deep Learning) : Lecture 14

35/41

aj = [aj1, aj2, aj3, . . . ajd, ]

sj = [σ(aj1), σ(aj2), . . . σ(ajd)]

∂sj∂aj

=

∂sj1∂aj1

∂sj2∂aj1

∂sj3∂aj1

. . .

∂sj1∂aj2

∂sj2∂aj2

. . .

......

...∂sjd∂ajd

=

σ′(aj1) 0 0 0

0 σ′(aj2) 0 0

0 0. . .

0 0 . . . σ′(ajd)

= diag(σ

′(aj))

We are interested in∂sj∂sj−1

aj = Wsj + b

sj = σ(aj)

∂sj∂sj−1

=∂sj∂aj

∂aj∂sj−1

= diag(σ′(aj))W

We are interested in the magnitudeof

∂sj∂sj−1

← if it is small (large) ∂st∂sk

and hence ∂Lt∂W will vanish (explode)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 172: CS7015 (Deep Learning) : Lecture 14

35/41

aj = [aj1, aj2, aj3, . . . ajd, ]

sj = [σ(aj1), σ(aj2), . . . σ(ajd)]

∂sj∂aj

=

∂sj1∂aj1

∂sj2∂aj1

∂sj3∂aj1

. . .

∂sj1∂aj2

∂sj2∂aj2

. . .

......

...∂sjd∂ajd

=

σ′(aj1) 0 0 0

0 σ′(aj2) 0 0

0 0. . .

0 0 . . . σ′(ajd)

= diag(σ

′(aj))

We are interested in∂sj∂sj−1

aj = Wsj + b

sj = σ(aj)

∂sj∂sj−1

=∂sj∂aj

∂aj∂sj−1

= diag(σ′(aj))W

We are interested in the magnitudeof

∂sj∂sj−1

← if it is small (large) ∂st∂sk

and hence ∂Lt∂W will vanish (explode)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 173: CS7015 (Deep Learning) : Lecture 14

35/41

aj = [aj1, aj2, aj3, . . . ajd, ]

sj = [σ(aj1), σ(aj2), . . . σ(ajd)]

∂sj∂aj

=

∂sj1∂aj1

∂sj2∂aj1

∂sj3∂aj1

. . .

∂sj1∂aj2

∂sj2∂aj2

. . .

......

...∂sjd∂ajd

=

σ′(aj1) 0 0 0

0 σ′(aj2) 0 0

0 0. . .

0 0 . . . σ′(ajd)

= diag(σ

′(aj))

We are interested in∂sj∂sj−1

aj = Wsj + b

sj = σ(aj)

∂sj∂sj−1

=∂sj∂aj

∂aj∂sj−1

= diag(σ′(aj))W

We are interested in the magnitudeof

∂sj∂sj−1

← if it is small (large) ∂st∂sk

and hence ∂Lt∂W will vanish (explode)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 174: CS7015 (Deep Learning) : Lecture 14

35/41

aj = [aj1, aj2, aj3, . . . ajd, ]

sj = [σ(aj1), σ(aj2), . . . σ(ajd)]

∂sj∂aj

=

∂sj1∂aj1

∂sj2∂aj1

∂sj3∂aj1

. . .

∂sj1∂aj2

∂sj2∂aj2

. . .

......

...∂sjd∂ajd

=

σ′(aj1) 0 0 0

0 σ′(aj2) 0 0

0 0. . .

0 0 . . . σ′(ajd)

= diag(σ

′(aj))

We are interested in∂sj∂sj−1

aj = Wsj + b

sj = σ(aj)

∂sj∂sj−1

=∂sj∂aj

∂aj∂sj−1

= diag(σ′(aj))W

We are interested in the magnitudeof

∂sj∂sj−1

← if it is small (large) ∂st∂sk

and hence ∂Lt∂W will vanish (explode)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 175: CS7015 (Deep Learning) : Lecture 14

35/41

aj = [aj1, aj2, aj3, . . . ajd, ]

sj = [σ(aj1), σ(aj2), . . . σ(ajd)]

∂sj∂aj

=

∂sj1∂aj1

∂sj2∂aj1

∂sj3∂aj1

. . .

∂sj1∂aj2

∂sj2∂aj2

. . .

......

...∂sjd∂ajd

=

σ′(aj1) 0 0 0

0 σ′(aj2) 0 0

0 0. . .

0 0 . . . σ′(ajd)

= diag(σ

′(aj))

We are interested in∂sj∂sj−1

aj = Wsj + b

sj = σ(aj)

∂sj∂sj−1

=∂sj∂aj

∂aj∂sj−1

= diag(σ′(aj))W

We are interested in the magnitudeof

∂sj∂sj−1

← if it is small (large) ∂st∂sk

and hence ∂Lt∂W will vanish (explode)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 176: CS7015 (Deep Learning) : Lecture 14

35/41

aj = [aj1, aj2, aj3, . . . ajd, ]

sj = [σ(aj1), σ(aj2), . . . σ(ajd)]

∂sj∂aj

=

∂sj1∂aj1

∂sj2∂aj1

∂sj3∂aj1

. . .

∂sj1∂aj2

∂sj2∂aj2

. . .

......

...∂sjd∂ajd

=

σ′(aj1) 0 0 0

0 σ′(aj2) 0 0

0 0. . .

0 0 . . . σ′(ajd)

= diag(σ

′(aj))

We are interested in∂sj∂sj−1

aj = Wsj + b

sj = σ(aj)

∂sj∂sj−1

=∂sj∂aj

∂aj∂sj−1

= diag(σ′(aj))W

We are interested in the magnitudeof

∂sj∂sj−1

← if it is small (large) ∂st∂sk

and hence ∂Lt∂W will vanish (explode)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 177: CS7015 (Deep Learning) : Lecture 14

35/41

aj = [aj1, aj2, aj3, . . . ajd, ]

sj = [σ(aj1), σ(aj2), . . . σ(ajd)]

∂sj∂aj

=

∂sj1∂aj1

∂sj2∂aj1

∂sj3∂aj1

. . .

∂sj1∂aj2

∂sj2∂aj2

. . .

......

...∂sjd∂ajd

=

σ′(aj1) 0 0 0

0 σ′(aj2) 0 0

0 0. . .

0 0 . . . σ′(ajd)

= diag(σ

′(aj))

We are interested in∂sj∂sj−1

aj = Wsj + b

sj = σ(aj)

∂sj∂sj−1

=∂sj∂aj

∂aj∂sj−1

= diag(σ′(aj))W

We are interested in the magnitudeof

∂sj∂sj−1

← if it is small (large) ∂st∂sk

and hence ∂Lt∂W will vanish (explode)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 178: CS7015 (Deep Learning) : Lecture 14

35/41

aj = [aj1, aj2, aj3, . . . ajd, ]

sj = [σ(aj1), σ(aj2), . . . σ(ajd)]

∂sj∂aj

=

∂sj1∂aj1

∂sj2∂aj1

∂sj3∂aj1

. . .

∂sj1∂aj2

∂sj2∂aj2

. . .

......

...∂sjd∂ajd

=

σ′(aj1) 0 0 0

0 σ′(aj2) 0 0

0 0. . .

0 0 . . . σ′(ajd)

= diag(σ

′(aj))

We are interested in∂sj∂sj−1

aj = Wsj + b

sj = σ(aj)

∂sj∂sj−1

=∂sj∂aj

∂aj∂sj−1

= diag(σ′(aj))W

We are interested in the magnitudeof

∂sj∂sj−1

← if it is small (large) ∂st∂sk

and hence ∂Lt∂W will vanish (explode)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 179: CS7015 (Deep Learning) : Lecture 14

35/41

aj = [aj1, aj2, aj3, . . . ajd, ]

sj = [σ(aj1), σ(aj2), . . . σ(ajd)]

∂sj∂aj

=

∂sj1∂aj1

∂sj2∂aj1

∂sj3∂aj1

. . .

∂sj1∂aj2

∂sj2∂aj2

. . .

......

...∂sjd∂ajd

=

σ′(aj1) 0 0 0

0 σ′(aj2) 0 0

0 0. . .

0 0 . . . σ′(ajd)

= diag(σ

′(aj))

We are interested in∂sj∂sj−1

aj = Wsj + b

sj = σ(aj)

∂sj∂sj−1

=∂sj∂aj

∂aj∂sj−1

= diag(σ′(aj))W

We are interested in the magnitudeof

∂sj∂sj−1

← if it is small (large) ∂st∂sk

and hence ∂Lt∂W will vanish (explode)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 180: CS7015 (Deep Learning) : Lecture 14

35/41

aj = [aj1, aj2, aj3, . . . ajd, ]

sj = [σ(aj1), σ(aj2), . . . σ(ajd)]

∂sj∂aj

=

∂sj1∂aj1

∂sj2∂aj1

∂sj3∂aj1

. . .

∂sj1∂aj2

∂sj2∂aj2

. . .

......

...∂sjd∂ajd

=

σ′(aj1) 0 0 0

0 σ′(aj2) 0 0

0 0. . .

0 0 . . . σ′(ajd)

= diag(σ

′(aj))

We are interested in∂sj∂sj−1

aj = Wsj + b

sj = σ(aj)

∂sj∂sj−1

=∂sj∂aj

∂aj∂sj−1

= diag(σ′(aj))W

We are interested in the magnitudeof

∂sj∂sj−1

← if it is small (large) ∂st∂sk

and hence ∂Lt∂W will vanish (explode)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 181: CS7015 (Deep Learning) : Lecture 14

35/41

aj = [aj1, aj2, aj3, . . . ajd, ]

sj = [σ(aj1), σ(aj2), . . . σ(ajd)]

∂sj∂aj

=

∂sj1∂aj1

∂sj2∂aj1

∂sj3∂aj1

. . .

∂sj1∂aj2

∂sj2∂aj2

. . .

......

...∂sjd∂ajd

=

σ′(aj1) 0 0 0

0 σ′(aj2) 0 0

0 0. . .

0 0 . . . σ′(ajd)

= diag(σ

′(aj))

We are interested in∂sj∂sj−1

aj = Wsj + b

sj = σ(aj)

∂sj∂sj−1

=∂sj∂aj

∂aj∂sj−1

= diag(σ′(aj))W

We are interested in the magnitudeof

∂sj∂sj−1

← if it is small (large) ∂st∂sk

and hence ∂Lt∂W will vanish (explode)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 182: CS7015 (Deep Learning) : Lecture 14

35/41

aj = [aj1, aj2, aj3, . . . ajd, ]

sj = [σ(aj1), σ(aj2), . . . σ(ajd)]

∂sj∂aj

=

∂sj1∂aj1

∂sj2∂aj1

∂sj3∂aj1

. . .

∂sj1∂aj2

∂sj2∂aj2

. . .

......

...∂sjd∂ajd

=

σ′(aj1) 0 0 0

0 σ′(aj2) 0 0

0 0. . .

0 0 . . . σ′(ajd)

= diag(σ′(aj))

We are interested in∂sj∂sj−1

aj = Wsj + b

sj = σ(aj)

∂sj∂sj−1

=∂sj∂aj

∂aj∂sj−1

= diag(σ′(aj))W

We are interested in the magnitudeof

∂sj∂sj−1

← if it is small (large) ∂st∂sk

and hence ∂Lt∂W will vanish (explode)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 183: CS7015 (Deep Learning) : Lecture 14

35/41

aj = [aj1, aj2, aj3, . . . ajd, ]

sj = [σ(aj1), σ(aj2), . . . σ(ajd)]

∂sj∂aj

=

∂sj1∂aj1

∂sj2∂aj1

∂sj3∂aj1

. . .

∂sj1∂aj2

∂sj2∂aj2

. . .

......

...∂sjd∂ajd

=

σ′(aj1) 0 0 0

0 σ′(aj2) 0 0

0 0. . .

0 0 . . . σ′(ajd)

= diag(σ′(aj))

We are interested in∂sj∂sj−1

aj = Wsj + b

sj = σ(aj)

∂sj∂sj−1

=∂sj∂aj

∂aj∂sj−1

= diag(σ′(aj))W

We are interested in the magnitudeof

∂sj∂sj−1

← if it is small (large) ∂st∂sk

and hence ∂Lt∂W will vanish (explode)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 184: CS7015 (Deep Learning) : Lecture 14

35/41

aj = [aj1, aj2, aj3, . . . ajd, ]

sj = [σ(aj1), σ(aj2), . . . σ(ajd)]

∂sj∂aj

=

∂sj1∂aj1

∂sj2∂aj1

∂sj3∂aj1

. . .

∂sj1∂aj2

∂sj2∂aj2

. . .

......

...∂sjd∂ajd

=

σ′(aj1) 0 0 0

0 σ′(aj2) 0 0

0 0. . .

0 0 . . . σ′(ajd)

= diag(σ

′(aj))

We are interested in∂sj∂sj−1

aj = Wsj + b

sj = σ(aj)

∂sj∂sj−1

=∂sj∂aj

∂aj∂sj−1

= diag(σ′(aj))W

We are interested in the magnitudeof

∂sj∂sj−1

← if it is small (large) ∂st∂sk

and hence ∂Lt∂W will vanish (explode)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 185: CS7015 (Deep Learning) : Lecture 14

35/41

aj = [aj1, aj2, aj3, . . . ajd, ]

sj = [σ(aj1), σ(aj2), . . . σ(ajd)]

∂sj∂aj

=

∂sj1∂aj1

∂sj2∂aj1

∂sj3∂aj1

. . .

∂sj1∂aj2

∂sj2∂aj2

. . .

......

...∂sjd∂ajd

=

σ′(aj1) 0 0 0

0 σ′(aj2) 0 0

0 0. . .

0 0 . . . σ′(ajd)

= diag(σ

′(aj))

We are interested in∂sj∂sj−1

aj = Wsj + b

sj = σ(aj)

∂sj∂sj−1

=∂sj∂aj

∂aj∂sj−1

= diag(σ′(aj))W

We are interested in the magnitudeof

∂sj∂sj−1

← if it is small (large) ∂st∂sk

and hence ∂Lt∂W will vanish (explode)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 186: CS7015 (Deep Learning) : Lecture 14

35/41

aj = [aj1, aj2, aj3, . . . ajd, ]

sj = [σ(aj1), σ(aj2), . . . σ(ajd)]

∂sj∂aj

=

∂sj1∂aj1

∂sj2∂aj1

∂sj3∂aj1

. . .

∂sj1∂aj2

∂sj2∂aj2

. . .

......

...∂sjd∂ajd

=

σ′(aj1) 0 0 0

0 σ′(aj2) 0 0

0 0. . .

0 0 . . . σ′(ajd)

= diag(σ

′(aj))

We are interested in∂sj∂sj−1

aj = Wsj + b

sj = σ(aj)

∂sj∂sj−1

=∂sj∂aj

∂aj∂sj−1

= diag(σ′(aj))W

We are interested in the magnitudeof

∂sj∂sj−1

← if it is small (large) ∂st∂sk

and hence ∂Lt∂W will vanish (explode)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 187: CS7015 (Deep Learning) : Lecture 14

36/41

∥∥∥∥ ∂sj∂sj−1

∥∥∥∥ =∥∥∥diag(σ

′(aj))W

∥∥∥

≤∥∥∥diag(σ

′(aj)

∥∥∥ ‖W‖∵ σ(aj) is a bounded function (sigmoid,

tanh) σ′(aj) is bounded

σ′(aj) ≤

1

4= γ [if σ is logistic ]

≤ 1 = γ [if σ is tanh ]∥∥∥∥ ∂sj∂sj−1

∥∥∥∥ ≤ γ ‖W‖≤ γλ

∥∥∥∥ ∂st∂sk

∥∥∥∥ =

∥∥∥∥∥∥t∏

j=k+1

∂sj∂sj−1

∥∥∥∥∥∥≤

t∏j=k+1

γλ

≤ (γλ)t−k

If γλ < 1 the gradient will vanish

If γλ > 1 the gradient could explode

This is known as the problem ofvanishing/ exploding gradients

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 188: CS7015 (Deep Learning) : Lecture 14

36/41

∥∥∥∥ ∂sj∂sj−1

∥∥∥∥ =∥∥∥diag(σ

′(aj))W

∥∥∥≤∥∥∥diag(σ

′(aj)

∥∥∥ ‖W‖

∵ σ(aj) is a bounded function (sigmoid,tanh) σ

′(aj) is bounded

σ′(aj) ≤

1

4= γ [if σ is logistic ]

≤ 1 = γ [if σ is tanh ]∥∥∥∥ ∂sj∂sj−1

∥∥∥∥ ≤ γ ‖W‖≤ γλ

∥∥∥∥ ∂st∂sk

∥∥∥∥ =

∥∥∥∥∥∥t∏

j=k+1

∂sj∂sj−1

∥∥∥∥∥∥≤

t∏j=k+1

γλ

≤ (γλ)t−k

If γλ < 1 the gradient will vanish

If γλ > 1 the gradient could explode

This is known as the problem ofvanishing/ exploding gradients

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 189: CS7015 (Deep Learning) : Lecture 14

36/41

∥∥∥∥ ∂sj∂sj−1

∥∥∥∥ =∥∥∥diag(σ

′(aj))W

∥∥∥≤∥∥∥diag(σ

′(aj)

∥∥∥ ‖W‖∵ σ(aj) is a bounded function (sigmoid,

tanh) σ′(aj) is bounded

σ′(aj) ≤

1

4= γ [if σ is logistic ]

≤ 1 = γ [if σ is tanh ]∥∥∥∥ ∂sj∂sj−1

∥∥∥∥ ≤ γ ‖W‖≤ γλ

∥∥∥∥ ∂st∂sk

∥∥∥∥ =

∥∥∥∥∥∥t∏

j=k+1

∂sj∂sj−1

∥∥∥∥∥∥≤

t∏j=k+1

γλ

≤ (γλ)t−k

If γλ < 1 the gradient will vanish

If γλ > 1 the gradient could explode

This is known as the problem ofvanishing/ exploding gradients

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 190: CS7015 (Deep Learning) : Lecture 14

36/41

∥∥∥∥ ∂sj∂sj−1

∥∥∥∥ =∥∥∥diag(σ

′(aj))W

∥∥∥≤∥∥∥diag(σ

′(aj)

∥∥∥ ‖W‖∵ σ(aj) is a bounded function (sigmoid,

tanh) σ′(aj) is bounded

σ′(aj) ≤

1

4= γ [if σ is logistic ]

≤ 1 = γ [if σ is tanh ]∥∥∥∥ ∂sj∂sj−1

∥∥∥∥ ≤ γ ‖W‖≤ γλ

∥∥∥∥ ∂st∂sk

∥∥∥∥ =

∥∥∥∥∥∥t∏

j=k+1

∂sj∂sj−1

∥∥∥∥∥∥≤

t∏j=k+1

γλ

≤ (γλ)t−k

If γλ < 1 the gradient will vanish

If γλ > 1 the gradient could explode

This is known as the problem ofvanishing/ exploding gradients

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 191: CS7015 (Deep Learning) : Lecture 14

36/41

∥∥∥∥ ∂sj∂sj−1

∥∥∥∥ =∥∥∥diag(σ

′(aj))W

∥∥∥≤∥∥∥diag(σ

′(aj)

∥∥∥ ‖W‖∵ σ(aj) is a bounded function (sigmoid,

tanh) σ′(aj) is bounded

σ′(aj) ≤

1

4= γ [if σ is logistic ]

≤ 1 = γ [if σ is tanh ]

∥∥∥∥ ∂sj∂sj−1

∥∥∥∥ ≤ γ ‖W‖≤ γλ

∥∥∥∥ ∂st∂sk

∥∥∥∥ =

∥∥∥∥∥∥t∏

j=k+1

∂sj∂sj−1

∥∥∥∥∥∥≤

t∏j=k+1

γλ

≤ (γλ)t−k

If γλ < 1 the gradient will vanish

If γλ > 1 the gradient could explode

This is known as the problem ofvanishing/ exploding gradients

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 192: CS7015 (Deep Learning) : Lecture 14

36/41

∥∥∥∥ ∂sj∂sj−1

∥∥∥∥ =∥∥∥diag(σ

′(aj))W

∥∥∥≤∥∥∥diag(σ

′(aj)

∥∥∥ ‖W‖∵ σ(aj) is a bounded function (sigmoid,

tanh) σ′(aj) is bounded

σ′(aj) ≤

1

4= γ [if σ is logistic ]

≤ 1 = γ [if σ is tanh ]∥∥∥∥ ∂sj∂sj−1

∥∥∥∥ ≤ γ ‖W‖≤ γλ

∥∥∥∥ ∂st∂sk

∥∥∥∥ =

∥∥∥∥∥∥t∏

j=k+1

∂sj∂sj−1

∥∥∥∥∥∥≤

t∏j=k+1

γλ

≤ (γλ)t−k

If γλ < 1 the gradient will vanish

If γλ > 1 the gradient could explode

This is known as the problem ofvanishing/ exploding gradients

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 193: CS7015 (Deep Learning) : Lecture 14

36/41

∥∥∥∥ ∂sj∂sj−1

∥∥∥∥ =∥∥∥diag(σ

′(aj))W

∥∥∥≤∥∥∥diag(σ

′(aj)

∥∥∥ ‖W‖∵ σ(aj) is a bounded function (sigmoid,

tanh) σ′(aj) is bounded

σ′(aj) ≤

1

4= γ [if σ is logistic ]

≤ 1 = γ [if σ is tanh ]∥∥∥∥ ∂sj∂sj−1

∥∥∥∥ ≤ γ ‖W‖≤ γλ

∥∥∥∥ ∂st∂sk

∥∥∥∥ =

∥∥∥∥∥∥t∏

j=k+1

∂sj∂sj−1

∥∥∥∥∥∥

≤t∏

j=k+1

γλ

≤ (γλ)t−k

If γλ < 1 the gradient will vanish

If γλ > 1 the gradient could explode

This is known as the problem ofvanishing/ exploding gradients

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 194: CS7015 (Deep Learning) : Lecture 14

36/41

∥∥∥∥ ∂sj∂sj−1

∥∥∥∥ =∥∥∥diag(σ

′(aj))W

∥∥∥≤∥∥∥diag(σ

′(aj)

∥∥∥ ‖W‖∵ σ(aj) is a bounded function (sigmoid,

tanh) σ′(aj) is bounded

σ′(aj) ≤

1

4= γ [if σ is logistic ]

≤ 1 = γ [if σ is tanh ]∥∥∥∥ ∂sj∂sj−1

∥∥∥∥ ≤ γ ‖W‖≤ γλ

∥∥∥∥ ∂st∂sk

∥∥∥∥ =

∥∥∥∥∥∥t∏

j=k+1

∂sj∂sj−1

∥∥∥∥∥∥≤

t∏j=k+1

γλ

≤ (γλ)t−k

If γλ < 1 the gradient will vanish

If γλ > 1 the gradient could explode

This is known as the problem ofvanishing/ exploding gradients

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 195: CS7015 (Deep Learning) : Lecture 14

36/41

∥∥∥∥ ∂sj∂sj−1

∥∥∥∥ =∥∥∥diag(σ

′(aj))W

∥∥∥≤∥∥∥diag(σ

′(aj)

∥∥∥ ‖W‖∵ σ(aj) is a bounded function (sigmoid,

tanh) σ′(aj) is bounded

σ′(aj) ≤

1

4= γ [if σ is logistic ]

≤ 1 = γ [if σ is tanh ]∥∥∥∥ ∂sj∂sj−1

∥∥∥∥ ≤ γ ‖W‖≤ γλ

∥∥∥∥ ∂st∂sk

∥∥∥∥ =

∥∥∥∥∥∥t∏

j=k+1

∂sj∂sj−1

∥∥∥∥∥∥≤

t∏j=k+1

γλ

≤ (γλ)t−k

If γλ < 1 the gradient will vanish

If γλ > 1 the gradient could explode

This is known as the problem ofvanishing/ exploding gradients

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 196: CS7015 (Deep Learning) : Lecture 14

36/41

∥∥∥∥ ∂sj∂sj−1

∥∥∥∥ =∥∥∥diag(σ

′(aj))W

∥∥∥≤∥∥∥diag(σ

′(aj)

∥∥∥ ‖W‖∵ σ(aj) is a bounded function (sigmoid,

tanh) σ′(aj) is bounded

σ′(aj) ≤

1

4= γ [if σ is logistic ]

≤ 1 = γ [if σ is tanh ]∥∥∥∥ ∂sj∂sj−1

∥∥∥∥ ≤ γ ‖W‖≤ γλ

∥∥∥∥ ∂st∂sk

∥∥∥∥ =

∥∥∥∥∥∥t∏

j=k+1

∂sj∂sj−1

∥∥∥∥∥∥≤

t∏j=k+1

γλ

≤ (γλ)t−k

If γλ < 1 the gradient will vanish

If γλ > 1 the gradient could explode

This is known as the problem ofvanishing/ exploding gradients

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 197: CS7015 (Deep Learning) : Lecture 14

36/41

∥∥∥∥ ∂sj∂sj−1

∥∥∥∥ =∥∥∥diag(σ

′(aj))W

∥∥∥≤∥∥∥diag(σ

′(aj)

∥∥∥ ‖W‖∵ σ(aj) is a bounded function (sigmoid,

tanh) σ′(aj) is bounded

σ′(aj) ≤

1

4= γ [if σ is logistic ]

≤ 1 = γ [if σ is tanh ]∥∥∥∥ ∂sj∂sj−1

∥∥∥∥ ≤ γ ‖W‖≤ γλ

∥∥∥∥ ∂st∂sk

∥∥∥∥ =

∥∥∥∥∥∥t∏

j=k+1

∂sj∂sj−1

∥∥∥∥∥∥≤

t∏j=k+1

γλ

≤ (γλ)t−k

If γλ < 1 the gradient will vanish

If γλ > 1 the gradient could explode

This is known as the problem ofvanishing/ exploding gradients

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 198: CS7015 (Deep Learning) : Lecture 14

36/41

∥∥∥∥ ∂sj∂sj−1

∥∥∥∥ =∥∥∥diag(σ

′(aj))W

∥∥∥≤∥∥∥diag(σ

′(aj)

∥∥∥ ‖W‖∵ σ(aj) is a bounded function (sigmoid,

tanh) σ′(aj) is bounded

σ′(aj) ≤

1

4= γ [if σ is logistic ]

≤ 1 = γ [if σ is tanh ]∥∥∥∥ ∂sj∂sj−1

∥∥∥∥ ≤ γ ‖W‖≤ γλ

∥∥∥∥ ∂st∂sk

∥∥∥∥ =

∥∥∥∥∥∥t∏

j=k+1

∂sj∂sj−1

∥∥∥∥∥∥≤

t∏j=k+1

γλ

≤ (γλ)t−k

If γλ < 1 the gradient will vanish

If γλ > 1 the gradient could explode

This is known as the problem ofvanishing/ exploding gradients

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 199: CS7015 (Deep Learning) : Lecture 14

37/41

Lt

w

v

u

x1

y1

x2

y2

w

v

u

x3

y3

w

v

u

x4

y4

w

v

u

xn

yn

w

v

u

One simple way of avoiding this is touse truncated backpropogationwhere we restrict the product toτ(< t− k) terms

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 200: CS7015 (Deep Learning) : Lecture 14

38/41

Module 14.5: Some Gory Details

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 201: CS7015 (Deep Learning) : Lecture 14

39/41

∂Lt(θ)

∂W︸ ︷︷ ︸

∈Rd×d

=∂Lt(θ)

∂st︸ ︷︷ ︸

∈R1×d

t∑k=1

∂st∂sk︸︷︷︸

∈Rd×d

∂+sk∂W︸ ︷︷ ︸

∈Rd×d×d

We know how to compute ∂Lt(θ)∂st

(derivative of Lt(θ) (scalar) w.r.t. lasthidden layer (vector)) using backpropagation

We just saw a formula for ∂st∂sk

which is the derivative of a vector w.r.t. avector)∂+sk∂W is a tensor ∈ Rd×d×d, the derivative of a vector ∈ Rd w.r.t. a matrix∈ Rd×d

How do we compute ∂+sk∂W ?

Let us see

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 202: CS7015 (Deep Learning) : Lecture 14

39/41

∂Lt(θ)

∂W︸ ︷︷ ︸∈Rd×d

=∂Lt(θ)

∂st︸ ︷︷ ︸

∈R1×d

t∑k=1

∂st∂sk︸︷︷︸

∈Rd×d

∂+sk∂W︸ ︷︷ ︸

∈Rd×d×d

We know how to compute ∂Lt(θ)∂st

(derivative of Lt(θ) (scalar) w.r.t. lasthidden layer (vector)) using backpropagation

We just saw a formula for ∂st∂sk

which is the derivative of a vector w.r.t. avector)∂+sk∂W is a tensor ∈ Rd×d×d, the derivative of a vector ∈ Rd w.r.t. a matrix∈ Rd×d

How do we compute ∂+sk∂W ?

Let us see

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 203: CS7015 (Deep Learning) : Lecture 14

39/41

∂Lt(θ)

∂W︸ ︷︷ ︸∈Rd×d

=∂Lt(θ)

∂st︸ ︷︷ ︸∈R1×d

t∑k=1

∂st∂sk︸︷︷︸

∈Rd×d

∂+sk∂W︸ ︷︷ ︸

∈Rd×d×d

We know how to compute ∂Lt(θ)∂st

(derivative of Lt(θ) (scalar) w.r.t. lasthidden layer (vector)) using backpropagation

We just saw a formula for ∂st∂sk

which is the derivative of a vector w.r.t. avector)∂+sk∂W is a tensor ∈ Rd×d×d, the derivative of a vector ∈ Rd w.r.t. a matrix∈ Rd×d

How do we compute ∂+sk∂W ?

Let us see

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 204: CS7015 (Deep Learning) : Lecture 14

39/41

∂Lt(θ)

∂W︸ ︷︷ ︸∈Rd×d

=∂Lt(θ)

∂st︸ ︷︷ ︸∈R1×d

t∑k=1

∂st∂sk︸︷︷︸∈Rd×d

∂+sk∂W︸ ︷︷ ︸

∈Rd×d×d

We know how to compute ∂Lt(θ)∂st

(derivative of Lt(θ) (scalar) w.r.t. lasthidden layer (vector)) using backpropagation

We just saw a formula for ∂st∂sk

which is the derivative of a vector w.r.t. avector)∂+sk∂W is a tensor ∈ Rd×d×d, the derivative of a vector ∈ Rd w.r.t. a matrix∈ Rd×d

How do we compute ∂+sk∂W ?

Let us see

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 205: CS7015 (Deep Learning) : Lecture 14

39/41

∂Lt(θ)

∂W︸ ︷︷ ︸∈Rd×d

=∂Lt(θ)

∂st︸ ︷︷ ︸∈R1×d

t∑k=1

∂st∂sk︸︷︷︸∈Rd×d

∂+sk∂W︸ ︷︷ ︸∈Rd×d×d

We know how to compute ∂Lt(θ)∂st

(derivative of Lt(θ) (scalar) w.r.t. lasthidden layer (vector)) using backpropagation

We just saw a formula for ∂st∂sk

which is the derivative of a vector w.r.t. avector)∂+sk∂W is a tensor ∈ Rd×d×d, the derivative of a vector ∈ Rd w.r.t. a matrix∈ Rd×d

How do we compute ∂+sk∂W ?

Let us see

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 206: CS7015 (Deep Learning) : Lecture 14

39/41

∂Lt(θ)

∂W︸ ︷︷ ︸∈Rd×d

=∂Lt(θ)

∂st︸ ︷︷ ︸∈R1×d

t∑k=1

∂st∂sk︸︷︷︸∈Rd×d

∂+sk∂W︸ ︷︷ ︸∈Rd×d×d

We know how to compute ∂Lt(θ)∂st

(derivative of Lt(θ) (scalar) w.r.t. lasthidden layer (vector)) using backpropagation

We just saw a formula for ∂st∂sk

which is the derivative of a vector w.r.t. avector)∂+sk∂W is a tensor ∈ Rd×d×d, the derivative of a vector ∈ Rd w.r.t. a matrix∈ Rd×d

How do we compute ∂+sk∂W ? Let us see

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 207: CS7015 (Deep Learning) : Lecture 14

39/41

∂Lt(θ)

∂W︸ ︷︷ ︸∈Rd×d

=∂Lt(θ)

∂st︸ ︷︷ ︸∈R1×d

t∑k=1

∂st∂sk︸︷︷︸∈Rd×d

∂+sk∂W︸ ︷︷ ︸∈Rd×d×d

We know how to compute ∂Lt(θ)∂st

(derivative of Lt(θ) (scalar) w.r.t. lasthidden layer (vector)) using backpropagation

We just saw a formula for ∂st∂sk

which is the derivative of a vector w.r.t. avector)

∂+sk∂W is a tensor ∈ Rd×d×d, the derivative of a vector ∈ Rd w.r.t. a matrix∈ Rd×d

How do we compute ∂+sk∂W ? Let us see

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 208: CS7015 (Deep Learning) : Lecture 14

39/41

∂Lt(θ)

∂W︸ ︷︷ ︸∈Rd×d

=∂Lt(θ)

∂st︸ ︷︷ ︸∈R1×d

t∑k=1

∂st∂sk︸︷︷︸∈Rd×d

∂+sk∂W︸ ︷︷ ︸∈Rd×d×d

We know how to compute ∂Lt(θ)∂st

(derivative of Lt(θ) (scalar) w.r.t. lasthidden layer (vector)) using backpropagation

We just saw a formula for ∂st∂sk

which is the derivative of a vector w.r.t. avector)∂+sk∂W is a tensor ∈ Rd×d×d, the derivative of a vector ∈ Rd w.r.t. a matrix∈ Rd×d

How do we compute ∂+sk∂W ? Let us see

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 209: CS7015 (Deep Learning) : Lecture 14

39/41

∂Lt(θ)

∂W︸ ︷︷ ︸∈Rd×d

=∂Lt(θ)

∂st︸ ︷︷ ︸∈R1×d

t∑k=1

∂st∂sk︸︷︷︸∈Rd×d

∂+sk∂W︸ ︷︷ ︸∈Rd×d×d

We know how to compute ∂Lt(θ)∂st

(derivative of Lt(θ) (scalar) w.r.t. lasthidden layer (vector)) using backpropagation

We just saw a formula for ∂st∂sk

which is the derivative of a vector w.r.t. avector)∂+sk∂W is a tensor ∈ Rd×d×d, the derivative of a vector ∈ Rd w.r.t. a matrix∈ Rd×d

How do we compute ∂+sk∂W ? Let us see

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 210: CS7015 (Deep Learning) : Lecture 14

40/41

We just look at one element of this ∂+sk∂W tensor

∂+skp∂Wqr

is the (p, q, r)-th element of the 3d tensor

ak = Wsk−1 + b

sk = σ(ak)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 211: CS7015 (Deep Learning) : Lecture 14

40/41

We just look at one element of this ∂+sk∂W tensor

∂+skp∂Wqr

is the (p, q, r)-th element of the 3d tensor

ak = Wsk−1 + b

sk = σ(ak)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 212: CS7015 (Deep Learning) : Lecture 14

40/41

We just look at one element of this ∂+sk∂W tensor

∂+skp∂Wqr

is the (p, q, r)-th element of the 3d tensor

ak = Wsk−1 + b

sk = σ(ak)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 213: CS7015 (Deep Learning) : Lecture 14

40/41

We just look at one element of this ∂+sk∂W tensor

∂+skp∂Wqr

is the (p, q, r)-th element of the 3d tensor

ak = Wsk−1 + b

sk = σ(ak)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 214: CS7015 (Deep Learning) : Lecture 14

41/41

ak = Wsk−1

ak1ak2

...akp

...akd

=

W11 W12 . . . W1d

......

......

Wp1 Wp2 . . . Wpd

......

......

sk−1,1sk−1,2

...sk−1,p

...sk−1,d

akp =

d∑i=1

Wpisk−1,i

skp = σ(akp)

∂skp∂Wqr

=∂skp∂akp

∂akp∂Wqr

= σ′(akp)∂akp∂Wqr

∂akp∂Wqr

=∂∑d

i=1Wpisk−1,i∂Wqr

= sk−1,i if p = q and i = r

= 0 otherwise

∂skp∂Wqr

= σ′(akp)sk−1,r if p = q and i = r

= 0 otherwise

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 215: CS7015 (Deep Learning) : Lecture 14

41/41

ak = Wsk−1

ak1ak2

...akp

...akd

=

W11 W12 . . . W1d

......

......

Wp1 Wp2 . . . Wpd

......

......

sk−1,1sk−1,2

...sk−1,p

...sk−1,d

akp =d∑

i=1

Wpisk−1,i

skp = σ(akp)

∂skp∂Wqr

=∂skp∂akp

∂akp∂Wqr

= σ′(akp)∂akp∂Wqr

∂akp∂Wqr

=∂∑d

i=1Wpisk−1,i∂Wqr

= sk−1,i if p = q and i = r

= 0 otherwise

∂skp∂Wqr

= σ′(akp)sk−1,r if p = q and i = r

= 0 otherwise

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 216: CS7015 (Deep Learning) : Lecture 14

41/41

ak = Wsk−1

ak1ak2

...akp

...akd

=

W11 W12 . . . W1d

......

......

Wp1 Wp2 . . . Wpd

......

......

sk−1,1sk−1,2

...sk−1,p

...sk−1,d

akp =

d∑i=1

Wpisk−1,i

skp = σ(akp)

∂skp∂Wqr

=∂skp∂akp

∂akp∂Wqr

= σ′(akp)∂akp∂Wqr

∂akp∂Wqr

=∂∑d

i=1Wpisk−1,i∂Wqr

= sk−1,i if p = q and i = r

= 0 otherwise

∂skp∂Wqr

= σ′(akp)sk−1,r if p = q and i = r

= 0 otherwise

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 217: CS7015 (Deep Learning) : Lecture 14

41/41

ak = Wsk−1

ak1ak2

...akp

...akd

=

W11 W12 . . . W1d

......

......

Wp1 Wp2 . . . Wpd

......

......

sk−1,1sk−1,2

...sk−1,p

...sk−1,d

akp =

d∑i=1

Wpisk−1,i

skp = σ(akp)

∂skp∂Wqr

=∂skp∂akp

∂akp∂Wqr

= σ′(akp)∂akp∂Wqr

∂akp∂Wqr

=∂∑d

i=1Wpisk−1,i∂Wqr

= sk−1,i if p = q and i = r

= 0 otherwise

∂skp∂Wqr

= σ′(akp)sk−1,r if p = q and i = r

= 0 otherwise

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 218: CS7015 (Deep Learning) : Lecture 14

41/41

ak = Wsk−1

ak1ak2

...akp

...akd

=

W11 W12 . . . W1d

......

......

Wp1 Wp2 . . . Wpd

......

......

sk−1,1sk−1,2

...sk−1,p

...sk−1,d

akp =

d∑i=1

Wpisk−1,i

skp = σ(akp)

∂skp∂Wqr

=∂skp∂akp

∂akp∂Wqr

= σ′(akp)∂akp∂Wqr

∂akp∂Wqr

=∂∑d

i=1Wpisk−1,i∂Wqr

= sk−1,i if p = q and i = r

= 0 otherwise

∂skp∂Wqr

= σ′(akp)sk−1,r if p = q and i = r

= 0 otherwise

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 219: CS7015 (Deep Learning) : Lecture 14

41/41

ak = Wsk−1

ak1ak2

...akp

...akd

=

W11 W12 . . . W1d

......

......

Wp1 Wp2 . . . Wpd

......

......

sk−1,1sk−1,2

...sk−1,p

...sk−1,d

akp =

d∑i=1

Wpisk−1,i

skp = σ(akp)

∂skp∂Wqr

=∂skp∂akp

∂akp∂Wqr

= σ′(akp)∂akp∂Wqr

∂akp∂Wqr

=∂∑d

i=1Wpisk−1,i∂Wqr

= sk−1,i if p = q and i = r

= 0 otherwise

∂skp∂Wqr

= σ′(akp)sk−1,r if p = q and i = r

= 0 otherwise

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 220: CS7015 (Deep Learning) : Lecture 14

41/41

ak = Wsk−1

ak1ak2

...akp

...akd

=

W11 W12 . . . W1d

......

......

Wp1 Wp2 . . . Wpd

......

......

sk−1,1sk−1,2

...sk−1,p

...sk−1,d

akp =

d∑i=1

Wpisk−1,i

skp = σ(akp)

∂skp∂Wqr

=∂skp∂akp

∂akp∂Wqr

= σ′(akp)∂akp∂Wqr

∂akp∂Wqr

=∂∑d

i=1Wpisk−1,i∂Wqr

= sk−1,i if p = q and i = r

= 0 otherwise

∂skp∂Wqr

= σ′(akp)sk−1,r if p = q and i = r

= 0 otherwise

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 221: CS7015 (Deep Learning) : Lecture 14

41/41

ak = Wsk−1

ak1ak2

...akp

...akd

=

W11 W12 . . . W1d

......

......

Wp1 Wp2 . . . Wpd

......

......

sk−1,1sk−1,2

...sk−1,p

...sk−1,d

akp =

d∑i=1

Wpisk−1,i

skp = σ(akp)

∂skp∂Wqr

=∂skp∂akp

∂akp∂Wqr

= σ′(akp)∂akp∂Wqr

∂akp∂Wqr

=∂∑d

i=1Wpisk−1,i∂Wqr

= sk−1,i if p = q and i = r

= 0 otherwise

∂skp∂Wqr

= σ′(akp)sk−1,r if p = q and i = r

= 0 otherwise

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 222: CS7015 (Deep Learning) : Lecture 14

41/41

ak = Wsk−1

ak1ak2

...akp

...akd

=

W11 W12 . . . W1d

......

......

Wp1 Wp2 . . . Wpd

......

......

sk−1,1sk−1,2

...sk−1,p

...sk−1,d

akp =

d∑i=1

Wpisk−1,i

skp = σ(akp)

∂skp∂Wqr

=∂skp∂akp

∂akp∂Wqr

= σ′(akp)∂akp∂Wqr

∂akp∂Wqr

=∂∑d

i=1Wpisk−1,i∂Wqr

= sk−1,i if p = q and i = r

= 0 otherwise

∂skp∂Wqr

= σ′(akp)sk−1,r if p = q and i = r

= 0 otherwise

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 223: CS7015 (Deep Learning) : Lecture 14

41/41

ak = Wsk−1

ak1ak2

...akp

...akd

=

W11 W12 . . . W1d

......

......

Wp1 Wp2 . . . Wpd

......

......

sk−1,1sk−1,2

...sk−1,p

...sk−1,d

akp =

d∑i=1

Wpisk−1,i

skp = σ(akp)

∂skp∂Wqr

=∂skp∂akp

∂akp∂Wqr

= σ′(akp)∂akp∂Wqr

∂akp∂Wqr

=∂∑d

i=1Wpisk−1,i∂Wqr

= sk−1,i if p = q and i = r

= 0 otherwise

∂skp∂Wqr

= σ′(akp)sk−1,r if p = q and i = r

= 0 otherwise

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Page 224: CS7015 (Deep Learning) : Lecture 14

41/41

ak = Wsk−1

ak1ak2

...akp

...akd

=

W11 W12 . . . W1d

......

......

Wp1 Wp2 . . . Wpd

......

......

sk−1,1sk−1,2

...sk−1,p

...sk−1,d

akp =

d∑i=1

Wpisk−1,i

skp = σ(akp)

∂skp∂Wqr

=∂skp∂akp

∂akp∂Wqr

= σ′(akp)∂akp∂Wqr

∂akp∂Wqr

=∂∑d

i=1Wpisk−1,i∂Wqr

= sk−1,i if p = q and i = r

= 0 otherwise

∂skp∂Wqr

= σ′(akp)sk−1,r if p = q and i = r

= 0 otherwise

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14