![Page 1: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss π=ΰ· , U π‘πβ β=βlogπ= β](https://reader030.vdocuments.net/reader030/viewer/2022040913/5e89d341697b8d60e44c8e63/html5/thumbnails/1.jpg)
Recurrent Neural NetworksDeep Learning β Lecture 5
Efstratios Gavves
![Page 2: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss π=ΰ· , U π‘πβ β=βlogπ= β](https://reader030.vdocuments.net/reader030/viewer/2022040913/5e89d341697b8d60e44c8e63/html5/thumbnails/2.jpg)
Sequential Data
So far, all tasks assumed stationary data
Neither all data, nor all tasks are stationary though
![Page 3: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss π=ΰ· , U π‘πβ β=βlogπ= β](https://reader030.vdocuments.net/reader030/viewer/2022040913/5e89d341697b8d60e44c8e63/html5/thumbnails/3.jpg)
Sequential Data: Text
What
![Page 4: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss π=ΰ· , U π‘πβ β=βlogπ= β](https://reader030.vdocuments.net/reader030/viewer/2022040913/5e89d341697b8d60e44c8e63/html5/thumbnails/4.jpg)
Sequential Data: Text
What about
![Page 5: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss π=ΰ· , U π‘πβ β=βlogπ= β](https://reader030.vdocuments.net/reader030/viewer/2022040913/5e89d341697b8d60e44c8e63/html5/thumbnails/5.jpg)
Sequential Data: Text
What about inputs that appear in
sequences, such as text? Could neural
network handle such modalities?
a
![Page 6: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss π=ΰ· , U π‘πβ β=βlogπ= β](https://reader030.vdocuments.net/reader030/viewer/2022040913/5e89d341697b8d60e44c8e63/html5/thumbnails/6.jpg)
Memory needed
What about inputs that appear in
sequences, such as text? Could neural
network handle such modalities?
a
Pr π₯ =ΰ·
π
Pr π₯π π₯1, β¦ , π₯πβ1)
![Page 7: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss π=ΰ· , U π‘πβ β=βlogπ= β](https://reader030.vdocuments.net/reader030/viewer/2022040913/5e89d341697b8d60e44c8e63/html5/thumbnails/7.jpg)
Sequential data: Video
![Page 8: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss π=ΰ· , U π‘πβ β=βlogπ= β](https://reader030.vdocuments.net/reader030/viewer/2022040913/5e89d341697b8d60e44c8e63/html5/thumbnails/8.jpg)
Quiz: Other sequential data?
![Page 9: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss π=ΰ· , U π‘πβ β=βlogπ= β](https://reader030.vdocuments.net/reader030/viewer/2022040913/5e89d341697b8d60e44c8e63/html5/thumbnails/9.jpg)
Quiz: Other sequential data?
Time series data
Stock exchange
Biological measurements
Climate measurements
Market analysis
Speech/Music
User behavior in websites
...
![Page 10: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss π=ΰ· , U π‘πβ β=βlogπ= β](https://reader030.vdocuments.net/reader030/viewer/2022040913/5e89d341697b8d60e44c8e63/html5/thumbnails/10.jpg)
ApplicationsClick to go to the video in Youtube Click to go to the website
![Page 11: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss π=ΰ· , U π‘πβ β=βlogπ= β](https://reader030.vdocuments.net/reader030/viewer/2022040913/5e89d341697b8d60e44c8e63/html5/thumbnails/11.jpg)
Machine Translation
The phrase in the source language is one sequence
β βToday the weather is goodβ
The phrase in the target language is also a sequence
β βΠΠΎΠ³ΠΎΠ΄Π° ΡΠ΅Π³ΠΎΠ΄Π½Ρ Ρ ΠΎΡΠΎΡΠ°Ρβ
LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM
Today the weather is good <EOS> ΠΠΎΠ³ΠΎΠ΄Π° ΡΠ΅Π³ΠΎΠ΄Π½Ρ Ρ ΠΎΡΠΎΡΠ°Ρ
ΠΠΎΠ³ΠΎΠ΄Π° ΡΠ΅Π³ΠΎΠ΄Π½Ρ <EOS>
Encoder Decoder
Ρ ΠΎΡΠΎΡΠ°Ρ
![Page 12: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss π=ΰ· , U π‘πβ β=βlogπ= β](https://reader030.vdocuments.net/reader030/viewer/2022040913/5e89d341697b8d60e44c8e63/html5/thumbnails/12.jpg)
Image captioning
An image is a thousand words, literally!
Pretty much the same as machine transation
Replace the encoder part with the output of a Convnet
β E.g. use Alexnet or a VGG16 network
Keep the decoder part to operate as a translator
LSTM LSTM LSTM LSTM LSTM LSTM
Today the weather is good
Today the weather is good <EOS>
Convnet
![Page 14: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss π=ΰ· , U π‘πβ β=βlogπ= β](https://reader030.vdocuments.net/reader030/viewer/2022040913/5e89d341697b8d60e44c8e63/html5/thumbnails/14.jpg)
Question answering
Bleeding-edge research, no real consensusβ Very interesting open, research problems
Again, pretty much like machine translation
Again, Encoder-Decoder paradigmβ Insert the question to the encoder part
β Model the answer at the decoder part
Question answering with images alsoβ Again, bleeding-edge research
β How/where to add the image?
β What has been working so far is to add the image
only in the beginningQ: what are the people playing?
A: They play beach football
Q: John entered the living room, where he met Mary. She was drinking some wine and watching
a movie. What room did John enter?A: John entered the living room.
![Page 17: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss π=ΰ· , U π‘πβ β=βlogπ= β](https://reader030.vdocuments.net/reader030/viewer/2022040913/5e89d341697b8d60e44c8e63/html5/thumbnails/17.jpg)
Recurrent Networks
Recurrentconnections
NN CellState
Input
Output
![Page 18: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss π=ΰ· , U π‘πβ β=βlogπ= β](https://reader030.vdocuments.net/reader030/viewer/2022040913/5e89d341697b8d60e44c8e63/html5/thumbnails/18.jpg)
Sequences
Next data depend on previous data
Roughly equivalent to predicting what comes next
Pr π₯ =ΰ·
π
Pr π₯π π₯1, β¦ , π₯πβ1)
What about inputs that appear in
sequences,suc
h as text?Could neural
network handle such modalities?
a
![Page 19: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss π=ΰ· , U π‘πβ β=βlogπ= β](https://reader030.vdocuments.net/reader030/viewer/2022040913/5e89d341697b8d60e44c8e63/html5/thumbnails/19.jpg)
Why sequences?
Parameters are reused Fewer parameters Easier modelling
Generalizes well to arbitrary lengths Not constrained to specific length
However, often we pick a βframeβ π instead of an arbitrary length
Pr π₯ =ΰ·
π
Pr π₯π π₯πβπ , β¦ , π₯πβ1)
I think, therefore, I am!
Everything is repeated, in a circle. History is a master because it
teaches us that it doesn't exist. It's the permutations that matter.
π πππ’πππππ‘πππππ(
π πππ’πππππ‘πππππ(
)
)
β‘
![Page 20: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss π=ΰ· , U π‘πβ β=βlogπ= β](https://reader030.vdocuments.net/reader030/viewer/2022040913/5e89d341697b8d60e44c8e63/html5/thumbnails/20.jpg)
Quiz: What is really a sequence?
Data inside a sequence are β¦ ?
I am Bond , James
Bond
McGuire
tired
am
!
Bond
![Page 21: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss π=ΰ· , U π‘πβ β=βlogπ= β](https://reader030.vdocuments.net/reader030/viewer/2022040913/5e89d341697b8d60e44c8e63/html5/thumbnails/21.jpg)
Quiz: What is really a sequence?
Data inside a sequence are non i.i.d.
β Identically, independently distributed
The next βwordβ depends on the previous βwordsβ
β Ideally on all of them
We need context, and we need memory!
How to model context and memory ?
I am Bond , James
Bond
McGuire
tired
am
!
Bond
Bond
![Page 22: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss π=ΰ· , U π‘πβ β=βlogπ= β](https://reader030.vdocuments.net/reader030/viewer/2022040913/5e89d341697b8d60e44c8e63/html5/thumbnails/22.jpg)
π₯π β‘ One-hot vectors
A vector with all zeros except for the active dimension
12 words in a sequence 12 One-hot vectors
After the one-hot vectors apply an embedding
Word2Vec, GloVE
I
am
Bond
,
James
McGuire
tired
!
I
am
Bond
,
James
McGuire
tired
!
I
am
Bond
,
James
McGuire
tired
!
I
am
Bond
,
James
McGuire
tired
!
0
0
0
0
0
0
1
0
0
0
0
0
1
0
0
0
0
0
1
0
0
0
0
0
1
0
0
0I
am
Bond
,
James
McGuire
tired
!
tired
Vocabulary
0 0 0 0
One-hot vectors
![Page 23: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss π=ΰ· , U π‘πβ β=βlogπ= β](https://reader030.vdocuments.net/reader030/viewer/2022040913/5e89d341697b8d60e44c8e63/html5/thumbnails/23.jpg)
Quiz: Why not just indices?
I am James McGuire
0
0
0
0
0
0
1
0
0
0
0
0
0
1
0
0
0
0
0
1
0
0
0
0
0
0
0
1
0
0
0
0
I am James McGuire
One-hot representation Index representation
π₯π‘=1,2,3,4 =
OR?
π₯"πΌ" = 1
π₯"ππ" = 2
π₯"π½ππππ " = 4
π₯"πππΊπ’πππ" = 7
![Page 24: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss π=ΰ· , U π‘πβ β=βlogπ= β](https://reader030.vdocuments.net/reader030/viewer/2022040913/5e89d341697b8d60e44c8e63/html5/thumbnails/24.jpg)
Quiz: Why not just indices?
No, because then some words get closer together for no
good reason (artificial bias between words)
I am James McGuire
0
0
0
0
0
0
1
0
0
0
0
0
0
1
0
0
0
0
0
1
0
0
0
0
0
0
0
1
0
0
0
0
I am James McGuire
One-hot representation Index representation
π₯"πΌ"
OR?
π"πΌ" = 1
π"ππ" = 2
π"π½ππππ " = 4
π"πππΊπ’πππ" = 7
πππ π‘ππππ(π"ππ", π"πππΊπ’πππ") = 5
π₯"ππ"
π₯"π½ππππ "
π₯"πππΊπ’πππ"
πππ π‘ππππ(π"πΌ", π"ππ") = 1
πππ π‘ππππ(π₯"ππ", π₯"πππΊπ’πππ") = 1
πππ π‘ππππ(π₯"πΌ", π₯"ππ") = 1
![Page 25: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss π=ΰ· , U π‘πβ β=βlogπ= β](https://reader030.vdocuments.net/reader030/viewer/2022040913/5e89d341697b8d60e44c8e63/html5/thumbnails/25.jpg)
Memory
A representation of the past
A memory must project information at timestep π‘ on a
latent space ππ‘ using parameters π
Then, re-use the projected information from π‘ at π‘ + 1ππ‘+1 = β(π₯π‘+1, ππ‘; π)
Memory parameters π are shared for all timesteps π‘ = 0,β¦ππ‘+1 = β(π₯π‘+1, β(π₯π‘ , β(π₯π‘β1, β¦ β π₯1, π0; π ; π); π); π)
![Page 26: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss π=ΰ· , U π‘πβ β=βlogπ= β](https://reader030.vdocuments.net/reader030/viewer/2022040913/5e89d341697b8d60e44c8e63/html5/thumbnails/26.jpg)
Memory as a Graph
Simplest model
β Input with parameters π
β Memory embedding with parameters π
β Output with parameters π
π₯π‘
π¦π‘
π
π
π
Output parameters
Input parameters
Memory
parameters
Input
Output
ππ‘Memory
![Page 27: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss π=ΰ· , U π‘πβ β=βlogπ= β](https://reader030.vdocuments.net/reader030/viewer/2022040913/5e89d341697b8d60e44c8e63/html5/thumbnails/27.jpg)
Memory as a Graph
Simplest model
β Input with parameters π
β Memory embedding with parameters π
β Output with parameters π
π₯π‘
π¦π‘
π₯π‘+1
π¦π‘+1
π
π
π
Output parameters
Input parameters Input
Output
π
π
πMemory
parametersππ‘ ππ‘+1
Memory
![Page 28: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss π=ΰ· , U π‘πβ β=βlogπ= β](https://reader030.vdocuments.net/reader030/viewer/2022040913/5e89d341697b8d60e44c8e63/html5/thumbnails/28.jpg)
Memory as a Graph
Simplest model
β Input with parameters π
β Memory embedding with parameters π
β Output with parameters π
π₯π‘
π¦π‘
π₯π‘+1 π₯π‘+2 π₯π‘+3
π¦π‘+1 π¦π‘+2 π¦π‘+3
π
π
πππ‘ ππ‘+1 ππ‘+2 ππ‘+3
Output parameters
Input parameters π
π
π
π
π
π
π
π
π
Memory
Input
Output
π
π
πMemory
parameters
![Page 29: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss π=ΰ· , U π‘πβ β=βlogπ= β](https://reader030.vdocuments.net/reader030/viewer/2022040913/5e89d341697b8d60e44c8e63/html5/thumbnails/29.jpg)
Folding the memory
π₯π‘
π¦π‘
π₯π‘+1 π₯π‘+2
π¦π‘+1 π¦π‘+2
π
π
π
π
ππ‘ ππ‘+1 ππ‘+2
π
π
π
π₯π‘
π¦π‘
ππ
π
ππ‘(ππ‘β1)
Unrolled/Unfolded Network Folded Network
π
![Page 30: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss π=ΰ· , U π‘πβ β=βlogπ= β](https://reader030.vdocuments.net/reader030/viewer/2022040913/5e89d341697b8d60e44c8e63/html5/thumbnails/30.jpg)
Recurrent Neural Network (RNN)
Only two equations
π₯π‘
π¦π‘
ππ
π
ππ‘
(ππ‘β1)
ππ‘ = tanh(π π₯π‘ +πππ‘β1)
π¦π‘ = softmax(π ππ‘)
![Page 31: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss π=ΰ· , U π‘πβ β=βlogπ= β](https://reader030.vdocuments.net/reader030/viewer/2022040913/5e89d341697b8d60e44c8e63/html5/thumbnails/31.jpg)
RNN Example
Vocabulary of 5 words
A memory of 3 units
β Hyperparameter that we choose like layer size
β ππ‘: 3 Γ 1 ,W: [3 Γ 3]
An input projection of 3 dimensions
β U: [3 Γ 5]
An output projections of 10 dimensions
β V: [10 Γ 3]
ππ‘ = tanh(π π₯π‘ +πππ‘β1)
π¦π‘ = softmax(π ππ‘)
π β π₯π‘=4 =
![Page 32: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss π=ΰ· , U π‘πβ β=βlogπ= β](https://reader030.vdocuments.net/reader030/viewer/2022040913/5e89d341697b8d60e44c8e63/html5/thumbnails/32.jpg)
Rolled Network vs. MLP?
What is really different?
β Steps instead of layers
β Step parameters shared in Recurrent Network
β In a Multi-Layer Network parameters are different
π¦1 π¦2 π¦3
πΏππ¦ππ
1
πΏππ¦ππ
2
πΏππ¦ππ
3
3-gram Unrolled Recurrent Network 3-layer Neural Network
Final output
π
π
π
π
π
π
π
π
π1 π2 π3π¦
βLayer/Stepβ 1 βLayer/Stepβ 2 βLayer/Stepβ 3
ππ₯ π₯
![Page 33: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss π=ΰ· , U π‘πβ β=βlogπ= β](https://reader030.vdocuments.net/reader030/viewer/2022040913/5e89d341697b8d60e44c8e63/html5/thumbnails/33.jpg)
Rolled Network vs. MLP?
What is really different?
β Steps instead of layers
β Step parameters shared in Recurrent Network
β In a Multi-Layer Network parameters are different
π¦1 π¦2 π¦3
πΏππ¦ππ
1
πΏππ¦ππ
2
πΏππ¦ππ
3
3-gram Unrolled Recurrent Network 3-layer Neural Network
βLayer/Stepβ 1 βLayer/Stepβ 2 βLayer/Stepβ 3
Final output
π
π
π
π
π
π
π
π
π1 π2π π3π₯ π₯
π¦
![Page 34: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss π=ΰ· , U π‘πβ β=βlogπ= β](https://reader030.vdocuments.net/reader030/viewer/2022040913/5e89d341697b8d60e44c8e63/html5/thumbnails/34.jpg)
Rolled Network vs. MLP?
What is really different?
β Steps instead of layers
β Step parameters shared in Recurrent Network
β In a Multi-Layer Network parameters are different
β’ Sometimes intermediate outputs are not even neededβ’ Removing them, we almost end up to a standard
Neural Network
π¦1 π¦2 π¦3
πΏππ¦ππ
1
πΏππ¦ππ
2
πΏππ¦ππ
3
3-gram Unrolled Recurrent Network 3-layer Neural Network
βLayer/Stepβ 1 βLayer/Stepβ 2 βLayer/Stepβ 3
Final output
π
π
π
π
π
π
π
π
π1 π2π π3π₯ π₯
π¦
![Page 35: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss π=ΰ· , U π‘πβ β=βlogπ= β](https://reader030.vdocuments.net/reader030/viewer/2022040913/5e89d341697b8d60e44c8e63/html5/thumbnails/35.jpg)
Rolled Network vs. MLP?
What is really different?
β Steps instead of layers
β Step parameters shared in Recurrent Network
β In a Multi-Layer Network parameters are different
π¦
πΏππ¦ππ
1
πΏππ¦ππ
2
πΏππ¦ππ
3
3-gram Unrolled Recurrent Network 3-layer Neural Network
βLayer/Stepβ 1 βLayer/Stepβ 2 βLayer/Stepβ 3 Final output
π
π
π
π
π
π
π2π π3π₯ π₯
π¦π1
![Page 36: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss π=ΰ· , U π‘πβ β=βlogπ= β](https://reader030.vdocuments.net/reader030/viewer/2022040913/5e89d341697b8d60e44c8e63/html5/thumbnails/36.jpg)
Training Recurrent Networks
Cross-entropy loss
π =ΰ·
π‘,π
π¦π‘πππ‘π β β = β log π =
π‘
βπ‘ =β1
π
π‘
ππ‘ log π¦π‘
Backpropagation Through Time (BPTT)
β Again, chain rule
β Only difference: Gradients survive over time steps
![Page 37: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss π=ΰ· , U π‘πβ β=βlogπ= β](https://reader030.vdocuments.net/reader030/viewer/2022040913/5e89d341697b8d60e44c8e63/html5/thumbnails/37.jpg)
Training RNNs is hard
Vanishing gradients
β After a few time steps the gradients become almost 0
Exploding gradients
β After a few time steps the gradients become huge
Canβt capture long-term dependencies
![Page 38: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss π=ΰ· , U π‘πβ β=βlogπ= β](https://reader030.vdocuments.net/reader030/viewer/2022040913/5e89d341697b8d60e44c8e63/html5/thumbnails/38.jpg)
Alternative formulation for RNNs
An alternative formulation to derive conclusions and
intuitions
ππ‘ = π β tanh(ππ‘β1) + π β π₯π‘ + π
β =
π‘
βπ‘(ππ‘)
![Page 39: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss π=ΰ· , U π‘πβ β=βlogπ= β](https://reader030.vdocuments.net/reader030/viewer/2022040913/5e89d341697b8d60e44c8e63/html5/thumbnails/39.jpg)
Another look at the gradients
β = πΏ(ππ(ππβ1 β¦ π1 π₯1, π0;π ;π ;π ;π)
πβπ‘ππ
=
π=1
tπβπ‘πππ‘
πππ‘πππ
πππππ
πβ
πππ‘
πππ‘πππ
=πβ
πctβ πππ‘πctβ1
β πππ‘β1πctβ2
β β¦ β πππ+1πcΟ
β€ ππ‘βππβπ‘πππ‘
The RNN gradient is a recursive product of πππ‘
πctβ1
π ππ π‘ β short-term factors π‘ β« π β long-term factors
![Page 40: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss π=ΰ· , U π‘πβ β=βlogπ= β](https://reader030.vdocuments.net/reader030/viewer/2022040913/5e89d341697b8d60e44c8e63/html5/thumbnails/40.jpg)
Exploding/Vanishing gradients
πβ
πππ‘=
πβ
πcTβ ππππcTβ1
β πππβ1πcTβ2
β β¦ β πππ‘+1πcππ‘
πβ
πππ‘=
πβ
πcTβ ππππcTβ1
β πππβ1πcTβ2
β β¦ β ππ1πcππ‘
< 1 < 1 < 1
πβ
ππβͺ 1βΉ Vanishing
gradient
> 1 > 1 > 1
πβ
ππβ« 1βΉ Exploding
gradient
![Page 41: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss π=ΰ· , U π‘πβ β=βlogπ= β](https://reader030.vdocuments.net/reader030/viewer/2022040913/5e89d341697b8d60e44c8e63/html5/thumbnails/41.jpg)
Vanishing gradients
The gradient of the error w.r.t. to intermediate cell
πβπ‘ππ
=
π=1
tπβπππ¦π‘
ππ¦π‘πππ‘
πππ‘πππ
πππππ
πππ‘πππ
= ΰ·
π‘β₯πβ₯π
ππππππβ1
= ΰ·
π‘β₯πβ₯π
π β π tanh(ππβ1)
![Page 42: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss π=ΰ· , U π‘πβ β=βlogπ= β](https://reader030.vdocuments.net/reader030/viewer/2022040913/5e89d341697b8d60e44c8e63/html5/thumbnails/42.jpg)
Vanishing gradients
The gradient of the error w.r.t. to intermediate cell
πβπ‘ππ
=
π=1
tπβπππ¦π‘
ππ¦π‘πππ‘
πππ‘πππ
πππππ
πππ‘πππ
= ΰ·
π‘β₯πβ₯π
ππππππβ1
= ΰ·
π‘β₯πβ₯π
π β π tanh(ππβ1)
Long-term dependencies get exponentially smaller weights
![Page 43: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss π=ΰ· , U π‘πβ β=βlogπ= β](https://reader030.vdocuments.net/reader030/viewer/2022040913/5e89d341697b8d60e44c8e63/html5/thumbnails/43.jpg)
Gradient clipping for exploding gradients
Scale the gradients to a threshold
1. g βπβ
ππ
2. if π > π0:
g βπ0
ππ
else:
print(βDo nothingβ)
Simple, but works!
π0
g
π0π
π
Pseudocode
![Page 44: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss π=ΰ· , U π‘πβ β=βlogπ= β](https://reader030.vdocuments.net/reader030/viewer/2022040913/5e89d341697b8d60e44c8e63/html5/thumbnails/44.jpg)
Rescaling vanishing gradients?
Not good solution
Weights are shared between timesteps Loss summed
over timesteps
β =
π‘
βπ‘ βΉπβ
ππ=
π‘
πβπ‘ππ
πβπ‘ππ
=
π=1
tπβπ‘πππ
πππππ
=
π=1
π‘πβπ‘πππ‘
πππ‘πππ
πππππ
Rescaling for one timestep (πβπ‘
ππ) affects all timesteps
β The rescaling factor for one timestep does not work for another
![Page 45: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss π=ΰ· , U π‘πβ β=βlogπ= β](https://reader030.vdocuments.net/reader030/viewer/2022040913/5e89d341697b8d60e44c8e63/html5/thumbnails/45.jpg)
More intuitively
πβ1ππ
πβ2ππ
πβ3ππ
πβ4ππ
πβ5ππ
πβ
ππ=
πβ1
ππ+
πβ2
ππ+πβ3
ππ+
πβ4
ππ+
πβ5
ππ
![Page 46: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss π=ΰ· , U π‘πβ β=βlogπ= β](https://reader030.vdocuments.net/reader030/viewer/2022040913/5e89d341697b8d60e44c8e63/html5/thumbnails/46.jpg)
πβ1ππ
πβ4ππ
πβ2ππ πβ3
πππβ5ππ
More intuitively
Letβs say πβ1
ππβ 1,
πβ2
ππβ 1/10,
πβ3
ππβ 1/100,
πβ4
ππβ 1/1000,
πβ5
ππβ 1/10000
πβ
ππ=
π
πβπππ
= 1.1111
If πβ
ππrescaled to 1
πβ5
ππβ 10β5
Longer-term dependencies negligible
β Weak recurrent modelling
β Learning focuses on the short-term only
πβ
ππ=
πβ1
ππ+
πβ2
ππ+πβ3
ππ+
πβ4
ππ+
πβ5
ππ
![Page 47: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss π=ΰ· , U π‘πβ β=βlogπ= β](https://reader030.vdocuments.net/reader030/viewer/2022040913/5e89d341697b8d60e44c8e63/html5/thumbnails/47.jpg)
Recurrent networks β Chaotic systems
![Page 48: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss π=ΰ· , U π‘πβ β=βlogπ= β](https://reader030.vdocuments.net/reader030/viewer/2022040913/5e89d341697b8d60e44c8e63/html5/thumbnails/48.jpg)
Fixing vanishing gradients
Regularization on the recurrent weights
β Force the error signal not to vanish
Ξ© =
π‘
Ξ©t =
π‘
πβπππ‘+1
πππ‘+1πππ‘
πβπππ‘+1
β 1
2
Advanced recurrent modules
Long-Short Term Memory module
Gated Recurrent Unit module
Pascanu et al., On the diffculty of training Recurrent Neural Networks, 2013
![Page 49: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss π=ΰ· , U π‘πβ β=βlogπ= β](https://reader030.vdocuments.net/reader030/viewer/2022040913/5e89d341697b8d60e44c8e63/html5/thumbnails/49.jpg)
Advanced Recurrent Nets
+
πππ
tanh
tanh
Input
Output
ππ‘
ππ‘β1 ππ‘
ππ‘ππ‘
ππ‘
ππ‘ππ‘β1
π₯π‘
![Page 50: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss π=ΰ· , U π‘πβ β=βlogπ= β](https://reader030.vdocuments.net/reader030/viewer/2022040913/5e89d341697b8d60e44c8e63/html5/thumbnails/50.jpg)
How to fix the vanishing gradients?
Error signal over time must have not too large, not too small
norm
Letβs have a look at the loss function
πβπ‘ππ
=
π=1
tπβπππ¦π‘
ππ¦π‘πππ‘
πππ‘πππ
πππππ
πππ‘πππ
= ΰ·
π‘β₯πβ₯π
ππππππβ1
![Page 51: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss π=ΰ· , U π‘πβ β=βlogπ= β](https://reader030.vdocuments.net/reader030/viewer/2022040913/5e89d341697b8d60e44c8e63/html5/thumbnails/51.jpg)
How to fix the vanishing gradients?
Error signal over time must have not too large, not too small
norm
Letβs have a look at the loss function
πβπ‘ππ
=
π=1
tπβπππ¦π‘
ππ¦π‘πππ‘
πππ‘πππ
πππππ
πππ‘πππ
= ΰ·
π‘β₯πβ₯π
ππππππβ1
![Page 52: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss π=ΰ· , U π‘πβ β=βlogπ= β](https://reader030.vdocuments.net/reader030/viewer/2022040913/5e89d341697b8d60e44c8e63/html5/thumbnails/52.jpg)
How to fix the vanishing gradients?
Error signal over time must have not too large, not too small
norm
Solution: have an activation function with gradient equal to 1
πβπ‘ππ
=
π=1
tπβπππ¦π‘
ππ¦π‘πππ‘
πππ‘πππ
πππππ
πππ‘πππ
= ΰ·
π‘β₯πβ₯π
ππππππβ1
β Identify function has a gradient equal to 1
By doing so, gradients do not become too small not too large
![Page 53: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss π=ΰ· , U π‘πβ β=βlogπ= β](https://reader030.vdocuments.net/reader030/viewer/2022040913/5e89d341697b8d60e44c8e63/html5/thumbnails/53.jpg)
Long Short-Term MemoryLSTM: Beefed up RNN
π = π π₯π‘π(π) +ππ‘β1π
(π)
π = π π₯π‘π(π) +ππ‘β1π
(π)
π = π π₯π‘π(π) +ππ‘β1π
(π)
ππ‘ = tanh(π₯π‘ππ +ππ‘β1π
(π))
ππ‘ = ππ‘β1 βπ + ππ‘ β π
ππ‘ = tanh ππ‘ βπ
+
πππ
tanh
tanh
InputOutput
ππ‘
ππ‘β1 ππ‘
ππ‘ππ‘
ππ‘
ππ‘ππ‘β1
π₯π‘
More info at: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
ππ‘ = π β tanh(ππ‘β1) + π β π₯π‘ + π
LSTM
Simple
RNN
- The previous state ππ‘β1 is connected to new
ππ‘ with no nonlinearity (identity function).
- The only other factor is the forget gate πwhich rescales the previous LSTM state.
![Page 54: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss π=ΰ· , U π‘πβ β=βlogπ= β](https://reader030.vdocuments.net/reader030/viewer/2022040913/5e89d341697b8d60e44c8e63/html5/thumbnails/54.jpg)
Cell state
The cell state carries the essential information over time
π = π π₯π‘π(π) +ππ‘β1π
(π)
π = π π₯π‘π(π) +ππ‘β1π
(π)
π = π π₯π‘π(π) +ππ‘β1π
(π)
ππ‘ = tanh(π₯π‘ππ +ππ‘β1π
(π))
ππ‘ = ππ‘β1 βπ + ππ‘ β π
ππ‘ = tanh ππ‘ βπ
Cell state line
+
πππ
tanh
tanh
ππ‘
ππ‘β1 ππ‘
ππ‘ππ‘
ππ‘
ππ‘ππ‘β1
π₯π‘
![Page 55: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss π=ΰ· , U π‘πβ β=βlogπ= β](https://reader030.vdocuments.net/reader030/viewer/2022040913/5e89d341697b8d60e44c8e63/html5/thumbnails/55.jpg)
LSTM nonlinearities
π β (0, 1): control gate β something like a switch
tanh β β1, 1 : recurrent nonlinearity
π = π π₯π‘π(π) +ππ‘β1π
(π)
π = π π₯π‘π(π) +ππ‘β1π
(π)
π = π π₯π‘π(π) +ππ‘β1π
(π)
ππ‘ = tanh(π₯π‘ππ +ππ‘β1π
(π))
ππ‘ = ππ‘β1 βπ + ππ‘ β π
ππ‘ = tanh ππ‘ βπ
+
πππ
tanh
tanh
Input
Output
ππ‘
ππ‘β1 ππ‘
ππ‘ππ‘
ππ‘
ππ‘ππ‘β1
π₯π‘
![Page 56: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss π=ΰ· , U π‘πβ β=βlogπ= β](https://reader030.vdocuments.net/reader030/viewer/2022040913/5e89d341697b8d60e44c8e63/html5/thumbnails/56.jpg)
LSTM Step-by-Step: Step (1)
E.g. LSTM on βYesterday she slapped me. Today she loves me.β
Decide what to forget and what to remember for the new memory
β Sigmoid 1 Remember everything
β Sigmoid 0 Forget everything
ππ‘ = π π₯π‘π(π) +ππ‘β1π
(π)
ππ‘ = π π₯π‘π(π) +ππ‘β1π
(π)
ππ‘ = π π₯π‘π(π) +ππ‘β1π
(π)
ππ‘ = tanh(π₯π‘ππ +ππ‘β1π
(π))
ππ‘ = ππ‘β1 βπ + ππ‘ β π
ππ‘ = tanh ππ‘ βπ
+
πππ
tanh
tanh
ππ‘
ππ‘β1 ππ‘
ππ‘ππ‘
ππ‘
ππ‘ππ‘β1
π₯π‘
![Page 57: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss π=ΰ· , U π‘πβ β=βlogπ= β](https://reader030.vdocuments.net/reader030/viewer/2022040913/5e89d341697b8d60e44c8e63/html5/thumbnails/57.jpg)
LSTM Step-by-Step: Step (2)
Decide what new information is relevant from the new input
and should be add to the new memory
β Modulate the input ππ‘β Generate candidate memories ππ‘
ππ‘ = π π₯π‘π(π) +ππ‘β1π
(π)
ππ‘ = π π₯π‘π(π) +ππ‘β1π
(π)
ππ‘ = π π₯π‘π(π) +ππ‘β1π
(π)
ππ‘ = tanh(π₯π‘ππ +ππ‘β1π
(π))
ππ‘ = ππ‘β1 βπ + ππ‘ β π
ππ‘ = tanh ππ‘ βπ
+
πππ
tanh
tanh
ππ‘
ππ‘β1 ππ‘
ππ‘ππ‘
ππ‘
ππ‘ππ‘β1
π₯π‘
![Page 58: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss π=ΰ· , U π‘πβ β=βlogπ= β](https://reader030.vdocuments.net/reader030/viewer/2022040913/5e89d341697b8d60e44c8e63/html5/thumbnails/58.jpg)
LSTM Step-by-Step: Step (3)Compute and update the current cell state ππ‘
β Depends on the previous cell state
β What we decide to forget
β What inputs we allow
β The candidate memories
+
πππ
tanh
tanh
ππ‘ = π π₯π‘π(π) +ππ‘β1π
(π)
ππ‘ = π π₯π‘π(π) +ππ‘β1π
(π)
ππ‘ = π π₯π‘π(π) +ππ‘β1π
(π)
ππ‘ = tanh(π₯π‘ππ +ππ‘β1π
(π))
ππ‘ = ππ‘β1 βπ + ππ‘ β π
ππ‘ = tanh ππ‘ βπ
ππ‘
ππ‘β1 ππ‘
ππ‘ππ‘
ππ‘
ππ‘ππ‘β1
π₯π‘
![Page 59: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss π=ΰ· , U π‘πβ β=βlogπ= β](https://reader030.vdocuments.net/reader030/viewer/2022040913/5e89d341697b8d60e44c8e63/html5/thumbnails/59.jpg)
LSTM Step-by-Step: Step (4)
Modulate the output
β Does the new cell state relevant? Sigmoid 1
β If not Sigmoid 0
Generate the new memory
+
πππ
tanh
tanh
ππ‘ = π π₯π‘π(π) +ππ‘β1π
(π)
ππ‘ = π π₯π‘π(π) +ππ‘β1π
(π)
ππ‘ = π π₯π‘π(π) +ππ‘β1π
(π)
ππ‘ = tanh(π₯π‘ππ +ππ‘β1π
(π))
ππ‘ = ππ‘β1 βπ + ππ‘ β π
ππ‘ = tanh ππ‘ βπ
ππ‘
ππ‘β1 ππ‘
ππ‘ππ‘
ππ‘
ππ‘ππ‘β1
π₯π‘
![Page 60: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss π=ΰ· , U π‘πβ β=βlogπ= β](https://reader030.vdocuments.net/reader030/viewer/2022040913/5e89d341697b8d60e44c8e63/html5/thumbnails/60.jpg)
LSTM Unrolled Network
Macroscopically very similar to standard RNNs
The engine is a bit different (more complicated)
β Because of their gates LSTMs capture long and short term
dependencies
Γ +
πππ
Γ
tanh
Γ
tanh
Γ +
πππ
Γ
tanh
Γ
tanh
Γ +
πππ
Γ
tanh
Γ
tanh
![Page 61: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss π=ΰ· , U π‘πβ β=βlogπ= β](https://reader030.vdocuments.net/reader030/viewer/2022040913/5e89d341697b8d60e44c8e63/html5/thumbnails/61.jpg)
Beyond RNN & LSTM
LSTM with peephole connections
β Gates have access also to the previous cell states ππ‘β1 (not only memories)
β Coupled forget and input gates, ππ‘ = ππ‘ β ππ‘β1 + 1 β ππ‘ β ππ‘β Bi-directional recurrent networks
Gated Recurrent Units (GRU)
Deep recurrent architectures
Recursive neural networks
β Tree structured
Multiplicative interactions
Generative recurrent architectures
LSTM (2)
LSTM (1)
LSTM (2)
LSTM (1)
LSTM (2)
LSTM (1)
![Page 62: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss π=ΰ· , U π‘πβ β=βlogπ= β](https://reader030.vdocuments.net/reader030/viewer/2022040913/5e89d341697b8d60e44c8e63/html5/thumbnails/62.jpg)
Take-away message
Recurrent Neural Networks (RNN) for sequences
Backpropagation Through Time
Vanishing and Exploding Gradients and Remedies
RNNs using Long Short-Term Memory (LSTM)
Applications of Recurrent Neural Networks
![Page 63: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss π=ΰ· , U π‘πβ β=βlogπ= β](https://reader030.vdocuments.net/reader030/viewer/2022040913/5e89d341697b8d60e44c8e63/html5/thumbnails/63.jpg)
Thank you!