slides credited from hung-yi lee & richard sochermiulab/s107-adl/doc/190312... ·...
TRANSCRIPT
![Page 1: Slides credited from Hung-Yi Lee & Richard Sochermiulab/s107-adl/doc/190312... · 2019-03-09 · Pascanu et al., ^On the difficulty of training recurrent neural networks, in ICML,](https://reader033.vdocuments.net/reader033/viewer/2022041717/5e4c12535623cd195b4c5512/html5/thumbnails/1.jpg)
Slides credited from Hung-Yi Lee & Richard Socher
![Page 2: Slides credited from Hung-Yi Lee & Richard Sochermiulab/s107-adl/doc/190312... · 2019-03-09 · Pascanu et al., ^On the difficulty of training recurrent neural networks, in ICML,](https://reader033.vdocuments.net/reader033/viewer/2022041717/5e4c12535623cd195b4c5512/html5/thumbnails/2.jpg)
OutlineLanguage Modeling◦ N-gram Language Model◦ Feed-Forward Neural Language Model◦ Recurrent Neural Network Language Model (RNNLM)
Recurrent Neural Network◦ Definition◦ Training via Backpropagation through Time (BPTT)◦ Training Issue
Applications◦ Sequential Input◦ Sequential Output
◦ Aligned Sequential Pairs (Tagging)◦ Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)
2
![Page 3: Slides credited from Hung-Yi Lee & Richard Sochermiulab/s107-adl/doc/190312... · 2019-03-09 · Pascanu et al., ^On the difficulty of training recurrent neural networks, in ICML,](https://reader033.vdocuments.net/reader033/viewer/2022041717/5e4c12535623cd195b4c5512/html5/thumbnails/3.jpg)
OutlineLanguage Modeling◦ N-gram Language Model◦ Feed-Forward Neural Language Model◦ Recurrent Neural Network Language Model (RNNLM)
Recurrent Neural Network◦ Definition◦ Training via Backpropagation through Time (BPTT)◦ Training Issue
Applications◦ Sequential Input◦ Sequential Output
◦ Aligned Sequential Pairs (Tagging)◦ Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)
3
![Page 4: Slides credited from Hung-Yi Lee & Richard Sochermiulab/s107-adl/doc/190312... · 2019-03-09 · Pascanu et al., ^On the difficulty of training recurrent neural networks, in ICML,](https://reader033.vdocuments.net/reader033/viewer/2022041717/5e4c12535623cd195b4c5512/html5/thumbnails/4.jpg)
Language ModelingGoal: estimate the probability of a word sequence
Example task: determinate whether a sequence is grammatical or makes more sense
4
recognize speechor
wreck a nice beachOutput =
“recognize speech”
If P(recognize speech)> P(wreck a nice beach)
![Page 5: Slides credited from Hung-Yi Lee & Richard Sochermiulab/s107-adl/doc/190312... · 2019-03-09 · Pascanu et al., ^On the difficulty of training recurrent neural networks, in ICML,](https://reader033.vdocuments.net/reader033/viewer/2022041717/5e4c12535623cd195b4c5512/html5/thumbnails/5.jpg)
OutlineLanguage Modeling◦ N-gram Language Model◦ Feed-Forward Neural Language Model◦ Recurrent Neural Network Language Model (RNNLM)
Recurrent Neural Network◦ Definition◦ Training via Backpropagation through Time (BPTT)◦ Training Issue
Applications◦ Sequential Input◦ Sequential Output
◦ Aligned Sequential Pairs (Tagging)◦ Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)
5
![Page 6: Slides credited from Hung-Yi Lee & Richard Sochermiulab/s107-adl/doc/190312... · 2019-03-09 · Pascanu et al., ^On the difficulty of training recurrent neural networks, in ICML,](https://reader033.vdocuments.net/reader033/viewer/2022041717/5e4c12535623cd195b4c5512/html5/thumbnails/6.jpg)
N-Gram Language ModelingGoal: estimate the probability of a word sequence
N-gram language model◦ Probability is conditioned on a window of (n-1) previous words
◦ Estimate the probability based on the training data
6
𝑃 beach|nice =𝐶 𝑛𝑖𝑐𝑒 𝑏𝑒𝑎𝑐ℎ
𝐶 𝑛𝑖𝑐𝑒 Count of “nice” in the training data
Count of “nice beach” in the training data
Issue: some sequences may not appear in the training data
![Page 7: Slides credited from Hung-Yi Lee & Richard Sochermiulab/s107-adl/doc/190312... · 2019-03-09 · Pascanu et al., ^On the difficulty of training recurrent neural networks, in ICML,](https://reader033.vdocuments.net/reader033/viewer/2022041717/5e4c12535623cd195b4c5512/html5/thumbnails/7.jpg)
N-Gram Language ModelingTraining data:◦ The dog ran ……
◦ The cat jumped ……
7
P( jumped | dog ) = 0
P( ran | cat ) = 0give some small probability→ smoothing
0.0001
0.0001
➢ The probability is not accurate.
➢ The phenomenon happens because we cannot collect all the possible text in the world as training data.
![Page 8: Slides credited from Hung-Yi Lee & Richard Sochermiulab/s107-adl/doc/190312... · 2019-03-09 · Pascanu et al., ^On the difficulty of training recurrent neural networks, in ICML,](https://reader033.vdocuments.net/reader033/viewer/2022041717/5e4c12535623cd195b4c5512/html5/thumbnails/8.jpg)
OutlineLanguage Modeling◦ N-gram Language Model◦ Feed-Forward Neural Language Model◦ Recurrent Neural Network Language Model (RNNLM)
Recurrent Neural Network◦ Definition◦ Training via Backpropagation through Time (BPTT)◦ Training Issue
Applications◦ Sequential Input◦ Sequential Output
◦ Aligned Sequential Pairs (Tagging)◦ Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)
8
![Page 9: Slides credited from Hung-Yi Lee & Richard Sochermiulab/s107-adl/doc/190312... · 2019-03-09 · Pascanu et al., ^On the difficulty of training recurrent neural networks, in ICML,](https://reader033.vdocuments.net/reader033/viewer/2022041717/5e4c12535623cd195b4c5512/html5/thumbnails/9.jpg)
Neural Language ModelingIdea: estimate not from count, but from the NN prediction
9
Neural Network
vector of “START”
P(next word is “wreck”)
Neural Network
vector of “wreck”
P(next word is “a”)
Neural Network
vector of “a”
P(next word is “nice”)
Neural Network
vector of “nice”
P(next word is “beach”)
P(“wreck a nice beach”) = P(wreck|START)P(a|wreck)P(nice|a)P(beach|nice)
![Page 10: Slides credited from Hung-Yi Lee & Richard Sochermiulab/s107-adl/doc/190312... · 2019-03-09 · Pascanu et al., ^On the difficulty of training recurrent neural networks, in ICML,](https://reader033.vdocuments.net/reader033/viewer/2022041717/5e4c12535623cd195b4c5512/html5/thumbnails/10.jpg)
Neural Language Modeling
10Bengio et al., “A Neural Probabilistic Language Model,” in JMLR, 2003.
input
hidden
output
context vector
Probability distribution of the next word
![Page 11: Slides credited from Hung-Yi Lee & Richard Sochermiulab/s107-adl/doc/190312... · 2019-03-09 · Pascanu et al., ^On the difficulty of training recurrent neural networks, in ICML,](https://reader033.vdocuments.net/reader033/viewer/2022041717/5e4c12535623cd195b4c5512/html5/thumbnails/11.jpg)
Neural Language ModelingThe input layer (or hidden layer) of the related words are close
◦ If P(jump|dog) is large, P(jump|cat) increase accordingly (even there is not “… cat jump …” in the data)
11
h1
h2
dog
cat
rabbit
Smoothing is automatically done
Issue: fixed context window for conditioning
![Page 12: Slides credited from Hung-Yi Lee & Richard Sochermiulab/s107-adl/doc/190312... · 2019-03-09 · Pascanu et al., ^On the difficulty of training recurrent neural networks, in ICML,](https://reader033.vdocuments.net/reader033/viewer/2022041717/5e4c12535623cd195b4c5512/html5/thumbnails/12.jpg)
OutlineLanguage Modeling◦ N-gram Language Model◦ Feed-Forward Neural Language Model◦ Recurrent Neural Network Language Model (RNNLM)
Recurrent Neural Network◦ Definition◦ Training via Backpropagation through Time (BPTT)◦ Training Issue
Applications◦ Sequential Input◦ Sequential Output
◦ Aligned Sequential Pairs (Tagging)◦ Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)
12
![Page 13: Slides credited from Hung-Yi Lee & Richard Sochermiulab/s107-adl/doc/190312... · 2019-03-09 · Pascanu et al., ^On the difficulty of training recurrent neural networks, in ICML,](https://reader033.vdocuments.net/reader033/viewer/2022041717/5e4c12535623cd195b4c5512/html5/thumbnails/13.jpg)
Recurrent Neural NetworkIdea: condition the neural network on all previous words and tie the weights at each time step
Assumption: temporal information matters
13
![Page 14: Slides credited from Hung-Yi Lee & Richard Sochermiulab/s107-adl/doc/190312... · 2019-03-09 · Pascanu et al., ^On the difficulty of training recurrent neural networks, in ICML,](https://reader033.vdocuments.net/reader033/viewer/2022041717/5e4c12535623cd195b4c5512/html5/thumbnails/14.jpg)
RNN Language Modeling
14
vector of “START”
P(next word is “wreck”)
vector of “wreck”
P(next word is “a”)
vector of “a”
P(next word is “nice”)
vector of “nice”
P(next word is “beach”)
input
hidden
output
context vector
word prob dist
Idea: pass the information from the previous hidden layer to leverage all contexts
![Page 15: Slides credited from Hung-Yi Lee & Richard Sochermiulab/s107-adl/doc/190312... · 2019-03-09 · Pascanu et al., ^On the difficulty of training recurrent neural networks, in ICML,](https://reader033.vdocuments.net/reader033/viewer/2022041717/5e4c12535623cd195b4c5512/html5/thumbnails/15.jpg)
OutlineLanguage Modeling◦ N-gram Language Model◦ Feed-Forward Neural Language Model◦ Recurrent Neural Network Language Model (RNNLM)
Recurrent Neural Network◦ Definition◦ Training via Backpropagation through Time (BPTT)◦ Training Issue
Applications◦ Sequential Input◦ Sequential Output
◦ Aligned Sequential Pairs (Tagging)◦ Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)
15
![Page 16: Slides credited from Hung-Yi Lee & Richard Sochermiulab/s107-adl/doc/190312... · 2019-03-09 · Pascanu et al., ^On the difficulty of training recurrent neural networks, in ICML,](https://reader033.vdocuments.net/reader033/viewer/2022041717/5e4c12535623cd195b4c5512/html5/thumbnails/16.jpg)
RNNLM Formulation
16
At each time step,
…………
……
……
vector of the current word
probability of the next word
![Page 17: Slides credited from Hung-Yi Lee & Richard Sochermiulab/s107-adl/doc/190312... · 2019-03-09 · Pascanu et al., ^On the difficulty of training recurrent neural networks, in ICML,](https://reader033.vdocuments.net/reader033/viewer/2022041717/5e4c12535623cd195b4c5512/html5/thumbnails/17.jpg)
OutlineLanguage Modeling◦ N-gram Language Model◦ Feed-Forward Neural Language Model◦ Recurrent Neural Network Language Model (RNNLM)
Recurrent Neural Network◦ Definition◦ Training via Backpropagation through Time (BPTT)◦ Training Issue
Applications◦ Sequential Input◦ Sequential Output
◦ Aligned Sequential Pairs (Tagging)◦ Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)
17
![Page 18: Slides credited from Hung-Yi Lee & Richard Sochermiulab/s107-adl/doc/190312... · 2019-03-09 · Pascanu et al., ^On the difficulty of training recurrent neural networks, in ICML,](https://reader033.vdocuments.net/reader033/viewer/2022041717/5e4c12535623cd195b4c5512/html5/thumbnails/18.jpg)
Recurrent Neural Network Definition
18http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/
: tanh, ReLU
![Page 19: Slides credited from Hung-Yi Lee & Richard Sochermiulab/s107-adl/doc/190312... · 2019-03-09 · Pascanu et al., ^On the difficulty of training recurrent neural networks, in ICML,](https://reader033.vdocuments.net/reader033/viewer/2022041717/5e4c12535623cd195b4c5512/html5/thumbnails/19.jpg)
Model TrainingAll model parameters can be updated by
19http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/
yt-1 yt+1yt target
predicted
![Page 20: Slides credited from Hung-Yi Lee & Richard Sochermiulab/s107-adl/doc/190312... · 2019-03-09 · Pascanu et al., ^On the difficulty of training recurrent neural networks, in ICML,](https://reader033.vdocuments.net/reader033/viewer/2022041717/5e4c12535623cd195b4c5512/html5/thumbnails/20.jpg)
OutlineLanguage Modeling◦ N-gram Language Model◦ Feed-Forward Neural Language Model◦ Recurrent Neural Network Language Model (RNNLM)
Recurrent Neural Network◦ Definition◦ Training via Backpropagation through Time (BPTT)◦ Training Issue
Applications◦ Sequential Input◦ Sequential Output
◦ Aligned Sequential Pairs (Tagging)◦ Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)
20
![Page 21: Slides credited from Hung-Yi Lee & Richard Sochermiulab/s107-adl/doc/190312... · 2019-03-09 · Pascanu et al., ^On the difficulty of training recurrent neural networks, in ICML,](https://reader033.vdocuments.net/reader033/viewer/2022041717/5e4c12535623cd195b4c5512/html5/thumbnails/21.jpg)
Backpropagation
21
……
1
2
j
……
1
2
il
ijw
Layer lLayer 1−l
=
−
1
11
lx
la
j
l
jl
i
Backward Pass
⋮
⋮
Forward Pass
⋮
⋮
Error signal
![Page 22: Slides credited from Hung-Yi Lee & Richard Sochermiulab/s107-adl/doc/190312... · 2019-03-09 · Pascanu et al., ^On the difficulty of training recurrent neural networks, in ICML,](https://reader033.vdocuments.net/reader033/viewer/2022041717/5e4c12535623cd195b4c5512/html5/thumbnails/22.jpg)
Backpropagation
22
l
i
Backward Pass
⋮
⋮
Error signal
1
2
n…
1y
C
( )Lz1
( )Lz2
( )L
nz
2y
C
ny
C
Layer L
2
1
i
…
Layer l
( )lz1
( )lz2
( )l
iz
lδ1
lδ2
l
iδ
2
…
( )1L
1
− z
1
m
Layer L-1
…
…
… ( )TW L( )TlW 1+
( )yCL1-L
( )1L
2
− z
( )1L− mz
lδ
![Page 23: Slides credited from Hung-Yi Lee & Richard Sochermiulab/s107-adl/doc/190312... · 2019-03-09 · Pascanu et al., ^On the difficulty of training recurrent neural networks, in ICML,](https://reader033.vdocuments.net/reader033/viewer/2022041717/5e4c12535623cd195b4c5512/html5/thumbnails/23.jpg)
Backpropagation through Time (BPTT)Unfold
◦ Input: init, x1, x2, …, xt
◦ Output: ot
◦ Target: yt
23
init
st ytxt ot
xt-1 st-1
xt-2
x1 s1
st-21o
C
2o
C
no
C
( )yC
…
![Page 24: Slides credited from Hung-Yi Lee & Richard Sochermiulab/s107-adl/doc/190312... · 2019-03-09 · Pascanu et al., ^On the difficulty of training recurrent neural networks, in ICML,](https://reader033.vdocuments.net/reader033/viewer/2022041717/5e4c12535623cd195b4c5512/html5/thumbnails/24.jpg)
Backpropagation through Time (BPTT)Unfold
◦ Input: init, x1, x2, …, xt
◦ Output: ot
◦ Target: yt
24
init
st ytxt ot
xt-1 st-1
xt-2
x1 s1
st-2
1
2
n
…
1
2
n
…( )yC
![Page 25: Slides credited from Hung-Yi Lee & Richard Sochermiulab/s107-adl/doc/190312... · 2019-03-09 · Pascanu et al., ^On the difficulty of training recurrent neural networks, in ICML,](https://reader033.vdocuments.net/reader033/viewer/2022041717/5e4c12535623cd195b4c5512/html5/thumbnails/25.jpg)
Backpropagation through Time (BPTT)Unfold
◦ Input: init, x1, x2, …, xt
◦ Output: ot
◦ Target: yt
25
init
st ytxt ot
xt-1 st-1
xt-2
x1 s1
st-2
( )yC
![Page 26: Slides credited from Hung-Yi Lee & Richard Sochermiulab/s107-adl/doc/190312... · 2019-03-09 · Pascanu et al., ^On the difficulty of training recurrent neural networks, in ICML,](https://reader033.vdocuments.net/reader033/viewer/2022041717/5e4c12535623cd195b4c5512/html5/thumbnails/26.jpg)
Backpropagation through Time (BPTT)Unfold
◦ Input: init, x1, x2, …, xt
◦ Output: ot
◦ Target: yt
26
init
st ytxt ot
xt-1 st-1
xt-2
x1 s1
st-2
j
i
i
j
i
j
i
j
( )yC
Weights are tied together
the same memory
pointer
pointer
![Page 27: Slides credited from Hung-Yi Lee & Richard Sochermiulab/s107-adl/doc/190312... · 2019-03-09 · Pascanu et al., ^On the difficulty of training recurrent neural networks, in ICML,](https://reader033.vdocuments.net/reader033/viewer/2022041717/5e4c12535623cd195b4c5512/html5/thumbnails/27.jpg)
Backpropagation through Time (BPTT)Unfold
◦ Input: init, x1, x2, …, xt
◦ Output: ot
◦ Target: yt
27
init
st ytxt ot
xt-1 st-1
xt-2
x1 s1
st-2
j
i
k
i
k
j
i
j
i
k
j
( )yC
Weights are tied together
![Page 28: Slides credited from Hung-Yi Lee & Richard Sochermiulab/s107-adl/doc/190312... · 2019-03-09 · Pascanu et al., ^On the difficulty of training recurrent neural networks, in ICML,](https://reader033.vdocuments.net/reader033/viewer/2022041717/5e4c12535623cd195b4c5512/html5/thumbnails/28.jpg)
BPTT
28
For 𝐶(1)Backward Pass:For 𝐶(2)
For 𝐶(3)For 𝐶(4)
Forward Pass: Compute s1, s2, s3, s4 ……
y1 y2 y3
x1x2 x3
o1 o2 o3
init
y4
x4
o4
𝐶(1) 𝐶(2) 𝐶(3) 𝐶(4)
s1 s2 s3s4
![Page 29: Slides credited from Hung-Yi Lee & Richard Sochermiulab/s107-adl/doc/190312... · 2019-03-09 · Pascanu et al., ^On the difficulty of training recurrent neural networks, in ICML,](https://reader033.vdocuments.net/reader033/viewer/2022041717/5e4c12535623cd195b4c5512/html5/thumbnails/29.jpg)
OutlineLanguage Modeling◦ N-gram Language Model◦ Feed-Forward Neural Language Model◦ Recurrent Neural Network Language Model (RNNLM)
Recurrent Neural Network◦ Definition◦ Training via Backpropagation through Time (BPTT)◦ Training Issue
Applications◦ Sequential Input◦ Sequential Output
◦ Aligned Sequential Pairs (Tagging)◦ Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)
29
![Page 30: Slides credited from Hung-Yi Lee & Richard Sochermiulab/s107-adl/doc/190312... · 2019-03-09 · Pascanu et al., ^On the difficulty of training recurrent neural networks, in ICML,](https://reader033.vdocuments.net/reader033/viewer/2022041717/5e4c12535623cd195b4c5512/html5/thumbnails/30.jpg)
RNN Training IssueThe gradient is a product of Jacobian matrices, each associated with a step in the forward computation
Multiply the same matrix at each time step during backprop
30
The gradient becomes very small or very large quickly→ vanishing or exploding gradient
Bengio et al., “Learning long-term dependencies with gradient descent is difficult,” IEEE Trans. of Neural Networks, 1994. [link]Pascanu et al., “On the difficulty of training recurrent neural networks,” in ICML, 2013. [link]
![Page 31: Slides credited from Hung-Yi Lee & Richard Sochermiulab/s107-adl/doc/190312... · 2019-03-09 · Pascanu et al., ^On the difficulty of training recurrent neural networks, in ICML,](https://reader033.vdocuments.net/reader033/viewer/2022041717/5e4c12535623cd195b4c5512/html5/thumbnails/31.jpg)
w2
w1
Co
st
Rough Error Surface
31Bengio et al., “Learning long-term dependencies with gradient descent is difficult,” IEEE Trans. of Neural Networks, 1994. [link]Pascanu et al., “On the difficulty of training recurrent neural networks,” in ICML, 2013. [link]
The error surface is either very flat or very steep
![Page 32: Slides credited from Hung-Yi Lee & Richard Sochermiulab/s107-adl/doc/190312... · 2019-03-09 · Pascanu et al., ^On the difficulty of training recurrent neural networks, in ICML,](https://reader033.vdocuments.net/reader033/viewer/2022041717/5e4c12535623cd195b4c5512/html5/thumbnails/32.jpg)
Vanishing/Exploding Gradient Example
32
0
5
10
15
20
25
30
35
0
5
10
15
20
25
30
35
0
5
10
15
20
25
30
35
0
5
10
15
20
25
30
35
0
5
10
15
20
25
30
35
0
5
10
15
20
25
30
35
1 step 2 steps 5 steps
10 steps 20 steps 50 steps
![Page 33: Slides credited from Hung-Yi Lee & Richard Sochermiulab/s107-adl/doc/190312... · 2019-03-09 · Pascanu et al., ^On the difficulty of training recurrent neural networks, in ICML,](https://reader033.vdocuments.net/reader033/viewer/2022041717/5e4c12535623cd195b4c5512/html5/thumbnails/33.jpg)
Possible SolutionsRecurrent Neural Network
33
![Page 34: Slides credited from Hung-Yi Lee & Richard Sochermiulab/s107-adl/doc/190312... · 2019-03-09 · Pascanu et al., ^On the difficulty of training recurrent neural networks, in ICML,](https://reader033.vdocuments.net/reader033/viewer/2022041717/5e4c12535623cd195b4c5512/html5/thumbnails/34.jpg)
Exploding Gradient: Clipping
34
w2
w1
Co
st
clipped gradientIdea: control the gradient value to avoid exploding
Parameter setting: values from half to ten times the average can still yield convergence
Pascanu et al., “On the difficulty of training recurrent neural networks,” in ICML, 2013. [link]
![Page 35: Slides credited from Hung-Yi Lee & Richard Sochermiulab/s107-adl/doc/190312... · 2019-03-09 · Pascanu et al., ^On the difficulty of training recurrent neural networks, in ICML,](https://reader033.vdocuments.net/reader033/viewer/2022041717/5e4c12535623cd195b4c5512/html5/thumbnails/35.jpg)
Vanishing Gradient: Initialization + ReLUIRNN◦ initialize all W as identity
matrix I
◦ use ReLU for activation functions
35Le et al., “A Simple Way to Initialize Recurrent Networks of Rectified Linear Units,” arXiv, 2016. [link]
![Page 36: Slides credited from Hung-Yi Lee & Richard Sochermiulab/s107-adl/doc/190312... · 2019-03-09 · Pascanu et al., ^On the difficulty of training recurrent neural networks, in ICML,](https://reader033.vdocuments.net/reader033/viewer/2022041717/5e4c12535623cd195b4c5512/html5/thumbnails/36.jpg)
Vanishing Gradient: Gating MechanismRNN models temporal sequence information◦ can handle “long-term dependencies” in theory
36http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Issue: RNN cannot handle such “long-term dependencies” in practice due to vanishing gradient→ apply the gating mechanism to directly encode the long-distance information
“I grew up in France…I speak fluent French.”
![Page 37: Slides credited from Hung-Yi Lee & Richard Sochermiulab/s107-adl/doc/190312... · 2019-03-09 · Pascanu et al., ^On the difficulty of training recurrent neural networks, in ICML,](https://reader033.vdocuments.net/reader033/viewer/2022041717/5e4c12535623cd195b4c5512/html5/thumbnails/37.jpg)
ExtensionRecurrent Neural Network
37
![Page 38: Slides credited from Hung-Yi Lee & Richard Sochermiulab/s107-adl/doc/190312... · 2019-03-09 · Pascanu et al., ^On the difficulty of training recurrent neural networks, in ICML,](https://reader033.vdocuments.net/reader033/viewer/2022041717/5e4c12535623cd195b4c5512/html5/thumbnails/38.jpg)
Bidirectional RNN
38
ℎ = ℎ; ℎ represents (summarizes) the past and future around a single token
![Page 39: Slides credited from Hung-Yi Lee & Richard Sochermiulab/s107-adl/doc/190312... · 2019-03-09 · Pascanu et al., ^On the difficulty of training recurrent neural networks, in ICML,](https://reader033.vdocuments.net/reader033/viewer/2022041717/5e4c12535623cd195b4c5512/html5/thumbnails/39.jpg)
Deep Bidirectional RNN
39
Each memory layer passes an intermediate representation to the next
![Page 40: Slides credited from Hung-Yi Lee & Richard Sochermiulab/s107-adl/doc/190312... · 2019-03-09 · Pascanu et al., ^On the difficulty of training recurrent neural networks, in ICML,](https://reader033.vdocuments.net/reader033/viewer/2022041717/5e4c12535623cd195b4c5512/html5/thumbnails/40.jpg)
OutlineLanguage Modeling◦ N-gram Language Model◦ Feed-Forward Neural Language Model◦ Recurrent Neural Network Language Model (RNNLM)
Recurrent Neural Network◦ Definition◦ Training via Backpropagation through Time (BPTT)◦ Training Issue
Applications◦ Sequential Input◦ Sequential Output
◦ Aligned Sequential Pairs (Tagging)◦ Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)
40
![Page 41: Slides credited from Hung-Yi Lee & Richard Sochermiulab/s107-adl/doc/190312... · 2019-03-09 · Pascanu et al., ^On the difficulty of training recurrent neural networks, in ICML,](https://reader033.vdocuments.net/reader033/viewer/2022041717/5e4c12535623cd195b4c5512/html5/thumbnails/41.jpg)
How to Frame the Learning Problem?The learning algorithm f is to map the input domain X into the output domain Y
Input domain: word, word sequence, audio signal, click logs
Output domain: single label, sequence tags, tree structure, probability distribution
41
YXf →:
Network design should leverage input and output domain properties
![Page 42: Slides credited from Hung-Yi Lee & Richard Sochermiulab/s107-adl/doc/190312... · 2019-03-09 · Pascanu et al., ^On the difficulty of training recurrent neural networks, in ICML,](https://reader033.vdocuments.net/reader033/viewer/2022041717/5e4c12535623cd195b4c5512/html5/thumbnails/42.jpg)
OutlineLanguage Modeling◦ N-gram Language Model◦ Feed-Forward Neural Language Model◦ Recurrent Neural Network Language Model (RNNLM)
Recurrent Neural Network◦ Definition◦ Training via Backpropagation through Time (BPTT)◦ Training Issue
Applications◦ Sequential Input◦ Sequential Output
◦ Aligned Sequential Pairs (Tagging)◦ Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)
42
![Page 43: Slides credited from Hung-Yi Lee & Richard Sochermiulab/s107-adl/doc/190312... · 2019-03-09 · Pascanu et al., ^On the difficulty of training recurrent neural networks, in ICML,](https://reader033.vdocuments.net/reader033/viewer/2022041717/5e4c12535623cd195b4c5512/html5/thumbnails/43.jpg)
Input Domain – Sequence ModelingIdea: aggregate the meaning from all words into a vector
Method:◦ Basic combination: average, sum
◦ Neural combination: ✓Recursive neural network (RvNN)
✓Recurrent neural network (RNN)
✓Convolutional neural network (CNN)
43
How to compute
規格(specification)
誠意(sincerity)
這(this)
有(have)
N-dim
![Page 44: Slides credited from Hung-Yi Lee & Richard Sochermiulab/s107-adl/doc/190312... · 2019-03-09 · Pascanu et al., ^On the difficulty of training recurrent neural networks, in ICML,](https://reader033.vdocuments.net/reader033/viewer/2022041717/5e4c12535623cd195b4c5512/html5/thumbnails/44.jpg)
誠意這 規格 有
x4
h4
Sentiment AnalysisEncode the sequential input into a vector using RNN
44
1x
2x
……
1y
2y
… …
…
…
…
Input Output
MyNx
RNN considers temporal information to learn sentence vectors as the input of classification tasks
![Page 45: Slides credited from Hung-Yi Lee & Richard Sochermiulab/s107-adl/doc/190312... · 2019-03-09 · Pascanu et al., ^On the difficulty of training recurrent neural networks, in ICML,](https://reader033.vdocuments.net/reader033/viewer/2022041717/5e4c12535623cd195b4c5512/html5/thumbnails/45.jpg)
OutlineLanguage Modeling◦ N-gram Language Model◦ Feed-Forward Neural Language Model◦ Recurrent Neural Network Language Model (RNNLM)
Recurrent Neural Network◦ Definition◦ Training via Backpropagation through Time (BPTT)◦ Training Issue
Applications◦ Sequential Input◦ Sequential Output
◦ Aligned Sequential Pairs (Tagging)◦ Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)
45
![Page 46: Slides credited from Hung-Yi Lee & Richard Sochermiulab/s107-adl/doc/190312... · 2019-03-09 · Pascanu et al., ^On the difficulty of training recurrent neural networks, in ICML,](https://reader033.vdocuments.net/reader033/viewer/2022041717/5e4c12535623cd195b4c5512/html5/thumbnails/46.jpg)
Output Domain – Sequence PredictionPOS Tagging
Speech Recognition
Machine Translation
46
“推薦我台大後門的餐廳”推薦/VV我/PN台大/NR後門/NN的/DEG餐廳/NN
“大家好”
“How are you doing today?” “你好嗎?”
The output can be viewed as a sequence of classification
![Page 47: Slides credited from Hung-Yi Lee & Richard Sochermiulab/s107-adl/doc/190312... · 2019-03-09 · Pascanu et al., ^On the difficulty of training recurrent neural networks, in ICML,](https://reader033.vdocuments.net/reader033/viewer/2022041717/5e4c12535623cd195b4c5512/html5/thumbnails/47.jpg)
OutlineLanguage Modeling◦ N-gram Language Model◦ Feed-Forward Neural Language Model◦ Recurrent Neural Network Language Model (RNNLM)
Recurrent Neural Network◦ Definition◦ Training via Backpropagation through Time (BPTT)◦ Training Issue
Applications◦ Sequential Input◦ Sequential Output
◦ Aligned Sequential Pairs (Tagging)◦ Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)
47
![Page 48: Slides credited from Hung-Yi Lee & Richard Sochermiulab/s107-adl/doc/190312... · 2019-03-09 · Pascanu et al., ^On the difficulty of training recurrent neural networks, in ICML,](https://reader033.vdocuments.net/reader033/viewer/2022041717/5e4c12535623cd195b4c5512/html5/thumbnails/48.jpg)
POS TaggingTag a word at each timestamp◦ Input: word sequence
◦ Output: corresponding POS tag sequence
48
四樓 好 專業
N VA AD
![Page 49: Slides credited from Hung-Yi Lee & Richard Sochermiulab/s107-adl/doc/190312... · 2019-03-09 · Pascanu et al., ^On the difficulty of training recurrent neural networks, in ICML,](https://reader033.vdocuments.net/reader033/viewer/2022041717/5e4c12535623cd195b4c5512/html5/thumbnails/49.jpg)
Natural Language Understanding (NLU)Tag a word at each timestamp◦ Input: word sequence
◦ Output: IOB-format slot tag and intent tag
49
<START> just sent email to bob about fishing this weekend <END>
O O O OB-contact_name
O
B-subject I-subject I-subject
→ send_email(contact_name=“bob”, subject=“fishing this weekend”)
Osend_email
Temporal orders for input and output are the same
![Page 50: Slides credited from Hung-Yi Lee & Richard Sochermiulab/s107-adl/doc/190312... · 2019-03-09 · Pascanu et al., ^On the difficulty of training recurrent neural networks, in ICML,](https://reader033.vdocuments.net/reader033/viewer/2022041717/5e4c12535623cd195b4c5512/html5/thumbnails/50.jpg)
OutlineLanguage Modeling◦ N-gram Language Model◦ Feed-Forward Neural Language Model◦ Recurrent Neural Network Language Model (RNNLM)
Recurrent Neural Network◦ Definition◦ Training via Backpropagation through Time (BPTT)◦ Training Issue
Applications◦ Sequential Input◦ Sequential Output
◦ Aligned Sequential Pairs (Tagging)◦ Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)
50
![Page 51: Slides credited from Hung-Yi Lee & Richard Sochermiulab/s107-adl/doc/190312... · 2019-03-09 · Pascanu et al., ^On the difficulty of training recurrent neural networks, in ICML,](https://reader033.vdocuments.net/reader033/viewer/2022041717/5e4c12535623cd195b4c5512/html5/thumbnails/51.jpg)
超棒 的 醬汁
Machine Translation
51
Cascade two RNNs, one for encoding and one for decoding◦ Input: word sequences in the source language
◦ Output: word sequences in the target language
encoder
decoder
![Page 52: Slides credited from Hung-Yi Lee & Richard Sochermiulab/s107-adl/doc/190312... · 2019-03-09 · Pascanu et al., ^On the difficulty of training recurrent neural networks, in ICML,](https://reader033.vdocuments.net/reader033/viewer/2022041717/5e4c12535623cd195b4c5512/html5/thumbnails/52.jpg)
Chit-Chat Dialogue ModelingCascade two RNNs, one for encoding and one for decoding◦ Input: word sequences in the question
◦ Output: word sequences in the response
52
Temporal ordering for input and output may be different
![Page 53: Slides credited from Hung-Yi Lee & Richard Sochermiulab/s107-adl/doc/190312... · 2019-03-09 · Pascanu et al., ^On the difficulty of training recurrent neural networks, in ICML,](https://reader033.vdocuments.net/reader033/viewer/2022041717/5e4c12535623cd195b4c5512/html5/thumbnails/53.jpg)
Sci-Fi Short Film - SUNSPRING
https://www.youtube.com/watch?v=LY7x2Ihqj 53
![Page 54: Slides credited from Hung-Yi Lee & Richard Sochermiulab/s107-adl/doc/190312... · 2019-03-09 · Pascanu et al., ^On the difficulty of training recurrent neural networks, in ICML,](https://reader033.vdocuments.net/reader033/viewer/2022041717/5e4c12535623cd195b4c5512/html5/thumbnails/54.jpg)
Concluding RemarksLanguage Modeling◦ RNNLM
Recurrent Neural Networks◦ Definition
◦ Backpropagation through Time (BPTT)
◦ Vanishing/Exploding Gradient
Applications◦ Sequential Input: Sequence-Level Embedding
◦ Sequential Output: Tagging / Seq2Seq (Encoder-Decoder)
54