recurrent neural networks 2 - github pages · pdf filereview: representation of sequence i...
TRANSCRIPT
![Page 1: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors](https://reader034.vdocuments.net/reader034/viewer/2022042620/5a7fa68b7f8b9a9d308ba376/html5/thumbnails/1.jpg)
Recurrent Neural Networks 2
CS 287
(Based on Yoav Goldberg’s notes)
![Page 2: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors](https://reader034.vdocuments.net/reader034/viewer/2022042620/5a7fa68b7f8b9a9d308ba376/html5/thumbnails/2.jpg)
Review: Representation of Sequence
I Many tasks in NLP involve sequences
w1, . . . ,wn
I Representations as matrix dense vectors X
(Following YG, slight abuse of notation)
x1 = x01W0, . . . , xn = x0nW
0
I Would like fixed-dimensional representation.
![Page 3: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors](https://reader034.vdocuments.net/reader034/viewer/2022042620/5a7fa68b7f8b9a9d308ba376/html5/thumbnails/3.jpg)
Review: Sequence Recurrence
I Can map from dense sequence to dense representation.
I x1, . . . , xn 7→ s1, . . . , sn
I For all i ∈ {1, . . . , n}
si = R(si−1, xi ; θ)
I θ is shared by all R
Example:
s4 = R(s3, x4)
= R(R(s2, x3), x4)
= R(R(R(R(s0, x1), x2), x3), x4)
![Page 4: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors](https://reader034.vdocuments.net/reader034/viewer/2022042620/5a7fa68b7f8b9a9d308ba376/html5/thumbnails/4.jpg)
Review: BPTT (Acceptor)
I Run forward propagation.
I Run backward propagation.
I Update all weights (shared)
y
sn
xn. . .
. . .s3
x3s2
x2s1
x1s0
![Page 5: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors](https://reader034.vdocuments.net/reader034/viewer/2022042620/5a7fa68b7f8b9a9d308ba376/html5/thumbnails/5.jpg)
Review: BPTT (Acceptor)
I Run forward propagation.
I Run backward propagation.
I Update all weights (shared)
y
sn
xn. . .
. . .s3
x3s2
x2s1
x1s0
![Page 6: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors](https://reader034.vdocuments.net/reader034/viewer/2022042620/5a7fa68b7f8b9a9d308ba376/html5/thumbnails/6.jpg)
Review: BPTT (Acceptor)
I Run forward propagation.
I Run backward propagation.
I Update all weights (shared)
y
sn
xn. . .
. . .s3
x3s2
x2s1
x1s0
![Page 7: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors](https://reader034.vdocuments.net/reader034/viewer/2022042620/5a7fa68b7f8b9a9d308ba376/html5/thumbnails/7.jpg)
Review: BPTT (Acceptor)
I Run forward propagation.
I Run backward propagation.
I Update all weights (shared)
y
sn
xn. . .
. . .s3
x3s2
x2s1
x1s0
![Page 8: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors](https://reader034.vdocuments.net/reader034/viewer/2022042620/5a7fa68b7f8b9a9d308ba376/html5/thumbnails/8.jpg)
Review: BPTT (Acceptor)
I Run forward propagation.
I Run backward propagation.
I Update all weights (shared)
y
sn
xn. . .
. . .s3
x3s2
x2s1
x1s0
![Page 9: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors](https://reader034.vdocuments.net/reader034/viewer/2022042620/5a7fa68b7f8b9a9d308ba376/html5/thumbnails/9.jpg)
Review: BPTT (Acceptor)
I Run forward propagation.
I Run backward propagation.
I Update all weights (shared)
y
sn
xn. . .
. . .s3
x3s2
x2s1
x1s0
![Page 10: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors](https://reader034.vdocuments.net/reader034/viewer/2022042620/5a7fa68b7f8b9a9d308ba376/html5/thumbnails/10.jpg)
Review: BPTT (Acceptor)
I Run forward propagation.
I Run backward propagation.
I Update all weights (shared)
y
sn
xn. . .
. . .s3
x3s2
x2s1
x1s0
![Page 11: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors](https://reader034.vdocuments.net/reader034/viewer/2022042620/5a7fa68b7f8b9a9d308ba376/html5/thumbnails/11.jpg)
Review: BPTT (Acceptor)
I Run forward propagation.
I Run backward propagation.
I Update all weights (shared)
y
sn
xn. . .
. . .s3
x3s2
x2s1
x1s0
![Page 12: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors](https://reader034.vdocuments.net/reader034/viewer/2022042620/5a7fa68b7f8b9a9d308ba376/html5/thumbnails/12.jpg)
Review: BPTT (Acceptor)
I Run forward propagation.
I Run backward propagation.
I Update all weights (shared)
y
sn
xn. . .
. . .s3
x3s2
x2s1
x1s0
![Page 13: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors](https://reader034.vdocuments.net/reader034/viewer/2022042620/5a7fa68b7f8b9a9d308ba376/html5/thumbnails/13.jpg)
Review: BPTT (Acceptor)
I Run forward propagation.
I Run backward propagation.
I Update all weights (shared)
y
sn
xn. . .
. . .s3
x3s2
x2s1
x1s0
![Page 14: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors](https://reader034.vdocuments.net/reader034/viewer/2022042620/5a7fa68b7f8b9a9d308ba376/html5/thumbnails/14.jpg)
Review: Issues
I Can be inefficient, but batch/GPUs help.
I Model is much deeper than previous approaches.
I This matters a lot, focus of next class.
I Variable-size model for each sentence.
I Have to be a bit more clever in Torch.
![Page 15: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors](https://reader034.vdocuments.net/reader034/viewer/2022042620/5a7fa68b7f8b9a9d308ba376/html5/thumbnails/15.jpg)
Quiz
Consider a ReLU version of the Elman RNN with function R defined as
NN(x, s) = ReLU(sWs + xWx + b).
We use this RNN with an acceptor architecture over the sequence
x1, . . . , x5. Assume we have computed the gradient for the final layer
∂L
∂s5
What is the symbolic gradient of the previous state ∂L∂s4
?
What is the symbolic gradient of the first state ∂L∂s1
?
![Page 16: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors](https://reader034.vdocuments.net/reader034/viewer/2022042620/5a7fa68b7f8b9a9d308ba376/html5/thumbnails/16.jpg)
Answer
∂L
∂s4,i= ∑
j
∂s5,j∂s4,i
∂L
∂s5,j
= ∑j
W si ,j
∂L∂s5,j
s5,j > 0
0 o.w .
= ∑j
1(s5,j > 0)W si ,j
∂L
∂s5,j
I Chain-Rule
I ReLU Gradient rule
I Indicator notation
![Page 17: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors](https://reader034.vdocuments.net/reader034/viewer/2022042620/5a7fa68b7f8b9a9d308ba376/html5/thumbnails/17.jpg)
Answer
∂L
∂s1,j1= ∑
j2
∂s2,j2∂s1,j1
. . . ∑j5
∂s5,j5∂s4,j4
∂L
∂s5,j5
= ∑j2
. . . ∑j5
1(s2,j2 > 0) . . . 1(s5,j5 > 0)W sj1,j2 . . .W s
j4,j5
∂L
∂s5,j5
= ∑j2 ...j5
5
∏k=2
1(sk,jk > 0)W sjk−1,jk
∂L
∂s5,j5
I Multiple applications of Chain rule
I Combine and multiply.
I Product of weights.
![Page 18: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors](https://reader034.vdocuments.net/reader034/viewer/2022042620/5a7fa68b7f8b9a9d308ba376/html5/thumbnails/18.jpg)
The Promise of RNNs
I Hope: Learn the long-range interactions of language from data.
I For acceptors this means maintaining early state:
I Example: How can you not see this movie?
You should not see this movie.
I Memory interaction here is at s1, but gradient signal is at sn
![Page 19: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors](https://reader034.vdocuments.net/reader034/viewer/2022042620/5a7fa68b7f8b9a9d308ba376/html5/thumbnails/19.jpg)
Long-Term Gradients
I Gradients go through multiplicative layers.
I Fine at end layers, but issues with early layers.
I For instance consider quiz with hardtanh
∑j2...j5
5
∏k=2
1(1 > sk,jk > 0)W sjk−1,jk
∂L
∂s5,j5
I The multiplicative effect tends very large exploding or vanishing
gradients.
![Page 20: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors](https://reader034.vdocuments.net/reader034/viewer/2022042620/5a7fa68b7f8b9a9d308ba376/html5/thumbnails/20.jpg)
Exploding Gradients
Easier case, can be handled heuristically,
I Occurs if there is no saturation but exponential blowup.
∑j2...jn
n
∏k=2
W sjk−1,jk
I In these cases, there are reasonable short-term gradients, but bad
long-term gradients.
I Two practical heuristics:
I gradient clipping, i.e. bounding any gradient by a maximum value
I gradient normalization, i.e. renormalizing the the RNN gradients if
they are above a fixed norm value.
![Page 21: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors](https://reader034.vdocuments.net/reader034/viewer/2022042620/5a7fa68b7f8b9a9d308ba376/html5/thumbnails/21.jpg)
Vanishing Gradients
I Occurs when combining small weights (or from saturation)
∑j2...jn
n
∏k=2
W sjk−1,jk
I Again, affects mainly long-term gradients.
I However, not as simple as boosting gradients.
![Page 22: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors](https://reader034.vdocuments.net/reader034/viewer/2022042620/5a7fa68b7f8b9a9d308ba376/html5/thumbnails/22.jpg)
LSTM (Hochreiter and Schmidhuber, 1997)
![Page 23: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors](https://reader034.vdocuments.net/reader034/viewer/2022042620/5a7fa68b7f8b9a9d308ba376/html5/thumbnails/23.jpg)
LSTM Formally
R(si−1, xi ) = [ci ,hi ]
ci = j� i+ f � ci−1
hi = tanh(ci )� o
i = tanh(xWxi + hi−1Whi + bi )
j = σ(xWxj + hi−1Whj + bj )
f = σ(xWxf + hi−1Whf + bf )
o = tanh(xWxo + hi−1Who + bo)
I (eeks.)
![Page 24: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors](https://reader034.vdocuments.net/reader034/viewer/2022042620/5a7fa68b7f8b9a9d308ba376/html5/thumbnails/24.jpg)
![Page 25: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors](https://reader034.vdocuments.net/reader034/viewer/2022042620/5a7fa68b7f8b9a9d308ba376/html5/thumbnails/25.jpg)
Unreasonable Effectiveness of RNNs
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
![Page 26: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors](https://reader034.vdocuments.net/reader034/viewer/2022042620/5a7fa68b7f8b9a9d308ba376/html5/thumbnails/26.jpg)
![Page 27: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors](https://reader034.vdocuments.net/reader034/viewer/2022042620/5a7fa68b7f8b9a9d308ba376/html5/thumbnails/27.jpg)
PANDARUS:
Alas, I think he shall be come approached and the day
When little srain would be attain’d into being never fed,
And who is but a chain and subjects of his death,
I should not sleep.
Second Senator:
They are away this miseries, produced upon my soul,
Breaking and strongly should be buried, when I perish
The earth and thoughts of many states.
DUKE VINCENTIO:
Well, your wit is in the care of side and that.
![Page 28: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors](https://reader034.vdocuments.net/reader034/viewer/2022042620/5a7fa68b7f8b9a9d308ba376/html5/thumbnails/28.jpg)
Music Composition
http://www.hexahedria.com/2015/08/03/
composing-music-with-recurrent-neural-networks/
![Page 29: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors](https://reader034.vdocuments.net/reader034/viewer/2022042620/5a7fa68b7f8b9a9d308ba376/html5/thumbnails/29.jpg)
Today’s Lecture
I Build up to LSTMs
I Note: Development is ahistoric, later papers first.
I Main idea: Getting around the vanishing gradient issue.
![Page 30: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors](https://reader034.vdocuments.net/reader034/viewer/2022042620/5a7fa68b7f8b9a9d308ba376/html5/thumbnails/30.jpg)
Contents
Highway Networks
Gated Recurrent Unit (GRU)
LSTM
![Page 31: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors](https://reader034.vdocuments.net/reader034/viewer/2022042620/5a7fa68b7f8b9a9d308ba376/html5/thumbnails/31.jpg)
Review: Deep ReLU Networks
NNlayer (x) = ReLU(xW1 + b1)
I Given a input x can build arbitrarily deep fully-connected networks.
I Deep Neural Network (DNN) :
NNlayer (NNlayer (NNlayer (. . .NNlayer (x))))
![Page 32: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors](https://reader034.vdocuments.net/reader034/viewer/2022042620/5a7fa68b7f8b9a9d308ba376/html5/thumbnails/32.jpg)
Deep Networks
NNlayer (x) = ReLU(xW1 + b1)
hn
hn−1
. . .
h2
h1
x
I Can have similar issues with vanishing gradients.
∂L
∂hn−1,jn−1= ∑
jn
1(hn,jn > 0)Wjn−1,jn∂L
∂hn,jn
![Page 33: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors](https://reader034.vdocuments.net/reader034/viewer/2022042620/5a7fa68b7f8b9a9d308ba376/html5/thumbnails/33.jpg)
Deep Networks
NNlayer (x) = ReLU(xW1 + b1)
hn
hn−1
. . .
h2
h1
x
I Can have similar issues with vanishing gradients.
∂L
∂hn−1,jn−1= ∑
jn
1(hn,jn > 0)Wjn−1,jn∂L
∂hn,jn
![Page 34: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors](https://reader034.vdocuments.net/reader034/viewer/2022042620/5a7fa68b7f8b9a9d308ba376/html5/thumbnails/34.jpg)
Review: NLM Skip Connections
![Page 35: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors](https://reader034.vdocuments.net/reader034/viewer/2022042620/5a7fa68b7f8b9a9d308ba376/html5/thumbnails/35.jpg)
Thought Experiment: Additive Skip-Connections
NNsl1(x) =1
2ReLU(xW1 + b1) +
1
2x
hn
hn−1
. . .
h3
h2
h1
x
![Page 36: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors](https://reader034.vdocuments.net/reader034/viewer/2022042620/5a7fa68b7f8b9a9d308ba376/html5/thumbnails/36.jpg)
Exercise: Gradients
Exercise: What is the gradient of ∂Lhn−1
with skip-connections?
Inductively what happens?
hn
hn−1
. . .
h3
h2
h1
x
![Page 37: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors](https://reader034.vdocuments.net/reader034/viewer/2022042620/5a7fa68b7f8b9a9d308ba376/html5/thumbnails/37.jpg)
Exercise
We now have the average of two terms. One with no saturation
condition or multiplicative term.
∂L
∂hn−1,jn−1= 1
2 (∑jn
1(hn,jn > 0)Wjn−1,jn∂L
∂hn,jn) +
12 (hn−1,jn−1
∂L
∂hn,jn−1)
![Page 38: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors](https://reader034.vdocuments.net/reader034/viewer/2022042620/5a7fa68b7f8b9a9d308ba376/html5/thumbnails/38.jpg)
Thought Experiment 2: Dynamic Skip-Connections
NNsl2(x) = (1− t)ReLU(xW1 + b1) + tx
t = σ(xWt + bt)
W1 ∈ Rdhid×dhid
Wt ∈ Rdhid×1
hn
hn−1
. . .
h3
h2
h1
x
![Page 39: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors](https://reader034.vdocuments.net/reader034/viewer/2022042620/5a7fa68b7f8b9a9d308ba376/html5/thumbnails/39.jpg)
Dynamic Skip-Connections: intuition
NNsl2(x) = (1− t)ReLU(xW1 + b1) + tx
t = σ(xWt + bt)
I No longer directly computing the next layer hi .
I Instead computing ∆h and dynamic update proportion t.
I Seems like a small change from DNN. Why does this matter?
![Page 40: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors](https://reader034.vdocuments.net/reader034/viewer/2022042620/5a7fa68b7f8b9a9d308ba376/html5/thumbnails/40.jpg)
Dynamic Skip-Connections Gradient
NNsl2(x) = (1− t)ReLU(xW1 + b1) + tx
t = σ(xWt + bt)
I The t values are saved on the forward pass.
I t allows for a direct flow of gradients to earlier layers.
∂L
∂hn−1,jn−1= (1− t) (∑
jn
1(hn,jn > 0)Wjn−1,jn∂L
∂hn,jn)
+ t (hn−1,jn−1∂L
∂hn,jn−1) + . . .
I (What is the . . .?)
![Page 41: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors](https://reader034.vdocuments.net/reader034/viewer/2022042620/5a7fa68b7f8b9a9d308ba376/html5/thumbnails/41.jpg)
Dynamic Skip-Connections
NNsl2(x) = (1− t)ReLU(xW1 + b1) + tx
t = σ(xWt + bt)
∂L
∂hn−1,jn−1= (1− t) (∑
jn
1(hn,jn > 0)Wjn−1,jn∂L
∂hn,jn)
+ t (hn−1,jn−1∂L
∂hn,jn−1) + . . .
I Note: h and Wt are also receiving gradients through the sigmoid!
I (Backprop is fun.)
![Page 42: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors](https://reader034.vdocuments.net/reader034/viewer/2022042620/5a7fa68b7f8b9a9d308ba376/html5/thumbnails/42.jpg)
Highway Network (Srivastava et al., 2015)
I Now add a combination at each dimension.
I � is point-wise multiplication.
NNhighway (x) = (1− t)� h+ t� x
h = ReLU(xW1 + b1)
t = σ(xWt + bt)
Wt ,W1 ∈ Rdhid×dhid
bt ,b1 ∈ R1×dhid
I h; transform (e.g. standard MLP layer)
I t; carry (dimension-specific dynamic skipping)
![Page 43: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors](https://reader034.vdocuments.net/reader034/viewer/2022042620/5a7fa68b7f8b9a9d308ba376/html5/thumbnails/43.jpg)
Highway Gradients
I The t values are saved on the forward pass.
I tj determines the update of dimension j .
∂L
∂hn−1,jn−1= (∑
jn
(1− tjn)1(hn,jn > 0)Wjn−1,jn∂L
∂hn,jn)
+ tjn−1 (hn−1,jn−1∂L
∂hn,jn−1) + . . .
![Page 44: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors](https://reader034.vdocuments.net/reader034/viewer/2022042620/5a7fa68b7f8b9a9d308ba376/html5/thumbnails/44.jpg)
Intuition: Gating
I This is known as the gating operation
t� x
I Allows vector t to mask or gate x.
I True gating would have t ∈ {0, 1}dhid
I Approximate with the sigmoid,
t = σ(Wtx+ b)
![Page 45: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors](https://reader034.vdocuments.net/reader034/viewer/2022042620/5a7fa68b7f8b9a9d308ba376/html5/thumbnails/45.jpg)
Contents
Highway Networks
Gated Recurrent Unit (GRU)
LSTM
![Page 46: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors](https://reader034.vdocuments.net/reader034/viewer/2022042620/5a7fa68b7f8b9a9d308ba376/html5/thumbnails/46.jpg)
Back To Acceptor RNNs
I Acceptor RNNs are deep networks with shared weights.
I Elman tanh layers
NN(x, s) = tanh(sWs + xWx + b).
sn
xn. . .
. . .s4
x4s3
x3s2
x2s1
x1s0
![Page 47: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors](https://reader034.vdocuments.net/reader034/viewer/2022042620/5a7fa68b7f8b9a9d308ba376/html5/thumbnails/47.jpg)
RNN with Skip-Connections
I Can replace layer with modified highway layer.
R(si−1, xi ) = (1− t)� h+ t� si−1
h = tanh(xWx + si−1Ws + b)
t = σ(xWxt + si−1Wst + bt)
Wxt ,Wx ∈ Rdin×dhid
Wst ,Ws ∈ Rdhid×dhid
bt ,b ∈ R1×dhid
![Page 48: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors](https://reader034.vdocuments.net/reader034/viewer/2022042620/5a7fa68b7f8b9a9d308ba376/html5/thumbnails/48.jpg)
RNN with Skip-Connections
sn
xn. . .
. . .s4
x4s3
x3s2
x2s1
x1s0
![Page 49: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors](https://reader034.vdocuments.net/reader034/viewer/2022042620/5a7fa68b7f8b9a9d308ba376/html5/thumbnails/49.jpg)
Final Idea: Stopping flow
I For many tasks, it is useful to halt propagation.
I Can do this by applying a reset gate r.
h = tanh(xWx + (r� si−1)Ws + b)
r = σ(xWxr + si−1Wsr + br )
![Page 50: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors](https://reader034.vdocuments.net/reader034/viewer/2022042620/5a7fa68b7f8b9a9d308ba376/html5/thumbnails/50.jpg)
Gated Recurrent Unit (GRU) (Cho et al 2014)
R(si−1, xi ) = (1− t)� h+ t� si−1
h = tanh(xWx + (r� si−1)Ws + b)
r = σ(xWxr + si−1Wsr + br )
t = σ(xWxt + si−1Wst + bt)
Wxt ,Wxr ,Wx ∈ Rdin×dhid
Wst ,Wsr ,Ws ∈ Rdhid×dhid
bt ,b ∈ R1×dhid
I t; dynamic skip-connections
I r; reset gating
I s; hidden state
I h; gated MLP
![Page 51: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors](https://reader034.vdocuments.net/reader034/viewer/2022042620/5a7fa68b7f8b9a9d308ba376/html5/thumbnails/51.jpg)
Example: Language Modeling
I In non-Markovian language modeling, treat corpus as a sequence
I r allows resetting the state after a sequence.
consumers may want to move their telephones a little closer to
the tv set </s> <unk> <unk> watching abc ’s monday
night football can now vote during <unk> for the greatest
play in N years from among four or five <unk> <unk>
</s> two weeks ago viewers of several nbc <unk> consumer
segments started calling a N number for advice on various
<unk> issues </s> and the new syndicated reality show
hard copy records viewers ’ opinions for possible airing on the
next day ’s show </s>
![Page 52: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors](https://reader034.vdocuments.net/reader034/viewer/2022042620/5a7fa68b7f8b9a9d308ba376/html5/thumbnails/52.jpg)
Contents
Highway Networks
Gated Recurrent Unit (GRU)
LSTM
![Page 53: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors](https://reader034.vdocuments.net/reader034/viewer/2022042620/5a7fa68b7f8b9a9d308ba376/html5/thumbnails/53.jpg)
LSTM
I Long Short-Term Memory network uses the same idea.
I Model has three gates, input, output, forget.
I Developed first, but a bit harder to follow.
I Seems to be better at LM, comparable to GRU on other tasks.
![Page 54: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors](https://reader034.vdocuments.net/reader034/viewer/2022042620/5a7fa68b7f8b9a9d308ba376/html5/thumbnails/54.jpg)
LSTMs State
The state si is made of 2 components :
I ci ; cell
I hi ; hidden
R(si−1, xi ) = [ci ,hi ]
![Page 55: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors](https://reader034.vdocuments.net/reader034/viewer/2022042620/5a7fa68b7f8b9a9d308ba376/html5/thumbnails/55.jpg)
LSTM Development: Input and Forget Gates
The cell is updated with ∆c, not a convex combination (two gates).
R(si−1, xi ) = [ci ,hi ]
ci = j� h+ f � ci−1
h = tanh(xWxi + hi−1Whi + bi )
j = σ(xWxj + hi−1Whj + bj )
f = σ(xWxf + hi−1Whf + bf )
I ci ; cell
I j; input gate
I f; forget gate
![Page 56: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors](https://reader034.vdocuments.net/reader034/viewer/2022042620/5a7fa68b7f8b9a9d308ba376/html5/thumbnails/56.jpg)
LSTM Development: Hidden
Hidden hi squashes ci
R(ci−1, xi ) = [ci ,hi ]
hi = tanh(ci )
ci = j� h+ f � ci−1
h = tanh(xWxi + hi−1Whi + bi )
j = σ(xWxj + hi−1Whj + bj )
f = σ(xWxf + hi−1Whf + bf )
I ci ; cell
I hi ; hidden
I j; input gate
I f; forget gate
![Page 57: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors](https://reader034.vdocuments.net/reader034/viewer/2022042620/5a7fa68b7f8b9a9d308ba376/html5/thumbnails/57.jpg)
Long Short-Term Memory
Finally, another output gate is applied to h
R(si−1, xi ) = [ci ,hi ]
ci = j� i+ f � ci−1
hi = tanh(ci )� o
i = tanh(xWxi + hi−1Whi + bi )
j = σ(xWxj + hi−1Whj + bj )
f = σ(xWxf + hi−1Whf + bf )
o = tanh(xWxo + hi−1Who + bo)
![Page 58: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors](https://reader034.vdocuments.net/reader034/viewer/2022042620/5a7fa68b7f8b9a9d308ba376/html5/thumbnails/58.jpg)
Output Gate?
I The output gate feels the most ad-hoc (why tanh?)
I Luckily Jozefowicz et al (2015) find output not too important.
![Page 59: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors](https://reader034.vdocuments.net/reader034/viewer/2022042620/5a7fa68b7f8b9a9d308ba376/html5/thumbnails/59.jpg)
![Page 60: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors](https://reader034.vdocuments.net/reader034/viewer/2022042620/5a7fa68b7f8b9a9d308ba376/html5/thumbnails/60.jpg)
Accuracy Results (Jozefowicz et al, 2015)
Accuracy of RNN models on three different tasks. LSTM models
include versions without the forget, input, and output gates.
![Page 61: Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors](https://reader034.vdocuments.net/reader034/viewer/2022042620/5a7fa68b7f8b9a9d308ba376/html5/thumbnails/61.jpg)
English LM Results (Jozefowicz et al, 2015)
NLL (Perplexity)