![Page 1: Deep learning for Natural Language Processing and Machine …nlp.jbnu.ac.kr/nash_HCLT2015_deeplearning_tutorial.pdf · 2015-10-15 · Deep learning for NLP 13 Learning word embedding](https://reader033.vdocuments.net/reader033/viewer/2022053004/5f07e8de7e708231d41f5e1e/html5/thumbnails/1.jpg)
Deep learning for Natural Language Processing and Machine Translation
2015.10.16
Seung-Hoon Na
![Page 2: Deep learning for Natural Language Processing and Machine …nlp.jbnu.ac.kr/nash_HCLT2015_deeplearning_tutorial.pdf · 2015-10-15 · Deep learning for NLP 13 Learning word embedding](https://reader033.vdocuments.net/reader033/viewer/2022053004/5f07e8de7e708231d41f5e1e/html5/thumbnails/2.jpg)
Contents
Introduction: Neural network, deep learning
Deep learning for Natural language processing
Neural network for classification
Word embedding
General architecture for sequential labeling
Recent advances
Neural machine translation
Future plan
2
![Page 3: Deep learning for Natural Language Processing and Machine …nlp.jbnu.ac.kr/nash_HCLT2015_deeplearning_tutorial.pdf · 2015-10-15 · Deep learning for NLP 13 Learning word embedding](https://reader033.vdocuments.net/reader033/viewer/2022053004/5f07e8de7e708231d41f5e1e/html5/thumbnails/3.jpg)
Deep learning: Motivation
• 기계학습방법에서자질추출 (Feature extraction)
Handcrafted features 사용
– 자질추출단계가자동화되지는않음
지속적으로자질개선필요
성능개선및튜닝요구
Deep learning
자질추출단계의제거또는간소화
정교한자질을비선형학습과정에내재시킴
𝒇(𝐱;𝐰) = argmaxy 𝐰 ⋅ 𝜳(𝐱, 𝐲)
Feature vector
![Page 4: Deep learning for Natural Language Processing and Machine …nlp.jbnu.ac.kr/nash_HCLT2015_deeplearning_tutorial.pdf · 2015-10-15 · Deep learning for NLP 13 Learning word embedding](https://reader033.vdocuments.net/reader033/viewer/2022053004/5f07e8de7e708231d41f5e1e/html5/thumbnails/4.jpg)
Multi-layer perceptron
상위은닉층은하위은닉층의출력에대한추상화비선형성모델
다층 NN 구조로추상자질내재가능자질튜닝절차를단순화시킴
Deep learningAbstraction
Raw dataHand-crafted
Features Trainable Classifier
일반적인 기계 학습
Raw dataTrainable Features
TrainiableClassifier
딥 러닝
![Page 5: Deep learning for Natural Language Processing and Machine …nlp.jbnu.ac.kr/nash_HCLT2015_deeplearning_tutorial.pdf · 2015-10-15 · Deep learning for NLP 13 Learning word embedding](https://reader033.vdocuments.net/reader033/viewer/2022053004/5f07e8de7e708231d41f5e1e/html5/thumbnails/5.jpg)
Neural Network
A neuron
A general computational unit that takes n inputs and produces a single output
Sigmoid & binary logistic regression unit
The most popular neurons
Takes an n-dimensional input vector 𝑥 and produces the scalar activation (output) 𝑎
5
𝑎 =1
1 + exp(𝑤𝑇𝑥 + 𝑏)
𝑥1
𝑥2
𝑥𝑛
…
𝑤1
𝑤2
𝑤𝑛
∑
𝑏
𝜎 𝑎
![Page 6: Deep learning for Natural Language Processing and Machine …nlp.jbnu.ac.kr/nash_HCLT2015_deeplearning_tutorial.pdf · 2015-10-15 · Deep learning for NLP 13 Learning word embedding](https://reader033.vdocuments.net/reader033/viewer/2022053004/5f07e8de7e708231d41f5e1e/html5/thumbnails/6.jpg)
Neural network: Setting for Classification
Input layer: Feature values
Output layer: Scores of labels
Softmax layer: Normalization of output values– To get probabilities of output labels given input
– K: the number of labels
6
Output layer
Hidden layer
Input layer
𝑦1 𝑦2 𝑦𝐾−1 𝑦𝐾
Softmax layer 𝑦𝑖 =𝑒𝑥𝑝(𝑦𝑖)
∑ 𝑒𝑥𝑝(𝑦𝑡)𝑦3
![Page 7: Deep learning for Natural Language Processing and Machine …nlp.jbnu.ac.kr/nash_HCLT2015_deeplearning_tutorial.pdf · 2015-10-15 · Deep learning for NLP 13 Learning word embedding](https://reader033.vdocuments.net/reader033/viewer/2022053004/5f07e8de7e708231d41f5e1e/html5/thumbnails/7.jpg)
Neural network: Training
Training data: Tr = (𝒙1, 𝑔1), … , (𝒙𝑁, 𝑔𝑁)
𝒙𝑖: i-th input feature vector
𝑔𝑖 ∈ {1, … , 𝐾}: i-th target label
Objective function
Negative Log-likelihood (NLL)
𝐿 = −∑ 𝒙,𝑔 ∈𝑇 log 𝑃(𝑔|𝒙)
7
![Page 8: Deep learning for Natural Language Processing and Machine …nlp.jbnu.ac.kr/nash_HCLT2015_deeplearning_tutorial.pdf · 2015-10-15 · Deep learning for NLP 13 Learning word embedding](https://reader033.vdocuments.net/reader033/viewer/2022053004/5f07e8de7e708231d41f5e1e/html5/thumbnails/8.jpg)
Neural network: Training
Stochastic gradient method
1. Randomly sample (𝒙, 𝑔) from training data
2. Define NLL for (𝒙, 𝑔)
L = log 𝑃(𝑔|𝒙)
for each weight matrix 𝑾 ∈ 𝜽
3. Compute gradients : 𝜕𝐿
𝜕𝑾
4. Update weight matrix 𝑊: 𝑊 ← 𝑊 − 𝜂𝜕𝐿
𝜕𝑊
Iterate the above procedure 8
![Page 9: Deep learning for Natural Language Processing and Machine …nlp.jbnu.ac.kr/nash_HCLT2015_deeplearning_tutorial.pdf · 2015-10-15 · Deep learning for NLP 13 Learning word embedding](https://reader033.vdocuments.net/reader033/viewer/2022053004/5f07e8de7e708231d41f5e1e/html5/thumbnails/9.jpg)
Neural network: Backpropagation
Output layer
Compute delta for i-th output node
𝛿𝑖𝑜 =
𝜕𝐿
𝜕𝑦𝑖= 𝛿 𝑖, 𝑔 −
exp 𝑦𝑖
exp ∑ 𝑦𝑗= 𝛿 𝑖, 𝑔 − 𝑃(𝑖|𝑥)
Vector form: 𝜹𝑜 = 𝟏𝑔 −𝑃(1|𝑥)
⋮𝑃(𝐾|𝑥) 9
𝑦1 𝑦2 𝑦𝐾−1 𝑦𝐾𝑦𝑖
L = log 𝑃 𝑔 𝑥 = 𝑙𝑜𝑔𝑒𝑥𝑝(𝑦𝑔)
∑exp(𝑦𝑖)= 𝑦𝑔 − 𝑙𝑜𝑔∑exp(𝑦𝑖)
softmax
![Page 10: Deep learning for Natural Language Processing and Machine …nlp.jbnu.ac.kr/nash_HCLT2015_deeplearning_tutorial.pdf · 2015-10-15 · Deep learning for NLP 13 Learning word embedding](https://reader033.vdocuments.net/reader033/viewer/2022053004/5f07e8de7e708231d41f5e1e/html5/thumbnails/10.jpg)
Neural network: Backpropagation
Output weight matrix W
Compute gradient of 𝑤𝑖𝑗
𝜕𝐿
𝜕𝑤𝑖𝑗=
𝜕𝐿
𝜕𝑦𝑖
𝜕𝑦𝑖
𝜕𝑤𝑖𝑗= 𝛿𝑖
𝑜ℎ𝑗
𝜕𝐿
𝜕𝑾= 𝛅𝒐𝐡𝐓
10
𝑦1 𝑦2 𝑦𝐾−1 𝑦𝐾𝑦𝑖
softmax
𝑦𝑔 − 𝑙𝑜𝑔∑exp(𝑦𝑖)
Hidden layer
ℎ1
𝑧1
ℎ𝑗
𝑧𝑗
ℎ𝑚
𝑧𝑚
𝑤𝑖𝑗
𝛿𝑖𝑜
ℎ𝑗 = 𝑔 𝑧𝑗
Output layer
![Page 11: Deep learning for Natural Language Processing and Machine …nlp.jbnu.ac.kr/nash_HCLT2015_deeplearning_tutorial.pdf · 2015-10-15 · Deep learning for NLP 13 Learning word embedding](https://reader033.vdocuments.net/reader033/viewer/2022053004/5f07e8de7e708231d41f5e1e/html5/thumbnails/11.jpg)
Hidden layer
Compute delta for j-th hidden node
𝛿𝑗ℎ =
𝜕𝐿
𝜕𝑧𝑗=
𝜕𝐿
𝜕ℎ𝑗
𝜕ℎ𝑗
𝜕𝑧𝑗=𝜕ℎ𝑗
𝜕𝑧𝑗∑𝑖
𝜕𝐿
𝜕𝑦𝑖
𝜕𝑦𝑖
𝜕ℎ𝑗= 𝑔′ 𝑧𝑗 ∑𝑖 𝛿𝑖
𝑂𝑤𝑖𝑗
𝜹ℎ = 𝑔′ 𝒛 °𝐖𝐓𝜹o
11
𝑦1 𝑦2 𝑦𝐾−1 𝑦𝐾𝑦𝑖
softmax
𝑦𝑔 − 𝑙𝑜𝑔∑exp(𝑦𝑖)
Hidden layer
ℎ1
𝑧1
ℎ𝑗
𝑧𝑗
ℎ𝑚
𝑧𝑚
𝑤𝑖𝑗
𝛿𝑖𝑜
Output layer
ℎ𝑗 = 𝑔 𝑧𝑗
Neural network: Backpropagation
![Page 12: Deep learning for Natural Language Processing and Machine …nlp.jbnu.ac.kr/nash_HCLT2015_deeplearning_tutorial.pdf · 2015-10-15 · Deep learning for NLP 13 Learning word embedding](https://reader033.vdocuments.net/reader033/viewer/2022053004/5f07e8de7e708231d41f5e1e/html5/thumbnails/12.jpg)
Hidden weight matrix U
Compute gradient of 𝑢𝑖𝑗
𝜕𝐿
𝜕𝑢𝑗𝑘=
𝜕𝐿
𝜕𝑧𝑗
𝜕𝑧𝑗
𝜕𝑢𝑗𝑘= 𝛿𝑗
ℎ𝑥𝑘
𝜕𝐿
𝜕𝑼= 𝛅𝒉𝒙𝐓
12
𝑦𝑔 − 𝑙𝑜𝑔∑exp(𝑦𝑖)
Hidden layer
ℎ1
𝑧1
ℎ𝑗
𝑧𝑗
ℎ𝑚
𝑧𝑚ℎ𝑗 = 𝑔 𝑧𝑗
Neural network: Backpropagation
𝛿𝑗ℎ
𝑥𝑘
𝑢𝑗𝑘
![Page 13: Deep learning for Natural Language Processing and Machine …nlp.jbnu.ac.kr/nash_HCLT2015_deeplearning_tutorial.pdf · 2015-10-15 · Deep learning for NLP 13 Learning word embedding](https://reader033.vdocuments.net/reader033/viewer/2022053004/5f07e8de7e708231d41f5e1e/html5/thumbnails/13.jpg)
Deep learning for NLP
13
Learning word embedding matrix
Raw corpus
NN for Word embedding
Initialize lookup table
Application-specific neuralnetwork
Application-specific NN
Annotatedcorpus
Lookup table is further fine-tuned
Unsupervised
Supervised
![Page 14: Deep learning for Natural Language Processing and Machine …nlp.jbnu.ac.kr/nash_HCLT2015_deeplearning_tutorial.pdf · 2015-10-15 · Deep learning for NLP 13 Learning word embedding](https://reader033.vdocuments.net/reader033/viewer/2022053004/5f07e8de7e708231d41f5e1e/html5/thumbnails/14.jpg)
Word embedding: Distributed representation
Distributed representation
n-dimensional latent vector for a word
Semantically similar words are closely located in vector space
14
![Page 15: Deep learning for Natural Language Processing and Machine …nlp.jbnu.ac.kr/nash_HCLT2015_deeplearning_tutorial.pdf · 2015-10-15 · Deep learning for NLP 13 Learning word embedding](https://reader033.vdocuments.net/reader033/viewer/2022053004/5f07e8de7e708231d41f5e1e/html5/thumbnails/15.jpg)
Word embedding matrix:Lookup Table
15
0⋮1⋮0
Word
One-hot vector e (|V|-dimensional vector)
L= …
|V|
the cat mat …
d
𝐿 = 𝑅𝑑×|𝑉|
Word vector x is obtained from one-hot vector eby referring to lookup table
x =L e
![Page 16: Deep learning for Natural Language Processing and Machine …nlp.jbnu.ac.kr/nash_HCLT2015_deeplearning_tutorial.pdf · 2015-10-15 · Deep learning for NLP 13 Learning word embedding](https://reader033.vdocuments.net/reader033/viewer/2022053004/5f07e8de7e708231d41f5e1e/html5/thumbnails/16.jpg)
Word embedding matrix:context input layer
word seq 𝑤1 ⋯𝑤𝑛 input layer
16
𝑤𝑡−𝑛+1 𝑤𝑡−2 𝑤𝑡−1
… … …
n context words
Input layer
𝐿 𝑒𝑤𝑡−𝑛+1
𝑛 𝑑 dim input vector
concatenation
d-diml vec
𝐿 𝑒𝑤𝑡−2 𝐿 𝑒𝑤𝑡−1
𝑒𝑤 : one hot vector 𝑤
Lookup table
…
d-dim vec d-dim vec
![Page 17: Deep learning for Natural Language Processing and Machine …nlp.jbnu.ac.kr/nash_HCLT2015_deeplearning_tutorial.pdf · 2015-10-15 · Deep learning for NLP 13 Learning word embedding](https://reader033.vdocuments.net/reader033/viewer/2022053004/5f07e8de7e708231d41f5e1e/html5/thumbnails/17.jpg)
Neural Probabilistic Language Model(Bengio ’03)
Language models
The probability of a sequence of words
N-gram model
𝑃(𝑤𝑖|𝑤1, … , 𝑤𝑖−1) ≈ 𝑃(𝑤𝑖|𝑤𝑖− 𝑛−1 , ⋯ ,𝑤𝑖−1)
Neural probabilistic language model
Estimate 𝑃(𝑤𝑖|𝑤𝑖− 𝑛−1 , ⋯ , 𝑤𝑖−1) by neural
networks for classification
𝐱: Concatenated input features (input layer)
𝒚 = 𝐔 tanh(𝐝 + 𝐇𝐱)
𝒚 = 𝐖𝐱 + 𝐛 + 𝐔 tanh(𝐝 + 𝐇𝐱)17
![Page 18: Deep learning for Natural Language Processing and Machine …nlp.jbnu.ac.kr/nash_HCLT2015_deeplearning_tutorial.pdf · 2015-10-15 · Deep learning for NLP 13 Learning word embedding](https://reader033.vdocuments.net/reader033/viewer/2022053004/5f07e8de7e708231d41f5e1e/html5/thumbnails/18.jpg)
Neural probabilistic language model
18
Output layer
Input layer
𝑤𝑡−𝑛+1 𝑤𝑡−2 𝑤𝑡−1
… …
Lookup table
𝑤𝑡
… …
𝑃 𝑤𝑡 𝑤𝑡−𝑛+1, ⋯ , 𝑤𝑡−1 =𝑒𝑥𝑝(𝑦𝑤𝑡
)
∑ 𝑒𝑥𝑝(𝑦𝑡)Softmax layer
𝑦1 𝑦2 𝑦|𝑉|
Hidden layer
𝑼
𝑯
𝑾
![Page 19: Deep learning for Natural Language Processing and Machine …nlp.jbnu.ac.kr/nash_HCLT2015_deeplearning_tutorial.pdf · 2015-10-15 · Deep learning for NLP 13 Learning word embedding](https://reader033.vdocuments.net/reader033/viewer/2022053004/5f07e8de7e708231d41f5e1e/html5/thumbnails/19.jpg)
Neural Probabilistic Language Model
19𝑤𝑡−𝑛+1 𝑤𝑡−2 𝑤𝑡−1
𝑃 𝑤𝑡 𝑤𝑡−𝑛+1, ⋯ , 𝑤𝑡−1 =
… … … Input layer
… Output layer…
Lookup table
Softmax=𝑒𝑥𝑝(𝑦𝑤𝑡
)
∑ 𝑒𝑥𝑝(𝑦𝑡)
Normalization for probabilistic value
Hidden Layer…
tanh
Each node indicates a specific word
Directly connected to output layer
![Page 20: Deep learning for Natural Language Processing and Machine …nlp.jbnu.ac.kr/nash_HCLT2015_deeplearning_tutorial.pdf · 2015-10-15 · Deep learning for NLP 13 Learning word embedding](https://reader033.vdocuments.net/reader033/viewer/2022053004/5f07e8de7e708231d41f5e1e/html5/thumbnails/20.jpg)
NPLM: Discussion
Limitation: Computational complexity
Softmax layer requires computing scores over all vocabulary words
Vocabulary size is very large
20
![Page 21: Deep learning for Natural Language Processing and Machine …nlp.jbnu.ac.kr/nash_HCLT2015_deeplearning_tutorial.pdf · 2015-10-15 · Deep learning for NLP 13 Learning word embedding](https://reader033.vdocuments.net/reader033/viewer/2022053004/5f07e8de7e708231d41f5e1e/html5/thumbnails/21.jpg)
Ranking Approach for Word Embedding
Idea: Sampling a negative example (Collobert& Weston ’08)
𝒔: a given sequence of words (in training data)
𝒔′: a negative example the last word is replaced with another word
𝑓(𝒔): score of the sequence 𝒔
Goal: makes the score difference (𝑓(𝒔) – 𝑓(𝒔’)) large
Various loss functions are possible
Hinge loss: max(0, 1 − 𝑓 𝒔 + 𝑓 𝒔′ )
21
![Page 22: Deep learning for Natural Language Processing and Machine …nlp.jbnu.ac.kr/nash_HCLT2015_deeplearning_tutorial.pdf · 2015-10-15 · Deep learning for NLP 13 Learning word embedding](https://reader033.vdocuments.net/reader033/viewer/2022053004/5f07e8de7e708231d41f5e1e/html5/thumbnails/22.jpg)
Ranking Approach for Word Embedding (Collobert and Weston ‘08)
22
Hidden layer
𝑤𝑡−𝑛+1 𝑤𝑡−2 𝑤𝑡−1
Lookup table
𝑓(𝑤𝑡−𝑛+1, ⋯ , 𝑤𝑡−1, 𝑤𝑡)
𝑤𝑡𝑤𝑡−𝑛+1 𝑤𝑡−2 𝑤𝑡−1 𝑤𝑡′
Input layer
Positive example Negative example
Sampled Last word
𝑯 𝑯
𝑓(𝑤𝑡−𝑛+1, ⋯ , 𝑤𝑡−1, 𝑤𝑡′)
𝒘 𝒘
𝒙 𝒙′
𝐰𝑡𝑎𝑛ℎ 𝐇𝐱 + 𝐛 + 𝑑 𝐰𝑡𝑎𝑛ℎ 𝐇𝐱′ + 𝐛 + 𝑑
![Page 23: Deep learning for Natural Language Processing and Machine …nlp.jbnu.ac.kr/nash_HCLT2015_deeplearning_tutorial.pdf · 2015-10-15 · Deep learning for NLP 13 Learning word embedding](https://reader033.vdocuments.net/reader033/viewer/2022053004/5f07e8de7e708231d41f5e1e/html5/thumbnails/23.jpg)
Recurrent Neural Network (RNN) based Language Model
NPLM: only (n-1) previous words are conditioned
𝑃(𝑤𝑖|𝑤𝑖− 𝑛−1 , ⋯ ,𝑤𝑖−1)
RNNLM: all previous words are conditioned
𝑃(𝑤𝑖|𝑤1, … , 𝑤𝑖−1)
23
Recurrent neural network
![Page 24: Deep learning for Natural Language Processing and Machine …nlp.jbnu.ac.kr/nash_HCLT2015_deeplearning_tutorial.pdf · 2015-10-15 · Deep learning for NLP 13 Learning word embedding](https://reader033.vdocuments.net/reader033/viewer/2022053004/5f07e8de7e708231d41f5e1e/html5/thumbnails/24.jpg)
Recurrent Neural Network
24
𝒉𝒕−𝟏
𝒙𝒕−𝟏
𝒉𝒕
𝒙𝒕
𝒉𝒕+𝟏
𝒙𝒕+𝟏
𝒚𝒕+𝟏𝒚𝒕
𝒚𝒕−𝟏
𝒉𝒕 = 𝒈(𝒛𝒕)
𝑼
𝑾
𝑽 𝒚𝒕 = 𝑼𝒉𝒕
𝒛𝒕 = 𝑾𝒉𝒕−𝟏 + 𝑽𝒙𝒕
𝒉𝒕 = 𝒇(𝒙𝒕, 𝒉𝒕−𝟏)
![Page 25: Deep learning for Natural Language Processing and Machine …nlp.jbnu.ac.kr/nash_HCLT2015_deeplearning_tutorial.pdf · 2015-10-15 · Deep learning for NLP 13 Learning word embedding](https://reader033.vdocuments.net/reader033/viewer/2022053004/5f07e8de7e708231d41f5e1e/html5/thumbnails/25.jpg)
Recurrent Neural Network: Backpropagation Through Time
Objective function: L = ∑t 𝐿 𝑡
25
𝒉𝒕
𝒙𝒕
𝒉𝒕+𝟏
𝒙𝒕+𝟏
𝒚𝒕+𝟏𝒚𝒕
𝑾
𝑽
𝒉𝑻
𝒙𝑻
𝒚𝑻
𝒙𝑻−𝟏
⋯
𝑼 𝑼𝜕𝐿
𝜕𝑼=
𝑡
𝛅𝑡𝑜𝒉𝑡
𝑇
𝜕𝐿
𝜕𝑾=
𝑡
𝛅𝑡ℎ𝒉𝑡−1
𝑇
𝜕𝐿
𝜕𝑽=
𝑡
𝛅𝑡ℎ𝒙𝑡
𝑇
𝑾 𝑾
𝑼
𝑽 𝑽 𝑽
𝜹𝑡ℎ = 𝑔′ 𝒛𝑡+𝟏 °𝑾𝑇𝜹𝑡+1
ℎ + 𝑼𝑇𝜹𝑡𝒐
![Page 26: Deep learning for Natural Language Processing and Machine …nlp.jbnu.ac.kr/nash_HCLT2015_deeplearning_tutorial.pdf · 2015-10-15 · Deep learning for NLP 13 Learning word embedding](https://reader033.vdocuments.net/reader033/viewer/2022053004/5f07e8de7e708231d41f5e1e/html5/thumbnails/26.jpg)
Recurrent Neural Network: BPTT
Objective function: L = ∑t 𝐿 𝑡
26
𝒉𝒕−𝟏
𝒙𝒕−𝟏
𝒉𝒕
𝒙𝒕
𝒉𝒕+𝟏
𝒙𝒕+𝟏
𝒚𝒕+𝟏𝒚𝒕
𝒚𝒕−𝟏
𝑼
𝑾
𝑽
𝜕𝐿(𝑡)
𝜕𝑾= 𝛅t,t
ℎ 𝒉𝑡−1𝑇 + 𝛅t−1,t
ℎ 𝒉𝑡−2𝑇 + ⋯+ 𝜹1,t
ℎ 𝒉0𝑇
Compute gradient of L(t) w.r.t. params
𝑾
𝑽
𝜕𝐿(𝑡)
𝜕𝑽= 𝛅t,t
ℎ 𝒙𝑡𝑇 + ⋯+ 𝜹1,t
ℎ 𝒙0𝑇
𝜕𝐿(𝑡)
𝜕𝑼= 𝛅𝑡
𝑜𝒉𝑡𝑇
𝜕𝐿(𝑡)
𝜕𝐶(𝑤)=
𝑡′≤𝑡:𝑤𝑡=𝑤
𝛅t′,t𝑥
Lookup table
𝜹𝑡,𝑡ℎ = 𝑼𝐓𝜹𝒕
𝒐
𝜹𝑡−1,𝑡ℎ = 𝑔′ 𝒛𝑡 °𝐖𝐓𝜹𝑡,𝑡
𝒉
𝜹𝑡−2,𝑡ℎ = 𝑔′ 𝒛𝑡−1 °𝐖𝐓𝜹𝑡−1,𝑡
𝒉
Given specific time t
![Page 27: Deep learning for Natural Language Processing and Machine …nlp.jbnu.ac.kr/nash_HCLT2015_deeplearning_tutorial.pdf · 2015-10-15 · Deep learning for NLP 13 Learning word embedding](https://reader033.vdocuments.net/reader033/viewer/2022053004/5f07e8de7e708231d41f5e1e/html5/thumbnails/27.jpg)
Recurrent Neural Network: BPTT
Vanishing gradient & gradient explosion problems
27
𝜕𝐿(𝑡)
𝜕𝑾= 𝛅t,𝐭
ℎ 𝒉𝑡−1𝑇 + 𝛅t−1,t
ℎ 𝒉𝑡−2𝑇 + ⋯+ 𝜹1,t
ℎ 𝒉0𝑇
𝜹𝑡−1,𝑡ℎ = 𝑔′ 𝒛𝑡 ° 𝐖𝐓𝜹𝑡,𝑡
ℎ
𝜹𝑡−2,𝑡ℎ = 𝑔′ 𝒛𝑡−𝟏 °𝐖𝐓𝜹𝑡−1,𝑡
ℎ
= 𝑔′ 𝒛𝑡−𝟏 °𝑔′ 𝒛𝑡 °𝐖𝑇𝐖𝑇𝛿𝑡,𝑡ℎ
𝑔′ 𝑧𝑘 ⋯° 𝑔′ 𝒛𝑡−𝟏 °𝑔′ 𝒛𝑡
𝐖𝑇 …𝐖𝑇𝐖𝑇
Can easily become a very small or large number!
![Page 28: Deep learning for Natural Language Processing and Machine …nlp.jbnu.ac.kr/nash_HCLT2015_deeplearning_tutorial.pdf · 2015-10-15 · Deep learning for NLP 13 Learning word embedding](https://reader033.vdocuments.net/reader033/viewer/2022053004/5f07e8de7e708231d41f5e1e/html5/thumbnails/28.jpg)
Recurrent Neural Network: BPTT
Solutions to the exploding and vanishing gradients
1. Instead of initializing W randomly, start off from an identify matrix initialization
2. Use the Rectified linear units (ReLU) instead of the sigmoid function
The derivative for the ReLU is either 0 or 1
28
![Page 29: Deep learning for Natural Language Processing and Machine …nlp.jbnu.ac.kr/nash_HCLT2015_deeplearning_tutorial.pdf · 2015-10-15 · Deep learning for NLP 13 Learning word embedding](https://reader033.vdocuments.net/reader033/viewer/2022053004/5f07e8de7e708231d41f5e1e/html5/thumbnails/29.jpg)
Experiments: NPLM
29
NPLM
![Page 30: Deep learning for Natural Language Processing and Machine …nlp.jbnu.ac.kr/nash_HCLT2015_deeplearning_tutorial.pdf · 2015-10-15 · Deep learning for NLP 13 Learning word embedding](https://reader033.vdocuments.net/reader033/viewer/2022053004/5f07e8de7e708231d41f5e1e/html5/thumbnails/30.jpg)
Experiments: RNN LM
30
![Page 31: Deep learning for Natural Language Processing and Machine …nlp.jbnu.ac.kr/nash_HCLT2015_deeplearning_tutorial.pdf · 2015-10-15 · Deep learning for NLP 13 Learning word embedding](https://reader033.vdocuments.net/reader033/viewer/2022053004/5f07e8de7e708231d41f5e1e/html5/thumbnails/31.jpg)
Long Short Term Memory (LSTM)
RNN: Very hard to actually train long-term dependencies exploding & vanishing gradients
LSTM: makes it easier for RNNs to capture long-term dependencies Using gated units
Traditional LSTM (Hochreiter and Schmidhuer, 98)
– Introduces input gate & output gate
– Limitation: The output is close to zero as long as the output gate is closed.
Modern LSTM: Uses forget gate (Gers et al ‘00)
Variants of LSTM
– Add peephole connections (Gers et al ‘02)
• Allow all gates to inspect the current cell state even when the output gate is closed.
31
![Page 32: Deep learning for Natural Language Processing and Machine …nlp.jbnu.ac.kr/nash_HCLT2015_deeplearning_tutorial.pdf · 2015-10-15 · Deep learning for NLP 13 Learning word embedding](https://reader033.vdocuments.net/reader033/viewer/2022053004/5f07e8de7e708231d41f5e1e/html5/thumbnails/32.jpg)
Long Short Term Memory (LSTM)
𝑖 𝑡 = 𝜎 𝑊 𝑖 𝑥 𝑡 + 𝑈 𝑖 ℎ 𝑡−1 (Input gate)
𝑓 𝑡 = 𝜎 𝑊 𝑓 𝑥 𝑡 + 𝑈 𝑓 ℎ 𝑡−1 (Forget gate)
𝑜 𝑡 = 𝜎 𝑊 𝑜 𝑥 𝑡 + 𝑈 𝑜 ℎ 𝑡−1 (Output/Exposure
gate)
𝑐 𝑡 = 𝑡𝑎𝑛ℎ 𝑊 𝑐 𝑥 𝑡 + 𝑈 𝑐 ℎ 𝑡−1 (New memory cell)
𝑐(𝑡) = 𝑓(𝑡) ° 𝑐 𝑡−1 + 𝑖(𝑡) ° 𝑐 𝑡 (Final memory cell)
ℎ(𝑡) = 𝑜(𝑡)°tanh(𝑐 𝑡 )
32
𝒉(𝒕) = 𝒇(𝒙 𝒕 , 𝒉(𝒕−𝟏))
![Page 33: Deep learning for Natural Language Processing and Machine …nlp.jbnu.ac.kr/nash_HCLT2015_deeplearning_tutorial.pdf · 2015-10-15 · Deep learning for NLP 13 Learning word embedding](https://reader033.vdocuments.net/reader033/viewer/2022053004/5f07e8de7e708231d41f5e1e/html5/thumbnails/33.jpg)
Long Short Term Memory (LSTM)
33
![Page 34: Deep learning for Natural Language Processing and Machine …nlp.jbnu.ac.kr/nash_HCLT2015_deeplearning_tutorial.pdf · 2015-10-15 · Deep learning for NLP 13 Learning word embedding](https://reader033.vdocuments.net/reader033/viewer/2022053004/5f07e8de7e708231d41f5e1e/html5/thumbnails/34.jpg)
LSTM: Memory cell
34
![Page 35: Deep learning for Natural Language Processing and Machine …nlp.jbnu.ac.kr/nash_HCLT2015_deeplearning_tutorial.pdf · 2015-10-15 · Deep learning for NLP 13 Learning word embedding](https://reader033.vdocuments.net/reader033/viewer/2022053004/5f07e8de7e708231d41f5e1e/html5/thumbnails/35.jpg)
Long Short Term Memory (LSTM)
Input gate: Whether or not the input is worth preserving and thus is used to gate the new memory
Forget gate: Whether the past memory call is useful for the computation of the current memory cell
Final memory generation: Takes the advices of the forget and input gates to produce the final memory
Output/Exposure gate: What parts of the memory needs to be explored in the hidden state
The purpose is to separate the final memory from the hidden state. The final memory contains a lot of information that is not necessarily required to be saved in the hidden state
35
![Page 36: Deep learning for Natural Language Processing and Machine …nlp.jbnu.ac.kr/nash_HCLT2015_deeplearning_tutorial.pdf · 2015-10-15 · Deep learning for NLP 13 Learning word embedding](https://reader033.vdocuments.net/reader033/viewer/2022053004/5f07e8de7e708231d41f5e1e/html5/thumbnails/36.jpg)
Gated Recurrent Units(Cho et al ‘14)
Alternative architecture to handle long-term dependencies
𝑧 𝑡 = 𝜎 𝑊 𝑧 𝑥 𝑡 + 𝑈 𝑧 ℎ 𝑡−1 (Update gate)
𝑟 𝑡 = 𝜎 𝑊 𝑟 𝑥 𝑡 + 𝑈 𝑟 ℎ 𝑡−1 (Reset gate)
ℎ 𝑡 = 𝑡𝑎𝑛ℎ 𝑟(𝑡)°𝑈ℎ 𝑡−1 + 𝑊𝑥(𝑡) (New memory)
ℎ(𝑡) = (1 − 𝑧(𝑡)) ° ℎ 𝑡 + 𝑧(𝑡) °ℎ 𝑡−1 (Hidden state)
36
𝒉(𝒕) = 𝒇(𝒙 𝒕 , 𝒉(𝒕−𝟏))
![Page 37: Deep learning for Natural Language Processing and Machine …nlp.jbnu.ac.kr/nash_HCLT2015_deeplearning_tutorial.pdf · 2015-10-15 · Deep learning for NLP 13 Learning word embedding](https://reader033.vdocuments.net/reader033/viewer/2022053004/5f07e8de7e708231d41f5e1e/html5/thumbnails/37.jpg)
Gated Recurrent Units
37
![Page 38: Deep learning for Natural Language Processing and Machine …nlp.jbnu.ac.kr/nash_HCLT2015_deeplearning_tutorial.pdf · 2015-10-15 · Deep learning for NLP 13 Learning word embedding](https://reader033.vdocuments.net/reader033/viewer/2022053004/5f07e8de7e708231d41f5e1e/html5/thumbnails/38.jpg)
Gated Recurrent UnitsNew memory generation: A new memory ℎ(𝑡) is
the consolidation of a new input word 𝑥(𝑡) with the
past hidden state ℎ(𝑡−1)
Reset gate: Determining how important ℎ(𝑡−1) is to
the summarization ℎ(𝑡). It can completely diminish past hidden state if it is irrelevant to the computation of the new memory
Update gate: Determining how much of ℎ(𝑡−1)
should be carried forward to the next stage
Hidden state: The hidden state ℎ(𝑡) is generated
using the past hidden input ℎ(𝑡−1) and the new
memory generated ℎ(𝑡) with the advice of the update gate 38
![Page 39: Deep learning for Natural Language Processing and Machine …nlp.jbnu.ac.kr/nash_HCLT2015_deeplearning_tutorial.pdf · 2015-10-15 · Deep learning for NLP 13 Learning word embedding](https://reader033.vdocuments.net/reader033/viewer/2022053004/5f07e8de7e708231d41f5e1e/html5/thumbnails/39.jpg)
LSTM vs. GRU
39
![Page 40: Deep learning for Natural Language Processing and Machine …nlp.jbnu.ac.kr/nash_HCLT2015_deeplearning_tutorial.pdf · 2015-10-15 · Deep learning for NLP 13 Learning word embedding](https://reader033.vdocuments.net/reader033/viewer/2022053004/5f07e8de7e708231d41f5e1e/html5/thumbnails/40.jpg)
Neural Machine Translation:RNN Encoder-Decoder (Cho et al ‘14)Computing the log of translation probability Log
P(y|x) by two RNNs
40
![Page 41: Deep learning for Natural Language Processing and Machine …nlp.jbnu.ac.kr/nash_HCLT2015_deeplearning_tutorial.pdf · 2015-10-15 · Deep learning for NLP 13 Learning word embedding](https://reader033.vdocuments.net/reader033/viewer/2022053004/5f07e8de7e708231d41f5e1e/html5/thumbnails/41.jpg)
Neural Machine Translation
RNN Encoder Decoder
RNN Search
RNN Encoder Decoder with Attention
RNN Search Large Vocabulary
Sequence-to-Sequence Model
LSTM
Rare word model
41
![Page 42: Deep learning for Natural Language Processing and Machine …nlp.jbnu.ac.kr/nash_HCLT2015_deeplearning_tutorial.pdf · 2015-10-15 · Deep learning for NLP 13 Learning word embedding](https://reader033.vdocuments.net/reader033/viewer/2022053004/5f07e8de7e708231d41f5e1e/html5/thumbnails/42.jpg)
Neural Machine Translation:RNN Encoder-Decoder (Cho et al ‘14)
RNN for the encoder: input 𝒙 hidden state 𝒄
𝒉𝑡 = 𝑓(𝒉𝑡−1, 𝑥𝑡)
– Encode a variable-length sequence into a fixed-length vector representation
– Reads each symbol of an input sequence x sequentially
– After reading the end of the sequence, we obtain a summary c of the whole input sequence: 𝒄 = 𝒉𝑻
42
![Page 43: Deep learning for Natural Language Processing and Machine …nlp.jbnu.ac.kr/nash_HCLT2015_deeplearning_tutorial.pdf · 2015-10-15 · Deep learning for NLP 13 Learning word embedding](https://reader033.vdocuments.net/reader033/viewer/2022053004/5f07e8de7e708231d41f5e1e/html5/thumbnails/43.jpg)
RNN Encoder-Decoder (Cho et al ‘14)
RNN for the decoder: Hidden state 𝒄 Output 𝒚
𝒉𝑡 = 𝑓 𝒉𝑡−1, 𝑦𝑡−1, 𝒄
𝑃(𝑦𝑡|𝑦𝑡−1, ⋯ , 𝑦1, 𝒄) = 𝑔 𝒉𝑡 , 𝑦𝑡−1, 𝒄
– Decode a given fixed-length vector representation back into a variable-length sequence
– Generate the output sequence by predicting the next symbol 𝑦𝑡 given the previous hidden state
– Both 𝑦𝑡 and 𝒉𝑡 are conditioned on 𝑦𝑡−1
How to define f?
Gated recurrent unit (GRU) is used to capture long-term dependencies
43
![Page 44: Deep learning for Natural Language Processing and Machine …nlp.jbnu.ac.kr/nash_HCLT2015_deeplearning_tutorial.pdf · 2015-10-15 · Deep learning for NLP 13 Learning word embedding](https://reader033.vdocuments.net/reader033/viewer/2022053004/5f07e8de7e708231d41f5e1e/html5/thumbnails/44.jpg)
Using RNN Encoder-Decoder for Statistical Machine Translation
Statistical machine translation
Generative model: 𝑃 𝒇 𝒆 ∝ 𝑝 𝒆 𝒇 𝑃 𝒇
– 𝑝 𝒆 𝒇 : Translation model, 𝑝(𝒇): Language model
Log-linear model: log𝑃 𝒇 𝒆 = ∑𝑤𝑛𝑓𝑛(𝒇, 𝒆) + 𝑙𝑜𝑔𝑍 𝒆
– In Phrase-based SMT
• log𝑃(𝒆|𝒇) is factorized into the translation probabilities of matching phrases in the source and target sents
Scoring phrase pairs with RNN Encoder-Decoder
Train the RNN Encoder-Decoder on a table of phrase pairs Use its scores as additional features
– Ignore the normalized frequencies of each phrase pair
• Existing translation prob in the table already reflects the frequencies of the phrase pairs
44
![Page 45: Deep learning for Natural Language Processing and Machine …nlp.jbnu.ac.kr/nash_HCLT2015_deeplearning_tutorial.pdf · 2015-10-15 · Deep learning for NLP 13 Learning word embedding](https://reader033.vdocuments.net/reader033/viewer/2022053004/5f07e8de7e708231d41f5e1e/html5/thumbnails/45.jpg)
RNN Encoder-Decoder: Experiments
Training
– # hidden units: 1,000, # dim for word embedding: 100
– Adadelta & stochastic gradient descent
– Use Controlled vocabulary setting
• Tmost frequent 15,000 words for both source & target languages
– All the out-of-vocabulary words are mapped to UNK
45English-to-French translation
![Page 46: Deep learning for Natural Language Processing and Machine …nlp.jbnu.ac.kr/nash_HCLT2015_deeplearning_tutorial.pdf · 2015-10-15 · Deep learning for NLP 13 Learning word embedding](https://reader033.vdocuments.net/reader033/viewer/2022053004/5f07e8de7e708231d41f5e1e/html5/thumbnails/46.jpg)
Neural Machine Translation:Sequence-to-sequence (Sutskever et al 14)
• <EOS>: the end-of-sentence token
1) a LSTM reads input sentence 𝒙 which generates the last hidden state 𝒗
2) a standard LSTM-LM computes the probability of 𝑦1, ⋯ , 𝑦𝑇′ given the initial hidden state 𝒗
– 𝑃 𝑦1, ⋯ , 𝑦𝑇′ 𝑥1, ⋯ , 𝑥𝑇 = 𝑡 𝑃(𝑦𝑡|𝑣, 𝑦1, ⋯ , 𝑦𝑡−1)
– 𝑃(𝑦𝑡|𝑣, 𝑦1, ⋯ , 𝑦𝑡−1): Taking Softmax over all words
46
![Page 47: Deep learning for Natural Language Processing and Machine …nlp.jbnu.ac.kr/nash_HCLT2015_deeplearning_tutorial.pdf · 2015-10-15 · Deep learning for NLP 13 Learning word embedding](https://reader033.vdocuments.net/reader033/viewer/2022053004/5f07e8de7e708231d41f5e1e/html5/thumbnails/47.jpg)
Can the LSTM reconstruct the input sentence?
Can this scheme learn the identity function?
Answer: it can, and it can do it very easily. It just does it effortlessly and perfectly.
47Slide from http://www.cs.tau.ac.il/~wolf/deeplearningmeeting/pdfs/Sup+LSTM.pdf
![Page 48: Deep learning for Natural Language Processing and Machine …nlp.jbnu.ac.kr/nash_HCLT2015_deeplearning_tutorial.pdf · 2015-10-15 · Deep learning for NLP 13 Learning word embedding](https://reader033.vdocuments.net/reader033/viewer/2022053004/5f07e8de7e708231d41f5e1e/html5/thumbnails/48.jpg)
Sequence-to-Sequence Model(Sutskever et al 14)
Two different LSTMs
one for the input sentence
Another for the output sentence
Deep LSTM (with 4 layer)
1,000 cell at each layer, 1,000 dimensional word embedding
Deep LSTM uses 8000 numbers for sentence representation
Revising source sentence
…, C’, B’, A’ → A, B, C,…
Backprop can notice the short-term dependencies first, and slowly extend them to long range dependencies
Ensemble of deep LSTMs
LSTMs trained from Different initialization 48
![Page 49: Deep learning for Natural Language Processing and Machine …nlp.jbnu.ac.kr/nash_HCLT2015_deeplearning_tutorial.pdf · 2015-10-15 · Deep learning for NLP 13 Learning word embedding](https://reader033.vdocuments.net/reader033/viewer/2022053004/5f07e8de7e708231d41f5e1e/html5/thumbnails/49.jpg)
Sequence-to-Sequence Model : End-to-end Translation Experiments
English-to-French translation
No SMT is used for reference model
Input/Output vocabulary size: 160,000/80,000
Training objective: maximizing ∑(𝑆,𝑇) 𝑙𝑜𝑔𝑃(𝑇|𝑆)
Decoding: Beam search
49
![Page 50: Deep learning for Natural Language Processing and Machine …nlp.jbnu.ac.kr/nash_HCLT2015_deeplearning_tutorial.pdf · 2015-10-15 · Deep learning for NLP 13 Learning word embedding](https://reader033.vdocuments.net/reader033/viewer/2022053004/5f07e8de7e708231d41f5e1e/html5/thumbnails/50.jpg)
Sequence-to-Sequence Model : Re-scoring Experiments
Rescoring the 1000-best list of the baseline system
50
![Page 51: Deep learning for Natural Language Processing and Machine …nlp.jbnu.ac.kr/nash_HCLT2015_deeplearning_tutorial.pdf · 2015-10-15 · Deep learning for NLP 13 Learning word embedding](https://reader033.vdocuments.net/reader033/viewer/2022053004/5f07e8de7e708231d41f5e1e/html5/thumbnails/51.jpg)
Sequence-to-Sequence Model : Learned representation
2-dimensional PCA projection of the LSTM hidden states
51
![Page 52: Deep learning for Natural Language Processing and Machine …nlp.jbnu.ac.kr/nash_HCLT2015_deeplearning_tutorial.pdf · 2015-10-15 · Deep learning for NLP 13 Learning word embedding](https://reader033.vdocuments.net/reader033/viewer/2022053004/5f07e8de7e708231d41f5e1e/html5/thumbnails/52.jpg)
Sequence-to-Sequence Model: Learned representation
2-dimensional PCA projection of the LSTM hidden states
52
![Page 53: Deep learning for Natural Language Processing and Machine …nlp.jbnu.ac.kr/nash_HCLT2015_deeplearning_tutorial.pdf · 2015-10-15 · Deep learning for NLP 13 Learning word embedding](https://reader033.vdocuments.net/reader033/viewer/2022053004/5f07e8de7e708231d41f5e1e/html5/thumbnails/53.jpg)
Sequence-to-Sequence Learning: Performance vs Sentence Length
53
![Page 54: Deep learning for Natural Language Processing and Machine …nlp.jbnu.ac.kr/nash_HCLT2015_deeplearning_tutorial.pdf · 2015-10-15 · Deep learning for NLP 13 Learning word embedding](https://reader033.vdocuments.net/reader033/viewer/2022053004/5f07e8de7e708231d41f5e1e/html5/thumbnails/54.jpg)
Sequence-to-Sequence Learning: Performance on rare words
54
![Page 55: Deep learning for Natural Language Processing and Machine …nlp.jbnu.ac.kr/nash_HCLT2015_deeplearning_tutorial.pdf · 2015-10-15 · Deep learning for NLP 13 Learning word embedding](https://reader033.vdocuments.net/reader033/viewer/2022053004/5f07e8de7e708231d41f5e1e/html5/thumbnails/55.jpg)
RNN Search: Jointly Learning to Align and Translate (Bahdanau et al ‘15)
RNN Search: Extended architecture of RNN Encoder-Decoder
Encodes the input sentence into a seq of vectors using a bidirectional RNN
Context vector c
Previously, the last hidden state (Cho et al ‘14)
Now, mixture of hidden states of input sentence at generating each target word
– During decoding, chooses a subset of these vectors adaptively
55
![Page 56: Deep learning for Natural Language Processing and Machine …nlp.jbnu.ac.kr/nash_HCLT2015_deeplearning_tutorial.pdf · 2015-10-15 · Deep learning for NLP 13 Learning word embedding](https://reader033.vdocuments.net/reader033/viewer/2022053004/5f07e8de7e708231d41f5e1e/html5/thumbnails/56.jpg)
RNN Search: Components
𝑝 𝑦𝑖 𝑦1, ⋯ , 𝑦𝑖−1 = 𝑔 𝑦𝑖−1, 𝑠𝑖 , 𝑐𝑖𝑠𝑖 = 𝑓(𝑠𝑖−1, 𝑦𝑖−1, 𝑐𝑖)
An RNN hidden state
𝑐𝑖 = ∑𝑗 𝛼𝑖𝑗ℎ𝑗
𝛼𝑖𝑗 =exp(𝑒𝑖𝑗)
∑𝑘 exp(𝑒𝑖𝑘)
𝑒𝑖𝑗 = 𝑎(𝑠𝑖−1, ℎ𝑗)
Alignment model
56
![Page 57: Deep learning for Natural Language Processing and Machine …nlp.jbnu.ac.kr/nash_HCLT2015_deeplearning_tutorial.pdf · 2015-10-15 · Deep learning for NLP 13 Learning word embedding](https://reader033.vdocuments.net/reader033/viewer/2022053004/5f07e8de7e708231d41f5e1e/html5/thumbnails/57.jpg)
RNN Search: Encoder
Bidirectional RNN
57
![Page 58: Deep learning for Natural Language Processing and Machine …nlp.jbnu.ac.kr/nash_HCLT2015_deeplearning_tutorial.pdf · 2015-10-15 · Deep learning for NLP 13 Learning word embedding](https://reader033.vdocuments.net/reader033/viewer/2022053004/5f07e8de7e708231d41f5e1e/html5/thumbnails/58.jpg)
RNN Search: Decoder
58
compute alignment
compute context
generate new output
compute new decoder state
![Page 59: Deep learning for Natural Language Processing and Machine …nlp.jbnu.ac.kr/nash_HCLT2015_deeplearning_tutorial.pdf · 2015-10-15 · Deep learning for NLP 13 Learning word embedding](https://reader033.vdocuments.net/reader033/viewer/2022053004/5f07e8de7e708231d41f5e1e/html5/thumbnails/59.jpg)
RNN Search: Alignment model
59
𝑒𝑖𝑗 = 𝑣𝑇 tanh 𝑊𝑠𝑖−1 + 𝑉ℎ𝑗
Directly computes a soft alignment
𝛼𝑖𝑗 =exp(𝑒𝑖𝑗)
∑𝑘 exp(𝑒𝑖𝑘)
Expected annotation
Alignment model is jointly trained with all other components
![Page 60: Deep learning for Natural Language Processing and Machine …nlp.jbnu.ac.kr/nash_HCLT2015_deeplearning_tutorial.pdf · 2015-10-15 · Deep learning for NLP 13 Learning word embedding](https://reader033.vdocuments.net/reader033/viewer/2022053004/5f07e8de7e708231d41f5e1e/html5/thumbnails/60.jpg)
RNN Search: Experiment – English-to-French
Model
RNN Search, 1,000 units
Baseline
RNN Encoder-Decoder, 1000 units
Data
English to French translation, 348 million words
30000 words + UNK token for the networks, all words for Moses
Training
Stochastic gradient descent (SGD) with a minibatch of 80
Decoding
Beam search (Sutskever ‘14) 60
![Page 61: Deep learning for Natural Language Processing and Machine …nlp.jbnu.ac.kr/nash_HCLT2015_deeplearning_tutorial.pdf · 2015-10-15 · Deep learning for NLP 13 Learning word embedding](https://reader033.vdocuments.net/reader033/viewer/2022053004/5f07e8de7e708231d41f5e1e/html5/thumbnails/61.jpg)
RNN Search: Experiment – English-to-French
61
![Page 62: Deep learning for Natural Language Processing and Machine …nlp.jbnu.ac.kr/nash_HCLT2015_deeplearning_tutorial.pdf · 2015-10-15 · Deep learning for NLP 13 Learning word embedding](https://reader033.vdocuments.net/reader033/viewer/2022053004/5f07e8de7e708231d41f5e1e/html5/thumbnails/62.jpg)
RNN Search: Large Target Vocabulary
(Jean et al ‘15)Decoding in RNN search
𝑃 𝑦𝑡 𝑦<𝑡, 𝑥 =1
𝑍exp 𝑤𝑡
𝑇𝜙 𝑦𝑡−1, 𝑧𝑡 , 𝑐𝑡 + 𝑏𝑡
Normalization constant
𝑍 = ∑𝑘:𝑦𝑘∈𝑉 exp 𝑤𝑡𝑇𝜙 𝑦𝑡−1, 𝑧𝑡 , 𝑐𝑡 + 𝑏𝑡
– Computationally inefficient
Idea: Use only small V’ of target vocabulary to approximate Z
Based on importance sampling (Bengio and Senecal, ‘08)
62
![Page 63: Deep learning for Natural Language Processing and Machine …nlp.jbnu.ac.kr/nash_HCLT2015_deeplearning_tutorial.pdf · 2015-10-15 · Deep learning for NLP 13 Learning word embedding](https://reader033.vdocuments.net/reader033/viewer/2022053004/5f07e8de7e708231d41f5e1e/html5/thumbnails/63.jpg)
RNN Search: Large Target Vocabulary (Jean et al ‘15)
63
𝛻𝑙𝑜𝑔𝑝 𝑦𝑡 𝑦<𝑡 , 𝑥 = 𝛻ℰ 𝑦𝑡 −
𝑘:𝑦𝑘∈𝑉
𝑃(𝑦𝑘|𝑦<𝑡 , 𝑥)𝛻ℰ 𝑦𝑘
ℰ 𝑦𝑗 = 𝑤𝑗𝑇𝜙 𝑦𝑗−1, 𝑧𝑗 , 𝑐𝑗 + 𝑏𝑗
Energy function
𝐸𝑃 𝛻ℰ 𝑦
Expected gradient of the energyApproximation by importancesampling
𝑘:𝑦𝑘∈𝑉′
𝜔𝑘
∑𝑘′:𝑦𝑘′∈𝑉′𝜔𝑘′𝛻ℰ 𝑦𝑘
𝜔𝑘 = exp{ℰ 𝑦𝑘 − 𝑙𝑜𝑔𝑄(𝑦𝑘)}
Proposal distribution
![Page 64: Deep learning for Natural Language Processing and Machine …nlp.jbnu.ac.kr/nash_HCLT2015_deeplearning_tutorial.pdf · 2015-10-15 · Deep learning for NLP 13 Learning word embedding](https://reader033.vdocuments.net/reader033/viewer/2022053004/5f07e8de7e708231d41f5e1e/html5/thumbnails/64.jpg)
RNN Search LV: Experiments
Setting (Vocabulary size)
RNN Search: 30,000 for En-Fr, 50,000 for En-Ge
RNN Search LV: 500,000 source and target words
64
![Page 65: Deep learning for Natural Language Processing and Machine …nlp.jbnu.ac.kr/nash_HCLT2015_deeplearning_tutorial.pdf · 2015-10-15 · Deep learning for NLP 13 Learning word embedding](https://reader033.vdocuments.net/reader033/viewer/2022053004/5f07e8de7e708231d41f5e1e/html5/thumbnails/65.jpg)
Sequence-to-Sequence Model: Rare Word Models (Luong et al ‘15)
Extend LSTM for NMT (Suskever ’14)
Translation-specific approach
For each OOV word, we make a pointer to its corresponding word in the source sentence
The pointer information is later utilized in a post-processing step
Directly translates OOV using a original dictionary
or with the identify translation, if no translation is found
65
![Page 66: Deep learning for Natural Language Processing and Machine …nlp.jbnu.ac.kr/nash_HCLT2015_deeplearning_tutorial.pdf · 2015-10-15 · Deep learning for NLP 13 Learning word embedding](https://reader033.vdocuments.net/reader033/viewer/2022053004/5f07e8de7e708231d41f5e1e/html5/thumbnails/66.jpg)
Rare Word Models (Luong et al ‘15)
Replacing rare words with UNK
Positional information is stored in target sentence
66
![Page 67: Deep learning for Natural Language Processing and Machine …nlp.jbnu.ac.kr/nash_HCLT2015_deeplearning_tutorial.pdf · 2015-10-15 · Deep learning for NLP 13 Learning word embedding](https://reader033.vdocuments.net/reader033/viewer/2022053004/5f07e8de7e708231d41f5e1e/html5/thumbnails/67.jpg)
Rare Word Models: Experiments
67
![Page 68: Deep learning for Natural Language Processing and Machine …nlp.jbnu.ac.kr/nash_HCLT2015_deeplearning_tutorial.pdf · 2015-10-15 · Deep learning for NLP 13 Learning word embedding](https://reader033.vdocuments.net/reader033/viewer/2022053004/5f07e8de7e708231d41f5e1e/html5/thumbnails/68.jpg)
Grammar as Translation (Vinyals et al ‘15)
Extended Sequence-to-Sequence Model
Two LSTMs + Attention (Bahdanau et al ‘15)
Linearizing parse trees: Depth-first traversal 68
![Page 69: Deep learning for Natural Language Processing and Machine …nlp.jbnu.ac.kr/nash_HCLT2015_deeplearning_tutorial.pdf · 2015-10-15 · Deep learning for NLP 13 Learning word embedding](https://reader033.vdocuments.net/reader033/viewer/2022053004/5f07e8de7e708231d41f5e1e/html5/thumbnails/69.jpg)
Grammar as Translation: Linearizing Parse Trees
69
![Page 70: Deep learning for Natural Language Processing and Machine …nlp.jbnu.ac.kr/nash_HCLT2015_deeplearning_tutorial.pdf · 2015-10-15 · Deep learning for NLP 13 Learning word embedding](https://reader033.vdocuments.net/reader033/viewer/2022053004/5f07e8de7e708231d41f5e1e/html5/thumbnails/70.jpg)
Sequence-to-Sequence Model: Parsing - Experiment Results
70
![Page 71: Deep learning for Natural Language Processing and Machine …nlp.jbnu.ac.kr/nash_HCLT2015_deeplearning_tutorial.pdf · 2015-10-15 · Deep learning for NLP 13 Learning word embedding](https://reader033.vdocuments.net/reader033/viewer/2022053004/5f07e8de7e708231d41f5e1e/html5/thumbnails/71.jpg)
Neural Network Transition-Based Parsing (Weiss et al ‘15)
71
Structured learning!
![Page 72: Deep learning for Natural Language Processing and Machine …nlp.jbnu.ac.kr/nash_HCLT2015_deeplearning_tutorial.pdf · 2015-10-15 · Deep learning for NLP 13 Learning word embedding](https://reader033.vdocuments.net/reader033/viewer/2022053004/5f07e8de7e708231d41f5e1e/html5/thumbnails/72.jpg)
NN Transition-Based Parsing: Experiments
72
![Page 73: Deep learning for Natural Language Processing and Machine …nlp.jbnu.ac.kr/nash_HCLT2015_deeplearning_tutorial.pdf · 2015-10-15 · Deep learning for NLP 13 Learning word embedding](https://reader033.vdocuments.net/reader033/viewer/2022053004/5f07e8de7e708231d41f5e1e/html5/thumbnails/73.jpg)
Neural Network Graph-Based Parsing (Pei et al ‘15)
73
![Page 74: Deep learning for Natural Language Processing and Machine …nlp.jbnu.ac.kr/nash_HCLT2015_deeplearning_tutorial.pdf · 2015-10-15 · Deep learning for NLP 13 Learning word embedding](https://reader033.vdocuments.net/reader033/viewer/2022053004/5f07e8de7e708231d41f5e1e/html5/thumbnails/74.jpg)
Neural Network Graph-Based Parsing (Pei et al ‘15)
74
![Page 75: Deep learning for Natural Language Processing and Machine …nlp.jbnu.ac.kr/nash_HCLT2015_deeplearning_tutorial.pdf · 2015-10-15 · Deep learning for NLP 13 Learning word embedding](https://reader033.vdocuments.net/reader033/viewer/2022053004/5f07e8de7e708231d41f5e1e/html5/thumbnails/75.jpg)
NN Graph-Based Parsing: Experiments
75
![Page 76: Deep learning for Natural Language Processing and Machine …nlp.jbnu.ac.kr/nash_HCLT2015_deeplearning_tutorial.pdf · 2015-10-15 · Deep learning for NLP 13 Learning word embedding](https://reader033.vdocuments.net/reader033/viewer/2022053004/5f07e8de7e708231d41f5e1e/html5/thumbnails/76.jpg)
Discussion: Research Topics
Neural Machine Translation
DNNs represent words, phrases, and sentences in continuous space
How to utilize more syntactic knowledge?
Recursive recurrent NN (Liu et al ‘14)
Deep architecture may approximately represents input structure
Neural Dependency Parsing
Neural Language Model
76
![Page 77: Deep learning for Natural Language Processing and Machine …nlp.jbnu.ac.kr/nash_HCLT2015_deeplearning_tutorial.pdf · 2015-10-15 · Deep learning for NLP 13 Learning word embedding](https://reader033.vdocuments.net/reader033/viewer/2022053004/5f07e8de7e708231d41f5e1e/html5/thumbnails/77.jpg)
Reference Mikolov Tomá¹, Karafiát Martin, Burget Luká¹, Èernocký Jan, Khudanpur Sanjeev: Recurrent neural network based language model,
In: Proceedings of the 11th Annual Conference of the International Speech Communication Association (INTERSPEECH 2010)
Yoshua Bengio, Jerome Louradour, Ronan Collobert and Jason Weston, Curriculum Learning, ICML '09
Yoshua Bengio, Réjean Ducharme and Pascal Vincent, A Neural Probabilistic Language Model, NIPS '01
Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, Yoshua Bengio, Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014
S. Hochreiter, J. Schmidhuer, "Long short-term memory," Neural Computation, vol.9, no.8, pp.1735- 1780, 1997.
F. A. Gers, N. N. Schraudolph, J. Schmidhuer, Learning Precise Timing with LSTM Recurrent Networks, JMLR, 2002
Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, Yoshua Bengio, Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling, 2014
Ilya Sutskever, Oriol Vinyals, Quoc V. Le, Sequence to Sequence Learning with Neural Networks, NIPS '14
N. Kalchbrenner and Phil Blunsom, Recurrent continuous translation models, EMNLP ‘13
Dzmitry Bahdanau, KyungHyun Cho, Yoshua Bengio, Neural Machine Translation by Jointly Learning to Align and Translate, ICLR ‘15
Minh-Thang Luong, Ilya Sutskever, Quoc V. Le, Oriol Vinyals, Wojciech Zaremba, Addressing the Rare Word Problem in Neural Machine Translation, ALC ’15
S´ebastien Jean, Kyunghyun Cho, Roland Memisevic, and Yoshua Bengio. 2015. On using very large target vocabulary for neural machine translation. In ACL.
Oriol Vinyals, Lukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, Geoffrey Hinton, Grammar as a Foreign Language, 15
David Weiss, Chris Alberti, Michael Collins, Slav Petrov, Structured Training for Neural Network Transition-Based Parsing, ACL '15
Wenzhe Pei, Tao Ge, Baobao Chang, An Effective Neural Network Model for Graph-based Dependency Parsing, EMNLP '15
Shujie Liu, Nan Yang, Mu Li, and Ming Zhou, A Recursive Recurrent Neural Network for Statistical Machine Translation, ACL '14
Jiajun Zhang and Chengqing Zong, Deep Neural Networks in Machine Translation: An Overview, IEEE Intelligent Systems ‘15
77