convnets for nlp
TRANSCRIPT
![Page 2: ConvNets for NLP](https://reader030.vdocuments.net/reader030/viewer/2022011903/61d6aec4a28af8738a247f62/html5/thumbnails/2.jpg)
Theory
![Page 3: ConvNets for NLP](https://reader030.vdocuments.net/reader030/viewer/2022011903/61d6aec4a28af8738a247f62/html5/thumbnails/3.jpg)
Discrete Convolutions. Intuition
• How it works:
1. Slide a small filter matrix (kernel) over the input matrix.
2. At each position, compute the product of kernel and input values, and add them together.
3. The output matrix is the concatenation of the application of the filter over the input matrix.
• The “trainable” weights are the filters (kernels).
![Page 4: ConvNets for NLP](https://reader030.vdocuments.net/reader030/viewer/2022011903/61d6aec4a28af8738a247f62/html5/thumbnails/4.jpg)
Discrete Convolutions. Purpose• Used to detect local features.
• Invariant to feature position.
• Stacked convolutional layers detect hierarchical features.
![Page 5: ConvNets for NLP](https://reader030.vdocuments.net/reader030/viewer/2022011903/61d6aec4a28af8738a247f62/html5/thumbnails/5.jpg)
Discrete Convolutions. Domains• Images (a matrix of pixels - 2D convolution)
• Text (a sequence of tokens - 1D convolution)
• Video (a sequence of images - 3D convolution)
![Page 6: ConvNets for NLP](https://reader030.vdocuments.net/reader030/viewer/2022011903/61d6aec4a28af8738a247f62/html5/thumbnails/6.jpg)
(Notation in images)
https://github.com/vdumoulin/conv_arithmetic
input
kernel
output
![Page 7: ConvNets for NLP](https://reader030.vdocuments.net/reader030/viewer/2022011903/61d6aec4a28af8738a247f62/html5/thumbnails/7.jpg)
• Input size: 4 x 4
• Kernel size: 3
• Output size: 2 x 2
Discrete 2D Convolutions. Hyperparams (1/4)
https://github.com/vdumoulin/conv_arithmetic
![Page 8: ConvNets for NLP](https://reader030.vdocuments.net/reader030/viewer/2022011903/61d6aec4a28af8738a247f62/html5/thumbnails/8.jpg)
• Input size: 5 x 5• Kernel size: 3• Padding: 1• Output size: 5 x 5
Discrete 2D Convolutions. Hyperparams (2/4)
https://github.com/vdumoulin/conv_arithmetic
• Input size: 5 x 5• Kernel size: 3• Padding: 2• Output size: 7 x 7
![Page 9: ConvNets for NLP](https://reader030.vdocuments.net/reader030/viewer/2022011903/61d6aec4a28af8738a247f62/html5/thumbnails/9.jpg)
Discrete 2D Convolutions. Hyperparams (3/4)
https://github.com/vdumoulin/conv_arithmetic
• Input size: 5 x 5• Kernel size: 3• Padding: 1• Stride: 2• Output size: 3 x 3
• Input size: 7 x 7• Kernel size: 3• Padding: 0• Stride: 1• Dilation: 2• Output size: 3 x 3
![Page 10: ConvNets for NLP](https://reader030.vdocuments.net/reader030/viewer/2022011903/61d6aec4a28af8738a247f62/html5/thumbnails/10.jpg)
( Multichannel inputs )
• Input size: 5 x 10 x 3• Kernel size: 3• Padding: 1• Output size: 5 x 10 x 1
https://www.engineersgarage.com/contributions/content-based-image-retrieval-cbir-using-matlab/
![Page 11: ConvNets for NLP](https://reader030.vdocuments.net/reader030/viewer/2022011903/61d6aec4a28af8738a247f62/html5/thumbnails/11.jpg)
Discrete 2D Convolutions. Hyperparams (4/4)
https://www.jeremyjordan.me/convnet-architectures/
• Number of filters (= number of output channels)
![Page 12: ConvNets for NLP](https://reader030.vdocuments.net/reader030/viewer/2022011903/61d6aec4a28af8738a247f62/html5/thumbnails/12.jpg)
Discrete 1D ConvolutionsSame as in 2D, but with 1D input and kernel:
Applied over a continuous representation, e.g.:
(I * k)[n] =M
∑−M
I[n − m]k[m]
0.1-1.2
0.7…
-0.70.5
0.2…
1.11.7
-0.8…
0.20.1
-0.6…
the dog sat on the mat
1.0-0.3
-0.9…
0.70.2
0.2…
0.10.2
-0.1…
0.8-1.2
0.5…filter
(size=3)
word embedding
result 0.7
feature detection at 3-gram level
* * *
+ +
0.1-1.2
0.7…
![Page 13: ConvNets for NLP](https://reader030.vdocuments.net/reader030/viewer/2022011903/61d6aec4a28af8738a247f62/html5/thumbnails/13.jpg)
Example• Sequence length: 6• Embedding dimensionality: 3• Kernel size: 3• Padding: 1• Output length: 6
https://cezannec.github.io/CNN_Text_Classification/
![Page 14: ConvNets for NLP](https://reader030.vdocuments.net/reader030/viewer/2022011903/61d6aec4a28af8738a247f62/html5/thumbnails/14.jpg)
Sliding Window operations: Max pooling over time
• Sliding window over input along time dimension.
• Output is maximum value in window.
• Usually has stride of the same size of the window size.
• Used after convolution.
• Purpose: reduce complexity while capturing most important activation from previous layer.
https://www.wandb.com/classes/intro/class-8-notes
![Page 15: ConvNets for NLP](https://reader030.vdocuments.net/reader030/viewer/2022011903/61d6aec4a28af8738a247f62/html5/thumbnails/15.jpg)
Sliding Window operations: Average pooling over time
• Sliding window over input along time dimension.
• Output is average value in window.
• Usually has stride of the same size of the window size.
• Used after convolution.
• Purpose: reduce complexity while retaining info.
![Page 16: ConvNets for NLP](https://reader030.vdocuments.net/reader030/viewer/2022011903/61d6aec4a28af8738a247f62/html5/thumbnails/16.jpg)
1D Convnets Architectural Fit• 2D convnets are normally used as a group with pooling and
ReLU, over fixed-sized inputs:
• In NLP, convnet inputs are variable-length sequences
• Depending on the task, a “collapsing” operation (e.g. max) may be needed to obtain a fixed-size representation.
https://www.ayasdi.com/awards/artificial-intelligence-100/
![Page 17: ConvNets for NLP](https://reader030.vdocuments.net/reader030/viewer/2022011903/61d6aec4a28af8738a247f62/html5/thumbnails/17.jpg)
Collapsing pooling approaches
• Max pooling: computes the maximum values per channel in the input sequence.
• Average pooling: computes the average values per channel in the input sequence.
• k-max pooling: computes the subsequence of k maximum values in the input sequence. Keeps order of appearance.
![Page 18: ConvNets for NLP](https://reader030.vdocuments.net/reader030/viewer/2022011903/61d6aec4a28af8738a247f62/html5/thumbnails/18.jpg)
Conv. with Kernel size = 1• Position-wise linear transformation.
• Increases or decreases channel dimensionality (depending on number of filters / num. output channels).
• In 2D convolutions they are known as 1x1 convolutions or “network-in-network”.
![Page 19: ConvNets for NLP](https://reader030.vdocuments.net/reader030/viewer/2022011903/61d6aec4a28af8738a247f62/html5/thumbnails/19.jpg)
Depthwise Separable Conv.• Operation divided in two steps:
1. Per-channel normal convolution → output has same number of channels as input
2. Position-wise convolution (kernel width=1)
• Less parameters and less computational cost.
![Page 20: ConvNets for NLP](https://reader030.vdocuments.net/reader030/viewer/2022011903/61d6aec4a28af8738a247f62/html5/thumbnails/20.jpg)
Dynamic convolutions• Normal convolutions have fixed (trainable) kernels.
• Idea: compute kernels dynamically in the neural network.
![Page 21: ConvNets for NLP](https://reader030.vdocuments.net/reader030/viewer/2022011903/61d6aec4a28af8738a247f62/html5/thumbnails/21.jpg)
Batch processing• Most neural networks are fed mini-batches of data.
• In-batch padding is needed (apart from the CNN padding).
![Page 22: ConvNets for NLP](https://reader030.vdocuments.net/reader030/viewer/2022011903/61d6aec4a28af8738a247f62/html5/thumbnails/22.jpg)
Comparison with RNNs• Speed:
‣ Convnets are computed in parallel.
‣ RNNs must be computed sequentially.
• Dependency range / receptive field:
‣ RNNs tend to capture too much of previous vector (bad if output is taken from the last position). Attention mitigates this.
‣ Convnets only handle dependencies within the filter size. Dilation and stacked convolutions mitigate this.
![Page 23: ConvNets for NLP](https://reader030.vdocuments.net/reader030/viewer/2022011903/61d6aec4a28af8738a247f62/html5/thumbnails/23.jpg)
Exercises
![Page 24: ConvNets for NLP](https://reader030.vdocuments.net/reader030/viewer/2022011903/61d6aec4a28af8738a247f62/html5/thumbnails/24.jpg)
E1: vanilla convolution, no padding
http://web.stanford.edu/class/cs224n/slides/cs224n-2019-lecture11-convnets.pdf
![Page 25: ConvNets for NLP](https://reader030.vdocuments.net/reader030/viewer/2022011903/61d6aec4a28af8738a247f62/html5/thumbnails/25.jpg)
E2: padding = 1
http://web.stanford.edu/class/cs224n/slides/cs224n-2019-lecture11-convnets.pdf
![Page 26: ConvNets for NLP](https://reader030.vdocuments.net/reader030/viewer/2022011903/61d6aec4a28af8738a247f62/html5/thumbnails/26.jpg)
E3: channels=3, padding = 1
http://web.stanford.edu/class/cs224n/slides/cs224n-2019-lecture11-convnets.pdf
![Page 27: ConvNets for NLP](https://reader030.vdocuments.net/reader030/viewer/2022011903/61d6aec4a28af8738a247f62/html5/thumbnails/27.jpg)
E4: padding, max pooling
http://web.stanford.edu/class/cs224n/slides/cs224n-2019-lecture11-convnets.pdf
![Page 28: ConvNets for NLP](https://reader030.vdocuments.net/reader030/viewer/2022011903/61d6aec4a28af8738a247f62/html5/thumbnails/28.jpg)
E5: padding, avg. pooling
http://web.stanford.edu/class/cs224n/slides/cs224n-2019-lecture11-convnets.pdf
![Page 29: ConvNets for NLP](https://reader030.vdocuments.net/reader030/viewer/2022011903/61d6aec4a28af8738a247f62/html5/thumbnails/29.jpg)
E6: stride=2
http://web.stanford.edu/class/cs224n/slides/cs224n-2019-lecture11-convnets.pdf
![Page 30: ConvNets for NLP](https://reader030.vdocuments.net/reader030/viewer/2022011903/61d6aec4a28af8738a247f62/html5/thumbnails/30.jpg)
E7: max.pooling over time, stride=2
http://web.stanford.edu/class/cs224n/slides/cs224n-2019-lecture11-convnets.pdf
![Page 31: ConvNets for NLP](https://reader030.vdocuments.net/reader030/viewer/2022011903/61d6aec4a28af8738a247f62/html5/thumbnails/31.jpg)
E8: k-max.pooling, k=2
http://web.stanford.edu/class/cs224n/slides/cs224n-2019-lecture11-convnets.pdf
![Page 32: ConvNets for NLP](https://reader030.vdocuments.net/reader030/viewer/2022011903/61d6aec4a28af8738a247f62/html5/thumbnails/32.jpg)
E9: dilation=2
http://web.stanford.edu/class/cs224n/slides/cs224n-2019-lecture11-convnets.pdf
![Page 33: ConvNets for NLP](https://reader030.vdocuments.net/reader030/viewer/2022011903/61d6aec4a28af8738a247f62/html5/thumbnails/33.jpg)
Application Examples
![Page 34: ConvNets for NLP](https://reader030.vdocuments.net/reader030/viewer/2022011903/61d6aec4a28af8738a247f62/html5/thumbnails/34.jpg)
Word Representation Learning (1/4)
Learning Character-level Representations for Part-of-Speech Tagging
Dos Santos and Zadrozny, 2015
![Page 35: ConvNets for NLP](https://reader030.vdocuments.net/reader030/viewer/2022011903/61d6aec4a28af8738a247f62/html5/thumbnails/35.jpg)
Word Representation Learning (2/4)
Character-Aware Neural Language Models
Kim et al., 2015
![Page 36: ConvNets for NLP](https://reader030.vdocuments.net/reader030/viewer/2022011903/61d6aec4a28af8738a247f62/html5/thumbnails/36.jpg)
Word Representation Learning (3/4)
Character-based Neural Machine Translation
Costa-jussà and Fonollosa, 2016
! " # $ % & ' ( )
*)+,)$-)./0.-1&"&-2)".
)3')!!#$45
6,(2#7().-/$8/(,2#/$.
0#(2)"5./0.!#00)")$2.()$4125
6&9./,27,2./0.)&-1.0#(2)"
:#41;&<.=&<)"
:#41;&<.=&<)":#41;&<.=&<)"5
![Page 37: ConvNets for NLP](https://reader030.vdocuments.net/reader030/viewer/2022011903/61d6aec4a28af8738a247f62/html5/thumbnails/37.jpg)
Word Representation Learning (4/4)
Deep contextualized word representations
Peters et al., 2018
LSTM
LSTM LSTM LSTM
LSTMLSTM
CNC1
T2T1 TN
…
…
softmax
project.
softmax
project.
softmax
project.
…
…
LSTM
LSTM LSTM LSTM
LSTMLSTM
…
…
charCNN charCNN
<s> </s>
charCNN charCNN
…
…
![Page 38: ConvNets for NLP](https://reader030.vdocuments.net/reader030/viewer/2022011903/61d6aec4a28af8738a247f62/html5/thumbnails/38.jpg)
Classification (1/2)
Convolutional Neural Networks for Sentence Classification
Kim, 2014
wait for the
video and do n't
rent it
n x k representation of sentence with static and
non-static channels
Convolutional layer with multiple filter widths and
feature maps
Max-over-time pooling
Fully connected layer with dropout and softmax output
![Page 39: ConvNets for NLP](https://reader030.vdocuments.net/reader030/viewer/2022011903/61d6aec4a28af8738a247f62/html5/thumbnails/39.jpg)
I like this
movievery
much!
2 feature maps for
each region size
6 univariate vectors
concatenated together to form a
single feature vector
Sentence matrix 7 × 5
3 region sizes: (2,3,4)2 filters for each region
sizetotally 6 filters
convolution
activation function
1-maxpooling
2 classes
softmax functionregularization in this layer
d=5
Classification (2/2)
A Sensitivity Analysis of (and Practitioners' Guide to) Convolutional Neural Networks
for Sentence Classification Zhang and Wallace, 2015
![Page 40: ConvNets for NLP](https://reader030.vdocuments.net/reader030/viewer/2022011903/61d6aec4a28af8738a247f62/html5/thumbnails/40.jpg)
Natural Language Inference
Convolutional Interaction Network for Natural Language Inference
Gong et al., 2018
![Page 41: ConvNets for NLP](https://reader030.vdocuments.net/reader030/viewer/2022011903/61d6aec4a28af8738a247f62/html5/thumbnails/41.jpg)
Language modeling (1/2)
Convolutional Neural Network Language Models
Pham et al., 2016
![Page 42: ConvNets for NLP](https://reader030.vdocuments.net/reader030/viewer/2022011903/61d6aec4a28af8738a247f62/html5/thumbnails/42.jpg)
Language modeling (2/2)
Language modeling with gated convolutional networks
Dauphin et al., 2016
Input sentence
Text The cat sat on the mat .
w0 w1 w2 w3 w4 w5 w6
Lookup Table
E = Dwi
Convolution
A = E∗W + b
Gating
H0 = A⊗σ(B)
σ
Softmax
Y = softmax(WHL)
B = E∗V + c
Stack L - 1 Convolution+Gating Blocks
Gated Linear Unit (GLU)
Use of padding to ensure causal dependencies
![Page 43: ConvNets for NLP](https://reader030.vdocuments.net/reader030/viewer/2022011903/61d6aec4a28af8738a247f62/html5/thumbnails/43.jpg)
Machine Translation (1/4)
Recurrent Continuous Translation Models
Kalchbrenner and Blunsomet, 2013
![Page 44: ConvNets for NLP](https://reader030.vdocuments.net/reader030/viewer/2022011903/61d6aec4a28af8738a247f62/html5/thumbnails/44.jpg)
Machine Translation (3/4)
Convolutional Sequence to Sequence Learning
Gehring et al., 2017
Gated Linear Unit (GLU)
![Page 45: ConvNets for NLP](https://reader030.vdocuments.net/reader030/viewer/2022011903/61d6aec4a28af8738a247f62/html5/thumbnails/45.jpg)
input
LConv
Linear
Linear
Linear
dynamic weights
GLU
Machine Translation (3/4)
Pay Less Attention with Lightweight and Dynamic Convolutions
Wu et al., 2019
LConv
Linear
Linear
input
GLU
Lightweight conv. Dynamic convolution
depthwise-separable convolution
![Page 46: ConvNets for NLP](https://reader030.vdocuments.net/reader030/viewer/2022011903/61d6aec4a28af8738a247f62/html5/thumbnails/46.jpg)
Machine Translation (4/4)
https://elbayadm.github.io/assets/talks/attn2d/nve_seminar_attn2d.pdf
Pervasive Attention - 2D Convolutional Neural Networks for Sequence-to-Sequence Prediction
Elbayad et al., 2018
• 2D convolutions: source x target
• Causality: with masked filters in the target direction.
• Context: grown with stacked convolutions.
• Padding: throughout the network to maintain source/target resolution.
• Each layer grows its input channels by g.
![Page 47: ConvNets for NLP](https://reader030.vdocuments.net/reader030/viewer/2022011903/61d6aec4a28af8738a247f62/html5/thumbnails/47.jpg)
Summary• Convolutions are local feature detectors.
• Convolutions are invariant to position.
• Stacked convolutions detect hierarchical features.
• Convolutions are fast.
• Convolutions have been applied to multiple NLP tasks.
![Page 48: ConvNets for NLP](https://reader030.vdocuments.net/reader030/viewer/2022011903/61d6aec4a28af8738a247f62/html5/thumbnails/48.jpg)
References• (book) Neural Network Methods for Natural Language
Processing Synthesis Lectures on Human Language Technologies. Yoav Goldberg.
• (online course) Stanford’s CS224n: Natural Language Processing with Deep Learning http://web.stanford.edu/class/cs224n/slides/cs224n-2019-lecture11-convnets.pdf