lecture 24: attention - github pages
TRANSCRIPT
![Page 1: Lecture 24: Attention - GitHub Pages](https://reader033.vdocuments.net/reader033/viewer/2022052722/628ef15b37fc18273d1b15ae/html5/thumbnails/1.jpg)
Harvard IACSCS109BPavlos Protopapas, Mark Glickman, and Chris Tanner
NLP Lectures: Part 3 of 4
Lecture 24: Attention
![Page 2: Lecture 24: Attention - GitHub Pages](https://reader033.vdocuments.net/reader033/viewer/2022052722/628ef15b37fc18273d1b15ae/html5/thumbnails/2.jpg)
2
Outline
How to use embeddings
seq2seq
seq2seq + Attention
Transformers (preview)
![Page 3: Lecture 24: Attention - GitHub Pages](https://reader033.vdocuments.net/reader033/viewer/2022052722/628ef15b37fc18273d1b15ae/html5/thumbnails/3.jpg)
3
Outline
How to use embeddings
seq2seq
seq2seq + Attention
Transformers (preview)
![Page 4: Lecture 24: Attention - GitHub Pages](https://reader033.vdocuments.net/reader033/viewer/2022052722/628ef15b37fc18273d1b15ae/html5/thumbnails/4.jpg)
Previously, we learned about word embeddings
4
word embeddings (type-based)approaches:• count-based/DSMs (e.g., SVD, LSA)• Predictive models (e.g., word2vec, GloVe)
millions of books word2vec
![Page 5: Lecture 24: Attention - GitHub Pages](https://reader033.vdocuments.net/reader033/viewer/2022052722/628ef15b37fc18273d1b15ae/html5/thumbnails/5.jpg)
Previously, we learned about word embeddings
5
word embeddings (type-based)approaches:• count-based/DSMs (e.g., SVD, LSA)• Predictive models (e.g., word2vec, GloVe)
millions of books word2vec word embeddings
aardvark
apple
before
zoo
![Page 6: Lecture 24: Attention - GitHub Pages](https://reader033.vdocuments.net/reader033/viewer/2022052722/628ef15b37fc18273d1b15ae/html5/thumbnails/6.jpg)
6
word embeddings (type-based)approaches:• count-based/DSMs (e.g., SVD, LSA)• Predictive models (e.g., word2vec, GloVe)
“The food was delicious. Amazing!” 4.8/5
![Page 7: Lecture 24: Attention - GitHub Pages](https://reader033.vdocuments.net/reader033/viewer/2022052722/628ef15b37fc18273d1b15ae/html5/thumbnails/7.jpg)
7
word embeddings (type-based)approaches:• count-based/DSMs (e.g., SVD, LSA)• Predictive models (e.g., word2vec, GloVe)
“The food was delicious. Amazing!” 4.8/5
the
food
was
delicious
amazing
+
+
+
+
=
average embedding
Feed-forwardNeural Net
average embedding
4.8/5
![Page 8: Lecture 24: Attention - GitHub Pages](https://reader033.vdocuments.net/reader033/viewer/2022052722/628ef15b37fc18273d1b15ae/html5/thumbnails/8.jpg)
8
word embeddings (type-based)approaches:• count-based/DSMs (e.g., SVD, LSA)• Predictive models (e.g., word2vec, GloVe)
“Waste of money. Tasteless!” 2.4/5
waste
of
money
tasteless
+
+
+
=
average embedding
Feed-forwardNeural Net
average embedding
2.4/5
![Page 9: Lecture 24: Attention - GitHub Pages](https://reader033.vdocuments.net/reader033/viewer/2022052722/628ef15b37fc18273d1b15ae/html5/thumbnails/9.jpg)
9
word embeddings (type-based)approaches:• count-based/DSMs (e.g., SVD, LSA)• Predictive models (e.g., word2vec, GloVe)
“Daaang. What?! Supa Lit” 4.9/5
daaang
what
supa
lit
+
+
+
=
average embedding
Strengths and weaknesses of word embeddings (type-based)?
![Page 10: Lecture 24: Attention - GitHub Pages](https://reader033.vdocuments.net/reader033/viewer/2022052722/628ef15b37fc18273d1b15ae/html5/thumbnails/10.jpg)
10
word embeddings (type-based)approaches:• count-based/DSMs (e.g., SVD, LSA)• Predictive models (e.g., word2vec, GloVe)
daaang
what
supa
lit
+
+
+
=
average embedding
Strengths:
• Leverages tons of existing data
• Don’t need to depend on our data to create embeddings
![Page 11: Lecture 24: Attention - GitHub Pages](https://reader033.vdocuments.net/reader033/viewer/2022052722/628ef15b37fc18273d1b15ae/html5/thumbnails/11.jpg)
11
word embeddings (type-based)approaches:• count-based/DSMs (e.g., SVD, LSA)• Predictive models (e.g., word2vec, GloVe)
daaang
what
supa
lit
+
+
+
=
average embedding
Issues:
• Out-of-vocabulary (OOV) words
• Not tailored to this dataset
![Page 12: Lecture 24: Attention - GitHub Pages](https://reader033.vdocuments.net/reader033/viewer/2022052722/628ef15b37fc18273d1b15ae/html5/thumbnails/12.jpg)
Previously, we learned about word embeddings
12
contextualized embeddings (token-based)approaches:• Predictive models (e.g., BiLSTMs, GPT-2, BERT)
LSTM
Input Layer
Hidden layer
𝑊
𝑈
#𝑦!
𝑊
𝑈
𝑊
𝑈
#𝑦" #𝑦#
𝑥! 𝑥" 𝑥#
𝑉 𝑉
Output layer
![Page 13: Lecture 24: Attention - GitHub Pages](https://reader033.vdocuments.net/reader033/viewer/2022052722/628ef15b37fc18273d1b15ae/html5/thumbnails/13.jpg)
13
contextualized embeddings (token-based)approaches:• Predictive models (e.g., BiLSTMs, GPT-2, BERT)
Review #1
𝑊
𝑈
#𝑦!
𝑊
𝑈
𝑊
𝑈
#𝑦" #𝑦#
𝑉 𝑉
the food was
𝑊
𝑈
#𝑦#
𝑉
delicious
𝑊
𝑈
#𝑦#
𝑉
amazing
4.8
7192.
![Page 14: Lecture 24: Attention - GitHub Pages](https://reader033.vdocuments.net/reader033/viewer/2022052722/628ef15b37fc18273d1b15ae/html5/thumbnails/14.jpg)
14
contextualized embeddings (token-based)approaches:• Predictive models (e.g., BiLSTMs, GPT-2, BERT)
𝑊
𝑈
#𝑦!
𝑊
𝑈
𝑊
𝑈
#𝑦" #𝑦#
𝑉 𝑉
it was cold
𝑊
𝑈
#𝑦#
𝑉
and
𝑊
𝑈
#𝑦#
𝑉
tasteless
2.5
7192.
Review #2
![Page 15: Lecture 24: Attention - GitHub Pages](https://reader033.vdocuments.net/reader033/viewer/2022052722/628ef15b37fc18273d1b15ae/html5/thumbnails/15.jpg)
15
contextualized embeddings (token-based)approaches:• Predictive models (e.g., BiLSTMs, GPT-2, BERT)
𝑊
𝑈
#𝑦!
𝑊
𝑈
𝑊
𝑈
#𝑦" #𝑦#
𝑉 𝑉
found a hair
𝑊
𝑈
#𝑦#
𝑉
in
𝑊
𝑈
#𝑦#
𝑉
the
1.0
7192.
Review #53,781
![Page 16: Lecture 24: Attention - GitHub Pages](https://reader033.vdocuments.net/reader033/viewer/2022052722/628ef15b37fc18273d1b15ae/html5/thumbnails/16.jpg)
16
contextualized embeddings (token-based)approaches:• Predictive models (e.g., BiLSTMs, GPT-2, BERT)
𝑊
𝑈
#𝑦!
𝑊
𝑈
𝑊
𝑈
#𝑦" #𝑦#
𝑉 𝑉
found a hair
𝑊
𝑈
#𝑦#
𝑉
in
𝑊
𝑈
#𝑦#
𝑉
the
1.0
7192.
Review #53,781
Every token in the corpus has a contextualized embedding
![Page 17: Lecture 24: Attention - GitHub Pages](https://reader033.vdocuments.net/reader033/viewer/2022052722/628ef15b37fc18273d1b15ae/html5/thumbnails/17.jpg)
17
contextualized embeddings (token-based)approaches:• Predictive models (e.g., BiLSTMs, GPT-2, BERT)
𝑊
𝑈
#𝑦!
𝑊
𝑈
𝑊
𝑈
#𝑦" #𝑦#
𝑉 𝑉
found a hair
𝑊
𝑈
#𝑦#
𝑉
in
𝑊
𝑈
#𝑦#
𝑉
the
1.0
7192.
Review #53,781
This is where the “meaning” is captured
![Page 18: Lecture 24: Attention - GitHub Pages](https://reader033.vdocuments.net/reader033/viewer/2022052722/628ef15b37fc18273d1b15ae/html5/thumbnails/18.jpg)
18
contextualized embeddings (token-based)approaches:• Predictive models (e.g., BiLSTMs, GPT-2, BERT)
𝑊
𝑈
#𝑦!
𝑊
𝑈
𝑊
𝑈
#𝑦" #𝑦#
𝑉 𝑉
found a hair
𝑊
𝑈
#𝑦#
𝑉
in
𝑊
𝑈
#𝑦#
𝑉
the
1.0
Review #53,781
Strengths and weaknesses of contextualized embeddings
(aka token-based)?
![Page 19: Lecture 24: Attention - GitHub Pages](https://reader033.vdocuments.net/reader033/viewer/2022052722/628ef15b37fc18273d1b15ae/html5/thumbnails/19.jpg)
19
contextualized embeddings (token-based)approaches:• Predictive models (e.g., BiLSTMs, GPT-2, BERT)
𝑊
𝑈
#𝑦!
𝑊
𝑈
𝑊
𝑈
#𝑦" #𝑦#
𝑉 𝑉
found a hair
𝑊
𝑈
#𝑦#
𝑉
in
𝑊
𝑈
#𝑦#
𝑉
the
1.0
Review #53,781
Strengths:
• Tailored to your particular corpus
• No out-of-vocabulary (OOV) words
![Page 20: Lecture 24: Attention - GitHub Pages](https://reader033.vdocuments.net/reader033/viewer/2022052722/628ef15b37fc18273d1b15ae/html5/thumbnails/20.jpg)
20
contextualized embeddings (token-based)approaches:• Predictive models (e.g., BiLSTMs, GPT-2, BERT)
𝑊
𝑈
#𝑦!
𝑊
𝑈
𝑊
𝑈
#𝑦" #𝑦#
𝑉 𝑉
found a hair
𝑊
𝑈
#𝑦#
𝑉
in
𝑊
𝑈
#𝑦#
𝑉
the
1.0
Review #53,781
Weaknesses:
• May not have enough data to produce good results
• Have to train new model for each use case
• Can’t leverage a wealth of existed text data (millions of books)???
![Page 21: Lecture 24: Attention - GitHub Pages](https://reader033.vdocuments.net/reader033/viewer/2022052722/628ef15b37fc18273d1b15ae/html5/thumbnails/21.jpg)
21
contextualized embeddings (token-based)approaches:• Predictive models (e.g., BiLSTMs, GPT-2, BERT)
𝑊
𝑈
#𝑦!
𝑊
𝑈
𝑊
𝑈
#𝑦" #𝑦#
𝑉 𝑉
found a hair
𝑊
𝑈
#𝑦#
𝑉
in
𝑊
𝑈
#𝑦#
𝑉
the
1.0
Review #53,781
Weaknesses:
• May not have enough data to produce good results
• Have to train new model for each use case
• Can’t leverage a wealth of existed text data (millions of books)???
WRONG! We can leverage millions of books!
![Page 22: Lecture 24: Attention - GitHub Pages](https://reader033.vdocuments.net/reader033/viewer/2022052722/628ef15b37fc18273d1b15ae/html5/thumbnails/22.jpg)
22
𝑊
𝑈
𝑊
𝑈
𝑊
𝑈
𝑊
𝑈𝑉 𝑉 𝑉
Language Modelling
Call me Ishmael. It
wasme Ishmael. It
(let’s input 1 million documents)
![Page 23: Lecture 24: Attention - GitHub Pages](https://reader033.vdocuments.net/reader033/viewer/2022052722/628ef15b37fc18273d1b15ae/html5/thumbnails/23.jpg)
23
𝑊
𝑈
𝑊
𝑈
𝑊
𝑈
𝑊
𝑈𝑉 𝑉 𝑉
Language Modelling
An oak tree belongs
tooak tree. belongs
(let’s input 1 million documents)
![Page 24: Lecture 24: Attention - GitHub Pages](https://reader033.vdocuments.net/reader033/viewer/2022052722/628ef15b37fc18273d1b15ae/html5/thumbnails/24.jpg)
24
𝑊
𝑈
𝑊
𝑈
𝑊
𝑈
𝑊
𝑈𝑉 𝑉 𝑉
An oak tree belongs
tooak tree. belongs
The contextualized embeddings for 1 million docs aren’t useful to us for a new task (e.g., predicting Yelp reviews),but the learned weights could be!
Learn a rich, robust 𝑊 𝑉and
![Page 25: Lecture 24: Attention - GitHub Pages](https://reader033.vdocuments.net/reader033/viewer/2022052722/628ef15b37fc18273d1b15ae/html5/thumbnails/25.jpg)
25
𝑊
𝑈
𝑊
𝑈
𝑊
𝑈
𝑊
𝑈𝑉 𝑉 𝑉
An oak tree belongs
tooak tree. belongs
, we can possibly𝑊 𝑉andUsing these “pre-trained”
increase our performance on other tasks (e.g., Yelp reviews), since they’re very experienced with producing/capturing “meaning”
![Page 26: Lecture 24: Attention - GitHub Pages](https://reader033.vdocuments.net/reader033/viewer/2022052722/628ef15b37fc18273d1b15ae/html5/thumbnails/26.jpg)
26
• Language Modelling may help us for other tasks
• LSTMs do a great job of capturing “meaning”, which can be used for almost every task
• Given a sequence of N words, we can produce 1 output
• Given a sequence of N words, we can produce N outputs
RECAP
![Page 27: Lecture 24: Attention - GitHub Pages](https://reader033.vdocuments.net/reader033/viewer/2022052722/628ef15b37fc18273d1b15ae/html5/thumbnails/27.jpg)
27
• Language Modelling may help us for other tasks
• LSTMs do a great job of capturing “meaning”, which can be used for almost every task
• Given a sequence of N words, we can produce 1 output
• Given a sequence of N words, we can produce N outputs
• What if we wish to have M outputs?
RECAP
![Page 28: Lecture 24: Attention - GitHub Pages](https://reader033.vdocuments.net/reader033/viewer/2022052722/628ef15b37fc18273d1b15ae/html5/thumbnails/28.jpg)
28
We want to produce a variable-length output(e.g., n à m predictions)
Thank you for visiting! Děkujeme za návštěvu!
![Page 29: Lecture 24: Attention - GitHub Pages](https://reader033.vdocuments.net/reader033/viewer/2022052722/628ef15b37fc18273d1b15ae/html5/thumbnails/29.jpg)
29
Outline
How to use embeddings
seq2seq
seq2seq + Attention
Transformers (preview)
![Page 30: Lecture 24: Attention - GitHub Pages](https://reader033.vdocuments.net/reader033/viewer/2022052722/628ef15b37fc18273d1b15ae/html5/thumbnails/30.jpg)
30
Outline
How to use embeddings
seq2seq
seq2seq + Attention
Transformers (preview)
![Page 31: Lecture 24: Attention - GitHub Pages](https://reader033.vdocuments.net/reader033/viewer/2022052722/628ef15b37fc18273d1b15ae/html5/thumbnails/31.jpg)
Sequence-to-Sequence (seq2seq)
• If our input is a sentence in Language A, and we wish to translate it to
Language B, it is clearly sub-optimal to translate word by word (like our
current models are suited to do).
• Instead, let a sequence of tokens be the unit that we ultimately wish to
work with (a sequence of length N may emit a sequences of length M)
• Seq2seq models are comprised of 2 RNNs: 1 encoder, 1 decoder
![Page 32: Lecture 24: Attention - GitHub Pages](https://reader033.vdocuments.net/reader033/viewer/2022052722/628ef15b37fc18273d1b15ae/html5/thumbnails/32.jpg)
Sequence-to-Sequence (seq2seq)
Input layer
Hidden layer
ℎ!$ ℎ"$ ℎ#$ ℎ%$
The brown dog ran
ENCODER RNN
![Page 33: Lecture 24: Attention - GitHub Pages](https://reader033.vdocuments.net/reader033/viewer/2022052722/628ef15b37fc18273d1b15ae/html5/thumbnails/33.jpg)
Sequence-to-Sequence (seq2seq)
Input layer
Hidden layer
ℎ!$ ℎ"$ ℎ#$ ℎ%$
The brown dog ran
The final hidden state of the encoder RNN is the initial state of the decoder RNN
ENCODER RNN
![Page 34: Lecture 24: Attention - GitHub Pages](https://reader033.vdocuments.net/reader033/viewer/2022052722/628ef15b37fc18273d1b15ae/html5/thumbnails/34.jpg)
Sequence-to-Sequence (seq2seq)
Input layer
Hidden layer
ℎ!$ ℎ"$ ℎ#$ ℎ%$
The brown dog ran
ENCODER RNN
ℎ!&
<s>
DECODER RNN
The final hidden state of the encoder RNN is the initial state of the decoder RNN
![Page 35: Lecture 24: Attention - GitHub Pages](https://reader033.vdocuments.net/reader033/viewer/2022052722/628ef15b37fc18273d1b15ae/html5/thumbnails/35.jpg)
Sequence-to-Sequence (seq2seq)
Input layer
Hidden layer
ℎ!$ ℎ"$ ℎ#$ ℎ%$
The brown dog ran
ENCODER RNN
ℎ!&
<s>
DECODER RNN
The final hidden state of the encoder RNN is the initial state of the decoder RNN
Le
![Page 36: Lecture 24: Attention - GitHub Pages](https://reader033.vdocuments.net/reader033/viewer/2022052722/628ef15b37fc18273d1b15ae/html5/thumbnails/36.jpg)
Sequence-to-Sequence (seq2seq)
Input layer
Hidden layer
ℎ!$ ℎ"$ ℎ#$ ℎ%$
The brown dog ran
ENCODER RNN
ℎ!& ℎ"&
<s>
DECODER RNN
Le
The final hidden state of the encoder RNN is the initial state of the decoder RNN
Le
![Page 37: Lecture 24: Attention - GitHub Pages](https://reader033.vdocuments.net/reader033/viewer/2022052722/628ef15b37fc18273d1b15ae/html5/thumbnails/37.jpg)
Sequence-to-Sequence (seq2seq)
Input layer
Hidden layer
ℎ!$ ℎ"$ ℎ#$ ℎ%$
The brown dog ran
ENCODER RNN
ℎ!& ℎ"&
<s>
DECODER RNN
Le
The final hidden state of the encoder RNN is the initial state of the decoder RNN
Le chien
![Page 38: Lecture 24: Attention - GitHub Pages](https://reader033.vdocuments.net/reader033/viewer/2022052722/628ef15b37fc18273d1b15ae/html5/thumbnails/38.jpg)
Sequence-to-Sequence (seq2seq)
Input layer
Hidden layer
ℎ!$ ℎ"$ ℎ#$ ℎ%$
The brown dog ran
ENCODER RNN
ℎ!& ℎ"&
<s> chien
DECODER RNN
ℎ#&
Le
Le chien
The final hidden state of the encoder RNN is the initial state of the decoder RNN
![Page 39: Lecture 24: Attention - GitHub Pages](https://reader033.vdocuments.net/reader033/viewer/2022052722/628ef15b37fc18273d1b15ae/html5/thumbnails/39.jpg)
Sequence-to-Sequence (seq2seq)
Input layer
Hidden layer
ℎ!$ ℎ"$ ℎ#$ ℎ%$
The brown dog ran
ENCODER RNN
ℎ!& ℎ"&
<s> chien
DECODER RNN
ℎ#&
Le
Le chien brun
The final hidden state of the encoder RNN is the initial state of the decoder RNN
![Page 40: Lecture 24: Attention - GitHub Pages](https://reader033.vdocuments.net/reader033/viewer/2022052722/628ef15b37fc18273d1b15ae/html5/thumbnails/40.jpg)
Sequence-to-Sequence (seq2seq)
Input layer
Hidden layer
ℎ!$ ℎ"$ ℎ#$ ℎ%$
The brown dog ran
ENCODER RNN
ℎ!& ℎ"&
<s> chien brun
DECODER RNN
ℎ#& ℎ%&
Le
Le chien brun
The final hidden state of the encoder RNN is the initial state of the decoder RNN
![Page 41: Lecture 24: Attention - GitHub Pages](https://reader033.vdocuments.net/reader033/viewer/2022052722/628ef15b37fc18273d1b15ae/html5/thumbnails/41.jpg)
Sequence-to-Sequence (seq2seq)
Input layer
Hidden layer
ℎ!$ ℎ"$ ℎ#$ ℎ%$
The brown dog ran
ENCODER RNN
ℎ!& ℎ"&
<s> chien brun
DECODER RNN
ℎ#& ℎ%&
Le
Le chien brun a
The final hidden state of the encoder RNN is the initial state of the decoder RNN
![Page 42: Lecture 24: Attention - GitHub Pages](https://reader033.vdocuments.net/reader033/viewer/2022052722/628ef15b37fc18273d1b15ae/html5/thumbnails/42.jpg)
Sequence-to-Sequence (seq2seq)
Input layer
Hidden layer
ℎ!$ ℎ"$ ℎ#$ ℎ%$
The brown dog ran
ENCODER RNN
ℎ!& ℎ"&
<s> chien brun a
DECODER RNN
ℎ#& ℎ%& ℎ'&
Le
Le chien brun a
The final hidden state of the encoder RNN is the initial state of the decoder RNN
![Page 43: Lecture 24: Attention - GitHub Pages](https://reader033.vdocuments.net/reader033/viewer/2022052722/628ef15b37fc18273d1b15ae/html5/thumbnails/43.jpg)
Sequence-to-Sequence (seq2seq)
Input layer
Hidden layer
ℎ!$ ℎ"$ ℎ#$ ℎ%$
The brown dog ran
ENCODER RNN
ℎ!& ℎ"&
<s> chien brun a
DECODER RNN
ℎ#& ℎ%& ℎ'&
Le
Le chien brun a couru
The final hidden state of the encoder RNN is the initial state of the decoder RNN
![Page 44: Lecture 24: Attention - GitHub Pages](https://reader033.vdocuments.net/reader033/viewer/2022052722/628ef15b37fc18273d1b15ae/html5/thumbnails/44.jpg)
Sequence-to-Sequence (seq2seq)
Input layer
Hidden layer
ℎ!$ ℎ"$ ℎ#$ ℎ%$
The brown dog ran
ENCODER RNN
ℎ!& ℎ"&
<s> chien brun a
DECODER RNN
couru
ℎ#& ℎ%& ℎ'&
Le
ℎ(&
Le chien brun a couru
The final hidden state of the encoder RNN is the initial state of the decoder RNN
![Page 45: Lecture 24: Attention - GitHub Pages](https://reader033.vdocuments.net/reader033/viewer/2022052722/628ef15b37fc18273d1b15ae/html5/thumbnails/45.jpg)
Sequence-to-Sequence (seq2seq)
Input layer
Hidden layer
ℎ!$ ℎ"$ ℎ#$ ℎ%$
The brown dog ran
ENCODER RNN
ℎ!& ℎ"&
<s> chien brun a
DECODER RNN
couru
ℎ#& ℎ%& ℎ'&
Le
ℎ(&
Le chien brun a couru <s>
The final hidden state of the encoder RNN is the initial state of the decoder RNN
![Page 46: Lecture 24: Attention - GitHub Pages](https://reader033.vdocuments.net/reader033/viewer/2022052722/628ef15b37fc18273d1b15ae/html5/thumbnails/46.jpg)
Sequence-to-Sequence (seq2seq)
Input layer
Hidden layer
ℎ!$ ℎ"$ ℎ#$ ℎ%$
The brown dog ran
ENCODER RNN
ℎ!& ℎ"&
<s> chien brun a
DECODER RNN
couru
ℎ#& ℎ%& ℎ'&
Le
ℎ(&
#𝑦! #𝑦" #𝑦# #𝑦% #𝑦' #𝑦(
![Page 47: Lecture 24: Attention - GitHub Pages](https://reader033.vdocuments.net/reader033/viewer/2022052722/628ef15b37fc18273d1b15ae/html5/thumbnails/47.jpg)
Sequence-to-Sequence (seq2seq)
Input layer
Hidden layer
ℎ!$ ℎ"$ ℎ#$ ℎ%$
The brown dog ran
ENCODER RNN
ℎ!& ℎ"&
<s> chien brun a
DECODER RNN
couru
ℎ#& ℎ%& ℎ'&
Le
ℎ(&
#𝑦! #𝑦" #𝑦# #𝑦% #𝑦' #𝑦(
Training occurs like RNNs typically do; the loss (from the decoder outputs) is calculated, and we update weights all the way to the beginning (encoder)
![Page 48: Lecture 24: Attention - GitHub Pages](https://reader033.vdocuments.net/reader033/viewer/2022052722/628ef15b37fc18273d1b15ae/html5/thumbnails/48.jpg)
Sequence-to-Sequence (seq2seq)
Input layer
Hidden layer
ℎ!$ ℎ"$ ℎ#$ ℎ%$
The brown dog ran
ENCODER RNN
ℎ!& ℎ"&
<s> chien brun a
DECODER RNN
couru
ℎ#& ℎ%& ℎ'&
Le
ℎ(&
#𝑦! #𝑦" #𝑦# #𝑦% #𝑦' #𝑦(
Testing generates decoder outputs one word at a time, until we generate a <S> token.
Each decoder’s !𝒚𝒊 becomes the input 𝒙𝒊"𝟏
![Page 49: Lecture 24: Attention - GitHub Pages](https://reader033.vdocuments.net/reader033/viewer/2022052722/628ef15b37fc18273d1b15ae/html5/thumbnails/49.jpg)
Sequence-to-Sequence (seq2seq)
See any issues with this traditional seq2seq paradigm?
![Page 50: Lecture 24: Attention - GitHub Pages](https://reader033.vdocuments.net/reader033/viewer/2022052722/628ef15b37fc18273d1b15ae/html5/thumbnails/50.jpg)
Sequence-to-Sequence (seq2seq)
Input layer
Hidden layer
ℎ!$ ℎ"$ ℎ#$ ℎ%$
The brown dog ran
ENCODER RNN
ℎ!& ℎ"&
<s> chien brun a
DECODER RNN
couru
ℎ#& ℎ%& ℎ'&
Le
ℎ(&
#𝑦! #𝑦" #𝑦# #𝑦% #𝑦' #𝑦(
![Page 51: Lecture 24: Attention - GitHub Pages](https://reader033.vdocuments.net/reader033/viewer/2022052722/628ef15b37fc18273d1b15ae/html5/thumbnails/51.jpg)
Sequence-to-Sequence (seq2seq)
Input layer
Hidden layer
ℎ!$ ℎ"$ ℎ#$ ℎ%$
The brown dog ran
ENCODER RNN
ℎ!& ℎ"&
<s> chien brun a
DECODER RNN
couru
ℎ#& ℎ%& ℎ'&
Le
ℎ(&
#𝑦! #𝑦" #𝑦# #𝑦% #𝑦' #𝑦(
It’s crazy that the entire “meaning” of the 1st sequence is expected to be packed into this one embedding, and that the encoder then never interacts w/ the decoder again. Hands free.
![Page 52: Lecture 24: Attention - GitHub Pages](https://reader033.vdocuments.net/reader033/viewer/2022052722/628ef15b37fc18273d1b15ae/html5/thumbnails/52.jpg)
Sequence-to-Sequence (seq2seq)
Instead, what if the decoder, at each step, pays attention to a distribution of all of the encoder’s hidden states?
![Page 53: Lecture 24: Attention - GitHub Pages](https://reader033.vdocuments.net/reader033/viewer/2022052722/628ef15b37fc18273d1b15ae/html5/thumbnails/53.jpg)
Sequence-to-Sequence (seq2seq)
Instead, what if the decoder, at each step, pays attention to a distribution of all of the encoder’s hidden states?
Intuition: when we (humans) translate a sentence, we don’t just consume the original sentence then regurgitate in a new language; we continuously look back at the original while focusing on different parts.
![Page 54: Lecture 24: Attention - GitHub Pages](https://reader033.vdocuments.net/reader033/viewer/2022052722/628ef15b37fc18273d1b15ae/html5/thumbnails/54.jpg)
54
Outline
How to use embeddings
seq2seq
seq2seq + Attention
Transformers (preview)
![Page 55: Lecture 24: Attention - GitHub Pages](https://reader033.vdocuments.net/reader033/viewer/2022052722/628ef15b37fc18273d1b15ae/html5/thumbnails/55.jpg)
55
Outline
How to use embeddings
seq2seq
seq2seq + Attention
Transformers (preview)
![Page 56: Lecture 24: Attention - GitHub Pages](https://reader033.vdocuments.net/reader033/viewer/2022052722/628ef15b37fc18273d1b15ae/html5/thumbnails/56.jpg)
seq2seq + Attention
Input layer
Hidden layer
The brown dog ran
ENCODER RNN
Q: How do we determine how much to pay attention to each of the encoder’s hidden layers?
.4? .3? .1? .2?ℎ!$ ℎ"$ ℎ#$ ℎ%$
![Page 57: Lecture 24: Attention - GitHub Pages](https://reader033.vdocuments.net/reader033/viewer/2022052722/628ef15b37fc18273d1b15ae/html5/thumbnails/57.jpg)
seq2seq + Attention
Input layer
Hidden layer
The brown dog ran
ENCODER RNN
Q: How do we determine how much to pay attention to each of the encoder’s hidden layers?
ℎ!$ ℎ"$ ℎ#$ ℎ%$
A: Let’s base it on our decoder’s current hidden state (our current representation of meaning) and all of the encoder’s hidden layers!
ℎ!&
<s>
DECODER RNN
![Page 58: Lecture 24: Attention - GitHub Pages](https://reader033.vdocuments.net/reader033/viewer/2022052722/628ef15b37fc18273d1b15ae/html5/thumbnails/58.jpg)
seq2seq + Attention
Input layer
Hidden layer
The brown dog ran
ENCODER RNN
Q: How do we determine how much to pay attention to each of the encoder’s hidden layers?
ℎ!$ ℎ"$ ℎ#$ ℎ%$ ℎ!&
Separate FFNN
ℎ!$ ℎ!&
𝑒! 1.5
A: Let’s base it on our decoder’s current hidden state (our current representation of meaning) and all of the encoder’s hidden layers!
<s>
DECODER RNN
![Page 59: Lecture 24: Attention - GitHub Pages](https://reader033.vdocuments.net/reader033/viewer/2022052722/628ef15b37fc18273d1b15ae/html5/thumbnails/59.jpg)
seq2seq + Attention
Input layer
Hidden layer
The brown dog ran
ENCODER RNN
Q: How do we determine how much to pay attention to each of the encoder’s hidden layers?
ℎ!$ ℎ"$ ℎ#$ ℎ%$ ℎ!&
Separate FFNN
ℎ"$ ℎ!&
𝑒" 0.9
A: Let’s base it on our decoder’s current hidden state (our current representation of meaning) and all of the encoder’s hidden layers!
<s>
DECODER RNN
![Page 60: Lecture 24: Attention - GitHub Pages](https://reader033.vdocuments.net/reader033/viewer/2022052722/628ef15b37fc18273d1b15ae/html5/thumbnails/60.jpg)
seq2seq + Attention
Input layer
Hidden layer
The brown dog ran
ENCODER RNN
Q: How do we determine how much to pay attention to each of the encoder’s hidden layers?
ℎ!$ ℎ"$ ℎ#$ ℎ%$ ℎ!&
Separate FFNN
A: Let’s base it on our decoder’s current hidden state (our current representation of meaning) and all of the encoder’s hidden layers!
ℎ#$ ℎ!&
𝑒# 0.2
<s>
DECODER RNN
![Page 61: Lecture 24: Attention - GitHub Pages](https://reader033.vdocuments.net/reader033/viewer/2022052722/628ef15b37fc18273d1b15ae/html5/thumbnails/61.jpg)
seq2seq + Attention
Input layer
Hidden layer
The brown dog ran
ENCODER RNN
Q: How do we determine how much to pay attention to each of the encoder’s hidden layers?
ℎ!$ ℎ"$ ℎ#$ ℎ%$ ℎ!&
Separate FFNN
A: Let’s base it on our decoder’s current hidden state (our current representation of meaning) and all of the encoder’s hidden layers!
ℎ%$ ℎ!&
𝑒% −0.5
<s>
DECODER RNN
![Page 62: Lecture 24: Attention - GitHub Pages](https://reader033.vdocuments.net/reader033/viewer/2022052722/628ef15b37fc18273d1b15ae/html5/thumbnails/62.jpg)
seq2seq + Attention
Input layer
Hidden layer
The brown dog ran
ENCODER RNN
Q: How do we determine how much to pay attention to each of the encoder’s hidden layers?
ℎ!$ ℎ"$ ℎ#$ ℎ%$ ℎ!&
A: Let’s base it on our decoder’s current hidden state (our current representation of meaning) and all of the encoder’s hidden layers!
𝑒% −0.5
Attention (raw scores)
𝑒# 0.2𝑒" 0.9𝑒! 1.5
<s>
DECODER RNN
![Page 63: Lecture 24: Attention - GitHub Pages](https://reader033.vdocuments.net/reader033/viewer/2022052722/628ef15b37fc18273d1b15ae/html5/thumbnails/63.jpg)
seq2seq + Attention
Input layer
Hidden layer
The brown dog ran
ENCODER RNN
Q: How do we determine how much to pay attention to each of the encoder’s hidden layers?
ℎ!$ ℎ"$ ℎ#$ ℎ%$ ℎ!&
A: Let’s base it on our decoder’s current hidden state (our current representation of meaning) and all of the encoder’s hidden layers!
𝑒% −0.5
Attention (raw scores)
𝑒# 0.2𝑒" 0.9𝑒! 1.5
Attention (softmax’d)
𝑎)! =exp(𝑒))
∑)* exp(𝑒!)<s>
DECODER RNN
![Page 64: Lecture 24: Attention - GitHub Pages](https://reader033.vdocuments.net/reader033/viewer/2022052722/628ef15b37fc18273d1b15ae/html5/thumbnails/64.jpg)
seq2seq + Attention
Input layer
Hidden layer
The brown dog ran
ENCODER RNN
Q: How do we determine how much to pay attention to each of the encoder’s hidden layers?
ℎ!$ ℎ"$ ℎ#$ ℎ%$ ℎ!&
A: Let’s base it on our decoder’s current hidden state (our current representation of meaning) and all of the encoder’s hidden layers!
𝑒% −0.5
Attention (raw scores)
𝑒# 0.2𝑒" 0.9𝑒! 1.5
Attention (softmax’d)
𝑎$$ = 0.51𝑎%$ = 0.28𝑎&$ = 0.14𝑎&$ = 0.07
<s>
DECODER RNN
![Page 65: Lecture 24: Attention - GitHub Pages](https://reader033.vdocuments.net/reader033/viewer/2022052722/628ef15b37fc18273d1b15ae/html5/thumbnails/65.jpg)
seq2seq + Attention
Input layer
Hidden layer
The brown dog ran
ENCODER RNN
ℎ!$ ℎ"$ ℎ#$ ℎ%$ ℎ!&
Attention (softmax’d)
𝑎$$ = 0.51𝑎%$ = 0.28𝑎&$ = 0.14𝑎&$ = 0.07
𝑎!! 𝑎"! 𝑎#! 𝑎%!
We multiply each encoder’s hidden layer
by its 𝑎'$ attention weights to create a context vector 𝑐$(
𝑐!&
<s>
DECODER RNN
![Page 66: Lecture 24: Attention - GitHub Pages](https://reader033.vdocuments.net/reader033/viewer/2022052722/628ef15b37fc18273d1b15ae/html5/thumbnails/66.jpg)
seq2seq + Attention
Input layer
Hidden layer
The brown dog ran
ENCODER RNN
ℎ!$ ℎ"$ ℎ#$ ℎ%$
<s>
[ℎ!&; 𝑐!&]
𝑐!&
𝑎!! 𝑎"! 𝑎#! 𝑎%!
REMEMBER: each attention weight 𝑎') is based on the decoder’s current hidden state, too.
#𝑦!Le
DECODER RNN
![Page 67: Lecture 24: Attention - GitHub Pages](https://reader033.vdocuments.net/reader033/viewer/2022052722/628ef15b37fc18273d1b15ae/html5/thumbnails/67.jpg)
seq2seq + Attention
Input layer
Hidden layer
The brown dog ran
ENCODER RNN
ℎ!$ ℎ"$ ℎ#$ ℎ%$
<s>
[ℎ!&; 𝑐!&]
𝑐"&
𝑎!" 𝑎"" 𝑎#" 𝑎%"
REMEMBER: each attention weight 𝑎') is based on the decoder’s current hidden state, too.
#𝑦!Le
Le
[ℎ"&; 𝑐"&]
#𝑦"chien
DECODER RNN
![Page 68: Lecture 24: Attention - GitHub Pages](https://reader033.vdocuments.net/reader033/viewer/2022052722/628ef15b37fc18273d1b15ae/html5/thumbnails/68.jpg)
seq2seq + Attention
Input layer
Hidden layer
The brown dog ran
ENCODER RNN
ℎ!$ ℎ"$ ℎ#$ ℎ%$
<s>
[ℎ!&; 𝑐!&]
𝑐#&
𝑎!# 𝑎"# 𝑎## 𝑎%#
REMEMBER: each attention weight 𝑎') is based on the decoder’s current hidden state, too.
#𝑦!Le
Le
[ℎ"&; 𝑐"&]
#𝑦"chien
[ℎ#&; 𝑐#&]
#𝑦#brun
chien
DECODER RNN
![Page 69: Lecture 24: Attention - GitHub Pages](https://reader033.vdocuments.net/reader033/viewer/2022052722/628ef15b37fc18273d1b15ae/html5/thumbnails/69.jpg)
seq2seq + Attention
Input layer
Hidden layer
The brown dog ran
ENCODER RNN
ℎ!$ ℎ"$ ℎ#$ ℎ%$
<s>
[ℎ!&; 𝑐!&]
𝑐%&
𝑎!% 𝑎"% 𝑎#% 𝑎%%
REMEMBER: each attention weight 𝑎') is based on the decoder’s current hidden state, too.
#𝑦!Le
Le
[ℎ"&; 𝑐"&]
#𝑦"chien
[ℎ#&; 𝑐#&]
#𝑦#brun
chien
[ℎ%&; 𝑐%&]
#𝑦%a
brun
DECODER RNN
![Page 70: Lecture 24: Attention - GitHub Pages](https://reader033.vdocuments.net/reader033/viewer/2022052722/628ef15b37fc18273d1b15ae/html5/thumbnails/70.jpg)
seq2seq + Attention
Input layer
Hidden layer
The brown dog ran
ENCODER RNN
ℎ!$ ℎ"$ ℎ#$ ℎ%$
<s>
[ℎ!&; 𝑐!&]
𝑐'&
𝑎!' 𝑎"' 𝑎#' 𝑎%'
REMEMBER: each attention weight 𝑎') is based on the decoder’s current hidden state, too.
#𝑦!Le
Le
[ℎ"&; 𝑐"&]
#𝑦"chien
[ℎ#&; 𝑐#&]
#𝑦#brun
chien
[ℎ%&; 𝑐%&]
#𝑦%a
brun
[ℎ'&; 𝑐'&]
#𝑦%couru
a
DECODER RNN
![Page 71: Lecture 24: Attention - GitHub Pages](https://reader033.vdocuments.net/reader033/viewer/2022052722/628ef15b37fc18273d1b15ae/html5/thumbnails/71.jpg)
71Photo credit: https://lena-voita.github.io/nlp_course/seq2seq_and_attention.html
For convenience, here’s the Attention calculation summarized on 1 slide
![Page 72: Lecture 24: Attention - GitHub Pages](https://reader033.vdocuments.net/reader033/viewer/2022052722/628ef15b37fc18273d1b15ae/html5/thumbnails/72.jpg)
72Photo credit: https://lena-voita.github.io/nlp_course/seq2seq_and_attention.html
For convenience, here’s the Attention calculation summarized on 1 slide
The Attention mechanism that produces
scores doesn’t have to be a FFNN like I
illustrated. It can be any function you wish.
![Page 73: Lecture 24: Attention - GitHub Pages](https://reader033.vdocuments.net/reader033/viewer/2022052722/628ef15b37fc18273d1b15ae/html5/thumbnails/73.jpg)
73Photo credit: https://lena-voita.github.io/nlp_course/seq2seq_and_attention.html
Popular Attention Scoring functions:
![Page 74: Lecture 24: Attention - GitHub Pages](https://reader033.vdocuments.net/reader033/viewer/2022052722/628ef15b37fc18273d1b15ae/html5/thumbnails/74.jpg)
seq2seq + Attention
Attention:
• greatly improves seq2seq results
• allows us to visualize the
contribution each encoding word
gave for each decoder’s word
Image source: Fig 3 in Bahdanau et al., 2015
![Page 75: Lecture 24: Attention - GitHub Pages](https://reader033.vdocuments.net/reader033/viewer/2022052722/628ef15b37fc18273d1b15ae/html5/thumbnails/75.jpg)
seq2seq + Attention
Attention:
• greatly improves seq2seq results
• allows us to visualize the
contribution each encoding word
gave for each decoder’s word
Image source: Fig 3 in Bahdanau et al., 2015
Takeaway:
Having a separate encoder and decoderallows for n à m length predictions.
Attention is powerful; allows us to conditionally weight our focus
![Page 76: Lecture 24: Attention - GitHub Pages](https://reader033.vdocuments.net/reader033/viewer/2022052722/628ef15b37fc18273d1b15ae/html5/thumbnails/76.jpg)
76
• LSTMs yielded state-of-the-art results on most NLP tasks (2014-2018)
• seq2seq+Attention was an even more revolutionary idea (Google Translate used it)
• Attention allows us to place appropriate weight to the encoder’s hidden states
• But, LSTMs require us to iteratively scan each word and wait until we’re at the end before we can do anything
SUMMARY
![Page 77: Lecture 24: Attention - GitHub Pages](https://reader033.vdocuments.net/reader033/viewer/2022052722/628ef15b37fc18273d1b15ae/html5/thumbnails/77.jpg)
77
Outline
How to use embeddings
seq2seq
seq2seq + Attention
Transformers (preview)
![Page 78: Lecture 24: Attention - GitHub Pages](https://reader033.vdocuments.net/reader033/viewer/2022052722/628ef15b37fc18273d1b15ae/html5/thumbnails/78.jpg)
78
Outline
How to use embeddings
seq2seq
seq2seq + Attention
Transformers (preview)
![Page 79: Lecture 24: Attention - GitHub Pages](https://reader033.vdocuments.net/reader033/viewer/2022052722/628ef15b37fc18273d1b15ae/html5/thumbnails/79.jpg)
The brown dog ranx1 x2 x3 x4
z4z3z2z1
Self-attention Head
FFNN
r2
FFNN
r3
FFNN
r4
Transformer Encoder uses attention on itself (self-attention) to create very rich embeddings which can be used for any task.
FFNN
r1
Encoder
Transformer Encoder
BERT is a BidirectionalTransformer Encoder. You can attach a final layer that performs whatever task you’re interested in (e.g., Yelp reviews).
Its results are unbelievably good.
![Page 80: Lecture 24: Attention - GitHub Pages](https://reader033.vdocuments.net/reader033/viewer/2022052722/628ef15b37fc18273d1b15ae/html5/thumbnails/80.jpg)
BERT is trained on a lot of text data:
• BooksCorpus (800M words)
• English Wikipedia (2.5B words)
BERT-Base model has 12 transformer blocks, 12 attention heads,
110M parameters!
BERT-Large model has 24 transformer blocks, 16 attention heads,
340M parameters!
BERT (a Transformer variant)
Yay, for transfer learning!
![Page 81: Lecture 24: Attention - GitHub Pages](https://reader033.vdocuments.net/reader033/viewer/2022052722/628ef15b37fc18273d1b15ae/html5/thumbnails/81.jpg)
BERT (a Transformer variant)
The brown dog
x1 x2 x3
Encoder #1
Encoder #8
r2 r3r1 r4
ran
x4
Typically, one uses BERT’s awesome
embeddings to fine-tune toward a
different NLP task (this is called
Sequential Transfer Learning)
yTakeaway:BERT is incredible for learning context-aware representations of words and using transfer learning for other tasks (e.g., classification).
Can’t generate new sentences though, due to no decoders.
![Page 82: Lecture 24: Attention - GitHub Pages](https://reader033.vdocuments.net/reader033/viewer/2022052722/628ef15b37fc18273d1b15ae/html5/thumbnails/82.jpg)
Transformer
What if we want to generate a new output sequence?
GPT-2 model to the rescue!Generative Pre-trained Transformer 2
![Page 83: Lecture 24: Attention - GitHub Pages](https://reader033.vdocuments.net/reader033/viewer/2022052722/628ef15b37fc18273d1b15ae/html5/thumbnails/83.jpg)
GPT-2 (a Transformer variant)
• GPT-2 uses only Transformer Decoders (no Encoders) to generate
new sequences
• As it processes each word/token, it cleverly masks the “future”
words and conditions itself on the previous words
• Can generate text from scratch or from a starting sequence.
• Easy to fine-tune on your own dataset (language)
![Page 84: Lecture 24: Attention - GitHub Pages](https://reader033.vdocuments.net/reader033/viewer/2022052722/628ef15b37fc18273d1b15ae/html5/thumbnails/84.jpg)
GPT-2
• GPT-2 uses only Transformer Decoders (no Encoders) to generate
new sequences
• As it processes each word/token, it cleverly masks the “future”
words and conditions itself on the previous words
• Can generate text from scratch or from a starting sequence.
• Easy to fine-tune on your own dataset (language)
• GPT-3 is an even bigger version of GPT-2, but isn’t open-source
Takeaway:GPT-2 is astounding for generating realistic-looking new text
Can fine-tune toward other tasks, too.
![Page 85: Lecture 24: Attention - GitHub Pages](https://reader033.vdocuments.net/reader033/viewer/2022052722/628ef15b37fc18273d1b15ae/html5/thumbnails/85.jpg)
GPT-2 is:
• trained on 40GB of text data (8M webpages)!
• 1.5B parameters
GPT-3 is an even bigger version (175B parameters) of GPT-2, but
isn’t open-source
Yay, for transfer learning!
GPT-2 (a Transformer variant)
![Page 86: Lecture 24: Attention - GitHub Pages](https://reader033.vdocuments.net/reader033/viewer/2022052722/628ef15b37fc18273d1b15ae/html5/thumbnails/86.jpg)
QUESTIONS?