![Page 1: COS 484: Natural Language Processing - Princeton NLP · COS 484: Natural Language Processing. Overview • ELMo Contextualized Word Representations = Bidirectional Encoder Representations](https://reader030.vdocuments.net/reader030/viewer/2022040221/5f09e8947e708231d42914cf/html5/thumbnails/1.jpg)
Contextualized Word Embeddings
Fall 2019
COS 484: Natural Language Processing
![Page 2: COS 484: Natural Language Processing - Princeton NLP · COS 484: Natural Language Processing. Overview • ELMo Contextualized Word Representations = Bidirectional Encoder Representations](https://reader030.vdocuments.net/reader030/viewer/2022040221/5f09e8947e708231d42914cf/html5/thumbnails/2.jpg)
Overview
• ELMo
Contextualized Word Representations
= Bidirectional Encoder Representations from Transformers
= Embeddings from Language Models
• BERT
![Page 3: COS 484: Natural Language Processing - Princeton NLP · COS 484: Natural Language Processing. Overview • ELMo Contextualized Word Representations = Bidirectional Encoder Representations](https://reader030.vdocuments.net/reader030/viewer/2022040221/5f09e8947e708231d42914cf/html5/thumbnails/3.jpg)
Overview
• Transformers
![Page 4: COS 484: Natural Language Processing - Princeton NLP · COS 484: Natural Language Processing. Overview • ELMo Contextualized Word Representations = Bidirectional Encoder Representations](https://reader030.vdocuments.net/reader030/viewer/2022040221/5f09e8947e708231d42914cf/html5/thumbnails/4.jpg)
Recap: word2vec
word = “sweden”
![Page 5: COS 484: Natural Language Processing - Princeton NLP · COS 484: Natural Language Processing. Overview • ELMo Contextualized Word Representations = Bidirectional Encoder Representations](https://reader030.vdocuments.net/reader030/viewer/2022040221/5f09e8947e708231d42914cf/html5/thumbnails/5.jpg)
What’s wrong with word2vec?
• One vector for each word type vcat =
0
BB@
�0.2240.130�0.2900.276
1
CCA
<latexit sha1_base64="ZS11t+SATcIQYaaJ4VZuEjXjz0Y=">AAACOXicbZDPShxBEMZrjIk65s8aj3poIoFcsvRsQlSIIHjxuIKrws6y9PTWro09PUN3jbgM8wx5m1zyFt4ELx4U8ZoXSM+uiFE/aPj4VRVd9SW5Vo44vwhmXs2+fjM3vxAuvn33/kNj6eOByworsSMzndmjRDjUymCHFGk8yi2KNNF4mJzs1PXDU7ROZWafxjn2UjEyaqikII/6jfZpv4wJz6j0pKq2wjjBkTJlngqy6qwKv/Jmq/WdxXHIm9E3XpsabfIpaq3/CGM0g4eBfmONN/lE7LmJ7s3a9uqvmAFAu984jweZLFI0JLVwrhvxnHqlsKSkxiqMC4e5kCdihF1vjUjR9crJ5RX77MmADTPrnyE2oY8nSpE6N04T3+n3O3ZPazV8qdYtaLjRK5XJC0Ijpx8NC80oY3WMbKAsStJjb4S0yu/K5LGwQpIPO/QhRE9Pfm4OWs3Ih7rn0/gJU83DCnyCLxDBOmzDLrShAxJ+wyVcw03wJ7gKboO7aetMcD+zDP8p+PsPf3eqbQ==</latexit><latexit sha1_base64="X7JObiHYNXwbsISLOmkjXbSsJws=">AAACOXicbZBNSyNBEIZ7/Fh1dHejHvXQrCx42dCTFT9AQfDiMYJRIRNCT6cSG3t6hu4aMQzzG/w3Xrz5E7wJXjwo4lW825OIrLovNLw8VUVXvVGqpEXGbryR0bHxbxOTU/70zPcfPyuzcwc2yYyAhkhUYo4ibkFJDQ2UqOAoNcDjSMFhdLJT1g9PwViZ6H3sp9CKeU/LrhQcHWpX6qftPEQ4w9yRotjywwh6UudpzNHIs8L/w6q12goNQ59Vg7+sNCXaYENUW1v1Q9Cd94F2ZYlV2UD0qwnezNL24nm4/HJ1Xm9XrsNOIrIYNArFrW0GLMVWzg1KoaDww8xCysUJ70HTWc1jsK18cHlBfzvSod3EuKeRDui/EzmPre3Hket0+x3bz7US/q/WzLC73sqlTjMELYYfdTNFMaFljLQjDQhUfWe4MNLtSsUxN1ygC9t3IQSfT/5qDmrVwIW659LYJENNkgXyiyyTgKyRbbJL6qRBBLkgt+SePHiX3p336D0NW0e8t5l58kHe8yuUTqy7</latexit><latexit sha1_base64="X7JObiHYNXwbsISLOmkjXbSsJws=">AAACOXicbZBNSyNBEIZ7/Fh1dHejHvXQrCx42dCTFT9AQfDiMYJRIRNCT6cSG3t6hu4aMQzzG/w3Xrz5E7wJXjwo4lW825OIrLovNLw8VUVXvVGqpEXGbryR0bHxbxOTU/70zPcfPyuzcwc2yYyAhkhUYo4ibkFJDQ2UqOAoNcDjSMFhdLJT1g9PwViZ6H3sp9CKeU/LrhQcHWpX6qftPEQ4w9yRotjywwh6UudpzNHIs8L/w6q12goNQ59Vg7+sNCXaYENUW1v1Q9Cd94F2ZYlV2UD0qwnezNL24nm4/HJ1Xm9XrsNOIrIYNArFrW0GLMVWzg1KoaDww8xCysUJ70HTWc1jsK18cHlBfzvSod3EuKeRDui/EzmPre3Hket0+x3bz7US/q/WzLC73sqlTjMELYYfdTNFMaFljLQjDQhUfWe4MNLtSsUxN1ygC9t3IQSfT/5qDmrVwIW659LYJENNkgXyiyyTgKyRbbJL6qRBBLkgt+SePHiX3p336D0NW0e8t5l58kHe8yuUTqy7</latexit><latexit sha1_base64="yUhkDlYwUUEoQ+3MeiaCkTTY5/M=">AAACOXicbZBNSyNBEIZ71PVj3NWsHr00BsHLhp4oxgUFwYvHCEaFTAg9nUps7OkZumvEMMzf8uK/8CZ48bCLePUP2JME8euFhpenquiqN0qVtMjYvTc1PfNjdm5+wV/8+WtpufJ75dQmmRHQEolKzHnELSipoYUSFZynBngcKTiLLg/L+tkVGCsTfYLDFDoxH2jZl4KjQ91K86qbhwjXmDtSFPt+GMFA6jyNORp5Xfh/WK1e36Zh6LNasMVKU6K/bIzqjR0/BN17G+hWqqzGRqJfTTAxVTJRs1u5C3uJyGLQKBS3th2wFDs5NyiFgsIPMwspF5d8AG1nNY/BdvLR5QXdcKRH+4lxTyMd0fcTOY+tHcaR63T7XdjPtRJ+V2tn2N/t5FKnGYIW44/6maKY0DJG2pMGBKqhM1wY6Xal4oIbLtCF7bsQgs8nfzWn9VrgQj1m1YO9SRzzZI2sk00SkAY5IEekSVpEkBvyQP6R/96t9+g9ec/j1ilvMrNKPsh7eQWaI6kG</latexit>
v(bank)
• Complex characteristics of word use: semantics, syntactic behavior, and connotations
• Polysemous words, e.g., bank, mouse
![Page 6: COS 484: Natural Language Processing - Princeton NLP · COS 484: Natural Language Processing. Overview • ELMo Contextualized Word Representations = Bidirectional Encoder Representations](https://reader030.vdocuments.net/reader030/viewer/2022040221/5f09e8947e708231d42914cf/html5/thumbnails/6.jpg)
Contextualized word embeddings
Let’s build a vector for each word conditioned on its context!
movie was terribly exciting !the
Contextualized word embeddings
f : (w1, w2, …, wn) ⟶ x1, …, xn ∈ ℝd
![Page 7: COS 484: Natural Language Processing - Princeton NLP · COS 484: Natural Language Processing. Overview • ELMo Contextualized Word Representations = Bidirectional Encoder Representations](https://reader030.vdocuments.net/reader030/viewer/2022040221/5f09e8947e708231d42914cf/html5/thumbnails/7.jpg)
Contextualized word embeddings
(Peters et al, 2018): Deep contextualized word representations
![Page 8: COS 484: Natural Language Processing - Princeton NLP · COS 484: Natural Language Processing. Overview • ELMo Contextualized Word Representations = Bidirectional Encoder Representations](https://reader030.vdocuments.net/reader030/viewer/2022040221/5f09e8947e708231d42914cf/html5/thumbnails/8.jpg)
ELMo
• NAACL’18: Deep contextualized word representations
• Key idea:
• Train an LSTM-based language model on some large corpus
• Use the hidden states of the LSTM for each token to compute a vector representation of each word
![Page 9: COS 484: Natural Language Processing - Princeton NLP · COS 484: Natural Language Processing. Overview • ELMo Contextualized Word Representations = Bidirectional Encoder Representations](https://reader030.vdocuments.net/reader030/viewer/2022040221/5f09e8947e708231d42914cf/html5/thumbnails/9.jpg)
ELMo
input softmax
# words in the sentence
![Page 10: COS 484: Natural Language Processing - Princeton NLP · COS 484: Natural Language Processing. Overview • ELMo Contextualized Word Representations = Bidirectional Encoder Representations](https://reader030.vdocuments.net/reader030/viewer/2022040221/5f09e8947e708231d42914cf/html5/thumbnails/10.jpg)
How to use ELMo?
• : allows the task model to scale the entire ELMo vectorγtask
• : softmax-normalized weights across layersstaskj
hlMk,0 = xLM
k , hLMk,j = [h LM
k,j ; h LMk,j ]
• Plug ELMo into any (neural) NLP model: freeze all the LMs weights and change the input representation to:
(could also insert into higher layers)
# of layers
![Page 11: COS 484: Natural Language Processing - Princeton NLP · COS 484: Natural Language Processing. Overview • ELMo Contextualized Word Representations = Bidirectional Encoder Representations](https://reader030.vdocuments.net/reader030/viewer/2022040221/5f09e8947e708231d42914cf/html5/thumbnails/11.jpg)
More details
• Forward and backward LMs: 2 layers each • Use character CNN to build initial word representation
• 2048 char n-gram filters and 2 highway layers, 512 dim projection
• User 4096 dim hidden/cell LSTM states with 512 dim projections to next input
• A residual connection from the first to second layer • Trained 10 epochs on 1B Word Benchmark
![Page 12: COS 484: Natural Language Processing - Princeton NLP · COS 484: Natural Language Processing. Overview • ELMo Contextualized Word Representations = Bidirectional Encoder Representations](https://reader030.vdocuments.net/reader030/viewer/2022040221/5f09e8947e708231d42914cf/html5/thumbnails/12.jpg)
Experimental results
• SQuAD: question answering
• SNLI: natural language inference
• SRL: semantic role labeling
• Coref: coreference resolution
• NER: named entity recognition
• SST-5: sentiment analysis
![Page 13: COS 484: Natural Language Processing - Princeton NLP · COS 484: Natural Language Processing. Overview • ELMo Contextualized Word Representations = Bidirectional Encoder Representations](https://reader030.vdocuments.net/reader030/viewer/2022040221/5f09e8947e708231d42914cf/html5/thumbnails/13.jpg)
Intrinsic Evaluation
First Layer > Second Layer Second Layer > First Layer
syntactic information is better represented at lower layers while semantic information is captured a higher layers
![Page 14: COS 484: Natural Language Processing - Princeton NLP · COS 484: Natural Language Processing. Overview • ELMo Contextualized Word Representations = Bidirectional Encoder Representations](https://reader030.vdocuments.net/reader030/viewer/2022040221/5f09e8947e708231d42914cf/html5/thumbnails/14.jpg)
Use ELMo in practice
https://allennlp.org/elmo
Also available in TensorFlow
![Page 15: COS 484: Natural Language Processing - Princeton NLP · COS 484: Natural Language Processing. Overview • ELMo Contextualized Word Representations = Bidirectional Encoder Representations](https://reader030.vdocuments.net/reader030/viewer/2022040221/5f09e8947e708231d42914cf/html5/thumbnails/15.jpg)
BERT
• NAACL’19: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
• First released in Oct 2018.
How is BERT different from ELMo?
#1. Unidirectional context vs bidirectional context
#2. LSTMs vs Transformers (will talk later)
#3. The weights are not freezed, called fine-tuning
![Page 16: COS 484: Natural Language Processing - Princeton NLP · COS 484: Natural Language Processing. Overview • ELMo Contextualized Word Representations = Bidirectional Encoder Representations](https://reader030.vdocuments.net/reader030/viewer/2022040221/5f09e8947e708231d42914cf/html5/thumbnails/16.jpg)
Bidirectional encoders
• Language models only use left context or right context (although ELMo used two independent LMs from each direction).
• Language understanding is bidirectional
Lecture 9:
Why are LMs unidirectional?
![Page 17: COS 484: Natural Language Processing - Princeton NLP · COS 484: Natural Language Processing. Overview • ELMo Contextualized Word Representations = Bidirectional Encoder Representations](https://reader030.vdocuments.net/reader030/viewer/2022040221/5f09e8947e708231d42914cf/html5/thumbnails/17.jpg)
Bidirectional encoders
• Language models only use left context or right context (although ELMo used two independent LMs from each direction).
• Language understanding is bidirectional
![Page 18: COS 484: Natural Language Processing - Princeton NLP · COS 484: Natural Language Processing. Overview • ELMo Contextualized Word Representations = Bidirectional Encoder Representations](https://reader030.vdocuments.net/reader030/viewer/2022040221/5f09e8947e708231d42914cf/html5/thumbnails/18.jpg)
Masked language models (MLMs)
• Solution: Mask out 15% of the input words, and then predict the masked words
• Too little masking: too expensive to train • Too much masking: not enough context
![Page 19: COS 484: Natural Language Processing - Princeton NLP · COS 484: Natural Language Processing. Overview • ELMo Contextualized Word Representations = Bidirectional Encoder Representations](https://reader030.vdocuments.net/reader030/viewer/2022040221/5f09e8947e708231d42914cf/html5/thumbnails/19.jpg)
Masked language models (MLMs)
A little more complication:
Because [MASK] is never seen when BERT is used…
![Page 20: COS 484: Natural Language Processing - Princeton NLP · COS 484: Natural Language Processing. Overview • ELMo Contextualized Word Representations = Bidirectional Encoder Representations](https://reader030.vdocuments.net/reader030/viewer/2022040221/5f09e8947e708231d42914cf/html5/thumbnails/20.jpg)
Next sentence prediction (NSP)
Always sample two sentences, predict whether the second sentence is followed after the first one.
Recent papers show that NSP is not necessary…(Joshi*, Chen* et al, 2019) :SpanBERT: Improving Pre-training by Representing and Predicting Spans (Liu et al, 2019): RoBERTa: A Robustly Optimized BERT Pretraining Approach
![Page 21: COS 484: Natural Language Processing - Princeton NLP · COS 484: Natural Language Processing. Overview • ELMo Contextualized Word Representations = Bidirectional Encoder Representations](https://reader030.vdocuments.net/reader030/viewer/2022040221/5f09e8947e708231d42914cf/html5/thumbnails/21.jpg)
Pre-training and fine-tuning
Pre-training Fine-tuning
Key idea: all the weights are fine-tuned on downstream tasks
![Page 22: COS 484: Natural Language Processing - Princeton NLP · COS 484: Natural Language Processing. Overview • ELMo Contextualized Word Representations = Bidirectional Encoder Representations](https://reader030.vdocuments.net/reader030/viewer/2022040221/5f09e8947e708231d42914cf/html5/thumbnails/22.jpg)
Applications
![Page 23: COS 484: Natural Language Processing - Princeton NLP · COS 484: Natural Language Processing. Overview • ELMo Contextualized Word Representations = Bidirectional Encoder Representations](https://reader030.vdocuments.net/reader030/viewer/2022040221/5f09e8947e708231d42914cf/html5/thumbnails/23.jpg)
More details
• Input representations
• Use word pieces instead of words: playing => play ##ing Assignment 4
• Trained 40 epochs on Wikipedia (2.5B tokens) + BookCorpus (0.8B tokens)
• Released two model sizes: BERT_base, BERT_large
![Page 24: COS 484: Natural Language Processing - Princeton NLP · COS 484: Natural Language Processing. Overview • ELMo Contextualized Word Representations = Bidirectional Encoder Representations](https://reader030.vdocuments.net/reader030/viewer/2022040221/5f09e8947e708231d42914cf/html5/thumbnails/24.jpg)
Experimental results
(Wang et al, 2018): GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
BiLSTM: 63.9
![Page 25: COS 484: Natural Language Processing - Princeton NLP · COS 484: Natural Language Processing. Overview • ELMo Contextualized Word Representations = Bidirectional Encoder Representations](https://reader030.vdocuments.net/reader030/viewer/2022040221/5f09e8947e708231d42914cf/html5/thumbnails/25.jpg)
Use BERT in practice
TensorFlow: https://github.com/google-research/bert
PyTorch: https://github.com/huggingface/transformers
![Page 26: COS 484: Natural Language Processing - Princeton NLP · COS 484: Natural Language Processing. Overview • ELMo Contextualized Word Representations = Bidirectional Encoder Representations](https://reader030.vdocuments.net/reader030/viewer/2022040221/5f09e8947e708231d42914cf/html5/thumbnails/26.jpg)
Contextualized word embeddings in context
• TagLM (Peters et, 2017) • CoVe (McCann et al. 2017) • ULMfit (Howard and Ruder, 2018) • ELMo (Peters et al, 2018) • OpenAI GPT (Radford et al, 2018) • BERT (Devlin et al, 2018) • OpenAI GPT-2 (Radford et al, 2019) • XLNet (Yang et al, 2019) • SpanBERT (Joshi et al, 2019) • RoBERTa (Liu et al, 2019) • AlBERT (Anonymous) • …
![Page 27: COS 484: Natural Language Processing - Princeton NLP · COS 484: Natural Language Processing. Overview • ELMo Contextualized Word Representations = Bidirectional Encoder Representations](https://reader030.vdocuments.net/reader030/viewer/2022040221/5f09e8947e708231d42914cf/html5/thumbnails/27.jpg)
Transformers
• NIPS’17: Attention is All You Need • Key idea: Multi-head self-attention • No recurrence structure any more so it
trains much faster • Originally proposed for NMT (encoder-
decoder framework) • Used as the base model of BERT
(encoder only)
![Page 28: COS 484: Natural Language Processing - Princeton NLP · COS 484: Natural Language Processing. Overview • ELMo Contextualized Word Representations = Bidirectional Encoder Representations](https://reader030.vdocuments.net/reader030/viewer/2022040221/5f09e8947e708231d42914cf/html5/thumbnails/28.jpg)
Useful Resourcesnn.Transformer:
nn.TransformerEncoder:
The Annotated Transformer:http://nlp.seas.harvard.edu/2018/04/03/attention.html
A Jupyter notebook which explains how Transformer works line by line in PyTorch!
![Page 29: COS 484: Natural Language Processing - Princeton NLP · COS 484: Natural Language Processing. Overview • ELMo Contextualized Word Representations = Bidirectional Encoder Representations](https://reader030.vdocuments.net/reader030/viewer/2022040221/5f09e8947e708231d42914cf/html5/thumbnails/29.jpg)
RNNs vs Transformers
the movie was terribly exciting !
Transformer layer 3
Transformer layer 2
Transformer layer 1
![Page 30: COS 484: Natural Language Processing - Princeton NLP · COS 484: Natural Language Processing. Overview • ELMo Contextualized Word Representations = Bidirectional Encoder Representations](https://reader030.vdocuments.net/reader030/viewer/2022040221/5f09e8947e708231d42914cf/html5/thumbnails/30.jpg)
Multi-head Self Attention
• Attention: a query and a set of key-value pairs to an outputq (ki, vi)
• If we have multiple queries:
A(Q, K, V ) = softmax(QK⊺)V
Q ∈ ℝnQ×d, K, V ∈ ℝn×d
• Dot-product attention:A(q, K, V ) = ∑
i
eq⋅ki
∑j eq⋅kjvi
K, V ∈ ℝn×d, q ∈ ℝd
• Self-attention: let’s use each word as query and compute the attention with all the other words
= the word vectors themselves select each other
![Page 31: COS 484: Natural Language Processing - Princeton NLP · COS 484: Natural Language Processing. Overview • ELMo Contextualized Word Representations = Bidirectional Encoder Representations](https://reader030.vdocuments.net/reader030/viewer/2022040221/5f09e8947e708231d42914cf/html5/thumbnails/31.jpg)
Multi-head Self Attention
• Scaled Dot-Product Attention:
A(Q, K, V ) = softmax(QK⊺
d)V
• Input: X ∈ ℝn×din
A(XWQ, XWK, XWV) ∈ ℝn×d
WQ, WK, WV ∈ ℝdin×d
• Multi-head attention: using more than one head is always useful..
A(Q, K, V ) = Concat(head1, …, headh)WO
headi = A(XWQi , XWK
i , XWVi )
In practice, , h = 8 d = dout /h, WO = dout × dout
![Page 32: COS 484: Natural Language Processing - Princeton NLP · COS 484: Natural Language Processing. Overview • ELMo Contextualized Word Representations = Bidirectional Encoder Representations](https://reader030.vdocuments.net/reader030/viewer/2022040221/5f09e8947e708231d42914cf/html5/thumbnails/32.jpg)
Putting it all together
• Each Transformer block has two sub-layers • Multi-head attention • 2-layer feedforward NN (with ReLU)
• Each sublayer has a residual connection and a layer normalization
LayerNorm(x + SubLayer(x))
(Ba et al, 2016): Layer Normalization
• Input layer has a positional encoding
• BERT_base: 12 layers, 12 heads, hidden size = 768, 110M parameters
• BERT_large: 24 layers, 16 heads, hidden size = 1024, 340M parameters
![Page 33: COS 484: Natural Language Processing - Princeton NLP · COS 484: Natural Language Processing. Overview • ELMo Contextualized Word Representations = Bidirectional Encoder Representations](https://reader030.vdocuments.net/reader030/viewer/2022040221/5f09e8947e708231d42914cf/html5/thumbnails/33.jpg)
Have fun with using ELMo or BERT in your final project :)