lecture 2: n-gram - university of virginia school of ...kc2wc/teaching/nlp16/slides/02-ngram.pdf ·...
TRANSCRIPT
Lecture 2: N-gram
Kai-Wei Chang
CS @ University of Virginia
Couse webpage: http://kwchang.net/teaching/NLP16
1CS 6501: Natural Language Processing
This lecture
Language Models
What are N-gram models?
How to use probabilities
What does P(Y|X) mean?
How can I manipulate it?
How can I estimate its value in practice?
CS 6501: Natural Language Processing 2
What is a language model?
Probability distributions over sentences
(i.e., word sequences )
P(W) = P(𝑤1𝑤2𝑤3𝑤4…𝑤𝑘)
Can use them to generate strings
P(𝑤𝑘 ∣ 𝑤2𝑤3𝑤4…𝑤𝑘−1)
Rank possible sentences
P(“Today is Tuesday”) > P(“Tuesday Today is”)
P(“Today is Tuesday”) > P(“Today is Virginia”)
CS 6501: Natural Language Processing 3
Language model applications
Context-sensitive spelling correction
CS 6501: Natural Language Processing 4
Language model applications
Autocomplete
CS 6501: Natural Language Processing 5
Language model applications
Smart Reply
CS 6501: Natural Language Processing 6
Language model applications
Language generation
https://pdos.csail.mit.edu/archive/scigen/
CS 6501: Natural Language Processing 7
Bag-of-Words with N-grams
N-grams: a contiguous sequence of n
tokens from a given piece of text
CS 6501: Natural Language Processing 8
http://recognize-speech.com/language-model/n-gram-model/comparison
N-Gram Models
Unigram model: 𝑃 𝑤1 𝑃 𝑤2 𝑃 𝑤3 …𝑃(𝑤𝑛)
Bigram model:
𝑃 𝑤1 𝑃 𝑤2|𝑤1 𝑃 𝑤3|𝑤2 …𝑃(𝑤𝑛|𝑤𝑛−1)
Trigram model:
𝑃 𝑤1 𝑃 𝑤2|𝑤1 𝑃 𝑤3|𝑤2, 𝑤1 …𝑃(𝑤𝑛|𝑤𝑛−1𝑤𝑛−2)
N-gram model:
𝑃 𝑤1 𝑃 𝑤2|𝑤1 … 𝑃(𝑤𝑛|𝑤𝑛−1𝑤𝑛−2…𝑤𝑛−𝑁)
CS 6501: Natural Language Processing 9
Random language via n-gram
http://www.cs.jhu.edu/~jason/465/PowerPo
int/lect01,3tr-ngram-gen.pdf
Behind the scenes – probability theory
CS 6501: Natural Language Processing 10
Sampling with replacement
1. P( ) = ? 2. P() = ? 3. P(red, ) = ?
4. P(blue) = ? 5. P(red | ) = ?
6. P( | red) = ? 7. P( ) = ?
8. P() = ? 9. P(2 x , 3 x , 4 x ) = ?
CS 6501: Natural Language Processing 11
Sampling words with replacement
CS 6501: Natural Language Processing 12
Example from Julia hockenmaier, Intro to NLP
Implementation: how to sample?
Sample from a discrete distribution 𝑝(𝑋)
Assume 𝑛 outcomes in the event space 𝑋
1. Divide the interval [0,1] into 𝑛 intervals
according to the probabilities of the outcomes
2. Generate a random number 𝑟 between 0 and 1
3. Return 𝑥𝑖 where 𝑟 falls into
CS 6501: Natural Language Processing 13
Conditional on the previous word
CS 6501: Natural Language Processing 14
Example from Julia hockenmaier, Intro to NLP
Conditional on the previous word
CS 6501: Natural Language Processing 15
Example from Julia hockenmaier, Intro to NLP
Recap: probability Theory
Conditional probability
P(blue | ) = ?
𝑃 𝐵 𝐴 = 𝑃(𝐵, 𝐴)/𝑃(𝐴)
Bayes’ rule: 𝑃 𝐵 𝐴 =𝑃(A|B)𝑃 𝐵
𝑃(𝐴)
Verify: P(red | ) , P( | red ), P(), P(red)
Independent 𝑃 𝐵 𝐴 = P(B)
Prove: 𝑃 A, B = P A P(B)
CS 6501: Natural Language Processing 16
The Chain Rule
The joint probability can be expressed in terms of the conditional probability:
P(X,Y) = P(X | Y) P(Y)
More variables:
P(X, Y, Z) = P(X | Y, Z) P(Y, Z)= P(X | Y, Z) P(Y | Z) P(Z)
𝑃 X1, 𝑋2, …𝑋𝑛= 𝑃 𝑋1 𝑃 𝑋2 𝑋1 𝑃 𝑋3 𝑋2, 𝑋1 …𝑃 𝑋𝑛 𝑋1, …𝑋𝑛−1
= 𝑃 𝑋1 Π𝑖=2𝑛 𝑋𝑖 𝑋1, … 𝑋𝑖−1
CS 6501: Natural Language Processing 17
Language model for text
Probability distribution over sentences
𝑝 𝑤1 𝑤2 … 𝑤𝑛 =
𝑝 𝑤1 𝑝 𝑤2 𝑤1 𝑝 𝑤3 𝑤1, 𝑤2 …𝑝 𝑤𝑛 𝑤1, 𝑤2, … , 𝑤𝑛−1
Complexity - 𝑂(𝑉𝑛∗)
𝑛∗ - maximum sentence length
CS 6501: Natural Language Processing 18
Chain rule: from conditional probability to joint probability
A rough estimate: 𝑂(47500014)
Average English sentence length is 14.3 words
475,000 main headwords in Webster's Third New International Dictionary
47500014
8𝑏𝑦𝑡𝑒𝑠 × 1024 4≈ 3.38𝑒66𝑇𝐵How large is this?
We need independence assumptions!
Probability models
Building a probability model:
defining the model
(making independent assumption)
estimating the model’s parameters
use the model (making inference)
CS 6501: Natural Language Processing 19
Trigram Model(defined in terms of
parameters like P(“is”|”today”) )
param
Values Θdefinition
of P
Independent assumption
Independent assumption
even though X and Y are not actually
independent, we treat them as independent
Make the model compact
(e.g., from 100𝑘14 to 100𝑘2)
CS 6501: Natural Language Processing 20
Language model with N-gram
The chain rule:𝑃 X1, 𝑋2, …𝑋𝑛= 𝑃 𝑋1 𝑃 𝑋2 𝑋1 𝑃 𝑋3 𝑋2, 𝑋1 …𝑃 𝑋𝑛 𝑋1, …𝑋𝑛−1
N-gram language model assumes each word
depends only on the last n-1 words (Markov
assumption)
CS 6501: Natural Language Processing 21
Language model with N-gram
Example: trigram (3-gram)
𝑃 𝑤𝑛 𝑤1, …𝑤𝑛−1 = 𝑃 𝑤𝑛 𝑤𝑛−2, 𝑤𝑛−1
𝑃(𝑤1, …𝑤𝑛)=
P 𝑤1 𝑃 𝑤2 𝑤1 …𝑃 𝑤𝑛 𝑤𝑛−2, 𝑤𝑛−1
𝑃 "𝑇𝑜𝑑𝑎𝑦 𝑖𝑠 𝑎 𝑠𝑢𝑛𝑛𝑦 𝑑𝑎𝑦"
=P(“Today”)P(“is”|”Today”)P(“a”|”is”, “Today”)…
P(“day”|”sunny”, “a”)
CS 6501: Natural Language Processing 22
Unigram model
CS 6501: Natural Language Processing 23
Bigram model
Condition on the previous word
CS 6501: Natural Language Processing 24
Ngram model
CS 6501: Natural Language Processing 25
More examples
Yoav’s blog post:
http://nbviewer.jupyter.org/gist/yoavg/d761
21dfde2618422139
10-gram character-level LM:
CS 6501: Natural Language Processing 26
First Citizen: Nay, then, that was hers, It
speaks against your other service: But since the
youth of the circumstance be spoken: Your uncle
and one Baptista's daughter.
SEBASTIAN: Do I stand till the break off.
BIRON:
Hide thy head.
More examples
Yoav’s blog post:
http://nbviewer.jupyter.org/gist/yoavg/d761
21dfde2618422139
10-gram character-level LM:
CS 6501: Natural Language Processing 27
~~/*
* linux/kernel/time.c
* Please report this on hardware.
*/
void irq_mark_irq(unsigned long old_entries, eval);
/*
* Divide only 1000 for ns^2 -> us^2 conversion values don't
overflow:
seq_puts(m, "\ttramp: %pS",
(void *)class->contending_point]++;
if (likely(t->flags & WQ_UNBOUND)) {
/*
* Update inode information. If the
* slowpath and sleep time (abs or rel)
* @rmtp: remaining (either due
* to consume the state of ring buffer size. */
header_size - size, in bytes, of the chain.
*/
BUG_ON(!error);
} while (cgrp) {
if (old) {
if (kdb_continue_catastrophic;
#endif
Questions?
CS 6501: Natural Language Processing 28
Maximum likelihood Estimation
“Best” means “data likelihood reaches maximum”
Unigram Language Model p(w| )=?
Document
text 10mining 5association 3database 3algorithm 2…query 1efficient 1
…text ?mining ?assocation ?database ?…query ?…
Estimation
A paper (total #words=100)
10/1005/1003/1003/100
1/100
CS 6501: Natural Language Processing 29
𝜽 = 𝐚𝐫𝐠𝐦𝐚𝐱𝜽𝐏(𝐗|𝜽)
Which bag of words more likely generate:
aaaDaaaKoaaaa
CS 6501: Natural Language Processing 30
aE
Ka
aa
D aa
a oa
b
Ka
DE
PF
n
o
Parameter estimation
General setting:
Given a (hypothesized & probabilistic) model
that governs the random experiment
The model gives a probability of any data
𝑝(𝑋|𝜃) that depends on the parameter 𝜃
Now, given actual sample data X={x1,…,xn},
what can we say about the value of 𝜃?
Intuitively, take our best guess of 𝜃 -- “best”
means “best explaining/fitting the data”
Generally an optimization problem
CS 6501: Natural Language Processing 31
Maximum likelihood estimation
Data: a collection of words, 𝑤1, 𝑤2, … , 𝑤𝑛
Model: multinomial distribution p(𝑊) with parameters
𝜃𝑖 = 𝑝(𝑤𝑖)
Maximum likelihood estimator: 𝜃 = 𝑎𝑟𝑔𝑚𝑎𝑥𝜃∈Θ𝑝(𝑊|𝜃)
32
𝑝 𝑊 𝜃 =𝑁
𝑐 𝑤1 , … , 𝑐(𝑤𝑁)ෑ
𝑖=1
𝑁
𝜃𝑖𝑐(𝑤𝑖) ∝ෑ
𝑖=1
𝑁
𝜃𝑖𝑐(𝑤𝑖)
⇒ log𝑝 𝑊 𝜃 =
𝑖=1
𝑁
𝑐 𝑤𝑖 log 𝜃𝑖 + 𝑐𝑜𝑛𝑠𝑡
CS 6501: Natural Language Processing
𝜃 = 𝑎𝑟𝑔𝑚𝑎𝑥𝜃∈Θ
𝑖=1
𝑁
𝑐 𝑤𝑖 log 𝜃𝑖
Maximum likelihood estimation
33
Lagrange multiplier
Set partial derivatives to zero
ML estimate
𝐿 𝑊, 𝜃 =
𝑖=1
𝑁
𝑐 𝑤𝑖 log 𝜃𝑖 + 𝜆
𝑖=1
𝑁
𝜃𝑖 − 1
𝜕𝐿
𝜕𝜃𝑖=𝑐 𝑤𝑖
𝜃𝑖+ 𝜆 → 𝜃𝑖 = −
𝑐 𝑤𝑖
𝜆
σ𝑖=1𝑁 𝜃𝑖=1 𝜆 = −
𝑖=1
𝑁
𝑐 𝑤𝑖Since we have
𝜃𝑖 =𝑐 𝑤𝑖
σ𝑖=1𝑁 𝑐 𝑤𝑖
Requirement from probability
CS 6501: Natural Language Processing
𝜃 = 𝑎𝑟𝑔𝑚𝑎𝑥𝜃∈Θσ𝑖=1𝑁 𝑐 𝑤𝑖 log 𝜃𝑖
Maximum likelihood estimation
For N-gram language models
𝑝 𝑤𝑖 𝑤𝑖−1, … , 𝑤𝑖−𝑛+1 =𝑐(𝑤𝑖,𝑤𝑖−1,…,𝑤𝑖−𝑛+1)
𝑐(𝑤𝑖−1,…,𝑤𝑖−𝑛+1)
𝑐 ∅ = 𝑁
CS 6501: Natural Language Processing 34
Length of document or total number of words in a corpus
A bi-gram example
<S> I am Sam </S>
<S> I am legend </S>
<S> Sam I am </S>
P( I | <S>) = ? P(am | I) = ?
P( Sam | am) = ? P( </S> | Sam) = ?
P( <S>I am Sam</S> | bigram model) = ?
CS 6501: Natural Language Processing 35
Practical Issues
We do everything in the log space
Avoid underflow
Adding is faster than multiplying
log 𝑝1 × 𝑝2 = log 𝑝1 + log 𝑝2
Toolkits
KenLM: https://kheafield.com/code/kenlm/
SRILM:
http://www.speech.sri.com/projects/srilm
CS 6501: Natural Language Processing 36
More resources
Google n-gram:
https://research.googleblog.com/2006/08/all-
our-n-gram-are-belong-to-you.html
CS 6501: Natural Language Processing 37
File sizes: approx. 24 GB compressed (gzip'ed) text files
Number of tokens: 1,024,908,267,229
Number of sentences: 95,119,665,584
Number of unigrams: 13,588,391
Number of bigrams: 314,843,401
Number of trigrams: 977,069,902
Number of fourgrams: 1,313,818,354
Number of fivegrams: 1,176,470,663
More resources
Google n-gram viewer
https://books.google.com/ngrams/
Data:
http://storage.googleapis.com/books/ngrams/
books/datasetsv2.html
CS 6501: Natural Language Processing 38
circumvallate 1978 335 91
circumvallate 1979 261 91
CS 6501: Natural Language Processing 39
CS 6501: Natural Language Processing 40
CS 6501: Natural Language Processing 41
CS 6501: Natural Language Processing 42
How about unseen words/phrases
Example: Shakespeare corpus consists of
N=884,647 word tokens and a vocabulary
of V=29,066 word types
Only 30,000 word types occurred
Words not in the training data ⇒ 0 probability
Only 0.04% of all possible bigrams
occurred
CS 6501: Natural Language Processing 43
Next Lecture
Dealing with unseen n-grams
Key idea: reserve some probability mass to
events that don’t occur in the training data
How much probability mass should we
reserve?
CS 6501: Natural Language Processing 44
Recap
N-gram language models
How to generate text from a language model
How to estimate a language model
Reading: Speech and Language
Processing Chapter 4: N-Grams
CS 6501: Natural Language Processing 45