lecture 2: n-gram - university of virginia school of ...kc2wc/teaching/nlp16/slides/02-ngram.pdf ·...

Lecture 2: N-gram

Kai-Wei Chang

CS @ University of Virginia

[email protected]

Couse webpage: http://kwchang.net/teaching/NLP16

1CS 6501: Natural Language Processing

mailto:[email protected]

http://kwchang.net/teaching/NLP16

This lecture

Language Models

What are N-gram models?

How to use probabilities

What does P(Y|X) mean?

How can I manipulate it?

How can I estimate its value in practice?

CS 6501: Natural Language Processing 2

What is a language model?

Probability distributions over sentences

(i.e., word sequences )

P(W) = P(𝑤1𝑤2𝑤3𝑤4…𝑤𝑘)

Can use them to generate strings

P(𝑤𝑘 ∣ 𝑤2𝑤3𝑤4…𝑤𝑘−1)

Rank possible sentences

P(“Today is Tuesday”) > P(“Tuesday Today is”)

P(“Today is Tuesday”) > P(“Today is Virginia”)


Language model applications

Context-sensitive spelling correction



Autocomplete



Smart Reply



Language generation

https://pdos.csail.mit.edu/archive/scigen/


https://pdos.csail.mit.edu/archive/scigen/

Bag-of-Words with N-grams

N-grams: a contiguous sequence of n

tokens from a given piece of text


http://recognize-speech.com/language-model/n-gram-model/comparison

Random language via n-gram

http://www.cs.jhu.edu/~jason/465/PowerPo

int/lect01,3tr-ngram-gen.pdf

Behind the scenes – probability theory


http://www.cs.jhu.edu/~jason/465/PowerPoint/lect01,3tr-ngram-gen.pdf

Sampling with replacement

1. P( ) = ? 2. P() = ? 3. P(red, ) = ?

4. P(blue) = ? 5. P(red | ) = ?

6. P( | red) = ? 7. P( ) = ?

8. P() = ? 9. P(2 x , 3 x , 4 x ) = ?


Sampling words with replacement


Example from Julia hockenmaier, Intro to NLP

Implementation: how to sample?

Sample from a discrete distribution 𝑝(𝑋)

Assume 𝑛 outcomes in the event space 𝑋

1. Divide the interval [0,1] into 𝑛 intervals

according to the probabilities of the outcomes

2. Generate a random number 𝑟 between 0 and 1

3. Return 𝑥𝑖 where 𝑟 falls into


Conditional on the previous word



Recap: probability Theory

Conditional probability

P(blue | ) = ?

𝑃 𝐵 𝐴 = 𝑃(𝐵, 𝐴)/𝑃(𝐴)

Bayes’ rule: 𝑃 𝐵 𝐴 =𝑃(A|B)𝑃 𝐵

𝑃(𝐴)

Verify: P(red | ) , P( | red ), P(), P(red)

Independent 𝑃 𝐵 𝐴 = P(B)

Prove: 𝑃 A, B = P A P(B)


The Chain Rule

The joint probability can be expressed in terms of the conditional probability:

P(X,Y) = P(X | Y) P(Y)

More variables:

P(X, Y, Z) = P(X | Y, Z) P(Y, Z)= P(X | Y, Z) P(Y | Z) P(Z)

𝑃 X1, 𝑋2, …𝑋𝑛= 𝑃 𝑋1 𝑃 𝑋2 𝑋1 𝑃 𝑋3 𝑋2, 𝑋1 …𝑃 𝑋𝑛 𝑋1, …𝑋𝑛−1

= 𝑃 𝑋1 Π𝑖=2𝑛 𝑋𝑖 𝑋1, … 𝑋𝑖−1


Language model for text

Probability distribution over sentences

𝑝 𝑤1 𝑤2 … 𝑤𝑛 =

𝑝 𝑤1 𝑝 𝑤2 𝑤1 𝑝 𝑤3 𝑤1, 𝑤2 …𝑝 𝑤𝑛 𝑤1, 𝑤2, … , 𝑤𝑛−1

Complexity - 𝑂(𝑉𝑛∗)

𝑛∗ - maximum sentence length


Chain rule: from conditional probability to joint probability

A rough estimate: 𝑂(47500014)

Average English sentence length is 14.3 words

475,000 main headwords in Webster's Third New International Dictionary

47500014

8𝑏𝑦𝑡𝑒𝑠 × 1024 4≈ 3.38𝑒66𝑇𝐵How large is this?

We need independence assumptions!

Probability models

Building a probability model:

defining the model

(making independent assumption)

estimating the model’s parameters

use the model (making inference)


Trigram Model(defined in terms of

parameters like P(“is”|”today”) )

param

Values Θdefinition

of P

Independent assumption

Independent assumption

even though X and Y are not actually

independent, we treat them as independent

Make the model compact

(e.g., from 100𝑘14 to 100𝑘2)


Language model with N-gram

The chain rule:𝑃 X1, 𝑋2, …𝑋𝑛= 𝑃 𝑋1 𝑃 𝑋2 𝑋1 𝑃 𝑋3 𝑋2, 𝑋1 …𝑃 𝑋𝑛 𝑋1, …𝑋𝑛−1

N-gram language model assumes each word

depends only on the last n-1 words (Markov

assumption)


Language model with N-gram

Example: trigram (3-gram)

𝑃 𝑤𝑛 𝑤1, …𝑤𝑛−1 = 𝑃 𝑤𝑛 𝑤𝑛−2, 𝑤𝑛−1

𝑃(𝑤1, …𝑤𝑛)=

P 𝑤1 𝑃 𝑤2 𝑤1 …𝑃 𝑤𝑛 𝑤𝑛−2, 𝑤𝑛−1

𝑃 "𝑇𝑜𝑑𝑎𝑦 𝑖𝑠 𝑎 𝑠𝑢𝑛𝑛𝑦 𝑑𝑎𝑦"

=P(“Today”)P(“is”|”Today”)P(“a”|”is”, “Today”)…

P(“day”|”sunny”, “a”)


Unigram model


Bigram model

Condition on the previous word


Ngram model


More examples

Yoav’s blog post:

http://nbviewer.jupyter.org/gist/yoavg/d761

21dfde2618422139

10-gram character-level LM:


First Citizen: Nay, then, that was hers, It

speaks against your other service: But since the

youth of the circumstance be spoken: Your uncle

and one Baptista's daughter.

SEBASTIAN: Do I stand till the break off.

BIRON:

Hide thy head.

http://nbviewer.jupyter.org/gist/yoavg/d76121dfde2618422139

More examples

Yoav’s blog post:

http://nbviewer.jupyter.org/gist/yoavg/d761

21dfde2618422139

10-gram character-level LM:


~~/*

* linux/kernel/time.c

* Please report this on hardware.

*/

void irq_mark_irq(unsigned long old_entries, eval);

/*

* Divide only 1000 for ns^2 -> us^2 conversion values don't

overflow:

seq_puts(m, "\ttramp: %pS",

(void *)class->contending_point]++;

if (likely(t->flags & WQ_UNBOUND)) {

/*

* Update inode information. If the

* slowpath and sleep time (abs or rel)

* @rmtp: remaining (either due

* to consume the state of ring buffer size. */

header_size - size, in bytes, of the chain.

*/

BUG_ON(!error);

} while (cgrp) {

if (old) {

if (kdb_continue_catastrophic;

#endif

http://nbviewer.jupyter.org/gist/yoavg/d76121dfde2618422139

Questions?


Maximum likelihood Estimation

“Best” means “data likelihood reaches maximum”

Unigram Language Model p(w| )=?

Document

text 10mining 5association 3database 3algorithm 2…query 1efficient 1

…text ?mining ?assocation ?database ?…query ?…

Estimation

A paper (total #words=100)

10/1005/1003/1003/100

1/100


𝜽 = 𝐚𝐫𝐠𝐦𝐚𝐱𝜽𝐏(𝐗|𝜽)

Which bag of words more likely generate:

aaaDaaaKoaaaa


aE

Ka

aa

D aa

a oa

b

Ka

DE

PF

n

o

Parameter estimation

General setting:

Given a (hypothesized & probabilistic) model

that governs the random experiment

The model gives a probability of any data

𝑝(𝑋|𝜃) that depends on the parameter 𝜃

Now, given actual sample data X={x1,…,xn},

what can we say about the value of 𝜃?

Intuitively, take our best guess of 𝜃 -- “best”

means “best explaining/fitting the data”

Generally an optimization problem


Maximum likelihood estimation

Data: a collection of words, 𝑤1, 𝑤2, … , 𝑤𝑛

Model: multinomial distribution p(𝑊) with parameters

𝜃𝑖 = 𝑝(𝑤𝑖)

Maximum likelihood estimator: 𝜃 = 𝑎𝑟𝑔𝑚𝑎𝑥𝜃∈Θ𝑝(𝑊|𝜃)

32

𝑝 𝑊 𝜃 =𝑁

𝑐 𝑤1 , … , 𝑐(𝑤𝑁)ෑ

𝑖=1

𝑁

𝜃𝑖𝑐(𝑤𝑖) ∝ෑ

𝑖=1

𝑁

𝜃𝑖𝑐(𝑤𝑖)

⇒ log𝑝 𝑊 𝜃 =

𝑖=1

𝑁

𝑐 𝑤𝑖 log 𝜃𝑖 + 𝑐𝑜𝑛𝑠𝑡

CS 6501: Natural Language Processing

𝜃 = 𝑎𝑟𝑔𝑚𝑎𝑥𝜃∈Θ

𝑖=1

𝑁

𝑐 𝑤𝑖 log 𝜃𝑖


33

Lagrange multiplier

Set partial derivatives to zero

ML estimate

𝐿 𝑊, 𝜃 =

𝑖=1

𝑁

𝑐 𝑤𝑖 log 𝜃𝑖 + 𝜆

𝑖=1

𝑁

𝜃𝑖 − 1

𝜕𝐿

𝜕𝜃𝑖=𝑐 𝑤𝑖

𝜃𝑖+ 𝜆 → 𝜃𝑖 = −

𝑐 𝑤𝑖

𝜆

σ𝑖=1𝑁 𝜃𝑖=1 𝜆 = −

𝑖=1

𝑁

𝑐 𝑤𝑖Since we have

𝜃𝑖 =𝑐 𝑤𝑖

σ𝑖=1𝑁 𝑐 𝑤𝑖

Requirement from probability

CS 6501: Natural Language Processing

𝜃 = 𝑎𝑟𝑔𝑚𝑎𝑥𝜃∈Θσ𝑖=1𝑁 𝑐 𝑤𝑖 log 𝜃𝑖


For N-gram language models

𝑝 𝑤𝑖 𝑤𝑖−1, … , 𝑤𝑖−𝑛+1 =𝑐(𝑤𝑖,𝑤𝑖−1,…,𝑤𝑖−𝑛+1)

𝑐(𝑤𝑖−1,…,𝑤𝑖−𝑛+1)

𝑐 ∅ = 𝑁


Length of document or total number of words in a corpus

Practical Issues

We do everything in the log space

Avoid underflow

Adding is faster than multiplying

log 𝑝1 × 𝑝2 = log 𝑝1 + log 𝑝2

Toolkits

KenLM: https://kheafield.com/code/kenlm/

SRILM:

http://www.speech.sri.com/projects/srilm


https://kheafield.com/code/kenlm/

http://www.speech.sri.com/projects/srilm

More resources

Google n-gram:

https://research.googleblog.com/2006/08/all-

our-n-gram-are-belong-to-you.html


File sizes: approx. 24 GB compressed (gzip'ed) text files

Number of tokens: 1,024,908,267,229

Number of sentences: 95,119,665,584

Number of unigrams: 13,588,391

Number of bigrams: 314,843,401

Number of trigrams: 977,069,902

Number of fourgrams: 1,313,818,354

Number of fivegrams: 1,176,470,663

https://research.googleblog.com/2006/08/all-our-n-gram-are-belong-to-you.html

More resources

Google n-gram viewer

https://books.google.com/ngrams/

Data:

http://storage.googleapis.com/books/ngrams/

books/datasetsv2.html


circumvallate 1978 335 91

circumvallate 1979 261 91

https://books.google.com/ngrams/

How about unseen words/phrases

Example: Shakespeare corpus consists of

N=884,647 word tokens and a vocabulary

of V=29,066 word types

Only 30,000 word types occurred

Words not in the training data ⇒ 0 probability

Only 0.04% of all possible bigrams

occurred


Next Lecture

Dealing with unseen n-grams

Key idea: reserve some probability mass to

events that don’t occur in the training data

How much probability mass should we

reserve?


Recap

N-gram language models

How to generate text from a language model

How to estimate a language model

Reading: Speech and Language

Processing Chapter 4: N-Grams


lecture 2: n-gram - university of virginia school of ...kc2wc/teaching/nlp16/slides/02-ngram.pdf ·...

Documents