lecture 2: n-gram - university of virginia school of ...kc2wc/teaching/nlp16/slides/02-ngram.pdf ·...

45
Lecture 2: N-gram Kai-Wei Chang CS @ University of Virginia [email protected] Couse webpage: http://kwchang.net/teaching/NLP16 1 CS 6501: Natural Language Processing

Upload: others

Post on 20-Mar-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 2: N-gram - University of Virginia School of ...kc2wc/teaching/NLP16/slides/02-ngram.pdf · Parameter estimation General setting: Given a (hypothesized & probabilistic) model

Lecture 2: N-gram

Kai-Wei Chang

CS @ University of Virginia

[email protected]

Couse webpage: http://kwchang.net/teaching/NLP16

1CS 6501: Natural Language Processing

Page 2: Lecture 2: N-gram - University of Virginia School of ...kc2wc/teaching/NLP16/slides/02-ngram.pdf · Parameter estimation General setting: Given a (hypothesized & probabilistic) model

This lecture

Language Models

What are N-gram models?

How to use probabilities

What does P(Y|X) mean?

How can I manipulate it?

How can I estimate its value in practice?

CS 6501: Natural Language Processing 2

Page 3: Lecture 2: N-gram - University of Virginia School of ...kc2wc/teaching/NLP16/slides/02-ngram.pdf · Parameter estimation General setting: Given a (hypothesized & probabilistic) model

What is a language model?

Probability distributions over sentences

(i.e., word sequences )

P(W) = P(𝑤1𝑤2𝑤3𝑤4…𝑤𝑘)

Can use them to generate strings

P(𝑤𝑘 ∣ 𝑤2𝑤3𝑤4…𝑤𝑘−1)

Rank possible sentences

P(“Today is Tuesday”) > P(“Tuesday Today is”)

P(“Today is Tuesday”) > P(“Today is Virginia”)

CS 6501: Natural Language Processing 3

Page 4: Lecture 2: N-gram - University of Virginia School of ...kc2wc/teaching/NLP16/slides/02-ngram.pdf · Parameter estimation General setting: Given a (hypothesized & probabilistic) model

Language model applications

Context-sensitive spelling correction

CS 6501: Natural Language Processing 4

Page 5: Lecture 2: N-gram - University of Virginia School of ...kc2wc/teaching/NLP16/slides/02-ngram.pdf · Parameter estimation General setting: Given a (hypothesized & probabilistic) model

Language model applications

Autocomplete

CS 6501: Natural Language Processing 5

Page 6: Lecture 2: N-gram - University of Virginia School of ...kc2wc/teaching/NLP16/slides/02-ngram.pdf · Parameter estimation General setting: Given a (hypothesized & probabilistic) model

Language model applications

Smart Reply

CS 6501: Natural Language Processing 6

Page 7: Lecture 2: N-gram - University of Virginia School of ...kc2wc/teaching/NLP16/slides/02-ngram.pdf · Parameter estimation General setting: Given a (hypothesized & probabilistic) model

Language model applications

Language generation

https://pdos.csail.mit.edu/archive/scigen/

CS 6501: Natural Language Processing 7

Page 8: Lecture 2: N-gram - University of Virginia School of ...kc2wc/teaching/NLP16/slides/02-ngram.pdf · Parameter estimation General setting: Given a (hypothesized & probabilistic) model

Bag-of-Words with N-grams

N-grams: a contiguous sequence of n

tokens from a given piece of text

CS 6501: Natural Language Processing 8

http://recognize-speech.com/language-model/n-gram-model/comparison

Page 9: Lecture 2: N-gram - University of Virginia School of ...kc2wc/teaching/NLP16/slides/02-ngram.pdf · Parameter estimation General setting: Given a (hypothesized & probabilistic) model

N-Gram Models

Unigram model: 𝑃 𝑤1 𝑃 𝑤2 𝑃 𝑤3 …𝑃(𝑤𝑛)

Bigram model:

𝑃 𝑤1 𝑃 𝑤2|𝑤1 𝑃 𝑤3|𝑤2 …𝑃(𝑤𝑛|𝑤𝑛−1)

Trigram model:

𝑃 𝑤1 𝑃 𝑤2|𝑤1 𝑃 𝑤3|𝑤2, 𝑤1 …𝑃(𝑤𝑛|𝑤𝑛−1𝑤𝑛−2)

N-gram model:

𝑃 𝑤1 𝑃 𝑤2|𝑤1 … 𝑃(𝑤𝑛|𝑤𝑛−1𝑤𝑛−2…𝑤𝑛−𝑁)

CS 6501: Natural Language Processing 9

Page 10: Lecture 2: N-gram - University of Virginia School of ...kc2wc/teaching/NLP16/slides/02-ngram.pdf · Parameter estimation General setting: Given a (hypothesized & probabilistic) model

Random language via n-gram

http://www.cs.jhu.edu/~jason/465/PowerPo

int/lect01,3tr-ngram-gen.pdf

Behind the scenes – probability theory

CS 6501: Natural Language Processing 10

Page 11: Lecture 2: N-gram - University of Virginia School of ...kc2wc/teaching/NLP16/slides/02-ngram.pdf · Parameter estimation General setting: Given a (hypothesized & probabilistic) model

Sampling with replacement

1. P( ) = ? 2. P() = ? 3. P(red, ) = ?

4. P(blue) = ? 5. P(red | ) = ?

6. P( | red) = ? 7. P( ) = ?

8. P() = ? 9. P(2 x , 3 x , 4 x ) = ?

CS 6501: Natural Language Processing 11

Page 12: Lecture 2: N-gram - University of Virginia School of ...kc2wc/teaching/NLP16/slides/02-ngram.pdf · Parameter estimation General setting: Given a (hypothesized & probabilistic) model

Sampling words with replacement

CS 6501: Natural Language Processing 12

Example from Julia hockenmaier, Intro to NLP

Page 13: Lecture 2: N-gram - University of Virginia School of ...kc2wc/teaching/NLP16/slides/02-ngram.pdf · Parameter estimation General setting: Given a (hypothesized & probabilistic) model

Implementation: how to sample?

Sample from a discrete distribution 𝑝(𝑋)

Assume 𝑛 outcomes in the event space 𝑋

1. Divide the interval [0,1] into 𝑛 intervals

according to the probabilities of the outcomes

2. Generate a random number 𝑟 between 0 and 1

3. Return 𝑥𝑖 where 𝑟 falls into

CS 6501: Natural Language Processing 13

Page 14: Lecture 2: N-gram - University of Virginia School of ...kc2wc/teaching/NLP16/slides/02-ngram.pdf · Parameter estimation General setting: Given a (hypothesized & probabilistic) model

Conditional on the previous word

CS 6501: Natural Language Processing 14

Example from Julia hockenmaier, Intro to NLP

Page 15: Lecture 2: N-gram - University of Virginia School of ...kc2wc/teaching/NLP16/slides/02-ngram.pdf · Parameter estimation General setting: Given a (hypothesized & probabilistic) model

Conditional on the previous word

CS 6501: Natural Language Processing 15

Example from Julia hockenmaier, Intro to NLP

Page 16: Lecture 2: N-gram - University of Virginia School of ...kc2wc/teaching/NLP16/slides/02-ngram.pdf · Parameter estimation General setting: Given a (hypothesized & probabilistic) model

Recap: probability Theory

Conditional probability

P(blue | ) = ?

𝑃 𝐵 𝐴 = 𝑃(𝐵, 𝐴)/𝑃(𝐴)

Bayes’ rule: 𝑃 𝐵 𝐴 =𝑃(A|B)𝑃 𝐵

𝑃(𝐴)

Verify: P(red | ) , P( | red ), P(), P(red)

Independent 𝑃 𝐵 𝐴 = P(B)

Prove: 𝑃 A, B = P A P(B)

CS 6501: Natural Language Processing 16

Page 17: Lecture 2: N-gram - University of Virginia School of ...kc2wc/teaching/NLP16/slides/02-ngram.pdf · Parameter estimation General setting: Given a (hypothesized & probabilistic) model

The Chain Rule

The joint probability can be expressed in terms of the conditional probability:

P(X,Y) = P(X | Y) P(Y)

More variables:

P(X, Y, Z) = P(X | Y, Z) P(Y, Z)= P(X | Y, Z) P(Y | Z) P(Z)

𝑃 X1, 𝑋2, …𝑋𝑛= 𝑃 𝑋1 𝑃 𝑋2 𝑋1 𝑃 𝑋3 𝑋2, 𝑋1 …𝑃 𝑋𝑛 𝑋1, …𝑋𝑛−1

= 𝑃 𝑋1 Π𝑖=2𝑛 𝑋𝑖 𝑋1, … 𝑋𝑖−1

CS 6501: Natural Language Processing 17

Page 18: Lecture 2: N-gram - University of Virginia School of ...kc2wc/teaching/NLP16/slides/02-ngram.pdf · Parameter estimation General setting: Given a (hypothesized & probabilistic) model

Language model for text

Probability distribution over sentences

𝑝 𝑤1 𝑤2 … 𝑤𝑛 =

𝑝 𝑤1 𝑝 𝑤2 𝑤1 𝑝 𝑤3 𝑤1, 𝑤2 …𝑝 𝑤𝑛 𝑤1, 𝑤2, … , 𝑤𝑛−1

Complexity - 𝑂(𝑉𝑛∗)

𝑛∗ - maximum sentence length

CS 6501: Natural Language Processing 18

Chain rule: from conditional probability to joint probability

A rough estimate: 𝑂(47500014)

Average English sentence length is 14.3 words

475,000 main headwords in Webster's Third New International Dictionary

47500014

8𝑏𝑦𝑡𝑒𝑠 × 1024 4≈ 3.38𝑒66𝑇𝐵How large is this?

We need independence assumptions!

Page 19: Lecture 2: N-gram - University of Virginia School of ...kc2wc/teaching/NLP16/slides/02-ngram.pdf · Parameter estimation General setting: Given a (hypothesized & probabilistic) model

Probability models

Building a probability model:

defining the model

(making independent assumption)

estimating the model’s parameters

use the model (making inference)

CS 6501: Natural Language Processing 19

Trigram Model(defined in terms of

parameters like P(“is”|”today”) )

param

Values Θdefinition

of P

Page 20: Lecture 2: N-gram - University of Virginia School of ...kc2wc/teaching/NLP16/slides/02-ngram.pdf · Parameter estimation General setting: Given a (hypothesized & probabilistic) model

Independent assumption

Independent assumption

even though X and Y are not actually

independent, we treat them as independent

Make the model compact

(e.g., from 100𝑘14 to 100𝑘2)

CS 6501: Natural Language Processing 20

Page 21: Lecture 2: N-gram - University of Virginia School of ...kc2wc/teaching/NLP16/slides/02-ngram.pdf · Parameter estimation General setting: Given a (hypothesized & probabilistic) model

Language model with N-gram

The chain rule:𝑃 X1, 𝑋2, …𝑋𝑛= 𝑃 𝑋1 𝑃 𝑋2 𝑋1 𝑃 𝑋3 𝑋2, 𝑋1 …𝑃 𝑋𝑛 𝑋1, …𝑋𝑛−1

N-gram language model assumes each word

depends only on the last n-1 words (Markov

assumption)

CS 6501: Natural Language Processing 21

Page 22: Lecture 2: N-gram - University of Virginia School of ...kc2wc/teaching/NLP16/slides/02-ngram.pdf · Parameter estimation General setting: Given a (hypothesized & probabilistic) model

Language model with N-gram

Example: trigram (3-gram)

𝑃 𝑤𝑛 𝑤1, …𝑤𝑛−1 = 𝑃 𝑤𝑛 𝑤𝑛−2, 𝑤𝑛−1

𝑃(𝑤1, …𝑤𝑛)=

P 𝑤1 𝑃 𝑤2 𝑤1 …𝑃 𝑤𝑛 𝑤𝑛−2, 𝑤𝑛−1

𝑃 "𝑇𝑜𝑑𝑎𝑦 𝑖𝑠 𝑎 𝑠𝑢𝑛𝑛𝑦 𝑑𝑎𝑦"

=P(“Today”)P(“is”|”Today”)P(“a”|”is”, “Today”)…

P(“day”|”sunny”, “a”)

CS 6501: Natural Language Processing 22

Page 23: Lecture 2: N-gram - University of Virginia School of ...kc2wc/teaching/NLP16/slides/02-ngram.pdf · Parameter estimation General setting: Given a (hypothesized & probabilistic) model

Unigram model

CS 6501: Natural Language Processing 23

Page 24: Lecture 2: N-gram - University of Virginia School of ...kc2wc/teaching/NLP16/slides/02-ngram.pdf · Parameter estimation General setting: Given a (hypothesized & probabilistic) model

Bigram model

Condition on the previous word

CS 6501: Natural Language Processing 24

Page 25: Lecture 2: N-gram - University of Virginia School of ...kc2wc/teaching/NLP16/slides/02-ngram.pdf · Parameter estimation General setting: Given a (hypothesized & probabilistic) model

Ngram model

CS 6501: Natural Language Processing 25

Page 26: Lecture 2: N-gram - University of Virginia School of ...kc2wc/teaching/NLP16/slides/02-ngram.pdf · Parameter estimation General setting: Given a (hypothesized & probabilistic) model

More examples

Yoav’s blog post:

http://nbviewer.jupyter.org/gist/yoavg/d761

21dfde2618422139

10-gram character-level LM:

CS 6501: Natural Language Processing 26

First Citizen: Nay, then, that was hers, It

speaks against your other service: But since the

youth of the circumstance be spoken: Your uncle

and one Baptista's daughter.

SEBASTIAN: Do I stand till the break off.

BIRON:

Hide thy head.

Page 27: Lecture 2: N-gram - University of Virginia School of ...kc2wc/teaching/NLP16/slides/02-ngram.pdf · Parameter estimation General setting: Given a (hypothesized & probabilistic) model

More examples

Yoav’s blog post:

http://nbviewer.jupyter.org/gist/yoavg/d761

21dfde2618422139

10-gram character-level LM:

CS 6501: Natural Language Processing 27

~~/*

* linux/kernel/time.c

* Please report this on hardware.

*/

void irq_mark_irq(unsigned long old_entries, eval);

/*

* Divide only 1000 for ns^2 -> us^2 conversion values don't

overflow:

seq_puts(m, "\ttramp: %pS",

(void *)class->contending_point]++;

if (likely(t->flags & WQ_UNBOUND)) {

/*

* Update inode information. If the

* slowpath and sleep time (abs or rel)

* @rmtp: remaining (either due

* to consume the state of ring buffer size. */

header_size - size, in bytes, of the chain.

*/

BUG_ON(!error);

} while (cgrp) {

if (old) {

if (kdb_continue_catastrophic;

#endif

Page 28: Lecture 2: N-gram - University of Virginia School of ...kc2wc/teaching/NLP16/slides/02-ngram.pdf · Parameter estimation General setting: Given a (hypothesized & probabilistic) model

Questions?

CS 6501: Natural Language Processing 28

Page 29: Lecture 2: N-gram - University of Virginia School of ...kc2wc/teaching/NLP16/slides/02-ngram.pdf · Parameter estimation General setting: Given a (hypothesized & probabilistic) model

Maximum likelihood Estimation

“Best” means “data likelihood reaches maximum”

Unigram Language Model p(w| )=?

Document

text 10mining 5association 3database 3algorithm 2…query 1efficient 1

…text ?mining ?assocation ?database ?…query ?…

Estimation

A paper (total #words=100)

10/1005/1003/1003/100

1/100

CS 6501: Natural Language Processing 29

𝜽 = 𝐚𝐫𝐠𝐦𝐚𝐱𝜽𝐏(𝐗|𝜽)

Page 30: Lecture 2: N-gram - University of Virginia School of ...kc2wc/teaching/NLP16/slides/02-ngram.pdf · Parameter estimation General setting: Given a (hypothesized & probabilistic) model

Which bag of words more likely generate:

aaaDaaaKoaaaa

CS 6501: Natural Language Processing 30

aE

Ka

aa

D aa

a oa

b

Ka

DE

PF

n

o

Page 31: Lecture 2: N-gram - University of Virginia School of ...kc2wc/teaching/NLP16/slides/02-ngram.pdf · Parameter estimation General setting: Given a (hypothesized & probabilistic) model

Parameter estimation

General setting:

Given a (hypothesized & probabilistic) model

that governs the random experiment

The model gives a probability of any data

𝑝(𝑋|𝜃) that depends on the parameter 𝜃

Now, given actual sample data X={x1,…,xn},

what can we say about the value of 𝜃?

Intuitively, take our best guess of 𝜃 -- “best”

means “best explaining/fitting the data”

Generally an optimization problem

CS 6501: Natural Language Processing 31

Page 32: Lecture 2: N-gram - University of Virginia School of ...kc2wc/teaching/NLP16/slides/02-ngram.pdf · Parameter estimation General setting: Given a (hypothesized & probabilistic) model

Maximum likelihood estimation

Data: a collection of words, 𝑤1, 𝑤2, … , 𝑤𝑛

Model: multinomial distribution p(𝑊) with parameters

𝜃𝑖 = 𝑝(𝑤𝑖)

Maximum likelihood estimator: 𝜃 = 𝑎𝑟𝑔𝑚𝑎𝑥𝜃∈Θ𝑝(𝑊|𝜃)

32

𝑝 𝑊 𝜃 =𝑁

𝑐 𝑤1 , … , 𝑐(𝑤𝑁)ෑ

𝑖=1

𝑁

𝜃𝑖𝑐(𝑤𝑖) ∝ෑ

𝑖=1

𝑁

𝜃𝑖𝑐(𝑤𝑖)

⇒ log𝑝 𝑊 𝜃 =

𝑖=1

𝑁

𝑐 𝑤𝑖 log 𝜃𝑖 + 𝑐𝑜𝑛𝑠𝑡

CS 6501: Natural Language Processing

𝜃 = 𝑎𝑟𝑔𝑚𝑎𝑥𝜃∈Θ

𝑖=1

𝑁

𝑐 𝑤𝑖 log 𝜃𝑖

Page 33: Lecture 2: N-gram - University of Virginia School of ...kc2wc/teaching/NLP16/slides/02-ngram.pdf · Parameter estimation General setting: Given a (hypothesized & probabilistic) model

Maximum likelihood estimation

33

Lagrange multiplier

Set partial derivatives to zero

ML estimate

𝐿 𝑊, 𝜃 =

𝑖=1

𝑁

𝑐 𝑤𝑖 log 𝜃𝑖 + 𝜆

𝑖=1

𝑁

𝜃𝑖 − 1

𝜕𝐿

𝜕𝜃𝑖=𝑐 𝑤𝑖

𝜃𝑖+ 𝜆 → 𝜃𝑖 = −

𝑐 𝑤𝑖

𝜆

σ𝑖=1𝑁 𝜃𝑖=1 𝜆 = −

𝑖=1

𝑁

𝑐 𝑤𝑖Since we have

𝜃𝑖 =𝑐 𝑤𝑖

σ𝑖=1𝑁 𝑐 𝑤𝑖

Requirement from probability

CS 6501: Natural Language Processing

𝜃 = 𝑎𝑟𝑔𝑚𝑎𝑥𝜃∈Θσ𝑖=1𝑁 𝑐 𝑤𝑖 log 𝜃𝑖

Page 34: Lecture 2: N-gram - University of Virginia School of ...kc2wc/teaching/NLP16/slides/02-ngram.pdf · Parameter estimation General setting: Given a (hypothesized & probabilistic) model

Maximum likelihood estimation

For N-gram language models

𝑝 𝑤𝑖 𝑤𝑖−1, … , 𝑤𝑖−𝑛+1 =𝑐(𝑤𝑖,𝑤𝑖−1,…,𝑤𝑖−𝑛+1)

𝑐(𝑤𝑖−1,…,𝑤𝑖−𝑛+1)

𝑐 ∅ = 𝑁

CS 6501: Natural Language Processing 34

Length of document or total number of words in a corpus

Page 35: Lecture 2: N-gram - University of Virginia School of ...kc2wc/teaching/NLP16/slides/02-ngram.pdf · Parameter estimation General setting: Given a (hypothesized & probabilistic) model

A bi-gram example

<S> I am Sam </S>

<S> I am legend </S>

<S> Sam I am </S>

P( I | <S>) = ? P(am | I) = ?

P( Sam | am) = ? P( </S> | Sam) = ?

P( <S>I am Sam</S> | bigram model) = ?

CS 6501: Natural Language Processing 35

Page 36: Lecture 2: N-gram - University of Virginia School of ...kc2wc/teaching/NLP16/slides/02-ngram.pdf · Parameter estimation General setting: Given a (hypothesized & probabilistic) model

Practical Issues

We do everything in the log space

Avoid underflow

Adding is faster than multiplying

log 𝑝1 × 𝑝2 = log 𝑝1 + log 𝑝2

Toolkits

KenLM: https://kheafield.com/code/kenlm/

SRILM:

http://www.speech.sri.com/projects/srilm

CS 6501: Natural Language Processing 36

Page 37: Lecture 2: N-gram - University of Virginia School of ...kc2wc/teaching/NLP16/slides/02-ngram.pdf · Parameter estimation General setting: Given a (hypothesized & probabilistic) model

More resources

Google n-gram:

https://research.googleblog.com/2006/08/all-

our-n-gram-are-belong-to-you.html

CS 6501: Natural Language Processing 37

File sizes: approx. 24 GB compressed (gzip'ed) text files

Number of tokens: 1,024,908,267,229

Number of sentences: 95,119,665,584

Number of unigrams: 13,588,391

Number of bigrams: 314,843,401

Number of trigrams: 977,069,902

Number of fourgrams: 1,313,818,354

Number of fivegrams: 1,176,470,663

Page 38: Lecture 2: N-gram - University of Virginia School of ...kc2wc/teaching/NLP16/slides/02-ngram.pdf · Parameter estimation General setting: Given a (hypothesized & probabilistic) model

More resources

Google n-gram viewer

https://books.google.com/ngrams/

Data:

http://storage.googleapis.com/books/ngrams/

books/datasetsv2.html

CS 6501: Natural Language Processing 38

circumvallate 1978 335 91

circumvallate 1979 261 91

Page 39: Lecture 2: N-gram - University of Virginia School of ...kc2wc/teaching/NLP16/slides/02-ngram.pdf · Parameter estimation General setting: Given a (hypothesized & probabilistic) model

CS 6501: Natural Language Processing 39

Page 40: Lecture 2: N-gram - University of Virginia School of ...kc2wc/teaching/NLP16/slides/02-ngram.pdf · Parameter estimation General setting: Given a (hypothesized & probabilistic) model

CS 6501: Natural Language Processing 40

Page 41: Lecture 2: N-gram - University of Virginia School of ...kc2wc/teaching/NLP16/slides/02-ngram.pdf · Parameter estimation General setting: Given a (hypothesized & probabilistic) model

CS 6501: Natural Language Processing 41

Page 42: Lecture 2: N-gram - University of Virginia School of ...kc2wc/teaching/NLP16/slides/02-ngram.pdf · Parameter estimation General setting: Given a (hypothesized & probabilistic) model

CS 6501: Natural Language Processing 42

Page 43: Lecture 2: N-gram - University of Virginia School of ...kc2wc/teaching/NLP16/slides/02-ngram.pdf · Parameter estimation General setting: Given a (hypothesized & probabilistic) model

How about unseen words/phrases

Example: Shakespeare corpus consists of

N=884,647 word tokens and a vocabulary

of V=29,066 word types

Only 30,000 word types occurred

Words not in the training data ⇒ 0 probability

Only 0.04% of all possible bigrams

occurred

CS 6501: Natural Language Processing 43

Page 44: Lecture 2: N-gram - University of Virginia School of ...kc2wc/teaching/NLP16/slides/02-ngram.pdf · Parameter estimation General setting: Given a (hypothesized & probabilistic) model

Next Lecture

Dealing with unseen n-grams

Key idea: reserve some probability mass to

events that don’t occur in the training data

How much probability mass should we

reserve?

CS 6501: Natural Language Processing 44

Page 45: Lecture 2: N-gram - University of Virginia School of ...kc2wc/teaching/NLP16/slides/02-ngram.pdf · Parameter estimation General setting: Given a (hypothesized & probabilistic) model

Recap

N-gram language models

How to generate text from a language model

How to estimate a language model

Reading: Speech and Language

Processing Chapter 4: N-Grams

CS 6501: Natural Language Processing 45