1 language models

8/10/2019 1 Language Models

1/74

Language Models

Instructor: Rada Mihalcea

Note: some of the material in this slide set was adapted from an NLP coursetaught by Bonnie Dorr at Univ. of Maryland


2/74

Slide 1

Language Models

A language modelan abstract representation of a (natural) language phenomenon.an approximation to real language

Statistical models

predictiveexplicative


3/74

Slide 2

Claim

A useful part of the knowledge needed to allow letter/wordpredictions can be captured using simple statisticaltechniques.

Compute:probability of a sequencelikelihood of letters/words co-occurring

Why would we want to do this?Rank the likelihood of sequences containing various alternative

hypotheses Assess the likelihood of a hypothesis


4/74

Slide 3

Outline

Applications of language models Approximating natural language The chain rule Learning N-gram models Smoothing for language models Distribution of words in language: Zipfs law and Heaps law


5/74

Slide 4

Why is This Useful?

Speech recognitionHandwriting recognitionSpelling correctionMachine translation systems

Optical character recognizers


6/74

Slide 5

Handwriting Recognition

Assume a note is given to a bank teller, which the teller reads as I havea gub . (cf. Woody Allen)

NLP to the rescue . gub is not a wordgun, gum, Gus, and gull are words, but gun has a higher probability in the

context of a bank


7/74

Slide 6

Real Word Spelling Errors

They are leaving in about fifteen minuets to go to herhouse.

The study was conducted mainly be John Black.Hopefully, all with continue smoothly in my absence.

Can they lave him my messages?I need to notified the bank of. He is trying to fine out.


8/74

Slide 7

For Spell Checkers

Collect list of commonly substituted wordspiece/peace, whether/weather, their/there ...

Example:On Tuesday, the whether On Tuesday, the weather


9/74

Slide 8

Other Applications

Machine translation Text summarization Optical character recognition


10/74

Slide 9

Outline

Applications of language models Approximating natural language The chain rule Learning N-gram models Smoothing for language models

Distribution of words in language: Zipfs law and Heaps law


11/74


12/74

Slide 11

Letter-based Language Models

Shannons GameGuess the next letter:

W


13/74

Slide 12



Wh


14/74

Slide 13


Wha



15/74

Slide 14


What



16/74

Slide 15


What d



17/74

Slide 16


What do



18/74

Slide 17


What do you think the next letter is?



19/74

Slide 18


What do you think the next letter is?Guess the next word:



20/74

Slide 19



What



21/74

Slide 20



What do



22/74

Slide 21



What do you



23/74

Slide 22



What do you think



24/74

Slide 23



What do you think the



25/74

Slide 24



What do you think the next



26/74

Slide 25



What do you think the next word is?



27/74

Slide 26

Approximating Natural Language Words

zero-order approximation: letter sequences areindependent of each other and all equallyprobable:

xfoml rxkhrjffjuj zlpwcwkcy ffjeyvkcqsghyd


28/74

Slide 27


first-order approximation: letters are independent, but occur with the frequencies of English text:

ocro hli rgwr nmielwis eu ll nbnesebya th eei alhenhtppaoobttva nah


29/74

Slide 28

second-order approximation: the probability that aletter appears depends on the previous letter

on ie antsoutinys are t inctore st bes deamy achin dilonasive tucoowe at teasonare fuzo tizin andy tobeseace ctisbe



30/74

Slide 29

third-order approximation: the probability that acertain letter appears depends on the twoprevious letters

in no ist lat whey cratict froure birs grocid pondenome ofdemonstures of the reptagin is regoactiona of cre



31/74

Slide 30

Higher frequency trigrams for different languages:English: THE, ING, ENT, IONGerman: EIN, ICH, DEN, DERFrench: ENT, QUE, LES, ION

Italian: CHE, ERE, ZIO, DELSpanish: QUE, EST, ARA, ADO



32/74

Slide 31

Language Syllabic Similarity Anca Dinu, Liviu Dinu

Languages within the same family are more similaramong them than with other languages

How similar (sounding) are languages within thesame family?

Syllabic based similarity


33/74

Slide 32

Syllable Ranks

Gather the most frequent words in each languagein the family;

Syllabify words;Rank syllables;Compute language similarity based on syllable

rankings;


34/74

Slide 33

Example Analysis: the Romance Family

Syllables in Romance languages


35/74

Slide 34

Latin-Romance Languages Similarity

servusservusciao


36/74

Slide 35

Outline




37/74

Slide 36

Terminology

Sentence : unit of written languageUtterance : unit of spoken language Word Form : the inflected form that appears in the

corpusLemma : lexical forms having the same stem, part of

speech, and word senseTypes (V) : number of distinct words that mightappear in a corpus (vocabulary size)

Tokens (N T) : total number of words in a corpusTypes seen so far (T) : number of distinct words seen

so far in corpus (smaller than V and N T)


38/74

Slide 37

Word-based Language Models

A model that enables one to compute the probability, orlikelihood, of a sentence S, P(S).

Simple: Every word follows every other word w/ equalprobability (0-gram)

Assume |V| is the size of the vocabulary VLikelihood of sentence S of length n is = 1/|V| 1/|V| 1/|V|If English has 100,000 words, probability of each next word is 1/100000 =

.00001


39/74

Slide 38

Word Prediction: Simple vs. Smart

Smarter: probability of each next word is related to word frequency (unigram)

Likelihood of sentence S = P(w 1) P(w 2) P(w n) Assumes probability of each word is independent of probabilities ofother words.

Even smarter: Look at probability given previous words (N-gram)

Likelihood of sentence S = P(w 1) P(w 2|w 1) P(w n |w n-1) Assumes probability of each word is dependent on probabilities ofother words.


41/74

Slide 40

Relative Frequencies and ConditionalProbabilities

Relative word frequencies are better than equalprobabilities for all wordsIn a corpus with 10K word types, each word would have

P(w) = 1/10KDoes not match our intuitions that different words are

more likely to occur (e.g. the)Conditional probability more useful than

individual relative word frequenciesdog may be relatively rare in a corpusBut if we see barking , P( dog | barking ) may be very large


42/74


43/74

Slide 42

Markov Assumption

How do we compute P(w n |w 1n-1

)?Trick: Instead of P( rabbit | I saw a ), we use P( rabbit | a).This lets us collect statistics in practice A bigram model: P( the barking dog ) =

P( the |)P( barking | the )P( dog | barking )

Markov models are the class of probabilistic modelsthat assume that we can predict the probability ofsome future unit without looking too far into thepastSpecifically, for N=2 (bigram):

P(w 1n) k=1 n P(w k |w k-1); w 0 = Order of a Markov model: length of prior context

bigram is first order, trigram is second order,


44/74

Slide 43

Counting Words in Corpora

What is a word?e.g., are cat and cats the same word?September and Sept ?zero and oh ?

Is seventy-two one word or two? AT&T?Punctuation?

How many words are there in English? Where do we find the things to count?


45/74

Slide 44

Outline




46/74

Slide 45

Simple N-Grams

An N-gram model uses the previous N-1 words to predictthe next one:P(w n | w n-N+1 w n- N+2 w n-1 )

unigrams: P(dog)

bigrams: P(dog | big)trigrams: P(dog | the big)quadrigrams: P(dog | chasing the big)


48/74

Slide 47

A Bigram Grammar Fragment

Eat on .16 Eat Thai .03

Eat some .06 Eat breakfast .03

Eat lunch .06 Eat in .02

Eat dinner .05 Eat Chinese .02

Eat at .04 Eat Mexican .02

Eat a .04 Eat tomorrow .01

Eat Indian .04 Eat dessert .007

Eat today .03 Eat British .001


49/74

Slide 48

Additional Grammar

I .25 Want some .04

Id .06 Want Thai .01

Tell .04 To eat .26

Im .02 To have .14

I want .32 To spend .09

I would .29 To be .02

I dont .08 British food .60

I have .04 British restaurant .15

Want to .65 British cuisine .01

Want a .05 British lunch .01


50/74

Slide 49

Computing Sentence Probability

P( I want to eat British food ) = P( I|) P( want | I)P( to | want ) P( eat | to ) P( British | eat ) P( food | British ) =.25.32.65.26.001.60 = .000080

vs.P( I want to eat Chinese food ) = .00015

Probabilities seem to capture syntactic'' facts, worldknowledge''eat is often followed by a NPBritish food is not too popular

N-gram models can be trained by counting and

normalization


51/74

Slide 50

N-grams Issues

Sparse dataNot all N-grams found in training data, need smoothingChange of domain

Train on WSJ, attempt to identify Shakespeare wont work N-grams more reliable than (N-1)-grams

But even more sparseGenerating Shakespeare sentences with random unigrams...

Every enter now severally so, let With bigrams...

What means, sir. I confess she? then all sorts, he is trim, captain.Trigrams

Sweet prince, Falstaff shall die.


52/74

Slide 51

N-grams Issues

Determine reliable sentence probability estimatesshould have smoothing capabilities (avoid the zero-counts)apply back-off strategies: if N-grams are not possible, back-off to (N-

1) grams

P( And nothing but the truth ) 0.001

P( And nuts sing on the roof ) 0


53/74

Slide 52

Bigram Counts

I Want To Eat Chinese Food lunch

I 8 1087 0 13 0 0 0

Want 3 0 786 0 6 8 6

To 3 0 10 860 3 0 12

Eat 0 0 2 0 19 2 52

Chinese 2 0 0 0 0 120 1

Food 19 0 17 0 0 0 0

Lunch 4 0 0 0 0 1 0

Bi P b bili i


54/74

Slide 53

Bigram Probabilities:Use Unigram Count

Normalization: divide bigram count by unigram countof first word.

Computing the probability of I IP( I| I) = C( I I )/C( I) = 8 / 3437 = .0023

A bigram grammar is an VxV matrix of probabilities, where V is the vocabulary size

I Want To Eat Chinese Food Lunch

3437 1215 3256 938 213 1506 459


55/74

Slide 54

Learning a Bigram Grammar

The formulaP(w n |w n-1) = C(w n-1 w n)/C(w n-1)is used for bigram parameter estimation


56/74

Slide 55

Training and Testing

Probabilities come from a training corpus , which isused to design the model.overly narrow corpus: probabilities don't generalizeoverly general corpus: probabilities don't reflect task or

domain

A separate test corpus is used to evaluate themodel, typically using standard metricsheld out test setcross validation

evaluation differences should be statistically significant


57/74

Slide 56

Outline




58/74

Slide 57

Smoothing Techniques

Every N-gram training matrix is sparse, even for very large corpora (Zipfs law )

Solution: estimate the likelihood of unseen N-grams


59/74


60/74

Slide 59

Add-one Smoothed Bigrams

P(wn|wn-1) = C(wn-1wn)/C(wn-1)

P

(wn|wn-1) = [C(wn-1wn)+1]/[C(wn-1)+V]

Assume a vocabulary V=1500


61/74


62/74

Slide 61

Smoothing: Good Turing

How many species (words) wereseen once? Estimate for howmany are unseen.

All other estimates are adjusted(down) to give probabilities forunseen

Smoothing:


63/74

Slide 62

gGood Turing Example

10 Carp, 3 Cod, 2 tuna, 1 trout, 1 salmon, 1 eel.How likely is new data (p 0 ).

Let n 1 be number occurringonce (3), N be total (18). p 0=3/18

How likely is eel? 1 * n 1 =3, n 2 =11* =2 1/3 = 2/3P(eel) = 1 * /N = (2/3)/18 = 1/27

Notes:

p0 refers to the probability of seeing any new data. Probability to see a specificunknown item is much smaller, p0 /all_unknown_items and use the assumptionthat all unknown events occur with equal probability

for the words with the highest number of occurrences, use the actual probability(no smoothing)

for the words for which n r+1 is 0, go to the next rank n r+2


64/74

Slide 63

Back-off Methods

Notice that:N-grams are more precise than (N-1)grams (rememberthe Shakespeare example)

But also, N-grams are more sparse than (N-1) grams

How to combine things? Attempt N-grams and back-off to (N-1) if counts are not

availableE.g. attempt prediction using 4-grams, and back-off to

trigrams (or bigrams, or unigrams) if counts are notavailable


65/74

Slide 64

Outline




66/74

Slide 65

Text properties (formalized)

Sample word frequency data


67/74

Slide 66

Zipfs Law

Rank (r ): The numerical position of a word in a list sorted by decreasing frequency ( f ).

Zipf (1949) discovered that:

If probability of word of rank r is p r and N is the totalnumber of word occurrences:

)constant(for k k r f

1.0const.indp.corpusfor Ar

A

N

f pr


68/74

Slide 67

Zipf curve


69/74

Slide 68

Predicting Occurrence Frequencies

By Zipf, a word appearing n times has rank rn= AN/n

If several words may occur n times, assume rank r n applies to the last ofthese.

Therefore, r n words occur n or more times and r n+ 1 words occur n+ 1 ormore times.

So, the number of words appearing exactly n times is:

)1(11 nn AN

n AN

n AN

r r I nnn

Fraction of words with frequency n is:

Fraction of words appearing only once is therefore .)1(

1nn D

I n


70/74


71/74

Slide 70

Vocabulary Growth

How does the size of the overall vocabulary(number of unique words) grow with the size ofthe corpus?

This determines how the size of the inverted index will scale with the size of the corpus.

Vocabulary not really upper-bounded due toproper names, typos, etc.


72/74

H L D


73/74

Slide 72

Heaps Law Data

Letter-based models do WE need


74/74

them? (a discovery)

Aoccdrnig to rscheearch at an Elingsh uinervtisy, it deosn't mttaerin waht oredr the ltteers in a wrod are, olny taht the frist andlsat ltteres are at the rghit pcleas. The rset can be a toatl msesand you can sitll raed it wouthit a porbelm. Tihs is bcuseae we donot raed ervey lteter by ilstef, but the wrod as a wlohe.

1 language models

Documents