part ii. statistical nlp advanced artificial intelligence n-gramms wolfram burgard, luc de raedt,...
Post on 18-Dec-2015
229 views
TRANSCRIPT
![Page 1: Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d235503460f949f9595/html5/thumbnails/1.jpg)
Part II. Statistical NLP
Advanced Artificial Intelligence
N-Gramms
Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme
Most slides taken from Helmut Schmid, Rada Mihalcea, Bonnie Dorr, Leila Kosseim and others
![Page 2: Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d235503460f949f9595/html5/thumbnails/2.jpg)
Contents
Short recap motivation for SNLP Probabilistic language models N-gramms Predicting the next word in a sentence Language guessing
Largely chapter 6 of Statistical NLP, Manning and Schuetze.
And chapter 6 of Speech and Language Processing, Jurafsky and Martin
![Page 3: Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d235503460f949f9595/html5/thumbnails/3.jpg)
Human Language is highly ambiguous at all levels
• acoustic levelrecognize speech vs. wreck a nice beach
• morphological levelsaw: to see (past), saw (noun), to saw (present, inf)
• syntactic levelI saw the man on the hill with a telescope
• semantic levelOne book has to be read by every student
Motivation
![Page 4: Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d235503460f949f9595/html5/thumbnails/4.jpg)
Statistical Disambiguation
• Define a probability model for the data
• Compute the probability of each alternative
• Choose the most likely alternative
NLP and Statistics
![Page 5: Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d235503460f949f9595/html5/thumbnails/5.jpg)
Speech recognisers use a „noisy channel model“
The source generates a sentence s with probability P(s).The channel transforms the text into an acoustic signalwith probability P(a|s).
The task of the speech recogniser is to find for a givenspeech signal a the most likely sentence s:
s = argmaxs P(s|a) = argmaxs P(a|s) P(s) / P(a)
= argmaxs P(a|s) P(s)
Language Models
source s channel acoustic signal a
![Page 6: Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d235503460f949f9595/html5/thumbnails/6.jpg)
Speech recognisers employ two statistical models:
• a language model P(s)
• an acoustics model P(a|s)
Language Models
![Page 7: Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d235503460f949f9595/html5/thumbnails/7.jpg)
Language Model
Definition: • Language model is a model that enables
one to compute the probability, or likelihood, of a sentence s, P(s).
Let’s look at different ways of computing P(s) in the context of Word Prediction
![Page 8: Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d235503460f949f9595/html5/thumbnails/8.jpg)
Example of bad language model
![Page 9: Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d235503460f949f9595/html5/thumbnails/9.jpg)
A bad language model
![Page 10: Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d235503460f949f9595/html5/thumbnails/10.jpg)
A bad language model
![Page 11: Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d235503460f949f9595/html5/thumbnails/11.jpg)
A Good Language Model
Determine reliable sentence probability estimates
P(“And nothing but the truth”) 0.001 P(“And nuts sing on the roof”) 0
![Page 12: Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d235503460f949f9595/html5/thumbnails/12.jpg)
Shannon game Word Prediction
Predicting the next word in the sequence• Statistical natural language ….• The cat is thrown out of the …• The large green …• Sue swallowed the large green …• …
![Page 13: Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d235503460f949f9595/html5/thumbnails/13.jpg)
Claim
A useful part of the knowledge needed to allow Word Prediction can be captured using simple statistical techniques.
Compute:- probability of a sequence- likelihood of words co-occurring
Why would we want to do this?• Rank the likelihood of sequences containing various
alternative alternative hypotheses• Assess the likelihood of a hypothesis
![Page 14: Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d235503460f949f9595/html5/thumbnails/14.jpg)
Applications
Spelling correction Mobile phone texting Speech recognition Handwriting recognition Disabled users …
![Page 15: Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d235503460f949f9595/html5/thumbnails/15.jpg)
Spelling errors
They are leaving in about fifteen minuets to go to her house.
The study was conducted mainly be John Black. Hopefully, all with continue smoothly in my absence. Can they lave him my messages? I need to notified the bank of…. He is trying to fine out.
![Page 16: Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d235503460f949f9595/html5/thumbnails/16.jpg)
Handwriting recognition
Assume a note is given to a bank teller, which the teller reads as I have a gub. (cf. Woody Allen)
NLP to the rescue ….• gub is not a word• gun, gum, Gus, and gull are words, but gun has a
higher probability in the context of a bank
![Page 17: Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d235503460f949f9595/html5/thumbnails/17.jpg)
For Spell Checkers
Collect list of commonly substituted words• piece/peace, whether/weather, their/there ...
Example:“On Tuesday, the whether …’’“On Tuesday, the weather …”
![Page 18: Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d235503460f949f9595/html5/thumbnails/18.jpg)
How to assign probabilities to word sequences?
The probability of a word sequence w1,n is decomposedinto a product of conditional probabilities.
P(w1,n) = P(w1) P(w2 | w1) P(w3 | w1,w2) ... P(wn | w1,n-1)
= i=1..n P(wi | w1,i-1)
Language Models
![Page 19: Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d235503460f949f9595/html5/thumbnails/19.jpg)
In order to simplify the model, we assume that
• each word only depends on the 2 preceding words P(wi | w1,i-1) = P(wi | wi-2, wi-1)
• 2nd order Markov model, trigram
• that the probabilities are time invariant (stationary)
P(Wi=c | Wi-2=a, Wi-1=b) = P(Wk=c | Wk-2=a, Wk-1=b)
Final formula: P(w1,n) = i=1..n P(wi | wi-2, wi-1)
Language Models
![Page 20: Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d235503460f949f9595/html5/thumbnails/20.jpg)
Simple N-Grams
An N-gram model uses the previous N-1 words to predict the next one:• P(wn | wn-N+1 wn-N+2… wn-1 )
unigrams: P(dog) bigrams: P(dog | big) trigrams: P(dog | the big) quadrigrams: P(dog | chasing the big)
![Page 21: Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d235503460f949f9595/html5/thumbnails/21.jpg)
A Bigram Grammar Fragment
Eat on .16 Eat Thai .03
Eat some .06 Eat breakfast .03
Eat lunch .06 Eat in .02
Eat dinner .05 Eat Chinese .02
Eat at .04 Eat Mexican .02
Eat a .04 Eat tomorrow .01
Eat Indian .04 Eat dessert .007
Eat today .03 Eat British .001
![Page 22: Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d235503460f949f9595/html5/thumbnails/22.jpg)
Additional Grammar
<start> I .25 Want some .04
<start> I’d .06 Want Thai .01
<start> Tell .04 To eat .26
<start> I’m .02 To have .14
I want .32 To spend .09
I would .29 To be .02
I don’t .08 British food .60
I have .04 British restaurant .15
Want to .65 British cuisine .01
Want a .05 British lunch .01
![Page 23: Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d235503460f949f9595/html5/thumbnails/23.jpg)
Computing Sentence Probability
P(I want to eat British food) = P(I|<start>) P(want|I) P(to|want) P(eat|to) P(British|eat) P(food|British) = .25x.32x.65x.26x.001x.60 = .000080
vs. P(I want to eat Chinese food) = .00015
Probabilities seem to capture “syntactic'' facts, “world knowledge'' - eat is often followed by a NP- British food is not too popular
N-gram models can be trained by counting and normalization
![Page 24: Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d235503460f949f9595/html5/thumbnails/24.jpg)
Some adjustments
product of probabilities… numerical underflow for long sentences
so instead of multiplying the probs, we add the log of the probs
P(I want to eat British food) Computed usinglog(P(I|<s>)) + log(P(want|I)) + log(P(to|want)) + log(P(eat|to)) +
log(P(British|eat)) + log(P(food|British))= log(.25) + log(.32) + log(.65) + log (.26) + log(.001) + log(.6)= -11.722
![Page 25: Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d235503460f949f9595/html5/thumbnails/25.jpg)
Why use only bi- or tri-grams?
Markov approximation is still costlywith a 20 000 word vocabulary:• bigram needs to store 400 million parameters• trigram needs to store 8 trillion parameters• using a language model > trigram is impractical
to reduce the number of parameters, we can:• do stemming (use stems instead of word types)• group words into semantic classes• seen once --> same as unseen• ...
![Page 26: Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d235503460f949f9595/html5/thumbnails/26.jpg)
![Page 27: Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d235503460f949f9595/html5/thumbnails/27.jpg)
![Page 28: Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d235503460f949f9595/html5/thumbnails/28.jpg)
Building n-gram Models
Data preparation: • Decide training corpus• Clean and tokenize• How do we deal with sentence boundaries?
I eat. I sleep. • (I eat) (eat I) (I sleep)
<s>I eat <s> I sleep <s> • (<s> I) (I eat) (eat <s>) (<s> I) (I sleep) (sleep <s>)
Use statistical estimators:• to derive a good probability estimates based on training
data.
![Page 29: Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d235503460f949f9595/html5/thumbnails/29.jpg)
Statistical Estimators
Maximum Likelihood Estimation (MLE) Smoothing
• Add-one -- Laplace• Add-delta -- Lidstone’s & Jeffreys-Perks’ Laws (ELE)
• ( Validation:» Held Out Estimation» Cross Validation )
• Witten-Bell smoothing• Good-Turing smoothing
Combining Estimators• Simple Linear Interpolation• General Linear Interpolation• Katz’s Backoff
![Page 30: Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d235503460f949f9595/html5/thumbnails/30.jpg)
Statistical Estimators --> Maximum Likelihood Estimation (MLE) Smoothing
• Add-one -- Laplace• Add-delta -- Lidstone’s & Jeffreys-Perks’ Laws (ELE)
• ( Validation:» Held Out Estimation» Cross Validation )
• Witten-Bell smoothing• Good-Turing smoothing
Combining Estimators• Simple Linear Interpolation• General Linear Interpolation• Katz’s Backoff
![Page 31: Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d235503460f949f9595/html5/thumbnails/31.jpg)
Maximum Likelihood Estimation Choose the parameter values which gives the
highest probability on the training corpus
Let C(w1,..,wn) be the frequency of n-gram w1,..,wn
PMLE (wn |w1,..,wn-1) =C(w1,..,wn)C(w1,..,wn-1)
![Page 32: Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d235503460f949f9595/html5/thumbnails/32.jpg)
Example 1: P(event) in a training corpus, we have 10 instances of “come
across”• 8 times, followed by “as”• 1 time, followed by “more”• 1 time, followed by “a”
with MLE, we have: • P(as | come across) = 0.8 • P(more | come across) = 0.1 • P(a | come across) = 0.1 • P(X | come across) = 0 where X “as”, “more”, “a”
![Page 33: Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d235503460f949f9595/html5/thumbnails/33.jpg)
Problem with MLE: data sparseness What if a sequence never appears in training corpus?
P(X)=0• “come across the men” --> prob = 0
• “come across some men” --> prob = 0
• “come across 3 men” --> prob = 0
MLE assigns a probability of zero to unseen events … probability of an n-gram involving unseen words will be
zero!
![Page 34: Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d235503460f949f9595/html5/thumbnails/34.jpg)
Maybe with a larger corpus?
Some words or word combinations are unlikely to appear !!!
Recall: • Zipf’s law• f ~ 1/r
![Page 35: Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d235503460f949f9595/html5/thumbnails/35.jpg)
in (Balh et al 83) • training with 1.5 million words • 23% of the trigrams from another part of the
same corpus were previously unseen. So MLE alone is not good enough estimator
Problem with MLE: data sparseness (con’t)
![Page 36: Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d235503460f949f9595/html5/thumbnails/36.jpg)
Discounting or Smoothing
MLE is usually unsuitable for NLP because of the sparseness of the data
We need to allow for possibility of seeing events not seen in training
Must use a Discounting or Smoothing technique
Decrease the probability of previously seen events to leave a little bit of probability for previously unseen events
![Page 37: Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d235503460f949f9595/html5/thumbnails/37.jpg)
Statistical Estimators Maximum Likelihood Estimation (MLE) --> Smoothing
• --> Add-one -- Laplace• Add-delta -- Lidstone’s & Jeffreys-Perks’ Laws (ELE)
• ( Validation:» Held Out Estimation» Cross Validation )
• Witten-Bell smoothing• Good-Turing smoothing
Combining Estimators• Simple Linear Interpolation• General Linear Interpolation• Katz’s Backoff
![Page 38: Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d235503460f949f9595/html5/thumbnails/38.jpg)
Many smoothing techniques
Add-one Add-delta Witten-Bell smoothing Good-Turing smoothing Church-Gale smoothing Absolute-discounting Kneser-Ney smoothing ...
![Page 39: Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d235503460f949f9595/html5/thumbnails/39.jpg)
Add-one Smoothing (Laplace’s law)
Pretend we have seen every n-gram at least once
Intuitively:• new_count(n-gram) = old_count(n-gram) + 1
The idea is to give a little bit of the probability space to unseen events
![Page 40: Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d235503460f949f9595/html5/thumbnails/40.jpg)
Add-one: Example
I want to eat Chinese food lunch … Total (N)
I 8 1087 0 13 0 0 0 3437
want 3 0 786 0 6 8 6 1215
to 3 0 10 860 3 0 12 3256
eat 0 0 2 0 19 2 52 938
Chinese 2 0 0 0 0 120 1 213
food 19 0 17 0 0 0 0 1506
lunch 4 0 0 0 0 1 0 459
…
unsmoothed bigram counts:
I want to eat Chinese food lunch … Total
I .0023 (8/3437)
.32 0 .0038 (13/3437)
0 0 0 1
want .0025 0 .65 0 .0049 .0066 .0049 1
to .00092 0 .0031 .26 .00092 0 .0037 1
eat 0 0 .0021 0 .020 .0021 .055 1
Chinese .0094 0 0 0 0 .56 .0047 1
food .013 0 .011 0 0 0 0 1
lunch .0087 0 0 0 0 .0022 0 1
…
unsmoothed normalized bigram probabilities:
1st w
ord
2nd word
![Page 41: Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d235503460f949f9595/html5/thumbnails/41.jpg)
Add-one: Example (con’t)
I want to eat Chinese food lunch … Total (N+V)
I 8 9 1087
1088
1 14 1 1 1 3437
5053
want 3 4 1 787 1 7 9 7 2831
to 4 1 11 861 4 1 13 4872
eat 1 1 23 1 20 3 53 2554
Chinese 3 1 1 1 1 121 2 1829
food 20 1 18 1 1 1 1 3122
lunch 5 1 1 1 1 2 1 2075
add-one smoothed bigram counts:
I want to eat Chinese food lunch … Total
I .0018 (9/5053)
.22 .0002 .0028 (14/5053)
.0002 .0002 .0002 1
want .0014 .00035 .28 .00035 .0025 .0032 .0025 1
to .00082 .00021 .0023 .18 .00082 .00021 .0027 1
eat .00039 .00039 .0012 .00039 .0078 .0012 .021 1
Chinese .0016 .00055 .00055 .00055 .00055 .066 .0011 1
food .0064 .00032 .0058 .00032 .00032 .00032 .00032 1
lunch .0024 .00048 .00048 .00048 .00048 .0022 .00048 1
add-one normalized bigram probabilities:
![Page 42: Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d235503460f949f9595/html5/thumbnails/42.jpg)
Add-one, more formally
N: nb of n-grams in training corpus -
B:nb of bins (of possible n-grams) B = V^2 for bigrams
B = V^3 for trigrams etc. where V is size of vocabulary
PAdd1(w1 w2 …wn)=C(w1w2…wn)+1
N+ B
![Page 43: Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d235503460f949f9595/html5/thumbnails/43.jpg)
Problem with add-one smoothing bigrams starting with Chinese are boosted by a factor of
8 ! (1829 / 213) I want to eat Chinese food lunch … Total (N)
I 8 1087 0 13 0 0 0 3437
want 3 0 786 0 6 8 6 1215
to 3 0 10 860 3 0 12 3256
eat 0 0 2 0 19 2 52 938
Chinese 2 0 0 0 0 120 1 213
food 19 0 17 0 0 0 0 1506
lunch 4 0 0 0 0 1 0 459
I want to eat Chinese food lunch … Total (N+V)
I 9 1088 1 14 1 1 1 5053
want 4 1 787 1 7 9 7 2831
to 4 1 11 861 4 1 13 4872
eat 1 1 23 1 20 3 53 2554
Chinese 3 1 1 1 1 121 2 1829
food 20 1 18 1 1 1 1 3122
lunch 5 1 1 1 1 2 1 2075
unsmoothed bigram counts:
add-one smoothed bigram counts:
1st w
ord
1st w
ord
![Page 44: Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d235503460f949f9595/html5/thumbnails/44.jpg)
Problem with add-one smoothing (con’t) Data from the AP from (Church and Gale, 1991)
• Corpus of 22,000,000 bigrams• Vocabulary of 273,266 words (i.e. 74,674,306,760 possible bigrams - or
bins)• 74,671,100,000 bigrams were unseen• And each unseen bigram was given a frequency of 0.000137
fMLE fempirical fadd-one
0 0.000027 0.000137
1 0.448 0.000274
2 1.25 0.000411
3 2.24 0.000548
4 3.23 0.000685
5 4.21 0.000822
too high
too low
Freq. from training data
Freq. from held-out
data
Add-one smoothed
freq.
Total probability mass given to unseen bigrams = (74,671,100,000 x 0.000137) / 22,000,000 ~0.465 !!!!
![Page 45: Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d235503460f949f9595/html5/thumbnails/45.jpg)
Problem with add-one smoothing
every previously unseen n-gram is given a low probability
but there are so many of them that too much probability mass is given to unseen events
adding 1 to frequent bigram, does not change much but adding 1 to low bigrams (including unseen ones)
boosts them too much !
In NLP applications that are very sparse, Laplace’s Law actually gives far too much of the probability space to unseen events.
![Page 46: Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d235503460f949f9595/html5/thumbnails/46.jpg)
Statistical Estimators Maximum Likelihood Estimation (MLE) Smoothing
• Add-one -- Laplace• --> Add-delta -- Lidstone’s & Jeffreys-Perks’ Laws (ELE)
• Validation:» Held Out Estimation» Cross Validation
• Witten-Bell smoothing• Good-Turing smoothing
Combining Estimators• Simple Linear Interpolation• General Linear Interpolation• Katz’s Backoff
![Page 47: Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d235503460f949f9595/html5/thumbnails/47.jpg)
Add-delta smoothing (Lidstone’s law)
instead of adding 1, add some other (smaller) positive value
most widely used value for = 0.5 if =0.5, Lidstone’s Law is called:
• the Expected Likelihood Estimation (ELE) • or the Jeffreys-Perks Law
better than add-one, but still…
PAddD(w1 w2 …wn)=C(w1w2…wn)+
N+ B
PELE(w1 w2 …wn)=C(w1w2…wn)+ 0.5
N+ 0.5B
![Page 48: Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d235503460f949f9595/html5/thumbnails/48.jpg)
The expected frequency of a trigram in a randomsample of size N is therefore
f*(w,w‘,w‘‘) = f(w,w‘,w‘‘) + (1- ) N/B
relative discounting
Adding / Lidstone‘s law
![Page 49: Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d235503460f949f9595/html5/thumbnails/49.jpg)
Statistical Estimators Maximum Likelihood Estimation (MLE) Smoothing
• Add-one -- Laplace• Add-delta -- Lidstone’s & Jeffreys-Perks’ Laws (ELE)
• --> ( Validation:» Held Out Estimation» Cross Validation )
• Witten-Bell smoothing• Good-Turing smoothing
Combining Estimators• Simple Linear Interpolation• General Linear Interpolation• Katz’s Backoff
![Page 50: Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d235503460f949f9595/html5/thumbnails/50.jpg)
Validation / Held-out Estimation
How do we know how much of the probability space to “hold out” for unseen events?
ie. We need a good way to guess in advance Held-out data:
• We can divide the training data into two parts: the training set: used to build initial estimates by counting the held out data: used to refine the initial estimates (i.e. see
how often the bigrams that appeared r times in the training text occur in the held-out text)
![Page 51: Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d235503460f949f9595/html5/thumbnails/51.jpg)
Held Out Estimation
For each n-gram w1...wn we compute:• Ctr(w1...wn) the frequency of w1...wn in the training data• Cho(w1...wn) the frequency of w1...wn in the held out data
Let:• r = the frequency of an n-gram in the training data• Nr = the number of different n-grams with frequency r in the training
data• Tr = the sum of the counts of all n-grams in the held-out data that
appeared r times in the training data
• T = total number of n-gram in the held out data So:
Tr = Cho(w1L wn)
{w1L wnwhereCtr (w1L wn )=r}∑
Pho (w1L wn ) =
Tr
Tx1Nr
wherer=Ctr(w1L wn)
![Page 52: Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d235503460f949f9595/html5/thumbnails/52.jpg)
Some explanation…
Pho (w1L wn ) =
Tr
Tx1Nr
wherer=Ctr(w1L wn)
probability in held-out data for all n-grams appearing r times in
the training data
since we have Nr different n-grams in the training data that
occurred r times, let's share this probability mass equality among
them
ex: assume • if r=5 and 10 different n-grams (types) occur 5 times in training • --> N5 = 10• if all the n-grams (types) that occurred 5 times in training, occurred in
total (n-gram tokens) 20 times in the held-out data• --> T5 = 20• assume the held-out data contains 2000 n-grams (tokens)
Pho (an n-gram with r =5) =20
2000x110
=0.001
![Page 53: Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d235503460f949f9595/html5/thumbnails/53.jpg)
Cross-Validation
Held Out estimation is useful if there is a lot of data available If not, we can use each part of the data both as training data
and as held out data. Main methods:
• Deleted Estimation (two-way cross validation) Divide data into part 0 and part 1 In one model use 0 as the training data and 1 as the held out data In another model use 1 as training and 0 as held out data. Do a weighted average of the two models
• Leave-One-Out Divide data into N parts (N = nb of tokens) Leave 1 token out each time Train N language models
![Page 54: Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d235503460f949f9595/html5/thumbnails/54.jpg)
Empirical results for bigram data (Church and Gale)
f femp fGT fadd1 fheld-out
0 0.000027 0.000027 0.000137 0.000037
1 0.448 0.446 0.000274 0.396
2 1.25 1.26 0.000411 1.24
3 2.24 2.24 0.000548 2.23
4 3.23 3.24 0.000685 3.22
5 4.21 4.22 0.000822 4.22
6 5.23 5.19 0.000959 5.20
7 6.21 6.21 0.00109 6.21
8 7.21 7.24 0.00123 7.18
9 8.26 8.25 0.00137 8.18
Comparison
![Page 55: Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d235503460f949f9595/html5/thumbnails/55.jpg)
Dividing the corpus Training:• Training data (80% of total data)
To build initial estimates (frequency counts)• Held out data (10% of total data)
To refine initial estimates (smoothed estimates) Testing:
• Development test data (5% of total data) To test while developing
• Final test data (5% of total data) To test at the end
But how do we divide?• Randomly select data (ex. sentences, n-grams)
Advantage: Test data is very similar to training data
• Cut large chunks of consecutive data Advantage: Results are lower, but more realistic
![Page 56: Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d235503460f949f9595/html5/thumbnails/56.jpg)
Developing and Testing Models1. Write an algorithm2. Train it
With training set & held-out data
3. Test it With development set
4. Note things it does wrong & revise it 5. Repeat 1-5 until satisfied6. Only then, evaluate and publish results
With final test set Better to give final results by testing on n smaller
samples of the test data and averaging
![Page 57: Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d235503460f949f9595/html5/thumbnails/57.jpg)
Statistical Estimators
Maximum Likelihood Estimation (MLE) Smoothing
• Add-one -- Laplace• Add-delta -- Lidstone’s & Jeffreys-Perks’ Laws (ELE)
• ( Validation:» Held Out Estimation» Cross Validation )
• --> Witten-Bell smoothing• Good-Turing smoothing
Combining Estimators• Simple Linear Interpolation• General Linear Interpolation• Katz’s Backoff
![Page 58: Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d235503460f949f9595/html5/thumbnails/58.jpg)
Witten-Bell smoothing
intuition: • An unseen n-gram is one that just did not
occur yet• When it does happen, it will be its first
occurrence• So give to unseen n-grams the probability of
seeing a new n-gram
![Page 59: Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d235503460f949f9595/html5/thumbnails/59.jpg)
Witten-Bell: the equations
Total probability mass assigned to zero-frequency N-grams:
(NB: T is OBSERVED types, not V) So each zero N-gram gets the probability:
![Page 60: Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d235503460f949f9595/html5/thumbnails/60.jpg)
Witten-Bell: why ‘discounting’
Now of course we have to take away something (‘discount’) from the probability of the events seen more than once:
![Page 61: Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d235503460f949f9595/html5/thumbnails/61.jpg)
Witten-Bell for bigrams
We `relativize’ the types to the previous word:
this probability mass, must be distributed in equal parts over all unseen bigrams
• Z (w1) : number of unseen n-grams starting with w1
for each unseen eventP(w2|w1) =1
Z(w1)x
T(w1)N(w1) + T(w1)
![Page 62: Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d235503460f949f9595/html5/thumbnails/62.jpg)
Small example
all unseen bigrams starting with a will share a probability mass of
each unseen bigrams starting with a will have an equal part of this
a b c d … Total = N(w1)
nb seen tokens
T(w1)
nb seen types
Z(w1)
nb. unseen types
a 10 10 10 0 30 3 1
b 0 0 30 0 30 1 3
c 0 0 300 0 300 1 3
d
…
P(d|a) =1
Z(a)×
T(a)N(a) + T(a)
=11×0.091=0.091
T(a)
N(a) + T(a)=
330 + 3
=0.091
![Page 63: Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d235503460f949f9595/html5/thumbnails/63.jpg)
all unseen bigrams starting with b will share a probability mass of
each unseen bigrams starting with b will have an equal part of this
Small example (con’t)
T(b)
N(b) + T(b)=
130 +1
=0.032
P(a|b) =1
Z(b)x
T(b)N(b)+ T(b)
=13x0.032=0.011
P(b|b) =1
Z(b)x
T(b)N(b)+ T(b)
=13x0.032=0.011
P(d|b) =1
Z(b)x
T(b)N(b)+ T(b)
=13x0.032=0.011
a b c d … Total = N(w1)
nb seen tokens
T(w1)
nb seen types
Z(w1)
nb. unseen types
a 10 10 10 0 30 3 1
b 0 0 30 0 30 1 3
c 0 0 300 0 300 1 3
d
…
![Page 64: Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d235503460f949f9595/html5/thumbnails/64.jpg)
all unseen bigrams starting with c will share a probability mass of
each unseen bigrams starting with c will have an equal part of this
Small example (con’t)
T(c)
N(c) + T(c)=
1300 +1
=0.0033
P(a|c) =1
Z(c)x
T(c)N(c)+ T(c)
=13x0.0033=0.0011
P(b|c) =1
Z(c)x
T(c)N(c)+ T(c)
=13x0.0033=0.0011
P(d|c) =1
Z(c)x
T(c)N(c)+ T(c)
=13x0.0033=0.0011
a b c d … Total = N(w1)
nb seen tokens
T(w1)
nb seen types
Z(w1)
nb. unseen types
a 10 10 10 0 30 3 1
b 0 0 30 0 30 1 3
c 0 0 300 0 300 1 3
d
…
![Page 65: Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d235503460f949f9595/html5/thumbnails/65.jpg)
Unseen bigrams:• To get from the probabilities back to the counts, we know that:
// N (w1) = nb of bigrams starting with w1
• so we get:
P(w2|w1) =C(w2|w1)
N(w1)
C(w2|w1) =P(w2|w1)×N(w1)
=1
Z(w1)×
T(w1)N(w1)+ T(w1)
×N(w1)
=T(w1)Z(w1)
×N(w1)
N(w1)+ T(w1)
More formally
![Page 66: Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d235503460f949f9595/html5/thumbnails/66.jpg)
The restaurant example The original counts were:
T(w)= number of different seen bigrams types starting with w we have a vocabulary of 1616 words, so we can compute Z(w)= number of unseen bigrams types starting with w
Z(w) = 1616 - T(w)
N(w) = number of bigrams tokens starting with w
I want to eat Chinese
food lunch … N(w) seen bigram tokens
T(w) seen bigram types
Z(w) unseen bigram types
I 8 1087 0 13 0 0 0 3437 95 1521
want 3 0 786 0 6 8 6 1215 76 1540
to 3 0 10 860 3 0 12 3256 130 1486
eat 0 0 2 0 19 2 52 938 124 1492
Chinese 2 0 0 0 0 120 1 213 20 1592
food 19 0 17 0 0 0 0 1506 82 534
lunch 4 0 0 0 0 1 0 459 45 1571
![Page 67: Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d235503460f949f9595/html5/thumbnails/67.jpg)
Witten-Bell smoothed count
I want to eat Chinese food lunch … Total
I 7.78 1057.76 .061 12.65 .06 .06 .06 3437
want 2.82 .05 739.73 .05 5.65 7.53 5.65 1215
to 2.88 .08 9.62 826.98 2.88 .08 12.50 3256
eat .07 .07 19.43 .07 16.78 1.77 45.93 938
Chinese 1.74 .01 .01 .01 .01 109.70 .91 213
food 18.02 .05 16.12 .05 .05 .05 .05 1506
lunch 3.64 .03 .03 .03 .03 0.91 .03 459
• the count of the unseen bigram “I lunch”
• the count of the seen bigram “want to”
Witten-Bell smoothed bigram counts:
T(I)
Z(I)x
N(I)
N(I)+ T(I)=
951521
x3437
3437 + 95=0.06
count(want to)xN(want)
N(want)+ T(want)=786x
12151215 + 76
=739.73
![Page 68: Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d235503460f949f9595/html5/thumbnails/68.jpg)
Witten-Bell smoothed probabilities
I want to eat Chinese food lunch … Total
I .0022
(7.78/3437)
.3078 .000002 .0037
.000002 .000002 .000002 1
want .00230 .00004 .6088 .00004 .0047 .0062 .0047 1
to .00009 .00003 .0030 .2540 .00009 .00003 .0038 1
eat .00008 .00008 .0021 .00008 .0179 .0019 .0490 1
Chinese .00812 .00005 .00005 .00005 .00005 .5150 .0042 1
food .0120 .00004 .0107 .00004 .00004 .00004 .00004 1
lunch .0079 .00006 .00006 .00006 .00006 .0020 .00006 1
Witten-Bell normalized bigram probabilities:
![Page 69: Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d235503460f949f9595/html5/thumbnails/69.jpg)
Statistical Estimators Maximum Likelihood Estimation (MLE) Smoothing
• Add-one -- Laplace• Add-delta -- Lidstone’s & Jeffreys-Perks’ Laws (ELE)
• Validation:» Held Out Estimation» Cross Validation
• Witten-Bell smoothing• --> Good-Turing smoothing
Combining Estimators• Simple Linear Interpolation• General Linear Interpolation• Katz’s Backoff
![Page 70: Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d235503460f949f9595/html5/thumbnails/70.jpg)
Good-Turing Estimator Based on the assumption that words have a binomial
distribution Works well in practice (with large corpora) Idea:
• Re-estimate the probability mass of n-grams with zero (or low) counts by looking at the number of n-grams with higher counts
• Ex:
c*=(c+1)Nc+1
NcNb of ngrams that occur c
times
Nb of ngrams that occur c+1 times
new count for bigrams that never occurred =nbofbigramsthatoccurredoncenbofbigramsthatneveroccurred
co =(0 +1)N1
No
![Page 71: Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d235503460f949f9595/html5/thumbnails/71.jpg)
Good-Turing Estimator (con’t) In practice c* is not used for all counts c large counts (> a threshold k) are assumed to be
reliable
If c > k (usually k = 5)c* = c
If c <= k
c* =
(c+1)Nc+1
Nc
−c(k+1)Nk+1
N1
1−(k+1)Nk+1
N1
for1≤c≤k
![Page 72: Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d235503460f949f9595/html5/thumbnails/72.jpg)
Statistical Estimators Maximum Likelihood Estimation (MLE) Smoothing
• Add-one -- Laplace• Add-delta -- Lidstone’s & Jeffreys-Perks’ Laws (ELE)
• ( Validation:» Held Out Estimation» Cross Validation )
• Witten-Bell smoothing• Good-Turing smoothing
--> Combining Estimators• Simple Linear Interpolation• General Linear Interpolation • Katz’s Backoff
![Page 73: Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d235503460f949f9595/html5/thumbnails/73.jpg)
Combining Estimators so far, we gave the same probability to all unseen n-
grams • we have never seen the bigrams
journal of Punsmoothed(of |journal) = 0 journal from Punsmoothed(from |journal) = 0 journal never Punsmoothed(never |journal) = 0
• all models so far will give the same probability to all 3 bigrams
but intuitively, “journal of” is more probable because...• “of” is more frequent than “from” & “never” • unigram probability P(of) > P(from) > P(never)
![Page 74: Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d235503460f949f9595/html5/thumbnails/74.jpg)
observation: • unigram model suffers less from data sparseness than
bigram model• bigram model suffers less from data sparseness than
trigram model• …
so use a lower model estimate, to estimate probability of unseen n-grams
if we have several models of how the history predicts what comes next, we can combine them in the hope of producing an even better model
Combining Estimators (con’t)
![Page 75: Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d235503460f949f9595/html5/thumbnails/75.jpg)
Statistical Estimators Maximum Likelihood Estimation (MLE) Smoothing
• Add-one -- Laplace• Add-delta -- Lidstone’s & Jeffreys-Perks’ Laws (ELE)
• Validation:» Held Out Estimation» Cross Validation
• Witten-Bell smoothing• Good-Turing smoothing
Combining Estimators• --> Simple Linear Interpolation• General Linear Interpolation• Katz’s Backoff
![Page 76: Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d235503460f949f9595/html5/thumbnails/76.jpg)
Simple Linear Interpolation Solve the sparseness in a trigram model by mixing with
bigram and unigram models Also called:
• linear interpolation,• finite mixture models • deleted interpolation
Combine linearlyPli(wn|wn-2,wn-1) = 1P(wn) + 2P(wn|wn-1) + 3P(wn|wn-2,wn-1)
• where 0 i 1 and i i =1
![Page 77: Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d235503460f949f9595/html5/thumbnails/77.jpg)
Statistical Estimators Maximum Likelihood Estimation (MLE) Smoothing
• Add-one -- Laplace• Add-delta -- Lidstone’s & Jeffreys-Perks’ Laws (ELE)
• Validation:» Held Out Estimation» Cross Validation
• Witten-Bell smoothing• Good-Turing smoothing
Combining Estimators• Simple Linear Interpolation• --> General Linear Interpolation• Katz’s Backoff
![Page 78: Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d235503460f949f9595/html5/thumbnails/78.jpg)
General Linear Interpolation
In simple linear interpolation, the weights i are constant So the unigram estimate is always combined with the same
weight, regardless of whether the trigram is accurate (because there is lots of data) or poor
We can have a more general and powerful model where i are a function of the history h
• where 0 i(h) 1 and i i(h) =1
Having a specific (h) per n-gram is not a good idea, but we can set a (h) according to the frequency of the n-gram
Pgli(wn |wn-2 ,wn-1) =1P(wn)+ 2 (wn-1)P(wn|wn-1)+ 3(wn-2 ,wn-1)P (wn|wn-2 ,wn-1)
![Page 79: Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d235503460f949f9595/html5/thumbnails/79.jpg)
Statistical Estimators Maximum Likelihood Estimation (MLE) Smoothing
• Add-one -- Laplace• Add-delta -- Lidstone’s & Jeffreys-Perks’ Laws (ELE)
• Validation:» Held Out Estimation» Cross Validation
• Witten-Bell smoothing• Good-Turing smoothing
Combining Estimators• Simple Linear Interpolation• General Linear Interpolation• --> Katz’s Backoff
![Page 80: Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d235503460f949f9595/html5/thumbnails/80.jpg)
Smoothing of Conditional Probabilities
p(Angeles | to, Los)
If „to Los Angeles“ is not in the training corpus,the smoothed probability p(Angeles | to, Los) isidentical to p(York | to, Los).
However, the actual probability is probably close tothe bigram probability p(Angeles | Los).
Backoff Smoothing
![Page 81: Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d235503460f949f9595/html5/thumbnails/81.jpg)
(Wrong) Back-off Smoothing of trigram probabilities
if C(w‘, w‘‘, w) > 0P*(w | w‘, w‘‘) = P(w | w‘, w‘‘)
else if C(w‘‘, w) > 0P*(w | w‘, w‘‘) = P(w | w‘‘)
else if C(w) > 0P*(w | w‘, w‘‘) = P(w)
elseP*(w | w‘, w‘‘) = 1 / #words
Backoff Smoothing
![Page 82: Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d235503460f949f9595/html5/thumbnails/82.jpg)
Problem: not a probability distribution
Solution:
Combination of Back-off and frequency discounting
P(w | w1,...,wk) = C*(w1,...,wk,w) / N if C(w1,...,wk,w) > 0
else
P(w | w1,...,wk) = (w1,...,wk) P(w | w2,...,wk)
Backoff Smoothing
![Page 83: Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d235503460f949f9595/html5/thumbnails/83.jpg)
The backoff factor is defined s.th. the probabilitymass assigned to unobserved trigrams
(w1,...,wk) P(w | w2,...,wk)) w: C(w
1,...,w
k,w)=0
is identical to the probability mass discounted fromthe observed trigrams.
1- P(w | w1,...,wk)) w: C(w
1,...,w
k,w)>0
Therefore, we get:
(w1,...,wk) = ( 1 - P(w | w1,...,wk)) / (1 - P(w | w2,...,wk)) w: C(w
1,...,w
k,w)>0 w: C(w
1,...,w
k ,w)>0
Backoff Smoothing
![Page 84: Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d235503460f949f9595/html5/thumbnails/84.jpg)
Other applications of LM
Author / Language identification
hypothesis: texts that resemble each other (same author, same language) share similar characteristics • In English character sequence “ing” is more probable than in
French
Training phase: • construction of the language model • with pre-classified documents (known language/author)
Testing phase: • evaluation of unknown text (comparison with language model)
![Page 85: Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d235503460f949f9595/html5/thumbnails/85.jpg)
Example: Language identification
bigram of characters • characters = 26 letters (case insensitive)• possible variations: case sensitivity, punctuation,
beginning/end of sentence marker, …
![Page 86: Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d235503460f949f9595/html5/thumbnails/86.jpg)
A B C D … Y Z
A 0.0014 0.0014 0.0014 0.0014 … 0.0014 0.0014
B 0.0014 0.0014 0.0014 0.0014 … 0.0014 0.0014
C 0.0014 0.0014 0.0014 0.0014 … 0.0014 0.0014
D 0.0042 0.0014 0.0014 0.0014 … 0.0014 0.0014
E 0.0097 0.0014 0.0014 0.0014 … 0.0014 0.0014
… … … … … … … 0.0014
Y 0.0014 0.0014 0.0014 0.0014 … 0.0014 0.0014
Z 0.0014 0.0014 0.0014 0.0014 0.0014 0.0014 0.0014
1. Train a language model for English:
2. Train a language model for French
3. Evaluate probability of a sentence with LM-English & LM-French
4. Highest probability -->language of sentence
![Page 87: Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d235503460f949f9595/html5/thumbnails/87.jpg)
Claim
A useful part of the knowledge needed to allow Word Prediction can be captured using simple statistical techniques.
Compute:- probability of a sequence- likelihood of words co-occurring
It can be useful to do this.