chapter 6: statistical inference: n-gram models over sparse data tdm seminar jonathan henke...
TRANSCRIPT
![Page 1: Chapter 6: Statistical Inference: n-gram Models over Sparse Data TDM Seminar Jonathan Henke jhenke/Tdm/TDM-Ch6.ppt Slide](https://reader036.vdocuments.net/reader036/viewer/2022081506/56649eb15503460f94bb6ec1/html5/thumbnails/1.jpg)
Chapter 6: Statistical Inference: n-gram
Models over Sparse DataTDM Seminar
Jonathan Henke
http://www.sims.berkeley.edu/~jhenke/Tdm/TDM-Ch6.ppt
Slide set modified slightly by Juggy for teachinga class on NLP using the same book: http://www.csee.wvu.edu/classes/nlp/Spring_2007/Modified Slides are marked with a
![Page 2: Chapter 6: Statistical Inference: n-gram Models over Sparse Data TDM Seminar Jonathan Henke jhenke/Tdm/TDM-Ch6.ppt Slide](https://reader036.vdocuments.net/reader036/viewer/2022081506/56649eb15503460f94bb6ec1/html5/thumbnails/2.jpg)
Basic Idea:
• Examine short sequences of words
• How likely is each sequence?
• “Markov Assumption” – word is affected only by its “prior local context” (last few words)
![Page 3: Chapter 6: Statistical Inference: n-gram Models over Sparse Data TDM Seminar Jonathan Henke jhenke/Tdm/TDM-Ch6.ppt Slide](https://reader036.vdocuments.net/reader036/viewer/2022081506/56649eb15503460f94bb6ec1/html5/thumbnails/3.jpg)
Possible Applications:
• OCR / Voice recognition – resolve ambiguity
• Spelling correction
• Machine translation
• Confirming the author of a newly discovered work
• “Shannon game”
![Page 4: Chapter 6: Statistical Inference: n-gram Models over Sparse Data TDM Seminar Jonathan Henke jhenke/Tdm/TDM-Ch6.ppt Slide](https://reader036.vdocuments.net/reader036/viewer/2022081506/56649eb15503460f94bb6ec1/html5/thumbnails/4.jpg)
“Shannon Game”
• Claude E. Shannon. “Prediction and Entropy of Printed English”, Bell System Technical Journal 30:50-64. 1951.
• Predict the next word, given (n-1) previous words
• Determine probability of different sequences by examining training corpus
![Page 5: Chapter 6: Statistical Inference: n-gram Models over Sparse Data TDM Seminar Jonathan Henke jhenke/Tdm/TDM-Ch6.ppt Slide](https://reader036.vdocuments.net/reader036/viewer/2022081506/56649eb15503460f94bb6ec1/html5/thumbnails/5.jpg)
Forming Equivalence Classes (Bins)
• “n-gram” = sequence of n words– bigram– trigram– four-gram
• Task at hand:– P(wn|w1,…,wn-1)
![Page 6: Chapter 6: Statistical Inference: n-gram Models over Sparse Data TDM Seminar Jonathan Henke jhenke/Tdm/TDM-Ch6.ppt Slide](https://reader036.vdocuments.net/reader036/viewer/2022081506/56649eb15503460f94bb6ec1/html5/thumbnails/6.jpg)
Reliability vs. Discrimination
“large green ___________”
tree? mountain? frog? car?
“swallowed the large green ________”pill? broccoli?
![Page 7: Chapter 6: Statistical Inference: n-gram Models over Sparse Data TDM Seminar Jonathan Henke jhenke/Tdm/TDM-Ch6.ppt Slide](https://reader036.vdocuments.net/reader036/viewer/2022081506/56649eb15503460f94bb6ec1/html5/thumbnails/7.jpg)
Reliability vs. Discrimination
• larger n: more information about the context of the specific instance (greater discrimination)
• smaller n: more instances in training data, better statistical estimates (more reliability)
![Page 8: Chapter 6: Statistical Inference: n-gram Models over Sparse Data TDM Seminar Jonathan Henke jhenke/Tdm/TDM-Ch6.ppt Slide](https://reader036.vdocuments.net/reader036/viewer/2022081506/56649eb15503460f94bb6ec1/html5/thumbnails/8.jpg)
Selecting an n
Vocabulary (V) = 20,000 words
n Number of bins
2 (bigrams) 20,000*19,999=400 Million
3 (trigrams) 20,000*19,999*19,998= 8 Trillion
4 (4-grams) 1.6 x 1017
![Page 9: Chapter 6: Statistical Inference: n-gram Models over Sparse Data TDM Seminar Jonathan Henke jhenke/Tdm/TDM-Ch6.ppt Slide](https://reader036.vdocuments.net/reader036/viewer/2022081506/56649eb15503460f94bb6ec1/html5/thumbnails/9.jpg)
Statistical Estimators
• Given the observed training data …• How do you develop a model (probability
distribution) to predict future events?
)...wP(w
)...wP(w)...w|wP(w
n
nnn
11
111
![Page 10: Chapter 6: Statistical Inference: n-gram Models over Sparse Data TDM Seminar Jonathan Henke jhenke/Tdm/TDM-Ch6.ppt Slide](https://reader036.vdocuments.net/reader036/viewer/2022081506/56649eb15503460f94bb6ec1/html5/thumbnails/10.jpg)
![Page 11: Chapter 6: Statistical Inference: n-gram Models over Sparse Data TDM Seminar Jonathan Henke jhenke/Tdm/TDM-Ch6.ppt Slide](https://reader036.vdocuments.net/reader036/viewer/2022081506/56649eb15503460f94bb6ec1/html5/thumbnails/11.jpg)
Maximum Likelihood Estimation (MLE)
• Example– 10 training instances of “comes across”
– 8 of them were followed by “as”
– 1 followed by “a”
– 1 followed by “more”
– P(as) = 0.8
– P(a) = 0.1
– P(more) = 0.1
– P(x) = 0
![Page 12: Chapter 6: Statistical Inference: n-gram Models over Sparse Data TDM Seminar Jonathan Henke jhenke/Tdm/TDM-Ch6.ppt Slide](https://reader036.vdocuments.net/reader036/viewer/2022081506/56649eb15503460f94bb6ec1/html5/thumbnails/12.jpg)
Statistical Estimators
Example:
Corpus: five Jane Austen novels
N = 617,091 words
V = 14,585 unique words
Task: predict the next word of the trigram “inferior to ________”
from test data, Persuasion: “[In person, she was] inferior to both [sisters.]”
![Page 13: Chapter 6: Statistical Inference: n-gram Models over Sparse Data TDM Seminar Jonathan Henke jhenke/Tdm/TDM-Ch6.ppt Slide](https://reader036.vdocuments.net/reader036/viewer/2022081506/56649eb15503460f94bb6ec1/html5/thumbnails/13.jpg)
![Page 14: Chapter 6: Statistical Inference: n-gram Models over Sparse Data TDM Seminar Jonathan Henke jhenke/Tdm/TDM-Ch6.ppt Slide](https://reader036.vdocuments.net/reader036/viewer/2022081506/56649eb15503460f94bb6ec1/html5/thumbnails/14.jpg)
![Page 15: Chapter 6: Statistical Inference: n-gram Models over Sparse Data TDM Seminar Jonathan Henke jhenke/Tdm/TDM-Ch6.ppt Slide](https://reader036.vdocuments.net/reader036/viewer/2022081506/56649eb15503460f94bb6ec1/html5/thumbnails/15.jpg)
“Smoothing”
• Develop a model which decreases probability of seen events and allows the occurrence of previously unseen n-grams
• a.k.a. “Discounting methods”
• “Validation” – Smoothing methods which utilize a second batch of test data.
![Page 16: Chapter 6: Statistical Inference: n-gram Models over Sparse Data TDM Seminar Jonathan Henke jhenke/Tdm/TDM-Ch6.ppt Slide](https://reader036.vdocuments.net/reader036/viewer/2022081506/56649eb15503460f94bb6ec1/html5/thumbnails/16.jpg)
LaPlace’s Law(adding one)
![Page 17: Chapter 6: Statistical Inference: n-gram Models over Sparse Data TDM Seminar Jonathan Henke jhenke/Tdm/TDM-Ch6.ppt Slide](https://reader036.vdocuments.net/reader036/viewer/2022081506/56649eb15503460f94bb6ec1/html5/thumbnails/17.jpg)
LaPlace’s Law(adding one)
![Page 18: Chapter 6: Statistical Inference: n-gram Models over Sparse Data TDM Seminar Jonathan Henke jhenke/Tdm/TDM-Ch6.ppt Slide](https://reader036.vdocuments.net/reader036/viewer/2022081506/56649eb15503460f94bb6ec1/html5/thumbnails/18.jpg)
LaPlace’s Law
![Page 19: Chapter 6: Statistical Inference: n-gram Models over Sparse Data TDM Seminar Jonathan Henke jhenke/Tdm/TDM-Ch6.ppt Slide](https://reader036.vdocuments.net/reader036/viewer/2022081506/56649eb15503460f94bb6ec1/html5/thumbnails/19.jpg)
Lidstone’s Law
BλN
λ)wC(w)w(wP n
nLid
11
P = probability of specific n-gram
C = count of that n-gram in training data
N = total n-grams in training data
B = number of “bins” (possible n-grams)
= small positive number
M.L.E: = 0LaPlace’s Law: = 1Jeffreys-Perks Law: = ½
![Page 20: Chapter 6: Statistical Inference: n-gram Models over Sparse Data TDM Seminar Jonathan Henke jhenke/Tdm/TDM-Ch6.ppt Slide](https://reader036.vdocuments.net/reader036/viewer/2022081506/56649eb15503460f94bb6ec1/html5/thumbnails/20.jpg)
Expected Likelihood EstimationRank Word MLE ELE
1 not 0.065 0.036
2 a 0.052 0.030
3 the 0.033 0.019
4 to 0.031 0.017
…
=1482 inferior 0 0.00003
“was” appeared 9409“not” appeared after “was” 608Total # of word types = 14589MLE = 608/9409 = 0.065ELE = (608+0.5)/(608+14589x0.5) = 0.036The new estimate has been discounted by 50%
![Page 21: Chapter 6: Statistical Inference: n-gram Models over Sparse Data TDM Seminar Jonathan Henke jhenke/Tdm/TDM-Ch6.ppt Slide](https://reader036.vdocuments.net/reader036/viewer/2022081506/56649eb15503460f94bb6ec1/html5/thumbnails/21.jpg)
Jeffreys-Perks Law
![Page 22: Chapter 6: Statistical Inference: n-gram Models over Sparse Data TDM Seminar Jonathan Henke jhenke/Tdm/TDM-Ch6.ppt Slide](https://reader036.vdocuments.net/reader036/viewer/2022081506/56649eb15503460f94bb6ec1/html5/thumbnails/22.jpg)
Objections to Lidstone’s Law
• Need an a priori way to determine .
• Predicts all unseen events to be equally likely
• Gives probability estimates linear in the M.L.E. frequency
![Page 23: Chapter 6: Statistical Inference: n-gram Models over Sparse Data TDM Seminar Jonathan Henke jhenke/Tdm/TDM-Ch6.ppt Slide](https://reader036.vdocuments.net/reader036/viewer/2022081506/56649eb15503460f94bb6ec1/html5/thumbnails/23.jpg)
Smoothing
• Lidstone’s Law (incl. LaPlace’s Law and Jeffreys-Perks Law): modifies the observed counts
• Other methods: modify probabilities.
![Page 24: Chapter 6: Statistical Inference: n-gram Models over Sparse Data TDM Seminar Jonathan Henke jhenke/Tdm/TDM-Ch6.ppt Slide](https://reader036.vdocuments.net/reader036/viewer/2022081506/56649eb15503460f94bb6ec1/html5/thumbnails/24.jpg)
Held-Out Estimator
• How much of the probability distribution should be “held out” to allow for previously unseen events?
• Validate by holding out part of the training data.
• How often do events unseen in training data occur in validation data?(e.g., to choose for Lidstone model)
![Page 25: Chapter 6: Statistical Inference: n-gram Models over Sparse Data TDM Seminar Jonathan Henke jhenke/Tdm/TDM-Ch6.ppt Slide](https://reader036.vdocuments.net/reader036/viewer/2022081506/56649eb15503460f94bb6ec1/html5/thumbnails/25.jpg)
Held-Out Estimator
})(:{12
111
)(rwwCww
nr
nn
wwCT
NN
TwwP
r
rnh )...( 10
C1(w1… wn) = frequency of w1… wn in training dataC2(w1… wn) = frequency of w1… wn in training data
Nr is # of n-grams with frequency r in the training textTr is the total # of times that all n-grams appeared r times in training text appeared in the held out dataAverage frequency of the n-grams in the held-out data= Tr /Nr
r = C(w1… wn)
![Page 26: Chapter 6: Statistical Inference: n-gram Models over Sparse Data TDM Seminar Jonathan Henke jhenke/Tdm/TDM-Ch6.ppt Slide](https://reader036.vdocuments.net/reader036/viewer/2022081506/56649eb15503460f94bb6ec1/html5/thumbnails/26.jpg)
Testing Models
• Hold out ~ 5 – 10% for testing
• Hold out ~ 10% for validation (smoothing)
• For testing: useful to test on multiple sets of data, report variance of results.– Are results (good or bad) just the result of
chance?
![Page 27: Chapter 6: Statistical Inference: n-gram Models over Sparse Data TDM Seminar Jonathan Henke jhenke/Tdm/TDM-Ch6.ppt Slide](https://reader036.vdocuments.net/reader036/viewer/2022081506/56649eb15503460f94bb6ec1/html5/thumbnails/27.jpg)
Cross-Validation(a.k.a. deleted estimation)
• Use data for both training and validation
Divide test data into 2 parts
(1) Train on A, validate on B
(2) Train on B, validate on A
Combine two models
A B
train validate
validate train
Model 1
Model 2
Model 1 Model 2+ Final Model
![Page 28: Chapter 6: Statistical Inference: n-gram Models over Sparse Data TDM Seminar Jonathan Henke jhenke/Tdm/TDM-Ch6.ppt Slide](https://reader036.vdocuments.net/reader036/viewer/2022081506/56649eb15503460f94bb6ec1/html5/thumbnails/28.jpg)
Cross-Validation
Two estimates:
Combined estimate:
NN
TP
r
rho 0
01
NN
TP
r
rho 1
10
Nra = number of n-grams
occurring r times in a-th part of training set
Trab = total number of those
found in b-th part
)( 10
1001
rr
rrho NNN
TTP
(arithmetic mean)
![Page 29: Chapter 6: Statistical Inference: n-gram Models over Sparse Data TDM Seminar Jonathan Henke jhenke/Tdm/TDM-Ch6.ppt Slide](https://reader036.vdocuments.net/reader036/viewer/2022081506/56649eb15503460f94bb6ec1/html5/thumbnails/29.jpg)
Good-Turing Estimator
r* = “adjusted frequency”
Nr = number of n-gram-types which occur r times
E(Nr) = “expected value”
E(Nr+1) < E(Nr) Typically this is done for r < some constant k as this value
is 0 for a r that corresponds to max r.
)(
)()(*
r
r
NE
NErr 11 NrPGT
*
![Page 30: Chapter 6: Statistical Inference: n-gram Models over Sparse Data TDM Seminar Jonathan Henke jhenke/Tdm/TDM-Ch6.ppt Slide](https://reader036.vdocuments.net/reader036/viewer/2022081506/56649eb15503460f94bb6ec1/html5/thumbnails/30.jpg)
Count of counts in Austen corpus
![Page 31: Chapter 6: Statistical Inference: n-gram Models over Sparse Data TDM Seminar Jonathan Henke jhenke/Tdm/TDM-Ch6.ppt Slide](https://reader036.vdocuments.net/reader036/viewer/2022081506/56649eb15503460f94bb6ec1/html5/thumbnails/31.jpg)
Good-Turing Estimates for Austen Corpus
• N1 = number of bigrams seen exactly once in training instance = 138741
• N = 617091 [number of words in Austen corpus]• N1 /N = 0.2248 [mass reserved for unseen bigrams using Good-Turing
approach]• Space of bigrams is vocabulary squared: 145852 • Total # of bigrams seen in training set: 199,252• Probability estimate for
unseen bigrams = 0.2248/(145852 -199,252) = 1.058 x 10-9
![Page 32: Chapter 6: Statistical Inference: n-gram Models over Sparse Data TDM Seminar Jonathan Henke jhenke/Tdm/TDM-Ch6.ppt Slide](https://reader036.vdocuments.net/reader036/viewer/2022081506/56649eb15503460f94bb6ec1/html5/thumbnails/32.jpg)
Discounting Methods
First, determine held-out probability
• Absolute discounting: Decrease probability of each observed n-gram by subtracting a small constant
• Linear discounting: Decrease probability of each observed n-gram by multiplying by the same proportion
![Page 33: Chapter 6: Statistical Inference: n-gram Models over Sparse Data TDM Seminar Jonathan Henke jhenke/Tdm/TDM-Ch6.ppt Slide](https://reader036.vdocuments.net/reader036/viewer/2022081506/56649eb15503460f94bb6ec1/html5/thumbnails/33.jpg)
Combining Estimators
(Sometimes a trigram model is best, sometimes a bigram model is best, and sometimes a unigram model is best.)
• How can you develop a model to utilize different length n-grams as appropriate?
![Page 34: Chapter 6: Statistical Inference: n-gram Models over Sparse Data TDM Seminar Jonathan Henke jhenke/Tdm/TDM-Ch6.ppt Slide](https://reader036.vdocuments.net/reader036/viewer/2022081506/56649eb15503460f94bb6ec1/html5/thumbnails/34.jpg)
Simple Linear Interpolation(a.k.a., finite mixture models;a.k.a., deleted interpolation)
• weighted average of unigram, bigram, and trigram probabilities
),|( 12 nnnli wwwP
),|()|()( 123112211 nnnnnn wwwPwwPwP
![Page 35: Chapter 6: Statistical Inference: n-gram Models over Sparse Data TDM Seminar Jonathan Henke jhenke/Tdm/TDM-Ch6.ppt Slide](https://reader036.vdocuments.net/reader036/viewer/2022081506/56649eb15503460f94bb6ec1/html5/thumbnails/35.jpg)
Katz’s Backing-Off
• Use n-gram probability when enough training data– (when adjusted count > k; k usu. = 0 or 1)
• If not, “back-off” to the (n-1)-gram probability
• (Repeat as needed)
![Page 36: Chapter 6: Statistical Inference: n-gram Models over Sparse Data TDM Seminar Jonathan Henke jhenke/Tdm/TDM-Ch6.ppt Slide](https://reader036.vdocuments.net/reader036/viewer/2022081506/56649eb15503460f94bb6ec1/html5/thumbnails/36.jpg)
Problems with Backing-Off
• If bigram w1 w2 is common
• but trigram w1 w2 w3 is unseen
• may be a meaningful gap, rather than a gap due to chance and scarce data– i.e., a “grammatical null”
• May not want to back-off to lower-order probability
![Page 37: Chapter 6: Statistical Inference: n-gram Models over Sparse Data TDM Seminar Jonathan Henke jhenke/Tdm/TDM-Ch6.ppt Slide](https://reader036.vdocuments.net/reader036/viewer/2022081506/56649eb15503460f94bb6ec1/html5/thumbnails/37.jpg)
Comparison of Estimators