Download - Language Models & Smoothing Shallow Processing Techniques for NLP Ling570 October 19, 2011
Language Models & Smoothing
Shallow Processing Techniques for NLPLing570
October 19, 2011
AnnouncementsCareer exploration talk: Bill McNeill
Thursday (10/20): 2:30-3:30pmThomson 135 & Online (Treehouse URL)
Treehouse meeting: Friday 10/21: 11-12Thesis topic brainstorming
GP Meeting: Friday 10/21: 3:30-5pmPCAR 291 & Online (…/clmagrad)
RoadmapNgram language models
Constructing language models
Generative language models
Evaluation:Training and TestingPerplexity
Smoothing:Laplace smoothingGood-Turing smoothing Interpolation & backoff
Ngram Language ModelsIndependence assumptions moderate data
needs
Approximate probability given all prior wordsAssume finite historyUnigram: Probability of word in isolation Bigram: Probability of word given 1 previousTrigram: Probability of word given 2 previous
N-gram approximation
)|()|( 11
11
nNnn
nn wwPwwP
)|()( 11
1 k
n
kk
n wwPwPBigram sequence
Berkeley Restaurant Project Sentences
can you tell me about any good cantonese restaurants close by
mid priced thai food is what i’m looking for
tell me about chez panisse
can you give me a listing of the kinds of food that are available
i’m looking for a good place to eat breakfast
when is caffe venezia open during the day
Bigram CountsOut of 9222 sentences
Eg. “I want” occurred 827 times
Bigram ProbabilitiesDivide bigram counts by prefix unigram counts
to get probabilities.
Bigram Estimates of Sentence Probabilities
P(<s> I want english food </s>) =
P(i|<s>)*
P(want|I)*
P(english|want)*
P(food|english)*
P(</s>|food)
=.000031
Kinds of Knowledge
P(english|want) = .0011
P(chinese|want) = .0065
P(to|want) = .66
P(eat | to) = .28
P(food | to) = 0
P(want | spend) = 0
P (i | <s>) = .25
What types of knowledge are captured by ngram models?
Kinds of Knowledge
P(english|want) = .0011
P(chinese|want) = .0065
P(to|want) = .66
P(eat | to) = .28
P(food | to) = 0
P(want | spend) = 0
P (i | <s>) = .25
World knowledge
What types of knowledge are captured by ngram models?
Kinds of Knowledge
P(english|want) = .0011
P(chinese|want) = .0065
P(to|want) = .66
P(eat | to) = .28
P(food | to) = 0
P(want | spend) = 0
P (i | <s>) = .25
World knowledge
Syntax
What types of knowledge are captured by ngram models?
Kinds of Knowledge
P(english|want) = .0011
P(chinese|want) = .0065
P(to|want) = .66
P(eat | to) = .28
P(food | to) = 0
P(want | spend) = 0
P (i | <s>) = .25
World knowledge
Syntax
Discourse
What types of knowledge are captured by ngram models?
Probabilistic Language Generation
Coin-flipping modelsA sentence is generated by a randomized
algorithmThe generator can be in one of several “states”Flip coins to choose the next stateFlip other coins to decide which letter or word to
output
Generated Language:Effects of N
1. Zero-order approximation:XFOML RXKXRJFFUJ ZLPWCFWKCYJ
FFJEYVKCQSGHYD QPAAMKBZAACIBZLHJQD
Generated Language:Effects of N
1. Zero-order approximation:XFOML RXKXRJFFUJ ZLPWCFWKCYJ
FFJEYVKCQSGHYD QPAAMKBZAACIBZLHJQD
2. First-order approximation:OCRO HLI RGWR NWIELWIS EU LL NBNESEBYA
TH EEI ALHENHTTPA OOBTTVA NAH RBL
Generated Language:Effects of N
1. Zero-order approximation:XFOML RXKXRJFFUJ ZLPWCFWKCYJ
FFJEYVKCQSGHYD QPAAMKBZAACIBZLHJQD
2. First-order approximation:OCRO HLI RGWR NWIELWIS EU LL NBNESEBYA
TH EEI ALHENHTTPA OOBTTVA NAH RBL
3. Second-order approximation:ON IE ANTSOUTINYS ARE T INCTORE ST BE S
DEAMY ACHIND ILONASIVE TUCOOWE AT TEASONARE FUSO TIZIN ANDY TOBE SEACE CTISBE
Word Models: Effects of N1. First-order approximation:
REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME CAN DIFFERENT NATURAL HERE HE THE A IN CAME THE TO OF TO EXPERT GRAY COME TO FURNISHES THE LINE MESSAGE HAD BE THESE
Word Models: Effects of N1. First-order approximation:
REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME CAN DIFFERENT NATURAL HERE HE THE A IN CAME THE TO OF TO EXPERT GRAY COME TO FURNISHES THE LINE MESSAGE HAD BE THESE
2. Second-order approximation:THE HEAD AND IN FRONTAL ATTACK ON AN
ENGLISH WRITER THAT THE CHARACTER OF THIS POINT IS THEREFORE ANOTHER METHOD FOR THE LETTERS THAT THE TIME OF WHO EVER TOLD THE PROBLEM FOR AN UNEXPECTED
Shakespeare
The Wall Street Journal is Not Shakespeare
Evaluation
Evaluation - GeneralEvaluation crucial for NLP systems
Required for most publishable results
Should be integrated early
Many factors:
Evaluation - GeneralEvaluation crucial for NLP systems
Required for most publishable results
Should be integrated early
Many factors:Data MetricsPrior results…..
Evaluation GuidelinesEvaluate your system
Use standard metrics
Use (standard) training/dev/test sets
Describing experiments: (Intrinsic vs Extrinsic)
Evaluation GuidelinesEvaluate your system
Use standard metrics
Use (standard) training/dev/test sets
Describing experiments: (Intrinsic vs Extrinsic)Clearly lay out experimental setting
Evaluation GuidelinesEvaluate your system
Use standard metrics
Use (standard) training/dev/test sets
Describing experiments: (Intrinsic vs Extrinsic)Clearly lay out experimental setting Compare to baseline and previous resultsPerform error analysis
Evaluation GuidelinesEvaluate your system
Use standard metrics
Use (standard) training/dev/test sets
Describing experiments: (Intrinsic vs Extrinsic)Clearly lay out experimental setting Compare to baseline and previous resultsPerform error analysisShow utility in real application (ideally)
Data OrganizationTraining:
Training data: used to learn model parameters
Data OrganizationTraining:
Training data: used to learn model parametersHeld-out data: used to tune additional parameters
Data OrganizationTraining:
Training data: used to learn model parametersHeld-out data: used to tune additional parameters
Development (Dev) set:Used to evaluate system during development
Avoid overfitting
Data OrganizationTraining:
Training data: used to learn model parametersHeld-out data: used to tune additional parameters
Development (Dev) set:Used to evaluate system during development
Avoid overfitting
Test data: Used for final, blind evaluation
Data OrganizationTraining:
Training data: used to learn model parameters Held-out data: used to tune additional parameters
Development (Dev) set: Used to evaluate system during development
Avoid overfitting
Test data: Used for final, blind evaluation
Typical division of data: 80/10/10 Tradeoffs Cross-validation
Evaluting LMsExtrinsic evaluation (aka in vivo)
Embed alternate models in systemSee which improves overall application
MT, IR, …
Evaluting LMsExtrinsic evaluation (aka in vivo)
Embed alternate models in systemSee which improves overall application
MT, IR, …
Intrinsic evaluation:Metric applied directly to model
Independent of larger applicationPerplexity
Evaluting LMsExtrinsic evaluation (aka in vivo)
Embed alternate models in systemSee which improves overall application
MT, IR, …
Intrinsic evaluation:Metric applied directly to model
Independent of larger applicationPerplexity
Why not just extrinsic?
Perplexity
Perplexity Intuition:
A better model will have tighter fit to test dataWill yield higher probability on test data
Perplexity Intuition:
A better model will have tighter fit to test dataWill yield higher probability on test data
Formally,
Perplexity Intuition:
A better model will have tighter fit to test dataWill yield higher probability on test data
Formally,
Perplexity Intuition:
A better model will have tighter fit to test dataWill yield higher probability on test data
Formally,
Perplexity Intuition:
A better model will have tighter fit to test dataWill yield higher probability on test data
Formally,
For bigrams:
Perplexity Intuition:
A better model will have tighter fit to test dataWill yield higher probability on test data
Formally,
For bigrams:
Inversely related to probability of sequenceHigher probability Lower perplexity
Perplexity Intuition:
A better model will have tighter fit to test dataWill yield higher probability on test data
Formally,
For bigrams:
Inversely related to probability of sequenceHigher probability Lower perplexity
Can be viewed as average branching factor of model
Perplexity ExampleAlphabet: 0,1,…,9
Equiprobable
Perplexity ExampleAlphabet: 0,1,…,9;
Equiprobable: P(X)=1/10
Perplexity ExampleAlphabet: 0,1,…,9;
Equiprobable: P(X)=1/10
PP(W)=
Perplexity ExampleAlphabet: 0,1,…,9;
Equiprobable: P(X)=1/10
PP(W)=
If probability of 0 is higher, PP(W) will be
Perplexity ExampleAlphabet: 0,1,…,9;
Equiprobable: P(X)=1/10
PP(W)=
If probability of 0 is higher, PP(W) will be lower
Thinking about PerplexityGiven some vocabulary V with a uniform
distribution I.e. P(w) = 1/|V|
Thinking about PerplexityGiven some vocabulary V with a uniform
distribution I.e. P(w) = 1/|V|
Under a unigram LM, the perplexity is
PP(W) =
Thinking about PerplexityGiven some vocabulary V with a uniform
distribution I.e. P(w) = 1/|V|
Under a unigram LM, the perplexity is
PP(W) =
Thinking about PerplexityGiven some vocabulary V with a uniform
distribution I.e. P(w) = 1/|V|
Under a unigram LM, the perplexity is
PP(W) =
Perplexity is effective branching factor of language
Perplexity and Entropy
Given that
Consider the perplexity equation:
PP(W) = P(W)-1/N =
Perplexity and Entropy
Given that
Consider the perplexity equation:
PP(W) = P(W)-1/N =
Perplexity and Entropy
Given that
Consider the perplexity equation:
PP(W) = P(W)-1/N = = =
Perplexity and Entropy
Given that
Consider the perplexity equation:
PP(W) = P(W)-1/N = = = 2H(L,P)
Where H is the entropy of the language L
EntropyInformation theoretic measure
Measures information in grammar
Conceptually, lower bound on # bits to encode
EntropyInformation theoretic measure
Measures information in grammar
Conceptually, lower bound on # bits to encode
Entropy: H(X): X is a random var, p: prob fn
)(log)()( 2 xpxpXHXx
EntropyInformation theoretic measure
Measures information in grammar
Conceptually, lower bound on # bits to encode
Entropy: H(X): X is a random var, p: prob fn
E.g. 8 things: number as code => 3 bits/trans Alt. short code if high prob; longer if lower
Can reduce
)(log)()( 2 xpxpXHXx
Computing EntropyPicking horses (Cover and Thomas)
Send message: identify horse - 1 of 8If all horses equally likely, p(i)
Computing EntropyPicking horses (Cover and Thomas)
Send message: identify horse - 1 of 8If all horses equally likely, p(i) = 1/8
Computing EntropyPicking horses (Cover and Thomas)
Send message: identify horse - 1 of 8If all horses equally likely, p(i) = 1/8
Some horses more likely:1: ½; 2: ¼; 3: 1/8; 4: 1/16; 5,6,7,8: 1/64
8
1
38/1log8/1log8/1)(i
bitsXH
Computing EntropyPicking horses (Cover and Thomas)
Send message: identify horse - 1 of 8If all horses equally likely, p(i) = 1/8
Some horses more likely:1: ½; 2: ¼; 3: 1/8; 4: 1/16; 5,6,7,8: 1/64
bitsipipXHi
2)(log)()(8
1
8
1
38/1log8/1log8/1)(i
bitsXH
Entropy of a SequenceBasic sequence
Entropy of language: infinite lengthsAssume stationary & ergodic
)(log)(1
)(1
1211
1
n
LW
nn WpWpn
WHn n
),...,(log1
lim)(
),...,(log),...,(1
lim)(
1
11
nn
nLW
nn
wwpn
LH
wwpwwpn
LH
Computing P(s): s is a sentence
Let s = w1w2….wn
Assume a bigram model
P(s) = P(w1w2…wn) = P(BOS w1w2….wnEOS)
Computing P(s): s is a sentence
Let s = w1w2….wn
Assume a bigram model
P(s) = P(w1w2…wn) = P(BOS w1w2….wnEOS)
~ P(BOS)*P(w1|BOS)*P(w2|w1)*…*P(wn|wn-1)*P(EOS|wn)
Out-of-vocabulary words (OOV): If n-gram contains OOV word,
Computing P(s): s is a sentence
Let s = w1w2….wn
Assume a bigram model
P(s) = P(w1w2…wn) = P(BOS w1w2….wnEOS)
~ P(BOS)*P(w1|BOS)*P(w2|w1)*…*P(wn|wn-1)*P(EOS|wn)
Out-of-vocabulary words (OOV): If n-gram contains OOV word,
Remove n-gram from computationIncrement oov_count
N
Computing P(s): s is a sentence
Let s = w1w2….wn
Assume a trigram model
P(s) = P(w1w2…wn) = P(BOS w1w2….wnEOS)
~P(w1|BOS)*P(w2|w1BOS)*…*P(wn|wn-2 wn-1)*P(EOS|wn-1wn)
Out-of-vocabulary words (OOV): If n-gram contains OOV word,
Remove n-gram from computation Increment oov_count
N =sent_leng + 1 – oov_count
Computing PerplexityPP(W) =
Where W is a set of m sentences: s1,s2,…,sm
log P(W)
Computing PerplexityPP(W) =
Where W is a set of m sentences: s1,s2,…,sm
log P(W) =
Computing PerplexityPP(W) =
Where W is a set of m sentences: s1,s2,…,sm
log P(W) =
N
Computing PerplexityPP(W) =
Where W is a set of m sentences: s1,s2,…,sm
log P(W) =
N = word_count + sent_count – oov_count
Perplexity Model Comparison
Compare models with different history
Homework #4
Building Language ModelsStep 1: Count ngrams
Step 2: Build model – Compute probabilitiesMLESmoothed: Laplace, GT
Step 3: Compute perplexity
Steps 2 & 3 depend on model/smoothing choices
Q1: Counting N-gramsCollect real counts from the training data:
ngram_count.* training_data ngram_count_file
Output ngrams and real count c(w1), c(w1, w2), and c(w1, w2, w3).
Given a sentence: John called Mary Insert BOS and EOS: <s> John called Mary </s>
Q1: OutputCount key
875 a…200 the book…20 thank you very
In “chunks” – unigrams, then bigrams, then trigrams
Sort in decreasing order of count within chunk
Q2: Create Language Model
build_lm.* ngram_count_file lm_fileStore the logprob of ngrams and other parameters in
the lm
There are actually three language models: P(w3), P(w3|w2) and P(w3|w1,w2)The output file is in a modified ARPA format (see
next slide)Lines for n-grams are sorted by n-gram counts
Modified ARPA Format\data\
ngram 1: type = xx; token = yy
ngram 2: type = xx; token = yy
ngram 3: type = xx; token = yy
\1-grams:
count prob logprob w1
\2-grams:
count prob logprob w1 w2
\3-grams:
count prob logprob w1 w2 w3
# xx: is type count
# yy: is token count
# prob is P(w)
# prob is P(w2|w1)
#count in C(w1w2)
Q3: Calculating Perplexitypp.* lm_file n test_file outfile
Compute perplexity for n-gram history given model
sum=0; count=0;
for each s in test_file: if n-gram of history n exists
Compute P(wi|…wi-n+1)sum += log_2 P(wi…)count ++
total = -sum/count
pp(test_file) = 2total
Output format Sent #1: <s> Influential members of the House … </s>
1: log P(Influential | <s>) = -inf(unknown word)
2: log P(members | <s> Influential) = -inf (unseen ngrams)
4: log P(the | members of) = -0.673243382588536
1 sentence, 38 words, 9 OOVs
logprob=-82.8860891791949 ppl=721.341645452964
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
sent_num=50 word_num=1175 oov_num=190
logprob=-2854.78157013778 ave_logprob=-2.75824306293506 pp=573.116699237283
Q4: Compute PerplexityCompute perplexity for different n