speech & nlp (fall 2014): n-grams, n-gram computation, word sequence probabilities, n-gram...
TRANSCRIPT
Speech & NLP
www.vkedco.blogspot.com
N-Grams, N-Gram Computation, N-Gram Smoothing, Word Sequence
Probabilities, N-Gram Vectors, Markov Models
Vladimir Kulyukin
Outline
● N-Grams
● N-Gram Computation
● N-Gram Smoothing
● Markov & Hidden Markov Models (HMMs)
N-Grams
Introduction
● Word prediction is a fundamental task of spelling
checking, speech recognition, augmentative
communication, and many other areas of NLP
● Word prediction algorithms are typically trained on
various text corpora
● N-Gram is a word prediction model that uses the
previous N-1 words to predict the next word
● In statistical NLP, N-Gram is called a language model
(LM) or grammar
Word Prediction Examples
● See if you can predict the next word:
– It happened a long time …
– She wants to make a collect phone …
– I need to open a bank …
– Nutrition labels include serving …
– Nutrition labels include amounts of total …
Word Prediction Examples
● It happened a long time ago.
● She wants to make a collect phone call.
● I need to open a bank account.
● Nutrition labels include serving sizes.
● Nutrition labels include amounts of total fat|
carbohydrate.
Augmentative Communication
● Many people with physical disabilities experience
problems communicating with other people: many of
them cannot speak or type
● Word prediction models can productively augment their
communication efforts by automatically suggesting the
next word to speak or type
● For example, people with disabilities can use simple
hand movements to choose next words to speak or type
Real-Word Spelling Errors
● Real-word spelling errors are real words incorrectly used
● Examples:
– They are leaving in about fifteen minuets to go to her house.
– The study was conducted mainly be John Black.
– The design an construction of the system will take more than a year.
– Hopefully, all with continue smoothly in my absence.
– I need to notified the bank of this problem.
– He is trying to fine out.
K. Kukich, "Techniques for Automatically Correcting Words in Text." ACM
Computing Surveys, Vol. 24, No. 4, Dec. 1992.
Word Sequence Probabilities
● Word prediction is based on evaluating probabilities of specific
word sequences
● To count those probabilities we need a corpus (speech corpus
or text corpus)
● We also need to determine what is counted count and how:
the most important decision is how to handle punctuation
marks and capitalization (text) or pauses like uh and um
(speech)
● What is counted and how depends on the task at hand (e.g.,
punctuation is more important to grammar checking than
spelling correction)
Wordforms, Lemmas, Types, Tokens
● Wordform is an alphanumerical sequence actually used
in the corpus (e.g., begin, began, begun)
● Lemma is a set of word forms (e.g., {begin, began,
begun})
● Token is a synonym of wordform
● Type is a dictionary entry: for example, a dictionary
lists only begin as the main entry for the lemma
{begin, began, begun}
Unsmoothed N-Grams
Notation: A Sequence of N Words
n
n www ... 11
● Example: ‘I understand this algorithm.’
– W1 = ‘I’
– W2 = ‘understand’
– W3 = ‘this’
– W4 = ‘algorithm’
– W5 = ‘.’
Probabilities of Word Sequences
1
11
1
1
2
131211
|
| ... || ...
k
k
n
k
n
nn
wwP
wwPwwPwwPwPwwP
Example:
P(‘I understand this algorithm.’) =
P(‘I’) *
P(‘understand’|‘I’) *
P(‘this’|‘I understand’) *
P(‘algorithm’|‘I understand this’) *
P(‘.’|‘I understand this algorithm’)
Probabilities of Word Sequences
● How difficult is it to compute the required probabilities?
– P(‘I’) - this is easy to compute (e.g., frequency of ‘I’ in
the corpus over the size of the corpus)
– P(‘understand’|‘I’) – harder but quite feasible
– P(‘this’|‘I understand’) – harder still but feasible
– P(‘algorithm’|‘I understand this’) – even harder (Why?)
– P(‘.’|‘I understand this algorithm’) – possible but
impractical
Probability Approximation
● Markov assumption: we can estimate the probability of a word
given only N previous words
● If N = 0, we have the unigram model (aka 0th-order Markov
model)
● If N = 1, we have the bigram model (aka 1st-order Markov
model)
● If N = 2, we have the trigram model (aka 2nd-order Markov
model)
● N can be greater but the higher values are rare, because they
are hard to compute
Bigram Probability Approximation
● <S> is the start of sentence mark; this is a dummy mark
● What is the probability of ‘I understand this
algorithm.’
● P(‘I understand this algorithm.’) =
P(‘I’|<S>) *
P(‘understand’|‘I’) *
P(‘this’|‘understand’) *
P(‘algorithm’|‘this’) *
P(‘.’ |‘algorithm’)
Trigram Probability Approximation
● <S> is the start of sentence mark
● In the trigram model, we assume that at the beginning of the
sentence, there are two start marks <S><S>
● P(‘I understand this algorithm.’) =
P(‘I’|<S><S>) *
P(‘understand’|‘<S>I’) *
P(‘this’|‘I understand’) *
P(‘algorithm’|‘understand this’) *
P(‘.’ |‘this algorithm’)
N-Gram Approximation
4,||||
3,||||
2,||||
1,|| :formula General
123
1
3
1
14
1
1
12
1
2
1
13
1
1
1
1
1
1
12
1
1
1
1
1
1
NwwwwPwwPwwPwwP
NwwwPwwPwwPwwP
NwwPwwPwwPwwP
NwwPwwP
nnnn
n
nn
n
nn
n
nn
nnn
n
nn
n
nn
n
n
nn
n
nn
n
nn
n
n
n
Nnn
n
n
Bigram Approximation
11
1
1 ||
kk
n
k
n
n wwPwwP
Bigram Approximation Example
<S> I 0.25
I understand 0.3
understand this 0.05
this algorithm 0.7
algorithm . 0.45
P(‘I understand this algorithm.’) =
P(‘I’|<S>) * P(‘understand’|‘I’) * P(‘this’|‘understand’) * P(‘algorithm’|‘this’) * P(‘.’ |‘algorithm’) =
0.25 * 0.3 * 0.05 * 0.7 * 0.45 =
0.00118125
Logprobs
● If we compute raw probability products, we risk getting the
problem of numerical underflow: at some point all probability
products become zero, especially on long word sequences
● To address this problem, the probabilities are computed in the
logarithmic space: instead of computing the product of
probabilities, the sum of logarithms of those probabilities is
computed
● log(P(A)P(B)) = log(P(A)) + log(P(B))
● Original product can be recovered: P(A)P(B) = log-1(P(A)P(B))
Bigram Computation
1
1
1
1
11
1
11
11
|
size dictionary is ,
corpusin ofcount
n
nn
V
i
in
nnnn
V
i
nin
nnnn
wC
wwC
wwC
wwCwwP
VwCwwC
wwwwC
Example
.201010
.100100
.310110
.a,c,c,b,a,a,, characters of sequence a as text therepresent can
Weaabcca"." isour text Suppose .,,
,, that so marks end andstart theare ,
where}, c, b, a, ,{ islary our vocabu that Suppose
5
1
543
21
EcCccCcbCcaCScCcC
EbCbcCbbCbaCSbCbC
EaCacCabCaaCSaCawCaC
ES
Ewcwbw
awSwES
ES
i
i
N-Gram Generalization
12
121
2
1
13
1
11
1
1
12
1
1
1
11
1
||,3For
||,2For
:Examples
1,| :formula General
nn
nnnn
nn
n
nn
n
nnn
nn
n
nn
n
Nn
n
n
Nnn
Nnn
wwC
wwwCwwPwwPN
wC
wwCwwPwwPN
NwC
wwCwwP
Maximum Likelihood Estimation
● This N-Gram probability estimation is known as the Maximum
Likelihood Estimation (MLE)
● It is the MLE because it always maximizes the probability of the
training set (the statistics of the training set)
● Example: If a word W occurs 5 times in a training corpus of 100
words, its probability of occurrence is P(W) = 5/100
● This is not a good estimate of P(W) in all corpora, but the one
that maximizes P(W) in the training corpus
N-Gram Smoothing
Unsmoothed N-Gram Problem
● Since any corpus is finite, in any corpus used for computing N-
Grams, some valid N-Grams will not be found
● To put it differently, an N-Gram matrix for any corpus is likely
to be sparse: it will have a large number of possible N-Grams
with zero counts
● The MLE methods produce unreliable estimates when counts are
greater than 0 but still small (small, of course, is relative)
● Smoothing is a set of techniques used to overcome zero or low
counts
Add-One Smoothing
One way to smooth is to add one to all N-Gram counts and
normalize by the size of the dictionary (V)
smoothed one-add //1
|
unsmoothed//|
1
11
*
1
11
VwC
wwCwwP
wC
wwCwwP
n
nnnn
n
nnnn
A Problem with Add-One Smoothing
● Much of the total probability mass moves to the N-
Grams with zero counts
● Researchers attribute it to the arbitrary choice of the
value of 1
● Add-One smoothing appears to be worse than other
methods at predicting N-Grams with zero counts
● Some research indicates that add-one smoothing is no
better than no smoothing
Good-Turing Discounting
● Probability mass assigned to N-Grams with zero or low
counts is reassigned by using with N-Grams with higher
counts
● Let Nc is the number of N-Grams that occur c times in a
corpus
● N0 is the number of N-Grams that occur 0 times
● N1 is the number of N-Grams that occur once
● N2 is the number of N-Grams that occur twice
Good-Turing Discounting
Let C(w1 … wn)=c be a count of some N-Gram w1 … wn,
then the new count smoothed by the GTD, i.e.,
C*(w1 … wn), is:
c
cnn
N
NwwCwwC 1
11
* 1......
N-Gram Vectors
● N-Grams can be computed over any finite symbolic sets
● Those symbolic sets are called alphabets and can
consists of wordforms, waveforms, individual letters,
etc.
● The choice of the symbols in the alphabet depends on
the application
● Regardless of the application, the objective is to take
an input sequence over a specific alphabet and
compute its N-Gram frequency vector
Dimensions of N-Gram Vectors
● Let A be an alphabet and n > 0 be the size of the N-Gram
● The number of N-Gram dimensions is |A|n
● Suppose that the alphabet has 26 characters and we
compute trigrams over that alphabets, then the number
of possible trigrams, i.e., the dimension of N-Gram
frequency vectors is 263 = 17576
● A practical implication is that N-Gram frequency vectors
even for low values of n are sparse
Example
● Suppose the alphabet A = {a, <space>, <start>}
● The number of possible bigrams (n=2) is |A|2 = 9:
– 1) aa; 2) a<start>; 3) a<space>; 4) <start><start>; 5) <start>a;
6) <start><space>; 7) <space><space> ; 8) <space>a;
9) <space><start>
● Suppose the input is = ‘a a’
● The input’s N-Grams are: <start>a, a<space>, <space>a
● To the input’s N-Gram vector: (0, 0, 1, 0, 1, 0, 0, 1, 0)
(this assumes 1-based indexing)
Markov & Hidden Markov Models
Markov Models
11 ... | nn wwwP
Markov Models are closely related to N-Grams:
the basic idea is to estimate the conditional
probability of the n-th observation given a
sequence of n-1 observations
Markov Assumption
order 3rd // | ... |
order 2nd //| ... |
order1st // | ... |
12311
1211
111
nnnnnn
nnnnn
nnnn
wwwwPwwwP
wwwPwwwP
wwPwwwP
● If n = 5 and the size of the observation alphabet is 3, we
need to collect statistics over 35 = 243 sequence types
● If n = 2 and the size of the observation alphabet is 3, we
need to collection statistics over 32 = 9 sequence types
● So the number of observations matters
Weather Example 01
Sunny Rainy Foggy
Sunny 0.8 0.05 0.15
Rainy 0.2 0.6 0.2
Foggy 0.2 0.3 0.5
Weather Today vs. Weather Tomorrow
04.08.005.0
||
,|
|,
1223
213
132
SunnywSunnywPSunnywRainywP
SunnywSunnywRainywP
SunnywRainywSunnywP
Here is how to read this table:
1st row: P(Sunny|Sunny)=0.8;
P(Rainy|Sunny)=0.05; P(Foggy|Sunny)=0.15
2nd row: P(Sunny|Rainy)=0.2;
P(Rainy|Rainy)=0.6; P(Foggy|Rainy)=0.2
3rd row: P(Sunny|Foggy)=0.2;
P(Rainy|Foggy)=0.3; P(Foggy|Foggy)=0.5
Weather Example 02
Sunny Rainy Foggy
Sunny 0.8 0.05 0.15
Rainy 0.2 0.6 0.2
Foggy 0.2 0.3 0.5
Weather Today vs. Weather Tomorrow
34.0
||
||
||
|,
|,
|,
|
1223
1223
1223
132
132
132
13
FoggywSunnywPSunnywRainywP
FoggywRainywPRainywRainywP
FoggywFoggywPFoggywRainywP
FoggywRainywSunnywP
FoggywRainywRainywP
FoggywRainywFoggywP
FoggywRainywP
Speech Recognition
w is a sequence of tokens
L is a language
y is an acoustic signal
wPwyPywP
yP
wPwyPywP
LwLw
LwLw
|maxarg|maxarg
|maxarg|maxarg
References
● Ch 06, D. Jurafsky & J. Martin. Speech & Language Processing,
Prentice Hall, ISBN 0-13-095069-6
● E. Fossler-Lussier, J. 1998. Markov Models & Hidden Markov
Models: A Brief Tutorial, ICSI, UC Berkley.