resolving word ambiguities
Post on 24-Feb-2016
76 Views
Preview:
DESCRIPTION
TRANSCRIPT
Resolving Word Ambiguities• Description: After determining word boundaries, the speech
recognition process matches an array of possible word sequences from spoken audio
• Issues to consider – determine the intended word sequence?– resolve grammatical and pronunciation errors
• Applications: spell-checking, allophonic variations of pronunciation, automatic speech recognition
• Implementation: Establish word sequence probabilities– Use existing corpora– Train program with run-time data
Entropy
Questions answered if we can compute entropy?
• How much information there is in a particular grammar, words, parts of speech, phonemes, etc?
• How predictive is a language in computing the next word based on previous words?
• How difficult is a speech recognition task?
• Entropy is measured in bits• What is the least number of bits
required to encode a piece of information
Measures the quantity of information in a signal stream
- pi lg pi
i=1
r
åH(X) =
X is a random variable that can assume r values
Entropy of spoken languages could focus implementation possibilities
Entropy Example• Eight Horses are in an upcoming race
– We want to take bets on the winner– Naïve approach is to use 3 bits– Better approach is to use less bits for the
horses bet on more frequently
• Entropy – What is the minimum number of bits needed?
= -∑i=1,8p(i)log p(i)
= - ½ log ½ - ¼ log ¼ - 1/8 log 1/8 – 1/16 log 1/16 – 4 * 1/64 log 1/64
= 1/2 + 2/4 +3/8 + 4/16 + 4 * 6/64 = (32 + 32 + 24 + 16 + 24)/64 = 2
• The table to the right shows the optimal coding scheme• Question: What if the odd were all equal (1/8)
Horse odds Code1 ½ 02 ¼ 103 1/8 1104 1/16 11105 1/64 1111006 1/64 1111017 1/64 1111108 1/64 111111
Entropy of words and languages• What is the entropy of a sequence of words?
H(w1,w2,…,wn) = -∑ p(win) log p (wi
n)
– win ε L
– P(win) = probability that wi
n is in a sequence of n words
• What is the entropy of a word appearing in an n word sequence?
H(w1n) = - 1/n ∑ p(wi
n) log p (win)
• What is the entropy of a language?
H(L) = lim n=∞ (- 1/n * ∑ p(win) log p (wi
n))
Cross Entropy• We want to know the entropy of a language L, but don’t
know its distribution• We model L by an approximation to its probability
distribution• We take sequences of words, phonemes, etc from the real
language but use the following formula
H(p,m) = limn->∞ - 1/n log m(w1,w2,…,wn)
• Cross entropy will always be an upper bound for the actual language
• Example– Trigram model of 583 million words of English– Corpus of 1,014,312 tokens– Character entropy computed based on a tri-gram grammar– Result 1.75 bits per character
Probability Chain Rule• Conditional Probability P(A1,A2) = P(A1) · P(A2|A1)• The Chain Rule generalizes to multiple events
– P(A1, …,An) = P(A1) P(A2|A1) P(A3|A1,A2)…P(An|A1…An-1) • Examples:
– P(the dog) = P(the) P(dog | the)– P(the dog bites) = P(the) P(dog | the) P(bites| the dog)
• Conditional probability applies more than individual relative word frequencies because they consider the context – Dog may be relatively rare word in a corpus– But if we see barking, P(dog|barking) is much more likely
)11|
1( wkn
kwkP -
=1n
• In general, the probability of a complete string of words w1…wn is:
P(w ) = P(w1)P(w2|w1)P(w3|w1..w2)…P(wn|w1…wn-1) =
Detecting likely word sequences using probabilities
Counts• What’s the probability of “canine”?• What’s the probability of “canine tooth” or tooth | canine?• What’s the probability of “canine companion”?• P(tooth|canine) = P(canine & tooth)/P(canine)
• Sometimes we can use counts to deduce probabilities.• Example: According to google:
– P(canine): occurs 1,750,000 times– P(canine tooth): 6280 times – P(tooth | canine): 6280/1750000 = .0035– P(companion | canine): .01 – So companion is the more likely next word after canine
Detecting likely word sequences using counts/table look up
Single Word Probabilities
Word P(O|w) P(w) P(O|w)P(w)
new .36 .001 .00036
neat .52 .00013 .000068
need .11 .00056 .000062
knee 1.00 .000024 .000024
P([ni]|new)P(new)
P([ni]|neat)P(neat)
P([ni]|need)P(need)
P([ni]|knee)P(knee)
• Limitation: ignores context• We might need to factor in the surrounding words
- Use P(need|I) instead of just P(need)- Note: P(new|I) < P(need|I)
Single word probability
Compute likelihood P([ni]|w), then multiply
Word Prediction Approaches
Simple:*Every word follows every other word w/ equal probability (0-gram) – Assume |V| is the size of the vocabulary – Likelihood of sentence S of length n is = 1/|V| × 1/|V| … × 1/|V|
– If English has 100,000 words, probability of each next word is 1/100000 = .00001n times
Simple vs. Smart
Smarter: Probability of each next word is related to word frequency – Likelihood of sentence S = P(w1) × P(w2) × … × P(wn) – Assumes probability of each word is independent of probabilities of other words.
Even smarter: Look at probability given previous words – Likelihood of sentence S = P(w1) × P(w2|w1) × … × P(wn|wn-1) – Assumes probability of each word is dependent on probabilities of other words.
Common Spelling Errors
• They are leaving in about fifteen minuets• The study was conducted manly be John Black.• The design an construction of the system will take
more than a year.• Hopefully, all with continue smoothly in my
absence.• Can they lave him my messages?• I need to notified the bank of….• He is trying to fine out.
Spell check without considering context will fail
Difficulty: Detect grammatical errors, or nonsensical expressions
N-grams
• 0 gram: Every word’s likelihood probability is equal– Each word of a 300,000 word corpora has .000033 frequency probability
• Uni-gram: A word’s likelihood depends on frequency counts– The occurs 69,971 in the Brown corpus of 1,000,000 words
• Bi-gram: word likelihood determined by the previous word– P(w|a) = P(w) * P(w|wi-1)– The appears with frequency .07, rabbit appears with frequency .00001– Rabbit is a more likely word that follows the word white than the is
• Tri-gram: word likelihood determined by the previous two words– P(w|a) = P(w) * P(w|wi-1 & wi-2)
• Question: How many previous words should we consider?– Test: Generate random sentences from Shakesphere – Results: Trigram sentences start looking like those of Shakesphere– Tradeoffs: Computational overhead and memory requirements
How many previous words should we consider?
The Sparse Data Problem• Definitions
– Maximum likelihood: Finding the most probable sequence of tokens based on the context of the input
– N-gram sequence: A sequence of n words whose context speech algorithms consider
– Training data: A group of probabilities computed from a corpora of text data
– Sparse data problem: How should algorithms handle n-grams that have very low probabilities?• Data sparseness is a frequently occurring problem• Algorithms will make incorrect decisions if it is not handled
• Problem 1: Low frequency n-grams– Assume n-gram x occurs twice and n-gram y occurs once– Is x really twice as likely to occur as y?
• Problem 2: Zero counts– Probabilities compute to zero for n-grams not seen in the corpora– If n-gram y does not occur, should its probability is zero?
Smoothing
An algorithm that redistributes the probability mass
Discounting: Reduces probabilities of n-grams withnon-zero counts to accommodate the n-grams with zero counts (that are unseen in the corpora).
Definition: A corpora is a collection of written or spoken material in machine-readable form
Add-One Smoothing• The Naïve smoothing technique
– Add one to the count of all seen and unseen n-grams– Add the total increased count to the probability mass
• Example: Uni-grams– Un-smoothed probability for word w: uni-grams
– Add-one revised probability for word w:
– N = number of words encountered, V = vocabulary size, c(w) = number of times word, w, was encountered
P(w) =c(w)
N
P1(w) =c(w) 1N V
Add-One Smoothing Example
P(wn|wn-1) = C(wn-1wn)/C(wn-1)
P+1(wn|wn-1) = [C(wn-1wn)+1]/[C(wn-1)+V]
Note: This example assumes bi-gram counts and a vocabulary V = 1616 wordsNote: row = times that word in column precedes word on left, or starts a sentence
Note: C(I)=3437, C(want)=1215, C(to)=3256, C(eat)=938, C(Chinese)=213, C(food)=1506, C(lunch)=459
Add-One Discounting
VNN
c’(wi,wi-1) =(c(wi,wi-1)i+1) *
c(wi,wi-1)
Original Counts
Revised Counts
Note: High counts reduce by approximately a third for this exampleNote: Low counts get largerNote: N = c(wi-1), V = vocabulary size = 1616
C(WI)
I 3437Want 1215To 3256Eat 938Chinese 213Food 1506Lunch 459
Evaluation of Add-One Smoothing
• Advantage: – Simple technique to implement and understand
• Disadvantages:– Too much probability mass moves to the unseen n-grams– Underestimates the probabilities of the common n-grams– Overestimates probabilities of rare (or unseen) n-grams– Relative smoothing of all unseen n-grams is the same– Relative smoothing of rare n-grams still incorrect
• Alternative:– Use a smaller add value– Disadvantage: Does not fully solve this problem
Unigram Witten-Bell Discounting• Compute the probability of a first time encounter of a new word
– Note: Every one of O observed words had a first encounter– How many Unseen words: U = V – O – What is the probability of encountering a new word?
• Answer: P( any newly encountered word ) = O/(V+O)• Equally add this probability across all unobserved words
– P( any specific newly encountered word ) = 1/U * O/(V+O)– Adjusted counts = V * 1/U*O/(V+O))
• Discount each encountered wordi to preserve probability space– Probability From: counti /V To: counti/(V+O)– Discounted Counts From: counti To: counti * V/(V+O)
Add probability mass to un-encountered words; discount the rest
O = observed words, U = words never seen, V = corpus vocabulary words
Bi-gram Witten-Bell Discounting
• Consider the bi-gram wnwn-1– O(wn-1) = number of uniquely observed bi-grams starting with wn-1– V(wn-1) = count of bi-grams starting with wn-1– U(wn-1) = number of un-observed bi-grams starting with wn-1• Compute probability of a new bi-gram (bin-1) starting with wn-1
– Answer: P( any newly encountered bi-gram ) = O(wn-1)/(V(wn-1) +O(wn-1))– Note: We observed O(wn-1) bi-grams in V(wn-1)+O(wn-1) events– Note: An event is either a bi-gram or a first time encounter
• Divide this probability among all unseen bi-grams (new(wn-1))– Adjusted P(new(wn-1)) = 1/U(wn-1)*O(wn-1)/(V(wn-1)+O(wn-1))– Adjusted count = V(wn-1) * 1/U(wn-1) * O(wn-1)/(V(wn-1)+O(wn-1))
• Discount observed bi-grams gram(wn-1) to preserve probability space– Probability From: c(wn-1 wn)/V(wn-1) To: c(wn-1wn)/(V(wn-1) + O(wn-1))– Counts From: c(wn-1 wn) To: c(wn-1wn) * V(wn-1)/(V(wn-1)+O(wn-1))
Add probability mass to un-encountered bi-grams; discount the rest
O = observed bi-gram, U = bi-gram never seen, V = corpus vocabulary bi-grams
Witten-Bell Smoothing
c′(wn,wn-1)= (c(wn,wn-1)+1)
c(wn,wn-1)
c′(wn,wn-1) = O/U
if c(wn,wn-1)=0
c(wn,wn-1) otherwise
Original Counts
Adjusted Add-One Counts
Adjusted Witten-Bell Counts
V, O and U values are on the next slide
VNV
Note: V, O, U refer to wn-1 counts
VNV
VNV
Bi-gram Counts for Example
O(wn-1) U(Wn-1) V(wn-1)
I 95 1,521 3437Want 76 1,540 1215
To 130 1,486 3256
Eat 124 1,492 938
Chinese 20 1,596 213
Food 82 1,534 1506
Lunch 45 1,571 459
O(wn-1) = number of observed bi-grams starting with wn-1
V(wn-1) = count of bi-grams starting with wn-1 U(wn-1) = number of un-observed bi-grams starting with
Evaluation of Witten-Bell
• Estimates probability of already encountered grams to compute probabilities for unseen grams
• Smaller impact on probabilities of already encountered grams
• Generally computes reasonable probabilities
Back-off Discounting• The general Concept
– Consider the trigram (wn,wn-1, wn-2)– If c(wn-1, wn-2) = 0, consider the ‘back-off’ bi-gram (wn, wn-1)– If c(wn-1) = 0, consider the ‘back-off’ unigram wn
• Goal is to use a hierarchy of approximations– trigram > bigram > unigram– Degrade gracefully when higher level grams don’t exist
• Given a word sequence fragment: wn-2 wn-1 wn …• Utilize the following preference rule
– 1.p(wn |wn-2 wn-1) if c(wn-2wn-1 wn ) 0– 2.1p(wn|wn-1 ) if c(wn-1 wn ) 0– 3.2p(wn)
Note: 1 and 2 are values carefully computed to preserve probability mass
N-grams for Spell Checks• Non-word detection (easiest) Example: graffe => (giraffe)
– Isolated-word (context-free) error correction– by definition cannot correct when error word is a valid word
• Context-dependent (hardest) Example: your an idiot => you’re an idiot
• when the mis-typed word happens to be a real word• 15% Peterson (1986), 25%-40% Kukich (1992)
top related