resolving word ambiguities

Resolving Word Ambiguities• Description: After determining word boundaries, the speech

recognition process matches an array of possible word sequences from spoken audio

• Issues to consider – determine the intended word sequence?– resolve grammatical and pronunciation errors

• Applications: spell-checking, allophonic variations of pronunciation, automatic speech recognition

• Implementation: Establish word sequence probabilities– Use existing corpora– Train program with run-time data

Entropy

Questions answered if we can compute entropy?

• How much information there is in a particular grammar, words, parts of speech, phonemes, etc?

• How predictive is a language in computing the next word based on previous words?

• How difficult is a speech recognition task?

• Entropy is measured in bits• What is the least number of bits

required to encode a piece of information

Measures the quantity of information in a signal stream

- pi lg pi

åH(X) =

X is a random variable that can assume r values

Entropy of spoken languages could focus implementation possibilities

Entropy Example• Eight Horses are in an upcoming race

– We want to take bets on the winner– Naïve approach is to use 3 bits– Better approach is to use less bits for the

horses bet on more frequently

• Entropy – What is the minimum number of bits needed?

= -∑i=1,8p(i)log p(i)

= - ½ log ½ - ¼ log ¼ - 1/8 log 1/8 – 1/16 log 1/16 – 4 * 1/64 log 1/64

= 1/2 + 2/4 +3/8 + 4/16 + 4 * 6/64 = (32 + 32 + 24 + 16 + 24)/64 = 2

• The table to the right shows the optimal coding scheme• Question: What if the odd were all equal (1/8)

Horse odds Code1 ½ 02 ¼ 103 1/8 1104 1/16 11105 1/64 1111006 1/64 1111017 1/64 1111108 1/64 111111

Entropy of words and languages• What is the entropy of a sequence of words?

H(w1,w2,…,wn) = -∑ p(win) log p (wi

– win ε L

– P(win) = probability that wi

n is in a sequence of n words

• What is the entropy of a word appearing in an n word sequence?

H(w1n) = - 1/n ∑ p(wi

n) log p (win)

• What is the entropy of a language?

H(L) = lim n=∞ (- 1/n * ∑ p(win) log p (wi

Cross Entropy• We want to know the entropy of a language L, but don’t

know its distribution• We model L by an approximation to its probability

distribution• We take sequences of words, phonemes, etc from the real

language but use the following formula

H(p,m) = limn->∞ - 1/n log m(w1,w2,…,wn)

• Cross entropy will always be an upper bound for the actual language

• Example– Trigram model of 583 million words of English– Corpus of 1,014,312 tokens– Character entropy computed based on a tri-gram grammar– Result 1.75 bits per character

Probability Chain Rule• Conditional Probability P(A1,A2) = P(A1) · P(A2|A1)• The Chain Rule generalizes to multiple events

– P(A1, …,An) = P(A1) P(A2|A1) P(A3|A1,A2)…P(An|A1…An-1) • Examples:

– P(the dog) = P(the) P(dog | the)– P(the dog bites) = P(the) P(dog | the) P(bites| the dog)

• Conditional probability applies more than individual relative word frequencies because they consider the context – Dog may be relatively rare word in a corpus– But if we see barking, P(dog|barking) is much more likely

1( wkn

kwkP -

• In general, the probability of a complete string of words w1…wn is:

P(w ) = P(w1)P(w2|w1)P(w3|w1..w2)…P(wn|w1…wn-1) =

Detecting likely word sequences using probabilities

Counts• What’s the probability of “canine”?• What’s the probability of “canine tooth” or tooth | canine?• What’s the probability of “canine companion”?• P(tooth|canine) = P(canine & tooth)/P(canine)

• Sometimes we can use counts to deduce probabilities.• Example: According to google:

– P(canine): occurs 1,750,000 times– P(canine tooth): 6280 times – P(tooth | canine): 6280/1750000 = .0035– P(companion | canine): .01 – So companion is the more likely next word after canine

Detecting likely word sequences using counts/table look up

Single Word Probabilities

Word P(O|w) P(w) P(O|w)P(w)

new .36 .001 .00036

neat .52 .00013 .000068

need .11 .00056 .000062

knee 1.00 .000024 .000024

P([ni]|new)P(new)

P([ni]|neat)P(neat)

P([ni]|need)P(need)

P([ni]|knee)P(knee)

• Limitation: ignores context• We might need to factor in the surrounding words

- Use P(need|I) instead of just P(need)- Note: P(new|I) < P(need|I)

Single word probability

Compute likelihood P([ni]|w), then multiply

Word Prediction Approaches

Simple:*Every word follows every other word w/ equal probability (0-gram) – Assume |V| is the size of the vocabulary – Likelihood of sentence S of length n is = 1/|V| × 1/|V| … × 1/|V|

– If English has 100,000 words, probability of each next word is 1/100000 = .00001n times

Simple vs. Smart

Smarter: Probability of each next word is related to word frequency – Likelihood of sentence S = P(w1) × P(w2) × … × P(wn) – Assumes probability of each word is independent of probabilities of other words.

Even smarter: Look at probability given previous words – Likelihood of sentence S = P(w1) × P(w2|w1) × … × P(wn|wn-1) – Assumes probability of each word is dependent on probabilities of other words.

Common Spelling Errors

• They are leaving in about fifteen minuets• The study was conducted manly be John Black.• The design an construction of the system will take

more than a year.• Hopefully, all with continue smoothly in my

absence.• Can they lave him my messages?• I need to notified the bank of….• He is trying to fine out.

Spell check without considering context will fail

Difficulty: Detect grammatical errors, or nonsensical expressions

N-grams

• 0 gram: Every word’s likelihood probability is equal– Each word of a 300,000 word corpora has .000033 frequency probability

• Uni-gram: A word’s likelihood depends on frequency counts– The occurs 69,971 in the Brown corpus of 1,000,000 words

• Bi-gram: word likelihood determined by the previous word– P(w|a) = P(w) * P(w|wi-1)– The appears with frequency .07, rabbit appears with frequency .00001– Rabbit is a more likely word that follows the word white than the is

• Tri-gram: word likelihood determined by the previous two words– P(w|a) = P(w) * P(w|wi-1 & wi-2)

• Question: How many previous words should we consider?– Test: Generate random sentences from Shakesphere – Results: Trigram sentences start looking like those of Shakesphere– Tradeoffs: Computational overhead and memory requirements

How many previous words should we consider?

The Sparse Data Problem• Definitions

– Maximum likelihood: Finding the most probable sequence of tokens based on the context of the input

– N-gram sequence: A sequence of n words whose context speech algorithms consider

– Training data: A group of probabilities computed from a corpora of text data

– Sparse data problem: How should algorithms handle n-grams that have very low probabilities?• Data sparseness is a frequently occurring problem• Algorithms will make incorrect decisions if it is not handled

• Problem 1: Low frequency n-grams– Assume n-gram x occurs twice and n-gram y occurs once– Is x really twice as likely to occur as y?

• Problem 2: Zero counts– Probabilities compute to zero for n-grams not seen in the corpora– If n-gram y does not occur, should its probability is zero?

Smoothing

An algorithm that redistributes the probability mass

Discounting: Reduces probabilities of n-grams withnon-zero counts to accommodate the n-grams with zero counts (that are unseen in the corpora).

Definition: A corpora is a collection of written or spoken material in machine-readable form

Add-One Smoothing• The Naïve smoothing technique

– Add one to the count of all seen and unseen n-grams– Add the total increased count to the probability mass

• Example: Uni-grams– Un-smoothed probability for word w: uni-grams

– Add-one revised probability for word w:

– N = number of words encountered, V = vocabulary size, c(w) = number of times word, w, was encountered

P(w) =c(w)

P1(w) =c(w) 1N V

Add-One Smoothing Example

P(wn|wn-1) = C(wn-1wn)/C(wn-1)

P+1(wn|wn-1) = [C(wn-1wn)+1]/[C(wn-1)+V]

Note: This example assumes bi-gram counts and a vocabulary V = 1616 wordsNote: row = times that word in column precedes word on left, or starts a sentence

Note: C(I)=3437, C(want)=1215, C(to)=3256, C(eat)=938, C(Chinese)=213, C(food)=1506, C(lunch)=459

Add-One Discounting

c’(wi,wi-1) =(c(wi,wi-1)i+1) *

c(wi,wi-1)

Original Counts

Revised Counts

Note: High counts reduce by approximately a third for this exampleNote: Low counts get largerNote: N = c(wi-1), V = vocabulary size = 1616

I 3437Want 1215To 3256Eat 938Chinese 213Food 1506Lunch 459

Evaluation of Add-One Smoothing

• Advantage: – Simple technique to implement and understand

• Disadvantages:– Too much probability mass moves to the unseen n-grams– Underestimates the probabilities of the common n-grams– Overestimates probabilities of rare (or unseen) n-grams– Relative smoothing of all unseen n-grams is the same– Relative smoothing of rare n-grams still incorrect

• Alternative:– Use a smaller add value– Disadvantage: Does not fully solve this problem

Unigram Witten-Bell Discounting• Compute the probability of a first time encounter of a new word

– Note: Every one of O observed words had a first encounter– How many Unseen words: U = V – O – What is the probability of encountering a new word?

• Answer: P( any newly encountered word ) = O/(V+O)• Equally add this probability across all unobserved words

– P( any specific newly encountered word ) = 1/U * O/(V+O)– Adjusted counts = V * 1/U*O/(V+O))

• Discount each encountered wordi to preserve probability space– Probability From: counti /V To: counti/(V+O)– Discounted Counts From: counti To: counti * V/(V+O)

Add probability mass to un-encountered words; discount the rest

O = observed words, U = words never seen, V = corpus vocabulary words

Bi-gram Witten-Bell Discounting

• Consider the bi-gram wnwn-1– O(wn-1) = number of uniquely observed bi-grams starting with wn-1– V(wn-1) = count of bi-grams starting with wn-1– U(wn-1) = number of un-observed bi-grams starting with wn-1• Compute probability of a new bi-gram (bin-1) starting with wn-1

– Answer: P( any newly encountered bi-gram ) = O(wn-1)/(V(wn-1) +O(wn-1))– Note: We observed O(wn-1) bi-grams in V(wn-1)+O(wn-1) events– Note: An event is either a bi-gram or a first time encounter

• Divide this probability among all unseen bi-grams (new(wn-1))– Adjusted P(new(wn-1)) = 1/U(wn-1)*O(wn-1)/(V(wn-1)+O(wn-1))– Adjusted count = V(wn-1) * 1/U(wn-1) * O(wn-1)/(V(wn-1)+O(wn-1))

• Discount observed bi-grams gram(wn-1) to preserve probability space– Probability From: c(wn-1 wn)/V(wn-1) To: c(wn-1wn)/(V(wn-1) + O(wn-1))– Counts From: c(wn-1 wn) To: c(wn-1wn) * V(wn-1)/(V(wn-1)+O(wn-1))

Add probability mass to un-encountered bi-grams; discount the rest

O = observed bi-gram, U = bi-gram never seen, V = corpus vocabulary bi-grams

Witten-Bell Smoothing

c′(wn,wn-1)= (c(wn,wn-1)+1)

c(wn,wn-1)

c′(wn,wn-1) = O/U

if c(wn,wn-1)=0

c(wn,wn-1) otherwise

Original Counts

Adjusted Add-One Counts

Adjusted Witten-Bell Counts

V, O and U values are on the next slide

Note: V, O, U refer to wn-1 counts

Bi-gram Counts for Example

O(wn-1) U(Wn-1) V(wn-1)

I 95 1,521 3437Want 76 1,540 1215

To 130 1,486 3256

Eat 124 1,492 938

Chinese 20 1,596 213

Food 82 1,534 1506

Lunch 45 1,571 459

O(wn-1) = number of observed bi-grams starting with wn-1

V(wn-1) = count of bi-grams starting with wn-1 U(wn-1) = number of un-observed bi-grams starting with

Evaluation of Witten-Bell

• Estimates probability of already encountered grams to compute probabilities for unseen grams

• Smaller impact on probabilities of already encountered grams

• Generally computes reasonable probabilities

Back-off Discounting• The general Concept

– Consider the trigram (wn,wn-1, wn-2)– If c(wn-1, wn-2) = 0, consider the ‘back-off’ bi-gram (wn, wn-1)– If c(wn-1) = 0, consider the ‘back-off’ unigram wn

• Goal is to use a hierarchy of approximations– trigram > bigram > unigram– Degrade gracefully when higher level grams don’t exist

• Given a word sequence fragment: wn-2 wn-1 wn …• Utilize the following preference rule

– 1.p(wn |wn-2 wn-1) if c(wn-2wn-1 wn ) 0– 2.1p(wn|wn-1 ) if c(wn-1 wn ) 0– 3.2p(wn)

Note: 1 and 2 are values carefully computed to preserve probability mass

N-grams for Spell Checks• Non-word detection (easiest) Example: graffe => (giraffe)

– Isolated-word (context-free) error correction– by definition cannot correct when error word is a valid word

• Context-dependent (hardest) Example: your an idiot => you’re an idiot

• when the mis-typed word happens to be a real word• 15% Peterson (1986), 25%-40% Kukich (1992)

resolving word ambiguities

log log

n word sequence

sequence of n words

sequence of words

word boundaries

word ambiguitiesdescription

rare word

intended word sequence

Documents

resolving the ambiguities in 38 c.f.r. § 3.655: an

resolving attachment ambiguities with multiple...

globalization and civil society-continuities, ambiguities

perturbative ambiguities in compactified spacetime and

microsoft word - a modest proposal for resolving the...

icelandic archaeology and the ambiguities of colonialism

ambiguities in calculating cancer patient survival

ambiguities in the romanization of yiddish.pdf

probabilistic asymptotic decider for topological ambiguity...

ambiguities doubts and ambiguities in the transmission of...

ambiguities, subtleties, contradictions, ironies, and...

part of speech tagging importance resolving ambiguities by...

why are carrier phase ambiguities integer?

corrensite mineralogical ambiguities and geologic...

reducing ambiguities in requirements speciﬁcations...

developing a unified superset in quantifying ambiguities

resolving ambiguities in insurance policy...

resolving leadership conceptual ambiguities: a ......key...

resolving topic-focus ambiguities in natural...

artiﬁcial, natural, historical: ambiguities of synthetic...