search and decoding in speech recognition

164
Search and Decoding in Speech Recognition N-Grams

Upload: azia

Post on 15-Jan-2016

40 views

Category:

Documents


2 download

DESCRIPTION

Search and Decoding in Speech Recognition. N-Grams. N-Grams. Problem of word prediction. Example: “I’d like to make a collect …” Very likely words: “call”, “international call”, or “phone call”, and NOT “the”. - PowerPoint PPT Presentation

TRANSCRIPT

  • Search and Decoding in Speech RecognitionN-Grams

    Veton Kpuska

  • *Veton Kpuska*N-GramsProblem of word prediction.Example: Id like to make a collect Very likely words: call,international call, orphone call, and NOTthe.The idea of word prediction is formalized with probabilistic models called N-grams.N-grams predict the next word from previous N-1 words.Statistical models of word sequences are also called language models or LMs.Computing probability of the next word will turn out to be closely related to computing the probability of a sequence of words.Example: all of a sudden I notice three guys standing on the sidewalk , vs. on guys all I of notice sidewalk three a sudden standing the

    Veton Kpuska

  • *Veton Kpuska*N-gramsEstimators like N-grams that assign a conditional probability to possible next words can be used to assign a joint probability to an entire sentence.N-gram models are on of the most important tools in speech and language processing.

    N-grams are essential in any tasks in which the words must be identified from ambiguous and noisy inputs.

    Speech Recognition the input speech sounds are very confusable and many words sound extremely similar.

    Veton Kpuska

  • *Veton Kpuska*N-gramHandwriting Recognition probabilities of word sequences help in recognition.Woody Allen in his movie Take the Money and Run, tries to rob a bank with a sloppily written hold-up note that the teller incorrectly reads as I have a gub.Any speech and language processing system could avoid making this mistake by using the knowledge that the sequence I have a gun is far more probable than the non-word I have a gub or even I have a gull.

    Veton Kpuska

  • *Veton Kpuska*N-gramStatistical Machine Translation Example of translation of a Chinese source sentence:

    from a set of potential rough English translations:he briefed to reporters on the chief contents of the statementhe briefed reporters on the chief contents of the statementhe briefed to reporters on the main contents of the statementhe briefed reporters on the main contents of the statement

    Veton Kpuska

  • *Veton Kpuska*N-gramAn N-gram grammar might tell us that briefed reporters is more likely than briefed to reporters, and main contents is more likely than chief contents.Spelling Corrections need to find correct spelling errors like the following that accidentally result in real English words:They are leaving in about fifteen minuets to go to her house.The design an construction of the system will take more than a year.Problem real words thus dictionary search will not help.Note: in about fifteen minuets is a much less probable sequence than in about fifteen minutesSpell-checker can use a probability estimator both to detect these errors and to suggest higher-probability corrections.

    Veton Kpuska

  • *Veton Kpuska*N-gramAugmentative Communication helping people who are unable to sue speech or sign language to communicate (Steven Hawking). Using simple body movements to select words from a menu that are spoken by the system.Word prediction can be used to suggest likely words for the menu.Other areas:Part of-speech taggingNatural Language Generation,Word Similarity,Authorship identificationSentiment ExtractionPredictive Text Input (Cell phones).

    Veton Kpuska

  • *Veton Kpuska*Corpora & Counting WordsProbabilities are based on counting things.Must decide what to count.Counting things in natural language is based on a corpus (plural corpora) an on-line collection of text or speech.Popular corpora Brown and Switchboard.

    Brown corpus is a 1 million word collection of samples from 500 written texts from different genres (newspaper, novels, non-fiction, academic, etc.) assembled at Brown university 1963-1964.Example sentence from Brown corpus:He stepped out into the hall, was delighted to encounter a water brother.13 words if dont count punctuation-marks as words 15 if we count punctuation.Treatment of , and . depends on the task.Punctuation marks are critical for identifying boundaries (, . ;) of things and for identifying some aspects of meaning (? ! )For some tasks (part-of-speech tagging or parsing or sometimes speech synthesis) punctuation are treated as being separate words.

    Veton Kpuska

  • *Veton Kpuska*Corpora & Counting WordsSwitchboard Corpus collection of 2430 telephone conversations averaging 6 minutes each total of 240 hours of speech with about 3 million words.This kind of corpora do not have punctuation. Complications with defining words.Example: I do uh main- mainly business data processing.Two kinds of disfluencies.Broken-off word main- is called a fragment.Words like uh um are called fillers or filled pauses.Counting disfluencies as words depends on the application:Automatic Dictation System based on Automatic Speech Recognition will remove disfluencies.Speaker Identification application can use disfluencies to identify a person.Parsing and word prediction can use disfluencies Stolcke and Shriberg (1996) found that treating uh as a word improves next-word prediciton (?) and thus most speech recognition systems treat uh and um as words.

    Veton Kpuska

  • *Veton Kpuska*N-gramAre capitalized tokens like They and un-capitalized tokens like they the same word?In speech recognition they are treated the same.In part-of-speech-tagging capitalization is retained as a separate features.In this chapter models are not case sensitive.Inflected forms cats versus cat. These two words have the same lemma cat but are different wordforms. Lemma is a set of lexical forms having the sameStemMajor part-of-speech, andWord-sense.Wordform is the full inflected or derived form of the word.

    Veton Kpuska

  • *Veton Kpuska*N-gramsIn this chapter N-grams are based on wordforms.

    N-gram models and counting words in general requires that we do the kind of tokenization or text normalization that was introduced in previous chapter:Separating out punctuationDealing with abbreviations (m.p.h)Normalizing spelling, etc.

    Veton Kpuska

  • *Veton Kpuska*N-gramHow many words are there in English?Must first distinguish types the number of the distinct words in a corpus or vocabulary size V, fromtokens the total number N of running words.Example:They picnicked by the pool, then lay back on the grass and looked at the stars.16 Tokens14 Types

    Veton Kpuska

  • *Veton Kpuska*N-gramThe Switchboard corpus has ~20,000 wordform types~3 million worform tokensShakspeares complete works have29,066 wordform types884,647 wordform tokens Brown corpus has:61,805 wordform types37,851 lemma types1 million wordform tokensA very large corpus (Brown 1992a) found that it included293,181 different wordform types583 million wordform tokensHeritage third edition dictionary lists 200,00 boldface forms.It seems that the larger corpora the more word types are found:It is suggested that vocabulary size (the number of types) grows at least the square root of the number of tokens

    Veton Kpuska

  • Brief Introduction to ProbabilityDiscrete Probability

    Veton Kpuska

  • *Veton Kpuska*Discrete Probability Distributions Definition:Set called the sample space which contains the set of all possible outcomes:

    For each element x of the set S; x S , a probability value is assigned as a function of x; P(x) with the following properties:P(x) [0,1], x S,

    Veton Kpuska

  • *Veton Kpuska*Discrete Probability DistributionsEvent is defined as any subset E of the sample space S.The probability of the event E is defined as:

    Probability of the entire space S is 1 as indicated by 2 in the previous slide.Probability of the empty or null event is 0.

    The function P(x) mapping a point in in the sample space to the probability value is called a probability mass function (pmf).

    Veton Kpuska

  • *Veton Kpuska*Properties of Probability FunctionIf A and B are mutually exclusive events in S, then:P(AB) = P(A)+P(B)Mutual exclusive events are those that AB=In general for n mutually exclusive events:ABVenn Diagram

    Veton Kpuska

  • *Veton Kpuska*Elementary Theorems of ProbabilityIf A is any event in S, then P(A) = 1-P(A) where A is set of all events not in A.Proof:P(AA) = P(A)+P(A), considering thatP(AA) = P(S)= 1P(A)+P(A) = 1

    Veton Kpuska

  • *Veton Kpuska*Elementary Theorems of ProbabilityIf A and B are any events in S, thenP(AB) = P(A)+P(B)- P(AB),Proof: P(AB) = P(AB)+P(AB)+P(AB)=P(AB) = [P(AB)+P(AB)] + [P(AB)+P(AB) ] - P(AB) P(AB) = P(A)+P(B)- P(AB)ABVenn DiagramABABABSAB

    Veton Kpuska

  • *Veton Kpuska*Conditional ProbabilityIf A and B are any events in S, and P(B)0, the conditional probability of A relative to B is given by:

    If A and B are any events in S, then

    Veton Kpuska

  • *Veton Kpuska*Independent EventsIf A and B are independent events then :

    Veton Kpuska

  • *Veton Kpuska*Bayes RuleIf B1, B2, B3,, Bn are mutually exclusive events of which one must occur, that is: , then

    Veton Kpuska

  • End of Brief Introduction to ProbabilityEnd

    Veton Kpuska

  • Simple N-grams

    Veton Kpuska

  • *Veton Kpuska*Simple (Unsmoothed) N-GramsOur goal is to compute the probability of a word w given some history h: P(w|h).Example:h its water is so transparent that w theP(the | its water is so transparent that)How can we compute this probability?One way is to estimate it from relative frequency counts.From a very large corpus count number of times we see its water is so transparent that and count the number of times this is followed by the- Out of the times we saw the history h, how many times was it followed by the word w:

    Veton Kpuska

  • *Veton Kpuska*Estimating ProbabilitiesEstimating probabilities form counts works fine in many cases, it turns out that even the www is not big enough to give us good estimates in most cases.Language is creative: new sentences are created all the time.It is not possible to count entire sentences.

    Veton Kpuska

  • *Veton Kpuska*Estimating ProbabilitiesJoint Probabilities probability of an entire sequence of words like its water is so transparent:Out of all possible sequences of 5 words how many of them are its water is so transparentMust count of all occurrences of its water is so transparent and divide by the sum of counts of all possible 5 word sequences.

    It seems a lot of work for a simple computation of estimates.

    Veton Kpuska

  • *Veton Kpuska*Estimating ProbabilitiesMust figure out cleverer ways of estimating the probability ofA word w given some history h, orAn entire word sequence W.Introduction of formal notations:Random variable XiProbability Xi taking on the value the P(Xi =the) = P(the)Sequence of N words:Joint probability of each word in a sequence having a particular value:

    Veton Kpuska

  • *Veton Kpuska*Chain RuleChain rule of Probability:

    Applying the chain rule to words we get:

    Veton Kpuska

  • *Veton Kpuska*Chain Rule The chain rule provides the link between computing the joint probability of a sequence and computing the conditional probability of a word given previous words.Equation presented in previous slide provides the way of computing joint probability estimate of an entire sequence based on multiplication of a number of conditional probabilities.However, we still do not know any way of computing the exact probability of a word given a long sequence of preceding words:

    Veton Kpuska

  • *Veton Kpuska*N-gramsApproximation:Idea of N-gram model is to approximate the history by just the last few words instead of computing the probability of a word given its entire history.Bigram:The bigram model approximates the probability of a word given all the previous words by the conditional probability of the preceding word . Example: Instead of computing the probability:

    It is approximated with the probability:

    Veton Kpuska

  • *Veton Kpuska* Bi-gramThe following approximation is used when the bi-gram probability is applied:

    The assumption that the conditional probability of the a word depends only on the previous word is called a Markov assumption. Markov models are the class of probabilistic models that assume that wee can predict the probability of some future unit without looking too far into the past.

    Veton Kpuska

  • *Veton Kpuska*Bi-gram GeneralizationTri-gram: looks two words into the pastN-gram: looks N-1 words into the past.General equation for N-gram approximation to the conditional probability of the next word in a sequence is:

    The simplest and most intuitive way to estimate probabilities is the method called Maximum Likelihood Estimation or MLE for short.

    Veton Kpuska

  • *Veton Kpuska*Maximum Likelihood EstimationMLE estimate for the parameters of an N-gram model is done by taking counts from a corpus, and normalizing them so they lie between 0 and 1.

    Bi-gram: computing a particular bi-gram probability of a word y given a previous word x, the count C(xy) is computed and normalized by the sum of all bi-grams that share the same first word x.

    Veton Kpuska

  • *Veton Kpuska*Maximum Likelihood EstimateThe previous equation can be further simplified by noting:

    Veton Kpuska

  • *Veton Kpuska*ExampleMini-corpus containing three sentences marked with begging sentence marked and ending sentence marker : I am Sam Sam I am I do not like green eggs and ham (From Dr. Seuss series: Green Eggs and Ham book)Some of the bi-gram calculations from this corpus:P(I|) = 2/3 = 0.66P(Sam|) = 1/3=0.33P(am|I) = 2/3 = 0.66 P(Sam|am) = =0.5P(|Sam) = 1/3 = 0.33 P(|am) = 1/3=0.33P(do|I) = 1/3 = 0.33

    Veton Kpuska

  • *Veton Kpuska*N-gram Parameter EstimationIn general case, MLE is calculated for N-gram model using the following:

    This equation estimates the N-gram probability by dividing the observed frequency of a particular sequence by the observed frequency of a prefix. This ration is called relative frequency.Relative frequencies is one way how to estimate probabilities in Maximum Likelihood Estimation.Conventional MLE is not always the best way to compute probability estimates (bias toward a training corpus e.g., Brown).MLE can be modified to address better those considerations.

    Veton Kpuska

  • *Veton Kpuska*Example 2Data used from Berkeley Restaurant Project Corpus consisting of 9332 sentences (available from the WWW):can you tell me about any good cantonese restaurants close bymid priced thai food is what Im looking fortell me about chez panissecan you give me a listing of the kinds of food that are availableiam looking for a good place to eat breakfastwhen is caffe venezia open during the day

    Veton Kpuska

  • *Veton Kpuska*Bigram counts for eight of the words (out of V=1446) in Berkeley Restaurant Project corpus of 9332 sentences

    iwanttoeatchinesefoodlunchspendi5827090002want2060816651to204686206211eat0020162420chineze100008210food1501501400lunch20000100spend10100000

    Veton Kpuska

  • *Veton Kpuska*Bigram Probabilities After NormalizationSome other useful probabilities:P(i|)=0.25P(english|want)=0.0011P(food|english)=0.5P(|food)=0.68

    Clearly now we can compute probability of sentence like:I want English food, orI want Chineze foodby multiplying appropriate bigram probabilities together as follows:Unigram Counts

    iwanttoeatchinesefoodlunchspend25339272147461581093341278

    Veton Kpuska

  • *Veton Kpuska*Bigram probabilities for eight words (out of V=1446) in Berkeley Restaurant Project corpus of 9332 sentences

    iwanttoeatchinesefoodlunchspendi0.0020.3300.00360000.00079want0.002200.660.00110.00650.00650.00540.0011to0.0008300.00170.280.0008300.00250.087eat000.002700.0210.00270.0560chineze0.006300000.520.00630food0.01400.01400.000920.003700lunch0.005900000.002900spend0.003600.003600000

    Veton Kpuska

  • *Veton Kpuska*Bigram ProbabilityP( i want english food )= P(i|)P(want|i) P(english|want) P(food|english) P(|food)= 0.25 x 0.33 x 0.0011 x 0.5 x 0.68 = 0.000031Exercise: Computer the probability of I want chinese food.

    Some of the bigram probabilities encode some facts that we think of as strictly syntactic in nature:What comes after eat is usually a noun oran adjective, orWhat comes after to is usuallya verb

    Veton Kpuska

  • *Veton Kpuska*Trigram ModelingAlthough we will generally show bigram models in this chapter for pedagogical purposes, note that when there is sufficient training data we are more likely to use trigram models, which condition on the previous two words rather than the previous word. To compute trigram probabilities at the very beginning of sentence, we can use two pseudo-words for the first trigram (i.e., P(I|).

    Veton Kpuska

  • *Veton Kpuska*Training and Test SetsN-gram models are obtained from a corpus that is trained on. Those models are used on some new data in some task (e.g. speech recognition).New data or task will not be exactly the same as data that was used for training.Formally:Data that is used to build the N-gram (or any model) are called Training Set or Training CorpusData that are used to test the models comprise Test Set or Test Corpus.

    Veton Kpuska

  • *Veton Kpuska*Model EvaluationTraining-and-testing paradigm can also be used to evaluate different N-gram architectures:Comparing N-grams of different order N, orUsing the different smoothing algorithms (to be introduced later)Train various models using training corpusEvaluate each model on the test corpus.

    How do we measure the performance of each model on the test corpus?Perplexity (introduced latter in the chapter) computing probability of each sentence in the test set: the model that assigns a higher probability to the test set (hence more accurately predicts the test set) is assumed to be a better model.Because evaluation metric is based on test set probability, its important not to let the test sentences into the training set. Avoiding training on the test set data.

    Veton Kpuska

  • *Veton Kpuska*Other Divisions of DataExtra source of data to augment the training set is needed. This data is called a held-out set.N-gram model is based on only training set.Held-out set is used to set additional (other) parameters of our model.Used to set interpolation parameters of N-gram modelMultiple test sets:Test set that is used often in measuring performance of the model typically is called development (test) set.Due to its high usage the models may be tuned to it. Thus a new completely unseen (or seldom used data set) should be used for final evaluation. This set is called evaluation (test) set.

    Veton Kpuska

  • *Veton Kpuska*Picking Train, Development Test and Evaluation Test DataFor training we need as much data as possible.However, for Testing we need sufficient data in order for the resulting measurements to be statistically significant.In practice often the data is divided into 80% training 10% development and 10% evaluation.

    Veton Kpuska

  • *Veton Kpuska*N-gram Sensitivity to the Training Corpus.N-gram modeling, like many statistical models, is very dependent on the training corpus.Often the model encodes very specific facts about a given training corpus.N-grams do a better and better job of modeling the training corpus as we increase the value of N.This is another aspect of model being tuned to specifically to training data at the expense of generality.

    Veton Kpuska

  • *Veton Kpuska*Visualization of N-gram ModelingShannon (1951) AND Miller & Selfridge (1950).The simplest way to visualize how this works is for the unigram case:All words of English language covering the probability space between 0 and 1 each word thus covering an interval of size equal to its (relative) frequency.Let us choose a random number between 0 and 1, and print out the word whose interval includes the real value we have chosen.We continue choosing random numbers and generating words until we randomly generate the sentence-final token .The same technique can be used to generate bigrams by first generating a random bigram that starts with (according to its bigram probability)Then choosing a random bigram to follow it (again according to its conditional probability), and so on.

    Veton Kpuska

  • *Veton Kpuska*Visualization of N-gram Modeling: UnigramTo provide an intuition of the increasing power of higher order N-grams, the example below is depicted that shows random sentences generated from unigram, bigram, trigram, and quadrigram models trained on Shakspeares work.

    To him swallowed confess hear both. Which. Of save on trail for are ay device an rote life haveEvery enter noe severally so, letHill he late speaks; or! A more to leg less first you enterAre where exeunt and sighs have rise excellency took of. Sleep knave we. Near; vile like

    Veton Kpuska

  • *Veton Kpuska*Visualization of N-gram Modeling: BigramWhat means, sir. I confess she? then all sorts, he is trim, captain.Why dost stand forth thy canopy, forsooth; he is this palpable hit the King Henry. Live king. Follow.What we, hath got so she that I rest and sent to scold and nature bankrupt, nor the first gentleman?Enter Menenius, if it so many good direction foundst thou art a strong upon command of fear not a liberal largess given away, Falstaff! Exeunt

    Veton Kpuska

  • *Veton Kpuska*Visualization of N-gram Modeling: TrigramSweet prince, Falstaff shall die. Harry of Monmouths grave.This shall forbid it should be branded, if renown made it empty.Indeed the duke; and had a very good friend.Fly, and will rid me these news of price. Therefore the sadness of parting, as they say, tis done.

    Veton Kpuska

  • *Veton Kpuska*Visualization of N-gram Modeling: QuadrigramKing Henry. What! I will go seek the traitor Gloucester. Exeunt some of the watch. A great banquet servd in;Will you not tell me who I am?It cannot be but so.Indeed the short and the long. Marry, tis a noble Lepidus.

    Veton Kpuska

  • *Veton Kpuska*Size of N in N-gram ModelsThe longer the context on which we train the model, the more coherent the sentences.In the unigram sentences, there is no coherent relation between words, nor sentence-final punctuation. The bigram sentences have some very local word-to-word coherence (especially if we consider that punctuation counts as a word). The trigram and quadrigram sentences are beginning to look a lot like Shakespeare. Indeed a careful investigation of the quadrigram sentences shows that they look a little too much like Shakespeare. The words It cannot be but so are directly from King John.

    Veton Kpuska

  • *Veton Kpuska*Specificity vs GeneralityThe variability of words phrases in Shakespeare is not very large in the context of training corpora used for language modeling:N = 884,647 & V = 29,066N-gram probability matrices are very sparse:V2 = 844,000,000 possible bigrams alone V4 = 7x1017 number of possible quadrigrams.Once the generator has chosen the first quadrigram, there are only five possible continuations (that, I , he, thou, and so). In fact for many quadrigrams there is only one continuation.

    Veton Kpuska

  • *Veton Kpuska*Dependence of Grammar to its Training Set.Example of Wall Street Journal (WSJ) Corpus based on the newspaper.Shakespeare work and WSJ are both in English, so one might expect some overlap between our N-grams for the two genres.In order to check whether this is true the next slides provide sentences generated by unigram, bigram and trigram grammars trained on 40 million words from WSJ.

    Veton Kpuska

  • *Veton Kpuska*WSJ Example

    UnigramMonths the my and issue of year foreign new exchanges september were recession exchange new endorsed a acquire to six executivesBigramLast December through the way to preserve the Hudson corporation N. B. E. C. Taylor would seem to complete the major central planners one point five percent of U.S. E. has already old M. X. corporation of living on information such as more frequently fishing to keep herTrigramThey also point to ninety nine point six billion dollars from two hundred four oh six three percent of the rates of interest stores as Mexico and Brazil on market conditionsSentences randomly generated from three orders of N-gram computed from 40 million words of the Wall Street Journal. All characters were mapped to lower-case and punctuation marks were treated as words. Output is hand-corrected for capitalization to improve readability.

    Veton Kpuska

  • *Veton Kpuska*Comparison of Shakespeare and WSJ ExamplesWhile superficially they both seem to model English-like sentences there is obviously no overlap whatsoever in possible sentences, and little if any overlap even in small phrases. This stark difference tells us that statistical models are likely to be pretty useless as predictors if the training sets and the test sets are as different as Shakespeare and WSJ.

    How should we deal with this problem when we build N-gram models?

    Veton Kpuska

  • *Veton Kpuska*Comparison of Shakespeare and WSJ ExamplesIn general we need to be sure to use a training corpus that looks like our test corpus. We especially wouldnt choose training and tests from different genres of text like newspaper text, early English fiction, telephone conversations, and web pages. Sometimes finding appropriate training text for a specific new task can be difficult; to build N-grams for text prediction in SMS (Short Message Service), we need a training corpus of SMS data.To build N-grams on business meetings, we would need to have corpora of transcribed business meetings.For general research where we know we want written English but dont have a domain in mind, we can use a balanced training corpus that includes cross-sections from different genres, such as the 1-million word Brown corpus of English (Francis and Kucera, 1982) or the 100-million word British National Corpus (Leech et al., 1994).Recent research has also studied ways to dynamically adapt language models to different genres;

    Veton Kpuska

  • *Veton Kpuska*Unknown Words: Open vs. Closed Vocabulary TasksSometimes we have a language task in which we know all the words that can occur, and hence we know the vocabulary size V in advance. The closed vocabulary assumption is the assumption that we have such a lexicon, and that the test set can only contain words from this lexicon. The closed vocabulary task thus assumes there are no unknown words.

    Veton Kpuska

  • *Veton Kpuska*Unknown Words: Open vs. Closed Vocabulary TasksAs we suggested earlier, the number of unseen words grows constantly, so we cant possibly know in advance exactly how many there are, and wed like our model to do something reasonable with them. We call these OOV unseen events unknown words, or out of vocabulary (OOV) words. The percentage of OOV words that appear in the test set is called the OOV rate.An open vocabulary system is one where we model these potential unknown words in the test set by adding a pseudo-word called .

    Veton Kpuska

  • *Veton Kpuska*Training Probabilities of Unknown ModelWe can train the probabilities of the unknown word model as follows:Choose a vocabulary (word list) which is fixed in advance.Convert in the training set any word that is not in this set (any OOV word) to the unknown word token in a text normalization step.Estimate the probabilities for from its counts just like any other regular word in the trainings set.

    Veton Kpuska

  • Evaluating N-GramsPerplexity

    Veton Kpuska

  • *Veton Kpuska*PerplexityThe correct way to evaluate the performance of a language model is to embed it in an application and measure the total performance of the application. Such end-to end evaluation, also called in vivo evaluation, is the only way to know if a particular improvement in a component is really going to help the task at hand. Thus for speech recognition, we can compare the performance of two language models by running the speech recognizer twice, once with each language model, and seeing which gives the more accurate transcription.

    Veton Kpuska

  • *Veton Kpuska*PerplexityEnd-to-end evaluation is often very expensive; evaluating a large speech recognition test set, for example, takes hours or even days. Thus we would like a metric that can be used to quickly evaluate potential improvements in a language model. Perplexity is the most common evaluation metric for N-gram language models. While an improvement in perplexity does not guarantee an improvement in speech recognition performance (or any other end-to-end metric), it often correlates with such improvements. Thus it is commonly used as a quick check on an algorithm; an improvement in perplexity can then be confirmed by an end-to-end evaluation.

    Veton Kpuska

  • *Veton Kpuska*PerplexityGiven two probabilistic models, the better model is the one that has a tighter fit to the test data, or predicts the details of the test data better. We can measure better prediction by looking at the probability the model assigns to the test data; the better model will assign a higher probability to the test data.

    Veton Kpuska

  • *Veton Kpuska*Definition of PerplexityThe perplexity (sometimes called PP for short) of a language model on a test set is a function of the probability that the language model assigns to that test set. For a test set W = w1w2 . . .wN, the perplexity is the probability of the test set, normalized by the number of words:

    Veton Kpuska

  • *Veton Kpuska*Definition of PerplexityWe can use the chain rule to expand the probability of W:

    For bigram language model the perplexity of W is computed as:

    Veton Kpuska

  • *Veton Kpuska*Interpretation of PerplexityMinimizing perplexity is equivalent to maximizing the test set probability according to the language model.What we generally use for word sequence in the general Equation presented in previous slide is the entire sequence of words in some test set. Since this sequence will cross many sentence boundaries, we need to include the begin-and end-sentence markers and in the probability computation. also need to include the end-of-sentence marker (but not the beginning-of-sentence marker ) in the total count of word tokens N.

    Veton Kpuska

  • *Veton Kpuska*Interpretation of PerplexityPerplexity can also be interpreted as the weighted average branching factor of a language. The branching factor of a language is the number of possible next words that can follow any word.

    Consider the task of recognizing the digits in English (zero, one, two,..., nine), given that each of the 10 digits occur with equal probability P = 1/10 . The perplexity of this language is in fact 10. To see that, imagine a string of digits of length N. By Equation presented in previous slide, the perplexity will be:

    Veton Kpuska

  • *Veton Kpuska*Interpretation of PerplexityExercise: Suppose that the number zero is really frequent and occurs 10 times more often than other numbers.Show that the perplexity to be lower, as expected since most of the time the next number will be zero.Branching factor however, is still the same for digit recognition task (e.g. 10).

    Veton Kpuska

  • *Veton Kpuska*Interpretation of PerplexityPerplexity is also related to the information theoretic notion of entropy as it will be shown latter in this chapter.

    Veton Kpuska

  • *Veton Kpuska*Example of Perplexity UsePerplexity is used in following example to compare three N-gram models.Unigram, Bigram, and Trigram grammars are trained on 38 million words (including start-of-sentence tokens) using WSJ corpora with 19,979 word vocabulary.Perplexity is computed on a test set of 1.5 million words via equation presented in the slide: Definition of Perplexity and the results are summarized in the Table below:

    N-gram OrderUnigramBigramTrigramPerplexity962170109

    Veton Kpuska

  • *Veton Kpuska*Example of Perplexity UseAs we see in previous slide, the more information the N-gram gives us about the word sequence, the lower the perplexity:the perplexity is related inversely to the likelihood of the test sequence according to the model.

    Note that in computing perplexities the N-gram model P must be constructed without any knowledge of the test set t. Any kind of knowledge of the test set can cause the perplexity to be artificially low.

    For example, we defined above the closed vocabulary task, in which the vocabulary for the test set is specified in advance. This can greatly reduce the perplexity. As long as this knowledge is provided equally to each of the models we are comparing, the closed vocabulary perplexity can still be useful for comparing models, but care must be taken in interpreting the results. In general, the perplexity of two language models is only comparable if they use the same vocabulary.

    Veton Kpuska

  • Smoothing

    Veton Kpuska

  • *Veton Kpuska*SmoothingThere is a major problem with the maximum likelihood estimation process we have seen for training the parameters of an N-gram model. This is the problem of sparse data caused by the fact that our maximum likelihood estimate was based on a particular set of training data. For any N-gram that occurred a sufficient number of times, we might have a good estimate of its probability. But because any corpus is limited, some perfectly acceptable English word sequences are bound to be missing from it. This missing data means that the N-gram matrix for any given training corpus is bound to have a very large number of cases of putative zero probability N-grams that should really have some non-zero probability. Furthermore, the MLE method also produces poor estimates when the counts are non-zero but still small.

    Veton Kpuska

  • *Veton Kpuska*SmoothingWe need a method which can help get better estimates for these zero or low frequency counts. Zero counts turn out to cause another huge problem. The perplexity metric defined above requires that we compute the probability of each test sentence.But if a test sentence has an N-gram that never appeared in the training set, the Maximum Likelihood estimate of the probability for this N-gram, and hence for the whole test sentence, will be zero!

    This means that in order to evaluate our language models, we need to modify the MLE method to assign some non-zero probability to any N-gram, even one that was never observed in training.

    Veton Kpuska

  • *Veton Kpuska*SmoothingThe term smoothing is used for such modifications that address the poor estimates due to variability in small data sets. The name comes from the fact that (looking ahead a bit) we will be shaving a little bit of probability mass from the higher counts, and piling it instead on the zero counts, making the distribution a little less discontinuous.In the next few sections some smoothing algorithms are introduced.The original Berkeley Restaurant example introduced previously will be used to show how smoothing algorithms modify the bigram probabilities.

    Veton Kpuska

  • *Veton Kpuska*Laplace SmoothingOne simple way to do smoothing is to take our matrix of bigram counts, before we normalize them into probabilities, and add one to all the counts. This algorithm is called Laplace smoothing, or Laplaces Law. Laplace smoothing does not perform well enough to be used in modern N-gram models, but we begin with it because it introduces many of the concepts that we will see in other smoothing algorithms, and also gives us a useful baseline.

    Veton Kpuska

  • *Veton Kpuska*Laplace Smoothing to Unigram ProbabilitiesRecall that the unsmoothed maximum likelihood estimate of the unigram probability of the word wi is its count ci normalized by the total number of word tokens N:

    Laplace smoothing adds one to each count. Considering that there are V words in the vocabulary, and each one got increased, we also need to adjust the denominator to take into account the extra V observations in order to have legitimate probabilities.

    Veton Kpuska

  • *Veton Kpuska*Laplace SmoothingIt is convenient to describe a smoothing algorithm as a corrective constant that affects the numerator by defining an adjusted count c* as follows:

    Veton Kpuska

  • *Veton Kpuska*DiscountingA related way to view smoothing is as discounting (lowering) some non-zero counts in order to get the correct probability mass that will be assigned to the zero counts.Thus instead of referring to the discounted counts c, we might describe a smoothing algorithm in terms of a relative discount dc, the ratio of the discounted counts to the original counts:

    Veton Kpuska

  • *Veton Kpuska*Berkeley Restaurant Project Smoothed Bigram Counts (V=1446)

    iwanttoeatchinesefoodlunchspendi68281101113want3160927762to315687317212eat1131173431chineze211118321food1611612511lunch31111211spend21211111

    Veton Kpuska

  • *Veton Kpuska*Smoothed Bigram ProbabilitiesRecall normal bigram probabilites are computed by normalizing each raw of counts by the unigram count:

    For add-one smoothed bigram counts we need to augment the unigram count by the number of total types in the vocabulary V:

    The result is the smoothed bigram probabilities presented in the table in the next slide.

    Veton Kpuska

  • *Veton Kpuska*Bigram Smoothed Probabilities for eight words (out of V=1446) in Berkeley Restaurant Project corpus of 9332 sentences

    iwanttoeatchinesefoodlunchspendi0.00150.210.000250.00250.000250.000250.000250.00075want0.00130.000420.260.000840.00290.00290.00250.0084to0.000780.000260.00130.180.000780.000260.00180.055eat0.000460.000460.00140.000460.00780.00140.020.00046

    chinese0.00120.000620.000620.000620.000620.0520.00120.00062food0.00630.000390.00630.000390.000790.0020.000390.00039lunch0.00170.000560.000560.000560.000560.00110.000560.00056spend0.00120.000580.00120.000580.000580.000580.000580.00058

    Veton Kpuska

  • *Veton Kpuska*Adjusted Counts TableIt is often convenient to reconstruct the count matrix so we can see how much a smoothing algorithm has changed the original counts. These adjusted counts can be computed by Equation presented below and the table in the next slide shows the reconstructed counts.

    Veton Kpuska

  • *Veton Kpuska*Adjusted Counts Table

    iwanttoeatchinesefoodlunchspendi3.85270.646.40.640.640.641.9want1.20.392380.782.72.72.30.78to1.90.633.14301.90.634.4133eat0.340.3410.345.81150.34chineze0.20.0980.0980.0980.0988.20.20.098food6.90.436.90.430.862.20.430.43lunch0.570.190.190.190.190.380.190.19spend0.320.160.320.160.160.160.160.16

    Veton Kpuska

  • *Veton Kpuska*ObservationNote that add-one smoothing has made a very big change to the counts. C(want to) changed from 608 to 238! We can see this in probability space as well: P(to|want) decreases from .66 in the unsmoothed case to .26 in the smoothed case. Looking at the discount d (the ratio between new and old counts) shows us how strikingly the counts for each prefix-word have been reduced; the discount for the bigram want to is .39, while the discount for Chinese food is .10, a factor of 10!

    Veton Kpuska

  • *Veton Kpuska*Problems with Add-One (Laplace) SmoothingThe sharp change in counts and probabilities occurs because too much probability mass is moved to all the zeros. We could move a bit less mass by adding a fractional count rather than 1 (add-d smoothing; (Lidstone, 1920; Jeffreys, 1948)), but this method requires a method for choosing d dynamically, results in an inappropriate discount for many counts, and turns out to give counts with poor variances. For these and other reasons (Gale and Church, 1994), well need use better smoothing methods for N-grams like the ones we will present in the next section.

    Veton Kpuska

  • *Veton Kpuska*Good-Turing DiscountingA number of much better algorithms have been developed that are only slightly more complex than add-one smoothing: Good-TuringThe idea behind a number of those algorithms is to use the count of things youve seen once to help estimate the count of things you have never seen. Good described the algorithm in 1953 in which he credits Turing for the original idea.Basic idea in this algorithm is to re-estimate the amount of probability mass t assign to N-grams with zero counts by looking at the number of N-grams that occurred only one time. A word or N-gram that occurs once is called a singleton. Good-Turing algorithm uses the frequency of singletons as a re-estimate of the frequency of zero-count bigrams.

    Veton Kpuska

  • *Veton Kpuska*Good-Turing DiscountingAlgorithm Definition:Nc the number of N-grams that occur c times: frequency of frequency c.N0 the number of bigrams b with count 0.N1 the number of bigram with count 1 (singletons), etc.

    The MLE count for Nc is c. The Good-Turing estimate replaces this with a smoothed count c*, as a function of Nc+1:

    Veton Kpuska

  • *Veton Kpuska*Good-Turing DiscountingThe previous equation presented in previous slide can be used to replace the MLE counts for all the bins N1, N2, and so on. Instead of using this equation directly to re-estimate the smoothed count c* for N0, the following equation is used that defines probability of the missing mass:

    N1 is the count of items in bin 1 (that were seen once in training), and N is total number of items we have seen in training.

    Veton Kpuska

  • *Veton Kpuska*Good-Turing DiscountingExample: A lake with 8 species of fish (bass, carp, catfish, eel, perch, salmon, trout, whitefish)When fishing we have caught 6 species with the following count:10 carp3 perch, 2 whitefish, 1 trout, 1 salmon, and 1 eel (no catfish and no bass).What is the probability that the next fish we catch will be a new species, i.e., one that had a zero frequency in our training set (catfish or bass)?

    The MLE count c of unseen species (bass or catfish) is 0. From the equation in the previous slide the probability of a new fish being one of these unseen species is 3/18, since N1 is 3 and N is 18:

    Veton Kpuska

  • *Veton Kpuska*Good-Turing Discounting: ExampleLets now estimate the probability that the next fish will be another trout? MLE count for trout is 1, so the MLE estimated probability is 1/18.However, the Good-Turing estimate must be lower, since we took 3/18 of our probability mass to use on unseen events! Must discount MLE probabilities for observed counts (perch, whitefish, trout, salmon, and eel)

    The revised count c* and Good-Turing smoothed probabilities for species with counts 0 (like bass or catfish in previous example) or counts 1 (like trout, salmon, or eel) are as follows:

    Veton Kpuska

  • *Veton Kpuska*Good-Turing Discounting: ExampleNote that the revised count c* for eel as well is discounted from c=1.0 to c*=0.67 in order to account for some probability mass for unseen species (unseen) = 3/18=0.17 for catfish and bass.Since we know that there are 2 unknown species, the probability of the next fish being specifically a catfish is (catfish) = (1/2)x(3/18) = 0.085

    Unseen (bass or catfish)troutc01MLE pc*GT

    Veton Kpuska

  • *Veton Kpuska*Bi-gram ExamplesBerkeley Restaurant Corpus of 9332 sentences.Associated Press (AP) newswire Corpus

    AP NewswireBerkeley Restaurantc(MLE)Ncc*(GT)c(MLE)Ncc*(GT)074,671,100,0000.000027002,081,4960.00255312,018,0460.446153150.5339602449,7211.26214191.3572943188,9332.2436422.3738324105,6683.2443814.081365568,3794.2253113.781350648,1905.1961964.500000

    Veton Kpuska

  • *Veton Kpuska*Advanced Issues in Good-Turing EstimationAssumptions of Good-Turing Estimation:Distribution of each bigram is binomialThe number N0 of bigrams that have not been seen is known.This number is known if size V of vocabulary is known total number of bigrams is thus V2.The raw number Nc can not be used since re-estimate c* for Nc dependents on Nc+1 and the re-estimation expression is undefined when Nc+1=0.In the example of the fish species N4 = 0, thus how can one compute N3? Simple Good-Turing Algorithm

    Veton Kpuska

  • *Veton Kpuska*Simple Good-TuringCompute Nc for all csSmooth counts to replace any zeros in the sequence.Compute adjusted counts c*.

    Smoothing Approaches:Linear regression: fitting a map from Nc to c in log space:

    In addition, the discounted c* is not used for all counts c. Large counts where c > k for some threshold k (e.g., k=5 in Katz 1987) are assumed to be reliable.

    Veton Kpuska

  • *Veton Kpuska*Simple Good-TuringCorrect equation of c* when some k is introduced is:

    With Good-Turing discounting (as well as other algorithms), it is usual to treat N-grams with low count (especially counts of 1) as if the count were 0.

    Finally Good-Turing discounting (or any other algorithm) is not used directly on N-grams; it is used in combination with the backoff and interpolation algorithms that are described next.

    Veton Kpuska

  • Interpolation

    Veton Kpuska

  • *Veton Kpuska*InterpolationDiscounting algorithms can help solve the problem of zero frequency N-grams.Additional knowledge that is not used:If trying to compute P(wn|wn-1wn-2) but we have no examples of a particular trigram wn-2wn-1wnEstimate trigram probability based on the bigram probability P(wn|wn-1).If there are no counts for computation of bigram probability P(wn|wn-1), use unigram probability P(wn).

    There are two ways to rely on this N-gram hiearchy:Backoff, andInterpolation

    Veton Kpuska

  • *Veton Kpuska*Backoff vs. InterpolationBackoff: Relies solely on the trigram counts. When there is a zero count evidence of a trigram then the backoff to lower N-gram.Interpolation:Probability estimates are always mixed from all N-gram estimators:Weighted interpolation of trigram, bigram and unigram counts.Simple Interpolation Linear Interpolation

    Veton Kpuska

  • *Veton Kpuska*Linear Interpolation

    Slightly more sophisticate version of linear interpolation with context dependent weights.

    Veton Kpuska

  • *Veton Kpuska*Computing Interpolation Weights lWeights are set from held-out corpus. Held-out corpus is additional training corpus that is NOT used to set the N-gram counts but to set other parameters like in this case. Choosing l values that maximize the estimated interpolated probability for example with EM algorithm (iterative algorithm discussed in latter chapters).

    Veton Kpuska

  • *Veton Kpuska*BackoffInterpolation is simple to understand and implementThere are better algorithms like backoff N-gram modeling.Uses Good-Turing discounting based on Katz and also known as Katz backoff.

    Equation above describes a recursive procedure. Computation of P*, the normalizing factor a, and other details are discussed in next section.

    Veton Kpuska

  • *Veton Kpuska*Trigram Discounting with InterpolationThe wi , wi-1, wi-2 for clarity are referred as a sequence x, y, z.Katz method incorporates discounting as integral part of the algorithm.

    Veton Kpuska

  • *Veton Kpuska*Katz BackoffGood-Turing method assigned probability of unseen events based on the assumption that they are all equally probable. Katz backoff gives us a better way to distribute the probability mass among unseen trigram events, by relying on information from unigrams and bigrams. We use discounting to tell us how much total probability mass to set aside for all the events we havent seen, and backoff to tell us how to distribute this probability.Discount probability P*(.) is needed rather than MLE P(.) in order to account for the missing probability mass.a weights are necessary to ensure that when backoff occurs the resulting probabilities are true probabilities that sum to 1.

    Veton Kpuska

  • *Veton Kpuska*Discounted Probability ComputationP* is defined as discounted (c*) estimate of the conditional probability of an N-gram.

    Because on average the discounted c8 will be less than c, this probability P* will be slightly less than the MLE estimate:

    Veton Kpuska

  • *Veton Kpuska*Discounted Probability ComputationThe previous slide computation will leave some probability mass for the lower order N-grams, which is then distributed by the a weights (descirbed in next section).The table in the next slide shows the result of Katz backoff bigram probabilities for previous 8 sample words computed from BeRP corpus using the SRILM toolkit.

    Veton Kpuska

  • *Veton Kpuska*Smoothed Bigram Probabilities computed with SRILM toolkit.

    iwanttoeatchinesefoodlunchspendi0.00140.3260.002480.003550.0002050.00170.000730.000489want0.001340.001520.6560.0004830.004550.004550.000730.000483to0.0005120.001520.001650.2840.0005120.00170.001750.00873eat0.001010.001520.001660.001890.02140.001660.05630.000585chineze0.02830.001520.002480.001890.0002050.5190.002830.000585food0.01370.001520.01370.001890.0004090.003660.000730.000585lunch0.003630.001520.002480.001890.0002050.001310.000730.000585spend0.001610.001520.001610.001890.0002050.00170.000730.000585

    Veton Kpuska

  • *Veton Kpuska*Advanced Details of Computing Katz backoff a and P*Remaining details of computation of a and P* are presented in this section.b total amount of left-over probability mass function of the (N-1)-gram context.For a given (N-1) gram context, the total left-over probability mass can be computed by subtracting from 1 the total discounted probability mass for all N-grams starting with that context.

    This gives us the total probability mass that we are ready to distribute to all (N-1)-gram (e.g., bigrams if our original model was trigram)

    Veton Kpuska

  • *Veton Kpuska*Advanced Details of Computing Katz backoff a and P* (cont.)Each individual (N-1)-gram (bigram) will only get a fraction of this mass, so we need to normalize b by the total probability of all the (N-1)-grams (bigrams) that begin some N-gram (trigram) that has zero count. The final equation for computing how much probability mass to distribute from an N-gram to an (N-1)-gram is represented by the function a:

    Veton Kpuska

  • *Veton Kpuska*Advanced Details of Computing Katz backoff a and P* (cont.)Note that a is a function of the preceding word string, that is, of ; thus the amount by which we discount each trigram (d), and the mass that gets reassigned to lower order N-grams (a) are recomputed for every (N-1)-gram that occurs in any N-gram.We only need to specify what to do when the counts of an (N-1)-gram context are 0, (i.e., when ) and our definition is complete:

    Veton Kpuska

  • Practical IssuesToolkits and Data Formats

    Veton Kpuska

  • *Veton Kpuska*Practical Issues: Toolkits and Data FormatsHow N-gram language models are represented?Language model probabilities are represented and computed in log format to avoid underflow and speed up computation.Probabilities by definition are less than 1. Multiplying enough N-grams together would result in numerical underflow.Using log probabilities instead of raw probabilities the numbers do not get as small.Adding in log space is equivalent to multiplying in linear space; log probabilities are combined by addition. In general addition is faster than multiplication in most general purpose computers.Reporting true probabilities if necessary requires exponentiation operation: p1xp2xp3xp4 = exp[log(p1)+log(p2)+log(p3)+log(p4)]

    Veton Kpuska

  • *Veton Kpuska*Practical Issues: Toolkits and Data FormatsBackoff N-gram language models are generally stored in ARPA format. An N-gram in ARPA format is an SCII file with a small header followed by a list of all the non-zero N-gram probabilities of all:unigrams, followed by bigrams, followed by trigrams, and so onEach N-gram entry is stored with its discounted log probability (in log10 format) and its backoff weight a. Backoff weights are only necessary for N-grams which form a prefix of a longer N-gram, thus no as are computed for the highest order N-gram (e.g., tigram) or N-grams ending in the end of sequence token .

    Veton Kpuska

  • *Veton Kpuska*Practical Issues: Toolkits and Data FormatsFormat of each N-gram is for a trigram grammar is:

    Veton Kpuska

  • *Veton Kpuska*Example of ARPA formatted LM file from BeRP corpus (Juras

    Veton Kpuska

  • *Veton Kpuska*Probability ComputationGiven a sequence x,y,z the trigram probability P(z|x,y) is computed form the model as follows:

    Veton Kpuska

  • *Veton Kpuska*Toolkits (Publicly Available)SRILM (Stolcke, 2002)http://citeseer.ist.psu.edu/621361.htmlhttp://www.speech.sri.com/projects/srilm/

    Cambridge-CMU toolkit (Clarkson & Rosenfeld, 1997). http://www.cs.cmu.edu/~archan/sphinxInfo.html#cmulmtkhttp://www.cs.cmu.edu/~archan/s_info/CMULMTK/toolkit_documentation.htmlhttp://www.speech.cs.cmu.edu/SLM_info.html

    Veton Kpuska

  • ADVANCED ISSUES IN LANGUAGE MODELINGAdvanced Smoothing Methods: Kneser-Ney Smoothing

    Veton Kpuska

  • *Veton Kpuska*Advanced Smoothing Methods: Kneser-Ney SmoothingBrief introduction to the most commonly used modern N-gram smoothing method, the Interpolated Kneser-Ney algorithm:Algorithm is based on absolute discounting method.It is a more elaborate method of computing revised count c* than the Good-Turing discount formula.Re-visiting Good-Turing estimates of the bigram extended from slide Bi-gram Examples.

    c(MLE)0123456789c*(GT)0.00002700.4461.262.243.244.225.196.217.248.25D=c-c*-0.00002700.5540.740.760.760.780.810.790.760.75

    Veton Kpuska

  • *Veton Kpuska*Advanced Smoothing Methods: Kneser-Ney SmoothingRe-estimated counts c* for greater than 1 counts could be estimated pretty well by just subtracting 0.75 from the MLE count c. Absolute discounting method formalizes this intuition by subtracting a fixed (absolute) discount d from each count.The rational is that we have good estimates already for the high counts, and a small discount d wont affect them much.The affected are only the smaller counts for which we do not necessarily trust the estimate anyhow.The equation for absolute discounting applied to bigrams (assuming a proper coefficient a on the backoff to make everything sum to one) is:

    Veton Kpuska

  • *Veton Kpuska*Advanced Smoothing Methods: Kneser-Ney SmoothingIn practice distinct discount values d for the 0 and 1 counts are computed.Kneser-Neay discounting augments absolute discounting with a more sophisticated way to handle the backoff distribution. Consider the job of predicting the next word in the sentence, assuming we are backing off to a unigram model: I cant see without my reading XXXXXX.The word glasses seem much more likely to follow than the word Francisco.But Francisco is in fact more common, and thus a unigram model will prefer it to glasses.Thus we would like to capture that although Francisco is frequent, it is only frequent after the word San.The word glasses has a much wider distribution.

    Veton Kpuska

  • *Veton Kpuska*Advanced Smoothing Methods: Kneser-Ney SmoothingThus the idea is instead of backing off to the unigram MLE count (the number of times the word w has been seen), we want to use a completely different backoff distribution!We want a heuristic that more accurately estimates the number of times we might expect to see word w in a new unseen context.The Kneser-Ney intuition is to base our estimate on the number of different contexts word w has appeared in.Words that have appeared in more contexts are more likely to appear in some new context as well. New backoff probability can be expressed as the continuation probability presented in following expression:

    Veton Kpuska

  • *Veton Kpuska*Advanced Smoothing Methods: Kneser-Ney SmoothingContinuation Probability:

    Kneser-Ney backoff is formalized as follows assuming proper coefficient a on the backoff to make everything sum to one:

    Veton Kpuska

  • *Veton Kpuska*Interpolated vs Backoff form of Kneser-NeyKneser-Ney backoff algorithm was shown to be less superior to its interpolated version. Interpolated Kneser-Ney discounting can be computed with an equation like the following (omitting the computation of b):

    Practical note it turns out that any interpolation model cab be represented as a backoff model, and hence stored in ARPA backoff format. The interpolation is done when the model is built, thus the bigram probability stored in the backoff format is really bigram already interpolated with unigram.

    Veton Kpuska

  • Class-based N-gramsClass-based N-grams or Cluster N-grams

    Veton Kpuska

  • *Veton Kpuska*Class-based N-gramsClass-based N-grams or Cluster N-grams is a variant of the N-gram that uses information about word classes or clusters. It is useful for dealing with scarcity in the training data. Example: Suppose for a flight reservation system we want to compute the probability of the bigram to Shanghai, but this bigram never occurs in the training set. Assume that our training data has to London, to Beijing, and to Denver.If we new that these were all cities, and assuming Shanghai does appear in the training set in other contexts, we could predict the likelihood of a city following from.

    Veton Kpuska

  • *Veton Kpuska*Class-based N-gramsMany variants of cluster N-grams:IBM clustering hard clustering: each word can belong to only one class.The model estimates the conditional probability of a word wi by multiplying two factors: the probability of the words class ci given the preceding classes (based on N-gram-of-classes), andthe probability of wi given ci.

    Veton Kpuska

  • *Veton Kpuska*Class-based N-gramsAssuming that there is a training corpus in which we have a class label for each word, the MLE of the probability of the word given the class and the probability of the class given the previous class could be computed as follows:

    Veton Kpuska

  • *Veton Kpuska*Class-based N-gramsCluster N-grams are generally used in two ways:Hand-designed domain-specific word classes. In airline information system we might use classes like:CITYNAMEAIRLINEDAYOFWEEKMONTH, etc.Automatically induce the classes by clustering words in a corpus.Syntactic categories like part-of-speech tags dont seem to work well as classes.Whether automatically induced or hand-designed, cluster N-grams are generally mixed with regular word-based N-grams.

    Veton Kpuska

  • *Veton Kpuska*Language Model Adaptation and Using the WWWOne of the most recent developments in language modeling is language model adaptation. Relevant when one has only a small amount of in-domain training data, but a large amount of data from some other domain. Train on the larger out-of-domain dataset, andAdopt the models to the small in-domain set.An obvious large data source for this type of adaptation is WWW. The simplest way to apply the web is to improve, say, trigram language models is to use search engines to get counts for w1w2 and w1w2w3, and then compute

    Veton Kpuska

  • *Veton Kpuska*Language Model Adaptation and Using the WWWOne can mix with a conventional N-gram. Also, more sophisticated methods can be used by combining methods that make use of topic or class dependence to find domain-relevant data on the web.Problems: In practice it is impossible to download every page from the web in order to compute N-grams.Only page counts are used from the data returned by search engines. Page counts are only approximations to actual counts for many reasons:A page may contain an N-gram multiple times. Most search engines round off their counts, Punctuation is deleted, andCounts may be adjusted due to link and other information.

    Veton Kpuska

  • *Veton Kpuska*Language Model Adaptation and Using the WWWThe result is not hugely affected in spite of the noise due to inaccuracies of the information collected.It is possible to perform specific adjustments, such as fitting a regression to predict actual word counts from page counts.

    Veton Kpuska

  • *Veton Kpuska*Using Longer Distance Information: A Brief SummaryThere are methods for incorporating longer-distance context into N-gram modeling.While we have limited our discussion mainly to bigram and trigrams, state-of-the-art speech recognition systems, for example, are based on longer distance N-grams, especially 4-grams, but also 5-grams.

    Goodman (2006) showed that with 284 million words of training data, 5-grams do improve perplexity scores over 4-grams, but not by much. Goodman checked contexts up to 20-grams, and found that after 6-grams, longer contexts werent useful, at least not with 284 million words of training data.

    Veton Kpuska

  • *Veton Kpuska*More Sophisticated ModelsPeople tend to repeat words they have used before. Thus if a word is used once in a text, it will probably be used again. We can capture this fact by a cache language model (Kuhn and DeMori, 1990). To use a unigram cache model to predict word i of a test corpus, we create a unigram grammar from the preceding part of the test corpus (words 1 to i1), and mix this with our conventional N-gram. We might use only a shorter window from the previous words, rather than the entire set.Cache language models are very powerful in any applications where we have perfect knowledge of the words. Cache models work less well in domains where the previous words are not known exactly. In speech applications, for example, unless there is some way for users to correct errors, cache models tend to lock-in to errors they made on earlier words.

    Veton Kpuska

  • *Veton Kpuska*More Sophisticated ModelsRepetition of words in a text is a symptom of a more general fact about words; texts tend to be about things. Documents which are about particular topics tend to use similar words. Suggests that we could train separate language model for different topics.Topic-based language models take advantage of the fact that different topics will have different kinds of words.Train different language models for each topic t, and then mix them, weighted by how likely each topic is given the history h:

    Veton Kpuska

  • *Veton Kpuska*More Sophisticated ModelsLatent Semantic Indexing:Based on the intuition that upcoming words are semantically similar to preceding words in the text.Word similarity is computed from measure of semantic word association such as the latent semantic indexingComputed from dictionaries or thesauri, thenMixed with a conventional N-gram.

    Veton Kpuska

  • *Veton Kpuska*More Sophisticated ModelsTrigger Word based N-gramsPredictor word called trigger which is not adjacent but is very related: has high mutual information with the word we are trying to predict. Skip N-gramsPreceding context skips over some intermediate words such as P(wi|wi-1wi-2).Variable-length N-gramsPreceding context is extended where a longer phrase is particularly frequent.

    Using very large and rich contexts can result in very large language models. These models are often pruned, removing low-probability events.There is a large body of research on integrating sophisticated linguistic structures into language modeling as described in following chapters of the text book.

    Veton Kpuska

  • ADVACED TOPICINFORMAITON THEORY BACKGROUND

    Veton Kpuska

  • *Veton Kpuska*Information Theory BackgroundIn previous section, perplexity was introduced as a way to evaluate N-gram models on a test set.A better N-gram model is one which assigns a higher probability to the test dataPerplexity is a normalized version of the probability of the test set.

    Another way to think about perplexity is based on the information-theoretic concept of cross-entropy.This section introduces fundamentals of information theory including the concept of cross-entropy.Reference: Elements of Information Theory, Cover and Thomas. Wiley-Interscience, 1991

    Veton Kpuska

  • *Veton Kpuska*EntropyEntropy is a measure of information content. Computing entropy requires establishing of a Random variable X that takes values from whatever it is being predicted:Words, letters, parts of speech, - from a set c.Probability function p(x).The entropy of random variable X is then defined as:

    The log can in principle be computed in any base. However, if base 2 is used the resulting value is measured in bits.

    Veton Kpuska

  • *Veton Kpuska*EntropyThe most intuitive way to define entropy for computer scientist is to think of the entropy as a lower bound on the number of bits it would take to encode a certain decision or piece of information in the optimal coding scheme.

    In Cover and Thomas the following example is provided:Imagine that we want to place a bet on a horse race but it is too far to go all the way to Yonkers Racetrack, and wed like to send a short message to the bookie to tell him which horse to bet on.Suppose there are eight horses in this particular race. One way to encode this message is just to use the binary representation of the horses number as the code; thus horse 1 would be 001, horse 2 - 010, horse 3 - 011, and so on, with horse 8 coded as 000. If we spend the whole day betting, and each horse is coded with 3 bits, on the average we would be sending 3 bits per race.Can we do better?

    Veton Kpuska

  • *Veton Kpuska*EntropySuppose that the spread is the actual distribution of the bets placed, and that we represent it as the prior probability of each horse as follows:

    The entropy of the random variable X that ranges over horses gives us a lower bound on the number of bits, and it is:

    Horse12345678Prior 1/21/41/81/161/641/641/641/6432/6416/648/644/641/641/641/641/64

    Veton Kpuska

  • *Veton Kpuska*EntropyVariable length encoding:0 for most likely horse10- for the next most likely horse, and110, 1110, and for the last equally likely four 111100, 111101, 111110, and 111111The entropy for equal-length binary code applied when the horses are equally likely is:

    Veton Kpuska

  • *Veton Kpuska*EntropyPractical application in Language Processing involves sequences; for a grammar, one will be computing entropy of some sequence of words W={w0, w1, w2,, wn}.One way to compute entropy for a sequence is to assign a random variable that ranges over all finite sequences of words of length n in some language L as follows:

    Entropy rate can be defined as per-word entropy of a sequence divided by the number of words:

    Veton Kpuska

  • *Veton Kpuska*EntropyTo compute the true entropy of a language, one needs to consider sequences of infinite length. Assuming a language L as a stochastic process that produces a sequence of words, its entropy rate H(L) is defined as:

    Veton Kpuska

  • *Veton Kpuska*EntropyBased on Shannon-McMillan-Breiman theorem for a language that is regular in certain ways (specifically, if its both stationary and ergodic) then the following expression can be used

    That is, we can take a single sequence that is long enough instead of summing over all possible sequences. The rationale of the Shannon-McMillan-Breiman theorem is that a long enough sequence of words will contain in it many other shorter sequences, and that each of these shorter sequences will reoccur in the longer sequence according to their probabilities.

    Veton Kpuska

  • *Veton Kpuska*EntropyA stochastic process is said to be stationary if the probabilities it assigns to a sequence are invariant with respect to shifts in the time index.The probability distribution for words at time t, is the same as the probability distribution at time t+1.Markov models and hence N-grams are stationary.In bigram Pi is dependent only on Pi-1. So if we shift our time index by x, Pi+x is still dependent on Pi+x-1.However, natural language is not stationary, since as we will see in latter (Ch12 of the book) the probability of upcoming words can be dependent on events that were arbitrarily distant in time and thus time dependent.Consequently, statistical models only give an approximation to the correct distributions and entropies of natural language.

    Veton Kpuska

  • *Veton Kpuska*EntropyTo summarize: By making some incorrect but convenient simplifying assumptions, we can compute the entropy of some stochastic process by taking a very long sample of the output, and computing its average log probability. In the next section we talk about the why and how; why we would want to do this (i.e., for what kinds of problems would the entropy tell us something useful), and how to compute the probability of a very long sequence.

    Veton Kpuska

  • *Veton Kpuska*Cross Entropy & Comparing ModelsCross entropy is useful when we do not know the actual probability distribution p that generated a data. It uses a model m of distribution p as follows:

    For stationary ergodic process this expression becomes:

    The cross entropy H(p,m) is useful because it gives us an upper bound on the entropy H(p). For any model m:

    H(p) H(p,m)

    Veton Kpuska

  • *Veton Kpuska*Cross Entropy & Comparing ModelsThis means that we can use some simplified model m to help estimate the true entropy of a sequence of symbols drawn according to probability p. The more accurate m is, the closer the cross entropy H(p,m) will be to the true entropy H(p). Thus the difference between H(p,m) and H(p) is a measure of how accurate a model is.Between two models m1 and m2, the more accurate model will be the one with the lower cross-entropy. (The cross-entropy can never be lower than the true entropy, so a model cannot err by underestimating the true entropy).

    Veton Kpuska

  • *Veton Kpuska*Perplexity and Cross-EntropyCross-entropy is defined in the limit, as the length of the observed word sequence goes to infinity. We will need an approximation to cross-entropy, relying on a (sufficiently long) sequence of fixed length.This approximation to the cross-entropy of a model M = P(wi|wi-N+1wi-1) on a sequence of words W is:

    The perplexity of a model P on a sequence of words W defined as exponent of the cross-entropy presented next:

    Veton Kpuska

  • *Veton Kpuska*Perplexity and Cross-Entropy

    Veton Kpuska

  • Advanced TopicThe Entropy of English and Entropy Rate Constancy

    Veton Kpuska

  • *Veton Kpuska*The Entropy of English and Entropy Rate ConstancyAs we suggested in the previous section, the cross-entropy of some model m can be used as an upper bound on the true entropy of some process. We can use this method to get an estimate of the true entropy of English. Why should we care about the entropy of English?

    Knowing the entropy of English would give us a solid lower bound for all of our future experiments on probabilistic grammars.We can use the entropy of English to help understand what parts of a language provide the most information:Is the predictability of English mainly based on Word orderSemantics,Morphology,Consistency, or onPragmatic cues?Answering this question can help us immensely in knowing where to focus our language-modeling efforts.

    Veton Kpuska

  • *Veton Kpuska*The Entropy of English and Entropy Rate ConstancyThere are two common methods for computing the entropy of English:Using human subjects to construct a psychological experiments that requires them to guess strings of letters; By looking at how many guesses it takes them to guess letters correctly we can estimate the probability of the letters and hence the entropy of the sequence. This method was used by Shannon in 1951 in his groundbreaking work in defining the field of information theory.Take a very good stochastic model, trained on a very large corpus, and use it to assign a log-probability to a very long sequence of English applying Shannon-McMillan-Breiman theorem introduced earlier and repeated here for clarity:

    Veton Kpuska

  • *Veton Kpuska*Shannon ExperimentThe actual experiment is designed as follows: We present a subject with some English text and ask the subject to guess the next letter. The subjects will use their knowledge of the language to guess the most probable letter first, the next most probable next, and so on. We record the number of guesses it takes for the subject to guess correctly.

    Shannons insight was that the entropy of the number-of-guesses sequence is the same as the entropy of English. (The intuition is that given the number-of-guesses sequence, we could reconstruct the original text by choosing the nth most probable letter whenever the subject took n guesses). This methodology requires the use of letter guesses rather than word guesses (since the subject sometimes has to do an exhaustive search of all the possible letters!), and so Shannon computed the per-letter entropy of English rather than the per-word entropy. He reported an entropy of 1.3 bits (for 27 characters (26 letters plus space)). Shannons estimate is likely to be too low, since it is based on a single text (Jefferson the Virginian by Dumas Malone). Shannon notes that his subjects had worse guesses (hence higher entropies) on other texts (newspaper writing, scientific work, and poetry).

    More recently variations on the Shannon experiments include the use of a gambling paradigm where the subjects get to bet on the next letter (Cover and King, 1978; Cover and Thomas, 1991).

    Veton Kpuska

  • *Veton Kpuska*Computer Based Computation of Entropy of EnglishThe second method for computing the entropy of English helps avoid the single text problem that confounds Shannons results.For example, Brown et al. (1992a) trained a trigram language model on 583 million words of English, (293,181 different types) and used it to compute the probability of the entire Brown corpus (1,014,312 tokens). The training data include newspapers, encyclopedias, novels, office correspondence, proceedings of the Canadian parliament, and other miscellaneous sources.They then computed the character-entropy of the Brown corpus, by using their word-trigram grammar to assign probabilities to the Brown corpus, considered as a sequence of individual letters. They obtained an entropy of 1.75 bits per character (where the set of characters included all the 95 printable ASCII characters).

    Veton Kpuska

  • *Veton Kpuska*Computer Based Computation of Entropy of EnglishThe average length of English written words (including space) has been reported at 5.5 letters (Nadas, 1984). If this is correct, it means that the Shannon estimate of 1.3 bits per letter corresponds to a per-word perplexity of 142 for general English. The numbers we report earlier for the WSJ experiments are significantly lower than this, since the training and test set came from the same sub-sample of English. That is, those experiments underestimate the complexity of English (since the Wall Street Journal looks very little like Shakespeare, for example).

    Veton Kpuska

  • *Veton Kpuska*Constant Information/Entropy Rate of SpeechA number of scholars have independently made the intriguing suggestion that entropy rate plays a role in human communication in general (Lindblom, 1990; Van Son et al., 1998; Aylett, 1999; Genzel and Charniak, 2002; Van Son and Pols, 2003). Constant information/entropy rate of speech:The idea is that people speak so as to keep the rate of information being transmitted per second roughly constant, i.e. transmitting a constant number of bits per second, or maintaining a constant entropy rate. Since the most efficient way of transmitting information through a channel is at a constant rate, language may even have evolved for such communicative efficiency (Plotkin and Nowak, 2000). There is a wide variety of evidence for the constant entropy rate hypothesis. One class of evidence, for speech, shows that speakers shorten predictable words (i.e. they take less time to say predictable words) and lengthen unpredictable words (Aylett, 1999; Jurafsky et al., 2001; Aylett and Turk, 2004).

    Veton Kpuska

  • *Veton Kpuska*Constant Information/Entropy Rate of SpeechIn another line of research, Genzel and Charniak (2002, 2003) show that entropy rate constancy makes predictions about the entropy of individual sentences from a text. In particular, they show that it predicts that local measures of sentence entropy which ignore previous discourse context (for example the N-gram probability of sentence), should increase with the sentence number, and they document this increase in corpora. Keller (2004) provides evidence that entropy rate plays a role for the addressee as well, showing a correlation between the entropy of a sentence and the processing effort it causes in comprehension, as measured by reading times in eye-tracking data.

    Veton Kpuska

  • END

    Veton Kpuska