part 5 language model cse717, spring 2008 cubs, univ at buffalo

Part 5Language Model

CSE717, SPRING 2008

CUBS, Univ at Buffalo

Examples of Good & Bad Language Models Excerption from Herman, comic strips by Jim Unger

1 2

3 4

What’s a Language Model

A Language model is a probability distribution over word sequences

P(“And nothing but the truth”) 0.001

P(“And nuts sing on the roof”) 0

What’s a language model for?

Speech recognition Handwriting recognition Spelling correction Optical character recognition Machine translation

(and anyone doing statistical modeling)

The Equation

)()|(maxarg

)(

)()|(maxarg

)|(maxarg

cewordsequenPcewordsequennsobservatioP

nsobservatioP

cewordsequenPcewordsequennsobservatioP

nsobservatiocewordsequenP

cewordsequen

cewordsequen

cewordsequen

The observation can be image features (handwriting recognition), acoustics (speech recognition), word sequence in another language (MT), etc.

How Language Models work

Hard to compute P(“And nothing but the truth”)

Decompose probabilityP(“and nothing but the truth) =P(“and”) P(“nothing|and”) P(“but|and nothing”) P(“the|and nothing but”) P(“truth|and nothing but the”)

The Trigram Approximation

Assume each word depends only on the previous two words

P(“the|and nothing but”)

P(“the|nothing but”)

P(“truth|and nothing but the”)

P(“truth|but the”)

How to find probabilities?

Count from real text

Pr(“the | nothing but”) c(“nothing but the”) / c(“nothing but”)

Evaluation

How can you tell a good language model from a bad one?

Run a speech recognizer (or your application of choice), calculate word error rate Slow Specific to your recognizer

Perplexity

An exampleData: “the whole truth and nothing but the truth”Lexicon: L={the, whole, truth, and, nothing, but}Model 1: uni-gram, Pr(L1)=…=Pr(L6)=1/6

Model 2: unigram, Pr(“the”)=Pr(“truth”)=1/4, Pr(“whole”)=Pr(“and”)=Pr(“nothing”)=Pr(“but”)=1/8

TTww

1

1 )],...,[Pr()(

wP

6])6/1[()( 8

1 8

wP

5.657])8/1()4/1[()( 8

1 44

wP

modelgiven by generated

is y that probabilit :),...,Pr(

test text:,...,

1

1

w

w

T

T

ww

ww

Perplexity: Is lower better?

Remarkable fact: the “true” model for data has the lowest possible perplexity

Lower the perplexity, the closer we are to true model.

Perplexity correlates well with the error rate of recognition task Correlates better when both models are trained on

same data Doesn’t correlate well when training data changes

Smoothing

Terrible on test data: If no occurrences of C(xyz), probability is 0

P(sing|nuts) =0 leads to infinite perplexity!

)(

)(

)(

)()|Pr(

y

y

y

yy

c

zc

wc

zcz

w

Smoothing: Add One

Add one smoothing:

Add delta smoothing:

Simple add-one smoothing does not perform well – the probability of rarely seen events is over-estimated

||)(

1)()|Pr(

Lc

zcz

y

yy

||)(

)()|Pr(

Lc

zcz

y

yy

Smoothing: Simple Interpolation

Interpolate Trigram, Bigram, Unigram for best combination

Almost good enough

)(

)()1(

)(

)(

)(

)()|Pr(

c

zc

yc

yzc

xyc

xyzcxyz

Smoothing: Redistribution of Probability Mass (Backing Off) [Katz87] Discounting

Discounted probability mass

Redistribution

)()(, ,)(

)()()|Pr( zzcz

c

zzcz yyy

y

yyy

)...|Pr()...|Pr()|Pr(

,0)...( If

21

1

nn

n

yyzkyyzz

zyyc

y

)(

)()(

y

yy

c

z

1)|Pr( that so selected is z

zk y

(n-1)-gram

Factor can be determined by the relative frequency of singletons, i.e., events observed exactly once in the data [Ney95]

Linear Discount

0)(, ,)(

)()1()|Pr(

zcz

c

zcz yy

y

yy

)(

)(1

zc

zd

y

y

1 ),()( zcz yy

)( zc y

Generalization

: function of y, determined by cross-validation

Requires more data

Computation is expensive

More General Formulation

Drawback of linear discount

The counts of frequently observed events are modified the most ; against the “law of large numbers”

)( y

1)( ),()()( yyyy zcz

)( zc y

The discount is an absolute value

Works pretty well, easier than linear discounting

Absolute Discounting

)(, ,

)(

)()|Pr( zcz

c

zcz yy

y

yy

)( zy

References

[1] Katz S, Estimation of probabilities from sparse data for the language model component of a speech recognizer, IEEE Trans on Acoustics, Speech, and Signal Processing 35(3):400-401, 1987

[2] Ney H, Essen U, Kneser R, On the estimation of “small” probabilities by leaving-one-out, ITTT Trans. on PAMI 17(12): 1202-1212, 1995

[3] Joshua Goodman, A tutorial of language model: the State of The Art in Language Modeling, research.microsoft.com/~joshuago/lm-tutorial-public.ppt

part 5 language model cse717, spring 2008 cubs, univ at buffalo

Documents