>65536 arthur chan may 4, 2006. what so special about 65536? 65536 = 2 ^ 16 do you know?...

>65536

Arthur Chan

May 4, 2006

What so special about 65536?

• 65536 = 2 ^ 16 • Do you know?

– Sphinx III did not support language model with more than 65536 (2^16) words

– CMU-Cambridge LM Toolkit V2 is not happy about text with 65536 unique words as well.

• Though a word could have counts more than 65536 (-four_byte_counts)

Why 65536 was the limit?

• Both Sphinx III and CMU-Cambridge LM Toolkit V2 was written in 95-99– Time when having 64M RAM is extravagant

• (Now 64G seems to be the number.)• (At that time, Pentium 166 or 200 are hot)

– Programmers therefore designed clever structures to deal with memory issues

• In Sphinx, DMP format was invented (WordID: 16bits)• In CMU-Cambridge LM Toolkit, 16 bits data types were

used for wordID.

This Talk (30-35 pages)• Describe our effort on breaking the 16 bits limit in

– Sphinx 3– CMU-Cambridge LM Toolkit V2

• Half a talk– Features not fully tested in real-life– But the talk itself is quite long.

• Technical Detail of the changes. – Sphinx III – The easy part (9 pages)– CMU-Cambridge LM Toolkit – The tough part (10 pages)

• The Root of the Evil (11 pages)– Why does this problem exist? Why does it persist?– What if similar problem appear? How do we solve and avoid them?

Disclaimer about the Speaker• Notorious of being negative on Language

Modeling Techniques– Symptom 1: Yell at others when his LM code has

bugs. – Symptom 2: Yell at others <period>

He should be forgiven because – His Master Thesis Supervisor taught him that when he was

young– Prof. Ronald Rosenfeld’s taught him the same– He also read Dr. Joshua Goodman’s papers

Terminology

• “Probability” actually means – The estimate of the probability

• Back-off weight means– When some n-gram is unseen in the training data

• Back-off to (n-1)-gram “probability” times a weight

• According to Manning, four-gram should be tetragram, bigram should be digram– Well, it’s lucky it doesn’t matter to us today

LM Component of Sphinx III

The Easy Part

What Sphinx 3.6 RCI supports

• ARPA LM• DMP LM

– A memory efficient version of ARPA LM– Could be run in disk-mode as well

• Class-based LM• Multiple LMs and LM switching dynamically• lm_convert

– (new in 3.6!) Conversion tool for ARPA, DMP LM

A note on the DMP format

• A tree like format– Bigram is indexed by prefix unigram– Trigram is indexed by prefix bigram.

• Bigram, Trigram probabilities and back-of weights– Quantized to 4 decimal point. So you see

following statements in the code:

Funny C statements in the Code

/* HACK!! to quantize probs to 4 decimal digits */p = p3*10000;p3 = p*0.0001;

If you delete this, then the LM will be larger because quantization is not done.

Reasons why Sphinx III only supports less than 65536 words

• 16 bits data structures for – Bigram– Trigram– Cache structure

Bogus Reason of Why Sphinx III doesn’t more than 65536 words

• A very bad misconception– “The decoding is constrained by the dictionary”– WRONG

• In both flat and tree lexicon search. Only LM words are traversed.

– RIGHT• Generally, decoding is constrained by the intersection of

the LM word and dictionary words

Several Proposed Surgery Procedure

• 1, Rewrite the whole LM routine– Oops! But it takes too much time, – Old routine is very memory efficient

• 2, Replace the old LM by just switching the type of data structure– Problem: All the binary LMs we generated have

the old layout. – We will lose backward compatibility very badly

Final Solution

• lm now support two data structures: 16 and 32 bits• lm_convert and decode will support two types of

binaries LM– DMP that has a 16 bit layout– DMP32 that has a 32 bit layout– Magic version number will decide which layout to use– Regression test could ensure not bad code check-ins

• When to use which format is hidden from – Any one called the lm routine. (for a few exceptions)

Partial Verification of the Code

• The 16 bit and 32 bit code produce exactly the same decoding results for– decode– decode_anytop– (allphone’s trigram could probably left untest.)

• A faked LM with more than 65536 words could be used and run in decode

Current Practical Limit

• The lm data structure in lm.h– Theoretically support LM with

• Less than 4 billion unigram• Less than 4 billion bigram• Less than 4 billion trigram

– What if we have n-gram size larger than 4 billion?• Answer: we are dead people • Further answer: it is easily fixable

• Other data structure from Sphinx 3?– hash.c doesn’t return prime number large than 900001– Further answer: it is easily fixable as well

Conclusion

• Technically– Sphinx III 32bit mode is not that difficult to take care.– The problem was also confined to one data structure

• Thanks to the modular design of Sphinx III • Pretty easy to solve.

• Sphinx III’s decision of using binary format– If I were Ravi, I will do that as well– Much faster loading time for large model.

CMU-Cambridge LM Toolkit V2

The Tough Part

CMU-Cambridge LM Toolkit Version 2

• LM Support of CMU-Cambridge LM Toolkit Version 2– LM training

• Parameter estimation with backoff weight computation• Support both

– LM in ARPA format– LM in BINLM format

» BINLM is not the same as DMP format. » bin2arpa could translate BINLM to ARPA

Purpose of the toolkit

• Training LM for– Speech Recognition– Statistical Machine Translation– Document Classification– Hand writing Recognition

• A note:– Occasionally, speech recognition is really not

everything

Standard Procedure of Training

• In V2 time, – David Huggins-Daines wasn’t in CMU– Training is separated into 4 stages

• text2wfreq -> Find the word frequency table• wfreq2vocab

– Find the vocabulary we need (smaller than the frequency table)• text2idngram

– Convert the text to a stream of ngram and its count (idngram)– The ngram word id is alphabetically sorted

• idngram2lm– Gather the counts, compute the discounted estimates and the

backoff weights.

Reasons why V2 doesn’t support 65536 words

• There is one single file that typedef many data structures– But the variables are not used very often

• Most variables are not typedef. – Many of them are declared as

• unsigned short• int

Another Issue…….

• What if we have more than 4 billion n-gram this time?– e.g if n>5

• Not forgivable in LM training because– MT people are already having this

problem. (unigram size is 5 million)

Strategy

• Spent 90% of the time to make sure the data type was declared correctly

• Give up taking care of both 16-bit and 32-bit binary layout together. – Compile time switch (THIRTYTWOBITS) is provided– Reasonable because users seldom used BINLM any way

• User need to use DMP format in Sphinx III• Tool chain is now completed

– Number of ngram is a 64 bit number

What we support now

• One could trained an LM with more than 65536 words– text2wfreq, wfreq2vocab, text2idngram, idngram2lm are

fixed

• One could convert an LM– binlm2arpa, ngram2mgram are fixed

• One could compute the perplexity of an LM and some statistics from the text– evallm, idngram2stats are fixed

Other Evils of Detail

• V2’s hash table is using a very bad hash function– Many collisions– Legacy from pre-90s– One could take a 4 hour nap to load the word list if we train a

500k word model. – After using Dan J. Barstein’s hash function,

• the load time is acceptable (<1 min)

• Binary layout was one of the most time-consuming part of development

Verification

• 16 bits and 32 bits code provides exactly the same results

• 32 bit code could train a LM from a faked corpora with 10M unique words.

• Note: we are talking about uniq words.

• Both Dave perl tests and Arthur’s tests are all passed.– So, things like LM interpolation is actually working

too.

Current Limitation

• Theoretically support– 1.84 x10^19 ngrams.

• The 4-step procedure used too much space– 100 M words training requires

• 10 G harddisk• 1-2G RAM

– 1G word training requires more• 100G harddisk and 20G RAM?

So, we still have issue when ……

• Ascending order of difficulties– What if MT people asked us to run their LM in our

recognizer (1M limit)?– What if we need to run decoding for 10 languages

and each with 100k words?– What if we need to train a N word corpora (N= 1

billion) and there are N*N*N trigrams?– What if Prof. Jim Baker was back? – What if there were aliens?

Deliver Us From Evil

Why this feature wasn’t implemented in 2001?

An Important Observation

• There is and implicit Development Deadlock between – Sphinx III– CMU-Cambridge LM Toolkit– SphinxTrain

General Pattern (part I)

• Decoder’s developer think– “Feature X is not implemented in the

trainer”– “That is to say there will be no use if we

implement feature X”.

=> Give up feature X

General Pattern (part II)

• Trainer’s developers think– “Feature X is not implemented in the

decoder”– “That is to say there will be no use if we

implement feature X”.Give up feature X

Why Feature X is not implemented in the first place?

• Possible Reason 1:– In the past, someone analyze some results and

conclude that Feature X is not useful

• Possible Reason 2:– Because of theoretical reasons Y and Z, someone

conclude that Feature X is not useful

• Possible Reason 3:– Past hardware limitation

In Reality……

• Feature X could turn out to be very useful,– E.g.

• More than 65k words in LM• N-gram when N > 3• Interpolation (instead of backoff) in N-gram

Another Important Observation

• Constant give up of new features– Eventually give up the whole software

development– Look at CMU-Cambridge LM Toolkit V2

How we should deal with this Problem?

• 1, Know that this is a problem• (From anonymous self-help books.)

• 2, We need a joint understanding of both of the decoder and the trainer(s)– Question to ask: Is it really correct to always develop the

decoder first? • 3, New features of the training could always be tested

in cheap ways– N-best and Lattice rescoring– Then the deadlock will be broken on one side

A Unified View of Our Software

SphinxTrainCMU-Cambridge LM Toolkit

Sphinx Brothers{2,3,4} depends on where you live

The Suite

Issue 1

• Q: “Do we have the right to change the LM Toolkit?”

• A: “Yes, according to the license if we open the source for research purpose, we could change, distribute the code.

• Our changes is endorsed by Prof. Rosenfeld (CMU), Dr. Clarkson (Cambridge) and Prof. Robinson (Cambridge)”

Issue 2

• Q: “Do we have anything new in LM?”• A: “That depends on the brilliance of our

students and staffs. – Also generally brilliance of the public

• They have the right to contribute

• Actually, in past 10 years, – A lot of new thing were done in CMU in LM– Just no one collects them and put them together.”

Issue 3

• Q: “Are you just getting yourself a lot of trouble?”

• A: “The troubles are always there, we just never face it. “

Digression: Project LL• News: Some Folks are working on the LM

toolkit now!– Project Code: LL– Three key supporters

• A Young Professor (or Prof. ABAB)– Hint: he is not exactly young

• A Young Student (or DHDH) • A Young Staff (or ACAC)

– Gathering code from around the world• Thanks for Professor Yannick from LIUM in advance• Thanks to contributor ATAT

Conclusion

• 32 bits data structure now supported in both Sphinx III and CMU-Cambridge LM Toolkit.

• This brings up a lot of development issue– May be we should take the LM toolkit more

seriously• Maintenance (a must)• New features development (if we have time)

Preview of the next 2 talks

• Project LL– Story of The Three Young Developers

• Development Progress of Sphinx 3.X (From X=3 to X=6)– What is the big picture of Sphinx?

>65536 arthur chan may 4, 2006. what so special about 65536? 65536 = 2 ^ 16 do you know?...

Documents