>65536 arthur chan may 4, 2006. what so special about 65536? 65536 = 2 ^ 16 do you know?...
TRANSCRIPT
What so special about 65536?
• 65536 = 2 ^ 16 • Do you know?
– Sphinx III did not support language model with more than 65536 (2^16) words
– CMU-Cambridge LM Toolkit V2 is not happy about text with 65536 unique words as well.
• Though a word could have counts more than 65536 (-four_byte_counts)
Why 65536 was the limit?
• Both Sphinx III and CMU-Cambridge LM Toolkit V2 was written in 95-99– Time when having 64M RAM is extravagant
• (Now 64G seems to be the number.)• (At that time, Pentium 166 or 200 are hot)
– Programmers therefore designed clever structures to deal with memory issues
• In Sphinx, DMP format was invented (WordID: 16bits)• In CMU-Cambridge LM Toolkit, 16 bits data types were
used for wordID.
This Talk (30-35 pages)• Describe our effort on breaking the 16 bits limit in
– Sphinx 3– CMU-Cambridge LM Toolkit V2
• Half a talk– Features not fully tested in real-life– But the talk itself is quite long.
• Technical Detail of the changes. – Sphinx III – The easy part (9 pages)– CMU-Cambridge LM Toolkit – The tough part (10 pages)
• The Root of the Evil (11 pages)– Why does this problem exist? Why does it persist?– What if similar problem appear? How do we solve and avoid them?
Disclaimer about the Speaker• Notorious of being negative on Language
Modeling Techniques– Symptom 1: Yell at others when his LM code has
bugs. – Symptom 2: Yell at others <period>
He should be forgiven because – His Master Thesis Supervisor taught him that when he was
young– Prof. Ronald Rosenfeld’s taught him the same– He also read Dr. Joshua Goodman’s papers
Terminology
• “Probability” actually means – The estimate of the probability
• Back-off weight means– When some n-gram is unseen in the training data
• Back-off to (n-1)-gram “probability” times a weight
• According to Manning, four-gram should be tetragram, bigram should be digram– Well, it’s lucky it doesn’t matter to us today
What Sphinx 3.6 RCI supports
• ARPA LM• DMP LM
– A memory efficient version of ARPA LM– Could be run in disk-mode as well
• Class-based LM• Multiple LMs and LM switching dynamically• lm_convert
– (new in 3.6!) Conversion tool for ARPA, DMP LM
A note on the DMP format
• A tree like format– Bigram is indexed by prefix unigram– Trigram is indexed by prefix bigram.
• Bigram, Trigram probabilities and back-of weights– Quantized to 4 decimal point. So you see
following statements in the code:
Funny C statements in the Code
/* HACK!! to quantize probs to 4 decimal digits */p = p3*10000;p3 = p*0.0001;
If you delete this, then the LM will be larger because quantization is not done.
Reasons why Sphinx III only supports less than 65536 words
• 16 bits data structures for – Bigram– Trigram– Cache structure
Bogus Reason of Why Sphinx III doesn’t more than 65536 words
• A very bad misconception– “The decoding is constrained by the dictionary”– WRONG
• In both flat and tree lexicon search. Only LM words are traversed.
– RIGHT• Generally, decoding is constrained by the intersection of
the LM word and dictionary words
Several Proposed Surgery Procedure
• 1, Rewrite the whole LM routine– Oops! But it takes too much time, – Old routine is very memory efficient
• 2, Replace the old LM by just switching the type of data structure– Problem: All the binary LMs we generated have
the old layout. – We will lose backward compatibility very badly
Final Solution
• lm now support two data structures: 16 and 32 bits• lm_convert and decode will support two types of
binaries LM– DMP that has a 16 bit layout– DMP32 that has a 32 bit layout– Magic version number will decide which layout to use– Regression test could ensure not bad code check-ins
• When to use which format is hidden from – Any one called the lm routine. (for a few exceptions)
Partial Verification of the Code
• The 16 bit and 32 bit code produce exactly the same decoding results for– decode– decode_anytop– (allphone’s trigram could probably left untest.)
• A faked LM with more than 65536 words could be used and run in decode
Current Practical Limit
• The lm data structure in lm.h– Theoretically support LM with
• Less than 4 billion unigram• Less than 4 billion bigram• Less than 4 billion trigram
– What if we have n-gram size larger than 4 billion?• Answer: we are dead people • Further answer: it is easily fixable
• Other data structure from Sphinx 3?– hash.c doesn’t return prime number large than 900001– Further answer: it is easily fixable as well
Conclusion
• Technically– Sphinx III 32bit mode is not that difficult to take care.– The problem was also confined to one data structure
• Thanks to the modular design of Sphinx III • Pretty easy to solve.
• Sphinx III’s decision of using binary format– If I were Ravi, I will do that as well– Much faster loading time for large model.
CMU-Cambridge LM Toolkit Version 2
• LM Support of CMU-Cambridge LM Toolkit Version 2– LM training
• Parameter estimation with backoff weight computation• Support both
– LM in ARPA format– LM in BINLM format
» BINLM is not the same as DMP format. » bin2arpa could translate BINLM to ARPA
Purpose of the toolkit
• Training LM for– Speech Recognition– Statistical Machine Translation– Document Classification– Hand writing Recognition
• A note:– Occasionally, speech recognition is really not
everything
Standard Procedure of Training
• In V2 time, – David Huggins-Daines wasn’t in CMU– Training is separated into 4 stages
• text2wfreq -> Find the word frequency table• wfreq2vocab
– Find the vocabulary we need (smaller than the frequency table)• text2idngram
– Convert the text to a stream of ngram and its count (idngram)– The ngram word id is alphabetically sorted
• idngram2lm– Gather the counts, compute the discounted estimates and the
backoff weights.
Reasons why V2 doesn’t support 65536 words
• There is one single file that typedef many data structures– But the variables are not used very often
• Most variables are not typedef. – Many of them are declared as
• unsigned short• int
Another Issue…….
• What if we have more than 4 billion n-gram this time?– e.g if n>5
• Not forgivable in LM training because– MT people are already having this
problem. (unigram size is 5 million)
Strategy
• Spent 90% of the time to make sure the data type was declared correctly
• Give up taking care of both 16-bit and 32-bit binary layout together. – Compile time switch (THIRTYTWOBITS) is provided– Reasonable because users seldom used BINLM any way
• User need to use DMP format in Sphinx III• Tool chain is now completed
– Number of ngram is a 64 bit number
What we support now
• One could trained an LM with more than 65536 words– text2wfreq, wfreq2vocab, text2idngram, idngram2lm are
fixed
• One could convert an LM– binlm2arpa, ngram2mgram are fixed
• One could compute the perplexity of an LM and some statistics from the text– evallm, idngram2stats are fixed
Other Evils of Detail
• V2’s hash table is using a very bad hash function– Many collisions– Legacy from pre-90s– One could take a 4 hour nap to load the word list if we train a
500k word model. – After using Dan J. Barstein’s hash function,
• the load time is acceptable (<1 min)
• Binary layout was one of the most time-consuming part of development
Verification
• 16 bits and 32 bits code provides exactly the same results
• 32 bit code could train a LM from a faked corpora with 10M unique words.
• Note: we are talking about uniq words.
• Both Dave perl tests and Arthur’s tests are all passed.– So, things like LM interpolation is actually working
too.
Current Limitation
• Theoretically support– 1.84 x10^19 ngrams.
• The 4-step procedure used too much space– 100 M words training requires
• 10 G harddisk• 1-2G RAM
– 1G word training requires more• 100G harddisk and 20G RAM?
So, we still have issue when ……
• Ascending order of difficulties– What if MT people asked us to run their LM in our
recognizer (1M limit)?– What if we need to run decoding for 10 languages
and each with 100k words?– What if we need to train a N word corpora (N= 1
billion) and there are N*N*N trigrams?– What if Prof. Jim Baker was back? – What if there were aliens?
An Important Observation
• There is and implicit Development Deadlock between – Sphinx III– CMU-Cambridge LM Toolkit– SphinxTrain
General Pattern (part I)
• Decoder’s developer think– “Feature X is not implemented in the
trainer”– “That is to say there will be no use if we
implement feature X”.
=> Give up feature X
General Pattern (part II)
• Trainer’s developers think– “Feature X is not implemented in the
decoder”– “That is to say there will be no use if we
implement feature X”.Give up feature X
Why Feature X is not implemented in the first place?
• Possible Reason 1:– In the past, someone analyze some results and
conclude that Feature X is not useful
• Possible Reason 2:– Because of theoretical reasons Y and Z, someone
conclude that Feature X is not useful
• Possible Reason 3:– Past hardware limitation
In Reality……
• Feature X could turn out to be very useful,– E.g.
• More than 65k words in LM• N-gram when N > 3• Interpolation (instead of backoff) in N-gram
Another Important Observation
• Constant give up of new features– Eventually give up the whole software
development– Look at CMU-Cambridge LM Toolkit V2
How we should deal with this Problem?
• 1, Know that this is a problem• (From anonymous self-help books.)
• 2, We need a joint understanding of both of the decoder and the trainer(s)– Question to ask: Is it really correct to always develop the
decoder first? • 3, New features of the training could always be tested
in cheap ways– N-best and Lattice rescoring– Then the deadlock will be broken on one side
A Unified View of Our Software
SphinxTrainCMU-Cambridge LM Toolkit
Sphinx Brothers{2,3,4} depends on where you live
The Suite
Issue 1
• Q: “Do we have the right to change the LM Toolkit?”
• A: “Yes, according to the license if we open the source for research purpose, we could change, distribute the code.
• Our changes is endorsed by Prof. Rosenfeld (CMU), Dr. Clarkson (Cambridge) and Prof. Robinson (Cambridge)”
Issue 2
• Q: “Do we have anything new in LM?”• A: “That depends on the brilliance of our
students and staffs. – Also generally brilliance of the public
• They have the right to contribute
• Actually, in past 10 years, – A lot of new thing were done in CMU in LM– Just no one collects them and put them together.”
Issue 3
• Q: “Are you just getting yourself a lot of trouble?”
• A: “The troubles are always there, we just never face it. “
Digression: Project LL• News: Some Folks are working on the LM
toolkit now!– Project Code: LL– Three key supporters
• A Young Professor (or Prof. ABAB)– Hint: he is not exactly young
• A Young Student (or DHDH) • A Young Staff (or ACAC)
– Gathering code from around the world• Thanks for Professor Yannick from LIUM in advance• Thanks to contributor ATAT
Conclusion
• 32 bits data structure now supported in both Sphinx III and CMU-Cambridge LM Toolkit.
• This brings up a lot of development issue– May be we should take the LM toolkit more
seriously• Maintenance (a must)• New features development (if we have time)