elute
TRANSCRIPT
![Page 1: ELUTE](https://reader035.vdocuments.net/reader035/viewer/2022062320/55c52c80bb61eb7d588b4603/html5/thumbnails/1.jpg)
ELUTEEssential Libraries and Utilities of
Text Engineering
Tian-Jian Jiang
![Page 2: ELUTE](https://reader035.vdocuments.net/reader035/viewer/2022062320/55c52c80bb61eb7d588b4603/html5/thumbnails/2.jpg)
Why bother?Since we already have...
![Page 3: ELUTE](https://reader035.vdocuments.net/reader035/viewer/2022062320/55c52c80bb61eb7d588b4603/html5/thumbnails/3.jpg)
(lib)TaBE•Traditional Chinese Word Segmentation
•with Big5 encoding
•Traditional Chinese Syllable-to-Word Conversion
•with Big5 encoding
•for bo-po-mo-fo transcription system
![Page 4: ELUTE](https://reader035.vdocuments.net/reader035/viewer/2022062320/55c52c80bb61eb7d588b4603/html5/thumbnails/4.jpg)
1999 - 2001?1999 - 2001?1999 - 2001?1999 - 2001?
![Page 5: ELUTE](https://reader035.vdocuments.net/reader035/viewer/2022062320/55c52c80bb61eb7d588b4603/html5/thumbnails/5.jpg)
How about...
![Page 6: ELUTE](https://reader035.vdocuments.net/reader035/viewer/2022062320/55c52c80bb61eb7d588b4603/html5/thumbnails/6.jpg)
libchewing•Now hacking for
•UTF-8 encoding
•Pinyin transcription system
•and looking for
•an alternative algorithm
•a better dictionary
![Page 7: ELUTE](https://reader035.vdocuments.net/reader035/viewer/2022062320/55c52c80bb61eb7d588b4603/html5/thumbnails/7.jpg)
We got a problem•23:43 < s*****> 到底 (3023) + 螫舌 (0) 麼(2536) 東西 (6024) = 11583
•23:43 < s*****> 到底是 (829) + 什麼東西 (337) = 1166
•23:43 < s*****> 到底螫舌麼東西 大勝 到底這什麼東西
•00:02 < s*****> k***: 「什麼」會被「什麼東西」排擠掉
•00:02 < s*****> k***: 結果是 20445 活生生的被 337 幹掉 :P
![Page 8: ELUTE](https://reader035.vdocuments.net/reader035/viewer/2022062320/55c52c80bb61eb7d588b4603/html5/thumbnails/8.jpg)
Word Segmentation Review
![Page 9: ELUTE](https://reader035.vdocuments.net/reader035/viewer/2022062320/55c52c80bb61eb7d588b4603/html5/thumbnails/9.jpg)
Heuristic Rules*
• Maximum matching -- Simple vs. Complex: 下雨天真正討厭
• 下雨 天真 正 討厭 vs. 下雨天 真正 討厭
• Maximum average word length
• 國際化
• Minimum variance of word lengths
• 研究 生命 起源
• Maximum degree of morphemic freedom of single-character word
• 主要 是 因為
* Refer to MMSEG by C. H. Tsai: http://technology.chtsai.org/mmseg/
![Page 10: ELUTE](https://reader035.vdocuments.net/reader035/viewer/2022062320/55c52c80bb61eb7d588b4603/html5/thumbnails/10.jpg)
Graphical Models• Markov chain family
• Statistical Language Model (SLM)
• Hidden Markov Model (HMM)
• Exponential models
• Maximum Entropy (ME)
• Conditional Random Fields (CRF)
• Applications
• Probabilistic Context-Free Grammar (PCFG) Parser
• Head-driven Phrase Structure Grammar (HPSG) Parser
• Link Grammar Parser
![Page 11: ELUTE](https://reader035.vdocuments.net/reader035/viewer/2022062320/55c52c80bb61eb7d588b4603/html5/thumbnails/11.jpg)
What is a language model?
![Page 12: ELUTE](https://reader035.vdocuments.net/reader035/viewer/2022062320/55c52c80bb61eb7d588b4603/html5/thumbnails/12.jpg)
A probability distributionover surface patterns of
texts.
![Page 13: ELUTE](https://reader035.vdocuments.net/reader035/viewer/2022062320/55c52c80bb61eb7d588b4603/html5/thumbnails/13.jpg)
The Italian Who Went to Malta
•One day ima gonna Malta to bigga hotel.
•Ina morning I go down to eat breakfast.
•I tella waitress I wanna two pissis toasts.
•She brings me only one piss.
•I tella her I want two piss. She say go to the toilet.
•I say, you no understand, I wanna piss onna my plate.
•She say you better no piss onna plate, you sonna ma bitch.
•I don’t even know the lady and she call me sonna ma bitch!
![Page 14: ELUTE](https://reader035.vdocuments.net/reader035/viewer/2022062320/55c52c80bb61eb7d588b4603/html5/thumbnails/14.jpg)
P(“I want to piss”) > P(“I want two pieces”)
For that Malta waitress,
![Page 15: ELUTE](https://reader035.vdocuments.net/reader035/viewer/2022062320/55c52c80bb61eb7d588b4603/html5/thumbnails/15.jpg)
Do the Math• Conditional probability:
•
• Bayes’ theorem:
•
• Information theory:
• Noisy channel model
•
• Language model: P(i)
Noisy channelp(o|i)
DecoderI O Î
![Page 16: ELUTE](https://reader035.vdocuments.net/reader035/viewer/2022062320/55c52c80bb61eb7d588b4603/html5/thumbnails/16.jpg)
Shannon’s Game
•Predict next word by history
•
•Maximum Likelihood Estimation
•
•C(w1…wn) : Frequency of n-gram w1…wn
![Page 17: ELUTE](https://reader035.vdocuments.net/reader035/viewer/2022062320/55c52c80bb61eb7d588b4603/html5/thumbnails/17.jpg)
Once in a Blue Moon
• A cat has seen...
• 10 sparrows
• 4 barn swallows
• 1 Chinese Bulbul
• 1 Pacific Swallow
• How likely is it that next bird is unseen?
![Page 18: ELUTE](https://reader035.vdocuments.net/reader035/viewer/2022062320/55c52c80bb61eb7d588b4603/html5/thumbnails/18.jpg)
(1+1) / (10 + 4 + 1 + 1)
![Page 19: ELUTE](https://reader035.vdocuments.net/reader035/viewer/2022062320/55c52c80bb61eb7d588b4603/html5/thumbnails/19.jpg)
But I’ve seen a moonand I’m blue
• Simple linear interpolation
• PLi(wn|wn-2 , wn-1) = λ1P1(wn) + λ2P2(wn|wn-1 ) + λ3P2(wn|wn-1 , wn-2)
• 0 ≤λi ≤ 1, Σiλi = 1
• Katz’s backing-off
• Back-off through progressively shorter histories.
• Pbo(wi|wi-(n-1)…wi-1) =
•
•
![Page 20: ELUTE](https://reader035.vdocuments.net/reader035/viewer/2022062320/55c52c80bb61eb7d588b4603/html5/thumbnails/20.jpg)
Good Luck!• Place a bet remotely on a
horse race within 8 horses by passing encoded messages.
• Past bet distribution
• horse 1: 1/2
• horse 2: 1/4
• horse 3: 1/8
• horse 4: 1/16
• the rest: 1/64
Foreversoul: http://flickr.com/photos/foreversouls/CC: BY-NC-ND
![Page 21: ELUTE](https://reader035.vdocuments.net/reader035/viewer/2022062320/55c52c80bb61eb7d588b4603/html5/thumbnails/21.jpg)
3 bits? No, only 2!
0, 10, 110, 1110, 111100, 111101, 111110, 111111
![Page 22: ELUTE](https://reader035.vdocuments.net/reader035/viewer/2022062320/55c52c80bb61eb7d588b4603/html5/thumbnails/22.jpg)
Alright, let’s ELUTE
![Page 23: ELUTE](https://reader035.vdocuments.net/reader035/viewer/2022062320/55c52c80bb61eb7d588b4603/html5/thumbnails/23.jpg)
Bi-gram MLE Flow Chart
Permute candidates
right_gramIn LM?
hasleft_gram?
bi_gramIn LM?
left_gramIn LM?
temp_score =LogProb(right_gram)
temp_score =LogProb(bi_gram)
temp_score =LogProb(left_gram) +BackOff(right_gram)
temp_score =LogProb(Unknown) +BackOff(right_gram)
temp_score =LogProb(Unknown)
Update scores
temp_score +=previous_score
have 2 grams?
Yes
Yes
Yes
Yes Yes
No
No
No
No No
![Page 24: ELUTE](https://reader035.vdocuments.net/reader035/viewer/2022062320/55c52c80bb61eb7d588b4603/html5/thumbnails/24.jpg)
INPUT input_syllables; len = Length(input_syllables); Load(language_model);scores[len + 1]; tracks[len + 1]; words[len + 1];FOR i = 0 TO len
scores[i] = 0.0; tracks[i] = -1; words[i] = "";FOR index = 1 TO len
best_score = 0.0; best_prefix = -1; best_word = "";FOR prefix = index - 1 TO 0
right_grams[] = Homophones(Substring(input_syllabes, prefix, index - prefix));FOR EACH right_gram IN right_grams[]
IF right_gram IN language_modelleft = tracks[prefix];IF left >= 0 AND left != prefix
left_grams[] = Homophones(Substring(input_syllables, left, prefix - left));FOR EACH left_gram IN left_grams[]
temp_score = 0.0;bigram = left_gram + " " + right_gram;IF bigram IN language_model
bigram_score = LogProb(bigram);temp_score += bigram_score;
ELSE IF left_gram IN language_modelbigram_backoff = LogProb(left_gram) + BackOff(right_gram);temp_score += bigram_backoff;
ELSEtemp_score += LogProb(Unknown) + BackOff(right_gram);
temp_score += scores[prefix];Scoring
ELSEtemp_score = LogProb(right_gram);Scoring
ELSEtemp_score = LogProb(Unknown) + scores[prefix];Scoring
scores[index] = best_score; tracks[index] = best_prefix_index; words[index] = best_prefix;IF tracks[index] == -1
tracks[index] = index - 1;boundary = len; output_words = "";WHILE boundary > 0
output_words = words[boundary] + output_words;boundary = tracks[boundary];
RETURN output_words;
SUBROUTINE ScoringIF best_score == 0.0 OR temp_score > best_score
best_score = temp_score;best_prefix = prefix;best_word = right_gram;
Bi-gram Syllable-to-Word
![Page 25: ELUTE](https://reader035.vdocuments.net/reader035/viewer/2022062320/55c52c80bb61eb7d588b4603/html5/thumbnails/25.jpg)
Show me the…
![Page 26: ELUTE](https://reader035.vdocuments.net/reader035/viewer/2022062320/55c52c80bb61eb7d588b4603/html5/thumbnails/26.jpg)
William’s Requests
![Page 27: ELUTE](https://reader035.vdocuments.net/reader035/viewer/2022062320/55c52c80bb61eb7d588b4603/html5/thumbnails/27.jpg)
And My Suggestions
• Convenient API
• Plain text I/O (in UTF-8)
• More linguistic information
• Algorithm: CRF
• Corpus: we need YOU!
• Flexible to different applications
• Composite, Iterator, and Adapter Patterns
• IDL support
• SWIG
• Open Source
• Open Corpus, too
![Page 28: ELUTE](https://reader035.vdocuments.net/reader035/viewer/2022062320/55c52c80bb61eb7d588b4603/html5/thumbnails/28.jpg)
Thank YOU