mqp final presentation - worcester polytechnic institute · mqp final presentation advisors: gabor...
Post on 21-Mar-2020
4 Views
Preview:
TRANSCRIPT
Automated Building of Classic Chinese-English
Dictionary and Chinese-Hungarian Dictionary
MQP Final Presentation
Advisors: Gabor Sarkozy, WPI
Andras Kornai, MTA-Sztaki
April 28, 2015
Xiaosong Wen xwen2@wpi.edu
Hongbo Fang hfang@wpi.edu
Outline
● Introduction & Background
● Methodology
● Analysis
● Conclusion & Future Work
Introduction: Bilingual Dictionary: ● Definition: a specialized dictionary used to
translate words or phrases from one
language to another
● Other usage: Cross-Language Information
Retrieval and Cross-Language Plagiarism
Detection
Classic Chinese and Modern Chinese: ● Ancient Chinese: articles and poems
pre Qin and Han (121 AC)
● Modern Chinese:
after Republic of China (after 1912)
Differences between Classic Chinese and Modern Chinese:
1. Ancient Chinese didn’t have
punctuation marks
Differences between Classic Chinese and Modern Chinese: 2. Different meaning of words:
modern Chinese: “咸” → salty
classic Chinese: “咸” → all
modern Chinese: ‘’
Differences between Classic Chinese and Modern Chinese:
3. Different ways to format words:
In classic Chinese: Every content character is a word
example:
modern Chinese: ‘妻子’ ---> wife
classic Chinese: ‘妻’ ----> wife;
‘子’----> son
two distinct words
Parallel Corpus:
Hunglish;
UM-Corpus;
Chinese Text Project
Sparse Matrix: • A matrix in which most of the elements are zero
• Widely used in the numerical linear algebra computations
Pointwise mutual information (PMI):
Fano 1961: mutual information between particular events X and Y
𝑃𝑀𝐼 𝑋, 𝑌 = log𝑝 𝑋, 𝑌
𝑝(𝑋)𝑝(𝑌)= log
𝑝 𝑋 𝑌
𝑝(𝑋)= log
𝑝 𝑌 𝑋
𝑝(𝑌)
Methodology: Classic Chinese
Download and extract parallel corpus from
ctext.org
• wget, Java, python: openCV
• HTML parsing
• 34140 zh-en sentence pair
• hundict
Evaluate: clzh-en dic
● join with 100 basic set:
o overall precision is 82%
o 92% for the words with confidence above 0.2
Evaluate: clzh-en dic Improve the quality:
• remove punctuation marks:
• stem English corpus:
stemming algorithm: porter2
Results: improved clzh-en dic
error distribution the precision for confidence
above 0.25 is 94%,
for confidence above 0.3 is 96%
598 entries for confidence 0.25
and above
508 entries for confidence 0.3
and above
Results: improved clzh-en dic
诸侯 @ feudal → feudal lord 寐 @ dawn → sleep 之 @ the → of
Results: improved clzh-en dic
Recall against a selected 100 words dictionary
is 42%
manually composed the remaining translation in the
100 word basic dictionary, except for the words:
Wednesday, bread, game
Methodology: Modern Chinese (simplified)
● Purchased Chinese to English Dictionary
● Professor Andras provided English to Hungarian Dictionary
● used linux “join” command to get a “raw” dictionary
joined dictionary
Sparse Matrix:
sentence1 sentence2 sentence3 sentence4 ...
both(hu1, zh1) 0 2 0 3
only hu1 0 0 1 0
only zh1 0 0 0 1
PMI:
𝑃𝑀𝐼 𝑧ℎ, ℎ𝑢 = log𝑝 𝑧ℎ, ℎ𝑢
𝑝(𝑧ℎ)𝑝(ℎ𝑢)= log
𝑛𝑏𝑁
𝑛𝑧 + 𝑛𝑏𝑁
∗𝑛ℎ + 𝑛𝑧
𝑁
= log𝑛𝑏
(𝑛𝑧+𝑛𝑏) ∗ (𝑛ℎ + 𝑛𝑏)∗ 𝑁
hu1 not hu1
zh1 n_b n_z
not zh1 n_h
Example:
PMI(非洲人,afrikai)=log〖(3/(6*45)*10,365,171)〗=14.0012952777
非洲人 非洲人
afrikai 3 42
afrikai 3
Evaluate: spcl-hu dic
Pairs with PMI>6, considered having high precision
0
500
1000
1500
2000
2500
3000
3500
4000
4500
Number of Pairs
Number of Pairs
Errors:
Error I:
gift @ 天分 @ ad
天分: talent, gift
ad: give
silent @ 密 @ csendes.
密: secret or dense
csendes: quiet
Error II: Inconsistent in part of speech
发怒, 愤, 怒, 触怒, 激怒 and 生气 are verbs;
忿, 怒气, 怒火 are nouns;
气愤, 愤怒 are adjectives.
Error III: Slang expression
stupid @ 二 @ buta
bird @ 女人 @ madár
Error IV: Indirect translation
life @ 春 @ energia
bread @食物@ élelmiszer
Future work:
Acknowledgement
● András Kornai, MTA-SZTAKI
● Gábor Sárközy, Worcester Polytechnic Institute
● Huba Bartos, MTA-SZTAKI
Reference: • Fano, Robert M. 1961. Transmission of information; a statistical theory of communications. New York: MIT Press
• Liang Tian, Derek F. Wong, Lidia S.Chao, Paulo Quaresma, Franciso Oliveira, Yi Lu, Shuo Li, Yiming Wang,
Longyue Wang. “A Large English-Chinese Parallel Corpus for Statistical Machine Translation." 2014
• Grigg, Hugh. Past events in Mandarin Chinese grammar. 14 April 2013. 22 4 2015.
• Attila Balogh, Zsolt Both, András Farkas, Péter Halácsy. http://www.hunglish.hu/.
• Chinese Text Project http://ctext.org/
• Parallel Text, http://en.wikipedia.org/wiki/Parallel_text.
Question?
Thank you!
top related