chenhui chu , toshiaki nakazawa , daisuke kawahara, sadao kurohashi
DESCRIPTION
Exploiting Shared Chinese Characters in Chinese Word Segmentation Optimization for Chinese-Japanese Machine Translation. Chenhui Chu , Toshiaki Nakazawa , Daisuke Kawahara, Sadao Kurohashi Graduate School of Informatics, Kyoto University. EAMT2012 ( 2012/05/ 28). Outline. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Chenhui Chu , Toshiaki Nakazawa , Daisuke Kawahara, Sadao Kurohashi](https://reader033.vdocuments.net/reader033/viewer/2022050911/56816371550346895dd44f20/html5/thumbnails/1.jpg)
1
Exploiting Shared Chinese Characters in Chinese Word Segmentation Optimization for Chinese-Japanese Machine Translation
Chenhui Chu, Toshiaki Nakazawa, Daisuke Kawahara, Sadao Kurohashi
Graduate School of Informatics, Kyoto University
EAMT2012 (2012/05/28)
![Page 2: Chenhui Chu , Toshiaki Nakazawa , Daisuke Kawahara, Sadao Kurohashi](https://reader033.vdocuments.net/reader033/viewer/2022050911/56816371550346895dd44f20/html5/thumbnails/2.jpg)
Outline
• Word Segmentation Problems• Common Chinese Characters• Chinese Word Segmentation Optimization• Experiments• Discussion• Related Work• Conclusion and Future Work
2
![Page 3: Chenhui Chu , Toshiaki Nakazawa , Daisuke Kawahara, Sadao Kurohashi](https://reader033.vdocuments.net/reader033/viewer/2022050911/56816371550346895dd44f20/html5/thumbnails/3.jpg)
Outline
• Word Segmentation Problems• Common Chinese Characters• Chinese Word Segmentation Optimization• Experiments• Discussion• Related Work• Conclusion and Future Work
3
![Page 4: Chenhui Chu , Toshiaki Nakazawa , Daisuke Kawahara, Sadao Kurohashi](https://reader033.vdocuments.net/reader033/viewer/2022050911/56816371550346895dd44f20/html5/thumbnails/4.jpg)
Word Segmentation for Chinese-Japanese MT
4
Mr. Kosaka is the founder of The Japan Society for Clinical Anesthesiologists.
小坂 / 先生 / は / 日本 / 臨床 / 麻酔 / 学会 / の / 創始 / 者 / である / 。小 / 坂 / 先生 / 是 / 日本 / 临床 / 麻醉 / 学会 /的 / 创始者 / 。Zh:
Ja:
Ref:
小坂先生は日本臨床麻酔学会の創始者である。小坂先生是日本临床麻醉学会的创始者。Zh:
Ja:
![Page 5: Chenhui Chu , Toshiaki Nakazawa , Daisuke Kawahara, Sadao Kurohashi](https://reader033.vdocuments.net/reader033/viewer/2022050911/56816371550346895dd44f20/html5/thumbnails/5.jpg)
Word Segmentation Problems in Chinese-Japanese MT
• Unknown words– Affect segmentation accuracy and consistency
• Word segmentation granularity– Affect word alignment
5
小坂 / 先生 / は / 日本 / 臨床 / 麻酔 / 学会 / の / 創始 / 者 / である / 。小 / 坂 / 先生 / 是 / 日本 / 临床 / 麻醉 / 学会 / 的 / 创始者 / 。Mr. Kosaka is the founder of The Japan Society for Clinical Anesthesiologists.
Zh:
Ja:
Ref:
![Page 6: Chenhui Chu , Toshiaki Nakazawa , Daisuke Kawahara, Sadao Kurohashi](https://reader033.vdocuments.net/reader033/viewer/2022050911/56816371550346895dd44f20/html5/thumbnails/6.jpg)
Outline
• Introduction• Common Chinese Characters• Chinese Word Segmentation Optimization• Experiments• Discussion• Related Work• Conclusion and Future Work
6
![Page 7: Chenhui Chu , Toshiaki Nakazawa , Daisuke Kawahara, Sadao Kurohashi](https://reader033.vdocuments.net/reader033/viewer/2022050911/56816371550346895dd44f20/html5/thumbnails/7.jpg)
Chinese Characters
• Chinese characters are used both in Chinese (Hanzi) and Japanese (Kanji)
• There are many common Chinese characters between Hanzi and Kanji
• We made a common Chinese characters mapping table for 6,355 JIS Kanji (Chu et al., 2012)
7
![Page 8: Chenhui Chu , Toshiaki Nakazawa , Daisuke Kawahara, Sadao Kurohashi](https://reader033.vdocuments.net/reader033/viewer/2022050911/56816371550346895dd44f20/html5/thumbnails/8.jpg)
Common Chinese Characters Related Studies
• Automatic sentence alignment task (Tan et al., 1995)
• Dictionary construction (Goh et al., 2005)• Word level semantic relations investigation
(Huang et al., 2008)• Phrase alignment (Chu et al., 2011)• This study exploits common Chinese characters
in Chinese word segmentation optimization for MT
8
![Page 9: Chenhui Chu , Toshiaki Nakazawa , Daisuke Kawahara, Sadao Kurohashi](https://reader033.vdocuments.net/reader033/viewer/2022050911/56816371550346895dd44f20/html5/thumbnails/9.jpg)
Outline
• Word Segmentation Problems• Common Chinese Characters• Chinese Word Segmentation Optimization• Experiments• Discussion• Related Work• Conclusion and Future Work
9
![Page 10: Chenhui Chu , Toshiaki Nakazawa , Daisuke Kawahara, Sadao Kurohashi](https://reader033.vdocuments.net/reader033/viewer/2022050911/56816371550346895dd44f20/html5/thumbnails/10.jpg)
Reason for Chinese Word Segmentation Optimization
• Segmentation for Japanese is easier than Chinese, because Japanese uses Kana other than Chinese characters
• F-score for Japanese segmentation is nearly 99% (Kudo et al., 2004), while that for Chinese is still about 95% (Wang et al., 2011)
• Therefore, we only do word segmentation optimization for Chinese, and keep the Japanese segmentation results
10
![Page 11: Chenhui Chu , Toshiaki Nakazawa , Daisuke Kawahara, Sadao Kurohashi](https://reader033.vdocuments.net/reader033/viewer/2022050911/56816371550346895dd44f20/html5/thumbnails/11.jpg)
11
Parallel Training Corpus
① Chinese Lexicons Extraction
Chinese Annotated Corpus for Chinese Segmenter
② Chinese Lexicons Incorporation
③ Short Unit Transformation
Optimized Chinese Segmenter
Training
Short Unit Chinese Corpus for Chinese Segmenter
Common Chinese Characters
Chinese Lexicons
System Dictionary of Chinese Segmenter
System Dictionary withChinese Lexicons
![Page 12: Chenhui Chu , Toshiaki Nakazawa , Daisuke Kawahara, Sadao Kurohashi](https://reader033.vdocuments.net/reader033/viewer/2022050911/56816371550346895dd44f20/html5/thumbnails/12.jpg)
12
Parallel Training Corpus
① Chinese Lexicons Extraction
Chinese Annotated Corpus for Chinese Segmenter
② Chinese Lexicons Incorporation
③ Short Unit Transformation
Optimized Chinese Segmenter
Training
Short Unit Chinese Corpus for Chinese Segmenter
Common Chinese Characters
Chinese Lexicons
System Dictionary of Chinese Segmenter
System Dictionary withChinese Lexicons
![Page 13: Chenhui Chu , Toshiaki Nakazawa , Daisuke Kawahara, Sadao Kurohashi](https://reader033.vdocuments.net/reader033/viewer/2022050911/56816371550346895dd44f20/html5/thumbnails/13.jpg)
① Chinese Lexicons Extraction
• Step 1: Segment Chinese and Japanese sentences in the parallel training corpus
• Step 2: Convert Japanese Kanji tokens into Chinese using the mapping table we made (Chu et al., 2012)
• Step 3: Extract the converted tokens as Chinese lexicons if they exist in the corresponding Chinese sentence
13
![Page 14: Chenhui Chu , Toshiaki Nakazawa , Daisuke Kawahara, Sadao Kurohashi](https://reader033.vdocuments.net/reader033/viewer/2022050911/56816371550346895dd44f20/html5/thumbnails/14.jpg)
Extraction Example
14
小坂 / 先生 / は / 日本 / 臨床 / 麻酔 / 学会 / の / 創始 / 者 / である / 。
小 / 坂 / 先生 / 是 / 日本 / 临床 / 麻醉 / 学会 / 的 / 创始 者 / 。
Mr. Kosaka is the founder of The Japan Society for Clinical Anesthesiologists.
Zh:
Ja:
Ref:
小坂 / 先生 / は / 日本 / 临床 / 麻醉 / 学会 / の / 创始 / 者/ である / 。Ja:
Kanji tokens conversion
Check
小坂 先生 日本 临床 麻醉 学会 创始 者 Chinese Lexicons :
Extraction
![Page 15: Chenhui Chu , Toshiaki Nakazawa , Daisuke Kawahara, Sadao Kurohashi](https://reader033.vdocuments.net/reader033/viewer/2022050911/56816371550346895dd44f20/html5/thumbnails/15.jpg)
15
Parallel Training Corpus
① Chinese Lexicons Extraction
Chinese Annotated Corpus for Chinese Segmenter
② Chinese Lexicons Incorporation
③ Short Unit Transformation
Optimized Chinese Segmenter
Training
Short Unit Chinese Corpus for Chinese Segmenter
Common Chinese Characters
Chinese Lexicons
System Dictionary of Chinese Segmenter
System Dictionary withChinese Lexicons
![Page 16: Chenhui Chu , Toshiaki Nakazawa , Daisuke Kawahara, Sadao Kurohashi](https://reader033.vdocuments.net/reader033/viewer/2022050911/56816371550346895dd44f20/html5/thumbnails/16.jpg)
② Chinese Lexicons Incorporation
• Using a system dictionary is helpful for Chinese word segmentation (Low et al., 2005; Wang et al., 2011)
• We incorporate the extracted lexicons into the system dictionary of a Chinese segmenter
• Assign POS tags by converting the POS tags assigned by the Japanese segmenter using POS tags mapping table between Chinese and Japanese
16
![Page 17: Chenhui Chu , Toshiaki Nakazawa , Daisuke Kawahara, Sadao Kurohashi](https://reader033.vdocuments.net/reader033/viewer/2022050911/56816371550346895dd44f20/html5/thumbnails/17.jpg)
17
CTB (Penn Chinese Treebank)
JUMAN (Kurohashi et al., 1994)
AD(adverb) 副詞 (adverb)CC(coordinating conjunction) 接続詞 (conjunction)CD(cardinal number) 名詞 (noun)[ 数詞 (numeral noun)]FW(foreign words) 未定義語 (undefined word)[ アルファベット (alphabet)] IJ(interjection) 感動詞 (interjection)M(measure word) 接尾辞 (suffix)[ 名詞性名詞助数辞 (measure word suffix)] NN(common noun) 名詞 (noun)[ 普通名詞 (common noun)/ サ変名詞 (sahen
noun)/形式名詞 (formal noun)/ 副詞的名詞 (adverbial noun)], 接尾辞 (suffix)[ 名詞性名詞接尾辞 (noun suffix)/名詞性特殊接尾辞 (special noun suffix)]
NR(proper noun) 名詞 (noun)[ 固有名詞 (proper noun)/ 地名 (place name)/人名 (person name)/ 組織名 (organization name)]
NT(temporal noun) 名詞 (noun)[ 時相名詞 (temporal noun)]PU(punctuation) 特殊 (special word) VA(predicative adjective) 形容詞 (adjective)VV(other verb) 動詞 (verb)/ 名詞 (noun)[ サ変名詞 (sahen noun)]
![Page 18: Chenhui Chu , Toshiaki Nakazawa , Daisuke Kawahara, Sadao Kurohashi](https://reader033.vdocuments.net/reader033/viewer/2022050911/56816371550346895dd44f20/html5/thumbnails/18.jpg)
18
Parallel Training Corpus
① Chinese Lexicons Extraction
Chinese Annotated Corpus for Chinese Segmenter
② Chinese Lexicons Incorporation
③ Short Unit Transformation
Optimized Chinese Segmenter
Training
Short Unit Chinese Corpus for Chinese Segmenter
Common Chinese Characters
Chinese Lexicons
System Dictionary of Chinese Segmenter
System Dictionary withChinese Lexicons
![Page 19: Chenhui Chu , Toshiaki Nakazawa , Daisuke Kawahara, Sadao Kurohashi](https://reader033.vdocuments.net/reader033/viewer/2022050911/56816371550346895dd44f20/html5/thumbnails/19.jpg)
③ Short Unit Transformation
• Adjusting Chinese word segmentation to make tokens 1-to-1 mapping as many as possible between a parallel sentences can improve alignment accuracy (Bai et al., 2008)
• Wang et al. (2010) proposed a short unit standard for Chinese word segmentation, which can reduce the number of 1-to-n alignments and improve MT performance
19
![Page 20: Chenhui Chu , Toshiaki Nakazawa , Daisuke Kawahara, Sadao Kurohashi](https://reader033.vdocuments.net/reader033/viewer/2022050911/56816371550346895dd44f20/html5/thumbnails/20.jpg)
Our Method
• We transform the annotated training data of Chinese segmenter utilizing the extracted lexicons
20
从 _P/ 有效性 _NN / 高 _VA/ 的 _DEC/ 格要素_NN /…
从 _P/ 有效 _NN/性 _NN /高 _VA/的 _DEC/ 格 _NN/要素_NN /…
Short:
CTB:
Lexicon: 有效 (effective) Lexicon : 要素 (element)
From case element with high effectiveness …Ref:
![Page 21: Chenhui Chu , Toshiaki Nakazawa , Daisuke Kawahara, Sadao Kurohashi](https://reader033.vdocuments.net/reader033/viewer/2022050911/56816371550346895dd44f20/html5/thumbnails/21.jpg)
Constraints
• We do not use the extracted lexicons that are composed of only one Chinese character
21
歌 (song)・・・歌颂(praise)
歌 (song)/颂 (praise)
long token extracted lexicons short unit tokens
![Page 22: Chenhui Chu , Toshiaki Nakazawa , Daisuke Kawahara, Sadao Kurohashi](https://reader033.vdocuments.net/reader033/viewer/2022050911/56816371550346895dd44f20/html5/thumbnails/22.jpg)
Outline
• Word Segmentation Problems• Common Chinese Characters• Chinese Word Segmentation Optimization• Experiments• Discussion• Related Work• Conclusion and Future Work
22
![Page 23: Chenhui Chu , Toshiaki Nakazawa , Daisuke Kawahara, Sadao Kurohashi](https://reader033.vdocuments.net/reader033/viewer/2022050911/56816371550346895dd44f20/html5/thumbnails/23.jpg)
Two Kinds of Experiments
• Experiments on Moses • Experiments on Kyoto example-based
machine translation (EBMT) system Nakazawa and Kurohashi, 2011)– A dependency tree-based decoder
23
![Page 24: Chenhui Chu , Toshiaki Nakazawa , Daisuke Kawahara, Sadao Kurohashi](https://reader033.vdocuments.net/reader033/viewer/2022050911/56816371550346895dd44f20/html5/thumbnails/24.jpg)
Experimental Settings on MOSES (1/2)
24
Parallel Training Corpus Zh-Ja paper abstract corpus (680k sentences)
Chinese Annotated Corpus NICT Chinese Treebank (same domain of parallel corpus, 9,792 sentences)CTB 7 (31,131 sentences)
Chinese Segmenter In-house Chinese segmenterJapanese Segmenter JUMAN (Kurohashi et al., 1994)MT system Moses with default options, except for
the distortion limit (6→20)
![Page 25: Chenhui Chu , Toshiaki Nakazawa , Daisuke Kawahara, Sadao Kurohashi](https://reader033.vdocuments.net/reader033/viewer/2022050911/56816371550346895dd44f20/html5/thumbnails/25.jpg)
Experimental Settings on MOSES (2/2)
• Baseline: Only using the lexicons extracted from Chinese annotated corpus
• Incorporation: Incorporate the extracted Chinese lexicons
• Short unit: Incorporate the extracted Chinese lexicons and train the Chinese segmenter on the short unit training data
25
![Page 26: Chenhui Chu , Toshiaki Nakazawa , Daisuke Kawahara, Sadao Kurohashi](https://reader033.vdocuments.net/reader033/viewer/2022050911/56816371550346895dd44f20/html5/thumbnails/26.jpg)
Results of Chinese-to-Japanese Translation Experiments on MOSES
BLEU NICT Chinese Treebank CTB 7Baseline 34.92 36.64Incorporation 36.19 37.39Short unit 36.82 38.59
26
• CTB 7 shows better performance because the size is more than 3 times of NICT Chinese Treebank
• Lexicons extracted from a paper abstract domain also work well on other domains (i.e. CTB 7)
![Page 27: Chenhui Chu , Toshiaki Nakazawa , Daisuke Kawahara, Sadao Kurohashi](https://reader033.vdocuments.net/reader033/viewer/2022050911/56816371550346895dd44f20/html5/thumbnails/27.jpg)
Results of Japanese-to-Chinese Translation Experiments on MOSES
27
BLEU NICT Chinese Treebank CTB 7Baseline 31.42 31.83Incorporation 31.24 32.34Short unit 31.95 31.95
• Not significant compared to Zh-to-Ja, because our proposed approach does not change the segmentation results of input Japanese sentences
![Page 28: Chenhui Chu , Toshiaki Nakazawa , Daisuke Kawahara, Sadao Kurohashi](https://reader033.vdocuments.net/reader033/viewer/2022050911/56816371550346895dd44f20/html5/thumbnails/28.jpg)
Experimental Settings on EBMT (1/2)
28
Parallel Training Corpus Zh-Ja paper abstract corpus (680k sentences)
Chinese Annotated Corpus CTB 7 (31,131 sentences)Chinese Segmenter In-house Chinese segmenterJapanese Segmenter JUMAN (Kurohashi et al., 1994)MT system Kyoto example-based machine
translation (EBMT) system
![Page 29: Chenhui Chu , Toshiaki Nakazawa , Daisuke Kawahara, Sadao Kurohashi](https://reader033.vdocuments.net/reader033/viewer/2022050911/56816371550346895dd44f20/html5/thumbnails/29.jpg)
Experimental Settings on EBMT (2/2)
• Baseline: Only using the lexicons extracted from Chinese annotated corpus
• Short unit: Incorporate the extracted Chinese lexicons and train the Chinese segmenter on the short unit training data
29
![Page 30: Chenhui Chu , Toshiaki Nakazawa , Daisuke Kawahara, Sadao Kurohashi](https://reader033.vdocuments.net/reader033/viewer/2022050911/56816371550346895dd44f20/html5/thumbnails/30.jpg)
Results of Translation Experiments on EBMT
30
BLEU Zh-to-Ja Ja-to-ZhBaseline 22.84 19.10Short unit 23.36 19.15
• Translation performance is worse than MOSES, because EBMT suffers from low accuracy of Chinese parser
• Improvement by short unit is not significant because the Chinese parser is not trained on short unit segmented training data
![Page 31: Chenhui Chu , Toshiaki Nakazawa , Daisuke Kawahara, Sadao Kurohashi](https://reader033.vdocuments.net/reader033/viewer/2022050911/56816371550346895dd44f20/html5/thumbnails/31.jpg)
Outline
• Word Segmentation Problems• Common Chinese Characters• Chinese Word Segmentation Optimization• Experiments• Discussion• Related Work• Conclusion and Future Work
31
![Page 32: Chenhui Chu , Toshiaki Nakazawa , Daisuke Kawahara, Sadao Kurohashi](https://reader033.vdocuments.net/reader033/viewer/2022050911/56816371550346895dd44f20/html5/thumbnails/32.jpg)
Short Unit Effectiveness on MOSES
32
Baseline (BLEU=49.38)Input: 本 / 论文 / 中 / , / 提议 / 考虑 / 现存 / 实现 / 方式 / 的 / 功能 / 适应性 / 决定 / 对策 / 目标 / 的 / 保密 / 基本 / 设计法 / 。Output: 本 / 論文 / で / は / , / 提案 / する / 適応 / 的 / 対策 / を / 決定 / する / セキュリティ / 基本 / 設計 / 法 / を / 考える / 現存 / の / 実現 / 方式 / の / 機能 / を / 目標 / と / して / いる / .
Short unit (BLEU=56.33)Input: 本 / 论文 / 中 / , / 提议 / 考虑 / 现存 / 实现 / 方式 / 的 / 功能 / 适应 / 性 / 决定 / 对策 / 目标 / 的 / 保密 / 基本 / 设计 / 法 / 。Output: 本 / 論文 / で / は / , / 提案 / する / 考え / 現存 / の / 実現 / 方式 / の / 機能 / 的 / 適応 / 性 / を / 決定 / する / 対策 / 目標 / の / セキュリティ / 基本 / 設計 / 法 / を / 提案 / する / .
Reference本 / 論文 / で / は / , / 対策 / 目標 / を / 現存 / の / 実現 / 方式 / の / 機能 / 的 / 適合 / 性 / も / 考慮 / して / 決定 / する / セキュリティ / 基本 / 設計 / 法 / を / 提案 / する / .(In this paper, we propose a basic security design method also consider functional suitability of the existing implementation method for determining countermeasures target.)
![Page 33: Chenhui Chu , Toshiaki Nakazawa , Daisuke Kawahara, Sadao Kurohashi](https://reader033.vdocuments.net/reader033/viewer/2022050911/56816371550346895dd44f20/html5/thumbnails/33.jpg)
Number of Extracted Lexicons
33
Source NICT Treebank CTB 7Chinese annotated corpus 13,471 26,202Short unit Chinese corpus 12,627 25,490
SourceParallel training corpus 18,584
• The number of extracted lexicons deceased after short unit transformation because duplicated lexicons increased
![Page 34: Chenhui Chu , Toshiaki Nakazawa , Daisuke Kawahara, Sadao Kurohashi](https://reader033.vdocuments.net/reader033/viewer/2022050911/56816371550346895dd44f20/html5/thumbnails/34.jpg)
Short Unit Transformation Percentage
• NICT Chinese Treebank– 6,623 tokens out of 257,825 been transformed to
13,469 short unit tokens, the percentage is 2.57% • CTB 7– 19,983 tokens out of 718,716 been transformed to
41,336 short unit tokens, the percentage is 2.78%
34
![Page 35: Chenhui Chu , Toshiaki Nakazawa , Daisuke Kawahara, Sadao Kurohashi](https://reader033.vdocuments.net/reader033/viewer/2022050911/56816371550346895dd44f20/html5/thumbnails/35.jpg)
Short Unit Transformation Problems (1/3)
• Improper transformation problem
35
好意 (favor)・・・不好意思(sorry)
不 (not)/好意 (favor)/思 (think)
long token extracted lexicons short unit tokens
![Page 36: Chenhui Chu , Toshiaki Nakazawa , Daisuke Kawahara, Sadao Kurohashi](https://reader033.vdocuments.net/reader033/viewer/2022050911/56816371550346895dd44f20/html5/thumbnails/36.jpg)
Short Unit Transformation Problems (2/3)
36
充电(charge)电器(electric
equipment)・・・
充电器 (charger) 充 (charge)/电器 (electric equipment)
充电 (charge)/器 (device)
long token extracted lexicons short unit tokens
• Transformation ambiguity problem
![Page 37: Chenhui Chu , Toshiaki Nakazawa , Daisuke Kawahara, Sadao Kurohashi](https://reader033.vdocuments.net/reader033/viewer/2022050911/56816371550346895dd44f20/html5/thumbnails/37.jpg)
Short Unit Transformation Problems (3/3)
37
实验 (test)・・・被实验者 _NN(test subject)
被 _NN(be)/实验 _NN(test)/者 _NN(person)
long token extracted lexicons short unit tokens
The correct POS tag for “ 被 (be)” should be LB (“ 被” in long bei-const)
• POS Tag assignment problem
![Page 38: Chenhui Chu , Toshiaki Nakazawa , Daisuke Kawahara, Sadao Kurohashi](https://reader033.vdocuments.net/reader033/viewer/2022050911/56816371550346895dd44f20/html5/thumbnails/38.jpg)
Outline
• Word Segmentation Problems• Common Chinese Characters• Chinese Word Segmentation Optimization• Experiments• Discussion• Related Work• Conclusion and Future Work
38
![Page 39: Chenhui Chu , Toshiaki Nakazawa , Daisuke Kawahara, Sadao Kurohashi](https://reader033.vdocuments.net/reader033/viewer/2022050911/56816371550346895dd44f20/html5/thumbnails/39.jpg)
Bai et al., 2008
• Proposed a method of learning affix rules from a aligned Chinese-English bilingual terminology bank to adjust Chinese word segmentation in the parallel corpus directly
39
![Page 40: Chenhui Chu , Toshiaki Nakazawa , Daisuke Kawahara, Sadao Kurohashi](https://reader033.vdocuments.net/reader033/viewer/2022050911/56816371550346895dd44f20/html5/thumbnails/40.jpg)
Wang et al., 2010
• Proposed a method based on transfer rules and a transfer database. – The transfer rules are extracted from alignment
results of annotated Chinese and segmented Japanese training data
– The transfer database is constructed using external lexicons, and is manually modified
40
![Page 41: Chenhui Chu , Toshiaki Nakazawa , Daisuke Kawahara, Sadao Kurohashi](https://reader033.vdocuments.net/reader033/viewer/2022050911/56816371550346895dd44f20/html5/thumbnails/41.jpg)
Conclusion
• We proposed an approach of exploiting common Chinese characters in Chinese word segmentation optimization for Chinese-Japanese MT
• Experimental results of Chinese-Japanese MT on a phrase-based SMT system indicated that our approach can improve MT performance significantly
41
![Page 42: Chenhui Chu , Toshiaki Nakazawa , Daisuke Kawahara, Sadao Kurohashi](https://reader033.vdocuments.net/reader033/viewer/2022050911/56816371550346895dd44f20/html5/thumbnails/42.jpg)
Future Work
• Solve the short unit transformation problems• Evaluate the proposed approach on parallel
corpus of other domains
42