building bilingual lexicons using lexical translation probabilities via pivot language

37
Building Bilingual Building Bilingual Lexicons Using Lexical Lexicons Using Lexical Translation Translation Probabilities Probabilities via Pivot Language via Pivot Language Takashi Tsunakawa 1 Naoaki Okazaki 1 Jun’ichi Tsujii 1,2 1 1 Department of Computer Science, Graduate School of Information Science and Technology, University of Tokyo 2 School of Computer Science, University of Manchester / National Centre for Text Mining LREC 2008 29 May, 2008

Upload: devin-adkins

Post on 31-Dec-2015

25 views

Category:

Documents


4 download

DESCRIPTION

Building Bilingual Lexicons Using Lexical Translation Probabilities via Pivot Language. Takashi Tsunakawa 1 Naoaki Okazaki 1 Jun’ichi Tsujii 1,2. LREC 200829 May, 2008. 1 Department of Computer Science, Graduate School of Information Science and Technology, University of Tokyo - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Building Bilingual Lexicons Using Lexical Translation Probabilities via Pivot Language

Building Bilingual Lexicons Using Building Bilingual Lexicons Using Lexical Translation ProbabilitiesLexical Translation Probabilities

via Pivot Languagevia Pivot LanguageTakashi Tsunakawa1

Naoaki Okazaki1

Jun’ichi Tsujii1,2

1

1Department of Computer Science, Graduate School of Information Science and

Technology,University of Tokyo

2School of Computer Science, University of Manchester /

National Centre for Text Mining

LREC 2008 29 May, 2008

Page 2: Building Bilingual Lexicons Using Lexical Translation Probabilities via Pivot Language

IntroductionIntroductionBuilding bilingual lexicons via pivot

languages

2

odometer

pedometer

计步器E-J lexicon

オドメーターペドメータ

ペドメーター

万歩計

歩数計

C-E lexicon(jìbùqì)

CHINESE

ENGLISH

JAPANESE

(pedomēta)(odomētā)

(pedomētā)(hosūkei)

(mampokei)

Page 3: Building Bilingual Lexicons Using Lexical Translation Probabilities via Pivot Language

IntroductionIntroductionBuilding bilingual lexicons via pivot languages

3

计步器(jìbùqì)

(1)オドメーター (odomētā)(2)ペドメータ (pedomēta) ,ペドメー

ター (pedomētā) ,歩数計 (hosūkei) ,万歩計 (mampokei)

odometer pedometer

Creative  Commons Attribution ShareAlike 2.0  Licenseby skippy13

Page 4: Building Bilingual Lexicons Using Lexical Translation Probabilities via Pivot Language

Advantages of the pivotal approachAdvantages of the pivotal approach

Constructing Japanese-Chinese lexicon from Japanese-English and English-Chinese lexicons through English terms J-E and E-C lexicons are well-supported for

many terms and domains, compared to J-C lexicons

Especially for technical terms, there are few J-C lexicons because technical terms are first written by English in most cases

The pivotal approach could help us to (semi-) automatically find J-C translation term pairs

4

Page 5: Building Bilingual Lexicons Using Lexical Translation Probabilities via Pivot Language

MismatchMismatch problem problem

Chinese terms English terms Japanese terms

全球变暖(qúanqíu-bìannŭan)

global heating (n/a)

(n/a) global warming 地球温暖化(chikyū-ondanka)

5

We cannot find a Chinese-Japanese term pair that does not share the identical English translations.

Chinese terms English terms Japanese terms

全球变暖 global heatingglobal warming

地球温暖化

Is it possible to generate thefollowing lexical item?

Page 6: Building Bilingual Lexicons Using Lexical Translation Probabilities via Pivot Language

Merging Two Bilingual LexiconsMerging Two Bilingual Lexicons“Exact merging”

cannot merge pairs that do not share the identical English translations mismatch problem

Challenges to merge more terms “Word-based merging” “Alignment-based merging”

6

Page 7: Building Bilingual Lexicons Using Lexical Translation Probabilities via Pivot Language

Word-based mergingWord-based mergingTokenize a term into word tokens, andTranslate each word by the bilingual lexicon

7

Chinese terms English terms Japanese terms

全球变暖 global heating (n/a)

(n/a) global warming 地球温暖化

(n/a) global 地球

(n/a) heating 温暖化

全球变暖 global heating

地球 温暖化(qúanqíu-bìannŭan)

(chikyū - ondanka)

Page 8: Building Bilingual Lexicons Using Lexical Translation Probabilities via Pivot Language

Alignment-based merging:Alignment-based merging:OverviewOverview

Align each word, Calculate word translation probabilities, and Translate each word by the probabilities

8

Chinese terms English terms Japanese terms

全球 变暖 global heating (n/a)

(n/a) global warming 地球 温暖化

(n/a) heating 温暖化全球 变暖

global heating 地球

global heating

温暖化

warming

温暖化

Page 9: Building Bilingual Lexicons Using Lexical Translation Probabilities via Pivot Language

Alignment-based merging:Alignment-based merging:OverviewOverview

9

Word-by-word

translationMerging word pairs &

re-calculating probabilities

(Add term frequencies on Web)

Page 10: Building Bilingual Lexicons Using Lexical Translation Probabilities via Pivot Language

Alignment-based mergingAlignment-based mergingApply word alignment

(GIZA++) (Och & Ney, 2003) for all term pairs

Calculate word translation probabilities from co-occurrence frequencies

10

For both of the bilingual lexicons, source(f)-pivot(p) and pivot(p)-target(e)

)(

);,();|(

,)(

);,();|(

e

pepepeep

p

fpfpfppf

wC

awwCawwp

wC

awwCawwp

Page 11: Building Bilingual Lexicons Using Lexical Translation Probabilities via Pivot Language

Alignment-based mergingAlignment-based mergingCalculate word translation

probabilities from a target-language word to a source-language word (Utiyama & Isahara, 2007):

11

pwpeepfppf

fppeefef

awwpawwp

aawwpwwp

);|();|(

),;|()|(

Page 12: Building Bilingual Lexicons Using Lexical Translation Probabilities via Pivot Language

Alignment-based mergingAlignment-based merging Calculate the translation probabilities (scores)

based on the noisy channel model (Brown et al., 1990)

12

iieife

efefe

wwpwp

wwpwpww

)|()(

)|()()|Pr(

,,

The language model p(we) is calculated by using the number of Web searching results (Google) of the term we

p(we) ∝ (hit count of we) Generate the merged lexicon with translation

probabilities are greater than zero. New_Lexicon = {(wf,we)|Pr(we|wf)>0 and

Pr(wf|we) > 0}

Page 13: Building Bilingual Lexicons Using Lexical Translation Probabilities via Pivot Language

Experimental settingsExperimental settings Used lexicons: Bilingual lexicons that consist

of technical terms C-E : Wanfang Data E-C & C-E Science and

Technology Dictionary J-E: JST Machine Translation Dictionary By “exact merging,” we can translate about

22% of Japanese (or Chinese) terms

13

Lexicon # of terms (J)

# of terms (E)

# of terms (C)

J-E 465,563 416,578

C-E 429,766 439,795

# of distinct E terms

777,344

C-J by “exact merging”

103,437(22.2%)

68,996 98,537(22.4%)

Page 14: Building Bilingual Lexicons Using Lexical Translation Probabilities via Pivot Language

Experimental resultsExperimental results Utilization ratio

Alignment-based merging drastically improved the utilization ratio, and the size of merged lexicon also increased

Accuracy (by manual evaluation)

MRR: Mean Reciprocal Rank (Voorhees, 1999) calculates the mean of reciprocal ranks over all source terms

Prec1: Precision of the highest ranked terms Prec10: Precision that the 10-best outputs include the

correct one

14

Method # of terms (J)(Utilization ratio

of J)

# of terms (C)(Utilization ratio

of C)

Exact merging 103,437 (22.2%) 98,537 (22.4%)

Word-based merging 124,945 (26.8%) 167,929 (38.1%)

Alignment-based merging

438,976 (94.2%) 342,229 (77.8%)

Source-Target MRR Prec1 Prec10

Japanese-to-Chinese 0.242 0.14 0.46

Chinese-to-Japanese 0.258 0.20 0.40

Page 15: Building Bilingual Lexicons Using Lexical Translation Probabilities via Pivot Language

Experimental results: Examples (1/2)Experimental results: Examples (1/2) A Chinese-to-Japanese example of “ 角膜 实质 炎” (keratitis parenchymatosa)

15

Japanese translation

J-to-E literal translation

Score Log10

prob.

Hitcount

角膜 実質 炎 kerato- parenchymatitis

0.057 -2.89 432 OK

角膜 的 炎 kerato- inflammation 0.00457 -3.34 10

角膜 物質 炎 kerato- material inflammation

0 -2.24 0

角膜 物質 関節 

kerato- material joint 0 -2.49 0

角膜 実 炎 kerato- real inflammation

0 -2.63 0

角膜 物質 性 kerato- materiality 0 -2.66 0

角膜 材料 炎 kerato- stuff inflammation

0 -2.66 0

角膜 物質 高安

kerato- material high-low

0 -2.83 0

角膜 物質 胃腸

kerato- material stomach

0 -2.87 0

(jiăomó - shízhì - yán)

Page 16: Building Bilingual Lexicons Using Lexical Translation Probabilities via Pivot Language

Experimental results: Examples (2/2)Experimental results: Examples (2/2) A J-to-C example of “ 発育 状態” (growth

status)

16

Chinese translation

C-to-E literal translation

Score

Log10

prob.

Hitcount

的 状态 state of 7249 -2.43 1960000

发展 状态 development state 6593 -1.58 252000

发展 条件 development condition 6001 -2.05 674000

的 条件 condition of 3159 -2.90 2510000

发展 国家 development country 2715 -2.57 998000

生长 状态 growing state 2688 -1.51 87900 OK

生长 条件 growing condition 2248 -1.98 216000

增长 状态 rising state 1343 -1.72 69800 OK

开发 条件 development condition 1260 -2.78 192000

(hatsuiku - jōtai)

Page 17: Building Bilingual Lexicons Using Lexical Translation Probabilities via Pivot Language

ConclusionConclusion Alignment-based merging of two bilingual

lexicons via a pivot language is proposed The alignment-based merging could achieve at

least 75% utilization ratio in our experiments The precision still remains 0.14 (Japanese-to-

Chinese) and 0.20 (Chinese-to-Japanese), which would be improved by sophisticated scoring method

Future directions To choose the correct translation with examining

the context or semantic classes of source and target terms

To evaluate a machine translation system with this lexicon integrated

17

Page 18: Building Bilingual Lexicons Using Lexical Translation Probabilities via Pivot Language

Thank you for your attentionThank you for your attention

Acknowledgments MEXT, Japan Japan Science and Technology Agency (JST),

Japan NICT, Japan Wanfang Data, China

18

Page 19: Building Bilingual Lexicons Using Lexical Translation Probabilities via Pivot Language

Experimental ResultsExperimental ResultsOur system could generate at least

one Japanese translations into 73.4% (385509/525259) of the C-E lexicons

19

传染性 肝炎 病毒 score

感染 性 肝炎 ウイルス -8.29

感染 肝炎 ウィールス -16.58

感染 肝炎 ウイルス -16.60

感染 性 肝炎 ウイルス -17.24

感染 性 肝炎 ウイルス -17.42

伝染 性 肝炎 ウィールス -17.63

伝染 性 肝炎 ウイルス -17.65

(infectious hepatitis virus, 感染性肝炎ウイルス )

大肠 杆菌 噬菌体 score

大腸 菌 ファージ -17.68

大腸 ファージ -17.82

大腸 菌 型 ファージ -18.48

大腸 菌 ファージ の -18.88

大腸 菌 バクテリオファージ

-18.88

コリフォーム ファージ -19.01

大腸 ファージ の -19.02

(coliphage, 大腸菌ファージ)

Japanese reference translation

Chinese input term

Page 20: Building Bilingual Lexicons Using Lexical Translation Probabilities via Pivot Language

Experimental ResultsExperimental Results

20

补码 形式 score

補 形式 -18.38

補 体 形式 -18.47

補 形 -18.63

補完 形式 -18.68

補 体 形 -18.72

追加 形式 -18.81

補完 形 -18.93

補助 形式 -18.95

保健 形式 -18.97

追加 形 -19.05

声 延迟 线存 储器 score

音声 遅延 線 記憶 装置 -17.15

音 遅延 線 記憶 装置 -17.51

音声 遅延 記憶 装置 -17.80

音響 遅延 線 記憶 装置 -17.87

音 遅延 記憶 装置 -18.16

音響 記憶 -18.17

音響 遅延 線 記憶 装置 -18.36

超 音波 遅延 線 記憶 装置

-18,42

音響 貯蔵 -18.50

音響 遅延 記憶 装置 -18.52

(complement form, 補数形式 )

(acoustic delay line storage,音響遅延線記憶装置 )

same character but the meanings are not

identical

Page 21: Building Bilingual Lexicons Using Lexical Translation Probabilities via Pivot Language

Manual evaluationManual evaluation A human evaluator checked the translation results of

200 Chinese terms classified in the category of “Computer” by the C-E lexicon Terms that could be translated into Japanese: 181 (90.5%) Terms that the top-10 translations included the correct

one: 135 (67.5%) Terms that the top translation was correct: 73 (36.5%) MRR (mean reciprocal rank) = 0.466

The average of the inverses of the ranks that are the highest correct translations

21

Terms that the top was correct

Terms that the top was incorrect /Terms that could not be translated

激光 存储器 电路 – laser memory circuit – レーザー メモリ 回路虚拟 处理 – dummy treatment – 仮想 処理综合 数字网 – integrated digital network – 総合 ディジタル 網

数 组 元素 – array element – 配列 元素计算机 化 管理 学会 – ICM – 特 発 性 心筋 障害信息量 – information content – 量转镜 式激 光束 影像 记录 仪 – laser beam rotating mirror image recorder – (NO)

Page 22: Building Bilingual Lexicons Using Lexical Translation Probabilities via Pivot Language

Manual evaluationManual evaluation A human evaluator checked the translation results of 200 Chinese terms classified in the category of “Computer” by the C-E lexicon Terms that could be translated into Japanese: 181 (90.5%) Terms that the top-10 translations included the correct

one: 135 (67.5%) Terms that the top translation was correct: 73 (36.5%) MRR (mean reciprocal rank) = 0.466

The average of the inverses of the ranks that are the highest correct translations

22

Terms that the top was correct

Terms that the top was incorrect /Terms that could not be translated

激光 存储器 电路 – laser memory circuit – レーザー メモリ 回路虚拟 处理 – dummy treatment – 仮想 処理综合 数字网 – integrated digital network – 総合 ディジタル 網

数 组 元素 – array element – 配列 元素计算机 化 管理 学会 – ICM – 特 発 性 心筋 障害信息量 – information content – 量转镜 式激 光束 影像 记录 仪 – laser beam rotating mirror image recorder – (NO)

Page 23: Building Bilingual Lexicons Using Lexical Translation Probabilities via Pivot Language

Manual evaluationManual evaluation A human evaluator checked the translation results of 200 Chinese terms classified in the category of “Computer” by the C-E lexicon Terms that could be translated into Japanese: 181 (90.5%) Terms that the top-10 translations included the correct

one: 135 (67.5%) Terms that the top translation was correct: 73 (36.5%) MRR (mean reciprocal rank) = 0.466

The average of the inverses of the ranks that are the highest correct translations

23

Terms that the top was correct

Terms that the top was incorrect /Terms that could not be translated

激光 存储器 电路 – laser memory circuit – レーザー メモリ 回路虚拟 处理 – dummy treatment – 仮想 処理综合 数字网 – integrated digital network – 総合 ディジタル 網

数 组 元素 – array element – 配列 元素计算机 化 管理 学会 – ICM – 特 発 性 心筋 障害信息量 – information content – 量转镜 式激 光束 影像 记录 仪 – laser beam rotating mirror image recorder – (NO)

Page 24: Building Bilingual Lexicons Using Lexical Translation Probabilities via Pivot Language

Manual evaluationManual evaluation A human evaluator checked the translation results of 200 Chinese terms classified in the category of “Computer” by the C-E lexicon Terms that could be translated into Japanese: 181 (90.5%) Terms that the top-10 translations included the correct

one: 135 (67.5%) Terms that the top translation was correct: 73 (36.5%) MRR (mean reciprocal rank) = 0.466

The average of the inverses of the ranks that are the highest correct translations

24

Terms that the top was correct

Terms that the top was incorrect /Terms that could not be translated

激光 存储器 电路 – laser memory circuit – レーザー メモリ 回路虚拟 处理 – dummy treatment – 仮想 処理综合 数字网 – integrated digital network – 総合 ディジタル 網

数 组 元素 – array element – 配列 元素计算机 化 管理 学会 – ICM – 特 発 性 心筋 障害信息量 – information content – 量转镜 式激 光束 影像 记录 仪 – laser beam rotating mirror image recorder – (NO)

Page 25: Building Bilingual Lexicons Using Lexical Translation Probabilities via Pivot Language

Manual evaluationManual evaluation A human evaluator checked the translation results of 200 Chinese terms classified in the category of “Computer” by the C-E lexicon Terms that could be translated into Japanese: 181 (90.5%) Terms that the top-10 translations included the correct

one: 135 (67.5%) Terms that the top translation was correct: 73 (36.5%) MRR (mean reciprocal rank) = 0.466

The average of the inverses of the ranks that are the highest correct translations

25

Terms that the top was correct

Terms that the top was incorrect /Terms that could not be translated

激光 存储器 电路 – laser memory circuit – レーザー メモリ 回路虚拟 处理 – dummy treatment – 仮想 処理综合 数字网 – integrated digital network – 総合 ディジタル 網

数 组 元素 – array element – 配列 元素计算机 化 管理 学会 – ICM – 特 発 性 心筋 障害信息量 – information content – 量转镜 式激 光束 影像 记录 仪 – laser beam rotating mirror image recorder – (NO)

Page 26: Building Bilingual Lexicons Using Lexical Translation Probabilities via Pivot Language

ConclusionConclusion We proposed the method using phrase-based SMT for

constructing J-C lexicon from J-E and C-E lexicons. We could obtain J translations for 73.4% of items in the

C-E lexicon, and it outperformed the “exact matching” (22.2%).

36.5% of the top J translations were correct and that 67.5% of the top-10 J translations included the correct one. We could apply this method for support of manual

construction of bilingual dictionaries and use this lexicon for MT.

Future work Parameter optimization of SMT by using existing J-C lexicons Chinese character similarity considering each similarity

between individual characters More sophisticated reordering model (considering parts-of-

speech) Other translation directions (EJ, JC, EC)

26

Page 27: Building Bilingual Lexicons Using Lexical Translation Probabilities via Pivot Language

Acquisition of Translation Pairs of Acquisition of Translation Pairs of Technical TermsTechnical Terms

Large-scale translation dictionaries (lexicons) of technical terms are required for translating technical documents

For constructing such dictionaries, we must ask the experts who can deal with both languages It requires huge costs We must support rapid increase of new terms

27

Automatic acquisition of translation candidates of technical terms

• Support for constructing the dictionary • Improvement of the performance of machine translation systems

Page 28: Building Bilingual Lexicons Using Lexical Translation Probabilities via Pivot Language

J-E bilingual lexiconJ-E bilingual lexicon 527,206 translation pairs Numbers of distinct terms : 465,565 J terms, 509,259 E

terms

28

Japanese terms English terms

“ 外装・内装”派 "exterior ・ interior" fraction

(案) (draft)

(案) (plan)

(株) Co.,Ltd.

(株) Inc.

… …

ころがり接触疲労 rolling contact fatigue

ころがり損失 rolling loss

ころがり対偶 rolling pair

ころがり疲れ寿命 rolling fatigue life

Page 29: Building Bilingual Lexicons Using Lexical Translation Probabilities via Pivot Language

C-E bilingual lexiconC-E bilingual lexicon

29

Wanfang Data E-C & C-E Science and Technology Dictionary 525,259 pairs

id Chinese terms Japanese terms Category

1 ……的瞬时值 Instantaneous… 科技 (science and technology)

2 Ⅰ-Ⅴ族化合物半导体 group Ⅰ-Ⅴ compound semiconductor

电子 (electronic)

3 Ⅰ-Ⅵ族化合物半导体 group I-VI compound semiconductor

电子

4 Ⅰ-Ⅶ族化合物半导体 group Ⅰ-Ⅶ compound semiconductor

电子

5 ⅠA族化合物 ⅠA compound 无化 (inorganic chemistry)

525259

专利发明 patent 专利 (patent)

Page 30: Building Bilingual Lexicons Using Lexical Translation Probabilities via Pivot Language

Construction of the C-J bilingual Construction of the C-J bilingual lexiconlexiconAttach Japanese translations for

each lexical item of C-E lexicon

30

Chinese terms English terms Japanese terms

……的瞬时值 Instantaneous… 瞬間…

Ⅰ-Ⅴ族化合物半导体 group Ⅰ-Ⅴ compound semiconductor

Ⅰ-V族化合物半導体

Ⅰ-Ⅵ族化合物半导体 group I-VI compound semiconductor

Ⅰ-Ⅵ族化合物半導体

Ⅰ-Ⅶ族化合物半导体 group Ⅰ-Ⅶ compound semiconductor

Ⅰ-Ⅶ族化合物半導体

ⅠA族化合物 ⅠA compound ⅠA族化合物

专利发明 patent 特許

Page 31: Building Bilingual Lexicons Using Lexical Translation Probabilities via Pivot Language

Overview of constructing J-C lexiconOverview of constructing J-C lexicon We assume the C-E and J-E lexicons as

parallel corpora, and use them for training data for constructing a J-C SMT system

Word/phrase-level merging in English can be available by applying an SMT approach for the C-E and J-E lexicons

We apply C-J phrase-based SMT for Chinese terms in the C-E lexicon Statistical approaches seem to be effective

because of similarities of semantics and word order between C and J

Easy to introduce other clues such as Chinese character similarity

31

Page 32: Building Bilingual Lexicons Using Lexical Translation Probabilities via Pivot Language

Collecting J-E & C-E translation phrase Collecting J-E & C-E translation phrase pairspairs Apply morphological analyzers, and obtain word alignments by GIZA+

+ (Och and Ney, 2003) for J-E and C-E lexicons Collect phrase pairs by “Grow-diag-final” method (using Moses, Koehn

et al., 2007) and calculate the probabilities by the relative frequencies

32

ころがり   疲れ   寿命

rolling   fatigue life

Japanese phrases

English phrases p( e | j ) p( j | e )

ころがり rolling 0.733 0.083

疲れ fatigue 0.973 0.503

寿命 life 0.565 0.210

ころがり 疲れ rolling fatigue 1 1

疲れ 寿命 fatigue life 1 0.545

ころがり 疲れ 寿命

rolling fatigue life 1 1

Page 33: Building Bilingual Lexicons Using Lexical Translation Probabilities via Pivot Language

Merging phrase pairsMerging phrase pairs (Utiyama & (Utiyama & Isahara,Isahara, 2007) (J-E & E-C phrases to J-C 2007) (J-E & E-C phrases to J-C

phrases)phrases)

33

Japanese phrases

English phrases p( e | j ) p( j | e )

ころがり rolling 0.733 0.083

疲れ fatigue 0.973 0.503

寿命 life 0.565 0.210

ころがり 疲れ rolling fatigue 1 1

疲れ 寿命 fatigue life 1 0.545

ころがり 疲れ 寿命

rolling fatigue life 1 1Chinese phrases

English phrases p( e | c )

p( c | e )

侧倾 rolling 0.182 0.029

横摇 rolling 0.5 0.014

… … … …

疲乏 fatigue 1 0.011

… … … …

疲劳 寿命 fatigue life 1 1

Page 34: Building Bilingual Lexicons Using Lexical Translation Probabilities via Pivot Language

Merging phrase pairsMerging phrase pairs (Utiyama & (Utiyama & Isahara,Isahara, 2007) (J-E & E-C phrases to J-C 2007) (J-E & E-C phrases to J-C

phrases)phrases)

f p

p

w weppfe

weppf

eef

wwpwwpZ

wwpwwpZ

wwp

)|()|(

)|()|(1

)|(

34

Japanese phrases

Chinese phrases p( c | j ) p( j | c )

ころがり 侧倾 … 0.015

ころがり 横摇 … 0.042

… … … …

疲れ 疲乏 … 0.297

… … … …

疲れ 寿命 疲劳 寿命 … 0.545

(Ze is a normalized

factor)

Page 35: Building Bilingual Lexicons Using Lexical Translation Probabilities via Pivot Language

Features for learning of the log-linear Features for learning of the log-linear modelmodel

We employ the following features h1-h4 for the log-linear model:

1. Phrase translation prob. where are the i-th phrase pair for the

translation

2. 3-gram language model of the target language

where p(we) is a language model probability from other monolingual corpora

3. Phrase reordering penalty (Koehn et al., 2003)4. Chinese character similarity (Zhang et al.,

2005)

35

M

mfemm

we wwhw

e 1

),(maxargˆ

i

if

iefe wwpwwh ),(log),( )()(

1)()( , i

fie ww

)(log),(2 efe wpwwh

Page 36: Building Bilingual Lexicons Using Lexical Translation Probabilities via Pivot Language

Feature 3: Phrase reordering penaltyFeature 3: Phrase reordering penalty(Koehn et al., 2003)(Koehn et al., 2003)The feature value is the sum of penalties

d defined by the following formula for the phrase pairs we, wf

where ai is the position of the first word of wf and bi-1 is the position of the last word of wf translated in the previous step

36

i

if

iefe

iiif

ie

wwdwwh

bawwd

),(),(

1),()()(

3

1)()(

f1 f2 f3

f4 f5 f6 f7 f8

e1 e2

e3 e4e5 e6

d(e1 e2, f1 f2 f3) = 0d(e3, f8) = – |8 – 3 – 1| = – 4d(e4, f6 f7) = – |6 – 8 – 1| = – 3d(e5 e6, f4 f5) = – |4 – 7 – 1| = – 4h3(e1…e6, f1…f8) = – 11

Page 37: Building Bilingual Lexicons Using Lexical Translation Probabilities via Pivot Language

Feature 4: Chinese character similarityFeature 4: Chinese character similarity Chinese and Japanese writing systems both

have Chinese characters, and their similarity should be a powerful clue to derive the translation phrase pairs (Zhang et al., 2005)

We define the feature value h4 between we and wf as follows:

Differences of Chinese and Japanese forms of characters are ignored

Example : h4( 万歩計 , 计步器 ) = h4( 万歩計 , 計歩器 ) = h4(ABC,CBD) = 1 – 2 / 3 = 0.333

37

h4(we,wf) = 1 –

Edit distance of Chinese characters between we and wf

Max. of the number of characters in we and wf