a joint source-channel model for machine transliteration li haizhou, zhang min, su jian institute...

20
A Joint Source-Channel Model for Machine Transliteration Li Haizhou, Zhang Min, Su Jian Institute for Infocomm Research 21 Heng Mui Keng Terrace, Singapore 119613 {hli,sujian,mzhang}@i2r.a-star.edu.s g ACL 2004

Upload: naomi-sutton

Post on 18-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Joint Source-Channel Model for Machine Transliteration Li Haizhou, Zhang Min, Su Jian Institute for Infocomm Research 21 Heng Mui Keng Terrace, Singapore

A Joint Source-Channel Model for Machine Transliteration

Li Haizhou, Zhang Min, Su JianInstitute for Infocomm Research

21 Heng Mui Keng Terrace, Singapore 119613{hli,sujian,mzhang}@i2r.a-star.edu.sg

ACL 2004

Page 2: A Joint Source-Channel Model for Machine Transliteration Li Haizhou, Zhang Min, Su Jian Institute for Infocomm Research 21 Heng Mui Keng Terrace, Singapore

2

Abstract

This paper presents a new framework that allows direct orthographical mapping (DOM) between two different languages, through a joint source- channel model, also called n-gram TM model,

Automated the orthographic alignment process derives the aligned transliteration units from a bilingual dictionary.

State-of-the-art machine learning algorithm.

Page 3: A Joint Source-Channel Model for Machine Transliteration Li Haizhou, Zhang Min, Su Jian Institute for Infocomm Research 21 Heng Mui Keng Terrace, Singapore

3

Research Question

Out of vocabulary Transliterating English names into Chinese is not

straightforward. For example,/Lafontant/ is transliterated into 拉丰

唐 (La-Feng-Tang) while /Constant/ becomes 康斯坦特 (Kang-Si-Tan-Te), where syllable /-tant/ in the two names are transliterated differently depending on the names’ language of origin.

Page 4: A Joint Source-Channel Model for Machine Transliteration Li Haizhou, Zhang Min, Su Jian Institute for Infocomm Research 21 Heng Mui Keng Terrace, Singapore

4

Research Question

For a given English name E as the observed channel output, one seeks a posteriori the most likely Chinese transliteration C that maximizes P(C|E).

– P(E|C) : the probability of transliterating C to E through a noisy channel, which is also called transformation rules.

– P(C) : n-gram language models.

Likewise, in C2E transliteration

– P(E) : n-gram language models.

?

?

Page 5: A Joint Source-Channel Model for Machine Transliteration Li Haizhou, Zhang Min, Su Jian Institute for Infocomm Research 21 Heng Mui Keng Terrace, Singapore

5

Research Question

E intermediat phonemic C– P(C|E) = P(C|P)P(P|E) P(E|C) = P(E|P)P(P|C) , P: phonemic– For Example:

Chinese characters -> PinYin -> IPA English characters -> IPA

Orthographic alignment– For example:

one-to-many mappings

Page 6: A Joint Source-Channel Model for Machine Transliteration Li Haizhou, Zhang Min, Su Jian Institute for Infocomm Research 21 Heng Mui Keng Terrace, Singapore

6

Research Question

English name :

Chinese name :

?

Page 7: A Joint Source-Channel Model for Machine Transliteration Li Haizhou, Zhang Min, Su Jian Institute for Infocomm Research 21 Heng Mui Keng Terrace, Singapore

7

Machine Transliteration - Review

phoneme-based techniques– Transformation-based learning algorithm

(Meng et al. 2001; Jung et al, 2000;Virga & Khudanpur, 2003)– Using finite state transducer that implements transformation rules.

(Knight & Graehl, 1998),

Web- (Sigir 2004)– Using the Web for Automated Translation Extraction

Ying Zhang, Phil Vines– Learning Phonetic Similarity for Matching Named Entity

Translations and Mining New Translations Wai Lam, Ruizhang Huang, Pik-Shan Cheung

– Translating Unknown Queries with Web Corpora for Cross-Language Information Retrieval

Pu-Jen Cheng, Jei-Wen Teng, Ruei-Cheng Chen, Jenq-Haur Wang, Wen-Hsiang Lu+, and Lee-Feng Chien

Page 8: A Joint Source-Channel Model for Machine Transliteration Li Haizhou, Zhang Min, Su Jian Institute for Infocomm Research 21 Heng Mui Keng Terrace, Singapore

8

Joint Source-Channel Model

For K aligned transliteration units , we propose to estimate P(E,C) by a joint source-channel model.

– Suppose exists an alignment γ with :

:

:

:

γ ?

Page 9: A Joint Source-Channel Model for Machine Transliteration Li Haizhou, Zhang Min, Su Jian Institute for Infocomm Research 21 Heng Mui Keng Terrace, Singapore

9

Joint Source-Channel Model

A transliteration unit correspondence <e,c> is called a transliteration pair.

– E2C transliteration can be formulated as

– Similarly the C2E back-transliteration as

An n-gram transliteration model is defined as the conditional probability, or transliteration probability, of a transliteration pair <e,c>k depending on its immediate n predecessor pairs:

Page 10: A Joint Source-Channel Model for Machine Transliteration Li Haizhou, Zhang Min, Su Jian Institute for Infocomm Research 21 Heng Mui Keng Terrace, Singapore

10

Transliteration alignment

A bilingual dictionary contains entries mapping English names to their respective Chinese transliterations.

– To obtain the statistics, the bilingual dictionary needs to be aligned. The maximum likelihood approach, through EM Algorithm.

Page 11: A Joint Source-Channel Model for Machine Transliteration Li Haizhou, Zhang Min, Su Jian Institute for Infocomm Research 21 Heng Mui Keng Terrace, Singapore

11

Transliteration alignment

:

:

:

:

γ

Page 12: A Joint Source-Channel Model for Machine Transliteration Li Haizhou, Zhang Min, Su Jian Institute for Infocomm Research 21 Heng Mui Keng Terrace, Singapore

12

The experiments

Database from the bilingual dictionary “Chinese Transliteration of Foreign Personal Names”

– Xinhua News Agency 37,694 unique English ent

ries and their official Chinese transliteration.

13 subsets for 13-flod open test

13-fold open tests : error rates < 1% deviation.

– EM iteration converges at 8th round ?

Page 13: A Joint Source-Channel Model for Machine Transliteration Li Haizhou, Zhang Min, Su Jian Institute for Infocomm Research 21 Heng Mui Keng Terrace, Singapore

13

The experiments

For a test set W composed of V names– Each name has been aligned into a sequence of transliteration pai

r tokens.– The cross-entropy Hp(W) of a model on data W is defined as

,where and

WT is the total number of aligned transliteration pair tokens in the data W.

– The perplexity :

Page 14: A Joint Source-Channel Model for Machine Transliteration Li Haizhou, Zhang Min, Su Jian Institute for Infocomm Research 21 Heng Mui Keng Terrace, Singapore

14

The experiments

Page 15: A Joint Source-Channel Model for Machine Transliteration Li Haizhou, Zhang Min, Su Jian Institute for Infocomm Research 21 Heng Mui Keng Terrace, Singapore

15

The experiments

In word error report, a word is considered correct only if an exact match happens between transliteration and the reference.

The character error rate is the sum of deletion, insertion and substitution errors..

Page 16: A Joint Source-Channel Model for Machine Transliteration Li Haizhou, Zhang Min, Su Jian Institute for Infocomm Research 21 Heng Mui Keng Terrace, Singapore

16

The experiments

5,640 transliteration pairs are cross mappings between 3,683 English and 374 Chinese units.

– Each English unit have 1.53 = 5,640/3,683 Chinese correspondences.

– Each Chinese unit have 15.1 = 5,640/374 English back-transliteration units.

Monolingual language perplexity– Chinese and English independently using the n-gram language

models

Page 17: A Joint Source-Channel Model for Machine Transliteration Li Haizhou, Zhang Min, Su Jian Institute for Infocomm Research 21 Heng Mui Keng Terrace, Singapore

17

The experiments

Page 18: A Joint Source-Channel Model for Machine Transliteration Li Haizhou, Zhang Min, Su Jian Institute for Infocomm Research 21 Heng Mui Keng Terrace, Singapore

18

Discussions

Page 19: A Joint Source-Channel Model for Machine Transliteration Li Haizhou, Zhang Min, Su Jian Institute for Infocomm Research 21 Heng Mui Keng Terrace, Singapore

19

Conclusions

In this paper, we propose a new framework (DOM) for transliteration. n-gram TM is a successful realization of DOM paradigm.

It generates probabilistic orthographic transformation rules using a data driven approach. By skipping the intermediate phonemic interpretation, the transliteration error rate is reduced significantly.

Place and company names : – /Victoria-Fall/ becomes 维多利亚瀑布 (Pinyin:Wei Duo

Li Ya Pu Bu).– It can also be easily extended to handle such name tran

slation.?

Page 20: A Joint Source-Channel Model for Machine Transliteration Li Haizhou, Zhang Min, Su Jian Institute for Infocomm Research 21 Heng Mui Keng Terrace, Singapore

20

Ca Va