an integrated approach for arabic-english named entity translation hany hassan ibm cairo technology...

An Integrated Approach for Arabic-English Named

Entity Translation

Hany HassanIBM Cairo Technology Development CenterJeffrey SorensenIBM T.J. Watson Research Center

ACL 2005 WorkshopACL 2005 Workshop(on Feature Engineering for Machine Learning in NLP)(on Feature Engineering for Machine Learning in NLP)

22/24/24

Outline (1/2)

Named Entities (NEs) translation is crucial for effective – cross-language information retrieval– Machine Translation

NEs (only focus on person names, location names and organization names)– might be phonetically transliterated

persons names– might also be mixed between phonetic transliterati

on and semantic translation locations and organizations names

33/24/24

Outline (2/2)

an integrated approach– phrase-based translation

advantage: frequently used NE phrases disadvantage: less frequent words

– word-based translation traditional statistical machine translation techniques such

as IBM Model1 (Brown et al., 1993) disadvantage: many-to-many phrase translations

– transliteration modules advantage: out of vocabulary, unknown words disadvantage: frequently used words

aligning NEs across parallel corpora

44/24/24

Related Work (1/2)

Huang et al., 2003:– used a bilingual dictionary to extract NE pai

rs and deployed it iteratively to extract more NEs.

Moore, 2003:– relies on orthographic clues, only suitable fo

r language pairs with similar scripts and/or orthographic conventions

55/24/24

Related Work (2/2)

Arabic-related transliteration– Arbabi et al., 1998: developed a hybrid neur

al network and knowledge-based system to generate multiple English spellings for Arabic person names.

– Stalls and Knight, 1998: Arabic-English back transliteration.

– Al-Onaizan and Knight, 2002: spelling-based model which directly maps English letter sequences into Arabic letter sequences.

66/24/24

Persons names tend to be phonetically transliterated– the idiomatic translation that has been established

For locations and organizations, the translation can be a mixture of translation and transliteration.

77/24/24

Data

A parallel corpus– using NE identifiers similar to the systems desc

ribed in (Florian et al., 2004) for NE detection.– to separately acquire the phrases for the phras

e based system– the translation matrix for the word based syste

m– training data for the transliteration system

88/24/24

Translation and Transliteration Modules

Word Based NE Translation– Basic multi-cost NE Alignment– Multi-cost NE Alignment by Content Word

s Elimination Phrase Based Named Entity Translatio

n Named Entity Transliteration System Integration and Decoding

99/24/24

Basic multi-cost NE Alignment (1/3)

IBM Model1 (Brown et. al, 1993) Huang et al. 2003

– multi-cost aligning approach– The cost for aligning any source and target NE word i

s defined as:

Ed(we,wf): this phonetic-based edit distance employs an Editex style (Zobel and Dart, 1996) distance measure

1010/24/24


The Editex distance (d) between two letters a and b is: d(a,b) = – 0 if both are identical– 1 if they are in the same group– 2 otherwise

1111/24/24


1212/24/24

Multi-cost NE Alignment by Content Words Elimination (1/2)

Content words might be aligned incorrectly to rare NE words

A two-phase alignment approach– The first phase is aligning the content words u

sing a content-word-only translation matrix -> remove

– Remaining words -> multi-cost alignment

1313/24/24

Multi-cost NE Alignment by Content Words Elimination (2/2)

Ex– Wsi: content words in the source sentence.– NEsi: the Named Entity source words.– Wti: the content words in the target sentence.– NEti: the Named Entity target words.

– Source: Ws1 Ws2 NEs1 NEs2 Ws3 Ws4 Ws5– Target: Wt1 Wt2 Wt3 NEt1 NEt2 NEt3 Wt4 NEt4

– Source: NEs1 NEs2 Ws4 Ws5– Target: Wt3 NEt1 NEt2 NEt3 NEt4

1414/24/24

Phrase Based Named Entity Translation (1/2)

Tillman (Tillmann, 2003) for block generation with modifications suitable for NE phrase extraction.

A block is defined to be any pair of source and target phrases.

This approach starts from a word alignment generated by HMM Viterbi training (Vogel et. Al, 1996), which is done in both directions between source and target.

1515/24/24

Phrase Based Named Entity Translation (2/2)

1616/24/24

Named Entity Transliteration (1/3)

Out Of Vocabulary (OOV) words that are not covered by the word or phrase based models.

These source and target sequences construct the blocks which enables the modeling of vowels insertion.

1717/24/24


Arabic name -> “Shoukry” The system tries to model bi-grams from the s

ource language to n-grams in the target language as follows:– $k -> shouk– kr -> kr– ry -> ry

1818/24/24


Use the translation matrix, from the word based alignment models.– Translations with probabilities less than a certain threshold a

re filtered out.– Distance between both romanized Arabic and English -> great

er than the threshold are also filtered out.– The remaining highly confident name pairs are used to train a

letter to letter translation matrix using HMM Viterbi training (Vogel et al., 1996).

– a source block s and a target block t, P(t | s)– a Weighted Finite State Transducer (WFST) for translating any

source sequence to a target sequence

1919/24/24

System Integration and Decoding

Used a dynamic programming beam search decoder similar to the decoder described by Tillman (Tillmann, 2003).

Monolingual target data -> NE phrases– The first language model is a trigram language on N

E phrases.– The second language model is a class based langua

ge model with a class for unknown NEs.

2020/24/24

Experimental Setup (1/4)

three NE categories, namely names of persons, organizations, and locations.

trained on a news domain parallel corpus containing 2.8 million Arabic words and 3.4 million words.

Monolingual English data was annotated with NE types

2121/24/24

Experimental Setup (2/4) manually constructed a test set

The BLEU score (Papineni et al., 2002) with a single reference translation was deployed for evaluation.

BLEU-3 which uses up to 3-grams is deployed since three words phrase is a reasonable length for various NE types.

2222/24/24


2323/24/24


2424/24/24

Conclusion and Future Work

We have presented an integrated system that can handle various NE categories and requires the regular parallel and monolingual corpora which are typically used in the training of any statistical machine translation system along with NEs identifier.

We will evaluate the effect of the system on CLIR and MT tasks.

an integrated approach for arabic-english named entity translation hany hassan ibm cairo technology...

Documents