an integrated approach for arabic-english named entity translation hany hassan ibm cairo technology...
TRANSCRIPT
An Integrated Approach for Arabic-English Named
Entity Translation
Hany HassanIBM Cairo Technology Development CenterJeffrey SorensenIBM T.J. Watson Research Center
ACL 2005 WorkshopACL 2005 Workshop(on Feature Engineering for Machine Learning in NLP)(on Feature Engineering for Machine Learning in NLP)
22/24/24
Outline (1/2)
Named Entities (NEs) translation is crucial for effective – cross-language information retrieval– Machine Translation
NEs (only focus on person names, location names and organization names)– might be phonetically transliterated
persons names– might also be mixed between phonetic transliterati
on and semantic translation locations and organizations names
33/24/24
Outline (2/2)
an integrated approach– phrase-based translation
advantage: frequently used NE phrases disadvantage: less frequent words
– word-based translation traditional statistical machine translation techniques such
as IBM Model1 (Brown et al., 1993) disadvantage: many-to-many phrase translations
– transliteration modules advantage: out of vocabulary, unknown words disadvantage: frequently used words
aligning NEs across parallel corpora
44/24/24
Related Work (1/2)
Huang et al., 2003:– used a bilingual dictionary to extract NE pai
rs and deployed it iteratively to extract more NEs.
Moore, 2003:– relies on orthographic clues, only suitable fo
r language pairs with similar scripts and/or orthographic conventions
55/24/24
Related Work (2/2)
Arabic-related transliteration– Arbabi et al., 1998: developed a hybrid neur
al network and knowledge-based system to generate multiple English spellings for Arabic person names.
– Stalls and Knight, 1998: Arabic-English back transliteration.
– Al-Onaizan and Knight, 2002: spelling-based model which directly maps English letter sequences into Arabic letter sequences.
66/24/24
Persons names tend to be phonetically transliterated– the idiomatic translation that has been established
For locations and organizations, the translation can be a mixture of translation and transliteration.
77/24/24
Data
A parallel corpus– using NE identifiers similar to the systems desc
ribed in (Florian et al., 2004) for NE detection.– to separately acquire the phrases for the phras
e based system– the translation matrix for the word based syste
m– training data for the transliteration system
88/24/24
Translation and Transliteration Modules
Word Based NE Translation– Basic multi-cost NE Alignment– Multi-cost NE Alignment by Content Word
s Elimination Phrase Based Named Entity Translatio
n Named Entity Transliteration System Integration and Decoding
99/24/24
Basic multi-cost NE Alignment (1/3)
IBM Model1 (Brown et. al, 1993) Huang et al. 2003
– multi-cost aligning approach– The cost for aligning any source and target NE word i
s defined as:
Ed(we,wf): this phonetic-based edit distance employs an Editex style (Zobel and Dart, 1996) distance measure
1010/24/24
Basic multi-cost NE Alignment (2/3)
The Editex distance (d) between two letters a and b is: d(a,b) = – 0 if both are identical– 1 if they are in the same group– 2 otherwise
1111/24/24
Basic multi-cost NE Alignment (3/3)
1212/24/24
Multi-cost NE Alignment by Content Words Elimination (1/2)
Content words might be aligned incorrectly to rare NE words
A two-phase alignment approach– The first phase is aligning the content words u
sing a content-word-only translation matrix -> remove
– Remaining words -> multi-cost alignment
1313/24/24
Multi-cost NE Alignment by Content Words Elimination (2/2)
Ex– Wsi: content words in the source sentence.– NEsi: the Named Entity source words.– Wti: the content words in the target sentence.– NEti: the Named Entity target words.
– Source: Ws1 Ws2 NEs1 NEs2 Ws3 Ws4 Ws5– Target: Wt1 Wt2 Wt3 NEt1 NEt2 NEt3 Wt4 NEt4
– Source: NEs1 NEs2 Ws4 Ws5– Target: Wt3 NEt1 NEt2 NEt3 NEt4
1414/24/24
Phrase Based Named Entity Translation (1/2)
Tillman (Tillmann, 2003) for block generation with modifications suitable for NE phrase extraction.
A block is defined to be any pair of source and target phrases.
This approach starts from a word alignment generated by HMM Viterbi training (Vogel et. Al, 1996), which is done in both directions between source and target.
1515/24/24
Phrase Based Named Entity Translation (2/2)
1616/24/24
Named Entity Transliteration (1/3)
Out Of Vocabulary (OOV) words that are not covered by the word or phrase based models.
These source and target sequences construct the blocks which enables the modeling of vowels insertion.
1717/24/24
Named Entity Transliteration (2/3)
Arabic name -> “Shoukry” The system tries to model bi-grams from the s
ource language to n-grams in the target language as follows:– $k -> shouk– kr -> kr– ry -> ry
1818/24/24
Named Entity Transliteration (3/3)
Use the translation matrix, from the word based alignment models.– Translations with probabilities less than a certain threshold a
re filtered out.– Distance between both romanized Arabic and English -> great
er than the threshold are also filtered out.– The remaining highly confident name pairs are used to train a
letter to letter translation matrix using HMM Viterbi training (Vogel et al., 1996).
– a source block s and a target block t, P(t | s)– a Weighted Finite State Transducer (WFST) for translating any
source sequence to a target sequence
1919/24/24
System Integration and Decoding
Used a dynamic programming beam search decoder similar to the decoder described by Tillman (Tillmann, 2003).
Monolingual target data -> NE phrases– The first language model is a trigram language on N
E phrases.– The second language model is a class based langua
ge model with a class for unknown NEs.
2020/24/24
Experimental Setup (1/4)
three NE categories, namely names of persons, organizations, and locations.
trained on a news domain parallel corpus containing 2.8 million Arabic words and 3.4 million words.
Monolingual English data was annotated with NE types
2121/24/24
Experimental Setup (2/4) manually constructed a test set
The BLEU score (Papineni et al., 2002) with a single reference translation was deployed for evaluation.
BLEU-3 which uses up to 3-grams is deployed since three words phrase is a reasonable length for various NE types.
2222/24/24
Experimental Setup (3/4)
2323/24/24
Experimental Setup (4/4)
2424/24/24
Conclusion and Future Work
We have presented an integrated system that can handle various NE categories and requires the regular parallel and monolingual corpora which are typically used in the training of any statistical machine translation system along with NEs identifier.
We will evaluate the effect of the system on CLIR and MT tasks.