rapid development of translation tools: application to persian and

5

Upload: vucong

Post on 11-Jan-2017

218 views

Category:

Documents


1 download

TRANSCRIPT

Rapid Development of Translation Tools:Application to Persian and Turkish

Jan W. Amtrup, Karine Megerdoomian and R�emi ZajacComputing Research Lab

New Mexico State Universityfjamtrup,karine,[email protected]

Abstract

The Computing Research laboratory (CRL) is devel-oping a machine translation toolkit that allows rapiddeployment of translation capabilities. This toolkithas been used to develop several machine translationsystems, including a Persian-English and a Turkish-English system, which will be demonstrated. Wepresent the architecture of these systems as well asthe development methodology.

1 Introduction

At CRL, one of the major research topics is the de-velopment and deployment of machine translationsystems for low-density languages in a short amountof time. As the availability of knowledge sourcessuitable for automatic processing in those languages(e.g. Persian, Turkish, Serbo-Croatian) is usuallyscarce, the systems developed have to assist the ac-quisition process in an incremental fashion, startingout from low-level translation on a word-for-wordbasis and gradually extending to the incorporationof syntactic and world knowledge.The tasks and requirements for a machine trans-

lation environment that supports linguists with thenecessary tools to develop and debug increasinglycomplex knowledge about a speci�c language in-clude:

� The development of a bilingual dictionary thatis used for initial basic translation and can fur-ther be utilized in the more complex translationsystem stages.

� Methods to describe and process morpholog-ically rich languages, either by integratingalready existing processors or by developinga morphological processor within the systemframework.

� Glossing a text to ensure the correctness of mor-phological analysis and the completeness of thedictionary for a given corpus.

� Processors and grammar development tools forthe syntactic analysis of the source language.

� In order to allow rapid development cycles, thetranslation system itself has to be reasonably

fast and provide the user with a rich envi-ronment for debugging of linguistic data andknowledge.

� The system used for development must be con-�gurable for a large variety of tasks that emergeduring the development process.

We have developed a component-based transla-tion system that meets all of the criteria men-tioned above. In the next section, we will describethe architecture of the system MEAT (MultilingualEnvironment for Advanced Translations) , which isused to translate between a number of languages(Persian, Turkish, Arabic, Japanese, Korean, Rus-sian, Serbo-Croatian and Spanish) and English. Inthe following sections, we will describe the generaldevelopment cycle for a new language, going intomore detail for two such languages, Persian andTurkish.

2 General architecture

MEAT is a publicly available environment1 that as-sists a linguist in rapidly developing a machine trans-lation system. In order to keep the overhead in-volved in learning and using the system as low aspossible, the linguist uses use simple yet powerful ba-sic data and control structures. These structures areoriented towards contemporary linguistic and com-putational linguistic theories.In MEAT, linguistic knowledge is entirely repre-

sented using Typed Feature Structures (TFS) (Car-penter, 1992; Zajac, 1992), the most widely used rep-resentational formalism today. We developed a fastimplementation of Typed Feature Structures withappropriateness, based on an abstract machine view(cf. Carpenter and Qu (1995), Wintner and Francez(1995). Bilingual dictionary entries as well as allkinds of rules (morphology, syntax, transfer, gener-ation) are expressed as feature structures of speci�ctypes, so only one description language has to bemastered. This usually leads to a rapid familiaritywith the system, yielding a high productivity almostfrom the start.

1http://crl.nmsu.edu/�jamtrup/Meat

Runtime linguistic objects (words, syntactic struc-tures etc.) are stored in a central data structure.We use an extension of the well-known concept ofa chart (Kay, 1973) to hold all temporary and �-nal results. As multiple components have to processdi�erent origins of data, the chart is equipped withseveral di�erent layers, each of which denotes a spe-ci�c aspect of processing. Thus, at every point dur-ing the runtime of the system, the contents of thechart re ect what operations have been performedso far (Amtrup, 1999). The complete chart is avail-able to the user using a graphical interface. Thischart browser can be used to exactly trace why aspeci�c solution was produced, a signi�cant aid indeveloping and debugging grammars.MEAT addresses the necessity of carrying out sev-

eral di�erent tasks by providing a component-basedarchitecture. The core of the system consists of theformalism and the chart data representation. Allprocessing components are implemented in the formof plug-ins, components that obey a small interfaceto communicate with the main application. Thechoice of which components to apply to an actualinput, the order in which the components are pro-cessed, and individual parameters for componentscan be speci�ed by the user, allowing for a highly exible way of con�guring a machine translation (orword-lookup, of glossing etc.) system (cf. Amtrupet al. (2000) for a more detailed description of thearchitecture).The MEAT system is completely implemented in

C++, resulting in a relatively fast mode of op-eration. The implementation of the TFS formal-ism supports between 3000 and 4500 uni�cationsper second, depending on the application it is usedin. Translating an average length sentence (20-25words) takes about 3.5 seconds on a Pentium PII400(in non-optimized debug mode). The system sup-ports Unix (tested on Solaris and Linux) and Win-dows95/98/NT.We use Unicode to represent charac-ter data, as we face translations of several di�erent,non-European languages with a variety of scripts.

3 Development cycle

One of the main requirement facilitating the deploy-ment of a new language with possibly scarce pre-existing resources is the ability to incrementally de-velop knowledge sources and translation capability(the incremental approach to MT development is de-scribed in (Zajac, 1999)). In the case of translationsystems at our laboratory, we mostly translate intoEnglish. Thus, a complete set of English resources isalready available (dictionary, generation grammarsand morphological generation) and does not need tobe developed.The �rst step in bootstrapping a running system

is to build a bilingual dictionary. The work on the

dictionary usually continues throughout the devel-opment process of higher level knowledge sources.We use dictionaries where entries are encoded as atfeature-value pairs, as shown in Figure 1.

$Headword @J|yb hftg|nh$Category Noun$Number Plural$Regular False$English Seven Wonder<nc plur>;$$

Figure 1: A Persian-English dictionary entry.

While this is already enough information to facil-itate a basic word-for-word translation, in generala morphological analyzer for the source language isneeded to translate real-world text. For MEAT, onecan either import the results of an existing morpho-logical analyzer, or use the native description lan-guage, based on a �nite-state transducer using char-acters as left projections and typed feature struc-tures as right projections (Zajac, 1998). After com-pleting the morphological analysis of the source lan-guage and specifying the mapping of lexical featuresto English, glossing is available. The Glosser is anMEAT application consisting of morphological anal-ysis of source language words, followed by dictionarylookup for single words and compounds, and thetranslation into English in ected word forms. Anexample of the interface for the glosser is shown inFigure 2.

Figure 2: The Glosser interface

The next step in developing a medium-quality,broad coverage translation system is to develop

Figure 3: Viewing a complex analysis

knowledge sources for the structural analysis of in-put sentences. MEAT supports the use of modu-lar uni�cation grammars, which facilitates develop-ment and debugging. Each grammar module can bedeveloped and tested in isolation, the �nal systemapplying each grammar in a linear fashion (Zajacand Amtrup, 2000). The main component used is abidirectional island-parser (cf. Stock et al. (1988))for uni�cation-based grammars. The grammar rulesare usually written in the style of context-free ruleswith associated uni�cation constraints as shown inFigure 4. The rules allow for the speci�cation ofthe right-hand side as a regular-expression of featurestructures. We plan to add more restricted types ofgrammars (e.g. based on �nite-state transducers) togive the linguist a richer choice of syntactic processesto choose from.

For the time being, the transfer capabilities of thesystem are restricted to lexical transfer, as we havenot �nished the implementation of a complex trans-fer module. Thus, the grammar developer eitherneeds to create structural descriptions that matchthe English generation, or the English generationgrammar has to be modi�ed for each language.

At each point during the development of a trans-

Adj' = tur.Rule[lhs: tur.AdjBar[

lexHead: #adj,spec: #adv],

rhs: <: #adv= tur.AdvEntry "?"#adj= tur.AdjEntry

:>];

Figure 4: A Turkish syntax rule

lation system, we consider it essential to be ablenot only to see the results, but also to monitorthe processing history of a result. Thus, the chartthat leads to the construction of an English outputcan be viewed in order to examine all intermediateconstructions. In the MEAT system, each modulerecords various steps of computations in the chartwhich can be inspected statically after processing. Auni�ed data interface for all modules in the systemallows both the inspection of recorded internal datastructures for each module (when it makes sense,

such as in a chart parser), and the inspection of theinput/output of all modules. The graphical interfaceused to view complex analyses is shown in Figure 3.

4 Applications

In this section, we give an overview of the capabili-ties of MEAT using two current examples from workat our laboratory. In the Shiraz project2 (Amtrupet al., 2000), we developed a machine translationsystem from Farsi to English, for which no previousknowledge sources were available. We mainly targetnews material and the translation of web pages. TheTurkish-English system has been developed with theExpedition project3, an enterprise for the rapid de-velopment of MT systems for low-density languages.Both systems use a common user interface for ac-

cess to MEAT, which is shown in Figure 5. TheMT systems are targeted to the translation of news-paper text and other sources available online (e.g.web pages). The emphasis is therefore put on exten-sive coverage rather than very-high quality transla-tion. Currently, we reach for both systems a levelof quality that allows to assess in detail the contentof source texts, at the expense of some unfelicitousEnglish.

Figure 5: The MEAT user interface

4.1 Persian-English MT

The input for the Persian-English system is usu-ally taken from web pages (on-line news articles),although plain text can be handled as well. The in-ternal encoding is Unicode, and various codeset con-verters are available; we also developed an ASCII-based transliteration to facilitate the easy acquisi-tion of dictionaries and grammars (see Figure 1).The dictionary consists of approximately 50,000 en-tries, single words as well as multi-word compounds.Additionally, we utilize a multi-lingual onomasticonmaintained locally to identify proper names.

2http://crl.nmsu.edu/shiraz3http://crl.nmsu.edu/expedition

We developed a morphological grammar for Per-sian using a uni�cation-based formalism (Zajac,1998). A sample rule is shown in Figure 6 (cf.Megerdoomian (2000) for a more thorough descrip-tion of the Persian morphological analyzer).

PresentStem = <RegularPresentStem< < <"|n" "y"?>

per.Verbal[infl.causative: True]> |< per.Verbal[infl.causative: False]>

>>;

Figure 6: A Persian morphological rule

The knowledge sources for syntax were (manually)developed using a corpus of 3,000 tagged and brack-eted sentences extracted from a 10MB corpus of Per-sian news articles. We use three grammars, respon-sible for the attachment of auxiliaries to main verbs,the recognition and processing of light verb phenom-ena, and phrasal and sentential syntax, respectively.The combined size is about 110 rules. The develop-ment of the Persian resources took several months,primarily due to the fact that the translation systemwas developed in parallel to the linguistic knowl-edge. The Persian resources were developed by ateam of one computational linguist (morphologicaland syntactic grammars, overall supervision for lan-guage resources), and 3 lexicographers (dictionaryand corpus annotation).

4.2 Turkish-English MT

Within yet another project (Expedition), we devel-oped a machine translation system for Turkish. Thisapplication functioned as a benchmark on how muche�ort the building of a medium-quality system re-quires, given that an appropriate framework is al-ready available.For Turkish, we use a pre-existing morphologi-

cal analyzer (O azer, 1994). Turkish shows a richderivational and in ectional morphology, which ac-counts for most of the system development workthat was necessary to build a wrapper for inte-grating the Turkish morphological analyzer in thesystem (approximatly 60 person-hours)4. The de-velopment of the Turkish syntactic grammars tookaround 100 person-hours, resulting in 85 uni�cation-based phrase structure rules describing the basics ofTurkish syntax. The development of the bilingualTurkish-English dictionary had been going on for

4The main problem during the adaptation was the treat-

ment of derivational information, where the morphologicallyanalyzed input does not conform exactly to the contents ofthe dictionary

some time prior to the application of MEAT, andcurrently contains approximately 43,000 headwords.

5 Conclusion

The development of automatic machine translationsystems for several languages is a complicated task,even more so if it has to be done in a short amountof time. We have shown how the availability of amachine translation environment based on contem-porary computational linguistic theories and usinga sound system design aids a linguist in buildinga complete system, starting out with relatively sim-ple tasks as word-for-word translation and incremen-tally increasing the complexity and capabilities ofthe system. Using the current library of modules, itis possible to achieve a level of quality on par withthe best transfer-based MT systems. Some of thestrong points of the system are: (1) The system canbe used by a linguist with reasonable knowledge ofcomputational linguistics and does not require spe-ci�c programming skills. (2) It can be easily con-�gured for building a variety of applications, includ-ing a complete MT system, by reusing a library ofgeneric modular components. (3) It supports an in-cremental development methodology which allows todevelop a system in a step-wise fashion and enablesto deliver running systems early in the developmentcycle. (4) Based on the experiences in building theMT systems mentioned in the paper, we estimatethat a team of one linguist and three lexicographerscan build a basic transfer-based MT system withmedium-size coverage (dictionary of 50,000 head-words, all most frequent syntactic constructions inthe language) in less than a year.

One major improvement of the system would bea more integrated test and debug cycle linked to acorpus and to a database of test items. Although ex-isting testing methodologies and tools could be used(Dauphin E., 1996; Judith Klein and Wegst, 1998),building test sets is a rather time-consuming taskand some new approach to testing supporting rapiddevelopment of MT systems with an emphasis onwide coverage would be needed.

References

Jan W. Amtrup, Hamid Mansouri Rad, KarineMegerdoomian, and R�emi Zajac. 2000. Persian-English Machine Translation: An Overview ofthe Shiraz Project. Memoranda in Computer andCognitive Science MCCS-00-319, Computing Re-search Laboratory, New Mexico State University,Las Cruces, NM, April.

Jan W. Amtrup. 1999. Incremental Speech Transla-tion. Number 1735 in Lecture Notes in Arti�cialIntelligence. Springer Verlag, Berlin, Heidelberg,New York.

Bob Carpenter and Yan Qu. 1995. An AbstractMachine for Attribute-Value Logics. In Proceed-ings of the 4th International Workshop on Pars-ing Technologies (IWPT95), pages 59{70, Prague.Charles University.

Bob Carpenter. 1992. The Logic of Typed FeatureStructures. Tracts in Theoretical Computer Sci-ence. Cambridge University Press, Cambridge.

Lux V. Dauphin E. 1996. Corpus-Based annotatedTest Set for Machine Translation Evaluation byan Industrial User. In Proceedings of the 16thInternational Conference on Computational Lin-guistics, COLING-96, pages 1061{1065, Centerfor Sprogteknologi, Copenhagen, Denmark, Au-gust 5-9.

Klaus Netter Judith Klein, Sabine Lehmann andTillmanWegst. 1998. DIET in the context of MTevaluation. In Proceedings of KONVENS-98.

Martin Kay. 1973. The MIND System. InR. Rustin, editor, Natural Language Processing,pages 155{188. Algorithmic Press, New York.

Karine Megerdoomian. 2000. Uni�cation-BasedPersian Morphology. In Proceedings of the CI-Cling 2000, Mexico City, Mexico, February.

Kemal O azer. 1994. Two-level Description ofTurkish Morphology. Literary and LinguisticComputing, 9(2).

Oliviero Stock, Rino Falcone, and Patrizia Insin-namo. 1988. Island Parsing and BidirectionalCharts. In Proc. of the 12 th COLING, pages 636{641, Budapest, Hungary, August.

Shuly Wintner and Nissim Francez. 1995. Pars-ing with Typed Feature Structures. In Proceed-ings of the 4th International Workshop on ParsingTechnologies (IWPT95), pages 273{287, Prague,September. Charles University.

R�emi Zajac and Jan W. Amtrup. 2000. ModularUni�cation-Based Parsers. In Proc. Sixth Interna-tional Workshop on Parsing Technologies, trento,Italy, February.

R�emi Zajac. 1992. Inheritance and Constraint-Based Grammar Formalisms. Computational Lin-guistics, 18(2):159{182.

R�emi Zajac. 1998. Feature Structures, Uni�cationand Finite-State Transducers. In FSMNLP'98,International Workshop on Finite State Methodsin Natural Language Processing, Ankara, Turkey,June.

Remi Zajac. 1999. A Multilevel Framework for In-cremental Development of MT Systems. In Ma-chine Translation Summit VII, pages 131{157,University of Singapore, September 13-17.