machine translation, interlingual methods...articlenumber:lali:00939 2 machine translation,...

FIRSTPROOF

a0005 Machine Translation,Interlingual MethodsB Dorr, UMIACS, College Park, MD, USA

E Hovy, University of Southern California,

Los Angelos, CA, USA

L Levin, Carnegie Mellon University,

Pittsburgh, PA, USA

� 2006 Elsevier Ltd. All rights reserved.

s0005 Introduction

p0005 As described in the article on Machine Translation,Overview (00936), machine translation (MT) meth-odologies are commonly categorized as direct, trans-fer, and interlingual. The methodologies differ inthe depth of analysis of the source language and theextent to which they attempt to reach a language-independent representation of meaning or intent be-tween the source and target languages. InterlingualMT typically involves the deepest analysis of thesource language.

p0010 Figure 1—the Vauquois triangle (Vauquois,1968)—illustrates these levels of analysis. Startingwith the shallowest level at the bottom, direct transferis made at the word level. Moving upward throughsyntactic and semantic transfer approaches, the trans-lation occurs on representations of the source sen-tence structure and meaning, respectively. Finally,at the interlingual level, the notion of transfer isreplaced with a single underlying representation—the interlingua—that represents both the source andtarget texts simultaneously. Moving up the trianglereduces the amount of work required to traverse thegap between languages, at the cost of increasing therequired amount of analysis (to convert the sourceinput into a suitable pretransfer representation) and

synthesis (to convert the post-transfer representa-tion into the final target surface form). For example,at the base of the triangle, languages can differ signif-icantly in word order, requiring many permutationsto achieve a good translation. However, a syntacticdependency structure expressing the source text maybe converted more easily into a dependency structurefor the target equivalent because the grammaticalrelations (subject, object, modifier) may be shareddespite word order differences. Going further, a se-mantic representation (interlingua) for the source lan-guage may totally abstract away from the syntax ofthe language, so that it can be used as the basis for thetarget language sentence without change.

p0015Comparing the effort required to move up anddown the sides of the triangle to the effort to performtransfer, interlingual MT may be more desirable insome situations than in others. Because in principle aninterlingual representation of a sentence contains suf-ficient information to allow generation in any lan-guage, the more (and the more different) targetlanguages there are, the more valuable an interlinguabecomes. To translate from one source into N targetlanguages, one needs (1þN) steps using an interlin-gua compared to N steps of transfer (one to eachtarget). But to translate pairwise among all the lan-guages, one needs only 2N steps using an interlinguacompared to about N2 with transfer—a significantreduction for the former. In addition, since in theoryit is not necessary to consider the properties of anyother language during the analysis of the source lan-guage or generation of the target language, eachanalyzer and generator can be built independentlyby a monolingual development team. Each system

f0005 Figure 1 The Vauquois Triangle for MT.

Article Number: LALI: 00939

Machine Translation, Interlingual Methods 1

FIRSTPROOF

developer only needs to be familiar with his/herlanguage and the interlingua.

p0020 Another advantage of the interlingua approach isthat interlingual representations can be used by NLPsystems for other multilingual applications, such ascross-lingual information retrieval, summarization,and question answering (see Figure 2). For example,it is a basic assumption of the semantic web thatwebpages will contain not only source text but alsosome interlingual representations thereof, againstwhich queries issued in other languages and translat-ed into the interlingua can be matched, and fromwhich various target-language versions of the web-pages can be generated. In all of these applications,there is a reduction in computation over approachesthat tailor the underlying representation to the idio-syncrasies of each of the input/output languages.Without an interlingual representation, all these mul-tilingual applications require the insertion of a trans-lation step at least once and often in two differentplaces.

p0025 Although interlinguas are a topic of recurring in-terest, only one interlingual MT system has ever beenmade operational in a commercial setting—KANT(Nyberg and Mitamura, 2000)—and only a handfulhave actually been taken beyond the stage of a re-search prototype. Interesting research prototypesare Pangloss (Frederking et al., 1994), CICC,NESPOLE!, and ChinMT (Habash et al., 2003).

s0010 Interlingua Definition and Components

p0030 An interlingua is a system for representing the mean-ings and communicative intentions of language. It canbe defined as a triple (S,N,L) where:

. S is a collection of representation symbols, oftendefined in an ontology, where each symbol denotes aparticular aspect of meaning or intention (sometimes

individually, and sometimes in concert with othersaccording to specific rules of combination).

. N is a notation, within which symbols canbe composed into meanings. The rules governing no-tational well-formedness reflect the compositionalderivation of complex meaning out of ‘‘atomic’’ sym-bols, an operation that is basic to the theory of theinterlingua.

. L is a lexicon, namely a collection of words of ahuman language such as English, in which each lexi-cal element is associated directly or indirectly withone or more symbols from S. Interlingual MT systemstypically include one lexicon for each language.

p0035An interlingua instance is the representation of themeaning of a given fragment of text, such as a clause,sentence, or document. Such an instance is often writ-ten as a list of interconnected nested frame structures,where each proposition in the frame represents someatomic component of the total meaning.

p0040Details and examples of each of these componentsfollow.

s0015Representation Symbols

p0045Typically, an interlingua comprises several kinds ofsymbols to represent meaning. The largest set can bethought of as the conceptual primitives; rather like theopen-class words in a human language, these symbolsstand for specific types of objects, events, relations,qualities, etc. Other, smaller, sets of symbols are de-fined to represent specific fields of meaning, and usu-ally derive from a logical theory about the nature ofsome phenomenon. For example, the linguistic systemof tense can be related to a theory of time, and timecan be represented in an interlingua according to ahighly formalized subsystem; see (Reichenbach,1947; Allen, 1984). Other typical subfields of mean-ing represent space, causality, the epistemic status ofevents (actual, hypothetical, desired, etc.), etc.

p0050These symbols are often arranged as taxonomiesin which each node stands for a symbol, and informa-tion stored at higher-level symbols is inherited down-wards and shared by lower ones. The contentsand structure of the taxonomy thereby embody, tosome degree, the interlingua designer’s conceptualiza-tion of the world, making the taxonomy an ontologyin the classical sense. Although ontologies are as oldas Aristotle and are most commonly used in artificialintelligence systems to support complex reasoning,interlingua ontologies form a distinct type: they aregenerally large (comprising several thousands of sym-bols), contain relatively little information per symbol,and what information is contained is primarilydevoted to interlingua instance composition orlinguistic behavior instead of to inference.

f0010 Figure 2 Use of Interlingua in multiple applications.


2 Machine Translation, Interlingual Methods

FIRSTPROOF

p0055 It is not uncommon for an interlingual MT systemto contain both an upper-level, very general, ontologyand then one or more specific domain-oriented ones.The upper ontology contains notions that are sharedover all domains in common language; the lower onesencode distinct subworlds, such as finances, sports,chemistry, etc. Usually, the higher-level symbols rep-resent conceptual and linguistic abstractions forwhich there are no words, and the lower-level onesmore concrete meanings for which words exist inthe various languages’ lexicons. (For example, thePenman Upper Model contains the nodes Non-DecomposableObject and DecomposableObject toseparate mass and count nouns.) One advantage ofdomain partitioning is ambiguity avoidance: the term‘‘bond’’ in the financial domain has only one mean-ing, and in the chemistry domain another, enablingthe MT system to proceed more expeditiously in eachdomain.

p0060 Ontologies developed for MT include ONTOS(ONTOS, 1989), SENSUS (Knight and Luk,1994), and Mikrokosmos/OntoSem (Nirenburg andRaskin, 2004). Ontologies developed and used forlanguage technology applications in general includeWordNet (Fellbaum, 1998), the Penman UpperModel (Bateman et al., 1989), and Omega (Philpotet al., 2003). Omega can be browsed using the DINObrowser.

s0020 Notation

p0065 The notation is the vehicle by which the symbols’individual shades of meaning are assembled into acomplex meaning. The notation is usually instan-tiated as a network of propositions represented as aset of nested frames, where each proposition employsthe symbols of the interlingua, composed accordingto the specifications of the interlingua in general andof the symbols in particular.

p0070 Typically, a frame has a frame header, which mayinclude a frame identifier, and one or more proposi-tions, each being a relation–value pair that links theframe header to the value via the relation. Figure 3provides an example from the KANT system, repre-senting the meaning of If the error persists, service isrequired. The frame headers—each marked with anasterisk (*)—of the two clauses are BE-PREDICATEand QUALIFYING-EVENT. BE-PREDICATE hastwo arguments, an attribute and a theme. Each ofthese is headed by another frame, REQUIRED andSERVICE, respectively. The QUALIFYING-EVENThas a PERSIST event whose theme is ERROR.

p0075 In some sophisticated interlinguas, the notationcontains separate zones for different kinds of mean-ing (Nirenburg and Raskin, 2004); typically a zonefor world semantics (the conceptual content of the

text), a zone for interpersonal semantics (informationin the text reflecting the writer, reader, their relation-ship, etc., which often affects the style of the textrather than the content), and a zone for metatextualinformation (medium, such as spoken or written;genre, such as telegram, letter, report, article; situa-tion, such as anonymous posting, personal delivery,etc.).

s0025Lexicon

p0080An interlingua lexicon includes information about thenature and behavior of each word in the language.For example, events and actions (usually expressed asverbs) include information about their preferredarguments (agents, patients, instruments, etc.). Insome interlinguas, this information may reflect theverbal predilections of one language more than an-other; for example, I swim across the river isexpressed in Spanish as I cross the river swimmingly.Should the interlingual representation be anchored on‘‘swim’’ or ‘‘cross’’? The choice rests with the interlin-gua symbol set designer. To the degree such asymme-tries in the interlingua prefer one language overanother, it is said to deviate from true language neutral-ity. A representation system reflecting one languageclosely is often called shallow semantics.

p0085Within a chosen representation system, the con-cepts on which events are anchored are called predi-cates and the participants in the event are calledarguments following the formalism used in logicalrepresentations used in artificial intelligence systems.

f0015Figure 3 KANT representation of If the error persists, service is

required.



FIRSTPROOF

Predicate-argument structure refers to the combina-tion of an event concept and its participants—and agiven predicate is said to have a certain number ofpotential participants—or valency. For example, theverb load has a valency of 3: the person doingthe loading, the item that is loaded, and the placethat the item is loaded.

p0090 Semantic roles—often called thematic roles—areby far the most common approach to represent thearguments of a predicate semantically. However, thenumerous variant theories display little agreementeven on terminology. A small set of examples isshown in Table 1. The reader is referred to the sectionon Logical and Lexical Semantics for a more compre-hensive set of examples.

p0095 A number of interlingua researchers have used se-mantic roles for interlingual MT (Dorr, 2001; Nybergand Mitamura, 2000). More details are given in‘‘Interlinguas in Machine Translation’’ below.

s0030 Issues in Interlingua

p0100 The notion of Interlingua appeals to many, but isa complex undertaking. In this section we exam-ine the issues faced by designers of interlinguas andinterlingual MT systems.

s0035 Problems with Representing Meaning

p0105 Probably the central problem of interlingua design isthe complexity of meaning. A great deal has beenwritten about interlinguas, but no clear methodologyexists for determining exactly how one should builda true language-neutral meaning representation, ifsuch a thing is possible at all (Hovy and Nirenburg,

1992). It is always possible to add more detail to ameaning representation, but in order to implement anMT system, the details must end at some point. Todate no adequate criteria have been found for decid-ing when to stop refining the meaning representation,although some preliminary attempts have been madein the NESPOLE! project (Levin et al., 2003) and inthe IAMTC project (see Interlingual Annotation ofMultilingual Text Corpora (00000) below).

p0110A basic design choice is granularity: the number ofinterlingual representation primitives. The parsimo-nious approach, exemplified by Conceptual Depen-dency (Schank and Abelson, 1977), declares that asmall number of primitives are enough to composi-tionally represent all actions. This poses a dauntingproblem of meaning assembly that has never beenseriously attempted. In contrast, the profligate ap-proach, called ontological promiscuity (Hobbs,1985), essentially allows a representation symbol forevery shade of meaning (and certainly one for eachlexical item). This poses a problem of representingthe essential relatedness of notions such as buy andsell, come and go, etc. The ideal seems to have been toaim somewhere in between, seeking conceptual depthand coverage simultaneously. Many researchers(Nirenburg and Raskin, 2004) develop a deep seman-tic analysis that requires extensive world knowledge;the performance of deep semantic analysis (if re-quired) depends on the (so far unproven) feasibilityof representing, collecting, and efficiently storinglarge amounts of world and domain knowledge.This problem consumes extensive efforts in thebroader field of artificial intelligence (Lenat, 1995).

p0115We present an example. What, principally, are theprimitive concepts of the meaning representation foreat? Do we also need more specific primitives like‘eat-politely’ and ‘eat-like-a-pig’? This distinction isrequired to distinguish between the verbs essen andfressen in German. In general, two strategies are pos-sible. One is to adopt arbitrarily the conceptualiza-tions of one language, and specify the variations of allothers in terms thereof; the other is to multiply out allthe distinctions found in any language. In the lattercase one will obtain two interlingual items represent-ing eat (because of German) and two for the objectfish (because of the distinction between pez and pes-cado in Spanish). The situation worsens; in Japanesetranslation of the verb wear depends on where theobject is worn, e.g., head or hands.

p0120Ontologies greatly support the profligate ap-proach, because they allow one to concisely representsystematic relationships between groups of concepts.However, building an ontology remains a problem.For example, the WordNet-based component of theOmega ontology (Philpot et al., 2003) mentioned

t0005 Table 1 Examples of semantic roles

Role Definition Example

AGENT An Agent should have the features

of volition (able to make a

conscious choice), sentience

(having perception), causation

(able to bring about an effect) and

independent existence (existence

not resulting from the action)

John broke the

vase

THEME The Theme is causally affected, or

is in a state or changes state, or is

in a location or changes location,

or comes into or out of existence

John broke thevase

INSTR The Instrument has causation but

no volition. Typically, an

instrument appears with an agent

and can be paraphrased with

‘‘using’’

John broke the

vase with ahammer

AU:1



FIRSTPROOF

above contains 110,000 nodes and often provides toomany indistinguishable alternatives, whereas theMikrokosmos-based component of Omega containsonly 6,000 concepts and does not offer all the con-cepts needed to represent the full meaning of a word.Thus the word extremely contains four concepts inWordNet-based Omega, and sense is hard to distin-guish from the others: (1) to a high degree or extent,favorably or with much respect; (2) to an extremedegree; (3) to an extreme degree, super; (4) to anextreme degree or extent, exceedingly. On the otherhand, the Mikrokosmos-based part of Omega doesnot contain even one concept for the word extremely.

p0125 Another issue raised with respect to interlinguas isthat, because this representation is purportedly inde-pendent of the syntax of the source text, the targettext generated reads more like a paraphrase than astrict translation. That is, the style and emphasis ofthe original text are lost. However, this is not so mucha failure of the interlingua as its incompleteness,caused by a lack of understanding of the discourseand pragmatics required to recognize and appropri-ately reproduce style and emphasis. In fact, in somecases it may be an advantage to ignore the author’sstyle. Moreover, many have argued that, outside thefield of artistic texts (poetry and fiction), preservationof the syntactic form of the source text in translationis completely superfluous (Goodman and Nirenburg,1991; Whitelock, 1989). For example, the passivevoice constructions in the two languages may notconvey identical meanings. Taken overall, the currentstate of the art seems to confirm that it is possible toproduce interlinguas that are reliably adequate be-tween language groups (e.g., Japanese and WesternEuropean) for specialized domains only.

s0040 Divergences

p0130 An important problem addressed by interlinguaapproaches is that of structural differences betweenlanguages—language divergences—e.g., English ‘fear’vs. Spanish tener miedo de. Some examples fromDorr (1993) are:

. Categorial divergence: the translation of wordsin one language into words that have different partsof speech in another language. For example, ‘to bejealous’—tener celos (‘to have jealousy’).

. Conflational divergence: the translation of twoor more words in one language into one word inanother language. Examples include ‘to kick’—daruna patada (‘give a kick’).

. Structural divergence: the realization of verbarguments in different syntactic configurations indifferent languages. For example, ‘to enter thehouse’—entrar en la casa (‘enter in the house’).

. Head swapping divergence: the inversion of astructural dominance relation between two semanti-cally equivalent words when translating from onelanguage to another. For example, ‘to run in’—entrarcorriendo (‘enter running’).

. Thematic divergence: the realization of verbarguments in syntactic configurations that reflect dif-ferent thematic to syntactic mapping orders. For ex-ample, ‘I like grapes’—me gustan uvas (‘to-me pleasegrapes’).

p0135Many divergences are caused by differences in lan-guage typology. For example, many verb serializinglanguages express the benefactive (e.g., write a letterfor me) in a serial verb constructions (e.g., write lettergive me). Some types of meaning are particularlysusceptible to divergences. In English, sentencesexpressing the speech act of suggesting (How aboutgoing to the conference?, Why not go to the confer-ence?) cannot be translated literally into most otherlanguages. Divergences are also common in expres-sions of modality. For example, the expression ofdeontic modality in ‘you had better go’ in Englishcan be expressed in Japanese roughly as Itta hoo gaii, literally ‘go(past form) way/option/alternativesubj-marker good’ or ‘(the) option (of) going (is)good.’ Some authors have argued that divergencesmay be the norm rather than the exception (Levinand Nirenburg, 1994).

p0140Resolution of cross-language divergences is an areawhere the differences in MT architecture are mostcrucial. Many MT approaches resolve such diver-gences by means of construction-specific rules thatmap from the predicate-argument structure of onelanguage into that of another. The interlingua ap-proach to MT takes advantage of the compositional-ity of basic units of meaning to resolve divergences.For example, the conflational divergence above isresolved by mapping English ‘‘kick’’ into two compo-nents, the motional component (movement of the leg)and the manner (a kicking motion) before translatinginto a language like Spanish.

s0045Interlinguas in Machine Translation

p0145A typical interlingual system is illustrated schemati-cally in Figure 4. Each language requires an analyzerand a synthesizer. The analyzer takes as input a sourcelanguage sentence and produces as output an interlin-gual representation of the meaning. The synthesizer

f0020Figure 4 Interlingual MT system architecture.



FIRSTPROOF

takes an interlingual representation of meaning asinput and produces one or more sentences with thatmeaning. In theory, it is not necessary to consider theproperties of another language during the analysis ofthe source language or generation of the target lan-guage. To translate from language L1 to L2, L1’s ana-lyzer produces an interlingual representation and L2’ssynthesizer generates an L2 sentence with the samemeaning.

p0150 Below we illustrate several representative examplesof interlingual representations used by developers ofinterlingual MT systems.

s0050 Pangloss

p0155 The Pangloss project (Frederking et al., 1994) startedas an ambitious attempt to build rich interlingualexpressions using humans to augment system analysis.As shown in Figure 5, the representation includes a setof frames representing semantic components (eachheaded by a unique identifier such as %proposition_5)and a separate frame with aspectual information(see %aspect_5 at bottom) representing duration,telicity, etc. Some modifiers are treated as scalarsand represented by numerical values; the phrase‘‘active expansion’’ is represented in %expand_1 withan intensity of 0.75 (out of 1.0). Note also that allimplicit arguments (for instance, the agent of%expand_1) are explicitly included.

p0160The focus of the Mikrokosmos project—more re-cently dubbed OntoSem (Nirenburg and Raskin,2004)—is to produce semantically rich text-meaningrepresentations (TMRs) of unrestricted text that canbe used in a wide variety of applications, including asan interlingua for MT. These representations providethe basis for addressing some of the most difficultproblems of NLP, such as disambiguation and allaspects of reference resolution, from reconstructingelliptical utterances to linking textual referents totheir real-world ‘‘anchors’’ in a fact repository.

p0165TMRs (Ontosem’s interlingua expressions) use alanguage-independent metalanguage compatible withthat used to represent the underlying static knowl-edge resources—the ontology and ontologically-linked lexicons. A sample TMR for the input Heasked the UN to authorize the war, is as shown inFigure 6. (Small caps indicate ontological concepts;the indices represent numbered instances of ontologi-cal concepts in the world model built up during thisrun of the system.)

p0170This says that the word ask instantiates the 69thinstance of the concept request-action, whose agentis human-72 (the instantiation of he, which was re-solved as Colin Powell using reference resolutionprocedures), whose beneficiary is organization-71(the instantiation of UN, which was resolved toUnited-Nations using reference resolution proce-dures), and whose theme is accept-70 (the instantia-tion of ‘authorize’, whose theme is war-73—thesemantic representation of the meaning of the wordwar). One goal of recent work in the OntoSem envi-ronment has been to create TMRs for large amountsof text, populate a fact repository using a subset ofinformation from the TMRs, and then use the factrepository as a language-independent search space

f0025 Figure 5 Pangloss interlingual representation of The Sezon

Group will pursue an active overseas expansion policy by means of the

tie-up with SAS. Mikrokosmos/OntoSem.

f0030Figure 6 OntoSem interlingual representation of He asked the

UN to authorize the war.



FIRSTPROOF

for applications such as question answering andknowledge extraction.

s0055 JapanGloss

p0175 The Interlingua notation developed for the Japan-gloss MT system and the Nitrogen generator (Knightand Langkilde, 2000) used symbols from the SENSUSontology (Knight and Luk, 1994), one of the precur-sors of Omega. In this notation, frame identifiers aresymbols like h1 and SENSUS symbols are delimitedby bars; and in contrast to many other Interlinguas,modality predicates (e.g., likelihood and necessity)are represented as frame predicates, the same wayother, normal, actions and events are. Thus in theexample given in Figure 7, which represents ‘‘It ispossible that you must eat chicken’’ (equivalently,‘‘You might have to eat chicken’’), e4 is the eating byyou of the chicken, which by h2 is obligatory, whichin turn by h1 is possible.

s0060 KANT

p0180 KANT is the only interlingual MT system that hasever been made operational in a commercial setting.The KANT system (Nyberg and Mitamura, 2000) is aknowledge-based, interlingual machine translationsystem. KANT is designed for translation of technicaldocuments written in Controlled English to multipletarget languages. The KANT Analyzer produces aninterlingua expression for each sentence in the inputdocument; an example appeared earlier in Figure 3.This interlingua is mapped into an appropriate targetsentence by the KANT generator. For each targetlanguage there is a separate lexicon and grammar.

p0185 The KANT system was integrated with theClearCheck document checking interface (built byCarnegie Group) and deployed in the Caterpillar doc-ument workflow during the middle 1990s. The workfor Caterpillar involved development of a CaterpillarTechnical English (CTE), a corresponding KANT An-alyzer, and KANT Generators for French, Spanish,and German. The system delivered to Caterpillarrepresents the first large-scale deployment of con-trolled language checking integrated with machinetranslation. The interlingua used in the KANT systemis based on research on the generation of additional

target languages, such as Portuguese, Italian, Russian,Chinese, and Turkish.

s0065Interlingual Systems for MT of Spoken Language

p0190The interlingua approach to machine translation hasbeen implemented in several demos and prototypesfor translation of spoken language. MT for spokenlanguage begins with speech recognition. The outputof the speech recognizer is then passed to the sourcelanguage analysis module of the MT system. In addi-tion to the problems faced by MT for text, MT forspoken language must deal with disfluencies in speechand imperfect output from a speech recognizer. Forthis reason, most spoken language MT systems arerestricted to task-oriented domains such as travelplanning or doctor–patient interviews.

p0195Interlinguas for spoken, task-oriented dialogue typ-ically focus on the dialogue act that the speakerintends to accomplish with his/her utterance. Exam-ples of dialogue acts are suggesting, accepting, andrejecting a time or price. In interlinguas for spokenlanguage, less emphasis is placed on predicate argu-ment structure. The emphasis on speaker intentmeans that the same interlingual representation willbe used for sentences that have very different syntac-tic structures. For example, the following sentencesall carry out the dialogue act of giving informationabout the price of a room. The concept of costing isexpressed by the verb (cost) in the first sentence, andthe subject (price) in the second sentence. In the thirdsentence, the concept of costing is implicit in thepredicate nominal (one hundred dollars).

The room costs one hundred dollars per night.The price of the room is one hundred dollars per night.The room is one hundred dollars per night.

p0200The JANUS system was the earliest spoken lan-guage MT system using an interlingua in the early1990s (Levin et al., 2000). JANUS was part of theC-STAR consortium (Consortium for Speech Trans-lation Advanced Research), many of whose membersadopted the interlingua approach for an internationaldemo in 1999. Other interlingual speech translationsystems include Enthusiast, CCLINC, NESPOLE!,Speechalator, Carnegie Mellon University’s Thaispeech translation system (Schultz et al., 2004), andFAME.

p0205Figure 8 provides an example from the NESPOLE!project, in which both sentences are represented bythe given interlingua instance. The NESPOLE! inter-lingua is based on an annotated corpus of transcribeddialogues in English, German, Italian, and Japanese.It has also been applied to Chinese, Spanish, andFrench. Its precursor, the C-STAR interlingua, hasalso been applied to Korean.

f0035 Figure 7 Japangloss interlingual representation of It is possible

that you must eat chicken or You might have to eat chicken.



FIRSTPROOFs0070 Universal Networking Language

p0210 The Universal Networking Language (UNL) is a for-mal language designed for rendering automatic mul-tilingual information exchange (Martins et al., 2000).It is intended to be a cross-linguistic semantic repre-sentation of sentence meaning consisting of concepts(e.g., ‘‘cat,’’ ‘‘sit,’’ ‘‘on,’’ or ‘‘mat’’), concept relations(e.g., ‘‘agent,’’ ‘‘place,’’ or ‘‘object’’), and concept pre-dicates (e.g., ‘‘past’’ or ‘‘definite’’). The UNL syntaxsupports the representation of a hypergraph whosenodes represent Universal Words and whose arcsrepresent Relation Labels. An example is shown inFigure 9 for the sentence The cat sat on the mat.

p0215 Several semantic relationships hold between uni-versal words (synonymy, antonymy, hyponymy,hypernymy, meronymy, etc.) which compose theUNL ontology.

s0075 Lexical Conceptual Structure

p0220 Lexical Conceptual Structure (LCS) is an interlingualrepresentation used as part of a Chinese-English Ma-chine Translation (MT) system, called ChinMT(Habash et al., 2003) that has also been used formany other MT language pairs (e.g., Spanish andArabic) and other natural language applications(e.g., cross-language information retrieval). TheLCS-based approach focuses on the types of diver-gences described in ‘‘Divergences.’’ Consider, for ex-ample, the case of a conflational divergence betweenArabic and English:

Arabic:Gloss: ‘The-reporter sent email to Al-Jazeera’.English: The reporter emailed Al-Jazeera.

p0225The LCS representation for this example is shownin Figure 10, glossed as ‘The reporter caused the emailto go to Al-Jazeera in a sending manner’. Here, theprimary components of meaning are the top-levelconceptual nodes cause and go. These are taken to-gether with their arguments, each identified by asemantic role (agent, theme, and goal), and a modifier(manner) sendþ ingly.

s0080Approximate Interlingua

p0230One response to the MT divergence problem (dis-cussed in ‘‘Divergences’’) is the use of an approximateinterlingua (Habash et al., 2003). In this approach,the depth of knowledge-based systems is approxi-mated by tapping into the richness of resources inone language (often English) and this information isused to map the source-language (SL) input to thetarget-language (TL) output.

p0235The focus of the approximate-interlingua approachis to address the types of divergences covered bythe LCS-based approach, but with fewer knowledge-intensive components. Thus, a key feature of anapproximate interlingua is the coupling of basic argu-ment-structure information with some, but not all,components the LCS representation. Only the top-level primitives and semantic roles are retained. Thisnew representation provides the basis for generatingmultiple sentences that are statistically pared down sothat the most likely sentence is generated according tothe constraints of the TL.

p0240Consider, for example, the conflational divergenceexample given above (‘‘Lexical Conceptual Struc-ture’’) between Arabic to English. Figure 11 illustratesthe approximate-interlingua approach to translationfor this example. The top-level conceptual nodes arefirst checked for possible matches. Following this,unmatched thematic roles are checked for conflat-ability, i.e., cases where semantic roles are absorbedinto other predicate positions. As long as there is arelation between the conflated argument (EMAILN)and the new predicate node (EMAILV), part-of-speech is disregarded.

f0040 Figure 8 Two sentences and corresponding interlingua

instance from the NESPOLE! Project.

f0045 Figure 9 UNL representation of The cat sat on the mat.

f0050Figure 10 LCS representation of The reporter emailed Al-Jazeera.



FIRSTPROOF

s0085 Annotating Text with InterlingualInformation

p0245 The success of corpus-based language technologyover the past decade has shown the value of systemsthat automatically learn their processing from largecollections of annotated examples. Although no onehas yet created an Interlingua-annotated corpus toparallel the 1 million sentences plus syntax trees ofthe Penn Treebank (Kingsbury et al., 2002), severalefforts to annotate important parts of an interlinguaare underway. Principally, these efforts focus on verbsand their arguments. We list these and then describeone initiative—IAMTC—in more detail to illustratethe issues involved in annotation.

s0090 Semantic Annotation Initiatives

p0250 WordNet (Fellbaum, 1998) provides a terminologytaxonomy for English containing over 100,000terms. Several ontology-building efforts have usedthis resource as a starting point. Focusing on thecreation of wordnets for other languages, the GlobalWordNet Association lists EuroWordNet, GermaNet,BalkaNet, and many others.

p0255 Term taxonomizing and ontologizing efforts includethe Chinese HowNet and the Mimida multilingualsemantic network.

p0260 Focusing on verbs alone, the FrameNet project(Baker et al., 1998) is classifying all verbs into groupsaccording to the case roles (thematic roles) they sup-port. The SALSA project parallels FrameNet, work-ing on German verbs. Other FrameNet-relatedprojects are available for Japanese and Spanish.

p0265 The PropBank project resembles FrameNet in thatit focuses on verbs, but it does not employ a fixed setof case roles, preferring instead a more neutral set oflabels with no overall semantics. VerbNet is an asso-ciated effort to assign FrameNet-like case roles toverbs. There is a list that combines VerbNet andFrameNet. The NomBank Project closely parallelsPropBank, but focuses on nouns (such as nominalizedverbs and relational nouns) with argument structure.

p0270 The Interlingual Annotation of Multilingual TextCorpora (IAMTC) project is an ambitious attempt toinvestigate interlingual semantics by annotating and

comparing semantic phenomena across six lan-guages. Having prepared bilingual corpora pairingEnglish texts with corresponding text in Japanese,Spanish, Arabic, Hindi, French, and Korean, annota-tors are assigned to each text impairs to select seman-tic representation symbols from the Omega ontology(Philpot et al., 2003) for all nouns, verbs, adjectives,and adverbs. We describe this project in more detailbelow.

s0095Interlingual Annotation of Multilingual Text Corpora

p0275The IAMTC project has the following goals:

. Development of an interlingual representationframework based on a careful study of text corporain six languages and their translations into English.

. Development of a methodology for accuratelyand consistently assigning such representations totexts across languages and across annotators.

. Annotation of a corpus of six multilingual paral-lel subcorpora, using the agreed-upon interlingualrepresentation.

. Development of semantic annotation tools thatserve to facilitate more rapid annotation of texts.

. Design of new metrics and evaluations for theinterlingual representations, in order to evaluate thedegree of annotator agreement and the granularity ofmeaning representation.

p0280The IAMTC project is radically different fromthose annotation projects that have focused on mor-phology, syntax, or even certain types of semanticcontent (e.g., for word sense disambiguation). It ismost similar to PropBank (Kingsbury et al., 2002)and FrameNet (Baker et al., 1998). However,IAMTC places an emphasis on:

1. a more abstract level of mark-up (interpretation);2. the assignment of a well-defined meaning repre-

sentation to concrete texts;3. issues of a community-wide consistent and

accurate annotation of meaning.

p0285The data set consists of six bilingual parallelcorpora. Each corpus is made up of 125 source lan-guage news articles along with three independentlyproduced translations into English. (The source news

f0055 Figure 11 Approximate interlingua for English-Arabic example.



FIRSTPROOF

articles for each individual language corpus are dif-ferent from the source articles in the other languagecorpora.) The source languages are Japanese, Korean,Hindi, Arabic, French, and Spanish. Typically, eacharticle contains between 300 and 400 words (or theequivalent) and thus each corpus has between150,000 and 200,000 words. The Spanish, French,and Japanese corpora are based on the DARPA’s 1994MT evaluation data. The Arabic corpus is based onLDC’s Multiple Translation Arabic, Part 1.

p0290 The interlingual representation comprises threelevels and incorporates knowledge sources such asthe Omega ontology (Philpot et al., 2003) and the-matic roles (Dorr, 2001). The three levels of represen-tation are referred to as IL0, IL1, and IL2. The aim isto perform the annotation process incrementally,with each level of representation incorporating addi-tional semantic features and removing existing syn-tactic ones. IL2 is intended as the interlingual levelthat abstracts away from (most) syntactic idiosyncra-sies of the source language. IL0 and IL1 are interme-diate representations that are useful stepping stonesfor annotating at the next level.

s0100 Issues in Interlingual Annotation

p0295 A preliminary investigation of intercoder agreementon multiple annotations shows that the more annota-tors learn the process, the better they become, result-ing in an improvement of intercoder agreement(Mitamura et al., 2004). Two assumptions may bemade regarding the training of novice annotators inorder to improve intercoder agreement. One is thatnovice annotators may make inconsistent annota-tions within the same text, but these may be recon-ciled through a process of intra-annotator consistencychecking, in which annotators go over their results tofind any inconsistencies within the text. Another as-sumption is that, if two annotators at the same sitediscuss their annotation results after their annotationtasks are completed, their judgments may be recon-ciled through a process of inter-annotator check-ing, in which each annotator votes, they discuss thedifferences, and then vote again.

p0300 From an MT perspective, issues include evaluatingconsistency in the use of the annotation language,given that any source text can result in multiple, differ-ent, legitimate translations (Farwell and Helmreich,2003). Along these lines, there is the problem ofannotating texts for translation without including inthe annotations inferences resulting from the sourcetext. The IAMTC effort described above is the onlyinitiative, to date, that addresses issues of this type inlarge-scale annotation of data for use in interlingualMT.

See also: Machine Translation, Overview (00936); Interlin-

gual Annotation of Multilingual Text Corpora (00000).

Bibliography

Allen J F (1984). ‘A general model of action and time.’Artificial Intelligence 23(2).

Baker C, Fillmore C J & Lowe J B (1998). ‘TheBerkeley FrameNet Project.’ In Proceedings of the ACLConference.

Bateman J A, Kasper R T, Moore J D & Whitney R A(1989). ‘A general organization of knowledge fornatural language processing: the Penman Upper Model.’Unpublished research report, USC/Information SciencesInstitute, Marina del Rey, CA. A version of this paperappears in 1990 as: Upper Modeling: A Level of Seman-tics for Natural Language Processing. In Proceedingsof the 5th International Workshop on LanguageGeneration. Pittsburgh, PA.

Dorr B J (1993). Machine translation: a view from thelexicon. Cambridge: MIT Press.

Dorr B J (2001). ‘LCS verb database, online software data-base of lexical conceptual structures and documentation.’University of Maryland. http://www.umiacs.umd.edu/�bonnie/LCS_Database_Documentation.html.

Farwell D & Helmreich S (2003). ‘Pragmatics-based trans-lation and MT evaluation.’ In Proceedings of towardssystematizing MT evaluation. Workshop at the Interna-tional Machine Translation Summit IX, New Orleans,LA.

Fellbaum C (ed.) (1998). WordNet: an on-line lexical data-base and some of its applications. Cambridge: MIT Press.

Frederking R, Nirenburg S, Farwell D, Helmreich S, HovyE H, Knight K, Beale S, Domashnev C, Attardo D,Grannes D & Brown R (1994). ‘The Pangloss Mark IIIMachine Translation System.’ In Proceedings of the 1stAMTA Conference. Columbia, MD.

Goodman K & Nirenburg S (eds.) (1991). The KBMTproject: a case study in knowledge-based machine trans-lation. San Mateo: Morgan Kaufmann.

Habash N, Dorr B J & Traum D (2003). ‘Hybrid naturallanguage generation from lexical conceptual structures.’Machine Translation 18(2), 81–128.

Hajic J, Vidova-Hladka B & Pajas P (2001). ‘The PragueDependency Treebank: annotation structure and sup-port.’ In Proceeding of the IRCS Workshop on LinguisticDatabases. University of Pennsylvania, Philadelphia, PA.105–114.

Hobbs J R (1985). ‘Ontological promiscuity.’ In Proceed-ings of the conference of the Association for Computa-tional Linguistics (ACL). 61–69.

Hovy E H & Nirenburg S (1992). ‘Approximating an inter-lingua in a principled way.’ In Proceedings of the DARPASpeech and Natural Language Workshop. Arden House,NY.

Kingsbury P, Palmer M & Marcus M (2002). ‘Adding se-mantic annotation to the Penn TreeBank.’ In Proceedingsof the Human Language Technology conference (HLT2002).

AU:2

AU:3

AU:4

AU:5

AU:6

AU:7

AU:8

AU:9

AU:10

AU:23



FIRSTPROOF

Kipper K & Palmer M (2000). ‘Representation of actionsas an interlingua.’ In Proceedings of the Third AMTASIG-IL Workshop on Interlinguas and InterlingualApproaches. Seattle, WA.

Knight K & Langkilde I (2000). ‘Preserving ambiguitiesin generation via automata intersection.’ In Proceedingsof the American Association for Artificial Intelligenceconference (AAAI).

Knight K & Luk S K (1994). ‘Building a large-scale knowl-edge base for machine translation.’ In Proceedings of theAAAI conference. Seattle, WA.

Lenat D B (1995). ‘CYC: a large-scale investment in knowl-edge infrastructure.’ Communications of the ACM38(11), 32–38.

Levin B (1993). English verb classes and alternations:a preliminary investigation. Chicago: University ofChicago Press.

Levin L & Nirenburg S (1994). ‘Construction-based MTlexicons.’ In Zampolli A, Calzolari N & Palmer M (eds.)Current issues in computational linguistics: in honour ofDon Walker. Pisa: Giardini editori e stambatori andKluwer publishers. 321–338.

Levin L, Lavie A, Woszczyna M, Gates D, Gavalda M, KollD & Waibel A (2000). ‘The Janus III translation system.’Machine Translation.

Levin L, Langley C, Lavie A, Gates D, Wallace D &Peterson K (2003). ‘Domain specific speech acts forspoken language translation.’ In Proceedings of the 4thSIGdial Workshop on Discourse and Dialogue. Sapporo,Japan.

Martins T, Machado Rino L H, Volpe Nunes M G,Montilha G & Osvaldo Novais O (2000). ‘An interlinguaaiming at communication on the web: how language-independent can it be?’ In Proceedings of workshop onapplied interlinguas: practical applications of interlin-gual approaches to NLP. Workshop at ANLP-NAACL.Seattle, WA.

Nirenburg S & Raskin V (2004). Ontological Semantics.Cambridge: MIT Press.

Nyberg E & Mitamuran T (2000). ‘The KANTOO machinetranslation environment.’ In White J S (ed.) Envisioningmachine translation in the information future. 4th Con-ference of the Association for Machine Translation inthe Americas (AMTA 2000). Lecture Notes in ArtificialIntelligence, Vol. 1934. Berlin: Springer Verlag.

Philpot A, Fleischman M & Hovy E H (2003). ‘Semi-automatic construction of a general purpose ontology.’In Proceedings of the International Lisp Conference.New York.

Reichenbach H (1947). ‘The tenses of verbs.’ In Elements ofsymbolic logic. London: Collier Macmillan.

Schultz T, Alexander D, Black A W, Peterson K, Suebvisai S& Waibel A (2004). ‘A Thai speech translation system formedical dialogs.’ In Proceedings of the conference onHuman Language Technologies (HLT-NAACL). Boston,MA.

Vauquois B (1968). ‘A survey of formal grammars and algo-rithms for recognition and transformation in machinetranslation.’ In Proceedings of the IFIP Congress-6.254–260.

Whitelock P (1989). ‘Why transfer and interlinguaapproaches to MT: are both wrong: a position paper.’ InProceedings of the MT Workshop: Into The 90’s.Manchester, England.

Relevant Websites

http://www.cicc.or.jp—CICC website.http://nespole.itc.it—NESPOLE! website.http://www.umiacs.umd.edu—UMIACS website.http://www.isi.edu.http://www-2.cs.cmu.edu.http://www.lti.cs.cmu.edu.http://blombos.isi.edu—DINO browser.http://www-2.cs.cmu.edu—Enthusiast.http://www.ll.mit.edu—CCLINC.http://www-2.cs.cmu.edu—Speechalator.http://isl.ira.uka.de—FAME.http://www.cogsci.princeton.edu—WordNet.http://www.globalwordnet.org—Global WordNet Associa-

tion.http://www.illc.uva.nl—EuroWordNet.http://www.sfs.nphil.uni-tuebingen.de—GermaNet.http://www.ceid.upatras.gr—BalkaNet.http://www.keenage.comChinese HowNet.http://www.gittens.nl—Mimida multilingual semantic net-

work.http://www.icsi.berkeley.edu—FrameNet project.http://www.coli.uni-sb.de—SALSA project.http://www.nak.ics.keio.ac.jp—FrameNet project for

Japanese.http://gemini.uab.es—FrameNet project for Spanish.http://www.cis.upenn.edu—PropBank project.http://www.cis.upenn.edu—VerbNet.http://www.cis.upenn.edu—combination of VerbNet and

FrameNet.http://nlp.cs.nyu.edu—The NomBank Project.http://aitc.aitcnet.org—IAMTC project.

AU:11

AU:12

AU:13

AU:14

AU:15

AU:16

AU:17

AU:18

AU:19

AU:20

AU:21

AU:22



Non-Print Items

Abstract:

An interlingua is a notation for representing the content of a text that abstracts away from the characteristics of thelanguage itself and focuses on the meaning (semantics) alone. Interlinguas are typically used as pivotrepresentations in machine translation, allowing the contents of a source text to be generated in many differenttarget languages. Because of the complexities involved, few interlinguas are more than demonstration prototypes,and only one has been used in a commercial MT system. In this article, we define the components of an interlinguaand the principal issues faced by designers and builders of interlinguas and interlingua MT systems, illustrating withexamples from operational systems and research prototypes. We discuss current efforts to annotate texts withinterlingua-based information.

Biography:

Bonnie Dorr is an associate professor at the University of Maryland, with a joint appointment in Computer Science,UMIACS, and Linguistics, and is co-director of the Computational Linguistics and Information Processing laboratory.She graduated from the Massachusetts Institute of Technology in 1990 with a Ph.D. in computer science. Herresearch focuses on several areas of broad-scale multilingual processing, e.g., machine translation, summarization,and cross-language information retrieval. She has investigated the problem creating new statistical models that arelinguistically informed, leading to higher quality output for a wide range of languages while still being practical to trainand use. Dr. Dorr is the recipient of a NSF Presidential Faculty Fellowship Award, Maryland’s Distinguished YoungScientist Award, the Alfred P. Sloan Research Award, and a NSF Young Investigator Award. She has served onnumerous editorial boards and executive committees and is the author of Machine translation: a view from the lexicon.

Biography:

Eduard Hovy leads the Natural Language Research Group at the Information Sciences Institute of the University ofSouthern California. He is also Deputy Director of the Intelligent Systems Division, as well as a research associateprofessor of the Computer Science Departments of USC and of the University of Waterloo in Canada. He completeda Ph.D. in Computer Science (artificial intelligence) at Yale University in 1987. His research focuses on automatedtext summarization, question answering, text planning and generation, the semi-automated construction of largelexicons and ontologies, and machine translation. Dr. Hovy is the author or co-editor of five books and over 140technical articles. In 2001, Dr. Hovy served as President of the Association for Computational Linguistics (ACL) and in2001–2003 as President of the International Association of Machine Translation (IAMT). Dr. Hovy regularly co-teaches a course in the Master’s Degree Program in Computational Linguistics at the University of SouthernCalifornia, as well as occasional short courses on MTand other topics at universities and conferences. He has servedon the Ph.D. & M.S. committees for students from USC, Carnegie Mellon University, the Universities of Toronto,Karlsruhe, Pennsylvania, Stockholm, Waterloo, Nijmegen, Pretoria, and Ho Chi Minh City.

Biography:

Lori Levin is an Associate Research Professor at Carnegie Mellon University’s Language Technologies Institute. She


received a B.A. in Linguistics from the University of Pennsylvania in 1979 and a Ph.D. in Linguistics from M.I.T. in1986. She has co-directed several machine translation projects covering both spoken and written language.

Keywords: approximate interlingua, compositionality, conceptual knowledge, cross-language divergences,interlingua, interlingual speech translation, language-independent representation, lexical-conceptual structure,machine translation, ontology, predicate-argument structure, semantic annotation, semantic frames, semantic zones,text-meaning representations, thematic roles

Author Contact Information:

Bonnie DorrComputer ScienceUMIACSA.V.Williams Bldg 3153College Park, MD [email protected]


machine translation, interlingual methods...articlenumber:lali:00939 2 machine translation,...

Documents