nine issues in translation

8/8/2019 Nine Issues in Translation

1/38

Nine Issues in Speech TranslationAuthor(s): Mark SeligmanSource: Machine Translation, Vol. 15, No. 1/2, Spoken Language Translation (2000), pp. 149-185Published by: SpringerStable URL: http://www.jstor.org/stable/40007086

Accessed: 23/09/2010 03:42

Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available athttp://www.jstor.org/page/info/about/policies/terms.jsp. JSTOR's Terms and Conditions of Use provides, in part, that unless

you have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you

may use content in the JSTOR archive only for your personal, non-commercial use.

Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at

http://www.jstor.org/action/showPublisher?publisherCode=springer.

Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed

page of such transmission.

JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of

content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms

of scholarship. For more information about JSTOR, please contact [email protected].

Springeris collaborating with JSTOR to digitize, preserve and extend access toMachine Translation.

http://www.jstor.org
http://www.jstor.org/action/showPublisher?publisherCode=springerhttp://www.jstor.org/stable/40007086?origin=JSTOR-pdfhttp://www.jstor.org/page/info/about/policies/terms.jsphttp://www.jstor.org/action/showPublisher?publisherCode=springerhttp://www.jstor.org/action/showPublisher?publisherCode=springerhttp://www.jstor.org/page/info/about/policies/terms.jsphttp://www.jstor.org/stable/40007086?origin=JSTOR-pdfhttp://www.jstor.org/action/showPublisher?publisherCode=springer


2/38

Nine Issues in SpeechTranslationMARKSELIGMAN

^M Machine Translation 15: 149-185, 2000. J49y\ 2001 KluwerAcademicPublishers. Printed n the Netherlands.

GETA,UniversiteJosephFourier,385 rue de la Bibliotheque,38041 GrenobleCedex9, France(E-mail:[email protected])

Abstract. Thispapersketchesresearchn nine areasrelated o spoken anguage ranslation:nterac-tivedisambiguationtwodemonstrations f highlyinteractive,broad-coverage peechtranslation rereported); ystemarchitecture;atastructures;heinterfacebetweenspeechrecognitionandanalysis;the use of naturalpauses for segmentingutterances; xample-basedmachinetranslation;dialogueacts;thetrackingof lexical co-occurrences; nd the resolutionof translationmismatches.Key words: dialogueacts, example-basedmachinetranslation,nteractivedisambiguation, auses,speech recognition, poken anguagetranslation

1. IntroductionThispaperreviews someaspectsof the author's esearchn spoken anguage rans-lation(SLT)since 1992. Since the purpose s to promptdiscussion,the treatmentis informal,programmatic,ndspeculative.There s frequentreference o workinprogress in otherwords,work for which evaluation s incomplete.The paper sketches work in nine areas: interactivedisambiguation; ystemarchitecture; atastructures;he interfacebetween speech recognition(SR) andanalysis; the use of naturalpauses for segmenting utterances;example-basedmachinetranslation; ialogueacts;the trackingof lexical co-occurrences; nd theresolution of translationmismatches.There is no attempt o providea balancedsurveyof the SLTscene.Instead, hehopeis toprovideaprovocative ndsomewhatpersonal ook at the fieldby spotlightingt fromninedirections in somerespects,to offeran editorialrather hanpurelya report.One of the most significantand difficultaspects of the SLTproblemis theneed to integrate ffectivelymanydifferent ortsof knowledge:phonological,pros-odic,morphological, yntactic,semantic,discourse,and domainknowledgeshouldideallyworktogether o produce he most accurateandhelpfultranslation.Thus atrend towardgreater ntegrationof knowledgesources is visible in currentSLTresearch(e.g., in the Verbmobilproject,Wahlster,1993), and most of the workdescribedbelow is in this integrativedirection.Manyof the issues to be discussedhere couldin fact be addressedby dedicatedpieces of softwareplaying parts n anintegratedSLTsystem.Thepaper'sconclusionwill review the issues by sketchingan idealized system of this sort - a kind of personaldreamteam in which thecomponentsare team members.


3/38

150 MARK ELIGMANHowever,the firsttopic to be discussed is a renegade,headedin exactly the

opposite direction.This is because, while continuing my concern with integ-ration of SLT system components, I have become interestedin an alternativesystem design which, in sharpcontrast,stresses a clean separationbetween SRand translation.The thrust of this alternative"low road" or "quickand dirty"approach s to substitutetemporarily ntensive user interactionfor system in-tegration, hereby attemptinga radicaldesign simplification n hopes of fieldingpractical,broad-coverage ystemsas soon as possible.To accommodate his renegadeon the one hand and the team playerson theother, hepaperwill be notonly somewhatpersonal,but also two-faced.I willbeginby advocatinga "lowroad",non-integrated pproach or the neartermthroughoutSection 2. Two demonstrations f highly interactive,broad-coverage LTwill bereported nddiscussed.Then,performing nabout-face, will go on in the remain-ing sections to considerelements of a more satisfying ntegratedapproachor thelongerterm.

2. InteractiveDisambiguationInthepresentstateof theart,severalstagesof SLT eaveambiguitieswhich currenttechniques annotyet resolvecorrectlyandautomatically. uchresidualambiguityplaguesSR, analysis,transfer, ndgenerationalike.Since users can generallyresolve these ambiguities,it seems reasonable toincorporateacilities for interactivedisambiguationnto SLTsystems, especiallythoseaimingfor broadcoverage.A good idea of therangeof work n this areacanbe gainedfrom Boitet (1996a).Infact,Seligman(1997) suggeststhat,by stressingsuch interactivedisambigu-ation- for instance,by using highlyinteractive ommercialdictationsystemsforinput,andby adaptingexisting techniquesfor interactivedisambiguation f texttranslationBoitet, 1996b;Blanchon,1996)- practicallyusable SLTsystems maybe constructablen the nearterm.In such "quickand dirty"or "low road"SLTsystems, user interaction s substituted or system integration.For example, theinterfacebetween SR andanalysiscan be supplied entirely by the user,who cancorrectSR results beforepassingthem to translation omponents, husbypassingany attemptateffective communication r feedbackbetween SR and MT.Theargument,however, s notthat the "highroad" oward ntegrated nd max-imally automaticsystems should be abandoned.Rather, t is that the low roadof forgoing integrationand embracing nteractionmay offer the quickestrouteto widespreadusability,and that experiencewith real use is vital for progress.Clearly,the high road is the most desirable for the longer term:integrationofknowledgesources s a fundamentalssue forbothcognitiveandcomputer cience,and maximallyautomaticuse is intrinsicallydesirable. The suggestion,then, isthatthe low andhigh roads be traveled n tandem;andthat even systemsaimingfor full automaticity ecognizethe need for interactive esolutionwhen automatic


4/38

NINEISSUESIN SPEECHTRANSLATION 151resolution s insufficient.As progress s madealongthe high road andincreasingknowledgecan be applied o automaticambiguityresolution, nteractive esolutionshouldbe necessary ess often. When t is necessary, tsqualityshouldbe improved:Questionsputto the user shouldbecomemoresensible and moretightlyfocused.

2.1. TWO INTERACTIVE DEMOSThese design conceptshave been informallyandpartlytestedin two demos, firstat the Machine TranslationSummit n San Diego in October,1997, and a secondtime at the meetingof C-Star II (Consortium or Speech TranslationAdvancedResearch) n Grenoble,France, n January,1998. Both demos were organizedandconductedunder he supervisionof Mary Flanagan,and both demo systemswerebasedupona text-based hattranslationystempreviouslybuiltby Flanagan'seamatCompuServe, nc.Thecompany'sproprietaryn-line chattechnologywasused,as distinct rom InternetRelayChat,or IRC(Pyra,1995).1In an on-line chatsession,users most often converseas a group, houghone-on-one conversationsare also easy to arrange.Eachconversanthas a small windowused for typing input.Once the inputtext is finished,the user sends it to the chatserverby pressingReturn.The text comes back to the senderafteranimperceptibleinterval,andappearsn a largerwindow,prefacedby aheader ndicating he author.Since this largerwindowreceives inputfrom all partiesto the chatconversation,it soon comes to resemblethe transcriptof a cocktail party,often with severalconversationsnterleaved.Eachpartynormallysees the "same" ranscriptwindow.However,priorto theSLTdemos, CompuServehad arranged o place at the chat server a commercialtranslationsystem of the direct variety, enabling several translationdirections.Once the user of this experimentalchat system had selected a direction (sayEnglish-French),all lines in the transcriptwindow would appear n the sourcelanguage(in this case, English), even if some of the contributions riginated nthe target anguage(here, French).Bilingualtext conversationswere thus enabledbetweenEnglish typistsand writersof French,German,Spanish,or Italian.At the time of the demos, total delay from the pressingof Return until thearrivalof translatedext in the interlocutor'sranscriptwindowaveragedabout sixseconds,well withintolerable imits for conversation.2At the author'ssuggestion and with his consultation,highly interactiveSLTdemos were createdby adding SR front ends and speech-synthesisback endsto CompuServe's ext-basedchat-translationystem.Two laptopswere used, onerunningEnglish inputandoutputsoftware(in addition o the CompuServe lient,modifiedas explainedbelow), and one running hecomparableFrenchprograms.Commercialdictationsoftwarewas employedfor SR. For the firstdemo,bothsides used discretedictation, n which shortpauses arerequiredbetweenwords;for the seconddemo,Englishwas dictatedcontinuously that s, withoutrequiredpauses- while Frenchcontinued o be dictateddiscreetly.3


5/38

152 MARK ELIGMANAt the time of the demos, the discreteproductsallowed dictationdirectly ntothe chatinputbuffer,but the continuousproductsrequireddictation nto theirowndedicatedwindow. Thus for continuousEnglish inputit becamenecessaryto em-ploy third-partyoftware4 o create a macro which (a) transferred ictated ext tothe chatinputbufferand(b) inserteda Returnas a signalto send the chat.5Commercialspeech synthesisprogramspackagedwith the discretedictationproductswere used for voice synthesis. Using developmentsoftware sold separ-ately by the dictationvendor,CompuServe's hat client softwarewas customizedso that,as each text stringreturning rom the chat serverwas written o the tran-scriptwindow,it was simultaneouslysent to the speech synthesis engine to bepronouncedn theappropriateanguage.The text read aloud n thisway waseithertheuser'sown, transmittedwithoutchanges,or the translation f aninterlocutor'sinput.The firstdemo took place in an auditorium efore a quietaudienceof perhaps100, while the second was presented o numeroussmall groupsin a booth in anoisy room of medium size. Each demo began with ten scriptedand pre-testedutterances,and then continued with improvisedutterances,sometimes solicitedfrom the audience- perhapssix in the firstdemo, and 50 or morein the second.Someexamplesof improvised entencesaregivenin (l)-(2).

(1) French: Qu'est-cequevousetudiez?(Whatdo you study?)English: Computer cience. (Uinformatique.)(2) French: Qu'est-ce que vous faites plus tard? (What are you doinglater?)

English: I'm going skiing.(Je vaisfaire duski.)French: Vousn'avez pas besoin de travailler? (You don't need towork?)English: I'll takemy computerwith me. {Jeprendraimon ordinateuravec moi.)French: Ou est-ce que vous mettrezVordinateurpendant que vousskiez?(Wherewill you putthecomputerwhileyou ski?)English: Inmy pocket.{Dansmapoche.)

As theseexamplessuggest,the level of languageremainedbasic,and sentenceswerepurposelykeptshort,with standard rammar ndpunctuation.2.2. DISCUSSION OF DEMOSA primarypurposeof the chat SLT demos was to show that SLTis both feasibleandsuitable or on-line chatusers,at least at theproof-of-conceptevel.In my own view, the demos were successful in this respect.The basic feasi-bilityof theapproach ppearsn thefact thatmost demo utteranceswere translated


6/38

NINEISSUESIN SPEECHTRANSLATION 153

comprehensibly nd withintolerable imelimits.It is truethatthe language,whilemostly spontaneous,was consciously kept quitebasic and standard. t is also truethat there were occasionaltranslation rrors(discussed below). Nevertheless, hedemos can plausiblybe claimed to show that chattersmaking a reasonableef-fort could successfully socialize in this way. As preliminary vidence thatmanyusers could adjust o the system'slimitations,we can remark hat the dozen or soutterances uggestedbytheaudience,oncerepeatedverbatimby thedemonstrators,weresuccessfullyrecognized, ranslated, ndpronouncedn everycase.In additionto the generaldemo goals just mentioned, he authoralso had hisown, morespecificaxes to grindfrom the viewpointof SLT research. hopedthedemos would be the first to show broad-coverageSLT of usable quality;and Ihopedthey wouldhighlightthe potentialusefulness of interactivedisambiguationin movingtowardpracticalbroad-coverageystems.6I believe that these goals, too, were reached.Coveragewas indeed broadbycontemporary tandards.There was no restrictionon conversational opic - noneed, for instance,to remainwithin the area of airlinereservations,appointmentscheduling,or streetdirections.As long as the speakers tayedwithinthe dictationandtranslationexica (each in the tens of thousandsof words),they were free tochat andbanteras theyliked.The usefulness of interaction n achievingthis breadthwas also clear: verbalcorrectionsof dictationresults were indeed necessary for perhaps5% to 10%of the input words. To give only the most annoying example, Hello was onceinitially transcribedas Hollow. Here we see with painful clarity the limita-tions of an approachwhich substitutes nteractivedisambiguationor automaticknowledge-based isambiguation:ven themostrudimentary iscourseknowledgeshouldhave allowed the program o judge which word was morelikely as a dia-logue opener.On the otherhand,the approach's apacityto compensate or lackof such knowledgewas also clear:a verbal correctionwas quicklymade, usingfacilitiessuppliedby thedictationvendor.It should be stressed that the SLTsystem of the CompuServedemos was byno means the first or only system to permitinteractivemonitoringof SR outputbeforetranslation.As farback as the C-Star I international LTdemonstrations f1993,7selectionamongSR candidateswas essentialfor mostparticipating ystems.Similarly,selectionamong,or typedcorrectionof, candidates s possible in mostof the systemsshownin the recent C-Star IIdemos of July22, 1999.8The CompuServeexperimentswere, however,the first to demonstrate hat abroad-coverageSLT system of usable quality- a system capable of extendingcoveragebeyond specializeddomainstoward unrestricteddiscourse- could beconstructedby enablingusers to correcteconomically the outputof a broad-coverage SR component before passing the results to a broad-coverageMTcomponent.Ergonomicoperationwas animportant lement n thesystem'ssuccess. The SRcorrection acilities usedin the experiments the set of verbalrevision commands


7/38

154 MARK ELIGMANsuppliedby thedictationproduct, ncluding"scratchhat","correct word)", tc. -weredesignedforgeneraluse in a competitivemarket,and thusof necessityshowconsiderableattention o ergonomicissues. (By contrast, he SR componentsofotherresearch ystemscontinue o relyon typedcorrectionor menuselection.)Ofcourse,a smoothhuman nterfacebetween SR andMTcannotby itself yieldbroadcoverage;what it can do is to permitthe unexpectedcombinationof SR andMTcomponentsdevelopedseparately,with broadcoveragerather han SLT n mind.This reliance on interactivecorrectionraises obvious questions:Is the currentamountandtype of dictationcorrection olerablefor practicaluse? Would addi-tional nteraction orguidingorcorrectingranslation e useful?Even if potentiallyuseful,would it be tolerated,orwould it break he camel's back?2.2. 1 Correctionof DictationThe interaction equired n the currentdemos for correctingdictation s just thatcurrentlyrequired or correctingtext dictationin general.All currentdictationproducts equirenteractive orrection.Thequestion s, do theadvantages f dicta-tion overtypingneverthelessustify the cost of theseproducts,plus the troubleofacquiring hem,training hem,andlearning o use them?Theirsteadily ncreasinguser baseindicates hatmanyusers think so. (Fortherecord,portionsof thispaperwere producedusing continuousdictationsoftware.)My own impression s that,duringthe demos, continuousdictationwith spokencorrectionssuppliedcorrecttextat least twice as fast as my ownreasonably killedtypingwould havedone.For readerswho havenevertrieddictating,a descriptionof the dictationcorrec-tionprocessavailable n Seligmanet al. (1998b)may help to realisticallyestimatethe correctionburden.While a strict hands-offpolicy was adoptedfor the demos, it is worthnotingthat ypedtextandcommandscan be freely interspersedwithspoken ext andcom-mands.It is sometimeshandy, or instance, o select an errorusingthemouse,andthenverballyapply any of the above-mentioned orrectioncommands.Similarly,when spellingbecomes necessary, ypingoften turnsout to be faster thanspokenspelling.Thusverbal nputbecomes one option amongseveral, o be chosen when- as often happens- it offers the easiest or fastestpathto the desired text. Thequestion, hen,is no longerwhether o typeor dictate he discourseas a whole,butwhich mode is most convenient or the inputtaskimmediatelyat hand.As broad-coverageSLTsystemsin the near ermarelikelyto remainmulti-modal ather hanexclusivelytelephonic, heycan andshouldtakeadvantageof this flexibility.2.2.2. Correction f TranslationThe currentdemos were not intended o demonstrate he full rangeof interactivepossibilities.In particular,while dictationresults were correctedon-line as justdiscussed,therewasno comparable ttemptat interactivedisambiguation f trans-


8/38

NINEISSUESIN SPEECHTRANSLATION 155lation. Thus, when ambiguitiesoccurred, he speakerhad no way to controlorcheck thetranslation esults.Forexample,when theEnglishpartner oncludedone dialog by saying(3a),theFrenchpartner aw andheard 3b), whichmightbe literallyrenderedas (3c).(3)a. It was a pleasureworkingwithyou.

b. C'etait unplaisirfonctionneravec vous.c. It was a pleasure unctioningwithyou.

The wordwork,in otherwords,had been translatedasfonctionner,as would beappropriateor aninput ike (4).(4) Thisprograms notworking.

Such translationerrors were not disruptiveduringthe demos: they were in-frequent,andmanyof the errorswhich did appearmightbe more amusingthanbothersomen the sort of informalsocializingseen in most on-line chattoday.However,errorsarising from lexical and structuralambiguities might wellbe more numerousand moredisruptive n future,more sensitive chat-translationapplications.Further,t seems doubtfulthatthey can be eliminated n near- ermsystems aimingfor both broadcoverageandhigh quality,even assumingeffect-ive use of multipleknowledgesourceslike those describedbelow. Thusmy ownguess is that nteractive esolutionof ambiguitiesduringchat translationwould infact provevaluable.Feedbackconcerning he translation, ia some form of back-translation,wouldprobablyproveuseful as well. Again,for discussionof possibletechniques, ee Boitet (1996a)and Blanchon(1996).But even grantingthat interactivecorrectioncould raise the qualityof SLT,would users be willing to supplyit? There is some indicationthat the degreeofinteractionnow required n the demos to correct dictationmay alreadybe nearthe tolerable limit for chat as it is presentlyused (Flanagan,1997). A healthyskepticismconcerningthe practicalityof real-time translation orrection s thuswarranted. suspect,however,that users' tolerance or interactive orrectionwillturn out to dependon the applicationand the value of correcttranslation: o theextent thatreal-timeMT can move beyond socializinginto business,emergency,military,or otherrelativelycrucial and sensitive applications,user tolerance forinteraction an be expectedto increase.Ultimately, hough,questionsabout he trade-offbetweentheburdenof interac-tion and ts worthshouldbe treatedas topicsforresearch:usinga specifiedsystem,what level of quality s required or given applications specified n termsof tasksto be accomplishedwithinspecifiedtime limits), and whattypes andamountsofinteraction rerequiredon average o achieve thatquality evel?Clearly,until SLTsystemswith translation orrection apabilitiesarebuilt,no suchexperimentswillbe possible.


9/38

156 MARK ELIGMANHaving discussed the role of interactivedisambiguationn SLT,and havingdescribed wo experimentswithhighly interactiveSLT,we now turnon our heelsas forecast owardmore ntegration-orientedtudies.Webeginwith considerationsof SLTsystemarchitecture.

3. System ArchitectureAn ideal architecture or "highroad",or highly integrated,SLT systems wouldallow global coordinationof, cooperationbetween, and feedbackamong, com-ponents(SR, analysis, transfer,etc.), thus moving away from linear or pipelinearrangements. orinstance, SR, as it moves throughan utterance, hould be ableto benefit frompreliminaryanalysisresults for segmentsearlier n the utterance.The architecture hould also be modular, o that a varietyof configurations an betried: t should be possible, for instance,to exchange competingSR components;andit should be possible to combinecomponentsnot explicitlyintended or worktogether,even if these are written n different anguagesor runningon differentmachines.Blackboard rchitectures ave beenproposed Erman&Lesser,1990)to permitcooperationamong components.In such systems, all participating omponentsread from and write to a centralset of datastructures, he blackboard.To sharethis commonarea,however, hecomponentsmust all "speaka common(software)language".Modularityhussuffers,since it is difficult o assemble a systemfromcomponentsdeveloped separately.Further,blackboardsystems are widely seenas difficult to debug, since control is typicallydistributed,with each componentdeterminingndependentlywhen to act and what actionsto take.In orderto maintain the cooperativebenefits of a blackboardsystem whileenhancingmodularityandfacilitatingcentralcoordinationor controlof compon-ents, Seligmanand Boitet (1993) andBoitet and Seligman(1994) proposedanddemonstrated "whiteboard" rchitectureor SLT. As in the blackboardarchi-tecture,a central data structure s maintainedwhich contains selected results ofall components.However, he componentsdo not access this whiteboarddirectly.Instead,onlyaprivilegedprogram alled the Coordinator an read rom t andwriteto it. Eachcomponentcommunicateswith the Coordinator nd the whiteboard iaa go-betweenprogram alled a Manager,which handlesmessagesto and from theCoordinatorn a set of mailbox files. Because files are used as data-holding reasin thisway,components andtheirmanagers) anbe freelydistributed crossmanymachines.9

Managersare not only mailmen,but interpreters: hey translatebetween thereservedlanguageof the whiteboardand the native languagesof the compon-ents, which are thus free to differ.In our demo, the whiteboardwas maintainedin a commercialLisp-basedobject-orientedanguage,while components ncludedindependently-developedR, analysis,andword-lookup omponentswritten n C.Overall,the whiteboardarchitecture an be seen as an adaptationof blackboard


10/38

NINE ISSUESIN SPEECHTRANSLATION 157architecturesor client-server perations:he Coordinator ecomes the main clientfor severalcomponentsbehavingas servers.Since the Coordinator urveys the whiteboard, n which are assembled theselected resultsof all components,all representedn a single software nterlingua,it is indeed well situated o providecentralor global coordination.However,anydegree of distributedcontrol can also be achieved by providing appropriateprogramsalongside the Coordinatorwhich representthe componentsfrom thewhiteboardside. That is, to dilute the Coordinator's mnipotence,a numberofdemi-godscan be created.In one possible partlydistributed ontrolstructure,heCoordinatorwould overseea set of agendas,one or more for eachcomponent.A closely relatedeffort to create a modular"agent-based"client-server-style)architecturewith a centraldatastructure, sablefor manysorts of systemsinclud-ing SLT, s describedby Juliaet al. (1997). Lackinga centralboardbut stillaimingin a similarspiritfor modularityn various sorts of translation pplicationss theprojectdescribedby ZajacandCasper 1997). Furtherdiscussion of SLT architec-turefromthe alternative iewpointof the Verbmobil ystem appears n Gorzet al.(1996). For discussionof a recentDARPA nitiativestressingmodularswitchingof components orexperimentation,ee Aberdeenet al. (1999).

4. Data StructuresWe haveargued he desirability or systemcoordination f a centraldata structurewhere selectedresultsof variouscomponentsare assembled.Thequestionremainshow thatdatastructurehouldbe arranged.The ideal structure houldclarifyall ofthe relevantrelationships,n particular learing up the matterof representational"levels" a confusingtermwith severalcompeting nterpretations.Boitet and Seligman(1994) presentedseveralarguments or the use of inter-related attices for maintainingcomponents'results. Here I presentone possibleelaboration,suggesting a multi-dimensionalset of structures n which threemeaningsof "level"arekeptdistinct(Figure1).We firstdistinguishan arbitrary umberof "Stagesof Translation",with eachstage viewable as a long scroll of paperextendingacross our view from left toright.Left-right s the timedimension,with earlierelementson the left. The Stage0 scrollrepresents herawinputto the SLTsystem, includingfor examplethe un-processedspeechinputfrom bothspeakersand the recordof one speaker'smouseclicks on an on-screenmap,such as mightbe used for a direction-findingask.Inits full extentfrom left to right, Stage 0 would thus include the raw inputfor atranslationessiononce complete,forexample,for a dialogueto be translated.Stage 1 containsthe results of the firststageof processing,whateverprocessesmightbe involved.This scroll,viewed as unrollingbehindStage 0, mightfor in-stance include twin sets of lattices representing he results of phoneme spottingwithinbothspeakers'rawinput. Stages 2, 3, . . . , n unroll n turnbehindStage 1,receding n depth.Stage 2 mightincludesource-language yntactictrees;Stage3


11/38

158 MARKSELIGMAN

Figure1. Multi-dimensional ata structures orspeechtranslation.

might nclude semanticstructures erived rom thesetrees;and so on, throughMTtransferandgeneration o the final stage, a scroll behindall otherscrolls, whichmight containtranslated ext annotated or speech synthesis.Pointers(diagonallightlines) wouldindicaterelationshipsbetween elements n subsequent tages.Eachstagecan be subdividedbothverticallyandhorizontally.Verticalbounda-ries (verticaldashed ines) representappropriateime or segmentdivisions,prob-ably includingutterances.Horizontaldivisionsrepresent"Tracks", ince at eachstageseveralseparable ignalsourcesmaybe underconsideration.As alreadypoin-tedout,Stage 1 in the figure ncludes two rawspeechtracksand a track ndicatingmouseclicks on amap.Stage2 mightcontain, naddition o tracks or thephonemelatticesalreadymentioned,other tracks(hiddenfromview) containingFOcurvesextracted rom the respectivespeech signals.Differentstages may have differentnumbersof tracks,dependingon theprocesseswhichdefine them.Finally,withineach track at a given stage, we can distinguishvaryinglevelsof "Height"on the page - thatis, various values on the y-axis correspondingogiven time valuesalong the x-axis. These can be given various nterpretationssappropriateor the type of trackin question.When the track containssyntactictrees,heightcorresponds o syntacticrank, .e., dominance,with dominantnodesusuallycovering ongertime spansthandominatedones.Confusionregardinghemeaningof "level"bedevilsmanydiscussionsof MT: tsometimesmeans a stageof processing,sometimesa mode ortypeof information,


12/38

NINE ISSUESIN SPEECHTRANSLATION 159and sometimes a gradationof dominance or span. The hope is that,by clearlydistinguishinghesemeaningsas stages,tracks,orheightwithintracks,we canhelpbothprogrammersndprogramskeeptheirbearingsamid a welter of information.The multi-dimensionalstructures ust described bear some resemblance tothe three-dimensional harts of Barnettet al. (1990), used to trackrelationshipsbetweensyntacticand semanticstructures uringanalysisof queries o CYC know-ledge bases (Lenat& Guha,1990).Theyweredeveloped ndependently, owever.Three-dimensional hartswere restricted o two depthsor stages (syntacticandsemantic), ackedtracks,and made no explicitreference o heightor rank.The whiteboarddemo reported n SeligmanandBoitet (1993) likewise madeonly partialuse of the multi-dimensional tructure: tages andheightwere expli-citly represented nd shown n thegraphicaluserinterface,withexplicit represent-ation of relationsbetween structuresn subsequent tages;but trackswere not yetincluded.

5. Interface between SR and MT AnalysisIn a certainsense, SR andanalysisforMT arecomparableproblems.Bothrequirethe recognitionof the most probablesequencesof elements.In SR, sequencesofshortspeech segmentsmust be recognizedas phones, and sequencesof phonesmust be recognizedas words.In analysis, sequencesof words must be recognizedas phrases,sentences,and utterances.Despite this similarity,currentSLTsystems use quitedifferent echniques orphone, word, and syntactic recognition.Phone recognitionis generallyhandledusing HiddenMarkovModels (HMMs);wordrecognition s often handledusingViterbi-style earch or the bestpaths n phonelattices;and sentencerecognitionshandled hrougha varietyof parsing echniques.It can be arguedthat these differencesare justified by differences of scale,perplexity,and meaningfulness.On the otherhand, they introducethe need forinterfacesbetweenprocessing evels.Theprocessorsmaythusbecome blackboxesto each other,when seamless connectionandeasy communicationmightwell bepreferable. nparticular,wordrecognitionandsyntacticanalysis(of phrases,sen-tences, andutterances) houldhave a lot to say to each other: the probabilityofa wordshoulddependon its place in the top-downcontextof surroundingwords,just as the probabilityof a phraseor larger syntacticunit shoulddependon thebottom-upnformation f the words which it contains.TointegrateSRandanalysismoretightly, t is possibleto employa single gram-marfor bothprocesses,one whose terminalsarephonesand whose non-terminalsare words,phrases,sentences, etc.10 This phone-grounded trategywas used togoodeffect forexample ntheHMM-LRSRcomponentof the ASURASLTsystem(Morimotoet al., 1993), in which anLR parserextendeda parse phone by phoneand left to rightwhile buildinga full syntactictree.11The techniqueworkedwellfor scriptedexamples.For spontaneous xamples,however,performancewas un-


13/38

160 MARK ELIGMNsatisfactory, ecause of the gaps, repairs,andothernoise commonin spontaneousspeech.To deal with suchstructural roblems,an island-driven arsingstylemightwell be preferable.An island-basedchartparser, ike that of Stock et al. (1989),wouldbe agood candidate.However,chartinitializationpresents some technicalproblems.There is nodifficultyin computinga lattice from spotted phones, given informationregard-ing the maximumgap and overlapof phones.But it is not trivial to convertthatlattice into a "chart"i.e., a multi-path inite-stateautomaton)without ntroducingspuriousextrapaths.The authorhas implementeda CommonLisp programwhichdoes so correctly,based on analgorithmby C. Boitet (Seligmanet al., 1998a).Thealgorithmracks, oreach node of an automaton nderconstruction,helatticearcswhich it reflects and the lattice nodes at theiroriginsand extremities.An exten-sion of the procedurepermitsthe inclusion of null, or epsilon, arcsin the outputautomaton.Themethodhas been successfullyapplied o lattices derived rom dic-tionaries, .e., very largecorporaof strings.(Full source code andpseudocodeareavailable rom theauthor.)Experimentswithbottom-upsland-drivenhartparsingfrom charts nitializedwithphonesareanticipated.

6. Use of Pauses for SegmentationIt is widely believed thatprosodycan provecrucial for SR andanalysisof spon-taneousspeech.12Severalaspectsof prosody mightbe exploited:pitchcontours,rhythm,volumemodulation, tc. However,Seligmanet al. (1997) proposed ocus-ingon natural ausesas anaspectof prosodywhich is both mportant ndrelativelyeasy to detectautomatically.13Giventhe frequencyof utterances n spontaneousspeech which are not fullywell-formed- which containrepairs,hesitations,and fragments strategies ordividingandconqueringutteranceswould be quiteuseful. The suggestion s thatnaturalpausescan play a part n such a strategy: hat"pauseunits",or segmentswithinutterancesboundedby naturalpauses, can providechunks which (a) arereliablyshorterand less variable n lengththan entireutterancesand(b) are relat-ively well-behaved nternally romthe syntacticviewpoint,though analysisof therelationships mongthemappearsmoreproblematic.Ourinvestigationbeganwith transcriptions f fourspontaneousJapanesedia-logues concerninga simulateddirection-findingask. The dialogueswere carriedout in the EMMI-ATREnvironment or Multi-modalInteraction Loken-Kimetal., 1993;Furukawa t al., 1993), two using telephoneconnectionsonly, and twoemployingon-screengraphicsand video as well. Ineach 3- to 7-minutedialogue,a caller pretending o be at Kyoto station received from a pre-trained"agent"directionsto a conference centerand/orhotel. In the multimediasetup,both thecaller andagentcould drawon on-screenmapsandexchangetypedinformation.Morphologicallyaggedtranscripts f theconversationswere divided nto turnsby the transcriber, nd included hesitationexpressionsand other naturalspeech


14/38

NINEISSUESIN SPEECHTRANSLATION 161

Figure 2. Interface used by the pause tagger.

features.We then added to the transcriptsnformation oncerning he placementandlengthof significantpauses.For ourpurposes,a significantpausewas eithera junctureof any lengthwherebreathingwas clearly indicated(sometimesa bitless than300 milliseconds)or a silence lasting approximately 00 millisecondsormore.To facilitatepause tagging,we prepareda customizedconfiguration f the Xwavesspeechdisplayprogram14o that t showedsynchronized utseparate peechtracksof bothpartieson screen(Figure2). Thepausetagger,referring o the tran-script,could use the mouse to draw labeled lines throughthe tracksindicatingthe startsandends of turns; he startsand ends of segmentswithinturns;and thestartsandends of response syllableswhich occurduring he otherspeaker's urn.Visualplacementof labels was quiteclearin most cases. As a secondary ob, thetagger nserteda specialcharacternto a copyof thetranscriptext whereverpausesoccurredwithin turns.Aftertagging, abelsbearingexacttiming nformationwere downloaded o sep-arate iles. Because there shouldbe a one-to-onemappingbetweenlabeledpauseswithin turns andmarkedpause locationsin the transcript,t was thenpossible tocreateaugmented ranscripts y substituting ccuratepause-lengthnformationntothe transcripts t markedpause points.In studying heaugmented ranscripts,ourspecific questionswere addressed:1. Arepauseunitsreliablyshorter han whole utterances? f they werenot, theycould hardlybe useful in simplifying analysis. It was found however, that,


15/38

162 MARKSELIGMANin the corpus investigated,pause units are in fact about 60% the length ofentireutterances, n theaverage,whenmeasuredn Japanesemorphemes.Theaverage engthof pauseunits was 5.89 morphemes,as comparedwith 9.39 forwhole utterances.Further,pause units are less variable n lengththanentireutterances:he standard eviation s 5.79 as comparedwith 12.97.2. Wouldhesitationsgive even shorter,and thusperhapsevenmoremanageable,segments f used as alternate r additionalboundaries?The answerseemsto bethatbecausehesitations o often coincidewithpauseboundaries,he segmentstheymarkout arenearly hesame as thesegmentsmarkedby pausesalone.Nocombinationof expressionswas foundwhich gave segmentsas much as onemorpheme horter hanpauseunits on average.3. Is the syntax within pause units relatively manageable?A manual surveyshowed that, once hesitationexpressionsare filteredfrom them, some 90%of thepauseunits studiedcanbe parsedusing standard apanesegrammars;varietyof special problemsappearn theremaining10%.4. Is translation f isolatedpauseunits a possibility?We foundthata majorityofthepauseunitsin fourdialoguesgaveunderstandableranslationsntoEnglishwhen translated y hand.Thestudyprovidedencouragementor a "divideandconquer" nalysisstrategy,in whichparsingandperhapsranslation f pauseunits s carried utbefore,orevenwithout,attempts o createcoherentanalysesof entireutterances.As mentioned,parsabilityof spontaneousutterancesmight be enhancedbyfilteringhesitationexpressionsfrom themin preprocessing.Researchon spottingtechniquesfor such expressionswould thus seem to be worthwhile.Researcherscanexploita speaker's endency o lengthenhesitations,andto use them ustbeforeor afternaturalpauses.Use of pauseinformation or"dividingutterancesntomeaningful hunks"dur-ing SLT of Japanese s describedby Takezawaet al. (1999). Pausesare used as

segmentboundariesn severalcommercialdictationproducts,but no descriptionsare available.

7. Example-BasedSLTExample-basedMT (EBMT) (Nagao, 1984; Sato, 1991) is translationby ana-logy. An EBMT system translatessource-language entencesby referenceto an"examplebase", or set of source-languageutterancespairedwith their target-language equivalents. In developing such a system, the hope is to improvetranslation ualityby reusingcorrectand idiomatic ranslations;o partlyautomategrammar evelopment; nd to gain insightinto language earning.Two EBMTsystemsare nowbeingapplied o SLT: heTDMT(Transfer-drivenMT) systemdevelopedat ATR(Furuse& Iida, 1996;Iida et al., 1996; Sumita&Iida, 1992), used in the ATR-Matrix LTsystem (Takezawa t al., 1999);and thePanEBMTsystem (Brown, 1996) of CMU, used along with transfer-basedMT


16/38

NINEISSUESIN SPEECHTRANSLATION 163within the multi-engineMT architecturen the DiplomatSLTsystem (Frederkingetal., 1997).Despitetheircommonaims,the two systemsdiffersubstantially.The ATRsys-tem aims to supplya completetranslationingle-handed, ndaccordinglyncludesa full parserfor utterancesand a hand-builtgrammar set of language patterns)to go with it. The CMU system, by contrast,operatesas a componentof a largersystem:in general, ts aim is to supply possiblepartial ranslations, r translationchunks,to be placedon a chartalong with chunkssuppliedby other translationengines.15Forthismission,thesystemrequiresneitherparsernorgrammar, elyinginsteadon heuristics to align sub-elementsof sentencesin the examplebase attrainingtime. Once it has put its chunksin place duringtranslation,a separateprocess,belongingto the Multi-EngineMT architecture,will employa statisticallanguagemodel to select the best paththrough he pre-stockedchart n ordertoassemblethe finaloutput.As the suggestions below relate to a tree-orientedand end-to-end view ofexample-basedprocessing,the primaryconcern will be with systemsof the ATRtype.Webeginwith a sketchof this methodology.Consider heJapanesenounphrase 5a).Its literal ranslationmightbe (5b),buta moregraceful ranslationwould be (5c) or (5d).(5)a. kyotono kaigi

b. conferenceof Kyotoc. conference n Kyotod. Kyotoconference

We could hope to providesuch improved ranslationsf we had an examplebaseshowingfor instance hat(6a)had beentranslated s (6b) or (6c), and that(7a)hadbeen renderedas (7b) or (7c).(6)a. tokyono kaigi

b. conference n Tokyoc. Tokyoconference

(7)a. nyuyokuno kaigic. conference n New Yorkd. New Yorkconference.

The strategywould be to recognizea close similaritybetween the new input(5a)and these previouslytranslatednoun phrases,based on the semantic similarity


17/38

164 MARK ELIGMANbetweenkyotoon one hand and tokyoandnyuyokuon the other.The same sortof patternmatchingcould be performed gainsta nounphrase n theexamplebasedifferingfromthe inputat more than one point, for example(8), wheremiitingu('meeting')is semantically imilar o kaigi ('conference').(8). tokyono miitingu

'meeting n Tokyo'At any number of such comparisonpoints, semantic similarityof the relevantexpressionscan be assessedby referenceto a semantichierarchy for example,a type hierarchy f semantic ags suppliedby a thesaurus.A thesaurus ssociatesalexical item likekaigiwith one ormoresemantic ags (e.g., CITY, OCIAL-EVENT);andthesimilarityof two semantic agscan be definedas the distanceone mustrisein therelevantsemantichierarchyo reach a node which dominatesbothtags:thefurther, he more semanticallydistant.The four-levelhierarchyof the KadokawaRuigoShin-jiten Ohno& Hamanishi,1981) has been used in this way in severalstudies.Different ranslations f the Japanesegenitiveno construction, or exampleasthe English possessive (tanaka-sanno kurunia, Tanaka's ar') would be distin-guishedby the distinct semantictypes of theirrespectivecomparisonpoints

- inthiscase, forexample,PERSONnd VEHICLE.By replacingeach comparisonpointin anexpression ike (5a) with a variable,we can obtain a pattern ike (9a). Suchpatternscan be embedded,giving (9b) or(9c).(9)a. [?Xno ?Y]

b. [[?Xno ?Y]no ?Z]c. [?Xno [?Yno ?Z]]

If we thenreceive an inputlike (10) we can determinewhich bracketings mostsensible - thatis, we can parsethe input- by extendingthe techniquesalreadydiscussedforgaugingsemanticsimilarity.(10) kyoto no kaigi no ronbun

Kyoto- gen conference- gen paperpaperat theconference n Kyoto

Onepossibility s to designatea "head" oreachpattern, ndtopositthatapattern'soverallsemantic ypeis thetypeof its head.Thensemanticsimilarity cores can becalculatedbetween aninput ike (10) and anentireset of embeddedpatterns thatis, an entirepattern ree- by propagating imilarity cores outward upward).One


18/38

NINE SSUESN SPEECH RANSLATION 165can calculatesimilarityscoresfor severalpossiblebracketings trees),and choosethe bracketingmost semanticallysimilarto the input.In this way, the calculationof semanticsimilarityguidesstructural isambiguation uringanalysis.Havingoutlined he essentialsof example-basedprocessing n thetree-orientedstyle, we are now ready to discuss possible elaborations.The first involves thedegreeof separation etweenstagesof EBMT.

7.1. SEPARATION OF EXAMPLE-BASED ANALYSIS, TRANSFER, ANDGENERATION

Recall thatsemanticsimilaritycalculationcan be used to select an embeddedsetof patterns aparsetree)fromamongseveralcompetitors. f each source-languagepattern(i.e., subtree) s associated with a uniquetarget-language atternwhichprovides ts translation,henthe selectionof a completesource-languageree willsimultaneously ndautomatically rovidea correspondingarget-languageree.Inthisway,anexample-based nalysisprocesscan be made to provideautomaticallya transferprocessas well - thatis, a mappingof source-language tructures ntotarget-languagetructures.TDMTintentionally ombinesanalysisand transfernthis way. The combination s seen as an advantage: he same mechanismwhichhandlesstructural isambiguationimultaneously elects therighttranslation romamongseveral candidates.However,the combinationof phasesdoes raise issuesconcerningthe role of transfer n handlingtranslationambiguityand structuralmismatches.

First,some translation pplicationsmayrequirean explicitaccountof transla-tionambiguity that s, of the possibilityof translating given subtreeor nodeinmore than one way. For such applications, ransfermightbe treatedas a separatephaseof translationromsource-language arsing.That s, since considerations fsemanticsimilaritycan guide the selection of targetstructure just as they canguide the choice of analysistree- we can recognizethe possibilityof example-basedtransferas separate romexample-basedanalysis. Furthermore, ependingon the depth of analysis, even once a target-language ree has been selected,ambiguitymayarisein selecting target-languageurface orms to expressit. Thusa separate xample-based enerationphasealso becomes a possibility.A second issue relates to structural mismatches between source andtarget.Shouldthey be handled n the transferphase of translation?Considerthetranslationsn (1 1)-(13), forexample.

(11) Zo wa, hana ga nagai.ELEPHANTopic NOSEsubj BE-LONGElephants have long noses.


19/38

166 MARK ELIGMAN(12) Taeko wa, kaminokewa nagai

(NAME)opicHAIR subjBE-LONGTaeko'shair s long.Taekohas long hair.(13) Watashiwa, taeko ga sukidesuME topic, (NAME)ubjIS-LOVEDI like/loveTaeko.In these cases, language-internalonsiderationsdictate non-flatanalyseson boththe sourceand targetsides. However, n each case, the source tree is differentlyconfiguredrom thetarget ree.Thus,to represent hecorrespondences ompletely,it is insufficient implyto mapone sourcenode(one sourcepattern) nto one targetnode (one targetpattern); ather,we need to intermaparbitraryubtreeconfigura-tions (embeddedpattern ets). In currentmplementations f tree-orientedEBMT,suchgeneralmappingsbetween subtreesare not supportedduring ransfer; ather,theyare handledby special-purpose ost-processing outines. tmightproveeasierto arrange moregeneral reatment or suchintermappingsf transferweretreatedas a separate ranslation hase.

An experimentreportedby Sobashima and Iida (1995) and Sobashima andSeligman (1994) takes a first step toward clear separationof EBMT phases: itpresentsanexample-basedreatment f analysisonly.(Furthernformations givenbelow.)A distinctexample-based ransferphase includingfacilities for intermap-pingembeddedpatternswas envisaged,buthas notyet beenimplemented.7.2. MULTIPLEDIMENSIONS OF SIMILARITYSo far we havediscussed the measurement f similarityalongthe semantic scaleonly. But utterancesand structures an be comparedalong otherdimensions aswell. Thusforexample,whenassessingthe similaritybetween a given patternandthe inputpattern o be translated,we could asknot only how semanticallysimilarits elementsare to thoseof the inputpattern,but how syntactically imilaras well,orhowgraphologicallyorphonologicallysimilar.Sobashimaand Iida(1995) and SobashimaandSeligman(1994) describe acil-ities for measuringandcombiningseveral sorts of similarity.Syntacticsimilarity,for instance,is measuredwith referenceto a syntacticontology, comparable othe thesaurus-basedemantichierarchydiscussed above; and a score indicatingoverallsimilarityof respectivevariableelementsin two patterns s calculatedbycombiningsyntacticandsemanticsimilarityscores.The reported mplementationalso considered,as a factorin overallsimilarity,a score indicatinggraphologicalsimilarity:1 for a completematch,and0 in othercases. Futureversions, however,might nsteadmeasurephonologicalsimilarity forinstance,by meansof aphone-type ontologyindicating, or example,that[/] and[t f] are similarsounds,while


20/38

NINE SSUESNSPEECHRANSLATION 167[/] and[k] are more different.Below, we brieflyindicatehow multiplesimilaritydimensionsentered nto the calculationof overallsimilarity.Once we recognizethe possibilityof consideringphonologicalsimilarityas afactorin overallsimilaritybetweenpatterns,we move EBMTbeyondtext trans-lation into the area of SLT. We could, for instance,attempt o disambiguate hespeech act of an utteranceby comparingthe prosodiccontours of its elementswith the contoursof elements of labeled utterances n a database.Such prosodiccomparisonsmight help, for example,to distinguishpolitely hesitantstatementsandyes-no questions n Japanese.These utterance ypes aresyntacticallymarkedby finalparticlesga andka,which arephonologicallyquitedifficult o distinguish;theirprosodies,however, end to be quitedistinct.In any case, use of similaritymeasurements long multipledimensions as anaid to disambiguationwould be very much in the spiritof the "highroad",orintegrative, pproacho SLTdiscussedthroughout.16

7.3. BOTH TOP-DOWN AND BOTTOM-UPInmost current xample-based ystems,the applicabilityof a patterns judged bythe semanticmatchof its sub-elementsagainst hose of theinput.These arebottom-up similarity udgments: hesub-elementsprovideevidencefor thepresenceof thepattern s a whole.Usuallyabsent,however,arecorrespondingop-down imilarityjudgmentswherebythe patternsgive evidence for the sub-elements.Sobashimaand Iida (1995) and Sobashimaand Seligman(1994), however,do demonstrateapplicationof bothbottom-upandtop-downsimilarityconstraints.Further, imil-arity s measuredn both directionsalongseveraldimensions(syntactic,semantic,andothers),as suggestedabove. Wenow brieflydescribe he method.First, some necessary background.Considera "linguisticexpression",whichmaybe eitheratomicorcomplex.Complexexpressionsarecomposedof variablesand/orfixed lexical elements,as in (9a) above.We calculate the "elemental im-ilarity",or E-Sim, of two expressionsas a combined function of theirsyntactic,semantic,andphonologicalorgraphological imilarites.17Now we arereadyto consider op-downvs. bottom-up imilaritymeasurement.We calculate he "structuralimilarity", r"bottom-up imilarity", f two complexexpressionsby combiningthe elementalsimilaritiesof theirrespectiveelements.By contrast,the top-downfactor in the similarityof two expressionsA and Bis a measureof the similarityof theirrespectivecontexts. We call this factor the"contextual imilarity"C-Sim of expressionsA andB, and calculate t as the sumof the elementalsimilaritiesof theirrespective eft andright neighborexpressionsL and R (14).

(14) C-Sim(A, B) = E-Sim(LA, LB) + E-Sim(tfA, RB)


21/38

168 MARK ELIGMANThe final,or integrated, imilarityscore Sim for expressionsSi and S2, then,is the combinationof their structuralbottom-up)similarityand their contextual(top-down) imilarity 15).

(15) Sim(S!, S2)= S-Sim(Si, S2) * C-Sim(Si, S2)We have seenthatSim incorporatesmulti-dimensionalimilaritymeasurementsappliedboth top-downand bottom-up.The next question is how to apply thisscore forexample-based nalysis.We now outlinethe methodproposed nthe citedpapers.

7.4. ANALYSIS WITH MULTI-DIMENSIONAL, TOP-DOWN AND BOTTOM-UPSIMILARITY MEASUREMENTS

We can consider he training tagefirst. In this stage,an examplebase is preparedby bracketingandlabelingthe trainingcorpus by hand. The labeling entryfor acomplex expression ncludes the numberof elements it contains;the set of syn-tactic, semantic,and otherclassifyingfeaturesof thecomplexstructure s a whole;theclassifyingfeaturesof eachsub-element; nd theclassifyingfeaturesof the leftandrightcontexts.Now on to the analysis tself. Aftermorphologicalprocessing,with access to alexicongivingclassifyingfeature ets (perhapsmultiplesets)foreachterminal, hemainroutineproceedsas follows:1. Search the examplebase for the expressionmost similar to any contiguoussubsequence n the input:find the longest similarmatches fromposition 1 inthe input,then from position2, and so on, terminatingf a perfectmatch isfound.2. Reduce,orrewrite, he coveredsubsequence,passingits similarity eatures o

the rewritten tructure.3. Go to 1. Continue hecycle untilno further eduction s possible.A preliminaryexperimentwas conducted on 132 English and 129 Japanesesentences.This corpuswas too small to permit meaningfulstatisticalevaluation,but we can say thatnumeroussentences were successfully analyzedwhichmighthaveyieldedmassivestructural mbiguity, orexample(16).(16) However,we do havesinglerooms with a showerforeightydollarsandnightandtwinroomswith a bath or ahundred ndfortydollarsa night.Here,manyspurious ombinations, .g., shower or eightydollarsa nightand twinrooms,were ignored n favor of the proper nterpretations. uccessfulanalysisofvarious uses of the article a was particularlynotable.A full traceappears n thecitedpapers.


22/38

NINEISSUESIN SPEECHTRANSLATION 1697.5. SIMILARITY VS. FREQUENCYWe have been discussingthe uses of similaritycalculations or the resolutionofvarioussorts of ambiguity.We conclude this section by contrastingsimilarity-baseddisambiguation ndprobability-based isambiguation, napproachwhich ismorewidelystudiedatpresent.Severalcurrentparsers e.g., Black et al., 1993)aretrained o resolve conflictsamong competinganalysesby usinginformation bouttherelative requencies,andthusprobabilities, f the combinationsof elements nquestion.At shortrange, /i-gramstatistics are used; at longer ranges,stochasticrules.Several of the considerationsraised above with respect to similarity-baseddisambiguation pplyequallyto probability-based isambiguation.Forexample,Jurafsky 1993) stresses the need for multidimensional rocessing: n his parserbasedupon the theoryof grammatical onstructions Fillmoreet al., 1988; Kay,1990) and claimed to model several features of humanparsingas observedinpsycholinguistic xperiments semanticas well as syntactic requenciesandprob-abilitiesarebrought o bear n selectingtheproperparse.Also stressed s the needforbothtop-downandbottom-up tatistics n evaluatingparsetrees as a whole.Ideally,disambiguation pproaches aseduponsimilarityandapproaches aseduponoccurrenceprobability houldcomplementeach other.However,I am awareof no attempts o combinethe two.8. Cue-based Speech ActsSpeech-actanalysis (Searle, 1969) - analysis in terms of illocutionaryacts likeINFORM,WH-QUESTION,EQUEST,nd so on - canbe useful for SLT n numer-ous ways. Six uses, threerelated o translation nd threeto speech processing,willbe mentionedhere.Concerningranslation,hefollowingtasksmustbe performed:1. Identifythe speech acts of the current utterance.Speech-act analysisof the

currentutterance s necessary or translation.Forinstance, he Englishpatterncan you (VP,bareinfinitive)?may expresseitheran ACTION-REQUESTr aYN-QUESTIONyes-noquestion).Resolutionof this ambiguitywill be crucialfor translation.2. Identifyrelated utterances.Utterances n dialoguesare often closely related:for instance,one utterancemay be a promptandanotherutterancemaybe itsresponse;and the proper ranslationof a responseoften dependson identifi-cationandanalysisof its prompt.Forexample, Japanesehai can be translatedas yes if it is the response to a YN-QUESTION,ut as all right if it is theresponse o anACTION-REQUEST.urther,hesyntaxof apromptmaybecomea factor n the finaltranslation.Thus, n arespondingutterancehai, so desu(lit.'yes, that'sright'),the segmentso desumaybe mostnaturallyranslated s hecan, you will, she does, etc., dependingon the structureand content of thepromptingquestion.The recognitionof such prompt-response elationshipswill requireanalysisof typical speech-actsequences.


23/38

170 MARK ELIGMAN3. Analyze relationshipsamong segmentsand fragments. Early processingofutterancesmay yield fragmentswhich must laterbe assembledto form theglobal interpretationor an utterance.Speech-act sequence analysis shouldhelpfit fragments ogether, incewe hopeto learnabout ypicalactgroupings.Concerning peech processing, t is necessary o do the following:4. Predictspeech acts to aid SR. If we can predictthe coming speech acts, wecan partly predicttheirsurfacepatterns.This predictioncan be used to con-strainSR. As alreadymentioned, or instance,Japaneseutterances ndinginka andga - respectively,YN-QUESTlONsnd INFORMSaredifficultto dis-tinguishphonologically.We earlierconsidered he useof prosodic nformationin resolvingthis uncertainty.Predictionsas to the relative ikelihoodof thesespeechactsin a givencontext shouldfurtheraidrecognition.5. Provide conventionsfor prosody recognition. Once spontaneous data islabeled,SR researchers an try to recognize prosodiccues to aid in speech-act recognitionanddisambiguation.For instance,they can try to distinguishsegmentsexpressing nforms andYN-QUESTlONsccording o the F0 curvesassociated with them - a distinctionwhich would be especially useful forrecognizingyn-QUESTIONs ith no morphologicalor syntacticmarkings.6. Provide conventions for speech synthesis. Similarly, speech synthesisresearchers an try to providemore naturalprosody by exploitingspeech-actinformation.Once relationsbetweenprosodyandspeechacts havebeenextrac-ted fromcorpora abeled with speech-act nformation, esearchers anattemptto supplynaturalprosodyfor synthesizedutterances ccording o the specifiedspeech acts. For instance,more naturalpronunciations an be attempted orYN-QUESTlONs,r for CONFIRMATION-QUESTIONSincluding ag questionsin English,as in

(17) Thetraingoes east,doesn'tit?While a well-foundedset of speech-act abels would be useful, it has not beenclear what the theoretical oundation hould be. As a result,no speech-actset hasyet becomestandard, espiteconsiderable ecenteffort.18Labels are stillproposedintuitively,orby trial and error.Speakers'goalscancertainlybe analyzed n many ways. However,Seligmanetal. (1995) hypothesize hatonly a limited set of goals is conventionally xpressedin a given language.Forjust these goals, relativelyfixed expressivepatternsarelearnedby speakerswhen they learn the language.In English, for instance,it isconventional o expresscertainsuggestionsor invitationsusing the patternsLet's

V or Shall we V? In Japanese,one conventionally xpressessimilargoals via thepatternsV[combiningsitm]mashoorV[combiningstemjmasenka?The proposal s to focus on discoveryandexploitationof theseconventionallyexpressiblespeech acts, or "Cue-basedSpeech Acts" (CBSAs).19The relevantexpressive patternsand the contexts within which they are found have the greatvirtue of being objectivelyobservable;and assumingthe use of these patterns s


24/38

NINEISSUESIN SPEECHTRANSLATION 1 7 1common to all native speakers, t should be possible to reach a consensus clas-sificationof the patternsaccording o their contextualizedmeaningand use. Thisfunctionalclassificationshouldyield a set of language-specific peech-act abelswhich canhelpto put speech-actanalysisfor SLTon a firmer oundation.The firstreason o analyzespeechactsin termsof observable inguisticpatterns,then, is the measureof objectivitythus gained:the discovery process is to somedegreeempirical,data-driven, rcorpus-based.A second reason s that automatedcue-basedanalysis, being shallow or surface-bound,houldbe relativelyquickasopposedto plan-basedanalysis.Plan-basedanalysis maywell provenecessary orcertainpurposes,but t is quite expensive.Forapplicationsike SLTwhich must becarriedout in nearlyreal time, it seems wise to exploit shallowanalysisas far aspossible.With heseadvantages f cue-basedprocessing empiricalgrounding ndspeed- come certain imitations.Whenanalyzing n termsof CBSAs, we cannotexpectto recognizeall communicative oals. Instead,we restrictour attention o commu-nicativegoals which can be expressedusing conventional inguisticcue patterns.Communicativeoalswhich cannotbe describedas CBSAs includeutterance oalswhich areexpressednon-conventionally compare he non-conventionalwarning(17a) to the conventionalWARNING17b);or goals which areexpressedonly im-plicitly ((18) as an implicitrequestto shut the window);or goals which can onlybe defined n termsof relationsbetweenutterances.20(17)a. MayI call yourattention o a potentiallydangerousdog?

b. Look outforthe dog!(18) It's cold outside.

Given thatthe aim is to classify expressivepatternsaccording o theirmeaningandfunction,how should his be done?Seligman 1991) andSeligmanet al. (1995)describea paraphrase-basedpproach: ativespeakersarepolledas to the essentialequivalenceof expressivepatternsn specifieddiscoursecontexts.If by consensusseveralpatterns anyield paraphraseswhicharejudged equivalent n context,andif theresultingpattern et is not identical o any competingpattern et,then t canbeconsidered o define a CBSA. (KnottandDale (1995) and Knott(1996) describea similar substitution-based pproach o the discoveryof discourserelations,asopposedto speechacts.)CBSAs are defined n terms of monolingualconventions or expressingcertaincommunicative oalsusingcertaincuepatterns.For translation urposes,however,it will be necessaryto compare he conventions n one languagewith those in theother.Withthis goal in mind, the discovery procedurewas appliedto twin cor-poraof Japanese-JapanesendEnglish-Englishspontaneous ialoguesconcerningtransportationirectionsand hotel accommodationLoken-Kim t al., 1993).CB-SAs were first identifiedaccordingto monolingualcriteria.Then, by observing


25/38

172 MARK ELIGMANtranslation elationsamong the English and Japanesecue patterns, he resultingEnglishandJapaneseCBSAs werecompared. nterestingly,t was found thatmostof theproposedCBSAs seem valid for bothEnglishandJapanese: nly two out of27 seem to be monolingual or thecorpus n question.We have been outlininga cue-basedapproach o recognitionof speechor dis-course acts, with the assumption hat some sort of parsingwould be employedto recognizecue patterns.This methodologycan be comparedwith statisticalre-cognitionapproaches: peech-or discourse-act abels arepositedin advance,andstatisticalmodels aresubsequentlybuilt whichattempt o identifythe acts accord-ing to theirsequence (Reithinger,1995;Nagata& Morimoto,1993) or accordingto the wordstheycontain Alexandersson t al., 1997;Reithinger& Klesen, 1997).Certainspeech-act sequences may indeed turn out to be typical;and certainwordsmay indeedproveto be unusuallycommon in, and thus symptomaticof,arbitrarilydefined speech acts. Thus statisticaltechniquesare indeed likely tobe helpful for recognitionof conventionalspeech acts when they are impliedorexpressednon-conventionally, r for recognitionof speech acts which are notconventionalbut neverthelessappearuseful for some applications.Further, venfor conventional peech acts which areconventionallyexpressed,efficiencycon-siderationsmay sometimes favor statisticalrecognitiontechniquesover patternrecognition:once CBSAs were identifiedusing our methods and a sufficientlylargetrainingcorpushad been hand-labeled, tatisticalmodelsmight certainlybebuilt to permitefficient identificationn context.However,statisticalrecognitionapproachesalone cannotprovidea principledway to discover (thatis, posit orhypothesize)the labels in the firstplace, and this is what we seek. The currentCBSA set has been appliedin three studies:Black (1997) attempted o associatespeech acts, includingCBSAs, with intonationcontoursin hopes of improvingspeech synthesis;Iwaderaet al. (1995) employed the CBSA set in attemptstoparsediscoursestructure; nd Jokinenet al. (1998) used CBSAs in topic-trackingexperiments.

9. TrackingLexical Co-occurrencesIn the processingof spontaneous anguage,the need for predictionsat the mor-phological or lexical level is clear. For bottom-upparsingbased on phones orsyllables, the number of lexical candidates s explosive. It is crucial to predictwhichmorphological rlexical items arelikelyso thatcandidates an be weightedappropriately.Compare uch lexical predictionwith the predictions rom CBSAsdiscussed above. In general,it is hopedthatby predictingCBSAs we can in turnpredictthe structural lements of their cue patterns.We are now shiftingthe dis-cussion to the predictionof open-classelements instead.The hope is that the twosortsof predictionwill prove complementary.)A^-grams rovide such predictions only at very short ranges. To supportbottom-upparsingof noisy materialcontaininggaps andfragments, onger-range


26/38

NINE SSUESNSPEECHRANSLATION 173predictionsare needed as well. Some researchershave proposed investigationof associationsbeyond the rc-gram ange,but the proposedassociationsremainrelativelyshort-rangeaboutfive words).While stochasticgrammars an providesomewhat longer-range predictions than /i-grams, they predict only withinutterances.Ourinterest, however,extends to predictionson the scale of severalutterances.Thus Seligman (1994a) and Seligman et al. (1999) propose to permit thedefinitionof windows n a transcribedorpuswithin which co-occurrences f mor-phologicalor lexicalelementscan be examined.A flexible set of facilities(CO-OC)has been implementedn CommonLisp to aid collection of suchdiscourse-rangeco-occurrencenformation nd to providequickaccess to the statistics or on-lineuse.A window is defined as a sequence of minimal segments, where a segmentis typicallya turn,but can also be a block delimitedby suitable markers n thetranscript.Sparsedata is somewhat ess problematic or long-rangethan for short-rangepredictions, ince it is in generaleasierto predictwhat s coming"soon" hanwhatis comingnext. Evenso, there s neverquiteenoughdata;so smoothingwill remainimportant.CO-OCcan supportvarious statisticalsmoothingmeasures.However,since these techniques are likely to remain insufficient,a new technique forsemantic smoothing is proposed and supported:researchers can track co-occurrencesof semantic tokens associated with words or morphsin additiontoco-occurrencesof the words or morphsthemselves.The semantictokens are ob-tained from standardon-line thesauri.The benefits of such semanticsmoothingappear speciallyin thepossibilityof retrieving easonable emantically-mediatedassociations ormorphswhich are rareor absent n a training orpus.Sections 9.1-9.3 describeCO-OC'soperationsn somewhatgreaterdetail. Sec-tion 9.4 sketchespossibleapplicationsor the co-occurrencenformation arvestedby theprogram.9.1. WINDOWS AND CONDITIONAL PROBABILITIESAs mentioned,we firstpermitthe investigator o defineminimalsegmentswithinthe corpus:these may be utterances,sections boundedby pauses or significantmorphemes uch as conjunctions,hesitations,postpositions,and so on. Windowscomposedof several successive minimalsegmentscan then be recognized:Let5/ be the currentsegment and N be the numberof additionalsegments in thewindowas it extends to the right.N=2 would, for instance,give a window threesegments ong with 5/ as its firstsegment.Thenif a givenword or morphemeM\occurs (at least once) in the initial segment, 5/, we attempt o predictthe otherwords ormorphemeswhichwill co-occur(atleastonce) anywhere n the window.Specifically,a conditionalprobabilityQ can be definedas in (20),(20) Q(MUM2) = P(M2 e St U SM U S/+2U U Si+N\Ml e St)


27/38

174 MARK ELIGMANwhereMj aremorphemes,Sj areminimalsegments,and N is thewidthof windowin segments.Q is thusthe conditionalprobabilityhatM2 s an elementof the unionof segmentsSi, S;+i, 5/+2, and so on up to Si+N, given thatM\ is an element ofSi, Both the segmentdefinitionand the numberof segmentsin a windowcan beadjusted o varytherangeover which co-occurrencepredictionsareattempted.For initialexperiments,we used a morphologically aggedcorpusof 16 spon-taneousJapanesedialogues concerningdirection-finding nd hotel arrangements(Loken-Kimet al., 1993). We collected common-noun/common-noun,ommon-noun/verb,verb/common-noun,nd verb/verb onditionalprobabilitiesn a three-segment window (N=2). ConditionalprobabilityQ was computedamong allmorph pairs for these classes and stored in a database;pairs scoring below athreshold 0.1 for the initialexperiments)were discarded.We also computedandstored he mutual nformation or eachmorphpair,usingthe standard efinitionasinFano(1961).Fastqueriesof the databasearethenenabled.A central unction s get-MORPH-WINDOW-MATES,hich provides all the window mates for a specified morphwhichbelongto a specifiedclass andhave scores abovea specified hresholdor thespecifiedco-occurrencemeasure conditionalprobability rmutual nformation).Theintent s to use suchqueries n real time to supportbottom-up,sland-drivenSR andanalysis.To support he establishmentof island centersfor suchparsing,we also collect informationon each corpus morph n isolation:its hit count andthesegments t appearsn, its unigramprobability ndprobability f appearancena given segment,etc. Once islandhypotheseshave been establishedbased on thisfoundation, o-occurrencepredictionswill come intoplayfor island extension.

9.2. SEMANTIC SMOOTHINGAs mentioned,CO-OCsupports he use of standard tatistical echniques Nadas,1985) for smoothing both conditionalprobabilityand mutual information.Inaddition,however,we enable semanticsmoothing n an innovativeway.Thesauruscategories "cats" or short- are soughtfor each corpusmorph(andstored n acorpus-specific ustomized hesaurus or fastaccess).The commonnoun eki ('sta-tion'), forinstance,hasamongothers he cat label "725a" representing semanticclass of POSTS-OR-STATIONSnthe standardKadokawa apanese hesaurusOhno&Hamanishi,1981).

Equippedwith such information,we can studythe co-occurrencewithin win-dows of cats as well as morphs.For example, using N=2, GET-CAT-WINDOW-MATESinds 36 catsco-occurringwith"725a",one of the cats associatedwitheki,with a conditionalprobabilityQ > 0.10, including"459a" sewa 'takingcare ofor 'lookingafter'),"216a" henko 'transfer'),and "315b" ori 'gettingoff). Sincewe haveprepared n indexedreverse hesaurus or ourcorpus,we canquicklyfindthecorpusmorphswhich have these catlabels,respectivelymiru'look',mieru cansee/bevisible', magaru turn',and oriru'getoff. Theresultingmorphsarerelated


28/38

NINEISSUESIN SPEECHTRANSLATION 175to theinputmorpheki via semanticrather hanmorph- pecificco-occurrence.Theythus form a broader, moothedgroup.This semantic-smoothingprocedure- morph to related cats, cats to co-occurring ategorywindow-mates, ats to relatedmorphs has beenencapsulatedin the function GET-MORPH-WINDOW-MATES-VIA-CATS.t permits filtering,sothatmorphsareoutputonly if they belongto a desiredmorphological lass and aremediatedby cats whose co-occurrence ikelihood s above a specifiedthreshold.Thesaurus ategoriesareoften arrangedn a type hierarchy. n the Kadokawathesaurus,there are four levels of specificity: "725a" (POSTS-OR-STATIONS),mentionedabove, belongs to a more general category "725" (STATIONS-AND-HARBORS), hichin turnbelongsto "72"(INSTITUTIONS),hichbelongsto "7"(SOCIETY).ccordingly,we need not restrictco-occurrence nvestigation o catsat the level given by the thesaurus.Instead,knowingthat "725a" occurred n asegment5/, we can infer that all of its ancestorcats occurred hereas well; and wecan seek andrecordsemanticco-occurrencesateverylevel of specificity.This hasbeendone;and GET-MORPH-WINDOW-MATES-VIA-CATSas a parameter ermit-ting specificationof the desired level of semanticsmoothing.The more abstractthe level of smoothing,the broader he resultinggroupof semantically-mediatedmorphemeco-occurrences.The most desirable evel for semanticsmoothing s amatter or futureexperimentation.

9.3. EVALUATIONWe arepresently eportingheimplementation f facilities ntended o enablemanyexperimentsconcerningmorphologicaland morpho-emanticco-occurrence; heexperiments hemselvesremainfor the future.Clearly,further esting is neces-sary to demonstrate he reliabilityand usefulness of the approach.A principleaim would be to determinehow large the corpusmust be before consistent co-occurrencepredictionsare obtained.Nevertheless,some indication of the basicusabilityof the data s in order.Tools have been providedfor comparing wo corporawith respectto any ofthe fieldsin the recordsrelating o morphs,morphco-occurrences, ats,or cat co-occurrences.Using these, we treated15 of ourdialoguesas a trainingcorpus,andthe oneremainingdialogueas a testcorpus.Wecompared he two corporan termsof conditionalprobabilities or morphco-occurrences. n both cases, statisticallyunsmoothed coreswere usedfor simplicityof interpretation.We found5,162 co-occurrencepairsabove a conditionalprobability hresholdof 0.10 in thetraining orpusand1,552in the test. Since 509 pairsoccurrednbothcorpora, he trainingcorpuscovered509 out of 1,552, or 33%,of the test corpus.That s, one thirdof themorphco-occurrenceswith conditionalprobabilities bove0.10 in the testcorpuswereanticipatedby thetrainingcorpus.Thiscoverageseemsrespectable, onsidering hatthetraining orpuswas smalland that neither statistical nor semantic smoothingwas used. More important


29/38

176 MARK ELIGMANthan coverage,however,is the presenceof numerouspairs for which good co-occurrencepredictionswere obtained. Such predictionsdiffer from those madeusingn-grams n thattheyneed not be chained,and thusneed not coverthe inputto be useful: f consistentlygoodco-occurrence redictions an berecognized, heycan be exploited selectively.Thefiguresobtained or cats and cat co-occurrencesarecomparable.9.4. POSSIBLE APPLICATIONSA weightedco-occurrencebetweenmorphemesor lexemes can be viewed as anassociationbetweenthese items; so the set of co-occurrenceswhich CO-OCdis-covers can be viewed as an associativeor semanticnetwork.Spreadingactivationwithin such networks s often proposedas a method of lexical disambiguation.21Thusdisambiguation ecomes a secondpossible applicationof CO-OC'sresults,beyondthe abovementioned rimaryuse forconstrainingSR.22A thirdpossibleuse is in thediscoveryof topictransitions:we canhypothesizethata spanwithin a dialoguewhere few co-occurrencepredictionsare fulfilled s atopic boundary.23 nce the newtopicis determined, ppropriateonstraints anbeexploited,forexampleby selectinga relevant ubgrammar.10. Translation MismatchesDuring translation,when the source and target expressions contain differingamountsof information, "translationmismatch" s said to occur.Forexample, heEnglishsentence(21a) maybe translatedby Japanese 21b). In this case, becausethe explicit pronoun s suppressed, nformation oncerningpersonand number slost. Similarly, 22a) may be translated s (22b). Here,the pronoun s once againsuppressed,andinformationabout the objectof the verb is lost as well: Japanesedoes notexpresseither ts numberor its definiteness.(21)a. He ate.

b. Tabemashita.EAT-past

(22)a. He boughtthe books.b. Hon o kaimashita.

BOOKObjBUY-pastSuppressing uch informationduring ranslation s less difficult hanarrangingfor its additionwhen translatingn the oppositedirection.WhentranslatingromJapanese o English,for instance,how is a program o determinewhetheranentity


30/38

NINEISSUESIN SPEECHTRANSLATION 177is definite,or plural,or third-person?Of course,suchproblemsare not uniquetoSLT- they areequallypresentin text translation. n spokentranslation, hough,there s the addeddifficultyof resolvingthemin real time.The first observationwe can make about mismatchresolution is that it is insome respectsakin to ambiguityresolution.In bothcases, information s missingwhich must be suppliedsomehow: n translationmismatches,missinginformationmustbe filledin;in ambiguity esolution,missinginformationmustguidea choice.Inlightof this similarity,he interactive esolution echniquessuggestedabove forambiguityresolution can be suggested for mismatches as well. For example, itwould be relatively straightforwardo put up a menu offeringa choice betweensingularandplural or "onevs. many",etc. Granted,other sorts of information,for exampleconcerningdefiniteness,would be trickierto elicit in non-technicalterms.24Of course,handling many such requestswould be tedious, so interfacedesignwould be crucial.And again,the hope is that the need for interactionwillshrinkas knowledge-ourceintegration dvances.A secondobservationaboutmismatchresolution s that,whenmissinginform-ation cannot be accuratelycomputedand is excessively burdensome or users tosupply, t can simplybe left missing.Forexample,for translating21b) when thecorrectEnglishwould be (21a),theincomplete ranslation23) could be produced.This brokenEnglishwould at least allow the hearer o infer the correctmeaningfromcontext,moreor less as a hearerof theoriginalJapanese ora hearerof "real"brokenEnglish)would have to do. Further, upplying nsufficient nformation susually betterthan supplyingincorrect nformation: n the same situation,(23)would be farless confusingthan,(24), say,which is wrongon severalcounts.Thusfar,however,I am awareof no SLTprogramswhichdeliberatelyabstainwhenindoubt.(23) *bought ook(24) Ibought book.

Ideally,however,translation oftwarewill do its best to resolve mismatchesbefore requestinghelp from the user or throwingin the towel. Researchers nthis area have tendedto createprograms ocusing on a specificsort of mismatch.Forexample,MurataandNagao (1993) proposedan expert systemfor supplyingnumberand definiteness nformation,and thus articles,duringJapanese-Englishtranslation.In a similar spirit,Seligman (1994b) describes a programfor resolving thereferencesof zero pronouns n the Asura SLT system (Morimotoet al., 1993),thussupplying he missingpronouns or translation.Theprogram,baseduponthetheoryof centering(Sidner,1979; Grosz et al., 1983; Joshi & Weinstein, 1981;Takeda& Doi, 1994),follows unpublishedworkby MasaakiNagata.It is invokedfrom within specially modified transferrules for verbs, and can work alongside


31/38

178 MARK ELIGMNotherpronoun-resolutionechniques,for examplethose makinguse of Japanesehonorific nformationDohsaka,1990).No evaluationshaveyet beenmade.Othermismatchproblems o be addressed resurveyedn Seligmanet al. (1993)in the context of Japanese-Englishor Japanese-Germanransfer.These includethe determination f tense (Japanese, orexample,does not have anexplicitfuturetense);aspect (Japaneseacksexplicitcues which would license a choice between(25a, b));intimacy asrequiredor a choicebetweenGermandu andSie- Japanesedoes supplya greatdeal of information oncerningpoliteness,formality,relativestatus,etc., but none of these mapcleanlyinto the Germandistinction); hoice ofpossessive determiners Japaneseoften uses only namae, 'name', whereEnglishwould haveyourname);and severalothersorts of mismatch.(25)a. He has been studying,

b. He is studying.

11. ConclusionsThe firstsection of thepaperdescribeda "low road"or "quickanddirty"approachto SLT,in which interactivedisambiguationof SR and translation s temporar-ily substituted or system integration.This approach, believe, is likely to yieldbroad-coverage ystemswith usablequalitysoonerthanapproacheswhich aimformaximallyautomaticoperationbasedupon tight integration f knowledgesourcesandcomponents.Twodemonstrations f "quickanddirty"SLT over the Internetwerereported.For the demos, an experimental hattranslation ystem createdby CompuServe,Inc. was providedwith front and backends, using commercialdictationproductsfor speechinputand commercialspeech-synthesis nginesfor speech output.Thedictationproducts'standardnterfaceswere used to debugdictationresults inter-actively.While evaluationof these experiments emained nformal,coveragewasmuch broader hanin most SLTexperiments o date - in the tens of thousandsof words. While interactive ontrol of translationwas lacking, outputqualitywasprobably ufficient ormanysocialexchanges.But while the "low road"may offer the fastest route to usablebroad-coverageSLT systems, automaticoperationbased upon knowledge-ource integration scertain o remaindesirable n the longerrun. Hence the balance of the paperhasconcentrated n aspectsof integrated ystems.

Taken ogether, he nine areasof research xamined n thepapersuggesta nine-itemwish list for anexperimentalSLTsystem.1. The system would include facilities for interactivedisambiguationof bothspeechandtranslation andidates.2. Its architecturewould allow modularreconfiguration ndglobal coordinationof components.


32/38

NINE ISSUES IN SPEECHTRANSLATION 1793. It wouldemploya perspicuous et of datastructures or tracking nformationfrom multipleprocesses: stages of translation,multipletracks, and height,span,or dominanceof nodes would be clearly distinguished.4. Thesystemwouldemploya grammarwhose terminalswerephones,recogniz-ing both words and syntacticstructuresn a uniform and integratedmanner,e.g., via island-driven hartparsing.5. Naturalpauses and other aspects of prosody would be used to segmentutterances nd otherwiseaidanalysis.6. Similarity-basedechniques orresolving ambiguities,comparableo those ofEBMT,would be effectivelyused. Stagesof translation ielding potentialam-

biguitieswould be kept distinct;similaritywould be measuredalong severaldimensions(e.g., syntacticand phonologicalin addition to semantic);top-down as well asbottom-up onstraintswould be exercised;anddisambiguationusingbothprobability-basedndsimilarity-basedechniqueswould be used incomplementaryashion.7. Speechor dialogueacts would be definedin terms of their cue patterns,andanalysesbaseduponthem would be exploitedfor SR andanalysis.8. Semantically moothed rackingof lexical co-occurrenceswouldprovideanet-workof associationsuseful forSR,lexicaldisambiguation, ndtopic-boundaryrecognition.9. A suite of specializedprogramswouldhelpto resolve translationmismatches,for instance o supplyreferents or zeropronouns.

AcknowledgementsWarmest ppreciationo CompuServe, nc. formaking he chat-basedSLTdemon-strationspossible. In particular,hanksare due to MaryFlanagan, henManager,AdvancedTechnologies,and to Sophie Toole, then Supervisor,Language Sup-port. Ms. Flanaganauthorizedand oversaw both demos. Ms. Toole organizedand conductedthe Grenobledemo and played an active role in makingthe SRand speech- ynthesissoftwareoperational.Thanks also to Phil Jensen andDougChinnock,translationsystem engineers. The demos made use of pre-existingproprietaryoftware.Workon all nine of the issues discussed herebeganat ATRInterpretingTele-communicationsLaboratoriesn Kyoto, Japan.I am very grateful or the supportand stimulation receivedthere.Thanks also to numerouscolleagues at GETA(Grouped'Etudespourla Tra-ductionAutomatique) ttheUniversiteJosephFouriern Grenoble,France;and atDFKI(DeutschesForschungszentrumur KiinstlicheIntelligenz) n Saarbriicken,Germany.Theopinionsexpressed hroughout re mine alone.


33/38

180 MARK ELIGMANNotes

CompuServe's hattranslation rojectwas discontinuedn early 1998.All trademarksreherebyacknowledged.A later commercial chat translation ervice, that of Uni-verse,Inc. (now discontinued),gave acomparable hroughputn 2-3 seconds.3 ContinuousFrenchwas released ust before the seconddemo, but becauselittle testingtime wasavailable,a decision was made to foregoits use.

SpeechLinks oftware romSpeechOne,Inc.5 By March1998,upgradesof the continuous oftwarehadalreadymadethis macro ess necessary.Direct dictation o the chat window would then have beenpossiblewithout t, with some sacrificeofadvanced eatures orvoice-driven nteractive orrectionof errors.6 Kowalskiet al. (1995)arrangedheonly previousdemonstration nownto the authorof SLTusingcommercialdictation oftware orinput thoughat leastonegroup Miikeet al., 1988)hadpreviouslysimulated SLTafter a fashionby automatically ranslating ypedconversations).Since Kowalski'susers (spectatorsat twin exposition displays in Boston, Massachusettsand Lyons, France)wereuntrained,ittle interactive orrectionof dictationwaspossible.For thisand otherreasons, ranslationqualitywas generally ow (BurtonRosenberg,personalcommunication); ut as the mainpurposeofthe demo was to make an artistic and social statementconcerningfuturehi-techpossibilitiesforcross-cultural ommunication,his was no greatcause for concern. Text was transmitted ia FTP,rather hanvia chat as in theexperiments eportedhere. See Seligman(1997) fora fuller account.www.itl.atr.co. jp/matrix/c-s tar /matrix. en. htmlo www.c-star.org

9 Mailbox files were extensivelyand successfullyused in the Frenchentryin the C-Star II SLTdemoof July22, 1999 (www.c-star.org).0 Inclusion of other levels is also possible. At the lower limit, assuming the grammarwerestochastic,one could even use sub-phonespeech segmentsas grammar erminals, hus subsumingeven HMM-basedphone recognition n theparsingregime.At an intermediateevel betweenphonesandwords,syllablescould be used.11The parsetreewas not used for analysis,however.Instead, t was discarded,and a unification-basedparserbegana new parsefor MTpurposeson a text stringpassedfromspeech recognition.For one exampleof extensive relatedworkin the framework f the Verbmobil ystem,see Kompeetal. (1997).13A relatedbutdistinctproposalappears n Hosakaet al. (1994).14EntropicResearchLaboratory,Washington,DC, 1993.15PanEBMToperatessolo only when the entire sourceexpressioncan be renderedwith a singlememorized argetexpression.16The sort of generalizationsuggestedhere - from graded semanticsimilaritymeasurements ogradedmeasurements f similarityalong multipledimensions- shouldnot be confused with thatof GeneralizedEBMT,the example-based echniqueproposedfor CMU's PanEBMTengine.Thatengineutilizes no graded imilaritymeasurements long anyscale. Itsgeneralizationnstead nvolvessubstitutionof semantictags for lexical items in examplesand in input,so that for exampleJohnHancockwas in Washington ecomes (PERSON) as in (CITY).Inthiscalculation, ixed elements aretreateddifferently romvariableelements,and variableele-mentscan be weightedto varyingdegrees: he headsof complexstructures redifferentlyweightedthannon-heads.See forexamplethe website of theDiscourse ResourceInitiative,www.georgetown.edu/luper-foy/Discourse-Treebank/dri-home.html, with links to recent

workshops,or browse Walker(1999), especially regardingattemptedstandardization f Japanesediscourse abeling(Ichikawaet al., 1999).


34/38

NINE SSUESNSPEECHRANSLATION 181Called"CommunicativeActs"in Seligmanet al. (1995) and "SituationalFormulas"n Seligman(1991).20Whilespeakersoftenrepeataninterlocutor's tterance o confirm t, we do not use a REPEAT-TO-CONFIRMBSA, sinceit is apparently ignaledby nocuepatterns, ndthuscouldonlyberecognizedby notinginter-utteranceepetition.21Forexample, f theconceptmoney has beenobserved, hen the lexical item bankhas themeaningclosest to MONEYn thenetwork: savingsinstitution'rather han'edgeof river',etc.22See Schiitze (1998) or Veling and van der Werd(1999) concerningthe use of co-occurrencenetworks ordisambiguation, houghwithoutcomparable egmentation r semanticsmoothing.23Compare,for example, Morris and Hirst (1991), Hearst(1994), Nomoto and Nitta (1994), orKozimaandFurugori 1994).24One possible formulation:"Canthe audienceeasily identify which one is meant?"See Boitet

(1996a)for discussion.

ReferencesAberdeen,John,SamBayer,SashaCaskey,LaurieDamianos,AlanGoldschen,LynetteHirschman,Dan LoehrandHugo Trappe:1999, 'ImplementingPracticalDialogue Systemswith the DARPACommunicatorArchitecture', JCAI-99Workshop n Knowledgeand Reasoningin Practical

DialogueSystems,Stockholm,Sweden,pp. 81-86.Alexandersson,Jan, NorbertReithingerand Elisabeth Maier: 1997, 'Insights into the DialogueProcessing of Verbmobil', Fifth Conference on Applied Natural Language Processing,Washington,DC, pp. 33-40.Barnett,Jim, Kevin Knight, InderjeetMani and Elaine Rich: 1990, 'Knowledge and NaturalLanguageProcessing',Communicationsf theACM33(8), 50-71.Black, Alan: 1997, 'Predicting he Intonationof DiscourseSegments from Examplesin DialogueSpeech', in Y.Sagisaka,N. CampbellandN. Higuchi(eds),Computing rosody,SpringerVerlag,Berlin,pp. 117-128.Black,Ezra,RogerGarsideandGeoffreyLeech: 1993,Statistically-drivenComputerGrammars fEnglish:TheIBM/LancasterApproach,Rodopi,Amsterdam.Blanchon, Herve: 1996, 'A CustomizableInteractiveDisambiguation Methodology and TwoImplementationso DisambiguateFrenchandEnglish Input', n C. Boitet (1996a),pp. 190-200.Boitet, Christian ed.): 1996a, Proceedings of MIDDIM-96Post-COLING eminar on InteractiveDisambiguation,Le Col de Porte,France.Boitet, Christian:1996b, 'Dialogue-basedMachine Translation or Monolingualsand FutureSelf-explainingDocuments', n C. Boitet (1996a),pp.75-85.Boitet, ChristianandMarkSeligman:1994, 'The "Whiteboard" rchitecture:A Wayto IntegrateHeterogeneousComponentsof NLP Systems',COLING94, The 15thInternationalConferenceon ComputationalLinguistics,Kyoto, Japan,pp.426^-30Brown,RalphD.: 1996, 'Example-basedMachineTranslationn the PanglossSystem', COLING-96, The 16th I