handling texts and corpuses in ariane-g5, a complete environment for multilingual mt
DESCRIPTION
Handling texts and corpuses in Ariane-G5, a complete environment for multilingual MT. ACIDCA ’2000, Monastir, 21-24/3/2000 Christian Boitet GETA, CLIPS, IMAG, Grenoble. Outline. Introduction Multilingual MT-R (for revisors): linguistic methodology & basic software - PowerPoint PPT PresentationTRANSCRIPT
Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 1
ACIDCA ’2000, Monastir, 21-24/3/2000
Christian BoitetGETA, CLIPS, IMAG, Grenoble
Handling texts and corpuses in Ariane-G5,
a complete environment for multilingual MT
Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 2
Outline Introduction Multilingual MT-R (for revisors): linguistic
methodology & basic software Goals and linguistic methodology Ariane-G5, an MT shell for building multilingual MT-R
systems What has been and is done with Ariane-G5:
MT-R, MT-A (for authors), MT of speech Representation of input documents Structuration of corpuses Functionalities during processing
Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 3
MULTILINGUAL MT-R: GOALS AND LINGUISTIC METHODOLOGY
Produce RAW translation GOOD ENOUGH to be revised
Specialize to SUBLANGUAGES and use MULTILEVEL TRANSFER
(semantic + traces) HEURISTIC PROGRAMMING
Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 4
MULTILINGUAL MT-R: BASIC DIAGRAM
umc-structure
uma-structure
umc-structure
Source LanguageText
Target Language 1Text
umc-structure
Target Language 2Text
uma-structure uma-structure
gma-structure gma-structure
paraphrase choice.
Morphological Analysis
Abstraction
Structural Analysis
Structural Generation
Morphological Generation
Syntactic Generation
Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 5
Ariane-G5 (1978-99) : structure
Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 6
DB of lingware components Declaration of variables (= typed attributes),
templates… Dictionaries Grammars (rules = transitions of abstract automata)
DB of texts Corpuses Source texts Intermediate results Translations (± revisions)
Ariane-G5: 2 specialized DB
relative to “variants”=>
Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 7
What has been and is done with Ariane-G5:
MT-R (for revisors)Large, operational systems: RU—>FR, FR—>ENPrototypes: EN—>MY, TH, FRLots of mockups
MT-A (for authors)LIDIA mockups: FR—>DE, EN, RU (adding CH)
MT of speech (for task-oriented dialogues)CSTAR demo system (EN, DE, KR, IT, FR, JP)
Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 8
MT-R examples of translation (1)français-anglais en aéronautique (avant révision humaine)
Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 9
MT-R examples of translation (2)
Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 10
MT-A example of a disambiguation dialogueLe capitaine a rapporté des tasses et des assiettes bleues—> The captain has brought back blue bowls and plates
/ bowls and blue plates OO des tasses bleues et des assiettes bleues
O des assiettes bleues et des tassesQuestion 1
OO capitaine de marineO capitaine d’aviationO capitaine d’artillerieO capitaine d’infanterieO capitaine de cavalerieO …
Question 2
Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 11
e-mail server
Interaction in source for the “quality MT for all”
Example scenario : multilingual e-mail (UNL)
e-mail tool
Nicknames + language preferences
enconversion server
analysis serverinteractive disambiguation server
decoding serverdecoding serverdecoding serverdecoding serverdecoding serverdeconversion servers
1
2
65
7
8
9
Addressees’ e-mail servers
10
4
3
Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 12
Other future possibility: production of multilingual “self-explaining documents”
structure MMC
structure UMC
structure UMA
structure UMC
Texte en langue source
Texte en langue cible 1
structure UMC
Texte en langue cible 2
désambiguïsation interactive
structure GMA
structure UMA
structure UMC
rétro-traduction
Rétro-traduction 1
Utilisateur
structure MMC
désambiguïsation "muette" simulée (DMS) DMS
m.a.&d.marques d'ambiguïté
et dialogue
structure MMC
structure UMA structure UMA
structure GMA structure GMA
choix de paraphrase
marques d'ambiguïté et dialogue
Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 13
Speech Translation:advantages of an Interchange Format
N target languages for the cost of one analysis Translating into one’s language from N source
languages with one generation Using the same generation to “backgenerate”
Analysis into IF
IFBackgeneration
Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 14
Interchange Format : example
la semaine du 12 nous avons des chambres simples et doubles disponibles
give-informationgive-information++availabilityavailability++roomroom(room-type=(room-type=((single ; doublesingle ; double), time=(), time=(week, md12week, md12))))
give-informationgive-information ++availabilityavailability++roomroom (room-type=((room-type=(single ; doublesingle ; double), time=(), time=(week, md12week, md12))))
Acte de dialogueActe de dialogue
ConceptsConcepts
ArgumentsArguments
Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 15
Interface of CLIPS++ CSTAR-II demonstratorReconnaissance IF Rétrogénération (pour contrôler la “compréhension”)
Génération
Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 16
Hardware architecture of the CLIPS++ CSTAR-II demonstrator
FIF
MontpellierGrenobleRNIS
Reco
Ethernet
Contrôle, IFFSynthèseVC IU
Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 17
Steps in translating a text
Build its hierarchical structureChapters, sections, paragraphs, [sentences]
Segment into translation unitsAccording to current length parameter [min..max]
Translate each segmentAdding segment results to text results for desired
phases Revise (manually) the whole translations, keep
the revisions
Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 18
Representations of input documents
3 main questions: how to represent the writing system, separate formatting tags from the text or not, how to handle non-textual elements (figures, icons, or
formulas) contained in utterances Transliterations of textual elements Keeping formatting tags in the texts Non-textual elements
Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 19
Facilitate string-matching operations Diminish the size of dictionaries
Represent diacritics
Make some processing easier for some toolskataba —> ktb$aaa, katub —> ktb$au- or ktb$-ua
Transliterations of textual elements
lisp Lisp LISPLISP *LISP **LISP
François va à ACIDCA’2000*FRANC!4OIS VA A!2 **ACIDCA'2000
Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 20
Transliterations of textual elements (2) Represent writing systems using non Roman
characters"мать" (mother) —> "MATQ" and not "MAT6"‡ fl  ˝ Ë ˚ Ó fi Û ˛ ÈA YA E YE I YI O E!1 U YU JÁ Ê Í ˜ Ò ¯ Ú ˘ ¸ ˙Z ZH K KH S SH T TH Q W
今日は京都へ行きます。 (Today theme Kyoto dest go.) —>
KYOU <kj k1=kon k2=nichi> WA <hg ha> KYOUTO <kj k1=higashi k2=toukyo-no-tou> E <hg he> IKI <kj k1=iku> MASU.
Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 21
Keeping formatting tags in the texts
If the translation units get larger, almost all tags become “inside tags”
Tags often have a linguistic roleFor example, a sentence may contain• a bullet list• or a numbered listwhich are normally linguistically homogeneous.
<P>For example, a sentence may contain</P><UL> <LI>a bullet list <LI>or a numbered list</UL><P>which are normally linguistically homogeneous. </P>
Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 22
Non-textual elements
Formulas, figures, icons, brand names, anchors, links…are often best replaced by tags or special occurrences
The situation may be recursive (text inside figures)
*IF x2+5y>3 , x+y IS CONVENIENT .
*IF <relation 1> , <entity 2> IS CONVENIENT .
*IF $$R-1 , $$E-2 IS CONVENIENT .
Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 23
Structuration of corpuses
Motivations for corpuses Segmentation and structuration Representation of texts, intermediate results,
translations and revisions
Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 24
Motivations for corpuses
Corpus = collection of texts sharing some factual characteristics:
• natural language• transliteration and method for handling formatting
information and non-textual elements• segmentation method• structuration method
some management information:• source (journal/volume, book/chapter…)• usage destination (send back, postedit, tests…)
Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 25
Segmentation and structuration "segmentation"
= input texts —> words, sentences…best done by the morphological analyzer
& units of translation "structuration"
= segmentation —> higher level units paragraphs, sections, etc.
+ production of a corresponding tree structure In Ariane-G5, up to 7 hierarchical separators
for a given corpus
Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 26
Representation of texts, intermediate results, translations and revisions
Corpus = list of text files + descriptor Text = (transliterated) text + descriptor
(+ non-textual elements replaced by tags or spec.occs) Intermediate result = list of decorated trees
+ descriptor (lingware variant + interval processed) Translation = (transliterated) text + descriptor
(transliterated form may reduce morph. gen. size) Revision = (transliterated) text + descriptor
(usually another, more natural transliteration)
Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 27
Functionalities during processsing
Ensuring coherence between lingware and results
Stopping & restarting processing of a text Reusing intermediate results
recovery from interruptions debugging multitarget translation (analysis ≈ 2/3 of translation
time)
Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 28
Conclusion and perspectives (1)
Text & corpus handling in complete MT systems is quite complex and interesting…�handling texts and corpuses not a straightforward
problem,�suggests many interesting technological and
scientific issues
Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 29
Conclusion and perspectives (2)
but more is coming:Synergy MT systems <—> TA (Translation Aids)
unification of the representations of texts in both worlds: • MT: revised texts structured as input texts,
=> the text data base will become a kind of multilevel translation memory (texts, translations/revisions, intermediate results)
• TA: translation memories from "bags" to structured translation memories (keeping the sequential context)
both: multiple-layer translation memories• lemmatized forms• "concrete" syntactic trees & "abstract" logico-semantic trees• formatting tags
Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 30
Conclusion and perspectives (3)
Structuration may be used to « distribute the work » to MT and TA by segmenting according to the « best engine »
some sublanguages are good for MT, bad for TA• weather bulletins
others are good for TA, bad for MT• weather related warnings, slightly modified versions of
already translated documentsand others are best kept for specialists
• Fine-tune legal sentences