handling texts and corpuses in ariane-g5, a complete environment for multilingual mt

30
Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 ACIDCA ’2000, Monastir, 21-24/3/2000 Christian Boitet GETA, CLIPS, IMAG, Grenoble Handling texts and corpuses in Ariane-G5, a complete environment for multilingual MT

Upload: yule

Post on 19-Mar-2016

36 views

Category:

Documents


0 download

DESCRIPTION

Handling texts and corpuses in Ariane-G5, a complete environment for multilingual MT. ACIDCA ’2000, Monastir, 21-24/3/2000 Christian Boitet GETA, CLIPS, IMAG, Grenoble. Outline. Introduction Multilingual MT-R (for revisors): linguistic methodology & basic software - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Handling texts and corpuses in Ariane-G5,  a complete environment for multilingual MT

Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 1

ACIDCA ’2000, Monastir, 21-24/3/2000

Christian BoitetGETA, CLIPS, IMAG, Grenoble

Handling texts and corpuses in Ariane-G5,

a complete environment for multilingual MT

Page 2: Handling texts and corpuses in Ariane-G5,  a complete environment for multilingual MT

Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 2

Outline Introduction Multilingual MT-R (for revisors): linguistic

methodology & basic software Goals and linguistic methodology Ariane-G5, an MT shell for building multilingual MT-R

systems What has been and is done with Ariane-G5:

MT-R, MT-A (for authors), MT of speech Representation of input documents Structuration of corpuses Functionalities during processing

Page 3: Handling texts and corpuses in Ariane-G5,  a complete environment for multilingual MT

Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 3

MULTILINGUAL MT-R: GOALS AND LINGUISTIC METHODOLOGY

Produce RAW translation GOOD ENOUGH to be revised

Specialize to SUBLANGUAGES and use MULTILEVEL TRANSFER

(semantic + traces) HEURISTIC PROGRAMMING

Page 4: Handling texts and corpuses in Ariane-G5,  a complete environment for multilingual MT

Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 4

MULTILINGUAL MT-R: BASIC DIAGRAM

umc-structure

uma-structure

umc-structure

Source LanguageText

Target Language 1Text

umc-structure

Target Language 2Text

uma-structure uma-structure

gma-structure gma-structure

paraphrase choice.

Morphological Analysis

Abstraction

Structural Analysis

Structural Generation

Morphological Generation

Syntactic Generation

Page 5: Handling texts and corpuses in Ariane-G5,  a complete environment for multilingual MT

Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 5

Ariane-G5 (1978-99) : structure

Page 6: Handling texts and corpuses in Ariane-G5,  a complete environment for multilingual MT

Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 6

DB of lingware components Declaration of variables (= typed attributes),

templates… Dictionaries Grammars (rules = transitions of abstract automata)

DB of texts Corpuses Source texts Intermediate results Translations (± revisions)

Ariane-G5: 2 specialized DB

relative to “variants”=>

Page 7: Handling texts and corpuses in Ariane-G5,  a complete environment for multilingual MT

Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 7

What has been and is done with Ariane-G5:

MT-R (for revisors)Large, operational systems: RU—>FR, FR—>ENPrototypes: EN—>MY, TH, FRLots of mockups

MT-A (for authors)LIDIA mockups: FR—>DE, EN, RU (adding CH)

MT of speech (for task-oriented dialogues)CSTAR demo system (EN, DE, KR, IT, FR, JP)

Page 8: Handling texts and corpuses in Ariane-G5,  a complete environment for multilingual MT

Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 8

MT-R examples of translation (1)français-anglais en aéronautique (avant révision humaine)

Page 9: Handling texts and corpuses in Ariane-G5,  a complete environment for multilingual MT

Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 9

MT-R examples of translation (2)

Page 10: Handling texts and corpuses in Ariane-G5,  a complete environment for multilingual MT

Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 10

MT-A example of a disambiguation dialogueLe capitaine a rapporté des tasses et des assiettes bleues—> The captain has brought back blue bowls and plates

/ bowls and blue plates OO des tasses bleues et des assiettes bleues

O des assiettes bleues et des tassesQuestion 1

OO capitaine de marineO capitaine d’aviationO capitaine d’artillerieO capitaine d’infanterieO capitaine de cavalerieO …

Question 2

Page 11: Handling texts and corpuses in Ariane-G5,  a complete environment for multilingual MT

Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 11

e-mail server

Interaction in source for the “quality MT for all”

Example scenario : multilingual e-mail (UNL)

e-mail tool

Nicknames + language preferences

enconversion server

analysis serverinteractive disambiguation server

decoding serverdecoding serverdecoding serverdecoding serverdecoding serverdeconversion servers

1

2

65

7

8

9

Addressees’ e-mail servers

10

4

3

Page 12: Handling texts and corpuses in Ariane-G5,  a complete environment for multilingual MT

Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 12

Other future possibility: production of multilingual “self-explaining documents”

structure MMC

structure UMC

structure UMA

structure UMC

Texte en langue source

Texte en langue cible 1

structure UMC

Texte en langue cible 2

désambiguïsation interactive

structure GMA

structure UMA

structure UMC

rétro-traduction

Rétro-traduction 1

Utilisateur

structure MMC

désambiguïsation "muette" simulée (DMS) DMS

m.a.&d.marques d'ambiguïté

et dialogue

structure MMC

structure UMA structure UMA

structure GMA structure GMA

choix de paraphrase

marques d'ambiguïté et dialogue

Page 13: Handling texts and corpuses in Ariane-G5,  a complete environment for multilingual MT

Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 13

Speech Translation:advantages of an Interchange Format

N target languages for the cost of one analysis Translating into one’s language from N source

languages with one generation Using the same generation to “backgenerate”

Analysis into IF

IFBackgeneration

Page 14: Handling texts and corpuses in Ariane-G5,  a complete environment for multilingual MT

Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 14

Interchange Format : example

la semaine du 12 nous avons des chambres simples et doubles disponibles

give-informationgive-information++availabilityavailability++roomroom(room-type=(room-type=((single ; doublesingle ; double), time=(), time=(week, md12week, md12))))

give-informationgive-information ++availabilityavailability++roomroom (room-type=((room-type=(single ; doublesingle ; double), time=(), time=(week, md12week, md12))))

Acte de dialogueActe de dialogue

ConceptsConcepts

ArgumentsArguments

Page 15: Handling texts and corpuses in Ariane-G5,  a complete environment for multilingual MT

Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 15

Interface of CLIPS++ CSTAR-II demonstratorReconnaissance IF Rétrogénération (pour contrôler la “compréhension”)

Génération

Page 16: Handling texts and corpuses in Ariane-G5,  a complete environment for multilingual MT

Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 16

Hardware architecture of the CLIPS++ CSTAR-II demonstrator

FIF

MontpellierGrenobleRNIS

Reco

Ethernet

Contrôle, IFFSynthèseVC IU

Page 17: Handling texts and corpuses in Ariane-G5,  a complete environment for multilingual MT

Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 17

Steps in translating a text

Build its hierarchical structureChapters, sections, paragraphs, [sentences]

Segment into translation unitsAccording to current length parameter [min..max]

Translate each segmentAdding segment results to text results for desired

phases Revise (manually) the whole translations, keep

the revisions

Page 18: Handling texts and corpuses in Ariane-G5,  a complete environment for multilingual MT

Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 18

Representations of input documents

3 main questions: how to represent the writing system, separate formatting tags from the text or not, how to handle non-textual elements (figures, icons, or

formulas) contained in utterances Transliterations of textual elements Keeping formatting tags in the texts Non-textual elements

Page 19: Handling texts and corpuses in Ariane-G5,  a complete environment for multilingual MT

Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 19

Facilitate string-matching operations Diminish the size of dictionaries

Represent diacritics

Make some processing easier for some toolskataba —> ktb$aaa, katub —> ktb$au- or ktb$-ua

Transliterations of textual elements

lisp Lisp LISPLISP *LISP **LISP

François va à ACIDCA’2000*FRANC!4OIS VA A!2 **ACIDCA'2000

Page 20: Handling texts and corpuses in Ariane-G5,  a complete environment for multilingual MT

Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 20

Transliterations of textual elements (2) Represent writing systems using non Roman

characters"мать" (mother) —> "MATQ" and not "MAT6"‡ fl  ˝ Ë ˚ Ó fi Û ˛ ÈA YA E YE I YI O E!1 U YU JÁ Ê Í ˜ Ò ¯ Ú ˘ ¸ ˙Z ZH K KH S SH T TH Q W

今日は京都へ行きます。 (Today theme Kyoto dest go.) —>

KYOU <kj k1=kon k2=nichi> WA <hg ha> KYOUTO <kj k1=higashi k2=toukyo-no-tou> E <hg he> IKI <kj k1=iku> MASU.

Page 21: Handling texts and corpuses in Ariane-G5,  a complete environment for multilingual MT

Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 21

Keeping formatting tags in the texts

If the translation units get larger, almost all tags become “inside tags”

Tags often have a linguistic roleFor example, a sentence may contain• a bullet list• or a numbered listwhich are normally linguistically homogeneous.

<P>For example, a sentence may contain</P><UL> <LI>a bullet list <LI>or a numbered list</UL><P>which are normally linguistically homogeneous. </P>

Page 22: Handling texts and corpuses in Ariane-G5,  a complete environment for multilingual MT

Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 22

Non-textual elements

Formulas, figures, icons, brand names, anchors, links…are often best replaced by tags or special occurrences

The situation may be recursive (text inside figures)

*IF x2+5y>3 , x+y IS CONVENIENT .

*IF <relation 1> , <entity 2> IS CONVENIENT .

*IF $$R-1 , $$E-2 IS CONVENIENT .

Page 23: Handling texts and corpuses in Ariane-G5,  a complete environment for multilingual MT

Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 23

Structuration of corpuses

Motivations for corpuses Segmentation and structuration Representation of texts, intermediate results,

translations and revisions

Page 24: Handling texts and corpuses in Ariane-G5,  a complete environment for multilingual MT

Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 24

Motivations for corpuses

Corpus = collection of texts sharing some factual characteristics:

• natural language• transliteration and method for handling formatting

information and non-textual elements• segmentation method• structuration method

some management information:• source (journal/volume, book/chapter…)• usage destination (send back, postedit, tests…)

Page 25: Handling texts and corpuses in Ariane-G5,  a complete environment for multilingual MT

Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 25

Segmentation and structuration "segmentation"

= input texts —> words, sentences…best done by the morphological analyzer

& units of translation "structuration"

= segmentation —> higher level units paragraphs, sections, etc.

+ production of a corresponding tree structure In Ariane-G5, up to 7 hierarchical separators

for a given corpus

Page 26: Handling texts and corpuses in Ariane-G5,  a complete environment for multilingual MT

Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 26

Representation of texts, intermediate results, translations and revisions

Corpus = list of text files + descriptor Text = (transliterated) text + descriptor

(+ non-textual elements replaced by tags or spec.occs) Intermediate result = list of decorated trees

+ descriptor (lingware variant + interval processed) Translation = (transliterated) text + descriptor

(transliterated form may reduce morph. gen. size) Revision = (transliterated) text + descriptor

(usually another, more natural transliteration)

Page 27: Handling texts and corpuses in Ariane-G5,  a complete environment for multilingual MT

Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 27

Functionalities during processsing

Ensuring coherence between lingware and results

Stopping & restarting processing of a text Reusing intermediate results

recovery from interruptions debugging multitarget translation (analysis ≈ 2/3 of translation

time)

Page 28: Handling texts and corpuses in Ariane-G5,  a complete environment for multilingual MT

Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 28

Conclusion and perspectives (1)

Text & corpus handling in complete MT systems is quite complex and interesting…�handling texts and corpuses not a straightforward

problem,�suggests many interesting technological and

scientific issues

Page 29: Handling texts and corpuses in Ariane-G5,  a complete environment for multilingual MT

Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 29

Conclusion and perspectives (2)

but more is coming:Synergy MT systems <—> TA (Translation Aids)

unification of the representations of texts in both worlds: • MT: revised texts structured as input texts,

=> the text data base will become a kind of multilevel translation memory (texts, translations/revisions, intermediate results)

• TA: translation memories from "bags" to structured translation memories (keeping the sequential context)

both: multiple-layer translation memories• lemmatized forms• "concrete" syntactic trees & "abstract" logico-semantic trees• formatting tags

Page 30: Handling texts and corpuses in Ariane-G5,  a complete environment for multilingual MT

Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 30

Conclusion and perspectives (3)

Structuration may be used to « distribute the work » to MT and TA by segmenting according to the « best engine »

some sublanguages are good for MT, bad for TA• weather bulletins

others are good for TA, bad for MT• weather related warnings, slightly modified versions of

already translated documentsand others are best kept for specialists

• Fine-tune legal sentences