machine translation overview

Machine Translation Overview

Alon LavieLanguage Technologies Institute

Carnegie Mellon University

August 25, 2004

August 25, 2004 LTI Immigration Course 2

Machine Translation: History

• MT started in 1940’s, one of the first conceived application of computers

• Promising “toy” demonstrations in the 1950’s, failed miserably to scale up to “real” systems

• AIPAC Report: MT recognized as an extremely difficult, “AI-complete” problem in the early 1960’s

• MT Revival started in earnest in 1980s (US, Japan)• Field dominated by rule-based approaches, requiring

100s of K-years of manual development• Economic incentive for developing MT systems for small

number of language pairs (mostly European languages)


Machine Translation: Where are we today?

• Age of Internet and Globalization – great demand for MT: – Multiple official languages of UN, EU, Canada, etc.– Documentation dissemination for large manufacturers

(Microsoft, IBM, Caterpillar)• Economic incentive is still primarily within a small

number of language pairs• Some fairly good commercial products in the market for

these language pairs– Primarily a product of rule-based systems after many years

of development• Pervasive MT between most language pairs still non-

existent and not on the immediate horizon


Best Current General-purpose MT

PAHO’s Spanam system:• Mediante petición recibida por la Comisión Interamericana de

Derechos Humanos (en adelante …) el 6 de octubre de 1997, el señor Lino César Oviedo (en adelante …) denunció que la República del Paraguay (en adelante …) violó en su perjuicio los derechos a las garantías judiciales … en su contra.

• Through petition received by the `Inter-American Commission on Human Rights` (hereinafter …) on 6 October 1997, Mr. Linen César Oviedo (hereinafter “the petitioner”) denounced that the Republic of Paraguay (hereinafter …) violated to his detriment the rights to the judicial guarantees, to the political participation, to // equal protection and to the honor and dignity consecrated in articles 8, 23, 24 and 11, respectively, of the `American Convention on Human Rights` (hereinafter …”), as a consequence of judgments initiated against it.


Core Challenges of MT

• Ambiguity:– Human languages are highly ambiguous, and

differently in different languages– Ambiguity at all “levels”: lexical, syntactic, semantic,

language-specific constructions and idioms

• Amount of required knowledge:– At least several 100k words, about as many phrases,

plus syntactic knowledge (i.e. translation rules). How do you acquire and construct a knowledge base that big that is (even mostly) correct and consistent?


How to Tackle the Core Challenges

• Manual Labor: 1000s of person-years of human experts developing large word and phrase translation lexicons and translation rules.

Example: Systran’s RBMT systems.• Lots of Parallel Data: data-driven approaches for

finding word and phrase correspondences automatically from large amounts of sentence-aligned parallel texts. Example: Statistical MT systems.

• Learning Approaches: learn translation rules automatically from small amounts of human translated and word-aligned data. Example: AVENUE’s XFER approach

• Simplify the Problem: build systems that are limited-domain or constrained in other ways. Examples: CATALYST, NESPOLE!


State-of-the-Art in MT• What users want:

– General purpose (any text)– High quality (human level)– Fully automatic (no user intervention)

• We can meet any 2 of these 3 goals today, but not all three at once:– FA HQ: Knowledge-Based MT (KBMT)– FA GP: Corpus-Based (Example-Based) MT– GP HQ: Human-in-the-loop (efficiency tool)


Types of MT Applications:

• Assimilation: multiple source languages, uncontrolled style/topic. General purpose MT, no semantic analysis. (GP FA or GP HQ)

• Dissemination: one source language, controlled style, single topic/domain. Special purpose MT, full semantic analysis. (FA HQ)

• Communication: Lower quality may be okay, but degraded input, real-time required.


Mi chiamo Alon Lavie My name is Alon Lavie

Give-information+personal-data (name=alon_lavie)

[s [vp accusative_pronoun “chiamare” proper_name]]

[s [np [possessive_pronoun “name”]]

[vp “be” proper_name]]

Direct

Transfer

Interlingua

Analysis Generation

Approaches to MT: Vaquois MT Triangle


Knowledge-based Interlingual MT

• The “obvious” deep Artificial Intelligence approach:– Analyze the source language into a detailed

symbolic representation of its meaning – Generate this meaning in the target

language

• “Interlingua”: one single meaning representation for all languages – Nice in theory, but extremely difficult in

practice


The Interlingua KBMT approach

• With interlingua, need only N parsers/ generators instead of N2 transfer systems:

L1L2

L3

L4

L5

L6

L1L2

L3

L6

L5

L4

interlingua


Statistical MT (SMT)• Proposed by IBM in early 1990s: a direct,

purely statistical, model for MT• Statistical translation models are trained on a

sentence-aligned translation corpus• Attractive: completely automatic, no manual

rules, much reduced manual labor• Main drawbacks:

– Effective only with huge volumes (several mega-words) of parallel text

– Very domain-sensitive– Still viable only for small number of language pairs!

• Impressive progress in last 3-4 years due to large DARPA funding program (TIDES)

EBMT ParadigmNew Sentence (Source)

Yesterday, 200 delegates met with President Clinton.

Matches to Source Found

Yesterday, 200 delegates met behind closed doors…

Difficulties with President Clinton…

Gestern trafen sich 200 Abgeordnete hinter verschlossenen…

Schwierigkeiten mit Praesident Clinton…

Alignment (Sub-sentential)

Translated Sentence (Target)

Gestern trafen sich 200 Abgeordnete mit Praesident Clinton.

Yesterday, 200 delegates met behind closed doors…

Difficulties with President Clinton over…

Gestern trafen sich 200 Abgeordnete hinter verschlossenen…

Schwierigkeiten mit Praesident Clinton…


GEBMT vs. Statistical MTGeneralized-EBMT (GEBMT) uses examples at run time,

rather than training a parameterized model. Thus:– GEBMT can work with a smaller parallel corpus than Stat

MT– Large target language corpus still useful for generating

target language model– Much faster to “train” (index examples) than Stat MT; until

recently was much faster at run time as well– Generalizes in a different way than Stat MT (whether this is

better or worse depends on match between Statistical model and reality):

• Stat MT can fail on a training sentence, while GEBMT never will• GEBMT generalizations based on linguistic knowledge, rather

than statistical model design


Multi-Engine MT

• Apply several MT engines to each input; use statistical language modeller to select best combination of outputs.

• Goal is to combine strengths, and avoid weaknesses.

• Along all dimensions: domain limits, quality, development time/cost, run-time speed, etc.

• Used in various projects

MEMT chart example

Russian leaders signedKBMT (0.8)

compact of peaceEBMT (0.65)

political leadersEBMT (0.9)

compact ofEBMT (0.7)

civilianGLOSS (1.0)

tactfulDICT (1.0)

pactGLOSS (1.0)

of peaceEBMT (1.0)

civilGLOSS (1.0)

expedientsDICT (1.0)

bargainDICT (1.0)

forDICT (1.0)

civil peaceEBMT (0.9)

politicalDICT (1.0)

RussiansDICT (1.0)

subscribe

DICT (1.0)

pactDICT (1.0)

ofGLOSS (1.0)

quietDICT (1.0)

civilianDICT (1.0)

leadersDICT (1.0)

politicDICT (1.0)

RussianDICT (1.0)

signDICT (1.0)

compactDICT (1.0)

ofDICT (1.0)

peaceDICT (1.0)

civilDICT (1.0)

lideres politicos rusos firman pacto de paz civil


Speech-to-Speech MT

• Speech just makes MT (much) more difficult:– Spoken language is messier

• False starts, filled pauses, repetitions, out-of-vocabulary words

• Lack of punctuation and explicit sentence boundaries– Current Speech technology is far from perfect

• Need for speech recognition and synthesis in foreign languages

• Robustness: MT quality degradation should be proportional to SR quality

• Tight Integration: rather than separate sequential tasks, can SR + MT be integrated in ways that improves end-to-end performance?


MT at the LTI

• LTI originated as the Center for Machine Translation (CMT) in 1985

• MT continues to be a prominent sub-discipline of research with the LTI– More MT faculty than any of the other areas– More MT faculty than anywhere else

• Active research on all main approaches to MT: Interlingua, Transfer, EBMT, SMT

• Leader in the area of speech-to-speech MT


KBMT: KANT, KANTOO, CATALYST

• Deep knowledge-based framework, with symbolic interlingua as intermediate representation– Syntactic and semantic analysis into a unambiguous

detailed symbolic representation of meaning using unification grammars and transformation mappers

– Generation into the target language using unification grammars and transformation mappers

• First large-scale multi-lingual interlingua-based MT system deployed commercially: – CATALYST at Caterpillar: high quality translation of

documentation manuals for heavy equipment• Limited domains and controlled English input• Minor amounts of post-editing• Active follow-on projects• Contact Faculty: Eric Nyberg and Teruko Mitamura


EBMT• Developed originally for the PANGLOSS system

in the early 1990s– Translation between English and Spanish

• Generalized EBMT under development for the past several years

• Currently one of the two MT approaches developed at CMU for the DARPA/TIDES program– Chinese-to-English, large and very large amounts of

sentence-aligned parallel data• Active research work on improving alignment

and indexing, decoding from a lattice• Contact Faculty: Ralf Brown and Jaime

Carbonell


Statistical MT• Word-to-word and phrase-to-phrase translation pairs are

acquired automatically from data and assigned probabilities based on a statistical model

• Extracted and trained from very large amounts of sentence-aligned parallel text– Word alignment algorithms– Phrase detection algorithms– Translation model probability estimation

• Main approach pursued in CMU systems in the DARPA/TIDES program:– Chinese-to-English and Arabic-to-English

• Most active work is on phrase detection and on advanced lattice decoding

• Contact Faculty: Stephan Vogel and Alex Waibel


Speech-to-Speech MT• Evolution from JANUS/C-STAR systems to

NESPOLE!, LingWear, BABYLON– Early 1990s: first prototype system that fully

performed sp-to-sp (very limited domain)– Interlingua-based, but with shallow task-oriented

representations: “we have single and double rooms available” [give-information+availability] (room-type={single, double})– Semantic Grammars for analysis and generation– Multiple languages: English, German, French, Italian,

Japanese, Korean, and others– Most active work on portable speech translation on

small devices: Arabic/English and Thai/English– Contact Faculty: Alan Black, Tanja Schultz and Alex

Waibel (also Alon Lavie and Lori Levin)


AVENUE: Transfer-based MT• Develop new approaches for automatically acquiring

syntactic MT transfer rules from small amounts of elicited translated and word-aligned data– Specifically designed to bootstrap MT for languages for

which only limited amounts of electronic resources are available (particularly indigenous minority languages)

– Use machine learning techniques to generalize transfer rules from specific translated examples

– Combine with decoding techniques from SMT for producing the best translation of new input from a lattice of translation segments

• Languages: Hebrew, Hindi, Mapudungun, Quechua• Most active work on designing a typologically

comprehensive elicitation corpus, advanced techniques for automatic rule learning, improved decoding, and rule refinement via user interaction

• Contact Faculty: Alon Lavie, Lori Levin and Jaime Carbonell


Transfer with Strong Decoding

Learning Module

Transfer Rules

{PP,4894};;Score:0.0470PP::PP [NP POSTP] -> [PREP NP]((X2::Y1)(X1::Y2))

Translation Lexicon

Run Time Transfer System

Lattice Decoder

English Language Model

Word-to-Word Translation Probabilities

Word-aligned elicited data


MT for Minority and Indigenous Languages: Challenges

• Minimal amount of parallel text• Possibly competing standards for

orthography/spelling• Often relatively few trained linguists• Access to native informants possible• Need to minimize development time

and cost


Learning Transfer-Rules for Languages with Limited Resources

• Rationale:– Large bilingual corpora not available– Bilingual native informant(s) can translate and align a

small pre-designed elicitation corpus, using elicitation tool– Elicitation corpus designed to be typologically

comprehensive and compositional– Transfer-rule engine and new learning approach support

acquisition of generalized transfer-rules from the data


English-Hindi Example


Questions…

machine translation overview

Documents

translation rules

human rights

statistical mt systems

based systems

human languages

phrase translation lexicons

small amounts of human

lti immigration coursehow