machine translation overview
DESCRIPTION
Machine Translation Overview. Alon Lavie Language Technologies Institute Carnegie Mellon University August 25, 2004. Machine Translation: History. MT started in 1940’s, one of the first conceived application of computers - PowerPoint PPT PresentationTRANSCRIPT
Machine Translation Overview
Alon LavieLanguage Technologies Institute
Carnegie Mellon University
August 25, 2004
August 25, 2004 LTI Immigration Course 2
Machine Translation: History
• MT started in 1940’s, one of the first conceived application of computers
• Promising “toy” demonstrations in the 1950’s, failed miserably to scale up to “real” systems
• AIPAC Report: MT recognized as an extremely difficult, “AI-complete” problem in the early 1960’s
• MT Revival started in earnest in 1980s (US, Japan)• Field dominated by rule-based approaches, requiring
100s of K-years of manual development• Economic incentive for developing MT systems for small
number of language pairs (mostly European languages)
August 25, 2004 LTI Immigration Course 3
Machine Translation: Where are we today?
• Age of Internet and Globalization – great demand for MT: – Multiple official languages of UN, EU, Canada, etc.– Documentation dissemination for large manufacturers
(Microsoft, IBM, Caterpillar)• Economic incentive is still primarily within a small
number of language pairs• Some fairly good commercial products in the market for
these language pairs– Primarily a product of rule-based systems after many years
of development• Pervasive MT between most language pairs still non-
existent and not on the immediate horizon
August 25, 2004 LTI Immigration Course 4
Best Current General-purpose MT
PAHO’s Spanam system:• Mediante petición recibida por la Comisión Interamericana de
Derechos Humanos (en adelante …) el 6 de octubre de 1997, el señor Lino César Oviedo (en adelante …) denunció que la República del Paraguay (en adelante …) violó en su perjuicio los derechos a las garantías judiciales … en su contra.
• Through petition received by the `Inter-American Commission on Human Rights` (hereinafter …) on 6 October 1997, Mr. Linen César Oviedo (hereinafter “the petitioner”) denounced that the Republic of Paraguay (hereinafter …) violated to his detriment the rights to the judicial guarantees, to the political participation, to // equal protection and to the honor and dignity consecrated in articles 8, 23, 24 and 11, respectively, of the `American Convention on Human Rights` (hereinafter …”), as a consequence of judgments initiated against it.
August 25, 2004 LTI Immigration Course 5
Core Challenges of MT
• Ambiguity:– Human languages are highly ambiguous, and
differently in different languages– Ambiguity at all “levels”: lexical, syntactic, semantic,
language-specific constructions and idioms
• Amount of required knowledge:– At least several 100k words, about as many phrases,
plus syntactic knowledge (i.e. translation rules). How do you acquire and construct a knowledge base that big that is (even mostly) correct and consistent?
August 25, 2004 LTI Immigration Course 6
How to Tackle the Core Challenges
• Manual Labor: 1000s of person-years of human experts developing large word and phrase translation lexicons and translation rules.
Example: Systran’s RBMT systems.• Lots of Parallel Data: data-driven approaches for
finding word and phrase correspondences automatically from large amounts of sentence-aligned parallel texts. Example: Statistical MT systems.
• Learning Approaches: learn translation rules automatically from small amounts of human translated and word-aligned data. Example: AVENUE’s XFER approach
• Simplify the Problem: build systems that are limited-domain or constrained in other ways. Examples: CATALYST, NESPOLE!
August 25, 2004 LTI Immigration Course 7
State-of-the-Art in MT• What users want:
– General purpose (any text)– High quality (human level)– Fully automatic (no user intervention)
• We can meet any 2 of these 3 goals today, but not all three at once:– FA HQ: Knowledge-Based MT (KBMT)– FA GP: Corpus-Based (Example-Based) MT– GP HQ: Human-in-the-loop (efficiency tool)
August 25, 2004 LTI Immigration Course 8
Types of MT Applications:
• Assimilation: multiple source languages, uncontrolled style/topic. General purpose MT, no semantic analysis. (GP FA or GP HQ)
• Dissemination: one source language, controlled style, single topic/domain. Special purpose MT, full semantic analysis. (FA HQ)
• Communication: Lower quality may be okay, but degraded input, real-time required.
August 25, 2004 LTI Immigration Course 9
Mi chiamo Alon Lavie My name is Alon Lavie
Give-information+personal-data (name=alon_lavie)
[s [vp accusative_pronoun “chiamare” proper_name]]
[s [np [possessive_pronoun “name”]]
[vp “be” proper_name]]
Direct
Transfer
Interlingua
Analysis Generation
Approaches to MT: Vaquois MT Triangle
August 25, 2004 LTI Immigration Course 10
Knowledge-based Interlingual MT
• The “obvious” deep Artificial Intelligence approach:– Analyze the source language into a detailed
symbolic representation of its meaning – Generate this meaning in the target
language
• “Interlingua”: one single meaning representation for all languages – Nice in theory, but extremely difficult in
practice
August 25, 2004 LTI Immigration Course 11
The Interlingua KBMT approach
• With interlingua, need only N parsers/ generators instead of N2 transfer systems:
L1L2
L3
L4
L5
L6
L1L2
L3
L6
L5
L4
interlingua
August 25, 2004 LTI Immigration Course 12
Statistical MT (SMT)• Proposed by IBM in early 1990s: a direct,
purely statistical, model for MT• Statistical translation models are trained on a
sentence-aligned translation corpus• Attractive: completely automatic, no manual
rules, much reduced manual labor• Main drawbacks:
– Effective only with huge volumes (several mega-words) of parallel text
– Very domain-sensitive– Still viable only for small number of language pairs!
• Impressive progress in last 3-4 years due to large DARPA funding program (TIDES)
EBMT ParadigmNew Sentence (Source)
Yesterday, 200 delegates met with President Clinton.
Matches to Source Found
Yesterday, 200 delegates met behind closed doors…
Difficulties with President Clinton…
Gestern trafen sich 200 Abgeordnete hinter verschlossenen…
Schwierigkeiten mit Praesident Clinton…
Alignment (Sub-sentential)
Translated Sentence (Target)
Gestern trafen sich 200 Abgeordnete mit Praesident Clinton.
Yesterday, 200 delegates met behind closed doors…
Difficulties with President Clinton over…
Gestern trafen sich 200 Abgeordnete hinter verschlossenen…
Schwierigkeiten mit Praesident Clinton…
August 25, 2004 LTI Immigration Course 14
GEBMT vs. Statistical MTGeneralized-EBMT (GEBMT) uses examples at run time,
rather than training a parameterized model. Thus:– GEBMT can work with a smaller parallel corpus than Stat
MT– Large target language corpus still useful for generating
target language model– Much faster to “train” (index examples) than Stat MT; until
recently was much faster at run time as well– Generalizes in a different way than Stat MT (whether this is
better or worse depends on match between Statistical model and reality):
• Stat MT can fail on a training sentence, while GEBMT never will• GEBMT generalizations based on linguistic knowledge, rather
than statistical model design
August 25, 2004 LTI Immigration Course 15
Multi-Engine MT
• Apply several MT engines to each input; use statistical language modeller to select best combination of outputs.
• Goal is to combine strengths, and avoid weaknesses.
• Along all dimensions: domain limits, quality, development time/cost, run-time speed, etc.
• Used in various projects
MEMT chart example
Russian leaders signedKBMT (0.8)
compact of peaceEBMT (0.65)
political leadersEBMT (0.9)
compact ofEBMT (0.7)
civilianGLOSS (1.0)
tactfulDICT (1.0)
pactGLOSS (1.0)
of peaceEBMT (1.0)
civilGLOSS (1.0)
expedientsDICT (1.0)
bargainDICT (1.0)
forDICT (1.0)
civil peaceEBMT (0.9)
politicalDICT (1.0)
RussiansDICT (1.0)
subscribe
DICT (1.0)
pactDICT (1.0)
ofGLOSS (1.0)
quietDICT (1.0)
civilianDICT (1.0)
leadersDICT (1.0)
politicDICT (1.0)
RussianDICT (1.0)
signDICT (1.0)
compactDICT (1.0)
ofDICT (1.0)
peaceDICT (1.0)
civilDICT (1.0)
lideres politicos rusos firman pacto de paz civil
August 25, 2004 LTI Immigration Course 17
Speech-to-Speech MT
• Speech just makes MT (much) more difficult:– Spoken language is messier
• False starts, filled pauses, repetitions, out-of-vocabulary words
• Lack of punctuation and explicit sentence boundaries– Current Speech technology is far from perfect
• Need for speech recognition and synthesis in foreign languages
• Robustness: MT quality degradation should be proportional to SR quality
• Tight Integration: rather than separate sequential tasks, can SR + MT be integrated in ways that improves end-to-end performance?
August 25, 2004 LTI Immigration Course 18
MT at the LTI
• LTI originated as the Center for Machine Translation (CMT) in 1985
• MT continues to be a prominent sub-discipline of research with the LTI– More MT faculty than any of the other areas– More MT faculty than anywhere else
• Active research on all main approaches to MT: Interlingua, Transfer, EBMT, SMT
• Leader in the area of speech-to-speech MT
August 25, 2004 LTI Immigration Course 19
KBMT: KANT, KANTOO, CATALYST
• Deep knowledge-based framework, with symbolic interlingua as intermediate representation– Syntactic and semantic analysis into a unambiguous
detailed symbolic representation of meaning using unification grammars and transformation mappers
– Generation into the target language using unification grammars and transformation mappers
• First large-scale multi-lingual interlingua-based MT system deployed commercially: – CATALYST at Caterpillar: high quality translation of
documentation manuals for heavy equipment• Limited domains and controlled English input• Minor amounts of post-editing• Active follow-on projects• Contact Faculty: Eric Nyberg and Teruko Mitamura
August 25, 2004 LTI Immigration Course 20
EBMT• Developed originally for the PANGLOSS system
in the early 1990s– Translation between English and Spanish
• Generalized EBMT under development for the past several years
• Currently one of the two MT approaches developed at CMU for the DARPA/TIDES program– Chinese-to-English, large and very large amounts of
sentence-aligned parallel data• Active research work on improving alignment
and indexing, decoding from a lattice• Contact Faculty: Ralf Brown and Jaime
Carbonell
August 25, 2004 LTI Immigration Course 21
Statistical MT• Word-to-word and phrase-to-phrase translation pairs are
acquired automatically from data and assigned probabilities based on a statistical model
• Extracted and trained from very large amounts of sentence-aligned parallel text– Word alignment algorithms– Phrase detection algorithms– Translation model probability estimation
• Main approach pursued in CMU systems in the DARPA/TIDES program:– Chinese-to-English and Arabic-to-English
• Most active work is on phrase detection and on advanced lattice decoding
• Contact Faculty: Stephan Vogel and Alex Waibel
August 25, 2004 LTI Immigration Course 22
Speech-to-Speech MT• Evolution from JANUS/C-STAR systems to
NESPOLE!, LingWear, BABYLON– Early 1990s: first prototype system that fully
performed sp-to-sp (very limited domain)– Interlingua-based, but with shallow task-oriented
representations: “we have single and double rooms available” [give-information+availability] (room-type={single, double})– Semantic Grammars for analysis and generation– Multiple languages: English, German, French, Italian,
Japanese, Korean, and others– Most active work on portable speech translation on
small devices: Arabic/English and Thai/English– Contact Faculty: Alan Black, Tanja Schultz and Alex
Waibel (also Alon Lavie and Lori Levin)
August 25, 2004 LTI Immigration Course 23
AVENUE: Transfer-based MT• Develop new approaches for automatically acquiring
syntactic MT transfer rules from small amounts of elicited translated and word-aligned data– Specifically designed to bootstrap MT for languages for
which only limited amounts of electronic resources are available (particularly indigenous minority languages)
– Use machine learning techniques to generalize transfer rules from specific translated examples
– Combine with decoding techniques from SMT for producing the best translation of new input from a lattice of translation segments
• Languages: Hebrew, Hindi, Mapudungun, Quechua• Most active work on designing a typologically
comprehensive elicitation corpus, advanced techniques for automatic rule learning, improved decoding, and rule refinement via user interaction
• Contact Faculty: Alon Lavie, Lori Levin and Jaime Carbonell
August 25, 2004 LTI Immigration Course 24
Transfer with Strong Decoding
Learning Module
Transfer Rules
{PP,4894};;Score:0.0470PP::PP [NP POSTP] -> [PREP NP]((X2::Y1)(X1::Y2))
Translation Lexicon
Run Time Transfer System
Lattice Decoder
English Language Model
Word-to-Word Translation Probabilities
Word-aligned elicited data
August 25, 2004 LTI Immigration Course 25
MT for Minority and Indigenous Languages: Challenges
• Minimal amount of parallel text• Possibly competing standards for
orthography/spelling• Often relatively few trained linguists• Access to native informants possible• Need to minimize development time
and cost
August 25, 2004 LTI Immigration Course 26
Learning Transfer-Rules for Languages with Limited Resources
• Rationale:– Large bilingual corpora not available– Bilingual native informant(s) can translate and align a
small pre-designed elicitation corpus, using elicitation tool– Elicitation corpus designed to be typologically
comprehensive and compositional– Transfer-rule engine and new learning approach support
acquisition of generalized transfer-rules from the data
August 25, 2004 LTI Immigration Course 27
English-Hindi Example
August 25, 2004 LTI Immigration Course 28
Questions…