phrase-based statistical machine translation using finite

Phrase-Based Statistical Machine TranslationUsing Finite State Machines

with some links to ASR

Bill Byrne

Cambridge University Engineering Department

The Johns Hopkins University Center for Language and Speech Processing

[email protected]

27 May 2005

Work done with Shankar Kumar and Yonggang Deng

IntroductionPhrase Translation via WFSTs

Embedded Model EstimationConclusion

Can We Pretend MT is ASR ?

Of course. We just need :I Alignment models and estimation algorithmsI Training data: bitext, monolingual textI Search algorithms for translationI Some way to measure translation quality

Cambridge UniversityEngineering Department Current Research in SMT– 1/28



Five Easy Problems in Statistical Machine Translation

What’s currently needed to build a basic phrase-based SMT system:

I Bitext for SMT Training: Document and Sentence AlignmentI Models of Word and Phrase Alignment Within SentencesI Language ModelsI Translation – Model-Based Search AlgorithmsI Automatic Measurement of Translation Quality




Component TransducersAlignment and Translation Operations

Translation

Once the models are defined, translation is ‘trivial’ .

Its just search ... e = argmaxe P(e|f)

One approach is to construct specialized search algorithmsI Depending on the underlying models, search can be by Viterbi, A∗, or

other specialized search procedures

But decoder design and implementation is complexI Small model changes might require large changes to a decoderI Approximations in search imply inexact implementation of the models

I Can only implement what can be implementedI May as develop models which lead to exact realizations

I Decoder implementation takes effort away from ‘modeling’





Translation via Weighted Finite State Transducers

Translation with Finite State Devices, Knight & Al Onaizan, AMTA’98I Implements the IBM models as WFSTs

I word-to-word translation, word fertility, and permutation (reordering)

If the component models can be implemented as WFSTs which can becomposed, building a decoder is trivial

I Can be limiting, but avoids special-purpose decodersI The value of this modeling approach has been shown in ASR by the

systems developed at AT&TI Translation is performed using libraries of standard FSM operationsI Clear formulation of underlying model componentsI Easy to work on modules in isolation





Translation Template Model - TTM

Generative source-channel model of machine translationI Takes the best of

I Och&Ney’s Phrase-based translation modelsI Knight&Al Onaizan WFST description of translation via IBM models

I Bitext word alignment and translation under the model can be performedusing standard WFST operations

I Modular ImplementationI No need for a specialized decoder - “the model is the decoder”I Can easily generate translation lattices and N-best lists

Overall Goals:I Relatively good performance with models that are really quite simpleI Should be easy to use in translating ASR latticesI Easy to teach





TTM (2004) – Translation with Monotone Phrase Order

1

Source PhraseSegmentation

Source PhraseReordering

exports grain are_projected_to fall by_25_%

Placement of

Target Phrase

Insertion Markers

Target PhraseInsertion

PhraseTransduction

Target PhraseSegmentation

1

are_projected_tograin exports fall by_25_% Source Phrases

grain exports are projected to fall by 25 % Source LanguageSentence

Reordered SourcePhrases

grain are_projected_to by_25_%fall

Target Phrasesflechirdoivent de_25_%les exportations de grains

les exportations de grains doivent flechir de 25 %Target Language

Sentence

exports

I Transformations via stochastic models implemented as WFSTsI Implementation is direct using standard WFST operations





TTM (2005) – Translation with Moving Target Phrase Order

Target Phrase

0c

4c

5c

0d

1d

1v

2v

3v

4v

5v

6v

7v

2f

3f

4f

5f

6f

7f

8f

9f

2d

3d

4d

5d

2c

3c

1c

x1 x

2x

3x

4x

5

1e

5e

7e

2e

3e

4e

6e

9e

8e

u1 u

2u

3u

4u

5

y1 y

5y

4y

3y

2

1f

Source Phrase

Segmentation

Target Phrase

Segmentation

Reordering

Translation

Insertion

Target Phrase

doivent de_25_%exportationsgrains fléchir

exportations grains de_25_%doivent fléchir

1

les exportations de grains doivent fléchir de_25_%

1.exportations doiventgrains fléchir

Target Phrasesin Source Phrase Order

grain exports are_projected_to by_25_%

grain exports are projected to fall by 25 %Sentence

fall Source Phrases

Target Phrases

Target Phrases

Source Language

Target Language Sentence

Phrase Insertion Markers

in Target Phrase Order

de_25_%

les exportations de grains doivent fléchir de 25 %

Target

Phrase





Phrase-to-Phrase Translation Models

Alignment Template Models (Och et al. 1999)I derived from ‘good’ word-level alignments, typically from IBM-4

this bill places salve on a sore wound

ce bill met de le baume sur une blessure

this bill places salve on a sore woundce bill met de le baume sur une blessure

this bill places salve on a sore wound

ce bill met de le baume sur une blessureIBM−4 F

IBM−4 E

u1

v1 v2

u2

IBM−4 F U E

I Phrase Pairs are extracted to cover word alignment patternsI Probability distributions are defined over phrase pair sequences





The Phrase Pair Inventory

English Phrase French Phrase Phrase Transductionu v Probability P(v |u)

hear hear bravo 0.8bravo bravo 0.15

ordre 0.05terms of reference mandat 0.8

de son mandat 0.2

I Phrase Pair Inventory affects the performance of the TTMI Word Alignment Quality of underlying modelsI Coverage of phrases on the test set





Target Phrase Segmentation Transducer Ω

:ε voilà

: leε

: problèmeε

: problème

ε

εfon

damen

tal

:

ε fondamental

:

: voilàε

: εvoilà

ε:

le_problème_fondamental

ε:vo

ilà_l

e

voilà

:ε le

le problème fondamentalF

problème_fondamental ε:

Ω

Based on the French phrases inthe PPI

Assume F is the sentence to betranslated

Phrase sequences that couldhave generated F : Ω Fvoila le probleme fondamentalvoila le probleme fondamental





Phrase Translation Transducer Y

:

Y

fundamental_problem:problème_fondamental/1basic_problem:problème_fondamental/0.3

the_fundamental_problem_we_face:le_problème_fondamental/1

that_is:voilà/0.11those_are:voilà/0.25these_are:voilà/0.12

those_are_the:voilà_le/0.11that_these_are_the:voilà_le/1

that_is_the:voilà_le/0.10

le_problème_fondamental/0.4the_basic_problem

Map English phrases intoFrench phrases

Realizes the Phrase-PairInventory





Target Language Phrase Insertion

1

c1

c2

c4

c5

c3

c

0d

1d

1v

2v

3v

4v

5v

6v

7v

1f

2f

3f

4f

5f

6f

7f

8f

9f

2d

3d

4d

5d

1 de_25_%exportations.

fléchirdoivent de_25_%les exportations de grains

les exportations de grains doivent fléchir de 25 %

Target Language Sentence

Target Phrases

Source Phrases with Target Phrase Insertion Markers

flechirdoiventgrains

0

Different phrase segmentations lead to different translations :I Sequences cK

0 that could have generated F : Y Ω FH1 that these are the fundamental problem :

voila le probleme fondamental...

H16 that is the basic problem:voila le probleme fondamental





Procedures for Alignment and Translation (Monotone Phrase Order)

Given a French sentence f J1 to be translated into English, we build the

following transducers (in this order)I F to represent the French sentenceI Ω maps French phrases in our Phrase-Pair Inventory (PPI) to words in FI Y maps English phrases to French phrases in Ω with probabilities (PPI)I Φ inserts French phrase insertion markersI W maps English words to English phrases seen in Y

(restricted phrase segmentation transducer)

If f J1 is to be aligned with an English sentence eI

1:I Build an FSM E to represent eI

1

If f J1 is to be translated:I Build a n-gram backoff LM and compile it as a weighted acceptor G





Bitext Word Alignment and Translation Via WFSTs

TTM Generative Model: P(f J1 , vR

1 , dK0 , cK

0 , aK1 , uK

1 , K , eI1)

I MAP Alignment of a sentence pair f J1 ,eI

1

K , uK1 , aK

1 , cK0 , d K

0 , v R1 = argmax

K ,uK1 ,aK

1 ,cK0 ,dK

0 ,vR1

P(K , uK1 , aK

1 , cK0 , dK

0 , vR1 |eI

1, f J1 )

I MAP Translation of a French sentence f J1

eI1, K , uK

1 , aK1 , cK

0 , d K0 , v R

1 = argmaxeI

1,K ,uK1 ,aK

1 ,cK0 ,dK

0 ,vR1

P(K , uK1 , aK

1 , ck0 , dK

0 , vR1 |f J

1 )

WFST Operations with Monotone SearchI Alignment

1. Generate the alignment lattice: B = E W Φ Y Ω F2. MAP Alignment : least cost path in B

I Translation1. Generate the translation lattice: T = G U Φ Y Ω F2. MAP Translation : least cost path in T




Translation Performance and Analysis

Problems with Bitext Word Alignment under the TTM

Consider extraction of phrase pairs from word alignments

il demeure donc important dans un pays comme le nôtre

transportation is important to this country

I Extracted PPI is not rich enough to cover all the sentence-pairsI If we discard the word alignments, this pair has probability zero under the

modelI In fact, most sentences from the training bitext have probability zero

I Solution:- TTM already allows insertions of target phrases- In addition, we allow deletions of source phrases





Source Phrase Deletion in Bitext Word Alignmenttransportation is important to this country

transportation is important_to_this_country

I3 transportation is important_to_this_country I3

important_dans_un_pays comme_le_nôtreεil_demeure_donc ε

il demeure donc important dans un pays comme le nôtre

I Novel use of phrase-based translation models for alignmentI TTM word alignments are very accurate: ↑ precision, ↓ recall

Supports parameter estimation proceduresI Must be able to segment and align the bitext

EM requiresP( K , uK

1 , aK1 , cK

0 , dK0 , vR

1| z Hidden

Variables

| eI1, f J

1| z Training

Data

)





Phrase Swapping by WFSTsx 2 x 3 x 4 x 5x 1

y 2 y 3 y 4 y 5y 1

3b = 01b = +12b = −1 4b = 0 5b = 0

doivent de_25_%exportations fléchir


grains

Associate a jump sequence bK1 with each sequence yK

1P(yK

1 |xK1 , uK

1 , K , eI1) = P(bK

1 |xK1 , uK

1 , K , eI1) =

QKk=1 P(bk |xk , uk )

1x y2:

x2 1y

1x 1y:x2 y2: / 1−p / b=0

1 2

/ 1−p / b=0

p / b=+1

1 / b=−1

:

Inspired by work of Tillman

Input: x1, x2, b ∈ 0, +1,−1Output: y1, y2 with prob (1 − p)2

y2, y1 with prob p

MJ-1 : maximum jump of 1

Properly parameterized

Not degenerate





Phrase Swapping by WFSTsx 2 x 3 x 4 x 5x 1

y 2 y 3 y 4 y 5y 1

3b = 01b = +12b = −1 4b = 0 5b = 0

doivent de_25_%exportations fléchir


grains

Associate a jump sequence bK1 with each sequence yK

1P(yK

1 |xK1 , uK

1 , K , eI1) = P(bK

1 |xK1 , uK

1 , K , eI1) =

QKk=1 P(bk |xk , uk )

b=0

1

23

45

6

b=−1

b=+1b=−1

b=+2

b=0

b=−2

b=−1

b=+1 b=−2

Inspired by work of Tillman

Input: x1, x2, b ∈ 0, +1,−1Output: y1, y2 with prob (1 − p)2

y2, y1 with prob p

MJ-2 : maximum jump of 2

Properly parameterized

Not degenerate





Incorporating Reordering in Translation

I Reodering prior to translation:I English phrases in French phrase orderI Difficult to realize with WFSTs

I Tried the opposite approach:I First generate a French phrase sequence in English phrase orderI Next reorder this sequence into French phrase order under the Local Phrase

Reordering ModelI Possible to realize with WFSTs both in alignment and translation

I No English phrase reordering processI Reordering is done prior to Insertion of Target Phrases

I word alignments within phrases can span fairly long distances

Can perform embedded reestimation of reordering model parametersI Phrase-pair dependent reordering probability : P(bk |xk , uk )

I Estimated via Viterbi approximation to EMI Exact estimation: alignments are done under the translation model





Training and Translation Procedures

1. Bitext Chunking1.1 Monotone alignment into coarse chunks of documents1.2 Divisive clustering into subsentence chunks

2. MTTK word alignment – Model 4 replacement3. Construct component WFSTs for the TTM –

3.1 extract phrases from the alignments using the ‘usual’ heuristics3.2 use phrase-pair induction under MTTK to augment the PPI

4. Translation lattice generation with pruned trigram →5. Translation lattice rescoring with unpruned 4gram →6. Minimum Error Training

6.1 transducer weights optimized for BLEU on heldout data





TTM and Phrase Reordering

Good gains from reordering in both Arabic→English and Chinese→EnglishI Jump-1 might be as good as Jump-2 (in this formulation)I little gain in Chinese→English from parameter estimation

Translation under MJ-1 and MJ-2 reordering with a 4-gram LM.BLEU (%)

Reordering Arabic-English Chinese-EnglishModel 02 03 04 02 03 04None 37.5 40.3 36.8 24.2 23.7 26.0

MJ-1 flat 40.4 43.9 39.4 25.7 24.5 27.4MJ-1 VT 41.3 44.8 40.3 25.8 24.5 27.8MJ-2 flat 41.0 44.4 39.7 26.2 24.8 27.8MJ-2 VT 41.4 45.0 40.2 26.4 24.8 27.8





Influence of Language Model Order on Reordering in Translation

BLEU Over Merged Eval Sets (Eval02,Eval03,Eval04)BLEU (%)

A-E C-E2g 3g 4g 2g 3g 4g

None 21.0 36.8 37.8 16.1 24.8 25.0MJ-1 23.4 40.4 41.6 16.2 25.9 26.5MJ-2 23.3 40.4 41.6 16.0 26.0 26.8

These Simple Reordering Models Benefit from Better Language Models





Translation Performance with Reordering across Eval 04 Test Genres

Model BLEU (%)A-E C-E

News Eds Spchs News Eds SpchsNone 41.1 30.8 33.3 23.6 25.9 30.8

MJ-1 flat 44.7 31.6 35.3 24.3 27.6 32.9MJ-1 VT 45.6 32.6 35.7 24.8 27.8 33.3MJ-2 flat 45.2 31.9 35.2 24.8 27.8 33.4MJ-2 VT 45.8 32.4 35.2 24.8 27.7 33.6

Causes:I Training / Test Mismatch ?I Language model mismatch, in particular ?I Less movement in Speeches and Editorials ?I Different operating points ?




Unsolicited Opinions

Summary: Working Towards Integrated Modeling and DecodingModeling: Given an English Sentence e and a French sentence f , construct ajoint distribution over their alignments.

P(e, a, f ) = P(f |a, e)| z Translation

Model

P(a|e)| z Alignment

Model

P(e)| z Language

Model

Decoding (ideal) : Given f , find a translation be and an alignment ba as

( be, ba ) = argmaxe,a

P(f |a, e) P(a|e) P(e)

Decoding (really) : lacks integrated modeling and decodingI Models are trained & alignments are generated over the training setI The models are discarded, and the alignments are keptI PPI, etc. are extracted from the alignments and used in translation

Goal: same models in alignment and translation – ‘what works’ for ASRI needed for : MMI, clustering, context dependence, ...Cambridge UniversityEngineering Department Current Research in SMT– 24/28




New! – TTM Tutorial is Available

German→English translation based onI Europarl corpusI Giza++ alignmentsI AT&T FSM Toolkit

I any FSM toolkit should work ...

Tutorial steps through building and using all the transducers

Very much an alpha version, but available to anybody who’s interested





The Machine Translation PyramidCasts the problem in familiar terms- strengths and weaknesses of the formulation are obvious

Interlingua

English

Words

English

Syntax

English

Semantics

Foreign

Words

Foreign

Syntax

Foreign

Semantics





The Statistical Machine Translation Pyramid

Building good systems requires balancing competing concerns

Simple Models & Fast Algorithms

MachineLearning

LinguisticKnowledge

Evidence

Explication

Elegance





Thanks!


phrase-based statistical machine translation using finite

Documents