monolingual data in nmt - wordpress.com › 2018 › 11 › monolingual.pdfmonolingual data in nmt...

Monolingual data in NMT

Franck Burlot & François Yvon

LIMSI, CNRS, Université Paris-Saclay

NLP MeetupParis, november 28, 2018

F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 1 / 35

Corpora in data-based machine translation

Corpora in Statistical MachineTranslationThe Noisy Channel Model

Statistical MT translates French (f) into English (e) according to :

e? = argmaxe

P(e|f; θTM)

θTM trained using examples of bio-translations⇒ parallel data

f eElle partit avec son père, le visage souriant ; In the gayest and happiest spirits she set forward with her father ;elle n’ écoutait pas toujours, mais elle acquiesçait de confiance. not always listening, but always agreeing to what he said ;Ils arrivèrent. They arrived.– C’est Frank et Mlle Fairfax, dit aussitôt Mme Weston. It is Frank and Miss Fairfax, said Mrs. Weston.– J’allai justement vous faire part de l’agréable surprise que nousavons eue en le voyant arriver.

I was just going to tell you of our agreeable surprize in seeinghim arrive this morning.

Il reste jusqu’à demain et Mlle Fairfax a bien voulu, sur notredemande, venir passer la journée.

He stays till tomorrow, and Miss Fairfax has been persuaded tospend the day with us.

From Emma, by J. Austen

Corpora in Statistical MachineTranslationThe Noisy Channel Model

Statistical MT actually translates French (f) in English (e) according to

e? = argmaxe

P(f|e; θTM)× P(e; θLM)

θTM trained using examples of existing translations⇒ parallel data

θLM trained using examples of existing texts⇒ monolingual data

Parallel CorpusFrench-English

Monolingual CorpusEnglish

Statistical processing Statistical processing

P(f|e) P(e)French broken English English

Decoding / Inference :e? = argmaxe P(f|e)P(e)

A search problem

Corpora in Statistical MachineTranslationBeauty of the Noisy Channel Model

e? = argmaxe

P(f|e; θTM)P(e; θLM)

Parallel corpora are costly, scarce, difficult to get, restricted to specificdomains : counts in millions of sentences

Monolingual corpora are “free”, massive, easy to get, for all domains :counts in billions of sentences

Corpora in Neural MachineTranslation (NMT)

NMT translates French (f) into English (e) according to

e? = argmaxe

P(e|f; θNN)

θNN trained using translation examples⇒ parallel data

Tonight’s question

How to best leverage existing monolingual data?

Corpora in Neural MachineTranslation (NMT)

NMT translates French (f) into English (e) according to

e? = argmaxe

P(e|f; θNN)

θNN trained using translation examples⇒ parallel data

Tonight’s question

NMT Primer in two slides

Recurrent Neural Networks

words : w ∈ {0, 1}|V| w0 . . . wt wt+1 . . . wI

embeddings : i ∈ Rd i0 . . . it it+1 . . . iI

hidden states : h ∈ Rp h0 . . . ht ht+1 . . . hI

it =W iwt

ht+1 = f(W ihit+1 + Wrhht + br)

Recurrent Neural Networks for Language Modeling

words : w ∈ {0, 1}|V| w0 . . . wt wt+1 . . . wI

hidden states : h ∈ Rp h0 . . . ht ht+1 . . . hI

output : o ∈ R|V| . . . ot ot+1 . . . oI

P(wt+1 = k|w≤t; θLM) = [softmax(ot = Whoht + bo)]k

Recurrent Neural Networks for Machine Translation

source : f ∈ {0, 1}|V| w0 . . . wt wt+1 . . . wI

encoder states : h ∈ Rp h0 . . . ht ht+1 . . . hI

attention / context : c ∈ Rp αt, ct ct = αTt h

decoder states : s ∈ Rp s0 . . . st st+1 . . . sI

output : o ∈ R|V| . . . ot ot+1 . . . oI

P(et+1 = k|e≤t, f; θNMT) = [softmax(ot = Wsost + bo)]k

To remember for now

NNLMs / NMTs predict one word at a timeInference ends with a softmax layer

They have multiple subparts :encoder (embeddings+RNN), decoder (RNN+embeddings), attention layer

Many architectural variantsGRU / LSTMs, multiple layers, transformers, CNNs

Back to Tonight’s question

To remember for now

Using Language Models

The old timer way : combine NNLM and NMT

ex post - aka shallow fusion

P(et+1 = k|e≤t, f; θLM, θTM) = λ1PTM(et+1 = k|wIt; θTM)

+ PLM(et+1 = k|eIt; θLM)

Combines the output layers.train θLM and θTM separately [Gulcehre et al., 2017] or one after the other [Stahlberg

et al., 2018]

within decoder - aka deep fusion [Gulcehre et al., 2017, Burlot and Yvon, 2018]

P(et+1 = k|e≤t, f; θLM, θTM) ∝ [Wf (hLMt ; sTM

t ; ct; ot)]k (1)

Combines the hidden layers hLMt ; sTM

t ; (+ trained scaling factor σhLMt )

ex post - aka shallow fusion

within decoder - aka deep fusion [Gulcehre et al., 2017, Burlot and Yvon, 2018]

P(et+1 = k|e≤t, f; θLM, θTM) ∝ [Wf (hLMt ; sTM

t ; ct; ot)]k (1)

Combines the hidden layers hLMt ; sTM

t ; (+ trained scaling factor σhLMt )

log-linear shallow better than linear [Stahlberg et al., 2018]

Deep fusion better than shallow [Stahlberg et al., 2018]

No clear result for very large data

“Back-Translation” seems to be a much better receipe.

Back translation

The rich man’s way : generate artificial parallel data

NN Training

P(e|f)

Decoding :f? = argmax P(e|f)

BackwardsMachine Translation

Artificial CorpusFrench-English

a very very very old idea [Bertoldi and Federico, 2009, Bojar and Tamchyna, 2011]

Back translation

NN Training

P(e|f)

Decoding :f? = argmax P(e|f)

BackwardsMachine Translation

Artificial CorpusFrench-English

a very very very old idea [Bertoldi and Federico, 2009, Bojar and Tamchyna, 2011]

Back translation

Design choicesback-translation engine (WBMT, NMMT, NMT)

data selection and weighting

training regime / data mix

Experimental validation : try approach X, evaluate MT quality

Main findings to dateBT very works well [Sennrich et al., 2016] and many others

BT quality matters, real data is even better [Burlot and Yvon, 2018]

BT selection helps : choose monotonic sentences [Burlot and Yvon, 2018],difficult words/phrases [Fadaee and Monz, 2018]

BT data are unsufficiently diverse [Burlot and Yvon, 2018], noising helps[Edunov et al., 2018]

FT also helps [Crego and Senellart, 2016]

training regime matters [Poncelas et al., 2018]

large scale experiments yield large gains [Edunov et al., 2018]

iterative BT : an effective [Lample et al., 2018] and sound [Cotterell and

Kreutzer, 2018] idea

Back translation

BT quality matters, real data is even better

Backtranslation setup3 automatic systems BT :

backtrans-bad : SMT (Moses) with 50k parallel sentences

backtrans-good : SMT (Moses) with all WMT data

backtrans-nmt : backward NMTs

French→English German→Englishtest-07 test-08 nt-14 unk test-07 test-08 nt-14 unk

backtrans-bad 18.86 19.27 20.49 3.22% 14.66 14.62 15.07 1.45%backtrans-good 29.71 29.51 32.10 0.24% 24.19 24.19 25.75 0.73%backtrans-nmt 31.10 31.43 31.27 0.0% 26.02 26.03 26.98 0.0%

Fine-tuningThese systems are used to backtranslate the target side of Europarl in order tofine-tune the baselines.

Back translation

Assessing the effectiveness of BT

English→Frenchtest-07 test-08 newstest-14

BLEU BEER CTER BLEU BEER CTER BLEU BEER CTERBaseline 31.25 62.14 51.89 32.17 62.35 50.79 33.06 61.97 48.56backtrans-bad 31.55 62.39 51.50 31.89 62.23 51.73 31.99 61.59 48.86backtrans-good 32.99 63.43 49.58 33.25 63.08 49.29 33.52 62.62 47.23backtrans-nmt 33.30 63.33 50.02 33.39 63.09 49.48 34.11 62.76 46.94fwdtrans-nmt 31.93 62.55 50.84 32.62 62.66 49.83 33.56 62.44 47.65backfwdtrans-nmt 33.09 63.19 50.08 33.70 63.25 48.83 34.00 62.76 47.22natural 35.10 64.71 48.33 35.29 64.52 48.26 34.96 63.08 46.67

English→Germantest-07 test-08 newstest-14

⇒ Bad BT hardly helps.

Back translation

⇒ BTs with PBMT and NMT not so different.

Back translation

⇒ Forward-translated source data can also help.

Back translation

⇒ human translated sources are much better. Why this gap?

Back translation

⇒ human translated sources are much better. Why this gap?

Back translation

Properties of Backtranslated sentences (I)

English→French English→German

⇒ Synthetic sources contain shorter sentences.

Back translation

Properties of Backtranslated sentences (II)

⇒ Synthetic sources contain slightly simpler syntax.

Back translation

Properties of Backtranslated sentences (III)

⇒ Synthetic sources use a smaller vocabulary.

Back translation

Properties of Backtranslated sentences (IV)

Monotonic translationsMonotonicity measured by average Kendall τ distance of source-targetalignments [Birch and Osborne, 2010]

en2fr en2denatural backtrans-nmt natural backtrans-nmt

0.048 0.018 0.068 0.053

Selected 10M words from natural randomly or according to Kendall τdistance, then fine-tuned on the result.

test-07 test-08 newstest-14BLEU BEER CTER BLEU BEER CTER BLEU BEER CTER

random 32.08 62.98 50.78 32.66 62.86 49.99 23.05 55.38 58.51monotonic 33.52 63.75 49.51 33.73 63.59 48.91 32.16 61.75 48.64

⇒Monotonic BTs help NMT.

Back translation

Properties of Backtranslated sentences (IV)

Monotonic translationsMonotonicity measured by average Kendall τ distance of source-targetalignments [Birch and Osborne, 2010]

en2fr en2denatural backtrans-nmt natural backtrans-nmt

0.048 0.018 0.068 0.053

Selected 10M words from natural randomly or according to Kendall τdistance, then fine-tuned on the result.

test-07 test-08 newstest-14BLEU BEER CTER BLEU BEER CTER BLEU BEER CTER

random 32.08 62.98 50.78 32.66 62.86 49.99 23.05 55.38 58.51monotonic 33.52 63.75 49.51 33.73 63.59 48.91 32.16 61.75 48.64

⇒Monotonic BTs help NMT.

Pseudo-back translations

The poor man’s way : simulate parallel data

BT assumesa. monolingual datab. MT engine translating “backwards” (from target to source)c. (lots of) compute power

What can we do with less resource?

4 cheap ways to generate parallel data : stupid BT

copy : recopy the target onto the source

e (True English) How useful are fake translations?f (Fake French) How useful are fake translations?

copy+mark : copies carry a language id

copy+mark+noise : add noise (deletions, swaps, etc)

copy-dummy : replace everything with dummy symbol

f (Fake French) @fr@How @fr@useful @fr@are @fr@fake @fr@translations?

f (Fake French) @fr@useful @fr@How@fr@fake @fr@are @fr@translations?

f (Fake French) dummy dummy dummy dummy dummy?

Stupid Backtranslation not so stupid

BLEU BEER CTER BLEU BEER CTER BLEU BEER CTERBaseline 31.25 62.14 51.89 32.17 62.35 50.79 33.06 61.97 48.56copy-dummies 30.89 62.06 52.07 31.51 61.98 51.46 31.43 60.92 50.58copy 31.65 62.45 52.09 32.23 62.37 52.20 32.80 61.99 49.05copy+mark 32.01 62.66 51.57 32.31 62.52 51.46 32.33 61.55 49.44copy+mark+noise 31.87 62.52 52.69 32.64 62.55 51.63 33.04 62.11 48.47

BLEU BEER CTER BLEU BEER CTER BLEU BEER CTERBaseline 21.36 57.08 63.32 21.27 57.11 60.67 22.49 57.79 55.64copy-dummies 21.73 57.84 61.35 21.38 57.38 60.10 21.12 56.81 57.21copy 22.15 57.95 61.49 21.95 57.72 59.58 22.59 57.83 55.44copy+mark 22.58 58.23 61.10 22.47 57.97 59.24 22.53 57.54 55.85copy+mark+noise 22.92 58.62 60.27 22.83 58.36 58.48 22.34 57.47 55.72

Stupid BT almost as good as smart BT (for German, where BT is bad)

Adversarial training

The smart poor man’s way : simulate credible parallel data

Stupid BT is cheap and almost as good as using LMs : mostly trains thedecoder.

Can we improve even better by also training the rest of thesystem?

Towards better fake sources

Fake sources in a GAN setupcopy-marked contains a fake source. Let’s make it look like a real source.

Two encoders : MT encoder E(x) and pseudo-source encoder G(x′).Discriminator D is optimized to distinguish both sources :

J(D) =− 12Ex∼preal logD(E(x))

− 12Ex′∼ppseudo log(1− D(G(x′)))

G is trained to fool the discriminator D :

J(G) = −Ex′∼ppseudo logD(G(x′))

GANs Results

BLEU BEER CTER BLEU BEER CTER BLEU BEER CTERBaseline 31.25 62.14 51.89 32.17 62.35 50.79 33.06 61.97 48.56copy-mark 32.01 62.66 51.57 32.31 62.52 51.46 32.33 61.55 49.44+ GANs 31.95 62.55 52.87 32.24 62.47 52.16 32.86 61.90 48.97copy-mark+noise 31.87 62.52 52.69 32.64 62.55 51.63 33.04 62.11 48.47+ GANs 32.41 62.78 52.25 32.79 62.72 50.92 33.01 61.98 48.37backtrans-nmt 33.30 63.33 50.02 33.39 63.09 49.48 34.11 62.76 46.94+ GANs 32.91 63.08 51.17 33.24 62.93 50.82 33.77 62.42 47.80natural 35.10 64.71 48.33 35.29 64.52 48.26 34.96 63.08 46.67

BLEU BEER CTER BLEU BEER CTER BLEU BEER CTERBaseline 21.36 57.08 63.32 21.27 57.11 60.67 22.49 57.79 55.64copy-mark 22.58 58.23 61.10 22.47 57.97 59.24 22.53 57.54 55.85+ GANs 22.71 58.25 61.25 22.44 57.86 59.28 22.81 57.54 55.99copy-mark+noise 22.92 58.62 60.27 22.83 58.36 58.48 22.34 57.47 55.72+ GANs 23.01 58.66 60.22 22.53 58.16 58.65 22.64 57.70 55.48backtrans-nmt 23.00 59.12 58.31 23.10 58.85 56.67 22.91 58.12 54.67+ GANs 23.65 58.85 59.70 23.20 58.50 58.22 23.00 57.89 55.15natural 26.74 61.14 56.19 26.16 60.64 54.76 23.84 58.64 54.23

GANs provide a small additional boost

Conclusions

Conclusion

BT a very efficient method to integrate monolingual data

BT helps improve simultaneously all the components of NMT

artificial sources are lexically and syntactically simpler than naturalsources - sampling brings diversity, monotonicity is a facilitating factor

the quality of BT matters for NMT and BT is only worth its cost whenhigh-quality BT can be generated

GANs can help by making the pseudo-source sentences closer to naturalones.

Conclusions

Conclusion

Conclusions

Conclusion

Conclusions

Conclusion

Conclusions

Conclusion

Conclusions

Meta-Conclusion

NMT is improving ; yields useful translations in many language pairs

NMT research is empirical, burns a lot of CPUs / GPUs

many conclusions are unstable : so many variables to control

MT not solved : many open avenues/problems : architectural, theoretical,data based, and many more

Conclusions

Meta-Conclusion

Conclusions

Meta-Conclusion

Conclusions

Meta-Conclusion

Conclusions

Thank you for your attention !

Franck Burlot, François Yvon

Conclusions

References I

Nicola Bertoldi and Marcello Federico. Domain adaptation for statistical machine translationwith monolingual resources. In Proceedings of the Fourth Workshop on Statistical MachineTranslation, pages 182–189, Athens, Greece, 2009. URLhttp://www.aclweb.org/anthology/W09-0432.

Alexandra Birch and Miles Osborne. LRscore for evaluating lexical and reordering quality inMT. In Proceedings of the Joint Fifth Workshop on Statistical Machine Translation andMetricsMATR, WMT ’10, pages 327–332, Stroudsburg, PA, USA, 2010. Association forComputational Linguistics. ISBN 978-1-932432-71-8. URLhttp://dl.acm.org/citation.cfm?id=1868850.1868899.

Ondrej Bojar and Aleš Tamchyna. Improving translation model by monolingual data. InProceedings of the Sixth Workshop on Statistical Machine Translation, WMT ’11, pages330–336. Association for Computational Linguistics, 2011. URLhttp://dl.acm.org/citation.cfm?id=2132960.2133004.

Franck Burlot and François Yvon. Using monolingual data in neural machine translation : asystematic study. In Proceedings of the Third Conference on Machine Translation, pages144–155, Belgium, Brussels, October 2018. Association for Computational Linguistics.URL http://www.aclweb.org/anthology/W18-64015.

Conclusions

References II

Ryan Cotterell and Julia Kreutzer. Explaining and generalizing back-translation throughwake-sleep, 2018.

Josep Maria Crego and Jean Senellart. Neural machine translation from simplified translations.CoRR, abs/1612.06139, 2016. URL http://arxiv.org/abs/1612.06139.

Sergey Edunov, Myle Ott, Michael Auli, and David Grangier. Understanding back-translationat scale. In Proceedings of the 2018 Conference on Empirical Methods in NaturalLanguage Processing, pages 489–500. Association for Computational Linguistics, 2018.URL http://aclweb.org/anthology/D18-1045.

Marzieh Fadaee and Christof Monz. Back-translation sampling by targeting difficult words inneural machine translation. In Proceedings of the 2018 Conference on Empirical Methodsin Natural Language Processing, pages 436–446. Association for ComputationalLinguistics, 2018. URL http://aclweb.org/anthology/D18-1040.

Caglar Gulcehre, Orhan Firat, Kelvin Xu, Kyunghyun Cho, and Yoshua Bengio. On integratinga language model into neural machine translation. Comput. Speech Lang., 45(C) :137–148,September 2017. ISSN 0885-2308. doi : 10.1016/j.csl.2017.01.014. URLhttps://doi.org/10.1016/j.csl.2017.01.014.

Conclusions

References III

Guillaume Lample, Myle Ott, Alexis Conneau, Ludovic Denoyer, and Marc’Aurelio Ranzato.Phrase-based & neural unsupervised machine translation. In Proceedings of the 2018Conference on Empirical Methods in Natural Language Processing, pages 5039–5049.Association for Computational Linguistics, 2018. URLhttp://aclweb.org/anthology/D18-1549.

Alberto Poncelas, Dimitar Shterionov, Andy Way, Gideon Maillette de Buy Wenniger, andPeyman Passban. Investigating backtranslation in neural machine translation. InProceedings of the 21st Annual Conference of the European Association for MachineTranslation, EAMT, Alicante, Spain, 28–30 May 2018.

Rico Sennrich, Barry Haddow, and Alexandra Birch. Improving neural machine translationmodels with monolingual data. In Proceedings of the 54th Annual Meeting of theAssociation for Computational Linguistics (Volume 1 : Long Papers), pages 86–96, Berlin,Germany, August 2016. Association for Computational Linguistics. URLhttp://www.aclweb.org/anthology/P16-1009.

Felix Stahlberg, James Cross, and Veselin Stoyanov. Simple fusion : Return of the languagemodel. In Proceedings of the Third Conference on Machine Translation, pages 204–211,Belgium, Brussels, October 2018. Association for Computational Linguistics. URLhttp://www.aclweb.org/anthology/W18-64021.

monolingual data in nmt - wordpress.com › 2018 › 11 › monolingual.pdfmonolingual data in nmt...

Documents