monolingual data in nmt - wordpress.com › 2018 › 11 › monolingual.pdfmonolingual data in nmt...

68
Monolingual data in NMT Franck Burlot & François Yvon LIMSI, CNRS, Université Paris-Saclay NLP Meetup Paris, november 28, 2018 F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 1 / 35

Upload: others

Post on 23-Jun-2020

14 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Monolingual data in NMT - WordPress.com › 2018 › 11 › monolingual.pdfMonolingual data in NMT Franck Burlot & François Yvon LIMSI, CNRS, Université Paris-Saclay NLP Meetup Paris,

Monolingual data in NMT

Franck Burlot & François Yvon

LIMSI, CNRS, Université Paris-Saclay

NLP MeetupParis, november 28, 2018

F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 1 / 35

Page 2: Monolingual data in NMT - WordPress.com › 2018 › 11 › monolingual.pdfMonolingual data in NMT Franck Burlot & François Yvon LIMSI, CNRS, Université Paris-Saclay NLP Meetup Paris,

Corpora in data-based machine translation

Corpora in Statistical MachineTranslationThe Noisy Channel Model

Statistical MT translates French (f) into English (e) according to :

e? = argmaxe

P(e|f; θTM)

θTM trained using examples of bio-translations⇒ parallel data

f eElle partit avec son père, le visage souriant ; In the gayest and happiest spirits she set forward with her father ;elle n’ écoutait pas toujours, mais elle acquiesçait de confiance. not always listening, but always agreeing to what he said ;Ils arrivèrent. They arrived.– C’est Frank et Mlle Fairfax, dit aussitôt Mme Weston. It is Frank and Miss Fairfax, said Mrs. Weston.– J’allai justement vous faire part de l’agréable surprise que nousavons eue en le voyant arriver.

I was just going to tell you of our agreeable surprize in seeinghim arrive this morning.

Il reste jusqu’à demain et Mlle Fairfax a bien voulu, sur notredemande, venir passer la journée.

He stays till tomorrow, and Miss Fairfax has been persuaded tospend the day with us.

From Emma, by J. Austen

F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 2 / 35

Page 3: Monolingual data in NMT - WordPress.com › 2018 › 11 › monolingual.pdfMonolingual data in NMT Franck Burlot & François Yvon LIMSI, CNRS, Université Paris-Saclay NLP Meetup Paris,

Corpora in data-based machine translation

Corpora in Statistical MachineTranslationThe Noisy Channel Model

Statistical MT actually translates French (f) in English (e) according to

e? = argmaxe

P(f|e; θTM)× P(e; θLM)

θTM trained using examples of existing translations⇒ parallel data

θLM trained using examples of existing texts⇒ monolingual data

F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 3 / 35

Page 4: Monolingual data in NMT - WordPress.com › 2018 › 11 › monolingual.pdfMonolingual data in NMT Franck Burlot & François Yvon LIMSI, CNRS, Université Paris-Saclay NLP Meetup Paris,

Corpora in data-based machine translation

Parallel CorpusFrench-English

Monolingual CorpusEnglish

Statistical processing Statistical processing

P(f|e) P(e)French broken English English

Decoding / Inference :e? = argmaxe P(f|e)P(e)

A search problem

F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 4 / 35

Page 5: Monolingual data in NMT - WordPress.com › 2018 › 11 › monolingual.pdfMonolingual data in NMT Franck Burlot & François Yvon LIMSI, CNRS, Université Paris-Saclay NLP Meetup Paris,

Corpora in data-based machine translation

Parallel CorpusFrench-English

Monolingual CorpusEnglish

Statistical processing Statistical processing

P(f|e) P(e)French broken English English

Decoding / Inference :e? = argmaxe P(f|e)P(e)

A search problem

F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 4 / 35

Page 6: Monolingual data in NMT - WordPress.com › 2018 › 11 › monolingual.pdfMonolingual data in NMT Franck Burlot & François Yvon LIMSI, CNRS, Université Paris-Saclay NLP Meetup Paris,

Corpora in data-based machine translation

Parallel CorpusFrench-English

Monolingual CorpusEnglish

Statistical processing Statistical processing

P(f|e) P(e)French broken English English

Decoding / Inference :e? = argmaxe P(f|e)P(e)

A search problem

F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 4 / 35

Page 7: Monolingual data in NMT - WordPress.com › 2018 › 11 › monolingual.pdfMonolingual data in NMT Franck Burlot & François Yvon LIMSI, CNRS, Université Paris-Saclay NLP Meetup Paris,

Corpora in data-based machine translation

Parallel CorpusFrench-English

Monolingual CorpusEnglish

Statistical processing Statistical processing

P(f|e) P(e)French broken English English

Decoding / Inference :e? = argmaxe P(f|e)P(e)

A search problem

F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 4 / 35

Page 8: Monolingual data in NMT - WordPress.com › 2018 › 11 › monolingual.pdfMonolingual data in NMT Franck Burlot & François Yvon LIMSI, CNRS, Université Paris-Saclay NLP Meetup Paris,

Corpora in data-based machine translation

Corpora in Statistical MachineTranslationBeauty of the Noisy Channel Model

e? = argmaxe

P(f|e; θTM)P(e; θLM)

Parallel corpora are costly, scarce, difficult to get, restricted to specificdomains : counts in millions of sentences

Monolingual corpora are “free”, massive, easy to get, for all domains :counts in billions of sentences

F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 5 / 35

Page 9: Monolingual data in NMT - WordPress.com › 2018 › 11 › monolingual.pdfMonolingual data in NMT Franck Burlot & François Yvon LIMSI, CNRS, Université Paris-Saclay NLP Meetup Paris,

Corpora in data-based machine translation

Corpora in Neural MachineTranslation (NMT)

NMT translates French (f) into English (e) according to

e? = argmaxe

P(e|f; θNN)

θNN trained using translation examples⇒ parallel data

Tonight’s question

How to best leverage existing monolingual data?

F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 6 / 35

Page 10: Monolingual data in NMT - WordPress.com › 2018 › 11 › monolingual.pdfMonolingual data in NMT Franck Burlot & François Yvon LIMSI, CNRS, Université Paris-Saclay NLP Meetup Paris,

Corpora in data-based machine translation

Corpora in Neural MachineTranslation (NMT)

NMT translates French (f) into English (e) according to

e? = argmaxe

P(e|f; θNN)

θNN trained using translation examples⇒ parallel data

Tonight’s question

How to best leverage existing monolingual data?

F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 6 / 35

Page 11: Monolingual data in NMT - WordPress.com › 2018 › 11 › monolingual.pdfMonolingual data in NMT Franck Burlot & François Yvon LIMSI, CNRS, Université Paris-Saclay NLP Meetup Paris,

NMT Primer in two slides

Recurrent Neural Networks

words : w ∈ {0, 1}|V| w0 . . . wt wt+1 . . . wI

embeddings : i ∈ Rd i0 . . . it it+1 . . . iI

hidden states : h ∈ Rp h0 . . . ht ht+1 . . . hI

it =W iwt

ht+1 = f(W ihit+1 + Wrhht + br)

F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 7 / 35

Page 12: Monolingual data in NMT - WordPress.com › 2018 › 11 › monolingual.pdfMonolingual data in NMT Franck Burlot & François Yvon LIMSI, CNRS, Université Paris-Saclay NLP Meetup Paris,

NMT Primer in two slides

Recurrent Neural Networks for Language Modeling

words : w ∈ {0, 1}|V| w0 . . . wt wt+1 . . . wI

embeddings : i ∈ Rd i0 . . . it it+1 . . . iI

hidden states : h ∈ Rp h0 . . . ht ht+1 . . . hI

output : o ∈ R|V| . . . ot ot+1 . . . oI

P(wt+1 = k|w≤t; θLM) = [softmax(ot = Whoht + bo)]k

F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 8 / 35

Page 13: Monolingual data in NMT - WordPress.com › 2018 › 11 › monolingual.pdfMonolingual data in NMT Franck Burlot & François Yvon LIMSI, CNRS, Université Paris-Saclay NLP Meetup Paris,

NMT Primer in two slides

Recurrent Neural Networks for Machine Translation

source : f ∈ {0, 1}|V| w0 . . . wt wt+1 . . . wI

embeddings : i ∈ Rd i0 . . . it it+1 . . . iI

encoder states : h ∈ Rp h0 . . . ht ht+1 . . . hI

attention / context : c ∈ Rp αt, ct ct = αTt h

decoder states : s ∈ Rp s0 . . . st st+1 . . . sI

output : o ∈ R|V| . . . ot ot+1 . . . oI

P(et+1 = k|e≤t, f; θNMT) = [softmax(ot = Wsost + bo)]k

F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 9 / 35

Page 14: Monolingual data in NMT - WordPress.com › 2018 › 11 › monolingual.pdfMonolingual data in NMT Franck Burlot & François Yvon LIMSI, CNRS, Université Paris-Saclay NLP Meetup Paris,

NMT Primer in two slides

To remember for now

NNLMs / NMTs predict one word at a timeInference ends with a softmax layer

They have multiple subparts :encoder (embeddings+RNN), decoder (RNN+embeddings), attention layer

Many architectural variantsGRU / LSTMs, multiple layers, transformers, CNNs

Back to Tonight’s question

How to best leverage existing monolingual data?

F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 10 / 35

Page 15: Monolingual data in NMT - WordPress.com › 2018 › 11 › monolingual.pdfMonolingual data in NMT Franck Burlot & François Yvon LIMSI, CNRS, Université Paris-Saclay NLP Meetup Paris,

NMT Primer in two slides

To remember for now

NNLMs / NMTs predict one word at a timeInference ends with a softmax layer

They have multiple subparts :encoder (embeddings+RNN), decoder (RNN+embeddings), attention layer

Many architectural variantsGRU / LSTMs, multiple layers, transformers, CNNs

Back to Tonight’s question

How to best leverage existing monolingual data?

F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 10 / 35

Page 16: Monolingual data in NMT - WordPress.com › 2018 › 11 › monolingual.pdfMonolingual data in NMT Franck Burlot & François Yvon LIMSI, CNRS, Université Paris-Saclay NLP Meetup Paris,

NMT Primer in two slides

To remember for now

NNLMs / NMTs predict one word at a timeInference ends with a softmax layer

They have multiple subparts :encoder (embeddings+RNN), decoder (RNN+embeddings), attention layer

Many architectural variantsGRU / LSTMs, multiple layers, transformers, CNNs

Back to Tonight’s question

How to best leverage existing monolingual data?

F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 10 / 35

Page 17: Monolingual data in NMT - WordPress.com › 2018 › 11 › monolingual.pdfMonolingual data in NMT Franck Burlot & François Yvon LIMSI, CNRS, Université Paris-Saclay NLP Meetup Paris,

Using Language Models

The old timer way : combine NNLM and NMT

ex post - aka shallow fusion

P(et+1 = k|e≤t, f; θLM, θTM) = λ1PTM(et+1 = k|wIt; θTM)

+ PLM(et+1 = k|eIt; θLM)

Combines the output layers.train θLM and θTM separately [Gulcehre et al., 2017] or one after the other [Stahlberg

et al., 2018]

within decoder - aka deep fusion [Gulcehre et al., 2017, Burlot and Yvon, 2018]

P(et+1 = k|e≤t, f; θLM, θTM) ∝ [Wf (hLMt ; sTM

t ; ct; ot)]k (1)

Combines the hidden layers hLMt ; sTM

t ; (+ trained scaling factor σhLMt )

F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 11 / 35

Page 18: Monolingual data in NMT - WordPress.com › 2018 › 11 › monolingual.pdfMonolingual data in NMT Franck Burlot & François Yvon LIMSI, CNRS, Université Paris-Saclay NLP Meetup Paris,

Using Language Models

The old timer way : combine NNLM and NMT

ex post - aka shallow fusion

within decoder - aka deep fusion [Gulcehre et al., 2017, Burlot and Yvon, 2018]

P(et+1 = k|e≤t, f; θLM, θTM) ∝ [Wf (hLMt ; sTM

t ; ct; ot)]k (1)

Combines the hidden layers hLMt ; sTM

t ; (+ trained scaling factor σhLMt )

F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 11 / 35

Page 19: Monolingual data in NMT - WordPress.com › 2018 › 11 › monolingual.pdfMonolingual data in NMT Franck Burlot & François Yvon LIMSI, CNRS, Université Paris-Saclay NLP Meetup Paris,

Using Language Models

The old timer way : combine NNLM and NMT

log-linear shallow better than linear [Stahlberg et al., 2018]

Deep fusion better than shallow [Stahlberg et al., 2018]

No clear result for very large data

“Back-Translation” seems to be a much better receipe.

F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 12 / 35

Page 20: Monolingual data in NMT - WordPress.com › 2018 › 11 › monolingual.pdfMonolingual data in NMT Franck Burlot & François Yvon LIMSI, CNRS, Université Paris-Saclay NLP Meetup Paris,

Using Language Models

The old timer way : combine NNLM and NMT

log-linear shallow better than linear [Stahlberg et al., 2018]

Deep fusion better than shallow [Stahlberg et al., 2018]

No clear result for very large data

“Back-Translation” seems to be a much better receipe.

F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 12 / 35

Page 21: Monolingual data in NMT - WordPress.com › 2018 › 11 › monolingual.pdfMonolingual data in NMT Franck Burlot & François Yvon LIMSI, CNRS, Université Paris-Saclay NLP Meetup Paris,

Using Language Models

The old timer way : combine NNLM and NMT

log-linear shallow better than linear [Stahlberg et al., 2018]

Deep fusion better than shallow [Stahlberg et al., 2018]

No clear result for very large data

“Back-Translation” seems to be a much better receipe.

F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 12 / 35

Page 22: Monolingual data in NMT - WordPress.com › 2018 › 11 › monolingual.pdfMonolingual data in NMT Franck Burlot & François Yvon LIMSI, CNRS, Université Paris-Saclay NLP Meetup Paris,

Using Language Models

The old timer way : combine NNLM and NMT

log-linear shallow better than linear [Stahlberg et al., 2018]

Deep fusion better than shallow [Stahlberg et al., 2018]

No clear result for very large data

“Back-Translation” seems to be a much better receipe.

F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 12 / 35

Page 23: Monolingual data in NMT - WordPress.com › 2018 › 11 › monolingual.pdfMonolingual data in NMT Franck Burlot & François Yvon LIMSI, CNRS, Université Paris-Saclay NLP Meetup Paris,

Back translation

The rich man’s way : generate artificial parallel data

Parallel CorpusFrench-English

NN Training

P(e|f)

Decoding :f? = argmax P(e|f)

Monolingual CorpusEnglish

BackwardsMachine Translation

Artificial CorpusFrench-English

a very very very old idea [Bertoldi and Federico, 2009, Bojar and Tamchyna, 2011]

F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 13 / 35

Page 24: Monolingual data in NMT - WordPress.com › 2018 › 11 › monolingual.pdfMonolingual data in NMT Franck Burlot & François Yvon LIMSI, CNRS, Université Paris-Saclay NLP Meetup Paris,

Back translation

The rich man’s way : generate artificial parallel data

Parallel CorpusFrench-English

NN Training

P(e|f)

Decoding :f? = argmax P(e|f)

Monolingual CorpusEnglish

BackwardsMachine Translation

Artificial CorpusFrench-English

a very very very old idea [Bertoldi and Federico, 2009, Bojar and Tamchyna, 2011]

F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 13 / 35

Page 25: Monolingual data in NMT - WordPress.com › 2018 › 11 › monolingual.pdfMonolingual data in NMT Franck Burlot & François Yvon LIMSI, CNRS, Université Paris-Saclay NLP Meetup Paris,

Back translation

The rich man’s way : generate artificial parallel data

Design choicesback-translation engine (WBMT, NMMT, NMT)

data selection and weighting

training regime / data mix

Experimental validation : try approach X, evaluate MT quality

Main findings to dateBT very works well [Sennrich et al., 2016] and many others

BT quality matters, real data is even better [Burlot and Yvon, 2018]

BT selection helps : choose monotonic sentences [Burlot and Yvon, 2018],difficult words/phrases [Fadaee and Monz, 2018]

BT data are unsufficiently diverse [Burlot and Yvon, 2018], noising helps[Edunov et al., 2018]

FT also helps [Crego and Senellart, 2016]

training regime matters [Poncelas et al., 2018]

large scale experiments yield large gains [Edunov et al., 2018]

iterative BT : an effective [Lample et al., 2018] and sound [Cotterell and

Kreutzer, 2018] idea

F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 14 / 35

Page 26: Monolingual data in NMT - WordPress.com › 2018 › 11 › monolingual.pdfMonolingual data in NMT Franck Burlot & François Yvon LIMSI, CNRS, Université Paris-Saclay NLP Meetup Paris,

Back translation

The rich man’s way : generate artificial parallel data

Main findings to dateBT very works well [Sennrich et al., 2016] and many others

BT quality matters, real data is even better [Burlot and Yvon, 2018]

BT selection helps : choose monotonic sentences [Burlot and Yvon, 2018],difficult words/phrases [Fadaee and Monz, 2018]

BT data are unsufficiently diverse [Burlot and Yvon, 2018], noising helps[Edunov et al., 2018]

FT also helps [Crego and Senellart, 2016]

training regime matters [Poncelas et al., 2018]

large scale experiments yield large gains [Edunov et al., 2018]

iterative BT : an effective [Lample et al., 2018] and sound [Cotterell and

Kreutzer, 2018] idea

F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 14 / 35

Page 27: Monolingual data in NMT - WordPress.com › 2018 › 11 › monolingual.pdfMonolingual data in NMT Franck Burlot & François Yvon LIMSI, CNRS, Université Paris-Saclay NLP Meetup Paris,

Back translation

The rich man’s way : generate artificial parallel data

Main findings to dateBT very works well [Sennrich et al., 2016] and many others

BT quality matters, real data is even better [Burlot and Yvon, 2018]

BT selection helps : choose monotonic sentences [Burlot and Yvon, 2018],difficult words/phrases [Fadaee and Monz, 2018]

BT data are unsufficiently diverse [Burlot and Yvon, 2018], noising helps[Edunov et al., 2018]

FT also helps [Crego and Senellart, 2016]

training regime matters [Poncelas et al., 2018]

large scale experiments yield large gains [Edunov et al., 2018]

iterative BT : an effective [Lample et al., 2018] and sound [Cotterell and

Kreutzer, 2018] idea

F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 14 / 35

Page 28: Monolingual data in NMT - WordPress.com › 2018 › 11 › monolingual.pdfMonolingual data in NMT Franck Burlot & François Yvon LIMSI, CNRS, Université Paris-Saclay NLP Meetup Paris,

Back translation

The rich man’s way : generate artificial parallel data

Main findings to dateBT very works well [Sennrich et al., 2016] and many others

BT quality matters, real data is even better [Burlot and Yvon, 2018]

BT selection helps : choose monotonic sentences [Burlot and Yvon, 2018],difficult words/phrases [Fadaee and Monz, 2018]

BT data are unsufficiently diverse [Burlot and Yvon, 2018], noising helps[Edunov et al., 2018]

FT also helps [Crego and Senellart, 2016]

training regime matters [Poncelas et al., 2018]

large scale experiments yield large gains [Edunov et al., 2018]

iterative BT : an effective [Lample et al., 2018] and sound [Cotterell and

Kreutzer, 2018] idea

F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 14 / 35

Page 29: Monolingual data in NMT - WordPress.com › 2018 › 11 › monolingual.pdfMonolingual data in NMT Franck Burlot & François Yvon LIMSI, CNRS, Université Paris-Saclay NLP Meetup Paris,

Back translation

The rich man’s way : generate artificial parallel data

Main findings to dateBT very works well [Sennrich et al., 2016] and many others

BT quality matters, real data is even better [Burlot and Yvon, 2018]

BT selection helps : choose monotonic sentences [Burlot and Yvon, 2018],difficult words/phrases [Fadaee and Monz, 2018]

BT data are unsufficiently diverse [Burlot and Yvon, 2018], noising helps[Edunov et al., 2018]

FT also helps [Crego and Senellart, 2016]

training regime matters [Poncelas et al., 2018]

large scale experiments yield large gains [Edunov et al., 2018]

iterative BT : an effective [Lample et al., 2018] and sound [Cotterell and

Kreutzer, 2018] idea

F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 14 / 35

Page 30: Monolingual data in NMT - WordPress.com › 2018 › 11 › monolingual.pdfMonolingual data in NMT Franck Burlot & François Yvon LIMSI, CNRS, Université Paris-Saclay NLP Meetup Paris,

Back translation

The rich man’s way : generate artificial parallel data

Main findings to dateBT very works well [Sennrich et al., 2016] and many others

BT quality matters, real data is even better [Burlot and Yvon, 2018]

BT selection helps : choose monotonic sentences [Burlot and Yvon, 2018],difficult words/phrases [Fadaee and Monz, 2018]

BT data are unsufficiently diverse [Burlot and Yvon, 2018], noising helps[Edunov et al., 2018]

FT also helps [Crego and Senellart, 2016]

training regime matters [Poncelas et al., 2018]

large scale experiments yield large gains [Edunov et al., 2018]

iterative BT : an effective [Lample et al., 2018] and sound [Cotterell and

Kreutzer, 2018] idea

F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 14 / 35

Page 31: Monolingual data in NMT - WordPress.com › 2018 › 11 › monolingual.pdfMonolingual data in NMT Franck Burlot & François Yvon LIMSI, CNRS, Université Paris-Saclay NLP Meetup Paris,

Back translation

The rich man’s way : generate artificial parallel data

Main findings to dateBT very works well [Sennrich et al., 2016] and many others

BT quality matters, real data is even better [Burlot and Yvon, 2018]

BT selection helps : choose monotonic sentences [Burlot and Yvon, 2018],difficult words/phrases [Fadaee and Monz, 2018]

BT data are unsufficiently diverse [Burlot and Yvon, 2018], noising helps[Edunov et al., 2018]

FT also helps [Crego and Senellart, 2016]

training regime matters [Poncelas et al., 2018]

large scale experiments yield large gains [Edunov et al., 2018]

iterative BT : an effective [Lample et al., 2018] and sound [Cotterell and

Kreutzer, 2018] idea

F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 14 / 35

Page 32: Monolingual data in NMT - WordPress.com › 2018 › 11 › monolingual.pdfMonolingual data in NMT Franck Burlot & François Yvon LIMSI, CNRS, Université Paris-Saclay NLP Meetup Paris,

Back translation

The rich man’s way : generate artificial parallel data

Main findings to dateBT very works well [Sennrich et al., 2016] and many others

BT quality matters, real data is even better [Burlot and Yvon, 2018]

BT selection helps : choose monotonic sentences [Burlot and Yvon, 2018],difficult words/phrases [Fadaee and Monz, 2018]

BT data are unsufficiently diverse [Burlot and Yvon, 2018], noising helps[Edunov et al., 2018]

FT also helps [Crego and Senellart, 2016]

training regime matters [Poncelas et al., 2018]

large scale experiments yield large gains [Edunov et al., 2018]

iterative BT : an effective [Lample et al., 2018] and sound [Cotterell and

Kreutzer, 2018] idea

F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 14 / 35

Page 33: Monolingual data in NMT - WordPress.com › 2018 › 11 › monolingual.pdfMonolingual data in NMT Franck Burlot & François Yvon LIMSI, CNRS, Université Paris-Saclay NLP Meetup Paris,

Back translation

The rich man’s way : generate artificial parallel data

Main findings to dateBT very works well [Sennrich et al., 2016] and many others

BT quality matters, real data is even better [Burlot and Yvon, 2018]

BT selection helps : choose monotonic sentences [Burlot and Yvon, 2018],difficult words/phrases [Fadaee and Monz, 2018]

BT data are unsufficiently diverse [Burlot and Yvon, 2018], noising helps[Edunov et al., 2018]

FT also helps [Crego and Senellart, 2016]

training regime matters [Poncelas et al., 2018]

large scale experiments yield large gains [Edunov et al., 2018]

iterative BT : an effective [Lample et al., 2018] and sound [Cotterell and

Kreutzer, 2018] idea

F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 14 / 35

Page 34: Monolingual data in NMT - WordPress.com › 2018 › 11 › monolingual.pdfMonolingual data in NMT Franck Burlot & François Yvon LIMSI, CNRS, Université Paris-Saclay NLP Meetup Paris,

Back translation

BT quality matters, real data is even better

Backtranslation setup3 automatic systems BT :

backtrans-bad : SMT (Moses) with 50k parallel sentences

backtrans-good : SMT (Moses) with all WMT data

backtrans-nmt : backward NMTs

French→English German→Englishtest-07 test-08 nt-14 unk test-07 test-08 nt-14 unk

backtrans-bad 18.86 19.27 20.49 3.22% 14.66 14.62 15.07 1.45%backtrans-good 29.71 29.51 32.10 0.24% 24.19 24.19 25.75 0.73%backtrans-nmt 31.10 31.43 31.27 0.0% 26.02 26.03 26.98 0.0%

Fine-tuningThese systems are used to backtranslate the target side of Europarl in order tofine-tune the baselines.

F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 15 / 35

Page 35: Monolingual data in NMT - WordPress.com › 2018 › 11 › monolingual.pdfMonolingual data in NMT Franck Burlot & François Yvon LIMSI, CNRS, Université Paris-Saclay NLP Meetup Paris,

Back translation

Assessing the effectiveness of BT

English→Frenchtest-07 test-08 newstest-14

BLEU BEER CTER BLEU BEER CTER BLEU BEER CTERBaseline 31.25 62.14 51.89 32.17 62.35 50.79 33.06 61.97 48.56backtrans-bad 31.55 62.39 51.50 31.89 62.23 51.73 31.99 61.59 48.86backtrans-good 32.99 63.43 49.58 33.25 63.08 49.29 33.52 62.62 47.23backtrans-nmt 33.30 63.33 50.02 33.39 63.09 49.48 34.11 62.76 46.94fwdtrans-nmt 31.93 62.55 50.84 32.62 62.66 49.83 33.56 62.44 47.65backfwdtrans-nmt 33.09 63.19 50.08 33.70 63.25 48.83 34.00 62.76 47.22natural 35.10 64.71 48.33 35.29 64.52 48.26 34.96 63.08 46.67

English→Germantest-07 test-08 newstest-14

BLEU BEER CTER BLEU BEER CTER BLEU BEER CTERBaseline 21.36 57.08 63.32 21.27 57.11 60.67 22.49 57.79 55.64backtrans-bad 21.84 57.85 61.24 21.04 57.44 59.77 22.28 57.70 55.49backtrans-good 23.33 59.03 58.84 23.11 57.14 57.14 22.87 58.09 54.91backtrans-nmt 23.00 59.12 58.31 23.10 58.85 56.67 22.91 58.12 54.67fwdtrans-nmt 21.97 57.46 61.99 21.89 57.53 59.71 22.52 57.93 55.13backfwdtrans-nmt 22.99 58.37 60.45 22.82 58.14 58.80 23.04 58.17 54.96natural 26.74 61.14 56.19 26.16 60.64 54.76 23.84 58.64 54.23

⇒ Bad BT hardly helps.

F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 16 / 35

Page 36: Monolingual data in NMT - WordPress.com › 2018 › 11 › monolingual.pdfMonolingual data in NMT Franck Burlot & François Yvon LIMSI, CNRS, Université Paris-Saclay NLP Meetup Paris,

Back translation

Assessing the effectiveness of BT

English→Frenchtest-07 test-08 newstest-14

BLEU BEER CTER BLEU BEER CTER BLEU BEER CTERBaseline 31.25 62.14 51.89 32.17 62.35 50.79 33.06 61.97 48.56backtrans-bad 31.55 62.39 51.50 31.89 62.23 51.73 31.99 61.59 48.86backtrans-good 32.99 63.43 49.58 33.25 63.08 49.29 33.52 62.62 47.23backtrans-nmt 33.30 63.33 50.02 33.39 63.09 49.48 34.11 62.76 46.94fwdtrans-nmt 31.93 62.55 50.84 32.62 62.66 49.83 33.56 62.44 47.65backfwdtrans-nmt 33.09 63.19 50.08 33.70 63.25 48.83 34.00 62.76 47.22natural 35.10 64.71 48.33 35.29 64.52 48.26 34.96 63.08 46.67

English→Germantest-07 test-08 newstest-14

BLEU BEER CTER BLEU BEER CTER BLEU BEER CTERBaseline 21.36 57.08 63.32 21.27 57.11 60.67 22.49 57.79 55.64backtrans-bad 21.84 57.85 61.24 21.04 57.44 59.77 22.28 57.70 55.49backtrans-good 23.33 59.03 58.84 23.11 57.14 57.14 22.87 58.09 54.91backtrans-nmt 23.00 59.12 58.31 23.10 58.85 56.67 22.91 58.12 54.67fwdtrans-nmt 21.97 57.46 61.99 21.89 57.53 59.71 22.52 57.93 55.13backfwdtrans-nmt 22.99 58.37 60.45 22.82 58.14 58.80 23.04 58.17 54.96natural 26.74 61.14 56.19 26.16 60.64 54.76 23.84 58.64 54.23

⇒ BTs with PBMT and NMT not so different.

F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 17 / 35

Page 37: Monolingual data in NMT - WordPress.com › 2018 › 11 › monolingual.pdfMonolingual data in NMT Franck Burlot & François Yvon LIMSI, CNRS, Université Paris-Saclay NLP Meetup Paris,

Back translation

Assessing the effectiveness of BT

English→Frenchtest-07 test-08 newstest-14

BLEU BEER CTER BLEU BEER CTER BLEU BEER CTERBaseline 31.25 62.14 51.89 32.17 62.35 50.79 33.06 61.97 48.56backtrans-bad 31.55 62.39 51.50 31.89 62.23 51.73 31.99 61.59 48.86backtrans-good 32.99 63.43 49.58 33.25 63.08 49.29 33.52 62.62 47.23backtrans-nmt 33.30 63.33 50.02 33.39 63.09 49.48 34.11 62.76 46.94fwdtrans-nmt 31.93 62.55 50.84 32.62 62.66 49.83 33.56 62.44 47.65backfwdtrans-nmt 33.09 63.19 50.08 33.70 63.25 48.83 34.00 62.76 47.22natural 35.10 64.71 48.33 35.29 64.52 48.26 34.96 63.08 46.67

English→Germantest-07 test-08 newstest-14

BLEU BEER CTER BLEU BEER CTER BLEU BEER CTERBaseline 21.36 57.08 63.32 21.27 57.11 60.67 22.49 57.79 55.64backtrans-bad 21.84 57.85 61.24 21.04 57.44 59.77 22.28 57.70 55.49backtrans-good 23.33 59.03 58.84 23.11 57.14 57.14 22.87 58.09 54.91backtrans-nmt 23.00 59.12 58.31 23.10 58.85 56.67 22.91 58.12 54.67fwdtrans-nmt 21.97 57.46 61.99 21.89 57.53 59.71 22.52 57.93 55.13backfwdtrans-nmt 22.99 58.37 60.45 22.82 58.14 58.80 23.04 58.17 54.96natural 26.74 61.14 56.19 26.16 60.64 54.76 23.84 58.64 54.23

⇒ Forward-translated source data can also help.

F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 18 / 35

Page 38: Monolingual data in NMT - WordPress.com › 2018 › 11 › monolingual.pdfMonolingual data in NMT Franck Burlot & François Yvon LIMSI, CNRS, Université Paris-Saclay NLP Meetup Paris,

Back translation

Assessing the effectiveness of BT

English→Frenchtest-07 test-08 newstest-14

BLEU BEER CTER BLEU BEER CTER BLEU BEER CTERBaseline 31.25 62.14 51.89 32.17 62.35 50.79 33.06 61.97 48.56backtrans-bad 31.55 62.39 51.50 31.89 62.23 51.73 31.99 61.59 48.86backtrans-good 32.99 63.43 49.58 33.25 63.08 49.29 33.52 62.62 47.23backtrans-nmt 33.30 63.33 50.02 33.39 63.09 49.48 34.11 62.76 46.94fwdtrans-nmt 31.93 62.55 50.84 32.62 62.66 49.83 33.56 62.44 47.65backfwdtrans-nmt 33.09 63.19 50.08 33.70 63.25 48.83 34.00 62.76 47.22natural 35.10 64.71 48.33 35.29 64.52 48.26 34.96 63.08 46.67

English→Germantest-07 test-08 newstest-14

BLEU BEER CTER BLEU BEER CTER BLEU BEER CTERBaseline 21.36 57.08 63.32 21.27 57.11 60.67 22.49 57.79 55.64backtrans-bad 21.84 57.85 61.24 21.04 57.44 59.77 22.28 57.70 55.49backtrans-good 23.33 59.03 58.84 23.11 57.14 57.14 22.87 58.09 54.91backtrans-nmt 23.00 59.12 58.31 23.10 58.85 56.67 22.91 58.12 54.67fwdtrans-nmt 21.97 57.46 61.99 21.89 57.53 59.71 22.52 57.93 55.13backfwdtrans-nmt 22.99 58.37 60.45 22.82 58.14 58.80 23.04 58.17 54.96natural 26.74 61.14 56.19 26.16 60.64 54.76 23.84 58.64 54.23

⇒ human translated sources are much better. Why this gap?

F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 19 / 35

Page 39: Monolingual data in NMT - WordPress.com › 2018 › 11 › monolingual.pdfMonolingual data in NMT Franck Burlot & François Yvon LIMSI, CNRS, Université Paris-Saclay NLP Meetup Paris,

Back translation

Assessing the effectiveness of BT

English→Frenchtest-07 test-08 newstest-14

BLEU BEER CTER BLEU BEER CTER BLEU BEER CTERBaseline 31.25 62.14 51.89 32.17 62.35 50.79 33.06 61.97 48.56backtrans-bad 31.55 62.39 51.50 31.89 62.23 51.73 31.99 61.59 48.86backtrans-good 32.99 63.43 49.58 33.25 63.08 49.29 33.52 62.62 47.23backtrans-nmt 33.30 63.33 50.02 33.39 63.09 49.48 34.11 62.76 46.94fwdtrans-nmt 31.93 62.55 50.84 32.62 62.66 49.83 33.56 62.44 47.65backfwdtrans-nmt 33.09 63.19 50.08 33.70 63.25 48.83 34.00 62.76 47.22natural 35.10 64.71 48.33 35.29 64.52 48.26 34.96 63.08 46.67

English→Germantest-07 test-08 newstest-14

BLEU BEER CTER BLEU BEER CTER BLEU BEER CTERBaseline 21.36 57.08 63.32 21.27 57.11 60.67 22.49 57.79 55.64backtrans-bad 21.84 57.85 61.24 21.04 57.44 59.77 22.28 57.70 55.49backtrans-good 23.33 59.03 58.84 23.11 57.14 57.14 22.87 58.09 54.91backtrans-nmt 23.00 59.12 58.31 23.10 58.85 56.67 22.91 58.12 54.67fwdtrans-nmt 21.97 57.46 61.99 21.89 57.53 59.71 22.52 57.93 55.13backfwdtrans-nmt 22.99 58.37 60.45 22.82 58.14 58.80 23.04 58.17 54.96natural 26.74 61.14 56.19 26.16 60.64 54.76 23.84 58.64 54.23

⇒ human translated sources are much better. Why this gap?

F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 19 / 35

Page 40: Monolingual data in NMT - WordPress.com › 2018 › 11 › monolingual.pdfMonolingual data in NMT Franck Burlot & François Yvon LIMSI, CNRS, Université Paris-Saclay NLP Meetup Paris,

Back translation

Properties of Backtranslated sentences (I)

English→French English→German

⇒ Synthetic sources contain shorter sentences.

F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 20 / 35

Page 41: Monolingual data in NMT - WordPress.com › 2018 › 11 › monolingual.pdfMonolingual data in NMT Franck Burlot & François Yvon LIMSI, CNRS, Université Paris-Saclay NLP Meetup Paris,

Back translation

Properties of Backtranslated sentences (II)

English→French English→German

⇒ Synthetic sources contain slightly simpler syntax.

F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 21 / 35

Page 42: Monolingual data in NMT - WordPress.com › 2018 › 11 › monolingual.pdfMonolingual data in NMT Franck Burlot & François Yvon LIMSI, CNRS, Université Paris-Saclay NLP Meetup Paris,

Back translation

Properties of Backtranslated sentences (III)

English→French English→German

⇒ Synthetic sources use a smaller vocabulary.

F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 22 / 35

Page 43: Monolingual data in NMT - WordPress.com › 2018 › 11 › monolingual.pdfMonolingual data in NMT Franck Burlot & François Yvon LIMSI, CNRS, Université Paris-Saclay NLP Meetup Paris,

Back translation

Properties of Backtranslated sentences (IV)

Monotonic translationsMonotonicity measured by average Kendall τ distance of source-targetalignments [Birch and Osborne, 2010]

en2fr en2denatural backtrans-nmt natural backtrans-nmt

0.048 0.018 0.068 0.053

Selected 10M words from natural randomly or according to Kendall τdistance, then fine-tuned on the result.

test-07 test-08 newstest-14BLEU BEER CTER BLEU BEER CTER BLEU BEER CTER

random 32.08 62.98 50.78 32.66 62.86 49.99 23.05 55.38 58.51monotonic 33.52 63.75 49.51 33.73 63.59 48.91 32.16 61.75 48.64

⇒Monotonic BTs help NMT.

F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 23 / 35

Page 44: Monolingual data in NMT - WordPress.com › 2018 › 11 › monolingual.pdfMonolingual data in NMT Franck Burlot & François Yvon LIMSI, CNRS, Université Paris-Saclay NLP Meetup Paris,

Back translation

Properties of Backtranslated sentences (IV)

Monotonic translationsMonotonicity measured by average Kendall τ distance of source-targetalignments [Birch and Osborne, 2010]

en2fr en2denatural backtrans-nmt natural backtrans-nmt

0.048 0.018 0.068 0.053

Selected 10M words from natural randomly or according to Kendall τdistance, then fine-tuned on the result.

test-07 test-08 newstest-14BLEU BEER CTER BLEU BEER CTER BLEU BEER CTER

random 32.08 62.98 50.78 32.66 62.86 49.99 23.05 55.38 58.51monotonic 33.52 63.75 49.51 33.73 63.59 48.91 32.16 61.75 48.64

⇒Monotonic BTs help NMT.

F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 23 / 35

Page 45: Monolingual data in NMT - WordPress.com › 2018 › 11 › monolingual.pdfMonolingual data in NMT Franck Burlot & François Yvon LIMSI, CNRS, Université Paris-Saclay NLP Meetup Paris,

Pseudo-back translations

The poor man’s way : simulate parallel data

BT assumesa. monolingual datab. MT engine translating “backwards” (from target to source)c. (lots of) compute power

What can we do with less resource?

F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 24 / 35

Page 46: Monolingual data in NMT - WordPress.com › 2018 › 11 › monolingual.pdfMonolingual data in NMT Franck Burlot & François Yvon LIMSI, CNRS, Université Paris-Saclay NLP Meetup Paris,

Pseudo-back translations

4 cheap ways to generate parallel data : stupid BT

copy : recopy the target onto the source

e (True English) How useful are fake translations?f (Fake French) How useful are fake translations?

copy+mark : copies carry a language id

copy+mark+noise : add noise (deletions, swaps, etc)

copy-dummy : replace everything with dummy symbol

F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 25 / 35

Page 47: Monolingual data in NMT - WordPress.com › 2018 › 11 › monolingual.pdfMonolingual data in NMT Franck Burlot & François Yvon LIMSI, CNRS, Université Paris-Saclay NLP Meetup Paris,

Pseudo-back translations

4 cheap ways to generate parallel data : stupid BT

copy : recopy the target onto the source

copy+mark : copies carry a language id

f (Fake French) @fr@How @fr@useful @fr@are @fr@fake @fr@translations?

copy+mark+noise : add noise (deletions, swaps, etc)

copy-dummy : replace everything with dummy symbol

F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 25 / 35

Page 48: Monolingual data in NMT - WordPress.com › 2018 › 11 › monolingual.pdfMonolingual data in NMT Franck Burlot & François Yvon LIMSI, CNRS, Université Paris-Saclay NLP Meetup Paris,

Pseudo-back translations

4 cheap ways to generate parallel data : stupid BT

copy : recopy the target onto the source

copy+mark : copies carry a language id

copy+mark+noise : add noise (deletions, swaps, etc)

f (Fake French) @fr@useful @fr@How@fr@fake @fr@are @fr@translations?

copy-dummy : replace everything with dummy symbol

F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 25 / 35

Page 49: Monolingual data in NMT - WordPress.com › 2018 › 11 › monolingual.pdfMonolingual data in NMT Franck Burlot & François Yvon LIMSI, CNRS, Université Paris-Saclay NLP Meetup Paris,

Pseudo-back translations

4 cheap ways to generate parallel data : stupid BT

copy : recopy the target onto the source

copy+mark : copies carry a language id

copy+mark+noise : add noise (deletions, swaps, etc)

copy-dummy : replace everything with dummy symbol

f (Fake French) dummy dummy dummy dummy dummy?

F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 25 / 35

Page 50: Monolingual data in NMT - WordPress.com › 2018 › 11 › monolingual.pdfMonolingual data in NMT Franck Burlot & François Yvon LIMSI, CNRS, Université Paris-Saclay NLP Meetup Paris,

Pseudo-back translations

Stupid Backtranslation not so stupid

English→Frenchtest-07 test-08 newstest-14

BLEU BEER CTER BLEU BEER CTER BLEU BEER CTERBaseline 31.25 62.14 51.89 32.17 62.35 50.79 33.06 61.97 48.56copy-dummies 30.89 62.06 52.07 31.51 61.98 51.46 31.43 60.92 50.58copy 31.65 62.45 52.09 32.23 62.37 52.20 32.80 61.99 49.05copy+mark 32.01 62.66 51.57 32.31 62.52 51.46 32.33 61.55 49.44copy+mark+noise 31.87 62.52 52.69 32.64 62.55 51.63 33.04 62.11 48.47

English→Germantest-07 test-08 newstest-14

BLEU BEER CTER BLEU BEER CTER BLEU BEER CTERBaseline 21.36 57.08 63.32 21.27 57.11 60.67 22.49 57.79 55.64copy-dummies 21.73 57.84 61.35 21.38 57.38 60.10 21.12 56.81 57.21copy 22.15 57.95 61.49 21.95 57.72 59.58 22.59 57.83 55.44copy+mark 22.58 58.23 61.10 22.47 57.97 59.24 22.53 57.54 55.85copy+mark+noise 22.92 58.62 60.27 22.83 58.36 58.48 22.34 57.47 55.72

Stupid BT almost as good as smart BT (for German, where BT is bad)

F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 26 / 35

Page 51: Monolingual data in NMT - WordPress.com › 2018 › 11 › monolingual.pdfMonolingual data in NMT Franck Burlot & François Yvon LIMSI, CNRS, Université Paris-Saclay NLP Meetup Paris,

Adversarial training

The smart poor man’s way : simulate credible parallel data

Stupid BT is cheap and almost as good as using LMs : mostly trains thedecoder.

Can we improve even better by also training the rest of thesystem?

F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 27 / 35

Page 52: Monolingual data in NMT - WordPress.com › 2018 › 11 › monolingual.pdfMonolingual data in NMT Franck Burlot & François Yvon LIMSI, CNRS, Université Paris-Saclay NLP Meetup Paris,

Adversarial training

Towards better fake sources

Fake sources in a GAN setupcopy-marked contains a fake source. Let’s make it look like a real source.

Two encoders : MT encoder E(x) and pseudo-source encoder G(x′).Discriminator D is optimized to distinguish both sources :

J(D) =− 12Ex∼preal logD(E(x))

− 12Ex′∼ppseudo log(1− D(G(x′)))

G is trained to fool the discriminator D :

J(G) = −Ex′∼ppseudo logD(G(x′))

F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 28 / 35

Page 53: Monolingual data in NMT - WordPress.com › 2018 › 11 › monolingual.pdfMonolingual data in NMT Franck Burlot & François Yvon LIMSI, CNRS, Université Paris-Saclay NLP Meetup Paris,

Adversarial training

Towards better fake sources

Fake sources in a GAN setupcopy-marked contains a fake source. Let’s make it look like a real source.

Two encoders : MT encoder E(x) and pseudo-source encoder G(x′).Discriminator D is optimized to distinguish both sources :

J(D) =− 12Ex∼preal logD(E(x))

− 12Ex′∼ppseudo log(1− D(G(x′)))

G is trained to fool the discriminator D :

J(G) = −Ex′∼ppseudo logD(G(x′))

F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 28 / 35

Page 54: Monolingual data in NMT - WordPress.com › 2018 › 11 › monolingual.pdfMonolingual data in NMT Franck Burlot & François Yvon LIMSI, CNRS, Université Paris-Saclay NLP Meetup Paris,

Adversarial training

Towards better fake sources

Fake sources in a GAN setupcopy-marked contains a fake source. Let’s make it look like a real source.

Two encoders : MT encoder E(x) and pseudo-source encoder G(x′).Discriminator D is optimized to distinguish both sources :

J(D) =− 12Ex∼preal logD(E(x))

− 12Ex′∼ppseudo log(1− D(G(x′)))

G is trained to fool the discriminator D :

J(G) = −Ex′∼ppseudo logD(G(x′))

F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 28 / 35

Page 55: Monolingual data in NMT - WordPress.com › 2018 › 11 › monolingual.pdfMonolingual data in NMT Franck Burlot & François Yvon LIMSI, CNRS, Université Paris-Saclay NLP Meetup Paris,

Adversarial training

GANs Results

English→Frenchtest-07 test-08 newstest-14

BLEU BEER CTER BLEU BEER CTER BLEU BEER CTERBaseline 31.25 62.14 51.89 32.17 62.35 50.79 33.06 61.97 48.56copy-mark 32.01 62.66 51.57 32.31 62.52 51.46 32.33 61.55 49.44+ GANs 31.95 62.55 52.87 32.24 62.47 52.16 32.86 61.90 48.97copy-mark+noise 31.87 62.52 52.69 32.64 62.55 51.63 33.04 62.11 48.47+ GANs 32.41 62.78 52.25 32.79 62.72 50.92 33.01 61.98 48.37backtrans-nmt 33.30 63.33 50.02 33.39 63.09 49.48 34.11 62.76 46.94+ GANs 32.91 63.08 51.17 33.24 62.93 50.82 33.77 62.42 47.80natural 35.10 64.71 48.33 35.29 64.52 48.26 34.96 63.08 46.67

English→Germantest-07 test-08 newstest-14

BLEU BEER CTER BLEU BEER CTER BLEU BEER CTERBaseline 21.36 57.08 63.32 21.27 57.11 60.67 22.49 57.79 55.64copy-mark 22.58 58.23 61.10 22.47 57.97 59.24 22.53 57.54 55.85+ GANs 22.71 58.25 61.25 22.44 57.86 59.28 22.81 57.54 55.99copy-mark+noise 22.92 58.62 60.27 22.83 58.36 58.48 22.34 57.47 55.72+ GANs 23.01 58.66 60.22 22.53 58.16 58.65 22.64 57.70 55.48backtrans-nmt 23.00 59.12 58.31 23.10 58.85 56.67 22.91 58.12 54.67+ GANs 23.65 58.85 59.70 23.20 58.50 58.22 23.00 57.89 55.15natural 26.74 61.14 56.19 26.16 60.64 54.76 23.84 58.64 54.23

GANs provide a small additional boost

F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 29 / 35

Page 56: Monolingual data in NMT - WordPress.com › 2018 › 11 › monolingual.pdfMonolingual data in NMT Franck Burlot & François Yvon LIMSI, CNRS, Université Paris-Saclay NLP Meetup Paris,

Conclusions

Conclusion

BT a very efficient method to integrate monolingual data

BT helps improve simultaneously all the components of NMT

artificial sources are lexically and syntactically simpler than naturalsources - sampling brings diversity, monotonicity is a facilitating factor

the quality of BT matters for NMT and BT is only worth its cost whenhigh-quality BT can be generated

GANs can help by making the pseudo-source sentences closer to naturalones.

F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 30 / 35

Page 57: Monolingual data in NMT - WordPress.com › 2018 › 11 › monolingual.pdfMonolingual data in NMT Franck Burlot & François Yvon LIMSI, CNRS, Université Paris-Saclay NLP Meetup Paris,

Conclusions

Conclusion

BT a very efficient method to integrate monolingual data

BT helps improve simultaneously all the components of NMT

artificial sources are lexically and syntactically simpler than naturalsources - sampling brings diversity, monotonicity is a facilitating factor

the quality of BT matters for NMT and BT is only worth its cost whenhigh-quality BT can be generated

GANs can help by making the pseudo-source sentences closer to naturalones.

F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 30 / 35

Page 58: Monolingual data in NMT - WordPress.com › 2018 › 11 › monolingual.pdfMonolingual data in NMT Franck Burlot & François Yvon LIMSI, CNRS, Université Paris-Saclay NLP Meetup Paris,

Conclusions

Conclusion

BT a very efficient method to integrate monolingual data

BT helps improve simultaneously all the components of NMT

artificial sources are lexically and syntactically simpler than naturalsources - sampling brings diversity, monotonicity is a facilitating factor

the quality of BT matters for NMT and BT is only worth its cost whenhigh-quality BT can be generated

GANs can help by making the pseudo-source sentences closer to naturalones.

F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 30 / 35

Page 59: Monolingual data in NMT - WordPress.com › 2018 › 11 › monolingual.pdfMonolingual data in NMT Franck Burlot & François Yvon LIMSI, CNRS, Université Paris-Saclay NLP Meetup Paris,

Conclusions

Conclusion

BT a very efficient method to integrate monolingual data

BT helps improve simultaneously all the components of NMT

artificial sources are lexically and syntactically simpler than naturalsources - sampling brings diversity, monotonicity is a facilitating factor

the quality of BT matters for NMT and BT is only worth its cost whenhigh-quality BT can be generated

GANs can help by making the pseudo-source sentences closer to naturalones.

F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 30 / 35

Page 60: Monolingual data in NMT - WordPress.com › 2018 › 11 › monolingual.pdfMonolingual data in NMT Franck Burlot & François Yvon LIMSI, CNRS, Université Paris-Saclay NLP Meetup Paris,

Conclusions

Conclusion

BT a very efficient method to integrate monolingual data

BT helps improve simultaneously all the components of NMT

artificial sources are lexically and syntactically simpler than naturalsources - sampling brings diversity, monotonicity is a facilitating factor

the quality of BT matters for NMT and BT is only worth its cost whenhigh-quality BT can be generated

GANs can help by making the pseudo-source sentences closer to naturalones.

F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 30 / 35

Page 61: Monolingual data in NMT - WordPress.com › 2018 › 11 › monolingual.pdfMonolingual data in NMT Franck Burlot & François Yvon LIMSI, CNRS, Université Paris-Saclay NLP Meetup Paris,

Conclusions

Meta-Conclusion

NMT is improving ; yields useful translations in many language pairs

NMT research is empirical, burns a lot of CPUs / GPUs

many conclusions are unstable : so many variables to control

MT not solved : many open avenues/problems : architectural, theoretical,data based, and many more

F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 31 / 35

Page 62: Monolingual data in NMT - WordPress.com › 2018 › 11 › monolingual.pdfMonolingual data in NMT Franck Burlot & François Yvon LIMSI, CNRS, Université Paris-Saclay NLP Meetup Paris,

Conclusions

Meta-Conclusion

NMT is improving ; yields useful translations in many language pairs

NMT research is empirical, burns a lot of CPUs / GPUs

many conclusions are unstable : so many variables to control

MT not solved : many open avenues/problems : architectural, theoretical,data based, and many more

F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 31 / 35

Page 63: Monolingual data in NMT - WordPress.com › 2018 › 11 › monolingual.pdfMonolingual data in NMT Franck Burlot & François Yvon LIMSI, CNRS, Université Paris-Saclay NLP Meetup Paris,

Conclusions

Meta-Conclusion

NMT is improving ; yields useful translations in many language pairs

NMT research is empirical, burns a lot of CPUs / GPUs

many conclusions are unstable : so many variables to control

MT not solved : many open avenues/problems : architectural, theoretical,data based, and many more

F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 31 / 35

Page 64: Monolingual data in NMT - WordPress.com › 2018 › 11 › monolingual.pdfMonolingual data in NMT Franck Burlot & François Yvon LIMSI, CNRS, Université Paris-Saclay NLP Meetup Paris,

Conclusions

Meta-Conclusion

NMT is improving ; yields useful translations in many language pairs

NMT research is empirical, burns a lot of CPUs / GPUs

many conclusions are unstable : so many variables to control

MT not solved : many open avenues/problems : architectural, theoretical,data based, and many more

F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 31 / 35

Page 65: Monolingual data in NMT - WordPress.com › 2018 › 11 › monolingual.pdfMonolingual data in NMT Franck Burlot & François Yvon LIMSI, CNRS, Université Paris-Saclay NLP Meetup Paris,

Conclusions

Thank you for your attention !

Franck Burlot, François Yvon

F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 32 / 35

Page 66: Monolingual data in NMT - WordPress.com › 2018 › 11 › monolingual.pdfMonolingual data in NMT Franck Burlot & François Yvon LIMSI, CNRS, Université Paris-Saclay NLP Meetup Paris,

Conclusions

References I

Nicola Bertoldi and Marcello Federico. Domain adaptation for statistical machine translationwith monolingual resources. In Proceedings of the Fourth Workshop on Statistical MachineTranslation, pages 182–189, Athens, Greece, 2009. URLhttp://www.aclweb.org/anthology/W09-0432.

Alexandra Birch and Miles Osborne. LRscore for evaluating lexical and reordering quality inMT. In Proceedings of the Joint Fifth Workshop on Statistical Machine Translation andMetricsMATR, WMT ’10, pages 327–332, Stroudsburg, PA, USA, 2010. Association forComputational Linguistics. ISBN 978-1-932432-71-8. URLhttp://dl.acm.org/citation.cfm?id=1868850.1868899.

Ondrej Bojar and Aleš Tamchyna. Improving translation model by monolingual data. InProceedings of the Sixth Workshop on Statistical Machine Translation, WMT ’11, pages330–336. Association for Computational Linguistics, 2011. URLhttp://dl.acm.org/citation.cfm?id=2132960.2133004.

Franck Burlot and François Yvon. Using monolingual data in neural machine translation : asystematic study. In Proceedings of the Third Conference on Machine Translation, pages144–155, Belgium, Brussels, October 2018. Association for Computational Linguistics.URL http://www.aclweb.org/anthology/W18-64015.

F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 33 / 35

Page 67: Monolingual data in NMT - WordPress.com › 2018 › 11 › monolingual.pdfMonolingual data in NMT Franck Burlot & François Yvon LIMSI, CNRS, Université Paris-Saclay NLP Meetup Paris,

Conclusions

References II

Ryan Cotterell and Julia Kreutzer. Explaining and generalizing back-translation throughwake-sleep, 2018.

Josep Maria Crego and Jean Senellart. Neural machine translation from simplified translations.CoRR, abs/1612.06139, 2016. URL http://arxiv.org/abs/1612.06139.

Sergey Edunov, Myle Ott, Michael Auli, and David Grangier. Understanding back-translationat scale. In Proceedings of the 2018 Conference on Empirical Methods in NaturalLanguage Processing, pages 489–500. Association for Computational Linguistics, 2018.URL http://aclweb.org/anthology/D18-1045.

Marzieh Fadaee and Christof Monz. Back-translation sampling by targeting difficult words inneural machine translation. In Proceedings of the 2018 Conference on Empirical Methodsin Natural Language Processing, pages 436–446. Association for ComputationalLinguistics, 2018. URL http://aclweb.org/anthology/D18-1040.

Caglar Gulcehre, Orhan Firat, Kelvin Xu, Kyunghyun Cho, and Yoshua Bengio. On integratinga language model into neural machine translation. Comput. Speech Lang., 45(C) :137–148,September 2017. ISSN 0885-2308. doi : 10.1016/j.csl.2017.01.014. URLhttps://doi.org/10.1016/j.csl.2017.01.014.

F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 34 / 35

Page 68: Monolingual data in NMT - WordPress.com › 2018 › 11 › monolingual.pdfMonolingual data in NMT Franck Burlot & François Yvon LIMSI, CNRS, Université Paris-Saclay NLP Meetup Paris,

Conclusions

References III

Guillaume Lample, Myle Ott, Alexis Conneau, Ludovic Denoyer, and Marc’Aurelio Ranzato.Phrase-based & neural unsupervised machine translation. In Proceedings of the 2018Conference on Empirical Methods in Natural Language Processing, pages 5039–5049.Association for Computational Linguistics, 2018. URLhttp://aclweb.org/anthology/D18-1549.

Alberto Poncelas, Dimitar Shterionov, Andy Way, Gideon Maillette de Buy Wenniger, andPeyman Passban. Investigating backtranslation in neural machine translation. InProceedings of the 21st Annual Conference of the European Association for MachineTranslation, EAMT, Alicante, Spain, 28–30 May 2018.

Rico Sennrich, Barry Haddow, and Alexandra Birch. Improving neural machine translationmodels with monolingual data. In Proceedings of the 54th Annual Meeting of theAssociation for Computational Linguistics (Volume 1 : Long Papers), pages 86–96, Berlin,Germany, August 2016. Association for Computational Linguistics. URLhttp://www.aclweb.org/anthology/P16-1009.

Felix Stahlberg, James Cross, and Veselin Stoyanov. Simple fusion : Return of the languagemodel. In Proceedings of the Third Conference on Machine Translation, pages 204–211,Belgium, Brussels, October 2018. Association for Computational Linguistics. URLhttp://www.aclweb.org/anthology/W18-64021.

F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 35 / 35