dual learning for machine translation (nips 2016)

“Dual Learning for Machine Translation”Di He et al.

2016年1⽉Toru Fujino

東⼤新領域⼈間環境学陳研究室 D1

Paper information

• Authors: Di He et al. (Microsoft Research Asia)• Conference: NIPS 2016• Date: 11/01/2016 (arxiv)• Times cited: 1

Overview

• What• Introduce an autoencoder-like mechanism, “Dual learning”,

to utilize monolingual datasets

• Results• Dual Learning with 10% data ≈ Baseline model with 100% data

1)“DualLearning:ANewLearningParadigm”,https://www.youtube.com/watch?v=HzokNo3g63E&feature=youtu.be

1)

1)

Neural machine translation

• Learn conditional probability 𝑃(𝑦|𝑥; Θ) from a input 𝑥 = {𝑥,, 𝑥., … , 𝑥01} to an output 𝑦 = {𝑦,, 𝑦., … , 𝑦03}

• Maximize the log probability

Θ∗ = argmax ; ;log 𝑃(𝑦>|𝑦?>, 𝑥; Θ)03

>@,

�

B,C ∈E

Difficulty in getting large bilingual data

• Solution: utilization of monolingual data• Train a language model of the target language, and then

integrate it with the MT model1)2)

<- does not fundamentally address the shortage ofparallel data.

• Generate pesudo bilingual data from monolingual data3)4)

<- no guarantee on the quality of the pesudo bilingual data

1) T.Brants etal.,“Largelanguagemodelsinmachinetranslation”,EMNLP20072) C.Gucehre etal.,“Onusingmonolingualcorporainneuralmachinetranslation”,arix 20153) R.Sennrich etal.,“Improvingneuralmachinetranslationmodelswithmonolingualdata”,ACL20164) N.Ueffing etal.,“Semi-supervisedmodeladaptationforstatisticalmachinetranslation”,MachineTranslationJournal2008

Dual learning algorithm

• Use monolingual datasets to train translation models through dual learning

• Things required𝐷G:corpus of language A𝐷I: corpus of language B (not necessarily aligned with 𝐷G)𝑃(. |𝑠; ΘGI): translation model from A to B𝑃(. |𝑠; 𝛩IG): translation model from B to A𝐿𝑀G . : learned language model of A𝐿𝑀I . : learned language model of B


1. Generate 𝐾 translated sentences𝑠PQR,,, 𝑠PQR,., … , 𝑠PQR,S

from 𝑃 . 𝑠; ΘTU based on beam search


1. Generate 𝐾 translated sentences𝑠PQR,,, 𝑠PQR,., … , 𝑠PQR,S

from 𝑃 . 𝑠; ΘTU based on beam search2. Compute intermediate rewards

𝑟,,,, 𝑟,,., … , 𝑟,,Sfrom 𝐿𝑀I(𝑠PQR,W) for each sentence as

𝑟,,W = 𝐿𝑀I(𝑠PQR,W)


3. Get communication rewards𝑟.,,, 𝑟.,., … , 𝑟.,W

for each sentence as 𝑟.,W = ln 𝑃(𝑠|𝑠PQR,W; ΘUT)


3. Get communication rewards𝑟.,,, 𝑟.,., … , 𝑟.,W

for each sentence as 𝑟.,W = ln 𝑃(𝑠|𝑠PQR,W; ΘUT)4. Set the total reward of k-th sentence as

𝑟W = 𝛼𝑟,,W + 1 − 𝛼 𝑟.,W


5. Compute the stochastic gradient of ΘGI and ΘTU𝛻^_`𝐸 𝑟 =

1𝐾 ;[𝑟W∇TU ln 𝑃(𝑠PQR,W|𝑠; ΘGI)]

S

W@,

𝛻^`_𝐸 𝑟 =1𝐾 ;[(1 − 𝛼)∇IG ln 𝑃(𝑠PQR,W|𝑠; ΘIG)]

S

W@,


5. Compute the stochastic gradient of ΘGI and ΘTU𝛻^_`𝐸 𝑟 =

1𝐾 ;[𝑟W∇TU ln 𝑃(𝑠PQR,W|𝑠; ΘGI)]

S

W@,

𝛻^`_𝐸 𝑟 =1𝐾 ;[(1 − 𝛼)∇IG ln 𝑃(𝑠PQR,W|𝑠; ΘIG)]

S

W@,6. Update model parameters

ΘGI ← ΘGI + 𝛾,∇g_`𝐸[𝑟]ΘIG ← ΘIG + 𝛾.∇g`_𝐸[𝑟]

Experiment settings

• Baseline models• Bahdanau et al., “Neural Machine Translation by Jointly

Learning to Align and Translate”

• Sennrich et al., “Improving Neural Machine Translation Models with Monolingual Data”

Dataset

• WMTʼ14• 12M sentence pairs• English -> French, French -> English

• Data usage (for dual learning)• Small

1. Train translation models with 10% bilingual data.2. Train translation models with 10% bilingual data and

monolingual data through dual learning algorithm.3. Train translation models only with monolingual data through dual

learning algorithm.• Large

1. Train translation models with 100% bilingual data.2. Train translation models with 100% bilingual data.3. Train translation models only with monolingual data through dual

learning algorithm.

Evaluation

• BLEU: geometric mean of n-gram precision

Results

• Outperform the base line models• In Fr->En, dual learning with 10% data ≈ baseline

models with 100% data.• Dual learning is effective especially in a small dataset.

Results

• For different source sentence length• Improvement is significant for long sentences.

Results

• Reconstruction performance (BLEU)• Huge improvement from baseline models, especially in

En->Fr-En(S)

Results

• Reconstruction examples

Future extensions & words

• Application in other domains

• Generalization of dual learning• Dual -> Triple -> … -> n-loop

• Learn from scratch• only with monolingual data• maybe plus lexical dictionary

Application Primaltask Dualtask

Speechprocessing Speech recognition Texttospeech

Imageunderstanding Imagecaptioning Imagegeneration

Conversation engine Question Response

Search engine Search Query/Keywordsuggestion

Summary

• What• Introduce “Dual learning algorithm” to utilize

monolingual data• Results

• With 100% data, the model outperforms the baseline models

• With 10% data, the model shows the comparable result with the baseline models

• Future• Dual learning mechanism can be applied to other

domains• Learn from scratch

Some notes

• Dual Learning does not learn word-to-word correspondences?• Training from bilingual data is a must?• Or lexical dictionary

Appendix: Stochastic gradient of models

dual learning for machine translation (nips 2016)

Technology