dual learning for machine translation (nips 2016)

24
“Dual Learning for Machine Translation” Di He et al. 2016年1⽉ Toru Fujino 東⼤ 新領域 ⼈間環境学 陳研究室 D1

Upload: toru-fujino

Post on 24-Jan-2017

259 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Dual Learning for Machine Translation (NIPS 2016)

“Dual Learning for Machine Translation”Di He et al.

2016年1⽉Toru Fujino

東⼤ 新領域 ⼈間環境学 陳研究室 D1

Page 2: Dual Learning for Machine Translation (NIPS 2016)

Paper information

• Authors: Di He et al. (Microsoft Research Asia)• Conference: NIPS 2016• Date: 11/01/2016 (arxiv)• Times cited: 1

Page 3: Dual Learning for Machine Translation (NIPS 2016)

Overview

• What• Introduce an autoencoder-like mechanism, “Dual learning”,

to utilize monolingual datasets

• Results• Dual Learning with 10% data ≈ Baseline model with 100% data

1)“DualLearning:ANewLearningParadigm”,https://www.youtube.com/watch?v=HzokNo3g63E&feature=youtu.be

1)

1)

Page 4: Dual Learning for Machine Translation (NIPS 2016)

Neural machine translation

• Learn conditional probability 𝑃(𝑦|𝑥; Θ) from a input 𝑥 = {𝑥,, 𝑥., … , 𝑥01} to an output 𝑦 = {𝑦,, 𝑦., … , 𝑦03}

• Maximize the log probability

Θ∗ = argmax ; ;log 𝑃(𝑦>|𝑦?>, 𝑥; Θ)03

>@,

B,C ∈E

Page 5: Dual Learning for Machine Translation (NIPS 2016)

Difficulty in getting large bilingual data

• Solution: utilization of monolingual data• Train a language model of the target language, and then

integrate it with the MT model1)2)

<- does not fundamentally address the shortage ofparallel data.

• Generate pesudo bilingual data from monolingual data3)4)

<- no guarantee on the quality of the pesudo bilingual data

1) T.Brants etal.,“Largelanguagemodelsinmachinetranslation”,EMNLP20072) C.Gucehre etal.,“Onusingmonolingualcorporainneuralmachinetranslation”,arix 20153) R.Sennrich etal.,“Improvingneuralmachinetranslationmodelswithmonolingualdata”,ACL20164) N.Ueffing etal.,“Semi-supervisedmodeladaptationforstatisticalmachinetranslation”,MachineTranslationJournal2008

Page 6: Dual Learning for Machine Translation (NIPS 2016)

Dual learning algorithm

• Use monolingual datasets to train translation models through dual learning

• Things required𝐷G:corpus of language A𝐷I: corpus of language B (not necessarily aligned with 𝐷G)𝑃(. |𝑠; ΘGI): translation model from A to B𝑃(. |𝑠; 𝛩IG): translation model from B to A𝐿𝑀G . : learned language model of A𝐿𝑀I . : learned language model of B

Page 7: Dual Learning for Machine Translation (NIPS 2016)

Dual learning algorithm

1. Generate 𝐾 translated sentences𝑠PQR,,, 𝑠PQR,., … , 𝑠PQR,S

from 𝑃 . 𝑠; ΘTU based on beam search

Page 8: Dual Learning for Machine Translation (NIPS 2016)

Dual learning algorithm

1. Generate 𝐾 translated sentences𝑠PQR,,, 𝑠PQR,., … , 𝑠PQR,S

from 𝑃 . 𝑠; ΘTU based on beam search2. Compute intermediate rewards

𝑟,,,, 𝑟,,., … , 𝑟,,Sfrom 𝐿𝑀I(𝑠PQR,W) for each sentence as

𝑟,,W = 𝐿𝑀I(𝑠PQR,W)

Page 9: Dual Learning for Machine Translation (NIPS 2016)

Dual learning algorithm

3. Get communication rewards𝑟.,,, 𝑟.,., … , 𝑟.,W

for each sentence as 𝑟.,W = ln 𝑃(𝑠|𝑠PQR,W; ΘUT)

Page 10: Dual Learning for Machine Translation (NIPS 2016)

Dual learning algorithm

3. Get communication rewards𝑟.,,, 𝑟.,., … , 𝑟.,W

for each sentence as 𝑟.,W = ln 𝑃(𝑠|𝑠PQR,W; ΘUT)4. Set the total reward of k-th sentence as

𝑟W = 𝛼𝑟,,W + 1 − 𝛼 𝑟.,W

Page 11: Dual Learning for Machine Translation (NIPS 2016)

Dual learning algorithm

5. Compute the stochastic gradient of ΘGI and ΘTU𝛻^_`𝐸 𝑟 =

1𝐾 ;[𝑟W∇TU ln 𝑃(𝑠PQR,W|𝑠; ΘGI)]

S

W@,

𝛻^`_𝐸 𝑟 =1𝐾 ;[(1 − 𝛼)∇IG ln 𝑃(𝑠PQR,W|𝑠; ΘIG)]

S

W@,

Page 12: Dual Learning for Machine Translation (NIPS 2016)

Dual learning algorithm

5. Compute the stochastic gradient of ΘGI and ΘTU𝛻^_`𝐸 𝑟 =

1𝐾 ;[𝑟W∇TU ln 𝑃(𝑠PQR,W|𝑠; ΘGI)]

S

W@,

𝛻^`_𝐸 𝑟 =1𝐾 ;[(1 − 𝛼)∇IG ln 𝑃(𝑠PQR,W|𝑠; ΘIG)]

S

W@,6. Update model parameters

ΘGI ← ΘGI + 𝛾,∇g_`𝐸[𝑟]ΘIG ← ΘIG + 𝛾.∇g`_𝐸[𝑟]

Page 13: Dual Learning for Machine Translation (NIPS 2016)

Dual learning algorithm

Page 14: Dual Learning for Machine Translation (NIPS 2016)

Experiment settings

• Baseline models• Bahdanau et al., “Neural Machine Translation by Jointly

Learning to Align and Translate”

• Sennrich et al., “Improving Neural Machine Translation Models with Monolingual Data”

Page 15: Dual Learning for Machine Translation (NIPS 2016)

Dataset

• WMTʼ14• 12M sentence pairs• English -> French, French -> English

• Data usage (for dual learning)• Small

1. Train translation models with 10% bilingual data.2. Train translation models with 10% bilingual data and

monolingual data through dual learning algorithm.3. Train translation models only with monolingual data through dual

learning algorithm.• Large

1. Train translation models with 100% bilingual data.2. Train translation models with 100% bilingual data.3. Train translation models only with monolingual data through dual

learning algorithm.

Page 16: Dual Learning for Machine Translation (NIPS 2016)

Evaluation

• BLEU: geometric mean of n-gram precision

Page 17: Dual Learning for Machine Translation (NIPS 2016)

Results

• Outperform the base line models• In Fr->En, dual learning with 10% data ≈ baseline

models with 100% data.• Dual learning is effective especially in a small dataset.

Page 18: Dual Learning for Machine Translation (NIPS 2016)

Results

• For different source sentence length• Improvement is significant for long sentences.

Page 19: Dual Learning for Machine Translation (NIPS 2016)

Results

• Reconstruction performance (BLEU)• Huge improvement from baseline models, especially in

En->Fr-En(S)

Page 20: Dual Learning for Machine Translation (NIPS 2016)

Results

• Reconstruction examples

Page 21: Dual Learning for Machine Translation (NIPS 2016)

Future extensions & words

• Application in other domains

• Generalization of dual learning• Dual -> Triple -> … -> n-loop

• Learn from scratch• only with monolingual data• maybe plus lexical dictionary

Application Primaltask Dualtask

Speechprocessing Speech recognition Texttospeech

Imageunderstanding Imagecaptioning Imagegeneration

Conversation engine Question Response

Search engine Search Query/Keywordsuggestion

Page 22: Dual Learning for Machine Translation (NIPS 2016)

Summary

• What• Introduce “Dual learning algorithm” to utilize

monolingual data• Results

• With 100% data, the model outperforms the baseline models

• With 10% data, the model shows the comparable result with the baseline models

• Future• Dual learning mechanism can be applied to other

domains• Learn from scratch

Page 23: Dual Learning for Machine Translation (NIPS 2016)

Some notes

• Dual Learning does not learn word-to-word correspondences?• Training from bilingual data is a must?• Or lexical dictionary

Page 24: Dual Learning for Machine Translation (NIPS 2016)

Appendix: Stochastic gradient of models