back-translation sampling by targeting difficult words in ... · neural machine translation has...

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 436–446Brussels, Belgium, October 31 - November 4, 2018. c©2018 Association for Computational Linguistics

436

Back-Translation Sampling by TargetingDifficult Words in Neural Machine Translation

Marzieh Fadaee and Christof MonzInformatics Institute, University of Amsterdam

Science Park 904, 1098 XH Amsterdam, The Netherlands{m.fadaee,c.monz}@uva.nl

Abstract

Neural Machine Translation has achievedstate-of-the-art performance for several lan-guage pairs using a combination of paralleland synthetic data. Synthetic data is oftengenerated by back-translating sentences ran-domly sampled from monolingual data us-ing a reverse translation model. While back-translation has been shown to be very effec-tive in many cases, it is not entirely clear why.In this work, we explore different aspects ofback-translation, and show that words withhigh prediction loss during training benefitmost from the addition of synthetic data. Weintroduce several variations of sampling strate-gies targeting difficult-to-predict words usingprediction losses and frequencies of words. Inaddition, we also target the contexts of difficultwords and sample sentences that are similar incontext. Experimental results for the WMTnews translation task show that our methodimproves translation quality by up to 1.7 and1.2 BLEU points over back-translation usingrandom sampling for GermanÑEnglish andEnglishÑGerman, respectively.

1 Introduction

Neural machine translation (NMT) using asequence-to-sequence model has achieved state-of-the-art performance for several language pairs(Bahdanau et al., 2015; Sutskever et al., 2014; Choet al., 2014). The availability of large-scale train-ing data for these sequence-to-sequence models isessential for achieving good translation quality.

Previous approaches have focused on leverag-ing monolingual data which is available in muchlarger quantities than parallel data (Lambert et al.,2011). Gulcehre et al. (2017) proposed twomethods, shallow and deep fusion, for integrat-ing a neural language model into the NMT sys-tem. They observe improvements by combining

the scores of a neural language model trained ontarget monolingual data with the NMT system.

Sennrich et al. (2016a) proposed back-translation of monolingual target sentences tothe source language and adding the syntheticsentences to the parallel data. In this approach areverse model trained on parallel data is used totranslate sentences from target-side monolingualdata into the source language. This syntheticparallel data is then used in combination with theactual parallel data to re-train the model. Thisapproach yields state-of-the-art results even whenlarge parallel data are available and has becomecommon practice in NMT (Sennrich et al., 2017;Garcıa-Martınez et al., 2017; Ha et al., 2017).

While back-translation has been shown to bevery effective to improve translation quality, it isnot exactly clear why it helps. Generally speak-ing, it mitigates the problem of overfitting andfluency by exploiting additional data in the tar-get language. An important question in this con-text is how to select the monolingual data in thetarget language that is to be back-translated intothe source language to optimally benefit transla-tion quality. Pham et al. (2017) experimented withusing domain adaptation methods to select mono-lingual data based on the cross-entropy betweenthe monolingual data and in-domain corpus (Ax-elrod et al., 2015) but did not find any improve-ments over random sampling as originally pro-posed by Sennrich et al. (2016a). Earlier workhas explored to what extent data selection of paral-lel corpora can benefit translation quality (Axelrodet al., 2011; van der Wees et al., 2017), but suchselection techniques have not been investigated inthe context of back-translation.

In this work, we explore different aspects ofthe back-translation method to gain a better under-standing of its performance. Our analyses showthat the quality of the synthetic data acquired with

437

a reasonably good model has a small impact onthe effectiveness of back-translation, but the ratioof synthetic to real training data plays a more im-portant role. With a higher ratio, the model gets bi-ased towards noises in synthetic data and unlearnsthe parameters completely. Our findings show thatit is mostly words that are difficult to predict in thetarget language that benefit from additional back-translated data. These are the words with highprediction loss during training when the transla-tion model converges. We further investigate thesedifficult words and explore alternatives to randomsampling of sentences with a focus on increasingoccurrences of such words.

Our proposed approach is twofold: identifyingdifficult words and sampling with the objective ofincreasing occurrences of these words, and iden-tifying contexts where these words are difficult topredict and sample sentences similar to the diffi-cult contexts. With targeted sampling of sentencesfor back-translation we achieve improvements ofup to 1.7 BLEU points over back-translation usingrandom sampling.

2 Back-Translation for NMT

In this section, we briefly review a sequence-to-sequence NMT system and describe our experi-mental settings. We then investigate different as-pects and modeling challenges of integrating theback-translation method into the NMT pipeline.

2.1 Neural Machine TranslationThe NMT system used for our experiments isan encoder-decoder network with recurrent archi-tecture (Luong et al., 2015). For training theNMT system, two sequences of tokens, X ““

x1, . . . , xn‰

and Y ““

y1, . . . , ym‰

, are given inthe source and target language, respectively.

The source sequence is the input to the encoderwhich is a bidirectional long short-term memorynetwork generating a representation sn. Using anattention mechanism (Bahdanau et al., 2015), theattentional hidden state is:

rht “ tanhpWcrct;htsq

where ht is the target hidden state at time step tand ct is the context vector which is a weightedaverage of sn.

The decoder predicts each target token yt bycomputing the probability:

ppyt|yăt, snq “ softmaxpWorhtq

For the token yt, the conditional probabilityppyt|yăt, snq during training quantifies the diffi-culty of predicting that token in the context yăt.The prediction loss of token yt is the negative log-likelihood of this probability.

During training on a parallel corpus D, thecross-entropy objective function is defined as:

L “ÿ

pX,Y qPD

mÿ

i“1

´ log ppyi|yăi, snq

The objective of this function is to improvethe model’s estimation of predicting target wordsgiven the source sentence and the target context.

The model is trained end-to-end by minimizingthe negative log likelihood of the target words.

2.2 Experimental SetupFor the translation experiments, we useEnglishØGerman WMT17 training data andreport results on newstest 2014, 2015, 2016, and2017 (Bojar et al., 2017).

As NMT system, we use a 2-layer attention-based encoder-decoder model implemented inOpenNMT (Klein et al., 2017) trained with em-bedding size 512, hidden dimension size 1024,and batch size 64. We pre-process the training datawith Byte-Pair Encoding (BPE) using 32K mergeoperations (Sennrich et al., 2016b).

We compare the results to Sennrich et al.(2016a) by back-translating random sentencesfrom the monolingual data and combine them withthe parallel training data. We perform the randomselection and re-training 3 times and report the av-eraged outcomes for the 3 models. In all experi-ments the sentence pairs are shuffled before eachepoch.

We measure translation quality by single-reference case-sensitive BLEU (Papineni et al.,2002) computed with the multi-bleu.perlscript from Moses.

2.3 Size of the Synthetic Data inBack-Translation

One selection criterion for using back-translationis the ratio of real to synthetic data. Sennrich et al.(2016a) showed that higher ratios of synthetic dataleads to decreases in translation performance.

In order to investigate whether the improve-ments in translation performance increases withhigher ratios of synthetic data, we perform threeexperiments with different sizes of synthetic data.

438

Size 2014 2015 2016 2017

Baseline 4.5M 26.7 27.6 32.5 28.1+ synthetic (1:1) 9M 28.7 29.7 36.3 30.8+ synthetic (1:4) 23M 29.1 30.0 36.9 31.1+ synthetic (1:10) 50M 22.8 23.6 29.2 23.9

Table 1: GermanÑEnglish translation quality(BLEU) of systems with different ratios of real:syndata.

Figure 1 shows the perplexity as a function oftraining time for different sizes of synthetic data.One can see that all systems perform similarlyin the beginning and converge after observing in-creasingly more training instances. However, themodel with the ratio of (1:10) synthetic data getsincreasingly biased towards the noisy data after1M instances. Decreases in performance withmore synthetic than real data is also inline withfindings of Poncelas et al. (2018).

Comparing the systems using ratios of (1:1) and(1:4), we see that the latter achieves lower perplex-ity during training. Table 1 presents the perfor-mance of these systems on the GermanÑEnglishtranslation task. The BLEU scores show thatthe translation quality does not improve linearlywith the size of the synthetic data. The modeltrained on (1:4) real to synthetic ratio of trainingdata achieves the best results, however, the perfor-mance is close to the model trained on (1:1) train-ing data.

2.4 Direction of Back-Translation

Adding monolingual data in the target language tothe training data has the benefit of introducing newcontext and improving the fluency of the transla-tion model. The automatically generated trans-lations in the source language while being erro-neous, introduce new context for the source wordsand will not affect the translation model signifi-cantly.

Monolingual data is available in large quantitiesfor many languages. The decision of the directionof back-translation is subsequently not based onthe monolingual data available, but on the advan-tage of having more fluent source or target sen-tences.

Lambert et al. (2011) show that adding syn-thetic source and real target data achieves im-provements in traditional phrase-based machinetranslation (PBMT). Similarly in previous works

Size 2014 2015 2016 2017

Baseline 4.5M 21.2 23.3 28.0 22.4+ synthetic tgt 9M 22.4 25.3 29.8 23.7+ synthetic src 9M 24.0 26.0 30.7 24.8

Table 2: EnglishÑGerman translation quality(BLEU) of systems using forward and reversemodels for generating synthetic data.

in NMT, back-translation is performed on mono-lingual data in the target language.

We perform a small experiment to measurewhether back-translating from source to target isalso beneficial for improving translation quality.Table 2 shows that in both directions the perfor-mance of the translation system improves over thebaseline. This is in contrast to the findings of Lam-bert et al. (2011) for PBMT systems where theyshow that using synthetic target data does not leadto improvements in translation quality.

Still, when adding monolingual data in the tar-get language the BLEU scores are slightly higherthan when using monolingual data in the sourcelanguage. This indicates the moderate importanceof having fluent sentences in the target language.

0 200K 400K 600K 800K 1M 1.2M 1.4MTraining time (minibatches)

101

102

103

Perp

lexi

ty

Synthetic data ratio (1:1)Synthetic data ratio (1:4)Synthetic data ratio (1:10)

Figure 1: Training plots for systems with differentratios of (real : syn) training data, showing per-plexity on development set.

2.5 Quality of the Synthetic Data inBack-Translation

One selection criterion for back-translation is thequality of the synthetic data. Khayrallah andKoehn (2018) studied the effects of noise in thetraining data on a translation model and discov-

439

ered that NMT models are less robust to manytypes of noise than PBMT models. In order forthe NMT model to learn from the parallel data,the data should be fluent and close to the man-ually generated translations. However, automati-cally generating sentences using back-translationis not as accurate as manual translations.

Size 2014 2015 2016 2017

Baseline 2.25M 24.3 24.9 29.5 25.6+ synthetic 4.5M 26.0 26.9 32.2 27.5+ ground truth 4.5M 26.7 27.6 32.5 28.1

Table 3: GermanÑEnglish translation quality(BLEU).

To investigate the oracle gap between theperformance of manually generated and back-translated sentences, we perform a simple exper-iment using the existing parallel training data. Inthis experiment, we divide the parallel data intotwo parts, train the reverse model on the first halfof the data and use this model to back-translate thesecond half. The manually translated sentences ofthe second half are considered as ground truth forthe synthetic data.

Table 3 shows the BLEU scores of the experi-ments. As to be expected, re-training with addi-tional parallel data yields higher performance thanre-training with additional synthetic data. How-ever, the differences between the BLEU scores ofthese two models are surprisingly small. This indi-cates that performing back-translation with a rea-sonably good reverse model already achieves re-sults that are close to a system that uses additionalmanually translated data. This is inline with find-ings of Sennrich et al. (2016a) who observed thatthe same monolingual data translated with threetranslation systems of different quality and usedin re-training the translation model yields similarresults.

3 Back-Translation and TokenPrediction

In the previous section, we observed that the re-verse model used to back-translate achieves re-sults comparable to manually translated sentences.Also, there is a limit in learning from syntheticdata, and with more synthetic data the model un-learns its parameters completely.

In this section, we investigate the influenceof the sampled sentences on the learning model.

Fadaee et al. (2017) showed that targeting specificwords during data augmentation improves the gen-eration of these words in the right context. Specifi-cally, adding synthetic data to the training data hasan impact on the prediction probabilities of indi-vidual words. In this section, we further examinethe effects of the back-translated synthetic data onthe prediction of target tokens.

0 2.5K 5K 7.5K 10K 12.5K 15K 17.5K 20K0

10

20

30

Mea

n pr

edict

ion

loss

0 2.5K 5K 7.5K 10K 12.5K 15K 17.5K 20KTarget words sorted by baseline prediction loss

10

1K

100K

10M

Freq

uenc

y

Figure 2: Top: Changes in mean token predictionloss after re-training with synthetic data sorted bymean prediction loss of the baseline system. De-creases and increases in values are marked blueand red, respectively. Bottom: Frequencies (log)of target tokens in the baseline training data.

As mentioned in Section 2.1, the objec-tive function of training an NMT system is tominimize L by minimizing the prediction loss´ log ppyt|yăt, snq for each target token in thetraining data. The addition of monolingual datain the target language improves the estimation ofthe probability ppY q and consequently the modelgenerates more fluent sentences.

Sennrich et al. (2016a) show that by using back-translation, the system with target-side monolin-gual data reaches a lower perplexity on the de-velopment set. This is expected since the do-main of the monolingual data is similar to the do-main of the development set. To investigate themodel’s accuracy independent from the domainsof the data, we collect statistics of the target tokenprediction loss during training.

Figure 2 shows changes of token prediction losswhen training converges and the weights are verg-ing on being stable. The values are sorted by meantoken prediction loss of the system trained on realparallel data.

440

We observe an effect similar to distributionalsmoothing (Chen and Goodman, 1996): Whileprediction loss increases slightly for most tokens,the largest decrease in loss occurs for tokens withhigh prediction loss values. This indicates that byrandomly sampling sentences for back-translation,the model improves its estimation of tokens thatwere originally more difficult to predict, i.e., to-kens that had a high prediction loss. Note that wecompute the token prediction loss in just one passover the training corpus with the final model andas a result it is not biased towards the order of thedata.

This finding motivates us to further exploresampling criteria for back-translation that con-tribute considerably to the parameter estimation ofthe translation model. We propose that by over-sampling sentences containing difficult-to-predicttokens we can maximize the impact of using themonolingual data. After translating sentences con-taining such tokens and including them in thetraining data, the model becomes more robust inpredicting these tokens.

In the next two sections, we propose severalmethods of using the target token prediction lossto identify the most rewarding sentences for back-translating and re-training the translation model.

4 Targeted Sampling for Difficult Words

One of the main outcomes of using synthetic datais better estimation of words that were originallydifficult to predict as measured by their high pre-diction losses during training. In this section, wepropose three variations of how to identify thesewords and perform sampling to target these words.

Algorithm 1 Sampling for difficult wordsInput: Difficult tokens D “ tyiuDi“1, monolingualcorpus M, number of required samples NOutput: Sampled sentences S “ tSiu

Ni“1 where each

sentence Si is sampled from M1: procedure DIFFSAMPLING (D,M, N ):2: Initialize S “ tu3: repeat4: Sample Sc from M5: for all tokens y in Sc do6: if y P D then7: Add Sc to S8: until |S| “ N9: return S

4.1 Token Frequency as a Feature ofDifficulty

Figure 2 shows that the majority of tokens withhigh mean prediction losses have low frequenciesin the training data. Additionally, the majority ofdecreases in prediction loss after adding syntheticsentence pairs to the training data occurs with lessfrequent tokens.

Note that these tokens are not necessarily rarein the traditional sense; in Figure 2 the 10k lessfrequent tokens in the target vocabulary benefitmost from back-translated data.

Sampling new contexts from monolingual dataprovides context diversity proportional to the to-ken frequencies and less frequent tokens benefitmost from new contexts. Algorithm 1 presentsthis approach where the list of difficult tokens isdefined as:

D “ t@yi P Vt : freqpyiq ă ηu

Vt is the target vocabulary and η is the frequencythreshold for deciding on the difficulty of the to-ken.

4.2 Difficult Words with High MeanPrediction Losses

In this approach, we use the mean losses to iden-tify difficult-to-predict tokens. The mean predic-tion loss ˆpyq of token y during training is definedas follows:

ˆpyq “1

ny

Nÿ

n“1

|Y n|ÿ

t“1

´ log ppynt |ynăt, snqδpy

nt , yq

where ny is the number of times token y is ob-served during training, i.e., the token frequency ofy, N is the number of sentences in the trainingdata, |Y n| is the length of target sentence n, andδpynt , yq is the Kronecker delta function, which is1 if ynt “ y and 0 otherwise.

By specifically providing more sentences fordifficult words, we improve the model’s estimationand decrease the model’s uncertainty in prediction.During sampling from the monolingual data, weselect sentences that contain difficult words.

Algorithm 1 presents this approach where thelist of difficult tokens is defined as:

D “ t@yi P Vt : ˆpyiq ą µu

Vt is the vocabulary of the target language andµ is the threshold on the difficulty of the token.

441

De-En En-De

System test2014 test2015 test2016 test2017 test2014 test2015 test2016 test2017

BASELINE: 26.7 27.6 32.5 28.1 21.2 23.3 28.0 22.4RANDOM: 28.7 29.7 36.3 30.8 24.0 26.0 30.7 24.8

FREQ 29.7 30.5 37.5 31.4 24.2 27.0 31.7 25.2MEANPREDLOSS: 29.9 30.9 37.8 32.1 24.7 26.8 31.5 25.5MEANPREDLOSS + STDPREDLOSS 30.0 30.9 37.7 31.9 24.1 26.9 31.0 25.3PRESERVE PREDLOSS RATIO 29.8 30.9 37.4 31.6 24.5 27.2 31.8 25.5

Table 4: GermanØEnglish translation quality (BLEU). Experiments marked : are averaged over 3 runs.MEANPREDLOSS and FREQ are difficulty criteria based on mean token prediction loss and token fre-quency respectively. MEANPREDLOSS + STDPREDLOSS is experiments favoring tokens with skewedprediction losses. PRESERVE PREDLOSS RATIO preserves the ratio of the distribution of difficult con-texts.

4.3 Difficult Words with Skewed PredictionLosses

By using the mean loss for target tokens as definedabove, we do not discriminate between differencesin prediction loss for occurrences in different con-texts. This lack of discrimination can be problem-atic for tokens with high loss variations. For in-stance, there can be a token with ten occurrences,out of which two have high and eight have lowprediction loss values.

We hypothesize that if a particular token is eas-ier to predict in some contexts and harder in oth-ers, the sampling strategy should be context sensi-tive, allowing to target specific contexts in whicha token has a high prediction loss. In order to dis-tinguish between tokens with a skewed and tokenswith a more uniform prediction loss distribution,we use both mean and standard deviation of tokenprediction losses to identify difficult tokens.

Algorithm 1 formalizes this approach where thelist of the difficult tokens is defined as:

D “ t@yi P Vt : ˆpyiq ą µ^ σp`pyiqq ą ρu

ˆpyiq is the mean and σp`pyiqq is the standarddeviation of prediction loss of token yi, Vt is thevocabulary list of the target language, and µ and ρare the thresholds for deciding on the difficulty ofthe token.

4.4 Preserving Sampling Ratio of DifficultOccurrences

Above we examined the mean of prediction lossfor each token over all occurrences, in order toidentify difficult-to-predict tokens. However, theuncertainty of the model in predicting a difficult

Algorithm 2 Sampling with ratio preservationInput: Difficult tokens and the corresponding sentencesin the bitext D “ tyt, Yyt “ ry1, . . . , yt, . . . , ymsu,monolingual corpus M, number of required samples NOutput: Sampled sentences S “ tSiu

Ni“1 where each

sentence Si is sampled from M1: procedure PREDLOSSRATIOSAMPLING(D,M, N ):2: Initialize S “ tu

3: Hpytq “Nˆ|pyt,¨qPD||py¨,¨qPD|

4: repeat5: Sample Sc from M6: for all tokens y in Sc do7: if |y P S| ă Hpyq then8: Add Sc to S9: until |S| “ N

10: return S

token varies for different occurrences of the to-ken: one token can be easy to predict in one con-text, and hard in another. While the sampling stepin the previous approaches targets these tokens, itdoes not ensure that the distribution of sampledsentences is similar to the distribution of problem-atic tokens in difficult contexts.

To address this issue, we propose an approachwhere we target the number of times a token oc-curs in difficult-to-predict contexts and samplesentences accordingly, thereby ensuring the sameratio as the distribution of difficult contexts. If to-ken y1 is difficult to predict in two contexts andtoken y2 is difficult to predict in four contexts,the number of sampled sentences containing y2 isdouble the number of sampled sentences contain-ing y1. Algorithm 2 formalizes this approach.

4.5 Results

We measure the translation quality of var-ious models for GermanÑEnglish andEnglishÑGerman translation tasks. The re-

442

sults of the translation experiments are presentedin Table 4. As baseline we compare our approachto Sennrich et al. (2016a). For all experimentswe sample and back-translate sentences from themonolingual data, keeping a one-to-one ratio ofback-translated versus original data (1:1).

We set the hyperparameters µ, ρ, and η to 5,10, and 5000 respectively. The values of the hy-perparameters are chosen on a small sample of theparallel data based on the token loss distribution.

As expected using random sampling for back-translation improves the translation quality overthe baseline. However, all targeted sam-pling variants in turn outperform random sam-pling. Specifically, the best performing model forGermanÑEnglish, MEANPREDLOSS, uses themean of prediction loss for the target vocabularyto oversample sentences including these tokens.

For the EnglishÑGerman experiments we ob-tain the best translation performance when we pre-serve the prediction loss ratio during sampling.

We also observe that even though the model tar-geting tokens with skewed prediction loss distribu-tions (MEANPREDLOSS + STDPREDLOSS) im-proves over random selection of sentences, it doesnot outperform the model using only mean predic-tion losses.

5 Context-Aware Targeted Sampling

In the previous section, we proposed methodsfor identifying difficult-to-predict tokens and per-formed targeted sampling from monolingual data.While the objective was to increase the occur-rences of difficult tokens, we ignored the contextof these tokens in the sampled sentences.

Arguably, if a word is difficult to predict ina given context, providing more examples of thesame or similar context can aid the learning pro-cess. In this section, we focus on the context ofdifficult-to-predict words and aim to sample sen-tences that are similar to the corresponding diffi-cult context.

The general algorithm is described in Algo-rithm 3. In the following sections, we discuss dif-ferent definitions of the local context (line 7 andline 9) and similarity measures (line 10) in this al-gorithm, and report the results.

5.1 Definition of Local Context

Prediction loss is a function of the source sentenceand the target context. One reason that a token has

high prediction loss is that the occurrence of theword is a deviation from what occurs more fre-quently in other occurrences of the context in thetraining data. This indicates an infrequent event,in particular a rare sense of the word, a domain thatis different from other occurrences of the word, oran idiomatic expression.

source wer glaube, dass das ende, sobald sie inDeutschland ank|a|men, ir|re, erzahlt B|ahr.

reference if you think that this stops as soon as theyarrive in Germany, you’d be wrong, saysB|ahr.

NMT output who believe that the end, as soon as they goto Germany, tells B|risk.

Table 5: An example from the synthetic data wherethe word B|ahr is incorrectly translated to B|risk.Subword unit boundaries are marked with ‘|’.

We identify pairs of tokens and sentences fromparallel data where in each pair the NMT modelassigns high prediction loss to the token in thegiven context. Note that a token can occur sev-eral times in this list, since it can be considered asdifficult-to-predict in different sentences.

We propose two approaches to define the localcontext of the difficult token:

Neighboring tokens A straightforward way isto use positional context: tokens that precede andfollow the target token, typically in a window ofw tokens to each side. For sentence S containinga difficult token at index i the context function inAlgorithm 3 is:

contextpS, iq “ rSi´w, . . . , Si´1, Si`1, . . . , Si`ws

where Sj is the token at index j in sentence S.

Sentence from bitext containing difficult token:

He attended Stan|ford University, where he doublemaj|ored in Spanish and History.

Sampled sentences from monolingual data:

´ The group is headed by Aar|on K|ush|ner, a Stan|fordUniversity gradu|ate who formerly headed a gre|eting cardcompany.´ Ford just opened a new R&D center near Stan|ford Uni-versity, a hot|bed of such technological research.´ Joe Grund|fest, a professor and a colleague at Stan|fordLaw School, outlines four reasons why the path to the IP|Ohas become so steep for asp|iring companies.

Table 6: Results of targeted sampling for the diffi-cult subword unit ‘Stan’.

443

Neighboring subword units In our analysis ofprediction loss during training, we observe thatseveral tokens that are difficult to predict are in-deed subword units. Current state-of-the-art NMTsystems apply BPE to the training data to ad-dress large vocabulary challenges (Sennrich et al.,2016b).

By using BPE the model generalizes commonsubword units towards what is more frequent inthe training data. This is inherently useful since itallows for better learning of less frequent words.However, a side effect of this approach is that attimes the model generates subword units that arenot linked to any words in the source sentence.

As an example, in Table 5 German source andEnglish reference translation show this problem.The word B|ahr consisting of two subword unitsis incorrectly translated into B|risk.

We address the insufficiency of the context forsubword units with high prediction losses by tar-geting these tokens in sentence sampling.

Algorithm 3 formalizes this approach in sam-pling sentences from the monolingual data. For asentence S containing a difficult subword at indexi, the context function is defined as:

contextpS, iq “ r. . . , Si´1, Si`1, . . .s

where every token Sj in the local context is a sub-word unit and part of the same word as Si. Ta-ble 6 presents examples of sampled sentences forthe difficult subword unit ‘Stan’.

Sentence from bitext containing difficult word:

Bud|dy Hol|ly was part of the first group induc|ted into theRock and R|oll Hall of F|ame on its formation in 1986.

Sampled sentences from monolingual data:

´ A 2008 Rock and R|oll Hall of F|ame induc|t|ee,Mad|onna is ran|ked by the Gu|inn|ess Book of WorldRec|ords as the top-selling recording artist of all time.´ The Rock and R|oll Hall of Fam|ers gave birth to theCalifornia rock sound.´ The winners were chosen by 500 voters, mostly musi-cians and other music industry veter|ans, who belong tothe Rock and R|oll Hall of F|ame Foundation.

Table 7: Results of context-aware targeted sam-pling for the difficult token ‘Rock’

5.2 Similarity of the Local ContextsIn context-aware targeted sampling, we comparethe context of a sampled sentence and the difficultcontext in the parallel data and select the sentence

Algorithm 3 Sampling with contextInput: Difficult tokens and the corresponding sentencesin the bitext D “ tyt, Yyt “ ry1, . . . , yt, . . . , ymsu,monolingual corpus M, context function context,number of required samples N, similarity threshold sOutput: Sampled sentences S “ tSiu

Ni“1 where each

sentence Si is sampled from M1: procedure CNTXTSAMPLING(D,M, context,N, s):2: Initialize S “ tu3: repeat4: Sample Sc from M5: for all tokens yt in Sc do6: if yt P D then7: Cm Ð contextpSc, indxofpSc, ytqq8: for all Yyt do9: Cp Ð contextpYyt , indxofpYyt , ytqq

10: if SimpCm, Cpq ą s then11: Add Sc to S12: until |S| “ N13: return S

if they are similar. In the following, we proposetwo approaches for quantifying the similarities.

Matching the local context In this approach weaim to sample sentences containing the difficult to-ken, matching the exact context to the problematiccontext. By sampling sentences that match in a lo-cal window with the problematic context and dif-fer in the rest of the sentence, we have more in-stances of the difficult token for training.

Algorithm 3 formalizes this approach where thesimilarity function is defined as:

SimpCm, Cpq “1

c

cÿ

i“1

δpCim, C

ipq

Cm and Cp are the contexts of the sentencesfrom monolingual and parallel data respectively,and c is the number of tokens in the contexts.

Word representations Another approach tosampling sentences that are similar to the problem-atic context is to weaken the matching assumption.Acquiring sentences that are similar in subject andnot match the exact context words allows for lex-ical diversity in the training data. We use em-beddings obtained by training the Skipgram model(Mikolov et al., 2013) on monolingual data to cal-culate the similarity of the two contexts.

For this approach we define the similarity func-tion in Algorithm 3 as:

SimpCm, Cpq “ cospvpCmq,vpCpqq

where vpCmq and vpCpq are the averaged embed-dings of the tokens in the contexts.

444

De-En En-De

System test2014 test2015 test2016 test2017 test2014 test2015 test2016 test2017

BASELINE : 26.7 27.6 32.5 28.1 21.2 23.3 28.0 22.4RANDOM : 28.7 29.7 36.3 30.8 24.0 26.0 30.7 24.8

Difficulty criterion Context Similarity

FREQ TOKENS EMB 30.0 30.8 37.6 31.7 24.4 26.3 31.5 25.6PREDLOSS SWORDS MATCH 29.1 30.1 36.9 31.0 23.8 26.2 28.8 23.2PREDLOSS TOKENS MATCH 29.7 30.6 37.6 31.8 24.3 27.4 31.6 25.5PREDLOSS TOKENS EMB 29.9 30.8 37.7 31.9 24.5 27.5 31.7 25.6PREDLOSS SENTENCE EMB 24.9 25.5 30.1 26.2 22.0 24.6 27.9 22.5MEANPREDLOSS TOKENS EMB 30.2 31.4 37.9 32.2 24.4 27.2 31.8 25.6

Table 8: GermanØEnglish translation quality (BLEU). Experiments marked : are averaged over 3 runs.PREDLOSS is the contextual prediction loss and MEANPREDLOSS is the average loss. TOKEN andSWORD are context selection definitions from neighboring tokens and subword units respectively. Notethat token includes both subword units and full words. EMB is computing context similarities with tokenembeddings and MATCH is comparing the context tokens.

Table 7 presents examples of sampled sentencesfor the difficult word rock. In this example, thecontext where the word ‘Rock’ has high predic-tion loss is about the music genre and not the mostprominent sense of the word, stone. Sampling sen-tences that contain this word in this particular con-text provides an additional signal for the transla-tion model to improve parameter estimation.

5.3 ResultsThe results of the translation experiments aregiven in Table 8. In these experiments, we set thehyperparameters s and w to 0.75 and 4, respec-tively. Comparing the experiments with differentsimilarity measures, MATCH and EMB, we observethat in all test sets we achieve the best results whenusing word embeddings. This indicate that for tar-geted sampling, it is more beneficial to have diver-sity in the context of difficult words as opposed tohaving the exact ngrams.

When using embeddings as the similarity mea-sure, it is worth noting that with a context of size4 the model performs very well but fails when weincrease the window size to include the whole sen-tence.

The experiments focusing on subword units(SWORD) achieve improvements over the base-lines, however they perform slightly worse thanthe experiments targeting tokens (TOKEN).

The best BLEU scores are obtained with themean of prediction loss as difficulty criterion(MEANPREDLOSS) and using word representa-tions to identify the most similar contexts. Weobserve that summarizing the distribution of the

prediction losses by its mean is more beneficialthan using individual losses. Our results motivatefurther explorations of using context for targetedsampling sentences for back-translation.

6 Conclusion

In this paper we investigated the effective methodof back-translation for NMT and explored alter-natives to select the monolingual data in the tar-get language that is to be back-translated into thesource language to improve translation quality.

Our findings showed that words with high pre-diction losses in the target language benefit mostfrom additional back-translated data.

As an alternative to random sampling, we pro-posed targeted sampling and specifically targetedwords that are difficult to predict. Moreover, weused the contexts of the difficult words by incor-porating context similarities as a feature to samplesentences for back-translation. We discovered thatusing the prediction loss to identify weaknessesof the translation model and providing additionalsynthetic data targeting these shortcomings im-proved the translation quality in GermanØEnglishby up to 1.7 BLEU points.

Acknowledgments

This research was funded in part by theNetherlands Organization for Scientific Research(NWO) under project numbers 639.022.213 and612.001.218. We also thank NVIDIA for theirhardware support and the anonymous reviewersfor their helpful comments.

445

ReferencesAmittai Axelrod, Xiaodong He, and Jianfeng Gao.

2011. Domain adaptation via pseudo in-domain dataselection. In Proceedings of the 2011 Conference onEmpirical Methods in Natural Language Process-ing, pages 355–362. Association for ComputationalLinguistics.

Amittai Axelrod, Yogarshi Vyas, Marianna Martindale,Marine Carpuat, and Johns Hopkins. 2015. Class-based n-gram language difference models for dataselection. In IWSLT (International Workshop onSpoken Language Translation), pages 180–187.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2015. Neural machine translation by jointlylearning to align and translate. In Proceedings ofthe International Conference on Learning Represen-tations (ICLR).

Ondrej Bojar, Rajen Chatterjee, Christian Federmann,Yvette Graham, Barry Haddow, Shujian Huang,Matthias Huck, Philipp Koehn, Qun Liu, VarvaraLogacheva, Christof Monz, Matteo Negri, MattPost, Raphael Rubino, Lucia Specia, and MarcoTurchi. 2017. Findings of the 2017 conferenceon machine translation (wmt17). In Proceedingsof the Second Conference on Machine Translation,Volume 2: Shared Task Papers, pages 169–214,Copenhagen, Denmark. Association for Computa-tional Linguistics.

Stanley F Chen and Joshua Goodman. 1996. An em-pirical study of smoothing techniques for languagemodeling. In Proceedings of the 34th annual meet-ing on Association for Computational Linguistics,pages 310–318. Association for Computational Lin-guistics.

Kyunghyun Cho, B van Merrienboer, Dzmitry Bah-danau, and Yoshua Bengio. 2014. On the proper-ties of neural machine translation: Encoder-decoderapproaches. In Eighth Workshop on Syntax, Seman-tics and Structure in Statistical Translation (SSST-8), 2014.

Marzieh Fadaee, Arianna Bisazza, and Christof Monz.2017. Data augmentation for low-resource neuralmachine translation. In Proceedings of the 55th An-nual Meeting of the Association for ComputationalLinguistics (Volume 2: Short Papers), pages 567–573, Vancouver, Canada. Association for Computa-tional Linguistics.

Mercedes Garcıa-Martınez, Ozan Caglayan, WalidAransa, Adrien Bardet, Fethi Bougares, and LoıcBarrault. 2017. Lium machine translation systemsfor wmt17 news translation task. In Proceedingsof the Second Conference on Machine Translation,pages 288–295, Copenhagen, Denmark. Associationfor Computational Linguistics.

Caglar Gulcehre, Orhan Firat, Kelvin Xu, KyunghyunCho, and Yoshua Bengio. 2017. On integrating

a language model into neural machine translation.Computer Speech & Language, 45:137–148.

Thanh-Le. Ha, Jan Niehues, and Alexander Waibel.2017. Effective Strategies in Zero-Shot Neural Ma-chine Translation. ArXiv e-prints.

Huda Khayrallah and Philipp Koehn. 2018. On theimpact of various types of noise on neural machinetranslation. In Proceedings of the 2nd Workshop onNeural Machine Translation and Generation, pages74–83. Association for Computational Linguistics.

Guillaume Klein, Yoon Kim, Yuntian Deng, JeanSenellart, and Alexander Rush. 2017. Opennmt:Open-source toolkit for neural machine translation.In Proceedings of ACL 2017, System Demonstra-tions, pages 67–72. Association for ComputationalLinguistics.

Patrik Lambert, Holger Schwenk, Christophe Ser-van, and Sadaf Abdul-Rauf. 2011. Investigationson translation model adaptation using monolingualdata. In Proceedings of the Sixth Workshop on Sta-tistical Machine Translation, WMT ’11, pages 284–293, Stroudsburg, PA, USA. Association for Com-putational Linguistics.

Thang Luong, Hieu Pham, and Christopher D. Man-ning. 2015. Effective approaches to attention-basedneural machine translation. In Proceedings of the2015 Conference on Empirical Methods in Natu-ral Language Processing, pages 1412–1421, Lis-bon, Portugal. Association for Computational Lin-guistics.

Tomas Mikolov, Kai Chen, Greg Corrado, and Jef-frey Dean. 2013. Efficient estimation of wordrepresentations in vector space. arXiv preprintarXiv:1301.3781.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic eval-uation of machine translation. In Proceedings of40th Annual Meeting of the Association for Com-putational Linguistics, pages 311–318, Philadelphia,Pennsylvania, USA. Association for ComputationalLinguistics.

Ngoc-Quan Pham, Jan Niehues, Thanh-Le Ha, EunahCho, Matthias Sperber, and Alexander Waibel. 2017.The karlsruhe institute of technology systems for thenews translation task in wmt 2017. In Proceedingsof the Second Conference on Machine Translation,pages 366–373.

Alberto Poncelas, Dimitar Shterionov, Andy Way,Gideon Maillette de Buy Wenniger, and PeymanPassban. 2018. Investigating Backtranslation inNeural Machine Translation. ArXiv e-prints.

Rico Sennrich, Alexandra Birch, Anna Currey, UlrichGermann, Barry Haddow, Kenneth Heafield, An-tonio Valerio Miceli Barone, and Philip Williams.

446

2017. The university of edinburgh’s neural mt sys-tems for wmt17. In Proceedings of the Second Con-ference on Machine Translation, pages 389–399,Copenhagen, Denmark. Association for Computa-tional Linguistics.

Rico Sennrich, Barry Haddow, and Alexandra Birch.2016a. Improving neural machine translation mod-els with monolingual data. In Proceedings of the54th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers), pages86–96, Berlin, Germany. Association for Computa-tional Linguistics.

Rico Sennrich, Barry Haddow, and Alexandra Birch.2016b. Neural machine translation of rare wordswith subword units. In Proceedings of the 54th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany. Association for Computa-tional Linguistics.

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.Sequence to sequence learning with neural net-works. In Z. Ghahramani, M. Welling, C. Cortes,N. D. Lawrence, and K. Q. Weinberger, editors, Ad-vances in Neural Information Processing Systems27, pages 3104–3112. Curran Associates, Inc.

Marlies van der Wees, Arianna Bisazza, and ChristofMonz. 2017. Dynamic data selection for neural ma-chine translation. In Proceedings of the 2017 Con-ference on Empirical Methods in Natural LanguageProcessing, pages 1400–1410, Copenhagen, Den-mark. Association for Computational Linguistics.

back-translation sampling by targeting difficult words in ... · neural machine translation has...

Documents