Revisiting the Context Window for Cross-lingual Word Embeddings

Ryokan Ri and Yoshimasa TsuruokaThe University of Tokyo

7-3-1 Hongo, Bunkyo-ku, Tokyo, Japan{li0123,tsuruoka}

AbstractExisting approaches to mapping-based cross-lingual word embeddings are based on theassumption that the source and target em-bedding spaces are structurally similar. Thestructures of embedding spaces largely de-pend on the co-occurrence statistics of eachword, which the choice of context windowdetermines. Despite this obvious connectionbetween the context window and mapping-based cross-lingual embeddings, their rela-tionship has been underexplored in prior work.In this work, we provide a thorough evalua-tion, in various languages, domains, and tasks,of bilingual embeddings trained with differ-ent context windows. The highlight of ourfindings is that increasing the size of both thesource and target window sizes improves theperformance of bilingual lexicon induction,especially the performance on frequent nouns.

1 Introduction

Cross-lingual word embeddings can capture wordsemantics invariant among multiple languages,and facilitate cross-lingual transfer for low-resource languages (Ruder et al., 2019). Recentresearch has focused on mapping-based methods,which find a linear transformation from the sourceto target embedding spaces (Mikolov et al., 2013b;Artetxe et al., 2016; Lample et al., 2018). Learn-ing a linear transformation is based on a strongassumption that the two embedding spaces arestructurally similar or isometric.

The structure of word embeddings heavily de-pends on the co-occurrence information of words(Turney and Pantel, 2010; Baroni et al., 2014), i.e.,word embeddings are computed by counting otherwords that appear in a specific context window ofeach word. The choice of context window changesthe co-occurrence statistics of words and thus iscrucial to determine the structure of an embed-ding space. For example, it has been known that

an embedding space trained with a smaller linearwindow captures functional similarities, while alarger window captures topical similarities (Levyand Goldberg, 2014a). Despite this important re-lationship between the choice of context windowand the structure of embedding space, how thechoice of context window affects the structuralsimilarity of two embedding spaces has not beenfully explored yet.

In this paper, we attempt to deepen the under-standing of cross-lingual word embeddings fromthe perspective of the choice of the context win-dow through carefully designed experiments. Weexperiment with a variety of settings, with dif-ferent domains and languages. We train monolin-gual word embeddings varying the context windowsizes, align them with a mapping-based method,and then evaluate them with both intrinsic anddownstream cross-lingual transfer tasks. Our re-search questions and the summary of the findingsare as follows:

RQ1: What kind of context windows producesa better alignment of two embedding spaces?Our result shows that increasing the window sizesof both the source and target embeddings improvesthe accuracy of bilingual dictionary induction con-sistently regardless of the domains of the sourceand target corpora. Our fine-grained analysis re-veals that frequent nouns receive the most benefitfrom larger context sizes.

RQ2. In downstream cross-lingual transfer, dothe context windows that perform well on thesource language also perform well on the tar-get languages? No. We find that even when somecontext window performs well on the source lan-guage task, that is often not the best choice forthe target language. The general tendency is thatbroader context windows produce better perfor-mance for the target languages.








2 Background and Related Work

2.1 Context Window of Word Embeddings

Word embeddings are computed from the co-occurrence information of words, i.e., contextwords that appear around a given word. Theembedding algorithm used in this work is theskip-gram with negative sampling (Mikolov et al.,2013c). In the skip-gram model, each word w inthe vocabulary W is associated with a word vectorvw and a context vector cw.1 The objective is tomaximize the dot-product vwt ·cwc for the observedword-context pairs (wt, wc), and to minimize thedot-product for negative examples.

The most common type of context is a lin-ear window. When the window size is setto k, the context words of a target wordwt in a sentence [w1, w2, ..., wt, ...wL] are[wt−k, ..., wt−1, wt+1, ..., wt+k]. The choice ofcontext is crucial to the resulting embeddings asit will change the co-occurrence statistics associ-ated with each target word. Table 1 demonstratesthe effect of the context window size on the near-est neighbor structure of embedding space; witha small window size, the resulting embeddingscapture functional similarity, while with a largerwindow size, the embeddings capture topical simi-larities.

Among the other types of context windows thathave been explored by researchers are linear win-dows enriched with positional information (Levyand Goldberg, 2014b; Ling et al., 2015a; Li et al.,2017), syntactically informed context windowsbased on dependency trees (Levy and Goldberg,2014a; Li et al., 2017), and one that dynamicallyweights the surrounding words with the attentionmechanism (Ling et al., 2015b). In this paper, wemainly discuss the most common linear windowand investigate how the choice of the window sizeaffects the isomorphism of two embedding spacesand the performance of cross-lingual transfer.

2.2 Cross-lingual Word Embeddings

Cross-lingual word embeddings aim to learn ashared semantic space in multiple languages. Onepromising solution is to jointly train the sourceand target embedding, so-called joint methods,by exploiting cross-lingual supervision signalsin the form of word dictionaries (Duong et al.,

1Conceptually, the word and context vocabularies are re-garded as separated, but for simplicity, we assume that theyshare the vocabulary.

Query word window size 1 window size 10phrases word

loanwords phraseswords morphemes phrase

verses ungrammaticalphonemes homographssynchronic totemismmechanistic typology

typological numerological categorizationsarchitectonic dialectology

dialectical fusional

Table 1: The top-5 nearest neighbors in English em-bedding spaces trained with different context windowsin our experiment. The smaller window size capturesfunctional similarities (-s, -cal, -ic), while the largercaptures topical similarities.

2016), parallel corpora (Gouws et al., 2015; Lu-ong et al., 2015), document-aligned corpora (Vulicand Moens, 2016).

Another line of research is off-line mapping-based approaches (Ruder et al., 2019), wheremonolingual embeddings are independentlytrained in multiple languages, and a post-hoc align-ment matrix is learned to align the embeddingspaces with a seed word dictionary (Mikolov et al.,2013b; Xing et al., 2015; Artetxe et al., 2016),with only a little supervision such as identicalstrings or numerals (Artetxe et al., 2017; Smithet al., 2017), or even in a completely unsuper-vised manner (Lample et al., 2018; Artetxe et al.,2018). Mapping-based approaches have recentlybeen popularized by their cheaper computationalcost compared to joint approaches, as they canmake use of pre-trained monolingual word embed-dings.

The assumption behind the mapping-basedmethods is the isomorphism of monolingual em-bedding spaces, i.e., the embedding spaces arestructurally similar, or the nearest neighbor graphsfrom the different languages are approximatelyisomorphic (Søgaard et al., 2018). Consideringthat the structures of the monolingual embeddingspaces are closely related to the choice of the con-text window, it is natural to expect that the contextwindow has a considerable impact on the perfor-mance of mapping-based bilingual word embed-dings.

However, most existing work has not providedempirical results on the effect of the context win-dow on cross-lingual embeddings, as their focus ison how to learn a mapping between the two embed-

ding spaces. In order to shed light on the effect ofthe context window on cross-lingual embeddings,we trained cross-lingual embeddings with differ-ent context windows, and carefully analyzed theimplications of their varying performance on bothintrinsic and extrinsic tasks.

3 Experimental Design

3.1 Training Monolingual Embeddings

The experiment is designed to deal with multiplesettings to fully understand the effect of the contextwindow.Languages. As the target language, we chooseEnglish (En) because of its richness of resources,and as the source languages, we choose French(Fr), German (De), Russian (Ru), Japanese (Ja),taking into account the typological variety andavailability of evaluation resource.

Note that the language pairs analyzed in thispaper are limited to those including English, andthere is a possibility that some results may notgeneralize to other language pairs.Corpus for Training Word Embeddings. Totrain the monolingual embeddings, we use theWikipedia Comparable Corpora2. We choose com-parable corpora for the main analysis in order toaccentuate the effect of context window by set-ting an ideal situation for training cross-lingualembeddings.

We also experiment with different domain set-tings, where we use corpora from the news do-main3 for the source languages, because the iso-morphism assumption is shown to be very sensi-tive to the domains of the source and target corpora(Søgaard et al., 2018). We refer to those resultswhen we are interested in whether the same trendwith respect to context window can be observed inthe different domain settings.

For the size of the data, to simulate the settingof transferring from a low-resource language to ahigh-resource language, we use 5M sentences forthe target language (English), and 1M sentencesfor the source languages.4

Context Window. Since we want to measure theeffect of the context window size, we vary the


3 also experimented with very low-resource settings,

where the source corpus size is set to 100K, but the resultsshowed similar trends to the 1M setting, and thus we onlyinclude the result of the 1M settings in this paper.

window size among 1, 2, 3, 4, 5, 7, 10, 15, and 20.Besides the linear window, we also experi-

mented with the unbound dependency context (Liet al., 2017), where we extract context words thatare the head, modifiers, and siblings in a depen-dency tree. Our initial motivation was that, whilethe linear context is directly affected by differentword orders, the dependency context can mitigatethe effect of language differences, and thus mayproduce better cross-lingual embeddings. How-ever, the performance of the dependency contextturned out to be always in the middle betweensmaller and larger linear windows, and we foundnothing notable. Therefore, the following analysisonly focuses on the results of the linear contextwindow.Implementation of Word2Vec. Note that somecommon existing implementations of the skip-gram may obfuscate the effect of the window size.The original C implementation of word2vec andits python implementation Gensim5 adopt a dy-namic window mechanism where the window sizeis uniformly sampled between 1 and the speci-fied window size for each target word (Mikolovet al., 2013a). Also, those implementations re-move frequent tokens by subsampling before ex-tracting word-context pairs (so-called “dirty” sub-sampling) (Levy et al., 2015), which enlarges thecontext size in effect. Our experiment is basedon word2vecf,6 which takes arbitrary word-context pairs as input. We extract word-contextpairs from a fixed window size and afterward per-form subsampling.

We train 300-dimensional embeddings. For de-tails on the hyperparameters, we refer the readersto Appendix A.

3.2 Aligning Monolingual Embeddings

After training monolingual embeddings in thesource and target languages, we align themwith a mapping-based algorithm. To inducea alignment matrix W for the source and tar-get embeddings x, y, we use a simple super-vised method of solving the Procrustes problemarg min


∑mi=1 ‖Wxi − yi‖2, with a training word

dictionary (xi, yi)mi=1 (Mikolov et al., 2013b), with

the orthogonality constraint on W , length normal-ization and mean-centering as preprocessing for



Figure 1: BLI performance in the comparable setting. The target window size is fixed and the source windowsize is varied.

the source and target embeddings (Artetxe et al.,2016).

The word dictionaries are automatically createdby using Google Translate. 7 We translate allwords in our English vocabulary into the sourcelanguages and filter out words that do not existin the source vocabularies. We also perform thisprocess in the opposite direction (translated fromthe source languages into English), and take theunion of the two corresponding dictionaries. Wethen randomly select 5K tuples for training and2K for testing. Although using word dictionariesautomatically derived from a system is currentlya common practice in this field, it should be ac-knowledged that this may sometimes pose prob-lems: the generated dictionaries are noisy, and thedefinition of word translation is unclear (e.g., howdo we handle polysemy?). It can hinder valid com-parisons between systems or detailed analysis ofthem, and should be addressed in future research.

For each setting, we train three pairs of alignedembeddings with different random seeds in themonolingual embedding training, as training wordembeddings is known to be unstable and differentruns result in different nearest neighbors (Wend-landt et al., 2018). The following results are pre-sented with their averages and standard deviations.

4 Bilingual Lexicon Induction

We first evaluate the learned bilingual embeddingswith bilingual lexicon induction (BLI). The taskis to retrieve the target translations with sourcewords by searching for nearest neighbors with co-sine similarity in the bilingual embedding space.

The evaluation metric used in prior work is usu-ally top-k precision, but here we use a more infor-mative measure, mean reciprocal rank (MRR) asrecommended by Glavas et al. (2019).

Fixed Target Context Window Settings. First,we consider the settings where the target contextsize is fixed, and the source context size is config-urable. This setting assumes common situationswhere the embedding of the target language isavailable in the form of pre-trained embeddings.

Figure 1 shows the result of the four languages.Firstly, we observe that too small windows (1 to 3)for source embeddings do not yield good perfor-mance, probably because the model failed to trainaccurate word embedding models with insufficienttraining word-context pairs that the small windowscapture.

At first, this result may seem to contradict withthe result from Søgaard et al. (2018). They trainedEnglish and Spanish embeddings with fasttext(Bojanowski et al., 2017) and the window size of2, and then aligned them with an unsupervisedmapping algorithm (Lample et al., 2018). Whenthey changed the window size of the Spanish em-bedding to 10, they only observed a very slightdrop on top-1 precision (from 81.89 to 81.28). Wesuspect that the discrepancy with our result is dueto the different settings. First of all, fasttextadopts a dynamic window mechanism, which mayobfuscate the difference in the context window.Also, they trained embeddings with full Wikipediaarticles, which is an order of magnitude largerthan ours; the fasttext algorithm, which takesinto account the character n-gram information ofwords, can exploit a non-trivial amount of subword

Figure 2: BLI performance for each PoS in the comparable setting.

Figure 3: BLI performance in the comparable setting.

overlap between the quite similar languages.Overall, we observe that the best context win-

dow size for the source embeddings increases asthe target context size increases, and increasingthe context sizes of both the source and target em-bedding seems beneficial to the BLI performance.Configurable Source/Target Context WindowSettings. Hereafter, we present the results whereboth the source and target sizes are configurableand set to the same. Figure 3 summarizes the resultof the same domain setting.

As we expected from the observation of the set-tings where the target window size is fixed, theperformance consistently improves as the sourceand target context sizes increase. Given that thelarger context windows tend to capture topical sim-ilarities of words, we hypothesize that the moretopical the embeddings are, the easier they are tobe aligned. Topics are invariant across differentlanguages to some extent as long as the corporaare comparable. It is natural to think that topic-oriented embeddings capture language-agnostic se-mantics of words and thus are easier to be aligned

Figure 4: BLI performance in the different domain set-ting.

among different languages.This hypothesis can be further supported by

looking at the metrics of each part-of-speech (PoS).Intuitively, nouns tend to be more representativeof topics than other PoS, and thus are expectedto show a high correlation with the window size.Figure 2 shows the scores for each PoS. 8 In alllanguages, nouns and adjectives show stronger (al-most perfect) correlation than verbs and adverbs.Different-domain Settings. The results so far areobtained in the settings where the source and tar-get corpora are comparable. When the corpora arecomparable, it is natural that topical embeddingsare easier to be aligned as comparable corporashare their topics. In order to see if the observa-tions from the comparable settings hold true fordifferent-domain settings, we also present the re-sult from the different-domain (news) source cor-pora in Figure 4.

8We assigned to each word its most frequent PoS tag inthe Brown Corpus (Kucera and Francis, 1967), followingWada et al. (2019).

Figure 5: BLI performance for each PoS in the different domain setting.

Figure 6: BLI performance with the top 500 frequentand rare words in the comparable setting.

Firstly, compared to the same-domain settings(Figure 3), the scores are lower by around 0.1 to0.2 points across the languages and context win-dows, even with the same amount of training data.This result confirms previous findings showingthat domain consistency is important to the iso-morphism assumption (Søgaard et al., 2018).

As to the relation between the BLI performanceand the context window, we observe a similar trendto the comparable settings: increasing the contextwindow size basically improves the performance.Figure 5 summarizes the results for each PoS. Theperformance on nouns and adjectives still accountsfor much of the correlation with the window size.This suggests that even when the source and targetdomains are different, some domain-invariant top-ics are captured by larger-context embeddings fornouns and adjectives.Frequency Analysis. To further gain insight intowhat kind of words receive the benefit of larger

Figure 7: BLI performance on the top 500 frequentand rare words in the different domain setting.

context windows, we analyze the effect of wordfrequency. We extract the top and bottom 500frequent words9 from the test vocabularies andevaluate the performance on them respectively.

The results of the comparable setting in eachlanguage are shown in Figure 6.

The scores for the frequent words (top500) arenotably higher than the rare words (bottom500).This confirms previous empirical results that ex-isting mapping-based methods perform signifi-cantly worse for rare words (Braune et al., 2018;Czarnowska et al., 2019).

With respect to the relation with the context size,both frequent and rare words benefit from largerwindow sizes, although the gain in the rare wordsis less obvious in some languages (Ja and Ru).

In the different domain settings, as shown inFigure 7, the rare words, in turn, suffer from larger

9The frequencies were calculated from our subset of theEnglish Wikipedia corpus.

Figure 8: Downstream evaluations in the comparable settings. SA: sentiment analysis; DC: document classifica-tion; DP: dependency parsing. The window sizes of both the source and target embeddings are varied.

window sizes, especially for Fr and Ru, but theperformance on frequent words still improves asthe context window increases.

We conjecture that when training a skip-grammodel, frequent words observe many contextwords, and that would mitigate the effect of ir-relevant words (noise) caused by a larger windowsize and result in high-quality topical embeddings;however, rare words have to rely on a limited num-ber of context words, and larger windows just am-plify the noise and domain difference to result inan inaccurate alignment of them.

5 Downstream Tasks

Although BLI is a common evaluation methodfor bilingual embeddings, good performance onBLI does not necessarily generalize to downstreamtasks (Glavas et al., 2019). To further gain insightinto the effect of the context size on bilingual em-beddings, we evaluate the embeddings with threedownstream tasks: 1) sentiment analysis; 2) docu-ment classification; 3) dependency parsing. Here,we briefly describe the dataset and model used foreach task.Sentiment Analysis (SA). We use the Webis-

CLS-10 corpus10 (Prettenhofer and Stein, 2010),which is comprised of Amazon product reviewsin the four languages: English, German, French,and Japanese (no Russian data available). Wecast sentiment analysis as a binary classificationtask, where we label reviews with the scores of1 or 2 as negative and reviews with 4 or 5 aspositive. For the model, we employ a simpleCNN encoder followed by a multi-layer percep-trons classifier.Document Classification (DC). MLDoc11

(Schwenk and Li, 2018) is compiled from theReuters corpus for eight languages including allthe languages used in this paper. The task is afour-way classification of the news article topics:Corporate/Industrial, Economics,Government/Social, and Markets. Weuse the same model architecture as sentimentanalysis.Dependency Parsing (DP). We train deep bi-affine parsers (Dozat and Manning, 2017) with the



UD English EWT dataset12 (Silveira et al., 2014).We use the PUD treebanks13 as test data.

The hyperparameters used in this experimentare shown in Appendix B.Evaluation Setup. We evaluate in a cross-lingualtransfer setup how well the bilingual embeddingstrained with different context windows transferlexical knowledge across languages. Here, wefocus on the settings where both the source andtarget context sizes are varied.

For each task, we train models with our pre-trained English embeddings. We do not updatethe parameters of the embedding during training.Then, we evaluate the model with the test data inother languages available in the dataset. At testtime, we feed the model with the word embeddingsof the test language aligned to the training Englishembeddings.

We train nine models in total for each settingwith different random seeds and English embed-dings, and we present their average scores andstandard deviations.Result and Discussion. The results from all thethree tasks are presented in Figure 8.

For sentiment analysis and document classifi-cation, we observe a similar trend where the bestwindow size is around 3 to 5 for the source En-glish task, but for the test languages, larger contextwindows achieve better results. The only devia-tion is the Japanese document classification, wherethe score does not show a significant correlation.We attribute this to low-quality alignments due tothe large typological difference between Englishand Japanese, which can be confirmed by the factthat the Japanese scores are the lowest across theboard.

For dependency parsing, embeddings withsmaller context windows perform better in thesource English task, which is consistent with theobservation that smaller context windows tend toproduce syntax-oriented embeddings (Levy andGoldberg, 2014a). However, the performance ofthe small-window embeddings does not transfer tothe test languages. The best context window forthe English development data (the size of 1) per-forms the worst for all the test languages, and thetransferred accuracy seems to benefit from largercontext sizes, although it does not always correlate



with the window size. This observation highlightsthe difficulty of transferring syntactic knowledgeacross languages. Word embeddings trained withsmall windows capture more grammatical aspectsof words in each language, which, as different lan-guages have different grammars, makes the sourceand target embedding spaces so different that it isdifficult to align them.

In summary, a general trend we observe here isthat good context windows in the source languagetask do not necessarily produce good transferrablebilingual embeddings. In practice, it seems betterto choose a context window that aligns the sourceand target well, rather than using the window sizethat just performs the best for the source language.

6 Conclusion and Future Work

Despite their obvious connection, the relation be-tween the choice of context window and the struc-tural similarity of two embedding spaces has notbeen fully investigated in prior work. In this study,we have offered the first thorough empirical resultson the relation between the context window sizeand bilingual embeddings, and shed new light onthe property of bilingual embeddings. In summary,we have shown that:

• larger context windows for both the sourceand target facilitate the alignment of words,especially nouns.

• for cross-lingual transfer, the best contextwindow for the source task is often not thebest for test languages. Especially for de-pendency parsing, the smallest context sizeproduces the best result for the source task,but performs the worst for test languages.

We hope that our study will provide insightsinto ways to improve cross-lingual embeddings bynot only mapping methods but also the propertiesof monolingual embedding spaces.


