low-resource parsing with crosslingual contextualized

Low-Resource Parsing with Crosslingual Contextualized Representations

Phoebe Mulcaire♥∗ Jungo Kasai♥∗ Noah A. Smith♥♦♥Paul G. Allen School of Computer Science & Engineering,

University of Washington, Seattle, WA, USA♦Allen Institute for Artificial Intelligence, Seattle, WA, USA{pmulc,jkasai,nasmith}@cs.washington.edu

Abstract

Despite advances in dependency parsing, lan-guages with small treebanks still present chal-lenges. We assess recent approaches tomultilingual contextual word representations(CWRs), and compare them for crosslingualtransfer from a language with a large tree-bank to a language with a small or nonexistenttreebank, by sharing parameters between lan-guages in the parser itself. We experiment witha diverse selection of languages in both sim-ulated and truly low-resource scenarios, andshow that multilingual CWRs greatly facilitatelow-resource dependency parsing even with-out crosslingual supervision such as dictionar-ies or parallel text. Furthermore, we exam-ine the non-contextual part of the learned lan-guage models (which we call a “decontextualprobe”) to demonstrate that polyglot languagemodels better encode crosslingual lexical cor-respondence compared to aligned monolinguallanguage models. This analysis provides fur-ther evidence that polyglot training is an effec-tive approach to crosslingual transfer.

1 Introduction

Dependency parsing has achieved new states ofthe art using distributed word representations inneural networks, trained with large amounts of an-notated data (Dozat and Manning, 2017; Dozatet al., 2017; Ma et al., 2018; Che et al., 2018).However, many languages are low-resource, withsmall or no treebanks, which presents a severechallenge in developing accurate parsing systemsin those languages. One way to address this prob-lem is with a crosslingual solution that makesuse of a language with a large treebank and rawtext in both languages. The hypothesis behindthis approach is that, although each language isunique, different languages manifest similar char-

∗ Equal contribution. Random order.

acteristics (e.g., morphological, lexical, syntac-tic) which can be exploited by training a singlepolyglot model with data from multiple languages(Ammar, 2016).

Recent work has extended contextual word rep-resentations (CWRs) multilingually either by train-ing a polyglot language model (LM) on a mix-ture of data from multiple languages (joint train-ing approach; Mulcaire et al., 2019; Lample andConneau, 2019) or by aligning multiple monolin-gual language models crosslingually (retrofittingapproach; Schuster et al., 2019; Aldarmaki andDiab, 2019). These multilingual representationshave been shown to facilitate crosslingual trans-fer on several tasks, including Universal Depen-dencies parsing and natural language inference.In this work, we assess these two types of meth-ods by using them for low-resource dependencyparsing, and discover that the joint training ap-proach substantially outperforms the retrofittingapproach. We further apply multilingual CWRsproduced by the joint training approach to diverselanguages, and show that it is still effective intransfer between distant languages, though we findthat phylogenetically related source languages aregenerally more helpful.

We hypothesize that joint polyglot training ismore successful than retrofitting because it in-duces a degree of lexical correspondence be-tween languages that the linear transformationused in retrofitting methods cannot capture. To testthis hypothesis, we design a decontextual probe.We decontextualize CWRs into non-contextualword vectors that retain much of CWRs’ task-performance benefit, and evaluate the crosslingualtransferability of language models via word trans-lation. In our decontextualization framework, weuse a single LSTM cell without recurrence to ob-tain a context-independent vector, thereby allow-ing for a direct probe into the LSTM networks in-

arX

iv:1

909.

0874

4v1

[cs

.CL

] 1

9 Se

p 20

19

dependent of a particular corpus. We show that de-contextualized vectors from the joint training ap-proach yield representations that score higher on aword translation task than the retrofitting approachor word type vectors such as fastText (Bojanowskiet al., 2017). This finding provides evidencethat polyglot language models encode crosslingualsimilarity, specifically crosslingual lexical corre-spondence, that a linear alignment between mono-lingual language models does not.

2 Models

We examine crosslingual solutions to low-resource dependency parsing, which make crucialuse of multilingual CWRs. All models are imple-mented in AllenNLP, version 0.7.2 (Gardner et al.,2018) and the hyperparameters and training detailsare given in the appendix.

2.1 Multilingual CWRs

Prior methods to produce multilingual contextualword representations (CWRs) can be categorizedinto two major classes, which we call joint train-ing and retrofitting.1 The joint training approachtrains a single polgylot language model (LM) ona mixture of texts in multiple languages (Mulcaireet al., 2019; Lample and Conneau, 2019; Devlinet al., 2019),2 while the retrofitting approach trainsseparate LMs on each language and aligns thelearned representations later (Schuster et al., 2019;Aldarmaki and Diab, 2019). We compare exampleapproaches from these two classes using the sameLM training data, and discover that the joint train-ing approach generally yields better performancein low-resource dependency parsing, even withoutcrosslingual supervision.

Retrofitting Approach Following Schusteret al. (2019), we first train a bidirectional LM withtwo-layer LSTMs on top of character CNNs foreach language (ELMo, Peters et al., 2018), andthen align the monolingual LMs across languages.Denote the hidden state in the jth layer for word iin context c by h

(j)i,c . We use a trainable weighted

average of the three layers (character-CNN and1This term was originally used by Faruqui et al. (2015) to

describe updates to word vectors, after estimating them fromcorpora, using semantic lexicons. We generalize it to capturethe notion of a separate update to fit something other than theoriginal data, applied after conventional training.

2Multilingual BERT is documented in https://github.com/google-research/bert/blob/master/multilingual.md.

two LSTM layers) to compute the contextual rep-resentation ei,c for the word: ei,c =

∑2j=0 λjh

(j)i,c

(Peters et al., 2018).3 In the first step, we computean “anchor” h

(j)i for each word by averaging h

(j)i,c

over all occurrences in an LM corpus. We thenapply a standard dictionary-based technique4 tocreate multilingual word embeddings (Mikolovet al., 2013; Conneau et al., 2018). In particular,suppose that we have a word-translation dictio-nary from source language s to target languaget. Let H(j)

s ,H(j)t be matrices whose columns are

the anchors in the jth layer for the source andcorresponding target words in the dictionary. Foreach layer j, find the linear transformation W∗(j)

such that

W∗(j) = argminW

||WH(j)s −H

(j)t ||F

The linear transformations are then used to mapthe LM hidden states for the source languageto the target LM space. Specifically, contex-tual representations for the source and target lan-guages are computed by

∑2j=0 λjW

∗(j)h(j)i,c and∑2

j=0 λjh(j)i,c respectively. We use publicly avail-

able dictionaries from Conneau et al. (2018)5 andalign all languages to the English LM space, againfollowing Schuster et al. (2019).

Joint Training Approach Another approach tomultilingual CWRs is to train a single LM on mul-tiple languages (Tsvetkov et al., 2016; Ragni et al.,2016; Ostling and Tiedemann, 2017). We train asingle bidirectional LM with charater CNNs andtwo-layer LSTMs on multiple languages (Rosita,Mulcaire et al., 2019). We then use the polyglotLM to provide contextual representations. Sim-ilarly to the retrofitting approach, we representword i in context c as a trainable weighted aver-age of the hidden states in the trained polyglot LM:∑2

j=0 λjh(j)i,c . In contrast to retrofitting, crosslin-

guality is learned implicitly by sharing all networkparameters during LM training; no crosslingualdictionaries are used.

3Schuster et al. (2019) only used the first LSTM layer,but we found a performance benefit from using all layers inpreliminary results.

4Conneau et al. (2018) developed an unsupervised align-ment technique that does not require a dictionary. We foundthat their unsupervised alignment yielded substantially de-graded performance in downstream parsing in line with thefindings of Schuster et al. (2019).

5https://github.com/facebookresearch/MUSE#ground-truth-bilingual-dictionaries

https://github.com/google-research/bert/blob/master/multilingual.md



https://github.com/facebookresearch/MUSE#ground-truth-bilingual-dictionaries

https://github.com/facebookresearch/MUSE#ground-truth-bilingual-dictionaries

Refinement after Joint Training It is possibleto combine the two approaches above; the align-ment procedure used in the retrofitting approachcan serve as a refinement step on top of an already-polyglot language model. We will see only a lim-ited gain in parsing performance from this refine-ment in our experiments, suggesting that polyglotLMs are already producing high-quality multilin-gual CWRs even without crosslingual dictionarysupervision.

FastText Baseline We also compare the multi-lingual CWRs to a subword-based, non-contextualword embedding baseline. We train 300-dimensional word vectors on the same LM data us-ing the fastText method (Bojanowski et al., 2017),and use the same bilingual dictionaries to alignthem (Conneau et al., 2018).

2.2 Dependency Parsers

We train polyglot parsers for multiple languages(Ammar et al., 2016) on top of multilingualCWRs. All parser parameters are shared betweenthe source and target languages. Ammar et al.(2016) suggest that sharing parameters betweenlanguages can alleviate the low-resource problemin syntactic parsing, but their experiments are lim-ited to (relatively similar) European languages.Mulcaire et al. (2019) also include experimentswith dependency parsing using polyglot contex-tual representations between two language pairs(English/Chinese and English/Arabic), but focuson high-resource tasks. Here we explore a widerrange of languages, and analyze the particular ef-ficacy of a crosslingual approach to dependencyparsing in a low-resource setting.

We use a strong graph-based dependency parserwith BiLSTM and biaffine attention (Dozat andManning, 2017), which is also used in relatedwork (Schuster et al., 2019; Mulcaire et al., 2019).Crucially, our parser only takes as input wordrepresentations. Universal parts of speech havebeen shown useful for low-resource dependencyparsing (Duong et al., 2015; Ammar et al., 2016;Ahmad et al., 2019), but many realistic low-resource scenarios lack reliable part-of-speechtaggers; here, we do not use parts of speech as in-put, and thus avoid the error-prone part-of-speechtagging pipeline. For the fastText baseline, wordembeddings are not updated during training, topreserve crosslingual alignment (Ammar et al.,2016).

Lang Code Genus WALS 81A SizeEnglish ENG Germanic SVO –Arabic ARA Semitic VSO/SVOHebrew HEB Semitic SVO 5241

Croatian HRV Slavic SVORussian RUS Slavic SVO 6983

Dutch NLD Germanic SOV/SVOGerman DEU Germanic SOV/SVO 12269

Spanish SPA Romance SVOItalian ITA Romance SVO 12543

Chinese CMN Chinese SVOJapanese JPN Japanese SOV 3997

Hungarian HUN Ugric SOV/SVO 910Finnish FIN Finnic SVO 12217Vietnamese VIE Viet-Muong SVO 1400Uyghur UIG Turkic SOV 1656Kazakh KAZ Turkic SOV 31Turkish TUR Turkic SOV 3685

Table 1: List of the languages used in our UD v2.2 ex-periments. Each shaded/unshaded section correspondsto a pair of “related” languages. WALS 81A denotesFeature 81A in WALS, Order of Subject, Object, andVerb (Dryer and Haspelmath, 2013). “Size” representsthe downsampled size in # of sentences used for sourcetreebanks. The four languages in bold face are trulylow resource languages (< 2000 trees).

3 Experiments

We first conduct a set of experiments to assess theefficacy of multilingual CWRs for low-resource de-pendency parsing.

3.1 Zero-Target Dependency Parsing

Following prior work on low-resource depen-dency parsing and crosslingual transfer (Zhangand Barzilay, 2015; Guo et al., 2015; Ammaret al., 2016; Schuster et al., 2019), we conductmulti-source experiments on six languages (Ger-man, Spanish, French, Italian, Portuguese, andSwedish) from Google universal dependency tree-bank version 2.0 (McDonald et al., 2013).6 Wetrain language models on the six languages andEnglish to produce multilingual CWRs. For eachtested language, we train a polyglot parser withthe multilingual CWRs on the five other languagesand English, and apply the parser to the test datafor the target language. Importantly, the parsingannotation scheme is shared among the seven lan-guages. Our results will show that the joint train-ing approach for CWRs substantially outperformsthe retrofitting approach.

6http://github.com/ryanmcd/uni-dep-tb

http://github.com/ryanmcd/uni-dep-tb

3.2 Diverse Low-Resource Parsing

The previous experiment compares the joint train-ing and retrofitting approaches in low-resourcedependency parsing only for relatively similarlanguages. In order to study the effectivenessmore extensively, we apply it to a more typo-logically diverse set of languages. We use fivepairs of languages for “low-resource simulations,”in which we reduce the size of a large treebank,and four languages for “true low-resource experi-ments,” where only small UD treebanks are avail-able, allowing us to compare to other work inthe low-resource condition (Table 1). Followingde Lhoneux et al. (2018), we selected these lan-guage pairs to represent linguistic diversity. Foreach target language, we produce multilingualCWRs by training a polyglot language model withits related language (e.g., Arabic and Hebrew) aswell as English (e.g., Arabic and English). Wethen train a polyglot dependency parser on eachlanguage pair and assess the crosslingual transferin terms of target parsing accuracy.

Each pair of related languages shares featureslike word order, morphology, or script. For exam-ple, Arabic and Hebrew are similar in their richtransfixing morphology (de Lhoneux et al., 2018),and Dutch and German share most of their wordorder features. We chose Chinese and Japaneseas an example of a language pair which does notshare a language family but does share characters.

We chose Hungarian, Vietnamese, Uyghur, andKazakh as true low-resource target languages be-cause they had comparatively small amounts ofannotated text in the UD corpus (Vietnamese:1,400 sentences, 20,285 tokens; Hungarian: 910sentences, 20,166 tokens; Uyghur: 1,656 sen-tences, 19,262 tokens; Kazakh: 31 sentences,529 tokens;), yet had convenient sources of textfor LM pretraining (Zeman et al., 2018).7 Othersmall treebanks exist, but in most cases anotherlarger treebank exists for the same language, mak-ing domain adaptation a more likely option thancrosslingual transfer. Also, recent work (Cheet al., 2018) using contextual embeddings was top-ranked for most of these languages in the CoNLL2018 shared task on UD parsing (Zeman et al.,2018).8

7The one exception is Uyghur where we only have 3Mwords in the raw LM data from Zeman et al. (2018).

8In Kazakh, Che et al. (2018) did not use CWRs due to theextremely small treebank size.

We use the same Universal Dependen-cies (UD) treebanks (Nivre et al., 2018) andtrain/development/test splits as the CoNLL 2018shared task (Zeman et al., 2018).9 The annota-tion scheme is again shared across languages,which facilitates crosslingual transfer. For eachtriple of two related languages and English, wedownsample training and development data tomatch the language with the smallest treebanksize. This allows for fairer comparisons becausewithin each triple, the source language for anyparser will have the same amount of training data.We further downsample sentences from the targettrain/development data to simulate low-resourcescenarios. The ratio of training and developmentdata is kept 5:1 throughout the simulations, andwe denote the number of sentences in trainingdata by |Dτ |. For testing, we use the CoNLL2018 script on the gold word segmentations. Forthe truly low-resource languages, we also presentresults with word segmentations from the systemoutputs of Che et al. (2018) (HUN, VIE, UIG) andSmith et al. (2018) (KAZ) for a direct comparisonto those languages’ best previously reportedparsers.10

4 Results and Discussion

In this section we describe the results of the vari-ous parsing experiments.

4.1 Zero-Target ParsingTable 2 shows results on zero-target dependencyparsing. First, we see that all CWRs greatly im-prove upon the fastText baseline. The joint train-ing approach (Rosita), which uses no dictionaries,consistently outperforms the dictionary-dependentretrofitting approach (ELMos+Alignment). Asdiscussed in the previous section, we can ap-ply the alignment method to refine the already-polyglot Rosita using dictionaries. However, weobserve a relatively limited gain in overall per-formance (74.5 vs. 73.9 LAS points), suggest-ing that Rosita (polyglot language model) is al-ready developing useful multilingual CWRs forparsing without crosslingual supervision. Notethat the degraded overall performance of ourELMo+Alignment compared to Schuster et al.(2019)’s reported results (71.2 vs. 73.1) is likely

9See Appendix for a list of UD treebanks used.10System outputs for all shared task systems are available

at https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-2885

https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-2885

https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-2885

Model DEU SPA FRA ITA POR SWE AVGSchuster et al. (2019) (retrofitting) 61.4 77.5 77.0 77.6 73.9 71.0 73.1Schuster et al. (2019) (retrofitting, no dictionaries) 61.7 76.6 76.3 77.1 69.1 54.2 69.2fastText + Alignment 45.2 68.5 62.8 58.9 61.1 50.4 57.8ELMos + Alignment (retrofitting) 57.3 75.4 73.7 71.6 75.1 74.2 71.2Rosita (joint training, no dictionaries) 58.0 81.8 75.6 74.8 77.1 76.2 73.9Rosita + Refinement (joint training + retrofitting) 61.7 79.7 75.8 76.0 76.8 76.7 74.5

Table 2: Zero-target results in LAS. Results reported in prior work (above the line) use an unknown amount of LMtraining data; all models below the line are limited to approximately 50M words per language.

Model DEU SPA FRA ITA POR SWE AVGZhang and Barzilay (2015) 54.1 68.3 68.8 69.4 72.5 62.5 65.9Guo et al. (2016) 55.9 73.1 71.0 71.2 78.6 69.5 69.9Ammar et al. (2016) 57.1 74.6 73.9 72.5 77.0 68.1 70.5Schuster et al. (2019) (retrofitting) 65.2 80.0 80.8 79.8 82.7 75.4 77.3Schuster et al. (2019) (retrofitting, no dictionaries) 64.1 77.8 79.8 79.7 79.1 69.6 75.0Rosita (joint training, no dictionaries) 63.6 83.4 78.9 77.8 83.0 79.6 77.7Rosita + Refinement (joint training + retrofitting) 64.8 82.1 78.7 78.8 84.1 79.1 77.9

Table 3: Zero-target results in LAS with gold UPOS.

due to the significantly reduced amount of LMdata we used in all of our experiments (50Mwords per language, an order of magnitude re-duction from the full Wikipedia dumps used inSchuster et al. (2019)). Schuster et al. (2019) (nodictionaries) is the same retrofitting approach asELMos+Alignment except that the transformationmatrices are learned in an unsupervised fashionwithout dictionaries (Conneau et al., 2018). Theabsence of a dictionary yields much worse perfor-mance (69.2 vs. 73.1) in contrast with the jointtraining approach of Rosita, which also does notuse a dictionary (73.9).

We also present results using gold universal partof speech to compare to previous work in Ta-ble 3. We again see Rosita’s effectiveness and amarginal benefit from refinement with dictionar-ies. It should also be noted that the reported resultsfor French, Italian and German in Schuster et al.(2019) outperform all results from our controlledcomparison; this may be due to the use of abun-dant LM training data. Nevertheless, joint train-ing, with or without refinement, performs best onaverage in both gold and predicted POS settings.

4.2 Diverse Low-Resource ParsingLow-Resource Simulations Figure 1 showssimulated low-resource results.11 Of greatest in-terest are the significant improvements over mono-lingual parsers when adding English or related-language data. This improvement is consistentacross languages and suggests that crosslingual

11A table with full details including different size simula-tions is provided in the appendix.

ARA HEB HRV RUS NLD DEU SPA ITA CMN JPN50

60

70

80

mono. +ENG +rel.

Figure 1: LAS for UD parsing results in a simulatedlow-resource setting where the size of the target lan-guage treebank (|Dτ |) is set to 100 sentences.

transfer is a viable solution for a wide range oflanguages, even when (as in our case) language-specific tuning or annotated resources like parallelcorpora or bilingual dictionaries are not available.See Figure 2 for a visualization of the differencesin performance with varying training size. Thepolyglot advantage is minor when the target lan-guage treebank is large, but dramatic in the condi-tion where the target language has only 100 sen-tences. The fastText approaches consistently un-derperform the language model approaches, butshow the same pattern.

In addition, related-language polyglot (“+rel.”)outperforms English polyglot in most cases in thelow-resource condition. The exceptions to thispattern are Italian (whose treebank is of a differ-ent genre from the Spanish one), and Japanese andChinese, which differ significantly in morphologyand word order. The CMN/JPN result suggeststhat such typological features influence the degreeof crosslingual transfer more than orthographic

0 100 500 1000 full40

50

60

70

80

90

|Dτ |

Hebrew

0 100 500 1000 full

40

50

60

70

80

90

|Dτ |

Russian

0 100 500 1000 full

40

50

60

70

80

|Dτ |

German

mono. (ELMo) +ENG (Rosita) +rel. (Rosita) fastText mono. fastText+ENG fastText+rel.

Figure 2: Plots of parsing performance vs. target language treebank size for several example languages. Thesize 0 target treebank point indicates a parser trained only on the source language treebank but with polyglotrepresentations, allowing transfer to the target test treebank using no target language training trees. See Appendixfor results with zero-target-treebank and intermediate size data (|Dτ | ∈ {0, 100, 500, 1000}) for all languages.

properties like shared characters. This result incrosslingual transfer also mirrors the observationfrom prior work (Gerz et al., 2018) that typologicalfeatures of the language are predictive of monolin-gual LM performance. The related-language im-provement also vanishes in the full-data condition(Figure 2), implying that the importance of sharedlinguistic features can be overcome with sufficientannotated data. It is also noteworthy that varia-tions in word order, such as the order of adjec-tive and noun, do not affect performance: Ital-ian, Arabic, and others use a noun-adjective orderwhile English uses an adjective-noun order, buttheir +ENG and +rel. results are comparable.

The Croatian and Russian results are notablebecause of shared heritage but different scripts.Though Croatian uses the Latin alphabet and Rus-sian uses Cyrillic, transfer between HRV+RUS isclearly more effective than HRV+ENG (82.00 vs.79.21 LAS points when |Dτ | = 100). Thissuggests that character-based LMs can implicitlylearn to transliterate between related languageswith different scripts, even without parallel super-vision.

Truly Low Resource Languages Finally wepresent “true low-resource” experiments for fourlanguages in which little UD data is available(see Section 3.2). Table 4 shows these results.Consistent with our simulations, our parsers ontop of Rosita (multilingual CWRs from the jointtraining approach) substantially outperform theparsers with ELMos (monolingual CWRs) in alllanguages, and establish a new state of the art

Model gold pred.Hungarian (HUN)Che et al. (2018) (HUN, ensemble) – 82.66Che et al. (2018) (HUN) – 80.96ELMo (HUN) 81.89 81.54Rosita (HUN +ENG) 85.34 84.89Rosita (HUN +FIN) 85.40 84.96Vietnamese (VIE)Che et al. (2018) (VIE, ensemble) – 55.22ELMo (VIE) 62.67 55.72Rosita (VIE +ENG) 63.07 56.42Uyghur (UIG)Che et al. (2018) (UIG, ensemble) – 67.05Che et al. (2018) (UIG) – 66.20ELMo (UIG) 66.64 63.98Rosita (UIG +ENG) 67.85 65.55Rosita (UIG +TUR) 68.08 65.73Rosa and Marecek (2018) (KAZ +TUR) – 26.31Smith et al. (2018) (KAZ +TUR) – 31.93Schuster et al. (2019) (KAZ +TUR) – 36.98Rosita (KAZ +ENG) 48.02 46.03Rosita (KAZ +TUR) 53.98 51.96

Table 4: LAS (F1) comparison for truly low-resourcelanguages. The gold and pred. columns show re-sults under gold segmentation and predicted segmen-tation. The languages in the parentheses indicate thelanguages used in parser training.

in Hungarian, Vietnamese, and Kazakh. Con-sistent with our simulations, we see that train-ing parsers with the target’s related language ismore effective than with the more distant lan-guage, English. It is particularly noteworthy thatthe Rosita models, which do not use a parallelcorpus or dictionary, dramatically improve overthe best previously reported result from Schusteret al. (2019) when either the related language ofTurkish (51.96 vs. 36.98) or even the more dis-tant language of English (46.03 v.s. 36.98) is used.Schuster et al. (2019) aligned the monolingual EL-Mos for Kazakh and Turkish using the KAZ-TUR

dictionary that Rosa and Marecek (2018) derivedfrom parallel text. This result further corroboratesour finding that the joint training approach to mul-tilingual CWRs is more effective than retrofittingmonolingual LMs.

4.3 Comparison to Multilingual BERTEmbeddings

We also evaluate the diverse low-resource lan-guage pairs using pretrained multilingual BERT(Devlin et al., 2019) as text embeddings (Figure3). Here, the same language model (multilingualcased BERT,12 covering 104 languages) is usedfor all parsers, with the only variation being in thetraining treebanks provided to each parser. Parsersare trained using the same hyperparameters anddata as in Section 3.2.13

There are two critical differences from our pre-vious experiments: multilingual BERT is trainedon much larger amounts of Wikipedia data com-pared to other LMs used in this work, and theWordPiece vocabulary (Wu et al., 2016) usedin the cased multilingual BERT model has beenshown to have a distribution skewed toward Latinalphabets (Acs, 2019). These results are thus notdirectly comparable to those in Figure 1; never-theless, it is interesting to see that the results ob-tained with ELMo-like LMs are comparable to andin some cases better than results using a BERTmodel trained on over a hundred languages. Ourresults broadly fit with those of Pires et al. (2019),who found that multilingual BERT was useful forzero-shot crosslingual syntactic transfer. In partic-ular, we find nearly no performance benefit fromcross-script transfer using BERT in a languagepair (English-Japanese) for which they reported

12Available at https://github.com/google-research/bert/

13AllenNLP version 0.9.0 was used for these experiments.

ARA HEB HRV RUS NLD DEU SPA ITA CMN JPN

50

60

70

80

mono. +ENG +rel.

Figure 3: LAS for UD parsing results in a simulatedlow-resource setting ((|Dτ | = 100) using multilingualBERT embeddings in place of Rosita. Cf. Figure 1.

poor performance in zero-shot transfer, contraryto our results using Rosita (Section 4.2).

5 Decontextual Probe

We saw the success of the joint polyglot train-ing for multilingual CWRs over the retrofitting ap-proach in the previous section. We hypothesizethat CWRs from joint training provide useful repre-sentations for parsers by inducing nonlinear sim-ilarity in the vector spaces of different languagesthat we cannot retrieve with a simple alignment ofmonolingual pretrained language models. In orderto test this hypothesis, we conduct a decontextualprobe comprised of two steps. The decontextu-alization step effectively distills CWRs into wordtype vectors, where each unique word is mappedto exactly one embedding regardless of the con-text. We then conduct linear transformation-basedword translation (Mikolov et al., 2013) on the de-contextualized vectors to quantify the degree ofcrosslingual similarity in the multilingual CWRs.

5.1 DecontextualizationRecall from Section 2 that we produce CWRs frombidirectional LMs with character CNNs and two-layer LSTMs. We propose a method to remove thedependence on context c for the two LSTM layers(the CNN layer is already context-independent bydesign). During LM training, the hidden states ofeach layer ht are computed by the standard LSTMequations:

it = σ (Wixt ++Uiht−1 + bi)

ft = σ (Wfxt + Ufht−1 + bf )

ct = tanh (Wcxt + Ucht−1 + bc)

ot = σ (Woxt + Uoht−1 + bo)

ct = ft � ct−1 + it � ctht = ot � tanh (ct)

https://github.com/google-research/bert/

https://github.com/google-research/bert/

Representations UD SRL NERGloVe 83.78 80.01 83.90fastText 83.93 80.27 83.40Decontextualization 86.88 81.41 87.72ELMo 88.71 82.12 88.65

Table 5: Context independent vs. dependent per-formance in English. All embeddings are 512-dimensional and trained on the same English corpusof approximately 50M tokens for fair comparisons.We also concatenate 128-dimensional character LSTMrepresentations with the word vectors in every config-uration to ensure all models have character input. UDscores are LAS, and SRL and NER are F1.

We produce contextless vectors from pretrainedLMs by removing recursion in the computation(i.e. setting ht−1 and ct−1 to 0):

it = σ (Wixt + bi)

ft = σ (Wfxt + bf )

ct = tanh (Wcxt + bc)

ot = σ (Woxt + bo)

ct = it � ctht = ot � tanh (ct)

This method is fast to compute, as it does not re-quire recurrent computation and only needs to seeeach word once. This way, each word is associ-ated with a set of exactly three vectors from thethree layers.

Performance of decontextualized vectors Weperform a brief experiment to find what informa-tion is successfully retained by the decontextu-alized vectors, by using them as inputs to threetasks (in a monolingual English setting, for sim-plicity). For Universal Dependencies (UD) pars-ing, semantic role labeling (SRL), and namedentity recognition (NER), we used the standardtrain/development/test splits from UD EnglishEWT (Zeman et al., 2018) and Ontonotes (Prad-han et al., 2013). Following Mulcaire et al. (2019),we use strong existing neural models for each task:Dozat and Manning (2017) for UD parsing, Heet al. (2017) for SRL, and Peters et al. (2017) forNER.

Table 5 compares the decontextualized vectorswith the original CWRs (ELMo) and the conven-tional word type vectors, GloVe (Pennington et al.,2014) and fastText (Bojanowski et al., 2017). Inall three tasks, the decontextualized vectors sub-stantially improve over fastText and GloVe vec-tors, and perform nearly on par with contextual

Vector DEU SPA FRA ITA POR SWEfastText 31.6 54.8 56.7 50.2 55.5 43.9

ELMosLayer 0 19.7 41.5 41.1 36.9 44.6 27.5Layer 1 24.4 46.4 47.6 44.2 48.3 36.3Layer 2 19.9 40.5 41.9 38.1 42.5 30.9

RositaLayer 0 37.9 56.6 58.2 57.5 56.6 50.6Layer 1 40.3 56.3 57.2 58.1 56.5 53.7Layer 2 38.8 51.1 52.7 53.6 50.7 50.8

Table 6: Crosslingual alignment results (precision at 1)from decontextual probe. Layers 0, 1, and 2 denote thecharacter CNN, first LSTM, and second LSTM layersin the language models respectively.

ELMo. This suggests that while part of the advan-tage of CWRs is in the incorporation of context,they also benefit from rich context-independentrepresentations present in deeper networks.

5.2 Word Translation Test

Given the decontextualized vectors from eachlayer of the bidirectional language models, wecan measure the crosslingual lexical correspon-dence in the multilingual CWRs by performingword translation. Concretely, suppose that wehave training and evaluation word translation pairsfrom the source to the target language. Usingthe same word alignment objective discussed as inSection 2.1, we find a linear transform by align-ing the decontextualized vectors for the trainingsource-target word pairs. Then, we apply this lin-ear transform to the decontextualized vector foreach source word in the evaluation pairs. The clos-est target vector is found using the cross-domainsimilarity local scaling (CSLS) measure (Conneauet al., 2018), which is designed to remedy the hub-ness problem (where a few “hub” points are near-est neighbors to many other points each) in wordtranslation by normalizing the cosine similarityaccording to the degree of hubness.

We again take the dictionaries from Conneauet al. (2018) with the given train/test split, and al-ways use English as the target language. For eachlanguage, we take all words that appear three timesor more in our LM training data and compute de-contextualized vectors for them. Word translationis evaluated by choosing the closest vector amongthe English decontextualized vectors.

5.3 Results

We present word translation results from our de-contextual probe in Table 6. We see that the first

LSTM layer generally achieves the best crosslin-gual alignment both in ELMos and Rosita. Thisfinding mirrors recent studies on layerwise trans-ferability; representations from the first LSTMlayer in a language model are most transferableacross a range of tasks (Liu et al., 2019). Our de-contextual probe demonstrates that the first LSTMlayer learns the most generalizable representationsnot only across tasks but also across languages.In all six languages, Rosita (joint LM trainingapproach) outperforms ELMos (retrofitting ap-proach) and the fastText vectors. This shows thatfor the polyglot (jointly trained) LMs, there isa preexisting similarity between languages’ vec-tor spaces beyond what a linear transform pro-vides. The resulting language-agnostic represen-tations lead to polyglot training’s success in low-resource dependency parsing.

6 Further Related Work

In addition to the work mentioned above, muchprevious work has proposed techniques to trans-fer knowledge from a high-resource to a low-resource language for dependency parsing. Manyof these methods use an essentially (either lex-icalized or delexicalized) joint polyglot trainingsetup (e.g., McDonald et al., 2011; Cohen et al.,2011; Duong et al., 2015; Guo et al., 2016; Vi-lares et al., 2016; Falenska and Cetinoglu, 2017as well as many of the CoNLL 2017/2018 sharedtask participants: Lim and Poibeau (2017); Vaniaet al. (2017); de Lhoneux et al. (2017); Che et al.(2018); Wan et al. (2018); Smith et al. (2018); Limet al. (2018)). Some use typological information tofacilitate crosslingual transfer (e.g., Naseem et al.,2012; Tackstrom et al., 2013; Zhang and Barzi-lay, 2015; Wang and Eisner, 2016; Rasooli andCollins, 2017; Ammar et al., 2016). Others use bi-text (Zeman et al., 2018), manually-specified rules(Naseem et al., 2012), or surface statistics fromgold universal part of speech (Wang and Eisner,2018a,b) to map the source to target. The meth-ods examined in this work to produce multilin-gual CWRs do not rely on such external informa-tion about the languages, and instead use relativelyabundant LM data to learn crosslinguality that ab-stracts away from typological divergence.

Recent work has developed several probingmethods for (monolingual) contextual represen-tations (Liu et al., 2019; Hewitt and Manning,2019; Tenney et al., 2019). Wada and Iwata

(2018) showed that the (contextless) input and out-put word vectors in a polyglot word-based lan-guage model manifest a certain level of lexical cor-respondence between languages. Our decontex-tual probe demonstrated that the internal layers ofpolyglot language models capture crosslingualityand produce useful multilingual CWRs for down-stream low-resource dependency parsing.

7 Conclusion

We assessed recent approaches to multilingualcontextual word representations, and comparedthem in the context of low-resource dependencyparsing. Our parsing results illustrate that a jointtraining approach for polyglot language modelsoutperforms a retrofitting approach of aligningmonolingual language models. Our decontextualprobe showed that jointly trained LMs learn a bet-ter crosslingual lexical correspondence than theone produced by aligning monolingual languagemodels or word type vectors. Our results provide astrong basis for multilingual representation learn-ing and for further study of crosslingual transferin a low-resource setting beyond dependency pars-ing.

Acknowledgments

The authors thank Nikolaos Pappas and Tal Schus-ter as well as the anonymous reviewers for theirhelpful feedback. This research was funded inpart by NSF grant IIS-1562364, a Google researchaward to NAS, the Funai Overseas Scholarship toJK, and the NVIDIA Corporation through the do-nation of a GeForce GPU.

ReferencesJudit Acs. 2019. Exploring BERT’s vocabulary.

Wasi Ahmad, Zhisong Zhang, Xuezhe Ma, EduardHovy, Kai-Wei Chang, and Nanyun Peng. 2019. Ondifficulties of cross-lingual transfer with order dif-ferences: A case study on dependency parsing. InProc. of NAACL-HLT.

Hanan Aldarmaki and Mona Diab. 2019. Context-aware cross-lingual mapping. In Proc. of NAACL-HLT.

Waleed Ammar. 2016. Towards a Universal Analyzerof Natural Languages. Ph.D. thesis, Carnegie Mel-lon University.

Waleed Ammar, George Mulcaire, Miguel Ballesteros,Chris Dyer, and Noah A. Smith. 2016. Many lan-guages, one parser. TACL, 4.

http://juditacs.github.io/2019/02/19/bert-tokenization-stats.html

https://doi.org/10.18653/v1/N19-1253

https://doi.org/10.18653/v1/N19-1253

https://doi.org/10.18653/v1/N19-1253

https://arxiv.org/abs/1903.03243


https://transacl.org/ojs/index.php/tacl/article/view/892


Piotr Bojanowski, Edouard Grave, Armand Joulin, andTomas Mikolov. 2017. Enriching word vectors withsubword information. TACL, 5.

Wanxiang Che, Yijia Liu, Yuxuan Wang, Bo Zheng,and Ting Liu. 2018. Towards better UD parsing:Deep contextualized word embeddings, ensemble,and treebank concatenation. In Proc. of the CoNLL2018 Shared Task: Multilingual Parsing from RawText to Universal Dependencies.

Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge,Thorsten Brants, Phillipp Koehn, and Tony Robin-son. 2013. One billion word benchmark for mea-suring progress in statistical language modeling.arXiv:1312.3005.

Shay B Cohen, Dipanjan Das, and Noah A. Smith.2011. Unsupervised structure prediction with non-parallel multilingual guidance. In Proc. of EMNLP.

Alexis Conneau, Guillaume Lample, Marc’AurelioRanzato, Ludovic Denoyer, and Herve Jegou. 2018.Word translation without parallel data. In Proc. ofICLR.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In Proc. of NAACL-HLT.

Timothy Dozat and Christopher Manning. 2017. Deepbiaffine attention for neural dependency parsing. InProc. of ICLR.

Timothy Dozat, Peng Qi, and Christopher D. Manning.2017. Stanford’s graph-based neural dependencyparser at the conll 2017 shared task. In Proc. ofthe CoNLL 2017 Shared Task: Multilingual Parsingfrom Raw Text to Universal Dependencies.

Matthew S. Dryer and Martin Haspelmath, editors.2013. WALS Online.

John C. Duchi, Elad Hazan, and Yoram Singer. 2011.Adaptive subgradient methods for online learningand stochastic optimization. JMLR, 12.

Long Duong, Trevor Cohn, Steven Bird, and PaulCook. 2015. Low resource dependency parsing:Cross-lingual parameter sharing in a neural networkparser. In Proc. of ACL-IJCNLP.

Agnieszka Falenska and Ozlem Cetinoglu. 2017. Lex-icalized vs. delexicalized parsing in low-resourcescenarios. In Proc. of IWPT.

Manaal Faruqui, Jesse Dodge, Sujay Kumar Jauhar,Chris Dyer, Eduard Hovy, and Noah A. Smith. 2015.Retrofitting word vectors to semantic lexicons. InProc. of NAACL-HLT.

Matt Gardner, Joel Grus, Mark Neumann, OyvindTafjord, Pradeep Dasigi, Nelson F. Liu, Matthew Pe-ters, Michael Schmitz, and Luke Zettlemoyer. 2018.AllenNLP: A deep semantic natural language pro-cessing platform. In Proc. of NLP-OSS.

Daniela Gerz, Ivan Vulic, Edoardo Maria Ponti, RoiReichart, and Anna Korhonen. 2018. On the re-lation between linguistic typology and (limitationsof) multilingual language modeling. In Proc. ofEMNLP.

Jiang Guo, Wanxiang Che, David Yarowsky, HaifengWang, and Ting Liu. 2015. Cross-lingual depen-dency parsing based on distributed representations.In Proc. of ACL-IJCNLP.

Jiang Guo, Wanxiang Che, David Yarowsky, HaifengWang, and Ting Liu. 2016. A representation learn-ing framework for multi-source transfer parsing. InProc. of AAAI.

Luheng He, Kenton Lee, Mike Lewis, and Luke Zettle-moyer. 2017. Deep semantic role labeling: Whatworks and what’s next. In Proc. of ACL.

John Hewitt and Christopher D. Manning. 2019. Astructural probe for finding syntax in word represen-tations. In Proc. of NAACL-HLT.

Diederik P. Kingma and Jimmy Lei Ba. 2015. ADAM:A Method for Stochastic Optimization. In Proc. ofICLR.

Guillaume Lample and Alexis Conneau. 2019. Cross-lingual language model pretraining. In Proc. ofNeurIPS.

Miryam de Lhoneux, Johannes Bjerva, Isabelle Augen-stein, and Anders Søgaard. 2018. Parameter sharingbetween dependency parsers for related languages.In Proc. of EMNLP.

Miryam de Lhoneux, Yan Shao, Ali Basirat, EliyahuKiperwasser, Sara Stymne, Yoav Goldberg, andJoakim Nivre. 2017. From raw text to universal de-pendencies - look, no tags! In Proc. of the CoNLL2017 Shared Task: Multilingual Parsing from RawText to Universal Dependencies.

KyungTae Lim, Cheoneum Park, Changki Lee, andThierry Poibeau. 2018. SEx BiST: A multi-sourcetrainable parser with deep contextualized lexicalrepresentations. In Proc. of the CoNLL 2018 SharedTask: Multilingual Parsing from Raw Text to Univer-sal Dependencies.

KyungTae Lim and Thierry Poibeau. 2017. A systemfor multilingual dependency parsing based on bidi-rectional LSTM feature representations. In Proc. ofthe CoNLL 2017 Shared Task: Multilingual Parsingfrom Raw Text to Universal Dependencies.

Nelson F. Liu, Matt Gardner, Yonatan Belinkov,Matthew Peters, and Noah A. Smith. 2019. Lin-guistic knowledge and transferability of contextualrepresentations. In Proc. of NAACL-HLT.

Xuezhe Ma, Zecong Hu, Jingzhou Liu, Nanyun Peng,Graham Neubig, and Eduard Hovy. 2018. Stack-pointer networks for dependency parsing. In Proc.of ACL.



http://www.aclweb.org/anthology/K18-2005



http://www.aclweb.org/anthology/D11-1005


arXiv:1710.04087




arXiv:1611.01734v3

arXiv:1611.01734v3



https://wals.info/

http://dblp.uni-trier.de/db/journals/jmlr/jmlr12.html#DuchiHS11

http://dblp.uni-trier.de/db/journals/jmlr/jmlr12.html#DuchiHS11

http://www.aclweb.org/anthology/P15-2139



https://www.aclweb.org/anthology/W17-6303



https://doi.org/10.3115/v1/N15-1184

https://doi.org/10.18653/v1/W18-2501

https://doi.org/10.18653/v1/W18-2501

http://aclweb.org/anthology/D18-1029



https://doi.org/10.3115/v1/P15-1119

https://doi.org/10.3115/v1/P15-1119

https://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/view/12236

https://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/view/12236



https://nlp.stanford.edu/pubs/hewitt2019structural.pdf



arxiv:1412.6980

arxiv:1412.6980

http://arxiv.org/abs/1901.07291




https://doi.org/10.18653/v1/K17-3022

https://doi.org/10.18653/v1/K17-3022

https://doi.org/10.18653/v1/K18-2014

https://doi.org/10.18653/v1/K18-2014

https://doi.org/10.18653/v1/K18-2014

https://doi.org/10.18653/v1/K17-3006

https://doi.org/10.18653/v1/K17-3006

https://doi.org/10.18653/v1/K17-3006

https://www.aclweb.org/anthology/N19-1112/





Ryan McDonald, Joakim Nivre, Yvonne Quirmbach-Brundage, Yoav Goldberg, Dipanjan Das, Kuz-man Ganchev, Keith Hall, Slav Petrov, HaoZhang, Oscar Tackstrom, Claudia Bedini, NuriaBertomeu Castello, and Jungmee Lee. 2013. Uni-versal dependency annotation for multilingual pars-ing. In Proc. of ACL.

Ryan McDonald, Slav Petrov, and Keith Hall. 2011.Multi-source transfer of delexicalized dependencyparsers. In Proc. of EMNLP.

Tomas Mikolov, Quoc V. Le, and Ilya Sutskever. 2013.Exploiting similarities among languages for ma-chine translation. arXiv:1309.4168.

Phoebe Mulcaire, Jungo Kasai, and Noah A. Smith.2019. Polyglot contextual representations improvecrosslingual transfer. In Proc. of NAACL-HLT.

Tahira Naseem, Regina Barzilay, and Amir Globerson.2012. Selective sharing for multilingual dependencyparsing. In Proc. of ACL.

Joakim Nivre, Mitchell Abrams, Zeljko Agic, LarsAhrenberg, Lene Antonsen, Maria Jesus Aranz-abe, Gashaw Arutie, Masayuki Asahara, LumaAteyah, Mohammed Attia, Aitziber Atutxa, Lies-beth Augustinus, Elena Badmaeva, Miguel Balles-teros, Esha Banerjee, Sebastian Bank, VerginicaBarbu Mititelu, John Bauer, Sandra Bellato, KepaBengoetxea, Riyaz Ahmad Bhat, Erica Biagetti,Eckhard Bick, Rogier Blokland, Victoria Bobicev,Carl Borstell, Cristina Bosco, Gosse Bouma, SamBowman, Adriane Boyd, Aljoscha Burchardt, MarieCandito, Bernard Caron, Gauthier Caron, GulsenCebiroglu Eryigit, Giuseppe G. A. Celano, SavasCetin, Fabricio Chalub, Jinho Choi, Yongseok Cho,Jayeol Chun, Silvie Cinkova, Aurelie Collomb,Cagrı Coltekin, Miriam Connor, Marine Courtin,Elizabeth Davidson, Marie-Catherine de Marneffe,Valeria de Paiva, Arantza Diaz de Ilarraza, CarlyDickerson, Peter Dirix, Kaja Dobrovoljc, Tim-othy Dozat, Kira Droganova, Puneet Dwivedi,Marhaba Eli, Ali Elkahky, Binyam Ephrem, TomazErjavec, Aline Etienne, Richard Farkas, HectorFernandez Alcalde, Jennifer Foster, Claudia Fre-itas, Katarına Gajdosova, Daniel Galbraith, Mar-cos Garcia, Moa Gardenfors, Kim Gerdes, FilipGinter, Iakes Goenaga, Koldo Gojenola, MemduhGokırmak, Yoav Goldberg, Xavier Gomez Guino-vart, Berta Gonzales Saavedra, Matias Grioni, Nor-munds Gruzıtis, Bruno Guillaume, Celine Guillot-Barbance, Nizar Habash, Jan Hajic, Jan Hajic jr.,Linh Ha My, Na-Rae Han, Kim Harris, Dag Haug,Barbora Hladka, Jaroslava Hlavacova, FlorinelHociung, Petter Hohle, Jena Hwang, Radu Ion,Elena Irimia, Tomas Jelınek, Anders Johannsen,Fredrik Jørgensen, Huner Kasıkara, Sylvain Ka-hane, Hiroshi Kanayama, Jenna Kanerva, TolgaKayadelen, Vaclava Kettnerova, Jesse Kirchner,Natalia Kotsyba, Simon Krek, Sookyoung Kwak,Veronika Laippala, Lorenzo Lambertino, TatianaLando, Septina Dian Larasati, Alexei Lavrentiev,John Lee, Phng Le H`ong, Alessandro Lenci, Saran

Lertpradit, Herman Leung, Cheuk Ying Li, JosieLi, Keying Li, KyungTae Lim, Nikola Ljubesic,Olga Loginova, Olga Lyashevskaya, Teresa Lynn,Vivien Macketanz, Aibek Makazhanov, MichaelMandl, Christopher Manning, Ruli Manurung,Catalina Maranduc, David Marecek, Katrin Marhei-necke, Hector Martınez Alonso, Andre Martins, JanMasek, Yuji Matsumoto, Ryan McDonald, GustavoMendonca, Niko Miekka, Anna Missila, CatalinMititelu, Yusuke Miyao, Simonetta Montemagni,Amir More, Laura Moreno Romero, ShinsukeMori, Bjartur Mortensen, Bohdan Moskalevskyi,Kadri Muischnek, Yugo Murawaki, Kaili Muurisep,Pinkey Nainwani, Juan Ignacio Navarro Horniacek,Anna Nedoluzhko, Gunta Nespore-Berzkalne, LngNguy˜en Thi., Huy`en Nguy˜en Thi. Minh, Vi-taly Nikolaev, Rattima Nitisaroj, Hanna Nurmi,Stina Ojala, Adedayo. Oluokun, Mai Omura, PetyaOsenova, Robert Ostling, Lilja Øvrelid, NikoPartanen, Elena Pascual, Marco Passarotti, Ag-nieszka Patejuk, Siyao Peng, Cenel-Augusto Perez,Guy Perrier, Slav Petrov, Jussi Piitulainen, EmilyPitler, Barbara Plank, Thierry Poibeau, Mar-tin Popel, Lauma Pretkalnina, Sophie Prevost,Prokopis Prokopidis, Adam Przepiorkowski, Ti-ina Puolakainen, Sampo Pyysalo, Andriela Raabis,Alexandre Rademaker, Loganathan Ramasamy,Taraka Rama, Carlos Ramisch, Vinit Ravishankar,Livy Real, Siva Reddy, Georg Rehm, MichaelRießler, Larissa Rinaldi, Laura Rituma, LuisaRocha, Mykhailo Romanenko, Rudolf Rosa, Da-vide Rovati, Valentin Roca, Olga Rudina, ShovalSadde, Shadi Saleh, Tanja Samardzic, StephanieSamson, Manuela Sanguinetti, Baiba Saulıte,Yanin Sawanakunanon, Nathan Schneider, Sebas-tian Schuster, Djame Seddah, Wolfgang Seeker,Mojgan Seraji, Mo Shen, Atsuko Shimada, MuhShohibussirri, Dmitry Sichinava, Natalia Silveira,Maria Simi, Radu Simionescu, Katalin Simko,Maria Simkova, Kiril Simov, Aaron Smith, Is-abela Soares-Bastos, Antonio Stella, Milan Straka,Jana Strnadova, Alane Suhr, Umut Sulubacak,Zsolt Szanto, Dima Taji, Yuta Takahashi, TakaakiTanaka, Isabelle Tellier, Trond Trosterud, AnnaTrukhina, Reut Tsarfaty, Francis Tyers, Sumire Ue-matsu, Zdenka Uresova, Larraitz Uria, Hans Uszko-reit, Sowmya Vajjala, Daniel van Niekerk, Gertjanvan Noord, Viktor Varga, Veronika Vincze, LarsWallin, Jonathan North Washington, Seyi Williams,Mats Wiren, Tsegay Woldemariam, Tak-sum Wong,Chunxiao Yan, Marat M. Yavrumyan, Zhuoran Yu,Zdenek Zabokrtsky, Amir Zeldes, Daniel Zeman,Manying Zhang, and Hanzhi Zhu. 2018. Univer-sal dependencies 2.2. LINDAT/CLARIN digital li-brary at the Institute of Formal and Applied Linguis-tics (UFAL), Faculty of Mathematics and Physics,Charles University.

Robert Ostling and Jorg Tiedemann. 2017. Continuousmultilinguality with language vectors. In Proc. ofEACL.

Jeffrey Pennington, Richard Socher, and Christo-

https://www.aclweb.org/anthology/P13-2017











http://hdl.handle.net/11234/1-2837

http://hdl.handle.net/11234/1-2837

https://www.aclweb.org/anthology/E17-2102

https://www.aclweb.org/anthology/E17-2102

pher D. Manning. 2014. GloVe: Global vectors forword representation. In Proc. of EMNLP.

Matthew Peters, Waleed Ammar, Chandra Bhagavat-ula, and Russell Power. 2017. Semi-supervised se-quence tagging with bidirectional language models.In Proc. of ACL.

Matthew Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word rep-resentations. In Proc. of NAACL-HLT.

Telmo Pires, Eva Schlinger, and Dan Garrette. 2019.How multilingual is multilingual BERT? In Proc. ofACL.

Sameer Pradhan, Alessandro Moschitti, Nianwen Xue,Hwee Tou Ng, Anders Bjorkelund, Olga Uryupina,Yuchen Zhang, and Zhi Zhong. 2013. Towards ro-bust linguistic analysis using ontonotes. In Proc. ofCoNLL.

Anton Ragni, Edgar Dakin, Xie Chen, Mark J. F. Gales,and Kate Knill. 2016. Multi-language neural net-work language models. In Proc. of INTERSPEECH.

Mohammad Sadegh Rasooli and Michael Collins.2017. Cross-lingual syntactic transfer with limitedresources. TACL, 5.

Rudolf Rosa and David Marecek. 2018. CUNI x-ling:Parsing under-resourced languages in CoNLL 2018UD shared task. In Proc. of the CoNLL 2018 SharedTask: Multilingual Parsing from Raw Text to Univer-sal Dependencies.

Tal Schuster, Ori Ram, Regina Barzilay, and AmirGloberson. 2019. Cross-lingual alignment of con-textual word embeddings, with applications to zero-shot dependency parsing. In Proc. of NAACL-HLT.

Aaron Smith, Bernd Bohnet, Miryam de Lhoneux,Joakim Nivre, Yan Shao, and Sara Stymne. 2018. 82treebanks, 34 models: Universal dependency pars-ing with multi-treebank models. In Proc. of theCoNLL 2018 Shared Task: Multilingual Parsingfrom Raw Text to Universal Dependencies.

Oscar Tackstrom, Ryan McDonald, and Joakim Nivre.2013. Target language adaptation of discriminativetransfer parsers. In Proc. of NAACL-HLT.

Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019.BERT rediscovers the classical NLP pipeline. InProc. of ACL.

Yulia Tsvetkov, Sunayana Sitaram, Manaal Faruqui,Guillaume Lample, Patrick Littell, DavidMortensen, Alan W Black, Lori Levin, and ChrisDyer. 2016. Polyglot neural language models: Acase study in cross-lingual phonetic representationlearning. In Proc. of NAACL-HLT.

Clara Vania, Xingxing Zhang, and Adam Lopez. 2017.UParse: the Edinburgh system for the CoNLL 2017UD shared task. In Proc. of the CoNLL 2017 SharedTask: Multilingual Parsing from Raw Text to Univer-sal Dependencies.

David Vilares, Carlos Gomez-Rodrıguez, andMiguel A. Alonso. 2016. One model, two lan-guages: training bilingual parsers with harmonizedtreebanks. In Proc. of ACL.

Takashi Wada and Tomoharu Iwata. 2018. Unsuper-vised cross-lingual word embedding by multilingualneural language models. arXiv:1809.02306.

Hui Wan, Tahira Naseem, Young-Suk Lee, VittorioCastelli, and Miguel Ballesteros. 2018. IBM re-search at the CoNLL 2018 shared task on multilin-gual parsing. In Proc. of the CoNLL 2018 SharedTask: Multilingual Parsing from Raw Text to Uni-versal Dependencies.

Dingquan Wang and Jason Eisner. 2016. The galacticdependencies treebanks: Getting more data by syn-thesizing new languages. TACL, 4.

Dingquan Wang and Jason Eisner. 2018a. Surfacestatistics of an unknown language indicate how toparse it. TACL, 6.

Dingquan Wang and Jason Eisner. 2018b. Syntheticdata made to order: The case of parsing. In Proc. ofEMNLP.

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc VLe, Mohammad Norouzi, Wolfgang Macherey,Maxim Krikun, Yuan Cao, Qin Gao, KlausMacherey, et al. 2016. Google’s neural machinetranslation system: Bridging the gap between humanand machine translation. arXiv:1609.08144.

Matthew D Zeiler. 2012. ADADELTA: an adaptivelearning rate method. arxiv:1212.5701.

Daniel Zeman, Jan Hajic, Martin Popel, Martin Pot-thast, Milan Straka, Filip Ginter, Joakim Nivre, andSlav Petrov. 2018. CoNLL 2018 shared task: Multi-lingual parsing from raw text to universal dependen-cies. In Proc. of the CoNLL 2018 Shared Task: Mul-tilingual Parsing from Raw Text to Universal Depen-dencies.

Yuan Zhang and Regina Barzilay. 2015. Hierarchicallow-rank tensors for multilingual transfer parsing.In Proc. of EMNLP.

https://www.aclweb.org/anthology/D14-1162




http://www.aclweb.org/anthology/N18-1202

http://www.aclweb.org/anthology/N18-1202

https://doi.org/10.18653/v1/P19-1493

http://www.aclweb.org/anthology/W13-3516

http://www.aclweb.org/anthology/W13-3516

https://doi.org/10.1162/tacl_a_00061

https://doi.org/10.1162/tacl_a_00061

https://www.aclweb.org/anthology/K18-2019









https://www.aclweb.org/anthology/N13-1126

https://www.aclweb.org/anthology/N13-1126


https://doi.org/10.18653/v1/N16-1161

https://doi.org/10.18653/v1/N16-1161

https://doi.org/10.18653/v1/N16-1161

https://doi.org/10.18653/v1/K17-3010

https://doi.org/10.18653/v1/K17-3010

https://doi.org/10.18653/v1/P16-2069

https://doi.org/10.18653/v1/P16-2069

https://doi.org/10.18653/v1/P16-2069




https://doi.org/10.18653/v1/K18-2009

https://doi.org/10.18653/v1/K18-2009

https://doi.org/10.18653/v1/K18-2009












arXiv:1212.5701

arXiv:1212.5701




https://doi.org/10.18653/v1/D15-1213

https://doi.org/10.18653/v1/D15-1213

Character CNNsChar embedding size 16(# Window Size, # Filters) (1, 32), (2, 32), (3,

68), (4, 128), (5,256), 6, 512), (7,1024)

Activation ReluWord-level LSTM

LSTM size 2048# LSTM layers 2LSTM projection size 256Use skip connections YesInter-layer dropout rate 0.1

TrainingBatch size 128Unroll steps (Window Size) 20# Negative samples 64# Epochs 10Adagrad (Duchi et al., 2011) lrate 0.2Adagrad initial accumulator value 1.0

Table 7: Language model hyperparameters.

A Training Parameters

In this section, we provide hyperparameters usedin our models and training details for ease of repli-cation.

A.1 Language Models

Seen in Table 7 is a list of hyperparameters forour language models. We use the publicly avail-able code of Peters et al. (2018) for training.14

Following Mulcaire et al. (2019), we reduce theLSTM and projection sizes to expedite trainingand to compensate for the greatly reduced train-ing data—the hyperparameters used in Peters et al.(2018) were tuned for the One Billion Word Cor-pus (Chelba et al., 2013), while we used only 5%as much text (approximately 50M tokens) per lan-guage. Contextual representations from languagemodels trained with even less text are still effec-tive (Che et al., 2018; Schuster et al., 2019), sug-gesting that the method used in this work wouldapply to even lower-resource languages that havescarce text in addition to scarce or nonexistent an-notation, though at the cost of some of the perfor-mance.

A.2 UD Parsing

For UD parsing, we generally follow the hyper-parameters used for the dependency parsing demoin AllenNLP (Gardner et al., 2018). See a list ofhyperparameters in Table 8. We use stratified sam-pling so that each training mini-batch has an equal

14github.com/allenai/bilm-tf

InputInput dropout rate 0.3

Word-level BiLSTMLSTM size 400# LSTM layers 3Recurrent dropout rate 0.3Inter-layer dropout rate 0.3Use Highway Connection Yes

Multilayer Perceptron, AttentionArc MLP size 500Label MLP size 100# MLP layers 1Activation Relu

TrainingBatch size 80# Epochs 80Early stopping 50Adam (Kingma and Ba, 2015) lrate 0.001Adam β1 0.9Adam β2 0.999

Table 8: UD parsing hyperparameters.

number of sentences from the source and targetlanguages.

A.3 Multilingual Word Vectors

We train our word type representations usedfor non-contextual baselines with fastText (Bo-janowski et al., 2017). We use window size 5 anda minimum count of 5, with 300 dimensions.

B Other Low-Resource Simulations

In addition to the 100-sentence condition, we sim-ulated low-resource experiments with 500 and1000 sentences of target language data, and zero-target-treebank experiments in which the parserwas trained with only source language data,but with multilingual representations allowingcrosslingual transfer. See Table 9 for these results.The additional low-resource results confirm ouranalysis in Section 4.2: polyglot training is moreeffective the less target-language data is available,with a slight advantage for related languages.

C UD Treebanks

Additional statistics about the languages and tree-banks used are given in Table 10.

D Additional Experiments

Semantic Role Labeling For SRL, we againfollow the hyperparameters given in AllenNLP(Table 11). The one exception is that we used 4layers of alternating BiLSTMs instead of 8 layersto expedite the training process.

github.com/allenai/bilm-tf

|Dτ | = 0 |Dτ | = 100 |Dτ | = 500 |Dτ | = 1000

target +eng +rel. mono +eng +rel. mono +eng +rel. mono +eng +rel.ARA 10.31 20.47 62.50 73.39 73.43 76.15 79.55 79.16 79.43 81.38 81.49HEB 23.76 24.89 64.53 74.86 75.69 79.27 82.35 82.92 82.59 84.59 84.70HRV 48.69 67.67 63.49 79.21 82.00 80.80 84.92 85.89 84.14 86.27 86.66RUS 38.69 73.24 59.51 75.63 79.29 77.38 83.16 84.60 82.90 85.68 86.99NLD 61.68 72.90 57.12 74.90 77.01 75.19 82.42 81.33 81.41 84.93 83.23DEU 51.18 68.66 60.26 72.52 73.45 72.94 77.88 77.68 76.46 78.67 78.57SPA 55.85 75.88 64.97 80.86 81.55 79.67 84.88 84.63 82.97 86.69 86.81ITA 59.71 78.12 69.17 84.63 83.51 82.96 88.96 87.91 87.03 90.22 89.32CMN 8.16 5.34 53.36 63.63 61.47 71.94 74.88 74.98 77.42 79.07 78.96JPN 4.12 11.66 72.37 80.94 80.24 86.20 87.74 87.74 88.74 89.08 89.32

Table 9: LAS for UD parsing with additional simulated low-resource and zero-target-treebank settings.

Named Entity Recognition We again use thehyperparameter configurations provided in Al-lenNLP. See Table 12 for details.

Lang Code WALS Genus WALS 81A Size (# sents.) Treebank Genre

English eng Germanic SVO EWT blog, email, reviews, socialSimulation Pairs

Arabic ara Semitic VSO/SVO PADT newsHebrew heb Semitic SVO

5241HTB news

Croatian hrv Slavic SVO SET news, web, wikiRussian rus Slavic SVO

6983SynTagRus contemporary fiction, popular,

science, newspaper, journal ar-ticles, online news

Dutch nld Germanic SOV/SVO Alpino newsGerman deu Germanic SOV/SVO

12269GSD news, reviews, wiki

Spanish spa Romance SVO GSD blog, news, reviews, wikiItalian ita Romance SVO

12543ISDT legal, news, wiki

Chinese cmn Chinese SVO GSD wikiJapanese jpn Japanese SOV

3997GSD wiki

Truly Low Resource and Related LanguagesHungarian hun Ugric SOV/SVO 910 Szeged newsFinnish fin Finnic SVO 12217 TDT news, wiki, blog, legal, fiction,

grammar-examplesVietnamese vie Viet-Muong SVO 1400 VTB newsUyghur uig Turkic SOV 1656 UDT fictionKazakh kaz Turkic SOV (not in WALS) 31 KTB wiki, fiction, newsTurkish tur Turkic SOV 3685 IMST nonfiction, news

Table 10: List of the languages and their UD treebanks used in our experiments. Each shaded/unshaded sectioncorresponds to a pair of related languages. WALS 81A denotes Feature 81A in WALS, Order of Subject, Object,and Verb (Dryer and Haspelmath, 2013). Size represents the downsampled size in # of sentences used for sourcetreebanks.

InputPredicate indicator embedding size 100

Word-level Alternating BiLSTMLSTM size 300# LSTM layers 4Recurrent dropout rate 0.1Use Highway Connection Yes

TrainingBatch size 80# Epochs 80Early stopping 20Adadelta (Zeiler, 2012) lrate 0.1Adadelta ρ 0.95Gradient clipping 1.0

Table 11: SRL hyperparameters.

Char-level LSTMChar embedding size 25Input dropout rate 0.5LSTM size 128# LSTM layers 1

Word-level BiLSTMLSTM size 200# LSTM layers 3Inter-layer dropout rate 0.5Recurrent dropout rate 0.5Use Highway Connection Yes

Multilayer PerceptronMLP size 400Activation tanh

TrainingBatch size 64# Epochs 50Early stopping 25Adam (Kingma and Ba, 2015) lrate 0.001Adam β1 0.9Adam β2 0.999L2 regularization coefficient 0.001

Table 12: NER hyperparameters.

low-resource parsing with crosslingual contextualized

Documents