arxiv:2005.00352v2 [cs.cl] 16 apr 2021

16
MUSS: Multilingual Unsupervised Sentence Simplification by Mining Paraphrases Louis Martin 1,2 Angela Fan 1,3 Éric de la Clergerie 2 Antoine Bordes 1 Benoît Sagot 2 1 Facebook AI Research, Paris, France 2 Inria, Paris, France 3 LORIA, Nancy, France {louismartin, angelafan, abordes}@fb.com {eric.de_la_clergerie, benoit.sagot}@inria.fr Abstract Progress in sentence simplification has been hindered by a lack of labeled parallel simpli- fication data, particularly in languages other than English. We introduce MUSS, a Multi- lingual Unsupervised Sentence Simplification system that does not require labeled simplifica- tion data. MUSS uses a novel approach to sen- tence simplification that trains strong models using sentence-level paraphrase data instead of proper simplification data. These models leverage unsupervised pretraining and control- lable generation mechanisms to flexibly adjust attributes such as length and lexical complex- ity at inference time. We further present a method to mine such paraphrase data in any language from Common Crawl using semantic sentence embeddings, thus removing the need for labeled data. We evaluate our approach on English, French, and Spanish simplification benchmarks and closely match or outperform the previous best supervised results, despite not using any labeled simplification data. We push the state of the art further by incorporat- ing labeled simplification data. 1 Introduction Sentence simplification is the task of making a sen- tence easier to read and understand by reducing its lexical and syntactic complexity, while retaining most of its original meaning. Simplification has a variety of important societal applications, for exam- ple increasing accessibility for those with cognitive disabilities such as aphasia (Carroll et al., 1998), dyslexia (Rello et al., 2013), and autism (Evans et al., 2014), or for non-native speakers (Paetzold and Specia, 2016). Research has mostly focused on English simplification, where source texts and associated simplified texts exist and can be automat- ically aligned, such as English Wikipedia and Sim- ple English Wikipedia (Zhang and Lapata, 2017). However, such data is limited in terms of size and domain, and difficult to find in other languages. Ad- ditionally, simplifying a sentence can be achieved in multiple ways, and depend on the target audi- ence. Simplification guidelines are not uniquely defined, outlined by the stark differences in English simplification benchmarks (Alva-Manchego et al., 2020a). This highlights the need for more general models that can adjust to different simplification contexts and scenarios. In this paper, we propose to train controllable models using sentence-level paraphrase data only, i.e. parallel sentences that have the same meaning but phrased differently. In order to generate sim- plifications and not paraphrases, we use ACCESS (Martin et al., 2020) to control attributes such as length, lexical and syntactic complexity. Para- phrase data is more readily available, and opens the door to training flexible models that can adjust to more varied simplification scenarios. We propose to gather such paraphrase data in any language by mining sentences from Common Crawl using se- mantic sentence embeddings. We show that simpli- fication models trained on mined paraphrase data perform as well as models trained on existing large paraphrase corpora (cf. Appendix D). Moreover, paraphrases are more straightforward to mine than simplifications, and we show that they lead to mod- els with better performance than equivalent models trained on mined simplifications (cf. Section 5.4). Our resulting Multilingual Unsupervised Sen- tence Simplification method, MUSS, is unsuper- vised because it can be trained without relying on labeled simplification data, 1 even though we mine using supervised sentence embeddings. 2 We addi- tionally incorporate unsupervised pretraining (Liu et al., 2019, 2020) and apply MUSS on English, 1 We use the term labeled simplifications to refer to parallel datasets where texts were manually simplified by humans. 2 Previous works have also used the term unsupervised simplification to describe works that do not use any labeled parallel simplification data while leveraging supervised com- ponents such as constituency parsers and knowledge bases (Kumar et al., 2020), external synonymy lexicons (Surya et al., 2019), and databases of simplified synonyms (Zhao et al., 2020). We shall come back to these works in Section 2. arXiv:2005.00352v2 [cs.CL] 16 Apr 2021

Upload: others

Post on 25-Apr-2022

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: arXiv:2005.00352v2 [cs.CL] 16 Apr 2021

MUSS: Multilingual Unsupervised Sentence Simplificationby Mining Paraphrases

Louis Martin1,2 Angela Fan1,3

Éric de la Clergerie2 Antoine Bordes1 Benoît Sagot21Facebook AI Research, Paris, France

2Inria, Paris, France3LORIA, Nancy, France

{louismartin, angelafan, abordes}@fb.com{eric.de_la_clergerie, benoit.sagot}@inria.fr

AbstractProgress in sentence simplification has beenhindered by a lack of labeled parallel simpli-fication data, particularly in languages otherthan English. We introduce MUSS, a Multi-lingual Unsupervised Sentence Simplificationsystem that does not require labeled simplifica-tion data. MUSS uses a novel approach to sen-tence simplification that trains strong modelsusing sentence-level paraphrase data insteadof proper simplification data. These modelsleverage unsupervised pretraining and control-lable generation mechanisms to flexibly adjustattributes such as length and lexical complex-ity at inference time. We further present amethod to mine such paraphrase data in anylanguage from Common Crawl using semanticsentence embeddings, thus removing the needfor labeled data. We evaluate our approachon English, French, and Spanish simplificationbenchmarks and closely match or outperformthe previous best supervised results, despitenot using any labeled simplification data. Wepush the state of the art further by incorporat-ing labeled simplification data.

1 Introduction

Sentence simplification is the task of making a sen-tence easier to read and understand by reducing itslexical and syntactic complexity, while retainingmost of its original meaning. Simplification has avariety of important societal applications, for exam-ple increasing accessibility for those with cognitivedisabilities such as aphasia (Carroll et al., 1998),dyslexia (Rello et al., 2013), and autism (Evanset al., 2014), or for non-native speakers (Paetzoldand Specia, 2016). Research has mostly focusedon English simplification, where source texts andassociated simplified texts exist and can be automat-ically aligned, such as English Wikipedia and Sim-ple English Wikipedia (Zhang and Lapata, 2017).However, such data is limited in terms of size anddomain, and difficult to find in other languages. Ad-ditionally, simplifying a sentence can be achieved

in multiple ways, and depend on the target audi-ence. Simplification guidelines are not uniquelydefined, outlined by the stark differences in Englishsimplification benchmarks (Alva-Manchego et al.,2020a). This highlights the need for more generalmodels that can adjust to different simplificationcontexts and scenarios.

In this paper, we propose to train controllablemodels using sentence-level paraphrase data only,i.e. parallel sentences that have the same meaningbut phrased differently. In order to generate sim-plifications and not paraphrases, we use ACCESS(Martin et al., 2020) to control attributes such aslength, lexical and syntactic complexity. Para-phrase data is more readily available, and opens thedoor to training flexible models that can adjust tomore varied simplification scenarios. We proposeto gather such paraphrase data in any language bymining sentences from Common Crawl using se-mantic sentence embeddings. We show that simpli-fication models trained on mined paraphrase dataperform as well as models trained on existing largeparaphrase corpora (cf. Appendix D). Moreover,paraphrases are more straightforward to mine thansimplifications, and we show that they lead to mod-els with better performance than equivalent modelstrained on mined simplifications (cf. Section 5.4).

Our resulting Multilingual Unsupervised Sen-tence Simplification method, MUSS, is unsuper-vised because it can be trained without relying onlabeled simplification data,1 even though we mineusing supervised sentence embeddings.2 We addi-tionally incorporate unsupervised pretraining (Liuet al., 2019, 2020) and apply MUSS on English,

1We use the term labeled simplifications to refer to paralleldatasets where texts were manually simplified by humans.

2Previous works have also used the term unsupervisedsimplification to describe works that do not use any labeledparallel simplification data while leveraging supervised com-ponents such as constituency parsers and knowledge bases(Kumar et al., 2020), external synonymy lexicons (Surya et al.,2019), and databases of simplified synonyms (Zhao et al.,2020). We shall come back to these works in Section 2.

arX

iv:2

005.

0035

2v2

[cs

.CL

] 1

6 A

pr 2

021

Page 2: arXiv:2005.00352v2 [cs.CL] 16 Apr 2021

French, and Spanish to closely match or outper-form the supervised state of the art in all languages.MUSS further improves the state of the art onall English datasets by incorporating additional la-beled simplification data.

To sum up, our contributions are as follows:

• We introduce a novel approach to trainingsimplification models with paraphrase dataonly and propose a mining procedure to cre-ate large paraphrase corpora for any language.

• Our approach obtains strong performance.Without any labeled simplification data, wematch or outperform the supervised state ofthe art in English, French and Spanish. Wefurther improve the English state of the art byincorporating labeled simplification data.

• We release MUSS pretrained models, para-phrase data, and code for mining and train-ing3.

2 Related work

Data-driven methods have been predominant inEnglish sentence simplification in recent years(Alva-Manchego et al., 2020b), requiring largesupervised training corpora of complex-simplealigned sentences (Wubben et al., 2012; Xu et al.,2016; Zhang and Lapata, 2017; Zhao et al., 2018;Martin et al., 2020). Methods have relied on En-glish and Simple English Wikipedia with automaticsentence alignment from similar articles (Zhu et al.,2010; Coster and Kauchak, 2011; Woodsend andLapata, 2011; Kauchak, 2013; Zhang and Lapata,2017). Higher quality datasets have been proposedsuch as the Newsela corpus (Xu et al., 2015), butthey are rare and come with restrictive licenses thathinder reproducibility and widespread usage.

Simplification in other languages has beenexplored in Brazilian Portuguese (Aluísio et al.,2008), Spanish (Saggion et al., 2015; Štajner et al.,2015), Italian (Brunato et al., 2015; Tonelli et al.,2016), Japanese (Goto et al., 2015; Kajiwara andKomachi, 2018; Katsuta and Yamamoto, 2019),and French (Gala et al., 2020), but the lack of alarge labeled parallel corpora has slowed researchdown. In this work, we show that a method trainedon automatically mined corpora can reach state-of-the-art results in each language.

3https://github.com/facebookresearch/muss

When labeled parallel simplification data is un-available, systems rely on unsupervised simplifi-cation techniques, often inspired from machinetranslation. The prevailing approach is to splita monolingual corpora into disjoint sets of com-plex and simple sentences using readability met-rics. Then simplification models can be trainedby using automatic sentence alignments (Kajiwaraand Komachi, 2016, 2018), auto-encoders (Suryaet al., 2019; Zhao et al., 2020), unsupervised statis-tical machine translation (Katsuta and Yamamoto,2019), or back-translation (Aprosio et al., 2019).Other unsupervised simplification approaches iter-atively edit the sentence until a certain simplicitycriterion is reached (Kumar et al., 2020). The per-formance of unsupervised methods are often belowtheir supervised counterparts. MUSS bridges thegap with supervised method and removes the needfor deciding in advance how complex and simplesentences should be separated, but instead trains di-rectly on paraphrases mined from the raw corpora.

Previous work on parallel dataset mining havebeen used mostly in machine translation using doc-ument retrieval (Munteanu and Marcu, 2005), lan-guage models (Koehn et al., 2018, 2019), and em-bedding space alignment (Artetxe and Schwenk,2019b) to create large corpora (Tiedemann, 2012;Schwenk et al., 2019). We focus on paraphras-ing for sentence simplifications, which presentsnew challenges. Unlike machine translation, wherethe same sentence should be identified in two lan-guages, we develop a method to identify variedparaphrases of sentences, that have a wider array ofsurface forms, including different lengths, multiplesentences, different vocabulary usage, and removalof content from more complex sentences.

Previous unsupervised paraphrasing researchhas aligned sentences from various parallel corpora(Barzilay and Lee, 2003) with multiple objectivefunctions (Liu et al., 2019). Bilingual pivoting re-lied on MT datasets to create large databases ofword-level paraphrases (Pavlick et al., 2015), lex-ical simplifications (Pavlick and Callison-Burch,2016; Kriz et al., 2018), or sentence-level para-phrase corpora (Wieting and Gimpel, 2018). Thishas not been applied to multiple languages or tosentence-level simplification. Additionally, we useraw monolingual data to create our paraphrase cor-pora instead of relying on parallel MT datasets.

Page 3: arXiv:2005.00352v2 [cs.CL] 16 Apr 2021

Type # Sequence # Avg. TokensPairs per Sequence

WikiLarge Labeled Parallel 296,402 original: 21.7(English) Simplifications simple: 16.0Newsela Labeled Parallel 94,206 original: 23.4(English) Simplifications simple: 14.2

English Mined 1,194,945 22.3French Mined 1,360,422 18.7Spanish Mined 996,609 22.8

Table 1: Statistics on our mined paraphrase trainingcorpora compared to standard simplification datasets(see section 4.3 for more details).

3 Method

In this section we describe MUSS, our approachto mining paraphrase data and training controllablesimplification models on paraphrases.

3.1 Mining Paraphrases in Many Languages

Sequence Extraction Simplification consists ofmultiple rewriting operations, some of which spanmultiple sentences (e.g. sentence splitting or fu-sion). To allow such operations to be representedin our paraphrase data, we extract chunks of textscomposed of multiple sentences, we refer to thesesmall pieces of text by sequences.

We extract such sequences by first tok-enizing a document into individual sentences{s1, s2, . . . , sn} using the NLTK sentencetokenizer (Bird and Loper, 2004). We thenextract sequences of adjacent sentences withmaximum length of 300 characters: e.g.{[s1], [s1, s2], [s1, . . . , sk], [s2], [s2, s3], ...}.We can thus align two sequences that containa different number of sentences, and representsentence splitting or sentence fusion operations.

These sequences are further filtered to removenoisy sequences with more than 10% punctuationcharacters and sequences with low language modelprobability according to a 3-gram Kneser-Ney lan-guage model trained with kenlm (Heafield, 2011)on Wikipedia.

We extract these sequences from CCNet (Wen-zek et al., 2019), an extraction of Common Crawl(an open source snapshot of the web) that has beensplit into different languages using fasttext lan-guage identification (Joulin et al., 2017) and vari-ous language modeling filtering techniques to iden-tify high quality, clean sentences. For English andFrench, we extract 1 billion sequences from CCNet.For Spanish we extract 650 millions sequences, themaximum for this language in CCNet after filtering

out noisy text.

Creating a Sequence Index Using EmbeddingsTo automatically mine our paraphrase corpora, wefirst compute n-dimensional embeddings for eachextracted sequence using LASER (Artetxe andSchwenk, 2019b). LASER provides joint multi-lingual sentence embeddings in 93 languages thathave been successfully applied to the task of bilin-gual bitext mining (Schwenk et al., 2019). In thiswork, we show that LASER can also be used tomine monolingual paraphrase datasets.

Mining Paraphrases After computing the em-beddings for each language, we index them for fastnearest neighbor search using faiss.

Each of these sequences is then used as a queryqi against the billion-scale index that returns a setof top-8 nearest neighbor sequences according tothe semantic LASER embedding space using L2distance, resulting in a set of candidate paraphrasesare {ci,1, . . . , ci,8}.

We then use an upper bound on L2 distance anda margin criterion following (Artetxe and Schwenk,2019a) to filter out sequences with low similarity.We refer the reader to Appendix Section A.1 fortechnical details.

The resulting nearest neighbors constitute aset of aligned paraphrases of the query sequence:{(qi, ci,1), . . . , (qi, ci,j)}. We finally apply pooralignment filters. We remove sequences that arealmost identical with character-level Levenshteindistance ≤ 20%, when they are contained in oneanother, or when they were extracted from twooverlapping sliding windows of the same originaldocument.

We report statistics of the mined corpora in En-glish, French and Spanish in Table 1, and qualita-tive examples of the resulting mined paraphrases inAppendix Table 6. Models trained on the resultingmined paraphrases obtain similar performance thanmodels trained on existing paraphrase datasets (cf.Appendix Section D).

3.2 Simplifying with ACCESS

In this section we describe how we adapt ACCESS(Martin et al., 2020) to train controllable modelson mined paraphrases, instead of labeled parallelsimplifications.

ACCESS is a method to make any seq2seqmodel controllable by conditioning onsimplification-specific control tokens. We

Page 4: arXiv:2005.00352v2 [cs.CL] 16 Apr 2021

apply it on our seq2seq pretrained transformermodels based on the BART (Lewis et al., 2019)architecture (see next subsection).

Training with Control Tokens At training time,the model is provided with control tokens that giveoracle information on the target sequence, such asthe amount of compression of the target sequencerelative to the source sequence (length control).For example, when the target sequence is 80% ofthe length of the original sequence, we providethe <NumChars_80%> control token. At infer-ence time we can then control the generation byselecting a given target control value. We adapt theoriginal Levenshtein similarity control to only con-sider replace operations but otherwise use the samecontrols as Martin et al. (2020). The controls usedare therefore character length ratio, replace-onlyLevenshtein similarity, aggregated word frequencyratio, and dependency tree depth ratio. For instancewe will prepend to every source in the trainingset the following 4 control tokens with sample-specific values, so that the model learns to relyon them: <NumChars_XX%> <LevSim_YY%><WordFreq_ZZ%> <DepTreeDepth_TT%>. Werefer the reader to the original paper (Martin et al.,2020) and Appendix A.2 for details on ACCESSand how those control tokens are computed.

Selecting Control Values at Inference Oncethe model has been trained with oracle controls, wecan adjust the control tokens to obtain the desiredtype of simplifications. Indeed, sentence simpli-fication often depends on the context and targetaudience (Martin et al., 2020). For instance shortersentences are more adapted to people with cogni-tive disabilities, while using more frequent wordsare useful to second language learners. It is there-fore important that supervised and unsupervisedsimplification systems can be adapted to differentconditions: (Kumar et al., 2020) do so by choosinga set of operation-specific weights of their unsu-pervised simplification model for each evaluationdataset, (Surya et al., 2019) select different mod-els using SARI on each validation set. Similarly,we set the 4 control hyper-parameters of ACCESSusing SARI on each validation set and keep themfixed for all samples in the test set.4. These 4 con-trol hyper-parameters are intuitive and easy to inter-pret: when no validation set is available, they canalso be set using prior knowledge on the task and

4Details in Appendix A.2

still lead to solid performance (cf. Appendix C).

3.3 Leveraging Unsupervised Pretraining

We combine our controllable models with unsuper-vised pretraining to further extend our approachto text simplification. For English, we finetunethe pretrained generative model BART (Lewiset al., 2019) on our newly created training corpora.BART is a pretrained sequence-to-sequence modelthat can be seen as a generalization of other re-cent pretrained models such as BERT (Devlin et al.,2018). For non-English, we use its multilingualgeneralization MBART (Liu et al., 2020), whichwas pretrained on 25 languages.

4 Experimental Setting

We assess the performance of our approach on threelanguages: English, French, and Spanish. We detailour experimental procedure for mining and train-ing in Appendix Section A. In all our experiments,we report scores on the test sets averaged over 5random seeds with 95% confidence intervals.

4.1 Baselines

In addition to comparisons with previous works,we implement and hereafter describe multiple base-lines to assess the performance of our models, es-pecially for French and Spanish where no previoussimplification systems are available.

Identity The entire original sequence is kept un-changed and used as the simplification.

Truncation The original sequence is truncatedto the first 80% words. It is a strong baseline ac-cording to standard simplification metrics.

Pivot We use machine translation to provide abaseline for languages for which no simplificationcorpus is available. The source non-English sen-tence is translated to English, simplified with ourbest supervised English simplification system, andthen translated back into the source language. ForFrench and Spanish translation, we use CCMATRIX

(Schwenk et al., 2019) to train Transformer mod-els with LayerDrop (Fan et al., 2019). We usethe BART+ACCESS supervised model trained onMINED + WikiLarge as the English simplificationmodel. While pivoting creates potential errors, re-cent improvements of translation systems on highresource languages make this a strong baseline.

Page 5: arXiv:2005.00352v2 [cs.CL] 16 Apr 2021

EnglishASSET TurkCorpus Newsela

SARI ↑ FKGL ↓ SARI ↑ FKGL ↓ SARI ↑ FKGL ↓

Baselines and Gold Reference

Identity Baseline 20.73 10.02 26.29 10.02 12.24 8.82Truncate Baseline 29.85 7.91 33.10 7.91 25.49 6.68Gold Reference 44.87±0.36 6.49±0.15 40.04±0.30 8.77±0.08 — —

Unsupervised Systems

BTRLTS (Zhao et al., 2020) 33.95 7.59 33.09 8.39 37.22 3.80UNTS (Surya et al., 2019) 35.19 7.60 36.29 7.60 — —RM+EX+LS+RO (Kumar et al., 2020) 36.67 7.33 37.27 7.33 38.33 2.98

MUSS 42.65±0.23 8.23±0.62 40.85±0.15 8.79±0.30 38.09±0.59 5.12±0.47

Supervised Systems

EditNTS (Dong et al., 2019) 34.95 8.38 37.66 8.38 39.30 3.90Dress-LS (Zhang and Lapata, 2017) 36.59 7.66 36.97 7.66 38.00 4.90DMASS-DCSS (Zhao et al., 2018) 38.67 7.73 39.92 7.73 — —ACCESS(Martin et al., 2020) 40.13 7.29 41.38 7.29 — —

MUSS 43.63±0.71 6.25±0.42 42.62±0.27 6.98±0.95 42.59±1.00 2.74±0.98

MUSS (+ mined data) 44.15±0.56 6.05±0.51 42.53±0.36 7.60±1.06 41.17±0.95 2.70±1.00

Table 2: Unsupervised and Supervised Sentence Simplification for English. We display SARI and FKGL onASSET, TurkCorpus and Newsela test sets for English. Supervised models are trained on WikiLarge for the firsttwo test sets, and Newsela for the last. Best SARI scores within confidence intervals are in bold.

Gold Reference We report gold reference scoresfor ASSET and TurkCorpus as multiple referencesare available. We compute scores in a leave-one-out scenario where each reference is evaluatedagainst all others. The scores are then averagedover all references.

4.2 Evaluation MetricsWe evaluate with the standard metrics SARI andFKGL. We report BLEU (Papineni et al., 2002)only in Appendix Table 10 due its dubious suitabil-ity for sentence simplification (Sulem et al., 2018).

SARI Sentence simplification is commonly eval-uated with SARI (Xu et al., 2016), which comparesmodel-generated simplifications with the source se-quence and gold references. It averages F1 scoresfor addition, keep, and deletion operations. Wecompute SARI with the EASSE simplification eval-uation suite (Alva-Manchego et al., 2019).5

FKGL We report readability scores using theFlesch-Kincaid Grade Level (FKGL) (Kincaidet al., 1975), a linear combination of sentence

5We use the latest version of SARI implemented in EASSE(Alva-Manchego et al., 2019) which fixes bugs and inconsis-tencies from the traditional implementation. We thus recom-pute scores from previous systems that we compare to, byusing the system predictions provided by the respective au-thors available in EASSE.

French SpanishBaselines SARI ↑ SARI ↑

Identity 26.16 16.99Truncate 33.44 27.34Pivot 33.48±0.37 36.19±0.34

MUSS† 41.73±0.67 35.67±0.46

Table 3: Unsupervised Sentence Simplification inFrench and Spanish. We display SARI scores inFrench (ALECTOR) and Spanish (Newsela). BestSARI scores within confidence intervals are in bold.†MBART+ACCESS model.

lengths and word lengths. FKGL was designedto be used on English texts only, we do not reportit on French and Spanish.

4.3 Training Data

For all languages we use the mined data describedin Table 1 as training data. We show that trainingwith additional labeled simplification data leads toeven better performance for English. We use thelabeled datasets WikiLarge (Zhang and Lapata,2017) and Newsela (Xu et al., 2015). WikiLargeis composed of 296k simplification pairs automati-cally aligned from English Wikipedia and SimpleEnglish Wikipedia. Newsela is a collection of newsarticles with professional simplifications, aligned

Page 6: arXiv:2005.00352v2 [cs.CL] 16 Apr 2021

into 94k simplification pairs by Zhang and Lapata(2017).6

4.4 Evaluation DataEnglish We evaluate our English models on AS-SET (Alva-Manchego et al., 2020a), TurkCorpus(Xu et al., 2016) and Newsela (Xu et al., 2015).TurkCorpus and ASSET were created using thesame 2000 valid and 359 test source sentences.TurkCorpus contains 8 reference simplificationsper source sentence and ASSET contains 10 ref-erences per source. ASSET is a generalization ofTurkCorpus with a more varied set of rewriting op-erations, and considered simpler by human judges(Alva-Manchego et al., 2020a). For Newsela, weevaluate on the split from (Zhang and Lapata,2017), which includes 1129 validation and 1077test sentence pairs.

French For French, we use the ALECTORdataset (Gala et al., 2020) for evaluation. ALEC-TOR is a collection of literary (tales, stories) andscientific (documentary) texts along with their man-ual document-level simplified versions. These doc-uments were extracted from material available toFrench primary school pupils. We split the datasetin 450 validation and 416 test sentence pairs (seeAppendix A.3 for details).

Spanish For Spanish we use the Spanish partof Newsela (Xu et al., 2015). We use the align-ments from (Aprosio et al., 2019), composed of2794 validation and 2795 test sentence pairs. Eventhough sentences were aligned using the CATSsimplification alignment tool (Štajner et al., 2018),some alignment errors remain and automatic scoresshould be taken with a pinch of salt.

5 Results

5.1 English SimplificationWe compare models trained on our mined corpus(see Table 1) with models trained on labeled simpli-fication datasets (WikiLarge and Newsela). We alsocompare to other state-of-the-art supervised mod-els: Dress-LS (Zhang and Lapata, 2017), DMASS-DCSS (Zhao et al., 2018), EditNTS (Dong et al.,2019), ACCESS (Martin et al., 2020); and theunsupervised models: UNTS (Surya et al., 2019),BTRLTS (Zhao et al., 2020), and RM+EX+LS+RO(Kumar et al., 2020). Results are shown in Table 2.

6We experimented with other alignments (wiki-auto andnewsela-auto (Jiang et al., 2020)) but with lower performance.

MUSS Unsupervised Results On the ASSETbenchmark, with no labeled simplification data,MUSS obtains a +5.98 SARI improvement withrespect to previous unsupervised methods, and a+2.52 SARI improvement over the state-of-the-art supervised methods. For the TurkCorpus andNewsela datasets, the unsupervised MUSS ap-proach achieves strong results, either outperform-ing or closely matching both unsupervised and su-pervised previous works.

When incorporating labeled data from Wiki-Large and Newsela, MUSS obtains state-of-the-artresults on all datasets. Using labeled data alongwith mined data does not always help comparedto training only with labeled data, especially withthe Newsela training set. Newsela is already a highquality dataset focused on the specific domain ofnews articles. It might not benefit from additionallesser quality mined data.

Examples of Simplifications Various examplesfrom our unsupervised system are shown in Table 4.Examining the simplifications, we see reduced sen-tence length, sentence splitting, and simpler vocab-ulary usage. For example, the words in the town’swestern outskirts is changed into near the town andaerial nests is simplified into nests in the air. Wealso witnessed errors related factual consistencyand especially with respect with named entity hal-lucination or disappearance which would be aninteresting area of improvement for future work.

5.2 French and Spanish SimplificationOur unsupervised approach to simplification canbe applied to any language. Similar to English,we first create a corpus of paraphrases composedof 1.4 million sequence pairs in French and 1.0million sequence pairs in Spanish (cf. Table 1).To incorporate multilingual pretraining, we replacethe monolingual BART with MBART, which wastrained on 25 languages.

We report the performance of models trained onthe mined corpus in Table 3. Unlike English, wherelabeled parallel training data has been created usingSimple English Wikipedia, no such datasets existfor French or Spanish. Similarly, no other simpli-fication systems are available in these languages.We thus compare to several baselines, namely theidentity, truncation and the strong pivot baseline.

Results MUSS outperforms our strongest base-line by +8.25 SARI for French, while matching thepivot baseline performance for Spanish.

Page 7: arXiv:2005.00352v2 [cs.CL] 16 Apr 2021

Original History Landsberg prison, which is in the town’s western outskirts, was completed in 1910.Simplified The Landsberg prison, which is near the town, was built in 1910.

Original The name "hornet" is used for this and related species primarily because of their habit of making aerial nests(similar to the true hornets) rather than subterranean nests.

Simplified The name "hornet" is used for this and related species because they make nests in the air (like the true hornets)rather than in the ground.

Original Nocturnes is an orchestral composition in three movements by the French composer Claude Debussy.Simplified Nocturnes is a piece of music for orchestra by the French composer Claude Debussy.

Table 4: Examples of Generated Simplifications. We show simplifications generated by our best unsupervisedmodel: MUSS trained on mined data only. Bold highlights differences between original and simplified.

Seq2Seq BART+ACCESS20

25

30

35

40

45

SARI

ParaphrasesSimplifications

(a) Simplifications vs. Paraphrases

1K 10K 100K 1.2M (full)Number of training samples

20

25

30

35

40

45

SARI

(b) Large-Scale Mining

Seq2Seq ACCESS BART BART+ACCESS

20

25

30

35

40

45

SARI

(c) BART and ACCESS

Figure 1: Ablations We display averaged SARI scores on the English ASSET test set with 95% confidence inter-vals (5 runs). (a) Models trained on mined simplifications or mined paraphrases, (b) MUSS trained on varyingamounts of mined data, (c) Models trained with or without BART and/or ACCESS.

Besides using state-of-the-art machine transla-tion models, the pivot baseline relies on a strongbackbone simplification model that has two advan-tages compared to the French and Spanish simplifi-cation model. First the simplification model of thepivot baseline was trained on labeled simplificationdata from WikiLarge, which obtains +1.5 SARI inEnglish compared to training only on mined data.Second it uses the stronger monolingual BARTmodel instead of MBART. In Appendix Table 10,we can see that MBART has a small loss in perfor-mance of 1.54 SARI compared to its monolingualcounterpart BART, due to the fact that it handles 25languages instead of one. Further improvementscould be achieved by using monolingual BARTmodels trained for French or Spanish, possibly out-performing the pivot baseline.

5.3 Human Evaluation

To further validate the quality of our models, weconduct a human evaluation in all languages ac-cording to adequacy, fluency, and simplicity andreport the results in Table 5.

Human Ratings Collection For human evalua-tion, we recruit volunteer native speakers for eachlanguage (5 in English, 2 in French, and 2 in Span-ish). We evaluate three linguistic aspects on a 5

point Likert scale (0-4): adequacy (is the meaningpreserved?), fluency (is the simplification fluent?)and simplicity (is the simplification actually sim-pler?). For each system and each language, 50simplifications are annotated and each simplifica-tion is rated once only by a single annotator. Thesimplifications are taken from ASSET (English),ALECTOR (French), and Newsela (Spanish).

Discussion Table 5 displays the average ratingsalong with 95% confidence intervals. Human judg-ments confirm that our unsupervised and super-vised MUSS models are more fluent and producesimpler outputs than previous state-of-the-art (Mar-tin et al., 2019). They are deemed as fluent andsimpler than the human simplifications from AS-SET test set, which indicates our model is ableto reach a high level of simplicity thanks to thecontrol mechanism. In French and Spanish, ourunsupervised model performs better or similar inall aspects than the strong supervised pivot baselinewhich has been trained on labeled English simplifi-cations. In Spanish the gold reference surprisinglyobtains poor human ratings, which we found to becaused by errors in the automatic alignments ofsource sentences with simplified sentences of thesame article, as previously highlighted in Subsec-tion 4.4.

Page 8: arXiv:2005.00352v2 [cs.CL] 16 Apr 2021

English French SpanishAdequacy Fluency Simplicity Adequacy Fluency Simplicity Adequacy Fluency Simplicity

ACCESS (Martin et al., 2020) 3.10±0.32 3.46±0.28 1.40±0.29 — — — — — —Pivot baseline — — — 1.78±0.40 2.10±0.47 1.16±0.31 2.02±0.28 3.48±0.22 2.20±0.29

Gold Reference 3.44±0.28 3.78±0.17 1.80±0.28 3.46±0.25 3.92±0.10 1.66±0.31 2.18†±0.43 3.38±0.24 1.26†±0.36

MUSS (unsup.) 3.20±0.28 3.84±0.14 1.88±0.33 2.88±0.34 3.50±0.32 1.22±0.25 2.26±0.29 3.48±0.25 2.56±0.29

MUSS (sup.) 3.12±0.34 3.90±0.14 2.22±0.36 — — — — — —

Table 5: Human Evaluation We display human ratings of adequacy, fluency and simplicity for previous workACCESS, pivot baseline, reference human simplifications, our best unsupervised systems (BART+ACCESS forEnglish, MBART+ACCESS for other languages), and our best supervised model for English. Scores are averagedover 50 ratings per system with 95% confidence intervals. †Low ratings of the gold reference in Spanish Newselais due to automatic alignment errors.

5.4 Ablations

Mining Simplifications vs. Paraphrases In thiswork, we mined paraphrases to train simplificationmodels. This has the advantage of making fewerassumptions earlier on, by keeping the mining andmodels as general as possible, so that they are ableto adapt to more simplification scenarios.

We also compared to directly mining simpli-fications using simplification heuristics to makesure that the target side is simpler than the source,following previous work (Kajiwara and Komachi,2016; Surya et al., 2019). To mine a simplificationdataset, we followed the same paraphrase miningprocedure of querying 1 billion sequences on anindex of 1 billion sequences. Out of the resultingparaphrases, we kept only pairs that either con-tained sentence splits, reduced sequence length, orsimpler vocabulary (similar to how previous workenforce an FKGL difference). We removed theparaphrase constraint that enforced sentences tobe different enough. We tuned these heuristics tooptimize SARI on the validation set. The result-ing dataset has 2.7 million simplification pairs. InFigure 1a, we show that seq2seq models trainedon mined paraphrases achieve better performance.A similar trend exists with BART and ACCESS,thus confirming that mining paraphrases can obtainbetter performance than mining simplifications.

How Much Mined Data Do You Need? We in-vestigate the importance of a scalable mining ap-proach that can create million-sized training cor-pora for sentence simplification. In Figure 1b, weanalyze the performance of training our best modelon English on different amounts of mined data. Byincreasing the number of mined pairs, SARI dras-tically improves, indicating that efficient miningat scale is critical to performance. Unlike human-created training sets, unsupervised mining allowsfor large datasets in multiple languages.

Improvements from Pretraining and ControlWe compare the respective influence of pretrainingBART and controllable generation ACCESS inFigure 1c. While both BART and ACCESS bringimprovement over standard sequence-to-sequence,they work best in combination. Unlike previousapproaches to text simplification, we use pretrain-ing to train our simplification systems. We findthat the main qualitative improvement from pre-training is increased fluency and meaning preser-vation. For example, in Appendix Table 9, themodel trained only with ACCESS substituted cul-turally akin with culturally much like, but when us-ing BART, it is simplified to the more fluent closelyrelated. While models trained on mined data seeseveral million sentences, pretraining methods aretypically trained on billions. Combining pretrain-ing with controllable simplification enhances sim-plification performance by flexibly adjusting thetype of simplification.

6 Conclusion

We propose a sentence simplification approach thatdoes not rely on labeled parallel simplification datathanks to controllable generation, pretraining andlarge-scale mining of paraphrases from the web.This approach is language-agnostic and matches oroutperforms previous state-of-the-art results, evenfrom supervised systems that use labeled simplifi-cation data, on three languages: English, French,and Spanish. In future work, we plan to investigatehow to scale this approach to more languages andtypes of simplification, and to apply this method toparaphrase generation. Another interesting direc-tion for future work would to examine and improvefactual consistency, especially related to namedentity hallucination or disappearance.

Page 9: arXiv:2005.00352v2 [cs.CL] 16 Apr 2021

ReferencesSandra M Aluísio, Lucia Specia, Thiago AS Pardo, Er-

ick G Maziero, and Renata PM Fortes. 2008. To-wards Brazilian Portuguese automatic text simplifi-cation systems. In Proceedings of the eighth ACMsymposium on Document engineering, pages 240–248.

Fernando Alva-Manchego, Louis Martin, Antoine Bor-des, Carolina Scarton, Benoît Sagot, and Lucia Spe-cia. 2020a. ASSET: A dataset for tuning and eval-uation of sentence simplification models with multi-ple rewriting transformations. In Procedings of ACL2020, pages 4668–4679.

Fernando Alva-Manchego, Louis Martin, CarolinaScarton, and Lucia Specia. 2019. EASSE: Eas-ier automatic sentence simplification evaluation. InEMNLP-IJCNLP: System Demonstrations.

Fernando Alva-Manchego, Carolina Scarton, and Lu-cia Specia. 2020b. Data-Driven Sentence Simplifi-cation: Survey and Benchmark. Computational Lin-guistics, pages 1–87.

Alessio Palmero Aprosio, Sara Tonelli, Marco Turchi,Matteo Negri, and Mattia A Di Gangi. 2019. Neuraltext simplification in low-resource conditions usingweak supervision. In Proceedings of the Workshopon Methods for Optimizing and Evaluating NeuralLanguage Generation, pages 37–44.

Mikel Artetxe and Holger Schwenk. 2019a. Margin-based parallel corpus mining with multilingual sen-tence embeddings. In Proceedings of the 57th An-nual Meeting of the Association for ComputationalLinguistics, pages 3197–3203, Florence, Italy. Asso-ciation for Computational Linguistics.

Mikel Artetxe and Holger Schwenk. 2019b. Mas-sively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. Transac-tions of the Association for Computational Linguis-tics, 7:597–610.

Regina Barzilay and Lillian Lee. 2003. Learningto paraphrase: an unsupervised approach usingmultiple-sequence alignment. In Proceedings ofNAACL-HLT 2003, pages 16–23.

Steven Bird and Edward Loper. 2004. NLTK: The nat-ural language toolkit. In Proceedings of the ACL In-teractive Poster and Demonstration Sessions, pages214–217, Barcelona, Spain.

Dominique Brunato, Felice Dell’Orletta, Giulia Ven-turi, and Simonetta Montemagni. 2015. Design andannotation of the first Italian corpus for text simpli-fication. In Proceedings of The 9th Linguistic Anno-tation Workshop, pages 31–41.

John Carroll, Guido Minnen, Yvonne Canning, Siob-han Devlin, and John Tait. 1998. Practical simpli-fication of English newspaper text to assist aphasicreaders. In Proceedings of the AAAI-98 Workshop

on Integrating Artificial Intelligence and AssistiveTechnology, pages 7–10.

William Coster and David Kauchak. 2011. Learning tosimplify sentences using wikipedia. In Proceedingsof the workshop on monolingual text-to-text genera-tion, pages 1–9.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2018. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. arXiv preprint arXiv:1810.04805.

Yue Dong, Zichao Li, Mehdi Rezagholizadeh, andJackie Chi Kit Cheung. 2019. Editnts: An neuralprogrammer-interpreter model for sentence simplifi-cation through explicit editing. In Proceedings ofACL2019, pages 3393–3402.

Richard Evans, Constantin Orasan, and Iustin Dor-nescu. 2014. An evaluation of syntactic simplifi-cation rules for people with autism. In Proceed-ings of the 3rd Workshop on Predicting and Improv-ing Text Readability for Target Reader Populations,pages 131–140.

Angela Fan, Edouard Grave, and Armand Joulin. 2019.Reducing transformer depth on demand with struc-tured dropout. arXiv preprint arXiv:1909.11556.

Núria Gala, Anaïs Tack, Ludivine Javourey-Drevet,Thomas François, and Johannes C Ziegler. 2020.Alector: A Parallel Corpus of Simplified FrenchTexts with Alignments of Misreadings by Poor andDyslexic Readers. In Proceedings of LREC 2020.

Isao Goto, Hideki Tanaka, and Tadashi Kumano. 2015.Japanese news simplification: Task design, data setconstruction, and analysis of simplified text. Pro-ceedings of MT Summit XV.

Kenneth Heafield. 2011. KenLM: Faster and smallerlanguage model queries. In Proceedings of the sixthworkshop on statistical machine translation, pages187–197.

Chao Jiang, Mounica Maddela, Wuwei Lan, YangZhong, and Wei Xu. 2020. Neural crf model for sen-tence alignment in text simplification. In Proceed-ings of the 58th Annual Meeting of the Associationfor Computational Linguistics.

Armand Joulin, Edouard Grave, Piotr Bojanowski, andTomas Mikolov. 2017. Bag of tricks for efficient textclassification. In Proceedings of EACL 2017, pages427–431.

Tomoyuki Kajiwara and M Komachi. 2018. Text sim-plification without simplified corpora. The Journalof Natural Language Processing, 25:223–249.

Tomoyuki Kajiwara and Mamoru Komachi. 2016.Building a monolingual parallel corpus for text sim-plification using sentence similarity based on align-ment between word embeddings. In Proceedings

Page 10: arXiv:2005.00352v2 [cs.CL] 16 Apr 2021

of COLING 2016, the 26th International Confer-ence on Computational Linguistics: Technical Pa-pers, pages 1147–1158, Osaka, Japan. The COLING2016 Organizing Committee.

Akihiro Katsuta and Kazuhide Yamamoto. 2019. Im-proving text simplification by corpus expansionwith unsupervised learning. In 2019 InternationalConference on Asian Language Processing (IALP),pages 216–221. IEEE.

David Kauchak. 2013. Improving text simplificationlanguage modeling using unsimplified text data. InProceedings of ACL 2013, pages 1537–1546.

J. Peter Kincaid, Robert P Fishburne Jr., Richard L.Rogers, and Brad S. Chissom. 1975. Derivation ofnew readability formulas (automated readability in-dex, fog count and flesch reading ease formula) fornavy enlisted personnel.

Philipp Koehn, Francisco Guzmán, Vishrav Chaud-hary, and Juan Pino. 2019. Findings of the WMT2019 shared task on parallel corpus filtering for low-resource conditions. In Proceedings of the FourthConference on Machine Translation, pages 54–72.

Philipp Koehn, Huda Khayrallah, Kenneth Heafield,and Mikel L. Forcada. 2018. Findings of the WMT2018 shared task on parallel corpus filtering. In Pro-ceedings of the Third Conference on Machine Trans-lation: Shared Task Papers, pages 726–739.

Reno Kriz, Eleni Miltsakaki, Marianna Apidianaki,and Chris Callison-Burch. 2018. Simplification us-ing paraphrases and context-based lexical substitu-tion. In Proceedings of NAACL 2018, pages 207–217. Association for Computational Linguistics.

Dhruv Kumar, Lili Mou, Lukasz Golab, and Olga Vech-tomova. 2020. Iterative edit-based unsupervised sen-tence simplification. In Proceedings of ACL 2020,pages 7918–7928, Online.

Vladimir I Levenshtein. 1966. Binary codes capableof correcting deletions, insertions, and reversals. InSoviet physics doklady, volume 10, pages 707–710.

Mike Lewis, Yinhan Liu, Naman Goyal, Mar-jan Ghazvininejad, Abdelrahman Mohamed, OmerLevy, Ves Stoyanov, and Luke Zettlemoyer. 2019.Bart: Denoising sequence-to-sequence pre-trainingfor natural language generation, translation, andcomprehension. arXiv preprint arXiv:1910.13461.

Xianggen Liu, Lili Mou, Fandong Meng, Hao Zhou,Jie Zhou, and Sen Song. 2019. Unsupervised para-phrasing by simulated annealing. arXiv preprintarXiv:1909.03588.

Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, SergeyEdunov, Marjan Ghazvininejad, Mike Lewis, andLuke Zettlemoyer. 2020. Multilingual denoisingpre-training for neural machine translation. arXivpreprint arXiv:2001.08210.

Louis Martin, Samuel Humeau, Pierre-EmmanuelMazaré, Éric de La Clergerie, Antoine Bordes, andBenoît Sagot. 2018. Reference-less quality estima-tion of text simplification systems. In Proceedingsof the 1st Workshop on Automatic Text Adaptation,pages 29–38.

Louis Martin, Benjamin Muller, Pedro Javier OrtizSuárez, Yoann Dupont, Laurent Romary, Éric Ville-monte de la Clergerie, Djamé Seddah, and BenoîtSagot. 2019. CamemBERT: a Tasty French Lan-guage Model. In Proceedings of ACL 2020.

Louis Martin, Benoît Sagot, Éric de la Clergerie, andAntoine Bordes. 2020. Controllable sentence sim-plification. In Proceedings of LREC 2020.

Dragos Stefan Munteanu and Daniel Marcu. 2005. Im-proving machine translation performance by exploit-ing non-parallel corpora. Computational Linguis-tics, 31(4):477–504.

Myle Ott, Sergey Edunov, Alexei Baevski, AngelaFan, Sam Gross, Nathan Ng, David Grangier, andMichael Auli. 2019. fairseq: A fast, extensibletoolkit for sequence modeling. In Proceedings ofNAACL 2019 (Demonstrations), pages 48–53, Min-neapolis, Minnesota.

Gustavo H Paetzold and Lucia Specia. 2016. Unsuper-vised lexical simplification for non-native speakers.In Proceedings of the Thirtieth AAAI Conference onArtificial Intelligence.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automaticevaluation of machine translation. In Proceedingsof ACL 2002, pages 311–318.

Ellie Pavlick and Chris Callison-Burch. 2016. Simpleppdb: A paraphrase database for simplification. InProceedings of ACL 2016, pages 143–148.

Ellie Pavlick, Pushpendre Rastogi, Juri Ganitkevitch,Benjamin Van Durme, and Chris Callison-Burch.2015. Ppdb 2.0: Better paraphrase ranking, fine-grained entailment relations, word embeddings, andstyle classification. In Proceedings of ACL-CoLing2015.

J. Rapin and O. Teytaud. 2018. Nevergrad - A gradient-free optimization platform. https://GitHub.com/FacebookResearch/Nevergrad.

Luz Rello, Ricardo Baeza-Yates, Stefan Bott, andHoracio Saggion. 2013. Simplify or help?: textsimplification strategies for people with dyslexia.In Proceedings of the 10th International Cross-Disciplinary Conference on Web Accessibility,page 15. ACM.

Horacio Saggion, Sanja Štajner, Stefan Bott, SimonMille, Luz Rello, and Biljana Drndarevic. 2015.Making it simplext: Implementation and evaluationof a text simplification system for spanish. ACMTransactions on Accessible Computing, 6(4):1–36.

Page 11: arXiv:2005.00352v2 [cs.CL] 16 Apr 2021

Holger Schwenk, Guillaume Wenzek, Sergey Edunov,Edouard Grave, and Armand Joulin. 2019. Ccma-trix: Mining billions of high-quality parallel sen-tences on the web. arXiv:1911.04944.

Sanja Štajner, Iacer Calixto, and Horacio Saggion.2015. Automatic text simplification for Spanish:Comparative evaluation of various simplificationstrategies. In Proceedings of the international con-ference recent advances in natural language pro-cessing.

Sanja Štajner, Marc Franco-Salvador, Paolo Rosso, andSimone Paolo Ponzetto. 2018. CATS: A tool forcustomized alignment of text simplification corpora.In Proceedings of the Eleventh International Confer-ence on Language Resources and Evaluation (LREC2018), Miyazaki, Japan. European Language Re-sources Association (ELRA).

Elior Sulem, Omri Abend, and Ari Rappoport. 2018.Semantic structural evaluation for text simplification.In Proceedings of the NAACL-HLT 2018, pages 685–696.

Sai Surya, Abhijit Mishra, Anirban Laha, Parag Jain,and Karthik Sankaranarayanan. 2019. Unsupervisedneural text simplification. In Proceedings of ACL2018, pages 2058–2068.

Jörg Tiedemann. 2012. Parallel Data, Tools and Inter-faces in OPUS. In Proceedings of LREC 2012.

Sara Tonelli, Alessio Palmero Aprosio, and FrancescaSaltori. 2016. SIMPITIKI: a Simplification corpusfor Italian. Proceedings of the Third Italian Confer-ence on Computational Linguistics.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Advances in Neural Information Pro-cessing Systems 30, pages 5998–6008.

Guillaume Wenzek, Marie-Anne Lachaux, Alexis Con-neau, Vishrav Chaudhary, Francisco Guzman, Ar-mand Joulin, and Edouard Grave. 2019. CCnet: Ex-tracting high quality monolingual datasets from webcrawl data. arXiv preprint arXiv:1911.00359.

John Wieting and Kevin Gimpel. 2018. ParaNMT-50M: Pushing the limits of paraphrastic sentenceembeddings with millions of machine translations.In Proceedings of ACL 2018, pages 451–462, Mel-bourne, Australia.

Kristian Woodsend and Mirella Lapata. 2011. Learn-ing to simplify sentences with quasi-synchronousgrammar and integer programming. In Proceedingsof EMNLP 2011, pages 409–420, Edinburgh, Scot-land, UK.

Sander Wubben, Antal Van Den Bosch, and EmielKrahmer. 2012. Sentence simplification by mono-lingual machine translation. In Proceedings of ACL2012, pages 1015–1024.

Wei Xu, Chris Callison-Burch, and Courtney Napoles.2015. Problems in current text simplification re-search: New data can help. Transactions of the As-sociation for Computational Linguistics, pages 283–297.

Wei Xu, Courtney Napoles, Ellie Pavlick, QuanzeChen, and Chris Callison-Burch. 2016. Optimizingstatistical machine translation for text simplification.Transactions of the Association for ComputationalLinguistics, 4:401–415.

Xingxing Zhang and Mirella Lapata. 2017. Sen-tence simplification with deep reinforcement learn-ing. In Proceedings of EMNLP 2017, pages 584–594, Copenhagen, Denmark.

Sanqiang Zhao, Rui Meng, Daqing He, Andi Saptono,and Bambang Parmanto. 2018. Integrating trans-former and paraphrase rules for sentence simplifica-tion. In Proceedings of EMNLP 2018, pages 3164–3173, Brussels, Belgium.

Yanbin Zhao, Lu Chen, Zhi Chen, and Kai Yu.2020. Semi-supervised text simplification withback-translation and asymmetric denoising autoen-coders. In Proceedings of the Thirty-Fourth AAAIConference on Artificial Intelligence.

Zhemin Zhu, Delphine Bernhard, and Iryna Gurevych.2010. A monolingual tree-based translation modelfor sentence simplification. In Proceedings of the23rd International Conference on ComputationalLinguistics (Coling 2010), pages 1353–1361.

Page 12: arXiv:2005.00352v2 [cs.CL] 16 Apr 2021

Appendices

A Experimental details

In this section we describe specific details of ourexperimental procedure. Figure 2 is a overall re-minder of our method presented in the main paper.

Figure 2: Sentence Simplification Models for AnyLanguage without Simplification Data. Sentencesfrom the web are used to create a large scale index thatallows mining millions of paraphrases. Subsequently,we finetune pretrained models augmented with control-lable mechanisms on the paraphrase corpora to achievesentence simplification models in any language.

A.1 Mining Details

Sequence Extraction We only consider docu-ments from the HEAD split in CCNet— this repre-sents the third of the data with the best perplexityusing a language model.

Paraphrase Mining We compute LASER em-beddings of dimension 1024 and reduce dimension-ality with a 512 PCA followed by random rota-tion. We further compress them using 8 bit scalarquantization. The compressed embeddings are thenstored in a faiss inverted file index with 32,768cells (nprobe=16). These embeddings are used tomine pairs of paraphrases. We return the top-8nearest neighbors, and keep those with L2 distancelower than 0.05 and relative distance compared toother top-8 nearest neighbors lower than 0.6.

Paraphrases Filtering The resulting para-phrases are filtered to remove almost identicalparaphrases by enforcing a case-insensitivecharacter-level Levenshtein distance (Levenshtein,1966) greater or equal to 20%. We removeparaphrases that come from the same documentto avoid aligning sequences that overlapped eachother in the text. We also remove paraphraseswhere one of the sequence is contained in the other.We further filter out any sequence that is present inour evaluation datasets.

A.2 Training Details

Seq2Seq training We implement our modelswith fairseq (Ott et al., 2019). All our mod-els are Transformers (Vaswani et al., 2017) basedon the BART Large architecture (388M param-eters), keeping the optimization procedure andhyper-parameters fixed to those used in the orig-inal implementation (Lewis et al., 2019)7. Weeither randomly initialize weights for the stan-dard sequence-to-sequence experiments or initial-ize with pretrained BART for the BART exper-iments. When initializing the weights randomly,we use a learning rate of 3.10−4 versus the orig-inal 3.10−5 when finetuning BART. For a givenseed, the model is trained on 8 Nvidia V100 GPUsduring approximately 10 hours.

Controllable Generation For controllable gen-eration, we use the open-source ACCESS imple-mentation (Martin et al., 2018). We use the samecontrol parameters as the original paper, namelylength, Levenshtein similarity, lexical complexity,and syntactic complexity.8

As mentioned in Section ”Simplifying withACCESS”, we select the 4 ACCESS hyper-parameters using SARI on the validation set. Weuse zero-order optimization with the NEVERGRAD

library (Rapin and Teytaud, 2018). We use the One-PlusOne optimizer with a budget of 64 evaluations(approximately 1 hour of optimization on a singleGPU). The hyper-parameters are contained in the[0.2, 1.5] interval.

The 4 hyper-parameter values are then kept fixedfor all sentences in the associated test set.

Translation Model for Pivot Baseline For thepivot baseline we train models on ccMatrix(Schwenk et al., 2019). Our models use the Trans-former architecture with 240 million parameterswith LayerDrop (Fan et al., 2019). We train for 36hours on 8 GPUs following the suggested parame-ters in Ott et al. (2019).

Gold Reference Baseline To avoid creating adiscrepancy in terms of number of references be-

7All hyper-parameters and training commands forfairseq can be found here: https://github.com/pytorch/fairseq/blob/master/examples/bart/README.summarization.md

8We modify the Levenshtein similarity parameter to onlyconsider replace operations, by assigning a 0 weight to in-sertions and deletions. This change helps decorrelate theLevenshtein similarity control token from the length controltoken and produced better results in preliminary experiments.

Page 13: arXiv:2005.00352v2 [cs.CL] 16 Apr 2021

tween the gold reference scores, where we leaveone reference out, and when we evaluate the mod-els with all references, we compensate by duplicat-ing one of the other references at random so thatthe total number of references is unchanged.

A.3 Evaluation DetailsSARI score computation We use the latestversion of SARI implemented in EASSE (Alva-Manchego et al., 2019) which fixes bugs and in-consistencies from the traditional implementationof SARI. As a consequence, we also recomputescores from previous systems that we compare to.We do so by using the system predictions providedby the respective authors, and available in EASSE.

ALECTOR Sentence-level Alignment TheALECTOR corpus comes as source documents andtheir manual simplifications but not sentence-levelalignment is provided. Luckily, most of thesedocuments were simplified line by line, eachline consisting of a few sentences. For eachsource document, we therefore align each line,provided it is not too long (less than 6 sentences),with the most appropriate line in the simplifieddocument, using the LASER embedding space.The resulting alignments are split into validationand test by randomly sampling the documents forthe validation (450 sentence pairs) and rest for test(416 sentence pairs).

A.4 Links to DatasetsThe datasets we used are available at the followingaddresses:

• CCNet:https://github.com/facebookresearch/

cc_net.

• WikiLarge:https://github.com/XingxingZhang/

dress.

• ASSET:https://github.com/facebookresearch/

asset or https://github.com/feralvam/

easse.

• TurkCorpus:https://github.com/cocoxu/

simplification/ or https://github.

com/feralvam/easse.

• Newsela: This dataset has to be requested athttps://newsela.com/data.

• ALECTOR: This dataset has to be requestedfrom the authors (Gala et al., 2020).

B Characteristics of the mined data

We show in Figure 3 the distribution of differentsurface features of our mined data versus those ofWikiLarge. Some examples of mined paraphrasesare shown in Table 6.

C Set ACCESS Control ParametersWithout Parallel Data

In our experiments we adjusted our model tothe different dataset conditions by selecting ourACCESS control tokens with SARI on each val-idation set. When no such parallel validation setexists, we show that strong performance can stillbe obtained by using prior knowledge for the givendownstream application. This can be done by set-ting all 4 ACCESS control hyper-parameters to anintuitive guess of the desired compression ratio.

To illustrate this for the considered evaluationdatasets, we first independently sample 50 sourcesentences and 50 random unaligned simple sen-tences from each validation set. These two groupsof non-parallel sentences are used to approximatethe character-level compression ratio between com-plex and simplified sentences. We do so by dividingthe average length of the simplified sentences bythe average length of the 50 source sentences. Wefinally use this approximated compression ratio asthe value of all 4 ACCESS hyper-parameters. Inpractice, we obtain the following approximations:ASSET = 0.8, TurkCorpus = 0.95, and Newsela= 0.4 (rounded to 0.05). Results in Table 7 showthat the resulting model performs very close towhen we adjust the ACCESS hyper-parametersusing SARI on the complete validation set.

D Comparing to Existing ParaphraseDatasets

We compare using our mined paraphrase datawith existing large-scale paraphrase datasets in Ta-ble 8. We use PARANMT (Wieting and Gimpel,2018), a large paraphrase dataset created usingback-translation on an existing labeled parallel ma-chine translation dataset. We use the same 5 milliontop-scoring sentences that the authors used to traintheir sentence embeddings. Training MUSS on themined data or on PARANMT obtains similar re-sults for text simplification, confirming that mining

Page 14: arXiv:2005.00352v2 [cs.CL] 16 Apr 2021

0 1 20

1

2

3

4

dens

ityCompression levels

0.8 1.0 1.20

2

4

6

8

10

WordRank ratio

0 100 200 3000.000

0.002

0.004

0.006

0.008

Target length

0.4 0.6 0.8 1.00

2

4

6

8Replace-only Levenshtein similarity

WikilargeMined

Figure 3: Density of several text features in WikiLarge and our mined data. The WordRank ratio is a measureof lexical complexity reduction (Martin et al., 2020). Replace-only Levenshtein similarity only considers replaceoperations in the traditional Levenshtein similarity and assigns 0 weights to insertions and deletions.

Query For insulation, it uses foam-injected polyurethane which helps ensure the quality of the ice produced by themachine. It comes with an easy to clean air filter.

Mined It has polyurethane for insulation which is foam-injected. This helps to maintain the quality of the ice itproduces. The unit has an easy to clean air filter.

Query Here are some useful tips and tricks to identify and manage your stress.Mined Here are some tips and remedies you can follow to manage and control your anxiety.

Query As cancer cells break apart, their contents are released into the blood.Mined When brain cells die, their contents are partially spilled back into the blood in the form of debris.

Query The trail is ideal for taking a short hike with small children or a longer, more rugged overnight trip.Mined It is the ideal location for a short stroll, a nature walk or a longer walk.

Query Thank you for joining us, and please check out the site.Mined Thank you for calling us. Please check the website.

Table 6: Examples of Mined Paraphrases. Paraphrases, although sometimes not preserving the entire meaning,display various rewriting operations, such as lexical substitution, compression or sentence splitting.

ASSET TurkCor. NewselaMethod SARI ↑ SARI ↑ SARI ↑

SARI on valid 42.65±0.23 40.85±0.15 38.09±0.59

Approx. value 42.49±0.34 39.57±0.40 36.16±0.35

Table 7: Set ACCESS Controls Wo. Parallel DataSetting ACCESS parameters of MUSS +MINEDmodel either using SARI on the validation set or usingonly 50 unaligned sentence pairs from the validationset. All ACCESS parameters are set to the same ap-proximated value: ASSET = 0.8, TurkCorpus = 0.95,and Newsela = 0.4).

paraphrase data is a viable alternative to using exist-ing paraphrase datasets relying on labeled parallelmachine translation corpora.

ASSET TurkCor. NewselaData SARI ↑ SARI ↑ SARI ↑

MINED 42.65±0.23 40.85±0.15 38.09±0.59

PARANMT 42.50±0.33 40.50±0.16 39.11±0.88

Table 8: Mined Data vs. ParaNMTWe compare SARI scores of MUSS trained either onour mined data or on PARANMT (Wieting and Gim-pel, 2018) on the test sets of ASSET, TurkCorpus andNewsela.

E Influence of BART on Fluency

In Table 9, we present some selected samples thathighlight the improved fluency of simplificationswhen using BART.

F Additional Scores

BLEU We report additional BLEU scores forcompleteness. These results are displayed alongwith SARI and FKGL for English. These BLEUscores should be carefully interpreted. They havebeen found to correlate poorly with human judg-ments of simplicity (Sulem et al., 2018). Fur-thermore, the identity baseline achieves very highBLEU scores on some datasets (e.g. 92.81 on AS-SET or 99.36 on TurkCorpus), which underlinesthe weaknesses of this metric.

Validation Scores We report English validationscores to foster reproducibility in Table 11.

Seq2Seq Models on Mined Data When train-ing a Transformer sequence-to-sequence model(Seq2Seq) on WikiLarge compared to the minedcorpus, models trained on the mined data performbetter. It is surprising that a model trained solely on

Page 15: arXiv:2005.00352v2 [cs.CL] 16 Apr 2021

Original They are culturally akin to the coastal peoples of Papua New Guinea.ACCESS They’re culturally much like the Papua New Guinea coastal peoples.BART+ACCESS They are closely related to coastal people of Papua New Guinea

Original Orton and his wife welcomed Alanna Marie Orton on July 12, 2008.ACCESS Orton and his wife had been called Alanna Marie Orton on July 12.BART+ACCESS Orton and his wife gave birth to Alanna Marie Orton on July 12, 2008.

Original He settled in London, devoting himself chiefly to practical teaching.ACCESS He set up in London and made himself mainly for teaching.BART+ACCESS He settled in London and devoted himself to teaching.

Table 9: Influence of BART on Simplifications. We display some examples of generations that illustrate howBART improves the fluency and meaning preservation of generated simplifications.

paraphrases achieves such good results on simpli-fication benchmarks. Previous works have shownthat simplification models suffer from not mak-ing enough modifications to the source sentenceand found that forcing models to rewrite the inputwas beneficial (Wubben et al., 2012; Martin et al.,2020). This is confirmed when investigating theF1 deletion component of SARI which is 20 pointshigher for the model trained on paraphrases.

Page 16: arXiv:2005.00352v2 [cs.CL] 16 Apr 2021

Data ASSET TurkCorpus NewselaBaselines and Gold Reference SARI ↑ BLEU ↑ FKGL ↓ SARI ↑ BLEU ↑ FKGL ↓ SARI ↑ BLEU ↑ FKGL ↓

Identity Baseline — 20.73 92.81 10.02 26.29 99.36 10.02 — — —Truncate Baseline — 29.85 84.94 7.91 33.10 88.82 7.91 — — —Reference — 44.87±0.36 68.95±1.33 6.49±0.15 40.04±0.30 73.56±1.18 8.77±0.08 — — —

Supervised Systems (This Work)

Seq2Seq WikiLarge 32.71±1.55 88.56±1.06 8.62±0.34 35.79±0.89 90.24±2.52 8.63±0.34 22.23±1.99 21.75±0.45 8.00±0.26

MUSS WikiLarge 43.63±0.71 76.28±4.30 6.25±0.42 42.62±0.27 78.28±3.95 6.98±0.95 40.00±0.63 14.42±6.85 3.51±0.53

MUSS WikiLarge + MINED 44.15±0.56 72.98±4.27 6.05±0.51 42.53±0.36 78.17±2.20 7.60±1.06 39.50±0.42 15.52±0.99 3.19±0.49

MUSS Newsela 42.91±0.58 71.40±6.38 6.91±0.42 41.53±0.36 74.29±4.67 7.39±0.42 42.59±1.00 18.61±4.49 2.74±0.98

MUSS Newsela + MINED 41.36±0.48 78.35±2.83 6.96±0.26 40.01±0.51 83.77±1.00 8.26±0.36 41.17±0.95 16.87±4.55 2.70±1.00

Unsupervised Systems (This Work)

Seq2Seq MINED 38.03±0.63 61.76±2.19 9.41±0.07 38.06±0.47 63.70±2.43 9.43±0.07 30.36±0.71 12.98±0.32 8.85±0.13

MUSS (MBART) MINED 41.11±0.70 77.22±2.12 7.18±0.21 39.40±0.54 77.05±3.02 8.65±0.40 34.76±0.96 19.06±1.15 5.44±0.25

MUSS (BART) MINED 42.65±0.23 66.23±4.31 8.23±0.62 40.85±0.15 63.76±4.26 8.79±0.30 38.09±0.59 14.91±1.39 5.12±0.47

Table 10: Detailed English Results. We display SARI, BLEU, and FKGL on ASSET, TurkCorpus and NewselaEnglish evaluation datasets (test sets).

Data ASSET TurkCorpus NewselaBaselines and Gold Reference SARI ↑ BLEU ↑ FKGL ↓ SARI ↑ BLEU ↑ FKGL ↓ SARI ↑ BLEU ↑ FKGL ↓

Identity Baseline — 22.53 94.44 9.49 26.96 99.27 9.49 12.00 20.69 8.77Truncate Baseline — 29.95 86.67 7.39 32.90 89.10 7.40 24.64 18.97 6.90Reference — 45.22±0.94 72.67±2.83 6.13±0.56 40.66±0.11 77.21±0.45 8.31±0.04 — — —

Supervised Systems (This Work)

Seq2Seq WikiLarge 33.87±1.90 90.21±1.14 8.31±0.34 35.87±1.09 91.06±2.24 8.31±0.34 20.89±4.08 20.97±0.53 8.27±0.46

MUSS WikiLarge 45.58±0.28 78.85±4.44 5.61±0.31 43.26±0.42 78.39±3.08 6.73±0.38 39.66±1.80 14.82±7.17 4.64±1.85

MUSS WikiLarge + MINED 45.50±0.69 73.16±4.41 5.83±0.51 43.17±0.19 77.52±3.01 7.19±1.02 40.50±0.56 16.30±0.97 3.57±0.60

MUSS Newsela 43.91±0.10 70.06±10.05 6.47±0.29 41.94±0.21 74.03±7.51 6.99±0.53 42.36±1.32 19.18±6.03 3.20±1.01

MUSS Newsela + MINED 42.48±0.41 77.86±3.13 6.41±0.13 40.77±0.52 83.04±1.16 7.68±0.30 41.68±1.60 17.23±5.28 2.97±0.91

Unsupervised Systems (This Work)

Seq2Seq MINED 38.88±0.22 61.80±0.94 8.63±0.13 37.51±0.10 62.04±0.91 8.64±0.13 30.35±0.23 13.04±0.45 8.87±0.12

MUSS (MBART) MINED 41.68±0.72 77.11±2.02 6.56±0.21 39.60±0.44 75.64±2.85 8.04±0.40 34.59±0.59 18.19±1.26 5.76±0.22

MUSS (BART) MINED 43.01±0.23 67.65±4.32 7.75±0.53 40.61±0.18 63.56±4.30 8.28±0.18 38.07±0.22 14.43±0.97 5.40±0.41

Table 11: English Results on Validation Sets. We display SARI, BLEU, and FKGL on ASSET, TurkCorpus andNewsela English datasets (validation sets).