arxiv:2004.03066v1 [cs.cl] 7 apr 2020 · 2020. 4. 8. · 2facebook ai research [email protected],...

Are Natural Language Inference Models IMPPRESsive?Learning IMPlicature and PRESupposition

Paloma Jeretic1, Alex Warstadt1, Suvrat Bhooshan2, Adina Williams21Department of Linguistics, New York University

2Facebook AI [email protected], [email protected], [email protected]

[email protected]

AbstractNatural language inference (NLI) is an in-creasingly important task for natural languageunderstanding, which requires one to inferwhether one sentence entails another. How-ever, the ability of NLI models to make prag-matic inferences remains understudied. Wecreate an IMPlicature and PRESupposition di-agnostic dataset (IMPPRES), consisting of32K semi-automatically generated sentencepairs illustrating well-studied pragmatic in-ference types. We use IMPPRES to evalu-ate whether BERT, BOW, and InferSent NLImodels trained on MultiNLI (Williams et al.,2018) learn to make pragmatic inferences.Although MultiNLI contains vanishingly fewpairs illustrating these inference types, we findthat BERT learns to draw pragmatic infer-ences: it reliably treats implicatures triggeredby “some” as entailments. For some presuppo-sition triggers like only, BERT reliably recog-nizes the presupposition as an entailment, evenwhen the trigger is embedded under an entail-ment canceling operator like negation. BOWand InferSent show weaker evidence of prag-matic reasoning. We conclude that NLI train-ing encourages models to learn some, but notall, pragmatic inferences.

1 Introduction

One of the most foundational semantic discover-ies is that systematic rules govern the inferentialrelationships between pairs of natural languagesentences (Aristotle, De Interpretatione, Ch. 6).In natural language processing, Natural LanguageInference (NLI)—a task whereby a system de-termines whether a pair of sentences instantiatesin an entailment, a contradiction, or a neutralrelation—has been useful for training and evaluat-ing models on sentential reasoning. However, lin-guists and philosophers now recognize that thereare separate semantic and pragmatic modes of rea-soning (Grice, 1975; Clark, 1996; Beaver, 1997;

Figure 1: Illustration of key properties of classical en-tailments, implicatures, and presuppositions. Solid ar-rows indicate valid commonsense entailments, and ar-rows with X’s indicate lack of entailment. Dashed ar-rows indicate follow up statements with the addition ofin fact, which can either be acceptable (marked with‘7’) or unacceptable (marked with ‘3’).

Horn and Ward, 2004; Potts, 2015), and it is notclear which of these modes, if either, NLI mod-els learn. We investigate two pragmatic inferencetypes that are known to differ from classical en-tailment: scalar implicatures and presuppositions.As shown in Figure 1, implicatures differ from en-tailments in that they can be denied, and presuppo-sitions differ from entailments in that they are notcanceled when placed in entailment-cancelling en-vironments (e.g., negation, questions).

To enable research into the relationship be-tween NLI and pragmatic reasoning, we introduceIMPPRES, a fine-grained NLI-style diagnostic testdataset for probing how well NLI models performimplicature and presupposition. Containing 32Ksentence pairs illustrating key properties of thesepragmatic inference types, IMPPRES is automati-cally generated according to expert-crafted gram-mars, allowing us to create a large and lexicallyvaried, but nonetheless controlled dataset isolatingspecific instances of both types.

To test whether NLI models learn to do well

arX

iv:2

004.

0306

6v1

[cs

.CL

] 7

Apr

202

0

on IMPPRES, we first need to determine if pre-suppositions and implicatures are present in ourtraining data. We take MultiNLI (Williams et al.,2018) as a case study, and find it has very fewinstances of pragmatic inference (see §4). Giventhis, we ask whether training on MultiNLI issufficient for models to generalize about theselargely absent commonsense reasoning types. Wefind that generalization is possible: the BERTNLI model shows evidence of pragmatic reason-ing when tested on the implicature from someto not all, as well as on several presuppositionsubdatasets (only, cleft existence, possessive ex-istence, questions). We obtain some negative re-sults, that are likely due to models lacking a so-phisticated enough understanding of the meaningsof the lexical triggers for implicature and presup-position (e.g., BERT treats several word pairs assynonyms, e.g., most notably, or and and).

Our contributions are: (i) we provide a newdiagnostic test set to probe for pragmatic infer-ences, complete with linguistic controls, (ii) to ourknowledge, we present the first work evaluatingdeep NLI models on specific pragmatic inferences,and (iii) we show that BERT models can performsome types of pragmatic reasoning very well, evenwhen trained on NLI data containing very few ex-plicit examples of pragmatic reasoning.

2 Background: Pragmatic Inference

We take pragmatic inference to be a relation be-tween two sentences relying on the utterance con-text and the conversational goals of interlocu-tors. Pragmatic inference contrasts with seman-tic entailment, which instead captures the logicalrelationship between isolated sentence meanings(Grice, 1975; Stalnaker, 1977). We present impli-cature and presupposition inferences below.

2.1 Implicature

Broadly speaking, implicatures contrast with en-tailments in that they are inferences suggested bythe speaker, but not included in the literal mean-ing of the sentence (Grice, 1975). Although thereare many types of implicatures we focus here onscalar implicatures. Scalar implicatures are in-ferences, often optional,1 which can be drawn

1Implicature computation can depend on the cooperativityof the speakers, or on any aspect of the context of utterance(lexical, syntactic, semantic/pragmatic, discourse). See (De-gen, 2015a) for a study of the high variability of implicaturecomputation, and the factors responsible for it.

Type Example

Trigger Jo’s cat yawned.Presupposition Jo has a cat.

Negated Trigger Jo’s cat didn’t yawn.Modal Trigger It’s possible that Jo’s cat yawned.Interrog. Trigger Did Jo’s cat yawn?Cond. Trigger If Jo’s cat yawned, it’s OK.

Negated Prsp. Jo doesn’t have a cat.Neutral Prsp. Amy has a cat.

Table 1: Sample generated presupposition paradigm.Examples adapted from the ‘change-of-state’ dataset.

when one member of a memorized lexical scale(e.g., 〈some, all〉) is uttered (see §6.1). For exam-ple, when someone utters Jo ate some of the cake,they suggest that Jo didn’t eat all of the cake, (seeFigure 1 for more examples). According to Neo-Gricean pragmatic theory (Horn, 1989; Levinson,2000), the inference Jo didn’t eat all of the cakearises because some has a more informative lexicalalternative all that could have been uttered instead.We expect the speaker to make the most informa-tive true statement:2 as a result, the listener shouldinfer that a stronger statement, where some is re-placed by all, is false.

Implicatures differ from entailments (and, as wewill see, presuppositions; see Figure 1) in that theyare deniable, i.e., they can be explicitly negatedwithout resulting in a contradiction. For example,someone can utter Jo ate some of the cake, fol-lowed by In fact, Jo ate all of it. In this case, theimplicature (i.e., Jo didn’t eat all the cake fromabove) has been denied. We thus distinguish im-plicated meaning from literal, or logical, meaning.

2.2 Presupposition

Presuppositions of a sentence are facts that thespeaker takes for granted when uttering a sentence(Stalnaker, 1977; Beaver, 1997). Presuppositionsare generally associated with the presence of cer-tain expressions, known as presupposition trig-gers. For example, in Figure 1, the definite de-scription the cake triggers the presupposition thatthere is a cake (Russell, 1905). Other examples ofpresupposition triggers are shown in Table 1.

Presuppositions differ from other inferencetypes in that they often project out of operatorslike questions and negation, meaning that they re-main valid inferences even when embedded un-

2This follows from: Grice’s (1975) cooperative principle;the assumption that speakers are opinionated (Gazdar, 1979).

der these operators (Karttunen, 1973). The infer-ence that there is a cake survives even when thepresupposition trigger is in a question (Did Jor-dan eat some of the cake?), as shown in Figure 1.However, in questions, classical entailments andimplicatures disappear. Table 1 provides exam-ples of triggers projecting out of several entail-ment canceling operators: negation, modals, in-terrogatives, and conditionals.

It is necessary to clarify in what sense presup-position is a pragmatic inference. There is no con-sensus on whether presuppositions should be con-sidered part of the semantic content of expressions(see Stalnaker, 1977; Heim, 1983, for opposingviews). However, presuppositions may come tobe inferred via accommodation, a pragmatic pro-cess by which a listener infers the truth of somenew fact based on its being presupposed by thespeaker (Lewis, 1979). For instance, if Jordan tellsHarper that the King of Sweden wears glasses, andHarper did not previously know that Sweden hasa king, they would learn this fact by accommo-dation. With respect to NLI, any presuppositionin the premise (short of world knowledge) willbe new information, and therefore accommodationshould be necessary to recognize it as entailed.

3 Related Work

NLI has been framed as a commonsense reason-ing task (Dagan et al., 2006; Manning, 2006).One early formulation of NLI defines “entail-ment” as holding for sentences p and h when-ever, “typically, a human reading p would inferthat h is most likely true. . . [given] common hu-man understanding of language [and] commonbackground knowledge” (Dagan et al., 2006). Al-though this sparked lively debate regarding theterms inference and entailment—and whether acoherent notion of “inference” could (and should)be defined (Zaenen et al., 2005; Manning, 2006;Crouch et al., 2006)—in recent work on NLI, com-mon sense formulations of “inference” have beenwidely adopted (Bowman et al., 2015; Williamset al., 2018) largely for practical reasons , i.e., theyenable untrained annotators to easily participate indataset creation.

NLI itself has been steadily gaining in popular-ity; many datasets for training and/or testing sys-tems are now available including: FraCaS (Cooperet al., 1994), RTE (Dagan et al., 2006; Mirkinet al., 2009; Dagan et al., 2013), Sentences In-

volving Compositional Knowledge (Marelli et al.,2014, SICK), large scale imaging captioning asNLI (Bowman et al., 2015, SNLI), recasting otherdatasets into NLI (Reisinger et al., 2015; Whiteet al., 2017; Poliak et al., 2018), ordinal commonsense inference (Zhang et al., 2017, JOCI), Multi-Premise Entailment (Lai et al., 2017, MPE), NLIover multiple genres of written and spoken En-glish (Williams et al., 2018, MultiNLI), adversar-ially filtered common sense reasoning sentences(Zellers et al., 2018, 2019, (Hella)SWAG), ex-plainable annotations for SNLI (Camburu et al.,2018, e-SNLI), cross-lingual NLI (Conneau et al.,2018, XNLI), scientific questioning answering asNLI (Khot et al., 2018, SciTail), NLI recast-question answering (part of Wang et al. 2019,GLUE), NLI for dialog (Welleck et al., 2019).Other NLI datasets are created specifically to iden-tify where models fail (Glockner et al., 2018; Naiket al., 2018; McCoy et al., 2019; Schmitt andSchutze, 2019), many of which are also automati-cally generated (Geiger et al., 2018; Yanaka et al.,2019a,b; Kim et al., 2019; Nie et al., 2019).

As datasets for NLI become increasingly nu-merous, one might wonder, do we need yet anotherNLI dataset? In this case, the answer is clearlyyes: despite NLI’s formulation as a common sensereasoning task, no extant NLI dataset has been cre-ated to explicitly probe pragmatic inferences incontrast with entailment. IMPPRES allows us toask whether NLI models trained on common sensereasoning perform well on specific types of prag-matic inference, without explicit training.

Beyond NLI, several recent works evaluate sen-tence understanding models for knowledge ofpragmatic inferences. de Marneffe et al. (2019)introduce the CommitmentBank, a dataset of hu-man judgments measuring how strongly presup-positions of clause embedding verbs project, andJiang and de Marneffe (2019) find that an LSTMhas limited ability to predict the strength of the in-ference. Degen (2015b) introduces a dataset mea-suring the strength of the implicature from someto not all with crowd-sourced judgments. Schus-ter et al. (2019) find that an LSTM with supervi-sion on this dataset can predict human judgmentswell. Outside the topic of sentential inference,Cianflone et al. (2018) create sentence-level ad-verbial presupposition datasets and train a binaryclassifier to detect contexts in which presupposi-tion triggers (e.g., too, again) can be used.

Premise Hypothesis Relation type Logical label Pragmatic label Item type

some not all implicature (+ to −) neutral entailment targetnot all some implicature (− to +) neutral entailment target

some all negated implicature (+) neutral contradiction targetall some reverse negated implicature (+) entailment contradiction targetnot all none negated implicature (−) neutral contradiction targetnone not all reverse negated implicature (−) entailment contradiction target

all none opposite contradiction contradiction controlnone all opposite contradiction contradiction controlsome none negation contradiction contradiction controlnone some negation contradiction contradiction controlall not all negation contradiction contradiction controlnot all all negation contradiction contradiction control

Table 2: Paradigm for the scalar implicature datasets, with 〈some, all〉 as an example.

4 Annotating MultiNLI for Pragmatics

Although Williams et al. (2018) reports that 22%of the MultiNLI development set sentence pairscontain lexical triggers (such as regret or stopped)in the premise and/or hypothesis, the mere pres-ence of presupposition-triggering lexical itemsin the data is insufficient to answer the ques-tion, since the sentential inference may focuson other types of information. To address this,we randomly selecting 200 sentence pairs fromthe MultiNLI matched development set and pre-sented them to three expert annotators with a com-bined total of 17-years of training in formal se-mantics and pragmatics. Annotators answeredthe following questions for each pair: (1) arethe sentences P and H related by a presuppo-sition/implicature relation (entails/is entailed by,negated or not); (2) what subtype of inference(e.g., existence presupposition, 〈some, all〉 impli-cature); (3) is the presupposition trigger embeddedunder an entailment-cancelling operator?

Very few MultiNLI pairs relied on pragmatic in-ferences. We find only eight presupposition pairsand three implicature pairs on which at least twoannotators agreed. Moreover, of the inference sub-types tested in IMPPRES, we only found one pos-session presupposition. All others were tagged asexistence presuppositions and conversational im-plicatures. We conclude that any positive prag-matic results from models trained on MultiNLIand tested on IMPPRES can reasonably be as-sumed to derive from generalization from basicsemantic entailment relations and independent se-mantic knowledge acquired in the original trainingdata seen by these models.

5 Methods

Data Generation. IMPPRES consists of semi-automatically generated pairs of sentences withNLI labels illustrating key properties of implica-tures and presuppositions. Generating data lets uscontrol the lexical and syntactic content so that wecan guarantee that the sentence pairs in IMPPRES

evaluate the desired phenomenon (see Ettingeret al., 2016, for related discussion). We generateIMPPRES according to expert-crafted grammarsusing a codebase developed by Warstadt et al.(2019). The codebase includes a vocabulary ofover 3000 lexical items annotated with grammati-cal features needed to ensure morphological, syn-tactic, and semantic well-formedness. The samecodebase is used for both the implicature and pre-supposition parts of IMPPRES (see §6.1 and 7.1 fordetailed descriptions of the implicature and pre-supposition data). We will release our generationcode upon acceptance.

Models. We create IMPPRES for the purposeof evaluating what NLI models learn from stan-dard datasets like MultiNLI. We intend IMPPRES

to be used only for evaluation, since supervisedclassifiers trained on automatically generated datamay learn to exploit simple regularities in thedata. Accordingly, our experiments evaluate NLImodels trained on MultiNLI and built on top ofthree sentence encoding models: a Bag of Words(BOW) model, InferSent (Conneau et al., 2017),and BERT-Large (Devlin et al., 2018). The BOWand InferSent models use 300D GloVe embed-dings as word representations (Pennington et al.,2014). For the BOW baseline, word embeddingsfor premise and hypothesis are separately summedto create sentence representations, which are con-

Figure 2: Results on Controls (Implicatures)

catenated to form a single sentence-pair represen-tation which is fed to a logistic regression softmaxclassifier. For the InferSent model, GloVe em-beddings for the words in premise and hypothesisare respectively fed into a bidirectional LSTM, af-ter which we concatenate the representations forpremise and hypothesis, their difference, and theirelement-wise product (Mou et al., 2015). BERTis a multilayer bidirectional transformer pretrainedwith the masked language modelling and next se-quence prediction objectives, and finetuned on theMultiNLI dataset. We concatenate the premiseand hypothesis after a special [CLS] token andseparated them with the [SEP] token. The BERTrepresentation for the [CLS] token is fed into clas-sifier. For the BERT model, we use the pre-trained model weights from a transformer trainedon Toronto books and Wikipedia from its Hug-gingface Pytorch implementation.3

The BOW and InferSent models have develop-ment set accuracies of 49.6% and 67.6%. Thedevelopment set accuracy for the BERT-Largemodel on MultiNLI is 86.6%, similar to the re-sults achieved by (Devlin et al., 2018), but some-what lower than state-of-the-art (currently 90.8%on test from the ensembled RoBERTa model withlong pretraining optimization, Liu et al. 2019).

6 Experiment 1: Scalar Implicatures

6.1 Scalar Implicature DatasetsThe scalar implicature portion of IMPPRES in-cludes six datasets, each isolating a different scalar

3github.com/huggingface/pytorch-pretrained-BERT/

Figure 3: Results on Target Conditions (Implicatures)

implicature trigger from six types of lexical scales(of the type described in §2): determiners 〈some,all〉, connectives 〈or, and〉, modals 〈can, have to〉,numerals 〈2,3〉, 〈10,100〉, scalar adjectives, andverbs, e.g., 〈good, excellent〉, 〈run, sprint〉. Exam-ples pairs of each implicature trigger can be foundin the Appendix. For each type, we generate 100paradigms, each consisting of 12 unique sentencepairs, as shown in Table 2.

The six target sentence pairs comprise two mainrelation types: ‘implicature’ and ‘negated implica-ture’. Pairs tagged as ‘implicature’ have a premisethat implicates the hypothesis (e.g., some and notall). For ‘negated implicature’, the premise im-plicates the negation of the hypothesis (e.g., someand all), or vice versa (e.g., all and some). Sixcontrol pairs are logical contradictions, represent-ing either scalar ‘opposites’ (e.g., all and none), or‘negations’ (e.g., not all and all; some and none),probing the models’ basic grasp of negation.

Since implicature computation is dependent onthe context of utterance (see §2.1), we anticipatetwo possible behaviors, i.e., the model may: (a) bepragmatic, and compute the implicature and con-clude that the premise and hypothesis are in an‘entailment’ relation, (b) be logical, i.e., consideronly the literal content, and not compute the impli-cature, concluding they are in a ‘neutral’ relation.Thus, we measure both possible conclusions, bytagging sentence pairs for scalar implicature withtwo sets of NLI labels to reflect the behavior ex-pected under “logical” and “pragmatic” modes ofinference, as shown in Table 2.

Figure 4: BERT results for scalar implicatures trig-gered by determiners 〈some, all〉, by target condition.

6.2 Implicatures Results & Discussion

We first evaluate model performance on the con-trols, shown in Figure 2. If models do not givecorrect outputs on the controls, they either do notunderstand the basic function of the negative itemsused (not, none, neither), or do not correctly di-agnose the scalar relationship between terms likesome and all. We find that the NLI BERT modelperforms at ceiling on control conditions for allimplicature types, in contrast with InferSent andBOW, whose performance is very variable. SinceBERT is the only model that passes all controls, itsresults on the target items are most interpretable.Full results for all models and target conditions byimplicature trigger are in the Appendix.

For connectives, scalar adjectives and verbs, theBERT model results show neither the hypothe-sized pragmatic or logical behavior (see Appendixfor breakdown of results). In fact, it evaluates sen-tence pairs as entailment in both directions, mean-ing it treats scalemates (e.g., and and or; goodand excellent) as synonyms, revealing a coarse-grained knowledge of these meanings that lacksinformation about asymmetric informativity rela-tions between scalemates. Results for modals (canand have to) are split between the three labels, notshowing any predicted logical or pragmatic pat-tern. We conclude that the models do not have suf-ficient knowledge of the meaning of these words.

In addition to pragmatic and logical interpreta-tions, numerals can also be interpreted as exactcardinalities. We thus predict three different be-haviors: logical “at least n”, pragmatic “at leastn”, “exactly n”. We observe that results are in-consistent: neither the “exact” nor “at least” inter-pretations hold across the board.

For the determiner dataset (some-all), the re-sults in Figure 4 break down the results by con-dition and show that BERT clearly behaves as

Presuppositions Label ItemPremise Hypothesis Type

*Trigger Prsp entailment target*Trigger Neg. Prsp contradiction target*Trigger Neut. Prsp neutral target

Neg. Trigger Trigger contradiction controlModal Trigger Trigger neutral controlInterrog. Trigger Trigger neutral controlCond. Trigger Trigger neutral control

Table 3: Paradigm for the presupposition target (top)and control datasets (bottom). For space, *Triggerrefers to either plain, Negated, Modal, Interrogative, orConditional Triggers as per Table 1.

though it performs pragmatic and logical reason-ing in different conditions. It has a preferencefor pragmatic over logical interpretations overall(55% vs. 36%), and only 9% of results are con-sistent with neither mode of reasoning. Further-more, we observe that the proportion of pragmaticreasoning shows consistent effects of sentence or-der (i.e., whether the implicature trigger is in thepremise or the hypothesis), and the presence ofnegation in one or both sentences.

Pragmatic reasoning is consistently higherwhen the implicature trigger is in the premise,which we can see in the results for negated im-plicatures. We observe more pragmatic reasoningin the some and all condition compared to the alland some condition (a similar behavior is observedwith the not all-none conditions).

Generally, the presence of negation lowers ratesof pragmatic reasoning. First, the negated im-plicature conditions can be subdivided into pairswith and without negation. Among the negatedones, pragmatic reasoning is lower than for non-negated ones. Second, having negation in thepremise rather than the hypothesis makes prag-matic reasoning lower: among pairs tagged as di-rect implicatures (some vs. not all), there is higherpragmatic reasoning with non-negated some in thepremise than with negated not all. Finally, weobserve that pragmatic rates are lower for somevs. not all than for some vs. all. In this finalcase, pragmatic reasoning could be facilitated byexplicit presentation of the two items on the scale.

To conclude, only the determiner dataset wasinformative in showing the extent of the NLIBERT model’s pragmatic reasoning, since it aloneshowed a fine-grained enough understanding ofthe semantic relationship of the scalemates, likesome and all. For the other datasets, scalemates

were mostly treated as synonymous (except formodals, where knowledge of any semantic rela-tionship between can and need to seemed absent).Within the determiner dataset, however, BERT re-turned impressive results showing a high propor-tion of pragmatic reasoning compared to logicalreasoning, which was affected by sentence orderand presence of negation in a predictable way.

7 Experiment 2: Presuppositions

7.1 Presupposition Datasets

The presupposition portion of IMPPRES includeseight datasets, each isolating a different kind ofpresupposition trigger. The full set of triggers isshown in the Appendix. For each type, we gener-ate 100 paradigms, with each paradigm consistingof 19 unique sentence pairs. (Examples of the sen-tence types are in Table 1).

Of the 19 sentence pairs, 15 contain targetitems. The first target item tests whether themodel correctly determines that the presupposi-tion trigger entails its presupposition. The nexttwo items alter the presupposition, either negatingit, or replacing a constituent, leading to contra-diction and neutrality, respectively. The remain-ing 12 target items show that the relation betweenthe trigger and the (altered) presupposition is notaffected by embedding the trigger under variousentailment-canceling operators. 4 control itemsare designed to test the basic effect of entailment-canceling operators—negation, modals, interrog-atives, and conditionals. In each control, thepremise is a presupposition trigger embedded un-der an entailment-canceling operator, and the hy-pothesis is an unembedded sentence containingthe trigger. These controls are necessary to es-tablish whether models learn that presuppositionsbehave differently under these operators than clas-sical semantic entailments.

7.2 Presupposition Results & Discussion

The results from presupposition controls are inFigure 5. We find that the BERT NLI model per-forms well above chance on each control (accu-racy > 0.33), whereas the BOW and InferSentmodels perform at or below chance. In the“negated” condition, the BERT NLI model cor-rectly identifies that the trigger is contradicted byits negation 100% of the time, e.g., Bill’s handy-man didn’t win contradicts Bill’s handyman won.In the other conditions, it correctly identifies the

Figure 5: Results on Controls (Presuppositions).

neutral relation the majority of the time, e.g., DidBill’s handyman win? is neutral with respectto Bill’s handyman won. This indicates that theBERT model mostly learns that negation, modals,interrogatives, and conditionals cancel classicalentailments. On the other hand, the BOW and In-ferSent models do not grasp the ordinary behaviorof these common operators.

Next, we test whether models identify presup-positions of the premise as entailments, e.g., thatBill’s handyman won entails that Bill has a handy-man. Recall from §2.2 that this would be akinto what a listener does when they accommodatea presupposition. The results in Figure 6 showthat each of the three models accommodates somepresuppositions, but this depends on both the na-ture of the presupposition and the model. For in-stance, the BOW and InferSent models accommo-date presuppositions of nearly all trigger types atwell above chance rates (accuracy � 0.33). Inthe case of the uniqueness presupposition of cleftsentences, these models almost always correctlypredict an entailment (i.e., at over 90% accuracy),but for most triggers, performance is much lessreliable. By contrast, the BERT model’s behav-ior is much more bimodal. It always accommo-dates the existence presuppositions of clefts andpossessed definites, as well as the presuppositionof only, but almost never accommodates any pre-supposition involving numeracy, e.g. Both flowersthat bloomed died entails There are exactly twoflowers that bloomed.

Finally, we evaluate whether models predict thatpresuppositions project out of entailment cancel-ing operators (e.g., that Did Bill’s handyman win?entails that Bill has a handyman). We can onlyconsider such a prediction as evidence of projec-tion if two conditions hold: (a) the model correctlyidentifies that the relevant operator cancels entail-ments in the control from the same paradigm (e.g.,

Figure 6: Results for the unembedded trigger pairedwith positive presupposition.

Did Bill’s handyman win? is neutral with respectto Bill’s handyman won), and (b) the model iden-tifies the presupposition as an entailment when thetrigger is unembedded in the same paradigm (e.g.Bill’s handyman won entails Bill has a handyman).Otherwise, a model might correctly predict entail-ment essentially by accident if, for instance, it sys-tematically ignores interrogatives. For this reason,we filter out results for the target conditions thatdo not meet these criteria.

Figure 7 shows results for the target conditionsafter filtering. While InferSent almost never un-derstands that presuppositions can project, we findstrong evidence that the BERT and BOW modelsdo. Specifically, they correctly identify that thepremise entails the presupposition (over 80% ofthe time for BERT, and over 90% of the time forBOW). Furthermore, BERT is the only model toreliably identify (i.e., over 90% of the time) thatthe negation of the presupposition is contradicted.These results hold irrespective of the entailmentcanceling operator. No model reliably performsabove chance when the presupposition is alteredto be neutral (e.g., Did Bill’s handyman win? isneutral with respect to John has a handyman).

It is somewhat counterintuitive that the sim-ple BOW model can learn some of the projec-tive behavior of presuppositions. One explanationfor this finding is that projection is essentially aphenomenon insensitive to word order. In otherwords, if a presupposition trigger is present at allin a sentence, a presupposition will generally ariseirrespective of its position in the sentence.4

4There is an exception to this, which we expect to be in-

Figure 7: Results for presupposition target conditionsinvolving projection.

To conclude, our results show that training onNLI is sufficient for BERT and even a BOW modelto learn characteristic projective behavior of cer-tain presuppositions. This is especially strikingconsidering our findings in §3 that presuppositionsare virtually absent from MultiNLI, indicating thatthis is learned through generalization.

8 Conclusion

This work is an initial step towards rigorously in-vestigating the extent to which NLI models learnsemantic versus pragmatic inference types. Wehave introduced a new dataset IMPPRES for prob-ing this question, which can be reused to evaluatepragmatic performance of any NLI given model.

We find strong evidence that BERT learns scalarimplicatures associated with determiners. Prag-matic or logical reasoning was not diagnosable forthe other implicature trigger pairs, whose mean-ing was not fully understood by our models (asmost scalar pairs were treated as synonymous). Inthe case of presuppositions, the BERT NLI mod-els, and BOW to some extent, perform well on anumber of our subdatasets (only, cleft existence,possessive existence, questions). For the otherdatasets, the models did not perform as expectedon the basic unembedded triggers, again suggest-ing the model’s lack of knowledge of the basicmeaning of these words. Though their behavioris far from systematic, this is suggestive evidencethat some NLI models have the capacity to per-

credibly rare, in which the presupposition fails to project dueto “local satisfaction” (Heim, 1983).

form in ways that correlate with human-like prag-matic behavior, generalizing from data that doesnot have explicit information on the tested infer-ence types.

Acknowledgments

This material is based upon work supported bythe National Science Foundation under Grant No.1850208. Any opinions, findings, and conclusionsor recommendations expressed in this material arethose of the author(s) and do not necessarily re-flect the views of the National Science FoundationThanks to Sam Bowman, Hagen Blix, EmmanuelChemla, He He, Phu Mon Htut, Katharina Kann,Haokun Liu, Ethen Perez, Richard Pang, Clara Va-nia, and the other members of ML2 for initial dis-cussion on the topic, and/or feedback on an earlierdraft.

ReferencesAristotle. De Interpretatione.

David Ian Beaver. 1997. Presupposition. In Handbookof logic and language, pages 939–1008. Elsevier.

Samuel R. Bowman, Gabor Angeli, Christopher Potts,and Christopher D. Manning. 2015. A large anno-tated corpus for learning natural language inference.In Proceedings of the 2015 Conference on Empiri-cal Methods in Natural Language Processing, pages632–642. Association for Computational Linguis-tics.

Oana-Maria Camburu, Tim Rocktaschel, ThomasLukasiewicz, and Phil Blunsom. 2018. e-SNLI:Natural Language Inference with Natural LanguageExplanations. In Advances in Neural InformationProcessing Systems, pages 9539–9549.

Andre Cianflone, Yulan Feng, Jad Kabbara, and JackieChi Kit Cheung. 2018. Let’s do it “again”: A firstcomputational approach to detecting adverbial pre-supposition triggers. In Proceedings of the 56th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers), pages 2747–2755, Melbourne, Australia. Association for Com-putational Linguistics.

Herbert H Clark. 1996. Using language. Cambridgeuniversity press.

Alexis Conneau, Douwe Kiela, Holger Schwenk, LoıcBarrault, and Antoine Bordes. 2017. Supervisedlearning of universal sentence representations fromnatural language inference data. In Proceedings ofthe 2017 Conference on Empirical Methods in Nat-ural Language Processing, pages 670–680. Associ-ation for Computational Linguistics.

Alexis Conneau, Ruty Rinott, Guillaume Lample, Ad-ina Williams, Samuel Bowman, Holger Schwenk,and Veselin Stoyanov. 2018. Xnli: Evaluating cross-lingual sentence representations. In Proceedings ofthe 2018 Conference on Empirical Methods in Nat-ural Language Processing, pages 2475–2485. Asso-ciation for Computational Linguistics.

Robin Cooper, Richard Crouch, Jan Van Eijck, ChrisFox, Josef Van Genabith, Jan Jaspers, Hans Kamp,Manfred Pinkal, Massimo Poesio, Stephen Pulman,et al. 1994. FraCaS: A framework for computationalsemantics. Deliverable D6.

Richard Crouch, Lauri Karttunen, and Annie Zaenen.2006. Circumscribing is not excluding: A reply toManning. Ms., Palo Alto Research Center.

Ido Dagan, Oren Glickman, and Bernardo Magnini.2006. The pascal recognising textual entailmentchallenge. In Machine learning challenges. evalu-ating predictive uncertainty, visual object classifica-tion, and recognising tectual entailment, pages 177–190. Springer.

Ido Dagan, Dan Roth, Mark Sammons, and Fabio Mas-simo Zanzotto. 2013. Recognizing textual entail-ment: Models and applications. Synthesis Lectureson Human Language Technologies, 6(4):1–220.

Judith Degen. 2015a. Investigating the distribution ofsome (but not all) implicatures using corpora andweb-based methods. Semantics and Pragmatics,8:11–1.

Judith Degen. 2015b. Investigating the distribution ofsome (but not all) implicatures using corpora andweb-based methods. Semantics and Pragmatics,8:11–1.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2018. Bert: Pre-training of deepbidirectional transformers for language understand-ing. arXiv preprint arXiv:1810.04805.

Allyson Ettinger, Ahmed Elgohary, and Philip Resnik.2016. Probing for semantic evidence of compositionby means of simple classification tasks. In Proceed-ings of the 1st Workshop on Evaluating Vector-SpaceRepresentations for NLP, pages 134–139.

Gerald Gazdar. 1979. Pragmatics, implicature, presu-position and logical form. Academic Press, NY.

Atticus Geiger, Ignacio Cases, Lauri Karttunen, andChristopher Potts. 2018. Stress-testing neural mod-els of natural language inference with multiply-quantified sentences.

Max Glockner, Vered Shwartz, and Yoav Goldberg.2018. Breaking nli systems with sentences that re-quire simple lexical inferences. In Proceedings ofthe 56th Annual Meeting of the Association for Com-putational Linguistics (Volume 2: Short Papers),pages 650–655. Association for Computational Lin-guistics.

https://wp.nyu.edu/ml2/

https://doi.org/10.18653/v1/D15-1075

https://doi.org/10.18653/v1/D15-1075

https://doi.org/10.18653/v1/P18-1256

https://doi.org/10.18653/v1/P18-1256

https://doi.org/10.18653/v1/P18-1256

https://doi.org/10.18653/v1/D17-1070

https://doi.org/10.18653/v1/D17-1070

https://doi.org/10.18653/v1/D17-1070

http://aclweb.org/anthology/D18-1269


http://aclweb.org/anthology/P18-2103

http://aclweb.org/anthology/P18-2103

H Paul Grice. 1975. Logic and conversation. 1975,pages 41–58.

Irene Heim. 1983. On the projection problem for pre-suppositions. Formal semantics–the essential read-ings, pages 249–260.

Laurence Horn. 1989. A natural history of negation.University of Chicago Press.

Laurence R Horn and Gregory L Ward. 2004. Thehandbook of pragmatics. Wiley Online Library.

Nanjiang Jiang and Marie-Catherine de Marneffe.2019. Do you know that florence is packed with vis-itors? evaluating state-of-the-art models of speakercommitment. In Proceedings of the 57th AnnualMeeting of the Association for Computational Lin-guistics, pages 4208–4213.

Lauri Karttunen. 1973. Presuppositions of compoundsentences. Linguistic inquiry, 4(2):169–193.

Tushar Khot, Ashish Sabharwal, and Peter Clark. 2018.Scitail: A textual entailment dataset from sciencequestion answering. In Proceedings of AAAI.

Najoung Kim, Roma Patel, Adam Poliak, Alex Wang,Patrick Xia, R Thomas McCoy, Ian Tenney, AlexisRoss, Tal Linzen, Benjamin Van Durme, et al. 2019.Probing what different nlp tasks teach machinesabout function word comprehension. arXiv preprintarXiv:1904.11544.

Alice Lai, Yonatan Bisk, and Julia Hockenmaier. 2017.Natural language inference from multiple premises.pages 100–109.

Stephen C Levinson. 2000. Presumptive meanings:The theory of generalized conversational implica-ture. MIT press.

David Lewis. 1979. Scorekeeping in a language game.In Semantics from different points of view, pages172–187. Springer.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,Luke Zettlemoyer, and Veselin Stoyanov. 2019.Roberta: A robustly optimized bert pretraining ap-proach. arXiv preprint arXiv:1907.11692.

Christopher D. Manning. 2006. Local textual infer-ence: It’s hard to circumscribe, but you know itwhen you see it – and NLP needs it. Ms., StanfordUniversity.

Marco Marelli, Luisa Bentivogli, Marco Baroni, Raf-faella Bernardi, Stefano Menini, and Roberto Zam-parelli. 2014. Semeval-2014 task 1: Evaluation ofcompositional distributional semantic models on fullsentences through semantic relatedness and textualentailment. In Proceedings of the 8th internationalworkshop on semantic evaluation (SemEval 2014),pages 1–8.

Marie-Catherine de Marneffe, Mandy Simons, and Ju-dith Tonhauser. 2019. The commitmentbank: Inves-tigating projection in naturally occurring discourse.In proceedings of Sinn und Bedeutung, volume 23,pages 107–124.

R Thomas McCoy, Ellie Pavlick, and Tal Linzen. 2019.Right for the wrong reasons: Diagnosing syntacticheuristics in natural language inference. Proceed-ings of the Association for Computational Linguis-tics.

Shachar Mirkin, Roy Bar-Haim, Ido Dagan, EyalShnarch, Asher Stern, Idan Szpektor, and JonathanBerant. 2009. Addressing discourse and documentstructure in the rte search task. In Textual AnalysisConference.

Lili Mou, Rui Men, Ge Li, Yan Xu, Lu Zhang, RuiYan, and Zhi Jin. 2015. Natural language inferenceby tree-based convolution and heuristic matching.arXiv preprint arXiv:1512.08422.

Aakanksha Naik, Abhilasha Ravichander, NormanSadeh, Carolyn Rose, and Graham Neubig. 2018.Stress test evaluation for natural language inference.In Proceedings of the 27th International Conferenceon Computational Linguistics, pages 2340–2353,Santa Fe, New Mexico, USA. Association for Com-putational Linguistics.

Yixin Nie, Yicheng Wang, and Mohit Bansal. 2019.Analyzing compositionality-sensitivity of nli mod-els. In Proceedings of the AAAI Conference on Arti-ficial Intelligence, volume 33, pages 6867–6874.

Jeffrey Pennington, Richard Socher, and ChristopherManning. 2014. GloVe: Global vectors for wordrepresentation. In Proceedings of the 2014 Con-ference on Empirical Methods in Natural LanguageProcessing, pages 1532–1543.

Adam Poliak, Aparajita Haldar, Rachel Rudinger, J Ed-ward Hu, Ellie Pavlick, Aaron Steven White, andBenjamin Van Durme. 2018. Collecting diverse nat-ural language inference problems for sentence rep-resentation evaluation. In Proceedings of the 2018Conference on Empirical Methods in Natural Lan-guage Processing, pages 67–81.

Christopher Potts. 2015. Presupposition and implica-ture. The handbook of contemporary semantic the-ory, 2:168–202.

Drew Reisinger, Rachel Rudinger, Francis Ferraro,Craig Harman, Kyle Rawlins, and BenjaminVan Durme. 2015. Semantic proto-roles. Transac-tions of the Association for Computational Linguis-tics, 3:475–488.

Bertrand Russell. 1905. On denoting. Mind,14(56):479–493.

Martin Schmitt and Hinrich Schutze. 2019. SherLIiC:A typed event-focused lexical inference benchmark

http://www.aclweb.org/anthology/C18-1198

https://www.aclweb.org/anthology/P19-1086


for evaluating natural language inference. In Pro-ceedings of the 57th Conference of the Associationfor Computational Linguistics, pages 902–914, Flo-rence, Italy. Association for Computational Linguis-tics.

Sebastian Schuster, Yuxing Chen, and Judith Degen.2019. Harnessing the richness of the linguisticsignal in predicting pragmatic inferences. arXivpreprint arXiv:1910.14254.

Robert Stalnaker. 1977. Pragmatic presuppositions. InMilton K. Munitz and Peter K. Unger, editors, Se-mantics and Philosophy, pages 135–148. New YorkUniversity Press.

Alex Wang, Amapreet Singh, Julian Michael, FelixHill, Omer Levy, and Samuel R Bowman. 2019.GLUE: A multi-task benchmark and analysis plat-form for natural language understanding. In Pro-ceedings of the Interantional Conference on Learn-ing Representations (ICLR).

Alex Warstadt, Yu Cao, Ioana Grosu, Wei Peng, Ha-gen Blix, Yining Nie, Anna Alsop, Shikha Bor-dia, Haokun Liu, Alicia Parrish, et al. 2019. In-vestigating bert’s knowledge of language: Fiveanalysis methods with npis. arXiv preprintarXiv:1909.02597.

Sean Welleck, Jason Weston, Arthur Szlam, andKyunghyun Cho. 2019. Dialogue natural languageinference. Proceedings of the 57th Annual Meet-ing of the Association for Computational Linguistics(Volume 2: Short Papers).

Aaron Steven White, Pushpendre Rastogi, Kevin Duh,and Benjamin Van Durme. 2017. Inference is ev-erything: Recasting semantic resources into a uni-fied evaluation framework. In Proceedings of theEighth International Joint Conference on NaturalLanguage Processing (Volume 1: Long Papers),pages 996–1005.

Adina Williams, Nikita Nangia, and Samuel Bowman.2018. A broad-coverage challenge corpus for sen-tence understanding through inference. In Proceed-ings of the 2018 Conference of the North AmericanChapter of the Association for Computational Lin-guistics: Human Language Technologies, Volume 1(Long Papers), pages 1112–1122. Association forComputational Linguistics.

Hitomi Yanaka, Koji Mineshima, Daisuke Bekki, Ken-taro Inui, Satoshi Sekine, Lasha Abzianidze, and Jo-han Bos. 2019a. Can neural networks understandmonotonicity reasoning? In Proceedings of the2019 ACL Workshop BlackboxNLP: Analyzing andInterpreting Neural Networks for NLP, pages 31–40,Florence, Italy. Association for Computational Lin-guistics.

Hitomi Yanaka, Koji Mineshima, Daisuke Bekki, Ken-taro Inui, Satoshi Sekine, Lasha Abzianidze, and Jo-han Bos. 2019b. HELP: A dataset for identifying

shortcomings of neural models in monotonicity rea-soning. In Proceedings of the Eighth Joint Con-ference on Lexical and Computational Semantics(*SEM 2019), pages 250–255, Minneapolis, Min-nesota. Association for Computational Linguistics.

Annie Zaenen, Lauri Karttunen, and Richard Crouch.2005. Local textual inference: Can it be defined orcircumscribed? In Proceedings of the ACL Work-shop on Empirical Modeling of Semantic Equiva-lence and Entailment, pages 31–36, Ann Arbor, MI.Association for Computational Linguistics.

Rowan Zellers, Yonatan Bisk, Roy Schwartz, andYejin Choi. 2018. SWAG: A large-scale adversar-ial dataset for grounded commonsense inference.In Proceedings of the 2018 Conference on Empiri-cal Methods in Natural Language Processing, pages93–104. Association for Computational Linguistics.

Rowan Zellers, Ari Holtzman, Yonatan Bisk, AliFarhadi, and Yejin Choi. 2019. Hellaswag: Can amachine really finish your sentence? arXiv preprintarXiv:1905.07830.

Sheng Zhang, Rachel Rudinger, Kevin Duh, and Ben-jamin Van Durme. 2017. Ordinal common-sense in-ference. Transactions of the Association for Com-putational Linguistics, 5:379–395.


https://doi.org/10.18653/v1/N18-1101

https://doi.org/10.18653/v1/N18-1101

https://www.aclweb.org/anthology/W19-4804

https://www.aclweb.org/anthology/W19-4804

https://www.aclweb.org/anthology/S19-1027





Type Premise Hypothesis

Connectives These cats or those fish appear. These cats and those fish don’t both appear.Determiners Some skateboards tipped over. Not all skateboards tipped over.Numerals Ten bananas were scorching. One hundred bananas weren’t scorching.Modals Jerry could wake up. Jerry didn’t need to wake up.Scalar adjectives Banks are fine. Banks are not great.Scalar verbs Dawn went towards the hills. Dawn did not get to the hills.

Table 4: Examples of automatically generated sentences pairs from each of the six datasets for the scalar implica-tures experiment. The pairs belong to the “Implicature (+ to −)” condition.

Type Premise (Trigger) Hypothesis (Presupposition)

All N All six roses that bloomed died. Exactly six roses bloomed.Both Both flowers that bloomed died. Exactly two flowers bloomed.Change of State The cat escaped. The cat used to be captive.Cleft Existence It is Sandra who disliked Veronica. Someone disliked Veronica.Cleft Uniqueness It is Sandra who disliked Veronica. Exactly one person disliked Veronica.Only Only Lucille went to Spain. Lucille went to Spain.Possessed Definites Bill’s handyman won. Bill has a handyman.Question Sue learned why Candice testified. Candice testified.

Table 5: Examples of automatically generated sentences pairs from each of the eight datasets for the presuppositionexperiment. The pairs belong to the “Plain Trigger / Presupposition” condition.

Figure 8: Results for the scalar implicatures triggered by adjectives, by target condition.

Figure 9: Results for the scalar implicatures triggered by adjectives, by target condition.

Figure 10: Results for the scalar triggered by determiners, by target condition.

Figure 11: Results for the scalar triggered by modals, by target condition.

Figure 12: Results for the scalar triggered by numerals, by target condition.

Figure 13: Results for the scalar triggered by verbs, by target condition.

arxiv:2004.03066v1 [cs.cl] 7 apr 2020 · 2020. 4. 8. · 2facebook ai research [email protected],...

Documents