tal linzen abstractknows the meaning of john loves mary is neces-sarily able to understand mary...

19
COGS: A Compositional Generalization Challenge Based on Semantic Interpretation Najoung Kim Johns Hopkins University [email protected] Tal Linzen New York University [email protected] Abstract Natural language is characterized by compo- sitionality: the meaning of a complex expres- sion is constructed from the meanings of its constituent parts. To facilitate the evalua- tion of the compositional abilities of language processing architectures, we introduce COGS, a semantic parsing dataset based on a frag- ment of English. The evaluation portion of COGS contains multiple systematic gaps that can only be addressed by compositional gen- eralization; these include new combinations of familiar syntactic structures, or new com- binations of familiar words and familiar struc- tures. In experiments with Transformers and LSTMs, we found that in-distribution accuracy on the COGS test set was near-perfect (96– 99%), but generalization accuracy was sub- stantially lower (16–35%) and showed high sensitivity to random seed (±6–8%). These findings indicate that contemporary standard NLP models are limited in their compositional generalization capacity, and position COGS as a good way to measure progress. 1 Introduction Humans can produce and understand linguistic ex- pressions that they have not encountered before, by systematically combining atomic building blocks (Montague, 1974). For instance, a speaker that knows the meaning of John loves Mary is neces- sarily able to understand Mary loves John, even if the speaker has not heard or uttered this sentence before (Fodor and Pylyshyn, 1988). The discipline of formal semantics concerns itself with character- izing these building blocks, or “primitives”, and the ways in which they combine to construct the meaning of a complex expression (e.g., Figure 1a). To assess the abilities of computational mod- els of language to generalize compositionally, we propose COGS, a COmpositional Generalization Challenge based on Semantic Interpretation, in (a) (b) Figure 1: (a) The meaning of a sentence (right) is com- positionally built up from the meanings of its parts, in accordance with its structure (left). (b) Interpreting a familiar word in a structure it has not appeared in be- fore. In colors: expressions providing the primitive meanings; in bold: expressions providing evidence that definite NPs may appear in both argument positions of a transitive verb. [[x]] denotes the meaning of x. which a model of language is expected to construct a semantic representation of a given English sen- tence (semantic parsing). The key component of this challenge is that the training and evaluation sets systematically differ in their properties, such that success on the evaluation set requires out-of- distribution generalization. Of the many possible ways that a model could systematically fill such gaps, we expect it to do so in a way that is consis- tent with the compositional principles that guide human linguistic generalization. Figure 1b illus- trates how the meaning of the unseen expression The boy loves the hedgehog could be composition- ally inferred from known parts. In this case, the noun phrase (NP) the hedgehog, which has only been observed as a subject, needs to be interpreted in the direct object position. The generalizations tested by COGS, described in detail in Section 3, in- arXiv:2010.05465v1 [cs.CL] 12 Oct 2020

Upload: others

Post on 17-Oct-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Tal Linzen Abstractknows the meaning of John loves Mary is neces-sarily able to understand Mary loves John, even if the speaker has not heard or uttered this sentence before (Fodor

COGS: A Compositional Generalization ChallengeBased on Semantic Interpretation

Najoung KimJohns Hopkins University

[email protected]

Tal LinzenNew York [email protected]

Abstract

Natural language is characterized by compo-sitionality: the meaning of a complex expres-sion is constructed from the meanings of itsconstituent parts. To facilitate the evalua-tion of the compositional abilities of languageprocessing architectures, we introduce COGS,a semantic parsing dataset based on a frag-ment of English. The evaluation portion ofCOGS contains multiple systematic gaps thatcan only be addressed by compositional gen-eralization; these include new combinationsof familiar syntactic structures, or new com-binations of familiar words and familiar struc-tures. In experiments with Transformers andLSTMs, we found that in-distribution accuracyon the COGS test set was near-perfect (96–99%), but generalization accuracy was sub-stantially lower (16–35%) and showed highsensitivity to random seed (±6–8%). Thesefindings indicate that contemporary standardNLP models are limited in their compositionalgeneralization capacity, and position COGS asa good way to measure progress.

1 Introduction

Humans can produce and understand linguistic ex-pressions that they have not encountered before, bysystematically combining atomic building blocks(Montague, 1974). For instance, a speaker thatknows the meaning of John loves Mary is neces-sarily able to understand Mary loves John, even ifthe speaker has not heard or uttered this sentencebefore (Fodor and Pylyshyn, 1988). The disciplineof formal semantics concerns itself with character-izing these building blocks, or “primitives”, andthe ways in which they combine to construct themeaning of a complex expression (e.g., Figure 1a).

To assess the abilities of computational mod-els of language to generalize compositionally, wepropose COGS, a COmpositional GeneralizationChallenge based on Semantic Interpretation, in

(a)

(b)

Figure 1: (a) The meaning of a sentence (right) is com-positionally built up from the meanings of its parts, inaccordance with its structure (left). (b) Interpreting afamiliar word in a structure it has not appeared in be-fore. In colors: expressions providing the primitivemeanings; in bold: expressions providing evidence thatdefinite NPs may appear in both argument positions ofa transitive verb. [[x]] denotes the meaning of x.

which a model of language is expected to constructa semantic representation of a given English sen-tence (semantic parsing). The key component ofthis challenge is that the training and evaluationsets systematically differ in their properties, suchthat success on the evaluation set requires out-of-distribution generalization. Of the many possibleways that a model could systematically fill suchgaps, we expect it to do so in a way that is consis-tent with the compositional principles that guidehuman linguistic generalization. Figure 1b illus-trates how the meaning of the unseen expressionThe boy loves the hedgehog could be composition-ally inferred from known parts. In this case, thenoun phrase (NP) the hedgehog, which has onlybeen observed as a subject, needs to be interpretedin the direct object position. The generalizationstested by COGS, described in detail in Section 3, in-

arX

iv:2

010.

0546

5v1

[cs

.CL

] 1

2 O

ct 2

020

Page 2: Tal Linzen Abstractknows the meaning of John loves Mary is neces-sarily able to understand Mary loves John, even if the speaker has not heard or uttered this sentence before (Fodor

clude interpreting novel combinations of primitivesand grammatical roles, interpreting novel combi-nations of modified phrases and grammatical roles,generalizing phrase nesting to unseen depths, verbargument structure alternation, and sensitivity toverb class.

Rule-based semantic parsing systems such asBoxer (Bos, 2008) are able to generalize compo-sitionally by design. By contrast, this ability doesnot constitute a part of the design of the neuralnetwork models of language that are standard inNLP; it could only arise in such models throughlearning, inductive biases, or a combination of thetwo. To test whether standard NLP models areequipped with the ability to generalize composi-tionally, we used COGS to evaluate three archi-tectures: Transformer, Bidirectional LSTM, andUnidirectional LSTM (Section 5). We found thatthe out-of-distribution generalization set was signif-icantly more challenging (16–35% mean accuracy)than an in-distribution test set (96–99% mean accu-racy). Furthermore, generalization accuracy variedgreatly across runs of the same architecture thatdiffered only in random seed (6–8% standard de-viation). Further analysis revealed that structuralgeneralization (to novel combinations of familiarsyntactic structures) poses greater difficulties thanlexical generalization (to novel combinations of afamiliar primitive and a familiar structure). Theseresults suggests that higher accuracy on COGSwould require a stronger structural bias than that ofTransformers and LSTMs.

2 Compositional Generalization

Fodor and Pylyshyn (1988) highlighted the intrin-sic connection between the ability to produce andunderstand different sentences that are made upof the same building blocks, such as John lovesMary and Mary loves John. This connection, whichthey refer to as systematicity, derives from a combi-natorial mechanism that constructs the meaningof a complex expression from its parts: under-standing John loves Mary and Mary loves Johninvolves combining the same primitives using thesame rules. The question of whether neural net-works can display human-like systematicity has along history. In a review of early work, Hadley(1994) argued that none of the connectionist mod-els he examined displayed the degree of system-aticity that humans do. Recently Lake and Baroni(2018) revisited this question using contemporary

neural architectures—sequence-to-sequence mod-els with LSTM and GRU units—and came to thesame conclusion as Hadley.

Lake and Baroni based their study on the SCANtask, a novel task in which word sequences in a syn-thetic language need to be mapped to navigationcommand sequences (e.g., jump twice → JUMPJUMP). Crucially, their training/evaluation split re-quired compositional generalization. A number ofmodels have been developed that have improvedperformance on SCAN (Li et al., 2019; Gordonet al., 2020). However, since the semantic represen-tation used by SCAN only covers a small subsetof English grammar, SCAN does not enable test-ing various systematic linguistic abstractions thathumans are known to make (e.g., verb argumentstructure alternation). Thus, it is unclear whetherprogress on SCAN would generalize to natural lan-guage. To bring the evaluation of compositionalgeneralization a step closer to natural language,COGS includes a wide range of syntactic construc-tions, and uses semantic representations based onlambda calculus, inspired by the formalisms em-ployed in formal semantics (Parsons, 1990) andsemantic parsing (Palmer et al., 2005; Reddy et al.,2017). Following Dong and Lapata (2016) andDaza and Frank (2018), we cast semantic parsingas a sequence-to-sequence problem.

3 Overview of COGS

In a semantic parsing task such as COGS, the goalis to map a sentence to a logical form. Followingrecent works such as Marvin and Linzen (2018)and Keysers et al. (2020), we generate the datasetusing a rule-based approach; this allows us to main-tain full control over the distribution of inputs thatthe learners are exposed to, and to ensure cover-age of rare constructions that are not guaranteedto appear in natural corpora. COGS is not inher-ently grounded but could potentially be linked toa knowledge base or a visual world. The COGSdataset1 is split into a training set and a general-ization set. The training set includes systematicgaps that, in the generalization set, must be filledvia compositional generalization. Success on thegeneralization set relies on several types of linguis-tic generalizations that humans are able to make.Instead of providing individual splits for each ofthe targeted generalizations, we expect the learnerto make all of the target generalizations at once.

1https://github.com/najoungkim/COGS

Page 3: Tal Linzen Abstractknows the meaning of John loves Mary is neces-sarily able to understand Mary loves John, even if the speaker has not heard or uttered this sentence before (Fodor

Case Training Generalization

S.3.1. Novel Combination of Familiar Primitives and Grammatical Roles

Subject→ Object (common noun) A hedgehog ate the cake. The baby liked the hedgehog.Subject→ Object (proper noun) Lina gave the cake to Olivia. A hero shortened Lina.Object→ Subject (common noun) Henry liked a cockroach. The cockroach ate the bat.Object→ Subject (proper noun) The creature grew Charlie. Charlie worshipped the cake.Primitive noun→ Subject (common noun) shark A shark examined the child.Primitive noun→ Subject (proper noun) Paula Paula sketched William.Primitive noun→ Object (common noun) shark A chief heard the shark.Primitive noun→ Object (proper noun) Paula The child helped Paula.Primitive verb→ Infinitival argument crawl A baby planned to crawl.

S.3.2. Novel Combination Modified Phrases and Grammatical Roles

Object modification→ Subject modification Noah ate the cake on the plate. The cake on the table burned.

S.3.3. Deeper Recursion

Depth generalization: Sentential complements Emma said that Noah knew thatthe cat danced.

Emma said that Noah knew thatLucas saw that the cat danced.

Depth generalization: PP modifiers Ava saw the ball in the bottle onthe table.

Ava saw the ball in the bottle onthe table on the floor.

S.3.4. Verb Argument Structure Alternation

Active→ Passive The crocodile blessed William. A muffin was blessed.Passive→ Active The book was squeezed. The girl squeezed the straw-

berry.Object-omitted transitive→ Transitive Emily baked. The giraffe baked a cake.Unaccusative→ Transitive The glass shattered. Liam shatterd the jigsaw.Double object dative→ PP dative The girl teleported Liam the

cookie.Benjamin teleported the cake toIsabella.

PP dative→ Double Object Dative Jane shipped the cake to John. Jane shipped John the cake.

S.3.5. Verb Class

Agent NP→ Unaccusative subject The cobra helped a dog. The cobra froze.Theme NP→ Object-omitted transitive subject The hippo decomposed. The hippo painted.Theme NP→ Unergative subject The hippo decomposed. The hippo giggled.

Table 1: A full list of generalization cases. Each sentence in the table represents a (sentence, logical form) pair.For instance, the sentence A hedgehog ate the cake represents the following input-output mapping:

A hedgehog ate the cake→ *cake(x4) ; hedgehog(x1) AND eat.agent(x2,x1) AND eat.theme(x2,x4)

“Subject” and “Object” include subjects and objects of both simple and embedded sentences. Due to space con-straints, some sentences are simplified or rephrased versions of the sentences included in the dataset.

Page 4: Tal Linzen Abstractknows the meaning of John loves Mary is neces-sarily able to understand Mary loves John, even if the speaker has not heard or uttered this sentence before (Fodor

We describe below the five categories of generaliza-tions targeted by COGS (see Table 1 for a full list).For a discussion of our design decisions from theperspective of formal semantics, see Appendix H.

3.1 Novel Combination of FamiliarPrimitives and Grammatical Roles

English speakers can easily interpret an open-classprimitive (e.g., a noun) in a grammatical role that isdifferent from the one in which it was first observed.For example, a noun that was only observed as asubject can easily be interpreted as an object. Thisgeneralization capacity has been attested in chil-dren as young as 20 months old (Tomasello andOlguin, 1993). We ensured that in the training setsome lexical items only appear in subject position,and some only appear in object. In the generaliza-tion set, these lexical items appear in the oppositegrammatical role. We test for generalization tothe targeted grammatical roles not only in simplesentences, but also embedded clauses; this formof generalization is a defining criterion of strongsystematicity (Hadley, 1994). For instance, a nounthat only occurred as a subject of a simple sentencein training may occur as an object of an embeddedclause in the generalization set:

(1) a. TRAINING: A hedgehog ate the cake.b. GENERALIZATION: A girl said that Emma

called the hedgehog.

While some primitives appear in the training set inthe context of a sentence, others only occur in isola-tion. We express common noun meanings as unarypredicates (shark → λx.shark(x), proper nounmeanings as constants (Emma→ Emma), and verbmeanings as n-ary predicates with thematic rolespecifications (like → λx.λy.λe.like.agent(e, y)AND like.theme(e, x)) (see Appendix H for moredetails). The training set contains these primitivesas isolated words, but not as a part of a sentence; bycontrast, the generalization set includes examplesthat require interpreting these primitives in context(e.g., The shark smiled).

3.2 Novel Combination of Modified Phrasesand Grammatical Roles

Phrases with a modifier, such as an NP modifiedby a prepositional phrase (PP), can occupy thesame grammatical roles as unmodified phrases. Forexample, just like [the cat]NP , the phrase [[thecat]NP [on the mat]PP ]NP is an NP, and can oc-

cupy the same syntactic positions. Children acquir-ing language are most likely not exposed to mod-ifiers in every possible syntactic position that themodified element may occur, yet learn a context-free phrasal modification rule (e.g., NP→ NP PP)rather than a rule localized to a specific grammat-ical role (e.g., NPobj → NP PP). To test for gen-eralization to modifiers in an unseen grammaticalrole, our training set includes only examples withPP modifiers within object NPs, and the generaliza-tion set contains PP modifiers within subject NPs.We note that this is a simplification of the general-ization problem that humans may encounter; seeAppendix H for a further discussion.

3.3 Deeper RecursionThe ability to derive an infinite number of ex-pressions from a finite set of building blocks isa defining characteristic of human linguistic com-petence (Hauser et al., 2002). Human languageachieves this property by allowing certain phrasetypes to be nested within a phrase of the sametype. In [Mary knows that [John knows [that Emmacooks]CP ]CP ]CP , clauses (CP) are nested insideother clauses. Our dataset includes two types of re-cursive constructions that allow arbitrary depths ofnesting: sentential complements (nested CPs) andnominal PP modifiers (nested PPs). The trainingset contains nestings of depth 0–2, where depth 0is a phrase without nesting. The generalization setcontains nestings of strictly greater depths (3–12).

3.4 Verb Argument Structure AlternationMany English verbs participate in argument struc-ture alternations (Levin, 1993). For instance, breakcan be used both as a transitive verb (John brokethe window), and as an unaccusative verb, with itstheme in the subject position (The window broke).Likewise, agent-patient verbs can passivize; Johnbroke the window can be passivized to The windowwas broken, or with an optional agent by-phrase,The window was broken by John. These alternationpatterns are not restricted to particular lexical items,and humans can often apply such alternations toverbs that have only been observed in one of theforms. To illustrate, a person told that I floosed thecat means “I fed the cat twice” would immediatelybe able to interpret The cat was floosed (though seeSection 7 for a caveat).

COGS contains alternation patterns that hu-mans have been shown in experiments to gen-eralize to nonce verbs: active-passive (Brooks

Page 5: Tal Linzen Abstractknows the meaning of John loves Mary is neces-sarily able to understand Mary loves John, even if the speaker has not heard or uttered this sentence before (Fodor

and Tomasello, 1999), transitive-intransitive (un-accusative and object-omitted transitives; Ono andBudwig 2006; Hu et al. 2007; Kline and Demuth2014), and the alternation between double-objectand prepositional-phrase datives (Conwell and De-muth, 2007). For several verbs, we include onlyone of the alternating forms (e.g., active) in thetraining set, and only the other form (e.g., passive)in the generalization set.

3.5 Verb Class

In English, the semantic role of the argument of averb with a single argument depends on the identityof the verb; the surface syntax of the sentence isnot enough to determine its interpretation. For in-stance, froze in the sentence The lake froze is an un-accusative verb, which takes a theme (or patient) asits grammatical subject, whereas in The dog smiled,smiled is an unergative verb that takes an agent asits grammatical subject. Inspired by this property,we include in our generalization set combinationsof verbs and NPs, which all occur separately inthe training set, but such that the NPs never appearas the thematic role specified by the verb in thetraining set. For instance, the training set containsa sentence with cobra as an agent subject (2a), andsentences with unaccusative verbs (2b), and thegeneralization set contains examples in which co-bra and freeze appear together (3). Correctly in-terpreting cobra as the theme, even though it onlyappears in the training set as an agent, requiressensitivity to the argument structure of freeze.

(2) TRAINING

a. A cobra helped a dog. →cobra(x1) AND help.agent(x2,x1) AND

help.theme(x2,x4) AND dog(x4)b. The drink froze. →

*drink(x1) AND freeze.theme(x2,x1)

(3) GENERALIZATION

The cobra froze. →*cobra(x1) AND freeze.theme(x2,x1)

4 Dataset Generation

Grammar and logical forms. We generatedthe constructions described in Section 3 using aProbabilistic Context-Free Grammar (PCFG; Ap-pendix A). The types of sentences covered bythis PCFG accounted for 70–80% of naturally-occurring English sentences, according to the anal-ysis of five English corpora conducted by Roland

et al. (2007). The semantic interpretation of asentence follows deterministically from the PCFGrules, which were annotated with semantic classinformation needed to disambiguate ambiguoussyntactic structures (Section 3.5).

Sentences were first mapped to the simplifiedlogical formalism proposed by Reddy et al. (2017)using their codebase,2 and then passed throughseveral postprocessing steps (see Appendix C). Thelogical forms use indexed constants that expressthe existence of an entity or an event denoted bythe predicate. For example, in (4), x1 expresses theexistence of an entity that is both a cat and an agentof a smiling event; x2 expresses the existence of anevent that is a smiling event.

(4) A cat smiled→cat(x1) AND smile.agent(x2, x1)

Our constants are named after indices of the phrasalhead in the original sentence; in (4), the noun catis in position 1, so the corresponding constant isx1. This indexing scheme was adopted to avoid theneed to select arbitrary constant names (e.g, x, y,z, . . . ) as the number of entities and events in theexpression grows.

Primitive exposure examples. Many general-ization cases crucially rely on particular trainingexamples. For instance, to apply the Subject →Object generalization to hedgehog, at least one ex-ample with hedgehog as subject must be includedin the training set. Human learners only need to ob-serve an item in a small number of distinct contextsbefore they can generalize to new contexts. Forexample, children of age 2 years and 11 monthswere able to produce in a passive construction anonce verb they have only heard in an active transi-tive construction, after being exposed to 8 distinctusages of the construction (Brooks and Tomasello,1999). Borovsky et al. (2010, 2012) further suggestthat humans are even capable of single-shot learn-ing of word meaning in context. We include in ourtraining set a single example to generalize from(“primitive exposure example”) per generalizationcase that requires it. In Appendix E.2 we reportresults on a version of COGS with 100 primitiveexposure examples.

Training and generalization sets. We sampled30,000 distinct sentences from our PCFG, exclud-

2https://github.com/sivareddyg/udeplambda

Page 6: Tal Linzen Abstractknows the meaning of John loves Mary is neces-sarily able to understand Mary loves John, even if the speaker has not heard or uttered this sentence before (Fodor

ing ones with duplicate nominals (e.g., The cat sawa cat). These sentences were divided into training(80%;n = 24,000), development (10%;n = 3000),and test (10%;n = 3000) sets. We then added tothe training set examples that specify the primitivemeanings of 80 verbs and 60 nouns (including com-mon and proper nouns). Separately, we generatedprimitive exposure examples (n = 15, see previousparagraph) to add to the training set. The resultingtraining set consists of 24,155 examples.

The out-of-distribution generalization set wasconstructed from separate PCFGs, each of whichgenerates examples pertaining to a particular gen-eralization case. For the Subject → Object gen-eralization, for example, we generated sentenceswith hedgehog in the object position. We sampled1000 examples of each of the 21 cases, for a totalof 21,000 examples.

5 Experiments

We next analyze the performance on COGSof two widely-used models for language tasks:Long Short-Term Memory (LSTM; Hochreiter andSchmidhuber 1997) and Transformer (Vaswaniet al., 2017), both in an encoder-decoder setup(Sutskever et al., 2014). Transformers have beenquickly adopted in practical NLP systems (Storkset al., 2019), but the literature has reported mixedresults on the benefit of Transformers over LSTMsin terms of linguistic generalization (Hupkes et al.,2020; van Schijndel et al., 2019). Our goals inthese experiments are, first, to test whether strongNLP models are equipped with the compositionalgeneralization abilities required by COGS, and sec-ond, to determine whether there exist substantialdifferences across the models we test, when thenumber of trainable parameters is controlled for.

5.1 Training Details

We trained LSTM and Transformer models onCOGS only without any pretraining. We used cross-entropy loss, a batch size of 128, and early stoppingwhen validation loss did not improve for five valida-tion steps (step size = 500). All experiments wererun five times with different random seeds, whichdetermined the initial weights and the order of thetraining examples. Models were implemented us-ing OpenNMT-py3 (Klein et al., 2017).

For the LSTM, we used a 2-layer encoder-decoder with global attention and a dot-product

3https://github.com/OpenNMT/OpenNMT-py

score function. The decoder followed an input-feeding approach (Luong et al., 2015). We testedboth unidirectional and bidirectional LSTM en-coders. The Transformer had a comparable numberof parameters to the LSTMs (Transformer: 9.5M;BiLSTM: 10M; LSTM: 11M). It had 2 encoderand decoder layers, 4 attention heads, and a feed-forward dimension of 512. See Appendix D foradditional training details.

5.2 Results

All architectures performed well on the develop-ment and test sets (Table 2), with little variabil-ity across runs (Figure 2a, green dots). By con-trast, generalization accuracy was low across theboard, and was characterized by much higher vari-ance (blue dots). Transformers and unidirectionalLSTMs of a comparable size did not substantiallydiffer in their average accuracy, whereas bidirec-tional LSTMs performed comparatively worse.

Model Dev. Test Gen.

Transformer 0.96 0.96 0.35 (± 0.06)LSTM (Bi) 0.99 0.99 0.16 (± 0.08)LSTM (Uni) 0.99 0.99 0.32 (± 0.06)

Table 2: Average accuracy of tested models. Only stan-dard deviation greater than 0.01 is shown.

Transformer LSTM (Bi) LSTM (Uni)0.0

0.5

1.0

0.0 0.2 0.4 0.6 0.8 1.00.00

0.25

0.50

0.75

1.00

Accu

racy

Generalization Test

(a)

Transformer LSTM (Bi) LSTM (Uni)0.0

0.5

1.0

0.0 0.2 0.4 0.6 0.8 1.00.00

0.25

0.50

0.75

1.00

Accu

racy

Structural Lexical

(b)

Figure 2: (a) Accuracy on COGS. An output sequenceis considered correct only if it exactly matches the goldsequence. Each dot represents a model trained with adifferent random seed. (b) Accuracy by generalizationtype (lexical or structural).

Accuracy on each generalization case greatlyfluctuated across different runs of the same model,

Page 7: Tal Linzen Abstractknows the meaning of John loves Mary is neces-sarily able to understand Mary loves John, even if the speaker has not heard or uttered this sentence before (Fodor

Case Training Generalization Accuracy Distribution

Subject→ Object(common noun)

SubjectA hedgehog ate the cake.

ObjectThe baby liked the hedgehog.

0.0 0.2 0.4 0.6 0.8 1.0

TransformerLSTM (Bi)

LSTM (Uni)0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

Object→ Subject(common noun)

ObjectHenry liked a cockroach.

SubjectThe cockroach ate the bat.

0.0 0.2 0.4 0.6 0.8 1.0

TransformerLSTM (Bi)

LSTM (Uni)0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

Object→ Subject(proper noun)

ObjectMary saw Charlie.

SubjectCharlie ate a donut.

0.0 0.2 0.4 0.6 0.8 1.0

TransformerLSTM (Bi)

LSTM (Uni)0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

Primitive→ Object(proper noun)

PrimitivePaula

ObjectThe child helped Paula.

0.0 0.2 0.4 0.6 0.8 1.0

TransformerLSTM (Bi)

LSTM (Uni)0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

Depth generalization: PPmodifiers

Depth 2Ava saw the ball in the bottleon the table.

Depth 3Ava saw the ball in the bottleon the table on the floor. 0.0 0.2 0.4 0.6 0.8 1.0

TransformerLSTM (Bi)

LSTM (Uni)0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

Active→ Passive ActiveEmma blessed William.

PassiveA child was blessed.

0.0 0.2 0.4 0.6 0.8 1.0

TransformerLSTM (Bi)

LSTM (Uni)0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

Table 3: Accuracy on COGS by generalization case. Each dot represents a single run of the model.

except for the cases where accuracy was close tozero (see examples in Table 3, and see Appendix Ffor full results). The only exception to the trend wasthe Active→ Passive case (but not vice versa) inthe Transformer model, where all runs of the modelachieved close to 100% accuracy. The majority ofthe LSTMs’ predictions were structurally correcteven when they did not exactly match the expectedoutput, suggesting that Active→ Passive is one ofthe least challenging cases in our generalization set(see Appendix G.1 for an error analysis).

5.2.1 Lexical vs. Structural Generalization

Some of the COGS generalization cases requirelexical generalization: a primitive needs to be inter-preted in a structure which, while not itself novel,did not occur with that primitive in training. This isthe case for Object→ Subject: the training set doescontain examples of the structure [NP [V NP]VP ](Figure 3a), and the generalization concerns theparticular NP that has never been observed in thefirst NP position. This contrasts with cases requir-ing structural generalization, where the structureof the sentence is itself novel. This is the case, forinstance, for the structure [[NP PP]NP [V NP]VP ]—a PP modifier on the subject—which appears in thegeneralization set but not in training (Figure 3b).

The depth generalizations and the generaliza-

tion of modifiers across grammatical roles requirestructural generalization; all such cases had zero ornear-zero accuracies, whereas models performedbetter on lexical generalization (Figure 2b). Thisdiscrepancy suggests that composition of struc-tures is more challenging to both Transformers andLSTMs.

Figure 3: (a) Lexical generalization: a novel combina-tion of a familiar primitive and a familiar structure. (b)Structural generalization: a novel combination of twofamiliar structures.

Successful depth generalization cases. Depthgeneralization with PP modifiers was the only caseof structural generalization on which some modelsachieved nonzero accuracy. All of the success-ful examples were cases of depth 3, the smallest

Page 8: Tal Linzen Abstractknows the meaning of John loves Mary is neces-sarily able to understand Mary loves John, even if the speaker has not heard or uttered this sentence before (Fodor

unseen depth tested. The success cases also hadshorter output lengths, with a maximum length of120 tokens. This was within the range of outputlengths seen during training (the longest trainingexample included 153 tokens), which may accountfor the somewhat higher accuracy on these cases.

Failure to generalize structurally or failure toproduce novel labels? It is known that neuralmodels find it challenging to produce labels theyhave not seen during training (Gandhi and Lake,2019). Handling this problem is a necessary part ofsolving depth generalization, since each of the out-puts of the depth generalization cases, such as (5b)below, contains more constants than the trainingoutputs, such as the output of (5a):

(5) a. Depth 1: The cat liked that the dog sawthe mouse. (5 index-taking items)

b. Depth 3: The cat liked that the dog likedthat the mouse liked that the girl saw therat. (9 index-taking items)

As discussed in Section 3, we used index-basedlabels for constants precisely to help models withthis issue of producing novel elements, by ground-ing the labels to the indices. Specifically, the 5index-taking items in (5a) are labeled x1, x2, x5,x6 and x8 instead of being assigned arbitrary labelssuch as x, y, z . . . . However, even with such non-arbitrary labels, the model still needs to learn thata word at index i relates to the output string ‘i’.

While this problem of novel symbols is indeedan issue that the models need to handle duringdepth generalization, the pattern of errors suggestthat the low accuracy is not purely due to this is-sue. In fact, only 0.5% of all depth generalizationerrors were cases where the structural form of theoutputs were correct with only the indices beingincorrect. More frequently, the models produced anend-of-sentence token too early (90.3% of all depthgeneralization errors), or produced sequences thatwere superfluously long (3% of errors containedmore than 1000 tokens—more than twice as longerthan the maximum gold output length: 480). Thisimplies that models struggle with handling longerand deeper sequences than those observed duringtraining, independently of their inability to producenovel labels. While output length likely contributedto the difficulty of our depth generalization cases—even in the in-domain test set, the average lengthof correct answers was 43 tokens, compared to

83 for incorrect answers—deeply nested structuresimposed additional challenges. On the test set ex-amples with output length greater than 95, LSTMmodels and Transformer models had 68% and 13%accuracy, respectively. Their PP modifier depthgeneralization accuracy was much lower (LSTM:2%; BiLSTM and Transformer: near 0%).

5.2.2 Levels of EmbeddingOur depth generalization set contains exampleswith embedding depths 3–12. However, it is likelythat humans would find deeply embedded struc-tures difficult to interpret. Given this potential dif-ficulty for humans, is our depth generalization afair challenge to pose? Comprehensibility of 3–5degrees of embedding is attested in the literature;Blaubergs and Braine (1974) showed that humanscan understand 3–5 levels of right-branching CPembedding, and Karlsson (2010) observed that 3–5levels of right-branching PP and CP embeddingsdo occur in corpora. In the case of the models wetested, they almost completely failed on generaliza-tion to any levels of embedding, including depths3–5 that humans should be able understand (Ta-ble 4). We discuss the issue of generalization todepths greater than 5 in Appendix H.

Model All 3–5 6–12

Transformer 0.00 0.00 0.00LSTM (Bi) 0.00 0.01 0.00LSTM (Uni) 0.01 0.03 0.00

Table 4: Accuracy on depths 3–5 and depths 6–12.

5.2.3 Model Size / Number of ExposureExamples

In follow-up experiments, we found that increas-ing the number of parameters of the Transformermodel five fold did not improve performance. Ifanything, variability was higher and mean accu-racy was lower (see Appendix E.1). By contrast,increasing the number of exposure examples perprimitive from one to 100 led to a significant im-provement in generalization for all three models,though this increase was only applicable to lexicalgeneralization cases (see Appendix E.2).

6 Comparison to Related Work

Our aggregate results in Table 2 are in line withrecent work that has documented a significant dis-crepancy between neural models’ excellent perfor-

Page 9: Tal Linzen Abstractknows the meaning of John loves Mary is neces-sarily able to understand Mary loves John, even if the speaker has not heard or uttered this sentence before (Fodor

mance within distribution and their degraded per-formance out of distribution (Johnson et al., 2017;Lake and Baroni, 2018; Hupkes et al., 2020).

Our finding of poor generalization to deepernested structures aligns with the results of Hup-kes et al. (2020). Given that deeper structuresalso tend to be longer than shallower ones, thisfinding also relates to the difficulty of general-ization to longer sequences. One illustrative ex-ample is the poor performance of LSTMs on aSCAN split that requires generalizing from shorterto longer sequences. While several models havemade significant improvements over other SCANsplits, progress on the length split remains minimal(Li et al., 2019; Lake, 2019; Gordon et al., 2020).

The most similar work to ours is CompositionalFreebase Questions (CFQ; Keysers et al. 2020), asynthetic dataset designed to test for compositionalgeneralization in SQL parsing. COGS differs fromCFQ in two main ways. First, compared to sen-tences with a SQL mapping, which are limited toquestions and imperatives, the semantic representa-tion used in COGS significantly extends the varietyof expressions that can be assigned an interpreta-tion. Second, in CFQ, challenging splits are definedby a similar primitive distribution but different dis-tributions of the composed forms (“compound di-vergence”). This can lead to a training and test splitthat is not characterized by any principled linguis-tic difference. Following a stronger definition ofcompositionality, the generalization set in COGSincludes combinations of primitives and syntacticroles that are novel (occurred zero times in train-ing), without concern for matching the distributionof primitives across training and testing.

Our work is related to but distinct from workthat tests language models for systematic syntac-tic generalization (Gulordava et al., 2018; Marvinand Linzen, 2018, i.a.). Unlike our work, the lan-guage modeling setup does not directly evaluatethe meaning that the model assigns to a sentence.

7 Constraints on Generalization

To reach full adult linguistic competence, hu-man learners not only need to be able to makeabstraction-based generalizations, but also need tolearn how to constrain them. For example, the verbdonate takes a recipient to-PP (Emma donated thebook to the museum) but does not allow double-object alternation (*Emma donated the museum thebook). How constraints as such could be learned

has been discussed in linguistics under the bannerof the projection problem (Baker, 1979). COGS fo-cuses on evaluating computational models’ abilityto make systematic generalizations, but not on eval-uating the ability to constrain them. For this reason,COGS only includes examples to which general-izations are applicable (e.g., dative verbs that alter-nate). This is a simplification; in natural language,generalizations are not applicable across-the-board,and are modulated by a multitude of morphophono-logical, syntactic and semantic factors. In the caseof the dative alternation, properties such as animacyand definiteness are involved (Bresnan and Ford,2010). Thus, evaluating constraints on generaliza-tion requires a detailed characterization of factorsthat govern individual generalization cases, as wellas a formalism capable of expressing these factors,which we leave to future work.

8 Conclusion

We have proposed COGS, a challenge set for com-positional generalization, which uses a syntheticsentence-to-logical-form mapping task that approx-imates meaning interpretation in English. Whentested on COGS, both Transformers and LSTMsperformed poorly on the generalization set, withhigh variability across runs, while their perfor-mance on the in-domain test set was consistentlynear-perfect. Furthermore, the models found struc-tural generalization much more challenging com-pared to lexical generalization. Our results sug-gest that achieving high generalization accuracyon COGS is beyond the capacity of models thatwe tested, and COGS can therefore motivate thedevelopment of new computational models.

What architecture would be needed to solveCOGS? For structural generalization cases, theresults of Bowman et al. (2015); Evans et al.(2018) and McCoy et al. (2019) suggest that tree-structured models may provide a better inductivebias. In particular, Bowman et al. (2015) showedthat tree-structured neural networks generalizedto longer sequences. For lexical generalizationcases, the RNN-based model from Gordon et al.(2020) that implements permutation equivariancemay help, considering that it was able to solve allprimitive generalizations in SCAN.

Acknowledgments

We thank Sadhwi Srinivas and Kyle Rawlins for dis-cussions about the logical form. We also thank Paul

Page 10: Tal Linzen Abstractknows the meaning of John loves Mary is neces-sarily able to understand Mary loves John, even if the speaker has not heard or uttered this sentence before (Fodor

Smolensky, Benjamin Van Durme, and membersof the JHU Neurosymbolic Computation Lab andthe JHU Computation and Psycholinguistics Labfor their helpful feedback. TL was supported byNational Science Foundation grant BCS-1920924.

ReferencesCarl L. Baker. 1979. Syntactic theory and the projec-

tion problem. Linguistic Inquiry, 10(4):533–581.

Maija S. Blaubergs and Martin D. Braine. 1974.Short-term memory limitations on decoding self-embedded sentences. Journal of Experimental Psy-chology, 102(4):745–748.

Arielle Borovsky, Jeffrey L. Elman, and Marta Kutas.2012. Once is enough: N400 indexes semantic in-tegration of novel word meanings from a single ex-posure in context. Language Learning and Develop-ment, 8(3):278–302.

Arielle Borovsky, Marta Kutas, and Jeffrey L. Elman.2010. Learning to use words: Event-related po-tentials index single-shot contextual word learning.Cognition, 116(2):289–296.

Johan Bos. 2008. Wide-coverage semantic analysiswith Boxer. In Semantics in Text Processing. STEP2008 Conference Proceedings, pages 277–286. Col-lege Publications.

Samuel R. Bowman, Christopher Potts, and Christo-pher D. Manning. 2015. Recursive neural networkscan learn logical semantics. In Proceedings of the3rd Workshop on Continuous Vector Space Modelsand their Compositionality, pages 12–21, Beijing,China. Association for Computational Linguistics.

Joan Bresnan and Marilyn Ford. 2010. Predictingsyntax: Processing dative constructions in Ameri-can and Australian varieties of English. Language,86(1):168–213.

Patricia J. Brooks and Michael Tomasello. 1999.Young children learn to produce passives with nonceverbs. Developmental Psychology, 35(1):29–44.

Morten H. Christiansen. 1992. The (non) necessity ofrecursion in natural language processing. In Pro-ceedings of the 14th Annual Conference of the Cog-nitive Science Society, pages 665–670.

Morten H. Christiansen and Nick Chater. 1999. To-ward a connectionist model of recursion in hu-man linguistic performance. Cognitive Science,23(2):157–205.

Erin Conwell and Katherine Demuth. 2007. Early syn-tactic productivity: Evidence from dative shift. Cog-nition, 103(2):163–179.

Angel Daza and Anette Frank. 2018. A sequence-to-sequence model for semantic role labeling. In Pro-ceedings of The Third Workshop on RepresentationLearning for NLP, pages 207–216, Melbourne, Aus-tralia. Association for Computational Linguistics.

Li Dong and Mirella Lapata. 2016. Language to logi-cal form with neural attention. In Proceedings of the54th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers), pages33–43, Berlin, Germany. Association for Computa-tional Linguistics.

Richard Evans, David Saxton, David Amos, PushmeetKohli, and Edward Grefenstette. 2018. Can neuralnetworks understand logical entailment? In Interna-tional Conference on Learning Representations.

Larry Fenson, Virginia A. Marchman, Donna J. Thal,Phillip S. Dale, J. Steven Reznick, and ElizabethBates. 2007. MacArthur-Bates communicative de-velopment inventories. Paul H. Brookes PublishingCompany, Baltimore, MD.

Jerry A. Fodor and Zenon W. Pylyshyn. 1988. Connec-tionism and cognitive architecture: A critical analy-sis. Cognition, 28(1-2):3–71.

Kanishk Gandhi and Brenden M. Lake. 2019. Mu-tual exclusivity as a challenge for neural networks.arXiv:1906.10197.

Jonathan Gordon, David Lopez-Paz, Marco Baroni,and Diane Bouchacourt. 2020. Permutation equiv-ariant models for compositional generalization inlanguage. In International Conference on LearningRepresentations.

Kristina Gulordava, Piotr Bojanowski, Edouard Grave,Tal Linzen, and Marco Baroni. 2018. Colorlessgreen recurrent networks dream hierarchically. InProceedings of the 2018 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,Volume 1 (Long Papers), pages 1195–1205, NewOrleans, Louisiana. Association for ComputationalLinguistics.

Robert F. Hadley. 1994. Systematicity in connection-ist language learning. Mind & Language, 9(3):247–272.

Marc D. Hauser, Noam Chomsky, and W. TecumsehFitch. 2002. The faculty of language: What isit, who has it, and how did it evolve? Science,298(5598):1569–1579.

Sepp Hochreiter and Jurgen Schmidhuber. 1997.Long Short-Term Memory. Neural Computation,9(8):1735–1780.

Juan Hu, Nancy Budwig, Kaya Ono, and Hang Zhang.2007. Individual differences in preschoolers’ abilityto generalize unaccusative intransitive constructionsin novel verb experiments: Evidence from their fa-miliar verb usage in naturalistic play contexts. In

Page 11: Tal Linzen Abstractknows the meaning of John loves Mary is neces-sarily able to understand Mary loves John, even if the speaker has not heard or uttered this sentence before (Fodor

The 31st Boston University Conference on LanguageDevelopment.

Dieuwke Hupkes, Verna Dankers, Mathijs Mul, andElia Bruni. 2020. Compositionality decomposed:How do neural networks generalise? Journal of Ar-tificial Intelligence Research, 67:757–795.

Justin Johnson, Bharath Hariharan, Laurens van derMaaten, Li Fei-Fei, C. Lawrence Zitnick, and RossGirshick. 2017. CLEVR: A diagnostic dataset forcompositional language and elementary visual rea-soning. In The IEEE Conference on Computer Vi-sion and Pattern Recognition (CVPR).

David Kaplan. 1989. Demonstratives. In Joseph Al-mog, John Perry, and Howard Wettstein, editors,Themes from Kaplan, pages 481–563. Oxford Uni-versity Press.

Fred Karlsson. 2010. Syntactic recursion and iteration.In Harry van der Hulst, editor, Recursion and humanlanguage, pages 43–67. De Gruyter Mouton.

Daniel Keysers, Nathanael Scharli, Nathan Scales,Hylke Buisman, Daniel Furrer, Sergii Kashubin,Nikola Momchev, Danila Sinopalnikov, LukaszStafiniak, Tibor Tihon, Dmitry Tsarkov, Xiao Wang,Marc van Zee, and Olivier Bousquet. 2020. Measur-ing compositional generalization: A comprehensivemethod on realistic data. In International Confer-ence on Learning Representations.

Karin Kipper-Schuler. 2005. VerbNet: A broad-coverage, comprehensive verb lexicon. Ph.D. thesis,University of Pennsylvania.

Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senel-lart, and Alexander Rush. 2017. OpenNMT: Open-source toolkit for neural machine translation. InProceedings of ACL 2017, System Demonstrations,pages 67–72, Vancouver, Canada. Association forComputational Linguistics.

Melissa Kline and Katherine Demuth. 2014. Syntacticgeneralization with novel intransitive verbs. Journalof Child Language, 41(3):543–574.

Angelika Kratzer. 1996. Severing the external argu-ment from its verb. In Phrase structure and the lexi-con, pages 109–137. Springer.

Brenden M. Lake. 2019. Compositional generalizationthrough meta sequence-to-sequence learning. In Ad-vances in Neural Information Processing Systems32, pages 9791–9801. Curran Associates, Inc.

Brenden M. Lake and Marco Baroni. 2018. General-ization without systematicity: On the compositionalskills of sequence-to-sequence recurrent networks.In Proceedings of the 35th International Conferenceon Machine Learning, volume 80 of Proceedingsof Machine Learning Research, pages 2873–2882,Stockholmsmassan, Stockholm Sweden. PMLR.

Geoffrey Leech, Paul. Rayson, and Andrew Wilson.2001. Word frequencies in written and spoken En-glish: based on the British National Corpus. Long-man.

Beth Levin. 1993. English verb classes and alterna-tions: A preliminary investigation. University ofChicago Press.

Yuanpeng Li, Liang Zhao, Jianyu Wang, and Joel Hes-tness. 2019. Compositional generalization for prim-itive substitutions. In Proceedings of the 2019 Con-ference on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP), pages 4293–4302, Hong Kong, China. As-sociation for Computational Linguistics.

Thang Luong, Hieu Pham, and Christopher D. Man-ning. 2015. Effective approaches to attention-basedneural machine translation. In Proceedings of the2015 Conference on Empirical Methods in Natu-ral Language Processing, pages 1412–1421, Lis-bon, Portugal. Association for Computational Lin-guistics.

Rebecca Marvin and Tal Linzen. 2018. Targeted syn-tactic evaluation of language models. In Proceed-ings of the 2018 Conference on Empirical Meth-ods in Natural Language Processing (EMNLP),pages 1192–1202, Brussels, Belgium. Associationfor Computational Linguistics.

Tom McCoy, Ellie Pavlick, and Tal Linzen. 2019.Right for the wrong reasons: Diagnosing syntacticheuristics in natural language inference. In Proceed-ings of the 57th Annual Meeting of the Associationfor Computational Linguistics, pages 3428–3448,Florence, Italy. Association for Computational Lin-guistics.

Richard Montague. 1974. English as a formal lan-guage. In Formal Philosophy: Selected papers ofRichard Montague. Yale University Press.

Kaya Ono and Nancy Budwig. 2006. Young children’suse of unaccusative intransitives in novel verb exper-iments. In The 30th Boston University Conferenceon Language Development.

Martha Palmer, Daniel Gildea, and Paul Kingsbury.2005. The proposition bank: An annotated cor-pus of semantic roles. Computational Linguistics,31(1):71–106.

Terence Parsons. 1990. Events in the Semantics of En-glish, volume 334. MIT press Cambridge, MA.

Andrew Perfors, Joshua B. Tenenbaum, and TerryRegier. 2011. The learnability of abstract syntacticprinciples. Cognition, 118(3):306–338.

Steven T. Piantadosi. 2014. Zipf’s word frequencylaw in natural language: A critical review and fu-ture directions. Psychonomic Bulletin & Review,21(5):1112–1130.

Page 12: Tal Linzen Abstractknows the meaning of John loves Mary is neces-sarily able to understand Mary loves John, even if the speaker has not heard or uttered this sentence before (Fodor

Siva Reddy, Oscar Tackstrom, Slav Petrov, Mark Steed-man, and Mirella Lapata. 2017. Universal seman-tic parsing. In Proceedings of the 2017 Conferenceon Empirical Methods in Natural Language Pro-cessing (EMNLP), pages 89–101, Copenhagen, Den-mark. Association for Computational Linguistics.

Douglas Roland, Frederic Dick, and Jeffrey L. Elman.2007. Frequency of basic english grammatical struc-tures: A corpus analysis. Journal of Memory andLanguage, 57(3):348–379.

Marten van Schijndel, Aaron Mueller, and Tal Linzen.2019. Quantity doesn’t buy quality syntax withneural language models. In Proceedings of the2019 Conference on Empirical Methods in Natu-ral Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP), pages 5831–5837, Hong Kong,China. Association for Computational Linguistics.

Shane Storks, Qiaozi Gao, and Joyce Y. Chai. 2019.Recent advances in natural language inference: Asurvey of benchmarks, resources, and approaches.arXiv:1904.01172.

Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014.Sequence to sequence learning with neural networks.In Z. Ghahramani, M. Welling, C. Cortes, N. D.Lawrence, and K. Q. Weinberger, editors, Advancesin Neural Information Processing Systems 27, pages3104–3112. Curran Associates, Inc.

Michael Tomasello and Raquel Olguin. 1993. Twenty-three-month-old children have a grammatical cate-gory of noun. Cognitive Development, 8(4):451–464.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N. Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In I. Guyon, U. V. Luxburg, S. Bengio,H. Wallach, R. Fergus, S. Vishwanathan, and R. Gar-nett, editors, Advances in Neural Information Pro-cessing Systems 30, pages 5998–6008. Curran Asso-ciates, Inc.

Lynsey Wolter. 2019. Situation variables and licensingby modification in opaque demonstratives. Proceed-ings of Sinn und Bedeutung, 11:612–626.

Page 13: Tal Linzen Abstractknows the meaning of John loves Mary is neces-sarily able to understand Mary loves John, even if the speaker has not heard or uttered this sentence before (Fodor

A PCFG

Our PCFG assigns uniform probability (about 5%)to each frame (e.g., transitive verb with both subjectand object, transitive verb with only subject, pas-sivized transitive with subject only, passivized tran-sitive with subject and agent by-phrase...) exceptfor CP embedding constructions, whose probabilitywas increased to about 8% to match their distribu-tion in natural corpora.4 Syntactically ambiguousverb subcategories are distinguishable by distribu-tional information; for instance, unaccusative verbsappear with both animate and inanimate subjects,whereas unergatives and object-omitted transitivesonly appear with animate subjects. Object-omittedtransitives always have a transitive counterpart,whereas unergatives do not alternate. The verbsubtypes also have distinct primitive logical forms,and primitive logical forms of some verbs were pro-vided as part of the training set. The grammar as-signs Zipfian probability distribution (inverse rank-frequency distribution) over lexical items in eachnoun and verb subcategory.5 This was done in or-der to ensure that all possible grammatical patternsthat a lexical item could appear in were sampledby the PCFG and included in our dataset, for atleast the top most frequent items in the class (e.g.,both forms of the object omission alternation aresampled for the most frequent verb).

The types of sentences generated by our PCFGare as follows. Sentence type names are taken fromRoland et al. (2007).

• Simple Intransitive

• To Infinitive Verb Phrase

• Sentential Complement

• Simple Transitive

• Ditransitive

• Passive

When calculating the % covered by our grammarin Section 4, we collapsed Sentential Complementwith Complementizer and Sentential Complementwithout Complementizer.

4The assigned probabilities did not necessarily translateinto the proportion in the generated dataset, since there werepost-generation filtering mechanisms such as removing dupli-cate entries.

5This is a simplification, since not all synctactic categoriesor category subtypes are expected to follow a Zipfian fre-quency distribution (Piantadosi, 2014).

B Selection of Lexical Items

We selected the 403 common nouns in our lexi-cal inventory from the MacArthur-Bates Commu-nicative Development Inventories (Fenson et al.,2007) and the British National Corpus (Leech et al.,2001). 100 proper nouns were selected from topbaby names of 2019 in the United States accordingto the United States Social Security Administra-tion. In selecting the verbs, we referred to Levin(1993) and Kipper-Schuler (2005). There were 113unique verbs and 6 verb types, with some overlap-ping verbs across verb types (e.g., like with NPand CP arguments). The list of verb types are asfollows:

• Verbs that take NP arguments that allow directobject omission (e.g., eat)

• Verbs that take NP arguments that do not allowdirect object omission (e.g., find)

• Subject control verbs that take infinitival argu-ments (e.g., try)

• Verbs that take CP arguments (e.g., say)

• Unaccusative verbs (e.g., freeze)

• Unergative verbs (e.g., sleep)

• Dative verbs (e.g., give)

5 common nouns, 3 proper nouns and 7 verbs usedas primitive exposure examples were selected atrandom.

C Logical Form Postprocessing

We applied several postprocessing steps to the sim-plified logical forms of Reddy et al. (2017). Thechanges induced by our postprocessing steps are asfollows:

• Skolem constants are named xi instead of i,where i is the 0-based index of the head of thephrase denoted by the constant.

• Event predicates triggered by nominals areremoved for simplicity.

• The final form is conjunctive, where theconjuncts are sorted by the subscript of theSkolem constants (i.e., the order of the con-juncts are deterministic).

Page 14: Tal Linzen Abstractknows the meaning of John loves Mary is neces-sarily able to understand Mary loves John, even if the speaker has not heard or uttered this sentence before (Fodor

Expression: John ate the cookie.

Neo-Davidsonian ∃.e.eat′(e) ∧ (Agent(e) = john′) ∧ (Theme(e) = ιx.cookie′(x))Reddy et al. (2017) [‘arg0(3:e, 3:cookie)’, ‘eat.arg1(1:e, 0:m.John)’, ‘eat.arg2(1:e, 3:cookie)’]Ours *cookie(x3) ; eat.agent(x1, John) AND eat.theme(x1, x3)

Table 5: Comparison of logical forms for the expression John ate the cookie.

• Definite and indefinite descriptions are for-mally distinguished. Refer to Appendix H forthe exact distinction and linguistic implica-tions.

See Table 5 for a comparison between logicalforms.

D Training Details

LSTM. We used a 2-layer LSTM encoder-decoder with global attention and a dot-productscore function. The decoder followed an input-feeding approach (Luong et al., 2015). We testedboth unidirectional and bidirectional encoders. Weused inputs of dimension 512 and two hidden layersof dimension 512 (256 for model with bidirectionalencoders so that the input dimension of the decoderstays constant across models after concatenatingforward and backward states, and the number ofparameters in each model remains comparable). Adropout of 0.1 was applied after the embeddinglayer and after each hidden layer except for the last.Following Lake and Baroni (2018), we used theAdam optimizer, and clipped gradients with a normlarger than 5.0. The training time for each modelwas around 3 to 4 hours on a single NVIDIA K80GPU.

Transformer. Our Transformer model had 2 en-coder and decoder layers, 4 attention heads, and afeedforward dimension of 512. Other hyperparam-eter settings not discussed here followed Vaswaniet al. (2017) as closely as possible. The trainingtime for each model was around 1 to 2 hours on asingle NVIDIA K80 GPU.

E Additional experiments

E.1 Effect of Transformer Model SizeThe results we report in the body of the paper arefrom a Transformer with 9.5M parameters. Howdoes the number of parameters affect the Trans-former’s success on COGS? Figure 4 comparesthe performance of three Transformer models ofvarying size (large: 45M, small: 9.5M, smaller:

4.5M). The number of parameters did not a havelarge impact on test set accuracy; all runs of allmodels achieved higher than 90% accuracy. Onthe other hand, model size did affect generalization.Perhaps surprisingly, the average across 5 runs ofthe large model was lower than those of smallermodels; however, this average result is hard to in-terpret given the very high variance in accuracyacross runs of the the largest Transformer.

full (45M) small (9.5M) smaller (5.3M)0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.2 0.4 0.6 0.8 1.0

# of model parameters

0.0

0.2

0.4

0.6

0.8

1.0

Accu

racy

GeneralizationTest

# Params. Dev. Test Gen.

45M 0.95 0.95 0.20 (± 0.26)9.5M 0.96 0.96 0.35 (± 0.06)5.3M 0.95 0.95 0.37 (± 0.14)

Figure 4: The effect of Transformer model size on gen-eralization and test set accuracy.

E.2 Effect of Number of Distinct ExposureExamples per Primitive

COGS includes a single exposure example for eachprimitive (one-shot generalization). To test whethera larger number exposure examples help general-ization, we repeated our experiments with a versionof COGS training set in which the number of ex-posure examples was increased to 100. All modelsbenefited from the greater number of exposure ex-amples (Table 6). Note that some of the cases, suchas Object-Modifying PP→ Subject-Modifying PP,

Page 15: Tal Linzen Abstractknows the meaning of John loves Mary is neces-sarily able to understand Mary loves John, even if the speaker has not heard or uttered this sentence before (Fodor

did not require primitive exposure examples, andare therefore identical across the 1-shot and 100-shot settings (for the detailed breakdown by case,see Table 7).

Model # Exposure Dev. Test Gen.examples

Transformer 1 0.96 0.96 0.35100 0.94 0.94 0.63

LSTM (Bi) 1 0.99 0.99 0.16100 0.99 0.99 0.50

LSTM (Uni) 1 0.99 0.99 0.32100 1.00 1.00 0.54

Table 6: Effect of number of exposure examples perprimitive on accuracy.

F Results by Case

Table 7 lists the full results on each generalizationcase.

G Detailed Error Analysis

G.1 Active→ Passive: Systematicity ofErrors in LSTMs vs. Transformers

As discussed in Section 5.2, the Active→ Passivegeneralization was a case in which Transformersperformed near-perfectly, whereas LSTMs did not.However, an error analysis revealed that the errorsmade by LSTMs were more systematic than thoseof Transformers.

The majority of LSTMs’ errors were structurallycorrect; only 0.3% (7/2591) of the unidirectionalLSTM errors and 0.5% (14/2773) of the bidirec-tional LSTM errors had a different structure fromthe gold output. LSTMs often replaced the targetpassive verb with a different one (6), misused athematic role (7), or misused an index (8). Thesetypes of errors have equivalent structure to the cor-rect output, and have the same number of tokens asthe correct output.

(6) A balloon was blessed. →GOLD: balloon(x1) AND

bless.theme(x3,x1)LSTM: balloon(x1) AND

inflate.theme(x3,x1)

(7) The book was blessed by a girl. →

GOLD: *book(x1) AND bless.theme(x3,x1)AND bless.agent(x3,x6) AND girl(x6)LSTM: *book(x1) AND

bless.theme(x3,x1) AND

send.recipient(x3,x6) AND girl(x6)

(8) A rose was blessed by the baby. →GOLD: *baby(x6) ; rose(x1) AND

bless.theme(x3,x1) AND

bless.agent(x3,x6)LSTM: *baby(x5) ; rose(x1) AND

bless.theme(x3,x1) AND

bless.agent(x3,x6)

By contrast, the Transformer’s errors in the Ac-tive→ Passive generalization, despite being muchfewer in number, had incorrect structure (79.6% ofall errors; 39/49). The pattern in the total of 49errors made by Transformer models in aggregateincluded omission of whole conjunct, spurious in-dices, not producing an output, using a numberedconstant in place of a proper noun, etc. The fol-lowing example shows a Transformer output withmultiple errors—the model misinterpreted tool asa binary predicate and misindexed the theme argu-ment:

(9) The tool was blessed by the girl. →GOLD: *tool(x1) ; *girl(x6) ;bless.theme(x3,x1) AND bless.agent(x3,x6)

TRANSFORMER: *tool(x1) ; *girl(x6) ;tool(x3,x1) AND bless.theme(x3,x6)

Some Transformer runs produced more systematicerrors than others, despite having similar accuracyon the Active→ Passive generalization. For exam-ple, some runs mostly made the error of using thewrong verb as in (6). Others made more idiosyn-cratic errors with mixed patterns.

One possible reason for the high performanceon the Active→ Passive case is that our trainingdata included both passive constructions with andwithout the agent by-phrase (e.g., both The bookwas seen and The book was seen by Emma). Inthese two constructions, the logical form of theformer is a prefix of the logical form of the latter:

(10) The book was seen (by Emma). →NO BY: *book(x1) ANDsee.theme(x3,x1)

Page 16: Tal Linzen Abstractknows the meaning of John loves Mary is neces-sarily able to understand Mary loves John, even if the speaker has not heard or uttered this sentence before (Fodor

Transformer LSTM (Bi) LSTM (Uni)0.0

0.2

0.4

0.6

0.8

1.01 Primitive Exposure Example

Transformer LSTM (Bi) LSTM (Uni)

100 Primitive Exposure Examples

0.0 0.2 0.4 0.6 0.8 1.00.0

0.2

0.4

0.6

0.8

1.0

Accu

racy

GeneralizationTest

Figure 5: Accuracy on COGS with a different number of exposure examples. Each dot represents a model trainedwith a different random weight initialization.

WITH BY: *book(x1) ANDsee.theme(x3,x1) AND

see.agent(x3,Emma)

Since these two types of passive constructions weresampled with equal probability, performance on theActive → Passive case may have benefited frommore exposures to examples relevant to formingthe passive construction.

G.2 More General Error Patterns

The LSTMs’ erroneous outputs were more system-atic, and closer to the correct outputs, in other gen-eralization cases as well. The average token-leveledit distance between errors and correct answersacross all generalization cases, only consideringerror cases, were 11 and 14 tokens for bidirec-tional and unidirectional LSTMs, compared to 42tokens for Transformers. Furthermore, Transform-ers frequently produced ill-formed logical forms;for example, they often failed to close the finalparenthesis (11). In fact, ending the logical formwith anything other than a right parenthesis is ill-formed (12). This type of error accounted for 12%of all Transformer errors, while only 0.5% of bidi-rectional and unidirectional LSTM errors were ill-formed in this way.

(11) Paula packed. →GOLD: pack.agent(x1, Paula)TRANSFORMER: pack.agent(x1, Paula

(12) Emma appreciated the hedgehog. →

GOLD: *hedgehog(x3) ;appreciate.agent(x1,Emma) AND

appreciate.theme(x1,x3)TRANSFORMER: *

G.3 Common vs. Proper Nouns

Table 3 shows that even for the same type of tar-geted generalization (e.g., Object→ Subject, Prim-itive→ Object), the variant that used proper nouns(13) was more challenging than the variant usingcommon nouns (14).

(13) Training: The creature grew Charlie. →*creature(x1) AND grow.agent(x2, x1)AND grow.theme(x2, Charlie)Generalization: Charlie ate a cookie. →eat.agent(x1,Charlie) AND

eat.theme(x1,x3) AND cookie(x3)

(14) Training: Henry liked a cockroach. →like.agent(x1, Henry) AND

like.theme(x1,x3) AND cockroach(x3)Generalization: The cockroach ate thebat. → *cockroach(x1) AND *bat(x4)AND eat.agent(x2,x1) AND

eat.theme(x2,x4)

What is the source of this discrepancy? As can beseen from the above examples, common and propernouns are formally distinct in both the source sen-tence and the target logical form. Translating acommon noun requires conjoining a unary predi-cate (cockroach(xn)), and placing the predicated

Page 17: Tal Linzen Abstractknows the meaning of John loves Mary is neces-sarily able to understand Mary loves John, even if the speaker has not heard or uttered this sentence before (Fodor

constant (xn) in appropriate event predicates. Onthe other hand, translating a proper requires plac-ing the nominal constant (Charlie) inside appropri-ate event predicates. Given the lower complexityof (symbolic) steps required for translating propernouns, the lower accuracy is surprising. While wedo not have a definite explanation for this discrep-ancy, one possibility is that it is due to a frequencyeffect; our dataset overall contained more commonnouns than proper nouns, in terms of both type andtoken frequency.

The discrepancy in accuracy between commonand proper nouns indicates that performance is sen-sitive to seemingly minor formal differences incases that require the same type of generalization,echoing the discrepancy between the jump and turnleft primitive splits of SCAN that were originallyobserved by Lake and Baroni (2018).

H Linguistic Commentary

Semantic representation. Our semantic repre-sentation is based on a Neo-Davidsonian view ofverbal arguments (Parsons, 1990), in which verbsspecify an event argument, and thematic roles linknon-event arguments to the event. Definite descrip-tions that are not proper names are marked with anasterisk, standing in place of the standard ι notation.The asterisk expressions appear to the leftmost ofthe logical form to avoid nesting of predicated ex-pressions. They are not conjoined to the logicalform but separated with a ;, because ι expressionsare of type e rather than t. The logical form withthe asterisk expression (e.g., The cat ran: *cat(x1); run.agent(x2, x1) should be semantically equiva-lent to one that contains a nested ι expression (∃e.run.agent(e, ιx.cat(x)), if ι is scopally inert. Thismay not necessarily be the case for definite descrip-tions in intensional semantics; for instance undermodals. See the discussion of Kaplan (1989) inWolter (2019) for more details.

Representation of primitive meanings. Primi-tives in our dataset take the following form:

• Common noun: shark→ λa.shark(a)

• Proper noun: Emma→ Emma

• Verb: like→λa.λb.λe.like.agent(e, a) ∧ like.theme(e, b)

where λ is written as ‘LAMBDA’ and ∧ is writtenas ‘AND’. Primitive meanings are not skolemized

because they are not existentially quantified. Weused the letters e, a, b to distinguish variables fromskolem constants (xn). Verbs that are compatiblewith agents specify an agent as an argument intheir primitive meanings for simplicity, rather thanfollowing the external argument analysis of Kratzer(1996).

Recursive structures tested. Whether un-bounded recursion should be considered as a partof machinery that governs language is a debatedissue, the evidence against being the significantlydegraded human parsing performance on multiply-nested structures (Christiansen and Chater, 1999).In our dataset, we only included structures that aretraditionally thought of as recursive, but does notnecessitate recursion as an intrinsic mechanismbecause they can be implemented by a Finite StateMachine (Christiansen, 1992).

Testing generalization to arbitrary depths.Our depth generalization sets test generalizationto 3-12 degrees of embedding in right-branchingstructures. However, human processing of embed-ded structures degrades over levels of embedding(Blaubergs and Braine, 1974) and attestation ofembeddings greater than depth 5 is rare (Karlsson,2010). Given this limitation in humans, shouldthe inability to handle generalization to our gen-eralization set, and furthermore arbitrary depthsof embedding be viewed as a flaw of the system?Our position is that is should. According to Chom-sky’s notion of competence versus performance,there is no reason to view English sentences withembedding depths greater than 5 to be ungrammat-ical, even if human memory limitations make suchsentences difficult to understand. Computationalmodels that we tested are not restricted by the samememory limitations and therefore should not fail toprocess such sentences on the same grounds. Anysuch failure would be diagnostic of a discrepancybetween what the model has learned and the cor-rect way to perform the task, as defined by Englishgrammar. A detailed comparison of computationalmodels and human subjects’ performance on thissubset of COGS would be an interesting follow-up work that would shed light on both human andmachine generalization. We predict that models’behavior will differ from that of humans, sincethe models’ accuracy at depth 3 was already closeto zero, whereas we expect that humans will dis-play degraded but still reasonable understanding of

Page 18: Tal Linzen Abstractknows the meaning of John loves Mary is neces-sarily able to understand Mary loves John, even if the speaker has not heard or uttered this sentence before (Fodor

depth 3 PP/CP embeddings.

PP attachment ambiguity. Our grammar doesnot generate VP-modifying PPs (the only PP ver-bal dependents are recipient to-phrases, which arealways arguments rather than modifiers). There-fore, all PP modifiers in our dataset should strictlyhave an NP-attachment reading, although for hu-man readers VP-attachment readings could some-times be more prominent based on the lexical con-tent of the sentences. All modifications are nestedrather than sequential: The cat ate [the cookie [onthe mat [beside the table]]] rather than The cat ate[the cookie [on the mat] [beside the table]].

Selectional preference. Words have selectionalpreference, a tendency to semantically constrainother words that they appear with. For instance,verbs such as sing, walk are likely to take animatesubjects. Our grammar only implements a simpli-fied version of selectional preference: namely theanimacy of the NP arguments based on verb type(e.g., subjects of unergatives are animate). In real-ity, selectional preference is much more complexand highly verb-specific; for instance the theme ofeat should be something that is edible. The simplifi-cation of selectional preference results in semanticinfelicity in some of the generated sentences. Thisshould not create any difficulty in constructing avalid form-meaning mapping if models are trainedfrom scratch, but may cause problems if modelspretrained on real language data are tested.

Generalization of PP modification. Our PPmodifier generalization set (Section 3.2) requiresgeneralizing PPs that modify NPs in the objectposition to NPs in the subject position, withouthaving seen any subject modification. We note thatthis may be a stronger generalization problem thanwhat humans may actually encounter based on thefollowing two observations. First, it is true thatPP modifiers in the subject position are much lessfrequent than PP modifiers in the object position inchild-directed speech, but subject-modifying PPsare not absent from it: according to our analysis ofthe Epochs corpus of Perfors et al. (2011), PP mod-ification on the subject of a declarative sentenceoccurred only 13 times whereas PP modification onthe object occurred over 100 times. Second, thereexist many [NP PP] fragments that are not full sen-tences (e.g., a disk from a game) in the corpus. Itis still likely that PP modification does not occur inall possible syntactic positions that can be occupied

by an NP—for instance, in the subject position ofa depth 2 embedded CP—and to interpret such sen-tences structural generalization would be required.Nevertheless, whether humans would be able togeneralize modifiers in one syntactic position inthe total absence of observing modifiers in othersyntactic positions (or as fragments) remains to betested, and is part of our future work.

Page 19: Tal Linzen Abstractknows the meaning of John loves Mary is neces-sarily able to understand Mary loves John, even if the speaker has not heard or uttered this sentence before (Fodor

# Exposure Contexts Case Transformer LSTM (Bi) LSTM (Uni)

1 Subject→ Object (common noun) 0.31 0.05 0.18Subject→ Object (proper noun) 0.30 0.00 0.06Object→ Subject (common noun) 0.87 0.28 0.51Object→ Subject (proper noun) 0.45 0.02 0.04Primitive noun→ Subject (common noun) 0.17 0.02 0.03Primitive noun→ Subject (proper noun) 0.00 0.00 0.17Primitive noun→ Object (common noun) 0.06 0.05 0.01Primitive noun→ Object (proper noun) 0.00 0.00 0.00Primitive verb→ Infinitival argument 0.00 0.23 0.07Object-modifying PP→ Subject-modifying PP 0.00 0.00 0.00Depth generalization: Sentential complements 0.00 0.00 0.00Depth generalization: PP modifiers 0.00 0.00 0.02Active→ Passive 0.99 0.45 0.48Passive→ Active 0.61 0.19 0.49Object-omitted transitive→ Transitive 0.61 0.05 0.60Unaccusative→ Transitive 0.38 0.03 0.26Double object dative→ PP dative 0.45 0.16 0.75PP dative→ Double object dative 0.58 0.07 0.79Agent NP→ Unaccusative Subject 0.69 0.31 0.56Theme NP→ Object-omitted transitive Subject 0.45 0.74 0.87Theme NP→ Unergative subject 0.50 0.74 0.87

100 Subject→ Object (common noun) 0.86 0.93 0.91Subject→ Object NP (proper noun) 0.54 0.60 0.54Object→ Subject (common noun) 0.86 0.98 0.97Object→ Subject (proper noun) 0.81 0.30 0.32Primitive noun→ Subject (common noun) 0.83 0.00 0.00Primitive noun→ Subject (proper noun) 0.24 0.00 0.00Primitive noun→ Object (common noun) 0.82 0.05 0.01Primitive noun→ Object (proper noun) 0.23 0.00 0.00Primitive verb→ Infinitival argument 0.89 0.18 0.21Object-modifying PP→ Subject-modifying PP 0.00 0.00 0.00Depth generalization: Sentential complements 0.00 0.00 0.00Depth generalization: PP modifiers 0.00 0.01 0.02Active→ Passive 0.99 1.00 1.00Passive→ Active 0.89 0.45 0.79Object-omitted transitive→ Transitive 0.73 0.63 0.98Unaccusative→ Transitive 0.47 0.75 0.94Double object dative→ PP dative 0.83 0.85 0.99PP dative→ Double object dative 0.82 0.94 0.96Agent NP→ Unaccusative Subject 0.84 0.99 0.99Theme NP→ Object-omitted transitive Subject 0.53 0.86 0.81Theme NP→ Unergative subject 0.96 0.96 0.98

Table 7: Full model accuracy by generalization case, with primitive exposure in 1 context (default) and 100 (in-creased) distinct contexts. Each result is an average over 5 random seeds.