systematicit en andrew phillips b.sc., b.a.(hons.) · er, i also argue that connectionism p...

Connectionism and the Problem of

Systematicity

Steven Andrew Phillips B.Sc., B.A.(Hons.)

A thesis submitted for the degree of Doctor of Philosophy

Department of Computer Science

The University of Queensland

January 25, 1995

Statement of originality

The work presented in this thesis is, to the best of my knowledge, original, except

as acknowledged in the text, and the material has not been submitted, either in

whole or in part, for a degree at this or any other university.

Steven Phillips

January 25, 1995

i

Abstract

Systematicity is a pervasive property of cognitive behaviour (e.g., language and

reasoning) whereby the ability to represent (systematicity of representation) and

infer from (systematicity of inference) some instances of a structured object ex-

tends to other instances conforming to the same structure (Fodor & Pylyshyn,

1988; Fodor & McLaughlin, 1990). The problem of systematicity for a Connection-

ist approach to cognition is: (1) Can Connectionist models exhibit systematicity

without implementing a Classical cognitive architecture (i.e., without resorting to

symbolic representations and processes)?; and (2) Can Connectionist models exhibit

the necessary acquisition of systematic behaviour (i.e., above chance level)?

In regard to the �rst question, after a review of the Classicist/Connectionist

debate over alternative explanations for systematicity, I conclude that Connection-

ist models cannot exhibit systematicity without resorting to some form of Classi-

cal compositionality (i.e., tokening of component representations wherever complex

representations are tokened). Essentially, Smolensky's (1991) weak (microfeatures)

and strong (tensors) compositionality, and van Gelder's (1990) functional compo-

sitionality either: (1) cannot support systematicity because component representa-

tions are not (uniquely) accessible by other processes; or, (2) can support system-

aticity, but only by tokening component representations relative to the processes

that access them (i.e., by implementing some form of Classical compositionality).

However, I also argue that Connectionism potentially o�ers an explanation for

the acquisition of systematic behaviour, which is an issue not addressed in the

Classical paradigm since systematicity is built into a Classical architecture. Conse-

quently, the primary concern of this thesis is the second question, which I address

by considering two criteria for the necessary acquisition of systematic behaviour,

iii

and then evaluating Connectionist models with respect to these criteria on several

learning tasks. Thus, systematicity is treated as a problem of generalization rather

than representation.

The main results of this thesis concern Hadley's (1993) strong systematicity cri-

terion (i.e., generalization to novel component positions) on inference tasks where

a network must learn to infer, on request, the components of binary and ternary

relations. Through an analysis of internal representations and network learning

dynamics I show that, in general, three-layer �rst-order networks (including the

feedforward network, Elman's, 1990 simple recurrent network, and Pollack's, 1990

recursive auto-associative memory) cannot exhibit strong systematicity on the bi-

nary relations inference task. I attribute the lack of strong systematicity to an

independence between weights that implement component mappings in each of

their possible positions.

In response to the lack of strong systematicity with existing networks, I then

developed the tensor-recurrent network, which has a dependency between such

weights, by incorporating the representational capacity of Smolensky's (1987b)

tensor network, with the learning capacity of the simple recurrent network. Con-

structing and manipulating tensor representations of complex objects, through the

inner and outer product operators, assumes appropriate component, role and cue

vectors (for representing and extracting component objects and their roles within

a complex object). In the tensor-recurrent network, these vectors are learnt by

backpropagating an error signal along weighted connections and units implement-

ing the inner and outer product operators. I show that this architecture exhibits

the necessary acquisition of systematic behaviour, as de�ned by Hadley's strong

systematicity criterion, on a ternary relations inference task.

In summary, I conclude that:

1. Connectionist models cannot demonstrate systematicity without implement-

ing some form of symbolic representation and process.

2. Connectionism can provide models, as exempli�ed by the tensor-recurrent

network, with architectural properties that are su�cient for the necessary

iv

acquisition of systematic behaviour from, in part, non-symbolic processes.

The separation of internal representations into component and role vector spaces

in the tensor-recurrent network is analogous to a type/token distinction character-

istic of symbolic computation. Thus, in terms of systematic behaviour, Connection-

ism o�ers a \neural-like" implementation of symbolic processing. However, where

Connectionism goes beyond Classicism is in an explanation for the acquisition of

systematic behaviour, which is an issue not addressed in the Classical paradigm

since systematicity is built into a Classical architecture, not acquired.

v

List of publications

1. Phillips, S. (1994a). Strong Systematicity within Connectionism: The

Tensor-Recurrent Network. To appear in Proceedings of the Sixteenth An-

nual Conference of the Cognitive Science Society, Atlanta, GA.

2. Phillips, S. (1994b). Connectionism and Systematicity. In A.C. Tsoi & T.

Downs (Eds.), Proceedings of the Fifth Australian Conference on Neural Net-

works, pp. 53-55 Brisbane, Australia.

3. Phillips, S. (1994c). Understanding as generalization not just representa-

tion. In J. Wiles, C. Latimer & C. Stevens (Eds.), Collected Papers from

a Symposium on Connectionist Models and Psychology, pp. 110-111. Tech-

nical Report No. 289, Department of Computer Science, The University of

Queensland, Australia. A comment on Halford & Wilson's paper: \How far

do neural network models account for human reasoning?".

4. Phillips, S. and Wiles, J., (1993). Exponential Generalizations from a Polyno-

mial Number of Examples in a Combinatorial Domain. In Proceedings of the

International Conference on Neural Networks, pp. 505-508 Nagoya, Japan.

5. Phillips, S., (1993). The E�ect of Representation on Error Surface. In P.

Leong & M. Jabri (Eds.), Proceedings of the Fourth Australian Conference

on Neural Networks, pp. 86-89. Melbourne, Australia: Sydney University

Electrical Engineering.

6. Phillips, S., (1992). Making a Simple Recurrent Network a Self-Oscillator by

Incremental Training. In P. Leong and M. Jabri (Eds.), Proceedings of the

vii

Third Australian Conference on Neural Networks, pp. 244-247. Canberra,

Australia: Sydney University Electrical Engineering.

7. Phillips, S., Wiles, J. and Schwartz, S., (1991). A Comparison of Three

Classi�cation Algorithms on the Diagnosis of Abdominal Pains. In M. Jabri

(Ed.), Proceedings of the Second Australian Conference on Neural Networks,

pp. 283-287. Sydney, Australia. Sydney University Electrical Engineering.

8. Bakker, P., Phillips, S. and Wiles, J. (in press). The 1000-2-1000 encoder: A

matter of Representation. Neural Network World, 4(5), 527-534.

9. Schwartz, S., Wiles, J. and Phillips, S., (in press). Connectionist, rule-based

and Bayesian decision aids: an empirical comparison. In P. Slezak, T. Caelli

& R. Clark (Eds.), Perspectives on Cognitive Science: Theories, Methods and

Foundations, pp. 167-180. Norwood, NJ: Ablex.

10. Schwartz, S., Wiles, J., Gough, I. and Phillips, S., (1993). Connectionist, rule-

based and Bayesian decision aids: an empirical comparison. In D. J. Hand

(Ed.) Arti�cial Intelligence Frontiers in Statistics (pp. 264-277). London:

Chapman & Hall. (NB. An earlier version of Schwartz et al 1994).

11. Bakker, P., Phillips, S. and Wiles, J., (1993). The N-2-N encoder: A matter

of Representation. In S. Gielen & B. Kappen (Eds.), Proceedings of the In-

ternational Conference on Neural Networks, pp. 554-557. London: Springer-

Verlag.

12. Dennis, S. and Phillips, S., (1991). Analysis Tools for Neural Networks.

Tech. Report No. 207, Department of Computer Science, The University of

Queensland, Australia

viii

Contents

Statement of originality i

Abstract iii

List of publications vii

Contents ix

List of �gures xiii

List of tables xvii

Preface xix

Relationship to previous work : : : : : : : : : : : : : : : : : : : : : xix

Acknowledgements : : : : : : : : : : : : : : : : : : : : : : : : : : : xix

1 Introduction 1

1.1 Statement of thesis : : : : : : : : : : : : : : : : : : : : : : : : : : : 5

2 Systematicity: What is the problem? 7

2.1 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 7

2.2 Systematicity and its Classical explanation : : : : : : : : : : : : : : 8

2.3 Is there a Connectionist alternative? : : : : : : : : : : : : : : : : : 13

2.3.1 The horns of Fodor : : : : : : : : : : : : : : : : : : : : : : : 14

2.3.2 Weak and strong compositionality : : : : : : : : : : : : : : : 15

2.3.3 The concatenative/functional distinction : : : : : : : : : : : 19

ix

2.3.4 A \potential" contribution of Connectionism : : : : : : : : : 24

2.4 Thesis question : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 31

2.4.1 Method : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 33

2.5 Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 34

3 Generalization across domain 37

3.1 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 37

3.2 Brousse and Smolensky's massive generalization : : : : : : : : : : : 39

3.2.1 Task: Auto-association of N -tuples : : : : : : : : : : : : : : 39

3.2.2 Method : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 40

3.2.3 Exponential number of generalizations : : : : : : : : : : : : 41

3.2.4 Exponential decrease in percentage of generalizations : : : : 42

3.3 Probably approximately correct learnability : : : : : : : : : : : : : 44

3.4 Systematicity of representation : : : : : : : : : : : : : : : : : : : : 46

3.4.1 Task: Auto-association of N -tuples : : : : : : : : : : : : : : 47

3.4.2 Model: Feedforward network : : : : : : : : : : : : : : : : : : 47

3.4.3 Theoretical result : : : : : : : : : : : : : : : : : : : : : : : : 49

3.4.4 Empirical results : : : : : : : : : : : : : : : : : : : : : : : : 50

3.5 General Discussion : : : : : : : : : : : : : : : : : : : : : : : : : : : 58

3.6 Summary and conclusion : : : : : : : : : : : : : : : : : : : : : : : : 61

4 Generalization across position 63

4.1 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 63

4.2 Hadley's strong systematicity : : : : : : : : : : : : : : : : : : : : : : 64

4.2.1 Strong systematicity de�ned : : : : : : : : : : : : : : : : : : 64

4.2.2 A review of Connectionist models : : : : : : : : : : : : : : : 65

4.2.3 Summary of review : : : : : : : : : : : : : : : : : : : : : : : 69

4.3 Strong systematicity of representation : : : : : : : : : : : : : : : : : 69

4.3.1 Task: Auto-association of 2-tuples : : : : : : : : : : : : : : : 69

4.3.2 Feedforward network : : : : : : : : : : : : : : : : : : : : : : 70

4.3.3 Simple recurrent network : : : : : : : : : : : : : : : : : : : : 77

x

4.3.4 Summary of strong systematicity of representation : : : : : : 97

4.4 Strong systematicity of inference : : : : : : : : : : : : : : : : : : : : 97

4.4.1 Task: Querying 2-tuples : : : : : : : : : : : : : : : : : : : : 98

4.4.2 Simple recurrent network : : : : : : : : : : : : : : : : : : : : 98

4.4.3 Other �rst-order three-layer networks : : : : : : : : : : : : : 111

4.4.4 Architectural issues : : : : : : : : : : : : : : : : : : : : : : : 113

4.5 Summary and conclusion : : : : : : : : : : : : : : : : : : : : : : : : 116

5 The tensor-recurrent network 119

5.1 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 119

5.2 Tensor representations : : : : : : : : : : : : : : : : : : : : : : : : : 120

5.3 Representational issues : : : : : : : : : : : : : : : : : : : : : : : : : 124

5.3.1 Component representations : : : : : : : : : : : : : : : : : : 124

5.3.2 Role representations : : : : : : : : : : : : : : : : : : : : : : 124

5.3.3 Cue representations : : : : : : : : : : : : : : : : : : : : : : : 125

5.4 The tensor-recurrent network : : : : : : : : : : : : : : : : : : : : : 126

5.4.1 Architecture : : : : : : : : : : : : : : : : : : : : : : : : : : : 128

5.4.2 Learning : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 133

5.5 Evaluating the network : : : : : : : : : : : : : : : : : : : : : : : : : 136

5.5.1 Querying of 3-tuples task : : : : : : : : : : : : : : : : : : : : 136

5.5.2 Simulations : : : : : : : : : : : : : : : : : : : : : : : : : : : 137

5.5.3 Results : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 139

5.5.4 Discussion and analysis : : : : : : : : : : : : : : : : : : : : : 146

5.6 Summary and conclusions : : : : : : : : : : : : : : : : : : : : : : : 150

6 Discussion 153

6.1 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 153

6.2 Implications : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 153

6.2.1 Same training and testing distribution assumption : : : : : : 154

6.2.2 Structured networks : : : : : : : : : : : : : : : : : : : : : : 155

6.3 Limitation and possible extension : : : : : : : : : : : : : : : : : : : 156

xi

6.4 Relationship to the Classical paradigm : : : : : : : : : : : : : : : : 164

6.5 Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 167

7 Concluding remarks 169

7.1 Summary of contribution : : : : : : : : : : : : : : : : : : : : : : : : 170

7.2 Conclusions : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 174

7.3 Further work : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 176

A Understanding as generalization 181

Bibliography 185

xii

List of Figures

2.1 Systematicity: a computational level property. : : : : : : : : : : : : 11

2.2 Beam scattering analogy. : : : : : : : : : : : : : : : : : : : : : : : : 24

2.3 Sum to 5000 example. : : : : : : : : : : : : : : : : : : : : : : : : : 26

2.4 Connectionist implementation of a look-up table. : : : : : : : : : : 29

3.1 A feedforward network for learning to represent 2-tuples. : : : : : : 40

3.2 A log-linear plot of the number of generalized and generalized plus

virtual memories as a function of tuple order (N). : : : : : : : : : : 43

3.3 Feedforward network for the auto-association of N -tuples. : : : : : 48

3.4 A log-log plot of the number of training examples (solid line) and

total possible examples (dashed line) as a function of tuple order (N). 53

3.5 A log-log plot of the mean number of total weight updates (over 5

trials) as a function of tuple order (N). : : : : : : : : : : : : : : : : 55

3.6 Hyperplane orientation for an 8-2-8 encoder. : : : : : : : : : : : : : 58

4.1 The feedforward network architecture for auto-association of 2-tuples. 71

4.2 Orientation of the Mary-agent hyperplane in hidden unit activation

(internal representation) space. : : : : : : : : : : : : : : : : : : : : 72

4.3 Orientation of the Mary-patient hyperplane in hidden unit activation

(internal representation) space. : : : : : : : : : : : : : : : : : : : : 74

4.4 Feedforward network architecture for auto-association task without

weight tying (a) and with weight tying (b). : : : : : : : : : : : : : : 76

4.5 The simple recurrent network and the temporal version of the auto-

association of 2-tuples task. : : : : : : : : : : : : : : : : : : : : : : 79

xiii

4.6 Generalization to second position as a function of the number of

training patterns. : : : : : : : : : : : : : : : : : : : : : : : : : : : : 81

4.7 Principal components analysis of points in hidden unit activation

space generated from training sequences at the �rst and second time

steps. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 85


space generated from training sequences at the second time step. : : 86

4.9 Canonical discriminants analysis of points in hidden unit activation

space generated from training sequences at the second time step

grouped on the basis of the �rst input object. : : : : : : : : : : : : 88

4.10 Construction of ordered pair representations by bu�ering input

through the same set of dimensions. : : : : : : : : : : : : : : : : : : 89


space generated from training sequences at the �rst and second time

steps grouped on the basis of the current input/output object. : : : 90


space generated from training sequences at the third time step. : : : 91

4.13 Canonical discriminants analysis of points in hidden unit activa-

tion space generated from training sequences at the third time step

grouped on the basis of the �rst input object. : : : : : : : : : : : : 92

4.14 Canonical discriminants analysis of points in hidden unit activa-

tion space generated from training sequences at the third time step

grouped on the basis of the second input object. : : : : : : : : : : : 93


space generated from training sequences at the third and fourth time

steps grouped on the basis of the current target output object. : : : 94

4.16 Idealization of the bu�er solution for the recurrent network in

demonstrating strong systematicity of representation. : : : : : : : : 96

4.17 The simple recurrent network and the querying of 2-tuples task. : : 100

xiv

4.18 Error pro�le for the 8 hidden unit network over 10000 epochs of

training for each trial. : : : : : : : : : : : : : : : : : : : : : : : : : 104

4.19 First-order network. : : : : : : : : : : : : : : : : : : : : : : : : : : : 107

4.20 Orientation of the hidden unit hyperplanes to extract theMary com-

ponent. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 109

4.21 Orientation of the output unit hyperplane in the hidden unit acti-

vation space. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 110

4.22 Weight and representation con�guration resulting in weak system-

aticity of inference. : : : : : : : : : : : : : : : : : : : : : : : : : : : 111

4.23 Two alternative recurrent network architectures: Jordan's recurrent

network (a) and Pollack's recursive auto-associative memory (b). : : 112

4.24 Characterization of the representational organization su�cient for

strong systematicity of inference. : : : : : : : : : : : : : : : : : : : 115

5.1 A Connectionist implementation of the outer product (T) of two

vector representations (V and W). : : : : : : : : : : : : : : : : : : : 121

5.2 A Connectionist implementation of the inner product (Vout) of a

tensor representation (T) and a vector representation (W). : : : : : 122

5.3 The tensor-recurrent network architecture. : : : : : : : : : : : : : : 127

5.4 Generalization (maximum criterion) as a function of the number of

overlapping items where items were omitted from the patient posi-

tion in the 1 to 4 overlap cases (see Table 5.2). : : : : : : : : : : : : 141

5.5 Generalization (0.5 criterion) as a function of the number of over-

lapping items where items were omitted from the patient position in

the 1 to 4 overlap cases (see Table 5.2). : : : : : : : : : : : : : : : : 142

5.6 Generalization (maximum criterion) as a function of the number of

overlapping items where items were omitted from both positions in

the 1 to 3 overlap cases (see Table 5.3). : : : : : : : : : : : : : : : : 143

5.7 Generalization (0.5 criterion) as a function of the number of over-

lapping items where items were omitted from both positions in the

1 to 3 overlap cases (see Table 5.3). : : : : : : : : : : : : : : : : : : 144

xv

5.8 Mean training times as a function of overlap where items were omit-

ted from the patient position in the 1 to 4 overlap cases (see Table

5.2). : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 145

5.9 Mean training times as a function of overlap where items were omit-

ted from both positions in the 1 to 3 overlap cases (see Table 5.3). : 145

6.1 Characterization of the extraction of component representations

from positions 1 and 2 for the three-layer networks in chapter 4

(i.e., with one hidden layer). : : : : : : : : : : : : : : : : : : : : : : 157


from positions 1 and 2 for the tensor-recurrent network. : : : : : : : 158


from positions 1 and 2 for a network with two hidden layers linked

by modi�able-weighted connections. : : : : : : : : : : : : : : : : : : 159

6.4 An example of the internal organization of a network, with two units

per subspace in the �rst internal layer and two units for the common

subspace in the second internal layer, learning the querying of 2-

tuples task. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 160

6.5 Number of components in the second position required in the train-

ing set to demonstrate strong systematicity. : : : : : : : : : : : : : 161

6.6 Process of redescribing internal representations. : : : : : : : : : : : 163

xvi

List of Tables

4.1 Summary of the performance of the simple recurrent network on the

systematicity of inference task for networks with 20, 10 and 8 hidden

units. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 102

4.2 Summary of the degrees of systematicity of networks with respect

to the systematicity of representation task (auto-association of 2-

tuples), and the systematicity of inference task (querying of 2-tuples).117

5.1 Order of activation for the tensor-recurrent network when given the

sequence John, Mary, �rst. : : : : : : : : : : : : : : : : : : : : : : : 131

5.2 Training and testing example distributions where nouns were omit-

ted from the patient position in the case where there was between 1

and 4 nouns appearing in both positions. : : : : : : : : : : : : : : : 139

5.3 Training and testing example distributions for the 1, 2 and 3 overlap

cases, where nouns were omitted from both positions. : : : : : : : : 140

5.4 Orthogonality between agent, action and patient role vectors mea-

sured as one minus the magnitude of the normalized dot product. : 147

5.5 Collinearity between cue (at time step 4) and role vectors concerned

with the extraction and construction (respectively) of the agent, ac-

tion and patient positions measured as the magnitude of the nor-

malized dot product. : : : : : : : : : : : : : : : : : : : : : : : : : : 148

5.6 Collinearity before training between cue (at time step 4) and role vec-

tors concerned with the extraction and construction (respectively) of

the agent, action and patient positions measured as the magnitude

of the normalized dot product. : : : : : : : : : : : : : : : : : : : : : 148

xvii

Preface

Relationship to previous work

Some of the work presented in this thesis has appeared, in part, in a number of

published papers. The work presented in Phillips and Wiles (1993) forms the ba-

sis of chapter 3. The arguments, results and analysis presented in chapter 4 is a

signi�cant expansion of the work that appeared in Phillips (1994b), and chapter 5

provides further analysis and a more detailed explanation of the network architec-

ture that appeared in Phillips (1994a). Also, the work presented in Phillips (1994c)

appears in Appendix A.

Two of the techniques used in this thesis have also appeared in previous work.

The block encoding technique, for faster training of auto-associators (chapter 3),

appeared in Bakker, Phillips and Wiles (1993, 1994), and an explanation of its

improvement in training time was based on Phillips (1993). The network analysis

tools used in chapter 4 appeared in Dennis and Phillips (1991).

Lastly, there are several works which helped provide an understanding of some

of the learning properties of feedforward and recurrent networks. They include, the

work on generalization with feedforward networks in Phillips, Wiles and Schwartz

(1991), Schwartz, Wiles, Gough and Phillips (1993), and Schwartz, Wiles and

Phillips (1994); and the learning capabilities on the simple recurrent network, in

Phillips (1992).

Acknowledgements

Throughout the years I have been supported by many people who have helped

either directly, or indirectly to produce this thesis. In particular, I thank my

xix

supervisor Janet Wiles for all her support prior to and throughout my candidature;

for providing valuable discussion and criticism of the work in, and relating to this

thesis, and for her constant and critical attention to the writing process.

I would also like to thank my collegues: Simon Dennis, for many interesting

and critical discussions on this work; and Paul Bakker, not only for his help and

comments on this work, but also in regard to frequent yer miles.

During my candidature, I have had many useful discussions with researchers

who have shared their time and energy to comment and discuss some of the issues

relating to this thesis. For their help and support, I would like to thank: Glenda

Andrews, Peter Bartlett, Anthony Bloesch, Kerry Chalmers, Bob Colomb, Graeme

Halford, Mike Humphreys, Danny Latimer, Ray Lister, David Lovell, Hanna Ma-

jewski, Michael Norris, Helen Purchase, Gordon Rose, Guy Smith, Kate Stevens,

Julie Stewart and Steven Young.

I would also like to thank Je� Elman and his research group for providing the

Tlearn simulator, which was used for the simulations in chapters 3 and 4; and

Geo�rey Hinton and his research group for providing and maintaining the Xerion

simulator, which was used for the simulations in chapter 5.

I am also extremely grateful for the �nancial support provided by the University

of Queensland and the Department of Computer Science. This work was supported

by a University of Queensland Postgraduate Research Scholarship, and a Depart-

ment of Computer Science Postgraduate Research Scholarship. I am also extremely

grateful for the excellent facilaties provided by the Department of Computer Sci-

ence in the form of equipment and support sta�, and for �nancial assistance that

has allowed me to attend a number of national and international conferences.

Of course, much help and support was given before I even started this work. I

would like to thank my parents for all their support through the years.

Finally, I thank Michiko and Mizuki, for their undying patience and love.

xx

Chapter 1

Introduction

In 1986, a small group of researchers put together two volumes on parallel dis-

tributed processing models of cognition (Rumelhart, McClelland, & the PDP re-

search group, 1986; McClelland, Rumelhart, & the PDP research group, 1986)

which, as it turned out, repopularized a movement in the study of human and

machine intelligence called Connectionism.

Connectionist models are typically characterized as collections of interconnected

processing units, each speci�ed by some local level process or function (e.g., sig-

moid, see Rumelhart, Hinton, & Williams, 1986), which together exhibit some

global level behaviour (e.g., hand-written character recognition LeCun, Boser, &

Denker, 1989). The analogy, of course, is taken from the human brain, which is

similarly organized, hence the term Arti�cial Neural Networks 1.

Since its resurgence in popularity many advocates have proclaimed Connec-

tionism as the new paradigm for cognitive modeling, promising not only models

of cognitive behaviour, but also models of cognitive development. For example,

Rumelhart and McClelland (1986) have presented a Connectionist network that

not only performs the past-tense mapping of English verbs, but also learns this

1The term Connectionism as used here refers to neural network research activities directedtowards cognitive behaviour (e.g., perception, language, analogy), rather than the more widelyused term Arti�cial Neural Networks (or simply, Neural Networks), referring to researchactivities directed at a wider range of engineering applications (including, for example, plantcontrol, stock market prediction, chemical analysis, automatic piloting).

1

2 CHAPTER 1. INTRODUCTION

behaviour in a qualitatively similar manner as children2.

Of central importance to a paradigm or framework for cognition is the ability

to provide models that demonstrate such properties as systematicity. Brie y, sys-

tematicity is the property of human cognition whereby the ability to represent and

process structured objects (e.g., John loves Mary) is accompanied by an ability to

represent and process other objects conforming to the same structure3 (e.g., Mary

loves John). That is, cognitive capacities are grouped on the basis of the structural

similarity of the objects on which they operate.

It is generally accepted that human cognition is systematic, at least to some

signi�cant degree. In general, one does not, for example, observe people capable of

inferring that John went to the store given that John and Mary went to the store,

yet unable to infer the structurally similar case ofMary went to the store given that

Mary and John went to the store. What has not been accepted is the implication

that the property of systematicity holds for cognitive architecture (i.e., the set of

basic data structures and processes employed in cognitive behaviour).

Fodor and Pylyshyn (1988) have argued that systematicity necessitates struc-

tured representations (symbol structures) and processes that are sensitive to the

structure of those representations. In short, the so-called Classical cognitive archi-

tecture is a symbol system. A characteristic property of cognition that is posited

in a Classical cognitive architecture is that the tokening (inscription) of represen-

tations of compositional objects necessarily entails the tokening of representations

of all component objects.

2Although, the accuracy of this work as a model of the acquisition of past-tense has beenstrongly criticized (Pinker & Prince, 1988; Marcus, Brinkmann, Clahsen, Wiese, Woest, & Pinker,1993).

3The term structure is taken to mean some invariant property of the relationship betweencomponents of complex objects. For example, in chemistry, two types of molecules n-Pentane andNeopentane are constructed from the same type and number of atoms (i.e., �ve carbon atoms andtwelve hydrogen atoms), yet they di�er structurally due to the di�erence in the spatial relationshipbetween their component atoms. The structure of an n-Pentane molecule, the spatial relationshipbetween its components atoms, is: CH3{CH2{CH2{CH2{CH3. In contrast, the structure of a

Neopentane molecule is:

CH3

j

CH3{C{CH3

j

CH3 (Brown & LeMay, 1981). In the context of representations,structure is regarded as some invariant property of the relationship representational components.

3

Smolensky (1987b, 1987a, 1990, 1991) and van Gelder (1990) have argued for

a more general Connectionist notion of compositionality4 based on distributed

(vector) representations and processes. In a Connectionist cognitive architecture,

representations of compositional objects are vectors distributed over a number of

processing units. Connectionist representations are non-Classical in that the rep-

resentation of a compositional object does not necessarily entail representations

of the object's components. Thus, in this respect Connectionism, it is claimed, is

an alternative to the Classical conception of cognitive architecture. For example,

Elman's (1990) simple recurrent network is capable of constructing a single vector

representation of a sentence where the representations of individual words do not

\appear"5 in the representation of the sentence in which they are contained. Yet,

the network is able to operate, often termed wholistically, on the vector represen-

tation of the sentence to extract word-level information such as plurality.

Fodor and McLaughlin (1990) have rejected the so-called Connectionist alterna-

tive concept of compositionality on the basis that either: (1) vector representations

are simply points in space with no internal structure, and therefore cannot account

for systematicity since component representations of complex objects are not ac-

cessible to other processes; or (2) component representations are accessible by to-

kening (i.e., inscribing) a vector representation of each component object whenever

the complex representation is inscribed, in which case, the Connectionist architec-

ture implements a Classical architecture. Thus, Fodor and McLaughlin concluded,

Connectionism is at best an implementation framework for Classical cognitive ar-

chitectures, and consequently, has nothing to o�er in the way of alternative theories

of cognition.

This conclusion is, however, far from unanimously accepted. Furthermore, it

is not clear as to the grounds on which Fodor and McLaughlin rejected the Con-

nectionist proposals. On the one hand they accept the possibility of Connectionist

4That is, the method by which a representation of a compositional object is constructed fromthe representations of its components.

5In the sense that no portion of the sentence vector equals any vector representation of itscomponents. For example, the vector ~A = (3 2) does not appear in the vector ~B = (9 1 7).


architectures exhibiting systematicity without resorting to Classical composition-

ality (p. 202), yet on the other hand they reject such possibilities on the grounds

that it is not su�cient to show how systematicity is possible given the assumptions

of a (Connectionist) architecture, one must also show how systematicity is neces-

sary given those assumptions (p. 202). The confusion arises because they do not

specify a criterion by which a model is said to necessarily exhibit systematicity.

Given the lack of consensus over the conclusion that Connectionism is at best

an implementation framework, an appropriate question to ask is:

� Can Connectionism provide non-Classical models (i.e., ones that do not rely

on symbolic representations and processes) that exhibit systematicity?

After reviewing, in detail, the Connectionist/Classicist debate with respect to

this question, in chapter 2, it is concluded that: Connectionist architectures can-

not exhibit systematicity without implementing Classical compositionality.

However, it is also argued that the limitation of Fodor and Pylyshyn's thesis,

and subsequently, Fodor and McLaughlin's rebuttal is that their criterion for sys-

tematicity is grounded solely in terms of computational capacity (i.e., in terms

of the functions computable by an architecture). Consequently, systematicity is

realizable by many computationally su�cient architectures, including the possible

Connectionist alternatives. Connectionism, however, is also concerned with the

modeling of cognitive development. Thus, it is argued, a \potential" contribution

of Connectionism to cognitive theory is in an explanation for the necessary acquisi-

tion of systematic behaviour. Therefore, the problem that systematicity poses for

Connectionism, and the question of primary concern in this thesis is:

� Can Connectionism provide models that exhibit the necessary acquisition of

systematic behaviour?

The approach taken in an attempt to address this question is to: �rst, specify

a suitable criterion by which a model is said to necessarily acquire systematic

behaviour; and second, evaluate Connectionist models with respect to this criterion

on learning tasks designed to test their capacity to acquire systematic behaviour.

1.1. STATEMENT OF THESIS 5

The �rst criterion, taken from computational learning theory (Valiant, 1984),

is based on the amount of computational resource required to learn a systematic

behaviour to a high degree of accuracy. In chapter 3, a Connectionist architecture

called the feedforward network is shown, by analysis and simulation, to meet this

criterion on a task designed to test systematicity.

However, it is then suggested that this criterion may be too weak in that the

amount of resource required is more than that required by people. Consequently,

a second criterion is used, called strong systematicity (Hadley, 1993), which is

based on linguistic evidence of generalization over structurally similar sentences.

In chapter 4, it is shown, by analysis, that the same network cannot exhibit the

acquisition of systematicity as de�ned by this criterion on the same task. Strong

systematicity is then evaluated on a number of other models based on a recurrent

network architecture. Brie y, none of the models examined demonstrate strong

systematicity with respect to an inference task whereby the model is required to

infer components from complex objects.

The lack of strong systematicity in the models examined prompted the design

of a further architecture, called the tensor-recurrent network, speci�cally to ad-

dress the issue of the necessary acquisition of systematicity. In chapter 5, the

tensor-recurrent network is presented and shown, by simulation, to exhibit strong

systematicity with respect to the inference task.

A discussion of these results and their relationship to the Classical paradigm is

given in chapter 6. Finally, conclusions are drawn, and further work is suggested

in chapter 7.

1.1 Statement of thesis

Having reviewed the Classicist and Connectionist arguments relating to system-

aticity, and having tested and analyzed a number of Connectionist architectures

for their capacity to acquire systematic behaviour, the two main conclusions of this

thesis are:


1. Connectionist models cannot demonstrate systematicity without implement-

ing some form of symbolic representations and processes.

2. Connectionism can provide models, as exempli�ed by the tensor-recurrent

network, with architectural properties that are su�cient for the necessary

acquisition of systematic behaviour from, in part, non-symbolic processes.

Chapter 2

Systematicity: What is the

problem?

2.1 Introduction

As mentioned in chapter 1, Fodor and Pylyshyn have used the systematicity prop-

erty to argue against the Connectionist approach to cognitive modeling. Yet, as

also pointed out, it is not clear that systematicity cannot be addressed within

the Connectionist framework. The purpose of this chapter is to establish the cen-

tral problem that systematicity poses for the Connectionist approach to cognition;

and subsequently, the question that this thesis attempts to address. Therefore,

an appropriate place to begin is with a review of the Classicist/Connectionist de-

bate over the implication of systematicity to cognitive modeling with the following

question in mind: Can Connectionism provide non-Classical models that exhibit

systematicity?

In section 2, the Classical explanation for systematicity is presented. The Clas-

sical view is that to exhibit systematicity a cognitive architecture must have two

properties: (1) structured representations (symbol structures); and (2) processes

that are sensitive to the structure of those representations. The claim that Con-

nectionism provides an alternative view of cognition turns at one point on an

alternative explanation for systematicity; one that does not rely on the Classical

7

8 CHAPTER 2. SYSTEMATICITY: WHAT IS THE PROBLEM?

notion of compositionality (i.e., the tokening or inscription of component represen-

tations whenever complex representations are tokened). In section 3, I argue that

Connectionist models cannot exhibit systematicity without implementing Classical

compositionality. However, I also argue that the direction along which Connection-

ism \potentially" extends the Classical theory is in the acquisition of systematic

behaviour. The question then arises: Can Connectionism provide models that ex-

hibit the necessary acquisition (learning) of systematic behaviour? This question,

which is the central concern of this thesis, and the method used to address it are

given in section 4. Finally, a summary is provided in section 5.

2.2 Systematicity and its Classical explanation

Systematicity is a property of cognitive beings in that their capacity for certain

cognitive behaviours is grouped on the basis of the structural similarity of these

behaviours (Fodor & Pylyshyn, 1988). Fodor and Pylyshyn distinguish two aspects

of systematicity: systematicity of representation and systematicity of inference.

Systematicity of representation is the property that, if one can represent a

structured object (e.g., John loves Mary1), then one also has the ability to represent

the structurally similar object Mary loves John.

Similarly, systematicity of inference is the property that, if one can infer, for

example, that Tom went to the store from the sentence Sue and Tom went to the

store, then one also has the ability to infer that Sue went to the store from the

sentence Tom and Sue went to the store.

The ability to make such generalizations is based on a similarity of structure

between compositional objects (i.e., objects that are composed of other simpler

objects). Thus, systematicity is a property at the level of compositional objects.

For example, in the domain of scene description, it is reasonable to expect

that people are able to verbalize the word \cat" on seeing the object cat, but

without further experience are unable to verbalize the word \platypus" on seeing

1The italicized word is used to refer to the object, not its representation.

2.2. SYSTEMATICITY AND ITS CLASSICAL EXPLANATION 9

the object platypus. Because, in this case, the relationship between one object and

its label (name) is independent of the relationship between the other object and its

associated label. In other words, the two objects are atomic (no further structure)

with respect to their labels2. However, what one does not �nd, in general, are

people capable of verbalizing the sentence \the cat chases the dog" on seeing the

compositional object cat chases dog, but unable to verbalize the sentence \the

dog chases the cat" on seeing the compositional object dog chases cat. Because,

in the compositional case, there is a relationship between the two objects and

their corresponding labels (i.e., the structure agent-action-patient). This point has

an important consequence when deciding on appropriate external representations

for atomic objects and will be returned to when examining some Connectionist

attempts to demonstrate systematicity.

At this point it is important to identify one misconception regarding system-

aticity. I use an example3 from the reading-aloud domain to make my point. In

English, the \w" component of the word \sweet" is vocalized, however, in the word

\sword" the \w" component is silent. This inconsistency is not a case for the lack

of systematicity of human cognition. It is a case for the lack of systematicity in one

human cognitive domain (namely, correct pronunciation of English words). The

fact that people can, upon request, produce the \w" sound in the word \sword"

(i.e., by producing the \s" sound followed by the \word" sound) is an example of

the systematicity of human cognitive capacities. Another example concerns the

interchangeability of nouns in the agent and patient positions. For example, peo-

ple capable of saying the sentence \John ate the cheese" are unlikely to be heard

saying, \the cheese ate John". Again, the lack of apparent systematicity in this

domain does not preclude people from understanding the sentence in the sense that

they could answer questions like: \What did the cheese eat?"; or \Who ate John?".

Although the extent to which cognition is systematic is arguable, at least in the

2Note that the term atomic is a relative one. The same objects would be considered structuredwhen labeled as \quadrapeds", in which case one may expect generalization from one object tothe other based on their limb component objects.

3Presented in a seminar on word recognition by Sally Andrews.


linguistic domain, there is ample evidence of systematicity in adults (see Hadley,

1993, for a review) and for that matter perhaps even children (Pinker, 1984).

That systematicity is a signi�cant property of cognitive behaviour is, in general,

not disputed by Connectionists. What is contended, is the implication of system-

aticity in terms of a cognitive architecture (i.e., the basic algorithms and data

structures of a cognitive system utilized in the performance of cognitive tasks).

Or in other words, it is agreed that cognitive behaviour is signi�cantly systematic

at the computational level, but disputed as to what systematicity implies at the

algorithmic level4.

The distinction between the computational and algorithmic levels is illustrated

in the following example. Consider a function called minimum that returns the

smallest number from a set of numbers. The function can be described at the

computational level as: minimum(S) = e;where e � ei 8ei 2 S. The function

minimum takes a set of numbers S and returns the smallest number e from the set.

At the algorithmic level, an algorithm to compute the function could be de-

scribed as: removing an element (e) from the set (S) leaving a new set (S0) and

returning, either the element (e) if the new set (S0) is empty; or the smaller of e

and the minimum of S0.

The value of a computational level description or speci�cation of some behaviour

is that it constrains the possible algorithms that can implement such behaviour.

The value of systematicity as a computational level property is that it constrains

the class of possible cognitive architectures to those that exhibit this property. For

example, a cognitive architecture should be capable of exhibiting the behaviour

illustrated in Figure 2.1. The contention is over what characterizes architectures

that exhibit such systematic behaviour. The predominant view in cognitive science,

what Fodor and Pylyshyn (1988) and Fodor and McLaughlin (1990) term Classi-

cism, is that to support such properties as systematicity a cognitive architecture

4Referring to Marr's (1982) levels of analysis where the computational level is concerned withwhat is being computed, and the algorithmic level is concerned with how it is being computed.There is also an implementation level concerned with how the algorithmic level is physicallyrealized.

2.2. SYSTEMATICITY AND ITS CLASSICAL EXPLANATION 11

must make use of:

1. a combinatorial syntax (or structured representations); and

2. processes that are sensitive to the structure of those representations (Fodor

& Pylyshyn, 1988; Fodor & McLaughlin, 1990).

?

2. Mary1. John

1. John chased Mary. Who chased ?

2. Mary chased John. Who chased ?

Cognitive

architecture

Input

Output

Figure 2.1: Systematicity: a computational level property.

A combinatorial syntax is a symbol-level description (i.e., independent of the

physical realization of the symbol) of the relationship between representations of

compositional objects (called complex representations) and the representations of

their component objects (called component representations, or constituents5). For

example, in the list of syntactic relations:

S ! N Vt N

N ! John

N ! Mary

Vt ! chases,

5The term constituents was used in Fodor and Pylyshyn (1988) and van Gelder (1990).


\S" denotes the complex representation of a compositional object consisting of

three components: an action, an agent (performer of the action), and a patient

(recipient of the action); \N" denotes the constituent representation of the agent

and the patient; and \Vt" denotes the constituent representation of the action.

The symbol \N" has as a possible component the symbol \John" or \Mary", and

the symbol \Vt" has as a component the symbol \chases".

In this example world, all four compositional objects can be represented by

making all four possible expansions as de�ned by the syntax rules. Thus, system-

aticity of representation is captured by the combinatorial nature of the rules of

syntax. Since each symbol is replaceable by any one of its possible constituent

symbols, all possible representations are realizable.

The need for a combinatorial syntax constrains the way in which a cognitive

system puts these representations together. It says that the processes for con-

structing representations are such that: �rstly, there is a distinction between struc-

turally atomic (e.g., \N", \John") and structurally molecular (e.g., \S") repre-

sentations; secondly, a representation of a compositional objects has, as its parts,

representations of the object's components; and thirdly, the relationship between

an object's components is captured by the relationship between their correspond-

ing constituents. For example, the agent component is identi�able as it is always

the �rst constituent symbol. The particular symbol order is not important, only

that the order is used consistently. By mirroring the structural relations in ob-

jects with analogous structural relations in their corresponding representations, a

combinatorial semantics is captured by a combinatorial syntax.

In addition to a combinatorial syntax, the Classical view is that a cognitive

architecture must also have processes that are sensitive to the way in which these

representations are structured. Structure sensitive processes manipulate complex

representations on the basis of their constituency relations, not on the basis of par-

ticular constituents. In the John-chases-Mary example, the agent of the action is

always the �rst constituent symbol of the compositional object's complex represen-

tation. Consequently, regardless of whether the agent is John or Mary, a process

2.3. IS THERE A CONNECTIONIST ALTERNATIVE? 13

that can determine the �rst constituent symbol from a complex representation can

do so for all compositional objects in this domain.

Together, structured representations and processes that are sensitive to the

structure of those representations are two constraints on cognitive architecture

that explain the systematicity property as exempli�ed in Figure 2.1.

Before presenting arguments for and against the Connectionist alternatives, it

should be pointed out that the Classicist view is not a statement about the phys-

ical realization of these representations and processes; it is a statement about the

relationships between their physical realizations. The physical structural relation-

ships of the brain correspond to symbolic structural relationships of the mind. For

example, the symbols John, Mary, and chases may be physically realized as John,

Mary, and chases respectively; or alternatively, 1 0 1 1, 0 1 1 0, and 0 1 0 1

respectively. Additionally, the process for combining these atomic representations

may be concatenation. In which case, the complex symbol \John chases Mary"

may be physically instantiated as John chases Mary, or 1 0 1 1 0 1 1 0 0 1 0

1. As well the associated structure sensitive process for extracting the doer of the

action would be one that scans for the �rst space, in the �rst instance; and one

that counts the �rst four digits in the second instance.

Structured representations and structure sensitive processes make up what is

called a symbol system, and its physical realization is called the physical sym-

bol system (Newell, 1980). It is the symbol system and its distinction from its

physical implementation that characterizes Classical cognitivism (Pylyshyn, 1980).

The Classicist's approach to understanding cognitive behaviour is determining the

nature of these symbol structures and processes.

2.3 Is there a Connectionist alternative?

Connectionism, at least from a cursory glance, appears to o�er a radically di�erent

explanation of cognitive behaviour: An explanation where cognitive behaviour is

seen as an emergent property of the interaction of simple units governed by numeric

equations. However, for Connectionism to be a successful approach to cognition


it must ultimately explain systematicity. It is because people are systematic that

their capacities extend to such a vast (combinatorial) number of examples. A frame-

work that cannot address systematicity is unlikely to scale from small experimental

domains to real-world human domains.

The focus of architectural design on local-level interactions between simple pro-

cessing units is not unique to Connectionism. A similar approach has been used by

Brooks (1991a, 1991b), in what is called a subsumption architecture, in the design

of \intelligent" robots. Like Connectionism, Brooks's approach has also been crit-

icized for its scalability (Kirsh, 1991a). Thus, systematicity is not only important

for cognitive modeling, but it is also of practical signi�cance.

To be an interesting approach, Connectionism must o�er something that Clas-

sicism does not. Consequently, the �rst concern is whether or not Connectionism

can o�er an explanation for systematicity that does not rely on structured repre-

sentations and structure sensitive processes.

2.3.1 The horns of Fodor

Fodor and Pylyshyn's position is that to explain systematicity it is necessary to

posit structured representations, and processes that are sensitive to the structure

of those representations. Furthermore, they argued that Connectionism fails to

provide an alternative theory of cognition because either

� connectionist models do not use structured representations and structure sen-

sitive processes, in which case, they cannot demonstrate systematicity (sys-

tematicity horn); or,

� connectionist models do use structured representations and structure sensitive

processes to implement systematicity, in which case, Connectionism is an

implementation theory of Classical cognitivism (implementation horn) (Fodor

& Pylyshyn, 1988; Fodor & McLaughlin, 1990).

At this point it should be emphasized that Fodor and Pylyshyn fully endorse the

possibility of Connectionist networks that demonstrate systematicity. And, clearly,


given the result that there exists a Connectionist network with a �nite number of

processing units that implements a universal Turing machine (Siegelman & Son-

tag, 1991), it must be possible to construct a network that exhibits systematicity.

However, this result alone is not a refutation of Fodor and Pylyshyn's thesis (as has

been argued by Chalmers (1990b, 1993)), because one must also avoid the imple-

mentation horn of their argument. (See, also, Butler, 1993, for a similar comment

on Chalmers.)

2.3.2 Weak and strong compositionality

The principal defendant of the Connectionist position in this debate over the import

of Connectionism has been Smolensky (1987b, 1987a, 1990, 1991). Smolensky's

position is to provide two alternative ways in which Connectionist models make use

of compositional representations. They are microfeatures and the tensor product,

which Smolensky (1991) labels as weak and strong compositionality, respectively.

Weak compositionality is the representation of objects as a collection of

microfeatures, where each microfeature is a unit that becomes activated only in

the presence of some feature (component) of an object. For example, the complex

object John chased Bill can be represented as the set of active units fJohn, chased,Billg. The structurally similar object Bill chased John would be represented by the

set fBill, chased, Johng. However, as sets do not preserve order both objects have

the same representation. Consequently, it is not possible to uniquely determine,

for example, who is the agent.

A system that represents complex objects by representing only the components

without regard for their structural relationship is inadequate on the basis that it

fails to di�erentiate objects with the same components but di�erent structural rela-

tionships. This example illustrates why Fodor and Pylyshyn talk of two necessary

dimensions in a representational space: one to record the presence or absence of a

component; the other to record its relationship to other components.

Alternatively, one could augment the class of feature detectors with speci�c

John-agent, John-patient, Bill-agent and Bill-patient units. In this case, the two


objects can be represented as fJohn-agent, chased, Bill-patientg and fBill-agent,chased, John-patientg, respectively. Now, suppose there is another complex ob-

ject, Mary chased Mark resulting in the set of activated units fBill-agent, chased,John-patient, Mary-agent, Mark-patientg. At this point it is not possible to de-

termine whether \Bill chased Mark", or \Mary chased Mark"; or whether \Bill

chased John", or \Mary chased John". Again, the class of feature detectors could

be augmented to include, for example, the Bill-chased-John unit. However, the

problem with this scheme, as Fodor and Pylyshyn have correctly pointed out, is

that the number of feature-detecting units grows rapidly with the complexity of

the object. In fact, considering n-ary relations, the number of units required to

uniquely represent all possible relations between k possible objects is: kn (i.e., the

number of units grows exponentially with the arity of the relation).

Clark (1991) suggested that the number of microfeature units could be reduced

by utilizing existing microfeatures from the component objects. For example, in the

complex object John hit Bill an existing microfeature could be \tears"6 from which

the system could deduce that Bill was the recipient of the action. However, such

features must still be bound to Bill, and so there needs to be available one unit for

every possible component object. Furthermore, it is not clear that such information

will always be available to di�erentiate possible constituents, particularly if complex

objects are communicated verbally rather than visually. I conclude that such a

scheme either does not avoid the explosion of microfeature units, or does not exhibit

systematicity.

The alternative to using sets of purely atomic features is to allow individual

features to themselves be sets. For example, the object John chases Bill could be

represented as the set ffJohn, agentg, chases, fBill, subjectgg, which is distinct

from the set ffBill, agentg, chases, fJohn, subjectgg representing the object Billchases John. However, this scheme is characteristically Classical in that there is a

distinction between atomic and molecular representations, and the constituents are

tokened wherever the complex representation is tokened. Consequently, as Fodor

6Extra visual information arising from the fact that being hit by John caused Bill to cry.


and Pylyshyn pointed out, this scheme does not avoid the implementation horn of

the argument.

Weak compositionality, I conclude, fails to avoid the two horns of the argu-

ment. Either, all complex objects of a particular structure cannot be (uniquely)

represented (systematicity horn); or the representation of all complex objects re-

quires Classical compositionality - the tokening of constituents with the tokening

of complex representations (implementation horn).

Strong compositionality is the representation of complex objects by tensors.

In Smolensky's case, the tensor representation is constructed as the sum of the outer

products of each component-role representation pair. For example, the object John

chases Bill could be represented by the tensor construction:

~TJohn chases Bill = ~VJohn ~Ragent + ~Vchases ~Raction + ~VBill ~Rpatient

where ~TJohn chases Bill is the tensor representation of the object John chases Bill;

~VJohn, ~Vchases and ~VBill are the vector representations of the component objects

John, chases and Bill, respectively; and ~Ragent, ~Raction and ~Rpatient are the vector

representations of each component object's respective role.

Alternatively, complex objects can be represented as the running outer product

of each component object representation (Halford, Wilson, Guo, Gayler, Wiles, &

Stewart, 1994). In this case, the object John chases Bill is constructed as:

~TJohn chases Bill = ~VJohn ~Vchases ~VBill:

Tensor representations were intended by Smolensky as an example of Connec-

tionist representations that are non-Classical, and therefore non-implementational.

Tensors \appear" to o�er a form of non-Classical compositionality in the sense that

constituents do not occur within the complex representation. For example, given

the constituent-role pairs ~V1 = (3 1 4), ~R1 = (�2 3) and ~V2 = (2 3 1), ~R2 = (3 2);

the resulting complex representation:

~V1 ~R1 + ~V2 ~R2 =

0BBBBB@

3

1

4

1CCCCCA ( �2 3 ) +

0BBBBB@

2

3

1

1CCCCCA ( 3 2 ) (2.1)


(2.2)

=

0BBBBB@

0 13

7 9

�5 14

1CCCCCA (2.3)

does not contain in any one of its rows or columns a vector that equals any one of

the constituent vectors from which it was constructed. However, does this compo-

sitionality scheme account for systematicity?

The problem with tensor constructions, as Fodor and McLaughlin (1990) have

pointed out, is that there is potentially an in�nite number of constituent combina-

tions that result in the same tensor representation, and therefore it is not possible

to determine which combination was used to construct the tensor representation.

To see the force of this point, suppose a system has represented as a tensor a

complex object consisting of two component objects and the system is required to

extract one of its constituents for further processing. That is,

~a~b = ~Ta;b

where ~a and ~b are the vector representations of the �rst and second component ob-

jects, respectively. Now, suppose two di�erent objects have vector representations

~a0 = 1

k~a and ~b0 = k~b, where k 6= 0; 1. The corresponding representation of the

complex object that is composed of these two objects is,

~a0 ~b0 =1

kk~a~b = ~Ta;b:

Since, one complex representation stands for two di�erent complex objects it is not

possible to uniquely determine the component objects. Furthermore, in general,

there are an in�nite number of possible decompositions for the same complex rep-

resentation. Thus, the value of the vector (tensor), in itself, provides insu�cient

information to determine which of the possibly in�nite components were the actual

components. In this case, the tensor does not allow for systematicity, because it

does not give unambiguous access to its components. Therefore it fails to avoid

the systematicity horn.


The alternative is to provide one of the constituent vectors to a process that

performs the inner product of the constituent with the tensor to extract the other

constituent vector. However, in doing so, one must maintain the representation of

the constituent whenever one is processing the complex representation (i.e., one

must token the constituent every time one tokens the complex representation).

Therefore, this alternative fails to avoid the implementation horn of the argument.

In both weak and strong compositionality, Smolensky has presented compo-

sitionality schemes that are non-Classical in that the representations of complex

objects are tokened without having to token representations of their component ob-

jects. However, both schemes fail to explain systematicity because the components

are no longer uniquely determinable.

2.3.3 The concatenative/functional distinction

The tensor solution is an example of what is generally regarded as structured rep-

resentations without the necessity for them to be symbol structures. Van Gelder

(1990) characterizes this distinction as the di�erence between concatenative and

functional modes of compositionality. That is, modes of combining representations

of component objects to form a new representation standing for the correspond-

ing compositional object. Furthermore, he argues, that because Connectionism is

committed to a functional compositionality scheme it is an alternative theoretical

framework to Classical cognitivism. Before examining this claim the two modes of

compositionality are described.

Concatenative compositionality is the spatial or temporal juxtapositioning

of representational components to form complex representations. For example, con-

sider the complex object Tom chased Sue which is composed of the atomic objects

Tom, chased, and Sue. Now, suppose these atomic objects have the representations

Tom, chased, and Sue, respectively. Then a corresponding concatenative com-

positional representation may be Tom chased Sue. The point to note is that the

constituent (e.g., Tom) remains unchanged within the complex representation.


Functional compositionality is a more general form of compositionality

whereby the representation of a complex object is the result of some function

that takes the constituent representations as its arguments. Consider again the

Tom chased Sue example. This time suppose the objects Tom, chased, and Sue are

represented by the numbers 1, 3, and 5, respectively. A possible functional compo-

sition of these vectors is 267 under the function: f(x; y; z) = 70x+71y+72z, where

x, y and z are numeric representations of the three components. Constituents can

be uniquely extracted by the functions:

f1(R) = R mod 7 = x

f2(R) = (R div 7) mod 7 = y

f3(R) = R div 72 = z

Although functional compositionality contains concatenative compositionality

as a special case, and is therefore a more general conception of compositionality,

it is tangential to the Classicist/Connectionist debate for two reasons. Firstly,

Classicism was never committed to a concatenative style of compositionality.

\A suitable de�nition [of a physical instantiation mapping (F)] might

contain the statement that for any expressions P and Q, F[P&Q] =

B(F[P ],F[Q]), where the function B speci�es the physical relation that

holds between physical states F[P ] and F[Q]. Here the property B serves

to physically encode, (or `instantiate') the relation that holds between

the expressions P and Q on the one hand, and the expressions P&Q

on the other."

(Fodor and Pylyshyn, 1988: p14)

Thus, the function B is exactly what van Gelder calls a functional mode of

compositionality taking the component representations F[P ] and F[Q] as arguments

and returning a representation of the compositional object P&Q. The point is that

the Classical picture leaves the function B unspeci�ed and is therefore neutral as

to the mode of compositionality used by the brain. Function B is unspeci�ed as it


is irrelevant to the cognitive level, and only relevant to the implementation level.

The independence of these two levels is foundational to a Computational Theory

of Mind (Pylyshyn, 1980). That Classicism is not committed to a concatenative

compositionality is re-emphasized in two further places:

\... the relation that is physically realized is functional adjacency, there

is no necessity that physical instantiations of adjacent symbols be spa-

tially adjacent."

(Fodor and Pylyshyn, 1988: p57; emphasis in the original)

\Though we shall generally consider examples where complex sym-

bols literally contain their Classical constituents, the present condition

means to leave it open that symbols may have Classical constituents

that are not among their (spatio-temporal) parts. (For example, so

far as this condition is concerned, it might be that the Classical con-

stituents of a symbol include the values of a `fetch' operation that takes

the symbol as an argument.)"

(Fodor and McLaughlin, 1990: p186; emphasis in the original)

The contention over Classical constituents, that is, when is a component tokened

within the tokening of its whole, is analogous to Kirsh's (1991b) computational

de�nition of explicitness of information. Kirsh de�nes explicitness of information

within a larger context as information that is accessible within a reasonable (con-

stant in Kirsh's case) amount of time. The important point that Kirsh makes is

that

the accessibility of information is dependent on, or relative to,

the processes that access that information.


The point that has not been previously made by either the Classicists or the Con-

nectionists is that tokening of a constituent, like Kirsh's explicitness of information,

is relative to the process extracting that constituent.

In the tensor example from equation 2.3, let an access function fR1be de�ned

as:

fR1(~T ) =

1

13(~T � (�2 3)):

Then, constituent ~V1 is explicitly tokened relative to fR1(~T ) since

fR1

0BBBBB@

2666664

0 13

7 9

�5 14

3777775

1CCCCCA =

0BBBBB@

3

1

4

1CCCCCA :

The function fR1speci�es a particular projection or view of the complex repre-

sentation. When the complex representation is viewed along this projection the

constituent becomes explicit (i.e., tokened).

The point is that the mode of compositionality is irrelevant beyond there being

explicit representations of components within complex representations relative to

the processes that operate on those representations.

The second reason why the distinction between concatenative and functional

modes of compositionality is irrelevant is that the modes of compositionality, in

themselves, do not guarantee systematicity. Or, in other words, a model that

makes use of a concatenative compositionality scheme can be just as unsystematic

as a model under some other functional compositionality scheme. An example is

given to illustrate this point.

Suppose that the objects John, loves and Mary are encoded as binary-valued

vectors 1 0 1, 0 1 0 and 1 0 1 0 1, respectively. Then, under a concatenative

compositionality scheme a possible encoding of the object John loves Mary is 1 0 1

0 1 0 1 0 1 0 1. In order to extract the constituents a process must know at which

point to truncate the vector. However, it is not possible to determine with certainty

these points without additional information or assumptions as to the way in which

the constituents were concatenatively constructed. For example, under the same

compositionality scheme, the compositional object Mary loves John has the same


complex representation (i.e., 1 0 1 0 1 0 1 0 1 0 1). Additional information could

be provided, for example, in the form of separation markers (say a component with

value 0.5). A similar problem arises with functional compositionality schemes. The

point is that the processes operating on compositional representations must have

some knowledge of the way in which constituents are brought together in the �rst

place.

It could also be argued that a commitment to concatenative compositional-

ity is simultaneously a commitment to discrete representations and processes; and

since the brain is not a discrete device, the mode of compositionality cannot be

concatenative, and therefore cannot be symbolic (i.e., Classical). That the dis-

crete/continuous dimension is orthogonal to the issue of whether cognitive archi-

tecture is symbolic is illustrated here by what I term as the beam scattering analogy.

In a beam of light there are photons with wavelengths that may vary over some

continuous (though not necessarily uniform) distribution. When passed through

a prism the photons are scattered by an angle that varies monotonically with

wavelength. Consequently, if one were to record the scattering with some photo-

electric device that measures the incident light, one would observe intensity as

a continuous function of angle. Now, suppose that constituents are encoded at

points of zero intensity gradient. The magnitude of light intensity at the maximum,

minimum and in ection points of the angle-intensity curve encodes the constituent.

In this way, constituents may be concatenated as a continuous waveform without

the necessity for the constituents to be con�ned to discrete �xed-width spatial

locations (see Figure 2.2). The point is that concatenative compositionality is

not limited to, and therefore does not equate to, discrete constituents. So the

argument that cognition could not be symbolic because it could not operate over

discrete representations does not follow.

The conclusion taken from a re-examination of this debate is that the im-

plementation horn of the argument leaves no room for an alternative

theory. Either constituents are tokened (relative to their access processes) with

their complex representations, in which case one is implementing a Classical ar-


A

B

C

D

E

F

G

angle

inte

nsity

C B B A G EA

Figure 2.2: Beam scattering analogy: An example of continuous concatenativecompositionality where constituents are encoded as intensity magnitudes at thezero-gradient points of an angle-intensity curve.

chitecture; or, constituents are not tokened, in which case one cannot account for

systematicity simply because the constituents are not uniquely determinable. So

the question becomes, just what (if anything) can Connectionism contribute in

terms of cognitive theory?

2.3.4 A \potential" contribution of Connectionism

Classical theory of cognition is based on the assertion that the cognitive level is

supported by, but independent of the implementation level, where the cognitive

level consists of structured representations (standing for properties of the exter-

nal world), and processes that operate over these representations. Furthermore, a

cognitive state is the result of some process on some internal (structured) represen-

tation. From a Classicist's perspective, the proper concerns at the cognitive level

of theorizing are determining these processes and representations.

There are, arguably, two ways in which the Classicist picture is (potentially)

limited. The �rst limitation is that it is not at all certain that cognitive behaviour

is only a consequence of cognitive level representations and processes. Take for


instance Fodor's (1975) own example:

\A man wishes to be reminded, sometime during the day, to send a

message to a friend. He therefore puts his watch on upside down,

knowing that he will glance at it eventually and that, when he does,

he will think to send the message. What we have here is, presumably,

a straight forward causal connection between two states ..., but not

a kind of connection cognitive psychology has anything to say about.

Roughly, although the mental states are causally connected, they aren't

connected by virtue of their content; compare the case of the man who

is reminded to send a message to his friend when he (a) hears and (b)

understands an utterance token of the type `send a message to your

friend'."

(Fodor, 1975: p203, emphasis in the original)

A second example7 is illustrated in the form of the following addition task.

Present the sequence of numbers listed in Figure 2.3(a), one at a time, to a subject

and request the subject to perform a running additive total as each number is

presented. After the last number is presented, ask the subject to respond with

the total. Interestingly, some subjects (though not necessarily all) will incorrectly

respond with 5000 instead of 4100. Possibly, some perceptual stimulus, perhaps

the alternating rhythm of the 1000s, or the predominance of 0s, interacts with the

symbolic process of addition. However, the subjects are more likely to respond

correctly, given the same numbers but in a di�erent order (Figure 2.3(b)).

A third example is due to van Gelder and Niklasson (1994). They used the re-

sults of a study by Kern, Mirels, and Hinshaw (1983), which showed that scientists

do not consistently apply the rules of logic (speci�cally, modus ponens8 and modus

tollens9) in their reasoning, to argue that people are not as systematic as Fodor and

7Danny Sumina, personal communication.8That is: (P ! Q) ^P ` Q, which reads, if P implies Q, and P is true then Q is also true.9That is: (P ! Q) ^ :Q ` :P , which reads, if P implies Q, and not Q is true then not P is

also true.


1000

10

1000

20

1000

30

1000

40

10

20

30

40

1000

1000

1000

1000

(b)(a)

Figure 2.3: A subject asked to perform a running additive total of the numberslisted in (a) may respond with 5000. When asked to perform the same task on thesame numbers but in a di�erent order (b) subjects are likely to respond with thecorrect answer 4100.


Pylyshyn claim. Most dramatically, of the 98 people examined only 41% correctly

recognized the logical validity of modus tollens when reasoning about questions

presented to them. Consequently, van Gelder and Niklasson argued that a Classi-

cal architecture is inadequate, as it fails to account for the degree of systematicity

evident in the empirical studies. One could argue that the lack of perfect, or near-

perfect accuracy is a competence/performance issue (i.e., the di�erence between

what can be processed as distinct from what is processed in practice). However,

the study also shows a statistically signi�cant di�erence, in the percentage of scien-

tists that recognized the logical validity of both modus ponens and modus tollens,

between concrete and abstract examples. A classical model would predict no sig-

ni�cant di�erence as concrete and abstract instances of, for example,modus tollens

are structurally identical.

These three examples illustrate how some cognitive processes may operate

purely on the basis of an associative or statistical relationship between two cognitive

states, not on the basis of the structural content of those states. It is these forms of

associatively- and statistically-based behaviours that the Classical framework does

not address, and presumably, the sorts of behaviours that Connectionism would be

well-placed to explain. The extent to which cognitive behaviour can be explained,

however, is an empirical question10, and not one addressed here.

The second limitation comes from the process-relative notion of tokening which

precluded Connectionist architectures from being anything other than Classical

implementations. The process-relative notion of tokening leaves open a vast class

of architectures able to meet this condition. In an extreme case, tokening can be

realized by a look-up table. For example, in the following systematicity of inference

task, the �rst or second component is extracted from a binary relation. That is:

F ((xi; xj); qk) =

8><>:

xi if k = 1

xj if k = 2

10Another empirical question is to what extent does the implementational level impact on thecognitive level. Charter and Oaksford (1990) and Butler (1993) have argued that the implemen-tation level constrains the class of possible algorithms that can compute some cognitive function.Therefore, Connectionism at the implementational level may limit the choice of algorithms at thecognitive level.


where xi and xj are chosen from a set of N objects; and qk is a query input

requesting the �rst or second component. All 2N2 unique complex representations

are generated by a hashing function:

fH(xi; xj; qk) = #R; (2.4)

where #R is a unique number for every unique triple (xi; xj; qk). Now, the �rst

and second constituents can be extracted by providing indexed functions:

f1i;j(#R) =

8><>:

xi if #R = fH(xi; xj; q1)

0 otherwise(2.5)

f2i;j(#R) =

8><>:

xj if #R = fH(xi; xj; q2)

0 otherwise,(2.6)

where f1i;j and f2i;j are the class of function dedicated to recovering the �rst and

second components, respectively. The results of the 2N2 access functions are then

summed. Since, each function fki;j is sensitive to only one unique constituent, only

one constituent will be returned.

In terms of network architecture, the internal representation can be physically

realized as activation of a single hidden unit, and each access function as a connec-

tion from the hidden unit to another processing unit. The output of each processing

unit is then summed at a single output unit (see Figure 2.4). The problem with

this architecture is that it requires 2N2 processing units, and should be rejected on

the same grounds as the microfeature solution was rejected. That is, on the basis

that too many processing units are required.

However, it is not necessary to have dedicated one unit per object. Units can

share the representation and processing responsibilities so as to minimize the total

amount of resources. Auto-encoder networks of sigmoidal units11, for example,

require at most two hidden units to represent N objects (Kruglyak, 1990). By

combining two auto-encoders, all N2 binary relations can be represented with just

four hidden units and 2N access functions. In this case, the number of processable

11That is, the activation function for the hidden and output units is: f(neti) =1

1+e�neti, where

neti is the net input activation to unit i.


qk

fHf1

N,N

f 1,12

f11,1

f N,N2

Xi

Xj #R Σ

outputinput internal

Figure 2.4: Connectionist implementation of a look-up table. Input units receiveexternal representations xi, xj and qk of the �rst component, second component andquestion, respectively. The activation function (fH) for the �rst internal unit, whoseactivation is #R, is given by equation 2.4. The activation functions for the secondlayer of internal units are given by equations 2.5 and 2.6. The activation from theseunits is summed by the output unit labeled

P. Arrows indicate connections with

a weight value of one.


objects grows faster than the amount of resource. Similarly, as suggested above, a

network implementing Smolensky's tensor representation scheme could be provided

with access functions at limited cost.

However, Fodor and McLaughlin rejected the possibility of a tensor (or other

Connectionist) solutions on the grounds that it is not su�cient to show how system-

aticity is possible given the assumptions of a Connectionist architecture, one must

also show how systematicity could be necessary given those assumptions. Since, in

most cases, people exhibit systematicity despite their varying physical and social

backgrounds. Fodor and McLaughlin argued that although it may be possible to

hand-craft a Connectionist model that could represent the structured object John

loves Mary given that it could represent the structurally related object Mary loves

John, it is also possible to hand-craft a model that supports a representation of the

structured object Bill chased Sue given that it supports the structurally unrelated

object John put the train in the box. In short, even though Connectionist mod-

els could demonstrate systematicity by implementing Classical compositionality,

Fodor and McLaughlin argued that they are too general to necessarily account for

systematicity.

The weakness of Fodor and McLaughlin's argument, however, is that they do

not specify criteria by which a (Connectionist) model is too general to necessarily

account for systematicity. For example, a Connectionist architecture necessarily

exhibits systematicity if the assumption is made that it be wired so as to exhibit

systematicity. I attribute the lack of such criteria to be a consequence of Fodor

and Pylyshyn's de�nition of systematicity being grounded, solely, in terms of com-

putational capacity. That is, in terms of the desired functional behaviour of an

architecture.

At this point, it is not clear whether the look-up table and tensor networks

would be regarded by Fodor and McLaughlin as examples of tokening. It may be

that the process-relative notion of tokening that I have described is too general

for Fodor and McLauchlin. For on the one hand they entertained the possibility

of non-Classical Connectionist models exhibiting systematicity by careful hand-

2.4. THESIS QUESTION 31

crafting of connections, but subsequently rejected such solutions on the basis that

they do not necessitate systematicity. It is this lack of clarity that is precisely the

problem one gets into when de�nitions are grounded in terms of computational

capacity.

What is required is a tightening of the de�nition of systematicity along a di-

mension orthogonal to computational capacity. One such dimension often used

in psychology is reaction time. Two computationally equivalent algorithms can be

discriminated on the basis of the amount of time required to perform a task. Butler

(1993) has used this dimension to argue for the importance of the implementational

level.

Another dimension is learnability. In the domain of analogical reasoning, Hal-

ford and Wilson (1994) have attempted to reject a three-layer feedforward network

model of the balance-scale task, also, on the basis of computational capacity alone.

As I have argued in their case (see Appendix A), the learnability dimension can

be used to discriminate two architectures on the basis of the amount of resource

required to learn or acquire a function or behaviour. In the Classical framework,

systematicity is built into a symbol system. However, Connectionism is concerned

not only with the realization of cognitive behaviours, but also their acquisition.

Thus, a \potential" contribution of Connectionism is in re�ning the characteristic

properties of cognitive architecture on the basis of the learnability of systematic

behaviour.

2.4 Thesis question

Connectionism is a general framework whereby global behaviours are a consequence

of the interaction of simple units. Connectionist models are su�ciently general to

represent any spatial function (Lapedes & Farber, 1988; Hornik, Stinchcombe, &

White, 1989; Funahashi, 1989) and a general class of dynamic functions (Sato,

Murakami, & Joe, 1990).

However, despite their capacity to realize any function there does not exist

any general purpose learning algorithm except for restricted models. For example,


the perceptron convergence rule (Hertz, Krogh, & Palmer, 1991) guarantees the

perceptron will learn to discriminate between two classes of stimuli, provided that

the stimuli (in their input representational space) are linearly separable (i.e., all

points can be correctly partitioned by a single hyperplane). The limitation of the

perceptron is that it cannot represent the function for discriminating points that

are not linearly separable (Minsky & Papert, 1990). In fact, the general learning

problem is known to be di�cult computationally in the sense that it is unlikely that

there exists a polynomial time algorithm for con�guring weights of a network with

three layers (Judd, 1987, 1990), even in the case where there is only one output

and two hidden threshold units (Blum & Rivest, 1990, 1992).

This trade-o� between the realizability of a range of functions/behaviours and

their learnability is well-known from statistics as the bias/variance dilemma (Ge-

man, Bienenstock, & Doursat, 1992). So, although Connectionist models have

su�cient variance to approximate most behaviours, the important question, is: do

these models have su�cient bias to account for systematicity?

Bias (in statistics), or innateness (in psychology) is a term that refers to the

tendency for certain behaviour over others as a consequence of the architectural

properties of the model. Architectural biases are both a boon and a burden for

they may allow the tractable acquisition of one class of behaviours, but at the

expense of being unable to exhibit another class. In the case of the perceptron, for

example, restricting the unit activation function to be linear guarantees learnability

of any linearly separable class of stimuli, but at the expense of not be able to

discriminate non-linearly separable classes. Thus, use of a linear transfer function

is an architectural bias. Conversely, use of a non-linear transfer function allows

the discrimination of arbitrary classes, but does not guarantee learnability. A non-

linear transfer function is also an architectural bias.

In a learning framework, where behaviour is (partly) acquired through expe-

rience, systematicity is a form of generalization over structurally related objects.

Having learnt how to represent and process some instances of a complex object, a

system can represent and process some instances of previously unseen, but struc-

2.4. THESIS QUESTION 33

turally similar objects. It is the combinatorial nature of structured domains that

makes the acquisition of systematic behaviours di�cult for general purpose learning

models like Connectionist networks. Even in the simple N-Vt-N example, if there

are n agent/patients and m actions, then there are n2m N-Vt-N complex objects.

Simply allocating a new representation for each complex object would be infeasible

on the basis that one is never likely to have previously seen all instances. Thus,

an important question, and the primary question that this thesis attempts to ad-

dress, is: Can Connectionist models exhibit the necessary acquisition of systematic

behaviour?

2.4.1 Method

The acquisition of behaviour requires the necessary, but judicious, use of archi-

tectural biases. The question is, what are those biases in regard to the necessary

acquisition of systematic behaviour. However, at the same time a particular bias

should not preclude other desirable behaviours.

At this point the only property under consideration is systematicity. Having

only one behaviour is one sided in that there is no cost. For example, system-

aticity could be trivially account for by direct implementation of a symbol system.

However, this approach would be at the expense of the associative/statistical prop-

erties that potentially may explain some cognitive behaviours (e.g., the �rst option

of Connectionism). Consequently, the approach taken here is to start from what

Dyer (1991) calls \radical Connectionism". That is, from relatively unstructured

general-purpose network architectures (e.g., feedforward and recurrent networks).

This approach makes the fewest assumptions regarding a Connectionist architec-

ture, and therefore precludes the fewest behaviours.

The other consideration is the criterion by which a network is regarded to

have \necessarily" acquired systematicity. Since systematicity is regarded as a

computational level property, such a criterion must be at this level and with respect

to a particular task. In this thesis, two tasks are considered: (1) auto-association

of N -tuples (as a means for capturing the degree of systematicity of representation


of a model); and (2) extraction of a component from an N -tuple upon request

(as a means for capturing the degree of systematicity of inference of a model).

In conjunction with these two tasks are two degrees of generalization by which a

model is considered systematic with respect to a task. They are: (1) systematicity

as generalization over a domain of complex objects related by a common structure;

and (2) systematicity as generalization across position: what Hadley (1993) calls

strong systematicity. In contrast to Smolensky and van Gelder's approach, where

systematicity was regarded primarily12 as a representation problem, the approach

taken here is to regard systematicity as a generalization problem.

At each investigation, the property that permits/prevents systematicity is ana-

lyzed. So, although this thesis is concerned with what properties give systematicity,

this thesis is also concerned with why these properties do or don't give systematic-

ity. The reason why a property of one architecture prevents systematicity is used

as justi�cation for moves to more structured architectures.

In summary, the approach taken in this thesis is as follows:

� specify a task for which it is possible to exhibit systematicity at the compu-

tational level;

� de�ne a criterion for the necessary acquisition of systematic behaviour;

� evaluate a Connectionist model with respect to this task and criterion; and

� analyze the model in terms of the properties that permit/prevent systematic-

ity.

2.5 Summary

Systematicity is a computational level property of cognitive systems whereby the

ability to represent and process objects having a particular structure extends to

12Although Smolensky (1987b) also considers the possibility of learning an \optimal" combi-nations of role vectors, which is discussed in chapter 4.

2.5. SUMMARY 35

other objects having the same structure. The importance of systematicity is that

it places constraints on cognitive architectures.

Those constraints as argued for by Fodor and Pylyshyn are that cognitive archi-

tecture must consist of: (1) structured representations, or symbol structures; and

(2) processes that are sensitive to the structure of those representations. These

two constraints characterize a Classical cognitive architecture. The distinguishing

feature of a Classical architecture is the tokening of constituents whenever complex

representations are tokened.

Connectionists have proposed three alternative schemes of compositionality:

Smolensky's weak (microfeatures) and strong (tensors) compositionality; and van

Gelder's more general functional compositionality. After reviewing these alterna-

tives, I have concluded that

� the Connectionist notions of compositionality either: cannot explain system-

aticity because the constituents are not uniquely accessible; or, implement

Classical compositionality because the constituents are tokened (relative to

their access processes) whenever complex representations are tokened.

Consequently, they fail to provide an alternative explanation of systematicity.

However, I have also argued that the limitation of the Classical position is that

the requirement of systematicity is grounded, solely, in terms of computational

capacity (i.e., what can be computed by an architecture). Consequently, system-

aticity can be realized by many computationally su�cient architectures, including

Smolensky's tensors, which Fodor and McLaughlin wish to reject. I have suggested

that the requirement of systematicity can be made stronger by considering learn-

ability as providing an additional criterion for systematicity. Thus, a \potential"

contribution of Connectionism is in an explanation for the necessary acquisition of

systematic behaviour. Therefore, the problem that systematicity poses for Con-

nectionism, which is the central concern of this thesis, is:

� Can Connectionist models exhibit the necessary acquisition of systematic

behaviour?


The approach taken, in the subsequent chapters, in addressing this question is

to: �rst, propose a suitable de�nition of the necessary acquisition of systematicity;

and second, evaluate Connectionist models with respect to this de�nition; and in

doing so ascertain the properties that permit/prevent the necessary acquisition of

systematicity.

Chapter 3

Systematicity as generalization

across domain

3.1 Introduction

In the last chapter I suggested that a \potential" import of Connectionism is in

an explanation of the necessary acquisition of systematicity. This chapter is a �rst

attempt at addressing the learnability of systematicity, rather than its computabil-

ity. That is, where the debate reviewed in the previous chapter was concerned with

algorithms/architectures that can compute systematic functions/behaviours, this

chapter is concerned with algorithms/architectures that produce algorithms that

exhibit systematic functions/behaviours.

The de�nition of systematicity from the previous chapter is a property whereby

the ability to represent and process some instances of a structured object implies

the ability to represent and process other instances with the same structure. In a

learning framework then, systematicity is the property whereby the ability to learn

to represent and process some instances of a structured object implies the ability

to represent and process other instances with the same structure. In other words,

systematicity is a form of generalization over structurally related objects. Some

previous work has been done in this area by Brousse and Smolensky (1989) and

Brousse (1991), and their work is re-examined in section 2.

37

38 CHAPTER 3. GENERALIZATION ACROSS DOMAIN

In section 3, the framework of probably approximately correct (PAC)-

learnability (Valiant, 1984) is presented, and used to provide the criterion by which

a model is said to necessarily acquire systematic behaviour.

The classic criterion in computer science for which a problem is regarded as

tractable (in terms of the amount of resource required), or for which an algorithm

is regarded as a tractable solution to a problem is polynomial resource1 complexity

(Garey & Johnson, 1979). Algorithms, or more generally, problems which require

more (exponential) resource are, generally, considered impractical as there are not

enough resources to solve even small sized problems. Valiant extended the notion

of tractability (feasibility) to the domain of learning. Brie y, a function (or, more

generally, a function class) is considered learnable by an algorithm if the target

function is acquired, to within some degree of accuracy on most occasions with

only polynomial resources in the size of the target function. The PAC-learning

framework has proven very useful in the formal analysis of learning and has led to

a �eld within machine learning called computational learning theory (see Anthony,

1992, for an introduction to this �eld). In this chapter, the following de�nition

is introduced: A Connectionist model is said to necessarily acquire systematic

behaviour if

the architecture acquires the target behaviour to within a desired degree

of accuracy, with a desired degree of con�dence, with at most polynomial

resources in the size of the behaviour.

Feedforward networks are the �rst and perhaps the most widely used networks

in Connectionist modeling. Given the work of Brousse and Smolensky they are

an obvious starting point to addressing the issue of systematicity within Connec-

tionism. In section 4, PAC-learnability is used as the criterion for the necessary

acquisition of systematic behaviour with respect to the task studied by Brousse

and Smolensky (i.e., auto-association of N -tuples - relations). Auto-association

provides a way of evaluating whether a model (in this case, a feedforward network)

1Typically, resource means time, however, time can be traded for space, or number of proces-sors in the case of parallel machines.

3.2. BROUSSE AND SMOLENSKY'S MASSIVE GENERALIZATION 39

exhibits systematicity of representation.

General discussion regarding these results is given in section 5, and �nally, a

summary and conclusion are provided in section 6.

3.2 Brousse and Smolensky'smassive generaliza-

tion

In Connectionist models that learn from examples, cognitive behaviour is seen to

emerge from the dynamics of the network interacting with some environment. One

of the criticisms of the Connectionist approach has always been that Connectionist

models require too many examples to acquire a competent level of performance in

such complex domains as language. Combinatorial domains are interesting because

such a vast amount of behaviour can be described with relatively few rules. In fact,

they are the perfect domains for symbol systems, and the sorts of domains where

a Connectionist model is expected to fail. It is this expectation of failure that

motivated the studies of Brousse and Smolensky.

3.2.1 Task: Auto-association of N -tuples

One of the simplest combinatorial domains is the auto-association of N -tuples,

where say, an object represented as N instantiated variables, must be encoded and

later recovered. The set of N -tuples S is de�ned (using Z-notation Spivey, 1989)

as: S = S1 � : : : � SN = fx1 : S1; : : : ;xN : SN � (x1; : : : ; xN)g, where Si is theset of possible values at position i of the tuple. The case considered here is where

S1 = S2 = : : : = SN . For example, (x3; x4) and (x4; x3) are distinct objects in a

domain of 2-tuples. The auto-association of N -tuples is the identity function that

maps every element in S to itself.


3.2.2 Method

Brousse and Smolensky (1989) and Brousse (1991) studied the auto-association of

N -tuples (strings of length N) by a feedforward network. In their simulations, the

number of atomic objects was 26 (letters of the alphabet), and the tuple order was

varied from 2 to 6 (i.e., the number of components in each tuple). Therefore, the

size of the domain grew exponentially with the tuple order from 262 = 676 (N = 2)

to 266 � 3 � 108 (N = 6).

Each atomic object (letter) was represented externally (i.e., at the input/output

layer) by a random binary vector of length 8. The input/output representation of

an N -tuple (string) is the in-order concatenation of the vector representations of its

components. Thus, the representation of an N -tuple is a 8N dimensional vector.

The feedforward network they examined consisted of 8N input units (i.e., one

unit for each vector component) completely connected to all 5N hidden units, which

are in turn completely connected to all 8N output units. The task of the network

is to learn to construct internal (hidden unit vector) representations of the input,

and subsequently, recover the input representation at the output (e.g., Figure 3.1).

10

(C,A)

(C,A)

R[(C,A)]

input

hidden

output

8 8

88

Figure 3.1: A feedforward network for learning to represent 2-tuples. Arrows in-dicate complete connections between groups of units, and numerals indicate thenumber of units in a group. In this example, an input representation of the object(C;A) is mapped to an internal representation R[(C;A)], from which the inputrepresentation is recovered at the output.

A subset of the domain is randomly selected from a uniform distribution on

which the network is trained. Training consists of presenting the input and target

vector pairs. The network learns by changing the weights of connection so as to


minimize the error between the network's output and the target output for each

pattern using the standard backpropagation learning algorithm (Rumelhart et al.,

1986), which is de�ned as:

�Wij = �k @E

@Wij

(3.1)

where �Wij is the change in the weight of the connection from unit i to unit j; k

is the magnitude of change; and E is the network error, which, in this case, is the

sum of the squared di�erences between the target vector and the network output

vector, given by:

E =nX

p=1

Ep =nX

p=1

j~tp � ~opj2 =nX

p=1

mXi=1

jtpi � opij2 (3.2)

where Ep is the error for each example pattern (p); n is the number of examples;

~tp and ~op are the target and network output vectors, respectively; and tpi and opi

are the corresponding vector components at output unit i for m output units. The

activation of any non-input unit is given by:

oj =1

1 + e�Pk

i=1oiwij+bj

(3.3)

where oj is the activation of unit j; oi is the activation of unit i whose activity is

propagated to unit j via a connection with weight wij; k is the number of units

that have an incoming connection to unit j; and bj is the bias.

Training was started from a random setting of weights and terminated when all

output units were within 0.4 of their targets for all patterns.

Testing the performance of the network consists of randomly resampling the

domain and determining whether the network's output is su�ciently close to the

target output. In this case, the network's response was considered correct when all

output units are within 0.4 of their corresponding target values.

3.2.3 Exponential number of generalizations

From this training scheme, Brousse and Smolensky showed that from a �xed num-

ber of training examples (50), the number of generalizations (future examples cor-

rectly processed) and virtual memories (patterns learnt within �ve additional learn-


ing trials) grew exponentially with the order of the tuple (see Brousse & Smolensky,

1989, Figures 5 & 7, respectively).

3.2.4 Exponential decrease in percentage of generaliza-

tions

The exponential growth in number of generalizations exhibited by the feedforward

network appears impressive. However, the growth in the total number of examples

is also exponential. Two measures of the network's degree of systematicity for

the representations of tuples can be obtained by calculating (1) the number of

generalizations; and (2) the number of generalizations plus virtual memories, (from

Brousse & Smolensky, 1989, Figures 5 & 7, respectively) as a percentage of the total

number of examples in the domain.

A plot of tuple order versus degree of systematicity is shown in Figure 3.2.

The linear relationship between the x and y axes of the log-linear plot indicates a

decreasing exponential relationship between tuple order and degree of systematicity.

So, for example, despite the network being able to represent tuple instances of

order 6, on average only 1% were correctly represented, considering the number of

generalizations plus virtual memories.

It is this lack of generalization over such structured domains (i.e., lack of system-

atic behaviour) that has been the target of the strong criticism of the Connectionist

approach to cognitive modeling (Fodor & Pylyshyn, 1988). Brousse (1991) used

a weight decay term2 to improve generalization to the point where the network

performed perfectly for small N . However, the percentage of correct examples still

decreased as an exponential3 function of N . Clearly, 50 training examples is too

few to achieve a high degree of generalization over such a large domain. However,

if the network requires some constant (even if small) fraction of the domain then

learning will be intractable as the training set size will grow exponentially with

2Weights are eliminated so as to reduce the possibility of overparameterization by the network.3Brousse (page 190) reports the number of generalizations to be 15N . However, the total

number of examples is 26N . Consequently, the fraction of example space correctly processed is(1526)N (i.e., exponentially decreasing function of N ).


0.001

0.01

0.1

1

10

100

2 3 4 5 6

Per

cent

age

of s

pace

cor

rect

(%

)

Tuple order (n)

GeneralizationsGeneralizations plus Virtual memories

Figure 3.2: A log-linear plot of the number of generalized and generalized plusvirtual memories as a function of tuple order (N). The linear relationship indicatesan exponential decrease in percentage of space correct. Reconstructed from Brousseand Smolensky (1989); Figures 5 & 7.


tuple order.

If systematic behaviour is regarded as a degree of generalization over a particular

domain, then the problem can be restated in terms of PAC-learnability by asking

two questions. First, how many training examples are required to maintain a �xed

degree of accuracy over the entire space? (In other words, what is the sample

complexity?) Second, how long will it take to learn these examples? (In other

words, what is the time complexity?)

Before attempting to answer these two questions, the concept of PAC-

learnability is explained with emphasis on its applicability to the systematicity

problem.

3.3 Probably approximately correct learnabil-

ity

Probably approximately correct (PAC)-learnability is a theoretical concept intro-

duced by Valiant (1984) as a criterion against which one may formally prove the

learnability of a particular class of problems by a particular learning machine.

The concept of PAC-learnability is analogous to, and extended from the concept

of tractability. Roughly, an algorithm is said to tractably compute some function

if it requires no more than a polynomial amount of resource (e.g., time, space,

processors) in some parameter measuring the size of the function. Similarly, a

function (or function class) is said to be PAC-learnable by an algorithm if that

algorithm, on most occasions, acquires the target function to within a desired

degree of accuracy with no more than a polynomial amount of resource in the size

of the target function.

PAC-learnability has been used as the framework for theoretical results on the

learning capabilities of neural networks. (See Anthony, 1992, for a review of the im-

port of computational learning theory to Connectionism). Theoretical approaches

to learnability problems using neural networks, in general, divide the problem into

two parts: sample complexity - the number of training examples required to ac-

3.3. PROBABLY APPROXIMATELY CORRECT LEARNABILITY 45

quire the target function; and time complexity - the time required to load (correctly

map) these training examples.

In determining the sample complexity, one wants to establish how many training

examples are required before the network correctly processes a new example with

probability greater than 1� �. Furthermore, the probability that the network will

behave with this degree of accuracy on future examples having been given this

many training examples must be greater than 1 � �. A network is considered to

tractably learn a function when the number of training examples required to satisfy

that relationship is a polynomial of the size of the target function, which is usually

measured as the number of weights required to implement that function. A number

of results on the sample complexity of neural networks have been established (e.g.,

Baum & Wilczek, 1987; Shawe-Taylor & Anthony, 1991; Haussler, 1992), and it is

one of the results of Shawe-Taylor and Anthony that is used in the next section.

Establishing time complexity results generally involves mapping the task of

�nding a set of weights satisfying the function speci�ed by the training set for a

given network architecture into a problem with known complexity in computability

theory (e.g., Judd, 1987, 1988; Lin & Vitter, 1989; Blum & Rivest, 1990; Judd,

1990; Blum & Rivest, 1992). These results, although general, are also negative, in

the sense that, even for relatively simple feedforward architectures there does not

exist a polynomial time algorithm for �nding a set of weights that will implement

an arbitrary function speci�ed by the training set.

However, these results are worst case scenarios in that they consider the time

required to learn any function from very general function classes. Time complexity

can be reduced by restricting the function class, or by incorporating knowledge

regarding the functions to be learnt, into the learning algorithm. Since learning is

essentially a search through a function space for a function that is su�ciently close

to the target function, restricting the function class e�ectively reduces the search

space and therefore the search time. Incorporating knowledge about the functions

to be learnt structures the search space into regions where target functions are

likely to be found. Consequently, search time can be reduced by ignoring regions


of the search space, just as is the case, for example, when performing a binary

search.

Incorporating domain-speci�c features into an algorithm or architecture (e.g.,

non-linear activation function) can, however, greatly complicate a theoretical ap-

proach. Consequently, in the next section, the time complexity is determined

empirically.

PAC-learnability provides a tool for determining the learning capabilities for

such potential cognitive models as Connectionist networks. If one links the desired

degree of generalization with the computational resources required to achieve that

degree of generalization then one has a criterion for determining whether a Connec-

tionist model is systematic that does not rely on the computational capacity of the

model. The rationale for using polynomial resources as a criterion for systematicity

is that, if a machine or model requires a greater amount of resources than polyno-

mial (e.g., exponential), then there simply isn't enough time or resource available

to achieve the desired level of generalization for even moderately sized functions.

Thus, this criterion allows potential candidates for cognitive architecture to be

discounted on the basis that they are too computationally expensive.

In the next section, this criterion is applied to the task studied by Brousse and

Smolensky to examine the capacity of feedforward networks to acquire systematic-

ity of representation.

3.4 Systematicity of representation with feed-

forward networks

Systematicity of representation is the ability to construct internal representations

of structurally related objects. For these internal representations to be useful the

objects they represent must be subsequently recoverable.

For example, a model that constructs internal (vector) representations of com-

plex objects by the addition of vector representations of the object's components

would not be satisfactory since some objects would not have unique internal rep-

3.4. SYSTEMATICITY OF REPRESENTATION 47

resentations. Suppose, for example, internal representations of the two distinct

objects John loves Mary and Mary loves John were composed by vector addition

of their components John, Mary and loves. Then:

~RJohn loves Mary = ~RJohn + ~Rloves + ~RMary by de�nition

= ~RMary + ~Rloves + ~RJohn by law of commutativity

= ~RMary loves John by de�nition

In fact, any operator with the property of commutativity is an inadequate mecha-

nism for the construction of complex representations.

The usefulness of the internal representations constructed by a network can

be tested by requiring the network to subsequently output a representation of the

complex object it was presented. In other words, by requiring the network to

auto-associate the input representations of complex objects.

3.4.1 Task: Auto-association of N -tuples

The family of tasks on which the network is evaluated is the same as used in

the studies by Brousse and Smolensky as described in subsection 3.2.1. That is,

auto-association of N -tuples where an N -tuple is an ordered relation of N atomic

(unstructured) objects selected from some set.

3.4.2 Model: Feedforward network

The feedforward network (Figure 3.3) used in these simulations is the same as used

by Brousse and Smolensky except that the number of input/output units per tuple

order was 10 (i.e., k = 10) and tuple order (N) was varied from 2 to 10, whereas

in the work of Brousse and Smolensky, k = 8 and N was varied from 2 to 6. Also,

a di�erent input and output encoding scheme was used. Each component was

encoded by a local encoding scheme (i.e., one unit on and the rest o�) at the input

layer and a block encoding scheme (i.e., half of the units on and the rest o� Bakker,

Phillips & Wiles, 1993; in press) at the output layer (see section 3.4.4), whereas


Brousse and Smolensky encoded each component as a random binary vector. With

local and block encoding schemes, the number of possible components is equal to

the number of units available for representing each component, which in this case

is 10.

k units

N

k units

k units k units

k units k units

21

output (k x N)

input (k x N)

hxN units hidden (h x N)

Figure 3.3: Feedforward network for the auto-association of N -tuples. In thisarchitecture k refers to the number of possible atomic objects in any one position,N refers to the number of components within a complex object (tuple order), andh refers to the number of hidden units per component. Every input unit wasconnected to every hidden unit, which was connected to every output unit.

In this chapter, systematicity has been framed in terms of PAC-learnability.

That is, a network is considered systematic if it can attain a �xed degree of gen-

eralization (i.e., percentage of example space) by requiring, at most, polynomial

resources in the size of the function. For the task used here, the size of the function

is characterized as the tuple order (N).

Here, both a theoretical and an empirical approach is taken to address this issue

of computational resource. A theoretical result places bounds on the di�culty of

the problem. By using a result due to Shawe-Taylor and Anthony (1991) an upper

bound on the number of training examples needed to attain a desired degree of

generalization can be obtained. The advantage of this approach is that one can

obtain limits purely on the information contained in the data and the represen-

tational power of the model which are independent of the learning algorithm the

model may use. The limitation of this approach, however, is that there are no

general results for determining time complexity. Therefore, the time complexity

results were established empirically.


3.4.3 Theoretical result

Firstly, a theoretical upper bound on sample complexity is given for this task.

Then, the sample and time complexities are determined empirically.

Polynomial sample complexity

The sample complexity result of Baum and Haussler (Baum & Haussler, 1989) is

not applicable in this case as multiple output units are involved. However, there is

a corollary by Shawe-Taylor and Anthony (Shawe-Taylor & Anthony, 1991) that,

interestingly enough, provides an upper bound on the number of training examples

independent of the number of output units. It applies to feedforward networks of

threshold units with one hidden layer. It says:

\Given an accuracy parameter � and a con�dence parameter �, for

a feedforward network with W variable weights and n computational

nodes, with probability greater than 1� � the network will give correct

output with probability greater than 1 � � on inputs drawn according

to some distribution, provided it correctly computes the function on a

sample (drawn from the same distribution) of size at least

m0 = m0(�; �) =1

�(1�p�)

�ln

�1:3

�

�+ 4(W + n)log2(en)ln

�6

�

��

(Shawe-Taylor and Anthony, 1991: p116 )"

where W is the total number of variable weights (including biases) of connections

from input units to hidden units and hidden units to a single output unit4; and n

is the number of hidden units. In a network of N (k�h� k) encoders, n = N �h,

and W = khN2 + 2hN + 1.

If the accuracy and con�dence parameters are �xed, as well as the number of

possible values at each tuple position (k), then an upper bound on sample com-

plexity in terms of N , the tuple order, can be determined. Given that the number

4Since their result is independent of the number of output units,W is speci�ed so as to consideronly weights associated with one of the output units.


of input and output units is N�k, and the number of hidden units5 is N�dlog2ke,then the upper bound on the maximum number of training examples required is

m0 = O

N2k log2k

�log2

N log2k

�

!!= O(N2log2N)

Thus, the network only requires a polynomial number of training examples.

Provided the network can load (acquire the correct behaviour on) these examples

it will, with high probability, correctly generalize to an exponential number of

future examples.

3.4.4 Empirical results

The theoretical result is quite general in that it applies to any function that is

representable by such a network (not only auto-association of N -tuples), and it

is independent of the distribution from which the training and testing examples

are selected (it only requires that the training and testing distributions be the

same). However, the bound put on the number of training examples is quite high

(in the order of tens of thousands of training examples). Furthermore, there is

an assumption that there exists an algorithm that can load these examples in

polynomial time.

With respect to the auto-association of N -tuples task, the question is: Can

the standard backpropagation algorithm be used to load enough training examples

in polynomial time so that the feedforward network exhibits systematicity with

respect to this task?

Method

In an empirical study, it is not necessary to separate complexity into sample and

time components. One could simply record the total processing time required by

the network to acquire the desired degree of generalization and subsume sample

complexity within time complexity.

5The minimum number of threshold units required by a single encoder of k items is dlog2ke,using a binary encoding scheme.


However, keeping the two components separate allows one to determine where

the demand for computational resource is greatest, and therefore it allows one to

identify the weaknesses of the architecture. For example, it is possible that the

sample complexity is low relative to the time complexity to load the examples.

Such a result suggests that the learning algorithm used by the network is not

making e�cient use of the information represented in the training set. It would

suggest that improvements are to be found in the learning algorithm rather than

the connectivity of the network. This point will be returned to when discussing the

empirical results of the feedforward network on the auto-association of N -tuples

task.

For the empirical approach, what is considered is the number of training exam-

ples and the time taken to obtain at least 95% accuracy (� = 0:05) with at least

99% con�dence (� = 0:01) on future examples for tuples of order N = 2 to N = 10

(in increments of 2). The number of possible values at each tuple position was set

at 10, as mentioned in subsection 3.4.2 (i.e., jSij = k = 10).

Each network was trained using the standard backpropagation algorithm6

(Rumelhart et al., 1986) on training sets ranging from 10 to 1500 examples ran-

domly generated from a uniform distribution. A train-test trial consisted of: ran-

dom initialization of network weights; random generation of training examples;

training the network until all output units were on the right side of 0.5 for all

training patterns; testing network performance on a new test set of 1000 randomly

generated examples from the same distribution; and recording the number of test

patterns for which all output unit activations were on the right side of 0.5.

For each value of N and each training set size, �ve train-test trials were con-

ducted, providing �ve measurements of network performance on the 1000 test ex-

amples, from which the 99% con�dence intervals on network performance on future

examples could be calculated. For each N , training set size was increased until the

lower bound of the 99% con�dence interval was at least 95%. In other words, the

training set size was increased until any one of the randomly initialized networks,

6Which is the same algorithm as used by Brousse and Smolensky (see subsection 3.2.2).


with 99% con�dence, would correctly process at least 95% of the 1000 test exam-

ples it was given. The training set size which met this criterion was recorded. The

time taken to train each network, measured in total number of weight updates, was

also recorded.

In this simulation, sigmoidal units (as de�ned by equation 3.3) rather than

thresholds were used as it reduces the number of necessary hidden units7 to 2N .

The learning rate was set to 0.1. No momentum term was used. Weights were up-

dated after each training pattern was presented. Weights were randomly initialized

from a uniform distribution in the range [�1; 1]. The learning algorithm used was

gradient descent with the sum of squares error function (as de�ned by equations

3.1 and 3.2, respectively).

The representation of each component at each group of k units at the input layer

was a local encoding (i.e., one unit on and the rest o�), and at the output layer

was a block encoding (i.e., half of the units are on consecutively, with wraparound,

and the rest o�). That is,

Input (local) Output (block)

x1 = 1 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0

x2 = 0 1 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0...

......

x10 = 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 1

The block encoding scheme for the output layer was chosen as it dramatically

improves the learning time of the k � 2 � k encoder problem (Bakker, Phillips &

Wiles, 1993; in press), which has previously been shown to be particularly di�cult

for backpropagation-style networks (Lister, 1993).

Result(1): Polynomial sample complexity

The number of training examples, su�cient to meet the 99% con�dence level (i.e.,

� = 0:01) and 95% accuracy level (i.e., � = 0:05) for tuple orders ranging from 2 to

7Two hidden units being the most required for a single encoder of sigmoidal units (Kruglyak,1990).


10 is shown in Figure 3.4.

On a log-log plot, the linear relationship between the x and y coordinates (solid

line) indicates that training set size is a polynomial function of tuple order, whereas

the total number of examples is an exponential function of tuple order and appears

as a non-linear relationship between the x and y coordinates (dotted line).

A linear regression shows that the relationship between number of training

examples (m) and tuple order (N) is:

m = 130 N1:01

with signi�cance F (1; 3) = 44, p < 0:01.

Thus, the sample complexity of the feedforward network on the auto-association

of N -tuples task is polynomial in the tuple order.

1

10

100

1000

10000

100000

1e+06

1e+07

1e+08

1e+09

1e+10

2 3 4 5 6 7 8 9 10

Num

ber

of p

atte

rns

Tuple order (N)

Training setTotal space

Figure 3.4: A log-log plot of the number of training examples (solid line) andtotal possible examples (dashed line) as a function of tuple order (N). The linearrelationship indicates that the required number of training examples is a polynomialfunction of tuple order (i.e., sample complexity is polynomial).


Result(2): Polynomial time complexity

The mean training time, measured as the total number of weight updates (over 5

trials), su�cient to load the training examples for tuple orders ranging from 2 to

10 is shown in Figure 3.5.

On a log-log plot, the linear relationship between the x and y coordinates indi-

cates that training time is a polynomial function of tuple order.

A linear regression shows that the relationship between training time (t) and

tuple order (N) is:

t = 2:9� 108 N5:1

with signi�cance F (1; 3) = 1020, p < 0:01.

Thus, the time complexity of the feedforward network on the auto-association

of N -tuples task is polynomial in the tuple order.

Discussion

The theoretical result assumed the existence of an algorithm that could load the

training examples in polynomial time. The question was whether the standard

backpropagation algorithm could be used to load enough training examples in

polynomial time so that the feedforward network would demonstrate the neces-

sary acquisition of systematicity of representation with respect to the domain of

N -tuples. For the empirical studies, the feedforward network is said to neces-

sarily acquire systematicity of representation if, on 99% of occasions, it acquires

the capacity to represent N -tuples with accuracy greater than or equal to 95%,

requiring computational time that is at most a polynomial of N . The empirical

results showed that the feedforward network exhibited the necessary acquisition of

systematicity as de�ned by this criterion.

The number of training examples required was almost linear (O(N1:01)) in the

tuple order. The total time required to load these examples was roughly a quintic

(O(N5:01)) in the tuple order. The time complexity can be reduced to O(N3)

if one considers that there are O(N2) weights that can be updated in parallel.


1e+09

1e+10

1e+11

1e+12

1e+13

1e+14

2 3 4 5 6 7 8 9 10

Num

ber

of w

eigh

t upd

ates

Tuple order (N)

Figure 3.5: A log-log plot of the mean number of total weight updates (over 5 trials)as a function of tuple order (N). Error bars indicate one standard deviation. Thelinear relationship indicates training time as a polynomial function of tuple order(i.e., time complexity is polynomial).


By separating resource requirements into sample and time complexity components

one can see that the network did not require many examples to learn the function.

The expense was in learning these examples. Consequently, greatest reductions

in resource requirements would be found with improvements to learning. One

possibility is the use of an exponential error function (see, for example, Baum

& Wilczek, 1987; Lister, Bakker, & Wiles, 1993), instead of the quadratic error

function used here (i.e., equation 3.2).

The number of generalizations exhibited by the network was very high. For ex-

ample, at N = 10, the network's degree of generalization was estimated to be 95%

of 1010 test patterns8, from approximately 1500 training patterns. The high degree

of generalization was due to the small number of weights needed to represent the

function. Sample complexity was proportional to the number of weights. Restrict-

ing the number of weights (by restricting the number of hidden units) reduces the

class of functions that can be represented by the network. Less information (in

the form of training examples) is needed to distinguish the target function from

the class of representable functions as there are fewer representable functions. In

the feedforward network, the number of weights was a polynomial function of tuple

order. If the number of weights were to increase as an exponential function of tuple

order then, by Shawe-Tayler and Anthony's result, the upper-bound on number of

training examples would also increase as an exponential function. This assumption

would make systematicity, as de�ned in terms of PAC-learnability, not achievable.

Thus, an architectural assumption required in the feedforward network is that the

number of weights grows as a polynomial function of tuple order.

The use of a block encoding scheme was important in establishing the poly-

nomial time complexity result. Bakker et al. (1994) showed that, for a single

auto-associator (i.e., N = 1), training time as a function of the number of possible

components (k) grew rapidly when outputs were encoded locally, yet slowly when

using block encoding. For example, at k = 9 (N = 1), the number of pattern

8Since the 1000 test patterns were randomly generated from a uniform distribution for eachtrain-test trial.


presentations9 was approximately 7 � 105 for local encoding compared to approx-

imately 4000 for block encoding. For the simulations performed here, it was not

possible to get the network to learn to correctly map all training examples under

a local encoding scheme, and so block encoding of the outputs was used.

Use of block encoding at the outputs changes the organization of internal rep-

resentations and hyperplanes at the hidden layer of the network. Lister (1993)

and Bakker et al. (1994) showed that, after training a network on 1-tuple auto-

associations where the output patterns were encoded locally, the hidden unit vector

representations were arranged in a circle with the hyperplanes (i.e., the decision

boundaries of the output units) separating one point from all other points. Fig-

ure 3.6(a) characterizes this organization for local encoding. Bakker et al. (1994)

showed that in the case of block encoding hyperplanes were oriented so as to bisect

the organization of hidden unit vector representations. Figure 3.6(b) characterizes

this organization for block encoding. The change in internal organization a�ects

the error surface which determines learning time. E�ectively, each hyperplane is

performing a classi�cation. In the case of local encoding, the distribution of class

members is skewed into two classes of 1 and N � 1 points, respectively. In the

case of block encoding, it is relatively balanced, with about N

2points in each class.

The e�ect of these two distributions on error surface was examined and discussed

in Phillips (1993). Essentially, a block encoding changes the error surface so as to

provide more information as to the location, in weight space, of the target function

(Bakker et al., 1994). More information is provided in the sense that more of the

error surface points (in terms of direction of steepest descent) to the region con-

taining the target function. Thus, the search time is e�ectively reduced by ignoring

more of the search space.

Given the architectural assumptions of: complete connectivity between layers;

polynomial growth in the number of variably weighted connections; block encoding

of objects at the output layer; and a set of training and testing examples that were

selected from the same uniform distribution the feedforward network demonstrated

9One pattern presentation involves a forward propagation of activity followed by a backprop-agation of error, after which weights are updated.


Hid

den

unit

activ

atio

n (2

)

Hidden unit activation (1)

Hid

den

unit

activ

atio

n (2

)

Hidden unit activation (1)

(a) (b)

Figure 3.6: Hyperplane orientation for an 8-2-8 encoder. Characteristic orientationof hyperplanes (dark solid lines), and hidden unit vector representations (solidsquares) for local output encoding (a) and block output encoding (b) for 1-tupleauto-association of 8 possible components (i.e., N = 1, k = 8 and h = 2).

the necessary acquisition of systematicity with respect to the auto-association of

N -tuples task. The criterion for the necessary acquisition of systematicity was that

the time required to acquire the target function to within 95% accuracy with 99%

con�dence was no more than a polynomial function of tuple order.

3.5 General Discussion

Having presented a case for a Connectionist model that necessarily exhibits system-

aticity the next question to ask is whether such a demonstration addresses Fodor

and McLaughlin's concerns regarding the necessity for systematicity given the as-

sumptions of the architecture. There are three areas in which this demonstration

could be criticized.

� The assumptions made in the architecture were too strong. These assump-

tions were:

{ polynomial growth in the number of variably weighted connections;

{ complete connectivity between layers;

3.5. GENERAL DISCUSSION 59

{ block encoding of representations at the output layer.

� The criterion for the necessary acquisition of systematicity is too weak.

� The demonstration of the feedforward network in meeting this criterion was

too narrow.

In this section, each of these point are discussed in turn.

A polynomial growth in number of connections would appear to be a reasonable

assumption. An architecture requiring more resources (e.g., exponential) soon be-

comes unrealistic with even small sized functions. It is this criterion that computer

scientists use to separate feasible and infeasible algorithms. However, Halford et al.

(1994) have argued that an assumption of exponential resource is reasonable if one

considers that the order of relations that humans typically can process is limited

to 4 or 5. Furthermore, an architecture that required exponential resources would

explain these processing capacity limitations.

The assumption of complete connectivity is probably unrealistic given that it

is generally known that real neurons are not completely connected. However, the

theoretical result is a statement about the number of weights, not the connectivity

of the network, and so still applies. In the empirical case, the network relied on

complete connectivity between layers since the number of hidden units to represent

every tuple instance was minimal10. This restriction could be relaxed by using more

hidden units. An increase in hidden units would increase the training time, but it

is not expected to be more than polynomial provided the number of hidden units

is not more than polynomial.

The strongest assumption is block encoding. Block encoding imposes a priori

an ordering, or similarity on component representations. For example, the vector

1 1 1 0 0 0 is closer to the vector 0 1 1 1 0 0 than the vector 0 0 1 1 1 0 under

the euclidean distance metric. However, in the previous chapter it was mentioned

that systematicity is a property at the level of compositional objects, not at the level

of atomic objects. Thus, there should be no a priori similarity between components.

10The minimum number of hidden units for 1-tuple auto-association is two (Phatak, 1993).


In the subsequent chapters, the assumption of external representation is weakened

by using local encoding for output representations.

In the debate over the import of systematicity no quanti�cation of empirical

evidence was given to support or refute Connectionist models. Thus, there are no

absolute levels of generalization and con�dence with which to test models. The

importance of the theoretical result is that for any speci�ed level of accuracy and

con�dence, the sample complexity is a polynomial function of the tuple order.

The accuracy and con�dence levels chosen for the empirical studies were, although

somewhat arbitrary, chosen to show that the network could attain a high degree of

generalization with a high degree of con�dence. It is possible that people acquire

systematic behaviour, to whatever absolute levels, with fewer than polynomial

examples. In this case, a tighter criterion for systematicity is required. In the next

chapter, the feedforward network is re-examined on a di�erent criterion based on

linguistic evidence.

A cognitive architecture must not only be able to represent complex objects, but

also make inferences from them. For example, a cognitive architecture must be able

to infer Sue from the sentence-question pair Sue went to the store. Who went to the

store? (i.e., the architecture must be capable of extracting, on request components

from complex objects). In the next chapter, the feedforward network is re-examined

on a systematicity of inference task, as well as a systematicity of representation

task. The domain of N -tuples was chosen because the number of complex objects

grows exponentially with N . Given the work of Halford et al. (1994) it may be

more appropriate to examine growth in terms of number of possible components

with tuple order �xed (i.e., training time as a function of k), as was done in (Wiles,

Phillips, & Norris, 1993). With a PAC-learnability criterion, however, this would

not be an interesting case as the total number of objects in the domain grows only

as a polynomial of k. Thus, perfect systematicity can be demonstrated by simply

memorizing all objects.

3.6. SUMMARY AND CONCLUSION 61

3.6 Summary and conclusion

In this chapter, the necessary acquisition of systematicity has been de�ned in terms

of PAC-learnability. That is, a network is said to necessarily acquire a systematic

behaviour if, with a high degree of con�dence, the time required to learn that

behaviour to a high degree of accuracy is no worse than a polynomial function of

a parameter that measures the size or complexity of that behaviour.

Previous work by Brousse and Smolensky (1989) and Brousse (1991), which

showed an exponential number of generalizations with respect to tuple order (N) in

an auto-association of N -tuples task, failed to meet this criterion as the percentage

of future examples correctly represented decreased as an exponential function of

tuple order. Clearly, the 50 training examples they used for all values of N was

too few. The questions then asked were: \How many training examples would be

required?"; and \How much time is needed to load these examples?" to meet the

necessary acquisition of systematicity criterion used in this chapter.

The theoretical result showed that at most a polynomial number of examples

were required to attain a high degree of generalization. The empirical results

showed that, not only were the number of training examples polynomial, but that

the time taken to load these examples was also a polynomial function of N . From

these results it was concluded that assuming a restricted architecture (polyno-

mial number of weights) the feedforward network exhibits the necessary acqui-

sition of systematicity, de�ned in terms of PAC-learnability, with respect to the

auto-association of N -tuples task.

It was suggested that the PAC-learnability criterion for the necessary acquisition

of systematicity was too weak. It may be the case that people acquire systematic

behaviours from fewer examples. In the next chapter, a second criterion, based

on linguistic evidence is considered. The feedforward network is then re-examined

with respect to this criterion.

Chapter 4

Systematicity as generalization

across position

4.1 Introduction

In the previous chapter PAC-learnability was used as the criterion for the necessary

acquisition of systematic behaviour against which the feedforward network was

examined. This criterion comes from the computer science notion of a feasible

algorithm and allows one to reject potential cognitive architectures on the basis

that they require too much resource (e.g., time) to acquire even modestly sized

behaviours.

In this chapter, an alternative criterion called strong systematicity (Hadley,

1993) is considered. This criterion, based on linguistic evidence of generalization,

is presented in section 2. Feedforward network and recurrent network architectures

are examined, by simulation and analysis, on two tasks in which it is possible to

exhibit strong systematicity. Those tasks are: the auto-association of 2-tuples, on

which it is possible to exhibit strong systematicity of representations (section 3);

and querying of 2-tuples (i.e., after presenting an ordered pair, query the network

for the �rst or second component), on which it is possible to exhibit strong sys-

tematicity of inference (section 4). A summary of these results and conclusion are

given in section 5.

63

64 CHAPTER 4. GENERALIZATION ACROSS POSITION

4.2 Hadley's strong systematicity

In addition to the work concerned with the generalization capabilities of feed-

forward networks in the previous chapter, Connectionists have provided a num-

ber of other models that have also demonstrated generalization over structured

domains. For example, Elman's (1990) simple recurrent network and Pollack's

(1990) recursive auto-associative memory correctly processed sentences not present

in the training set. Hadley (1993) questioned whether the degrees of generalization

demonstrated by these and other Connectionist models constitute what Fodor and

Pylyshyn intended as systematicity.

Hadley addressed this question by �rst, reformulating the de�nition of system-

aticity based on linguistic evidence; and second, evaluating a number of Connec-

tionist models with respect to this de�nition.

4.2.1 Strong systematicity de�ned

Strong systematicity is the above chance level capability of correctly processing

sentences containing words in novel syntactic positions. A model or person is said

to be strongly systematic if they can, for example, correctly process the sentence

Mary loves John having never before seen John in the patient position. In other

words, strong systematicity is generalization across syntactic position. (Note that

it is not su�cient for a model to correctly process a novel sentence on only one

occasion. Say, for example, there are �ve possible words from which to choose.

A network that simply outputs words at random has a 20% chance of correct

response. A model's performance must be signi�cantly above this level for it to

have exhibited strong systematicity.)

There is ample linguistic evidence of people, and even young children exhibiting

at least this degree of systematicity. For example, children are able to coin new

verbs from nouns, as in the example provided by Hadley, It was bandaided, and

subsequently use the new verbs in their passive form. More dramatically, adults

given a sentence containing a nonsensical noun (which they are highly unlikely to

4.2. HADLEY'S STRONG SYSTEMATICITY 65

have seen before) can correctly process sentences with the new noun in other novel

syntactic positions.

Strong systematicity is another criterion against which potential cognitive mod-

els can be evaluated. In the previous chapter, PAC-learnability was used as the

criterion for rejecting potential models on the basis that they require more com-

putational resource than could possibly be available. In this chapter, strong sys-

tematicity is used as the criterion for rejecting potential models on the basis that

they require more resource (i.e., training examples) than is necessary for people to

acquire systematic behaviour.

Given Hadley's strong systematicity criterion then, the question of concern in

this chapter is whether Connectionist models, including the feedforward network

architecture of the previous chapter, can exhibit this degree of systematicity.

4.2.2 A review of Connectionist models

The starting point for this discussion is with Hadley's conclusion: After reviewing

a number of Connectionist models, Hadley (1993) concluded that Connectionist

models have not exhibited strong systematicity. His conclusion was based on a

closer examination of the training scheme applied to the models of McClelland and

Kawamoto (1986), Chalmers (1990a), Elman (1989, 1990, 1991), Niklasson and

Sharkey (1992), Pollack (1990), Smolensky (1990), and St. John and McClelland

(1990). These models, which have all demonstrated generalization over structured

domains, could be claimed as refutations of Fodor and Pylyshyn's thesis of compo-

sitionality and systematicity without the need for symbol structures. Yet, in each

one of these models, there has not been a clear demonstration of strong system-

aticity. Essentially, a statistical analysis of the training sets revealed that in all

probability every word appeared in every one of its allowable syntactic positions in

the training set.

Hadley termed the degree of generalization exhibited by these models as being

either weak or quasi-systematicity. Weak systematicity is characterized as gener-

alization to novel sentences in which every word has previously appeared in that


syntactic position in the training set. For example, a weakly systematic model is

one that could only represent and process the sentence John loves Mary having al-

ready been trained on sentences where John appeared in the agent position, Mary

appeared in the patient position, and loves in the action position, although not

necessarily in that combination. Mary, for example, could have appeared in the

sentence Tim loves Mary.

Quasi-systematicity is characterized as generalization to novel sentences con-

taining embedded clauses in which every word in the embedded or surrounding

clause can be found in the same position of some simple sentence in the training

set. For example, a model that could only represent and process the sentence Tim

knew John loves Mary having already been trained on the sentences Tim knew Sue

and John loves Mary demonstrates quasi-systematicity, not strong systematicity,

as the words Tim and knew appearing in the agent and action positions (respec-

tively) in the surrounding clause appeared in the same positions in the simple

sentence Tim knew Sue in the training set. Similarly, the words John, loves and

Mary appearing in the agent, action and patient positions (respectively) relative

to the embedded clause, also appeared in the same positions in the training set.

Hadley found that the models of McClelland and Kawamoto (1986), Chalmers

(1990a), Elman (1989, 1990, 1991), Smolensky (1990) and St. John and McClelland

(1990) have not demonstrated anything more than weak systematicity, and that

Pollack (1990), and Niklasson and Sharkey (1992) have not demonstrated any-

thing more than quasi-systematicity. Furthermore, the demonstrations of quasi-

systematicity make an unrealistic assumption by preprocessing the input into tree

structures.

Subsequent work by Niklasson (1993) claims to have demonstrated strong sys-

tematicity on a transformation of logical expressions task. However, component

representations presupposed a similarity based on their position, or role within a

complex expression. For example, the proposition symbols (p, q, r, s) all shared

a common feature in their vector representation. Furthermore, the so-called novel

component, s, on which the network was tested had as its representation the com-

4.2. HADLEY'S STRONG SYSTEMATICITY 67

mon feature with no other features (vector components) active. Consequently, the

generalization the network demonstrated was due to the common feature on which

the network had already been trained in the components of p, q, r, and so cannot

be considered a demonstration of strong systematicity. See also Hadley (1993) for

a similar comment on McClelland and Kawamoto (1986).

Component representations

The use of similar component representations also occurs in Chalmers (1990a),

where words of a common class (e.g., verbs, nouns) share a common feature in their

vector representations; and, in Niklasson and van Gelder (1994), where compo-

nent representations of objects belonging to a common category (e.g., proposition,

conjunctive) are constructed with vectors representing these categories. However,

this information is, in general, not explicitly available within the external repre-

sentations of component objects1. The information identifying which category a

component belongs to is available implicitly in terms of the words relative position

in a sentence. These two models assume a similarity that Connectionist networks

are expected to learn. Such an assumption begs the question that the criterion of

strong systematicity poses: Can Connectionist models acquire position-based simi-

larities and consequently demonstrate generalization across position given category

information implicit within the relative positioning of component objects?

The point is that systematicity is a property at the level of compositional ob-

jects, not at the level of external component objects. In language, systematicity is

a property at the level of sentences, not at the level of words (Fodor & Pylyshyn,

1988). That is, the ability to represent and process sentences of a particular struc-

ture relates to other sentences conforming to the same structure.

Orthogonal external representations of components To appreciate the

1In the sense that there is some feature of the component object that identi�es its category.Information can also come in the form of semantics. Knowledge (past experience) with a wordcan identify its category. For example, experience with the word Mary and its associated objectprovides one with the information that this word is a noun. However, semantics alone does notnecessarily identify a category. For example, dog is both a noun and a verb (to chase relentlessly).Furthermore, novel words in isolation have no semantic content.


signi�cance of component representations, suppose a network is required to learn

to auto-associate 2-tuples. Suppose, also, that this network is composed of two

unconnected subnetworks, where each subnetwork consists of a single input unit

connected to a single hidden unit, which is connected to a single output unit.

Each component is represented, externally to the network, by a single scalar value

between zero and one. For example,

� Ann - 0.1;

� Bill - 0.3;

� John - 0.5;

� Karen - 0.7;

� Vivian - 0.9.

Auto-associating the tuple (John, Bill) requires performing the mapping

(0:5; 0:3) ! (0:5; 0:3). Since there is a linear dependency built into the in-

put/output representations, having only seen two of the atomic items, the network

would generalize to the others, therefore demonstrating generalization across posi-

tion. Yet, the two subnetworks are independent of each other. Since they are not

connected no information can be conveyed. Thus strong systematicity in this case

is a consequence of similarity at the component level.

Of course, there may be a priori similarities between component objects. For

example, in the set of possible phonemic components of English words, \dee" is

closer to \tee" than the \a" sound in \cat". It is an empirical question as to the

degree of similarity there is between component objects and its impact of exhibiting

systematicity. However, given that systematicity is such a pervasive property across

languages and modalities, it is assumed that a priori similarity at the component

level does not play a signi�cant role in systematicity, which is a property at the

level of complex objects. Consequently, when examining Connectionist models for

systematicity there must be no a priori similarity between the external represen-

tations of component objects. In subsequent simulations and analyses, orthogonal

4.3. STRONG SYSTEMATICITY OF REPRESENTATION 69

(actually local2) representations are used.

4.2.3 Summary of review

In all cases, with the exception of Niklasson (1993) and Niklasson and van Gelder

(1994), Connectionist modelers did not set out to demonstrate strong systematicity

per se. Their main concern was to demonstrate some degree of generalization over

a structured domain. With that point in mind it is perhaps not surprising that

none of the models demonstrated strong systematicity.

However, a counterpoint to this argument is that it seems a remarkable coinci-

dence that, since it is generalization that all modelers are seeking, all models have

not shown a greater degree of generalization. This coincidence raises the question

of whether there is not some deeper, more fundamental reason why Connectionist

models have not demonstrated strong systematicity. The next two sections are di-

rected at determining architectural properties that permit/prevent Connectionist

models from exhibiting strong systematicity.

4.3 Strong systematicity of representation

The purpose of this section is to elucidate the architectural properties of particular

neural network models that permit or prevent the model from displaying strong

systematicity of representation. Before analyzing particular models there �rst must

be a task on which it is possible to exhibit strong systematicity.

4.3.1 Task: Auto-association of 2-tuples

In the previous chapter, auto-association of N -tuples was the task used to evaluate

the systematicity of representation of feedforward networks. In language, simple

sentences representing binary relations of the form Noun transitive-verb noun are

instances of 3-tuples. For example, the simple sentence Mary loves John is an in-

stance of the 3-tuple (Mary,loves,John). In the case of binary relations, a model

2One vector component on, the rest o�.


exhibits strong systematicity when, for example, having only ever been given ex-

amples of tuples with Mary in the �rst position, the model generalizes to examples

with Mary in the third position. With respect to exhibiting strong systematicity

over any one binary relator (e.g., loves), the second position component is com-

mon to both the trained and generalized examples, and so carries no information

regarding generalization. Therefore, assuming that the common component does

not incur a signi�cant learning overhead, the domain can be simpli�ed to a set of

2-tuples, where the agent component is in the �rst position and patient component

is in the second position.

The advantage of making this assumption is that it aids in analysis of Connec-

tionist models. Furthermore, a model that cannot demonstrate strong systematicity

on this task, will not demonstrate strong systematicity when the common compo-

nent is included, since, as already mentioned, the common component carries no

additional information.

As in the previous chapter, systematicity of representation can be cap-

tured by a network's ability to auto-associate the input. In the domain

of 2-tuples, ordered pairs are constructed from a set of �ve atomic objects

fAnn;Bill; John;Karen; V iviang. A network or model is considered to have

demonstrated strong systematicity of representation when it can correctly auto-

associate (above chance level on repeated trials), for example, (Mary, John) having

only ever seen Mary in the patient (second) position.

4.3.2 Feedforward network

The feedforward network used to examine generalization over a structured domain

in (Brousse & Smolensky, 1989; Brousse, 1991) and in the previous chapter had

at the input and output layers one unit per component per position. For the

task described in this section, this means that there are 10 input units completely

connected to a number of hidden units that are, in turn, completely connected to

10 output units (Figure 4.1).

Furthermore, to demonstrate strong systematicity there must be at least one


1 0 0 0 0 0

1 0 0 0 0 0 0 0 0 00 1

0 0 0 00 1

Mary

loves

John

John

loves

Mary

Mary-patientMary-agent

Figure 4.1: The feedforward network architecture for auto-association of 2-tuples.Dashed lines identify individual units within a group of units. The left and rightshaded boxes indicate the nodes that extract, from internal representations, theMary component in the agent and patient positions, respectively.

component that does not appear in one of the positions in the training set. Suppose,

without loss of generality, that the component Mary appears only in the agent

position in the training set. The network is considered to have demonstrated

strong systematicity if on testing it correctly auto-associated tuples with Mary

in the patient position.

Independent weights

With this architecture, each output unit is dedicated to detecting a particular com-

ponent in a particular position from the network's hidden unit (internal) represen-

tation of the complex object. For example, one output unit detects the presence

or absence of Mary in the agent position. The correct performance of this task

depends on orientating the Mary-agent hyperplane so as to position all points in

the training set into two classes: Mary in the agent position; and, Mary not in the

agent position (Figure 4.2). If there are enough training points then the network

will also correctly classify test cases.

However, for the output unit that detects Mary in the patient position there is

only one class of points in the training set (i.e.,Mary not in the patient position), by

the requirement of demonstrating strong systematicity (Figure 4.3). With only one


- not Mary(agent)

- Mary(agent)

Hyperplane(Mary-agent)

Figure 4.2: Orientation of the Mary-agent hyperplane in hidden unit activation(internal representation) space. The empty circles are points (or vectors) repre-senting structured objects that have as one of its components Mary in the agentrole. The solid squares are points representing structured objects that do not haveas one of its components Mary in the agent role.


class of points, the training set provides no information regarding the orientation

of the hyperplane to correctly classify points with Mary in the patient position,

and so the network cannot be expected to generalize from components appearing

in the agent position to components appearing in the patient position. That is, the

network cannot demonstrate strong systematicity.

In any one trial, it is possible that in the event of correctly classifying examples

from the training set, the Mary-patient hyperplane is fortuitously positioned so

that the network also correctly classi�es the test examples where Mary appears in

the patient position. However, considering the information required to correctly

position this hyperplane, this event is highly unlikely. Since the training set con-

tains only one class of point with respect to the hyperplane, there are two choices

per dimension of the representation space for correct classi�cation of all points

in the training set (i.e., position the hyperplane along the dimension so that all

points are to the left or the right of the hyperplane). In an N dimensional space,

an estimation of the likelihood of correctly positioning the hyperplane so as to

correctly classify the test examples is: 1

2N. For the positioning of hyperplanes in

the hidden units space, N is the number of hidden units. Thus, the likelihood

of generalization across position in any one trial is an exponential function of N

that decreases asymptotically to zero. Since the likelihood of generalization across

position is below chance level, this feedforward network architecture cannot exhibit

strong systematicity.

A similar argument can be made for the input-to-hidden weights. Again, with

respect to the input space, the hyperplanes associated with the hidden units are

trained on only one type of point along the Mary-patient input dimension (i.e.,

Mary not in the patient position). Therefore, along this dimension (which is the

only dimension relevant to the encoding of the Mary-patient component) there is

no information provided in the training set, and therefore the network cannot be

expected to demonstrate strong systematicity.

Essentially, the problem with the feedforward network is that there is a sepa-

rate set of weights that implement the functions that encode/decode the agent and


- not Mary(patient)

- Mary(patient)

Hyperplane(Mary-patient)

Figure 4.3: Orientation of the Mary-patient hyperplane in hidden unit activation(internal representation) space. The solid squares are points representing struc-tured objects that do not have as one of its components Mary in the patient role.There are no points in the hidden unit activation space representing structuredobjects that have as one of its components Mary in the patient position.


patient components (Figure 4.4(a)). Furthermore, the domains of the two input-to-

hidden layer functions and the co-domains of the two hidden-to-output layer func-

tions are independent. Consequently, learning these functions necessitates training

on every input-output pair from their respective domains and co-domains. The

reason these component functions must be trained on every point is that, a priori,

there is no similarity between the external representations of components.

This result also applies to Smolensky's (1987a) recirculation algorithm, which is

just a special case of the backpropagation algorithm. The recirculation algorithm

applies to a feedforward network where the input to hidden and hidden to output

weights are identical. In this case, the three-layer network can be considered as

a two-layer network (one input/output layer and one hidden layer). To modify

weights, activity at the input/output layer is propagated to the hidden layer (re-

sulting in a vector ~p0), then recirculated back to the input/output layer and then

back up to the hidden layer (resulting in a vector ~p1). The weights can be updated

as a function of the di�erence vector ~p0�~p1 using the delta rule3 (Rumelhart et al.,

1986) Again, as the algorithm is only trained on one type of point (with respect

to Mary), the training set does not contain any information regarding the correct

weight values for generalization across position.

Weight tying

One method of introducing generalization across position is with the use of weight

tying. Weight tying links two or more network weights so that the update of one

weighted connection automatically updates the weights of the linked connections.

Weight tying has been used successfully by LeCun et al. (1989) as a way of intro-

ducing translation invariance for optical character recognition.

Similarly, weight tying could be introduced in the auto-association task to

demonstrate strong systematicity by linking the weights that implement the �rst

position mapping (i.e., from input to hidden and from hidden to output) to the

3The delta rule is backpropagation for two layers of units (i.e., one set of connecting weights).Backpropagation is applicable to arbitrarily many layers of units, and hence is also called thegeneralized delta rule.


first second

first second

hidden

f2

g2

f1

g1

(a)

first second

first second

hidden

(b)

f f

gg

first/second

first/second

hidden

f

g

(c)

Figure 4.4: Standard feedforward network architecture for auto-association taskwithout weight tying (a) and with weight tying (b). With weight tying, the previ-ously independent functions (f1 and f2, g1 and g2) are now linked by the same set ofweights (i.e., f1 = f2 = f; g1 = g2 = g). A recurrent network architecture (c) incor-porates partial weight tying by making the input-to-hidden and hidden-to-outputlayer weights for each component the same weights. In each case, arrows indicatecomplete connectivity from units in the source layer to units in the destinationlayer.

weights that implement the second position mapping. E�ectively, weight tying

makes available multiple copies of the functions that map component representa-

tions (Figure 4.4(b)).

The problem with weight tying is that it presupposes knowledge of the very

weights that are to be learnt. Furthermore, with complete weight tying it is not

possible to extract the two component representations from the internal hidden

unit representation because the same hidden to output unit function g is required

to perform two di�erent mappings for the same hidden unit vector. There does

not exist a set of weights to implement the function g, such that, g(~h) = ~x1 and

g(~h) = ~x2, where ~h is the internal hidden unit representation, of the 2-tuple (x1; x2);

and ~x1 and ~x2 are the output representations of the �rst and second components,

respectively.

Recurrency

Another method of incorporating a dependency between component mapping func-

tions is by making use of the same set of input (output) units for presenting (ex-

tracting) component representations. In this way, dependency between the com-

ponent mapping functions is introduced as the functions are implemented over the


same set of weights.

Of course, presenting all components over the same set of units means that

representations constructed at the hidden layer will be lost on presentation of sub-

sequent components, therefore, preventing the network from constructing complex

representations. However, previously constructed representations can be main-

tained by providing feedback connections that retain representations for future

processing (e.g., Figure 4.4(c)). Architectures with feedback connections are called

recurrent networks. In the next subsection, a recurrent network is investigated

for its ability to demonstrate strong systematicity of representation on the auto-

association task.

4.3.3 Simple recurrent network

In the previous subsection, it was shown that the standard feedforward network

(i.e., without weight tying) could not demonstrate strong systematicity of repre-

sentation. Essentially, the lack of strong systematicity was a consequence of the

independence of the weights that implement the functions which map components

to and from internal representations.

This result suggested a more structured network where components are pre-

sented at the same set of input units and consequently, mapped internally by the

same set of weights. A recurrent network architecture has such a structure. In

general, a recurrent network has a set of input units, a set of hidden units, a set of

context units (which hold the hidden unit activations from the previous time step)

and a set of output units.

There are a number of ways of connecting each set of units. The architecture

used in this subsection is Elman's (1989) simple recurrent network (Figure 4.5).

The performance of the simple recurrent network on a temporal version of the auto-

association of the 2-tuple task is tested and analyzed for its ability to demonstrate



Simulation

The task of the recurrent network is essentially the same as for the feedforward

network described in the previous subsection except that instead of presenting both

components simultaneously, components are presented one per time step. After

presenting both components, at which time the network should have constructed

some internal representation of the ordered pair, the network is required to output

the ordered pair in the order that it was presented. Figure 4.5 shows an example

of the network and the input-output mapping it is required to perform.

Each trial of the network consisted of:

� Random generation of training examples from a distribution where all �ve

atomic objects can appear in the �rst position, but only four atomic objects

can appear in the second position. The training set was large so that in all

likelihood all 20 combinations occurred. Weights were updated at the end of

each sequence (i.e., every four patterns). Context units were reset to zero at

the beginning of each sequence.

� Random initialization of network weights from a uniform distribution between

-1 and 1.

� Training of the network using the standard error backpropagation algorithm

with a 0.1 learning rate, no momentum term, and the sum of squares error

function (Rumelhart et al., 1986) until performance on the training set was

within 0.5 of the target for all output units for all patterns in the training

set.

� Testing on all remaining combinations (i.e., every atomic object in the �rst

position combined with the atomic object left out of the second position in

the training set).

� Recording of network response to all output patterns in the test set. Two

testing criteria were used. A network response was considered correct when:

(1) the maximally activated output unit had a target activation of one -


John

Mary

-

-

2.

1.

3.

4.

John

2.

1.

3.

4.

Mary*

Mary

John*

copy back

Output(5)

Hidden(20)

Context(20)Input(5)

Figure 4.5: The simple recurrent network and the temporal version of the auto-association of 2-tuples task. Parenthesized values indicate number of units used insimulations. Numbers indicate speci�c time periods. Dashes (-) indicate zero input.Starred (*) output indicates the auto-association of the current input to assist inthe formation of internal representations. However, for the purpose of evaluatingstrong systematicity, only output at time steps 3 and 4 (i.e., auto-association ofthe ordered pair) was considered.


maximum criterion; or (2) all output units were within 0.5 of their target -

0.5 criterion.

In addition, the number of training examples (ordered pairs) was varied from

10 to 200. Since each network was initialized from a random set of weights, each

train and test trial was repeated �ve times for each training set size.

Result

The recurrent network showed perfect performance (for both maximum and 0.5

testing criteria) on the unseen object in the second position in the test set when

trained on 200 ordered pairs. When trained on 50 ordered pairs, performance

dropped to a mean of 76% (maximum criterion) and 36% (0.5 criterion) over 5

trials (Figure 4.6).

Discussion

The purpose of this simulation was the demonstration of strong systematicity with

respect to the representation of 2-tuples.

Generalization across position occurred in all trials when only one of the items

from the second position was left out of the training set and the network was

trained on all other combinations4.

There is, however, a question over the robustness of the demonstration of strong

systematicity with this recurrent network. When there were 50 pairs in the training

set the network is less likely to have seen all combinations, yet likely to have seen

all �ve objects in the �rst position and all four objects in the second position.

However, generalization to the remaining �fth object in the second position dropped

dramatically to 76% (maximumcriterion) and 36% (0.5 criterion) averaged over �ve

trials. When only 10 pairs were present in the training set generalization dropped

to below chance level performance.

The problem that systematicity poses for Connectionist models is not only the

4Since there are only 20 remaining pairs (i.e., 5 � 4) it is likely that every pair appearedsomewhere in a training set of 200.


0

1

2

3

4

5

10 15 20 50 100 200

Num

ber

corr

ect o

n te

st s

et

Number of training patterns

Maximum criterion0.5 criterion

chance level

Figure 4.6: Generalization to second position as a function of the number of train-ing patterns. The number of correct responses using maximum (solid line) and0.5 (dashed line) test response criteria were averaged over 5 trials. Error barsindicate 95% con�dence levels. The horizontal dotted line indicates chance levelperformance. The number of training patterns is plotted on a log scale.


demonstration of such properties as generalization across position, but an expla-

nation as to how such properties necessarily occur as a consequence of the archi-

tecture. In other words, regardless of the initial conditions (e.g., initial weight

values, speci�c training sets), how is it that the network consistently demonstrates

generalization across position. People, despite their varying biological and environ-

mental backgrounds, demonstrate generalization across position in language. It is

this consistency that Fodor and McLaughlin (1990) are referring to when they talk

of systematicity being a law of cognitive architecture. Therefore it is important

to understand how generalization across position was demonstrated by the simple

recurrent network in the case where it occurred in all �ve trials. In the next sub-

section, the internal representational structure of the simple recurrent network is

analyzed to determine how generalization across position was consistently demon-

strated in this case.

Analysis of internal representations

Due to the high dimensionality of a network's internal representational space (which

is the number of hidden units) it is, in general, extremely di�cult to determine the

nature of the internal representations constructed by a network to solve a partic-

ular task. However, there are two analysis techniques that can be used to aid an

understanding of internal representations. They are: principal components anal-

ysis (PCA), �rst used by Elman (1989) in his analysis of the internal representa-

tions constructed by the simple recurrent network on his word prediction task; and

canonical discriminants analysis (CDA), �rst used by Wiles and Bloesch (1992) in

their temporal logic task.

A principal components analysis identi�es the dimensions of greatest variance

of points in some space. The intuition behind its application to networks is that to

reduce error the network must learn to separate points corresponding to di�erent

target outputs in its hidden unit activation space. Thus, dimensions along which

points vary contain important information about the problem, and will be identi�ed


by a principal components analysis of points in the hidden unit activation space5.

Canonical discriminants analysis identi�es dimensions in some space that max-

imizes the ratio of the between-groups variance to the within-groups variance of

points in that space. In other words, it �nds directions in the space along which

points belonging to the same group are tightly clustered, yet separated from points

belonging to other groups. This technique is particularly useful in combinato-

rial domains where one can hypothesize network representations where points are

aligned by groups implicit in the training data, but not made explicit in any error

signal6 (Wiles & Ollila, 1993).

The internal representations learnt by the simple recurrent network were ana-

lyzed in the case where the network demonstrated generalization across position in

all �ve trials (i.e., when the training set consisted of 200 training examples). After

the network was trained to the 0.5 training criterion, all 200 training examples were

presented to the network and the resulting 800 hidden unit activation patterns were

saved. Principal components and canonical discriminants analysis were performed

on these points using the package developed by Dennis and Phillips (1991).

Input to internal subspace The question to be answered is: how are the internal

representations organized so as to demonstrate strong systematicity? The �rst

step towards answering this question was to perform a PCA of the hidden unit

activations at the �rst and second time steps (Figure 4.7). The diagram shows

the �rst and second principal components of these points in hidden unit activation

space (i.e., the �rst and second directions of greatest variance, and presumably,

greatest information about representational organization). The points are labeled

on the basis of the current input, which for the �rst two time steps are also the

target output. The analysis shows that all points can be grouped into �ve clusters7

5Thanks to Simon Dennis for this explanation.6In a combinatorial domain, two groups of points could have the same target output which

may make it di�cult to �nd using principal components analysis since separation of internalrepresentations is driven, in part, by the error signal. (Of course, separation may also be drivenby di�erences in input representations.)

7The cluster labeled \Vivian" only contains one type of point (i.e., Vivian at the �rst timestep). Since the network was tested for strong systematicity on this object, Vivian did not appearin the second position in the training set.


depending on the output requirement of the network at each point. For example,

the input pair (Bill, Ann) resulted in the point labeled \1Bill"8 at the �rst time

step, and the point labeled \Bill:Ann" at the second time step.

The activation space is organized in this manner due to the target outputs

which at these two time steps were the current inputs. Consequently, error is

reduced when points that map to di�erent outputs are separated, and a PCA has

identi�ed this separation.

The PCA suggests that the major information contained in the representation

is the current object. However, the internal representation at the second time step

must still maintain information about the �rst item, otherwise the network could

not possibly recover the �rst item at time step three. The question is: how is this

information maintained?

The PCA of points at the �rst and second time steps shows some organization

within the �ve clusters. To identify this organization a PCA of points at the

second time step only was performed and these points were plotted onto the �rst

and second principal components (Figure 4.8). The plot shows two levels of spatial

organization. At one level, points are grouped into four clusters based on their

output mapping9 just as in the previous PCA. At another level, there appears to

be the same within-cluster organization of points (labeled by their �rst object).

This consistency is suggested by the same rotational ordering of points within each

cluster.

Position independent internal subspaces The consistency of within-cluster

organization suggests that �rst object information is being encoded along some

other dimensions independent (by virtue of the apparent regularity) of second ob-

ject information. If this suggestion is correct then there may exist directions in

hidden unit activation space along which all four clusters can be superimposed.

8Since the context units were reset to zero at the beginning of each sequence, the hidden unitactivations at the �rst time step were not a�ected by the previous sequence. Consequently, allsequences with the same �rst component resulted in the same hidden unit activation vector atthe �rst time step.

9Again, there are only four clusters since one object (Vivian) was left out of the second positionin order to test strong systematicity.


1Ann1Karen

1Vivian

1John

1Ann1Ann

1Vivian

1Ann1Ann1Ann

1Bill1Bill

1John

1Bill

1Karen1Karen

1Bill1Bill

1John

1Bill1Bill

1John

1Vivian

1Ann

1Bill

1Vivian

1Bill1Bill

1John

1Bill

1John

1Karen

1Vivian

1John1John1John

1Vivian

1Bill

1Karen 1Ann

1Bill

1Karen

1John

1Vivian

1John

1Vivian

1John1John

1Ann

1Bill1Bill

1Vivian

1John

1Karen

1Bill

1Ann

1Vivian

1Ann1Ann1Karen1Karen

1John

1Vivian

1Karen

1John1John1John

1Karen

1Vivian

1Bill

1Karen1Karen 1Ann

1Bill

1Ann1Ann

1John

1Karen

1Bill

1Vivian1Vivian

1Karen

1John

1Ann1Karen 1Ann

1Bill

1Vivian1Vivian1Vivian

1Ann

1Bill

1Ann1Karen

1Bill

1John

1Bill

1Ann1Karen

1Bill1Bill

1Karen 1Ann1Karen1Karen

1John

1Bill

1Vivian

1Karen 1Ann

1Vivian

1Ann

1Vivian

1Ann1Karen 1Ann

1Vivian

1Ann

1Vivian

1Karen1Karen

1Bill

1Ann

1Bill

1Ann

1John

1Bill

1Karen

1Vivian1Vivian

1John

1Bill

1Vivian

1Bill1Bill

1Karen

1John

1Karen

1Bill

1Ann1Karen 1Ann1Ann1Karen1Karen

1John

1Ann1Karen

1Bill1Bill

1John

1Ann

1Bill1Bill

1Ann

1Bill

1John

1Ann

1Vivian

1Bill

1Karen

1John1John1John

1Bill

1John

1Bill

1Vivian

1Ann1Karen

1John

1Ann

1Vivian

1Bill

1Ann1Karen

1Bill

1John1John

1Karen

1Bill1Bill1Bill

1Karen1Karen

1Bill

1John

1Ann

1Bill

1Vivian

1Bill

1Ann

1John

1Bill

1John

1Karen

1Vivian1Vivian

1Ann

1Vivian

Ann:John

Karen:Ann

Vivian:Bill

John:Bill

Ann:Karen

Ann:John

Vivian:Ann

Ann:BillAnn:BillAnn:Bill

Bill:Karen

Bill:JohnJohn:Ann

Bill:Karen

Karen:JohnKaren:John

Bill:AnnBill:Karen

John:John

Bill:JohnBill:John

John:John

Vivian:Ann

Ann:BillBill:Bill

Vivian:Karen

Bill:John

Bill:Ann

John:Bill

Bill:Bill

John:BillKaren:Bill

Vivian:Bill

John:Karen

John:John

John:Ann

Vivian:Karen

Bill:John

Karen:John

Ann:John

Bill:Ann

Karen:BillJohn:Bill

Vivian:Bill

John:Bill

Vivian:Bill

John:Bill

John:Karen

Ann:Ann

Bill:Karen

Bill:John

Vivian:Ann

John:Bill

Karen:Karen

Bill:Ann

Ann:Bill

Vivian:KarenAnn:AnnAnn:Karen Karen:Ann

Karen:Karen

John:John

Vivian:John

Karen:Ann

John:Bill

John:Karen

John:John

Karen:John

Vivian:Ann

Bill:Bill

Karen:Bill

Karen:KarenAnn:Ann

Bill:Bill

Ann:Karen

Ann:JohnJohn:Ann

Karen:Karen

Bill:Ann

Vivian:Bill

Vivian:Karen Karen:Ann

John:Bill

Ann:Karen

Karen:John

Ann:Ann

Bill:Bill

Vivian:Karen

Vivian:John

Vivian:AnnAnn:Karen

Bill:John

Ann:John

Karen:John

Bill:JohnJohn:Ann

Bill:Bill

Ann:Karen Karen:Ann

Bill:BillBill:Bill

Karen:Bill

Ann:Bill

Karen:BillKaren:Bill

John:Karen

Bill:John

Vivian:JohnKaren:John

Ann:John

Vivian:AnnAnn:Karen

Vivian:JohnAnn:John

Karen:AnnAnn:Karen

Vivian:JohnAnn:JohnVivian:John

Karen:Ann

Karen:Bill

Bill:Bill

Ann:Karen

Bill:Ann

Ann:Bill

John:Bill

Bill:Karen

Karen:JohnVivian:JohnVivian:John

John:Ann

Bill:Karen

Vivian:Ann

Bill:BillBill:Bill

Karen:Karen

John:John

Karen:John

Bill:John

Ann:John

Karen:John

Ann:AnnAnn:Karen Karen:AnnKaren:Karen

John:Ann

Ann:Karen Karen:Ann

Bill:KarenBill:Ann

John:KarenAnn:John

Bill:AnnBill:Karen

Ann:John

Bill:John

John:John

Ann:JohnVivian:John

Bill:John

Karen:Ann

John:Bill

John:Karen

John:John

Bill:Ann

John:Bill

Bill:Karen

Vivian:JohnAnn:John

Karen:Ann

John:Karen

Ann:AnnVivian:Karen

Bill:John

Ann:John

Karen:Ann

Bill:Karen

John:Ann

John:Bill

Karen:Karen

Bill:Ann

Bill:Bill

Bill:Karen


Bill:Ann

John:Karen

Ann:Ann

Bill:Karen

Vivian:Ann

Bill:BillAnn:Bill

John:Karen

Bill:Ann

John:BillKaren:Bill

Vivian:KarenVivian:Ann

Ann:BillVivian:Bill

First principal component

Seco

nd p

rinc

ipal

com

pone

nt

Bill

Vivian

Karen

John

Ann

Figure 4.7: Principal components analysis of points in hidden unit activation spacegenerated from training sequences at the �rst and second time steps. Points arelabeled with either the �rst and second input objects (e.g., Ann:Bill indicates thepoint as a result of receiving Ann as the �rst object and Bill as the second object);or, with a single word pre�xed by 1 indicating a point as a result of only receivingthe �rst object. Dashed lines group points with the same output mapping.


Ann

Karen

Vivian

John

Ann

Ann

Vivian

AnnAnnAnn

Bill

Bill

JohnBill

KarenKaren

Bill

Bill

John

BillBill

John

Vivian

Ann Bill

Vivian

Bill

Bill

John

Bill

John

KarenVivian

John

John

John

Vivian

Bill

Karen

Ann

Bill

Karen

John

Vivian

John

Vivian

John

John

Ann

Bill

Bill

Vivian

John

Karen

Bill

Ann

Vivian

Ann

Ann

Karen

Karen

John

Vivian

Karen

John

John

John

Karen

Vivian

Bill

Karen

Karen

Ann

Bill

Ann

Ann

John

Karen

Bill

Vivian

Vivian

Karen

John

Ann

Karen

Ann

Bill

Vivian

Vivian

Vivian

Ann

BillAnn

Karen

Bill

John

Bill

Ann

Karen

BillBill

Karen

Ann

KarenKaren

John

Bill

Vivian Karen

Ann

Vivian

Ann

Vivian

Ann

Karen

Ann

Vivian

Ann

Vivian

Karen

Karen

Bill

Ann

Bill

Ann

John

Bill

KarenVivianVivian

JohnBill

Vivian

BillBill

Karen

John

Karen

BillAnn

Karen

Ann

Ann

Karen

Karen

JohnAnn

Karen

Bill

Bill

John

Ann

Bill

Bill

Ann Bill

John

Ann

Vivian

Bill

Karen

John

John

John

Bill

John

Bill

Vivian

Ann

Karen

John

Ann

Vivian

BillAnn

Karen

Bill John

John

Karen

Bill

Bill

Bill

KarenKaren

Bill

John

Ann

Bill

Vivian

BillAnnJohn

Bill

John

Karen

Vivian

Vivian

Ann

Vivian


Seco

nd p

rinc

ipal

com

pone

nt

John(2)

Karen(2)

Bill(2)

Ann(2)

Figure 4.8: Principal components analysis of points in hidden unit activation spacegenerated from training sequences at the second time step. Points are labeled withthe �rst input object. Points that are a result of the same second object are linkedby a solid line. Each link is labeled with the object that was presented at thesecond time step (e.g., points in the top right hand section of the graph are a resultof Ann as the second input object).


Or, in other words, there may be a projection from points in hidden unit activation

space such that, for example, the point representing the pair (Karen, John) and

the point representing the pair (Karen, Ann) become the same.

If such a direction exists, it can be found using CDA by grouping all points on

the basis of the �rst object. CDA will identify dimensions in hidden unit activation

space such that points in the same group (e.g., (Karen, Ann), (Karen, Bill), (Karen,

Karen), and (Karen, John)) are close together, while points in di�erent groups are

far apart.

A CDA was performed on all hidden unit activations at the second time step

grouped by the object presented at the �rst time step. The points were then plotted

onto the �rst and second canonical discriminants (Figure 4.9). The plot shows that

the four clusters can be superimposed, and supports the suggestion that �rst object

information was encoded independent of second object information. In each case,

all points belonging to the same cluster di�ered in location by at most 1 � 10�6.

Another way of conceptualizing the representational organization is to say that the

network has bu�ered the input onto independent dimensions.

A CDA was performed on all hidden unit activations constructed at the �rst

and second time steps. The points were grouped on the basis of input component

(which was also the target output component) presented at that time step. For

example, the hidden unit activations resulting from the object Karen having been

presented at the �rst time step or the second time step are identi�ed, for the

purposes of CDA, as belonging to the same group. The points were then plotted

onto the �rst and second canonical discriminants (Figure 4.11). The plot shows

that the current input object was encoded on the same dimension independent of

its position. Thus, suggesting an organization of internal representations of ordered

pairs as characterized in Figure 4.10.

Internal subspace to output A PCA of hidden unit activations at time step

three was performed and the points were plotted onto the �rst and second principal

components (Figure 4.12). The plot shows two levels of organization similar to time

step two.


Ann

Karen

Vivian

John

AnnAnn

Vivian

AnnAnnAnn

BillBill JohnBill

KarenKaren

BillBill JohnBillBill John

Vivian

Ann

Bill

Vivian

BillBill JohnBill John

Karen

Vivian

JohnJohnJohn

Vivian

Bill

Karen

Ann

Bill

Karen

John

Vivian

John

Vivian

JohnJohn

Ann

BillBill

Vivian

John

Karen

Bill

Ann

Vivian

AnnAnn

KarenKaren

John

Vivian

Karen

JohnJohnJohn

Karen

Vivian

Bill

KarenKaren

Ann

Bill

AnnAnn

John

Karen

Bill

VivianVivian

Karen

John

Ann

Karen

Ann

Bill

VivianVivianVivian

Ann

Bill

Ann

Karen

Bill JohnBill

Ann

Karen

BillBill

Karen

Ann

KarenKaren

JohnBill

Vivian

Karen

Ann

Vivian

Ann

Vivian

Ann

Karen

Ann

Vivian

Ann

Vivian

KarenKaren

Bill

Ann

Bill

Ann

JohnBill

Karen

VivianVivian

JohnBill

Vivian

BillBill

Karen

John

Karen

Bill

Ann

Karen

AnnAnn

KarenKaren

John

Ann

Karen

BillBill John

Ann

BillBill

Ann

Bill John

Ann

Vivian

Bill

Karen

JohnJohnJohnBill JohnBill

Vivian

Ann

Karen

John

Ann

Vivian

Bill

Ann

Karen

Bill JohnJohn

Karen

BillBillBill

KarenKaren

Bill John

Ann

Bill

Vivian

Bill

Ann

JohnBill John

Karen

VivianVivian

Ann

Vivian

First canonical discriminant

Seco

nd c

anon

ical

dis

crim

inan

t

Figure 4.9: Canonical discriminants analysis of points in hidden unit activationspace generated from training sequences at the second time step grouped on thebasis of the �rst input object. Points are labeled with the �rst input object. Thefact that there appears only �ve labels means that all points with the same labelhave been projected onto the same location.


2

1

Time step

Input

Output

John

Ann John

Figure 4.10: Construction of ordered pair representations by bu�ering inputthrough the same set of dimensions. Dashed lines indicate weights and hiddenunits at time step one, whereas solid lines indicate the same sets of weights andhidden units but at time step two.

A CDA of these points grouped on the target output at time step three was

performed and the points plotted onto the �rst and second canonical discrimi-

nants (Figure 4.13). The plot shows the �rst object has been maintained on two

dimensions independent of the second object.

A CDA of these points grouped on the second input object was also performed.

The points were plotted onto the �rst and second canonical discriminants (Fig-

ure 4.14) and shows second object information has been encoded on two further

dimensions independent of the �rst object.

A CDA of points at the third and fourth time steps grouped on the current

target output object was performed. The points were plotted onto the �rst and

second canonical discriminants (Figure 4.15). The plot suggests that the current

object was extracted along the same dimensions, independent of its position.

First-in �rst-out bu�er This series of analyses suggest that the organizational

structure of the simple recurrent network's internal representations is analogous

to a �rst-in-�rst-out bu�er (Figure 4.16). By encoding and decoding component

representations via the same input and output set of weights (respectively), the

network only requires to see each unique object in one of the positions, but not


1Ann

1Karen

1Vivian

1John

1Ann1Ann

1Vivian

1Ann1Ann1Ann

1Bill1Bill

1John

1Bill

1Karen1Karen

1Bill1Bill

1John

1Bill1Bill

1John

1Vivian

1Ann

1Bill

1Vivian

1Bill1Bill

1John

1Bill

1John

1Karen

1Vivian

1John1John1John

1Vivian

1Bill

1Karen

1Ann

1Bill

1Karen

1John

1Vivian

1John

1Vivian

1John1John

1Ann

1Bill1Bill

1Vivian

1John

1Karen

1Bill

1Ann

1Vivian

1Ann1Ann

1Karen1Karen

1John

1Vivian

1Karen

1John1John1John

1Karen

1Vivian

1Bill

1Karen1Karen

1Ann

1Bill

1Ann1Ann

1John

1Karen

1Bill

1Vivian1Vivian

1Karen

1John

1Ann

1Karen

1Ann

1Bill

1Vivian1Vivian1Vivian

1Ann

1Bill

1Ann

1Karen

1Bill

1John

1Bill

1Ann

1Karen

1Bill1Bill

1Karen

1Ann

1Karen1Karen

1John

1Bill

1Vivian

1Karen

1Ann

1Vivian

1Ann

1Vivian

1Ann

1Karen

1Ann

1Vivian

1Ann

1Vivian

1Karen1Karen

1Bill

1Ann

1Bill

1Ann

1John

1Bill

1Karen

1Vivian1Vivian

1John

1Bill

1Vivian

1Bill1Bill

1Karen

1John

1Karen

1Bill

1Ann

1Karen

1Ann1Ann

1Karen1Karen

1John

1Ann

1Karen

1Bill1Bill

1John

1Ann

1Bill1Bill

1Ann

1Bill

1John

1Ann

1Vivian

1Bill

1Karen

1John1John1John

1Bill

1John

1Bill

1Vivian

1Ann

1Karen

1John

1Ann

1Vivian

1Bill

1Ann

1Karen

1Bill

1John1John

1Karen

1Bill1Bill1Bill

1Karen1Karen

1Bill

1John

1Ann

1Bill

1Vivian

1Bill

1Ann

1John

1Bill

1John

1Karen

1Vivian1Vivian

1Ann

1Vivian

John

Ann

BillBill

Karen

John

Ann

BillBillBill

Karen

John

Ann

Karen

JohnJohn

Ann

Karen

JohnJohnJohnJohn

Ann

BillBill

Karen

John

Ann

BillBillBillBillBill

Karen

John

Ann

Karen

JohnJohnJohn

Ann

BillBillBillBillBillBill

Karen

Ann

Karen

John

Ann

Bill

Karen

Ann

Bill

Karen

Ann

Karen

Ann

Karen

JohnJohn

Ann

Bill

Karen

JohnJohn

Ann

BillBill

Karen

Ann

Bill

Karen

John

Ann

Karen

Ann

Bill

Karen

Ann

Bill

Karen

John

Ann

Bill

Karen

John

Ann

Karen

JohnJohnJohnJohn

Ann

Bill

Karen

Ann


Karen

JohnJohnJohnJohn

Ann

Karen

JohnJohn

Ann

Karen

JohnJohnJohn

Ann

BillBill

Karen

Ann

BillBill

Karen

JohnJohnJohn

Ann

Karen

Ann

BillBill

Karen

JohnJohnJohnJohnJohn

Ann

Karen

Ann

Karen

Ann

Karen

Ann

Karen

Ann

Karen

John

Ann

Karen

JohnJohnJohnJohnJohnJohn

Ann

Bill

Karen

John

Ann

Bill

Karen

JohnJohn

Ann

Karen

Ann

Karen

JohnJohn

Ann

Karen

Ann

Bill

Karen

Ann

Bill

Karen

JohnJohn

Ann

Karen

Ann

Karen

Ann

BillBill

Karen

Ann

BillBill

Karen

Ann

BillBill


Seco

nd c

anon

ical

dis

crim

inan

t

Bill

Ann

John

VivianKaren

Figure 4.11: Canonical discriminants analysis of points in hidden unit activationspace generated from training sequences at the �rst and second time steps groupedon the basis of the current input/output object. Points are labeled with the in-put/output object at that time step.



Seco

nd p

rinc

ipal

com

pone

nt

Ann:JohnKaren:Ann

Vivian:BillJohn:Bill

Ann:Karen

Ann:John

Vivian:AnnAnn:BillAnn:BillAnn:Bill

Bill:Karen

Bill:John

John:Ann

Bill:Karen


Bill:Ann

Bill:Karen

John:John

Bill:JohnBill:John

John:JohnVivian:Ann

Ann:Bill

Bill:Bill

Vivian:Karen

Bill:John

Bill:Ann

John:Bill

Bill:Bill

John:Bill

Karen:Bill

Vivian:Bill

John:Karen

John:John

John:Ann

Vivian:Karen

Bill:John

Karen:John

Ann:John

Bill:AnnKaren:Bill

John:BillVivian:Bill

John:BillVivian:Bill

John:Bill

John:Karen

Ann:Ann

Bill:Karen

Bill:John

Vivian:Ann

John:Bill

Karen:Karen

Bill:AnnAnn:Bill

Vivian:Karen

Ann:Ann

Ann:Karen

Karen:Ann

Karen:Karen

John:John Vivian:John

Karen:Ann

John:Bill

John:Karen

John:John

Karen:John

Vivian:Ann

Bill:Bill

Karen:Bill

Karen:Karen

Ann:Ann

Bill:Bill

Ann:Karen

Ann:John

John:AnnKaren:Karen

Bill:Ann

Vivian:Bill

Vivian:Karen

Karen:Ann

John:Bill

Ann:KarenKaren:John

Ann:Ann

Bill:Bill

Vivian:Karen

Vivian:JohnVivian:Ann

Ann:Karen

Bill:John

Ann:John

Karen:John

Bill:John

John:Ann

Bill:Bill

Ann:Karen

Karen:Ann

Bill:BillBill:Bill

Karen:Bill

Ann:Bill

Karen:BillKaren:Bill John:Karen

Bill:John

Vivian:John

Karen:John

Ann:John

Vivian:Ann

Ann:Karen

Vivian:John

Ann:JohnKaren:Ann

Ann:Karen

Vivian:John

Ann:John

Vivian:John

Karen:Ann

Karen:Bill

Bill:Bill

Ann:Karen

Bill:AnnAnn:Bill

John:BillBill:Karen

Karen:John

Vivian:JohnVivian:John

John:Ann

Bill:Karen

Vivian:Ann

Bill:BillBill:Bill

Karen:Karen

John:John

Karen:John

Bill:John

Ann:John

Karen:John

Ann:Ann

Ann:Karen

Karen:Ann

Karen:KarenJohn:Ann

Ann:Karen

Karen:Ann

Bill:Karen

Bill:AnnJohn:Karen

Ann:John

Bill:Ann

Bill:Karen

Ann:John

Bill:John

John:John

Ann:John

Vivian:John

Bill:John

Karen:Ann

John:Bill

John:Karen

John:John

Bill:Ann

John:BillBill:Karen

Vivian:John

Ann:JohnKaren:Ann

John:Karen

Ann:Ann

Vivian:Karen

Bill:John

Ann:John

Bill:Karen

John:Ann

John:Bill

Karen:Karen

Bill:Ann

Bill:Bill

Bill:Karen


Bill:AnnJohn:Karen

Ann:Ann

Bill:Karen

Vivian:Ann

Bill:Bill

Ann:Bill

John:KarenBill:Ann

John:Bill

Karen:Bill Vivian:Karen

Vivian:AnnAnn:Bill

Vivian:Bill

Karen:Ann

Figure 4.12: Principal components analysis of points in hidden unit activation spacegenerated from training sequences at the third time step. Points are labeled withthe �rst and second input objects (e.g., Ann:Bill indicates the point as a result ofreceiving Ann as the �rst object and Bill as the second object). The solid lines linkpoints as a result of the same �rst input object.


Ann

Karen

Vivian

John

AnnAnn

Vivian

AnnAnnAnn

BillBill

John

Bill

KarenKaren

BillBill

John

BillBill

John

Vivian

Ann

Bill

Vivian

BillBill

John

Bill

John

Karen

Vivian

JohnJohnJohn

Vivian

Bill

Karen

Ann

Bill

Karen

John

Vivian

John

Vivian

JohnJohn

Ann

BillBill

Vivian

John

Karen

Bill

Ann

Vivian

AnnAnn

KarenKaren

John

Vivian

Karen

JohnJohnJohn

Karen

Vivian

Bill

KarenKaren

Ann

Bill

AnnAnn

John

Karen

Bill

VivianVivian

Karen

John

Ann

Karen

Ann

Bill

VivianVivianVivian

Ann

Bill

Ann

Karen

Bill

John

Bill

Ann

Karen

BillBill

Karen

Ann

KarenKaren

John

Bill

Vivian

Karen

Ann

Vivian

Ann

Vivian

Ann

Karen

Ann

Vivian

Ann

Vivian

KarenKaren

Bill

Ann

Bill

Ann

John

Bill

Karen

VivianVivian

John

Bill

Vivian

BillBill

Karen

John

Karen

Bill

Ann

Karen

AnnAnn

KarenKaren

John

Ann

Karen

BillBill

John

Ann

BillBill

Ann

Bill

John

Ann

Vivian

Bill

Karen

JohnJohnJohn

Bill

John

Bill

Vivian

Ann

Karen

John

Ann

Vivian

Bill

Ann

Karen

Bill

JohnJohn

Karen

BillBillBill

KarenKaren

Bill

John

Ann

Bill

Vivian

Bill

Ann

John

Bill

John

Karen

VivianVivian

Ann

Vivian

First canonical component

Seco

nd c

anon

ical

com

pone

nt

Figure 4.13: Canonical discriminants analysis of points in hidden unit activationspace generated from training sequences at the third time step grouped on the basisof the �rst input object. Points are labeled with the �rst input object. The factthat there appears only �ve labels means that all points with the same label havebeen projected onto the same location.


John Ann

BillBill

Karen

John Ann

BillBillBill

Karen

John Ann

Karen

JohnJohn Ann

Karen

JohnJohnJohnJohn Ann

BillBill

Karen

John Ann

BillBillBillBillBill

Karen

John Ann

Karen

JohnJohnJohn Ann


Karen

Ann

Karen

John Ann

Bill

Karen

Ann

Bill

Karen

Ann

Karen

Ann

Karen

JohnJohn Ann

Bill

Karen

JohnJohn Ann

BillBill

Karen

Ann

Bill

Karen

John Ann

Karen

Ann

Bill

Karen

Ann

Bill

Karen

John Ann

Bill

Karen

John Ann

Karen


Bill

Karen

Ann


Karen


Karen

JohnJohn Ann

Karen

JohnJohnJohn Ann

BillBill

Karen

Ann

BillBill

Karen

JohnJohnJohn Ann

Karen

Ann

BillBill

Karen

JohnJohnJohnJohnJohn Ann

Karen

Ann

Karen

Ann

Karen

Ann

Karen

Ann

Karen

John Ann

Karen

JohnJohnJohnJohnJohnJohn Ann

Bill

Karen

John Ann

Bill

Karen

JohnJohn Ann

Karen

Ann

Karen

JohnJohn Ann

Karen

Ann

Bill

Karen

Ann

Bill

Karen

JohnJohn Ann

Karen

Ann

Karen

Ann

BillBill

Karen

Ann

BillBill

Karen

Ann

BillBill


Seco

nd c

anon

ical

dis

crim

inan

t

Figure 4.14: Canonical discriminants analysis of points in hidden unit activationspace generated from training sequences at the third time step grouped on the basisof the second input object. Points are labeled with the second input object. Thefact that there appears only four labels means that all points with the same labelhave been projected onto the same location.


Ann

Karen

Vivian

John

AnnAnn

Vivian

AnnAnnAnn

BillBill

John

Bill

KarenKaren

BillBill

John

BillBill

John

Vivian

Ann

Bill

Vivian

BillBill

John

Bill

John

Karen

Vivian

John

JohnJohn

Vivian

Bill

Karen

Ann

Bill

Karen

John

Vivian

John

Vivian

JohnJohn

Ann

BillBillVivian

John

Karen

Bill

Ann

Vivian

AnnAnn

KarenKaren

John

Vivian

Karen

JohnJohn

John

Karen

VivianBill

KarenKaren

Ann

Bill

AnnAnn

John

Karen

Bill Vivian

VivianKaren

John

Ann

Karen

Ann

Bill

Vivian

Vivian

Vivian

Ann

Bill

Ann

Karen

Bill

John

Bill

Ann

Karen

BillBill

Karen

Ann

KarenKaren

John

BillVivian

Karen

Ann

Vivian

Ann

Vivian

Ann

Karen

Ann

Vivian

Ann

Vivian

KarenKaren

Bill

Ann

Bill

Ann

John

Bill

Karen

VivianVivian

John

Bill VivianBillBill

Karen

John

Karen

Bill

Ann

Karen

AnnAnn

KarenKaren

John

Ann

Karen

BillBill

John

Ann

BillBill

Ann

Bill

John

Ann

VivianBill

Karen

JohnJohn

John

Bill

John

Bill

Vivian

Ann

Karen

John

Ann

Vivian

Bill

Ann

Karen

Bill

John

John

Karen

BillBillBill

KarenKaren

Bill

John

Ann

Bill VivianBill

Ann

John

Bill

John

Karen

Vivian

Vivian

Ann

Vivian

John

Ann

BillBill

Karen

John

Ann

BillBillBill

Karen

John

Ann

Karen

JohnJohn

Ann

Karen

JohnJohnJohn

John

Ann

BillBill

Karen

John

Ann

BillBill

BillBill

Bill

Karen

John

Ann

Karen

John

John

John

Ann

BillBill

BillBill

BillBill

Karen

Ann

Karen

John

Ann

Bill

Karen

Ann

Bill

Karen

Ann

Karen

Ann

Karen

JohnJohn

Ann

Bill

Karen

John

John

Ann

BillBill

Karen

Ann

Bill

Karen

John

Ann

Karen

Ann

Bill

Karen

Ann

Bill

Karen

John

Ann

Bill

Karen

John

Ann

Karen

JohnJohn

John

John

Ann

Bill

Karen

Ann


Karen

JohnJohn

John

John

Ann

Karen

JohnJohn

Ann

Karen

JohnJohnJohn

Ann

BillBill

Karen

Ann

BillBill

Karen

John

JohnJohn

Ann

Karen

Ann

BillBill

Karen

John

John

JohnJohn

John

Ann

Karen

Ann

Karen

Ann

Karen

Ann

Karen

Ann

Karen

John

Ann

Karen

JohnJohn

JohnJohnJohnJohn

Ann

Bill

Karen

John

Ann

Bill

Karen

JohnJohn

Ann

Karen

Ann

Karen

JohnJohn

Ann

Karen

Ann

Bill

Karen

Ann

Bill

Karen

JohnJohn

Ann

Karen

Ann

Karen

Ann

BillBill

Karen

Ann

BillBill

Karen

Ann

BillBill


Seco

nd c

anon

ical

dis

crim

inan

t

Figure 4.15: Canonical discriminants analysis of points in hidden unit activationspace generated from training sequences at the third and fourth time steps groupedon the basis of the current target output object. Points are labeled with the currenttarget output object.


necessarily both, as the simulations demonstrated. When the �rst component is

presented to the network it is mapped to some subspace of the hidden unit activa-

tion space. On the next time step, the network maps the internal representation

of the �rst component to a second independent subspace (set of independent di-

mensions) to make available the �rst subspace for the second component. At the

third time step, the �rst component is mapped to a third subspace, at which point

it is extracted at the output layer, and the second component is mapped to the

second subspace. At the fourth time step, the second component is mapped to

the third subspace, at which point it is extracted at the output layer. (Note that

these subspaces were not localized to speci�c groups of hidden units. Furthermore,

the internal representations were not encoded locally. Figure 4.16 characterizes the

structure of hidden unit activation space as being composed of three subspaces. It

does not characterize any relationship between subspaces and hidden units.)

The analysis, however, only suggests a possible internal organization for the

simple recurrent network, since the two techniques do not consider the hidden to

output weights10. Nevertheless, PCA and CDA have identi�ed the existance of di-

mensions along the hidden unit activation space in which component objects were

encoded independent of their position. Therefore, learning to extract components

along these dimensions will generalize to the other position. Thus, it is not nec-

essary, as the simulation results have shown, for the network to be trained on all

components in both positions. It was only necessary that each component appeared

in one of the two positions.

Importantly, though, the simple recurrent network only demonstrated perfect

generalization in all trials when all other ordered pairs appeared in the training set.

When there were fewer ordered pairs in the training set, there were solutions other

than the bu�er solution which satis�ed the requirements of the training set, but

which did not demonstrate generalization across position. In these cases, the simple

recurrent network required more information than was available in the training set

to demonstrate perfect generalization in all trials.

10It is, in general, di�cult to visualize the relationship between hyperplanes and internal rep-resentations for high dimensional spaces.


JohnMary

--

2.1.

3.4.

John

2.1.

3.4.

Mary*

MaryJohn*

fixed

trained

trained

(1) (2) (3)

Figure 4.16: Idealization of the bu�er solution by the recurrent network in demon-strating strong systematicity of representation. Shaded regions and solid arrowsindicate modi�able weights. Dashed arrows indicate �xed weights. Starred (*)outputs at time steps 1 and 2 indicate auto-association of current input to assist inthe formation of internal representations. Dashed (-) inputs at time steps 3 and 4indicate zero input. Parenthesized numerals indicate subspaces of the hidden unitactivation space.

4.4. STRONG SYSTEMATICITY OF INFERENCE 97

4.3.4 Summary of strong systematicity of representation

With respect to systematicity of representation it was shown that the standard

feedforward network could not demonstrate strong systematicity. The property of

the feedforward network that prevents it from exhibiting strong systematicity of

representation is the independence of weights that implement component mappings.

This independence means that there are too many positions in weight space that

implement the function that is speci�ed by the training set. Consequently, there

is no information in the training set to determine, with con�dence greater than

chance level, the function that will exhibit generalization across position on future

examples.

This result suggested an architectural bias whereby there is a dependency be-

tween the weights that implement component mappings. The simple recurrent

network incorporates such a bias. Since component representations are presented

to the network at the same set of input units they are mapped, in part, by the

same set of weights. Simulations showed that the simple recurrent network could

demonstrate strong systematicity of representation with respect to the 2-tuple task.

The organization of internal representations learnt by the network was character-

istic of a �rst-in-�rst-out bu�er, where components were mapped from the input

to a common subspace (input phase), and from a common subspace to the output

(output phase). Because these mappings were implemented by the same set of

weights it was not necessary for the network to be trained on every component in

both positions. It was only necessary that components appear in at least one of the

positions. Consequently, the network was able to demonstrate strong systematicity

of representations on this task.

4.4 Strong systematicity of inference

Systematicity of inference is the ability to extract components from structurally re-

lated objects. For example, one does not �nd people who are able to infer the John

went to the store from the statement John and Mary went to the store, but cannot


infer that Mary went to the store from the statement Mary and John went to the

store. As in the previous section, Hadley's strong systematicity is used as the cri-

terion for determining whether a Connectionist model demonstrates systematicity

of inference.

4.4.1 Task: Querying 2-tuples

In this systematicity of inference task, a network is presented with an ordered pair

and a query/question, which requests the �rst or second component of the ordered

pair. For example, given the pair (John, Bill) and the question that requests the

�rst component the network should respond with John. As in the auto-association

of 2-tuples task, there are �ve possible atomic objects that may appear in either

the �rst or the second positions. In addition, there are two query objects that

request either the �rst or second component of the ordered pair.

A network is said to have demonstrated strong systematicity of inference with

respect to this task if on the test set it can correctly infer components in component-

position combinations that did not occur in the training set. In the next subsection,

the simple recurrent network is examined for its capacity to exhibit strong system-

aticity of inference with respect to this task.

4.4.2 Simple recurrent network

The simple recurrent network, which demonstrated strong systematicity of repre-

sentation, is examined here on this systematicity of inference task.

Simulation

The simulation conditions for the querying of 2-tuples task are as follows:

� Local encoding of input and output vectors. As there are seven possible input

vectors in this task (i.e., 5 possible components plus 2 possible queries) there

are 7 input units to encode the input vectors, and 5 output units to encode

the 5 possible components.


� Generation of 40 training sequences (each sequence consisting of three pat-

terns), consisting of all 5 components in the �rst position combined with 4

possible components in the second position combined with the two possible

queries. The �fth component was used to test strong systematicity. Thus, the

test set consisted of 10 sequences (i.e., all 5 components in the �rst position

combined with the �fth component in the second position combined with the

two queries).

� Random initialization of weights from a uniform distribution in the range �1to 1.


with 0.1 learning rate, no momentum term, and the sum of squares error func-

tion (Rumelhart et al., 1986) until performance on the training set reached

criterion. Two training criteria were used: (1) all output units were within

0.5; (2) all outputs were within 0.4 of the target output for every unit on

every training pattern. If the network did not reach criterion by the 10000th

epoch (where one epoch is the presentation of every training pattern) then

training was terminated. Weights were updated at the end of each sequence

(i.e., every three patterns). Context units were reset to zero at the beginning

of each sequence.

� Testing on all remaining patterns using two criteria for correctness: (1) the

maximally activated output unit corresponds to the target activation of 1 -

maximum criterion; (2) all output units are with 0.5 of their target activation

- 0.5 criterion.

� Each train-test trial was repeated 10 times for each training criterion, with

weights being randomly initialized at the beginning of each trial.

The simple recurrent network was examined with 20, 10 and 8 hidden units.

In the case of the 8 hidden unit network, 0.5 was used as the training criterion.

Figure 4.17 shows the network and an example sequence of input-output pattern

pairs.


2.

1.

3.

Mary*

John*

John

copy back

Output(5)

Input(7)

John

Mary

2.

1.

3. second

Hidden(8,10,20)

Context(8,10,20)

Figure 4.17: The simple recurrent network and the querying of 2-tuples task. Num-bers indicate speci�c time periods. Parenthesized values indicate number of unitsused in simulations. Starred (*) output indicates the auto-association of current in-put to assist in the formation of internal representations. However, for the purposeof evaluating strong systematicity, only performance on time step 3 was considered.


Results

In the case of 20 hidden units, the simple recurrent network learnt perfectly on

the training set on each of the 10 trials for both 0.5 and 0.4 training criteria (i.e.,

all output units were within 0.4 of their target outputs for all patterns in the

training set). However, in all trials performance on the test set was zero for both

maximum and 0.5 testing criteria where the network was required to extract the

second component (i.e., the component which did not appear in the second position

in the training set). The network did not correctly respond, in any trial, to any of

the 5 sequences where the network was required to extract the second component.

In the case where 10 hidden units were used, training reached the 0.4 criterion

in 7 of the 10 trials. In all trials, regardless of whether the network reached the

training criterion, the performance on the test set was zero, for both testing criteria,

with regard to the extraction of the second component.

When only 8 hidden units were used only 1 of the 10 trials converged to the

0.5 training criterion. In all but two trials performance was zero, with 0.5 and

maximum testing criteria, on the extraction of the second component in the test

set. For the other two trials, test set performance was 1 out of 5 for maximum

testing criterion, 0 out of 5 for 0.5 testing criterion. Since there are 5 possible

components to choose from this level of performance is no better than chance.

Furthermore, in these two trials the network did not acquire perfect performance

on the training set within 10000 epochs. The results are summarized in Table 4.1.

Discussion

The purpose of this simulation was to determine whether the simple recurrent net-

work could exhibit strong systematicity of inference. Although simulation results

cannot show that the network is incapable of demonstrating strong systematicity

on this task, they do suggest some inherent problem with the architecture. As

with the systematicity of representation task, the training sets were constructed

so as to maximize the possibility of exhibiting strong systematicity by leaving out

only one of the �ve components in the second position and training on all other


Table 4.1: Summary of the performance of the simple recurrent network on thesystematicity of inference task for networks with 20, 10 and 8 hidden units. Thenetworks were trained to 0.5 and 0.4 training criteria, or until 10000 epochs of train-ing. In the case of 8 hidden units, only the 0.5 training criterion was considered.The table shows that for a network with 20 hidden units, all 10 trials converged tothe 0.5 and 0.4 training criteria within 10000 epochs. For each training criterion,the network was tested on the 0.5 and maximum testing criterion for the 5 testcases. The table shows that for a training criterion of 0.5 and a testing criterionof 0.5, 0 out of 5 test cases were correctly inferred in all 10 trials. In the case ofthe 0.5 training criteria and maximum testing criterion, in 2 of the 10 trials thenetwork with 8 hidden units correctly inferred 1 of the 5 test cases.

Hidden units 20 10 8*

Convergent trials (0.5/0.4) 10/10 10/7 1(10 trials)

Correct test cases (5)Train: 0.4; Test: (0.5/max) 0/0 0/0 -Train: 0.5; Test: (0.5/max) 0/0 0/0 0/0 (8 trials)

0/1 (2 trials)

combinations. Thus, with respect to this task, the training set provides the maxi-

mum amount of information11 available for a demonstration of generalization across

position. Yet, despite being given such information the networks did not demon-

strate strong systematicity. Having considered two di�erent training criteria, two

di�erent testing criteria, and various numbers of hidden units, in all but two trials,

the simple recurrent network performance on the test set was zero for the correct

extraction of second position components in the test set. In the other two trials,

a network with 8 hidden units, trained to the 0.5 training criterion, and tested on

the maximum testing criterion, correctly extracted second position components in

only 1 of 5 test cases. In the case of the maximum testing criterion, where output

units are randomly activated, there is a 1 in 5 chance (since there are 5 possible

components) that the maximally activated output unit will correspond to the tar-

get activation of 1. Since performance on these two trials was not greater than

chance level performance, the network did not exhibit strong systematicity.

11Providing more information in the way of additional training patterns would mean that strongsystematicity could not be tested as all components would have appeared in both positions.


The network with 20 hidden units was the same as the network which demon-

strated strong systematicity of representation. Although this network did not

demonstrate strong systematicity of inference is did perform perfectly on all �rst

queries in the test set. Thus, although not demonstrating generalization across

position, the network did demonstrate generalization to novel combinations. The

lack of generalization across position suggests that the 20 hidden unit network was

not su�ciently constrained in its weights space. Reducing the number of hidden

units decreases the region of weight space within which the network can implement

the target function. Therefore, given that the network learns the function speci�ed

by the training set, there is a greater likelihood of exhibiting strong systematicity.

However, decreasing the number of hidden units made learning the training set

more di�cult to the point where only 1 of the 10 trials terminated within 10000

epochs in the case of the 8 hidden unit network. The error pro�les for the 8 hidden

unit network in each of the 10 trials are given in Figure 4.18. They suggest that

the network was not likely to converge in the near future.

The error pro�les show that the error on the training set was generally lowest

around the 500th epoch of training. Further training resulted in an increase in

training error to a relatively stable point by the 10000th epoch. It is possible

that the network at this point could have exhibited generalization to the novel

component-position, but at the expense of being unable to correctly infer all cases

in the training set.

To test this possibility, the simulations were rerun under the same conditions

except that training was terminated by the 500th epoch, at which point the network

was tested using the maximum testing criterion.

In none of the 10 trials conducted did the network with 8 hidden units correctly

infer the component object in any of the 5 test cases. It would suggest that the

network was not generalizing across position, above chance level, at any stage

during the 10000 epochs of training in the previous simulations.


0

500

1000

1500

2000

2500

3000

0 1 2 3 4 5 6 7 8 9 10

Err

or

Epoch (x1000)

Convergent TrialsNon-convergent Trials

Figure 4.18: Error pro�le for the 8 hidden unit network over 10000 epochs oftraining for each trial. Lines indicated trials that did not converge to the 0.5training criterion within 10000 epochs. The line with points labeled as squaresindicates convergence to the 0.5 training criterion within 10000 epochs.


Analysis of internal representations

The lack of strong systematicity of inference from a network that demonstrated

strong systematicity of representation raises the question of whether the simple

recurrent network is su�ciently biased for this task. In this subsection, by analysis

of internal representations it is shown that this simple recurrent network cannot

demonstrate strong systematicity of inference with respect to this task. The ex-

planation proceeds by �rst, showing the minimum network con�guration to solve

this task; and second, showing that with this minimum con�guration the training

set does not provide enough information to make generalization across position a

likely property.

The simple recurrent network is an instance of a �rst-order network, character-

ized by a unit activation function of the form:

fj(nXi=1

wijxi + bj)

where fj is the activation function for unit j; xi is the activation from a previous

unit i along a connection with weightwij; and bj is a bias on the unit. In a �rst-order

network there are no multiplicative activation terms. Consequently, input points

resulting in the same unit activation belong to a hyperplane whose orientation is

determined by the unit's incoming weights and bias. A hyperplane divides the

unit's input space into two half-spaces so that all points in one half-space result

in a unit activity (output) greater than some threshold value (�), and all points in

the other half-space result in a unit activity less than �.

In the querying of 2-tuples task, the network is presented with an ordered

pair and a question requesting the �rst or second component of the pair. Now

suppose, without loss of generality, Mary is the component on which the network

is to demonstrate strong systematicity. Then, to solve the task the network must

implement the following function:

f(Q1;R[(Mary;X)]) ! Mary

f(Q1;R[(X;Mary)]) ! :Mary

f(Q2;R[(Mary;X)]) ! :Mary


f(Q2;R[(X;Mary)]) ! Mary

where Q1 and Q2 are the question vectors; R[(Mary;X)] is a representation of

an ordered pair with Mary in the �rst position and some other object (X) in the

second position; R[(X;Mary)] is a representation of an ordered pair with Mary

in the second position and some other object (X) in the �rst position; and, f is

the function that implements the mapping. The R[:] notation is used to refer to

the network's internal representation of an object which may be di�erent from the

objects external representation.

A �rst-order network that implements the mapping is given in Figure 4.19. For

a simple recurrent network the representation of each pair at the context units is

a result of cycling back representations of components at the hidden units from

previous time step. For a feedforward network the representations at the context

units can be regarded as just input representations, and the context units can be

regarded as just other input units.

As in the previous section, by using a local encoding for each component the

assumption is made that there is no a priori similarity between components. There-

fore, at the output layer there is a single unit that discriminates on the basis of

whether or not a representation contains the Mary component. That is, there is

an output unit that detects Mary. It is straightforward to show that this discrim-

ination cannot be performed by positioning a single hyperplane.

To correctly discriminate theMary component a hyperplane must be partitioned

so as to satisfy the following inequalities:

~WI~Q1 + ~WC

~RCMary;X +B > � (4.1)

~WI~Q1 + ~WC

~RCX;Mary +B < � (4.2)

~WI~Q2 + ~WC

~RCMary;X +B < � (4.3)

~WI~Q2 + ~WC

~RCX;Mary +B > � (4.4)

where ~WI and ~WC are the input to hidden and context to hidden weight vectors

(respectively); B is the bias (weight) to the unit; and, � is the threshold above


(Mary, X)(X, Mary)

Mary Output

(linear)

(non-linear)

Input Context

copy Hidden

Q1Q2

Figure 4.19: Conceptualization of a �rst-order network where the hidden layer iscomposed of two layers: the �rst performs a linear transformation of activationfrom the preceding layer; and, the second performs a non-linear transformation ofthe linear layer.


which the component Mary is considered to be the correct response.

Subtracting equation 4.3 from equation 4.1, and equation 4.4 from equation 4.2

leaves:

WI ( ~Q1 � ~Q2) > 0 (4.5)

WI ( ~Q2 � ~Q1) > 0 (4.6)

Since there does not exist a weight (WI) vector that satis�es both inequalities

4.5 and 4.6, the network cannot represent the solution with a single hyperplane for

each component. Consequently, to correctly extract the Mary component requires

at least two hyperplanes positioned in the input space (by two hidden units at the

hidden layer) and a hyperplane positioned in the hidden space (by the output unit

detecting the Mary component).

Essentially, this reasoning uses the same analysis that was used to explain why

perceptrons cannot solve the exclusive-or (XOR) task (Minsky & Papert, 1990;

Hertz et al., 1991).

The positioning of hyperplanes in the input space is depicted in Figure 4.20.

The necessity of two hyperplanes is a consequence of a �rst-order architecture and is

independent of the representations of the question and pair vectors. For, although

the values of these vectors are not known, the relationship between these vectors

is known. The di�erence in the internal representations at the linear hidden layer

(see Figure 4.19) between the (Q1,(Mary,X)) case and the (Q2,(Mary,X)) case is

just a constant vector ~WI ( ~Q1 � ~Q2) = ~C. Similarly, the di�erence between the

(Q1,(X,Mary)) case and the (Q2,(X,Mary)) case is also the vector ~C.

Assuming that these two hyperplanes are in correct position the resulting repre-

sentation at the hidden layer for these four types of points, dependent on whether

they are on the high or low side (relative to some threshold) of these two hyper-

planes, is shown in Figure 4.21. In the hidden unit activation space, these four

point types must be partitioned by a single hyperplane into two groups: Mary

(circles) and not-Mary (squares). Clearly, such a partitioning is possible. However,

in addition, the network is required to demonstrate strong systematicity on the


(X, Mary)2

(Mary, X)2

(X, Mary)1

C

C

(Mary, X)1

h1

h2

Figure 4.20: Orientation of the hidden unit hyperplanes to extract the Mary com-ponent. To implement the function there must be at least two hyperplanes incorrect orientation in the input space to extract the Mary component. Circles indi-cateMary is the target output, and squares indicateMary is not the target output.Numerals indicate a request for the �rst or second component.


h1

h2

Mary

(Mary, X)2(Mary, X)1

(X, Mary)1

(X, Mary)2

Figure 4.21: Orientation of the output unit hyperplane in the hidden unit activationspace. Circles indicateMary is the target output, and squares indicateMary is notthe target output. Solid circles and squares indicate training points and emptycircles and squares indicate test points.

Mary component. Therefore, in the training set, one of the positions (e.g.,Mary in

the second position) does not occur (empty circle and square). Consequently, the

training set does not provide any information regarding the partitioning of points

along the vertical dimension, and therefore, the network cannot be expected to

generalize to these cases.

Thus, as with the bu�er solution to the auto-association task, there are two sets

of dimensions: one encoding components in the �rst position, and one encoding

components in the second position. The critical di�erence in the inference task is

that at a single time step the network is required to recover either the �rst or second

component depending on the question vector, whereas in the auto-association task

at one time step only the �rst component was required with the second component

recovered in the next time step. Consequently, the network solution to the task

requires two output subspaces (in contrast with the solution to the representation

task where only one output subspace is necessary) implemented by two sets of

weights (Figure 4.22). Since the two sets of weights are independent the network,


Output

Context

Input

hidden

Mary

Patient Actor

2.

1.

3.

2.

1.

3.

Mary

John

Who loves?/Who is loved?

John*

Mary*

Mary/John

Figure 4.22: Weight and representation con�guration resulting in weak system-aticity of inference. Shaded regions and solid arrows indicate modi�ed weights.Dashed arrows indicate �xed weights. Starred output indicates auto-association ofinput to assist in the formation of internal representations.

in general, must see components in both positions and therefore cannot demonstrate


4.4.3 Other �rst-order three-layer networks

The weak systematicity result of the simple recurrent network does not make any

assumptions about the nature of the internal representations of the ordered pairs.

The assumptions made were that the external representations of components are

encoded locally, and there is an intermediate layer of �rst-order units between the

inputs and the outputs (which are also �rst-order units). Therefore, the result can

also be applied to other architectures conforming to these restrictions.


Output

Hidden

Input Context

copy

Output

Input Context

Hidden

copy

Context

(a) (b)

Figure 4.23: Two alternative recurrent network architectures: Jordan's recurrentnetwork (a) and Pollack's recursive auto-associative memory (b).

Feedforward network

As mentioned previously, the context units in the simple recurrent network can be

regarded as other input units holding a representation of the ordered pair. The

two sets of input units map representations to hidden units which in turn map

representations to output units. Since no assumptions are made regarding the

nature of the order pair representation, three-layered feedforward networks are

weakly systematic of inference with respect to this task.

Jordan's recurrent network

Jordan's (1990) recurrent network (see Figure 4.23(a)) di�ers from the simple re-

current network in that context is copied from the output unit activations rather

than the hidden unit activations of the previous time step. Since the result makes

no assumptions regarding the nature of the context representations the weak sys-

tematicity of inference result applies to the Jordan recurrent network.

Pollack's recursive auto-associative memory

Pollack's (1990) recursive auto-associative memory (see Figure 4.23(b)) is essen-

tially the same as the simple recurrent network except there is an additional set

of output units whose targets are the current context values (i.e., the network not

only auto-associates the current input, but also the current context). However,

these units (and their associated weights) play no part in the extraction of com-


ponents. Their purpose is to encourage the formation of useful representations at

the context layer. However, as mentioned above, the result of the simple recurrent

network is independent of the actual representations at the context layer. There-

fore, this architecture cannot demonstrate strong systematicity of inference on this

task.

Forced simple recurrent network

The forced simple recurrent network (Maskara & Noetzel, 1992) is a variation of

the recursive auto-associative memory in that the targets for the output units are

the previous context, the current input and the next input. Additional information

(hints) regarding the target function can improve learnability in networks (Abu-

Mostafa, 1990; Suddarth & Kergos, 1990; Al-Mashouq & Reed, 1991; Suddarth &

Holden, 1991). For example, use of hints allowed the simple recurrent network to

discriminate input values appearing further back in time (Phillips & Wiles, 1991).

In fact, auto-association of the inputs at the �rst and second time steps in the

auto-association of 2-tuples task is an example of hint information except that the

hints were on the same output units but were not considered for network eval-

uation. However, as with the recursive auto-associative memory, the additional

output units play no part in the extraction of components in the forced simple re-

current network. Therefore, the network cannot demonstrate strong systematicity

of inference on this task.

4.4.4 Architectural issues

The problem with the simple recurrent network and the other �rst-order three-layer

networks is that the components must be extracted from one of two independent

subspaces, depending on whether the query vector requests the �rst or second

component. As in the case of the feedforward network, because there is an inde-

pendence between the weights that implement these two mappings the network

must be trained on all components in both positions.

The simple recurrent network was able to demonstrate strong systematicity of


representation on the auto-association task because the task only requires the net-

work to extract the �rst component at the third time step, with the second compo-

nent being extracted at the fourth time step. At either time step the same subspace

can be used to map internal component representations to outputs. Therefore, the

same set of weights were can be used to implement the mapping. In this case, de-

pendency was incorporated into the architecture as the component mappings were

implemented by the same set of weights.

Dependency between component-access weights

The problem that strong systematicity of inference poses for an architecture is that

all components must be simultaneously represented on di�erent subspaces, which

means there must be a di�erent set of weights to implement the mappings from

those subspaces to the output units. Therefore, to demonstrate strong systematic-

ity there must be a dependency between these sets of weights. The representation

task permitted all components to be accessed via a common subspace in the case of

the simple recurrent network. Thus, dependency was incorporated as the network

could use the same set of weights to access all components. However, the inference

task does not permit all components being accessed from a common subspace in

the case of these networks.

One way to incorporate a weight dependency is to collapse the various com-

ponent subspaces onto a single common subspace from which all components, in-

dependent of their position within the complex object, can be extracted to the

output. Thus, strong systematicity of inference is achieved as only one set of

weights implements a position-independent mapping from a common internal sub-

space to the output, just as was suggested with the simple recurrent network in the

auto-association task. The problem then is to incorporate a mechanism into a Con-

nectionist architecture that, when given a query vector, returns the corresponding

component subspace to the common subspace (Figure 4.24).


Input Hidden Output

Figure 4.24: Characterization of the representational organization su�cient forstrong systematicity of inference.


Tensors

One possibility is with the use of a tensor, rather than the more common vector

representation space. Brie y, an M � N -dimensional tensor space (constructed

from two vector spaces of dimensionalityM and N , respectively) can be conceptu-

alized as M N -dimensional subspaces (or similarly, N M -dimensional subspaces).

The inner product operator, provided in tensor calculus, e�ectively collapses the

multiple subspaces down to a single subspace. The particular subspace selected

depends on the vector with which the inner product was performed. Thus, tensors

provide multiple representational subspaces, which can be indexed (or accessed) by

other vectors.

The use of a tensor representational space in a Connectionist learning archi-

tecture is investigated in the next chapter with the purpose of addressing strong

systematicity of inference in the querying of 2-tuples task.

4.5 Summary and conclusion

In this chapter, Hadley's strong systematicity was used as the criterion by which

a model is regarded as exhibiting systematicity. The purpose of this chapter was:

(1) to determine which models, if any, meet this criterion; and (2) to elucidate

the properties which give rise to the model's degree of systematicity. Two tasks

were designed to test whether Connectionist models could meet Hadley's strong

systematicity criterion. They were: (1) auto-association of 2-tuples, designed to

test strong systematicity of representation; and (2) querying of 2-tuples, designed

to test strong systematicity of inference. A summary of the conclusions drawn from

an evaluation of networks and their degrees of systematicity with respect to these

two tasks is given in Table 4.2.

The �rst model examined was the feedforward network. It was shown not to

be capable of exhibiting strong systematicity of representation. Essentially, the

lack of strong systematicity is because of the independent relationship between the

weights which implement the various component mappings. The simple recurrent

4.5. SUMMARY AND CONCLUSION 117

Table 4.2: Summary of the degrees of systematicity of networks with respect tothe systematicity of representation task (auto-association of 2-tuples), and thesystematicity of inference task (querying of 2-tuples). The networks examinedincluded the feedforward network (FFN); simple recurrent network (SRN); andseveral �rst-order recurrent networks (RN*), including the simple recurrent net-work; Jordan's recurrent network; Pollack's recursive auto-associative memory;and Maskara's forced simple recurrent network.

Network Task Degree Property

FFN Represent. Weak Independent weightsSRN Represent. Strong Dependent weightsFFN Inference Weak Independent access weightsRN* Inference Weak Independent access weights

network incorporates a weight dependence as all components are presented (and

extracted) at the same set of input (and output) units. Therefore, the function

which performs this mapping is implemented, in part, over the same set of weights.

Simulation results con�rmed that the simple recurrent network could exhibit strong

systematicity of representation. Analysis of internal representations for the sim-

ple recurrent network suggested that the network learnt to organize its internal

representations so that components were mapped from the input to a common

internal subspace, and mapped from another common internal subspace to the

output. Thus, dependency between component mapping weights was achieved as

the network learnt to use the same set of weights.

The simple recurrent network was then examined on the second task designed

to test strong systematicity of inference. Simulations showed that although the

network generalized to novel combinations it did not demonstrate any degree of

generalization across position on this task. The lack of strong systematicity sug-

gested that the network did not have the right architectural bias to exhibit strong

systematicity. By analysis of internal representations it was shown that the sim-

ple recurrent network cannot demonstrate strong systematicity on this task. The

inference task demands that the components be extracted from two component

subspaces. The independence of the weights that implement these two mappings

meant the network must be trained on each component in both positions. It was


also shown that a number of other �rst-order three-layer networks (i.e., one hidden

layer) could not exhibit strong systematicity of inference on this task. Those net-

works included: the three-layer feedforward network; Jordan's recurrent network;

Pollack's recursive auto-associative memory; and Maskara's forced simple recurrent

network. The architectural bias lacking in these models was a dependency between

component access weights.

Finally, it was suggested that dependency between component access weights

could be achieved by collapsing the various component subspaces onto a single com-

mon subspace from which a position independent mapping could be implemented

by a single set of weights. Thus, weight dependency is incorporated by using the

same set of weights. It was suggested that this bias could be achieved in a tensor

representation scheme using the inner product operator. In the next chapter, de-

tails of a Connectionist network incorporating these ideas is presented and tested

by simulation on a querying of 3-tuples task.

Chapter 5

Strong systematicity of inference:

The Tensor-Recurrent Network

5.1 Introduction

In the previous chapter, a number of Connectionist models were analyzed for their

capacity to exhibit strong systematicity. The importance of Hadley's work was

to identify and specify a degree of generalization that is clearly evident in people,

but has yet to be demonstrated in Connectionist models. The main results of the

previous chapter were to show that the three-layer feedforward network and several

�rst-order three-layer recurrent networks cannot demonstrate strong systematicity

of inference with respect to the querying of 2-tuples task.

The characteristic property that prevented these models from exhibiting strong

systematicity of inference was an independence between the weights that access

internal component representations from the two component subspaces. It was

then suggested that a mechanism that incorporated such a dependence would be

su�cient to demonstrate strong systematicity of inference. The use of tensor rep-

resentations was suggested as one possible mechanism. In this chapter, tensor

representations are used as the basis for a Connectionist architecture, called the

tensor-recurrent network, for the purpose of exhibiting strong systematicity of in-

ference with respect to a querying of 3-tuples task.

119

120 CHAPTER 5. THE TENSOR-RECURRENT NETWORK

Before presenting the tensor-recurrent network, the concept of tensor represen-

tations, and their Connectionist implementation (introduced by Smolensky, 1987b)

is reviewed in section 2. The use of tensors for representing complex objects as-

sumes the existence of a number of representational components. This assumption

raises an issue, which is discussed in section 3, regarding the origins of these rep-

resentations. In section 4, it is shown how these representational components are

learnable in the tensor-recurrent network by combining a tensor representation ar-

chitecture with a recurrent network learning architecture. The tensor-recurrent

network is then evaluated for strong systematicity of inference on a querying of

3-tuples task in section 5. Finally, a summary and conclusion are given in section

6.

5.2 Tensor representations

Tensor representations were introduced by (Smolensky, 1987b) as a means of rep-

resenting complex objects in a Connectionist architecture. As already shown in

chapter 2, complex objects can be represented as the sum of the outer products

of component-role vector representation pairs. The outer product (~T ) of two

vectors (~V and ~W ) is de�ned as:

~T = ~V ~W =

0BBBBB@

v1...

vm

1CCCCCA ( w1 � � � wn ) =

0BBBBB@

v1w1 � � � v1wn

......

vmw1 � � � vmwn

1CCCCCA (5.1)

A Connectionist network that implements the outer product is given in Figure

5.1. In this network, connections have a �xed weight of value one. The activation

of tensor unit Tij at time step t is de�ned by the equation:

T tij = viwj + T t�1

ij ; (5.2)

where vi is the activation of the ith unit in the vector(V) group of units; wj is the

activation of the jth unit in the vector(W) group of units; and T t�1ij is the activa-

tion of the ijth tensor unit at the previous time step (t � 1). The multiplicative

activation term makes the tensor an example of a higher-order network.

5.2. TENSOR REPRESENTATIONS 121

v1 v2 w1 w2 w3

Vector units (V) Vector units (W)

Tensor units (T)

T11 T12 T13 T21 T22 T23

Figure 5.1: A Connectionist implementation of the outer product (T) of two vectorrepresentations (V and W). All connections are of �xed weight 1. Solid circlesindicate the product of input activity, and empty circles indicate the sum of inputactivity. The activation of a tensor unit is the product of the activity of previousunits plus the tensor unit's activation from the previous time step.

In addition to constructing complex representations via the outer product op-

erator, component representations can be accessed via the inner product operator.

The inner product of a tensor (~T ) and a cue vector ( ~W ) is de�ned as:

~T � ~W = ~V =

0BBBBB@

v1...

vm

1CCCCCA ; (5.3)

where vi =Pn

j=1 Tijwj; and n is the dimensionality of vector ~W .

The Connectionist network that implements the inner product is given in Figure

5.2. In this network, connections have a �xed weight of value one. The activation

of product unit pij is de�ned as:

pij = Tijwj; (5.4)

where Tij is the activation of the ijth tensor unit; and oj is the activation of the

jth unit in the vector(W) group of units. The activation of vector(Vout) unit vouti


w1 w2 w3

vout1 vout2

Product units

Vector units (Vout)

Tensor units (T) Vector units (W)

T11 T12 T13 T21 T22 T23

p11 p12 p13 p21 p22 p23

Figure 5.2: A Connectionist implementation of the inner product (Vout) of a tensorrepresentation (T) and a vector representation (W). All connections are of �xedweight 1. Solid circles indicate the product of input activity, and empty circlesindicate the sum of input activity. The activation of a product unit is the productof the activity of previous units. The activation of a vector(Vout) unit is the sumof the activity of previous units.

is de�ned as:

vouti =nX

j=1

pij ; (5.5)

where pij is the activation of product unit ij. The activation is summed over all n

incoming connections1.

The following is an example of how these two networks, together, can process

examples from the querying of 2-tuples task, which was given in the previous chap-

ter. Suppose the network is presented with the 2-tuple (John, Mary) and a query

1The number of incoming connections n is the dimensionality of the cue vector, which in thisnetwork implementation is the number of vector(W) units.

5.2. TENSOR REPRESENTATIONS 123

requesting the second component. Then the sequence of inputs and targets is:

Time Input Output

1. John {

2. Mary {

3. Second Mary.

The activation of the network units throughout this sequence is described by

the following sequence of equations:

0: ~T 0 = ~0

1: ~T 1 = ~T 0 + ~VJohn ~Wfirst

2: ~T 2 = ~T 1 + ~VMary ~Wsecond

3: ~Vout = ~T 2 � ~Qsecond

= ~VJohn ( ~Wfirst � ~Qsecond) + ~VMary ( ~Wsecond � ~Qsecond)

= ~VMary (when ~Wfirst ? ~Qsecond, and ~Wsecond � ~Qsecond = 1):

The tensor units are initialized with zero activation. At time step 1, the two groups

of units vector(V) and vector(W) (see Figure 5.1) are presented with representations

of the �rst component (~VJohn) and the �rst component's role ( ~Wfirst), respectively.

The activation of the tensor units becomes their activation at the previous time step

plus the outer product of the component and role vectors. At time step 2, vector(V)

and vector(W) units are presented with representations of the second component

(~VMary) and the second component's role ( ~Wsecond), respectively. The tensor units

are updated as in the previous time step. At time step 3, a representation of the

query (~Qsecond), or cue vector, is presented to the vector(W) units. The activation

of the vector(Vout) units are a result of the inner product of the activations of the

tensor units and the vector(W) units. If the representation of the �rst component

(~VJohn) is orthogonal to the representation of the query (~Qsecond), and if the inner

product of the second component (~VMary) and the query vector is one2, then the

representation of the second component will be correctly extracted.

2This situation occurs, for example, when the two vectors are equal and of length one.


5.3 Representational issues

The use of tensors to represent and process complex objects assumes the existence

of external agents for generating appropriate component (e.g., ~VJohn), role (e.g.,

~Wfirst) and cue (e.g., ~Qfirst) representations. In this section, the assumptions

regarding each of these three aspects of tensor representations are discussed.

5.3.1 Component representations

In the tensor scheme, component representations are bound (by an outer prod-

uct) to a representation of their associated roles. If the appropriate cue vector is

applied (by an inner product) to the tensor representation then the original com-

ponent representation is extracted. Thus, in a tensor scheme the input and output

representations of a component object are the same. However, the input repre-

sentation of the object John, for example, could be a string of letters (in the case

where one is receiving visual input), whereas the output representation of the same

object could be a sequence of phonemes (in the case where one is sending auditory

output). Therefore, it must be assumed that there exists additional agents that

perform the mappings from an external input representation of an object to an

internal representation (which is used in the tensor construction), and from the

internal representation to an external output representation of the same object

(which may di�er from the external input representation).

5.3.2 Role representations

In a tensor representation of complex objects, each component representation is

bound to a representation of the component's role within the complex object. In

using tensor representations it is assumed that there exists some other agent that

generates role representations for each component, since these representations are

not explicitly represented in the input3. That is to say, there is not an input rep-

3Although Smolensky (1987a) showed how \optimal" role representations could be learnt withhis recirculation algorithm, these roles are learnt in isolation (i.e., separate from any binding to

5.3. REPRESENTATIONAL ISSUES 125

resentation denoting patient, for example, in the sentence John loves Mary. Roles

must be deduced from other information such as relative positioning of compo-

nents. In addition, role representations must be orthogonal to each other for the

component representations to be subsequently extracted, via the inner product op-

erator, without interference. For example, suppose vectors ~V and ~W are bound to

roles ~R1 and ~R2, respectively. Then, the extraction of the �rst component V using

~R1 as the cue vector results in:

(~V ~R1 + ~W ~R2)� ~R1 = ~V + � ~W; where ~R1 � ~R2 = �

In the case where the roles vectors are of unit length, � is the component of ~R1 in the

direction of ~R2. The choice of role vectors is important for the successful extraction

of component representations. Thus, a tensor scheme assumes the existence of

agents that generate appropriate role vectors.

5.3.3 Cue representations

Having constructed tensor representations of complex objects, component objects

can be extracted via the inner product operator. Typically, the cue vector is

chosen, by some external agent, to be the same as the role vector to which the

desired component was composed. However, since the role vectors were generated

internally there is no reason to expect that an external representation of a query will

be the same as the internal representation of the role. For example, in the complex

object John chased Mary, the component John is bound to a representation of the

agent role. It is assumed that the query Who chased? gets mapped to the same

representation of agent, at which point the John component can be extracted from

the tensor representation using the inner product operator.

component vectors). No demonstration was given as to how the recirculation algorithm couldbe used in any task. The recirculation algorithm could have been used to train a three-layerfeedforward network to auto-associate N -tuples if one assumes that the activation function forthe hidden units is the identity function, and that the input to hidden weight matrix equals thetranspose of the hidden to output weight matrix (i.e., WI;H = WT

H;O). However, even with theseassumptions there is still an independence between the weights that access components fromdi�erent positions, and therefore the network will not demonstrate strong systematicity withrespect to this task.


Tensor representational schemes were introduced by Smolensky as a way of

implementing the capacity to represent complex objects in a Connectionist archi-

tecture. In chapter 2, it was argued that a \potential" contribution of Connec-

tionism is in an explanation of the acquisition of systematic behaviour. If tensor

representations are to prove useful in providing such an explanation then this issue

regarding the origins of component, role and cue representations must be addressed.

In the next section, it is shown how these representations can be learnt with the

tensor-recurrent network.

5.4 The tensor-recurrent network

In the previous chapter, a number of Connectionist networks were analyzed for

their capacity to exhibit strong systematicity. Characteristic of these networks was

the use of a vector space to represent complex objects. Although these networks

can represent most functions, it was shown that there was not su�cient archi-

tectural bias to account for strong systematicity. Since tensors have been shown

capable of representing complex objects, a network incorporating tensor represen-

tations was suggested as a way of addressing strong systematicity. However, to

this point, there has not been an e�ective procedure for learning tensor repre-

sentations. Consequently, assumptions must be made regarding the nature and

origins of component, role and cue representations. The motivation behind the

tensor-recurrent network was to incorporate a learning mechanism with a tensor

representation scheme so that component, role and cue representations are learnt

as a consequence of the demands of the task, rather than provided by external

agents. The following is a description of the tensor-recurrent network and how it

addresses the issue of acquiring component, role and cue representations.

The tensor-recurrent network is depicted in Figure 5.3. The network has many

of the components that are characteristic of the recurrent networks of the previous

chapter. That is, there are input and output units for presenting and extracting

component representations, and there are units for storing activations from the

previous time step. The main di�erence with the tensor-recurrent network is in the

5.4. THE TENSOR-RECURRENT NETWORK 127

cue(4)

question(3)

role(4)

input(8)

context(10)

state(10)

bottleneck(2)

hidden(8)

tensor(32)

output(8)

inner product(8)

(outer product)

copy

bac

k

h

r

s

b2

b1

f

g

Figure 5.3: The tensor-recurrent network architecture. Dashed arrows indicatecompletely connected modi�able weights and their destination units (which are thecue, hidden, bottleneck, state, role, and output units) have tanh as their activationfunction. Solid arrows indicate �xed connections of weight 1, and the connectivitybetween their source and destination units is such as to implement the operator asshown (i.e., inner product, outer product, and copy). Parenthesized values indicatenumber of units used in simulations.


separation of the responsibility for representing component and role information

into two sets of units: (1) hidden units, which construct internal representations

of component objects; and (2) role units, which construct internal representations

a component's role within a complex object. This organization contrasts with

the recurrent networks where a single set of hidden units is responsible for the

representation of both component and role information. Before discussing the

implications of this di�erence with respect to demonstrating strong systematicity,

the internal behaviour of the network and its capacity to learn component, role

and cue representations is presented in detail.

5.4.1 Architecture

The description of the tensor-recurrent network has two parts: (1) an outline of

the functionality of each group of units; and (2) the activation functions and con-

nectivity of weights that implement the functionality of each group of units.

Unit functionality

The functionality of each group of units is organized into three major categories:

(1) external representation units; (2) tensor-related units; and (3) feedback-related

units. Their functionality is as follows:

1. External representation units The purpose of these units is to hold the

input and output unit activation. The three types of external units are:

� Input: external input representation of component objects.

� Question: external representation of a question.

� Output: external output representation of component objects.

2. Tensor-related units. These units are concerned with implementing the

inner and outer product operators. The �ve types of units associated with

these two operators are:

� Tensor: tensor representations of complex objects.


� Hidden: internal representations of component objects.

� Role: internal representations of a component's role within a complex

object.

� Cue: internal representations of vectors for accessing component repre-

sentations within the tensor representations.

� Inner product: extracted component representations.

3. Role-related units These units are concerned with creating new role rep-

resentations at each time step. The three types of units are:

� State: hold the current state vector.

� Context: holds the activation of the state units from the previous time

step.

� Bottle-neck: an intermediate representation between the hidden units

and the state units.

Implementation

The activation of each context unit is de�ned as:

cti = st�1i ; (5.6)

where cti is the activation of context unit i at time step t; and st�1i is the activation

of state unit i at the previous time step t� 1.

The activation of each tensor unit is de�ned as:

T tij = hi rj + T t�1

ij ; (5.7)

where T tij and T t�1

ij are the activations of tensor unit ij at time steps t and t � 1,

respectively; hi is the activation of hidden unit i; and rj is the activation of role

unit j.

The activation of each inner product unit is de�ned as:

ipi =nX

j=1

Tij(qj + rj); (5.8)


where ipi is the activation of inner product unit i; qj is the activation of cue unit

j; rj is the activation of role unit j; and n is the number of cue/role units.

The activation of all other units, which includes the cue, hidden, bottleneck,

state, role, and output units, is de�ned as:

oj = tanh(kXi=1

xiwij + bj); (5.9)

where oj is the activation of unit j; xi is the activation of a preceding unit i; wij is

the connection weight from unit i to unit j; bj is the bias term on unit j; and k is

the number of incoming connections.

Each dashed arrow indicates complete connectivity between source and desti-

nation layers (i.e., every unit in the source layer is connected to every unit in the

destination layer). The weights associated with these connections are modi�able

through a learning algorithm. The solid lines indicate unmodi�able weighted con-

nections. The connectivity between layers is such as to implement the following

functions:

� Outer product The connectivity from the hidden and the role units to the

tensor units implements the outer-product ~h ~r, where ~h is a vector at the

hidden units providing an internal representation of some component, and

~r is a vector at the role units providing an internal representation of the

component's role. An example of the connectivity that implements the outer

product was given in Figure 5.1.

� Inner product The connectivity from the tensor units and the role units

to the inner product units implements the inner product ~T � ~r, where ~T is

a tensor representation at the tensor units of some complex object and ~r is

a vector representation at the role units. An example of the connectivity

that implements the inner product was given in Figure 5.2. The connectivity

between cue and inner product units implements the same function, except

that the inner product is: ~T � ~q, where ~q is a vector representation at the

cue units.


Table 5.1: Order of activation for the tensor-recurrent network when given thesequence John, Mary, �rst.

Units T = 1 T = 2 T = 3

Input ~VJohn ~VMary~0

Question ~0 ~0 ~Qfirst

Hidden f(~VJohn) f(~VMary) ~Hbias

Cue ~Cbias~Cbias h(~Qfirst)

Context ~S0 ~S1 ~S2

Bottle ~B1~B2

~B3

State ~S1 ~S2 ~S3

Role ~R1~R2

~R3

Tensor ~T1 = ~T0 + f(~VJohn) ~R1~T2 = ~T1 + f(~VMary) ~R2

~T3 = ~T2 + ~Tnoise

Inner ~T1 � ~R0

1~T2 � ~R0

2~T3 � h(~Qfirst)0

Output g(~T1 � ~R0

1) g(~T2 � ~R0

2) g(~T3 � h(~Qfirst)0)

� Copy The connectivity between state and context units implements a copy-

back function, which is the same as in Elman's (1990) simple recurrent net-

work (i.e., one-to-one connections, with a �xed weight of one).

Those units that have modi�able weights also have a modi�able bias, which can

be implemented as an incoming weighted connection whose activity is always one.

Other units do not have a bias.

An example of activity propagation

An example sequence from the querying of 2-tuples task is used to show the pro-

cessing of activation in the tensor-recurrent network. The sequence John, Mary,

�rst will result in the following sequence of activation, which is summarized in

Table 5.1.

At the �rst time step (T = 1), the tensor-recurrent network is presented with

an external input representation (~VJohn) of the component John. At this point,

the network has not been queried so the activation at the question units is zero.

This representation is mapped (via function f) to the hidden units to form the

internal representation f(~VJohn). The hidden unit representation is mapped (via


function b1) to the bottleneck units, resulting in vector ~B1, and from the bottleneck

units to the state units via function b2. At the same time, state vector ~S0 at the

context units is mapped via function s to the state units. The state units take the

results of these two functions and produce a new state vector ~S1. The new state

vector is mapped (via function r) to the role vector ~R1, which is a representation

of the role of the �rst component object. The outer product of vectors f(~VJohn)

and ~R1 is added to the previous tensor (~T0) to give a new tensor representation ~T1

at the tensor units. The inner product of the tensor and the role plus cue vectors

is then performed. Although the question units have zero activity at the �rst time

step, and therefore, do not pass on any activity to the cue units, the bias terms

associated with the cue units result in an activity vector ~Cbias. Consequently, the

activity vector at the inner product units is ~T1 � ~R0

1, where ~R0

1 = ~R1 + ~Cbias. The

activity vector at the inner product units is subsequently mapped (via function g)

to the output units, resulting in the output vector g(~T1 � ~R0

1). In addition, the

state vector is propagated back to the context units in preparation for the next

time step.

Activity vectors are propagated, in a similar manner, during the second time

step. At the third time step, the network is presented with the question requesting

the �rst component, which is represented as the vector ~Qfirst. The question vector

is mapped (via function h) to form an internal query vector h(~Qfirst) at the cue

units. Although the activation at the input units is zero at the third time step,

the bias terms at the hidden units (i.e., vector ~Hbias) and the state vector ~S2 are

propagated forward to generate a new state vector ~S3, and subsequently a new role

vector ~R3. The outer product of the hidden and role unit vectors results in a noise

term ~Tnoise (i.e., a term that contains no information regarding the complex object

(John,Mary)). Again, the inner product of the tensor vector and the role plus cue

vectors is performed. The inner product results in the vector ~T3�h(~Qfirst)0, where

h(~Qfirst)0 = h(~Qfirst) + ~R3. Finally, this vector is mapped to the output units

resulting in the output vector g(~T3 � h(~Qfirst)0).

Having traced through the propagation of activity for an example sequence,


the next step is to show how this activity can be used to train the tensor-recurrent

network to construct appropriate internal component, role and cue representations.

5.4.2 Learning

The output vector produced by the network as a result of the sequence of input

vectors can then be compared to a target vector. In the case of the previous ex-

ample, the target output at third time step is ~V TJohn. Just as with the networks of

the previous chapter, by incorporating an error function within the network, per-

formance can be improved by backpropagating an error signal, which is a function

of the network and target output vectors. Weights are adjusted as a function of

the error signal so as to reduce network error. Again, it is the principle of error

backpropagation (Rumelhart et al., 1986) that drives the construction of internal

representations. The same principle is used in the tensor-recurrent network. The

interesting feature of the tensor-recurrent network is that the error signal is back-

propagated along the �xed weights and units that implement the tensor operators

to the modi�able weights that generate component, role and cue representations.

In this way, these representations can be learnt in response to the demands of the

task, rather than pre-speci�ed by some external agent. The following is an ex-

planation of how the appropriate component, role and cue representations can be

learnt given the principle of error backpropagation and the internal organization

of the tensor-recurrent network.

Component representations

It was mentioned in section 5.3 that a tensor representation scheme assumes the ex-

istence of external agents that perform the mappings between external and internal

component representations. This assumption is avoided in the tensor-recurrent net-

work by providing a layer of modi�able weights before and after the tensor-related

units (i.e., the weights associated with functions f and g, respectively).

Suppose that, in the (John,Mary,�rst) example, the network is also required

to output the current object for the �rst and second time steps (as was done for


the networks of the previous chapter to encourage the formation of useful internal

representations). Then, using the sum of squares error function, the error term at

the �rst time step is:

E = k~V TJohn � g((f(~VJohn) ~R1)� (~R1 + ~Cbias))k;

where ~VJohn and ~V TJohn are the input and target output representations of John,

respectively; ~R1 is the associated role vector, and ~Cbias is a bias vector. Thus, error

E goes to zero (E ! 0) when, for example, the following three conditions occur:

1. ~Cbias ! 0;

2. k~R1k ! 1; and

3. g(f(~VJohn))! ~V TJohn.

That is, when the modi�able weights that implement the various mapping functions

are adjusted so that these three conditions occur.

In the case where ~R1 � (~R1 + ~Cbias) = 1, which occurs, for example, when

~Cbias = 0 and k~R1k = 1, the vector representation at the hidden units is the same

as the vector representation at the inner product units. That is,

f(~VJohn) = (f(~VJohn) ~R1)� (~R1 + ~Cbias)

Consequently, the network reduces to a three-layer feedforward network, where

any input to output mapping is representable (given su�cient hidden units), and

therefore, \potentially"4 learnable.

Role representations

Tensor representations also assume the existence of an external agent that generates

orthogonal role vectors for each component. Referring to the (John,Mary,�rst)

4As with the feedforward network, there is no guarantee that a multiple layer network withnon-linear units will learn an arbitrary function.


example, at the second time step, the error term is:

E = k~V TMary � g((f(~VJohn) ~R1 + f(~VMary) ~R2)� (~R2 + ~Cbias))k;

where ~V TMary is the target output for the object Mary; and ~R2 is its associated

role representation. Thus, error goes to zero when, for example, the following four

conditions occur:

1. ~Cbias ! 0;

2. k~R2k ! 1;

3. ~R1 ? ~R2; and

4. g(f(~VMary))! ~V TMary.

Thus, the network learns appropriate role vectors by adjusting weights which min-

imize an error function that goes to zero when the role vectors are orthogonal.

Cue representations

At the third time step, the error term is:

E = k~V TJohn � g((f(~VJohn) ~R1 + f(~VMary) ~R2 + f( ~Hbias) ~R3)

�(h(~Qfirst) + ~R3))k;

where ~Hbias is the bias vector at the hidden units; and ~Qfirst is the representation

of the query requesting the �rst component. At this time step, error goes to zero

when:

1. ~R3 ! 0;

2. ~R1 ? ~R2 ? ~R3;

3. k~R1k ! 1;

4. k~Qfirst � ~R1k ! 0; and


5. g(f(~VJohn))! ~V TJohn.

The error function drives the network to learn an appropriate cue vector that when

applied to the tensor representation by the inner product operator results in an

output representation of the �rst component John.

5.5 Evaluating the network

The lack of strong systematicity in the networks of the previous chapter was at-

tributed to the independence between the weights that access components from

di�erent representational subspaces. A tensor representation scheme was suggested

as it has the property that multiple subspaces can be collapsed down to a single

subspace under the inner product operator. Thus, a weight dependency is intro-

duced across component subspaces as all components, independent of their role,

are mapped from the common internal subspace to the output. Therefore, learning

to extract a component representation in one position (role) generalizes to compo-

nent representations in other positions (roles). However, extraction of the correct

component requires the network to learn the appropriate component, role and cue

representations. In the tensor-recurrent network, this requirement means learning

the weights that will generate these vectors. The solutions to the error functions

(above) showed that these weights exist, however, they do not indicate whether

they will necessarily be learnt. For example, the error surface could contain local

minima preventing the network from learning a set of weights that generates strong

systematic behaviour. The purpose of this section is to test, through simulation,

the tensor-recurrent network's capacity to learn a set of weights that results in

strong systematicity on an inference task.

5.5.1 Querying of 3-tuples task

The task used in this section is based on simple sentences conforming to the struc-

ture agent-action-patient. Each component is presented to the network one per

time step after which the network is given one of three possible questions: Who

5.5. EVALUATING THE NETWORK 137

performed the action?; What was the action?; and Who was e�ected by the action?,

from which the network should respond with the agent, action, and patient com-

ponents, respectively. For example, the sentence-question pair \John loves Mary.

Who performed the action?" would be represented by a sequence of four vectors

(representing John, loves, Mary, Who performed the action?, respectively) pre-

sented in that order to the network. At the fourth time step the correct response is

a vector representing John at the output layer. This task is similar to the querying

of 2-tuples task of the previous chapter, except that the name of the binary relation

(e.g., loves) is included in the input.

The action and patient components are drawn from the set fBill, John, Karen,Mark, Viviang, and the action component is drawn from the set fcalls, chases,lovesg. That is, there are 5 (agents) � 3 (actions) � 5 (patients) � (3 questions)

= 225 possible sentence-question pairs. All components are encoded locally (i.e.,

by orthogonal vectors where one dimension has a value of 1 and the rest 0). The

network is said to have demonstrated strong systematicity if having only seenMary

in the agent position in the training set, for example, the network consistently

(above chance level) generalizes to cases where Mary is in the patient position in

the test set.

In the next subsection, simulations are run on the tensor-recurrent network to

test whether the network will exhibit strong systematicity of inference on this task.

5.5.2 Simulations

The number of units used in these simulations was given in Figure 5.3. The simu-

lation conditions for the tensor-recurrent network were as follows:

� Local encoding of input and output vectors. As there are 5 possible agent or

patient components, 3 possible action components and 3 possible questions,

8 input, 8 output and 3 question units were used.

� Random generation of 20 training examples from a training example distribu-

tion. A number of di�erent training distributions were used (see Table 5.2),


where the overlap between objects that appeared in the agent and patient

positions varied from 0 (no noun appeared in both positions) to 5 (every noun

appeared in both positions). In each case, with the exception of 0 and 5 noun

overlap, noun(s) were left out of the patient position only. It is possible that

the e�ect on generalization, by omitting patient position nouns only, is due to

the limited number of patient components in the training set, rather than the

limited degree of overlap between agent and patient positions. Thus, in the

case of one, two and three noun overlap, a second training distribution was

used (see Table 5.3), where nouns were left out of both positions, therefore

ensuring that at least 3 di�erent nouns appeared in the agent and patient

positions.

� Random initialization of variable network weights from a uniform distribution

between -1 and 1.


with 0.1 learning rate, no momentum term, and the sum of squares error

function (Rumelhart et al., 1986) until outputs were within 0.4 of their target

activation for all output units on all training patterns. Training was termi-

nated after 50000 epochs if this criterion was not met. Each word or question

was presented to the network one per time step, with the context and tensor

units being reset to zero at the beginning of each sequence. Weights were

updated at the end of each sentence-question pair (i.e., every four patterns).

� Testing of the network on 100 test sequences randomly generated from a test

example distribution. The testing distribution associated with each training

distribution is given in Tables 5.2 and 5.3. Importantly, since generalization

across position is being tested, the training and testing distributions were

not the same5 (i.e., some nouns should not appear in some positions in the

training set, but should appear in those positions in the test set). Two testing

5Except in the case where the training example distribution had an overlap of 5. Thiscase, although not considered in the evaluation of strong systematicity, and was included forcompleteness.


Table 5.2: Training and testing example distributions where nouns were omittedfrom the patient position in the case where there was between 1 and 4 nouns ap-pearing in both positions. The 0 (no noun appearing in both positions) and 5(every noun appearing in both positions) overlap cases were included for complete-ness. In the 0 overlap case, nouns were omitted from both positions. The tableshows only agent and patient objects (i.e., B - Bill; J - John; K - Karen; M - Mark;and V - Vivian). For the training sets, agent and patient objects were combinedwith every action and question. For the testing set, the superscript q indicatesthe set of objects that were queried, and therefore tested. The symbol � is thecartesian product operator generating all pair-wise combinations from elements inthe respective sets.

Overlap Train (agent,patient) Test (agent,patient)0 fB; J;Kg � fM;V g fM;V gq � fB; J;Kgq1 fB; J;K;M; V g � fBg fB; J;K;M; V g � fJ;K;M; V gq2 fB; J;K;M; V g � fB; Jg fB; J;K;M; V g � fK;M;V gq3 fB; J;K;M; V g � fB; J;Kg fB; J;K;M; V g � fM;V gq4 fB; J;K;M; V g � fB; J;K;Mg fB; J;K;M; V g � fV gq5 fB; J;K;M; V g � fB; J;K;M; V g fB; J;K;M; V gq � fB; J;K;M; V gq

criteria were used. A response to a test sequence was considered correct

when: (1) the maximally activated output unit had a target activation of

one - maximum criterion; or (2) all output units were within 0.5 of their

target activation - 0.5 criterion. During the test phase performance on the

auto-association of input was not considered. Its purpose was to encourage

the formation of internal representations during training, as discussed in the

previous section.

� Each train-test trial was repeated 5 times, with modi�able weights randomly

initialized at the beginning of each trial.

5.5.3 Results

In the case of the training and testing distributions from Table 5.2, the percentage

of correct responses (maximum criterion) to the question vector on the test set

averaged over 5 trials is given in Figure 5.4, where the bottom dashed line indicates


Table 5.3: Training and testing example distributions for the 1, 2 and 3 overlapcases, where nouns were omitted from both positions. The table shows only agentand patient objects (i.e., B - Bill; J - John; K - Karen; M - Mark; and V - Vivian).For the training sets, agent and patient objects were combined with every actionand question. For the testing set, the superscript q indicates the objects that werequeried, and therefore tested. The symbol � is the cartesian product operatorgenerating all pair-wise combinations from elements in the respective sets.

Overlap Train (agent,patient) Test (agent,patient)1 fB; J;Kg � fK;M;V g fB; J;K;M q; V qg � fBq; J q;K;M; V g2 fB; J;K;Mg � fK;M;V g fB; J;K;M; V qg � fBq; J q;K;M; V g3 fB; J;K;Mg � fJ;K;M; V g fB; J;K;M; V qg � fBq; J;K;M; V g

chance level response. The graph shows results only for noun-position combinations

that did not occur in the training set. For example, when no noun appeared in

both agent and patient positions in the training set, the network achieved a mean

accuracy of 74%. When three or more of the possible �ve nouns appear in both

positions in the training set, the network was 100% accurate on the test set.

For the same distributions, the percentage of correct responses using the 0.5

testing criterion is given in Figure 5.5. Note that chance level performance for this

testing criterion is 1

25� 100% (= 3%), as there is a 50% chance of the activation of

each output unit being on the \right" side of 0.5. For this criterion the network's

mean accuracy was 64% with no overlap and 100% for three of more overlapping

objects.

In the case of the training and testing distributions from Table 5.3, the percent-

age of correct responses using the maximum testing criterion, and the 0.5 testing

criterion are given in Figures 5.6 and 5.7, respectively.

All trials except one converged to the training criterion within 50000 epochs.

The single non-convergent case, occurred with the zero overlap training distribu-

tion. Training times for distributions given in Tables 5.2 and 5.3 are shown in

Figures 5.8 and 5.9, respectively.


0

20

40

60

80

100

0 1 2 3 4 5

Cor

rect

(%

)

Number of overlapping items

Maximum criterionChance level

Figure 5.4: Generalization (maximum criterion) as a function of the number ofoverlapping items where items were omitted from the patient position in the 1to 4 overlap cases (see Table 5.2). The percentage correct on the test set wasaveraged over 5 trials. The dashed line indicates chance level performance. Errorbars indicate 95% con�dence intervals.


0

20

40

60

80

100

0 1 2 3 4 5

Cor

rect

(%

)


0.5 criterionChance level

Figure 5.5: Generalization (0.5 criterion) as a function of the number of overlappingitems where items were omitted from the patient position in the 1 to 4 overlapcases (see Table 5.2). The percentage correct on the test set was averaged over5 trials. The dashed line indicates chance level performance. Error bars indicate95% con�dence intervals.


0

20

40

60

80

100

0 1 2 3 4 5

Cor

rect

(%

)


Maximum criterionChance level

Figure 5.6: Generalization (maximum criterion) as a function of the number ofoverlapping items where items were omitted from both positions in the 1 to 3overlap cases (see Table 5.3). The percentage correct on the test set was averagedover 5 trials. The dashed line indicates chance level performance. Error barsindicate 95% con�dence intervals.


0

20

40

60

80

100

0 1 2 3 4 5

Cor

rect

(%

)


0.5 criterionChance level

Figure 5.7: Generalization (0.5 criterion) as a function of the number of overlappingitems where items were omitted from both positions in the 1 to 3 overlap cases (seeTable 5.3). The percentage correct on the test set was averaged over 5 trials. Thedashed line indicates chance level performance. Error bars indicate 95% con�denceintervals.


0

100

200

300

400

500

600

0 1 2 3 4 5

Tra

inin

g tim

e (e

poch

s)


Mean

Figure 5.8: Mean training times as a function of overlap where items were omittedfrom the patient position in the 1 to 4 overlap cases (see Table 5.2). The zerooverlap case, does not include the single non-convergent trial. Error bars indicateone standard deviation.

0

50

100

150

200

250

300

350

400

450

0 1 2 3 4 5

Tra

inin

g tim

e (e

poch

s)


Mean

Figure 5.9: Mean training times as a function of overlap where items were omittedfrom both positions in the 1 to 3 overlap cases (see Table 5.3). The zero overlapcase, does not include the single non-convergent trial. Error bars indicate onestandard deviation.


5.5.4 Discussion and analysis

The point of these simulations was to test whether the tensor-recurrent network

exhibits generalization across position above chance level with respect to this task

(i.e., strong systematicity of inference). For the training distributions used in Table

5.2, the network showed perfect generalization to novel object-positions on all trials

where there was an overlap of three or more nouns in both positions. Perfect

generalization occurred for maximum and 0.5 testing criteria. Since the degree

of generalization was consistently above chance level, the network is said to have

exhibited strong systematicity of inference with respect to this task.

The strong systematicity exhibited by the network was a consequence of three

factors: (1) architecture; (2) learning; and (3) environment (i.e., training set dis-

tribution). Each of these factors are discussed in turn.

Architecture

Strong systematicity was a consequence of separating the responsibility for repre-

senting component and position information. Partly, this separation was due to

the architecture whereby component information was represented (independent of

its position) at the hidden units, and position information was represented (inde-

pendent of its component value) at the role units. The component-independence

of position information was encouraged by the bottleneck units which attenuated

the e�ect of the current input on the state.

Learning

The separation of component and position information, however, was also a result

of the network's learning dynamics. It was assumed that backpropagation of error

would orthogonalize the role vectors (which is crucial for the correct extraction

of components). During one trial the activation vectors for the role units were

recorded while training on a data set generated from a distribution where three

nouns appeared in both agent and patient positions (see Table 5.2). Table 5.4


Table 5.4: Orthogonality between agent, action and patient role vectors measuredas one minus the magnitude of the normalized dot product. Role vector ~R0

i =~Ri+ ~Cbias, where ~Ri is the activity vector at the role units, and ~Cbias is the activityvector at the cue units resulting from the bias terms.

Weight updates ~R0

agent;~R0

action~R0

agent;~R0

patient~R0

action;~R0

patient

0 0.1 0.1 0.01000 0.8 0.7 0.04900 0.8 0.8 0.9

shows how the role vectors6 become progressively more orthogonal with training.

(The measure being one minus the magnitude of the normalized dot product so that

zero implies collinear and one implies orthogonal for non-zero vectors.) Before any

training (i.e., at zero weight updates), the role vectors associated with the agent

and action components were almost collinear, as indicated by the orthogonality

measure of 0.1. However, after 1000 weight updates, the two role vectors became

more orthogonal, as indicated by an orthogonality measure of 0.8. At this stage

of training, the action and patient role vectors were still collinear (as indicated by

the orthogonality measure of 0.0). However, after 4900 weight updates, the two

role vectors orthogonalized (with an orthogonality measure of 0.9). It is not nec-

essary that the role vectors be perfectly orthogonal since the non-linear activation

function at the output layer is capable of masking out residual vectors. This table

demonstrates that before training the network was not systematic. The collinearity

of the action and patient role vectors, for example, means that the network could

not have extracted the associated verb without extracting the noun in the patient

position. Therefore, the strong systematicity exhibited by the network was also a

consequence of learning.

The network was also required to learn the appropriate cue vectors so that com-

ponent representations could be extracted from the tensor representation. During

the same trial, the activation vectors at the cue units were also recorded. Table 5.5

6Where the role vector ~R0i =

~Ri + ~Cbias.


Table 5.5: Collinearity between cue (at time step 4) and role vectors concernedwith the extraction and construction (respectively) of the agent, action and patientpositions measured as the magnitude of the normalized dot product. Role vector~R0

i = ~Ri + ~Cbias, where ~Ri is the activity vector at the role units and ~Cbias is theactivity vector at the cue units resulting from the bias terms. At time step 4, cuevector ~C 0

i = ~Ci+ ~R4, where ~Ci is the activity vector at the cue units, and ~R4 is theactivity vector at the role units at time step 4.

Weight updates ~C 0

agent;~R0

agent~C 0

action;~R0

action~C 0

patient;~R0

patient

0 0.8 0.9 0.91000 0.1 1.0 1.04900 0.8 0.9 1.0

Table 5.6: Collinearity before training between cue (at time step 4) and role vectorsconcerned with the extraction and construction (respectively) of the agent, actionand patient positions measured as the magnitude of the normalized dot product.Role vector ~R0

i = ~Ri + ~Cbias, where ~Ri is the activity vector at the role units and~Cbias is the activity vector at the cue units resulting from the bias terms. At timestep 4, cue vector ~C 0

i = ~Ci + ~R4, where ~Ci is the activity vector at the cue units,

and ~R4 is the activity vector at the role units at time step 4.

Weight updates ~R0

agent~R0

action~R0

patient

~C 0

agent 0.8 0.9 1.0~C 0

action 0.8 0.9 0.8~C 0

patient 0.6 0.8 0.9

shows how the cue vectors7 line up with the appropriate role vectors with training.

(The measure being the magnitude of the normalized dot product.) For example,

after 4900 weight updates, the cue vector for extracting the patient component was

collinear with the role to which the patient component was composed (indicated

by the collinearity measure of 1.0). Before training, cue vectors were collinear with

their associated role vectors. However, the cue vectors were also collinear with

the other role vectors as indicated in Table 5.6. Thus, before training it was not

possible to extract the target component without extracting other components.

7Where, at time step 4, the cue vector ~C0i =

~Ci + ~R4.


Environment

The third factor in uencing the demonstration of strong systematicity of infer-

ence was the environment in which the training examples were generated (i.e., the

training example distribution). With the training distributions given in Table 5.2,

perfect generalization across position was exhibited on all trials when there was an

overlap of three or more nouns in the agent and patient positions. However, when

the number of overlapping items was reduced to two nouns, generalization across

position dropped sharply. In the two item overlap case, the network no longer ex-

hibited above chance level performance with greater than 95% con�dence for both

maximum (Figure 5.4) and 0.5 (Figure 5.5) testing criteria. Performance in the

one item overlap case was similar for the maximum testing criterion (Figure 5.4),

and worse for the 0.5 testing criterion (Figure 5.5).

In the one and two item overlap cases, there were only one and two objects

(respectively) appearing in the patient position in the training set. Correct ex-

traction of components requires the network to learn to generate a common role

vector for the patient8 position. With only one or two objects in the patient po-

sition, the network did not generalize to the case whereby a common role vector

was generated for the remaining objects in the patient position. Since only one

cue vector is generated per question, it becomes increasingly di�cult to maintain

similarity between the single cue vector and multiple role vectors generated for the

same position. Consequently, performance decreased in these cases.

In the one, two and three item overlap cases, training distributions were changed

so that at least three objects appeared in both agent and patient positions while

the same degree of overlap was maintained (see Table 5.3). For example, in the

one item overlap case, Bill, John and Karen appeared in the agent position and

Karen, Mark and Vivian appeared in the patient position. With these training

distributions, test performance improved to the point where, with at least 95%

con�dence, the network generalized to novel object-positions with an accuracy of

8The network must also learn to generate a common role vector for the agent and actionpositions.


at least 74% (maximum testing criterion, see Figure 5.6), and at least 64% (0.5

testing criterion, see Figure 5.7). Performance was also higher in the two overlap

case.

The number of network weights also in uences the capacity to generate common

role vectors for each position. The bottleneck units were introduced to attenuate the

e�ect of the input on the generation of state, and therefore role vectors. Increas-

ing the number of bottleneck units increases the degrees of freedom, and therefore

the sensitivity of the role vectors to the current input. In other words, it is more

likely that the generated role vectors are speci�c to component, rather than posi-

tion information. Therefore, an increase in these weights, decreases the degree of

generalization across position.

The network also demonstrated above chance level performance, with greater

than 95% con�dence, for both maximum and 0.5 testing criteria when there was

no overlap across agent and patient positions. This result means that the network

will also demonstrate generalization from agent to action, and from action to agent

or patient positions. Thus, it raises the question of whether the network has too

strong an architectural bias. Should generalization across position occur when there

is no overlap. For example, Hadley cites the case where children generated new

verbs from previously learnt nouns (e.g., It was bandaided). Clearly, the degree

to which people generalize across position is an empirical question, and one not

addressed in Hadley's de�nition of systematicity. Hadley's concern, and the concern

of this thesis, has been to establish some above chance level degree of generalization

across position. However, in the next chapter, architectural properties that permit

generalization across position only in the case of overlap are discussed.

5.6 Summary and conclusions

In the previous chapter, it was shown that a common property preventing a demon-

stration of strong systematicity of inference in the Connectionist models examined

was an independence between access weights (i.e., the weights that implement the

functions that access component representations from the various positions within

5.6. SUMMARY AND CONCLUSIONS 151

a complex representation). It was suggested that a dependency between these

weights would be su�cient to demonstrate strong systematicity of inference. The

purpose of this chapter was to implement and test such a dependency in the form

of the tensor-recurrent network.

Weight dependency was implemented in the tensor-recurrent network by ac-

cessing component representations via the dot product operator in tensor calculus.

In this way, the multiple subspaces provided by tensor units, which hold repre-

sentations of complex objects, can be collapsed down to a single subspace from

which component representations may be extracted independent of their position.

The dependency across access weights was achieved as access was, ultimately, via

the same set of weights. The use of tensors assumes appropriate component and

role representations (for the construction of tensor representations), and cue rep-

resentations (for the extraction component representations). Simulations showed

how these representations were learnt in the tensor-recurrent network by back-

propagating error signals through the tensor units to the weights that generate the

component, role and cue vectors. Successful extraction of component representa-

tions also relies on the network generating common (i.e., item independent) role

vectors for each position within a complex object. Simulation results showed that

item independent role vectors were learnt when su�cient objects appeared in each

position in the training set. The number of weights implementing the function that

generates role vectors in uences the degree of generalization across position. Thus,

from the simulations it was concluded that the architectural biases assumed in the

tensor recurrent network, which were, together, su�cient for strong systematicity

of inference on the querying of 3-tuples task were:

� tensor units, which construct representations of complex objects as the outer

product of internal component and role representations;

� inner product units, which collapse the multiple representational subspaces

down to a single subspace from which component representations were ex-

tracted;


� backpropagation learning dynamic, which allows the appropriate internal

component, role and cue representations to be learnt; and

� component-independent, position-dependent generation of role vectors.

The tensor-recurrent network was designed speci�cally to address the issue of

strong systematicity of inference. It may be that the network is too systematic, in

that it generalizes to cases not supported by empirical evidence. This possibility

was suggested when the network demonstrated some degree of generalization across

position in the case where there was no overlap across agent and patient positions

in the training set. In the next chapter, the extent to which the assumptions

made in the tensor-recurrent network can be relaxed are discussed. In addition,

the signi�cance of the work in this thesis is discussed in terms of approaches to

cognitive modeling and the relationship of this work to the Classical paradigm.

Chapter 6

Discussion

6.1 Introduction

The purpose of this chapter is to re ect on the work presented in the previous

chapters. Speci�cally, a discussion is given in relation to: the implications of these

results for Connectionist (and other) approaches to cognitive modeling (section 2); a

limitation of the tensor-recurrent network solution, and a possible extension which

may address the problem (section 3); and the relationship of the Connectionist

approach presented here to the Classical paradigm (section 4). A summary is

provided in section 5.

6.2 Implications

The problem that systematicity posed for the Connectionist approach to cogni-

tion was an explanation as to how the acquisition of systematic behaviour could

be a necessary property of a Connectionist architecture. That is, in terms of a

learning framework, how can learning to represent and process some instances of a

structured object necessarily generalize to other instances conforming to the same

structure? In this section, the implications arising from the results obtained in

addressing this question are discussed.

153

154 CHAPTER 6. DISCUSSION

6.2.1 Same training and testing distribution assumption

In chapter 3, the problem of the necessary acquisition of systematic behaviour was

framed in terms of probably approximately correct (PAC)-learnability. That is, a

Connectionist model was considered to have necessarily acquired systematic be-

haviour over some suitably structured domain when a high degree of generalization

over that domain was obtained, with a high degree of con�dence (i.e., over repeated

trials) with at most a polynomial amount of computational resource in some pa-

rameter that measures the size of the behaviour being learnt. Using this criterion,

the feedforward network was shown to demonstrate systematicity with respect to

the auto-association of N -tuples task.

Although the feedforward network demonstrated systematicity as de�ned in

terms of PAC-learnability it was suggested that this de�nition was too weak for

the purposes of cognitive models. For although this de�nition is suitable in terms

of computational feasibility (i.e., any model requiring more than polynomial re-

source is impractical as there is rarely enough resource available to learn even

moderately sized behaviours), it did not consider the amount of resource actually

required. Consequently, a model may be computationally feasible, but still require

more resource than is ever needed by people. Thus, an alternative de�nition of

systematicity was considered, which was Hadley's strong systematicity, whereby a

model is said to exhibit the acquisition of systematicity if it shows above chance

level generalization to component-position combinations that it had never been

trained (i.e., generalization across position). In chapter 4, the same feedforward

network was examined on the same task for strong systematicity. It was found that

the feedforward network could not exhibit strong systematicity. The implication

of this result is concerned with an assumption regarding the training and testing

distributions.

The PAC-learnability framework generally assumes that the examples on which

the network was trained are drawn from the same distribution as the examples used

for testing generalization1. However, a test for strong systematicity, by de�nition,

1Although, Bartlett (1992) has considered the case where the training-testing distribution

6.2. IMPLICATIONS 155

requires the training and testing distributions to be di�erent. More speci�cally, if a

model is to demonstrate strong systematicity then there must be objects from the

domain that are drawn from a training distribution Ptrain, such that Ptrain(Ri =

x) = 0 (i.e., the probability of object x appearing in role/position Ri is zero),

but are drawn from a test distribution Ptest, such that Ptest(Ri = x) 6= 0 (i.e.,

the probability of the same object (x) appearing in same role/position Ri is not

zero). Thus, the implication for Connectionist (and other learning) approaches

to cognitive modeling is that criteria derived from the PAC-learning framework,

where there is an assumption that the training and testing distributions are the

same, are potentially weak criteria in that they may admit cognitive models that

cannot account for particular behaviours, as was the case, for example, with the

three-layer feedforward network with respect to strong systematicity2.

6.2.2 Structured networks

Traditionally, the approach taken in Connectionism has been to specify general-

purpose, computationally-su�cient3 architectures; and then provide a learning dy-

namic, which allows for the acquisition of some interesting behaviour. Numerous re-

sults already exist regarding the capacity of Connectionist networks to compute ar-

bitrary functions under a variety of basis functions. For example, it has been shown

that feedforward networks with su�cient sigmoidal (Funahashi, 1989), squashing

(Hornik et al., 1989), and gaussian (Hartman, Keeler, & Kowalski, 1990) units are

universal approximators. Furthermore, Sato et al. (1990) presented a recurrent

network as a universal approximator to a general class of dynamical systems, and

Siegelman and Sontag (1991) proved that a recurrent network with �nite sigmoidal

units is computationally equivalent to a universal Turing machine. The implica-

tion of the negative results of the three-layer feedforward and recurrent networks of

chapter 4 is that it is not su�cient to specify computationally-general mechanisms,

changes by a small constant.2Amsterdam (1988) has criticized the formal learning approach, from a philosophical perspec-

tive, on the basis that it only considers learning from naturally occurring examples, and thereforedoes not consider the possibility of acquiring concepts that do not occur naturally.

3In the sense, that with su�cient internal units, a network can compute an arbitrary function.


one must also consider behaviour-speci�c biases. For example, although the feed-

forward network in chapters 3 and 4 has su�cient resources to represent (and even

learn) behaviours over structured domains, it requires too many examples to be an

adequate model in regard to the necessary acquisition of systematic behaviour.

Of course, it has generally been recognized that Connectionist learning models

require some form of architectural bias for the acquisition of complex behaviours.

The important question has been, and still is: what sort of biases are necessary, or

even su�cient for the acquisition of these behaviours? In chapter 4, it was shown

that the property preventing the networks from exhibiting strong systematicity

was the independence between the weights that access component representations

from the various roles within a complex representation. In chapter 5, weight de-

pendence was introduced into the tensor-recurrent network. The characteristic

di�erence between the tensor-recurrent network and the networks in chapter 4 was

that components were accessed via a common representational subspace. Thus,

learning to access a component object representations from one position general-

ized to other positions. This characteristic property was also evident in the simple

recurrent network in a demonstration of strong systematicity of representation.

The architectural organization of the tensor-recurrent network was su�cient to

demonstrate strong systematicity on an inference task. However, it was suggested

that the network was too strongly biased in that it exhibited generalization not

apparent in the training data (i.e., when there was no overlap). The implication

of this result is that the network may not exhibit appropriate behaviour where

generalization across position is not required. In the next section, this limitation

is discussed along with a possible extension to address this limitation.

6.3 Limitation and possible extension

In the three-layer networks analyzed in chapter 4, components were extracted from

complex representations, held at the hidden units, via a single layer of weighted

connections linking the hidden and output units. Figure 6.1 characterizes the ex-

traction process for such networks, where the dashed arrows indicate modi�able

6.3. LIMITATION AND POSSIBLE EXTENSION 157

Output

HiddenPosition 1 Position 2

Figure 6.1: Characterization of the extraction of component representations frompositions 1 and 2 for the three-layer networks in chapter 4 (i.e., with one hiddenlayer). Dashed arrows indicate modi�able-weighted connections. The diagramshows only the hidden and output layers.

connections. In the tensor-recurrent network, component representations were ex-

tracted via an intermediate layer (the inner product units), which extracted the

desired component subspace from the tensor units. Figure 6.2 characterizes the

extraction process for the tensor-recurrent network, where solid arrows indicate

�xed-weighted connections. The tensor-recurrent network assumed, as one of its

architectural biases, unit connectivity for implementing the inner product. Thus,

the process of collapsing subspaces down to a single subspace was already provided,

regardless of the degree of overlap evident in the training set. Therefore, the net-

work was able to exhibit some degree of generalization across position in the case

where there was no overlap.

One alternative is to make the weighted connections between the two internal

layers of the tensor-recurrent network learnable. In this way, there is the potential

for the network to learn weights which collapse subspaces only in the case where

there was a su�cient degree of overlap in the training set.

The �rst point to consider is whether a network with modi�able weights between

two hidden layers can demonstrate strong systematicity of inference. Suppose a

network has a similar internal organization to that of the tensor-recurrent network,

except that the �xed weights between the two internal layers are replaced with

modi�able weights (Figure 6.3). Suppose also that the resources of this network

are such that there are two units per position in the �rst internal layer and two units

in the second layer, and that activation functions of the second internal layer and


Output

Inner product

TensorPosition 1 Position 2

Pos 1/Pos 2

Figure 6.2: Characterization of the extraction of component representations frompositions 1 and 2 for the tensor-recurrent network. Dashed arrows indicatemodi�able-weighted connections. Solid arrows indicate �xed-weighted connections.The diagram shows only the two hidden layers (i.e., tensor and inner product) andthe output layer.

output units are the identity and sigmoidal functions, respectively. Thus, two units

are su�cient to represent all possible components in a given position (Kruglyak,

1990). The network must learn the weights between the three layers of units to

demonstrate strong systematicity of inference on the querying of 2-tuples task.

Assume also that the network has learnt to bu�er the input onto the two sub-

spaces so that the relative positions within each of the subspaces is the same. Then

the question is: Can the network learn the extraction part of the inference task

without having to be trained on all components in both positions?

Suppose, for example, there are six possible components in each position, and

the network is trained on examples where all components appeared in the �rst

position, but not in the second position. After training, the network has learnt the

mapping depicted in Figure 6.4. The diagram assumes that the network has also

learnt to select, depending on the query, between the two subspaces by the time

they are represented at the �rst hidden layer (i.e., only one of the two components

is active at the �rst hidden layer). The question is: how many components in the

second position are required to learn the same mapping from the second subspace

as was learnt from the �rst subspace.


Output

Hidden(2)

Hidden(1)Position 1 Position 2

Pos 1/Pos 2

Figure 6.3: Characterization of the extraction of component representations frompositions 1 and 2 for a network with two hidden layers linked by modi�able-weightedconnections. Dashed arrows indicate modi�able-weighted connections. The dia-gram shows only the two hidden layers and the output layer.

The answer is at most three components. The weights from the �rst to the

second hidden layer perform a linear transformation. In the case where the network

has been trained on no more than two components in the second position there

are an in�nite number of weight vectors that satisfy the mapping speci�ed by

the training set (Figure 6.5(a)). In the case where there were three components,

there is only one solution (Figure 6.5(a)), and therefore the network generalizes to

the other three positions. By sharing a common representational subspace at the

second layer, the network has already learnt the mapping from the second hidden

layer to the output layer for all components since all components appeared in the

�rst position and were mapped through the same subspace. Thus, the network

has the possibility of exhibiting strong systematicity, given also the bu�ering and

querying assumptions. The important condition is that the dimensionality of the

positional and common subspaces is less than the number of possible components.

If there are as many dimensions as components then some dimensions would only be

trained on one type of point, and therefore there would not be enough information

in the training set to uniquely specify the mapping. For this reason, the three-

layer networks in chapter 4 could not demonstrate strong systematicity, since the


(second internal layer)

(first internal layer) (first internal layer)

Position 1 Position 2

Common subspace

Output units

Figure 6.4: An example of the internal organization of a network, with two unitsper subspace in the �rst internal layer and two units for the common subspace inthe second internal layer, learning the querying of 2-tuples task. Dashed arrows in-dicate modi�able-weighted connections. Solid squares indicate training points andempty squares indicate test points. Lines through the common subspace indicatecharacteristic positioning of hyperplanes for the correct extraction of componentrepresentations by the output units. Dashed circle indicates the characteristic po-sitioning of training points for an arbitrary number of components per position.

number of output units was the number of components (which was a condition

on the external representations of objects since there should not be any a priori

similarity between external object representations).

There are a number of ways in which the dimensionality of component subspaces

could be enforced, or at least encouraged. One way is to limit the number of units

in the second hidden layer. The di�culty with this approach is that it is generally

more di�cult to learn a function with fewer weights as the solution space is smaller

and there are fewer trajectories which lead to a solution. Alternatively, excessive

weights can be provided initially, and subsequently removed during the course of

training where they have shown to be irrelevant. Weights can be e�ectively removed


position 2

(first internal layer)

common subspace

(second internal layer)

(a) (b)

Figure 6.5: Number of components in the second position required in the trainingset to demonstrate strong systematicity. With only two components (a) there isnot enough information to constrain the possible mappings. Two possible solu-tions which satisfy the mapping speci�ed by the training points (solid squares) areshown. The correct mapping can be uniquely described with three components (b).Empty squares indicate test points. Dashed arrows indicate modi�able-weightedconnections which implement the mapping. Dashed circle indicates the charac-teristic positioning of training points for an arbitrary number of components perposition.


with the use of an additional error term that penalizes unnecessary weights, as was

shown, for example, in the work of Nowlan and Hinton (1991) and Zemel (1993).

The error function not only encourages the network to �nd the target function, but

to do so with the fewest necessary weights.

Thus, with respect to the querying of 2-tuples example, it is possible for the

network with two hidden layers to be given an excessive number of units (i.e.,

greater than two in the second hidden layer), but make use of only a small portion

of those units. In doing so, it is possible that the network learnt to collapse the

subspaces from the previous layer to a single common subspace where the dimen-

sionality is less than the number of possible components, and therefore demonstrate


Of course, it would require simulations to verify whether such additional terms

in the error function would result in a demonstration of strong systematicity. How-

ever, error terms that reduce the number of weights could still result in solutions

where all subspaces are collapsed to a common subspace, since this solution has

the fewest number of weights (i.e., two hidden units in the second internal layer are

su�cient to encode arbitrarily many components independent of their position).

What is required is to collapse component subspaces only in the cases where

there is overlap evident in the training set. In the case where there is no overlap,

the subspaces should remain separate. Figure 6.6 characterizes this situation for

the agent-action-patient domain. In the early stages of training (Figure 6.6(a)),

when there is no apparent or signi�cant4 degree of overlap between positions, com-

ponent subspaces remain separate. This situation is likely to occur, for example,

when the network is given excessive numbers of units so that there is no necessity

for the dimensionality of each subspace to be less than the number of possible

components. In the later stages of training, when overlap has been detected, the

network reorganizes the weights associated with the overlapping subspaces so as

to map them onto a common subspace (Figure 6.6(b)). In this way, generalization

across position only occurs with components in the overlapping positions. Gen-

4One could consider a threshold, in which the degree of overlap must be exceeded beforecollapsing subspaces.


Agent Action Patient


Output

redescription

(a) (b)


Output

ActionAg/Pat

Figure 6.6: Process of redescribing internal representations. With three subspacesat the second internal layer (a), the network can perform the task, but withoutdemonstrating strong systematicity. By collapsing the two overlapping subspaces(i.e., agent and patient) the network demonstrates strong systematicity, but onlywith respect to these two positions, not with respect to the action position (b).Dashed arrows indicate modi�able-weighted connections.

eralization to novel components in the action position would not occur since the

network has never been trained to map the new component in the action position

(in the second internal layer) to the output. Remembering that each output unit is

dedicated to a single component (by the requirement that there is no a priori simi-

larity between the external representations of components). Therefore, the weights

from the action subspace (second layer) to the output unit have only been trained

on one type of point (i.e., points without the new component).

The question, of course, is how could such training behaviour be a property of a

Connectionist architecture? One possibility is to incorporate an objective function

that encourages this behaviour. In designing an appropriate objective function, one

needs to consider the cases in which representational subspaces should be brought

together, and when they should not. Two subspaces, or internal representations

within those subspaces should brought together when they map to the same out-

put. For example, suppose vector representations ~x1 and ~x2 map to the same (or

similar) output vector ~y. Then, the weights which generated ~x1 and ~x2, should

be changed so as to reduce the distance between those two vectors (i.e., collapses

their representations down to a single vector representation). Note that in the


standard objective functions, which only consider target and network output, the

two vectors will not necessarily come together as a result of error backpropagation,

because in an underconstrained network there are many points in the domain that

map to the same point in the co-domain. So long as the common point (~y) in the

co-domain is near the target, the two vectors (~x1 and ~x2) will not necessarily be

brought together through training. In the other case, where ~x1 and ~x2 do not map

to the same output, training would proceed as usual (i.e., by reducing the di�erence

between output and target vectors independent of each other).

The interesting point about such a process is its potential relationship to

Karmilo�-Smith's (1992) representation redescription hypothesis. Karmilo�-Smith

hypothesizes that cognitive development undergoes a number of phases (possibly

concurrent) of which behavioural mastery is just one. During the acquisition of a

competent level of performance on some task, additional processes are redescribing

the behaviour into representations that are explicit relative to some other processes.

This process of redescription allows for the transfer of ability on one task to other

structurally related tasks. The transfer of ability is analogous to the way proce-

dures and functions are reusable by other procedures and functions in programming

languages. The process described above has a similar quality. The network in Fig-

ure 6.6(a) has the capacity to learn the target behaviour, but not in a strongly

systematic way with its current organization of internal representations. However,

if the representations are reorganized (redescribed) as in Figure 6.6(b), then the

network has the capacity to demonstrate strong systematicity with respect to the

agent and patient positions.

6.4 Relationship to the Classical paradigm

Having demonstrated a network that is strongly systematic, what then is its re-

lationship to the Classical paradigm? In the Classical explanation of cognition,

cognitive representations are symbols5, or symbol structures (i.e., symbols that

5The symbols, by themselves, are abstract entities. The physical realization of a symbol maybe for example, a 1 0 1 1 state of a register, or a 0.4mV activation potential of a neuron.

6.4. RELATIONSHIP TO THE CLASSICAL PARADIGM 165

are composed of other more primitive symbols, for example aRb), and cognitive

processes are sensitive to the structure of those representations. Since processing

capacity is de�ned with respect to symbol structures, all objects conforming to that

structure are processable. For example, a symbolic process called �rst, de�ned as:

�rst: (XY ) ! X takes the structured symbol XY , composed of two symbols X

and Y , and returns the symbol X. The symbol XY stand in some correspondance

relationship to a complex object, which is external to the cognitive system, such

that its component objects correspond to the more primitive symbols X and Y .

The structural relationship between component symbols mirrors a structural rela-

tionship between component objects. Thus, the symbolXY may correspond to any

external complex object with the same relationship between its component objects.

The important point is that the computational processes of Classical architectures

are de�ned at the level of symbols, therefore the capacities of these architectures

come in clumps de�ned by the symbols on which they operate. So, systematicity

is built into a Classical architecture.

In the tensor-recurrent network, external representations are mapped via

weighted connections to internal subspaces. After the network has been trained

to the point where it behaves systematically on the querying of 3-tuples task, the

internal representations generated by the network are examples of symbolic repre-

sentations. For instance, at the �rst time step, a similar role vector is generated

regardless of the object that is presented to the network. The �rst role vector (~R1)

is a symbol denoting the �rst component object, and its physical realization is its

role unit activations (e.g., .1 .9 .5 .7). At the same time, the external object is

also mapped to a hidden unit vector, for example ~HJohn, which denotes in this case

the object John. The physical realization of this symbol is its corresponding hidden

unit activations. The role symbol is an example of a type (i.e., a symbol denoting

a set of more primitive symbols), and the symbol (realized at the hidden units)

is an example of a token (i.e., element from a set of symbols, or a type instance).

The binding of the token to its type is performed by the outer product operator

and represented at the tensor units. The construction of more complex symbols


is achieved by adding the resulting tensors. Thus, the physical realization of the

complex symbol \John loves Mary", is its corresponding tensor unit activations.

The similarity between a Classical architecture and the tensor-recurrent net-

work would be of descriptive value only except that the symbols themselves are

accessible by other processes within the network. Given an appropriate cue vector,

a symbol is selected from a more complex symbol at the tensor units by the inner

product operator, implemented by appropriate connectivity between cue, tensor

and inner product units. It is because the constituent symbol can be extracted

independently of the object to which it currently corresponds that the network

exhibits systematicity.

It would appear that the tensor-recurrent network solution o�ers no more than

a Connectionist implementation of a Classical architecture. However, there is one

important di�erence, and that di�erence is concerned with the acquisition of sys-

tematic behaviour. Classical architectures, by de�nition, operate at the level of

symbols and symbol structures. The grounding of symbols is assumed to occur

by some perceptual level process that maps external stimuli to their corresponding

symbols. Beyond that, cognitive behaviour is a process of symbol manipulation.

Therefore, learning in a Classical architecture, like any other behaviour is a pro-

cess of symbol manipulation. For example, learning the structure sensitive process

�rst (see above) can only occur as a consequence of generating and testing possible

symbolic mappings. Possible mappings are, XY ! X; XY ! Y ; and XY ! Y X,

which the architecture could either accept or reject on the basis of environmental

evidence. The problem with this form of learning is that the architecture can only

progress from one systematic behaviour to another. At no point is there a transi-

tion from unsystematic to systematic behaviour. The lack of progression is because

the perceptual system on which a Classical architecture is grounded is

not accessible to a Classical learning system.

By contrast, in the tensor-recurrent network, the perceptual system (i.e., the

weights that map external representations to symbols) is part of the learning pro-

cess. Accessibility is granted in the tensor-recurrent network by linking external

6.5. SUMMARY 167

stimuli to their symbols via continuously di�erentiable functions. In this way,

the relationship between external representations (stimuli) and their corresponding

symbols can be learnt by a non-symbolic hill-climbing strategy, such as backprop-

agation. Therefore, it allows the tensor-recurrent network, as was demonstrated in

chapter 5, to exhibit a transition from unsystematic to systematic behaviour.

Thus, the contribution of Connectionism is not in an alternative explanation

for systematicity (i.e., the processing is still symbolic where it needs to be), but

in an explanation of how the acquisition of systematic behaviour can be a neces-

sary consequence of a network's internal organization and a non-symbolic learning

process.

6.5 Summary

In this chapter, the implications of the results from this thesis were discussed. Two

general lessons to be taken from this work in regard to cognitive modeling are:

(1) The assumption that the training and testing distributions are the same may

lead to criteria for cognitive architecture that are too strong. Therefore, models

designed with this assumption are potentially incomplete, since for cognitive prop-

erties, at least with respect to strong systematicity, this assumption does not hold.

(2) The lack of strong systematicity in the more \standard" Connectionist architec-

tures points to a greater consideration for behaviour-speci�c architectural biases.

The architectural biases built into the tensor-recurrent network were su�cient to

demonstrate strong systematicity on an inference task.

Also discussed was the case where the tensor-recurrent network was too sys-

tematic (i.e., generalization across position where it was not evident in the training

set). It was suggested that a possible extension to the tensor-recurrent network

was in the use of modi�able weights between the tensor and inner product layers,

and the use of an additional objective function which encourages the collapsing of

the subspaces only in the case where there is signi�cant evidence of overlap in the

training set.

Lastly, comments were given on the relationship between the Connectionist


tensor-recurrent network solution and the Classical paradigm. The characteristic

components of a Classical architecture (i.e., symbols and symbol-level processes)

appeared in the tensor-recurrent network as characteristic components of a Connec-

tionist architecture (i.e., vectors and vector operators). Thus, in terms of system-

atic behaviour, Connectionism o�ers a \neural-like" implementation of symbolic

processing. However, where Connectionism goes beyond Classicism is in an expla-

nation for the acquisition of systematic behaviour, which is an issue not addressed

in the Classical paradigm since systematicity is built into a Classical architecture,

not acquired.

Chapter 7

Concluding remarks

In chapter 1, I introduced two major paradigms for models of cognitive behaviour

(i.e., Classicism and Connectionism), and the background to an ongoing debate over

the capacity of the Connectionist approach to explain one cognitive-level property

called systematicity. Brie y, Fodor and Pylyshyn (1988), and Fodor and McLaugh-

lin (1990) have argued that Connectionist models cannot exhibit properties such

as systematicity without implementing a Classical architecture (i.e., without re-

sorting to symbolic representations and processes). Therefore, they have argued

that Connectionism is at best an implementation framework for Classical cognitive

architectures.

However, as also mentioned in chapter 1, this conclusion is still far from unan-

imously accepted with argument and counter-argument being put forth in an at-

tempt to resolve the issue. Given the lack of consensus, it was decided that a review

of the debate over the possibility of a Connectionist (non-classical) explanation of

systematicity was an appropriate place to start. Thus, the �rst question of concern

in this thesis was: Can Connectionist models exhibit systematicity without reliance

on symbolic representations and processes?

169

170 CHAPTER 7. CONCLUDING REMARKS

7.1 Summary of contribution

The arguments for an alternative Connectionist explanation of systematicity have

focused on alternative notions of compositionality (i.e., the way in which archi-

tectures construct internal representations which allow them to reason about the

external world). The Classical notion of compositionality is that the tokening

(i.e., physical inscription) of complex representations is accompanied by a tokening

of component representations (i.e., representations of component objects). After

reviewing three Connectionist notions of compositionality: Smolensky's weak (mi-

crofeatures) and strong (tensors) compositionality; and van Gelder's more general

notion of functional compositionality, I concluded, in chapter 2, that they fail

to provide an alternative explanation of systematicity for one of two reasons, ei-

ther: (1) component representations are tokened, relative to the processes that

access them, wherever complex representations are tokened, in which case one has

an implementation of Classical compositionality; or (2) component representations

are not tokened, in which case, systematicity of inference cannot be demonstrated

since the component representations are not available to other processes. There-

fore, I have concluded that Connectionist models cannot provide an alternative

explanation for systematicity.

However, I have also argued that the limitation of Fodor and Pylyshyn's thesis

is that their requirement for systematicity is grounded solely in terms of compu-

tational capacity (i.e., in terms of the functions computable by an architecture).

Thus, systematicity can be realized by many computationally su�cient architec-

tures, one of which (i.e., tensors) Fodor and McLaughlin (1990) later reject on the

grounds that one must also explain how systematicity is a necessary consequence of

the architectural assumptions. However, the limitation of Fodor and McLaughlin's

position is that they do not specify a criterion by which a model (Connectionist or

otherwise) is said to necessarily exhibit systematicity. Since Connectionism is also

concerned with the acquisition of cognitive behaviour, I have suggested that a \po-

tential" contribution of Connectionism to cognitive theory is in an explanation for

the necessary acquisition of systematic behaviour. Thus, the second and primary

7.1. SUMMARY OF CONTRIBUTION 171

question of concern in this thesis was: What Connectionist architectural properties

are su�cient for the necessary acquisition of systematic behaviour?

In chapter 3, the necessary acquisition of systematic behaviour was de�ned

in terms of probably approximately correct (PAC)-learnability. PAC-learnability

is the approximation of a function to within some �xed degree of error with some

�xed degree of con�dence over repeated trials with resources (e.g., time, processors)

polynomial in the size of the function. This criterion was applied to a feedforward

network on the auto-association of N -tuples task, designed to test systematicity

of representation. The auto-association of N -tuples by a feedforward network has

been examined in Brousse and Smolensky (1989) and Brousse (1991). Although

Brousse and Smolensky demonstrated an exponential increase in the number of

generalizations with respect to N (tuple order), I showed that the percentage of

correct to total number of patterns decreased exponentially in N . Therefore, their

demonstration of generalization with the feedforward network does not meet the

PAC-learnability criterion for the acquisition of systematicity. However, I also

showed that with a polynomial (with respect to N) increase in the number of

training patterns the feedforward network met the PAC-learnability criterion. The

conditions su�cient for the necessary acquisition of systematicity were access to a

polynomial number of training examples, and no more than a polynomial number

of weights in the network.

PAC-learnability is a useful criterion in that it di�erentiates possible cogni-

tive architectures on the basis of resource feasibility. Essentially, any architec-

ture requiring more than polynomial resource is impractical as there is rarely

enough resource available for even moderately sized behaviours. However, the

PAC-learnability criterion did not consider the amount of resource used by people.

It is possible for an architecture to acquire systematic behaviour with respect to

the PAC-learnability criterion, yet still require more resource than is ever needed

by people. Thus, a second criterion based on linguistic evidence, called strong

systematicity (Hadley, 1993), was considered in the evaluation of Connectionist

models.


Strong systematicity is the ability, above chance level, to correctly represent

and process complex objects with components in novel positions (i.e., generaliza-

tion across position). Based on statistical analysis of training sets, Hadley found

no evidence of strong systematicity in the six Connectionist models he examined.

It is possible that none of these models exhibited strong systematicity because such

a criterion had not been de�ned at that time, and that the purpose of the Con-

nectionist work in question was just a demonstration of generalization over some

structured domain. However, it is also possible that there was some common and

fundamental inadequacy with these models given that none of them demonstrated


In chapter 4, the feedforward network used in the previous chapter was ex-

amined for its capacity to exhibit strong systematicity of representation on the

auto-association of 2-tuples task. I showed, by an analysis of the available in-

formation and network learning properties, that this feedforward network cannot

exhibit strong systematicity of representation. The lack of strong systematicity

was attributed to the independence between the weights that implement the map-

ping of objects in the two component positions. This result suggested that a

recurrent network is likely to exhibit strong systematicity of representation, since

component objects are presented (extracted) via the same set of input (output)

units. Therefore, component mappings are implemented, in part, by the same set

of weights. Simulation results showed that a three-layer simple recurrent network

can demonstrate strong systematicity of representation on the temporal version of

the auto-association of 2-tuples task. Analysis of the internal representations of a

trained network suggested that components were mapped to (and from) common

internal subspaces. Consequently, generalization across position was achieved as

the same set of weights were used to map component representations independent

of their position.

The simple recurrent network was then examined for strong systematicity of

inference on the querying of 2-tuples task. Simulation results did not show strong

systematicity for this network. An analysis of the available information and the

7.1. SUMMARY OF CONTRIBUTION 173

learning requirements of the network showed that the network cannot exhibit strong

systematicity of inference on this task. The lack of strong systematicity of inference

was attributed to the independence between the weights that access components

from the two positions of a complex internal representation. It was also argued

that strong systematicity of inference cannot be demonstrated in a number of other

three-layer networks.

It was suggested that access-weight dependence can be achieved by incorporat-

ing an architectural bias that forces or encourages the access of components via a

common internal subspace. In chapter 5, such a bias was incorporated into a Con-

nectionist architecture called the tensor-recurrent network. In the tensor-recurrent

network, tensors were used to represent complex objects. Using the inner prod-

uct operator, subspaces of the tensor space can be collapsed down to a common

internal subspace from which the components can be extracted via a single set of

learnable weights. Correct extraction of components, however, requires an appro-

priate choice of internal component, role and cue vectors, which are assumed to

exist in a tensor representational scheme. The novel feature of the tensor-recurrent

network is that, by backpropagating an error signal through the units that imple-

ment tensor representations, these vectors are learnt by the network. Simulation

results showed that this network exhibited strong systematicity of inference on the

querying of 3-tuples task.

My design of the tensor-recurrent network was motivated speci�cally to ad-

dress the issue of strong systematicity. I have suggested that the network was too

strongly biased in that it exhibited some degree of generalization across position

when there was no overlap in the training set. In chapter 6, I then suggested in-

corporating an additional error term designed to encourage the collapse of internal

subspaces only in the case where there was signi�cant overlap in the training set.

In this chapter, I also discussed the implications of the results of this thesis, and the

relationship to the Classical paradigm. Speci�cally, the points made were: (1) The

PAC-learning framework, where it relies on the assumption that the training and

testing distributions are the same, may be inappropriate for devoloping cognitive


modeling in that it can admit architectures (e.g., three-layer feedforward network)

that are inadequate when this assumption does not hold (as was the case with

respect to strong systematicity). (2) Strong systematicity necessitates more struc-

tured (architecturally biased) networks than the three-layer networks examined

(e.g., tensor-recurrent networks). (3) The tensor-recurrent network is an example

of the contribution of Connectionism as providing an explanation for the necessary

acquisition of systematic behaviour from, in part, non-symbolic learning processes.

7.2 Conclusions

The initial question posed in this thesis was:

� Can Connectionist models exhibit systematicity without reliance on symbolic

representations and processes?

After reviewing the systematicity debate in chapter 2, I have concluded that:

� Connectionist models cannot demonstrate systematicity without implementing

some form of symbolic representations and processes.

From this conclusion it was argued that the problem of systematicity for Con-

nectionism is not in an alternative explanation for systematicity, but in an expla-

nation for the necessary acquisition of systematic behaviour. Thus, the second,

and primary question of concern in this thesis was:

� Can Connectionism provide models that exhibit the necessary acquisition of

systematic behaviour?

After testing and analyzing a number of Connectionist models I have concluded

that:

� Connectionism can provide models, as exempli�ed by the tensor-recurrent net-

work, with architectural properties that are su�cient for the necessary ac-

quisition of systematic behaviour (i.e., strong systematicity) from, in part,

non-symbolic processes.

7.2. CONCLUSIONS 175

Of the Connectionist architectures analyzed, only the tensor-recurrent network

demonstrated strong systematicity of inference. From the simulation results and

analysis of the tensor-recurrent network in chapter 5, I have also concluded that:

� the architectural properties of the tensor-recurrent network that were, to-

gether, su�cient to exhibit strong systematicity on a task designed to test

systematicity of inference were:

{ tensor units, which construct representations of complex objects as the

outer product of internal component and role representations;

{ inner product units, which collapse multiple representational subspaces

down to a single subspace from which component representations are

extracted;

{ backpropagation learning dynamics, which allows the appropriate inter-

nal component, role and cue representations to be learnt; and

{ component-independent, position-dependent generation of role vectors.

In addressing the two questions of central concern to this thesis, two more gen-

eral conclusions are drawn. Firstly, in the evaluation of the necessary acquisition

of systematicity, two criteria were used: (1) PAC-learnability (chapter 3); and

(2) Hadley's strong systematicity (chapters 4 and 5). The feedforward network

demonstrated the acquisition of systematic behaviour as de�ned by the �rst crite-

rion, but not the second. Therefore, I conclude that:

� the PAC-learnability framework, in as far as it assumes that the training and

testing distributions are the same, is inappropriate for developing cognitive

models that exhibit strong systematicity.

Secondly, the tensor-recurrent network in chapter 5 is an instance of a Connec-

tionist architecture where systematic behaviour emerged from initially unsystem-

atic behaviour, partly, as a consequence of a non-symbolic learning process (i.e.,

backpropagation). Therefore, I conclude that:


� Connectionism can contribute to cognitive theory in the form of an explana-

tion of the necessary acquisition of symbolic properties such as systematicity

from, in part, non-symbolic processes like backpropagation.

7.3 Further work

The systematicity problem is a watershed for cognitive research in that it segregates

potential cognitive architectures on the basis of whether or not such architectures

have the capacity to represent and generalize in structured domains.

Systematicity is an instance of a more general problem facing the Connectionist

approach to cognitive modeling. That is, learning high-level structured represen-

tations from low-level information.

The tensor-recurrent network approach to demonstrating strong systematicity

of inference suggests rethinking the notion of the type/token distinction. The

advantage of a type/token distinction is that if a function is learnt at the type level

then learning generalizes to all tokens of those types. Another way of characterizing

the lack of strong systematicity of inference in the �rst-order networks studied in

chapter 4 is that there was no explicit distinction (i.e., one that is readily accessible

by other processes within the network) in the network's internal representation

space between the tokening of a component and the component's type (i.e., role

within the complex object).

The general problem that a type/token distinction introduces into a Connec-

tionist learning framework (which is essentially hill-climbing) is how can a hill-

climbing strategy be applied to a domain where identi�ers have only logical, not

spatial signi�cance. For, in general, the hill-climbing strategy requires a contin-

uously di�erentiable surface, yet symbol names serve only to identify one symbol

from another, and so are, in general, not de�ned with respect to any similarity

measure. That is to say, a comparison between two identi�ers is, in general, a

two-valued function, either the two identi�ers are the same, or they are not the

same.

The solution to this problem has been hinted at in the form of the tensor-

7.3. FURTHER WORK 177

recurrent network, where type identi�ers were vectors in the role unit activation

space linked (via a continuously di�erentiable function) to their associated tokens,

which were vectors in the hidden unit activation space. Since the token and type

identi�er spaces, and the functions linking the two (i.e., inner and outer products)

are all continuously di�erentiable, type level mappings can be learnt by a hill-

climbing strategy through the principle of: error (information) propagation through

structured architectures.

There are, however, a number of limitations with the tensor-recurrent network,

which suggest directions for further work. The �rst limitation, which was discussed

in chapter 6, is the \over-generalization" of the network. The tensor-recurrent

network demonstrated some degree of generalization across position when there

was no evidence of overlap in the training set. One direction of further work is to

incorporate an additional error term so that subspaces are collapsed only in the case

where there is signi�cant overlap in the training set. The suggestion given was an

error term that reduced the di�erence between internal representations (vectors)

that map to the same output vector. This approach di�ers from the standard

approach that only considers the di�erence between network and target output.

The second limitation concerns the generation of role vectors. In the tasks

considered, the type/token distinction was arti�cially clean in that each component

was associated with one position at each time step. Thus, the point at which to

make the binding of a type identi�er and its associated token was built into the

network, only the values of those identi�ers and tokens were learnt. In reality, there

are multiple component levels (e.g., sentences are composed of words, and words are

composed of phonemes), and the point at which to make/break bindings depends

entirely on the structure of the domain. Thus, one possibility for further work is

to consider the same tasks, but with input encodings at the phoneme level rather

than the word level. A network is still required to demonstrate generalization across

word position, but also to learn the points at which to bind component words to

their roles.

The inference tasks considered in this thesis only required the networks to


extract one of the possible components. More complex behaviours involve the

restructuring of input, for example, in active to passive transformations, where

complex objects of the form Agent-Action(active)-Patient are mapped to objects

with the form Patient-Action(passive)-Agent. One possibility for addressing this

task is with modi�able recurrent connections on the cue units, which would allow

the network to generate a sequence of cue vectors for extraction of a sequence of

components. Furthermore, the tasks considered had a at structure, where each

component was an atomic object (i.e., no further components). Natural language

processing, however, operates over recursive structures where, for example, a role

may be bound to an entire phrase containing its own component-role bindings.

Recent work by Hadley (1994), Hadley and Hayward (1994), and Niklasson and

van Gelder (1994) consider generalization to di�erent levels of embedding in re-

cursively structured objects as an even tighter criterion for the demonstration of

systematicity. In this case, one could include connectivity which allows the binding

of role vectors to tensor representations of complex objects, rather than just the

hidden unit representations of atomic objects.

More generally, though, people have the capacity to generalize from behaviours

learnt in one domain to structurally similar behaviours in other domains. That

is, people have the capacity to reason analogically. This degree of generalization

goes beyond Hadley's de�nition of strong systematicity, since in the new domain,

people can generalize to complex objects where components have not been seen in

any position. For example, given the question, \Grapes are to wine, as rice is to

?", one is likely to respond with \sake", based on knowledge of the relation (i.e.,

is-used-in) in both domains, given very little experience with this type of question

before, and almost certainly no experience with the four components in the context

of this question. Work has already begun in an attempt to address this form of

generalization with Connectionist models (see Halford, Wilson, Guo, Gayler, Wiles,

& Stewart, 1994).

In this work, it was assumed that systematicity was not due to any a priori sim-

ilarity between component objects. However, it is possible that existing similarities

7.3. FURTHER WORK 179

between component objects (e.g., run and walk) play some part in the acquisition of

systematic behaviour. Another possible extension of this work would be to examine

networks trained on complex objects whose component representations share some

degree of similarity, and then test these networks on objects with completely dis-

similar component representations. With respect to the tensor-recurrent network,

for example, it may be that initial similarities facilitate the acquisition of stable

role vectors for particular component positions.

Connectionist networks have demonstrated generalization on a wide variety of

tasks, yet their scalability to the size and complexity of human cognitive domains

is still in question. One could argue that the lack of scalability is because of the

technology. There is not yet available the processing capacity, in terms of number

of processors, in the same order of magnitude as that of the brain. The value of

addressing structure-related properties like systematicity is that they also apply to

small-scale tasks. Architectures that cannot exhibit systematicity on small tasks

will not scale to larger tasks. Thus, research that considers these properties is not

only of theoretical signi�cance, but is of practical importance as well.

Appendix A

Understanding as generalization

not just representation1

Halford and Wilson (1994) de�ne understanding of a concept in terms of repre-

senting the relation to which that concept is equivalent. A model is said to have

represented a relation when it performs the correct input-output mappings for all

functions implicated by that relation. For example, in the balance scale domain,

the concept of balance is a relation between two weight and two distance variables

and a state variable (which has the possible values: balance, tip-left and tip-right).

The concept of balance implicates a number of questions which can be used to

evaluate a child's understanding of the concept.

For example, the balance state question is a function from the two weights and

two distance variables to the state variable. The missing weight question is a func-

tion from two distance variables, a weight variable, and a state variable to a weight

variable. The missing distance question is a function from two weight variables, a

distance variable and a state variable to a weight variable. More generally, from

any combination of four variables, the �fth variable can be predicted.

Halford and Wilson argue that just as a child's understanding of the concept

balance is evaluated on a battery of implicated questions, so too a model of that

concept can be evaluated by its performance on a number of implicated functions.

1This work appears in Phillips (1994).

181

182 APPENDIX A. UNDERSTANDING AS GENERALIZATION

That is, on the basis of whether or not the model performs the same input-output

mappings. Their main point is that such a de�nition provides an adequate crite-

rion against which candidate Connectionist models can be evaluated. With this

de�nition the authors reject McClelland's (in press) feedforward network model on

the basis that it cannot demonstrate adequate performance on all three questions

(i.e., the model is incomplete).

The �rst point I want to make is that implicit in their de�nition is a notion of

generalization (i.e., the capacity to take existing representations and apply them

correctly to previously unexperienced situations). I suggest that by confounding

these two issues (representation and generalization) some respondents to Halford's

presentation have been led to question the authors' justi�cation for rejecting Mc-

Clelland's feedforward network model.

The authors claim that McClelland's model cannot perform the additional miss-

ing weight and missing distance tasks. However, they do not state the basis for

this claim. Clearly, McClelland's model cannot be rejected on the basis of the

representational capacity of the feedforward network alone. Since multi-layer feed-

forward networks are universal function approximators (Hornik et al., 1989) there

is every reason to expect that given su�cient information and time the network

could perform correctly on these additional tasks.

Presumably, however, the authors have some additional criterion in mind. I

suspect that although they would agree that McClelland's model (suitably ex-

tended) can represent the required functions, to do so would require an extensive

and unaccountable amount of (re)training. If it can be shown that such training

is not available to (nor required by) children, then there is a basis for rejecting

the feedforward network model. That is, on the basis of the model's generalization

characteristics.

The question is, of course, how much training do children receive? Based on

empirical evidence, Hadley (1993, 1994) presents an example of how generaliza-

tion may be characterized in the linguistic domain. He identi�es generalization

across syntactic position as a criterion against which Connectionist models can be

183

evaluated. Connectionist models can then be di�erentiated on the basis of this

generalization criterion. For example, as was shown in chapter 4, the three-layer

feedforward networks without the assumption of weight tying cannot demonstrate

generalization across syntactic position, yet recurrent networks in a limited case

can demonstrate this form of generalization.

Essentially, the issue is that although some models have the capacity to rep-

resent any function of interest they simply require too many training examples to

be psychologically plausible. I suspect that it is on grounds of generalization that

the authors want to reject McClelland's model. The point I want to make is that

the degree of generalization considered as psychologically plausible must be made

explicit; and in doing so, the authors would then have much stronger grounds for

rejecting particular Connectionist models. That is, on the basis that these models

make use of information either not available to, or not required by children.

The issue of generalization brings me to my second point: if generalization is

used as a criterion on which to accept or reject models, then the tensor model that

Halford and Wilson are proposing is incomplete. For, although it demonstrates

how relational concepts may be represented, there is no story as to how these

representations might arise from experience.

In Halford and Wilson's formulation, an appropriate arrangement of weights

and connections are in place to represent relational concepts as tensor products.

Presumably, however, these weights and connections were not always there oth-

erwise this model could not account for the empirical fact that children below a

certain developmental stage do not understand (in Halford and Wilson's sense) the

concept of balance.

The two issues of representation and generalization place the authors in a

dilemma. Explicitly, they want to distinguish between two models on the basis of

representational capacity, yet both models (suitably extended) are capable of rep-

resenting relations. Implicitly, I suspect they want to reject McClelland's model on

grounds of generalization. However, if generalization is the distinguishing criterion

then they are required to tell a story of how the right arrangement of weights and

184 APPENDIX A. UNDERSTANDING AS GENERALIZATION

connections of a tensor model come into being; and that is a story not yet told.

Bibliography

Abu-Mostafa, Y. S. (1990). Learning from hints in neural networks. Journal of

Complexity, 6, 192{198.

Al-Mashouq, K. A., & Reed, I. S. (1991). Including hints in training neural nets.

Neural Computation, 3, 418{427.

Amsterdam, J. (1988). Some philosopical problems with formal learning theory. In

The seventh national conference on arti�cal intelligence, pp. 580{584. Saint

Paul, MN: Morgan Kaufmann.

Anthony, M. (1992). An introduction to computational learning theory. Cambridge

tracts in theoretical computer science. Cambridge, England: Cambridge Uni-

versity Press.

Bakker, P., Phillips, S., & Wiles, J. (1993). The N-2-N encoder: A matter of

representation. In S. Gielen & B. Kappen (Eds.), ICANN'93: Proceedings

of the International Conference on Arti�cial Neural Networks, pp. 554{557.

London: Springer-Verlag.

Bakker, P., Phillips, S., & Wiles, J. (1994). The 1000-2-1000 encoder: A matter of

representation. Neural Network World, 4 (5), 527{534.

Bartlett, P. L. (1992). Computational learning theory and neural network learning.

Ph.D. thesis, The University of Queensland, Dept. of Electrical and Computer

Engineering.

Baum, E. B., & Haussler, D. (1989). What size net gives valid generalization?

Neural Computation, 1, 151{160.

185

186 BIBLIOGRAPHY

Baum, E. B., & Wilczek, F. (1987). Supervised learning of probability distributions

by neural networks. In D. Z. Anderson (Ed.), Neural Information Processing

Systems, pp. 52{61. New York, NY: American Institute of Physics.

Blum, A. L., & Rivest, R. L. (1990). Training a 3-node neural network is NP-

complete. Tech. Rep., MIT, Cambridge, MA.

Blum, A. L., & Rivest, R. L. (1992). Training a 3-node neural network is NP-

complete. Neural Networks, 5 (1), 117{127.

Brooks, R. A. (1991a). Intelligence without reason. In 12th International Joint

Conference on Arti�cial Intelligence, pp. 569{595.

Brooks, R. A. (1991b). Intelligence without representation. Arti�cial Intelligence,

47, 139{160.

Brousse, O. J. (1991). Generativity and systematicity in neural network combina-

torial learning. Ph.D. thesis, University of Colorado, Boulder, CO.

Brousse, O. J., & Smolensky, P. (1989). Virtual memories and massive general-

ization in connectionist combinatorial learning. In Proceedings of the 11th

Annual Conference of the Cognitive Science Society, pp. 380{387. Hillsdale,

NJ: Lawrence Erlbaum.

Brown, T. L., & LeMay, H. E. (1981). Chemistry: The central science (2nd edition).

Englewood Cli�s, NJ: Prentice Hall.

Butler, K. (1993). Connectionism, classical cognitivism and the relation between

cognitive and implementational levels of analysis. Philosophical Psychology,

6 (3), 321{333.

Chalmers, D. J. (1990a). Syntactic transformations on distributed representations.

Connection Science, 2, 53{62.

Chalmers, D. J. (1990b). Why Fodor and Pylyshyn were wrong: The simpliest

refutation. In Proceedings of the Twelfth Annual Conference of the Cognitive

Science Society, pp. 340{347. Hillsdale, NJ: Lawrence Erlbaum.

BIBLIOGRAPHY 187

Chalmers, D. J. (1993). Connectionism and compositionality: Why Fodor and

Pylyshyn were wrong. Philosophical Psychology, 6, 305{319.

Charter, N., & Oaksford, M. (1990). Autonomy, implementation and cognitive

architecture: A reply to Fodor and Pylyshyn. Cognition, 34, 93{107.

Clark, A. (1991). Systematicity, structured representations and cognitive architec-

ture: A reply to Fodor and Pylyshyn. In T. Horgan & J. Tienson (Eds.),

Connectionism and the Philosophy of Mind, pp. 198{218. Dordrecht, The

Netherlands: Kluwer Academic.

Dennis, S., & Phillips, S. (1991). Analysis tools for neural networks. Tech. Rep.

207, University of Queensland, Brisbane, Australia.

Dyer, M. G. (1991). Connectionism versus symbolism in higher-level cognition. In

T. Horgan (Ed.), Connectionism and the Philosphy of Mind, pp. 382{416.

Dordrecht, The Netherlands: Kluwer Academic.

Elman, J. L. (1989). Representation and structure in connectionist models. Tech.

Rep. 8903, University of California, San Diego, CA.

Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14, 179{211.

Elman, J. L. (1991). Incremental learning, or the importance of starting small.

Tech. Rep. 9101, University of California, San Diego, CA.

Fodor, J. A. (1975). The Language of Thought. Language and Thought. New York,

NY: Crowell.

Fodor, J. A., & McLaughlin, B. P. (1990). Connectionism and the problem of

systematicity: Why Smolensky's solution doesn't work. Cognition, 35, 183{

204.

Fodor, J. A., & Pylyshyn, Z. W. (1988). Connectionism and cognitive architecture:

A critical analysis. Cognition, 28, 3{71.

188 BIBLIOGRAPHY

Funahashi, K. (1989). On the approximate realization of continuous mappings by

neural networks. Neural Networks, 2, 183{192.

Garey, M. R., & Johnson, D. S. (1979). Computers and Intractability: A guide to

the theory of NP-completeness. San Francisco, CA: W H Freeman.

Geman, S., Bienenstock, E., & Doursat, R. (1992). Neural networks and the

bias/variance dilemma. Neural Computation, 4 (1), 1{58.

Hadley, R. F. (1993). Compositionality and systematicity in connectionist language

learning. Tech. Rep. CSS-IS TR 93-01, Simon Fraser University, Burnaby,

BC.

Hadley, R. F. (1994). Systematicity in connectionist language learning. Mind and

Language, 9 (3), 247{272.

Hadley, R. F., & Hayward, M. (1994). Strong semantic systematicity from unsu-

pervised connectionist learning. Tech. Rep. CSS-IS TR94-02, Simon Fraser

University.

Halford, G. S., & Wilson, W. H. (1994). How far do neural networks account for

human reasoning? In J. Wiles, C. Latimer, & C. Stevens (Eds.), Collected

Papers from a symposium on connectionist models and psychology, pp. 50{62.

Department of Computer Science, The University of Queensland, Brisbane,

Australia.

Halford, G. S., Wilson, W. H., Guo, J., Gayler, R. W., Wiles, J., & Stewart, J.

E. M. (1994). Connectionist implications for processing capacity limitations

in analogies. In K. J. Holyoak & J. Barnden (Eds.), Advances in Connectionist

and Neural Network theory, chap. 7, pp. 363{415. Norwood, NJ: Ablex.

Hartman, E., Keeler, J. D., & Kowalski, J. M. (1990). Layered neural networks

with gaussian hidden units as universal approximators. Neural Computation,

2 (2), 210{215.

BIBLIOGRAPHY 189

Haussler, D. (1992). Decision theoretic generalizations of the PAC model for neural

net and other learning applications. Tech. Rep. UCSC-CRL-91-02, University

of California, Santa Cruz, CA.

Hertz, J., Krogh, A., & Palmer, R. G. (1991). Introduction to the Theory of Neural

Computation. Redwood City, CA: Addison-Wesley.

Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks

are universal approximators. Neural Networks, 2, 359{366.

Jordan, M. I. (1990). Serial order: A parallel distributed processing approach.

Tech. Rep., MIT, Cambridge, MA.

Judd, S. (1987). Learning in networks is hard. In M. Caudill & C. Butler (Eds.),

IEEE International Conference on Neural Networks, pp. 685{692. Piscat-

away, NJ: IEEE Service Center.

Judd, S. (1988). On the complexity of loading shallow neural networks. Journal of

Complexity, 4, 177{192.

Judd, S. (1990). Neural Network Design and the Complexity of Learning. Cam-

bridge, MA: MIT Press.

Karmilo�-Smith, A. (1992). Beyond Modularity: A developmental perspective on

cognitive science. Cambridge, MA: MIT Press/Bradford Books.

Kern, L. H., Mirels, H. L., & Hinshaw, V. G. (1983). Scientists' understanding of

propositional logic: An experimental investigation. Social Studies of Science,

13, 131{146.

Kirsh, D. (1991a). Today the earwig, tomorrow man. Arti�cial Intelligence, 47 (1),

161{184.

Kirsh, D. (1991b). When is information explicitly represented? In P. Hanson

(Ed.), Information, Language and Cognition: Vancouver Studies in Cognitive

Science, pp. 340{365. Vancouver, BC: UBC Press.

190 BIBLIOGRAPHY

Kruglyak, L. (1990). How to solve the N bit encoder problem with just two hidden

units. Neural Computation, 2, 399{401.

Lapedes, A., & Farber, R. (1988). How neural nets work. In D. Z. Anderson

(Ed.), Neural Information Processing Systems, pp. 442{456. New York, NY:

American Institute of Physics.

LeCun, Y., Boser, B., & Denker, J. S. (1989). Backpropagation applied to hand-

written zip code recognition. Neural Computation, 1 (4), 541{551.

Lin, J. H., & Vitter, J. S. (1989). Complexity issues in learning by neural nets.

In R. Rivest, D. Haussler, & M. K. Warmath (Eds.), Proceedings of the Sec-

ond Annual Workshop on Computational Learning Theory, pp. 118{133. San

Mateo, CA: Morgan Kaufmann.

Lister, R. (1993). Visualizing weight dynamics in the N-2-N encoder. In Proceedings

of the IEEE International Conference on Neural Networks. Piscataway, NJ:

IEEE Service Center.

Lister, R., Bakker, P., & Wiles, J. (1993). Error signal, exceptions and back

propagation. In Proceedings of the 1993 International Joint Conference on

Neural Networks, pp. 573{576. Piscataway, NJ: IEEE Service Center.

Marcus, G. F., Brinkmann, U., Clahsen, H., Wiese, R., Woest, A., & Pinker, S.

(1993). German in ection: The exception that proves the rule. Tech. Rep. 47,

MIT, Cambridge, MA.

Marr, D. (1982). Vision: a computational investigation into the human representa-

tion and processing of visual information. San Francisco, CA: W H Freeman.

Maskara, A., & Noetzel, A. (1992). Forcing simple recurrent networks to encode

context. Tech. Rep., New Jersey Institute of Technology, NJ. Also in the

Proceedings of the 1992 Long Island Conference on Arti�cial Intelligence and

Computer Graphics.

BIBLIOGRAPHY 191

McClelland, J. L. (in press). A connectionist perspective on knowledge and develop-

ment. In T. Simon & G. S. Halford (Eds.), Developing Cognitive Competence:

New Approaches to Cognitive Modeling. Hillsdal, NJ: Erlbaum.

McClelland, J. L., & Kawamoto, A. H. (1986). Mechanisms of Sentence Processing:

Assigning roles to Constituents of Sentences, Vol. 2 of Computational models

of cognition and perception, chap. 19, pp. 272{331. Cambridge, MA: MIT

Press.

McClelland, J. L., Rumelhart, D. E., & the PDP research group (Eds.). (1986).

Parallel Distributed processing: Explorations in the microstructure of cogn-

tion, Vol. 2 of Computational models of cognition and perception. Cambridge,

MA: MIT Press.

Minsky, M. L., & Papert, S. A. (1990). Perceptrons (4th edition). Cambridge, MA:

MIT Press.

Newell, A. (1980). Physical symbol systems. Cognitive Science, 4, 135{183.

Niklasson, L., & Sharkey, N. E. (1992). Connectionism and the issues of composi-

tionality and systematicity. In R. Trappl (Ed.), Proceedings of the Cybernetics

and Systems Research, pp. 1367{1374.

Niklasson, L., & van Gelder, T. (1994). Can connectionist models exhibit non-

classical structure sensitivity. In A. Ram & K. Eiselt (Eds.), Proceedings of

the Sixteenth Annual Conference of the Cognitive Science Society, pp. 644{

669. Lawrence Erlbaum.

Niklasson, L. F. (1993). Structure sensitivity in connectionist models. In Proceed-

ings of the 1993 Connectionist Summer School, pp. 162{169. Hillsdale, NJ:

Lawrence Erlbaum.

Nowlan, S. J., & Hinton, G. E. (1991). Simplifying neural networks by soft-weight

sharing. Tech. Rep., The Salk Institute, San Diego, CA.

192 BIBLIOGRAPHY

Phatak, D. S. (1993). Construction of minimal n-2-n encoders for any n. Neural

computation, 5 (5), 783{794.

Phillips, S. (1993). The e�ect of representation on error surface. In P. Leong &

M. Jabri (Eds.), Proceedings of the Fourth Australian Conference on Neural

Networks, pp. 86{89. Sydney, Australia: Sydney University Electrical Engi-

neering.

Phillips, S. (1994). Understanding as generalization not just representation. In

J. Wiles, C. Latimer, & C. Stevens (Eds.), Collected Papers from a sympo-

sium on connectionist models and psychology, pp. 110{111. The University of

Queensland, Brisbane, Australia: Department of Computer Science. Techni-

cal report No. 289. A comment on Halford and Wilson's `How far do neural

networks account for human reasoning?'.

Phillips, S., & Wiles, J. (1991). On learning long-term contingencies. Unpublished

manuscript.

Pinker, S. (1984). Language Learnability and Language Development. Cognitive

Science Series. Cambridge, MA: Harvard University Press.

Pinker, S., & Prince, A. (1988). On language and connectionism: Analysis of a

parallel distributed processing model of language acquisition. Cognition, 28,

73{193.

Pollack, J. B. (1990). Recursive distributed representations. Arti�cial Intelligence,

46, 77{105.

Pylyshyn, Z. W. (1980). Computation and cognition: Issues in the foundations of

cognitive science. Behavioral and Brain Sciences, 3, 111{169.

Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning Internal

Representations by Error Propagation, Vol. 1 of Computational models of

cognition and perception, chap. 8, pp. 319{362. Cambridge, MA: MIT Press.

BIBLIOGRAPHY 193

Rumelhart, D. E., & McClelland, J. L. (1986). On Learning Past Tenses of En-

glish Verbs, chap. 18, pp. 216{271. Computational models of cognition and

perception. Cambridge, MA: MIT Press.

Rumelhart, D. E., McClelland, J. L., & the PDP research group (Eds.). (1986).

Parallel Distributed processing: Explorations in the microstructure of cogn-

tion, Vol. 1 of Computational models of cognition and perception. Cambridge,

MA: MIT Press.

Sato, M., Murakami, Y., & Joe, K. (1990). Learning chaotic dynamics by recurrent

neural networks. In Proceedings of the International Conference on Fuzzy

Logic and Neural Networks.

Shawe-Taylor, J., & Anthony, M. (1991). Sample sizes for multiple-output threshold

networks. Network, Computation in Neural Systems, 2 (1), 107{117.

Siegelman, H. T., & Sontag, E. D. (1991). On the computational power of neural

nets. Tech. Rep., Rutgers University, New Brunswick, NJ.

Smolensky, P. (1987a). The constituent structure of connectionist mental states: A

reply to Fodor and Pylyshyn. Southern Journal of Philosophy, 26, 137{161.

Smolensky, P. (1987b). On variable binding and the representation of symbolic

structures in connectionist systems. Tech. Rep. CU-CS-355-87, University of

Colorado, Boulder, CO.

Smolensky, P. (1990). Tensor product variable binding and the representation of

symbolic structures in connectionist systems. Arti�cial Intelligence, 46, 159{

216.

Smolensky, P. (1991). Connectionism, constituency, and the language of thought.

In B. Loewer & G. Rey (Eds.), Meaning in Mind: Fodor and his critics,

chap. 12, pp. 201{227. Cambridge, MA: Blackwells.

Spivey, J. M. (1989). The Z Notation: A reference manual. Computer Science.

New York, NY: Prentice Hall.

194 BIBLIOGRAPHY

St. John, M. F., & McClelland, J. L. (1990). Learning and applying contextual

constraints in sentence comprehension. Arti�cial Intelligence, 46, 217{257.

Suddarth, S. C., & Holden, A. D. C. (1991). Symbolic-neural systems and the use of

hints in developing complex systems. International Journal of Man-Machine

Studies, 35, 291{311.

Suddarth, S. C., & Kergos, Y. L. (1990). Rule injection hints as a means of

improving network performance and learning time. In L. B. Almeida & C. J.

Wellekens (Eds.), Neural Networks, Vol. 412 of Lectures in Computer Science,

pp. 120{129. New York, NY: Springer-Verlag.

Valiant, L. G. (1984). A theory of the learnable. Communications of the ACM,

27 (11), 1134{1142.

van Gelder, T. (1990). Compositionality: A connectionist variation on a classical

theme. Cognitive Science, 14, 355{384.

van Gelder, T., & Niklasson, L. (1994). Classicism and cognitive architecture. In

A. Ram & K. Eiselt (Eds.), Proceedings of the Sixteenth Annual Conference

of the Cognitive Science Society, pp. 905{909. Lawrence Erlbaum.

Wiles, J., & Bloesch, A. (1992). Operators and curried functions: Training and

analysis of simple recurrent networks. In S. J. Hanson & R. P. Lippman

(Eds.), Advances in Neural Information Processing Systems 4. San Mateo,

CA: Morgan Kaufmann.

Wiles, J., & Ollila, M. (1993). Intersecting regions: the key to combinatorial

structure in hidden unit space. In S. J. Hanson, J. D. Cowan, & C. L. Giles

(Eds.), Advances in Neural Information Processing Systems 5, pp. 27{33. San

Mateo, CA: Morgan Kaufmann.

Wiles, J., Phillips, S., & Norris, M. (1993). Generalization and training complexity

in a combinatorial domain. Unpublished manuscript.

BIBLIOGRAPHY 195

Zemel, R. S. (1993). A minimum description length framework for unsupervised

learning. Ph.D. thesis, University of Toronto, Toronto, Canada.

systematicit en andrew phillips b.sc., b.a.(hons.) · er, i also argue that connectionism p...

Documents