systematicit en andrew phillips b.sc., b.a.(hons.) · er, i also argue that connectionism p...
TRANSCRIPT
Connectionism and the Problem of
Systematicity
Steven Andrew Phillips B.Sc., B.A.(Hons.)
A thesis submitted for the degree of Doctor of Philosophy
Department of Computer Science
The University of Queensland
January 25, 1995
Statement of originality
The work presented in this thesis is, to the best of my knowledge, original, except
as acknowledged in the text, and the material has not been submitted, either in
whole or in part, for a degree at this or any other university.
Steven Phillips
January 25, 1995
i
ii
Abstract
Systematicity is a pervasive property of cognitive behaviour (e.g., language and
reasoning) whereby the ability to represent (systematicity of representation) and
infer from (systematicity of inference) some instances of a structured object ex-
tends to other instances conforming to the same structure (Fodor & Pylyshyn,
1988; Fodor & McLaughlin, 1990). The problem of systematicity for a Connection-
ist approach to cognition is: (1) Can Connectionist models exhibit systematicity
without implementing a Classical cognitive architecture (i.e., without resorting to
symbolic representations and processes)?; and (2) Can Connectionist models exhibit
the necessary acquisition of systematic behaviour (i.e., above chance level)?
In regard to the �rst question, after a review of the Classicist/Connectionist
debate over alternative explanations for systematicity, I conclude that Connection-
ist models cannot exhibit systematicity without resorting to some form of Classi-
cal compositionality (i.e., tokening of component representations wherever complex
representations are tokened). Essentially, Smolensky's (1991) weak (microfeatures)
and strong (tensors) compositionality, and van Gelder's (1990) functional compo-
sitionality either: (1) cannot support systematicity because component representa-
tions are not (uniquely) accessible by other processes; or, (2) can support system-
aticity, but only by tokening component representations relative to the processes
that access them (i.e., by implementing some form of Classical compositionality).
However, I also argue that Connectionism potentially o�ers an explanation for
the acquisition of systematic behaviour, which is an issue not addressed in the
Classical paradigm since systematicity is built into a Classical architecture. Conse-
quently, the primary concern of this thesis is the second question, which I address
by considering two criteria for the necessary acquisition of systematic behaviour,
iii
and then evaluating Connectionist models with respect to these criteria on several
learning tasks. Thus, systematicity is treated as a problem of generalization rather
than representation.
The main results of this thesis concern Hadley's (1993) strong systematicity cri-
terion (i.e., generalization to novel component positions) on inference tasks where
a network must learn to infer, on request, the components of binary and ternary
relations. Through an analysis of internal representations and network learning
dynamics I show that, in general, three-layer �rst-order networks (including the
feedforward network, Elman's, 1990 simple recurrent network, and Pollack's, 1990
recursive auto-associative memory) cannot exhibit strong systematicity on the bi-
nary relations inference task. I attribute the lack of strong systematicity to an
independence between weights that implement component mappings in each of
their possible positions.
In response to the lack of strong systematicity with existing networks, I then
developed the tensor-recurrent network, which has a dependency between such
weights, by incorporating the representational capacity of Smolensky's (1987b)
tensor network, with the learning capacity of the simple recurrent network. Con-
structing and manipulating tensor representations of complex objects, through the
inner and outer product operators, assumes appropriate component, role and cue
vectors (for representing and extracting component objects and their roles within
a complex object). In the tensor-recurrent network, these vectors are learnt by
backpropagating an error signal along weighted connections and units implement-
ing the inner and outer product operators. I show that this architecture exhibits
the necessary acquisition of systematic behaviour, as de�ned by Hadley's strong
systematicity criterion, on a ternary relations inference task.
In summary, I conclude that:
1. Connectionist models cannot demonstrate systematicity without implement-
ing some form of symbolic representation and process.
2. Connectionism can provide models, as exempli�ed by the tensor-recurrent
network, with architectural properties that are su�cient for the necessary
iv
acquisition of systematic behaviour from, in part, non-symbolic processes.
The separation of internal representations into component and role vector spaces
in the tensor-recurrent network is analogous to a type/token distinction character-
istic of symbolic computation. Thus, in terms of systematic behaviour, Connection-
ism o�ers a \neural-like" implementation of symbolic processing. However, where
Connectionism goes beyond Classicism is in an explanation for the acquisition of
systematic behaviour, which is an issue not addressed in the Classical paradigm
since systematicity is built into a Classical architecture, not acquired.
v
vi
List of publications
1. Phillips, S. (1994a). Strong Systematicity within Connectionism: The
Tensor-Recurrent Network. To appear in Proceedings of the Sixteenth An-
nual Conference of the Cognitive Science Society, Atlanta, GA.
2. Phillips, S. (1994b). Connectionism and Systematicity. In A.C. Tsoi & T.
Downs (Eds.), Proceedings of the Fifth Australian Conference on Neural Net-
works, pp. 53-55 Brisbane, Australia.
3. Phillips, S. (1994c). Understanding as generalization not just representa-
tion. In J. Wiles, C. Latimer & C. Stevens (Eds.), Collected Papers from
a Symposium on Connectionist Models and Psychology, pp. 110-111. Tech-
nical Report No. 289, Department of Computer Science, The University of
Queensland, Australia. A comment on Halford & Wilson's paper: \How far
do neural network models account for human reasoning?".
4. Phillips, S. and Wiles, J., (1993). Exponential Generalizations from a Polyno-
mial Number of Examples in a Combinatorial Domain. In Proceedings of the
International Conference on Neural Networks, pp. 505-508 Nagoya, Japan.
5. Phillips, S., (1993). The E�ect of Representation on Error Surface. In P.
Leong & M. Jabri (Eds.), Proceedings of the Fourth Australian Conference
on Neural Networks, pp. 86-89. Melbourne, Australia: Sydney University
Electrical Engineering.
6. Phillips, S., (1992). Making a Simple Recurrent Network a Self-Oscillator by
Incremental Training. In P. Leong and M. Jabri (Eds.), Proceedings of the
vii
Third Australian Conference on Neural Networks, pp. 244-247. Canberra,
Australia: Sydney University Electrical Engineering.
7. Phillips, S., Wiles, J. and Schwartz, S., (1991). A Comparison of Three
Classi�cation Algorithms on the Diagnosis of Abdominal Pains. In M. Jabri
(Ed.), Proceedings of the Second Australian Conference on Neural Networks,
pp. 283-287. Sydney, Australia. Sydney University Electrical Engineering.
8. Bakker, P., Phillips, S. and Wiles, J. (in press). The 1000-2-1000 encoder: A
matter of Representation. Neural Network World, 4(5), 527-534.
9. Schwartz, S., Wiles, J. and Phillips, S., (in press). Connectionist, rule-based
and Bayesian decision aids: an empirical comparison. In P. Slezak, T. Caelli
& R. Clark (Eds.), Perspectives on Cognitive Science: Theories, Methods and
Foundations, pp. 167-180. Norwood, NJ: Ablex.
10. Schwartz, S., Wiles, J., Gough, I. and Phillips, S., (1993). Connectionist, rule-
based and Bayesian decision aids: an empirical comparison. In D. J. Hand
(Ed.) Arti�cial Intelligence Frontiers in Statistics (pp. 264-277). London:
Chapman & Hall. (NB. An earlier version of Schwartz et al 1994).
11. Bakker, P., Phillips, S. and Wiles, J., (1993). The N-2-N encoder: A matter
of Representation. In S. Gielen & B. Kappen (Eds.), Proceedings of the In-
ternational Conference on Neural Networks, pp. 554-557. London: Springer-
Verlag.
12. Dennis, S. and Phillips, S., (1991). Analysis Tools for Neural Networks.
Tech. Report No. 207, Department of Computer Science, The University of
Queensland, Australia
viii
Contents
Statement of originality i
Abstract iii
List of publications vii
Contents ix
List of �gures xiii
List of tables xvii
Preface xix
Relationship to previous work : : : : : : : : : : : : : : : : : : : : : xix
Acknowledgements : : : : : : : : : : : : : : : : : : : : : : : : : : : xix
1 Introduction 1
1.1 Statement of thesis : : : : : : : : : : : : : : : : : : : : : : : : : : : 5
2 Systematicity: What is the problem? 7
2.1 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 7
2.2 Systematicity and its Classical explanation : : : : : : : : : : : : : : 8
2.3 Is there a Connectionist alternative? : : : : : : : : : : : : : : : : : 13
2.3.1 The horns of Fodor : : : : : : : : : : : : : : : : : : : : : : : 14
2.3.2 Weak and strong compositionality : : : : : : : : : : : : : : : 15
2.3.3 The concatenative/functional distinction : : : : : : : : : : : 19
ix
2.3.4 A \potential" contribution of Connectionism : : : : : : : : : 24
2.4 Thesis question : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 31
2.4.1 Method : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 33
2.5 Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 34
3 Generalization across domain 37
3.1 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 37
3.2 Brousse and Smolensky's massive generalization : : : : : : : : : : : 39
3.2.1 Task: Auto-association of N -tuples : : : : : : : : : : : : : : 39
3.2.2 Method : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 40
3.2.3 Exponential number of generalizations : : : : : : : : : : : : 41
3.2.4 Exponential decrease in percentage of generalizations : : : : 42
3.3 Probably approximately correct learnability : : : : : : : : : : : : : 44
3.4 Systematicity of representation : : : : : : : : : : : : : : : : : : : : 46
3.4.1 Task: Auto-association of N -tuples : : : : : : : : : : : : : : 47
3.4.2 Model: Feedforward network : : : : : : : : : : : : : : : : : : 47
3.4.3 Theoretical result : : : : : : : : : : : : : : : : : : : : : : : : 49
3.4.4 Empirical results : : : : : : : : : : : : : : : : : : : : : : : : 50
3.5 General Discussion : : : : : : : : : : : : : : : : : : : : : : : : : : : 58
3.6 Summary and conclusion : : : : : : : : : : : : : : : : : : : : : : : : 61
4 Generalization across position 63
4.1 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 63
4.2 Hadley's strong systematicity : : : : : : : : : : : : : : : : : : : : : : 64
4.2.1 Strong systematicity de�ned : : : : : : : : : : : : : : : : : : 64
4.2.2 A review of Connectionist models : : : : : : : : : : : : : : : 65
4.2.3 Summary of review : : : : : : : : : : : : : : : : : : : : : : : 69
4.3 Strong systematicity of representation : : : : : : : : : : : : : : : : : 69
4.3.1 Task: Auto-association of 2-tuples : : : : : : : : : : : : : : : 69
4.3.2 Feedforward network : : : : : : : : : : : : : : : : : : : : : : 70
4.3.3 Simple recurrent network : : : : : : : : : : : : : : : : : : : : 77
x
4.3.4 Summary of strong systematicity of representation : : : : : : 97
4.4 Strong systematicity of inference : : : : : : : : : : : : : : : : : : : : 97
4.4.1 Task: Querying 2-tuples : : : : : : : : : : : : : : : : : : : : 98
4.4.2 Simple recurrent network : : : : : : : : : : : : : : : : : : : : 98
4.4.3 Other �rst-order three-layer networks : : : : : : : : : : : : : 111
4.4.4 Architectural issues : : : : : : : : : : : : : : : : : : : : : : : 113
4.5 Summary and conclusion : : : : : : : : : : : : : : : : : : : : : : : : 116
5 The tensor-recurrent network 119
5.1 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 119
5.2 Tensor representations : : : : : : : : : : : : : : : : : : : : : : : : : 120
5.3 Representational issues : : : : : : : : : : : : : : : : : : : : : : : : : 124
5.3.1 Component representations : : : : : : : : : : : : : : : : : : 124
5.3.2 Role representations : : : : : : : : : : : : : : : : : : : : : : 124
5.3.3 Cue representations : : : : : : : : : : : : : : : : : : : : : : : 125
5.4 The tensor-recurrent network : : : : : : : : : : : : : : : : : : : : : 126
5.4.1 Architecture : : : : : : : : : : : : : : : : : : : : : : : : : : : 128
5.4.2 Learning : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 133
5.5 Evaluating the network : : : : : : : : : : : : : : : : : : : : : : : : : 136
5.5.1 Querying of 3-tuples task : : : : : : : : : : : : : : : : : : : : 136
5.5.2 Simulations : : : : : : : : : : : : : : : : : : : : : : : : : : : 137
5.5.3 Results : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 139
5.5.4 Discussion and analysis : : : : : : : : : : : : : : : : : : : : : 146
5.6 Summary and conclusions : : : : : : : : : : : : : : : : : : : : : : : 150
6 Discussion 153
6.1 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 153
6.2 Implications : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 153
6.2.1 Same training and testing distribution assumption : : : : : : 154
6.2.2 Structured networks : : : : : : : : : : : : : : : : : : : : : : 155
6.3 Limitation and possible extension : : : : : : : : : : : : : : : : : : : 156
xi
6.4 Relationship to the Classical paradigm : : : : : : : : : : : : : : : : 164
6.5 Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 167
7 Concluding remarks 169
7.1 Summary of contribution : : : : : : : : : : : : : : : : : : : : : : : : 170
7.2 Conclusions : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 174
7.3 Further work : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 176
A Understanding as generalization 181
Bibliography 185
xii
List of Figures
2.1 Systematicity: a computational level property. : : : : : : : : : : : : 11
2.2 Beam scattering analogy. : : : : : : : : : : : : : : : : : : : : : : : : 24
2.3 Sum to 5000 example. : : : : : : : : : : : : : : : : : : : : : : : : : 26
2.4 Connectionist implementation of a look-up table. : : : : : : : : : : 29
3.1 A feedforward network for learning to represent 2-tuples. : : : : : : 40
3.2 A log-linear plot of the number of generalized and generalized plus
virtual memories as a function of tuple order (N). : : : : : : : : : : 43
3.3 Feedforward network for the auto-association of N -tuples. : : : : : 48
3.4 A log-log plot of the number of training examples (solid line) and
total possible examples (dashed line) as a function of tuple order (N). 53
3.5 A log-log plot of the mean number of total weight updates (over 5
trials) as a function of tuple order (N). : : : : : : : : : : : : : : : : 55
3.6 Hyperplane orientation for an 8-2-8 encoder. : : : : : : : : : : : : : 58
4.1 The feedforward network architecture for auto-association of 2-tuples. 71
4.2 Orientation of the Mary-agent hyperplane in hidden unit activation
(internal representation) space. : : : : : : : : : : : : : : : : : : : : 72
4.3 Orientation of the Mary-patient hyperplane in hidden unit activation
(internal representation) space. : : : : : : : : : : : : : : : : : : : : 74
4.4 Feedforward network architecture for auto-association task without
weight tying (a) and with weight tying (b). : : : : : : : : : : : : : : 76
4.5 The simple recurrent network and the temporal version of the auto-
association of 2-tuples task. : : : : : : : : : : : : : : : : : : : : : : 79
xiii
4.6 Generalization to second position as a function of the number of
training patterns. : : : : : : : : : : : : : : : : : : : : : : : : : : : : 81
4.7 Principal components analysis of points in hidden unit activation
space generated from training sequences at the �rst and second time
steps. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 85
4.8 Principal components analysis of points in hidden unit activation
space generated from training sequences at the second time step. : : 86
4.9 Canonical discriminants analysis of points in hidden unit activation
space generated from training sequences at the second time step
grouped on the basis of the �rst input object. : : : : : : : : : : : : 88
4.10 Construction of ordered pair representations by bu�ering input
through the same set of dimensions. : : : : : : : : : : : : : : : : : : 89
4.11 Canonical discriminants analysis of points in hidden unit activation
space generated from training sequences at the �rst and second time
steps grouped on the basis of the current input/output object. : : : 90
4.12 Principal components analysis of points in hidden unit activation
space generated from training sequences at the third time step. : : : 91
4.13 Canonical discriminants analysis of points in hidden unit activa-
tion space generated from training sequences at the third time step
grouped on the basis of the �rst input object. : : : : : : : : : : : : 92
4.14 Canonical discriminants analysis of points in hidden unit activa-
tion space generated from training sequences at the third time step
grouped on the basis of the second input object. : : : : : : : : : : : 93
4.15 Canonical discriminants analysis of points in hidden unit activation
space generated from training sequences at the third and fourth time
steps grouped on the basis of the current target output object. : : : 94
4.16 Idealization of the bu�er solution for the recurrent network in
demonstrating strong systematicity of representation. : : : : : : : : 96
4.17 The simple recurrent network and the querying of 2-tuples task. : : 100
xiv
4.18 Error pro�le for the 8 hidden unit network over 10000 epochs of
training for each trial. : : : : : : : : : : : : : : : : : : : : : : : : : 104
4.19 First-order network. : : : : : : : : : : : : : : : : : : : : : : : : : : : 107
4.20 Orientation of the hidden unit hyperplanes to extract theMary com-
ponent. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 109
4.21 Orientation of the output unit hyperplane in the hidden unit acti-
vation space. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 110
4.22 Weight and representation con�guration resulting in weak system-
aticity of inference. : : : : : : : : : : : : : : : : : : : : : : : : : : : 111
4.23 Two alternative recurrent network architectures: Jordan's recurrent
network (a) and Pollack's recursive auto-associative memory (b). : : 112
4.24 Characterization of the representational organization su�cient for
strong systematicity of inference. : : : : : : : : : : : : : : : : : : : 115
5.1 A Connectionist implementation of the outer product (T) of two
vector representations (V and W). : : : : : : : : : : : : : : : : : : : 121
5.2 A Connectionist implementation of the inner product (Vout) of a
tensor representation (T) and a vector representation (W). : : : : : 122
5.3 The tensor-recurrent network architecture. : : : : : : : : : : : : : : 127
5.4 Generalization (maximum criterion) as a function of the number of
overlapping items where items were omitted from the patient posi-
tion in the 1 to 4 overlap cases (see Table 5.2). : : : : : : : : : : : : 141
5.5 Generalization (0.5 criterion) as a function of the number of over-
lapping items where items were omitted from the patient position in
the 1 to 4 overlap cases (see Table 5.2). : : : : : : : : : : : : : : : : 142
5.6 Generalization (maximum criterion) as a function of the number of
overlapping items where items were omitted from both positions in
the 1 to 3 overlap cases (see Table 5.3). : : : : : : : : : : : : : : : : 143
5.7 Generalization (0.5 criterion) as a function of the number of over-
lapping items where items were omitted from both positions in the
1 to 3 overlap cases (see Table 5.3). : : : : : : : : : : : : : : : : : : 144
xv
5.8 Mean training times as a function of overlap where items were omit-
ted from the patient position in the 1 to 4 overlap cases (see Table
5.2). : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 145
5.9 Mean training times as a function of overlap where items were omit-
ted from both positions in the 1 to 3 overlap cases (see Table 5.3). : 145
6.1 Characterization of the extraction of component representations
from positions 1 and 2 for the three-layer networks in chapter 4
(i.e., with one hidden layer). : : : : : : : : : : : : : : : : : : : : : : 157
6.2 Characterization of the extraction of component representations
from positions 1 and 2 for the tensor-recurrent network. : : : : : : : 158
6.3 Characterization of the extraction of component representations
from positions 1 and 2 for a network with two hidden layers linked
by modi�able-weighted connections. : : : : : : : : : : : : : : : : : : 159
6.4 An example of the internal organization of a network, with two units
per subspace in the �rst internal layer and two units for the common
subspace in the second internal layer, learning the querying of 2-
tuples task. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 160
6.5 Number of components in the second position required in the train-
ing set to demonstrate strong systematicity. : : : : : : : : : : : : : 161
6.6 Process of redescribing internal representations. : : : : : : : : : : : 163
xvi
List of Tables
4.1 Summary of the performance of the simple recurrent network on the
systematicity of inference task for networks with 20, 10 and 8 hidden
units. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 102
4.2 Summary of the degrees of systematicity of networks with respect
to the systematicity of representation task (auto-association of 2-
tuples), and the systematicity of inference task (querying of 2-tuples).117
5.1 Order of activation for the tensor-recurrent network when given the
sequence John, Mary, �rst. : : : : : : : : : : : : : : : : : : : : : : : 131
5.2 Training and testing example distributions where nouns were omit-
ted from the patient position in the case where there was between 1
and 4 nouns appearing in both positions. : : : : : : : : : : : : : : : 139
5.3 Training and testing example distributions for the 1, 2 and 3 overlap
cases, where nouns were omitted from both positions. : : : : : : : : 140
5.4 Orthogonality between agent, action and patient role vectors mea-
sured as one minus the magnitude of the normalized dot product. : 147
5.5 Collinearity between cue (at time step 4) and role vectors concerned
with the extraction and construction (respectively) of the agent, ac-
tion and patient positions measured as the magnitude of the nor-
malized dot product. : : : : : : : : : : : : : : : : : : : : : : : : : : 148
5.6 Collinearity before training between cue (at time step 4) and role vec-
tors concerned with the extraction and construction (respectively) of
the agent, action and patient positions measured as the magnitude
of the normalized dot product. : : : : : : : : : : : : : : : : : : : : : 148
xvii
xviii
Preface
Relationship to previous work
Some of the work presented in this thesis has appeared, in part, in a number of
published papers. The work presented in Phillips and Wiles (1993) forms the ba-
sis of chapter 3. The arguments, results and analysis presented in chapter 4 is a
signi�cant expansion of the work that appeared in Phillips (1994b), and chapter 5
provides further analysis and a more detailed explanation of the network architec-
ture that appeared in Phillips (1994a). Also, the work presented in Phillips (1994c)
appears in Appendix A.
Two of the techniques used in this thesis have also appeared in previous work.
The block encoding technique, for faster training of auto-associators (chapter 3),
appeared in Bakker, Phillips and Wiles (1993, 1994), and an explanation of its
improvement in training time was based on Phillips (1993). The network analysis
tools used in chapter 4 appeared in Dennis and Phillips (1991).
Lastly, there are several works which helped provide an understanding of some
of the learning properties of feedforward and recurrent networks. They include, the
work on generalization with feedforward networks in Phillips, Wiles and Schwartz
(1991), Schwartz, Wiles, Gough and Phillips (1993), and Schwartz, Wiles and
Phillips (1994); and the learning capabilities on the simple recurrent network, in
Phillips (1992).
Acknowledgements
Throughout the years I have been supported by many people who have helped
either directly, or indirectly to produce this thesis. In particular, I thank my
xix
supervisor Janet Wiles for all her support prior to and throughout my candidature;
for providing valuable discussion and criticism of the work in, and relating to this
thesis, and for her constant and critical attention to the writing process.
I would also like to thank my collegues: Simon Dennis, for many interesting
and critical discussions on this work; and Paul Bakker, not only for his help and
comments on this work, but also in regard to frequent yer miles.
During my candidature, I have had many useful discussions with researchers
who have shared their time and energy to comment and discuss some of the issues
relating to this thesis. For their help and support, I would like to thank: Glenda
Andrews, Peter Bartlett, Anthony Bloesch, Kerry Chalmers, Bob Colomb, Graeme
Halford, Mike Humphreys, Danny Latimer, Ray Lister, David Lovell, Hanna Ma-
jewski, Michael Norris, Helen Purchase, Gordon Rose, Guy Smith, Kate Stevens,
Julie Stewart and Steven Young.
I would also like to thank Je� Elman and his research group for providing the
Tlearn simulator, which was used for the simulations in chapters 3 and 4; and
Geo�rey Hinton and his research group for providing and maintaining the Xerion
simulator, which was used for the simulations in chapter 5.
I am also extremely grateful for the �nancial support provided by the University
of Queensland and the Department of Computer Science. This work was supported
by a University of Queensland Postgraduate Research Scholarship, and a Depart-
ment of Computer Science Postgraduate Research Scholarship. I am also extremely
grateful for the excellent facilaties provided by the Department of Computer Sci-
ence in the form of equipment and support sta�, and for �nancial assistance that
has allowed me to attend a number of national and international conferences.
Of course, much help and support was given before I even started this work. I
would like to thank my parents for all their support through the years.
Finally, I thank Michiko and Mizuki, for their undying patience and love.
xx
Chapter 1
Introduction
In 1986, a small group of researchers put together two volumes on parallel dis-
tributed processing models of cognition (Rumelhart, McClelland, & the PDP re-
search group, 1986; McClelland, Rumelhart, & the PDP research group, 1986)
which, as it turned out, repopularized a movement in the study of human and
machine intelligence called Connectionism.
Connectionist models are typically characterized as collections of interconnected
processing units, each speci�ed by some local level process or function (e.g., sig-
moid, see Rumelhart, Hinton, & Williams, 1986), which together exhibit some
global level behaviour (e.g., hand-written character recognition LeCun, Boser, &
Denker, 1989). The analogy, of course, is taken from the human brain, which is
similarly organized, hence the term Arti�cial Neural Networks 1.
Since its resurgence in popularity many advocates have proclaimed Connec-
tionism as the new paradigm for cognitive modeling, promising not only models
of cognitive behaviour, but also models of cognitive development. For example,
Rumelhart and McClelland (1986) have presented a Connectionist network that
not only performs the past-tense mapping of English verbs, but also learns this
1The term Connectionism as used here refers to neural network research activities directedtowards cognitive behaviour (e.g., perception, language, analogy), rather than the more widelyused term Arti�cial Neural Networks (or simply, Neural Networks), referring to researchactivities directed at a wider range of engineering applications (including, for example, plantcontrol, stock market prediction, chemical analysis, automatic piloting).
1
2 CHAPTER 1. INTRODUCTION
behaviour in a qualitatively similar manner as children2.
Of central importance to a paradigm or framework for cognition is the ability
to provide models that demonstrate such properties as systematicity. Brie y, sys-
tematicity is the property of human cognition whereby the ability to represent and
process structured objects (e.g., John loves Mary) is accompanied by an ability to
represent and process other objects conforming to the same structure3 (e.g., Mary
loves John). That is, cognitive capacities are grouped on the basis of the structural
similarity of the objects on which they operate.
It is generally accepted that human cognition is systematic, at least to some
signi�cant degree. In general, one does not, for example, observe people capable of
inferring that John went to the store given that John and Mary went to the store,
yet unable to infer the structurally similar case ofMary went to the store given that
Mary and John went to the store. What has not been accepted is the implication
that the property of systematicity holds for cognitive architecture (i.e., the set of
basic data structures and processes employed in cognitive behaviour).
Fodor and Pylyshyn (1988) have argued that systematicity necessitates struc-
tured representations (symbol structures) and processes that are sensitive to the
structure of those representations. In short, the so-called Classical cognitive archi-
tecture is a symbol system. A characteristic property of cognition that is posited
in a Classical cognitive architecture is that the tokening (inscription) of represen-
tations of compositional objects necessarily entails the tokening of representations
of all component objects.
2Although, the accuracy of this work as a model of the acquisition of past-tense has beenstrongly criticized (Pinker & Prince, 1988; Marcus, Brinkmann, Clahsen, Wiese, Woest, & Pinker,1993).
3The term structure is taken to mean some invariant property of the relationship betweencomponents of complex objects. For example, in chemistry, two types of molecules n-Pentane andNeopentane are constructed from the same type and number of atoms (i.e., �ve carbon atoms andtwelve hydrogen atoms), yet they di�er structurally due to the di�erence in the spatial relationshipbetween their component atoms. The structure of an n-Pentane molecule, the spatial relationshipbetween its components atoms, is: CH3{CH2{CH2{CH2{CH3. In contrast, the structure of a
Neopentane molecule is:
CH3
j
CH3{C{CH3
j
CH3 (Brown & LeMay, 1981). In the context of representations,structure is regarded as some invariant property of the relationship representational components.
3
Smolensky (1987b, 1987a, 1990, 1991) and van Gelder (1990) have argued for
a more general Connectionist notion of compositionality4 based on distributed
(vector) representations and processes. In a Connectionist cognitive architecture,
representations of compositional objects are vectors distributed over a number of
processing units. Connectionist representations are non-Classical in that the rep-
resentation of a compositional object does not necessarily entail representations
of the object's components. Thus, in this respect Connectionism, it is claimed, is
an alternative to the Classical conception of cognitive architecture. For example,
Elman's (1990) simple recurrent network is capable of constructing a single vector
representation of a sentence where the representations of individual words do not
\appear"5 in the representation of the sentence in which they are contained. Yet,
the network is able to operate, often termed wholistically, on the vector represen-
tation of the sentence to extract word-level information such as plurality.
Fodor and McLaughlin (1990) have rejected the so-called Connectionist alterna-
tive concept of compositionality on the basis that either: (1) vector representations
are simply points in space with no internal structure, and therefore cannot account
for systematicity since component representations of complex objects are not ac-
cessible to other processes; or (2) component representations are accessible by to-
kening (i.e., inscribing) a vector representation of each component object whenever
the complex representation is inscribed, in which case, the Connectionist architec-
ture implements a Classical architecture. Thus, Fodor and McLaughlin concluded,
Connectionism is at best an implementation framework for Classical cognitive ar-
chitectures, and consequently, has nothing to o�er in the way of alternative theories
of cognition.
This conclusion is, however, far from unanimously accepted. Furthermore, it
is not clear as to the grounds on which Fodor and McLaughlin rejected the Con-
nectionist proposals. On the one hand they accept the possibility of Connectionist
4That is, the method by which a representation of a compositional object is constructed fromthe representations of its components.
5In the sense that no portion of the sentence vector equals any vector representation of itscomponents. For example, the vector ~A = (3 2) does not appear in the vector ~B = (9 1 7).
4 CHAPTER 1. INTRODUCTION
architectures exhibiting systematicity without resorting to Classical composition-
ality (p. 202), yet on the other hand they reject such possibilities on the grounds
that it is not su�cient to show how systematicity is possible given the assumptions
of a (Connectionist) architecture, one must also show how systematicity is neces-
sary given those assumptions (p. 202). The confusion arises because they do not
specify a criterion by which a model is said to necessarily exhibit systematicity.
Given the lack of consensus over the conclusion that Connectionism is at best
an implementation framework, an appropriate question to ask is:
� Can Connectionism provide non-Classical models (i.e., ones that do not rely
on symbolic representations and processes) that exhibit systematicity?
After reviewing, in detail, the Connectionist/Classicist debate with respect to
this question, in chapter 2, it is concluded that: Connectionist architectures can-
not exhibit systematicity without implementing Classical compositionality.
However, it is also argued that the limitation of Fodor and Pylyshyn's thesis,
and subsequently, Fodor and McLaughlin's rebuttal is that their criterion for sys-
tematicity is grounded solely in terms of computational capacity (i.e., in terms
of the functions computable by an architecture). Consequently, systematicity is
realizable by many computationally su�cient architectures, including the possible
Connectionist alternatives. Connectionism, however, is also concerned with the
modeling of cognitive development. Thus, it is argued, a \potential" contribution
of Connectionism to cognitive theory is in an explanation for the necessary acquisi-
tion of systematic behaviour. Therefore, the problem that systematicity poses for
Connectionism, and the question of primary concern in this thesis is:
� Can Connectionism provide models that exhibit the necessary acquisition of
systematic behaviour?
The approach taken in an attempt to address this question is to: �rst, specify
a suitable criterion by which a model is said to necessarily acquire systematic
behaviour; and second, evaluate Connectionist models with respect to this criterion
on learning tasks designed to test their capacity to acquire systematic behaviour.
1.1. STATEMENT OF THESIS 5
The �rst criterion, taken from computational learning theory (Valiant, 1984),
is based on the amount of computational resource required to learn a systematic
behaviour to a high degree of accuracy. In chapter 3, a Connectionist architecture
called the feedforward network is shown, by analysis and simulation, to meet this
criterion on a task designed to test systematicity.
However, it is then suggested that this criterion may be too weak in that the
amount of resource required is more than that required by people. Consequently,
a second criterion is used, called strong systematicity (Hadley, 1993), which is
based on linguistic evidence of generalization over structurally similar sentences.
In chapter 4, it is shown, by analysis, that the same network cannot exhibit the
acquisition of systematicity as de�ned by this criterion on the same task. Strong
systematicity is then evaluated on a number of other models based on a recurrent
network architecture. Brie y, none of the models examined demonstrate strong
systematicity with respect to an inference task whereby the model is required to
infer components from complex objects.
The lack of strong systematicity in the models examined prompted the design
of a further architecture, called the tensor-recurrent network, speci�cally to ad-
dress the issue of the necessary acquisition of systematicity. In chapter 5, the
tensor-recurrent network is presented and shown, by simulation, to exhibit strong
systematicity with respect to the inference task.
A discussion of these results and their relationship to the Classical paradigm is
given in chapter 6. Finally, conclusions are drawn, and further work is suggested
in chapter 7.
1.1 Statement of thesis
Having reviewed the Classicist and Connectionist arguments relating to system-
aticity, and having tested and analyzed a number of Connectionist architectures
for their capacity to acquire systematic behaviour, the two main conclusions of this
thesis are:
6 CHAPTER 1. INTRODUCTION
1. Connectionist models cannot demonstrate systematicity without implement-
ing some form of symbolic representations and processes.
2. Connectionism can provide models, as exempli�ed by the tensor-recurrent
network, with architectural properties that are su�cient for the necessary
acquisition of systematic behaviour from, in part, non-symbolic processes.
Chapter 2
Systematicity: What is the
problem?
2.1 Introduction
As mentioned in chapter 1, Fodor and Pylyshyn have used the systematicity prop-
erty to argue against the Connectionist approach to cognitive modeling. Yet, as
also pointed out, it is not clear that systematicity cannot be addressed within
the Connectionist framework. The purpose of this chapter is to establish the cen-
tral problem that systematicity poses for the Connectionist approach to cognition;
and subsequently, the question that this thesis attempts to address. Therefore,
an appropriate place to begin is with a review of the Classicist/Connectionist de-
bate over the implication of systematicity to cognitive modeling with the following
question in mind: Can Connectionism provide non-Classical models that exhibit
systematicity?
In section 2, the Classical explanation for systematicity is presented. The Clas-
sical view is that to exhibit systematicity a cognitive architecture must have two
properties: (1) structured representations (symbol structures); and (2) processes
that are sensitive to the structure of those representations. The claim that Con-
nectionism provides an alternative view of cognition turns at one point on an
alternative explanation for systematicity; one that does not rely on the Classical
7
8 CHAPTER 2. SYSTEMATICITY: WHAT IS THE PROBLEM?
notion of compositionality (i.e., the tokening or inscription of component represen-
tations whenever complex representations are tokened). In section 3, I argue that
Connectionist models cannot exhibit systematicity without implementing Classical
compositionality. However, I also argue that the direction along which Connection-
ism \potentially" extends the Classical theory is in the acquisition of systematic
behaviour. The question then arises: Can Connectionism provide models that ex-
hibit the necessary acquisition (learning) of systematic behaviour? This question,
which is the central concern of this thesis, and the method used to address it are
given in section 4. Finally, a summary is provided in section 5.
2.2 Systematicity and its Classical explanation
Systematicity is a property of cognitive beings in that their capacity for certain
cognitive behaviours is grouped on the basis of the structural similarity of these
behaviours (Fodor & Pylyshyn, 1988). Fodor and Pylyshyn distinguish two aspects
of systematicity: systematicity of representation and systematicity of inference.
Systematicity of representation is the property that, if one can represent a
structured object (e.g., John loves Mary1), then one also has the ability to represent
the structurally similar object Mary loves John.
Similarly, systematicity of inference is the property that, if one can infer, for
example, that Tom went to the store from the sentence Sue and Tom went to the
store, then one also has the ability to infer that Sue went to the store from the
sentence Tom and Sue went to the store.
The ability to make such generalizations is based on a similarity of structure
between compositional objects (i.e., objects that are composed of other simpler
objects). Thus, systematicity is a property at the level of compositional objects.
For example, in the domain of scene description, it is reasonable to expect
that people are able to verbalize the word \cat" on seeing the object cat, but
without further experience are unable to verbalize the word \platypus" on seeing
1The italicized word is used to refer to the object, not its representation.
2.2. SYSTEMATICITY AND ITS CLASSICAL EXPLANATION 9
the object platypus. Because, in this case, the relationship between one object and
its label (name) is independent of the relationship between the other object and its
associated label. In other words, the two objects are atomic (no further structure)
with respect to their labels2. However, what one does not �nd, in general, are
people capable of verbalizing the sentence \the cat chases the dog" on seeing the
compositional object cat chases dog, but unable to verbalize the sentence \the
dog chases the cat" on seeing the compositional object dog chases cat. Because,
in the compositional case, there is a relationship between the two objects and
their corresponding labels (i.e., the structure agent-action-patient). This point has
an important consequence when deciding on appropriate external representations
for atomic objects and will be returned to when examining some Connectionist
attempts to demonstrate systematicity.
At this point it is important to identify one misconception regarding system-
aticity. I use an example3 from the reading-aloud domain to make my point. In
English, the \w" component of the word \sweet" is vocalized, however, in the word
\sword" the \w" component is silent. This inconsistency is not a case for the lack
of systematicity of human cognition. It is a case for the lack of systematicity in one
human cognitive domain (namely, correct pronunciation of English words). The
fact that people can, upon request, produce the \w" sound in the word \sword"
(i.e., by producing the \s" sound followed by the \word" sound) is an example of
the systematicity of human cognitive capacities. Another example concerns the
interchangeability of nouns in the agent and patient positions. For example, peo-
ple capable of saying the sentence \John ate the cheese" are unlikely to be heard
saying, \the cheese ate John". Again, the lack of apparent systematicity in this
domain does not preclude people from understanding the sentence in the sense that
they could answer questions like: \What did the cheese eat?"; or \Who ate John?".
Although the extent to which cognition is systematic is arguable, at least in the
2Note that the term atomic is a relative one. The same objects would be considered structuredwhen labeled as \quadrapeds", in which case one may expect generalization from one object tothe other based on their limb component objects.
3Presented in a seminar on word recognition by Sally Andrews.
10 CHAPTER 2. SYSTEMATICITY: WHAT IS THE PROBLEM?
linguistic domain, there is ample evidence of systematicity in adults (see Hadley,
1993, for a review) and for that matter perhaps even children (Pinker, 1984).
That systematicity is a signi�cant property of cognitive behaviour is, in general,
not disputed by Connectionists. What is contended, is the implication of system-
aticity in terms of a cognitive architecture (i.e., the basic algorithms and data
structures of a cognitive system utilized in the performance of cognitive tasks).
Or in other words, it is agreed that cognitive behaviour is signi�cantly systematic
at the computational level, but disputed as to what systematicity implies at the
algorithmic level4.
The distinction between the computational and algorithmic levels is illustrated
in the following example. Consider a function called minimum that returns the
smallest number from a set of numbers. The function can be described at the
computational level as: minimum(S) = e;where e � ei 8ei 2 S. The function
minimum takes a set of numbers S and returns the smallest number e from the set.
At the algorithmic level, an algorithm to compute the function could be de-
scribed as: removing an element (e) from the set (S) leaving a new set (S0) and
returning, either the element (e) if the new set (S0) is empty; or the smaller of e
and the minimum of S0.
The value of a computational level description or speci�cation of some behaviour
is that it constrains the possible algorithms that can implement such behaviour.
The value of systematicity as a computational level property is that it constrains
the class of possible cognitive architectures to those that exhibit this property. For
example, a cognitive architecture should be capable of exhibiting the behaviour
illustrated in Figure 2.1. The contention is over what characterizes architectures
that exhibit such systematic behaviour. The predominant view in cognitive science,
what Fodor and Pylyshyn (1988) and Fodor and McLaughlin (1990) term Classi-
cism, is that to support such properties as systematicity a cognitive architecture
4Referring to Marr's (1982) levels of analysis where the computational level is concerned withwhat is being computed, and the algorithmic level is concerned with how it is being computed.There is also an implementation level concerned with how the algorithmic level is physicallyrealized.
2.2. SYSTEMATICITY AND ITS CLASSICAL EXPLANATION 11
must make use of:
1. a combinatorial syntax (or structured representations); and
2. processes that are sensitive to the structure of those representations (Fodor
& Pylyshyn, 1988; Fodor & McLaughlin, 1990).
?
2. Mary1. John
1. John chased Mary. Who chased ?
2. Mary chased John. Who chased ?
Cognitive
architecture
Input
Output
Figure 2.1: Systematicity: a computational level property.
A combinatorial syntax is a symbol-level description (i.e., independent of the
physical realization of the symbol) of the relationship between representations of
compositional objects (called complex representations) and the representations of
their component objects (called component representations, or constituents5). For
example, in the list of syntactic relations:
S ! N Vt N
N ! John
N ! Mary
Vt ! chases,
5The term constituents was used in Fodor and Pylyshyn (1988) and van Gelder (1990).
12 CHAPTER 2. SYSTEMATICITY: WHAT IS THE PROBLEM?
\S" denotes the complex representation of a compositional object consisting of
three components: an action, an agent (performer of the action), and a patient
(recipient of the action); \N" denotes the constituent representation of the agent
and the patient; and \Vt" denotes the constituent representation of the action.
The symbol \N" has as a possible component the symbol \John" or \Mary", and
the symbol \Vt" has as a component the symbol \chases".
In this example world, all four compositional objects can be represented by
making all four possible expansions as de�ned by the syntax rules. Thus, system-
aticity of representation is captured by the combinatorial nature of the rules of
syntax. Since each symbol is replaceable by any one of its possible constituent
symbols, all possible representations are realizable.
The need for a combinatorial syntax constrains the way in which a cognitive
system puts these representations together. It says that the processes for con-
structing representations are such that: �rstly, there is a distinction between struc-
turally atomic (e.g., \N", \John") and structurally molecular (e.g., \S") repre-
sentations; secondly, a representation of a compositional objects has, as its parts,
representations of the object's components; and thirdly, the relationship between
an object's components is captured by the relationship between their correspond-
ing constituents. For example, the agent component is identi�able as it is always
the �rst constituent symbol. The particular symbol order is not important, only
that the order is used consistently. By mirroring the structural relations in ob-
jects with analogous structural relations in their corresponding representations, a
combinatorial semantics is captured by a combinatorial syntax.
In addition to a combinatorial syntax, the Classical view is that a cognitive
architecture must also have processes that are sensitive to the way in which these
representations are structured. Structure sensitive processes manipulate complex
representations on the basis of their constituency relations, not on the basis of par-
ticular constituents. In the John-chases-Mary example, the agent of the action is
always the �rst constituent symbol of the compositional object's complex represen-
tation. Consequently, regardless of whether the agent is John or Mary, a process
2.3. IS THERE A CONNECTIONIST ALTERNATIVE? 13
that can determine the �rst constituent symbol from a complex representation can
do so for all compositional objects in this domain.
Together, structured representations and processes that are sensitive to the
structure of those representations are two constraints on cognitive architecture
that explain the systematicity property as exempli�ed in Figure 2.1.
Before presenting arguments for and against the Connectionist alternatives, it
should be pointed out that the Classicist view is not a statement about the phys-
ical realization of these representations and processes; it is a statement about the
relationships between their physical realizations. The physical structural relation-
ships of the brain correspond to symbolic structural relationships of the mind. For
example, the symbols John, Mary, and chases may be physically realized as John,
Mary, and chases respectively; or alternatively, 1 0 1 1, 0 1 1 0, and 0 1 0 1
respectively. Additionally, the process for combining these atomic representations
may be concatenation. In which case, the complex symbol \John chases Mary"
may be physically instantiated as John chases Mary, or 1 0 1 1 0 1 1 0 0 1 0
1. As well the associated structure sensitive process for extracting the doer of the
action would be one that scans for the �rst space, in the �rst instance; and one
that counts the �rst four digits in the second instance.
Structured representations and structure sensitive processes make up what is
called a symbol system, and its physical realization is called the physical sym-
bol system (Newell, 1980). It is the symbol system and its distinction from its
physical implementation that characterizes Classical cognitivism (Pylyshyn, 1980).
The Classicist's approach to understanding cognitive behaviour is determining the
nature of these symbol structures and processes.
2.3 Is there a Connectionist alternative?
Connectionism, at least from a cursory glance, appears to o�er a radically di�erent
explanation of cognitive behaviour: An explanation where cognitive behaviour is
seen as an emergent property of the interaction of simple units governed by numeric
equations. However, for Connectionism to be a successful approach to cognition
14 CHAPTER 2. SYSTEMATICITY: WHAT IS THE PROBLEM?
it must ultimately explain systematicity. It is because people are systematic that
their capacities extend to such a vast (combinatorial) number of examples. A frame-
work that cannot address systematicity is unlikely to scale from small experimental
domains to real-world human domains.
The focus of architectural design on local-level interactions between simple pro-
cessing units is not unique to Connectionism. A similar approach has been used by
Brooks (1991a, 1991b), in what is called a subsumption architecture, in the design
of \intelligent" robots. Like Connectionism, Brooks's approach has also been crit-
icized for its scalability (Kirsh, 1991a). Thus, systematicity is not only important
for cognitive modeling, but it is also of practical signi�cance.
To be an interesting approach, Connectionism must o�er something that Clas-
sicism does not. Consequently, the �rst concern is whether or not Connectionism
can o�er an explanation for systematicity that does not rely on structured repre-
sentations and structure sensitive processes.
2.3.1 The horns of Fodor
Fodor and Pylyshyn's position is that to explain systematicity it is necessary to
posit structured representations, and processes that are sensitive to the structure
of those representations. Furthermore, they argued that Connectionism fails to
provide an alternative theory of cognition because either
� connectionist models do not use structured representations and structure sen-
sitive processes, in which case, they cannot demonstrate systematicity (sys-
tematicity horn); or,
� connectionist models do use structured representations and structure sensitive
processes to implement systematicity, in which case, Connectionism is an
implementation theory of Classical cognitivism (implementation horn) (Fodor
& Pylyshyn, 1988; Fodor & McLaughlin, 1990).
At this point it should be emphasized that Fodor and Pylyshyn fully endorse the
possibility of Connectionist networks that demonstrate systematicity. And, clearly,
2.3. IS THERE A CONNECTIONIST ALTERNATIVE? 15
given the result that there exists a Connectionist network with a �nite number of
processing units that implements a universal Turing machine (Siegelman & Son-
tag, 1991), it must be possible to construct a network that exhibits systematicity.
However, this result alone is not a refutation of Fodor and Pylyshyn's thesis (as has
been argued by Chalmers (1990b, 1993)), because one must also avoid the imple-
mentation horn of their argument. (See, also, Butler, 1993, for a similar comment
on Chalmers.)
2.3.2 Weak and strong compositionality
The principal defendant of the Connectionist position in this debate over the import
of Connectionism has been Smolensky (1987b, 1987a, 1990, 1991). Smolensky's
position is to provide two alternative ways in which Connectionist models make use
of compositional representations. They are microfeatures and the tensor product,
which Smolensky (1991) labels as weak and strong compositionality, respectively.
Weak compositionality is the representation of objects as a collection of
microfeatures, where each microfeature is a unit that becomes activated only in
the presence of some feature (component) of an object. For example, the complex
object John chased Bill can be represented as the set of active units fJohn, chased,Billg. The structurally similar object Bill chased John would be represented by the
set fBill, chased, Johng. However, as sets do not preserve order both objects have
the same representation. Consequently, it is not possible to uniquely determine,
for example, who is the agent.
A system that represents complex objects by representing only the components
without regard for their structural relationship is inadequate on the basis that it
fails to di�erentiate objects with the same components but di�erent structural rela-
tionships. This example illustrates why Fodor and Pylyshyn talk of two necessary
dimensions in a representational space: one to record the presence or absence of a
component; the other to record its relationship to other components.
Alternatively, one could augment the class of feature detectors with speci�c
John-agent, John-patient, Bill-agent and Bill-patient units. In this case, the two
16 CHAPTER 2. SYSTEMATICITY: WHAT IS THE PROBLEM?
objects can be represented as fJohn-agent, chased, Bill-patientg and fBill-agent,chased, John-patientg, respectively. Now, suppose there is another complex ob-
ject, Mary chased Mark resulting in the set of activated units fBill-agent, chased,John-patient, Mary-agent, Mark-patientg. At this point it is not possible to de-
termine whether \Bill chased Mark", or \Mary chased Mark"; or whether \Bill
chased John", or \Mary chased John". Again, the class of feature detectors could
be augmented to include, for example, the Bill-chased-John unit. However, the
problem with this scheme, as Fodor and Pylyshyn have correctly pointed out, is
that the number of feature-detecting units grows rapidly with the complexity of
the object. In fact, considering n-ary relations, the number of units required to
uniquely represent all possible relations between k possible objects is: kn (i.e., the
number of units grows exponentially with the arity of the relation).
Clark (1991) suggested that the number of microfeature units could be reduced
by utilizing existing microfeatures from the component objects. For example, in the
complex object John hit Bill an existing microfeature could be \tears"6 from which
the system could deduce that Bill was the recipient of the action. However, such
features must still be bound to Bill, and so there needs to be available one unit for
every possible component object. Furthermore, it is not clear that such information
will always be available to di�erentiate possible constituents, particularly if complex
objects are communicated verbally rather than visually. I conclude that such a
scheme either does not avoid the explosion of microfeature units, or does not exhibit
systematicity.
The alternative to using sets of purely atomic features is to allow individual
features to themselves be sets. For example, the object John chases Bill could be
represented as the set ffJohn, agentg, chases, fBill, subjectgg, which is distinct
from the set ffBill, agentg, chases, fJohn, subjectgg representing the object Billchases John. However, this scheme is characteristically Classical in that there is a
distinction between atomic and molecular representations, and the constituents are
tokened wherever the complex representation is tokened. Consequently, as Fodor
6Extra visual information arising from the fact that being hit by John caused Bill to cry.
2.3. IS THERE A CONNECTIONIST ALTERNATIVE? 17
and Pylyshyn pointed out, this scheme does not avoid the implementation horn of
the argument.
Weak compositionality, I conclude, fails to avoid the two horns of the argu-
ment. Either, all complex objects of a particular structure cannot be (uniquely)
represented (systematicity horn); or the representation of all complex objects re-
quires Classical compositionality - the tokening of constituents with the tokening
of complex representations (implementation horn).
Strong compositionality is the representation of complex objects by tensors.
In Smolensky's case, the tensor representation is constructed as the sum of the outer
products of each component-role representation pair. For example, the object John
chases Bill could be represented by the tensor construction:
~TJohn chases Bill = ~VJohn ~Ragent + ~Vchases ~Raction + ~VBill ~Rpatient
where ~TJohn chases Bill is the tensor representation of the object John chases Bill;
~VJohn, ~Vchases and ~VBill are the vector representations of the component objects
John, chases and Bill, respectively; and ~Ragent, ~Raction and ~Rpatient are the vector
representations of each component object's respective role.
Alternatively, complex objects can be represented as the running outer product
of each component object representation (Halford, Wilson, Guo, Gayler, Wiles, &
Stewart, 1994). In this case, the object John chases Bill is constructed as:
~TJohn chases Bill = ~VJohn ~Vchases ~VBill:
Tensor representations were intended by Smolensky as an example of Connec-
tionist representations that are non-Classical, and therefore non-implementational.
Tensors \appear" to o�er a form of non-Classical compositionality in the sense that
constituents do not occur within the complex representation. For example, given
the constituent-role pairs ~V1 = (3 1 4), ~R1 = (�2 3) and ~V2 = (2 3 1), ~R2 = (3 2);
the resulting complex representation:
~V1 ~R1 + ~V2 ~R2 =
0BBBBB@
3
1
4
1CCCCCA ( �2 3 ) +
0BBBBB@
2
3
1
1CCCCCA ( 3 2 ) (2.1)
18 CHAPTER 2. SYSTEMATICITY: WHAT IS THE PROBLEM?
(2.2)
=
0BBBBB@
0 13
7 9
�5 14
1CCCCCA (2.3)
does not contain in any one of its rows or columns a vector that equals any one of
the constituent vectors from which it was constructed. However, does this compo-
sitionality scheme account for systematicity?
The problem with tensor constructions, as Fodor and McLaughlin (1990) have
pointed out, is that there is potentially an in�nite number of constituent combina-
tions that result in the same tensor representation, and therefore it is not possible
to determine which combination was used to construct the tensor representation.
To see the force of this point, suppose a system has represented as a tensor a
complex object consisting of two component objects and the system is required to
extract one of its constituents for further processing. That is,
~a~b = ~Ta;b
where ~a and ~b are the vector representations of the �rst and second component ob-
jects, respectively. Now, suppose two di�erent objects have vector representations
~a0 = 1
k~a and ~b0 = k~b, where k 6= 0; 1. The corresponding representation of the
complex object that is composed of these two objects is,
~a0 ~b0 =1
kk~a~b = ~Ta;b:
Since, one complex representation stands for two di�erent complex objects it is not
possible to uniquely determine the component objects. Furthermore, in general,
there are an in�nite number of possible decompositions for the same complex rep-
resentation. Thus, the value of the vector (tensor), in itself, provides insu�cient
information to determine which of the possibly in�nite components were the actual
components. In this case, the tensor does not allow for systematicity, because it
does not give unambiguous access to its components. Therefore it fails to avoid
the systematicity horn.
2.3. IS THERE A CONNECTIONIST ALTERNATIVE? 19
The alternative is to provide one of the constituent vectors to a process that
performs the inner product of the constituent with the tensor to extract the other
constituent vector. However, in doing so, one must maintain the representation of
the constituent whenever one is processing the complex representation (i.e., one
must token the constituent every time one tokens the complex representation).
Therefore, this alternative fails to avoid the implementation horn of the argument.
In both weak and strong compositionality, Smolensky has presented compo-
sitionality schemes that are non-Classical in that the representations of complex
objects are tokened without having to token representations of their component ob-
jects. However, both schemes fail to explain systematicity because the components
are no longer uniquely determinable.
2.3.3 The concatenative/functional distinction
The tensor solution is an example of what is generally regarded as structured rep-
resentations without the necessity for them to be symbol structures. Van Gelder
(1990) characterizes this distinction as the di�erence between concatenative and
functional modes of compositionality. That is, modes of combining representations
of component objects to form a new representation standing for the correspond-
ing compositional object. Furthermore, he argues, that because Connectionism is
committed to a functional compositionality scheme it is an alternative theoretical
framework to Classical cognitivism. Before examining this claim the two modes of
compositionality are described.
Concatenative compositionality is the spatial or temporal juxtapositioning
of representational components to form complex representations. For example, con-
sider the complex object Tom chased Sue which is composed of the atomic objects
Tom, chased, and Sue. Now, suppose these atomic objects have the representations
Tom, chased, and Sue, respectively. Then a corresponding concatenative com-
positional representation may be Tom chased Sue. The point to note is that the
constituent (e.g., Tom) remains unchanged within the complex representation.
20 CHAPTER 2. SYSTEMATICITY: WHAT IS THE PROBLEM?
Functional compositionality is a more general form of compositionality
whereby the representation of a complex object is the result of some function
that takes the constituent representations as its arguments. Consider again the
Tom chased Sue example. This time suppose the objects Tom, chased, and Sue are
represented by the numbers 1, 3, and 5, respectively. A possible functional compo-
sition of these vectors is 267 under the function: f(x; y; z) = 70x+71y+72z, where
x, y and z are numeric representations of the three components. Constituents can
be uniquely extracted by the functions:
f1(R) = R mod 7 = x
f2(R) = (R div 7) mod 7 = y
f3(R) = R div 72 = z
Although functional compositionality contains concatenative compositionality
as a special case, and is therefore a more general conception of compositionality,
it is tangential to the Classicist/Connectionist debate for two reasons. Firstly,
Classicism was never committed to a concatenative style of compositionality.
\A suitable de�nition [of a physical instantiation mapping (F)] might
contain the statement that for any expressions P and Q, F[P&Q] =
B(F[P ],F[Q]), where the function B speci�es the physical relation that
holds between physical states F[P ] and F[Q]. Here the property B serves
to physically encode, (or `instantiate') the relation that holds between
the expressions P and Q on the one hand, and the expressions P&Q
on the other."
(Fodor and Pylyshyn, 1988: p14)
Thus, the function B is exactly what van Gelder calls a functional mode of
compositionality taking the component representations F[P ] and F[Q] as arguments
and returning a representation of the compositional object P&Q. The point is that
the Classical picture leaves the function B unspeci�ed and is therefore neutral as
to the mode of compositionality used by the brain. Function B is unspeci�ed as it
2.3. IS THERE A CONNECTIONIST ALTERNATIVE? 21
is irrelevant to the cognitive level, and only relevant to the implementation level.
The independence of these two levels is foundational to a Computational Theory
of Mind (Pylyshyn, 1980). That Classicism is not committed to a concatenative
compositionality is re-emphasized in two further places:
\... the relation that is physically realized is functional adjacency, there
is no necessity that physical instantiations of adjacent symbols be spa-
tially adjacent."
(Fodor and Pylyshyn, 1988: p57; emphasis in the original)
\Though we shall generally consider examples where complex sym-
bols literally contain their Classical constituents, the present condition
means to leave it open that symbols may have Classical constituents
that are not among their (spatio-temporal) parts. (For example, so
far as this condition is concerned, it might be that the Classical con-
stituents of a symbol include the values of a `fetch' operation that takes
the symbol as an argument.)"
(Fodor and McLaughlin, 1990: p186; emphasis in the original)
The contention over Classical constituents, that is, when is a component tokened
within the tokening of its whole, is analogous to Kirsh's (1991b) computational
de�nition of explicitness of information. Kirsh de�nes explicitness of information
within a larger context as information that is accessible within a reasonable (con-
stant in Kirsh's case) amount of time. The important point that Kirsh makes is
that
the accessibility of information is dependent on, or relative to,
the processes that access that information.
22 CHAPTER 2. SYSTEMATICITY: WHAT IS THE PROBLEM?
The point that has not been previously made by either the Classicists or the Con-
nectionists is that tokening of a constituent, like Kirsh's explicitness of information,
is relative to the process extracting that constituent.
In the tensor example from equation 2.3, let an access function fR1be de�ned
as:
fR1(~T ) =
1
13(~T � (�2 3)):
Then, constituent ~V1 is explicitly tokened relative to fR1(~T ) since
fR1
0BBBBB@
2666664
0 13
7 9
�5 14
3777775
1CCCCCA =
0BBBBB@
3
1
4
1CCCCCA :
The function fR1speci�es a particular projection or view of the complex repre-
sentation. When the complex representation is viewed along this projection the
constituent becomes explicit (i.e., tokened).
The point is that the mode of compositionality is irrelevant beyond there being
explicit representations of components within complex representations relative to
the processes that operate on those representations.
The second reason why the distinction between concatenative and functional
modes of compositionality is irrelevant is that the modes of compositionality, in
themselves, do not guarantee systematicity. Or, in other words, a model that
makes use of a concatenative compositionality scheme can be just as unsystematic
as a model under some other functional compositionality scheme. An example is
given to illustrate this point.
Suppose that the objects John, loves and Mary are encoded as binary-valued
vectors 1 0 1, 0 1 0 and 1 0 1 0 1, respectively. Then, under a concatenative
compositionality scheme a possible encoding of the object John loves Mary is 1 0 1
0 1 0 1 0 1 0 1. In order to extract the constituents a process must know at which
point to truncate the vector. However, it is not possible to determine with certainty
these points without additional information or assumptions as to the way in which
the constituents were concatenatively constructed. For example, under the same
compositionality scheme, the compositional object Mary loves John has the same
2.3. IS THERE A CONNECTIONIST ALTERNATIVE? 23
complex representation (i.e., 1 0 1 0 1 0 1 0 1 0 1). Additional information could
be provided, for example, in the form of separation markers (say a component with
value 0.5). A similar problem arises with functional compositionality schemes. The
point is that the processes operating on compositional representations must have
some knowledge of the way in which constituents are brought together in the �rst
place.
It could also be argued that a commitment to concatenative compositional-
ity is simultaneously a commitment to discrete representations and processes; and
since the brain is not a discrete device, the mode of compositionality cannot be
concatenative, and therefore cannot be symbolic (i.e., Classical). That the dis-
crete/continuous dimension is orthogonal to the issue of whether cognitive archi-
tecture is symbolic is illustrated here by what I term as the beam scattering analogy.
In a beam of light there are photons with wavelengths that may vary over some
continuous (though not necessarily uniform) distribution. When passed through
a prism the photons are scattered by an angle that varies monotonically with
wavelength. Consequently, if one were to record the scattering with some photo-
electric device that measures the incident light, one would observe intensity as
a continuous function of angle. Now, suppose that constituents are encoded at
points of zero intensity gradient. The magnitude of light intensity at the maximum,
minimum and in ection points of the angle-intensity curve encodes the constituent.
In this way, constituents may be concatenated as a continuous waveform without
the necessity for the constituents to be con�ned to discrete �xed-width spatial
locations (see Figure 2.2). The point is that concatenative compositionality is
not limited to, and therefore does not equate to, discrete constituents. So the
argument that cognition could not be symbolic because it could not operate over
discrete representations does not follow.
The conclusion taken from a re-examination of this debate is that the im-
plementation horn of the argument leaves no room for an alternative
theory. Either constituents are tokened (relative to their access processes) with
their complex representations, in which case one is implementing a Classical ar-
24 CHAPTER 2. SYSTEMATICITY: WHAT IS THE PROBLEM?
A
B
C
D
E
F
G
angle
inte
nsity
C B B A G EA
Figure 2.2: Beam scattering analogy: An example of continuous concatenativecompositionality where constituents are encoded as intensity magnitudes at thezero-gradient points of an angle-intensity curve.
chitecture; or, constituents are not tokened, in which case one cannot account for
systematicity simply because the constituents are not uniquely determinable. So
the question becomes, just what (if anything) can Connectionism contribute in
terms of cognitive theory?
2.3.4 A \potential" contribution of Connectionism
Classical theory of cognition is based on the assertion that the cognitive level is
supported by, but independent of the implementation level, where the cognitive
level consists of structured representations (standing for properties of the exter-
nal world), and processes that operate over these representations. Furthermore, a
cognitive state is the result of some process on some internal (structured) represen-
tation. From a Classicist's perspective, the proper concerns at the cognitive level
of theorizing are determining these processes and representations.
There are, arguably, two ways in which the Classicist picture is (potentially)
limited. The �rst limitation is that it is not at all certain that cognitive behaviour
is only a consequence of cognitive level representations and processes. Take for
2.3. IS THERE A CONNECTIONIST ALTERNATIVE? 25
instance Fodor's (1975) own example:
\A man wishes to be reminded, sometime during the day, to send a
message to a friend. He therefore puts his watch on upside down,
knowing that he will glance at it eventually and that, when he does,
he will think to send the message. What we have here is, presumably,
a straight forward causal connection between two states ..., but not
a kind of connection cognitive psychology has anything to say about.
Roughly, although the mental states are causally connected, they aren't
connected by virtue of their content; compare the case of the man who
is reminded to send a message to his friend when he (a) hears and (b)
understands an utterance token of the type `send a message to your
friend'."
(Fodor, 1975: p203, emphasis in the original)
A second example7 is illustrated in the form of the following addition task.
Present the sequence of numbers listed in Figure 2.3(a), one at a time, to a subject
and request the subject to perform a running additive total as each number is
presented. After the last number is presented, ask the subject to respond with
the total. Interestingly, some subjects (though not necessarily all) will incorrectly
respond with 5000 instead of 4100. Possibly, some perceptual stimulus, perhaps
the alternating rhythm of the 1000s, or the predominance of 0s, interacts with the
symbolic process of addition. However, the subjects are more likely to respond
correctly, given the same numbers but in a di�erent order (Figure 2.3(b)).
A third example is due to van Gelder and Niklasson (1994). They used the re-
sults of a study by Kern, Mirels, and Hinshaw (1983), which showed that scientists
do not consistently apply the rules of logic (speci�cally, modus ponens8 and modus
tollens9) in their reasoning, to argue that people are not as systematic as Fodor and
7Danny Sumina, personal communication.8That is: (P ! Q) ^P ` Q, which reads, if P implies Q, and P is true then Q is also true.9That is: (P ! Q) ^ :Q ` :P , which reads, if P implies Q, and not Q is true then not P is
also true.
26 CHAPTER 2. SYSTEMATICITY: WHAT IS THE PROBLEM?
1000
10
1000
20
1000
30
1000
40
10
20
30
40
1000
1000
1000
1000
(b)(a)
Figure 2.3: A subject asked to perform a running additive total of the numberslisted in (a) may respond with 5000. When asked to perform the same task on thesame numbers but in a di�erent order (b) subjects are likely to respond with thecorrect answer 4100.
2.3. IS THERE A CONNECTIONIST ALTERNATIVE? 27
Pylyshyn claim. Most dramatically, of the 98 people examined only 41% correctly
recognized the logical validity of modus tollens when reasoning about questions
presented to them. Consequently, van Gelder and Niklasson argued that a Classi-
cal architecture is inadequate, as it fails to account for the degree of systematicity
evident in the empirical studies. One could argue that the lack of perfect, or near-
perfect accuracy is a competence/performance issue (i.e., the di�erence between
what can be processed as distinct from what is processed in practice). However,
the study also shows a statistically signi�cant di�erence, in the percentage of scien-
tists that recognized the logical validity of both modus ponens and modus tollens,
between concrete and abstract examples. A classical model would predict no sig-
ni�cant di�erence as concrete and abstract instances of, for example,modus tollens
are structurally identical.
These three examples illustrate how some cognitive processes may operate
purely on the basis of an associative or statistical relationship between two cognitive
states, not on the basis of the structural content of those states. It is these forms of
associatively- and statistically-based behaviours that the Classical framework does
not address, and presumably, the sorts of behaviours that Connectionism would be
well-placed to explain. The extent to which cognitive behaviour can be explained,
however, is an empirical question10, and not one addressed here.
The second limitation comes from the process-relative notion of tokening which
precluded Connectionist architectures from being anything other than Classical
implementations. The process-relative notion of tokening leaves open a vast class
of architectures able to meet this condition. In an extreme case, tokening can be
realized by a look-up table. For example, in the following systematicity of inference
task, the �rst or second component is extracted from a binary relation. That is:
F ((xi; xj); qk) =
8><>:
xi if k = 1
xj if k = 2
10Another empirical question is to what extent does the implementational level impact on thecognitive level. Charter and Oaksford (1990) and Butler (1993) have argued that the implemen-tation level constrains the class of possible algorithms that can compute some cognitive function.Therefore, Connectionism at the implementational level may limit the choice of algorithms at thecognitive level.
28 CHAPTER 2. SYSTEMATICITY: WHAT IS THE PROBLEM?
where xi and xj are chosen from a set of N objects; and qk is a query input
requesting the �rst or second component. All 2N2 unique complex representations
are generated by a hashing function:
fH(xi; xj; qk) = #R; (2.4)
where #R is a unique number for every unique triple (xi; xj; qk). Now, the �rst
and second constituents can be extracted by providing indexed functions:
f1i;j(#R) =
8><>:
xi if #R = fH(xi; xj; q1)
0 otherwise(2.5)
f2i;j(#R) =
8><>:
xj if #R = fH(xi; xj; q2)
0 otherwise,(2.6)
where f1i;j and f2i;j are the class of function dedicated to recovering the �rst and
second components, respectively. The results of the 2N2 access functions are then
summed. Since, each function fki;j is sensitive to only one unique constituent, only
one constituent will be returned.
In terms of network architecture, the internal representation can be physically
realized as activation of a single hidden unit, and each access function as a connec-
tion from the hidden unit to another processing unit. The output of each processing
unit is then summed at a single output unit (see Figure 2.4). The problem with
this architecture is that it requires 2N2 processing units, and should be rejected on
the same grounds as the microfeature solution was rejected. That is, on the basis
that too many processing units are required.
However, it is not necessary to have dedicated one unit per object. Units can
share the representation and processing responsibilities so as to minimize the total
amount of resources. Auto-encoder networks of sigmoidal units11, for example,
require at most two hidden units to represent N objects (Kruglyak, 1990). By
combining two auto-encoders, all N2 binary relations can be represented with just
four hidden units and 2N access functions. In this case, the number of processable
11That is, the activation function for the hidden and output units is: f(neti) =1
1+e�neti, where
neti is the net input activation to unit i.
2.3. IS THERE A CONNECTIONIST ALTERNATIVE? 29
qk
fHf1
N,N
f 1,12
f11,1
f N,N2
Xi
Xj #R Σ
outputinput internal
Figure 2.4: Connectionist implementation of a look-up table. Input units receiveexternal representations xi, xj and qk of the �rst component, second component andquestion, respectively. The activation function (fH) for the �rst internal unit, whoseactivation is #R, is given by equation 2.4. The activation functions for the secondlayer of internal units are given by equations 2.5 and 2.6. The activation from theseunits is summed by the output unit labeled
P. Arrows indicate connections with
a weight value of one.
30 CHAPTER 2. SYSTEMATICITY: WHAT IS THE PROBLEM?
objects grows faster than the amount of resource. Similarly, as suggested above, a
network implementing Smolensky's tensor representation scheme could be provided
with access functions at limited cost.
However, Fodor and McLaughlin rejected the possibility of a tensor (or other
Connectionist) solutions on the grounds that it is not su�cient to show how system-
aticity is possible given the assumptions of a Connectionist architecture, one must
also show how systematicity could be necessary given those assumptions. Since, in
most cases, people exhibit systematicity despite their varying physical and social
backgrounds. Fodor and McLaughlin argued that although it may be possible to
hand-craft a Connectionist model that could represent the structured object John
loves Mary given that it could represent the structurally related object Mary loves
John, it is also possible to hand-craft a model that supports a representation of the
structured object Bill chased Sue given that it supports the structurally unrelated
object John put the train in the box. In short, even though Connectionist mod-
els could demonstrate systematicity by implementing Classical compositionality,
Fodor and McLaughlin argued that they are too general to necessarily account for
systematicity.
The weakness of Fodor and McLaughlin's argument, however, is that they do
not specify criteria by which a (Connectionist) model is too general to necessarily
account for systematicity. For example, a Connectionist architecture necessarily
exhibits systematicity if the assumption is made that it be wired so as to exhibit
systematicity. I attribute the lack of such criteria to be a consequence of Fodor
and Pylyshyn's de�nition of systematicity being grounded, solely, in terms of com-
putational capacity. That is, in terms of the desired functional behaviour of an
architecture.
At this point, it is not clear whether the look-up table and tensor networks
would be regarded by Fodor and McLaughlin as examples of tokening. It may be
that the process-relative notion of tokening that I have described is too general
for Fodor and McLauchlin. For on the one hand they entertained the possibility
of non-Classical Connectionist models exhibiting systematicity by careful hand-
2.4. THESIS QUESTION 31
crafting of connections, but subsequently rejected such solutions on the basis that
they do not necessitate systematicity. It is this lack of clarity that is precisely the
problem one gets into when de�nitions are grounded in terms of computational
capacity.
What is required is a tightening of the de�nition of systematicity along a di-
mension orthogonal to computational capacity. One such dimension often used
in psychology is reaction time. Two computationally equivalent algorithms can be
discriminated on the basis of the amount of time required to perform a task. Butler
(1993) has used this dimension to argue for the importance of the implementational
level.
Another dimension is learnability. In the domain of analogical reasoning, Hal-
ford and Wilson (1994) have attempted to reject a three-layer feedforward network
model of the balance-scale task, also, on the basis of computational capacity alone.
As I have argued in their case (see Appendix A), the learnability dimension can
be used to discriminate two architectures on the basis of the amount of resource
required to learn or acquire a function or behaviour. In the Classical framework,
systematicity is built into a symbol system. However, Connectionism is concerned
not only with the realization of cognitive behaviours, but also their acquisition.
Thus, a \potential" contribution of Connectionism is in re�ning the characteristic
properties of cognitive architecture on the basis of the learnability of systematic
behaviour.
2.4 Thesis question
Connectionism is a general framework whereby global behaviours are a consequence
of the interaction of simple units. Connectionist models are su�ciently general to
represent any spatial function (Lapedes & Farber, 1988; Hornik, Stinchcombe, &
White, 1989; Funahashi, 1989) and a general class of dynamic functions (Sato,
Murakami, & Joe, 1990).
However, despite their capacity to realize any function there does not exist
any general purpose learning algorithm except for restricted models. For example,
32 CHAPTER 2. SYSTEMATICITY: WHAT IS THE PROBLEM?
the perceptron convergence rule (Hertz, Krogh, & Palmer, 1991) guarantees the
perceptron will learn to discriminate between two classes of stimuli, provided that
the stimuli (in their input representational space) are linearly separable (i.e., all
points can be correctly partitioned by a single hyperplane). The limitation of the
perceptron is that it cannot represent the function for discriminating points that
are not linearly separable (Minsky & Papert, 1990). In fact, the general learning
problem is known to be di�cult computationally in the sense that it is unlikely that
there exists a polynomial time algorithm for con�guring weights of a network with
three layers (Judd, 1987, 1990), even in the case where there is only one output
and two hidden threshold units (Blum & Rivest, 1990, 1992).
This trade-o� between the realizability of a range of functions/behaviours and
their learnability is well-known from statistics as the bias/variance dilemma (Ge-
man, Bienenstock, & Doursat, 1992). So, although Connectionist models have
su�cient variance to approximate most behaviours, the important question, is: do
these models have su�cient bias to account for systematicity?
Bias (in statistics), or innateness (in psychology) is a term that refers to the
tendency for certain behaviour over others as a consequence of the architectural
properties of the model. Architectural biases are both a boon and a burden for
they may allow the tractable acquisition of one class of behaviours, but at the
expense of being unable to exhibit another class. In the case of the perceptron, for
example, restricting the unit activation function to be linear guarantees learnability
of any linearly separable class of stimuli, but at the expense of not be able to
discriminate non-linearly separable classes. Thus, use of a linear transfer function
is an architectural bias. Conversely, use of a non-linear transfer function allows
the discrimination of arbitrary classes, but does not guarantee learnability. A non-
linear transfer function is also an architectural bias.
In a learning framework, where behaviour is (partly) acquired through expe-
rience, systematicity is a form of generalization over structurally related objects.
Having learnt how to represent and process some instances of a complex object, a
system can represent and process some instances of previously unseen, but struc-
2.4. THESIS QUESTION 33
turally similar objects. It is the combinatorial nature of structured domains that
makes the acquisition of systematic behaviours di�cult for general purpose learning
models like Connectionist networks. Even in the simple N-Vt-N example, if there
are n agent/patients and m actions, then there are n2m N-Vt-N complex objects.
Simply allocating a new representation for each complex object would be infeasible
on the basis that one is never likely to have previously seen all instances. Thus,
an important question, and the primary question that this thesis attempts to ad-
dress, is: Can Connectionist models exhibit the necessary acquisition of systematic
behaviour?
2.4.1 Method
The acquisition of behaviour requires the necessary, but judicious, use of archi-
tectural biases. The question is, what are those biases in regard to the necessary
acquisition of systematic behaviour. However, at the same time a particular bias
should not preclude other desirable behaviours.
At this point the only property under consideration is systematicity. Having
only one behaviour is one sided in that there is no cost. For example, system-
aticity could be trivially account for by direct implementation of a symbol system.
However, this approach would be at the expense of the associative/statistical prop-
erties that potentially may explain some cognitive behaviours (e.g., the �rst option
of Connectionism). Consequently, the approach taken here is to start from what
Dyer (1991) calls \radical Connectionism". That is, from relatively unstructured
general-purpose network architectures (e.g., feedforward and recurrent networks).
This approach makes the fewest assumptions regarding a Connectionist architec-
ture, and therefore precludes the fewest behaviours.
The other consideration is the criterion by which a network is regarded to
have \necessarily" acquired systematicity. Since systematicity is regarded as a
computational level property, such a criterion must be at this level and with respect
to a particular task. In this thesis, two tasks are considered: (1) auto-association
of N -tuples (as a means for capturing the degree of systematicity of representation
34 CHAPTER 2. SYSTEMATICITY: WHAT IS THE PROBLEM?
of a model); and (2) extraction of a component from an N -tuple upon request
(as a means for capturing the degree of systematicity of inference of a model).
In conjunction with these two tasks are two degrees of generalization by which a
model is considered systematic with respect to a task. They are: (1) systematicity
as generalization over a domain of complex objects related by a common structure;
and (2) systematicity as generalization across position: what Hadley (1993) calls
strong systematicity. In contrast to Smolensky and van Gelder's approach, where
systematicity was regarded primarily12 as a representation problem, the approach
taken here is to regard systematicity as a generalization problem.
At each investigation, the property that permits/prevents systematicity is ana-
lyzed. So, although this thesis is concerned with what properties give systematicity,
this thesis is also concerned with why these properties do or don't give systematic-
ity. The reason why a property of one architecture prevents systematicity is used
as justi�cation for moves to more structured architectures.
In summary, the approach taken in this thesis is as follows:
� specify a task for which it is possible to exhibit systematicity at the compu-
tational level;
� de�ne a criterion for the necessary acquisition of systematic behaviour;
� evaluate a Connectionist model with respect to this task and criterion; and
� analyze the model in terms of the properties that permit/prevent systematic-
ity.
2.5 Summary
Systematicity is a computational level property of cognitive systems whereby the
ability to represent and process objects having a particular structure extends to
12Although Smolensky (1987b) also considers the possibility of learning an \optimal" combi-nations of role vectors, which is discussed in chapter 4.
2.5. SUMMARY 35
other objects having the same structure. The importance of systematicity is that
it places constraints on cognitive architectures.
Those constraints as argued for by Fodor and Pylyshyn are that cognitive archi-
tecture must consist of: (1) structured representations, or symbol structures; and
(2) processes that are sensitive to the structure of those representations. These
two constraints characterize a Classical cognitive architecture. The distinguishing
feature of a Classical architecture is the tokening of constituents whenever complex
representations are tokened.
Connectionists have proposed three alternative schemes of compositionality:
Smolensky's weak (microfeatures) and strong (tensors) compositionality; and van
Gelder's more general functional compositionality. After reviewing these alterna-
tives, I have concluded that
� the Connectionist notions of compositionality either: cannot explain system-
aticity because the constituents are not uniquely accessible; or, implement
Classical compositionality because the constituents are tokened (relative to
their access processes) whenever complex representations are tokened.
Consequently, they fail to provide an alternative explanation of systematicity.
However, I have also argued that the limitation of the Classical position is that
the requirement of systematicity is grounded, solely, in terms of computational
capacity (i.e., what can be computed by an architecture). Consequently, system-
aticity can be realized by many computationally su�cient architectures, including
Smolensky's tensors, which Fodor and McLaughlin wish to reject. I have suggested
that the requirement of systematicity can be made stronger by considering learn-
ability as providing an additional criterion for systematicity. Thus, a \potential"
contribution of Connectionism is in an explanation for the necessary acquisition of
systematic behaviour. Therefore, the problem that systematicity poses for Con-
nectionism, which is the central concern of this thesis, is:
� Can Connectionist models exhibit the necessary acquisition of systematic
behaviour?
36 CHAPTER 2. SYSTEMATICITY: WHAT IS THE PROBLEM?
The approach taken, in the subsequent chapters, in addressing this question is
to: �rst, propose a suitable de�nition of the necessary acquisition of systematicity;
and second, evaluate Connectionist models with respect to this de�nition; and in
doing so ascertain the properties that permit/prevent the necessary acquisition of
systematicity.
Chapter 3
Systematicity as generalization
across domain
3.1 Introduction
In the last chapter I suggested that a \potential" import of Connectionism is in
an explanation of the necessary acquisition of systematicity. This chapter is a �rst
attempt at addressing the learnability of systematicity, rather than its computabil-
ity. That is, where the debate reviewed in the previous chapter was concerned with
algorithms/architectures that can compute systematic functions/behaviours, this
chapter is concerned with algorithms/architectures that produce algorithms that
exhibit systematic functions/behaviours.
The de�nition of systematicity from the previous chapter is a property whereby
the ability to represent and process some instances of a structured object implies
the ability to represent and process other instances with the same structure. In a
learning framework then, systematicity is the property whereby the ability to learn
to represent and process some instances of a structured object implies the ability
to represent and process other instances with the same structure. In other words,
systematicity is a form of generalization over structurally related objects. Some
previous work has been done in this area by Brousse and Smolensky (1989) and
Brousse (1991), and their work is re-examined in section 2.
37
38 CHAPTER 3. GENERALIZATION ACROSS DOMAIN
In section 3, the framework of probably approximately correct (PAC)-
learnability (Valiant, 1984) is presented, and used to provide the criterion by which
a model is said to necessarily acquire systematic behaviour.
The classic criterion in computer science for which a problem is regarded as
tractable (in terms of the amount of resource required), or for which an algorithm
is regarded as a tractable solution to a problem is polynomial resource1 complexity
(Garey & Johnson, 1979). Algorithms, or more generally, problems which require
more (exponential) resource are, generally, considered impractical as there are not
enough resources to solve even small sized problems. Valiant extended the notion
of tractability (feasibility) to the domain of learning. Brie y, a function (or, more
generally, a function class) is considered learnable by an algorithm if the target
function is acquired, to within some degree of accuracy on most occasions with
only polynomial resources in the size of the target function. The PAC-learning
framework has proven very useful in the formal analysis of learning and has led to
a �eld within machine learning called computational learning theory (see Anthony,
1992, for an introduction to this �eld). In this chapter, the following de�nition
is introduced: A Connectionist model is said to necessarily acquire systematic
behaviour if
the architecture acquires the target behaviour to within a desired degree
of accuracy, with a desired degree of con�dence, with at most polynomial
resources in the size of the behaviour.
Feedforward networks are the �rst and perhaps the most widely used networks
in Connectionist modeling. Given the work of Brousse and Smolensky they are
an obvious starting point to addressing the issue of systematicity within Connec-
tionism. In section 4, PAC-learnability is used as the criterion for the necessary
acquisition of systematic behaviour with respect to the task studied by Brousse
and Smolensky (i.e., auto-association of N -tuples - relations). Auto-association
provides a way of evaluating whether a model (in this case, a feedforward network)
1Typically, resource means time, however, time can be traded for space, or number of proces-sors in the case of parallel machines.
3.2. BROUSSE AND SMOLENSKY'S MASSIVE GENERALIZATION 39
exhibits systematicity of representation.
General discussion regarding these results is given in section 5, and �nally, a
summary and conclusion are provided in section 6.
3.2 Brousse and Smolensky'smassive generaliza-
tion
In Connectionist models that learn from examples, cognitive behaviour is seen to
emerge from the dynamics of the network interacting with some environment. One
of the criticisms of the Connectionist approach has always been that Connectionist
models require too many examples to acquire a competent level of performance in
such complex domains as language. Combinatorial domains are interesting because
such a vast amount of behaviour can be described with relatively few rules. In fact,
they are the perfect domains for symbol systems, and the sorts of domains where
a Connectionist model is expected to fail. It is this expectation of failure that
motivated the studies of Brousse and Smolensky.
3.2.1 Task: Auto-association of N -tuples
One of the simplest combinatorial domains is the auto-association of N -tuples,
where say, an object represented as N instantiated variables, must be encoded and
later recovered. The set of N -tuples S is de�ned (using Z-notation Spivey, 1989)
as: S = S1 � : : : � SN = fx1 : S1; : : : ;xN : SN � (x1; : : : ; xN)g, where Si is theset of possible values at position i of the tuple. The case considered here is where
S1 = S2 = : : : = SN . For example, (x3; x4) and (x4; x3) are distinct objects in a
domain of 2-tuples. The auto-association of N -tuples is the identity function that
maps every element in S to itself.
40 CHAPTER 3. GENERALIZATION ACROSS DOMAIN
3.2.2 Method
Brousse and Smolensky (1989) and Brousse (1991) studied the auto-association of
N -tuples (strings of length N) by a feedforward network. In their simulations, the
number of atomic objects was 26 (letters of the alphabet), and the tuple order was
varied from 2 to 6 (i.e., the number of components in each tuple). Therefore, the
size of the domain grew exponentially with the tuple order from 262 = 676 (N = 2)
to 266 � 3 � 108 (N = 6).
Each atomic object (letter) was represented externally (i.e., at the input/output
layer) by a random binary vector of length 8. The input/output representation of
an N -tuple (string) is the in-order concatenation of the vector representations of its
components. Thus, the representation of an N -tuple is a 8N dimensional vector.
The feedforward network they examined consisted of 8N input units (i.e., one
unit for each vector component) completely connected to all 5N hidden units, which
are in turn completely connected to all 8N output units. The task of the network
is to learn to construct internal (hidden unit vector) representations of the input,
and subsequently, recover the input representation at the output (e.g., Figure 3.1).
10
(C,A)
(C,A)
R[(C,A)]
input
hidden
output
8 8
88
Figure 3.1: A feedforward network for learning to represent 2-tuples. Arrows in-dicate complete connections between groups of units, and numerals indicate thenumber of units in a group. In this example, an input representation of the object(C;A) is mapped to an internal representation R[(C;A)], from which the inputrepresentation is recovered at the output.
A subset of the domain is randomly selected from a uniform distribution on
which the network is trained. Training consists of presenting the input and target
vector pairs. The network learns by changing the weights of connection so as to
3.2. BROUSSE AND SMOLENSKY'S MASSIVE GENERALIZATION 41
minimize the error between the network's output and the target output for each
pattern using the standard backpropagation learning algorithm (Rumelhart et al.,
1986), which is de�ned as:
�Wij = �k @E
@Wij
(3.1)
where �Wij is the change in the weight of the connection from unit i to unit j; k
is the magnitude of change; and E is the network error, which, in this case, is the
sum of the squared di�erences between the target vector and the network output
vector, given by:
E =nX
p=1
Ep =nX
p=1
j~tp � ~opj2 =nX
p=1
mXi=1
jtpi � opij2 (3.2)
where Ep is the error for each example pattern (p); n is the number of examples;
~tp and ~op are the target and network output vectors, respectively; and tpi and opi
are the corresponding vector components at output unit i for m output units. The
activation of any non-input unit is given by:
oj =1
1 + e�Pk
i=1oiwij+bj
(3.3)
where oj is the activation of unit j; oi is the activation of unit i whose activity is
propagated to unit j via a connection with weight wij; k is the number of units
that have an incoming connection to unit j; and bj is the bias.
Training was started from a random setting of weights and terminated when all
output units were within 0.4 of their targets for all patterns.
Testing the performance of the network consists of randomly resampling the
domain and determining whether the network's output is su�ciently close to the
target output. In this case, the network's response was considered correct when all
output units are within 0.4 of their corresponding target values.
3.2.3 Exponential number of generalizations
From this training scheme, Brousse and Smolensky showed that from a �xed num-
ber of training examples (50), the number of generalizations (future examples cor-
rectly processed) and virtual memories (patterns learnt within �ve additional learn-
42 CHAPTER 3. GENERALIZATION ACROSS DOMAIN
ing trials) grew exponentially with the order of the tuple (see Brousse & Smolensky,
1989, Figures 5 & 7, respectively).
3.2.4 Exponential decrease in percentage of generaliza-
tions
The exponential growth in number of generalizations exhibited by the feedforward
network appears impressive. However, the growth in the total number of examples
is also exponential. Two measures of the network's degree of systematicity for
the representations of tuples can be obtained by calculating (1) the number of
generalizations; and (2) the number of generalizations plus virtual memories, (from
Brousse & Smolensky, 1989, Figures 5 & 7, respectively) as a percentage of the total
number of examples in the domain.
A plot of tuple order versus degree of systematicity is shown in Figure 3.2.
The linear relationship between the x and y axes of the log-linear plot indicates a
decreasing exponential relationship between tuple order and degree of systematicity.
So, for example, despite the network being able to represent tuple instances of
order 6, on average only 1% were correctly represented, considering the number of
generalizations plus virtual memories.
It is this lack of generalization over such structured domains (i.e., lack of system-
atic behaviour) that has been the target of the strong criticism of the Connectionist
approach to cognitive modeling (Fodor & Pylyshyn, 1988). Brousse (1991) used
a weight decay term2 to improve generalization to the point where the network
performed perfectly for small N . However, the percentage of correct examples still
decreased as an exponential3 function of N . Clearly, 50 training examples is too
few to achieve a high degree of generalization over such a large domain. However,
if the network requires some constant (even if small) fraction of the domain then
learning will be intractable as the training set size will grow exponentially with
2Weights are eliminated so as to reduce the possibility of overparameterization by the network.3Brousse (page 190) reports the number of generalizations to be 15N . However, the total
number of examples is 26N . Consequently, the fraction of example space correctly processed is(1526)N (i.e., exponentially decreasing function of N ).
3.2. BROUSSE AND SMOLENSKY'S MASSIVE GENERALIZATION 43
0.001
0.01
0.1
1
10
100
2 3 4 5 6
Per
cent
age
of s
pace
cor
rect
(%
)
Tuple order (n)
GeneralizationsGeneralizations plus Virtual memories
Figure 3.2: A log-linear plot of the number of generalized and generalized plusvirtual memories as a function of tuple order (N). The linear relationship indicatesan exponential decrease in percentage of space correct. Reconstructed from Brousseand Smolensky (1989); Figures 5 & 7.
44 CHAPTER 3. GENERALIZATION ACROSS DOMAIN
tuple order.
If systematic behaviour is regarded as a degree of generalization over a particular
domain, then the problem can be restated in terms of PAC-learnability by asking
two questions. First, how many training examples are required to maintain a �xed
degree of accuracy over the entire space? (In other words, what is the sample
complexity?) Second, how long will it take to learn these examples? (In other
words, what is the time complexity?)
Before attempting to answer these two questions, the concept of PAC-
learnability is explained with emphasis on its applicability to the systematicity
problem.
3.3 Probably approximately correct learnabil-
ity
Probably approximately correct (PAC)-learnability is a theoretical concept intro-
duced by Valiant (1984) as a criterion against which one may formally prove the
learnability of a particular class of problems by a particular learning machine.
The concept of PAC-learnability is analogous to, and extended from the concept
of tractability. Roughly, an algorithm is said to tractably compute some function
if it requires no more than a polynomial amount of resource (e.g., time, space,
processors) in some parameter measuring the size of the function. Similarly, a
function (or function class) is said to be PAC-learnable by an algorithm if that
algorithm, on most occasions, acquires the target function to within a desired
degree of accuracy with no more than a polynomial amount of resource in the size
of the target function.
PAC-learnability has been used as the framework for theoretical results on the
learning capabilities of neural networks. (See Anthony, 1992, for a review of the im-
port of computational learning theory to Connectionism). Theoretical approaches
to learnability problems using neural networks, in general, divide the problem into
two parts: sample complexity - the number of training examples required to ac-
3.3. PROBABLY APPROXIMATELY CORRECT LEARNABILITY 45
quire the target function; and time complexity - the time required to load (correctly
map) these training examples.
In determining the sample complexity, one wants to establish how many training
examples are required before the network correctly processes a new example with
probability greater than 1� �. Furthermore, the probability that the network will
behave with this degree of accuracy on future examples having been given this
many training examples must be greater than 1 � �. A network is considered to
tractably learn a function when the number of training examples required to satisfy
that relationship is a polynomial of the size of the target function, which is usually
measured as the number of weights required to implement that function. A number
of results on the sample complexity of neural networks have been established (e.g.,
Baum & Wilczek, 1987; Shawe-Taylor & Anthony, 1991; Haussler, 1992), and it is
one of the results of Shawe-Taylor and Anthony that is used in the next section.
Establishing time complexity results generally involves mapping the task of
�nding a set of weights satisfying the function speci�ed by the training set for a
given network architecture into a problem with known complexity in computability
theory (e.g., Judd, 1987, 1988; Lin & Vitter, 1989; Blum & Rivest, 1990; Judd,
1990; Blum & Rivest, 1992). These results, although general, are also negative, in
the sense that, even for relatively simple feedforward architectures there does not
exist a polynomial time algorithm for �nding a set of weights that will implement
an arbitrary function speci�ed by the training set.
However, these results are worst case scenarios in that they consider the time
required to learn any function from very general function classes. Time complexity
can be reduced by restricting the function class, or by incorporating knowledge
regarding the functions to be learnt, into the learning algorithm. Since learning is
essentially a search through a function space for a function that is su�ciently close
to the target function, restricting the function class e�ectively reduces the search
space and therefore the search time. Incorporating knowledge about the functions
to be learnt structures the search space into regions where target functions are
likely to be found. Consequently, search time can be reduced by ignoring regions
46 CHAPTER 3. GENERALIZATION ACROSS DOMAIN
of the search space, just as is the case, for example, when performing a binary
search.
Incorporating domain-speci�c features into an algorithm or architecture (e.g.,
non-linear activation function) can, however, greatly complicate a theoretical ap-
proach. Consequently, in the next section, the time complexity is determined
empirically.
PAC-learnability provides a tool for determining the learning capabilities for
such potential cognitive models as Connectionist networks. If one links the desired
degree of generalization with the computational resources required to achieve that
degree of generalization then one has a criterion for determining whether a Connec-
tionist model is systematic that does not rely on the computational capacity of the
model. The rationale for using polynomial resources as a criterion for systematicity
is that, if a machine or model requires a greater amount of resources than polyno-
mial (e.g., exponential), then there simply isn't enough time or resource available
to achieve the desired level of generalization for even moderately sized functions.
Thus, this criterion allows potential candidates for cognitive architecture to be
discounted on the basis that they are too computationally expensive.
In the next section, this criterion is applied to the task studied by Brousse and
Smolensky to examine the capacity of feedforward networks to acquire systematic-
ity of representation.
3.4 Systematicity of representation with feed-
forward networks
Systematicity of representation is the ability to construct internal representations
of structurally related objects. For these internal representations to be useful the
objects they represent must be subsequently recoverable.
For example, a model that constructs internal (vector) representations of com-
plex objects by the addition of vector representations of the object's components
would not be satisfactory since some objects would not have unique internal rep-
3.4. SYSTEMATICITY OF REPRESENTATION 47
resentations. Suppose, for example, internal representations of the two distinct
objects John loves Mary and Mary loves John were composed by vector addition
of their components John, Mary and loves. Then:
~RJohn loves Mary = ~RJohn + ~Rloves + ~RMary by de�nition
= ~RMary + ~Rloves + ~RJohn by law of commutativity
= ~RMary loves John by de�nition
In fact, any operator with the property of commutativity is an inadequate mecha-
nism for the construction of complex representations.
The usefulness of the internal representations constructed by a network can
be tested by requiring the network to subsequently output a representation of the
complex object it was presented. In other words, by requiring the network to
auto-associate the input representations of complex objects.
3.4.1 Task: Auto-association of N -tuples
The family of tasks on which the network is evaluated is the same as used in
the studies by Brousse and Smolensky as described in subsection 3.2.1. That is,
auto-association of N -tuples where an N -tuple is an ordered relation of N atomic
(unstructured) objects selected from some set.
3.4.2 Model: Feedforward network
The feedforward network (Figure 3.3) used in these simulations is the same as used
by Brousse and Smolensky except that the number of input/output units per tuple
order was 10 (i.e., k = 10) and tuple order (N) was varied from 2 to 10, whereas
in the work of Brousse and Smolensky, k = 8 and N was varied from 2 to 6. Also,
a di�erent input and output encoding scheme was used. Each component was
encoded by a local encoding scheme (i.e., one unit on and the rest o�) at the input
layer and a block encoding scheme (i.e., half of the units on and the rest o� Bakker,
Phillips & Wiles, 1993; in press) at the output layer (see section 3.4.4), whereas
48 CHAPTER 3. GENERALIZATION ACROSS DOMAIN
Brousse and Smolensky encoded each component as a random binary vector. With
local and block encoding schemes, the number of possible components is equal to
the number of units available for representing each component, which in this case
is 10.
k units
N
k units
k units k units
k units k units
21
output (k x N)
input (k x N)
hxN units hidden (h x N)
Figure 3.3: Feedforward network for the auto-association of N -tuples. In thisarchitecture k refers to the number of possible atomic objects in any one position,N refers to the number of components within a complex object (tuple order), andh refers to the number of hidden units per component. Every input unit wasconnected to every hidden unit, which was connected to every output unit.
In this chapter, systematicity has been framed in terms of PAC-learnability.
That is, a network is considered systematic if it can attain a �xed degree of gen-
eralization (i.e., percentage of example space) by requiring, at most, polynomial
resources in the size of the function. For the task used here, the size of the function
is characterized as the tuple order (N).
Here, both a theoretical and an empirical approach is taken to address this issue
of computational resource. A theoretical result places bounds on the di�culty of
the problem. By using a result due to Shawe-Taylor and Anthony (1991) an upper
bound on the number of training examples needed to attain a desired degree of
generalization can be obtained. The advantage of this approach is that one can
obtain limits purely on the information contained in the data and the represen-
tational power of the model which are independent of the learning algorithm the
model may use. The limitation of this approach, however, is that there are no
general results for determining time complexity. Therefore, the time complexity
results were established empirically.
3.4. SYSTEMATICITY OF REPRESENTATION 49
3.4.3 Theoretical result
Firstly, a theoretical upper bound on sample complexity is given for this task.
Then, the sample and time complexities are determined empirically.
Polynomial sample complexity
The sample complexity result of Baum and Haussler (Baum & Haussler, 1989) is
not applicable in this case as multiple output units are involved. However, there is
a corollary by Shawe-Taylor and Anthony (Shawe-Taylor & Anthony, 1991) that,
interestingly enough, provides an upper bound on the number of training examples
independent of the number of output units. It applies to feedforward networks of
threshold units with one hidden layer. It says:
\Given an accuracy parameter � and a con�dence parameter �, for
a feedforward network with W variable weights and n computational
nodes, with probability greater than 1� � the network will give correct
output with probability greater than 1 � � on inputs drawn according
to some distribution, provided it correctly computes the function on a
sample (drawn from the same distribution) of size at least
m0 = m0(�; �) =1
�(1�p�)
�ln
�1:3
�
�+ 4(W + n)log2(en)ln
�6
�
��
(Shawe-Taylor and Anthony, 1991: p116 )"
where W is the total number of variable weights (including biases) of connections
from input units to hidden units and hidden units to a single output unit4; and n
is the number of hidden units. In a network of N (k�h� k) encoders, n = N �h,
and W = khN2 + 2hN + 1.
If the accuracy and con�dence parameters are �xed, as well as the number of
possible values at each tuple position (k), then an upper bound on sample com-
plexity in terms of N , the tuple order, can be determined. Given that the number
4Since their result is independent of the number of output units,W is speci�ed so as to consideronly weights associated with one of the output units.
50 CHAPTER 3. GENERALIZATION ACROSS DOMAIN
of input and output units is N�k, and the number of hidden units5 is N�dlog2ke,then the upper bound on the maximum number of training examples required is
m0 = O
N2k log2k
�log2
N log2k
�
!!= O(N2log2N)
Thus, the network only requires a polynomial number of training examples.
Provided the network can load (acquire the correct behaviour on) these examples
it will, with high probability, correctly generalize to an exponential number of
future examples.
3.4.4 Empirical results
The theoretical result is quite general in that it applies to any function that is
representable by such a network (not only auto-association of N -tuples), and it
is independent of the distribution from which the training and testing examples
are selected (it only requires that the training and testing distributions be the
same). However, the bound put on the number of training examples is quite high
(in the order of tens of thousands of training examples). Furthermore, there is
an assumption that there exists an algorithm that can load these examples in
polynomial time.
With respect to the auto-association of N -tuples task, the question is: Can
the standard backpropagation algorithm be used to load enough training examples
in polynomial time so that the feedforward network exhibits systematicity with
respect to this task?
Method
In an empirical study, it is not necessary to separate complexity into sample and
time components. One could simply record the total processing time required by
the network to acquire the desired degree of generalization and subsume sample
complexity within time complexity.
5The minimum number of threshold units required by a single encoder of k items is dlog2ke,using a binary encoding scheme.
3.4. SYSTEMATICITY OF REPRESENTATION 51
However, keeping the two components separate allows one to determine where
the demand for computational resource is greatest, and therefore it allows one to
identify the weaknesses of the architecture. For example, it is possible that the
sample complexity is low relative to the time complexity to load the examples.
Such a result suggests that the learning algorithm used by the network is not
making e�cient use of the information represented in the training set. It would
suggest that improvements are to be found in the learning algorithm rather than
the connectivity of the network. This point will be returned to when discussing the
empirical results of the feedforward network on the auto-association of N -tuples
task.
For the empirical approach, what is considered is the number of training exam-
ples and the time taken to obtain at least 95% accuracy (� = 0:05) with at least
99% con�dence (� = 0:01) on future examples for tuples of order N = 2 to N = 10
(in increments of 2). The number of possible values at each tuple position was set
at 10, as mentioned in subsection 3.4.2 (i.e., jSij = k = 10).
Each network was trained using the standard backpropagation algorithm6
(Rumelhart et al., 1986) on training sets ranging from 10 to 1500 examples ran-
domly generated from a uniform distribution. A train-test trial consisted of: ran-
dom initialization of network weights; random generation of training examples;
training the network until all output units were on the right side of 0.5 for all
training patterns; testing network performance on a new test set of 1000 randomly
generated examples from the same distribution; and recording the number of test
patterns for which all output unit activations were on the right side of 0.5.
For each value of N and each training set size, �ve train-test trials were con-
ducted, providing �ve measurements of network performance on the 1000 test ex-
amples, from which the 99% con�dence intervals on network performance on future
examples could be calculated. For each N , training set size was increased until the
lower bound of the 99% con�dence interval was at least 95%. In other words, the
training set size was increased until any one of the randomly initialized networks,
6Which is the same algorithm as used by Brousse and Smolensky (see subsection 3.2.2).
52 CHAPTER 3. GENERALIZATION ACROSS DOMAIN
with 99% con�dence, would correctly process at least 95% of the 1000 test exam-
ples it was given. The training set size which met this criterion was recorded. The
time taken to train each network, measured in total number of weight updates, was
also recorded.
In this simulation, sigmoidal units (as de�ned by equation 3.3) rather than
thresholds were used as it reduces the number of necessary hidden units7 to 2N .
The learning rate was set to 0.1. No momentum term was used. Weights were up-
dated after each training pattern was presented. Weights were randomly initialized
from a uniform distribution in the range [�1; 1]. The learning algorithm used was
gradient descent with the sum of squares error function (as de�ned by equations
3.1 and 3.2, respectively).
The representation of each component at each group of k units at the input layer
was a local encoding (i.e., one unit on and the rest o�), and at the output layer
was a block encoding (i.e., half of the units are on consecutively, with wraparound,
and the rest o�). That is,
Input (local) Output (block)
x1 = 1 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0
x2 = 0 1 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0...
......
x10 = 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 1
The block encoding scheme for the output layer was chosen as it dramatically
improves the learning time of the k � 2 � k encoder problem (Bakker, Phillips &
Wiles, 1993; in press), which has previously been shown to be particularly di�cult
for backpropagation-style networks (Lister, 1993).
Result(1): Polynomial sample complexity
The number of training examples, su�cient to meet the 99% con�dence level (i.e.,
� = 0:01) and 95% accuracy level (i.e., � = 0:05) for tuple orders ranging from 2 to
7Two hidden units being the most required for a single encoder of sigmoidal units (Kruglyak,1990).
3.4. SYSTEMATICITY OF REPRESENTATION 53
10 is shown in Figure 3.4.
On a log-log plot, the linear relationship between the x and y coordinates (solid
line) indicates that training set size is a polynomial function of tuple order, whereas
the total number of examples is an exponential function of tuple order and appears
as a non-linear relationship between the x and y coordinates (dotted line).
A linear regression shows that the relationship between number of training
examples (m) and tuple order (N) is:
m = 130 N1:01
with signi�cance F (1; 3) = 44, p < 0:01.
Thus, the sample complexity of the feedforward network on the auto-association
of N -tuples task is polynomial in the tuple order.
1
10
100
1000
10000
100000
1e+06
1e+07
1e+08
1e+09
1e+10
2 3 4 5 6 7 8 9 10
Num
ber
of p
atte
rns
Tuple order (N)
Training setTotal space
Figure 3.4: A log-log plot of the number of training examples (solid line) andtotal possible examples (dashed line) as a function of tuple order (N). The linearrelationship indicates that the required number of training examples is a polynomialfunction of tuple order (i.e., sample complexity is polynomial).
54 CHAPTER 3. GENERALIZATION ACROSS DOMAIN
Result(2): Polynomial time complexity
The mean training time, measured as the total number of weight updates (over 5
trials), su�cient to load the training examples for tuple orders ranging from 2 to
10 is shown in Figure 3.5.
On a log-log plot, the linear relationship between the x and y coordinates indi-
cates that training time is a polynomial function of tuple order.
A linear regression shows that the relationship between training time (t) and
tuple order (N) is:
t = 2:9� 108 N5:1
with signi�cance F (1; 3) = 1020, p < 0:01.
Thus, the time complexity of the feedforward network on the auto-association
of N -tuples task is polynomial in the tuple order.
Discussion
The theoretical result assumed the existence of an algorithm that could load the
training examples in polynomial time. The question was whether the standard
backpropagation algorithm could be used to load enough training examples in
polynomial time so that the feedforward network would demonstrate the neces-
sary acquisition of systematicity of representation with respect to the domain of
N -tuples. For the empirical studies, the feedforward network is said to neces-
sarily acquire systematicity of representation if, on 99% of occasions, it acquires
the capacity to represent N -tuples with accuracy greater than or equal to 95%,
requiring computational time that is at most a polynomial of N . The empirical
results showed that the feedforward network exhibited the necessary acquisition of
systematicity as de�ned by this criterion.
The number of training examples required was almost linear (O(N1:01)) in the
tuple order. The total time required to load these examples was roughly a quintic
(O(N5:01)) in the tuple order. The time complexity can be reduced to O(N3)
if one considers that there are O(N2) weights that can be updated in parallel.
3.4. SYSTEMATICITY OF REPRESENTATION 55
1e+09
1e+10
1e+11
1e+12
1e+13
1e+14
2 3 4 5 6 7 8 9 10
Num
ber
of w
eigh
t upd
ates
Tuple order (N)
Figure 3.5: A log-log plot of the mean number of total weight updates (over 5 trials)as a function of tuple order (N). Error bars indicate one standard deviation. Thelinear relationship indicates training time as a polynomial function of tuple order(i.e., time complexity is polynomial).
56 CHAPTER 3. GENERALIZATION ACROSS DOMAIN
By separating resource requirements into sample and time complexity components
one can see that the network did not require many examples to learn the function.
The expense was in learning these examples. Consequently, greatest reductions
in resource requirements would be found with improvements to learning. One
possibility is the use of an exponential error function (see, for example, Baum
& Wilczek, 1987; Lister, Bakker, & Wiles, 1993), instead of the quadratic error
function used here (i.e., equation 3.2).
The number of generalizations exhibited by the network was very high. For ex-
ample, at N = 10, the network's degree of generalization was estimated to be 95%
of 1010 test patterns8, from approximately 1500 training patterns. The high degree
of generalization was due to the small number of weights needed to represent the
function. Sample complexity was proportional to the number of weights. Restrict-
ing the number of weights (by restricting the number of hidden units) reduces the
class of functions that can be represented by the network. Less information (in
the form of training examples) is needed to distinguish the target function from
the class of representable functions as there are fewer representable functions. In
the feedforward network, the number of weights was a polynomial function of tuple
order. If the number of weights were to increase as an exponential function of tuple
order then, by Shawe-Tayler and Anthony's result, the upper-bound on number of
training examples would also increase as an exponential function. This assumption
would make systematicity, as de�ned in terms of PAC-learnability, not achievable.
Thus, an architectural assumption required in the feedforward network is that the
number of weights grows as a polynomial function of tuple order.
The use of a block encoding scheme was important in establishing the poly-
nomial time complexity result. Bakker et al. (1994) showed that, for a single
auto-associator (i.e., N = 1), training time as a function of the number of possible
components (k) grew rapidly when outputs were encoded locally, yet slowly when
using block encoding. For example, at k = 9 (N = 1), the number of pattern
8Since the 1000 test patterns were randomly generated from a uniform distribution for eachtrain-test trial.
3.4. SYSTEMATICITY OF REPRESENTATION 57
presentations9 was approximately 7 � 105 for local encoding compared to approx-
imately 4000 for block encoding. For the simulations performed here, it was not
possible to get the network to learn to correctly map all training examples under
a local encoding scheme, and so block encoding of the outputs was used.
Use of block encoding at the outputs changes the organization of internal rep-
resentations and hyperplanes at the hidden layer of the network. Lister (1993)
and Bakker et al. (1994) showed that, after training a network on 1-tuple auto-
associations where the output patterns were encoded locally, the hidden unit vector
representations were arranged in a circle with the hyperplanes (i.e., the decision
boundaries of the output units) separating one point from all other points. Fig-
ure 3.6(a) characterizes this organization for local encoding. Bakker et al. (1994)
showed that in the case of block encoding hyperplanes were oriented so as to bisect
the organization of hidden unit vector representations. Figure 3.6(b) characterizes
this organization for block encoding. The change in internal organization a�ects
the error surface which determines learning time. E�ectively, each hyperplane is
performing a classi�cation. In the case of local encoding, the distribution of class
members is skewed into two classes of 1 and N � 1 points, respectively. In the
case of block encoding, it is relatively balanced, with about N
2points in each class.
The e�ect of these two distributions on error surface was examined and discussed
in Phillips (1993). Essentially, a block encoding changes the error surface so as to
provide more information as to the location, in weight space, of the target function
(Bakker et al., 1994). More information is provided in the sense that more of the
error surface points (in terms of direction of steepest descent) to the region con-
taining the target function. Thus, the search time is e�ectively reduced by ignoring
more of the search space.
Given the architectural assumptions of: complete connectivity between layers;
polynomial growth in the number of variably weighted connections; block encoding
of objects at the output layer; and a set of training and testing examples that were
selected from the same uniform distribution the feedforward network demonstrated
9One pattern presentation involves a forward propagation of activity followed by a backprop-agation of error, after which weights are updated.
58 CHAPTER 3. GENERALIZATION ACROSS DOMAIN
Hid
den
unit
activ
atio
n (2
)
Hidden unit activation (1)
Hid
den
unit
activ
atio
n (2
)
Hidden unit activation (1)
(a) (b)
Figure 3.6: Hyperplane orientation for an 8-2-8 encoder. Characteristic orientationof hyperplanes (dark solid lines), and hidden unit vector representations (solidsquares) for local output encoding (a) and block output encoding (b) for 1-tupleauto-association of 8 possible components (i.e., N = 1, k = 8 and h = 2).
the necessary acquisition of systematicity with respect to the auto-association of
N -tuples task. The criterion for the necessary acquisition of systematicity was that
the time required to acquire the target function to within 95% accuracy with 99%
con�dence was no more than a polynomial function of tuple order.
3.5 General Discussion
Having presented a case for a Connectionist model that necessarily exhibits system-
aticity the next question to ask is whether such a demonstration addresses Fodor
and McLaughlin's concerns regarding the necessity for systematicity given the as-
sumptions of the architecture. There are three areas in which this demonstration
could be criticized.
� The assumptions made in the architecture were too strong. These assump-
tions were:
{ polynomial growth in the number of variably weighted connections;
{ complete connectivity between layers;
3.5. GENERAL DISCUSSION 59
{ block encoding of representations at the output layer.
� The criterion for the necessary acquisition of systematicity is too weak.
� The demonstration of the feedforward network in meeting this criterion was
too narrow.
In this section, each of these point are discussed in turn.
A polynomial growth in number of connections would appear to be a reasonable
assumption. An architecture requiring more resources (e.g., exponential) soon be-
comes unrealistic with even small sized functions. It is this criterion that computer
scientists use to separate feasible and infeasible algorithms. However, Halford et al.
(1994) have argued that an assumption of exponential resource is reasonable if one
considers that the order of relations that humans typically can process is limited
to 4 or 5. Furthermore, an architecture that required exponential resources would
explain these processing capacity limitations.
The assumption of complete connectivity is probably unrealistic given that it
is generally known that real neurons are not completely connected. However, the
theoretical result is a statement about the number of weights, not the connectivity
of the network, and so still applies. In the empirical case, the network relied on
complete connectivity between layers since the number of hidden units to represent
every tuple instance was minimal10. This restriction could be relaxed by using more
hidden units. An increase in hidden units would increase the training time, but it
is not expected to be more than polynomial provided the number of hidden units
is not more than polynomial.
The strongest assumption is block encoding. Block encoding imposes a priori
an ordering, or similarity on component representations. For example, the vector
1 1 1 0 0 0 is closer to the vector 0 1 1 1 0 0 than the vector 0 0 1 1 1 0 under
the euclidean distance metric. However, in the previous chapter it was mentioned
that systematicity is a property at the level of compositional objects, not at the level
of atomic objects. Thus, there should be no a priori similarity between components.
10The minimum number of hidden units for 1-tuple auto-association is two (Phatak, 1993).
60 CHAPTER 3. GENERALIZATION ACROSS DOMAIN
In the subsequent chapters, the assumption of external representation is weakened
by using local encoding for output representations.
In the debate over the import of systematicity no quanti�cation of empirical
evidence was given to support or refute Connectionist models. Thus, there are no
absolute levels of generalization and con�dence with which to test models. The
importance of the theoretical result is that for any speci�ed level of accuracy and
con�dence, the sample complexity is a polynomial function of the tuple order.
The accuracy and con�dence levels chosen for the empirical studies were, although
somewhat arbitrary, chosen to show that the network could attain a high degree of
generalization with a high degree of con�dence. It is possible that people acquire
systematic behaviour, to whatever absolute levels, with fewer than polynomial
examples. In this case, a tighter criterion for systematicity is required. In the next
chapter, the feedforward network is re-examined on a di�erent criterion based on
linguistic evidence.
A cognitive architecture must not only be able to represent complex objects, but
also make inferences from them. For example, a cognitive architecture must be able
to infer Sue from the sentence-question pair Sue went to the store. Who went to the
store? (i.e., the architecture must be capable of extracting, on request components
from complex objects). In the next chapter, the feedforward network is re-examined
on a systematicity of inference task, as well as a systematicity of representation
task. The domain of N -tuples was chosen because the number of complex objects
grows exponentially with N . Given the work of Halford et al. (1994) it may be
more appropriate to examine growth in terms of number of possible components
with tuple order �xed (i.e., training time as a function of k), as was done in (Wiles,
Phillips, & Norris, 1993). With a PAC-learnability criterion, however, this would
not be an interesting case as the total number of objects in the domain grows only
as a polynomial of k. Thus, perfect systematicity can be demonstrated by simply
memorizing all objects.
3.6. SUMMARY AND CONCLUSION 61
3.6 Summary and conclusion
In this chapter, the necessary acquisition of systematicity has been de�ned in terms
of PAC-learnability. That is, a network is said to necessarily acquire a systematic
behaviour if, with a high degree of con�dence, the time required to learn that
behaviour to a high degree of accuracy is no worse than a polynomial function of
a parameter that measures the size or complexity of that behaviour.
Previous work by Brousse and Smolensky (1989) and Brousse (1991), which
showed an exponential number of generalizations with respect to tuple order (N) in
an auto-association of N -tuples task, failed to meet this criterion as the percentage
of future examples correctly represented decreased as an exponential function of
tuple order. Clearly, the 50 training examples they used for all values of N was
too few. The questions then asked were: \How many training examples would be
required?"; and \How much time is needed to load these examples?" to meet the
necessary acquisition of systematicity criterion used in this chapter.
The theoretical result showed that at most a polynomial number of examples
were required to attain a high degree of generalization. The empirical results
showed that, not only were the number of training examples polynomial, but that
the time taken to load these examples was also a polynomial function of N . From
these results it was concluded that assuming a restricted architecture (polyno-
mial number of weights) the feedforward network exhibits the necessary acqui-
sition of systematicity, de�ned in terms of PAC-learnability, with respect to the
auto-association of N -tuples task.
It was suggested that the PAC-learnability criterion for the necessary acquisition
of systematicity was too weak. It may be the case that people acquire systematic
behaviours from fewer examples. In the next chapter, a second criterion, based
on linguistic evidence is considered. The feedforward network is then re-examined
with respect to this criterion.
62 CHAPTER 3. GENERALIZATION ACROSS DOMAIN
Chapter 4
Systematicity as generalization
across position
4.1 Introduction
In the previous chapter PAC-learnability was used as the criterion for the necessary
acquisition of systematic behaviour against which the feedforward network was
examined. This criterion comes from the computer science notion of a feasible
algorithm and allows one to reject potential cognitive architectures on the basis
that they require too much resource (e.g., time) to acquire even modestly sized
behaviours.
In this chapter, an alternative criterion called strong systematicity (Hadley,
1993) is considered. This criterion, based on linguistic evidence of generalization,
is presented in section 2. Feedforward network and recurrent network architectures
are examined, by simulation and analysis, on two tasks in which it is possible to
exhibit strong systematicity. Those tasks are: the auto-association of 2-tuples, on
which it is possible to exhibit strong systematicity of representations (section 3);
and querying of 2-tuples (i.e., after presenting an ordered pair, query the network
for the �rst or second component), on which it is possible to exhibit strong sys-
tematicity of inference (section 4). A summary of these results and conclusion are
given in section 5.
63
64 CHAPTER 4. GENERALIZATION ACROSS POSITION
4.2 Hadley's strong systematicity
In addition to the work concerned with the generalization capabilities of feed-
forward networks in the previous chapter, Connectionists have provided a num-
ber of other models that have also demonstrated generalization over structured
domains. For example, Elman's (1990) simple recurrent network and Pollack's
(1990) recursive auto-associative memory correctly processed sentences not present
in the training set. Hadley (1993) questioned whether the degrees of generalization
demonstrated by these and other Connectionist models constitute what Fodor and
Pylyshyn intended as systematicity.
Hadley addressed this question by �rst, reformulating the de�nition of system-
aticity based on linguistic evidence; and second, evaluating a number of Connec-
tionist models with respect to this de�nition.
4.2.1 Strong systematicity de�ned
Strong systematicity is the above chance level capability of correctly processing
sentences containing words in novel syntactic positions. A model or person is said
to be strongly systematic if they can, for example, correctly process the sentence
Mary loves John having never before seen John in the patient position. In other
words, strong systematicity is generalization across syntactic position. (Note that
it is not su�cient for a model to correctly process a novel sentence on only one
occasion. Say, for example, there are �ve possible words from which to choose.
A network that simply outputs words at random has a 20% chance of correct
response. A model's performance must be signi�cantly above this level for it to
have exhibited strong systematicity.)
There is ample linguistic evidence of people, and even young children exhibiting
at least this degree of systematicity. For example, children are able to coin new
verbs from nouns, as in the example provided by Hadley, It was bandaided, and
subsequently use the new verbs in their passive form. More dramatically, adults
given a sentence containing a nonsensical noun (which they are highly unlikely to
4.2. HADLEY'S STRONG SYSTEMATICITY 65
have seen before) can correctly process sentences with the new noun in other novel
syntactic positions.
Strong systematicity is another criterion against which potential cognitive mod-
els can be evaluated. In the previous chapter, PAC-learnability was used as the
criterion for rejecting potential models on the basis that they require more com-
putational resource than could possibly be available. In this chapter, strong sys-
tematicity is used as the criterion for rejecting potential models on the basis that
they require more resource (i.e., training examples) than is necessary for people to
acquire systematic behaviour.
Given Hadley's strong systematicity criterion then, the question of concern in
this chapter is whether Connectionist models, including the feedforward network
architecture of the previous chapter, can exhibit this degree of systematicity.
4.2.2 A review of Connectionist models
The starting point for this discussion is with Hadley's conclusion: After reviewing
a number of Connectionist models, Hadley (1993) concluded that Connectionist
models have not exhibited strong systematicity. His conclusion was based on a
closer examination of the training scheme applied to the models of McClelland and
Kawamoto (1986), Chalmers (1990a), Elman (1989, 1990, 1991), Niklasson and
Sharkey (1992), Pollack (1990), Smolensky (1990), and St. John and McClelland
(1990). These models, which have all demonstrated generalization over structured
domains, could be claimed as refutations of Fodor and Pylyshyn's thesis of compo-
sitionality and systematicity without the need for symbol structures. Yet, in each
one of these models, there has not been a clear demonstration of strong system-
aticity. Essentially, a statistical analysis of the training sets revealed that in all
probability every word appeared in every one of its allowable syntactic positions in
the training set.
Hadley termed the degree of generalization exhibited by these models as being
either weak or quasi-systematicity. Weak systematicity is characterized as gener-
alization to novel sentences in which every word has previously appeared in that
66 CHAPTER 4. GENERALIZATION ACROSS POSITION
syntactic position in the training set. For example, a weakly systematic model is
one that could only represent and process the sentence John loves Mary having al-
ready been trained on sentences where John appeared in the agent position, Mary
appeared in the patient position, and loves in the action position, although not
necessarily in that combination. Mary, for example, could have appeared in the
sentence Tim loves Mary.
Quasi-systematicity is characterized as generalization to novel sentences con-
taining embedded clauses in which every word in the embedded or surrounding
clause can be found in the same position of some simple sentence in the training
set. For example, a model that could only represent and process the sentence Tim
knew John loves Mary having already been trained on the sentences Tim knew Sue
and John loves Mary demonstrates quasi-systematicity, not strong systematicity,
as the words Tim and knew appearing in the agent and action positions (respec-
tively) in the surrounding clause appeared in the same positions in the simple
sentence Tim knew Sue in the training set. Similarly, the words John, loves and
Mary appearing in the agent, action and patient positions (respectively) relative
to the embedded clause, also appeared in the same positions in the training set.
Hadley found that the models of McClelland and Kawamoto (1986), Chalmers
(1990a), Elman (1989, 1990, 1991), Smolensky (1990) and St. John and McClelland
(1990) have not demonstrated anything more than weak systematicity, and that
Pollack (1990), and Niklasson and Sharkey (1992) have not demonstrated any-
thing more than quasi-systematicity. Furthermore, the demonstrations of quasi-
systematicity make an unrealistic assumption by preprocessing the input into tree
structures.
Subsequent work by Niklasson (1993) claims to have demonstrated strong sys-
tematicity on a transformation of logical expressions task. However, component
representations presupposed a similarity based on their position, or role within a
complex expression. For example, the proposition symbols (p, q, r, s) all shared
a common feature in their vector representation. Furthermore, the so-called novel
component, s, on which the network was tested had as its representation the com-
4.2. HADLEY'S STRONG SYSTEMATICITY 67
mon feature with no other features (vector components) active. Consequently, the
generalization the network demonstrated was due to the common feature on which
the network had already been trained in the components of p, q, r, and so cannot
be considered a demonstration of strong systematicity. See also Hadley (1993) for
a similar comment on McClelland and Kawamoto (1986).
Component representations
The use of similar component representations also occurs in Chalmers (1990a),
where words of a common class (e.g., verbs, nouns) share a common feature in their
vector representations; and, in Niklasson and van Gelder (1994), where compo-
nent representations of objects belonging to a common category (e.g., proposition,
conjunctive) are constructed with vectors representing these categories. However,
this information is, in general, not explicitly available within the external repre-
sentations of component objects1. The information identifying which category a
component belongs to is available implicitly in terms of the words relative position
in a sentence. These two models assume a similarity that Connectionist networks
are expected to learn. Such an assumption begs the question that the criterion of
strong systematicity poses: Can Connectionist models acquire position-based simi-
larities and consequently demonstrate generalization across position given category
information implicit within the relative positioning of component objects?
The point is that systematicity is a property at the level of compositional ob-
jects, not at the level of external component objects. In language, systematicity is
a property at the level of sentences, not at the level of words (Fodor & Pylyshyn,
1988). That is, the ability to represent and process sentences of a particular struc-
ture relates to other sentences conforming to the same structure.
Orthogonal external representations of components To appreciate the
1In the sense that there is some feature of the component object that identi�es its category.Information can also come in the form of semantics. Knowledge (past experience) with a wordcan identify its category. For example, experience with the word Mary and its associated objectprovides one with the information that this word is a noun. However, semantics alone does notnecessarily identify a category. For example, dog is both a noun and a verb (to chase relentlessly).Furthermore, novel words in isolation have no semantic content.
68 CHAPTER 4. GENERALIZATION ACROSS POSITION
signi�cance of component representations, suppose a network is required to learn
to auto-associate 2-tuples. Suppose, also, that this network is composed of two
unconnected subnetworks, where each subnetwork consists of a single input unit
connected to a single hidden unit, which is connected to a single output unit.
Each component is represented, externally to the network, by a single scalar value
between zero and one. For example,
� Ann - 0.1;
� Bill - 0.3;
� John - 0.5;
� Karen - 0.7;
� Vivian - 0.9.
Auto-associating the tuple (John, Bill) requires performing the mapping
(0:5; 0:3) ! (0:5; 0:3). Since there is a linear dependency built into the in-
put/output representations, having only seen two of the atomic items, the network
would generalize to the others, therefore demonstrating generalization across posi-
tion. Yet, the two subnetworks are independent of each other. Since they are not
connected no information can be conveyed. Thus strong systematicity in this case
is a consequence of similarity at the component level.
Of course, there may be a priori similarities between component objects. For
example, in the set of possible phonemic components of English words, \dee" is
closer to \tee" than the \a" sound in \cat". It is an empirical question as to the
degree of similarity there is between component objects and its impact of exhibiting
systematicity. However, given that systematicity is such a pervasive property across
languages and modalities, it is assumed that a priori similarity at the component
level does not play a signi�cant role in systematicity, which is a property at the
level of complex objects. Consequently, when examining Connectionist models for
systematicity there must be no a priori similarity between the external represen-
tations of component objects. In subsequent simulations and analyses, orthogonal
4.3. STRONG SYSTEMATICITY OF REPRESENTATION 69
(actually local2) representations are used.
4.2.3 Summary of review
In all cases, with the exception of Niklasson (1993) and Niklasson and van Gelder
(1994), Connectionist modelers did not set out to demonstrate strong systematicity
per se. Their main concern was to demonstrate some degree of generalization over
a structured domain. With that point in mind it is perhaps not surprising that
none of the models demonstrated strong systematicity.
However, a counterpoint to this argument is that it seems a remarkable coinci-
dence that, since it is generalization that all modelers are seeking, all models have
not shown a greater degree of generalization. This coincidence raises the question
of whether there is not some deeper, more fundamental reason why Connectionist
models have not demonstrated strong systematicity. The next two sections are di-
rected at determining architectural properties that permit/prevent Connectionist
models from exhibiting strong systematicity.
4.3 Strong systematicity of representation
The purpose of this section is to elucidate the architectural properties of particular
neural network models that permit or prevent the model from displaying strong
systematicity of representation. Before analyzing particular models there �rst must
be a task on which it is possible to exhibit strong systematicity.
4.3.1 Task: Auto-association of 2-tuples
In the previous chapter, auto-association of N -tuples was the task used to evaluate
the systematicity of representation of feedforward networks. In language, simple
sentences representing binary relations of the form Noun transitive-verb noun are
instances of 3-tuples. For example, the simple sentence Mary loves John is an in-
stance of the 3-tuple (Mary,loves,John). In the case of binary relations, a model
2One vector component on, the rest o�.
70 CHAPTER 4. GENERALIZATION ACROSS POSITION
exhibits strong systematicity when, for example, having only ever been given ex-
amples of tuples with Mary in the �rst position, the model generalizes to examples
with Mary in the third position. With respect to exhibiting strong systematicity
over any one binary relator (e.g., loves), the second position component is com-
mon to both the trained and generalized examples, and so carries no information
regarding generalization. Therefore, assuming that the common component does
not incur a signi�cant learning overhead, the domain can be simpli�ed to a set of
2-tuples, where the agent component is in the �rst position and patient component
is in the second position.
The advantage of making this assumption is that it aids in analysis of Connec-
tionist models. Furthermore, a model that cannot demonstrate strong systematicity
on this task, will not demonstrate strong systematicity when the common compo-
nent is included, since, as already mentioned, the common component carries no
additional information.
As in the previous chapter, systematicity of representation can be cap-
tured by a network's ability to auto-associate the input. In the domain
of 2-tuples, ordered pairs are constructed from a set of �ve atomic objects
fAnn;Bill; John;Karen; V iviang. A network or model is considered to have
demonstrated strong systematicity of representation when it can correctly auto-
associate (above chance level on repeated trials), for example, (Mary, John) having
only ever seen Mary in the patient (second) position.
4.3.2 Feedforward network
The feedforward network used to examine generalization over a structured domain
in (Brousse & Smolensky, 1989; Brousse, 1991) and in the previous chapter had
at the input and output layers one unit per component per position. For the
task described in this section, this means that there are 10 input units completely
connected to a number of hidden units that are, in turn, completely connected to
10 output units (Figure 4.1).
Furthermore, to demonstrate strong systematicity there must be at least one
4.3. STRONG SYSTEMATICITY OF REPRESENTATION 71
1 0 0 0 0 0
1 0 0 0 0 0 0 0 0 00 1
0 0 0 00 1
Mary
loves
John
John
loves
Mary
Mary-patientMary-agent
Figure 4.1: The feedforward network architecture for auto-association of 2-tuples.Dashed lines identify individual units within a group of units. The left and rightshaded boxes indicate the nodes that extract, from internal representations, theMary component in the agent and patient positions, respectively.
component that does not appear in one of the positions in the training set. Suppose,
without loss of generality, that the component Mary appears only in the agent
position in the training set. The network is considered to have demonstrated
strong systematicity if on testing it correctly auto-associated tuples with Mary
in the patient position.
Independent weights
With this architecture, each output unit is dedicated to detecting a particular com-
ponent in a particular position from the network's hidden unit (internal) represen-
tation of the complex object. For example, one output unit detects the presence
or absence of Mary in the agent position. The correct performance of this task
depends on orientating the Mary-agent hyperplane so as to position all points in
the training set into two classes: Mary in the agent position; and, Mary not in the
agent position (Figure 4.2). If there are enough training points then the network
will also correctly classify test cases.
However, for the output unit that detects Mary in the patient position there is
only one class of points in the training set (i.e.,Mary not in the patient position), by
the requirement of demonstrating strong systematicity (Figure 4.3). With only one
72 CHAPTER 4. GENERALIZATION ACROSS POSITION
- not Mary(agent)
- Mary(agent)
Hyperplane(Mary-agent)
Figure 4.2: Orientation of the Mary-agent hyperplane in hidden unit activation(internal representation) space. The empty circles are points (or vectors) repre-senting structured objects that have as one of its components Mary in the agentrole. The solid squares are points representing structured objects that do not haveas one of its components Mary in the agent role.
4.3. STRONG SYSTEMATICITY OF REPRESENTATION 73
class of points, the training set provides no information regarding the orientation
of the hyperplane to correctly classify points with Mary in the patient position,
and so the network cannot be expected to generalize from components appearing
in the agent position to components appearing in the patient position. That is, the
network cannot demonstrate strong systematicity.
In any one trial, it is possible that in the event of correctly classifying examples
from the training set, the Mary-patient hyperplane is fortuitously positioned so
that the network also correctly classi�es the test examples where Mary appears in
the patient position. However, considering the information required to correctly
position this hyperplane, this event is highly unlikely. Since the training set con-
tains only one class of point with respect to the hyperplane, there are two choices
per dimension of the representation space for correct classi�cation of all points
in the training set (i.e., position the hyperplane along the dimension so that all
points are to the left or the right of the hyperplane). In an N dimensional space,
an estimation of the likelihood of correctly positioning the hyperplane so as to
correctly classify the test examples is: 1
2N. For the positioning of hyperplanes in
the hidden units space, N is the number of hidden units. Thus, the likelihood
of generalization across position in any one trial is an exponential function of N
that decreases asymptotically to zero. Since the likelihood of generalization across
position is below chance level, this feedforward network architecture cannot exhibit
strong systematicity.
A similar argument can be made for the input-to-hidden weights. Again, with
respect to the input space, the hyperplanes associated with the hidden units are
trained on only one type of point along the Mary-patient input dimension (i.e.,
Mary not in the patient position). Therefore, along this dimension (which is the
only dimension relevant to the encoding of the Mary-patient component) there is
no information provided in the training set, and therefore the network cannot be
expected to demonstrate strong systematicity.
Essentially, the problem with the feedforward network is that there is a sepa-
rate set of weights that implement the functions that encode/decode the agent and
74 CHAPTER 4. GENERALIZATION ACROSS POSITION
- not Mary(patient)
- Mary(patient)
Hyperplane(Mary-patient)
Figure 4.3: Orientation of the Mary-patient hyperplane in hidden unit activation(internal representation) space. The solid squares are points representing struc-tured objects that do not have as one of its components Mary in the patient role.There are no points in the hidden unit activation space representing structuredobjects that have as one of its components Mary in the patient position.
4.3. STRONG SYSTEMATICITY OF REPRESENTATION 75
patient components (Figure 4.4(a)). Furthermore, the domains of the two input-to-
hidden layer functions and the co-domains of the two hidden-to-output layer func-
tions are independent. Consequently, learning these functions necessitates training
on every input-output pair from their respective domains and co-domains. The
reason these component functions must be trained on every point is that, a priori,
there is no similarity between the external representations of components.
This result also applies to Smolensky's (1987a) recirculation algorithm, which is
just a special case of the backpropagation algorithm. The recirculation algorithm
applies to a feedforward network where the input to hidden and hidden to output
weights are identical. In this case, the three-layer network can be considered as
a two-layer network (one input/output layer and one hidden layer). To modify
weights, activity at the input/output layer is propagated to the hidden layer (re-
sulting in a vector ~p0), then recirculated back to the input/output layer and then
back up to the hidden layer (resulting in a vector ~p1). The weights can be updated
as a function of the di�erence vector ~p0�~p1 using the delta rule3 (Rumelhart et al.,
1986) Again, as the algorithm is only trained on one type of point (with respect
to Mary), the training set does not contain any information regarding the correct
weight values for generalization across position.
Weight tying
One method of introducing generalization across position is with the use of weight
tying. Weight tying links two or more network weights so that the update of one
weighted connection automatically updates the weights of the linked connections.
Weight tying has been used successfully by LeCun et al. (1989) as a way of intro-
ducing translation invariance for optical character recognition.
Similarly, weight tying could be introduced in the auto-association task to
demonstrate strong systematicity by linking the weights that implement the �rst
position mapping (i.e., from input to hidden and from hidden to output) to the
3The delta rule is backpropagation for two layers of units (i.e., one set of connecting weights).Backpropagation is applicable to arbitrarily many layers of units, and hence is also called thegeneralized delta rule.
76 CHAPTER 4. GENERALIZATION ACROSS POSITION
first second
first second
hidden
f2
g2
f1
g1
(a)
first second
first second
hidden
(b)
f f
gg
first/second
first/second
hidden
f
g
(c)
Figure 4.4: Standard feedforward network architecture for auto-association taskwithout weight tying (a) and with weight tying (b). With weight tying, the previ-ously independent functions (f1 and f2, g1 and g2) are now linked by the same set ofweights (i.e., f1 = f2 = f; g1 = g2 = g). A recurrent network architecture (c) incor-porates partial weight tying by making the input-to-hidden and hidden-to-outputlayer weights for each component the same weights. In each case, arrows indicatecomplete connectivity from units in the source layer to units in the destinationlayer.
weights that implement the second position mapping. E�ectively, weight tying
makes available multiple copies of the functions that map component representa-
tions (Figure 4.4(b)).
The problem with weight tying is that it presupposes knowledge of the very
weights that are to be learnt. Furthermore, with complete weight tying it is not
possible to extract the two component representations from the internal hidden
unit representation because the same hidden to output unit function g is required
to perform two di�erent mappings for the same hidden unit vector. There does
not exist a set of weights to implement the function g, such that, g(~h) = ~x1 and
g(~h) = ~x2, where ~h is the internal hidden unit representation, of the 2-tuple (x1; x2);
and ~x1 and ~x2 are the output representations of the �rst and second components,
respectively.
Recurrency
Another method of incorporating a dependency between component mapping func-
tions is by making use of the same set of input (output) units for presenting (ex-
tracting) component representations. In this way, dependency between the com-
ponent mapping functions is introduced as the functions are implemented over the
4.3. STRONG SYSTEMATICITY OF REPRESENTATION 77
same set of weights.
Of course, presenting all components over the same set of units means that
representations constructed at the hidden layer will be lost on presentation of sub-
sequent components, therefore, preventing the network from constructing complex
representations. However, previously constructed representations can be main-
tained by providing feedback connections that retain representations for future
processing (e.g., Figure 4.4(c)). Architectures with feedback connections are called
recurrent networks. In the next subsection, a recurrent network is investigated
for its ability to demonstrate strong systematicity of representation on the auto-
association task.
4.3.3 Simple recurrent network
In the previous subsection, it was shown that the standard feedforward network
(i.e., without weight tying) could not demonstrate strong systematicity of repre-
sentation. Essentially, the lack of strong systematicity was a consequence of the
independence of the weights that implement the functions which map components
to and from internal representations.
This result suggested a more structured network where components are pre-
sented at the same set of input units and consequently, mapped internally by the
same set of weights. A recurrent network architecture has such a structure. In
general, a recurrent network has a set of input units, a set of hidden units, a set of
context units (which hold the hidden unit activations from the previous time step)
and a set of output units.
There are a number of ways of connecting each set of units. The architecture
used in this subsection is Elman's (1989) simple recurrent network (Figure 4.5).
The performance of the simple recurrent network on a temporal version of the auto-
association of the 2-tuple task is tested and analyzed for its ability to demonstrate
strong systematicity.
78 CHAPTER 4. GENERALIZATION ACROSS POSITION
Simulation
The task of the recurrent network is essentially the same as for the feedforward
network described in the previous subsection except that instead of presenting both
components simultaneously, components are presented one per time step. After
presenting both components, at which time the network should have constructed
some internal representation of the ordered pair, the network is required to output
the ordered pair in the order that it was presented. Figure 4.5 shows an example
of the network and the input-output mapping it is required to perform.
Each trial of the network consisted of:
� Random generation of training examples from a distribution where all �ve
atomic objects can appear in the �rst position, but only four atomic objects
can appear in the second position. The training set was large so that in all
likelihood all 20 combinations occurred. Weights were updated at the end of
each sequence (i.e., every four patterns). Context units were reset to zero at
the beginning of each sequence.
� Random initialization of network weights from a uniform distribution between
-1 and 1.
� Training of the network using the standard error backpropagation algorithm
with a 0.1 learning rate, no momentum term, and the sum of squares error
function (Rumelhart et al., 1986) until performance on the training set was
within 0.5 of the target for all output units for all patterns in the training
set.
� Testing on all remaining combinations (i.e., every atomic object in the �rst
position combined with the atomic object left out of the second position in
the training set).
� Recording of network response to all output patterns in the test set. Two
testing criteria were used. A network response was considered correct when:
(1) the maximally activated output unit had a target activation of one -
4.3. STRONG SYSTEMATICITY OF REPRESENTATION 79
John
Mary
-
-
2.
1.
3.
4.
John
2.
1.
3.
4.
Mary*
Mary
John*
copy back
Output(5)
Hidden(20)
Context(20)Input(5)
Figure 4.5: The simple recurrent network and the temporal version of the auto-association of 2-tuples task. Parenthesized values indicate number of units used insimulations. Numbers indicate speci�c time periods. Dashes (-) indicate zero input.Starred (*) output indicates the auto-association of the current input to assist inthe formation of internal representations. However, for the purpose of evaluatingstrong systematicity, only output at time steps 3 and 4 (i.e., auto-association ofthe ordered pair) was considered.
80 CHAPTER 4. GENERALIZATION ACROSS POSITION
maximum criterion; or (2) all output units were within 0.5 of their target -
0.5 criterion.
In addition, the number of training examples (ordered pairs) was varied from
10 to 200. Since each network was initialized from a random set of weights, each
train and test trial was repeated �ve times for each training set size.
Result
The recurrent network showed perfect performance (for both maximum and 0.5
testing criteria) on the unseen object in the second position in the test set when
trained on 200 ordered pairs. When trained on 50 ordered pairs, performance
dropped to a mean of 76% (maximum criterion) and 36% (0.5 criterion) over 5
trials (Figure 4.6).
Discussion
The purpose of this simulation was the demonstration of strong systematicity with
respect to the representation of 2-tuples.
Generalization across position occurred in all trials when only one of the items
from the second position was left out of the training set and the network was
trained on all other combinations4.
There is, however, a question over the robustness of the demonstration of strong
systematicity with this recurrent network. When there were 50 pairs in the training
set the network is less likely to have seen all combinations, yet likely to have seen
all �ve objects in the �rst position and all four objects in the second position.
However, generalization to the remaining �fth object in the second position dropped
dramatically to 76% (maximumcriterion) and 36% (0.5 criterion) averaged over �ve
trials. When only 10 pairs were present in the training set generalization dropped
to below chance level performance.
The problem that systematicity poses for Connectionist models is not only the
4Since there are only 20 remaining pairs (i.e., 5 � 4) it is likely that every pair appearedsomewhere in a training set of 200.
4.3. STRONG SYSTEMATICITY OF REPRESENTATION 81
0
1
2
3
4
5
10 15 20 50 100 200
Num
ber
corr
ect o
n te
st s
et
Number of training patterns
Maximum criterion0.5 criterion
chance level
Figure 4.6: Generalization to second position as a function of the number of train-ing patterns. The number of correct responses using maximum (solid line) and0.5 (dashed line) test response criteria were averaged over 5 trials. Error barsindicate 95% con�dence levels. The horizontal dotted line indicates chance levelperformance. The number of training patterns is plotted on a log scale.
82 CHAPTER 4. GENERALIZATION ACROSS POSITION
demonstration of such properties as generalization across position, but an expla-
nation as to how such properties necessarily occur as a consequence of the archi-
tecture. In other words, regardless of the initial conditions (e.g., initial weight
values, speci�c training sets), how is it that the network consistently demonstrates
generalization across position. People, despite their varying biological and environ-
mental backgrounds, demonstrate generalization across position in language. It is
this consistency that Fodor and McLaughlin (1990) are referring to when they talk
of systematicity being a law of cognitive architecture. Therefore it is important
to understand how generalization across position was demonstrated by the simple
recurrent network in the case where it occurred in all �ve trials. In the next sub-
section, the internal representational structure of the simple recurrent network is
analyzed to determine how generalization across position was consistently demon-
strated in this case.
Analysis of internal representations
Due to the high dimensionality of a network's internal representational space (which
is the number of hidden units) it is, in general, extremely di�cult to determine the
nature of the internal representations constructed by a network to solve a partic-
ular task. However, there are two analysis techniques that can be used to aid an
understanding of internal representations. They are: principal components anal-
ysis (PCA), �rst used by Elman (1989) in his analysis of the internal representa-
tions constructed by the simple recurrent network on his word prediction task; and
canonical discriminants analysis (CDA), �rst used by Wiles and Bloesch (1992) in
their temporal logic task.
A principal components analysis identi�es the dimensions of greatest variance
of points in some space. The intuition behind its application to networks is that to
reduce error the network must learn to separate points corresponding to di�erent
target outputs in its hidden unit activation space. Thus, dimensions along which
points vary contain important information about the problem, and will be identi�ed
4.3. STRONG SYSTEMATICITY OF REPRESENTATION 83
by a principal components analysis of points in the hidden unit activation space5.
Canonical discriminants analysis identi�es dimensions in some space that max-
imizes the ratio of the between-groups variance to the within-groups variance of
points in that space. In other words, it �nds directions in the space along which
points belonging to the same group are tightly clustered, yet separated from points
belonging to other groups. This technique is particularly useful in combinato-
rial domains where one can hypothesize network representations where points are
aligned by groups implicit in the training data, but not made explicit in any error
signal6 (Wiles & Ollila, 1993).
The internal representations learnt by the simple recurrent network were ana-
lyzed in the case where the network demonstrated generalization across position in
all �ve trials (i.e., when the training set consisted of 200 training examples). After
the network was trained to the 0.5 training criterion, all 200 training examples were
presented to the network and the resulting 800 hidden unit activation patterns were
saved. Principal components and canonical discriminants analysis were performed
on these points using the package developed by Dennis and Phillips (1991).
Input to internal subspace The question to be answered is: how are the internal
representations organized so as to demonstrate strong systematicity? The �rst
step towards answering this question was to perform a PCA of the hidden unit
activations at the �rst and second time steps (Figure 4.7). The diagram shows
the �rst and second principal components of these points in hidden unit activation
space (i.e., the �rst and second directions of greatest variance, and presumably,
greatest information about representational organization). The points are labeled
on the basis of the current input, which for the �rst two time steps are also the
target output. The analysis shows that all points can be grouped into �ve clusters7
5Thanks to Simon Dennis for this explanation.6In a combinatorial domain, two groups of points could have the same target output which
may make it di�cult to �nd using principal components analysis since separation of internalrepresentations is driven, in part, by the error signal. (Of course, separation may also be drivenby di�erences in input representations.)
7The cluster labeled \Vivian" only contains one type of point (i.e., Vivian at the �rst timestep). Since the network was tested for strong systematicity on this object, Vivian did not appearin the second position in the training set.
84 CHAPTER 4. GENERALIZATION ACROSS POSITION
depending on the output requirement of the network at each point. For example,
the input pair (Bill, Ann) resulted in the point labeled \1Bill"8 at the �rst time
step, and the point labeled \Bill:Ann" at the second time step.
The activation space is organized in this manner due to the target outputs
which at these two time steps were the current inputs. Consequently, error is
reduced when points that map to di�erent outputs are separated, and a PCA has
identi�ed this separation.
The PCA suggests that the major information contained in the representation
is the current object. However, the internal representation at the second time step
must still maintain information about the �rst item, otherwise the network could
not possibly recover the �rst item at time step three. The question is: how is this
information maintained?
The PCA of points at the �rst and second time steps shows some organization
within the �ve clusters. To identify this organization a PCA of points at the
second time step only was performed and these points were plotted onto the �rst
and second principal components (Figure 4.8). The plot shows two levels of spatial
organization. At one level, points are grouped into four clusters based on their
output mapping9 just as in the previous PCA. At another level, there appears to
be the same within-cluster organization of points (labeled by their �rst object).
This consistency is suggested by the same rotational ordering of points within each
cluster.
Position independent internal subspaces The consistency of within-cluster
organization suggests that �rst object information is being encoded along some
other dimensions independent (by virtue of the apparent regularity) of second ob-
ject information. If this suggestion is correct then there may exist directions in
hidden unit activation space along which all four clusters can be superimposed.
8Since the context units were reset to zero at the beginning of each sequence, the hidden unitactivations at the �rst time step were not a�ected by the previous sequence. Consequently, allsequences with the same �rst component resulted in the same hidden unit activation vector atthe �rst time step.
9Again, there are only four clusters since one object (Vivian) was left out of the second positionin order to test strong systematicity.
4.3. STRONG SYSTEMATICITY OF REPRESENTATION 85
1Ann1Karen
1Vivian
1John
1Ann1Ann
1Vivian
1Ann1Ann1Ann
1Bill1Bill
1John
1Bill
1Karen1Karen
1Bill1Bill
1John
1Bill1Bill
1John
1Vivian
1Ann
1Bill
1Vivian
1Bill1Bill
1John
1Bill
1John
1Karen
1Vivian
1John1John1John
1Vivian
1Bill
1Karen 1Ann
1Bill
1Karen
1John
1Vivian
1John
1Vivian
1John1John
1Ann
1Bill1Bill
1Vivian
1John
1Karen
1Bill
1Ann
1Vivian
1Ann1Ann1Karen1Karen
1John
1Vivian
1Karen
1John1John1John
1Karen
1Vivian
1Bill
1Karen1Karen 1Ann
1Bill
1Ann1Ann
1John
1Karen
1Bill
1Vivian1Vivian
1Karen
1John
1Ann1Karen 1Ann
1Bill
1Vivian1Vivian1Vivian
1Ann
1Bill
1Ann1Karen
1Bill
1John
1Bill
1Ann1Karen
1Bill1Bill
1Karen 1Ann1Karen1Karen
1John
1Bill
1Vivian
1Karen 1Ann
1Vivian
1Ann
1Vivian
1Ann1Karen 1Ann
1Vivian
1Ann
1Vivian
1Karen1Karen
1Bill
1Ann
1Bill
1Ann
1John
1Bill
1Karen
1Vivian1Vivian
1John
1Bill
1Vivian
1Bill1Bill
1Karen
1John
1Karen
1Bill
1Ann1Karen 1Ann1Ann1Karen1Karen
1John
1Ann1Karen
1Bill1Bill
1John
1Ann
1Bill1Bill
1Ann
1Bill
1John
1Ann
1Vivian
1Bill
1Karen
1John1John1John
1Bill
1John
1Bill
1Vivian
1Ann1Karen
1John
1Ann
1Vivian
1Bill
1Ann1Karen
1Bill
1John1John
1Karen
1Bill1Bill1Bill
1Karen1Karen
1Bill
1John
1Ann
1Bill
1Vivian
1Bill
1Ann
1John
1Bill
1John
1Karen
1Vivian1Vivian
1Ann
1Vivian
Ann:John
Karen:Ann
Vivian:Bill
John:Bill
Ann:Karen
Ann:John
Vivian:Ann
Ann:BillAnn:BillAnn:Bill
Bill:Karen
Bill:JohnJohn:Ann
Bill:Karen
Karen:JohnKaren:John
Bill:AnnBill:Karen
John:John
Bill:JohnBill:John
John:John
Vivian:Ann
Ann:BillBill:Bill
Vivian:Karen
Bill:John
Bill:Ann
John:Bill
Bill:Bill
John:BillKaren:Bill
Vivian:Bill
John:Karen
John:John
John:Ann
Vivian:Karen
Bill:John
Karen:John
Ann:John
Bill:Ann
Karen:BillJohn:Bill
Vivian:Bill
John:Bill
Vivian:Bill
John:Bill
John:Karen
Ann:Ann
Bill:Karen
Bill:John
Vivian:Ann
John:Bill
Karen:Karen
Bill:Ann
Ann:Bill
Vivian:KarenAnn:AnnAnn:Karen Karen:Ann
Karen:Karen
John:John
Vivian:John
Karen:Ann
John:Bill
John:Karen
John:John
Karen:John
Vivian:Ann
Bill:Bill
Karen:Bill
Karen:KarenAnn:Ann
Bill:Bill
Ann:Karen
Ann:JohnJohn:Ann
Karen:Karen
Bill:Ann
Vivian:Bill
Vivian:Karen Karen:Ann
John:Bill
Ann:Karen
Karen:John
Ann:Ann
Bill:Bill
Vivian:Karen
Vivian:John
Vivian:AnnAnn:Karen
Bill:John
Ann:John
Karen:John
Bill:JohnJohn:Ann
Bill:Bill
Ann:Karen Karen:Ann
Bill:BillBill:Bill
Karen:Bill
Ann:Bill
Karen:BillKaren:Bill
John:Karen
Bill:John
Vivian:JohnKaren:John
Ann:John
Vivian:AnnAnn:Karen
Vivian:JohnAnn:John
Karen:AnnAnn:Karen
Vivian:JohnAnn:JohnVivian:John
Karen:Ann
Karen:Bill
Bill:Bill
Ann:Karen
Bill:Ann
Ann:Bill
John:Bill
Bill:Karen
Karen:JohnVivian:JohnVivian:John
John:Ann
Bill:Karen
Vivian:Ann
Bill:BillBill:Bill
Karen:Karen
John:John
Karen:John
Bill:John
Ann:John
Karen:John
Ann:AnnAnn:Karen Karen:AnnKaren:Karen
John:Ann
Ann:Karen Karen:Ann
Bill:KarenBill:Ann
John:KarenAnn:John
Bill:AnnBill:Karen
Ann:John
Bill:John
John:John
Ann:JohnVivian:John
Bill:John
Karen:Ann
John:Bill
John:Karen
John:John
Bill:Ann
John:Bill
Bill:Karen
Vivian:JohnAnn:John
Karen:Ann
John:Karen
Ann:AnnVivian:Karen
Bill:John
Ann:John
Karen:Ann
Bill:Karen
John:Ann
John:Bill
Karen:Karen
Bill:Ann
Bill:Bill
Bill:Karen
Karen:JohnKaren:John
Bill:Ann
John:Karen
Ann:Ann
Bill:Karen
Vivian:Ann
Bill:BillAnn:Bill
John:Karen
Bill:Ann
John:BillKaren:Bill
Vivian:KarenVivian:Ann
Ann:BillVivian:Bill
First principal component
Seco
nd p
rinc
ipal
com
pone
nt
Bill
Vivian
Karen
John
Ann
Figure 4.7: Principal components analysis of points in hidden unit activation spacegenerated from training sequences at the �rst and second time steps. Points arelabeled with either the �rst and second input objects (e.g., Ann:Bill indicates thepoint as a result of receiving Ann as the �rst object and Bill as the second object);or, with a single word pre�xed by 1 indicating a point as a result of only receivingthe �rst object. Dashed lines group points with the same output mapping.
86 CHAPTER 4. GENERALIZATION ACROSS POSITION
Ann
Karen
Vivian
John
Ann
Ann
Vivian
AnnAnnAnn
Bill
Bill
JohnBill
KarenKaren
Bill
Bill
John
BillBill
John
Vivian
Ann Bill
Vivian
Bill
Bill
John
Bill
John
KarenVivian
John
John
John
Vivian
Bill
Karen
Ann
Bill
Karen
John
Vivian
John
Vivian
John
John
Ann
Bill
Bill
Vivian
John
Karen
Bill
Ann
Vivian
Ann
Ann
Karen
Karen
John
Vivian
Karen
John
John
John
Karen
Vivian
Bill
Karen
Karen
Ann
Bill
Ann
Ann
John
Karen
Bill
Vivian
Vivian
Karen
John
Ann
Karen
Ann
Bill
Vivian
Vivian
Vivian
Ann
BillAnn
Karen
Bill
John
Bill
Ann
Karen
BillBill
Karen
Ann
KarenKaren
John
Bill
Vivian Karen
Ann
Vivian
Ann
Vivian
Ann
Karen
Ann
Vivian
Ann
Vivian
Karen
Karen
Bill
Ann
Bill
Ann
John
Bill
KarenVivianVivian
JohnBill
Vivian
BillBill
Karen
John
Karen
BillAnn
Karen
Ann
Ann
Karen
Karen
JohnAnn
Karen
Bill
Bill
John
Ann
Bill
Bill
Ann Bill
John
Ann
Vivian
Bill
Karen
John
John
John
Bill
John
Bill
Vivian
Ann
Karen
John
Ann
Vivian
BillAnn
Karen
Bill John
John
Karen
Bill
Bill
Bill
KarenKaren
Bill
John
Ann
Bill
Vivian
BillAnnJohn
Bill
John
Karen
Vivian
Vivian
Ann
Vivian
First principal component
Seco
nd p
rinc
ipal
com
pone
nt
John(2)
Karen(2)
Bill(2)
Ann(2)
Figure 4.8: Principal components analysis of points in hidden unit activation spacegenerated from training sequences at the second time step. Points are labeled withthe �rst input object. Points that are a result of the same second object are linkedby a solid line. Each link is labeled with the object that was presented at thesecond time step (e.g., points in the top right hand section of the graph are a resultof Ann as the second input object).
4.3. STRONG SYSTEMATICITY OF REPRESENTATION 87
Or, in other words, there may be a projection from points in hidden unit activation
space such that, for example, the point representing the pair (Karen, John) and
the point representing the pair (Karen, Ann) become the same.
If such a direction exists, it can be found using CDA by grouping all points on
the basis of the �rst object. CDA will identify dimensions in hidden unit activation
space such that points in the same group (e.g., (Karen, Ann), (Karen, Bill), (Karen,
Karen), and (Karen, John)) are close together, while points in di�erent groups are
far apart.
A CDA was performed on all hidden unit activations at the second time step
grouped by the object presented at the �rst time step. The points were then plotted
onto the �rst and second canonical discriminants (Figure 4.9). The plot shows that
the four clusters can be superimposed, and supports the suggestion that �rst object
information was encoded independent of second object information. In each case,
all points belonging to the same cluster di�ered in location by at most 1 � 10�6.
Another way of conceptualizing the representational organization is to say that the
network has bu�ered the input onto independent dimensions.
A CDA was performed on all hidden unit activations constructed at the �rst
and second time steps. The points were grouped on the basis of input component
(which was also the target output component) presented at that time step. For
example, the hidden unit activations resulting from the object Karen having been
presented at the �rst time step or the second time step are identi�ed, for the
purposes of CDA, as belonging to the same group. The points were then plotted
onto the �rst and second canonical discriminants (Figure 4.11). The plot shows
that the current input object was encoded on the same dimension independent of
its position. Thus, suggesting an organization of internal representations of ordered
pairs as characterized in Figure 4.10.
Internal subspace to output A PCA of hidden unit activations at time step
three was performed and the points were plotted onto the �rst and second principal
components (Figure 4.12). The plot shows two levels of organization similar to time
step two.
88 CHAPTER 4. GENERALIZATION ACROSS POSITION
Ann
Karen
Vivian
John
AnnAnn
Vivian
AnnAnnAnn
BillBill JohnBill
KarenKaren
BillBill JohnBillBill John
Vivian
Ann
Bill
Vivian
BillBill JohnBill John
Karen
Vivian
JohnJohnJohn
Vivian
Bill
Karen
Ann
Bill
Karen
John
Vivian
John
Vivian
JohnJohn
Ann
BillBill
Vivian
John
Karen
Bill
Ann
Vivian
AnnAnn
KarenKaren
John
Vivian
Karen
JohnJohnJohn
Karen
Vivian
Bill
KarenKaren
Ann
Bill
AnnAnn
John
Karen
Bill
VivianVivian
Karen
John
Ann
Karen
Ann
Bill
VivianVivianVivian
Ann
Bill
Ann
Karen
Bill JohnBill
Ann
Karen
BillBill
Karen
Ann
KarenKaren
JohnBill
Vivian
Karen
Ann
Vivian
Ann
Vivian
Ann
Karen
Ann
Vivian
Ann
Vivian
KarenKaren
Bill
Ann
Bill
Ann
JohnBill
Karen
VivianVivian
JohnBill
Vivian
BillBill
Karen
John
Karen
Bill
Ann
Karen
AnnAnn
KarenKaren
John
Ann
Karen
BillBill John
Ann
BillBill
Ann
Bill John
Ann
Vivian
Bill
Karen
JohnJohnJohnBill JohnBill
Vivian
Ann
Karen
John
Ann
Vivian
Bill
Ann
Karen
Bill JohnJohn
Karen
BillBillBill
KarenKaren
Bill John
Ann
Bill
Vivian
Bill
Ann
JohnBill John
Karen
VivianVivian
Ann
Vivian
First canonical discriminant
Seco
nd c
anon
ical
dis
crim
inan
t
Figure 4.9: Canonical discriminants analysis of points in hidden unit activationspace generated from training sequences at the second time step grouped on thebasis of the �rst input object. Points are labeled with the �rst input object. Thefact that there appears only �ve labels means that all points with the same labelhave been projected onto the same location.
4.3. STRONG SYSTEMATICITY OF REPRESENTATION 89
2
1
Time step
Input
Output
John
Ann John
Figure 4.10: Construction of ordered pair representations by bu�ering inputthrough the same set of dimensions. Dashed lines indicate weights and hiddenunits at time step one, whereas solid lines indicate the same sets of weights andhidden units but at time step two.
A CDA of these points grouped on the target output at time step three was
performed and the points plotted onto the �rst and second canonical discrimi-
nants (Figure 4.13). The plot shows the �rst object has been maintained on two
dimensions independent of the second object.
A CDA of these points grouped on the second input object was also performed.
The points were plotted onto the �rst and second canonical discriminants (Fig-
ure 4.14) and shows second object information has been encoded on two further
dimensions independent of the �rst object.
A CDA of points at the third and fourth time steps grouped on the current
target output object was performed. The points were plotted onto the �rst and
second canonical discriminants (Figure 4.15). The plot suggests that the current
object was extracted along the same dimensions, independent of its position.
First-in �rst-out bu�er This series of analyses suggest that the organizational
structure of the simple recurrent network's internal representations is analogous
to a �rst-in-�rst-out bu�er (Figure 4.16). By encoding and decoding component
representations via the same input and output set of weights (respectively), the
network only requires to see each unique object in one of the positions, but not
90 CHAPTER 4. GENERALIZATION ACROSS POSITION
1Ann
1Karen
1Vivian
1John
1Ann1Ann
1Vivian
1Ann1Ann1Ann
1Bill1Bill
1John
1Bill
1Karen1Karen
1Bill1Bill
1John
1Bill1Bill
1John
1Vivian
1Ann
1Bill
1Vivian
1Bill1Bill
1John
1Bill
1John
1Karen
1Vivian
1John1John1John
1Vivian
1Bill
1Karen
1Ann
1Bill
1Karen
1John
1Vivian
1John
1Vivian
1John1John
1Ann
1Bill1Bill
1Vivian
1John
1Karen
1Bill
1Ann
1Vivian
1Ann1Ann
1Karen1Karen
1John
1Vivian
1Karen
1John1John1John
1Karen
1Vivian
1Bill
1Karen1Karen
1Ann
1Bill
1Ann1Ann
1John
1Karen
1Bill
1Vivian1Vivian
1Karen
1John
1Ann
1Karen
1Ann
1Bill
1Vivian1Vivian1Vivian
1Ann
1Bill
1Ann
1Karen
1Bill
1John
1Bill
1Ann
1Karen
1Bill1Bill
1Karen
1Ann
1Karen1Karen
1John
1Bill
1Vivian
1Karen
1Ann
1Vivian
1Ann
1Vivian
1Ann
1Karen
1Ann
1Vivian
1Ann
1Vivian
1Karen1Karen
1Bill
1Ann
1Bill
1Ann
1John
1Bill
1Karen
1Vivian1Vivian
1John
1Bill
1Vivian
1Bill1Bill
1Karen
1John
1Karen
1Bill
1Ann
1Karen
1Ann1Ann
1Karen1Karen
1John
1Ann
1Karen
1Bill1Bill
1John
1Ann
1Bill1Bill
1Ann
1Bill
1John
1Ann
1Vivian
1Bill
1Karen
1John1John1John
1Bill
1John
1Bill
1Vivian
1Ann
1Karen
1John
1Ann
1Vivian
1Bill
1Ann
1Karen
1Bill
1John1John
1Karen
1Bill1Bill1Bill
1Karen1Karen
1Bill
1John
1Ann
1Bill
1Vivian
1Bill
1Ann
1John
1Bill
1John
1Karen
1Vivian1Vivian
1Ann
1Vivian
John
Ann
BillBill
Karen
John
Ann
BillBillBill
Karen
John
Ann
Karen
JohnJohn
Ann
Karen
JohnJohnJohnJohn
Ann
BillBill
Karen
John
Ann
BillBillBillBillBill
Karen
John
Ann
Karen
JohnJohnJohn
Ann
BillBillBillBillBillBill
Karen
Ann
Karen
John
Ann
Bill
Karen
Ann
Bill
Karen
Ann
Karen
Ann
Karen
JohnJohn
Ann
Bill
Karen
JohnJohn
Ann
BillBill
Karen
Ann
Bill
Karen
John
Ann
Karen
Ann
Bill
Karen
Ann
Bill
Karen
John
Ann
Bill
Karen
John
Ann
Karen
JohnJohnJohnJohn
Ann
Bill
Karen
Ann
BillBillBillBillBillBill
Karen
JohnJohnJohnJohn
Ann
Karen
JohnJohn
Ann
Karen
JohnJohnJohn
Ann
BillBill
Karen
Ann
BillBill
Karen
JohnJohnJohn
Ann
Karen
Ann
BillBill
Karen
JohnJohnJohnJohnJohn
Ann
Karen
Ann
Karen
Ann
Karen
Ann
Karen
Ann
Karen
John
Ann
Karen
JohnJohnJohnJohnJohnJohn
Ann
Bill
Karen
John
Ann
Bill
Karen
JohnJohn
Ann
Karen
Ann
Karen
JohnJohn
Ann
Karen
Ann
Bill
Karen
Ann
Bill
Karen
JohnJohn
Ann
Karen
Ann
Karen
Ann
BillBill
Karen
Ann
BillBill
Karen
Ann
BillBill
First canonical discriminant
Seco
nd c
anon
ical
dis
crim
inan
t
Bill
Ann
John
VivianKaren
Figure 4.11: Canonical discriminants analysis of points in hidden unit activationspace generated from training sequences at the �rst and second time steps groupedon the basis of the current input/output object. Points are labeled with the in-put/output object at that time step.
4.3. STRONG SYSTEMATICITY OF REPRESENTATION 91
First principal component
Seco
nd p
rinc
ipal
com
pone
nt
Ann:JohnKaren:Ann
Vivian:BillJohn:Bill
Ann:Karen
Ann:John
Vivian:AnnAnn:BillAnn:BillAnn:Bill
Bill:Karen
Bill:John
John:Ann
Bill:Karen
Karen:JohnKaren:John
Bill:Ann
Bill:Karen
John:John
Bill:JohnBill:John
John:JohnVivian:Ann
Ann:Bill
Bill:Bill
Vivian:Karen
Bill:John
Bill:Ann
John:Bill
Bill:Bill
John:Bill
Karen:Bill
Vivian:Bill
John:Karen
John:John
John:Ann
Vivian:Karen
Bill:John
Karen:John
Ann:John
Bill:AnnKaren:Bill
John:BillVivian:Bill
John:BillVivian:Bill
John:Bill
John:Karen
Ann:Ann
Bill:Karen
Bill:John
Vivian:Ann
John:Bill
Karen:Karen
Bill:AnnAnn:Bill
Vivian:Karen
Ann:Ann
Ann:Karen
Karen:Ann
Karen:Karen
John:John Vivian:John
Karen:Ann
John:Bill
John:Karen
John:John
Karen:John
Vivian:Ann
Bill:Bill
Karen:Bill
Karen:Karen
Ann:Ann
Bill:Bill
Ann:Karen
Ann:John
John:AnnKaren:Karen
Bill:Ann
Vivian:Bill
Vivian:Karen
Karen:Ann
John:Bill
Ann:KarenKaren:John
Ann:Ann
Bill:Bill
Vivian:Karen
Vivian:JohnVivian:Ann
Ann:Karen
Bill:John
Ann:John
Karen:John
Bill:John
John:Ann
Bill:Bill
Ann:Karen
Karen:Ann
Bill:BillBill:Bill
Karen:Bill
Ann:Bill
Karen:BillKaren:Bill John:Karen
Bill:John
Vivian:John
Karen:John
Ann:John
Vivian:Ann
Ann:Karen
Vivian:John
Ann:JohnKaren:Ann
Ann:Karen
Vivian:John
Ann:John
Vivian:John
Karen:Ann
Karen:Bill
Bill:Bill
Ann:Karen
Bill:AnnAnn:Bill
John:BillBill:Karen
Karen:John
Vivian:JohnVivian:John
John:Ann
Bill:Karen
Vivian:Ann
Bill:BillBill:Bill
Karen:Karen
John:John
Karen:John
Bill:John
Ann:John
Karen:John
Ann:Ann
Ann:Karen
Karen:Ann
Karen:KarenJohn:Ann
Ann:Karen
Karen:Ann
Bill:Karen
Bill:AnnJohn:Karen
Ann:John
Bill:Ann
Bill:Karen
Ann:John
Bill:John
John:John
Ann:John
Vivian:John
Bill:John
Karen:Ann
John:Bill
John:Karen
John:John
Bill:Ann
John:BillBill:Karen
Vivian:John
Ann:JohnKaren:Ann
John:Karen
Ann:Ann
Vivian:Karen
Bill:John
Ann:John
Bill:Karen
John:Ann
John:Bill
Karen:Karen
Bill:Ann
Bill:Bill
Bill:Karen
Karen:JohnKaren:John
Bill:AnnJohn:Karen
Ann:Ann
Bill:Karen
Vivian:Ann
Bill:Bill
Ann:Bill
John:KarenBill:Ann
John:Bill
Karen:Bill Vivian:Karen
Vivian:AnnAnn:Bill
Vivian:Bill
Karen:Ann
Figure 4.12: Principal components analysis of points in hidden unit activation spacegenerated from training sequences at the third time step. Points are labeled withthe �rst and second input objects (e.g., Ann:Bill indicates the point as a result ofreceiving Ann as the �rst object and Bill as the second object). The solid lines linkpoints as a result of the same �rst input object.
92 CHAPTER 4. GENERALIZATION ACROSS POSITION
Ann
Karen
Vivian
John
AnnAnn
Vivian
AnnAnnAnn
BillBill
John
Bill
KarenKaren
BillBill
John
BillBill
John
Vivian
Ann
Bill
Vivian
BillBill
John
Bill
John
Karen
Vivian
JohnJohnJohn
Vivian
Bill
Karen
Ann
Bill
Karen
John
Vivian
John
Vivian
JohnJohn
Ann
BillBill
Vivian
John
Karen
Bill
Ann
Vivian
AnnAnn
KarenKaren
John
Vivian
Karen
JohnJohnJohn
Karen
Vivian
Bill
KarenKaren
Ann
Bill
AnnAnn
John
Karen
Bill
VivianVivian
Karen
John
Ann
Karen
Ann
Bill
VivianVivianVivian
Ann
Bill
Ann
Karen
Bill
John
Bill
Ann
Karen
BillBill
Karen
Ann
KarenKaren
John
Bill
Vivian
Karen
Ann
Vivian
Ann
Vivian
Ann
Karen
Ann
Vivian
Ann
Vivian
KarenKaren
Bill
Ann
Bill
Ann
John
Bill
Karen
VivianVivian
John
Bill
Vivian
BillBill
Karen
John
Karen
Bill
Ann
Karen
AnnAnn
KarenKaren
John
Ann
Karen
BillBill
John
Ann
BillBill
Ann
Bill
John
Ann
Vivian
Bill
Karen
JohnJohnJohn
Bill
John
Bill
Vivian
Ann
Karen
John
Ann
Vivian
Bill
Ann
Karen
Bill
JohnJohn
Karen
BillBillBill
KarenKaren
Bill
John
Ann
Bill
Vivian
Bill
Ann
John
Bill
John
Karen
VivianVivian
Ann
Vivian
First canonical component
Seco
nd c
anon
ical
com
pone
nt
Figure 4.13: Canonical discriminants analysis of points in hidden unit activationspace generated from training sequences at the third time step grouped on the basisof the �rst input object. Points are labeled with the �rst input object. The factthat there appears only �ve labels means that all points with the same label havebeen projected onto the same location.
4.3. STRONG SYSTEMATICITY OF REPRESENTATION 93
John Ann
BillBill
Karen
John Ann
BillBillBill
Karen
John Ann
Karen
JohnJohn Ann
Karen
JohnJohnJohnJohn Ann
BillBill
Karen
John Ann
BillBillBillBillBill
Karen
John Ann
Karen
JohnJohnJohn Ann
BillBillBillBillBillBill
Karen
Ann
Karen
John Ann
Bill
Karen
Ann
Bill
Karen
Ann
Karen
Ann
Karen
JohnJohn Ann
Bill
Karen
JohnJohn Ann
BillBill
Karen
Ann
Bill
Karen
John Ann
Karen
Ann
Bill
Karen
Ann
Bill
Karen
John Ann
Bill
Karen
John Ann
Karen
JohnJohnJohnJohn Ann
Bill
Karen
Ann
BillBillBillBillBillBill
Karen
JohnJohnJohnJohn Ann
Karen
JohnJohn Ann
Karen
JohnJohnJohn Ann
BillBill
Karen
Ann
BillBill
Karen
JohnJohnJohn Ann
Karen
Ann
BillBill
Karen
JohnJohnJohnJohnJohn Ann
Karen
Ann
Karen
Ann
Karen
Ann
Karen
Ann
Karen
John Ann
Karen
JohnJohnJohnJohnJohnJohn Ann
Bill
Karen
John Ann
Bill
Karen
JohnJohn Ann
Karen
Ann
Karen
JohnJohn Ann
Karen
Ann
Bill
Karen
Ann
Bill
Karen
JohnJohn Ann
Karen
Ann
Karen
Ann
BillBill
Karen
Ann
BillBill
Karen
Ann
BillBill
First canonical discriminant
Seco
nd c
anon
ical
dis
crim
inan
t
Figure 4.14: Canonical discriminants analysis of points in hidden unit activationspace generated from training sequences at the third time step grouped on the basisof the second input object. Points are labeled with the second input object. Thefact that there appears only four labels means that all points with the same labelhave been projected onto the same location.
94 CHAPTER 4. GENERALIZATION ACROSS POSITION
Ann
Karen
Vivian
John
AnnAnn
Vivian
AnnAnnAnn
BillBill
John
Bill
KarenKaren
BillBill
John
BillBill
John
Vivian
Ann
Bill
Vivian
BillBill
John
Bill
John
Karen
Vivian
John
JohnJohn
Vivian
Bill
Karen
Ann
Bill
Karen
John
Vivian
John
Vivian
JohnJohn
Ann
BillBillVivian
John
Karen
Bill
Ann
Vivian
AnnAnn
KarenKaren
John
Vivian
Karen
JohnJohn
John
Karen
VivianBill
KarenKaren
Ann
Bill
AnnAnn
John
Karen
Bill Vivian
VivianKaren
John
Ann
Karen
Ann
Bill
Vivian
Vivian
Vivian
Ann
Bill
Ann
Karen
Bill
John
Bill
Ann
Karen
BillBill
Karen
Ann
KarenKaren
John
BillVivian
Karen
Ann
Vivian
Ann
Vivian
Ann
Karen
Ann
Vivian
Ann
Vivian
KarenKaren
Bill
Ann
Bill
Ann
John
Bill
Karen
VivianVivian
John
Bill VivianBillBill
Karen
John
Karen
Bill
Ann
Karen
AnnAnn
KarenKaren
John
Ann
Karen
BillBill
John
Ann
BillBill
Ann
Bill
John
Ann
VivianBill
Karen
JohnJohn
John
Bill
John
Bill
Vivian
Ann
Karen
John
Ann
Vivian
Bill
Ann
Karen
Bill
John
John
Karen
BillBillBill
KarenKaren
Bill
John
Ann
Bill VivianBill
Ann
John
Bill
John
Karen
Vivian
Vivian
Ann
Vivian
John
Ann
BillBill
Karen
John
Ann
BillBillBill
Karen
John
Ann
Karen
JohnJohn
Ann
Karen
JohnJohnJohn
John
Ann
BillBill
Karen
John
Ann
BillBill
BillBill
Bill
Karen
John
Ann
Karen
John
John
John
Ann
BillBill
BillBill
BillBill
Karen
Ann
Karen
John
Ann
Bill
Karen
Ann
Bill
Karen
Ann
Karen
Ann
Karen
JohnJohn
Ann
Bill
Karen
John
John
Ann
BillBill
Karen
Ann
Bill
Karen
John
Ann
Karen
Ann
Bill
Karen
Ann
Bill
Karen
John
Ann
Bill
Karen
John
Ann
Karen
JohnJohn
John
John
Ann
Bill
Karen
Ann
BillBillBillBillBillBill
Karen
JohnJohn
John
John
Ann
Karen
JohnJohn
Ann
Karen
JohnJohnJohn
Ann
BillBill
Karen
Ann
BillBill
Karen
John
JohnJohn
Ann
Karen
Ann
BillBill
Karen
John
John
JohnJohn
John
Ann
Karen
Ann
Karen
Ann
Karen
Ann
Karen
Ann
Karen
John
Ann
Karen
JohnJohn
JohnJohnJohnJohn
Ann
Bill
Karen
John
Ann
Bill
Karen
JohnJohn
Ann
Karen
Ann
Karen
JohnJohn
Ann
Karen
Ann
Bill
Karen
Ann
Bill
Karen
JohnJohn
Ann
Karen
Ann
Karen
Ann
BillBill
Karen
Ann
BillBill
Karen
Ann
BillBill
First canonical discriminant
Seco
nd c
anon
ical
dis
crim
inan
t
Figure 4.15: Canonical discriminants analysis of points in hidden unit activationspace generated from training sequences at the third and fourth time steps groupedon the basis of the current target output object. Points are labeled with the currenttarget output object.
4.3. STRONG SYSTEMATICITY OF REPRESENTATION 95
necessarily both, as the simulations demonstrated. When the �rst component is
presented to the network it is mapped to some subspace of the hidden unit activa-
tion space. On the next time step, the network maps the internal representation
of the �rst component to a second independent subspace (set of independent di-
mensions) to make available the �rst subspace for the second component. At the
third time step, the �rst component is mapped to a third subspace, at which point
it is extracted at the output layer, and the second component is mapped to the
second subspace. At the fourth time step, the second component is mapped to
the third subspace, at which point it is extracted at the output layer. (Note that
these subspaces were not localized to speci�c groups of hidden units. Furthermore,
the internal representations were not encoded locally. Figure 4.16 characterizes the
structure of hidden unit activation space as being composed of three subspaces. It
does not characterize any relationship between subspaces and hidden units.)
The analysis, however, only suggests a possible internal organization for the
simple recurrent network, since the two techniques do not consider the hidden to
output weights10. Nevertheless, PCA and CDA have identi�ed the existance of di-
mensions along the hidden unit activation space in which component objects were
encoded independent of their position. Therefore, learning to extract components
along these dimensions will generalize to the other position. Thus, it is not nec-
essary, as the simulation results have shown, for the network to be trained on all
components in both positions. It was only necessary that each component appeared
in one of the two positions.
Importantly, though, the simple recurrent network only demonstrated perfect
generalization in all trials when all other ordered pairs appeared in the training set.
When there were fewer ordered pairs in the training set, there were solutions other
than the bu�er solution which satis�ed the requirements of the training set, but
which did not demonstrate generalization across position. In these cases, the simple
recurrent network required more information than was available in the training set
to demonstrate perfect generalization in all trials.
10It is, in general, di�cult to visualize the relationship between hyperplanes and internal rep-resentations for high dimensional spaces.
96 CHAPTER 4. GENERALIZATION ACROSS POSITION
JohnMary
--
2.1.
3.4.
John
2.1.
3.4.
Mary*
MaryJohn*
fixed
trained
trained
(1) (2) (3)
Figure 4.16: Idealization of the bu�er solution by the recurrent network in demon-strating strong systematicity of representation. Shaded regions and solid arrowsindicate modi�able weights. Dashed arrows indicate �xed weights. Starred (*)outputs at time steps 1 and 2 indicate auto-association of current input to assist inthe formation of internal representations. Dashed (-) inputs at time steps 3 and 4indicate zero input. Parenthesized numerals indicate subspaces of the hidden unitactivation space.
4.4. STRONG SYSTEMATICITY OF INFERENCE 97
4.3.4 Summary of strong systematicity of representation
With respect to systematicity of representation it was shown that the standard
feedforward network could not demonstrate strong systematicity. The property of
the feedforward network that prevents it from exhibiting strong systematicity of
representation is the independence of weights that implement component mappings.
This independence means that there are too many positions in weight space that
implement the function that is speci�ed by the training set. Consequently, there
is no information in the training set to determine, with con�dence greater than
chance level, the function that will exhibit generalization across position on future
examples.
This result suggested an architectural bias whereby there is a dependency be-
tween the weights that implement component mappings. The simple recurrent
network incorporates such a bias. Since component representations are presented
to the network at the same set of input units they are mapped, in part, by the
same set of weights. Simulations showed that the simple recurrent network could
demonstrate strong systematicity of representation with respect to the 2-tuple task.
The organization of internal representations learnt by the network was character-
istic of a �rst-in-�rst-out bu�er, where components were mapped from the input
to a common subspace (input phase), and from a common subspace to the output
(output phase). Because these mappings were implemented by the same set of
weights it was not necessary for the network to be trained on every component in
both positions. It was only necessary that components appear in at least one of the
positions. Consequently, the network was able to demonstrate strong systematicity
of representations on this task.
4.4 Strong systematicity of inference
Systematicity of inference is the ability to extract components from structurally re-
lated objects. For example, one does not �nd people who are able to infer the John
went to the store from the statement John and Mary went to the store, but cannot
98 CHAPTER 4. GENERALIZATION ACROSS POSITION
infer that Mary went to the store from the statement Mary and John went to the
store. As in the previous section, Hadley's strong systematicity is used as the cri-
terion for determining whether a Connectionist model demonstrates systematicity
of inference.
4.4.1 Task: Querying 2-tuples
In this systematicity of inference task, a network is presented with an ordered pair
and a query/question, which requests the �rst or second component of the ordered
pair. For example, given the pair (John, Bill) and the question that requests the
�rst component the network should respond with John. As in the auto-association
of 2-tuples task, there are �ve possible atomic objects that may appear in either
the �rst or the second positions. In addition, there are two query objects that
request either the �rst or second component of the ordered pair.
A network is said to have demonstrated strong systematicity of inference with
respect to this task if on the test set it can correctly infer components in component-
position combinations that did not occur in the training set. In the next subsection,
the simple recurrent network is examined for its capacity to exhibit strong system-
aticity of inference with respect to this task.
4.4.2 Simple recurrent network
The simple recurrent network, which demonstrated strong systematicity of repre-
sentation, is examined here on this systematicity of inference task.
Simulation
The simulation conditions for the querying of 2-tuples task are as follows:
� Local encoding of input and output vectors. As there are seven possible input
vectors in this task (i.e., 5 possible components plus 2 possible queries) there
are 7 input units to encode the input vectors, and 5 output units to encode
the 5 possible components.
4.4. STRONG SYSTEMATICITY OF INFERENCE 99
� Generation of 40 training sequences (each sequence consisting of three pat-
terns), consisting of all 5 components in the �rst position combined with 4
possible components in the second position combined with the two possible
queries. The �fth component was used to test strong systematicity. Thus, the
test set consisted of 10 sequences (i.e., all 5 components in the �rst position
combined with the �fth component in the second position combined with the
two queries).
� Random initialization of weights from a uniform distribution in the range �1to 1.
� Training of the network using the standard error backpropagation algorithm
with 0.1 learning rate, no momentum term, and the sum of squares error func-
tion (Rumelhart et al., 1986) until performance on the training set reached
criterion. Two training criteria were used: (1) all output units were within
0.5; (2) all outputs were within 0.4 of the target output for every unit on
every training pattern. If the network did not reach criterion by the 10000th
epoch (where one epoch is the presentation of every training pattern) then
training was terminated. Weights were updated at the end of each sequence
(i.e., every three patterns). Context units were reset to zero at the beginning
of each sequence.
� Testing on all remaining patterns using two criteria for correctness: (1) the
maximally activated output unit corresponds to the target activation of 1 -
maximum criterion; (2) all output units are with 0.5 of their target activation
- 0.5 criterion.
� Each train-test trial was repeated 10 times for each training criterion, with
weights being randomly initialized at the beginning of each trial.
The simple recurrent network was examined with 20, 10 and 8 hidden units.
In the case of the 8 hidden unit network, 0.5 was used as the training criterion.
Figure 4.17 shows the network and an example sequence of input-output pattern
pairs.
100 CHAPTER 4. GENERALIZATION ACROSS POSITION
2.
1.
3.
Mary*
John*
John
copy back
Output(5)
Input(7)
John
Mary
2.
1.
3. second
Hidden(8,10,20)
Context(8,10,20)
Figure 4.17: The simple recurrent network and the querying of 2-tuples task. Num-bers indicate speci�c time periods. Parenthesized values indicate number of unitsused in simulations. Starred (*) output indicates the auto-association of current in-put to assist in the formation of internal representations. However, for the purposeof evaluating strong systematicity, only performance on time step 3 was considered.
4.4. STRONG SYSTEMATICITY OF INFERENCE 101
Results
In the case of 20 hidden units, the simple recurrent network learnt perfectly on
the training set on each of the 10 trials for both 0.5 and 0.4 training criteria (i.e.,
all output units were within 0.4 of their target outputs for all patterns in the
training set). However, in all trials performance on the test set was zero for both
maximum and 0.5 testing criteria where the network was required to extract the
second component (i.e., the component which did not appear in the second position
in the training set). The network did not correctly respond, in any trial, to any of
the 5 sequences where the network was required to extract the second component.
In the case where 10 hidden units were used, training reached the 0.4 criterion
in 7 of the 10 trials. In all trials, regardless of whether the network reached the
training criterion, the performance on the test set was zero, for both testing criteria,
with regard to the extraction of the second component.
When only 8 hidden units were used only 1 of the 10 trials converged to the
0.5 training criterion. In all but two trials performance was zero, with 0.5 and
maximum testing criteria, on the extraction of the second component in the test
set. For the other two trials, test set performance was 1 out of 5 for maximum
testing criterion, 0 out of 5 for 0.5 testing criterion. Since there are 5 possible
components to choose from this level of performance is no better than chance.
Furthermore, in these two trials the network did not acquire perfect performance
on the training set within 10000 epochs. The results are summarized in Table 4.1.
Discussion
The purpose of this simulation was to determine whether the simple recurrent net-
work could exhibit strong systematicity of inference. Although simulation results
cannot show that the network is incapable of demonstrating strong systematicity
on this task, they do suggest some inherent problem with the architecture. As
with the systematicity of representation task, the training sets were constructed
so as to maximize the possibility of exhibiting strong systematicity by leaving out
only one of the �ve components in the second position and training on all other
102 CHAPTER 4. GENERALIZATION ACROSS POSITION
Table 4.1: Summary of the performance of the simple recurrent network on thesystematicity of inference task for networks with 20, 10 and 8 hidden units. Thenetworks were trained to 0.5 and 0.4 training criteria, or until 10000 epochs of train-ing. In the case of 8 hidden units, only the 0.5 training criterion was considered.The table shows that for a network with 20 hidden units, all 10 trials converged tothe 0.5 and 0.4 training criteria within 10000 epochs. For each training criterion,the network was tested on the 0.5 and maximum testing criterion for the 5 testcases. The table shows that for a training criterion of 0.5 and a testing criterionof 0.5, 0 out of 5 test cases were correctly inferred in all 10 trials. In the case ofthe 0.5 training criteria and maximum testing criterion, in 2 of the 10 trials thenetwork with 8 hidden units correctly inferred 1 of the 5 test cases.
Hidden units 20 10 8*
Convergent trials (0.5/0.4) 10/10 10/7 1(10 trials)
Correct test cases (5)Train: 0.4; Test: (0.5/max) 0/0 0/0 -Train: 0.5; Test: (0.5/max) 0/0 0/0 0/0 (8 trials)
0/1 (2 trials)
combinations. Thus, with respect to this task, the training set provides the maxi-
mum amount of information11 available for a demonstration of generalization across
position. Yet, despite being given such information the networks did not demon-
strate strong systematicity. Having considered two di�erent training criteria, two
di�erent testing criteria, and various numbers of hidden units, in all but two trials,
the simple recurrent network performance on the test set was zero for the correct
extraction of second position components in the test set. In the other two trials,
a network with 8 hidden units, trained to the 0.5 training criterion, and tested on
the maximum testing criterion, correctly extracted second position components in
only 1 of 5 test cases. In the case of the maximum testing criterion, where output
units are randomly activated, there is a 1 in 5 chance (since there are 5 possible
components) that the maximally activated output unit will correspond to the tar-
get activation of 1. Since performance on these two trials was not greater than
chance level performance, the network did not exhibit strong systematicity.
11Providing more information in the way of additional training patterns would mean that strongsystematicity could not be tested as all components would have appeared in both positions.
4.4. STRONG SYSTEMATICITY OF INFERENCE 103
The network with 20 hidden units was the same as the network which demon-
strated strong systematicity of representation. Although this network did not
demonstrate strong systematicity of inference is did perform perfectly on all �rst
queries in the test set. Thus, although not demonstrating generalization across
position, the network did demonstrate generalization to novel combinations. The
lack of generalization across position suggests that the 20 hidden unit network was
not su�ciently constrained in its weights space. Reducing the number of hidden
units decreases the region of weight space within which the network can implement
the target function. Therefore, given that the network learns the function speci�ed
by the training set, there is a greater likelihood of exhibiting strong systematicity.
However, decreasing the number of hidden units made learning the training set
more di�cult to the point where only 1 of the 10 trials terminated within 10000
epochs in the case of the 8 hidden unit network. The error pro�les for the 8 hidden
unit network in each of the 10 trials are given in Figure 4.18. They suggest that
the network was not likely to converge in the near future.
The error pro�les show that the error on the training set was generally lowest
around the 500th epoch of training. Further training resulted in an increase in
training error to a relatively stable point by the 10000th epoch. It is possible
that the network at this point could have exhibited generalization to the novel
component-position, but at the expense of being unable to correctly infer all cases
in the training set.
To test this possibility, the simulations were rerun under the same conditions
except that training was terminated by the 500th epoch, at which point the network
was tested using the maximum testing criterion.
In none of the 10 trials conducted did the network with 8 hidden units correctly
infer the component object in any of the 5 test cases. It would suggest that the
network was not generalizing across position, above chance level, at any stage
during the 10000 epochs of training in the previous simulations.
104 CHAPTER 4. GENERALIZATION ACROSS POSITION
0
500
1000
1500
2000
2500
3000
0 1 2 3 4 5 6 7 8 9 10
Err
or
Epoch (x1000)
Convergent TrialsNon-convergent Trials
Figure 4.18: Error pro�le for the 8 hidden unit network over 10000 epochs oftraining for each trial. Lines indicated trials that did not converge to the 0.5training criterion within 10000 epochs. The line with points labeled as squaresindicates convergence to the 0.5 training criterion within 10000 epochs.
4.4. STRONG SYSTEMATICITY OF INFERENCE 105
Analysis of internal representations
The lack of strong systematicity of inference from a network that demonstrated
strong systematicity of representation raises the question of whether the simple
recurrent network is su�ciently biased for this task. In this subsection, by analysis
of internal representations it is shown that this simple recurrent network cannot
demonstrate strong systematicity of inference with respect to this task. The ex-
planation proceeds by �rst, showing the minimum network con�guration to solve
this task; and second, showing that with this minimum con�guration the training
set does not provide enough information to make generalization across position a
likely property.
The simple recurrent network is an instance of a �rst-order network, character-
ized by a unit activation function of the form:
fj(nXi=1
wijxi + bj)
where fj is the activation function for unit j; xi is the activation from a previous
unit i along a connection with weightwij; and bj is a bias on the unit. In a �rst-order
network there are no multiplicative activation terms. Consequently, input points
resulting in the same unit activation belong to a hyperplane whose orientation is
determined by the unit's incoming weights and bias. A hyperplane divides the
unit's input space into two half-spaces so that all points in one half-space result
in a unit activity (output) greater than some threshold value (�), and all points in
the other half-space result in a unit activity less than �.
In the querying of 2-tuples task, the network is presented with an ordered
pair and a question requesting the �rst or second component of the pair. Now
suppose, without loss of generality, Mary is the component on which the network
is to demonstrate strong systematicity. Then, to solve the task the network must
implement the following function:
f(Q1;R[(Mary;X)]) ! Mary
f(Q1;R[(X;Mary)]) ! :Mary
f(Q2;R[(Mary;X)]) ! :Mary
106 CHAPTER 4. GENERALIZATION ACROSS POSITION
f(Q2;R[(X;Mary)]) ! Mary
where Q1 and Q2 are the question vectors; R[(Mary;X)] is a representation of
an ordered pair with Mary in the �rst position and some other object (X) in the
second position; R[(X;Mary)] is a representation of an ordered pair with Mary
in the second position and some other object (X) in the �rst position; and, f is
the function that implements the mapping. The R[:] notation is used to refer to
the network's internal representation of an object which may be di�erent from the
objects external representation.
A �rst-order network that implements the mapping is given in Figure 4.19. For
a simple recurrent network the representation of each pair at the context units is
a result of cycling back representations of components at the hidden units from
previous time step. For a feedforward network the representations at the context
units can be regarded as just input representations, and the context units can be
regarded as just other input units.
As in the previous section, by using a local encoding for each component the
assumption is made that there is no a priori similarity between components. There-
fore, at the output layer there is a single unit that discriminates on the basis of
whether or not a representation contains the Mary component. That is, there is
an output unit that detects Mary. It is straightforward to show that this discrim-
ination cannot be performed by positioning a single hyperplane.
To correctly discriminate theMary component a hyperplane must be partitioned
so as to satisfy the following inequalities:
~WI~Q1 + ~WC
~RCMary;X +B > � (4.1)
~WI~Q1 + ~WC
~RCX;Mary +B < � (4.2)
~WI~Q2 + ~WC
~RCMary;X +B < � (4.3)
~WI~Q2 + ~WC
~RCX;Mary +B > � (4.4)
where ~WI and ~WC are the input to hidden and context to hidden weight vectors
(respectively); B is the bias (weight) to the unit; and, � is the threshold above
4.4. STRONG SYSTEMATICITY OF INFERENCE 107
(Mary, X)(X, Mary)
Mary Output
(linear)
(non-linear)
Input Context
copy Hidden
Q1Q2
Figure 4.19: Conceptualization of a �rst-order network where the hidden layer iscomposed of two layers: the �rst performs a linear transformation of activationfrom the preceding layer; and, the second performs a non-linear transformation ofthe linear layer.
108 CHAPTER 4. GENERALIZATION ACROSS POSITION
which the component Mary is considered to be the correct response.
Subtracting equation 4.3 from equation 4.1, and equation 4.4 from equation 4.2
leaves:
WI ( ~Q1 � ~Q2) > 0 (4.5)
WI ( ~Q2 � ~Q1) > 0 (4.6)
Since there does not exist a weight (WI) vector that satis�es both inequalities
4.5 and 4.6, the network cannot represent the solution with a single hyperplane for
each component. Consequently, to correctly extract the Mary component requires
at least two hyperplanes positioned in the input space (by two hidden units at the
hidden layer) and a hyperplane positioned in the hidden space (by the output unit
detecting the Mary component).
Essentially, this reasoning uses the same analysis that was used to explain why
perceptrons cannot solve the exclusive-or (XOR) task (Minsky & Papert, 1990;
Hertz et al., 1991).
The positioning of hyperplanes in the input space is depicted in Figure 4.20.
The necessity of two hyperplanes is a consequence of a �rst-order architecture and is
independent of the representations of the question and pair vectors. For, although
the values of these vectors are not known, the relationship between these vectors
is known. The di�erence in the internal representations at the linear hidden layer
(see Figure 4.19) between the (Q1,(Mary,X)) case and the (Q2,(Mary,X)) case is
just a constant vector ~WI ( ~Q1 � ~Q2) = ~C. Similarly, the di�erence between the
(Q1,(X,Mary)) case and the (Q2,(X,Mary)) case is also the vector ~C.
Assuming that these two hyperplanes are in correct position the resulting repre-
sentation at the hidden layer for these four types of points, dependent on whether
they are on the high or low side (relative to some threshold) of these two hyper-
planes, is shown in Figure 4.21. In the hidden unit activation space, these four
point types must be partitioned by a single hyperplane into two groups: Mary
(circles) and not-Mary (squares). Clearly, such a partitioning is possible. However,
in addition, the network is required to demonstrate strong systematicity on the
4.4. STRONG SYSTEMATICITY OF INFERENCE 109
(X, Mary)2
(Mary, X)2
(X, Mary)1
C
C
(Mary, X)1
h1
h2
Figure 4.20: Orientation of the hidden unit hyperplanes to extract the Mary com-ponent. To implement the function there must be at least two hyperplanes incorrect orientation in the input space to extract the Mary component. Circles indi-cateMary is the target output, and squares indicateMary is not the target output.Numerals indicate a request for the �rst or second component.
110 CHAPTER 4. GENERALIZATION ACROSS POSITION
h1
h2
Mary
(Mary, X)2(Mary, X)1
(X, Mary)1
(X, Mary)2
Figure 4.21: Orientation of the output unit hyperplane in the hidden unit activationspace. Circles indicateMary is the target output, and squares indicateMary is notthe target output. Solid circles and squares indicate training points and emptycircles and squares indicate test points.
Mary component. Therefore, in the training set, one of the positions (e.g.,Mary in
the second position) does not occur (empty circle and square). Consequently, the
training set does not provide any information regarding the partitioning of points
along the vertical dimension, and therefore, the network cannot be expected to
generalize to these cases.
Thus, as with the bu�er solution to the auto-association task, there are two sets
of dimensions: one encoding components in the �rst position, and one encoding
components in the second position. The critical di�erence in the inference task is
that at a single time step the network is required to recover either the �rst or second
component depending on the question vector, whereas in the auto-association task
at one time step only the �rst component was required with the second component
recovered in the next time step. Consequently, the network solution to the task
requires two output subspaces (in contrast with the solution to the representation
task where only one output subspace is necessary) implemented by two sets of
weights (Figure 4.22). Since the two sets of weights are independent the network,
4.4. STRONG SYSTEMATICITY OF INFERENCE 111
Output
Context
Input
hidden
Mary
Patient Actor
2.
1.
3.
2.
1.
3.
Mary
John
Who loves?/Who is loved?
John*
Mary*
Mary/John
Figure 4.22: Weight and representation con�guration resulting in weak system-aticity of inference. Shaded regions and solid arrows indicate modi�ed weights.Dashed arrows indicate �xed weights. Starred output indicates auto-association ofinput to assist in the formation of internal representations.
in general, must see components in both positions and therefore cannot demonstrate
strong systematicity.
4.4.3 Other �rst-order three-layer networks
The weak systematicity result of the simple recurrent network does not make any
assumptions about the nature of the internal representations of the ordered pairs.
The assumptions made were that the external representations of components are
encoded locally, and there is an intermediate layer of �rst-order units between the
inputs and the outputs (which are also �rst-order units). Therefore, the result can
also be applied to other architectures conforming to these restrictions.
112 CHAPTER 4. GENERALIZATION ACROSS POSITION
Output
Hidden
Input Context
copy
Output
Input Context
Hidden
copy
Context
(a) (b)
Figure 4.23: Two alternative recurrent network architectures: Jordan's recurrentnetwork (a) and Pollack's recursive auto-associative memory (b).
Feedforward network
As mentioned previously, the context units in the simple recurrent network can be
regarded as other input units holding a representation of the ordered pair. The
two sets of input units map representations to hidden units which in turn map
representations to output units. Since no assumptions are made regarding the
nature of the order pair representation, three-layered feedforward networks are
weakly systematic of inference with respect to this task.
Jordan's recurrent network
Jordan's (1990) recurrent network (see Figure 4.23(a)) di�ers from the simple re-
current network in that context is copied from the output unit activations rather
than the hidden unit activations of the previous time step. Since the result makes
no assumptions regarding the nature of the context representations the weak sys-
tematicity of inference result applies to the Jordan recurrent network.
Pollack's recursive auto-associative memory
Pollack's (1990) recursive auto-associative memory (see Figure 4.23(b)) is essen-
tially the same as the simple recurrent network except there is an additional set
of output units whose targets are the current context values (i.e., the network not
only auto-associates the current input, but also the current context). However,
these units (and their associated weights) play no part in the extraction of com-
4.4. STRONG SYSTEMATICITY OF INFERENCE 113
ponents. Their purpose is to encourage the formation of useful representations at
the context layer. However, as mentioned above, the result of the simple recurrent
network is independent of the actual representations at the context layer. There-
fore, this architecture cannot demonstrate strong systematicity of inference on this
task.
Forced simple recurrent network
The forced simple recurrent network (Maskara & Noetzel, 1992) is a variation of
the recursive auto-associative memory in that the targets for the output units are
the previous context, the current input and the next input. Additional information
(hints) regarding the target function can improve learnability in networks (Abu-
Mostafa, 1990; Suddarth & Kergos, 1990; Al-Mashouq & Reed, 1991; Suddarth &
Holden, 1991). For example, use of hints allowed the simple recurrent network to
discriminate input values appearing further back in time (Phillips & Wiles, 1991).
In fact, auto-association of the inputs at the �rst and second time steps in the
auto-association of 2-tuples task is an example of hint information except that the
hints were on the same output units but were not considered for network eval-
uation. However, as with the recursive auto-associative memory, the additional
output units play no part in the extraction of components in the forced simple re-
current network. Therefore, the network cannot demonstrate strong systematicity
of inference on this task.
4.4.4 Architectural issues
The problem with the simple recurrent network and the other �rst-order three-layer
networks is that the components must be extracted from one of two independent
subspaces, depending on whether the query vector requests the �rst or second
component. As in the case of the feedforward network, because there is an inde-
pendence between the weights that implement these two mappings the network
must be trained on all components in both positions.
The simple recurrent network was able to demonstrate strong systematicity of
114 CHAPTER 4. GENERALIZATION ACROSS POSITION
representation on the auto-association task because the task only requires the net-
work to extract the �rst component at the third time step, with the second compo-
nent being extracted at the fourth time step. At either time step the same subspace
can be used to map internal component representations to outputs. Therefore, the
same set of weights were can be used to implement the mapping. In this case, de-
pendency was incorporated into the architecture as the component mappings were
implemented by the same set of weights.
Dependency between component-access weights
The problem that strong systematicity of inference poses for an architecture is that
all components must be simultaneously represented on di�erent subspaces, which
means there must be a di�erent set of weights to implement the mappings from
those subspaces to the output units. Therefore, to demonstrate strong systematic-
ity there must be a dependency between these sets of weights. The representation
task permitted all components to be accessed via a common subspace in the case of
the simple recurrent network. Thus, dependency was incorporated as the network
could use the same set of weights to access all components. However, the inference
task does not permit all components being accessed from a common subspace in
the case of these networks.
One way to incorporate a weight dependency is to collapse the various com-
ponent subspaces onto a single common subspace from which all components, in-
dependent of their position within the complex object, can be extracted to the
output. Thus, strong systematicity of inference is achieved as only one set of
weights implements a position-independent mapping from a common internal sub-
space to the output, just as was suggested with the simple recurrent network in the
auto-association task. The problem then is to incorporate a mechanism into a Con-
nectionist architecture that, when given a query vector, returns the corresponding
component subspace to the common subspace (Figure 4.24).
4.4. STRONG SYSTEMATICITY OF INFERENCE 115
Input Hidden Output
Figure 4.24: Characterization of the representational organization su�cient forstrong systematicity of inference.
116 CHAPTER 4. GENERALIZATION ACROSS POSITION
Tensors
One possibility is with the use of a tensor, rather than the more common vector
representation space. Brie y, an M � N -dimensional tensor space (constructed
from two vector spaces of dimensionalityM and N , respectively) can be conceptu-
alized as M N -dimensional subspaces (or similarly, N M -dimensional subspaces).
The inner product operator, provided in tensor calculus, e�ectively collapses the
multiple subspaces down to a single subspace. The particular subspace selected
depends on the vector with which the inner product was performed. Thus, tensors
provide multiple representational subspaces, which can be indexed (or accessed) by
other vectors.
The use of a tensor representational space in a Connectionist learning archi-
tecture is investigated in the next chapter with the purpose of addressing strong
systematicity of inference in the querying of 2-tuples task.
4.5 Summary and conclusion
In this chapter, Hadley's strong systematicity was used as the criterion by which
a model is regarded as exhibiting systematicity. The purpose of this chapter was:
(1) to determine which models, if any, meet this criterion; and (2) to elucidate
the properties which give rise to the model's degree of systematicity. Two tasks
were designed to test whether Connectionist models could meet Hadley's strong
systematicity criterion. They were: (1) auto-association of 2-tuples, designed to
test strong systematicity of representation; and (2) querying of 2-tuples, designed
to test strong systematicity of inference. A summary of the conclusions drawn from
an evaluation of networks and their degrees of systematicity with respect to these
two tasks is given in Table 4.2.
The �rst model examined was the feedforward network. It was shown not to
be capable of exhibiting strong systematicity of representation. Essentially, the
lack of strong systematicity is because of the independent relationship between the
weights which implement the various component mappings. The simple recurrent
4.5. SUMMARY AND CONCLUSION 117
Table 4.2: Summary of the degrees of systematicity of networks with respect tothe systematicity of representation task (auto-association of 2-tuples), and thesystematicity of inference task (querying of 2-tuples). The networks examinedincluded the feedforward network (FFN); simple recurrent network (SRN); andseveral �rst-order recurrent networks (RN*), including the simple recurrent net-work; Jordan's recurrent network; Pollack's recursive auto-associative memory;and Maskara's forced simple recurrent network.
Network Task Degree Property
FFN Represent. Weak Independent weightsSRN Represent. Strong Dependent weightsFFN Inference Weak Independent access weightsRN* Inference Weak Independent access weights
network incorporates a weight dependence as all components are presented (and
extracted) at the same set of input (and output) units. Therefore, the function
which performs this mapping is implemented, in part, over the same set of weights.
Simulation results con�rmed that the simple recurrent network could exhibit strong
systematicity of representation. Analysis of internal representations for the sim-
ple recurrent network suggested that the network learnt to organize its internal
representations so that components were mapped from the input to a common
internal subspace, and mapped from another common internal subspace to the
output. Thus, dependency between component mapping weights was achieved as
the network learnt to use the same set of weights.
The simple recurrent network was then examined on the second task designed
to test strong systematicity of inference. Simulations showed that although the
network generalized to novel combinations it did not demonstrate any degree of
generalization across position on this task. The lack of strong systematicity sug-
gested that the network did not have the right architectural bias to exhibit strong
systematicity. By analysis of internal representations it was shown that the sim-
ple recurrent network cannot demonstrate strong systematicity on this task. The
inference task demands that the components be extracted from two component
subspaces. The independence of the weights that implement these two mappings
meant the network must be trained on each component in both positions. It was
118 CHAPTER 4. GENERALIZATION ACROSS POSITION
also shown that a number of other �rst-order three-layer networks (i.e., one hidden
layer) could not exhibit strong systematicity of inference on this task. Those net-
works included: the three-layer feedforward network; Jordan's recurrent network;
Pollack's recursive auto-associative memory; and Maskara's forced simple recurrent
network. The architectural bias lacking in these models was a dependency between
component access weights.
Finally, it was suggested that dependency between component access weights
could be achieved by collapsing the various component subspaces onto a single com-
mon subspace from which a position independent mapping could be implemented
by a single set of weights. Thus, weight dependency is incorporated by using the
same set of weights. It was suggested that this bias could be achieved in a tensor
representation scheme using the inner product operator. In the next chapter, de-
tails of a Connectionist network incorporating these ideas is presented and tested
by simulation on a querying of 3-tuples task.
Chapter 5
Strong systematicity of inference:
The Tensor-Recurrent Network
5.1 Introduction
In the previous chapter, a number of Connectionist models were analyzed for their
capacity to exhibit strong systematicity. The importance of Hadley's work was
to identify and specify a degree of generalization that is clearly evident in people,
but has yet to be demonstrated in Connectionist models. The main results of the
previous chapter were to show that the three-layer feedforward network and several
�rst-order three-layer recurrent networks cannot demonstrate strong systematicity
of inference with respect to the querying of 2-tuples task.
The characteristic property that prevented these models from exhibiting strong
systematicity of inference was an independence between the weights that access
internal component representations from the two component subspaces. It was
then suggested that a mechanism that incorporated such a dependence would be
su�cient to demonstrate strong systematicity of inference. The use of tensor rep-
resentations was suggested as one possible mechanism. In this chapter, tensor
representations are used as the basis for a Connectionist architecture, called the
tensor-recurrent network, for the purpose of exhibiting strong systematicity of in-
ference with respect to a querying of 3-tuples task.
119
120 CHAPTER 5. THE TENSOR-RECURRENT NETWORK
Before presenting the tensor-recurrent network, the concept of tensor represen-
tations, and their Connectionist implementation (introduced by Smolensky, 1987b)
is reviewed in section 2. The use of tensors for representing complex objects as-
sumes the existence of a number of representational components. This assumption
raises an issue, which is discussed in section 3, regarding the origins of these rep-
resentations. In section 4, it is shown how these representational components are
learnable in the tensor-recurrent network by combining a tensor representation ar-
chitecture with a recurrent network learning architecture. The tensor-recurrent
network is then evaluated for strong systematicity of inference on a querying of
3-tuples task in section 5. Finally, a summary and conclusion are given in section
6.
5.2 Tensor representations
Tensor representations were introduced by (Smolensky, 1987b) as a means of rep-
resenting complex objects in a Connectionist architecture. As already shown in
chapter 2, complex objects can be represented as the sum of the outer products
of component-role vector representation pairs. The outer product (~T ) of two
vectors (~V and ~W ) is de�ned as:
~T = ~V ~W =
0BBBBB@
v1...
vm
1CCCCCA ( w1 � � � wn ) =
0BBBBB@
v1w1 � � � v1wn
......
vmw1 � � � vmwn
1CCCCCA (5.1)
A Connectionist network that implements the outer product is given in Figure
5.1. In this network, connections have a �xed weight of value one. The activation
of tensor unit Tij at time step t is de�ned by the equation:
T tij = viwj + T t�1
ij ; (5.2)
where vi is the activation of the ith unit in the vector(V) group of units; wj is the
activation of the jth unit in the vector(W) group of units; and T t�1ij is the activa-
tion of the ijth tensor unit at the previous time step (t � 1). The multiplicative
activation term makes the tensor an example of a higher-order network.
5.2. TENSOR REPRESENTATIONS 121
v1 v2 w1 w2 w3
Vector units (V) Vector units (W)
Tensor units (T)
T11 T12 T13 T21 T22 T23
Figure 5.1: A Connectionist implementation of the outer product (T) of two vectorrepresentations (V and W). All connections are of �xed weight 1. Solid circlesindicate the product of input activity, and empty circles indicate the sum of inputactivity. The activation of a tensor unit is the product of the activity of previousunits plus the tensor unit's activation from the previous time step.
In addition to constructing complex representations via the outer product op-
erator, component representations can be accessed via the inner product operator.
The inner product of a tensor (~T ) and a cue vector ( ~W ) is de�ned as:
~T � ~W = ~V =
0BBBBB@
v1...
vm
1CCCCCA ; (5.3)
where vi =Pn
j=1 Tijwj; and n is the dimensionality of vector ~W .
The Connectionist network that implements the inner product is given in Figure
5.2. In this network, connections have a �xed weight of value one. The activation
of product unit pij is de�ned as:
pij = Tijwj; (5.4)
where Tij is the activation of the ijth tensor unit; and oj is the activation of the
jth unit in the vector(W) group of units. The activation of vector(Vout) unit vouti
122 CHAPTER 5. THE TENSOR-RECURRENT NETWORK
w1 w2 w3
vout1 vout2
Product units
Vector units (Vout)
Tensor units (T) Vector units (W)
T11 T12 T13 T21 T22 T23
p11 p12 p13 p21 p22 p23
Figure 5.2: A Connectionist implementation of the inner product (Vout) of a tensorrepresentation (T) and a vector representation (W). All connections are of �xedweight 1. Solid circles indicate the product of input activity, and empty circlesindicate the sum of input activity. The activation of a product unit is the productof the activity of previous units. The activation of a vector(Vout) unit is the sumof the activity of previous units.
is de�ned as:
vouti =nX
j=1
pij ; (5.5)
where pij is the activation of product unit ij. The activation is summed over all n
incoming connections1.
The following is an example of how these two networks, together, can process
examples from the querying of 2-tuples task, which was given in the previous chap-
ter. Suppose the network is presented with the 2-tuple (John, Mary) and a query
1The number of incoming connections n is the dimensionality of the cue vector, which in thisnetwork implementation is the number of vector(W) units.
5.2. TENSOR REPRESENTATIONS 123
requesting the second component. Then the sequence of inputs and targets is:
Time Input Output
1. John {
2. Mary {
3. Second Mary.
The activation of the network units throughout this sequence is described by
the following sequence of equations:
0: ~T 0 = ~0
1: ~T 1 = ~T 0 + ~VJohn ~Wfirst
2: ~T 2 = ~T 1 + ~VMary ~Wsecond
3: ~Vout = ~T 2 � ~Qsecond
= ~VJohn ( ~Wfirst � ~Qsecond) + ~VMary ( ~Wsecond � ~Qsecond)
= ~VMary (when ~Wfirst ? ~Qsecond, and ~Wsecond � ~Qsecond = 1):
The tensor units are initialized with zero activation. At time step 1, the two groups
of units vector(V) and vector(W) (see Figure 5.1) are presented with representations
of the �rst component (~VJohn) and the �rst component's role ( ~Wfirst), respectively.
The activation of the tensor units becomes their activation at the previous time step
plus the outer product of the component and role vectors. At time step 2, vector(V)
and vector(W) units are presented with representations of the second component
(~VMary) and the second component's role ( ~Wsecond), respectively. The tensor units
are updated as in the previous time step. At time step 3, a representation of the
query (~Qsecond), or cue vector, is presented to the vector(W) units. The activation
of the vector(Vout) units are a result of the inner product of the activations of the
tensor units and the vector(W) units. If the representation of the �rst component
(~VJohn) is orthogonal to the representation of the query (~Qsecond), and if the inner
product of the second component (~VMary) and the query vector is one2, then the
representation of the second component will be correctly extracted.
2This situation occurs, for example, when the two vectors are equal and of length one.
124 CHAPTER 5. THE TENSOR-RECURRENT NETWORK
5.3 Representational issues
The use of tensors to represent and process complex objects assumes the existence
of external agents for generating appropriate component (e.g., ~VJohn), role (e.g.,
~Wfirst) and cue (e.g., ~Qfirst) representations. In this section, the assumptions
regarding each of these three aspects of tensor representations are discussed.
5.3.1 Component representations
In the tensor scheme, component representations are bound (by an outer prod-
uct) to a representation of their associated roles. If the appropriate cue vector is
applied (by an inner product) to the tensor representation then the original com-
ponent representation is extracted. Thus, in a tensor scheme the input and output
representations of a component object are the same. However, the input repre-
sentation of the object John, for example, could be a string of letters (in the case
where one is receiving visual input), whereas the output representation of the same
object could be a sequence of phonemes (in the case where one is sending auditory
output). Therefore, it must be assumed that there exists additional agents that
perform the mappings from an external input representation of an object to an
internal representation (which is used in the tensor construction), and from the
internal representation to an external output representation of the same object
(which may di�er from the external input representation).
5.3.2 Role representations
In a tensor representation of complex objects, each component representation is
bound to a representation of the component's role within the complex object. In
using tensor representations it is assumed that there exists some other agent that
generates role representations for each component, since these representations are
not explicitly represented in the input3. That is to say, there is not an input rep-
3Although Smolensky (1987a) showed how \optimal" role representations could be learnt withhis recirculation algorithm, these roles are learnt in isolation (i.e., separate from any binding to
5.3. REPRESENTATIONAL ISSUES 125
resentation denoting patient, for example, in the sentence John loves Mary. Roles
must be deduced from other information such as relative positioning of compo-
nents. In addition, role representations must be orthogonal to each other for the
component representations to be subsequently extracted, via the inner product op-
erator, without interference. For example, suppose vectors ~V and ~W are bound to
roles ~R1 and ~R2, respectively. Then, the extraction of the �rst component V using
~R1 as the cue vector results in:
(~V ~R1 + ~W ~R2)� ~R1 = ~V + � ~W; where ~R1 � ~R2 = �
In the case where the roles vectors are of unit length, � is the component of ~R1 in the
direction of ~R2. The choice of role vectors is important for the successful extraction
of component representations. Thus, a tensor scheme assumes the existence of
agents that generate appropriate role vectors.
5.3.3 Cue representations
Having constructed tensor representations of complex objects, component objects
can be extracted via the inner product operator. Typically, the cue vector is
chosen, by some external agent, to be the same as the role vector to which the
desired component was composed. However, since the role vectors were generated
internally there is no reason to expect that an external representation of a query will
be the same as the internal representation of the role. For example, in the complex
object John chased Mary, the component John is bound to a representation of the
agent role. It is assumed that the query Who chased? gets mapped to the same
representation of agent, at which point the John component can be extracted from
the tensor representation using the inner product operator.
component vectors). No demonstration was given as to how the recirculation algorithm couldbe used in any task. The recirculation algorithm could have been used to train a three-layerfeedforward network to auto-associate N -tuples if one assumes that the activation function forthe hidden units is the identity function, and that the input to hidden weight matrix equals thetranspose of the hidden to output weight matrix (i.e., WI;H = WT
H;O). However, even with theseassumptions there is still an independence between the weights that access components fromdi�erent positions, and therefore the network will not demonstrate strong systematicity withrespect to this task.
126 CHAPTER 5. THE TENSOR-RECURRENT NETWORK
Tensor representational schemes were introduced by Smolensky as a way of
implementing the capacity to represent complex objects in a Connectionist archi-
tecture. In chapter 2, it was argued that a \potential" contribution of Connec-
tionism is in an explanation of the acquisition of systematic behaviour. If tensor
representations are to prove useful in providing such an explanation then this issue
regarding the origins of component, role and cue representations must be addressed.
In the next section, it is shown how these representations can be learnt with the
tensor-recurrent network.
5.4 The tensor-recurrent network
In the previous chapter, a number of Connectionist networks were analyzed for
their capacity to exhibit strong systematicity. Characteristic of these networks was
the use of a vector space to represent complex objects. Although these networks
can represent most functions, it was shown that there was not su�cient archi-
tectural bias to account for strong systematicity. Since tensors have been shown
capable of representing complex objects, a network incorporating tensor represen-
tations was suggested as a way of addressing strong systematicity. However, to
this point, there has not been an e�ective procedure for learning tensor repre-
sentations. Consequently, assumptions must be made regarding the nature and
origins of component, role and cue representations. The motivation behind the
tensor-recurrent network was to incorporate a learning mechanism with a tensor
representation scheme so that component, role and cue representations are learnt
as a consequence of the demands of the task, rather than provided by external
agents. The following is a description of the tensor-recurrent network and how it
addresses the issue of acquiring component, role and cue representations.
The tensor-recurrent network is depicted in Figure 5.3. The network has many
of the components that are characteristic of the recurrent networks of the previous
chapter. That is, there are input and output units for presenting and extracting
component representations, and there are units for storing activations from the
previous time step. The main di�erence with the tensor-recurrent network is in the
5.4. THE TENSOR-RECURRENT NETWORK 127
cue(4)
question(3)
role(4)
input(8)
context(10)
state(10)
bottleneck(2)
hidden(8)
tensor(32)
output(8)
inner product(8)
(outer product)
copy
bac
k
h
r
s
b2
b1
f
g
Figure 5.3: The tensor-recurrent network architecture. Dashed arrows indicatecompletely connected modi�able weights and their destination units (which are thecue, hidden, bottleneck, state, role, and output units) have tanh as their activationfunction. Solid arrows indicate �xed connections of weight 1, and the connectivitybetween their source and destination units is such as to implement the operator asshown (i.e., inner product, outer product, and copy). Parenthesized values indicatenumber of units used in simulations.
128 CHAPTER 5. THE TENSOR-RECURRENT NETWORK
separation of the responsibility for representing component and role information
into two sets of units: (1) hidden units, which construct internal representations
of component objects; and (2) role units, which construct internal representations
a component's role within a complex object. This organization contrasts with
the recurrent networks where a single set of hidden units is responsible for the
representation of both component and role information. Before discussing the
implications of this di�erence with respect to demonstrating strong systematicity,
the internal behaviour of the network and its capacity to learn component, role
and cue representations is presented in detail.
5.4.1 Architecture
The description of the tensor-recurrent network has two parts: (1) an outline of
the functionality of each group of units; and (2) the activation functions and con-
nectivity of weights that implement the functionality of each group of units.
Unit functionality
The functionality of each group of units is organized into three major categories:
(1) external representation units; (2) tensor-related units; and (3) feedback-related
units. Their functionality is as follows:
1. External representation units The purpose of these units is to hold the
input and output unit activation. The three types of external units are:
� Input: external input representation of component objects.
� Question: external representation of a question.
� Output: external output representation of component objects.
2. Tensor-related units. These units are concerned with implementing the
inner and outer product operators. The �ve types of units associated with
these two operators are:
� Tensor: tensor representations of complex objects.
5.4. THE TENSOR-RECURRENT NETWORK 129
� Hidden: internal representations of component objects.
� Role: internal representations of a component's role within a complex
object.
� Cue: internal representations of vectors for accessing component repre-
sentations within the tensor representations.
� Inner product: extracted component representations.
3. Role-related units These units are concerned with creating new role rep-
resentations at each time step. The three types of units are:
� State: hold the current state vector.
� Context: holds the activation of the state units from the previous time
step.
� Bottle-neck: an intermediate representation between the hidden units
and the state units.
Implementation
The activation of each context unit is de�ned as:
cti = st�1i ; (5.6)
where cti is the activation of context unit i at time step t; and st�1i is the activation
of state unit i at the previous time step t� 1.
The activation of each tensor unit is de�ned as:
T tij = hi rj + T t�1
ij ; (5.7)
where T tij and T t�1
ij are the activations of tensor unit ij at time steps t and t � 1,
respectively; hi is the activation of hidden unit i; and rj is the activation of role
unit j.
The activation of each inner product unit is de�ned as:
ipi =nX
j=1
Tij(qj + rj); (5.8)
130 CHAPTER 5. THE TENSOR-RECURRENT NETWORK
where ipi is the activation of inner product unit i; qj is the activation of cue unit
j; rj is the activation of role unit j; and n is the number of cue/role units.
The activation of all other units, which includes the cue, hidden, bottleneck,
state, role, and output units, is de�ned as:
oj = tanh(kXi=1
xiwij + bj); (5.9)
where oj is the activation of unit j; xi is the activation of a preceding unit i; wij is
the connection weight from unit i to unit j; bj is the bias term on unit j; and k is
the number of incoming connections.
Each dashed arrow indicates complete connectivity between source and desti-
nation layers (i.e., every unit in the source layer is connected to every unit in the
destination layer). The weights associated with these connections are modi�able
through a learning algorithm. The solid lines indicate unmodi�able weighted con-
nections. The connectivity between layers is such as to implement the following
functions:
� Outer product The connectivity from the hidden and the role units to the
tensor units implements the outer-product ~h ~r, where ~h is a vector at the
hidden units providing an internal representation of some component, and
~r is a vector at the role units providing an internal representation of the
component's role. An example of the connectivity that implements the outer
product was given in Figure 5.1.
� Inner product The connectivity from the tensor units and the role units
to the inner product units implements the inner product ~T � ~r, where ~T is
a tensor representation at the tensor units of some complex object and ~r is
a vector representation at the role units. An example of the connectivity
that implements the inner product was given in Figure 5.2. The connectivity
between cue and inner product units implements the same function, except
that the inner product is: ~T � ~q, where ~q is a vector representation at the
cue units.
5.4. THE TENSOR-RECURRENT NETWORK 131
Table 5.1: Order of activation for the tensor-recurrent network when given thesequence John, Mary, �rst.
Units T = 1 T = 2 T = 3
Input ~VJohn ~VMary~0
Question ~0 ~0 ~Qfirst
Hidden f(~VJohn) f(~VMary) ~Hbias
Cue ~Cbias~Cbias h(~Qfirst)
Context ~S0 ~S1 ~S2
Bottle ~B1~B2
~B3
State ~S1 ~S2 ~S3
Role ~R1~R2
~R3
Tensor ~T1 = ~T0 + f(~VJohn) ~R1~T2 = ~T1 + f(~VMary) ~R2
~T3 = ~T2 + ~Tnoise
Inner ~T1 � ~R0
1~T2 � ~R0
2~T3 � h(~Qfirst)0
Output g(~T1 � ~R0
1) g(~T2 � ~R0
2) g(~T3 � h(~Qfirst)0)
� Copy The connectivity between state and context units implements a copy-
back function, which is the same as in Elman's (1990) simple recurrent net-
work (i.e., one-to-one connections, with a �xed weight of one).
Those units that have modi�able weights also have a modi�able bias, which can
be implemented as an incoming weighted connection whose activity is always one.
Other units do not have a bias.
An example of activity propagation
An example sequence from the querying of 2-tuples task is used to show the pro-
cessing of activation in the tensor-recurrent network. The sequence John, Mary,
�rst will result in the following sequence of activation, which is summarized in
Table 5.1.
At the �rst time step (T = 1), the tensor-recurrent network is presented with
an external input representation (~VJohn) of the component John. At this point,
the network has not been queried so the activation at the question units is zero.
This representation is mapped (via function f) to the hidden units to form the
internal representation f(~VJohn). The hidden unit representation is mapped (via
132 CHAPTER 5. THE TENSOR-RECURRENT NETWORK
function b1) to the bottleneck units, resulting in vector ~B1, and from the bottleneck
units to the state units via function b2. At the same time, state vector ~S0 at the
context units is mapped via function s to the state units. The state units take the
results of these two functions and produce a new state vector ~S1. The new state
vector is mapped (via function r) to the role vector ~R1, which is a representation
of the role of the �rst component object. The outer product of vectors f(~VJohn)
and ~R1 is added to the previous tensor (~T0) to give a new tensor representation ~T1
at the tensor units. The inner product of the tensor and the role plus cue vectors
is then performed. Although the question units have zero activity at the �rst time
step, and therefore, do not pass on any activity to the cue units, the bias terms
associated with the cue units result in an activity vector ~Cbias. Consequently, the
activity vector at the inner product units is ~T1 � ~R0
1, where ~R0
1 = ~R1 + ~Cbias. The
activity vector at the inner product units is subsequently mapped (via function g)
to the output units, resulting in the output vector g(~T1 � ~R0
1). In addition, the
state vector is propagated back to the context units in preparation for the next
time step.
Activity vectors are propagated, in a similar manner, during the second time
step. At the third time step, the network is presented with the question requesting
the �rst component, which is represented as the vector ~Qfirst. The question vector
is mapped (via function h) to form an internal query vector h(~Qfirst) at the cue
units. Although the activation at the input units is zero at the third time step,
the bias terms at the hidden units (i.e., vector ~Hbias) and the state vector ~S2 are
propagated forward to generate a new state vector ~S3, and subsequently a new role
vector ~R3. The outer product of the hidden and role unit vectors results in a noise
term ~Tnoise (i.e., a term that contains no information regarding the complex object
(John,Mary)). Again, the inner product of the tensor vector and the role plus cue
vectors is performed. The inner product results in the vector ~T3�h(~Qfirst)0, where
h(~Qfirst)0 = h(~Qfirst) + ~R3. Finally, this vector is mapped to the output units
resulting in the output vector g(~T3 � h(~Qfirst)0).
Having traced through the propagation of activity for an example sequence,
5.4. THE TENSOR-RECURRENT NETWORK 133
the next step is to show how this activity can be used to train the tensor-recurrent
network to construct appropriate internal component, role and cue representations.
5.4.2 Learning
The output vector produced by the network as a result of the sequence of input
vectors can then be compared to a target vector. In the case of the previous ex-
ample, the target output at third time step is ~V TJohn. Just as with the networks of
the previous chapter, by incorporating an error function within the network, per-
formance can be improved by backpropagating an error signal, which is a function
of the network and target output vectors. Weights are adjusted as a function of
the error signal so as to reduce network error. Again, it is the principle of error
backpropagation (Rumelhart et al., 1986) that drives the construction of internal
representations. The same principle is used in the tensor-recurrent network. The
interesting feature of the tensor-recurrent network is that the error signal is back-
propagated along the �xed weights and units that implement the tensor operators
to the modi�able weights that generate component, role and cue representations.
In this way, these representations can be learnt in response to the demands of the
task, rather than pre-speci�ed by some external agent. The following is an ex-
planation of how the appropriate component, role and cue representations can be
learnt given the principle of error backpropagation and the internal organization
of the tensor-recurrent network.
Component representations
It was mentioned in section 5.3 that a tensor representation scheme assumes the ex-
istence of external agents that perform the mappings between external and internal
component representations. This assumption is avoided in the tensor-recurrent net-
work by providing a layer of modi�able weights before and after the tensor-related
units (i.e., the weights associated with functions f and g, respectively).
Suppose that, in the (John,Mary,�rst) example, the network is also required
to output the current object for the �rst and second time steps (as was done for
134 CHAPTER 5. THE TENSOR-RECURRENT NETWORK
the networks of the previous chapter to encourage the formation of useful internal
representations). Then, using the sum of squares error function, the error term at
the �rst time step is:
E = k~V TJohn � g((f(~VJohn) ~R1)� (~R1 + ~Cbias))k;
where ~VJohn and ~V TJohn are the input and target output representations of John,
respectively; ~R1 is the associated role vector, and ~Cbias is a bias vector. Thus, error
E goes to zero (E ! 0) when, for example, the following three conditions occur:
1. ~Cbias ! 0;
2. k~R1k ! 1; and
3. g(f(~VJohn))! ~V TJohn.
That is, when the modi�able weights that implement the various mapping functions
are adjusted so that these three conditions occur.
In the case where ~R1 � (~R1 + ~Cbias) = 1, which occurs, for example, when
~Cbias = 0 and k~R1k = 1, the vector representation at the hidden units is the same
as the vector representation at the inner product units. That is,
f(~VJohn) = (f(~VJohn) ~R1)� (~R1 + ~Cbias)
Consequently, the network reduces to a three-layer feedforward network, where
any input to output mapping is representable (given su�cient hidden units), and
therefore, \potentially"4 learnable.
Role representations
Tensor representations also assume the existence of an external agent that generates
orthogonal role vectors for each component. Referring to the (John,Mary,�rst)
4As with the feedforward network, there is no guarantee that a multiple layer network withnon-linear units will learn an arbitrary function.
5.4. THE TENSOR-RECURRENT NETWORK 135
example, at the second time step, the error term is:
E = k~V TMary � g((f(~VJohn) ~R1 + f(~VMary) ~R2)� (~R2 + ~Cbias))k;
where ~V TMary is the target output for the object Mary; and ~R2 is its associated
role representation. Thus, error goes to zero when, for example, the following four
conditions occur:
1. ~Cbias ! 0;
2. k~R2k ! 1;
3. ~R1 ? ~R2; and
4. g(f(~VMary))! ~V TMary.
Thus, the network learns appropriate role vectors by adjusting weights which min-
imize an error function that goes to zero when the role vectors are orthogonal.
Cue representations
At the third time step, the error term is:
E = k~V TJohn � g((f(~VJohn) ~R1 + f(~VMary) ~R2 + f( ~Hbias) ~R3)
�(h(~Qfirst) + ~R3))k;
where ~Hbias is the bias vector at the hidden units; and ~Qfirst is the representation
of the query requesting the �rst component. At this time step, error goes to zero
when:
1. ~R3 ! 0;
2. ~R1 ? ~R2 ? ~R3;
3. k~R1k ! 1;
4. k~Qfirst � ~R1k ! 0; and
136 CHAPTER 5. THE TENSOR-RECURRENT NETWORK
5. g(f(~VJohn))! ~V TJohn.
The error function drives the network to learn an appropriate cue vector that when
applied to the tensor representation by the inner product operator results in an
output representation of the �rst component John.
5.5 Evaluating the network
The lack of strong systematicity in the networks of the previous chapter was at-
tributed to the independence between the weights that access components from
di�erent representational subspaces. A tensor representation scheme was suggested
as it has the property that multiple subspaces can be collapsed down to a single
subspace under the inner product operator. Thus, a weight dependency is intro-
duced across component subspaces as all components, independent of their role,
are mapped from the common internal subspace to the output. Therefore, learning
to extract a component representation in one position (role) generalizes to compo-
nent representations in other positions (roles). However, extraction of the correct
component requires the network to learn the appropriate component, role and cue
representations. In the tensor-recurrent network, this requirement means learning
the weights that will generate these vectors. The solutions to the error functions
(above) showed that these weights exist, however, they do not indicate whether
they will necessarily be learnt. For example, the error surface could contain local
minima preventing the network from learning a set of weights that generates strong
systematic behaviour. The purpose of this section is to test, through simulation,
the tensor-recurrent network's capacity to learn a set of weights that results in
strong systematicity on an inference task.
5.5.1 Querying of 3-tuples task
The task used in this section is based on simple sentences conforming to the struc-
ture agent-action-patient. Each component is presented to the network one per
time step after which the network is given one of three possible questions: Who
5.5. EVALUATING THE NETWORK 137
performed the action?; What was the action?; and Who was e�ected by the action?,
from which the network should respond with the agent, action, and patient com-
ponents, respectively. For example, the sentence-question pair \John loves Mary.
Who performed the action?" would be represented by a sequence of four vectors
(representing John, loves, Mary, Who performed the action?, respectively) pre-
sented in that order to the network. At the fourth time step the correct response is
a vector representing John at the output layer. This task is similar to the querying
of 2-tuples task of the previous chapter, except that the name of the binary relation
(e.g., loves) is included in the input.
The action and patient components are drawn from the set fBill, John, Karen,Mark, Viviang, and the action component is drawn from the set fcalls, chases,lovesg. That is, there are 5 (agents) � 3 (actions) � 5 (patients) � (3 questions)
= 225 possible sentence-question pairs. All components are encoded locally (i.e.,
by orthogonal vectors where one dimension has a value of 1 and the rest 0). The
network is said to have demonstrated strong systematicity if having only seenMary
in the agent position in the training set, for example, the network consistently
(above chance level) generalizes to cases where Mary is in the patient position in
the test set.
In the next subsection, simulations are run on the tensor-recurrent network to
test whether the network will exhibit strong systematicity of inference on this task.
5.5.2 Simulations
The number of units used in these simulations was given in Figure 5.3. The simu-
lation conditions for the tensor-recurrent network were as follows:
� Local encoding of input and output vectors. As there are 5 possible agent or
patient components, 3 possible action components and 3 possible questions,
8 input, 8 output and 3 question units were used.
� Random generation of 20 training examples from a training example distribu-
tion. A number of di�erent training distributions were used (see Table 5.2),
138 CHAPTER 5. THE TENSOR-RECURRENT NETWORK
where the overlap between objects that appeared in the agent and patient
positions varied from 0 (no noun appeared in both positions) to 5 (every noun
appeared in both positions). In each case, with the exception of 0 and 5 noun
overlap, noun(s) were left out of the patient position only. It is possible that
the e�ect on generalization, by omitting patient position nouns only, is due to
the limited number of patient components in the training set, rather than the
limited degree of overlap between agent and patient positions. Thus, in the
case of one, two and three noun overlap, a second training distribution was
used (see Table 5.3), where nouns were left out of both positions, therefore
ensuring that at least 3 di�erent nouns appeared in the agent and patient
positions.
� Random initialization of variable network weights from a uniform distribution
between -1 and 1.
� Training of the network using the standard error backpropagation algorithm
with 0.1 learning rate, no momentum term, and the sum of squares error
function (Rumelhart et al., 1986) until outputs were within 0.4 of their target
activation for all output units on all training patterns. Training was termi-
nated after 50000 epochs if this criterion was not met. Each word or question
was presented to the network one per time step, with the context and tensor
units being reset to zero at the beginning of each sequence. Weights were
updated at the end of each sentence-question pair (i.e., every four patterns).
� Testing of the network on 100 test sequences randomly generated from a test
example distribution. The testing distribution associated with each training
distribution is given in Tables 5.2 and 5.3. Importantly, since generalization
across position is being tested, the training and testing distributions were
not the same5 (i.e., some nouns should not appear in some positions in the
training set, but should appear in those positions in the test set). Two testing
5Except in the case where the training example distribution had an overlap of 5. Thiscase, although not considered in the evaluation of strong systematicity, and was included forcompleteness.
5.5. EVALUATING THE NETWORK 139
Table 5.2: Training and testing example distributions where nouns were omittedfrom the patient position in the case where there was between 1 and 4 nouns ap-pearing in both positions. The 0 (no noun appearing in both positions) and 5(every noun appearing in both positions) overlap cases were included for complete-ness. In the 0 overlap case, nouns were omitted from both positions. The tableshows only agent and patient objects (i.e., B - Bill; J - John; K - Karen; M - Mark;and V - Vivian). For the training sets, agent and patient objects were combinedwith every action and question. For the testing set, the superscript q indicatesthe set of objects that were queried, and therefore tested. The symbol � is thecartesian product operator generating all pair-wise combinations from elements inthe respective sets.
Overlap Train (agent,patient) Test (agent,patient)0 fB; J;Kg � fM;V g fM;V gq � fB; J;Kgq1 fB; J;K;M; V g � fBg fB; J;K;M; V g � fJ;K;M; V gq2 fB; J;K;M; V g � fB; Jg fB; J;K;M; V g � fK;M;V gq3 fB; J;K;M; V g � fB; J;Kg fB; J;K;M; V g � fM;V gq4 fB; J;K;M; V g � fB; J;K;Mg fB; J;K;M; V g � fV gq5 fB; J;K;M; V g � fB; J;K;M; V g fB; J;K;M; V gq � fB; J;K;M; V gq
criteria were used. A response to a test sequence was considered correct
when: (1) the maximally activated output unit had a target activation of
one - maximum criterion; or (2) all output units were within 0.5 of their
target activation - 0.5 criterion. During the test phase performance on the
auto-association of input was not considered. Its purpose was to encourage
the formation of internal representations during training, as discussed in the
previous section.
� Each train-test trial was repeated 5 times, with modi�able weights randomly
initialized at the beginning of each trial.
5.5.3 Results
In the case of the training and testing distributions from Table 5.2, the percentage
of correct responses (maximum criterion) to the question vector on the test set
averaged over 5 trials is given in Figure 5.4, where the bottom dashed line indicates
140 CHAPTER 5. THE TENSOR-RECURRENT NETWORK
Table 5.3: Training and testing example distributions for the 1, 2 and 3 overlapcases, where nouns were omitted from both positions. The table shows only agentand patient objects (i.e., B - Bill; J - John; K - Karen; M - Mark; and V - Vivian).For the training sets, agent and patient objects were combined with every actionand question. For the testing set, the superscript q indicates the objects that werequeried, and therefore tested. The symbol � is the cartesian product operatorgenerating all pair-wise combinations from elements in the respective sets.
Overlap Train (agent,patient) Test (agent,patient)1 fB; J;Kg � fK;M;V g fB; J;K;M q; V qg � fBq; J q;K;M; V g2 fB; J;K;Mg � fK;M;V g fB; J;K;M; V qg � fBq; J q;K;M; V g3 fB; J;K;Mg � fJ;K;M; V g fB; J;K;M; V qg � fBq; J;K;M; V g
chance level response. The graph shows results only for noun-position combinations
that did not occur in the training set. For example, when no noun appeared in
both agent and patient positions in the training set, the network achieved a mean
accuracy of 74%. When three or more of the possible �ve nouns appear in both
positions in the training set, the network was 100% accurate on the test set.
For the same distributions, the percentage of correct responses using the 0.5
testing criterion is given in Figure 5.5. Note that chance level performance for this
testing criterion is 1
25� 100% (= 3%), as there is a 50% chance of the activation of
each output unit being on the \right" side of 0.5. For this criterion the network's
mean accuracy was 64% with no overlap and 100% for three of more overlapping
objects.
In the case of the training and testing distributions from Table 5.3, the percent-
age of correct responses using the maximum testing criterion, and the 0.5 testing
criterion are given in Figures 5.6 and 5.7, respectively.
All trials except one converged to the training criterion within 50000 epochs.
The single non-convergent case, occurred with the zero overlap training distribu-
tion. Training times for distributions given in Tables 5.2 and 5.3 are shown in
Figures 5.8 and 5.9, respectively.
5.5. EVALUATING THE NETWORK 141
0
20
40
60
80
100
0 1 2 3 4 5
Cor
rect
(%
)
Number of overlapping items
Maximum criterionChance level
Figure 5.4: Generalization (maximum criterion) as a function of the number ofoverlapping items where items were omitted from the patient position in the 1to 4 overlap cases (see Table 5.2). The percentage correct on the test set wasaveraged over 5 trials. The dashed line indicates chance level performance. Errorbars indicate 95% con�dence intervals.
142 CHAPTER 5. THE TENSOR-RECURRENT NETWORK
0
20
40
60
80
100
0 1 2 3 4 5
Cor
rect
(%
)
Number of overlapping items
0.5 criterionChance level
Figure 5.5: Generalization (0.5 criterion) as a function of the number of overlappingitems where items were omitted from the patient position in the 1 to 4 overlapcases (see Table 5.2). The percentage correct on the test set was averaged over5 trials. The dashed line indicates chance level performance. Error bars indicate95% con�dence intervals.
5.5. EVALUATING THE NETWORK 143
0
20
40
60
80
100
0 1 2 3 4 5
Cor
rect
(%
)
Number of overlapping items
Maximum criterionChance level
Figure 5.6: Generalization (maximum criterion) as a function of the number ofoverlapping items where items were omitted from both positions in the 1 to 3overlap cases (see Table 5.3). The percentage correct on the test set was averagedover 5 trials. The dashed line indicates chance level performance. Error barsindicate 95% con�dence intervals.
144 CHAPTER 5. THE TENSOR-RECURRENT NETWORK
0
20
40
60
80
100
0 1 2 3 4 5
Cor
rect
(%
)
Number of overlapping items
0.5 criterionChance level
Figure 5.7: Generalization (0.5 criterion) as a function of the number of overlappingitems where items were omitted from both positions in the 1 to 3 overlap cases (seeTable 5.3). The percentage correct on the test set was averaged over 5 trials. Thedashed line indicates chance level performance. Error bars indicate 95% con�denceintervals.
5.5. EVALUATING THE NETWORK 145
0
100
200
300
400
500
600
0 1 2 3 4 5
Tra
inin
g tim
e (e
poch
s)
Number of overlapping items
Mean
Figure 5.8: Mean training times as a function of overlap where items were omittedfrom the patient position in the 1 to 4 overlap cases (see Table 5.2). The zerooverlap case, does not include the single non-convergent trial. Error bars indicateone standard deviation.
0
50
100
150
200
250
300
350
400
450
0 1 2 3 4 5
Tra
inin
g tim
e (e
poch
s)
Number of overlapping items
Mean
Figure 5.9: Mean training times as a function of overlap where items were omittedfrom both positions in the 1 to 3 overlap cases (see Table 5.3). The zero overlapcase, does not include the single non-convergent trial. Error bars indicate onestandard deviation.
146 CHAPTER 5. THE TENSOR-RECURRENT NETWORK
5.5.4 Discussion and analysis
The point of these simulations was to test whether the tensor-recurrent network
exhibits generalization across position above chance level with respect to this task
(i.e., strong systematicity of inference). For the training distributions used in Table
5.2, the network showed perfect generalization to novel object-positions on all trials
where there was an overlap of three or more nouns in both positions. Perfect
generalization occurred for maximum and 0.5 testing criteria. Since the degree
of generalization was consistently above chance level, the network is said to have
exhibited strong systematicity of inference with respect to this task.
The strong systematicity exhibited by the network was a consequence of three
factors: (1) architecture; (2) learning; and (3) environment (i.e., training set dis-
tribution). Each of these factors are discussed in turn.
Architecture
Strong systematicity was a consequence of separating the responsibility for repre-
senting component and position information. Partly, this separation was due to
the architecture whereby component information was represented (independent of
its position) at the hidden units, and position information was represented (inde-
pendent of its component value) at the role units. The component-independence
of position information was encouraged by the bottleneck units which attenuated
the e�ect of the current input on the state.
Learning
The separation of component and position information, however, was also a result
of the network's learning dynamics. It was assumed that backpropagation of error
would orthogonalize the role vectors (which is crucial for the correct extraction
of components). During one trial the activation vectors for the role units were
recorded while training on a data set generated from a distribution where three
nouns appeared in both agent and patient positions (see Table 5.2). Table 5.4
5.5. EVALUATING THE NETWORK 147
Table 5.4: Orthogonality between agent, action and patient role vectors measuredas one minus the magnitude of the normalized dot product. Role vector ~R0
i =~Ri+ ~Cbias, where ~Ri is the activity vector at the role units, and ~Cbias is the activityvector at the cue units resulting from the bias terms.
Weight updates ~R0
agent;~R0
action~R0
agent;~R0
patient~R0
action;~R0
patient
0 0.1 0.1 0.01000 0.8 0.7 0.04900 0.8 0.8 0.9
shows how the role vectors6 become progressively more orthogonal with training.
(The measure being one minus the magnitude of the normalized dot product so that
zero implies collinear and one implies orthogonal for non-zero vectors.) Before any
training (i.e., at zero weight updates), the role vectors associated with the agent
and action components were almost collinear, as indicated by the orthogonality
measure of 0.1. However, after 1000 weight updates, the two role vectors became
more orthogonal, as indicated by an orthogonality measure of 0.8. At this stage
of training, the action and patient role vectors were still collinear (as indicated by
the orthogonality measure of 0.0). However, after 4900 weight updates, the two
role vectors orthogonalized (with an orthogonality measure of 0.9). It is not nec-
essary that the role vectors be perfectly orthogonal since the non-linear activation
function at the output layer is capable of masking out residual vectors. This table
demonstrates that before training the network was not systematic. The collinearity
of the action and patient role vectors, for example, means that the network could
not have extracted the associated verb without extracting the noun in the patient
position. Therefore, the strong systematicity exhibited by the network was also a
consequence of learning.
The network was also required to learn the appropriate cue vectors so that com-
ponent representations could be extracted from the tensor representation. During
the same trial, the activation vectors at the cue units were also recorded. Table 5.5
6Where the role vector ~R0i =
~Ri + ~Cbias.
148 CHAPTER 5. THE TENSOR-RECURRENT NETWORK
Table 5.5: Collinearity between cue (at time step 4) and role vectors concernedwith the extraction and construction (respectively) of the agent, action and patientpositions measured as the magnitude of the normalized dot product. Role vector~R0
i = ~Ri + ~Cbias, where ~Ri is the activity vector at the role units and ~Cbias is theactivity vector at the cue units resulting from the bias terms. At time step 4, cuevector ~C 0
i = ~Ci+ ~R4, where ~Ci is the activity vector at the cue units, and ~R4 is theactivity vector at the role units at time step 4.
Weight updates ~C 0
agent;~R0
agent~C 0
action;~R0
action~C 0
patient;~R0
patient
0 0.8 0.9 0.91000 0.1 1.0 1.04900 0.8 0.9 1.0
Table 5.6: Collinearity before training between cue (at time step 4) and role vectorsconcerned with the extraction and construction (respectively) of the agent, actionand patient positions measured as the magnitude of the normalized dot product.Role vector ~R0
i = ~Ri + ~Cbias, where ~Ri is the activity vector at the role units and~Cbias is the activity vector at the cue units resulting from the bias terms. At timestep 4, cue vector ~C 0
i = ~Ci + ~R4, where ~Ci is the activity vector at the cue units,
and ~R4 is the activity vector at the role units at time step 4.
Weight updates ~R0
agent~R0
action~R0
patient
~C 0
agent 0.8 0.9 1.0~C 0
action 0.8 0.9 0.8~C 0
patient 0.6 0.8 0.9
shows how the cue vectors7 line up with the appropriate role vectors with training.
(The measure being the magnitude of the normalized dot product.) For example,
after 4900 weight updates, the cue vector for extracting the patient component was
collinear with the role to which the patient component was composed (indicated
by the collinearity measure of 1.0). Before training, cue vectors were collinear with
their associated role vectors. However, the cue vectors were also collinear with
the other role vectors as indicated in Table 5.6. Thus, before training it was not
possible to extract the target component without extracting other components.
7Where, at time step 4, the cue vector ~C0i =
~Ci + ~R4.
5.5. EVALUATING THE NETWORK 149
Environment
The third factor in uencing the demonstration of strong systematicity of infer-
ence was the environment in which the training examples were generated (i.e., the
training example distribution). With the training distributions given in Table 5.2,
perfect generalization across position was exhibited on all trials when there was an
overlap of three or more nouns in the agent and patient positions. However, when
the number of overlapping items was reduced to two nouns, generalization across
position dropped sharply. In the two item overlap case, the network no longer ex-
hibited above chance level performance with greater than 95% con�dence for both
maximum (Figure 5.4) and 0.5 (Figure 5.5) testing criteria. Performance in the
one item overlap case was similar for the maximum testing criterion (Figure 5.4),
and worse for the 0.5 testing criterion (Figure 5.5).
In the one and two item overlap cases, there were only one and two objects
(respectively) appearing in the patient position in the training set. Correct ex-
traction of components requires the network to learn to generate a common role
vector for the patient8 position. With only one or two objects in the patient po-
sition, the network did not generalize to the case whereby a common role vector
was generated for the remaining objects in the patient position. Since only one
cue vector is generated per question, it becomes increasingly di�cult to maintain
similarity between the single cue vector and multiple role vectors generated for the
same position. Consequently, performance decreased in these cases.
In the one, two and three item overlap cases, training distributions were changed
so that at least three objects appeared in both agent and patient positions while
the same degree of overlap was maintained (see Table 5.3). For example, in the
one item overlap case, Bill, John and Karen appeared in the agent position and
Karen, Mark and Vivian appeared in the patient position. With these training
distributions, test performance improved to the point where, with at least 95%
con�dence, the network generalized to novel object-positions with an accuracy of
8The network must also learn to generate a common role vector for the agent and actionpositions.
150 CHAPTER 5. THE TENSOR-RECURRENT NETWORK
at least 74% (maximum testing criterion, see Figure 5.6), and at least 64% (0.5
testing criterion, see Figure 5.7). Performance was also higher in the two overlap
case.
The number of network weights also in uences the capacity to generate common
role vectors for each position. The bottleneck units were introduced to attenuate the
e�ect of the input on the generation of state, and therefore role vectors. Increas-
ing the number of bottleneck units increases the degrees of freedom, and therefore
the sensitivity of the role vectors to the current input. In other words, it is more
likely that the generated role vectors are speci�c to component, rather than posi-
tion information. Therefore, an increase in these weights, decreases the degree of
generalization across position.
The network also demonstrated above chance level performance, with greater
than 95% con�dence, for both maximum and 0.5 testing criteria when there was
no overlap across agent and patient positions. This result means that the network
will also demonstrate generalization from agent to action, and from action to agent
or patient positions. Thus, it raises the question of whether the network has too
strong an architectural bias. Should generalization across position occur when there
is no overlap. For example, Hadley cites the case where children generated new
verbs from previously learnt nouns (e.g., It was bandaided). Clearly, the degree
to which people generalize across position is an empirical question, and one not
addressed in Hadley's de�nition of systematicity. Hadley's concern, and the concern
of this thesis, has been to establish some above chance level degree of generalization
across position. However, in the next chapter, architectural properties that permit
generalization across position only in the case of overlap are discussed.
5.6 Summary and conclusions
In the previous chapter, it was shown that a common property preventing a demon-
stration of strong systematicity of inference in the Connectionist models examined
was an independence between access weights (i.e., the weights that implement the
functions that access component representations from the various positions within
5.6. SUMMARY AND CONCLUSIONS 151
a complex representation). It was suggested that a dependency between these
weights would be su�cient to demonstrate strong systematicity of inference. The
purpose of this chapter was to implement and test such a dependency in the form
of the tensor-recurrent network.
Weight dependency was implemented in the tensor-recurrent network by ac-
cessing component representations via the dot product operator in tensor calculus.
In this way, the multiple subspaces provided by tensor units, which hold repre-
sentations of complex objects, can be collapsed down to a single subspace from
which component representations may be extracted independent of their position.
The dependency across access weights was achieved as access was, ultimately, via
the same set of weights. The use of tensors assumes appropriate component and
role representations (for the construction of tensor representations), and cue rep-
resentations (for the extraction component representations). Simulations showed
how these representations were learnt in the tensor-recurrent network by back-
propagating error signals through the tensor units to the weights that generate the
component, role and cue vectors. Successful extraction of component representa-
tions also relies on the network generating common (i.e., item independent) role
vectors for each position within a complex object. Simulation results showed that
item independent role vectors were learnt when su�cient objects appeared in each
position in the training set. The number of weights implementing the function that
generates role vectors in uences the degree of generalization across position. Thus,
from the simulations it was concluded that the architectural biases assumed in the
tensor recurrent network, which were, together, su�cient for strong systematicity
of inference on the querying of 3-tuples task were:
� tensor units, which construct representations of complex objects as the outer
product of internal component and role representations;
� inner product units, which collapse the multiple representational subspaces
down to a single subspace from which component representations were ex-
tracted;
152 CHAPTER 5. THE TENSOR-RECURRENT NETWORK
� backpropagation learning dynamic, which allows the appropriate internal
component, role and cue representations to be learnt; and
� component-independent, position-dependent generation of role vectors.
The tensor-recurrent network was designed speci�cally to address the issue of
strong systematicity of inference. It may be that the network is too systematic, in
that it generalizes to cases not supported by empirical evidence. This possibility
was suggested when the network demonstrated some degree of generalization across
position in the case where there was no overlap across agent and patient positions
in the training set. In the next chapter, the extent to which the assumptions
made in the tensor-recurrent network can be relaxed are discussed. In addition,
the signi�cance of the work in this thesis is discussed in terms of approaches to
cognitive modeling and the relationship of this work to the Classical paradigm.
Chapter 6
Discussion
6.1 Introduction
The purpose of this chapter is to re ect on the work presented in the previous
chapters. Speci�cally, a discussion is given in relation to: the implications of these
results for Connectionist (and other) approaches to cognitive modeling (section 2); a
limitation of the tensor-recurrent network solution, and a possible extension which
may address the problem (section 3); and the relationship of the Connectionist
approach presented here to the Classical paradigm (section 4). A summary is
provided in section 5.
6.2 Implications
The problem that systematicity posed for the Connectionist approach to cogni-
tion was an explanation as to how the acquisition of systematic behaviour could
be a necessary property of a Connectionist architecture. That is, in terms of a
learning framework, how can learning to represent and process some instances of a
structured object necessarily generalize to other instances conforming to the same
structure? In this section, the implications arising from the results obtained in
addressing this question are discussed.
153
154 CHAPTER 6. DISCUSSION
6.2.1 Same training and testing distribution assumption
In chapter 3, the problem of the necessary acquisition of systematic behaviour was
framed in terms of probably approximately correct (PAC)-learnability. That is, a
Connectionist model was considered to have necessarily acquired systematic be-
haviour over some suitably structured domain when a high degree of generalization
over that domain was obtained, with a high degree of con�dence (i.e., over repeated
trials) with at most a polynomial amount of computational resource in some pa-
rameter that measures the size of the behaviour being learnt. Using this criterion,
the feedforward network was shown to demonstrate systematicity with respect to
the auto-association of N -tuples task.
Although the feedforward network demonstrated systematicity as de�ned in
terms of PAC-learnability it was suggested that this de�nition was too weak for
the purposes of cognitive models. For although this de�nition is suitable in terms
of computational feasibility (i.e., any model requiring more than polynomial re-
source is impractical as there is rarely enough resource available to learn even
moderately sized behaviours), it did not consider the amount of resource actually
required. Consequently, a model may be computationally feasible, but still require
more resource than is ever needed by people. Thus, an alternative de�nition of
systematicity was considered, which was Hadley's strong systematicity, whereby a
model is said to exhibit the acquisition of systematicity if it shows above chance
level generalization to component-position combinations that it had never been
trained (i.e., generalization across position). In chapter 4, the same feedforward
network was examined on the same task for strong systematicity. It was found that
the feedforward network could not exhibit strong systematicity. The implication
of this result is concerned with an assumption regarding the training and testing
distributions.
The PAC-learnability framework generally assumes that the examples on which
the network was trained are drawn from the same distribution as the examples used
for testing generalization1. However, a test for strong systematicity, by de�nition,
1Although, Bartlett (1992) has considered the case where the training-testing distribution
6.2. IMPLICATIONS 155
requires the training and testing distributions to be di�erent. More speci�cally, if a
model is to demonstrate strong systematicity then there must be objects from the
domain that are drawn from a training distribution Ptrain, such that Ptrain(Ri =
x) = 0 (i.e., the probability of object x appearing in role/position Ri is zero),
but are drawn from a test distribution Ptest, such that Ptest(Ri = x) 6= 0 (i.e.,
the probability of the same object (x) appearing in same role/position Ri is not
zero). Thus, the implication for Connectionist (and other learning) approaches
to cognitive modeling is that criteria derived from the PAC-learning framework,
where there is an assumption that the training and testing distributions are the
same, are potentially weak criteria in that they may admit cognitive models that
cannot account for particular behaviours, as was the case, for example, with the
three-layer feedforward network with respect to strong systematicity2.
6.2.2 Structured networks
Traditionally, the approach taken in Connectionism has been to specify general-
purpose, computationally-su�cient3 architectures; and then provide a learning dy-
namic, which allows for the acquisition of some interesting behaviour. Numerous re-
sults already exist regarding the capacity of Connectionist networks to compute ar-
bitrary functions under a variety of basis functions. For example, it has been shown
that feedforward networks with su�cient sigmoidal (Funahashi, 1989), squashing
(Hornik et al., 1989), and gaussian (Hartman, Keeler, & Kowalski, 1990) units are
universal approximators. Furthermore, Sato et al. (1990) presented a recurrent
network as a universal approximator to a general class of dynamical systems, and
Siegelman and Sontag (1991) proved that a recurrent network with �nite sigmoidal
units is computationally equivalent to a universal Turing machine. The implica-
tion of the negative results of the three-layer feedforward and recurrent networks of
chapter 4 is that it is not su�cient to specify computationally-general mechanisms,
changes by a small constant.2Amsterdam (1988) has criticized the formal learning approach, from a philosophical perspec-
tive, on the basis that it only considers learning from naturally occurring examples, and thereforedoes not consider the possibility of acquiring concepts that do not occur naturally.
3In the sense, that with su�cient internal units, a network can compute an arbitrary function.
156 CHAPTER 6. DISCUSSION
one must also consider behaviour-speci�c biases. For example, although the feed-
forward network in chapters 3 and 4 has su�cient resources to represent (and even
learn) behaviours over structured domains, it requires too many examples to be an
adequate model in regard to the necessary acquisition of systematic behaviour.
Of course, it has generally been recognized that Connectionist learning models
require some form of architectural bias for the acquisition of complex behaviours.
The important question has been, and still is: what sort of biases are necessary, or
even su�cient for the acquisition of these behaviours? In chapter 4, it was shown
that the property preventing the networks from exhibiting strong systematicity
was the independence between the weights that access component representations
from the various roles within a complex representation. In chapter 5, weight de-
pendence was introduced into the tensor-recurrent network. The characteristic
di�erence between the tensor-recurrent network and the networks in chapter 4 was
that components were accessed via a common representational subspace. Thus,
learning to access a component object representations from one position general-
ized to other positions. This characteristic property was also evident in the simple
recurrent network in a demonstration of strong systematicity of representation.
The architectural organization of the tensor-recurrent network was su�cient to
demonstrate strong systematicity on an inference task. However, it was suggested
that the network was too strongly biased in that it exhibited generalization not
apparent in the training data (i.e., when there was no overlap). The implication
of this result is that the network may not exhibit appropriate behaviour where
generalization across position is not required. In the next section, this limitation
is discussed along with a possible extension to address this limitation.
6.3 Limitation and possible extension
In the three-layer networks analyzed in chapter 4, components were extracted from
complex representations, held at the hidden units, via a single layer of weighted
connections linking the hidden and output units. Figure 6.1 characterizes the ex-
traction process for such networks, where the dashed arrows indicate modi�able
6.3. LIMITATION AND POSSIBLE EXTENSION 157
Output
HiddenPosition 1 Position 2
Figure 6.1: Characterization of the extraction of component representations frompositions 1 and 2 for the three-layer networks in chapter 4 (i.e., with one hiddenlayer). Dashed arrows indicate modi�able-weighted connections. The diagramshows only the hidden and output layers.
connections. In the tensor-recurrent network, component representations were ex-
tracted via an intermediate layer (the inner product units), which extracted the
desired component subspace from the tensor units. Figure 6.2 characterizes the
extraction process for the tensor-recurrent network, where solid arrows indicate
�xed-weighted connections. The tensor-recurrent network assumed, as one of its
architectural biases, unit connectivity for implementing the inner product. Thus,
the process of collapsing subspaces down to a single subspace was already provided,
regardless of the degree of overlap evident in the training set. Therefore, the net-
work was able to exhibit some degree of generalization across position in the case
where there was no overlap.
One alternative is to make the weighted connections between the two internal
layers of the tensor-recurrent network learnable. In this way, there is the potential
for the network to learn weights which collapse subspaces only in the case where
there was a su�cient degree of overlap in the training set.
The �rst point to consider is whether a network with modi�able weights between
two hidden layers can demonstrate strong systematicity of inference. Suppose a
network has a similar internal organization to that of the tensor-recurrent network,
except that the �xed weights between the two internal layers are replaced with
modi�able weights (Figure 6.3). Suppose also that the resources of this network
are such that there are two units per position in the �rst internal layer and two units
in the second layer, and that activation functions of the second internal layer and
158 CHAPTER 6. DISCUSSION
Output
Inner product
TensorPosition 1 Position 2
Pos 1/Pos 2
Figure 6.2: Characterization of the extraction of component representations frompositions 1 and 2 for the tensor-recurrent network. Dashed arrows indicatemodi�able-weighted connections. Solid arrows indicate �xed-weighted connections.The diagram shows only the two hidden layers (i.e., tensor and inner product) andthe output layer.
output units are the identity and sigmoidal functions, respectively. Thus, two units
are su�cient to represent all possible components in a given position (Kruglyak,
1990). The network must learn the weights between the three layers of units to
demonstrate strong systematicity of inference on the querying of 2-tuples task.
Assume also that the network has learnt to bu�er the input onto the two sub-
spaces so that the relative positions within each of the subspaces is the same. Then
the question is: Can the network learn the extraction part of the inference task
without having to be trained on all components in both positions?
Suppose, for example, there are six possible components in each position, and
the network is trained on examples where all components appeared in the �rst
position, but not in the second position. After training, the network has learnt the
mapping depicted in Figure 6.4. The diagram assumes that the network has also
learnt to select, depending on the query, between the two subspaces by the time
they are represented at the �rst hidden layer (i.e., only one of the two components
is active at the �rst hidden layer). The question is: how many components in the
second position are required to learn the same mapping from the second subspace
as was learnt from the �rst subspace.
6.3. LIMITATION AND POSSIBLE EXTENSION 159
Output
Hidden(2)
Hidden(1)Position 1 Position 2
Pos 1/Pos 2
Figure 6.3: Characterization of the extraction of component representations frompositions 1 and 2 for a network with two hidden layers linked by modi�able-weightedconnections. Dashed arrows indicate modi�able-weighted connections. The dia-gram shows only the two hidden layers and the output layer.
The answer is at most three components. The weights from the �rst to the
second hidden layer perform a linear transformation. In the case where the network
has been trained on no more than two components in the second position there
are an in�nite number of weight vectors that satisfy the mapping speci�ed by
the training set (Figure 6.5(a)). In the case where there were three components,
there is only one solution (Figure 6.5(a)), and therefore the network generalizes to
the other three positions. By sharing a common representational subspace at the
second layer, the network has already learnt the mapping from the second hidden
layer to the output layer for all components since all components appeared in the
�rst position and were mapped through the same subspace. Thus, the network
has the possibility of exhibiting strong systematicity, given also the bu�ering and
querying assumptions. The important condition is that the dimensionality of the
positional and common subspaces is less than the number of possible components.
If there are as many dimensions as components then some dimensions would only be
trained on one type of point, and therefore there would not be enough information
in the training set to uniquely specify the mapping. For this reason, the three-
layer networks in chapter 4 could not demonstrate strong systematicity, since the
160 CHAPTER 6. DISCUSSION
(second internal layer)
(first internal layer) (first internal layer)
Position 1 Position 2
Common subspace
Output units
Figure 6.4: An example of the internal organization of a network, with two unitsper subspace in the �rst internal layer and two units for the common subspace inthe second internal layer, learning the querying of 2-tuples task. Dashed arrows in-dicate modi�able-weighted connections. Solid squares indicate training points andempty squares indicate test points. Lines through the common subspace indicatecharacteristic positioning of hyperplanes for the correct extraction of componentrepresentations by the output units. Dashed circle indicates the characteristic po-sitioning of training points for an arbitrary number of components per position.
number of output units was the number of components (which was a condition
on the external representations of objects since there should not be any a priori
similarity between external object representations).
There are a number of ways in which the dimensionality of component subspaces
could be enforced, or at least encouraged. One way is to limit the number of units
in the second hidden layer. The di�culty with this approach is that it is generally
more di�cult to learn a function with fewer weights as the solution space is smaller
and there are fewer trajectories which lead to a solution. Alternatively, excessive
weights can be provided initially, and subsequently removed during the course of
training where they have shown to be irrelevant. Weights can be e�ectively removed
6.3. LIMITATION AND POSSIBLE EXTENSION 161
position 2
(first internal layer)
common subspace
(second internal layer)
(a) (b)
Figure 6.5: Number of components in the second position required in the trainingset to demonstrate strong systematicity. With only two components (a) there isnot enough information to constrain the possible mappings. Two possible solu-tions which satisfy the mapping speci�ed by the training points (solid squares) areshown. The correct mapping can be uniquely described with three components (b).Empty squares indicate test points. Dashed arrows indicate modi�able-weightedconnections which implement the mapping. Dashed circle indicates the charac-teristic positioning of training points for an arbitrary number of components perposition.
162 CHAPTER 6. DISCUSSION
with the use of an additional error term that penalizes unnecessary weights, as was
shown, for example, in the work of Nowlan and Hinton (1991) and Zemel (1993).
The error function not only encourages the network to �nd the target function, but
to do so with the fewest necessary weights.
Thus, with respect to the querying of 2-tuples example, it is possible for the
network with two hidden layers to be given an excessive number of units (i.e.,
greater than two in the second hidden layer), but make use of only a small portion
of those units. In doing so, it is possible that the network learnt to collapse the
subspaces from the previous layer to a single common subspace where the dimen-
sionality is less than the number of possible components, and therefore demonstrate
generalization across position.
Of course, it would require simulations to verify whether such additional terms
in the error function would result in a demonstration of strong systematicity. How-
ever, error terms that reduce the number of weights could still result in solutions
where all subspaces are collapsed to a common subspace, since this solution has
the fewest number of weights (i.e., two hidden units in the second internal layer are
su�cient to encode arbitrarily many components independent of their position).
What is required is to collapse component subspaces only in the cases where
there is overlap evident in the training set. In the case where there is no overlap,
the subspaces should remain separate. Figure 6.6 characterizes this situation for
the agent-action-patient domain. In the early stages of training (Figure 6.6(a)),
when there is no apparent or signi�cant4 degree of overlap between positions, com-
ponent subspaces remain separate. This situation is likely to occur, for example,
when the network is given excessive numbers of units so that there is no necessity
for the dimensionality of each subspace to be less than the number of possible
components. In the later stages of training, when overlap has been detected, the
network reorganizes the weights associated with the overlapping subspaces so as
to map them onto a common subspace (Figure 6.6(b)). In this way, generalization
across position only occurs with components in the overlapping positions. Gen-
4One could consider a threshold, in which the degree of overlap must be exceeded beforecollapsing subspaces.
6.3. LIMITATION AND POSSIBLE EXTENSION 163
Agent Action Patient
Agent Action Patient
Output
redescription
(a) (b)
Agent Action Patient
Output
ActionAg/Pat
Figure 6.6: Process of redescribing internal representations. With three subspacesat the second internal layer (a), the network can perform the task, but withoutdemonstrating strong systematicity. By collapsing the two overlapping subspaces(i.e., agent and patient) the network demonstrates strong systematicity, but onlywith respect to these two positions, not with respect to the action position (b).Dashed arrows indicate modi�able-weighted connections.
eralization to novel components in the action position would not occur since the
network has never been trained to map the new component in the action position
(in the second internal layer) to the output. Remembering that each output unit is
dedicated to a single component (by the requirement that there is no a priori simi-
larity between the external representations of components). Therefore, the weights
from the action subspace (second layer) to the output unit have only been trained
on one type of point (i.e., points without the new component).
The question, of course, is how could such training behaviour be a property of a
Connectionist architecture? One possibility is to incorporate an objective function
that encourages this behaviour. In designing an appropriate objective function, one
needs to consider the cases in which representational subspaces should be brought
together, and when they should not. Two subspaces, or internal representations
within those subspaces should brought together when they map to the same out-
put. For example, suppose vector representations ~x1 and ~x2 map to the same (or
similar) output vector ~y. Then, the weights which generated ~x1 and ~x2, should
be changed so as to reduce the distance between those two vectors (i.e., collapses
their representations down to a single vector representation). Note that in the
164 CHAPTER 6. DISCUSSION
standard objective functions, which only consider target and network output, the
two vectors will not necessarily come together as a result of error backpropagation,
because in an underconstrained network there are many points in the domain that
map to the same point in the co-domain. So long as the common point (~y) in the
co-domain is near the target, the two vectors (~x1 and ~x2) will not necessarily be
brought together through training. In the other case, where ~x1 and ~x2 do not map
to the same output, training would proceed as usual (i.e., by reducing the di�erence
between output and target vectors independent of each other).
The interesting point about such a process is its potential relationship to
Karmilo�-Smith's (1992) representation redescription hypothesis. Karmilo�-Smith
hypothesizes that cognitive development undergoes a number of phases (possibly
concurrent) of which behavioural mastery is just one. During the acquisition of a
competent level of performance on some task, additional processes are redescribing
the behaviour into representations that are explicit relative to some other processes.
This process of redescription allows for the transfer of ability on one task to other
structurally related tasks. The transfer of ability is analogous to the way proce-
dures and functions are reusable by other procedures and functions in programming
languages. The process described above has a similar quality. The network in Fig-
ure 6.6(a) has the capacity to learn the target behaviour, but not in a strongly
systematic way with its current organization of internal representations. However,
if the representations are reorganized (redescribed) as in Figure 6.6(b), then the
network has the capacity to demonstrate strong systematicity with respect to the
agent and patient positions.
6.4 Relationship to the Classical paradigm
Having demonstrated a network that is strongly systematic, what then is its re-
lationship to the Classical paradigm? In the Classical explanation of cognition,
cognitive representations are symbols5, or symbol structures (i.e., symbols that
5The symbols, by themselves, are abstract entities. The physical realization of a symbol maybe for example, a 1 0 1 1 state of a register, or a 0.4mV activation potential of a neuron.
6.4. RELATIONSHIP TO THE CLASSICAL PARADIGM 165
are composed of other more primitive symbols, for example aRb), and cognitive
processes are sensitive to the structure of those representations. Since processing
capacity is de�ned with respect to symbol structures, all objects conforming to that
structure are processable. For example, a symbolic process called �rst, de�ned as:
�rst: (XY ) ! X takes the structured symbol XY , composed of two symbols X
and Y , and returns the symbol X. The symbol XY stand in some correspondance
relationship to a complex object, which is external to the cognitive system, such
that its component objects correspond to the more primitive symbols X and Y .
The structural relationship between component symbols mirrors a structural rela-
tionship between component objects. Thus, the symbolXY may correspond to any
external complex object with the same relationship between its component objects.
The important point is that the computational processes of Classical architectures
are de�ned at the level of symbols, therefore the capacities of these architectures
come in clumps de�ned by the symbols on which they operate. So, systematicity
is built into a Classical architecture.
In the tensor-recurrent network, external representations are mapped via
weighted connections to internal subspaces. After the network has been trained
to the point where it behaves systematically on the querying of 3-tuples task, the
internal representations generated by the network are examples of symbolic repre-
sentations. For instance, at the �rst time step, a similar role vector is generated
regardless of the object that is presented to the network. The �rst role vector (~R1)
is a symbol denoting the �rst component object, and its physical realization is its
role unit activations (e.g., .1 .9 .5 .7). At the same time, the external object is
also mapped to a hidden unit vector, for example ~HJohn, which denotes in this case
the object John. The physical realization of this symbol is its corresponding hidden
unit activations. The role symbol is an example of a type (i.e., a symbol denoting
a set of more primitive symbols), and the symbol (realized at the hidden units)
is an example of a token (i.e., element from a set of symbols, or a type instance).
The binding of the token to its type is performed by the outer product operator
and represented at the tensor units. The construction of more complex symbols
166 CHAPTER 6. DISCUSSION
is achieved by adding the resulting tensors. Thus, the physical realization of the
complex symbol \John loves Mary", is its corresponding tensor unit activations.
The similarity between a Classical architecture and the tensor-recurrent net-
work would be of descriptive value only except that the symbols themselves are
accessible by other processes within the network. Given an appropriate cue vector,
a symbol is selected from a more complex symbol at the tensor units by the inner
product operator, implemented by appropriate connectivity between cue, tensor
and inner product units. It is because the constituent symbol can be extracted
independently of the object to which it currently corresponds that the network
exhibits systematicity.
It would appear that the tensor-recurrent network solution o�ers no more than
a Connectionist implementation of a Classical architecture. However, there is one
important di�erence, and that di�erence is concerned with the acquisition of sys-
tematic behaviour. Classical architectures, by de�nition, operate at the level of
symbols and symbol structures. The grounding of symbols is assumed to occur
by some perceptual level process that maps external stimuli to their corresponding
symbols. Beyond that, cognitive behaviour is a process of symbol manipulation.
Therefore, learning in a Classical architecture, like any other behaviour is a pro-
cess of symbol manipulation. For example, learning the structure sensitive process
�rst (see above) can only occur as a consequence of generating and testing possible
symbolic mappings. Possible mappings are, XY ! X; XY ! Y ; and XY ! Y X,
which the architecture could either accept or reject on the basis of environmental
evidence. The problem with this form of learning is that the architecture can only
progress from one systematic behaviour to another. At no point is there a transi-
tion from unsystematic to systematic behaviour. The lack of progression is because
the perceptual system on which a Classical architecture is grounded is
not accessible to a Classical learning system.
By contrast, in the tensor-recurrent network, the perceptual system (i.e., the
weights that map external representations to symbols) is part of the learning pro-
cess. Accessibility is granted in the tensor-recurrent network by linking external
6.5. SUMMARY 167
stimuli to their symbols via continuously di�erentiable functions. In this way,
the relationship between external representations (stimuli) and their corresponding
symbols can be learnt by a non-symbolic hill-climbing strategy, such as backprop-
agation. Therefore, it allows the tensor-recurrent network, as was demonstrated in
chapter 5, to exhibit a transition from unsystematic to systematic behaviour.
Thus, the contribution of Connectionism is not in an alternative explanation
for systematicity (i.e., the processing is still symbolic where it needs to be), but
in an explanation of how the acquisition of systematic behaviour can be a neces-
sary consequence of a network's internal organization and a non-symbolic learning
process.
6.5 Summary
In this chapter, the implications of the results from this thesis were discussed. Two
general lessons to be taken from this work in regard to cognitive modeling are:
(1) The assumption that the training and testing distributions are the same may
lead to criteria for cognitive architecture that are too strong. Therefore, models
designed with this assumption are potentially incomplete, since for cognitive prop-
erties, at least with respect to strong systematicity, this assumption does not hold.
(2) The lack of strong systematicity in the more \standard" Connectionist architec-
tures points to a greater consideration for behaviour-speci�c architectural biases.
The architectural biases built into the tensor-recurrent network were su�cient to
demonstrate strong systematicity on an inference task.
Also discussed was the case where the tensor-recurrent network was too sys-
tematic (i.e., generalization across position where it was not evident in the training
set). It was suggested that a possible extension to the tensor-recurrent network
was in the use of modi�able weights between the tensor and inner product layers,
and the use of an additional objective function which encourages the collapsing of
the subspaces only in the case where there is signi�cant evidence of overlap in the
training set.
Lastly, comments were given on the relationship between the Connectionist
168 CHAPTER 6. DISCUSSION
tensor-recurrent network solution and the Classical paradigm. The characteristic
components of a Classical architecture (i.e., symbols and symbol-level processes)
appeared in the tensor-recurrent network as characteristic components of a Connec-
tionist architecture (i.e., vectors and vector operators). Thus, in terms of system-
atic behaviour, Connectionism o�ers a \neural-like" implementation of symbolic
processing. However, where Connectionism goes beyond Classicism is in an expla-
nation for the acquisition of systematic behaviour, which is an issue not addressed
in the Classical paradigm since systematicity is built into a Classical architecture,
not acquired.
Chapter 7
Concluding remarks
In chapter 1, I introduced two major paradigms for models of cognitive behaviour
(i.e., Classicism and Connectionism), and the background to an ongoing debate over
the capacity of the Connectionist approach to explain one cognitive-level property
called systematicity. Brie y, Fodor and Pylyshyn (1988), and Fodor and McLaugh-
lin (1990) have argued that Connectionist models cannot exhibit properties such
as systematicity without implementing a Classical architecture (i.e., without re-
sorting to symbolic representations and processes). Therefore, they have argued
that Connectionism is at best an implementation framework for Classical cognitive
architectures.
However, as also mentioned in chapter 1, this conclusion is still far from unan-
imously accepted with argument and counter-argument being put forth in an at-
tempt to resolve the issue. Given the lack of consensus, it was decided that a review
of the debate over the possibility of a Connectionist (non-classical) explanation of
systematicity was an appropriate place to start. Thus, the �rst question of concern
in this thesis was: Can Connectionist models exhibit systematicity without reliance
on symbolic representations and processes?
169
170 CHAPTER 7. CONCLUDING REMARKS
7.1 Summary of contribution
The arguments for an alternative Connectionist explanation of systematicity have
focused on alternative notions of compositionality (i.e., the way in which archi-
tectures construct internal representations which allow them to reason about the
external world). The Classical notion of compositionality is that the tokening
(i.e., physical inscription) of complex representations is accompanied by a tokening
of component representations (i.e., representations of component objects). After
reviewing three Connectionist notions of compositionality: Smolensky's weak (mi-
crofeatures) and strong (tensors) compositionality; and van Gelder's more general
notion of functional compositionality, I concluded, in chapter 2, that they fail
to provide an alternative explanation of systematicity for one of two reasons, ei-
ther: (1) component representations are tokened, relative to the processes that
access them, wherever complex representations are tokened, in which case one has
an implementation of Classical compositionality; or (2) component representations
are not tokened, in which case, systematicity of inference cannot be demonstrated
since the component representations are not available to other processes. There-
fore, I have concluded that Connectionist models cannot provide an alternative
explanation for systematicity.
However, I have also argued that the limitation of Fodor and Pylyshyn's thesis
is that their requirement for systematicity is grounded solely in terms of compu-
tational capacity (i.e., in terms of the functions computable by an architecture).
Thus, systematicity can be realized by many computationally su�cient architec-
tures, one of which (i.e., tensors) Fodor and McLaughlin (1990) later reject on the
grounds that one must also explain how systematicity is a necessary consequence of
the architectural assumptions. However, the limitation of Fodor and McLaughlin's
position is that they do not specify a criterion by which a model (Connectionist or
otherwise) is said to necessarily exhibit systematicity. Since Connectionism is also
concerned with the acquisition of cognitive behaviour, I have suggested that a \po-
tential" contribution of Connectionism to cognitive theory is in an explanation for
the necessary acquisition of systematic behaviour. Thus, the second and primary
7.1. SUMMARY OF CONTRIBUTION 171
question of concern in this thesis was: What Connectionist architectural properties
are su�cient for the necessary acquisition of systematic behaviour?
In chapter 3, the necessary acquisition of systematic behaviour was de�ned
in terms of probably approximately correct (PAC)-learnability. PAC-learnability
is the approximation of a function to within some �xed degree of error with some
�xed degree of con�dence over repeated trials with resources (e.g., time, processors)
polynomial in the size of the function. This criterion was applied to a feedforward
network on the auto-association of N -tuples task, designed to test systematicity
of representation. The auto-association of N -tuples by a feedforward network has
been examined in Brousse and Smolensky (1989) and Brousse (1991). Although
Brousse and Smolensky demonstrated an exponential increase in the number of
generalizations with respect to N (tuple order), I showed that the percentage of
correct to total number of patterns decreased exponentially in N . Therefore, their
demonstration of generalization with the feedforward network does not meet the
PAC-learnability criterion for the acquisition of systematicity. However, I also
showed that with a polynomial (with respect to N) increase in the number of
training patterns the feedforward network met the PAC-learnability criterion. The
conditions su�cient for the necessary acquisition of systematicity were access to a
polynomial number of training examples, and no more than a polynomial number
of weights in the network.
PAC-learnability is a useful criterion in that it di�erentiates possible cogni-
tive architectures on the basis of resource feasibility. Essentially, any architec-
ture requiring more than polynomial resource is impractical as there is rarely
enough resource available for even moderately sized behaviours. However, the
PAC-learnability criterion did not consider the amount of resource used by people.
It is possible for an architecture to acquire systematic behaviour with respect to
the PAC-learnability criterion, yet still require more resource than is ever needed
by people. Thus, a second criterion based on linguistic evidence, called strong
systematicity (Hadley, 1993), was considered in the evaluation of Connectionist
models.
172 CHAPTER 7. CONCLUDING REMARKS
Strong systematicity is the ability, above chance level, to correctly represent
and process complex objects with components in novel positions (i.e., generaliza-
tion across position). Based on statistical analysis of training sets, Hadley found
no evidence of strong systematicity in the six Connectionist models he examined.
It is possible that none of these models exhibited strong systematicity because such
a criterion had not been de�ned at that time, and that the purpose of the Con-
nectionist work in question was just a demonstration of generalization over some
structured domain. However, it is also possible that there was some common and
fundamental inadequacy with these models given that none of them demonstrated
generalization across position.
In chapter 4, the feedforward network used in the previous chapter was ex-
amined for its capacity to exhibit strong systematicity of representation on the
auto-association of 2-tuples task. I showed, by an analysis of the available in-
formation and network learning properties, that this feedforward network cannot
exhibit strong systematicity of representation. The lack of strong systematicity
was attributed to the independence between the weights that implement the map-
ping of objects in the two component positions. This result suggested that a
recurrent network is likely to exhibit strong systematicity of representation, since
component objects are presented (extracted) via the same set of input (output)
units. Therefore, component mappings are implemented, in part, by the same set
of weights. Simulation results showed that a three-layer simple recurrent network
can demonstrate strong systematicity of representation on the temporal version of
the auto-association of 2-tuples task. Analysis of the internal representations of a
trained network suggested that components were mapped to (and from) common
internal subspaces. Consequently, generalization across position was achieved as
the same set of weights were used to map component representations independent
of their position.
The simple recurrent network was then examined for strong systematicity of
inference on the querying of 2-tuples task. Simulation results did not show strong
systematicity for this network. An analysis of the available information and the
7.1. SUMMARY OF CONTRIBUTION 173
learning requirements of the network showed that the network cannot exhibit strong
systematicity of inference on this task. The lack of strong systematicity of inference
was attributed to the independence between the weights that access components
from the two positions of a complex internal representation. It was also argued
that strong systematicity of inference cannot be demonstrated in a number of other
three-layer networks.
It was suggested that access-weight dependence can be achieved by incorporat-
ing an architectural bias that forces or encourages the access of components via a
common internal subspace. In chapter 5, such a bias was incorporated into a Con-
nectionist architecture called the tensor-recurrent network. In the tensor-recurrent
network, tensors were used to represent complex objects. Using the inner prod-
uct operator, subspaces of the tensor space can be collapsed down to a common
internal subspace from which the components can be extracted via a single set of
learnable weights. Correct extraction of components, however, requires an appro-
priate choice of internal component, role and cue vectors, which are assumed to
exist in a tensor representational scheme. The novel feature of the tensor-recurrent
network is that, by backpropagating an error signal through the units that imple-
ment tensor representations, these vectors are learnt by the network. Simulation
results showed that this network exhibited strong systematicity of inference on the
querying of 3-tuples task.
My design of the tensor-recurrent network was motivated speci�cally to ad-
dress the issue of strong systematicity. I have suggested that the network was too
strongly biased in that it exhibited some degree of generalization across position
when there was no overlap in the training set. In chapter 6, I then suggested in-
corporating an additional error term designed to encourage the collapse of internal
subspaces only in the case where there was signi�cant overlap in the training set.
In this chapter, I also discussed the implications of the results of this thesis, and the
relationship to the Classical paradigm. Speci�cally, the points made were: (1) The
PAC-learning framework, where it relies on the assumption that the training and
testing distributions are the same, may be inappropriate for devoloping cognitive
174 CHAPTER 7. CONCLUDING REMARKS
modeling in that it can admit architectures (e.g., three-layer feedforward network)
that are inadequate when this assumption does not hold (as was the case with
respect to strong systematicity). (2) Strong systematicity necessitates more struc-
tured (architecturally biased) networks than the three-layer networks examined
(e.g., tensor-recurrent networks). (3) The tensor-recurrent network is an example
of the contribution of Connectionism as providing an explanation for the necessary
acquisition of systematic behaviour from, in part, non-symbolic learning processes.
7.2 Conclusions
The initial question posed in this thesis was:
� Can Connectionist models exhibit systematicity without reliance on symbolic
representations and processes?
After reviewing the systematicity debate in chapter 2, I have concluded that:
� Connectionist models cannot demonstrate systematicity without implementing
some form of symbolic representations and processes.
From this conclusion it was argued that the problem of systematicity for Con-
nectionism is not in an alternative explanation for systematicity, but in an expla-
nation for the necessary acquisition of systematic behaviour. Thus, the second,
and primary question of concern in this thesis was:
� Can Connectionism provide models that exhibit the necessary acquisition of
systematic behaviour?
After testing and analyzing a number of Connectionist models I have concluded
that:
� Connectionism can provide models, as exempli�ed by the tensor-recurrent net-
work, with architectural properties that are su�cient for the necessary ac-
quisition of systematic behaviour (i.e., strong systematicity) from, in part,
non-symbolic processes.
7.2. CONCLUSIONS 175
Of the Connectionist architectures analyzed, only the tensor-recurrent network
demonstrated strong systematicity of inference. From the simulation results and
analysis of the tensor-recurrent network in chapter 5, I have also concluded that:
� the architectural properties of the tensor-recurrent network that were, to-
gether, su�cient to exhibit strong systematicity on a task designed to test
systematicity of inference were:
{ tensor units, which construct representations of complex objects as the
outer product of internal component and role representations;
{ inner product units, which collapse multiple representational subspaces
down to a single subspace from which component representations are
extracted;
{ backpropagation learning dynamics, which allows the appropriate inter-
nal component, role and cue representations to be learnt; and
{ component-independent, position-dependent generation of role vectors.
In addressing the two questions of central concern to this thesis, two more gen-
eral conclusions are drawn. Firstly, in the evaluation of the necessary acquisition
of systematicity, two criteria were used: (1) PAC-learnability (chapter 3); and
(2) Hadley's strong systematicity (chapters 4 and 5). The feedforward network
demonstrated the acquisition of systematic behaviour as de�ned by the �rst crite-
rion, but not the second. Therefore, I conclude that:
� the PAC-learnability framework, in as far as it assumes that the training and
testing distributions are the same, is inappropriate for developing cognitive
models that exhibit strong systematicity.
Secondly, the tensor-recurrent network in chapter 5 is an instance of a Connec-
tionist architecture where systematic behaviour emerged from initially unsystem-
atic behaviour, partly, as a consequence of a non-symbolic learning process (i.e.,
backpropagation). Therefore, I conclude that:
176 CHAPTER 7. CONCLUDING REMARKS
� Connectionism can contribute to cognitive theory in the form of an explana-
tion of the necessary acquisition of symbolic properties such as systematicity
from, in part, non-symbolic processes like backpropagation.
7.3 Further work
The systematicity problem is a watershed for cognitive research in that it segregates
potential cognitive architectures on the basis of whether or not such architectures
have the capacity to represent and generalize in structured domains.
Systematicity is an instance of a more general problem facing the Connectionist
approach to cognitive modeling. That is, learning high-level structured represen-
tations from low-level information.
The tensor-recurrent network approach to demonstrating strong systematicity
of inference suggests rethinking the notion of the type/token distinction. The
advantage of a type/token distinction is that if a function is learnt at the type level
then learning generalizes to all tokens of those types. Another way of characterizing
the lack of strong systematicity of inference in the �rst-order networks studied in
chapter 4 is that there was no explicit distinction (i.e., one that is readily accessible
by other processes within the network) in the network's internal representation
space between the tokening of a component and the component's type (i.e., role
within the complex object).
The general problem that a type/token distinction introduces into a Connec-
tionist learning framework (which is essentially hill-climbing) is how can a hill-
climbing strategy be applied to a domain where identi�ers have only logical, not
spatial signi�cance. For, in general, the hill-climbing strategy requires a contin-
uously di�erentiable surface, yet symbol names serve only to identify one symbol
from another, and so are, in general, not de�ned with respect to any similarity
measure. That is to say, a comparison between two identi�ers is, in general, a
two-valued function, either the two identi�ers are the same, or they are not the
same.
The solution to this problem has been hinted at in the form of the tensor-
7.3. FURTHER WORK 177
recurrent network, where type identi�ers were vectors in the role unit activation
space linked (via a continuously di�erentiable function) to their associated tokens,
which were vectors in the hidden unit activation space. Since the token and type
identi�er spaces, and the functions linking the two (i.e., inner and outer products)
are all continuously di�erentiable, type level mappings can be learnt by a hill-
climbing strategy through the principle of: error (information) propagation through
structured architectures.
There are, however, a number of limitations with the tensor-recurrent network,
which suggest directions for further work. The �rst limitation, which was discussed
in chapter 6, is the \over-generalization" of the network. The tensor-recurrent
network demonstrated some degree of generalization across position when there
was no evidence of overlap in the training set. One direction of further work is to
incorporate an additional error term so that subspaces are collapsed only in the case
where there is signi�cant overlap in the training set. The suggestion given was an
error term that reduced the di�erence between internal representations (vectors)
that map to the same output vector. This approach di�ers from the standard
approach that only considers the di�erence between network and target output.
The second limitation concerns the generation of role vectors. In the tasks
considered, the type/token distinction was arti�cially clean in that each component
was associated with one position at each time step. Thus, the point at which to
make the binding of a type identi�er and its associated token was built into the
network, only the values of those identi�ers and tokens were learnt. In reality, there
are multiple component levels (e.g., sentences are composed of words, and words are
composed of phonemes), and the point at which to make/break bindings depends
entirely on the structure of the domain. Thus, one possibility for further work is
to consider the same tasks, but with input encodings at the phoneme level rather
than the word level. A network is still required to demonstrate generalization across
word position, but also to learn the points at which to bind component words to
their roles.
The inference tasks considered in this thesis only required the networks to
178 CHAPTER 7. CONCLUDING REMARKS
extract one of the possible components. More complex behaviours involve the
restructuring of input, for example, in active to passive transformations, where
complex objects of the form Agent-Action(active)-Patient are mapped to objects
with the form Patient-Action(passive)-Agent. One possibility for addressing this
task is with modi�able recurrent connections on the cue units, which would allow
the network to generate a sequence of cue vectors for extraction of a sequence of
components. Furthermore, the tasks considered had a at structure, where each
component was an atomic object (i.e., no further components). Natural language
processing, however, operates over recursive structures where, for example, a role
may be bound to an entire phrase containing its own component-role bindings.
Recent work by Hadley (1994), Hadley and Hayward (1994), and Niklasson and
van Gelder (1994) consider generalization to di�erent levels of embedding in re-
cursively structured objects as an even tighter criterion for the demonstration of
systematicity. In this case, one could include connectivity which allows the binding
of role vectors to tensor representations of complex objects, rather than just the
hidden unit representations of atomic objects.
More generally, though, people have the capacity to generalize from behaviours
learnt in one domain to structurally similar behaviours in other domains. That
is, people have the capacity to reason analogically. This degree of generalization
goes beyond Hadley's de�nition of strong systematicity, since in the new domain,
people can generalize to complex objects where components have not been seen in
any position. For example, given the question, \Grapes are to wine, as rice is to
?", one is likely to respond with \sake", based on knowledge of the relation (i.e.,
is-used-in) in both domains, given very little experience with this type of question
before, and almost certainly no experience with the four components in the context
of this question. Work has already begun in an attempt to address this form of
generalization with Connectionist models (see Halford, Wilson, Guo, Gayler, Wiles,
& Stewart, 1994).
In this work, it was assumed that systematicity was not due to any a priori sim-
ilarity between component objects. However, it is possible that existing similarities
7.3. FURTHER WORK 179
between component objects (e.g., run and walk) play some part in the acquisition of
systematic behaviour. Another possible extension of this work would be to examine
networks trained on complex objects whose component representations share some
degree of similarity, and then test these networks on objects with completely dis-
similar component representations. With respect to the tensor-recurrent network,
for example, it may be that initial similarities facilitate the acquisition of stable
role vectors for particular component positions.
Connectionist networks have demonstrated generalization on a wide variety of
tasks, yet their scalability to the size and complexity of human cognitive domains
is still in question. One could argue that the lack of scalability is because of the
technology. There is not yet available the processing capacity, in terms of number
of processors, in the same order of magnitude as that of the brain. The value of
addressing structure-related properties like systematicity is that they also apply to
small-scale tasks. Architectures that cannot exhibit systematicity on small tasks
will not scale to larger tasks. Thus, research that considers these properties is not
only of theoretical signi�cance, but is of practical importance as well.
180 CHAPTER 7. CONCLUDING REMARKS
Appendix A
Understanding as generalization
not just representation1
Halford and Wilson (1994) de�ne understanding of a concept in terms of repre-
senting the relation to which that concept is equivalent. A model is said to have
represented a relation when it performs the correct input-output mappings for all
functions implicated by that relation. For example, in the balance scale domain,
the concept of balance is a relation between two weight and two distance variables
and a state variable (which has the possible values: balance, tip-left and tip-right).
The concept of balance implicates a number of questions which can be used to
evaluate a child's understanding of the concept.
For example, the balance state question is a function from the two weights and
two distance variables to the state variable. The missing weight question is a func-
tion from two distance variables, a weight variable, and a state variable to a weight
variable. The missing distance question is a function from two weight variables, a
distance variable and a state variable to a weight variable. More generally, from
any combination of four variables, the �fth variable can be predicted.
Halford and Wilson argue that just as a child's understanding of the concept
balance is evaluated on a battery of implicated questions, so too a model of that
concept can be evaluated by its performance on a number of implicated functions.
1This work appears in Phillips (1994).
181
182 APPENDIX A. UNDERSTANDING AS GENERALIZATION
That is, on the basis of whether or not the model performs the same input-output
mappings. Their main point is that such a de�nition provides an adequate crite-
rion against which candidate Connectionist models can be evaluated. With this
de�nition the authors reject McClelland's (in press) feedforward network model on
the basis that it cannot demonstrate adequate performance on all three questions
(i.e., the model is incomplete).
The �rst point I want to make is that implicit in their de�nition is a notion of
generalization (i.e., the capacity to take existing representations and apply them
correctly to previously unexperienced situations). I suggest that by confounding
these two issues (representation and generalization) some respondents to Halford's
presentation have been led to question the authors' justi�cation for rejecting Mc-
Clelland's feedforward network model.
The authors claim that McClelland's model cannot perform the additional miss-
ing weight and missing distance tasks. However, they do not state the basis for
this claim. Clearly, McClelland's model cannot be rejected on the basis of the
representational capacity of the feedforward network alone. Since multi-layer feed-
forward networks are universal function approximators (Hornik et al., 1989) there
is every reason to expect that given su�cient information and time the network
could perform correctly on these additional tasks.
Presumably, however, the authors have some additional criterion in mind. I
suspect that although they would agree that McClelland's model (suitably ex-
tended) can represent the required functions, to do so would require an extensive
and unaccountable amount of (re)training. If it can be shown that such training
is not available to (nor required by) children, then there is a basis for rejecting
the feedforward network model. That is, on the basis of the model's generalization
characteristics.
The question is, of course, how much training do children receive? Based on
empirical evidence, Hadley (1993, 1994) presents an example of how generaliza-
tion may be characterized in the linguistic domain. He identi�es generalization
across syntactic position as a criterion against which Connectionist models can be
183
evaluated. Connectionist models can then be di�erentiated on the basis of this
generalization criterion. For example, as was shown in chapter 4, the three-layer
feedforward networks without the assumption of weight tying cannot demonstrate
generalization across syntactic position, yet recurrent networks in a limited case
can demonstrate this form of generalization.
Essentially, the issue is that although some models have the capacity to rep-
resent any function of interest they simply require too many training examples to
be psychologically plausible. I suspect that it is on grounds of generalization that
the authors want to reject McClelland's model. The point I want to make is that
the degree of generalization considered as psychologically plausible must be made
explicit; and in doing so, the authors would then have much stronger grounds for
rejecting particular Connectionist models. That is, on the basis that these models
make use of information either not available to, or not required by children.
The issue of generalization brings me to my second point: if generalization is
used as a criterion on which to accept or reject models, then the tensor model that
Halford and Wilson are proposing is incomplete. For, although it demonstrates
how relational concepts may be represented, there is no story as to how these
representations might arise from experience.
In Halford and Wilson's formulation, an appropriate arrangement of weights
and connections are in place to represent relational concepts as tensor products.
Presumably, however, these weights and connections were not always there oth-
erwise this model could not account for the empirical fact that children below a
certain developmental stage do not understand (in Halford and Wilson's sense) the
concept of balance.
The two issues of representation and generalization place the authors in a
dilemma. Explicitly, they want to distinguish between two models on the basis of
representational capacity, yet both models (suitably extended) are capable of rep-
resenting relations. Implicitly, I suspect they want to reject McClelland's model on
grounds of generalization. However, if generalization is the distinguishing criterion
then they are required to tell a story of how the right arrangement of weights and
184 APPENDIX A. UNDERSTANDING AS GENERALIZATION
connections of a tensor model come into being; and that is a story not yet told.
Bibliography
Abu-Mostafa, Y. S. (1990). Learning from hints in neural networks. Journal of
Complexity, 6, 192{198.
Al-Mashouq, K. A., & Reed, I. S. (1991). Including hints in training neural nets.
Neural Computation, 3, 418{427.
Amsterdam, J. (1988). Some philosopical problems with formal learning theory. In
The seventh national conference on arti�cal intelligence, pp. 580{584. Saint
Paul, MN: Morgan Kaufmann.
Anthony, M. (1992). An introduction to computational learning theory. Cambridge
tracts in theoretical computer science. Cambridge, England: Cambridge Uni-
versity Press.
Bakker, P., Phillips, S., & Wiles, J. (1993). The N-2-N encoder: A matter of
representation. In S. Gielen & B. Kappen (Eds.), ICANN'93: Proceedings
of the International Conference on Arti�cial Neural Networks, pp. 554{557.
London: Springer-Verlag.
Bakker, P., Phillips, S., & Wiles, J. (1994). The 1000-2-1000 encoder: A matter of
representation. Neural Network World, 4 (5), 527{534.
Bartlett, P. L. (1992). Computational learning theory and neural network learning.
Ph.D. thesis, The University of Queensland, Dept. of Electrical and Computer
Engineering.
Baum, E. B., & Haussler, D. (1989). What size net gives valid generalization?
Neural Computation, 1, 151{160.
185
186 BIBLIOGRAPHY
Baum, E. B., & Wilczek, F. (1987). Supervised learning of probability distributions
by neural networks. In D. Z. Anderson (Ed.), Neural Information Processing
Systems, pp. 52{61. New York, NY: American Institute of Physics.
Blum, A. L., & Rivest, R. L. (1990). Training a 3-node neural network is NP-
complete. Tech. Rep., MIT, Cambridge, MA.
Blum, A. L., & Rivest, R. L. (1992). Training a 3-node neural network is NP-
complete. Neural Networks, 5 (1), 117{127.
Brooks, R. A. (1991a). Intelligence without reason. In 12th International Joint
Conference on Arti�cial Intelligence, pp. 569{595.
Brooks, R. A. (1991b). Intelligence without representation. Arti�cial Intelligence,
47, 139{160.
Brousse, O. J. (1991). Generativity and systematicity in neural network combina-
torial learning. Ph.D. thesis, University of Colorado, Boulder, CO.
Brousse, O. J., & Smolensky, P. (1989). Virtual memories and massive general-
ization in connectionist combinatorial learning. In Proceedings of the 11th
Annual Conference of the Cognitive Science Society, pp. 380{387. Hillsdale,
NJ: Lawrence Erlbaum.
Brown, T. L., & LeMay, H. E. (1981). Chemistry: The central science (2nd edition).
Englewood Cli�s, NJ: Prentice Hall.
Butler, K. (1993). Connectionism, classical cognitivism and the relation between
cognitive and implementational levels of analysis. Philosophical Psychology,
6 (3), 321{333.
Chalmers, D. J. (1990a). Syntactic transformations on distributed representations.
Connection Science, 2, 53{62.
Chalmers, D. J. (1990b). Why Fodor and Pylyshyn were wrong: The simpliest
refutation. In Proceedings of the Twelfth Annual Conference of the Cognitive
Science Society, pp. 340{347. Hillsdale, NJ: Lawrence Erlbaum.
BIBLIOGRAPHY 187
Chalmers, D. J. (1993). Connectionism and compositionality: Why Fodor and
Pylyshyn were wrong. Philosophical Psychology, 6, 305{319.
Charter, N., & Oaksford, M. (1990). Autonomy, implementation and cognitive
architecture: A reply to Fodor and Pylyshyn. Cognition, 34, 93{107.
Clark, A. (1991). Systematicity, structured representations and cognitive architec-
ture: A reply to Fodor and Pylyshyn. In T. Horgan & J. Tienson (Eds.),
Connectionism and the Philosophy of Mind, pp. 198{218. Dordrecht, The
Netherlands: Kluwer Academic.
Dennis, S., & Phillips, S. (1991). Analysis tools for neural networks. Tech. Rep.
207, University of Queensland, Brisbane, Australia.
Dyer, M. G. (1991). Connectionism versus symbolism in higher-level cognition. In
T. Horgan (Ed.), Connectionism and the Philosphy of Mind, pp. 382{416.
Dordrecht, The Netherlands: Kluwer Academic.
Elman, J. L. (1989). Representation and structure in connectionist models. Tech.
Rep. 8903, University of California, San Diego, CA.
Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14, 179{211.
Elman, J. L. (1991). Incremental learning, or the importance of starting small.
Tech. Rep. 9101, University of California, San Diego, CA.
Fodor, J. A. (1975). The Language of Thought. Language and Thought. New York,
NY: Crowell.
Fodor, J. A., & McLaughlin, B. P. (1990). Connectionism and the problem of
systematicity: Why Smolensky's solution doesn't work. Cognition, 35, 183{
204.
Fodor, J. A., & Pylyshyn, Z. W. (1988). Connectionism and cognitive architecture:
A critical analysis. Cognition, 28, 3{71.
188 BIBLIOGRAPHY
Funahashi, K. (1989). On the approximate realization of continuous mappings by
neural networks. Neural Networks, 2, 183{192.
Garey, M. R., & Johnson, D. S. (1979). Computers and Intractability: A guide to
the theory of NP-completeness. San Francisco, CA: W H Freeman.
Geman, S., Bienenstock, E., & Doursat, R. (1992). Neural networks and the
bias/variance dilemma. Neural Computation, 4 (1), 1{58.
Hadley, R. F. (1993). Compositionality and systematicity in connectionist language
learning. Tech. Rep. CSS-IS TR 93-01, Simon Fraser University, Burnaby,
BC.
Hadley, R. F. (1994). Systematicity in connectionist language learning. Mind and
Language, 9 (3), 247{272.
Hadley, R. F., & Hayward, M. (1994). Strong semantic systematicity from unsu-
pervised connectionist learning. Tech. Rep. CSS-IS TR94-02, Simon Fraser
University.
Halford, G. S., & Wilson, W. H. (1994). How far do neural networks account for
human reasoning? In J. Wiles, C. Latimer, & C. Stevens (Eds.), Collected
Papers from a symposium on connectionist models and psychology, pp. 50{62.
Department of Computer Science, The University of Queensland, Brisbane,
Australia.
Halford, G. S., Wilson, W. H., Guo, J., Gayler, R. W., Wiles, J., & Stewart, J.
E. M. (1994). Connectionist implications for processing capacity limitations
in analogies. In K. J. Holyoak & J. Barnden (Eds.), Advances in Connectionist
and Neural Network theory, chap. 7, pp. 363{415. Norwood, NJ: Ablex.
Hartman, E., Keeler, J. D., & Kowalski, J. M. (1990). Layered neural networks
with gaussian hidden units as universal approximators. Neural Computation,
2 (2), 210{215.
BIBLIOGRAPHY 189
Haussler, D. (1992). Decision theoretic generalizations of the PAC model for neural
net and other learning applications. Tech. Rep. UCSC-CRL-91-02, University
of California, Santa Cruz, CA.
Hertz, J., Krogh, A., & Palmer, R. G. (1991). Introduction to the Theory of Neural
Computation. Redwood City, CA: Addison-Wesley.
Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks
are universal approximators. Neural Networks, 2, 359{366.
Jordan, M. I. (1990). Serial order: A parallel distributed processing approach.
Tech. Rep., MIT, Cambridge, MA.
Judd, S. (1987). Learning in networks is hard. In M. Caudill & C. Butler (Eds.),
IEEE International Conference on Neural Networks, pp. 685{692. Piscat-
away, NJ: IEEE Service Center.
Judd, S. (1988). On the complexity of loading shallow neural networks. Journal of
Complexity, 4, 177{192.
Judd, S. (1990). Neural Network Design and the Complexity of Learning. Cam-
bridge, MA: MIT Press.
Karmilo�-Smith, A. (1992). Beyond Modularity: A developmental perspective on
cognitive science. Cambridge, MA: MIT Press/Bradford Books.
Kern, L. H., Mirels, H. L., & Hinshaw, V. G. (1983). Scientists' understanding of
propositional logic: An experimental investigation. Social Studies of Science,
13, 131{146.
Kirsh, D. (1991a). Today the earwig, tomorrow man. Arti�cial Intelligence, 47 (1),
161{184.
Kirsh, D. (1991b). When is information explicitly represented? In P. Hanson
(Ed.), Information, Language and Cognition: Vancouver Studies in Cognitive
Science, pp. 340{365. Vancouver, BC: UBC Press.
190 BIBLIOGRAPHY
Kruglyak, L. (1990). How to solve the N bit encoder problem with just two hidden
units. Neural Computation, 2, 399{401.
Lapedes, A., & Farber, R. (1988). How neural nets work. In D. Z. Anderson
(Ed.), Neural Information Processing Systems, pp. 442{456. New York, NY:
American Institute of Physics.
LeCun, Y., Boser, B., & Denker, J. S. (1989). Backpropagation applied to hand-
written zip code recognition. Neural Computation, 1 (4), 541{551.
Lin, J. H., & Vitter, J. S. (1989). Complexity issues in learning by neural nets.
In R. Rivest, D. Haussler, & M. K. Warmath (Eds.), Proceedings of the Sec-
ond Annual Workshop on Computational Learning Theory, pp. 118{133. San
Mateo, CA: Morgan Kaufmann.
Lister, R. (1993). Visualizing weight dynamics in the N-2-N encoder. In Proceedings
of the IEEE International Conference on Neural Networks. Piscataway, NJ:
IEEE Service Center.
Lister, R., Bakker, P., & Wiles, J. (1993). Error signal, exceptions and back
propagation. In Proceedings of the 1993 International Joint Conference on
Neural Networks, pp. 573{576. Piscataway, NJ: IEEE Service Center.
Marcus, G. F., Brinkmann, U., Clahsen, H., Wiese, R., Woest, A., & Pinker, S.
(1993). German in ection: The exception that proves the rule. Tech. Rep. 47,
MIT, Cambridge, MA.
Marr, D. (1982). Vision: a computational investigation into the human representa-
tion and processing of visual information. San Francisco, CA: W H Freeman.
Maskara, A., & Noetzel, A. (1992). Forcing simple recurrent networks to encode
context. Tech. Rep., New Jersey Institute of Technology, NJ. Also in the
Proceedings of the 1992 Long Island Conference on Arti�cial Intelligence and
Computer Graphics.
BIBLIOGRAPHY 191
McClelland, J. L. (in press). A connectionist perspective on knowledge and develop-
ment. In T. Simon & G. S. Halford (Eds.), Developing Cognitive Competence:
New Approaches to Cognitive Modeling. Hillsdal, NJ: Erlbaum.
McClelland, J. L., & Kawamoto, A. H. (1986). Mechanisms of Sentence Processing:
Assigning roles to Constituents of Sentences, Vol. 2 of Computational models
of cognition and perception, chap. 19, pp. 272{331. Cambridge, MA: MIT
Press.
McClelland, J. L., Rumelhart, D. E., & the PDP research group (Eds.). (1986).
Parallel Distributed processing: Explorations in the microstructure of cogn-
tion, Vol. 2 of Computational models of cognition and perception. Cambridge,
MA: MIT Press.
Minsky, M. L., & Papert, S. A. (1990). Perceptrons (4th edition). Cambridge, MA:
MIT Press.
Newell, A. (1980). Physical symbol systems. Cognitive Science, 4, 135{183.
Niklasson, L., & Sharkey, N. E. (1992). Connectionism and the issues of composi-
tionality and systematicity. In R. Trappl (Ed.), Proceedings of the Cybernetics
and Systems Research, pp. 1367{1374.
Niklasson, L., & van Gelder, T. (1994). Can connectionist models exhibit non-
classical structure sensitivity. In A. Ram & K. Eiselt (Eds.), Proceedings of
the Sixteenth Annual Conference of the Cognitive Science Society, pp. 644{
669. Lawrence Erlbaum.
Niklasson, L. F. (1993). Structure sensitivity in connectionist models. In Proceed-
ings of the 1993 Connectionist Summer School, pp. 162{169. Hillsdale, NJ:
Lawrence Erlbaum.
Nowlan, S. J., & Hinton, G. E. (1991). Simplifying neural networks by soft-weight
sharing. Tech. Rep., The Salk Institute, San Diego, CA.
192 BIBLIOGRAPHY
Phatak, D. S. (1993). Construction of minimal n-2-n encoders for any n. Neural
computation, 5 (5), 783{794.
Phillips, S. (1993). The e�ect of representation on error surface. In P. Leong &
M. Jabri (Eds.), Proceedings of the Fourth Australian Conference on Neural
Networks, pp. 86{89. Sydney, Australia: Sydney University Electrical Engi-
neering.
Phillips, S. (1994). Understanding as generalization not just representation. In
J. Wiles, C. Latimer, & C. Stevens (Eds.), Collected Papers from a sympo-
sium on connectionist models and psychology, pp. 110{111. The University of
Queensland, Brisbane, Australia: Department of Computer Science. Techni-
cal report No. 289. A comment on Halford and Wilson's `How far do neural
networks account for human reasoning?'.
Phillips, S., & Wiles, J. (1991). On learning long-term contingencies. Unpublished
manuscript.
Pinker, S. (1984). Language Learnability and Language Development. Cognitive
Science Series. Cambridge, MA: Harvard University Press.
Pinker, S., & Prince, A. (1988). On language and connectionism: Analysis of a
parallel distributed processing model of language acquisition. Cognition, 28,
73{193.
Pollack, J. B. (1990). Recursive distributed representations. Arti�cial Intelligence,
46, 77{105.
Pylyshyn, Z. W. (1980). Computation and cognition: Issues in the foundations of
cognitive science. Behavioral and Brain Sciences, 3, 111{169.
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning Internal
Representations by Error Propagation, Vol. 1 of Computational models of
cognition and perception, chap. 8, pp. 319{362. Cambridge, MA: MIT Press.
BIBLIOGRAPHY 193
Rumelhart, D. E., & McClelland, J. L. (1986). On Learning Past Tenses of En-
glish Verbs, chap. 18, pp. 216{271. Computational models of cognition and
perception. Cambridge, MA: MIT Press.
Rumelhart, D. E., McClelland, J. L., & the PDP research group (Eds.). (1986).
Parallel Distributed processing: Explorations in the microstructure of cogn-
tion, Vol. 1 of Computational models of cognition and perception. Cambridge,
MA: MIT Press.
Sato, M., Murakami, Y., & Joe, K. (1990). Learning chaotic dynamics by recurrent
neural networks. In Proceedings of the International Conference on Fuzzy
Logic and Neural Networks.
Shawe-Taylor, J., & Anthony, M. (1991). Sample sizes for multiple-output threshold
networks. Network, Computation in Neural Systems, 2 (1), 107{117.
Siegelman, H. T., & Sontag, E. D. (1991). On the computational power of neural
nets. Tech. Rep., Rutgers University, New Brunswick, NJ.
Smolensky, P. (1987a). The constituent structure of connectionist mental states: A
reply to Fodor and Pylyshyn. Southern Journal of Philosophy, 26, 137{161.
Smolensky, P. (1987b). On variable binding and the representation of symbolic
structures in connectionist systems. Tech. Rep. CU-CS-355-87, University of
Colorado, Boulder, CO.
Smolensky, P. (1990). Tensor product variable binding and the representation of
symbolic structures in connectionist systems. Arti�cial Intelligence, 46, 159{
216.
Smolensky, P. (1991). Connectionism, constituency, and the language of thought.
In B. Loewer & G. Rey (Eds.), Meaning in Mind: Fodor and his critics,
chap. 12, pp. 201{227. Cambridge, MA: Blackwells.
Spivey, J. M. (1989). The Z Notation: A reference manual. Computer Science.
New York, NY: Prentice Hall.
194 BIBLIOGRAPHY
St. John, M. F., & McClelland, J. L. (1990). Learning and applying contextual
constraints in sentence comprehension. Arti�cial Intelligence, 46, 217{257.
Suddarth, S. C., & Holden, A. D. C. (1991). Symbolic-neural systems and the use of
hints in developing complex systems. International Journal of Man-Machine
Studies, 35, 291{311.
Suddarth, S. C., & Kergos, Y. L. (1990). Rule injection hints as a means of
improving network performance and learning time. In L. B. Almeida & C. J.
Wellekens (Eds.), Neural Networks, Vol. 412 of Lectures in Computer Science,
pp. 120{129. New York, NY: Springer-Verlag.
Valiant, L. G. (1984). A theory of the learnable. Communications of the ACM,
27 (11), 1134{1142.
van Gelder, T. (1990). Compositionality: A connectionist variation on a classical
theme. Cognitive Science, 14, 355{384.
van Gelder, T., & Niklasson, L. (1994). Classicism and cognitive architecture. In
A. Ram & K. Eiselt (Eds.), Proceedings of the Sixteenth Annual Conference
of the Cognitive Science Society, pp. 905{909. Lawrence Erlbaum.
Wiles, J., & Bloesch, A. (1992). Operators and curried functions: Training and
analysis of simple recurrent networks. In S. J. Hanson & R. P. Lippman
(Eds.), Advances in Neural Information Processing Systems 4. San Mateo,
CA: Morgan Kaufmann.
Wiles, J., & Ollila, M. (1993). Intersecting regions: the key to combinatorial
structure in hidden unit space. In S. J. Hanson, J. D. Cowan, & C. L. Giles
(Eds.), Advances in Neural Information Processing Systems 5, pp. 27{33. San
Mateo, CA: Morgan Kaufmann.
Wiles, J., Phillips, S., & Norris, M. (1993). Generalization and training complexity
in a combinatorial domain. Unpublished manuscript.
BIBLIOGRAPHY 195
Zemel, R. S. (1993). A minimum description length framework for unsupervised
learning. Ph.D. thesis, University of Toronto, Toronto, Canada.