cognitive modeling paradigms · · 2008-01-04p1: jzp cufx212-02 cufx212-sun 978 0 521 85741 3...

P1: JZP

CUFX212-02 CUFX212-Sun 978 0 521 85741 3 November 15, 2007 13:46

Part II

COGNITIVE MODELINGPARADIGMS

�

The chapters in Part II introduce the reader to broadly influential and foundational ap-proaches to computational cognitive modeling. Each of these chapters describes in detail oneparticular approach and provides examples of its use in computational cognitive modeling.

21

P1: JZP


22

P1: JZP


CHAPTER 2

Connectionist Models of Cognition

Michael S. C. Thomas and James L. McClelland

1. Introduction

In this chapter, computer models of cogni-tion that have focused on the use of neuralnetworks are reviewed. These architectureswere inspired by research into how com-putation works in the brain and subsequentwork has produced models of cognition witha distinctive flavor. Processing is character-ized by patterns of activation across sim-ple processing units connected together intocomplex networks. Knowledge is stored inthe strength of the connections betweenunits. It is for this reason that this approachto understanding cognition has gained thename of connectionism.

2. Background

Over the last twenty years, connection-ist modeling has formed an influential ap-proach to the computational study of cog-nition. It is distinguished by its appeal toprinciples of neural computation to inspirethe primitives that are included in its cog-nitive level models. Also known as artifi-

cial neural network (ANN) or parallel dis-tributed processing (PDP) models, connec-tionism has been applied to a diverse rangeof cognitive abilities, including models ofmemory, attention, perception, action, lan-guage, concept formation, and reasoning(see, e.g., Houghton, 2005). Although manyof these models seek to capture adult func-tion, connectionism places an emphasis onlearning internal representations. This hasled to an increasing focus on developmentalphenomena and the origins of knowledge.Although, at its heart, connectionism com-prises a set of computational formalisms,it has spurred vigorous theoretical debateregarding the nature of cognition. Sometheorists have reacted by dismissing connec-tionism as mere implementation of preex-isting verbal theories of cognition, whereasothers have viewed it as a candidate to re-place the Classical Computational Theoryof Mind and as carrying profound impli-cations for the way human knowledge isacquired and represented; still others haveviewed connectionism as a subclass of statis-tical models involved in universal functionapproximation and data clustering.

23

P1: JZP


24 thomas and mcclelland

This chapter begins by placing connec-tionism in its historical context, leading upto its formalization in Rumelhart and Mc-Clelland’s two-volume Parallel DistributedProcessing (1986), written in combinationwith members of the Parallel DistributedProcessing Research Group. Then, three im-portant early models that illustrate someof the key properties of connectionist sys-tems are discussed, as well as how the noveltheoretical contributions of these modelsarose from their key computational prop-erties. These three models are the Interac-tive Activation model of letter recognition(McClelland & Rumelhart, 1981; Rumel-hart and McClelland, 1982), Rumelhart andMcClelland’s (1986) model of the acquisi-tion of the English past tense, and Elman’s(1991) simple recurrent network for findingstructure in time. Finally, the chapter con-siders how twenty-five years of connection-ist modeling has influenced wider theoriesof cognition.

2.1. Historical Context

Connectionist models draw inspiration fromthe notion that the information-processingproperties of neural systems should influ-ence our theories of cognition. The possiblerole of neurons in generating the mindwas first considered not long after the ex-istence of the nerve cell was acceptedin the latter half of the nineteenth cen-tury (Aizawa, 2004). Early neural net-work theorizing can therefore be found insome of the associationist theories of men-tal processes prevalent at the time (e.g.,Freud, 1895; James, 1890; Meynert, 1884;Spencer, 1872). However, this line of theo-rizing was quelled when Lashley presenteddata appearing to show that the perfor-mance of the brain degraded gracefully de-pending only on the quantity of damage.This argued against the specific involvementof neurons in particular cognitive processes(see, e.g., Lashley, 1929).

In the 1930s and 1940s, there was aresurgence of interest in using mathemati-cal techniques to characterize the behaviorof networks of nerve cells (e.g., Rashevksy,1935). This culminated in the work of Mc-

Culloch and Pitts (1943) who characterizedthe function of simple networks of binarythreshold neurons in terms of logical op-erations. In his 1949 book The Organiza-tion of Behavior, Donald Hebb proposed acell assembly theory of cognition, includingthe idea that specific synaptic changes mightunderlie psychological principles of learn-ing. A decade later, Rosenblatt (1958, 1962)formulated a learning rule for two-layeredneural networks, demonstrating mathemat-ically that the perceptron convergence rulecould adjust the weights connecting an in-put layer and an output layer of simpleneurons to allow the network to associatearbitrary binary patterns. With this rule,learning converged on the set of connectionvalues necessary to acquire any two-layer-computable function relating a set of input-output patterns. Unfortunately, Minsky andPapert (1969, 1988) demonstrated that theset of two-layer computable functions wassomewhat limited – that is, these simpleartificial neural networks were not particu-larly powerful devices. While more compu-tationally powerful networks could be de-scribed, there was no algorithm to learnthe connection weights of these systems.Such networks required the postulation ofadditional internal, or “hidden,” processingunits, which could adopt intermediate rep-resentational states in the mapping betweeninput and output patterns. An algorithm(backpropagation) able to learn these stateswas discovered independently several times.A key paper by Rumelhart, Hinton, andWilliams (1986) demonstrated the useful-ness of networks trained using backpropaga-tion for addressing key computational andcognitive challenges facing neural networks.

In the 1970s, serial processing and theVon Neumann computer metaphor domi-nated cognitive psychology. Nevertheless, anumber of researchers continued to workon the computational properties of neuralsystems. Some of the key themes identi-fied by these researchers included the roleof competition in processing and learning(e.g., Grossberg, 1976; Kohonen, 1984),the properties of distributed representa-tions (e.g., Anderson, 1977; Hinton &Anderson, 1981), and the possibility of

P1: JZP


connectionist models of cognition 25

Hopfield Nets(binary)

Simple NeuralNetworks

(logical operations)

2-Layer Feedforward Networks (trained withdelta rule) – Perceptron

3-Layer Feedforward Networks (trained with

backpropagation algorithm)

Cascade Rule(e.g., Stroop model)

AttractorNetworks

Recurrent Networks

JordanNetworks

ElmanNetworks (SRN)

(Theoretical)Logogen/

Pandemonium

(Hand-wired)Constraint Satisfaction

Networks (e.g., IA, IAC – Jets &Sharks, Necker Cube, Stereopsis)

Boltzmann Machine

(simulated annealingmetaphor)

Cascade correlation (Fahlman & Lebiere)

Competitive Networksunsupervised learning

(e.g., Kohonen, Grossberg)

Figure 2.1. A simplified schematic showing the historical evolution of neural network architectures.Simple binary networks (McCulloch & Pitts, 1943) are followed by two-layer feedforward networks(perceptrons; Rosenblatt, 1958). Three subtypes then emerge: three-layer feedforward networks(Rumelhart & McClelland, 1986), competitive or self-organizing networks (e.g., Grossberg, 1976;Kohonen, 1984), and interactive networks (Hopfield, 1982; Hinton & Sejnowksi, 1986). Adaptiveinteractive networks have precursors in detector theories of perception (Logogen: Morton, 1969;Pandemonium: Selfridge, 1959) and in handwired interactive models (interactive activation:McClelland & Rumelhart, 1981; interactive activation and competition: McClelland, 1981;Stereopsis: Marr & Poggio, 1976; Necker cube: Feldman, 1981, Rumelhart et al., 1986). Feedforwardpattern associators have produced multiple subtypes: for capturing temporally extended activationstates, cascade networks in which states monotonically asymptote (e.g., Cohen, Dunbar, &McClelland, 1990), and attractor networks in which states cycle into stable configurations (e.g., Plaut& McClelland, 1993); for processing sequential information, recurrent networks (Jordan, 1986;Elman, 1991); and for systems that alter their structure as part of learning, constructivist networks(e.g., cascade correlation: Fahlman & Lebiere, 1990; Shultz, 2003). SRN = simple recurrent network.

content addressable memory in networkswith attractor states, formalized using themathematics of statistical physics (Hopfield,1982). A fuller characterization of the manyhistorical influences in the development

of connectionism can be found in Rumel-hart and McClelland (1986, Chapter 1),Bechtel and Abrahamsen (1991), McLeod,Plunkett, and Rolls (1998), and O’Reillyand Munakata (2000). Figure 2.1 depicts a

P1: JZP



selective schematic of this history anddemonstrates the multiple types of neuralnetwork systems that have latterly cometo be used in building models of cogni-tion. Although diverse, they are unified onthe one hand by the proposal that cogni-tion comprises processes of constraint sat-isfaction, energy minimization, and patternrecognition, and on the other hand thatadaptive processes construct the microstruc-ture of these systems, primarily by adjust-ing the strengths of connections among theneuron-like processing units involved in acomputation.

2.2. Key Properties ofConnectionist Models

Connectionism starts with the following in-spiration from neural systems: Computa-tions will be carried out by a set of simpleprocessing units operating in parallel and af-fecting each others’ activation states via anetwork of weighted connections. Rumel-hart, Hinton, and McClelland (1986) iden-tified seven key features that would de-fine a general framework for connectionistprocessing.

The first feature is the set of processingunits ui . In a cognitive model, these maybe intended to represent individual concepts(such as letters or words), or they may sim-ply be abstract elements over which mean-ingful patterns can be defined. Processingunits are often distinguished into input, out-put, and hidden units. In associative net-works, input and output units have statesthat are defined by the task being modeled(at least during training), whereas hiddenunits are free parameters whose states maybe determined as necessary by the learningalgorithm.

The second feature is a state of activation(a) at a given time (t). The state of a setof units is usually represented by a vectorof real numbers a(t). These may be binaryor continuous numbers, bounded or un-bounded. A frequent assumption is that theactivation level of simple processing unitswill vary continuously between the values 0and 1.

The third feature is a pattern of connec-tivity. The strength of the connection be-tween any two units will determine the ex-tent to which the activation state of one unitcan affect the activation state of another unitat a subsequent time point. The strength ofthe connections between unit i and unit jcan be represented by a matrix W of weightvalues wi j . Multiple matrices may be speci-fied for a given network if there are connec-tions of different types. For example, onematrix may specify excitatory connectionsbetween units and a second may specify in-hibitory connections. Potentially, the weightmatrix allows every unit to be connected toevery other unit in the network. Typically,units are arranged into layers (e.g., input,hidden, output), and layers of units are fullyconnected to each other. For example, in athree-layer feedforward architecture whereactivation passes in a single direction frominput to output, the input layer would befully connected to the hidden layer and thehidden layer would be fully connected to theoutput layer.

The fourth feature is a rule for propa-gating activation states throughout the net-work. This rule takes the vector a(t) of out-put values for the processing units sendingactivation and combines it with the connec-tivity matrix W to produce a summed or netinput into each receiving unit. The net inputto a receiving unit is produced by multiply-ing the vector and matrix together, so that

neti = W × a(t) =∑

j

wi j aj . (2.1)

The fifth feature is an activation rule tospecify how the net inputs to a given unitare combined to produce its new activationstate. The function F derives the new acti-vation state

ai (t + 1) = F (neti (t)). (2.2)

For example, F might be a threshold so thatthe unit becomes active only if the net in-put exceeds a given value. Other possibil-ities include linear, Gaussian, and sigmoid

P1: JZP



functions, depending on the network type.Sigmoid is perhaps the most common, oper-ating as a smoothed threshold function thatis also differentiable. It is often importantthat the activation function be differentiablebecause learning seeks to improve a perfor-mance metric that is assessed via the acti-vation state whereas learning itself can onlyoperate on the connection weights. The ef-fect of weight changes on the performancemetric therefore depends to some extent onthe activation function, and the learning al-gorithm encodes this fact by including thederivative of that function (see the follow-ing discussion).

The sixth key feature of connectionistmodels is the algorithm for modifying thepatterns of connectivity as a function of ex-perience. Virtually all learning rules for PDPmodels can be considered a variant of theHebbian learning rule (Hebb, 1949). Theessential idea is that a weight between twounits should be altered in proportion to theunits’ correlated activity. For example, if aunit ui receives input from another unit uj ,then if both are highly active, the weight wi jfrom uj to ui should be strengthened. In itssimplest version, the rule is

�wi j = η aiaj (2.3)

where η is the constant of proportionalityknown as the learning rate. Where an ex-ternal target activation ti (t) is available fora unit i at time t, this algorithm is modifiedby replacing ai with a term depicting thedisparity of unit ui ’s current activation stateai (t) from its desired activation state ti (t) attime t, so forming the delta rule:

�wi j = η (ti (t) − ai (t)) aj . (2.4)

However, when hidden units are includedin networks, no target activation is availablefor these internal parameters. The weightsto such units may be modified by variants ofthe Hebbian learning algorithm (e.g., Con-trastive Hebbian; Hinton, 1989; see Xie &Seung, 2003) or by the backpropagation oferror signals from the output layer.

Backpropagation makes it possible to de-termine, for each connection weight in thenetwork, what effect a change in its valuewould have on the overall network error.The policy for changing the strengths of con-nections is simply to adjust each weight inthe direction (up or down) that would tendto reduce the error, and change it by anamount proportional to the size of the effectthe adjustment will have. If there are mul-tiple layers of hidden units remote from theoutput layer, this process can be followediteratively: First, error derivatives are com-puted for the hidden layer nearest the out-put layer; from these, derivatives are com-puted for the next deepest layer into thenetwork, and so forth. On this basis, thebackpropagation algorithm serves to modifythe pattern of weights in powerful multi-layer networks. It alters the weights to eachdeeper layer of units in such a way as toreduce the error on the output units (seeRumelhart, Hinton, et al., 1986, for thederivation). We can formulate the weightchange algorithm by analogy to the deltarule shown in equation 2.4. For each deeperlayer in the network, we modify the centralterm that represents the disparity betweenthe actual and target activation of the units.Assuming ui , uh, and uo are input, hidden,and output units in a three-layer feedfor-ward network, the algorithm for changingthe weight from hidden to output unit is:

�woh = η (to − ao) F ′(neto) ah (2.5)

where F ′(net) is the derivative of the ac-tivation function of the units (e.g., forthe sigmoid activation function,F ′(neto) =ao(1 − ao)). The term (to − ao) is propor-tional to the negative of the partial deriva-tive of the network’s overall error withrespect to the activation of the outputunit, where the error E is given by E =∑

o (to − ao)2.The derived error term for a unit at

the hidden layer is a product of threecomponents: the derivative of the hiddenunit’s activation function times the sumacross all the connections from that hid-den unit to the output later of the error

P1: JZP



term on each output unit weighted by thederivative of the output unit’s activationfunction (to − ao) F ′ (neto) times the weightconnecting the hidden unit to the outputunit:

F ′(neth)∑

o(to − ao)F ′(neto)woh. (2.6)

The algorithm for changing the weightsfrom the input to the hidden layer is there-fore:

�whi = η F ′(neth)∑

o

(to − ao)

× F ′(neto)woh ai . (2.7)

It is interesting that the previous omputa-tion can be construed as a backward passthrough the network, similar in spirit to theforward pass that computes activations inthat it involves propagation of signals acrossweighted connections, this time from theoutput layer back toward the input. Thebackward pass, however, involves the prop-agation of error derivatives rather than acti-vations.

It should be emphasized that a very widerange of variants and extensions of Hebbianand error-correcting algorithms have beenintroduced in the connectionist learning lit-erature. Most importantly, several variantsof backpropagation have been developedfor training recurrent networks (Williams &Zipser, 1995); and several algorithms (in-cluding the Contrastive Hebbian Learningalgorithm and O’Reilly’s 1998 LEABRAalgorithm) have addressed some of the con-cerns that have been raised regarding the bi-ological plausibility of backpropagation con-strued in its most literal form (O’Reilly &Munakata, 2000).

The last general feature of connection-ist networks is a representation of the en-vironment with respect to the system. Thisis assumed to consist of a set of externallyprovided events or a function for generat-ing such events. An event may be a singlepattern, such as a visual input; an ensem-ble of related patterns, such as the spellingof a word and its corresponding sound or

meaning; or a sequence of inputs, such as thewords in a sentence. A range of policies havebeen used for specifying the order of pre-sentation of the patterns, including sweep-ing through the full set to random samplingwith replacement. The selection of patternsto present may vary over the course of train-ing but is often fixed. Where a target outputis linked to each input, this is usually as-sumed to be simultaneously available. Twopoints are of note in the translation betweenPDP network and cognitive model. First, arepresentational scheme must be defined tomap between the cognitive domain of in-terest and a set of vectors depicting the rel-evant informational states or mappings forthat domain. Second, in many cases, con-nectionist models are addressed to aspects ofhigher-level cognition, where it is assumedthat the information of relevance is more ab-stract than sensory or motor codes. This hasmeant that the models often leave out de-tails of the transduction of sensory and mo-tor signals, using input and output represen-tations that are already somewhat abstract.We hold the view that the same principlesat work in higher-level cognition are also atwork in perceptual and motor systems, andindeed there is also considerable connection-ist work addressing issues of perception andaction, although these will not be the focusof the present chapter.

2.3. Neural Plausibility

It is a historical fact that most connectionistmodelers have drawn their inspiration fromthe computational properties of neural sys-tems. However, it has become a point ofcontroversy whether these “brain-like” sys-tems are indeed neurally plausible. If theyare not, should they instead be viewed asa class of statistical function approximators?And if so, shouldn’t the ability of these mod-els to simulate patterns of human behaviorbe assessed in the context of the large num-ber of free parameters they contain (e.g., inthe weight matrix; Green, 1998)?

Neural plausibility should not be the pri-mary focus for a consideration of connec-tionism. The advantage of connectionism,

P1: JZP



according to its proponents, is that it pro-vides better theories of cognition. Neverthe-less, we will deal briefly with this issuebecause it pertains to the origins of con-nectionist cognitive theory. In this area, twosorts of criticism have been leveled at con-nectionist models. The first is to maintainthat many connectionist models either in-clude properties that are not neurally plau-sible or omit other properties that neuralsystems appear to have. Some connection-ist researchers have responded to this firstcriticism by endeavoring to show how fea-tures of connectionist systems might in factbe realized in the neural machinery of thebrain. For example, the backward prop-agation of error across the same connec-tions that carry activation signals is gen-erally viewed as biologically implausible.However, a number of authors have shownthat the difference between activations com-puted using standard feedforward connec-tions and those computed using standard re-turn connections can be used to derive thecrucial error derivatives required by back-propagation (Hinton & McClelland, 1988;O’Reilly, 1996). It is widely held that con-nections run bidirectionally in the brain, asrequired for this scheme to work. Under thisview, backpropagation may be shorthandfor a Hebbian-based algorithm that usesbidirectional connections to spread error sig-nals throughout a network (Xie & Seung,2003).

Other connectionist researchers have re-sponded to the first criticism by stressingthe cognitive nature of current connection-ist models. Most of the work in develop-mental neuroscience addresses behavior atlevels no higher than cellular and local net-works, whereas cognitive models must makecontact with the human behavior studied inpsychology. Some simplification is thereforewarranted, with neural plausibility compro-mised under the working assumption thatthe simplified models share the same fla-vor of computation as actual neural sys-tems. Connectionist models have succeededin stimulating a great deal of progress in cog-nitive theory – and have sometimes gen-erated radically different proposals to the

previously prevailing symbolic theory – justgiven the set of basic computational featuresoutlined in the preceding section.

The second type of criticism leveled atconnectionism questions why, as Davies(2005) puts it, connectionist models shouldbe reckoned any more plausible as putativedescriptions of cognitive processes just be-cause they are “brain-like.” Under this view,there is independence between levels of de-scription because a given cognitive level the-ory might be implemented in multiple waysin different hardware. Therefore, the de-tails of the hardware (in this case, the brain)need not concern the cognitive theory. Thisfunctionalist approach, most clearly statedin Marr’s three levels of description (compu-tational, algorithmic, and implementational;see Marr, 1982) has been repeatedly chal-lenged (see, e.g., Rumelhart & McClelland,1985; Mareschal et al., 2007). The challengeto Marr goes as follows. Although, accord-ing to computational theory, there may bea principled independence between a com-puter program and the particular substrateon which it is implemented, in practicalterms, different sorts of computation areeasier or harder to implement on a givensubstrate. Because computations have to bedelivered in real time as the individual re-acts with his or her environment, in the firstinstance, cognitive level theories should beconstrained by the computational primitivesthat are most easily implemented on theavailable hardware; human cognition shouldbe shaped by the processes that work best inthe brain.

The relation of connectionist models tosymbolic models has also proved contro-versial. A full consideration of this issueis beyond the scope of the current chap-ter. Suffice to say that because the con-nectionist approach now includes a diversefamily of models, there is no single an-swer to this question. Smolensky (1988)argued that connectionist models exist ata lower (but still cognitive) level of de-scription than symbolic cognitive theories,a level that he called the subsymbolic. Con-nectionist models have sometimes been putforward as a way to implement symbolic

P1: JZP



production systems on neural architectures(e.g., Touretzky & Hinton, 1988). At othertimes, connectionist researchers have arguedthat their models represent a qualitativelydifferent form of computation: Whereasunder certain circumstances, connectionistmodels might produce behavior approxi-mating symbolic processes, it is held thathuman behavior, too, only approximates thecharacteristics of symbolic systems ratherthan directly implementing them. Further-more, connectionist systems incorporateadditional properties characteristic of hu-man cognition, such as content-addressablememory, context-sensitive processing, andgraceful degradation under damage or noise.Under this view, symbolic theories are ap-proximate descriptions rather than actualcharacterizations of human cognition. Con-nectionist theories should replace them bothbecause they capture subtle differences be-tween human behavior and symbolic char-acterizations, and because they provide aspecification of the underlying causal mech-anisms (van Gelder, 1991).

This strong position has prompted crit-icisms that in their current form, connec-tionist models are insufficiently powerful toaccount for certain aspects of human cog-nition – in particular, those areas best char-acterized by symbolic, syntactically drivencomputations (Fodor & Pylyshyn, 1988;Marcus, 2001). Again, however, the charac-terization of human cognition in such termsis highly controversial; close scrutiny of rel-evant aspects of language – the ground onwhich the dispute has largely been focused –lends support to the view that the system-aticity assumed by proponents of symbolicapproaches is overstated and that the actualcharacteristics of language are well matchedto the characteristics of connectionist sys-tems (Bybee & McClelland, 2005; McClel-land et al., 2003). In the end, it may be diffi-cult to make principled distinctions betweensymbolic and connectionist models. At a finescale, one might argue that two units in anetwork represent variables, and the con-nection between them specifies a symbolicrule linking these variables. One might alsoargue that a production system in which

rules are allowed to fire probabilistically andin parallel begins to approximate a connec-tionist system.

2.4. The Relationship betweenConnectionist Models andBayesian Inference

Since the early 1980s, it has been apparentthat there are strong links between the cal-culations carried out in connectionist mod-els and key elements of Bayesian calcu-lations (see chapter 3 in this volume onBayesian models of cognition). The state ofthe early literature on this point was re-viewed in McClelland (1998). There it wasnoted, first of all, that units can be viewedas playing the role of probabilistic hypothe-ses; that weights and biases play the roleof conditional probability relations betweenhypotheses and prior probabilities, respec-tively; and that if connection weights andbiases have the correct values, the logisticactivation function sets the activation of aunit to its posterior probability given the ev-idence represented on its inputs.

A second and more important observa-tion is that, in stochastic neural networks(Boltzmann machines and Continuous Dif-fusion Networks; Hinton & Sejnowski,1986; Movellan & McClelland, 1993), a net-work’s state over all of its units can rep-resent a constellation of hypotheses aboutan input, and (if the weights and the bi-ases are set correctly) that the probabilityof finding the network in a particular stateis monotonically related to the probabilitythat the state is the correct interpretationof the input. The exact nature of the re-lation depends on a parameter called tem-perature; if set to one, the probability thatthe network will be found in a particularstate exactly matches its posterior probabil-ity. When temperature is gradually reducedto zero, the network will end up in the mostprobable state, thus performing optimalperceptual inference (Hinton & Sejnowski,1983). It is also known that backpropaga-tion can learn weights that allow Bayes-optimal estimation of outputs given inputs(MacKay, 1992) and that the Boltzmann

P1: JZP



machine learning algorithm (Ackley, Hin-ton, & Sejnowski, 1985; Movellan & Mc-Clelland, 1993) can learn to produce correctconditional distributions of outputs giveninputs. The algorithm is slow but there hasbeen recent progress producing substantialspeedups that achieve outstanding perfor-mance on benchmark data sets (Hinton &Salakhutdinov, 2006).

3. Three Illustrative Models

In this section, we outline three of the land-mark models in the emergence of connec-tionist theories of cognition. The modelsserve to illustrate the key principles of con-nectionism and demonstrate how these prin-ciples are relevant to explaining behavior inways that were different from other priorapproaches. The contribution of these mod-els was twofold: they were better suited thanalternative approaches to capturing the ac-tual characteristics of human cognition, usu-ally on the basis of their context sensitiveprocessing properties; and compared to ex-isting accounts, they offered a sharper setof tools to drive theoretical progress andto stimulate empirical data collection. Eachof these models significantly advanced itsfield.

3.1. An Interactive Activation Modelof Context Effects in Letter Perception(McClelland & Rumelhart, 1981,Rumelhart & McClelland, 1982)

The interactive activation model of letterperception illustrates two interrelated ideas.The first is that connectionist models natu-rally capture a graded constraint satisfactionprocess in which the influences of many dif-ferent types of information are simultane-ously integrated in determining, for exam-ple, the identity of a letter in a word. Thesecond idea is that the computation of a per-ceptual representation of the current input(in this case, a word) involves the simultane-ous and mutual influence of representationsat multiple levels of abstraction – this is a coreidea of parallel distributed processing.

The interactive activation model ad-dressed a puzzle in word recognition. Bythe late 1970s, it had long been knownthat people were better at recognizing let-ters presented in words than letters pre-sented in random letter sequences. Reicher(1969) demonstrated that this was not theresult of tending to guess letters that wouldmake letter strings into words. He presentedtarget letters in words, in unpronounceablenonwords, or on their own. The stimuliwere then followed by a pattern mask, af-ter which participants were presented witha forced choice between two letters in agiven position. Importantly, both alterna-tives were equally plausible. Thus, the par-ticipant might be presented with WOODand asked whether the third letter was O orR. As expected, forced-choice performancewas more accurate for letters in words thanfor letters in nonwords or presented on theirown. Moreover, the benefit of surroundingcontext was also conferred by pronounce-able pseudowords (e.g., recognizing the P inSPET) compared with random letter strings,suggesting that subjects were able to bring tobear rules regarding the orthographic legal-ity of letter strings during recognition.

Rumelhart and McClelland (Rumelhart& McClelland, 1981; Rumelhart & McClel-land, 1982) took the contextual advantageof words and pseudowords on letter recog-nition to indicate the operation of top-downprocessing. Previous theories had put for-ward the idea that letter and word recogni-tion might be construed in terms of detec-tors that collect evidence consistent with thepresence of their assigned letter or word inthe input (Morton, 1969; Selfridge, 1959).Influenced by these theories, Rumelhart andMcClelland built a computational simula-tion in which the perception of letters re-sulted from excitatory and inhibitory inter-actions of detectors for visual features. Im-portantly, the detectors were organized intodifferent layers for letter features, letters andwords, and detectors could influence eachother both in a bottom-up and a top-downmanner.

Figure 2.2 illustrates the structure of theInteractive Activation (IA) model, both at

P1: JZP



WORD LEVEL

LETTER LEVEL

FEATURE LEVEL

VISUAL INPUT

SEED

FEED

WEED

SF

W

Figure 2.2. Interactive Activation model of context effects in letterrecognition (McClelland & Rumelhart, 1981, 1982). Pointed arrowsare excitatory connections, circular headed arrows are inhibitoryconnections. Left: macro view (connections in gray were set to zero inimplemented model). Right: micro view for the connections from thefeature level to the first letter position for the letters S, W, and F (onlyexcitatory connections shown) and from the first letter position to theword units SEED, WEED, and FEED (all connections shown).

the macro level (left) and for a small sec-tion of the model at a finer level (right).The explicit motivation for the structureof the IA was neural: “[We] have adoptedthe approach of formulating the model interms similar to the way in which such aprocess might actually be carried out in aneural or neural-like system” (McClelland &Rumelhart, 1981, p. 387). There were threemain assumptions of the IA model: (1) Per-ceptual processing takes place in a systemin which there are several levels of process-ing, each of which forms a representationof the input at a different level of abstrac-tion; (2) visual perception involves parallelprocessing, both of the four letters in eachword and of all levels of abstraction simul-taneously; and (3) perception is an interac-tive process in which conceptually drivenand data driven processing provide multi-ple, simultaneously acting constraints thatcombine to determine what is perceived.

The activation states of the system weresimulated by a sequence of discrete timesteps. Each unit combined its activation onthe previous time step, its excitatory influ-ences, its inhibitory influences, and a decay

factor to determine its activation on the nexttime step. Connectivity was set at unitaryvalues and along the following principles. Ineach layer, mutually exclusive alternativesshould inhibit each other. Each unit in alayer excited all units with which it was con-sistent and inhibited all those with whichit was inconsistent in the layer immediatelyabove. Thus, in Figure 2.2, the first-positionW letter unit has an excitatory connection tothe WEED word unit but an inhibitory con-nection to the SEED and FEED word units.Similarly, a unit excited all units with whichit was consistent and inhibited all those withwhich it was inconsistent in the layer im-mediately below. However, in the final im-plementation, top-down word-to-letter in-hibition and within-layer letter-to-letter in-hibition were set to zero (gray arrows,Figure 2.2).

The model was constructed to recognizeletters in four-letter strings. The full set ofpossible letters was duplicated for each let-ter position, and a set of 1,179 word unitswas created to represent the corpus of four-letter words. Word units were given baserate activation states at the beginning of

P1: JZP



processing to reflect their different frequen-cies. A trial began by clamping the featureunits to the appropriate states to representa letter string and then observing the dy-namic change in activation through the net-work. Conditions were included to allowthe simulation of stimulus masking and de-graded stimulus quality. Finally, a proba-bilistic response mechanism was added togenerate responses from the letter level,based on the relative activation states of theletter pool in each position.

The model successfully captured thegreater accuracy of letter detection for let-ters appearing in words and pseudowordscompared with random strings or in isola-tion. Moreover, it simulated a variety of em-pirical findings on the effect of masking andstimulus quality, and of changing the timingof the availability of context. The results onthe contextual effects of pseudowords areparticularly interesting, because the modelonly contains word units and letter unitsand has no explicit representation of ortho-graphic rules. Let us say on a given trial, thesubject is required to recognize the secondletter in the string SPET. In this case, thestring will produce bottom-up excitation ofthe word units for SPAT, SPIT, and SPOT,which each share three letters. In turn, theword units will propagate top-down activa-tion, reinforcing activation of the letter Pand so facilitating its recognition. Were thisletter to be presented in the string XPQJ, noword units could offer similar top-down ac-tivation, hence the relative facilitation of thepseudoword. Interestingly, although thesetop-down “gang” effects produced facilita-tion of letters contained in orthographicallylegal nonword strings, the model demon-strated that they also produced facilitationin orthographically illegal, unpronounceableletter strings, such as SPCT. Here, the samegang of SPAT, SPIT, and SPOT producetop-down support. Rumelhart and McClel-land (1982) reported empirical support forthis novel prediction. Therefore, althoughthe model behaved as if it contained ortho-graphic rules influencing recognition, it didnot in fact do so because continued contex-tual facilitation could be demonstrated for

strings that had gang support but violatedthe orthographic rules.

There are two specific points to note re-garding the IA model. First, this early con-nectionist model was not adaptive – connec-tivity was set by hand. Although the model’sbehavior was shaped by the statistical prop-erties of the language it processed, theseproperties were built into the structure ofthe system in terms of the frequency of oc-currence of letters and letter combinationsin the words. Second, the idea of bottom-upexcitation followed by competition amongmutually exclusive possibilities is a strategyfamiliar in Bayesian approaches to cognition.In that sense, the IA bears similarity to morerecent probability theory based approachesto perception (see chapter 3 in this volume).

3.1.1. what happened next?

Subsequent work saw the principles of theIA model extended to the recognition ofspoken words (the TRACE model: McClel-land & Elman, 1986) and more recentlyto bilingual speakers, where two languagesmust be incorporated in a single representa-tional system (see Thomas & van Heuven,2005, for review). The architecture was ap-plied to other domains where multiple con-straints were thought to operate during per-ception, for example, in face recognition(Burton, Bruce, & Johnston, 1990). Withinlanguage, more complex architectures havetried to recast the principles of the IA modelin developmental settings, such as Plautand Kello’s (1999) model of the emergenceof phonology from the interplay of speechcomprehension and production.

The more general lesson to draw from theinteractive activation model is the demon-stration of multiple influences (feature, let-ter, and word-level knowledge) working si-multaneously and in parallel to shape theresponse of the system, as well as the some-what surprising finding that a massively par-allel constraint satisfaction process of thisform can appear to behave as if it containsrules (in this case, orthographic) when nosuch rules are included in the processingstructure. At the time, the model broughtinto question whether it was necessary to

P1: JZP



postulate rules as processing structures toexplain regularities in human behavior. Thisskepticism was brought into sharper focusby our next example.

3.2. On Learning the Past Tenseof English Verbs (Rumelhart &McClelland, 1986)

Rumelhart and McClelland’s (1986) modelof English past tense formation markedthe real emergence of the PDP framework.Where the IA model used localist coding,the past tense model employed distributedcoding. Where the IA model had handwiredconnection weights, the past tense modellearned its weights via repeated exposureto a problem domain. However, the mod-els share two common themes. Once more,the behavior of the past tense model willbe driven by the statistics of the problemdomain, albeit these will be carved into themodel by training rather than sculpted bythe modelers. Perhaps more importantly, wesee a return to the idea that a connectionistsystem can exhibit rule-following behaviorwithout containing rules as causal process-ing structures; but in this case, the rule-following behavior will be the product oflearning and will accommodate a propor-tion of exception patterns that do not followthe general rule. The key point that the pasttense model illustrates is how (approximate)conformity to the regularities of language –and even a tendency to produce new regularforms (e.g., regularizations like “thinked” orpast tenses for novel verbs like “wugged”) –can arise in a connectionist network withoutan explicit representation of a linguistic rule.

The English past tense is characterizedby a predominant regularity in which themajority of verbs form their past tenses bythe addition of one of three allomorphs ofthe “-ed” suffix to the base stem (walk/walked, end/ended, chase/chased). How-ever, there is a small but significant groupof verbs that form their past tense in dif-ferent ways, including changing internalvowels (swim/swam), changing word finalconsonants (build/built), changing both in-ternal vowels and final consonants (think/

thought), and an arbitrary relation of stemto past tense (go/went), as well as verbs thathave a past tense form identical to the stem(hit/hit). These so-called irregular verbs of-ten come in small groups sharing a fam-ily resemblance (sleep/slept, creep/crept,leap/leapt) and usually have high tokenfrequencies (see Pinker, 1999, for furtherdetails).

During the acquisition of the Englishpast tense, children show a characteristicU-shaped developmental profile at differ-ent times for individual irregular verbs. Ini-tially, they use the correct past tense of asmall number of high-frequency regular andirregular verbs. Later, they sometimes pro-duce “overregularized” past tense forms fora small fraction of their irregular verbs (e.g.,thinked; Marcus et al., 1992), along withother, less frequent errors (Xu & Pinker,1995). They are also able to extend thepast tense “rule” to novel verbs (e.g., wug-wugged). Finally, in older children, perfor-mance approaches ceiling on both regu-lar and irregular verbs (Berko, 1958; Ervin,1964; Kuczaj, 1977).

In the early 1980s, it was held that thispattern of behavior represented the oper-ation of two developmental mechanisms(Pinker, 1984). One of these was symbolicand served to learn the regular past tenserule, whereas the other was associative andserved to learn the exceptions to the rule.The extended phase of overregularizationerrors corresponded to difficulties in inte-grating the two mechanisms, specifically, afailure of the associative mechanism to blockthe function of the symbolic mechanism.That the child comes to the language acqui-sition situation armed with these two mech-anisms (one of them full of blank rules) wasan a priori commitment of the developmen-tal theory.

By contrast, Rumelhart and McClelland(1986) proposed that a single network thatdoes not distinguish between regular and ir-regular past tenses is sufficient to learn pasttense formation. The architecture of theirmodel is shown in Figure 2.3. A phoneme-based representation of the verb root wasrecoded into a more distributed, coarser

P1: JZP



Phonological representation of past tense

Phonological representation of verb root

Recoding

Decoding

2-layer network trained with the

delta rule

Coarse coded, distributedWickelfeature

representation of root

Coarse coded, distributedWickelfeature representation

of past tense

Figure 2.3. Two-layer network for learning the mapping betweenthe verb roots and past tense forms of English verbs (Rumelhart &McClelland, 1986). Phonological representations of verbs areinitially encoded into a coarse, distributed “Wickelfeature”representation. Past tenses are decoded from the Wickelfeaturerepresentation back to the phonological form. Later connectionistmodels replaced the dotted area with a three-layer feedforwardbackpropagation network (e.g., Plunkett & Marchman, 1991,1993).

(more blurred) format, which they called“Wickelfeatures.” The stated aim of this re-coding was to produce a representation that(a) permitted differentiation of all of theroot forms of English and their past tenses,and (b) provided a natural basis for gener-alizations to emerge about what aspects ofa present tense correspond to what aspectsof a past tense. This format involved repre-senting verbs over 460 processing units. Atwo-layer network was used to associate theWickelfeature representations of the verbroot and past tense form. A final decodingnetwork was then used to derive the closestphoneme-based rendition of the past tenseform and reveal the model’s response (thedecoding part of the model was somewhatrestricted by computer processing limita-tions of the machines available at the time).

The connection weights in the two-layernetwork were initially randomized. Themodel was then trained in three phases,in each case using the delta rule to up-date the connection weights after each verbroot/past tense pair was presented (see Sec-

tion 2.2, and Equation 4). In Phase 1, thenetwork was trained on ten high-frequencyverbs, two regular and eight irregular, inline with the greater proportion of irreg-ular verbs among the most frequent verbsin English. Phase 1 lasted for ten presenta-tions of the full training set (or “epochs”).In Phase 2, the network was trained on 410medium frequency verbs, 334 regular and76 irregular, for a further 190 epochs. InPhase 3, no further training took place, but86 lower-frequency verbs were presented tothe network to test its ability to generalizeits knowledge of the past tense domain tonovel verbs.

There were four key results for thismodel. First, it succeeded in learning bothregular and irregular past tense mappings ina single network that made no reference tothe distinction between regular and irregularverbs. Second, it captured the overall pat-tern of faster acquisition for regular verbsthan irregular verbs, a predominant featureof children’s past tense acquisition. Third,the model captured the U-shaped profile

P1: JZP



of development: an early phase of accurateperformance on a small set of regular andirregular verbs, followed by a phase of over-regularization of the irregular forms, and fi-nally recovery for the irregular verbs andperformance approaching ceiling on bothverb types. Fourth, when the model waspresented with the low-frequency verbs onwhich it had not been trained, it was ableto generalize the past tense rule to a sub-stantial proportion of them, as if it had in-deed learned a rule. Additionally, the modelcaptured more fine-grained developmentalpatterns for subsets of regular and irregularverbs, and generated several novel predic-tions.

Rumelhart and McClelland explained thegeneralization abilities of the network interms of the superpositional memory of thetwo-layer network. All the associations be-tween the distributed encodings of verb rootand past tense forms must be stored acrossthe single matrix of connection weights. Asa result, similar patterns blend into one an-other and reinforce each other. Generaliza-tion is contingent on the similarity of verbsat input. Were the verbs to be presented us-ing an orthogonal, localist scheme (e.g., 420units, 1 per verb), then there would be nosimilarity between the verbs, no blendingof mappings, no generalization, and there-fore no regularization of novel verbs. Asthe authors state, “It is the statistical rela-tionships among the base forms themselvesthat determine the pattern of responding.The network merely reflects the statisticsof the featural representations of the verbforms” (Rumelhart & McClelland, 1986,p. 267). Based on the model’s successfulsimulation of the profile of language devel-opment in this domain, and compared withthe dual mechanism model, its more parsi-monious a priori commitments, Rumelhartand McClelland viewed their work on pasttense morphology as a step toward a revisedunderstanding of language knowledge, lan-guage acquisition, and linguistic informationprocessing in general.

The past tense model stimulated a greatdeal of subsequent debate, not least be-cause of its profound implications for the-

ories of language development (no rules!).The model was initially subjected to con-centrated criticism. Some of this wasoverstated – for instance, the use of domain-general learning principles (such as dis-tributed representation, parallel processing,and the delta rule) to acquire the past tensein a single network was interpreted as a claimthat all of language acquisition could be cap-tured by the operation of a single domain-general learning mechanism. Such an ab-surd claim could be summarily dismissed.As it stood, the model made no such claim:Its generality was in the processing princi-ples. The model itself represented a domain-specific system dedicated to learning a smallpart of language. Nevertheless, a number ofthe criticisms were more telling: The Wick-elfeature representational format was notpsycholinguistically realistic; the generaliza-tion performance of the model was relativelypoor; the U-shaped developmental profileappeared to be a result of abrupt changesin the composition of the training set; andthe actual response of the model was hard todiscern because of problems in decoding theWickelfeature output into a phoneme string(Pinker & Prince, 1988).

The criticisms and following rejoinderswere interesting in a number of ways. First,there was a stark contrast between the pre-cise, computationally implemented connec-tionist model of past tense formation andthe verbally specified dual-mechanism the-ory (e.g., Marcus et al., 1992). The im-plementation made simplifications but wasreadily evaluated against quantitative be-havioral evidence; it made predictions andit could be falsified. The verbal theory bycontrast was vague – it was hard to knowhow or whether it would work or exactlywhat behaviors it predicted (see Thomas,Forrester, & Richardson, 2006, for discus-sion). Therefore, it could only be evalu-ated on loose qualitative grounds. Second,the model stimulated a great deal of newmultidisciplinary research in the area. To-day, inflectional morphology (of which pasttense is a part) is one of the most studied as-pects of language processing in children, inadults, in second language learners, in adults

P1: JZP



with acquired brain damage, in children andadults with neurogenetic disorders, and inchildren with language impairments, usingpsycholinguistic methods, event-related po-tential measures of brain activity, functionalmagnetic resonance imaging, and behavioralgenetics. This rush of science illustrates theessential role of computational modeling indriving forward theories of human cogni-tion. Third, further modifications and im-provements to the past tense model havehighlighted how researchers go about thedifficult task of understanding which partsof their model represent the key theoreti-cal claims and which are implementationaldetails. Simplification is inherent in model-ing, but successful modeling relies on mak-ing the right simplifications to focus on theprocess of interest. For example, in sub-sequent models, the Wickelfeature repre-sentation was replaced by more plausiblephonemic representations based on articu-latory features; the recoding/two-layer net-work/decoding component of the network(the dotted rectangle in Figure 2.3) thatwas trained with the delta rule was re-placed by a three-layer feedforward net-work trained with the backpropagation al-gorithm; and the U-shaped developmentalprofile was demonstrated in connectionistnetworks trained with a smoothly growingtraining set of verbs or even with a fixed setof verbs (see, e.g., Plunkett & Marchman,1991, 1993, 1996).


The English past tense model promptedfurther work within inflectional morphol-ogy in other languages (e.g., pluralization inGerman: Goebel & Indefrey, 2000; plural-ization in Arabic: Plunkett & Nakisa, 1997),as well as models that explored the possiblecauses of deficits in acquired and develop-mental disorders, such as aphasia, SpecificLanguage Impairment and Williams syn-drome (e.g., Hoeffner & McClelland, 1993;Joanisse & Seidenberg, 1999; Thomas &Karmiloff-Smith, 2003b; Thomas, 2005).The idea that rule-following behavior couldemerge in a developing system that also hadto accommodate exceptions to the rules was

also successfully pursued via connectionistmodeling in the domain of reading (e.g.,Plaut, et al., 1996). This led to work thatalso considered various forms of acquiredand developmental dyslexia.

For the past tense itself, there re-mains much interest in the topic as acrucible to test theories of language de-velopment. However, in some senses thedebate between connectionist and dual-mechanism accounts has ground to a halt.There is much evidence from child devel-opment, adult cognitive neuropsychology,developmental neuropsychology, and func-tional brain imaging to suggest partial dis-sociations between performance on regu-lar and irregular inflection under variousconditions. Both connectionist and dual-mechanism models have been modified: theconnectionist model to include the influ-ence of lexical-semantics as well as verbroot phonology in driving the productionof the past tense form (Joanisse & Sei-denberg, 1999; Thomas & Karmiloff-Smith,2003b); the dual-mechanism model to sup-pose that regular verbs might also be storedin the associative mechanism, thereby in-troducing partial redundancy of function(Pinker, 1999). Both approaches now ac-cept that performance on regular and ir-regular past tenses partly indexes differentthings – in the connectionist account, dif-ferent underlying knowledge, in the dual-mechanism account, different underlyingprocesses. In the connectionist theory, per-formance on regular verbs indexes relianceon knowledge about phonological regu-larities, whereas performance on irregularverbs indexes reliance on lexical-semanticknowledge. In the dual-mechanism theory,performance on regular verbs indexes adedicated symbolic processing mechanismimplementing the regular rule, whereas per-formance on irregular verbs indexes an as-sociative memory device storing informa-tion about the past tense forms of specificverbs. Both approaches claim to accountfor the available empirical evidence. How-ever, to date, the dual-mechanism accountremains unimplemented, so its claim isweaker.

P1: JZP



How does one distinguish between twotheories that (a) both claim to explain thedata but (b) contain different representa-tional assumptions? Putting aside the differ-ent level of detail of the two theories, the an-swer is that it depends on one’s preferencefor consistency with other disciplines. Thedual-mechanism theory declares consistencywith linguistics – if rules are required tocharacterize other aspects of language per-formance (such as syntax), then one mightas well include them in a model of pasttense formation. The connectionist theorydeclares consistency with neuroscience – ifthe language system is going to be imple-mented in the brain, then one might as wellemploy a computational formulism based onhow neural networks function.

Finally, we return to the more generalconnectionist principle illustrated by thepast tense model. So long as there are regu-larities in the statistical structure of a prob-lem domain, a massively parallel constraintsatisfaction system can learn these regular-ities and extend them to novel situations.Moreover, as with humans, the behavior ofthe system is flexible and context sensitive –it can accommodate regularities and excep-tions within a single processing structure.

3.3. Finding Structure in Time(Elman, 1990)

In this section, the notion of the simple re-current network and its application to lan-guage are introduced. As with past tense,the key point of the model will be to showhow conformity to regularities of languagecan arise without an explicit representationof a linguistic rule. Moreover, the followingsimulations will demonstrate how learningcan lead to the discovery of useful internalrepresentations that capture conceptual andlinguistic structure on the basis of the co-occurrences of words in sentences.

The IA model exemplified connection-ism’s commitment to parallelism: All of theletters of the word presented to the net-work were recognized in parallel, and pro-cessing occurred simultaneously at differentlevels of abstraction. But not all processingcan be carried out in this way. Some hu-

man behaviors intrinsically revolve aroundtemporal sequences. Language, action plan-ning, goal-directed behavior, and reasoningabout causality are examples of domainsthat rely on events occurring in sequences.How has connectionism addressed the pro-cessing of temporally unfolding events? Onesolution was offered in the TRACE modelof spoken word recognition (McClelland &Elman, 1986), where a word was specifiedas a sequence of phonemes. In that case,the architecture of the system was dupli-cated for each time-slice and the duplicateswired together. This allowed constraints tooperate over items in the sequence to influ-ence recognition. In other models, a relatedapproach was used to convert a temporallyextended representation into a spatially ex-tended one. For example, in the past tensemodel, all the phonemes of a verb were pre-sented across the input layer. This could beviewed as a sequence if one assumed thatthe representation of the first phoneme rep-resents time slice t, the representation of thesecond phoneme represents time slice t + 1,and so on. As part of a comprehension sys-tem, this approach assumes a buffer that cantake sequences and convert them to a spa-tial vector. However, this solution is fairlylimited, as it necessarily pre-commits to thesize of the sequences that can be processedat once (i.e., the size of the input layer).

Elman (1990, 1991) offered an al-ternative and more flexible approach toprocessing sequences, proposing an archi-tecture that has been extremely influentialand much used since. Elman drew on thework of Jordan (1986) who had proposed amodel that could learn to associate a “plan”(i.e., a single input vector) with a series of“actions” (i.e., a sequence of output vectors).Jordan’s model contained recurrent connec-tions permitting the hidden units to “see”the network’s previous output (via a set of“state” input units that are given a copy ofthe previous output). The facility for thenetwork to shape its next output accordingto its previous response constitutes a kind ofmemory. Elman’s innovation was to builda recurrent facility into the internal units ofthe network, allowing it to compute statisti-cal relationships across sequences of inputs

P1: JZP



Input units

Output units

Hidden units

Context units

HU activations at time (t+1)

HU activationsat time (t)

Fixed 1-to-1 weightscopy activation of

hidden layer tocontext units

Figure 2.4. Elman’s simple recurrent network architecture for findingstructure in time (Elman, 1991, 1993). Connections between inputand hidden, context and hidden, and hidden and output layers aretrainable. Sequences are applied to the network element by elementin discrete time steps; the context layer contains a copy of the hiddenunit (HU) activations on the previous time step transmitted by fixed,1-to-1 connections.

and outputs. To achieve this, first time isdiscretized into a number of slices. On timestep t, an input is presented to the networkand causes a pattern of activation on hiddenand output layers. On time step t + 1, thenext input in the sequence of events is pre-sented to the network. However, crucially, acopy of the activation of the hidden units ontime step t is transmitted to a set of internal“context” units. This activation vector is alsofed to the hidden units on time step t + 1.Figure 2.4 shows the architecture, knownas the simple recurrent network (SRN). It isusually trained with the backpropagationalgorithm (see Section 2.3) as a multi-layer feedforward network, ignoring theorigin of the information on the contextlayer.

Each input to the SRN is therefore pro-cessed in the context of what came before,but in a way subtly more powerful than theJordan network. The input at t + 1 is pro-cessed in the context of the activity pro-duced on the hidden units by the input attime t. Now consider the next time step.The input at time t + 2 will be processedalong with activity from the context layerthat is shaped by two influences:

(the input at t + 1 (shaped by theinput at t)).

The input at time t + 3 will be processedalong with activity from the context layerthat is shaped by three influences:

(the input at t + 2 (shaped by the inputat t + 1 (shaped by the input at t))).

The recursive flavor of the information con-tained in the context layer means that eachnew input is processed in the context ofthe full history of previous inputs. This per-mits the network to learn statistical relation-ships across sequences of inputs or, in otherwords, to find structure in time.

In his original 1990 article, Elmandemonstrated the powerful properties of theSRN with two examples. In the first, thenetwork was presented with a sequence ofletters made up of concatenated words, forexample:

MANYYEARSAGOABOYANDGIRLLIV-EDBYTHESEATHEYPLAYEDHAPPIL

Each letter was represented by a distributedbinary code over five input units. The net-work was trained to predict the next letter inthe sentence for 200 sentences constructedfrom a lexicon of fifteen words. There were1,270 words and 4,963 letters. Becauseeach word appeared in many sentences,

P1: JZP



the network was not particularly successfulat predicting the next letter when it got tothe end of each word, but within a wordit was able to predict the sequences of let-ters. Using the accuracy of prediction as ameasure, one could therefore identify whichsequences in the letter string were words:They were the sequences of good predictionbounded by high prediction errors. The abil-ity to extract words was of course subjectto the ambiguities inherent in the trainingset (e.g., for the and they, there is ambigu-ity after the third letter). Elman suggestedthat if the letter strings are taken to beanalogous to the speech sounds available tothe infant, the SRN demonstrates a possi-ble mechanism to extract words from thecontinuous stream of sound that is presentin infant-directed speech. Elman’s work hascontributed to the increasing interest in thestatistical learning abilities of young chil-dren in language and cognitive develop-ment (see, e.g., Saffran, Newport, & Aslin,1996).

In the second example, Elman (1990)created a set of 10,000 sentences by com-bining a lexicon of 29 words and a set ofshort sentence frames (noun + [transitive]verb + noun; noun + [intransitive] verb).There was a separate input and output unitfor each word, and the SRN was trainedto predict the next word in the sentence.During training, the network’s output cameto approximate the transitional probabilitiesbetween the words in the sentences, that is,it could predict the next word in the sen-tences as much as this was possible. Follow-ing the first noun, the verb units would bemore active as the possible next word, andverbs that tended to be associated with thisparticular noun would be more active thanthose that did not. At this point, Elman ex-amined the similarity structure of the in-ternal representations to discover how thenetwork was achieving its prediction abil-ity. He found that the internal representa-tions were sensitive to the difference be-tween nouns and verbs, and within verbs, tothe difference between transitive and intran-sitive verbs. Moreover, the network was alsosensitive to a range of semantic distinctions:

Not only were the internal states induced bynouns split into animate and inanimate, butthe pattern for “woman” was most similar to“girl,” and that for “man” was most similarto “boy.” The network had learned to struc-ture its internal representations according toa mix of syntactic and semantic informationbecause these information states were thebest way to predict how sentences wouldunfold. Elman concluded that the represen-tations induced by connectionist networksneed not be flat but could include hierarchi-cal encodings of category structure.

Based on his finding, Elman also arguedthat the SRN was able to induce representa-tions of entities that varied according to theircontext of use. This contrasts with classi-cal symbolic representations that retain theiridentity regardless of the combinations intowhich they are put, a property called “com-positionality.” This claim is perhaps betterillustrated by a second article Elman (1993)published two years later, called “Learningand Development in Neural Networks: Theimportance of Starting Small.” In this laterarticle, Elman explored whether rule-basedmechanisms are required to explain certainaspects of language performance, such assyntax. He focused on “long-range depen-dencies,” which are links between wordsthat depend only on their syntactic relation-ship in the sentence and, importantly, not ontheir separation in a sequence of words. Forexample, in English, the subject and mainverb of a sentence must agree in number. Ifthe noun is singular, so must be the verb;if the noun is plural, so must be the verb.Thus, in the sentence “The boy chases thecat,” boy and chases must both be singular.But this is also true in the sentence “The boywhom the boys chase chases the cat.” In thesecond sentence, the subject and verb arefurther apart in the sequence of words buttheir relationship is the same; moreover, thewords are now separated by plural tokens ofthe same lexical items. Rule-based represen-tations of syntax were thought to be neces-sary to encode these long-distance relation-ships because, through the recursive natureof syntax, the words that have to agree in asentence can be arbitrarily far apart.

P1: JZP



(a) (b)

Principal Component #1

Prin

cipa

l Com

pone

nt #

11

boy1

chases2

boy3

who4

chases5

boy6

who7

chases8

boy9 END

START

Time step

boy1

boy6

chases5

who2

chase4

boys3

START END

Prin

cipa

l Com

pone

nt #

2

boys1

who2

boys3

chase4 chase5

boy6

Figure 2.5. Trajectory of internal activation states as the simple recurrent network (SRN)processes sentences (Elman, 1993). The data show positions according to the dimensions of aprincipal components analysis (PCA) carried out on hidden unit activations for the wholetraining set. Words are indexed by their position in the sequence but represent activation ofthe same input unit for each word. (a) PCA values for the second principal component as theSRN processes two sentences, “Boy who boys chase chases boy” or “Boys who boys chase chaseboy;” (b) PCA values for the first and eleventh principal components as the SRN processes“Boy chases boy who chases boy who chases boy.”

Using an SRN trained on the same pre-diction task as that previously outlined butnow with more complex sentences, Elman(1993) demonstrated that the network wasable to learn these long-range dependen-cies even across the separation of multiplephrases. If boy was the subject of the sen-tence, when the network came to predictthe main verb chase as the next word, itpredicted that it should be in the singular.The method by which the network achievedthis ability is of particular interest. Oncemore, Elman explored the similarity struc-ture in the hidden unit representations, us-ing principal component analyses to identifythe salient dimensions of similarity acrosswhich activation states were varying. Thisenabled him to reduce the high dimension-ality of the internal states (150 hidden unitswere used) to a manageable number to visu-alize processing. Elman was then able to plotthe trajectories of activation as the networkaltered its internal state in response to eachsubsequent input. Figure 2.5 depicts thesetrajectories as the network processes differ-

ent multi-phrase sentences, plotted with ref-erence to particular dimensions of princi-pal component space. This figure demon-strates that the network adopted similarstates in response to particular lexical items(e.g., tokens of boy, who, chases), but thatit modified the pattern slightly according tothe grammatical status of the word. In Fig-ure 2.5a, the second principal componentappears to encode singularity/plurality. Fig-ure 2.5b traces the network’s state as it pro-cesses two embedded relative clauses con-taining iterations of the same words. Eachclause exhibits a related but slightly shiftedtriangular trajectory to encode its role in thesyntactic structure.

The importance of this model is thatit prompts a different way to understandthe processing of sentences. Previously, onewould view symbols as possessing fixed iden-tities and as being bound into particulargrammatical roles via a syntactic construc-tion. In the connectionist system, sentencesare represented by trajectories throughactivation space in which the activation

P1: JZP



pattern for each word is subtly shifted ac-cording to the context of its usage. The im-plication is that the property of composi-tionality at the heart of the classical sym-bolic computational approach may not benecessary to process language.

Elman (1993) also used this model toinvestigate a possible advantage to learningthat could be gained by initially restrictingthe complexity of the training set. At thestart of training, the network had its mem-ory reset (its context layer wiped) after ev-ery third or fourth word. This window wasthen increased in stages up to six to sevenwords across training. The manipulation wasintended to capture maturational changes inworking memory in children. Elman (1993)reported that starting small enhanced learn-ing by allowing the network to build simplerinternal representations that were later use-ful for unpacking the structure of more com-plex sentences (see Rohde & Plaut, 1999,for discussion and further simulations). Thisidea resonated with developmental psychol-ogists in its demonstration of the way inwhich learning and maturation might inter-act in constructing cognition. It is an ideathat could turn out to be a key principle inthe organization of cognitive development(Elman et al., 1996).


Elman’s simulations with the SRN and theprediction task produced striking results.The ability of the network to induce struc-tured representations containing grammati-cal and semantic information from word se-quences prompted the view that associativestatistical learning mechanisms might play amuch more central role in language acqui-sition. This innovation was especially wel-come given that symbolic theories of sen-tence processing do not offer a ready accountof language development. Indeed, they arelargely identified with the nativist view thatlittle in syntax develops. However, one lim-itation of the prior simulations is that theprediction task does not learn any catego-rizations over the input set. Although thesimulations demonstrate that informationimportant for language comprehension and

production can be induced from word se-quences, neither task is actually performed.The learned distinction between nouns andverbs apparent in the hidden unit represen-tations is tied up with carrying out the pre-diction task. But to perform comprehension,for example, the SRN would need to learncategorizations from the word sequences,such as deciding which noun was the agentand which noun was the patient in a sen-tence, regardless of whether the sentencewas presented in the active (“the dog chasesthe cat”) or passive voice (“the cat is chasedby the dog”). These types of computationsare more complex and the network’s solu-tions typically more impenetrable. AlthoughSRNs have borne the promise of an inher-ently developmental connectionist theory ofparsing, progress on a full model has beenslow (see Christiansen & Chater, 2001; andchapter 17 of this volume). Parsing is a com-plex problem – it is not even clear what theoutput should be for a model of sentencecomprehension. Should it be some inter-mediate depiction of agent–patient role as-signments, some compound representationof roles and semantics, or a constantly up-dating mental model that processes eachsentence in the context of the emergingdiscourse? Connectionist models of pars-ing await greater constraints from psycholin-guistic evidence.

Nevertheless, some interesting prelimi-nary findings have emerged. For example,some of the grammatical sentences that theSRN finds the hardest to predict are alsothe sentences that humans find the hard-est to understand (e.g., center embeddedstructures like “the mouse the cat the dogbit chased ate the cheese”) (Weckerly &Elman, 1992). These are sequences thatplace maximal load on encoding informationin the network’s internal recurrent loop, sug-gesting that recurrence may be a key com-putational primitive in language processing.Moreover, when the prediction task is re-placed by a comprehension task (such aspredicting the agent/patient status of thenouns in the sentence), the results are againsuggestive. Rather than building a syntac-tic structure for the whole sentence as a

P1: JZP



symbolic parser might, the network focuseson the predictability of lexical cues for iden-tifying various syntactic structures (consis-tent with Bates and MacWhinney’s [1989]Competition model of language develop-ment. The salience of lexical cues that eachsyntactic structure exploits and the pro-cessing load that each structure places onthe recurrent loop makes them differentiallyvulnerable under damage. Here, neuropsy-chological findings from language break-down and developmental language disor-ders have tended to follow the predictionsof the connectionist account in the relativeimpairments that each syntactic construc-tion should show (Dick et al., 2001; 2004;Thomas & Redington, 2004).

For more recent work and discussion ofthe use of SRNs in syntax processing, seeMayberry, Crocker, and Knoeferle (2005),Miikkulainen and Mayberry (1999), Morris,Cottrell, and Elman (2000), Rohde (2002),and Sharkey, Sharkey, and Jackson (2000).Lastly, the impact of SRNs has not beenrestricted to language. These models havebeen usefully applied to other areas of cog-nition where sequential information is im-portant. For example, Botvinick and Plaut(2004) have shown how this architecturecan capture the control of routine sequencesof actions without the need for schema hier-archies, and Cleeremans and Dienes (chap-ter 14 in this volume) show how SRNs havebeen applied to implicit learning.

In sum, then, Elman’s work demonstrateshow simple connectionist architectures canlearn statistical regularities over temporalsequences. These systems may indeed besufficient to produce many of the behav-iors that linguists have described with gram-matical rules. However, in the connection-ist system, the underlying primitives arecontext-sensitive representations of wordsand trajectories of activation through recur-rent circuits.

4. Related Models

Before considering the wider impact of con-nectionism on theories of cognition, we

should note a number of other related ap-proaches.

4.1. Cascade-Correlation andIncremental Neural Network Algorithms

Backpropagation networks specify input andoutput representations, whereas in self-organizing networks, only the inputs arespecified. These networks therefore includesome number of internal processing unitswhose activation states are determined bythe learning algorithm. The number of in-ternal units and their organization (e.g., intolayers) plays an important role in determin-ing the complexity of the problems or cate-gories that the network can learn. In patternassociator networks, too few units and thenetwork will fail to learn; in self-organizingnetworks, too few output units and the net-work will fail to provide good discrimina-tion between the categories in the trainingset. How does the modeler select in advancethe appropriate number of internal units?Indeed, for a cognitive model, should thisbe a decision that the modeler gets to make?

For pattern associator networks, the cas-cade correlation algorithm (Fahlman &Lebiere, 1990) addresses this problem bystarting with a network that has no hiddenunits and then adding in these resources dur-ing learning as it becomes necessary to carryon improving on the task. New hidden unitsare added with weights from the input layertailored so that the unit’s activation corre-lates with network error, that is, the newunit responds to parts of the problem onwhich the network is currently doing poorly.New hidden units can also take input fromexisting hidden units, thereby creating de-tectors for higher order features in the prob-lem space.

The cascade correlation algorithm hasbeen widely used for studying cognitivedevelopment (Mareschal & Shultz, 1996;Shultz, 2003; Westermann, 1998), forexample, in simulating children’s perfor-mance in Piagetian reasoning tasks (see Sec-tion 5.2). The algorithm makes links withthe constructivist approach to development(Quartz, 1993; Quartz & Sejnowski, 1997),

P1: JZP



which argues that increases in the complex-ity of children’s cognitive abilities are bestexplained by the recruitment of additionalneurocomputational resources with age andexperience. Related models that also use this“incremental” approach to building networkarchitectures can be found in the work ofCarpenter and Grossberg (Adaptive Reso-nance Theory; e.g., Carpenter & Grossberg,1987a, 1987b) and in the work of Love andcolleagues (e.g., Love, Medin, & Gureckis,2004).

4.2. Mixture-of-Experts-Models

The preceding sections assume that only asingle architecture is available to learn eachproblem. However, it may be that multiplearchitectures are available to learn a givenproblem, each with different computationalproperties. Which architecture will end uplearning the problem? Moreover, what if acognitive domain can be broken down intodifferent parts, for example, in the way thatthe English past tense problem comprisesregular and irregular verbs – could differentcomputational components end up learn-ing the different parts of the problem? Themixture-of-experts approach considers waysin which learning could take place in justsuch a system with multiple componentsavailable (Jacobs et al., 1991). In these mod-els, functionally specialized structures canemerge as a result of learning, in the circum-stance where the computational propertiesof the different components happen to lineup with the demands presented by differ-ent parts of the problem domain (so-calledstructure-function correspondences).

During learning, mixture-of-experts algo-rithms typically permit the multiple com-ponents to compete with each other to de-liver the correct output for each input pat-tern. The best performer is then assignedthe pattern and allowed to learn it. The in-volvement of each component during func-tioning is controlled by a gating mechanism.Mixture-of-experts models are one of sev-eral approaches that seek to explain theorigin of functionally specialized process-ing components in the cognitive system (seeElman et al., 1996; Jacobs, 1999; Thomas &

Richardson, 2006, for discussion). An exam-ple of the application of mixture of expertscan be found in a developmental model offace and object recognition, where differ-ent “expert” mechanisms come to special-ize in processing visual inputs that corre-spond to faces and those that correspondto objects (Dailey & Cottrell, 1999). Theemergence of this functional specializationcan be demonstrated by damaging each ex-pert in turn and showing a double dissocia-tion between face and object recognition inthe two components of the model (see Sec-tion 5.3). Similarly, Thomas and Karmiloff-Smith 2002a showed how a mixture-of-experts model of English past tense couldproduce emergent specialization of separatemechanisms to regular and irregular verbs,respectively (see also Westermann, 1998,for related work with a constructivist net-work).

4.3. Hybrid Models

The success of mixture-of-experts modelssuggests that when two or more compo-nents are combined within a model, it can beadvantageous for the computational proper-ties of the components to differ. Where theproperties of the components are radicallydifferent, for example, involving the combi-nation of symbolic (rule-based) and connec-tionist (associative, similarity-based) archi-tectures, the models are sometimes referredto as “hybrid.” The use of hybrid models isinspired by the observation that some as-pects of human cognition seem better de-scribed by rules (e.g., syntax, reasoning),whereas some seem better described by sim-ilarity (e.g., perception, memory). We havepreviously encountered the debate betweensymbolic and connectionist approaches (seeSection 2.3) and the proposal that connec-tionist architectures may serve to imple-ment symbolic processes (e.g., Touretzky& Hinton, 1988). The hybrid systems ap-proach takes the alternative view that con-nectionist and symbolic processing princi-ples should be combined within the samemodel, taking advantage of the strengthsof each computational formalism. A dis-cussion of this approach can be found in

P1: JZP



Sun (2002a, 2002b). Example models in-clude CONSYDERR (Sun, 1995), CLAR-ION (Sun & Peterson, 1998), and ACT-R(Anderson & Lebiere, 1998).

An alternative to a truly hybrid approachis to develop a multi-part connectionist ar-chitecture that has components that employdifferent representational formats. Such asystem may be described as having “hetero-geneous” computational components. Forexample, in a purely connectionist system,one component might employ distributedrepresentations that permit different de-grees of similarity between activation pat-terns, whereas a second component employslocalist representations in which there isno similarity between different representa-tions. Behavior is then driven by the inter-play between two associative componentsthat employ different similarity structures.One example of a heterogeneous, multiple-component architecture is the complemen-tary learning systems model of McClel-land, McNaughton, and O’Reilly (1995),which employs localist representations toencode individual episodic memories butdistributed representations to encode gen-eral semantic memories. Heterogeneous de-velopmental models may offer new waysto conceive of the acquisition of concepts.For example, the cognitive domain of num-ber may be viewed as heterogeneous in thesense that it combines three systems: thesimilarity-based representations of quantity,the localist representations of number facts(such as the order of number labels in count-ing), and a system for object individuation.Carey and Sarnecka (2006) argue that a het-erogeneous multiple component system ofthis nature could acquire the concept of pos-itive integers even though such a conceptcould not be acquired by any single compo-nent of the system on its own.

4.4. Bayesian Graphical Models

The use of Bayesian methods of inferencein graphical models, including causal graph-ical models, has recently been embraced bya number of cognitive scientists (Chater,Tenenbaum, & Yuille, 2006; Gopnik et al.,2004; see chapter 3 in this volume). This

approach stresses how it may be possibleto combine prior knowledge in the formof a set of explicit alternative graph struc-tures and constraints on the complexity ofsuch structures with Bayesian methods ofinference to select the best type of repre-sentation of a particular data set (e.g., listsof facts about many different animals); andwithin that, to select the best specific in-stantiation of a representation of that type(Tenenbaum, Griffiths, & Kemp, 2006).These models are useful contributions to ourunderstanding, particularly because they al-low explicit exploration of the role of priorknowledge in the selection of a representa-tion of the structure present in each dataset. It should be recognized, however, thatsuch models are offered as characterizationsof learning at Marr’s (1982) “ComputationalLevel” and as such they do not specify therepresentations and processes that are ac-tually employed when people learn. Thesemodels do raise questions for connectionistresearch that does address such questions,however. Specifically, the work provides abenchmark against which connectionist ap-proaches might be tested for their successin learning to represent the structure froma data set and in using such a structureto make inferences consistent with opti-mal performance according to a Bayesianapproach within a graphical model frame-work. More substantively, the work raisesquestions about whether or not optimiza-tion depends on the explicit representationof alternative structured representations orwhether an approximation to such struc-tured representations can arise without theirprespecification. For an initial examinationof these issues as they arise in the con-text of causal inference, see McClelland andThompson (2007).

5. Connectionist Influenceson Cognitive Theory

Connectionism offers an explanation of hu-man cognition because instances of behaviorin particular cognitive domains can be ex-plained with respect to set of general prin-ciples (parallel distributed processing) and

P1: JZP



the conditions of the specific domains. How-ever, from the accumulation of successfulmodels, it is also possible to discern a widerinfluence of connectionism on the nature oftheorizing about cognition, and this is per-haps a truer reflection of its impact. Howhas connectionism made us think differentlyabout cognition?

5.1. Knowledge Versus Processing

One area where connectionism has changedthe basic nature of theorizing is mem-ory. According to the old model of mem-ory based on the classical computationalmetaphor, the information in long-termmemory (e.g., on the hard disk) has to bemoved into working memory (the centralprocessing unit, or CPU) for it to be oper-ated on, and the long-term memories arelaid down via a domain-general buffer ofshort-term memory (random access mem-ory, or RAM). In this type of system, it isrelatively easy to shift informational con-tent between different systems, back andforth between central processing and shortand long-term stores. Computation is pred-icated on variables: the same binary stringcan readily be instantiated in different mem-ory registers or encoded onto a permanentmedium.

By contrast, knowledge is hard to moveabout in connectionist networks because itis encoded in the weights. For example, inthe past-tense model, knowledge of the pasttense rule “add -ed” is distributed acrossthe weight matrix of the connections be-tween input and output layers. The diffi-culty in portability of knowledge is inher-ent in the principles of connectionism –Hebbian learning alters connection strengthsto reinforce desirable activation states inconnected units, tying knowledge to struc-ture. If we start from the premise thatknowledge will be very difficult to moveabout in our information processing system,what kind of cognitive architecture do weend up with? There are four main themes.

First, we need to distinguish betweentwo different ways in which knowledge canbe encoded: active and latent representa-

tions (Munakata & McClelland, 2003). La-tent knowledge corresponds to the infor-mation stored in the connection weightsfrom accumulated experience. By contrast,active knowledge is information containedin the current activation states of the sys-tem. Clearly, the two are related becausethe activation states are constrained by theconnection weights. But, particularly in re-current networks, there can be subtle dif-ferences. Active states contain a trace of re-cent events (how things are at the moment),whereas latent knowledge represents a his-tory of experience (how things tend to be).Differences in the ability to maintain theactive states (e.g., in the strength of recur-rent circuits) can produce errors in behaviorwhere the system lapses into more typicalways of behaving (Munakata, 1998; Morton& Munakata, 2002).

Second, if information does need to bemoved around the system, for example,from a more instance-based (episodic) sys-tem to a more general (semantic) sys-tem, this will require special structures andspecial (potentially time-consuming) pro-cesses. Thus McClelland, McNaughton, andO’Reilly (1995) proposed a dialogue be-tween separate stores in the hippocampusand neocortex to gradually transfer knowl-edge from episodic to semantic memory.French, Ans, and Rousset (2001) proposeda special method to transfer knowledgebetween two memory systems: internallygenerated noise produces “pseudopatterns”from one system that contain the centraltendencies of its knowledge; the secondmemory system is then trained with thisextracted knowledge to effect the transfer.Chapters 7 and 8 in the current volume offera wider consideration of models of episodicand semantic memory, respectively.

Third, information will be processed inthe same substrate where it is stored. There-fore, long-term memories will be activestructures and will perform computationson content. An external strategic control sys-tem plays the role of differentially activat-ing the knowledge in this long-term systemthat is relevant to the current context. Inanatomical terms, this distinction broadly

P1: JZP



corresponds to frontal/anterior (strategiccontrol) and posterior (long-term) cortex.The design means, somewhat counterintu-itively, that the control system has no con-tent. Rather, the control system containsplaceholders that serve to activate differentregions of the long-term system. The con-trol system may contain plans (sequencesof placeholders) and it may be involved inlearning abstract concepts (using a place-holder to temporarily coactivate previouslyunrelated portions of long-term knowledgewhile Hebbian learning builds an associationbetween them) but it does not contain con-tent in the sense of a domain-general work-ing memory. The study of frontal systemsthen becomes an exploration of the acti-vation dynamics of these placeholders andtheir involvement in learning (see, e.g., workby Davelaar & Usher, 2002; Haarmann &Usher, 2001; O’Reilly, Braver, & Cohen,1999; Usher & McClelland, 2001).

Similarly, connectionist research has ex-plored how activity in the control system canbe used to modulate the efficiency of pro-cessing elsewhere in the system, for instance,to implemented selective attention. For ex-ample, Cohen et al., (1990) demonstratedhow task units could be used to differ-entially modulate word-naming and color-naming processing channels in a model ofthe color-word Stroop task. In this model,latent knowledge interacted with the oper-ation of task control, so that it was harderto selectively attend to color-naming andignore information from the more practicedword-naming channel than vice versa. Thiswork was later extended to demonstratehow deficits in the strategic control system(prefrontal cortex) could lead to problemsin selective attention in disorders such asschizophrenia (Cohen & Servan-Schreiber,1992). Chapter 15 in this volume contains awider consideration of computational mod-els of attention and cognitive control.

Lastly, the connectionist perspective onmemory alters how we conceive of domaingenerality in processing systems. It is un-likely that there are any domain-generalprocessing systems that serve as a “Jack-of-all-trades,” that is, that can move between

representing the content of multiple do-mains. However, there may be domain-general systems that are involved in mod-ulating many disparate processes withouttaking on the content of those systems, whatwe might call a system with “a finger in ev-ery pie.” Meanwhile, short-term or work-ing memory (as exemplified by the activerepresentations contained in the recurrentloop of a network) is likely to exist as adevolved panoply of discrete systems, eachwith its own content-specific loop. For ex-ample, research in the neuropsychology oflanguage now tends to support the existenceof separate working memories for phono-logical, semantic, and syntactic information(see MacDonald & Christiansen, 2002, for adiscussion of these arguments).

5.2. Cognitive Development

A key feature of PDP models is the use ofa learning algorithm for modifying the pat-terns of connectivity as a function of experi-ence. Compared with symbolic, rule-basedcomputational models, this has made thema more sympathetic formalism for study-ing cognitive development (Elman et al.,1996). The combination of domain-generalprocessing principles, domain-specific archi-tectural constraints, and structured train-ing environments has enabled connectionistmodels to give accounts of a range of devel-opmental phenomena. These include infantcategory development, language acquisitionand reasoning in children (see Mareschal &Thomas, 2007, for a recent review and chap-ter 16 in this volume).

Connectionism has become aligned witha resurgence of interest in statistical learn-ing and a more careful consideration of theinformation available in the child’s environ-ment that may feed his or her cognitivedevelopment. One central debate revolvesaround how children can become “cleverer”as they get older, appearing to progressthrough qualitatively different stages of rea-soning. Connectionist modeling of the de-velopment of children’s reasoning was ableto demonstrate that continuous incremen-tal changes in the weight matrix driven by

P1: JZP



algorithms such as backpropagation can re-sult in nonlinear changes in surface behav-ior, suggesting that the stages apparent inbehavior may not necessarily be reflectedin changes in the underlying mechanism(e.g., McClelland, 1989). Other connection-ists have argued that algorithms able to sup-plement the computational resources of thenetwork as part of learning may also providean explanation for the emergence of morecomplex forms of behavior with age (e.g.,cascade correlation; see Shultz, 2003).

The key contribution of connectionistmodels in the area of developmental psy-chology has been to specify detailed, im-plemented models of transition mechanismsthat demonstrate how the child can movebetween producing different patterns of be-havior. This was a crucial addition to afield that has accumulated vast amounts ofempirical data cataloging what children areable to do at different ages. The specifi-cation of mechanism is also important tocounter some strongly empiricist views thatsimply identifying statistical information inthe environment suffices as an explanationof development; instead, it is necessary toshow how a mechanism could use this sta-tistical information to acquire some cogni-tive capacity. Moreover, when connection-ist models are applied to development, itoften becomes apparent that passive statis-tical structure is not the key factor; rather,the relevant statistics are in the transforma-tion of the statistical structure of the envi-ronment to the output or the behavior thatis relevant to the child, thereby appealing tonotions like the regularity, consistency, andfrequency of input-output mappings.

Recent connectionist approaches to de-velopment have begun to explore how thecomputational formalisms may change ourunderstanding of the nature of the knowl-edge that children acquire. For example,Mareschal et al. (2007) argue that manymental representations of knowledge arepartial (i.e., capture only some task-relevantdimensions); the existence of explicit lan-guage may blind us to the fact that therecould be a limited role for truly abstractknowledge in the normal operation of the

cognitive system (see Westermann et al.,2007). Current work also explores the com-putational basis of critical or sensitive peri-ods in development, uncovering the mecha-nisms by which the ability to learn appearsto reduce with age (e.g., McClelland et al.,1999; Thomas & Johnson, 2006).

5.3. The Study of Acquired Disordersin Cognitive Neuropsychology

Traditional cognitive neuropsychology ofthe 1980s was predicated on the assumptionof underlying modular structure, that is, thatthe cognitive system comprises a set of in-dependently functioning components. Pat-terns of selective cognitive impairment afteracquired brain damage could then be used toconstruct models of normal cognitive func-tion. The traditional models comprised box-and-arrow diagrams that sketched out roughversions of cognitive architecture, informedboth by the patterns of possible selectivedeficit (which bits can fail independently)and by a task analysis of what the cognitivesystem probably has to do.

In the initial formulation of cognitiveneuropsychology, caution was advised inattempting to infer cognitive architecturefrom behavioral deficits, since a givenpattern of deficits might be consistentwith a number of underlying architectures(Shallice, 1988). It is in this capacity thatconnectionist models have been extremelyuseful. They have both forced more detailedspecification of proposed cognitive modelsvia implementation and also permitted as-sessment of the range of deficits that can begenerated by damaging these models in var-ious ways. For example, models of readinghave demonstrated that the ability to de-code written words into spoken words andrecover their meanings can be learned in aconnectionist network; and when this net-work is damaged by, say, lesioning connec-tion weights or removing hidden units, var-ious patterns of acquired dyslexia can besimulated (e.g., Plaut et al., 1996; Plaut &Shallice, 1993). Connectionist models of ac-quired deficits have grown to be an influen-tial aspect of cognitive neuropsychology and

P1: JZP



have been applied to domains such as lan-guage, memory, semantics, and vision (seeCohen, Johnstone, & Plunkett, 2000, forexamples).

Several ideas have gained their first orclearest grounding via connectionist mod-eling. One of these ideas is that patternsof breakdown can arise from the statis-tics of the problem space (i.e., the map-ping between input and output) rather thanfrom structural distinctions in the process-ing system. In particular, connectionist mod-els have shed light on a principal inferentialtool of cognitive neuropsychology, the dou-ble dissociation. The line of reasoning arguesthat if in one patient, ability A can be lostwhile ability B is intact, and in a second pa-tient, ability B can be lost while ability A isintact, then the two abilities might be gen-erated by independent underlying mecha-nisms. In a connectionist model of category-specific impairments of semantic memory,Devlin et al. (1997) demonstrated that asingle undifferentiated network trained toproduce two behaviors could show a dou-ble dissociation between them simply as aconsequence of different levels of damage.This can arise because the mappings associ-ated with the two behaviors lead them tohave different sensitivity to damage. For asmall level of damage, performance on Amay fall off quickly, whereas performanceon B declines more slowly; for a high levelof damage, A may be more robust than B.The reverse pattern of relative deficits im-plies nothing about structure.

Connectionist researchers have often setout to demonstrate that, more generally,double dissociation methodology is a flawedform of inference, on the grounds that suchdissociations arise relatively easily from par-allel distributed architectures where func-tion is spread across the whole mechanism(e.g., Plunkett & Bandelow, 2006; Juola &Plunkett, 2000). However, on the whole,when connectionist models show robustdouble dissociations between two behaviors(for equivalent levels of damage applied tovarious parts of the network and over manyreplications), it does tend to be because dif-ferent internal processing structures (units

or layers or weights) or different parts of theinput layer or different parts of the outputlayer are differentially important for driv-ing the two behaviors – that is, there is spe-cialization of function. Connectionism mod-els of breakdown have, therefore, tendedto support the traditional inferences. Cru-cially, however, connectionist models havegreatly improved our understanding of whatmodularity might look like in a neurocom-putational system: a partial rather than anabsolute property; a property that is theconsequence of a developmental processwhere emergent specialization is driven bystructure-function correspondences (the abilityof certain parts of a computational structureto learn certain kinds of computation bet-ter than other kinds; see Section 4.2); and aproperty that must now be complementedby concepts such as division of labor, de-generacy, interactivity, and redundancy (seeThomas & Karmiloff-Smith, 2002a; Thomaset al., 2006, for discussion).

5.4. The Origins of Individual Variabilityand Developmental Disorders

In addition to their role in studying acquireddisorders, the fact that many connectio-nist models learn their cognitive abilitiesmakes them an ideal framework withinwhich to study developmental disorders, suchas autism, dyslexia, and specific languageimpairment (Joanisse & Seidenberg, 2003;Mareschal et al., 2007; Thomas & Karmiloff-Smith, 2002b, 2003a, 2005). Where mod-els of normal cognitive development seekto study the “average” child, models ofatypical development explore how devel-opmental profiles may be disrupted. Con-nectionist models contain a number of con-straints (architecture, activation dynamics,input and output representations, learn-ing algorithm, training regimen) that de-termine the efficiency and outcome oflearning. Manipulations to these constraintsproduce candidate explanations for impair-ments found in developmental disorders orfor the impairments caused by exposure toatypical environments, such as in cases ofdeprivation.

P1: JZP



In the 1980s and 1990s, many theories ofdevelopmental deficits employed the sameexplanatory framework as adult cognitiveneuropsychology. There was a search forspecific developmental deficits or dissocia-tions, which were then explained in terms ofthe failure of individual modules to develop-ment. However, as Karmiloff-Smith (1998)and Bishop (1997) pointed out, most ofthe developmental deficits were actually be-ing explained with reference to nondevelop-mental, static, and sometimes adult modelsof normal cognitive structure. Karmiloff-Smith (1998) argued that the causes of de-velopmental deficits of a genetic origin willlie in changes to low-level neurocomputa-tional properties that only exert their in-fluence on cognition via an extended atyp-ical developmental process (see also Elmanet al., 1996). Connectionist models providethe ideal forum to explore the thesis thatan understanding of the constraints on thedevelopmental process is essential for gen-erating accounts of developmental deficits.

The study of atypical variability alsoprompts a consideration of what causes vari-ability within the normal range, otherwiseknown as individual differences or intelli-gence. Are differences in intelligence causedby variation in the same computational pa-rameters that can cause disorders? Are somedevelopmental disorders just the extremelower end of the normal distribution orare they qualitatively different conditions?What computational parameter settings areable to produce above-average performanceor giftedness? Connectionism has begun totake advantage of the accumulated bodyof models of normal development to con-sider the wider question of cognitive varia-tion in parameterized computational models(Thomas & Karmiloff-Smith, 2003a).

5.5. Future Directions

The preceding sections indicate the rangeand depth of influence of connectionism oncontemporary theories of cognition. Wherewill connectionism go next? Necessarily,connectionism began with simple models ofindividual cognitive processes, focusing onthose domains of particular theoretical in-

terest. This piecemeal approach generatedexplanations of individual cognitive abilitiesusing bespoke networks, each containing itsown predetermined representations and ar-chitecture. In the future, one avenue to pur-sue is how these models fit together in thelarger cognitive system – for example, to ex-plain how the past tense network describedin Section 3.2 might link up with the sen-tence processing model described in Section3.3 to process past tenses as they arise in sen-tences. A further issue is to address the de-velopmental origin of the architectures thatare postulated. What processes specify theparts of the cognitive system to perform thevarious functions and how do these subsys-tems talk to each other, both across devel-opment and in the adult state? Improve-ments in computational power will aid morecomplex modeling endeavors. Nevertheless,it is worth bearing in mind that increas-ing complexity creates a tension with theprimary goals of modeling – simplificationand understanding. It is essential that weunderstand why more complicated modelsfunction as they do or they will merely be-come interesting artifacts (see Elman, 2005;Thomas, 2004, for further discussion).

In terms of its relation with other dis-ciplines, a number of future influences onconnectionism are discernible. Connection-ism will be affected by the increasing ap-peal to Bayesian probability theory in hu-man reasoning. In Bayesian theory, new dataare used to update existing estimates of themost likely model of the world. Work hasalready begun to relate connectionist andBayesian accounts, for example, in the do-main of causal reasoning in children (Mc-Clelland & Thompson, 2007). In some cases,connectionism may offer alternative expla-nations of the same behavior; in others itmay be viewed as an implementation of aBayesian account (see Section 3.1). Connec-tionism will continue to have a close rela-tion to neuroscience, perhaps seeking to buildmore neural constraints into its computa-tional assumptions (O’Reilly & Munakata,2000). Many of the new findings in cog-nitive neuroscience are influenced by func-tional brain imaging techniques. It will beimportant, therefore, for connectionism to

P1: JZP



make contact with these data, either viasystems-level modeling of the interactionbetween subnetworks in task performanceor in exploring the implications of the sub-traction methodology as a tool for assess-ing the behavior of distributed interactivesystems. The increasing influence of brainimaging foregrounds the relation of cogni-tion to the neural substrate; it depends onhow seriously one takes the neural plausi-bility of connectionist models as to whetheran increased focus on the substrate willhave particular implications for connection-ism over and above any other theory of cog-nition.

Connectionist approaches to individualdifferences and developmental disorderssuggest that this modeling approach hasmore to offer in considering the computa-tional causes of variability. Research in be-havioral genetics argues that a significant pro-portion of behavioral variability is geneticin origin (Bishop, 2006; Plomin, Owen &McGuffin, 1994). However, the neurode-velopmental mechanisms by which genesproduce such variation are largely unknown.Although connectionist cognitive modelsare not neural, the fact that they incorporateneurally inspired properties may allow themto build links between behavior (where vari-ability is measured) and the substrate onwhich genetic effects act. In the future, con-nectionism may therefore help to rectify amajor shortcoming in our attempts to un-derstand the relation of the human genometo human behavior – the omission of cogni-tion from current explanations.

6. Conclusions

In this chapter, we have considered thecontribution of connectionist modeling toour understanding of cognition. Connec-tionism was placed in the historical contextof nineteenth-century associative theories ofmental processes and twentieth-century at-tempts to understand the computations car-ried out by networks of neurons. The keyproperties of connectionist networks werethen reviewed, and particular emphasis wasplaced on the use of learning to build the

microstructure of these models. The coreconnectionist themes include the following:(1) that processing is simultaneously influ-enced by multiple sources of information atdifferent levels of abstraction, operating viasoft constraint satisfaction; (2) that repre-sentations are spread across multiple sim-ple processing units operating in parallel;(3) that representations are graded, context-sensitive, and the emergent product of adap-tive processes; and (4) that computationis similarity-based and driven by the sta-tistical structure of problem domains, butit can nevertheless produce rule-followingbehavior. We illustrated the connectionistapproach via three landmarks models, theInteractive Activation model of letter per-ception (McClelland & Rumelhart, 1981),the past tense model (Rumelhart & McClel-land, 1986), and simple recurrent networksfor finding structure in time (Elman, 1990).Apart from its body of successful individ-ual models, connectionist theory has hada widespread influence on cognitive the-orizing, and this influence was illustratedby considering connectionist contributionsto our understanding of memory, cogni-tive development, acquired cognitive im-pairments, and developmental deficits. Fi-nally, we peeked into the future of con-nectionism, arguing that its relationshipswith other fields in the cognitive sciencesare likely to guide its future contributionto understanding the mechanistic basis ofthought.

7. Acknowledgements

This work was supported by British Aca-demy Grant SG–40400 and UK Medical Re-search Council Grant G0300188 to MichaelThomas, and National Institute of MentalHealth Centre Grant P50 MH64445, JamesL. McClelland, Director.

References

Ackley, D. H., Hinton, G. E., & Sejnowski, T. J.(1985). A learning algorithm for Boltzmannmachines. Cognitive Science, 9, 147–169.

P1: JZP



Aizawa, K. (2004). History of connectionism. InC. Eliasmith (Ed.), Dictionary of philosophyof mind. Retrived October 4, 2006, fromhttp://philosophy.uwaterloo.ca/MindDict/connectionismhistory.html.

Anderson, J. A. (1977). Neural models with cog-nitive implications. In D. LaBerge & S. J.Samuels (Eds.), Basic processes in reading per-ception and comprehension (pp. 27–90). Hills-dale, NJ: Lawrence Erlbaum.

Anderson, J., & Lebiere, C. (1998). The atomiccomponents of thought. Mahwah, NJ: LawrenceErlbaum Associates.

Bates, E., & MacWhinney, B. (1989). Func-tionalism and the competition model. In B.MacWhinney & E. Bates (Eds.), The crosslin-guistic study of language processing (pp. 3–37).New York: Cambridge University Press.

Bechtel, W., & Abrahamsen, A. (1991). Connec-tionism and the mind. Oxford, UK: Blackwell.

Berko, J. (1958). The child’s learning of Englishmorphology. Word, 14, 150–177.

Bishop, D. V. M. (1997). Cognitive neuropsy-chology and developmental disorders: Un-comfortable bedfellows. Quarterly Journal ofExperimental Psychology, 50A, 899–923.

Bishop, D. V. M. (2006). Developmental cog-nitive genetics: How psychology can informgenetics and vice versa. Quarterly Journal ofExperimental Psychology, 59(7), 1153–1168.

Botvinick, M., & Plaut, D. C. (2004). Doingwithout schema hierarchies: A recurrent con-nectionist approach to normal and impairedroutine sequential action. Psychological Re-view, 111, 395–429.

Burton, A. M., Bruce, V., & Johnston, R. A.(1990). Understanding face recognition withan interactive activation model. British Journalof Psychology, 81, 361–380.

Bybee, J., & McClelland, J. L. (2005). Alterna-tives to the combinatorial paradigm of linguis-tic theory based on domain general principlesof human cognition. The Linguistic Review,22(2–4), 381–410.

Carey, S., & Sarnecka, B. W. (2006). The devel-opment of human conceptual representations:A case study. In Y. Munakata & M. H. Johnson(Eds.), Processes of change in brain and cognitivedevelopment: Attention and performance XXI,(pp. 473–496). Oxford, UK: Oxford Univer-sity Press.

Carpenter, G. A., & Grossberg, S. (1987a).ART2: Self-organization of stable categoryrecognition codes for analog input patterns.Applied Optics, 26, 4919–4930.

Carpenter, G. A., & Grossberg, S. (1987b).A massively parallel architecture for a self-organizing neural pattern recognition ma-chine. Computer Vision, Graphics and ImageProcessing, 37, 54–115.

Chater, N., Tenenbaum, J. B., & Yuille, A.(2006). Probabilistic models of cognition:Conceptual foundations. Trends in CognitiveSciences 10(7), 287–291.

Christiansen, M. H., & Chater, N. (2001).Connectionist psycholinguistics. Westport, CT:Ablex.

Cohen, G., Johnstone, R. A., & Plunkett, K.(2000). Exploring cognition: damaged brainsand neural networks. Hove, Sussex, UK: Psy-chology Press.

Cohen, J. D., Dunbar, K., & McClelland, J. L.(1990). On the control of automatic pro-cesses: A parallel distributed processing ac-count of the Stroop effect. Psychological Re-view, 97, 332–361.

Cohen, J. D., & Servan-Schreiber, D. (1992).Context, cortex, and dopamine: A connec-tionist approach to behavior and biology inschizophrenia. Psychological Review, 99, 45–77.

Dailey, M. N., & Cottrell, G. W. (1999). Or-ganization of face and object recognition inmodular neural networks. Neural Networks,12, 1053–1074.

Davelaar, E. J., & Usher, M. (2003). Anactivation-based theory of immediate itemmemory. In J. A. Bullinaria, & W. Lowe(Eds.), Proceedings of the Seventh Neural Com-putation and Psychology Workshop: Connec-tionist models of cognition and perception (pp.118–130). Singapore: World Scientific.

Davies, M. (2005). Cognitive science. In F.Jackson & M. Smith (Eds.), The Oxford hand-book of contemporary philosophy (pp. 358–394). Oxford: Oxford University Press.

Devlin, J., Gonnerman, L., Andersen, E., & Sei-denberg, M. S. (1997). Category specific se-mantic deficits in focal and widespread braindamage: A computational account. Journal ofCognitive Neuroscience, 10, 77–94.

Dick, F., Bates, E., Wulfeck, B., Aydelott, J.,Dronkers, N., & Gernsbacher, M. A. (2001).Language deficits, localization, and grammar:Evidence for a distributive model of languagebreakdown in aphasic patients and neurolog-ically intact individuals. Psychological Review,108(3), 759–788.

Dick, F., Wulfeck, B., Krupa-Kwiatkowski, M.,& Bates, E. (2004). The development of

P1: JZP



complex sentence interpretation in typicallydeveloping children compared with childrenwith specific language impairments or earlyunilateral focal lesions. Developmental Science,7(3), 360–377.

Elman, J. L. (1990). Finding structure in time.Cognitive Science, 14, 179–211.

Elman, J. L. (1991). Distributed representations,simple recurrent networks, and grammaticalstructure. Machine Learning, 7, 195–224.

Elman, J. L. (1993). Learning and developmentin neural networks: The importance of startingsmall. Cognition, 48, 71–99.

Elman, J. L. (2005). Connectionist models ofcognitive development: Where next? Trendsin Cognitive Sciences, 9, 111–117.

Elman, J. L., Bates, E. A., Johnson, M. H.,Karmiloff-Smith, A., Parisi, D., & Plunkett, K.(1996). Rethinking innateness: A connectionistperspective on development. Cambridge, MA:MIT Press.

Ervin, S. M. (1964). Imitation and structuralchange in children’s language. In E. H.Lenneberg (Ed.), New directions in the studyof language (pp. 163–189). Cambridge, MA:MIT Press.

Fahlman, S., & Lebiere, C. (1990). The cas-cade correlation learning architecture. In D.Touretzky (Ed.), Advances in neural informa-tion processing 2 (pp. 524–532). Los Altos,CA: Morgan Kauffman.

Feldman, J. A. (1981). A connectionist modelof visual memory. In G. E. Hinton & J. A.Anderson (Eds.), Parallel models of associativememory (pp. 49–81). Hillsdale, NJ: HawrenceErlbaum.

Fodor, J. A., & Pylyshyn, Z. W. (1988). Connec-tionism and cognitive architecture: A criticalanalysis. Cognition, 78, 3–71.

French, R. M., Ans, B., & Rousset, S. (2001).Pseudopatterns and dual-network memorymodels: Advantages and shortcomings. InR. French & J. Sougne (Eds.), Connectionistmodels of learning, development and evolution(pp. 13–22). London: Springer.

Freud, S. (1895). Project for a scientific psychol-ogy. In J. Strachey (Ed.), The standard editionof the complete psychological works of SigmundFreud (pp. 283–360). London: The HogarthPress and the Institute of Psycho-Analysis.

Goebel, R., & Indefrey, P. (2000). A recurrentnetwork with short-term memory capacitylearning the German –s plural. In P. Broeder &J. Murre (Eds.), Models of language acquisition:Inductive and deductive approaches (pp. 177–

200). Oxford, UK: Oxford University Press.Gopnik, A., Glymour, C., Sobel, D. M., Schulz,

L. E., Kushnir, T., & Danks, D. ( 2004). A the-ory of causal learning in children: Causal mapsand bayes nets. Psychological Review, 111(1):3–32.

Green, D. C. (1998). Are connectionist modelstheories of cognition? Psycoloquy, 9(4).

Grossberg, S. (1976). Adaptive pattern classifica-tion and universal recoding: Parallel develop-ment and coding of neural feature detectors.Biological Cybernetics, 23, 121–134.

Haarmann, H., & Usher, M. (2001). Maintenanceof semantic information in capacity limiteditem short-term memory. Psychonomic Bul-letin & Review, 8, 568–578.

Hebb, D. O. (1949). The organization of behav-ior: A neuropsychological approach. New York:John Wiley & Sons.

Hinton, G. E. (1989). Deterministic Boltzmannlearning performs steepest descent in weight-space. Neural Computation, 1, 143–150.

Hinton, G. E., & Anderson, J. A. (1981). Paral-lel models of associative memory. Hillsdale, NJ:Lawrence Erlbaum.

Hindton, G. E., & McClelland, J. L. (1988).Learning representations by recirculation. InD. Z. Anderson (Ed.), Neural Information Pro-cessing Systems, 1987 (pp. 358–366). NewYork: American Institute of Physics.

Hinton, G. E., & Salakhutdinov, R. R. (2006).Reducing the dimensionality of data with neu-ral networks. Science, 313(5786), 504–507.

Hinton, G. E., & Sejnowski, T. J. (1983). Opti-mal perceptual inference. In Proceedings of theIEEE Conference on Computer Vision and Pat-tern Recognition (pp. 448–453). Washington,DC.

Hinton, G. E., & Sejnowksi, T. (1986). Learn-ing and relearning in Boltzmann machines. InD. Rumelhart & J. McClelland (Eds.), Paral-lel distributed processing, Vol. 1 (pp. 282–317).Cambridge, MA: MIT Press.

Hoeffner, J. H., & McClelland, J. L. (1993). Cana perceptual processing deficit explain the im-pairment of inflectional morphology in devel-opmental dysphasia? A computational inves-tigation. In E. V. Clark (Ed.), Proceedings ofthe 25th Child language research forum (pp.38–49). Stanford, CA: Center for the Studyof Language and Information.

Hopfield, J. J. (1982). Neural networks and phys-ical systems with emergent collective compu-tational abilities. Proceedings of the NationalAcademy of Science USA, 79, 2554–2558.

P1: JZP



Houghton, G. (2005). Connectionist models incognitive psychology. Hove: Psychology Press.

Jacobs, R. A. (1999). Computational studies ofthe development of functionally specializedneural modules. Trends in Cognitive Sciences,3, 31–38.

Jacobs, R. A., Jordan, M. I., Nowlan, S. J.,& Hinton, G. E. (1991). Adaptive mix-tures of local experts. Neural Computation, 3,79–87.

James, W. (1890). Principles of psychology. NewYork, NY: Holt.

Joanisse, M. F., & Seidenberg, M. S. (1999). Im-pairments in verb morphology following braininjury: A connectionist model. Proceedings ofthe National Academy of Science USA, 96,7592–7597.

Joanisse, M. F., & Seidenberg, M. S. (2003).Phonology and syntax in specific languageimpairment: Evidence from a connectionistmodel. Brain and Language, 86, 40–56.

Jordan, M. I. (1986). Attractor dynamics and par-allelism in a connectionist sequential machine.In Proceedings of the Eight Annual Conference ofCognitive Science Society (pp. 531–546). Hills-dale, NJ: Lawrence Erlbaum.

Juola, P., & Plunkett, K. (2000). Why doubledissociations don’t mean much. In G. Cohen,R. A. Johnston, & K. Plunkett (Eds.), Exploringcognition: Damaged brains and neural networks:Readings in cognitive neuropsychology and con-nectionist modelling (pp. 319–327). Hove, UK:Psychology Press.

Karmiloff-Smith, A. (1998). Development itselfis the key to understanding developmental dis-orders. Trends in Cognitive Sciences, 2, 389–398.

Kohonen, T. (1984). Self-organization and asso-ciative memory. Berlin: Springer-Verlag.

Kuczaj, S. A. (1977). The acquisition of regu-lar and irregular past tense forms. Journal ofVerbal Learning and Verbal Behavior, 16, 589–600.

Lashley, K. S. (1929). Brain mechanisms and in-telligence: A quantitative study of injuries to thebrain. New York: Dover.

Love, B. C., Medin, D. L., & Gureckis, T. M.(2004). SUSTAIN: A network model of cate-gory learning. Psychological Review, 111, 309–332.

MacDonald, M. C., & Christiansen, M. H.(2002). Reassessing working memory: A com-ment on Just & Carpenter (1992) and Waters& Caplan (1996). Psychological Review, 109,35–54.

MacKay, D. J. (1992). A practical Bayesianframework for backpropagation networks.Neural Computation, 4, 448–472.

Marcus, G. F. (2001). The algebraic mind: In-tegrating connectionism and cognitive science.Cambridge, MA: MIT Press

Marcus, G., Pinker, S., Ullman, M., Hollander,J., Rosen, T. & Xu, F. (1992). Overregulari-sation in language acquisition. Monographs ofthe Society for Research in Child Development,57 (Serial No. 228).

Mareschal, D., Johnson, M., Sirios, S., Spratling,M., Thomas, M. S. C., & Westermann, G.(2007). Neuroconstructivism: How the brainconstructs cognition. Oxford, UK: Oxford Uni-versity Press.

Mareschal, D., & Shultz, T. R. (1996). Genera-tive connectionist architectures and construc-tivist cognitive development. Cognitive Devel-opment, 11, 571–605.

Mareschal, D., & Thomas, M. S. C. (2007). Com-putational modeling in developmental psy-chology. IEEE Transactions on EvolutionaryComputation, 11(2), 137–150.

Marr, D. (1982). Vision. San Francisco: W. H.Freeman.

Marr, D., & Poggio, T. (1976). Cooperative com-putation of stereo disparity. Science, 194, 283–287.

Mayberry, M. R., Crocker, M., & Knoeferle, P.(2005). A Connectionist Model of SentenceComprehension in Visual Worlds. In: Proceed-ings of the 27th Annual Conference of the Cog-nitive Science Society, (COGSCI-05, Streas,Italy), Mahwah, NJ: Erlbaum.

McClelland, J. L. (1981). Retrieving general andspecific information from stored knowledgeof specifics. In Proceedings of the Third An-nual Meeting of the Cognitive Science Society(pp. 170–172). Hillsdale, NJ: Lawrence Erl-baum Associates.

McClelland, J. L. (1989). Parallel distributed pro-cessing: Implications for cognition and devel-opment. In M. G. M. Morris (Ed.), Paralleldistributed processing, implications for psychol-ogy and neurobiology (pp. 8–45). Oxford, UK:Clarendon Press.

McClelland, J. L. (1998). Connectionist mod-els and Bayesian inference. In M. Oaksford &N. Chater (Eds.), Rational models of cognition(pp. 21–53). Oxford, UK: Oxford UniversityPress.

McClelland, J. L., & Elman, J. L. (1986). TheTrace model of speech perception. CognitivePsychology, 18, 1–86.

P1: JZP



McClelland, J. L., McNaughton, B. L., &O’Reilly, R. C. (1995). Why there are com-plementary learning systems in the hippocam-pus and neocortex: Insights from the successesand failures of connectionist models of learn-ing and memory. Psychological Review, 102,419–457.

McClelland, J. L., Plaut, D. C., Gotts, S. J., &Maia, T. V. (2003). Developing a domain-general framework for cognition: What is thebest approach? Commentary on a target arti-cle by Anderson and Lebiere. Behavioral andBrain Sciences, 22, 611–614.

McClelland, J. L., & Rumelhart, D. E. (1981).An interactive activation model of context ef-fects in letter perception: Part 1. An accountof basic findings. Psychological Review, 88(5),375–405.

McClelland, J. L., Rumelhart, D. E., & the PDPResearch Group (1986). Parallel distributedprocessing: Explorations in the microstructure ofcognition, Vol. 2: Psychological and biologicalmodels (pp. 2–4). Cambridge, MA: MIT Press.

McClelland, J. L., Thomas, A. G., McCandliss,B. D., & Fiez, J. A. (1999). Understandingfailures of learning: Hebbian learning, com-petition for representation space, and somepreliminary data. In J. A. Reggia, E. Rup-pin, & D. Glanzman (Eds.), Disorders of brain,behavior, and cognition: The neurocomputa-tional perspective (pp. 75–80). Oxford, UK:Elsevier.

McClelland, J. L., & Thompson, R. M. (2007).Using domain-general principles to explainchildren’s causal reasoning abilities. Develop-mental Science, 10, 333–356.

McCulloch, W. S., & Pitts, W. (1943). A logicalcalculus of ideas immanent in nervous activity.Bulletin of Mathematical Biophysics, 5, 115–133. (Reprinted in Anderson J., & RosenfieldE. (1988). Neurocomputing: Foundations ofresearch) Cambridge, MA: MIT Press.

McLeod, P., Plunkett, K., & Rolls, E. T. (1998).Introduction to connectionist modelling of cogni-tive processes. Oxford, UK: Oxford UniversityPress

Meynert, T. (1884). Psychiatry: A clinical treatiseon diseases of the forebrain. Part I. The anatomy,physiology and chemistry of the brain (B. Sachs,Trans.). New York: G.P. Putnam’s Sons.

Miikkulainen, R., & Mayberry, M. R. (1999). Dis-ambiguation and grammar as emergent softconstraints. In B. MacWhinney (Ed.), Emer-gence of language (pp. 153–176). Hillsdale, NJ:Lawrence Erlbaum.

Minsky, M., & Papert, S. (1969). Perceptrons: Anintroduction to computational geometry. Cam-bridge, MA: MIT Press.

Minsky, M. L., & Papert, S. (1988). Percep-trons: An introduction to computational ge-omety. Cambridge, MA: MIT Press.

Morris, W., Cottrell, G., & Elman, J. (2000).A connectionist simulation of the empiricalacquisition of grammatical relations. In S.Wermter & R. Sun (Eds.), Hybrid neural sys-tems (pp. 175–193). Springer Verlag, Heidel-berg.

Morton, J. (1969). Interaction of informationin word recognition. Psychological Review, 76,165–178.

Morton, J. B., & Munakata, Y. (2002). Active ver-sus latent representations: A neural networkmodel of perseveration, dissociation, and de-calage in childhood. Developmental Psychobi-ology, 40, 255–265.

Movellan, J. R., & McClelland, J. L. (1993).Learning continuous probability distributionswith symmetric diffusion networks. CognitiveScience, 17, 463–496.

Munakata, Y. (1998). Infant perseveration andimplications for object permanence theories:A PDP model of the AB task. DevelopmentalScience, 1, 161–184.

Munakata, Y., & McClelland, J. L. (2003). Con-nectionist models of development. Develop-mental Science, 6, 413–429.

O’Reilly, R. C. (1996). Biologically plausibleerror-driven learning using local activation dif-ferences: The generalized recirculation algo-rithm. Neural Compuation, 8(5), 895–938.

O’Reilly, R. C. (1998). Six principles forbiologically-based computational models ofcortical cognition. Trends in Cognitive Sciences,2, 455–462.

O’Reilly, R. C., Braver, T. S., & Cohen, J. D.(1999). A biologically based computationalmodel of working memory. In A. Miyake &P. Shah (Eds.), Models of working memory:Mechanisms of active maintenance and execu-tive control (pp. 375–411). New York: Cam-bridge University Press.

O’Reilly, R. C., & Munakata, Y. (2000). Compu-tational explorations in cognitive neuroscience:Understanding the mind by simulating the brain.Cambridge, MA: MIT Press.

Pinker, S. (1984). Language learnability and lan-guage development. Cambridge, MA: HarvardUniversity Press.

Pinker, S. (1999). Words and rules. London: Wei-denfeld & Nicolson

P1: JZP



Pinker, S., & Prince, A. (1988). On languageand connectionism: Analysis of a parallel dis-tributed processing model of language acqui-sition. Cognition, 28, 73–193.

Plaut, D. C., & Kello, C. T. (1999). The emer-gence of phonology from the interplay ofspeech comprehension and production: Adistributed connectionist approach. In B.MacWhinney (Ed.), The emergence of lan-guage (pp. 381–415). Mahwah, NJ: LawrenceErlbaum.

Plaut, D., & McClelland, J. L. (1993). Gener-alization with componential attractors: Wordand nonword reading in an attractor network.In Proceedings of the Fifteenth Annual Confer-ence of the Cognitive Science Society (pp. 824–829). Hillsdale, NJ: Lawrence Erlbaum.

Plaut, D. C., McClelland, J. L., Seidenberg, M. S.,& Patterson, K. E. (1996). Understanding nor-mal and impaired word reading: Computa-tional principles in quasi-regular domains. Psy-chological Review, 103, 56–115.

Plaut, D. C., & Shallice, T. (1993). Deepdyslexia: A case study of connectionist neu-ropsychology. Cognitive Neuropsychology, 10,377–500.

Plomin, R., Owen, M. J., & McGuffin, P. (1994).The genetic basis of complex human behav-iors. Science, 264, 1733–1739.

Plunkett, K., & Bandelow, S. (2006). Stochasticapproaches to understanding dissociations ininflectional morphology. Brain and Language,98, 194–209.

Plunkett, K., & Marchman, V. (1991). U-shapedlearning and frequency effects in a multi-layered perceptron: Implications for child lan-guage acquisition. Cognition, 38, 1–60.

Plunkett, K., & Marchman, V. (1993). Fromrote learning to system building: acquiringverb morphology in children and connection-ist nets. Cognition, 48, 21–69.

Plunkett, K., & Marchman, V. (1996). Learningfrom a connectionist model of the English pasttense. Cognition, 61, 299–308.

Plunkett, K., & Nakisa, R. (1997). A connec-tionist model of the Arabic plural system.Language and Cognitive Processes, 12, 807–836.

Quartz, S. R. (1993). Neural networks, nativism,and the plausibility of constructivism. Cogni-tion, 48, 223–242.

Quartz, S. R. & Sejnowski, T. J. (1997). Theneural basis of cognitive development: A con-structivist manifesto. Behavioral and BrainSciences, 20, 537–596.

Rashevsky, N. (1935). Outline of a physico-mathematical theory of the brain. Journal ofGeneral Psychology, 13, 82–112.

Reicher, G. M. (1969). Perceptual recognitionas a function of meaningfulness of stimulusmaterial. Journal of Experimental Psychology,81, 274–280.

Rohde, D. L. T. (2002). A connectionist modelof sentence comprehension and production.Unpublished doctoral dissertation, CarnegieMellon University, Pittsburgh, PA.

Rohde, D. L. T., & Plaut, D. C. (1999). Lan-guage acquisition in the absence of explicitnegative evidence: How important is startingsmall? Cognition, 72, 67–109.

Rosenblatt, F. (1958). The perceptron: A proba-bilistic model for information storage and or-ganization in the brain. Psychological Review,65, 386–408.

Rosenblatt, F. (1962). Principles of neurodynam-ics: Perceptrons and the theory of brain mecha-nisms. Washington, DC: Spartan Books.

Rumelhart, D. E., & McClelland, J. L. (1982).An interactive activation model of context ef-fects in letter perception: Part 2. The contex-tual enhancement effect and some tests andextensions of the model. Psychological Review,89, 60–94.

Rumelhart, D. E., Hinton, G. E., & McClelland,J. L. (1986). A general framework for par-allel distributed processing. In D. E. Rumel-hart, J. L. McClelland, & the PDP ResearchGroup, Parallel distributed processing: Expolra-tions in the microstructure of congnition. Volume1: Foundations (pp. 45–76). Cambridge, MA:MIT Press.

Rumelhart, D. E., & McClelland, J. L. (1985).Levels indeed! Journal of Experimental Psychol-ogy General, 114(2), 193–197.

Rumelhart, D. E., & McClelland, J. L. (1986). Onlearning the past tense of English verbs. In J. L.McClelland, D. E. Rumelhart & the PDP Re-search Group (Eds.), Parallel distributed pro-cessing: Explorations in the microstructure ofcognition, Vol. 2: Psychological and biologicalmodels (pp. 216–271). Cambridge, MA: MITPress.

Rumelhart, D. E., Hinton, G. E., & Williams, R.J. (1986). Learning internal representations byerror propagation. In D. E. Rumelhart, J. L.McClelland and The PDP Research Group,Parallel distributed processing: Explorations inthe microstructure of cognition. Volume 1: Foun-dations (pp. 318–362). Cambridge, MA: MITPress.

P1: JZP



Rumelhart, D. E., McClelland, J. L., & the PDPResearch Group (1986). Parallel distributedprocessing: Explorations in the microstructure ofcognition, Vol. 1: Foundations (p. 2-4). Cam-bridge, MA: MIT Press.

Rumelhart, D. E., Smolensky, P., McClelland,J. L., & Hinton, G. E. (1986). Schemata andsequential thought processes in PDP mod-els. In J. L. McClelland & D. E. Rumel-hart (Eds.), Parallel distributed processing,Vol. 2 (pp. 7–57). Cambridge, MA: MITPress.

Saffran, J. R., Newport, E. L., & Aslin, R. N.(1996). Word segmentation: The role of dis-tributional cues. Journal of Memory and Lan-guage, 35, 606–621.

Selfridge, O. G. (1959). Pandemonium: Aparadigm for learning. In D. V. Blane, & A. M.Uttley (Eds.). Proceedings of the Symposium onMechanisation of Thought Processes (pp. 511–529). London: HMSO.

Shallice, T. (1988). From neuropsychology to men-tal structure. Cambridge, UK: Cambridge Uni-versity Press.

Sharkey, N., Sharkey, A., & Jackson, S. (2000).Are SRNs sufficient for modelling languageacquisition. In P. Broeder & J. Murre (Eds.),Models of language acquisition: Inductive anddeductive approaches (pp. 33–54). Oxford,UK: Oxford University Press.

Shultz, T. R. (2003). Computational developmen-tal psychology. Cambridge, MA: MIT Press.

Smolensky, P. (1988). On the proper treatmentof connectionism. Behavioral and Brain Sci-ences, 11, 1–74.

Spencer, H. (1872). Principles of psychology (3rded.). London: Longman, Brown, Green, &Longmans.

Sun, R. (1995). Robust reasoning: Integrat-ing rule-based and similarity-based reasoning.Artificial Intelligence, 75, 241–295.

Sun, R. (2002a). Hybrid connectionist symbolicsystems. In M. Arbib (Ed.), Handbook ofbrain theories and neural networks (2nd ed.),(pp. 543–547). Cambridge, MA: MITPress.

Sun, R. (2002b). Hybrid systems and connec-tionist implementationalism. In L. Nadel, D.Chalmers, P. Culicover, R. Goldstone, & B.French (Eds.), Encyclopedia of Cognitive Sci-ence (pp. 697–703). London: Macmillan.

Sun, R., & Peterson, T. (1998). Autonomouslearning of sequential tasks: Experiments andanalyses. IEEE Transactions on Neural Net-works, 9(6), 1217–1234.

Tenenbaum, J. B., Griffiths, T. L., & Kemp, C.(2006). Theory-based Bayesian models of in-ductive learning and reasoning. Trends in Cog-nitive Sciences 10(7), 309–318.

Thomas, M. S. C. (2004). The state of connec-tionism in 2004. Parallaxis, 8, 43–61.

Thomas, M. S. C. (2005). Characterising com-pensation. Cortex, 41(3), 434–442.

Thomas, M. S. C., Forrester, N. A., & Richardson,F. M. (2006). What is modularity good for? InR. Sun & N. Miyake (Eds.). Proceedings of the28th Annual Conference of the Cognitive ScienceSociety (pp. 2240–2245). Vancouver, Canada:Cognitive Science Society.

Thomas, M. S. C., & Johnson, M. H. (2006). Thecomputational modelling of sensitive peri-ods. Developmental Psychobiology, 48(4), 337–344.

Thomas, M. S. C., & Karmiloff-Smith, A.(2002a). Are developmental disorders likecases of adult brain damage? Implicationsfrom connectionist modelling. Behavioral andBrain Sciences, 25(6), 727–788.

Thomas, M. S. C., & Karmiloff-Smith, A.(2002b). Modelling typical and atypical cog-nitive development. In U. Goswami (Ed.),Handbook of childhood development (pp. 575–599). Oxford: Blackwells.

Thomas, M. S. C., & Karmiloff-Smith, A.(2003a). Connectionist models of develop-ment, developmental disorders and individ-ual differences. In R. J. Sternberg, J. Lautrey,& T. Lubart (Eds.), Models of intelligence: In-ternational perspectives (pp. 133–150). Wash-ington, DC: American Psychological Asso-ciation.

Thomas, M. S. C., & Karmiloff-Smith, A.(2003b). Modeling language acquisition inatypical phenotypes. Psychological Review,110(4), 647–682.

Thomas, M. S. C., & Karmiloff-Smith, A. (2005).Can developmental disorders reveal the com-ponent parts of the human language faculty?Language Learning and Development, 1(1), 65–92.

Thomas, M. S. C., & Redington, M. (2004).Modelling atypical syntax processing. In W.Sakas (Ed.), Proceedings of the first workshopon psycho-computational models of human lan-guage acquisition at the 20th International Con-ference on Computational Linguistics (pp. 85–92).

Thomas, M. S. C., & Richardson, F. (2006).Atypical representational change: Conditionsfor the emergence of atypical modularity. In

P1: JZP



Y. Munakata & M. H. Johnson (Eds.), Pro-cesses of change in brain and cognitive de-velopment: Attention and Performance XXI,(pp. 315–347). Oxford, UK: Oxford Univer-sity Press.

Thomas, M. S. C., & van Heuven, W. (2005).Computational models of bilingual compre-hension. In J. F. Kroll & A. M. B. De Groot(Eds.), Handbook of bilingualism: Psycholin-guistic approaches (pp. 202–225). Oxford, UK:Oxford University Press.

Touretzky, D. S., & Hinton, G. E. (1988). Adistributed connectionist production system.Cognitive Science, 12, 423–466.

Usher, M., & McClelland, J. L. (2001). On thetime course of perceptual choice: The leakycompeting accumulator model. PsychologicalReview, 108, 550–592.

van Gelder, T. (1991). Classical questions, rad-ical answers: Connectionism and the struc-ture of mental representations. In T. Horgan &J. Tienson (Eds.), Connectionism and the phi-losophy of mind. (pp. 355–381). Dordrecht:Kluwer Academic.

Weckerly, J., & Elman, J. L. (1992). A PDP ap-proach to processing center-embedded sen-

tences. In Proceedings of the Fourteenth An-nual Conference of the Cognitive Science Society.Hillsdale, NJ: Lawrence Erlbaum.

Westermann, G. (1998). Emergent modularityand U-shaped learning in a constructivist neu-ral network learning the English past tense. InProceedings of the 20th Annual Conference ofthe Cognitive Science Society (pp. 1130–1135).Hillsdale, NJ: Lawrence Erlbaum.

Westermann, G., Mareschal, D., Johnson, M.H., Sirois, S., Spratling, M. W., & Thomas,M. S. C. (2007). Neuroconstructivism. Devel-opmental Science, 10, 75–83.

Williams, R. J., & Zipser, D. (1995). Gradient-based learning algorithms for recurrent net-works and their computational complexity. InY. Chauvin & D. E. Rumelhart (Eds.), Back-propagation: Theory, architectures and applica-tions. Hillsdale, NJ: Lawrence Erlbaum.

Xie, X., & Seung, H. S. (2003). Equivalence ofbackpropagation and contrastive Hebbian le-arning in a layered network. Neural Computa-tion, 15, 441–454.

Xu, F., & Pinker, S. (1995). Weird past tenseforms. Journal of Child Language, 22, 531–556.

cognitive modeling paradigms · · 2008-01-04p1: jzp cufx212-02 cufx212-sun 978 0 521 85741 3...

Documents