a neurobiological model of visual attention and invariant pattern recognition based on dynamic...

Upload: kuo-fen-lee

Post on 07-Apr-2018

221 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/4/2019 A Neurobiological Model of Visual Attention and Invariant Pattern Recognition Based on Dynamic Routing of Informa

    1/20

    The Journal o f Neuroscience, November 1993, 13(11): 4700-4719

    A Neurobiological Model of Visual Attention and Invariant PatternRecognition Based on Dynamic Routing of InformationBruno A. OIshausen,1,3 Charles H. Anderson,1,2,3 and David C. Van EssenlaComputation and Neural Systems Program, Ca lifornia Institute of Technology, Pasadena, California 91125, 2JetPropulsion Laboratory, Pasadena, California 91109, and 3 Department of Anatomy and Neurobiology, WashingtonUniversity School of Medicine, St. Louis, Missouri 63110

    We present a biologically plausible model of an attentionalmechanism for forming position- and scale-invariant repre-sentations of objects in the visual world. The model relieson a set of control neurons to dynamica lly modify the syn-aptic strengths of intracortical connections so that informationfrom a windowed region of primary visual cortex (VI) isselectively routed to higher cortical areas. Local spatial re-lationships (i.e., topography) within the attentional windoware preserved as information is routed through the cortex.This enables attended objects to be represented in highercortical areas within an object-centered reference frame thatis position and scale invariant. We hypothesize that the pul-vinar may provide the control signals for routing informationthrough the cortex. The dynamics of the control neurons aregoverned by simple dif ferential equations that could be re-alized by neurobiologically plausible circuits. In preattentivemode, the control neurons receive their input from a low-level saliency map representing potentially interestingregions of a scene. During the pattern recognition phase,control neurons are driven by the interaction between top-down (memory) and bottom-up (retinal input) sources. Themodel respects key neurophysiological, neuroanatomical,and psychophysical data relating to attention, and it makesa variety of experimentally testable predictions.

    [Key words: visual attention, recognition, model, sating,visual cortex, pulvinar, control]Of all the visual taskshumanscan perform, pattern recognitionis arguably the most computationally difficult. This can be at-tributed primarily to two major factors. The first is that in orderto recognize a particular object, the brain must go through amatching process o determine which of the countlessobjectsit hasseen eforebest matchesa particular object under scrutiny.The second factor is that any particular object can appear atdifferent positions, sizes, and orientations on the retina, thusgiving rise o very different neural representations t early stagesof the visual system.

    Received Dec. 3, 1992; revised Apr. 29, 1993; accepted May 6, 1993.We thank Christof Koch, John Tsotsos, Andreas Hers, Harold Hawkins, EdConnor, and Bill Press for critical commentary and helpful suggestions. This workwas supported by ONR Grant N00014-89-J-1192an dU.S. PHSGrant MH19138-02 (T32). Portions of this work were describe d in CNS Memo 18, Californ iaInstitute of Technology, August 1992.Present address of all authors, Department of Anatomy and Neurobiology, Box8 108, Washington University School of Medicine, 660 South Euclid Avenue, St.Louis, MO 63110.

    Copyright 0 1993 Society for Neuro science 0270-6474/93/134700-20$0 5.00/O

    Researchon associativememorieshasprovided some nsightas to how the problem of pattern matching can be solved byneural networks (e.g., Hopfield, 1982; Kanerva, 1988). How-ever, it is far lessclear how the brain solves he secondproblemto produce object representations hat are nvariant with respectto the dramatic fluctuations that occur on the sensory nputs.Our goal here is to propose a neurobiological solution to thisproblem that is detailed enough n its structure to generateusefulexperimental predictions.Our basic proposal is similar to a psychological theory putforth by Palmer (1983), in which it wasproposed hat the processof attending to an object places t into a canonical, or object-based, reference frame. It was suggestedhat the position andsize of the reference rame could be set by the position and sizeof the object in the scene assuming t was roughly segmented),and that the orientation of the reference frame could be esti-mated from relatively low-level cues,suchaselongation or axisof symmetry (seealso Marr, 1982). The computational advan-tage of such a system s obvious: only one or a few versions ofan object need to be stored in order for the object to be rec-ognized later under different viewing conditions. The disadvan-tage, of course, s that a scenecontaining multiple objects re-quiresa serialprocesso attend to one object at a time. However,psychophysical evidence suggestshat the brain indeedemployssuch a sequential strategy for pattern recognition (Bergen andJulesz, 1983; Treisman, 1988).Palmer made no attempt to describea neural mechanism ortransformingan objects epresentation rom one reference rameto another, becausehis was primarily a psychological model.Various other modelshave been proposed or transforming ref-erence ramesusingneural circuitry (Pitts and McCulloch, 1947;Hinton, 198 a; Hinton and Lang, 1985; von der Malsburg andBienenstock, 1986). Of these, only the proposal of Pitts andMcCulloch can be viewed truly as a neurobiological model.However, their proposal- that the brain averagesover all pos-sible transformations of an object via a scanningprocess-can-not be reconciled with our current understandingof the visualcortex.In this article we propose a neurobiological mechanism forrouting retinal information so that an object becomes epre-sentedwithin an object-based eference rame in higher corticalareas.The mechanism s modified and expanded rom an earlierproposal Anderson and Van Essen,1987) or dynamically shift-ing the alignment of neural input and output arrays without lossof spatial relationships. The model presentedhere allows bothshifting and scalingbetween nput and output arrays, and it alsoprovides a solution for controlling the shift and scale n an

  • 8/4/2019 A Neurobiological Model of Visual Attention and Invariant Pattern Recognition Based on Dynamic Routing of Informa

    2/20

    Object-centered reference t7ame(position and scale invtiant)

    b.The Journal of Neuroscience. November 1993, f3(11) 4701

    feature vectorfl early/intermediate

    Figure 1. Shiftingand resealing the window of attent ion. The image within the window of attention in the retina is remapped onto an array ofsample nodes in an object-centered reference frame. a, In the simplest scheme, each pixel in the object-centered reference frame represents imageluminance. b, More realistically , each pixel should presumably correspond to a feature vector that integrates over a somewhat larger spatial regionand represents orientation, depth, texture, and so on.

    autonomous fashion. While the model is clearly an oversim-plification in some respects, t respectskey neuroanatomicalconstraints and is consistent with neurophysiological and psy-chophysical data relating to directed visual attention.We begin with a description of the basicmodel-the dynamicrouting circuit-and its autonomous control. Subsequentsec-tions describe the proposed neurobiological substrates andmechanisms,predictions of the model, and a comparison withother models hat have been proposed or visual attention andrecognition.The ModelThe goalof our model s to provide a neurobiologically plausiblemechanism or shifting and resealing he representation of anobject from its retinal reference frame into an object-centeredreference frame. Information in the retinal reference frame isrepresentedon a neural map (the topographic representation nVI), and we hypothesize that information in the object-basedreference frame is also representedon a neural map, as illus-trated in Figure 1. This does not necessarily imply that onlypixels can be routed into the high level areas, as drawn inFigure la; each sample node in the high level map could beexpanded nto a feature vector representingvarious local imageproperties, suchasorientation, texture, and depth, that are madeexplicit along the way (Fig. lb).In order to map topographically an arbitrary section of theinput onto the output, the neurons n the output stageneed tohave dynamic accesso neurons n the input stage. n the brain,this accessmust necessarilybe obtained via the physical hard-wareof axons and dendrites. Since hesepathwaysare physicallyfixed for the time scaleof interest to us (< 1 set), there needs obe a way of dynamically modifying their strengths.We proposethat the efficacy of transmissionalong thesepathways is mod-ulated by the activity of control neurons whose primary re-

    sponsibility is to dynamically route information through suc-cessivestages f the cortical hierarchy.A dynamic routing circuitFigure 2a showsa simplified, one-dimensionaldynamic routingcircuit (the next section discusses ow this circuit can be scaledup asa model of the visual cortex). It consistsof an input layerof 33 nodes, an output layer of five nodes, and two layers inbetween.Additionally, a setof control units makemultiplicativecontacts onto the feedforward pathways n order to changecon-nection strengths. This network hasbeen constructed so that

    1. the fan-in (number of inputs) on any node is the same-in this case5,2. the spacingbetween nputs doublesat eachsuccessive tage,and3. the number of nodeswithin a layer is such hat the spreadof its total input field just covers the layer below.This connection scheme has the attractive property of keepingthe fan-in on any node fixed to a relatively low number whileallowing the nodes n the output layer accesso any part of theinput layer. This property will be important in scaling up themodel.An example of how the weights might be set for differentpositionsand sizesof the window of attention is shown n Figure2, b and c. When the window is at its smallestsize (same es-olution as the input stage,Fig. 2b), the weights are set so as toestablish a one-to-one correspondencebetween nodes in theoutput and the attended nodes n the input. When the windowis at a larger size, the weights must be setso hat multiple inputsconverge onto a single output node, resulting in a lower-reso-lution representationof the contents of the window of attentionon the output nodes. f the input representationwere to containnodes uned for different spatial frequencies, hen the low-fre-quency nodes would be primarily used when the window ofattention is large, whereas he high-frequency nodes would be

  • 8/4/2019 A Neurobiological Model of Visual Attention and Invariant Pattern Recognition Based on Dynamic Routing of Informa

    3/20

    4702 Olshausen et al. - Model of Visual Attention and Recognition

    a.

    b.

    Input

    I=1N=B

    I4 Hwindow of attention window of attentionFigure 2. A simple, one-dimensional dynamic routing circuit. a, Connections are shown for the leftmost node in each layer. The connections forthe other nodes are the same, but merely shifted. N denotes the number of nodes within each layer, and I denotes the layer number. A set of controlunits (not explicitly shown) provide the necessary signals for modulating connection strengths so that the image within the window of attention inthe input is mapped onto the output nodes. b and c, Some examples of how connection strengths would be set for di fferent positions and sizes ofthe window of attention. The gray level o f each connection denotes its strength. Each node, Zf, essentially interpolates from the nodes below byforming a linear weighted sum of its inputs:

    where W: denotes the strength of the connection from node j in level 1 to node i in level 1 + 1. If a gaussian is used as the interpolation function,then wt, is given byWI,= exp (j - cqi - d,)>- 24

    where the parameters d,, (Y,,and Q, denote the amount of translation, scaling, and blurring, respectively , in the transformation from level 1 to levelI + 1. The overall translation, scaling, and blurring of the entire circuit (d, 01,and u) is then given by d = d, + cu,(d , + cr ,d,), a! = LY~(Y,(Y~,~2=u; + (Y&J: + c+J:). Note that the lowest layers are best suited for small, fine-sca le adjustments to the position and size of the attentional window,while the upper layers are better suited for large, coarse-scale adjustments.

    used when the window is small. Thus, much of the imagesmoothing could be accomplishedby using a set of hardwiredfilters, and then switching between these filters depending onthe size of the attentional window.The challenge n controlling the routing circuit lies n properlysetting the synaptic weights to yield the desired position andsize of the window of attention. Low levels of the circuit arewell suited or making fine adjustments n the position and scaleof the window of attention, whereashigher evels are best suitedfor coarsecontrol. In general, hough, there are an infinite num-ber of possiblesolutions n terms of the combinations of weightsthat could achieve any particular input-output transformation.

    ControlOur analysisof how information flow can be controlled is aidedby visualizing the routing circuit in connection space,asshownin Figure 3a. This diagram shows he connection matrix for asimple one-dimensional outing circuit composedof two lay-ers-an input layer and an output layer. The horizontal axisrepresents he nodesconstituting the input layer of the network,the vertical axis represents he nodes constituting the outputlayer. An x at coordinate (j, i) in connection spacedenotesthat a physical connection exists from node in the input tonode i in the output; the lack of an x at (j, i) implies that

  • 8/4/2019 A Neurobiological Model of Visual Attention and Invariant Pattern Recognition Based on Dynamic Routing of Informa

    4/20

    The Journal of Neuroscience, November 1993, 13(11) 4703

    a. IV.. xxxxxxxxx.~p~#J \ XXXXXXXXX 0XXXXXXXXX l ixxxxxxxxx lxxxxxxxxx lxxxxxxxxx lxxxxxxxxx lxxxxxxxxx lxxxxxxxxx ll ee*e**e*eee****eInput i

    cwindow of attention

    l OUtpUtllllllll

    b. xxxlg~. xxxeCh@Wxxxq#%Xxx lx x x x:&t x x xxxxx,g~xxx

    )(xX)&3$(,,, T

    llx x x a&% x x x llx x x x&k x x x

    Ilxxxy,@kxxx lxxxxamxxx l

    Figure3. An illustrationof connec-tion space. he nputcontains 7sam-ple nodes nd he output contains inesamplenodes.a, Each x denotesphysical onnectionroman nputnodeto an outputnode.We shalldenote heeffective trength f theconnectionromnode in the input to node in theoutput asw,,. and , The tippled re-gion indicates hoseconnectionshatneed o beenabledwU 0) n order omap he regionwithin the windowofattention onto the output nodes. , Ifthe width of the enabled egion s toosmall, hen aliasingwill result;an ex-aggerated ase s illustratedhere i.e.,some utput nodeswill be acking nyinput, leading o spurious atterns nthe output).

    l eeeieeeeeeeeeeeeInput * cwindow of attention

    .window of attention (aliased)

    no connection pathway exists between those nodes.We denotethe strength of the connection at (j, i) as w,,. Note that for atwo-dimensional routing circuit the connection matrix wouldrequire four dimensions o display. We will use he one-dimen-

    If the window of attention is to be of a certain position andsize, then the strength of each connection, w,, needs o be set

    sional routing circuit for easeof illustration, but the concepts

    appropriately. Figure 3b showshow this would look in connec-tion space or an attentional window centeredwithin the input

    developed here are readily extendible to two dimensions.

    array with a scale factor of one (i.e., no magnification). Thestippled area represents hose connections that are enabled; heremaining connections are effectively disabled by mechanismsdiscussed elow. If the window of attention is to shift to the leftor right, then the band of enabled connections must translateacross he connection matrix. Changing he size of the windowof attention corresponds o tilting the band of open connections,as shown n Figure 3c. Note that the band of open connectionsmust also be widened as it is tilted (corresponding to blur);otherwise, aliasingwould occur, leading to spuriouspatterns inthe output representation Fig. 3d).By viewing the routing circuit in this way, it can be seen hatthe problem of setting the position, size, and blur of the windowof attention amounts to one of generating the proper patternsof active synapsesn connection space.How this might be ac-complished by the control units dependson how they are con-nected to the feedforward synapses f the routing circuit. Onepossible cenario s for eachcontrol unit to modulate he strengthof a singlephysical connection (j, i), as llustrated in Figure 4a.If a given control unit were on, then its correspondingcon-nection would be enabled,and if it were off then the connectionwould be disabled.Nearly any remapping could then be accom-plished by simply activating the control units corresponding othe connectionswe wish to enable.However, this schemewouldrequire an enormous number of control units for a scaled-upsystem. Since the set of remappings we wish to accomplish(translationsand scalings) s but a minute fraction of all possibleremappings, his schemewould arguably constitute a waste ofcomputational resources.Another possibility would be for the

    X l &Wllllllll

    l

    control units to gate connections globally so that each unit isresponsible or effectinga singleposition and scaleof the windowof attention, asshown n Figure 4b. However, this schemewould

    Our proposedsolution to the control problem minimizes both

    require a large fan-out for each control unit in a scaled-upsys-

    the number of control units and the fan-out required by havingeach control unit modulate a local group of synapses-or a

    tem. This could cause mplementation difficulties and render

    control block in connection space (Fig. 4~). The problem offorming the desiredpatterns in connection space hen becomes

    the circuit neurobiologically implausible.

    an approximation problem, in which the control blocks formthe basis functions and the activations of the correspondingcontrol units form the coefficients. That is, the connectionstrengthsw!, would be determined according to

    where c, denotes the activity of the kth control unit, and thefunction \Ilk(j, i) specifies he shapeof the kth control block inconnection space. n order to facilitate their ability to approx-imate patterns in connection space, he control blocks shouldnot have sharpboundaries; ather, they should have a gaussian-like taper and overlap one another somewhat.Shaping he con-trol blocks as n Figure 4c would be most optimal for realizingtranslations, but could also be used o approximate scalingsaswell, as shown n Figure 4d. It may well be possible o optimizethe shapeof the control blocks using appropriate learning al-gorithms, but the strategy illustrated here will suffice for ourimmediate purposes.An alternative way of expressingEquation 1 hat will beusefullater is

    where r,,k = *Jj, i). In this sense, llk denotes he weight withwhich ck modulates the strength of synapse j, i). Note that rrlk= 0 for most combinations of i, j, and k, since each controlneuron modulates only a small fraction of the many possiblesynapses j, i).

  • 8/4/2019 A Neurobiological Model of Visual Attention and Invariant Pattern Recognition Based on Dynamic Routing of Informa

    5/20

    4704 Olshausen et al. - Model of Visual Attention and Recognition

    Figure 4. Some possible control sce-narios. a, Each control unit modulatesthe strength of a single connection. b,Each control unit modulates the strengthof a large number of connections in or-der to eff ect a global position and scaleof the window of attention. c, Each con-trol unit modulates a local group ofcon-nections, or a control block. d, Ap-proximating a desired position and scaleof the window of attention using controlblocks.

    Figure 5. A simple attentional strat-egy for an autonomous visual system.Objects are preattentively segmented vialow-pass filtering. Once an object hasbeen localized, the contents of the win-dow o f attention are fed to an associa-tive memory for recognition. This pro-cess is then repeated ad infinitum, oruntil all interesting locations have beenattended.

    c c0ritr014. .s-ii$*9 x x ; ; ; x .htputxxxxx .xxxxxx xx 5xxxxxxx x2

    0xxxxxxx x lxxxxxxx 0xxxxxxxxxxxxx E= :xxxxxxxx:& ;l o*eaoe**m**o****Inplt

    Input

    xxxxx.OutPutxxxx lxxx .xx 0X l0l

    l0l *ea*

    b.

    d.

    xxxxxx ll ooo*a***a*mm~oe~Input

    1. Blur objects into blobs. 2. Focus the window ofattention on a blob.

    3. Feed the high resolutioncontents within the windowof attention to an associativememory.4. Note the location, size, 5. Move on to the nextand identity of the object. blob and repeat.

    I-+ large A,lower left

  • 8/4/2019 A Neurobiological Model of Visual Attention and Invariant Pattern Recognition Based on Dynamic Routing of Informa

    6/20

    The Journal of Neuroscience, November 1993, f3(11) 4705

    Autonomous controlUp to now we have described an essentially open loop modelof visual attention. That is, given a desired position and sizefor the window of attention , one could manua lly set the activityof the control units of the network so that the image within thewindow is remapped onto the output units of the network. Wenow describe how the network may be autonomously controlledwhen provided only with visual input and no external com-mands beyond the initial task specification.System object ive. The purpose of attention is to focus theneural resources for recognition on a specific region within ascene. Thus, it would make sense for the attentional window tobe automatica lly guided to salient, or potentia lly informative,areas of the visual input. Salient areas can often be defined onthe basis of relat ively low-level cues-such as pop-out due tomotion, depth, texture, or color (e.g., Koch and Ullman, 1985;Anderson et al., 1985). Here, we utili ze a very simple measureof salience based on luminance pop-out in which attention isattracted to blobs in a low-pass-filtered version of a scene. (Ablob may be defined simply as a contiguous cluster of activitywithin an image.) In reality, a ttention can also be directed viavoluntary or cogni tive influences, but these are not incorporatedinto our present model.We propose the following simple but useful strategy for anautonomous visual system (see Fig. 5).

    1. Form a low-pass-filtered version of the scene so that ob-jects are blurred into blobs.2. Select one of the blobs from the low-pass image-which-ever is brightest or largest-and set the position and size of thewindow of attention to match the position and size of the blob .3. Feed the high-resolution contents of the window of atten-tion to an associative memory for recognition.4. If a match with one of the memories is close enough (bysome as yet unspecified criterion), then consider the object tohave been recognized; note its identity, location, and size in thescene. If there is not a good match, then consider the object tobe unknown; either learn it or disregard it.5. Now inh ibi t this part of the scene and go to step 2 (findthe next most salient blob).The following three subsections describe the details for carryingout steps 2, 3, and 5. Step 1 is trivial, whereas step 4 is a high-level problem beyond the scope of this article (cf. Carpenter andGrossberg, 1987; Mumford, 1992).Focusing attention on a blob. We begin by formulating a so-lution for a simple one-dimensional routing circuit with one ormore gaussian blobs presented to the input units, as shown inFigure 6a. The values on the output units, I?, are computedfrom the input units, q, viazy = 2 w,,z; (3)

    = L: 2 CJ#Z/.I k

    Note that Equation 4 is obtained by substituting Equation 2into Equation 3. In this simple circuit the Trlk are set so thateach control uni t c, corresponds to a global posit ion of thewindow of attention , but in general this need not be the case.In order to focus the window of attention on a blob in theinput, the networks goal is to fi ll the output units with a blobwhile mainta ining a topographic correspondence between theinput and output (Fig. 5, step 2). Since the dynamic variables

    a.

    1 ( vob mapFigure 6. u, A simple one-dimensional routing circuit with a gaussianblob presented to the input units. Each control unit corresponds to adifferent posit ion of the window of attention: left (c,), center (c,), orright (c,). For example, in order to accomplish the remapping shown,the values on the control units should be c, = 1 and c,, = c, = 0. b, Thesame circuit with control circuitry added to autonomously focus thewindow of attention on a blob in the input. Each control unit essentiallyhas a gaussian receptive field in the input layer. The control units thencompete among each other, via negatively weighted interconnections,such that only the control unit corresponding to the strongest blob inthe input prevails. The combined leaky integrator and squashing func-tion (Eqs. 7, 8) are denoted by the ampliJier symbol.

    in this network are the ck, we need to formulate an equationgoverning the dynamics of ck that accomplishes this objective.We can accomplish the first part of the objective by lett ing ckfollow the gradient of an objective function, E,,O,, that providesa measure of how well a blob is focused on the output units.One possible choice for E,,, is the correlation between the actua lvalues on the output units, p, and the desired blob shape, G.That is,

    G, = exp[-(i - P)~/u~]. (5)The second part of the objective (maintaining topography) canbe accomplished by lett ing ck follow the gradient of a constraintfunction, E constrant~hat favors valid control states-that is, those

  • 8/4/2019 A Neurobiological Model of Visual Attention and Invariant Pattern Recognition Based on Dynamic Routing of Informa

    7/20

    4706 Olshausen et al. * Model of Visual Attention and Recognition

    corresponding to translations or scalings of the input-outputtransformation. One possible choice for EcOnStraln,sE = - -OS,ral, 2 ck %c,, (6)

    where the constraint matrix Tc is chosen so as to couple thecontrol neuronsappropriately. For the simple circuit of Figure6a, each control neuron corresponds o a different position ofthe window of attention, so we could define Tc asPk,11 kzl0 k=l.This has he effect of punishing any state n which two or morecontrol units are active simultaneously, and thus forces a win-ner-take-all solution. (The more generalcaseusingcontrol blocksis describedbelow.)A dynamical equation for c, that simultaneously minimizesboth E,,,, and E ConStralnts given by

    Ck = dk), (7)

    where the constantsT and 7 determine the rate of convergenceof the system, and the constant p determines he contributionof Econstrainfelative to E,,,,. A sigmoidal squashing unction (u)is used o limit ck to the interval [0, 11. A derivation of Eq. 8is given in the Appendix.)A neural circuit for computing Equations 7 and 8 is shownin Figure 6b. The first term on the right of Equation 8 is com-puted by correlating the gaussian,G, with a shifted version ofthe input (the amount of shift depends on the index k). Thesecond term is computed by forming a weighted sum of theactivities on the other control units. These wo results are thensummed together and passed hrough a leaky integrator andsquashing unction to form the output of the control unit, c,.Thus, the c, essentiallyderive their inputs directly from a blobmap, and then compete among each other so that the c, cor-responding o the strongestblob prevails.The circuit of Figure 6 could easily be modified to allow fordifferent sizesof the window of attention by adding another setof control units for eachdesiredsize of the window of attention.The control units corresponding o a large window of attentionwould then derive their inputs from a coarse-grained low-res-olution) blob map, while control units corresponding o a smallwindow of attention would derive their inputs from a fine-grained(high-resolution) blob map. All of theseunits would then com-pete with one another so that the window of attention is con-strained to a singleposition and scale. Seeexample in the nextsection.)In a more biologically plausible scenario, the control unitswould be configured into control blocks, like those shown inFigure 4c. In this case,Equation 8 states hat the input to eachc, would be computed by correlating the gaussianvalues, Gi,and the input values, v, that are connected via that controlunit (specifiedby Ill,k). Note that since he Gi are fixed, the termZ, G,Io, (Eq. 8) can essentially be considered a fixed weight.Also, the constraint matrix, P, would need to be modified inthis caseso hat those control units corresponding o a commontranslation or scale einforce eachother (Pk, > 0), while controlunits that are not part of the same ransformation inhibit eachother (Pk, < 0), as llustrated in Figure 7. This schemehas the

    effect of introducing many local minima, however, and so thecontrol neuronsneed to be more tightly constrained n order toconverge on states hat preserve ocal spatial relationships. Wehave accomplished his by utilizing a coarse-to-fine control ar-chitecture (B. Olshausen, unpublished observations). In thisscheme, routing is at first performed by a small number ofcontrol neuronson a low-pass-filteredversion of the image,andthis smaller setof control neurons s then used o constrain theactivities of the fine-grained control neurons routing the high-resolution information.

    Recognition. Once the window of attention has been focusedon a blob, the underlying high-resolution information can alsobe fed through the routing circuit and into the input of an as-sociative memory for recognition. However, it is likely that theinitial estimation of position and size made by routing the blobwould be only approximately correct, and this may causeprob-lems or matching he high-resolution nformation. Thus, t wouldbe desirable to have the associative memory help adjust theposition and scaleof the attentional window while it converges.How, then, shall the associativememory be incorporated intothe control of the routing circuit?If a Hopfield associativememory (Hopfield, 1984) s used orrecognition, then we can replaceE,,,, with the associativemem-orys energy function, E,,,, which is defined asYg;(V)d? - 2 KZym. (9

    In this equation the V, denote the output voltages on the as-sociative memory neurons, T, denotes he connection strengthbetweenneurons and , Zym denotes he inputs to the memory,and g, is a squashing unction such as tanh(x). Normally, theonly dynamic variables are the V,, which evolve by followinga monotonically increasing function, g, , of the gradient of theenergy. That is,

    = 2 K,y - z + zpem,I , (11)where C, and R, are constants that determine the integrationtime constant of each neuron. The dynamics of Equations 10and 11can be mplemented n simple,neural-like circuitry. Notethat the effect of minimizing E,,, is to simultaneouslymaximize(1) the similarity between the neuron voltages, V,, and one ofthe stored patterns superimposedn the T,j matrix (first term ofE,,,), and (2) the similarity between the v and the inputsZyrn (last term of E,,,). (The second erm of E,,, is the leakyintegrator term, which is unimportant for now. SeeAppendix.)Since he inputs of the associativememory are to be obtaineddirectly from the outputs of the routing circuit (Zym = I?), thecontrol neurons, c,, become additional dynamic variables hid-den in the last term of Em,,. By letting the c, follow the gradientof&m, along with the V, , the combined associativememory/routing circuit should relax to the closeststored pattern and tothe correct position and size of the window of attention simul-taneously.

  • 8/4/2019 A Neurobiological Model of Visual Attention and Invariant Pattern Recognition Based on Dynamic Routing of Informa

    8/20

    The Journal of Neuroscience, November 1993, 13(11) 4707

    xXX

    XXXxxxx 0l ooooooooooooooooInput

    Figure 7. Control unit interactions when configured into control blocks.The control unit corresponding to the block shown (stippled egion)should have excitatory connections ( Fk, > 0) to other control units whoseblocks form a consistent position and size of the window of attention-that is, those blocks lying along the f directions. Inhibitory connec-tions (T;, < 0) should be formed with control units whose blocks areinconsistent with this one-that is, those along the - directions. Thisscheme is somewhat analogous to the way constraints are imposed inthe Marr/Poggio stereo algorithm (Marr and Poggio, 1976).

    A dynamical equation for c, that simultaneously minimizesboth E,,, and EcOnSfraln,s given by

    (A derivation is given in the Appendix.)A neural circuit for computing Equations 12 and 13 s shownin Figure 8. The first term on the right of Equation 13 is com-puted by correlating the inputs, p, and outputs, V, , whosecon-nection pathways are influenced by control unit c, (specifiedbyI& The other terms are computed as before. Thus, the mainqualitative difference between his circuit and the blob finder(Fig. 6) is that the control is guided by the interaction betweentop-down and bottom-up signals ather than purely bottom-upsources.In order to avoid local minima, it would be advantageous operform the combined processof pattern matching, shifting,and scaling n a coarse-to-fine manner by utilizing informationat multiple scales e.g., Witkin et al., 1987; Buhmann et al.,1990). n this way, the low-pass nformation can be used nitiallyto send the memory into the right part of its searchspace; heinitial output of the associative memory can then be used tobetter refine the position and scaleof the window of attentionbefore allowing in higher-resolution nformation. A crude formof such a coarse-to-fine strategy has been utilized in the com-puter simulation below.Shifting attention. Once an object has been recognized, thewindow of attention should move on to another interesting partof the scene.One way this could be accomplishedwould be forthe control units to be self-inhibited through a delay. Thus, whena group of control units are active for some ime (long enoughfor recognition to take place) they should begin to shut off. Thiswill then allow other blobs or interesting tems to compete suc-cessfully for control of the window of attention (seealso Kochand Ullman, 1985).Computer simulationFigure 9 shows he resultsof a computer simulation of a simpleattentional system for recognizing objects, basedon the ptin-cipleselucidatedabove. The network begins n blob searchmode,

    memory

    Figure8. An autonomous routing circuit for recognition. Each nodeof the associative memory receives its external input from an outputnode of the routing circuit. Hence, each node of the associative memoryhas dynamic connections to many input nodes. The outputs of theassociative memory are then fed back and correlated with the inputs todrive the control units.

    attempting to fill the output of the routing circuit with somethinginteresting. In Figure 9a, the network has settled on the A,since t has the greatest overall brightness n the input. (Sincethe shapes sed n this example are so compact and simple, wehave bypassed he step of prefiltering them into blobs. Thus,during blob search,an object is low-pass iltered by the routingcircuit itself.) After settling on a potentially interesting object,the network is switched into recognition mode and the outputof the routing circuit is fed to an associative memory. Twopatterns-A and C-have been previously stored in theassociative memory. The blurred version of the object initiallydrives the inputs of the associativememory to begin the patternsearch. f the position of the window of attention is slightly off,the blurred version of the object is not affected much and stillsends he memory searching n the correct direction. As theassociative memory converges,control units compute the cor-relation between memory outputs and retinal inputs and settheir activation correspondingly. This tends to maximize thesimilarity between the outputs of the memory and the outputsof the routing circuit, which will also refine the position of theattentional window so that the high-resolution componentscanbe properly matched (Fig. 9b). After allowing a fixed amountof time for the associativememory to converge (another timeconstant or two), the simulation states the position and pre-sumed identi ty of the object. The current control state s thenself-inhibited and the network switchesback into blob searchmode. This then puts the next interestingobject at a competitiveadvantage in attracting the window of attention so that it mayalso be recognized (Fig. 9~4.Summary of the modelBy using control neurons o modulate connection strengthsdy-namically, we have derived simple,neural-like circuits for shift-ing and resealing he information from an input array into ahigher level, object-centered reference rame. We assumed hat

  • 8/4/2019 A Neurobiological Model of Visual Attention and Invariant Pattern Recognition Based on Dynamic Routing of Informa

    9/20

    4708 O lshausen etal. * Model of Visual Attention and Recognition

    a.Memory

    Pl outputb.

    InputControl

    Memory

    Input

    C.

    Memory

    P-l output0 IControl

    Input

    d. ....::::::,.\\.A..ilizl\\ .:.:.:.+:\...\....Memory,. -lzl

    Pl +-.utput

    Input

    Figure 9. Computer imulation f a simple ttentional ystemor recognizing bjects. he nput to the routingcircuit consists f a 22 x 22arrayof sample odes nd he output of the routingcircuit isan 8 x 8 array of sample odes.here are three sets of control units, each one correspondingto a different size of the window of attention [small (8 x 8), medium (11 x 1 l), and large (16 x 16)]. Each control neuron within a set corresponds toa particular position of the window of attention. The Hopfield associative memory network (Mem output; see Fig. 8) is composed o f 64 units,fully interconnectedndarrangednto an 8x 8 grid (i.e., onenode or eachoutputof the routingcircuit). The dashedutline within the nput arraydenotes the position and size of the window of attention. a, The network begins in blob search mode, attempting to fill the output of the routingcircuit with something interesting. The blurring function of the routing circuit has been facilitated in this case by setting the constraint matrix sothat neighboring positions of the window of attention only weakly inhibit each other. The network has settled on the A since it has the greatestoverall brightness. b, The network is then switched into recognition mode and settles on the identification of the object. The position and size ofthe objectare encodedn the activities of the control neurons.After a fixed amountof time, the currentcontrol state s self-inhibited nd thenetwork s switched ack nto blob searchmode.c andd, The C is now at a competitiveadvantagen attracting he windowof attention c) andis subsequentlyecognized y the associativememoryd).

    a useful strategy for an autonomous visual systemwould be tofocus its attention on interesting regions within a scene andattempt to recognize whatever is there. From this basic as-sumption, we derived equations for governing the dynamics ofthe control neurons in both preattentive (blob search) andattentive (recognition) modes. Although these circuits havebeen greatly oversimplified for the purpose of illustration, thebasic principles can be extended to larger, scaled-up routingcircuits composedof multiple stages.We now turn to the issueof how such circuits may possibly be implemented n the brain.Neurobiological Substrates and MechanismsFigure lOa shows the major visual processingcenters of theprimate brain. Information from the retino-geniculo-striatepathway enters he visual cortex through areaV 1 n the occipitallobe and proceeds hrough a hierarchy of visual areas hat canbe subdivided into two major functional streams Ungerleiderand Mishkin, 1982). The so-called form pathway leadsven-trally through V4 and inferotemporal cortex (IT) and is mainlyconcerned with object identification, regardless f position orsize. The so-called where pathway leads dorsally into theposterior parietal complex (PP), and seemso be concernedwiththe locations and spatial relationshipsamong objects, egardlessof their identity. The pulvinar, a subcortical nucleusof the thal-amus, makes reciprocal connections with all of these corticalareas cf. Robinson and Petersen,1992). The following sectionsdescribehow we envision the dynamic routing circuit mappingonto this collection of neural hardware.Cortical areasThe fform pathway. Figure lob shows he scaled-up outingcircuit that we propose as a model of attentional processing nvisual cortex. The different stages f the network correspond othe major cortical areas n the form pathway. There are two

    stages or Vl: Vla corresponding to layer 4C, and Vlb corre-sponding to superficial layers, since Vl has about twice thedensity of neuronsper unit surfaceareaas the rest of neocortex(OKusky and Colonnier, 1982). The remaining areas-V2, V4,and IT-occupy one stage apiece. Each node within a stagerepresents, n the simplestsense, sampleof image uminance.More realistically, each node would correspond to a featurevector that is representedby the activity profile on a arge group(hundreds or thousands) of neurons in each visual area. Forexample, in Vl, each group would include cells selective forvarious orientations, and spatial frequencies, n a small regionof visual space. t is impractical at this stage o include thesecharacteristicsexplicitly in our model, but we contend that thesedetails can safely be neglected or now without losing the pre-dictive value of the model.The input layer of the network (V 1, layer 4C) contains ap-proximately 300,000 samples f the retinal image - 550 nodesacross n one dimension). This corresponds oughly with thenumber of complete spatial samples elivered by the lo6 opticnerve fibers when one takes into account the fact that infor-mation is divided into on- and off-channels, magno and parvostreams,and different spectralbands Van Essen nd Anderson,1990). The number of nodes n the other layers is dictated bythe rulesspecified n the previous section,given a fan-in of 1000inputs per node (-30 inputs in one dimension). The sizes ofthe first four layers scale oughly with the relative sizesof eachcorrespondingcortical area (V 1 = 1120 mm*, V2 = 1190 mm*,V4 = 540 mm2; Felleman and Van Essen,1991). IT is dispro-portionately large, perhaps because t includes a complex ofmultiple areas,some of which may be devoted to specializedaspectsof pattern recognition. Only a relatively small portionof IT would be required to represent he actual contents of thewindow of attention.

    The fan-in for each node is about 1000 inputs, which is rea-

  • 8/4/2019 A Neurobiological Model of Visual Attention and Invariant Pattern Recognition Based on Dynamic Routing of Informa

    10/20

    The Journal o f Neuroscience, November 1993, 13(11) 4709

    Occipital lobea.

    --L-q-Lf SuperiorColliculus uuuu

    b.

    -go0 -30 -8 -2O 0 2 go 30 90

    Figure 10. a, Major visual processingpathways of the primate brain. To avoidclutter, many known connection path-ways (e.g., VII-PP) are not shown. b,Proposed neuroanatomical substratesfor dynamic routing. The label besideeach layer indicates the correspondingcortical area and the number of samplenodes in one dimension. The numberof sample nodes in two dimensions isapproximately the square of this num-ber. At the bottom is shown a scale ofthe approximate eccentricity of the in-put nodes to the circuit. Connectionsare shown for the center node in eachlayer. (Individual nodes are indistin-guishable here becauseof their density.)Control signals originate from the pul-vinar to effectively gate he feedforwardsynapses.

    sonable for cortical neurons (Cherniak, 1990; Douglas and Mar-tin, 1990a). Note that without the multistage hierarchy, a fan-in of nearly lo6 would be required for the neurons in IT, whichis several orders of magnitude beyond what is neuroanatomi-tally plausible. Also, the resulting receptive field sizes (in theall-connections-open state) are consistent with the observed in-crease in the size of classical receptive fields as one proceedsupward through the form pathway (Gattass et al., 1985).The output of the network, which represents the contents ofthe window of attent ion, contains approximately 1000 samplenodes, or a window size of about 30 x 30 nodes. This thencorresponds to the spatial resolution of the window of attent ionin our model. This estimate is roughly consistent with severallines of psychophysical evidence, including studies of spatialacuity, contrast sensitivity to gratings, and recognition (Camp-bel l, 1985; Van Essen et al., 199 1). While we certainly allow forsome give and take on al l of these numbers, we believe thiscircuit contains the essential components to explain how infor-mation can be routed from a shiftable and scalable window ofattention in Vl into IT while preserving spatial relationships.In order to better visualize the operation of this circuit, wehave created a computer simulation of an open loop versionof the model (i.e., manually controlled). Given a user-specifiedposition and size for the window of attention, the program ap-propriately gates the feedforward connections at each stage inthe routing circuit so that only the contents of the window ofattent ion are routed to IT. Figure 11 shows some example out-

    puts of the simulation when attention is focused on differentitems within a scene. Note that regions outside the window ofattention in each cortical area are blurred, because there is noneed to gate the inputs selectively to a neuron if it is not beingattended to. The specific predictions generated by this circuitwill be discussed in the next section.The where pathway. The posterior par ietal cortex (PP) isknown to play an important role in attentional processes. Somestudies have reported that neurons in this area show an enhancedresponse to attended targets within their receptive fields, evenwhen no eye movements are made (Bushnell et al., 198 1). Othershave reported a threefold enhancement for unattended targetswhen the animal is in an attentive state (Mountcastle et al.,198 l), or even a relative suppression for attended targets asopposed to unattended targets (Robinson et al., 199 1; Steinmetzet al., 1992). These latter results suggest that PP may be rep-resenting the locations of potential attentional targets, as op-posed to targets already being attended. This is also supportedby lesion studies that show that damage to the parietal lobe inhumans hinders the ability of other objects in the field of viewto attract the attentional window away from the currently at-tended location (Posner et al., 1984). Thus, we propose that PPmay act as a saliency map (e.g., Koch and Ullman, 1985)analogous to the blob map utilized in the simple attentionalsystem described previously. These neurons would then drivethe control neurons that compete for control of the window ofattention.

  • 8/4/2019 A Neurobiological Model of Visual Attention and Invariant Pattern Recognition Based on Dynamic Routing of Informa

    11/20

    4710 Olshausan et al. l Model of Visua l Attention and Recqnition

    a.

    UINDOU OF RTTENTION

    4IT-t

    RETINA 'VI'

    b.

    YI ON

    Figure II. Computer simulation f a sealed-up,orticaldynamic outingcircuit (noautonomousontrol). n both a andb, thebottom left imageshows hypothetical etinal mage, nd hedashed outline within this magendicateshe positionandsizeof the windowof attention.The mageabove this showshe output of the routingcircuit-the contents f the 30x 30 windowof attention.Theour images to the right show our stagesof the outingcircuit: VI (essentially copy of the etina), V2, V4, and T (the output).a, Attention is focused n the etter Tat highest esolution(i.e., connectionsetweennput and output are 1:1).b, Attention is focused n a larger egionof the scene, nd so esolutions sacrificed ithinthe windowof attention. n eachcase,he receptive ield of a hypothetical T cell s shown small n a and arge n b); in u, the receptive ield ofa V4 celloutside he windowof attention salsoshown.A more ealisticsimulation tilizing a log-polaratticehasalsobeen onstructed, ut theessentialredictions f the modelaremoreeasilyconveyedwith this simpler ersionof thecircuit. [The image sedn this examplewasobtainedfrom Anstis 1974).]

    This proposalcontainsat least wo potential weaknesses,ow-ever. One possibledrawback is that PP neurons typically haverelatively long latencies- - 100 msec (Robinson et al., 1978;Duhamel et al., 1992)-which is hard to reconcile with psycho-physical data that imply that attention takes - 50 msec o moveto a new location in the visual field (Nakayama and Mackeben,1989; Saarinen and Julesz, 1991). A possible solution to thisdilemma is that the superior colliculus may supplementPP byacting asa crude saliency map, but with a quicker responseimedue to its direct retinal input (the latency of neurons in thesuperficial ayers of the superior colliculus is in the rangeof 40-50 msec; Goldberg and Wurtz, 1972). The other drawback ofthis proposal is that currently available anatomical data seemto offer relatively few direct pathways by which PP could influ-ence he control neurons or modulating information flow in theform pathway. However, there do exist indirect pathways,suchas hrough the superior colliculus, that may provide viablealternatives (seebelow).Subcortical areasWe hypothesize that the pulvinar complex plays an importantrole in providing the control signals equired for the routingcircuit. The pulvinar is reciprocally connected to all areas n the

    form pathway, thus making it a plausible candidate for mod-ulating information flow from VI to IT. The pulvinar also re-ceives a massiveprojection from the superior colliculus, whichis known to encode the direction of saccade argets and mayalso be involved in setting up attentional targets (Posner andPetersen, 1990; Gattass and Desimone, 1991, 1992). In addi-tion, neurophysiological studies (Petersenet al., 1985, 1987),lesionstudies Rafal and Posner, 1987; Bender, 1988;Desimoneet al., 1990), and positron emission tomography studies La-Bergeand Buchsbaum, 1990; Corbetta et al., 1991) of the pul-vinar suggest hat it plays a role in engagingvisual attention, orfiltering out unattended stimuli.A subcortical nucleussuch as the pulvinar also has the im-portant property of being spatially localized while at the sametime being able to communicate with vast areasof the visualcortex. The relative proximity of pulvinar neurons o eachotherwould facilitate the competitive and cooperative interactionsamong the control neurons, which are necessary o enforce theconstraint of maintaining spatial relationshipswithin the atten-tional window. Although it is not known whether such nter-actions exist among pulvinar neurons, Ogren and Hendrickson(1979) have reported the existence of interneurons with elab-orate dendritic trees approaching 600 pm in diameter, which

  • 8/4/2019 A Neurobiological Model of Visual Attention and Invariant Pattern Recognition Based on Dynamic Routing of Informa

    12/20

    The Journal of Neuroscience, November 1993, 73(11) 4711

    could mediate communication among pulvinar neurons. In ad-dition, neuropharmacological experiments by Petersen et al.(1987) have shown that enhancing or depressing inhib ition with-in the pulvinar can respectively slow down or speed up atten-tional shifts, which is suggestive of lateral inhibitory connectionswithin the pulvinar. An analogous function might also be servedby the reticular nucleus of the thalamus, which is an inhib itorystructure through which pulvinar neurons project on their wayto the cortex. One study in G&go (Conley and Diamond, 1990)has shown that the pulvinar projects quite diffusely into thereticular nucleus, which would be desirable for a winner-take-all type circuit.To first order, it would make sense for each stage of the routingcircuit to have its own set of control neurons. The anatomicalsubdivisions ofthe pulvinar correspond roughly with this scheme,insofar as the inferior pulvinar projects mainly to lower areas(Vl, V2) and the lateral and medial pulvinar to higher areas(V4, IT). The control neurons for the lower stages would needto compete only loca lly, since these stages would be more con-cerned with making loca l adjustments in the position and scaleof the window of atten tion. Control neurons at the highest stagewould need to compete g loba lly, since these stages are settingthe position and scale of the window of atten tion for the entirescene.The number of control neurons that would be required forthe routing circuit depends on how many cortical synapses aremodified by each control neuron. Theoretically, the minimalnumber of control neurons is given by

    # of control neurons= (# of output nodes) x (fan-in per node)

    (# of synapses per control block) .Assuming that the control blocks comprise approximately 1000synapses each, then the number of control neurons required foreach stage of the routing circuit would be about the same as thenumber of output nodes of each stage (since the fan-in per nodeis also about 1000). Thus, -250,000 control neurons would berequired for the first stage, - 175,000 for the second stage, andso on, which is well within the estimated number of neurons inthe pulvinar. (The pulvinar has somewhat lower neuronal den-sity than the LGN, but also is several times larger. Since theLGN contains - lo6 projection neurons, this would constitutea reasonable lower bound for the number of neurons in thepulvinar.) However, each output node in the circuit actuallycorresponds to a multitude of neurons representing various fea-tures, such as orientation, spatial frequency, and so on. Thus,each pulvinar control neuron would require an additional fan-out for controlling the inputs to al l the neurons correspondingto an output node. Since there may be hundreds of neurons foreach node, the pulvinar neurons would need to amplify theirfan-out via other neurons (a fan-out of 100,000 for pulvinarneurons is probably too large to be plausible). This could pos-sibly be subserved by neurons residing in the deeper layers (5and 6) of the cortex (see Van Essen and Anderson, 1990). Con-trol would then be implemented in a hierarchical fashion, witheach pulvinar neuron specifying how information is routed be-tween nodes, and cortical control neurons specifying how in-formation is routed between the neurons be longing to each node.The simple autonomous routing circuits of Figures 6 and 8suggest an interesting role for the projections to the pulvinarfrom the parietal and temporal lobes and the superior colliculus.

    During blob search, the pulvinar might be influenced pri-marily from a saliency map of targets in the parietal lobe orsuperior colliculus. During recognition, top-down influences fromIT might then take over to refine the position and size of theattentional window for object matching. The pulvinar wouldthen alternate between these two modes of input as attentionmoves from one object to the next. A potential weakness of thisproposal, however, is that the anatomical evidence suggests thatPP and IT project mostly to segregated portions of the pulvinar(Baleydier and Morel, 1992). On the other hand, there is someoverlap near the border between the lateral and medial portionsof the pulvinar where these two streams in termingle . As notedalready, parietal cortex may also communicate with IT-recip ientpulvinar indirectly through the superior colliculus.An alternative means by which IT could supply top-downguidance to the control neurons would be via corticocorticalfeedback pathways. Under this scenario, contro l neurons withinthe cortex would be driven by feedback signals emanating fromIT once the pulvinar neurons have roughly set the position andsize of the window of attention . The pulvinars role would thusbe analogous to that of a general in an army-coarsely specifyinga plan of action, which the cortical control neurons refine intoa concise remapping under top-down, or object-based, guidancefrom IT.Gating mechanismsNeural gating mechanisms are believed to play an importantrole in many aspects of nervous system function. For example,the extent to which a noxious stimulus is perceived as painfu lvaries greatly as a funct ion of ones emotional state and otherexternal factors. This is subserved at least in part by gatingmechanisms in the spinal cord, where descending fibers fromthe raphe nucle i form part of a control system that modulatespain transmission via presynaptic inh ibi tion in the dorsal horn(Fields and Basbaum, 1978). Gating mechanisms are also thoughtto play an important role in sensorimotor coordination; forexample, there are many instances in which spinal cord centra lpattern generators gate sensory inputs according to the phase ofthe movement cycle in which the input occurs (Sillar , 199 1). Asomewhat different form of gating seems to take place in theLGN, where thalamic relay cells exhibit two distinct responsemodes: a relay mode, in which cells tend to replicate retinalinput more or less faithful ly, and a non-relay burst mode, inwhich cells burst in a rhythmic pattern that bears lit tle resem-blance to the retinal input (Sherman and Koch, 1986). In thisinstance, the reticular nucleus of the thalamus is thought to bethe source of the signal that switches the LGN into the nonrelayburst mode.Although there is as yet no explicit evidence for gating mech-anisms in the visual cortex, there are several possible biophysicalmechanisms that would allow control neurons to gate synapsesalong the VI-IT pathway. Presynaptic inhibi tion, as in the spi-nal cord, would probably provide the most localized gating ef-fect. However, to date there exists no morpholog ical evidencefor this type of synapse in the visual cortex (Berman et al., 1992).Postsynaptically, a control neuron could decrease or possiblynul lify the efficacy of a corticocortical synapse via shunting in-hibition. Evidence for this type of mechanism playing a role inorientation or direction tuning is mixed, with some for (Pei etal., 1992; Volgushev et al., 1992) and some against (Douglas etal., 1988). Another possible postsynaptic gating mechanism couldbe realized via the combined voltage- and ligand-gated NMDA

  • 8/4/2019 A Neurobiological Model of Visual Attention and Invariant Pattern Recognition Based on Dynamic Routing of Informa

    13/20

    4712 Olshausen et al. - Model of Visual Attention and Recognition

    receptor channel, which has been shown to play an important tention and invariant pattern recognition. We also describe somerole in normal visual function (Miller et al., 1989; Nelson and generalizations of the model, and briefly outline the unresolvedSur, 1992). In this case, a control neuron could effectively boost issues that remain as topics for future research.the gain of a corticocortical synapse by locally depolarizing themembrane in the vicinity of the synapse. Also, there exist volt- Predictionsage-gated Ca2+ channels in dendrites (Llinas, 1988) that could Neurophysiology. The most obvious prediction of the dynamicprovide nonlinear coupling between inputs. Evidence for non- routing circuit model is that the receptive fields of cortical neu-linear interactions of this type have been reported for synaptic rons should change their position or size as attent ion is shiftedinputs into layer 1 of neocortex (Cauller and Connors, 1992). or resealed. This effect should be especial ly pronounced in higherAl l of these mechanisms, and possibly others, offer a mul tipl i- cortical areas. Some support for this prediction comes from thecative-type effect that is suitable for gating information flow neurophysiological findings of Moran and Desimone (1985) inthrough the cortex (see also Koch and Poggio, 1992). areas V4 and IT of primate visual cortex. As schematized inUnder an inhib itory gating scheme scheme, such as shunting Figure 12, they found that if two bar-shaped stimuli were placedor presynaptic inh ibi tion, the control neurons would need to within the classical receptive f ield (CRF) of a V4 cel l, and thebecome active only when attention is actively engaged on an animal was trained to attend to only one of them, then the cellsobject. The finer the resolution desired within the window of response to the unattended stimulus was substantially attenu-attent ion, the more the control neurons would need to be en- ated. This is what one would expect from our routing circuit,gaged. The absence of any activity on the control neurons would since the pathways between the cel l and the unattended stimuluscorrespond to al l connections being open (the inattentive state), would be effectively disabled in this case (Fig. 12~). They alsoin which case neurons in IT would exhibit the very large recep- found that the V4 cell responded to an unattended stimulustive fields observed in anesthetized or inattentive animals (Gross anywhere within its CRF when the animal attended a stimuluset al., 1972; Desimone et al., 1984). outside the CRF. This effect is also predicted by the model,Under an excitatory gating scheme, such as via NMDA re- because once a V4 cel l lies outside the region of interest in V4ceptors, one would need to hypothesize the existence of a gain it no longer needs to restrict its inputs (Fig. 12d). Indeed, othercontrol mechanism working in concert with the control neurons. targets of V4, such as those in PP, would presumably be inter-When no control signals are provided, cortical input would be ested in the information from regions lying outside of the at-rather weak, and the firing threshold of pyramidal cells should tentional beam.be lowered to let al l information through. When control signals While Moran and Desimones findings offer some support forare present to boost the gain of ind ividual synapses, however, attentional modulation effects predicted by the model, they didthe threshold should be raised. This way, the unboosted syn- not attempt to map receptive fields under different attentionalapses will be essentially suppressed to a relatively low strength. conditions with any precision; thus, their results do not addressThreshold adjustment could perhaps be subserved by chandelier the more specific effects predicted by the model. One wouldcells, which make strong inhib itory connections exclusively onto expect a cortical receptive field to shift as the attentional windowthe axon initial segment of pyramidal cells (Douglas and Martin, is translated, and to expand or shrink as the attentional window1990b). Evidence that gain control mechanisms indeed exist in is made larger or smaller, respectively. We predict that the op-visual cortex has been established in previous physiological timal spatial frequency for the cel l should change as well, shiftingstudies (Ohzawa et al., 1982; Pettet and Gilbert, 1992). to high spatial frequency for a small window of attention, andFrom a computational viewpoint, gating of inputs within in- to low spatial frequency for a large window of attention. Thesedividual dendrites provides a much higher degree of flexibi lity predictions can be tested by giving the animal a task that forcesthan would merely gating the outputs of pyramidal cells. Sincethe output of a pyramidal cell may branch to several corticalareas and make synaptic connections to a multitude of neurons,any modulation of the cells output will simply be duplicatedat al l these subsequent input points. Gating inputs within thedendrites, on the other hand, allows the nonlinear computationof many intermediate results (& cJukq) within the postsynapticmembrane, which can then be summed together within a single

    it to attend to a region of a specific size and location, and thenprobing the receptive field with a neutral (behaviorally irrele-vant) stimulus to measure its extent. Preliminary results usingsuch a paradigm suggest that the receptive fields of V4 cells doindeed translate toward attentional foci in or near the classicalreceptive field (Connor et al., 1993). In its present simple form,our model predicts that V4 receptive fields could become up to1OO-fold smaller than the CRF (in one dimension) when atten-cel l. This results in a computational structure that is orders of tion is at highest resolution. While this extreme is unlikely, givenmagnitude richer (Mel, 1992) and provides a higher degree of the evidence for complex receptive fields in V4 (Desimone andflexibility in sculpting patterns in connection space (see Fig. 4). Schein, 1987; Gallant et al., 1993), there remains a pressingWe believe the demonstrable computational advantage of den- need to resolve empirically the extent to which cortical receptivedritic gating mechanisms for visual processing motivates theneed to specifically look for such mechanisms exper imentally.(See also Desimone, 1992, for a discussion of output vs inputgating mechanisms.)DiscussionBecause of its detailed neurobiological correlates, the routingcircuit model makes a number of interesting predictions thatcan be tested exper imenta lly. In this section we discuss thesepredictions, as well as the differences between our model andother network models that have been proposed for visual at-

    fields can dynamically change-position and size.Another physiological prediction of the model is that lesionsto the pulvinar, the hypothesized control center, should dra-matically degrade attention and pattern recognition abilities.While there is substantial evidence linking pulvinar lesions toattentional defects (Rafal and Posner, 1987; Bender, 1988; De-simone et al., 1990), some pattern recognition abilities appearto be relatively unimpaired by pulvinar lesions (Mishkin, 1972;Chalupa et al., 1976; Nagel-Leiby et al., 1984; Bender and But-ter, 1987). One possible reason for the apparent sparing of pat-tern recognition is that the tasks used in these studies generally

  • 8/4/2019 A Neurobiological Model of Visual Attention and Invariant Pattern Recognition Based on Dynamic Routing of Informa

    14/20

    The Journal of Neuroscience, November 1993, 13(11) 4713

    a. No attention(all connections ape

    Vlb

    I- -----Classical receptive field

    ineffective stimuluseffective stimulus

    b. Attend toeffective stimulus

    Vlb

    k------jrI

    c. Attend toineffectivestimulus d. Attend outside

    Vlb

    Figure 12. The dynamic outing circuit interpretation of the Moran and Desimone (1985) experiment. The node in layer V4 indicates the cellunder scrut iny. The hatched region indicates those connections to the cell that are enabled, the others are disabled. The bounds of the window ofattention in each area are shown b y the stippled lines. a, In the nonattentive state, all connections will be open and the effect ive stimulus can excitethe cell anywhere within its CRF. b, When attending to the effect ive stimulus, the cells response should be unaltered since the neural pathways tothe stimulus are still open. c, When attending to the inef fect ive stimulus, the cells response should decrease substantially since the neural pathwaysto the effecti ve stimulus are gated out. d, When attending outside the cells CRF, there is no need to gate the cells inputs since it is no longer takingpart in the process of routing information within the window of attention.

    were ver y simple, such as distinguishing a large N from a Z(Chalupa et al., 1976). It is conceivable that such a task couldbe carried out even when the fidelity of the remapping processhasbeen compromised. A more rigorous test using stimuli thatdemand the full spatial resolution capacity of the window ofattention would be better suited to test the ef fec t of pulvinarlesionson recognition abilities. Pulvinar lesionswould also beexpected to diminish the result found by Moran and Desimone(1985), and it would be interesting to repeat this experimentwhile reversibly deactivating the pulvinar.The physiological responses o be expected from pulvinarneuronsdepend on how they are configured o gate nformationflow in the cortex. In an inhibitory gating scheme,one wouldexpect enhanced esponsesrom pulvinar neuronsprojecting toareas of the cortex within and immediately surrounding theattentional beam, and little or no response rom pulvinar neu-rons projecting to those areasof the cortex substantially outsidethe attentional beam. In an excitatory gating scheme, ne wouldexpect to find enhanced esponsesrom pulvinar neurons pro-jecting to areasof the cortex within the attentional beam only.Petersenet al. (1985) have reported suchan enhancementeffectfor neurons n the dorsomedial portion of the pulvinar (which

    is connected with PP), but not in the inferior or lateral portion(which is connected to VI-IT). The lack of enhancement inthese atter areasmay be due to the fact that the task used nthis experiment was very simple (detecting the dimming of aspot of light). Again, a more appropriate task would be one thatful ly taxes the capaci ty of the attentional window, as this wouldrequire the greatest participation from the control neurons ingating out irrelevant information.Neuroanatomy. The routing circuit model predicts that thesize of the cortical region from which a cell receives ts inputshould increase by about a factor of 2 at each stage in thehierarchy of visual areas in the form pathway. While there issome evidence in support of this prediction-for example, con-nections betweenV4 and IT are more diffuse than connectionsbetween Vl and V2 (Van Essenet al., 1986, 1990; DeYoe andSisola, 1991)-more accurate and higher resolution data areneeded n order to confirm or contradict this prediction. Also,since the distribution of connections in the routing circuit be-comesmore patchy at higher levels (seeFig. lob), one wouldexpect a retrograde injection in V4 or IT to result in a patchydistribution in the lower level, which indeed has been eported(Fellernan and McClendon, 1991; Felleman et al., 1992).

  • 8/4/2019 A Neurobiological Model of Visual Attention and Invariant Pattern Recognition Based on Dynamic Routing of Informa

    15/20

    4714 Olshausen et al. * Model of Visual Attention and Recognition

    Another anatomical prediction of the model is that the ter-minations of pulvinar-cortical projections should be suitablypositioned for effective modulation of intercortical synapticstrengths. The pulvinar is known to project to the output layers(2, 3) of Vl and to both the input and output layers (3, 4) ofextrastriate areas V2, V4, and IT (Benevento and Rezak, 1976;Ogren and Hendrickson, 1977; Rezak and Benevento, 1979).These synapses are suspected to be excitatory since they are ofthe asymmetric type (in layers 1 and 2; Rezak and Benevento,1979). However, it is not known whether the pulvinar afferentsmake synapses with inhib itory interneurons or directly onto thedendrites of pyramidal cells.Finally, the model predicts that there should exist lateral in-hibitory and excitatory connections within the pulvinar in orderto enforce the constraint of preserving spatial relationships with-in the window of atten tion. This prediction is partially supportedby the existence of interneurons within the pulvinar (Ogren andHendrickson, 1979), but it remains to be seen if the axons ofproject ion neurons have collaterals that spread horizontallywithin the pulvinar, or to what extent the reticular nucleus ofthe thalamus might subserve this role.Psychophysics. The number of sample nodes in the top layerof our routing circuit is predicated on the notion that the spatialresolution of the window of attention is limi ted to the equivalentof about 30 x 30 pixels. This predic tion shares a basic similarityto Nakayamas (199 1) iconic bottleneck theory, although hisestimate (- 100 pixels tota l) is somewhat lower than ours. The30 x 30 estimate is roughly consistent with several lines ofpsychophysical evidence, including studies of spatial acuity,contrast sensitivity to gratings, and pattern recognition (Camp-bel l, 1985; Van Essen et al., 199 1). However, one problem withthis analysis is that the critical data were derived from experi-ments in which visual attention was not expl icit ly control led.In particular, most of the experiments had display times longenough to permit mult iple shifts of attention (although we doubtthat this would have been a major contaminating factor in mostcases).On the other hand, those experiments that have been directedat studying the amount of resources allocated during visualattention have largely ignored the issue of spatial resolution.For example, various studies have reported evidence for a zoomlens model of attention in which the density of processingresources decreases as the size of the attent ional window in-creases (Eriksen and St. James, 1986; Shulman and Wilson,1987). However, these experiments were not designed to mea-sure spatial resolution expl icitly. Also, Verghese and Pelli (1992)have attempted to measure the information capacity of the win-dow of attention, which they conclude to have an upper boundof about 50 bits. However, they studied only two tasks-de-tecting a nonmoving target among moving distracters, or de-tecting a nonflashing square among flashing squares-neither ofwhich is well suited for measuring spatial resolution. A moreappropriate experiment might be one that tested pattern dis-crimination ability as a function of the position, size, and res-olut ion of an object . In this case, our present model predicts*that performance would drop off sharply once the spatial fre-quency content of the stimulus exceeded approximately 15 x15 cycles per object.The model also makes some interesting predictions with re-gard to the dynamics of visual attention. For example, once alocation has been attended to in the visual field it should bediff icul t to stay there or immediately revisit the site, because

    the control neurons corresponding to that part of the visualfield would be transiently inhibi ted from firing. There is someevidence for such a mechanism, in that involuntary attentionalfixations tend to be transient (Nakayama and Mackeben, 1989)and appear to be inhibi ted from return (Posner and Cohen,1984). The amount of time that i t takes the attentional windowto shift from one locat ion to another would be expected to beroughly independent of the distance between locations. Unlikeeye saccades, there is no obvious reason why the control neuronsshould sequence through all intervening positions of the atten-tional window. Rather, moving the locus of atten tion wouldrequire merely inhibit ing the current control state and activatinga new one. This prediction is most consistent with Remingtonand Pierces (1984) study showing time- invariant shifts of visualattention, although other studies (e.g., Tsal, 1983) are in dis-agreement (but see Eriksen and Murphy, 1987, for a criticalcommentary on these and other studies). On the other hand, ifattention were actual ly to track a stimulus, then one wouldindeed expect a smooth transition of activity across the controlneurons. It is interesting to note that Cavanagh (1992) has dis-covered some forms of visual st imul i that produce a motionpercept only when tracked with attention. We speculate that theprogression of activity across the control neurons is what un-derlies ones perception of motion in such cases.Comparison with other modelsControl versus synchronicity. A number of other models of visualattention and pattern recognition have been proposed that relyon the synchronous firing of neurons in order to change con-nection strengths (e.g., Crick, 1984; von der Malsburg and Bi-enenstock, 1986; Crick and Koch, 1990). We contend that a keydisadvantage of such approaches is that information about theeffective connection state at any one point in time is not ex-plicit ly encoded anywhere in the system. In our model, thisinformation is encoded explicitly in the activities of the controlneurons, which then allows it to be util ized advantageously ina number of ways.One way that information about connectivity can be utilizedis in constraining the active connections between retinal- andobject-based reference frames to be in accordance with a globalshift and scale transformation. This constraint is incorporatedin our model via the competitive and cooperative interactionsamong the control neurons (Eq. 6). During object recognition,this constraint drastically reduces the number of degrees of free-dom in matching points between the retinal and object-centeredreference frames, because once a few point-to-point correspon-dences have been established, the number of potential matchesbetween other pairs of points is greatly reduced. In machinevision, this is known as the viewpoint consistency constraint, andit has proved to be a powerful computat ional strategy for objectrecognition systems (Hinton, 1981b; Lowe, 1987).Another advantage of having knowledge of the active con-nection state readily ava ilable is that the ensemble of controlneurons together form a neural code for the current positionand size of the window of attention. Therefore, informationabout the position and size of an object can be obtained bysimply reading out the state of the control neurons. In addition,it would also be possible for the control neurons to warp thereference frame transformation in order to form ob ject repre-sentations that are invariant to distortion (e.g., handwrittendigits), in which case information about the particular shape ofthe object (e.g., its slant or style) could also be preserved. Note

  • 8/4/2019 A Neurobiological Model of Visual Attention and Invariant Pattern Recognition Based on Dynamic Routing of Informa

    16/20

    The Journal of Neuroscience, November 1993, 13(11) 4715

    that such information is typically lost in networks that uti lizefeature hierarchies of complex cells (Fukushima, 1980, 1987;LeCun et al., 1990) or Fourier transforms (e.g., Pollen et al. ,1971; Cavanagh, 1978, 1985) for forming position-, scale-, andor distortion-invariant representations.Our model can also explain how attention may be directedat will, or by other modalities, to the extent that those areasofthe brain having access to the control neurons (such as parietalcortex) can directly influence where attent ion is directed. Thisalso provides a convenient format for mediat ing the access tocontrol among various competing demands. While such formsof top-down control are not impossible to incorporate in modelsbased on synchronicity-gated connections, its implementa tionwould seem to be less straightforward.Control-based network models. A number of other networkmodels of attention and recognition have also utilized the con-cept of control neurons for direct ing information flow. Nieburet al. (1993), Desimone (1992) LaBerge (1990, 1992), Ahmad(1992) and Posner et al. (1988) among others, have proposedmodels that involve the pulvinar as a control site for routinginformation from a select portion of the visual scene. In addi-tion, Tsotsos (199 1) and Mozer and Behrmann (1992) haveproposed somewhat more abstract connectionist models thatuti lize gat ing units to control atten tion. However, none of thesemodels preserve spatial relationships within the window of at-tent ion, which we consider to be a critica l component of therouting process.Hinton and Lang (1985) and Sandon (1990, 1988) have pro-posed control-based models that do preserve spatial relation-ships within the window of atten tion and share the same basicprinciple as the model presented here- that is, remapping objectrepresentations from retinal into object-centered reference framesvia a third set of units (equivalent to control in our model).Although these models attempt to explain various psychophys-ical data, they do not contain the necessary level of neurobiol-ogical detail to give them strongly predictive value in biology.Postma et al. (1992) have proposed a neural model basedupon the orig inal shifter circuit proposal (Anderson and VanEssen, 1987) to account for translat ional invariance in visualobject priming (Biederman and Cooper, 1992). This model sharesmany similarities to the model presented here, including top-down (template-driven) control, but it differs in the specificsof the control structure. Most notably, Postma et al. have pro-posed an interesting solution to controlling a hierarchical shiftercircuit based on a series of stages of local, winner-take-all cir-cuits.Control as a general computat ional strategyBesides being advantageous for the control of visual atten tion,we believe that the strategy of utiliz ing explicit control neuronsmay be a useful computational principle employed by the brainin other domains as well. A different perspective of dynamiccontrol is illustrated in Figure 13. In most neural network mod-els, the output of a neuron is computed by forming the innerproduct of a weight vector, KJ,with the inputs to the neuron,and then passing the result through a nonlineari ty. The weightvector may change on a slow time scale in order to optimizethe network for performing a certain task, but typically in re-mains f ixed over the relatively short time in which the task isactua lly performed (e.g., < 1 set). By having control neuronsavailab le to modify ++on a short time scale, the computation

    Controlw,

    vFigure 13. A more general way of viewing control. A weightvectorwith two components, w, and w,, is shown. Control neurons c, and c,modulate each of these components, respect ively, to change the weightvector dynamically. Thus, he weightvector maybeable o occupyanyregion within the circularoutlinen order to optimize the network forthe particular input and task at hand.being carried out by the network can be dynamically reconfig-ured and optimized for-the particular task at hand. This addeddegree of flexibility reduces the neural resources equired forsolving a complicated task, since t is no longer necessary ohave dedicated, specializednetworks with fixed connections odeal with each variation of a task (Van Essenet al., 1993).Unresolved ssuesThe dynamic routing circuit as described n this article is in-tended as a zero-th order model, and as such many detailshave been neglected or oversimplified. Here we outline someof the more important unresolved issues hat remain as topicsfor future research.Features nstead of pixels. As already noted, one key neuro-biological characteristic neglected n the present model is theknown preponderance of feature-selective cells in the visualcortex. Vl, for example, is known to contain cells tuned forvarious orientations and spatial frequencies, and V2 and V4contain cells that seem o be tuned for more complex stimuli(von der Heydt and Peterhans, 1989;Gallant et al., 1993).Howdoes this affect the routing process?One possiblestrategy, asmentioned earlier, would be to route information primarily fromlow-spatial-frequencycellswhen the window of attention is arge,and from high-spatial-frequency cells when the window of at-tention is small. More generally, dynamic routing need not nec-essarilybe restricted to the space omain, but could work acrossfeature domains as well.Feedbackpathways. We have describedhow information canbe routed in the feedforward pathways, but we have more orless gnored the feedback pathways that are known to exist inabundance n the visual cortex. Mumford (1992) has sketcheda theory proposing that the role of these feedback pathways isto relay the interpretations of higher cortical areas to lowercortical areas n order to verify the high-level interpretation of

  • 8/4/2019 A Neurobiological Model of Visual Attention and Invariant Pattern Recognition Based on Dynamic Routing of Informa

    17/20

    4716 Olshausen et al. - Model of Visual Attention and Recognition

    a scene. Such a mechanism would obviously be of use for step4 of our proposed strategy for an autonomous visual system.Under this scenario, it would be necessary to route informationflow within the feedback pathways as well to ensure that thehigh-level interpretation is matched against the appropriate re-gion within the cortical area below (i.e., within the window ofattention). Another possible role for information flow in thefeedback pathways may be to refine the tuning characteristicsof lower-level cortical cells based upon the interpretations madein higher cortical areas (see, e.g., Tsotsos, 1991).Pop-out in multiple dimensions. In the simple autonomousvisual system we have proposed, blobs were the only salientfeatures used to attract the window of attention. How mightother salient features-such as pop-out due to motion or texturegradients-be incorporated into the preattentive system? Howwould the demands among these different saliencies be medi-ated?Integration across mul tiple attentional sh$s. How are thevarious snapshots obtained by the window of attention in-corporated to form an overall percept of a scene? One possibility,as outlined by Hinton (198 1b), is that a compact representationof each object is maintained in the form of the activities on aset of neurons within a scene buffer. Each attentional fixat ionwould then write its contents into a different part of the buffer,depending on the position and size of the attentional windowas well as the orientation of the eyes, head, and body with respectto the environment (see also Baron, 1987).Rotation and warp. Our model accounts for how referenceframes can be shifted and resealed, but it does not addressrotation and other distortions (e.g., handwritten characters). Theabi lity to rotate or warp reference frames could probably beincluded in the model without much difficulty , since this wouldjust involve another form of routing. Moreover, for foveatedobjects the log-polar representation in V 1 would convert rota-tions into approximate linear shifts on the cortex (Schwartz,1980), which may facili tate the routing.Three-dimensional objects. How are three-dimensional ob-jects represented neurally, and how is information in the retinalreference frame transformed to match this representation? Onepossibility, as advanced by Poggio and Edelman (1990), is thatthree-dimensional objects are actually represented by a few char-

    Concluding remarksIn order for us to make sense of the visual world, the brain mustbe capable of forming object representations that are invariantwith respect to the dramatic fluctuations occurring on the retina.We have demonstrated how this feat may be accomplished bymodel neural circuits that are largely consistent with our currentknowledge of neurophysiology and neuroanatomy. The modelsuggests several experiments- such as measuring attentionalmodulation of receptive field position and size, or measuringthe spatial resolution of the window of attention-that may nothave been obvious otherwise. As these experiments are carriedout, the results will either help to increase our confidence in themodel, or will suggest where it is wrong and how it might berevised. It is through this combined process of computationalmodeling and exper imentat ion that we hope to understand howvisual attention and recognition are actually implemented inthe brain.Appendix: Derivation of Autonomous ControlDynamicsBlob searchThe total energy functional we wish to minimize is

    E total= Em + P%nstra,nt> (Al)where Lob and EcOnstralntre defined in Equations 5 and 6, and0 is a constant determining the relative contribution of the con-straint term. Letting c, follow the gradient of this functional, weobtain

    a-&k+ a-Llnstraint=- --TJP-jy' ac, kwhere rr is a constant determining the rate of gradient descent.As it stands, c, is unbounded; hence E,,,, and EcOnStralntill

    c, = 4%Jr

    also be unbounded and the network will not be guaranteed to

    (A3)

    converge. We can ameliorate this problem by letting c, be amonotonically increasing function of another analog variable,uk, that actually follows the gradient. That is,acteristic two-dimensional views, and that a match to the retinalrepresentation is achieved by interpolating among these views.In this case, the routing circuit would be required to repositionand rescale the object properly so that the interpolation could

    du, _ a&a,dt - -ldc, (A4)take place.Learning. Although the model we have presented here is neu-robiologically plausible in terms of the number of neurons, con-nectivity, and computational mechanisms required, it remainsto be seen whether such a system can self-organized or fine tuneitself with experience, beginning with only roughly appropriateconnections. A hint as to how this may be accomplished hasbeen described by Foldiak (199 l), who has demonstrated howa complex cell can learn translation invariance using the objec-tive function of perceptual stability. In our model, perceptualstability would be desired in IT, and the control neurons wouldneed to learn how to configure themselves to maintain a stablepercept as an attended object moves or changes size on theretina.