deep learning in the visual cortex

Deep learning in the visual cortex

I. Fundamentals of primate vision

II. Computational mechanisms of rapid recognition and feedforward processing

III. Beyond feedforward processing: Attentional mechanisms and cortical feedback

Thomas SerreBrown University

1

keynote:/Users/tserre/Dropbox/Teaching/Tutorials/IPAM-2012/2-recognition.key




keynote:/Users/tserre/Dropbox/Teaching/Tutorials/IPAM-2012/3-attention.key






Invariant recognition in natural images

2

Computer vision successes Face detection

3

Mobileye systemAlready available on volvo S60 and soon on most car manufacturers

4

Machine: Millions of labeled examples for real-world applications e.g., Mobileye pedestrian detection system

5

What’s wrong with this picture?

6

What’s wrong with this picture?

• ~30,000 object categories (Biederman, 1987)

• Approach unlikely to scale up ...

6

Humans: learning from very few examples

By age 6, a child knows 10-30K categories

bcs news

Source: Tenenbaum

7

• Subjects get the gist of the scene at 7 images/s from unpredictable random sequence of images

- No time for eye movements- No top-down / expectations


Potter 1971, 1975; see also Biederman 1972; Thorpe 1996

movie courtesy of Jim DiCarlo

8




9 Potter 1971, 1975; see also Biederman 1972; Thorpe 1996




• Coarse initial base representation

- Enables rapid object detection/recognition (‘what is there?’)

- Insufficient for object localization- Sensitive to presence of clutter

9 Potter 1971, 1975; see also Biederman 1972; Thorpe 1996

Two classes of models

Marr ‘82; Biederman ’87

10


Marr ‘82; Biederman ’87 11




II. R

apid

reco

gniti

on a

nd

feed

forw

ard

proc

essi

ng

11




II. R

apid

reco

gniti

on a

nd

feed

forw

ard

proc

essi

ngIII. Attentional mechanisms

and cortical feedback

11


Streams of processing12

Ventral visual stream

Dorsal visual stream

Simple eye vs. camera

Source: unknown13


• ~10M pixels

Source: unknown13


• ~100M photoreceptors

- ~5M cones- ~90M rods

• ~10M pixels

Source: unknown13

Eye stimulated by stuff in the world

Source: webvision14


photoreceptors

Retinal output (ganglion cells)

Modified from http://thalamus.wustl.edu/course/eyeret.html15

http://thalamus.wustl.edu/course/eyeret.html



photoreceptors


Increase in firing rate driven by

external stimulus





photoreceptors


+- -

x1 x2 x3

y

w1 w2 w3

y = f

⇣Xwixi

⌘


source: webvision




photoreceptors


+- -

x1 x2 x3

y

w1 w2 w3

Afferent activity from neuron i

y = f

⇣Xwixi

⌘


source: webvision




photoreceptors


+- -

x1 x2 x3

y

w1 w2 w3


Synaptic efficacies

y = f

⇣Xwixi

⌘


source: webvision




photoreceptors


+- -

x1 x2 x3

y

w1 w2 w3


Synaptic efficacies

Summation at the soma

y = f

⇣Xwixi

⌘


source: webvision




x1 x2 x3

y

w1 w2 w3

y = f

⇣Xwixi

⌘/ (pool)

pool

photoreceptors


+- -

17

source: webvision


x1 x2 x3

y

w1 w2 w3

Normalization via pool of neighboring units

(e.g., tuned to different orientations)

y = f

⇣Xwixi

⌘/ (pool)

pool

photoreceptors


+- -

17

source: webvision

18

RF organization in V1

Hubel & Wiesel ’59 ’62 ’68

Hubel & Wiesel

Simple cell

19



Hubel & Wiesel

Simple cell

20



Complex cell

Hubel & Wiesel

21



Complex cell

Hubel & Wiesel

22



Hubel & Wiesel

Hierarchical architecture: Anatomy Rockland & Pandya ’79;

Maunsell & Van Essen ‘83; Felleman & Van Essen ’91 23

Hierarchical architecture:Latencies

Nowak & Bullier ’97Schmolesky et al ’98

source: Thorpe & Fabre-Thorpe ‘01

24

Hierarchical architecture: Function

source: Kobatake & Tanaka ’94; Freiwald & Tsao ’10 see also Oram & Perrett 1993; Sheinberg & Logothetis ‘96; Gallant et al ‘96; Riesenhuber & Poggio ’99

gradual increase in complexity of preferred stimulus

25

Parallel increase in invariance properties (position and scale)

of neurons

Hierarchical architecture: Function

source: Kobatake & Tanaka 1994 see also Oram & Perrett 1993; Sheinberg & Logothetis 1996; Gallant et al 1996; Riesenhuber & Poggio 1999

26

Columnar organization

Columnar organization in V1

Source: unknown

27

Columnar organizationKobatake & Tanaka 1994 see also Oram & Perrett 1993; Sheinberg & Logothetis 1996; Gallant et al 1996; Riesenhuber & Poggio 1999

Dictionary of visual features of intermediate visual areas (V4/PIT)

28

Columnar organization in V1

Invariant image representation in the IT

Invariant object category information can be decoded from

small populations of cells in IT

29

Invariant representation in IT Hung et al ’05

8 ob

ject

cat

egor

y77 unique stimuli

spikes over a short time interval (100- to 300-ms interval divided into bins of 50 ms in thiscase) (11, 23, 24, 28). This is notable consid-ering the high trial-to-trial variability of corticalneurons (27). The IT population performanceis also robust to biological noise sources suchas neuronal death and failures in neurotrans-mitter release Efig. S1, (35)^. Although Fig. 1(and most other decoding studies) assumesprecise knowledge about stimulus onset time,this is not a limitation because we could alsoaccurately read out stimulus onset time fromthe same IT population Efig. S5, (28)^.

A key computational difficulty of objectrecognition is that it requires both selectivity(different responses to distinct objects suchas one face versus another face) and in-variance to image transformations (similarresponses to, e.g., rotations or translations ofthe same face) (8, 12, 17). The main achieve-ment of mammalian vision, and one reasonwhy it is still so much better than computervision algorithms, is the combination of highselectivity and robust invariance. The resultsin Fig. 1 demonstrate selectivity; the ITpopulation can also support generalizationover objects within predefined categories,suggesting that neuronal responses within acategory are similar (36). We also exploredthe ability of the IT population to generalizerecognition over changes in position and scaleby testing 71 additional sites with the original77 images and four transformations in posi-tion or scale. We could reliably classify (withless than 10% reduction in performance) theobjects across these transformations eventhough the classifier only Bsaw[ each objectat one particular scale and position duringtraining (Fig. 2). The Bidentification[ per-formance also robustly generalized acrossposition and scale (28). Neurons also showedscale and position invariance for novel objectsnot seen before (fig. S6). The IT population

representation is thus both selective andinvariant in a highly nontrivial manner. Thatis, although neuronal population selectivity forobjects could be obtained from areas like V1,this selectivity would not generalize overchanges in, e.g., position (Supporting OnlineMaterial).

We studied the temporal resolution of thecode by examining how classification per-

formance depended on the spike count binsize in the interval from 100 to 300 ms afterstimulus onset (Supporting Online Material).We observed that bin sizes ranging from 12.5through 50 ms yielded better performance thanlarger bin sizes (Fig. 3A). This does not implythat downstream neurons are simply inte-grating over 50-ms intervals or that no usefulobject information is contained in smaller time

Fig. 1. Accurate readoutof object category andidentity from IT popula-tion activity. (A) Exam-ple of multi-unit spikingresponses of 3 indepen-dently recorded sites to5 of the 77 objects. Ras-ters show spikes in the200 ms after stimulusonset for 10 repetitions

1 4 16 64 256Number of sites

BA C

AUM S

AU PFL AUM

& PFL

0 100 ms

Site 1

Site 2

Site 3

'Categorization'

'Identification'

0

20

40

60

80

100

ecnamrofrep noitacifissal

C)tcerroc

%(

0

20

40

60

80

100

100

% oftrials

0

tfdhfmfhvbcd

t fd hf mf h v b cdClassifier outputD

yrogetac lautcA

(black bars indicate object presentation). (B) Performance of a linear classifier over the entireobject set on test data (not used for training) as a function of the number of sites forreading out object category (red, chance 0 12.5%) or identity (blue, chance 0 1.3%). Theinput from each site was the spike count in consecutive 50-ms bins from 100 to 300 msafter stimulus onset (28). Sequentially recorded sites were combined by assuming independence (Supporting OnlineMaterial). In this and subsequent figures, error bars show the SD for 20 random choices of the sites used for training;the dashed lines show chance levels, and the bars next to the dashed lines show the range of performances using the200 ms before stimulus onset (control). (C) Categorization performance (n 0 64 sites, mean T SEM) for differentdata sources used as input to the classifier: multi-unit activity (MUA) as shown in (B), single-unit activity (SUA), andlocal field potentials (LFP, Supporting Online Material). (D) This confusion matrix describes the pattern of mistakesmade by the classifier (n 0 256 sites). Each row indicates the actual category presented to the monkey (29), andeach column indicates the classifier predictions (in color code).

Fig. 2. Invariance toscale and positionchanges. Classificationperformance (categori-zation, n 0 64 sites,chance 0 12.5%) whenthe classifier was trainedon the responses to the77 objects at a singlescale and position (de-picted for one object by‘‘TRAIN’’) and perform-ance was evaluated withspatially shifted or scaledversions of those ob-jects (depicted for oneobject by ‘‘TEST’’). Theclassifier never ‘‘saw’’the shifted/scaled ver-sions during training.Time interval 0 100 to300 ms after stimulusonset, bin size 0 50 ms.The left-most columnshows the performancefor training and testingon separate repetitionsof the objects at thesame standard positionand scale (as in Fig. 1).The second bar showsthe performance aftertraining on the stan-dard position and scale(scale 0 3.4-, center ofgaze) and testing onthe shifted and scaled images of the 77 objects. Subsequent columns use different image scalesand positions for training and testing.

TRAIN

TEST

3.4o

centerSize:

Position:3.4o

center1.7o

center6.8o

center3.4o

2o horz.3.4o

4o horz.

0

20

40

60

80

100

Cla

ssifi

catio

n pe

rfor

man

ce(%

cor

rect

)

R E P O R T S

4 NOVEMBER 2005 VOL 310 SCIENCE www.sciencemag.org864

on

Febr

uary

18,

200

9 w

ww

.sci

ence

mag

.org

Dow

nloa

ded

from

Decoding possible from around 100 ms

30

Humans animalness

Mon

keys

ani

mal

ness

0 0.5 1

1

0.5

0

A C

Fam New

50

60

70

80

90

100

Fam New

50

60

70

80

90

100

M1

M2

BAc

cura

cy (%

cor

rect

)

human Ss MIT undergrads :-)

Feedforward processing Data collected by Maxime Cauchoix and Denis Fize (CNRS, France)

Button release and touch screen on targets

- Head-free monkeys- Multiple electrodes implanted along ventral stream (V4+PIT)

31

Feedforward processing Data collected by Maxime Cauchoix and Denis Fize (CNRS, France)

Dec

odin

g ac

cura

cy (%

)Ac

tivity

bef

ore

140

ms

200 300 400 50045

50

55

60

65

70

75

0 20 40 60 80 100 120 14045

50

55

60

65

70

75103938870

Dec

odin

g ac

cura

cy (%

)Ac

tivity

bef

ore

140

ms

200 300 400 50045

50

55

60

65

70

75

0 20 40 60 80 100 120 14045

50

55

60

65

70

75

Time (ms)

1151079376

a b

Num

ber o

f res

pons

es

100

200

300N

umbe

r of r

espo

nses

100

200

300

Time (ms)

M1

M2

32

Learning and plasticity Li & DiCarlo ’0833

• We know very little (i.e., STDP) about the learning rules at work in the visual cortex Plasticity

Very fast learning in IT

in activity during the sample and/or thedelay interval to cats versus dogs. Similarnumbers of neurons preferred cats (sample

interval, 35/65; delay interval, 21/44) anddogs (sample, 30/65; delay, 23/44).

Figure 3B shows an example of a single

neuron that exhibited greater activity in re-sponse to dogs than to cats and respondedsimilarly to samples from the same category,regardless of their degree of dogness or catness.Its activity was different in response to stimulinear the category boundary, the cat-like dogs(60:40 dog:cat) versus the dog-like cats (60:40cat:dog) (22), but there was no difference inactivity elicited by these stimuli and by theirrespective prototypes (the 100% cats or dogs)(23). The inset in Fig. 3B shows the neuron’sactivity in response to each of the 54 samples. Itexhibited overall greater activity in response todogs than to cats, but there were small differ-ences within categories. Just a few stimuli elic-ited activity that was similar to that from theother category. These stimuli were not consis-tent across different neurons, however. Acrossthe population of neurons, category activity ap-peared at the start of neural responses to thesample, about 100 ms after sample onset (24).

We examined all stimulus-selective neurons,irrespective of whether they were category-se-lective per se (25). For each neuron, we com-puted the difference in activity between pairs ofsamples at different positions along each be-tween-category morph line (Fig. 1A). In Fig. 4,A and B, each neuron’s average difference inresponse to pairs of samples from the samecategory (within-category difference, WCD) isplotted against its difference in response to sam-ples from different categories (between-catego-ry difference, BCD). If neurons were not sensi-

Fig. 2. Task design and behavior. (A) A sample was followed by a delay anda test stimulus. If the sample and test stimulus were the same category (amatch), monkeys were required to release a lever before the test disap-peared. If they were not, there was another delay followed by a match. Equal numbers of match andnonmatch trials were randomly interleaved. (B) Average performance of both monkeys. Red and bluebars indicate percentages of samples classified as “dog” and “cat,” respectively.

Fig. 3. Recording loca-tions and single neu-ron example. (A) Re-cording locations inboth monkeys. A, an-terior; P, posterior; D,dorsal; V, ventral.There was no obvioustopography to task-related neurons. (B)The average activityof a single neuron inresponse to stimuli atthe six morph blends.The vertical lines cor-respond (from left toright) to sample onset,offset, and test stimu-lus onset. The insetshows the neuron’sdelay activity in re-sponse to stimulialong each of the ninebetween-class morphlines (see Fig. 1). Theprototypes (C1, C2,C3, D1, D2, and D3)are represented in theoutermost columns;each appears in threemorph lines. A colorscale indicates the ac-tivity level.

R E P O R T S

12 JANUARY 2001 VOL 291 SCIENCE www.sciencemag.org314

Learning and plasticity Li & DiCarlo ’0834

• We know very little (i.e., STDP) about the learning rules at work in the visual cortex Plasticity

Supervised category learning in PFC

Matching object representations in man and monkey Kriegeskorte et al ’08

performed a fixation task in a rapid event-related fMRI experi-ment. Each stimulus was presented once in each run in randomorder and repeated across runs within a given session. Theamplitudes of the overlapping single-image responses were esti-mated by fitting a linear model. The task required discriminationof fixation-cross color changes occurring during image presen-tation. We measured brain activity with high-resolution blood-oxygen-level-dependent fMRI (3-Tesla, voxels: 1.95 3 1.95 32 mm3, SENSE acquisition; Prussmann, 2004; Kriegeskorteand Bandettini, 2007; Bodurka et al., 2007) within a 5 cm thickslab including all of inferior temporal and early visual cortexbilaterally. Voxels within an anatomically defined IT-cortexmask were selected according to their visual responsivenessto the images in an independent set of experimental runs.

Representational Dissimilarity Matrices: The SameCategorical Structure May Be Inherent to IT in BothSpeciesWhat stimulus distinctions are emphasized by IT in eachspecies? Figure 1 shows the RDMs for monkey and human IT.Each cell of a given RDM compares the response patternselicited by two stimuli. The dissimilarity between two responsepatterns is measured by correlation distance, i.e., 1! r (Pearsoncorrelation), where the correlation is computed across thepopulation of neurons or voxels (Haxby et al., 2001; Kiani et al.,

2007). An RDM is symmetric about a diagonal of zeros here,because we use a single set of response-pattern estimates.

The RDMs allow us to compare the representations betweenthe species, although there may not be a precise correspon-dency of the representational features between monkey IT andhuman IT and although we used radically different measurementmodalities (single-cell recordings and fMRI) in the two species.Our approach of representational similarity analysis requirescomparisons only between response patterns within the sameindividual animal, obviating the need for a monkey-to-humancorrespondency mapping within IT.

Several important results (to be quantified in subsequent anal-yses) are apparent by visual inspection of the RDMs (Figure 1).First, there is a striking match between the RDMs of monkeyand human IT. Two stimuli tend to be dissimilar in the human-IT representation to the extent that they are dissimilar in themonkey-IT representation, and vice versa. This is unexpectedbecause the behaviorally relevant stimulus distinctions mightbe very different between the species. Moreover, single-cellrecording and fMRI sample brain activity in fundamentallydifferent ways, and it is not well understood to what extentthey similarly reflect distributed representations. Second, thedissimilarity tends to be large when one of the depicted objectsis animate and the other inanimate and smaller when the objectsare either both animate or both inanimate. Third, dissimilarities

Figure 1. Representational Dissimilarity Matrices for Monkey and Human ITFor each pair of stimuli, each RDM (monkey, human) color codes the dissimilarity of the two response patterns elicited by the stimuli in IT. The dissimilarity

measure is 1 ! r (Pearson correlation across space). The color code reflects percentiles (see color bar) computed separately for each RDM (for 1 ! r values

and their histograms, see Figure 3A). The two RDMs are the product of completely separate experiments and analysis pipelines (data not selected to match).

Human data are from 316 bilateral inferior temporal voxels (1.95 3 1.95 3 2 mm3) with the greatest visual-object response in an independent data set. For control

analyses using different definitions of the IT region of interest (size, laterality, exclusion of category-sensitive regions), see Figures S9–S11. RDMs were averaged

across two sessions for each of four subjects. Monkey data are from 674 IT single cells isolated in two monkeys (left IT in one monkey, right in the other; Kiani et al.,

2007).

Neuron

Matching Object Representations in Man and Monkey

1128 Neuron 60, 1126–1141, December 26, 2008 ª2008 Elsevier Inc.

35

Summarysource: Felleman & VanEssen ’90

Hierarchies are ubiquitous:Anatomy, function & latencies

36

Vent

ral v

isua

l stre

am


Summary Two modes of processing: Bottom-up vs. recurrent

37


Summary Two modes of processing: Bottom-up vs. recurrent

X

37

Deep learning in the visual cortex


II. Computational mechanisms of rapid recognition and feedforward processing

III. Beyond feedforward processing: Attentional mechanisms and cortical feedback

Thomas SerreBrown University

38











Vision is complex but the solution might be simple...

Cavanagh ’95

39

What we think we see

What we really see

deep learning in the visual cortex

Documents