a biologically-motivated approach to computer vision

A biologically-motivated approach to computer vision

Thomas Serre

McGovern Institute for Brain ResearchDepartment of Brain & Cognitive SciencesMassachusetts Institute of Technology

• Object recognition is hard!

• Our visual capabilities are computationally amazing

• Reverse-engineer the visual system and build machines that see and interpret the visual world as well as we do

The problem: invariant recognition in natural scenes

Computer vision successes

Face detection

Face detectionSchneiderman & Kanade ’99 Viola & Jones ’01

Lots of simple features

The recipe

fancy classifierlots of training

examples

Given example images where

for negative and positive examples respec-

tively.

Initialize weights for respec-

tively, where and are the number of negatives and

positives respectively.

For :

1. Normalize the weights,

so that is a probability distribution.

2. For each feature, , train a classifier which

is restricted to using a single feature. The

error is evaluated with respect to ,

.

3. Choose the classifier, , with the lowest error .

4. Update the weights:

where if example is classified cor-

rectly, otherwise, and .

The final strong classifier is:

otherwise

where

Table 1: The AdaBoost algorithm for classifier learn-

ing. Each round of boosting selects one feature from the

180,000 potential features.

number of features are retained (perhaps a few hundred or

thousand).

3.2. Learning Results

While details on the training and performance of the final

system are presented in Section 5, several simple results

merit discussion. Initial experiments demonstrated that a

frontal face classifier constructed from 200 features yields

a detection rate of 95% with a false positive rate of 1 in

14084. These results are compelling, but not sufficient for

many real-world tasks. In terms of computation, this clas-

sifier is probably faster than any other published system,

requiring 0.7 seconds to scan an 384 by 288 pixel image.

Unfortunately, the most straightforward technique for im-

proving detection performance, adding features to the clas-

sifier, directly increases computation time.

For the task of face detection, the initial rectangle fea-

tures selected by AdaBoost are meaningful and easily inter-

preted. The first feature selected seems to focus on the prop-

erty that the region of the eyes is often darker than the region

Figure 3: The first and second features selected by Ad-

aBoost. The two features are shown in the top row and then

overlayed on a typical training face in the bottom row. The

first feature measures the difference in intensity between the

region of the eyes and a region across the upper cheeks. The

feature capitalizes on the observation that the eye region is

often darker than the cheeks. The second feature compares

the intensities in the eye regions to the intensity across the

bridge of the nose.

of the nose and cheeks (see Figure 3). This feature is rel-

atively large in comparison with the detection sub-window,

and should be somewhat insensitive to size and location of

the face. The second feature selected relies on the property

that the eyes are darker than the bridge of the nose.

4. The Attentional Cascade

This section describes an algorithm for constructing a cas-

cade of classifiers which achieves increased detection per-

formance while radically reducing computation time. The

key insight is that smaller, and therefore more efficient,

boosted classifiers can be constructed which reject many of

the negative sub-windows while detecting almost all posi-

tive instances (i.e. the threshold of a boosted classifier can

be adjusted so that the false negative rate is close to zero).

Simpler classifiers are used to reject the majority of sub-

windows before more complex classifiers are called upon

to achieve low false positive rates.

The overall form of the detection process is that of a de-

generate decision tree, what we call a “cascade” (see Fig-

ure 4). A positive result from the first classifier triggers the

evaluation of a second classifier which has also been ad-

justed to achieve very high detection rates. A positive result

from the second classifier triggers a third classifier, and so

on. A negative outcome at any point leads to the immediate

rejection of the sub-window.

Stages in the cascade are constructed by training clas-

sifiers using AdaBoost and then adjusting the threshold to

minimize false negatives. Note that the default AdaBoost

threshold is designed to yield a low error rate on the train-

ing data. In general a lower threshold yields higher detec-

4

+

Given example images where

for negative and positive examples respec-

tively.

Initialize weights for respec-

tively, where and are the number of negatives and

positives respectively.

For :

1. Normalize the weights,

so that is a probability distribution.

2. For each feature, , train a classifier which

is restricted to using a single feature. The

error is evaluated with respect to ,

.

3. Choose the classifier, , with the lowest error .

4. Update the weights:

where if example is classified cor-

rectly, otherwise, and .

The final strong classifier is:

otherwise

where

Table 1: The AdaBoost algorithm for classifier learn-

ing. Each round of boosting selects one feature from the

180,000 potential features.

number of features are retained (perhaps a few hundred or

thousand).

3.2. Learning Results

While details on the training and performance of the final

system are presented in Section 5, several simple results

merit discussion. Initial experiments demonstrated that a

frontal face classifier constructed from 200 features yields

a detection rate of 95% with a false positive rate of 1 in

14084. These results are compelling, but not sufficient for

many real-world tasks. In terms of computation, this clas-

sifier is probably faster than any other published system,

requiring 0.7 seconds to scan an 384 by 288 pixel image.

Unfortunately, the most straightforward technique for im-

proving detection performance, adding features to the clas-

sifier, directly increases computation time.

For the task of face detection, the initial rectangle fea-

tures selected by AdaBoost are meaningful and easily inter-

preted. The first feature selected seems to focus on the prop-

erty that the region of the eyes is often darker than the region

Figure 3: The first and second features selected by Ad-

aBoost. The two features are shown in the top row and then

overlayed on a typical training face in the bottom row. The

first feature measures the difference in intensity between the

region of the eyes and a region across the upper cheeks. The

feature capitalizes on the observation that the eye region is

often darker than the cheeks. The second feature compares

the intensities in the eye regions to the intensity across the

bridge of the nose.

of the nose and cheeks (see Figure 3). This feature is rel-

atively large in comparison with the detection sub-window,

and should be somewhat insensitive to size and location of

the face. The second feature selected relies on the property

that the eyes are darker than the bridge of the nose.

4. The Attentional Cascade

This section describes an algorithm for constructing a cas-

cade of classifiers which achieves increased detection per-

formance while radically reducing computation time. The

key insight is that smaller, and therefore more efficient,

boosted classifiers can be constructed which reject many of

the negative sub-windows while detecting almost all posi-

tive instances (i.e. the threshold of a boosted classifier can

be adjusted so that the false negative rate is close to zero).

Simpler classifiers are used to reject the majority of sub-

windows before more complex classifiers are called upon

to achieve low false positive rates.

The overall form of the detection process is that of a de-

generate decision tree, what we call a “cascade” (see Fig-

ure 4). A positive result from the first classifier triggers the

evaluation of a second classifier which has also been ad-

justed to achieve very high detection rates. A positive result

from the second classifier triggers a third classifier, and so

on. A negative outcome at any point leads to the immediate

rejection of the sub-window.

Stages in the cascade are constructed by training clas-

sifiers using AdaBoost and then adjusting the threshold to

minimize false negatives. Note that the default AdaBoost

threshold is designed to yield a low error rate on the train-

ing data. In general a lower threshold yields higher detec-

4

+

is very valuable, in their implementation it is necessary to

first evaluate some feature detector at every location. These

features are then grouped to find unusual co-occurrences. In

practice, since the form of our detector and the features that

it uses are extremely efficient, the amortized cost of evalu-

ating our detector at every scale and location is much faster

than finding and grouping edges throughout the image.

In recent work Fleuret and Geman have presented a face

detection technique which relies on a “chain” of tests in or-

der to signify the presence of a face at a particular scale and

location [4]. The image properties measured by Fleuret and

Geman, disjunctions of fine scale edges, are quite different

than rectangle features which are simple, exist at all scales,

and are somewhat interpretable. The two approaches also

differ radically in their learning philosophy. The motivation

for Fleuret and Geman’s learning process is density estima-

tion and density discrimination, while our detector is purely

discriminative. Finally the false positive rate of Fleuret and

Geman’s approach appears to be higher than that of previ-

ous approaches like Rowley et al. and this approach. Un-

fortunately the paper does not report quantitative results of

this kind. The included example images each have between

2 and 10 false positives.

5 Results

A 38 layer cascaded classifier was trained to detect frontal

upright faces. To train the detector, a set of face and non-

face training images were used. The face training set con-

sisted of 4916 hand labeled faces scaled and aligned to a

base resolution of 24 by 24 pixels. The faces were ex-

tracted from images downloaded during a random crawl of

the world wide web. Some typical face examples are shown

in Figure 5. The non-face subwindows used to train the

detector come from 9544 images which were manually in-

spected and found to not contain any faces. There are about

350 million subwindows within these non-face images.

The number of features in the first five layers of the de-

tector is 1, 10, 25, 25 and 50 features respectively. The

remaining layers have increasingly more features. The total

number of features in all layers is 6061.

Each classifier in the cascade was trained with the 4916

training faces (plus their vertical mirror images for a total

of 9832 training faces) and 10,000 non-face sub-windows

(also of size 24 by 24 pixels) using the Adaboost training

procedure. For the initial one feature classifier, the non-

face training examples were collected by selecting random

sub-windows from a set of 9544 images which did not con-

tain faces. The non-face examples used to train subsequent

layers were obtained by scanning the partial cascade across

the non-face images and collecting false positives. A max-

imum of 10000 such non-face sub-windows were collected

for each layer.

Speed of the Final Detector

Figure 5: Example of frontal upright face images used for

training.

The speed of the cascaded detector is directly related to

the number of features evaluated per scanned sub-window.

Evaluated on the MIT+CMU test set [12], an average of 10

features out of a total of 6061 are evaluated per sub-window.

This is possible because a large majority of sub-windows

are rejected by the first or second layer in the cascade. On

a 700 Mhz Pentium III processor, the face detector can pro-

cess a 384 by 288 pixel image in about .067 seconds (us-

ing a starting scale of 1.25 and a step size of 1.5 described

below). This is roughly 15 times faster than the Rowley-

Baluja-Kanade detector [12] and about 600 times faster than

the Schneiderman-Kanade detector [15].

Image Processing

All example sub-windows used for training were vari-

ance normalized to minimize the effect of different light-

ing conditions. Normalization is therefore necessary during

detection as well. The variance of an image sub-window

can be computed quickly using a pair of integral images.

Recall that , where is the standard

deviation, is the mean, and is the pixel value within

the sub-window. The mean of a sub-window can be com-

puted using the integral image. The sum of squared pixels

is computed using an integral image of the image squared

(i.e. two integral images are used in the scanning process).

During scanning the effect of image normalization can be

achieved by post-multiplying the feature values rather than

pre-multiplying the pixels.

Scanning the Detector

The final detector is scanned across the image at multi-

ple scales and locations. Scaling is achieved by scaling the

detector itself, rather than scaling the image. This process

makes sense because the features can be evaluated at any

6

Face detectionSchneiderman & Kanade ’99 Viola & Jones ’01

10K-1M training examples

Car detection Schneiderman & Kanade ’99

over 100K training examples

Pedestrian detection Dalal & Triggs ’05

over 1K training examples

What’s wrong with this picture?

What’s wrong with this picture?

• Tens of thousands of manually annotated training examples

• ~30,000 object categories (Biederman, 1987)

• Approach unlikely to scale up ...

One-shot learning in humans

By age 6, a child knows 10-30K categories

What are the computational mechanisms underlying this amazing feat?

source: cerebral cortex



1. Organization of the visual system




2. Computational model of the visual cortex





3. Application to computer vision

Hierarchical architecture: Anatomy

Rockland & Pandya ’79; Maunsell & Van Essen ‘83; Felleman & Van Essen ’91

Hierarchical architecture:Latencies

Nowak & Bullier ’97Schmolesky et al ’98

source: Thorpe & Fabre-Thorpe ‘01

Hierarchical architecture: Function


ventral visual stream


Hubel & Wiesel 1959, 1962, 1965, 1968


Hubel & Wiesel 1959, 1962, 1965, 1968

Nobel prize 1981

simplecells

complexcells


Kobatake & Tanaka 1994 see also Oram & Perrett 1993; Sheinberg & Logothetis 1996; Gallant et al 1996; Riesenhuber & Poggio 1999

gradual increase in complexity of preferred stimulus

Parallel increase in invariance properties (position and scale)

of neurons


Kobatake & Tanaka 1994 see also Oram & Perrett 1993; Sheinberg & Logothetis 1996; Gallant et al 1996; Riesenhuber & Poggio 1999


Hung* Kreiman* Poggio & DiCarlo 2005


Hung* Kreiman* Poggio & DiCarlo 2005

• Invariant object recognition in IT:

• Robust invariant readout of category information from small population of neurons

• Single spikes after response onset carry most of the information

Hierarchical architecture: Feedforward processing

Thorpe Fize & Marlot ‘96

Hierarchical architecture: Feedforward processing

What are the computational mechanisms used by brains to achieve this amazing feat?





Feedforward hierarchical model of object recognition

• Qualitative neurobiological models (Hubel & Wiesel ‘58; Perrett & Oram ‘93)

• Biologically-inspired (Fukushima ‘80; Mel ‘97; LeCun et al ‘98; Thorpe ‘02; Ullman et al ‘02; Wersing & Koerner ‘03)

• Quantitative neurobiological models (Wallis & Rolls ‘97; Riesenhuber & Poggio ‘99; Amit & Mascaro ‘03; Deco & Rolls ‘06)

Feedforward hierarchical model

Animalvs.

non-animal

Complex cellsTuning

Simple cells

MAXMain routes Bypass routes

PG

Cor

tex

Ros

tral

ST

S

PrefrontalCortex

STP

DP VIP LIP 7a PP FST

PO V3A

TPO PGa IPa

V3

V4

PIT TF

TG 36 35

LIP,

VIP

PP,D

PPP

,7a

PP

V2,

V3,

V4,

MT,

MS

TTT

PIT

, AIT

AIT

,36,

35

MSTcTT T

}}}}

V1

PG

TE

46 8 45 1211,13

TEa TEmm

AIT

V2

V1

dorsal stream'where' pathway

ventral stream'what' pathway

ppMSTpppMM ccMM

C1

S1

S2

S3

S2b

C2

classificationunits

0.2 - 1.1o

0.4 - 1.6o

0.6 - 2.4o

1.1 - 3.0o

0.9 - 4.4o

1.2 - 3.2

o

o

o

o

o

oo

Modellayers

RF sizes

S4 7 o

Num.units

C2b 7 o

C3 7 o

10 6

10 4

10 7

10 5

10 4

10 7

10 0

102

10 3

10 3

Incr

ease

in c

ompl

exity

(num

ber o

f sub

units

), R

F si

ze a

nd in

varia

nce

Uns

uper

vise

d ta

sk-in

depe

nden

t lea

rnin

gSu

perv

ised

task

-dep

ende

nt le

arnin

g

• Large-scale (108 units), spans several areas of the visual cortex

• Combination of forward and reverse engineering

• Shown to be consistent with many experimental data across areas of visual cortex

Complex unitsSimple units

Selective pooling mechanisms

Riesenhuber & Poggio 1999 (building on Fukushima ‘80 and Hubel & Wiesel ‘62)

Complex unitsTemplate matching Gaussian-like tuning

~ “AND”

Invariance max-like operation

~”OR”

Simple units

Selective pooling mechanisms

Riesenhuber & Poggio 1999 (building on Fukushima ‘80 and Hubel & Wiesel ‘62)


Animalvs.

non-animal

Complex cellsTuning

Simple cells


PG

Cor

tex

Ros

tral

ST

S

PrefrontalCortex

STP


PO V3A

TPO PGa IPa

V3

V4

PIT TF

TG 36 35

LIP,

VIP

PP,D

PPP

,7a

PP

V2,

V3,

V4,

MT,

MS

TTT

PIT

, AIT

AIT

,36,

35

MSTcTT T

}}}}

V1

PG

TE

46 8 45 1211,13

TEa TEmm

AIT

V2

V1



ppMSTpppMM ccMM

C1

S1

S2

S3

S2b

C2

classificationunits

0.2 - 1.1o

0.4 - 1.6o

0.6 - 2.4o

1.1 - 3.0o

0.9 - 4.4o

1.2 - 3.2

o

o

o

o

o

oo

Modellayers

RF sizes

S4 7 o

Num.units

C2b 7 o

C3 7 o

10 6

10 4

10 7

10 5

10 4

10 7

10 0

102

10 3

10 3

Incr

ease

in c

ompl

exity

(num

ber o

f sub

units

), R

F si

ze a

nd in

varia

nce

Uns

uper

vise

d ta

sk-in

depe

nden

t lea

rnin

gSu

perv

ised

task

-dep

ende

nt le

arnin

g

• Large-scale (108 units), spans several areas of the visual cortex

• Combination of forward and reverse engineering

• Shown to be consistent with many experimental data across areas of visual cortex

Basic circuit for the two operations

Both operations can be approximated gain control circuits using shunting inhibition

Kouh & Poggio 2007; Knoblich Bouvrie Poggio 2007

Learning and plasticity

V1

V2

V4

PIT

AIT

PFCAnimal

vs.

non-animal

Complex cells

Tuning

Simple cells

MAX

Main routes

Bypass routes

Prefrontal

Cortex

V4

PIT

35

PIT

, A

IT

AIT

,36,3

5

V1

PG

TE

45 1211,

13

AIT

V2

V1

dorsal stream

'where' pathway

ventral stream

'what' pathway

C1

S1

S2

S3

S2b

C2

classification

units

0.2 - 1.1o

0.4 - 1.6o

0.6 - 2.4o

1.1 - 3.0o

0.9 - 4.4o

1.2 - 3.2

o

o

o

o

o

oo

Model

layers

RF sizes

S4 7o

Num.

units

C2b 7o

C3 7o

10 6

104

107

105

104

107

100

102

103

103

Incre

ase in c

om

ple

xity (

num

ber

of subunits),

RF

siz

e a

nd invariance

Unsuperv

ised

task-independent le

arn

ing

Su

perv

ise

d

ta

sk-d

ep

en

den

t le

arn

ing


V1

V2

V4

PIT

AIT

PFCAnimal

vs.

non-animal

Complex cells

Tuning

Simple cells

MAX

Main routes

Bypass routes

Prefrontal

Cortex

V4

PIT

35

PIT

, A

IT

AIT

,36,3

5

V1

PG

TE

45 1211,

13

AIT

V2

V1

dorsal stream

'where' pathway

ventral stream

'what' pathway

C1

S1

S2

S3

S2b

C2

classification

units

0.2 - 1.1o

0.4 - 1.6o

0.6 - 2.4o

1.1 - 3.0o

0.9 - 4.4o

1.2 - 3.2

o

o

o

o

o

oo

Model

layers

RF sizes

S4 7o

Num.

units

C2b 7o

C3 7o

10 6

104

107

105

104

107

100

102

103

103

Incre

ase in c

om

ple

xity (

num

ber

of subunits),

RF

siz

e a

nd invariance

Unsuperv

ised

task-independent le

arn

ing

Su

perv

ise

d

ta

sk-d

ep

en

den

t le

arn

ing

Evid

ence for adult p

lasticity

PFC, IT very likely

V4 likely

V1/V2 limited evidence


V1

V2

V4

PIT

PFCAnimal

vs.

non-animal

Complex cells

Tuning

Simple cells

MAX

Main routes

Bypass routes

Prefrontal

Cortex

V4

PIT

35

PIT

, A

IT

AIT

,36,3

5

V1

PG

TE

45 1211,

13

AIT

V2

V1

dorsal stream

'where' pathway

ventral stream

'what' pathway

C1

S1

S2

S3

S2b

C2

classification

units

0.2 - 1.1o

0.4 - 1.6o

0.6 - 2.4o

1.1 - 3.0o

0.9 - 4.4o

1.2 - 3.2

o

o

o

o

o

oo

Model

layers

RF sizes

S4 7o

Num.

units

C2b 7o

C3 7o

10 6

104

107

105

104

107

100

102

103

103

Incre

ase in c

om

ple

xity (

num

ber

of subunits),

RF

siz

e a

nd invariance

Unsuperv

ised

task-independent le

arn

ing

Su

perv

ise

d

ta

sk-d

ep

en

den

t le

arn

ing

Unsupervised developmental-like learning stage:Frequent image features

AIT


V1

V2

V4

PIT

PFCAnimal

vs.

non-animal

Complex cells

Tuning

Simple cells

MAX

Main routes

Bypass routes

Prefrontal

Cortex

V4

PIT

35

PIT

, A

IT

AIT

,36,3

5

V1

PG

TE

45 1211,

13

AIT

V2

V1

dorsal stream

'where' pathway

ventral stream

'what' pathway

C1

S1

S2

S3

S2b

C2

classification

units

0.2 - 1.1o

0.4 - 1.6o

0.6 - 2.4o

1.1 - 3.0o

0.9 - 4.4o

1.2 - 3.2

o

o

o

o

o

oo

Model

layers

RF sizes

S4 7o

Num.

units

C2b 7o

C3 7o

10 6

104

107

105

104

107

100

102

103

103

Incre

ase in c

om

ple

xity (

num

ber

of subunits),

RF

siz

e a

nd invariance

Unsuperv

ised

task-independent le

arn

ing

Su

perv

ise

d

ta

sk-d

ep

en

den

t le

arn

ing


stronger facilitation

stronger suppression

Learned V2/V4 units

AIT


V1

V2

V4

PIT

PFCAnimal

vs.

non-animal

Complex cells

Tuning

Simple cells

MAX

Main routes

Bypass routes

Prefrontal

Cortex

V4

PIT

35

PIT

, A

IT

AIT

,36,3

5

V1

PG

TE

45 1211,

13

AIT

V2

V1

dorsal stream

'where' pathway

ventral stream

'what' pathway

C1

S1

S2

S3

S2b

C2

classification

units

0.2 - 1.1o

0.4 - 1.6o

0.6 - 2.4o

1.1 - 3.0o

0.9 - 4.4o

1.2 - 3.2

o

o

o

o

o

oo

Model

layers

RF sizes

S4 7o

Num.

units

C2b 7o

C3 7o

10 6

104

107

105

104

107

100

102

103

103

Incre

ase in c

om

ple

xity (

num

ber

of subunits),

RF

siz

e a

nd invariance

Unsuperv

ised

task-independent le

arn

ing

Su

perv

ise

d

ta

sk-d

ep

en

den

t le

arn

ing


Beyond V4Combinations of those...

AIT


V1

V2

V4

PIT

PFCAnimal

vs.

non-animal

Complex cells

Tuning

Simple cells

MAX

Main routes

Bypass routes

Prefrontal

Cortex

V4

PIT

35

PIT

, A

IT

AIT

,36,3

5

V1

PG

TE

45 1211,

13

AIT

V2

V1

dorsal stream

'where' pathway

ventral stream

'what' pathway

C1

S1

S2

S3

S2b

C2

classification

units

0.2 - 1.1o

0.4 - 1.6o

0.6 - 2.4o

1.1 - 3.0o

0.9 - 4.4o

1.2 - 3.2

o

o

o

o

o

oo

Model

layers

RF sizes

S4 7o

Num.

units

C2b 7o

C3 7o

10 6

104

107

105

104

107

100

102

103

103

Incre

ase in c

om

ple

xity (

num

ber

of subunits),

RF

siz

e a

nd invariance

Unsuperv

ised

task-independent le

arn

ing

Su

perv

ise

d

ta

sk-d

ep

en

den

t le

arn

ingSupervised learning from a

handful of training examples ~ linear perceptron


AIT

Learning and sample complexity


Animalvs.

non-animal

Complex cellsTuning

Simple cells


PG

Cor

tex

Ros

tral

ST

S

PrefrontalCortex

STP


PO V3A

TPO PGa IPa

V3

V4

PIT TF

TG 36 35

LIP,

VIP

PP,D

PPP

,7a

PP

V2,

V3,

V4,

MT,

MS

TTT

PIT

, AIT

AIT

,36,

35

MSTcTT T

}}}}

V1

PG

TE

46 8 45 1211,13

TEa TEmm

AIT

V2

V1



ppMSTpppMM ccMM

C1

S1

S2

S3

S2b

C2

classificationunits

0.2 - 1.1o

0.4 - 1.6o

0.6 - 2.4o

1.1 - 3.0o

0.9 - 4.4o

1.2 - 3.2

o

o

o

o

o

oo

Modellayers

RF sizes

S4 7 o

Num.units

C2b 7 o

C3 7 o

10 6

10 4

10 7

10 5

10 4

10 7

10 0

102

10 3

10 3

Incr

ease

in c

ompl

exity

(num

ber o

f sub

units

), R

F si

ze a

nd in

varia

nce

Uns

uper

vise

d ta

sk-in

depe

nden

t lea

rnin

gSu

perv

ised

task

-dep

ende

nt le

arnin

g


Animalvs.

non-animal

Complex cellsTuning

Simple cells


PG

Cor

tex

Ros

tral

ST

S

PrefrontalCortex

STP


PO V3A

TPO PGa IPa

V3

V4

PIT TF

TG 36 35

LIP,

VIP

PP,D

PPP

,7a

PP

V2,

V3,

V4,

MT,

MS

TTT

PIT

, AIT

AIT

,36,

35

MSTcTT T

}}}}

V1

PG

TE

46 8 45 1211,13

TEa TEmm

AIT

V2

V1



ppMSTpppMM ccMM

C1

S1

S2

S3

S2b

C2

classificationunits

0.2 - 1.1o

0.4 - 1.6o

0.6 - 2.4o

1.1 - 3.0o

0.9 - 4.4o

1.2 - 3.2

o

o

o

o

o

oo

Modellayers

RF sizes

S4 7 o

Num.units

C2b 7 o

C3 7 o

10 6

10 4

10 7

10 5

10 4

10 7

10 0

102

10 3

10 3

Incr

ease

in c

ompl

exity

(num

ber o

f sub

units

), R

F si

ze a

nd in

varia

nce

Uns

uper

vise

d ta

sk-in

depe

nden

t lea

rnin

gSu

perv

ised

task

-dep

ende

nt le

arnin

g

• V1 | Simple and complex cells tuning properties (Schiller et al 1976; Hubel & Wiesel 1965; Devalois et al 1982)

• IT | Tuning and invariance properties (Logothetis et al 1995)


Animalvs.

non-animal

Complex cellsTuning

Simple cells


PG

Cor

tex

Ros

tral

ST

S

PrefrontalCortex

STP


PO V3A

TPO PGa IPa

V3

V4

PIT TF

TG 36 35

LIP,

VIP

PP,D

PPP

,7a

PP

V2,

V3,

V4,

MT,

MS

TTT

PIT

, AIT

AIT

,36,

35

MSTcTT T

}}}}

V1

PG

TE

46 8 45 1211,13

TEa TEmm

AIT

V2

V1



ppMSTpppMM ccMM

C1

S1

S2

S3

S2b

C2

classificationunits

0.2 - 1.1o

0.4 - 1.6o

0.6 - 2.4o

1.1 - 3.0o

0.9 - 4.4o

1.2 - 3.2

o

o

o

o

o

oo

Modellayers

RF sizes

S4 7 o

Num.units

C2b 7 o

C3 7 o

10 6

10 4

10 7

10 5

10 4

10 7

10 0

102

10 3

10 3

Incr

ease

in c

ompl

exity

(num

ber o

f sub

units

), R

F si

ze a

nd in

varia

nce

Uns

uper

vise

d ta

sk-in

depe

nden

t lea

rnin

gSu

perv

ised

task

-dep

ende

nt le

arnin

g



• V4 | Tuning for two-bar stimuli (Reynolds Chelazzi & Desimone 1999)

• V4 | MAX operation (Gawne et al 2002)

• V4 | Two-spot interaction (Freiwald et al 2005)

• V4 | Tuning for boundary conformation (Pasupathy & Connor 2001)

• V4 | Tuning for Cartesian and non-Cartesian gratings (Gallant et al 1996)


Animalvs.

non-animal

Complex cellsTuning

Simple cells


PG

Cor

tex

Ros

tral

ST

S

PrefrontalCortex

STP


PO V3A

TPO PGa IPa

V3

V4

PIT TF

TG 36 35

LIP,

VIP

PP,D

PPP

,7a

PP

V2,

V3,

V4,

MT,

MS

TTT

PIT

, AIT

AIT

,36,

35

MSTcTT T

}}}}

V1

PG

TE

46 8 45 1211,13

TEa TEmm

AIT

V2

V1



ppMSTpppMM ccMM

C1

S1

S2

S3

S2b

C2

classificationunits

0.2 - 1.1o

0.4 - 1.6o

0.6 - 2.4o

1.1 - 3.0o

0.9 - 4.4o

1.2 - 3.2

o

o

o

o

o

oo

Modellayers

RF sizes

S4 7 o

Num.units

C2b 7 o

C3 7 o

10 6

10 4

10 7

10 5

10 4

10 7

10 0

102

10 3

10 3

Incr

ease

in c

ompl

exity

(num

ber o

f sub

units

), R

F si

ze a

nd in

varia

nce

Uns

uper

vise

d ta

sk-in

depe

nden

t lea

rnin

gSu

perv

ised

task

-dep

ende

nt le

arnin

g



• V4 | Tuning for two-bar stimuli (Reynolds Chelazzi & Desimone 1999)

• V4 | MAX operation (Gawne et al 2002)

• V4 | Two-spot interaction (Freiwald et al 2005)

• V4 | Tuning for boundary conformation (Pasupathy & Connor 2001)

• V4 | Tuning for Cartesian and non-Cartesian gratings (Gallant et al 1996)

• V1 | MAX operation in subset of complex cells (Lampl et al 2004)

• IT | Differential role of IT and PFC in categorization (Freedman et al 2001 2002 2003)

• IT | Read out data (Hung Kreiman Poggio & DiCarlo 2005)

• IT | Average effect in IT (Zoccolan Cox & DiCarlo 2005; Zoccolan Kouh Poggio & DiCarlo in press)

• Human psychophysics | Rapid animal categorization (Serre Oliva Poggio 2007)

Invariance in IT

Invariance in IT

TRAIN

TEST

3.4ocenter

Size:Position:

3.4ocenter

1.7ocenter

6.8ocenter

3.4o2o horz.

3.4o4o horz.

0

0.2

0.4

0.6

0.8

1

Cla

ssifi

catio

n pe

rform

ance

IT Model

Model data: Serre Kouh Cadieu Knoblich Kreiman & Poggio 2005 Experimental data: Hung* Kreiman* Poggio & DiCarlo 2005

Explaining human performance in rapid categorization tasks

Serre Oliva & Poggio 2007

Head Close-body Medium-body Far-body

Animals

Natural

distractors

Artificial

distractors



Head Close-

body

Far-

body

Medium-

body

1.0

1.4

2.6

2.4

1.8

Pe

rfo

rma

nce

(d

')

Model (82% correct)

Human observers (80% correct)

Head Close-body Medium-body Far-body

Animals

Natural

distractors

Artificial

distractors



Bio-motivated computer vision

Serre Wolf & Poggio 2005; Wolf & Bileschi 2006; Serre et al 2007

Scene parsing and object recognition

Computer vision system based on the response properties of neurons in the ventral stream of the

visual cortex


Gflo

ps

GPU acceleration

Mutch, 2009


Gflo

ps

GPU acceleration

• GPU can run certain classes of algorithms 50-100x faster than a CPU

Mutch, 2009


Gflo

ps

GPU acceleration


• Designed to run same program (“kernel”) for each element of a large 2D grid

Mutch, 2009


Gflo

ps

GPU acceleration



• 240 parallel processors! (512 by 2010 Q1)

Mutch, 2009


Mutch, 2009



• 240 parallel processors! (512 by 2010 Q1)

GPU acceleration

• 97 times speed over our best CPU implementation

• 0.291 sec/image for a 256x256 pixel image

• currently downloading+processing about 300K images from internet / per day

Recognition in videos

Source: Wikipedia, “ventral stream”

Ungerleider & Mishkin ‘84


ventral stream“shape pathway”



ventral stream“shape pathway”

dorsal stream“motion pathway”



Jhuang Serre Wolf & Poggio 2007

Action recognition in video sequences

wave 2 bend

jack

jumprun

walk

side

wave 1

jump 2

motion-sensitive MT-like units


Jhuang Serre Wolf & Poggio 2007

Dollar et al ‘05

model chance

KTH Human 81.3% 91.6% 16.7%

Weiz. Human 86.7% 96.3% 11.1%

UCSD Mice 75.6% 79.0% 20.0%

Action recognition in video sequences

★ Cross-validation: 2/3 training, 1/3 testing, 10 repeats

Automatic recognition of rodent behavior

• Limit subjectivity of human intervention and stress on the animal (compared to standardized tests)

• 24 hr surveillance towards assessing well-being of animals

• Help validate models of mental and neuro-generative diseases (Huntington, schizophrenia, autism, etc)

• Help assess efficacy of drugs

Behaviors of interest

Data Set

• Manually annotated two sets:

• Frame accurate action clips: • ~50 man-hr/hr of video• 4000 clips• Fine-tuning system parameters (speed tuning, spatial resolution,

feature learning, etc)• Fully (continuous) annotated videos (less acurate):

• ~20 man-hr/hr of video • Learning temporal statistics


Serre* Jhuang* Garrote Poggio Steele in prep

• Proof of concept with 8 primitive behaviors (groom, eat, hang, drink, walk, jump, micro-move, rest)

• System is trainable and could be trained for additional behaviors

Demo available at http://

techtv.mit.edu/videos/1838

http://techtv.mit.edu/videos/1838







human agreement

72%

proposed system

71%

commercial system

56%

chance 12%

Performance


• Proof of concept with 8 primitive behaviors (groom, eat, hang, drink, walk, jump, micro-move, rest)

• System is trainable and could be trained for additional behaviors

Demo available at http://

techtv.mit.edu/videos/1838







Computer system vs. human

Behavioral comparison between 4 strains


• 24 hour monitoring of 4 different strains (n=8):

• CAST/EiJ (wild-like strain)

• C57Bl/6J (popular inbred mouse strains)

• DBA/2J (popular inbred mouse strains)

• BTBR2 (potential model of autism)

• Corresponds to about 7 yr of work for manual scoring

Behavioral comparison between 4 strains

Predicting strain based on behavior

Summary

• Feedforward hierarchical model of visual perception seems consistent with “immediate recognition”, i.e. during passive viewing or when visual system forced to operate without top-down cortical feedback

• Application to automated behavior recognition in home-cage mice

• Beyond feedforward processing:

• Cortical feedback

• Shifts of attention

Neuroscience of attention and Bayesian inference in collaboration with Desimone

lab (monkey electrophysiology)

PFC

IT

V4/PIT

V2

integrated model of attention and recognition



PFC

IT

V4/PIT

V2

feature-basedattention




PFC

IT

V4/PIT

V2

LIP/FEF

spatial attention

feature-basedattention


Model performance improves with attention

Chikkerur Serre & Poggio in prep

Model Humans

perfo

rman

ce (d

’)

no attentionone shift of attention



0

1

2

3

Model Humans

perfo

rman

ce (d

’)




0

1

2

3

Model Humans

perfo

rman

ce (d

’)


mask no mask

Acknowledgments

Other: • Narcisse Bichot• Stan Bileschi• Charles Cadieu• Robert Desimone • Jim DiCarlo• Michelle Fabre-Thorpe• Winrich Freiwald• Estibaliz Garrote• Hueihan Jhuang• Ulf Knoblich• Christof Koch• Minjoon Kouh• Gabriel Kreiman• Timothee Masquelier• Leila Reddy• David Sheinberg• Jed Singer• Andrew Steele• Simon Thorpe• Nao Tsuchyia• Lior Wolf• Ying Zhang

Andrew Steele Hueihan Jhuang Estibaliz Garrote

Tomaso Poggio

a biologically-motivated approach to computer vision

Technology

face detector

feature detector

set of face

face training images

thisthe face

face detectionthousand

number of features

feature compares180