a biologically-motivated approach to computer vision
DESCRIPTION
Presentation given at Yale University in August 2009.TRANSCRIPT
A biologically-motivated approach to computer vision
Thomas Serre
McGovern Institute for Brain ResearchDepartment of Brain & Cognitive SciencesMassachusetts Institute of Technology
• Object recognition is hard!
• Our visual capabilities are computationally amazing
• Reverse-engineer the visual system and build machines that see and interpret the visual world as well as we do
The problem: invariant recognition in natural scenes
Computer vision successes
Face detection
Computer vision successes
Face detection
Face detectionSchneiderman & Kanade ’99 Viola & Jones ’01
Lots of simple features
The recipe
fancy classifierlots of training
examples
Given example images where
for negative and positive examples respec-
tively.
Initialize weights for respec-
tively, where and are the number of negatives and
positives respectively.
For :
1. Normalize the weights,
so that is a probability distribution.
2. For each feature, , train a classifier which
is restricted to using a single feature. The
error is evaluated with respect to ,
.
3. Choose the classifier, , with the lowest error .
4. Update the weights:
where if example is classified cor-
rectly, otherwise, and .
The final strong classifier is:
otherwise
where
Table 1: The AdaBoost algorithm for classifier learn-
ing. Each round of boosting selects one feature from the
180,000 potential features.
number of features are retained (perhaps a few hundred or
thousand).
3.2. Learning Results
While details on the training and performance of the final
system are presented in Section 5, several simple results
merit discussion. Initial experiments demonstrated that a
frontal face classifier constructed from 200 features yields
a detection rate of 95% with a false positive rate of 1 in
14084. These results are compelling, but not sufficient for
many real-world tasks. In terms of computation, this clas-
sifier is probably faster than any other published system,
requiring 0.7 seconds to scan an 384 by 288 pixel image.
Unfortunately, the most straightforward technique for im-
proving detection performance, adding features to the clas-
sifier, directly increases computation time.
For the task of face detection, the initial rectangle fea-
tures selected by AdaBoost are meaningful and easily inter-
preted. The first feature selected seems to focus on the prop-
erty that the region of the eyes is often darker than the region
Figure 3: The first and second features selected by Ad-
aBoost. The two features are shown in the top row and then
overlayed on a typical training face in the bottom row. The
first feature measures the difference in intensity between the
region of the eyes and a region across the upper cheeks. The
feature capitalizes on the observation that the eye region is
often darker than the cheeks. The second feature compares
the intensities in the eye regions to the intensity across the
bridge of the nose.
of the nose and cheeks (see Figure 3). This feature is rel-
atively large in comparison with the detection sub-window,
and should be somewhat insensitive to size and location of
the face. The second feature selected relies on the property
that the eyes are darker than the bridge of the nose.
4. The Attentional Cascade
This section describes an algorithm for constructing a cas-
cade of classifiers which achieves increased detection per-
formance while radically reducing computation time. The
key insight is that smaller, and therefore more efficient,
boosted classifiers can be constructed which reject many of
the negative sub-windows while detecting almost all posi-
tive instances (i.e. the threshold of a boosted classifier can
be adjusted so that the false negative rate is close to zero).
Simpler classifiers are used to reject the majority of sub-
windows before more complex classifiers are called upon
to achieve low false positive rates.
The overall form of the detection process is that of a de-
generate decision tree, what we call a “cascade” (see Fig-
ure 4). A positive result from the first classifier triggers the
evaluation of a second classifier which has also been ad-
justed to achieve very high detection rates. A positive result
from the second classifier triggers a third classifier, and so
on. A negative outcome at any point leads to the immediate
rejection of the sub-window.
Stages in the cascade are constructed by training clas-
sifiers using AdaBoost and then adjusting the threshold to
minimize false negatives. Note that the default AdaBoost
threshold is designed to yield a low error rate on the train-
ing data. In general a lower threshold yields higher detec-
4
+
Given example images where
for negative and positive examples respec-
tively.
Initialize weights for respec-
tively, where and are the number of negatives and
positives respectively.
For :
1. Normalize the weights,
so that is a probability distribution.
2. For each feature, , train a classifier which
is restricted to using a single feature. The
error is evaluated with respect to ,
.
3. Choose the classifier, , with the lowest error .
4. Update the weights:
where if example is classified cor-
rectly, otherwise, and .
The final strong classifier is:
otherwise
where
Table 1: The AdaBoost algorithm for classifier learn-
ing. Each round of boosting selects one feature from the
180,000 potential features.
number of features are retained (perhaps a few hundred or
thousand).
3.2. Learning Results
While details on the training and performance of the final
system are presented in Section 5, several simple results
merit discussion. Initial experiments demonstrated that a
frontal face classifier constructed from 200 features yields
a detection rate of 95% with a false positive rate of 1 in
14084. These results are compelling, but not sufficient for
many real-world tasks. In terms of computation, this clas-
sifier is probably faster than any other published system,
requiring 0.7 seconds to scan an 384 by 288 pixel image.
Unfortunately, the most straightforward technique for im-
proving detection performance, adding features to the clas-
sifier, directly increases computation time.
For the task of face detection, the initial rectangle fea-
tures selected by AdaBoost are meaningful and easily inter-
preted. The first feature selected seems to focus on the prop-
erty that the region of the eyes is often darker than the region
Figure 3: The first and second features selected by Ad-
aBoost. The two features are shown in the top row and then
overlayed on a typical training face in the bottom row. The
first feature measures the difference in intensity between the
region of the eyes and a region across the upper cheeks. The
feature capitalizes on the observation that the eye region is
often darker than the cheeks. The second feature compares
the intensities in the eye regions to the intensity across the
bridge of the nose.
of the nose and cheeks (see Figure 3). This feature is rel-
atively large in comparison with the detection sub-window,
and should be somewhat insensitive to size and location of
the face. The second feature selected relies on the property
that the eyes are darker than the bridge of the nose.
4. The Attentional Cascade
This section describes an algorithm for constructing a cas-
cade of classifiers which achieves increased detection per-
formance while radically reducing computation time. The
key insight is that smaller, and therefore more efficient,
boosted classifiers can be constructed which reject many of
the negative sub-windows while detecting almost all posi-
tive instances (i.e. the threshold of a boosted classifier can
be adjusted so that the false negative rate is close to zero).
Simpler classifiers are used to reject the majority of sub-
windows before more complex classifiers are called upon
to achieve low false positive rates.
The overall form of the detection process is that of a de-
generate decision tree, what we call a “cascade” (see Fig-
ure 4). A positive result from the first classifier triggers the
evaluation of a second classifier which has also been ad-
justed to achieve very high detection rates. A positive result
from the second classifier triggers a third classifier, and so
on. A negative outcome at any point leads to the immediate
rejection of the sub-window.
Stages in the cascade are constructed by training clas-
sifiers using AdaBoost and then adjusting the threshold to
minimize false negatives. Note that the default AdaBoost
threshold is designed to yield a low error rate on the train-
ing data. In general a lower threshold yields higher detec-
4
+
is very valuable, in their implementation it is necessary to
first evaluate some feature detector at every location. These
features are then grouped to find unusual co-occurrences. In
practice, since the form of our detector and the features that
it uses are extremely efficient, the amortized cost of evalu-
ating our detector at every scale and location is much faster
than finding and grouping edges throughout the image.
In recent work Fleuret and Geman have presented a face
detection technique which relies on a “chain” of tests in or-
der to signify the presence of a face at a particular scale and
location [4]. The image properties measured by Fleuret and
Geman, disjunctions of fine scale edges, are quite different
than rectangle features which are simple, exist at all scales,
and are somewhat interpretable. The two approaches also
differ radically in their learning philosophy. The motivation
for Fleuret and Geman’s learning process is density estima-
tion and density discrimination, while our detector is purely
discriminative. Finally the false positive rate of Fleuret and
Geman’s approach appears to be higher than that of previ-
ous approaches like Rowley et al. and this approach. Un-
fortunately the paper does not report quantitative results of
this kind. The included example images each have between
2 and 10 false positives.
5 Results
A 38 layer cascaded classifier was trained to detect frontal
upright faces. To train the detector, a set of face and non-
face training images were used. The face training set con-
sisted of 4916 hand labeled faces scaled and aligned to a
base resolution of 24 by 24 pixels. The faces were ex-
tracted from images downloaded during a random crawl of
the world wide web. Some typical face examples are shown
in Figure 5. The non-face subwindows used to train the
detector come from 9544 images which were manually in-
spected and found to not contain any faces. There are about
350 million subwindows within these non-face images.
The number of features in the first five layers of the de-
tector is 1, 10, 25, 25 and 50 features respectively. The
remaining layers have increasingly more features. The total
number of features in all layers is 6061.
Each classifier in the cascade was trained with the 4916
training faces (plus their vertical mirror images for a total
of 9832 training faces) and 10,000 non-face sub-windows
(also of size 24 by 24 pixels) using the Adaboost training
procedure. For the initial one feature classifier, the non-
face training examples were collected by selecting random
sub-windows from a set of 9544 images which did not con-
tain faces. The non-face examples used to train subsequent
layers were obtained by scanning the partial cascade across
the non-face images and collecting false positives. A max-
imum of 10000 such non-face sub-windows were collected
for each layer.
Speed of the Final Detector
Figure 5: Example of frontal upright face images used for
training.
The speed of the cascaded detector is directly related to
the number of features evaluated per scanned sub-window.
Evaluated on the MIT+CMU test set [12], an average of 10
features out of a total of 6061 are evaluated per sub-window.
This is possible because a large majority of sub-windows
are rejected by the first or second layer in the cascade. On
a 700 Mhz Pentium III processor, the face detector can pro-
cess a 384 by 288 pixel image in about .067 seconds (us-
ing a starting scale of 1.25 and a step size of 1.5 described
below). This is roughly 15 times faster than the Rowley-
Baluja-Kanade detector [12] and about 600 times faster than
the Schneiderman-Kanade detector [15].
Image Processing
All example sub-windows used for training were vari-
ance normalized to minimize the effect of different light-
ing conditions. Normalization is therefore necessary during
detection as well. The variance of an image sub-window
can be computed quickly using a pair of integral images.
Recall that , where is the standard
deviation, is the mean, and is the pixel value within
the sub-window. The mean of a sub-window can be com-
puted using the integral image. The sum of squared pixels
is computed using an integral image of the image squared
(i.e. two integral images are used in the scanning process).
During scanning the effect of image normalization can be
achieved by post-multiplying the feature values rather than
pre-multiplying the pixels.
Scanning the Detector
The final detector is scanned across the image at multi-
ple scales and locations. Scaling is achieved by scaling the
detector itself, rather than scaling the image. This process
makes sense because the features can be evaluated at any
6
Face detectionSchneiderman & Kanade ’99 Viola & Jones ’01
10K-1M training examples
Car detection Schneiderman & Kanade ’99
over 100K training examples
Pedestrian detection Dalal & Triggs ’05
over 1K training examples
What’s wrong with this picture?
What’s wrong with this picture?
• Tens of thousands of manually annotated training examples
• ~30,000 object categories (Biederman, 1987)
• Approach unlikely to scale up ...
One-shot learning in humans
By age 6, a child knows 10-30K categories
One-shot learning in humans
By age 6, a child knows 10-30K categories
One-shot learning in humans
By age 6, a child knows 10-30K categories
What are the computational mechanisms underlying this amazing feat?
source: cerebral cortex
What are the computational mechanisms underlying this amazing feat?
source: cerebral cortex
What are the computational mechanisms underlying this amazing feat?
source: cerebral cortex
1. Organization of the visual system
What are the computational mechanisms underlying this amazing feat?
source: cerebral cortex
1. Organization of the visual system
2. Computational model of the visual cortex
What are the computational mechanisms underlying this amazing feat?
source: cerebral cortex
1. Organization of the visual system
2. Computational model of the visual cortex
3. Application to computer vision
What are the computational mechanisms underlying this amazing feat?
source: cerebral cortex
1. Organization of the visual system
2. Computational model of the visual cortex
3. Application to computer vision
Hierarchical architecture: Anatomy
Rockland & Pandya ’79; Maunsell & Van Essen ‘83; Felleman & Van Essen ’91
Hierarchical architecture: Anatomy
Rockland & Pandya ’79; Maunsell & Van Essen ‘83; Felleman & Van Essen ’91
Hierarchical architecture:Latencies
Nowak & Bullier ’97Schmolesky et al ’98
source: Thorpe & Fabre-Thorpe ‘01
Hierarchical architecture: Function
Hierarchical architecture: Function
ventral visual stream
Hierarchical architecture: Function
Hierarchical architecture: Function
Hierarchical architecture: Function
Hubel & Wiesel 1959, 1962, 1965, 1968
Hierarchical architecture: Function
Hubel & Wiesel 1959, 1962, 1965, 1968
Nobel prize 1981
simplecells
complexcells
Hierarchical architecture: Function
Kobatake & Tanaka 1994 see also Oram & Perrett 1993; Sheinberg & Logothetis 1996; Gallant et al 1996; Riesenhuber & Poggio 1999
gradual increase in complexity of preferred stimulus
Parallel increase in invariance properties (position and scale)
of neurons
Hierarchical architecture: Function
Kobatake & Tanaka 1994 see also Oram & Perrett 1993; Sheinberg & Logothetis 1996; Gallant et al 1996; Riesenhuber & Poggio 1999
Hierarchical architecture: Function
Hierarchical architecture: Function
Hierarchical architecture: Function
Hung* Kreiman* Poggio & DiCarlo 2005
Hierarchical architecture: Function
Hung* Kreiman* Poggio & DiCarlo 2005
• Invariant object recognition in IT:
• Robust invariant readout of category information from small population of neurons
• Single spikes after response onset carry most of the information
Hierarchical architecture: Feedforward processing
Thorpe Fize & Marlot ‘96
Hierarchical architecture: Feedforward processing
Thorpe Fize & Marlot ‘96
Hierarchical architecture: Feedforward processing
Hierarchical architecture: Feedforward processing
What are the computational mechanisms used by brains to achieve this amazing feat?
source: cerebral cortex
1. Organization of the visual system
2. Computational model of the visual cortex
3. Application to computer vision
Feedforward hierarchical model of object recognition
• Qualitative neurobiological models (Hubel & Wiesel ‘58; Perrett & Oram ‘93)
• Biologically-inspired (Fukushima ‘80; Mel ‘97; LeCun et al ‘98; Thorpe ‘02; Ullman et al ‘02; Wersing & Koerner ‘03)
• Quantitative neurobiological models (Wallis & Rolls ‘97; Riesenhuber & Poggio ‘99; Amit & Mascaro ‘03; Deco & Rolls ‘06)
Feedforward hierarchical model
Animalvs.
non-animal
Complex cellsTuning
Simple cells
MAXMain routes Bypass routes
PG
Cor
tex
Ros
tral
ST
S
PrefrontalCortex
STP
DP VIP LIP 7a PP FST
PO V3A
TPO PGa IPa
V3
V4
PIT TF
TG 36 35
LIP,
VIP
PP,D
PPP
,7a
PP
V2,
V3,
V4,
MT,
MS
TTT
PIT
, AIT
AIT
,36,
35
MSTcTT T
}}}}
V1
PG
TE
46 8 45 1211,13
TEa TEmm
AIT
V2
V1
dorsal stream'where' pathway
ventral stream'what' pathway
ppMSTpppMM ccMM
C1
S1
S2
S3
S2b
C2
classificationunits
0.2 - 1.1o
0.4 - 1.6o
0.6 - 2.4o
1.1 - 3.0o
0.9 - 4.4o
1.2 - 3.2
o
o
o
o
o
oo
Modellayers
RF sizes
S4 7 o
Num.units
C2b 7 o
C3 7 o
10 6
10 4
10 7
10 5
10 4
10 7
10 0
102
10 3
10 3
Incr
ease
in c
ompl
exity
(num
ber o
f sub
units
), R
F si
ze a
nd in
varia
nce
Uns
uper
vise
d ta
sk-in
depe
nden
t lea
rnin
gSu
perv
ised
task
-dep
ende
nt le
arnin
g
• Large-scale (108 units), spans several areas of the visual cortex
• Combination of forward and reverse engineering
• Shown to be consistent with many experimental data across areas of visual cortex
Complex unitsSimple units
Selective pooling mechanisms
Riesenhuber & Poggio 1999 (building on Fukushima ‘80 and Hubel & Wiesel ‘62)
Complex unitsTemplate matching Gaussian-like tuning
~ “AND”
Invariance max-like operation
~”OR”
Simple units
Selective pooling mechanisms
Riesenhuber & Poggio 1999 (building on Fukushima ‘80 and Hubel & Wiesel ‘62)
Feedforward hierarchical model
Animalvs.
non-animal
Complex cellsTuning
Simple cells
MAXMain routes Bypass routes
PG
Cor
tex
Ros
tral
ST
S
PrefrontalCortex
STP
DP VIP LIP 7a PP FST
PO V3A
TPO PGa IPa
V3
V4
PIT TF
TG 36 35
LIP,
VIP
PP,D
PPP
,7a
PP
V2,
V3,
V4,
MT,
MS
TTT
PIT
, AIT
AIT
,36,
35
MSTcTT T
}}}}
V1
PG
TE
46 8 45 1211,13
TEa TEmm
AIT
V2
V1
dorsal stream'where' pathway
ventral stream'what' pathway
ppMSTpppMM ccMM
C1
S1
S2
S3
S2b
C2
classificationunits
0.2 - 1.1o
0.4 - 1.6o
0.6 - 2.4o
1.1 - 3.0o
0.9 - 4.4o
1.2 - 3.2
o
o
o
o
o
oo
Modellayers
RF sizes
S4 7 o
Num.units
C2b 7 o
C3 7 o
10 6
10 4
10 7
10 5
10 4
10 7
10 0
102
10 3
10 3
Incr
ease
in c
ompl
exity
(num
ber o
f sub
units
), R
F si
ze a
nd in
varia
nce
Uns
uper
vise
d ta
sk-in
depe
nden
t lea
rnin
gSu
perv
ised
task
-dep
ende
nt le
arnin
g
• Large-scale (108 units), spans several areas of the visual cortex
• Combination of forward and reverse engineering
• Shown to be consistent with many experimental data across areas of visual cortex
Basic circuit for the two operations
Both operations can be approximated gain control circuits using shunting inhibition
Kouh & Poggio 2007; Knoblich Bouvrie Poggio 2007
Learning and plasticity
V1
V2
V4
PIT
AIT
PFCAnimal
vs.
non-animal
Complex cells
Tuning
Simple cells
MAX
Main routes
Bypass routes
Prefrontal
Cortex
V4
PIT
35
PIT
, A
IT
AIT
,36,3
5
V1
PG
TE
45 1211,
13
AIT
V2
V1
dorsal stream
'where' pathway
ventral stream
'what' pathway
C1
S1
S2
S3
S2b
C2
classification
units
0.2 - 1.1o
0.4 - 1.6o
0.6 - 2.4o
1.1 - 3.0o
0.9 - 4.4o
1.2 - 3.2
o
o
o
o
o
oo
Model
layers
RF sizes
S4 7o
Num.
units
C2b 7o
C3 7o
10 6
104
107
105
104
107
100
102
103
103
Incre
ase in c
om
ple
xity (
num
ber
of subunits),
RF
siz
e a
nd invariance
Unsuperv
ised
task-independent le
arn
ing
Su
perv
ise
d
ta
sk-d
ep
en
den
t le
arn
ing
Learning and plasticity
V1
V2
V4
PIT
AIT
PFCAnimal
vs.
non-animal
Complex cells
Tuning
Simple cells
MAX
Main routes
Bypass routes
Prefrontal
Cortex
V4
PIT
35
PIT
, A
IT
AIT
,36,3
5
V1
PG
TE
45 1211,
13
AIT
V2
V1
dorsal stream
'where' pathway
ventral stream
'what' pathway
C1
S1
S2
S3
S2b
C2
classification
units
0.2 - 1.1o
0.4 - 1.6o
0.6 - 2.4o
1.1 - 3.0o
0.9 - 4.4o
1.2 - 3.2
o
o
o
o
o
oo
Model
layers
RF sizes
S4 7o
Num.
units
C2b 7o
C3 7o
10 6
104
107
105
104
107
100
102
103
103
Incre
ase in c
om
ple
xity (
num
ber
of subunits),
RF
siz
e a
nd invariance
Unsuperv
ised
task-independent le
arn
ing
Su
perv
ise
d
ta
sk-d
ep
en
den
t le
arn
ing
Evid
ence for adult p
lasticity
PFC, IT very likely
V4 likely
V1/V2 limited evidence
Learning and plasticity
V1
V2
V4
PIT
PFCAnimal
vs.
non-animal
Complex cells
Tuning
Simple cells
MAX
Main routes
Bypass routes
Prefrontal
Cortex
V4
PIT
35
PIT
, A
IT
AIT
,36,3
5
V1
PG
TE
45 1211,
13
AIT
V2
V1
dorsal stream
'where' pathway
ventral stream
'what' pathway
C1
S1
S2
S3
S2b
C2
classification
units
0.2 - 1.1o
0.4 - 1.6o
0.6 - 2.4o
1.1 - 3.0o
0.9 - 4.4o
1.2 - 3.2
o
o
o
o
o
oo
Model
layers
RF sizes
S4 7o
Num.
units
C2b 7o
C3 7o
10 6
104
107
105
104
107
100
102
103
103
Incre
ase in c
om
ple
xity (
num
ber
of subunits),
RF
siz
e a
nd invariance
Unsuperv
ised
task-independent le
arn
ing
Su
perv
ise
d
ta
sk-d
ep
en
den
t le
arn
ing
Unsupervised developmental-like learning stage:Frequent image features
AIT
Learning and plasticity
V1
V2
V4
PIT
PFCAnimal
vs.
non-animal
Complex cells
Tuning
Simple cells
MAX
Main routes
Bypass routes
Prefrontal
Cortex
V4
PIT
35
PIT
, A
IT
AIT
,36,3
5
V1
PG
TE
45 1211,
13
AIT
V2
V1
dorsal stream
'where' pathway
ventral stream
'what' pathway
C1
S1
S2
S3
S2b
C2
classification
units
0.2 - 1.1o
0.4 - 1.6o
0.6 - 2.4o
1.1 - 3.0o
0.9 - 4.4o
1.2 - 3.2
o
o
o
o
o
oo
Model
layers
RF sizes
S4 7o
Num.
units
C2b 7o
C3 7o
10 6
104
107
105
104
107
100
102
103
103
Incre
ase in c
om
ple
xity (
num
ber
of subunits),
RF
siz
e a
nd invariance
Unsuperv
ised
task-independent le
arn
ing
Su
perv
ise
d
ta
sk-d
ep
en
den
t le
arn
ing
Unsupervised developmental-like learning stage:Frequent image features
AIT
Learning and plasticity
V1
V2
V4
PIT
PFCAnimal
vs.
non-animal
Complex cells
Tuning
Simple cells
MAX
Main routes
Bypass routes
Prefrontal
Cortex
V4
PIT
35
PIT
, A
IT
AIT
,36,3
5
V1
PG
TE
45 1211,
13
AIT
V2
V1
dorsal stream
'where' pathway
ventral stream
'what' pathway
C1
S1
S2
S3
S2b
C2
classification
units
0.2 - 1.1o
0.4 - 1.6o
0.6 - 2.4o
1.1 - 3.0o
0.9 - 4.4o
1.2 - 3.2
o
o
o
o
o
oo
Model
layers
RF sizes
S4 7o
Num.
units
C2b 7o
C3 7o
10 6
104
107
105
104
107
100
102
103
103
Incre
ase in c
om
ple
xity (
num
ber
of subunits),
RF
siz
e a
nd invariance
Unsuperv
ised
task-independent le
arn
ing
Su
perv
ise
d
ta
sk-d
ep
en
den
t le
arn
ing
Unsupervised developmental-like learning stage:Frequent image features
AIT
Learning and plasticity
V1
V2
V4
PIT
PFCAnimal
vs.
non-animal
Complex cells
Tuning
Simple cells
MAX
Main routes
Bypass routes
Prefrontal
Cortex
V4
PIT
35
PIT
, A
IT
AIT
,36,3
5
V1
PG
TE
45 1211,
13
AIT
V2
V1
dorsal stream
'where' pathway
ventral stream
'what' pathway
C1
S1
S2
S3
S2b
C2
classification
units
0.2 - 1.1o
0.4 - 1.6o
0.6 - 2.4o
1.1 - 3.0o
0.9 - 4.4o
1.2 - 3.2
o
o
o
o
o
oo
Model
layers
RF sizes
S4 7o
Num.
units
C2b 7o
C3 7o
10 6
104
107
105
104
107
100
102
103
103
Incre
ase in c
om
ple
xity (
num
ber
of subunits),
RF
siz
e a
nd invariance
Unsuperv
ised
task-independent le
arn
ing
Su
perv
ise
d
ta
sk-d
ep
en
den
t le
arn
ing
Unsupervised developmental-like learning stage:Frequent image features
stronger facilitation
stronger suppression
Learned V2/V4 units
AIT
Learning and plasticity
V1
V2
V4
PIT
PFCAnimal
vs.
non-animal
Complex cells
Tuning
Simple cells
MAX
Main routes
Bypass routes
Prefrontal
Cortex
V4
PIT
35
PIT
, A
IT
AIT
,36,3
5
V1
PG
TE
45 1211,
13
AIT
V2
V1
dorsal stream
'where' pathway
ventral stream
'what' pathway
C1
S1
S2
S3
S2b
C2
classification
units
0.2 - 1.1o
0.4 - 1.6o
0.6 - 2.4o
1.1 - 3.0o
0.9 - 4.4o
1.2 - 3.2
o
o
o
o
o
oo
Model
layers
RF sizes
S4 7o
Num.
units
C2b 7o
C3 7o
10 6
104
107
105
104
107
100
102
103
103
Incre
ase in c
om
ple
xity (
num
ber
of subunits),
RF
siz
e a
nd invariance
Unsuperv
ised
task-independent le
arn
ing
Su
perv
ise
d
ta
sk-d
ep
en
den
t le
arn
ing
Unsupervised developmental-like learning stage:Frequent image features
Beyond V4Combinations of those...
AIT
Learning and plasticity
V1
V2
V4
PIT
PFCAnimal
vs.
non-animal
Complex cells
Tuning
Simple cells
MAX
Main routes
Bypass routes
Prefrontal
Cortex
V4
PIT
35
PIT
, A
IT
AIT
,36,3
5
V1
PG
TE
45 1211,
13
AIT
V2
V1
dorsal stream
'where' pathway
ventral stream
'what' pathway
C1
S1
S2
S3
S2b
C2
classification
units
0.2 - 1.1o
0.4 - 1.6o
0.6 - 2.4o
1.1 - 3.0o
0.9 - 4.4o
1.2 - 3.2
o
o
o
o
o
oo
Model
layers
RF sizes
S4 7o
Num.
units
C2b 7o
C3 7o
10 6
104
107
105
104
107
100
102
103
103
Incre
ase in c
om
ple
xity (
num
ber
of subunits),
RF
siz
e a
nd invariance
Unsuperv
ised
task-independent le
arn
ing
Su
perv
ise
d
ta
sk-d
ep
en
den
t le
arn
ingSupervised learning from a
handful of training examples ~ linear perceptron
Unsupervised developmental-like learning stage:Frequent image features
AIT
Learning and sample complexity
Feedforward hierarchical model
Animalvs.
non-animal
Complex cellsTuning
Simple cells
MAXMain routes Bypass routes
PG
Cor
tex
Ros
tral
ST
S
PrefrontalCortex
STP
DP VIP LIP 7a PP FST
PO V3A
TPO PGa IPa
V3
V4
PIT TF
TG 36 35
LIP,
VIP
PP,D
PPP
,7a
PP
V2,
V3,
V4,
MT,
MS
TTT
PIT
, AIT
AIT
,36,
35
MSTcTT T
}}}}
V1
PG
TE
46 8 45 1211,13
TEa TEmm
AIT
V2
V1
dorsal stream'where' pathway
ventral stream'what' pathway
ppMSTpppMM ccMM
C1
S1
S2
S3
S2b
C2
classificationunits
0.2 - 1.1o
0.4 - 1.6o
0.6 - 2.4o
1.1 - 3.0o
0.9 - 4.4o
1.2 - 3.2
o
o
o
o
o
oo
Modellayers
RF sizes
S4 7 o
Num.units
C2b 7 o
C3 7 o
10 6
10 4
10 7
10 5
10 4
10 7
10 0
102
10 3
10 3
Incr
ease
in c
ompl
exity
(num
ber o
f sub
units
), R
F si
ze a
nd in
varia
nce
Uns
uper
vise
d ta
sk-in
depe
nden
t lea
rnin
gSu
perv
ised
task
-dep
ende
nt le
arnin
g
Feedforward hierarchical model
Animalvs.
non-animal
Complex cellsTuning
Simple cells
MAXMain routes Bypass routes
PG
Cor
tex
Ros
tral
ST
S
PrefrontalCortex
STP
DP VIP LIP 7a PP FST
PO V3A
TPO PGa IPa
V3
V4
PIT TF
TG 36 35
LIP,
VIP
PP,D
PPP
,7a
PP
V2,
V3,
V4,
MT,
MS
TTT
PIT
, AIT
AIT
,36,
35
MSTcTT T
}}}}
V1
PG
TE
46 8 45 1211,13
TEa TEmm
AIT
V2
V1
dorsal stream'where' pathway
ventral stream'what' pathway
ppMSTpppMM ccMM
C1
S1
S2
S3
S2b
C2
classificationunits
0.2 - 1.1o
0.4 - 1.6o
0.6 - 2.4o
1.1 - 3.0o
0.9 - 4.4o
1.2 - 3.2
o
o
o
o
o
oo
Modellayers
RF sizes
S4 7 o
Num.units
C2b 7 o
C3 7 o
10 6
10 4
10 7
10 5
10 4
10 7
10 0
102
10 3
10 3
Incr
ease
in c
ompl
exity
(num
ber o
f sub
units
), R
F si
ze a
nd in
varia
nce
Uns
uper
vise
d ta
sk-in
depe
nden
t lea
rnin
gSu
perv
ised
task
-dep
ende
nt le
arnin
g
• V1 | Simple and complex cells tuning properties (Schiller et al 1976; Hubel & Wiesel 1965; Devalois et al 1982)
• IT | Tuning and invariance properties (Logothetis et al 1995)
Feedforward hierarchical model
Animalvs.
non-animal
Complex cellsTuning
Simple cells
MAXMain routes Bypass routes
PG
Cor
tex
Ros
tral
ST
S
PrefrontalCortex
STP
DP VIP LIP 7a PP FST
PO V3A
TPO PGa IPa
V3
V4
PIT TF
TG 36 35
LIP,
VIP
PP,D
PPP
,7a
PP
V2,
V3,
V4,
MT,
MS
TTT
PIT
, AIT
AIT
,36,
35
MSTcTT T
}}}}
V1
PG
TE
46 8 45 1211,13
TEa TEmm
AIT
V2
V1
dorsal stream'where' pathway
ventral stream'what' pathway
ppMSTpppMM ccMM
C1
S1
S2
S3
S2b
C2
classificationunits
0.2 - 1.1o
0.4 - 1.6o
0.6 - 2.4o
1.1 - 3.0o
0.9 - 4.4o
1.2 - 3.2
o
o
o
o
o
oo
Modellayers
RF sizes
S4 7 o
Num.units
C2b 7 o
C3 7 o
10 6
10 4
10 7
10 5
10 4
10 7
10 0
102
10 3
10 3
Incr
ease
in c
ompl
exity
(num
ber o
f sub
units
), R
F si
ze a
nd in
varia
nce
Uns
uper
vise
d ta
sk-in
depe
nden
t lea
rnin
gSu
perv
ised
task
-dep
ende
nt le
arnin
g
• V1 | Simple and complex cells tuning properties (Schiller et al 1976; Hubel & Wiesel 1965; Devalois et al 1982)
• IT | Tuning and invariance properties (Logothetis et al 1995)
• V4 | Tuning for two-bar stimuli (Reynolds Chelazzi & Desimone 1999)
• V4 | MAX operation (Gawne et al 2002)
• V4 | Two-spot interaction (Freiwald et al 2005)
• V4 | Tuning for boundary conformation (Pasupathy & Connor 2001)
• V4 | Tuning for Cartesian and non-Cartesian gratings (Gallant et al 1996)
Feedforward hierarchical model
Animalvs.
non-animal
Complex cellsTuning
Simple cells
MAXMain routes Bypass routes
PG
Cor
tex
Ros
tral
ST
S
PrefrontalCortex
STP
DP VIP LIP 7a PP FST
PO V3A
TPO PGa IPa
V3
V4
PIT TF
TG 36 35
LIP,
VIP
PP,D
PPP
,7a
PP
V2,
V3,
V4,
MT,
MS
TTT
PIT
, AIT
AIT
,36,
35
MSTcTT T
}}}}
V1
PG
TE
46 8 45 1211,13
TEa TEmm
AIT
V2
V1
dorsal stream'where' pathway
ventral stream'what' pathway
ppMSTpppMM ccMM
C1
S1
S2
S3
S2b
C2
classificationunits
0.2 - 1.1o
0.4 - 1.6o
0.6 - 2.4o
1.1 - 3.0o
0.9 - 4.4o
1.2 - 3.2
o
o
o
o
o
oo
Modellayers
RF sizes
S4 7 o
Num.units
C2b 7 o
C3 7 o
10 6
10 4
10 7
10 5
10 4
10 7
10 0
102
10 3
10 3
Incr
ease
in c
ompl
exity
(num
ber o
f sub
units
), R
F si
ze a
nd in
varia
nce
Uns
uper
vise
d ta
sk-in
depe
nden
t lea
rnin
gSu
perv
ised
task
-dep
ende
nt le
arnin
g
• V1 | Simple and complex cells tuning properties (Schiller et al 1976; Hubel & Wiesel 1965; Devalois et al 1982)
• IT | Tuning and invariance properties (Logothetis et al 1995)
• V4 | Tuning for two-bar stimuli (Reynolds Chelazzi & Desimone 1999)
• V4 | MAX operation (Gawne et al 2002)
• V4 | Two-spot interaction (Freiwald et al 2005)
• V4 | Tuning for boundary conformation (Pasupathy & Connor 2001)
• V4 | Tuning for Cartesian and non-Cartesian gratings (Gallant et al 1996)
• V1 | MAX operation in subset of complex cells (Lampl et al 2004)
• IT | Differential role of IT and PFC in categorization (Freedman et al 2001 2002 2003)
• IT | Read out data (Hung Kreiman Poggio & DiCarlo 2005)
• IT | Average effect in IT (Zoccolan Cox & DiCarlo 2005; Zoccolan Kouh Poggio & DiCarlo in press)
• Human psychophysics | Rapid animal categorization (Serre Oliva Poggio 2007)
Invariance in IT
Invariance in IT
TRAIN
TEST
3.4ocenter
Size:Position:
3.4ocenter
1.7ocenter
6.8ocenter
3.4o2o horz.
3.4o4o horz.
0
0.2
0.4
0.6
0.8
1
Cla
ssifi
catio
n pe
rform
ance
IT Model
Model data: Serre Kouh Cadieu Knoblich Kreiman & Poggio 2005 Experimental data: Hung* Kreiman* Poggio & DiCarlo 2005
Explaining human performance in rapid categorization tasks
Serre Oliva & Poggio 2007
Explaining human performance in rapid categorization tasks
Serre Oliva & Poggio 2007
Head Close-body Medium-body Far-body
Animals
Natural
distractors
Artificial
distractors
Explaining human performance in rapid categorization tasks
Serre Oliva & Poggio 2007
Head Close-
body
Far-
body
Medium-
body
1.0
1.4
2.6
2.4
1.8
Pe
rfo
rma
nce
(d
')
Model (82% correct)
Human observers (80% correct)
Head Close-body Medium-body Far-body
Animals
Natural
distractors
Artificial
distractors
Explaining human performance in rapid categorization tasks
Serre Oliva & Poggio 2007
What are the computational mechanisms used by brains to achieve this amazing feat?
source: cerebral cortex
What are the computational mechanisms used by brains to achieve this amazing feat?
source: cerebral cortex
1. Organization of the visual system
What are the computational mechanisms used by brains to achieve this amazing feat?
source: cerebral cortex
1. Organization of the visual system
2. Computational model of the visual cortex
What are the computational mechanisms used by brains to achieve this amazing feat?
source: cerebral cortex
1. Organization of the visual system
2. Computational model of the visual cortex
3. Application to computer vision
Bio-motivated computer vision
Serre Wolf & Poggio 2005; Wolf & Bileschi 2006; Serre et al 2007
Scene parsing and object recognition
Computer vision system based on the response properties of neurons in the ventral stream of the
visual cortex
Bio-motivated computer vision
Gflo
ps
GPU acceleration
Mutch, 2009
Bio-motivated computer vision
Gflo
ps
GPU acceleration
• GPU can run certain classes of algorithms 50-100x faster than a CPU
Mutch, 2009
Bio-motivated computer vision
Gflo
ps
GPU acceleration
• GPU can run certain classes of algorithms 50-100x faster than a CPU
• Designed to run same program (“kernel”) for each element of a large 2D grid
Mutch, 2009
Bio-motivated computer vision
Gflo
ps
GPU acceleration
• GPU can run certain classes of algorithms 50-100x faster than a CPU
• Designed to run same program (“kernel”) for each element of a large 2D grid
• 240 parallel processors! (512 by 2010 Q1)
Mutch, 2009
Bio-motivated computer vision
Mutch, 2009
• GPU can run certain classes of algorithms 50-100x faster than a CPU
• Designed to run same program (“kernel”) for each element of a large 2D grid
• 240 parallel processors! (512 by 2010 Q1)
GPU acceleration
• 97 times speed over our best CPU implementation
• 0.291 sec/image for a 256x256 pixel image
• currently downloading+processing about 300K images from internet / per day
Recognition in videos
Source: Wikipedia, “ventral stream”
Ungerleider & Mishkin ‘84
Source: Wikipedia, “ventral stream”
ventral stream“shape pathway”
Ungerleider & Mishkin ‘84
Source: Wikipedia, “ventral stream”
ventral stream“shape pathway”
dorsal stream“motion pathway”
Ungerleider & Mishkin ‘84
Source: Wikipedia, “ventral stream”
ventral stream“shape pathway”
dorsal stream“motion pathway”
Ungerleider & Mishkin ‘84
Bio-motivated computer vision
Jhuang Serre Wolf & Poggio 2007
Action recognition in video sequences
wave 2 bend
jack
jumprun
walk
side
wave 1
jump 2
motion-sensitive MT-like units
Bio-motivated computer vision
Jhuang Serre Wolf & Poggio 2007
Dollar et al ‘05
model chance
KTH Human 81.3% 91.6% 16.7%
Weiz. Human 86.7% 96.3% 11.1%
UCSD Mice 75.6% 79.0% 20.0%
Action recognition in video sequences
★ Cross-validation: 2/3 training, 1/3 testing, 10 repeats
Automatic recognition of rodent behavior
• Limit subjectivity of human intervention and stress on the animal (compared to standardized tests)
• 24 hr surveillance towards assessing well-being of animals
• Help validate models of mental and neuro-generative diseases (Huntington, schizophrenia, autism, etc)
• Help assess efficacy of drugs
Behaviors of interest
Data Set
• Manually annotated two sets:
• Frame accurate action clips: • ~50 man-hr/hr of video• 4000 clips• Fine-tuning system parameters (speed tuning, spatial resolution,
feature learning, etc)• Fully (continuous) annotated videos (less acurate):
• ~20 man-hr/hr of video • Learning temporal statistics
Automatic recognition of rodent behavior
Serre* Jhuang* Garrote Poggio Steele in prep
• Proof of concept with 8 primitive behaviors (groom, eat, hang, drink, walk, jump, micro-move, rest)
• System is trainable and could be trained for additional behaviors
Demo available at http://
techtv.mit.edu/videos/1838
Automatic recognition of rodent behavior
human agreement
72%
proposed system
71%
commercial system
56%
chance 12%
Performance
Serre* Jhuang* Garrote Poggio Steele in prep
• Proof of concept with 8 primitive behaviors (groom, eat, hang, drink, walk, jump, micro-move, rest)
• System is trainable and could be trained for additional behaviors
Demo available at http://
techtv.mit.edu/videos/1838
Computer system vs. human
Behavioral comparison between 4 strains
Serre* Jhuang* Garrote Poggio Steele in prep
• 24 hour monitoring of 4 different strains (n=8):
• CAST/EiJ (wild-like strain)
• C57Bl/6J (popular inbred mouse strains)
• DBA/2J (popular inbred mouse strains)
• BTBR2 (potential model of autism)
• Corresponds to about 7 yr of work for manual scoring
Behavioral comparison between 4 strains
Behavioral comparison between 4 strains
Predicting strain based on behavior
Summary
• Feedforward hierarchical model of visual perception seems consistent with “immediate recognition”, i.e. during passive viewing or when visual system forced to operate without top-down cortical feedback
• Application to automated behavior recognition in home-cage mice
• Beyond feedforward processing:
• Cortical feedback
• Shifts of attention
Neuroscience of attention and Bayesian inference in collaboration with Desimone
lab (monkey electrophysiology)
PFC
IT
V4/PIT
V2
integrated model of attention and recognition
Neuroscience of attention and Bayesian inference in collaboration with Desimone
lab (monkey electrophysiology)
PFC
IT
V4/PIT
V2
feature-basedattention
integrated model of attention and recognition
Neuroscience of attention and Bayesian inference in collaboration with Desimone
lab (monkey electrophysiology)
PFC
IT
V4/PIT
V2
LIP/FEF
spatial attention
feature-basedattention
integrated model of attention and recognition
Model performance improves with attention
Chikkerur Serre & Poggio in prep
Model Humans
perfo
rman
ce (d
’)
no attentionone shift of attention
Model performance improves with attention
Chikkerur Serre & Poggio in prep
0
1
2
3
Model Humans
perfo
rman
ce (d
’)
no attentionone shift of attention
Model performance improves with attention
Chikkerur Serre & Poggio in prep
0
1
2
3
Model Humans
perfo
rman
ce (d
’)
no attentionone shift of attention
Model performance improves with attention
Chikkerur Serre & Poggio in prep
0
1
2
3
Model Humans
perfo
rman
ce (d
’)
no attentionone shift of attention
Model performance improves with attention
Chikkerur Serre & Poggio in prep
0
1
2
3
Model Humans
perfo
rman
ce (d
’)
no attentionone shift of attention
mask no mask
Acknowledgments
Other: • Narcisse Bichot• Stan Bileschi• Charles Cadieu• Robert Desimone • Jim DiCarlo• Michelle Fabre-Thorpe• Winrich Freiwald• Estibaliz Garrote• Hueihan Jhuang• Ulf Knoblich• Christof Koch• Minjoon Kouh• Gabriel Kreiman• Timothee Masquelier• Leila Reddy• David Sheinberg• Jed Singer• Andrew Steele• Simon Thorpe• Nao Tsuchyia• Lior Wolf• Ying Zhang
Andrew Steele Hueihan Jhuang Estibaliz Garrote
Tomaso Poggio