image-grain comparison of core object recognition behavior...

Image-grain comparison of core object recognition behavior in humans, monkeys and machinesRishi Rajalingham*, Elias B. Issa*, Kohitij Kar, Kailyn Schmidt, Pouya Bashivan, James J. DiCarloDept. of Brain and Cognitive Sciences1, McGovern Institute for Brain Research2, MIT

Methods

ResultsIntroduction

Summary

Humans can rapidly and accurately recognize objects in spite of variation in viewing pa-rameters (e.g. position, pose, and size) and background conditions. To uncover the algo-rithms underlying this ability, quantitative benchmarks of human behavior can be used to directly compare animal, neural and computational vision systems. Previously, we showed that on one such benchmark, the pattern of object-level confu-sions, monkeys behavior was indistinguishable from human behavior. This pattern was also shared with high-performing object recognition models (convolution neural networks, CNNs) but not models of lower-level visual representations. Here, we extended on our previous work by systematically comparing the core object recognition behavior of humans against that of monkeys and machines at the image-level using the pattern of image-by-image di�culties.

Fixation(200ms)

Test Image Presentation(100ms)

Choice Image Presentation (<1500ms) & Hold (700ms)

Behavioral paradigm: We tested object recognition performance on a two alter-native forced choice (2AFC) match-to-sample task using brief (100ms) presentations of naturalistic syn-thetic images of basic-level objects.

High-throughput psychophsyics: Over a million behavioral trials across hundreds of humans and �ve monkeys were ag-gregated to characterize each species on each image. To col-lect such large behavioral datasets, we used Amazon Me-chanical Turk for humans, and a novel home-cage touch-screen behavioral system (”MonkeyTurk”) for monkeys.

MO

NKE

Y

V1 H

MA

X

ALE

XNET

(fc6

)

ALE

XNET

(fc7

)

ALE

XNET

(fc8

)

VG

G(f

c6)

VG

G(f

c7)

VG

G(f

c8)

GO

OG

LEN

ET

RESN

ET

GO

OG

LEN

ET (v

3)

GO

OG

LEN

ET (v

3)sy

nthe

tic

trai

ned

GO

OG

LEN

ET (v

3)re

tina

trai

ned

High-resolution, image-grain behavior is a stronger constraint than object-level be-havior for distinguishing between alternative models of human object recognition.

The models’ gap in consistency at the image level, though small, could not be easily explained by image viewing parameters (position, pose and size).

Comparing Humans and Monkeys:

Comparing Humans and Machines (AlexNet):

Behavioral Benchmark:

Our results show that monkeys are highly consistent with the pooled human in their pattern of image di�culties. In contrast, all tested CNN models were signi�cantly less consistent with the pooled human.

• Our results show that monkeys are highly consistent with humans in their pat-tern of image di�culties, suggesting that they are a good animal model for studying human object recognition abilities.

• In contrast, all tested CNN models were signi�cantly less consistent with humans at the image grain even though they passed at the object grain. This small gap in image-level consistency to humans could not be easily attributed to the viewing parameters (position, size, pose) of the object, classi�er choice, images used in model training, or retinal sampling of images.

• Therefore, high-resolution, image-grain behavioral measures serve as stron-ger constraints than object-level measures for distinguishing between alterna-tive mechanistic models of human visual object recognition.

Humans Monkeys GoogLeNet(v3)

GoogLeNet (v3)synthetic trained

Monkey

V1HMAX

V1HMAX

Monkey GoogLeNet (v3)synthetic trained

Object grain Image grainObject grain Image grain

200 trainin

g im

ages/o

bject

100 testing

imag

es/ob

ject

Wrench Rhino LegCamelTankBirdShortsElephantHouseHangerPenHammer

ZebraForkGuitarBearWatchDogSpiderTruckTableCalculatorGunKnife

Stimuli: To enforce true object recognition behavior, we generated several thousand nat-uralistic images, each with one foreground object (by rendering a 3D model of each object with random viewing parameters) on a random natural image background.

Each object can produce myriad images under variations in viewing parameters, with some images being more challenging than others.

HARD EASYrh

inoca

lculat

or

hous

edo

gbir

d

gun

hang

er

eleph

ant

truck leg

table

watchbearfork

knife

cam

el

guita

r

tank

wren

ch

spide

r

ham

mer

zebr

a

shor

ts

pen

image-grain comparison of core object recognition behavior...

Documents