20141003.journal club

15
Journal Club 論文紹介 Performanceop2mized hierarchical models predict neural responses in higher visual cortex Yamins, D. K. et al. June 10, 2014, PNAS 111(23), 86198624 www.pnas.org/cgi/doi/10.1073/pnas.1403112111 2014/10/03 [email protected]

Upload: shouno-hayaru

Post on 24-May-2015

261 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: 20141003.journal club

Journal  Club  論文紹介  Performance-­‐op2mized  hierarchical  models  predict  neural  responses  in  

higher  visual  cortex Yamins,  D.  K.  et  al.    

June  10,  2014,  PNAS  111(23),  8619-­‐8624  www.pnas.org/cgi/doi/10.1073/pnas.1403112111    

 2014/10/03  

[email protected]

Page 2: 20141003.journal club

やりたいこと/やったこと

Performance-optimized hierarchical models predictneural responses in higher visual cortexDaniel L. K. Yaminsa,1, Ha Honga,b,1, Charles F. Cadieua, Ethan A. Solomona, Darren Seiberta, and James J. DiCarloa,2

aDepartment of Brain and Cognitive Sciences and McGovern Institute for Brain Research, Massachusetts Institute of Technology, Cambridge, MA 02139;and bHarvard-MIT Division of Health Sciences and Technology, Institute for Medical Engineering and Science, Massachusetts Institute of Technology,Cambridge, MA 02139

Edited by Terrence J. Sejnowski, Salk Institute for Biological Studies, La Jolla, CA, and approved April 8, 2014 (received for review March 3, 2014)

The ventral visual stream underlies key human visual object re-cognition abilities. However, neural encoding in the higher areas ofthe ventral stream remains poorly understood. Here, we describea modeling approach that yields a quantitatively accurate model ofinferior temporal (IT) cortex, the highest ventral cortical area. Usinghigh-throughput computational techniques, we discovered that,within a class of biologically plausible hierarchical neural networkmodels, there is a strong correlation between a model’s categoriza-tion performance and its ability to predict individual IT neural unitresponse data. To pursue this idea, we then identified a high-per-forming neural network that matches human performance ona range of recognition tasks. Critically, even though we did notconstrain this model to match neural data, its top output layer turnsout to be highly predictive of IT spiking responses to complex nat-uralistic images at both the single site and population levels. More-over, the model’s intermediate layers are highly predictive of neuralresponses in the V4 cortex, a midlevel visual area that provides thedominant cortical input to IT. These results show that performanceoptimization—applied in a biologically appropriate model class—can be used to build quantitative predictive models of neuralprocessing.

computational neuroscience | computer vision | array electrophysiology

Retinal images of real-world objects vary drastically due tochanges in object pose, size, position, lighting, nonrigid de-

formation, occlusion, and many other sources of noise and var-iation. Humans effortlessly recognize objects rapidly and accuratelydespite this enormous variation, an impressive computational feat(1). This ability is supported by a set of interconnected brain areascollectively called the ventral visual stream (2, 3), with homologousareas in nonhuman primates (4). The ventral stream is thought tofunction as a series of hierarchical processing stages (5–7) thatencode image content (e.g., object identity and category) increas-ingly explicitly in successive cortical areas (1, 8, 9). For example,neurons in the lowest area, V1, are well described by Gabor-likeedge detectors that extract rough object outlines (10), although theV1 population does not show robust tolerance to complex imagetransformations (9). Conversely, rapidly evoked population activityin top-level inferior temporal (IT) cortex can directly support real-time, invariant object categorization over a wide range of tasks (11,12). Midlevel ventral areas—such as V4, the dominant corticalinput to IT—exhibit intermediate levels of object selectivity andvariation tolerance (12–14).Significant progress has been made in understanding lower

ventral areas such as V1, where conceptually compelling modelshave been discovered (10). These models are also quantitativelyaccurate and can predict response magnitudes of individualneuronal units to novel image stimuli. Higher ventral corticalareas, especially V4 and IT, have been much more difficult tounderstand. Although first principles-based models of higherventral cortex have been proposed (15–20), these models fail tomatch important features of the higher ventral visual neuralrepresentation in both humans and macaques (4, 21). Moreover,attempts to fit V4 and IT neural tuning curves on general imagestimuli have shown only limited predictive success (22, 23).

Explaining the neural encoding in these higher ventral areas thusremains a fundamental open question in systems neuroscience.As with V1, models of higher ventral areas should be neurally

predictive. However, because the higher ventral stream is alsobelieved to underlie sophisticated behavioral object recognitioncapacities, models must also match IT on performance metrics,equalling (or exceeding) the decoding capacity of IT neurons onobject recognition tasks. A model with perfect neural predictivityin IT will necessarily exhibit high performance, because IT itselfdoes. Here we demonstrate that the converse is also true, withina biologically appropriate model class. Combining high-throughputcomputational and electrophysiology techniques, we explore a widerange of biologically plausible hierarchical neural network modelsand then assess them against measured IT and V4 neural responsedata. We show that there is a strong correlation between a model’sperformance on a challenging high-variation object recognitiontask and its ability to predict individual IT neural unit responses.Extending this idea, we used optimization methods to iden-

tify a neural network model that matches human performanceon a range of recognition tasks. We then show that even thoughthis model was never explicitly constrained to match neural data,its output layer is highly predictive of neural responses in ITcortex—providing a first quantitatively accurate model of thishighest ventral cortex area. Moreover, the middle layers of themodel are highly predictive of V4 neural responses, suggestingtop-down performance constraints directly shape intermediatevisual representations.

Significance

Humans and monkeys easily recognize objects in scenes. Thisability is known to be supported by a network of hierarchicallyinterconnected brain areas. However, understanding neuronsin higher levels of this hierarchy has long remained a majorchallenge in visual systems neuroscience. We use computa-tional techniques to identify a neural network model thatmatches human performance on challenging object categori-zation tasks. Although not explicitly constrained to matchneural data, this model turns out to be highly predictive ofneural responses in both the V4 and inferior temporal cortex,the top two layers of the ventral visual hierarchy. In addition toyielding greatly improved models of visual cortex, these resultssuggest that a process of biological performance optimizationdirectly shaped neural mechanisms.

Author contributions: D.L.K.Y., H.H., and J.J.D. designed research; D.L.K.Y., H.H., andE.A.S. performed research; D.L.K.Y. contributed new reagents/analytic tools; D.L.K.Y.,H.H., C.F.C., and D.S. analyzed data; and D.L.K.Y., H.H., and J.J.D. wrote the paper.

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

Freely available online through the PNAS open access option.

See Commentary on page 8327.1D.L.K.Y. and H.H. contributed equally to this work.2To whom correspondence should be addressed. E-mail: [email protected].

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1403112111/-/DCSupplemental.

www.pnas.org/cgi/doi/10.1073/pnas.1403112111 PNAS | June 10, 2014 | vol. 111 | no. 23 | 8619–8624

NEU

ROSC

IENCE

SEECO

MMEN

TARY

Page 3: 20141003.journal club

やりたいこと/やったこと

•  Deep  net  (CNN)を使って脳の高次領野(IT,  V4)    ニューロンの活性を予測することができるか?  – CNN  を画像識別能力に特化させても  IT  野ニューロ

ンの活性度の説明が可能  – CNN 中間層の表現は  V4  ニューロンの活性度の説

明が可能  

•  道具  – Deep  Convolu2onal  net  +  AdaBoost  =  HMO  

Page 4: 20141003.journal club

Summary  of  Visual  input  s2mulus

•  8  classes,  8  objects/class  •  Image:  Objects  +  Natural  Backgrounds  •  Object  Varia2on:  Low/Medium/High

1. Felleman DJ, Van Essen DC (1991) Distributed hierarchical processing in the primatecerebral cortex. Cereb Cortex 1(1):1–47.

2. DiCarlo JJ, Maunsell JHR (2000) Inferotemporal Representations Underlying ObjectRecognition in the Free Viewing Monkey (Society for Neuroscience, New Orleans).

3. Majaj N, Hong H, Solomon E, DiCarlo J (2012) A unified neuronal population codefully explains human object recognition. Computational and Systems Neuroscience(COSYNE). Available at http://cosyne.org/cosyne12/Cosyne2012_program_book.pdf.Accessed March 3, 2014.

4. Frey BJ, Dueck D (2007) Clustering by passing messages between data points. Science315(5814):972–976.

5. Quiroga RQ, Nadasdy Z, Ben-Shaul Y (2004) Unsupervised spike detection and sortingwith wavelets and superparamagnetic clustering. Neural Comput 16(8):1661–1687.

6. Hung CP, Kreiman G, Poggio T, DiCarlo JJ (2005) Fast readout of object identity frommacaque inferior temporal cortex. Science 310(5749):863–866.

7. Rust NC, Dicarlo JJ (2010) Selectivity and tolerance (“invariance”) both increase as visualinformation propagates from cortical area V4 to IT. J Neurosci 30(39):12978–12995.

8. Rosch E, Mervis CB, Gray WD, Johnson DM (1976) Basic objects in natural categories.Cognit Psychol 8(3):382–439.

9. Plachetka T (1998) POV ray: Persistence of vision parallel raytracer. Proceedings of theSpringer Conference on Computer Graphics, ed Szirmay-Kalos L (SCCG, Budmerice,Slovakia), pp 123–129.

10. Connor CE, Brincat SL, Pasupathy A (2007) Transformation of shape information inthe ventral pathway. Curr Opin Neurobiol 17(2):140–147.

11. Brincat SL, Connor CE (2004) Underlying principles of visual shape selectivity inposterior inferotemporal cortex. Nat Neurosci 7(8):880–886.

12. Carandini M, et al. (2005) Do we know what the early visual system does? J Neurosci25(46):10577–10597.

13. Cadieu C, et al. (2007) A model of V4 shape selectivity and invariance. J Neurophysiol98(3):1733–1750.

14. Helland I (2006) Partial least squares regression. Encyclopedia of Statistical Sciences(John Wiley & Sons, Hoboken, NJ).

15. Pedregosa F, et al. (2011) Scikit-learn: Machine learning in Python. J Mach Learn Res12:2825–2830.

16. Kriegeskorte N (2009) Relating population-code representations between man, monkey,and computational models. Front Neurosci 3(3):363–373.

17. Sharpee TO, Kouh M, Reynolds JH (2013) Trade-off between curvature tuning andposition invariance in visual area V4. Proc Natl Acad Sci USA 110(28):11618–11623.

18. Rust NC, Mante V, Simoncelli EP, Movshon JA (2006) HowMT cells analyze the motionof visual patterns. Nat Neurosci 9(11):1421–1431.

19. Mutch J, Lowe DG (2008) Object class recognition and localization using sparsefeatures with limited receptive fields. International Journal of Computer Vision 80(1):45–57.

20. LeCun Y, Bengio Y (2003) Convolutional networks for images, speech, and time series.The Handbook of Brain Theory and Neural Networks, ed Arbib MA (MIT Press,Cambridge MA), pp 255–258.

21. Kriegeskorte N, et al. (2008) Matching categorical object representations in inferiortemporal cortex of man and monkey. Neuron 60(6):1126–1141.

22. Lowe DG (2004) Distinctive image features from scale-invariant keypoints. InternationalJournal of Computer Vision 60(2):91–110.

23. Pinto N, Cox DD, DiCarlo JJ (2008) Why is real-world visual object recognition hard?PLOS Comput Biol 4(1):e27.

24. Freeman J, Simoncelli EP (2011) Metamers of the ventral stream. Nat Neurosci 14(9):1195–1201.

25. Serre T, Oliva A, Poggio T (2007) A feedforward architecture accounts for rapid categorization.Proc Natl Acad Sci USA 104(15):6424–6429.

26. Pinto N, Doukhan D, DiCarlo JJ, Cox DD (2009) A high-throughput screening approachto discovering good forms of biologically inspired visual representation. PLOS ComputBiol 5(11):e1000579.

27. Geisler WS (2003) Ideal observer analysis. The Visual Neurosciences, eds Chalupa L,Werner J (MIT Press, Boston), pp 825–837.

28. DiCarlo JJ, Zoccolan D, Rust NC (2012) How does the brain solve visual objectrecognition? Neuron 73(3):415–434.

29. Martin KA, Schröder S (2013) Functional heterogeneity in neighboring neurons of catprimary visual cortex in response to both artificial and natural stimuli. J Neurosci33(17):7325–7344.

30. Nakamura H, Gattass R, Desimone R, Ungerleider L (2011) The modular organization ofprojections from areas v1 and v2 to areas v4 and teo inmacaques. J Neurosci 14:1195–1201.

31. Yamins D, Hong H, Cadieu C, Dicarlo J (2013) Hierarchical modular optimization ofconvolutional networks achieves representations similar to macaque it and humanventral stream. Adv Neural Inf Process Syst 26:3093–3101.

32. Schapire RE (1999) Theoretical Views of Boosting and Applications, Lecture Notes inComputer Science (Springer, Berlin), Vol 1720, pp 13–25.

33. Krizhevsky A, Sutskever I, Hinton G (2012) ImageNet classification with deep convolutionalneural networks. Advances in Neural Information Processing Systems 25:1097–1105.

34. Buciu I, Pitas I (2003) ICA and Gabor representation for facial expression recognition.Proceedings of the 2003 International Conference on Image Processing (Institute ofElectrical and Electronics Engineers, Piscataway, NJ), Vol 2, pp 855–858.

35. Bergstra J, Yamins D, Cox D (2013) Making a science of model search: Hyperparameteroptimization in hundreds of dimensions for vision architectures. Proceedings of The30th International Conference on Machine Learning, eds Dasgupta S, McAllester D(MIT Press, Cambridge, MA).

36. DiCarlo JJ, Cox DD (2007) Untangling invariant object recognition. Trends Cogn Sci11(8):333–341.

37. Goslin M, Mine MR (2004) The Panda3D graphics engine. Computer 37(10):112–114.38. Downing PE, Chan AW, Peelen MV, Dodds CM, Kanwisher N (2006) Domain specificity

in visual cortex. Cereb Cortex 16(10):1453–1461.39. Kanwisher N, McDermott J, Chun MM (1997) The fusiform face area: A module in

human extrastriate cortex specialized for face perception. J Neurosci 17(11):4302–4311.40. Freiwald WA, Tsao DY (2010) Functional compartmentalization and viewpoint

generalization within the macaque face-processing system. Science 330(6005):845–851.41. Riesenhuber M, Poggio T (2000) Models of object recognition. Nat Neurosci 3(Suppl):

1199–1204.42. Chelaru MI, Dragoi V (2008) Efficient coding in heterogeneous neuronal populations.

Proc Natl Acad Sci USA 105(42):16344–16349.

Animals Boats Cars Chairs Faces Fruits Planes Tables

High variation

640 images

2560 images

2560 images

Low variation

Medium variation

TreesBuildings FlowersBodies Instruments Jewelry Shoes ToolsScreening image set: 9 categories, 4 objects per category

Guns

Testing image set: 8 categories, 8 objects per category

b

a

Fig. S1. (A) The neural representation benchmark (1) testing image set on which we collected neural data and evaluated models contained 5,760 images of 64objects in eight categories. The image set contained three subsets, with low, medium, and high levels of object view variation. Images were placed on realisticbackground scenes, which were chosen randomly to be uncorrelated with object category identity. (B) The screening image set used to discover the HMOmodel contained 4,500 images of 36 objects in nine categories. As with any two uncorrelated samples of images from the world—such as those images seenduring development vs. those seen in adult life—the overall natural statistics of the screening set images were intended to be roughly similar to those of the testingset, but the specific content was quite different. Thus, the objects, semantic categories, and background scenes used in screening were totally nonoverlapping withthose used in the testing set. Moreoever, different camera, lighting and noise conditions, and a different rendering software package, were used.

1. Cadieu C, et al. (2013) The neural representation benchmark and its evaluation on brain and machine. International Conference on Learning Representations. arXiv:1301.3530.

Yamins et al. www.pnas.org/cgi/content/short/1403112111 8 of 14

Page 5: 20141003.journal club

手法

•  CNN  –  4層,  パラメータ57個  –  識別評価は  Linear  

SVM    •  IT,  V4  neuron  の  

classifica2on  performance  は  Linear  classifier  (なんで?)  

•  IT  neuron  との比較  –  296  neural  sites,    

6000  complex  images  –  IT  ~  CNN  linear  

regression  での決定係数(多分)  

classifiers on the IT neural population (Fig. 2B, green bars) andthe V4 neural population (n= 128, hatched green bars). To ex-pose a key axis of recognition difficulty, we computed perfor-mance results at three levels of object view variation, from low(fixed orientation, size, and position) to high (180° rotations onall axes, 2.5× dilation, and full-frame translations; Fig. S1A). Asa behavioral reference point, we also measured human perfor-mance on these tasks using web-based crowdsourcing methods(black bars). A crucial observation is that at all levels of variation,the IT population tracks human performance levels, consistent withknown results about IT’s high category decoding abilities (11, 12).The V4 population matches IT and human performance at lowlevels of variation, but performance drops quickly at higher varia-tion levels. (This V4-to-IT performance gap remains nearly as largeeven for images with no object translation variation, showing thatthe performance gap is not due just to IT’s larger receptive fields.)As a computational reference, we used the same procedure to

evaluate a variety of published ventral stream models targetingseveral levels of the ventral hierarchy. To control for low-levelconfounds, we tested the (trivial) pixel model, as well as SIFT,a simple baseline computer vision model (30). We also evaluateda V1-like Gabor-based model (25), a V2-like conjunction-of-Gabors model (31), and HMAX (17, 28), a model targeted atexplaining higher ventral cortex and that has receptive field sizes

similar to those observed in IT. The HMAX model can be trainedin a domain-specific fashion, and to give it the best chance ofsuccess, we performed this training using the benchmark imagesthemselves (see SI Text for more information on the comparisonmodels). Like V4, the control models that we tested approach ITand human performance levels in the low-variation condition, butin the high-variation condition, all of them fail to match the per-formance of IT units by a large margin. It is not surprising that V1and V2 models are not nearly as effective as IT, but it is instructiveto note that the task is sufficiently difficult that the HMAX modelperforms less well than the V4 population sample, even whenpretrained directly on the test dataset.

Constructing a High-Performing Model. Although simple three-layer hierarchical CNNs can be effective at low-variation objectrecognition tasks, recent work has shown that they may be lim-ited in their performance capacity for higher-variation tasks (9).For this reason, we extended our model class to contain com-binations (e.g., mixtures) of deeper CNN networks (Fig. S2B),which correspond intuitively to architecturally specialized sub-regions like those observed in the ventral visual stream (13, 32).To address the significant computational challenge of finding es-pecially high-performing architectures within this large space ofpossible networks, we used hierarchical modular optimization(HMO). The HMO procedure embodies a conceptually simplehypothesis for how high-performing combinations of functionallyspecialized hierarchical architectures can be efficiently discov-ered and hierarchically combined, without needing to prespecifythe subtasks ahead of time. Algorithmically, HMO is analogousto an adaptive boosting procedure (33) interleaved with hyper-parameter optimization (see SI Text and Fig. S2C).As a pretraining step, we applied the HMO selection pro-

cedure on a screening task (Fig. S1B). Like the testing set, thescreening set contained images of objects placed on randomlyselected backgrounds, but used entirely different objects in to-tally nonoverlapping semantic categories, with none of the samebackgrounds and widely divergent lighting conditions and noiselevels. Like any two samples of naturalistic images, the screeningand testing images have high-level commonalities but quite dif-ferent semantic content. For this reason, performance increasesthat transfer between them are likely to also transfer to othernaturalistic image sets. Via this pretraining, the HMO procedureidentified a four-layer CNN with 1,250 top-level outputs (Figs.S2B and S5), which we will refer to as the HMO model.Using the same classifier training protocol as with the neural

data and control models, we then tested the HMO model todetermine whether its performance transferred from the screeningto the testing image set. In fact, the HMO model matched theobject recognition performance of the IT neural sample (Fig. 2B,red bars), even when faced with large amounts of variation—a hallmark of human object recognition ability (1). These per-formance results are robust to the number of training examplesand number of sampled model neurons, across a variety of distinctrecognition tasks (Figs. S6 and S7).

Predicting Neural Responses in Individual IT Neural Sites. Given thatthe HMO model had plausible performance characteristics, wethen measured its IT predictivity, both for the top layer and eachof the three intermediate layers (Fig. 3, red lines/bars). We foundthat each successive layer predicted IT units increasingly well,demonstrating that the trend identified in Fig. 1A continues tohold in higher performance regimes and across a wide range ofmodel complexities (Fig. 1B). Qualitatively examining the spe-cific predictions for individual images, the model layers showthat category selectivity and tolerance to more drastic imagetransformations emerges gradually along the hierarchy (Fig. 3A,top four rows). At lower layers, model units predict IT responsesonly at a limited range of object poses and positions. At higherlayers, variation tolerance grows while category selectivity develops,suggesting that as more explicit “untangled” object recognition

layer 2

layer 3

layer 4

. . .

Behavioral Taskse.g. Trees vs non-Trees

1. Optimize Model for Task Performance

Neural Recordings from IT and V4

layer 1

...

Φ1

Φ2

Φk

⊗⊗⊗Filter Threshold Pool Normalize

Operations in Linear-Nonlinear Layer

Perfo

rman

ce

Low Variation Tasks High Variation Tasks

Pixe

lsSI

FTV1

-like

V2-li

keH

MAX

PLO

S09

HMO

V4 P

opul

atio

nIT

Pop

ulat

ion

Hum

ans

Medium Variation Tasks

100msVisual

Presentation

. . .

V1

ITV2

V4

2. Test Per-Site Neural Predictions

High-variationV4-to-IT Gap

100

80

60

40

20

LN

LN

...

LN

LN

...

LN

LN

LN

...

LN

LN

LN

...

. . . . . .

Spatial Convolutionover Image Input

A

B

Fig. 2. Neural-like models via performance optimization. (A) We (1) usedhigh-throughput computational methods to optimize the parameters ofa hierarchical CNN with linear-nonlinear (LN) layers for performance on achallenging invariant object recognition task. Using new test images distinctfrom those used to optimize the model, we then (2) compared output of eachof the model’s layers to IT neural responses and the output of intermediatelayers to V4 neural responses. To obtain neural data for comparison, we usedchronically implanted multielectrode arrays to record the responses of mul-tiunit sites in IT and V4, obtaining the mean visually evoked response of eachof 296 neural sites to ∼6,000 complex images. (B) Object categorizationperformance results on the test images for eight-way object categorization atthree increasing levels of object view variation (y axis units are 8-way cate-gorization percent-correct, chance is 12.5%). IT (green bars) and V4 (hatchedgreen bars) neural responses, and computational models (gray and red bars)were collected on the same image set and used to train support vector ma-chine (SVM) linear classifiers from which population performance accuracywas evaluated. Error bars are computed over train/test image splits. Humansubject responses on the same tasks were collected via psychophysics experi-ments (black bars); error bars are due to intersubject variation.

Yamins et al. PNAS | June 10, 2014 | vol. 111 | no. 23 | 8621

NEU

ROSC

IENCE

SEECO

MMEN

TARY

Page 6: 20141003.journal club

Performace/IT  predic2vity

•  横軸:  識別性能,縦軸:  IT  ニューロン説明能力  •  緑:  random  parameter  selec2on,  橙:  IT  op2miza2on,  青:  classifica2on  op2miza2on  •  Classifica2on  performance  が,高いCNNはIT説明能力も高い  

→ IT  neuron  specific  にしなくても Classifica2on  performance  上げれば良いのでは?  •  Boos2ng 手法をとりいれてみる  (Hierarchical  modular  op2miza2on:  HMO)

ResultsInvariant Object Recognition Performance Strongly Correlates with ITNeural Predictivity. We first measured IT neural responses on abenchmark testing image set that exposes key performancecharacteristics of visual representations (24). This image setconsists of 5,760 images of photorealistic 3D objects drawn fromeight natural categories (animals, boats, cars, chairs, faces, fruits,planes, and tables) and contains high levels of the object posi-tion, scale, and pose variation that make recognition difficult forartificial vision systems, but to which humans are robustly tol-erant (1, 25). The objects are placed on cluttered natural scenesthat are randomly selected to ensure background content is un-correlated with object identity (Fig. S1A).Using multiple electrode arrays, we collected responses from

168 IT neurons to each image. We then used high-throughputcomputational methods to evaluate thousands of candidateneural network models on these same images, measuring objectcategorization performance as well as IT neural predictivityfor each model (Fig. 1A; each point represents a distinct model).To measure categorization performance, we trained supportvector machine (SVM) linear classifiers on model output layerunits (11) and computed cross-validated testing accuracy forthese trained classifiers. To assess models’ neural predictivity, weused a standard linear regression methodology (10, 26, 27): foreach target IT neural site, we identified a synthetic neuroncomposed of a linear weighting of model outputs that would bestmatch that site on fixed sample images and then tested re-sponse predictions against actual neural site’s output on novelimages (Materials and Methods and SI Text).Models were drawn from a large parameter space of con-

volutional neural networks (CNNs) expressing an inclusive ver-sion of the hierarchical processing concept (17, 18, 20, 28). CNNsapproximate the general retinotopic organization of the ventralstream via spatial convolution, with computations in any oneregion of the visual field identical to those elsewhere. Eachconvolutional layer is composed of simple and neuronally plau-sible basic operations, including linear filtering, thresholding,pooling, and normalization (Fig. S2A). These layers are stackedhierarchically to construct deep neural networks.Each model is specified by a set of 57 parameters controlling

the number of layers and parameters at each layer, fan-in andfan-out, activation thresholds, pooling exponents, and localreceptive field sizes at each layer. Network depth ranged fromone to three layers, and filter weights for each layer were chosenrandomly from bounded uniform distributions whose bounds weremodel parameters (SI Text). These models are consistent withthe Hierarchical Linear-Nonlinear (HLN) hypothesis that higherlevel neurons (e.g., IT) output a linear weighting of inputs from

intermediate-level (e.g., V4) neurons followed by simple addi-tional nonlinearities (14, 16, 29).Models were selected for evaluation by one of three proce-

dures: (i) random sampling of the uniform distribution overparameter space (Fig. 1A; n = 2,016, green dots); (ii) opti-mization for performance on the high-variation eight-way cate-gorization task (n = 2,043, blue dots); and (iii) optimizationdirectly for IT neural predictivity (n = 1,876, orange dots; alsosee SI Text and Fig. S3). In each case, we observed significantvariation in both performance and IT predictivity across theparameter range. Thus, although the HLN hypothesis is consis-tent with a broad spectrum of particular neural network archi-tectures, specific parameter choices have a large effect on a givenmodel’s recognition performance and neural predictivity.Performance was significantly correlated with neural pre-

dictivity in all three selection regimes. Models that performedbetter on the categorization task were also more likely to pro-duce outputs more closely aligned to IT neural responses. Al-though the class of HLN-consistent architectures contains manyneurally inconsistent architectures with low IT predictivity, per-formance provides a meaningful way to a priori rule out manyof those inconsistent models. No individual model parameterscorrelated nearly as strongly with IT predictivity as performance(Fig. S4), indicating that the performance/IT predictivity corre-lation cannot be explained by simpler mechanistic considerations(e.g., receptive field size of the top layer).Critically, directed optimization for performance significantly

increased the correlation with IT predictivity compared with therandom selection regime (r= 0:78 vs. r= 0:55), even thoughneural data were not used in the optimization. Moreover, whenoptimizing for performance, the best-performing models pre-dicted neural output as well as those models directly selected forneural predictivity, although the reverse is not true. Together,these results imply that, although the IT predictivity metric isa complex function of the model parameter landscape, performanceoptimization is an efficient means to identify regions in parameterspace containing IT-like models.

IT Cortex as a Neural Performance Target. Fig. 1A suggests a nextstep toward improved encoding models of higher ventral cortex:drive models further to the right along the x axis—if the corre-lation holds, the models will also climb on the y axis. Ideally, thiswould involve identifying hierarchical neural networks that per-form at or near human object recognition performance levelsand validating them using rigorous tests against neural data(Fig. 2A). However, the difficulty of meeting the performancechallenge itself can be seen in Fig. 2B. To obtain neural refer-ence points on categorization performance, we trained linear

IT E

xpla

ined

Var

ianc

e (%

)

0.6 0.70.5

CategorizationPerformanceOptimization

(r =0.78)

IT FittingOptimization(r =0.80)

RandomSelection(r =0.55)

Categorization Performance (balanced accuracy)

30

15

0

–15

B

V2-likeHMAX

PLOS09SIFT

r = 0.87 ± 0.15

CategoryIdeal

Observer

0.6 0.8 1.0

HMO50

40

30

20

10

0

V1-likePixels

AFig. 1. Performance/IT-predictivity correlation. (A)Object categorization performance vs. IT neuralexplained variance percentage (IT-predictivity) forCNN models in three independent high-throughputcomputational experiments (each point is a distinctneural network architecture). The x axis showsperformance (balanced accuracy, chance is 0.5) ofthe model output features on a high-variation cat-egorization task; the y axis shows the median singlesite IT explained variance percentage (n= 168 sites)of that model. Each dot corresponds to a distinctmodel selected from a large family of convolutionalneural network architectures. Models were selectedby random draws from parameter space (greendots), object categorization performance-optimi-zation (blue dots), or explicit IT predictivity optimi-zation (orange dots). (B) Pursuing the correlation identified in A, a high-performing neural network was identified that matches human performance ona range of recognition tasks, the HMO model. The object categorization performance vs. IT neural predictivity correlation extends across a variety of modelsexhibiting a wide range of performance levels. Black circles include controls and published models; red squares are models produced during the HMO op-timization procedure. The category ideal observer (purple square) lies significantly off the main trend, but is not an actual image-computable model. The rvalue is computed over red and black points. For reference, light blue circles indicate performance optimized models (blue dots) from A.

8620 | www.pnas.org/cgi/doi/10.1073/pnas.1403112111 Yamins et al.

Page 7: 20141003.journal club

比較評価:  V4,  IT,  HMO,  Human

•  Varia2on が低い場合は,どれも良い性能だけど,Medium,  High  になると,Human  や  IT  と同等の説明能力を持つモデルは HMO くらいしかない  

•  それと  V4  ニューロンはHigh-­‐varia2on  Task  でのギャップが大きい

classifiers on the IT neural population (Fig. 2B, green bars) andthe V4 neural population (n= 128, hatched green bars). To ex-pose a key axis of recognition difficulty, we computed perfor-mance results at three levels of object view variation, from low(fixed orientation, size, and position) to high (180° rotations onall axes, 2.5× dilation, and full-frame translations; Fig. S1A). Asa behavioral reference point, we also measured human perfor-mance on these tasks using web-based crowdsourcing methods(black bars). A crucial observation is that at all levels of variation,the IT population tracks human performance levels, consistent withknown results about IT’s high category decoding abilities (11, 12).The V4 population matches IT and human performance at lowlevels of variation, but performance drops quickly at higher varia-tion levels. (This V4-to-IT performance gap remains nearly as largeeven for images with no object translation variation, showing thatthe performance gap is not due just to IT’s larger receptive fields.)As a computational reference, we used the same procedure to

evaluate a variety of published ventral stream models targetingseveral levels of the ventral hierarchy. To control for low-levelconfounds, we tested the (trivial) pixel model, as well as SIFT,a simple baseline computer vision model (30). We also evaluateda V1-like Gabor-based model (25), a V2-like conjunction-of-Gabors model (31), and HMAX (17, 28), a model targeted atexplaining higher ventral cortex and that has receptive field sizes

similar to those observed in IT. The HMAX model can be trainedin a domain-specific fashion, and to give it the best chance ofsuccess, we performed this training using the benchmark imagesthemselves (see SI Text for more information on the comparisonmodels). Like V4, the control models that we tested approach ITand human performance levels in the low-variation condition, butin the high-variation condition, all of them fail to match the per-formance of IT units by a large margin. It is not surprising that V1and V2 models are not nearly as effective as IT, but it is instructiveto note that the task is sufficiently difficult that the HMAX modelperforms less well than the V4 population sample, even whenpretrained directly on the test dataset.

Constructing a High-Performing Model. Although simple three-layer hierarchical CNNs can be effective at low-variation objectrecognition tasks, recent work has shown that they may be lim-ited in their performance capacity for higher-variation tasks (9).For this reason, we extended our model class to contain com-binations (e.g., mixtures) of deeper CNN networks (Fig. S2B),which correspond intuitively to architecturally specialized sub-regions like those observed in the ventral visual stream (13, 32).To address the significant computational challenge of finding es-pecially high-performing architectures within this large space ofpossible networks, we used hierarchical modular optimization(HMO). The HMO procedure embodies a conceptually simplehypothesis for how high-performing combinations of functionallyspecialized hierarchical architectures can be efficiently discov-ered and hierarchically combined, without needing to prespecifythe subtasks ahead of time. Algorithmically, HMO is analogousto an adaptive boosting procedure (33) interleaved with hyper-parameter optimization (see SI Text and Fig. S2C).As a pretraining step, we applied the HMO selection pro-

cedure on a screening task (Fig. S1B). Like the testing set, thescreening set contained images of objects placed on randomlyselected backgrounds, but used entirely different objects in to-tally nonoverlapping semantic categories, with none of the samebackgrounds and widely divergent lighting conditions and noiselevels. Like any two samples of naturalistic images, the screeningand testing images have high-level commonalities but quite dif-ferent semantic content. For this reason, performance increasesthat transfer between them are likely to also transfer to othernaturalistic image sets. Via this pretraining, the HMO procedureidentified a four-layer CNN with 1,250 top-level outputs (Figs.S2B and S5), which we will refer to as the HMO model.Using the same classifier training protocol as with the neural

data and control models, we then tested the HMO model todetermine whether its performance transferred from the screeningto the testing image set. In fact, the HMO model matched theobject recognition performance of the IT neural sample (Fig. 2B,red bars), even when faced with large amounts of variation—a hallmark of human object recognition ability (1). These per-formance results are robust to the number of training examplesand number of sampled model neurons, across a variety of distinctrecognition tasks (Figs. S6 and S7).

Predicting Neural Responses in Individual IT Neural Sites. Given thatthe HMO model had plausible performance characteristics, wethen measured its IT predictivity, both for the top layer and eachof the three intermediate layers (Fig. 3, red lines/bars). We foundthat each successive layer predicted IT units increasingly well,demonstrating that the trend identified in Fig. 1A continues tohold in higher performance regimes and across a wide range ofmodel complexities (Fig. 1B). Qualitatively examining the spe-cific predictions for individual images, the model layers showthat category selectivity and tolerance to more drastic imagetransformations emerges gradually along the hierarchy (Fig. 3A,top four rows). At lower layers, model units predict IT responsesonly at a limited range of object poses and positions. At higherlayers, variation tolerance grows while category selectivity develops,suggesting that as more explicit “untangled” object recognition

layer 2

layer 3

layer 4

. . .

Behavioral Taskse.g. Trees vs non-Trees

1. Optimize Model for Task Performance

Neural Recordings from IT and V4

layer 1

...

Φ1

Φ2

Φk

⊗⊗⊗Filter Threshold Pool Normalize

Operations in Linear-Nonlinear Layer

Perfo

rman

ce

Low Variation Tasks High Variation Tasks

Pixe

lsSI

FTV1

-like

V2-li

keH

MAX

PLO

S09

HMO

V4 P

opul

atio

nIT

Pop

ulat

ion

Hum

ans

Medium Variation Tasks

100msVisual

Presentation

. . .

V1

ITV2

V4

2. Test Per-Site Neural Predictions

High-variationV4-to-IT Gap

100

80

60

40

20

LN

LN

...

LN

LN

...

LN

LN

LN

...

LN

LN

LN

...

. . . . . .

Spatial Convolutionover Image Input

A

B

Fig. 2. Neural-like models via performance optimization. (A) We (1) usedhigh-throughput computational methods to optimize the parameters ofa hierarchical CNN with linear-nonlinear (LN) layers for performance on achallenging invariant object recognition task. Using new test images distinctfrom those used to optimize the model, we then (2) compared output of eachof the model’s layers to IT neural responses and the output of intermediatelayers to V4 neural responses. To obtain neural data for comparison, we usedchronically implanted multielectrode arrays to record the responses of mul-tiunit sites in IT and V4, obtaining the mean visually evoked response of eachof 296 neural sites to ∼6,000 complex images. (B) Object categorizationperformance results on the test images for eight-way object categorization atthree increasing levels of object view variation (y axis units are 8-way cate-gorization percent-correct, chance is 12.5%). IT (green bars) and V4 (hatchedgreen bars) neural responses, and computational models (gray and red bars)were collected on the same image set and used to train support vector ma-chine (SVM) linear classifiers from which population performance accuracywas evaluated. Error bars are computed over train/test image splits. Humansubject responses on the same tasks were collected via psychophysics experi-ments (black bars); error bars are due to intersubject variation.

Yamins et al. PNAS | June 10, 2014 | vol. 111 | no. 23 | 8621

NEU

ROSC

IENCE

SEECO

MMEN

TARY

Page 8: 20141003.journal club

IT  Neural  Ac2vity  との比較

•  黒線  IT  ac2vity  •  赤線 HMO  •  その他  control  •  HMO 高次層の

説明能力は高い.  IT  neuron  に関して  op2miza2on  してはいない

features are generated at each stage, the representations be-come increasingly IT-like (9).Critically, we found that the top layer of the high-performing

HMO model achieves high predictivity for individual IT neuralsites, predicting 48:5± 1:3% of the explainable IT neuronalvariance (Fig. 3 B and C). This represents a nearly 100% im-provement over the best comparison models and is comparableto the prediction accuracy of state-of-the-art models of lower-level ventral areas such as V1 on complex stimuli (10). In com-parison, although the HMAX model was better at predicting ITresponses than baseline V1 or SIFT, it was not significantlydifferent from the V2-like model.To control for how much neural predictivity should be

expected from any algorithm with high categorization perfor-mance, we assessed semantic ideal observers (34), includinga hypothetical model that has perfect access to all categorylabels. The ideal observers do predict IT units above chance level(Fig. 3C, left two bars), consistent with the observation that ITneurons are partially categorical. However, the ideal observersare significantly less predictive than the HMO model, showingthat high IT predictivity does not automatically follow fromcategory selectivity and that there is significant noncategoricalstructure in IT responses attributable to intrinsic aspects of hi-erarchical network structure (Fig. 3A, last row). These resultssuggest that high categorization performance and the hierar-chical model architecture class work in concert to produce IT-like populations, and neither of these constraints is sufficient onits own to do so.

Population Representation Similarity. Characterizing the IT neuralrepresentation at the population level may be equally importantfor understanding object visual representation as individual ITneural sites. The representation dissimilarity matrix (RDM) is a

convenient tool comparing two representations on a commonstimulus set in a task-independent manner (4, 35). Each entry inthe RDM corresponds to one stimulus pair, with high/low valuesindicating that the population as a whole treats the pair stimulias very different/similar. Taken over the whole stimulus set, theRDM characterizes the layout of the images in the high-dimensional neural population space. When images are orderedby category, the RDM for the measured IT neural population(Fig. 4A) exhibits clear block-diagonal structure—associatedwith IT’s exceptionally high categorization performance—as wellas off-diagonal structure that characterizes the IT neural repre-sentation more finely than any single performance metric (Fig.4A and Fig. S8). We found that the neural population predictedby the output layer of the HMOmodel had very high similarity tothe actual IT population structure, close to the split-half noiseceiling of the IT population (Fig. 4B). This implies that much ofthe residual variance unexplained at the single-site level may notbe relevant for object recognition in the IT population level code.We also performed two stronger tests of generalization: (i)

object-level generalization, in which the regressor training setcontained images of only 32 object exemplars (four in each ofeight categories), with RDMs assessed only on the remaining 32objects, and (ii) category-level generalization, in which the re-gressor sample set contained images of only half the categoriesbut RDMs were assessed only on images of the other categories(see Figs. S8 and S9). We found that the prediction generalizesrobustly, capturing the IT population’s layout for images ofcompletely novel objects and categories (Fig. 4 B and C andFig. S8).

Predicting Responses in V4 from Intermediate Model Layers. Corticalarea V4 is the dominant cortical input to IT, and the neuralrepresentation in V4 is known to be significantly less categorical

HMO Layer 1(4%)

CategoryIdeal

Observer(15%)

HMO Layer 2(21%)

HMO Layer 3(36%)

HMO Top Layer

(48%)

V2-LikeModel(26%)

HMAXModel(25%)

V1-LikeModel(16%)

IT Site 150 IT Site 56 IT Site 42

Res

pons

e M

agni

fude

HM

O L

ayer

sC

ontro

l Mod

els

25 50 75 1000Animals PlanesBoats Cars Chairs Faces Fruits Tables

(n=168)

Binn

ed S

ite C

ount

sIT

Exp

lain

ed V

aria

nce

(%)

0

10

20

50

30

40

IdealObservers

ControlModels

HMOLayers

Cat

egor

yAl

l Var

iabl

es

Pixe

lsV1

-Lik

e

PLO

S09

HMAX

V2-L

ike

HM

O L

1HM

O L

2 HMO

L3 HM

OTo

p

SIFT

A

B

Single Site Explained Variance (%)

C

Fig. 3. IT neural predictions. (A) Actual neural re-sponse (black trace) vs. model predictions (coloredtrace) for three individual IT neural sites. The x axisin each plot shows 1,600 test images sorted first bycategory identity and then by variation amount,with more drastic image transformations toward theright within each category block. The y axis repre-sents the prediction/response magnitude of theneural site for each test image (those not used to fitthe model). Two of the units show selectivity forspecific classes of objects, namely chairs (Left) andfaces (Center), whereas the third (Right) exhibitsa wider variety of image preferences. The four toprows show neural predictions using the visual fea-ture set (i.e., units sampled) from each of the fourlayers of the HMO model, whereas the lower rowsshow the those of control models. (B) Distributionsof model explained variance percentage, over thepopulation of all measured IT sites (n = 168). Yellowdotted line indicates distribution median. (C)Comparison of IT neural explained variance per-centage for various models. Bar height shows me-dian explained variance, taken over all predicted ITunits. Error bars are computed over image splits.Colored bars are those shown in A and B, whereasgray bars are additional comparisons.

8622 | www.pnas.org/cgi/doi/10.1073/pnas.1403112111 Yamins et al.

Page 9: 20141003.journal club

V4  Neural  Ac2vity  との比較

•  黒線  V4  ac2vity  •  赤線 HMO  •  その他  control  •  HMO 中間層の説明能力

が高い.(ほんまか?)

than that of IT (12). Comparing a performance-optimized modelto these data would provide a strong test both of its ability topredict the internal structure of the ventral stream, as well as togo beyond the direct consequences of category selectivity. Wethus measured the HMO model’s neural predictivity for the V4neural population (Fig. 5). We found that the HMO model’spenultimate layer is highly predictive of V4 neural responses(51:7± 2:3% explained V4 variance), providing a significantlybetter match to V4 than either the model’s top or bottom layers.These results are strong evidence for the hypothesis that V4corresponds to an intermediate layer in a hierarchical modelwhose top layer is an effective model of IT. Of the controlmodels that we tested, the V2-like model predicts the most V4variation ð34:1± 2:4%Þ. Unlike the case of IT, semantic modelsexplain effectively no variance in V4, consistent with V4’s lack ofcategory selectivity. Together these results suggest that perfor-mance optimization not only drives top-level output model layersto resemble IT, but also imposes biologically consistent con-straints on the intermediate feature representations that cansupport downstream performance.

DiscussionHere, we demonstrate a principled method for achieving greatlyimproved predictive models of neural responses in higherventral cortex. Our approach operationalizes a hypothesis forhow two biological constraints together shaped visual cortex:(i) the functional constraint of recognition performance and(ii) the structural constraint imposed by the hierarchical net-work architecture.

Generative Basis for Higher Visual Cortical Areas. Our modelingapproach has common ground with existing work on neural re-sponse prediction (27), e.g., the HLN hypothesis. However, ina departure from that line of work, we do not tune modelparameters (the nonlinearities or the model filters) separatelyfor each neural unit to be predicted. In fact, with the exceptionof the final linear weighting, we do not tune parameters usingneural data at all. Instead, the parameters of our model wereindependently selected to optimize functional performance atthe top level, and these choices create fixed bases from which anyindividual IT or V4 unit can be composed. This yields a genera-tive model that allows the sampling of an arbitrary number of

neurally consistent units. As a result, the size of the model doesnot scale with the number of neural sites to be predicted—andbecause the prediction results were assessed for a randomsample of IT and V4 units, they are likely to generalize withsimilar levels of predictivity to any new sites that are measured.

What Features Do Good Models Share? Although the highest-per-forming models had certain commonalities (e.g., more hierar-chical layers), many poor models also exhibited these features,and no one architectural parameter dominated performancevariability (Fig. S3). To gain further insight, we performed anexploratory analysis of the parameters of the learned HMOmodel, evaluating each parameter both for how sensitively it wastuned and how diverse it was between model mixture components.Two classes of model parameters were especially sensitive anddiverse (SI Text and Figs. S10 and S11): (i) filter statistics, in-cluding filter mean and spread, and (ii) the exponent trading offbetween max-pooling and average-pooling (16). This observationhints at a computationally rigorous explanation for experimentally

IT neuronal unitsV4 neuronal units HMO modelAnimals (8)Boats (8)Cars (8)Chairs (8)Faces (8)Fruits (8)Planes (8)Tables (8)

Imag

e g

ener

aliz

atio

nO

bjec

tge

nera

lizat

ion

Cate

gory

ge

nera

lizat

ion

Animals (4)Boats (4)Cars (4)Chairs (4)Faces (4)Fruits (4)Planes (4)Tables (4)

Faces (8)

Fruits (8)

Planes (8)

Tables (8)

C

AHMAX ModelV1-like Model

Imagegeneralization

Objectgeneralization

Categorygeneralization

0.9

0.6

0.3

0.0

Popu

latio

n si

mila

ritty

to IT

Pix

els

V1 -

like

SIF

T

HM

AX

V2-

like

HM

OV

4 si

tes

IT s

ites

split

-hal

f

B

Fig. 4. Population-level similarity. (A) Object-level representation dissimi-larity matrices (RDMs) visualized via rank-normalized color plots (blue = 0thdistance percentile, red = 100th percentile). (B) IT population and the HMO-based IT model population, for image, object, and category generalizations(SI Text). (C) Quantification of model population representation similarity toIT. Bar height indicates the spearman correlation value of a given model’sRDM to the RDM for the IT neural population. The IT bar represents theSpearman-Brown corrected consistency of the IT RDM for split-halves overthe IT units, establishing a noise-limited upper bound. Error bars are takenover cross-validated regression splits in the case of models and over imageand unit splits in the case of neural data.

V4 E

xpla

ined

Var

ianc

e (%

)

0

10

20

50

HMO Layer 1(8%)

CategoryIdeal

Observer(8%)

HMO Layer 2(48%)

HMO Layer 3(52%)

HMO Top Layer

(33%)

30

40

V2-LikeModel(34%)

HMAXModel(24%)

V1-LikeModel(11%)

IdealObservers

ControlModels

HMOLayersB

A C

V4Site60

Cat

egor

yAl

l Var

iabl

es

Pixe

lsV1

-Lik

e

PLO

S09

HMAX V2

-Lik

e

HM

O L

1HM

O L

2HM

O L

3HM

OTo

p

SIFT

Single Site Explained Variance (%)25 50 75 1000

Res

pons

e M

agni

fude

Binn

ed S

ite C

ount

s

(n=128)

Fig. 5. V4 neural predictions. (A) Actual vs. predicted response magnitudesfor a typical V4 site. V4 sites are highly visually driven, but unlike IT sitesshow very little categorical preference, manifesting in more abrupt changesin the image-by-image plots shown here. Red highlight indicates the best-matching model (viz., HMO layer 3). (B) Distributions of explained variancespercentage for each model, over the population of all measured V4 sitesðn= 128Þ. (C) Comparison of V4 neural explained variance percentage forvarious models. Conventions follow those used in Fig. 3.

Yamins et al. PNAS | June 10, 2014 | vol. 111 | no. 23 | 8623

NEU

ROSC

IENCE

SEECO

MMEN

TARY

Page 10: 20141003.journal club

V4  Neural  Ac2vity  との比較

•  HMO 中間層の説明能力が高い  – わりと似てる気がしてきた  

than that of IT (12). Comparing a performance-optimized modelto these data would provide a strong test both of its ability topredict the internal structure of the ventral stream, as well as togo beyond the direct consequences of category selectivity. Wethus measured the HMO model’s neural predictivity for the V4neural population (Fig. 5). We found that the HMO model’spenultimate layer is highly predictive of V4 neural responses(51:7± 2:3% explained V4 variance), providing a significantlybetter match to V4 than either the model’s top or bottom layers.These results are strong evidence for the hypothesis that V4corresponds to an intermediate layer in a hierarchical modelwhose top layer is an effective model of IT. Of the controlmodels that we tested, the V2-like model predicts the most V4variation ð34:1± 2:4%Þ. Unlike the case of IT, semantic modelsexplain effectively no variance in V4, consistent with V4’s lack ofcategory selectivity. Together these results suggest that perfor-mance optimization not only drives top-level output model layersto resemble IT, but also imposes biologically consistent con-straints on the intermediate feature representations that cansupport downstream performance.

DiscussionHere, we demonstrate a principled method for achieving greatlyimproved predictive models of neural responses in higherventral cortex. Our approach operationalizes a hypothesis forhow two biological constraints together shaped visual cortex:(i) the functional constraint of recognition performance and(ii) the structural constraint imposed by the hierarchical net-work architecture.

Generative Basis for Higher Visual Cortical Areas. Our modelingapproach has common ground with existing work on neural re-sponse prediction (27), e.g., the HLN hypothesis. However, ina departure from that line of work, we do not tune modelparameters (the nonlinearities or the model filters) separatelyfor each neural unit to be predicted. In fact, with the exceptionof the final linear weighting, we do not tune parameters usingneural data at all. Instead, the parameters of our model wereindependently selected to optimize functional performance atthe top level, and these choices create fixed bases from which anyindividual IT or V4 unit can be composed. This yields a genera-tive model that allows the sampling of an arbitrary number of

neurally consistent units. As a result, the size of the model doesnot scale with the number of neural sites to be predicted—andbecause the prediction results were assessed for a randomsample of IT and V4 units, they are likely to generalize withsimilar levels of predictivity to any new sites that are measured.

What Features Do Good Models Share? Although the highest-per-forming models had certain commonalities (e.g., more hierar-chical layers), many poor models also exhibited these features,and no one architectural parameter dominated performancevariability (Fig. S3). To gain further insight, we performed anexploratory analysis of the parameters of the learned HMOmodel, evaluating each parameter both for how sensitively it wastuned and how diverse it was between model mixture components.Two classes of model parameters were especially sensitive anddiverse (SI Text and Figs. S10 and S11): (i) filter statistics, in-cluding filter mean and spread, and (ii) the exponent trading offbetween max-pooling and average-pooling (16). This observationhints at a computationally rigorous explanation for experimentally

IT neuronal unitsV4 neuronal units HMO modelAnimals (8)Boats (8)Cars (8)Chairs (8)Faces (8)Fruits (8)Planes (8)Tables (8)

Imag

e g

ener

aliz

atio

nO

bjec

tge

nera

lizat

ion

Cate

gory

ge

nera

lizat

ion

Animals (4)Boats (4)Cars (4)Chairs (4)Faces (4)Fruits (4)Planes (4)Tables (4)

Faces (8)

Fruits (8)

Planes (8)

Tables (8)

C

AHMAX ModelV1-like Model

Imagegeneralization

Objectgeneralization

Categorygeneralization

0.9

0.6

0.3

0.0

Popu

latio

n si

mila

ritty

to IT

Pix

els

V1 -

like

SIF

T

HM

AX

V2-

like

HM

OV

4 si

tes

IT s

ites

split

-hal

f

B

Fig. 4. Population-level similarity. (A) Object-level representation dissimi-larity matrices (RDMs) visualized via rank-normalized color plots (blue = 0thdistance percentile, red = 100th percentile). (B) IT population and the HMO-based IT model population, for image, object, and category generalizations(SI Text). (C) Quantification of model population representation similarity toIT. Bar height indicates the spearman correlation value of a given model’sRDM to the RDM for the IT neural population. The IT bar represents theSpearman-Brown corrected consistency of the IT RDM for split-halves overthe IT units, establishing a noise-limited upper bound. Error bars are takenover cross-validated regression splits in the case of models and over imageand unit splits in the case of neural data.

V4 E

xpla

ined

Var

ianc

e (%

)

0

10

20

50

HMO Layer 1(8%)

CategoryIdeal

Observer(8%)

HMO Layer 2(48%)

HMO Layer 3(52%)

HMO Top Layer

(33%)

30

40

V2-LikeModel(34%)

HMAXModel(24%)

V1-LikeModel(11%)

IdealObservers

ControlModels

HMOLayersB

A C

V4Site60

Cat

egor

yAl

l Var

iabl

es

Pixe

lsV1

-Lik

e

PLO

S09

HMAX V2

-Lik

e

HM

O L

1HM

O L

2HM

O L

3HM

OTo

p

SIFT

Single Site Explained Variance (%)25 50 75 1000

Res

pons

e M

agni

fude

Binn

ed S

ite C

ount

s

(n=128)

Fig. 5. V4 neural predictions. (A) Actual vs. predicted response magnitudesfor a typical V4 site. V4 sites are highly visually driven, but unlike IT sitesshow very little categorical preference, manifesting in more abrupt changesin the image-by-image plots shown here. Red highlight indicates the best-matching model (viz., HMO layer 3). (B) Distributions of explained variancespercentage for each model, over the population of all measured V4 sitesðn= 128Þ. (C) Comparison of V4 neural explained variance percentage forvarious models. Conventions follow those used in Fig. 3.

Yamins et al. PNAS | June 10, 2014 | vol. 111 | no. 23 | 8623

NEU

ROSC

IENCE

SEECO

MMEN

TARY

Page 11: 20141003.journal club

Popula2on  level  での類似度(1)

•  Class(8)  x  Objects(8)  で,RDM  を取ってみる  → 青:  似ている,赤:  似ていない  

•  IT,  HMO あたりは,表現の類似性が似ている(ややこしい)  

than that of IT (12). Comparing a performance-optimized modelto these data would provide a strong test both of its ability topredict the internal structure of the ventral stream, as well as togo beyond the direct consequences of category selectivity. Wethus measured the HMO model’s neural predictivity for the V4neural population (Fig. 5). We found that the HMO model’spenultimate layer is highly predictive of V4 neural responses(51:7± 2:3% explained V4 variance), providing a significantlybetter match to V4 than either the model’s top or bottom layers.These results are strong evidence for the hypothesis that V4corresponds to an intermediate layer in a hierarchical modelwhose top layer is an effective model of IT. Of the controlmodels that we tested, the V2-like model predicts the most V4variation ð34:1± 2:4%Þ. Unlike the case of IT, semantic modelsexplain effectively no variance in V4, consistent with V4’s lack ofcategory selectivity. Together these results suggest that perfor-mance optimization not only drives top-level output model layersto resemble IT, but also imposes biologically consistent con-straints on the intermediate feature representations that cansupport downstream performance.

DiscussionHere, we demonstrate a principled method for achieving greatlyimproved predictive models of neural responses in higherventral cortex. Our approach operationalizes a hypothesis forhow two biological constraints together shaped visual cortex:(i) the functional constraint of recognition performance and(ii) the structural constraint imposed by the hierarchical net-work architecture.

Generative Basis for Higher Visual Cortical Areas. Our modelingapproach has common ground with existing work on neural re-sponse prediction (27), e.g., the HLN hypothesis. However, ina departure from that line of work, we do not tune modelparameters (the nonlinearities or the model filters) separatelyfor each neural unit to be predicted. In fact, with the exceptionof the final linear weighting, we do not tune parameters usingneural data at all. Instead, the parameters of our model wereindependently selected to optimize functional performance atthe top level, and these choices create fixed bases from which anyindividual IT or V4 unit can be composed. This yields a genera-tive model that allows the sampling of an arbitrary number of

neurally consistent units. As a result, the size of the model doesnot scale with the number of neural sites to be predicted—andbecause the prediction results were assessed for a randomsample of IT and V4 units, they are likely to generalize withsimilar levels of predictivity to any new sites that are measured.

What Features Do Good Models Share? Although the highest-per-forming models had certain commonalities (e.g., more hierar-chical layers), many poor models also exhibited these features,and no one architectural parameter dominated performancevariability (Fig. S3). To gain further insight, we performed anexploratory analysis of the parameters of the learned HMOmodel, evaluating each parameter both for how sensitively it wastuned and how diverse it was between model mixture components.Two classes of model parameters were especially sensitive anddiverse (SI Text and Figs. S10 and S11): (i) filter statistics, in-cluding filter mean and spread, and (ii) the exponent trading offbetween max-pooling and average-pooling (16). This observationhints at a computationally rigorous explanation for experimentally

IT neuronal unitsV4 neuronal units HMO modelAnimals (8)Boats (8)Cars (8)Chairs (8)Faces (8)Fruits (8)Planes (8)Tables (8)

Imag

e g

ener

aliz

atio

nO

bjec

tge

nera

lizat

ion

Cate

gory

ge

nera

lizat

ion

Animals (4)Boats (4)Cars (4)Chairs (4)Faces (4)Fruits (4)Planes (4)Tables (4)

Faces (8)

Fruits (8)

Planes (8)

Tables (8)

C

AHMAX ModelV1-like Model

Imagegeneralization

Objectgeneralization

Categorygeneralization

0.9

0.6

0.3

0.0

Popu

latio

n si

mila

ritty

to IT

Pix

els

V1 -

like

SIF

T

HM

AX

V2-

like

HM

OV

4 si

tes

IT s

ites

split

-hal

f

B

Fig. 4. Population-level similarity. (A) Object-level representation dissimi-larity matrices (RDMs) visualized via rank-normalized color plots (blue = 0thdistance percentile, red = 100th percentile). (B) IT population and the HMO-based IT model population, for image, object, and category generalizations(SI Text). (C) Quantification of model population representation similarity toIT. Bar height indicates the spearman correlation value of a given model’sRDM to the RDM for the IT neural population. The IT bar represents theSpearman-Brown corrected consistency of the IT RDM for split-halves overthe IT units, establishing a noise-limited upper bound. Error bars are takenover cross-validated regression splits in the case of models and over imageand unit splits in the case of neural data.

V4 E

xpla

ined

Var

ianc

e (%

)

0

10

20

50

HMO Layer 1(8%)

CategoryIdeal

Observer(8%)

HMO Layer 2(48%)

HMO Layer 3(52%)

HMO Top Layer

(33%)

30

40

V2-LikeModel(34%)

HMAXModel(24%)

V1-LikeModel(11%)

IdealObservers

ControlModels

HMOLayersB

A C

V4Site60

Cat

egor

yAl

l Var

iabl

es

Pixe

lsV1

-Lik

e

PLO

S09

HMAX V2

-Lik

e

HM

O L

1HM

O L

2HM

O L

3HM

OTo

p

SIFT

Single Site Explained Variance (%)25 50 75 1000

Res

pons

e M

agni

fude

Binn

ed S

ite C

ount

s

(n=128)

Fig. 5. V4 neural predictions. (A) Actual vs. predicted response magnitudesfor a typical V4 site. V4 sites are highly visually driven, but unlike IT sitesshow very little categorical preference, manifesting in more abrupt changesin the image-by-image plots shown here. Red highlight indicates the best-matching model (viz., HMO layer 3). (B) Distributions of explained variancespercentage for each model, over the population of all measured V4 sitesðn= 128Þ. (C) Comparison of V4 neural explained variance percentage forvarious models. Conventions follow those used in Fig. 3.

Yamins et al. PNAS | June 10, 2014 | vol. 111 | no. 23 | 8623

NEU

ROSC

IENCE

SEECO

MMEN

TARY

Page 12: 20141003.journal club

Popula2on  level  での類似度(2)

•  IT,  HMO  の表現をオブジェクトごと,カテゴリごとに平均化した表現と比較したけどやっぱり HMO  が一番まし  

•  RDM でみてもわりとよく似ている  

than that of IT (12). Comparing a performance-optimized modelto these data would provide a strong test both of its ability topredict the internal structure of the ventral stream, as well as togo beyond the direct consequences of category selectivity. Wethus measured the HMO model’s neural predictivity for the V4neural population (Fig. 5). We found that the HMO model’spenultimate layer is highly predictive of V4 neural responses(51:7± 2:3% explained V4 variance), providing a significantlybetter match to V4 than either the model’s top or bottom layers.These results are strong evidence for the hypothesis that V4corresponds to an intermediate layer in a hierarchical modelwhose top layer is an effective model of IT. Of the controlmodels that we tested, the V2-like model predicts the most V4variation ð34:1± 2:4%Þ. Unlike the case of IT, semantic modelsexplain effectively no variance in V4, consistent with V4’s lack ofcategory selectivity. Together these results suggest that perfor-mance optimization not only drives top-level output model layersto resemble IT, but also imposes biologically consistent con-straints on the intermediate feature representations that cansupport downstream performance.

DiscussionHere, we demonstrate a principled method for achieving greatlyimproved predictive models of neural responses in higherventral cortex. Our approach operationalizes a hypothesis forhow two biological constraints together shaped visual cortex:(i) the functional constraint of recognition performance and(ii) the structural constraint imposed by the hierarchical net-work architecture.

Generative Basis for Higher Visual Cortical Areas. Our modelingapproach has common ground with existing work on neural re-sponse prediction (27), e.g., the HLN hypothesis. However, ina departure from that line of work, we do not tune modelparameters (the nonlinearities or the model filters) separatelyfor each neural unit to be predicted. In fact, with the exceptionof the final linear weighting, we do not tune parameters usingneural data at all. Instead, the parameters of our model wereindependently selected to optimize functional performance atthe top level, and these choices create fixed bases from which anyindividual IT or V4 unit can be composed. This yields a genera-tive model that allows the sampling of an arbitrary number of

neurally consistent units. As a result, the size of the model doesnot scale with the number of neural sites to be predicted—andbecause the prediction results were assessed for a randomsample of IT and V4 units, they are likely to generalize withsimilar levels of predictivity to any new sites that are measured.

What Features Do Good Models Share? Although the highest-per-forming models had certain commonalities (e.g., more hierar-chical layers), many poor models also exhibited these features,and no one architectural parameter dominated performancevariability (Fig. S3). To gain further insight, we performed anexploratory analysis of the parameters of the learned HMOmodel, evaluating each parameter both for how sensitively it wastuned and how diverse it was between model mixture components.Two classes of model parameters were especially sensitive anddiverse (SI Text and Figs. S10 and S11): (i) filter statistics, in-cluding filter mean and spread, and (ii) the exponent trading offbetween max-pooling and average-pooling (16). This observationhints at a computationally rigorous explanation for experimentally

IT neuronal unitsV4 neuronal units HMO modelAnimals (8)Boats (8)Cars (8)Chairs (8)Faces (8)Fruits (8)Planes (8)Tables (8)

Imag

e g

ener

aliz

atio

nO

bjec

tge

nera

lizat

ion

Cate

gory

ge

nera

lizat

ion

Animals (4)Boats (4)Cars (4)Chairs (4)Faces (4)Fruits (4)Planes (4)Tables (4)

Faces (8)

Fruits (8)

Planes (8)

Tables (8)

C

AHMAX ModelV1-like Model

Imagegeneralization

Objectgeneralization

Categorygeneralization

0.9

0.6

0.3

0.0

Popu

latio

n si

mila

ritty

to IT

Pix

els

V1 -

like

SIF

T

HM

AX

V2-

like

HM

OV

4 si

tes

IT s

ites

split

-hal

f

B

Fig. 4. Population-level similarity. (A) Object-level representation dissimi-larity matrices (RDMs) visualized via rank-normalized color plots (blue = 0thdistance percentile, red = 100th percentile). (B) IT population and the HMO-based IT model population, for image, object, and category generalizations(SI Text). (C) Quantification of model population representation similarity toIT. Bar height indicates the spearman correlation value of a given model’sRDM to the RDM for the IT neural population. The IT bar represents theSpearman-Brown corrected consistency of the IT RDM for split-halves overthe IT units, establishing a noise-limited upper bound. Error bars are takenover cross-validated regression splits in the case of models and over imageand unit splits in the case of neural data.

V4 E

xpla

ined

Var

ianc

e (%

)

0

10

20

50

HMO Layer 1(8%)

CategoryIdeal

Observer(8%)

HMO Layer 2(48%)

HMO Layer 3(52%)

HMO Top Layer

(33%)

30

40

V2-LikeModel(34%)

HMAXModel(24%)

V1-LikeModel(11%)

IdealObservers

ControlModels

HMOLayersB

A C

V4Site60

Cat

egor

yAl

l Var

iabl

es

Pixe

lsV1

-Lik

e

PLO

S09

HMAX V2

-Lik

e

HM

O L

1HM

O L

2HM

O L

3HM

OTo

p

SIFT

Single Site Explained Variance (%)25 50 75 1000

Res

pons

e M

agni

fude

Binn

ed S

ite C

ount

s

(n=128)

Fig. 5. V4 neural predictions. (A) Actual vs. predicted response magnitudesfor a typical V4 site. V4 sites are highly visually driven, but unlike IT sitesshow very little categorical preference, manifesting in more abrupt changesin the image-by-image plots shown here. Red highlight indicates the best-matching model (viz., HMO layer 3). (B) Distributions of explained variancespercentage for each model, over the population of all measured V4 sitesðn= 128Þ. (C) Comparison of V4 neural explained variance percentage forvarious models. Conventions follow those used in Fig. 3.

Yamins et al. PNAS | June 10, 2014 | vol. 111 | no. 23 | 8623

NEU

ROSC

IENCE

SEECO

MMEN

TARY

IT  RDM HMO  RDM

Page 13: 20141003.journal club

まとめ

•  CNN  ベースでモデル作ってみたけど,ニューロンレベルでも,割と説明能力があるよ  

•  特に classifica2on  task  に最適化しても  IT,  V4  ニューロンのpopula2on  レベルでの説明能力は高い.  

•  従来のニューロンベースの研究方法(Boeom-­‐up  approach)  に対して,モデルベースの研究方法(Top-­‐down  approach)  は見通しよくする可能性があるよ  (だから,使い古されたモデルとか,中間層が何やってるかわからんとか否定しないでほしいなぁ)  

Page 14: 20141003.journal club

おまけ  HMO アルゴリズム

...

Φ1

Φ2

Φk

⊗⊗⊗

NormalizePoolFilter Threshold &Saturate

Neural-like basic operations

L2 L3

a Basic operations:

L1

Θ (1)

Θ θ θ θ θ θ = ( filter , thr, sat, pool, norm)

Θ (2) Θ (3)

Hierarchical Stacking

L1

L2

L3

L4

Image

L1

L2

L3

b

Θ1(1) Θ2

(1) Θ3(1) Θ4

(1) Θ5(1) Θ6

(1) Θ7(1)

Θ1(3) Θ3

(3) Θ5(3)

Θ1(2) Θ3

(2) Θ4(2) Θ5

(2) Θ7(2)

Convolution

Heterogeneity

θnorm(3)

Error-based reweighting

θ filter(1)

θ thr(1) θ sat

(1)θpool

(3)

Adaboost

Screeningimage

set

N11

N12

N13

Inte

rmed

iate

Lev

el

N21

N22

N23

Out

put L

evel

1

1

1

1

1

1

θnorm(3)

Error-based reweighting

θ filter(1)

θ thr(1) θ sat

(1)θpool

(3)

AdaboostN11

N12

N13

N21

N22

N23

2

2

2

2

2

2

Perfo

rman

ce

Perfo

rman

cere

wei

ght

Perfo

rman

ce

Perfo

rman

cere

wei

ght

Hierarchical Layeringc

Fig. S2. In this work, we use CNN models. CNNs consist of a series of hierarchical layers, with bottom layers accepting inputs directly from image pixels, withunits form the top and intermediate layers used to support training linear classifiers for performance evaluation and linear regressors for predicting neuraltuning curves. (A) Following a line of existing work, we limited the constituent operations in each layer of the hierarchy to linear-nonlinear (LN) compositionsincluding (i) a filtering operation, implementing AND-like template matching; (ii) a simple nonlinearity, e.g., a threshold; (iii) a local pooling/aggregationoperation, such as softmax; and (iv) a local competitive normalization. These layers are combined to produce low complexity (L1), intermediate complexity (L2),and high complexity (L3) networks. All operations are repeated convolutionally at each spatial position, corresponding to the general retinotopic organizationin the ventral stream. (B) In creating the HMO model, we allow mixture of several of these elements to model heterogenous neural populations, each actingconvolutionally on the input image. The networks are structured in a manner consistent with known features of the ventral stream, as a series of areas ofroughly equal complexity, but which permit bypass projections. (C) HMO is a procedure for searching the space of CNN mixtures to maximize object recognitionperformance. With several rounds of optimization, HMO creates mixtures of component modules that specialize in subtasks, without needing to prespecifywhat these subtasks should be. Errors from earlier rounds of optimization are analyzed and used to reweight subsequent optimization toward unsolvedportions of the problem. The complementary component modules that emerge via this process are then combined and used as input to repeat the procedurehierarchically (Materials and Methods and SI Text).

Cat

egor

izat

ion

Perfo

rman

ce

Cat

egor

izat

ion

Perfo

rman

ce

Expl

aine

d IT

Var

ianc

e

Timestep Timestep Timestep

Random Performance-Optimized Predictivity-Optimized

Fig. S3. Optimization time traces for the high-throughput experiments shown in Fig. 1. In the performance and fitting-optimized the y axis shows the op-timization criterion—in the random selection case (Left), no optimization was done, and the performance data were ignored.

Yamins et al. www.pnas.org/cgi/content/short/1403112111 9 of 14

Page 15: 20141003.journal club

おまけ2 HMOパラメータの相関係数

PerformanceLayer 2 Pooling OrderLayer 3 Max ActivationLayer 2 Min ActivationLayer 3 Pooling Order

Number of LayersLayer 1 Min ActivationLayer 1 Max Activation

Layer 2 Normalization ThresholdLayer 3 Filter Kernel Size

Layer 1 Filter MeanPre-Normalization Threshold

Layer 1 Normalization ThresholdLayer 2 Normalization StretchPre-Normalization Kernel Size

Layer 2 Filter Kernel SizeLayer 3 Normalization Stretch

Layer 2 Normalization Kernel SizeLayer 3 Filter Mean

Top Layer Receptive Field SizeLayer 1 Normalization Kernel Size

Layer 1 Pooling OrderLayer 2 Filter Norm

Layer 2 Max ActivationLayer 2 Filter Mean

Pre-Normalization StretchLayer 3 Normalization Kernel Size

Layer 2 Number of FiltersLayer 3 Min Activation

Layer 3 Filter NormLayer 1 Uses Gabor

Layer 1 Normalization StretchLayer 1 Filter Norm

Layer 1 Number of FiltersLayer 1 Filter Kernel Size

Layer 3 Normalization ThresholdLayer 1 Pooling Kernel SizeLayer 2 Pooling Kernel SizeLayer 3 Pooling Kernel Size

–0.2 0.0 0.2 0.4 0.6 –0.4 0.0 0.4 0.8 –0.4 0.0 0.4 0.8

Spearman Correlation Coefficient

Random Selection Performance Optimized IT-Predictivity Optimized

Fig. S4. Correlation of model parameters with IT predictivity for the three high-throughput experiments shown in Fig. 1. Parameters for which the correlationis significantly different from 0 are shown. Also included are several additional metrics that are not direct model parameters but that represent measurablequantities of interest for each model, e.g., model object recognition performance. The x axis is Spearman r correlation of the given parameter with IT pre-dictivity for the indicated model selection procedure, including random (Left, green bars), performance-optimized (Center, blue bars), and IT predictivityoptimized (Right, red bars). Parameters are ordered by correlation value for the random condition. Performance strongly correlates with IT predictivity in allselection regimes. Number of layers (model depth) consistently correlates as well, but much more weakly. Interestingly, one obvious metric—receptive field sizeat the top model layer—is only very weakly associated with predictivity, because, although the best models tended to have larger receptive field sizes, a largenumber of poor models also shared this characteristic.

Yamins et al. www.pnas.org/cgi/content/short/1403112111 10 of 14