property of the mit press 17 object and scene perception
TRANSCRIPT
PROPERTY OF THE MIT PRESSFOR PROOFREADING, INDEXING, AND PROMOTIONAL PURPOSES ONLY
17.2â VisualâCortexâ â 513
networksâ (CNNs),â isâ directlyâ inspiredâ byâ theâ architec-tureâofâvisualâcortex.âPartlyâdueâtoâthisâcomputerâvisionâconnection,âmanyâneuralâmodelsâofâfeedforwardâvisualâprocessingâcanâhandleâdifficultâclassificationâproblemsâinâlargeâimageâdataâsets,âanâimportantâengineeringâstepâthatâ atâ theâ sameâ timeâ hasâ theâ scientificâ advantageâ ofâfacilitatingâ directâ comparisonâ withâ humansââ recogni-tionâperformance.
Theâ feedbackâmodelsâweâpresentâaimâ toâgoâbeyondâclassification,âallowingâallâvisualâareasâtoâinteractârecur-rently,â blendingâ theâ differentâ sortsâ ofâ informationâextractedâbyâeachâtoâformâtheâkindâofâcompleteâsceneâinterpretationânecessaryâtoâsupportâaâfullârangeâofâvisualâtasks.âThisâfullâsceneâinterpretationâisâaâmoreâdauntingâproblemâthanâtheâclassification,âandâourâunderstandingâofâhowâitâisâsolvedâinâtheâbrainâisânotâyetâasâfullyâdevel-oped.âWhileâfullâsceneâinterpretationâinânaturalâimagesâremainsâ beyondâ theâ graspâ ofâ theâ neuralâ modelsâ weâpresent,â theseâ modelsâ canâ tackleâ importantâ subprob-lems,âandâtheyâaccountâforâaârangeâofâphysiologicalâandâimagingâexperimentalâresults.
17.2â VisualâCortex
Humansââ andâ animalsââ abilityâ toâ processâ objectsâ andâscenesâ isâ supportedâ byâ aâ largeâ swatheâ ofâ specializedâneuralâ machineryâ locatedâ inâ visualâ cortex,â nearâ theâbackâofâtheâbrainâ(seeâfigureâ12.11âinâchapterâ12âofâthisâvolume).âInâthisâsection,âweâlayâoutâtheâfunctionalâandâanatomicalâ propertiesâ ofâ visualâ cortex,â givingâ aâ setâ ofâdesiderataâandâstructuralâconstraintsâthatâwillâshapeâtheâmodelsâweâdevelopâinâtheârestâofâtheâchapter.âThisâsec-tionâsâ coreâ messageâ isâ thatâ theâ visualâ cortexâ formsâ aâhierarchyâofâinterconnectedâareas,âandâthatâusefulârep-resentationsâ areâ builtâ upâ sequentially,â withâ low-levelâareasâ contributingâ theâ buildingâ blocksâ toâ theâ moreâhigh-levelâandâsemanticallyâmeaningfulârepresentationsâfoundâfurtherâupstream.
Visualâcortexâisâtraditionallyâdividedâintoâtwoâprocess-ingâpathways:âtheâdorsal,âorââwhere,ââpathway,âthoughtâtoârepresentâtheâlocationâandâmotionâofâobjectsâandâtoâsupportâ interactionâwithâvisualâ inputâviaâeyeâandâbodyâ
17.1â Introduction
Theâcollectionâofâquestionsâhumansâcanâanswerâaboutâtheâ visualâ worldâ isâ vastâ andâ diverse.â Someâ canâ beâansweredâatâaâglance:ââIsâthisâaâpictureâofâaâcar?âââWasâthisâpictureâtakenâoutdoors?ââOthersârequireâmoreâcog-nitiveâ computation,â moreâ viewingâ time,â andâ possiblyâlinksâtoânonvisualâcognitiveâsystems:ââIsâeveryoneâisâthisâpictureâlookingâatâtheâcamera?âââAreâtheâpeopleâinâthisâimageâfriends?âââWhatâwillâhappenânextâinâthisâscene?â
Theseâtwoâclassesâofâquestionsâareâhypothesizedâtoâbeâhandledâbyâtwoâdifferentâprocessingâmodesâinâtheâbrain.âAsâweâwillâsee,âtheâbrainâsâvisualâareasâareâorganizedâintoâaâhierarchyâorâ chain,âwithâeachâareaâpassingâ informa-tionâtoâitsâimmediateâneighbors.âFastâvisualâprocessingâisâthoughtâtoâoccurâinâaâsingleâfeedforwardâpassâthroughâthisâ chain,â withâ informationâ movingâ exclusivelyâ fromâlowerâareasâtoâhigherâones.âMoreâextendedâprocessing,âbyâ contrast,â bringsâ feedbackâ connectionsâ intoâ play,âallowingâ informationâ toâ circulateâ amongâ theâ visualâareasâ inâ loops.â Afterâ reviewingâ theâ anatomyâ ofâ visualâcortexâinâsectionâ17.2,âtheâorganizationâofâthisâchapterâfollowsâtheâfeedforward/feedbackâdistinction,âwithâsec-tionsâ17.3âandâ17.4âfocusingâonâfeedforwardâprocessing,âandâsectionâ17.5,âonâfeedback.
Theâparticularâfocusâofâourâfeedforwardâdiscussionâisâclassificationâproblems.âInâobjectâclassification,âaâmodelâassignsâ toâanâ imageâ theâcategoryâ labelâofâ itsâprincipalâobject,âforâexample,ââdogââorââcar.ââInâsceneâclassifica-tion,â theâ labelâ identifiesâ theâpictureâsâ larger-scaleâgist:âIsâitâanâindoorâorâoutdoorâscene,âaânaturalâsceneâorâanâurbanâ one?â Classificationâ isâ anâ importantâ buildingâblockâforâcomplexâvisualâtasksâandâisâaârichâandâdifficultâproblemâinâitsâownâright:âonlyâveryârecentlyâhaveâmodelsâemergedâthatâcanâbeginâtoâaccountâforâhumanârecogni-tionâ performance,â andâ evenâ today,â importantâ openâproblemsâremain.
Classificationâ hasâ alsoâ beenâ aâ majorâ focusâ ofâârecentâ computerâ visionâ research,â andâ theâ interactionâbetweenâthisâfieldâandâcomputationalâneuroscienceâhasâbeenâfruitfulâforâboth.âIndeed,âtheâcurrentlyâdominantâclassâofâcomputerâvisionâmodels,âconvolutionalâneuralâ
17â ObjectâandâSceneâPerception
OWeNâLeWISâANdâTOMASOâPOggIO
8522_017.indd 513 6/10/2016 7:47:59 PM
PROPERTY OF THE MIT PRESSFOR PROOFREADING, INDEXING, AND PROMOTIONAL PURPOSES ONLY
PROPERTY OF THE MIT PRESSFOR PROOFREADING, INDEXING, AND PROMOTIONAL PURPOSES ONLY
514â â 17â objectâandâsceneâperception
haveâunparalleledâ temporalâ andâ spatialâprecision,âbutâtheyâ areâ difficultâ andâ time-consumingâ toâ perform,âmeaningâthatâmostâstudiesâcanâexamineâatâmostâaâ fewâhundredâcells.âFunctionalâmagneticâresonanceâimagingâ(fMRI)â isâ aâ noninvasiveâ neuroimagingâ techniqueâ thatâmonitorsâtheâoxygenâconsumedâbyâblocksâofâtissueâaâfewâmillimetersâ onâ aâ side;â oxygenâ intakeâ isâ aâ correlateâ ofâneuralâ activity.â Comparedâ toâ electrophysiology,â fMRIâsacrificesâ spatialâ andâ temporalâ resolutionâ (fMRIâ scan-nersâobtainâaâ seriesâofâ imagesâeveryâ2â seconds)âbutâ isâquickerâ toâ performâ andâ safeâ toâ useâ onâ awakeâ humanâsubjectsâasâtheyâperformâbehavioralâtasks.âFinally,âmag-netoencephalographyâ (Meg)â andâ electroencephalog-raphyâ(eeg)âareâotherâimagingâtechniques,âalsoâsuitableâforâ humanâ subjects,â thatâ giveâ millisecondâ temporalâresolution.â However,â theseâ methodsâ relyâ onâ electricalâorâmagneticâfieldsârecordedâatâtheâscalp,âmakingâtheirâspatialâresolutionâpoor.
Onceâoneâofâtheseâmethodsâhasâbeenâusedâtoâcaptureâneuralâresponsesâwithinâanâarea,âitâremainsâtoâcharacter-izeâtheâinformationâthatâthisâareaâcontains.âForâinstance,âweâ areâ oftenâ interestedâ inâ determiningâ whetherâ anâareaâsârepresentationâisâ sufficientâ toâclassifyâaâstimulusâbasedâonâitsâobjectâorâsceneâcategory.âOneârelativelyânewâanalysisâtechnique,âneuralâdecoding,âunderliesâmanyâofâtheâstudiesâreportedâinâthisâchapter.âTheâideaâisâthatâifâaâ givenâ representationâ containsâ categoryâ information,âitâ shouldâ beâ possibleâ toâ readâ thisâ informationâ outâdirectlyâusingâaâclassifier.âConcretely,âsupposeâanâexperi-mentâshowedâimagesâI1,ââŚ,âIn,âfromâcategoriesâc1,ââŚ,âcn,âandâthatâimageâIiâelicitedâneuralâresponseâRi.âTheâclas-sificationâproblemâisâ thenâtoâ learnâaâmappingâdirectlyâfromâ neuralâ responseâ toâ categoryâ label.â Asâ usualâ inâmachineâ learning,â theâ setâ ofâ pairsâ (Ri,â ci)â areâ dividedâintoâaâtrainingâsetâusedâtoâtrainâaâclassifierâandâaâtestâsetâusedâtoâvalidateâitsâperformance.âIfâgoodâclassificationâisâpossible,âoneâmayâconcludeâthatâtheârepresentationsâRiâ containâ categoryâ information.â generally,â decodingâstudiesâuseâ linearâclassifiers,âextractingâcategoryâ infor-mationâ inâ aâ wayâ similarâ toâ howâ downstreamâ neuralâpopulationsâmayâdoâso.
Severalâ notesâ areâ inâ order.â First,â theâ factâ thatâ itâ isâpossibleâforâaâclassifierâtoâextractâcategoryâinformationâfromâanâareaâsâactivityâpatternâdoesânotâmeanâthatâtheâbrainâ itselfâ extractsâ thisâ information.â Relatedly,â theâpresenceâ ofâ categoryâ informationâ doesâ notâ ruleâ outâtheâ possibilityâ ofâ otherâ sortsâ ofâ informationâ beingâhousedâ inâ aâ givenâ regionâ asâ well.â Last,â aâ negativeâdecodingâ result,â aâ failureâ toâ readâ outâ categoryâ fromâactivity,âdoesânotânecessarilyâimplyâanâabsenceâofâinfor-mation:â usingâ finer-grainedâ informationâ orâ differentâalgorithms,âtheâbrainâmayâaccessâinformationâourâclas-sifiersâcannot.
movements,âandâtheâventral,âorââwhatââstream,âthoughtâtoâsupportâourâcoreâtasksâofâobjectâandâsceneârecogni-tion.â Ourâ focusâ willâ beâ onâ theâ ventralâ stream,â whichâencompassesâfourâareas,âV1,âV2,âV4,âandâinferotemporalâcortexâ(IT).
Withinâ eachâ area,â individualâ cellsâ areâ responsiveâ toâstimuliâpresentedâinâaâsmallâregionâofâspace,âcalledâtheâcellâsâreceptiveâfield.âV1,âV2,âV4,âandâpartsâofâITâshowâretinotopicâ organization,â meaningâ thatâ cellsâ whoseâreceptiveâfieldsâareâcloseâtoâeachâotherâareâthemselvesâphysicallyâcloseâtogetherâinâtheâbrain.
Acrossâareas,âtheâkeyâorganizationalâhypothesisâisâthatâtheâventralâstreamâformsâaâhierarchy;âalthoughâallâfourâareasâ areâ richlyâ interconnectedâ (Kravitzâ etâ al.,â 2013),âtheyâareâcommonlyâpicturedâasâcomprisingâaâchain,
â V V V IT1 2 4â â â ,
inâwhichâeachâareaâinteractsâprimarilyâwithâitsâimmedi-ateâneighbors.âV1âtakesâ inputâfromâtheâ lateralâgenicu-lateâ nucleusâ (LgN),â theâ structureâ thatâ processesâinformationâimmediatelyâafterâitâleavesâtheâretina,âandâITâ passesâ itsâ outputâ toâ nonvisualâ areasâ inâ prefrontalâcortex.â Withinâ theâ chain,â projectionsâ fromâ earlierâ toâlaterâareasâareâcalledâfeedforward,âandâthoseâfromâlaterâtoâ earlierâ areasâ areâ calledâ feedback.â Forâ example,â V2âsendsâfeedforwardâprojectionsâtoâV4âandâfeedbackâpro-jectionsâtoâV1.
Justificationâofâtheâhierarchicalâpictureâcomesâpartlyâfromâ anatomicalâ studiesâ tracingâ theâ projectionsâ thatâconnectâ areas,â andâ partlyâ fromâ functionalâ consider-ations.âForâinstance,âtheâactivationâinducedâbyâaâstimu-lusâspreadsâsequentiallyâupâtheâhierarchy,âemergingâinâV1âanâaverageâofâ34â97âmsâafterâstimulusâonset,â inâV2âafterâ82âms,âandâinâV4âafterâ104âÂąâ23.4âmsâinâanaesthe-tizedâmacaquesâ(Schmoleskyâetâal.,â1998).âAânumberâofâotherâ quantitiesâ varyâ systematicallyâ alongâ theâ ventralâstream:â receptiveâ fieldâ sizesâ getâ largerâ fromâ V1â toâ IT,âandâtheânumberâofâneuronsâinâsuccessiveâareasâbecomesâsmaller.â Mostâ importantly,â though,â theâ hierarchicalâstructureâofâtheâventralâstreamâisâreflectedâinâtheârepre-sentationsâ itâproduces.âAsâ informationâpassesâ throughâtheâventralâstream,âeachâareaâcomputesâaânewârepresen-tation,âencodedâinâtheâactivitiesâofâthatâareaâsâcells.âTheâhigherâtheâarea,âtheâbetterâableâitsârepresentationâisâtoâsupportâtasksâlikeâclassification.
Aânumberâofâexperimentalâ techniquesâareâavailableâtoâ studyâ theâbrainâsâactivityâ inâgeneral,â andâ theâ repre-sentationsâ computedâ alongâ theâ ventralâ streamâ inâ par-ticular.â Atâ theâ single-cellâ level,â theâ toolâ ofâ choice,âelectrophysiology,âinvolvesâimplantingâelectrodesâinâtheâbrainâ ofâ anâ animalâoftenâ aâ macaqueâ monkeyâandârecordingâtheâelectricalâactivityâinducedâbyâpresentationâofâ differentâ stimuli.â electrophysiologicalâ recordingsâ
8522_017.indd 514 6/10/2016 7:47:59 PM
PROPERTY OF THE MIT PRESSFOR PROOFREADING, INDEXING, AND PROMOTIONAL PURPOSES ONLY
PROPERTY OF THE MIT PRESSFOR PROOFREADING, INDEXING, AND PROMOTIONAL PURPOSES ONLY
17.2â VisualâCortexâ â 515
featuresâ(Carlsonâetâal.,â2011).âITâcontainsâcellsârespon-siveâ specificallyâ toâcertainâhigh-levelâobjectâclasses,â forâinstanceâfaces.âAtâeachâstage,âthough,âmostâneuronsâareâresponsiveâtoâaâvarietyâofâstimuli,âarguingâagainstâaâone-to-oneâ correspondenceâ betweenâ cellsâ andâ high-levelâfeatures.âOverall,âresponseâpropertiesâofâsingleâcellsâareâunlikelyâ toâ tellâ theâwholeâ story:â theâ fullâdiscriminativeâpowerâ shownâ byâ theâ higher-levelâ ventralâ streamâ areasâemergesâ fromâ theâ conjunctiveâ activityâ ofâ theâ areaâsâwholeâpopulation.
17.2.2â Invarianceâ Toâ beginâ toâ getâ aâ moreâ fine-grainedâpictureâofâhowâneuralârepresentationsâcomeâtoâsupportâ classification,âweâ turnâourâattentionâ toâ invari-anceâ toâ nuisanceâ transformationsâ suchâ asâ translationâandâscaling.âTheseâtransformationsâcanâradicallyâchangeâtheâpixelâcontentâofâanâimageâwhileâpreservingâitsâclassâlabel,â givingâ riseâ toâ difficultâ situationsâ inâ whichâ twoâimagesâwithâveryâdifferentâpixel-levelâcontentâshouldâbeâassignedâtoâtheâsameâcategory.
Toâ isolateâ theâ effectâ ofâ invarianceâ onâ classificationâperformance,â Anselmiâ etâ al.â (2013)â lookedâ atâ severalâclassificationâproblemsâ inâwhichâallâ imagesâwereâ recti-fiedâ toâ aâ canonicalâ position,â scale,â andâ orientationâbeforeâclassification.â Inâ thisâregime,âevenâtheâsimplestâpixelârepresentationsâwereâableâtoâachieveâgoodâclassi-ficationâperformance.âThisâsuggestsâthat,âatâleastâforâtheâclassesâ studied,â muchâ ofâ theâ performanceâ differenceâbetweenâ lowerâ andâ higherâ ventralâ streamâ representa-tionsâmayâbeâdueâtoâtheâlatterâsâinvarianceâproperties.
Aâ particularâ benefitâ ofâ invariantâ representationsâ isâthatâtheyâreduceâtheâsampleâcomplexityâofâtheâlearningâprocess,âtheânumberâofâlabeledâimagesâaâlearnerâhasâtoâseeâbeforeâbeingâableâtoâcategorizeânewâexamples.âIntui-tively,âifâaâlearnerârequiresâaâseparateâexampleâforâeachâposeâinâwhichâanâobjectâcouldâappear,âtheâtotalânumberâofâexamplesâneededâbecomesâextremelyâlarge.âAnselmiâetâ al.â (2013)â quantifiesâ thisâ intuitionâ inâ theâ caseâ ofâtranslation.â Considerâ aâ collectionâ ofâ objectsâ ofâ pâ Ăâ pâpixels,âwhichâmayâappearâanywhereâinâanâimageâofâsizeârpâĂâ rp.â givenâ anâ invariantâ representation,â itâ requiresâO(p2)âexamplesâtoâlearnâaâlinearâclassifier,âasâcomparedâtoâO(p2r2)âinâtheânoninvariantâcase.
Likeâcategoryâselectivity,â invarianceâbuildsâalongâtheâventralâ stream.â decodingâ isâ againâ anâ importantâ tool.âInâ usingâ decodingâ toâ testâ forâ invarianceâ toâ position,âsay,â theâclassifierâ isâ trainedâonârepresentationsâelicitedâbyâ imagesâ shownâ atâ positionsâ P1,â âŚ,â P(nâ1)â andâ testedâonâ imagesâ shownâatâpositionâPn.âThus,â theâclassifierâ isâneverâ shownâ examplesâ ofâ theâ positionâ inâ whichâ itâ isâtested;â goodâ classificationâ performanceâ canâ onlyâ beâattributedâtoâinformationâbeingâpreservedâunderâposi-tionalâshifts.
17.2.1â RepresentationâbyâAreaâ Thisâcollectionâofâexperimentalâ andâ analyticalâ techniquesâ allowsâ usâ toâexploreâhowâinformationâevolvesâasâitâpassesâalongâtheâventralâstream.âOurâparticularâfocusâinâthisâsectionâwillâbeâonâtheâabilityâofâeachâareaâsârepresentationâtoâsupportâclassification,âthoughâoneâmayâhypothesizeâthatâclassifi-cationâ supportingâ representationsâ alsoâ tendsâ toâ carryâsemanticâinformationâimportantâforâfullerâsceneâinter-pretationâasâwell.
WeâbeginâwithâV1.âTheâpropertiesâofâtheâV1ârepresen-tationâ areâ describedâ inâ detailâ inâ chapterâ 12,â butâ toâbrieflyâsummarize,âV1âcellsâareâoftenâmodeledâasâgaborâfilters,â selectiveâ forâ bar-likeâ stimuliâ atâ particularâ posi-tions,âorientation,âandâscales.âTheâV1ârepresentationâisâalreadyâmoreâabstractâthanâoneâusingâpixels,âsupportingâedgeâdetectionâandâobjectâsegmentation,â forâ instance,âbutâitâlacksâtheâhigh-levelâsemanticâcontentâandârobust-nessâneededâforâeffectiveâclassification.
Atâ theâ oppositeâ endâ ofâ theâ ventralâ stream,â theâ ITârepresentationâcontainsâaâremarkableâamountâofâcate-goryâ information.â fMRIâ studies,â forâ instance,â haveâfoundâsubareasâofâ ITâspecificallyâresponsiveâ toâcertainâobjectâ classes,â suchâ asâ faces,â places,â andâ bodies.â Inâ aâdecodingâ studyâ Hungâ etâ al.â (2005)â classifiedâ objectsâintoâoneâofâeightâcategoriesâusingâaârepresentationâcon-sistingâofâelectrophysiologicalârecordingsâfromâapproxi-matelyâ250âITâneurons,âachievingâaâhighâmeanâaccuracyâofâ94%â(withâchanceâbeingâ1/8â=â12.5%).
Inâ aâ combinedâ Megâ andâ fMRIâ study,â Cichyâ etâ al.â(2014)âexploredâ theâ intermediateâprocessingâbetweenâV1âandâIT.âTakingâadvantageâofâMegâsâmillisecondâtem-poralâresolution,âtheâauthorsâwereâableâtoâposeâaâlargeâensembleâ ofâ decodingâ problems,â oneâ forâ eachâ timeâpointâ afterâ stimulusâ onset.â Byâ examiningâ theâ errorsâmadeâ inâ eachâ ofâ theseâ problems,â theyâ concludeâ thatârepresentationsâ atâ laterâ timesâ areâbetterâ ableâ toâmakeâfineâ classificationâ judgments.â Forâ instance,â decodingâperformanceâ forâ theâ superordinateâ judgmentâ ofâwhetherâorânotâaâ stimulusâwasâanimateâwasâhighestâ atâ157âms,âwhileâtheâpeakâaccuracyâforâtheâfinerâjudgmentâofâwhetherâorânotâanâanimateâstimulusâcontainedâaâbodyâoccurredâlater,âatâ170âms.
exactlyâhowâ theseâpopulation-levelâdecodingâ resultsâariseâfromâtheâresponseâpropertiesâofâindividualâcellsâinâtheâvariousâbrainâareasâisânotâcompletelyâclear.âAâsimpli-fiedâpictureâhasâneuronsâinâeachâsuccessiveâareaâbecom-ingâ responsiveâ toâ moreâ andâ moreâ complexâ features,âformedâasâcombinationsâofâtheâfeaturesârepresentedâatâtheâlayerâbelow.âV2,âforâinstance,âcontainsâcellsârespon-siveâtoâanglesâandâjunctionsâformedâbyâcombinationsâofâtheâ bar-likeâ featuresâ preferredâ byâ V1â cellsâ (Itoâ andâKomatsu,â2004).âFartherâup,âaânumberâofâstudiesâhaveâarguedâthatâV4ârepresentsâstillâmoreâcomplexâcurvatureâ
8522_017.indd 515 6/10/2016 7:48:00 PM
PROPERTY OF THE MIT PRESSFOR PROOFREADING, INDEXING, AND PROMOTIONAL PURPOSES ONLY
PROPERTY OF THE MIT PRESSFOR PROOFREADING, INDEXING, AND PROMOTIONAL PURPOSES ONLY
516â â 17â objectâandâsceneâperception
areâ inspiredâbyâ theâarchitectureâofâ visualâ cortex:âevenâinâ no-holds-barredâ computerâ vision,â whereâ perfor-mance,âratherâthanâbiologicalâfidelity,âisâtheâmeasureâofâsuccess,âneuroscientificâinsightsâhaveâplayedâanâimpor-tantâdevelopmentalârole.
Progressâ inâ CNNs,â andâ theâ historyâ ofâ theirâ riseâ toâprominence,âisâencapsulatedâinâtheâresultsâofâtheâIma-geNetâ Largeâ Scaleâ Visualâ Recognitionâ Challengeâ(ILSVRC),âanâannualâcontestâinâwhichâmodelsâcompeteâonâaâdataâsetânotableâforâitsâsize,âcontainingâ1,000âimageâcategories,âeachâwithâbetweenâaâfewâhundredâandâmoreâthanâaâ thousandâexampleâ images.â Inâ ILSVRC,âmodelsâareâ judgedâ onâ theirâ abilityâ toâ assignâ toâ anâ imageâ aâcorrectâ object-categoryâ label;â exampleâ categoriesâincludeâ âAfghanâ hound,ââ âelectricâ fan,ââ âyurt.ââ Theâcompetitionâ usesâ naturalâ imagesâ harvestedâ fromâ theâInternetâandânotâcroppedâtoâisolateâtheâtargetâitem:âtoâcorrectâforâtheâfactâthatâanâimageâmayâcontainâmultipleâobjectâclasses,âmodelsâareâallowedâtoâproduceâfiveâcan-didateâ labelsâ andâ areâ consideredâ toâ achieveâ aâ correctâclassificationâifâanyâoneâofâtheseâfiveâmatchesâtheâtarget.âAllâ statisticsâ reportedâ hereâ areâ fromâ theâ surveyâ paperâ(Russakovskyâetâal.,â2014).
Inâ 2010,â theâ firstâ yearâ theâ contestâ wasâ held,â theâwinningâ modelâ attainedâ 28.2%,â anâ impressiveâ resultâgivenâtheâdifficultyâofâtheâtask,âbutâsignificantlyâshortâofâtheâ5.1%âerrorâachievedâbyâaâhumanâ labeler.âTheâ fol-lowingâ yearâ sawâaâ modestâperformanceâ improvement,âwithâtheâwinningâmodelâsâclassificationâerrorâdroppingâtoâ 25.8%.â Bothâ thisâ modelâ andâ theâ 2010â winnerâ areârepresentativeâ ofâ theâ computerâ visionâ techniquesâpopularâatâtheâtime,ârepresentingâanâimageâinâtermsâofâtheâorientationâstatisticsâofâitsâspatialâgradients.
Inâ theâ followingâ year,â 2012,â CNNsâ enteredâ theââscene,âcuttingâerrorâratesâdramatically,âwithâKrizhevskyâetâal.âsâ(2012)âAlexNetâattainingâ16.4%âerror.âInâsubse-quentâ years,â CNNsâ haveâ continuedâ toâ dominateâ theâ
UsingâphysiologicalâdataâfromâIT,âHungâetâal.â(2005)âfindâ decodingâ performanceâ decreasedâ byâ lessâ thanââ10%â inâ theâ presenceâ ofâ changesâ inâ translationâ andââscale.â Inâ anâ Megâ studyâ Isikâ etâ al.â (2014)â showâ thatâinvarianceâemergesâ inâ stagesâoverâ theâcourseâofâ visualâprocessing,â withâ invarianceâ toâ sizeâ arrivingâ first,â fol-lowedâbyâinvarianceâtoâposition.âInâaddition,âinvarianceâtoâsmallerâtransformationsâprecedesâinvarianceâtoâlargerâtransformations.
Invarianceâalsoâincreasesâatâtheâsingle-cellâlevel,âwithâresponsesâinâV1âbeingâinvariantâtoâatâmostâsmallâtrans-formationsâwhileâsomeâITâcellsâshowâinvarianceâtoâfairlyâlargeâ degreesâ ofâ rotation,â position,â andâ scaleâ changeâ(Itoâetâal.,â1995;âHungâetâal.,â2005).âAdditionally,âRustâandâ diCarloâ (2010)â showâ anâ increaseâ inâ invarianceâfromâV4âtoâIT.
17.2.3â Summingâ Up:â Lessonsâ forâ Modelsâ Theâexperimentalâresultsâsummarizedâaboveâposeâaâclearâsetâofâ desiderataâ forâ theâ computationalâ modelsâ weâ willâstudyâinâtheâremainderâofâthisâchapter.âAtâaâminimum,âtheâ modelsâ shouldâ mimicâ cortexâsâ hierarchicalâ struc-ture,âarrivingâatâtheâtopâlevelâwithâaârepresentationâsuf-ficientâ forâ classification.â experimentalâ findingsâ aboutâinvarianceâgiveâanâimportantâclueâaboutâhowâtheseârep-resentationsâcanâbeâbuilt.âSectionsâ17.3âandâ17.4âencodeâtheseâ insightsâ inâ modelsâ ofâ theâ visualâ systemâsâ initialâfeedforwardâ processingâ stage,â arrivingâ atâ algorithmsâthatâmimicâhumansââabilityâtoâclassifyâobjectsâinâscenes.
17.3â FeedforwardâModels
Theâpastâfiveâyearsâhaveâseenâdramaticâprogressâinâfeed-forwardâ modelsâ ofâ visualâ classification,â yieldingâ algo-rithmsâthatâcanâcompeteâwithâhumansâonâchallengingâtasks.â Importantly,â theâ classâ ofâ models,â convolutionalâneuralâ networksâ (CNNs),â responsibleâ forâ thisâ success,â
Figureâ 17.1â Theâ convolutionalâ neuralâ networkâ (CNN)âarchitecture.âInspiredâbyâHubelâandâWeiselâsâ findingsâ inâV1,âaâCNNâisâcomposedâofâaâseriesâofâ layers,âeachâwithâaâsimpleâ(S)â andâ complexâ (C)â sublayer.â Simpleâ cellsâ buildâ selectivityâforâcomplicatedâfeatures,âandâcomplexâcellsâbuildâtranslationâ
invariance.âAsâitâpassesâthroughâtheânetwork,âanâimageâunder-goesâaâseriesâofâtransformations,âtheâendâresultâofâwhichâisâaârepresentationâ thatâ canâbeâ fedâ intoâaâ classifier.âThisâ appliesâmoreâ orâ lessâ directlyâ toâ HMAXâ andâ LeNet;â moreâ recentâmodelsâincludeâsimple-onlyâlayers.
8522_017.indd 516 6/10/2016 7:48:00 PM
PROPERTY OF THE MIT PRESSFOR PROOFREADING, INDEXING, AND PROMOTIONAL PURPOSES ONLY
PROPERTY OF THE MIT PRESSFOR PROOFREADING, INDEXING, AND PROMOTIONAL PURPOSES ONLY
17.3â FeedforwardâModelsâ â 517
AâseriesâofâclassicâexperimentâdoneâbyâdavidâHubelâandâThorstenâWieselâinâtheâ1960sâprovideâtheâprincipalâneuroscientificâ inspirationâ forâ theâ CNNâ architecture.âHubelâ andâ Weiselâ recordedâ fromâ V1â cellsâ inâ anesthe-tizedâ catsâ asâ theâ animalsâ viewedâ patternsâ ofâ lightâ dis-playedâonâaâscreen.âTheyâfoundâtwoâmainâtypesâofâcells:âsimpleâcellsârespondedâtoâbar-likeâpatternsâatâaâparticu-larâ orientationâ andâ positionâ onâ theâ screen.â Complexâcellsâ alsoâ respondedâ toâbarsâ andâalsoâhadâaâpreferredâorientation,âbutâhadâaâsmallâdegreeâofâpositionâinvari-ance:âtheirâresponsesâwereâpreservedâunderâsmallâtrans-lationsâofâtheâstimulus.
Mimickingâ theâ hierarchicalâ organizationâ ofâ visualâcortex,â aâ CNNâ isâ composedâ ofâ aâ successionâ ofâ layers.âextrapolatingâHubelâandâWieselâsâfindings,âeachâunitâinâtheânetworkâisâofâeitherâsimpleâorâcomplexâtype:âsimpleâcellsâbuildâselectivityâforâcomplicatedâfeaturesâwhereasâcomplexâ cellsâ buildâ translationâ invariance.â Classically,âtheâ networkâ isâ aâ composedâ ofâ aâ seriesâ ofâ layers,â eachâwithâaâsimpleâandâcomplexâsublayer;âcomplexâcellsâtakeâinputâfromâtheirâownâlayerâsâsimpleâcells,âwhich,âinâturn,âtakeâinputâfromâtheâcomplexâcellsâ inâtheâlayerâbefore.âMoreâ recentlyâ (Krizhevskyâ etâ al.,â 2012;â Szegedyâ etâ al.,â2014),âmanyâmodelsâmodifyâ thisâ schemaâbyâpostfixingâorâ interspersingâ someâ numberâ ofâ layersâ thatâ containâonlyâsimpleâcells;âweâstickâtoâtheâclassicalâstructureâhere.
Justâasâcategoryâinformationâisâexposedâasâanâimageâpassesâthroughâtheâareasâofâtheâventralâstream,âaâCNNâsâlayersâenactâaâ similarâ seriesâofâ transformations,â culmi-natingâ inâaâ top-layerâ representationâ suitableâ forâ inputâtoâaâclassifier.âThisâarchitectureâisâshownâinâfigureâ17.1.âWeânowâexploreâtheâcomplexâandâsimpleâcellâ typesâ inâturn.
competition,âbecomingâbyâfarâtheâmostâpopularâmodelâclassâ andâ achievingâ steadilyâ decreasingâ errorâ rates:â11.7%â inâ 2013â andâ 6.4%â inâ 2014.â Heâ etâ al.â (2015)âclaimedâ4.94%âonâtheâ2012âdataâset,âbeatingâreportedâhumanâperformance.
TheâsuccessâofâCNNsâ isânotâ limitedâ toâILSVRC;â itâ isânowâsafeâtoâsayâthatâtheyâareâtheâtoolâofâchoiceâforâtheâfieldâofâobjectâandâsceneârecognitionâasâaâwhole.âTheyâhaveâalsoâmadeâanâimpactâinârelatedâfields,âforâinstance,âbeingâusedâtoâcomputeâvisualâfeaturesâforâgameâplayingâ(Mnihâ etâ al.,â 2015),â andâ haveâ attractedâ considerableâinvestmentâ fromâ industry.â Asâ describedâ below,â CNNsâalsoâaccountâwellâforâhumansâârecognitionâbehaviorâandâprovideâunprecedentedâmatchesâwithâelectrophysiologi-calâdata.
17.3.1â Convolutionalâ Neuralâ Networksâ CNNsâhaveâaâlongâhistoryâpriorâtoâtheirârecentâclimbâtoâcom-puterâvisionâpreeminence.âTheâfirstâCNNâwasâFukushi-maâsâ1980âmodel,âtheâNeocognitronâ(Fukushima,â1980,â1988).âInâtheâ1990s,âYanâLeCunâandâcollaboratorsâdevel-opedâ theâ LeNetâ familyâ ofâ CNNsâ (LeCunâ etâ al.,â 1998)âandâ usedâ themâ successfullyâ forâ tasksâ likeâ handwrittenâdigitâ recognition.â Inâ 1999,â Riesenhuberâ andâ PoggioâproposedâHMAX,âaâCNNâspecificallyâdesignedâtoâmodelâmammalianâ visualâ cortexâ (Riesenhuberâ andâ Poggio,â1999).âHMAXâcontinuedâtoâbeâdevelopedâinâtheâ2000sâ(Serreâetâal.,â2007b).âWhileâ theâdifferentâCNNsâdevel-opedâoverâtheâyearsâareâdistinguishedâbyâvariousâdesignâchoices,âmostâshareâaâcoreâarchitecture:âthisâwillâbeâtheâfocusâ ofâ thisâ section.â Sectionâ 17.3.3â presentsâ HMAX,âwhoseâmoreâexplicitlyâbiologicalâorientationâmotivatesâaâfewâimportantâpointsâofâdifference.
Figureâ17.2â Aâsegmentâfromâtheâearlyâ layersâofâaâconvolu-tionalâ neuralâ network.â eachâ complexâ (C)â andâ simpleâ (S)âsublayerâconsistsâofâaâcollectionâofâfeatureâmaps,âshownâinâtheâfigureâ asâ sheets,â consistingâ ofâ cellsâ responsiveâ toâ theâ sameâfeatureâ atâ differentâ positions.â Complexâ cellsâ buildâ positionâ
invarianceâbyâpoolingâoverâinputsâfromâaâsingleâfeatureâmapâinâtheâpreviousâsublayer.âSimpleâcellsâbuildâselectivityâforâcom-plicatedâ featuresâ byâ combingâ inputsâ fromâ multipleâ featureâmaps.â Asâ shownâ inâ theâ figure,â receptiveâ fieldâ sizesâ increaseâfartherâupâtheânetwork.
8522_017.indd 517 6/10/2016 7:48:01 PM
PROPERTY OF THE MIT PRESSFOR PROOFREADING, INDEXING, AND PROMOTIONAL PURPOSES ONLY
PROPERTY OF THE MIT PRESSFOR PROOFREADING, INDEXING, AND PROMOTIONAL PURPOSES ONLY
518â â 17â objectâandâsceneâperception
localizedâ butâ differâ inâ theâ factâ thatâ theyâ takeâ inputââfromâallâofâtheâpreviousâlayerâsâfeatureâmapsâratherâthanâjustâone.
Inâ sum,â aâ layer-2â simpleâ cellâsâ inputsâ comeâ fromâcomplexâcellsâdetectingâlineâsegmentsâatâaâfullârangeâofâorientationsâandâaâsmallârangeâofâpositions.âToâbecomeâselectiveâtoâaâfeatureâlikeâaâcorner,âtheâcellâhasâtoâacti-vateâ stronglyâ onlyâ ifâ aâ certainâ combinationâ ofâ theseâsegments,âforâinstanceâaâ90°âoneâandâ0°âone,âareâpresentâinâtheâimage.âWhereasâcomplexâcellsâplayedâanâOR-likeârole,âsimpleâcellsâhaveâtoâbehaveâmoreâlikeâanâANdâoverâtheirâinputs.âToâallowâonlyâparticularâinputsâtoâactivateâourâsimpleâcell,âweâintroduceâconnectionâweightsâmod-ulatingâ theâ influenceâ theseâ inputsâ have.â Inputsâ corre-spondingâtoâfeaturesâincludedâinâourâtargetâpatternâgetâhighâweights;âotherâ featuresâgetâ lowâones.â Inârealânet-works,â asâ weâ willâ see,â simpleâ cellâ weightsâ andâ theâ fea-turesâtheyâdefineâareâlearnedâautomaticallyâratherâthanâbeingâsetâbyâhand.
Consideringâaâsimpleâcellâ Spl
iĎ( ) âinâanâarbitraryâlayerâl,â supposeâ thatâ theâpreviousâ layerâhasânâ featureâmaps,âcorrespondingâtoâfeaturesâ Ď1 ,â⌠,Ďn .âTheâafferentsâofâSp
liĎ( ) âareâthenâaâsubsetâofâlayerâ lâââ1âsâcomplexâcells,â
localizedâinâspaceâtoâR(p),âbutâspanningâtheâfullârangeâofâfeatureâmaps.âdenoteâthisâset
â A S C q R p j npl
i ql
jĎ Ď( )( ) = â ( ) = âŚ{ }â |1 1( ) , .
Oneâofâtheseâafferentâcells,âCql
jâ ( )1 Ď ,âisâconnectedâSp
liĎ( )â
viaâtheâweightâW q pj i( , ) , .Ď Ďâ( ) âWithâthisâ(somewhatâcumber-some)ânotationâinâplace,âweâcanâpresentâtheâsimpleâcellâactivationâfunction,âtheâcounterpartâofâequationâ17.1:
S Cpl
i ql
jĎ Îˇ Ď ĎĎ
Ď( ) = ( ) +
â( )
â
( )( )â W bq p
A Sj i
pl
i
i( , ) ,1 j .â (17.2)
Here,â Ρ âisâaânonlinearâfunction,âsuchâasâaâsigmoidâorâhyperbolicâtangent,âandâb iĎ âisâaâbiasâterm.âOverall,âthen,âaâsimpleâcellâsumsâtheâactivationsâofâitsâafferents,âweight-ingâ eachâ activationâ byâ theâ correspondingâ connectionâweight,âaddsâaâbiasâterm,âandâappliesâaânonlinearity.âAsâwithâcomplexâcells,âaâsimpleâcellâsâreceptiveâfieldâspansâtheâ receptiveâ fieldsâ ofâ itsâ afferentsâ andâ isâ thereforeâlarger.
Toâtakeâitsâplaceâinâtheâlargerânetworkâarchitecture,âSp
liĎ( ) â mustâ itselfâ beâ partâ ofâ aâ featureâ map.â Inâ otherâ
words,âthereâmustâexistâsimpleâcellsâ Sql
iĎ( ) âforâallâotherâpositionsâq.âSinceâ Ďi -specificityâisâdefinedâbyâaâpatternâofâweights,âaâfeatureâmapâisânothingâotherâthanâanâarrayâofâsimpleâcellsâwithâtheâsameâpatternâofâinputâweights,âplacedâ atâ differentâ positionsâ inâ theâ layer;â theâ weightsâareâ oftenâ saidâ toâ beâ tied.â Thus,â computingâ theâ activa-tionsâofâallâtheâcellsâinâaâfeatureâmapâisâmathematicallyâ
Considerâ firstâ buildingâ aâ V1â complexâ cellâ likeâ theâonesâHubelâandâWieselâobserved;âweâwillâdenoteâsuchâaâcellâbyâCp i
1 θ( ),âwhereâtheâsuperscriptââ1ââtellsâusâthatâtheâcellâ livesâ inâtheâfirstâ layer,âandâtheâsubscriptâpâ indexesâtheâcellâsâpositionâwithinâtheâlayer.âBecauseâofâtheânet-workâsâretinotopicâorganization,âpâalsoâpicksâoutâaâposi-tionâinâtheâinputâimageâwhereâtheâcellâsâreceptiveâfieldâisâ centered.â Theâ argumentâ θi â isâ theâ orientationâ forâwhichâtheâcellâisâselective.âToâdenoteâtheâactivationâofâaâcell,âratherâthanâitsâlabel,âweâuseâboldface:âCp i
1 q( ).Toâmakeâourâtaskâexplicit,âweâneedâtoâwireâupâCp i
1 θ( )âinâsuchâaâwayâthatâitâactivatesâtoâaâbarâatâorientationâθi âappearingâanywhereâinâitsâreceptiveâfield;âweâwillâdenoteâtheâreceptiveâfieldâR(p).âHubelâandâWieselâproposedâaâsimpleâwayâinâwhichâthisâcouldâbeâdone.âTheyâhypoth-esizedâ thatâ Cp i
1 θ( ) â receivesâ inputâ fromâaâcollectionâofâsimpleâ cells,â allâ alsoâwithâorientationâ selectivityâ forâθi ,âwhoseâsmallerâreceptiveâfieldsâcoverâR(p).âThus,âifâanyâoneâofâ theseâafferentâcellsâbecomesâactive,â itâ indicatesâtheâpresenceâofâaâ θi -orientedâbarâsomewhereâinâR(p).âAllâCp i
1 θ( )âhasâtoâdo,âthen,âisâperformâanâOR-likeâopera-tionâoverâ itsâ inputâ signals,âbecomingâactiveâ ifâanyâoneâofâthemâisâactive.âInâCNNs,âtheâroleâofâtheâORâoperationâisâ oftenâ playedâ byâ theâ maxâ function,â yieldingâ theâso-calledâ max-poolingâ activationâ functionâ forâ V1âcomplexâcells:âC Sp i q i
1 1q q( ) = ( )â .( )Maxq R p
Complexâcellsâ inâhigherâ layersâofâ theânetworkâworkâsimilarly.â Theâ complexâ cellsâ inâ anâ arbitraryâ layerâ lâperformâmaxâpoolingâoverâaâ localizedâcollectionâofâ lâsâsimpleâcellsâallâtunedâtoâtheâsameâfeatureâ Ďi .âInâhigherâlayers,â theseâ featuresâ areâ moreâ complicatedâ thanâ theâorientationsâweâhaveâconsideredâsoâfar:
â C Spl
i ql
iĎ Ď( ) = ( )âMaxq R p( ) . â (17.1)
Thisâ expressionâ requiresâ that,â forâ eachâ positionâ p,âthereâ doesâ inâ factâ existâ aâ familyâ ofâ identicallyâ tunedâsimpleâcellsâatâaârangeâofâpositionsâinâR(p).âThisâmoti-vatesâ aâ keyâ architecturalâ assumptionâ ofâ CNNs:â simpleâcellsâ areâ organizedâ intoâ âsheets,ââ calledâ featureâ maps,âwhichâ tileâ theâ inputâ planeâ withâ cellsâ withâ theâ sameâtuning.â Inâ figureâ 17.2,â featureâ mapsâ areâ shownâ asâstackedâarrays.âAâ singleâ layerâ isâ composedâofâmultipleâfeatureâmaps.âAsâalsoâshownâinâfigureâ17.2,âtheâfactâthatâcomplexâ cellsâ poolâ overâ multipleâ simpleâ cellsâ meansâthatâtheâcomplexâcellsââreceptiveâfieldsâareâlarger.
Weânowâconsiderâsimpleâcells,âagainâusingâanâexampleâofâ anâ earlyâ networkâ layer,â thisâ timeâ theâ secondâ layer.âLayerâ2âsimpleâcellsâareâresponsiveâtoâpatternsâsuchâasâcornersâ andâ T-junctions,â whichâ canâ beâ constructedâ asâcombinationsâofâtheâorientedâbarsâselectedâforâbyâtheirâlayer-1âcomplexâcellâ afferents.âSimpleâcellsâareâ similarâtoâ complexâ cellsâ inâ thatâ theirâ afferentsâ areâ spatiallyâ
8522_017.indd 518 6/10/2016 7:48:01 PM
PROPERTY OF THE MIT PRESSFOR PROOFREADING, INDEXING, AND PROMOTIONAL PURPOSES ONLY
PROPERTY OF THE MIT PRESSFOR PROOFREADING, INDEXING, AND PROMOTIONAL PURPOSES ONLY
17.3â FeedforwardâModelsâ â 519
weâdefineâanâerrorâfunctionâE(w)âmeasuringâtheâdiffer-enceâ betweenâ f(Xi;â w)â andâ δ yi .â Aâ simpleâ L2â distanceâillustratesâ theâ point,â thoughâ moreâ complexâ measuresâsuchâasâcross-entropyâareâgenerallyâusedâinâpractice.
â E w f X wi yi
i( ) = ( ) ââ ; δ2
2.â (17.3)
givenâaâ choiceâofâerrorâ function,âoptimizationâpro-ceedsâbyâgradientâdescent.
Aâgivenâweightâwâisâupdatedâby
â w wEw
jt
jt
j
+ += + ââ
1 1 Îą â (17.4)
whereâ Îą âisâaâlearningârate.âTheâactualâcomputationâofâtheâ gradientâ isâ usuallyâ doneâ byâ anâ algorithmâ calledâbackpropagation.âWhileâ itâworksâ veryâwellâ inâpractice,âbackpropagationâ isâ generallyâ consideredâ neurallyâimplausible,âandâweâdoânotâpresentâ itâhere.âNeverthe-less,â evenâ backpropagation-trainedâ networksâ canâ beâneuroscientificallyâinteresting,âasâtheyâmakeâitâpossibleâtoâtestâhypothesesâaboutâgeneralânetworkâstructure,âandâchoiceâ ofâ learningâ objective,â networkâ propertiesâ thatâcanâbeâ chosenâ independentlyâofâ anyâ specificâ learningâprocess.
17.3.3â HMAXâ Whileâ theâ CNNsâ usedâ forâ tasksâ likeâtheâ ILSVRCâ areâ inspiredâ byâ neuroscience,â theirâ mainâobjectiveâisâclassificationâperformance.âHMAX,âbyâcon-trast,âisâexplicitlyâdesignedâasâaâcorticalâmodel,âdrawingâdesignâchoicesâfromâtheâexperimentalâ literatureâwher-everâ possible.â HMAXâ differsâ fromâ theâ CNNâ templateâpresentedâ soâ farâ inâ severalâ ways.â First,â HMAXâ simpleâcellsâpoolâoverâscalesâasâwellâpositions,âbuildingâinvari-anceâtoâbothâchangesâinâsizeâandâtranslations.âSecond,âwhileâotherâmodelsâlearnâtheirâsimpleâcellsââweightsâatâeveryâ layer,âHMAXâlayer-oneâsimpleâcellsâareâhandâsetâtoâ haveâ theâ gabor-likeâ tuningâ foundâ inâ V1.â Theâ firstâlayerâisâcomposedâofâgaborâfiltersâatâ16âdifferentâscales,âfromâ7âĂâ7âpixelsâtoâ37âĂâ37,âandâatâfourâdifferentâori-entations,â0°,â45°,â90°,âandâ135°.
Inâaddition,âHMAXâsâsimpleâcellâactivationâfunctionâisâsomewhatâdifferentâfromâequationâ17.2.âToâavoidâtheâextraânotationâassociatedâwithâdifferentâ scales,âweâwillâdropâsomeâofâourâearlierâindexing,âconsideringâanâarbi-traryâsimpleâcellâSi.âAsâbefore,âSiâreceivesâinputsâfromâaâsetâ A(i)â ofâ complexâ cellsâ fromâ theâ previousâ layer.â Itsâactivationâisâgivenâby
â S Ci j= â ââ ( )âexp ( )
12 2
2
Ďw ji
j A i.â (17.5)
Theâ interpretationâhereâ isâ thatâ theâ synapticâweightsâwjiâ defineâ aâ âtemplateâ:â Siâsâ excitationâ isâ whenâ theâ
equivalentâtoâconvolvingâtheâlayerâsâinputâwithâtheâlinearâfilterâdefinedâbyâitsâfeatureâsâweightâpattern,âaddingâtheâbiasâbâtoâtheâresult,âandâapplyingâ Ρ. âThisâisâtheâoriginâofâ theâ nameâ âconvolutionalâ neuralâ network.ââ Simpleâcellâ weightsâ andâ biasesâ areâ freeâ parametersâ inâ aâ CNNâandâmustâbeâlearnedâduringâaâtrainingâperiod;âtheâpro-cedureâisâoutlinedâbelow.
Withâ simpleâ andâ complexâ cellâ typesâ inâ hand,â aâmodelerâcanâassembleâanâentireânetworkâbyâspecifyingâaâ numberâ ofâ parameters:â theâ numberâ andâ sizesâ ofâ itsâlayers,â theânumberâofâ itsâ featureâmaps,â theâ sizesâofâ itsâreceptiveâ fields,â andâ soâ on.â Classicalâ CNNsâ suchâ asâHMAX,âLeNetâ5,âandâtheâNeocognitronâmakeâchoicesâroughlyâ similarâ toâ visualâ cortex,â havingâ aroundâ fourâlayersâ andâ stickingâ toâ theâ alternatingâ simpleâ andâcomplexâ sublayerâ schemeâ thatâ weâ haveâ usedâ inâ thisâsection.âMoreârecently,âarchitecturesâhaveâgrownâmoreâexotic;â googLeNet,â winnerâ ofâ theâ 2014â ILSVRC,â forâinstance,â hasâ 22â layers,â andâ Simonyanâ andâ Zissermanâ(2014)â presentâ anotherâ high-performingâ modelâ withâ16â19.âAsâdescribedâabove,âmanyâmodernâmodelsâalsoâincludeâ someâ numberâ ofâ simple-cell-onlyâ layers.â Theâquestionâofâwhyâsuchâapparentlyânonbiologicalâchoices,âparticularlyâasâregardsâlayerânumbers,âyieldâstate-of-the-artârecognitionâperformanceâisâanâinterestingâoneâthatâwillârequireâfurtherâinvestigation.
17.3.2â Learningâ inâ CNNsâ Havingâ specifiedâ theâstructureâofâaâCNN,âitâremainsâtoâlearnâtheâsimpleâcellâconnectionâweights.âAânumberâofâ learningâtechniquesâareâpossible.âHMAX,âforâ instance,âusesâaâsimpleâunsu-pervisedâ learningâ scheme,â describedâ below.â However,âbyâ farâ theâ mostâ commonâ approachâ inâ performance-optimizedâ CNNsâ isâ toâ optimizeâ aâ supervisedâ learningâobjectiveâ functionâ usingâ gradientâ descent.â Theseâmethodsâ defineâ anâ errorâ functionâ thatâ measuresâ theâmodelâsâperformanceâonâaâtrainingâdataâsetâofâlabeledâimages,â computeâ theâ gradientâ ofâ thisâ errorâ functionâwithârespectâtoâeachâofâtheâmodelâsâweights,âandâperformâstandardâgradientâdescent.
Theâ outputâ layerâ ofâ aâ CNNâ hasâ oneâ unitâ forâ eachâpossibleâclassâ labelâanâ inputâ imageâcouldâbeâassigned.âTheâactivationâinâtheâythâoutputâunitâinâresponseâtoâanâimageâ Xâ representsâ theâ modelâsâ beliefâ thatâ yâ isâ Xâsâcorrectâ label.âWeâdenoteâbyâ f(X;âw)â theâoutputâactiva-tionsâ inducedâ byâ anâ imageâ Xâ fedâ throughâ aâ network,âparameterizedâ byâ weightsâ w.â givenâ anâ image-correctâlabelâ pairâ (Xi,â yi)â fromâ theâ trainingâ set,â aâ perfectlyâcorrectânetworkâwouldâproduceâ f(Xi;âw)â=âδ yi ,âaâvectorâuniformlyâ zeroâ asideâ fromâ aâ oneâ inâ theâ yithâ position,âindicatingâ aâ completeâ concentrationâ ofâ beliefâ onâ theâcorrectâ label.â Aâ modelâ accruesâ errorâ toâ theâ extentâ toâwhichâitâdeviatesâfromâthisâdesiredâoutput.âSpecifically,â
8522_017.indd 519 6/10/2016 7:48:01 PM
PROPERTY OF THE MIT PRESSFOR PROOFREADING, INDEXING, AND PROMOTIONAL PURPOSES ONLY
PROPERTY OF THE MIT PRESSFOR PROOFREADING, INDEXING, AND PROMOTIONAL PURPOSES ONLY
520â â 17â objectâandâsceneâperception
ventralâstream.âInâanâattemptâtoâlevelâtheâplayingâfieldâbyâ limitingâhumansâtoâfeedforwardâprocessingâasâwell,âSerreâetâal.â(2007a)âusedâaâparadigmâcalledâmasking,âinâwhichâeachâtargetâimageâisâonlyâshownâforâ20âmsâandâisâfollowedâbyâaâmask,âaânoiseâimageâmodifiedâtoâshareâtheâtargetâsâlow-levelâstatistics.âTheâmotivationâforâthisâpro-cedureâ isâ thatâ lowerâvisualâareasâwillâbeâoccupiedâwithâfeedforwardâprocessingâofâtheâmaskâimagesâduringâtheâtimeâ periodâ whenâ feedbackâ processingâ ofâ theâ targetâcouldâ ordinarilyâ occur,â therebyâ preventingâ thisâ feed-backâprocessingâfromâtakingâplace.
Underâ theseâ conditions,â humanâ subjectsâ achievedâ80%â accuracyâ onâ theâ animal/no-animalâ classificationâtask.âHMAXâachievesâaâsimilarâ82%âaccuracy.âMoreover,âtheâ patternâ ofâ errorsâ madeâ byâ theâ modelâ wasâ qualita-tivelyâsimilarâtoâthatâofâhumans:âhumansâandâtheâmodelâtendedâtoâfindâtheâsameâimagesâdifficult.
Toâgetâaâfiner-grainedâpictureâofâtheâpropertiesâofâaâtrainedâ CNNâ thatâ supportsâ recognitionâ performance,âZeilerâandâFergusâ(2014)âuseâaâcleverâalgorithmâtoâvisu-alizeâtheâfeaturesâpreferredâbyâunitsâatâdifferentâlayers.âApplyingâ thisâ algorithmâ toâ aâ CNNâ similarâ toâ theâ firstâILSVRCâwinnerâ(Krizhevskyâetâal.,â2012),âtheyâfindâfea-turesâqualitativelyâ similarâ toâ thoseâdetectedâbyâhumanâmacaqueâvisualâcortexâ(figureâ17.3).âUnlikeâHMAX,âinâwhichâ theâ layer-1â gaborâ filtersâ areâ handâ chosen,â theâweightsâ atâ allâ layersâ ofâ thisâ CNNâ areâ learnedâ duringâtraining.âevenâso,âedge-likeâtuningâstillâemergesâinâtheânetworkâsâ firstâ layer.â Theâ secondâ layerâ selectsâ forâ fea-turesâ involvingâ multipleâ edgeâ transitions,â suchâ asâcorners,â similarâ toâ hypothesizedâ V2â features.â Theâhighestâlayersâofâtheânetwork,â4âandâ5,âareâselectiveâforârecognizableâ objectsâ andâ objectâ parts,â suchâ asâ animalâfacesâinâlayerâ4âandâwholeâanimalsâinâlayerâ5,âsuggestingâaâ similarityâ toâ mammalianâ IT.â Roughly,â then,â theâ fea-turesâ aâ givenâ CNNâ layerâ selectsâ forâ areâ qualitativelyâsimilarâ toâ thoseâ selectedâ forâ byâ aâ biologicalâ layerâ atâ aâsimilarâpointâinâtheâhierarchy.
Theâ feature-preferenceâ correspondenceâ betweenâbiologicalâneuronsâandâartificialâCNNâunitsâ suggestsâaâmoreâdirectâcomparison:âHowâwellâcanâCNNâactivationsâpredictâactualâphysiologicalâdata?âYaminsâetâal.â(2014)ârecordedâtheâresponsesâofâmacaqueâITâandâV4âunitsâandâcomparedâtheâresultsâtoâCNNâactivationsâinducedâbyâtheâsameâ images.â Toâ makeâ theâ comparison,â Yaminsâ etâ al.âcompareâeachâbiologicalâneuronâtoâtheâlinearâcombina-tionâ ofâ artificialâ unitsâ thatâ bestâ matchesâ itsâ responses.âUsingâthisâprocedure,âYaminsâetâal.âfindâthatâtheâCNNâsâtopâlayerâexplainsâ48.5âÂąâ1.3%âofâtheâexplainableâvari-anceâinâtheâbiologicalâITâcellsââfiringâpatterns,âandâtheâmodelâlayerâbelowâexplainsâ51.7âÂąâ2.3%âofâtheâvarianceâofâ biologicalâ V4â firingâ rates.â Toâ putâ thisâ numberâ inâcontext,âitâisâcomparableâtoâtheâmatchesâotherâencodingâ
activationsâ inâ A(i)â matchâ thisâ templateâ exactly,â andâ itâfallsâoffâinâaâgaussianâwayâasâtheyâdeviateâfromâit.
Asâ measuredâ byâ performance,â probablyâ theâ impor-tantâdistinguishingâfeatureâofâHMAXâisâitsâlearningârule.âProceedingâwithâtheâactivationâfunctionâinterpretationâabove,âlearningâinâHMAXâconsistsâofâchoosingâanâaffer-entâtemplateâforâeachâofâitsâsimpleâcells.âHMAXâaccom-plishesâ thisâ withâ aâ simple,â unsupervisedâ samplingâprocedure.âForâeachâsimpleâcellâSi,âtheâmodelâpicksâanâ(unlabeled)â imageâ fromâ aâ dataâ setâ provided,â which,âwhenâ presentedâ toâ theâ network,â inducesâ aâ particularâpatternâofâactivationâinâtheâcomplexâcellsâinâSiâsâafferentâset,âA(i),âwithâCjâbeingâtheâinducedâactivationâinâtheâjthâsimpleâcell.âThisâactivationâpatternâisâstoredâasâtheâtem-plate;âwji:â=âCj.âInâotherâwords,âeachâsimpleâcellâbecomesâtunedâtoâtheâneuralâimageâofâaâparticularâimageâpatchâarisingâinâtheâtrainingâset.
Thisâ learningâ schemeâ hasâ twoâ advantagesâ overâ theâsupervisedâoneâpresentedâabove.âFirst,âitâavoidsâtheâbio-logicalâ implausibilityâassociatedâwithâbackpropagation.âSecond,â theâ factâ thatâ itâ isâ unsupervisedâ meansâ thatâ itâdoesâ notâ requireâ aâ largeâ databaseâ ofâ naturalâ images,âlikelyâbetterâmatchingâtheâconditionsâfacedâbyâaâyoungâhumanâ orâ animalâ duringâ development.â Theseâ advan-tages,â though,â comeâ atâ theâ priceâ ofâ reducedâ perfor-mance.âAsâweâwillâseeâbelow,âHMAXâachievesârespectableârecognitionâ performance,â competingâ withâ humansââonâ someâ tasks.â However,â itâ lagsâ significantlyâ behindâperformance-optimizedâ CNNsâ inâ mostâ classificationâproblems.â Theâ differenceâ betweenâ HMAXâsâ unsuper-visedâlearningâruleâandâtheâsupervisedâbackpropagationâusedâbyâmostâotherâmodelsâisâlikelyâtheâmainâreasonâforâthisâdeficit.
17.3.4â AssessingâCNNsâ Aânumberâofâexperimentalâmethodsâ areâ availableâ toâ helpâ usâ toâ studyâ theâ perfor-manceâandâpropertiesâofâCNNsâandâtoâassessâtheâsimilar-ityâofâCNNsâtoâtheâmammalianâvisualâcortex.
Aâ firstâ questionâ isâ behavioral:â Howâ wellâ canâ CNNsâaccountâforâhumanârecognitionâperformance?âWeâhaveâalreadyâseen,âCNNâperformanceâcomparesâfavorablyâtoâhumansâ inâ theâ difficultâ ILSVRCâ recognitionâ task.â Aâmoreâ controlledâ experimentâ comparedâ HMAXâ toâhumansâonâaâsimplerâvisualâtask,âdeterminingâwhetherâorânotâanâimageâcontainsâanâanimal.âInâdifficultâcases,âhumansâcanâdeployâaânumberâofâstrategiesâtoâmakeâthisâdetermination,â forâ instance,â movingâ theirâ eyesâ forâdetailedâ examinationâ ofâ particularâ imageâ regionsâ orâusingâpriorâknowledgeâaboutâwhereâinâaânaturalâsceneâanâ animalâ mightâ hide.â However,â strategiesâ likeâ thisârequireâtimeâandâfeedbackâprocessingâandâareâoutsideâtheâportionâofâvisualâprocessingâthatâCNNsâattemptâtoâmodel,ânamely,âtheâinitialâfeedforwardâpassâthroughâtheâ
8522_017.indd 520 6/10/2016 7:48:01 PM
PROPERTY OF THE MIT PRESSFOR PROOFREADING, INDEXING, AND PROMOTIONAL PURPOSES ONLY
PROPERTY OF THE MIT PRESSFOR PROOFREADING, INDEXING, AND PROMOTIONAL PURPOSES ONLY
17.3â FeedforwardâModelsâ â 521
Figureâ17.3â ZeilerâandâFergusâ(2014)âdevelopedâaâdeconvo-lutionâalgorithmâtoâvisualizeâtheâfeaturesâthatâbestâactivateâtheâunitsâ inâ aâ CNN.â Here,â theâ algorithmâ isâ appliedâ toâ AlexNet,âtheâbest-performingâCNNâ inâ ILSVRCâ2012.âeachâ3âĂâ 3âgridâcorrespondsâ toâoneâ featureâmap.âTheâ imagesâonâ theâ leftâofâeachâ panelâ showâ algorithmicâ reconstructionsâ ofâ theâ imageâ
Layer 2
Layer 5
Layer 3
Layer 4
featuresâthatâcauseâhighâactivations,âandâtheâcolorâimagesâonâtheâ rightâ showâpatchesâ inâwhichâ theseâpatternsâoccur.âAsâ inâprimateâvisualâcortex,âunitsâinâhigherâlayersâareâresponsiveâtoâcomplexâstimuli.âReproducedâwithâpermissionâandâmodifiedâfromâZeilerâandâFergusâ(2014).
modelsâhaveâachievedâ toâ theâmuchâbetterâunderstoodâlowerâvisualâareas.
Anotherâ physiologicalâ questionâ oneâ canâ askâ aboutâCNNsâconcernsâtheâbiologicalârealityâofâtheirâbasicâoper-ations.âTheâsimpleâcellâactivationâfunctionâinâequationâ17.2âisâaâstandardâandâwidelyâacceptedâmodelâofâneuralâactivationâthroughoutâtheâbrain,âbutâaâCNNâsâcomplexâcellsââ max-poolingâ operationâ isâ moreâ exotic.â doesâ itâactuallyâoccurâinâcortex?âLamplâetâal.â(2004)ârecordedâfromâV1âcomplexâcellsâandâfoundâthatâforâ80%âofâthem,âaâmax-likeâactivationâfunctionâfitâtheirâresponsesâbetterâthanâaâ linearâmodel.âKnoblichâetâal.â (2007)âpresentâaâcircuit-levelâmodelâofâhowâmaxâpoolingâmightâoccur.
givenâtheseâimpressiveâresults,âitâisânaturalâtoâwonderâifâCNNsânowâofferâaâfullâaccountâofâfeedforwardâobjectâclassification.âWhileâtheyâundoubtedlyâmakeâsubstantialâprogressâ towardâ thisâ goal,â aâ fewâ caveatsâ remain.â First,âwhileâCNNsâachieveâhuman-likeâtotalâerrorâonâtasksâlikeâtheâ ILSRVC,â theâpatternsâofâ errorsâ theyâmakeâdeviateâsomewhatâfromâthoseâmadeâbyâhumansâ(Russakovskyâetâal.,â2014).âRussakovskyâcomparedâhumanâperformanceâtoâgoogLeNet,âtheâmostârecentâILSRVCâwinner.âMoreâthanâ halfâ ofâ humanâ errorsâ areâ attributableâ toâ knowl-edgeâdeficitsâasâopposedâtoâvisualâerrors:âhumansâmayânotârealizeâthatâtheâcorrectâlabelâisâanâoptionâandâmayâstruggleâwithâfine-grainedâandâobscureâclasses,âspecificâspeciesâofâdogs,â forâexample.âCNNs,âbyâcontrast,âhaveâ
moreâpurelyâvisualâproblems,âfindingâitâdifficultâtoâlabelâobjectsâthatâtakeâupâaâsmallâpercentageâofâtheâimageâasâaâ whole,â andâ strugglingâ withâ imagesâ distortedâ byâInstagram-styleâfilters.âCNNsâalsoâstruggleâwithâabstractârepresentationsâofâobjectsâsuchâasâdrawingsâorâstatues.âTakingâ theseâ lastâ twoâ sourcesâ ofâ errorâ together,â oneâmightâhypothesizeâ that,âcomparedâ toâhumans,âaâCNNâisâmoreâlikelyâtoâclassifyâbasedâonâtexturalâratherâthanâstructuralâfeaturesâofâanâimage.âRelatedly,âSzegedyâetâal.â(2013)âdemonstrateâthatâCNNsâcanâbeâfooledâbyâadver-sarialâexamples,âimagesâconstructedâbyâtakingâanâimageâtheâmodelâoriginallyâclassifiedâcorrectly,â andâaddingâaâspeciallyâ designedâ perturbation.â Whileâ theâ resultingâimageâisâindistinguishableâtoâhumanâeyesâfromâtheâorig-inal,âtheâmodelâneverthelessâlabelsâitâincorrectly.
Second,â achievingâ highâ classificationâ accuracyâ withâbiologicalâtrainingâmechanismsâremainsâanâopenâchal-lenge.â Asâ mentioned,â theâ popularâ backpropagationâalgorithmâ reliesâ onâ non-neuralâ mechanisms.â Moreâgenerally,â commonlyâ usedâ supervisedâ learningâ para-digmsâ requireâ veryâ largeâ amountsâ ofâ labeledâ trainingâdata,âarguablyâsignificantlyâmoreâthanâhumanâlearnersâneed.â Seeâ durbinâ andâ Rumelhartâ (1989),â OâReillyâ(1996),â andâ Balduzziâ etâ al.â (2014)â forâ proposedâ bio-logicallyâ plausibleâ backpropagationâ alternativesâ andâsectionâ 17.6.1â forâ furtherâ discussionâ ofâ small-sampleâlearning.
8522_017.indd 521 6/10/2016 7:48:02 PM
PROPERTY OF THE MIT PRESSFOR PROOFREADING, INDEXING, AND PROMOTIONAL PURPOSES ONLY
PROPERTY OF THE MIT PRESSFOR PROOFREADING, INDEXING, AND PROMOTIONAL PURPOSES ONLY
522â â 17â objectâandâsceneâperception
Soâfeedforwardâclassificationâisânotâwithoutâitsâ looseâends.âNevertheless,âCNNâmodelingâhasâbeenâaâ signifi-cantâadvance.âIndeed,âitâisâarguablyâtheâmostâsuccessfulâeffortâ toâdateâ inâ solvingâaâdifficultâbiologicalâproblemâinâanâatâleastâbroadlyâbiologicalâway.
17.4â FeedforwardâSceneâRecognition
ModelsâlikeâHMAXâandâotherâCNNsâareâgenerallyâusedâforâobjectârecognition,âasâ inâ theâanimal-detectionâtaskâdescribedâ above,â butâ inâ principle,â nothingâ preventsâthemâfromâcategorizingâscenesâasâwell.âIndeed,âaânaturalâconjectureâ isâ thatâ sceneâ recognitionâ isâ nothingâ otherâthanârepeatedâobjectârecognitionâweâmightârecognizeâaâstreetâscene,âforâinstance,âbyârealizingâthatâitâcontainsâcars,â pedestrians,â andâ buildingsâ inâ particularâ arrange-ments.âButârecentâworkâgivesâcompellingâexperimentalâevidenceâthatâthisâobject-basedâviewâofâsceneâperceptionâisâincorrect,âorâatâleastâincomplete.âRather,âtheâhumanâobjectârecognitionâapparatusâisâsupplementedâbyâaâdis-tinctâ setâofâprocessesâhousedâ inâaâdistinctâ setâofâbrainâareasâthatâquicklyâextractâtheâglobalâgeometricâproper-tiesâofâaâ scene,â itsâ âgist,ââ thatâ supportâ fastâ recognitionâofâaâsceneâsâsemanticâcategoryâandâitsâfunctionalâproper-ties.âThisâsectionâreviewsâthisâexperimentalâworkâonâgistâextractionâandâ thenâshowsâ thatâ itsâ findingsâhaveâbeenâcapturedâinâcomputationalâmodels.
Ifâtheâpropertiesâofâaâsceneâthatâsupportâclassificationâareânotâtheâidentitiesâofâitsâconstituentâobjects,âwhatâareâthey?â Theâ evidenceâ suggestsâ thatâ sceneâ classificationâreliesâ onâ aâ collectionâ ofâ globalâ geometricâ properties,âsuchâ asâ aâ sceneâsâ openness,â theâ extentâ toâ whichâ theâhorizonâisâvisible;âitsâdepth;âandâitsânavigability.âInâaddi-tionâ toâ theirâ supportingâ roleâ inâ classification,â theseâpropertiesâhaveâclearâecologicalârelevanceâinâtheirâownâright.
greeneâ andâ Olivaâ (2009)â showedâ subjectsâ maskedâimagesâofânaturalâ scenesâ forâvariableâamountsâofâ timeâandâ examinedâ whichâ propertiesâ subjectsâ wereâ ableâ toâextract.â Theyâ foundâ thatâ aâ sceneâsâ globalâ propertiesâbecameâ availableâ earlierâ (meanâ =â 34â ms)â thanâ didâ itsâbasic-levelâ semanticâ categoryâ (âmountain,ââ âocean,ââetc.;âmeanâ=â50âms),âconsistentâwithâaâpictureâinâwhichâglobalâpropertiesâunderlieâcategoryâjudgments.âToâshowâtheâ sufficiencyâofâ theseâpropertiesâ forâ sceneâclassifica-tion,â greeneâ andâ Olivaâ (2006)â hadâ humanâ subjectsâevaluateâtheâprevalenceâofâeachâofâsevenâglobalâproper-tiesâ inâ eachâ databaseâ ofâ naturalâ sceneâ imagesâ andâtrainedâ aâ classifierâ toâ predictâ theâ categoryâ ofâ aâ sceneâgivenâ onlyâ itsâ properties,â asâ codedâ byâ humans.â Theyâfoundâaâgoodâcorrelationâbetweenâhumanâperformanceâandâ thatâofâ theâclassifier,âandâaâ similarâdistributionâofâmistakenâclassifications.
Otherâtimingâstudiesâexamineâwhenâobjectâinforma-tionâentersâ theâmix,â andâ theyâ argueâ forâ theâ followingâapproximateâ ordering:â globalâ properties,â thenâ sceneâcategory,â thenâ individualâ objectâ categories.â Inâ aâ studyâbyâFei-Feiâetâal.â(2007),âsubjectsâgeneratedâfreeâdescrip-tionsâofâ scenesâ theyâhadâviewedâ forâ variousâdurations.âWhileâ theâ portionsâ ofâ theirâ descriptionsâ involvingâobjectsâchangedâandâbecameâmoreâdetailedâforâlongerâviewingâ times,â subjectsââ high-levelâ sceneâ classificationsâremainedârelativelyâconstant.âThisâisâconsistentâwithâanâinterpretationâinâwhichâtheâsceneâcategoryâisâextractedâquicklyâwhileâobjectâidentitiesârequireâlongerâviewingâtoâemergeâinâfull.
Inâadditionâtoâlimitingâtheâprocessingâtimeâavailableâtoâviewers,âotherâstudiesâhaveârestrictedâtheâinformationâavailableâ inâ theâ frequencyâ domain.â Schynsâ andâ Olivaâ(1994)âshowedâsubjectsâimagesâfromâwhichâallâfrequen-ciesâaboveâtwoâcyclesâperâdegreeâhadâbeenâeliminated.âAfterâthisâmodification,âobjectsâgenerallyâappearâonlyâasâunidentifiableâblobs,âbutâsubjectsââabilityâtoâidentifyâtheâsceneâ categoryâ wasâ largelyâ unimpaired.â Aâ similarâ low-pass-filteredâimageâisâshownâinâfigureâ17.4A.
Furtherâ evidenceâ forâ sceneâ processingâ asâ distinctâfromâ objectâ processingâ comesâ fromâ neuroimagingâstudiesâshowingâaâdistinctânetworkâofâbrainâareasâunder-lyingâsceneâperception.âImagingâworkâhasâfoundâatâleastâthreeâ areasâ thatâ preferentiallyâ respondâ toâ scenesâ orâplacesâ overâ objects:â theâ parahippocampalâ placeâ areaâ(PPA),â retrosplenialâ cortex,â andâ theâ occipitalâ placeâarea.âInâaâparticularlyâstrikingâfinding,âepsteinâandâKan-wisherâ(1998)âshowedâthatâPPAâactivationâtoâanâimageâofâaâroomâwasâmoreâorâlessâunchangedâbyâremovingâallâtheâmoveableâobjectsâ(furnitureâetc.),âfurtherâevidenceâofâ aâ geometricâ ratherâ thanâ object-basedâ encoding.âdecodingâstudiesâwithâfMRIâalsoâsupportâthisâview.âForâinstance,âParkâetâal.â(2011)âwereâableâtoâdecodeâsceneâinformationâ fromâ allâ threeâ ofâ theâ scene-specificâ areasâmentioned,â asâ wellâ asâ fromâ theâ object-selectiveâ areaâlateralâoccipitalâcortexâ(LOC),âbutâtheâpatternâofâerrorsâtheâclassifierâmadeâshowedâaâsensitivityâtoâspatialâprop-ertiesâ(openâvs.âclosed)âinâtheâsceneâareas,âasâcomparedâtoâcontentâinâLOC.
17.4.1â Feedforwardâ Modelsâ ofâ Sceneâ Percep-tionâ Theseâ experimentalâ findingsâ suggestâ someâwaysâ thatâ aâ representationâ designedâ toâ supportâ sceneââclassificationâshouldâdifferâfromâoneâforâobjectâclassifi-cation.âFirst,âasâindicatedâbyâhumansââabilityâtoâclassifyâscenesâusingâonlyâ lowâspatialâfrequencies,âaâsceneârep-resentationâ shouldâ needâ toâ beâ onlyâ weaklyâ spatiallyâlocalized.âSecond,âandârelatedly,âtheâsceneârepresenta-tionâshouldâbeâableâtoâbeârelativelyâlowâdimensional,âasâshownâbyâtheâgoodâperformanceâofâtheâmodelâinâgreeneâ
8522_017.indd 522 6/10/2016 7:48:02 PM
PROPERTY OF THE MIT PRESSFOR PROOFREADING, INDEXING, AND PROMOTIONAL PURPOSES ONLY
PROPERTY OF THE MIT PRESSFOR PROOFREADING, INDEXING, AND PROMOTIONAL PURPOSES ONLY
17.4â FeedforwardâSceneâRecognitionâ â 523
Figureâ17.4â (A)âdespiteâtheâfactâthatâlow-passâfilteringâhasâremovedâobjectâ information,â itâ isâ stillâeasyâ toârecognizeâ thatâtheâ imageâ showsâ aâ streetâ scene.â (B)â Aâ globalâ featureâ tem-plate,â oneâ principalâ componentâ ofâ aâ multiscaleâ filter-bankârepresentationâofâaâdatabaseâofânaturalâimages.âTheâimageâisâdividedâintoâ16âsubregions,âandâanâoutputâvalueâisâcomputedâforâ each.â Theâ polarâ plotâ withinâ subregionsâ showsâ howâ theâoutputâisâcomputedâforâdifferentâorientationsâandâspatialâfre-quencies:â angleâ inâ theâ plotâ representsâ orientation,â andâ dis-tanceâalongâtheâradiusârepresentsâspatialâfrequency,âwithâlowâfrequenciesâ nearâ theâ center.â Whiteâ regionsâ correspondâ toâpositiveâ outputs,â black,â toâ negative.â (C)â Aâ noiseâ imageâ(right)âcoercedâ toâ shareâglobalâ featuresâwithâaâ targetâ imageâ(left).â Theâ comparisonâ showsâ thatâ globalâ featuresâ preserveâtheâ sceneâsâ large-scaleâ geometryâ evenâ whileâ losingâ itsâ finerâdetails.â Reproducedâ withâ permissionâ andâ modifiedâ fromâOlivaâandâTorralbaâ(2006).
andâ Olivaâ (2006)â usingâ onlyâ sevenâ human-identifiedâfeatures.
MostâmodelsâofâsceneâprocessingâstemâfromâOlivaâandâTorralbaâ(2001).âHereâweâpresentâaâslightlyâmoreâneuralâvariantâbyâtheâsameâauthors,âOlivaâandâTorralbaâ(2006).âTheâmodelâbeginsâwithâaâV1-likeâ representationâgivenâbyâ theâ outputâ ofâ aâ bankâ ofâ gaborâ filtersâ spanningâ aârangeâofâorientationsâandâscalesâandâpositionedâdenselyâacrossâ theâ image.â Aâ firstâ reductionâ stepâ coarsensâ theâspatialâresolutionâofâthisârepresentation,âbyâdividingâtheâimageâintoâNâĂâNâwindowsâandâaveragingâtheâoutputâofâeachâfilterâwithinâeachâwindow.âThisâreducesâtheâdimen-sionalityâofâtheârepresentationâtoâNâĂâNâĂâSâĂâR,âwhereâSâisâtheânumberâofâspatialâscalesâinâtheâfilterâbank,âandâRâisâtheânumberâofâorientations.
Theâ secondâreductionâ stepâcomputesâ thisâ represen-tationâforâeachâofâaâ largeânumberâofâ imagesâ inâaâdataâsetâ andâ compressesâ theâ resultingâ setâ ofâ vectorsâ withâprincipalâcomponentsâanalysis.âeachâprincipalâcompo-nentâ isâ aâ weightedâ combinationâ ofâ filterâ outputsâ atâeachâofâtheâNâĂâNâwindows.âTheseâweightedâcombina-tionsâareâthenââfusedââintoâwhatâtheâauthorsâcallâglobalâfeatureâtemplatesâ(gFTs),âwhichâspanâtheâwholeâimage;âanâ exampleâ isâ shownâ inâ figureâ 17.4B.â Theâ firstâ Mâ ofâtheseâgFTsâcanâthenâbeâappliedâtoâaânewâsceneâimage,âbyâ applyingâ theâ originalâ filtersâ andâ weightingâ theâresults,â andâ theâ outputâ fedâ toâ aâ classifierâ toâ extractâglobalâ propertiesâ likeâ openness,â depth,â andâ soâ forth.âToâ getâ aâ feelâ forâ theâ representationâ thatâ theâ gFTsâcompute,â figureâ 17.4Câ showsâ aâ noiseâ imageâ modifiedâinâsuchâaâwayâ thatâ itâ inducesâ theâsameâgFTâresponsesâasâ aâ targetâ imageâofâ aânaturalâ scene.âWhileâ theâmodi-fiedâ noiseâ imageâ losesâ theâ fineâ detailsâ presentâ inâ theâtarget,âitâmaintainsâenoughâofâitsâglobalâgistâtoâsupportâeasyâclassification.
SinceâOlivaâandâTorralbaâ(2001),âaânumberâofâauthorsâhaveâproposedâdifferentâmodelsâofâscene-levelâprocess-ing.âSee,âforâexample,âRenningerâandâMalikâ(2004)âandâSiagianâandâIttiâ(2007)âandâreferencesâtherein.
Sceneâclassificationâisâaâworthyâendâinâitself,âbutâsceneâinformationâcanâalsoâbeâusefulâinâtheâcontextâofâobjectâdetection.âAnâimageâsâsceneâcategoryâcontainsâaâ lotâofâinformationâaboutâwhichâobjectsâ itâ isâ likelyâ toâcontainâandâwhereâtheyâareâlikelyâtoâappear.âAâstreetâscene,âforâinstance,â isâ likelyâ toâ containâ multipleâ cars,â whileâ aâmountainâorâforestâsceneâprobablyâcontainsânone.âFur-thermore,â carsâ appearâ atâ predictableâ (vertical)â loca-tionsâinâstreetâscenes;âaâcarâisâunlikelyâtoâappearâinâtheâsky,â forâ instance.â Torralbaâ etâ al.â (2010),â buildingâ onâMurphyâ etâ al.â (2003),â useâ scene-levelâ informationâ toâadjustâ theâ outputsâ ofâ localâ objectâ detectors,â ensuringâthatâtheâfinalâdetectionsâcontainâaâplausibleânumberâofâobjectsâ ofâ eachâ classâ andâ thatâ theseâ objectsâ appearâ inâplausibleâlocations.
SimilarâintuitionsâunderlieâtheâmodelâinâTorralbaâetâal.â(2006),âwhichâusesâsceneâfeaturesâtoâpredictâhumanâeyeâmovementsâduringâvisualâsearch.âSubjectsâwereâtoldâtoâ findâ anâ instanceâ ofâ aâ targetâ objectâ class,â âcar,ââ forâexample,âandâtheirâeyeâmovementsâwereâtrackedâasâtheyâlookedâ forâ it.â Ifâ humansâ useâ theirâ knowledgeâ ofâ theâtargetâ imageâsâ class,â togetherâ withâ quicklyâ extractedâsceneâgistâfeaturesâfromâtheâgivenâimage,âthenâaâmodelâincorporatingâgistâ featuresâ shouldâbetterâpredictâ theirâfixationsâthanâoneâwithout.
Theâ tendencyâ ofâ aâ humanâ toâ fixateâ aâ locationâ Xâ isâmodeledâas
â S X L X P X G O( ) = ( ) =âÎł ( | , )1 â (17.6)
8522_017.indd 523 6/10/2016 7:48:02 PM
PROPERTY OF THE MIT PRESSFOR PROOFREADING, INDEXING, AND PROMOTIONAL PURPOSES ONLY
PROPERTY OF THE MIT PRESSFOR PROOFREADING, INDEXING, AND PROMOTIONAL PURPOSES ONLY
524â â 17â objectâandâsceneâperception
whereâL(X)âisâtheâlocalâsaliencyâofâtheâlocationâX.âHeu-ristically,â Lâ measuresâ theâ extentâ toâ whichâ Xâ isâ unex-pectedâgivenâtheârestâofâtheâimage;âsalientâregionsâstandâout.âThisâlocalâcontributionâisâtradedâoffâviaâtheâexpo-nentâ Îł âwithâaâglobalâterm,â P X G O| , =( )1 âdenotingâtheâprobabilityâthatâanâinstanceâofâtheâtargetâclassâisâpresentâatâ theâ locationâ X,â givenâ theâ imageâsâ globalâ scene-gistâfeaturesâG,âandâassumingâthatâatâ leastâoneâ targetâclassâinstanceâisâpresentâinâtheâimageâasâaâwholeâ(Oâ=â1).âAsâexpected,â theâ combinedâ globalâlocalâ modelâ signifi-cantlyâ betterâ predictedâ whichâ imageâ regionsâ subjectsâwereâlikelyâtoâfixateâthanâaâlocal-onlyâone.
17.5â FeedbackâConnections
Whileâtheâfoundationsâofâtheâfeedforwardâmodelsâpre-sentedâinâsectionâ17.3âwereâalreadyâinâplaceâbyâtheâendâofâtheâ1960s,âanâunderstandingâofâvisualâcortexâsâfeed-backâconnectionsâhasâbeenâmoreâelusive.âItâisâgenerallyâagreedâthatâ feedbackâconnectionsâmediateâattentionalâprocessing,âcarryingâsignalsâaboutâstimulusâsalienceâandâtaskâ relevanceâ fromâ frontalâ andâ parietalâ areasâ (seeâchapterâ 19,â âSaccadesâ andâ Smoothâ Pursuitâ eyeâ Move-ments,ââ andâ chapterâ 4,â âNeuralâ Rhythmsâ),â butâ thereâhasâbeenâlessâconsensusâaboutâtheâroleâofâfeedbackâcon-nectionsâ inâ thisâ chapterâsâ coreâ themesâ ofâ objectâ andâsceneâ recognition.â Here,â weâ willâ focusâ onâ aâ classâ ofâmodelsâ thatâhasâcomeâtoâ theâ foreâoverâ theâ lastâ fifteenâyearsâ orâ so,â whichâ viewsâ feedbackâ connectionsâ asâ pri-marilyâgenerativeâinânature,âenablingâtheâbrainâtoâsyn-thesizeâasâwellâasâconsumeâimages.
Thatâ theâ brainâ canâ accomplishâ thisâ generationâ isâintuitivelyâclearâfromâeverydayâexperiencesâlikeâdream-ingâ andâ visualâ imaginationâ andâ fromâ clinicalâ observa-tionsâ ofâ visualâ hallucinations.â Sufferersâ ofâ CharlesâBonnetâ syndrome,â forâ instance,â reportâ extremelyââvividâhallucinationsâwhichâtheyâhaveâtroubleâdistinguish-ingâfromâreality,âandâwhichâimagingâstudiesâhaveâshownâtoâ haveâ veryâ similarâ neuralâ signaturesâ toâ normalââvisualâprocessingâ(Ffytche,â2005).âButâevenâgrantingâtheâvisualâ cortexâsâ generativeâ capacity,â itâ stillâ remainsâ toâexplainâ howâ thisâ capacityâ couldâ beâ usedâ toâ supportârecognition.
17.5.1â generativeâModelsâforâRecognitionâ Theâspiritualâfather,âarguably,âofâtheâgenerativeâapproachâtoâvisionâisâHermannâvonâHelmholtz,âwhoâinâ1867âadvancedâtheâ theoryâ ofâ âperceptionâ asâ unconsciousâ inference.ââOnâthisâview,âtheâinputâtheâvisualâsystemâreceivesâisâtheâresultâofâaâcausalâprocess,âtheâimagingâprocessâmappingâworldâ statesâ (particularâ configurationsâ ofâ objects,â illu-mination,â etc.)â toâ two-dimensionalâ images,â orâ collec-tionsâofâLgNâcellâresponses.âTheâgoalâofâvisualâcortexâ
isâthenâtoâinvertâthisâprocess,â inferringâfromâanâimageâtheâworldâstateâthatâcausedâit.âNoteâtheâambiguityâinher-entâinâthisâproblem:âanyâimageâis,âinâgeneral,âconsistentâwithâmanyâhigh-levelâcauses.âForâinstance,âaâperceptâofâaâgivenâsizeâmightâbeâcausedâbyâaâ large,âdistantâobjectâorâbyâaâsmallâobjectâcloseâatâhand.
Theâ purelyâ feedforwardâ modelsâ weâ haveâ studiedâ soâfarâapproximateâaâ solutionâ toâ thisâ inverseâproblemâbyâlearningâ aâ directâ mappingâ fromâ perceptâ toâ causes,â aâpurelyâ bottom-upâ approach.â givenâ aâ modelâ ofâ theâimagingâprocess,âaâpurelyâtop-downâprocessâisâalsoâpos-sible:âoneâcouldâguessâhigh-levelâcausesâandâcheckâthemâbyârunningâtheâimagingâprocessâandâseeingâifâtheâresultâmatchedâtheâobservedâpercept.âThisâapproachâhasâtheâadvantageâofâbeingârobustâtoâlow-levelâambiguity,âbutâinâpractice,â ofâ course,â pureâ trialâ andâ errorâ isâ hopelesslyâinefficient.â Practicalâ generativeâ visionâ modelsâ useâ aâmixedâ approachâ inâ whichâ bottom-upâ andâ top-downâinformationâisâprogressivelyâintegrated.
Reflectingâ theâ ambiguityâ inherentâ inâ theâ inverseâinferenceâproblem,âgenerativeâmodelsâinâvisionâareâgen-erallyâ probabilistic.â Stickingâ withâ ourâ chainâ viewâ ofâvisualâ cortex,â inâ whichâ areasâ interactâ primarilyâ withâtheirâimmediateâneighbors,âallowsâusâtoâfactorâtheâjointâprobabilityâofâtheâactivitiesâinâallâcorticalâareas,âandâofâtheâinputâdataâinâtheâLgN,âasâfollows:
P LGN V V V ITP LGN V P V V P V V P V IT P
, , , ,) ( | )( | | |
1 2 4
1 1 2 2 4 4
( )= ( ) ( ) (( )IT .
â â (17.7)
Theâ conditionalâ probabilitiesâ P V Vi j|( )â encodeâ aâmodelâ ofâ theâ statisticsâ ofâ naturalâ imagesâ givenâ theirâcontent.â Forâ instance,â P V IT( | )4 â indicatesâ whichâ V4 âresponsesâ theâ modelâ expectsâ givenâ theâ presenceâ ofâ aâparticularâ configurationâ ofâ objects,â encodedâ asâ ITâresponses.â Theseâ probabilitiesâ areâ acquiredâ throughâexperience;âthisâisâtheâlearningâproblem.âOnceâtheârel-evantâdistributionsâareâinâplace,âweâcanâaskâHelmholtzâsâinversionâquestion:âwhichâhigher-levelâcausesâwereâlikelyâtoâhaveâgeneratedâaâgivenâLgNâinput?âInâtheâprobabi-listicâcontext,âthisâisâcalledâtheâinferenceâproblem,âandâitâamountsâtoâapproximatingâorâmaximizingâtheâfollow-ingâposteriorâdistribution:
P V V V IT LGNP V V V IT LGN
P LGN1 2 4
1 2 4, , , |, , , ,
( )( ) = ( )
(17.8)
â= ( ) ( )11 2 2 4 4
ZP V V P V V P V IT P IT| | ( | ) ( ) .
Here,â weâ haveâ treatedâ theâ probabilityâ ofâ theâ LgNâinputâasâaânormalizingâconstantâandâusedâuationâ17.7âtoâperformâtheâfactorizationâinâtheâlastâstep.
8522_017.indd 524 6/10/2016 7:48:03 PM
PROPERTY OF THE MIT PRESSFOR PROOFREADING, INDEXING, AND PROMOTIONAL PURPOSES ONLY
PROPERTY OF THE MIT PRESSFOR PROOFREADING, INDEXING, AND PROMOTIONAL PURPOSES ONLY
17.5â FeedbackâConnectionsâ â 525
Inâ general,â generativeâ approachesâ toâ recognitionâshareâ thisâ basicâ structureâ andâ problemâ definitionâ butâdifferâinâtheâwayâinâwhichâtheâoptimizationâisâperformedâandâinâhowâtheâdistributionsâareâparametrized.
17.5.2â Predictiveâ Codingâ Whenâ youâ tellâ aâ friendâaboutâ yourâ day,â youâ doâ notâ catalogâ routineâ minutiae.âRather,â youâonlyâmentionâ theâ fewâunusualâeventsâ thatâdistinguishedâthisâparticularâdayâfromâanyâother,ârelyingâonâyourâfriendâsâpriorâknowledgeâofâyourâdailyâhabitsâtoâfillâinâtheâdetailsâyouâleftâout.âThisâisâtheâintuitionâunder-lyingâ predictiveâ coding,â aâ techniqueâ originallyâ devel-opedâforâspeechâprocessingâinâtheâ1960s:âaâsenderâneedâonlyâtransmitâthoseâpartsâofâaâsignalâthatâareâunexpectedâgivenâaâpredictiveâmodelâpossessedâbyâtheâreceiver.
Predictiveâcodingâunderliesâ severalâmodelsâofâ visualâcortexâandâotherâbrainâareasâ(Mumford,â1992;âFriston,â2005).âOurâexpositionâmostâ closelyâ followsâ theâmodelâdevelopedâinâRaoâandâBallardâ(1999).âInâRaoâandâBal-lardâsâ model,â higherâ visualâ areasâ playâ theâ roleâ ofâ theâreceiverâ inâ theâ sketchâ above,â andâ lowerâ areas,â theâsender.âHigherâareasâuseâfeedbackâconnectionsâtoâtrans-mitâtheirâgenerativeâpredictionsâofâtheânextâlowerâlevelâsâactivity,â andâ lowerâ areasâ sendâ upâ theâ errorsâ inâ theseâpredictionsâviaâfeedforwardâconnections.
Inâtheâsimplestâversionâofâtheâmodel,âtheâpredictionsâmadeâbyâtheâithâareaâisâtheâlinearâfunctionâofâtheâareaâsâownâactivities:
â WiVi â (17.9)
whereâWi â isâaâmatrixâofâsynapticâweights.âTheâWi âareâstructuredâsoâthatâeachâcellâinâareaâViâsendsâconnectionsâtoâaâspatiallyâlocalizedâsubsetâofâcellsâinâtheâlayerâbelow,âtheâsizeâofâthisâsubsetâincreasingâwithâi,âcorrespondingâtoâtheâfactâthatâcellsâinâhigherâareasâhaveâlargerârecep-tiveâfields.â
Withâaâgaussianânoiseâmodel,âequationâ17.9âgivesâtheâfollowingâconditionalâprobabilities:
â P V V Vi i i i( | diag+ + += ( )1 1 12) ( , )N Wi Ď â (17.10)
whereâ Ďi2âisâaâlayer-dependentânoiseâvariance.âWithâthisâ
formulation,â weâ canâ solveâ bothâ theâ learningâ problemâ(findingâW)âandâ theâ inferenceâproblemâwithâgradientâdescent;âweâfocusâonâtheâinferenceâproblemâhere.âFromâequationâ17.8,âpassingâtoâlogâspaceâandâdifferentiatingâgivesâtheâfollowing:
ââ
{ } =â
â+âV
P LGNP V VVi
ik k
ik
log V | )log
(( | )1 â (17.11)
â =â
â+
ââ
â +log logP V VV
P V VV
i i
i
i i
i
( | ) ( | ).1 1 â
Afterâpluggingâinâequationâ17.10,âweâobtainâtheâfollow-ingâupdateârule:
V V V V V Vi ii
Ti i
ii iâ + â( ) + â( )
â
â+
+ +ÎąĎ Ď
1 1
12 1
12 1 1W W Wi i i .
(17.12)
Thisâ isâ aâ quantitativeâ expressionâ ofâ theâ predictiveâcodingâintuitionâexplainedâabove.âTheâsecondâtermâinâtheâsumâisâtheâerrorâinâtheâpredictionsâcomingâfromâtheâhigherâareaâVi+1 âandâisâavailableâlocallyâtoâVi . Theâfirstâterm,âtheâerrorâinâVi âsâownâpredictions,âhasâtoâbeâpassedâbackâupâfromViâ1 whereâitâisâcomputed.
Theseâ considerationsâ suggestâ aâ neuralâ implementa-tionâ likeâ theâ oneâ shownâ inâ figureâ 17.5,â adaptedâ fromâFristonâ (2005).â eachâ visualâ areaâ isâ composedâ ofâ twoâsubpopulations:â theâ predictionâ neuronsâ Pi sendâ pre-dictionsâviaâfeedbackâconnections,âandâtheâerrorâdetect-ingâ neuronsâ Ei computeâ theâ errorâ inâ predictionsâcomingâdownâfromâtheâareaâabove.âForâaâmoreâdetailedâattemptâ toâ mapâ theâ predictiveâ codingâ hardwareâ ontoâneuralâmachinery,âseeâBastosâetâal.,â(2012).
17.5.3â Predictiveâ Coding:â Simulationsâ andâ Pre-dictionsâ Inâ simulationsâ withâ aâ truncatedâ versionâ ofâtheâmodelâabove,âencompassingâLgN,âV1,âandâV2,âRaoâandâBallardâ(1999)âderiveâaânumberâofâpropertiesâofâV1âcellâ responses,â particularlyâ theâ so-calledâ extra-classicalâreceptiveâ fieldâ effects.â Asâ describedâ inâ sectionâ 17.2,â aâcellâsâ classicalâ receptiveâ fieldâ isâ theâ regionâofâ spaceâ inâwhichâaâpresentedâstimulusâcanâevokeâaâresponse.âManyâV1â cellsâ haveâ inâ additionâ anâ extra-classicalâ receptiveâfieldâ(eCRF),âwhichâisâaâlargerâareaâoutsideâtheâclassicalâreceptiveâ fieldâ whereâ stimuliâ canâ modulateâ theâ cellâsâfiring.â Twoâ examplesâ areâ shownâ inâ figureâ 17.6.â end-stoppedâcellsâareâresponsiveâtoâbar-likeâstimuliâandâshowâenhancedâ responsesâ whenâ aâ barâ terminatesâ inâ theirâeCRF.â Anotherâ effect,â surroundâ suppression,â occursâwhenâaâcellâsâ firingâdecreasesâwhenâ theâdominantâori-entationâinâitsâclassicalâreceptiveâfieldâextendsâthroughâitsâ eCRF.â Inâ general,â extra-classicalâ effectsâ occurâ atâ
Figureâ 17.5â Aâ possibleâ neuralâ architectureâ forâ predictiveâcoding.â eachâ areaâ containsâ twoâ populationsâ ofâ neurons:â Piâpredictsâtheâactivityâofâtheâdownstreamâarea,âsendingâtheâpre-dictionsâviaâtheâsynapticâweightsâmatrixâWi,âandâEiâcomputesâtheâ errorâ madeâ byâ theâ predictionsâ comingâ fromâ theâ nextâhigherâarea.
8522_017.indd 525 6/10/2016 7:48:03 PM
PROPERTY OF THE MIT PRESSFOR PROOFREADING, INDEXING, AND PROMOTIONAL PURPOSES ONLY
PROPERTY OF THE MIT PRESSFOR PROOFREADING, INDEXING, AND PROMOTIONAL PURPOSES ONLY
526â â 17â objectâandâsceneâperception
aroundâ80â100âmsâafterâstimulusâonset,âasâcomparedâtoâ66âmsâ forâclassicalâV1âresponses,â suggestingâaâroleâ forâfeedback,âorâatâleastâlateral,âprocessing.
Afterâtrainingâtheirânetworkâonâaâdatabaseâofânaturalâimages,âRaoâandâBallardâfoundâthatâaâlargeâproportionâofâ theirâ error-correctingâ neuronsâ displayedâ end-stoppingâeffects.âAnâanalysisâofânaturalâimageâstatisticsârevealsâwhyâthisâshouldâbeâso.âedgesâinânaturalâimagesâgenerallyâ continueâ smoothly;â short,â isolatedâ barâ seg-mentsâ areâ rare.â duringâ training,â theâ simulatedâ V2âneuronsâadaptâtoâthisâregularityâandâshapeâtheirâpredic-tionsâofâV1âresponsesâaccordingly.âThus,âshortâbarsâcon-tainedâwithinâaâV1âcellâsâeCRFâtriggerâaâpredictionâerrorâandâ causeâ increasedâ activityâ inâ theâ V1â error-detectingâcells.âRecallâ thatâ theâreceptiveâ fieldâofâeachâ simulatedâV2âneuronâcontainsâseveralâV1âneurons;âthisâgroupâcon-stitutesâitsâcentralâcellâsâeCRF.
Theâanalysisâ forâ surroundâ suppressionâ isâ similar.âByâandâlarge,âtheâdominantâorientationâinâaânaturalâimageâchangesâslowlyâandâsmoothlyâoverâspace,âsoâthisâisâwhatâV2âcellsâcomeâtoâpredict.âWhenâthisâpredictionâisâupheld,âerror-detectingâ neuronsâ fireâ less,â resultingâ inâ theâobservedâsuppression.
SubsequentâstudiesâhaveâusedâfMRIâtoâinvestigateâpre-dictiveâ coding.â Murrayâ etâ al.â (2004)â usedâ stimuliâ ofâ aâvarietyâ ofâ types,â eachâ containingâ aâ collectionâ ofâ ele-ments.â Inâ oneâ classâ ofâ stimuli,â theâ elementsâ couldâ beâinterpretedâasâformingâaâcoherentâgroupâofâsomeâkind,âwhile,â inâ theâ otherâ class,â theyâ appearedâ essentiallyârandom.âAâcollectionâofâlineâsegmentsâformingâaâcoher-entâ two-dimensionalâ shape,â forâ instance,â wouldâ beâ aâmemberâofâtheâfirstâclass,âwhileâaâcollectionâofâscatteredâsegmentsâwouldâbeâaâmemberâofâ theâsecond.â Inâ fMRIâexperiments,â theâ groupableâ stimuliâ wereâ foundâ toâinduceâincreasedâresponseâinâhigherâvisualâareas,âsuchâasâ theâ object-responsiveâ areaâ LOC,â andâ reducedâresponsesâ inâ V1.â Thisâ findingâ isâ consistentâ withââ
predictiveâ coding:â groupableâ stimuliâ admitâ aâ higher-levelâ explanationâ thatâ isâ ableâ toâ âpredictâ awayââ lower-levelâresponses.âTheâauthorsâpointâout,âhowever,âthatâtheâresultâisâalsoâconsistentâwithâanâalternativeââsharpeningââexplanation,âinâwhichâlower-levelâresponsesâthatâareâcon-sistentâwithâhigher-levelâpredictionsâareâenhanced,âandâothersâreduced,âperhapsâasâaâwayâtoâproduceâaâsparserârepresentationâoverall.
Moreâdirectlyârelatedâtoâourâconcernâwithâobjectârec-ognition,âegnerâetâal.â(2010)âshowedâsubjectsâimagesâofâeitherâfacesâorâhouses.âBeforeâeachâimage,âsubjectsâsawâaâcoloredâframeâthatâwasâstochasticallyâpredictiveâofâtheâimageâcategoryâtoâfollow:âaâ faceâfollowedâgreenâframeâwithâ probabilityâ 0.25,â aâ yellowâ frameâ withâ probabilityâ0.5,âandâaâblueâframeâwithâprobabilityâ0.75.âegnerâetâal.ârecordedâ theâ activationâ inâ theâ FFA,â aâ fusiformâ areaâselectiveâ forâ faces.â Consistentâ withâ otherâ studies,â theyâfoundâsomeâFFAâactivityâ inâresponseâ toâhouseâ stimuli,âbutâthisâactivationâwasâmuchâhigher,âalmostâasâmuchâasâforâ faces,â whenâ theseâ stimuliâ wereâ unexpected.â Theâpredictiveâcodingâexplanationâofâthisâresultâ isâ thatâtheâFFAâactivationâisâanâerror-correctingâresponseâinducedâbyâtheâviolatedâfaceâexpectation.âTheâauthorsâruleâoutâanâ alternativeâ explanationâ thatâ faceâ expectationâ uni-formlyâenhancesâFFAâactivation,âperhapsâasâaâresultâofâanticipatoryâneuralâactivity.âOnâthisâexplanation,âactiva-tionâonâ face-expected,â face-presentâ trialsâ shouldâbeâatâleastâasâhighâasâonâface-unexpected,âface-presentâones,âtheâoppositeâofâtheâactualâfindingâthatâface-presentâacti-vationâincreasedâtheâfaceâexpectation.
Forâaârecentâreviewâofâotherâpredictiveâcoding,âinclud-ingâproposalsâaboutâhowâitâcouldâfunctionâinâcognitiveâdomainsâotherâthanâvision,âseeâClarkâ(2013).
17.5.4â OtherâApproachesâtoâInferenceâ SchemesâlikeâRaoâandâBallardâsâareâperhapsâ theâbest-developedâapproachesâtoâinferenceâinâgenerativeâvision,âbutâthereâisânoâshortageâofâotherâcandidates.âWeâfocusâinâparticu-larâonâoneâalgorithm,âknownâasâbeliefâpropagationâorâmessageâpassing,âinâpartâbecauseâitâhasâbeenâproposedâasâ aâ generalâ algorithmâ forâ neuralâ probabilisticâ infer-ence,âapplicableâ toâareasâbeyondâvisualâ cortex.âAsâ theânameâsuggests,âtheâmessage-passingâalgorithmâisâbasedâonâ communicationâ betweenâ probabilisticallyâ relatedâvariables,â makingâ itâ particularlyâ amenableâ toâ neuralâinterpretations.âTheâspecificâapproachâweâsketchâhereâwasâdevelopedâinâLeeâandâMumfordâ(2003).âForâanotherâapproach,âseeâRaoâ(2007).
LeeâandâMumfordâmoveâtoâaâslightlyâdifferentâformal-ismâfromâtheâoneâpresentedâabove,âreplacingâtheâcon-ditionalâprobabilitiesâwithâpotentialâfunctionsâĎ(V Vi j, ),âwhichâmeasureâtheâundirectedâcompatibilityâofâtheârep-resentationsâinâaâpairâofâadjacentâareas.
Figureâ17.6â extra-classicalâreceptiveâfieldsâ(eCRFs).âInâbothâfigures,âtheâcellâsâclassicalâreceptiveâfieldâ(RF),âtheâregionâinâwhichâ aâ presentedâ stimulusâ canâ directlyâ induceâ firing,â isâshownâinâred,âandâitsâeCRF,âaâlargerâareaâinâwhichâstimuliâcanâmodulateâfiring,âisâshownâinâblack.âend-stoppedâcellsâshowâaâdecreaseâ inâfiringâwhenâaâbarâextendsâbeyondâtheirâclassicalâRF.âSurroundâsuppressionâoccursâwhenâaâcellâfiresâlessâstronglyâwhenâtheâorientationâpresentâinâitsâclassicalâRFâcontinuesâinâitsâeCRF.
8522_017.indd 526 6/10/2016 7:48:03 PM
PROPERTY OF THE MIT PRESSFOR PROOFREADING, INDEXING, AND PROMOTIONAL PURPOSES ONLY
PROPERTY OF THE MIT PRESSFOR PROOFREADING, INDEXING, AND PROMOTIONAL PURPOSES ONLY
17.5â FeedbackâConnectionsâ â 527
LeeâandâMumfordâproposeâanâinferenceâmechanismâthatâ combinesâ beliefâ propagationâ withâ anotherâ algo-rithmâ popularâ inâ machineâ learningâ andâ computerâvisionâcalledâparticleâfiltering.âParticleâfilteringâreplacesâtheâ continuousâ distributionsâ overâ Vi â withâ aâ discreteâapproximation,â takingâ theâ formâ ofâ weightedâ samples:â
w Vk ik
k
nďż˝
=â
1
â whereâ theâ ďż˝Vik ,â calledâ particles,â areâ aâ chosenâ
setâofâpossibleâactivationâvalues.âBeliefâpropagationâthenâallowsâneighboringâareasâtoâupdateâtheseâdistributionsâbyâ passingâ messages.â Specifically,â eachâ areaâ passesâ upâthisâmessage:
M V M V V VV V ik
V V ij
ij
ik
ji i i iâ + â ++ â( ) = ( )â1 11 1
ďż˝ ďż˝ ďż˝ ďż˝Ď( , ) .â (17.13)
Similarly,âtheâdownwardâmassagesâtakeâtheâform
M V M V V VV V ik
V V ij
ij
ik
ji i i iâ â â ââ +( ) = ( )â1 11 1
ďż˝ ďż˝ ďż˝ ďż˝Ď( , ) .â (17.14)
Focusingâonâtheâupwardâcase,âtheâkthâcomponentâofâtheâ messageâ M VV V i
ki iâ ++ ( )1 1
ďż˝ â canâ beâ understoodâ asâ theâexpectedâ compatibilityâ ofâ theâ particleâ ďż˝Vik+1 â withâ theâactivityâ inâareaâVi,âwhereâtheâexpectationâisâtakenâwithârespectâtoâincomingâmessagesâfromâtheâotherâdirection.âWeâ haveâ presentedâ theâ standardâ versionâ ofâ theâ algo-rithm,â calledâ theâ sumâ productâ algorithm;â Leeâ andâMumfordâuseâaâvariantâcalledâtheâmax-sumâalgorithm,âinâwhichâtheâsumâisâreplacedâwithâaâmax.
Afterâoneâroundâofâmessageâpassing,âeachâparticleâ ďż˝Vik âhasâreceivedââscoresââ fromâtheâmessagesâcomingâfromâaboveâandâbelow.âTheseâareâmultipliedâtoâgiveâtheâpar-ticleâsâweight:
â wZM V M Vi
kV V i
kV V i
ki i i i= ( ) ( )+ ââ â
11 1
ďż˝ ďż˝ .â (17.15)
Asâ theâ modelâsâ beliefsâ evolve,â manyâ ofâ theâ originalâparticlesâ becomeâ overwhelminglyâ unlikely,â asâ mani-festedâ inâ lowâ weights.â Atâ eachâ iteration,â aâ newâ setâ ofâparticlesâisâformedâbyâresamplingâwithâreplacementâpar-ticlesâaccordingâtoâtheirâweights.âThisâwillâtendâtoârein-forceâparticlesâthatâareâdoingâwellâwhileâkillingâoffâonesâthatâareâdoingâpoorly.
Theâ basicâ formâ ofâ theâ message-passingâ algorithm,âpropagationâofâlocallyâcomputedâinformation,âmakesâitâattractiveâ fromâ aâ neuralâ pointâ ofâ view.â Oneâ stickingâpoint,â though,â isâ howâ theâ particlesâ areâ represented.âSinceâtheâalgorithmârequiresâeachâbrainâareaâtoâmain-tainâmultipleâindependentâhypothesesâ(particles)âsimul-taneously,â theâ questionâ arisesâ howâ theyâ canâ beâ keptâinternallyâ coherentâ andâ mutuallyâ distinct.â Leeâ andâMumfordâtentativelyâsuggestâaâfewâoptionsâsuchâasâtheâsynchronousâfiring,âbutâatâpresentânoâoptionâconstitutesâaâfullyâfleshedâoutâproposal.
ComparingâthisâpictureâwithâRaoâandâBallardâsâmodel,âpresentedâabove,âweâseeâthatâtheâtop-downâmessagesâcanâstillâ beâ interpretedâ asâ predictions,â butâ theâ bottom-upâmessagesâareânoâ longerâdirectlyâ interpretableâasâerrorâsignals.âInâaddition,âallowsâmoreâcomplexâinteractionsâbetweenâareasâthanâRaoâandâBallardâs.âThisâmakesâtheâLeeâandâMumfordâmodelâmoreâexpressive,âbutâ atâ theâcostâofâlessâtractableâinference.
17.5.5â Feedbackâ Processingâ andâ Priorsâ Inâ addi-tionâtoâprovidingâanâefficientârepresentation,â integrat-ingâgenerativeâtop-downâsignalsâwithâbottom-upâonesâisâanâeffectiveâwayâofâdealingâwithâtheâambiguitiesâinher-entâ inâ low-levelâcues.âYuilleâandâKerstenâ(2006)âargueâthatâlow-levelâdeterminationsâareâdifficultâtoâmakeâusingâonlyâ localâ low-levelâ informationâbutâareâ relativelyâeasyâgivenâ aâ high-levelâ analysis.â Mcdermottâ (2004)â askedâhumanâ subjectsâ toâ determineâ whetherâ aâ junctionâ ofâedgesâwasâpresentâinâaâgivenâimageâregion.âThisâisâeasyâwhenâ theâ wholeâ imageâ isâ visible;â anâ objectâsâ locationâdeterminesâ theâ locationsâ ofâ itsâ edges.â However,â whenâMcdermottârestrictedâhisâsubjectsââinformationâtoâlow-levelâ cuesâ byâ makingâ themâ viewâ theâ imageâ throughâ aâsmall,â13âĂ13âpixelâwindow,âtheirâ junction-recognitionâperformanceâwasâpoor.
Aâprediction,âthen,âisâthatâfeedbackâprocessingâshouldâbeâespeciallyâprevalentâwhenâlow-levelâcuesâareâambigu-ousâ orâ absent.â Theâ Kanizsaâ squareâ (figureâ 17.7)â isâ aâwell-knownâvisualâillusion,âwhichâadmitsâtwoâinterpreta-tions:â aâ whiteâ squareâ partiallyâ obscuringâ fourâ blackâcircles,â orâ fourâ mutilatedâ circles.â Leeâ andâ Mumfordâ
Figureâ17.7â TheâKanizsaâsquareâisâanâambiguousâstimulus,âwhichâ canâ eitherâ beâ interpretedâ asâ aâ whiteâ squareâ partiallyâoccludingâfourâcircles,âorâasâfourâcirclesâwithâmissingâpieces.âFeedbackâprocessingâ inducesâV1âcellsâ toâ respondâ toâ illusoryâcontoursâdefiningâtheâsquare.
8522_017.indd 527 6/10/2016 7:48:04 PM
PROPERTY OF THE MIT PRESSFOR PROOFREADING, INDEXING, AND PROMOTIONAL PURPOSES ONLY
PROPERTY OF THE MIT PRESSFOR PROOFREADING, INDEXING, AND PROMOTIONAL PURPOSES ONLY
528â â 17â objectâandâsceneâperception
recordedâ fromâ V1â asâ monkeysâ viewedâ thisâ stimulus,âfindingâ equallyâ strongâ responsesâ toâ theâ illusoryâ con-toursâbetweenâtheâcirclesâasâtoâtheârealâonesâwithinâtheâcircles,â suggestingâ anâ interpretationâ inâ whichâ higher-levelâprocessesâsettledâonâtheââoccludingâsquareââinter-pretationâandâusedâfeedbackâconnectionsâtoâfillâinâtheââmissingââ contourâ informationâ inâ V1.â Importantly,â V1âcellsârespondedâtoâillusoryâcontoursâafterâV2âcells,âcon-sistentâwithâtheâfeedbackâstory.
Inâ aâ studyâ usingâ moreâ naturalâ stimuli,â Tangâ etâ al.â(2014)â usedâ subduralâ electrodesâ toâ recordâ fromâ theâbrainsâofâhumanâepilepsyâpatientsâundergoingâsurgeryâasâ theyâ viewedâ twoâ ordinaryâ andâ heavilyâ occludedânaturalâimages.âObjectârecognitionâwasâpossibleâinâtheâoccludedâimagesâocclusionâwasâcalibratedâsoâthatâsub-jectsâachievedâapproximatelyâ80%âaccuracyâinâaâbehav-ioralâ identificationâ taskâbutâ mostâ (allâ butâ 18%,â onâaverage)â low-levelâ informationâ wasâ obliteratedâ byâ theâoccluders.â However,â activityâ inâ category-selectiveâ ITâcellsâ wasâ significantlyâ delayedâ inâ theâ occludedâ condi-tion,â suggestingâ aâ relianceâ onâ precedingâ recurrentâprocessing.
17.5.6â Feedforwardâ andâ Feedbackâ Processing:âSummingâ Upâ Feedforwardâ andâ feedbackâ modelsâpresentâsubstantiallyâdifferentâpicturesâofâvisualâprocess-ing.âHowâdoâtheâtwoâfitâtogether?âTheâmainstreamâviewâisâofâaâfirst,âpurelyâfeedforwardâpassâsufficientâforârela-tivelyâ straightforwardâ classificationâ tasksâ followedâ byâmoreâ sustainedâ feedbackâ processingâ supportingâ moreâdifficultâorâambiguousâdeterminations.
Aâfirstâlineâofâevidenceâforâthisâviewâcomesâfromâtheâtimingâofâhumanârecognition.âInâanâeegâstudy,âThorpeâetâ al.â (1996)â foundâ category-selectiveâ informationâarisingâinâtheâprefrontalâcortexâ(afterâvisualâcortex)âinâonlyâ150âms.âHistorically,â thisâ timingâhasâbeenâaâchal-lengeâforâfeedbackâmodels,âgivenâtheirâmoreâextensiveâprocessingâ demands.â Leeâ andâ Mumfordâ (2003)â pointâout,âhowever,â thatâ ifâ recurrentâprocessingâoccursâ con-tinuouslyâinâlocalâfeedbackâloops,âratherâthanârequiringâmultipleâcompleteâtop-to-bottomâpasses,â thenârecogni-tionâ atâ theâ 150-msâ timescaleâ isâ notâ unrealistic.â Moreârecentâtimeâstudiesâposeâchallengesâforâthisâargument,âthough.âAârecentâbehavioralâstudyâ(Potterâetâal.,â2014)âusedâaârapidâserialâvisualâpresentationâ(RSVP)âparadigmâinâwhichâaâ sequenceâofâ imagesâwasâ shownâforâonlyâ13âmsâ each,â afterâ whichâ subjectsâ wereâ askedâ whetherâ anâimageâ matchingâ aâ descriptionâ (e.g.,â âbearâ catchingâfishâ)â wasâ presentâ inâ theâ sequence.â evenâ underâ theseâextremelyâ demandingâ timeâ constraints,â subjectsââ per-formanceâ wasâ significantlyâ betterâ thanâ chance.â Shortâpresentationâ timesâaloneâareâconsistentâwithâ feedbackâmodels,âbutâ theâRSVPâparadigmâposesâaâchallenge:â inâ
theâ timeâ afterâ presentationâ ofâ theâ firstâ image,â earlyâvisualâ areasâ areâ occupiedâ withâ analysisâ ofâ subsequentâimagesâandâareâthereforeâunavailableâforâfeedbackâpro-cessingâofâtheâfirstâone.
Otherâevidenceâcomesâfromâphysiology.âInâaâdecod-ingâstudy,âHungâetâal.â(2005),âforâinstance,âfoundâthatâcategoryâinformationâwasâpresentâinâtheâfirstâITâspikesâappearingâinâresponseâtoâaâstimulus,âarguingâagainstâaâfeedbackâpictureâinâwhichâinitialâITâactivityâisâfedâbackâtoâ andâ processedâ byâ earlierâ areasâ beforeâ aâ finalâ inter-pretationâisâreached.
Asâ arguedâ above,â though,â fast,â purelyâ feedforwardâprocessingâ isâ insufficientâ forâ manyâ tasks,â particularlyâthoseâwithâambiguity,âlikeâocclusion.âPreciseâcharacter-izationâ ofâ theâ tasksâ supportedâ byâ theâ twoâ processingâregimesâ isâ anâ importantâ questionâ thatâ awaitsâ futureâwork.
17.6â Conclusion
Asâ weâ haveâ seen,â severalâ well-establishedâ frameworksâexistâforâunderstandingâtheâvariousâprocessingâstagesâinâtheâventralâvisualâpathway.âHowever,âseveralâ importantâquestionsâ remainâ unanswered.â Hereâ weâ focusâ onâ twoâareasâweâthinkâwillâbeâimportantâforâfutureâresearch.
17.6.1â diggingâ deep:â Learningâ Fromâ Fewâ exam-plesâ Whileâ theâ remarkableâ developmentsâ inâ com-puterâ visionâ overâ theâ lastâ fewâ yearsâ oweâ muchâ toâ newâalgorithmsâ andâ representationâ schemes,â muchâ creditâmustâ alsoâ goâ toâ theâ availabilityâ ofâ hugeâ databasesâ ofâlabeledâimages,âandâtheâcomputingâpowerârequiredâtoâprocessâ them.â Thisâ stateâ ofâ affairsâ contrastsâ markedlyâwithâhumanâvision:âweâcanâoftenâ learnâ toâ recognizeâaânewâobjectâ (sayâanâ iPhoneâcircaâ2005)â fromâonlyâoneâexample,ânotâhundredsâorâthousands.âevenâasâchildren,âweâhearâourâparentsâexplicitlyâidentifyâonlyâaârelativelyâmodestânumberâofâobjectsâinâtheâworld.âItâisâanâimpor-tantâchallenge,âtherefore,âbothâforâtheâdevelopmentâofâtheânextâgenerationâofâcomputerâvisionâsystemsâandâforâgenuineâunderstandingâ andâ replicationâofâ theâ animalâvisualâ system,â toâ comeâupâwithâ computationalâ systemsâthatâ canâ learnâ fromâ similarlyâ compactâ collectionsâ ofâdata.
Asâweâhaveâseen,âbuildingâinâinvarianceâtoânuisanceâtransformationsâ isâ oneâ routeâ toâ data-efficientâ models.âHMAXâandâotherâCNNsâcanâbuildâinvarianceâtoâchangesâandâscale,âbutâweâcanâhopeâforâmoreâgainsâbyâaccount-ingâforâotherâtransformationâtypes.âAnselmiâetâal.â(2013)âdevelopsâ aâ moreâ generalâ theoryâ ofâ invariantârepresentations.â
Transformationsâ likeâ translation,â scaling,â andâ rota-tionâareâclassâgeneral;âtheyâapplyâtoâallâobjectâclassesâinâ
8522_017.indd 528 6/10/2016 7:48:04 PM
PROPERTY OF THE MIT PRESSFOR PROOFREADING, INDEXING, AND PROMOTIONAL PURPOSES ONLY
PROPERTY OF THE MIT PRESSFOR PROOFREADING, INDEXING, AND PROMOTIONAL PURPOSES ONLY
17.6â Conclusionâ â 529
theâsameâway,âaâpropertyâthatâdoesânotâholdâforâtrans-formationsâ inâ general.â Changesâ inâ expression,â forâinstance,â areâ aâ classâ ofâ transformationâ thatâ applyâ toâfacesâalone.âLessâobviously,âthree-dimensionalârotationâinâ depthâ followedâ byâ projectionâ ontoâ theâ two-dimensionalâ imageâ planeâ isâ alsoâ class-specific:â aâ long,âthinâ object,â forâ instance,â rotatesâ differentlyâ fromâ anâapproximatelyâ sphericalâ one.â Anselmiâ etâ al.â presentâ aâtwo-stageâmodelâtoâtackleâclass-specifcâtransformations.âTheâ firstâ stage,â correspondingâ toâ areasâ V1âV4,â isâ anâHMAX-styleâCNN.âTheâsecondâstage,âcorrespondingâtoâanteriorâIT,âconsistsâofâaâcollectionâofâmodulesâeachâofâwhichâ handlesâ class-specificâ transformationsâ forâ oneâimportantâ objectâ class.â Theâ faceâ patchesâ inâ primateâbrainsâareâhypothesizedâtoâbeâexamplesâofâsuchâmodules.âTheâresultingâarchitectureâachievesâcompetitiveâperfor-manceâ inâ faceâ processingâ tasksâ andâ accountsâ forâ aânumberâ ofâ experimentalâ findings.â Itâ predicts,â forâinstance,â theâexistenceâofâ theâmirror-symmetricalâcellsâfoundâinâmacaqueâfaceâareasâthatârespondâequallyâtoâaâfaceâasâtoâanâaxis-flippedâversion.
Anotherâapproachâtoâimprovingâdataâefficiencyâ(Sal-akhutdinovâetâal.,â2011)âusesâhierarchicalâprobabilisticâmodels.â Aâ nonvisualâ parableâ conveysâ theâ mainâ idea.âSupposeâ youâ areâ givenâ aâ numberâ ofâ opaqueâ bagsâ ofâcoloredâ marbles.â Youâ emptyâ outâ theâ firstâ one,â seeingâthatâ itâcontainsâonlyâredâmarbles.âFurtherâexplorationâshowsâ thatâallâ theâmarblesâ inâsecondâareâblue,âandâallâinâ theâ thirdâ areâ yellow.â Youâ thenâ drawâ aâ singleâ greenâmarbleâ fromâ theâ fourthâbag.âevenâwithoutâ seeingâanyâotherâbag-fourâmarbles,âyouâcanâbeâreasonablyâsureâthatâtheyâ areâ green:â Afterâ all,â allâ ofâ theâ otherâ bagsâ wereâmonochromatic.âexamplesâlikeâthisâareâformulatedâwithâhierarchicalâprobabilityâmodels:âtheâindividualâbagsâareâprobabilityâdistributions,âinâthisâcaseâdistributionsâoverâmarbleâcolors,âwhoseâparametersâareâthemselvesâdrawnâfromâaâhigher-levelâdistribution.âThus,â learningâ some-thingâaboutâaâ fewâofâ theâbagsâforâ instance,â thatâ theyâareâ monochromaticâtellsâ youâ somethingâ aboutâ theâgenerativeâprocessâ forâbagsâ inâgeneralâ andâallowsâ youâmakeâ hypothesesâ aboutâ newâ examples.â Salakhutdinovâetâal.âtransferâthisâintuitionâtoâvision,âreplacingâbagsâwithâobjectâ categoriesâ andâ marblesâ withâ imageâ features.âThus,âinformationâaboutâtheâfeatureâdistributionsâforâaâfewâ classesâ isâ informationâ aboutâ howâ featuresâ areâ dis-tributedâforâclassesâinâgeneral.âAsâwithâtheâsingleâgreenâmarbleâ inâ theâ example,â thisâ higher-levelâ information,âcombinedâ withâ aâ singleâ imageâ fromâ aâ newâ class,â con-strainsâtheânewâclassâsâstructure,âmakingâitâeasierârecog-nizeâfurtherâexamples.
17.6.2â ThinkingâWide:âFullâSceneâInterpretationâinâ Naturalâ Imagesâ Weâ beganâ thisâ chapterâ withâ anâ
overviewâofâtheârangeâofâquestionsâhumansâcanâanswerâaboutâtheâvisualâworldâbutâhaveâfocusedâtheâdiscussionâsoâ farâ onâ theâ areasâ bestâ understoodâ byâ neuroscience:âfeedforwardâobjectâandâsceneâclassificationâandâprelimi-naryâ stepsâ towardâ fullâ sceneâ interpretationâ madeâ byâfeedbackâ models.â Hereâ weâ discussâ theâ prospectsâ forâmodelingâaâwiderârangeâofâvisualâtasks,âonesârequiringâtheâanalysisâofâscenesâwithâmultipleâobjectsâinteractingâinâcomplexâways.âWeâbrieflyâsurveyâpromisingâcomputerâvisionâapproachesâ toâ theseâproblems,âapproachesâ thatâmayâ developâ intoâ fruitfulâ sourcesâ ofâ neuroscientificâhypotheses.â
Aâprerequisite,âarguably,âforâhuman-levelâflexibilityâinâvisualâ sceneâ analysisâ isâ theâ abilityâ toâ recognizeâ allâ theâobjectsâ presentâ inâ anâ imageâ andâ toâ representâ theirâspatialârelations.âOneâversionâofâthisâtaskâisâcalledâimageâparsingâ(Yaoâetâal.,â2010;âZhuâandâMumford,â2007;âTuâetâ al.,â 2005;â Jinâ andâ geman,â 2006;â Zhuâ etâ al.,â 2010;âSocherâetâal.,â2011).âInâlanguageâprocessing,âoneâparsesâaâsentenceâbyârecursivelyâbreakingâitâupâintoâmeaningfulâparts:â aâ sentenceâ isâ composedâ ofâ phrases,â whichâ areâthemselvesâ composedâ ofâ smallerâ phrases,â andâ soâ on.âImagesâ haveâ similarâ hierarchicalâ structure:â anâ imageâdepictsâ aâ scene,â whichâ isâ composedâ ofâ objects,â whichâareâcomposedâofâparts,âwhichâareâcomposedâofâprimi-tiveâcomponentsâlikeâcornersâandâedges.âgivenâanâinputâimage,âanâimage-parsingâalgorithmâseeksâtoârecoverâthisâhierarchalâstructure.âWhereasâtheâclassificationâmodelsâweâ haveâ seenâ soâ farâ assignâ anâ imageâ aâ singleâ label,âimage-parsingâalgorithmsâassignâanâimageâaâwholeâparseâtree.
Severalâmodelsâ(Yaoâetâal.,â2010;âZhuâandâMumford,â2007)âdefineâimageâgrammars.âJustâasâaâstringâgrammarâisâ aâ generativeâ modelâ forâ theâ well-formedâ stringsâ inâ aâlanguage,âanâimageâgrammarâisâaârich,âstructuredâgen-erativeâ modelâ forâ images.â Asâ inâ theâ feedbackâ modelsâsurveyedâabove,â imageâparsingâ isâ theâ taskâofâ invertingâthisâ generativeâ model,â goingâ fromâ anâ imageâ toâ theâsequenceâofâchoicesâ thatâcouldâhaveâ formedâ it.âThereâareâtwoâmainâalgorithmicâchallengesâinâthisâapproach.âInâ theâ learningâ problem,â theâ modelâ mustâ inferâ theâappropriateâgrammarâtoâdescribeâaâcollectionâofâinputâimagesâ.âInâtheâinferenceâproblem,âtheâmodelâmustâfindâaâparseâtoâexplainâanâimageâwithârespectâtoâanâalreadyâfixedâ grammar.â givenâ theâ largeâ numberâ ofâ possibleâgrammarsâandâparseâtrees,âbothâofâtheseâproblemsâareâchallenging.
Partlyâ inâresponseâtoâ theseâdifficulties,âotherâ image-parsingâapproachesâ suchâasâZhuâetâ al.â (2010)âandâ Jinâandâgemanâ(2006)âeschewâformalâgrammars,âoptingâforâmoreâtractableârecursiveâmodelsâamenableâtoâefficientâdiscriminativeâ andâ dynamicâ programmingâ algorithms.âSocherâetâ al.â (2011)âusesâ aâneuralânetworkâapproach,â
8522_017.indd 529 6/10/2016 7:48:04 PM
PROPERTY OF THE MIT PRESSFOR PROOFREADING, INDEXING, AND PROMOTIONAL PURPOSES ONLY
PROPERTY OF THE MIT PRESSFOR PROOFREADING, INDEXING, AND PROMOTIONAL PURPOSES ONLY
530â â 17â objectâandâsceneâperception
thoughâ notâ oneâ withâ directâ correspondenceâ toâ theâbrain.â Thisâ modelâ takesâ asâ inputâ anâ imageâ overseg-mentedâ intoâregions,â fromâeachâofâwhichâ itâextractsâaâfeatureâ vector.â Itâ thenâ proceedsâ byâ mergingâ regionsâpairwise,âcontinuingâuntilâonlyâoneâregionâremainsâ inâtheâimage.âTheâbinaryâtreeâresultingâfromâthisâprocessâisâtheâimageâsâparse.
Theâheartâofâtheâmodelâisâaâso-calledârecursiveâneuralânetworkâ(RNN)âthatâtakesâasâinputâtwoâfeatureâvectorsâandâoutputsâaâsingleâmergedâfeatureâvectorâofâtheâsameâlength.âTheâRNNâisâusedâtoâevaluateâcandidateâregionâmerges.â givenâ neighboringâ regionsâ R1â andâ R2â withâfeatureâ vectorsâ F1â andâ F2,â theâ modelâ computesâ F3â =âRNN(F1,âF2).âAâ linearâregressionâmodelâ thenâassignsâaâscoreâtoâF3,âwhichâmeasuresâtheâqualityâofâmerge.âgivenâaâcurrentâsegmentationâofâtheâimageâintoâregions,âtheâmodelâconsidersâeachâpossibleâmergeâandâchoosesâtheâoneâwithâtheâhighestâscore,âcontinuingâinâthisâwayârecur-sively.â eachâ timeâ aâ newâ mergedâ regionâ isâ created,â itâcomesâwithâ itsâ featureâ vector,âF3â above.â Inâadditionâ toâbeingâ necessaryâ forâ calculatingâ furtherâ merges,â thisâfeatureâ vectorâ canâ beâ fedâ intoâ aâ classifierâ assigningâ aâcategoryâlabelâtoâregion.âThus,âeachânodeâinâtheâparseâtreeâcanâbeâassignedâaâcategory.
Inâtheâcontextâofâgenerativeâmodelsâlikeâimageâgram-mars,â oneâ oftenâ hearsâ theâ sloganâ âvisionâ asâ inverseâgraphics,ââ anâ expressionâ ofâ theâ Helmholtzianâ viewâ ofâperceptionâasâunconsciousâinference.âSomeâpromisingârecentâmodelsâ(Mansinghkaâetâal.,â2013;âKulkarniâetâal.,â2015)âtakeâthisâsloganâliterally,âviewingâimagesâasâarisingânotâfromâaâgrammarâbutâfromâaâprobabilisticâgraphicsâprogram,â inâ whichâ parameterâ choicesâ suchâ asâ objectâpositions,â cameraâ angle,â andâ lightingâ conditionsâ areâstochastic.â givenâ aâ generativeâ modelâ ofâ thisâ kind,â anâinferenceâalgorithmâcanâbeâusedâtoâfindâtheâcollectionâofâ parameterâ choicesâ thatâ bestâ explainâ aâ givenâ inputâimage.âWhileâthisâinferenceâproblemâisâdifficult,âadvan-tagesâofâtheâinverseâgraphicsâapproachâincludeâtheâfactâthatâitâcanâmodelâaâsceneâsâthree-dimensionalâstructureâandâcanâexploitâtheâexpressivityâandâflexibilityâofâaâfullâprogrammingâ language,â potentiallyâ obtainingâ richerâsceneâinterpretationsâthanâthoseâpossibleâwithâgrammar-likeâformalisms.â
Mappingâ structuredâ sceneâ interpretationâ modelsâ toâneuralâcomputationâ isâanâ importantâoutstandingâchal-lenge.â Inâ additionâ theâ purelyâ visualâ aspectsâ ofâ thisâmapping,âaâfullâsolutionâwillâhaveâtoâaddressâfundamen-talâquestionsâaboutâhowâtheâbrainârepresentsâstructuredâinformation.âInâvision,âtheseâquestionsâariseâinâevenâtheâsimplestâ visualâ scenes.â Consider,â forâ instance,â aâ ballâinsideâaâ cup.âWeâknowâhowâ toâ representâ theâballâ andâtheâcupâbyâfeatureâvectors,âandâweâmayâbeâableâtoâextendâtheâprincipleâtoâfindâtheâvectorâforââinsideââasâwell.âTheâ
question,âthen,âisâhowâtoâcombineâallâofâtheseârepresen-tationsâinâaâwayâthatâreflectsâtheâsceneâsâstructure,âpre-serving,â forâ instance,â itsâ distinctnessâ fromâ âcupâ insideâball.ââ Consideringâ neuralâ representationsâ ofâ moreâcomplexâobjects,âsuchâasâparseâtrees,âonlyâmagnifiesâtheâproblem.â Theâ questionâ ofâ neuralâ representationsâ ofâstructuredâ andâ relationalâ data,â sometimesâ calledâ theââconnectionistâ variableâ bindingâ problem,ââ hasâ aâ longâhistory,âandâaânumberâofâsolutionsâhaveâbeenâproposedâ(Smolenskyâ andâ Legendre,â 2006;â Shastriâ andâ Ajjana-gadde,â1993;â vanâderâVeldeâandâdeâKamps,â 2006).âToâdate,â though,â noneâ hasâ emergedâ asâ aâ consensusâ solu-tion.â Openâ problemsâ aboundâ inâ thisâ area,â bothâ forâcomputationalâ neuroscienceâ inâ generalâ andâ forâ fullyâstructuredâvisionâinâparticular.
Acknowledgmentsâ ThisâworkâwasâsupportedâbyâtheâCenterâforâ Brains,â Mindsâ andâ Machinesâ (CBMM),â fundedâ byâ NSFâSTCâawardâCCF-1231216.
ReFeReNCeS
Anselmi,âF.,âLeibo,âJ.âZ.,âRosasco,âL.,âMutch,âJ.,âTacchetti,âA.,â&â Poggio,â T.â (2013).â Unsupervisedâ learningâ ofâ invariantârepresentationsâinâhierarchicalâarchitectures.âarXivâpreprintâarXivâ:â1311.4158.
Balduzzi,âd.,âVanchinathan,âH.,â&âBuhmann,âJ.â(2014).âKick-backâcutsâBackpropâsâred-tape:âBiologicallyâplausibleâcreditâassignmentâ inâ neuralâ networks.â arXivâ preprintâ arXivâ :â1411.6191.
Bastos,âA.âM.,âUsrey,âW.âM.,âAdams,âR.âA.,âMangun,âg.âR.,âFries,âP.,â&âFriston,âK.âJ.â(2012).âCanonicalâmicrocircuitsâforâpredictiveâcoding.âNeuron,â76(4),â695â711.
Carlson,âe.âT.,âRasquinha,âR.âJ.,âZhang,âK.,â&âConnor,âC.âe.â(2011).âAâsparseâobjectâcodingâschemeâinâareaâV4.âCurrentâBiology,â21(4),â288â293.
Cichy,â R.â M.,â Pantazis,â d.,â &â Oliva,â A.â (2014).â Resolvingâhumanâobjectârecognitionâinâspaceâandâtime.âNatureâNeuro-science,â17(3),â455â462.
Clark,âA.â(2013).âWhateverânext?âPredictiveâbrains,âsituatedâagents,âandâtheâfutureâofâcognitiveâscience.âBehavioralâandâBrainâSciences,â36(3),â181â204.
durbin,â R.,â &â Rumelhart,â d.â e.â (1989).â Productâ units:â Aâcomputationallyâpowerfulâandâbiologicallyâplausibleâexten-sionâ toâ backpropagationâ networks.â Neuralâ Computation,â1(1),â133â142.
egner,âT.,âMonti,âJ.âM.,â&âSummerfield,âC.â(2010).âexpecta-tionâ andâ surpriseâ determineâ neuralâ populationâ responsesâinâtheâventralâvisualâstream.âJournalâofâNeuroscience,â30(49),â16601â16608.
epstein,âR.,â&âKanwisher,âN.â(1998).âAâcorticalârepresenta-tionâ ofâ theâ localâ visualâ environment.â Nature,â 392(6676),â598â601.
Fei-Fei,âL.,âIyer,âA.,âKoch,âC.,â&âPerona,âP.â(2007).âWhatâdoâweâ perceiveâ inâ aâ glanceâ ofâ aâ real-worldâ scene?â Journalâ ofâVisionâ(Charlottesville,âVa.),â7(1),â10.
Ffytche,âd.âH.â(2005).âVisualâhallucinationsâandâtheâCharlesâBonnetâ syndrome.â Currentâ Psychiatryâ Reports,â 7(3),â168â179.
8522_017.indd 530 6/10/2016 7:48:04 PM
PROPERTY OF THE MIT PRESSFOR PROOFREADING, INDEXING, AND PROMOTIONAL PURPOSES ONLY
PROPERTY OF THE MIT PRESSFOR PROOFREADING, INDEXING, AND PROMOTIONAL PURPOSES ONLY
17.6â Conclusionâ â 531
Friston,âK.â(2005).âAâtheoryâofâcorticalâresponses.âPhilosophi-calâTransactionsâofâtheâRoyalâSocietyâofâLondon.âSeriesâB,âBiologicalâSciences,â360(1456),â815â836.
Fukushima,â K.â (1980).â Neocognitron:â Aâ self-organizingâneuralânetworkâmodelâforâaâmechanismâofâpatternârecogni-tionâ unaffectedâ byâ shiftâ inâ position.â Biologicalâ Cybernetics,â36(4),â193â202.
Fukushima,âK.â(1988).âNeocognitron:âAâhierarchicalâneuralânetworkâ capableâofâ visualâpatternâ recognition.âNeuralâNet-works,â1(2),â119â130.
greene,âM.âR.,â&âOliva,âA.â(2006).âNaturalâsceneâcategoriza-tionâfromâconjunctionsâofâecologicalâglobalâproperties.âInâProceedingsâofâtheâ28thâAnnualâConferenceâofâtheâCognitiveâScienceâSocietyâ(pp.â291â296).âMahwah,âNJ:âerlbaum.
greene,âM.âR.,â&âOliva,âA.â(2009).âTheâbriefestâofâglances:âTheâtimeâcourseâofânaturalâsceneâunderstanding.âPsychologi-calâScience,â20(4),â464â472.
He,âK.,âZhang,âX.,âRen,âS.,â&âSun,âJ.â(2015).âdelvingâdeepâintoârectifiers:â Surpassingâ human-levelâ performanceâ onâ ima-genetâclassification.âarXivâpreprintâarXiv:1502.01852.
Hung,âC.âP.,âKreiman,âg.,âPoggio,âT.,â&âdiCarlo,âJ.âJ.â(2005).âFastâreadoutâofâobjectâidentityâfromâmacaqueâinferiorâtem-poralâcortex.âScience,â310(5749),â863â866.
Isik,âL.,âMeyers,âe.âM.,âLeibo,âJ.âZ.,â&âPoggio,âT.â(2014).âTheâdynamicsâ ofâ invariantâ objectâ recognitionâ inâ theâ humanâvisualâsystem.âJournalâofâNeurophysiology,â111(1),â91â102.
Ito,â M.,â &â Komatsu,â H.â (2004).â Representationâ ofâ anglesâembeddedâ withinâ contourâ stimuliâ inâ areaâ V2â ofâ macaqueâmonkeys.âJournalâofâNeuroscience,â24(13),â3313â3324.
Ito,â M.,â Tamura,â H.,â Fujita,â I.,â &â Tanaka,â K.â (1995).â Sizeâandâpositionâ invarianceâofâneuronalâresponsesâ inâmonkeyâinferotemporalâ cortex.â Journalâ ofâ Neurophysiology,â 73(1),â218â226.
Jin,âY.,â&âgeman,âS.â(2006).âContextâandâhierarchyâinâaâproba-bilisticâimageâmodel.âInâ2006âIEEEâComputerâSocietyâConferenceâonâComputerâVisionâandâPatternâRecognitionâ(Vol.â2,âpp.â2145â2152).âLosâAlamitos,âCA:âIeeeâComputerâSociety.
Knoblich,âU.,âBouvrie,â J.,â&âPoggio,âT.â(2007).âBiophysicalâmodelsâ ofâ neuralâ computation:â Maxâ andâ tuningâ circuitsâ (pp.â164â189).âBerlin:âSpringer.
Kravitz,âd.âJ.,âPeng,âC.âS.,â&âBaker,âC.âI.â(2011).âReal-worldâsceneâ representationsâ inâ high-levelâ visualâ cortex:â Itâsâ theâspacesâmoreâthanâtheâplaces.âJournalâofâNeuroscience,â31(20),â7322â7333.
Kravitz,âd.âJ.,âSaleem,âK.âS.,âBaker,âC.âI.,âUngerleider,âL.âg.,â&âMishkin,âM.â(2013).âTheâventralâvisualâpathway:âAnâexpandedâ neuralâ frameworkâ forâ theâ processingâ ofâ objectâquality.âTrendsâinâCognitiveâSciences,â17(1),â26â49.
Krizhevsky,âA.,âSutskever,âI.,â&âHinton,âg.âe.â(2012).âIma-genetâclassificationâwithâdeepâconvolutionalâneuralânetworks.âInâB.âSchĂślkope,â J.â Platt,â &â T.â Hofmannâ (eds.),â Advancesâ inâNeuralâ Informationâ Processingâ Systemsâ (pp.â 1097â1105).âSanâ Mateo,â CA:â Morganâ Kaufman;â Cambridge,â MA:â MITâPress.
Kulkarni,âT.âd.,âKohli,âP.,âTenenbaum,â J.âB.,â&âMansinghka,âV.â(2015).âPicture:âAâprobabilisticâprogrammingâ languageâforâsceneâperception.â2015âIeeeâConferenceâonâComputerâVisionâandâPatternâRecognition.
Lampl,â I.,â Ferster,â d.,â Poggio,â T.,â &â Riesenhuber,â M.â(2004).â Intracellularâ measurementsâ ofâ spatialâ integrationâandâtheâMAXâoperationâinâcomplexâcellsâofâtheâcatâprimaryâvisualâcortex.âJournalâofâNeurophysiology,â92(5),â2704â2713.
LeCun,â Y.,â Bottou,â L.,â Bengio,â Y.,â &â Haffner,â P.â (1998).âgradient-basedâlearningâappliedâtoâdocumentârecognition.âProceedingsâofâtheâIEEE,â86(11),â2278â2324.
Lee,â T.â S.,â &â Mumford,â d.â (2003).â Hierarchicalâ Bayesianâinferenceâ inâ theâ visualâ cortex.â JOSAâ A,â 20(7),â 1434â1448.
Mansinghka,â V.,â Kulkarni,â T.â d.,â Perov,â Y.â N.,â &â Tenen-baum,â J.â (2013).â Approximateâ Bayesianâ imageâ interpretationâusingâgenerativeâprobabilisticâgraphicsâprograms.âInâB.âSchĂślkopf,âJ.âPlatt,â&âT.âHofmannâ(eds.),âAdvancesâinâNeuralâInforma-tionâProcessingâSystemsâ(pp.â1520â1528).âSanâMateo,âCA:âMorganâKaufman;âCambridge,âMA:âMITâPress.
Mcdermott,âJ.â(2004).âPsychophysicsâwithâjunctionsâinârealâimages.âPerception,â33(9),â1101â1128.
Mnih,âV.,âKavukcuoglu,âK.,âSilver,âd.,âRusu,âA.âA.,âVeness,âJ.,â Bellemare,â M.â g.,â etâ al.â (2015).â Human-levelâ controlâthroughâ deepâ reinforcementâ learning.â Nature,â 518(7540),â529â533.
Mumford,âd.â(1992).âOnâtheâcomputationalâarchitectureâofâtheâneocortex.âBiologicalâCybernetics,â66(3),â241â251.
Murphy,âK.,âTorralba,âA.,â&âFreeman,âW.âT.â(2003).âUsingâtheâforestâtoâseeâtheâtrees:âAâgraphicalâmodelârelatingâfea-tures,â objects,â andâ scenes.â Inâ S.â Thrun,â L.â K.â Saul,â &â B.âSchĂślkopfâ (eds.),â (pp.â 1499â1506).â Advancesâ inâ NeuralâInformationâ Processingâ Systemsâ Cambridge,â MA:â MITâPress.
Murray,âS.âO.,âSchrater,âP.,â&âKersten,âd.â(2004).âPercep-tualâgroupingâandâtheâinteractionsâbetweenâvisualâcorticalâareas.âNeuralâNetworks,â17(5),â695â705.
Oliva,âA.,â&âTorralba,âA.â(2001).âModelingâtheâshapeâofâtheâscene:â Aâ holisticâ representationâ ofâ theâ spatialâ envelope.âInternationalâJournalâofâComputerâVision,â42(3),â145â175.
Oliva,â A.,â &â Torralba,â A.â (2006).â Buildingâ theâ gistâ ofâ aâscene:â Theâ roleâ ofâ globalâ imageâ featuresâ inâ recognition.âProgressâinâBrainâResearch,â155,â23â36.
OâReilly,â R.â C.â (1996).â Biologicallyâ plausibleâ error-drivenâlearningâusingâlocalâactivationâdifferences:âTheâgeneralizedârecirculationâ algorithm.â Neuralâ Computation,â 8(5),â895â938.
Park,âS.,â&âChun,âM.âM.â(2009).âdifferentârolesâofâtheâpara-hippocampalâ placeâ areaâ (PPA)â andâ retrosplenialâ cortexâ(RSC)â inâ panoramicâ sceneâ perception.â NeuroImage,â 47,â1747â1756.
Park,â S.,â Brady,â T.â F.,â greene,â M.â R.,â &â Oliva,â A.â (2011).âdisentanglingâ sceneâ contentâ fromâ itsâ spatialâ boundary:âComplementaryârolesâforâtheâPPAâandâLOCâinârepresentingâreal-worldâ scenes.â Journalâ ofâ Neuroscience,â 31(4),â 1333â1340.
Potter,âM.âC.,âWyble,âB.,âHagmann,âC.âe.,â&âMcCourt,âe.âS.â(2014).âdetectingâmeaningâinâRSVPâatâ13âmsâperâpicture.âAttention,âPerceptionâ&âPsychophysics,â76(2),â270â279.
Rao,âR.âP.â(2007).âNeuralâmodelsâofâBayesianâbeliefâpropaga-tion.âInâK.âdoya,âS.âIshii,âA.âPouget,â&âR.âP.âN.âRaoâ(eds.),âBayesianâ Brain:â Probabilisticâ Approachesâ toâ Neuralâ Codingâ (p.â239).âCambridge,âMA:âMITâPress.
Rao,âR.âP.,â&âBallard,âd.âH.â(1999).âPredictiveâcodingâinâtheâvisualâ cortex:â Aâ functionalâ interpretationâ ofâ someâ extra-classicalâ receptive-fieldâ effects.â Natureâ Neuroscience,â 2(1),â79â87.
Renninger,âL.âW.,â&âMalik,âJ.â(2004).âWhenâisâsceneâidenti-ficationâ justâ textureâ recognition?â Visionâ Research,â 44(19),â2301â2311.
8522_017.indd 531 6/10/2016 7:48:04 PM
PROPERTY OF THE MIT PRESSFOR PROOFREADING, INDEXING, AND PROMOTIONAL PURPOSES ONLY
532â â 17â objectâandâsceneâperception
Riesenhuber,âM.,â&âPoggio,âT.â(1999).âHierarchicalâmodelsâofâobjectârecognitionâ inâcortex.âNatureâNeuroscience,â2(11),â1019â1025.
Russakovsky,âO.,âdeng,âJ.,âSu,âH.,âKrause,âJ.,âSatheesh,âS.,âMa,âS.,â etâ al.â (2014).â Imagenetâ largeâ scaleâ visualâ recognitionâchallenge.âarXivâpreprintâarXivâ:â1409.0575.
Rust,âN.âC.,â&âdiCarlo,âJ.âJ.â(2010).âSelectivityâandâtoleranceâ(âinvarianceâ)â bothâ increaseâ asâ visualâ informationâ propa-gatesâ fromâ corticalâ areaâ V4â toâ IT.â Journalâ ofâ Neuroscience,â30(39),â12978â12995.
Salakhutdinov,â R.,â Torralba,â A.,â &â Tenenbaum,â J.â (2011).âLearningâ toâ shareâ visualâ appearanceâ forâ multiclassâ objectâdetection.â Inâ 2011â IEEEâ Conferenceâ onâ Computerâ Visionâ andâPatternâRecognitionâ(CVPR)â(pp.â1481â1488).âIeee.
Schmolesky,âM.âT.,âWang,âY.,âHanes,âd.âP.,âThompson,âK.âg.,âLeutgeb,â S.,â Schall,â J.â d.,â etâ al.â (1998).â Signalâ timingâacrossâtheâmacaqueâvisualâsystem.âJournalâofâNeurophysiology,â79(6),â3272â3278.
Schyns,âP.âg.,â&âOliva,âA.â(1994).âFromâblobsâtoâboundaryâedges:âevidenceâforâtime-andâspatial-scale-dependentâsceneârecognition.âPsychologicalâScience,â5(4),â195â200.
Serre,â T.,â Oliva,â A.,â &â Poggio,â T.â (2007a).â Aâ feedforwardâarchitectureâ accountsâ forâ rapidâ categorization.â ProceedingsâofâtheâNationalâAcademyâofâSciencesâofâtheâUnitedâStatesâofâAmerica,â104(15),â6424â6429.
Serre,âT.,âWolf,âL.,âBileschi,âS.,âRiesenhuber,âM.,â&âPoggio,âT.â (2007b).â Robustâ objectâ recognitionâ withâ cortex-likeâmechanisms.â â IEEEâ Transactionsâ onâ Patternâ Analysisâ andâMachineâIntelligence,ââ29(3),â411â426.
Shastri,â L.,â &â Ajjanagadde,â V.â (1993).â Fromâ simpleâassociationsâtoâsystematicâreasoning:âAâconnectionistârepre-sentationâ ofâ rules,â variablesâ andâ dynamicâ bindingsâ usingâtemporalâ synchrony.â Behavioralâ andâ Brainâ Sciences,â 16(3),â417â451.
Siagian,â C.,â &â Itti,â L.â (2007).â Rapidâ biologically-inspiredâsceneâclassificationâusingâfeaturesâsharedâwithâvisualâatten-tion.ââIEEEâTransactionsâonâPatternâAnalysisâandâMachineâIntel-ligence,ââ29(2),â300â312.
Simonyan,â K.,â &â Zisserman,â A.â (2014).â Veryâ deepâ convolu-tionalâ networksâ forâ large-scaleâ imageâ recognition.â arXivâpreprintâarXivâ:â1409.1556.
Smolensky,âP.,â&âLegendre,âg.â(2006).âTheâHarmonicâMind:âFromâNeuralâComputationâtoâOptimality-TheoreticâGrammar:âVol.â1.âCognitiveâArchitecture.âCambridge,âMA:âMITâPress.
Socher,âR.,âLin,âC.âC.,âManning,âC.,â&âNg,âA.âY.â(2011).âParsingânaturalâscenesâandânaturalâlanguageâwithârecursiveâneuralânetworks.âInâProceedingsâofâtheâ28thâInternationalâConferenceâonâMachineâLearningâ(ICML-11)â(pp.â129â136).
Szegedy,âC.,âLiu,âW.,âJia,âY.,âSermanet,âP.,âReed,âS.,âAnguelov,âd.,â etâ al.â (2014).â goingâ deeperâ withâ convolutions.â arXivâpreprintâarXivâ:â1409.4842.
Szegedy,âC.,âZaremba,âW.,âSutskever,âI.,âBruna,âJ.,âerhan,âd.,âgoodfellow,âI.,âetâal.â(2013).âIntriguingâpropertiesâofâneuralânetworks.âarXivâpreprintâarXivâ:â1312.6199.
Tang,âH.,âBuia,âC.,âMadhavan,âR.,âCrone,âN.âe.,âMadsen,âJ.âR.,âAnderson,âW.âS.,âetâal.â(2014).âSpatiotemporalâdynam-icsâ underlyingâ objectâ completionâ inâ humanâ ventralâ visualâcortex.âNeuron,â83(3),â736â748.
Thorpe,â S.,â Fize,â d.,â &â Marlot,â C.â (1996).â Speedâ ofâprocessingâinâtheâhumanâvisualâsystem.âNature,â381(6582),â520â522.
Torralba,âA.,âMurphy,âK.âP.,â&âFreeman,âW.âT.â(2010).âUsingâtheâ forestâ toâ seeâ theâ trees:â exploitingâ contextâ forâ visualâobjectâ detectionâ andâ localization.â Communicationsâ ofâ theâACM,â53(3),â107â114.
Torralba,âA.,âOliva,âA.,âCastelhano,âM.âS.,â&âHenderson,âJ.âM.â(2006).âContextualâguidanceâofâeyeâmovementsâandâattentionâinâreal-worldâscenes:âTheâroleâofâglobalâ featuresâinâobjectâsearch.âPsychologicalâReview,â113(4),â766â786.
Tu,â Z.,â Chen,â X.,â Yuille,â A.â L.,â &â Zhu,â S.â C.â (2005).âImageâparsing:âUnifyingâsegmentation,âdetection,âandârec-ognition.â Internationalâ Journalâ ofâ Computerâ Vision,â 63(2),â113â140.
Vanâderâvelde,âF.,â&âdeâKamps,âM.â(2006).âNeuralâblackboardâarchitecturesâ ofâ combinatorialâ structuresâ inâ cognition.âBehavioralâandâBrainâSciences,â29(1),â37â70.
Yamins,â d.â L.,â Hong,â H.,â Cadieu,â C.â F.,â Solomon,â e.â A.,âSeibert,â d.,â &â diCarlo,â J.â J.â (2014).â Performance-optimizedâhierarchicalâmodelsâpredictâneuralâresponsesâinâhigherâ visualâ cortex.â Proceedingsâ ofâ theâ Nationalâ Academyâ ofâSciencesâofâtheâUnitedâStatesâofâAmerica,â111(23),â8619â8624.
Yao,âB.âZ.,âYang,âX.,âLin,âL.,âLee,âM.âW.,â&âZhu,âS.âC.â(2010).âI2t:âImageâparsingâtoâtextâdescription.âProceedingsâofâtheâIEEE,â98(8),â1485â1508.
Yuille,â A.,â &â Kersten,â d.â (2006).â Visionâ asâ Bayesianâ infer-ence:â Analysisâ byâ synthesis?â Trendsâ inâ Cognitiveâ Sciences,â10(7),â301â308.
Zeiler,âM.âd.,â&âFergus,âR.â (2014).âVisualizingâandâunder-standingâconvolutionalânetworks.âInâComputerâVisionâECCVâ2014â(pp.â818â833).âSpringer.
Zhu,â L.,â Chen,â Y.,â Yuille,â A.,â &â Freeman,â W.â (2010).â Latentâhierarchicalâ structuralâ learningâ forâ objectâ detection.â Inâ2010âIEEEâConferenceâonâComputerâVisionâandâPatternâRecogni-tionâ(pp.â1062â1069).âNewâYork:âIeee.
Zhu,â S.â C.,â &â Mumford,â d.â (2007).â Aâ Stochasticâ Grammarâ ofâImages.âHanover,âMA:âNow.
8522_017.indd 532 6/10/2016 7:48:04 PM