detecting natural occlusion boundaries using local cuesdetecting natural occlusion boundaries using...

21
Detecting natural occlusion boundaries using local cues Christopher DiMattina # $ Department of Psychology & Neuroscience Concentration, Grinnell College, Grinnell, IA, USA Sean A. Fox $ Natural Perception Laboratory Department of Electrical Engineering & Computer Science Case Western Reserve University, Cleveland, OH, USA Michael S. Lewicki # $ Natural Perception Laboratory Department of Electrical Engineering & Computer Science Case Western Reserve University, Cleveland, OH, USA Occlusion boundaries and junctions provide important cues for inferring three-dimensional scene organization from two- dimensional images. Although several investigators in machine vision have developed algorithms for detecting occlusions and other edges in natural images, relatively few psychophysics or neurophysiology studies have investigated what features are used by the visual system to detect natural occlusions. In this study, we addressed this question using a psychophysical experiment where subjects discriminated image patches containing occlusions from patches containing surfaces. Image patches were drawn from a novel occlusion database containing labeled occlusion boundaries and textured surfaces in a variety of natural scenes. Consistent with related previous work, we found that relatively large image patches were needed to attain reliable performance, suggesting that human subjects integrate complex information over a large spatial region to detect natural occlusions. By defining machine observers using a set of previously studied features measured from natural occlusions and surfaces, we demonstrate that simple features defined at the spatial scale of the image patch are insufficient to account for human performance in the task. To define machine observers using a more biologically plausible multiscale feature set, we trained standard linear and neural network classifiers on the rectified outputs of a Gabor filter bank applied to the image patches. We found that simple linear classifiers could not match human performance, while a neural network classifier combining filter information across location and spatial scale compared well. These results demonstrate the importance of combining a variety of cues defined at multiple spatial scales for detecting natural occlusions. Keywords: occlusions, natural images, psychophysics, edge detection, neural networks Citation: DiMattina, C., Fox, S. A., & Lewicki, M. S. (2012). Detecting natural occlusion boundaries using local cues. Journal of Vision, 12(13):15, 1–21, http://www.journalofvision.org/content/12/13/15, doi:10.1167/12.13.15. Introduction One useful set of two-dimensional cues for inferring three-dimensional scene organization are the boundar- ies and junctions formed by the occlusions of distinct surfaces (Guzman, 1969; Nakayama, He, & Shimojo, 1995; Todd, 2004), as illustrated in Figure 1. In natural images, occlusion boundaries are defined by multiple cues, including local texture, color, and luminance differences, all of which are integrated perceptually (McGraw, Whitaker, Badcock, & Skillen, 2003; Rivest & Cavanagh, 1996). Although numerous machine vision studies have developed algorithms for detecting occlusions and junctions in natural images (Hoiem, Efros, & Hebert, 2011; Konishi, Yuille, Coughlin, & Zhu, 2003; Martin, Fowlkes, & Malik, 2004; Perona, 1992), relatively little work in visual psychophysics has directly studied natural occlusion detection (McDer- mott, 2004) or used natural occlusions in perceptual tasks (Fowlkes, Martin, & Malik, 2007). In this study, we investigate the question of what locally available cues are used by human subjects to detect occlusion boundaries in natural scenes. We approach this problem by developing a novel database of natural occlusion boundaries taken from a set of uncompressed calibrated images used in previous research (Arsenault, Yoonessi, & Baker, 2011; King- dom, Field, & Olmos, 2007; Olmos & Kingdom, 2004). We demonstrate that our database exhibits strong intersubject agreement in the locations of the labeled occlusions, particularly when compared with edges derived from image segmentation databases. In addi- tion, we find that a variety of simple visual features Journal of Vision (2012) 12(13):15, 1–21 1 http://www.journalofvision.org/content/12/13/15 doi: 10.1167/12.13.15 ISSN 1534-7362 Ó 2012 ARVO Received May 10, 2012; published December 18, 2012

Upload: others

Post on 30-Sep-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Detecting natural occlusion boundaries using local cuesDetecting natural occlusion boundaries using local cues Christopher DiMattina # $ Department of Psychology & Neuroscience Concentration,

Detecting natural occlusion boundaries using local cues

Christopher DiMattina $Department of Psychology amp Neuroscience

Concentration Grinnell College Grinnell IA USA

Sean A Fox $

Natural Perception Laboratory

Department of Electrical Engineering

amp Computer Science

Case Western Reserve University Cleveland OH USA

Michael S Lewicki $

Natural Perception Laboratory

Department of Electrical Engineering

amp Computer Science

Case Western Reserve University Cleveland OH USA

Occlusion boundaries and junctions provide important cues for inferring three-dimensional scene organization from two-dimensional images Although several investigators in machine vision have developed algorithms for detecting occlusionsand other edges in natural images relatively few psychophysics or neurophysiology studies have investigated what featuresare used by the visual system to detect natural occlusions In this study we addressed this question using a psychophysicalexperiment where subjects discriminated image patches containing occlusions from patches containing surfaces Imagepatches were drawn from a novel occlusion database containing labeled occlusion boundaries and textured surfaces in avariety of natural scenes Consistent with related previous work we found that relatively large image patches were neededto attain reliable performance suggesting that human subjects integrate complex information over a large spatial region todetect natural occlusions By defining machine observers using a set of previously studied features measured from naturalocclusions and surfaces we demonstrate that simple features defined at the spatial scale of the image patch are insufficientto account for human performance in the task To define machine observers using a more biologically plausible multiscalefeature set we trained standard linear and neural network classifiers on the rectified outputs of a Gabor filter bank applied tothe image patches We found that simple linear classifiers could not match human performance while a neural networkclassifier combining filter information across location and spatial scale compared well These results demonstrate theimportance of combining a variety of cues defined at multiple spatial scales for detecting natural occlusions

Keywords occlusions natural images psychophysics edge detection neural networks

Citation DiMattina C Fox S A amp Lewicki M S (2012) Detecting natural occlusion boundaries using local cues Journalof Vision 12(13)15 1ndash21 httpwwwjournalofvisionorgcontent121315 doi101167121315

Introduction

One useful set of two-dimensional cues for inferringthree-dimensional scene organization are the boundar-ies and junctions formed by the occlusions of distinctsurfaces (Guzman 1969 Nakayama He amp Shimojo1995 Todd 2004) as illustrated in Figure 1 In naturalimages occlusion boundaries are defined by multiplecues including local texture color and luminancedifferences all of which are integrated perceptually(McGraw Whitaker Badcock amp Skillen 2003 Rivestamp Cavanagh 1996) Although numerous machinevision studies have developed algorithms for detectingocclusions and junctions in natural images (HoiemEfros amp Hebert 2011 Konishi Yuille Coughlin ampZhu 2003 Martin Fowlkes amp Malik 2004 Perona

1992) relatively little work in visual psychophysics hasdirectly studied natural occlusion detection (McDer-mott 2004) or used natural occlusions in perceptualtasks (Fowlkes Martin amp Malik 2007)

In this study we investigate the question of whatlocally available cues are used by human subjects todetect occlusion boundaries in natural scenes Weapproach this problem by developing a novel databaseof natural occlusion boundaries taken from a set ofuncompressed calibrated images used in previousresearch (Arsenault Yoonessi amp Baker 2011 King-dom Field amp Olmos 2007 Olmos amp Kingdom 2004)We demonstrate that our database exhibits strongintersubject agreement in the locations of the labeledocclusions particularly when compared with edgesderived from image segmentation databases In addi-tion we find that a variety of simple visual features

Journal of Vision (2012) 12(13)15 1ndash21 1httpwwwjournalofvisionorgcontent121315

doi 10 1167 12 13 15 ISSN 1534-7362 2012 ARVOReceived May 10 2012 published December 18 2012

characterized in previous studies (Balboa amp Grzywacz2000 Fine MacLeod amp Boynton 2003 Geisler PerrySuper amp Gallogly 2001 Ing Wilson amp Geisler 2010Rajashekar van der Linde Bovik amp Cormack 2007)can be used to distinguish occlusion and surfacepatches

Using a simple two-alternative forced choice exper-iment we test the ability of human subjects todiscriminate local image regions containing eitherocclusions or single surfaces In agreement with relatedwork on junction detection (McDermott 2004) andimage classification (Torralba 2009) we find thatsubjects require a fairly large image region (32 middot 32pixels) in order to make reliable judgments Using aquadratic classifier analysis we find that simple visualfeatures defined on the scale of the whole image patch(ie luminance gradients) are insufficient to accountfor human performance suggesting that human sub-jects integrate complex spatial information existing atmultiple scales

We investigated this possibility further by trainingstandard linear and neural network classifiers on therectified outputs of a set of Gabor filters applied to theocclusion and surface patches We found that a linearclassifier cannot fully account for subject performancesince this classifier simply detects low spatial frequencyluminance edges However a neural network having amoderate number of hidden units compared muchbetter to human performance by combining informa-tion from filters across multiple locations and spatialscales Our analysis demonstrates that only one layer ofprocessing beyond the initial filtering and rectificationis needed for reliably detecting natural occlusionsInterpreting the hidden units as implementing lsquolsquosecond-orderrsquorsquo filters our results are consistent with previousdemonstrations that filter-rectify-filter (FRF) modelscan detect edges defined by cues other than luminancedifferences (Baker amp Mareschal 2001 Bergen ampLandy 1991 Graham 1991 Landy 1991)

This study complements and extends previous workby quantitatively demonstrating the importance ofintegrating complex multiscale spatial information

when detecting natural occlusion edges (McDermott2004) Furthermore this work provides the largervision science community with a novel database ofocclusion edges as well as a benchmark dataset ofhuman performance on a standard edge-detectionproblem studied in machine vision (Konishi et al2003 Martin et al 2004 Zhou amp Mel 2008) Finallywe discuss possible mechanisms for natural occlusiondetection and suggest directions for future research

Methods

Image databases

A set of 100 images containing few or no manmadeobjects were selected from a set of over 850 calibrateduncompressed color images from the McGill CalibratedColor Image Database (Olmos amp Kingdom 2004)Images were selected to have a clear figure-groundorganization and plenty of discernible occlusionboundaries Some representative images are shown inFigure 2 (left column) A group of five paid under-graduate research assistants were instructed to label allof the clearly discernible continuous occlusion bound-aries using Adobe Photoshop layers They were giventhe following instructions

lsquolsquoYour task is to label the occlusion contours in thegiven set of 100 images An occlusion contour is anedge or boundary where one object occludes or blocksthe view of another object or region behind it Label asmany contours as you can but you do not need to labelcontours that you are unsure of Make each distinctcontour a unique color to help with future analysisEach contour must be continuous (ie one connectedpiece) Start by labeling contours on the largest andmost prominent objects and work your way down tosmaller and less prominent objects Do not labelextremely small contours like blades of grassrsquorsquo

Students worked independently so their labelingreflected their independent judgment The lead author(CD) hand-labeled all images as well so there were sixsubjects total

In order to compare the statistics of occlusions withimage regions not containing occlusions a database oflsquolsquosurfacersquorsquo image patches was selected from the sameimages by the same subjects lsquolsquoSurfacesrsquorsquo in this contextwere broadly defined as uniform image regions whichdo not contain any occlusions and subjects were notgiven any explicit guidelines beyond the constraint thatthe regions they select should be relatively uniform andcould not contain any occlusions (which was preventedby our custom-authored software) No constraints wereimposed with respect to lighting curvature material

Figure 1 Occlusion of one surface by another in depth gives rise

to image patches containing occlusion edges (magenta circle)

and junctions (cyan and purple circles)

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 2

shadows or luminance gradients Therefore somesurface patches contained substantial luminance gradi-ents for instance patches of zebra skin (Figure 3) Eachsubject selected 10 surface regions (60 middot 60) from eachof the 100 images and for our analyses we extracted

image patches of various sizes (8 middot 8 16 middot 16 32 middot 32)at random locations from these larger 60 middot 60 regionsExample 32 middot 32 surface patches are shown in Figure 2(middle panel) and examples of both surface andocclusion patches are shown in Figure 3

Figure 2 Representative images from the occlusion boundary database together with subject occlusion labelings Left Original color

images Middle Grayscale images with overlaid pixels labeled as occlusions (white lines) and examples of surface regions (magenta

squares) Right Overlaid plots of subject occlusion labelings taken from an 87 middot 115 pixel central region of images (indicated by cyan

squares in middle column) Darker pixels were labeled by more subjects lighter pixels by fewer subjects

Figure 3 Examples of 32 middot 32 occlusions (left) surfaces (right) and shadow edges not defined by occlusions (bottom)

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 3

Quantifying subject consistency

In order to quantify intersubject consistency in thelocations of the pixels labeled as occlusions we appliedprecision-recall analysis commonly used in machinevision (Abdou amp Pratt 1979 Martin et al 2004) Inaddition we also developed a novel analysis methodwhich we call the most-conservative subject (MCS)analysis which controls for the fact that disagreementbetween subjects in the location of labeled occlusionsoften arises simply because some subjects are moreexhaustive in their labeling than others

Precision-recall analysis is often used in machinevision studies of edge detection in order to quantify thetrade-off between correctly detecting all edges (recall)and not incorrectly labeling non-edges as edges(precision) Mathematically precision (P) and recall(R) are defined

P frac14 tp

tpthorn fp eth1THORN

R frac14 tp

tpthorn fn eth2THORN

where tp fp tn fn are the true and false positives andtrue and false negatives respectively Typically thesequantities are determined by comparing a machinegenerated test edgemap E to a ground-truth referenceedgemap G derived from hand-annotated images(Martin et al 2004) Since all of our edgemaps werehuman generated we performed a lsquolsquoleave-one-outrsquorsquoanalysis where we compared a test edgemap from eachsubject to a reference lsquolsquoground truthrsquorsquo edgemap definedby combining edgemaps from all of the other remainingsubjects Since our goal for this analysis was simply tocompare human performance on two different tasks(occlusion labeling and region labeling) we did notmake use of sophisticated boundary-matching proce-dures (Goldberg amp Kennedy 1995) used in previousstudies to optimize the comparisons between humandata and machine performance (Martin et al 2004)We quantify the overall agreement between the test andreference edgemaps using the weighed harmonic meanof P and R defined by

F frac14 PR

aPthorn eth1 aTHORNR eth3THORN

This quantity F is known as an F-measure and wasoriginally developed to quantify the accuracy ofdocument retrieval methods (Rijsbergen 1979) but isalso applied in machine vision (Abdou amp Pratt 1979Martin et al 2004) The parameter a determines therelative weight of precision and recall and we used afrac1405 in our analysis

In addition to the precision-recall analysis wedeveloped a novel method for quantifying intersubject

consistency which minimizes problems of intersubjectdisagreement arising from the fact that certain subjectsare simply more exhaustive in labeling all possibleocclusions than other subjects We defined the mostconservative subject (MCS) for a given image as thesubject who had labeled the fewest pixels Using theMCS labeling we generate a binary image mask Fwhich is 1 for any pixel within R pixels of an occlusionlabeled by the MCS and 0 for all other pixelsApplying this mask to the labeling of each subjectyields a lsquolsquoreducedrsquorsquo labeling which is valid for inter-subject comparison since it only includes the mostprominent occlusions labeled by all of the subjects Tocalculate the comparison between two subjects werandomly assigned one binary edgemap as the lsquolsquorefer-encersquorsquo (Iref) and the other binary edge-map as the lsquolsquotestrsquorsquo(Itest) We used the reference map to define a weightingfunction fc(r) which was applied to all of the pixels inthe test map that quantified how close each pixel in thetest map was to a pixel in the reference mapMathematically our index is given by

f frac14 1

Nt

XethxyTHORNItest

fc

rethxyTHORNIref

ItestethxyTHORN eth4THORN

where r((xy) Iref) is the distance between (xy) and theclosest pixel in Iref Nt is the number of pixels in the testset and the function fc(r) is defined for 0 c lsquo by

fcethrTHORN frac14ecr if 0 r R0 if rR

eth5THORN

and for cfrac14 lsquo by

flsquoethrTHORN frac141 if r frac14 00 if r 0

eth6THORN

where R is the radius of the mask which we set to Rfrac1410 in our analysis The parameter c in our weightingfunction sets the sensitivity of fc to the distance betweenthe reference labeling and the test labeling Setting cfrac14 0counts the fraction of pixels in the test edgemap whichlie inside the mask generated by the reference edgemapand setting c frac14 lsquo measures the fraction of pixels incomplete agreement between the two edgemaps

Statistical measurements

Patch extraction and region labeling

Occlusion patches of varying sizes centered on anocclusion boundary were extracted automatically fromthe database by choosing random pixels labeled asocclusions by a single subject cycling through the sixsubjects Since we only wanted patches containing asingle occlusion separating two regions (figure andground) of roughly equal size we only accepted acandidate patch when

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 4

1 The composite occlusion edge map from all subjectsconsisted of a single connected piece in the analysiswindow

2 The occlusion contacted the sides of the window attwo distinct points and divided the patch into tworegions

3 Each region comprised at least 35 of the pixels

Each occlusion patch consisted of two regions ofroughly equal size separated by a single boundary withthe central pixel (w2 w2) of a patch of size w alwaysbeing an occlusion (Figure 4a) Note that thisprocedure actually yields a subset of all possibleocclusions since it excludes T-junctions or occlusionsformed by highly convex boundaries Since theselection of occlusion patches was automated therewere no constraints on the properties of the surfaces oneither side of the occlusion with respect to factors likelighting shadows reflectance or material In additionto the occlusion patches we extracted surface patchesat random locations from the set of 60 middot 60 surfaceregions chosen by the subjects We used 8 middot 8 16 middot 16and 32 middot 32 patches for our analyses and psychophys-ical experiments

Grayscale scalar measurements

To obtain grayscale images we converted the rawimages into gamma-corrected RGB images usingsoftware available online at httptabbyvisionmcgillca (Olmos amp Kingdom 2004) We then mapped theRGB color space to the NTSC color space obtainingthe grayscale luminance I frac14 02989 R thorn 05870 G thorn01140 B (Acharya amp Ray 2005) From these patches

we measured a variety of visual features which can beused to distinguish occlusion from surface patchesSome of these features (for instance luminance differ-ence) depended on there being a region labeling of theimage patch which separates it into regions corre-sponding to two different surfaces (Figure 4a) How-ever measuring these same features from surfacepatches is impossible since surface patches only containa single surface Therefore in order to measure regionlabeling-dependent features from the uniform surfacepatches we assigned to each surface patch a set of 25lsquolsquodummyrsquorsquo region labelings from our database (span-ning all possible orientations) The measured value ofthe feature was then taken as the maximum value overall 25 dummy labelings which is sensible since all of thevisual features were on average larger for occlusionsthan uniform surface patches

Given a grayscale image patch and a region labelingR frac14 R1 R2 B partitioning the patch into regionscorresponding to the two surfaces (R1R2) as well as theset of boundary (B) pixels (Figure 4a) we measured thefollowing visual features taken from the computationalvision literature

G1 Luminance difference Dl

Dl frac14 jl1 l2j eth7THORNwhere l1 l2 are the mean luminance in regions R1 R2respectively

G2 Contrast difference Dr

Dr frac14 jr1 r2j eth8THORNwhere r1 r2 are the contrasts (standard deviation) inregions R1 R2 respectively Features G1 and G2 wereboth measured in previous studies on surface segmen-tation (Fine et al 2003 Ing et al 2010)

G3 Boundary luminance gradient GB

GB frac14jjIjjI

eth9THORN

where I(x y) frac14 []I(x y)]x ]I(x y)]y]T is thegradient of the image patch evaluated at the centralpixel and I is the average intensity of the image patch(Balboa amp Grzywacz 2000)

G4 Oriented energy Eh

Eh frac14XNh

ifrac141ethweven

i xTHORN2 thorn ethwoddi xTHORN2 eth10THORN

where x is the image patch in vector form andweveni wodd

i are a quadrature-phase pair of Gabor filtersof Nhfrac14 8 evenly spaced orientations hi For patch sizew the filters had means w2 and standard derivations ofw4 Oriented energy has been used in several previousstudies as a means of detecting edges (Geisler et al

Figure 4 Illustration of stimuli used in the psychophysical

experiments (a) Original color and grayscale occlusion edge

(top middle) and its region label (bottom) The region label divides

the patch into regions corresponding to the two surfaces (white

gray) as well as the boundary (black) (b) Top Grayscale image

patch with texture information removed Middle Occlusion patch

with boundary removed Bottom Boundary and luminance

information removed

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 5

2001 Lee amp Choe 2003 Sigman Cecchi Gilbert ampMagnasco 2001)

G5 Global patch contrast q

q frac14 stdethITHORN eth11THORNThis quantity has been measured in previous studieswhich quantify statistical differences between fixatedimage regions and random image regions for subjectsrsquofree-viewing natural images while their eyes are beingtracked (Rajashekar et al 2007 Reinagel amp Zador1999) Several studies of this kind have suggested thatsubjects may be preferentially looking at edges (Bad-deley amp Tatler 2006)

Note that features G3-G5 are measured globallyfrom the entire patch whereas features G1 G2 aredifferences between statistics measured from differentregions of the patch

Color scalar measurements

Images were converted from RGB to LMS colorspace using a MATLAB program which accompaniesthe images in the McGill database (Olmos amp Kingdom2004) We converted the logarithmically transformedLMS images into an Lab color space by performingprincipal components analysis (PCA) on the set ofLMS pixel intensities (Fine et al 2003 Ing et al 2010)Projections onto the axes of the Lab basis represent acolor pixel in terms of its overall luminance (L) blue-yellow opponency (a) and red-green opponency (b)

We measured two additional properties from theLMS color image patches represented in the Lab basis

C1 Blue-Yellow difference Da

Da frac14 ja1 a2j eth12THORNwhere a1 a2 are the mean values of the B-Y opponencycomponent a in regions R1 R2 respectively

C2 Red-Green difference Db

Db frac14 jb1 b2j eth13THORNwhere b1 b2 are the mean values of the R-G opponencycomponent b in regions R1 R2 respectively

These color scalar statistics were motivated byprevious work studying human perceptual discrimina-tion of different surfaces (Fine et al 2003 Ing et al2010)

Machine classifiers

Quadratic classifier analysis

In order to study how well various feature subsetsmeasured from the image patches could predict humanperformance we made use of a quadratic classifieranalysis The quadratic classifier is a natural choice for

quantifying the discriminability of two categoriesdefined by features having multivariate Gaussiandistributions (Duda Hart amp Stork 2000) and hasbeen used in previous work studying the perceptualdiscriminability of surfaces (Ing et al 2010) Assumethat we have two categories C1C2 of stimuli fromwhich we can measure n features u frac14 (u1u2 un)

Tand features measured from each category are Gauss-ian distributed

pethujCiTHORN frac14 NethujliRiTHORN eth14THORNwhere li and Ri are the means and covariances of eachcategory Given a novel observation with feature vectoru assuming that the two categories are equally likely apriori we evaluate the log-likelihood ratio

L12ethuTHORN frac14 ln pethujC1THORN ln pethujC2THORN eth15THORNchoosing C1 when L12 0 and C2 when L12 0 In thecase of Gaussian distributions for each category as inEquation 14 Equation 15 can be rewritten as

L12ethuTHORN frac141

2jR2j jR1j

thorn 1

2Q1ethuTHORN Q2ethuTHORN

eth16THORNwhere

QiethuTHORN frac14 ethu liTHORNTR1i ethu liTHORN eth17THORNThe task for applying this formalism is to define a set ofn features to measure from the set of image patchesand then use these measurements to define the meansand covariances of each category in a supervisedmanner New image patches that are unlabeled canthen be classified using this quadratic classifier

In our analyses category C1 was occlusion patchesand category C2 was surface patches We estimated theparameters of the classifiers for each category by takingthe means and covariances of the statistics measuredfrom a set of 2000 image patches (training set) and weapplied these classifiers to a different 400 patch subsetsof 1000 image patches which were presented to thesubjects (test set) This analysis was performed formultiple classifiers defined by different subsets ofparameters and for image patches of all sizes (8 middot 816 middot 16 32 middot 32)

SVM classifier analysis

As an additional control in addition to the quadraticclassifier analysis we trained a Support Vector Machine(SVM) classifier (Cristianini amp Shawe-Taylor 2000) todiscriminate occlusions and surfaces using our gray-scale visual feature set (G1ndashG5) The SVM classifier is astandard and well-studied method in machine learningwhich achieves good classification results by learningthe separating hyperplane that maximizes the margin

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 6

between two categories (Bishop 2006) We implement-ed this analysis using the function svmclassifym andsvmtrainm in the MATLAB Bioinformatics Toolbox

Multiscale classifier analyses on Gabor filter outputs

One weakness of defining machine classifiers usingour set of visual features is that these features aredefined on the scale of the entire image patch This isproblematic because it is well known that occlusionedges exist at multiple scales and that appropriate scaleselection and integration across scale is an essentialcomputation for accurate edge detection (Elder ampZucker 1998 Marr amp Hildreth 1980) Furthermoreit is well known that the neural code at the earlieststages of cortical processing is reasonably well de-scribed by a bank of multi-scale filters resemblingGabor functions (Daugman 1985 Pollen amp Ronner1983) and that such a basis forms an efficient code fornatural scenes (Olshausen amp Field 1996)

In order to define a multiscale feature set resemblingthe early visual code we utilized the rectified outputs ofa bank of filters learned using Independent ComponentAnalysis (ICA) These filters closely resemble Gaborfilters but have the additional useful property ofconstituting a maximally independent set of featuredimensions for encoding natural images (Bell ampSejnowski 1997) Example filters for 16 middot 16 imagepatches are shown in Figure 5a The outputs of ourfilter bank were used as inputs to two different standardclassifiers (a) a linear logistic regression classifier and(b) a three-layer neural network classifier (Bishop2006) having 4 16 64 hidden units These classifierswere trained using standard gradient descent methods

and their performance was evaluated on a separate setof validation data not used for training A schematicillustration of these classifiers is shown in Figure 5b

Experimental paradigm

In the psychophysical experiments image patcheswere displayed on a 24-inch Macintosh cinema display(Apple Inc Cupertino CA) Since the RGB imageswere linearized to correct for camera gamma (Olmos ampKingdom 2004) we set the display monitor to havegamma of approximately 1 using the Mac CalibrationAssistant so that the displayed images would be asnatural looking as possible Subjects were seated in adim room in front of the monitor and image patcheswere presented to the center of the screen scaled tosubtend 15 degrees of visual angle at the approximate-ly 12-inch (305 cm) viewing distance All stimulisubtended the same visual angle to eliminate confoundsbetween the size of the image on the retina and its pixeldimensions Previous studies have demonstrated thathuman performance on a similar task is dependent onlyon the number of pixels (McDermott 2004) so byholding the retinal size constant the only variable is thenumber of pixels

Our experimental paradigm is illustrated in Figure 6Surrounding the image patch was a set of flankingcrosshairs whose imaginary intersection defines thelsquolsquocenterrsquorsquo of the image patch In the standard task thesubject decides whether an occlusion boundary passesthrough the image patch center with a binary (10) key-press Half of the patches were taken from theocclusion boundary database (where all occlusions passthrough the patch center) and the other half of thepatches were taken from the surface database There-fore guessing on the task would yield performance of50 percent correct The design of this experiment issimilar to a previous study on detecting T-junctions innatural images (McDermott 2004) To optimizeperformance subjects were given as much time as theyneeded for each patch Furthermore since perceptuallearning can improve performance (Ing et al 2010) weprovided positive and negative feedback

We performed the experiment on two sets of naivesubjects having no previous exposure to the images orknowledge of the scientific aims as well as the leadauthor (S0) for a total of six subjects (three males threefemales) One set of two naive subjects (S1 S2) wasallowed to browse grayscale versions of the full-scaleimages (576 middot 768 pixels) prior to the experiment After2ndash3 seconds of viewing they were shown superimposedon the images the union of all subject labelings Thiswas meant to give the subjects an intuition for what ismeant by an occlusion boundary in the context of thefull-scale image and was inspired by previous work on

Figure 5 Illustration of the multiscale classifier analysis using the

outputs of rectified Gabor filters (a) Gabor functions learned

using ICA form an efficient code for natural images which

maximize statistical independence of filter responses (b) Gray-

scale image patches are decomposed by a bank of multiscale

Gabor filters resembling simple cells and their rectified outputs

transformed by an intermediate layer of representation whose

outputs are passed to a linear classifier

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 7

surface segmentation where subjects were allowed topreview large full-scale images of foliage from whichsurface patches were drawn (Ing et al 2010) A secondset of three naive subjects (S3 S4 S5) were not shownthe full-scale images beforehand in order to control forthe possibility that this pre-exposure may have helped toimprove task performance

We used both color and grayscale patches of sizes 8middot 8 16 middot 16 and 32 middot 32 In addition to the raw imagepatches a set of lsquolsquotexture-removedrsquorsquo image patches wascreated by averaging the pixels in each of the tworegions thus creating synthetic edges where the onlycue was luminance or color contrast For the grayscalepatches we also considered the effects of removingluminance difference cues However it is a muchharder problem to create luminance-removed patchesas was done in previous studies on surface segmenta-tion (Ing et al 2010) since simply subtracting themean luminance in each region of an image patchcontaining an occlusion often yields a high spatialfrequency boundary artifact which provides a strongedge cue (Arsenault et al 2011) Therefore wecircumvented this problem by setting a 3-pixel thickregion around the boundary to the mean luminance ofthe entire patch in effect covering up the boundaryThen we could remove luminance cues equalizing themean luminance on each side without creating aboundary artifact since there was no boundary visibleWe called this condition lsquolsquoboundary thorn luminanceremovedrsquorsquo One issue however with comparing theseboundarythorn luminance removed patches to the normalpatches is that now two cues are missing (the boundaryand the luminance difference) so in order to betterassess the combination of texture and luminanceinformation we also created a lsquolsquoboundary removedrsquorsquocondition which blocks the boundary but does notmodify the mean luminance on each side Illustrativeexamples of these stimuli are shown in Figure 4b

All subjects were shown different sets of 400 imagepatches sampled randomly from a set of 1000 patchesin every condition with the exception of two subjects inthe luminance-only grayscale condition (S0 S2) who

were shown the same set of patches in this conditiononly Informed consent was obtained from all subjectsand all experimental procedures were approved before-hand by the Case Western Reserve University IRB(Protocol 20101216)

Tests of significance

In order to determine whether or not the perfor-mance of human subjects was significantly different indifferent conditions of the task we utilized thestandard binomial proportion test (Ott 1993) whichrelies on a Gaussian approximation to the binomialdistribution This test is well justified in our casebecause of the large number of stimuli (N frac14 400)presented in each experimental condition For propor-tion estimate p we compute the 1 ndash a confidence intervalas

p 6 z1a=2

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffipeth1 pTHORN

n

r eth18THORN

We use a significance level of afrac14 005 in all of our testsand calculations of confidence intervals

In order to evaluate the performance of the machineclassifiers we performed a Monte Carlo analysis wherethe classifiers were evaluated on 200 different sets of400 image patches randomly chosen from a validationset of image patches This validation set was distinctfrom the set of patches used to train the classifierallowing us to study classifier generalization We plot95 confidence intervals around the mean perfor-mance of our classifiers

Results

Case occlusion boundary (COB) database

We developed a novel database of occlusion edgesand surface regions not containing any occlusions for

Figure 6 Schematic of the two-alternative forced choice experiment A patch was presented to the subject who decided whether an

occlusion passes through the imaginary intersection of the crosshairs After the decision the subject was given positive or negative

feedback

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 8

use in our perceptual experiments Representativeimages from our database are illustrated in Figure 2(left column) Note the clear figural objects and manyocclusions in these images In the middle column ofFigure 2 we see grayscale versions of these imageswith the set of all pixels labeled by any subject (logicalOR) overlaid in white The magenta squares showexamples of surface regions labeled by the subjectsFinally the right column shows an overlay plot of theocclusions marked by all subjects with darker pixelsbeing labeled by more subjects and the lightest graypixels being labeled by only a single subject Note thehigh degree of consistency between the labelings of themultiple subjects Figure 3 shows some representativeexamples of occlusion (left) and surface (right) patchesfrom our database It is important to note that whileocclusions may be a major source of edges in naturalimages edges may arise from other cues like castshadows or changes in material properties (Kersten2000) The bottom panel of Figure 3 shows examples ofpatches containing edges defined by shadows ratherthan by occlusions

In other annotated edge databases like the BerkeleySegmentation Dataset (BSD) (Martin Fowlkes Tal ampMalik 2001) occlusion edges were labeled indirectly bysegmenting the image into regions and then denotingthe boundaries of these regions to be edges Weobserved that when occlusions are labeled directlyinstead of being inferred indirectly from regionsegmentations that a smaller number of pixels arelabeled as shown in Figure 7b which plots thedistribution of the percentage of edge pixels for allimages and subjects in our database (red) and the BSDdatabase (blue) for 98 BSD images segmented by sixsubjects We find that averaged across all images andsubjects that about 1 of the pixels were labeled asedges in our database whereas about twice as manypixels were labeled as edges by computing edgemapsusing the BSD segmentations (COB medianfrac1400084Nfrac14 600 BSD median frac14 00193 N frac14 588 p 10121Wilcox rank-sum)

We observed a higher level of intersubject consisten-cy in the edgemaps obtained from the COB than thoseobtained from the BSD segmentations which we

Figure 7 Labeling occlusions directly marks fewer pixels than inferring occlusions from image segmentations and yields greater

agreement between subjects (a) Top An image from our database (left) together with the labeling (middle) by the most conservative

subject (MCS) The right panel shows a binary mask of all pixels near the MCS labeling (10 pixel radius) Bottom Product of MCS mask

with labelings from three other subjects (b) Histogram of the fraction of pixels labeled as edges in the COB (red) and BSD (blue)

databases across all images and all subjects (c) Histogram of the subject consistency index for edgemaps obtained from the COB (red)

and BSD (blue) databases for cfrac14 10 (d) Precision-recall analysis also demonstrates better consistency (F-measure) for COB (red) than

BSD (blue)

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 9

quantified using a novel analysis we developed as wellas a more standard precision-recall analysis (Abdou ampPratt 1979 Rijsbergen 1979 Methods Image data-base) Figure 7a shows an image from the COBdatabase (top left) together with the edgemap andderived mask for the most conservative subject (MCS)which is the subject who labeled the fewest pixels (topmiddle right) When the MCS mask is multiplied bythe edgemap of the other subjects we see reasonablygood agreement between all subjects (bottom row) Inorder to quantify this over all images we computed ournovel intersubject similarity index f defined in Equation4 and in Figure 7c we see that on average that f is largerfor our dataset than for the BSD where here we plotthe histogram for c frac14 10 (COB median frac14 02890 N frac143500 BSD median frac14 01882 N frac14 3430 p 10202Wilcox rank-sum) Similar results were obtained over awide range of values of c (Supplementary Figure S1) Inaddition to our novel analysis we also implemented aprecision-recall analysis (Abdou amp Pratt 1979 Rijs-bergen 1979) using a lsquolsquoleave-one-outrsquorsquo procedure wherewe compared the edges labeled by one subject to alsquolsquoground truthrsquorsquo labeling defined by combining theedgemaps of the five other subjects (Methods Imagedatabase) Agreement was quantified using the F-measure (Rijsbergen 1979) which provides a weightedmean of precision (not labeling nonedges as edges) andrecall (detecting all edges in the image) We observe inFigure 7d that for this analysis there is a significantlybetter agreement between edgemaps in the COBdatabase than those obtained indirectly from the BSDsegmentations using this analysis (p 1028)

Visual features measured from occlusion andsurface patches

We were interested in determining which kinds oflocally available visual features could possibly beutilized by human subjects to distinguish image patchescontaining occlusions from those containing surfacesToward this end we measured a variety of visualfeatures taken from previous studies of natural imagestatistics (Balboa amp Grzywacz 2000 Field 1987 Ing etal 2010 Lee amp Choe 2003 Rajashekar et al 2007Reinagel amp Zador 1999) from both occlusion andsurface patches For patches containing only a singleocclusion which can be divided into two regions ofroughly equal size we obtained a region labelingillustrated in Figure 4a (bottom) The region labelingconsists of sets of pixels corresponding to the twosurfaces divided by the boundary (white and grayregions) as well as the boundary (black line) Usingthis region labeling we can measure properties of theimage on either side of the boundary and computedifferences in these properties between regions By

definition surface patches are comprised of a singleregion and therefore it is unclear how we can measurequantities (like luminance differences) which dependon a region labeling Therefore in order to measure thesame set of features from the surface patches weassigned to each patch a set of 25 dummy regionlabelings (spanning all orientations) and for eachdummy labeling performed the measurements Sinceall features were on average larger for occlusions forsurfaces we took as the measured value of the featurethe maximum over all 25 dummy labelings A fulldescription of measured features is given in Methods(Statistical measurements)

We compared the power spectra of 32 middot 32 occlusionand texture patches (Figure 8 left) On average there issignificantly less energy in low spatial frequencies fortextures than for occlusions which is intuitive sincemany occlusions contain a low spatial frequencyluminance edge (Figure 3 left panel) Analysis of theexponent by fitting a line to the power spectra ofindividual images found a different distribution ofexponents for the occlusions and the textures (Figure 8right) For textures the median exponent is close to 2consistent with previous observations (Field 1987)whereas for occlusions the median exponent value isslightly higher (rsquo26) This is consistent with previouswork analyzing the spectral content of different scenecategories which show that landscape scenes with aprominent horizon (like an ocean view) tend to havelower spatial frequency content (Oliva amp Torralba2001)

Figure 9 shows that all of the grayscale features (G1-G5 Methods Statistical measurements) measured fromsurfaces (green) and occlusions (blue) are well approx-imated by Gaussians when plotted in logarithmiccoordinates Supplementary Table 1 lists the meansand standard deviations of each of these features Wesee that on average these features are larger forocclusions than for textures as one might expectintuitively since these features explicitly or implicitlymeasure differences or variability in an image patch

Figure 8 Power spectra of 32 middot 32 patches Left Median power

spectrum of occlusions (blue) and surfaces (green) Thin dashed

lines show the 25th and 75th percentiles Right Power-spectrum

slopes for occlusions (blue) and surfaces (green)

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 10

which will tend to be larger for occlusion patchesAnalyzing the correlation structure of the grayscalefeatures reveals that they are all significantly positivelycorrelated (Supplementary Table 2) Although some ofthese correlations are unsurprising (like that betweenluminance difference log Dl and boundary gradient logGB) many pairs of these positively correlated featurescan be manipulated independently of each other inartificial images (for instance global contrast log q andcontrast difference log Dr) These results are consistentwith previous work which demonstrates that in naturalimages there are often strong conditional dependenciesbetween supposedly independent visual feature dimen-sions (Fine et al 2003 Karklin amp Lewicki 2003Schwartz amp Simoncelli 2001 Zetzsche amp Rohrbein2001) Supplementary Figure S2 plots all possiblebivariate distributions of the logarithm of the grayscalefeatures for 32 middot 32 textures and surfaces

In addition to grayscale features we measured colorfeatures (C1ndashC2 Methods Statistical measurements)as well by transforming the images into the Lab colorspace Fine et al (2003) Ing et al (2010) Using our

region labelings (Figure 4a bottom) we measured thecolor parameters log Da and log Db from ourocclusions Supplementary Table 3 lists the meansand standard deviations of each of these features andwe observe positive correlations (rfrac14043) between theirvalues similar to previous observations of positivecorrelations between these same features for twonearby patches taken from the same surface (Fine etal 2003) This finding is interesting because occlusionboundaries by definition separate two or more differentsurfaces and the color features we measure constitute aset of independent feature dimensions From the two-dimensional scatterplots in Supplementary Figure S3we see that there is separation between the multivariatedistributions defined by these color parameters sug-gesting that color contrast provides a potential cue forocclusion edge detection much as it does for surfacesegmentation (Fine et al 2003 Ing et al 2010)

Human performance on occlusion boundarydetection task

Effects of patch size color and pre-exposure

In order to determine the local visual features used todiscriminate occlusions from surfaces we designed asimple two-alternative forced choice task illustratedschematically in Figure 6 An image patch subtendingroughly 15 degrees of visual angle (305 cm viewingdistance) was presented to the subject who had todecide with a binary (10) keypress whether an imagepatch contains an occlusion Patches were chosen froma pre-extracted set of 1000 occlusions and 1000surfaces with equal probability of each type soguessing would yield 50 correct Subjects wereallowed to view each patch as long as they needed (1ndash2 seconds typical) and following previous workauditory feedback was provided to optimize perfor-mance (Ing et al 2010) The task was performed on sixsubjects having varying degrees of exposure to theimage database (Methods Experimental paradigm)

Figure 10a illustrates the effect of patch size on thetask for grayscale image patches for all subjects Thicklines indicate the mean subject performance and thindashed lines denote 95 confidence intervals We seefrom this that performance is significantly different forthe different size image patches which we tested (8 middot 816 middot 16 32 middot 32) For a subset of subjects (S0 S1 S2)we also tested color image patches and the results forcolor image patches are shown in Figure 10b (red line)together with the grayscale data from these samesubjects (black dashed line) We see from this plot thatfor all patch sizes tested performance is significantlybetter for color patches which is sensible because coloris a potentially informative cue for distinguishingdifferent surfaces as has been shown in previous work(Fine et al 2003 Ing et al 2010)

Figure 9 Univariate distributions of grayscale visual features G1-

G5 (see Methods) for occlusions (blue) and textures (green)

When plotted on a log scale these distributions are well described

by Gaussians and exhibit separation for occlusions and textures

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 11

One concern with the interpretation of our results isthat the brief pre-exposure (3 seconds) of subjects S1 S2to the full-scale (576 middot 768) images prior to the first tasksession (Methods Experimental paradigm) may haveunfairly improved their performance In order to controlfor this possibility we ran three additional subjects (S3S4 S5) who did not have this pre-exposure to the full-scale images and we found no significant difference inthe performance of the two groups of subjects(Supplementary Figure S4a) We also found that thelead author (S0) was not significantly better at the taskthan the two subjects (S1 S2) who were only briefly pre-exposed to the images (Supplementary Figure S4b)

Therefore we conclude that pre-exposure to the full-scale images makes little if any difference in our task

Effects of luminance boundary and texture cues

In order to better understand what cues subjects aremaking use of in the task we tested subjects onmodified image patches with various cues removed(Methods Experimental paradigm) In order to deter-mine the importance of texture we removed all texturecues from the patches by averaging all pixels in eachregion (lsquolsquotexture removedrsquorsquo) as illustrated in Figure 4b(top) We see from Figure 11a that subject performancein the task is substantially impaired without texturecues for the 16 middot 16 and 32 middot 32 patch sizes Note howperformance is roughly flat with increasing patch sizewhen all cues but luminance are removed suggestingthat luminance gradients are a fairly lsquolsquolocalrsquorsquo cue whichis very useful even at very small scales Similar resultswere obtained for color patches (Figure 12) where theonly cues available were luminance and color contrast

Removing luminance cues while keeping texture cuesintact is slightly more tricky since simply equalizing theluminance in the two image regions can lead toproblems of boundary artifact (a high spatial frequencyedge along the boundary) as has been noted by others(Arsenault et al 2011) In order to circumvent thisproblem we removed the boundary artifact by coveringa 3-pixel wide region including the boundary by settingall pixels in this strip to a single uniform value (Figure4b bottom) We then were able to equalize theluminance on each side of the patch without creatingboundary artifact (luminance thorn boundary removed)Since the boundary pixels may could potentiallycontain long-range spatial correlation information(Hess amp Field 1999) useful for identifying occlusions

Figure 10 Performance of human subjects at the occlusion

detection task for 8 middot 8 16 middot 16 and 32 middot 32 image patches (a)

Subject performance for grayscale image patches Thin dashed

lines denote 95 confidence intervals Note how performance

significantly improves with increasing patch size (b) Subject

performance for color image patches is significantly better than for

grayscale at all patch sizes

Figure 11 Subject performance for grayscale image patches with various cues removed Dashed lines indicate the average performance

for unaltered image patches solid lines performance in the cue-removed case (a) Removal of texture cues significantly impairs subject

performance for larger image patches (b) Removal of the boundary and luminance cues substantially impairs subject performance at all

patch sizes (c) Removal of the boundary cue alone without altering texture cues or luminance cues does not affect subject performance

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 12

as an additional control we also considered perfor-mance on patches where we only removed theboundary (boundary removed)

We see from Figure 11c that simply removing theboundary pixels has no effect on subject performancesuggesting that subjects do not need to see theboundary in order to perform the task so long asother cues like luminance and texture differences arepresent However we see from Figure 11b that thereare profound effects of removing the luminance cueswhile keeping the texture cues intact We see thatsubject performance is impaired at every patch sizealthough in contrast to removing texture cues we see asubstantial improvement in performance as patch sizeis increased This suggests that texture cues arecomplementary to luminance cues in the sense thattexture information becomes more useful for largerpatches while luminance is useful even for very smallpatches

Quadratic classifier analysis

Comparison with human performance

Our analysis of image patches taken from thedatabase suggests that relatively simple visual featuresmeasured in previous related studies can potentially beused to distinguish occlusions and surfaces (Figure 9)Therefore it was of interest to determine whethermachine classifiers defined using these simple visualfeatures could potentially account for human perfor-mance on the task Since the logarithmically trans-formed features are approximately Gaussian distributed(Figure 9 and Supplementary Figures S2 S3) followingprevious work (Ing et al 2010) we defined a quadraticclassifier on class-conditional distributions which wemodeled as multivariate Gaussians (Methods Machineclassifiers) When data from both classes is exactly

Gaussian this classifier yields Bayes-optimal perfor-mance (Duda et al 2000) This classifier was thentested on validation sets of 400 image patches drawnfrom the same set shown to humans subjects (whichwere not used to train the classifier) using 200 MonteCarlo resamplings

In Figure 13 we compare the performance of humansubjects (black lines) on the grayscale version of thetask with that of quadratic classifiers (blue lines)Figure 13a shows task performance with the fullgrayscale image patches We see that the quadraticclassifier defined on all grayscale parameters (G1-G5Methods Statistical measurements) does not capturethe experimentally observed improvement in subjectperformance with increasing patch size This suggeststhat in addition to the scalar grayscale cues used todefine the classifier subjects must be making use ofmore complex spatial cues like differences in texture(Bergen amp Landy 1991) or long-range spatial correla-tions (Hess amp Field 1999) which are not available atlow resolutions but become available as the imagepatch gets larger We see from Figure 13b that aquadratic classifier defined using only the luminancedifference is in much better agreement with humanperformance when this is the only available cue forperforming the task Similar results were obtained forcolor image patches (Figure 14)

Ranking of feature dimensions

In order to understand which feature dimensions aremost important for the quadratic classifier detecting

Figure 12 Subject performance for unaltered color image patches

(dashed line) and with texture cues removed (solid line) Figure 13 Comparison of subject and quadratic classifier

performance for grayscale image patches (a) Classifier defined

using all grayscale features (thick blue line) does not accurately

model human performance on the task with unmodified patches

(black line) (b) Classifier defined using only luminance cues

accurately models human performance in the texture-removed

condition where luminance differences are the only available cue

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 13

occlusions we computed the d0 value for each of thefeature dimensions with the results of this analysisbeing summarized in Supplementary Table 4 We seefrom this table that the best single grayscale parametersfor discriminating occlusion and surface patches are theluminance difference (Dl) and overall patch variance(q) We also see that contrast difference (Dr) by itself isa fairly weak parameter for all patch sizes while for thelarger patch sizes (16 middot 16 32 middot 32) color cues are veryinformative

Our correlation analyses presented in SupplementaryTable 2 demonstrated that many of these features were

correlated with each other so therefore it was ofinterest to determine which features contributed themost independent information to the classifier We didthis by adding feature dimensions to the classifier oneby one in order of decreasing d0 for 16 middot 16 and 32 middot 32patches and applying each classifier to the task for 200Monte Carlo simulations The results of this analysisperformed on 16 middot 16 and 32 middot 32 image patches areshown in Figure 15 We see from Figure 15 that forgrayscale classifiers containing q and Dl that noperformance increase is seen when one adds thefeatures GB (boundary gradient) and Eh (orientedenergy) most likely because these features are stronglycorrelated with q Dl (Supplementary Table 2)However we do see a jump in performance when thecontrast difference Dr is added most likely because thisfeature dimension is only weakly correlated with q andDl (Supplementary Table 2) For our analysis of thecolor classifier we see that there are substantialimprovements in performance when the color param-eters Da and Db are added to the classifier which makessense because color cues are largely independent fromluminance cues

Comparison with SVM classifier

There are two logical possibilities for the poor matchbetween human performance in our task and theperformance of the quadratic classifier The firstpossibility is that the feature sets used to define theclassifier do not adequately describe the featuresactually used by human subjects to perform the taskA second possibility is that the feature set is adequatebut the quadratic classifier is somehow suboptimalAlthough the quadratic classifier is Bayes optimal forGaussian distributed class-conditional densities (Duda

Figure 14 Comparison of subject and quadratic classifier

performance for color image patches (a) Classifier defined using

all color features (thick blue line) does not accurately model

human performance (black line) on the task with unmodified

patches (b) Classifier defined using only luminance and color

contrast more accurately models human performance in the

texture-removed condition

Figure 15 Quadratic classifier performance on 32 middot 32 patches as a function of number of features Feature sets of varying lengths are

defined by ranking individual dimensions by their d0 measures and adding them in order from highest to lowest

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 14

et al 2000) and our feature sets are reasonably welldescribed by Gaussians (see Figure 9 andSupplementary Figures S2 S3) this description is farfrom exact Therefore it is possible that some othermore sophisticated classifier may yield superior perfor-mance to the quadratic classifier and hence provide abetter match with our perceptual data

In order to control for this possibility we comparedthe performance of the quadratic classifier to a SupportVector Machine (SVM) classifier (Bishop 2006 Cris-tianini amp Shawe-Taylor 2000) on the task of discrim-inating grayscale occlusion patches from surfaces Wefound that the SVM classifier performed worse than thequadratic classifier (Supplementary Figure S5) Thisfailure of the SVM classifier strengthens the argumentthat the reason for the poor match of the quadraticclassifier with human performance is the inadequacy ofthe measured feature set used to define the classifierrather than the inadequacy our classifier method

Multiscale classifier analysis

Gabor feature set and classifiers

One weakness of the set of simple visual featuresquantified in this study is that they are defined on thescale of the entire image patch and therefore do notintegrate information across scale It is well known thatocclusions in natural images can exist on multiplescales and that accurate edge detection requires anappropriate choice of scale (Elder amp Zucker 1998Marr amp Hildreth 1980) Furthermore our feature setdoes not include any features which quantify texturedifferences which is problematic because our psycho-physical results suggests that texture is a very importantcue and previous work in machine vision hasdemonstrated that texture cues are highly informativefor detecting edges and segmenting image regions(Bergen amp Landy 1991 Malik amp Perona 1990 Martinet al 2004 Voorhees amp Poggio 1988)

In order to choose a set of biologically motivatedmultiscale features to measure from the image patcheswe obtained a Gabor filter bank by applying theIndependent Components Analysis (Bell amp Sejnowski1997) to natural image patches taken from our dataset(Figure 5a) and this filter bank was applied to theimage patches The independent component analysis(ICA) algorithm learns a set of linear features that aremaximally independent which is ideal for classifieranalyses since correlated features add less informationthan independent features The rectified outputs of thisGabor filter bank were then used as inputs to twosupervised classifiers (a) A linear binary logisticclassifier and (b) a neural network classifier (Bishop2006) having 4 16 and 64 hidden units The logisticclassifier can be considered as a special case of theneural network classifier illustrated in Figure 5b where

instead of learning the best hidden layer representationfor solving the classification task the hidden layerrepresentation is always identical to the input repre-sentation and only the output weights are modifiedAfter training both classifiers were tested on a separatevalidation set not used for training to evaluate theirperformance

Classifier comparison

Figure 16 shows the performance of these twoclassifiers together with subject data for the grayscaleocclusion detection task We see that the linearclassifier (blue line) does not accurately describe humanperformance (black line) while the neural network (redline) having 16 hidden units compares much morefavorably to the human subjects Similar results wereobtained for a neural network with 64 hidden units butworse performance was seen with four hidden units(not shown) We can see from Figure 16 that the linearclassifier does not increase its performance withincreasing patch size and seems to perform similarlyto the quadratic classifier defined using only luminancedifferences This suggests that this classifier may simplybe learning to detecting low spatial frequency edges atvarious orientations Indeed plotting the connectionstrengths of the output unit with each of the Gaborfilters as in Figure 17 demonstrates that there are strongpositive weights for filters at low spatial frequencylocated in the center of the image patch therebydetecting luminance differences on the scale of theentire image patch

In contrast the neural network classifier learns tospatially pool information across spatial scale andlocation (Supplementary Figure S6) although thehidden unit lsquolsquoreceptive fieldsrsquorsquo do not seem to resemble

Figure 16 Performance of humans and classifiers defined on

multiscale Gabor feature set We see that a simple linear classifier

(blue line) does not account for human performance (black

dashed line) in the task a neural network classifier having an

additional hidden layer of processing (red line) compares well with

human performance

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 15

the hypothetical texture-edge detectors illustrated inFigure 18 which could in principle encode edgesdefined by second-order cues like differences inorientation or spatial frequency Interestingly suchunits tuned for texture-edges have been learned instudies seeking to find a sparse code for patterns ofcorrelation of Gabor filter activities in natural images(Cadieu amp Olshausen 2012 Karklin amp Lewicki 20032009) and it is of interest for future work to examinewhether a population of such units can be decoded inorder to reliably detect occlusion edges Neverthelessthis analysis provides a simple existence proof that amultiscale feature set coupled with a second layer ofrepresentation which pools across this set of multiscalefeatures may be adequate to explain natural occlusionedge detection and a similar computational scheme hasbeen proposed by various filter-rectify-filter models oftexture edge detection (Baker amp Mareschal 2001Landy amp Graham 2003) Interestingly training neuralnetworks directly on pixel representations did not yieldmodels which could match human performance(Supplementary Figure S7 demonstrating the impor-tance of the multiscale Gabor decomposition as apreprocessing stage Of course we cannot suggest fromour simple analysis the detailed form which thisintegration takes or what the receptive fields of neuronsperforming this operation may be like but this is

certainly a question of great interest for future work inneurophysiology and computational modeling

Discussion

Summary of results

In this study we performed psychophysical experi-ments to better understand what locally available cuesthe human visual system utilizes when detecting naturalocclusion edges In agreement with previous work(McDermott 2004) we found that fairly large regions(32 middot 32 pixels) were required for subjects to attainreliable performance on our occlusion detection task(Figure 10) Studying the grayscale version of the taskin detail we found that texture and luminance cueswere both utilized although the relative importance ofthese cues varied with the patch size In particular wefound that luminance cues were equally useful at allpatch sizes tested (Figure 11a) This suggests thatluminance differences are a local cue being equallyuseful for small as well as large patches In contrasttexture cues were only useful for larger image patches(Figure 11b) This is sensible because texture is definedby information occurring at a variety of scales (Portillaamp Simoncelli 2000 Zhu Wu amp Mumford 1997) andmaking patches larger makes more spatial scalesvisible

The importance of textural cues was further under-scored by comparing human performance on the taskto the performance of a quadratic classifier definedusing a set of simple visual features taken fromprevious work which did not include texture cuesAlthough many of the visual features providedindependent sources of information (Figure 15) andpermitted some discrimination of the stimulus catego-ries this feature set could not account for humanperformance in the task overpredicting performancefor small (8 middot 8) patches and underpredicting

Figure 17 Schematic illustration of weights learned by the linear

classifier We see that the linear classifier learns strong positive

weights (hot colors) with Gabor filters having low spatial

frequencies (long lines) located in the center of the image patch

Figure 18 Hypothetical neural models for detecting occlusion edges defined by textural differences (a) Hypothetical unit which detects

texture edges defined by differences in orientation energy (b) Hypothetical unit that detects texture edges defined by spatial frequency

differences

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 16

performance for large (32 middot 32) patches (Figure 13)The overprediction of human performance for smallimage patches is particularly interesting since itdemonstrates that subjects are not simply relying onluminance cues but are integrating texture cues as wellwhich we found to be quite poor for small imagepatches (Figure 11b)

One limitation of the visual feature set used to definethe quadratic classifier analysis is that the features aredefined on the scale of the entire image patch This isproblematic since it is well-known that the earlieststages of human visual processing perform a multiscaleanalysis using receptive fields resembling Gabor filters(Daugman 1985 Jones amp Palmer 1987 Pollen ampRonner 1983) and that such a representation forms anefficient code for natural images (Field 1987 Olshau-sen amp Field 1996) Furthermore many computationalmethods for texture segmentation and edge detectionuse of a multiscale Gabor filter decomposition (Bakeramp Mareschal 2001 Landy amp Graham 2003) There-fore we defined a new set of classifiers taking as inputsthe rectified responses of a set of Gabor filters appliedto the image patches (Figure 5) We found that themultiscale feature set was adequate to explain humanperformance but only provided that the classifiermaking use of this feature set was sufficiently powerfulIn particular a linear classifier was unable to accountfor human performance since this classifier only learnedto detect luminance-edges at low spatial frequencies(Figure 17) However a neural network classifierhaving an additional hidden layer of processing wasable to learn to reliably detect occlusions nearly as wellas human subjects (Figure 16) These results suggest amultiscale feature decomposition together with rela-tively simple computations performed in an additionallayer of processing may be adequate to explain humanperformance in an important natural vision task

Comparison with previous work

Relatively few studies in visual psychophysics havedirectly considered the problem of what cues are usefulfor detecting natural occlusion edges with the closeststudy to our own focusing on the more specializedproblem of detecting T-junctions in natural grayscaleimages (McDermott 2004) This study found thatrather large image regions were needed for subjects toattain reliable performance for junction detectionconsistent with our results for occlusion detection Inthis work it was found that many T-junctions easilyvisible on the scale of the whole image were notdetectable using local cues suggesting that they mayonly be detectable by mechanisms making use offeedback of global scene information (Lee amp Mumford2003) Our task is somewhat complementary to that

studied by McDermott (2004) since we exclude T-junctions and only consider nonjunction occlusionsThis gave us a much larger stimulus set and permittedquantitative comparisons of human task performancewith a variety of models based on features measuredfrom occlusions and non-occlusions However wearrive at the very similar conclusion for our task thata substantial amount of spatial context is needed foraccurate occlusion detection (Figure 10) and thatmany occlusions cannot be reliably detected using localcues alone

A related set of studies has considered the problemof judging when two image patches belong to the sameor different surfaces and comparing the performanceof human subjects with the predictions of Bayesianmodels defined using sets of measured visual featuressimilar to those used to define our quadratic classifiersIn contrast to one such surface segmentation study (Inget al 2010) we found texture cues to be very importantfor our task One possible reason for the discrepancybetween our results and those of Ing et al (2010) is thattheir study only considered the problem of discrimi-nating a very limited class of images in particular thesurfaces of different leaves Since the leaves typicallybelonged to the same kind of plant they were fairlysmooth and quite similar in terms of visual textureTherefore for this particular choice of stimuli texture isnot a particularly informative cue since leaves of thesame kind tend to have very similar textures For ourtask however where very different surfaces werejuxtaposed to define occlusions we found removal oftexture cues to be highly detrimental to subjectperformance (Figure 11)

Previous work on edge localization reveals thatluminance texture and color cues are all combined todetermine the location of edges (McGraw et al 2003Rivest amp Cavanagh 1996) and similarly we find that inour task these multiple sources of information arecombined as well However in our task we did notconsider quantitative models for how disparate cuesmay be combined optimally as has been done in avariety of perceptual tasks (Knill amp Saunders 2003Landy amp Kojima 2001) One challenge with studyingcue combination in our task is that one of the majorcues used by the subjects are texture differences andunlike luminance or color natural textures require veryhigh-dimensional models to fully specify (Heeger ampBergen 1995 Portilla amp Simoncelli 2000 Zhu et al1997) The problem of optimal cue combination fornatural occlusion detection has been mostly consideredin the computer vision literature (Konishi et al 2003Zhou amp Mel 2008) but models of cue combination forocclusion detection have not been directly comparedwith human data in a controlled psychophysicalexperiment

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 17

Our computational analysis demonstrates the neces-sity of multiscale image analysis for modeling humanperformance in our occlusion detection task Further-more our work and that of others demonstrates theimportance of textural cues for occlusion detection as awide variety of computational models which explainhuman performance at detecting texture-defined edgesutilize a multiscale Gabor decomposition (Landy ampGraham 2003) One broad class of computationalmodels of texture segmentation and texture edgedetection can be described lsquolsquoFilter-Rectify-Filterrsquorsquo(FRF) models sometimes also referred to as lsquolsquoback-pocketrsquorsquo models where the rectified outputs of a Gaborfilter bank are then subject to a second stage of filteringand the outputs of these filters are then processedfurther to obtain a decision about whether or not atexture edge is present (Baker amp Mareschal 2001Bergen amp Landy 1991 Graham amp Sutter 1998 Landy1991 Landy amp Graham 2003) Our simple three-layerneural network model (Figure 5) constitutes a verysimple model of the FRF type with the hidden unitsperforming the role of the second-order filters Theability of our simple FRF model to solve this task aswell as human subjects provides additional evidence infavor of the FRF computational framework

Resources for future studies

Our database of hand-labeled occlusions (httpcaseedu) avoids many limitations of previously collecteddatasets of hand-labeled edges Perhaps the greatestimprovement is that unlike many related datasets(Collins Wright amp Greenway 1999 Martin et al2001) we make use of uncompressed calibrated colorimages taken from a larger image set used in previouspsychophysical research (Arsenault et al 2011 King-dom et al 2007 Olmos amp Kingdom 2004) JPEGcompression creates artificial statistical regularitieswhich complicates feature measurements (Zhou ampMel 2008) and causes artifact when training statisticalimage models (Caywood Willmore amp Tollhurst 2004)

Another improvement of this database over previouswork is that we explicitly label occlusion edges ratherthan inferring them indirectly from labels of imageregions or segments Although de facto finding regionboundaries does end up mainly labeling occlusions thenotion of an image region is much vaguer than that ofan occlusion However due to the specificity of ourtask our database most likely neglects quite a fewedges which are not occlusions since many edges aredefined by properties unrelated to ordinal depth likeshadows specularities and material properties (Ker-sten 2000) Therefore our database only represents asubset (albeit an important one) of possible naturaledges

Some databases only consider a restricted class ofimages like close-up foliage (Ing et al 2010) or lackclear figural objects (Doi Inui Lee Wachtler ampSejnowski 2003) Our database contains a wide varietyof images at various spatial scales containing a largenumber of occlusions We do not claim that our imagesare fully representative of natural images but representa modest effort to obtain a diverse sample which isimportant since different categories of natural imageshave different statistics (Torralba amp Oliva 2003) andtraining statistical image models on different imagedatabases often yields different linear filter properties(Ziegaus amp Lang 2003)

Acknowledgments

This work was supported by NIH grant 0705677 toMSL We thank the five Case Western ReserveUniversity undergraduate research assistants whohelped to collect the databases Parts of this work havebeen presented previously in poster form at the 2011Society for Neuroscience Meeting

Commercial relationships noneCorresponding author Christopher DiMattinaEmail dimattinagrinnelleduAddress Department of Psychology amp NeuroscienceConcentration Grinnell College Grinnell IA USA

References

Abdou I amp Pratt W (1979) Quantitative design andevaluation of enhancementthresholding edge de-tectors Proceedings of the IEEE 67 753ndash763

Acharya T amp Ray A K (2005) Image processingPrinciples and applications Hoboken NJ Wiley-Interscience

Arsenault E Yoonessi A amp Baker C L (2011)Higher-order texture statistics impair contrastboundary segmentation Journal of Vision 11(10)14 1ndash15 httpwwwjournalofvisionorgcontent111014 doi101167111014 [PubMed] [Article]

Baddeley R J amp Tatler B W (2006) High frequencyedges (but not contrast) predict where we fixate ABayesian system identification analysis VisionResearch 46 2824ndash2833

Baker C L amp Mareschal I (2001) Processing ofsecond-order stimuli in the visual cortex Progressin Brain Research 134 1ndash21

Balboa R M amp Grzywacz N M (2000) Occlusions

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 18

and their relationship with the distribution ofcontrasts in natural images Vision Research 402661ndash2669

Bell A J amp Sejnowski T J (1997) The lsquoindependentcomponentsrsquo of natural scenes are edge filtersVision Research 37 3327ndash3338

Bergen J R amp Landy M S (1991) Computationalmodeling of visual texture segregation In M SLandy amp J A Movshon (Eds) Computationalmodels of visual processing pp 253ndash271 Cam-bridge MA MIT Press

Bishop C M (2006) Pattern recognition and machinelearning New York Springer

Cadieu C amp Olshausen B A (2012) Learningintermediate-level representations of form andmotion from natural movies Neural Computation24 827ndash866

Caywood M S Willmore B amp Tolhurst D J(2004) Independent components of color naturalscenes resemble v1 neurons in their spatial andcolor tuning Journal of Neurophysiology 91 2859ndash2873

Collins D Wright W A amp Greenway P (1999) TheSowerby image database In Image processing andits applications (Seventh International Conference)(pp 306ndash310) London Institution of ElectricalEngineers

Cristianini N amp Shawe-Taylor J (2000) An intro-duction to support vector machines and other kernelbased learning methods Cambridge UK Cam-bridge University Press

Daugman J G (1985) Uncertainty relation forresolution in space spatial frequency and orienta-tion optimized by two-dimensional visual corticalfilters Journal of the Optical Society of America A2 1160ndash1169

Doi E Inui T Lee T W Wachtler T ampSejnowski T J (2003) Spatiochromatic receptivefield properties derived from information-theoreticanalyses of cone mosaic responses to naturalscenes Neural Computation 15 397ndash417

Duda R O Hart P E amp Stork D G (2000)Pattern classification (2nd ed) New York Wiley-Interscience

Elder J H amp Zucker S W (1998) Local scalecontrol for edge detection and blur estimationIEEE Transactions on Pattern Analysis and Ma-chine Intelligence 20 699ndash716

Field D J (1987) Relations between the statistics ofnatural images and the response properties of

cortical cells Journal of the Optical Society ofAmerica A 4 2379ndash2394

Fine I MacLeod D I A amp Boynton G M (2003)Surface segmentation based on the luminance andcolor statistics of natural scenes Journal of theOptical Society of America A 20 1283ndash1291

Fowlkes C C Martin D R amp Malik J (2007) Localfigure-ground cues are valid for natural imagesJournal of Vision 7(8)2 1ndash9 httpwwwjournalofvisionorgcontent782 doi101167782[PubMed] [Article]

Geisler W S Perry J S Super B J amp Gallogly DP (2001) Edge co-occurrence in natural imagespredicts contour grouping performance VisionResearch 41 711ndash724

Goldberg A V amp Kennedy R (1995) An efficientcost scaling algorithm for the assignment problemMathematical Programming 71 153ndash177

Graham N (1991) Complex channels early localnonlinearities and normalization in texture segre-gation In M S Landy amp J A Movshon (Eds)Computational models of visual processing pp 274ndash290 Cambridge MA MIT Press

Graham N amp Sutter A (1998) Spatial summation insimple (Fourier) and complex (non-Fourier) texturechannels Vision Research 38 231ndash257

Guzman A (1969) Decomposition of a visual sceneinto three-dimensional bodies In A Griselli (Ed)Automatic interpretation and classification of imag-es pp 291ndash304 New York Academic Press

Heeger D J amp Bergen J R (1995) Pyramid-basedtexture analysissynthesis In Proceedings of ACMSIGGRAPH 229ndash238

Hess R F amp Field D J (1999) Integration ofcontours New insights Trends in Cognitive Scienc-es 3 480ndash486

Hoiem D Efros A A amp Hebert M (2011)Recovering occlusion boundaries from an imageInternational Journal of Computer Vision 91(3)328ndash346

Ing A D Wilson J A amp Geisler W S (2010)Region grouping in natural foliage scenes Imagestatistics and human performance Journal ofVision 10(4)10 1ndash19 httpwwwjournalofvisionorgcontent10410 doi 10116710410[PubMed] [Article]

Jones J P amp Palmer L A (1987) The two-dimensional spatial structure of simple receptivefields in cat striate cortex Journal of Neurophysi-ology 58 1187ndash1211

Karklin Y amp Lewicki M S (2003) Learning higher-

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 19

order structures in natural images Network 14483ndash499

Karklin Y amp Lewicki M S (2009) Emergence ofcomplex cell properties by learning to generalize innatural scenes Nature 457 83ndash86

Kersten D (2000) High-level vision as statisticalinference In M S Gazzaniga (Ed) The newcognitive neurosciences pp 353ndash363 CambridgeMA MIT Press

Kingdom F A A Field D J amp Olmos A (2007)Does spatial invariance result from insensitivity tochange Journal of Vision 7(14)11 1ndash13 httpwwwjournalofvisionorgcontent71411 doi10116771411 [PubMed] [Article]

Knill D C amp Saunders J A (2003) Do humansoptimally integrate stereo and texture informationfor judgments of slant Vision Research 43 2539ndash2558

Konishi S Yuille A L Coughlin J M amp Zhu S C(2003) Statistical edge detection Learning andevaluating edge cues IEEE Transactions on PatternAnalysis and Machine Intelligence 25 57ndash74

Landy M S (1991) Texture segregation and orienta-tion gradient Vision Research 31 679ndash691

Landy M S amp Graham N (2003) Visual perceptionof texture In L M Chalupa amp J S Werner (Eds)The visual neurosciences pp 1106ndash1118 Cam-bridge MA MIT Press

Landy M S amp Kojima H (2001) Ideal cuecombination for localizing texture-defined edgesJournal of the Optical Society of America A 182307ndash2320

Lee H-C amp Choe Y (2003) Detecting salientcontours using orientation energy distribution InInternational Joint Conference on Neural Networks(pp 206ndash211) IEEE Conference Publications

Lee T S amp Mumford D (2003) Hierarchicalbayesian inference in the visual cortex Journal ofthe Optical Society of America A 20 1434ndash1448

Malik J amp Perona P (1990) Preattentive texturediscrimination with early vision mechanisms Jour-nal of the Optical Society of America A 7 923ndash932

Marr D amp Hildreth E (1980) Theory of edgedetection Proceedings of the Royal Society ofLondon Series B 29 187ndash217

Martin D Fowlkes C C Tal D amp Malik J (2001)A database of human segmented natural imagesand its application to evaluating segmentationalgorithms and measuring ecological statisticsProceedings of the 8th International Conference onComputer Vision 2 416ndash423

Martin D R Fowlkes C C amp Malik J (2004)Learning to detect natural image boundaries usinglocal brightness color and texture cues IEEETransactions on Pattern Analysis and MachineIntelligence 26 530ndash549

McDermott J (2004) Psychophysics with junctions inreal images Perception 33 1101ndash1127

McGraw P V Whitaker D Badcock D R ampSkillen J (2003) Neither here nor there Localizingconflicting visual attributes Journal of Vision 3(4)2 265ndash273 httpwwwjournalofvisionorgcontent342 doi101167342 [PubMed] [Article]

Nakayama K He Z J amp Shimojo S (1995) Visualsurface representation A critical link betweenlower-level and higher-level vision In D NOsherson S M Kosslyn amp L R Gleitman(Eds) An invitation to cognitive science Visualcognition pp 1ndash70 Cambridge MA MIT Press

Oliva A amp Torralba A (2001) Modeling the shapeof the scene A holistic representation of the spatialenvelope International Journal of Computer Vision42 145ndash175

Olmos A amp Kingdom F A A (2004) McGillcalibrated colour image database Internet sitehttptabbyvisionmcgillca (Accessed November20 2012)

Olshausen B A amp Field D J (1996) Emergence ofsimple-cell receptive-field properties by learning asparse code for natural images Nature 381 607ndash609

Ott R L (1993) An introduction to statistical methodsand data analysis (4th ed) Belmont CA Wads-worth Inc

Perona P (1992) Steerable-scalable kernels for edgedetection and junction analysis Image and VisionComputing 10 663ndash672

Pollen D A amp Ronner S F (1983) Visual corticalneurons as localized spatial frequency filters IEEETransactions on Systems Man and Cybernetics 13907ndash916

Portilla J amp Simoncelli E P (2000) A parametrictexture model based on joint statistics of complexwavelet coefficients International Journal of Com-puter Vision 40 4971

Rajashekar U van der Linde I Bovik A C ampCormack L K (2007) Foveated analysis of imagefeatures at fixations Vision Research 47 3160ndash3172

Reinagel P amp Zador A M (1999) Natural scenestatistics at the centre of gaze Network 10 341ndash350

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 20

Rijsbergen C V (1979) Information retrieval LondonButterworths

Rivest J amp Cavanagh P (1996) Localizing contoursdefined by more than one attribute Vision Re-search 36 53ndash66

Schwartz O amp Simoncelli E P (2001) Natural signalstatistics and sensory gain control Nature Neuro-science 4 819ndash825

Sigman M Cecchi G A Gilbert C D ampMagnasco M O (2001) On a common circleNatural scenes and gestalt rules Proceedings of theNational Academy of Sciences 98 1935ndash1940

Todd J T (2004) The visual perception of 3D shapeTrends in Cognitive Science 8 115ndash121

Torralba A (2009) How many pixels make an imageVisual Neuroscience 26 123ndash131

Torralba A amp Oliva A (2003) Statistics of naturalimage categories Network Computation in NeuralSystems 14 391ndash412

Voorhees H amp Poggio T (1988) Computing textureboundaries from images Nature 333 364ndash367

Zetzsche C amp Rohrbein F (2001) Nonlinear andextra-classical receptive field properties and thestatistics of natural scenes Network 12 331ndash350

Zhou C amp Mel B W (2008) Cue combination andcolor edge detection in natural scenes Journal ofVision 8(4)4 1ndash25 httpwwwjournalofvisionorgcontent844 doi101167844 [PubMed][Article]

Zhu S C Wu Y N amp Mumford D (1997)Minimax entropy principle and its application totexture modeling Neural Computation 9 1627ndash1660

Ziegaus C amp Lang E W (2003) Independentcomponent analysis of natural and urban imageensembles Neural Information Processing Letters1 89ndash95

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 21

  • Introduction
  • Methods
  • f01
  • f02
  • f03
  • e01
  • e02
  • e03
  • e04
  • e05
  • e06
  • e07
  • e08
  • e09
  • e10
  • f04
  • e11
  • e12
  • e13
  • e14
  • e15
  • e16
  • e17
  • f05
  • e18
  • Results
  • f06
  • f07
  • f08
  • f09
  • f10
  • f11
  • f12
  • f13
  • f14
  • f15
  • f16
  • Discussion
  • f17
  • f18
  • Abdou1
  • Acharya1
  • Arsenault1
  • Baddeley1
  • Baker1
  • Balboa1
  • Bell1
  • Bergen1
  • Bishop1
  • Cadieu1
  • Caywood1
  • Collins1
  • Cristianini1
  • Daugman1
  • Doi1
  • Duda1
  • Elder1
  • Field1
  • Fine1
  • Fowlkes1
  • Geisler1
  • Goldberg1
  • Graham1
  • Graham2
  • Guzman1
  • Heeger1
  • Hess1
  • Hoiem1
  • Ing1
  • Jones1
  • Karklin1
  • Karklin2
  • Kersten1
  • Kingdom1
  • Knill1
  • Konishi1
  • Landy1
  • Landy2
  • Landy3
  • Lee1
  • Lee2
  • Malik1
  • Marr1
  • Martin1
  • Martin2
  • McDermott1
  • McGraw1
  • Nakayama1
  • Oliva1
  • Olmos1
  • Olshausen1
  • Ott1
  • Perona1
  • Pollen1
  • Portilla1
  • Rajashekar1
  • Reinagel1
  • Rijsbergen1
  • Rivest1
  • Schwartz1
  • Sigman1
  • Todd1
  • Torralba1
  • Torralba2
  • Voorhees1
  • Zetzsche1
  • Zhou1
  • Zhu1
  • Ziegaus1
Page 2: Detecting natural occlusion boundaries using local cuesDetecting natural occlusion boundaries using local cues Christopher DiMattina # $ Department of Psychology & Neuroscience Concentration,

characterized in previous studies (Balboa amp Grzywacz2000 Fine MacLeod amp Boynton 2003 Geisler PerrySuper amp Gallogly 2001 Ing Wilson amp Geisler 2010Rajashekar van der Linde Bovik amp Cormack 2007)can be used to distinguish occlusion and surfacepatches

Using a simple two-alternative forced choice exper-iment we test the ability of human subjects todiscriminate local image regions containing eitherocclusions or single surfaces In agreement with relatedwork on junction detection (McDermott 2004) andimage classification (Torralba 2009) we find thatsubjects require a fairly large image region (32 middot 32pixels) in order to make reliable judgments Using aquadratic classifier analysis we find that simple visualfeatures defined on the scale of the whole image patch(ie luminance gradients) are insufficient to accountfor human performance suggesting that human sub-jects integrate complex spatial information existing atmultiple scales

We investigated this possibility further by trainingstandard linear and neural network classifiers on therectified outputs of a set of Gabor filters applied to theocclusion and surface patches We found that a linearclassifier cannot fully account for subject performancesince this classifier simply detects low spatial frequencyluminance edges However a neural network having amoderate number of hidden units compared muchbetter to human performance by combining informa-tion from filters across multiple locations and spatialscales Our analysis demonstrates that only one layer ofprocessing beyond the initial filtering and rectificationis needed for reliably detecting natural occlusionsInterpreting the hidden units as implementing lsquolsquosecond-orderrsquorsquo filters our results are consistent with previousdemonstrations that filter-rectify-filter (FRF) modelscan detect edges defined by cues other than luminancedifferences (Baker amp Mareschal 2001 Bergen ampLandy 1991 Graham 1991 Landy 1991)

This study complements and extends previous workby quantitatively demonstrating the importance ofintegrating complex multiscale spatial information

when detecting natural occlusion edges (McDermott2004) Furthermore this work provides the largervision science community with a novel database ofocclusion edges as well as a benchmark dataset ofhuman performance on a standard edge-detectionproblem studied in machine vision (Konishi et al2003 Martin et al 2004 Zhou amp Mel 2008) Finallywe discuss possible mechanisms for natural occlusiondetection and suggest directions for future research

Methods

Image databases

A set of 100 images containing few or no manmadeobjects were selected from a set of over 850 calibrateduncompressed color images from the McGill CalibratedColor Image Database (Olmos amp Kingdom 2004)Images were selected to have a clear figure-groundorganization and plenty of discernible occlusionboundaries Some representative images are shown inFigure 2 (left column) A group of five paid under-graduate research assistants were instructed to label allof the clearly discernible continuous occlusion bound-aries using Adobe Photoshop layers They were giventhe following instructions

lsquolsquoYour task is to label the occlusion contours in thegiven set of 100 images An occlusion contour is anedge or boundary where one object occludes or blocksthe view of another object or region behind it Label asmany contours as you can but you do not need to labelcontours that you are unsure of Make each distinctcontour a unique color to help with future analysisEach contour must be continuous (ie one connectedpiece) Start by labeling contours on the largest andmost prominent objects and work your way down tosmaller and less prominent objects Do not labelextremely small contours like blades of grassrsquorsquo

Students worked independently so their labelingreflected their independent judgment The lead author(CD) hand-labeled all images as well so there were sixsubjects total

In order to compare the statistics of occlusions withimage regions not containing occlusions a database oflsquolsquosurfacersquorsquo image patches was selected from the sameimages by the same subjects lsquolsquoSurfacesrsquorsquo in this contextwere broadly defined as uniform image regions whichdo not contain any occlusions and subjects were notgiven any explicit guidelines beyond the constraint thatthe regions they select should be relatively uniform andcould not contain any occlusions (which was preventedby our custom-authored software) No constraints wereimposed with respect to lighting curvature material

Figure 1 Occlusion of one surface by another in depth gives rise

to image patches containing occlusion edges (magenta circle)

and junctions (cyan and purple circles)

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 2

shadows or luminance gradients Therefore somesurface patches contained substantial luminance gradi-ents for instance patches of zebra skin (Figure 3) Eachsubject selected 10 surface regions (60 middot 60) from eachof the 100 images and for our analyses we extracted

image patches of various sizes (8 middot 8 16 middot 16 32 middot 32)at random locations from these larger 60 middot 60 regionsExample 32 middot 32 surface patches are shown in Figure 2(middle panel) and examples of both surface andocclusion patches are shown in Figure 3

Figure 2 Representative images from the occlusion boundary database together with subject occlusion labelings Left Original color

images Middle Grayscale images with overlaid pixels labeled as occlusions (white lines) and examples of surface regions (magenta

squares) Right Overlaid plots of subject occlusion labelings taken from an 87 middot 115 pixel central region of images (indicated by cyan

squares in middle column) Darker pixels were labeled by more subjects lighter pixels by fewer subjects

Figure 3 Examples of 32 middot 32 occlusions (left) surfaces (right) and shadow edges not defined by occlusions (bottom)

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 3

Quantifying subject consistency

In order to quantify intersubject consistency in thelocations of the pixels labeled as occlusions we appliedprecision-recall analysis commonly used in machinevision (Abdou amp Pratt 1979 Martin et al 2004) Inaddition we also developed a novel analysis methodwhich we call the most-conservative subject (MCS)analysis which controls for the fact that disagreementbetween subjects in the location of labeled occlusionsoften arises simply because some subjects are moreexhaustive in their labeling than others

Precision-recall analysis is often used in machinevision studies of edge detection in order to quantify thetrade-off between correctly detecting all edges (recall)and not incorrectly labeling non-edges as edges(precision) Mathematically precision (P) and recall(R) are defined

P frac14 tp

tpthorn fp eth1THORN

R frac14 tp

tpthorn fn eth2THORN

where tp fp tn fn are the true and false positives andtrue and false negatives respectively Typically thesequantities are determined by comparing a machinegenerated test edgemap E to a ground-truth referenceedgemap G derived from hand-annotated images(Martin et al 2004) Since all of our edgemaps werehuman generated we performed a lsquolsquoleave-one-outrsquorsquoanalysis where we compared a test edgemap from eachsubject to a reference lsquolsquoground truthrsquorsquo edgemap definedby combining edgemaps from all of the other remainingsubjects Since our goal for this analysis was simply tocompare human performance on two different tasks(occlusion labeling and region labeling) we did notmake use of sophisticated boundary-matching proce-dures (Goldberg amp Kennedy 1995) used in previousstudies to optimize the comparisons between humandata and machine performance (Martin et al 2004)We quantify the overall agreement between the test andreference edgemaps using the weighed harmonic meanof P and R defined by

F frac14 PR

aPthorn eth1 aTHORNR eth3THORN

This quantity F is known as an F-measure and wasoriginally developed to quantify the accuracy ofdocument retrieval methods (Rijsbergen 1979) but isalso applied in machine vision (Abdou amp Pratt 1979Martin et al 2004) The parameter a determines therelative weight of precision and recall and we used afrac1405 in our analysis

In addition to the precision-recall analysis wedeveloped a novel method for quantifying intersubject

consistency which minimizes problems of intersubjectdisagreement arising from the fact that certain subjectsare simply more exhaustive in labeling all possibleocclusions than other subjects We defined the mostconservative subject (MCS) for a given image as thesubject who had labeled the fewest pixels Using theMCS labeling we generate a binary image mask Fwhich is 1 for any pixel within R pixels of an occlusionlabeled by the MCS and 0 for all other pixelsApplying this mask to the labeling of each subjectyields a lsquolsquoreducedrsquorsquo labeling which is valid for inter-subject comparison since it only includes the mostprominent occlusions labeled by all of the subjects Tocalculate the comparison between two subjects werandomly assigned one binary edgemap as the lsquolsquorefer-encersquorsquo (Iref) and the other binary edge-map as the lsquolsquotestrsquorsquo(Itest) We used the reference map to define a weightingfunction fc(r) which was applied to all of the pixels inthe test map that quantified how close each pixel in thetest map was to a pixel in the reference mapMathematically our index is given by

f frac14 1

Nt

XethxyTHORNItest

fc

rethxyTHORNIref

ItestethxyTHORN eth4THORN

where r((xy) Iref) is the distance between (xy) and theclosest pixel in Iref Nt is the number of pixels in the testset and the function fc(r) is defined for 0 c lsquo by

fcethrTHORN frac14ecr if 0 r R0 if rR

eth5THORN

and for cfrac14 lsquo by

flsquoethrTHORN frac141 if r frac14 00 if r 0

eth6THORN

where R is the radius of the mask which we set to Rfrac1410 in our analysis The parameter c in our weightingfunction sets the sensitivity of fc to the distance betweenthe reference labeling and the test labeling Setting cfrac14 0counts the fraction of pixels in the test edgemap whichlie inside the mask generated by the reference edgemapand setting c frac14 lsquo measures the fraction of pixels incomplete agreement between the two edgemaps

Statistical measurements

Patch extraction and region labeling

Occlusion patches of varying sizes centered on anocclusion boundary were extracted automatically fromthe database by choosing random pixels labeled asocclusions by a single subject cycling through the sixsubjects Since we only wanted patches containing asingle occlusion separating two regions (figure andground) of roughly equal size we only accepted acandidate patch when

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 4

1 The composite occlusion edge map from all subjectsconsisted of a single connected piece in the analysiswindow

2 The occlusion contacted the sides of the window attwo distinct points and divided the patch into tworegions

3 Each region comprised at least 35 of the pixels

Each occlusion patch consisted of two regions ofroughly equal size separated by a single boundary withthe central pixel (w2 w2) of a patch of size w alwaysbeing an occlusion (Figure 4a) Note that thisprocedure actually yields a subset of all possibleocclusions since it excludes T-junctions or occlusionsformed by highly convex boundaries Since theselection of occlusion patches was automated therewere no constraints on the properties of the surfaces oneither side of the occlusion with respect to factors likelighting shadows reflectance or material In additionto the occlusion patches we extracted surface patchesat random locations from the set of 60 middot 60 surfaceregions chosen by the subjects We used 8 middot 8 16 middot 16and 32 middot 32 patches for our analyses and psychophys-ical experiments

Grayscale scalar measurements

To obtain grayscale images we converted the rawimages into gamma-corrected RGB images usingsoftware available online at httptabbyvisionmcgillca (Olmos amp Kingdom 2004) We then mapped theRGB color space to the NTSC color space obtainingthe grayscale luminance I frac14 02989 R thorn 05870 G thorn01140 B (Acharya amp Ray 2005) From these patches

we measured a variety of visual features which can beused to distinguish occlusion from surface patchesSome of these features (for instance luminance differ-ence) depended on there being a region labeling of theimage patch which separates it into regions corre-sponding to two different surfaces (Figure 4a) How-ever measuring these same features from surfacepatches is impossible since surface patches only containa single surface Therefore in order to measure regionlabeling-dependent features from the uniform surfacepatches we assigned to each surface patch a set of 25lsquolsquodummyrsquorsquo region labelings from our database (span-ning all possible orientations) The measured value ofthe feature was then taken as the maximum value overall 25 dummy labelings which is sensible since all of thevisual features were on average larger for occlusionsthan uniform surface patches

Given a grayscale image patch and a region labelingR frac14 R1 R2 B partitioning the patch into regionscorresponding to the two surfaces (R1R2) as well as theset of boundary (B) pixels (Figure 4a) we measured thefollowing visual features taken from the computationalvision literature

G1 Luminance difference Dl

Dl frac14 jl1 l2j eth7THORNwhere l1 l2 are the mean luminance in regions R1 R2respectively

G2 Contrast difference Dr

Dr frac14 jr1 r2j eth8THORNwhere r1 r2 are the contrasts (standard deviation) inregions R1 R2 respectively Features G1 and G2 wereboth measured in previous studies on surface segmen-tation (Fine et al 2003 Ing et al 2010)

G3 Boundary luminance gradient GB

GB frac14jjIjjI

eth9THORN

where I(x y) frac14 []I(x y)]x ]I(x y)]y]T is thegradient of the image patch evaluated at the centralpixel and I is the average intensity of the image patch(Balboa amp Grzywacz 2000)

G4 Oriented energy Eh

Eh frac14XNh

ifrac141ethweven

i xTHORN2 thorn ethwoddi xTHORN2 eth10THORN

where x is the image patch in vector form andweveni wodd

i are a quadrature-phase pair of Gabor filtersof Nhfrac14 8 evenly spaced orientations hi For patch sizew the filters had means w2 and standard derivations ofw4 Oriented energy has been used in several previousstudies as a means of detecting edges (Geisler et al

Figure 4 Illustration of stimuli used in the psychophysical

experiments (a) Original color and grayscale occlusion edge

(top middle) and its region label (bottom) The region label divides

the patch into regions corresponding to the two surfaces (white

gray) as well as the boundary (black) (b) Top Grayscale image

patch with texture information removed Middle Occlusion patch

with boundary removed Bottom Boundary and luminance

information removed

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 5

2001 Lee amp Choe 2003 Sigman Cecchi Gilbert ampMagnasco 2001)

G5 Global patch contrast q

q frac14 stdethITHORN eth11THORNThis quantity has been measured in previous studieswhich quantify statistical differences between fixatedimage regions and random image regions for subjectsrsquofree-viewing natural images while their eyes are beingtracked (Rajashekar et al 2007 Reinagel amp Zador1999) Several studies of this kind have suggested thatsubjects may be preferentially looking at edges (Bad-deley amp Tatler 2006)

Note that features G3-G5 are measured globallyfrom the entire patch whereas features G1 G2 aredifferences between statistics measured from differentregions of the patch

Color scalar measurements

Images were converted from RGB to LMS colorspace using a MATLAB program which accompaniesthe images in the McGill database (Olmos amp Kingdom2004) We converted the logarithmically transformedLMS images into an Lab color space by performingprincipal components analysis (PCA) on the set ofLMS pixel intensities (Fine et al 2003 Ing et al 2010)Projections onto the axes of the Lab basis represent acolor pixel in terms of its overall luminance (L) blue-yellow opponency (a) and red-green opponency (b)

We measured two additional properties from theLMS color image patches represented in the Lab basis

C1 Blue-Yellow difference Da

Da frac14 ja1 a2j eth12THORNwhere a1 a2 are the mean values of the B-Y opponencycomponent a in regions R1 R2 respectively

C2 Red-Green difference Db

Db frac14 jb1 b2j eth13THORNwhere b1 b2 are the mean values of the R-G opponencycomponent b in regions R1 R2 respectively

These color scalar statistics were motivated byprevious work studying human perceptual discrimina-tion of different surfaces (Fine et al 2003 Ing et al2010)

Machine classifiers

Quadratic classifier analysis

In order to study how well various feature subsetsmeasured from the image patches could predict humanperformance we made use of a quadratic classifieranalysis The quadratic classifier is a natural choice for

quantifying the discriminability of two categoriesdefined by features having multivariate Gaussiandistributions (Duda Hart amp Stork 2000) and hasbeen used in previous work studying the perceptualdiscriminability of surfaces (Ing et al 2010) Assumethat we have two categories C1C2 of stimuli fromwhich we can measure n features u frac14 (u1u2 un)

Tand features measured from each category are Gauss-ian distributed

pethujCiTHORN frac14 NethujliRiTHORN eth14THORNwhere li and Ri are the means and covariances of eachcategory Given a novel observation with feature vectoru assuming that the two categories are equally likely apriori we evaluate the log-likelihood ratio

L12ethuTHORN frac14 ln pethujC1THORN ln pethujC2THORN eth15THORNchoosing C1 when L12 0 and C2 when L12 0 In thecase of Gaussian distributions for each category as inEquation 14 Equation 15 can be rewritten as

L12ethuTHORN frac141

2jR2j jR1j

thorn 1

2Q1ethuTHORN Q2ethuTHORN

eth16THORNwhere

QiethuTHORN frac14 ethu liTHORNTR1i ethu liTHORN eth17THORNThe task for applying this formalism is to define a set ofn features to measure from the set of image patchesand then use these measurements to define the meansand covariances of each category in a supervisedmanner New image patches that are unlabeled canthen be classified using this quadratic classifier

In our analyses category C1 was occlusion patchesand category C2 was surface patches We estimated theparameters of the classifiers for each category by takingthe means and covariances of the statistics measuredfrom a set of 2000 image patches (training set) and weapplied these classifiers to a different 400 patch subsetsof 1000 image patches which were presented to thesubjects (test set) This analysis was performed formultiple classifiers defined by different subsets ofparameters and for image patches of all sizes (8 middot 816 middot 16 32 middot 32)

SVM classifier analysis

As an additional control in addition to the quadraticclassifier analysis we trained a Support Vector Machine(SVM) classifier (Cristianini amp Shawe-Taylor 2000) todiscriminate occlusions and surfaces using our gray-scale visual feature set (G1ndashG5) The SVM classifier is astandard and well-studied method in machine learningwhich achieves good classification results by learningthe separating hyperplane that maximizes the margin

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 6

between two categories (Bishop 2006) We implement-ed this analysis using the function svmclassifym andsvmtrainm in the MATLAB Bioinformatics Toolbox

Multiscale classifier analyses on Gabor filter outputs

One weakness of defining machine classifiers usingour set of visual features is that these features aredefined on the scale of the entire image patch This isproblematic because it is well known that occlusionedges exist at multiple scales and that appropriate scaleselection and integration across scale is an essentialcomputation for accurate edge detection (Elder ampZucker 1998 Marr amp Hildreth 1980) Furthermoreit is well known that the neural code at the earlieststages of cortical processing is reasonably well de-scribed by a bank of multi-scale filters resemblingGabor functions (Daugman 1985 Pollen amp Ronner1983) and that such a basis forms an efficient code fornatural scenes (Olshausen amp Field 1996)

In order to define a multiscale feature set resemblingthe early visual code we utilized the rectified outputs ofa bank of filters learned using Independent ComponentAnalysis (ICA) These filters closely resemble Gaborfilters but have the additional useful property ofconstituting a maximally independent set of featuredimensions for encoding natural images (Bell ampSejnowski 1997) Example filters for 16 middot 16 imagepatches are shown in Figure 5a The outputs of ourfilter bank were used as inputs to two different standardclassifiers (a) a linear logistic regression classifier and(b) a three-layer neural network classifier (Bishop2006) having 4 16 64 hidden units These classifierswere trained using standard gradient descent methods

and their performance was evaluated on a separate setof validation data not used for training A schematicillustration of these classifiers is shown in Figure 5b

Experimental paradigm

In the psychophysical experiments image patcheswere displayed on a 24-inch Macintosh cinema display(Apple Inc Cupertino CA) Since the RGB imageswere linearized to correct for camera gamma (Olmos ampKingdom 2004) we set the display monitor to havegamma of approximately 1 using the Mac CalibrationAssistant so that the displayed images would be asnatural looking as possible Subjects were seated in adim room in front of the monitor and image patcheswere presented to the center of the screen scaled tosubtend 15 degrees of visual angle at the approximate-ly 12-inch (305 cm) viewing distance All stimulisubtended the same visual angle to eliminate confoundsbetween the size of the image on the retina and its pixeldimensions Previous studies have demonstrated thathuman performance on a similar task is dependent onlyon the number of pixels (McDermott 2004) so byholding the retinal size constant the only variable is thenumber of pixels

Our experimental paradigm is illustrated in Figure 6Surrounding the image patch was a set of flankingcrosshairs whose imaginary intersection defines thelsquolsquocenterrsquorsquo of the image patch In the standard task thesubject decides whether an occlusion boundary passesthrough the image patch center with a binary (10) key-press Half of the patches were taken from theocclusion boundary database (where all occlusions passthrough the patch center) and the other half of thepatches were taken from the surface database There-fore guessing on the task would yield performance of50 percent correct The design of this experiment issimilar to a previous study on detecting T-junctions innatural images (McDermott 2004) To optimizeperformance subjects were given as much time as theyneeded for each patch Furthermore since perceptuallearning can improve performance (Ing et al 2010) weprovided positive and negative feedback

We performed the experiment on two sets of naivesubjects having no previous exposure to the images orknowledge of the scientific aims as well as the leadauthor (S0) for a total of six subjects (three males threefemales) One set of two naive subjects (S1 S2) wasallowed to browse grayscale versions of the full-scaleimages (576 middot 768 pixels) prior to the experiment After2ndash3 seconds of viewing they were shown superimposedon the images the union of all subject labelings Thiswas meant to give the subjects an intuition for what ismeant by an occlusion boundary in the context of thefull-scale image and was inspired by previous work on

Figure 5 Illustration of the multiscale classifier analysis using the

outputs of rectified Gabor filters (a) Gabor functions learned

using ICA form an efficient code for natural images which

maximize statistical independence of filter responses (b) Gray-

scale image patches are decomposed by a bank of multiscale

Gabor filters resembling simple cells and their rectified outputs

transformed by an intermediate layer of representation whose

outputs are passed to a linear classifier

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 7

surface segmentation where subjects were allowed topreview large full-scale images of foliage from whichsurface patches were drawn (Ing et al 2010) A secondset of three naive subjects (S3 S4 S5) were not shownthe full-scale images beforehand in order to control forthe possibility that this pre-exposure may have helped toimprove task performance

We used both color and grayscale patches of sizes 8middot 8 16 middot 16 and 32 middot 32 In addition to the raw imagepatches a set of lsquolsquotexture-removedrsquorsquo image patches wascreated by averaging the pixels in each of the tworegions thus creating synthetic edges where the onlycue was luminance or color contrast For the grayscalepatches we also considered the effects of removingluminance difference cues However it is a muchharder problem to create luminance-removed patchesas was done in previous studies on surface segmenta-tion (Ing et al 2010) since simply subtracting themean luminance in each region of an image patchcontaining an occlusion often yields a high spatialfrequency boundary artifact which provides a strongedge cue (Arsenault et al 2011) Therefore wecircumvented this problem by setting a 3-pixel thickregion around the boundary to the mean luminance ofthe entire patch in effect covering up the boundaryThen we could remove luminance cues equalizing themean luminance on each side without creating aboundary artifact since there was no boundary visibleWe called this condition lsquolsquoboundary thorn luminanceremovedrsquorsquo One issue however with comparing theseboundarythorn luminance removed patches to the normalpatches is that now two cues are missing (the boundaryand the luminance difference) so in order to betterassess the combination of texture and luminanceinformation we also created a lsquolsquoboundary removedrsquorsquocondition which blocks the boundary but does notmodify the mean luminance on each side Illustrativeexamples of these stimuli are shown in Figure 4b

All subjects were shown different sets of 400 imagepatches sampled randomly from a set of 1000 patchesin every condition with the exception of two subjects inthe luminance-only grayscale condition (S0 S2) who

were shown the same set of patches in this conditiononly Informed consent was obtained from all subjectsand all experimental procedures were approved before-hand by the Case Western Reserve University IRB(Protocol 20101216)

Tests of significance

In order to determine whether or not the perfor-mance of human subjects was significantly different indifferent conditions of the task we utilized thestandard binomial proportion test (Ott 1993) whichrelies on a Gaussian approximation to the binomialdistribution This test is well justified in our casebecause of the large number of stimuli (N frac14 400)presented in each experimental condition For propor-tion estimate p we compute the 1 ndash a confidence intervalas

p 6 z1a=2

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffipeth1 pTHORN

n

r eth18THORN

We use a significance level of afrac14 005 in all of our testsand calculations of confidence intervals

In order to evaluate the performance of the machineclassifiers we performed a Monte Carlo analysis wherethe classifiers were evaluated on 200 different sets of400 image patches randomly chosen from a validationset of image patches This validation set was distinctfrom the set of patches used to train the classifierallowing us to study classifier generalization We plot95 confidence intervals around the mean perfor-mance of our classifiers

Results

Case occlusion boundary (COB) database

We developed a novel database of occlusion edgesand surface regions not containing any occlusions for

Figure 6 Schematic of the two-alternative forced choice experiment A patch was presented to the subject who decided whether an

occlusion passes through the imaginary intersection of the crosshairs After the decision the subject was given positive or negative

feedback

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 8

use in our perceptual experiments Representativeimages from our database are illustrated in Figure 2(left column) Note the clear figural objects and manyocclusions in these images In the middle column ofFigure 2 we see grayscale versions of these imageswith the set of all pixels labeled by any subject (logicalOR) overlaid in white The magenta squares showexamples of surface regions labeled by the subjectsFinally the right column shows an overlay plot of theocclusions marked by all subjects with darker pixelsbeing labeled by more subjects and the lightest graypixels being labeled by only a single subject Note thehigh degree of consistency between the labelings of themultiple subjects Figure 3 shows some representativeexamples of occlusion (left) and surface (right) patchesfrom our database It is important to note that whileocclusions may be a major source of edges in naturalimages edges may arise from other cues like castshadows or changes in material properties (Kersten2000) The bottom panel of Figure 3 shows examples ofpatches containing edges defined by shadows ratherthan by occlusions

In other annotated edge databases like the BerkeleySegmentation Dataset (BSD) (Martin Fowlkes Tal ampMalik 2001) occlusion edges were labeled indirectly bysegmenting the image into regions and then denotingthe boundaries of these regions to be edges Weobserved that when occlusions are labeled directlyinstead of being inferred indirectly from regionsegmentations that a smaller number of pixels arelabeled as shown in Figure 7b which plots thedistribution of the percentage of edge pixels for allimages and subjects in our database (red) and the BSDdatabase (blue) for 98 BSD images segmented by sixsubjects We find that averaged across all images andsubjects that about 1 of the pixels were labeled asedges in our database whereas about twice as manypixels were labeled as edges by computing edgemapsusing the BSD segmentations (COB medianfrac1400084Nfrac14 600 BSD median frac14 00193 N frac14 588 p 10121Wilcox rank-sum)

We observed a higher level of intersubject consisten-cy in the edgemaps obtained from the COB than thoseobtained from the BSD segmentations which we

Figure 7 Labeling occlusions directly marks fewer pixels than inferring occlusions from image segmentations and yields greater

agreement between subjects (a) Top An image from our database (left) together with the labeling (middle) by the most conservative

subject (MCS) The right panel shows a binary mask of all pixels near the MCS labeling (10 pixel radius) Bottom Product of MCS mask

with labelings from three other subjects (b) Histogram of the fraction of pixels labeled as edges in the COB (red) and BSD (blue)

databases across all images and all subjects (c) Histogram of the subject consistency index for edgemaps obtained from the COB (red)

and BSD (blue) databases for cfrac14 10 (d) Precision-recall analysis also demonstrates better consistency (F-measure) for COB (red) than

BSD (blue)

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 9

quantified using a novel analysis we developed as wellas a more standard precision-recall analysis (Abdou ampPratt 1979 Rijsbergen 1979 Methods Image data-base) Figure 7a shows an image from the COBdatabase (top left) together with the edgemap andderived mask for the most conservative subject (MCS)which is the subject who labeled the fewest pixels (topmiddle right) When the MCS mask is multiplied bythe edgemap of the other subjects we see reasonablygood agreement between all subjects (bottom row) Inorder to quantify this over all images we computed ournovel intersubject similarity index f defined in Equation4 and in Figure 7c we see that on average that f is largerfor our dataset than for the BSD where here we plotthe histogram for c frac14 10 (COB median frac14 02890 N frac143500 BSD median frac14 01882 N frac14 3430 p 10202Wilcox rank-sum) Similar results were obtained over awide range of values of c (Supplementary Figure S1) Inaddition to our novel analysis we also implemented aprecision-recall analysis (Abdou amp Pratt 1979 Rijs-bergen 1979) using a lsquolsquoleave-one-outrsquorsquo procedure wherewe compared the edges labeled by one subject to alsquolsquoground truthrsquorsquo labeling defined by combining theedgemaps of the five other subjects (Methods Imagedatabase) Agreement was quantified using the F-measure (Rijsbergen 1979) which provides a weightedmean of precision (not labeling nonedges as edges) andrecall (detecting all edges in the image) We observe inFigure 7d that for this analysis there is a significantlybetter agreement between edgemaps in the COBdatabase than those obtained indirectly from the BSDsegmentations using this analysis (p 1028)

Visual features measured from occlusion andsurface patches

We were interested in determining which kinds oflocally available visual features could possibly beutilized by human subjects to distinguish image patchescontaining occlusions from those containing surfacesToward this end we measured a variety of visualfeatures taken from previous studies of natural imagestatistics (Balboa amp Grzywacz 2000 Field 1987 Ing etal 2010 Lee amp Choe 2003 Rajashekar et al 2007Reinagel amp Zador 1999) from both occlusion andsurface patches For patches containing only a singleocclusion which can be divided into two regions ofroughly equal size we obtained a region labelingillustrated in Figure 4a (bottom) The region labelingconsists of sets of pixels corresponding to the twosurfaces divided by the boundary (white and grayregions) as well as the boundary (black line) Usingthis region labeling we can measure properties of theimage on either side of the boundary and computedifferences in these properties between regions By

definition surface patches are comprised of a singleregion and therefore it is unclear how we can measurequantities (like luminance differences) which dependon a region labeling Therefore in order to measure thesame set of features from the surface patches weassigned to each patch a set of 25 dummy regionlabelings (spanning all orientations) and for eachdummy labeling performed the measurements Sinceall features were on average larger for occlusions forsurfaces we took as the measured value of the featurethe maximum over all 25 dummy labelings A fulldescription of measured features is given in Methods(Statistical measurements)

We compared the power spectra of 32 middot 32 occlusionand texture patches (Figure 8 left) On average there issignificantly less energy in low spatial frequencies fortextures than for occlusions which is intuitive sincemany occlusions contain a low spatial frequencyluminance edge (Figure 3 left panel) Analysis of theexponent by fitting a line to the power spectra ofindividual images found a different distribution ofexponents for the occlusions and the textures (Figure 8right) For textures the median exponent is close to 2consistent with previous observations (Field 1987)whereas for occlusions the median exponent value isslightly higher (rsquo26) This is consistent with previouswork analyzing the spectral content of different scenecategories which show that landscape scenes with aprominent horizon (like an ocean view) tend to havelower spatial frequency content (Oliva amp Torralba2001)

Figure 9 shows that all of the grayscale features (G1-G5 Methods Statistical measurements) measured fromsurfaces (green) and occlusions (blue) are well approx-imated by Gaussians when plotted in logarithmiccoordinates Supplementary Table 1 lists the meansand standard deviations of each of these features Wesee that on average these features are larger forocclusions than for textures as one might expectintuitively since these features explicitly or implicitlymeasure differences or variability in an image patch

Figure 8 Power spectra of 32 middot 32 patches Left Median power

spectrum of occlusions (blue) and surfaces (green) Thin dashed

lines show the 25th and 75th percentiles Right Power-spectrum

slopes for occlusions (blue) and surfaces (green)

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 10

which will tend to be larger for occlusion patchesAnalyzing the correlation structure of the grayscalefeatures reveals that they are all significantly positivelycorrelated (Supplementary Table 2) Although some ofthese correlations are unsurprising (like that betweenluminance difference log Dl and boundary gradient logGB) many pairs of these positively correlated featurescan be manipulated independently of each other inartificial images (for instance global contrast log q andcontrast difference log Dr) These results are consistentwith previous work which demonstrates that in naturalimages there are often strong conditional dependenciesbetween supposedly independent visual feature dimen-sions (Fine et al 2003 Karklin amp Lewicki 2003Schwartz amp Simoncelli 2001 Zetzsche amp Rohrbein2001) Supplementary Figure S2 plots all possiblebivariate distributions of the logarithm of the grayscalefeatures for 32 middot 32 textures and surfaces

In addition to grayscale features we measured colorfeatures (C1ndashC2 Methods Statistical measurements)as well by transforming the images into the Lab colorspace Fine et al (2003) Ing et al (2010) Using our

region labelings (Figure 4a bottom) we measured thecolor parameters log Da and log Db from ourocclusions Supplementary Table 3 lists the meansand standard deviations of each of these features andwe observe positive correlations (rfrac14043) between theirvalues similar to previous observations of positivecorrelations between these same features for twonearby patches taken from the same surface (Fine etal 2003) This finding is interesting because occlusionboundaries by definition separate two or more differentsurfaces and the color features we measure constitute aset of independent feature dimensions From the two-dimensional scatterplots in Supplementary Figure S3we see that there is separation between the multivariatedistributions defined by these color parameters sug-gesting that color contrast provides a potential cue forocclusion edge detection much as it does for surfacesegmentation (Fine et al 2003 Ing et al 2010)

Human performance on occlusion boundarydetection task

Effects of patch size color and pre-exposure

In order to determine the local visual features used todiscriminate occlusions from surfaces we designed asimple two-alternative forced choice task illustratedschematically in Figure 6 An image patch subtendingroughly 15 degrees of visual angle (305 cm viewingdistance) was presented to the subject who had todecide with a binary (10) keypress whether an imagepatch contains an occlusion Patches were chosen froma pre-extracted set of 1000 occlusions and 1000surfaces with equal probability of each type soguessing would yield 50 correct Subjects wereallowed to view each patch as long as they needed (1ndash2 seconds typical) and following previous workauditory feedback was provided to optimize perfor-mance (Ing et al 2010) The task was performed on sixsubjects having varying degrees of exposure to theimage database (Methods Experimental paradigm)

Figure 10a illustrates the effect of patch size on thetask for grayscale image patches for all subjects Thicklines indicate the mean subject performance and thindashed lines denote 95 confidence intervals We seefrom this that performance is significantly different forthe different size image patches which we tested (8 middot 816 middot 16 32 middot 32) For a subset of subjects (S0 S1 S2)we also tested color image patches and the results forcolor image patches are shown in Figure 10b (red line)together with the grayscale data from these samesubjects (black dashed line) We see from this plot thatfor all patch sizes tested performance is significantlybetter for color patches which is sensible because coloris a potentially informative cue for distinguishingdifferent surfaces as has been shown in previous work(Fine et al 2003 Ing et al 2010)

Figure 9 Univariate distributions of grayscale visual features G1-

G5 (see Methods) for occlusions (blue) and textures (green)

When plotted on a log scale these distributions are well described

by Gaussians and exhibit separation for occlusions and textures

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 11

One concern with the interpretation of our results isthat the brief pre-exposure (3 seconds) of subjects S1 S2to the full-scale (576 middot 768) images prior to the first tasksession (Methods Experimental paradigm) may haveunfairly improved their performance In order to controlfor this possibility we ran three additional subjects (S3S4 S5) who did not have this pre-exposure to the full-scale images and we found no significant difference inthe performance of the two groups of subjects(Supplementary Figure S4a) We also found that thelead author (S0) was not significantly better at the taskthan the two subjects (S1 S2) who were only briefly pre-exposed to the images (Supplementary Figure S4b)

Therefore we conclude that pre-exposure to the full-scale images makes little if any difference in our task

Effects of luminance boundary and texture cues

In order to better understand what cues subjects aremaking use of in the task we tested subjects onmodified image patches with various cues removed(Methods Experimental paradigm) In order to deter-mine the importance of texture we removed all texturecues from the patches by averaging all pixels in eachregion (lsquolsquotexture removedrsquorsquo) as illustrated in Figure 4b(top) We see from Figure 11a that subject performancein the task is substantially impaired without texturecues for the 16 middot 16 and 32 middot 32 patch sizes Note howperformance is roughly flat with increasing patch sizewhen all cues but luminance are removed suggestingthat luminance gradients are a fairly lsquolsquolocalrsquorsquo cue whichis very useful even at very small scales Similar resultswere obtained for color patches (Figure 12) where theonly cues available were luminance and color contrast

Removing luminance cues while keeping texture cuesintact is slightly more tricky since simply equalizing theluminance in the two image regions can lead toproblems of boundary artifact (a high spatial frequencyedge along the boundary) as has been noted by others(Arsenault et al 2011) In order to circumvent thisproblem we removed the boundary artifact by coveringa 3-pixel wide region including the boundary by settingall pixels in this strip to a single uniform value (Figure4b bottom) We then were able to equalize theluminance on each side of the patch without creatingboundary artifact (luminance thorn boundary removed)Since the boundary pixels may could potentiallycontain long-range spatial correlation information(Hess amp Field 1999) useful for identifying occlusions

Figure 10 Performance of human subjects at the occlusion

detection task for 8 middot 8 16 middot 16 and 32 middot 32 image patches (a)

Subject performance for grayscale image patches Thin dashed

lines denote 95 confidence intervals Note how performance

significantly improves with increasing patch size (b) Subject

performance for color image patches is significantly better than for

grayscale at all patch sizes

Figure 11 Subject performance for grayscale image patches with various cues removed Dashed lines indicate the average performance

for unaltered image patches solid lines performance in the cue-removed case (a) Removal of texture cues significantly impairs subject

performance for larger image patches (b) Removal of the boundary and luminance cues substantially impairs subject performance at all

patch sizes (c) Removal of the boundary cue alone without altering texture cues or luminance cues does not affect subject performance

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 12

as an additional control we also considered perfor-mance on patches where we only removed theboundary (boundary removed)

We see from Figure 11c that simply removing theboundary pixels has no effect on subject performancesuggesting that subjects do not need to see theboundary in order to perform the task so long asother cues like luminance and texture differences arepresent However we see from Figure 11b that thereare profound effects of removing the luminance cueswhile keeping the texture cues intact We see thatsubject performance is impaired at every patch sizealthough in contrast to removing texture cues we see asubstantial improvement in performance as patch sizeis increased This suggests that texture cues arecomplementary to luminance cues in the sense thattexture information becomes more useful for largerpatches while luminance is useful even for very smallpatches

Quadratic classifier analysis

Comparison with human performance

Our analysis of image patches taken from thedatabase suggests that relatively simple visual featuresmeasured in previous related studies can potentially beused to distinguish occlusions and surfaces (Figure 9)Therefore it was of interest to determine whethermachine classifiers defined using these simple visualfeatures could potentially account for human perfor-mance on the task Since the logarithmically trans-formed features are approximately Gaussian distributed(Figure 9 and Supplementary Figures S2 S3) followingprevious work (Ing et al 2010) we defined a quadraticclassifier on class-conditional distributions which wemodeled as multivariate Gaussians (Methods Machineclassifiers) When data from both classes is exactly

Gaussian this classifier yields Bayes-optimal perfor-mance (Duda et al 2000) This classifier was thentested on validation sets of 400 image patches drawnfrom the same set shown to humans subjects (whichwere not used to train the classifier) using 200 MonteCarlo resamplings

In Figure 13 we compare the performance of humansubjects (black lines) on the grayscale version of thetask with that of quadratic classifiers (blue lines)Figure 13a shows task performance with the fullgrayscale image patches We see that the quadraticclassifier defined on all grayscale parameters (G1-G5Methods Statistical measurements) does not capturethe experimentally observed improvement in subjectperformance with increasing patch size This suggeststhat in addition to the scalar grayscale cues used todefine the classifier subjects must be making use ofmore complex spatial cues like differences in texture(Bergen amp Landy 1991) or long-range spatial correla-tions (Hess amp Field 1999) which are not available atlow resolutions but become available as the imagepatch gets larger We see from Figure 13b that aquadratic classifier defined using only the luminancedifference is in much better agreement with humanperformance when this is the only available cue forperforming the task Similar results were obtained forcolor image patches (Figure 14)

Ranking of feature dimensions

In order to understand which feature dimensions aremost important for the quadratic classifier detecting

Figure 12 Subject performance for unaltered color image patches

(dashed line) and with texture cues removed (solid line) Figure 13 Comparison of subject and quadratic classifier

performance for grayscale image patches (a) Classifier defined

using all grayscale features (thick blue line) does not accurately

model human performance on the task with unmodified patches

(black line) (b) Classifier defined using only luminance cues

accurately models human performance in the texture-removed

condition where luminance differences are the only available cue

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 13

occlusions we computed the d0 value for each of thefeature dimensions with the results of this analysisbeing summarized in Supplementary Table 4 We seefrom this table that the best single grayscale parametersfor discriminating occlusion and surface patches are theluminance difference (Dl) and overall patch variance(q) We also see that contrast difference (Dr) by itself isa fairly weak parameter for all patch sizes while for thelarger patch sizes (16 middot 16 32 middot 32) color cues are veryinformative

Our correlation analyses presented in SupplementaryTable 2 demonstrated that many of these features were

correlated with each other so therefore it was ofinterest to determine which features contributed themost independent information to the classifier We didthis by adding feature dimensions to the classifier oneby one in order of decreasing d0 for 16 middot 16 and 32 middot 32patches and applying each classifier to the task for 200Monte Carlo simulations The results of this analysisperformed on 16 middot 16 and 32 middot 32 image patches areshown in Figure 15 We see from Figure 15 that forgrayscale classifiers containing q and Dl that noperformance increase is seen when one adds thefeatures GB (boundary gradient) and Eh (orientedenergy) most likely because these features are stronglycorrelated with q Dl (Supplementary Table 2)However we do see a jump in performance when thecontrast difference Dr is added most likely because thisfeature dimension is only weakly correlated with q andDl (Supplementary Table 2) For our analysis of thecolor classifier we see that there are substantialimprovements in performance when the color param-eters Da and Db are added to the classifier which makessense because color cues are largely independent fromluminance cues

Comparison with SVM classifier

There are two logical possibilities for the poor matchbetween human performance in our task and theperformance of the quadratic classifier The firstpossibility is that the feature sets used to define theclassifier do not adequately describe the featuresactually used by human subjects to perform the taskA second possibility is that the feature set is adequatebut the quadratic classifier is somehow suboptimalAlthough the quadratic classifier is Bayes optimal forGaussian distributed class-conditional densities (Duda

Figure 14 Comparison of subject and quadratic classifier

performance for color image patches (a) Classifier defined using

all color features (thick blue line) does not accurately model

human performance (black line) on the task with unmodified

patches (b) Classifier defined using only luminance and color

contrast more accurately models human performance in the

texture-removed condition

Figure 15 Quadratic classifier performance on 32 middot 32 patches as a function of number of features Feature sets of varying lengths are

defined by ranking individual dimensions by their d0 measures and adding them in order from highest to lowest

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 14

et al 2000) and our feature sets are reasonably welldescribed by Gaussians (see Figure 9 andSupplementary Figures S2 S3) this description is farfrom exact Therefore it is possible that some othermore sophisticated classifier may yield superior perfor-mance to the quadratic classifier and hence provide abetter match with our perceptual data

In order to control for this possibility we comparedthe performance of the quadratic classifier to a SupportVector Machine (SVM) classifier (Bishop 2006 Cris-tianini amp Shawe-Taylor 2000) on the task of discrim-inating grayscale occlusion patches from surfaces Wefound that the SVM classifier performed worse than thequadratic classifier (Supplementary Figure S5) Thisfailure of the SVM classifier strengthens the argumentthat the reason for the poor match of the quadraticclassifier with human performance is the inadequacy ofthe measured feature set used to define the classifierrather than the inadequacy our classifier method

Multiscale classifier analysis

Gabor feature set and classifiers

One weakness of the set of simple visual featuresquantified in this study is that they are defined on thescale of the entire image patch and therefore do notintegrate information across scale It is well known thatocclusions in natural images can exist on multiplescales and that accurate edge detection requires anappropriate choice of scale (Elder amp Zucker 1998Marr amp Hildreth 1980) Furthermore our feature setdoes not include any features which quantify texturedifferences which is problematic because our psycho-physical results suggests that texture is a very importantcue and previous work in machine vision hasdemonstrated that texture cues are highly informativefor detecting edges and segmenting image regions(Bergen amp Landy 1991 Malik amp Perona 1990 Martinet al 2004 Voorhees amp Poggio 1988)

In order to choose a set of biologically motivatedmultiscale features to measure from the image patcheswe obtained a Gabor filter bank by applying theIndependent Components Analysis (Bell amp Sejnowski1997) to natural image patches taken from our dataset(Figure 5a) and this filter bank was applied to theimage patches The independent component analysis(ICA) algorithm learns a set of linear features that aremaximally independent which is ideal for classifieranalyses since correlated features add less informationthan independent features The rectified outputs of thisGabor filter bank were then used as inputs to twosupervised classifiers (a) A linear binary logisticclassifier and (b) a neural network classifier (Bishop2006) having 4 16 and 64 hidden units The logisticclassifier can be considered as a special case of theneural network classifier illustrated in Figure 5b where

instead of learning the best hidden layer representationfor solving the classification task the hidden layerrepresentation is always identical to the input repre-sentation and only the output weights are modifiedAfter training both classifiers were tested on a separatevalidation set not used for training to evaluate theirperformance

Classifier comparison

Figure 16 shows the performance of these twoclassifiers together with subject data for the grayscaleocclusion detection task We see that the linearclassifier (blue line) does not accurately describe humanperformance (black line) while the neural network (redline) having 16 hidden units compares much morefavorably to the human subjects Similar results wereobtained for a neural network with 64 hidden units butworse performance was seen with four hidden units(not shown) We can see from Figure 16 that the linearclassifier does not increase its performance withincreasing patch size and seems to perform similarlyto the quadratic classifier defined using only luminancedifferences This suggests that this classifier may simplybe learning to detecting low spatial frequency edges atvarious orientations Indeed plotting the connectionstrengths of the output unit with each of the Gaborfilters as in Figure 17 demonstrates that there are strongpositive weights for filters at low spatial frequencylocated in the center of the image patch therebydetecting luminance differences on the scale of theentire image patch

In contrast the neural network classifier learns tospatially pool information across spatial scale andlocation (Supplementary Figure S6) although thehidden unit lsquolsquoreceptive fieldsrsquorsquo do not seem to resemble

Figure 16 Performance of humans and classifiers defined on

multiscale Gabor feature set We see that a simple linear classifier

(blue line) does not account for human performance (black

dashed line) in the task a neural network classifier having an

additional hidden layer of processing (red line) compares well with

human performance

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 15

the hypothetical texture-edge detectors illustrated inFigure 18 which could in principle encode edgesdefined by second-order cues like differences inorientation or spatial frequency Interestingly suchunits tuned for texture-edges have been learned instudies seeking to find a sparse code for patterns ofcorrelation of Gabor filter activities in natural images(Cadieu amp Olshausen 2012 Karklin amp Lewicki 20032009) and it is of interest for future work to examinewhether a population of such units can be decoded inorder to reliably detect occlusion edges Neverthelessthis analysis provides a simple existence proof that amultiscale feature set coupled with a second layer ofrepresentation which pools across this set of multiscalefeatures may be adequate to explain natural occlusionedge detection and a similar computational scheme hasbeen proposed by various filter-rectify-filter models oftexture edge detection (Baker amp Mareschal 2001Landy amp Graham 2003) Interestingly training neuralnetworks directly on pixel representations did not yieldmodels which could match human performance(Supplementary Figure S7 demonstrating the impor-tance of the multiscale Gabor decomposition as apreprocessing stage Of course we cannot suggest fromour simple analysis the detailed form which thisintegration takes or what the receptive fields of neuronsperforming this operation may be like but this is

certainly a question of great interest for future work inneurophysiology and computational modeling

Discussion

Summary of results

In this study we performed psychophysical experi-ments to better understand what locally available cuesthe human visual system utilizes when detecting naturalocclusion edges In agreement with previous work(McDermott 2004) we found that fairly large regions(32 middot 32 pixels) were required for subjects to attainreliable performance on our occlusion detection task(Figure 10) Studying the grayscale version of the taskin detail we found that texture and luminance cueswere both utilized although the relative importance ofthese cues varied with the patch size In particular wefound that luminance cues were equally useful at allpatch sizes tested (Figure 11a) This suggests thatluminance differences are a local cue being equallyuseful for small as well as large patches In contrasttexture cues were only useful for larger image patches(Figure 11b) This is sensible because texture is definedby information occurring at a variety of scales (Portillaamp Simoncelli 2000 Zhu Wu amp Mumford 1997) andmaking patches larger makes more spatial scalesvisible

The importance of textural cues was further under-scored by comparing human performance on the taskto the performance of a quadratic classifier definedusing a set of simple visual features taken fromprevious work which did not include texture cuesAlthough many of the visual features providedindependent sources of information (Figure 15) andpermitted some discrimination of the stimulus catego-ries this feature set could not account for humanperformance in the task overpredicting performancefor small (8 middot 8) patches and underpredicting

Figure 17 Schematic illustration of weights learned by the linear

classifier We see that the linear classifier learns strong positive

weights (hot colors) with Gabor filters having low spatial

frequencies (long lines) located in the center of the image patch

Figure 18 Hypothetical neural models for detecting occlusion edges defined by textural differences (a) Hypothetical unit which detects

texture edges defined by differences in orientation energy (b) Hypothetical unit that detects texture edges defined by spatial frequency

differences

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 16

performance for large (32 middot 32) patches (Figure 13)The overprediction of human performance for smallimage patches is particularly interesting since itdemonstrates that subjects are not simply relying onluminance cues but are integrating texture cues as wellwhich we found to be quite poor for small imagepatches (Figure 11b)

One limitation of the visual feature set used to definethe quadratic classifier analysis is that the features aredefined on the scale of the entire image patch This isproblematic since it is well-known that the earlieststages of human visual processing perform a multiscaleanalysis using receptive fields resembling Gabor filters(Daugman 1985 Jones amp Palmer 1987 Pollen ampRonner 1983) and that such a representation forms anefficient code for natural images (Field 1987 Olshau-sen amp Field 1996) Furthermore many computationalmethods for texture segmentation and edge detectionuse of a multiscale Gabor filter decomposition (Bakeramp Mareschal 2001 Landy amp Graham 2003) There-fore we defined a new set of classifiers taking as inputsthe rectified responses of a set of Gabor filters appliedto the image patches (Figure 5) We found that themultiscale feature set was adequate to explain humanperformance but only provided that the classifiermaking use of this feature set was sufficiently powerfulIn particular a linear classifier was unable to accountfor human performance since this classifier only learnedto detect luminance-edges at low spatial frequencies(Figure 17) However a neural network classifierhaving an additional hidden layer of processing wasable to learn to reliably detect occlusions nearly as wellas human subjects (Figure 16) These results suggest amultiscale feature decomposition together with rela-tively simple computations performed in an additionallayer of processing may be adequate to explain humanperformance in an important natural vision task

Comparison with previous work

Relatively few studies in visual psychophysics havedirectly considered the problem of what cues are usefulfor detecting natural occlusion edges with the closeststudy to our own focusing on the more specializedproblem of detecting T-junctions in natural grayscaleimages (McDermott 2004) This study found thatrather large image regions were needed for subjects toattain reliable performance for junction detectionconsistent with our results for occlusion detection Inthis work it was found that many T-junctions easilyvisible on the scale of the whole image were notdetectable using local cues suggesting that they mayonly be detectable by mechanisms making use offeedback of global scene information (Lee amp Mumford2003) Our task is somewhat complementary to that

studied by McDermott (2004) since we exclude T-junctions and only consider nonjunction occlusionsThis gave us a much larger stimulus set and permittedquantitative comparisons of human task performancewith a variety of models based on features measuredfrom occlusions and non-occlusions However wearrive at the very similar conclusion for our task thata substantial amount of spatial context is needed foraccurate occlusion detection (Figure 10) and thatmany occlusions cannot be reliably detected using localcues alone

A related set of studies has considered the problemof judging when two image patches belong to the sameor different surfaces and comparing the performanceof human subjects with the predictions of Bayesianmodels defined using sets of measured visual featuressimilar to those used to define our quadratic classifiersIn contrast to one such surface segmentation study (Inget al 2010) we found texture cues to be very importantfor our task One possible reason for the discrepancybetween our results and those of Ing et al (2010) is thattheir study only considered the problem of discrimi-nating a very limited class of images in particular thesurfaces of different leaves Since the leaves typicallybelonged to the same kind of plant they were fairlysmooth and quite similar in terms of visual textureTherefore for this particular choice of stimuli texture isnot a particularly informative cue since leaves of thesame kind tend to have very similar textures For ourtask however where very different surfaces werejuxtaposed to define occlusions we found removal oftexture cues to be highly detrimental to subjectperformance (Figure 11)

Previous work on edge localization reveals thatluminance texture and color cues are all combined todetermine the location of edges (McGraw et al 2003Rivest amp Cavanagh 1996) and similarly we find that inour task these multiple sources of information arecombined as well However in our task we did notconsider quantitative models for how disparate cuesmay be combined optimally as has been done in avariety of perceptual tasks (Knill amp Saunders 2003Landy amp Kojima 2001) One challenge with studyingcue combination in our task is that one of the majorcues used by the subjects are texture differences andunlike luminance or color natural textures require veryhigh-dimensional models to fully specify (Heeger ampBergen 1995 Portilla amp Simoncelli 2000 Zhu et al1997) The problem of optimal cue combination fornatural occlusion detection has been mostly consideredin the computer vision literature (Konishi et al 2003Zhou amp Mel 2008) but models of cue combination forocclusion detection have not been directly comparedwith human data in a controlled psychophysicalexperiment

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 17

Our computational analysis demonstrates the neces-sity of multiscale image analysis for modeling humanperformance in our occlusion detection task Further-more our work and that of others demonstrates theimportance of textural cues for occlusion detection as awide variety of computational models which explainhuman performance at detecting texture-defined edgesutilize a multiscale Gabor decomposition (Landy ampGraham 2003) One broad class of computationalmodels of texture segmentation and texture edgedetection can be described lsquolsquoFilter-Rectify-Filterrsquorsquo(FRF) models sometimes also referred to as lsquolsquoback-pocketrsquorsquo models where the rectified outputs of a Gaborfilter bank are then subject to a second stage of filteringand the outputs of these filters are then processedfurther to obtain a decision about whether or not atexture edge is present (Baker amp Mareschal 2001Bergen amp Landy 1991 Graham amp Sutter 1998 Landy1991 Landy amp Graham 2003) Our simple three-layerneural network model (Figure 5) constitutes a verysimple model of the FRF type with the hidden unitsperforming the role of the second-order filters Theability of our simple FRF model to solve this task aswell as human subjects provides additional evidence infavor of the FRF computational framework

Resources for future studies

Our database of hand-labeled occlusions (httpcaseedu) avoids many limitations of previously collecteddatasets of hand-labeled edges Perhaps the greatestimprovement is that unlike many related datasets(Collins Wright amp Greenway 1999 Martin et al2001) we make use of uncompressed calibrated colorimages taken from a larger image set used in previouspsychophysical research (Arsenault et al 2011 King-dom et al 2007 Olmos amp Kingdom 2004) JPEGcompression creates artificial statistical regularitieswhich complicates feature measurements (Zhou ampMel 2008) and causes artifact when training statisticalimage models (Caywood Willmore amp Tollhurst 2004)

Another improvement of this database over previouswork is that we explicitly label occlusion edges ratherthan inferring them indirectly from labels of imageregions or segments Although de facto finding regionboundaries does end up mainly labeling occlusions thenotion of an image region is much vaguer than that ofan occlusion However due to the specificity of ourtask our database most likely neglects quite a fewedges which are not occlusions since many edges aredefined by properties unrelated to ordinal depth likeshadows specularities and material properties (Ker-sten 2000) Therefore our database only represents asubset (albeit an important one) of possible naturaledges

Some databases only consider a restricted class ofimages like close-up foliage (Ing et al 2010) or lackclear figural objects (Doi Inui Lee Wachtler ampSejnowski 2003) Our database contains a wide varietyof images at various spatial scales containing a largenumber of occlusions We do not claim that our imagesare fully representative of natural images but representa modest effort to obtain a diverse sample which isimportant since different categories of natural imageshave different statistics (Torralba amp Oliva 2003) andtraining statistical image models on different imagedatabases often yields different linear filter properties(Ziegaus amp Lang 2003)

Acknowledgments

This work was supported by NIH grant 0705677 toMSL We thank the five Case Western ReserveUniversity undergraduate research assistants whohelped to collect the databases Parts of this work havebeen presented previously in poster form at the 2011Society for Neuroscience Meeting

Commercial relationships noneCorresponding author Christopher DiMattinaEmail dimattinagrinnelleduAddress Department of Psychology amp NeuroscienceConcentration Grinnell College Grinnell IA USA

References

Abdou I amp Pratt W (1979) Quantitative design andevaluation of enhancementthresholding edge de-tectors Proceedings of the IEEE 67 753ndash763

Acharya T amp Ray A K (2005) Image processingPrinciples and applications Hoboken NJ Wiley-Interscience

Arsenault E Yoonessi A amp Baker C L (2011)Higher-order texture statistics impair contrastboundary segmentation Journal of Vision 11(10)14 1ndash15 httpwwwjournalofvisionorgcontent111014 doi101167111014 [PubMed] [Article]

Baddeley R J amp Tatler B W (2006) High frequencyedges (but not contrast) predict where we fixate ABayesian system identification analysis VisionResearch 46 2824ndash2833

Baker C L amp Mareschal I (2001) Processing ofsecond-order stimuli in the visual cortex Progressin Brain Research 134 1ndash21

Balboa R M amp Grzywacz N M (2000) Occlusions

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 18

and their relationship with the distribution ofcontrasts in natural images Vision Research 402661ndash2669

Bell A J amp Sejnowski T J (1997) The lsquoindependentcomponentsrsquo of natural scenes are edge filtersVision Research 37 3327ndash3338

Bergen J R amp Landy M S (1991) Computationalmodeling of visual texture segregation In M SLandy amp J A Movshon (Eds) Computationalmodels of visual processing pp 253ndash271 Cam-bridge MA MIT Press

Bishop C M (2006) Pattern recognition and machinelearning New York Springer

Cadieu C amp Olshausen B A (2012) Learningintermediate-level representations of form andmotion from natural movies Neural Computation24 827ndash866

Caywood M S Willmore B amp Tolhurst D J(2004) Independent components of color naturalscenes resemble v1 neurons in their spatial andcolor tuning Journal of Neurophysiology 91 2859ndash2873

Collins D Wright W A amp Greenway P (1999) TheSowerby image database In Image processing andits applications (Seventh International Conference)(pp 306ndash310) London Institution of ElectricalEngineers

Cristianini N amp Shawe-Taylor J (2000) An intro-duction to support vector machines and other kernelbased learning methods Cambridge UK Cam-bridge University Press

Daugman J G (1985) Uncertainty relation forresolution in space spatial frequency and orienta-tion optimized by two-dimensional visual corticalfilters Journal of the Optical Society of America A2 1160ndash1169

Doi E Inui T Lee T W Wachtler T ampSejnowski T J (2003) Spatiochromatic receptivefield properties derived from information-theoreticanalyses of cone mosaic responses to naturalscenes Neural Computation 15 397ndash417

Duda R O Hart P E amp Stork D G (2000)Pattern classification (2nd ed) New York Wiley-Interscience

Elder J H amp Zucker S W (1998) Local scalecontrol for edge detection and blur estimationIEEE Transactions on Pattern Analysis and Ma-chine Intelligence 20 699ndash716

Field D J (1987) Relations between the statistics ofnatural images and the response properties of

cortical cells Journal of the Optical Society ofAmerica A 4 2379ndash2394

Fine I MacLeod D I A amp Boynton G M (2003)Surface segmentation based on the luminance andcolor statistics of natural scenes Journal of theOptical Society of America A 20 1283ndash1291

Fowlkes C C Martin D R amp Malik J (2007) Localfigure-ground cues are valid for natural imagesJournal of Vision 7(8)2 1ndash9 httpwwwjournalofvisionorgcontent782 doi101167782[PubMed] [Article]

Geisler W S Perry J S Super B J amp Gallogly DP (2001) Edge co-occurrence in natural imagespredicts contour grouping performance VisionResearch 41 711ndash724

Goldberg A V amp Kennedy R (1995) An efficientcost scaling algorithm for the assignment problemMathematical Programming 71 153ndash177

Graham N (1991) Complex channels early localnonlinearities and normalization in texture segre-gation In M S Landy amp J A Movshon (Eds)Computational models of visual processing pp 274ndash290 Cambridge MA MIT Press

Graham N amp Sutter A (1998) Spatial summation insimple (Fourier) and complex (non-Fourier) texturechannels Vision Research 38 231ndash257

Guzman A (1969) Decomposition of a visual sceneinto three-dimensional bodies In A Griselli (Ed)Automatic interpretation and classification of imag-es pp 291ndash304 New York Academic Press

Heeger D J amp Bergen J R (1995) Pyramid-basedtexture analysissynthesis In Proceedings of ACMSIGGRAPH 229ndash238

Hess R F amp Field D J (1999) Integration ofcontours New insights Trends in Cognitive Scienc-es 3 480ndash486

Hoiem D Efros A A amp Hebert M (2011)Recovering occlusion boundaries from an imageInternational Journal of Computer Vision 91(3)328ndash346

Ing A D Wilson J A amp Geisler W S (2010)Region grouping in natural foliage scenes Imagestatistics and human performance Journal ofVision 10(4)10 1ndash19 httpwwwjournalofvisionorgcontent10410 doi 10116710410[PubMed] [Article]

Jones J P amp Palmer L A (1987) The two-dimensional spatial structure of simple receptivefields in cat striate cortex Journal of Neurophysi-ology 58 1187ndash1211

Karklin Y amp Lewicki M S (2003) Learning higher-

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 19

order structures in natural images Network 14483ndash499

Karklin Y amp Lewicki M S (2009) Emergence ofcomplex cell properties by learning to generalize innatural scenes Nature 457 83ndash86

Kersten D (2000) High-level vision as statisticalinference In M S Gazzaniga (Ed) The newcognitive neurosciences pp 353ndash363 CambridgeMA MIT Press

Kingdom F A A Field D J amp Olmos A (2007)Does spatial invariance result from insensitivity tochange Journal of Vision 7(14)11 1ndash13 httpwwwjournalofvisionorgcontent71411 doi10116771411 [PubMed] [Article]

Knill D C amp Saunders J A (2003) Do humansoptimally integrate stereo and texture informationfor judgments of slant Vision Research 43 2539ndash2558

Konishi S Yuille A L Coughlin J M amp Zhu S C(2003) Statistical edge detection Learning andevaluating edge cues IEEE Transactions on PatternAnalysis and Machine Intelligence 25 57ndash74

Landy M S (1991) Texture segregation and orienta-tion gradient Vision Research 31 679ndash691

Landy M S amp Graham N (2003) Visual perceptionof texture In L M Chalupa amp J S Werner (Eds)The visual neurosciences pp 1106ndash1118 Cam-bridge MA MIT Press

Landy M S amp Kojima H (2001) Ideal cuecombination for localizing texture-defined edgesJournal of the Optical Society of America A 182307ndash2320

Lee H-C amp Choe Y (2003) Detecting salientcontours using orientation energy distribution InInternational Joint Conference on Neural Networks(pp 206ndash211) IEEE Conference Publications

Lee T S amp Mumford D (2003) Hierarchicalbayesian inference in the visual cortex Journal ofthe Optical Society of America A 20 1434ndash1448

Malik J amp Perona P (1990) Preattentive texturediscrimination with early vision mechanisms Jour-nal of the Optical Society of America A 7 923ndash932

Marr D amp Hildreth E (1980) Theory of edgedetection Proceedings of the Royal Society ofLondon Series B 29 187ndash217

Martin D Fowlkes C C Tal D amp Malik J (2001)A database of human segmented natural imagesand its application to evaluating segmentationalgorithms and measuring ecological statisticsProceedings of the 8th International Conference onComputer Vision 2 416ndash423

Martin D R Fowlkes C C amp Malik J (2004)Learning to detect natural image boundaries usinglocal brightness color and texture cues IEEETransactions on Pattern Analysis and MachineIntelligence 26 530ndash549

McDermott J (2004) Psychophysics with junctions inreal images Perception 33 1101ndash1127

McGraw P V Whitaker D Badcock D R ampSkillen J (2003) Neither here nor there Localizingconflicting visual attributes Journal of Vision 3(4)2 265ndash273 httpwwwjournalofvisionorgcontent342 doi101167342 [PubMed] [Article]

Nakayama K He Z J amp Shimojo S (1995) Visualsurface representation A critical link betweenlower-level and higher-level vision In D NOsherson S M Kosslyn amp L R Gleitman(Eds) An invitation to cognitive science Visualcognition pp 1ndash70 Cambridge MA MIT Press

Oliva A amp Torralba A (2001) Modeling the shapeof the scene A holistic representation of the spatialenvelope International Journal of Computer Vision42 145ndash175

Olmos A amp Kingdom F A A (2004) McGillcalibrated colour image database Internet sitehttptabbyvisionmcgillca (Accessed November20 2012)

Olshausen B A amp Field D J (1996) Emergence ofsimple-cell receptive-field properties by learning asparse code for natural images Nature 381 607ndash609

Ott R L (1993) An introduction to statistical methodsand data analysis (4th ed) Belmont CA Wads-worth Inc

Perona P (1992) Steerable-scalable kernels for edgedetection and junction analysis Image and VisionComputing 10 663ndash672

Pollen D A amp Ronner S F (1983) Visual corticalneurons as localized spatial frequency filters IEEETransactions on Systems Man and Cybernetics 13907ndash916

Portilla J amp Simoncelli E P (2000) A parametrictexture model based on joint statistics of complexwavelet coefficients International Journal of Com-puter Vision 40 4971

Rajashekar U van der Linde I Bovik A C ampCormack L K (2007) Foveated analysis of imagefeatures at fixations Vision Research 47 3160ndash3172

Reinagel P amp Zador A M (1999) Natural scenestatistics at the centre of gaze Network 10 341ndash350

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 20

Rijsbergen C V (1979) Information retrieval LondonButterworths

Rivest J amp Cavanagh P (1996) Localizing contoursdefined by more than one attribute Vision Re-search 36 53ndash66

Schwartz O amp Simoncelli E P (2001) Natural signalstatistics and sensory gain control Nature Neuro-science 4 819ndash825

Sigman M Cecchi G A Gilbert C D ampMagnasco M O (2001) On a common circleNatural scenes and gestalt rules Proceedings of theNational Academy of Sciences 98 1935ndash1940

Todd J T (2004) The visual perception of 3D shapeTrends in Cognitive Science 8 115ndash121

Torralba A (2009) How many pixels make an imageVisual Neuroscience 26 123ndash131

Torralba A amp Oliva A (2003) Statistics of naturalimage categories Network Computation in NeuralSystems 14 391ndash412

Voorhees H amp Poggio T (1988) Computing textureboundaries from images Nature 333 364ndash367

Zetzsche C amp Rohrbein F (2001) Nonlinear andextra-classical receptive field properties and thestatistics of natural scenes Network 12 331ndash350

Zhou C amp Mel B W (2008) Cue combination andcolor edge detection in natural scenes Journal ofVision 8(4)4 1ndash25 httpwwwjournalofvisionorgcontent844 doi101167844 [PubMed][Article]

Zhu S C Wu Y N amp Mumford D (1997)Minimax entropy principle and its application totexture modeling Neural Computation 9 1627ndash1660

Ziegaus C amp Lang E W (2003) Independentcomponent analysis of natural and urban imageensembles Neural Information Processing Letters1 89ndash95

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 21

  • Introduction
  • Methods
  • f01
  • f02
  • f03
  • e01
  • e02
  • e03
  • e04
  • e05
  • e06
  • e07
  • e08
  • e09
  • e10
  • f04
  • e11
  • e12
  • e13
  • e14
  • e15
  • e16
  • e17
  • f05
  • e18
  • Results
  • f06
  • f07
  • f08
  • f09
  • f10
  • f11
  • f12
  • f13
  • f14
  • f15
  • f16
  • Discussion
  • f17
  • f18
  • Abdou1
  • Acharya1
  • Arsenault1
  • Baddeley1
  • Baker1
  • Balboa1
  • Bell1
  • Bergen1
  • Bishop1
  • Cadieu1
  • Caywood1
  • Collins1
  • Cristianini1
  • Daugman1
  • Doi1
  • Duda1
  • Elder1
  • Field1
  • Fine1
  • Fowlkes1
  • Geisler1
  • Goldberg1
  • Graham1
  • Graham2
  • Guzman1
  • Heeger1
  • Hess1
  • Hoiem1
  • Ing1
  • Jones1
  • Karklin1
  • Karklin2
  • Kersten1
  • Kingdom1
  • Knill1
  • Konishi1
  • Landy1
  • Landy2
  • Landy3
  • Lee1
  • Lee2
  • Malik1
  • Marr1
  • Martin1
  • Martin2
  • McDermott1
  • McGraw1
  • Nakayama1
  • Oliva1
  • Olmos1
  • Olshausen1
  • Ott1
  • Perona1
  • Pollen1
  • Portilla1
  • Rajashekar1
  • Reinagel1
  • Rijsbergen1
  • Rivest1
  • Schwartz1
  • Sigman1
  • Todd1
  • Torralba1
  • Torralba2
  • Voorhees1
  • Zetzsche1
  • Zhou1
  • Zhu1
  • Ziegaus1
Page 3: Detecting natural occlusion boundaries using local cuesDetecting natural occlusion boundaries using local cues Christopher DiMattina # $ Department of Psychology & Neuroscience Concentration,

shadows or luminance gradients Therefore somesurface patches contained substantial luminance gradi-ents for instance patches of zebra skin (Figure 3) Eachsubject selected 10 surface regions (60 middot 60) from eachof the 100 images and for our analyses we extracted

image patches of various sizes (8 middot 8 16 middot 16 32 middot 32)at random locations from these larger 60 middot 60 regionsExample 32 middot 32 surface patches are shown in Figure 2(middle panel) and examples of both surface andocclusion patches are shown in Figure 3

Figure 2 Representative images from the occlusion boundary database together with subject occlusion labelings Left Original color

images Middle Grayscale images with overlaid pixels labeled as occlusions (white lines) and examples of surface regions (magenta

squares) Right Overlaid plots of subject occlusion labelings taken from an 87 middot 115 pixel central region of images (indicated by cyan

squares in middle column) Darker pixels were labeled by more subjects lighter pixels by fewer subjects

Figure 3 Examples of 32 middot 32 occlusions (left) surfaces (right) and shadow edges not defined by occlusions (bottom)

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 3

Quantifying subject consistency

In order to quantify intersubject consistency in thelocations of the pixels labeled as occlusions we appliedprecision-recall analysis commonly used in machinevision (Abdou amp Pratt 1979 Martin et al 2004) Inaddition we also developed a novel analysis methodwhich we call the most-conservative subject (MCS)analysis which controls for the fact that disagreementbetween subjects in the location of labeled occlusionsoften arises simply because some subjects are moreexhaustive in their labeling than others

Precision-recall analysis is often used in machinevision studies of edge detection in order to quantify thetrade-off between correctly detecting all edges (recall)and not incorrectly labeling non-edges as edges(precision) Mathematically precision (P) and recall(R) are defined

P frac14 tp

tpthorn fp eth1THORN

R frac14 tp

tpthorn fn eth2THORN

where tp fp tn fn are the true and false positives andtrue and false negatives respectively Typically thesequantities are determined by comparing a machinegenerated test edgemap E to a ground-truth referenceedgemap G derived from hand-annotated images(Martin et al 2004) Since all of our edgemaps werehuman generated we performed a lsquolsquoleave-one-outrsquorsquoanalysis where we compared a test edgemap from eachsubject to a reference lsquolsquoground truthrsquorsquo edgemap definedby combining edgemaps from all of the other remainingsubjects Since our goal for this analysis was simply tocompare human performance on two different tasks(occlusion labeling and region labeling) we did notmake use of sophisticated boundary-matching proce-dures (Goldberg amp Kennedy 1995) used in previousstudies to optimize the comparisons between humandata and machine performance (Martin et al 2004)We quantify the overall agreement between the test andreference edgemaps using the weighed harmonic meanof P and R defined by

F frac14 PR

aPthorn eth1 aTHORNR eth3THORN

This quantity F is known as an F-measure and wasoriginally developed to quantify the accuracy ofdocument retrieval methods (Rijsbergen 1979) but isalso applied in machine vision (Abdou amp Pratt 1979Martin et al 2004) The parameter a determines therelative weight of precision and recall and we used afrac1405 in our analysis

In addition to the precision-recall analysis wedeveloped a novel method for quantifying intersubject

consistency which minimizes problems of intersubjectdisagreement arising from the fact that certain subjectsare simply more exhaustive in labeling all possibleocclusions than other subjects We defined the mostconservative subject (MCS) for a given image as thesubject who had labeled the fewest pixels Using theMCS labeling we generate a binary image mask Fwhich is 1 for any pixel within R pixels of an occlusionlabeled by the MCS and 0 for all other pixelsApplying this mask to the labeling of each subjectyields a lsquolsquoreducedrsquorsquo labeling which is valid for inter-subject comparison since it only includes the mostprominent occlusions labeled by all of the subjects Tocalculate the comparison between two subjects werandomly assigned one binary edgemap as the lsquolsquorefer-encersquorsquo (Iref) and the other binary edge-map as the lsquolsquotestrsquorsquo(Itest) We used the reference map to define a weightingfunction fc(r) which was applied to all of the pixels inthe test map that quantified how close each pixel in thetest map was to a pixel in the reference mapMathematically our index is given by

f frac14 1

Nt

XethxyTHORNItest

fc

rethxyTHORNIref

ItestethxyTHORN eth4THORN

where r((xy) Iref) is the distance between (xy) and theclosest pixel in Iref Nt is the number of pixels in the testset and the function fc(r) is defined for 0 c lsquo by

fcethrTHORN frac14ecr if 0 r R0 if rR

eth5THORN

and for cfrac14 lsquo by

flsquoethrTHORN frac141 if r frac14 00 if r 0

eth6THORN

where R is the radius of the mask which we set to Rfrac1410 in our analysis The parameter c in our weightingfunction sets the sensitivity of fc to the distance betweenthe reference labeling and the test labeling Setting cfrac14 0counts the fraction of pixels in the test edgemap whichlie inside the mask generated by the reference edgemapand setting c frac14 lsquo measures the fraction of pixels incomplete agreement between the two edgemaps

Statistical measurements

Patch extraction and region labeling

Occlusion patches of varying sizes centered on anocclusion boundary were extracted automatically fromthe database by choosing random pixels labeled asocclusions by a single subject cycling through the sixsubjects Since we only wanted patches containing asingle occlusion separating two regions (figure andground) of roughly equal size we only accepted acandidate patch when

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 4

1 The composite occlusion edge map from all subjectsconsisted of a single connected piece in the analysiswindow

2 The occlusion contacted the sides of the window attwo distinct points and divided the patch into tworegions

3 Each region comprised at least 35 of the pixels

Each occlusion patch consisted of two regions ofroughly equal size separated by a single boundary withthe central pixel (w2 w2) of a patch of size w alwaysbeing an occlusion (Figure 4a) Note that thisprocedure actually yields a subset of all possibleocclusions since it excludes T-junctions or occlusionsformed by highly convex boundaries Since theselection of occlusion patches was automated therewere no constraints on the properties of the surfaces oneither side of the occlusion with respect to factors likelighting shadows reflectance or material In additionto the occlusion patches we extracted surface patchesat random locations from the set of 60 middot 60 surfaceregions chosen by the subjects We used 8 middot 8 16 middot 16and 32 middot 32 patches for our analyses and psychophys-ical experiments

Grayscale scalar measurements

To obtain grayscale images we converted the rawimages into gamma-corrected RGB images usingsoftware available online at httptabbyvisionmcgillca (Olmos amp Kingdom 2004) We then mapped theRGB color space to the NTSC color space obtainingthe grayscale luminance I frac14 02989 R thorn 05870 G thorn01140 B (Acharya amp Ray 2005) From these patches

we measured a variety of visual features which can beused to distinguish occlusion from surface patchesSome of these features (for instance luminance differ-ence) depended on there being a region labeling of theimage patch which separates it into regions corre-sponding to two different surfaces (Figure 4a) How-ever measuring these same features from surfacepatches is impossible since surface patches only containa single surface Therefore in order to measure regionlabeling-dependent features from the uniform surfacepatches we assigned to each surface patch a set of 25lsquolsquodummyrsquorsquo region labelings from our database (span-ning all possible orientations) The measured value ofthe feature was then taken as the maximum value overall 25 dummy labelings which is sensible since all of thevisual features were on average larger for occlusionsthan uniform surface patches

Given a grayscale image patch and a region labelingR frac14 R1 R2 B partitioning the patch into regionscorresponding to the two surfaces (R1R2) as well as theset of boundary (B) pixels (Figure 4a) we measured thefollowing visual features taken from the computationalvision literature

G1 Luminance difference Dl

Dl frac14 jl1 l2j eth7THORNwhere l1 l2 are the mean luminance in regions R1 R2respectively

G2 Contrast difference Dr

Dr frac14 jr1 r2j eth8THORNwhere r1 r2 are the contrasts (standard deviation) inregions R1 R2 respectively Features G1 and G2 wereboth measured in previous studies on surface segmen-tation (Fine et al 2003 Ing et al 2010)

G3 Boundary luminance gradient GB

GB frac14jjIjjI

eth9THORN

where I(x y) frac14 []I(x y)]x ]I(x y)]y]T is thegradient of the image patch evaluated at the centralpixel and I is the average intensity of the image patch(Balboa amp Grzywacz 2000)

G4 Oriented energy Eh

Eh frac14XNh

ifrac141ethweven

i xTHORN2 thorn ethwoddi xTHORN2 eth10THORN

where x is the image patch in vector form andweveni wodd

i are a quadrature-phase pair of Gabor filtersof Nhfrac14 8 evenly spaced orientations hi For patch sizew the filters had means w2 and standard derivations ofw4 Oriented energy has been used in several previousstudies as a means of detecting edges (Geisler et al

Figure 4 Illustration of stimuli used in the psychophysical

experiments (a) Original color and grayscale occlusion edge

(top middle) and its region label (bottom) The region label divides

the patch into regions corresponding to the two surfaces (white

gray) as well as the boundary (black) (b) Top Grayscale image

patch with texture information removed Middle Occlusion patch

with boundary removed Bottom Boundary and luminance

information removed

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 5

2001 Lee amp Choe 2003 Sigman Cecchi Gilbert ampMagnasco 2001)

G5 Global patch contrast q

q frac14 stdethITHORN eth11THORNThis quantity has been measured in previous studieswhich quantify statistical differences between fixatedimage regions and random image regions for subjectsrsquofree-viewing natural images while their eyes are beingtracked (Rajashekar et al 2007 Reinagel amp Zador1999) Several studies of this kind have suggested thatsubjects may be preferentially looking at edges (Bad-deley amp Tatler 2006)

Note that features G3-G5 are measured globallyfrom the entire patch whereas features G1 G2 aredifferences between statistics measured from differentregions of the patch

Color scalar measurements

Images were converted from RGB to LMS colorspace using a MATLAB program which accompaniesthe images in the McGill database (Olmos amp Kingdom2004) We converted the logarithmically transformedLMS images into an Lab color space by performingprincipal components analysis (PCA) on the set ofLMS pixel intensities (Fine et al 2003 Ing et al 2010)Projections onto the axes of the Lab basis represent acolor pixel in terms of its overall luminance (L) blue-yellow opponency (a) and red-green opponency (b)

We measured two additional properties from theLMS color image patches represented in the Lab basis

C1 Blue-Yellow difference Da

Da frac14 ja1 a2j eth12THORNwhere a1 a2 are the mean values of the B-Y opponencycomponent a in regions R1 R2 respectively

C2 Red-Green difference Db

Db frac14 jb1 b2j eth13THORNwhere b1 b2 are the mean values of the R-G opponencycomponent b in regions R1 R2 respectively

These color scalar statistics were motivated byprevious work studying human perceptual discrimina-tion of different surfaces (Fine et al 2003 Ing et al2010)

Machine classifiers

Quadratic classifier analysis

In order to study how well various feature subsetsmeasured from the image patches could predict humanperformance we made use of a quadratic classifieranalysis The quadratic classifier is a natural choice for

quantifying the discriminability of two categoriesdefined by features having multivariate Gaussiandistributions (Duda Hart amp Stork 2000) and hasbeen used in previous work studying the perceptualdiscriminability of surfaces (Ing et al 2010) Assumethat we have two categories C1C2 of stimuli fromwhich we can measure n features u frac14 (u1u2 un)

Tand features measured from each category are Gauss-ian distributed

pethujCiTHORN frac14 NethujliRiTHORN eth14THORNwhere li and Ri are the means and covariances of eachcategory Given a novel observation with feature vectoru assuming that the two categories are equally likely apriori we evaluate the log-likelihood ratio

L12ethuTHORN frac14 ln pethujC1THORN ln pethujC2THORN eth15THORNchoosing C1 when L12 0 and C2 when L12 0 In thecase of Gaussian distributions for each category as inEquation 14 Equation 15 can be rewritten as

L12ethuTHORN frac141

2jR2j jR1j

thorn 1

2Q1ethuTHORN Q2ethuTHORN

eth16THORNwhere

QiethuTHORN frac14 ethu liTHORNTR1i ethu liTHORN eth17THORNThe task for applying this formalism is to define a set ofn features to measure from the set of image patchesand then use these measurements to define the meansand covariances of each category in a supervisedmanner New image patches that are unlabeled canthen be classified using this quadratic classifier

In our analyses category C1 was occlusion patchesand category C2 was surface patches We estimated theparameters of the classifiers for each category by takingthe means and covariances of the statistics measuredfrom a set of 2000 image patches (training set) and weapplied these classifiers to a different 400 patch subsetsof 1000 image patches which were presented to thesubjects (test set) This analysis was performed formultiple classifiers defined by different subsets ofparameters and for image patches of all sizes (8 middot 816 middot 16 32 middot 32)

SVM classifier analysis

As an additional control in addition to the quadraticclassifier analysis we trained a Support Vector Machine(SVM) classifier (Cristianini amp Shawe-Taylor 2000) todiscriminate occlusions and surfaces using our gray-scale visual feature set (G1ndashG5) The SVM classifier is astandard and well-studied method in machine learningwhich achieves good classification results by learningthe separating hyperplane that maximizes the margin

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 6

between two categories (Bishop 2006) We implement-ed this analysis using the function svmclassifym andsvmtrainm in the MATLAB Bioinformatics Toolbox

Multiscale classifier analyses on Gabor filter outputs

One weakness of defining machine classifiers usingour set of visual features is that these features aredefined on the scale of the entire image patch This isproblematic because it is well known that occlusionedges exist at multiple scales and that appropriate scaleselection and integration across scale is an essentialcomputation for accurate edge detection (Elder ampZucker 1998 Marr amp Hildreth 1980) Furthermoreit is well known that the neural code at the earlieststages of cortical processing is reasonably well de-scribed by a bank of multi-scale filters resemblingGabor functions (Daugman 1985 Pollen amp Ronner1983) and that such a basis forms an efficient code fornatural scenes (Olshausen amp Field 1996)

In order to define a multiscale feature set resemblingthe early visual code we utilized the rectified outputs ofa bank of filters learned using Independent ComponentAnalysis (ICA) These filters closely resemble Gaborfilters but have the additional useful property ofconstituting a maximally independent set of featuredimensions for encoding natural images (Bell ampSejnowski 1997) Example filters for 16 middot 16 imagepatches are shown in Figure 5a The outputs of ourfilter bank were used as inputs to two different standardclassifiers (a) a linear logistic regression classifier and(b) a three-layer neural network classifier (Bishop2006) having 4 16 64 hidden units These classifierswere trained using standard gradient descent methods

and their performance was evaluated on a separate setof validation data not used for training A schematicillustration of these classifiers is shown in Figure 5b

Experimental paradigm

In the psychophysical experiments image patcheswere displayed on a 24-inch Macintosh cinema display(Apple Inc Cupertino CA) Since the RGB imageswere linearized to correct for camera gamma (Olmos ampKingdom 2004) we set the display monitor to havegamma of approximately 1 using the Mac CalibrationAssistant so that the displayed images would be asnatural looking as possible Subjects were seated in adim room in front of the monitor and image patcheswere presented to the center of the screen scaled tosubtend 15 degrees of visual angle at the approximate-ly 12-inch (305 cm) viewing distance All stimulisubtended the same visual angle to eliminate confoundsbetween the size of the image on the retina and its pixeldimensions Previous studies have demonstrated thathuman performance on a similar task is dependent onlyon the number of pixels (McDermott 2004) so byholding the retinal size constant the only variable is thenumber of pixels

Our experimental paradigm is illustrated in Figure 6Surrounding the image patch was a set of flankingcrosshairs whose imaginary intersection defines thelsquolsquocenterrsquorsquo of the image patch In the standard task thesubject decides whether an occlusion boundary passesthrough the image patch center with a binary (10) key-press Half of the patches were taken from theocclusion boundary database (where all occlusions passthrough the patch center) and the other half of thepatches were taken from the surface database There-fore guessing on the task would yield performance of50 percent correct The design of this experiment issimilar to a previous study on detecting T-junctions innatural images (McDermott 2004) To optimizeperformance subjects were given as much time as theyneeded for each patch Furthermore since perceptuallearning can improve performance (Ing et al 2010) weprovided positive and negative feedback

We performed the experiment on two sets of naivesubjects having no previous exposure to the images orknowledge of the scientific aims as well as the leadauthor (S0) for a total of six subjects (three males threefemales) One set of two naive subjects (S1 S2) wasallowed to browse grayscale versions of the full-scaleimages (576 middot 768 pixels) prior to the experiment After2ndash3 seconds of viewing they were shown superimposedon the images the union of all subject labelings Thiswas meant to give the subjects an intuition for what ismeant by an occlusion boundary in the context of thefull-scale image and was inspired by previous work on

Figure 5 Illustration of the multiscale classifier analysis using the

outputs of rectified Gabor filters (a) Gabor functions learned

using ICA form an efficient code for natural images which

maximize statistical independence of filter responses (b) Gray-

scale image patches are decomposed by a bank of multiscale

Gabor filters resembling simple cells and their rectified outputs

transformed by an intermediate layer of representation whose

outputs are passed to a linear classifier

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 7

surface segmentation where subjects were allowed topreview large full-scale images of foliage from whichsurface patches were drawn (Ing et al 2010) A secondset of three naive subjects (S3 S4 S5) were not shownthe full-scale images beforehand in order to control forthe possibility that this pre-exposure may have helped toimprove task performance

We used both color and grayscale patches of sizes 8middot 8 16 middot 16 and 32 middot 32 In addition to the raw imagepatches a set of lsquolsquotexture-removedrsquorsquo image patches wascreated by averaging the pixels in each of the tworegions thus creating synthetic edges where the onlycue was luminance or color contrast For the grayscalepatches we also considered the effects of removingluminance difference cues However it is a muchharder problem to create luminance-removed patchesas was done in previous studies on surface segmenta-tion (Ing et al 2010) since simply subtracting themean luminance in each region of an image patchcontaining an occlusion often yields a high spatialfrequency boundary artifact which provides a strongedge cue (Arsenault et al 2011) Therefore wecircumvented this problem by setting a 3-pixel thickregion around the boundary to the mean luminance ofthe entire patch in effect covering up the boundaryThen we could remove luminance cues equalizing themean luminance on each side without creating aboundary artifact since there was no boundary visibleWe called this condition lsquolsquoboundary thorn luminanceremovedrsquorsquo One issue however with comparing theseboundarythorn luminance removed patches to the normalpatches is that now two cues are missing (the boundaryand the luminance difference) so in order to betterassess the combination of texture and luminanceinformation we also created a lsquolsquoboundary removedrsquorsquocondition which blocks the boundary but does notmodify the mean luminance on each side Illustrativeexamples of these stimuli are shown in Figure 4b

All subjects were shown different sets of 400 imagepatches sampled randomly from a set of 1000 patchesin every condition with the exception of two subjects inthe luminance-only grayscale condition (S0 S2) who

were shown the same set of patches in this conditiononly Informed consent was obtained from all subjectsand all experimental procedures were approved before-hand by the Case Western Reserve University IRB(Protocol 20101216)

Tests of significance

In order to determine whether or not the perfor-mance of human subjects was significantly different indifferent conditions of the task we utilized thestandard binomial proportion test (Ott 1993) whichrelies on a Gaussian approximation to the binomialdistribution This test is well justified in our casebecause of the large number of stimuli (N frac14 400)presented in each experimental condition For propor-tion estimate p we compute the 1 ndash a confidence intervalas

p 6 z1a=2

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffipeth1 pTHORN

n

r eth18THORN

We use a significance level of afrac14 005 in all of our testsand calculations of confidence intervals

In order to evaluate the performance of the machineclassifiers we performed a Monte Carlo analysis wherethe classifiers were evaluated on 200 different sets of400 image patches randomly chosen from a validationset of image patches This validation set was distinctfrom the set of patches used to train the classifierallowing us to study classifier generalization We plot95 confidence intervals around the mean perfor-mance of our classifiers

Results

Case occlusion boundary (COB) database

We developed a novel database of occlusion edgesand surface regions not containing any occlusions for

Figure 6 Schematic of the two-alternative forced choice experiment A patch was presented to the subject who decided whether an

occlusion passes through the imaginary intersection of the crosshairs After the decision the subject was given positive or negative

feedback

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 8

use in our perceptual experiments Representativeimages from our database are illustrated in Figure 2(left column) Note the clear figural objects and manyocclusions in these images In the middle column ofFigure 2 we see grayscale versions of these imageswith the set of all pixels labeled by any subject (logicalOR) overlaid in white The magenta squares showexamples of surface regions labeled by the subjectsFinally the right column shows an overlay plot of theocclusions marked by all subjects with darker pixelsbeing labeled by more subjects and the lightest graypixels being labeled by only a single subject Note thehigh degree of consistency between the labelings of themultiple subjects Figure 3 shows some representativeexamples of occlusion (left) and surface (right) patchesfrom our database It is important to note that whileocclusions may be a major source of edges in naturalimages edges may arise from other cues like castshadows or changes in material properties (Kersten2000) The bottom panel of Figure 3 shows examples ofpatches containing edges defined by shadows ratherthan by occlusions

In other annotated edge databases like the BerkeleySegmentation Dataset (BSD) (Martin Fowlkes Tal ampMalik 2001) occlusion edges were labeled indirectly bysegmenting the image into regions and then denotingthe boundaries of these regions to be edges Weobserved that when occlusions are labeled directlyinstead of being inferred indirectly from regionsegmentations that a smaller number of pixels arelabeled as shown in Figure 7b which plots thedistribution of the percentage of edge pixels for allimages and subjects in our database (red) and the BSDdatabase (blue) for 98 BSD images segmented by sixsubjects We find that averaged across all images andsubjects that about 1 of the pixels were labeled asedges in our database whereas about twice as manypixels were labeled as edges by computing edgemapsusing the BSD segmentations (COB medianfrac1400084Nfrac14 600 BSD median frac14 00193 N frac14 588 p 10121Wilcox rank-sum)

We observed a higher level of intersubject consisten-cy in the edgemaps obtained from the COB than thoseobtained from the BSD segmentations which we

Figure 7 Labeling occlusions directly marks fewer pixels than inferring occlusions from image segmentations and yields greater

agreement between subjects (a) Top An image from our database (left) together with the labeling (middle) by the most conservative

subject (MCS) The right panel shows a binary mask of all pixels near the MCS labeling (10 pixel radius) Bottom Product of MCS mask

with labelings from three other subjects (b) Histogram of the fraction of pixels labeled as edges in the COB (red) and BSD (blue)

databases across all images and all subjects (c) Histogram of the subject consistency index for edgemaps obtained from the COB (red)

and BSD (blue) databases for cfrac14 10 (d) Precision-recall analysis also demonstrates better consistency (F-measure) for COB (red) than

BSD (blue)

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 9

quantified using a novel analysis we developed as wellas a more standard precision-recall analysis (Abdou ampPratt 1979 Rijsbergen 1979 Methods Image data-base) Figure 7a shows an image from the COBdatabase (top left) together with the edgemap andderived mask for the most conservative subject (MCS)which is the subject who labeled the fewest pixels (topmiddle right) When the MCS mask is multiplied bythe edgemap of the other subjects we see reasonablygood agreement between all subjects (bottom row) Inorder to quantify this over all images we computed ournovel intersubject similarity index f defined in Equation4 and in Figure 7c we see that on average that f is largerfor our dataset than for the BSD where here we plotthe histogram for c frac14 10 (COB median frac14 02890 N frac143500 BSD median frac14 01882 N frac14 3430 p 10202Wilcox rank-sum) Similar results were obtained over awide range of values of c (Supplementary Figure S1) Inaddition to our novel analysis we also implemented aprecision-recall analysis (Abdou amp Pratt 1979 Rijs-bergen 1979) using a lsquolsquoleave-one-outrsquorsquo procedure wherewe compared the edges labeled by one subject to alsquolsquoground truthrsquorsquo labeling defined by combining theedgemaps of the five other subjects (Methods Imagedatabase) Agreement was quantified using the F-measure (Rijsbergen 1979) which provides a weightedmean of precision (not labeling nonedges as edges) andrecall (detecting all edges in the image) We observe inFigure 7d that for this analysis there is a significantlybetter agreement between edgemaps in the COBdatabase than those obtained indirectly from the BSDsegmentations using this analysis (p 1028)

Visual features measured from occlusion andsurface patches

We were interested in determining which kinds oflocally available visual features could possibly beutilized by human subjects to distinguish image patchescontaining occlusions from those containing surfacesToward this end we measured a variety of visualfeatures taken from previous studies of natural imagestatistics (Balboa amp Grzywacz 2000 Field 1987 Ing etal 2010 Lee amp Choe 2003 Rajashekar et al 2007Reinagel amp Zador 1999) from both occlusion andsurface patches For patches containing only a singleocclusion which can be divided into two regions ofroughly equal size we obtained a region labelingillustrated in Figure 4a (bottom) The region labelingconsists of sets of pixels corresponding to the twosurfaces divided by the boundary (white and grayregions) as well as the boundary (black line) Usingthis region labeling we can measure properties of theimage on either side of the boundary and computedifferences in these properties between regions By

definition surface patches are comprised of a singleregion and therefore it is unclear how we can measurequantities (like luminance differences) which dependon a region labeling Therefore in order to measure thesame set of features from the surface patches weassigned to each patch a set of 25 dummy regionlabelings (spanning all orientations) and for eachdummy labeling performed the measurements Sinceall features were on average larger for occlusions forsurfaces we took as the measured value of the featurethe maximum over all 25 dummy labelings A fulldescription of measured features is given in Methods(Statistical measurements)

We compared the power spectra of 32 middot 32 occlusionand texture patches (Figure 8 left) On average there issignificantly less energy in low spatial frequencies fortextures than for occlusions which is intuitive sincemany occlusions contain a low spatial frequencyluminance edge (Figure 3 left panel) Analysis of theexponent by fitting a line to the power spectra ofindividual images found a different distribution ofexponents for the occlusions and the textures (Figure 8right) For textures the median exponent is close to 2consistent with previous observations (Field 1987)whereas for occlusions the median exponent value isslightly higher (rsquo26) This is consistent with previouswork analyzing the spectral content of different scenecategories which show that landscape scenes with aprominent horizon (like an ocean view) tend to havelower spatial frequency content (Oliva amp Torralba2001)

Figure 9 shows that all of the grayscale features (G1-G5 Methods Statistical measurements) measured fromsurfaces (green) and occlusions (blue) are well approx-imated by Gaussians when plotted in logarithmiccoordinates Supplementary Table 1 lists the meansand standard deviations of each of these features Wesee that on average these features are larger forocclusions than for textures as one might expectintuitively since these features explicitly or implicitlymeasure differences or variability in an image patch

Figure 8 Power spectra of 32 middot 32 patches Left Median power

spectrum of occlusions (blue) and surfaces (green) Thin dashed

lines show the 25th and 75th percentiles Right Power-spectrum

slopes for occlusions (blue) and surfaces (green)

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 10

which will tend to be larger for occlusion patchesAnalyzing the correlation structure of the grayscalefeatures reveals that they are all significantly positivelycorrelated (Supplementary Table 2) Although some ofthese correlations are unsurprising (like that betweenluminance difference log Dl and boundary gradient logGB) many pairs of these positively correlated featurescan be manipulated independently of each other inartificial images (for instance global contrast log q andcontrast difference log Dr) These results are consistentwith previous work which demonstrates that in naturalimages there are often strong conditional dependenciesbetween supposedly independent visual feature dimen-sions (Fine et al 2003 Karklin amp Lewicki 2003Schwartz amp Simoncelli 2001 Zetzsche amp Rohrbein2001) Supplementary Figure S2 plots all possiblebivariate distributions of the logarithm of the grayscalefeatures for 32 middot 32 textures and surfaces

In addition to grayscale features we measured colorfeatures (C1ndashC2 Methods Statistical measurements)as well by transforming the images into the Lab colorspace Fine et al (2003) Ing et al (2010) Using our

region labelings (Figure 4a bottom) we measured thecolor parameters log Da and log Db from ourocclusions Supplementary Table 3 lists the meansand standard deviations of each of these features andwe observe positive correlations (rfrac14043) between theirvalues similar to previous observations of positivecorrelations between these same features for twonearby patches taken from the same surface (Fine etal 2003) This finding is interesting because occlusionboundaries by definition separate two or more differentsurfaces and the color features we measure constitute aset of independent feature dimensions From the two-dimensional scatterplots in Supplementary Figure S3we see that there is separation between the multivariatedistributions defined by these color parameters sug-gesting that color contrast provides a potential cue forocclusion edge detection much as it does for surfacesegmentation (Fine et al 2003 Ing et al 2010)

Human performance on occlusion boundarydetection task

Effects of patch size color and pre-exposure

In order to determine the local visual features used todiscriminate occlusions from surfaces we designed asimple two-alternative forced choice task illustratedschematically in Figure 6 An image patch subtendingroughly 15 degrees of visual angle (305 cm viewingdistance) was presented to the subject who had todecide with a binary (10) keypress whether an imagepatch contains an occlusion Patches were chosen froma pre-extracted set of 1000 occlusions and 1000surfaces with equal probability of each type soguessing would yield 50 correct Subjects wereallowed to view each patch as long as they needed (1ndash2 seconds typical) and following previous workauditory feedback was provided to optimize perfor-mance (Ing et al 2010) The task was performed on sixsubjects having varying degrees of exposure to theimage database (Methods Experimental paradigm)

Figure 10a illustrates the effect of patch size on thetask for grayscale image patches for all subjects Thicklines indicate the mean subject performance and thindashed lines denote 95 confidence intervals We seefrom this that performance is significantly different forthe different size image patches which we tested (8 middot 816 middot 16 32 middot 32) For a subset of subjects (S0 S1 S2)we also tested color image patches and the results forcolor image patches are shown in Figure 10b (red line)together with the grayscale data from these samesubjects (black dashed line) We see from this plot thatfor all patch sizes tested performance is significantlybetter for color patches which is sensible because coloris a potentially informative cue for distinguishingdifferent surfaces as has been shown in previous work(Fine et al 2003 Ing et al 2010)

Figure 9 Univariate distributions of grayscale visual features G1-

G5 (see Methods) for occlusions (blue) and textures (green)

When plotted on a log scale these distributions are well described

by Gaussians and exhibit separation for occlusions and textures

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 11

One concern with the interpretation of our results isthat the brief pre-exposure (3 seconds) of subjects S1 S2to the full-scale (576 middot 768) images prior to the first tasksession (Methods Experimental paradigm) may haveunfairly improved their performance In order to controlfor this possibility we ran three additional subjects (S3S4 S5) who did not have this pre-exposure to the full-scale images and we found no significant difference inthe performance of the two groups of subjects(Supplementary Figure S4a) We also found that thelead author (S0) was not significantly better at the taskthan the two subjects (S1 S2) who were only briefly pre-exposed to the images (Supplementary Figure S4b)

Therefore we conclude that pre-exposure to the full-scale images makes little if any difference in our task

Effects of luminance boundary and texture cues

In order to better understand what cues subjects aremaking use of in the task we tested subjects onmodified image patches with various cues removed(Methods Experimental paradigm) In order to deter-mine the importance of texture we removed all texturecues from the patches by averaging all pixels in eachregion (lsquolsquotexture removedrsquorsquo) as illustrated in Figure 4b(top) We see from Figure 11a that subject performancein the task is substantially impaired without texturecues for the 16 middot 16 and 32 middot 32 patch sizes Note howperformance is roughly flat with increasing patch sizewhen all cues but luminance are removed suggestingthat luminance gradients are a fairly lsquolsquolocalrsquorsquo cue whichis very useful even at very small scales Similar resultswere obtained for color patches (Figure 12) where theonly cues available were luminance and color contrast

Removing luminance cues while keeping texture cuesintact is slightly more tricky since simply equalizing theluminance in the two image regions can lead toproblems of boundary artifact (a high spatial frequencyedge along the boundary) as has been noted by others(Arsenault et al 2011) In order to circumvent thisproblem we removed the boundary artifact by coveringa 3-pixel wide region including the boundary by settingall pixels in this strip to a single uniform value (Figure4b bottom) We then were able to equalize theluminance on each side of the patch without creatingboundary artifact (luminance thorn boundary removed)Since the boundary pixels may could potentiallycontain long-range spatial correlation information(Hess amp Field 1999) useful for identifying occlusions

Figure 10 Performance of human subjects at the occlusion

detection task for 8 middot 8 16 middot 16 and 32 middot 32 image patches (a)

Subject performance for grayscale image patches Thin dashed

lines denote 95 confidence intervals Note how performance

significantly improves with increasing patch size (b) Subject

performance for color image patches is significantly better than for

grayscale at all patch sizes

Figure 11 Subject performance for grayscale image patches with various cues removed Dashed lines indicate the average performance

for unaltered image patches solid lines performance in the cue-removed case (a) Removal of texture cues significantly impairs subject

performance for larger image patches (b) Removal of the boundary and luminance cues substantially impairs subject performance at all

patch sizes (c) Removal of the boundary cue alone without altering texture cues or luminance cues does not affect subject performance

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 12

as an additional control we also considered perfor-mance on patches where we only removed theboundary (boundary removed)

We see from Figure 11c that simply removing theboundary pixels has no effect on subject performancesuggesting that subjects do not need to see theboundary in order to perform the task so long asother cues like luminance and texture differences arepresent However we see from Figure 11b that thereare profound effects of removing the luminance cueswhile keeping the texture cues intact We see thatsubject performance is impaired at every patch sizealthough in contrast to removing texture cues we see asubstantial improvement in performance as patch sizeis increased This suggests that texture cues arecomplementary to luminance cues in the sense thattexture information becomes more useful for largerpatches while luminance is useful even for very smallpatches

Quadratic classifier analysis

Comparison with human performance

Our analysis of image patches taken from thedatabase suggests that relatively simple visual featuresmeasured in previous related studies can potentially beused to distinguish occlusions and surfaces (Figure 9)Therefore it was of interest to determine whethermachine classifiers defined using these simple visualfeatures could potentially account for human perfor-mance on the task Since the logarithmically trans-formed features are approximately Gaussian distributed(Figure 9 and Supplementary Figures S2 S3) followingprevious work (Ing et al 2010) we defined a quadraticclassifier on class-conditional distributions which wemodeled as multivariate Gaussians (Methods Machineclassifiers) When data from both classes is exactly

Gaussian this classifier yields Bayes-optimal perfor-mance (Duda et al 2000) This classifier was thentested on validation sets of 400 image patches drawnfrom the same set shown to humans subjects (whichwere not used to train the classifier) using 200 MonteCarlo resamplings

In Figure 13 we compare the performance of humansubjects (black lines) on the grayscale version of thetask with that of quadratic classifiers (blue lines)Figure 13a shows task performance with the fullgrayscale image patches We see that the quadraticclassifier defined on all grayscale parameters (G1-G5Methods Statistical measurements) does not capturethe experimentally observed improvement in subjectperformance with increasing patch size This suggeststhat in addition to the scalar grayscale cues used todefine the classifier subjects must be making use ofmore complex spatial cues like differences in texture(Bergen amp Landy 1991) or long-range spatial correla-tions (Hess amp Field 1999) which are not available atlow resolutions but become available as the imagepatch gets larger We see from Figure 13b that aquadratic classifier defined using only the luminancedifference is in much better agreement with humanperformance when this is the only available cue forperforming the task Similar results were obtained forcolor image patches (Figure 14)

Ranking of feature dimensions

In order to understand which feature dimensions aremost important for the quadratic classifier detecting

Figure 12 Subject performance for unaltered color image patches

(dashed line) and with texture cues removed (solid line) Figure 13 Comparison of subject and quadratic classifier

performance for grayscale image patches (a) Classifier defined

using all grayscale features (thick blue line) does not accurately

model human performance on the task with unmodified patches

(black line) (b) Classifier defined using only luminance cues

accurately models human performance in the texture-removed

condition where luminance differences are the only available cue

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 13

occlusions we computed the d0 value for each of thefeature dimensions with the results of this analysisbeing summarized in Supplementary Table 4 We seefrom this table that the best single grayscale parametersfor discriminating occlusion and surface patches are theluminance difference (Dl) and overall patch variance(q) We also see that contrast difference (Dr) by itself isa fairly weak parameter for all patch sizes while for thelarger patch sizes (16 middot 16 32 middot 32) color cues are veryinformative

Our correlation analyses presented in SupplementaryTable 2 demonstrated that many of these features were

correlated with each other so therefore it was ofinterest to determine which features contributed themost independent information to the classifier We didthis by adding feature dimensions to the classifier oneby one in order of decreasing d0 for 16 middot 16 and 32 middot 32patches and applying each classifier to the task for 200Monte Carlo simulations The results of this analysisperformed on 16 middot 16 and 32 middot 32 image patches areshown in Figure 15 We see from Figure 15 that forgrayscale classifiers containing q and Dl that noperformance increase is seen when one adds thefeatures GB (boundary gradient) and Eh (orientedenergy) most likely because these features are stronglycorrelated with q Dl (Supplementary Table 2)However we do see a jump in performance when thecontrast difference Dr is added most likely because thisfeature dimension is only weakly correlated with q andDl (Supplementary Table 2) For our analysis of thecolor classifier we see that there are substantialimprovements in performance when the color param-eters Da and Db are added to the classifier which makessense because color cues are largely independent fromluminance cues

Comparison with SVM classifier

There are two logical possibilities for the poor matchbetween human performance in our task and theperformance of the quadratic classifier The firstpossibility is that the feature sets used to define theclassifier do not adequately describe the featuresactually used by human subjects to perform the taskA second possibility is that the feature set is adequatebut the quadratic classifier is somehow suboptimalAlthough the quadratic classifier is Bayes optimal forGaussian distributed class-conditional densities (Duda

Figure 14 Comparison of subject and quadratic classifier

performance for color image patches (a) Classifier defined using

all color features (thick blue line) does not accurately model

human performance (black line) on the task with unmodified

patches (b) Classifier defined using only luminance and color

contrast more accurately models human performance in the

texture-removed condition

Figure 15 Quadratic classifier performance on 32 middot 32 patches as a function of number of features Feature sets of varying lengths are

defined by ranking individual dimensions by their d0 measures and adding them in order from highest to lowest

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 14

et al 2000) and our feature sets are reasonably welldescribed by Gaussians (see Figure 9 andSupplementary Figures S2 S3) this description is farfrom exact Therefore it is possible that some othermore sophisticated classifier may yield superior perfor-mance to the quadratic classifier and hence provide abetter match with our perceptual data

In order to control for this possibility we comparedthe performance of the quadratic classifier to a SupportVector Machine (SVM) classifier (Bishop 2006 Cris-tianini amp Shawe-Taylor 2000) on the task of discrim-inating grayscale occlusion patches from surfaces Wefound that the SVM classifier performed worse than thequadratic classifier (Supplementary Figure S5) Thisfailure of the SVM classifier strengthens the argumentthat the reason for the poor match of the quadraticclassifier with human performance is the inadequacy ofthe measured feature set used to define the classifierrather than the inadequacy our classifier method

Multiscale classifier analysis

Gabor feature set and classifiers

One weakness of the set of simple visual featuresquantified in this study is that they are defined on thescale of the entire image patch and therefore do notintegrate information across scale It is well known thatocclusions in natural images can exist on multiplescales and that accurate edge detection requires anappropriate choice of scale (Elder amp Zucker 1998Marr amp Hildreth 1980) Furthermore our feature setdoes not include any features which quantify texturedifferences which is problematic because our psycho-physical results suggests that texture is a very importantcue and previous work in machine vision hasdemonstrated that texture cues are highly informativefor detecting edges and segmenting image regions(Bergen amp Landy 1991 Malik amp Perona 1990 Martinet al 2004 Voorhees amp Poggio 1988)

In order to choose a set of biologically motivatedmultiscale features to measure from the image patcheswe obtained a Gabor filter bank by applying theIndependent Components Analysis (Bell amp Sejnowski1997) to natural image patches taken from our dataset(Figure 5a) and this filter bank was applied to theimage patches The independent component analysis(ICA) algorithm learns a set of linear features that aremaximally independent which is ideal for classifieranalyses since correlated features add less informationthan independent features The rectified outputs of thisGabor filter bank were then used as inputs to twosupervised classifiers (a) A linear binary logisticclassifier and (b) a neural network classifier (Bishop2006) having 4 16 and 64 hidden units The logisticclassifier can be considered as a special case of theneural network classifier illustrated in Figure 5b where

instead of learning the best hidden layer representationfor solving the classification task the hidden layerrepresentation is always identical to the input repre-sentation and only the output weights are modifiedAfter training both classifiers were tested on a separatevalidation set not used for training to evaluate theirperformance

Classifier comparison

Figure 16 shows the performance of these twoclassifiers together with subject data for the grayscaleocclusion detection task We see that the linearclassifier (blue line) does not accurately describe humanperformance (black line) while the neural network (redline) having 16 hidden units compares much morefavorably to the human subjects Similar results wereobtained for a neural network with 64 hidden units butworse performance was seen with four hidden units(not shown) We can see from Figure 16 that the linearclassifier does not increase its performance withincreasing patch size and seems to perform similarlyto the quadratic classifier defined using only luminancedifferences This suggests that this classifier may simplybe learning to detecting low spatial frequency edges atvarious orientations Indeed plotting the connectionstrengths of the output unit with each of the Gaborfilters as in Figure 17 demonstrates that there are strongpositive weights for filters at low spatial frequencylocated in the center of the image patch therebydetecting luminance differences on the scale of theentire image patch

In contrast the neural network classifier learns tospatially pool information across spatial scale andlocation (Supplementary Figure S6) although thehidden unit lsquolsquoreceptive fieldsrsquorsquo do not seem to resemble

Figure 16 Performance of humans and classifiers defined on

multiscale Gabor feature set We see that a simple linear classifier

(blue line) does not account for human performance (black

dashed line) in the task a neural network classifier having an

additional hidden layer of processing (red line) compares well with

human performance

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 15

the hypothetical texture-edge detectors illustrated inFigure 18 which could in principle encode edgesdefined by second-order cues like differences inorientation or spatial frequency Interestingly suchunits tuned for texture-edges have been learned instudies seeking to find a sparse code for patterns ofcorrelation of Gabor filter activities in natural images(Cadieu amp Olshausen 2012 Karklin amp Lewicki 20032009) and it is of interest for future work to examinewhether a population of such units can be decoded inorder to reliably detect occlusion edges Neverthelessthis analysis provides a simple existence proof that amultiscale feature set coupled with a second layer ofrepresentation which pools across this set of multiscalefeatures may be adequate to explain natural occlusionedge detection and a similar computational scheme hasbeen proposed by various filter-rectify-filter models oftexture edge detection (Baker amp Mareschal 2001Landy amp Graham 2003) Interestingly training neuralnetworks directly on pixel representations did not yieldmodels which could match human performance(Supplementary Figure S7 demonstrating the impor-tance of the multiscale Gabor decomposition as apreprocessing stage Of course we cannot suggest fromour simple analysis the detailed form which thisintegration takes or what the receptive fields of neuronsperforming this operation may be like but this is

certainly a question of great interest for future work inneurophysiology and computational modeling

Discussion

Summary of results

In this study we performed psychophysical experi-ments to better understand what locally available cuesthe human visual system utilizes when detecting naturalocclusion edges In agreement with previous work(McDermott 2004) we found that fairly large regions(32 middot 32 pixels) were required for subjects to attainreliable performance on our occlusion detection task(Figure 10) Studying the grayscale version of the taskin detail we found that texture and luminance cueswere both utilized although the relative importance ofthese cues varied with the patch size In particular wefound that luminance cues were equally useful at allpatch sizes tested (Figure 11a) This suggests thatluminance differences are a local cue being equallyuseful for small as well as large patches In contrasttexture cues were only useful for larger image patches(Figure 11b) This is sensible because texture is definedby information occurring at a variety of scales (Portillaamp Simoncelli 2000 Zhu Wu amp Mumford 1997) andmaking patches larger makes more spatial scalesvisible

The importance of textural cues was further under-scored by comparing human performance on the taskto the performance of a quadratic classifier definedusing a set of simple visual features taken fromprevious work which did not include texture cuesAlthough many of the visual features providedindependent sources of information (Figure 15) andpermitted some discrimination of the stimulus catego-ries this feature set could not account for humanperformance in the task overpredicting performancefor small (8 middot 8) patches and underpredicting

Figure 17 Schematic illustration of weights learned by the linear

classifier We see that the linear classifier learns strong positive

weights (hot colors) with Gabor filters having low spatial

frequencies (long lines) located in the center of the image patch

Figure 18 Hypothetical neural models for detecting occlusion edges defined by textural differences (a) Hypothetical unit which detects

texture edges defined by differences in orientation energy (b) Hypothetical unit that detects texture edges defined by spatial frequency

differences

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 16

performance for large (32 middot 32) patches (Figure 13)The overprediction of human performance for smallimage patches is particularly interesting since itdemonstrates that subjects are not simply relying onluminance cues but are integrating texture cues as wellwhich we found to be quite poor for small imagepatches (Figure 11b)

One limitation of the visual feature set used to definethe quadratic classifier analysis is that the features aredefined on the scale of the entire image patch This isproblematic since it is well-known that the earlieststages of human visual processing perform a multiscaleanalysis using receptive fields resembling Gabor filters(Daugman 1985 Jones amp Palmer 1987 Pollen ampRonner 1983) and that such a representation forms anefficient code for natural images (Field 1987 Olshau-sen amp Field 1996) Furthermore many computationalmethods for texture segmentation and edge detectionuse of a multiscale Gabor filter decomposition (Bakeramp Mareschal 2001 Landy amp Graham 2003) There-fore we defined a new set of classifiers taking as inputsthe rectified responses of a set of Gabor filters appliedto the image patches (Figure 5) We found that themultiscale feature set was adequate to explain humanperformance but only provided that the classifiermaking use of this feature set was sufficiently powerfulIn particular a linear classifier was unable to accountfor human performance since this classifier only learnedto detect luminance-edges at low spatial frequencies(Figure 17) However a neural network classifierhaving an additional hidden layer of processing wasable to learn to reliably detect occlusions nearly as wellas human subjects (Figure 16) These results suggest amultiscale feature decomposition together with rela-tively simple computations performed in an additionallayer of processing may be adequate to explain humanperformance in an important natural vision task

Comparison with previous work

Relatively few studies in visual psychophysics havedirectly considered the problem of what cues are usefulfor detecting natural occlusion edges with the closeststudy to our own focusing on the more specializedproblem of detecting T-junctions in natural grayscaleimages (McDermott 2004) This study found thatrather large image regions were needed for subjects toattain reliable performance for junction detectionconsistent with our results for occlusion detection Inthis work it was found that many T-junctions easilyvisible on the scale of the whole image were notdetectable using local cues suggesting that they mayonly be detectable by mechanisms making use offeedback of global scene information (Lee amp Mumford2003) Our task is somewhat complementary to that

studied by McDermott (2004) since we exclude T-junctions and only consider nonjunction occlusionsThis gave us a much larger stimulus set and permittedquantitative comparisons of human task performancewith a variety of models based on features measuredfrom occlusions and non-occlusions However wearrive at the very similar conclusion for our task thata substantial amount of spatial context is needed foraccurate occlusion detection (Figure 10) and thatmany occlusions cannot be reliably detected using localcues alone

A related set of studies has considered the problemof judging when two image patches belong to the sameor different surfaces and comparing the performanceof human subjects with the predictions of Bayesianmodels defined using sets of measured visual featuressimilar to those used to define our quadratic classifiersIn contrast to one such surface segmentation study (Inget al 2010) we found texture cues to be very importantfor our task One possible reason for the discrepancybetween our results and those of Ing et al (2010) is thattheir study only considered the problem of discrimi-nating a very limited class of images in particular thesurfaces of different leaves Since the leaves typicallybelonged to the same kind of plant they were fairlysmooth and quite similar in terms of visual textureTherefore for this particular choice of stimuli texture isnot a particularly informative cue since leaves of thesame kind tend to have very similar textures For ourtask however where very different surfaces werejuxtaposed to define occlusions we found removal oftexture cues to be highly detrimental to subjectperformance (Figure 11)

Previous work on edge localization reveals thatluminance texture and color cues are all combined todetermine the location of edges (McGraw et al 2003Rivest amp Cavanagh 1996) and similarly we find that inour task these multiple sources of information arecombined as well However in our task we did notconsider quantitative models for how disparate cuesmay be combined optimally as has been done in avariety of perceptual tasks (Knill amp Saunders 2003Landy amp Kojima 2001) One challenge with studyingcue combination in our task is that one of the majorcues used by the subjects are texture differences andunlike luminance or color natural textures require veryhigh-dimensional models to fully specify (Heeger ampBergen 1995 Portilla amp Simoncelli 2000 Zhu et al1997) The problem of optimal cue combination fornatural occlusion detection has been mostly consideredin the computer vision literature (Konishi et al 2003Zhou amp Mel 2008) but models of cue combination forocclusion detection have not been directly comparedwith human data in a controlled psychophysicalexperiment

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 17

Our computational analysis demonstrates the neces-sity of multiscale image analysis for modeling humanperformance in our occlusion detection task Further-more our work and that of others demonstrates theimportance of textural cues for occlusion detection as awide variety of computational models which explainhuman performance at detecting texture-defined edgesutilize a multiscale Gabor decomposition (Landy ampGraham 2003) One broad class of computationalmodels of texture segmentation and texture edgedetection can be described lsquolsquoFilter-Rectify-Filterrsquorsquo(FRF) models sometimes also referred to as lsquolsquoback-pocketrsquorsquo models where the rectified outputs of a Gaborfilter bank are then subject to a second stage of filteringand the outputs of these filters are then processedfurther to obtain a decision about whether or not atexture edge is present (Baker amp Mareschal 2001Bergen amp Landy 1991 Graham amp Sutter 1998 Landy1991 Landy amp Graham 2003) Our simple three-layerneural network model (Figure 5) constitutes a verysimple model of the FRF type with the hidden unitsperforming the role of the second-order filters Theability of our simple FRF model to solve this task aswell as human subjects provides additional evidence infavor of the FRF computational framework

Resources for future studies

Our database of hand-labeled occlusions (httpcaseedu) avoids many limitations of previously collecteddatasets of hand-labeled edges Perhaps the greatestimprovement is that unlike many related datasets(Collins Wright amp Greenway 1999 Martin et al2001) we make use of uncompressed calibrated colorimages taken from a larger image set used in previouspsychophysical research (Arsenault et al 2011 King-dom et al 2007 Olmos amp Kingdom 2004) JPEGcompression creates artificial statistical regularitieswhich complicates feature measurements (Zhou ampMel 2008) and causes artifact when training statisticalimage models (Caywood Willmore amp Tollhurst 2004)

Another improvement of this database over previouswork is that we explicitly label occlusion edges ratherthan inferring them indirectly from labels of imageregions or segments Although de facto finding regionboundaries does end up mainly labeling occlusions thenotion of an image region is much vaguer than that ofan occlusion However due to the specificity of ourtask our database most likely neglects quite a fewedges which are not occlusions since many edges aredefined by properties unrelated to ordinal depth likeshadows specularities and material properties (Ker-sten 2000) Therefore our database only represents asubset (albeit an important one) of possible naturaledges

Some databases only consider a restricted class ofimages like close-up foliage (Ing et al 2010) or lackclear figural objects (Doi Inui Lee Wachtler ampSejnowski 2003) Our database contains a wide varietyof images at various spatial scales containing a largenumber of occlusions We do not claim that our imagesare fully representative of natural images but representa modest effort to obtain a diverse sample which isimportant since different categories of natural imageshave different statistics (Torralba amp Oliva 2003) andtraining statistical image models on different imagedatabases often yields different linear filter properties(Ziegaus amp Lang 2003)

Acknowledgments

This work was supported by NIH grant 0705677 toMSL We thank the five Case Western ReserveUniversity undergraduate research assistants whohelped to collect the databases Parts of this work havebeen presented previously in poster form at the 2011Society for Neuroscience Meeting

Commercial relationships noneCorresponding author Christopher DiMattinaEmail dimattinagrinnelleduAddress Department of Psychology amp NeuroscienceConcentration Grinnell College Grinnell IA USA

References

Abdou I amp Pratt W (1979) Quantitative design andevaluation of enhancementthresholding edge de-tectors Proceedings of the IEEE 67 753ndash763

Acharya T amp Ray A K (2005) Image processingPrinciples and applications Hoboken NJ Wiley-Interscience

Arsenault E Yoonessi A amp Baker C L (2011)Higher-order texture statistics impair contrastboundary segmentation Journal of Vision 11(10)14 1ndash15 httpwwwjournalofvisionorgcontent111014 doi101167111014 [PubMed] [Article]

Baddeley R J amp Tatler B W (2006) High frequencyedges (but not contrast) predict where we fixate ABayesian system identification analysis VisionResearch 46 2824ndash2833

Baker C L amp Mareschal I (2001) Processing ofsecond-order stimuli in the visual cortex Progressin Brain Research 134 1ndash21

Balboa R M amp Grzywacz N M (2000) Occlusions

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 18

and their relationship with the distribution ofcontrasts in natural images Vision Research 402661ndash2669

Bell A J amp Sejnowski T J (1997) The lsquoindependentcomponentsrsquo of natural scenes are edge filtersVision Research 37 3327ndash3338

Bergen J R amp Landy M S (1991) Computationalmodeling of visual texture segregation In M SLandy amp J A Movshon (Eds) Computationalmodels of visual processing pp 253ndash271 Cam-bridge MA MIT Press

Bishop C M (2006) Pattern recognition and machinelearning New York Springer

Cadieu C amp Olshausen B A (2012) Learningintermediate-level representations of form andmotion from natural movies Neural Computation24 827ndash866

Caywood M S Willmore B amp Tolhurst D J(2004) Independent components of color naturalscenes resemble v1 neurons in their spatial andcolor tuning Journal of Neurophysiology 91 2859ndash2873

Collins D Wright W A amp Greenway P (1999) TheSowerby image database In Image processing andits applications (Seventh International Conference)(pp 306ndash310) London Institution of ElectricalEngineers

Cristianini N amp Shawe-Taylor J (2000) An intro-duction to support vector machines and other kernelbased learning methods Cambridge UK Cam-bridge University Press

Daugman J G (1985) Uncertainty relation forresolution in space spatial frequency and orienta-tion optimized by two-dimensional visual corticalfilters Journal of the Optical Society of America A2 1160ndash1169

Doi E Inui T Lee T W Wachtler T ampSejnowski T J (2003) Spatiochromatic receptivefield properties derived from information-theoreticanalyses of cone mosaic responses to naturalscenes Neural Computation 15 397ndash417

Duda R O Hart P E amp Stork D G (2000)Pattern classification (2nd ed) New York Wiley-Interscience

Elder J H amp Zucker S W (1998) Local scalecontrol for edge detection and blur estimationIEEE Transactions on Pattern Analysis and Ma-chine Intelligence 20 699ndash716

Field D J (1987) Relations between the statistics ofnatural images and the response properties of

cortical cells Journal of the Optical Society ofAmerica A 4 2379ndash2394

Fine I MacLeod D I A amp Boynton G M (2003)Surface segmentation based on the luminance andcolor statistics of natural scenes Journal of theOptical Society of America A 20 1283ndash1291

Fowlkes C C Martin D R amp Malik J (2007) Localfigure-ground cues are valid for natural imagesJournal of Vision 7(8)2 1ndash9 httpwwwjournalofvisionorgcontent782 doi101167782[PubMed] [Article]

Geisler W S Perry J S Super B J amp Gallogly DP (2001) Edge co-occurrence in natural imagespredicts contour grouping performance VisionResearch 41 711ndash724

Goldberg A V amp Kennedy R (1995) An efficientcost scaling algorithm for the assignment problemMathematical Programming 71 153ndash177

Graham N (1991) Complex channels early localnonlinearities and normalization in texture segre-gation In M S Landy amp J A Movshon (Eds)Computational models of visual processing pp 274ndash290 Cambridge MA MIT Press

Graham N amp Sutter A (1998) Spatial summation insimple (Fourier) and complex (non-Fourier) texturechannels Vision Research 38 231ndash257

Guzman A (1969) Decomposition of a visual sceneinto three-dimensional bodies In A Griselli (Ed)Automatic interpretation and classification of imag-es pp 291ndash304 New York Academic Press

Heeger D J amp Bergen J R (1995) Pyramid-basedtexture analysissynthesis In Proceedings of ACMSIGGRAPH 229ndash238

Hess R F amp Field D J (1999) Integration ofcontours New insights Trends in Cognitive Scienc-es 3 480ndash486

Hoiem D Efros A A amp Hebert M (2011)Recovering occlusion boundaries from an imageInternational Journal of Computer Vision 91(3)328ndash346

Ing A D Wilson J A amp Geisler W S (2010)Region grouping in natural foliage scenes Imagestatistics and human performance Journal ofVision 10(4)10 1ndash19 httpwwwjournalofvisionorgcontent10410 doi 10116710410[PubMed] [Article]

Jones J P amp Palmer L A (1987) The two-dimensional spatial structure of simple receptivefields in cat striate cortex Journal of Neurophysi-ology 58 1187ndash1211

Karklin Y amp Lewicki M S (2003) Learning higher-

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 19

order structures in natural images Network 14483ndash499

Karklin Y amp Lewicki M S (2009) Emergence ofcomplex cell properties by learning to generalize innatural scenes Nature 457 83ndash86

Kersten D (2000) High-level vision as statisticalinference In M S Gazzaniga (Ed) The newcognitive neurosciences pp 353ndash363 CambridgeMA MIT Press

Kingdom F A A Field D J amp Olmos A (2007)Does spatial invariance result from insensitivity tochange Journal of Vision 7(14)11 1ndash13 httpwwwjournalofvisionorgcontent71411 doi10116771411 [PubMed] [Article]

Knill D C amp Saunders J A (2003) Do humansoptimally integrate stereo and texture informationfor judgments of slant Vision Research 43 2539ndash2558

Konishi S Yuille A L Coughlin J M amp Zhu S C(2003) Statistical edge detection Learning andevaluating edge cues IEEE Transactions on PatternAnalysis and Machine Intelligence 25 57ndash74

Landy M S (1991) Texture segregation and orienta-tion gradient Vision Research 31 679ndash691

Landy M S amp Graham N (2003) Visual perceptionof texture In L M Chalupa amp J S Werner (Eds)The visual neurosciences pp 1106ndash1118 Cam-bridge MA MIT Press

Landy M S amp Kojima H (2001) Ideal cuecombination for localizing texture-defined edgesJournal of the Optical Society of America A 182307ndash2320

Lee H-C amp Choe Y (2003) Detecting salientcontours using orientation energy distribution InInternational Joint Conference on Neural Networks(pp 206ndash211) IEEE Conference Publications

Lee T S amp Mumford D (2003) Hierarchicalbayesian inference in the visual cortex Journal ofthe Optical Society of America A 20 1434ndash1448

Malik J amp Perona P (1990) Preattentive texturediscrimination with early vision mechanisms Jour-nal of the Optical Society of America A 7 923ndash932

Marr D amp Hildreth E (1980) Theory of edgedetection Proceedings of the Royal Society ofLondon Series B 29 187ndash217

Martin D Fowlkes C C Tal D amp Malik J (2001)A database of human segmented natural imagesand its application to evaluating segmentationalgorithms and measuring ecological statisticsProceedings of the 8th International Conference onComputer Vision 2 416ndash423

Martin D R Fowlkes C C amp Malik J (2004)Learning to detect natural image boundaries usinglocal brightness color and texture cues IEEETransactions on Pattern Analysis and MachineIntelligence 26 530ndash549

McDermott J (2004) Psychophysics with junctions inreal images Perception 33 1101ndash1127

McGraw P V Whitaker D Badcock D R ampSkillen J (2003) Neither here nor there Localizingconflicting visual attributes Journal of Vision 3(4)2 265ndash273 httpwwwjournalofvisionorgcontent342 doi101167342 [PubMed] [Article]

Nakayama K He Z J amp Shimojo S (1995) Visualsurface representation A critical link betweenlower-level and higher-level vision In D NOsherson S M Kosslyn amp L R Gleitman(Eds) An invitation to cognitive science Visualcognition pp 1ndash70 Cambridge MA MIT Press

Oliva A amp Torralba A (2001) Modeling the shapeof the scene A holistic representation of the spatialenvelope International Journal of Computer Vision42 145ndash175

Olmos A amp Kingdom F A A (2004) McGillcalibrated colour image database Internet sitehttptabbyvisionmcgillca (Accessed November20 2012)

Olshausen B A amp Field D J (1996) Emergence ofsimple-cell receptive-field properties by learning asparse code for natural images Nature 381 607ndash609

Ott R L (1993) An introduction to statistical methodsand data analysis (4th ed) Belmont CA Wads-worth Inc

Perona P (1992) Steerable-scalable kernels for edgedetection and junction analysis Image and VisionComputing 10 663ndash672

Pollen D A amp Ronner S F (1983) Visual corticalneurons as localized spatial frequency filters IEEETransactions on Systems Man and Cybernetics 13907ndash916

Portilla J amp Simoncelli E P (2000) A parametrictexture model based on joint statistics of complexwavelet coefficients International Journal of Com-puter Vision 40 4971

Rajashekar U van der Linde I Bovik A C ampCormack L K (2007) Foveated analysis of imagefeatures at fixations Vision Research 47 3160ndash3172

Reinagel P amp Zador A M (1999) Natural scenestatistics at the centre of gaze Network 10 341ndash350

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 20

Rijsbergen C V (1979) Information retrieval LondonButterworths

Rivest J amp Cavanagh P (1996) Localizing contoursdefined by more than one attribute Vision Re-search 36 53ndash66

Schwartz O amp Simoncelli E P (2001) Natural signalstatistics and sensory gain control Nature Neuro-science 4 819ndash825

Sigman M Cecchi G A Gilbert C D ampMagnasco M O (2001) On a common circleNatural scenes and gestalt rules Proceedings of theNational Academy of Sciences 98 1935ndash1940

Todd J T (2004) The visual perception of 3D shapeTrends in Cognitive Science 8 115ndash121

Torralba A (2009) How many pixels make an imageVisual Neuroscience 26 123ndash131

Torralba A amp Oliva A (2003) Statistics of naturalimage categories Network Computation in NeuralSystems 14 391ndash412

Voorhees H amp Poggio T (1988) Computing textureboundaries from images Nature 333 364ndash367

Zetzsche C amp Rohrbein F (2001) Nonlinear andextra-classical receptive field properties and thestatistics of natural scenes Network 12 331ndash350

Zhou C amp Mel B W (2008) Cue combination andcolor edge detection in natural scenes Journal ofVision 8(4)4 1ndash25 httpwwwjournalofvisionorgcontent844 doi101167844 [PubMed][Article]

Zhu S C Wu Y N amp Mumford D (1997)Minimax entropy principle and its application totexture modeling Neural Computation 9 1627ndash1660

Ziegaus C amp Lang E W (2003) Independentcomponent analysis of natural and urban imageensembles Neural Information Processing Letters1 89ndash95

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 21

  • Introduction
  • Methods
  • f01
  • f02
  • f03
  • e01
  • e02
  • e03
  • e04
  • e05
  • e06
  • e07
  • e08
  • e09
  • e10
  • f04
  • e11
  • e12
  • e13
  • e14
  • e15
  • e16
  • e17
  • f05
  • e18
  • Results
  • f06
  • f07
  • f08
  • f09
  • f10
  • f11
  • f12
  • f13
  • f14
  • f15
  • f16
  • Discussion
  • f17
  • f18
  • Abdou1
  • Acharya1
  • Arsenault1
  • Baddeley1
  • Baker1
  • Balboa1
  • Bell1
  • Bergen1
  • Bishop1
  • Cadieu1
  • Caywood1
  • Collins1
  • Cristianini1
  • Daugman1
  • Doi1
  • Duda1
  • Elder1
  • Field1
  • Fine1
  • Fowlkes1
  • Geisler1
  • Goldberg1
  • Graham1
  • Graham2
  • Guzman1
  • Heeger1
  • Hess1
  • Hoiem1
  • Ing1
  • Jones1
  • Karklin1
  • Karklin2
  • Kersten1
  • Kingdom1
  • Knill1
  • Konishi1
  • Landy1
  • Landy2
  • Landy3
  • Lee1
  • Lee2
  • Malik1
  • Marr1
  • Martin1
  • Martin2
  • McDermott1
  • McGraw1
  • Nakayama1
  • Oliva1
  • Olmos1
  • Olshausen1
  • Ott1
  • Perona1
  • Pollen1
  • Portilla1
  • Rajashekar1
  • Reinagel1
  • Rijsbergen1
  • Rivest1
  • Schwartz1
  • Sigman1
  • Todd1
  • Torralba1
  • Torralba2
  • Voorhees1
  • Zetzsche1
  • Zhou1
  • Zhu1
  • Ziegaus1
Page 4: Detecting natural occlusion boundaries using local cuesDetecting natural occlusion boundaries using local cues Christopher DiMattina # $ Department of Psychology & Neuroscience Concentration,

Quantifying subject consistency

In order to quantify intersubject consistency in thelocations of the pixels labeled as occlusions we appliedprecision-recall analysis commonly used in machinevision (Abdou amp Pratt 1979 Martin et al 2004) Inaddition we also developed a novel analysis methodwhich we call the most-conservative subject (MCS)analysis which controls for the fact that disagreementbetween subjects in the location of labeled occlusionsoften arises simply because some subjects are moreexhaustive in their labeling than others

Precision-recall analysis is often used in machinevision studies of edge detection in order to quantify thetrade-off between correctly detecting all edges (recall)and not incorrectly labeling non-edges as edges(precision) Mathematically precision (P) and recall(R) are defined

P frac14 tp

tpthorn fp eth1THORN

R frac14 tp

tpthorn fn eth2THORN

where tp fp tn fn are the true and false positives andtrue and false negatives respectively Typically thesequantities are determined by comparing a machinegenerated test edgemap E to a ground-truth referenceedgemap G derived from hand-annotated images(Martin et al 2004) Since all of our edgemaps werehuman generated we performed a lsquolsquoleave-one-outrsquorsquoanalysis where we compared a test edgemap from eachsubject to a reference lsquolsquoground truthrsquorsquo edgemap definedby combining edgemaps from all of the other remainingsubjects Since our goal for this analysis was simply tocompare human performance on two different tasks(occlusion labeling and region labeling) we did notmake use of sophisticated boundary-matching proce-dures (Goldberg amp Kennedy 1995) used in previousstudies to optimize the comparisons between humandata and machine performance (Martin et al 2004)We quantify the overall agreement between the test andreference edgemaps using the weighed harmonic meanof P and R defined by

F frac14 PR

aPthorn eth1 aTHORNR eth3THORN

This quantity F is known as an F-measure and wasoriginally developed to quantify the accuracy ofdocument retrieval methods (Rijsbergen 1979) but isalso applied in machine vision (Abdou amp Pratt 1979Martin et al 2004) The parameter a determines therelative weight of precision and recall and we used afrac1405 in our analysis

In addition to the precision-recall analysis wedeveloped a novel method for quantifying intersubject

consistency which minimizes problems of intersubjectdisagreement arising from the fact that certain subjectsare simply more exhaustive in labeling all possibleocclusions than other subjects We defined the mostconservative subject (MCS) for a given image as thesubject who had labeled the fewest pixels Using theMCS labeling we generate a binary image mask Fwhich is 1 for any pixel within R pixels of an occlusionlabeled by the MCS and 0 for all other pixelsApplying this mask to the labeling of each subjectyields a lsquolsquoreducedrsquorsquo labeling which is valid for inter-subject comparison since it only includes the mostprominent occlusions labeled by all of the subjects Tocalculate the comparison between two subjects werandomly assigned one binary edgemap as the lsquolsquorefer-encersquorsquo (Iref) and the other binary edge-map as the lsquolsquotestrsquorsquo(Itest) We used the reference map to define a weightingfunction fc(r) which was applied to all of the pixels inthe test map that quantified how close each pixel in thetest map was to a pixel in the reference mapMathematically our index is given by

f frac14 1

Nt

XethxyTHORNItest

fc

rethxyTHORNIref

ItestethxyTHORN eth4THORN

where r((xy) Iref) is the distance between (xy) and theclosest pixel in Iref Nt is the number of pixels in the testset and the function fc(r) is defined for 0 c lsquo by

fcethrTHORN frac14ecr if 0 r R0 if rR

eth5THORN

and for cfrac14 lsquo by

flsquoethrTHORN frac141 if r frac14 00 if r 0

eth6THORN

where R is the radius of the mask which we set to Rfrac1410 in our analysis The parameter c in our weightingfunction sets the sensitivity of fc to the distance betweenthe reference labeling and the test labeling Setting cfrac14 0counts the fraction of pixels in the test edgemap whichlie inside the mask generated by the reference edgemapand setting c frac14 lsquo measures the fraction of pixels incomplete agreement between the two edgemaps

Statistical measurements

Patch extraction and region labeling

Occlusion patches of varying sizes centered on anocclusion boundary were extracted automatically fromthe database by choosing random pixels labeled asocclusions by a single subject cycling through the sixsubjects Since we only wanted patches containing asingle occlusion separating two regions (figure andground) of roughly equal size we only accepted acandidate patch when

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 4

1 The composite occlusion edge map from all subjectsconsisted of a single connected piece in the analysiswindow

2 The occlusion contacted the sides of the window attwo distinct points and divided the patch into tworegions

3 Each region comprised at least 35 of the pixels

Each occlusion patch consisted of two regions ofroughly equal size separated by a single boundary withthe central pixel (w2 w2) of a patch of size w alwaysbeing an occlusion (Figure 4a) Note that thisprocedure actually yields a subset of all possibleocclusions since it excludes T-junctions or occlusionsformed by highly convex boundaries Since theselection of occlusion patches was automated therewere no constraints on the properties of the surfaces oneither side of the occlusion with respect to factors likelighting shadows reflectance or material In additionto the occlusion patches we extracted surface patchesat random locations from the set of 60 middot 60 surfaceregions chosen by the subjects We used 8 middot 8 16 middot 16and 32 middot 32 patches for our analyses and psychophys-ical experiments

Grayscale scalar measurements

To obtain grayscale images we converted the rawimages into gamma-corrected RGB images usingsoftware available online at httptabbyvisionmcgillca (Olmos amp Kingdom 2004) We then mapped theRGB color space to the NTSC color space obtainingthe grayscale luminance I frac14 02989 R thorn 05870 G thorn01140 B (Acharya amp Ray 2005) From these patches

we measured a variety of visual features which can beused to distinguish occlusion from surface patchesSome of these features (for instance luminance differ-ence) depended on there being a region labeling of theimage patch which separates it into regions corre-sponding to two different surfaces (Figure 4a) How-ever measuring these same features from surfacepatches is impossible since surface patches only containa single surface Therefore in order to measure regionlabeling-dependent features from the uniform surfacepatches we assigned to each surface patch a set of 25lsquolsquodummyrsquorsquo region labelings from our database (span-ning all possible orientations) The measured value ofthe feature was then taken as the maximum value overall 25 dummy labelings which is sensible since all of thevisual features were on average larger for occlusionsthan uniform surface patches

Given a grayscale image patch and a region labelingR frac14 R1 R2 B partitioning the patch into regionscorresponding to the two surfaces (R1R2) as well as theset of boundary (B) pixels (Figure 4a) we measured thefollowing visual features taken from the computationalvision literature

G1 Luminance difference Dl

Dl frac14 jl1 l2j eth7THORNwhere l1 l2 are the mean luminance in regions R1 R2respectively

G2 Contrast difference Dr

Dr frac14 jr1 r2j eth8THORNwhere r1 r2 are the contrasts (standard deviation) inregions R1 R2 respectively Features G1 and G2 wereboth measured in previous studies on surface segmen-tation (Fine et al 2003 Ing et al 2010)

G3 Boundary luminance gradient GB

GB frac14jjIjjI

eth9THORN

where I(x y) frac14 []I(x y)]x ]I(x y)]y]T is thegradient of the image patch evaluated at the centralpixel and I is the average intensity of the image patch(Balboa amp Grzywacz 2000)

G4 Oriented energy Eh

Eh frac14XNh

ifrac141ethweven

i xTHORN2 thorn ethwoddi xTHORN2 eth10THORN

where x is the image patch in vector form andweveni wodd

i are a quadrature-phase pair of Gabor filtersof Nhfrac14 8 evenly spaced orientations hi For patch sizew the filters had means w2 and standard derivations ofw4 Oriented energy has been used in several previousstudies as a means of detecting edges (Geisler et al

Figure 4 Illustration of stimuli used in the psychophysical

experiments (a) Original color and grayscale occlusion edge

(top middle) and its region label (bottom) The region label divides

the patch into regions corresponding to the two surfaces (white

gray) as well as the boundary (black) (b) Top Grayscale image

patch with texture information removed Middle Occlusion patch

with boundary removed Bottom Boundary and luminance

information removed

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 5

2001 Lee amp Choe 2003 Sigman Cecchi Gilbert ampMagnasco 2001)

G5 Global patch contrast q

q frac14 stdethITHORN eth11THORNThis quantity has been measured in previous studieswhich quantify statistical differences between fixatedimage regions and random image regions for subjectsrsquofree-viewing natural images while their eyes are beingtracked (Rajashekar et al 2007 Reinagel amp Zador1999) Several studies of this kind have suggested thatsubjects may be preferentially looking at edges (Bad-deley amp Tatler 2006)

Note that features G3-G5 are measured globallyfrom the entire patch whereas features G1 G2 aredifferences between statistics measured from differentregions of the patch

Color scalar measurements

Images were converted from RGB to LMS colorspace using a MATLAB program which accompaniesthe images in the McGill database (Olmos amp Kingdom2004) We converted the logarithmically transformedLMS images into an Lab color space by performingprincipal components analysis (PCA) on the set ofLMS pixel intensities (Fine et al 2003 Ing et al 2010)Projections onto the axes of the Lab basis represent acolor pixel in terms of its overall luminance (L) blue-yellow opponency (a) and red-green opponency (b)

We measured two additional properties from theLMS color image patches represented in the Lab basis

C1 Blue-Yellow difference Da

Da frac14 ja1 a2j eth12THORNwhere a1 a2 are the mean values of the B-Y opponencycomponent a in regions R1 R2 respectively

C2 Red-Green difference Db

Db frac14 jb1 b2j eth13THORNwhere b1 b2 are the mean values of the R-G opponencycomponent b in regions R1 R2 respectively

These color scalar statistics were motivated byprevious work studying human perceptual discrimina-tion of different surfaces (Fine et al 2003 Ing et al2010)

Machine classifiers

Quadratic classifier analysis

In order to study how well various feature subsetsmeasured from the image patches could predict humanperformance we made use of a quadratic classifieranalysis The quadratic classifier is a natural choice for

quantifying the discriminability of two categoriesdefined by features having multivariate Gaussiandistributions (Duda Hart amp Stork 2000) and hasbeen used in previous work studying the perceptualdiscriminability of surfaces (Ing et al 2010) Assumethat we have two categories C1C2 of stimuli fromwhich we can measure n features u frac14 (u1u2 un)

Tand features measured from each category are Gauss-ian distributed

pethujCiTHORN frac14 NethujliRiTHORN eth14THORNwhere li and Ri are the means and covariances of eachcategory Given a novel observation with feature vectoru assuming that the two categories are equally likely apriori we evaluate the log-likelihood ratio

L12ethuTHORN frac14 ln pethujC1THORN ln pethujC2THORN eth15THORNchoosing C1 when L12 0 and C2 when L12 0 In thecase of Gaussian distributions for each category as inEquation 14 Equation 15 can be rewritten as

L12ethuTHORN frac141

2jR2j jR1j

thorn 1

2Q1ethuTHORN Q2ethuTHORN

eth16THORNwhere

QiethuTHORN frac14 ethu liTHORNTR1i ethu liTHORN eth17THORNThe task for applying this formalism is to define a set ofn features to measure from the set of image patchesand then use these measurements to define the meansand covariances of each category in a supervisedmanner New image patches that are unlabeled canthen be classified using this quadratic classifier

In our analyses category C1 was occlusion patchesand category C2 was surface patches We estimated theparameters of the classifiers for each category by takingthe means and covariances of the statistics measuredfrom a set of 2000 image patches (training set) and weapplied these classifiers to a different 400 patch subsetsof 1000 image patches which were presented to thesubjects (test set) This analysis was performed formultiple classifiers defined by different subsets ofparameters and for image patches of all sizes (8 middot 816 middot 16 32 middot 32)

SVM classifier analysis

As an additional control in addition to the quadraticclassifier analysis we trained a Support Vector Machine(SVM) classifier (Cristianini amp Shawe-Taylor 2000) todiscriminate occlusions and surfaces using our gray-scale visual feature set (G1ndashG5) The SVM classifier is astandard and well-studied method in machine learningwhich achieves good classification results by learningthe separating hyperplane that maximizes the margin

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 6

between two categories (Bishop 2006) We implement-ed this analysis using the function svmclassifym andsvmtrainm in the MATLAB Bioinformatics Toolbox

Multiscale classifier analyses on Gabor filter outputs

One weakness of defining machine classifiers usingour set of visual features is that these features aredefined on the scale of the entire image patch This isproblematic because it is well known that occlusionedges exist at multiple scales and that appropriate scaleselection and integration across scale is an essentialcomputation for accurate edge detection (Elder ampZucker 1998 Marr amp Hildreth 1980) Furthermoreit is well known that the neural code at the earlieststages of cortical processing is reasonably well de-scribed by a bank of multi-scale filters resemblingGabor functions (Daugman 1985 Pollen amp Ronner1983) and that such a basis forms an efficient code fornatural scenes (Olshausen amp Field 1996)

In order to define a multiscale feature set resemblingthe early visual code we utilized the rectified outputs ofa bank of filters learned using Independent ComponentAnalysis (ICA) These filters closely resemble Gaborfilters but have the additional useful property ofconstituting a maximally independent set of featuredimensions for encoding natural images (Bell ampSejnowski 1997) Example filters for 16 middot 16 imagepatches are shown in Figure 5a The outputs of ourfilter bank were used as inputs to two different standardclassifiers (a) a linear logistic regression classifier and(b) a three-layer neural network classifier (Bishop2006) having 4 16 64 hidden units These classifierswere trained using standard gradient descent methods

and their performance was evaluated on a separate setof validation data not used for training A schematicillustration of these classifiers is shown in Figure 5b

Experimental paradigm

In the psychophysical experiments image patcheswere displayed on a 24-inch Macintosh cinema display(Apple Inc Cupertino CA) Since the RGB imageswere linearized to correct for camera gamma (Olmos ampKingdom 2004) we set the display monitor to havegamma of approximately 1 using the Mac CalibrationAssistant so that the displayed images would be asnatural looking as possible Subjects were seated in adim room in front of the monitor and image patcheswere presented to the center of the screen scaled tosubtend 15 degrees of visual angle at the approximate-ly 12-inch (305 cm) viewing distance All stimulisubtended the same visual angle to eliminate confoundsbetween the size of the image on the retina and its pixeldimensions Previous studies have demonstrated thathuman performance on a similar task is dependent onlyon the number of pixels (McDermott 2004) so byholding the retinal size constant the only variable is thenumber of pixels

Our experimental paradigm is illustrated in Figure 6Surrounding the image patch was a set of flankingcrosshairs whose imaginary intersection defines thelsquolsquocenterrsquorsquo of the image patch In the standard task thesubject decides whether an occlusion boundary passesthrough the image patch center with a binary (10) key-press Half of the patches were taken from theocclusion boundary database (where all occlusions passthrough the patch center) and the other half of thepatches were taken from the surface database There-fore guessing on the task would yield performance of50 percent correct The design of this experiment issimilar to a previous study on detecting T-junctions innatural images (McDermott 2004) To optimizeperformance subjects were given as much time as theyneeded for each patch Furthermore since perceptuallearning can improve performance (Ing et al 2010) weprovided positive and negative feedback

We performed the experiment on two sets of naivesubjects having no previous exposure to the images orknowledge of the scientific aims as well as the leadauthor (S0) for a total of six subjects (three males threefemales) One set of two naive subjects (S1 S2) wasallowed to browse grayscale versions of the full-scaleimages (576 middot 768 pixels) prior to the experiment After2ndash3 seconds of viewing they were shown superimposedon the images the union of all subject labelings Thiswas meant to give the subjects an intuition for what ismeant by an occlusion boundary in the context of thefull-scale image and was inspired by previous work on

Figure 5 Illustration of the multiscale classifier analysis using the

outputs of rectified Gabor filters (a) Gabor functions learned

using ICA form an efficient code for natural images which

maximize statistical independence of filter responses (b) Gray-

scale image patches are decomposed by a bank of multiscale

Gabor filters resembling simple cells and their rectified outputs

transformed by an intermediate layer of representation whose

outputs are passed to a linear classifier

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 7

surface segmentation where subjects were allowed topreview large full-scale images of foliage from whichsurface patches were drawn (Ing et al 2010) A secondset of three naive subjects (S3 S4 S5) were not shownthe full-scale images beforehand in order to control forthe possibility that this pre-exposure may have helped toimprove task performance

We used both color and grayscale patches of sizes 8middot 8 16 middot 16 and 32 middot 32 In addition to the raw imagepatches a set of lsquolsquotexture-removedrsquorsquo image patches wascreated by averaging the pixels in each of the tworegions thus creating synthetic edges where the onlycue was luminance or color contrast For the grayscalepatches we also considered the effects of removingluminance difference cues However it is a muchharder problem to create luminance-removed patchesas was done in previous studies on surface segmenta-tion (Ing et al 2010) since simply subtracting themean luminance in each region of an image patchcontaining an occlusion often yields a high spatialfrequency boundary artifact which provides a strongedge cue (Arsenault et al 2011) Therefore wecircumvented this problem by setting a 3-pixel thickregion around the boundary to the mean luminance ofthe entire patch in effect covering up the boundaryThen we could remove luminance cues equalizing themean luminance on each side without creating aboundary artifact since there was no boundary visibleWe called this condition lsquolsquoboundary thorn luminanceremovedrsquorsquo One issue however with comparing theseboundarythorn luminance removed patches to the normalpatches is that now two cues are missing (the boundaryand the luminance difference) so in order to betterassess the combination of texture and luminanceinformation we also created a lsquolsquoboundary removedrsquorsquocondition which blocks the boundary but does notmodify the mean luminance on each side Illustrativeexamples of these stimuli are shown in Figure 4b

All subjects were shown different sets of 400 imagepatches sampled randomly from a set of 1000 patchesin every condition with the exception of two subjects inthe luminance-only grayscale condition (S0 S2) who

were shown the same set of patches in this conditiononly Informed consent was obtained from all subjectsand all experimental procedures were approved before-hand by the Case Western Reserve University IRB(Protocol 20101216)

Tests of significance

In order to determine whether or not the perfor-mance of human subjects was significantly different indifferent conditions of the task we utilized thestandard binomial proportion test (Ott 1993) whichrelies on a Gaussian approximation to the binomialdistribution This test is well justified in our casebecause of the large number of stimuli (N frac14 400)presented in each experimental condition For propor-tion estimate p we compute the 1 ndash a confidence intervalas

p 6 z1a=2

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffipeth1 pTHORN

n

r eth18THORN

We use a significance level of afrac14 005 in all of our testsand calculations of confidence intervals

In order to evaluate the performance of the machineclassifiers we performed a Monte Carlo analysis wherethe classifiers were evaluated on 200 different sets of400 image patches randomly chosen from a validationset of image patches This validation set was distinctfrom the set of patches used to train the classifierallowing us to study classifier generalization We plot95 confidence intervals around the mean perfor-mance of our classifiers

Results

Case occlusion boundary (COB) database

We developed a novel database of occlusion edgesand surface regions not containing any occlusions for

Figure 6 Schematic of the two-alternative forced choice experiment A patch was presented to the subject who decided whether an

occlusion passes through the imaginary intersection of the crosshairs After the decision the subject was given positive or negative

feedback

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 8

use in our perceptual experiments Representativeimages from our database are illustrated in Figure 2(left column) Note the clear figural objects and manyocclusions in these images In the middle column ofFigure 2 we see grayscale versions of these imageswith the set of all pixels labeled by any subject (logicalOR) overlaid in white The magenta squares showexamples of surface regions labeled by the subjectsFinally the right column shows an overlay plot of theocclusions marked by all subjects with darker pixelsbeing labeled by more subjects and the lightest graypixels being labeled by only a single subject Note thehigh degree of consistency between the labelings of themultiple subjects Figure 3 shows some representativeexamples of occlusion (left) and surface (right) patchesfrom our database It is important to note that whileocclusions may be a major source of edges in naturalimages edges may arise from other cues like castshadows or changes in material properties (Kersten2000) The bottom panel of Figure 3 shows examples ofpatches containing edges defined by shadows ratherthan by occlusions

In other annotated edge databases like the BerkeleySegmentation Dataset (BSD) (Martin Fowlkes Tal ampMalik 2001) occlusion edges were labeled indirectly bysegmenting the image into regions and then denotingthe boundaries of these regions to be edges Weobserved that when occlusions are labeled directlyinstead of being inferred indirectly from regionsegmentations that a smaller number of pixels arelabeled as shown in Figure 7b which plots thedistribution of the percentage of edge pixels for allimages and subjects in our database (red) and the BSDdatabase (blue) for 98 BSD images segmented by sixsubjects We find that averaged across all images andsubjects that about 1 of the pixels were labeled asedges in our database whereas about twice as manypixels were labeled as edges by computing edgemapsusing the BSD segmentations (COB medianfrac1400084Nfrac14 600 BSD median frac14 00193 N frac14 588 p 10121Wilcox rank-sum)

We observed a higher level of intersubject consisten-cy in the edgemaps obtained from the COB than thoseobtained from the BSD segmentations which we

Figure 7 Labeling occlusions directly marks fewer pixels than inferring occlusions from image segmentations and yields greater

agreement between subjects (a) Top An image from our database (left) together with the labeling (middle) by the most conservative

subject (MCS) The right panel shows a binary mask of all pixels near the MCS labeling (10 pixel radius) Bottom Product of MCS mask

with labelings from three other subjects (b) Histogram of the fraction of pixels labeled as edges in the COB (red) and BSD (blue)

databases across all images and all subjects (c) Histogram of the subject consistency index for edgemaps obtained from the COB (red)

and BSD (blue) databases for cfrac14 10 (d) Precision-recall analysis also demonstrates better consistency (F-measure) for COB (red) than

BSD (blue)

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 9

quantified using a novel analysis we developed as wellas a more standard precision-recall analysis (Abdou ampPratt 1979 Rijsbergen 1979 Methods Image data-base) Figure 7a shows an image from the COBdatabase (top left) together with the edgemap andderived mask for the most conservative subject (MCS)which is the subject who labeled the fewest pixels (topmiddle right) When the MCS mask is multiplied bythe edgemap of the other subjects we see reasonablygood agreement between all subjects (bottom row) Inorder to quantify this over all images we computed ournovel intersubject similarity index f defined in Equation4 and in Figure 7c we see that on average that f is largerfor our dataset than for the BSD where here we plotthe histogram for c frac14 10 (COB median frac14 02890 N frac143500 BSD median frac14 01882 N frac14 3430 p 10202Wilcox rank-sum) Similar results were obtained over awide range of values of c (Supplementary Figure S1) Inaddition to our novel analysis we also implemented aprecision-recall analysis (Abdou amp Pratt 1979 Rijs-bergen 1979) using a lsquolsquoleave-one-outrsquorsquo procedure wherewe compared the edges labeled by one subject to alsquolsquoground truthrsquorsquo labeling defined by combining theedgemaps of the five other subjects (Methods Imagedatabase) Agreement was quantified using the F-measure (Rijsbergen 1979) which provides a weightedmean of precision (not labeling nonedges as edges) andrecall (detecting all edges in the image) We observe inFigure 7d that for this analysis there is a significantlybetter agreement between edgemaps in the COBdatabase than those obtained indirectly from the BSDsegmentations using this analysis (p 1028)

Visual features measured from occlusion andsurface patches

We were interested in determining which kinds oflocally available visual features could possibly beutilized by human subjects to distinguish image patchescontaining occlusions from those containing surfacesToward this end we measured a variety of visualfeatures taken from previous studies of natural imagestatistics (Balboa amp Grzywacz 2000 Field 1987 Ing etal 2010 Lee amp Choe 2003 Rajashekar et al 2007Reinagel amp Zador 1999) from both occlusion andsurface patches For patches containing only a singleocclusion which can be divided into two regions ofroughly equal size we obtained a region labelingillustrated in Figure 4a (bottom) The region labelingconsists of sets of pixels corresponding to the twosurfaces divided by the boundary (white and grayregions) as well as the boundary (black line) Usingthis region labeling we can measure properties of theimage on either side of the boundary and computedifferences in these properties between regions By

definition surface patches are comprised of a singleregion and therefore it is unclear how we can measurequantities (like luminance differences) which dependon a region labeling Therefore in order to measure thesame set of features from the surface patches weassigned to each patch a set of 25 dummy regionlabelings (spanning all orientations) and for eachdummy labeling performed the measurements Sinceall features were on average larger for occlusions forsurfaces we took as the measured value of the featurethe maximum over all 25 dummy labelings A fulldescription of measured features is given in Methods(Statistical measurements)

We compared the power spectra of 32 middot 32 occlusionand texture patches (Figure 8 left) On average there issignificantly less energy in low spatial frequencies fortextures than for occlusions which is intuitive sincemany occlusions contain a low spatial frequencyluminance edge (Figure 3 left panel) Analysis of theexponent by fitting a line to the power spectra ofindividual images found a different distribution ofexponents for the occlusions and the textures (Figure 8right) For textures the median exponent is close to 2consistent with previous observations (Field 1987)whereas for occlusions the median exponent value isslightly higher (rsquo26) This is consistent with previouswork analyzing the spectral content of different scenecategories which show that landscape scenes with aprominent horizon (like an ocean view) tend to havelower spatial frequency content (Oliva amp Torralba2001)

Figure 9 shows that all of the grayscale features (G1-G5 Methods Statistical measurements) measured fromsurfaces (green) and occlusions (blue) are well approx-imated by Gaussians when plotted in logarithmiccoordinates Supplementary Table 1 lists the meansand standard deviations of each of these features Wesee that on average these features are larger forocclusions than for textures as one might expectintuitively since these features explicitly or implicitlymeasure differences or variability in an image patch

Figure 8 Power spectra of 32 middot 32 patches Left Median power

spectrum of occlusions (blue) and surfaces (green) Thin dashed

lines show the 25th and 75th percentiles Right Power-spectrum

slopes for occlusions (blue) and surfaces (green)

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 10

which will tend to be larger for occlusion patchesAnalyzing the correlation structure of the grayscalefeatures reveals that they are all significantly positivelycorrelated (Supplementary Table 2) Although some ofthese correlations are unsurprising (like that betweenluminance difference log Dl and boundary gradient logGB) many pairs of these positively correlated featurescan be manipulated independently of each other inartificial images (for instance global contrast log q andcontrast difference log Dr) These results are consistentwith previous work which demonstrates that in naturalimages there are often strong conditional dependenciesbetween supposedly independent visual feature dimen-sions (Fine et al 2003 Karklin amp Lewicki 2003Schwartz amp Simoncelli 2001 Zetzsche amp Rohrbein2001) Supplementary Figure S2 plots all possiblebivariate distributions of the logarithm of the grayscalefeatures for 32 middot 32 textures and surfaces

In addition to grayscale features we measured colorfeatures (C1ndashC2 Methods Statistical measurements)as well by transforming the images into the Lab colorspace Fine et al (2003) Ing et al (2010) Using our

region labelings (Figure 4a bottom) we measured thecolor parameters log Da and log Db from ourocclusions Supplementary Table 3 lists the meansand standard deviations of each of these features andwe observe positive correlations (rfrac14043) between theirvalues similar to previous observations of positivecorrelations between these same features for twonearby patches taken from the same surface (Fine etal 2003) This finding is interesting because occlusionboundaries by definition separate two or more differentsurfaces and the color features we measure constitute aset of independent feature dimensions From the two-dimensional scatterplots in Supplementary Figure S3we see that there is separation between the multivariatedistributions defined by these color parameters sug-gesting that color contrast provides a potential cue forocclusion edge detection much as it does for surfacesegmentation (Fine et al 2003 Ing et al 2010)

Human performance on occlusion boundarydetection task

Effects of patch size color and pre-exposure

In order to determine the local visual features used todiscriminate occlusions from surfaces we designed asimple two-alternative forced choice task illustratedschematically in Figure 6 An image patch subtendingroughly 15 degrees of visual angle (305 cm viewingdistance) was presented to the subject who had todecide with a binary (10) keypress whether an imagepatch contains an occlusion Patches were chosen froma pre-extracted set of 1000 occlusions and 1000surfaces with equal probability of each type soguessing would yield 50 correct Subjects wereallowed to view each patch as long as they needed (1ndash2 seconds typical) and following previous workauditory feedback was provided to optimize perfor-mance (Ing et al 2010) The task was performed on sixsubjects having varying degrees of exposure to theimage database (Methods Experimental paradigm)

Figure 10a illustrates the effect of patch size on thetask for grayscale image patches for all subjects Thicklines indicate the mean subject performance and thindashed lines denote 95 confidence intervals We seefrom this that performance is significantly different forthe different size image patches which we tested (8 middot 816 middot 16 32 middot 32) For a subset of subjects (S0 S1 S2)we also tested color image patches and the results forcolor image patches are shown in Figure 10b (red line)together with the grayscale data from these samesubjects (black dashed line) We see from this plot thatfor all patch sizes tested performance is significantlybetter for color patches which is sensible because coloris a potentially informative cue for distinguishingdifferent surfaces as has been shown in previous work(Fine et al 2003 Ing et al 2010)

Figure 9 Univariate distributions of grayscale visual features G1-

G5 (see Methods) for occlusions (blue) and textures (green)

When plotted on a log scale these distributions are well described

by Gaussians and exhibit separation for occlusions and textures

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 11

One concern with the interpretation of our results isthat the brief pre-exposure (3 seconds) of subjects S1 S2to the full-scale (576 middot 768) images prior to the first tasksession (Methods Experimental paradigm) may haveunfairly improved their performance In order to controlfor this possibility we ran three additional subjects (S3S4 S5) who did not have this pre-exposure to the full-scale images and we found no significant difference inthe performance of the two groups of subjects(Supplementary Figure S4a) We also found that thelead author (S0) was not significantly better at the taskthan the two subjects (S1 S2) who were only briefly pre-exposed to the images (Supplementary Figure S4b)

Therefore we conclude that pre-exposure to the full-scale images makes little if any difference in our task

Effects of luminance boundary and texture cues

In order to better understand what cues subjects aremaking use of in the task we tested subjects onmodified image patches with various cues removed(Methods Experimental paradigm) In order to deter-mine the importance of texture we removed all texturecues from the patches by averaging all pixels in eachregion (lsquolsquotexture removedrsquorsquo) as illustrated in Figure 4b(top) We see from Figure 11a that subject performancein the task is substantially impaired without texturecues for the 16 middot 16 and 32 middot 32 patch sizes Note howperformance is roughly flat with increasing patch sizewhen all cues but luminance are removed suggestingthat luminance gradients are a fairly lsquolsquolocalrsquorsquo cue whichis very useful even at very small scales Similar resultswere obtained for color patches (Figure 12) where theonly cues available were luminance and color contrast

Removing luminance cues while keeping texture cuesintact is slightly more tricky since simply equalizing theluminance in the two image regions can lead toproblems of boundary artifact (a high spatial frequencyedge along the boundary) as has been noted by others(Arsenault et al 2011) In order to circumvent thisproblem we removed the boundary artifact by coveringa 3-pixel wide region including the boundary by settingall pixels in this strip to a single uniform value (Figure4b bottom) We then were able to equalize theluminance on each side of the patch without creatingboundary artifact (luminance thorn boundary removed)Since the boundary pixels may could potentiallycontain long-range spatial correlation information(Hess amp Field 1999) useful for identifying occlusions

Figure 10 Performance of human subjects at the occlusion

detection task for 8 middot 8 16 middot 16 and 32 middot 32 image patches (a)

Subject performance for grayscale image patches Thin dashed

lines denote 95 confidence intervals Note how performance

significantly improves with increasing patch size (b) Subject

performance for color image patches is significantly better than for

grayscale at all patch sizes

Figure 11 Subject performance for grayscale image patches with various cues removed Dashed lines indicate the average performance

for unaltered image patches solid lines performance in the cue-removed case (a) Removal of texture cues significantly impairs subject

performance for larger image patches (b) Removal of the boundary and luminance cues substantially impairs subject performance at all

patch sizes (c) Removal of the boundary cue alone without altering texture cues or luminance cues does not affect subject performance

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 12

as an additional control we also considered perfor-mance on patches where we only removed theboundary (boundary removed)

We see from Figure 11c that simply removing theboundary pixels has no effect on subject performancesuggesting that subjects do not need to see theboundary in order to perform the task so long asother cues like luminance and texture differences arepresent However we see from Figure 11b that thereare profound effects of removing the luminance cueswhile keeping the texture cues intact We see thatsubject performance is impaired at every patch sizealthough in contrast to removing texture cues we see asubstantial improvement in performance as patch sizeis increased This suggests that texture cues arecomplementary to luminance cues in the sense thattexture information becomes more useful for largerpatches while luminance is useful even for very smallpatches

Quadratic classifier analysis

Comparison with human performance

Our analysis of image patches taken from thedatabase suggests that relatively simple visual featuresmeasured in previous related studies can potentially beused to distinguish occlusions and surfaces (Figure 9)Therefore it was of interest to determine whethermachine classifiers defined using these simple visualfeatures could potentially account for human perfor-mance on the task Since the logarithmically trans-formed features are approximately Gaussian distributed(Figure 9 and Supplementary Figures S2 S3) followingprevious work (Ing et al 2010) we defined a quadraticclassifier on class-conditional distributions which wemodeled as multivariate Gaussians (Methods Machineclassifiers) When data from both classes is exactly

Gaussian this classifier yields Bayes-optimal perfor-mance (Duda et al 2000) This classifier was thentested on validation sets of 400 image patches drawnfrom the same set shown to humans subjects (whichwere not used to train the classifier) using 200 MonteCarlo resamplings

In Figure 13 we compare the performance of humansubjects (black lines) on the grayscale version of thetask with that of quadratic classifiers (blue lines)Figure 13a shows task performance with the fullgrayscale image patches We see that the quadraticclassifier defined on all grayscale parameters (G1-G5Methods Statistical measurements) does not capturethe experimentally observed improvement in subjectperformance with increasing patch size This suggeststhat in addition to the scalar grayscale cues used todefine the classifier subjects must be making use ofmore complex spatial cues like differences in texture(Bergen amp Landy 1991) or long-range spatial correla-tions (Hess amp Field 1999) which are not available atlow resolutions but become available as the imagepatch gets larger We see from Figure 13b that aquadratic classifier defined using only the luminancedifference is in much better agreement with humanperformance when this is the only available cue forperforming the task Similar results were obtained forcolor image patches (Figure 14)

Ranking of feature dimensions

In order to understand which feature dimensions aremost important for the quadratic classifier detecting

Figure 12 Subject performance for unaltered color image patches

(dashed line) and with texture cues removed (solid line) Figure 13 Comparison of subject and quadratic classifier

performance for grayscale image patches (a) Classifier defined

using all grayscale features (thick blue line) does not accurately

model human performance on the task with unmodified patches

(black line) (b) Classifier defined using only luminance cues

accurately models human performance in the texture-removed

condition where luminance differences are the only available cue

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 13

occlusions we computed the d0 value for each of thefeature dimensions with the results of this analysisbeing summarized in Supplementary Table 4 We seefrom this table that the best single grayscale parametersfor discriminating occlusion and surface patches are theluminance difference (Dl) and overall patch variance(q) We also see that contrast difference (Dr) by itself isa fairly weak parameter for all patch sizes while for thelarger patch sizes (16 middot 16 32 middot 32) color cues are veryinformative

Our correlation analyses presented in SupplementaryTable 2 demonstrated that many of these features were

correlated with each other so therefore it was ofinterest to determine which features contributed themost independent information to the classifier We didthis by adding feature dimensions to the classifier oneby one in order of decreasing d0 for 16 middot 16 and 32 middot 32patches and applying each classifier to the task for 200Monte Carlo simulations The results of this analysisperformed on 16 middot 16 and 32 middot 32 image patches areshown in Figure 15 We see from Figure 15 that forgrayscale classifiers containing q and Dl that noperformance increase is seen when one adds thefeatures GB (boundary gradient) and Eh (orientedenergy) most likely because these features are stronglycorrelated with q Dl (Supplementary Table 2)However we do see a jump in performance when thecontrast difference Dr is added most likely because thisfeature dimension is only weakly correlated with q andDl (Supplementary Table 2) For our analysis of thecolor classifier we see that there are substantialimprovements in performance when the color param-eters Da and Db are added to the classifier which makessense because color cues are largely independent fromluminance cues

Comparison with SVM classifier

There are two logical possibilities for the poor matchbetween human performance in our task and theperformance of the quadratic classifier The firstpossibility is that the feature sets used to define theclassifier do not adequately describe the featuresactually used by human subjects to perform the taskA second possibility is that the feature set is adequatebut the quadratic classifier is somehow suboptimalAlthough the quadratic classifier is Bayes optimal forGaussian distributed class-conditional densities (Duda

Figure 14 Comparison of subject and quadratic classifier

performance for color image patches (a) Classifier defined using

all color features (thick blue line) does not accurately model

human performance (black line) on the task with unmodified

patches (b) Classifier defined using only luminance and color

contrast more accurately models human performance in the

texture-removed condition

Figure 15 Quadratic classifier performance on 32 middot 32 patches as a function of number of features Feature sets of varying lengths are

defined by ranking individual dimensions by their d0 measures and adding them in order from highest to lowest

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 14

et al 2000) and our feature sets are reasonably welldescribed by Gaussians (see Figure 9 andSupplementary Figures S2 S3) this description is farfrom exact Therefore it is possible that some othermore sophisticated classifier may yield superior perfor-mance to the quadratic classifier and hence provide abetter match with our perceptual data

In order to control for this possibility we comparedthe performance of the quadratic classifier to a SupportVector Machine (SVM) classifier (Bishop 2006 Cris-tianini amp Shawe-Taylor 2000) on the task of discrim-inating grayscale occlusion patches from surfaces Wefound that the SVM classifier performed worse than thequadratic classifier (Supplementary Figure S5) Thisfailure of the SVM classifier strengthens the argumentthat the reason for the poor match of the quadraticclassifier with human performance is the inadequacy ofthe measured feature set used to define the classifierrather than the inadequacy our classifier method

Multiscale classifier analysis

Gabor feature set and classifiers

One weakness of the set of simple visual featuresquantified in this study is that they are defined on thescale of the entire image patch and therefore do notintegrate information across scale It is well known thatocclusions in natural images can exist on multiplescales and that accurate edge detection requires anappropriate choice of scale (Elder amp Zucker 1998Marr amp Hildreth 1980) Furthermore our feature setdoes not include any features which quantify texturedifferences which is problematic because our psycho-physical results suggests that texture is a very importantcue and previous work in machine vision hasdemonstrated that texture cues are highly informativefor detecting edges and segmenting image regions(Bergen amp Landy 1991 Malik amp Perona 1990 Martinet al 2004 Voorhees amp Poggio 1988)

In order to choose a set of biologically motivatedmultiscale features to measure from the image patcheswe obtained a Gabor filter bank by applying theIndependent Components Analysis (Bell amp Sejnowski1997) to natural image patches taken from our dataset(Figure 5a) and this filter bank was applied to theimage patches The independent component analysis(ICA) algorithm learns a set of linear features that aremaximally independent which is ideal for classifieranalyses since correlated features add less informationthan independent features The rectified outputs of thisGabor filter bank were then used as inputs to twosupervised classifiers (a) A linear binary logisticclassifier and (b) a neural network classifier (Bishop2006) having 4 16 and 64 hidden units The logisticclassifier can be considered as a special case of theneural network classifier illustrated in Figure 5b where

instead of learning the best hidden layer representationfor solving the classification task the hidden layerrepresentation is always identical to the input repre-sentation and only the output weights are modifiedAfter training both classifiers were tested on a separatevalidation set not used for training to evaluate theirperformance

Classifier comparison

Figure 16 shows the performance of these twoclassifiers together with subject data for the grayscaleocclusion detection task We see that the linearclassifier (blue line) does not accurately describe humanperformance (black line) while the neural network (redline) having 16 hidden units compares much morefavorably to the human subjects Similar results wereobtained for a neural network with 64 hidden units butworse performance was seen with four hidden units(not shown) We can see from Figure 16 that the linearclassifier does not increase its performance withincreasing patch size and seems to perform similarlyto the quadratic classifier defined using only luminancedifferences This suggests that this classifier may simplybe learning to detecting low spatial frequency edges atvarious orientations Indeed plotting the connectionstrengths of the output unit with each of the Gaborfilters as in Figure 17 demonstrates that there are strongpositive weights for filters at low spatial frequencylocated in the center of the image patch therebydetecting luminance differences on the scale of theentire image patch

In contrast the neural network classifier learns tospatially pool information across spatial scale andlocation (Supplementary Figure S6) although thehidden unit lsquolsquoreceptive fieldsrsquorsquo do not seem to resemble

Figure 16 Performance of humans and classifiers defined on

multiscale Gabor feature set We see that a simple linear classifier

(blue line) does not account for human performance (black

dashed line) in the task a neural network classifier having an

additional hidden layer of processing (red line) compares well with

human performance

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 15

the hypothetical texture-edge detectors illustrated inFigure 18 which could in principle encode edgesdefined by second-order cues like differences inorientation or spatial frequency Interestingly suchunits tuned for texture-edges have been learned instudies seeking to find a sparse code for patterns ofcorrelation of Gabor filter activities in natural images(Cadieu amp Olshausen 2012 Karklin amp Lewicki 20032009) and it is of interest for future work to examinewhether a population of such units can be decoded inorder to reliably detect occlusion edges Neverthelessthis analysis provides a simple existence proof that amultiscale feature set coupled with a second layer ofrepresentation which pools across this set of multiscalefeatures may be adequate to explain natural occlusionedge detection and a similar computational scheme hasbeen proposed by various filter-rectify-filter models oftexture edge detection (Baker amp Mareschal 2001Landy amp Graham 2003) Interestingly training neuralnetworks directly on pixel representations did not yieldmodels which could match human performance(Supplementary Figure S7 demonstrating the impor-tance of the multiscale Gabor decomposition as apreprocessing stage Of course we cannot suggest fromour simple analysis the detailed form which thisintegration takes or what the receptive fields of neuronsperforming this operation may be like but this is

certainly a question of great interest for future work inneurophysiology and computational modeling

Discussion

Summary of results

In this study we performed psychophysical experi-ments to better understand what locally available cuesthe human visual system utilizes when detecting naturalocclusion edges In agreement with previous work(McDermott 2004) we found that fairly large regions(32 middot 32 pixels) were required for subjects to attainreliable performance on our occlusion detection task(Figure 10) Studying the grayscale version of the taskin detail we found that texture and luminance cueswere both utilized although the relative importance ofthese cues varied with the patch size In particular wefound that luminance cues were equally useful at allpatch sizes tested (Figure 11a) This suggests thatluminance differences are a local cue being equallyuseful for small as well as large patches In contrasttexture cues were only useful for larger image patches(Figure 11b) This is sensible because texture is definedby information occurring at a variety of scales (Portillaamp Simoncelli 2000 Zhu Wu amp Mumford 1997) andmaking patches larger makes more spatial scalesvisible

The importance of textural cues was further under-scored by comparing human performance on the taskto the performance of a quadratic classifier definedusing a set of simple visual features taken fromprevious work which did not include texture cuesAlthough many of the visual features providedindependent sources of information (Figure 15) andpermitted some discrimination of the stimulus catego-ries this feature set could not account for humanperformance in the task overpredicting performancefor small (8 middot 8) patches and underpredicting

Figure 17 Schematic illustration of weights learned by the linear

classifier We see that the linear classifier learns strong positive

weights (hot colors) with Gabor filters having low spatial

frequencies (long lines) located in the center of the image patch

Figure 18 Hypothetical neural models for detecting occlusion edges defined by textural differences (a) Hypothetical unit which detects

texture edges defined by differences in orientation energy (b) Hypothetical unit that detects texture edges defined by spatial frequency

differences

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 16

performance for large (32 middot 32) patches (Figure 13)The overprediction of human performance for smallimage patches is particularly interesting since itdemonstrates that subjects are not simply relying onluminance cues but are integrating texture cues as wellwhich we found to be quite poor for small imagepatches (Figure 11b)

One limitation of the visual feature set used to definethe quadratic classifier analysis is that the features aredefined on the scale of the entire image patch This isproblematic since it is well-known that the earlieststages of human visual processing perform a multiscaleanalysis using receptive fields resembling Gabor filters(Daugman 1985 Jones amp Palmer 1987 Pollen ampRonner 1983) and that such a representation forms anefficient code for natural images (Field 1987 Olshau-sen amp Field 1996) Furthermore many computationalmethods for texture segmentation and edge detectionuse of a multiscale Gabor filter decomposition (Bakeramp Mareschal 2001 Landy amp Graham 2003) There-fore we defined a new set of classifiers taking as inputsthe rectified responses of a set of Gabor filters appliedto the image patches (Figure 5) We found that themultiscale feature set was adequate to explain humanperformance but only provided that the classifiermaking use of this feature set was sufficiently powerfulIn particular a linear classifier was unable to accountfor human performance since this classifier only learnedto detect luminance-edges at low spatial frequencies(Figure 17) However a neural network classifierhaving an additional hidden layer of processing wasable to learn to reliably detect occlusions nearly as wellas human subjects (Figure 16) These results suggest amultiscale feature decomposition together with rela-tively simple computations performed in an additionallayer of processing may be adequate to explain humanperformance in an important natural vision task

Comparison with previous work

Relatively few studies in visual psychophysics havedirectly considered the problem of what cues are usefulfor detecting natural occlusion edges with the closeststudy to our own focusing on the more specializedproblem of detecting T-junctions in natural grayscaleimages (McDermott 2004) This study found thatrather large image regions were needed for subjects toattain reliable performance for junction detectionconsistent with our results for occlusion detection Inthis work it was found that many T-junctions easilyvisible on the scale of the whole image were notdetectable using local cues suggesting that they mayonly be detectable by mechanisms making use offeedback of global scene information (Lee amp Mumford2003) Our task is somewhat complementary to that

studied by McDermott (2004) since we exclude T-junctions and only consider nonjunction occlusionsThis gave us a much larger stimulus set and permittedquantitative comparisons of human task performancewith a variety of models based on features measuredfrom occlusions and non-occlusions However wearrive at the very similar conclusion for our task thata substantial amount of spatial context is needed foraccurate occlusion detection (Figure 10) and thatmany occlusions cannot be reliably detected using localcues alone

A related set of studies has considered the problemof judging when two image patches belong to the sameor different surfaces and comparing the performanceof human subjects with the predictions of Bayesianmodels defined using sets of measured visual featuressimilar to those used to define our quadratic classifiersIn contrast to one such surface segmentation study (Inget al 2010) we found texture cues to be very importantfor our task One possible reason for the discrepancybetween our results and those of Ing et al (2010) is thattheir study only considered the problem of discrimi-nating a very limited class of images in particular thesurfaces of different leaves Since the leaves typicallybelonged to the same kind of plant they were fairlysmooth and quite similar in terms of visual textureTherefore for this particular choice of stimuli texture isnot a particularly informative cue since leaves of thesame kind tend to have very similar textures For ourtask however where very different surfaces werejuxtaposed to define occlusions we found removal oftexture cues to be highly detrimental to subjectperformance (Figure 11)

Previous work on edge localization reveals thatluminance texture and color cues are all combined todetermine the location of edges (McGraw et al 2003Rivest amp Cavanagh 1996) and similarly we find that inour task these multiple sources of information arecombined as well However in our task we did notconsider quantitative models for how disparate cuesmay be combined optimally as has been done in avariety of perceptual tasks (Knill amp Saunders 2003Landy amp Kojima 2001) One challenge with studyingcue combination in our task is that one of the majorcues used by the subjects are texture differences andunlike luminance or color natural textures require veryhigh-dimensional models to fully specify (Heeger ampBergen 1995 Portilla amp Simoncelli 2000 Zhu et al1997) The problem of optimal cue combination fornatural occlusion detection has been mostly consideredin the computer vision literature (Konishi et al 2003Zhou amp Mel 2008) but models of cue combination forocclusion detection have not been directly comparedwith human data in a controlled psychophysicalexperiment

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 17

Our computational analysis demonstrates the neces-sity of multiscale image analysis for modeling humanperformance in our occlusion detection task Further-more our work and that of others demonstrates theimportance of textural cues for occlusion detection as awide variety of computational models which explainhuman performance at detecting texture-defined edgesutilize a multiscale Gabor decomposition (Landy ampGraham 2003) One broad class of computationalmodels of texture segmentation and texture edgedetection can be described lsquolsquoFilter-Rectify-Filterrsquorsquo(FRF) models sometimes also referred to as lsquolsquoback-pocketrsquorsquo models where the rectified outputs of a Gaborfilter bank are then subject to a second stage of filteringand the outputs of these filters are then processedfurther to obtain a decision about whether or not atexture edge is present (Baker amp Mareschal 2001Bergen amp Landy 1991 Graham amp Sutter 1998 Landy1991 Landy amp Graham 2003) Our simple three-layerneural network model (Figure 5) constitutes a verysimple model of the FRF type with the hidden unitsperforming the role of the second-order filters Theability of our simple FRF model to solve this task aswell as human subjects provides additional evidence infavor of the FRF computational framework

Resources for future studies

Our database of hand-labeled occlusions (httpcaseedu) avoids many limitations of previously collecteddatasets of hand-labeled edges Perhaps the greatestimprovement is that unlike many related datasets(Collins Wright amp Greenway 1999 Martin et al2001) we make use of uncompressed calibrated colorimages taken from a larger image set used in previouspsychophysical research (Arsenault et al 2011 King-dom et al 2007 Olmos amp Kingdom 2004) JPEGcompression creates artificial statistical regularitieswhich complicates feature measurements (Zhou ampMel 2008) and causes artifact when training statisticalimage models (Caywood Willmore amp Tollhurst 2004)

Another improvement of this database over previouswork is that we explicitly label occlusion edges ratherthan inferring them indirectly from labels of imageregions or segments Although de facto finding regionboundaries does end up mainly labeling occlusions thenotion of an image region is much vaguer than that ofan occlusion However due to the specificity of ourtask our database most likely neglects quite a fewedges which are not occlusions since many edges aredefined by properties unrelated to ordinal depth likeshadows specularities and material properties (Ker-sten 2000) Therefore our database only represents asubset (albeit an important one) of possible naturaledges

Some databases only consider a restricted class ofimages like close-up foliage (Ing et al 2010) or lackclear figural objects (Doi Inui Lee Wachtler ampSejnowski 2003) Our database contains a wide varietyof images at various spatial scales containing a largenumber of occlusions We do not claim that our imagesare fully representative of natural images but representa modest effort to obtain a diverse sample which isimportant since different categories of natural imageshave different statistics (Torralba amp Oliva 2003) andtraining statistical image models on different imagedatabases often yields different linear filter properties(Ziegaus amp Lang 2003)

Acknowledgments

This work was supported by NIH grant 0705677 toMSL We thank the five Case Western ReserveUniversity undergraduate research assistants whohelped to collect the databases Parts of this work havebeen presented previously in poster form at the 2011Society for Neuroscience Meeting

Commercial relationships noneCorresponding author Christopher DiMattinaEmail dimattinagrinnelleduAddress Department of Psychology amp NeuroscienceConcentration Grinnell College Grinnell IA USA

References

Abdou I amp Pratt W (1979) Quantitative design andevaluation of enhancementthresholding edge de-tectors Proceedings of the IEEE 67 753ndash763

Acharya T amp Ray A K (2005) Image processingPrinciples and applications Hoboken NJ Wiley-Interscience

Arsenault E Yoonessi A amp Baker C L (2011)Higher-order texture statistics impair contrastboundary segmentation Journal of Vision 11(10)14 1ndash15 httpwwwjournalofvisionorgcontent111014 doi101167111014 [PubMed] [Article]

Baddeley R J amp Tatler B W (2006) High frequencyedges (but not contrast) predict where we fixate ABayesian system identification analysis VisionResearch 46 2824ndash2833

Baker C L amp Mareschal I (2001) Processing ofsecond-order stimuli in the visual cortex Progressin Brain Research 134 1ndash21

Balboa R M amp Grzywacz N M (2000) Occlusions

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 18

and their relationship with the distribution ofcontrasts in natural images Vision Research 402661ndash2669

Bell A J amp Sejnowski T J (1997) The lsquoindependentcomponentsrsquo of natural scenes are edge filtersVision Research 37 3327ndash3338

Bergen J R amp Landy M S (1991) Computationalmodeling of visual texture segregation In M SLandy amp J A Movshon (Eds) Computationalmodels of visual processing pp 253ndash271 Cam-bridge MA MIT Press

Bishop C M (2006) Pattern recognition and machinelearning New York Springer

Cadieu C amp Olshausen B A (2012) Learningintermediate-level representations of form andmotion from natural movies Neural Computation24 827ndash866

Caywood M S Willmore B amp Tolhurst D J(2004) Independent components of color naturalscenes resemble v1 neurons in their spatial andcolor tuning Journal of Neurophysiology 91 2859ndash2873

Collins D Wright W A amp Greenway P (1999) TheSowerby image database In Image processing andits applications (Seventh International Conference)(pp 306ndash310) London Institution of ElectricalEngineers

Cristianini N amp Shawe-Taylor J (2000) An intro-duction to support vector machines and other kernelbased learning methods Cambridge UK Cam-bridge University Press

Daugman J G (1985) Uncertainty relation forresolution in space spatial frequency and orienta-tion optimized by two-dimensional visual corticalfilters Journal of the Optical Society of America A2 1160ndash1169

Doi E Inui T Lee T W Wachtler T ampSejnowski T J (2003) Spatiochromatic receptivefield properties derived from information-theoreticanalyses of cone mosaic responses to naturalscenes Neural Computation 15 397ndash417

Duda R O Hart P E amp Stork D G (2000)Pattern classification (2nd ed) New York Wiley-Interscience

Elder J H amp Zucker S W (1998) Local scalecontrol for edge detection and blur estimationIEEE Transactions on Pattern Analysis and Ma-chine Intelligence 20 699ndash716

Field D J (1987) Relations between the statistics ofnatural images and the response properties of

cortical cells Journal of the Optical Society ofAmerica A 4 2379ndash2394

Fine I MacLeod D I A amp Boynton G M (2003)Surface segmentation based on the luminance andcolor statistics of natural scenes Journal of theOptical Society of America A 20 1283ndash1291

Fowlkes C C Martin D R amp Malik J (2007) Localfigure-ground cues are valid for natural imagesJournal of Vision 7(8)2 1ndash9 httpwwwjournalofvisionorgcontent782 doi101167782[PubMed] [Article]

Geisler W S Perry J S Super B J amp Gallogly DP (2001) Edge co-occurrence in natural imagespredicts contour grouping performance VisionResearch 41 711ndash724

Goldberg A V amp Kennedy R (1995) An efficientcost scaling algorithm for the assignment problemMathematical Programming 71 153ndash177

Graham N (1991) Complex channels early localnonlinearities and normalization in texture segre-gation In M S Landy amp J A Movshon (Eds)Computational models of visual processing pp 274ndash290 Cambridge MA MIT Press

Graham N amp Sutter A (1998) Spatial summation insimple (Fourier) and complex (non-Fourier) texturechannels Vision Research 38 231ndash257

Guzman A (1969) Decomposition of a visual sceneinto three-dimensional bodies In A Griselli (Ed)Automatic interpretation and classification of imag-es pp 291ndash304 New York Academic Press

Heeger D J amp Bergen J R (1995) Pyramid-basedtexture analysissynthesis In Proceedings of ACMSIGGRAPH 229ndash238

Hess R F amp Field D J (1999) Integration ofcontours New insights Trends in Cognitive Scienc-es 3 480ndash486

Hoiem D Efros A A amp Hebert M (2011)Recovering occlusion boundaries from an imageInternational Journal of Computer Vision 91(3)328ndash346

Ing A D Wilson J A amp Geisler W S (2010)Region grouping in natural foliage scenes Imagestatistics and human performance Journal ofVision 10(4)10 1ndash19 httpwwwjournalofvisionorgcontent10410 doi 10116710410[PubMed] [Article]

Jones J P amp Palmer L A (1987) The two-dimensional spatial structure of simple receptivefields in cat striate cortex Journal of Neurophysi-ology 58 1187ndash1211

Karklin Y amp Lewicki M S (2003) Learning higher-

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 19

order structures in natural images Network 14483ndash499

Karklin Y amp Lewicki M S (2009) Emergence ofcomplex cell properties by learning to generalize innatural scenes Nature 457 83ndash86

Kersten D (2000) High-level vision as statisticalinference In M S Gazzaniga (Ed) The newcognitive neurosciences pp 353ndash363 CambridgeMA MIT Press

Kingdom F A A Field D J amp Olmos A (2007)Does spatial invariance result from insensitivity tochange Journal of Vision 7(14)11 1ndash13 httpwwwjournalofvisionorgcontent71411 doi10116771411 [PubMed] [Article]

Knill D C amp Saunders J A (2003) Do humansoptimally integrate stereo and texture informationfor judgments of slant Vision Research 43 2539ndash2558

Konishi S Yuille A L Coughlin J M amp Zhu S C(2003) Statistical edge detection Learning andevaluating edge cues IEEE Transactions on PatternAnalysis and Machine Intelligence 25 57ndash74

Landy M S (1991) Texture segregation and orienta-tion gradient Vision Research 31 679ndash691

Landy M S amp Graham N (2003) Visual perceptionof texture In L M Chalupa amp J S Werner (Eds)The visual neurosciences pp 1106ndash1118 Cam-bridge MA MIT Press

Landy M S amp Kojima H (2001) Ideal cuecombination for localizing texture-defined edgesJournal of the Optical Society of America A 182307ndash2320

Lee H-C amp Choe Y (2003) Detecting salientcontours using orientation energy distribution InInternational Joint Conference on Neural Networks(pp 206ndash211) IEEE Conference Publications

Lee T S amp Mumford D (2003) Hierarchicalbayesian inference in the visual cortex Journal ofthe Optical Society of America A 20 1434ndash1448

Malik J amp Perona P (1990) Preattentive texturediscrimination with early vision mechanisms Jour-nal of the Optical Society of America A 7 923ndash932

Marr D amp Hildreth E (1980) Theory of edgedetection Proceedings of the Royal Society ofLondon Series B 29 187ndash217

Martin D Fowlkes C C Tal D amp Malik J (2001)A database of human segmented natural imagesand its application to evaluating segmentationalgorithms and measuring ecological statisticsProceedings of the 8th International Conference onComputer Vision 2 416ndash423

Martin D R Fowlkes C C amp Malik J (2004)Learning to detect natural image boundaries usinglocal brightness color and texture cues IEEETransactions on Pattern Analysis and MachineIntelligence 26 530ndash549

McDermott J (2004) Psychophysics with junctions inreal images Perception 33 1101ndash1127

McGraw P V Whitaker D Badcock D R ampSkillen J (2003) Neither here nor there Localizingconflicting visual attributes Journal of Vision 3(4)2 265ndash273 httpwwwjournalofvisionorgcontent342 doi101167342 [PubMed] [Article]

Nakayama K He Z J amp Shimojo S (1995) Visualsurface representation A critical link betweenlower-level and higher-level vision In D NOsherson S M Kosslyn amp L R Gleitman(Eds) An invitation to cognitive science Visualcognition pp 1ndash70 Cambridge MA MIT Press

Oliva A amp Torralba A (2001) Modeling the shapeof the scene A holistic representation of the spatialenvelope International Journal of Computer Vision42 145ndash175

Olmos A amp Kingdom F A A (2004) McGillcalibrated colour image database Internet sitehttptabbyvisionmcgillca (Accessed November20 2012)

Olshausen B A amp Field D J (1996) Emergence ofsimple-cell receptive-field properties by learning asparse code for natural images Nature 381 607ndash609

Ott R L (1993) An introduction to statistical methodsand data analysis (4th ed) Belmont CA Wads-worth Inc

Perona P (1992) Steerable-scalable kernels for edgedetection and junction analysis Image and VisionComputing 10 663ndash672

Pollen D A amp Ronner S F (1983) Visual corticalneurons as localized spatial frequency filters IEEETransactions on Systems Man and Cybernetics 13907ndash916

Portilla J amp Simoncelli E P (2000) A parametrictexture model based on joint statistics of complexwavelet coefficients International Journal of Com-puter Vision 40 4971

Rajashekar U van der Linde I Bovik A C ampCormack L K (2007) Foveated analysis of imagefeatures at fixations Vision Research 47 3160ndash3172

Reinagel P amp Zador A M (1999) Natural scenestatistics at the centre of gaze Network 10 341ndash350

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 20

Rijsbergen C V (1979) Information retrieval LondonButterworths

Rivest J amp Cavanagh P (1996) Localizing contoursdefined by more than one attribute Vision Re-search 36 53ndash66

Schwartz O amp Simoncelli E P (2001) Natural signalstatistics and sensory gain control Nature Neuro-science 4 819ndash825

Sigman M Cecchi G A Gilbert C D ampMagnasco M O (2001) On a common circleNatural scenes and gestalt rules Proceedings of theNational Academy of Sciences 98 1935ndash1940

Todd J T (2004) The visual perception of 3D shapeTrends in Cognitive Science 8 115ndash121

Torralba A (2009) How many pixels make an imageVisual Neuroscience 26 123ndash131

Torralba A amp Oliva A (2003) Statistics of naturalimage categories Network Computation in NeuralSystems 14 391ndash412

Voorhees H amp Poggio T (1988) Computing textureboundaries from images Nature 333 364ndash367

Zetzsche C amp Rohrbein F (2001) Nonlinear andextra-classical receptive field properties and thestatistics of natural scenes Network 12 331ndash350

Zhou C amp Mel B W (2008) Cue combination andcolor edge detection in natural scenes Journal ofVision 8(4)4 1ndash25 httpwwwjournalofvisionorgcontent844 doi101167844 [PubMed][Article]

Zhu S C Wu Y N amp Mumford D (1997)Minimax entropy principle and its application totexture modeling Neural Computation 9 1627ndash1660

Ziegaus C amp Lang E W (2003) Independentcomponent analysis of natural and urban imageensembles Neural Information Processing Letters1 89ndash95

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 21

  • Introduction
  • Methods
  • f01
  • f02
  • f03
  • e01
  • e02
  • e03
  • e04
  • e05
  • e06
  • e07
  • e08
  • e09
  • e10
  • f04
  • e11
  • e12
  • e13
  • e14
  • e15
  • e16
  • e17
  • f05
  • e18
  • Results
  • f06
  • f07
  • f08
  • f09
  • f10
  • f11
  • f12
  • f13
  • f14
  • f15
  • f16
  • Discussion
  • f17
  • f18
  • Abdou1
  • Acharya1
  • Arsenault1
  • Baddeley1
  • Baker1
  • Balboa1
  • Bell1
  • Bergen1
  • Bishop1
  • Cadieu1
  • Caywood1
  • Collins1
  • Cristianini1
  • Daugman1
  • Doi1
  • Duda1
  • Elder1
  • Field1
  • Fine1
  • Fowlkes1
  • Geisler1
  • Goldberg1
  • Graham1
  • Graham2
  • Guzman1
  • Heeger1
  • Hess1
  • Hoiem1
  • Ing1
  • Jones1
  • Karklin1
  • Karklin2
  • Kersten1
  • Kingdom1
  • Knill1
  • Konishi1
  • Landy1
  • Landy2
  • Landy3
  • Lee1
  • Lee2
  • Malik1
  • Marr1
  • Martin1
  • Martin2
  • McDermott1
  • McGraw1
  • Nakayama1
  • Oliva1
  • Olmos1
  • Olshausen1
  • Ott1
  • Perona1
  • Pollen1
  • Portilla1
  • Rajashekar1
  • Reinagel1
  • Rijsbergen1
  • Rivest1
  • Schwartz1
  • Sigman1
  • Todd1
  • Torralba1
  • Torralba2
  • Voorhees1
  • Zetzsche1
  • Zhou1
  • Zhu1
  • Ziegaus1
Page 5: Detecting natural occlusion boundaries using local cuesDetecting natural occlusion boundaries using local cues Christopher DiMattina # $ Department of Psychology & Neuroscience Concentration,

1 The composite occlusion edge map from all subjectsconsisted of a single connected piece in the analysiswindow

2 The occlusion contacted the sides of the window attwo distinct points and divided the patch into tworegions

3 Each region comprised at least 35 of the pixels

Each occlusion patch consisted of two regions ofroughly equal size separated by a single boundary withthe central pixel (w2 w2) of a patch of size w alwaysbeing an occlusion (Figure 4a) Note that thisprocedure actually yields a subset of all possibleocclusions since it excludes T-junctions or occlusionsformed by highly convex boundaries Since theselection of occlusion patches was automated therewere no constraints on the properties of the surfaces oneither side of the occlusion with respect to factors likelighting shadows reflectance or material In additionto the occlusion patches we extracted surface patchesat random locations from the set of 60 middot 60 surfaceregions chosen by the subjects We used 8 middot 8 16 middot 16and 32 middot 32 patches for our analyses and psychophys-ical experiments

Grayscale scalar measurements

To obtain grayscale images we converted the rawimages into gamma-corrected RGB images usingsoftware available online at httptabbyvisionmcgillca (Olmos amp Kingdom 2004) We then mapped theRGB color space to the NTSC color space obtainingthe grayscale luminance I frac14 02989 R thorn 05870 G thorn01140 B (Acharya amp Ray 2005) From these patches

we measured a variety of visual features which can beused to distinguish occlusion from surface patchesSome of these features (for instance luminance differ-ence) depended on there being a region labeling of theimage patch which separates it into regions corre-sponding to two different surfaces (Figure 4a) How-ever measuring these same features from surfacepatches is impossible since surface patches only containa single surface Therefore in order to measure regionlabeling-dependent features from the uniform surfacepatches we assigned to each surface patch a set of 25lsquolsquodummyrsquorsquo region labelings from our database (span-ning all possible orientations) The measured value ofthe feature was then taken as the maximum value overall 25 dummy labelings which is sensible since all of thevisual features were on average larger for occlusionsthan uniform surface patches

Given a grayscale image patch and a region labelingR frac14 R1 R2 B partitioning the patch into regionscorresponding to the two surfaces (R1R2) as well as theset of boundary (B) pixels (Figure 4a) we measured thefollowing visual features taken from the computationalvision literature

G1 Luminance difference Dl

Dl frac14 jl1 l2j eth7THORNwhere l1 l2 are the mean luminance in regions R1 R2respectively

G2 Contrast difference Dr

Dr frac14 jr1 r2j eth8THORNwhere r1 r2 are the contrasts (standard deviation) inregions R1 R2 respectively Features G1 and G2 wereboth measured in previous studies on surface segmen-tation (Fine et al 2003 Ing et al 2010)

G3 Boundary luminance gradient GB

GB frac14jjIjjI

eth9THORN

where I(x y) frac14 []I(x y)]x ]I(x y)]y]T is thegradient of the image patch evaluated at the centralpixel and I is the average intensity of the image patch(Balboa amp Grzywacz 2000)

G4 Oriented energy Eh

Eh frac14XNh

ifrac141ethweven

i xTHORN2 thorn ethwoddi xTHORN2 eth10THORN

where x is the image patch in vector form andweveni wodd

i are a quadrature-phase pair of Gabor filtersof Nhfrac14 8 evenly spaced orientations hi For patch sizew the filters had means w2 and standard derivations ofw4 Oriented energy has been used in several previousstudies as a means of detecting edges (Geisler et al

Figure 4 Illustration of stimuli used in the psychophysical

experiments (a) Original color and grayscale occlusion edge

(top middle) and its region label (bottom) The region label divides

the patch into regions corresponding to the two surfaces (white

gray) as well as the boundary (black) (b) Top Grayscale image

patch with texture information removed Middle Occlusion patch

with boundary removed Bottom Boundary and luminance

information removed

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 5

2001 Lee amp Choe 2003 Sigman Cecchi Gilbert ampMagnasco 2001)

G5 Global patch contrast q

q frac14 stdethITHORN eth11THORNThis quantity has been measured in previous studieswhich quantify statistical differences between fixatedimage regions and random image regions for subjectsrsquofree-viewing natural images while their eyes are beingtracked (Rajashekar et al 2007 Reinagel amp Zador1999) Several studies of this kind have suggested thatsubjects may be preferentially looking at edges (Bad-deley amp Tatler 2006)

Note that features G3-G5 are measured globallyfrom the entire patch whereas features G1 G2 aredifferences between statistics measured from differentregions of the patch

Color scalar measurements

Images were converted from RGB to LMS colorspace using a MATLAB program which accompaniesthe images in the McGill database (Olmos amp Kingdom2004) We converted the logarithmically transformedLMS images into an Lab color space by performingprincipal components analysis (PCA) on the set ofLMS pixel intensities (Fine et al 2003 Ing et al 2010)Projections onto the axes of the Lab basis represent acolor pixel in terms of its overall luminance (L) blue-yellow opponency (a) and red-green opponency (b)

We measured two additional properties from theLMS color image patches represented in the Lab basis

C1 Blue-Yellow difference Da

Da frac14 ja1 a2j eth12THORNwhere a1 a2 are the mean values of the B-Y opponencycomponent a in regions R1 R2 respectively

C2 Red-Green difference Db

Db frac14 jb1 b2j eth13THORNwhere b1 b2 are the mean values of the R-G opponencycomponent b in regions R1 R2 respectively

These color scalar statistics were motivated byprevious work studying human perceptual discrimina-tion of different surfaces (Fine et al 2003 Ing et al2010)

Machine classifiers

Quadratic classifier analysis

In order to study how well various feature subsetsmeasured from the image patches could predict humanperformance we made use of a quadratic classifieranalysis The quadratic classifier is a natural choice for

quantifying the discriminability of two categoriesdefined by features having multivariate Gaussiandistributions (Duda Hart amp Stork 2000) and hasbeen used in previous work studying the perceptualdiscriminability of surfaces (Ing et al 2010) Assumethat we have two categories C1C2 of stimuli fromwhich we can measure n features u frac14 (u1u2 un)

Tand features measured from each category are Gauss-ian distributed

pethujCiTHORN frac14 NethujliRiTHORN eth14THORNwhere li and Ri are the means and covariances of eachcategory Given a novel observation with feature vectoru assuming that the two categories are equally likely apriori we evaluate the log-likelihood ratio

L12ethuTHORN frac14 ln pethujC1THORN ln pethujC2THORN eth15THORNchoosing C1 when L12 0 and C2 when L12 0 In thecase of Gaussian distributions for each category as inEquation 14 Equation 15 can be rewritten as

L12ethuTHORN frac141

2jR2j jR1j

thorn 1

2Q1ethuTHORN Q2ethuTHORN

eth16THORNwhere

QiethuTHORN frac14 ethu liTHORNTR1i ethu liTHORN eth17THORNThe task for applying this formalism is to define a set ofn features to measure from the set of image patchesand then use these measurements to define the meansand covariances of each category in a supervisedmanner New image patches that are unlabeled canthen be classified using this quadratic classifier

In our analyses category C1 was occlusion patchesand category C2 was surface patches We estimated theparameters of the classifiers for each category by takingthe means and covariances of the statistics measuredfrom a set of 2000 image patches (training set) and weapplied these classifiers to a different 400 patch subsetsof 1000 image patches which were presented to thesubjects (test set) This analysis was performed formultiple classifiers defined by different subsets ofparameters and for image patches of all sizes (8 middot 816 middot 16 32 middot 32)

SVM classifier analysis

As an additional control in addition to the quadraticclassifier analysis we trained a Support Vector Machine(SVM) classifier (Cristianini amp Shawe-Taylor 2000) todiscriminate occlusions and surfaces using our gray-scale visual feature set (G1ndashG5) The SVM classifier is astandard and well-studied method in machine learningwhich achieves good classification results by learningthe separating hyperplane that maximizes the margin

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 6

between two categories (Bishop 2006) We implement-ed this analysis using the function svmclassifym andsvmtrainm in the MATLAB Bioinformatics Toolbox

Multiscale classifier analyses on Gabor filter outputs

One weakness of defining machine classifiers usingour set of visual features is that these features aredefined on the scale of the entire image patch This isproblematic because it is well known that occlusionedges exist at multiple scales and that appropriate scaleselection and integration across scale is an essentialcomputation for accurate edge detection (Elder ampZucker 1998 Marr amp Hildreth 1980) Furthermoreit is well known that the neural code at the earlieststages of cortical processing is reasonably well de-scribed by a bank of multi-scale filters resemblingGabor functions (Daugman 1985 Pollen amp Ronner1983) and that such a basis forms an efficient code fornatural scenes (Olshausen amp Field 1996)

In order to define a multiscale feature set resemblingthe early visual code we utilized the rectified outputs ofa bank of filters learned using Independent ComponentAnalysis (ICA) These filters closely resemble Gaborfilters but have the additional useful property ofconstituting a maximally independent set of featuredimensions for encoding natural images (Bell ampSejnowski 1997) Example filters for 16 middot 16 imagepatches are shown in Figure 5a The outputs of ourfilter bank were used as inputs to two different standardclassifiers (a) a linear logistic regression classifier and(b) a three-layer neural network classifier (Bishop2006) having 4 16 64 hidden units These classifierswere trained using standard gradient descent methods

and their performance was evaluated on a separate setof validation data not used for training A schematicillustration of these classifiers is shown in Figure 5b

Experimental paradigm

In the psychophysical experiments image patcheswere displayed on a 24-inch Macintosh cinema display(Apple Inc Cupertino CA) Since the RGB imageswere linearized to correct for camera gamma (Olmos ampKingdom 2004) we set the display monitor to havegamma of approximately 1 using the Mac CalibrationAssistant so that the displayed images would be asnatural looking as possible Subjects were seated in adim room in front of the monitor and image patcheswere presented to the center of the screen scaled tosubtend 15 degrees of visual angle at the approximate-ly 12-inch (305 cm) viewing distance All stimulisubtended the same visual angle to eliminate confoundsbetween the size of the image on the retina and its pixeldimensions Previous studies have demonstrated thathuman performance on a similar task is dependent onlyon the number of pixels (McDermott 2004) so byholding the retinal size constant the only variable is thenumber of pixels

Our experimental paradigm is illustrated in Figure 6Surrounding the image patch was a set of flankingcrosshairs whose imaginary intersection defines thelsquolsquocenterrsquorsquo of the image patch In the standard task thesubject decides whether an occlusion boundary passesthrough the image patch center with a binary (10) key-press Half of the patches were taken from theocclusion boundary database (where all occlusions passthrough the patch center) and the other half of thepatches were taken from the surface database There-fore guessing on the task would yield performance of50 percent correct The design of this experiment issimilar to a previous study on detecting T-junctions innatural images (McDermott 2004) To optimizeperformance subjects were given as much time as theyneeded for each patch Furthermore since perceptuallearning can improve performance (Ing et al 2010) weprovided positive and negative feedback

We performed the experiment on two sets of naivesubjects having no previous exposure to the images orknowledge of the scientific aims as well as the leadauthor (S0) for a total of six subjects (three males threefemales) One set of two naive subjects (S1 S2) wasallowed to browse grayscale versions of the full-scaleimages (576 middot 768 pixels) prior to the experiment After2ndash3 seconds of viewing they were shown superimposedon the images the union of all subject labelings Thiswas meant to give the subjects an intuition for what ismeant by an occlusion boundary in the context of thefull-scale image and was inspired by previous work on

Figure 5 Illustration of the multiscale classifier analysis using the

outputs of rectified Gabor filters (a) Gabor functions learned

using ICA form an efficient code for natural images which

maximize statistical independence of filter responses (b) Gray-

scale image patches are decomposed by a bank of multiscale

Gabor filters resembling simple cells and their rectified outputs

transformed by an intermediate layer of representation whose

outputs are passed to a linear classifier

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 7

surface segmentation where subjects were allowed topreview large full-scale images of foliage from whichsurface patches were drawn (Ing et al 2010) A secondset of three naive subjects (S3 S4 S5) were not shownthe full-scale images beforehand in order to control forthe possibility that this pre-exposure may have helped toimprove task performance

We used both color and grayscale patches of sizes 8middot 8 16 middot 16 and 32 middot 32 In addition to the raw imagepatches a set of lsquolsquotexture-removedrsquorsquo image patches wascreated by averaging the pixels in each of the tworegions thus creating synthetic edges where the onlycue was luminance or color contrast For the grayscalepatches we also considered the effects of removingluminance difference cues However it is a muchharder problem to create luminance-removed patchesas was done in previous studies on surface segmenta-tion (Ing et al 2010) since simply subtracting themean luminance in each region of an image patchcontaining an occlusion often yields a high spatialfrequency boundary artifact which provides a strongedge cue (Arsenault et al 2011) Therefore wecircumvented this problem by setting a 3-pixel thickregion around the boundary to the mean luminance ofthe entire patch in effect covering up the boundaryThen we could remove luminance cues equalizing themean luminance on each side without creating aboundary artifact since there was no boundary visibleWe called this condition lsquolsquoboundary thorn luminanceremovedrsquorsquo One issue however with comparing theseboundarythorn luminance removed patches to the normalpatches is that now two cues are missing (the boundaryand the luminance difference) so in order to betterassess the combination of texture and luminanceinformation we also created a lsquolsquoboundary removedrsquorsquocondition which blocks the boundary but does notmodify the mean luminance on each side Illustrativeexamples of these stimuli are shown in Figure 4b

All subjects were shown different sets of 400 imagepatches sampled randomly from a set of 1000 patchesin every condition with the exception of two subjects inthe luminance-only grayscale condition (S0 S2) who

were shown the same set of patches in this conditiononly Informed consent was obtained from all subjectsand all experimental procedures were approved before-hand by the Case Western Reserve University IRB(Protocol 20101216)

Tests of significance

In order to determine whether or not the perfor-mance of human subjects was significantly different indifferent conditions of the task we utilized thestandard binomial proportion test (Ott 1993) whichrelies on a Gaussian approximation to the binomialdistribution This test is well justified in our casebecause of the large number of stimuli (N frac14 400)presented in each experimental condition For propor-tion estimate p we compute the 1 ndash a confidence intervalas

p 6 z1a=2

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffipeth1 pTHORN

n

r eth18THORN

We use a significance level of afrac14 005 in all of our testsand calculations of confidence intervals

In order to evaluate the performance of the machineclassifiers we performed a Monte Carlo analysis wherethe classifiers were evaluated on 200 different sets of400 image patches randomly chosen from a validationset of image patches This validation set was distinctfrom the set of patches used to train the classifierallowing us to study classifier generalization We plot95 confidence intervals around the mean perfor-mance of our classifiers

Results

Case occlusion boundary (COB) database

We developed a novel database of occlusion edgesand surface regions not containing any occlusions for

Figure 6 Schematic of the two-alternative forced choice experiment A patch was presented to the subject who decided whether an

occlusion passes through the imaginary intersection of the crosshairs After the decision the subject was given positive or negative

feedback

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 8

use in our perceptual experiments Representativeimages from our database are illustrated in Figure 2(left column) Note the clear figural objects and manyocclusions in these images In the middle column ofFigure 2 we see grayscale versions of these imageswith the set of all pixels labeled by any subject (logicalOR) overlaid in white The magenta squares showexamples of surface regions labeled by the subjectsFinally the right column shows an overlay plot of theocclusions marked by all subjects with darker pixelsbeing labeled by more subjects and the lightest graypixels being labeled by only a single subject Note thehigh degree of consistency between the labelings of themultiple subjects Figure 3 shows some representativeexamples of occlusion (left) and surface (right) patchesfrom our database It is important to note that whileocclusions may be a major source of edges in naturalimages edges may arise from other cues like castshadows or changes in material properties (Kersten2000) The bottom panel of Figure 3 shows examples ofpatches containing edges defined by shadows ratherthan by occlusions

In other annotated edge databases like the BerkeleySegmentation Dataset (BSD) (Martin Fowlkes Tal ampMalik 2001) occlusion edges were labeled indirectly bysegmenting the image into regions and then denotingthe boundaries of these regions to be edges Weobserved that when occlusions are labeled directlyinstead of being inferred indirectly from regionsegmentations that a smaller number of pixels arelabeled as shown in Figure 7b which plots thedistribution of the percentage of edge pixels for allimages and subjects in our database (red) and the BSDdatabase (blue) for 98 BSD images segmented by sixsubjects We find that averaged across all images andsubjects that about 1 of the pixels were labeled asedges in our database whereas about twice as manypixels were labeled as edges by computing edgemapsusing the BSD segmentations (COB medianfrac1400084Nfrac14 600 BSD median frac14 00193 N frac14 588 p 10121Wilcox rank-sum)

We observed a higher level of intersubject consisten-cy in the edgemaps obtained from the COB than thoseobtained from the BSD segmentations which we

Figure 7 Labeling occlusions directly marks fewer pixels than inferring occlusions from image segmentations and yields greater

agreement between subjects (a) Top An image from our database (left) together with the labeling (middle) by the most conservative

subject (MCS) The right panel shows a binary mask of all pixels near the MCS labeling (10 pixel radius) Bottom Product of MCS mask

with labelings from three other subjects (b) Histogram of the fraction of pixels labeled as edges in the COB (red) and BSD (blue)

databases across all images and all subjects (c) Histogram of the subject consistency index for edgemaps obtained from the COB (red)

and BSD (blue) databases for cfrac14 10 (d) Precision-recall analysis also demonstrates better consistency (F-measure) for COB (red) than

BSD (blue)

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 9

quantified using a novel analysis we developed as wellas a more standard precision-recall analysis (Abdou ampPratt 1979 Rijsbergen 1979 Methods Image data-base) Figure 7a shows an image from the COBdatabase (top left) together with the edgemap andderived mask for the most conservative subject (MCS)which is the subject who labeled the fewest pixels (topmiddle right) When the MCS mask is multiplied bythe edgemap of the other subjects we see reasonablygood agreement between all subjects (bottom row) Inorder to quantify this over all images we computed ournovel intersubject similarity index f defined in Equation4 and in Figure 7c we see that on average that f is largerfor our dataset than for the BSD where here we plotthe histogram for c frac14 10 (COB median frac14 02890 N frac143500 BSD median frac14 01882 N frac14 3430 p 10202Wilcox rank-sum) Similar results were obtained over awide range of values of c (Supplementary Figure S1) Inaddition to our novel analysis we also implemented aprecision-recall analysis (Abdou amp Pratt 1979 Rijs-bergen 1979) using a lsquolsquoleave-one-outrsquorsquo procedure wherewe compared the edges labeled by one subject to alsquolsquoground truthrsquorsquo labeling defined by combining theedgemaps of the five other subjects (Methods Imagedatabase) Agreement was quantified using the F-measure (Rijsbergen 1979) which provides a weightedmean of precision (not labeling nonedges as edges) andrecall (detecting all edges in the image) We observe inFigure 7d that for this analysis there is a significantlybetter agreement between edgemaps in the COBdatabase than those obtained indirectly from the BSDsegmentations using this analysis (p 1028)

Visual features measured from occlusion andsurface patches

We were interested in determining which kinds oflocally available visual features could possibly beutilized by human subjects to distinguish image patchescontaining occlusions from those containing surfacesToward this end we measured a variety of visualfeatures taken from previous studies of natural imagestatistics (Balboa amp Grzywacz 2000 Field 1987 Ing etal 2010 Lee amp Choe 2003 Rajashekar et al 2007Reinagel amp Zador 1999) from both occlusion andsurface patches For patches containing only a singleocclusion which can be divided into two regions ofroughly equal size we obtained a region labelingillustrated in Figure 4a (bottom) The region labelingconsists of sets of pixels corresponding to the twosurfaces divided by the boundary (white and grayregions) as well as the boundary (black line) Usingthis region labeling we can measure properties of theimage on either side of the boundary and computedifferences in these properties between regions By

definition surface patches are comprised of a singleregion and therefore it is unclear how we can measurequantities (like luminance differences) which dependon a region labeling Therefore in order to measure thesame set of features from the surface patches weassigned to each patch a set of 25 dummy regionlabelings (spanning all orientations) and for eachdummy labeling performed the measurements Sinceall features were on average larger for occlusions forsurfaces we took as the measured value of the featurethe maximum over all 25 dummy labelings A fulldescription of measured features is given in Methods(Statistical measurements)

We compared the power spectra of 32 middot 32 occlusionand texture patches (Figure 8 left) On average there issignificantly less energy in low spatial frequencies fortextures than for occlusions which is intuitive sincemany occlusions contain a low spatial frequencyluminance edge (Figure 3 left panel) Analysis of theexponent by fitting a line to the power spectra ofindividual images found a different distribution ofexponents for the occlusions and the textures (Figure 8right) For textures the median exponent is close to 2consistent with previous observations (Field 1987)whereas for occlusions the median exponent value isslightly higher (rsquo26) This is consistent with previouswork analyzing the spectral content of different scenecategories which show that landscape scenes with aprominent horizon (like an ocean view) tend to havelower spatial frequency content (Oliva amp Torralba2001)

Figure 9 shows that all of the grayscale features (G1-G5 Methods Statistical measurements) measured fromsurfaces (green) and occlusions (blue) are well approx-imated by Gaussians when plotted in logarithmiccoordinates Supplementary Table 1 lists the meansand standard deviations of each of these features Wesee that on average these features are larger forocclusions than for textures as one might expectintuitively since these features explicitly or implicitlymeasure differences or variability in an image patch

Figure 8 Power spectra of 32 middot 32 patches Left Median power

spectrum of occlusions (blue) and surfaces (green) Thin dashed

lines show the 25th and 75th percentiles Right Power-spectrum

slopes for occlusions (blue) and surfaces (green)

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 10

which will tend to be larger for occlusion patchesAnalyzing the correlation structure of the grayscalefeatures reveals that they are all significantly positivelycorrelated (Supplementary Table 2) Although some ofthese correlations are unsurprising (like that betweenluminance difference log Dl and boundary gradient logGB) many pairs of these positively correlated featurescan be manipulated independently of each other inartificial images (for instance global contrast log q andcontrast difference log Dr) These results are consistentwith previous work which demonstrates that in naturalimages there are often strong conditional dependenciesbetween supposedly independent visual feature dimen-sions (Fine et al 2003 Karklin amp Lewicki 2003Schwartz amp Simoncelli 2001 Zetzsche amp Rohrbein2001) Supplementary Figure S2 plots all possiblebivariate distributions of the logarithm of the grayscalefeatures for 32 middot 32 textures and surfaces

In addition to grayscale features we measured colorfeatures (C1ndashC2 Methods Statistical measurements)as well by transforming the images into the Lab colorspace Fine et al (2003) Ing et al (2010) Using our

region labelings (Figure 4a bottom) we measured thecolor parameters log Da and log Db from ourocclusions Supplementary Table 3 lists the meansand standard deviations of each of these features andwe observe positive correlations (rfrac14043) between theirvalues similar to previous observations of positivecorrelations between these same features for twonearby patches taken from the same surface (Fine etal 2003) This finding is interesting because occlusionboundaries by definition separate two or more differentsurfaces and the color features we measure constitute aset of independent feature dimensions From the two-dimensional scatterplots in Supplementary Figure S3we see that there is separation between the multivariatedistributions defined by these color parameters sug-gesting that color contrast provides a potential cue forocclusion edge detection much as it does for surfacesegmentation (Fine et al 2003 Ing et al 2010)

Human performance on occlusion boundarydetection task

Effects of patch size color and pre-exposure

In order to determine the local visual features used todiscriminate occlusions from surfaces we designed asimple two-alternative forced choice task illustratedschematically in Figure 6 An image patch subtendingroughly 15 degrees of visual angle (305 cm viewingdistance) was presented to the subject who had todecide with a binary (10) keypress whether an imagepatch contains an occlusion Patches were chosen froma pre-extracted set of 1000 occlusions and 1000surfaces with equal probability of each type soguessing would yield 50 correct Subjects wereallowed to view each patch as long as they needed (1ndash2 seconds typical) and following previous workauditory feedback was provided to optimize perfor-mance (Ing et al 2010) The task was performed on sixsubjects having varying degrees of exposure to theimage database (Methods Experimental paradigm)

Figure 10a illustrates the effect of patch size on thetask for grayscale image patches for all subjects Thicklines indicate the mean subject performance and thindashed lines denote 95 confidence intervals We seefrom this that performance is significantly different forthe different size image patches which we tested (8 middot 816 middot 16 32 middot 32) For a subset of subjects (S0 S1 S2)we also tested color image patches and the results forcolor image patches are shown in Figure 10b (red line)together with the grayscale data from these samesubjects (black dashed line) We see from this plot thatfor all patch sizes tested performance is significantlybetter for color patches which is sensible because coloris a potentially informative cue for distinguishingdifferent surfaces as has been shown in previous work(Fine et al 2003 Ing et al 2010)

Figure 9 Univariate distributions of grayscale visual features G1-

G5 (see Methods) for occlusions (blue) and textures (green)

When plotted on a log scale these distributions are well described

by Gaussians and exhibit separation for occlusions and textures

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 11

One concern with the interpretation of our results isthat the brief pre-exposure (3 seconds) of subjects S1 S2to the full-scale (576 middot 768) images prior to the first tasksession (Methods Experimental paradigm) may haveunfairly improved their performance In order to controlfor this possibility we ran three additional subjects (S3S4 S5) who did not have this pre-exposure to the full-scale images and we found no significant difference inthe performance of the two groups of subjects(Supplementary Figure S4a) We also found that thelead author (S0) was not significantly better at the taskthan the two subjects (S1 S2) who were only briefly pre-exposed to the images (Supplementary Figure S4b)

Therefore we conclude that pre-exposure to the full-scale images makes little if any difference in our task

Effects of luminance boundary and texture cues

In order to better understand what cues subjects aremaking use of in the task we tested subjects onmodified image patches with various cues removed(Methods Experimental paradigm) In order to deter-mine the importance of texture we removed all texturecues from the patches by averaging all pixels in eachregion (lsquolsquotexture removedrsquorsquo) as illustrated in Figure 4b(top) We see from Figure 11a that subject performancein the task is substantially impaired without texturecues for the 16 middot 16 and 32 middot 32 patch sizes Note howperformance is roughly flat with increasing patch sizewhen all cues but luminance are removed suggestingthat luminance gradients are a fairly lsquolsquolocalrsquorsquo cue whichis very useful even at very small scales Similar resultswere obtained for color patches (Figure 12) where theonly cues available were luminance and color contrast

Removing luminance cues while keeping texture cuesintact is slightly more tricky since simply equalizing theluminance in the two image regions can lead toproblems of boundary artifact (a high spatial frequencyedge along the boundary) as has been noted by others(Arsenault et al 2011) In order to circumvent thisproblem we removed the boundary artifact by coveringa 3-pixel wide region including the boundary by settingall pixels in this strip to a single uniform value (Figure4b bottom) We then were able to equalize theluminance on each side of the patch without creatingboundary artifact (luminance thorn boundary removed)Since the boundary pixels may could potentiallycontain long-range spatial correlation information(Hess amp Field 1999) useful for identifying occlusions

Figure 10 Performance of human subjects at the occlusion

detection task for 8 middot 8 16 middot 16 and 32 middot 32 image patches (a)

Subject performance for grayscale image patches Thin dashed

lines denote 95 confidence intervals Note how performance

significantly improves with increasing patch size (b) Subject

performance for color image patches is significantly better than for

grayscale at all patch sizes

Figure 11 Subject performance for grayscale image patches with various cues removed Dashed lines indicate the average performance

for unaltered image patches solid lines performance in the cue-removed case (a) Removal of texture cues significantly impairs subject

performance for larger image patches (b) Removal of the boundary and luminance cues substantially impairs subject performance at all

patch sizes (c) Removal of the boundary cue alone without altering texture cues or luminance cues does not affect subject performance

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 12

as an additional control we also considered perfor-mance on patches where we only removed theboundary (boundary removed)

We see from Figure 11c that simply removing theboundary pixels has no effect on subject performancesuggesting that subjects do not need to see theboundary in order to perform the task so long asother cues like luminance and texture differences arepresent However we see from Figure 11b that thereare profound effects of removing the luminance cueswhile keeping the texture cues intact We see thatsubject performance is impaired at every patch sizealthough in contrast to removing texture cues we see asubstantial improvement in performance as patch sizeis increased This suggests that texture cues arecomplementary to luminance cues in the sense thattexture information becomes more useful for largerpatches while luminance is useful even for very smallpatches

Quadratic classifier analysis

Comparison with human performance

Our analysis of image patches taken from thedatabase suggests that relatively simple visual featuresmeasured in previous related studies can potentially beused to distinguish occlusions and surfaces (Figure 9)Therefore it was of interest to determine whethermachine classifiers defined using these simple visualfeatures could potentially account for human perfor-mance on the task Since the logarithmically trans-formed features are approximately Gaussian distributed(Figure 9 and Supplementary Figures S2 S3) followingprevious work (Ing et al 2010) we defined a quadraticclassifier on class-conditional distributions which wemodeled as multivariate Gaussians (Methods Machineclassifiers) When data from both classes is exactly

Gaussian this classifier yields Bayes-optimal perfor-mance (Duda et al 2000) This classifier was thentested on validation sets of 400 image patches drawnfrom the same set shown to humans subjects (whichwere not used to train the classifier) using 200 MonteCarlo resamplings

In Figure 13 we compare the performance of humansubjects (black lines) on the grayscale version of thetask with that of quadratic classifiers (blue lines)Figure 13a shows task performance with the fullgrayscale image patches We see that the quadraticclassifier defined on all grayscale parameters (G1-G5Methods Statistical measurements) does not capturethe experimentally observed improvement in subjectperformance with increasing patch size This suggeststhat in addition to the scalar grayscale cues used todefine the classifier subjects must be making use ofmore complex spatial cues like differences in texture(Bergen amp Landy 1991) or long-range spatial correla-tions (Hess amp Field 1999) which are not available atlow resolutions but become available as the imagepatch gets larger We see from Figure 13b that aquadratic classifier defined using only the luminancedifference is in much better agreement with humanperformance when this is the only available cue forperforming the task Similar results were obtained forcolor image patches (Figure 14)

Ranking of feature dimensions

In order to understand which feature dimensions aremost important for the quadratic classifier detecting

Figure 12 Subject performance for unaltered color image patches

(dashed line) and with texture cues removed (solid line) Figure 13 Comparison of subject and quadratic classifier

performance for grayscale image patches (a) Classifier defined

using all grayscale features (thick blue line) does not accurately

model human performance on the task with unmodified patches

(black line) (b) Classifier defined using only luminance cues

accurately models human performance in the texture-removed

condition where luminance differences are the only available cue

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 13

occlusions we computed the d0 value for each of thefeature dimensions with the results of this analysisbeing summarized in Supplementary Table 4 We seefrom this table that the best single grayscale parametersfor discriminating occlusion and surface patches are theluminance difference (Dl) and overall patch variance(q) We also see that contrast difference (Dr) by itself isa fairly weak parameter for all patch sizes while for thelarger patch sizes (16 middot 16 32 middot 32) color cues are veryinformative

Our correlation analyses presented in SupplementaryTable 2 demonstrated that many of these features were

correlated with each other so therefore it was ofinterest to determine which features contributed themost independent information to the classifier We didthis by adding feature dimensions to the classifier oneby one in order of decreasing d0 for 16 middot 16 and 32 middot 32patches and applying each classifier to the task for 200Monte Carlo simulations The results of this analysisperformed on 16 middot 16 and 32 middot 32 image patches areshown in Figure 15 We see from Figure 15 that forgrayscale classifiers containing q and Dl that noperformance increase is seen when one adds thefeatures GB (boundary gradient) and Eh (orientedenergy) most likely because these features are stronglycorrelated with q Dl (Supplementary Table 2)However we do see a jump in performance when thecontrast difference Dr is added most likely because thisfeature dimension is only weakly correlated with q andDl (Supplementary Table 2) For our analysis of thecolor classifier we see that there are substantialimprovements in performance when the color param-eters Da and Db are added to the classifier which makessense because color cues are largely independent fromluminance cues

Comparison with SVM classifier

There are two logical possibilities for the poor matchbetween human performance in our task and theperformance of the quadratic classifier The firstpossibility is that the feature sets used to define theclassifier do not adequately describe the featuresactually used by human subjects to perform the taskA second possibility is that the feature set is adequatebut the quadratic classifier is somehow suboptimalAlthough the quadratic classifier is Bayes optimal forGaussian distributed class-conditional densities (Duda

Figure 14 Comparison of subject and quadratic classifier

performance for color image patches (a) Classifier defined using

all color features (thick blue line) does not accurately model

human performance (black line) on the task with unmodified

patches (b) Classifier defined using only luminance and color

contrast more accurately models human performance in the

texture-removed condition

Figure 15 Quadratic classifier performance on 32 middot 32 patches as a function of number of features Feature sets of varying lengths are

defined by ranking individual dimensions by their d0 measures and adding them in order from highest to lowest

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 14

et al 2000) and our feature sets are reasonably welldescribed by Gaussians (see Figure 9 andSupplementary Figures S2 S3) this description is farfrom exact Therefore it is possible that some othermore sophisticated classifier may yield superior perfor-mance to the quadratic classifier and hence provide abetter match with our perceptual data

In order to control for this possibility we comparedthe performance of the quadratic classifier to a SupportVector Machine (SVM) classifier (Bishop 2006 Cris-tianini amp Shawe-Taylor 2000) on the task of discrim-inating grayscale occlusion patches from surfaces Wefound that the SVM classifier performed worse than thequadratic classifier (Supplementary Figure S5) Thisfailure of the SVM classifier strengthens the argumentthat the reason for the poor match of the quadraticclassifier with human performance is the inadequacy ofthe measured feature set used to define the classifierrather than the inadequacy our classifier method

Multiscale classifier analysis

Gabor feature set and classifiers

One weakness of the set of simple visual featuresquantified in this study is that they are defined on thescale of the entire image patch and therefore do notintegrate information across scale It is well known thatocclusions in natural images can exist on multiplescales and that accurate edge detection requires anappropriate choice of scale (Elder amp Zucker 1998Marr amp Hildreth 1980) Furthermore our feature setdoes not include any features which quantify texturedifferences which is problematic because our psycho-physical results suggests that texture is a very importantcue and previous work in machine vision hasdemonstrated that texture cues are highly informativefor detecting edges and segmenting image regions(Bergen amp Landy 1991 Malik amp Perona 1990 Martinet al 2004 Voorhees amp Poggio 1988)

In order to choose a set of biologically motivatedmultiscale features to measure from the image patcheswe obtained a Gabor filter bank by applying theIndependent Components Analysis (Bell amp Sejnowski1997) to natural image patches taken from our dataset(Figure 5a) and this filter bank was applied to theimage patches The independent component analysis(ICA) algorithm learns a set of linear features that aremaximally independent which is ideal for classifieranalyses since correlated features add less informationthan independent features The rectified outputs of thisGabor filter bank were then used as inputs to twosupervised classifiers (a) A linear binary logisticclassifier and (b) a neural network classifier (Bishop2006) having 4 16 and 64 hidden units The logisticclassifier can be considered as a special case of theneural network classifier illustrated in Figure 5b where

instead of learning the best hidden layer representationfor solving the classification task the hidden layerrepresentation is always identical to the input repre-sentation and only the output weights are modifiedAfter training both classifiers were tested on a separatevalidation set not used for training to evaluate theirperformance

Classifier comparison

Figure 16 shows the performance of these twoclassifiers together with subject data for the grayscaleocclusion detection task We see that the linearclassifier (blue line) does not accurately describe humanperformance (black line) while the neural network (redline) having 16 hidden units compares much morefavorably to the human subjects Similar results wereobtained for a neural network with 64 hidden units butworse performance was seen with four hidden units(not shown) We can see from Figure 16 that the linearclassifier does not increase its performance withincreasing patch size and seems to perform similarlyto the quadratic classifier defined using only luminancedifferences This suggests that this classifier may simplybe learning to detecting low spatial frequency edges atvarious orientations Indeed plotting the connectionstrengths of the output unit with each of the Gaborfilters as in Figure 17 demonstrates that there are strongpositive weights for filters at low spatial frequencylocated in the center of the image patch therebydetecting luminance differences on the scale of theentire image patch

In contrast the neural network classifier learns tospatially pool information across spatial scale andlocation (Supplementary Figure S6) although thehidden unit lsquolsquoreceptive fieldsrsquorsquo do not seem to resemble

Figure 16 Performance of humans and classifiers defined on

multiscale Gabor feature set We see that a simple linear classifier

(blue line) does not account for human performance (black

dashed line) in the task a neural network classifier having an

additional hidden layer of processing (red line) compares well with

human performance

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 15

the hypothetical texture-edge detectors illustrated inFigure 18 which could in principle encode edgesdefined by second-order cues like differences inorientation or spatial frequency Interestingly suchunits tuned for texture-edges have been learned instudies seeking to find a sparse code for patterns ofcorrelation of Gabor filter activities in natural images(Cadieu amp Olshausen 2012 Karklin amp Lewicki 20032009) and it is of interest for future work to examinewhether a population of such units can be decoded inorder to reliably detect occlusion edges Neverthelessthis analysis provides a simple existence proof that amultiscale feature set coupled with a second layer ofrepresentation which pools across this set of multiscalefeatures may be adequate to explain natural occlusionedge detection and a similar computational scheme hasbeen proposed by various filter-rectify-filter models oftexture edge detection (Baker amp Mareschal 2001Landy amp Graham 2003) Interestingly training neuralnetworks directly on pixel representations did not yieldmodels which could match human performance(Supplementary Figure S7 demonstrating the impor-tance of the multiscale Gabor decomposition as apreprocessing stage Of course we cannot suggest fromour simple analysis the detailed form which thisintegration takes or what the receptive fields of neuronsperforming this operation may be like but this is

certainly a question of great interest for future work inneurophysiology and computational modeling

Discussion

Summary of results

In this study we performed psychophysical experi-ments to better understand what locally available cuesthe human visual system utilizes when detecting naturalocclusion edges In agreement with previous work(McDermott 2004) we found that fairly large regions(32 middot 32 pixels) were required for subjects to attainreliable performance on our occlusion detection task(Figure 10) Studying the grayscale version of the taskin detail we found that texture and luminance cueswere both utilized although the relative importance ofthese cues varied with the patch size In particular wefound that luminance cues were equally useful at allpatch sizes tested (Figure 11a) This suggests thatluminance differences are a local cue being equallyuseful for small as well as large patches In contrasttexture cues were only useful for larger image patches(Figure 11b) This is sensible because texture is definedby information occurring at a variety of scales (Portillaamp Simoncelli 2000 Zhu Wu amp Mumford 1997) andmaking patches larger makes more spatial scalesvisible

The importance of textural cues was further under-scored by comparing human performance on the taskto the performance of a quadratic classifier definedusing a set of simple visual features taken fromprevious work which did not include texture cuesAlthough many of the visual features providedindependent sources of information (Figure 15) andpermitted some discrimination of the stimulus catego-ries this feature set could not account for humanperformance in the task overpredicting performancefor small (8 middot 8) patches and underpredicting

Figure 17 Schematic illustration of weights learned by the linear

classifier We see that the linear classifier learns strong positive

weights (hot colors) with Gabor filters having low spatial

frequencies (long lines) located in the center of the image patch

Figure 18 Hypothetical neural models for detecting occlusion edges defined by textural differences (a) Hypothetical unit which detects

texture edges defined by differences in orientation energy (b) Hypothetical unit that detects texture edges defined by spatial frequency

differences

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 16

performance for large (32 middot 32) patches (Figure 13)The overprediction of human performance for smallimage patches is particularly interesting since itdemonstrates that subjects are not simply relying onluminance cues but are integrating texture cues as wellwhich we found to be quite poor for small imagepatches (Figure 11b)

One limitation of the visual feature set used to definethe quadratic classifier analysis is that the features aredefined on the scale of the entire image patch This isproblematic since it is well-known that the earlieststages of human visual processing perform a multiscaleanalysis using receptive fields resembling Gabor filters(Daugman 1985 Jones amp Palmer 1987 Pollen ampRonner 1983) and that such a representation forms anefficient code for natural images (Field 1987 Olshau-sen amp Field 1996) Furthermore many computationalmethods for texture segmentation and edge detectionuse of a multiscale Gabor filter decomposition (Bakeramp Mareschal 2001 Landy amp Graham 2003) There-fore we defined a new set of classifiers taking as inputsthe rectified responses of a set of Gabor filters appliedto the image patches (Figure 5) We found that themultiscale feature set was adequate to explain humanperformance but only provided that the classifiermaking use of this feature set was sufficiently powerfulIn particular a linear classifier was unable to accountfor human performance since this classifier only learnedto detect luminance-edges at low spatial frequencies(Figure 17) However a neural network classifierhaving an additional hidden layer of processing wasable to learn to reliably detect occlusions nearly as wellas human subjects (Figure 16) These results suggest amultiscale feature decomposition together with rela-tively simple computations performed in an additionallayer of processing may be adequate to explain humanperformance in an important natural vision task

Comparison with previous work

Relatively few studies in visual psychophysics havedirectly considered the problem of what cues are usefulfor detecting natural occlusion edges with the closeststudy to our own focusing on the more specializedproblem of detecting T-junctions in natural grayscaleimages (McDermott 2004) This study found thatrather large image regions were needed for subjects toattain reliable performance for junction detectionconsistent with our results for occlusion detection Inthis work it was found that many T-junctions easilyvisible on the scale of the whole image were notdetectable using local cues suggesting that they mayonly be detectable by mechanisms making use offeedback of global scene information (Lee amp Mumford2003) Our task is somewhat complementary to that

studied by McDermott (2004) since we exclude T-junctions and only consider nonjunction occlusionsThis gave us a much larger stimulus set and permittedquantitative comparisons of human task performancewith a variety of models based on features measuredfrom occlusions and non-occlusions However wearrive at the very similar conclusion for our task thata substantial amount of spatial context is needed foraccurate occlusion detection (Figure 10) and thatmany occlusions cannot be reliably detected using localcues alone

A related set of studies has considered the problemof judging when two image patches belong to the sameor different surfaces and comparing the performanceof human subjects with the predictions of Bayesianmodels defined using sets of measured visual featuressimilar to those used to define our quadratic classifiersIn contrast to one such surface segmentation study (Inget al 2010) we found texture cues to be very importantfor our task One possible reason for the discrepancybetween our results and those of Ing et al (2010) is thattheir study only considered the problem of discrimi-nating a very limited class of images in particular thesurfaces of different leaves Since the leaves typicallybelonged to the same kind of plant they were fairlysmooth and quite similar in terms of visual textureTherefore for this particular choice of stimuli texture isnot a particularly informative cue since leaves of thesame kind tend to have very similar textures For ourtask however where very different surfaces werejuxtaposed to define occlusions we found removal oftexture cues to be highly detrimental to subjectperformance (Figure 11)

Previous work on edge localization reveals thatluminance texture and color cues are all combined todetermine the location of edges (McGraw et al 2003Rivest amp Cavanagh 1996) and similarly we find that inour task these multiple sources of information arecombined as well However in our task we did notconsider quantitative models for how disparate cuesmay be combined optimally as has been done in avariety of perceptual tasks (Knill amp Saunders 2003Landy amp Kojima 2001) One challenge with studyingcue combination in our task is that one of the majorcues used by the subjects are texture differences andunlike luminance or color natural textures require veryhigh-dimensional models to fully specify (Heeger ampBergen 1995 Portilla amp Simoncelli 2000 Zhu et al1997) The problem of optimal cue combination fornatural occlusion detection has been mostly consideredin the computer vision literature (Konishi et al 2003Zhou amp Mel 2008) but models of cue combination forocclusion detection have not been directly comparedwith human data in a controlled psychophysicalexperiment

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 17

Our computational analysis demonstrates the neces-sity of multiscale image analysis for modeling humanperformance in our occlusion detection task Further-more our work and that of others demonstrates theimportance of textural cues for occlusion detection as awide variety of computational models which explainhuman performance at detecting texture-defined edgesutilize a multiscale Gabor decomposition (Landy ampGraham 2003) One broad class of computationalmodels of texture segmentation and texture edgedetection can be described lsquolsquoFilter-Rectify-Filterrsquorsquo(FRF) models sometimes also referred to as lsquolsquoback-pocketrsquorsquo models where the rectified outputs of a Gaborfilter bank are then subject to a second stage of filteringand the outputs of these filters are then processedfurther to obtain a decision about whether or not atexture edge is present (Baker amp Mareschal 2001Bergen amp Landy 1991 Graham amp Sutter 1998 Landy1991 Landy amp Graham 2003) Our simple three-layerneural network model (Figure 5) constitutes a verysimple model of the FRF type with the hidden unitsperforming the role of the second-order filters Theability of our simple FRF model to solve this task aswell as human subjects provides additional evidence infavor of the FRF computational framework

Resources for future studies

Our database of hand-labeled occlusions (httpcaseedu) avoids many limitations of previously collecteddatasets of hand-labeled edges Perhaps the greatestimprovement is that unlike many related datasets(Collins Wright amp Greenway 1999 Martin et al2001) we make use of uncompressed calibrated colorimages taken from a larger image set used in previouspsychophysical research (Arsenault et al 2011 King-dom et al 2007 Olmos amp Kingdom 2004) JPEGcompression creates artificial statistical regularitieswhich complicates feature measurements (Zhou ampMel 2008) and causes artifact when training statisticalimage models (Caywood Willmore amp Tollhurst 2004)

Another improvement of this database over previouswork is that we explicitly label occlusion edges ratherthan inferring them indirectly from labels of imageregions or segments Although de facto finding regionboundaries does end up mainly labeling occlusions thenotion of an image region is much vaguer than that ofan occlusion However due to the specificity of ourtask our database most likely neglects quite a fewedges which are not occlusions since many edges aredefined by properties unrelated to ordinal depth likeshadows specularities and material properties (Ker-sten 2000) Therefore our database only represents asubset (albeit an important one) of possible naturaledges

Some databases only consider a restricted class ofimages like close-up foliage (Ing et al 2010) or lackclear figural objects (Doi Inui Lee Wachtler ampSejnowski 2003) Our database contains a wide varietyof images at various spatial scales containing a largenumber of occlusions We do not claim that our imagesare fully representative of natural images but representa modest effort to obtain a diverse sample which isimportant since different categories of natural imageshave different statistics (Torralba amp Oliva 2003) andtraining statistical image models on different imagedatabases often yields different linear filter properties(Ziegaus amp Lang 2003)

Acknowledgments

This work was supported by NIH grant 0705677 toMSL We thank the five Case Western ReserveUniversity undergraduate research assistants whohelped to collect the databases Parts of this work havebeen presented previously in poster form at the 2011Society for Neuroscience Meeting

Commercial relationships noneCorresponding author Christopher DiMattinaEmail dimattinagrinnelleduAddress Department of Psychology amp NeuroscienceConcentration Grinnell College Grinnell IA USA

References

Abdou I amp Pratt W (1979) Quantitative design andevaluation of enhancementthresholding edge de-tectors Proceedings of the IEEE 67 753ndash763

Acharya T amp Ray A K (2005) Image processingPrinciples and applications Hoboken NJ Wiley-Interscience

Arsenault E Yoonessi A amp Baker C L (2011)Higher-order texture statistics impair contrastboundary segmentation Journal of Vision 11(10)14 1ndash15 httpwwwjournalofvisionorgcontent111014 doi101167111014 [PubMed] [Article]

Baddeley R J amp Tatler B W (2006) High frequencyedges (but not contrast) predict where we fixate ABayesian system identification analysis VisionResearch 46 2824ndash2833

Baker C L amp Mareschal I (2001) Processing ofsecond-order stimuli in the visual cortex Progressin Brain Research 134 1ndash21

Balboa R M amp Grzywacz N M (2000) Occlusions

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 18

and their relationship with the distribution ofcontrasts in natural images Vision Research 402661ndash2669

Bell A J amp Sejnowski T J (1997) The lsquoindependentcomponentsrsquo of natural scenes are edge filtersVision Research 37 3327ndash3338

Bergen J R amp Landy M S (1991) Computationalmodeling of visual texture segregation In M SLandy amp J A Movshon (Eds) Computationalmodels of visual processing pp 253ndash271 Cam-bridge MA MIT Press

Bishop C M (2006) Pattern recognition and machinelearning New York Springer

Cadieu C amp Olshausen B A (2012) Learningintermediate-level representations of form andmotion from natural movies Neural Computation24 827ndash866

Caywood M S Willmore B amp Tolhurst D J(2004) Independent components of color naturalscenes resemble v1 neurons in their spatial andcolor tuning Journal of Neurophysiology 91 2859ndash2873

Collins D Wright W A amp Greenway P (1999) TheSowerby image database In Image processing andits applications (Seventh International Conference)(pp 306ndash310) London Institution of ElectricalEngineers

Cristianini N amp Shawe-Taylor J (2000) An intro-duction to support vector machines and other kernelbased learning methods Cambridge UK Cam-bridge University Press

Daugman J G (1985) Uncertainty relation forresolution in space spatial frequency and orienta-tion optimized by two-dimensional visual corticalfilters Journal of the Optical Society of America A2 1160ndash1169

Doi E Inui T Lee T W Wachtler T ampSejnowski T J (2003) Spatiochromatic receptivefield properties derived from information-theoreticanalyses of cone mosaic responses to naturalscenes Neural Computation 15 397ndash417

Duda R O Hart P E amp Stork D G (2000)Pattern classification (2nd ed) New York Wiley-Interscience

Elder J H amp Zucker S W (1998) Local scalecontrol for edge detection and blur estimationIEEE Transactions on Pattern Analysis and Ma-chine Intelligence 20 699ndash716

Field D J (1987) Relations between the statistics ofnatural images and the response properties of

cortical cells Journal of the Optical Society ofAmerica A 4 2379ndash2394

Fine I MacLeod D I A amp Boynton G M (2003)Surface segmentation based on the luminance andcolor statistics of natural scenes Journal of theOptical Society of America A 20 1283ndash1291

Fowlkes C C Martin D R amp Malik J (2007) Localfigure-ground cues are valid for natural imagesJournal of Vision 7(8)2 1ndash9 httpwwwjournalofvisionorgcontent782 doi101167782[PubMed] [Article]

Geisler W S Perry J S Super B J amp Gallogly DP (2001) Edge co-occurrence in natural imagespredicts contour grouping performance VisionResearch 41 711ndash724

Goldberg A V amp Kennedy R (1995) An efficientcost scaling algorithm for the assignment problemMathematical Programming 71 153ndash177

Graham N (1991) Complex channels early localnonlinearities and normalization in texture segre-gation In M S Landy amp J A Movshon (Eds)Computational models of visual processing pp 274ndash290 Cambridge MA MIT Press

Graham N amp Sutter A (1998) Spatial summation insimple (Fourier) and complex (non-Fourier) texturechannels Vision Research 38 231ndash257

Guzman A (1969) Decomposition of a visual sceneinto three-dimensional bodies In A Griselli (Ed)Automatic interpretation and classification of imag-es pp 291ndash304 New York Academic Press

Heeger D J amp Bergen J R (1995) Pyramid-basedtexture analysissynthesis In Proceedings of ACMSIGGRAPH 229ndash238

Hess R F amp Field D J (1999) Integration ofcontours New insights Trends in Cognitive Scienc-es 3 480ndash486

Hoiem D Efros A A amp Hebert M (2011)Recovering occlusion boundaries from an imageInternational Journal of Computer Vision 91(3)328ndash346

Ing A D Wilson J A amp Geisler W S (2010)Region grouping in natural foliage scenes Imagestatistics and human performance Journal ofVision 10(4)10 1ndash19 httpwwwjournalofvisionorgcontent10410 doi 10116710410[PubMed] [Article]

Jones J P amp Palmer L A (1987) The two-dimensional spatial structure of simple receptivefields in cat striate cortex Journal of Neurophysi-ology 58 1187ndash1211

Karklin Y amp Lewicki M S (2003) Learning higher-

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 19

order structures in natural images Network 14483ndash499

Karklin Y amp Lewicki M S (2009) Emergence ofcomplex cell properties by learning to generalize innatural scenes Nature 457 83ndash86

Kersten D (2000) High-level vision as statisticalinference In M S Gazzaniga (Ed) The newcognitive neurosciences pp 353ndash363 CambridgeMA MIT Press

Kingdom F A A Field D J amp Olmos A (2007)Does spatial invariance result from insensitivity tochange Journal of Vision 7(14)11 1ndash13 httpwwwjournalofvisionorgcontent71411 doi10116771411 [PubMed] [Article]

Knill D C amp Saunders J A (2003) Do humansoptimally integrate stereo and texture informationfor judgments of slant Vision Research 43 2539ndash2558

Konishi S Yuille A L Coughlin J M amp Zhu S C(2003) Statistical edge detection Learning andevaluating edge cues IEEE Transactions on PatternAnalysis and Machine Intelligence 25 57ndash74

Landy M S (1991) Texture segregation and orienta-tion gradient Vision Research 31 679ndash691

Landy M S amp Graham N (2003) Visual perceptionof texture In L M Chalupa amp J S Werner (Eds)The visual neurosciences pp 1106ndash1118 Cam-bridge MA MIT Press

Landy M S amp Kojima H (2001) Ideal cuecombination for localizing texture-defined edgesJournal of the Optical Society of America A 182307ndash2320

Lee H-C amp Choe Y (2003) Detecting salientcontours using orientation energy distribution InInternational Joint Conference on Neural Networks(pp 206ndash211) IEEE Conference Publications

Lee T S amp Mumford D (2003) Hierarchicalbayesian inference in the visual cortex Journal ofthe Optical Society of America A 20 1434ndash1448

Malik J amp Perona P (1990) Preattentive texturediscrimination with early vision mechanisms Jour-nal of the Optical Society of America A 7 923ndash932

Marr D amp Hildreth E (1980) Theory of edgedetection Proceedings of the Royal Society ofLondon Series B 29 187ndash217

Martin D Fowlkes C C Tal D amp Malik J (2001)A database of human segmented natural imagesand its application to evaluating segmentationalgorithms and measuring ecological statisticsProceedings of the 8th International Conference onComputer Vision 2 416ndash423

Martin D R Fowlkes C C amp Malik J (2004)Learning to detect natural image boundaries usinglocal brightness color and texture cues IEEETransactions on Pattern Analysis and MachineIntelligence 26 530ndash549

McDermott J (2004) Psychophysics with junctions inreal images Perception 33 1101ndash1127

McGraw P V Whitaker D Badcock D R ampSkillen J (2003) Neither here nor there Localizingconflicting visual attributes Journal of Vision 3(4)2 265ndash273 httpwwwjournalofvisionorgcontent342 doi101167342 [PubMed] [Article]

Nakayama K He Z J amp Shimojo S (1995) Visualsurface representation A critical link betweenlower-level and higher-level vision In D NOsherson S M Kosslyn amp L R Gleitman(Eds) An invitation to cognitive science Visualcognition pp 1ndash70 Cambridge MA MIT Press

Oliva A amp Torralba A (2001) Modeling the shapeof the scene A holistic representation of the spatialenvelope International Journal of Computer Vision42 145ndash175

Olmos A amp Kingdom F A A (2004) McGillcalibrated colour image database Internet sitehttptabbyvisionmcgillca (Accessed November20 2012)

Olshausen B A amp Field D J (1996) Emergence ofsimple-cell receptive-field properties by learning asparse code for natural images Nature 381 607ndash609

Ott R L (1993) An introduction to statistical methodsand data analysis (4th ed) Belmont CA Wads-worth Inc

Perona P (1992) Steerable-scalable kernels for edgedetection and junction analysis Image and VisionComputing 10 663ndash672

Pollen D A amp Ronner S F (1983) Visual corticalneurons as localized spatial frequency filters IEEETransactions on Systems Man and Cybernetics 13907ndash916

Portilla J amp Simoncelli E P (2000) A parametrictexture model based on joint statistics of complexwavelet coefficients International Journal of Com-puter Vision 40 4971

Rajashekar U van der Linde I Bovik A C ampCormack L K (2007) Foveated analysis of imagefeatures at fixations Vision Research 47 3160ndash3172

Reinagel P amp Zador A M (1999) Natural scenestatistics at the centre of gaze Network 10 341ndash350

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 20

Rijsbergen C V (1979) Information retrieval LondonButterworths

Rivest J amp Cavanagh P (1996) Localizing contoursdefined by more than one attribute Vision Re-search 36 53ndash66

Schwartz O amp Simoncelli E P (2001) Natural signalstatistics and sensory gain control Nature Neuro-science 4 819ndash825

Sigman M Cecchi G A Gilbert C D ampMagnasco M O (2001) On a common circleNatural scenes and gestalt rules Proceedings of theNational Academy of Sciences 98 1935ndash1940

Todd J T (2004) The visual perception of 3D shapeTrends in Cognitive Science 8 115ndash121

Torralba A (2009) How many pixels make an imageVisual Neuroscience 26 123ndash131

Torralba A amp Oliva A (2003) Statistics of naturalimage categories Network Computation in NeuralSystems 14 391ndash412

Voorhees H amp Poggio T (1988) Computing textureboundaries from images Nature 333 364ndash367

Zetzsche C amp Rohrbein F (2001) Nonlinear andextra-classical receptive field properties and thestatistics of natural scenes Network 12 331ndash350

Zhou C amp Mel B W (2008) Cue combination andcolor edge detection in natural scenes Journal ofVision 8(4)4 1ndash25 httpwwwjournalofvisionorgcontent844 doi101167844 [PubMed][Article]

Zhu S C Wu Y N amp Mumford D (1997)Minimax entropy principle and its application totexture modeling Neural Computation 9 1627ndash1660

Ziegaus C amp Lang E W (2003) Independentcomponent analysis of natural and urban imageensembles Neural Information Processing Letters1 89ndash95

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 21

  • Introduction
  • Methods
  • f01
  • f02
  • f03
  • e01
  • e02
  • e03
  • e04
  • e05
  • e06
  • e07
  • e08
  • e09
  • e10
  • f04
  • e11
  • e12
  • e13
  • e14
  • e15
  • e16
  • e17
  • f05
  • e18
  • Results
  • f06
  • f07
  • f08
  • f09
  • f10
  • f11
  • f12
  • f13
  • f14
  • f15
  • f16
  • Discussion
  • f17
  • f18
  • Abdou1
  • Acharya1
  • Arsenault1
  • Baddeley1
  • Baker1
  • Balboa1
  • Bell1
  • Bergen1
  • Bishop1
  • Cadieu1
  • Caywood1
  • Collins1
  • Cristianini1
  • Daugman1
  • Doi1
  • Duda1
  • Elder1
  • Field1
  • Fine1
  • Fowlkes1
  • Geisler1
  • Goldberg1
  • Graham1
  • Graham2
  • Guzman1
  • Heeger1
  • Hess1
  • Hoiem1
  • Ing1
  • Jones1
  • Karklin1
  • Karklin2
  • Kersten1
  • Kingdom1
  • Knill1
  • Konishi1
  • Landy1
  • Landy2
  • Landy3
  • Lee1
  • Lee2
  • Malik1
  • Marr1
  • Martin1
  • Martin2
  • McDermott1
  • McGraw1
  • Nakayama1
  • Oliva1
  • Olmos1
  • Olshausen1
  • Ott1
  • Perona1
  • Pollen1
  • Portilla1
  • Rajashekar1
  • Reinagel1
  • Rijsbergen1
  • Rivest1
  • Schwartz1
  • Sigman1
  • Todd1
  • Torralba1
  • Torralba2
  • Voorhees1
  • Zetzsche1
  • Zhou1
  • Zhu1
  • Ziegaus1
Page 6: Detecting natural occlusion boundaries using local cuesDetecting natural occlusion boundaries using local cues Christopher DiMattina # $ Department of Psychology & Neuroscience Concentration,

2001 Lee amp Choe 2003 Sigman Cecchi Gilbert ampMagnasco 2001)

G5 Global patch contrast q

q frac14 stdethITHORN eth11THORNThis quantity has been measured in previous studieswhich quantify statistical differences between fixatedimage regions and random image regions for subjectsrsquofree-viewing natural images while their eyes are beingtracked (Rajashekar et al 2007 Reinagel amp Zador1999) Several studies of this kind have suggested thatsubjects may be preferentially looking at edges (Bad-deley amp Tatler 2006)

Note that features G3-G5 are measured globallyfrom the entire patch whereas features G1 G2 aredifferences between statistics measured from differentregions of the patch

Color scalar measurements

Images were converted from RGB to LMS colorspace using a MATLAB program which accompaniesthe images in the McGill database (Olmos amp Kingdom2004) We converted the logarithmically transformedLMS images into an Lab color space by performingprincipal components analysis (PCA) on the set ofLMS pixel intensities (Fine et al 2003 Ing et al 2010)Projections onto the axes of the Lab basis represent acolor pixel in terms of its overall luminance (L) blue-yellow opponency (a) and red-green opponency (b)

We measured two additional properties from theLMS color image patches represented in the Lab basis

C1 Blue-Yellow difference Da

Da frac14 ja1 a2j eth12THORNwhere a1 a2 are the mean values of the B-Y opponencycomponent a in regions R1 R2 respectively

C2 Red-Green difference Db

Db frac14 jb1 b2j eth13THORNwhere b1 b2 are the mean values of the R-G opponencycomponent b in regions R1 R2 respectively

These color scalar statistics were motivated byprevious work studying human perceptual discrimina-tion of different surfaces (Fine et al 2003 Ing et al2010)

Machine classifiers

Quadratic classifier analysis

In order to study how well various feature subsetsmeasured from the image patches could predict humanperformance we made use of a quadratic classifieranalysis The quadratic classifier is a natural choice for

quantifying the discriminability of two categoriesdefined by features having multivariate Gaussiandistributions (Duda Hart amp Stork 2000) and hasbeen used in previous work studying the perceptualdiscriminability of surfaces (Ing et al 2010) Assumethat we have two categories C1C2 of stimuli fromwhich we can measure n features u frac14 (u1u2 un)

Tand features measured from each category are Gauss-ian distributed

pethujCiTHORN frac14 NethujliRiTHORN eth14THORNwhere li and Ri are the means and covariances of eachcategory Given a novel observation with feature vectoru assuming that the two categories are equally likely apriori we evaluate the log-likelihood ratio

L12ethuTHORN frac14 ln pethujC1THORN ln pethujC2THORN eth15THORNchoosing C1 when L12 0 and C2 when L12 0 In thecase of Gaussian distributions for each category as inEquation 14 Equation 15 can be rewritten as

L12ethuTHORN frac141

2jR2j jR1j

thorn 1

2Q1ethuTHORN Q2ethuTHORN

eth16THORNwhere

QiethuTHORN frac14 ethu liTHORNTR1i ethu liTHORN eth17THORNThe task for applying this formalism is to define a set ofn features to measure from the set of image patchesand then use these measurements to define the meansand covariances of each category in a supervisedmanner New image patches that are unlabeled canthen be classified using this quadratic classifier

In our analyses category C1 was occlusion patchesand category C2 was surface patches We estimated theparameters of the classifiers for each category by takingthe means and covariances of the statistics measuredfrom a set of 2000 image patches (training set) and weapplied these classifiers to a different 400 patch subsetsof 1000 image patches which were presented to thesubjects (test set) This analysis was performed formultiple classifiers defined by different subsets ofparameters and for image patches of all sizes (8 middot 816 middot 16 32 middot 32)

SVM classifier analysis

As an additional control in addition to the quadraticclassifier analysis we trained a Support Vector Machine(SVM) classifier (Cristianini amp Shawe-Taylor 2000) todiscriminate occlusions and surfaces using our gray-scale visual feature set (G1ndashG5) The SVM classifier is astandard and well-studied method in machine learningwhich achieves good classification results by learningthe separating hyperplane that maximizes the margin

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 6

between two categories (Bishop 2006) We implement-ed this analysis using the function svmclassifym andsvmtrainm in the MATLAB Bioinformatics Toolbox

Multiscale classifier analyses on Gabor filter outputs

One weakness of defining machine classifiers usingour set of visual features is that these features aredefined on the scale of the entire image patch This isproblematic because it is well known that occlusionedges exist at multiple scales and that appropriate scaleselection and integration across scale is an essentialcomputation for accurate edge detection (Elder ampZucker 1998 Marr amp Hildreth 1980) Furthermoreit is well known that the neural code at the earlieststages of cortical processing is reasonably well de-scribed by a bank of multi-scale filters resemblingGabor functions (Daugman 1985 Pollen amp Ronner1983) and that such a basis forms an efficient code fornatural scenes (Olshausen amp Field 1996)

In order to define a multiscale feature set resemblingthe early visual code we utilized the rectified outputs ofa bank of filters learned using Independent ComponentAnalysis (ICA) These filters closely resemble Gaborfilters but have the additional useful property ofconstituting a maximally independent set of featuredimensions for encoding natural images (Bell ampSejnowski 1997) Example filters for 16 middot 16 imagepatches are shown in Figure 5a The outputs of ourfilter bank were used as inputs to two different standardclassifiers (a) a linear logistic regression classifier and(b) a three-layer neural network classifier (Bishop2006) having 4 16 64 hidden units These classifierswere trained using standard gradient descent methods

and their performance was evaluated on a separate setof validation data not used for training A schematicillustration of these classifiers is shown in Figure 5b

Experimental paradigm

In the psychophysical experiments image patcheswere displayed on a 24-inch Macintosh cinema display(Apple Inc Cupertino CA) Since the RGB imageswere linearized to correct for camera gamma (Olmos ampKingdom 2004) we set the display monitor to havegamma of approximately 1 using the Mac CalibrationAssistant so that the displayed images would be asnatural looking as possible Subjects were seated in adim room in front of the monitor and image patcheswere presented to the center of the screen scaled tosubtend 15 degrees of visual angle at the approximate-ly 12-inch (305 cm) viewing distance All stimulisubtended the same visual angle to eliminate confoundsbetween the size of the image on the retina and its pixeldimensions Previous studies have demonstrated thathuman performance on a similar task is dependent onlyon the number of pixels (McDermott 2004) so byholding the retinal size constant the only variable is thenumber of pixels

Our experimental paradigm is illustrated in Figure 6Surrounding the image patch was a set of flankingcrosshairs whose imaginary intersection defines thelsquolsquocenterrsquorsquo of the image patch In the standard task thesubject decides whether an occlusion boundary passesthrough the image patch center with a binary (10) key-press Half of the patches were taken from theocclusion boundary database (where all occlusions passthrough the patch center) and the other half of thepatches were taken from the surface database There-fore guessing on the task would yield performance of50 percent correct The design of this experiment issimilar to a previous study on detecting T-junctions innatural images (McDermott 2004) To optimizeperformance subjects were given as much time as theyneeded for each patch Furthermore since perceptuallearning can improve performance (Ing et al 2010) weprovided positive and negative feedback

We performed the experiment on two sets of naivesubjects having no previous exposure to the images orknowledge of the scientific aims as well as the leadauthor (S0) for a total of six subjects (three males threefemales) One set of two naive subjects (S1 S2) wasallowed to browse grayscale versions of the full-scaleimages (576 middot 768 pixels) prior to the experiment After2ndash3 seconds of viewing they were shown superimposedon the images the union of all subject labelings Thiswas meant to give the subjects an intuition for what ismeant by an occlusion boundary in the context of thefull-scale image and was inspired by previous work on

Figure 5 Illustration of the multiscale classifier analysis using the

outputs of rectified Gabor filters (a) Gabor functions learned

using ICA form an efficient code for natural images which

maximize statistical independence of filter responses (b) Gray-

scale image patches are decomposed by a bank of multiscale

Gabor filters resembling simple cells and their rectified outputs

transformed by an intermediate layer of representation whose

outputs are passed to a linear classifier

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 7

surface segmentation where subjects were allowed topreview large full-scale images of foliage from whichsurface patches were drawn (Ing et al 2010) A secondset of three naive subjects (S3 S4 S5) were not shownthe full-scale images beforehand in order to control forthe possibility that this pre-exposure may have helped toimprove task performance

We used both color and grayscale patches of sizes 8middot 8 16 middot 16 and 32 middot 32 In addition to the raw imagepatches a set of lsquolsquotexture-removedrsquorsquo image patches wascreated by averaging the pixels in each of the tworegions thus creating synthetic edges where the onlycue was luminance or color contrast For the grayscalepatches we also considered the effects of removingluminance difference cues However it is a muchharder problem to create luminance-removed patchesas was done in previous studies on surface segmenta-tion (Ing et al 2010) since simply subtracting themean luminance in each region of an image patchcontaining an occlusion often yields a high spatialfrequency boundary artifact which provides a strongedge cue (Arsenault et al 2011) Therefore wecircumvented this problem by setting a 3-pixel thickregion around the boundary to the mean luminance ofthe entire patch in effect covering up the boundaryThen we could remove luminance cues equalizing themean luminance on each side without creating aboundary artifact since there was no boundary visibleWe called this condition lsquolsquoboundary thorn luminanceremovedrsquorsquo One issue however with comparing theseboundarythorn luminance removed patches to the normalpatches is that now two cues are missing (the boundaryand the luminance difference) so in order to betterassess the combination of texture and luminanceinformation we also created a lsquolsquoboundary removedrsquorsquocondition which blocks the boundary but does notmodify the mean luminance on each side Illustrativeexamples of these stimuli are shown in Figure 4b

All subjects were shown different sets of 400 imagepatches sampled randomly from a set of 1000 patchesin every condition with the exception of two subjects inthe luminance-only grayscale condition (S0 S2) who

were shown the same set of patches in this conditiononly Informed consent was obtained from all subjectsand all experimental procedures were approved before-hand by the Case Western Reserve University IRB(Protocol 20101216)

Tests of significance

In order to determine whether or not the perfor-mance of human subjects was significantly different indifferent conditions of the task we utilized thestandard binomial proportion test (Ott 1993) whichrelies on a Gaussian approximation to the binomialdistribution This test is well justified in our casebecause of the large number of stimuli (N frac14 400)presented in each experimental condition For propor-tion estimate p we compute the 1 ndash a confidence intervalas

p 6 z1a=2

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffipeth1 pTHORN

n

r eth18THORN

We use a significance level of afrac14 005 in all of our testsand calculations of confidence intervals

In order to evaluate the performance of the machineclassifiers we performed a Monte Carlo analysis wherethe classifiers were evaluated on 200 different sets of400 image patches randomly chosen from a validationset of image patches This validation set was distinctfrom the set of patches used to train the classifierallowing us to study classifier generalization We plot95 confidence intervals around the mean perfor-mance of our classifiers

Results

Case occlusion boundary (COB) database

We developed a novel database of occlusion edgesand surface regions not containing any occlusions for

Figure 6 Schematic of the two-alternative forced choice experiment A patch was presented to the subject who decided whether an

occlusion passes through the imaginary intersection of the crosshairs After the decision the subject was given positive or negative

feedback

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 8

use in our perceptual experiments Representativeimages from our database are illustrated in Figure 2(left column) Note the clear figural objects and manyocclusions in these images In the middle column ofFigure 2 we see grayscale versions of these imageswith the set of all pixels labeled by any subject (logicalOR) overlaid in white The magenta squares showexamples of surface regions labeled by the subjectsFinally the right column shows an overlay plot of theocclusions marked by all subjects with darker pixelsbeing labeled by more subjects and the lightest graypixels being labeled by only a single subject Note thehigh degree of consistency between the labelings of themultiple subjects Figure 3 shows some representativeexamples of occlusion (left) and surface (right) patchesfrom our database It is important to note that whileocclusions may be a major source of edges in naturalimages edges may arise from other cues like castshadows or changes in material properties (Kersten2000) The bottom panel of Figure 3 shows examples ofpatches containing edges defined by shadows ratherthan by occlusions

In other annotated edge databases like the BerkeleySegmentation Dataset (BSD) (Martin Fowlkes Tal ampMalik 2001) occlusion edges were labeled indirectly bysegmenting the image into regions and then denotingthe boundaries of these regions to be edges Weobserved that when occlusions are labeled directlyinstead of being inferred indirectly from regionsegmentations that a smaller number of pixels arelabeled as shown in Figure 7b which plots thedistribution of the percentage of edge pixels for allimages and subjects in our database (red) and the BSDdatabase (blue) for 98 BSD images segmented by sixsubjects We find that averaged across all images andsubjects that about 1 of the pixels were labeled asedges in our database whereas about twice as manypixels were labeled as edges by computing edgemapsusing the BSD segmentations (COB medianfrac1400084Nfrac14 600 BSD median frac14 00193 N frac14 588 p 10121Wilcox rank-sum)

We observed a higher level of intersubject consisten-cy in the edgemaps obtained from the COB than thoseobtained from the BSD segmentations which we

Figure 7 Labeling occlusions directly marks fewer pixels than inferring occlusions from image segmentations and yields greater

agreement between subjects (a) Top An image from our database (left) together with the labeling (middle) by the most conservative

subject (MCS) The right panel shows a binary mask of all pixels near the MCS labeling (10 pixel radius) Bottom Product of MCS mask

with labelings from three other subjects (b) Histogram of the fraction of pixels labeled as edges in the COB (red) and BSD (blue)

databases across all images and all subjects (c) Histogram of the subject consistency index for edgemaps obtained from the COB (red)

and BSD (blue) databases for cfrac14 10 (d) Precision-recall analysis also demonstrates better consistency (F-measure) for COB (red) than

BSD (blue)

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 9

quantified using a novel analysis we developed as wellas a more standard precision-recall analysis (Abdou ampPratt 1979 Rijsbergen 1979 Methods Image data-base) Figure 7a shows an image from the COBdatabase (top left) together with the edgemap andderived mask for the most conservative subject (MCS)which is the subject who labeled the fewest pixels (topmiddle right) When the MCS mask is multiplied bythe edgemap of the other subjects we see reasonablygood agreement between all subjects (bottom row) Inorder to quantify this over all images we computed ournovel intersubject similarity index f defined in Equation4 and in Figure 7c we see that on average that f is largerfor our dataset than for the BSD where here we plotthe histogram for c frac14 10 (COB median frac14 02890 N frac143500 BSD median frac14 01882 N frac14 3430 p 10202Wilcox rank-sum) Similar results were obtained over awide range of values of c (Supplementary Figure S1) Inaddition to our novel analysis we also implemented aprecision-recall analysis (Abdou amp Pratt 1979 Rijs-bergen 1979) using a lsquolsquoleave-one-outrsquorsquo procedure wherewe compared the edges labeled by one subject to alsquolsquoground truthrsquorsquo labeling defined by combining theedgemaps of the five other subjects (Methods Imagedatabase) Agreement was quantified using the F-measure (Rijsbergen 1979) which provides a weightedmean of precision (not labeling nonedges as edges) andrecall (detecting all edges in the image) We observe inFigure 7d that for this analysis there is a significantlybetter agreement between edgemaps in the COBdatabase than those obtained indirectly from the BSDsegmentations using this analysis (p 1028)

Visual features measured from occlusion andsurface patches

We were interested in determining which kinds oflocally available visual features could possibly beutilized by human subjects to distinguish image patchescontaining occlusions from those containing surfacesToward this end we measured a variety of visualfeatures taken from previous studies of natural imagestatistics (Balboa amp Grzywacz 2000 Field 1987 Ing etal 2010 Lee amp Choe 2003 Rajashekar et al 2007Reinagel amp Zador 1999) from both occlusion andsurface patches For patches containing only a singleocclusion which can be divided into two regions ofroughly equal size we obtained a region labelingillustrated in Figure 4a (bottom) The region labelingconsists of sets of pixels corresponding to the twosurfaces divided by the boundary (white and grayregions) as well as the boundary (black line) Usingthis region labeling we can measure properties of theimage on either side of the boundary and computedifferences in these properties between regions By

definition surface patches are comprised of a singleregion and therefore it is unclear how we can measurequantities (like luminance differences) which dependon a region labeling Therefore in order to measure thesame set of features from the surface patches weassigned to each patch a set of 25 dummy regionlabelings (spanning all orientations) and for eachdummy labeling performed the measurements Sinceall features were on average larger for occlusions forsurfaces we took as the measured value of the featurethe maximum over all 25 dummy labelings A fulldescription of measured features is given in Methods(Statistical measurements)

We compared the power spectra of 32 middot 32 occlusionand texture patches (Figure 8 left) On average there issignificantly less energy in low spatial frequencies fortextures than for occlusions which is intuitive sincemany occlusions contain a low spatial frequencyluminance edge (Figure 3 left panel) Analysis of theexponent by fitting a line to the power spectra ofindividual images found a different distribution ofexponents for the occlusions and the textures (Figure 8right) For textures the median exponent is close to 2consistent with previous observations (Field 1987)whereas for occlusions the median exponent value isslightly higher (rsquo26) This is consistent with previouswork analyzing the spectral content of different scenecategories which show that landscape scenes with aprominent horizon (like an ocean view) tend to havelower spatial frequency content (Oliva amp Torralba2001)

Figure 9 shows that all of the grayscale features (G1-G5 Methods Statistical measurements) measured fromsurfaces (green) and occlusions (blue) are well approx-imated by Gaussians when plotted in logarithmiccoordinates Supplementary Table 1 lists the meansand standard deviations of each of these features Wesee that on average these features are larger forocclusions than for textures as one might expectintuitively since these features explicitly or implicitlymeasure differences or variability in an image patch

Figure 8 Power spectra of 32 middot 32 patches Left Median power

spectrum of occlusions (blue) and surfaces (green) Thin dashed

lines show the 25th and 75th percentiles Right Power-spectrum

slopes for occlusions (blue) and surfaces (green)

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 10

which will tend to be larger for occlusion patchesAnalyzing the correlation structure of the grayscalefeatures reveals that they are all significantly positivelycorrelated (Supplementary Table 2) Although some ofthese correlations are unsurprising (like that betweenluminance difference log Dl and boundary gradient logGB) many pairs of these positively correlated featurescan be manipulated independently of each other inartificial images (for instance global contrast log q andcontrast difference log Dr) These results are consistentwith previous work which demonstrates that in naturalimages there are often strong conditional dependenciesbetween supposedly independent visual feature dimen-sions (Fine et al 2003 Karklin amp Lewicki 2003Schwartz amp Simoncelli 2001 Zetzsche amp Rohrbein2001) Supplementary Figure S2 plots all possiblebivariate distributions of the logarithm of the grayscalefeatures for 32 middot 32 textures and surfaces

In addition to grayscale features we measured colorfeatures (C1ndashC2 Methods Statistical measurements)as well by transforming the images into the Lab colorspace Fine et al (2003) Ing et al (2010) Using our

region labelings (Figure 4a bottom) we measured thecolor parameters log Da and log Db from ourocclusions Supplementary Table 3 lists the meansand standard deviations of each of these features andwe observe positive correlations (rfrac14043) between theirvalues similar to previous observations of positivecorrelations between these same features for twonearby patches taken from the same surface (Fine etal 2003) This finding is interesting because occlusionboundaries by definition separate two or more differentsurfaces and the color features we measure constitute aset of independent feature dimensions From the two-dimensional scatterplots in Supplementary Figure S3we see that there is separation between the multivariatedistributions defined by these color parameters sug-gesting that color contrast provides a potential cue forocclusion edge detection much as it does for surfacesegmentation (Fine et al 2003 Ing et al 2010)

Human performance on occlusion boundarydetection task

Effects of patch size color and pre-exposure

In order to determine the local visual features used todiscriminate occlusions from surfaces we designed asimple two-alternative forced choice task illustratedschematically in Figure 6 An image patch subtendingroughly 15 degrees of visual angle (305 cm viewingdistance) was presented to the subject who had todecide with a binary (10) keypress whether an imagepatch contains an occlusion Patches were chosen froma pre-extracted set of 1000 occlusions and 1000surfaces with equal probability of each type soguessing would yield 50 correct Subjects wereallowed to view each patch as long as they needed (1ndash2 seconds typical) and following previous workauditory feedback was provided to optimize perfor-mance (Ing et al 2010) The task was performed on sixsubjects having varying degrees of exposure to theimage database (Methods Experimental paradigm)

Figure 10a illustrates the effect of patch size on thetask for grayscale image patches for all subjects Thicklines indicate the mean subject performance and thindashed lines denote 95 confidence intervals We seefrom this that performance is significantly different forthe different size image patches which we tested (8 middot 816 middot 16 32 middot 32) For a subset of subjects (S0 S1 S2)we also tested color image patches and the results forcolor image patches are shown in Figure 10b (red line)together with the grayscale data from these samesubjects (black dashed line) We see from this plot thatfor all patch sizes tested performance is significantlybetter for color patches which is sensible because coloris a potentially informative cue for distinguishingdifferent surfaces as has been shown in previous work(Fine et al 2003 Ing et al 2010)

Figure 9 Univariate distributions of grayscale visual features G1-

G5 (see Methods) for occlusions (blue) and textures (green)

When plotted on a log scale these distributions are well described

by Gaussians and exhibit separation for occlusions and textures

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 11

One concern with the interpretation of our results isthat the brief pre-exposure (3 seconds) of subjects S1 S2to the full-scale (576 middot 768) images prior to the first tasksession (Methods Experimental paradigm) may haveunfairly improved their performance In order to controlfor this possibility we ran three additional subjects (S3S4 S5) who did not have this pre-exposure to the full-scale images and we found no significant difference inthe performance of the two groups of subjects(Supplementary Figure S4a) We also found that thelead author (S0) was not significantly better at the taskthan the two subjects (S1 S2) who were only briefly pre-exposed to the images (Supplementary Figure S4b)

Therefore we conclude that pre-exposure to the full-scale images makes little if any difference in our task

Effects of luminance boundary and texture cues

In order to better understand what cues subjects aremaking use of in the task we tested subjects onmodified image patches with various cues removed(Methods Experimental paradigm) In order to deter-mine the importance of texture we removed all texturecues from the patches by averaging all pixels in eachregion (lsquolsquotexture removedrsquorsquo) as illustrated in Figure 4b(top) We see from Figure 11a that subject performancein the task is substantially impaired without texturecues for the 16 middot 16 and 32 middot 32 patch sizes Note howperformance is roughly flat with increasing patch sizewhen all cues but luminance are removed suggestingthat luminance gradients are a fairly lsquolsquolocalrsquorsquo cue whichis very useful even at very small scales Similar resultswere obtained for color patches (Figure 12) where theonly cues available were luminance and color contrast

Removing luminance cues while keeping texture cuesintact is slightly more tricky since simply equalizing theluminance in the two image regions can lead toproblems of boundary artifact (a high spatial frequencyedge along the boundary) as has been noted by others(Arsenault et al 2011) In order to circumvent thisproblem we removed the boundary artifact by coveringa 3-pixel wide region including the boundary by settingall pixels in this strip to a single uniform value (Figure4b bottom) We then were able to equalize theluminance on each side of the patch without creatingboundary artifact (luminance thorn boundary removed)Since the boundary pixels may could potentiallycontain long-range spatial correlation information(Hess amp Field 1999) useful for identifying occlusions

Figure 10 Performance of human subjects at the occlusion

detection task for 8 middot 8 16 middot 16 and 32 middot 32 image patches (a)

Subject performance for grayscale image patches Thin dashed

lines denote 95 confidence intervals Note how performance

significantly improves with increasing patch size (b) Subject

performance for color image patches is significantly better than for

grayscale at all patch sizes

Figure 11 Subject performance for grayscale image patches with various cues removed Dashed lines indicate the average performance

for unaltered image patches solid lines performance in the cue-removed case (a) Removal of texture cues significantly impairs subject

performance for larger image patches (b) Removal of the boundary and luminance cues substantially impairs subject performance at all

patch sizes (c) Removal of the boundary cue alone without altering texture cues or luminance cues does not affect subject performance

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 12

as an additional control we also considered perfor-mance on patches where we only removed theboundary (boundary removed)

We see from Figure 11c that simply removing theboundary pixels has no effect on subject performancesuggesting that subjects do not need to see theboundary in order to perform the task so long asother cues like luminance and texture differences arepresent However we see from Figure 11b that thereare profound effects of removing the luminance cueswhile keeping the texture cues intact We see thatsubject performance is impaired at every patch sizealthough in contrast to removing texture cues we see asubstantial improvement in performance as patch sizeis increased This suggests that texture cues arecomplementary to luminance cues in the sense thattexture information becomes more useful for largerpatches while luminance is useful even for very smallpatches

Quadratic classifier analysis

Comparison with human performance

Our analysis of image patches taken from thedatabase suggests that relatively simple visual featuresmeasured in previous related studies can potentially beused to distinguish occlusions and surfaces (Figure 9)Therefore it was of interest to determine whethermachine classifiers defined using these simple visualfeatures could potentially account for human perfor-mance on the task Since the logarithmically trans-formed features are approximately Gaussian distributed(Figure 9 and Supplementary Figures S2 S3) followingprevious work (Ing et al 2010) we defined a quadraticclassifier on class-conditional distributions which wemodeled as multivariate Gaussians (Methods Machineclassifiers) When data from both classes is exactly

Gaussian this classifier yields Bayes-optimal perfor-mance (Duda et al 2000) This classifier was thentested on validation sets of 400 image patches drawnfrom the same set shown to humans subjects (whichwere not used to train the classifier) using 200 MonteCarlo resamplings

In Figure 13 we compare the performance of humansubjects (black lines) on the grayscale version of thetask with that of quadratic classifiers (blue lines)Figure 13a shows task performance with the fullgrayscale image patches We see that the quadraticclassifier defined on all grayscale parameters (G1-G5Methods Statistical measurements) does not capturethe experimentally observed improvement in subjectperformance with increasing patch size This suggeststhat in addition to the scalar grayscale cues used todefine the classifier subjects must be making use ofmore complex spatial cues like differences in texture(Bergen amp Landy 1991) or long-range spatial correla-tions (Hess amp Field 1999) which are not available atlow resolutions but become available as the imagepatch gets larger We see from Figure 13b that aquadratic classifier defined using only the luminancedifference is in much better agreement with humanperformance when this is the only available cue forperforming the task Similar results were obtained forcolor image patches (Figure 14)

Ranking of feature dimensions

In order to understand which feature dimensions aremost important for the quadratic classifier detecting

Figure 12 Subject performance for unaltered color image patches

(dashed line) and with texture cues removed (solid line) Figure 13 Comparison of subject and quadratic classifier

performance for grayscale image patches (a) Classifier defined

using all grayscale features (thick blue line) does not accurately

model human performance on the task with unmodified patches

(black line) (b) Classifier defined using only luminance cues

accurately models human performance in the texture-removed

condition where luminance differences are the only available cue

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 13

occlusions we computed the d0 value for each of thefeature dimensions with the results of this analysisbeing summarized in Supplementary Table 4 We seefrom this table that the best single grayscale parametersfor discriminating occlusion and surface patches are theluminance difference (Dl) and overall patch variance(q) We also see that contrast difference (Dr) by itself isa fairly weak parameter for all patch sizes while for thelarger patch sizes (16 middot 16 32 middot 32) color cues are veryinformative

Our correlation analyses presented in SupplementaryTable 2 demonstrated that many of these features were

correlated with each other so therefore it was ofinterest to determine which features contributed themost independent information to the classifier We didthis by adding feature dimensions to the classifier oneby one in order of decreasing d0 for 16 middot 16 and 32 middot 32patches and applying each classifier to the task for 200Monte Carlo simulations The results of this analysisperformed on 16 middot 16 and 32 middot 32 image patches areshown in Figure 15 We see from Figure 15 that forgrayscale classifiers containing q and Dl that noperformance increase is seen when one adds thefeatures GB (boundary gradient) and Eh (orientedenergy) most likely because these features are stronglycorrelated with q Dl (Supplementary Table 2)However we do see a jump in performance when thecontrast difference Dr is added most likely because thisfeature dimension is only weakly correlated with q andDl (Supplementary Table 2) For our analysis of thecolor classifier we see that there are substantialimprovements in performance when the color param-eters Da and Db are added to the classifier which makessense because color cues are largely independent fromluminance cues

Comparison with SVM classifier

There are two logical possibilities for the poor matchbetween human performance in our task and theperformance of the quadratic classifier The firstpossibility is that the feature sets used to define theclassifier do not adequately describe the featuresactually used by human subjects to perform the taskA second possibility is that the feature set is adequatebut the quadratic classifier is somehow suboptimalAlthough the quadratic classifier is Bayes optimal forGaussian distributed class-conditional densities (Duda

Figure 14 Comparison of subject and quadratic classifier

performance for color image patches (a) Classifier defined using

all color features (thick blue line) does not accurately model

human performance (black line) on the task with unmodified

patches (b) Classifier defined using only luminance and color

contrast more accurately models human performance in the

texture-removed condition

Figure 15 Quadratic classifier performance on 32 middot 32 patches as a function of number of features Feature sets of varying lengths are

defined by ranking individual dimensions by their d0 measures and adding them in order from highest to lowest

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 14

et al 2000) and our feature sets are reasonably welldescribed by Gaussians (see Figure 9 andSupplementary Figures S2 S3) this description is farfrom exact Therefore it is possible that some othermore sophisticated classifier may yield superior perfor-mance to the quadratic classifier and hence provide abetter match with our perceptual data

In order to control for this possibility we comparedthe performance of the quadratic classifier to a SupportVector Machine (SVM) classifier (Bishop 2006 Cris-tianini amp Shawe-Taylor 2000) on the task of discrim-inating grayscale occlusion patches from surfaces Wefound that the SVM classifier performed worse than thequadratic classifier (Supplementary Figure S5) Thisfailure of the SVM classifier strengthens the argumentthat the reason for the poor match of the quadraticclassifier with human performance is the inadequacy ofthe measured feature set used to define the classifierrather than the inadequacy our classifier method

Multiscale classifier analysis

Gabor feature set and classifiers

One weakness of the set of simple visual featuresquantified in this study is that they are defined on thescale of the entire image patch and therefore do notintegrate information across scale It is well known thatocclusions in natural images can exist on multiplescales and that accurate edge detection requires anappropriate choice of scale (Elder amp Zucker 1998Marr amp Hildreth 1980) Furthermore our feature setdoes not include any features which quantify texturedifferences which is problematic because our psycho-physical results suggests that texture is a very importantcue and previous work in machine vision hasdemonstrated that texture cues are highly informativefor detecting edges and segmenting image regions(Bergen amp Landy 1991 Malik amp Perona 1990 Martinet al 2004 Voorhees amp Poggio 1988)

In order to choose a set of biologically motivatedmultiscale features to measure from the image patcheswe obtained a Gabor filter bank by applying theIndependent Components Analysis (Bell amp Sejnowski1997) to natural image patches taken from our dataset(Figure 5a) and this filter bank was applied to theimage patches The independent component analysis(ICA) algorithm learns a set of linear features that aremaximally independent which is ideal for classifieranalyses since correlated features add less informationthan independent features The rectified outputs of thisGabor filter bank were then used as inputs to twosupervised classifiers (a) A linear binary logisticclassifier and (b) a neural network classifier (Bishop2006) having 4 16 and 64 hidden units The logisticclassifier can be considered as a special case of theneural network classifier illustrated in Figure 5b where

instead of learning the best hidden layer representationfor solving the classification task the hidden layerrepresentation is always identical to the input repre-sentation and only the output weights are modifiedAfter training both classifiers were tested on a separatevalidation set not used for training to evaluate theirperformance

Classifier comparison

Figure 16 shows the performance of these twoclassifiers together with subject data for the grayscaleocclusion detection task We see that the linearclassifier (blue line) does not accurately describe humanperformance (black line) while the neural network (redline) having 16 hidden units compares much morefavorably to the human subjects Similar results wereobtained for a neural network with 64 hidden units butworse performance was seen with four hidden units(not shown) We can see from Figure 16 that the linearclassifier does not increase its performance withincreasing patch size and seems to perform similarlyto the quadratic classifier defined using only luminancedifferences This suggests that this classifier may simplybe learning to detecting low spatial frequency edges atvarious orientations Indeed plotting the connectionstrengths of the output unit with each of the Gaborfilters as in Figure 17 demonstrates that there are strongpositive weights for filters at low spatial frequencylocated in the center of the image patch therebydetecting luminance differences on the scale of theentire image patch

In contrast the neural network classifier learns tospatially pool information across spatial scale andlocation (Supplementary Figure S6) although thehidden unit lsquolsquoreceptive fieldsrsquorsquo do not seem to resemble

Figure 16 Performance of humans and classifiers defined on

multiscale Gabor feature set We see that a simple linear classifier

(blue line) does not account for human performance (black

dashed line) in the task a neural network classifier having an

additional hidden layer of processing (red line) compares well with

human performance

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 15

the hypothetical texture-edge detectors illustrated inFigure 18 which could in principle encode edgesdefined by second-order cues like differences inorientation or spatial frequency Interestingly suchunits tuned for texture-edges have been learned instudies seeking to find a sparse code for patterns ofcorrelation of Gabor filter activities in natural images(Cadieu amp Olshausen 2012 Karklin amp Lewicki 20032009) and it is of interest for future work to examinewhether a population of such units can be decoded inorder to reliably detect occlusion edges Neverthelessthis analysis provides a simple existence proof that amultiscale feature set coupled with a second layer ofrepresentation which pools across this set of multiscalefeatures may be adequate to explain natural occlusionedge detection and a similar computational scheme hasbeen proposed by various filter-rectify-filter models oftexture edge detection (Baker amp Mareschal 2001Landy amp Graham 2003) Interestingly training neuralnetworks directly on pixel representations did not yieldmodels which could match human performance(Supplementary Figure S7 demonstrating the impor-tance of the multiscale Gabor decomposition as apreprocessing stage Of course we cannot suggest fromour simple analysis the detailed form which thisintegration takes or what the receptive fields of neuronsperforming this operation may be like but this is

certainly a question of great interest for future work inneurophysiology and computational modeling

Discussion

Summary of results

In this study we performed psychophysical experi-ments to better understand what locally available cuesthe human visual system utilizes when detecting naturalocclusion edges In agreement with previous work(McDermott 2004) we found that fairly large regions(32 middot 32 pixels) were required for subjects to attainreliable performance on our occlusion detection task(Figure 10) Studying the grayscale version of the taskin detail we found that texture and luminance cueswere both utilized although the relative importance ofthese cues varied with the patch size In particular wefound that luminance cues were equally useful at allpatch sizes tested (Figure 11a) This suggests thatluminance differences are a local cue being equallyuseful for small as well as large patches In contrasttexture cues were only useful for larger image patches(Figure 11b) This is sensible because texture is definedby information occurring at a variety of scales (Portillaamp Simoncelli 2000 Zhu Wu amp Mumford 1997) andmaking patches larger makes more spatial scalesvisible

The importance of textural cues was further under-scored by comparing human performance on the taskto the performance of a quadratic classifier definedusing a set of simple visual features taken fromprevious work which did not include texture cuesAlthough many of the visual features providedindependent sources of information (Figure 15) andpermitted some discrimination of the stimulus catego-ries this feature set could not account for humanperformance in the task overpredicting performancefor small (8 middot 8) patches and underpredicting

Figure 17 Schematic illustration of weights learned by the linear

classifier We see that the linear classifier learns strong positive

weights (hot colors) with Gabor filters having low spatial

frequencies (long lines) located in the center of the image patch

Figure 18 Hypothetical neural models for detecting occlusion edges defined by textural differences (a) Hypothetical unit which detects

texture edges defined by differences in orientation energy (b) Hypothetical unit that detects texture edges defined by spatial frequency

differences

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 16

performance for large (32 middot 32) patches (Figure 13)The overprediction of human performance for smallimage patches is particularly interesting since itdemonstrates that subjects are not simply relying onluminance cues but are integrating texture cues as wellwhich we found to be quite poor for small imagepatches (Figure 11b)

One limitation of the visual feature set used to definethe quadratic classifier analysis is that the features aredefined on the scale of the entire image patch This isproblematic since it is well-known that the earlieststages of human visual processing perform a multiscaleanalysis using receptive fields resembling Gabor filters(Daugman 1985 Jones amp Palmer 1987 Pollen ampRonner 1983) and that such a representation forms anefficient code for natural images (Field 1987 Olshau-sen amp Field 1996) Furthermore many computationalmethods for texture segmentation and edge detectionuse of a multiscale Gabor filter decomposition (Bakeramp Mareschal 2001 Landy amp Graham 2003) There-fore we defined a new set of classifiers taking as inputsthe rectified responses of a set of Gabor filters appliedto the image patches (Figure 5) We found that themultiscale feature set was adequate to explain humanperformance but only provided that the classifiermaking use of this feature set was sufficiently powerfulIn particular a linear classifier was unable to accountfor human performance since this classifier only learnedto detect luminance-edges at low spatial frequencies(Figure 17) However a neural network classifierhaving an additional hidden layer of processing wasable to learn to reliably detect occlusions nearly as wellas human subjects (Figure 16) These results suggest amultiscale feature decomposition together with rela-tively simple computations performed in an additionallayer of processing may be adequate to explain humanperformance in an important natural vision task

Comparison with previous work

Relatively few studies in visual psychophysics havedirectly considered the problem of what cues are usefulfor detecting natural occlusion edges with the closeststudy to our own focusing on the more specializedproblem of detecting T-junctions in natural grayscaleimages (McDermott 2004) This study found thatrather large image regions were needed for subjects toattain reliable performance for junction detectionconsistent with our results for occlusion detection Inthis work it was found that many T-junctions easilyvisible on the scale of the whole image were notdetectable using local cues suggesting that they mayonly be detectable by mechanisms making use offeedback of global scene information (Lee amp Mumford2003) Our task is somewhat complementary to that

studied by McDermott (2004) since we exclude T-junctions and only consider nonjunction occlusionsThis gave us a much larger stimulus set and permittedquantitative comparisons of human task performancewith a variety of models based on features measuredfrom occlusions and non-occlusions However wearrive at the very similar conclusion for our task thata substantial amount of spatial context is needed foraccurate occlusion detection (Figure 10) and thatmany occlusions cannot be reliably detected using localcues alone

A related set of studies has considered the problemof judging when two image patches belong to the sameor different surfaces and comparing the performanceof human subjects with the predictions of Bayesianmodels defined using sets of measured visual featuressimilar to those used to define our quadratic classifiersIn contrast to one such surface segmentation study (Inget al 2010) we found texture cues to be very importantfor our task One possible reason for the discrepancybetween our results and those of Ing et al (2010) is thattheir study only considered the problem of discrimi-nating a very limited class of images in particular thesurfaces of different leaves Since the leaves typicallybelonged to the same kind of plant they were fairlysmooth and quite similar in terms of visual textureTherefore for this particular choice of stimuli texture isnot a particularly informative cue since leaves of thesame kind tend to have very similar textures For ourtask however where very different surfaces werejuxtaposed to define occlusions we found removal oftexture cues to be highly detrimental to subjectperformance (Figure 11)

Previous work on edge localization reveals thatluminance texture and color cues are all combined todetermine the location of edges (McGraw et al 2003Rivest amp Cavanagh 1996) and similarly we find that inour task these multiple sources of information arecombined as well However in our task we did notconsider quantitative models for how disparate cuesmay be combined optimally as has been done in avariety of perceptual tasks (Knill amp Saunders 2003Landy amp Kojima 2001) One challenge with studyingcue combination in our task is that one of the majorcues used by the subjects are texture differences andunlike luminance or color natural textures require veryhigh-dimensional models to fully specify (Heeger ampBergen 1995 Portilla amp Simoncelli 2000 Zhu et al1997) The problem of optimal cue combination fornatural occlusion detection has been mostly consideredin the computer vision literature (Konishi et al 2003Zhou amp Mel 2008) but models of cue combination forocclusion detection have not been directly comparedwith human data in a controlled psychophysicalexperiment

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 17

Our computational analysis demonstrates the neces-sity of multiscale image analysis for modeling humanperformance in our occlusion detection task Further-more our work and that of others demonstrates theimportance of textural cues for occlusion detection as awide variety of computational models which explainhuman performance at detecting texture-defined edgesutilize a multiscale Gabor decomposition (Landy ampGraham 2003) One broad class of computationalmodels of texture segmentation and texture edgedetection can be described lsquolsquoFilter-Rectify-Filterrsquorsquo(FRF) models sometimes also referred to as lsquolsquoback-pocketrsquorsquo models where the rectified outputs of a Gaborfilter bank are then subject to a second stage of filteringand the outputs of these filters are then processedfurther to obtain a decision about whether or not atexture edge is present (Baker amp Mareschal 2001Bergen amp Landy 1991 Graham amp Sutter 1998 Landy1991 Landy amp Graham 2003) Our simple three-layerneural network model (Figure 5) constitutes a verysimple model of the FRF type with the hidden unitsperforming the role of the second-order filters Theability of our simple FRF model to solve this task aswell as human subjects provides additional evidence infavor of the FRF computational framework

Resources for future studies

Our database of hand-labeled occlusions (httpcaseedu) avoids many limitations of previously collecteddatasets of hand-labeled edges Perhaps the greatestimprovement is that unlike many related datasets(Collins Wright amp Greenway 1999 Martin et al2001) we make use of uncompressed calibrated colorimages taken from a larger image set used in previouspsychophysical research (Arsenault et al 2011 King-dom et al 2007 Olmos amp Kingdom 2004) JPEGcompression creates artificial statistical regularitieswhich complicates feature measurements (Zhou ampMel 2008) and causes artifact when training statisticalimage models (Caywood Willmore amp Tollhurst 2004)

Another improvement of this database over previouswork is that we explicitly label occlusion edges ratherthan inferring them indirectly from labels of imageregions or segments Although de facto finding regionboundaries does end up mainly labeling occlusions thenotion of an image region is much vaguer than that ofan occlusion However due to the specificity of ourtask our database most likely neglects quite a fewedges which are not occlusions since many edges aredefined by properties unrelated to ordinal depth likeshadows specularities and material properties (Ker-sten 2000) Therefore our database only represents asubset (albeit an important one) of possible naturaledges

Some databases only consider a restricted class ofimages like close-up foliage (Ing et al 2010) or lackclear figural objects (Doi Inui Lee Wachtler ampSejnowski 2003) Our database contains a wide varietyof images at various spatial scales containing a largenumber of occlusions We do not claim that our imagesare fully representative of natural images but representa modest effort to obtain a diverse sample which isimportant since different categories of natural imageshave different statistics (Torralba amp Oliva 2003) andtraining statistical image models on different imagedatabases often yields different linear filter properties(Ziegaus amp Lang 2003)

Acknowledgments

This work was supported by NIH grant 0705677 toMSL We thank the five Case Western ReserveUniversity undergraduate research assistants whohelped to collect the databases Parts of this work havebeen presented previously in poster form at the 2011Society for Neuroscience Meeting

Commercial relationships noneCorresponding author Christopher DiMattinaEmail dimattinagrinnelleduAddress Department of Psychology amp NeuroscienceConcentration Grinnell College Grinnell IA USA

References

Abdou I amp Pratt W (1979) Quantitative design andevaluation of enhancementthresholding edge de-tectors Proceedings of the IEEE 67 753ndash763

Acharya T amp Ray A K (2005) Image processingPrinciples and applications Hoboken NJ Wiley-Interscience

Arsenault E Yoonessi A amp Baker C L (2011)Higher-order texture statistics impair contrastboundary segmentation Journal of Vision 11(10)14 1ndash15 httpwwwjournalofvisionorgcontent111014 doi101167111014 [PubMed] [Article]

Baddeley R J amp Tatler B W (2006) High frequencyedges (but not contrast) predict where we fixate ABayesian system identification analysis VisionResearch 46 2824ndash2833

Baker C L amp Mareschal I (2001) Processing ofsecond-order stimuli in the visual cortex Progressin Brain Research 134 1ndash21

Balboa R M amp Grzywacz N M (2000) Occlusions

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 18

and their relationship with the distribution ofcontrasts in natural images Vision Research 402661ndash2669

Bell A J amp Sejnowski T J (1997) The lsquoindependentcomponentsrsquo of natural scenes are edge filtersVision Research 37 3327ndash3338

Bergen J R amp Landy M S (1991) Computationalmodeling of visual texture segregation In M SLandy amp J A Movshon (Eds) Computationalmodels of visual processing pp 253ndash271 Cam-bridge MA MIT Press

Bishop C M (2006) Pattern recognition and machinelearning New York Springer

Cadieu C amp Olshausen B A (2012) Learningintermediate-level representations of form andmotion from natural movies Neural Computation24 827ndash866

Caywood M S Willmore B amp Tolhurst D J(2004) Independent components of color naturalscenes resemble v1 neurons in their spatial andcolor tuning Journal of Neurophysiology 91 2859ndash2873

Collins D Wright W A amp Greenway P (1999) TheSowerby image database In Image processing andits applications (Seventh International Conference)(pp 306ndash310) London Institution of ElectricalEngineers

Cristianini N amp Shawe-Taylor J (2000) An intro-duction to support vector machines and other kernelbased learning methods Cambridge UK Cam-bridge University Press

Daugman J G (1985) Uncertainty relation forresolution in space spatial frequency and orienta-tion optimized by two-dimensional visual corticalfilters Journal of the Optical Society of America A2 1160ndash1169

Doi E Inui T Lee T W Wachtler T ampSejnowski T J (2003) Spatiochromatic receptivefield properties derived from information-theoreticanalyses of cone mosaic responses to naturalscenes Neural Computation 15 397ndash417

Duda R O Hart P E amp Stork D G (2000)Pattern classification (2nd ed) New York Wiley-Interscience

Elder J H amp Zucker S W (1998) Local scalecontrol for edge detection and blur estimationIEEE Transactions on Pattern Analysis and Ma-chine Intelligence 20 699ndash716

Field D J (1987) Relations between the statistics ofnatural images and the response properties of

cortical cells Journal of the Optical Society ofAmerica A 4 2379ndash2394

Fine I MacLeod D I A amp Boynton G M (2003)Surface segmentation based on the luminance andcolor statistics of natural scenes Journal of theOptical Society of America A 20 1283ndash1291

Fowlkes C C Martin D R amp Malik J (2007) Localfigure-ground cues are valid for natural imagesJournal of Vision 7(8)2 1ndash9 httpwwwjournalofvisionorgcontent782 doi101167782[PubMed] [Article]

Geisler W S Perry J S Super B J amp Gallogly DP (2001) Edge co-occurrence in natural imagespredicts contour grouping performance VisionResearch 41 711ndash724

Goldberg A V amp Kennedy R (1995) An efficientcost scaling algorithm for the assignment problemMathematical Programming 71 153ndash177

Graham N (1991) Complex channels early localnonlinearities and normalization in texture segre-gation In M S Landy amp J A Movshon (Eds)Computational models of visual processing pp 274ndash290 Cambridge MA MIT Press

Graham N amp Sutter A (1998) Spatial summation insimple (Fourier) and complex (non-Fourier) texturechannels Vision Research 38 231ndash257

Guzman A (1969) Decomposition of a visual sceneinto three-dimensional bodies In A Griselli (Ed)Automatic interpretation and classification of imag-es pp 291ndash304 New York Academic Press

Heeger D J amp Bergen J R (1995) Pyramid-basedtexture analysissynthesis In Proceedings of ACMSIGGRAPH 229ndash238

Hess R F amp Field D J (1999) Integration ofcontours New insights Trends in Cognitive Scienc-es 3 480ndash486

Hoiem D Efros A A amp Hebert M (2011)Recovering occlusion boundaries from an imageInternational Journal of Computer Vision 91(3)328ndash346

Ing A D Wilson J A amp Geisler W S (2010)Region grouping in natural foliage scenes Imagestatistics and human performance Journal ofVision 10(4)10 1ndash19 httpwwwjournalofvisionorgcontent10410 doi 10116710410[PubMed] [Article]

Jones J P amp Palmer L A (1987) The two-dimensional spatial structure of simple receptivefields in cat striate cortex Journal of Neurophysi-ology 58 1187ndash1211

Karklin Y amp Lewicki M S (2003) Learning higher-

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 19

order structures in natural images Network 14483ndash499

Karklin Y amp Lewicki M S (2009) Emergence ofcomplex cell properties by learning to generalize innatural scenes Nature 457 83ndash86

Kersten D (2000) High-level vision as statisticalinference In M S Gazzaniga (Ed) The newcognitive neurosciences pp 353ndash363 CambridgeMA MIT Press

Kingdom F A A Field D J amp Olmos A (2007)Does spatial invariance result from insensitivity tochange Journal of Vision 7(14)11 1ndash13 httpwwwjournalofvisionorgcontent71411 doi10116771411 [PubMed] [Article]

Knill D C amp Saunders J A (2003) Do humansoptimally integrate stereo and texture informationfor judgments of slant Vision Research 43 2539ndash2558

Konishi S Yuille A L Coughlin J M amp Zhu S C(2003) Statistical edge detection Learning andevaluating edge cues IEEE Transactions on PatternAnalysis and Machine Intelligence 25 57ndash74

Landy M S (1991) Texture segregation and orienta-tion gradient Vision Research 31 679ndash691

Landy M S amp Graham N (2003) Visual perceptionof texture In L M Chalupa amp J S Werner (Eds)The visual neurosciences pp 1106ndash1118 Cam-bridge MA MIT Press

Landy M S amp Kojima H (2001) Ideal cuecombination for localizing texture-defined edgesJournal of the Optical Society of America A 182307ndash2320

Lee H-C amp Choe Y (2003) Detecting salientcontours using orientation energy distribution InInternational Joint Conference on Neural Networks(pp 206ndash211) IEEE Conference Publications

Lee T S amp Mumford D (2003) Hierarchicalbayesian inference in the visual cortex Journal ofthe Optical Society of America A 20 1434ndash1448

Malik J amp Perona P (1990) Preattentive texturediscrimination with early vision mechanisms Jour-nal of the Optical Society of America A 7 923ndash932

Marr D amp Hildreth E (1980) Theory of edgedetection Proceedings of the Royal Society ofLondon Series B 29 187ndash217

Martin D Fowlkes C C Tal D amp Malik J (2001)A database of human segmented natural imagesand its application to evaluating segmentationalgorithms and measuring ecological statisticsProceedings of the 8th International Conference onComputer Vision 2 416ndash423

Martin D R Fowlkes C C amp Malik J (2004)Learning to detect natural image boundaries usinglocal brightness color and texture cues IEEETransactions on Pattern Analysis and MachineIntelligence 26 530ndash549

McDermott J (2004) Psychophysics with junctions inreal images Perception 33 1101ndash1127

McGraw P V Whitaker D Badcock D R ampSkillen J (2003) Neither here nor there Localizingconflicting visual attributes Journal of Vision 3(4)2 265ndash273 httpwwwjournalofvisionorgcontent342 doi101167342 [PubMed] [Article]

Nakayama K He Z J amp Shimojo S (1995) Visualsurface representation A critical link betweenlower-level and higher-level vision In D NOsherson S M Kosslyn amp L R Gleitman(Eds) An invitation to cognitive science Visualcognition pp 1ndash70 Cambridge MA MIT Press

Oliva A amp Torralba A (2001) Modeling the shapeof the scene A holistic representation of the spatialenvelope International Journal of Computer Vision42 145ndash175

Olmos A amp Kingdom F A A (2004) McGillcalibrated colour image database Internet sitehttptabbyvisionmcgillca (Accessed November20 2012)

Olshausen B A amp Field D J (1996) Emergence ofsimple-cell receptive-field properties by learning asparse code for natural images Nature 381 607ndash609

Ott R L (1993) An introduction to statistical methodsand data analysis (4th ed) Belmont CA Wads-worth Inc

Perona P (1992) Steerable-scalable kernels for edgedetection and junction analysis Image and VisionComputing 10 663ndash672

Pollen D A amp Ronner S F (1983) Visual corticalneurons as localized spatial frequency filters IEEETransactions on Systems Man and Cybernetics 13907ndash916

Portilla J amp Simoncelli E P (2000) A parametrictexture model based on joint statistics of complexwavelet coefficients International Journal of Com-puter Vision 40 4971

Rajashekar U van der Linde I Bovik A C ampCormack L K (2007) Foveated analysis of imagefeatures at fixations Vision Research 47 3160ndash3172

Reinagel P amp Zador A M (1999) Natural scenestatistics at the centre of gaze Network 10 341ndash350

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 20

Rijsbergen C V (1979) Information retrieval LondonButterworths

Rivest J amp Cavanagh P (1996) Localizing contoursdefined by more than one attribute Vision Re-search 36 53ndash66

Schwartz O amp Simoncelli E P (2001) Natural signalstatistics and sensory gain control Nature Neuro-science 4 819ndash825

Sigman M Cecchi G A Gilbert C D ampMagnasco M O (2001) On a common circleNatural scenes and gestalt rules Proceedings of theNational Academy of Sciences 98 1935ndash1940

Todd J T (2004) The visual perception of 3D shapeTrends in Cognitive Science 8 115ndash121

Torralba A (2009) How many pixels make an imageVisual Neuroscience 26 123ndash131

Torralba A amp Oliva A (2003) Statistics of naturalimage categories Network Computation in NeuralSystems 14 391ndash412

Voorhees H amp Poggio T (1988) Computing textureboundaries from images Nature 333 364ndash367

Zetzsche C amp Rohrbein F (2001) Nonlinear andextra-classical receptive field properties and thestatistics of natural scenes Network 12 331ndash350

Zhou C amp Mel B W (2008) Cue combination andcolor edge detection in natural scenes Journal ofVision 8(4)4 1ndash25 httpwwwjournalofvisionorgcontent844 doi101167844 [PubMed][Article]

Zhu S C Wu Y N amp Mumford D (1997)Minimax entropy principle and its application totexture modeling Neural Computation 9 1627ndash1660

Ziegaus C amp Lang E W (2003) Independentcomponent analysis of natural and urban imageensembles Neural Information Processing Letters1 89ndash95

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 21

  • Introduction
  • Methods
  • f01
  • f02
  • f03
  • e01
  • e02
  • e03
  • e04
  • e05
  • e06
  • e07
  • e08
  • e09
  • e10
  • f04
  • e11
  • e12
  • e13
  • e14
  • e15
  • e16
  • e17
  • f05
  • e18
  • Results
  • f06
  • f07
  • f08
  • f09
  • f10
  • f11
  • f12
  • f13
  • f14
  • f15
  • f16
  • Discussion
  • f17
  • f18
  • Abdou1
  • Acharya1
  • Arsenault1
  • Baddeley1
  • Baker1
  • Balboa1
  • Bell1
  • Bergen1
  • Bishop1
  • Cadieu1
  • Caywood1
  • Collins1
  • Cristianini1
  • Daugman1
  • Doi1
  • Duda1
  • Elder1
  • Field1
  • Fine1
  • Fowlkes1
  • Geisler1
  • Goldberg1
  • Graham1
  • Graham2
  • Guzman1
  • Heeger1
  • Hess1
  • Hoiem1
  • Ing1
  • Jones1
  • Karklin1
  • Karklin2
  • Kersten1
  • Kingdom1
  • Knill1
  • Konishi1
  • Landy1
  • Landy2
  • Landy3
  • Lee1
  • Lee2
  • Malik1
  • Marr1
  • Martin1
  • Martin2
  • McDermott1
  • McGraw1
  • Nakayama1
  • Oliva1
  • Olmos1
  • Olshausen1
  • Ott1
  • Perona1
  • Pollen1
  • Portilla1
  • Rajashekar1
  • Reinagel1
  • Rijsbergen1
  • Rivest1
  • Schwartz1
  • Sigman1
  • Todd1
  • Torralba1
  • Torralba2
  • Voorhees1
  • Zetzsche1
  • Zhou1
  • Zhu1
  • Ziegaus1
Page 7: Detecting natural occlusion boundaries using local cuesDetecting natural occlusion boundaries using local cues Christopher DiMattina # $ Department of Psychology & Neuroscience Concentration,

between two categories (Bishop 2006) We implement-ed this analysis using the function svmclassifym andsvmtrainm in the MATLAB Bioinformatics Toolbox

Multiscale classifier analyses on Gabor filter outputs

One weakness of defining machine classifiers usingour set of visual features is that these features aredefined on the scale of the entire image patch This isproblematic because it is well known that occlusionedges exist at multiple scales and that appropriate scaleselection and integration across scale is an essentialcomputation for accurate edge detection (Elder ampZucker 1998 Marr amp Hildreth 1980) Furthermoreit is well known that the neural code at the earlieststages of cortical processing is reasonably well de-scribed by a bank of multi-scale filters resemblingGabor functions (Daugman 1985 Pollen amp Ronner1983) and that such a basis forms an efficient code fornatural scenes (Olshausen amp Field 1996)

In order to define a multiscale feature set resemblingthe early visual code we utilized the rectified outputs ofa bank of filters learned using Independent ComponentAnalysis (ICA) These filters closely resemble Gaborfilters but have the additional useful property ofconstituting a maximally independent set of featuredimensions for encoding natural images (Bell ampSejnowski 1997) Example filters for 16 middot 16 imagepatches are shown in Figure 5a The outputs of ourfilter bank were used as inputs to two different standardclassifiers (a) a linear logistic regression classifier and(b) a three-layer neural network classifier (Bishop2006) having 4 16 64 hidden units These classifierswere trained using standard gradient descent methods

and their performance was evaluated on a separate setof validation data not used for training A schematicillustration of these classifiers is shown in Figure 5b

Experimental paradigm

In the psychophysical experiments image patcheswere displayed on a 24-inch Macintosh cinema display(Apple Inc Cupertino CA) Since the RGB imageswere linearized to correct for camera gamma (Olmos ampKingdom 2004) we set the display monitor to havegamma of approximately 1 using the Mac CalibrationAssistant so that the displayed images would be asnatural looking as possible Subjects were seated in adim room in front of the monitor and image patcheswere presented to the center of the screen scaled tosubtend 15 degrees of visual angle at the approximate-ly 12-inch (305 cm) viewing distance All stimulisubtended the same visual angle to eliminate confoundsbetween the size of the image on the retina and its pixeldimensions Previous studies have demonstrated thathuman performance on a similar task is dependent onlyon the number of pixels (McDermott 2004) so byholding the retinal size constant the only variable is thenumber of pixels

Our experimental paradigm is illustrated in Figure 6Surrounding the image patch was a set of flankingcrosshairs whose imaginary intersection defines thelsquolsquocenterrsquorsquo of the image patch In the standard task thesubject decides whether an occlusion boundary passesthrough the image patch center with a binary (10) key-press Half of the patches were taken from theocclusion boundary database (where all occlusions passthrough the patch center) and the other half of thepatches were taken from the surface database There-fore guessing on the task would yield performance of50 percent correct The design of this experiment issimilar to a previous study on detecting T-junctions innatural images (McDermott 2004) To optimizeperformance subjects were given as much time as theyneeded for each patch Furthermore since perceptuallearning can improve performance (Ing et al 2010) weprovided positive and negative feedback

We performed the experiment on two sets of naivesubjects having no previous exposure to the images orknowledge of the scientific aims as well as the leadauthor (S0) for a total of six subjects (three males threefemales) One set of two naive subjects (S1 S2) wasallowed to browse grayscale versions of the full-scaleimages (576 middot 768 pixels) prior to the experiment After2ndash3 seconds of viewing they were shown superimposedon the images the union of all subject labelings Thiswas meant to give the subjects an intuition for what ismeant by an occlusion boundary in the context of thefull-scale image and was inspired by previous work on

Figure 5 Illustration of the multiscale classifier analysis using the

outputs of rectified Gabor filters (a) Gabor functions learned

using ICA form an efficient code for natural images which

maximize statistical independence of filter responses (b) Gray-

scale image patches are decomposed by a bank of multiscale

Gabor filters resembling simple cells and their rectified outputs

transformed by an intermediate layer of representation whose

outputs are passed to a linear classifier

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 7

surface segmentation where subjects were allowed topreview large full-scale images of foliage from whichsurface patches were drawn (Ing et al 2010) A secondset of three naive subjects (S3 S4 S5) were not shownthe full-scale images beforehand in order to control forthe possibility that this pre-exposure may have helped toimprove task performance

We used both color and grayscale patches of sizes 8middot 8 16 middot 16 and 32 middot 32 In addition to the raw imagepatches a set of lsquolsquotexture-removedrsquorsquo image patches wascreated by averaging the pixels in each of the tworegions thus creating synthetic edges where the onlycue was luminance or color contrast For the grayscalepatches we also considered the effects of removingluminance difference cues However it is a muchharder problem to create luminance-removed patchesas was done in previous studies on surface segmenta-tion (Ing et al 2010) since simply subtracting themean luminance in each region of an image patchcontaining an occlusion often yields a high spatialfrequency boundary artifact which provides a strongedge cue (Arsenault et al 2011) Therefore wecircumvented this problem by setting a 3-pixel thickregion around the boundary to the mean luminance ofthe entire patch in effect covering up the boundaryThen we could remove luminance cues equalizing themean luminance on each side without creating aboundary artifact since there was no boundary visibleWe called this condition lsquolsquoboundary thorn luminanceremovedrsquorsquo One issue however with comparing theseboundarythorn luminance removed patches to the normalpatches is that now two cues are missing (the boundaryand the luminance difference) so in order to betterassess the combination of texture and luminanceinformation we also created a lsquolsquoboundary removedrsquorsquocondition which blocks the boundary but does notmodify the mean luminance on each side Illustrativeexamples of these stimuli are shown in Figure 4b

All subjects were shown different sets of 400 imagepatches sampled randomly from a set of 1000 patchesin every condition with the exception of two subjects inthe luminance-only grayscale condition (S0 S2) who

were shown the same set of patches in this conditiononly Informed consent was obtained from all subjectsand all experimental procedures were approved before-hand by the Case Western Reserve University IRB(Protocol 20101216)

Tests of significance

In order to determine whether or not the perfor-mance of human subjects was significantly different indifferent conditions of the task we utilized thestandard binomial proportion test (Ott 1993) whichrelies on a Gaussian approximation to the binomialdistribution This test is well justified in our casebecause of the large number of stimuli (N frac14 400)presented in each experimental condition For propor-tion estimate p we compute the 1 ndash a confidence intervalas

p 6 z1a=2

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffipeth1 pTHORN

n

r eth18THORN

We use a significance level of afrac14 005 in all of our testsand calculations of confidence intervals

In order to evaluate the performance of the machineclassifiers we performed a Monte Carlo analysis wherethe classifiers were evaluated on 200 different sets of400 image patches randomly chosen from a validationset of image patches This validation set was distinctfrom the set of patches used to train the classifierallowing us to study classifier generalization We plot95 confidence intervals around the mean perfor-mance of our classifiers

Results

Case occlusion boundary (COB) database

We developed a novel database of occlusion edgesand surface regions not containing any occlusions for

Figure 6 Schematic of the two-alternative forced choice experiment A patch was presented to the subject who decided whether an

occlusion passes through the imaginary intersection of the crosshairs After the decision the subject was given positive or negative

feedback

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 8

use in our perceptual experiments Representativeimages from our database are illustrated in Figure 2(left column) Note the clear figural objects and manyocclusions in these images In the middle column ofFigure 2 we see grayscale versions of these imageswith the set of all pixels labeled by any subject (logicalOR) overlaid in white The magenta squares showexamples of surface regions labeled by the subjectsFinally the right column shows an overlay plot of theocclusions marked by all subjects with darker pixelsbeing labeled by more subjects and the lightest graypixels being labeled by only a single subject Note thehigh degree of consistency between the labelings of themultiple subjects Figure 3 shows some representativeexamples of occlusion (left) and surface (right) patchesfrom our database It is important to note that whileocclusions may be a major source of edges in naturalimages edges may arise from other cues like castshadows or changes in material properties (Kersten2000) The bottom panel of Figure 3 shows examples ofpatches containing edges defined by shadows ratherthan by occlusions

In other annotated edge databases like the BerkeleySegmentation Dataset (BSD) (Martin Fowlkes Tal ampMalik 2001) occlusion edges were labeled indirectly bysegmenting the image into regions and then denotingthe boundaries of these regions to be edges Weobserved that when occlusions are labeled directlyinstead of being inferred indirectly from regionsegmentations that a smaller number of pixels arelabeled as shown in Figure 7b which plots thedistribution of the percentage of edge pixels for allimages and subjects in our database (red) and the BSDdatabase (blue) for 98 BSD images segmented by sixsubjects We find that averaged across all images andsubjects that about 1 of the pixels were labeled asedges in our database whereas about twice as manypixels were labeled as edges by computing edgemapsusing the BSD segmentations (COB medianfrac1400084Nfrac14 600 BSD median frac14 00193 N frac14 588 p 10121Wilcox rank-sum)

We observed a higher level of intersubject consisten-cy in the edgemaps obtained from the COB than thoseobtained from the BSD segmentations which we

Figure 7 Labeling occlusions directly marks fewer pixels than inferring occlusions from image segmentations and yields greater

agreement between subjects (a) Top An image from our database (left) together with the labeling (middle) by the most conservative

subject (MCS) The right panel shows a binary mask of all pixels near the MCS labeling (10 pixel radius) Bottom Product of MCS mask

with labelings from three other subjects (b) Histogram of the fraction of pixels labeled as edges in the COB (red) and BSD (blue)

databases across all images and all subjects (c) Histogram of the subject consistency index for edgemaps obtained from the COB (red)

and BSD (blue) databases for cfrac14 10 (d) Precision-recall analysis also demonstrates better consistency (F-measure) for COB (red) than

BSD (blue)

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 9

quantified using a novel analysis we developed as wellas a more standard precision-recall analysis (Abdou ampPratt 1979 Rijsbergen 1979 Methods Image data-base) Figure 7a shows an image from the COBdatabase (top left) together with the edgemap andderived mask for the most conservative subject (MCS)which is the subject who labeled the fewest pixels (topmiddle right) When the MCS mask is multiplied bythe edgemap of the other subjects we see reasonablygood agreement between all subjects (bottom row) Inorder to quantify this over all images we computed ournovel intersubject similarity index f defined in Equation4 and in Figure 7c we see that on average that f is largerfor our dataset than for the BSD where here we plotthe histogram for c frac14 10 (COB median frac14 02890 N frac143500 BSD median frac14 01882 N frac14 3430 p 10202Wilcox rank-sum) Similar results were obtained over awide range of values of c (Supplementary Figure S1) Inaddition to our novel analysis we also implemented aprecision-recall analysis (Abdou amp Pratt 1979 Rijs-bergen 1979) using a lsquolsquoleave-one-outrsquorsquo procedure wherewe compared the edges labeled by one subject to alsquolsquoground truthrsquorsquo labeling defined by combining theedgemaps of the five other subjects (Methods Imagedatabase) Agreement was quantified using the F-measure (Rijsbergen 1979) which provides a weightedmean of precision (not labeling nonedges as edges) andrecall (detecting all edges in the image) We observe inFigure 7d that for this analysis there is a significantlybetter agreement between edgemaps in the COBdatabase than those obtained indirectly from the BSDsegmentations using this analysis (p 1028)

Visual features measured from occlusion andsurface patches

We were interested in determining which kinds oflocally available visual features could possibly beutilized by human subjects to distinguish image patchescontaining occlusions from those containing surfacesToward this end we measured a variety of visualfeatures taken from previous studies of natural imagestatistics (Balboa amp Grzywacz 2000 Field 1987 Ing etal 2010 Lee amp Choe 2003 Rajashekar et al 2007Reinagel amp Zador 1999) from both occlusion andsurface patches For patches containing only a singleocclusion which can be divided into two regions ofroughly equal size we obtained a region labelingillustrated in Figure 4a (bottom) The region labelingconsists of sets of pixels corresponding to the twosurfaces divided by the boundary (white and grayregions) as well as the boundary (black line) Usingthis region labeling we can measure properties of theimage on either side of the boundary and computedifferences in these properties between regions By

definition surface patches are comprised of a singleregion and therefore it is unclear how we can measurequantities (like luminance differences) which dependon a region labeling Therefore in order to measure thesame set of features from the surface patches weassigned to each patch a set of 25 dummy regionlabelings (spanning all orientations) and for eachdummy labeling performed the measurements Sinceall features were on average larger for occlusions forsurfaces we took as the measured value of the featurethe maximum over all 25 dummy labelings A fulldescription of measured features is given in Methods(Statistical measurements)

We compared the power spectra of 32 middot 32 occlusionand texture patches (Figure 8 left) On average there issignificantly less energy in low spatial frequencies fortextures than for occlusions which is intuitive sincemany occlusions contain a low spatial frequencyluminance edge (Figure 3 left panel) Analysis of theexponent by fitting a line to the power spectra ofindividual images found a different distribution ofexponents for the occlusions and the textures (Figure 8right) For textures the median exponent is close to 2consistent with previous observations (Field 1987)whereas for occlusions the median exponent value isslightly higher (rsquo26) This is consistent with previouswork analyzing the spectral content of different scenecategories which show that landscape scenes with aprominent horizon (like an ocean view) tend to havelower spatial frequency content (Oliva amp Torralba2001)

Figure 9 shows that all of the grayscale features (G1-G5 Methods Statistical measurements) measured fromsurfaces (green) and occlusions (blue) are well approx-imated by Gaussians when plotted in logarithmiccoordinates Supplementary Table 1 lists the meansand standard deviations of each of these features Wesee that on average these features are larger forocclusions than for textures as one might expectintuitively since these features explicitly or implicitlymeasure differences or variability in an image patch

Figure 8 Power spectra of 32 middot 32 patches Left Median power

spectrum of occlusions (blue) and surfaces (green) Thin dashed

lines show the 25th and 75th percentiles Right Power-spectrum

slopes for occlusions (blue) and surfaces (green)

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 10

which will tend to be larger for occlusion patchesAnalyzing the correlation structure of the grayscalefeatures reveals that they are all significantly positivelycorrelated (Supplementary Table 2) Although some ofthese correlations are unsurprising (like that betweenluminance difference log Dl and boundary gradient logGB) many pairs of these positively correlated featurescan be manipulated independently of each other inartificial images (for instance global contrast log q andcontrast difference log Dr) These results are consistentwith previous work which demonstrates that in naturalimages there are often strong conditional dependenciesbetween supposedly independent visual feature dimen-sions (Fine et al 2003 Karklin amp Lewicki 2003Schwartz amp Simoncelli 2001 Zetzsche amp Rohrbein2001) Supplementary Figure S2 plots all possiblebivariate distributions of the logarithm of the grayscalefeatures for 32 middot 32 textures and surfaces

In addition to grayscale features we measured colorfeatures (C1ndashC2 Methods Statistical measurements)as well by transforming the images into the Lab colorspace Fine et al (2003) Ing et al (2010) Using our

region labelings (Figure 4a bottom) we measured thecolor parameters log Da and log Db from ourocclusions Supplementary Table 3 lists the meansand standard deviations of each of these features andwe observe positive correlations (rfrac14043) between theirvalues similar to previous observations of positivecorrelations between these same features for twonearby patches taken from the same surface (Fine etal 2003) This finding is interesting because occlusionboundaries by definition separate two or more differentsurfaces and the color features we measure constitute aset of independent feature dimensions From the two-dimensional scatterplots in Supplementary Figure S3we see that there is separation between the multivariatedistributions defined by these color parameters sug-gesting that color contrast provides a potential cue forocclusion edge detection much as it does for surfacesegmentation (Fine et al 2003 Ing et al 2010)

Human performance on occlusion boundarydetection task

Effects of patch size color and pre-exposure

In order to determine the local visual features used todiscriminate occlusions from surfaces we designed asimple two-alternative forced choice task illustratedschematically in Figure 6 An image patch subtendingroughly 15 degrees of visual angle (305 cm viewingdistance) was presented to the subject who had todecide with a binary (10) keypress whether an imagepatch contains an occlusion Patches were chosen froma pre-extracted set of 1000 occlusions and 1000surfaces with equal probability of each type soguessing would yield 50 correct Subjects wereallowed to view each patch as long as they needed (1ndash2 seconds typical) and following previous workauditory feedback was provided to optimize perfor-mance (Ing et al 2010) The task was performed on sixsubjects having varying degrees of exposure to theimage database (Methods Experimental paradigm)

Figure 10a illustrates the effect of patch size on thetask for grayscale image patches for all subjects Thicklines indicate the mean subject performance and thindashed lines denote 95 confidence intervals We seefrom this that performance is significantly different forthe different size image patches which we tested (8 middot 816 middot 16 32 middot 32) For a subset of subjects (S0 S1 S2)we also tested color image patches and the results forcolor image patches are shown in Figure 10b (red line)together with the grayscale data from these samesubjects (black dashed line) We see from this plot thatfor all patch sizes tested performance is significantlybetter for color patches which is sensible because coloris a potentially informative cue for distinguishingdifferent surfaces as has been shown in previous work(Fine et al 2003 Ing et al 2010)

Figure 9 Univariate distributions of grayscale visual features G1-

G5 (see Methods) for occlusions (blue) and textures (green)

When plotted on a log scale these distributions are well described

by Gaussians and exhibit separation for occlusions and textures

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 11

One concern with the interpretation of our results isthat the brief pre-exposure (3 seconds) of subjects S1 S2to the full-scale (576 middot 768) images prior to the first tasksession (Methods Experimental paradigm) may haveunfairly improved their performance In order to controlfor this possibility we ran three additional subjects (S3S4 S5) who did not have this pre-exposure to the full-scale images and we found no significant difference inthe performance of the two groups of subjects(Supplementary Figure S4a) We also found that thelead author (S0) was not significantly better at the taskthan the two subjects (S1 S2) who were only briefly pre-exposed to the images (Supplementary Figure S4b)

Therefore we conclude that pre-exposure to the full-scale images makes little if any difference in our task

Effects of luminance boundary and texture cues

In order to better understand what cues subjects aremaking use of in the task we tested subjects onmodified image patches with various cues removed(Methods Experimental paradigm) In order to deter-mine the importance of texture we removed all texturecues from the patches by averaging all pixels in eachregion (lsquolsquotexture removedrsquorsquo) as illustrated in Figure 4b(top) We see from Figure 11a that subject performancein the task is substantially impaired without texturecues for the 16 middot 16 and 32 middot 32 patch sizes Note howperformance is roughly flat with increasing patch sizewhen all cues but luminance are removed suggestingthat luminance gradients are a fairly lsquolsquolocalrsquorsquo cue whichis very useful even at very small scales Similar resultswere obtained for color patches (Figure 12) where theonly cues available were luminance and color contrast

Removing luminance cues while keeping texture cuesintact is slightly more tricky since simply equalizing theluminance in the two image regions can lead toproblems of boundary artifact (a high spatial frequencyedge along the boundary) as has been noted by others(Arsenault et al 2011) In order to circumvent thisproblem we removed the boundary artifact by coveringa 3-pixel wide region including the boundary by settingall pixels in this strip to a single uniform value (Figure4b bottom) We then were able to equalize theluminance on each side of the patch without creatingboundary artifact (luminance thorn boundary removed)Since the boundary pixels may could potentiallycontain long-range spatial correlation information(Hess amp Field 1999) useful for identifying occlusions

Figure 10 Performance of human subjects at the occlusion

detection task for 8 middot 8 16 middot 16 and 32 middot 32 image patches (a)

Subject performance for grayscale image patches Thin dashed

lines denote 95 confidence intervals Note how performance

significantly improves with increasing patch size (b) Subject

performance for color image patches is significantly better than for

grayscale at all patch sizes

Figure 11 Subject performance for grayscale image patches with various cues removed Dashed lines indicate the average performance

for unaltered image patches solid lines performance in the cue-removed case (a) Removal of texture cues significantly impairs subject

performance for larger image patches (b) Removal of the boundary and luminance cues substantially impairs subject performance at all

patch sizes (c) Removal of the boundary cue alone without altering texture cues or luminance cues does not affect subject performance

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 12

as an additional control we also considered perfor-mance on patches where we only removed theboundary (boundary removed)

We see from Figure 11c that simply removing theboundary pixels has no effect on subject performancesuggesting that subjects do not need to see theboundary in order to perform the task so long asother cues like luminance and texture differences arepresent However we see from Figure 11b that thereare profound effects of removing the luminance cueswhile keeping the texture cues intact We see thatsubject performance is impaired at every patch sizealthough in contrast to removing texture cues we see asubstantial improvement in performance as patch sizeis increased This suggests that texture cues arecomplementary to luminance cues in the sense thattexture information becomes more useful for largerpatches while luminance is useful even for very smallpatches

Quadratic classifier analysis

Comparison with human performance

Our analysis of image patches taken from thedatabase suggests that relatively simple visual featuresmeasured in previous related studies can potentially beused to distinguish occlusions and surfaces (Figure 9)Therefore it was of interest to determine whethermachine classifiers defined using these simple visualfeatures could potentially account for human perfor-mance on the task Since the logarithmically trans-formed features are approximately Gaussian distributed(Figure 9 and Supplementary Figures S2 S3) followingprevious work (Ing et al 2010) we defined a quadraticclassifier on class-conditional distributions which wemodeled as multivariate Gaussians (Methods Machineclassifiers) When data from both classes is exactly

Gaussian this classifier yields Bayes-optimal perfor-mance (Duda et al 2000) This classifier was thentested on validation sets of 400 image patches drawnfrom the same set shown to humans subjects (whichwere not used to train the classifier) using 200 MonteCarlo resamplings

In Figure 13 we compare the performance of humansubjects (black lines) on the grayscale version of thetask with that of quadratic classifiers (blue lines)Figure 13a shows task performance with the fullgrayscale image patches We see that the quadraticclassifier defined on all grayscale parameters (G1-G5Methods Statistical measurements) does not capturethe experimentally observed improvement in subjectperformance with increasing patch size This suggeststhat in addition to the scalar grayscale cues used todefine the classifier subjects must be making use ofmore complex spatial cues like differences in texture(Bergen amp Landy 1991) or long-range spatial correla-tions (Hess amp Field 1999) which are not available atlow resolutions but become available as the imagepatch gets larger We see from Figure 13b that aquadratic classifier defined using only the luminancedifference is in much better agreement with humanperformance when this is the only available cue forperforming the task Similar results were obtained forcolor image patches (Figure 14)

Ranking of feature dimensions

In order to understand which feature dimensions aremost important for the quadratic classifier detecting

Figure 12 Subject performance for unaltered color image patches

(dashed line) and with texture cues removed (solid line) Figure 13 Comparison of subject and quadratic classifier

performance for grayscale image patches (a) Classifier defined

using all grayscale features (thick blue line) does not accurately

model human performance on the task with unmodified patches

(black line) (b) Classifier defined using only luminance cues

accurately models human performance in the texture-removed

condition where luminance differences are the only available cue

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 13

occlusions we computed the d0 value for each of thefeature dimensions with the results of this analysisbeing summarized in Supplementary Table 4 We seefrom this table that the best single grayscale parametersfor discriminating occlusion and surface patches are theluminance difference (Dl) and overall patch variance(q) We also see that contrast difference (Dr) by itself isa fairly weak parameter for all patch sizes while for thelarger patch sizes (16 middot 16 32 middot 32) color cues are veryinformative

Our correlation analyses presented in SupplementaryTable 2 demonstrated that many of these features were

correlated with each other so therefore it was ofinterest to determine which features contributed themost independent information to the classifier We didthis by adding feature dimensions to the classifier oneby one in order of decreasing d0 for 16 middot 16 and 32 middot 32patches and applying each classifier to the task for 200Monte Carlo simulations The results of this analysisperformed on 16 middot 16 and 32 middot 32 image patches areshown in Figure 15 We see from Figure 15 that forgrayscale classifiers containing q and Dl that noperformance increase is seen when one adds thefeatures GB (boundary gradient) and Eh (orientedenergy) most likely because these features are stronglycorrelated with q Dl (Supplementary Table 2)However we do see a jump in performance when thecontrast difference Dr is added most likely because thisfeature dimension is only weakly correlated with q andDl (Supplementary Table 2) For our analysis of thecolor classifier we see that there are substantialimprovements in performance when the color param-eters Da and Db are added to the classifier which makessense because color cues are largely independent fromluminance cues

Comparison with SVM classifier

There are two logical possibilities for the poor matchbetween human performance in our task and theperformance of the quadratic classifier The firstpossibility is that the feature sets used to define theclassifier do not adequately describe the featuresactually used by human subjects to perform the taskA second possibility is that the feature set is adequatebut the quadratic classifier is somehow suboptimalAlthough the quadratic classifier is Bayes optimal forGaussian distributed class-conditional densities (Duda

Figure 14 Comparison of subject and quadratic classifier

performance for color image patches (a) Classifier defined using

all color features (thick blue line) does not accurately model

human performance (black line) on the task with unmodified

patches (b) Classifier defined using only luminance and color

contrast more accurately models human performance in the

texture-removed condition

Figure 15 Quadratic classifier performance on 32 middot 32 patches as a function of number of features Feature sets of varying lengths are

defined by ranking individual dimensions by their d0 measures and adding them in order from highest to lowest

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 14

et al 2000) and our feature sets are reasonably welldescribed by Gaussians (see Figure 9 andSupplementary Figures S2 S3) this description is farfrom exact Therefore it is possible that some othermore sophisticated classifier may yield superior perfor-mance to the quadratic classifier and hence provide abetter match with our perceptual data

In order to control for this possibility we comparedthe performance of the quadratic classifier to a SupportVector Machine (SVM) classifier (Bishop 2006 Cris-tianini amp Shawe-Taylor 2000) on the task of discrim-inating grayscale occlusion patches from surfaces Wefound that the SVM classifier performed worse than thequadratic classifier (Supplementary Figure S5) Thisfailure of the SVM classifier strengthens the argumentthat the reason for the poor match of the quadraticclassifier with human performance is the inadequacy ofthe measured feature set used to define the classifierrather than the inadequacy our classifier method

Multiscale classifier analysis

Gabor feature set and classifiers

One weakness of the set of simple visual featuresquantified in this study is that they are defined on thescale of the entire image patch and therefore do notintegrate information across scale It is well known thatocclusions in natural images can exist on multiplescales and that accurate edge detection requires anappropriate choice of scale (Elder amp Zucker 1998Marr amp Hildreth 1980) Furthermore our feature setdoes not include any features which quantify texturedifferences which is problematic because our psycho-physical results suggests that texture is a very importantcue and previous work in machine vision hasdemonstrated that texture cues are highly informativefor detecting edges and segmenting image regions(Bergen amp Landy 1991 Malik amp Perona 1990 Martinet al 2004 Voorhees amp Poggio 1988)

In order to choose a set of biologically motivatedmultiscale features to measure from the image patcheswe obtained a Gabor filter bank by applying theIndependent Components Analysis (Bell amp Sejnowski1997) to natural image patches taken from our dataset(Figure 5a) and this filter bank was applied to theimage patches The independent component analysis(ICA) algorithm learns a set of linear features that aremaximally independent which is ideal for classifieranalyses since correlated features add less informationthan independent features The rectified outputs of thisGabor filter bank were then used as inputs to twosupervised classifiers (a) A linear binary logisticclassifier and (b) a neural network classifier (Bishop2006) having 4 16 and 64 hidden units The logisticclassifier can be considered as a special case of theneural network classifier illustrated in Figure 5b where

instead of learning the best hidden layer representationfor solving the classification task the hidden layerrepresentation is always identical to the input repre-sentation and only the output weights are modifiedAfter training both classifiers were tested on a separatevalidation set not used for training to evaluate theirperformance

Classifier comparison

Figure 16 shows the performance of these twoclassifiers together with subject data for the grayscaleocclusion detection task We see that the linearclassifier (blue line) does not accurately describe humanperformance (black line) while the neural network (redline) having 16 hidden units compares much morefavorably to the human subjects Similar results wereobtained for a neural network with 64 hidden units butworse performance was seen with four hidden units(not shown) We can see from Figure 16 that the linearclassifier does not increase its performance withincreasing patch size and seems to perform similarlyto the quadratic classifier defined using only luminancedifferences This suggests that this classifier may simplybe learning to detecting low spatial frequency edges atvarious orientations Indeed plotting the connectionstrengths of the output unit with each of the Gaborfilters as in Figure 17 demonstrates that there are strongpositive weights for filters at low spatial frequencylocated in the center of the image patch therebydetecting luminance differences on the scale of theentire image patch

In contrast the neural network classifier learns tospatially pool information across spatial scale andlocation (Supplementary Figure S6) although thehidden unit lsquolsquoreceptive fieldsrsquorsquo do not seem to resemble

Figure 16 Performance of humans and classifiers defined on

multiscale Gabor feature set We see that a simple linear classifier

(blue line) does not account for human performance (black

dashed line) in the task a neural network classifier having an

additional hidden layer of processing (red line) compares well with

human performance

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 15

the hypothetical texture-edge detectors illustrated inFigure 18 which could in principle encode edgesdefined by second-order cues like differences inorientation or spatial frequency Interestingly suchunits tuned for texture-edges have been learned instudies seeking to find a sparse code for patterns ofcorrelation of Gabor filter activities in natural images(Cadieu amp Olshausen 2012 Karklin amp Lewicki 20032009) and it is of interest for future work to examinewhether a population of such units can be decoded inorder to reliably detect occlusion edges Neverthelessthis analysis provides a simple existence proof that amultiscale feature set coupled with a second layer ofrepresentation which pools across this set of multiscalefeatures may be adequate to explain natural occlusionedge detection and a similar computational scheme hasbeen proposed by various filter-rectify-filter models oftexture edge detection (Baker amp Mareschal 2001Landy amp Graham 2003) Interestingly training neuralnetworks directly on pixel representations did not yieldmodels which could match human performance(Supplementary Figure S7 demonstrating the impor-tance of the multiscale Gabor decomposition as apreprocessing stage Of course we cannot suggest fromour simple analysis the detailed form which thisintegration takes or what the receptive fields of neuronsperforming this operation may be like but this is

certainly a question of great interest for future work inneurophysiology and computational modeling

Discussion

Summary of results

In this study we performed psychophysical experi-ments to better understand what locally available cuesthe human visual system utilizes when detecting naturalocclusion edges In agreement with previous work(McDermott 2004) we found that fairly large regions(32 middot 32 pixels) were required for subjects to attainreliable performance on our occlusion detection task(Figure 10) Studying the grayscale version of the taskin detail we found that texture and luminance cueswere both utilized although the relative importance ofthese cues varied with the patch size In particular wefound that luminance cues were equally useful at allpatch sizes tested (Figure 11a) This suggests thatluminance differences are a local cue being equallyuseful for small as well as large patches In contrasttexture cues were only useful for larger image patches(Figure 11b) This is sensible because texture is definedby information occurring at a variety of scales (Portillaamp Simoncelli 2000 Zhu Wu amp Mumford 1997) andmaking patches larger makes more spatial scalesvisible

The importance of textural cues was further under-scored by comparing human performance on the taskto the performance of a quadratic classifier definedusing a set of simple visual features taken fromprevious work which did not include texture cuesAlthough many of the visual features providedindependent sources of information (Figure 15) andpermitted some discrimination of the stimulus catego-ries this feature set could not account for humanperformance in the task overpredicting performancefor small (8 middot 8) patches and underpredicting

Figure 17 Schematic illustration of weights learned by the linear

classifier We see that the linear classifier learns strong positive

weights (hot colors) with Gabor filters having low spatial

frequencies (long lines) located in the center of the image patch

Figure 18 Hypothetical neural models for detecting occlusion edges defined by textural differences (a) Hypothetical unit which detects

texture edges defined by differences in orientation energy (b) Hypothetical unit that detects texture edges defined by spatial frequency

differences

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 16

performance for large (32 middot 32) patches (Figure 13)The overprediction of human performance for smallimage patches is particularly interesting since itdemonstrates that subjects are not simply relying onluminance cues but are integrating texture cues as wellwhich we found to be quite poor for small imagepatches (Figure 11b)

One limitation of the visual feature set used to definethe quadratic classifier analysis is that the features aredefined on the scale of the entire image patch This isproblematic since it is well-known that the earlieststages of human visual processing perform a multiscaleanalysis using receptive fields resembling Gabor filters(Daugman 1985 Jones amp Palmer 1987 Pollen ampRonner 1983) and that such a representation forms anefficient code for natural images (Field 1987 Olshau-sen amp Field 1996) Furthermore many computationalmethods for texture segmentation and edge detectionuse of a multiscale Gabor filter decomposition (Bakeramp Mareschal 2001 Landy amp Graham 2003) There-fore we defined a new set of classifiers taking as inputsthe rectified responses of a set of Gabor filters appliedto the image patches (Figure 5) We found that themultiscale feature set was adequate to explain humanperformance but only provided that the classifiermaking use of this feature set was sufficiently powerfulIn particular a linear classifier was unable to accountfor human performance since this classifier only learnedto detect luminance-edges at low spatial frequencies(Figure 17) However a neural network classifierhaving an additional hidden layer of processing wasable to learn to reliably detect occlusions nearly as wellas human subjects (Figure 16) These results suggest amultiscale feature decomposition together with rela-tively simple computations performed in an additionallayer of processing may be adequate to explain humanperformance in an important natural vision task

Comparison with previous work

Relatively few studies in visual psychophysics havedirectly considered the problem of what cues are usefulfor detecting natural occlusion edges with the closeststudy to our own focusing on the more specializedproblem of detecting T-junctions in natural grayscaleimages (McDermott 2004) This study found thatrather large image regions were needed for subjects toattain reliable performance for junction detectionconsistent with our results for occlusion detection Inthis work it was found that many T-junctions easilyvisible on the scale of the whole image were notdetectable using local cues suggesting that they mayonly be detectable by mechanisms making use offeedback of global scene information (Lee amp Mumford2003) Our task is somewhat complementary to that

studied by McDermott (2004) since we exclude T-junctions and only consider nonjunction occlusionsThis gave us a much larger stimulus set and permittedquantitative comparisons of human task performancewith a variety of models based on features measuredfrom occlusions and non-occlusions However wearrive at the very similar conclusion for our task thata substantial amount of spatial context is needed foraccurate occlusion detection (Figure 10) and thatmany occlusions cannot be reliably detected using localcues alone

A related set of studies has considered the problemof judging when two image patches belong to the sameor different surfaces and comparing the performanceof human subjects with the predictions of Bayesianmodels defined using sets of measured visual featuressimilar to those used to define our quadratic classifiersIn contrast to one such surface segmentation study (Inget al 2010) we found texture cues to be very importantfor our task One possible reason for the discrepancybetween our results and those of Ing et al (2010) is thattheir study only considered the problem of discrimi-nating a very limited class of images in particular thesurfaces of different leaves Since the leaves typicallybelonged to the same kind of plant they were fairlysmooth and quite similar in terms of visual textureTherefore for this particular choice of stimuli texture isnot a particularly informative cue since leaves of thesame kind tend to have very similar textures For ourtask however where very different surfaces werejuxtaposed to define occlusions we found removal oftexture cues to be highly detrimental to subjectperformance (Figure 11)

Previous work on edge localization reveals thatluminance texture and color cues are all combined todetermine the location of edges (McGraw et al 2003Rivest amp Cavanagh 1996) and similarly we find that inour task these multiple sources of information arecombined as well However in our task we did notconsider quantitative models for how disparate cuesmay be combined optimally as has been done in avariety of perceptual tasks (Knill amp Saunders 2003Landy amp Kojima 2001) One challenge with studyingcue combination in our task is that one of the majorcues used by the subjects are texture differences andunlike luminance or color natural textures require veryhigh-dimensional models to fully specify (Heeger ampBergen 1995 Portilla amp Simoncelli 2000 Zhu et al1997) The problem of optimal cue combination fornatural occlusion detection has been mostly consideredin the computer vision literature (Konishi et al 2003Zhou amp Mel 2008) but models of cue combination forocclusion detection have not been directly comparedwith human data in a controlled psychophysicalexperiment

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 17

Our computational analysis demonstrates the neces-sity of multiscale image analysis for modeling humanperformance in our occlusion detection task Further-more our work and that of others demonstrates theimportance of textural cues for occlusion detection as awide variety of computational models which explainhuman performance at detecting texture-defined edgesutilize a multiscale Gabor decomposition (Landy ampGraham 2003) One broad class of computationalmodels of texture segmentation and texture edgedetection can be described lsquolsquoFilter-Rectify-Filterrsquorsquo(FRF) models sometimes also referred to as lsquolsquoback-pocketrsquorsquo models where the rectified outputs of a Gaborfilter bank are then subject to a second stage of filteringand the outputs of these filters are then processedfurther to obtain a decision about whether or not atexture edge is present (Baker amp Mareschal 2001Bergen amp Landy 1991 Graham amp Sutter 1998 Landy1991 Landy amp Graham 2003) Our simple three-layerneural network model (Figure 5) constitutes a verysimple model of the FRF type with the hidden unitsperforming the role of the second-order filters Theability of our simple FRF model to solve this task aswell as human subjects provides additional evidence infavor of the FRF computational framework

Resources for future studies

Our database of hand-labeled occlusions (httpcaseedu) avoids many limitations of previously collecteddatasets of hand-labeled edges Perhaps the greatestimprovement is that unlike many related datasets(Collins Wright amp Greenway 1999 Martin et al2001) we make use of uncompressed calibrated colorimages taken from a larger image set used in previouspsychophysical research (Arsenault et al 2011 King-dom et al 2007 Olmos amp Kingdom 2004) JPEGcompression creates artificial statistical regularitieswhich complicates feature measurements (Zhou ampMel 2008) and causes artifact when training statisticalimage models (Caywood Willmore amp Tollhurst 2004)

Another improvement of this database over previouswork is that we explicitly label occlusion edges ratherthan inferring them indirectly from labels of imageregions or segments Although de facto finding regionboundaries does end up mainly labeling occlusions thenotion of an image region is much vaguer than that ofan occlusion However due to the specificity of ourtask our database most likely neglects quite a fewedges which are not occlusions since many edges aredefined by properties unrelated to ordinal depth likeshadows specularities and material properties (Ker-sten 2000) Therefore our database only represents asubset (albeit an important one) of possible naturaledges

Some databases only consider a restricted class ofimages like close-up foliage (Ing et al 2010) or lackclear figural objects (Doi Inui Lee Wachtler ampSejnowski 2003) Our database contains a wide varietyof images at various spatial scales containing a largenumber of occlusions We do not claim that our imagesare fully representative of natural images but representa modest effort to obtain a diverse sample which isimportant since different categories of natural imageshave different statistics (Torralba amp Oliva 2003) andtraining statistical image models on different imagedatabases often yields different linear filter properties(Ziegaus amp Lang 2003)

Acknowledgments

This work was supported by NIH grant 0705677 toMSL We thank the five Case Western ReserveUniversity undergraduate research assistants whohelped to collect the databases Parts of this work havebeen presented previously in poster form at the 2011Society for Neuroscience Meeting

Commercial relationships noneCorresponding author Christopher DiMattinaEmail dimattinagrinnelleduAddress Department of Psychology amp NeuroscienceConcentration Grinnell College Grinnell IA USA

References

Abdou I amp Pratt W (1979) Quantitative design andevaluation of enhancementthresholding edge de-tectors Proceedings of the IEEE 67 753ndash763

Acharya T amp Ray A K (2005) Image processingPrinciples and applications Hoboken NJ Wiley-Interscience

Arsenault E Yoonessi A amp Baker C L (2011)Higher-order texture statistics impair contrastboundary segmentation Journal of Vision 11(10)14 1ndash15 httpwwwjournalofvisionorgcontent111014 doi101167111014 [PubMed] [Article]

Baddeley R J amp Tatler B W (2006) High frequencyedges (but not contrast) predict where we fixate ABayesian system identification analysis VisionResearch 46 2824ndash2833

Baker C L amp Mareschal I (2001) Processing ofsecond-order stimuli in the visual cortex Progressin Brain Research 134 1ndash21

Balboa R M amp Grzywacz N M (2000) Occlusions

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 18

and their relationship with the distribution ofcontrasts in natural images Vision Research 402661ndash2669

Bell A J amp Sejnowski T J (1997) The lsquoindependentcomponentsrsquo of natural scenes are edge filtersVision Research 37 3327ndash3338

Bergen J R amp Landy M S (1991) Computationalmodeling of visual texture segregation In M SLandy amp J A Movshon (Eds) Computationalmodels of visual processing pp 253ndash271 Cam-bridge MA MIT Press

Bishop C M (2006) Pattern recognition and machinelearning New York Springer

Cadieu C amp Olshausen B A (2012) Learningintermediate-level representations of form andmotion from natural movies Neural Computation24 827ndash866

Caywood M S Willmore B amp Tolhurst D J(2004) Independent components of color naturalscenes resemble v1 neurons in their spatial andcolor tuning Journal of Neurophysiology 91 2859ndash2873

Collins D Wright W A amp Greenway P (1999) TheSowerby image database In Image processing andits applications (Seventh International Conference)(pp 306ndash310) London Institution of ElectricalEngineers

Cristianini N amp Shawe-Taylor J (2000) An intro-duction to support vector machines and other kernelbased learning methods Cambridge UK Cam-bridge University Press

Daugman J G (1985) Uncertainty relation forresolution in space spatial frequency and orienta-tion optimized by two-dimensional visual corticalfilters Journal of the Optical Society of America A2 1160ndash1169

Doi E Inui T Lee T W Wachtler T ampSejnowski T J (2003) Spatiochromatic receptivefield properties derived from information-theoreticanalyses of cone mosaic responses to naturalscenes Neural Computation 15 397ndash417

Duda R O Hart P E amp Stork D G (2000)Pattern classification (2nd ed) New York Wiley-Interscience

Elder J H amp Zucker S W (1998) Local scalecontrol for edge detection and blur estimationIEEE Transactions on Pattern Analysis and Ma-chine Intelligence 20 699ndash716

Field D J (1987) Relations between the statistics ofnatural images and the response properties of

cortical cells Journal of the Optical Society ofAmerica A 4 2379ndash2394

Fine I MacLeod D I A amp Boynton G M (2003)Surface segmentation based on the luminance andcolor statistics of natural scenes Journal of theOptical Society of America A 20 1283ndash1291

Fowlkes C C Martin D R amp Malik J (2007) Localfigure-ground cues are valid for natural imagesJournal of Vision 7(8)2 1ndash9 httpwwwjournalofvisionorgcontent782 doi101167782[PubMed] [Article]

Geisler W S Perry J S Super B J amp Gallogly DP (2001) Edge co-occurrence in natural imagespredicts contour grouping performance VisionResearch 41 711ndash724

Goldberg A V amp Kennedy R (1995) An efficientcost scaling algorithm for the assignment problemMathematical Programming 71 153ndash177

Graham N (1991) Complex channels early localnonlinearities and normalization in texture segre-gation In M S Landy amp J A Movshon (Eds)Computational models of visual processing pp 274ndash290 Cambridge MA MIT Press

Graham N amp Sutter A (1998) Spatial summation insimple (Fourier) and complex (non-Fourier) texturechannels Vision Research 38 231ndash257

Guzman A (1969) Decomposition of a visual sceneinto three-dimensional bodies In A Griselli (Ed)Automatic interpretation and classification of imag-es pp 291ndash304 New York Academic Press

Heeger D J amp Bergen J R (1995) Pyramid-basedtexture analysissynthesis In Proceedings of ACMSIGGRAPH 229ndash238

Hess R F amp Field D J (1999) Integration ofcontours New insights Trends in Cognitive Scienc-es 3 480ndash486

Hoiem D Efros A A amp Hebert M (2011)Recovering occlusion boundaries from an imageInternational Journal of Computer Vision 91(3)328ndash346

Ing A D Wilson J A amp Geisler W S (2010)Region grouping in natural foliage scenes Imagestatistics and human performance Journal ofVision 10(4)10 1ndash19 httpwwwjournalofvisionorgcontent10410 doi 10116710410[PubMed] [Article]

Jones J P amp Palmer L A (1987) The two-dimensional spatial structure of simple receptivefields in cat striate cortex Journal of Neurophysi-ology 58 1187ndash1211

Karklin Y amp Lewicki M S (2003) Learning higher-

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 19

order structures in natural images Network 14483ndash499

Karklin Y amp Lewicki M S (2009) Emergence ofcomplex cell properties by learning to generalize innatural scenes Nature 457 83ndash86

Kersten D (2000) High-level vision as statisticalinference In M S Gazzaniga (Ed) The newcognitive neurosciences pp 353ndash363 CambridgeMA MIT Press

Kingdom F A A Field D J amp Olmos A (2007)Does spatial invariance result from insensitivity tochange Journal of Vision 7(14)11 1ndash13 httpwwwjournalofvisionorgcontent71411 doi10116771411 [PubMed] [Article]

Knill D C amp Saunders J A (2003) Do humansoptimally integrate stereo and texture informationfor judgments of slant Vision Research 43 2539ndash2558

Konishi S Yuille A L Coughlin J M amp Zhu S C(2003) Statistical edge detection Learning andevaluating edge cues IEEE Transactions on PatternAnalysis and Machine Intelligence 25 57ndash74

Landy M S (1991) Texture segregation and orienta-tion gradient Vision Research 31 679ndash691

Landy M S amp Graham N (2003) Visual perceptionof texture In L M Chalupa amp J S Werner (Eds)The visual neurosciences pp 1106ndash1118 Cam-bridge MA MIT Press

Landy M S amp Kojima H (2001) Ideal cuecombination for localizing texture-defined edgesJournal of the Optical Society of America A 182307ndash2320

Lee H-C amp Choe Y (2003) Detecting salientcontours using orientation energy distribution InInternational Joint Conference on Neural Networks(pp 206ndash211) IEEE Conference Publications

Lee T S amp Mumford D (2003) Hierarchicalbayesian inference in the visual cortex Journal ofthe Optical Society of America A 20 1434ndash1448

Malik J amp Perona P (1990) Preattentive texturediscrimination with early vision mechanisms Jour-nal of the Optical Society of America A 7 923ndash932

Marr D amp Hildreth E (1980) Theory of edgedetection Proceedings of the Royal Society ofLondon Series B 29 187ndash217

Martin D Fowlkes C C Tal D amp Malik J (2001)A database of human segmented natural imagesand its application to evaluating segmentationalgorithms and measuring ecological statisticsProceedings of the 8th International Conference onComputer Vision 2 416ndash423

Martin D R Fowlkes C C amp Malik J (2004)Learning to detect natural image boundaries usinglocal brightness color and texture cues IEEETransactions on Pattern Analysis and MachineIntelligence 26 530ndash549

McDermott J (2004) Psychophysics with junctions inreal images Perception 33 1101ndash1127

McGraw P V Whitaker D Badcock D R ampSkillen J (2003) Neither here nor there Localizingconflicting visual attributes Journal of Vision 3(4)2 265ndash273 httpwwwjournalofvisionorgcontent342 doi101167342 [PubMed] [Article]

Nakayama K He Z J amp Shimojo S (1995) Visualsurface representation A critical link betweenlower-level and higher-level vision In D NOsherson S M Kosslyn amp L R Gleitman(Eds) An invitation to cognitive science Visualcognition pp 1ndash70 Cambridge MA MIT Press

Oliva A amp Torralba A (2001) Modeling the shapeof the scene A holistic representation of the spatialenvelope International Journal of Computer Vision42 145ndash175

Olmos A amp Kingdom F A A (2004) McGillcalibrated colour image database Internet sitehttptabbyvisionmcgillca (Accessed November20 2012)

Olshausen B A amp Field D J (1996) Emergence ofsimple-cell receptive-field properties by learning asparse code for natural images Nature 381 607ndash609

Ott R L (1993) An introduction to statistical methodsand data analysis (4th ed) Belmont CA Wads-worth Inc

Perona P (1992) Steerable-scalable kernels for edgedetection and junction analysis Image and VisionComputing 10 663ndash672

Pollen D A amp Ronner S F (1983) Visual corticalneurons as localized spatial frequency filters IEEETransactions on Systems Man and Cybernetics 13907ndash916

Portilla J amp Simoncelli E P (2000) A parametrictexture model based on joint statistics of complexwavelet coefficients International Journal of Com-puter Vision 40 4971

Rajashekar U van der Linde I Bovik A C ampCormack L K (2007) Foveated analysis of imagefeatures at fixations Vision Research 47 3160ndash3172

Reinagel P amp Zador A M (1999) Natural scenestatistics at the centre of gaze Network 10 341ndash350

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 20

Rijsbergen C V (1979) Information retrieval LondonButterworths

Rivest J amp Cavanagh P (1996) Localizing contoursdefined by more than one attribute Vision Re-search 36 53ndash66

Schwartz O amp Simoncelli E P (2001) Natural signalstatistics and sensory gain control Nature Neuro-science 4 819ndash825

Sigman M Cecchi G A Gilbert C D ampMagnasco M O (2001) On a common circleNatural scenes and gestalt rules Proceedings of theNational Academy of Sciences 98 1935ndash1940

Todd J T (2004) The visual perception of 3D shapeTrends in Cognitive Science 8 115ndash121

Torralba A (2009) How many pixels make an imageVisual Neuroscience 26 123ndash131

Torralba A amp Oliva A (2003) Statistics of naturalimage categories Network Computation in NeuralSystems 14 391ndash412

Voorhees H amp Poggio T (1988) Computing textureboundaries from images Nature 333 364ndash367

Zetzsche C amp Rohrbein F (2001) Nonlinear andextra-classical receptive field properties and thestatistics of natural scenes Network 12 331ndash350

Zhou C amp Mel B W (2008) Cue combination andcolor edge detection in natural scenes Journal ofVision 8(4)4 1ndash25 httpwwwjournalofvisionorgcontent844 doi101167844 [PubMed][Article]

Zhu S C Wu Y N amp Mumford D (1997)Minimax entropy principle and its application totexture modeling Neural Computation 9 1627ndash1660

Ziegaus C amp Lang E W (2003) Independentcomponent analysis of natural and urban imageensembles Neural Information Processing Letters1 89ndash95

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 21

  • Introduction
  • Methods
  • f01
  • f02
  • f03
  • e01
  • e02
  • e03
  • e04
  • e05
  • e06
  • e07
  • e08
  • e09
  • e10
  • f04
  • e11
  • e12
  • e13
  • e14
  • e15
  • e16
  • e17
  • f05
  • e18
  • Results
  • f06
  • f07
  • f08
  • f09
  • f10
  • f11
  • f12
  • f13
  • f14
  • f15
  • f16
  • Discussion
  • f17
  • f18
  • Abdou1
  • Acharya1
  • Arsenault1
  • Baddeley1
  • Baker1
  • Balboa1
  • Bell1
  • Bergen1
  • Bishop1
  • Cadieu1
  • Caywood1
  • Collins1
  • Cristianini1
  • Daugman1
  • Doi1
  • Duda1
  • Elder1
  • Field1
  • Fine1
  • Fowlkes1
  • Geisler1
  • Goldberg1
  • Graham1
  • Graham2
  • Guzman1
  • Heeger1
  • Hess1
  • Hoiem1
  • Ing1
  • Jones1
  • Karklin1
  • Karklin2
  • Kersten1
  • Kingdom1
  • Knill1
  • Konishi1
  • Landy1
  • Landy2
  • Landy3
  • Lee1
  • Lee2
  • Malik1
  • Marr1
  • Martin1
  • Martin2
  • McDermott1
  • McGraw1
  • Nakayama1
  • Oliva1
  • Olmos1
  • Olshausen1
  • Ott1
  • Perona1
  • Pollen1
  • Portilla1
  • Rajashekar1
  • Reinagel1
  • Rijsbergen1
  • Rivest1
  • Schwartz1
  • Sigman1
  • Todd1
  • Torralba1
  • Torralba2
  • Voorhees1
  • Zetzsche1
  • Zhou1
  • Zhu1
  • Ziegaus1
Page 8: Detecting natural occlusion boundaries using local cuesDetecting natural occlusion boundaries using local cues Christopher DiMattina # $ Department of Psychology & Neuroscience Concentration,

surface segmentation where subjects were allowed topreview large full-scale images of foliage from whichsurface patches were drawn (Ing et al 2010) A secondset of three naive subjects (S3 S4 S5) were not shownthe full-scale images beforehand in order to control forthe possibility that this pre-exposure may have helped toimprove task performance

We used both color and grayscale patches of sizes 8middot 8 16 middot 16 and 32 middot 32 In addition to the raw imagepatches a set of lsquolsquotexture-removedrsquorsquo image patches wascreated by averaging the pixels in each of the tworegions thus creating synthetic edges where the onlycue was luminance or color contrast For the grayscalepatches we also considered the effects of removingluminance difference cues However it is a muchharder problem to create luminance-removed patchesas was done in previous studies on surface segmenta-tion (Ing et al 2010) since simply subtracting themean luminance in each region of an image patchcontaining an occlusion often yields a high spatialfrequency boundary artifact which provides a strongedge cue (Arsenault et al 2011) Therefore wecircumvented this problem by setting a 3-pixel thickregion around the boundary to the mean luminance ofthe entire patch in effect covering up the boundaryThen we could remove luminance cues equalizing themean luminance on each side without creating aboundary artifact since there was no boundary visibleWe called this condition lsquolsquoboundary thorn luminanceremovedrsquorsquo One issue however with comparing theseboundarythorn luminance removed patches to the normalpatches is that now two cues are missing (the boundaryand the luminance difference) so in order to betterassess the combination of texture and luminanceinformation we also created a lsquolsquoboundary removedrsquorsquocondition which blocks the boundary but does notmodify the mean luminance on each side Illustrativeexamples of these stimuli are shown in Figure 4b

All subjects were shown different sets of 400 imagepatches sampled randomly from a set of 1000 patchesin every condition with the exception of two subjects inthe luminance-only grayscale condition (S0 S2) who

were shown the same set of patches in this conditiononly Informed consent was obtained from all subjectsand all experimental procedures were approved before-hand by the Case Western Reserve University IRB(Protocol 20101216)

Tests of significance

In order to determine whether or not the perfor-mance of human subjects was significantly different indifferent conditions of the task we utilized thestandard binomial proportion test (Ott 1993) whichrelies on a Gaussian approximation to the binomialdistribution This test is well justified in our casebecause of the large number of stimuli (N frac14 400)presented in each experimental condition For propor-tion estimate p we compute the 1 ndash a confidence intervalas

p 6 z1a=2

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffipeth1 pTHORN

n

r eth18THORN

We use a significance level of afrac14 005 in all of our testsand calculations of confidence intervals

In order to evaluate the performance of the machineclassifiers we performed a Monte Carlo analysis wherethe classifiers were evaluated on 200 different sets of400 image patches randomly chosen from a validationset of image patches This validation set was distinctfrom the set of patches used to train the classifierallowing us to study classifier generalization We plot95 confidence intervals around the mean perfor-mance of our classifiers

Results

Case occlusion boundary (COB) database

We developed a novel database of occlusion edgesand surface regions not containing any occlusions for

Figure 6 Schematic of the two-alternative forced choice experiment A patch was presented to the subject who decided whether an

occlusion passes through the imaginary intersection of the crosshairs After the decision the subject was given positive or negative

feedback

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 8

use in our perceptual experiments Representativeimages from our database are illustrated in Figure 2(left column) Note the clear figural objects and manyocclusions in these images In the middle column ofFigure 2 we see grayscale versions of these imageswith the set of all pixels labeled by any subject (logicalOR) overlaid in white The magenta squares showexamples of surface regions labeled by the subjectsFinally the right column shows an overlay plot of theocclusions marked by all subjects with darker pixelsbeing labeled by more subjects and the lightest graypixels being labeled by only a single subject Note thehigh degree of consistency between the labelings of themultiple subjects Figure 3 shows some representativeexamples of occlusion (left) and surface (right) patchesfrom our database It is important to note that whileocclusions may be a major source of edges in naturalimages edges may arise from other cues like castshadows or changes in material properties (Kersten2000) The bottom panel of Figure 3 shows examples ofpatches containing edges defined by shadows ratherthan by occlusions

In other annotated edge databases like the BerkeleySegmentation Dataset (BSD) (Martin Fowlkes Tal ampMalik 2001) occlusion edges were labeled indirectly bysegmenting the image into regions and then denotingthe boundaries of these regions to be edges Weobserved that when occlusions are labeled directlyinstead of being inferred indirectly from regionsegmentations that a smaller number of pixels arelabeled as shown in Figure 7b which plots thedistribution of the percentage of edge pixels for allimages and subjects in our database (red) and the BSDdatabase (blue) for 98 BSD images segmented by sixsubjects We find that averaged across all images andsubjects that about 1 of the pixels were labeled asedges in our database whereas about twice as manypixels were labeled as edges by computing edgemapsusing the BSD segmentations (COB medianfrac1400084Nfrac14 600 BSD median frac14 00193 N frac14 588 p 10121Wilcox rank-sum)

We observed a higher level of intersubject consisten-cy in the edgemaps obtained from the COB than thoseobtained from the BSD segmentations which we

Figure 7 Labeling occlusions directly marks fewer pixels than inferring occlusions from image segmentations and yields greater

agreement between subjects (a) Top An image from our database (left) together with the labeling (middle) by the most conservative

subject (MCS) The right panel shows a binary mask of all pixels near the MCS labeling (10 pixel radius) Bottom Product of MCS mask

with labelings from three other subjects (b) Histogram of the fraction of pixels labeled as edges in the COB (red) and BSD (blue)

databases across all images and all subjects (c) Histogram of the subject consistency index for edgemaps obtained from the COB (red)

and BSD (blue) databases for cfrac14 10 (d) Precision-recall analysis also demonstrates better consistency (F-measure) for COB (red) than

BSD (blue)

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 9

quantified using a novel analysis we developed as wellas a more standard precision-recall analysis (Abdou ampPratt 1979 Rijsbergen 1979 Methods Image data-base) Figure 7a shows an image from the COBdatabase (top left) together with the edgemap andderived mask for the most conservative subject (MCS)which is the subject who labeled the fewest pixels (topmiddle right) When the MCS mask is multiplied bythe edgemap of the other subjects we see reasonablygood agreement between all subjects (bottom row) Inorder to quantify this over all images we computed ournovel intersubject similarity index f defined in Equation4 and in Figure 7c we see that on average that f is largerfor our dataset than for the BSD where here we plotthe histogram for c frac14 10 (COB median frac14 02890 N frac143500 BSD median frac14 01882 N frac14 3430 p 10202Wilcox rank-sum) Similar results were obtained over awide range of values of c (Supplementary Figure S1) Inaddition to our novel analysis we also implemented aprecision-recall analysis (Abdou amp Pratt 1979 Rijs-bergen 1979) using a lsquolsquoleave-one-outrsquorsquo procedure wherewe compared the edges labeled by one subject to alsquolsquoground truthrsquorsquo labeling defined by combining theedgemaps of the five other subjects (Methods Imagedatabase) Agreement was quantified using the F-measure (Rijsbergen 1979) which provides a weightedmean of precision (not labeling nonedges as edges) andrecall (detecting all edges in the image) We observe inFigure 7d that for this analysis there is a significantlybetter agreement between edgemaps in the COBdatabase than those obtained indirectly from the BSDsegmentations using this analysis (p 1028)

Visual features measured from occlusion andsurface patches

We were interested in determining which kinds oflocally available visual features could possibly beutilized by human subjects to distinguish image patchescontaining occlusions from those containing surfacesToward this end we measured a variety of visualfeatures taken from previous studies of natural imagestatistics (Balboa amp Grzywacz 2000 Field 1987 Ing etal 2010 Lee amp Choe 2003 Rajashekar et al 2007Reinagel amp Zador 1999) from both occlusion andsurface patches For patches containing only a singleocclusion which can be divided into two regions ofroughly equal size we obtained a region labelingillustrated in Figure 4a (bottom) The region labelingconsists of sets of pixels corresponding to the twosurfaces divided by the boundary (white and grayregions) as well as the boundary (black line) Usingthis region labeling we can measure properties of theimage on either side of the boundary and computedifferences in these properties between regions By

definition surface patches are comprised of a singleregion and therefore it is unclear how we can measurequantities (like luminance differences) which dependon a region labeling Therefore in order to measure thesame set of features from the surface patches weassigned to each patch a set of 25 dummy regionlabelings (spanning all orientations) and for eachdummy labeling performed the measurements Sinceall features were on average larger for occlusions forsurfaces we took as the measured value of the featurethe maximum over all 25 dummy labelings A fulldescription of measured features is given in Methods(Statistical measurements)

We compared the power spectra of 32 middot 32 occlusionand texture patches (Figure 8 left) On average there issignificantly less energy in low spatial frequencies fortextures than for occlusions which is intuitive sincemany occlusions contain a low spatial frequencyluminance edge (Figure 3 left panel) Analysis of theexponent by fitting a line to the power spectra ofindividual images found a different distribution ofexponents for the occlusions and the textures (Figure 8right) For textures the median exponent is close to 2consistent with previous observations (Field 1987)whereas for occlusions the median exponent value isslightly higher (rsquo26) This is consistent with previouswork analyzing the spectral content of different scenecategories which show that landscape scenes with aprominent horizon (like an ocean view) tend to havelower spatial frequency content (Oliva amp Torralba2001)

Figure 9 shows that all of the grayscale features (G1-G5 Methods Statistical measurements) measured fromsurfaces (green) and occlusions (blue) are well approx-imated by Gaussians when plotted in logarithmiccoordinates Supplementary Table 1 lists the meansand standard deviations of each of these features Wesee that on average these features are larger forocclusions than for textures as one might expectintuitively since these features explicitly or implicitlymeasure differences or variability in an image patch

Figure 8 Power spectra of 32 middot 32 patches Left Median power

spectrum of occlusions (blue) and surfaces (green) Thin dashed

lines show the 25th and 75th percentiles Right Power-spectrum

slopes for occlusions (blue) and surfaces (green)

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 10

which will tend to be larger for occlusion patchesAnalyzing the correlation structure of the grayscalefeatures reveals that they are all significantly positivelycorrelated (Supplementary Table 2) Although some ofthese correlations are unsurprising (like that betweenluminance difference log Dl and boundary gradient logGB) many pairs of these positively correlated featurescan be manipulated independently of each other inartificial images (for instance global contrast log q andcontrast difference log Dr) These results are consistentwith previous work which demonstrates that in naturalimages there are often strong conditional dependenciesbetween supposedly independent visual feature dimen-sions (Fine et al 2003 Karklin amp Lewicki 2003Schwartz amp Simoncelli 2001 Zetzsche amp Rohrbein2001) Supplementary Figure S2 plots all possiblebivariate distributions of the logarithm of the grayscalefeatures for 32 middot 32 textures and surfaces

In addition to grayscale features we measured colorfeatures (C1ndashC2 Methods Statistical measurements)as well by transforming the images into the Lab colorspace Fine et al (2003) Ing et al (2010) Using our

region labelings (Figure 4a bottom) we measured thecolor parameters log Da and log Db from ourocclusions Supplementary Table 3 lists the meansand standard deviations of each of these features andwe observe positive correlations (rfrac14043) between theirvalues similar to previous observations of positivecorrelations between these same features for twonearby patches taken from the same surface (Fine etal 2003) This finding is interesting because occlusionboundaries by definition separate two or more differentsurfaces and the color features we measure constitute aset of independent feature dimensions From the two-dimensional scatterplots in Supplementary Figure S3we see that there is separation between the multivariatedistributions defined by these color parameters sug-gesting that color contrast provides a potential cue forocclusion edge detection much as it does for surfacesegmentation (Fine et al 2003 Ing et al 2010)

Human performance on occlusion boundarydetection task

Effects of patch size color and pre-exposure

In order to determine the local visual features used todiscriminate occlusions from surfaces we designed asimple two-alternative forced choice task illustratedschematically in Figure 6 An image patch subtendingroughly 15 degrees of visual angle (305 cm viewingdistance) was presented to the subject who had todecide with a binary (10) keypress whether an imagepatch contains an occlusion Patches were chosen froma pre-extracted set of 1000 occlusions and 1000surfaces with equal probability of each type soguessing would yield 50 correct Subjects wereallowed to view each patch as long as they needed (1ndash2 seconds typical) and following previous workauditory feedback was provided to optimize perfor-mance (Ing et al 2010) The task was performed on sixsubjects having varying degrees of exposure to theimage database (Methods Experimental paradigm)

Figure 10a illustrates the effect of patch size on thetask for grayscale image patches for all subjects Thicklines indicate the mean subject performance and thindashed lines denote 95 confidence intervals We seefrom this that performance is significantly different forthe different size image patches which we tested (8 middot 816 middot 16 32 middot 32) For a subset of subjects (S0 S1 S2)we also tested color image patches and the results forcolor image patches are shown in Figure 10b (red line)together with the grayscale data from these samesubjects (black dashed line) We see from this plot thatfor all patch sizes tested performance is significantlybetter for color patches which is sensible because coloris a potentially informative cue for distinguishingdifferent surfaces as has been shown in previous work(Fine et al 2003 Ing et al 2010)

Figure 9 Univariate distributions of grayscale visual features G1-

G5 (see Methods) for occlusions (blue) and textures (green)

When plotted on a log scale these distributions are well described

by Gaussians and exhibit separation for occlusions and textures

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 11

One concern with the interpretation of our results isthat the brief pre-exposure (3 seconds) of subjects S1 S2to the full-scale (576 middot 768) images prior to the first tasksession (Methods Experimental paradigm) may haveunfairly improved their performance In order to controlfor this possibility we ran three additional subjects (S3S4 S5) who did not have this pre-exposure to the full-scale images and we found no significant difference inthe performance of the two groups of subjects(Supplementary Figure S4a) We also found that thelead author (S0) was not significantly better at the taskthan the two subjects (S1 S2) who were only briefly pre-exposed to the images (Supplementary Figure S4b)

Therefore we conclude that pre-exposure to the full-scale images makes little if any difference in our task

Effects of luminance boundary and texture cues

In order to better understand what cues subjects aremaking use of in the task we tested subjects onmodified image patches with various cues removed(Methods Experimental paradigm) In order to deter-mine the importance of texture we removed all texturecues from the patches by averaging all pixels in eachregion (lsquolsquotexture removedrsquorsquo) as illustrated in Figure 4b(top) We see from Figure 11a that subject performancein the task is substantially impaired without texturecues for the 16 middot 16 and 32 middot 32 patch sizes Note howperformance is roughly flat with increasing patch sizewhen all cues but luminance are removed suggestingthat luminance gradients are a fairly lsquolsquolocalrsquorsquo cue whichis very useful even at very small scales Similar resultswere obtained for color patches (Figure 12) where theonly cues available were luminance and color contrast

Removing luminance cues while keeping texture cuesintact is slightly more tricky since simply equalizing theluminance in the two image regions can lead toproblems of boundary artifact (a high spatial frequencyedge along the boundary) as has been noted by others(Arsenault et al 2011) In order to circumvent thisproblem we removed the boundary artifact by coveringa 3-pixel wide region including the boundary by settingall pixels in this strip to a single uniform value (Figure4b bottom) We then were able to equalize theluminance on each side of the patch without creatingboundary artifact (luminance thorn boundary removed)Since the boundary pixels may could potentiallycontain long-range spatial correlation information(Hess amp Field 1999) useful for identifying occlusions

Figure 10 Performance of human subjects at the occlusion

detection task for 8 middot 8 16 middot 16 and 32 middot 32 image patches (a)

Subject performance for grayscale image patches Thin dashed

lines denote 95 confidence intervals Note how performance

significantly improves with increasing patch size (b) Subject

performance for color image patches is significantly better than for

grayscale at all patch sizes

Figure 11 Subject performance for grayscale image patches with various cues removed Dashed lines indicate the average performance

for unaltered image patches solid lines performance in the cue-removed case (a) Removal of texture cues significantly impairs subject

performance for larger image patches (b) Removal of the boundary and luminance cues substantially impairs subject performance at all

patch sizes (c) Removal of the boundary cue alone without altering texture cues or luminance cues does not affect subject performance

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 12

as an additional control we also considered perfor-mance on patches where we only removed theboundary (boundary removed)

We see from Figure 11c that simply removing theboundary pixels has no effect on subject performancesuggesting that subjects do not need to see theboundary in order to perform the task so long asother cues like luminance and texture differences arepresent However we see from Figure 11b that thereare profound effects of removing the luminance cueswhile keeping the texture cues intact We see thatsubject performance is impaired at every patch sizealthough in contrast to removing texture cues we see asubstantial improvement in performance as patch sizeis increased This suggests that texture cues arecomplementary to luminance cues in the sense thattexture information becomes more useful for largerpatches while luminance is useful even for very smallpatches

Quadratic classifier analysis

Comparison with human performance

Our analysis of image patches taken from thedatabase suggests that relatively simple visual featuresmeasured in previous related studies can potentially beused to distinguish occlusions and surfaces (Figure 9)Therefore it was of interest to determine whethermachine classifiers defined using these simple visualfeatures could potentially account for human perfor-mance on the task Since the logarithmically trans-formed features are approximately Gaussian distributed(Figure 9 and Supplementary Figures S2 S3) followingprevious work (Ing et al 2010) we defined a quadraticclassifier on class-conditional distributions which wemodeled as multivariate Gaussians (Methods Machineclassifiers) When data from both classes is exactly

Gaussian this classifier yields Bayes-optimal perfor-mance (Duda et al 2000) This classifier was thentested on validation sets of 400 image patches drawnfrom the same set shown to humans subjects (whichwere not used to train the classifier) using 200 MonteCarlo resamplings

In Figure 13 we compare the performance of humansubjects (black lines) on the grayscale version of thetask with that of quadratic classifiers (blue lines)Figure 13a shows task performance with the fullgrayscale image patches We see that the quadraticclassifier defined on all grayscale parameters (G1-G5Methods Statistical measurements) does not capturethe experimentally observed improvement in subjectperformance with increasing patch size This suggeststhat in addition to the scalar grayscale cues used todefine the classifier subjects must be making use ofmore complex spatial cues like differences in texture(Bergen amp Landy 1991) or long-range spatial correla-tions (Hess amp Field 1999) which are not available atlow resolutions but become available as the imagepatch gets larger We see from Figure 13b that aquadratic classifier defined using only the luminancedifference is in much better agreement with humanperformance when this is the only available cue forperforming the task Similar results were obtained forcolor image patches (Figure 14)

Ranking of feature dimensions

In order to understand which feature dimensions aremost important for the quadratic classifier detecting

Figure 12 Subject performance for unaltered color image patches

(dashed line) and with texture cues removed (solid line) Figure 13 Comparison of subject and quadratic classifier

performance for grayscale image patches (a) Classifier defined

using all grayscale features (thick blue line) does not accurately

model human performance on the task with unmodified patches

(black line) (b) Classifier defined using only luminance cues

accurately models human performance in the texture-removed

condition where luminance differences are the only available cue

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 13

occlusions we computed the d0 value for each of thefeature dimensions with the results of this analysisbeing summarized in Supplementary Table 4 We seefrom this table that the best single grayscale parametersfor discriminating occlusion and surface patches are theluminance difference (Dl) and overall patch variance(q) We also see that contrast difference (Dr) by itself isa fairly weak parameter for all patch sizes while for thelarger patch sizes (16 middot 16 32 middot 32) color cues are veryinformative

Our correlation analyses presented in SupplementaryTable 2 demonstrated that many of these features were

correlated with each other so therefore it was ofinterest to determine which features contributed themost independent information to the classifier We didthis by adding feature dimensions to the classifier oneby one in order of decreasing d0 for 16 middot 16 and 32 middot 32patches and applying each classifier to the task for 200Monte Carlo simulations The results of this analysisperformed on 16 middot 16 and 32 middot 32 image patches areshown in Figure 15 We see from Figure 15 that forgrayscale classifiers containing q and Dl that noperformance increase is seen when one adds thefeatures GB (boundary gradient) and Eh (orientedenergy) most likely because these features are stronglycorrelated with q Dl (Supplementary Table 2)However we do see a jump in performance when thecontrast difference Dr is added most likely because thisfeature dimension is only weakly correlated with q andDl (Supplementary Table 2) For our analysis of thecolor classifier we see that there are substantialimprovements in performance when the color param-eters Da and Db are added to the classifier which makessense because color cues are largely independent fromluminance cues

Comparison with SVM classifier

There are two logical possibilities for the poor matchbetween human performance in our task and theperformance of the quadratic classifier The firstpossibility is that the feature sets used to define theclassifier do not adequately describe the featuresactually used by human subjects to perform the taskA second possibility is that the feature set is adequatebut the quadratic classifier is somehow suboptimalAlthough the quadratic classifier is Bayes optimal forGaussian distributed class-conditional densities (Duda

Figure 14 Comparison of subject and quadratic classifier

performance for color image patches (a) Classifier defined using

all color features (thick blue line) does not accurately model

human performance (black line) on the task with unmodified

patches (b) Classifier defined using only luminance and color

contrast more accurately models human performance in the

texture-removed condition

Figure 15 Quadratic classifier performance on 32 middot 32 patches as a function of number of features Feature sets of varying lengths are

defined by ranking individual dimensions by their d0 measures and adding them in order from highest to lowest

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 14

et al 2000) and our feature sets are reasonably welldescribed by Gaussians (see Figure 9 andSupplementary Figures S2 S3) this description is farfrom exact Therefore it is possible that some othermore sophisticated classifier may yield superior perfor-mance to the quadratic classifier and hence provide abetter match with our perceptual data

In order to control for this possibility we comparedthe performance of the quadratic classifier to a SupportVector Machine (SVM) classifier (Bishop 2006 Cris-tianini amp Shawe-Taylor 2000) on the task of discrim-inating grayscale occlusion patches from surfaces Wefound that the SVM classifier performed worse than thequadratic classifier (Supplementary Figure S5) Thisfailure of the SVM classifier strengthens the argumentthat the reason for the poor match of the quadraticclassifier with human performance is the inadequacy ofthe measured feature set used to define the classifierrather than the inadequacy our classifier method

Multiscale classifier analysis

Gabor feature set and classifiers

One weakness of the set of simple visual featuresquantified in this study is that they are defined on thescale of the entire image patch and therefore do notintegrate information across scale It is well known thatocclusions in natural images can exist on multiplescales and that accurate edge detection requires anappropriate choice of scale (Elder amp Zucker 1998Marr amp Hildreth 1980) Furthermore our feature setdoes not include any features which quantify texturedifferences which is problematic because our psycho-physical results suggests that texture is a very importantcue and previous work in machine vision hasdemonstrated that texture cues are highly informativefor detecting edges and segmenting image regions(Bergen amp Landy 1991 Malik amp Perona 1990 Martinet al 2004 Voorhees amp Poggio 1988)

In order to choose a set of biologically motivatedmultiscale features to measure from the image patcheswe obtained a Gabor filter bank by applying theIndependent Components Analysis (Bell amp Sejnowski1997) to natural image patches taken from our dataset(Figure 5a) and this filter bank was applied to theimage patches The independent component analysis(ICA) algorithm learns a set of linear features that aremaximally independent which is ideal for classifieranalyses since correlated features add less informationthan independent features The rectified outputs of thisGabor filter bank were then used as inputs to twosupervised classifiers (a) A linear binary logisticclassifier and (b) a neural network classifier (Bishop2006) having 4 16 and 64 hidden units The logisticclassifier can be considered as a special case of theneural network classifier illustrated in Figure 5b where

instead of learning the best hidden layer representationfor solving the classification task the hidden layerrepresentation is always identical to the input repre-sentation and only the output weights are modifiedAfter training both classifiers were tested on a separatevalidation set not used for training to evaluate theirperformance

Classifier comparison

Figure 16 shows the performance of these twoclassifiers together with subject data for the grayscaleocclusion detection task We see that the linearclassifier (blue line) does not accurately describe humanperformance (black line) while the neural network (redline) having 16 hidden units compares much morefavorably to the human subjects Similar results wereobtained for a neural network with 64 hidden units butworse performance was seen with four hidden units(not shown) We can see from Figure 16 that the linearclassifier does not increase its performance withincreasing patch size and seems to perform similarlyto the quadratic classifier defined using only luminancedifferences This suggests that this classifier may simplybe learning to detecting low spatial frequency edges atvarious orientations Indeed plotting the connectionstrengths of the output unit with each of the Gaborfilters as in Figure 17 demonstrates that there are strongpositive weights for filters at low spatial frequencylocated in the center of the image patch therebydetecting luminance differences on the scale of theentire image patch

In contrast the neural network classifier learns tospatially pool information across spatial scale andlocation (Supplementary Figure S6) although thehidden unit lsquolsquoreceptive fieldsrsquorsquo do not seem to resemble

Figure 16 Performance of humans and classifiers defined on

multiscale Gabor feature set We see that a simple linear classifier

(blue line) does not account for human performance (black

dashed line) in the task a neural network classifier having an

additional hidden layer of processing (red line) compares well with

human performance

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 15

the hypothetical texture-edge detectors illustrated inFigure 18 which could in principle encode edgesdefined by second-order cues like differences inorientation or spatial frequency Interestingly suchunits tuned for texture-edges have been learned instudies seeking to find a sparse code for patterns ofcorrelation of Gabor filter activities in natural images(Cadieu amp Olshausen 2012 Karklin amp Lewicki 20032009) and it is of interest for future work to examinewhether a population of such units can be decoded inorder to reliably detect occlusion edges Neverthelessthis analysis provides a simple existence proof that amultiscale feature set coupled with a second layer ofrepresentation which pools across this set of multiscalefeatures may be adequate to explain natural occlusionedge detection and a similar computational scheme hasbeen proposed by various filter-rectify-filter models oftexture edge detection (Baker amp Mareschal 2001Landy amp Graham 2003) Interestingly training neuralnetworks directly on pixel representations did not yieldmodels which could match human performance(Supplementary Figure S7 demonstrating the impor-tance of the multiscale Gabor decomposition as apreprocessing stage Of course we cannot suggest fromour simple analysis the detailed form which thisintegration takes or what the receptive fields of neuronsperforming this operation may be like but this is

certainly a question of great interest for future work inneurophysiology and computational modeling

Discussion

Summary of results

In this study we performed psychophysical experi-ments to better understand what locally available cuesthe human visual system utilizes when detecting naturalocclusion edges In agreement with previous work(McDermott 2004) we found that fairly large regions(32 middot 32 pixels) were required for subjects to attainreliable performance on our occlusion detection task(Figure 10) Studying the grayscale version of the taskin detail we found that texture and luminance cueswere both utilized although the relative importance ofthese cues varied with the patch size In particular wefound that luminance cues were equally useful at allpatch sizes tested (Figure 11a) This suggests thatluminance differences are a local cue being equallyuseful for small as well as large patches In contrasttexture cues were only useful for larger image patches(Figure 11b) This is sensible because texture is definedby information occurring at a variety of scales (Portillaamp Simoncelli 2000 Zhu Wu amp Mumford 1997) andmaking patches larger makes more spatial scalesvisible

The importance of textural cues was further under-scored by comparing human performance on the taskto the performance of a quadratic classifier definedusing a set of simple visual features taken fromprevious work which did not include texture cuesAlthough many of the visual features providedindependent sources of information (Figure 15) andpermitted some discrimination of the stimulus catego-ries this feature set could not account for humanperformance in the task overpredicting performancefor small (8 middot 8) patches and underpredicting

Figure 17 Schematic illustration of weights learned by the linear

classifier We see that the linear classifier learns strong positive

weights (hot colors) with Gabor filters having low spatial

frequencies (long lines) located in the center of the image patch

Figure 18 Hypothetical neural models for detecting occlusion edges defined by textural differences (a) Hypothetical unit which detects

texture edges defined by differences in orientation energy (b) Hypothetical unit that detects texture edges defined by spatial frequency

differences

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 16

performance for large (32 middot 32) patches (Figure 13)The overprediction of human performance for smallimage patches is particularly interesting since itdemonstrates that subjects are not simply relying onluminance cues but are integrating texture cues as wellwhich we found to be quite poor for small imagepatches (Figure 11b)

One limitation of the visual feature set used to definethe quadratic classifier analysis is that the features aredefined on the scale of the entire image patch This isproblematic since it is well-known that the earlieststages of human visual processing perform a multiscaleanalysis using receptive fields resembling Gabor filters(Daugman 1985 Jones amp Palmer 1987 Pollen ampRonner 1983) and that such a representation forms anefficient code for natural images (Field 1987 Olshau-sen amp Field 1996) Furthermore many computationalmethods for texture segmentation and edge detectionuse of a multiscale Gabor filter decomposition (Bakeramp Mareschal 2001 Landy amp Graham 2003) There-fore we defined a new set of classifiers taking as inputsthe rectified responses of a set of Gabor filters appliedto the image patches (Figure 5) We found that themultiscale feature set was adequate to explain humanperformance but only provided that the classifiermaking use of this feature set was sufficiently powerfulIn particular a linear classifier was unable to accountfor human performance since this classifier only learnedto detect luminance-edges at low spatial frequencies(Figure 17) However a neural network classifierhaving an additional hidden layer of processing wasable to learn to reliably detect occlusions nearly as wellas human subjects (Figure 16) These results suggest amultiscale feature decomposition together with rela-tively simple computations performed in an additionallayer of processing may be adequate to explain humanperformance in an important natural vision task

Comparison with previous work

Relatively few studies in visual psychophysics havedirectly considered the problem of what cues are usefulfor detecting natural occlusion edges with the closeststudy to our own focusing on the more specializedproblem of detecting T-junctions in natural grayscaleimages (McDermott 2004) This study found thatrather large image regions were needed for subjects toattain reliable performance for junction detectionconsistent with our results for occlusion detection Inthis work it was found that many T-junctions easilyvisible on the scale of the whole image were notdetectable using local cues suggesting that they mayonly be detectable by mechanisms making use offeedback of global scene information (Lee amp Mumford2003) Our task is somewhat complementary to that

studied by McDermott (2004) since we exclude T-junctions and only consider nonjunction occlusionsThis gave us a much larger stimulus set and permittedquantitative comparisons of human task performancewith a variety of models based on features measuredfrom occlusions and non-occlusions However wearrive at the very similar conclusion for our task thata substantial amount of spatial context is needed foraccurate occlusion detection (Figure 10) and thatmany occlusions cannot be reliably detected using localcues alone

A related set of studies has considered the problemof judging when two image patches belong to the sameor different surfaces and comparing the performanceof human subjects with the predictions of Bayesianmodels defined using sets of measured visual featuressimilar to those used to define our quadratic classifiersIn contrast to one such surface segmentation study (Inget al 2010) we found texture cues to be very importantfor our task One possible reason for the discrepancybetween our results and those of Ing et al (2010) is thattheir study only considered the problem of discrimi-nating a very limited class of images in particular thesurfaces of different leaves Since the leaves typicallybelonged to the same kind of plant they were fairlysmooth and quite similar in terms of visual textureTherefore for this particular choice of stimuli texture isnot a particularly informative cue since leaves of thesame kind tend to have very similar textures For ourtask however where very different surfaces werejuxtaposed to define occlusions we found removal oftexture cues to be highly detrimental to subjectperformance (Figure 11)

Previous work on edge localization reveals thatluminance texture and color cues are all combined todetermine the location of edges (McGraw et al 2003Rivest amp Cavanagh 1996) and similarly we find that inour task these multiple sources of information arecombined as well However in our task we did notconsider quantitative models for how disparate cuesmay be combined optimally as has been done in avariety of perceptual tasks (Knill amp Saunders 2003Landy amp Kojima 2001) One challenge with studyingcue combination in our task is that one of the majorcues used by the subjects are texture differences andunlike luminance or color natural textures require veryhigh-dimensional models to fully specify (Heeger ampBergen 1995 Portilla amp Simoncelli 2000 Zhu et al1997) The problem of optimal cue combination fornatural occlusion detection has been mostly consideredin the computer vision literature (Konishi et al 2003Zhou amp Mel 2008) but models of cue combination forocclusion detection have not been directly comparedwith human data in a controlled psychophysicalexperiment

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 17

Our computational analysis demonstrates the neces-sity of multiscale image analysis for modeling humanperformance in our occlusion detection task Further-more our work and that of others demonstrates theimportance of textural cues for occlusion detection as awide variety of computational models which explainhuman performance at detecting texture-defined edgesutilize a multiscale Gabor decomposition (Landy ampGraham 2003) One broad class of computationalmodels of texture segmentation and texture edgedetection can be described lsquolsquoFilter-Rectify-Filterrsquorsquo(FRF) models sometimes also referred to as lsquolsquoback-pocketrsquorsquo models where the rectified outputs of a Gaborfilter bank are then subject to a second stage of filteringand the outputs of these filters are then processedfurther to obtain a decision about whether or not atexture edge is present (Baker amp Mareschal 2001Bergen amp Landy 1991 Graham amp Sutter 1998 Landy1991 Landy amp Graham 2003) Our simple three-layerneural network model (Figure 5) constitutes a verysimple model of the FRF type with the hidden unitsperforming the role of the second-order filters Theability of our simple FRF model to solve this task aswell as human subjects provides additional evidence infavor of the FRF computational framework

Resources for future studies

Our database of hand-labeled occlusions (httpcaseedu) avoids many limitations of previously collecteddatasets of hand-labeled edges Perhaps the greatestimprovement is that unlike many related datasets(Collins Wright amp Greenway 1999 Martin et al2001) we make use of uncompressed calibrated colorimages taken from a larger image set used in previouspsychophysical research (Arsenault et al 2011 King-dom et al 2007 Olmos amp Kingdom 2004) JPEGcompression creates artificial statistical regularitieswhich complicates feature measurements (Zhou ampMel 2008) and causes artifact when training statisticalimage models (Caywood Willmore amp Tollhurst 2004)

Another improvement of this database over previouswork is that we explicitly label occlusion edges ratherthan inferring them indirectly from labels of imageregions or segments Although de facto finding regionboundaries does end up mainly labeling occlusions thenotion of an image region is much vaguer than that ofan occlusion However due to the specificity of ourtask our database most likely neglects quite a fewedges which are not occlusions since many edges aredefined by properties unrelated to ordinal depth likeshadows specularities and material properties (Ker-sten 2000) Therefore our database only represents asubset (albeit an important one) of possible naturaledges

Some databases only consider a restricted class ofimages like close-up foliage (Ing et al 2010) or lackclear figural objects (Doi Inui Lee Wachtler ampSejnowski 2003) Our database contains a wide varietyof images at various spatial scales containing a largenumber of occlusions We do not claim that our imagesare fully representative of natural images but representa modest effort to obtain a diverse sample which isimportant since different categories of natural imageshave different statistics (Torralba amp Oliva 2003) andtraining statistical image models on different imagedatabases often yields different linear filter properties(Ziegaus amp Lang 2003)

Acknowledgments

This work was supported by NIH grant 0705677 toMSL We thank the five Case Western ReserveUniversity undergraduate research assistants whohelped to collect the databases Parts of this work havebeen presented previously in poster form at the 2011Society for Neuroscience Meeting

Commercial relationships noneCorresponding author Christopher DiMattinaEmail dimattinagrinnelleduAddress Department of Psychology amp NeuroscienceConcentration Grinnell College Grinnell IA USA

References

Abdou I amp Pratt W (1979) Quantitative design andevaluation of enhancementthresholding edge de-tectors Proceedings of the IEEE 67 753ndash763

Acharya T amp Ray A K (2005) Image processingPrinciples and applications Hoboken NJ Wiley-Interscience

Arsenault E Yoonessi A amp Baker C L (2011)Higher-order texture statistics impair contrastboundary segmentation Journal of Vision 11(10)14 1ndash15 httpwwwjournalofvisionorgcontent111014 doi101167111014 [PubMed] [Article]

Baddeley R J amp Tatler B W (2006) High frequencyedges (but not contrast) predict where we fixate ABayesian system identification analysis VisionResearch 46 2824ndash2833

Baker C L amp Mareschal I (2001) Processing ofsecond-order stimuli in the visual cortex Progressin Brain Research 134 1ndash21

Balboa R M amp Grzywacz N M (2000) Occlusions

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 18

and their relationship with the distribution ofcontrasts in natural images Vision Research 402661ndash2669

Bell A J amp Sejnowski T J (1997) The lsquoindependentcomponentsrsquo of natural scenes are edge filtersVision Research 37 3327ndash3338

Bergen J R amp Landy M S (1991) Computationalmodeling of visual texture segregation In M SLandy amp J A Movshon (Eds) Computationalmodels of visual processing pp 253ndash271 Cam-bridge MA MIT Press

Bishop C M (2006) Pattern recognition and machinelearning New York Springer

Cadieu C amp Olshausen B A (2012) Learningintermediate-level representations of form andmotion from natural movies Neural Computation24 827ndash866

Caywood M S Willmore B amp Tolhurst D J(2004) Independent components of color naturalscenes resemble v1 neurons in their spatial andcolor tuning Journal of Neurophysiology 91 2859ndash2873

Collins D Wright W A amp Greenway P (1999) TheSowerby image database In Image processing andits applications (Seventh International Conference)(pp 306ndash310) London Institution of ElectricalEngineers

Cristianini N amp Shawe-Taylor J (2000) An intro-duction to support vector machines and other kernelbased learning methods Cambridge UK Cam-bridge University Press

Daugman J G (1985) Uncertainty relation forresolution in space spatial frequency and orienta-tion optimized by two-dimensional visual corticalfilters Journal of the Optical Society of America A2 1160ndash1169

Doi E Inui T Lee T W Wachtler T ampSejnowski T J (2003) Spatiochromatic receptivefield properties derived from information-theoreticanalyses of cone mosaic responses to naturalscenes Neural Computation 15 397ndash417

Duda R O Hart P E amp Stork D G (2000)Pattern classification (2nd ed) New York Wiley-Interscience

Elder J H amp Zucker S W (1998) Local scalecontrol for edge detection and blur estimationIEEE Transactions on Pattern Analysis and Ma-chine Intelligence 20 699ndash716

Field D J (1987) Relations between the statistics ofnatural images and the response properties of

cortical cells Journal of the Optical Society ofAmerica A 4 2379ndash2394

Fine I MacLeod D I A amp Boynton G M (2003)Surface segmentation based on the luminance andcolor statistics of natural scenes Journal of theOptical Society of America A 20 1283ndash1291

Fowlkes C C Martin D R amp Malik J (2007) Localfigure-ground cues are valid for natural imagesJournal of Vision 7(8)2 1ndash9 httpwwwjournalofvisionorgcontent782 doi101167782[PubMed] [Article]

Geisler W S Perry J S Super B J amp Gallogly DP (2001) Edge co-occurrence in natural imagespredicts contour grouping performance VisionResearch 41 711ndash724

Goldberg A V amp Kennedy R (1995) An efficientcost scaling algorithm for the assignment problemMathematical Programming 71 153ndash177

Graham N (1991) Complex channels early localnonlinearities and normalization in texture segre-gation In M S Landy amp J A Movshon (Eds)Computational models of visual processing pp 274ndash290 Cambridge MA MIT Press

Graham N amp Sutter A (1998) Spatial summation insimple (Fourier) and complex (non-Fourier) texturechannels Vision Research 38 231ndash257

Guzman A (1969) Decomposition of a visual sceneinto three-dimensional bodies In A Griselli (Ed)Automatic interpretation and classification of imag-es pp 291ndash304 New York Academic Press

Heeger D J amp Bergen J R (1995) Pyramid-basedtexture analysissynthesis In Proceedings of ACMSIGGRAPH 229ndash238

Hess R F amp Field D J (1999) Integration ofcontours New insights Trends in Cognitive Scienc-es 3 480ndash486

Hoiem D Efros A A amp Hebert M (2011)Recovering occlusion boundaries from an imageInternational Journal of Computer Vision 91(3)328ndash346

Ing A D Wilson J A amp Geisler W S (2010)Region grouping in natural foliage scenes Imagestatistics and human performance Journal ofVision 10(4)10 1ndash19 httpwwwjournalofvisionorgcontent10410 doi 10116710410[PubMed] [Article]

Jones J P amp Palmer L A (1987) The two-dimensional spatial structure of simple receptivefields in cat striate cortex Journal of Neurophysi-ology 58 1187ndash1211

Karklin Y amp Lewicki M S (2003) Learning higher-

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 19

order structures in natural images Network 14483ndash499

Karklin Y amp Lewicki M S (2009) Emergence ofcomplex cell properties by learning to generalize innatural scenes Nature 457 83ndash86

Kersten D (2000) High-level vision as statisticalinference In M S Gazzaniga (Ed) The newcognitive neurosciences pp 353ndash363 CambridgeMA MIT Press

Kingdom F A A Field D J amp Olmos A (2007)Does spatial invariance result from insensitivity tochange Journal of Vision 7(14)11 1ndash13 httpwwwjournalofvisionorgcontent71411 doi10116771411 [PubMed] [Article]

Knill D C amp Saunders J A (2003) Do humansoptimally integrate stereo and texture informationfor judgments of slant Vision Research 43 2539ndash2558

Konishi S Yuille A L Coughlin J M amp Zhu S C(2003) Statistical edge detection Learning andevaluating edge cues IEEE Transactions on PatternAnalysis and Machine Intelligence 25 57ndash74

Landy M S (1991) Texture segregation and orienta-tion gradient Vision Research 31 679ndash691

Landy M S amp Graham N (2003) Visual perceptionof texture In L M Chalupa amp J S Werner (Eds)The visual neurosciences pp 1106ndash1118 Cam-bridge MA MIT Press

Landy M S amp Kojima H (2001) Ideal cuecombination for localizing texture-defined edgesJournal of the Optical Society of America A 182307ndash2320

Lee H-C amp Choe Y (2003) Detecting salientcontours using orientation energy distribution InInternational Joint Conference on Neural Networks(pp 206ndash211) IEEE Conference Publications

Lee T S amp Mumford D (2003) Hierarchicalbayesian inference in the visual cortex Journal ofthe Optical Society of America A 20 1434ndash1448

Malik J amp Perona P (1990) Preattentive texturediscrimination with early vision mechanisms Jour-nal of the Optical Society of America A 7 923ndash932

Marr D amp Hildreth E (1980) Theory of edgedetection Proceedings of the Royal Society ofLondon Series B 29 187ndash217

Martin D Fowlkes C C Tal D amp Malik J (2001)A database of human segmented natural imagesand its application to evaluating segmentationalgorithms and measuring ecological statisticsProceedings of the 8th International Conference onComputer Vision 2 416ndash423

Martin D R Fowlkes C C amp Malik J (2004)Learning to detect natural image boundaries usinglocal brightness color and texture cues IEEETransactions on Pattern Analysis and MachineIntelligence 26 530ndash549

McDermott J (2004) Psychophysics with junctions inreal images Perception 33 1101ndash1127

McGraw P V Whitaker D Badcock D R ampSkillen J (2003) Neither here nor there Localizingconflicting visual attributes Journal of Vision 3(4)2 265ndash273 httpwwwjournalofvisionorgcontent342 doi101167342 [PubMed] [Article]

Nakayama K He Z J amp Shimojo S (1995) Visualsurface representation A critical link betweenlower-level and higher-level vision In D NOsherson S M Kosslyn amp L R Gleitman(Eds) An invitation to cognitive science Visualcognition pp 1ndash70 Cambridge MA MIT Press

Oliva A amp Torralba A (2001) Modeling the shapeof the scene A holistic representation of the spatialenvelope International Journal of Computer Vision42 145ndash175

Olmos A amp Kingdom F A A (2004) McGillcalibrated colour image database Internet sitehttptabbyvisionmcgillca (Accessed November20 2012)

Olshausen B A amp Field D J (1996) Emergence ofsimple-cell receptive-field properties by learning asparse code for natural images Nature 381 607ndash609

Ott R L (1993) An introduction to statistical methodsand data analysis (4th ed) Belmont CA Wads-worth Inc

Perona P (1992) Steerable-scalable kernels for edgedetection and junction analysis Image and VisionComputing 10 663ndash672

Pollen D A amp Ronner S F (1983) Visual corticalneurons as localized spatial frequency filters IEEETransactions on Systems Man and Cybernetics 13907ndash916

Portilla J amp Simoncelli E P (2000) A parametrictexture model based on joint statistics of complexwavelet coefficients International Journal of Com-puter Vision 40 4971

Rajashekar U van der Linde I Bovik A C ampCormack L K (2007) Foveated analysis of imagefeatures at fixations Vision Research 47 3160ndash3172

Reinagel P amp Zador A M (1999) Natural scenestatistics at the centre of gaze Network 10 341ndash350

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 20

Rijsbergen C V (1979) Information retrieval LondonButterworths

Rivest J amp Cavanagh P (1996) Localizing contoursdefined by more than one attribute Vision Re-search 36 53ndash66

Schwartz O amp Simoncelli E P (2001) Natural signalstatistics and sensory gain control Nature Neuro-science 4 819ndash825

Sigman M Cecchi G A Gilbert C D ampMagnasco M O (2001) On a common circleNatural scenes and gestalt rules Proceedings of theNational Academy of Sciences 98 1935ndash1940

Todd J T (2004) The visual perception of 3D shapeTrends in Cognitive Science 8 115ndash121

Torralba A (2009) How many pixels make an imageVisual Neuroscience 26 123ndash131

Torralba A amp Oliva A (2003) Statistics of naturalimage categories Network Computation in NeuralSystems 14 391ndash412

Voorhees H amp Poggio T (1988) Computing textureboundaries from images Nature 333 364ndash367

Zetzsche C amp Rohrbein F (2001) Nonlinear andextra-classical receptive field properties and thestatistics of natural scenes Network 12 331ndash350

Zhou C amp Mel B W (2008) Cue combination andcolor edge detection in natural scenes Journal ofVision 8(4)4 1ndash25 httpwwwjournalofvisionorgcontent844 doi101167844 [PubMed][Article]

Zhu S C Wu Y N amp Mumford D (1997)Minimax entropy principle and its application totexture modeling Neural Computation 9 1627ndash1660

Ziegaus C amp Lang E W (2003) Independentcomponent analysis of natural and urban imageensembles Neural Information Processing Letters1 89ndash95

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 21

  • Introduction
  • Methods
  • f01
  • f02
  • f03
  • e01
  • e02
  • e03
  • e04
  • e05
  • e06
  • e07
  • e08
  • e09
  • e10
  • f04
  • e11
  • e12
  • e13
  • e14
  • e15
  • e16
  • e17
  • f05
  • e18
  • Results
  • f06
  • f07
  • f08
  • f09
  • f10
  • f11
  • f12
  • f13
  • f14
  • f15
  • f16
  • Discussion
  • f17
  • f18
  • Abdou1
  • Acharya1
  • Arsenault1
  • Baddeley1
  • Baker1
  • Balboa1
  • Bell1
  • Bergen1
  • Bishop1
  • Cadieu1
  • Caywood1
  • Collins1
  • Cristianini1
  • Daugman1
  • Doi1
  • Duda1
  • Elder1
  • Field1
  • Fine1
  • Fowlkes1
  • Geisler1
  • Goldberg1
  • Graham1
  • Graham2
  • Guzman1
  • Heeger1
  • Hess1
  • Hoiem1
  • Ing1
  • Jones1
  • Karklin1
  • Karklin2
  • Kersten1
  • Kingdom1
  • Knill1
  • Konishi1
  • Landy1
  • Landy2
  • Landy3
  • Lee1
  • Lee2
  • Malik1
  • Marr1
  • Martin1
  • Martin2
  • McDermott1
  • McGraw1
  • Nakayama1
  • Oliva1
  • Olmos1
  • Olshausen1
  • Ott1
  • Perona1
  • Pollen1
  • Portilla1
  • Rajashekar1
  • Reinagel1
  • Rijsbergen1
  • Rivest1
  • Schwartz1
  • Sigman1
  • Todd1
  • Torralba1
  • Torralba2
  • Voorhees1
  • Zetzsche1
  • Zhou1
  • Zhu1
  • Ziegaus1
Page 9: Detecting natural occlusion boundaries using local cuesDetecting natural occlusion boundaries using local cues Christopher DiMattina # $ Department of Psychology & Neuroscience Concentration,

use in our perceptual experiments Representativeimages from our database are illustrated in Figure 2(left column) Note the clear figural objects and manyocclusions in these images In the middle column ofFigure 2 we see grayscale versions of these imageswith the set of all pixels labeled by any subject (logicalOR) overlaid in white The magenta squares showexamples of surface regions labeled by the subjectsFinally the right column shows an overlay plot of theocclusions marked by all subjects with darker pixelsbeing labeled by more subjects and the lightest graypixels being labeled by only a single subject Note thehigh degree of consistency between the labelings of themultiple subjects Figure 3 shows some representativeexamples of occlusion (left) and surface (right) patchesfrom our database It is important to note that whileocclusions may be a major source of edges in naturalimages edges may arise from other cues like castshadows or changes in material properties (Kersten2000) The bottom panel of Figure 3 shows examples ofpatches containing edges defined by shadows ratherthan by occlusions

In other annotated edge databases like the BerkeleySegmentation Dataset (BSD) (Martin Fowlkes Tal ampMalik 2001) occlusion edges were labeled indirectly bysegmenting the image into regions and then denotingthe boundaries of these regions to be edges Weobserved that when occlusions are labeled directlyinstead of being inferred indirectly from regionsegmentations that a smaller number of pixels arelabeled as shown in Figure 7b which plots thedistribution of the percentage of edge pixels for allimages and subjects in our database (red) and the BSDdatabase (blue) for 98 BSD images segmented by sixsubjects We find that averaged across all images andsubjects that about 1 of the pixels were labeled asedges in our database whereas about twice as manypixels were labeled as edges by computing edgemapsusing the BSD segmentations (COB medianfrac1400084Nfrac14 600 BSD median frac14 00193 N frac14 588 p 10121Wilcox rank-sum)

We observed a higher level of intersubject consisten-cy in the edgemaps obtained from the COB than thoseobtained from the BSD segmentations which we

Figure 7 Labeling occlusions directly marks fewer pixels than inferring occlusions from image segmentations and yields greater

agreement between subjects (a) Top An image from our database (left) together with the labeling (middle) by the most conservative

subject (MCS) The right panel shows a binary mask of all pixels near the MCS labeling (10 pixel radius) Bottom Product of MCS mask

with labelings from three other subjects (b) Histogram of the fraction of pixels labeled as edges in the COB (red) and BSD (blue)

databases across all images and all subjects (c) Histogram of the subject consistency index for edgemaps obtained from the COB (red)

and BSD (blue) databases for cfrac14 10 (d) Precision-recall analysis also demonstrates better consistency (F-measure) for COB (red) than

BSD (blue)

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 9

quantified using a novel analysis we developed as wellas a more standard precision-recall analysis (Abdou ampPratt 1979 Rijsbergen 1979 Methods Image data-base) Figure 7a shows an image from the COBdatabase (top left) together with the edgemap andderived mask for the most conservative subject (MCS)which is the subject who labeled the fewest pixels (topmiddle right) When the MCS mask is multiplied bythe edgemap of the other subjects we see reasonablygood agreement between all subjects (bottom row) Inorder to quantify this over all images we computed ournovel intersubject similarity index f defined in Equation4 and in Figure 7c we see that on average that f is largerfor our dataset than for the BSD where here we plotthe histogram for c frac14 10 (COB median frac14 02890 N frac143500 BSD median frac14 01882 N frac14 3430 p 10202Wilcox rank-sum) Similar results were obtained over awide range of values of c (Supplementary Figure S1) Inaddition to our novel analysis we also implemented aprecision-recall analysis (Abdou amp Pratt 1979 Rijs-bergen 1979) using a lsquolsquoleave-one-outrsquorsquo procedure wherewe compared the edges labeled by one subject to alsquolsquoground truthrsquorsquo labeling defined by combining theedgemaps of the five other subjects (Methods Imagedatabase) Agreement was quantified using the F-measure (Rijsbergen 1979) which provides a weightedmean of precision (not labeling nonedges as edges) andrecall (detecting all edges in the image) We observe inFigure 7d that for this analysis there is a significantlybetter agreement between edgemaps in the COBdatabase than those obtained indirectly from the BSDsegmentations using this analysis (p 1028)

Visual features measured from occlusion andsurface patches

We were interested in determining which kinds oflocally available visual features could possibly beutilized by human subjects to distinguish image patchescontaining occlusions from those containing surfacesToward this end we measured a variety of visualfeatures taken from previous studies of natural imagestatistics (Balboa amp Grzywacz 2000 Field 1987 Ing etal 2010 Lee amp Choe 2003 Rajashekar et al 2007Reinagel amp Zador 1999) from both occlusion andsurface patches For patches containing only a singleocclusion which can be divided into two regions ofroughly equal size we obtained a region labelingillustrated in Figure 4a (bottom) The region labelingconsists of sets of pixels corresponding to the twosurfaces divided by the boundary (white and grayregions) as well as the boundary (black line) Usingthis region labeling we can measure properties of theimage on either side of the boundary and computedifferences in these properties between regions By

definition surface patches are comprised of a singleregion and therefore it is unclear how we can measurequantities (like luminance differences) which dependon a region labeling Therefore in order to measure thesame set of features from the surface patches weassigned to each patch a set of 25 dummy regionlabelings (spanning all orientations) and for eachdummy labeling performed the measurements Sinceall features were on average larger for occlusions forsurfaces we took as the measured value of the featurethe maximum over all 25 dummy labelings A fulldescription of measured features is given in Methods(Statistical measurements)

We compared the power spectra of 32 middot 32 occlusionand texture patches (Figure 8 left) On average there issignificantly less energy in low spatial frequencies fortextures than for occlusions which is intuitive sincemany occlusions contain a low spatial frequencyluminance edge (Figure 3 left panel) Analysis of theexponent by fitting a line to the power spectra ofindividual images found a different distribution ofexponents for the occlusions and the textures (Figure 8right) For textures the median exponent is close to 2consistent with previous observations (Field 1987)whereas for occlusions the median exponent value isslightly higher (rsquo26) This is consistent with previouswork analyzing the spectral content of different scenecategories which show that landscape scenes with aprominent horizon (like an ocean view) tend to havelower spatial frequency content (Oliva amp Torralba2001)

Figure 9 shows that all of the grayscale features (G1-G5 Methods Statistical measurements) measured fromsurfaces (green) and occlusions (blue) are well approx-imated by Gaussians when plotted in logarithmiccoordinates Supplementary Table 1 lists the meansand standard deviations of each of these features Wesee that on average these features are larger forocclusions than for textures as one might expectintuitively since these features explicitly or implicitlymeasure differences or variability in an image patch

Figure 8 Power spectra of 32 middot 32 patches Left Median power

spectrum of occlusions (blue) and surfaces (green) Thin dashed

lines show the 25th and 75th percentiles Right Power-spectrum

slopes for occlusions (blue) and surfaces (green)

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 10

which will tend to be larger for occlusion patchesAnalyzing the correlation structure of the grayscalefeatures reveals that they are all significantly positivelycorrelated (Supplementary Table 2) Although some ofthese correlations are unsurprising (like that betweenluminance difference log Dl and boundary gradient logGB) many pairs of these positively correlated featurescan be manipulated independently of each other inartificial images (for instance global contrast log q andcontrast difference log Dr) These results are consistentwith previous work which demonstrates that in naturalimages there are often strong conditional dependenciesbetween supposedly independent visual feature dimen-sions (Fine et al 2003 Karklin amp Lewicki 2003Schwartz amp Simoncelli 2001 Zetzsche amp Rohrbein2001) Supplementary Figure S2 plots all possiblebivariate distributions of the logarithm of the grayscalefeatures for 32 middot 32 textures and surfaces

In addition to grayscale features we measured colorfeatures (C1ndashC2 Methods Statistical measurements)as well by transforming the images into the Lab colorspace Fine et al (2003) Ing et al (2010) Using our

region labelings (Figure 4a bottom) we measured thecolor parameters log Da and log Db from ourocclusions Supplementary Table 3 lists the meansand standard deviations of each of these features andwe observe positive correlations (rfrac14043) between theirvalues similar to previous observations of positivecorrelations between these same features for twonearby patches taken from the same surface (Fine etal 2003) This finding is interesting because occlusionboundaries by definition separate two or more differentsurfaces and the color features we measure constitute aset of independent feature dimensions From the two-dimensional scatterplots in Supplementary Figure S3we see that there is separation between the multivariatedistributions defined by these color parameters sug-gesting that color contrast provides a potential cue forocclusion edge detection much as it does for surfacesegmentation (Fine et al 2003 Ing et al 2010)

Human performance on occlusion boundarydetection task

Effects of patch size color and pre-exposure

In order to determine the local visual features used todiscriminate occlusions from surfaces we designed asimple two-alternative forced choice task illustratedschematically in Figure 6 An image patch subtendingroughly 15 degrees of visual angle (305 cm viewingdistance) was presented to the subject who had todecide with a binary (10) keypress whether an imagepatch contains an occlusion Patches were chosen froma pre-extracted set of 1000 occlusions and 1000surfaces with equal probability of each type soguessing would yield 50 correct Subjects wereallowed to view each patch as long as they needed (1ndash2 seconds typical) and following previous workauditory feedback was provided to optimize perfor-mance (Ing et al 2010) The task was performed on sixsubjects having varying degrees of exposure to theimage database (Methods Experimental paradigm)

Figure 10a illustrates the effect of patch size on thetask for grayscale image patches for all subjects Thicklines indicate the mean subject performance and thindashed lines denote 95 confidence intervals We seefrom this that performance is significantly different forthe different size image patches which we tested (8 middot 816 middot 16 32 middot 32) For a subset of subjects (S0 S1 S2)we also tested color image patches and the results forcolor image patches are shown in Figure 10b (red line)together with the grayscale data from these samesubjects (black dashed line) We see from this plot thatfor all patch sizes tested performance is significantlybetter for color patches which is sensible because coloris a potentially informative cue for distinguishingdifferent surfaces as has been shown in previous work(Fine et al 2003 Ing et al 2010)

Figure 9 Univariate distributions of grayscale visual features G1-

G5 (see Methods) for occlusions (blue) and textures (green)

When plotted on a log scale these distributions are well described

by Gaussians and exhibit separation for occlusions and textures

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 11

One concern with the interpretation of our results isthat the brief pre-exposure (3 seconds) of subjects S1 S2to the full-scale (576 middot 768) images prior to the first tasksession (Methods Experimental paradigm) may haveunfairly improved their performance In order to controlfor this possibility we ran three additional subjects (S3S4 S5) who did not have this pre-exposure to the full-scale images and we found no significant difference inthe performance of the two groups of subjects(Supplementary Figure S4a) We also found that thelead author (S0) was not significantly better at the taskthan the two subjects (S1 S2) who were only briefly pre-exposed to the images (Supplementary Figure S4b)

Therefore we conclude that pre-exposure to the full-scale images makes little if any difference in our task

Effects of luminance boundary and texture cues

In order to better understand what cues subjects aremaking use of in the task we tested subjects onmodified image patches with various cues removed(Methods Experimental paradigm) In order to deter-mine the importance of texture we removed all texturecues from the patches by averaging all pixels in eachregion (lsquolsquotexture removedrsquorsquo) as illustrated in Figure 4b(top) We see from Figure 11a that subject performancein the task is substantially impaired without texturecues for the 16 middot 16 and 32 middot 32 patch sizes Note howperformance is roughly flat with increasing patch sizewhen all cues but luminance are removed suggestingthat luminance gradients are a fairly lsquolsquolocalrsquorsquo cue whichis very useful even at very small scales Similar resultswere obtained for color patches (Figure 12) where theonly cues available were luminance and color contrast

Removing luminance cues while keeping texture cuesintact is slightly more tricky since simply equalizing theluminance in the two image regions can lead toproblems of boundary artifact (a high spatial frequencyedge along the boundary) as has been noted by others(Arsenault et al 2011) In order to circumvent thisproblem we removed the boundary artifact by coveringa 3-pixel wide region including the boundary by settingall pixels in this strip to a single uniform value (Figure4b bottom) We then were able to equalize theluminance on each side of the patch without creatingboundary artifact (luminance thorn boundary removed)Since the boundary pixels may could potentiallycontain long-range spatial correlation information(Hess amp Field 1999) useful for identifying occlusions

Figure 10 Performance of human subjects at the occlusion

detection task for 8 middot 8 16 middot 16 and 32 middot 32 image patches (a)

Subject performance for grayscale image patches Thin dashed

lines denote 95 confidence intervals Note how performance

significantly improves with increasing patch size (b) Subject

performance for color image patches is significantly better than for

grayscale at all patch sizes

Figure 11 Subject performance for grayscale image patches with various cues removed Dashed lines indicate the average performance

for unaltered image patches solid lines performance in the cue-removed case (a) Removal of texture cues significantly impairs subject

performance for larger image patches (b) Removal of the boundary and luminance cues substantially impairs subject performance at all

patch sizes (c) Removal of the boundary cue alone without altering texture cues or luminance cues does not affect subject performance

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 12

as an additional control we also considered perfor-mance on patches where we only removed theboundary (boundary removed)

We see from Figure 11c that simply removing theboundary pixels has no effect on subject performancesuggesting that subjects do not need to see theboundary in order to perform the task so long asother cues like luminance and texture differences arepresent However we see from Figure 11b that thereare profound effects of removing the luminance cueswhile keeping the texture cues intact We see thatsubject performance is impaired at every patch sizealthough in contrast to removing texture cues we see asubstantial improvement in performance as patch sizeis increased This suggests that texture cues arecomplementary to luminance cues in the sense thattexture information becomes more useful for largerpatches while luminance is useful even for very smallpatches

Quadratic classifier analysis

Comparison with human performance

Our analysis of image patches taken from thedatabase suggests that relatively simple visual featuresmeasured in previous related studies can potentially beused to distinguish occlusions and surfaces (Figure 9)Therefore it was of interest to determine whethermachine classifiers defined using these simple visualfeatures could potentially account for human perfor-mance on the task Since the logarithmically trans-formed features are approximately Gaussian distributed(Figure 9 and Supplementary Figures S2 S3) followingprevious work (Ing et al 2010) we defined a quadraticclassifier on class-conditional distributions which wemodeled as multivariate Gaussians (Methods Machineclassifiers) When data from both classes is exactly

Gaussian this classifier yields Bayes-optimal perfor-mance (Duda et al 2000) This classifier was thentested on validation sets of 400 image patches drawnfrom the same set shown to humans subjects (whichwere not used to train the classifier) using 200 MonteCarlo resamplings

In Figure 13 we compare the performance of humansubjects (black lines) on the grayscale version of thetask with that of quadratic classifiers (blue lines)Figure 13a shows task performance with the fullgrayscale image patches We see that the quadraticclassifier defined on all grayscale parameters (G1-G5Methods Statistical measurements) does not capturethe experimentally observed improvement in subjectperformance with increasing patch size This suggeststhat in addition to the scalar grayscale cues used todefine the classifier subjects must be making use ofmore complex spatial cues like differences in texture(Bergen amp Landy 1991) or long-range spatial correla-tions (Hess amp Field 1999) which are not available atlow resolutions but become available as the imagepatch gets larger We see from Figure 13b that aquadratic classifier defined using only the luminancedifference is in much better agreement with humanperformance when this is the only available cue forperforming the task Similar results were obtained forcolor image patches (Figure 14)

Ranking of feature dimensions

In order to understand which feature dimensions aremost important for the quadratic classifier detecting

Figure 12 Subject performance for unaltered color image patches

(dashed line) and with texture cues removed (solid line) Figure 13 Comparison of subject and quadratic classifier

performance for grayscale image patches (a) Classifier defined

using all grayscale features (thick blue line) does not accurately

model human performance on the task with unmodified patches

(black line) (b) Classifier defined using only luminance cues

accurately models human performance in the texture-removed

condition where luminance differences are the only available cue

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 13

occlusions we computed the d0 value for each of thefeature dimensions with the results of this analysisbeing summarized in Supplementary Table 4 We seefrom this table that the best single grayscale parametersfor discriminating occlusion and surface patches are theluminance difference (Dl) and overall patch variance(q) We also see that contrast difference (Dr) by itself isa fairly weak parameter for all patch sizes while for thelarger patch sizes (16 middot 16 32 middot 32) color cues are veryinformative

Our correlation analyses presented in SupplementaryTable 2 demonstrated that many of these features were

correlated with each other so therefore it was ofinterest to determine which features contributed themost independent information to the classifier We didthis by adding feature dimensions to the classifier oneby one in order of decreasing d0 for 16 middot 16 and 32 middot 32patches and applying each classifier to the task for 200Monte Carlo simulations The results of this analysisperformed on 16 middot 16 and 32 middot 32 image patches areshown in Figure 15 We see from Figure 15 that forgrayscale classifiers containing q and Dl that noperformance increase is seen when one adds thefeatures GB (boundary gradient) and Eh (orientedenergy) most likely because these features are stronglycorrelated with q Dl (Supplementary Table 2)However we do see a jump in performance when thecontrast difference Dr is added most likely because thisfeature dimension is only weakly correlated with q andDl (Supplementary Table 2) For our analysis of thecolor classifier we see that there are substantialimprovements in performance when the color param-eters Da and Db are added to the classifier which makessense because color cues are largely independent fromluminance cues

Comparison with SVM classifier

There are two logical possibilities for the poor matchbetween human performance in our task and theperformance of the quadratic classifier The firstpossibility is that the feature sets used to define theclassifier do not adequately describe the featuresactually used by human subjects to perform the taskA second possibility is that the feature set is adequatebut the quadratic classifier is somehow suboptimalAlthough the quadratic classifier is Bayes optimal forGaussian distributed class-conditional densities (Duda

Figure 14 Comparison of subject and quadratic classifier

performance for color image patches (a) Classifier defined using

all color features (thick blue line) does not accurately model

human performance (black line) on the task with unmodified

patches (b) Classifier defined using only luminance and color

contrast more accurately models human performance in the

texture-removed condition

Figure 15 Quadratic classifier performance on 32 middot 32 patches as a function of number of features Feature sets of varying lengths are

defined by ranking individual dimensions by their d0 measures and adding them in order from highest to lowest

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 14

et al 2000) and our feature sets are reasonably welldescribed by Gaussians (see Figure 9 andSupplementary Figures S2 S3) this description is farfrom exact Therefore it is possible that some othermore sophisticated classifier may yield superior perfor-mance to the quadratic classifier and hence provide abetter match with our perceptual data

In order to control for this possibility we comparedthe performance of the quadratic classifier to a SupportVector Machine (SVM) classifier (Bishop 2006 Cris-tianini amp Shawe-Taylor 2000) on the task of discrim-inating grayscale occlusion patches from surfaces Wefound that the SVM classifier performed worse than thequadratic classifier (Supplementary Figure S5) Thisfailure of the SVM classifier strengthens the argumentthat the reason for the poor match of the quadraticclassifier with human performance is the inadequacy ofthe measured feature set used to define the classifierrather than the inadequacy our classifier method

Multiscale classifier analysis

Gabor feature set and classifiers

One weakness of the set of simple visual featuresquantified in this study is that they are defined on thescale of the entire image patch and therefore do notintegrate information across scale It is well known thatocclusions in natural images can exist on multiplescales and that accurate edge detection requires anappropriate choice of scale (Elder amp Zucker 1998Marr amp Hildreth 1980) Furthermore our feature setdoes not include any features which quantify texturedifferences which is problematic because our psycho-physical results suggests that texture is a very importantcue and previous work in machine vision hasdemonstrated that texture cues are highly informativefor detecting edges and segmenting image regions(Bergen amp Landy 1991 Malik amp Perona 1990 Martinet al 2004 Voorhees amp Poggio 1988)

In order to choose a set of biologically motivatedmultiscale features to measure from the image patcheswe obtained a Gabor filter bank by applying theIndependent Components Analysis (Bell amp Sejnowski1997) to natural image patches taken from our dataset(Figure 5a) and this filter bank was applied to theimage patches The independent component analysis(ICA) algorithm learns a set of linear features that aremaximally independent which is ideal for classifieranalyses since correlated features add less informationthan independent features The rectified outputs of thisGabor filter bank were then used as inputs to twosupervised classifiers (a) A linear binary logisticclassifier and (b) a neural network classifier (Bishop2006) having 4 16 and 64 hidden units The logisticclassifier can be considered as a special case of theneural network classifier illustrated in Figure 5b where

instead of learning the best hidden layer representationfor solving the classification task the hidden layerrepresentation is always identical to the input repre-sentation and only the output weights are modifiedAfter training both classifiers were tested on a separatevalidation set not used for training to evaluate theirperformance

Classifier comparison

Figure 16 shows the performance of these twoclassifiers together with subject data for the grayscaleocclusion detection task We see that the linearclassifier (blue line) does not accurately describe humanperformance (black line) while the neural network (redline) having 16 hidden units compares much morefavorably to the human subjects Similar results wereobtained for a neural network with 64 hidden units butworse performance was seen with four hidden units(not shown) We can see from Figure 16 that the linearclassifier does not increase its performance withincreasing patch size and seems to perform similarlyto the quadratic classifier defined using only luminancedifferences This suggests that this classifier may simplybe learning to detecting low spatial frequency edges atvarious orientations Indeed plotting the connectionstrengths of the output unit with each of the Gaborfilters as in Figure 17 demonstrates that there are strongpositive weights for filters at low spatial frequencylocated in the center of the image patch therebydetecting luminance differences on the scale of theentire image patch

In contrast the neural network classifier learns tospatially pool information across spatial scale andlocation (Supplementary Figure S6) although thehidden unit lsquolsquoreceptive fieldsrsquorsquo do not seem to resemble

Figure 16 Performance of humans and classifiers defined on

multiscale Gabor feature set We see that a simple linear classifier

(blue line) does not account for human performance (black

dashed line) in the task a neural network classifier having an

additional hidden layer of processing (red line) compares well with

human performance

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 15

the hypothetical texture-edge detectors illustrated inFigure 18 which could in principle encode edgesdefined by second-order cues like differences inorientation or spatial frequency Interestingly suchunits tuned for texture-edges have been learned instudies seeking to find a sparse code for patterns ofcorrelation of Gabor filter activities in natural images(Cadieu amp Olshausen 2012 Karklin amp Lewicki 20032009) and it is of interest for future work to examinewhether a population of such units can be decoded inorder to reliably detect occlusion edges Neverthelessthis analysis provides a simple existence proof that amultiscale feature set coupled with a second layer ofrepresentation which pools across this set of multiscalefeatures may be adequate to explain natural occlusionedge detection and a similar computational scheme hasbeen proposed by various filter-rectify-filter models oftexture edge detection (Baker amp Mareschal 2001Landy amp Graham 2003) Interestingly training neuralnetworks directly on pixel representations did not yieldmodels which could match human performance(Supplementary Figure S7 demonstrating the impor-tance of the multiscale Gabor decomposition as apreprocessing stage Of course we cannot suggest fromour simple analysis the detailed form which thisintegration takes or what the receptive fields of neuronsperforming this operation may be like but this is

certainly a question of great interest for future work inneurophysiology and computational modeling

Discussion

Summary of results

In this study we performed psychophysical experi-ments to better understand what locally available cuesthe human visual system utilizes when detecting naturalocclusion edges In agreement with previous work(McDermott 2004) we found that fairly large regions(32 middot 32 pixels) were required for subjects to attainreliable performance on our occlusion detection task(Figure 10) Studying the grayscale version of the taskin detail we found that texture and luminance cueswere both utilized although the relative importance ofthese cues varied with the patch size In particular wefound that luminance cues were equally useful at allpatch sizes tested (Figure 11a) This suggests thatluminance differences are a local cue being equallyuseful for small as well as large patches In contrasttexture cues were only useful for larger image patches(Figure 11b) This is sensible because texture is definedby information occurring at a variety of scales (Portillaamp Simoncelli 2000 Zhu Wu amp Mumford 1997) andmaking patches larger makes more spatial scalesvisible

The importance of textural cues was further under-scored by comparing human performance on the taskto the performance of a quadratic classifier definedusing a set of simple visual features taken fromprevious work which did not include texture cuesAlthough many of the visual features providedindependent sources of information (Figure 15) andpermitted some discrimination of the stimulus catego-ries this feature set could not account for humanperformance in the task overpredicting performancefor small (8 middot 8) patches and underpredicting

Figure 17 Schematic illustration of weights learned by the linear

classifier We see that the linear classifier learns strong positive

weights (hot colors) with Gabor filters having low spatial

frequencies (long lines) located in the center of the image patch

Figure 18 Hypothetical neural models for detecting occlusion edges defined by textural differences (a) Hypothetical unit which detects

texture edges defined by differences in orientation energy (b) Hypothetical unit that detects texture edges defined by spatial frequency

differences

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 16

performance for large (32 middot 32) patches (Figure 13)The overprediction of human performance for smallimage patches is particularly interesting since itdemonstrates that subjects are not simply relying onluminance cues but are integrating texture cues as wellwhich we found to be quite poor for small imagepatches (Figure 11b)

One limitation of the visual feature set used to definethe quadratic classifier analysis is that the features aredefined on the scale of the entire image patch This isproblematic since it is well-known that the earlieststages of human visual processing perform a multiscaleanalysis using receptive fields resembling Gabor filters(Daugman 1985 Jones amp Palmer 1987 Pollen ampRonner 1983) and that such a representation forms anefficient code for natural images (Field 1987 Olshau-sen amp Field 1996) Furthermore many computationalmethods for texture segmentation and edge detectionuse of a multiscale Gabor filter decomposition (Bakeramp Mareschal 2001 Landy amp Graham 2003) There-fore we defined a new set of classifiers taking as inputsthe rectified responses of a set of Gabor filters appliedto the image patches (Figure 5) We found that themultiscale feature set was adequate to explain humanperformance but only provided that the classifiermaking use of this feature set was sufficiently powerfulIn particular a linear classifier was unable to accountfor human performance since this classifier only learnedto detect luminance-edges at low spatial frequencies(Figure 17) However a neural network classifierhaving an additional hidden layer of processing wasable to learn to reliably detect occlusions nearly as wellas human subjects (Figure 16) These results suggest amultiscale feature decomposition together with rela-tively simple computations performed in an additionallayer of processing may be adequate to explain humanperformance in an important natural vision task

Comparison with previous work

Relatively few studies in visual psychophysics havedirectly considered the problem of what cues are usefulfor detecting natural occlusion edges with the closeststudy to our own focusing on the more specializedproblem of detecting T-junctions in natural grayscaleimages (McDermott 2004) This study found thatrather large image regions were needed for subjects toattain reliable performance for junction detectionconsistent with our results for occlusion detection Inthis work it was found that many T-junctions easilyvisible on the scale of the whole image were notdetectable using local cues suggesting that they mayonly be detectable by mechanisms making use offeedback of global scene information (Lee amp Mumford2003) Our task is somewhat complementary to that

studied by McDermott (2004) since we exclude T-junctions and only consider nonjunction occlusionsThis gave us a much larger stimulus set and permittedquantitative comparisons of human task performancewith a variety of models based on features measuredfrom occlusions and non-occlusions However wearrive at the very similar conclusion for our task thata substantial amount of spatial context is needed foraccurate occlusion detection (Figure 10) and thatmany occlusions cannot be reliably detected using localcues alone

A related set of studies has considered the problemof judging when two image patches belong to the sameor different surfaces and comparing the performanceof human subjects with the predictions of Bayesianmodels defined using sets of measured visual featuressimilar to those used to define our quadratic classifiersIn contrast to one such surface segmentation study (Inget al 2010) we found texture cues to be very importantfor our task One possible reason for the discrepancybetween our results and those of Ing et al (2010) is thattheir study only considered the problem of discrimi-nating a very limited class of images in particular thesurfaces of different leaves Since the leaves typicallybelonged to the same kind of plant they were fairlysmooth and quite similar in terms of visual textureTherefore for this particular choice of stimuli texture isnot a particularly informative cue since leaves of thesame kind tend to have very similar textures For ourtask however where very different surfaces werejuxtaposed to define occlusions we found removal oftexture cues to be highly detrimental to subjectperformance (Figure 11)

Previous work on edge localization reveals thatluminance texture and color cues are all combined todetermine the location of edges (McGraw et al 2003Rivest amp Cavanagh 1996) and similarly we find that inour task these multiple sources of information arecombined as well However in our task we did notconsider quantitative models for how disparate cuesmay be combined optimally as has been done in avariety of perceptual tasks (Knill amp Saunders 2003Landy amp Kojima 2001) One challenge with studyingcue combination in our task is that one of the majorcues used by the subjects are texture differences andunlike luminance or color natural textures require veryhigh-dimensional models to fully specify (Heeger ampBergen 1995 Portilla amp Simoncelli 2000 Zhu et al1997) The problem of optimal cue combination fornatural occlusion detection has been mostly consideredin the computer vision literature (Konishi et al 2003Zhou amp Mel 2008) but models of cue combination forocclusion detection have not been directly comparedwith human data in a controlled psychophysicalexperiment

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 17

Our computational analysis demonstrates the neces-sity of multiscale image analysis for modeling humanperformance in our occlusion detection task Further-more our work and that of others demonstrates theimportance of textural cues for occlusion detection as awide variety of computational models which explainhuman performance at detecting texture-defined edgesutilize a multiscale Gabor decomposition (Landy ampGraham 2003) One broad class of computationalmodels of texture segmentation and texture edgedetection can be described lsquolsquoFilter-Rectify-Filterrsquorsquo(FRF) models sometimes also referred to as lsquolsquoback-pocketrsquorsquo models where the rectified outputs of a Gaborfilter bank are then subject to a second stage of filteringand the outputs of these filters are then processedfurther to obtain a decision about whether or not atexture edge is present (Baker amp Mareschal 2001Bergen amp Landy 1991 Graham amp Sutter 1998 Landy1991 Landy amp Graham 2003) Our simple three-layerneural network model (Figure 5) constitutes a verysimple model of the FRF type with the hidden unitsperforming the role of the second-order filters Theability of our simple FRF model to solve this task aswell as human subjects provides additional evidence infavor of the FRF computational framework

Resources for future studies

Our database of hand-labeled occlusions (httpcaseedu) avoids many limitations of previously collecteddatasets of hand-labeled edges Perhaps the greatestimprovement is that unlike many related datasets(Collins Wright amp Greenway 1999 Martin et al2001) we make use of uncompressed calibrated colorimages taken from a larger image set used in previouspsychophysical research (Arsenault et al 2011 King-dom et al 2007 Olmos amp Kingdom 2004) JPEGcompression creates artificial statistical regularitieswhich complicates feature measurements (Zhou ampMel 2008) and causes artifact when training statisticalimage models (Caywood Willmore amp Tollhurst 2004)

Another improvement of this database over previouswork is that we explicitly label occlusion edges ratherthan inferring them indirectly from labels of imageregions or segments Although de facto finding regionboundaries does end up mainly labeling occlusions thenotion of an image region is much vaguer than that ofan occlusion However due to the specificity of ourtask our database most likely neglects quite a fewedges which are not occlusions since many edges aredefined by properties unrelated to ordinal depth likeshadows specularities and material properties (Ker-sten 2000) Therefore our database only represents asubset (albeit an important one) of possible naturaledges

Some databases only consider a restricted class ofimages like close-up foliage (Ing et al 2010) or lackclear figural objects (Doi Inui Lee Wachtler ampSejnowski 2003) Our database contains a wide varietyof images at various spatial scales containing a largenumber of occlusions We do not claim that our imagesare fully representative of natural images but representa modest effort to obtain a diverse sample which isimportant since different categories of natural imageshave different statistics (Torralba amp Oliva 2003) andtraining statistical image models on different imagedatabases often yields different linear filter properties(Ziegaus amp Lang 2003)

Acknowledgments

This work was supported by NIH grant 0705677 toMSL We thank the five Case Western ReserveUniversity undergraduate research assistants whohelped to collect the databases Parts of this work havebeen presented previously in poster form at the 2011Society for Neuroscience Meeting

Commercial relationships noneCorresponding author Christopher DiMattinaEmail dimattinagrinnelleduAddress Department of Psychology amp NeuroscienceConcentration Grinnell College Grinnell IA USA

References

Abdou I amp Pratt W (1979) Quantitative design andevaluation of enhancementthresholding edge de-tectors Proceedings of the IEEE 67 753ndash763

Acharya T amp Ray A K (2005) Image processingPrinciples and applications Hoboken NJ Wiley-Interscience

Arsenault E Yoonessi A amp Baker C L (2011)Higher-order texture statistics impair contrastboundary segmentation Journal of Vision 11(10)14 1ndash15 httpwwwjournalofvisionorgcontent111014 doi101167111014 [PubMed] [Article]

Baddeley R J amp Tatler B W (2006) High frequencyedges (but not contrast) predict where we fixate ABayesian system identification analysis VisionResearch 46 2824ndash2833

Baker C L amp Mareschal I (2001) Processing ofsecond-order stimuli in the visual cortex Progressin Brain Research 134 1ndash21

Balboa R M amp Grzywacz N M (2000) Occlusions

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 18

and their relationship with the distribution ofcontrasts in natural images Vision Research 402661ndash2669

Bell A J amp Sejnowski T J (1997) The lsquoindependentcomponentsrsquo of natural scenes are edge filtersVision Research 37 3327ndash3338

Bergen J R amp Landy M S (1991) Computationalmodeling of visual texture segregation In M SLandy amp J A Movshon (Eds) Computationalmodels of visual processing pp 253ndash271 Cam-bridge MA MIT Press

Bishop C M (2006) Pattern recognition and machinelearning New York Springer

Cadieu C amp Olshausen B A (2012) Learningintermediate-level representations of form andmotion from natural movies Neural Computation24 827ndash866

Caywood M S Willmore B amp Tolhurst D J(2004) Independent components of color naturalscenes resemble v1 neurons in their spatial andcolor tuning Journal of Neurophysiology 91 2859ndash2873

Collins D Wright W A amp Greenway P (1999) TheSowerby image database In Image processing andits applications (Seventh International Conference)(pp 306ndash310) London Institution of ElectricalEngineers

Cristianini N amp Shawe-Taylor J (2000) An intro-duction to support vector machines and other kernelbased learning methods Cambridge UK Cam-bridge University Press

Daugman J G (1985) Uncertainty relation forresolution in space spatial frequency and orienta-tion optimized by two-dimensional visual corticalfilters Journal of the Optical Society of America A2 1160ndash1169

Doi E Inui T Lee T W Wachtler T ampSejnowski T J (2003) Spatiochromatic receptivefield properties derived from information-theoreticanalyses of cone mosaic responses to naturalscenes Neural Computation 15 397ndash417

Duda R O Hart P E amp Stork D G (2000)Pattern classification (2nd ed) New York Wiley-Interscience

Elder J H amp Zucker S W (1998) Local scalecontrol for edge detection and blur estimationIEEE Transactions on Pattern Analysis and Ma-chine Intelligence 20 699ndash716

Field D J (1987) Relations between the statistics ofnatural images and the response properties of

cortical cells Journal of the Optical Society ofAmerica A 4 2379ndash2394

Fine I MacLeod D I A amp Boynton G M (2003)Surface segmentation based on the luminance andcolor statistics of natural scenes Journal of theOptical Society of America A 20 1283ndash1291

Fowlkes C C Martin D R amp Malik J (2007) Localfigure-ground cues are valid for natural imagesJournal of Vision 7(8)2 1ndash9 httpwwwjournalofvisionorgcontent782 doi101167782[PubMed] [Article]

Geisler W S Perry J S Super B J amp Gallogly DP (2001) Edge co-occurrence in natural imagespredicts contour grouping performance VisionResearch 41 711ndash724

Goldberg A V amp Kennedy R (1995) An efficientcost scaling algorithm for the assignment problemMathematical Programming 71 153ndash177

Graham N (1991) Complex channels early localnonlinearities and normalization in texture segre-gation In M S Landy amp J A Movshon (Eds)Computational models of visual processing pp 274ndash290 Cambridge MA MIT Press

Graham N amp Sutter A (1998) Spatial summation insimple (Fourier) and complex (non-Fourier) texturechannels Vision Research 38 231ndash257

Guzman A (1969) Decomposition of a visual sceneinto three-dimensional bodies In A Griselli (Ed)Automatic interpretation and classification of imag-es pp 291ndash304 New York Academic Press

Heeger D J amp Bergen J R (1995) Pyramid-basedtexture analysissynthesis In Proceedings of ACMSIGGRAPH 229ndash238

Hess R F amp Field D J (1999) Integration ofcontours New insights Trends in Cognitive Scienc-es 3 480ndash486

Hoiem D Efros A A amp Hebert M (2011)Recovering occlusion boundaries from an imageInternational Journal of Computer Vision 91(3)328ndash346

Ing A D Wilson J A amp Geisler W S (2010)Region grouping in natural foliage scenes Imagestatistics and human performance Journal ofVision 10(4)10 1ndash19 httpwwwjournalofvisionorgcontent10410 doi 10116710410[PubMed] [Article]

Jones J P amp Palmer L A (1987) The two-dimensional spatial structure of simple receptivefields in cat striate cortex Journal of Neurophysi-ology 58 1187ndash1211

Karklin Y amp Lewicki M S (2003) Learning higher-

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 19

order structures in natural images Network 14483ndash499

Karklin Y amp Lewicki M S (2009) Emergence ofcomplex cell properties by learning to generalize innatural scenes Nature 457 83ndash86

Kersten D (2000) High-level vision as statisticalinference In M S Gazzaniga (Ed) The newcognitive neurosciences pp 353ndash363 CambridgeMA MIT Press

Kingdom F A A Field D J amp Olmos A (2007)Does spatial invariance result from insensitivity tochange Journal of Vision 7(14)11 1ndash13 httpwwwjournalofvisionorgcontent71411 doi10116771411 [PubMed] [Article]

Knill D C amp Saunders J A (2003) Do humansoptimally integrate stereo and texture informationfor judgments of slant Vision Research 43 2539ndash2558

Konishi S Yuille A L Coughlin J M amp Zhu S C(2003) Statistical edge detection Learning andevaluating edge cues IEEE Transactions on PatternAnalysis and Machine Intelligence 25 57ndash74

Landy M S (1991) Texture segregation and orienta-tion gradient Vision Research 31 679ndash691

Landy M S amp Graham N (2003) Visual perceptionof texture In L M Chalupa amp J S Werner (Eds)The visual neurosciences pp 1106ndash1118 Cam-bridge MA MIT Press

Landy M S amp Kojima H (2001) Ideal cuecombination for localizing texture-defined edgesJournal of the Optical Society of America A 182307ndash2320

Lee H-C amp Choe Y (2003) Detecting salientcontours using orientation energy distribution InInternational Joint Conference on Neural Networks(pp 206ndash211) IEEE Conference Publications

Lee T S amp Mumford D (2003) Hierarchicalbayesian inference in the visual cortex Journal ofthe Optical Society of America A 20 1434ndash1448

Malik J amp Perona P (1990) Preattentive texturediscrimination with early vision mechanisms Jour-nal of the Optical Society of America A 7 923ndash932

Marr D amp Hildreth E (1980) Theory of edgedetection Proceedings of the Royal Society ofLondon Series B 29 187ndash217

Martin D Fowlkes C C Tal D amp Malik J (2001)A database of human segmented natural imagesand its application to evaluating segmentationalgorithms and measuring ecological statisticsProceedings of the 8th International Conference onComputer Vision 2 416ndash423

Martin D R Fowlkes C C amp Malik J (2004)Learning to detect natural image boundaries usinglocal brightness color and texture cues IEEETransactions on Pattern Analysis and MachineIntelligence 26 530ndash549

McDermott J (2004) Psychophysics with junctions inreal images Perception 33 1101ndash1127

McGraw P V Whitaker D Badcock D R ampSkillen J (2003) Neither here nor there Localizingconflicting visual attributes Journal of Vision 3(4)2 265ndash273 httpwwwjournalofvisionorgcontent342 doi101167342 [PubMed] [Article]

Nakayama K He Z J amp Shimojo S (1995) Visualsurface representation A critical link betweenlower-level and higher-level vision In D NOsherson S M Kosslyn amp L R Gleitman(Eds) An invitation to cognitive science Visualcognition pp 1ndash70 Cambridge MA MIT Press

Oliva A amp Torralba A (2001) Modeling the shapeof the scene A holistic representation of the spatialenvelope International Journal of Computer Vision42 145ndash175

Olmos A amp Kingdom F A A (2004) McGillcalibrated colour image database Internet sitehttptabbyvisionmcgillca (Accessed November20 2012)

Olshausen B A amp Field D J (1996) Emergence ofsimple-cell receptive-field properties by learning asparse code for natural images Nature 381 607ndash609

Ott R L (1993) An introduction to statistical methodsand data analysis (4th ed) Belmont CA Wads-worth Inc

Perona P (1992) Steerable-scalable kernels for edgedetection and junction analysis Image and VisionComputing 10 663ndash672

Pollen D A amp Ronner S F (1983) Visual corticalneurons as localized spatial frequency filters IEEETransactions on Systems Man and Cybernetics 13907ndash916

Portilla J amp Simoncelli E P (2000) A parametrictexture model based on joint statistics of complexwavelet coefficients International Journal of Com-puter Vision 40 4971

Rajashekar U van der Linde I Bovik A C ampCormack L K (2007) Foveated analysis of imagefeatures at fixations Vision Research 47 3160ndash3172

Reinagel P amp Zador A M (1999) Natural scenestatistics at the centre of gaze Network 10 341ndash350

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 20

Rijsbergen C V (1979) Information retrieval LondonButterworths

Rivest J amp Cavanagh P (1996) Localizing contoursdefined by more than one attribute Vision Re-search 36 53ndash66

Schwartz O amp Simoncelli E P (2001) Natural signalstatistics and sensory gain control Nature Neuro-science 4 819ndash825

Sigman M Cecchi G A Gilbert C D ampMagnasco M O (2001) On a common circleNatural scenes and gestalt rules Proceedings of theNational Academy of Sciences 98 1935ndash1940

Todd J T (2004) The visual perception of 3D shapeTrends in Cognitive Science 8 115ndash121

Torralba A (2009) How many pixels make an imageVisual Neuroscience 26 123ndash131

Torralba A amp Oliva A (2003) Statistics of naturalimage categories Network Computation in NeuralSystems 14 391ndash412

Voorhees H amp Poggio T (1988) Computing textureboundaries from images Nature 333 364ndash367

Zetzsche C amp Rohrbein F (2001) Nonlinear andextra-classical receptive field properties and thestatistics of natural scenes Network 12 331ndash350

Zhou C amp Mel B W (2008) Cue combination andcolor edge detection in natural scenes Journal ofVision 8(4)4 1ndash25 httpwwwjournalofvisionorgcontent844 doi101167844 [PubMed][Article]

Zhu S C Wu Y N amp Mumford D (1997)Minimax entropy principle and its application totexture modeling Neural Computation 9 1627ndash1660

Ziegaus C amp Lang E W (2003) Independentcomponent analysis of natural and urban imageensembles Neural Information Processing Letters1 89ndash95

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 21

  • Introduction
  • Methods
  • f01
  • f02
  • f03
  • e01
  • e02
  • e03
  • e04
  • e05
  • e06
  • e07
  • e08
  • e09
  • e10
  • f04
  • e11
  • e12
  • e13
  • e14
  • e15
  • e16
  • e17
  • f05
  • e18
  • Results
  • f06
  • f07
  • f08
  • f09
  • f10
  • f11
  • f12
  • f13
  • f14
  • f15
  • f16
  • Discussion
  • f17
  • f18
  • Abdou1
  • Acharya1
  • Arsenault1
  • Baddeley1
  • Baker1
  • Balboa1
  • Bell1
  • Bergen1
  • Bishop1
  • Cadieu1
  • Caywood1
  • Collins1
  • Cristianini1
  • Daugman1
  • Doi1
  • Duda1
  • Elder1
  • Field1
  • Fine1
  • Fowlkes1
  • Geisler1
  • Goldberg1
  • Graham1
  • Graham2
  • Guzman1
  • Heeger1
  • Hess1
  • Hoiem1
  • Ing1
  • Jones1
  • Karklin1
  • Karklin2
  • Kersten1
  • Kingdom1
  • Knill1
  • Konishi1
  • Landy1
  • Landy2
  • Landy3
  • Lee1
  • Lee2
  • Malik1
  • Marr1
  • Martin1
  • Martin2
  • McDermott1
  • McGraw1
  • Nakayama1
  • Oliva1
  • Olmos1
  • Olshausen1
  • Ott1
  • Perona1
  • Pollen1
  • Portilla1
  • Rajashekar1
  • Reinagel1
  • Rijsbergen1
  • Rivest1
  • Schwartz1
  • Sigman1
  • Todd1
  • Torralba1
  • Torralba2
  • Voorhees1
  • Zetzsche1
  • Zhou1
  • Zhu1
  • Ziegaus1
Page 10: Detecting natural occlusion boundaries using local cuesDetecting natural occlusion boundaries using local cues Christopher DiMattina # $ Department of Psychology & Neuroscience Concentration,

quantified using a novel analysis we developed as wellas a more standard precision-recall analysis (Abdou ampPratt 1979 Rijsbergen 1979 Methods Image data-base) Figure 7a shows an image from the COBdatabase (top left) together with the edgemap andderived mask for the most conservative subject (MCS)which is the subject who labeled the fewest pixels (topmiddle right) When the MCS mask is multiplied bythe edgemap of the other subjects we see reasonablygood agreement between all subjects (bottom row) Inorder to quantify this over all images we computed ournovel intersubject similarity index f defined in Equation4 and in Figure 7c we see that on average that f is largerfor our dataset than for the BSD where here we plotthe histogram for c frac14 10 (COB median frac14 02890 N frac143500 BSD median frac14 01882 N frac14 3430 p 10202Wilcox rank-sum) Similar results were obtained over awide range of values of c (Supplementary Figure S1) Inaddition to our novel analysis we also implemented aprecision-recall analysis (Abdou amp Pratt 1979 Rijs-bergen 1979) using a lsquolsquoleave-one-outrsquorsquo procedure wherewe compared the edges labeled by one subject to alsquolsquoground truthrsquorsquo labeling defined by combining theedgemaps of the five other subjects (Methods Imagedatabase) Agreement was quantified using the F-measure (Rijsbergen 1979) which provides a weightedmean of precision (not labeling nonedges as edges) andrecall (detecting all edges in the image) We observe inFigure 7d that for this analysis there is a significantlybetter agreement between edgemaps in the COBdatabase than those obtained indirectly from the BSDsegmentations using this analysis (p 1028)

Visual features measured from occlusion andsurface patches

We were interested in determining which kinds oflocally available visual features could possibly beutilized by human subjects to distinguish image patchescontaining occlusions from those containing surfacesToward this end we measured a variety of visualfeatures taken from previous studies of natural imagestatistics (Balboa amp Grzywacz 2000 Field 1987 Ing etal 2010 Lee amp Choe 2003 Rajashekar et al 2007Reinagel amp Zador 1999) from both occlusion andsurface patches For patches containing only a singleocclusion which can be divided into two regions ofroughly equal size we obtained a region labelingillustrated in Figure 4a (bottom) The region labelingconsists of sets of pixels corresponding to the twosurfaces divided by the boundary (white and grayregions) as well as the boundary (black line) Usingthis region labeling we can measure properties of theimage on either side of the boundary and computedifferences in these properties between regions By

definition surface patches are comprised of a singleregion and therefore it is unclear how we can measurequantities (like luminance differences) which dependon a region labeling Therefore in order to measure thesame set of features from the surface patches weassigned to each patch a set of 25 dummy regionlabelings (spanning all orientations) and for eachdummy labeling performed the measurements Sinceall features were on average larger for occlusions forsurfaces we took as the measured value of the featurethe maximum over all 25 dummy labelings A fulldescription of measured features is given in Methods(Statistical measurements)

We compared the power spectra of 32 middot 32 occlusionand texture patches (Figure 8 left) On average there issignificantly less energy in low spatial frequencies fortextures than for occlusions which is intuitive sincemany occlusions contain a low spatial frequencyluminance edge (Figure 3 left panel) Analysis of theexponent by fitting a line to the power spectra ofindividual images found a different distribution ofexponents for the occlusions and the textures (Figure 8right) For textures the median exponent is close to 2consistent with previous observations (Field 1987)whereas for occlusions the median exponent value isslightly higher (rsquo26) This is consistent with previouswork analyzing the spectral content of different scenecategories which show that landscape scenes with aprominent horizon (like an ocean view) tend to havelower spatial frequency content (Oliva amp Torralba2001)

Figure 9 shows that all of the grayscale features (G1-G5 Methods Statistical measurements) measured fromsurfaces (green) and occlusions (blue) are well approx-imated by Gaussians when plotted in logarithmiccoordinates Supplementary Table 1 lists the meansand standard deviations of each of these features Wesee that on average these features are larger forocclusions than for textures as one might expectintuitively since these features explicitly or implicitlymeasure differences or variability in an image patch

Figure 8 Power spectra of 32 middot 32 patches Left Median power

spectrum of occlusions (blue) and surfaces (green) Thin dashed

lines show the 25th and 75th percentiles Right Power-spectrum

slopes for occlusions (blue) and surfaces (green)

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 10

which will tend to be larger for occlusion patchesAnalyzing the correlation structure of the grayscalefeatures reveals that they are all significantly positivelycorrelated (Supplementary Table 2) Although some ofthese correlations are unsurprising (like that betweenluminance difference log Dl and boundary gradient logGB) many pairs of these positively correlated featurescan be manipulated independently of each other inartificial images (for instance global contrast log q andcontrast difference log Dr) These results are consistentwith previous work which demonstrates that in naturalimages there are often strong conditional dependenciesbetween supposedly independent visual feature dimen-sions (Fine et al 2003 Karklin amp Lewicki 2003Schwartz amp Simoncelli 2001 Zetzsche amp Rohrbein2001) Supplementary Figure S2 plots all possiblebivariate distributions of the logarithm of the grayscalefeatures for 32 middot 32 textures and surfaces

In addition to grayscale features we measured colorfeatures (C1ndashC2 Methods Statistical measurements)as well by transforming the images into the Lab colorspace Fine et al (2003) Ing et al (2010) Using our

region labelings (Figure 4a bottom) we measured thecolor parameters log Da and log Db from ourocclusions Supplementary Table 3 lists the meansand standard deviations of each of these features andwe observe positive correlations (rfrac14043) between theirvalues similar to previous observations of positivecorrelations between these same features for twonearby patches taken from the same surface (Fine etal 2003) This finding is interesting because occlusionboundaries by definition separate two or more differentsurfaces and the color features we measure constitute aset of independent feature dimensions From the two-dimensional scatterplots in Supplementary Figure S3we see that there is separation between the multivariatedistributions defined by these color parameters sug-gesting that color contrast provides a potential cue forocclusion edge detection much as it does for surfacesegmentation (Fine et al 2003 Ing et al 2010)

Human performance on occlusion boundarydetection task

Effects of patch size color and pre-exposure

In order to determine the local visual features used todiscriminate occlusions from surfaces we designed asimple two-alternative forced choice task illustratedschematically in Figure 6 An image patch subtendingroughly 15 degrees of visual angle (305 cm viewingdistance) was presented to the subject who had todecide with a binary (10) keypress whether an imagepatch contains an occlusion Patches were chosen froma pre-extracted set of 1000 occlusions and 1000surfaces with equal probability of each type soguessing would yield 50 correct Subjects wereallowed to view each patch as long as they needed (1ndash2 seconds typical) and following previous workauditory feedback was provided to optimize perfor-mance (Ing et al 2010) The task was performed on sixsubjects having varying degrees of exposure to theimage database (Methods Experimental paradigm)

Figure 10a illustrates the effect of patch size on thetask for grayscale image patches for all subjects Thicklines indicate the mean subject performance and thindashed lines denote 95 confidence intervals We seefrom this that performance is significantly different forthe different size image patches which we tested (8 middot 816 middot 16 32 middot 32) For a subset of subjects (S0 S1 S2)we also tested color image patches and the results forcolor image patches are shown in Figure 10b (red line)together with the grayscale data from these samesubjects (black dashed line) We see from this plot thatfor all patch sizes tested performance is significantlybetter for color patches which is sensible because coloris a potentially informative cue for distinguishingdifferent surfaces as has been shown in previous work(Fine et al 2003 Ing et al 2010)

Figure 9 Univariate distributions of grayscale visual features G1-

G5 (see Methods) for occlusions (blue) and textures (green)

When plotted on a log scale these distributions are well described

by Gaussians and exhibit separation for occlusions and textures

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 11

One concern with the interpretation of our results isthat the brief pre-exposure (3 seconds) of subjects S1 S2to the full-scale (576 middot 768) images prior to the first tasksession (Methods Experimental paradigm) may haveunfairly improved their performance In order to controlfor this possibility we ran three additional subjects (S3S4 S5) who did not have this pre-exposure to the full-scale images and we found no significant difference inthe performance of the two groups of subjects(Supplementary Figure S4a) We also found that thelead author (S0) was not significantly better at the taskthan the two subjects (S1 S2) who were only briefly pre-exposed to the images (Supplementary Figure S4b)

Therefore we conclude that pre-exposure to the full-scale images makes little if any difference in our task

Effects of luminance boundary and texture cues

In order to better understand what cues subjects aremaking use of in the task we tested subjects onmodified image patches with various cues removed(Methods Experimental paradigm) In order to deter-mine the importance of texture we removed all texturecues from the patches by averaging all pixels in eachregion (lsquolsquotexture removedrsquorsquo) as illustrated in Figure 4b(top) We see from Figure 11a that subject performancein the task is substantially impaired without texturecues for the 16 middot 16 and 32 middot 32 patch sizes Note howperformance is roughly flat with increasing patch sizewhen all cues but luminance are removed suggestingthat luminance gradients are a fairly lsquolsquolocalrsquorsquo cue whichis very useful even at very small scales Similar resultswere obtained for color patches (Figure 12) where theonly cues available were luminance and color contrast

Removing luminance cues while keeping texture cuesintact is slightly more tricky since simply equalizing theluminance in the two image regions can lead toproblems of boundary artifact (a high spatial frequencyedge along the boundary) as has been noted by others(Arsenault et al 2011) In order to circumvent thisproblem we removed the boundary artifact by coveringa 3-pixel wide region including the boundary by settingall pixels in this strip to a single uniform value (Figure4b bottom) We then were able to equalize theluminance on each side of the patch without creatingboundary artifact (luminance thorn boundary removed)Since the boundary pixels may could potentiallycontain long-range spatial correlation information(Hess amp Field 1999) useful for identifying occlusions

Figure 10 Performance of human subjects at the occlusion

detection task for 8 middot 8 16 middot 16 and 32 middot 32 image patches (a)

Subject performance for grayscale image patches Thin dashed

lines denote 95 confidence intervals Note how performance

significantly improves with increasing patch size (b) Subject

performance for color image patches is significantly better than for

grayscale at all patch sizes

Figure 11 Subject performance for grayscale image patches with various cues removed Dashed lines indicate the average performance

for unaltered image patches solid lines performance in the cue-removed case (a) Removal of texture cues significantly impairs subject

performance for larger image patches (b) Removal of the boundary and luminance cues substantially impairs subject performance at all

patch sizes (c) Removal of the boundary cue alone without altering texture cues or luminance cues does not affect subject performance

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 12

as an additional control we also considered perfor-mance on patches where we only removed theboundary (boundary removed)

We see from Figure 11c that simply removing theboundary pixels has no effect on subject performancesuggesting that subjects do not need to see theboundary in order to perform the task so long asother cues like luminance and texture differences arepresent However we see from Figure 11b that thereare profound effects of removing the luminance cueswhile keeping the texture cues intact We see thatsubject performance is impaired at every patch sizealthough in contrast to removing texture cues we see asubstantial improvement in performance as patch sizeis increased This suggests that texture cues arecomplementary to luminance cues in the sense thattexture information becomes more useful for largerpatches while luminance is useful even for very smallpatches

Quadratic classifier analysis

Comparison with human performance

Our analysis of image patches taken from thedatabase suggests that relatively simple visual featuresmeasured in previous related studies can potentially beused to distinguish occlusions and surfaces (Figure 9)Therefore it was of interest to determine whethermachine classifiers defined using these simple visualfeatures could potentially account for human perfor-mance on the task Since the logarithmically trans-formed features are approximately Gaussian distributed(Figure 9 and Supplementary Figures S2 S3) followingprevious work (Ing et al 2010) we defined a quadraticclassifier on class-conditional distributions which wemodeled as multivariate Gaussians (Methods Machineclassifiers) When data from both classes is exactly

Gaussian this classifier yields Bayes-optimal perfor-mance (Duda et al 2000) This classifier was thentested on validation sets of 400 image patches drawnfrom the same set shown to humans subjects (whichwere not used to train the classifier) using 200 MonteCarlo resamplings

In Figure 13 we compare the performance of humansubjects (black lines) on the grayscale version of thetask with that of quadratic classifiers (blue lines)Figure 13a shows task performance with the fullgrayscale image patches We see that the quadraticclassifier defined on all grayscale parameters (G1-G5Methods Statistical measurements) does not capturethe experimentally observed improvement in subjectperformance with increasing patch size This suggeststhat in addition to the scalar grayscale cues used todefine the classifier subjects must be making use ofmore complex spatial cues like differences in texture(Bergen amp Landy 1991) or long-range spatial correla-tions (Hess amp Field 1999) which are not available atlow resolutions but become available as the imagepatch gets larger We see from Figure 13b that aquadratic classifier defined using only the luminancedifference is in much better agreement with humanperformance when this is the only available cue forperforming the task Similar results were obtained forcolor image patches (Figure 14)

Ranking of feature dimensions

In order to understand which feature dimensions aremost important for the quadratic classifier detecting

Figure 12 Subject performance for unaltered color image patches

(dashed line) and with texture cues removed (solid line) Figure 13 Comparison of subject and quadratic classifier

performance for grayscale image patches (a) Classifier defined

using all grayscale features (thick blue line) does not accurately

model human performance on the task with unmodified patches

(black line) (b) Classifier defined using only luminance cues

accurately models human performance in the texture-removed

condition where luminance differences are the only available cue

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 13

occlusions we computed the d0 value for each of thefeature dimensions with the results of this analysisbeing summarized in Supplementary Table 4 We seefrom this table that the best single grayscale parametersfor discriminating occlusion and surface patches are theluminance difference (Dl) and overall patch variance(q) We also see that contrast difference (Dr) by itself isa fairly weak parameter for all patch sizes while for thelarger patch sizes (16 middot 16 32 middot 32) color cues are veryinformative

Our correlation analyses presented in SupplementaryTable 2 demonstrated that many of these features were

correlated with each other so therefore it was ofinterest to determine which features contributed themost independent information to the classifier We didthis by adding feature dimensions to the classifier oneby one in order of decreasing d0 for 16 middot 16 and 32 middot 32patches and applying each classifier to the task for 200Monte Carlo simulations The results of this analysisperformed on 16 middot 16 and 32 middot 32 image patches areshown in Figure 15 We see from Figure 15 that forgrayscale classifiers containing q and Dl that noperformance increase is seen when one adds thefeatures GB (boundary gradient) and Eh (orientedenergy) most likely because these features are stronglycorrelated with q Dl (Supplementary Table 2)However we do see a jump in performance when thecontrast difference Dr is added most likely because thisfeature dimension is only weakly correlated with q andDl (Supplementary Table 2) For our analysis of thecolor classifier we see that there are substantialimprovements in performance when the color param-eters Da and Db are added to the classifier which makessense because color cues are largely independent fromluminance cues

Comparison with SVM classifier

There are two logical possibilities for the poor matchbetween human performance in our task and theperformance of the quadratic classifier The firstpossibility is that the feature sets used to define theclassifier do not adequately describe the featuresactually used by human subjects to perform the taskA second possibility is that the feature set is adequatebut the quadratic classifier is somehow suboptimalAlthough the quadratic classifier is Bayes optimal forGaussian distributed class-conditional densities (Duda

Figure 14 Comparison of subject and quadratic classifier

performance for color image patches (a) Classifier defined using

all color features (thick blue line) does not accurately model

human performance (black line) on the task with unmodified

patches (b) Classifier defined using only luminance and color

contrast more accurately models human performance in the

texture-removed condition

Figure 15 Quadratic classifier performance on 32 middot 32 patches as a function of number of features Feature sets of varying lengths are

defined by ranking individual dimensions by their d0 measures and adding them in order from highest to lowest

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 14

et al 2000) and our feature sets are reasonably welldescribed by Gaussians (see Figure 9 andSupplementary Figures S2 S3) this description is farfrom exact Therefore it is possible that some othermore sophisticated classifier may yield superior perfor-mance to the quadratic classifier and hence provide abetter match with our perceptual data

In order to control for this possibility we comparedthe performance of the quadratic classifier to a SupportVector Machine (SVM) classifier (Bishop 2006 Cris-tianini amp Shawe-Taylor 2000) on the task of discrim-inating grayscale occlusion patches from surfaces Wefound that the SVM classifier performed worse than thequadratic classifier (Supplementary Figure S5) Thisfailure of the SVM classifier strengthens the argumentthat the reason for the poor match of the quadraticclassifier with human performance is the inadequacy ofthe measured feature set used to define the classifierrather than the inadequacy our classifier method

Multiscale classifier analysis

Gabor feature set and classifiers

One weakness of the set of simple visual featuresquantified in this study is that they are defined on thescale of the entire image patch and therefore do notintegrate information across scale It is well known thatocclusions in natural images can exist on multiplescales and that accurate edge detection requires anappropriate choice of scale (Elder amp Zucker 1998Marr amp Hildreth 1980) Furthermore our feature setdoes not include any features which quantify texturedifferences which is problematic because our psycho-physical results suggests that texture is a very importantcue and previous work in machine vision hasdemonstrated that texture cues are highly informativefor detecting edges and segmenting image regions(Bergen amp Landy 1991 Malik amp Perona 1990 Martinet al 2004 Voorhees amp Poggio 1988)

In order to choose a set of biologically motivatedmultiscale features to measure from the image patcheswe obtained a Gabor filter bank by applying theIndependent Components Analysis (Bell amp Sejnowski1997) to natural image patches taken from our dataset(Figure 5a) and this filter bank was applied to theimage patches The independent component analysis(ICA) algorithm learns a set of linear features that aremaximally independent which is ideal for classifieranalyses since correlated features add less informationthan independent features The rectified outputs of thisGabor filter bank were then used as inputs to twosupervised classifiers (a) A linear binary logisticclassifier and (b) a neural network classifier (Bishop2006) having 4 16 and 64 hidden units The logisticclassifier can be considered as a special case of theneural network classifier illustrated in Figure 5b where

instead of learning the best hidden layer representationfor solving the classification task the hidden layerrepresentation is always identical to the input repre-sentation and only the output weights are modifiedAfter training both classifiers were tested on a separatevalidation set not used for training to evaluate theirperformance

Classifier comparison

Figure 16 shows the performance of these twoclassifiers together with subject data for the grayscaleocclusion detection task We see that the linearclassifier (blue line) does not accurately describe humanperformance (black line) while the neural network (redline) having 16 hidden units compares much morefavorably to the human subjects Similar results wereobtained for a neural network with 64 hidden units butworse performance was seen with four hidden units(not shown) We can see from Figure 16 that the linearclassifier does not increase its performance withincreasing patch size and seems to perform similarlyto the quadratic classifier defined using only luminancedifferences This suggests that this classifier may simplybe learning to detecting low spatial frequency edges atvarious orientations Indeed plotting the connectionstrengths of the output unit with each of the Gaborfilters as in Figure 17 demonstrates that there are strongpositive weights for filters at low spatial frequencylocated in the center of the image patch therebydetecting luminance differences on the scale of theentire image patch

In contrast the neural network classifier learns tospatially pool information across spatial scale andlocation (Supplementary Figure S6) although thehidden unit lsquolsquoreceptive fieldsrsquorsquo do not seem to resemble

Figure 16 Performance of humans and classifiers defined on

multiscale Gabor feature set We see that a simple linear classifier

(blue line) does not account for human performance (black

dashed line) in the task a neural network classifier having an

additional hidden layer of processing (red line) compares well with

human performance

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 15

the hypothetical texture-edge detectors illustrated inFigure 18 which could in principle encode edgesdefined by second-order cues like differences inorientation or spatial frequency Interestingly suchunits tuned for texture-edges have been learned instudies seeking to find a sparse code for patterns ofcorrelation of Gabor filter activities in natural images(Cadieu amp Olshausen 2012 Karklin amp Lewicki 20032009) and it is of interest for future work to examinewhether a population of such units can be decoded inorder to reliably detect occlusion edges Neverthelessthis analysis provides a simple existence proof that amultiscale feature set coupled with a second layer ofrepresentation which pools across this set of multiscalefeatures may be adequate to explain natural occlusionedge detection and a similar computational scheme hasbeen proposed by various filter-rectify-filter models oftexture edge detection (Baker amp Mareschal 2001Landy amp Graham 2003) Interestingly training neuralnetworks directly on pixel representations did not yieldmodels which could match human performance(Supplementary Figure S7 demonstrating the impor-tance of the multiscale Gabor decomposition as apreprocessing stage Of course we cannot suggest fromour simple analysis the detailed form which thisintegration takes or what the receptive fields of neuronsperforming this operation may be like but this is

certainly a question of great interest for future work inneurophysiology and computational modeling

Discussion

Summary of results

In this study we performed psychophysical experi-ments to better understand what locally available cuesthe human visual system utilizes when detecting naturalocclusion edges In agreement with previous work(McDermott 2004) we found that fairly large regions(32 middot 32 pixels) were required for subjects to attainreliable performance on our occlusion detection task(Figure 10) Studying the grayscale version of the taskin detail we found that texture and luminance cueswere both utilized although the relative importance ofthese cues varied with the patch size In particular wefound that luminance cues were equally useful at allpatch sizes tested (Figure 11a) This suggests thatluminance differences are a local cue being equallyuseful for small as well as large patches In contrasttexture cues were only useful for larger image patches(Figure 11b) This is sensible because texture is definedby information occurring at a variety of scales (Portillaamp Simoncelli 2000 Zhu Wu amp Mumford 1997) andmaking patches larger makes more spatial scalesvisible

The importance of textural cues was further under-scored by comparing human performance on the taskto the performance of a quadratic classifier definedusing a set of simple visual features taken fromprevious work which did not include texture cuesAlthough many of the visual features providedindependent sources of information (Figure 15) andpermitted some discrimination of the stimulus catego-ries this feature set could not account for humanperformance in the task overpredicting performancefor small (8 middot 8) patches and underpredicting

Figure 17 Schematic illustration of weights learned by the linear

classifier We see that the linear classifier learns strong positive

weights (hot colors) with Gabor filters having low spatial

frequencies (long lines) located in the center of the image patch

Figure 18 Hypothetical neural models for detecting occlusion edges defined by textural differences (a) Hypothetical unit which detects

texture edges defined by differences in orientation energy (b) Hypothetical unit that detects texture edges defined by spatial frequency

differences

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 16

performance for large (32 middot 32) patches (Figure 13)The overprediction of human performance for smallimage patches is particularly interesting since itdemonstrates that subjects are not simply relying onluminance cues but are integrating texture cues as wellwhich we found to be quite poor for small imagepatches (Figure 11b)

One limitation of the visual feature set used to definethe quadratic classifier analysis is that the features aredefined on the scale of the entire image patch This isproblematic since it is well-known that the earlieststages of human visual processing perform a multiscaleanalysis using receptive fields resembling Gabor filters(Daugman 1985 Jones amp Palmer 1987 Pollen ampRonner 1983) and that such a representation forms anefficient code for natural images (Field 1987 Olshau-sen amp Field 1996) Furthermore many computationalmethods for texture segmentation and edge detectionuse of a multiscale Gabor filter decomposition (Bakeramp Mareschal 2001 Landy amp Graham 2003) There-fore we defined a new set of classifiers taking as inputsthe rectified responses of a set of Gabor filters appliedto the image patches (Figure 5) We found that themultiscale feature set was adequate to explain humanperformance but only provided that the classifiermaking use of this feature set was sufficiently powerfulIn particular a linear classifier was unable to accountfor human performance since this classifier only learnedto detect luminance-edges at low spatial frequencies(Figure 17) However a neural network classifierhaving an additional hidden layer of processing wasable to learn to reliably detect occlusions nearly as wellas human subjects (Figure 16) These results suggest amultiscale feature decomposition together with rela-tively simple computations performed in an additionallayer of processing may be adequate to explain humanperformance in an important natural vision task

Comparison with previous work

Relatively few studies in visual psychophysics havedirectly considered the problem of what cues are usefulfor detecting natural occlusion edges with the closeststudy to our own focusing on the more specializedproblem of detecting T-junctions in natural grayscaleimages (McDermott 2004) This study found thatrather large image regions were needed for subjects toattain reliable performance for junction detectionconsistent with our results for occlusion detection Inthis work it was found that many T-junctions easilyvisible on the scale of the whole image were notdetectable using local cues suggesting that they mayonly be detectable by mechanisms making use offeedback of global scene information (Lee amp Mumford2003) Our task is somewhat complementary to that

studied by McDermott (2004) since we exclude T-junctions and only consider nonjunction occlusionsThis gave us a much larger stimulus set and permittedquantitative comparisons of human task performancewith a variety of models based on features measuredfrom occlusions and non-occlusions However wearrive at the very similar conclusion for our task thata substantial amount of spatial context is needed foraccurate occlusion detection (Figure 10) and thatmany occlusions cannot be reliably detected using localcues alone

A related set of studies has considered the problemof judging when two image patches belong to the sameor different surfaces and comparing the performanceof human subjects with the predictions of Bayesianmodels defined using sets of measured visual featuressimilar to those used to define our quadratic classifiersIn contrast to one such surface segmentation study (Inget al 2010) we found texture cues to be very importantfor our task One possible reason for the discrepancybetween our results and those of Ing et al (2010) is thattheir study only considered the problem of discrimi-nating a very limited class of images in particular thesurfaces of different leaves Since the leaves typicallybelonged to the same kind of plant they were fairlysmooth and quite similar in terms of visual textureTherefore for this particular choice of stimuli texture isnot a particularly informative cue since leaves of thesame kind tend to have very similar textures For ourtask however where very different surfaces werejuxtaposed to define occlusions we found removal oftexture cues to be highly detrimental to subjectperformance (Figure 11)

Previous work on edge localization reveals thatluminance texture and color cues are all combined todetermine the location of edges (McGraw et al 2003Rivest amp Cavanagh 1996) and similarly we find that inour task these multiple sources of information arecombined as well However in our task we did notconsider quantitative models for how disparate cuesmay be combined optimally as has been done in avariety of perceptual tasks (Knill amp Saunders 2003Landy amp Kojima 2001) One challenge with studyingcue combination in our task is that one of the majorcues used by the subjects are texture differences andunlike luminance or color natural textures require veryhigh-dimensional models to fully specify (Heeger ampBergen 1995 Portilla amp Simoncelli 2000 Zhu et al1997) The problem of optimal cue combination fornatural occlusion detection has been mostly consideredin the computer vision literature (Konishi et al 2003Zhou amp Mel 2008) but models of cue combination forocclusion detection have not been directly comparedwith human data in a controlled psychophysicalexperiment

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 17

Our computational analysis demonstrates the neces-sity of multiscale image analysis for modeling humanperformance in our occlusion detection task Further-more our work and that of others demonstrates theimportance of textural cues for occlusion detection as awide variety of computational models which explainhuman performance at detecting texture-defined edgesutilize a multiscale Gabor decomposition (Landy ampGraham 2003) One broad class of computationalmodels of texture segmentation and texture edgedetection can be described lsquolsquoFilter-Rectify-Filterrsquorsquo(FRF) models sometimes also referred to as lsquolsquoback-pocketrsquorsquo models where the rectified outputs of a Gaborfilter bank are then subject to a second stage of filteringand the outputs of these filters are then processedfurther to obtain a decision about whether or not atexture edge is present (Baker amp Mareschal 2001Bergen amp Landy 1991 Graham amp Sutter 1998 Landy1991 Landy amp Graham 2003) Our simple three-layerneural network model (Figure 5) constitutes a verysimple model of the FRF type with the hidden unitsperforming the role of the second-order filters Theability of our simple FRF model to solve this task aswell as human subjects provides additional evidence infavor of the FRF computational framework

Resources for future studies

Our database of hand-labeled occlusions (httpcaseedu) avoids many limitations of previously collecteddatasets of hand-labeled edges Perhaps the greatestimprovement is that unlike many related datasets(Collins Wright amp Greenway 1999 Martin et al2001) we make use of uncompressed calibrated colorimages taken from a larger image set used in previouspsychophysical research (Arsenault et al 2011 King-dom et al 2007 Olmos amp Kingdom 2004) JPEGcompression creates artificial statistical regularitieswhich complicates feature measurements (Zhou ampMel 2008) and causes artifact when training statisticalimage models (Caywood Willmore amp Tollhurst 2004)

Another improvement of this database over previouswork is that we explicitly label occlusion edges ratherthan inferring them indirectly from labels of imageregions or segments Although de facto finding regionboundaries does end up mainly labeling occlusions thenotion of an image region is much vaguer than that ofan occlusion However due to the specificity of ourtask our database most likely neglects quite a fewedges which are not occlusions since many edges aredefined by properties unrelated to ordinal depth likeshadows specularities and material properties (Ker-sten 2000) Therefore our database only represents asubset (albeit an important one) of possible naturaledges

Some databases only consider a restricted class ofimages like close-up foliage (Ing et al 2010) or lackclear figural objects (Doi Inui Lee Wachtler ampSejnowski 2003) Our database contains a wide varietyof images at various spatial scales containing a largenumber of occlusions We do not claim that our imagesare fully representative of natural images but representa modest effort to obtain a diverse sample which isimportant since different categories of natural imageshave different statistics (Torralba amp Oliva 2003) andtraining statistical image models on different imagedatabases often yields different linear filter properties(Ziegaus amp Lang 2003)

Acknowledgments

This work was supported by NIH grant 0705677 toMSL We thank the five Case Western ReserveUniversity undergraduate research assistants whohelped to collect the databases Parts of this work havebeen presented previously in poster form at the 2011Society for Neuroscience Meeting

Commercial relationships noneCorresponding author Christopher DiMattinaEmail dimattinagrinnelleduAddress Department of Psychology amp NeuroscienceConcentration Grinnell College Grinnell IA USA

References

Abdou I amp Pratt W (1979) Quantitative design andevaluation of enhancementthresholding edge de-tectors Proceedings of the IEEE 67 753ndash763

Acharya T amp Ray A K (2005) Image processingPrinciples and applications Hoboken NJ Wiley-Interscience

Arsenault E Yoonessi A amp Baker C L (2011)Higher-order texture statistics impair contrastboundary segmentation Journal of Vision 11(10)14 1ndash15 httpwwwjournalofvisionorgcontent111014 doi101167111014 [PubMed] [Article]

Baddeley R J amp Tatler B W (2006) High frequencyedges (but not contrast) predict where we fixate ABayesian system identification analysis VisionResearch 46 2824ndash2833

Baker C L amp Mareschal I (2001) Processing ofsecond-order stimuli in the visual cortex Progressin Brain Research 134 1ndash21

Balboa R M amp Grzywacz N M (2000) Occlusions

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 18

and their relationship with the distribution ofcontrasts in natural images Vision Research 402661ndash2669

Bell A J amp Sejnowski T J (1997) The lsquoindependentcomponentsrsquo of natural scenes are edge filtersVision Research 37 3327ndash3338

Bergen J R amp Landy M S (1991) Computationalmodeling of visual texture segregation In M SLandy amp J A Movshon (Eds) Computationalmodels of visual processing pp 253ndash271 Cam-bridge MA MIT Press

Bishop C M (2006) Pattern recognition and machinelearning New York Springer

Cadieu C amp Olshausen B A (2012) Learningintermediate-level representations of form andmotion from natural movies Neural Computation24 827ndash866

Caywood M S Willmore B amp Tolhurst D J(2004) Independent components of color naturalscenes resemble v1 neurons in their spatial andcolor tuning Journal of Neurophysiology 91 2859ndash2873

Collins D Wright W A amp Greenway P (1999) TheSowerby image database In Image processing andits applications (Seventh International Conference)(pp 306ndash310) London Institution of ElectricalEngineers

Cristianini N amp Shawe-Taylor J (2000) An intro-duction to support vector machines and other kernelbased learning methods Cambridge UK Cam-bridge University Press

Daugman J G (1985) Uncertainty relation forresolution in space spatial frequency and orienta-tion optimized by two-dimensional visual corticalfilters Journal of the Optical Society of America A2 1160ndash1169

Doi E Inui T Lee T W Wachtler T ampSejnowski T J (2003) Spatiochromatic receptivefield properties derived from information-theoreticanalyses of cone mosaic responses to naturalscenes Neural Computation 15 397ndash417

Duda R O Hart P E amp Stork D G (2000)Pattern classification (2nd ed) New York Wiley-Interscience

Elder J H amp Zucker S W (1998) Local scalecontrol for edge detection and blur estimationIEEE Transactions on Pattern Analysis and Ma-chine Intelligence 20 699ndash716

Field D J (1987) Relations between the statistics ofnatural images and the response properties of

cortical cells Journal of the Optical Society ofAmerica A 4 2379ndash2394

Fine I MacLeod D I A amp Boynton G M (2003)Surface segmentation based on the luminance andcolor statistics of natural scenes Journal of theOptical Society of America A 20 1283ndash1291

Fowlkes C C Martin D R amp Malik J (2007) Localfigure-ground cues are valid for natural imagesJournal of Vision 7(8)2 1ndash9 httpwwwjournalofvisionorgcontent782 doi101167782[PubMed] [Article]

Geisler W S Perry J S Super B J amp Gallogly DP (2001) Edge co-occurrence in natural imagespredicts contour grouping performance VisionResearch 41 711ndash724

Goldberg A V amp Kennedy R (1995) An efficientcost scaling algorithm for the assignment problemMathematical Programming 71 153ndash177

Graham N (1991) Complex channels early localnonlinearities and normalization in texture segre-gation In M S Landy amp J A Movshon (Eds)Computational models of visual processing pp 274ndash290 Cambridge MA MIT Press

Graham N amp Sutter A (1998) Spatial summation insimple (Fourier) and complex (non-Fourier) texturechannels Vision Research 38 231ndash257

Guzman A (1969) Decomposition of a visual sceneinto three-dimensional bodies In A Griselli (Ed)Automatic interpretation and classification of imag-es pp 291ndash304 New York Academic Press

Heeger D J amp Bergen J R (1995) Pyramid-basedtexture analysissynthesis In Proceedings of ACMSIGGRAPH 229ndash238

Hess R F amp Field D J (1999) Integration ofcontours New insights Trends in Cognitive Scienc-es 3 480ndash486

Hoiem D Efros A A amp Hebert M (2011)Recovering occlusion boundaries from an imageInternational Journal of Computer Vision 91(3)328ndash346

Ing A D Wilson J A amp Geisler W S (2010)Region grouping in natural foliage scenes Imagestatistics and human performance Journal ofVision 10(4)10 1ndash19 httpwwwjournalofvisionorgcontent10410 doi 10116710410[PubMed] [Article]

Jones J P amp Palmer L A (1987) The two-dimensional spatial structure of simple receptivefields in cat striate cortex Journal of Neurophysi-ology 58 1187ndash1211

Karklin Y amp Lewicki M S (2003) Learning higher-

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 19

order structures in natural images Network 14483ndash499

Karklin Y amp Lewicki M S (2009) Emergence ofcomplex cell properties by learning to generalize innatural scenes Nature 457 83ndash86

Kersten D (2000) High-level vision as statisticalinference In M S Gazzaniga (Ed) The newcognitive neurosciences pp 353ndash363 CambridgeMA MIT Press

Kingdom F A A Field D J amp Olmos A (2007)Does spatial invariance result from insensitivity tochange Journal of Vision 7(14)11 1ndash13 httpwwwjournalofvisionorgcontent71411 doi10116771411 [PubMed] [Article]

Knill D C amp Saunders J A (2003) Do humansoptimally integrate stereo and texture informationfor judgments of slant Vision Research 43 2539ndash2558

Konishi S Yuille A L Coughlin J M amp Zhu S C(2003) Statistical edge detection Learning andevaluating edge cues IEEE Transactions on PatternAnalysis and Machine Intelligence 25 57ndash74

Landy M S (1991) Texture segregation and orienta-tion gradient Vision Research 31 679ndash691

Landy M S amp Graham N (2003) Visual perceptionof texture In L M Chalupa amp J S Werner (Eds)The visual neurosciences pp 1106ndash1118 Cam-bridge MA MIT Press

Landy M S amp Kojima H (2001) Ideal cuecombination for localizing texture-defined edgesJournal of the Optical Society of America A 182307ndash2320

Lee H-C amp Choe Y (2003) Detecting salientcontours using orientation energy distribution InInternational Joint Conference on Neural Networks(pp 206ndash211) IEEE Conference Publications

Lee T S amp Mumford D (2003) Hierarchicalbayesian inference in the visual cortex Journal ofthe Optical Society of America A 20 1434ndash1448

Malik J amp Perona P (1990) Preattentive texturediscrimination with early vision mechanisms Jour-nal of the Optical Society of America A 7 923ndash932

Marr D amp Hildreth E (1980) Theory of edgedetection Proceedings of the Royal Society ofLondon Series B 29 187ndash217

Martin D Fowlkes C C Tal D amp Malik J (2001)A database of human segmented natural imagesand its application to evaluating segmentationalgorithms and measuring ecological statisticsProceedings of the 8th International Conference onComputer Vision 2 416ndash423

Martin D R Fowlkes C C amp Malik J (2004)Learning to detect natural image boundaries usinglocal brightness color and texture cues IEEETransactions on Pattern Analysis and MachineIntelligence 26 530ndash549

McDermott J (2004) Psychophysics with junctions inreal images Perception 33 1101ndash1127

McGraw P V Whitaker D Badcock D R ampSkillen J (2003) Neither here nor there Localizingconflicting visual attributes Journal of Vision 3(4)2 265ndash273 httpwwwjournalofvisionorgcontent342 doi101167342 [PubMed] [Article]

Nakayama K He Z J amp Shimojo S (1995) Visualsurface representation A critical link betweenlower-level and higher-level vision In D NOsherson S M Kosslyn amp L R Gleitman(Eds) An invitation to cognitive science Visualcognition pp 1ndash70 Cambridge MA MIT Press

Oliva A amp Torralba A (2001) Modeling the shapeof the scene A holistic representation of the spatialenvelope International Journal of Computer Vision42 145ndash175

Olmos A amp Kingdom F A A (2004) McGillcalibrated colour image database Internet sitehttptabbyvisionmcgillca (Accessed November20 2012)

Olshausen B A amp Field D J (1996) Emergence ofsimple-cell receptive-field properties by learning asparse code for natural images Nature 381 607ndash609

Ott R L (1993) An introduction to statistical methodsand data analysis (4th ed) Belmont CA Wads-worth Inc

Perona P (1992) Steerable-scalable kernels for edgedetection and junction analysis Image and VisionComputing 10 663ndash672

Pollen D A amp Ronner S F (1983) Visual corticalneurons as localized spatial frequency filters IEEETransactions on Systems Man and Cybernetics 13907ndash916

Portilla J amp Simoncelli E P (2000) A parametrictexture model based on joint statistics of complexwavelet coefficients International Journal of Com-puter Vision 40 4971

Rajashekar U van der Linde I Bovik A C ampCormack L K (2007) Foveated analysis of imagefeatures at fixations Vision Research 47 3160ndash3172

Reinagel P amp Zador A M (1999) Natural scenestatistics at the centre of gaze Network 10 341ndash350

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 20

Rijsbergen C V (1979) Information retrieval LondonButterworths

Rivest J amp Cavanagh P (1996) Localizing contoursdefined by more than one attribute Vision Re-search 36 53ndash66

Schwartz O amp Simoncelli E P (2001) Natural signalstatistics and sensory gain control Nature Neuro-science 4 819ndash825

Sigman M Cecchi G A Gilbert C D ampMagnasco M O (2001) On a common circleNatural scenes and gestalt rules Proceedings of theNational Academy of Sciences 98 1935ndash1940

Todd J T (2004) The visual perception of 3D shapeTrends in Cognitive Science 8 115ndash121

Torralba A (2009) How many pixels make an imageVisual Neuroscience 26 123ndash131

Torralba A amp Oliva A (2003) Statistics of naturalimage categories Network Computation in NeuralSystems 14 391ndash412

Voorhees H amp Poggio T (1988) Computing textureboundaries from images Nature 333 364ndash367

Zetzsche C amp Rohrbein F (2001) Nonlinear andextra-classical receptive field properties and thestatistics of natural scenes Network 12 331ndash350

Zhou C amp Mel B W (2008) Cue combination andcolor edge detection in natural scenes Journal ofVision 8(4)4 1ndash25 httpwwwjournalofvisionorgcontent844 doi101167844 [PubMed][Article]

Zhu S C Wu Y N amp Mumford D (1997)Minimax entropy principle and its application totexture modeling Neural Computation 9 1627ndash1660

Ziegaus C amp Lang E W (2003) Independentcomponent analysis of natural and urban imageensembles Neural Information Processing Letters1 89ndash95

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 21

  • Introduction
  • Methods
  • f01
  • f02
  • f03
  • e01
  • e02
  • e03
  • e04
  • e05
  • e06
  • e07
  • e08
  • e09
  • e10
  • f04
  • e11
  • e12
  • e13
  • e14
  • e15
  • e16
  • e17
  • f05
  • e18
  • Results
  • f06
  • f07
  • f08
  • f09
  • f10
  • f11
  • f12
  • f13
  • f14
  • f15
  • f16
  • Discussion
  • f17
  • f18
  • Abdou1
  • Acharya1
  • Arsenault1
  • Baddeley1
  • Baker1
  • Balboa1
  • Bell1
  • Bergen1
  • Bishop1
  • Cadieu1
  • Caywood1
  • Collins1
  • Cristianini1
  • Daugman1
  • Doi1
  • Duda1
  • Elder1
  • Field1
  • Fine1
  • Fowlkes1
  • Geisler1
  • Goldberg1
  • Graham1
  • Graham2
  • Guzman1
  • Heeger1
  • Hess1
  • Hoiem1
  • Ing1
  • Jones1
  • Karklin1
  • Karklin2
  • Kersten1
  • Kingdom1
  • Knill1
  • Konishi1
  • Landy1
  • Landy2
  • Landy3
  • Lee1
  • Lee2
  • Malik1
  • Marr1
  • Martin1
  • Martin2
  • McDermott1
  • McGraw1
  • Nakayama1
  • Oliva1
  • Olmos1
  • Olshausen1
  • Ott1
  • Perona1
  • Pollen1
  • Portilla1
  • Rajashekar1
  • Reinagel1
  • Rijsbergen1
  • Rivest1
  • Schwartz1
  • Sigman1
  • Todd1
  • Torralba1
  • Torralba2
  • Voorhees1
  • Zetzsche1
  • Zhou1
  • Zhu1
  • Ziegaus1
Page 11: Detecting natural occlusion boundaries using local cuesDetecting natural occlusion boundaries using local cues Christopher DiMattina # $ Department of Psychology & Neuroscience Concentration,

which will tend to be larger for occlusion patchesAnalyzing the correlation structure of the grayscalefeatures reveals that they are all significantly positivelycorrelated (Supplementary Table 2) Although some ofthese correlations are unsurprising (like that betweenluminance difference log Dl and boundary gradient logGB) many pairs of these positively correlated featurescan be manipulated independently of each other inartificial images (for instance global contrast log q andcontrast difference log Dr) These results are consistentwith previous work which demonstrates that in naturalimages there are often strong conditional dependenciesbetween supposedly independent visual feature dimen-sions (Fine et al 2003 Karklin amp Lewicki 2003Schwartz amp Simoncelli 2001 Zetzsche amp Rohrbein2001) Supplementary Figure S2 plots all possiblebivariate distributions of the logarithm of the grayscalefeatures for 32 middot 32 textures and surfaces

In addition to grayscale features we measured colorfeatures (C1ndashC2 Methods Statistical measurements)as well by transforming the images into the Lab colorspace Fine et al (2003) Ing et al (2010) Using our

region labelings (Figure 4a bottom) we measured thecolor parameters log Da and log Db from ourocclusions Supplementary Table 3 lists the meansand standard deviations of each of these features andwe observe positive correlations (rfrac14043) between theirvalues similar to previous observations of positivecorrelations between these same features for twonearby patches taken from the same surface (Fine etal 2003) This finding is interesting because occlusionboundaries by definition separate two or more differentsurfaces and the color features we measure constitute aset of independent feature dimensions From the two-dimensional scatterplots in Supplementary Figure S3we see that there is separation between the multivariatedistributions defined by these color parameters sug-gesting that color contrast provides a potential cue forocclusion edge detection much as it does for surfacesegmentation (Fine et al 2003 Ing et al 2010)

Human performance on occlusion boundarydetection task

Effects of patch size color and pre-exposure

In order to determine the local visual features used todiscriminate occlusions from surfaces we designed asimple two-alternative forced choice task illustratedschematically in Figure 6 An image patch subtendingroughly 15 degrees of visual angle (305 cm viewingdistance) was presented to the subject who had todecide with a binary (10) keypress whether an imagepatch contains an occlusion Patches were chosen froma pre-extracted set of 1000 occlusions and 1000surfaces with equal probability of each type soguessing would yield 50 correct Subjects wereallowed to view each patch as long as they needed (1ndash2 seconds typical) and following previous workauditory feedback was provided to optimize perfor-mance (Ing et al 2010) The task was performed on sixsubjects having varying degrees of exposure to theimage database (Methods Experimental paradigm)

Figure 10a illustrates the effect of patch size on thetask for grayscale image patches for all subjects Thicklines indicate the mean subject performance and thindashed lines denote 95 confidence intervals We seefrom this that performance is significantly different forthe different size image patches which we tested (8 middot 816 middot 16 32 middot 32) For a subset of subjects (S0 S1 S2)we also tested color image patches and the results forcolor image patches are shown in Figure 10b (red line)together with the grayscale data from these samesubjects (black dashed line) We see from this plot thatfor all patch sizes tested performance is significantlybetter for color patches which is sensible because coloris a potentially informative cue for distinguishingdifferent surfaces as has been shown in previous work(Fine et al 2003 Ing et al 2010)

Figure 9 Univariate distributions of grayscale visual features G1-

G5 (see Methods) for occlusions (blue) and textures (green)

When plotted on a log scale these distributions are well described

by Gaussians and exhibit separation for occlusions and textures

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 11

One concern with the interpretation of our results isthat the brief pre-exposure (3 seconds) of subjects S1 S2to the full-scale (576 middot 768) images prior to the first tasksession (Methods Experimental paradigm) may haveunfairly improved their performance In order to controlfor this possibility we ran three additional subjects (S3S4 S5) who did not have this pre-exposure to the full-scale images and we found no significant difference inthe performance of the two groups of subjects(Supplementary Figure S4a) We also found that thelead author (S0) was not significantly better at the taskthan the two subjects (S1 S2) who were only briefly pre-exposed to the images (Supplementary Figure S4b)

Therefore we conclude that pre-exposure to the full-scale images makes little if any difference in our task

Effects of luminance boundary and texture cues

In order to better understand what cues subjects aremaking use of in the task we tested subjects onmodified image patches with various cues removed(Methods Experimental paradigm) In order to deter-mine the importance of texture we removed all texturecues from the patches by averaging all pixels in eachregion (lsquolsquotexture removedrsquorsquo) as illustrated in Figure 4b(top) We see from Figure 11a that subject performancein the task is substantially impaired without texturecues for the 16 middot 16 and 32 middot 32 patch sizes Note howperformance is roughly flat with increasing patch sizewhen all cues but luminance are removed suggestingthat luminance gradients are a fairly lsquolsquolocalrsquorsquo cue whichis very useful even at very small scales Similar resultswere obtained for color patches (Figure 12) where theonly cues available were luminance and color contrast

Removing luminance cues while keeping texture cuesintact is slightly more tricky since simply equalizing theluminance in the two image regions can lead toproblems of boundary artifact (a high spatial frequencyedge along the boundary) as has been noted by others(Arsenault et al 2011) In order to circumvent thisproblem we removed the boundary artifact by coveringa 3-pixel wide region including the boundary by settingall pixels in this strip to a single uniform value (Figure4b bottom) We then were able to equalize theluminance on each side of the patch without creatingboundary artifact (luminance thorn boundary removed)Since the boundary pixels may could potentiallycontain long-range spatial correlation information(Hess amp Field 1999) useful for identifying occlusions

Figure 10 Performance of human subjects at the occlusion

detection task for 8 middot 8 16 middot 16 and 32 middot 32 image patches (a)

Subject performance for grayscale image patches Thin dashed

lines denote 95 confidence intervals Note how performance

significantly improves with increasing patch size (b) Subject

performance for color image patches is significantly better than for

grayscale at all patch sizes

Figure 11 Subject performance for grayscale image patches with various cues removed Dashed lines indicate the average performance

for unaltered image patches solid lines performance in the cue-removed case (a) Removal of texture cues significantly impairs subject

performance for larger image patches (b) Removal of the boundary and luminance cues substantially impairs subject performance at all

patch sizes (c) Removal of the boundary cue alone without altering texture cues or luminance cues does not affect subject performance

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 12

as an additional control we also considered perfor-mance on patches where we only removed theboundary (boundary removed)

We see from Figure 11c that simply removing theboundary pixels has no effect on subject performancesuggesting that subjects do not need to see theboundary in order to perform the task so long asother cues like luminance and texture differences arepresent However we see from Figure 11b that thereare profound effects of removing the luminance cueswhile keeping the texture cues intact We see thatsubject performance is impaired at every patch sizealthough in contrast to removing texture cues we see asubstantial improvement in performance as patch sizeis increased This suggests that texture cues arecomplementary to luminance cues in the sense thattexture information becomes more useful for largerpatches while luminance is useful even for very smallpatches

Quadratic classifier analysis

Comparison with human performance

Our analysis of image patches taken from thedatabase suggests that relatively simple visual featuresmeasured in previous related studies can potentially beused to distinguish occlusions and surfaces (Figure 9)Therefore it was of interest to determine whethermachine classifiers defined using these simple visualfeatures could potentially account for human perfor-mance on the task Since the logarithmically trans-formed features are approximately Gaussian distributed(Figure 9 and Supplementary Figures S2 S3) followingprevious work (Ing et al 2010) we defined a quadraticclassifier on class-conditional distributions which wemodeled as multivariate Gaussians (Methods Machineclassifiers) When data from both classes is exactly

Gaussian this classifier yields Bayes-optimal perfor-mance (Duda et al 2000) This classifier was thentested on validation sets of 400 image patches drawnfrom the same set shown to humans subjects (whichwere not used to train the classifier) using 200 MonteCarlo resamplings

In Figure 13 we compare the performance of humansubjects (black lines) on the grayscale version of thetask with that of quadratic classifiers (blue lines)Figure 13a shows task performance with the fullgrayscale image patches We see that the quadraticclassifier defined on all grayscale parameters (G1-G5Methods Statistical measurements) does not capturethe experimentally observed improvement in subjectperformance with increasing patch size This suggeststhat in addition to the scalar grayscale cues used todefine the classifier subjects must be making use ofmore complex spatial cues like differences in texture(Bergen amp Landy 1991) or long-range spatial correla-tions (Hess amp Field 1999) which are not available atlow resolutions but become available as the imagepatch gets larger We see from Figure 13b that aquadratic classifier defined using only the luminancedifference is in much better agreement with humanperformance when this is the only available cue forperforming the task Similar results were obtained forcolor image patches (Figure 14)

Ranking of feature dimensions

In order to understand which feature dimensions aremost important for the quadratic classifier detecting

Figure 12 Subject performance for unaltered color image patches

(dashed line) and with texture cues removed (solid line) Figure 13 Comparison of subject and quadratic classifier

performance for grayscale image patches (a) Classifier defined

using all grayscale features (thick blue line) does not accurately

model human performance on the task with unmodified patches

(black line) (b) Classifier defined using only luminance cues

accurately models human performance in the texture-removed

condition where luminance differences are the only available cue

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 13

occlusions we computed the d0 value for each of thefeature dimensions with the results of this analysisbeing summarized in Supplementary Table 4 We seefrom this table that the best single grayscale parametersfor discriminating occlusion and surface patches are theluminance difference (Dl) and overall patch variance(q) We also see that contrast difference (Dr) by itself isa fairly weak parameter for all patch sizes while for thelarger patch sizes (16 middot 16 32 middot 32) color cues are veryinformative

Our correlation analyses presented in SupplementaryTable 2 demonstrated that many of these features were

correlated with each other so therefore it was ofinterest to determine which features contributed themost independent information to the classifier We didthis by adding feature dimensions to the classifier oneby one in order of decreasing d0 for 16 middot 16 and 32 middot 32patches and applying each classifier to the task for 200Monte Carlo simulations The results of this analysisperformed on 16 middot 16 and 32 middot 32 image patches areshown in Figure 15 We see from Figure 15 that forgrayscale classifiers containing q and Dl that noperformance increase is seen when one adds thefeatures GB (boundary gradient) and Eh (orientedenergy) most likely because these features are stronglycorrelated with q Dl (Supplementary Table 2)However we do see a jump in performance when thecontrast difference Dr is added most likely because thisfeature dimension is only weakly correlated with q andDl (Supplementary Table 2) For our analysis of thecolor classifier we see that there are substantialimprovements in performance when the color param-eters Da and Db are added to the classifier which makessense because color cues are largely independent fromluminance cues

Comparison with SVM classifier

There are two logical possibilities for the poor matchbetween human performance in our task and theperformance of the quadratic classifier The firstpossibility is that the feature sets used to define theclassifier do not adequately describe the featuresactually used by human subjects to perform the taskA second possibility is that the feature set is adequatebut the quadratic classifier is somehow suboptimalAlthough the quadratic classifier is Bayes optimal forGaussian distributed class-conditional densities (Duda

Figure 14 Comparison of subject and quadratic classifier

performance for color image patches (a) Classifier defined using

all color features (thick blue line) does not accurately model

human performance (black line) on the task with unmodified

patches (b) Classifier defined using only luminance and color

contrast more accurately models human performance in the

texture-removed condition

Figure 15 Quadratic classifier performance on 32 middot 32 patches as a function of number of features Feature sets of varying lengths are

defined by ranking individual dimensions by their d0 measures and adding them in order from highest to lowest

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 14

et al 2000) and our feature sets are reasonably welldescribed by Gaussians (see Figure 9 andSupplementary Figures S2 S3) this description is farfrom exact Therefore it is possible that some othermore sophisticated classifier may yield superior perfor-mance to the quadratic classifier and hence provide abetter match with our perceptual data

In order to control for this possibility we comparedthe performance of the quadratic classifier to a SupportVector Machine (SVM) classifier (Bishop 2006 Cris-tianini amp Shawe-Taylor 2000) on the task of discrim-inating grayscale occlusion patches from surfaces Wefound that the SVM classifier performed worse than thequadratic classifier (Supplementary Figure S5) Thisfailure of the SVM classifier strengthens the argumentthat the reason for the poor match of the quadraticclassifier with human performance is the inadequacy ofthe measured feature set used to define the classifierrather than the inadequacy our classifier method

Multiscale classifier analysis

Gabor feature set and classifiers

One weakness of the set of simple visual featuresquantified in this study is that they are defined on thescale of the entire image patch and therefore do notintegrate information across scale It is well known thatocclusions in natural images can exist on multiplescales and that accurate edge detection requires anappropriate choice of scale (Elder amp Zucker 1998Marr amp Hildreth 1980) Furthermore our feature setdoes not include any features which quantify texturedifferences which is problematic because our psycho-physical results suggests that texture is a very importantcue and previous work in machine vision hasdemonstrated that texture cues are highly informativefor detecting edges and segmenting image regions(Bergen amp Landy 1991 Malik amp Perona 1990 Martinet al 2004 Voorhees amp Poggio 1988)

In order to choose a set of biologically motivatedmultiscale features to measure from the image patcheswe obtained a Gabor filter bank by applying theIndependent Components Analysis (Bell amp Sejnowski1997) to natural image patches taken from our dataset(Figure 5a) and this filter bank was applied to theimage patches The independent component analysis(ICA) algorithm learns a set of linear features that aremaximally independent which is ideal for classifieranalyses since correlated features add less informationthan independent features The rectified outputs of thisGabor filter bank were then used as inputs to twosupervised classifiers (a) A linear binary logisticclassifier and (b) a neural network classifier (Bishop2006) having 4 16 and 64 hidden units The logisticclassifier can be considered as a special case of theneural network classifier illustrated in Figure 5b where

instead of learning the best hidden layer representationfor solving the classification task the hidden layerrepresentation is always identical to the input repre-sentation and only the output weights are modifiedAfter training both classifiers were tested on a separatevalidation set not used for training to evaluate theirperformance

Classifier comparison

Figure 16 shows the performance of these twoclassifiers together with subject data for the grayscaleocclusion detection task We see that the linearclassifier (blue line) does not accurately describe humanperformance (black line) while the neural network (redline) having 16 hidden units compares much morefavorably to the human subjects Similar results wereobtained for a neural network with 64 hidden units butworse performance was seen with four hidden units(not shown) We can see from Figure 16 that the linearclassifier does not increase its performance withincreasing patch size and seems to perform similarlyto the quadratic classifier defined using only luminancedifferences This suggests that this classifier may simplybe learning to detecting low spatial frequency edges atvarious orientations Indeed plotting the connectionstrengths of the output unit with each of the Gaborfilters as in Figure 17 demonstrates that there are strongpositive weights for filters at low spatial frequencylocated in the center of the image patch therebydetecting luminance differences on the scale of theentire image patch

In contrast the neural network classifier learns tospatially pool information across spatial scale andlocation (Supplementary Figure S6) although thehidden unit lsquolsquoreceptive fieldsrsquorsquo do not seem to resemble

Figure 16 Performance of humans and classifiers defined on

multiscale Gabor feature set We see that a simple linear classifier

(blue line) does not account for human performance (black

dashed line) in the task a neural network classifier having an

additional hidden layer of processing (red line) compares well with

human performance

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 15

the hypothetical texture-edge detectors illustrated inFigure 18 which could in principle encode edgesdefined by second-order cues like differences inorientation or spatial frequency Interestingly suchunits tuned for texture-edges have been learned instudies seeking to find a sparse code for patterns ofcorrelation of Gabor filter activities in natural images(Cadieu amp Olshausen 2012 Karklin amp Lewicki 20032009) and it is of interest for future work to examinewhether a population of such units can be decoded inorder to reliably detect occlusion edges Neverthelessthis analysis provides a simple existence proof that amultiscale feature set coupled with a second layer ofrepresentation which pools across this set of multiscalefeatures may be adequate to explain natural occlusionedge detection and a similar computational scheme hasbeen proposed by various filter-rectify-filter models oftexture edge detection (Baker amp Mareschal 2001Landy amp Graham 2003) Interestingly training neuralnetworks directly on pixel representations did not yieldmodels which could match human performance(Supplementary Figure S7 demonstrating the impor-tance of the multiscale Gabor decomposition as apreprocessing stage Of course we cannot suggest fromour simple analysis the detailed form which thisintegration takes or what the receptive fields of neuronsperforming this operation may be like but this is

certainly a question of great interest for future work inneurophysiology and computational modeling

Discussion

Summary of results

In this study we performed psychophysical experi-ments to better understand what locally available cuesthe human visual system utilizes when detecting naturalocclusion edges In agreement with previous work(McDermott 2004) we found that fairly large regions(32 middot 32 pixels) were required for subjects to attainreliable performance on our occlusion detection task(Figure 10) Studying the grayscale version of the taskin detail we found that texture and luminance cueswere both utilized although the relative importance ofthese cues varied with the patch size In particular wefound that luminance cues were equally useful at allpatch sizes tested (Figure 11a) This suggests thatluminance differences are a local cue being equallyuseful for small as well as large patches In contrasttexture cues were only useful for larger image patches(Figure 11b) This is sensible because texture is definedby information occurring at a variety of scales (Portillaamp Simoncelli 2000 Zhu Wu amp Mumford 1997) andmaking patches larger makes more spatial scalesvisible

The importance of textural cues was further under-scored by comparing human performance on the taskto the performance of a quadratic classifier definedusing a set of simple visual features taken fromprevious work which did not include texture cuesAlthough many of the visual features providedindependent sources of information (Figure 15) andpermitted some discrimination of the stimulus catego-ries this feature set could not account for humanperformance in the task overpredicting performancefor small (8 middot 8) patches and underpredicting

Figure 17 Schematic illustration of weights learned by the linear

classifier We see that the linear classifier learns strong positive

weights (hot colors) with Gabor filters having low spatial

frequencies (long lines) located in the center of the image patch

Figure 18 Hypothetical neural models for detecting occlusion edges defined by textural differences (a) Hypothetical unit which detects

texture edges defined by differences in orientation energy (b) Hypothetical unit that detects texture edges defined by spatial frequency

differences

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 16

performance for large (32 middot 32) patches (Figure 13)The overprediction of human performance for smallimage patches is particularly interesting since itdemonstrates that subjects are not simply relying onluminance cues but are integrating texture cues as wellwhich we found to be quite poor for small imagepatches (Figure 11b)

One limitation of the visual feature set used to definethe quadratic classifier analysis is that the features aredefined on the scale of the entire image patch This isproblematic since it is well-known that the earlieststages of human visual processing perform a multiscaleanalysis using receptive fields resembling Gabor filters(Daugman 1985 Jones amp Palmer 1987 Pollen ampRonner 1983) and that such a representation forms anefficient code for natural images (Field 1987 Olshau-sen amp Field 1996) Furthermore many computationalmethods for texture segmentation and edge detectionuse of a multiscale Gabor filter decomposition (Bakeramp Mareschal 2001 Landy amp Graham 2003) There-fore we defined a new set of classifiers taking as inputsthe rectified responses of a set of Gabor filters appliedto the image patches (Figure 5) We found that themultiscale feature set was adequate to explain humanperformance but only provided that the classifiermaking use of this feature set was sufficiently powerfulIn particular a linear classifier was unable to accountfor human performance since this classifier only learnedto detect luminance-edges at low spatial frequencies(Figure 17) However a neural network classifierhaving an additional hidden layer of processing wasable to learn to reliably detect occlusions nearly as wellas human subjects (Figure 16) These results suggest amultiscale feature decomposition together with rela-tively simple computations performed in an additionallayer of processing may be adequate to explain humanperformance in an important natural vision task

Comparison with previous work

Relatively few studies in visual psychophysics havedirectly considered the problem of what cues are usefulfor detecting natural occlusion edges with the closeststudy to our own focusing on the more specializedproblem of detecting T-junctions in natural grayscaleimages (McDermott 2004) This study found thatrather large image regions were needed for subjects toattain reliable performance for junction detectionconsistent with our results for occlusion detection Inthis work it was found that many T-junctions easilyvisible on the scale of the whole image were notdetectable using local cues suggesting that they mayonly be detectable by mechanisms making use offeedback of global scene information (Lee amp Mumford2003) Our task is somewhat complementary to that

studied by McDermott (2004) since we exclude T-junctions and only consider nonjunction occlusionsThis gave us a much larger stimulus set and permittedquantitative comparisons of human task performancewith a variety of models based on features measuredfrom occlusions and non-occlusions However wearrive at the very similar conclusion for our task thata substantial amount of spatial context is needed foraccurate occlusion detection (Figure 10) and thatmany occlusions cannot be reliably detected using localcues alone

A related set of studies has considered the problemof judging when two image patches belong to the sameor different surfaces and comparing the performanceof human subjects with the predictions of Bayesianmodels defined using sets of measured visual featuressimilar to those used to define our quadratic classifiersIn contrast to one such surface segmentation study (Inget al 2010) we found texture cues to be very importantfor our task One possible reason for the discrepancybetween our results and those of Ing et al (2010) is thattheir study only considered the problem of discrimi-nating a very limited class of images in particular thesurfaces of different leaves Since the leaves typicallybelonged to the same kind of plant they were fairlysmooth and quite similar in terms of visual textureTherefore for this particular choice of stimuli texture isnot a particularly informative cue since leaves of thesame kind tend to have very similar textures For ourtask however where very different surfaces werejuxtaposed to define occlusions we found removal oftexture cues to be highly detrimental to subjectperformance (Figure 11)

Previous work on edge localization reveals thatluminance texture and color cues are all combined todetermine the location of edges (McGraw et al 2003Rivest amp Cavanagh 1996) and similarly we find that inour task these multiple sources of information arecombined as well However in our task we did notconsider quantitative models for how disparate cuesmay be combined optimally as has been done in avariety of perceptual tasks (Knill amp Saunders 2003Landy amp Kojima 2001) One challenge with studyingcue combination in our task is that one of the majorcues used by the subjects are texture differences andunlike luminance or color natural textures require veryhigh-dimensional models to fully specify (Heeger ampBergen 1995 Portilla amp Simoncelli 2000 Zhu et al1997) The problem of optimal cue combination fornatural occlusion detection has been mostly consideredin the computer vision literature (Konishi et al 2003Zhou amp Mel 2008) but models of cue combination forocclusion detection have not been directly comparedwith human data in a controlled psychophysicalexperiment

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 17

Our computational analysis demonstrates the neces-sity of multiscale image analysis for modeling humanperformance in our occlusion detection task Further-more our work and that of others demonstrates theimportance of textural cues for occlusion detection as awide variety of computational models which explainhuman performance at detecting texture-defined edgesutilize a multiscale Gabor decomposition (Landy ampGraham 2003) One broad class of computationalmodels of texture segmentation and texture edgedetection can be described lsquolsquoFilter-Rectify-Filterrsquorsquo(FRF) models sometimes also referred to as lsquolsquoback-pocketrsquorsquo models where the rectified outputs of a Gaborfilter bank are then subject to a second stage of filteringand the outputs of these filters are then processedfurther to obtain a decision about whether or not atexture edge is present (Baker amp Mareschal 2001Bergen amp Landy 1991 Graham amp Sutter 1998 Landy1991 Landy amp Graham 2003) Our simple three-layerneural network model (Figure 5) constitutes a verysimple model of the FRF type with the hidden unitsperforming the role of the second-order filters Theability of our simple FRF model to solve this task aswell as human subjects provides additional evidence infavor of the FRF computational framework

Resources for future studies

Our database of hand-labeled occlusions (httpcaseedu) avoids many limitations of previously collecteddatasets of hand-labeled edges Perhaps the greatestimprovement is that unlike many related datasets(Collins Wright amp Greenway 1999 Martin et al2001) we make use of uncompressed calibrated colorimages taken from a larger image set used in previouspsychophysical research (Arsenault et al 2011 King-dom et al 2007 Olmos amp Kingdom 2004) JPEGcompression creates artificial statistical regularitieswhich complicates feature measurements (Zhou ampMel 2008) and causes artifact when training statisticalimage models (Caywood Willmore amp Tollhurst 2004)

Another improvement of this database over previouswork is that we explicitly label occlusion edges ratherthan inferring them indirectly from labels of imageregions or segments Although de facto finding regionboundaries does end up mainly labeling occlusions thenotion of an image region is much vaguer than that ofan occlusion However due to the specificity of ourtask our database most likely neglects quite a fewedges which are not occlusions since many edges aredefined by properties unrelated to ordinal depth likeshadows specularities and material properties (Ker-sten 2000) Therefore our database only represents asubset (albeit an important one) of possible naturaledges

Some databases only consider a restricted class ofimages like close-up foliage (Ing et al 2010) or lackclear figural objects (Doi Inui Lee Wachtler ampSejnowski 2003) Our database contains a wide varietyof images at various spatial scales containing a largenumber of occlusions We do not claim that our imagesare fully representative of natural images but representa modest effort to obtain a diverse sample which isimportant since different categories of natural imageshave different statistics (Torralba amp Oliva 2003) andtraining statistical image models on different imagedatabases often yields different linear filter properties(Ziegaus amp Lang 2003)

Acknowledgments

This work was supported by NIH grant 0705677 toMSL We thank the five Case Western ReserveUniversity undergraduate research assistants whohelped to collect the databases Parts of this work havebeen presented previously in poster form at the 2011Society for Neuroscience Meeting

Commercial relationships noneCorresponding author Christopher DiMattinaEmail dimattinagrinnelleduAddress Department of Psychology amp NeuroscienceConcentration Grinnell College Grinnell IA USA

References

Abdou I amp Pratt W (1979) Quantitative design andevaluation of enhancementthresholding edge de-tectors Proceedings of the IEEE 67 753ndash763

Acharya T amp Ray A K (2005) Image processingPrinciples and applications Hoboken NJ Wiley-Interscience

Arsenault E Yoonessi A amp Baker C L (2011)Higher-order texture statistics impair contrastboundary segmentation Journal of Vision 11(10)14 1ndash15 httpwwwjournalofvisionorgcontent111014 doi101167111014 [PubMed] [Article]

Baddeley R J amp Tatler B W (2006) High frequencyedges (but not contrast) predict where we fixate ABayesian system identification analysis VisionResearch 46 2824ndash2833

Baker C L amp Mareschal I (2001) Processing ofsecond-order stimuli in the visual cortex Progressin Brain Research 134 1ndash21

Balboa R M amp Grzywacz N M (2000) Occlusions

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 18

and their relationship with the distribution ofcontrasts in natural images Vision Research 402661ndash2669

Bell A J amp Sejnowski T J (1997) The lsquoindependentcomponentsrsquo of natural scenes are edge filtersVision Research 37 3327ndash3338

Bergen J R amp Landy M S (1991) Computationalmodeling of visual texture segregation In M SLandy amp J A Movshon (Eds) Computationalmodels of visual processing pp 253ndash271 Cam-bridge MA MIT Press

Bishop C M (2006) Pattern recognition and machinelearning New York Springer

Cadieu C amp Olshausen B A (2012) Learningintermediate-level representations of form andmotion from natural movies Neural Computation24 827ndash866

Caywood M S Willmore B amp Tolhurst D J(2004) Independent components of color naturalscenes resemble v1 neurons in their spatial andcolor tuning Journal of Neurophysiology 91 2859ndash2873

Collins D Wright W A amp Greenway P (1999) TheSowerby image database In Image processing andits applications (Seventh International Conference)(pp 306ndash310) London Institution of ElectricalEngineers

Cristianini N amp Shawe-Taylor J (2000) An intro-duction to support vector machines and other kernelbased learning methods Cambridge UK Cam-bridge University Press

Daugman J G (1985) Uncertainty relation forresolution in space spatial frequency and orienta-tion optimized by two-dimensional visual corticalfilters Journal of the Optical Society of America A2 1160ndash1169

Doi E Inui T Lee T W Wachtler T ampSejnowski T J (2003) Spatiochromatic receptivefield properties derived from information-theoreticanalyses of cone mosaic responses to naturalscenes Neural Computation 15 397ndash417

Duda R O Hart P E amp Stork D G (2000)Pattern classification (2nd ed) New York Wiley-Interscience

Elder J H amp Zucker S W (1998) Local scalecontrol for edge detection and blur estimationIEEE Transactions on Pattern Analysis and Ma-chine Intelligence 20 699ndash716

Field D J (1987) Relations between the statistics ofnatural images and the response properties of

cortical cells Journal of the Optical Society ofAmerica A 4 2379ndash2394

Fine I MacLeod D I A amp Boynton G M (2003)Surface segmentation based on the luminance andcolor statistics of natural scenes Journal of theOptical Society of America A 20 1283ndash1291

Fowlkes C C Martin D R amp Malik J (2007) Localfigure-ground cues are valid for natural imagesJournal of Vision 7(8)2 1ndash9 httpwwwjournalofvisionorgcontent782 doi101167782[PubMed] [Article]

Geisler W S Perry J S Super B J amp Gallogly DP (2001) Edge co-occurrence in natural imagespredicts contour grouping performance VisionResearch 41 711ndash724

Goldberg A V amp Kennedy R (1995) An efficientcost scaling algorithm for the assignment problemMathematical Programming 71 153ndash177

Graham N (1991) Complex channels early localnonlinearities and normalization in texture segre-gation In M S Landy amp J A Movshon (Eds)Computational models of visual processing pp 274ndash290 Cambridge MA MIT Press

Graham N amp Sutter A (1998) Spatial summation insimple (Fourier) and complex (non-Fourier) texturechannels Vision Research 38 231ndash257

Guzman A (1969) Decomposition of a visual sceneinto three-dimensional bodies In A Griselli (Ed)Automatic interpretation and classification of imag-es pp 291ndash304 New York Academic Press

Heeger D J amp Bergen J R (1995) Pyramid-basedtexture analysissynthesis In Proceedings of ACMSIGGRAPH 229ndash238

Hess R F amp Field D J (1999) Integration ofcontours New insights Trends in Cognitive Scienc-es 3 480ndash486

Hoiem D Efros A A amp Hebert M (2011)Recovering occlusion boundaries from an imageInternational Journal of Computer Vision 91(3)328ndash346

Ing A D Wilson J A amp Geisler W S (2010)Region grouping in natural foliage scenes Imagestatistics and human performance Journal ofVision 10(4)10 1ndash19 httpwwwjournalofvisionorgcontent10410 doi 10116710410[PubMed] [Article]

Jones J P amp Palmer L A (1987) The two-dimensional spatial structure of simple receptivefields in cat striate cortex Journal of Neurophysi-ology 58 1187ndash1211

Karklin Y amp Lewicki M S (2003) Learning higher-

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 19

order structures in natural images Network 14483ndash499

Karklin Y amp Lewicki M S (2009) Emergence ofcomplex cell properties by learning to generalize innatural scenes Nature 457 83ndash86

Kersten D (2000) High-level vision as statisticalinference In M S Gazzaniga (Ed) The newcognitive neurosciences pp 353ndash363 CambridgeMA MIT Press

Kingdom F A A Field D J amp Olmos A (2007)Does spatial invariance result from insensitivity tochange Journal of Vision 7(14)11 1ndash13 httpwwwjournalofvisionorgcontent71411 doi10116771411 [PubMed] [Article]

Knill D C amp Saunders J A (2003) Do humansoptimally integrate stereo and texture informationfor judgments of slant Vision Research 43 2539ndash2558

Konishi S Yuille A L Coughlin J M amp Zhu S C(2003) Statistical edge detection Learning andevaluating edge cues IEEE Transactions on PatternAnalysis and Machine Intelligence 25 57ndash74

Landy M S (1991) Texture segregation and orienta-tion gradient Vision Research 31 679ndash691

Landy M S amp Graham N (2003) Visual perceptionof texture In L M Chalupa amp J S Werner (Eds)The visual neurosciences pp 1106ndash1118 Cam-bridge MA MIT Press

Landy M S amp Kojima H (2001) Ideal cuecombination for localizing texture-defined edgesJournal of the Optical Society of America A 182307ndash2320

Lee H-C amp Choe Y (2003) Detecting salientcontours using orientation energy distribution InInternational Joint Conference on Neural Networks(pp 206ndash211) IEEE Conference Publications

Lee T S amp Mumford D (2003) Hierarchicalbayesian inference in the visual cortex Journal ofthe Optical Society of America A 20 1434ndash1448

Malik J amp Perona P (1990) Preattentive texturediscrimination with early vision mechanisms Jour-nal of the Optical Society of America A 7 923ndash932

Marr D amp Hildreth E (1980) Theory of edgedetection Proceedings of the Royal Society ofLondon Series B 29 187ndash217

Martin D Fowlkes C C Tal D amp Malik J (2001)A database of human segmented natural imagesand its application to evaluating segmentationalgorithms and measuring ecological statisticsProceedings of the 8th International Conference onComputer Vision 2 416ndash423

Martin D R Fowlkes C C amp Malik J (2004)Learning to detect natural image boundaries usinglocal brightness color and texture cues IEEETransactions on Pattern Analysis and MachineIntelligence 26 530ndash549

McDermott J (2004) Psychophysics with junctions inreal images Perception 33 1101ndash1127

McGraw P V Whitaker D Badcock D R ampSkillen J (2003) Neither here nor there Localizingconflicting visual attributes Journal of Vision 3(4)2 265ndash273 httpwwwjournalofvisionorgcontent342 doi101167342 [PubMed] [Article]

Nakayama K He Z J amp Shimojo S (1995) Visualsurface representation A critical link betweenlower-level and higher-level vision In D NOsherson S M Kosslyn amp L R Gleitman(Eds) An invitation to cognitive science Visualcognition pp 1ndash70 Cambridge MA MIT Press

Oliva A amp Torralba A (2001) Modeling the shapeof the scene A holistic representation of the spatialenvelope International Journal of Computer Vision42 145ndash175

Olmos A amp Kingdom F A A (2004) McGillcalibrated colour image database Internet sitehttptabbyvisionmcgillca (Accessed November20 2012)

Olshausen B A amp Field D J (1996) Emergence ofsimple-cell receptive-field properties by learning asparse code for natural images Nature 381 607ndash609

Ott R L (1993) An introduction to statistical methodsand data analysis (4th ed) Belmont CA Wads-worth Inc

Perona P (1992) Steerable-scalable kernels for edgedetection and junction analysis Image and VisionComputing 10 663ndash672

Pollen D A amp Ronner S F (1983) Visual corticalneurons as localized spatial frequency filters IEEETransactions on Systems Man and Cybernetics 13907ndash916

Portilla J amp Simoncelli E P (2000) A parametrictexture model based on joint statistics of complexwavelet coefficients International Journal of Com-puter Vision 40 4971

Rajashekar U van der Linde I Bovik A C ampCormack L K (2007) Foveated analysis of imagefeatures at fixations Vision Research 47 3160ndash3172

Reinagel P amp Zador A M (1999) Natural scenestatistics at the centre of gaze Network 10 341ndash350

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 20

Rijsbergen C V (1979) Information retrieval LondonButterworths

Rivest J amp Cavanagh P (1996) Localizing contoursdefined by more than one attribute Vision Re-search 36 53ndash66

Schwartz O amp Simoncelli E P (2001) Natural signalstatistics and sensory gain control Nature Neuro-science 4 819ndash825

Sigman M Cecchi G A Gilbert C D ampMagnasco M O (2001) On a common circleNatural scenes and gestalt rules Proceedings of theNational Academy of Sciences 98 1935ndash1940

Todd J T (2004) The visual perception of 3D shapeTrends in Cognitive Science 8 115ndash121

Torralba A (2009) How many pixels make an imageVisual Neuroscience 26 123ndash131

Torralba A amp Oliva A (2003) Statistics of naturalimage categories Network Computation in NeuralSystems 14 391ndash412

Voorhees H amp Poggio T (1988) Computing textureboundaries from images Nature 333 364ndash367

Zetzsche C amp Rohrbein F (2001) Nonlinear andextra-classical receptive field properties and thestatistics of natural scenes Network 12 331ndash350

Zhou C amp Mel B W (2008) Cue combination andcolor edge detection in natural scenes Journal ofVision 8(4)4 1ndash25 httpwwwjournalofvisionorgcontent844 doi101167844 [PubMed][Article]

Zhu S C Wu Y N amp Mumford D (1997)Minimax entropy principle and its application totexture modeling Neural Computation 9 1627ndash1660

Ziegaus C amp Lang E W (2003) Independentcomponent analysis of natural and urban imageensembles Neural Information Processing Letters1 89ndash95

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 21

  • Introduction
  • Methods
  • f01
  • f02
  • f03
  • e01
  • e02
  • e03
  • e04
  • e05
  • e06
  • e07
  • e08
  • e09
  • e10
  • f04
  • e11
  • e12
  • e13
  • e14
  • e15
  • e16
  • e17
  • f05
  • e18
  • Results
  • f06
  • f07
  • f08
  • f09
  • f10
  • f11
  • f12
  • f13
  • f14
  • f15
  • f16
  • Discussion
  • f17
  • f18
  • Abdou1
  • Acharya1
  • Arsenault1
  • Baddeley1
  • Baker1
  • Balboa1
  • Bell1
  • Bergen1
  • Bishop1
  • Cadieu1
  • Caywood1
  • Collins1
  • Cristianini1
  • Daugman1
  • Doi1
  • Duda1
  • Elder1
  • Field1
  • Fine1
  • Fowlkes1
  • Geisler1
  • Goldberg1
  • Graham1
  • Graham2
  • Guzman1
  • Heeger1
  • Hess1
  • Hoiem1
  • Ing1
  • Jones1
  • Karklin1
  • Karklin2
  • Kersten1
  • Kingdom1
  • Knill1
  • Konishi1
  • Landy1
  • Landy2
  • Landy3
  • Lee1
  • Lee2
  • Malik1
  • Marr1
  • Martin1
  • Martin2
  • McDermott1
  • McGraw1
  • Nakayama1
  • Oliva1
  • Olmos1
  • Olshausen1
  • Ott1
  • Perona1
  • Pollen1
  • Portilla1
  • Rajashekar1
  • Reinagel1
  • Rijsbergen1
  • Rivest1
  • Schwartz1
  • Sigman1
  • Todd1
  • Torralba1
  • Torralba2
  • Voorhees1
  • Zetzsche1
  • Zhou1
  • Zhu1
  • Ziegaus1
Page 12: Detecting natural occlusion boundaries using local cuesDetecting natural occlusion boundaries using local cues Christopher DiMattina # $ Department of Psychology & Neuroscience Concentration,

One concern with the interpretation of our results isthat the brief pre-exposure (3 seconds) of subjects S1 S2to the full-scale (576 middot 768) images prior to the first tasksession (Methods Experimental paradigm) may haveunfairly improved their performance In order to controlfor this possibility we ran three additional subjects (S3S4 S5) who did not have this pre-exposure to the full-scale images and we found no significant difference inthe performance of the two groups of subjects(Supplementary Figure S4a) We also found that thelead author (S0) was not significantly better at the taskthan the two subjects (S1 S2) who were only briefly pre-exposed to the images (Supplementary Figure S4b)

Therefore we conclude that pre-exposure to the full-scale images makes little if any difference in our task

Effects of luminance boundary and texture cues

In order to better understand what cues subjects aremaking use of in the task we tested subjects onmodified image patches with various cues removed(Methods Experimental paradigm) In order to deter-mine the importance of texture we removed all texturecues from the patches by averaging all pixels in eachregion (lsquolsquotexture removedrsquorsquo) as illustrated in Figure 4b(top) We see from Figure 11a that subject performancein the task is substantially impaired without texturecues for the 16 middot 16 and 32 middot 32 patch sizes Note howperformance is roughly flat with increasing patch sizewhen all cues but luminance are removed suggestingthat luminance gradients are a fairly lsquolsquolocalrsquorsquo cue whichis very useful even at very small scales Similar resultswere obtained for color patches (Figure 12) where theonly cues available were luminance and color contrast

Removing luminance cues while keeping texture cuesintact is slightly more tricky since simply equalizing theluminance in the two image regions can lead toproblems of boundary artifact (a high spatial frequencyedge along the boundary) as has been noted by others(Arsenault et al 2011) In order to circumvent thisproblem we removed the boundary artifact by coveringa 3-pixel wide region including the boundary by settingall pixels in this strip to a single uniform value (Figure4b bottom) We then were able to equalize theluminance on each side of the patch without creatingboundary artifact (luminance thorn boundary removed)Since the boundary pixels may could potentiallycontain long-range spatial correlation information(Hess amp Field 1999) useful for identifying occlusions

Figure 10 Performance of human subjects at the occlusion

detection task for 8 middot 8 16 middot 16 and 32 middot 32 image patches (a)

Subject performance for grayscale image patches Thin dashed

lines denote 95 confidence intervals Note how performance

significantly improves with increasing patch size (b) Subject

performance for color image patches is significantly better than for

grayscale at all patch sizes

Figure 11 Subject performance for grayscale image patches with various cues removed Dashed lines indicate the average performance

for unaltered image patches solid lines performance in the cue-removed case (a) Removal of texture cues significantly impairs subject

performance for larger image patches (b) Removal of the boundary and luminance cues substantially impairs subject performance at all

patch sizes (c) Removal of the boundary cue alone without altering texture cues or luminance cues does not affect subject performance

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 12

as an additional control we also considered perfor-mance on patches where we only removed theboundary (boundary removed)

We see from Figure 11c that simply removing theboundary pixels has no effect on subject performancesuggesting that subjects do not need to see theboundary in order to perform the task so long asother cues like luminance and texture differences arepresent However we see from Figure 11b that thereare profound effects of removing the luminance cueswhile keeping the texture cues intact We see thatsubject performance is impaired at every patch sizealthough in contrast to removing texture cues we see asubstantial improvement in performance as patch sizeis increased This suggests that texture cues arecomplementary to luminance cues in the sense thattexture information becomes more useful for largerpatches while luminance is useful even for very smallpatches

Quadratic classifier analysis

Comparison with human performance

Our analysis of image patches taken from thedatabase suggests that relatively simple visual featuresmeasured in previous related studies can potentially beused to distinguish occlusions and surfaces (Figure 9)Therefore it was of interest to determine whethermachine classifiers defined using these simple visualfeatures could potentially account for human perfor-mance on the task Since the logarithmically trans-formed features are approximately Gaussian distributed(Figure 9 and Supplementary Figures S2 S3) followingprevious work (Ing et al 2010) we defined a quadraticclassifier on class-conditional distributions which wemodeled as multivariate Gaussians (Methods Machineclassifiers) When data from both classes is exactly

Gaussian this classifier yields Bayes-optimal perfor-mance (Duda et al 2000) This classifier was thentested on validation sets of 400 image patches drawnfrom the same set shown to humans subjects (whichwere not used to train the classifier) using 200 MonteCarlo resamplings

In Figure 13 we compare the performance of humansubjects (black lines) on the grayscale version of thetask with that of quadratic classifiers (blue lines)Figure 13a shows task performance with the fullgrayscale image patches We see that the quadraticclassifier defined on all grayscale parameters (G1-G5Methods Statistical measurements) does not capturethe experimentally observed improvement in subjectperformance with increasing patch size This suggeststhat in addition to the scalar grayscale cues used todefine the classifier subjects must be making use ofmore complex spatial cues like differences in texture(Bergen amp Landy 1991) or long-range spatial correla-tions (Hess amp Field 1999) which are not available atlow resolutions but become available as the imagepatch gets larger We see from Figure 13b that aquadratic classifier defined using only the luminancedifference is in much better agreement with humanperformance when this is the only available cue forperforming the task Similar results were obtained forcolor image patches (Figure 14)

Ranking of feature dimensions

In order to understand which feature dimensions aremost important for the quadratic classifier detecting

Figure 12 Subject performance for unaltered color image patches

(dashed line) and with texture cues removed (solid line) Figure 13 Comparison of subject and quadratic classifier

performance for grayscale image patches (a) Classifier defined

using all grayscale features (thick blue line) does not accurately

model human performance on the task with unmodified patches

(black line) (b) Classifier defined using only luminance cues

accurately models human performance in the texture-removed

condition where luminance differences are the only available cue

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 13

occlusions we computed the d0 value for each of thefeature dimensions with the results of this analysisbeing summarized in Supplementary Table 4 We seefrom this table that the best single grayscale parametersfor discriminating occlusion and surface patches are theluminance difference (Dl) and overall patch variance(q) We also see that contrast difference (Dr) by itself isa fairly weak parameter for all patch sizes while for thelarger patch sizes (16 middot 16 32 middot 32) color cues are veryinformative

Our correlation analyses presented in SupplementaryTable 2 demonstrated that many of these features were

correlated with each other so therefore it was ofinterest to determine which features contributed themost independent information to the classifier We didthis by adding feature dimensions to the classifier oneby one in order of decreasing d0 for 16 middot 16 and 32 middot 32patches and applying each classifier to the task for 200Monte Carlo simulations The results of this analysisperformed on 16 middot 16 and 32 middot 32 image patches areshown in Figure 15 We see from Figure 15 that forgrayscale classifiers containing q and Dl that noperformance increase is seen when one adds thefeatures GB (boundary gradient) and Eh (orientedenergy) most likely because these features are stronglycorrelated with q Dl (Supplementary Table 2)However we do see a jump in performance when thecontrast difference Dr is added most likely because thisfeature dimension is only weakly correlated with q andDl (Supplementary Table 2) For our analysis of thecolor classifier we see that there are substantialimprovements in performance when the color param-eters Da and Db are added to the classifier which makessense because color cues are largely independent fromluminance cues

Comparison with SVM classifier

There are two logical possibilities for the poor matchbetween human performance in our task and theperformance of the quadratic classifier The firstpossibility is that the feature sets used to define theclassifier do not adequately describe the featuresactually used by human subjects to perform the taskA second possibility is that the feature set is adequatebut the quadratic classifier is somehow suboptimalAlthough the quadratic classifier is Bayes optimal forGaussian distributed class-conditional densities (Duda

Figure 14 Comparison of subject and quadratic classifier

performance for color image patches (a) Classifier defined using

all color features (thick blue line) does not accurately model

human performance (black line) on the task with unmodified

patches (b) Classifier defined using only luminance and color

contrast more accurately models human performance in the

texture-removed condition

Figure 15 Quadratic classifier performance on 32 middot 32 patches as a function of number of features Feature sets of varying lengths are

defined by ranking individual dimensions by their d0 measures and adding them in order from highest to lowest

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 14

et al 2000) and our feature sets are reasonably welldescribed by Gaussians (see Figure 9 andSupplementary Figures S2 S3) this description is farfrom exact Therefore it is possible that some othermore sophisticated classifier may yield superior perfor-mance to the quadratic classifier and hence provide abetter match with our perceptual data

In order to control for this possibility we comparedthe performance of the quadratic classifier to a SupportVector Machine (SVM) classifier (Bishop 2006 Cris-tianini amp Shawe-Taylor 2000) on the task of discrim-inating grayscale occlusion patches from surfaces Wefound that the SVM classifier performed worse than thequadratic classifier (Supplementary Figure S5) Thisfailure of the SVM classifier strengthens the argumentthat the reason for the poor match of the quadraticclassifier with human performance is the inadequacy ofthe measured feature set used to define the classifierrather than the inadequacy our classifier method

Multiscale classifier analysis

Gabor feature set and classifiers

One weakness of the set of simple visual featuresquantified in this study is that they are defined on thescale of the entire image patch and therefore do notintegrate information across scale It is well known thatocclusions in natural images can exist on multiplescales and that accurate edge detection requires anappropriate choice of scale (Elder amp Zucker 1998Marr amp Hildreth 1980) Furthermore our feature setdoes not include any features which quantify texturedifferences which is problematic because our psycho-physical results suggests that texture is a very importantcue and previous work in machine vision hasdemonstrated that texture cues are highly informativefor detecting edges and segmenting image regions(Bergen amp Landy 1991 Malik amp Perona 1990 Martinet al 2004 Voorhees amp Poggio 1988)

In order to choose a set of biologically motivatedmultiscale features to measure from the image patcheswe obtained a Gabor filter bank by applying theIndependent Components Analysis (Bell amp Sejnowski1997) to natural image patches taken from our dataset(Figure 5a) and this filter bank was applied to theimage patches The independent component analysis(ICA) algorithm learns a set of linear features that aremaximally independent which is ideal for classifieranalyses since correlated features add less informationthan independent features The rectified outputs of thisGabor filter bank were then used as inputs to twosupervised classifiers (a) A linear binary logisticclassifier and (b) a neural network classifier (Bishop2006) having 4 16 and 64 hidden units The logisticclassifier can be considered as a special case of theneural network classifier illustrated in Figure 5b where

instead of learning the best hidden layer representationfor solving the classification task the hidden layerrepresentation is always identical to the input repre-sentation and only the output weights are modifiedAfter training both classifiers were tested on a separatevalidation set not used for training to evaluate theirperformance

Classifier comparison

Figure 16 shows the performance of these twoclassifiers together with subject data for the grayscaleocclusion detection task We see that the linearclassifier (blue line) does not accurately describe humanperformance (black line) while the neural network (redline) having 16 hidden units compares much morefavorably to the human subjects Similar results wereobtained for a neural network with 64 hidden units butworse performance was seen with four hidden units(not shown) We can see from Figure 16 that the linearclassifier does not increase its performance withincreasing patch size and seems to perform similarlyto the quadratic classifier defined using only luminancedifferences This suggests that this classifier may simplybe learning to detecting low spatial frequency edges atvarious orientations Indeed plotting the connectionstrengths of the output unit with each of the Gaborfilters as in Figure 17 demonstrates that there are strongpositive weights for filters at low spatial frequencylocated in the center of the image patch therebydetecting luminance differences on the scale of theentire image patch

In contrast the neural network classifier learns tospatially pool information across spatial scale andlocation (Supplementary Figure S6) although thehidden unit lsquolsquoreceptive fieldsrsquorsquo do not seem to resemble

Figure 16 Performance of humans and classifiers defined on

multiscale Gabor feature set We see that a simple linear classifier

(blue line) does not account for human performance (black

dashed line) in the task a neural network classifier having an

additional hidden layer of processing (red line) compares well with

human performance

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 15

the hypothetical texture-edge detectors illustrated inFigure 18 which could in principle encode edgesdefined by second-order cues like differences inorientation or spatial frequency Interestingly suchunits tuned for texture-edges have been learned instudies seeking to find a sparse code for patterns ofcorrelation of Gabor filter activities in natural images(Cadieu amp Olshausen 2012 Karklin amp Lewicki 20032009) and it is of interest for future work to examinewhether a population of such units can be decoded inorder to reliably detect occlusion edges Neverthelessthis analysis provides a simple existence proof that amultiscale feature set coupled with a second layer ofrepresentation which pools across this set of multiscalefeatures may be adequate to explain natural occlusionedge detection and a similar computational scheme hasbeen proposed by various filter-rectify-filter models oftexture edge detection (Baker amp Mareschal 2001Landy amp Graham 2003) Interestingly training neuralnetworks directly on pixel representations did not yieldmodels which could match human performance(Supplementary Figure S7 demonstrating the impor-tance of the multiscale Gabor decomposition as apreprocessing stage Of course we cannot suggest fromour simple analysis the detailed form which thisintegration takes or what the receptive fields of neuronsperforming this operation may be like but this is

certainly a question of great interest for future work inneurophysiology and computational modeling

Discussion

Summary of results

In this study we performed psychophysical experi-ments to better understand what locally available cuesthe human visual system utilizes when detecting naturalocclusion edges In agreement with previous work(McDermott 2004) we found that fairly large regions(32 middot 32 pixels) were required for subjects to attainreliable performance on our occlusion detection task(Figure 10) Studying the grayscale version of the taskin detail we found that texture and luminance cueswere both utilized although the relative importance ofthese cues varied with the patch size In particular wefound that luminance cues were equally useful at allpatch sizes tested (Figure 11a) This suggests thatluminance differences are a local cue being equallyuseful for small as well as large patches In contrasttexture cues were only useful for larger image patches(Figure 11b) This is sensible because texture is definedby information occurring at a variety of scales (Portillaamp Simoncelli 2000 Zhu Wu amp Mumford 1997) andmaking patches larger makes more spatial scalesvisible

The importance of textural cues was further under-scored by comparing human performance on the taskto the performance of a quadratic classifier definedusing a set of simple visual features taken fromprevious work which did not include texture cuesAlthough many of the visual features providedindependent sources of information (Figure 15) andpermitted some discrimination of the stimulus catego-ries this feature set could not account for humanperformance in the task overpredicting performancefor small (8 middot 8) patches and underpredicting

Figure 17 Schematic illustration of weights learned by the linear

classifier We see that the linear classifier learns strong positive

weights (hot colors) with Gabor filters having low spatial

frequencies (long lines) located in the center of the image patch

Figure 18 Hypothetical neural models for detecting occlusion edges defined by textural differences (a) Hypothetical unit which detects

texture edges defined by differences in orientation energy (b) Hypothetical unit that detects texture edges defined by spatial frequency

differences

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 16

performance for large (32 middot 32) patches (Figure 13)The overprediction of human performance for smallimage patches is particularly interesting since itdemonstrates that subjects are not simply relying onluminance cues but are integrating texture cues as wellwhich we found to be quite poor for small imagepatches (Figure 11b)

One limitation of the visual feature set used to definethe quadratic classifier analysis is that the features aredefined on the scale of the entire image patch This isproblematic since it is well-known that the earlieststages of human visual processing perform a multiscaleanalysis using receptive fields resembling Gabor filters(Daugman 1985 Jones amp Palmer 1987 Pollen ampRonner 1983) and that such a representation forms anefficient code for natural images (Field 1987 Olshau-sen amp Field 1996) Furthermore many computationalmethods for texture segmentation and edge detectionuse of a multiscale Gabor filter decomposition (Bakeramp Mareschal 2001 Landy amp Graham 2003) There-fore we defined a new set of classifiers taking as inputsthe rectified responses of a set of Gabor filters appliedto the image patches (Figure 5) We found that themultiscale feature set was adequate to explain humanperformance but only provided that the classifiermaking use of this feature set was sufficiently powerfulIn particular a linear classifier was unable to accountfor human performance since this classifier only learnedto detect luminance-edges at low spatial frequencies(Figure 17) However a neural network classifierhaving an additional hidden layer of processing wasable to learn to reliably detect occlusions nearly as wellas human subjects (Figure 16) These results suggest amultiscale feature decomposition together with rela-tively simple computations performed in an additionallayer of processing may be adequate to explain humanperformance in an important natural vision task

Comparison with previous work

Relatively few studies in visual psychophysics havedirectly considered the problem of what cues are usefulfor detecting natural occlusion edges with the closeststudy to our own focusing on the more specializedproblem of detecting T-junctions in natural grayscaleimages (McDermott 2004) This study found thatrather large image regions were needed for subjects toattain reliable performance for junction detectionconsistent with our results for occlusion detection Inthis work it was found that many T-junctions easilyvisible on the scale of the whole image were notdetectable using local cues suggesting that they mayonly be detectable by mechanisms making use offeedback of global scene information (Lee amp Mumford2003) Our task is somewhat complementary to that

studied by McDermott (2004) since we exclude T-junctions and only consider nonjunction occlusionsThis gave us a much larger stimulus set and permittedquantitative comparisons of human task performancewith a variety of models based on features measuredfrom occlusions and non-occlusions However wearrive at the very similar conclusion for our task thata substantial amount of spatial context is needed foraccurate occlusion detection (Figure 10) and thatmany occlusions cannot be reliably detected using localcues alone

A related set of studies has considered the problemof judging when two image patches belong to the sameor different surfaces and comparing the performanceof human subjects with the predictions of Bayesianmodels defined using sets of measured visual featuressimilar to those used to define our quadratic classifiersIn contrast to one such surface segmentation study (Inget al 2010) we found texture cues to be very importantfor our task One possible reason for the discrepancybetween our results and those of Ing et al (2010) is thattheir study only considered the problem of discrimi-nating a very limited class of images in particular thesurfaces of different leaves Since the leaves typicallybelonged to the same kind of plant they were fairlysmooth and quite similar in terms of visual textureTherefore for this particular choice of stimuli texture isnot a particularly informative cue since leaves of thesame kind tend to have very similar textures For ourtask however where very different surfaces werejuxtaposed to define occlusions we found removal oftexture cues to be highly detrimental to subjectperformance (Figure 11)

Previous work on edge localization reveals thatluminance texture and color cues are all combined todetermine the location of edges (McGraw et al 2003Rivest amp Cavanagh 1996) and similarly we find that inour task these multiple sources of information arecombined as well However in our task we did notconsider quantitative models for how disparate cuesmay be combined optimally as has been done in avariety of perceptual tasks (Knill amp Saunders 2003Landy amp Kojima 2001) One challenge with studyingcue combination in our task is that one of the majorcues used by the subjects are texture differences andunlike luminance or color natural textures require veryhigh-dimensional models to fully specify (Heeger ampBergen 1995 Portilla amp Simoncelli 2000 Zhu et al1997) The problem of optimal cue combination fornatural occlusion detection has been mostly consideredin the computer vision literature (Konishi et al 2003Zhou amp Mel 2008) but models of cue combination forocclusion detection have not been directly comparedwith human data in a controlled psychophysicalexperiment

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 17

Our computational analysis demonstrates the neces-sity of multiscale image analysis for modeling humanperformance in our occlusion detection task Further-more our work and that of others demonstrates theimportance of textural cues for occlusion detection as awide variety of computational models which explainhuman performance at detecting texture-defined edgesutilize a multiscale Gabor decomposition (Landy ampGraham 2003) One broad class of computationalmodels of texture segmentation and texture edgedetection can be described lsquolsquoFilter-Rectify-Filterrsquorsquo(FRF) models sometimes also referred to as lsquolsquoback-pocketrsquorsquo models where the rectified outputs of a Gaborfilter bank are then subject to a second stage of filteringand the outputs of these filters are then processedfurther to obtain a decision about whether or not atexture edge is present (Baker amp Mareschal 2001Bergen amp Landy 1991 Graham amp Sutter 1998 Landy1991 Landy amp Graham 2003) Our simple three-layerneural network model (Figure 5) constitutes a verysimple model of the FRF type with the hidden unitsperforming the role of the second-order filters Theability of our simple FRF model to solve this task aswell as human subjects provides additional evidence infavor of the FRF computational framework

Resources for future studies

Our database of hand-labeled occlusions (httpcaseedu) avoids many limitations of previously collecteddatasets of hand-labeled edges Perhaps the greatestimprovement is that unlike many related datasets(Collins Wright amp Greenway 1999 Martin et al2001) we make use of uncompressed calibrated colorimages taken from a larger image set used in previouspsychophysical research (Arsenault et al 2011 King-dom et al 2007 Olmos amp Kingdom 2004) JPEGcompression creates artificial statistical regularitieswhich complicates feature measurements (Zhou ampMel 2008) and causes artifact when training statisticalimage models (Caywood Willmore amp Tollhurst 2004)

Another improvement of this database over previouswork is that we explicitly label occlusion edges ratherthan inferring them indirectly from labels of imageregions or segments Although de facto finding regionboundaries does end up mainly labeling occlusions thenotion of an image region is much vaguer than that ofan occlusion However due to the specificity of ourtask our database most likely neglects quite a fewedges which are not occlusions since many edges aredefined by properties unrelated to ordinal depth likeshadows specularities and material properties (Ker-sten 2000) Therefore our database only represents asubset (albeit an important one) of possible naturaledges

Some databases only consider a restricted class ofimages like close-up foliage (Ing et al 2010) or lackclear figural objects (Doi Inui Lee Wachtler ampSejnowski 2003) Our database contains a wide varietyof images at various spatial scales containing a largenumber of occlusions We do not claim that our imagesare fully representative of natural images but representa modest effort to obtain a diverse sample which isimportant since different categories of natural imageshave different statistics (Torralba amp Oliva 2003) andtraining statistical image models on different imagedatabases often yields different linear filter properties(Ziegaus amp Lang 2003)

Acknowledgments

This work was supported by NIH grant 0705677 toMSL We thank the five Case Western ReserveUniversity undergraduate research assistants whohelped to collect the databases Parts of this work havebeen presented previously in poster form at the 2011Society for Neuroscience Meeting

Commercial relationships noneCorresponding author Christopher DiMattinaEmail dimattinagrinnelleduAddress Department of Psychology amp NeuroscienceConcentration Grinnell College Grinnell IA USA

References

Abdou I amp Pratt W (1979) Quantitative design andevaluation of enhancementthresholding edge de-tectors Proceedings of the IEEE 67 753ndash763

Acharya T amp Ray A K (2005) Image processingPrinciples and applications Hoboken NJ Wiley-Interscience

Arsenault E Yoonessi A amp Baker C L (2011)Higher-order texture statistics impair contrastboundary segmentation Journal of Vision 11(10)14 1ndash15 httpwwwjournalofvisionorgcontent111014 doi101167111014 [PubMed] [Article]

Baddeley R J amp Tatler B W (2006) High frequencyedges (but not contrast) predict where we fixate ABayesian system identification analysis VisionResearch 46 2824ndash2833

Baker C L amp Mareschal I (2001) Processing ofsecond-order stimuli in the visual cortex Progressin Brain Research 134 1ndash21

Balboa R M amp Grzywacz N M (2000) Occlusions

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 18

and their relationship with the distribution ofcontrasts in natural images Vision Research 402661ndash2669

Bell A J amp Sejnowski T J (1997) The lsquoindependentcomponentsrsquo of natural scenes are edge filtersVision Research 37 3327ndash3338

Bergen J R amp Landy M S (1991) Computationalmodeling of visual texture segregation In M SLandy amp J A Movshon (Eds) Computationalmodels of visual processing pp 253ndash271 Cam-bridge MA MIT Press

Bishop C M (2006) Pattern recognition and machinelearning New York Springer

Cadieu C amp Olshausen B A (2012) Learningintermediate-level representations of form andmotion from natural movies Neural Computation24 827ndash866

Caywood M S Willmore B amp Tolhurst D J(2004) Independent components of color naturalscenes resemble v1 neurons in their spatial andcolor tuning Journal of Neurophysiology 91 2859ndash2873

Collins D Wright W A amp Greenway P (1999) TheSowerby image database In Image processing andits applications (Seventh International Conference)(pp 306ndash310) London Institution of ElectricalEngineers

Cristianini N amp Shawe-Taylor J (2000) An intro-duction to support vector machines and other kernelbased learning methods Cambridge UK Cam-bridge University Press

Daugman J G (1985) Uncertainty relation forresolution in space spatial frequency and orienta-tion optimized by two-dimensional visual corticalfilters Journal of the Optical Society of America A2 1160ndash1169

Doi E Inui T Lee T W Wachtler T ampSejnowski T J (2003) Spatiochromatic receptivefield properties derived from information-theoreticanalyses of cone mosaic responses to naturalscenes Neural Computation 15 397ndash417

Duda R O Hart P E amp Stork D G (2000)Pattern classification (2nd ed) New York Wiley-Interscience

Elder J H amp Zucker S W (1998) Local scalecontrol for edge detection and blur estimationIEEE Transactions on Pattern Analysis and Ma-chine Intelligence 20 699ndash716

Field D J (1987) Relations between the statistics ofnatural images and the response properties of

cortical cells Journal of the Optical Society ofAmerica A 4 2379ndash2394

Fine I MacLeod D I A amp Boynton G M (2003)Surface segmentation based on the luminance andcolor statistics of natural scenes Journal of theOptical Society of America A 20 1283ndash1291

Fowlkes C C Martin D R amp Malik J (2007) Localfigure-ground cues are valid for natural imagesJournal of Vision 7(8)2 1ndash9 httpwwwjournalofvisionorgcontent782 doi101167782[PubMed] [Article]

Geisler W S Perry J S Super B J amp Gallogly DP (2001) Edge co-occurrence in natural imagespredicts contour grouping performance VisionResearch 41 711ndash724

Goldberg A V amp Kennedy R (1995) An efficientcost scaling algorithm for the assignment problemMathematical Programming 71 153ndash177

Graham N (1991) Complex channels early localnonlinearities and normalization in texture segre-gation In M S Landy amp J A Movshon (Eds)Computational models of visual processing pp 274ndash290 Cambridge MA MIT Press

Graham N amp Sutter A (1998) Spatial summation insimple (Fourier) and complex (non-Fourier) texturechannels Vision Research 38 231ndash257

Guzman A (1969) Decomposition of a visual sceneinto three-dimensional bodies In A Griselli (Ed)Automatic interpretation and classification of imag-es pp 291ndash304 New York Academic Press

Heeger D J amp Bergen J R (1995) Pyramid-basedtexture analysissynthesis In Proceedings of ACMSIGGRAPH 229ndash238

Hess R F amp Field D J (1999) Integration ofcontours New insights Trends in Cognitive Scienc-es 3 480ndash486

Hoiem D Efros A A amp Hebert M (2011)Recovering occlusion boundaries from an imageInternational Journal of Computer Vision 91(3)328ndash346

Ing A D Wilson J A amp Geisler W S (2010)Region grouping in natural foliage scenes Imagestatistics and human performance Journal ofVision 10(4)10 1ndash19 httpwwwjournalofvisionorgcontent10410 doi 10116710410[PubMed] [Article]

Jones J P amp Palmer L A (1987) The two-dimensional spatial structure of simple receptivefields in cat striate cortex Journal of Neurophysi-ology 58 1187ndash1211

Karklin Y amp Lewicki M S (2003) Learning higher-

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 19

order structures in natural images Network 14483ndash499

Karklin Y amp Lewicki M S (2009) Emergence ofcomplex cell properties by learning to generalize innatural scenes Nature 457 83ndash86

Kersten D (2000) High-level vision as statisticalinference In M S Gazzaniga (Ed) The newcognitive neurosciences pp 353ndash363 CambridgeMA MIT Press

Kingdom F A A Field D J amp Olmos A (2007)Does spatial invariance result from insensitivity tochange Journal of Vision 7(14)11 1ndash13 httpwwwjournalofvisionorgcontent71411 doi10116771411 [PubMed] [Article]

Knill D C amp Saunders J A (2003) Do humansoptimally integrate stereo and texture informationfor judgments of slant Vision Research 43 2539ndash2558

Konishi S Yuille A L Coughlin J M amp Zhu S C(2003) Statistical edge detection Learning andevaluating edge cues IEEE Transactions on PatternAnalysis and Machine Intelligence 25 57ndash74

Landy M S (1991) Texture segregation and orienta-tion gradient Vision Research 31 679ndash691

Landy M S amp Graham N (2003) Visual perceptionof texture In L M Chalupa amp J S Werner (Eds)The visual neurosciences pp 1106ndash1118 Cam-bridge MA MIT Press

Landy M S amp Kojima H (2001) Ideal cuecombination for localizing texture-defined edgesJournal of the Optical Society of America A 182307ndash2320

Lee H-C amp Choe Y (2003) Detecting salientcontours using orientation energy distribution InInternational Joint Conference on Neural Networks(pp 206ndash211) IEEE Conference Publications

Lee T S amp Mumford D (2003) Hierarchicalbayesian inference in the visual cortex Journal ofthe Optical Society of America A 20 1434ndash1448

Malik J amp Perona P (1990) Preattentive texturediscrimination with early vision mechanisms Jour-nal of the Optical Society of America A 7 923ndash932

Marr D amp Hildreth E (1980) Theory of edgedetection Proceedings of the Royal Society ofLondon Series B 29 187ndash217

Martin D Fowlkes C C Tal D amp Malik J (2001)A database of human segmented natural imagesand its application to evaluating segmentationalgorithms and measuring ecological statisticsProceedings of the 8th International Conference onComputer Vision 2 416ndash423

Martin D R Fowlkes C C amp Malik J (2004)Learning to detect natural image boundaries usinglocal brightness color and texture cues IEEETransactions on Pattern Analysis and MachineIntelligence 26 530ndash549

McDermott J (2004) Psychophysics with junctions inreal images Perception 33 1101ndash1127

McGraw P V Whitaker D Badcock D R ampSkillen J (2003) Neither here nor there Localizingconflicting visual attributes Journal of Vision 3(4)2 265ndash273 httpwwwjournalofvisionorgcontent342 doi101167342 [PubMed] [Article]

Nakayama K He Z J amp Shimojo S (1995) Visualsurface representation A critical link betweenlower-level and higher-level vision In D NOsherson S M Kosslyn amp L R Gleitman(Eds) An invitation to cognitive science Visualcognition pp 1ndash70 Cambridge MA MIT Press

Oliva A amp Torralba A (2001) Modeling the shapeof the scene A holistic representation of the spatialenvelope International Journal of Computer Vision42 145ndash175

Olmos A amp Kingdom F A A (2004) McGillcalibrated colour image database Internet sitehttptabbyvisionmcgillca (Accessed November20 2012)

Olshausen B A amp Field D J (1996) Emergence ofsimple-cell receptive-field properties by learning asparse code for natural images Nature 381 607ndash609

Ott R L (1993) An introduction to statistical methodsand data analysis (4th ed) Belmont CA Wads-worth Inc

Perona P (1992) Steerable-scalable kernels for edgedetection and junction analysis Image and VisionComputing 10 663ndash672

Pollen D A amp Ronner S F (1983) Visual corticalneurons as localized spatial frequency filters IEEETransactions on Systems Man and Cybernetics 13907ndash916

Portilla J amp Simoncelli E P (2000) A parametrictexture model based on joint statistics of complexwavelet coefficients International Journal of Com-puter Vision 40 4971

Rajashekar U van der Linde I Bovik A C ampCormack L K (2007) Foveated analysis of imagefeatures at fixations Vision Research 47 3160ndash3172

Reinagel P amp Zador A M (1999) Natural scenestatistics at the centre of gaze Network 10 341ndash350

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 20

Rijsbergen C V (1979) Information retrieval LondonButterworths

Rivest J amp Cavanagh P (1996) Localizing contoursdefined by more than one attribute Vision Re-search 36 53ndash66

Schwartz O amp Simoncelli E P (2001) Natural signalstatistics and sensory gain control Nature Neuro-science 4 819ndash825

Sigman M Cecchi G A Gilbert C D ampMagnasco M O (2001) On a common circleNatural scenes and gestalt rules Proceedings of theNational Academy of Sciences 98 1935ndash1940

Todd J T (2004) The visual perception of 3D shapeTrends in Cognitive Science 8 115ndash121

Torralba A (2009) How many pixels make an imageVisual Neuroscience 26 123ndash131

Torralba A amp Oliva A (2003) Statistics of naturalimage categories Network Computation in NeuralSystems 14 391ndash412

Voorhees H amp Poggio T (1988) Computing textureboundaries from images Nature 333 364ndash367

Zetzsche C amp Rohrbein F (2001) Nonlinear andextra-classical receptive field properties and thestatistics of natural scenes Network 12 331ndash350

Zhou C amp Mel B W (2008) Cue combination andcolor edge detection in natural scenes Journal ofVision 8(4)4 1ndash25 httpwwwjournalofvisionorgcontent844 doi101167844 [PubMed][Article]

Zhu S C Wu Y N amp Mumford D (1997)Minimax entropy principle and its application totexture modeling Neural Computation 9 1627ndash1660

Ziegaus C amp Lang E W (2003) Independentcomponent analysis of natural and urban imageensembles Neural Information Processing Letters1 89ndash95

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 21

  • Introduction
  • Methods
  • f01
  • f02
  • f03
  • e01
  • e02
  • e03
  • e04
  • e05
  • e06
  • e07
  • e08
  • e09
  • e10
  • f04
  • e11
  • e12
  • e13
  • e14
  • e15
  • e16
  • e17
  • f05
  • e18
  • Results
  • f06
  • f07
  • f08
  • f09
  • f10
  • f11
  • f12
  • f13
  • f14
  • f15
  • f16
  • Discussion
  • f17
  • f18
  • Abdou1
  • Acharya1
  • Arsenault1
  • Baddeley1
  • Baker1
  • Balboa1
  • Bell1
  • Bergen1
  • Bishop1
  • Cadieu1
  • Caywood1
  • Collins1
  • Cristianini1
  • Daugman1
  • Doi1
  • Duda1
  • Elder1
  • Field1
  • Fine1
  • Fowlkes1
  • Geisler1
  • Goldberg1
  • Graham1
  • Graham2
  • Guzman1
  • Heeger1
  • Hess1
  • Hoiem1
  • Ing1
  • Jones1
  • Karklin1
  • Karklin2
  • Kersten1
  • Kingdom1
  • Knill1
  • Konishi1
  • Landy1
  • Landy2
  • Landy3
  • Lee1
  • Lee2
  • Malik1
  • Marr1
  • Martin1
  • Martin2
  • McDermott1
  • McGraw1
  • Nakayama1
  • Oliva1
  • Olmos1
  • Olshausen1
  • Ott1
  • Perona1
  • Pollen1
  • Portilla1
  • Rajashekar1
  • Reinagel1
  • Rijsbergen1
  • Rivest1
  • Schwartz1
  • Sigman1
  • Todd1
  • Torralba1
  • Torralba2
  • Voorhees1
  • Zetzsche1
  • Zhou1
  • Zhu1
  • Ziegaus1
Page 13: Detecting natural occlusion boundaries using local cuesDetecting natural occlusion boundaries using local cues Christopher DiMattina # $ Department of Psychology & Neuroscience Concentration,

as an additional control we also considered perfor-mance on patches where we only removed theboundary (boundary removed)

We see from Figure 11c that simply removing theboundary pixels has no effect on subject performancesuggesting that subjects do not need to see theboundary in order to perform the task so long asother cues like luminance and texture differences arepresent However we see from Figure 11b that thereare profound effects of removing the luminance cueswhile keeping the texture cues intact We see thatsubject performance is impaired at every patch sizealthough in contrast to removing texture cues we see asubstantial improvement in performance as patch sizeis increased This suggests that texture cues arecomplementary to luminance cues in the sense thattexture information becomes more useful for largerpatches while luminance is useful even for very smallpatches

Quadratic classifier analysis

Comparison with human performance

Our analysis of image patches taken from thedatabase suggests that relatively simple visual featuresmeasured in previous related studies can potentially beused to distinguish occlusions and surfaces (Figure 9)Therefore it was of interest to determine whethermachine classifiers defined using these simple visualfeatures could potentially account for human perfor-mance on the task Since the logarithmically trans-formed features are approximately Gaussian distributed(Figure 9 and Supplementary Figures S2 S3) followingprevious work (Ing et al 2010) we defined a quadraticclassifier on class-conditional distributions which wemodeled as multivariate Gaussians (Methods Machineclassifiers) When data from both classes is exactly

Gaussian this classifier yields Bayes-optimal perfor-mance (Duda et al 2000) This classifier was thentested on validation sets of 400 image patches drawnfrom the same set shown to humans subjects (whichwere not used to train the classifier) using 200 MonteCarlo resamplings

In Figure 13 we compare the performance of humansubjects (black lines) on the grayscale version of thetask with that of quadratic classifiers (blue lines)Figure 13a shows task performance with the fullgrayscale image patches We see that the quadraticclassifier defined on all grayscale parameters (G1-G5Methods Statistical measurements) does not capturethe experimentally observed improvement in subjectperformance with increasing patch size This suggeststhat in addition to the scalar grayscale cues used todefine the classifier subjects must be making use ofmore complex spatial cues like differences in texture(Bergen amp Landy 1991) or long-range spatial correla-tions (Hess amp Field 1999) which are not available atlow resolutions but become available as the imagepatch gets larger We see from Figure 13b that aquadratic classifier defined using only the luminancedifference is in much better agreement with humanperformance when this is the only available cue forperforming the task Similar results were obtained forcolor image patches (Figure 14)

Ranking of feature dimensions

In order to understand which feature dimensions aremost important for the quadratic classifier detecting

Figure 12 Subject performance for unaltered color image patches

(dashed line) and with texture cues removed (solid line) Figure 13 Comparison of subject and quadratic classifier

performance for grayscale image patches (a) Classifier defined

using all grayscale features (thick blue line) does not accurately

model human performance on the task with unmodified patches

(black line) (b) Classifier defined using only luminance cues

accurately models human performance in the texture-removed

condition where luminance differences are the only available cue

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 13

occlusions we computed the d0 value for each of thefeature dimensions with the results of this analysisbeing summarized in Supplementary Table 4 We seefrom this table that the best single grayscale parametersfor discriminating occlusion and surface patches are theluminance difference (Dl) and overall patch variance(q) We also see that contrast difference (Dr) by itself isa fairly weak parameter for all patch sizes while for thelarger patch sizes (16 middot 16 32 middot 32) color cues are veryinformative

Our correlation analyses presented in SupplementaryTable 2 demonstrated that many of these features were

correlated with each other so therefore it was ofinterest to determine which features contributed themost independent information to the classifier We didthis by adding feature dimensions to the classifier oneby one in order of decreasing d0 for 16 middot 16 and 32 middot 32patches and applying each classifier to the task for 200Monte Carlo simulations The results of this analysisperformed on 16 middot 16 and 32 middot 32 image patches areshown in Figure 15 We see from Figure 15 that forgrayscale classifiers containing q and Dl that noperformance increase is seen when one adds thefeatures GB (boundary gradient) and Eh (orientedenergy) most likely because these features are stronglycorrelated with q Dl (Supplementary Table 2)However we do see a jump in performance when thecontrast difference Dr is added most likely because thisfeature dimension is only weakly correlated with q andDl (Supplementary Table 2) For our analysis of thecolor classifier we see that there are substantialimprovements in performance when the color param-eters Da and Db are added to the classifier which makessense because color cues are largely independent fromluminance cues

Comparison with SVM classifier

There are two logical possibilities for the poor matchbetween human performance in our task and theperformance of the quadratic classifier The firstpossibility is that the feature sets used to define theclassifier do not adequately describe the featuresactually used by human subjects to perform the taskA second possibility is that the feature set is adequatebut the quadratic classifier is somehow suboptimalAlthough the quadratic classifier is Bayes optimal forGaussian distributed class-conditional densities (Duda

Figure 14 Comparison of subject and quadratic classifier

performance for color image patches (a) Classifier defined using

all color features (thick blue line) does not accurately model

human performance (black line) on the task with unmodified

patches (b) Classifier defined using only luminance and color

contrast more accurately models human performance in the

texture-removed condition

Figure 15 Quadratic classifier performance on 32 middot 32 patches as a function of number of features Feature sets of varying lengths are

defined by ranking individual dimensions by their d0 measures and adding them in order from highest to lowest

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 14

et al 2000) and our feature sets are reasonably welldescribed by Gaussians (see Figure 9 andSupplementary Figures S2 S3) this description is farfrom exact Therefore it is possible that some othermore sophisticated classifier may yield superior perfor-mance to the quadratic classifier and hence provide abetter match with our perceptual data

In order to control for this possibility we comparedthe performance of the quadratic classifier to a SupportVector Machine (SVM) classifier (Bishop 2006 Cris-tianini amp Shawe-Taylor 2000) on the task of discrim-inating grayscale occlusion patches from surfaces Wefound that the SVM classifier performed worse than thequadratic classifier (Supplementary Figure S5) Thisfailure of the SVM classifier strengthens the argumentthat the reason for the poor match of the quadraticclassifier with human performance is the inadequacy ofthe measured feature set used to define the classifierrather than the inadequacy our classifier method

Multiscale classifier analysis

Gabor feature set and classifiers

One weakness of the set of simple visual featuresquantified in this study is that they are defined on thescale of the entire image patch and therefore do notintegrate information across scale It is well known thatocclusions in natural images can exist on multiplescales and that accurate edge detection requires anappropriate choice of scale (Elder amp Zucker 1998Marr amp Hildreth 1980) Furthermore our feature setdoes not include any features which quantify texturedifferences which is problematic because our psycho-physical results suggests that texture is a very importantcue and previous work in machine vision hasdemonstrated that texture cues are highly informativefor detecting edges and segmenting image regions(Bergen amp Landy 1991 Malik amp Perona 1990 Martinet al 2004 Voorhees amp Poggio 1988)

In order to choose a set of biologically motivatedmultiscale features to measure from the image patcheswe obtained a Gabor filter bank by applying theIndependent Components Analysis (Bell amp Sejnowski1997) to natural image patches taken from our dataset(Figure 5a) and this filter bank was applied to theimage patches The independent component analysis(ICA) algorithm learns a set of linear features that aremaximally independent which is ideal for classifieranalyses since correlated features add less informationthan independent features The rectified outputs of thisGabor filter bank were then used as inputs to twosupervised classifiers (a) A linear binary logisticclassifier and (b) a neural network classifier (Bishop2006) having 4 16 and 64 hidden units The logisticclassifier can be considered as a special case of theneural network classifier illustrated in Figure 5b where

instead of learning the best hidden layer representationfor solving the classification task the hidden layerrepresentation is always identical to the input repre-sentation and only the output weights are modifiedAfter training both classifiers were tested on a separatevalidation set not used for training to evaluate theirperformance

Classifier comparison

Figure 16 shows the performance of these twoclassifiers together with subject data for the grayscaleocclusion detection task We see that the linearclassifier (blue line) does not accurately describe humanperformance (black line) while the neural network (redline) having 16 hidden units compares much morefavorably to the human subjects Similar results wereobtained for a neural network with 64 hidden units butworse performance was seen with four hidden units(not shown) We can see from Figure 16 that the linearclassifier does not increase its performance withincreasing patch size and seems to perform similarlyto the quadratic classifier defined using only luminancedifferences This suggests that this classifier may simplybe learning to detecting low spatial frequency edges atvarious orientations Indeed plotting the connectionstrengths of the output unit with each of the Gaborfilters as in Figure 17 demonstrates that there are strongpositive weights for filters at low spatial frequencylocated in the center of the image patch therebydetecting luminance differences on the scale of theentire image patch

In contrast the neural network classifier learns tospatially pool information across spatial scale andlocation (Supplementary Figure S6) although thehidden unit lsquolsquoreceptive fieldsrsquorsquo do not seem to resemble

Figure 16 Performance of humans and classifiers defined on

multiscale Gabor feature set We see that a simple linear classifier

(blue line) does not account for human performance (black

dashed line) in the task a neural network classifier having an

additional hidden layer of processing (red line) compares well with

human performance

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 15

the hypothetical texture-edge detectors illustrated inFigure 18 which could in principle encode edgesdefined by second-order cues like differences inorientation or spatial frequency Interestingly suchunits tuned for texture-edges have been learned instudies seeking to find a sparse code for patterns ofcorrelation of Gabor filter activities in natural images(Cadieu amp Olshausen 2012 Karklin amp Lewicki 20032009) and it is of interest for future work to examinewhether a population of such units can be decoded inorder to reliably detect occlusion edges Neverthelessthis analysis provides a simple existence proof that amultiscale feature set coupled with a second layer ofrepresentation which pools across this set of multiscalefeatures may be adequate to explain natural occlusionedge detection and a similar computational scheme hasbeen proposed by various filter-rectify-filter models oftexture edge detection (Baker amp Mareschal 2001Landy amp Graham 2003) Interestingly training neuralnetworks directly on pixel representations did not yieldmodels which could match human performance(Supplementary Figure S7 demonstrating the impor-tance of the multiscale Gabor decomposition as apreprocessing stage Of course we cannot suggest fromour simple analysis the detailed form which thisintegration takes or what the receptive fields of neuronsperforming this operation may be like but this is

certainly a question of great interest for future work inneurophysiology and computational modeling

Discussion

Summary of results

In this study we performed psychophysical experi-ments to better understand what locally available cuesthe human visual system utilizes when detecting naturalocclusion edges In agreement with previous work(McDermott 2004) we found that fairly large regions(32 middot 32 pixels) were required for subjects to attainreliable performance on our occlusion detection task(Figure 10) Studying the grayscale version of the taskin detail we found that texture and luminance cueswere both utilized although the relative importance ofthese cues varied with the patch size In particular wefound that luminance cues were equally useful at allpatch sizes tested (Figure 11a) This suggests thatluminance differences are a local cue being equallyuseful for small as well as large patches In contrasttexture cues were only useful for larger image patches(Figure 11b) This is sensible because texture is definedby information occurring at a variety of scales (Portillaamp Simoncelli 2000 Zhu Wu amp Mumford 1997) andmaking patches larger makes more spatial scalesvisible

The importance of textural cues was further under-scored by comparing human performance on the taskto the performance of a quadratic classifier definedusing a set of simple visual features taken fromprevious work which did not include texture cuesAlthough many of the visual features providedindependent sources of information (Figure 15) andpermitted some discrimination of the stimulus catego-ries this feature set could not account for humanperformance in the task overpredicting performancefor small (8 middot 8) patches and underpredicting

Figure 17 Schematic illustration of weights learned by the linear

classifier We see that the linear classifier learns strong positive

weights (hot colors) with Gabor filters having low spatial

frequencies (long lines) located in the center of the image patch

Figure 18 Hypothetical neural models for detecting occlusion edges defined by textural differences (a) Hypothetical unit which detects

texture edges defined by differences in orientation energy (b) Hypothetical unit that detects texture edges defined by spatial frequency

differences

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 16

performance for large (32 middot 32) patches (Figure 13)The overprediction of human performance for smallimage patches is particularly interesting since itdemonstrates that subjects are not simply relying onluminance cues but are integrating texture cues as wellwhich we found to be quite poor for small imagepatches (Figure 11b)

One limitation of the visual feature set used to definethe quadratic classifier analysis is that the features aredefined on the scale of the entire image patch This isproblematic since it is well-known that the earlieststages of human visual processing perform a multiscaleanalysis using receptive fields resembling Gabor filters(Daugman 1985 Jones amp Palmer 1987 Pollen ampRonner 1983) and that such a representation forms anefficient code for natural images (Field 1987 Olshau-sen amp Field 1996) Furthermore many computationalmethods for texture segmentation and edge detectionuse of a multiscale Gabor filter decomposition (Bakeramp Mareschal 2001 Landy amp Graham 2003) There-fore we defined a new set of classifiers taking as inputsthe rectified responses of a set of Gabor filters appliedto the image patches (Figure 5) We found that themultiscale feature set was adequate to explain humanperformance but only provided that the classifiermaking use of this feature set was sufficiently powerfulIn particular a linear classifier was unable to accountfor human performance since this classifier only learnedto detect luminance-edges at low spatial frequencies(Figure 17) However a neural network classifierhaving an additional hidden layer of processing wasable to learn to reliably detect occlusions nearly as wellas human subjects (Figure 16) These results suggest amultiscale feature decomposition together with rela-tively simple computations performed in an additionallayer of processing may be adequate to explain humanperformance in an important natural vision task

Comparison with previous work

Relatively few studies in visual psychophysics havedirectly considered the problem of what cues are usefulfor detecting natural occlusion edges with the closeststudy to our own focusing on the more specializedproblem of detecting T-junctions in natural grayscaleimages (McDermott 2004) This study found thatrather large image regions were needed for subjects toattain reliable performance for junction detectionconsistent with our results for occlusion detection Inthis work it was found that many T-junctions easilyvisible on the scale of the whole image were notdetectable using local cues suggesting that they mayonly be detectable by mechanisms making use offeedback of global scene information (Lee amp Mumford2003) Our task is somewhat complementary to that

studied by McDermott (2004) since we exclude T-junctions and only consider nonjunction occlusionsThis gave us a much larger stimulus set and permittedquantitative comparisons of human task performancewith a variety of models based on features measuredfrom occlusions and non-occlusions However wearrive at the very similar conclusion for our task thata substantial amount of spatial context is needed foraccurate occlusion detection (Figure 10) and thatmany occlusions cannot be reliably detected using localcues alone

A related set of studies has considered the problemof judging when two image patches belong to the sameor different surfaces and comparing the performanceof human subjects with the predictions of Bayesianmodels defined using sets of measured visual featuressimilar to those used to define our quadratic classifiersIn contrast to one such surface segmentation study (Inget al 2010) we found texture cues to be very importantfor our task One possible reason for the discrepancybetween our results and those of Ing et al (2010) is thattheir study only considered the problem of discrimi-nating a very limited class of images in particular thesurfaces of different leaves Since the leaves typicallybelonged to the same kind of plant they were fairlysmooth and quite similar in terms of visual textureTherefore for this particular choice of stimuli texture isnot a particularly informative cue since leaves of thesame kind tend to have very similar textures For ourtask however where very different surfaces werejuxtaposed to define occlusions we found removal oftexture cues to be highly detrimental to subjectperformance (Figure 11)

Previous work on edge localization reveals thatluminance texture and color cues are all combined todetermine the location of edges (McGraw et al 2003Rivest amp Cavanagh 1996) and similarly we find that inour task these multiple sources of information arecombined as well However in our task we did notconsider quantitative models for how disparate cuesmay be combined optimally as has been done in avariety of perceptual tasks (Knill amp Saunders 2003Landy amp Kojima 2001) One challenge with studyingcue combination in our task is that one of the majorcues used by the subjects are texture differences andunlike luminance or color natural textures require veryhigh-dimensional models to fully specify (Heeger ampBergen 1995 Portilla amp Simoncelli 2000 Zhu et al1997) The problem of optimal cue combination fornatural occlusion detection has been mostly consideredin the computer vision literature (Konishi et al 2003Zhou amp Mel 2008) but models of cue combination forocclusion detection have not been directly comparedwith human data in a controlled psychophysicalexperiment

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 17

Our computational analysis demonstrates the neces-sity of multiscale image analysis for modeling humanperformance in our occlusion detection task Further-more our work and that of others demonstrates theimportance of textural cues for occlusion detection as awide variety of computational models which explainhuman performance at detecting texture-defined edgesutilize a multiscale Gabor decomposition (Landy ampGraham 2003) One broad class of computationalmodels of texture segmentation and texture edgedetection can be described lsquolsquoFilter-Rectify-Filterrsquorsquo(FRF) models sometimes also referred to as lsquolsquoback-pocketrsquorsquo models where the rectified outputs of a Gaborfilter bank are then subject to a second stage of filteringand the outputs of these filters are then processedfurther to obtain a decision about whether or not atexture edge is present (Baker amp Mareschal 2001Bergen amp Landy 1991 Graham amp Sutter 1998 Landy1991 Landy amp Graham 2003) Our simple three-layerneural network model (Figure 5) constitutes a verysimple model of the FRF type with the hidden unitsperforming the role of the second-order filters Theability of our simple FRF model to solve this task aswell as human subjects provides additional evidence infavor of the FRF computational framework

Resources for future studies

Our database of hand-labeled occlusions (httpcaseedu) avoids many limitations of previously collecteddatasets of hand-labeled edges Perhaps the greatestimprovement is that unlike many related datasets(Collins Wright amp Greenway 1999 Martin et al2001) we make use of uncompressed calibrated colorimages taken from a larger image set used in previouspsychophysical research (Arsenault et al 2011 King-dom et al 2007 Olmos amp Kingdom 2004) JPEGcompression creates artificial statistical regularitieswhich complicates feature measurements (Zhou ampMel 2008) and causes artifact when training statisticalimage models (Caywood Willmore amp Tollhurst 2004)

Another improvement of this database over previouswork is that we explicitly label occlusion edges ratherthan inferring them indirectly from labels of imageregions or segments Although de facto finding regionboundaries does end up mainly labeling occlusions thenotion of an image region is much vaguer than that ofan occlusion However due to the specificity of ourtask our database most likely neglects quite a fewedges which are not occlusions since many edges aredefined by properties unrelated to ordinal depth likeshadows specularities and material properties (Ker-sten 2000) Therefore our database only represents asubset (albeit an important one) of possible naturaledges

Some databases only consider a restricted class ofimages like close-up foliage (Ing et al 2010) or lackclear figural objects (Doi Inui Lee Wachtler ampSejnowski 2003) Our database contains a wide varietyof images at various spatial scales containing a largenumber of occlusions We do not claim that our imagesare fully representative of natural images but representa modest effort to obtain a diverse sample which isimportant since different categories of natural imageshave different statistics (Torralba amp Oliva 2003) andtraining statistical image models on different imagedatabases often yields different linear filter properties(Ziegaus amp Lang 2003)

Acknowledgments

This work was supported by NIH grant 0705677 toMSL We thank the five Case Western ReserveUniversity undergraduate research assistants whohelped to collect the databases Parts of this work havebeen presented previously in poster form at the 2011Society for Neuroscience Meeting

Commercial relationships noneCorresponding author Christopher DiMattinaEmail dimattinagrinnelleduAddress Department of Psychology amp NeuroscienceConcentration Grinnell College Grinnell IA USA

References

Abdou I amp Pratt W (1979) Quantitative design andevaluation of enhancementthresholding edge de-tectors Proceedings of the IEEE 67 753ndash763

Acharya T amp Ray A K (2005) Image processingPrinciples and applications Hoboken NJ Wiley-Interscience

Arsenault E Yoonessi A amp Baker C L (2011)Higher-order texture statistics impair contrastboundary segmentation Journal of Vision 11(10)14 1ndash15 httpwwwjournalofvisionorgcontent111014 doi101167111014 [PubMed] [Article]

Baddeley R J amp Tatler B W (2006) High frequencyedges (but not contrast) predict where we fixate ABayesian system identification analysis VisionResearch 46 2824ndash2833

Baker C L amp Mareschal I (2001) Processing ofsecond-order stimuli in the visual cortex Progressin Brain Research 134 1ndash21

Balboa R M amp Grzywacz N M (2000) Occlusions

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 18

and their relationship with the distribution ofcontrasts in natural images Vision Research 402661ndash2669

Bell A J amp Sejnowski T J (1997) The lsquoindependentcomponentsrsquo of natural scenes are edge filtersVision Research 37 3327ndash3338

Bergen J R amp Landy M S (1991) Computationalmodeling of visual texture segregation In M SLandy amp J A Movshon (Eds) Computationalmodels of visual processing pp 253ndash271 Cam-bridge MA MIT Press

Bishop C M (2006) Pattern recognition and machinelearning New York Springer

Cadieu C amp Olshausen B A (2012) Learningintermediate-level representations of form andmotion from natural movies Neural Computation24 827ndash866

Caywood M S Willmore B amp Tolhurst D J(2004) Independent components of color naturalscenes resemble v1 neurons in their spatial andcolor tuning Journal of Neurophysiology 91 2859ndash2873

Collins D Wright W A amp Greenway P (1999) TheSowerby image database In Image processing andits applications (Seventh International Conference)(pp 306ndash310) London Institution of ElectricalEngineers

Cristianini N amp Shawe-Taylor J (2000) An intro-duction to support vector machines and other kernelbased learning methods Cambridge UK Cam-bridge University Press

Daugman J G (1985) Uncertainty relation forresolution in space spatial frequency and orienta-tion optimized by two-dimensional visual corticalfilters Journal of the Optical Society of America A2 1160ndash1169

Doi E Inui T Lee T W Wachtler T ampSejnowski T J (2003) Spatiochromatic receptivefield properties derived from information-theoreticanalyses of cone mosaic responses to naturalscenes Neural Computation 15 397ndash417

Duda R O Hart P E amp Stork D G (2000)Pattern classification (2nd ed) New York Wiley-Interscience

Elder J H amp Zucker S W (1998) Local scalecontrol for edge detection and blur estimationIEEE Transactions on Pattern Analysis and Ma-chine Intelligence 20 699ndash716

Field D J (1987) Relations between the statistics ofnatural images and the response properties of

cortical cells Journal of the Optical Society ofAmerica A 4 2379ndash2394

Fine I MacLeod D I A amp Boynton G M (2003)Surface segmentation based on the luminance andcolor statistics of natural scenes Journal of theOptical Society of America A 20 1283ndash1291

Fowlkes C C Martin D R amp Malik J (2007) Localfigure-ground cues are valid for natural imagesJournal of Vision 7(8)2 1ndash9 httpwwwjournalofvisionorgcontent782 doi101167782[PubMed] [Article]

Geisler W S Perry J S Super B J amp Gallogly DP (2001) Edge co-occurrence in natural imagespredicts contour grouping performance VisionResearch 41 711ndash724

Goldberg A V amp Kennedy R (1995) An efficientcost scaling algorithm for the assignment problemMathematical Programming 71 153ndash177

Graham N (1991) Complex channels early localnonlinearities and normalization in texture segre-gation In M S Landy amp J A Movshon (Eds)Computational models of visual processing pp 274ndash290 Cambridge MA MIT Press

Graham N amp Sutter A (1998) Spatial summation insimple (Fourier) and complex (non-Fourier) texturechannels Vision Research 38 231ndash257

Guzman A (1969) Decomposition of a visual sceneinto three-dimensional bodies In A Griselli (Ed)Automatic interpretation and classification of imag-es pp 291ndash304 New York Academic Press

Heeger D J amp Bergen J R (1995) Pyramid-basedtexture analysissynthesis In Proceedings of ACMSIGGRAPH 229ndash238

Hess R F amp Field D J (1999) Integration ofcontours New insights Trends in Cognitive Scienc-es 3 480ndash486

Hoiem D Efros A A amp Hebert M (2011)Recovering occlusion boundaries from an imageInternational Journal of Computer Vision 91(3)328ndash346

Ing A D Wilson J A amp Geisler W S (2010)Region grouping in natural foliage scenes Imagestatistics and human performance Journal ofVision 10(4)10 1ndash19 httpwwwjournalofvisionorgcontent10410 doi 10116710410[PubMed] [Article]

Jones J P amp Palmer L A (1987) The two-dimensional spatial structure of simple receptivefields in cat striate cortex Journal of Neurophysi-ology 58 1187ndash1211

Karklin Y amp Lewicki M S (2003) Learning higher-

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 19

order structures in natural images Network 14483ndash499

Karklin Y amp Lewicki M S (2009) Emergence ofcomplex cell properties by learning to generalize innatural scenes Nature 457 83ndash86

Kersten D (2000) High-level vision as statisticalinference In M S Gazzaniga (Ed) The newcognitive neurosciences pp 353ndash363 CambridgeMA MIT Press

Kingdom F A A Field D J amp Olmos A (2007)Does spatial invariance result from insensitivity tochange Journal of Vision 7(14)11 1ndash13 httpwwwjournalofvisionorgcontent71411 doi10116771411 [PubMed] [Article]

Knill D C amp Saunders J A (2003) Do humansoptimally integrate stereo and texture informationfor judgments of slant Vision Research 43 2539ndash2558

Konishi S Yuille A L Coughlin J M amp Zhu S C(2003) Statistical edge detection Learning andevaluating edge cues IEEE Transactions on PatternAnalysis and Machine Intelligence 25 57ndash74

Landy M S (1991) Texture segregation and orienta-tion gradient Vision Research 31 679ndash691

Landy M S amp Graham N (2003) Visual perceptionof texture In L M Chalupa amp J S Werner (Eds)The visual neurosciences pp 1106ndash1118 Cam-bridge MA MIT Press

Landy M S amp Kojima H (2001) Ideal cuecombination for localizing texture-defined edgesJournal of the Optical Society of America A 182307ndash2320

Lee H-C amp Choe Y (2003) Detecting salientcontours using orientation energy distribution InInternational Joint Conference on Neural Networks(pp 206ndash211) IEEE Conference Publications

Lee T S amp Mumford D (2003) Hierarchicalbayesian inference in the visual cortex Journal ofthe Optical Society of America A 20 1434ndash1448

Malik J amp Perona P (1990) Preattentive texturediscrimination with early vision mechanisms Jour-nal of the Optical Society of America A 7 923ndash932

Marr D amp Hildreth E (1980) Theory of edgedetection Proceedings of the Royal Society ofLondon Series B 29 187ndash217

Martin D Fowlkes C C Tal D amp Malik J (2001)A database of human segmented natural imagesand its application to evaluating segmentationalgorithms and measuring ecological statisticsProceedings of the 8th International Conference onComputer Vision 2 416ndash423

Martin D R Fowlkes C C amp Malik J (2004)Learning to detect natural image boundaries usinglocal brightness color and texture cues IEEETransactions on Pattern Analysis and MachineIntelligence 26 530ndash549

McDermott J (2004) Psychophysics with junctions inreal images Perception 33 1101ndash1127

McGraw P V Whitaker D Badcock D R ampSkillen J (2003) Neither here nor there Localizingconflicting visual attributes Journal of Vision 3(4)2 265ndash273 httpwwwjournalofvisionorgcontent342 doi101167342 [PubMed] [Article]

Nakayama K He Z J amp Shimojo S (1995) Visualsurface representation A critical link betweenlower-level and higher-level vision In D NOsherson S M Kosslyn amp L R Gleitman(Eds) An invitation to cognitive science Visualcognition pp 1ndash70 Cambridge MA MIT Press

Oliva A amp Torralba A (2001) Modeling the shapeof the scene A holistic representation of the spatialenvelope International Journal of Computer Vision42 145ndash175

Olmos A amp Kingdom F A A (2004) McGillcalibrated colour image database Internet sitehttptabbyvisionmcgillca (Accessed November20 2012)

Olshausen B A amp Field D J (1996) Emergence ofsimple-cell receptive-field properties by learning asparse code for natural images Nature 381 607ndash609

Ott R L (1993) An introduction to statistical methodsand data analysis (4th ed) Belmont CA Wads-worth Inc

Perona P (1992) Steerable-scalable kernels for edgedetection and junction analysis Image and VisionComputing 10 663ndash672

Pollen D A amp Ronner S F (1983) Visual corticalneurons as localized spatial frequency filters IEEETransactions on Systems Man and Cybernetics 13907ndash916

Portilla J amp Simoncelli E P (2000) A parametrictexture model based on joint statistics of complexwavelet coefficients International Journal of Com-puter Vision 40 4971

Rajashekar U van der Linde I Bovik A C ampCormack L K (2007) Foveated analysis of imagefeatures at fixations Vision Research 47 3160ndash3172

Reinagel P amp Zador A M (1999) Natural scenestatistics at the centre of gaze Network 10 341ndash350

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 20

Rijsbergen C V (1979) Information retrieval LondonButterworths

Rivest J amp Cavanagh P (1996) Localizing contoursdefined by more than one attribute Vision Re-search 36 53ndash66

Schwartz O amp Simoncelli E P (2001) Natural signalstatistics and sensory gain control Nature Neuro-science 4 819ndash825

Sigman M Cecchi G A Gilbert C D ampMagnasco M O (2001) On a common circleNatural scenes and gestalt rules Proceedings of theNational Academy of Sciences 98 1935ndash1940

Todd J T (2004) The visual perception of 3D shapeTrends in Cognitive Science 8 115ndash121

Torralba A (2009) How many pixels make an imageVisual Neuroscience 26 123ndash131

Torralba A amp Oliva A (2003) Statistics of naturalimage categories Network Computation in NeuralSystems 14 391ndash412

Voorhees H amp Poggio T (1988) Computing textureboundaries from images Nature 333 364ndash367

Zetzsche C amp Rohrbein F (2001) Nonlinear andextra-classical receptive field properties and thestatistics of natural scenes Network 12 331ndash350

Zhou C amp Mel B W (2008) Cue combination andcolor edge detection in natural scenes Journal ofVision 8(4)4 1ndash25 httpwwwjournalofvisionorgcontent844 doi101167844 [PubMed][Article]

Zhu S C Wu Y N amp Mumford D (1997)Minimax entropy principle and its application totexture modeling Neural Computation 9 1627ndash1660

Ziegaus C amp Lang E W (2003) Independentcomponent analysis of natural and urban imageensembles Neural Information Processing Letters1 89ndash95

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 21

  • Introduction
  • Methods
  • f01
  • f02
  • f03
  • e01
  • e02
  • e03
  • e04
  • e05
  • e06
  • e07
  • e08
  • e09
  • e10
  • f04
  • e11
  • e12
  • e13
  • e14
  • e15
  • e16
  • e17
  • f05
  • e18
  • Results
  • f06
  • f07
  • f08
  • f09
  • f10
  • f11
  • f12
  • f13
  • f14
  • f15
  • f16
  • Discussion
  • f17
  • f18
  • Abdou1
  • Acharya1
  • Arsenault1
  • Baddeley1
  • Baker1
  • Balboa1
  • Bell1
  • Bergen1
  • Bishop1
  • Cadieu1
  • Caywood1
  • Collins1
  • Cristianini1
  • Daugman1
  • Doi1
  • Duda1
  • Elder1
  • Field1
  • Fine1
  • Fowlkes1
  • Geisler1
  • Goldberg1
  • Graham1
  • Graham2
  • Guzman1
  • Heeger1
  • Hess1
  • Hoiem1
  • Ing1
  • Jones1
  • Karklin1
  • Karklin2
  • Kersten1
  • Kingdom1
  • Knill1
  • Konishi1
  • Landy1
  • Landy2
  • Landy3
  • Lee1
  • Lee2
  • Malik1
  • Marr1
  • Martin1
  • Martin2
  • McDermott1
  • McGraw1
  • Nakayama1
  • Oliva1
  • Olmos1
  • Olshausen1
  • Ott1
  • Perona1
  • Pollen1
  • Portilla1
  • Rajashekar1
  • Reinagel1
  • Rijsbergen1
  • Rivest1
  • Schwartz1
  • Sigman1
  • Todd1
  • Torralba1
  • Torralba2
  • Voorhees1
  • Zetzsche1
  • Zhou1
  • Zhu1
  • Ziegaus1
Page 14: Detecting natural occlusion boundaries using local cuesDetecting natural occlusion boundaries using local cues Christopher DiMattina # $ Department of Psychology & Neuroscience Concentration,

occlusions we computed the d0 value for each of thefeature dimensions with the results of this analysisbeing summarized in Supplementary Table 4 We seefrom this table that the best single grayscale parametersfor discriminating occlusion and surface patches are theluminance difference (Dl) and overall patch variance(q) We also see that contrast difference (Dr) by itself isa fairly weak parameter for all patch sizes while for thelarger patch sizes (16 middot 16 32 middot 32) color cues are veryinformative

Our correlation analyses presented in SupplementaryTable 2 demonstrated that many of these features were

correlated with each other so therefore it was ofinterest to determine which features contributed themost independent information to the classifier We didthis by adding feature dimensions to the classifier oneby one in order of decreasing d0 for 16 middot 16 and 32 middot 32patches and applying each classifier to the task for 200Monte Carlo simulations The results of this analysisperformed on 16 middot 16 and 32 middot 32 image patches areshown in Figure 15 We see from Figure 15 that forgrayscale classifiers containing q and Dl that noperformance increase is seen when one adds thefeatures GB (boundary gradient) and Eh (orientedenergy) most likely because these features are stronglycorrelated with q Dl (Supplementary Table 2)However we do see a jump in performance when thecontrast difference Dr is added most likely because thisfeature dimension is only weakly correlated with q andDl (Supplementary Table 2) For our analysis of thecolor classifier we see that there are substantialimprovements in performance when the color param-eters Da and Db are added to the classifier which makessense because color cues are largely independent fromluminance cues

Comparison with SVM classifier

There are two logical possibilities for the poor matchbetween human performance in our task and theperformance of the quadratic classifier The firstpossibility is that the feature sets used to define theclassifier do not adequately describe the featuresactually used by human subjects to perform the taskA second possibility is that the feature set is adequatebut the quadratic classifier is somehow suboptimalAlthough the quadratic classifier is Bayes optimal forGaussian distributed class-conditional densities (Duda

Figure 14 Comparison of subject and quadratic classifier

performance for color image patches (a) Classifier defined using

all color features (thick blue line) does not accurately model

human performance (black line) on the task with unmodified

patches (b) Classifier defined using only luminance and color

contrast more accurately models human performance in the

texture-removed condition

Figure 15 Quadratic classifier performance on 32 middot 32 patches as a function of number of features Feature sets of varying lengths are

defined by ranking individual dimensions by their d0 measures and adding them in order from highest to lowest

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 14

et al 2000) and our feature sets are reasonably welldescribed by Gaussians (see Figure 9 andSupplementary Figures S2 S3) this description is farfrom exact Therefore it is possible that some othermore sophisticated classifier may yield superior perfor-mance to the quadratic classifier and hence provide abetter match with our perceptual data

In order to control for this possibility we comparedthe performance of the quadratic classifier to a SupportVector Machine (SVM) classifier (Bishop 2006 Cris-tianini amp Shawe-Taylor 2000) on the task of discrim-inating grayscale occlusion patches from surfaces Wefound that the SVM classifier performed worse than thequadratic classifier (Supplementary Figure S5) Thisfailure of the SVM classifier strengthens the argumentthat the reason for the poor match of the quadraticclassifier with human performance is the inadequacy ofthe measured feature set used to define the classifierrather than the inadequacy our classifier method

Multiscale classifier analysis

Gabor feature set and classifiers

One weakness of the set of simple visual featuresquantified in this study is that they are defined on thescale of the entire image patch and therefore do notintegrate information across scale It is well known thatocclusions in natural images can exist on multiplescales and that accurate edge detection requires anappropriate choice of scale (Elder amp Zucker 1998Marr amp Hildreth 1980) Furthermore our feature setdoes not include any features which quantify texturedifferences which is problematic because our psycho-physical results suggests that texture is a very importantcue and previous work in machine vision hasdemonstrated that texture cues are highly informativefor detecting edges and segmenting image regions(Bergen amp Landy 1991 Malik amp Perona 1990 Martinet al 2004 Voorhees amp Poggio 1988)

In order to choose a set of biologically motivatedmultiscale features to measure from the image patcheswe obtained a Gabor filter bank by applying theIndependent Components Analysis (Bell amp Sejnowski1997) to natural image patches taken from our dataset(Figure 5a) and this filter bank was applied to theimage patches The independent component analysis(ICA) algorithm learns a set of linear features that aremaximally independent which is ideal for classifieranalyses since correlated features add less informationthan independent features The rectified outputs of thisGabor filter bank were then used as inputs to twosupervised classifiers (a) A linear binary logisticclassifier and (b) a neural network classifier (Bishop2006) having 4 16 and 64 hidden units The logisticclassifier can be considered as a special case of theneural network classifier illustrated in Figure 5b where

instead of learning the best hidden layer representationfor solving the classification task the hidden layerrepresentation is always identical to the input repre-sentation and only the output weights are modifiedAfter training both classifiers were tested on a separatevalidation set not used for training to evaluate theirperformance

Classifier comparison

Figure 16 shows the performance of these twoclassifiers together with subject data for the grayscaleocclusion detection task We see that the linearclassifier (blue line) does not accurately describe humanperformance (black line) while the neural network (redline) having 16 hidden units compares much morefavorably to the human subjects Similar results wereobtained for a neural network with 64 hidden units butworse performance was seen with four hidden units(not shown) We can see from Figure 16 that the linearclassifier does not increase its performance withincreasing patch size and seems to perform similarlyto the quadratic classifier defined using only luminancedifferences This suggests that this classifier may simplybe learning to detecting low spatial frequency edges atvarious orientations Indeed plotting the connectionstrengths of the output unit with each of the Gaborfilters as in Figure 17 demonstrates that there are strongpositive weights for filters at low spatial frequencylocated in the center of the image patch therebydetecting luminance differences on the scale of theentire image patch

In contrast the neural network classifier learns tospatially pool information across spatial scale andlocation (Supplementary Figure S6) although thehidden unit lsquolsquoreceptive fieldsrsquorsquo do not seem to resemble

Figure 16 Performance of humans and classifiers defined on

multiscale Gabor feature set We see that a simple linear classifier

(blue line) does not account for human performance (black

dashed line) in the task a neural network classifier having an

additional hidden layer of processing (red line) compares well with

human performance

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 15

the hypothetical texture-edge detectors illustrated inFigure 18 which could in principle encode edgesdefined by second-order cues like differences inorientation or spatial frequency Interestingly suchunits tuned for texture-edges have been learned instudies seeking to find a sparse code for patterns ofcorrelation of Gabor filter activities in natural images(Cadieu amp Olshausen 2012 Karklin amp Lewicki 20032009) and it is of interest for future work to examinewhether a population of such units can be decoded inorder to reliably detect occlusion edges Neverthelessthis analysis provides a simple existence proof that amultiscale feature set coupled with a second layer ofrepresentation which pools across this set of multiscalefeatures may be adequate to explain natural occlusionedge detection and a similar computational scheme hasbeen proposed by various filter-rectify-filter models oftexture edge detection (Baker amp Mareschal 2001Landy amp Graham 2003) Interestingly training neuralnetworks directly on pixel representations did not yieldmodels which could match human performance(Supplementary Figure S7 demonstrating the impor-tance of the multiscale Gabor decomposition as apreprocessing stage Of course we cannot suggest fromour simple analysis the detailed form which thisintegration takes or what the receptive fields of neuronsperforming this operation may be like but this is

certainly a question of great interest for future work inneurophysiology and computational modeling

Discussion

Summary of results

In this study we performed psychophysical experi-ments to better understand what locally available cuesthe human visual system utilizes when detecting naturalocclusion edges In agreement with previous work(McDermott 2004) we found that fairly large regions(32 middot 32 pixels) were required for subjects to attainreliable performance on our occlusion detection task(Figure 10) Studying the grayscale version of the taskin detail we found that texture and luminance cueswere both utilized although the relative importance ofthese cues varied with the patch size In particular wefound that luminance cues were equally useful at allpatch sizes tested (Figure 11a) This suggests thatluminance differences are a local cue being equallyuseful for small as well as large patches In contrasttexture cues were only useful for larger image patches(Figure 11b) This is sensible because texture is definedby information occurring at a variety of scales (Portillaamp Simoncelli 2000 Zhu Wu amp Mumford 1997) andmaking patches larger makes more spatial scalesvisible

The importance of textural cues was further under-scored by comparing human performance on the taskto the performance of a quadratic classifier definedusing a set of simple visual features taken fromprevious work which did not include texture cuesAlthough many of the visual features providedindependent sources of information (Figure 15) andpermitted some discrimination of the stimulus catego-ries this feature set could not account for humanperformance in the task overpredicting performancefor small (8 middot 8) patches and underpredicting

Figure 17 Schematic illustration of weights learned by the linear

classifier We see that the linear classifier learns strong positive

weights (hot colors) with Gabor filters having low spatial

frequencies (long lines) located in the center of the image patch

Figure 18 Hypothetical neural models for detecting occlusion edges defined by textural differences (a) Hypothetical unit which detects

texture edges defined by differences in orientation energy (b) Hypothetical unit that detects texture edges defined by spatial frequency

differences

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 16

performance for large (32 middot 32) patches (Figure 13)The overprediction of human performance for smallimage patches is particularly interesting since itdemonstrates that subjects are not simply relying onluminance cues but are integrating texture cues as wellwhich we found to be quite poor for small imagepatches (Figure 11b)

One limitation of the visual feature set used to definethe quadratic classifier analysis is that the features aredefined on the scale of the entire image patch This isproblematic since it is well-known that the earlieststages of human visual processing perform a multiscaleanalysis using receptive fields resembling Gabor filters(Daugman 1985 Jones amp Palmer 1987 Pollen ampRonner 1983) and that such a representation forms anefficient code for natural images (Field 1987 Olshau-sen amp Field 1996) Furthermore many computationalmethods for texture segmentation and edge detectionuse of a multiscale Gabor filter decomposition (Bakeramp Mareschal 2001 Landy amp Graham 2003) There-fore we defined a new set of classifiers taking as inputsthe rectified responses of a set of Gabor filters appliedto the image patches (Figure 5) We found that themultiscale feature set was adequate to explain humanperformance but only provided that the classifiermaking use of this feature set was sufficiently powerfulIn particular a linear classifier was unable to accountfor human performance since this classifier only learnedto detect luminance-edges at low spatial frequencies(Figure 17) However a neural network classifierhaving an additional hidden layer of processing wasable to learn to reliably detect occlusions nearly as wellas human subjects (Figure 16) These results suggest amultiscale feature decomposition together with rela-tively simple computations performed in an additionallayer of processing may be adequate to explain humanperformance in an important natural vision task

Comparison with previous work

Relatively few studies in visual psychophysics havedirectly considered the problem of what cues are usefulfor detecting natural occlusion edges with the closeststudy to our own focusing on the more specializedproblem of detecting T-junctions in natural grayscaleimages (McDermott 2004) This study found thatrather large image regions were needed for subjects toattain reliable performance for junction detectionconsistent with our results for occlusion detection Inthis work it was found that many T-junctions easilyvisible on the scale of the whole image were notdetectable using local cues suggesting that they mayonly be detectable by mechanisms making use offeedback of global scene information (Lee amp Mumford2003) Our task is somewhat complementary to that

studied by McDermott (2004) since we exclude T-junctions and only consider nonjunction occlusionsThis gave us a much larger stimulus set and permittedquantitative comparisons of human task performancewith a variety of models based on features measuredfrom occlusions and non-occlusions However wearrive at the very similar conclusion for our task thata substantial amount of spatial context is needed foraccurate occlusion detection (Figure 10) and thatmany occlusions cannot be reliably detected using localcues alone

A related set of studies has considered the problemof judging when two image patches belong to the sameor different surfaces and comparing the performanceof human subjects with the predictions of Bayesianmodels defined using sets of measured visual featuressimilar to those used to define our quadratic classifiersIn contrast to one such surface segmentation study (Inget al 2010) we found texture cues to be very importantfor our task One possible reason for the discrepancybetween our results and those of Ing et al (2010) is thattheir study only considered the problem of discrimi-nating a very limited class of images in particular thesurfaces of different leaves Since the leaves typicallybelonged to the same kind of plant they were fairlysmooth and quite similar in terms of visual textureTherefore for this particular choice of stimuli texture isnot a particularly informative cue since leaves of thesame kind tend to have very similar textures For ourtask however where very different surfaces werejuxtaposed to define occlusions we found removal oftexture cues to be highly detrimental to subjectperformance (Figure 11)

Previous work on edge localization reveals thatluminance texture and color cues are all combined todetermine the location of edges (McGraw et al 2003Rivest amp Cavanagh 1996) and similarly we find that inour task these multiple sources of information arecombined as well However in our task we did notconsider quantitative models for how disparate cuesmay be combined optimally as has been done in avariety of perceptual tasks (Knill amp Saunders 2003Landy amp Kojima 2001) One challenge with studyingcue combination in our task is that one of the majorcues used by the subjects are texture differences andunlike luminance or color natural textures require veryhigh-dimensional models to fully specify (Heeger ampBergen 1995 Portilla amp Simoncelli 2000 Zhu et al1997) The problem of optimal cue combination fornatural occlusion detection has been mostly consideredin the computer vision literature (Konishi et al 2003Zhou amp Mel 2008) but models of cue combination forocclusion detection have not been directly comparedwith human data in a controlled psychophysicalexperiment

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 17

Our computational analysis demonstrates the neces-sity of multiscale image analysis for modeling humanperformance in our occlusion detection task Further-more our work and that of others demonstrates theimportance of textural cues for occlusion detection as awide variety of computational models which explainhuman performance at detecting texture-defined edgesutilize a multiscale Gabor decomposition (Landy ampGraham 2003) One broad class of computationalmodels of texture segmentation and texture edgedetection can be described lsquolsquoFilter-Rectify-Filterrsquorsquo(FRF) models sometimes also referred to as lsquolsquoback-pocketrsquorsquo models where the rectified outputs of a Gaborfilter bank are then subject to a second stage of filteringand the outputs of these filters are then processedfurther to obtain a decision about whether or not atexture edge is present (Baker amp Mareschal 2001Bergen amp Landy 1991 Graham amp Sutter 1998 Landy1991 Landy amp Graham 2003) Our simple three-layerneural network model (Figure 5) constitutes a verysimple model of the FRF type with the hidden unitsperforming the role of the second-order filters Theability of our simple FRF model to solve this task aswell as human subjects provides additional evidence infavor of the FRF computational framework

Resources for future studies

Our database of hand-labeled occlusions (httpcaseedu) avoids many limitations of previously collecteddatasets of hand-labeled edges Perhaps the greatestimprovement is that unlike many related datasets(Collins Wright amp Greenway 1999 Martin et al2001) we make use of uncompressed calibrated colorimages taken from a larger image set used in previouspsychophysical research (Arsenault et al 2011 King-dom et al 2007 Olmos amp Kingdom 2004) JPEGcompression creates artificial statistical regularitieswhich complicates feature measurements (Zhou ampMel 2008) and causes artifact when training statisticalimage models (Caywood Willmore amp Tollhurst 2004)

Another improvement of this database over previouswork is that we explicitly label occlusion edges ratherthan inferring them indirectly from labels of imageregions or segments Although de facto finding regionboundaries does end up mainly labeling occlusions thenotion of an image region is much vaguer than that ofan occlusion However due to the specificity of ourtask our database most likely neglects quite a fewedges which are not occlusions since many edges aredefined by properties unrelated to ordinal depth likeshadows specularities and material properties (Ker-sten 2000) Therefore our database only represents asubset (albeit an important one) of possible naturaledges

Some databases only consider a restricted class ofimages like close-up foliage (Ing et al 2010) or lackclear figural objects (Doi Inui Lee Wachtler ampSejnowski 2003) Our database contains a wide varietyof images at various spatial scales containing a largenumber of occlusions We do not claim that our imagesare fully representative of natural images but representa modest effort to obtain a diverse sample which isimportant since different categories of natural imageshave different statistics (Torralba amp Oliva 2003) andtraining statistical image models on different imagedatabases often yields different linear filter properties(Ziegaus amp Lang 2003)

Acknowledgments

This work was supported by NIH grant 0705677 toMSL We thank the five Case Western ReserveUniversity undergraduate research assistants whohelped to collect the databases Parts of this work havebeen presented previously in poster form at the 2011Society for Neuroscience Meeting

Commercial relationships noneCorresponding author Christopher DiMattinaEmail dimattinagrinnelleduAddress Department of Psychology amp NeuroscienceConcentration Grinnell College Grinnell IA USA

References

Abdou I amp Pratt W (1979) Quantitative design andevaluation of enhancementthresholding edge de-tectors Proceedings of the IEEE 67 753ndash763

Acharya T amp Ray A K (2005) Image processingPrinciples and applications Hoboken NJ Wiley-Interscience

Arsenault E Yoonessi A amp Baker C L (2011)Higher-order texture statistics impair contrastboundary segmentation Journal of Vision 11(10)14 1ndash15 httpwwwjournalofvisionorgcontent111014 doi101167111014 [PubMed] [Article]

Baddeley R J amp Tatler B W (2006) High frequencyedges (but not contrast) predict where we fixate ABayesian system identification analysis VisionResearch 46 2824ndash2833

Baker C L amp Mareschal I (2001) Processing ofsecond-order stimuli in the visual cortex Progressin Brain Research 134 1ndash21

Balboa R M amp Grzywacz N M (2000) Occlusions

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 18

and their relationship with the distribution ofcontrasts in natural images Vision Research 402661ndash2669

Bell A J amp Sejnowski T J (1997) The lsquoindependentcomponentsrsquo of natural scenes are edge filtersVision Research 37 3327ndash3338

Bergen J R amp Landy M S (1991) Computationalmodeling of visual texture segregation In M SLandy amp J A Movshon (Eds) Computationalmodels of visual processing pp 253ndash271 Cam-bridge MA MIT Press

Bishop C M (2006) Pattern recognition and machinelearning New York Springer

Cadieu C amp Olshausen B A (2012) Learningintermediate-level representations of form andmotion from natural movies Neural Computation24 827ndash866

Caywood M S Willmore B amp Tolhurst D J(2004) Independent components of color naturalscenes resemble v1 neurons in their spatial andcolor tuning Journal of Neurophysiology 91 2859ndash2873

Collins D Wright W A amp Greenway P (1999) TheSowerby image database In Image processing andits applications (Seventh International Conference)(pp 306ndash310) London Institution of ElectricalEngineers

Cristianini N amp Shawe-Taylor J (2000) An intro-duction to support vector machines and other kernelbased learning methods Cambridge UK Cam-bridge University Press

Daugman J G (1985) Uncertainty relation forresolution in space spatial frequency and orienta-tion optimized by two-dimensional visual corticalfilters Journal of the Optical Society of America A2 1160ndash1169

Doi E Inui T Lee T W Wachtler T ampSejnowski T J (2003) Spatiochromatic receptivefield properties derived from information-theoreticanalyses of cone mosaic responses to naturalscenes Neural Computation 15 397ndash417

Duda R O Hart P E amp Stork D G (2000)Pattern classification (2nd ed) New York Wiley-Interscience

Elder J H amp Zucker S W (1998) Local scalecontrol for edge detection and blur estimationIEEE Transactions on Pattern Analysis and Ma-chine Intelligence 20 699ndash716

Field D J (1987) Relations between the statistics ofnatural images and the response properties of

cortical cells Journal of the Optical Society ofAmerica A 4 2379ndash2394

Fine I MacLeod D I A amp Boynton G M (2003)Surface segmentation based on the luminance andcolor statistics of natural scenes Journal of theOptical Society of America A 20 1283ndash1291

Fowlkes C C Martin D R amp Malik J (2007) Localfigure-ground cues are valid for natural imagesJournal of Vision 7(8)2 1ndash9 httpwwwjournalofvisionorgcontent782 doi101167782[PubMed] [Article]

Geisler W S Perry J S Super B J amp Gallogly DP (2001) Edge co-occurrence in natural imagespredicts contour grouping performance VisionResearch 41 711ndash724

Goldberg A V amp Kennedy R (1995) An efficientcost scaling algorithm for the assignment problemMathematical Programming 71 153ndash177

Graham N (1991) Complex channels early localnonlinearities and normalization in texture segre-gation In M S Landy amp J A Movshon (Eds)Computational models of visual processing pp 274ndash290 Cambridge MA MIT Press

Graham N amp Sutter A (1998) Spatial summation insimple (Fourier) and complex (non-Fourier) texturechannels Vision Research 38 231ndash257

Guzman A (1969) Decomposition of a visual sceneinto three-dimensional bodies In A Griselli (Ed)Automatic interpretation and classification of imag-es pp 291ndash304 New York Academic Press

Heeger D J amp Bergen J R (1995) Pyramid-basedtexture analysissynthesis In Proceedings of ACMSIGGRAPH 229ndash238

Hess R F amp Field D J (1999) Integration ofcontours New insights Trends in Cognitive Scienc-es 3 480ndash486

Hoiem D Efros A A amp Hebert M (2011)Recovering occlusion boundaries from an imageInternational Journal of Computer Vision 91(3)328ndash346

Ing A D Wilson J A amp Geisler W S (2010)Region grouping in natural foliage scenes Imagestatistics and human performance Journal ofVision 10(4)10 1ndash19 httpwwwjournalofvisionorgcontent10410 doi 10116710410[PubMed] [Article]

Jones J P amp Palmer L A (1987) The two-dimensional spatial structure of simple receptivefields in cat striate cortex Journal of Neurophysi-ology 58 1187ndash1211

Karklin Y amp Lewicki M S (2003) Learning higher-

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 19

order structures in natural images Network 14483ndash499

Karklin Y amp Lewicki M S (2009) Emergence ofcomplex cell properties by learning to generalize innatural scenes Nature 457 83ndash86

Kersten D (2000) High-level vision as statisticalinference In M S Gazzaniga (Ed) The newcognitive neurosciences pp 353ndash363 CambridgeMA MIT Press

Kingdom F A A Field D J amp Olmos A (2007)Does spatial invariance result from insensitivity tochange Journal of Vision 7(14)11 1ndash13 httpwwwjournalofvisionorgcontent71411 doi10116771411 [PubMed] [Article]

Knill D C amp Saunders J A (2003) Do humansoptimally integrate stereo and texture informationfor judgments of slant Vision Research 43 2539ndash2558

Konishi S Yuille A L Coughlin J M amp Zhu S C(2003) Statistical edge detection Learning andevaluating edge cues IEEE Transactions on PatternAnalysis and Machine Intelligence 25 57ndash74

Landy M S (1991) Texture segregation and orienta-tion gradient Vision Research 31 679ndash691

Landy M S amp Graham N (2003) Visual perceptionof texture In L M Chalupa amp J S Werner (Eds)The visual neurosciences pp 1106ndash1118 Cam-bridge MA MIT Press

Landy M S amp Kojima H (2001) Ideal cuecombination for localizing texture-defined edgesJournal of the Optical Society of America A 182307ndash2320

Lee H-C amp Choe Y (2003) Detecting salientcontours using orientation energy distribution InInternational Joint Conference on Neural Networks(pp 206ndash211) IEEE Conference Publications

Lee T S amp Mumford D (2003) Hierarchicalbayesian inference in the visual cortex Journal ofthe Optical Society of America A 20 1434ndash1448

Malik J amp Perona P (1990) Preattentive texturediscrimination with early vision mechanisms Jour-nal of the Optical Society of America A 7 923ndash932

Marr D amp Hildreth E (1980) Theory of edgedetection Proceedings of the Royal Society ofLondon Series B 29 187ndash217

Martin D Fowlkes C C Tal D amp Malik J (2001)A database of human segmented natural imagesand its application to evaluating segmentationalgorithms and measuring ecological statisticsProceedings of the 8th International Conference onComputer Vision 2 416ndash423

Martin D R Fowlkes C C amp Malik J (2004)Learning to detect natural image boundaries usinglocal brightness color and texture cues IEEETransactions on Pattern Analysis and MachineIntelligence 26 530ndash549

McDermott J (2004) Psychophysics with junctions inreal images Perception 33 1101ndash1127

McGraw P V Whitaker D Badcock D R ampSkillen J (2003) Neither here nor there Localizingconflicting visual attributes Journal of Vision 3(4)2 265ndash273 httpwwwjournalofvisionorgcontent342 doi101167342 [PubMed] [Article]

Nakayama K He Z J amp Shimojo S (1995) Visualsurface representation A critical link betweenlower-level and higher-level vision In D NOsherson S M Kosslyn amp L R Gleitman(Eds) An invitation to cognitive science Visualcognition pp 1ndash70 Cambridge MA MIT Press

Oliva A amp Torralba A (2001) Modeling the shapeof the scene A holistic representation of the spatialenvelope International Journal of Computer Vision42 145ndash175

Olmos A amp Kingdom F A A (2004) McGillcalibrated colour image database Internet sitehttptabbyvisionmcgillca (Accessed November20 2012)

Olshausen B A amp Field D J (1996) Emergence ofsimple-cell receptive-field properties by learning asparse code for natural images Nature 381 607ndash609

Ott R L (1993) An introduction to statistical methodsand data analysis (4th ed) Belmont CA Wads-worth Inc

Perona P (1992) Steerable-scalable kernels for edgedetection and junction analysis Image and VisionComputing 10 663ndash672

Pollen D A amp Ronner S F (1983) Visual corticalneurons as localized spatial frequency filters IEEETransactions on Systems Man and Cybernetics 13907ndash916

Portilla J amp Simoncelli E P (2000) A parametrictexture model based on joint statistics of complexwavelet coefficients International Journal of Com-puter Vision 40 4971

Rajashekar U van der Linde I Bovik A C ampCormack L K (2007) Foveated analysis of imagefeatures at fixations Vision Research 47 3160ndash3172

Reinagel P amp Zador A M (1999) Natural scenestatistics at the centre of gaze Network 10 341ndash350

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 20

Rijsbergen C V (1979) Information retrieval LondonButterworths

Rivest J amp Cavanagh P (1996) Localizing contoursdefined by more than one attribute Vision Re-search 36 53ndash66

Schwartz O amp Simoncelli E P (2001) Natural signalstatistics and sensory gain control Nature Neuro-science 4 819ndash825

Sigman M Cecchi G A Gilbert C D ampMagnasco M O (2001) On a common circleNatural scenes and gestalt rules Proceedings of theNational Academy of Sciences 98 1935ndash1940

Todd J T (2004) The visual perception of 3D shapeTrends in Cognitive Science 8 115ndash121

Torralba A (2009) How many pixels make an imageVisual Neuroscience 26 123ndash131

Torralba A amp Oliva A (2003) Statistics of naturalimage categories Network Computation in NeuralSystems 14 391ndash412

Voorhees H amp Poggio T (1988) Computing textureboundaries from images Nature 333 364ndash367

Zetzsche C amp Rohrbein F (2001) Nonlinear andextra-classical receptive field properties and thestatistics of natural scenes Network 12 331ndash350

Zhou C amp Mel B W (2008) Cue combination andcolor edge detection in natural scenes Journal ofVision 8(4)4 1ndash25 httpwwwjournalofvisionorgcontent844 doi101167844 [PubMed][Article]

Zhu S C Wu Y N amp Mumford D (1997)Minimax entropy principle and its application totexture modeling Neural Computation 9 1627ndash1660

Ziegaus C amp Lang E W (2003) Independentcomponent analysis of natural and urban imageensembles Neural Information Processing Letters1 89ndash95

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 21

  • Introduction
  • Methods
  • f01
  • f02
  • f03
  • e01
  • e02
  • e03
  • e04
  • e05
  • e06
  • e07
  • e08
  • e09
  • e10
  • f04
  • e11
  • e12
  • e13
  • e14
  • e15
  • e16
  • e17
  • f05
  • e18
  • Results
  • f06
  • f07
  • f08
  • f09
  • f10
  • f11
  • f12
  • f13
  • f14
  • f15
  • f16
  • Discussion
  • f17
  • f18
  • Abdou1
  • Acharya1
  • Arsenault1
  • Baddeley1
  • Baker1
  • Balboa1
  • Bell1
  • Bergen1
  • Bishop1
  • Cadieu1
  • Caywood1
  • Collins1
  • Cristianini1
  • Daugman1
  • Doi1
  • Duda1
  • Elder1
  • Field1
  • Fine1
  • Fowlkes1
  • Geisler1
  • Goldberg1
  • Graham1
  • Graham2
  • Guzman1
  • Heeger1
  • Hess1
  • Hoiem1
  • Ing1
  • Jones1
  • Karklin1
  • Karklin2
  • Kersten1
  • Kingdom1
  • Knill1
  • Konishi1
  • Landy1
  • Landy2
  • Landy3
  • Lee1
  • Lee2
  • Malik1
  • Marr1
  • Martin1
  • Martin2
  • McDermott1
  • McGraw1
  • Nakayama1
  • Oliva1
  • Olmos1
  • Olshausen1
  • Ott1
  • Perona1
  • Pollen1
  • Portilla1
  • Rajashekar1
  • Reinagel1
  • Rijsbergen1
  • Rivest1
  • Schwartz1
  • Sigman1
  • Todd1
  • Torralba1
  • Torralba2
  • Voorhees1
  • Zetzsche1
  • Zhou1
  • Zhu1
  • Ziegaus1
Page 15: Detecting natural occlusion boundaries using local cuesDetecting natural occlusion boundaries using local cues Christopher DiMattina # $ Department of Psychology & Neuroscience Concentration,

et al 2000) and our feature sets are reasonably welldescribed by Gaussians (see Figure 9 andSupplementary Figures S2 S3) this description is farfrom exact Therefore it is possible that some othermore sophisticated classifier may yield superior perfor-mance to the quadratic classifier and hence provide abetter match with our perceptual data

In order to control for this possibility we comparedthe performance of the quadratic classifier to a SupportVector Machine (SVM) classifier (Bishop 2006 Cris-tianini amp Shawe-Taylor 2000) on the task of discrim-inating grayscale occlusion patches from surfaces Wefound that the SVM classifier performed worse than thequadratic classifier (Supplementary Figure S5) Thisfailure of the SVM classifier strengthens the argumentthat the reason for the poor match of the quadraticclassifier with human performance is the inadequacy ofthe measured feature set used to define the classifierrather than the inadequacy our classifier method

Multiscale classifier analysis

Gabor feature set and classifiers

One weakness of the set of simple visual featuresquantified in this study is that they are defined on thescale of the entire image patch and therefore do notintegrate information across scale It is well known thatocclusions in natural images can exist on multiplescales and that accurate edge detection requires anappropriate choice of scale (Elder amp Zucker 1998Marr amp Hildreth 1980) Furthermore our feature setdoes not include any features which quantify texturedifferences which is problematic because our psycho-physical results suggests that texture is a very importantcue and previous work in machine vision hasdemonstrated that texture cues are highly informativefor detecting edges and segmenting image regions(Bergen amp Landy 1991 Malik amp Perona 1990 Martinet al 2004 Voorhees amp Poggio 1988)

In order to choose a set of biologically motivatedmultiscale features to measure from the image patcheswe obtained a Gabor filter bank by applying theIndependent Components Analysis (Bell amp Sejnowski1997) to natural image patches taken from our dataset(Figure 5a) and this filter bank was applied to theimage patches The independent component analysis(ICA) algorithm learns a set of linear features that aremaximally independent which is ideal for classifieranalyses since correlated features add less informationthan independent features The rectified outputs of thisGabor filter bank were then used as inputs to twosupervised classifiers (a) A linear binary logisticclassifier and (b) a neural network classifier (Bishop2006) having 4 16 and 64 hidden units The logisticclassifier can be considered as a special case of theneural network classifier illustrated in Figure 5b where

instead of learning the best hidden layer representationfor solving the classification task the hidden layerrepresentation is always identical to the input repre-sentation and only the output weights are modifiedAfter training both classifiers were tested on a separatevalidation set not used for training to evaluate theirperformance

Classifier comparison

Figure 16 shows the performance of these twoclassifiers together with subject data for the grayscaleocclusion detection task We see that the linearclassifier (blue line) does not accurately describe humanperformance (black line) while the neural network (redline) having 16 hidden units compares much morefavorably to the human subjects Similar results wereobtained for a neural network with 64 hidden units butworse performance was seen with four hidden units(not shown) We can see from Figure 16 that the linearclassifier does not increase its performance withincreasing patch size and seems to perform similarlyto the quadratic classifier defined using only luminancedifferences This suggests that this classifier may simplybe learning to detecting low spatial frequency edges atvarious orientations Indeed plotting the connectionstrengths of the output unit with each of the Gaborfilters as in Figure 17 demonstrates that there are strongpositive weights for filters at low spatial frequencylocated in the center of the image patch therebydetecting luminance differences on the scale of theentire image patch

In contrast the neural network classifier learns tospatially pool information across spatial scale andlocation (Supplementary Figure S6) although thehidden unit lsquolsquoreceptive fieldsrsquorsquo do not seem to resemble

Figure 16 Performance of humans and classifiers defined on

multiscale Gabor feature set We see that a simple linear classifier

(blue line) does not account for human performance (black

dashed line) in the task a neural network classifier having an

additional hidden layer of processing (red line) compares well with

human performance

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 15

the hypothetical texture-edge detectors illustrated inFigure 18 which could in principle encode edgesdefined by second-order cues like differences inorientation or spatial frequency Interestingly suchunits tuned for texture-edges have been learned instudies seeking to find a sparse code for patterns ofcorrelation of Gabor filter activities in natural images(Cadieu amp Olshausen 2012 Karklin amp Lewicki 20032009) and it is of interest for future work to examinewhether a population of such units can be decoded inorder to reliably detect occlusion edges Neverthelessthis analysis provides a simple existence proof that amultiscale feature set coupled with a second layer ofrepresentation which pools across this set of multiscalefeatures may be adequate to explain natural occlusionedge detection and a similar computational scheme hasbeen proposed by various filter-rectify-filter models oftexture edge detection (Baker amp Mareschal 2001Landy amp Graham 2003) Interestingly training neuralnetworks directly on pixel representations did not yieldmodels which could match human performance(Supplementary Figure S7 demonstrating the impor-tance of the multiscale Gabor decomposition as apreprocessing stage Of course we cannot suggest fromour simple analysis the detailed form which thisintegration takes or what the receptive fields of neuronsperforming this operation may be like but this is

certainly a question of great interest for future work inneurophysiology and computational modeling

Discussion

Summary of results

In this study we performed psychophysical experi-ments to better understand what locally available cuesthe human visual system utilizes when detecting naturalocclusion edges In agreement with previous work(McDermott 2004) we found that fairly large regions(32 middot 32 pixels) were required for subjects to attainreliable performance on our occlusion detection task(Figure 10) Studying the grayscale version of the taskin detail we found that texture and luminance cueswere both utilized although the relative importance ofthese cues varied with the patch size In particular wefound that luminance cues were equally useful at allpatch sizes tested (Figure 11a) This suggests thatluminance differences are a local cue being equallyuseful for small as well as large patches In contrasttexture cues were only useful for larger image patches(Figure 11b) This is sensible because texture is definedby information occurring at a variety of scales (Portillaamp Simoncelli 2000 Zhu Wu amp Mumford 1997) andmaking patches larger makes more spatial scalesvisible

The importance of textural cues was further under-scored by comparing human performance on the taskto the performance of a quadratic classifier definedusing a set of simple visual features taken fromprevious work which did not include texture cuesAlthough many of the visual features providedindependent sources of information (Figure 15) andpermitted some discrimination of the stimulus catego-ries this feature set could not account for humanperformance in the task overpredicting performancefor small (8 middot 8) patches and underpredicting

Figure 17 Schematic illustration of weights learned by the linear

classifier We see that the linear classifier learns strong positive

weights (hot colors) with Gabor filters having low spatial

frequencies (long lines) located in the center of the image patch

Figure 18 Hypothetical neural models for detecting occlusion edges defined by textural differences (a) Hypothetical unit which detects

texture edges defined by differences in orientation energy (b) Hypothetical unit that detects texture edges defined by spatial frequency

differences

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 16

performance for large (32 middot 32) patches (Figure 13)The overprediction of human performance for smallimage patches is particularly interesting since itdemonstrates that subjects are not simply relying onluminance cues but are integrating texture cues as wellwhich we found to be quite poor for small imagepatches (Figure 11b)

One limitation of the visual feature set used to definethe quadratic classifier analysis is that the features aredefined on the scale of the entire image patch This isproblematic since it is well-known that the earlieststages of human visual processing perform a multiscaleanalysis using receptive fields resembling Gabor filters(Daugman 1985 Jones amp Palmer 1987 Pollen ampRonner 1983) and that such a representation forms anefficient code for natural images (Field 1987 Olshau-sen amp Field 1996) Furthermore many computationalmethods for texture segmentation and edge detectionuse of a multiscale Gabor filter decomposition (Bakeramp Mareschal 2001 Landy amp Graham 2003) There-fore we defined a new set of classifiers taking as inputsthe rectified responses of a set of Gabor filters appliedto the image patches (Figure 5) We found that themultiscale feature set was adequate to explain humanperformance but only provided that the classifiermaking use of this feature set was sufficiently powerfulIn particular a linear classifier was unable to accountfor human performance since this classifier only learnedto detect luminance-edges at low spatial frequencies(Figure 17) However a neural network classifierhaving an additional hidden layer of processing wasable to learn to reliably detect occlusions nearly as wellas human subjects (Figure 16) These results suggest amultiscale feature decomposition together with rela-tively simple computations performed in an additionallayer of processing may be adequate to explain humanperformance in an important natural vision task

Comparison with previous work

Relatively few studies in visual psychophysics havedirectly considered the problem of what cues are usefulfor detecting natural occlusion edges with the closeststudy to our own focusing on the more specializedproblem of detecting T-junctions in natural grayscaleimages (McDermott 2004) This study found thatrather large image regions were needed for subjects toattain reliable performance for junction detectionconsistent with our results for occlusion detection Inthis work it was found that many T-junctions easilyvisible on the scale of the whole image were notdetectable using local cues suggesting that they mayonly be detectable by mechanisms making use offeedback of global scene information (Lee amp Mumford2003) Our task is somewhat complementary to that

studied by McDermott (2004) since we exclude T-junctions and only consider nonjunction occlusionsThis gave us a much larger stimulus set and permittedquantitative comparisons of human task performancewith a variety of models based on features measuredfrom occlusions and non-occlusions However wearrive at the very similar conclusion for our task thata substantial amount of spatial context is needed foraccurate occlusion detection (Figure 10) and thatmany occlusions cannot be reliably detected using localcues alone

A related set of studies has considered the problemof judging when two image patches belong to the sameor different surfaces and comparing the performanceof human subjects with the predictions of Bayesianmodels defined using sets of measured visual featuressimilar to those used to define our quadratic classifiersIn contrast to one such surface segmentation study (Inget al 2010) we found texture cues to be very importantfor our task One possible reason for the discrepancybetween our results and those of Ing et al (2010) is thattheir study only considered the problem of discrimi-nating a very limited class of images in particular thesurfaces of different leaves Since the leaves typicallybelonged to the same kind of plant they were fairlysmooth and quite similar in terms of visual textureTherefore for this particular choice of stimuli texture isnot a particularly informative cue since leaves of thesame kind tend to have very similar textures For ourtask however where very different surfaces werejuxtaposed to define occlusions we found removal oftexture cues to be highly detrimental to subjectperformance (Figure 11)

Previous work on edge localization reveals thatluminance texture and color cues are all combined todetermine the location of edges (McGraw et al 2003Rivest amp Cavanagh 1996) and similarly we find that inour task these multiple sources of information arecombined as well However in our task we did notconsider quantitative models for how disparate cuesmay be combined optimally as has been done in avariety of perceptual tasks (Knill amp Saunders 2003Landy amp Kojima 2001) One challenge with studyingcue combination in our task is that one of the majorcues used by the subjects are texture differences andunlike luminance or color natural textures require veryhigh-dimensional models to fully specify (Heeger ampBergen 1995 Portilla amp Simoncelli 2000 Zhu et al1997) The problem of optimal cue combination fornatural occlusion detection has been mostly consideredin the computer vision literature (Konishi et al 2003Zhou amp Mel 2008) but models of cue combination forocclusion detection have not been directly comparedwith human data in a controlled psychophysicalexperiment

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 17

Our computational analysis demonstrates the neces-sity of multiscale image analysis for modeling humanperformance in our occlusion detection task Further-more our work and that of others demonstrates theimportance of textural cues for occlusion detection as awide variety of computational models which explainhuman performance at detecting texture-defined edgesutilize a multiscale Gabor decomposition (Landy ampGraham 2003) One broad class of computationalmodels of texture segmentation and texture edgedetection can be described lsquolsquoFilter-Rectify-Filterrsquorsquo(FRF) models sometimes also referred to as lsquolsquoback-pocketrsquorsquo models where the rectified outputs of a Gaborfilter bank are then subject to a second stage of filteringand the outputs of these filters are then processedfurther to obtain a decision about whether or not atexture edge is present (Baker amp Mareschal 2001Bergen amp Landy 1991 Graham amp Sutter 1998 Landy1991 Landy amp Graham 2003) Our simple three-layerneural network model (Figure 5) constitutes a verysimple model of the FRF type with the hidden unitsperforming the role of the second-order filters Theability of our simple FRF model to solve this task aswell as human subjects provides additional evidence infavor of the FRF computational framework

Resources for future studies

Our database of hand-labeled occlusions (httpcaseedu) avoids many limitations of previously collecteddatasets of hand-labeled edges Perhaps the greatestimprovement is that unlike many related datasets(Collins Wright amp Greenway 1999 Martin et al2001) we make use of uncompressed calibrated colorimages taken from a larger image set used in previouspsychophysical research (Arsenault et al 2011 King-dom et al 2007 Olmos amp Kingdom 2004) JPEGcompression creates artificial statistical regularitieswhich complicates feature measurements (Zhou ampMel 2008) and causes artifact when training statisticalimage models (Caywood Willmore amp Tollhurst 2004)

Another improvement of this database over previouswork is that we explicitly label occlusion edges ratherthan inferring them indirectly from labels of imageregions or segments Although de facto finding regionboundaries does end up mainly labeling occlusions thenotion of an image region is much vaguer than that ofan occlusion However due to the specificity of ourtask our database most likely neglects quite a fewedges which are not occlusions since many edges aredefined by properties unrelated to ordinal depth likeshadows specularities and material properties (Ker-sten 2000) Therefore our database only represents asubset (albeit an important one) of possible naturaledges

Some databases only consider a restricted class ofimages like close-up foliage (Ing et al 2010) or lackclear figural objects (Doi Inui Lee Wachtler ampSejnowski 2003) Our database contains a wide varietyof images at various spatial scales containing a largenumber of occlusions We do not claim that our imagesare fully representative of natural images but representa modest effort to obtain a diverse sample which isimportant since different categories of natural imageshave different statistics (Torralba amp Oliva 2003) andtraining statistical image models on different imagedatabases often yields different linear filter properties(Ziegaus amp Lang 2003)

Acknowledgments

This work was supported by NIH grant 0705677 toMSL We thank the five Case Western ReserveUniversity undergraduate research assistants whohelped to collect the databases Parts of this work havebeen presented previously in poster form at the 2011Society for Neuroscience Meeting

Commercial relationships noneCorresponding author Christopher DiMattinaEmail dimattinagrinnelleduAddress Department of Psychology amp NeuroscienceConcentration Grinnell College Grinnell IA USA

References

Abdou I amp Pratt W (1979) Quantitative design andevaluation of enhancementthresholding edge de-tectors Proceedings of the IEEE 67 753ndash763

Acharya T amp Ray A K (2005) Image processingPrinciples and applications Hoboken NJ Wiley-Interscience

Arsenault E Yoonessi A amp Baker C L (2011)Higher-order texture statistics impair contrastboundary segmentation Journal of Vision 11(10)14 1ndash15 httpwwwjournalofvisionorgcontent111014 doi101167111014 [PubMed] [Article]

Baddeley R J amp Tatler B W (2006) High frequencyedges (but not contrast) predict where we fixate ABayesian system identification analysis VisionResearch 46 2824ndash2833

Baker C L amp Mareschal I (2001) Processing ofsecond-order stimuli in the visual cortex Progressin Brain Research 134 1ndash21

Balboa R M amp Grzywacz N M (2000) Occlusions

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 18

and their relationship with the distribution ofcontrasts in natural images Vision Research 402661ndash2669

Bell A J amp Sejnowski T J (1997) The lsquoindependentcomponentsrsquo of natural scenes are edge filtersVision Research 37 3327ndash3338

Bergen J R amp Landy M S (1991) Computationalmodeling of visual texture segregation In M SLandy amp J A Movshon (Eds) Computationalmodels of visual processing pp 253ndash271 Cam-bridge MA MIT Press

Bishop C M (2006) Pattern recognition and machinelearning New York Springer

Cadieu C amp Olshausen B A (2012) Learningintermediate-level representations of form andmotion from natural movies Neural Computation24 827ndash866

Caywood M S Willmore B amp Tolhurst D J(2004) Independent components of color naturalscenes resemble v1 neurons in their spatial andcolor tuning Journal of Neurophysiology 91 2859ndash2873

Collins D Wright W A amp Greenway P (1999) TheSowerby image database In Image processing andits applications (Seventh International Conference)(pp 306ndash310) London Institution of ElectricalEngineers

Cristianini N amp Shawe-Taylor J (2000) An intro-duction to support vector machines and other kernelbased learning methods Cambridge UK Cam-bridge University Press

Daugman J G (1985) Uncertainty relation forresolution in space spatial frequency and orienta-tion optimized by two-dimensional visual corticalfilters Journal of the Optical Society of America A2 1160ndash1169

Doi E Inui T Lee T W Wachtler T ampSejnowski T J (2003) Spatiochromatic receptivefield properties derived from information-theoreticanalyses of cone mosaic responses to naturalscenes Neural Computation 15 397ndash417

Duda R O Hart P E amp Stork D G (2000)Pattern classification (2nd ed) New York Wiley-Interscience

Elder J H amp Zucker S W (1998) Local scalecontrol for edge detection and blur estimationIEEE Transactions on Pattern Analysis and Ma-chine Intelligence 20 699ndash716

Field D J (1987) Relations between the statistics ofnatural images and the response properties of

cortical cells Journal of the Optical Society ofAmerica A 4 2379ndash2394

Fine I MacLeod D I A amp Boynton G M (2003)Surface segmentation based on the luminance andcolor statistics of natural scenes Journal of theOptical Society of America A 20 1283ndash1291

Fowlkes C C Martin D R amp Malik J (2007) Localfigure-ground cues are valid for natural imagesJournal of Vision 7(8)2 1ndash9 httpwwwjournalofvisionorgcontent782 doi101167782[PubMed] [Article]

Geisler W S Perry J S Super B J amp Gallogly DP (2001) Edge co-occurrence in natural imagespredicts contour grouping performance VisionResearch 41 711ndash724

Goldberg A V amp Kennedy R (1995) An efficientcost scaling algorithm for the assignment problemMathematical Programming 71 153ndash177

Graham N (1991) Complex channels early localnonlinearities and normalization in texture segre-gation In M S Landy amp J A Movshon (Eds)Computational models of visual processing pp 274ndash290 Cambridge MA MIT Press

Graham N amp Sutter A (1998) Spatial summation insimple (Fourier) and complex (non-Fourier) texturechannels Vision Research 38 231ndash257

Guzman A (1969) Decomposition of a visual sceneinto three-dimensional bodies In A Griselli (Ed)Automatic interpretation and classification of imag-es pp 291ndash304 New York Academic Press

Heeger D J amp Bergen J R (1995) Pyramid-basedtexture analysissynthesis In Proceedings of ACMSIGGRAPH 229ndash238

Hess R F amp Field D J (1999) Integration ofcontours New insights Trends in Cognitive Scienc-es 3 480ndash486

Hoiem D Efros A A amp Hebert M (2011)Recovering occlusion boundaries from an imageInternational Journal of Computer Vision 91(3)328ndash346

Ing A D Wilson J A amp Geisler W S (2010)Region grouping in natural foliage scenes Imagestatistics and human performance Journal ofVision 10(4)10 1ndash19 httpwwwjournalofvisionorgcontent10410 doi 10116710410[PubMed] [Article]

Jones J P amp Palmer L A (1987) The two-dimensional spatial structure of simple receptivefields in cat striate cortex Journal of Neurophysi-ology 58 1187ndash1211

Karklin Y amp Lewicki M S (2003) Learning higher-

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 19

order structures in natural images Network 14483ndash499

Karklin Y amp Lewicki M S (2009) Emergence ofcomplex cell properties by learning to generalize innatural scenes Nature 457 83ndash86

Kersten D (2000) High-level vision as statisticalinference In M S Gazzaniga (Ed) The newcognitive neurosciences pp 353ndash363 CambridgeMA MIT Press

Kingdom F A A Field D J amp Olmos A (2007)Does spatial invariance result from insensitivity tochange Journal of Vision 7(14)11 1ndash13 httpwwwjournalofvisionorgcontent71411 doi10116771411 [PubMed] [Article]

Knill D C amp Saunders J A (2003) Do humansoptimally integrate stereo and texture informationfor judgments of slant Vision Research 43 2539ndash2558

Konishi S Yuille A L Coughlin J M amp Zhu S C(2003) Statistical edge detection Learning andevaluating edge cues IEEE Transactions on PatternAnalysis and Machine Intelligence 25 57ndash74

Landy M S (1991) Texture segregation and orienta-tion gradient Vision Research 31 679ndash691

Landy M S amp Graham N (2003) Visual perceptionof texture In L M Chalupa amp J S Werner (Eds)The visual neurosciences pp 1106ndash1118 Cam-bridge MA MIT Press

Landy M S amp Kojima H (2001) Ideal cuecombination for localizing texture-defined edgesJournal of the Optical Society of America A 182307ndash2320

Lee H-C amp Choe Y (2003) Detecting salientcontours using orientation energy distribution InInternational Joint Conference on Neural Networks(pp 206ndash211) IEEE Conference Publications

Lee T S amp Mumford D (2003) Hierarchicalbayesian inference in the visual cortex Journal ofthe Optical Society of America A 20 1434ndash1448

Malik J amp Perona P (1990) Preattentive texturediscrimination with early vision mechanisms Jour-nal of the Optical Society of America A 7 923ndash932

Marr D amp Hildreth E (1980) Theory of edgedetection Proceedings of the Royal Society ofLondon Series B 29 187ndash217

Martin D Fowlkes C C Tal D amp Malik J (2001)A database of human segmented natural imagesand its application to evaluating segmentationalgorithms and measuring ecological statisticsProceedings of the 8th International Conference onComputer Vision 2 416ndash423

Martin D R Fowlkes C C amp Malik J (2004)Learning to detect natural image boundaries usinglocal brightness color and texture cues IEEETransactions on Pattern Analysis and MachineIntelligence 26 530ndash549

McDermott J (2004) Psychophysics with junctions inreal images Perception 33 1101ndash1127

McGraw P V Whitaker D Badcock D R ampSkillen J (2003) Neither here nor there Localizingconflicting visual attributes Journal of Vision 3(4)2 265ndash273 httpwwwjournalofvisionorgcontent342 doi101167342 [PubMed] [Article]

Nakayama K He Z J amp Shimojo S (1995) Visualsurface representation A critical link betweenlower-level and higher-level vision In D NOsherson S M Kosslyn amp L R Gleitman(Eds) An invitation to cognitive science Visualcognition pp 1ndash70 Cambridge MA MIT Press

Oliva A amp Torralba A (2001) Modeling the shapeof the scene A holistic representation of the spatialenvelope International Journal of Computer Vision42 145ndash175

Olmos A amp Kingdom F A A (2004) McGillcalibrated colour image database Internet sitehttptabbyvisionmcgillca (Accessed November20 2012)

Olshausen B A amp Field D J (1996) Emergence ofsimple-cell receptive-field properties by learning asparse code for natural images Nature 381 607ndash609

Ott R L (1993) An introduction to statistical methodsand data analysis (4th ed) Belmont CA Wads-worth Inc

Perona P (1992) Steerable-scalable kernels for edgedetection and junction analysis Image and VisionComputing 10 663ndash672

Pollen D A amp Ronner S F (1983) Visual corticalneurons as localized spatial frequency filters IEEETransactions on Systems Man and Cybernetics 13907ndash916

Portilla J amp Simoncelli E P (2000) A parametrictexture model based on joint statistics of complexwavelet coefficients International Journal of Com-puter Vision 40 4971

Rajashekar U van der Linde I Bovik A C ampCormack L K (2007) Foveated analysis of imagefeatures at fixations Vision Research 47 3160ndash3172

Reinagel P amp Zador A M (1999) Natural scenestatistics at the centre of gaze Network 10 341ndash350

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 20

Rijsbergen C V (1979) Information retrieval LondonButterworths

Rivest J amp Cavanagh P (1996) Localizing contoursdefined by more than one attribute Vision Re-search 36 53ndash66

Schwartz O amp Simoncelli E P (2001) Natural signalstatistics and sensory gain control Nature Neuro-science 4 819ndash825

Sigman M Cecchi G A Gilbert C D ampMagnasco M O (2001) On a common circleNatural scenes and gestalt rules Proceedings of theNational Academy of Sciences 98 1935ndash1940

Todd J T (2004) The visual perception of 3D shapeTrends in Cognitive Science 8 115ndash121

Torralba A (2009) How many pixels make an imageVisual Neuroscience 26 123ndash131

Torralba A amp Oliva A (2003) Statistics of naturalimage categories Network Computation in NeuralSystems 14 391ndash412

Voorhees H amp Poggio T (1988) Computing textureboundaries from images Nature 333 364ndash367

Zetzsche C amp Rohrbein F (2001) Nonlinear andextra-classical receptive field properties and thestatistics of natural scenes Network 12 331ndash350

Zhou C amp Mel B W (2008) Cue combination andcolor edge detection in natural scenes Journal ofVision 8(4)4 1ndash25 httpwwwjournalofvisionorgcontent844 doi101167844 [PubMed][Article]

Zhu S C Wu Y N amp Mumford D (1997)Minimax entropy principle and its application totexture modeling Neural Computation 9 1627ndash1660

Ziegaus C amp Lang E W (2003) Independentcomponent analysis of natural and urban imageensembles Neural Information Processing Letters1 89ndash95

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 21

  • Introduction
  • Methods
  • f01
  • f02
  • f03
  • e01
  • e02
  • e03
  • e04
  • e05
  • e06
  • e07
  • e08
  • e09
  • e10
  • f04
  • e11
  • e12
  • e13
  • e14
  • e15
  • e16
  • e17
  • f05
  • e18
  • Results
  • f06
  • f07
  • f08
  • f09
  • f10
  • f11
  • f12
  • f13
  • f14
  • f15
  • f16
  • Discussion
  • f17
  • f18
  • Abdou1
  • Acharya1
  • Arsenault1
  • Baddeley1
  • Baker1
  • Balboa1
  • Bell1
  • Bergen1
  • Bishop1
  • Cadieu1
  • Caywood1
  • Collins1
  • Cristianini1
  • Daugman1
  • Doi1
  • Duda1
  • Elder1
  • Field1
  • Fine1
  • Fowlkes1
  • Geisler1
  • Goldberg1
  • Graham1
  • Graham2
  • Guzman1
  • Heeger1
  • Hess1
  • Hoiem1
  • Ing1
  • Jones1
  • Karklin1
  • Karklin2
  • Kersten1
  • Kingdom1
  • Knill1
  • Konishi1
  • Landy1
  • Landy2
  • Landy3
  • Lee1
  • Lee2
  • Malik1
  • Marr1
  • Martin1
  • Martin2
  • McDermott1
  • McGraw1
  • Nakayama1
  • Oliva1
  • Olmos1
  • Olshausen1
  • Ott1
  • Perona1
  • Pollen1
  • Portilla1
  • Rajashekar1
  • Reinagel1
  • Rijsbergen1
  • Rivest1
  • Schwartz1
  • Sigman1
  • Todd1
  • Torralba1
  • Torralba2
  • Voorhees1
  • Zetzsche1
  • Zhou1
  • Zhu1
  • Ziegaus1
Page 16: Detecting natural occlusion boundaries using local cuesDetecting natural occlusion boundaries using local cues Christopher DiMattina # $ Department of Psychology & Neuroscience Concentration,

the hypothetical texture-edge detectors illustrated inFigure 18 which could in principle encode edgesdefined by second-order cues like differences inorientation or spatial frequency Interestingly suchunits tuned for texture-edges have been learned instudies seeking to find a sparse code for patterns ofcorrelation of Gabor filter activities in natural images(Cadieu amp Olshausen 2012 Karklin amp Lewicki 20032009) and it is of interest for future work to examinewhether a population of such units can be decoded inorder to reliably detect occlusion edges Neverthelessthis analysis provides a simple existence proof that amultiscale feature set coupled with a second layer ofrepresentation which pools across this set of multiscalefeatures may be adequate to explain natural occlusionedge detection and a similar computational scheme hasbeen proposed by various filter-rectify-filter models oftexture edge detection (Baker amp Mareschal 2001Landy amp Graham 2003) Interestingly training neuralnetworks directly on pixel representations did not yieldmodels which could match human performance(Supplementary Figure S7 demonstrating the impor-tance of the multiscale Gabor decomposition as apreprocessing stage Of course we cannot suggest fromour simple analysis the detailed form which thisintegration takes or what the receptive fields of neuronsperforming this operation may be like but this is

certainly a question of great interest for future work inneurophysiology and computational modeling

Discussion

Summary of results

In this study we performed psychophysical experi-ments to better understand what locally available cuesthe human visual system utilizes when detecting naturalocclusion edges In agreement with previous work(McDermott 2004) we found that fairly large regions(32 middot 32 pixels) were required for subjects to attainreliable performance on our occlusion detection task(Figure 10) Studying the grayscale version of the taskin detail we found that texture and luminance cueswere both utilized although the relative importance ofthese cues varied with the patch size In particular wefound that luminance cues were equally useful at allpatch sizes tested (Figure 11a) This suggests thatluminance differences are a local cue being equallyuseful for small as well as large patches In contrasttexture cues were only useful for larger image patches(Figure 11b) This is sensible because texture is definedby information occurring at a variety of scales (Portillaamp Simoncelli 2000 Zhu Wu amp Mumford 1997) andmaking patches larger makes more spatial scalesvisible

The importance of textural cues was further under-scored by comparing human performance on the taskto the performance of a quadratic classifier definedusing a set of simple visual features taken fromprevious work which did not include texture cuesAlthough many of the visual features providedindependent sources of information (Figure 15) andpermitted some discrimination of the stimulus catego-ries this feature set could not account for humanperformance in the task overpredicting performancefor small (8 middot 8) patches and underpredicting

Figure 17 Schematic illustration of weights learned by the linear

classifier We see that the linear classifier learns strong positive

weights (hot colors) with Gabor filters having low spatial

frequencies (long lines) located in the center of the image patch

Figure 18 Hypothetical neural models for detecting occlusion edges defined by textural differences (a) Hypothetical unit which detects

texture edges defined by differences in orientation energy (b) Hypothetical unit that detects texture edges defined by spatial frequency

differences

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 16

performance for large (32 middot 32) patches (Figure 13)The overprediction of human performance for smallimage patches is particularly interesting since itdemonstrates that subjects are not simply relying onluminance cues but are integrating texture cues as wellwhich we found to be quite poor for small imagepatches (Figure 11b)

One limitation of the visual feature set used to definethe quadratic classifier analysis is that the features aredefined on the scale of the entire image patch This isproblematic since it is well-known that the earlieststages of human visual processing perform a multiscaleanalysis using receptive fields resembling Gabor filters(Daugman 1985 Jones amp Palmer 1987 Pollen ampRonner 1983) and that such a representation forms anefficient code for natural images (Field 1987 Olshau-sen amp Field 1996) Furthermore many computationalmethods for texture segmentation and edge detectionuse of a multiscale Gabor filter decomposition (Bakeramp Mareschal 2001 Landy amp Graham 2003) There-fore we defined a new set of classifiers taking as inputsthe rectified responses of a set of Gabor filters appliedto the image patches (Figure 5) We found that themultiscale feature set was adequate to explain humanperformance but only provided that the classifiermaking use of this feature set was sufficiently powerfulIn particular a linear classifier was unable to accountfor human performance since this classifier only learnedto detect luminance-edges at low spatial frequencies(Figure 17) However a neural network classifierhaving an additional hidden layer of processing wasable to learn to reliably detect occlusions nearly as wellas human subjects (Figure 16) These results suggest amultiscale feature decomposition together with rela-tively simple computations performed in an additionallayer of processing may be adequate to explain humanperformance in an important natural vision task

Comparison with previous work

Relatively few studies in visual psychophysics havedirectly considered the problem of what cues are usefulfor detecting natural occlusion edges with the closeststudy to our own focusing on the more specializedproblem of detecting T-junctions in natural grayscaleimages (McDermott 2004) This study found thatrather large image regions were needed for subjects toattain reliable performance for junction detectionconsistent with our results for occlusion detection Inthis work it was found that many T-junctions easilyvisible on the scale of the whole image were notdetectable using local cues suggesting that they mayonly be detectable by mechanisms making use offeedback of global scene information (Lee amp Mumford2003) Our task is somewhat complementary to that

studied by McDermott (2004) since we exclude T-junctions and only consider nonjunction occlusionsThis gave us a much larger stimulus set and permittedquantitative comparisons of human task performancewith a variety of models based on features measuredfrom occlusions and non-occlusions However wearrive at the very similar conclusion for our task thata substantial amount of spatial context is needed foraccurate occlusion detection (Figure 10) and thatmany occlusions cannot be reliably detected using localcues alone

A related set of studies has considered the problemof judging when two image patches belong to the sameor different surfaces and comparing the performanceof human subjects with the predictions of Bayesianmodels defined using sets of measured visual featuressimilar to those used to define our quadratic classifiersIn contrast to one such surface segmentation study (Inget al 2010) we found texture cues to be very importantfor our task One possible reason for the discrepancybetween our results and those of Ing et al (2010) is thattheir study only considered the problem of discrimi-nating a very limited class of images in particular thesurfaces of different leaves Since the leaves typicallybelonged to the same kind of plant they were fairlysmooth and quite similar in terms of visual textureTherefore for this particular choice of stimuli texture isnot a particularly informative cue since leaves of thesame kind tend to have very similar textures For ourtask however where very different surfaces werejuxtaposed to define occlusions we found removal oftexture cues to be highly detrimental to subjectperformance (Figure 11)

Previous work on edge localization reveals thatluminance texture and color cues are all combined todetermine the location of edges (McGraw et al 2003Rivest amp Cavanagh 1996) and similarly we find that inour task these multiple sources of information arecombined as well However in our task we did notconsider quantitative models for how disparate cuesmay be combined optimally as has been done in avariety of perceptual tasks (Knill amp Saunders 2003Landy amp Kojima 2001) One challenge with studyingcue combination in our task is that one of the majorcues used by the subjects are texture differences andunlike luminance or color natural textures require veryhigh-dimensional models to fully specify (Heeger ampBergen 1995 Portilla amp Simoncelli 2000 Zhu et al1997) The problem of optimal cue combination fornatural occlusion detection has been mostly consideredin the computer vision literature (Konishi et al 2003Zhou amp Mel 2008) but models of cue combination forocclusion detection have not been directly comparedwith human data in a controlled psychophysicalexperiment

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 17

Our computational analysis demonstrates the neces-sity of multiscale image analysis for modeling humanperformance in our occlusion detection task Further-more our work and that of others demonstrates theimportance of textural cues for occlusion detection as awide variety of computational models which explainhuman performance at detecting texture-defined edgesutilize a multiscale Gabor decomposition (Landy ampGraham 2003) One broad class of computationalmodels of texture segmentation and texture edgedetection can be described lsquolsquoFilter-Rectify-Filterrsquorsquo(FRF) models sometimes also referred to as lsquolsquoback-pocketrsquorsquo models where the rectified outputs of a Gaborfilter bank are then subject to a second stage of filteringand the outputs of these filters are then processedfurther to obtain a decision about whether or not atexture edge is present (Baker amp Mareschal 2001Bergen amp Landy 1991 Graham amp Sutter 1998 Landy1991 Landy amp Graham 2003) Our simple three-layerneural network model (Figure 5) constitutes a verysimple model of the FRF type with the hidden unitsperforming the role of the second-order filters Theability of our simple FRF model to solve this task aswell as human subjects provides additional evidence infavor of the FRF computational framework

Resources for future studies

Our database of hand-labeled occlusions (httpcaseedu) avoids many limitations of previously collecteddatasets of hand-labeled edges Perhaps the greatestimprovement is that unlike many related datasets(Collins Wright amp Greenway 1999 Martin et al2001) we make use of uncompressed calibrated colorimages taken from a larger image set used in previouspsychophysical research (Arsenault et al 2011 King-dom et al 2007 Olmos amp Kingdom 2004) JPEGcompression creates artificial statistical regularitieswhich complicates feature measurements (Zhou ampMel 2008) and causes artifact when training statisticalimage models (Caywood Willmore amp Tollhurst 2004)

Another improvement of this database over previouswork is that we explicitly label occlusion edges ratherthan inferring them indirectly from labels of imageregions or segments Although de facto finding regionboundaries does end up mainly labeling occlusions thenotion of an image region is much vaguer than that ofan occlusion However due to the specificity of ourtask our database most likely neglects quite a fewedges which are not occlusions since many edges aredefined by properties unrelated to ordinal depth likeshadows specularities and material properties (Ker-sten 2000) Therefore our database only represents asubset (albeit an important one) of possible naturaledges

Some databases only consider a restricted class ofimages like close-up foliage (Ing et al 2010) or lackclear figural objects (Doi Inui Lee Wachtler ampSejnowski 2003) Our database contains a wide varietyof images at various spatial scales containing a largenumber of occlusions We do not claim that our imagesare fully representative of natural images but representa modest effort to obtain a diverse sample which isimportant since different categories of natural imageshave different statistics (Torralba amp Oliva 2003) andtraining statistical image models on different imagedatabases often yields different linear filter properties(Ziegaus amp Lang 2003)

Acknowledgments

This work was supported by NIH grant 0705677 toMSL We thank the five Case Western ReserveUniversity undergraduate research assistants whohelped to collect the databases Parts of this work havebeen presented previously in poster form at the 2011Society for Neuroscience Meeting

Commercial relationships noneCorresponding author Christopher DiMattinaEmail dimattinagrinnelleduAddress Department of Psychology amp NeuroscienceConcentration Grinnell College Grinnell IA USA

References

Abdou I amp Pratt W (1979) Quantitative design andevaluation of enhancementthresholding edge de-tectors Proceedings of the IEEE 67 753ndash763

Acharya T amp Ray A K (2005) Image processingPrinciples and applications Hoboken NJ Wiley-Interscience

Arsenault E Yoonessi A amp Baker C L (2011)Higher-order texture statistics impair contrastboundary segmentation Journal of Vision 11(10)14 1ndash15 httpwwwjournalofvisionorgcontent111014 doi101167111014 [PubMed] [Article]

Baddeley R J amp Tatler B W (2006) High frequencyedges (but not contrast) predict where we fixate ABayesian system identification analysis VisionResearch 46 2824ndash2833

Baker C L amp Mareschal I (2001) Processing ofsecond-order stimuli in the visual cortex Progressin Brain Research 134 1ndash21

Balboa R M amp Grzywacz N M (2000) Occlusions

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 18

and their relationship with the distribution ofcontrasts in natural images Vision Research 402661ndash2669

Bell A J amp Sejnowski T J (1997) The lsquoindependentcomponentsrsquo of natural scenes are edge filtersVision Research 37 3327ndash3338

Bergen J R amp Landy M S (1991) Computationalmodeling of visual texture segregation In M SLandy amp J A Movshon (Eds) Computationalmodels of visual processing pp 253ndash271 Cam-bridge MA MIT Press

Bishop C M (2006) Pattern recognition and machinelearning New York Springer

Cadieu C amp Olshausen B A (2012) Learningintermediate-level representations of form andmotion from natural movies Neural Computation24 827ndash866

Caywood M S Willmore B amp Tolhurst D J(2004) Independent components of color naturalscenes resemble v1 neurons in their spatial andcolor tuning Journal of Neurophysiology 91 2859ndash2873

Collins D Wright W A amp Greenway P (1999) TheSowerby image database In Image processing andits applications (Seventh International Conference)(pp 306ndash310) London Institution of ElectricalEngineers

Cristianini N amp Shawe-Taylor J (2000) An intro-duction to support vector machines and other kernelbased learning methods Cambridge UK Cam-bridge University Press

Daugman J G (1985) Uncertainty relation forresolution in space spatial frequency and orienta-tion optimized by two-dimensional visual corticalfilters Journal of the Optical Society of America A2 1160ndash1169

Doi E Inui T Lee T W Wachtler T ampSejnowski T J (2003) Spatiochromatic receptivefield properties derived from information-theoreticanalyses of cone mosaic responses to naturalscenes Neural Computation 15 397ndash417

Duda R O Hart P E amp Stork D G (2000)Pattern classification (2nd ed) New York Wiley-Interscience

Elder J H amp Zucker S W (1998) Local scalecontrol for edge detection and blur estimationIEEE Transactions on Pattern Analysis and Ma-chine Intelligence 20 699ndash716

Field D J (1987) Relations between the statistics ofnatural images and the response properties of

cortical cells Journal of the Optical Society ofAmerica A 4 2379ndash2394

Fine I MacLeod D I A amp Boynton G M (2003)Surface segmentation based on the luminance andcolor statistics of natural scenes Journal of theOptical Society of America A 20 1283ndash1291

Fowlkes C C Martin D R amp Malik J (2007) Localfigure-ground cues are valid for natural imagesJournal of Vision 7(8)2 1ndash9 httpwwwjournalofvisionorgcontent782 doi101167782[PubMed] [Article]

Geisler W S Perry J S Super B J amp Gallogly DP (2001) Edge co-occurrence in natural imagespredicts contour grouping performance VisionResearch 41 711ndash724

Goldberg A V amp Kennedy R (1995) An efficientcost scaling algorithm for the assignment problemMathematical Programming 71 153ndash177

Graham N (1991) Complex channels early localnonlinearities and normalization in texture segre-gation In M S Landy amp J A Movshon (Eds)Computational models of visual processing pp 274ndash290 Cambridge MA MIT Press

Graham N amp Sutter A (1998) Spatial summation insimple (Fourier) and complex (non-Fourier) texturechannels Vision Research 38 231ndash257

Guzman A (1969) Decomposition of a visual sceneinto three-dimensional bodies In A Griselli (Ed)Automatic interpretation and classification of imag-es pp 291ndash304 New York Academic Press

Heeger D J amp Bergen J R (1995) Pyramid-basedtexture analysissynthesis In Proceedings of ACMSIGGRAPH 229ndash238

Hess R F amp Field D J (1999) Integration ofcontours New insights Trends in Cognitive Scienc-es 3 480ndash486

Hoiem D Efros A A amp Hebert M (2011)Recovering occlusion boundaries from an imageInternational Journal of Computer Vision 91(3)328ndash346

Ing A D Wilson J A amp Geisler W S (2010)Region grouping in natural foliage scenes Imagestatistics and human performance Journal ofVision 10(4)10 1ndash19 httpwwwjournalofvisionorgcontent10410 doi 10116710410[PubMed] [Article]

Jones J P amp Palmer L A (1987) The two-dimensional spatial structure of simple receptivefields in cat striate cortex Journal of Neurophysi-ology 58 1187ndash1211

Karklin Y amp Lewicki M S (2003) Learning higher-

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 19

order structures in natural images Network 14483ndash499

Karklin Y amp Lewicki M S (2009) Emergence ofcomplex cell properties by learning to generalize innatural scenes Nature 457 83ndash86

Kersten D (2000) High-level vision as statisticalinference In M S Gazzaniga (Ed) The newcognitive neurosciences pp 353ndash363 CambridgeMA MIT Press

Kingdom F A A Field D J amp Olmos A (2007)Does spatial invariance result from insensitivity tochange Journal of Vision 7(14)11 1ndash13 httpwwwjournalofvisionorgcontent71411 doi10116771411 [PubMed] [Article]

Knill D C amp Saunders J A (2003) Do humansoptimally integrate stereo and texture informationfor judgments of slant Vision Research 43 2539ndash2558

Konishi S Yuille A L Coughlin J M amp Zhu S C(2003) Statistical edge detection Learning andevaluating edge cues IEEE Transactions on PatternAnalysis and Machine Intelligence 25 57ndash74

Landy M S (1991) Texture segregation and orienta-tion gradient Vision Research 31 679ndash691

Landy M S amp Graham N (2003) Visual perceptionof texture In L M Chalupa amp J S Werner (Eds)The visual neurosciences pp 1106ndash1118 Cam-bridge MA MIT Press

Landy M S amp Kojima H (2001) Ideal cuecombination for localizing texture-defined edgesJournal of the Optical Society of America A 182307ndash2320

Lee H-C amp Choe Y (2003) Detecting salientcontours using orientation energy distribution InInternational Joint Conference on Neural Networks(pp 206ndash211) IEEE Conference Publications

Lee T S amp Mumford D (2003) Hierarchicalbayesian inference in the visual cortex Journal ofthe Optical Society of America A 20 1434ndash1448

Malik J amp Perona P (1990) Preattentive texturediscrimination with early vision mechanisms Jour-nal of the Optical Society of America A 7 923ndash932

Marr D amp Hildreth E (1980) Theory of edgedetection Proceedings of the Royal Society ofLondon Series B 29 187ndash217

Martin D Fowlkes C C Tal D amp Malik J (2001)A database of human segmented natural imagesand its application to evaluating segmentationalgorithms and measuring ecological statisticsProceedings of the 8th International Conference onComputer Vision 2 416ndash423

Martin D R Fowlkes C C amp Malik J (2004)Learning to detect natural image boundaries usinglocal brightness color and texture cues IEEETransactions on Pattern Analysis and MachineIntelligence 26 530ndash549

McDermott J (2004) Psychophysics with junctions inreal images Perception 33 1101ndash1127

McGraw P V Whitaker D Badcock D R ampSkillen J (2003) Neither here nor there Localizingconflicting visual attributes Journal of Vision 3(4)2 265ndash273 httpwwwjournalofvisionorgcontent342 doi101167342 [PubMed] [Article]

Nakayama K He Z J amp Shimojo S (1995) Visualsurface representation A critical link betweenlower-level and higher-level vision In D NOsherson S M Kosslyn amp L R Gleitman(Eds) An invitation to cognitive science Visualcognition pp 1ndash70 Cambridge MA MIT Press

Oliva A amp Torralba A (2001) Modeling the shapeof the scene A holistic representation of the spatialenvelope International Journal of Computer Vision42 145ndash175

Olmos A amp Kingdom F A A (2004) McGillcalibrated colour image database Internet sitehttptabbyvisionmcgillca (Accessed November20 2012)

Olshausen B A amp Field D J (1996) Emergence ofsimple-cell receptive-field properties by learning asparse code for natural images Nature 381 607ndash609

Ott R L (1993) An introduction to statistical methodsand data analysis (4th ed) Belmont CA Wads-worth Inc

Perona P (1992) Steerable-scalable kernels for edgedetection and junction analysis Image and VisionComputing 10 663ndash672

Pollen D A amp Ronner S F (1983) Visual corticalneurons as localized spatial frequency filters IEEETransactions on Systems Man and Cybernetics 13907ndash916

Portilla J amp Simoncelli E P (2000) A parametrictexture model based on joint statistics of complexwavelet coefficients International Journal of Com-puter Vision 40 4971

Rajashekar U van der Linde I Bovik A C ampCormack L K (2007) Foveated analysis of imagefeatures at fixations Vision Research 47 3160ndash3172

Reinagel P amp Zador A M (1999) Natural scenestatistics at the centre of gaze Network 10 341ndash350

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 20

Rijsbergen C V (1979) Information retrieval LondonButterworths

Rivest J amp Cavanagh P (1996) Localizing contoursdefined by more than one attribute Vision Re-search 36 53ndash66

Schwartz O amp Simoncelli E P (2001) Natural signalstatistics and sensory gain control Nature Neuro-science 4 819ndash825

Sigman M Cecchi G A Gilbert C D ampMagnasco M O (2001) On a common circleNatural scenes and gestalt rules Proceedings of theNational Academy of Sciences 98 1935ndash1940

Todd J T (2004) The visual perception of 3D shapeTrends in Cognitive Science 8 115ndash121

Torralba A (2009) How many pixels make an imageVisual Neuroscience 26 123ndash131

Torralba A amp Oliva A (2003) Statistics of naturalimage categories Network Computation in NeuralSystems 14 391ndash412

Voorhees H amp Poggio T (1988) Computing textureboundaries from images Nature 333 364ndash367

Zetzsche C amp Rohrbein F (2001) Nonlinear andextra-classical receptive field properties and thestatistics of natural scenes Network 12 331ndash350

Zhou C amp Mel B W (2008) Cue combination andcolor edge detection in natural scenes Journal ofVision 8(4)4 1ndash25 httpwwwjournalofvisionorgcontent844 doi101167844 [PubMed][Article]

Zhu S C Wu Y N amp Mumford D (1997)Minimax entropy principle and its application totexture modeling Neural Computation 9 1627ndash1660

Ziegaus C amp Lang E W (2003) Independentcomponent analysis of natural and urban imageensembles Neural Information Processing Letters1 89ndash95

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 21

  • Introduction
  • Methods
  • f01
  • f02
  • f03
  • e01
  • e02
  • e03
  • e04
  • e05
  • e06
  • e07
  • e08
  • e09
  • e10
  • f04
  • e11
  • e12
  • e13
  • e14
  • e15
  • e16
  • e17
  • f05
  • e18
  • Results
  • f06
  • f07
  • f08
  • f09
  • f10
  • f11
  • f12
  • f13
  • f14
  • f15
  • f16
  • Discussion
  • f17
  • f18
  • Abdou1
  • Acharya1
  • Arsenault1
  • Baddeley1
  • Baker1
  • Balboa1
  • Bell1
  • Bergen1
  • Bishop1
  • Cadieu1
  • Caywood1
  • Collins1
  • Cristianini1
  • Daugman1
  • Doi1
  • Duda1
  • Elder1
  • Field1
  • Fine1
  • Fowlkes1
  • Geisler1
  • Goldberg1
  • Graham1
  • Graham2
  • Guzman1
  • Heeger1
  • Hess1
  • Hoiem1
  • Ing1
  • Jones1
  • Karklin1
  • Karklin2
  • Kersten1
  • Kingdom1
  • Knill1
  • Konishi1
  • Landy1
  • Landy2
  • Landy3
  • Lee1
  • Lee2
  • Malik1
  • Marr1
  • Martin1
  • Martin2
  • McDermott1
  • McGraw1
  • Nakayama1
  • Oliva1
  • Olmos1
  • Olshausen1
  • Ott1
  • Perona1
  • Pollen1
  • Portilla1
  • Rajashekar1
  • Reinagel1
  • Rijsbergen1
  • Rivest1
  • Schwartz1
  • Sigman1
  • Todd1
  • Torralba1
  • Torralba2
  • Voorhees1
  • Zetzsche1
  • Zhou1
  • Zhu1
  • Ziegaus1
Page 17: Detecting natural occlusion boundaries using local cuesDetecting natural occlusion boundaries using local cues Christopher DiMattina # $ Department of Psychology & Neuroscience Concentration,

performance for large (32 middot 32) patches (Figure 13)The overprediction of human performance for smallimage patches is particularly interesting since itdemonstrates that subjects are not simply relying onluminance cues but are integrating texture cues as wellwhich we found to be quite poor for small imagepatches (Figure 11b)

One limitation of the visual feature set used to definethe quadratic classifier analysis is that the features aredefined on the scale of the entire image patch This isproblematic since it is well-known that the earlieststages of human visual processing perform a multiscaleanalysis using receptive fields resembling Gabor filters(Daugman 1985 Jones amp Palmer 1987 Pollen ampRonner 1983) and that such a representation forms anefficient code for natural images (Field 1987 Olshau-sen amp Field 1996) Furthermore many computationalmethods for texture segmentation and edge detectionuse of a multiscale Gabor filter decomposition (Bakeramp Mareschal 2001 Landy amp Graham 2003) There-fore we defined a new set of classifiers taking as inputsthe rectified responses of a set of Gabor filters appliedto the image patches (Figure 5) We found that themultiscale feature set was adequate to explain humanperformance but only provided that the classifiermaking use of this feature set was sufficiently powerfulIn particular a linear classifier was unable to accountfor human performance since this classifier only learnedto detect luminance-edges at low spatial frequencies(Figure 17) However a neural network classifierhaving an additional hidden layer of processing wasable to learn to reliably detect occlusions nearly as wellas human subjects (Figure 16) These results suggest amultiscale feature decomposition together with rela-tively simple computations performed in an additionallayer of processing may be adequate to explain humanperformance in an important natural vision task

Comparison with previous work

Relatively few studies in visual psychophysics havedirectly considered the problem of what cues are usefulfor detecting natural occlusion edges with the closeststudy to our own focusing on the more specializedproblem of detecting T-junctions in natural grayscaleimages (McDermott 2004) This study found thatrather large image regions were needed for subjects toattain reliable performance for junction detectionconsistent with our results for occlusion detection Inthis work it was found that many T-junctions easilyvisible on the scale of the whole image were notdetectable using local cues suggesting that they mayonly be detectable by mechanisms making use offeedback of global scene information (Lee amp Mumford2003) Our task is somewhat complementary to that

studied by McDermott (2004) since we exclude T-junctions and only consider nonjunction occlusionsThis gave us a much larger stimulus set and permittedquantitative comparisons of human task performancewith a variety of models based on features measuredfrom occlusions and non-occlusions However wearrive at the very similar conclusion for our task thata substantial amount of spatial context is needed foraccurate occlusion detection (Figure 10) and thatmany occlusions cannot be reliably detected using localcues alone

A related set of studies has considered the problemof judging when two image patches belong to the sameor different surfaces and comparing the performanceof human subjects with the predictions of Bayesianmodels defined using sets of measured visual featuressimilar to those used to define our quadratic classifiersIn contrast to one such surface segmentation study (Inget al 2010) we found texture cues to be very importantfor our task One possible reason for the discrepancybetween our results and those of Ing et al (2010) is thattheir study only considered the problem of discrimi-nating a very limited class of images in particular thesurfaces of different leaves Since the leaves typicallybelonged to the same kind of plant they were fairlysmooth and quite similar in terms of visual textureTherefore for this particular choice of stimuli texture isnot a particularly informative cue since leaves of thesame kind tend to have very similar textures For ourtask however where very different surfaces werejuxtaposed to define occlusions we found removal oftexture cues to be highly detrimental to subjectperformance (Figure 11)

Previous work on edge localization reveals thatluminance texture and color cues are all combined todetermine the location of edges (McGraw et al 2003Rivest amp Cavanagh 1996) and similarly we find that inour task these multiple sources of information arecombined as well However in our task we did notconsider quantitative models for how disparate cuesmay be combined optimally as has been done in avariety of perceptual tasks (Knill amp Saunders 2003Landy amp Kojima 2001) One challenge with studyingcue combination in our task is that one of the majorcues used by the subjects are texture differences andunlike luminance or color natural textures require veryhigh-dimensional models to fully specify (Heeger ampBergen 1995 Portilla amp Simoncelli 2000 Zhu et al1997) The problem of optimal cue combination fornatural occlusion detection has been mostly consideredin the computer vision literature (Konishi et al 2003Zhou amp Mel 2008) but models of cue combination forocclusion detection have not been directly comparedwith human data in a controlled psychophysicalexperiment

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 17

Our computational analysis demonstrates the neces-sity of multiscale image analysis for modeling humanperformance in our occlusion detection task Further-more our work and that of others demonstrates theimportance of textural cues for occlusion detection as awide variety of computational models which explainhuman performance at detecting texture-defined edgesutilize a multiscale Gabor decomposition (Landy ampGraham 2003) One broad class of computationalmodels of texture segmentation and texture edgedetection can be described lsquolsquoFilter-Rectify-Filterrsquorsquo(FRF) models sometimes also referred to as lsquolsquoback-pocketrsquorsquo models where the rectified outputs of a Gaborfilter bank are then subject to a second stage of filteringand the outputs of these filters are then processedfurther to obtain a decision about whether or not atexture edge is present (Baker amp Mareschal 2001Bergen amp Landy 1991 Graham amp Sutter 1998 Landy1991 Landy amp Graham 2003) Our simple three-layerneural network model (Figure 5) constitutes a verysimple model of the FRF type with the hidden unitsperforming the role of the second-order filters Theability of our simple FRF model to solve this task aswell as human subjects provides additional evidence infavor of the FRF computational framework

Resources for future studies

Our database of hand-labeled occlusions (httpcaseedu) avoids many limitations of previously collecteddatasets of hand-labeled edges Perhaps the greatestimprovement is that unlike many related datasets(Collins Wright amp Greenway 1999 Martin et al2001) we make use of uncompressed calibrated colorimages taken from a larger image set used in previouspsychophysical research (Arsenault et al 2011 King-dom et al 2007 Olmos amp Kingdom 2004) JPEGcompression creates artificial statistical regularitieswhich complicates feature measurements (Zhou ampMel 2008) and causes artifact when training statisticalimage models (Caywood Willmore amp Tollhurst 2004)

Another improvement of this database over previouswork is that we explicitly label occlusion edges ratherthan inferring them indirectly from labels of imageregions or segments Although de facto finding regionboundaries does end up mainly labeling occlusions thenotion of an image region is much vaguer than that ofan occlusion However due to the specificity of ourtask our database most likely neglects quite a fewedges which are not occlusions since many edges aredefined by properties unrelated to ordinal depth likeshadows specularities and material properties (Ker-sten 2000) Therefore our database only represents asubset (albeit an important one) of possible naturaledges

Some databases only consider a restricted class ofimages like close-up foliage (Ing et al 2010) or lackclear figural objects (Doi Inui Lee Wachtler ampSejnowski 2003) Our database contains a wide varietyof images at various spatial scales containing a largenumber of occlusions We do not claim that our imagesare fully representative of natural images but representa modest effort to obtain a diverse sample which isimportant since different categories of natural imageshave different statistics (Torralba amp Oliva 2003) andtraining statistical image models on different imagedatabases often yields different linear filter properties(Ziegaus amp Lang 2003)

Acknowledgments

This work was supported by NIH grant 0705677 toMSL We thank the five Case Western ReserveUniversity undergraduate research assistants whohelped to collect the databases Parts of this work havebeen presented previously in poster form at the 2011Society for Neuroscience Meeting

Commercial relationships noneCorresponding author Christopher DiMattinaEmail dimattinagrinnelleduAddress Department of Psychology amp NeuroscienceConcentration Grinnell College Grinnell IA USA

References

Abdou I amp Pratt W (1979) Quantitative design andevaluation of enhancementthresholding edge de-tectors Proceedings of the IEEE 67 753ndash763

Acharya T amp Ray A K (2005) Image processingPrinciples and applications Hoboken NJ Wiley-Interscience

Arsenault E Yoonessi A amp Baker C L (2011)Higher-order texture statistics impair contrastboundary segmentation Journal of Vision 11(10)14 1ndash15 httpwwwjournalofvisionorgcontent111014 doi101167111014 [PubMed] [Article]

Baddeley R J amp Tatler B W (2006) High frequencyedges (but not contrast) predict where we fixate ABayesian system identification analysis VisionResearch 46 2824ndash2833

Baker C L amp Mareschal I (2001) Processing ofsecond-order stimuli in the visual cortex Progressin Brain Research 134 1ndash21

Balboa R M amp Grzywacz N M (2000) Occlusions

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 18

and their relationship with the distribution ofcontrasts in natural images Vision Research 402661ndash2669

Bell A J amp Sejnowski T J (1997) The lsquoindependentcomponentsrsquo of natural scenes are edge filtersVision Research 37 3327ndash3338

Bergen J R amp Landy M S (1991) Computationalmodeling of visual texture segregation In M SLandy amp J A Movshon (Eds) Computationalmodels of visual processing pp 253ndash271 Cam-bridge MA MIT Press

Bishop C M (2006) Pattern recognition and machinelearning New York Springer

Cadieu C amp Olshausen B A (2012) Learningintermediate-level representations of form andmotion from natural movies Neural Computation24 827ndash866

Caywood M S Willmore B amp Tolhurst D J(2004) Independent components of color naturalscenes resemble v1 neurons in their spatial andcolor tuning Journal of Neurophysiology 91 2859ndash2873

Collins D Wright W A amp Greenway P (1999) TheSowerby image database In Image processing andits applications (Seventh International Conference)(pp 306ndash310) London Institution of ElectricalEngineers

Cristianini N amp Shawe-Taylor J (2000) An intro-duction to support vector machines and other kernelbased learning methods Cambridge UK Cam-bridge University Press

Daugman J G (1985) Uncertainty relation forresolution in space spatial frequency and orienta-tion optimized by two-dimensional visual corticalfilters Journal of the Optical Society of America A2 1160ndash1169

Doi E Inui T Lee T W Wachtler T ampSejnowski T J (2003) Spatiochromatic receptivefield properties derived from information-theoreticanalyses of cone mosaic responses to naturalscenes Neural Computation 15 397ndash417

Duda R O Hart P E amp Stork D G (2000)Pattern classification (2nd ed) New York Wiley-Interscience

Elder J H amp Zucker S W (1998) Local scalecontrol for edge detection and blur estimationIEEE Transactions on Pattern Analysis and Ma-chine Intelligence 20 699ndash716

Field D J (1987) Relations between the statistics ofnatural images and the response properties of

cortical cells Journal of the Optical Society ofAmerica A 4 2379ndash2394

Fine I MacLeod D I A amp Boynton G M (2003)Surface segmentation based on the luminance andcolor statistics of natural scenes Journal of theOptical Society of America A 20 1283ndash1291

Fowlkes C C Martin D R amp Malik J (2007) Localfigure-ground cues are valid for natural imagesJournal of Vision 7(8)2 1ndash9 httpwwwjournalofvisionorgcontent782 doi101167782[PubMed] [Article]

Geisler W S Perry J S Super B J amp Gallogly DP (2001) Edge co-occurrence in natural imagespredicts contour grouping performance VisionResearch 41 711ndash724

Goldberg A V amp Kennedy R (1995) An efficientcost scaling algorithm for the assignment problemMathematical Programming 71 153ndash177

Graham N (1991) Complex channels early localnonlinearities and normalization in texture segre-gation In M S Landy amp J A Movshon (Eds)Computational models of visual processing pp 274ndash290 Cambridge MA MIT Press

Graham N amp Sutter A (1998) Spatial summation insimple (Fourier) and complex (non-Fourier) texturechannels Vision Research 38 231ndash257

Guzman A (1969) Decomposition of a visual sceneinto three-dimensional bodies In A Griselli (Ed)Automatic interpretation and classification of imag-es pp 291ndash304 New York Academic Press

Heeger D J amp Bergen J R (1995) Pyramid-basedtexture analysissynthesis In Proceedings of ACMSIGGRAPH 229ndash238

Hess R F amp Field D J (1999) Integration ofcontours New insights Trends in Cognitive Scienc-es 3 480ndash486

Hoiem D Efros A A amp Hebert M (2011)Recovering occlusion boundaries from an imageInternational Journal of Computer Vision 91(3)328ndash346

Ing A D Wilson J A amp Geisler W S (2010)Region grouping in natural foliage scenes Imagestatistics and human performance Journal ofVision 10(4)10 1ndash19 httpwwwjournalofvisionorgcontent10410 doi 10116710410[PubMed] [Article]

Jones J P amp Palmer L A (1987) The two-dimensional spatial structure of simple receptivefields in cat striate cortex Journal of Neurophysi-ology 58 1187ndash1211

Karklin Y amp Lewicki M S (2003) Learning higher-

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 19

order structures in natural images Network 14483ndash499

Karklin Y amp Lewicki M S (2009) Emergence ofcomplex cell properties by learning to generalize innatural scenes Nature 457 83ndash86

Kersten D (2000) High-level vision as statisticalinference In M S Gazzaniga (Ed) The newcognitive neurosciences pp 353ndash363 CambridgeMA MIT Press

Kingdom F A A Field D J amp Olmos A (2007)Does spatial invariance result from insensitivity tochange Journal of Vision 7(14)11 1ndash13 httpwwwjournalofvisionorgcontent71411 doi10116771411 [PubMed] [Article]

Knill D C amp Saunders J A (2003) Do humansoptimally integrate stereo and texture informationfor judgments of slant Vision Research 43 2539ndash2558

Konishi S Yuille A L Coughlin J M amp Zhu S C(2003) Statistical edge detection Learning andevaluating edge cues IEEE Transactions on PatternAnalysis and Machine Intelligence 25 57ndash74

Landy M S (1991) Texture segregation and orienta-tion gradient Vision Research 31 679ndash691

Landy M S amp Graham N (2003) Visual perceptionof texture In L M Chalupa amp J S Werner (Eds)The visual neurosciences pp 1106ndash1118 Cam-bridge MA MIT Press

Landy M S amp Kojima H (2001) Ideal cuecombination for localizing texture-defined edgesJournal of the Optical Society of America A 182307ndash2320

Lee H-C amp Choe Y (2003) Detecting salientcontours using orientation energy distribution InInternational Joint Conference on Neural Networks(pp 206ndash211) IEEE Conference Publications

Lee T S amp Mumford D (2003) Hierarchicalbayesian inference in the visual cortex Journal ofthe Optical Society of America A 20 1434ndash1448

Malik J amp Perona P (1990) Preattentive texturediscrimination with early vision mechanisms Jour-nal of the Optical Society of America A 7 923ndash932

Marr D amp Hildreth E (1980) Theory of edgedetection Proceedings of the Royal Society ofLondon Series B 29 187ndash217

Martin D Fowlkes C C Tal D amp Malik J (2001)A database of human segmented natural imagesand its application to evaluating segmentationalgorithms and measuring ecological statisticsProceedings of the 8th International Conference onComputer Vision 2 416ndash423

Martin D R Fowlkes C C amp Malik J (2004)Learning to detect natural image boundaries usinglocal brightness color and texture cues IEEETransactions on Pattern Analysis and MachineIntelligence 26 530ndash549

McDermott J (2004) Psychophysics with junctions inreal images Perception 33 1101ndash1127

McGraw P V Whitaker D Badcock D R ampSkillen J (2003) Neither here nor there Localizingconflicting visual attributes Journal of Vision 3(4)2 265ndash273 httpwwwjournalofvisionorgcontent342 doi101167342 [PubMed] [Article]

Nakayama K He Z J amp Shimojo S (1995) Visualsurface representation A critical link betweenlower-level and higher-level vision In D NOsherson S M Kosslyn amp L R Gleitman(Eds) An invitation to cognitive science Visualcognition pp 1ndash70 Cambridge MA MIT Press

Oliva A amp Torralba A (2001) Modeling the shapeof the scene A holistic representation of the spatialenvelope International Journal of Computer Vision42 145ndash175

Olmos A amp Kingdom F A A (2004) McGillcalibrated colour image database Internet sitehttptabbyvisionmcgillca (Accessed November20 2012)

Olshausen B A amp Field D J (1996) Emergence ofsimple-cell receptive-field properties by learning asparse code for natural images Nature 381 607ndash609

Ott R L (1993) An introduction to statistical methodsand data analysis (4th ed) Belmont CA Wads-worth Inc

Perona P (1992) Steerable-scalable kernels for edgedetection and junction analysis Image and VisionComputing 10 663ndash672

Pollen D A amp Ronner S F (1983) Visual corticalneurons as localized spatial frequency filters IEEETransactions on Systems Man and Cybernetics 13907ndash916

Portilla J amp Simoncelli E P (2000) A parametrictexture model based on joint statistics of complexwavelet coefficients International Journal of Com-puter Vision 40 4971

Rajashekar U van der Linde I Bovik A C ampCormack L K (2007) Foveated analysis of imagefeatures at fixations Vision Research 47 3160ndash3172

Reinagel P amp Zador A M (1999) Natural scenestatistics at the centre of gaze Network 10 341ndash350

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 20

Rijsbergen C V (1979) Information retrieval LondonButterworths

Rivest J amp Cavanagh P (1996) Localizing contoursdefined by more than one attribute Vision Re-search 36 53ndash66

Schwartz O amp Simoncelli E P (2001) Natural signalstatistics and sensory gain control Nature Neuro-science 4 819ndash825

Sigman M Cecchi G A Gilbert C D ampMagnasco M O (2001) On a common circleNatural scenes and gestalt rules Proceedings of theNational Academy of Sciences 98 1935ndash1940

Todd J T (2004) The visual perception of 3D shapeTrends in Cognitive Science 8 115ndash121

Torralba A (2009) How many pixels make an imageVisual Neuroscience 26 123ndash131

Torralba A amp Oliva A (2003) Statistics of naturalimage categories Network Computation in NeuralSystems 14 391ndash412

Voorhees H amp Poggio T (1988) Computing textureboundaries from images Nature 333 364ndash367

Zetzsche C amp Rohrbein F (2001) Nonlinear andextra-classical receptive field properties and thestatistics of natural scenes Network 12 331ndash350

Zhou C amp Mel B W (2008) Cue combination andcolor edge detection in natural scenes Journal ofVision 8(4)4 1ndash25 httpwwwjournalofvisionorgcontent844 doi101167844 [PubMed][Article]

Zhu S C Wu Y N amp Mumford D (1997)Minimax entropy principle and its application totexture modeling Neural Computation 9 1627ndash1660

Ziegaus C amp Lang E W (2003) Independentcomponent analysis of natural and urban imageensembles Neural Information Processing Letters1 89ndash95

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 21

  • Introduction
  • Methods
  • f01
  • f02
  • f03
  • e01
  • e02
  • e03
  • e04
  • e05
  • e06
  • e07
  • e08
  • e09
  • e10
  • f04
  • e11
  • e12
  • e13
  • e14
  • e15
  • e16
  • e17
  • f05
  • e18
  • Results
  • f06
  • f07
  • f08
  • f09
  • f10
  • f11
  • f12
  • f13
  • f14
  • f15
  • f16
  • Discussion
  • f17
  • f18
  • Abdou1
  • Acharya1
  • Arsenault1
  • Baddeley1
  • Baker1
  • Balboa1
  • Bell1
  • Bergen1
  • Bishop1
  • Cadieu1
  • Caywood1
  • Collins1
  • Cristianini1
  • Daugman1
  • Doi1
  • Duda1
  • Elder1
  • Field1
  • Fine1
  • Fowlkes1
  • Geisler1
  • Goldberg1
  • Graham1
  • Graham2
  • Guzman1
  • Heeger1
  • Hess1
  • Hoiem1
  • Ing1
  • Jones1
  • Karklin1
  • Karklin2
  • Kersten1
  • Kingdom1
  • Knill1
  • Konishi1
  • Landy1
  • Landy2
  • Landy3
  • Lee1
  • Lee2
  • Malik1
  • Marr1
  • Martin1
  • Martin2
  • McDermott1
  • McGraw1
  • Nakayama1
  • Oliva1
  • Olmos1
  • Olshausen1
  • Ott1
  • Perona1
  • Pollen1
  • Portilla1
  • Rajashekar1
  • Reinagel1
  • Rijsbergen1
  • Rivest1
  • Schwartz1
  • Sigman1
  • Todd1
  • Torralba1
  • Torralba2
  • Voorhees1
  • Zetzsche1
  • Zhou1
  • Zhu1
  • Ziegaus1
Page 18: Detecting natural occlusion boundaries using local cuesDetecting natural occlusion boundaries using local cues Christopher DiMattina # $ Department of Psychology & Neuroscience Concentration,

Our computational analysis demonstrates the neces-sity of multiscale image analysis for modeling humanperformance in our occlusion detection task Further-more our work and that of others demonstrates theimportance of textural cues for occlusion detection as awide variety of computational models which explainhuman performance at detecting texture-defined edgesutilize a multiscale Gabor decomposition (Landy ampGraham 2003) One broad class of computationalmodels of texture segmentation and texture edgedetection can be described lsquolsquoFilter-Rectify-Filterrsquorsquo(FRF) models sometimes also referred to as lsquolsquoback-pocketrsquorsquo models where the rectified outputs of a Gaborfilter bank are then subject to a second stage of filteringand the outputs of these filters are then processedfurther to obtain a decision about whether or not atexture edge is present (Baker amp Mareschal 2001Bergen amp Landy 1991 Graham amp Sutter 1998 Landy1991 Landy amp Graham 2003) Our simple three-layerneural network model (Figure 5) constitutes a verysimple model of the FRF type with the hidden unitsperforming the role of the second-order filters Theability of our simple FRF model to solve this task aswell as human subjects provides additional evidence infavor of the FRF computational framework

Resources for future studies

Our database of hand-labeled occlusions (httpcaseedu) avoids many limitations of previously collecteddatasets of hand-labeled edges Perhaps the greatestimprovement is that unlike many related datasets(Collins Wright amp Greenway 1999 Martin et al2001) we make use of uncompressed calibrated colorimages taken from a larger image set used in previouspsychophysical research (Arsenault et al 2011 King-dom et al 2007 Olmos amp Kingdom 2004) JPEGcompression creates artificial statistical regularitieswhich complicates feature measurements (Zhou ampMel 2008) and causes artifact when training statisticalimage models (Caywood Willmore amp Tollhurst 2004)

Another improvement of this database over previouswork is that we explicitly label occlusion edges ratherthan inferring them indirectly from labels of imageregions or segments Although de facto finding regionboundaries does end up mainly labeling occlusions thenotion of an image region is much vaguer than that ofan occlusion However due to the specificity of ourtask our database most likely neglects quite a fewedges which are not occlusions since many edges aredefined by properties unrelated to ordinal depth likeshadows specularities and material properties (Ker-sten 2000) Therefore our database only represents asubset (albeit an important one) of possible naturaledges

Some databases only consider a restricted class ofimages like close-up foliage (Ing et al 2010) or lackclear figural objects (Doi Inui Lee Wachtler ampSejnowski 2003) Our database contains a wide varietyof images at various spatial scales containing a largenumber of occlusions We do not claim that our imagesare fully representative of natural images but representa modest effort to obtain a diverse sample which isimportant since different categories of natural imageshave different statistics (Torralba amp Oliva 2003) andtraining statistical image models on different imagedatabases often yields different linear filter properties(Ziegaus amp Lang 2003)

Acknowledgments

This work was supported by NIH grant 0705677 toMSL We thank the five Case Western ReserveUniversity undergraduate research assistants whohelped to collect the databases Parts of this work havebeen presented previously in poster form at the 2011Society for Neuroscience Meeting

Commercial relationships noneCorresponding author Christopher DiMattinaEmail dimattinagrinnelleduAddress Department of Psychology amp NeuroscienceConcentration Grinnell College Grinnell IA USA

References

Abdou I amp Pratt W (1979) Quantitative design andevaluation of enhancementthresholding edge de-tectors Proceedings of the IEEE 67 753ndash763

Acharya T amp Ray A K (2005) Image processingPrinciples and applications Hoboken NJ Wiley-Interscience

Arsenault E Yoonessi A amp Baker C L (2011)Higher-order texture statistics impair contrastboundary segmentation Journal of Vision 11(10)14 1ndash15 httpwwwjournalofvisionorgcontent111014 doi101167111014 [PubMed] [Article]

Baddeley R J amp Tatler B W (2006) High frequencyedges (but not contrast) predict where we fixate ABayesian system identification analysis VisionResearch 46 2824ndash2833

Baker C L amp Mareschal I (2001) Processing ofsecond-order stimuli in the visual cortex Progressin Brain Research 134 1ndash21

Balboa R M amp Grzywacz N M (2000) Occlusions

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 18

and their relationship with the distribution ofcontrasts in natural images Vision Research 402661ndash2669

Bell A J amp Sejnowski T J (1997) The lsquoindependentcomponentsrsquo of natural scenes are edge filtersVision Research 37 3327ndash3338

Bergen J R amp Landy M S (1991) Computationalmodeling of visual texture segregation In M SLandy amp J A Movshon (Eds) Computationalmodels of visual processing pp 253ndash271 Cam-bridge MA MIT Press

Bishop C M (2006) Pattern recognition and machinelearning New York Springer

Cadieu C amp Olshausen B A (2012) Learningintermediate-level representations of form andmotion from natural movies Neural Computation24 827ndash866

Caywood M S Willmore B amp Tolhurst D J(2004) Independent components of color naturalscenes resemble v1 neurons in their spatial andcolor tuning Journal of Neurophysiology 91 2859ndash2873

Collins D Wright W A amp Greenway P (1999) TheSowerby image database In Image processing andits applications (Seventh International Conference)(pp 306ndash310) London Institution of ElectricalEngineers

Cristianini N amp Shawe-Taylor J (2000) An intro-duction to support vector machines and other kernelbased learning methods Cambridge UK Cam-bridge University Press

Daugman J G (1985) Uncertainty relation forresolution in space spatial frequency and orienta-tion optimized by two-dimensional visual corticalfilters Journal of the Optical Society of America A2 1160ndash1169

Doi E Inui T Lee T W Wachtler T ampSejnowski T J (2003) Spatiochromatic receptivefield properties derived from information-theoreticanalyses of cone mosaic responses to naturalscenes Neural Computation 15 397ndash417

Duda R O Hart P E amp Stork D G (2000)Pattern classification (2nd ed) New York Wiley-Interscience

Elder J H amp Zucker S W (1998) Local scalecontrol for edge detection and blur estimationIEEE Transactions on Pattern Analysis and Ma-chine Intelligence 20 699ndash716

Field D J (1987) Relations between the statistics ofnatural images and the response properties of

cortical cells Journal of the Optical Society ofAmerica A 4 2379ndash2394

Fine I MacLeod D I A amp Boynton G M (2003)Surface segmentation based on the luminance andcolor statistics of natural scenes Journal of theOptical Society of America A 20 1283ndash1291

Fowlkes C C Martin D R amp Malik J (2007) Localfigure-ground cues are valid for natural imagesJournal of Vision 7(8)2 1ndash9 httpwwwjournalofvisionorgcontent782 doi101167782[PubMed] [Article]

Geisler W S Perry J S Super B J amp Gallogly DP (2001) Edge co-occurrence in natural imagespredicts contour grouping performance VisionResearch 41 711ndash724

Goldberg A V amp Kennedy R (1995) An efficientcost scaling algorithm for the assignment problemMathematical Programming 71 153ndash177

Graham N (1991) Complex channels early localnonlinearities and normalization in texture segre-gation In M S Landy amp J A Movshon (Eds)Computational models of visual processing pp 274ndash290 Cambridge MA MIT Press

Graham N amp Sutter A (1998) Spatial summation insimple (Fourier) and complex (non-Fourier) texturechannels Vision Research 38 231ndash257

Guzman A (1969) Decomposition of a visual sceneinto three-dimensional bodies In A Griselli (Ed)Automatic interpretation and classification of imag-es pp 291ndash304 New York Academic Press

Heeger D J amp Bergen J R (1995) Pyramid-basedtexture analysissynthesis In Proceedings of ACMSIGGRAPH 229ndash238

Hess R F amp Field D J (1999) Integration ofcontours New insights Trends in Cognitive Scienc-es 3 480ndash486

Hoiem D Efros A A amp Hebert M (2011)Recovering occlusion boundaries from an imageInternational Journal of Computer Vision 91(3)328ndash346

Ing A D Wilson J A amp Geisler W S (2010)Region grouping in natural foliage scenes Imagestatistics and human performance Journal ofVision 10(4)10 1ndash19 httpwwwjournalofvisionorgcontent10410 doi 10116710410[PubMed] [Article]

Jones J P amp Palmer L A (1987) The two-dimensional spatial structure of simple receptivefields in cat striate cortex Journal of Neurophysi-ology 58 1187ndash1211

Karklin Y amp Lewicki M S (2003) Learning higher-

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 19

order structures in natural images Network 14483ndash499

Karklin Y amp Lewicki M S (2009) Emergence ofcomplex cell properties by learning to generalize innatural scenes Nature 457 83ndash86

Kersten D (2000) High-level vision as statisticalinference In M S Gazzaniga (Ed) The newcognitive neurosciences pp 353ndash363 CambridgeMA MIT Press

Kingdom F A A Field D J amp Olmos A (2007)Does spatial invariance result from insensitivity tochange Journal of Vision 7(14)11 1ndash13 httpwwwjournalofvisionorgcontent71411 doi10116771411 [PubMed] [Article]

Knill D C amp Saunders J A (2003) Do humansoptimally integrate stereo and texture informationfor judgments of slant Vision Research 43 2539ndash2558

Konishi S Yuille A L Coughlin J M amp Zhu S C(2003) Statistical edge detection Learning andevaluating edge cues IEEE Transactions on PatternAnalysis and Machine Intelligence 25 57ndash74

Landy M S (1991) Texture segregation and orienta-tion gradient Vision Research 31 679ndash691

Landy M S amp Graham N (2003) Visual perceptionof texture In L M Chalupa amp J S Werner (Eds)The visual neurosciences pp 1106ndash1118 Cam-bridge MA MIT Press

Landy M S amp Kojima H (2001) Ideal cuecombination for localizing texture-defined edgesJournal of the Optical Society of America A 182307ndash2320

Lee H-C amp Choe Y (2003) Detecting salientcontours using orientation energy distribution InInternational Joint Conference on Neural Networks(pp 206ndash211) IEEE Conference Publications

Lee T S amp Mumford D (2003) Hierarchicalbayesian inference in the visual cortex Journal ofthe Optical Society of America A 20 1434ndash1448

Malik J amp Perona P (1990) Preattentive texturediscrimination with early vision mechanisms Jour-nal of the Optical Society of America A 7 923ndash932

Marr D amp Hildreth E (1980) Theory of edgedetection Proceedings of the Royal Society ofLondon Series B 29 187ndash217

Martin D Fowlkes C C Tal D amp Malik J (2001)A database of human segmented natural imagesand its application to evaluating segmentationalgorithms and measuring ecological statisticsProceedings of the 8th International Conference onComputer Vision 2 416ndash423

Martin D R Fowlkes C C amp Malik J (2004)Learning to detect natural image boundaries usinglocal brightness color and texture cues IEEETransactions on Pattern Analysis and MachineIntelligence 26 530ndash549

McDermott J (2004) Psychophysics with junctions inreal images Perception 33 1101ndash1127

McGraw P V Whitaker D Badcock D R ampSkillen J (2003) Neither here nor there Localizingconflicting visual attributes Journal of Vision 3(4)2 265ndash273 httpwwwjournalofvisionorgcontent342 doi101167342 [PubMed] [Article]

Nakayama K He Z J amp Shimojo S (1995) Visualsurface representation A critical link betweenlower-level and higher-level vision In D NOsherson S M Kosslyn amp L R Gleitman(Eds) An invitation to cognitive science Visualcognition pp 1ndash70 Cambridge MA MIT Press

Oliva A amp Torralba A (2001) Modeling the shapeof the scene A holistic representation of the spatialenvelope International Journal of Computer Vision42 145ndash175

Olmos A amp Kingdom F A A (2004) McGillcalibrated colour image database Internet sitehttptabbyvisionmcgillca (Accessed November20 2012)

Olshausen B A amp Field D J (1996) Emergence ofsimple-cell receptive-field properties by learning asparse code for natural images Nature 381 607ndash609

Ott R L (1993) An introduction to statistical methodsand data analysis (4th ed) Belmont CA Wads-worth Inc

Perona P (1992) Steerable-scalable kernels for edgedetection and junction analysis Image and VisionComputing 10 663ndash672

Pollen D A amp Ronner S F (1983) Visual corticalneurons as localized spatial frequency filters IEEETransactions on Systems Man and Cybernetics 13907ndash916

Portilla J amp Simoncelli E P (2000) A parametrictexture model based on joint statistics of complexwavelet coefficients International Journal of Com-puter Vision 40 4971

Rajashekar U van der Linde I Bovik A C ampCormack L K (2007) Foveated analysis of imagefeatures at fixations Vision Research 47 3160ndash3172

Reinagel P amp Zador A M (1999) Natural scenestatistics at the centre of gaze Network 10 341ndash350

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 20

Rijsbergen C V (1979) Information retrieval LondonButterworths

Rivest J amp Cavanagh P (1996) Localizing contoursdefined by more than one attribute Vision Re-search 36 53ndash66

Schwartz O amp Simoncelli E P (2001) Natural signalstatistics and sensory gain control Nature Neuro-science 4 819ndash825

Sigman M Cecchi G A Gilbert C D ampMagnasco M O (2001) On a common circleNatural scenes and gestalt rules Proceedings of theNational Academy of Sciences 98 1935ndash1940

Todd J T (2004) The visual perception of 3D shapeTrends in Cognitive Science 8 115ndash121

Torralba A (2009) How many pixels make an imageVisual Neuroscience 26 123ndash131

Torralba A amp Oliva A (2003) Statistics of naturalimage categories Network Computation in NeuralSystems 14 391ndash412

Voorhees H amp Poggio T (1988) Computing textureboundaries from images Nature 333 364ndash367

Zetzsche C amp Rohrbein F (2001) Nonlinear andextra-classical receptive field properties and thestatistics of natural scenes Network 12 331ndash350

Zhou C amp Mel B W (2008) Cue combination andcolor edge detection in natural scenes Journal ofVision 8(4)4 1ndash25 httpwwwjournalofvisionorgcontent844 doi101167844 [PubMed][Article]

Zhu S C Wu Y N amp Mumford D (1997)Minimax entropy principle and its application totexture modeling Neural Computation 9 1627ndash1660

Ziegaus C amp Lang E W (2003) Independentcomponent analysis of natural and urban imageensembles Neural Information Processing Letters1 89ndash95

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 21

  • Introduction
  • Methods
  • f01
  • f02
  • f03
  • e01
  • e02
  • e03
  • e04
  • e05
  • e06
  • e07
  • e08
  • e09
  • e10
  • f04
  • e11
  • e12
  • e13
  • e14
  • e15
  • e16
  • e17
  • f05
  • e18
  • Results
  • f06
  • f07
  • f08
  • f09
  • f10
  • f11
  • f12
  • f13
  • f14
  • f15
  • f16
  • Discussion
  • f17
  • f18
  • Abdou1
  • Acharya1
  • Arsenault1
  • Baddeley1
  • Baker1
  • Balboa1
  • Bell1
  • Bergen1
  • Bishop1
  • Cadieu1
  • Caywood1
  • Collins1
  • Cristianini1
  • Daugman1
  • Doi1
  • Duda1
  • Elder1
  • Field1
  • Fine1
  • Fowlkes1
  • Geisler1
  • Goldberg1
  • Graham1
  • Graham2
  • Guzman1
  • Heeger1
  • Hess1
  • Hoiem1
  • Ing1
  • Jones1
  • Karklin1
  • Karklin2
  • Kersten1
  • Kingdom1
  • Knill1
  • Konishi1
  • Landy1
  • Landy2
  • Landy3
  • Lee1
  • Lee2
  • Malik1
  • Marr1
  • Martin1
  • Martin2
  • McDermott1
  • McGraw1
  • Nakayama1
  • Oliva1
  • Olmos1
  • Olshausen1
  • Ott1
  • Perona1
  • Pollen1
  • Portilla1
  • Rajashekar1
  • Reinagel1
  • Rijsbergen1
  • Rivest1
  • Schwartz1
  • Sigman1
  • Todd1
  • Torralba1
  • Torralba2
  • Voorhees1
  • Zetzsche1
  • Zhou1
  • Zhu1
  • Ziegaus1
Page 19: Detecting natural occlusion boundaries using local cuesDetecting natural occlusion boundaries using local cues Christopher DiMattina # $ Department of Psychology & Neuroscience Concentration,

and their relationship with the distribution ofcontrasts in natural images Vision Research 402661ndash2669

Bell A J amp Sejnowski T J (1997) The lsquoindependentcomponentsrsquo of natural scenes are edge filtersVision Research 37 3327ndash3338

Bergen J R amp Landy M S (1991) Computationalmodeling of visual texture segregation In M SLandy amp J A Movshon (Eds) Computationalmodels of visual processing pp 253ndash271 Cam-bridge MA MIT Press

Bishop C M (2006) Pattern recognition and machinelearning New York Springer

Cadieu C amp Olshausen B A (2012) Learningintermediate-level representations of form andmotion from natural movies Neural Computation24 827ndash866

Caywood M S Willmore B amp Tolhurst D J(2004) Independent components of color naturalscenes resemble v1 neurons in their spatial andcolor tuning Journal of Neurophysiology 91 2859ndash2873

Collins D Wright W A amp Greenway P (1999) TheSowerby image database In Image processing andits applications (Seventh International Conference)(pp 306ndash310) London Institution of ElectricalEngineers

Cristianini N amp Shawe-Taylor J (2000) An intro-duction to support vector machines and other kernelbased learning methods Cambridge UK Cam-bridge University Press

Daugman J G (1985) Uncertainty relation forresolution in space spatial frequency and orienta-tion optimized by two-dimensional visual corticalfilters Journal of the Optical Society of America A2 1160ndash1169

Doi E Inui T Lee T W Wachtler T ampSejnowski T J (2003) Spatiochromatic receptivefield properties derived from information-theoreticanalyses of cone mosaic responses to naturalscenes Neural Computation 15 397ndash417

Duda R O Hart P E amp Stork D G (2000)Pattern classification (2nd ed) New York Wiley-Interscience

Elder J H amp Zucker S W (1998) Local scalecontrol for edge detection and blur estimationIEEE Transactions on Pattern Analysis and Ma-chine Intelligence 20 699ndash716

Field D J (1987) Relations between the statistics ofnatural images and the response properties of

cortical cells Journal of the Optical Society ofAmerica A 4 2379ndash2394

Fine I MacLeod D I A amp Boynton G M (2003)Surface segmentation based on the luminance andcolor statistics of natural scenes Journal of theOptical Society of America A 20 1283ndash1291

Fowlkes C C Martin D R amp Malik J (2007) Localfigure-ground cues are valid for natural imagesJournal of Vision 7(8)2 1ndash9 httpwwwjournalofvisionorgcontent782 doi101167782[PubMed] [Article]

Geisler W S Perry J S Super B J amp Gallogly DP (2001) Edge co-occurrence in natural imagespredicts contour grouping performance VisionResearch 41 711ndash724

Goldberg A V amp Kennedy R (1995) An efficientcost scaling algorithm for the assignment problemMathematical Programming 71 153ndash177

Graham N (1991) Complex channels early localnonlinearities and normalization in texture segre-gation In M S Landy amp J A Movshon (Eds)Computational models of visual processing pp 274ndash290 Cambridge MA MIT Press

Graham N amp Sutter A (1998) Spatial summation insimple (Fourier) and complex (non-Fourier) texturechannels Vision Research 38 231ndash257

Guzman A (1969) Decomposition of a visual sceneinto three-dimensional bodies In A Griselli (Ed)Automatic interpretation and classification of imag-es pp 291ndash304 New York Academic Press

Heeger D J amp Bergen J R (1995) Pyramid-basedtexture analysissynthesis In Proceedings of ACMSIGGRAPH 229ndash238

Hess R F amp Field D J (1999) Integration ofcontours New insights Trends in Cognitive Scienc-es 3 480ndash486

Hoiem D Efros A A amp Hebert M (2011)Recovering occlusion boundaries from an imageInternational Journal of Computer Vision 91(3)328ndash346

Ing A D Wilson J A amp Geisler W S (2010)Region grouping in natural foliage scenes Imagestatistics and human performance Journal ofVision 10(4)10 1ndash19 httpwwwjournalofvisionorgcontent10410 doi 10116710410[PubMed] [Article]

Jones J P amp Palmer L A (1987) The two-dimensional spatial structure of simple receptivefields in cat striate cortex Journal of Neurophysi-ology 58 1187ndash1211

Karklin Y amp Lewicki M S (2003) Learning higher-

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 19

order structures in natural images Network 14483ndash499

Karklin Y amp Lewicki M S (2009) Emergence ofcomplex cell properties by learning to generalize innatural scenes Nature 457 83ndash86

Kersten D (2000) High-level vision as statisticalinference In M S Gazzaniga (Ed) The newcognitive neurosciences pp 353ndash363 CambridgeMA MIT Press

Kingdom F A A Field D J amp Olmos A (2007)Does spatial invariance result from insensitivity tochange Journal of Vision 7(14)11 1ndash13 httpwwwjournalofvisionorgcontent71411 doi10116771411 [PubMed] [Article]

Knill D C amp Saunders J A (2003) Do humansoptimally integrate stereo and texture informationfor judgments of slant Vision Research 43 2539ndash2558

Konishi S Yuille A L Coughlin J M amp Zhu S C(2003) Statistical edge detection Learning andevaluating edge cues IEEE Transactions on PatternAnalysis and Machine Intelligence 25 57ndash74

Landy M S (1991) Texture segregation and orienta-tion gradient Vision Research 31 679ndash691

Landy M S amp Graham N (2003) Visual perceptionof texture In L M Chalupa amp J S Werner (Eds)The visual neurosciences pp 1106ndash1118 Cam-bridge MA MIT Press

Landy M S amp Kojima H (2001) Ideal cuecombination for localizing texture-defined edgesJournal of the Optical Society of America A 182307ndash2320

Lee H-C amp Choe Y (2003) Detecting salientcontours using orientation energy distribution InInternational Joint Conference on Neural Networks(pp 206ndash211) IEEE Conference Publications

Lee T S amp Mumford D (2003) Hierarchicalbayesian inference in the visual cortex Journal ofthe Optical Society of America A 20 1434ndash1448

Malik J amp Perona P (1990) Preattentive texturediscrimination with early vision mechanisms Jour-nal of the Optical Society of America A 7 923ndash932

Marr D amp Hildreth E (1980) Theory of edgedetection Proceedings of the Royal Society ofLondon Series B 29 187ndash217

Martin D Fowlkes C C Tal D amp Malik J (2001)A database of human segmented natural imagesand its application to evaluating segmentationalgorithms and measuring ecological statisticsProceedings of the 8th International Conference onComputer Vision 2 416ndash423

Martin D R Fowlkes C C amp Malik J (2004)Learning to detect natural image boundaries usinglocal brightness color and texture cues IEEETransactions on Pattern Analysis and MachineIntelligence 26 530ndash549

McDermott J (2004) Psychophysics with junctions inreal images Perception 33 1101ndash1127

McGraw P V Whitaker D Badcock D R ampSkillen J (2003) Neither here nor there Localizingconflicting visual attributes Journal of Vision 3(4)2 265ndash273 httpwwwjournalofvisionorgcontent342 doi101167342 [PubMed] [Article]

Nakayama K He Z J amp Shimojo S (1995) Visualsurface representation A critical link betweenlower-level and higher-level vision In D NOsherson S M Kosslyn amp L R Gleitman(Eds) An invitation to cognitive science Visualcognition pp 1ndash70 Cambridge MA MIT Press

Oliva A amp Torralba A (2001) Modeling the shapeof the scene A holistic representation of the spatialenvelope International Journal of Computer Vision42 145ndash175

Olmos A amp Kingdom F A A (2004) McGillcalibrated colour image database Internet sitehttptabbyvisionmcgillca (Accessed November20 2012)

Olshausen B A amp Field D J (1996) Emergence ofsimple-cell receptive-field properties by learning asparse code for natural images Nature 381 607ndash609

Ott R L (1993) An introduction to statistical methodsand data analysis (4th ed) Belmont CA Wads-worth Inc

Perona P (1992) Steerable-scalable kernels for edgedetection and junction analysis Image and VisionComputing 10 663ndash672

Pollen D A amp Ronner S F (1983) Visual corticalneurons as localized spatial frequency filters IEEETransactions on Systems Man and Cybernetics 13907ndash916

Portilla J amp Simoncelli E P (2000) A parametrictexture model based on joint statistics of complexwavelet coefficients International Journal of Com-puter Vision 40 4971

Rajashekar U van der Linde I Bovik A C ampCormack L K (2007) Foveated analysis of imagefeatures at fixations Vision Research 47 3160ndash3172

Reinagel P amp Zador A M (1999) Natural scenestatistics at the centre of gaze Network 10 341ndash350

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 20

Rijsbergen C V (1979) Information retrieval LondonButterworths

Rivest J amp Cavanagh P (1996) Localizing contoursdefined by more than one attribute Vision Re-search 36 53ndash66

Schwartz O amp Simoncelli E P (2001) Natural signalstatistics and sensory gain control Nature Neuro-science 4 819ndash825

Sigman M Cecchi G A Gilbert C D ampMagnasco M O (2001) On a common circleNatural scenes and gestalt rules Proceedings of theNational Academy of Sciences 98 1935ndash1940

Todd J T (2004) The visual perception of 3D shapeTrends in Cognitive Science 8 115ndash121

Torralba A (2009) How many pixels make an imageVisual Neuroscience 26 123ndash131

Torralba A amp Oliva A (2003) Statistics of naturalimage categories Network Computation in NeuralSystems 14 391ndash412

Voorhees H amp Poggio T (1988) Computing textureboundaries from images Nature 333 364ndash367

Zetzsche C amp Rohrbein F (2001) Nonlinear andextra-classical receptive field properties and thestatistics of natural scenes Network 12 331ndash350

Zhou C amp Mel B W (2008) Cue combination andcolor edge detection in natural scenes Journal ofVision 8(4)4 1ndash25 httpwwwjournalofvisionorgcontent844 doi101167844 [PubMed][Article]

Zhu S C Wu Y N amp Mumford D (1997)Minimax entropy principle and its application totexture modeling Neural Computation 9 1627ndash1660

Ziegaus C amp Lang E W (2003) Independentcomponent analysis of natural and urban imageensembles Neural Information Processing Letters1 89ndash95

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 21

  • Introduction
  • Methods
  • f01
  • f02
  • f03
  • e01
  • e02
  • e03
  • e04
  • e05
  • e06
  • e07
  • e08
  • e09
  • e10
  • f04
  • e11
  • e12
  • e13
  • e14
  • e15
  • e16
  • e17
  • f05
  • e18
  • Results
  • f06
  • f07
  • f08
  • f09
  • f10
  • f11
  • f12
  • f13
  • f14
  • f15
  • f16
  • Discussion
  • f17
  • f18
  • Abdou1
  • Acharya1
  • Arsenault1
  • Baddeley1
  • Baker1
  • Balboa1
  • Bell1
  • Bergen1
  • Bishop1
  • Cadieu1
  • Caywood1
  • Collins1
  • Cristianini1
  • Daugman1
  • Doi1
  • Duda1
  • Elder1
  • Field1
  • Fine1
  • Fowlkes1
  • Geisler1
  • Goldberg1
  • Graham1
  • Graham2
  • Guzman1
  • Heeger1
  • Hess1
  • Hoiem1
  • Ing1
  • Jones1
  • Karklin1
  • Karklin2
  • Kersten1
  • Kingdom1
  • Knill1
  • Konishi1
  • Landy1
  • Landy2
  • Landy3
  • Lee1
  • Lee2
  • Malik1
  • Marr1
  • Martin1
  • Martin2
  • McDermott1
  • McGraw1
  • Nakayama1
  • Oliva1
  • Olmos1
  • Olshausen1
  • Ott1
  • Perona1
  • Pollen1
  • Portilla1
  • Rajashekar1
  • Reinagel1
  • Rijsbergen1
  • Rivest1
  • Schwartz1
  • Sigman1
  • Todd1
  • Torralba1
  • Torralba2
  • Voorhees1
  • Zetzsche1
  • Zhou1
  • Zhu1
  • Ziegaus1
Page 20: Detecting natural occlusion boundaries using local cuesDetecting natural occlusion boundaries using local cues Christopher DiMattina # $ Department of Psychology & Neuroscience Concentration,

order structures in natural images Network 14483ndash499

Karklin Y amp Lewicki M S (2009) Emergence ofcomplex cell properties by learning to generalize innatural scenes Nature 457 83ndash86

Kersten D (2000) High-level vision as statisticalinference In M S Gazzaniga (Ed) The newcognitive neurosciences pp 353ndash363 CambridgeMA MIT Press

Kingdom F A A Field D J amp Olmos A (2007)Does spatial invariance result from insensitivity tochange Journal of Vision 7(14)11 1ndash13 httpwwwjournalofvisionorgcontent71411 doi10116771411 [PubMed] [Article]

Knill D C amp Saunders J A (2003) Do humansoptimally integrate stereo and texture informationfor judgments of slant Vision Research 43 2539ndash2558

Konishi S Yuille A L Coughlin J M amp Zhu S C(2003) Statistical edge detection Learning andevaluating edge cues IEEE Transactions on PatternAnalysis and Machine Intelligence 25 57ndash74

Landy M S (1991) Texture segregation and orienta-tion gradient Vision Research 31 679ndash691

Landy M S amp Graham N (2003) Visual perceptionof texture In L M Chalupa amp J S Werner (Eds)The visual neurosciences pp 1106ndash1118 Cam-bridge MA MIT Press

Landy M S amp Kojima H (2001) Ideal cuecombination for localizing texture-defined edgesJournal of the Optical Society of America A 182307ndash2320

Lee H-C amp Choe Y (2003) Detecting salientcontours using orientation energy distribution InInternational Joint Conference on Neural Networks(pp 206ndash211) IEEE Conference Publications

Lee T S amp Mumford D (2003) Hierarchicalbayesian inference in the visual cortex Journal ofthe Optical Society of America A 20 1434ndash1448

Malik J amp Perona P (1990) Preattentive texturediscrimination with early vision mechanisms Jour-nal of the Optical Society of America A 7 923ndash932

Marr D amp Hildreth E (1980) Theory of edgedetection Proceedings of the Royal Society ofLondon Series B 29 187ndash217

Martin D Fowlkes C C Tal D amp Malik J (2001)A database of human segmented natural imagesand its application to evaluating segmentationalgorithms and measuring ecological statisticsProceedings of the 8th International Conference onComputer Vision 2 416ndash423

Martin D R Fowlkes C C amp Malik J (2004)Learning to detect natural image boundaries usinglocal brightness color and texture cues IEEETransactions on Pattern Analysis and MachineIntelligence 26 530ndash549

McDermott J (2004) Psychophysics with junctions inreal images Perception 33 1101ndash1127

McGraw P V Whitaker D Badcock D R ampSkillen J (2003) Neither here nor there Localizingconflicting visual attributes Journal of Vision 3(4)2 265ndash273 httpwwwjournalofvisionorgcontent342 doi101167342 [PubMed] [Article]

Nakayama K He Z J amp Shimojo S (1995) Visualsurface representation A critical link betweenlower-level and higher-level vision In D NOsherson S M Kosslyn amp L R Gleitman(Eds) An invitation to cognitive science Visualcognition pp 1ndash70 Cambridge MA MIT Press

Oliva A amp Torralba A (2001) Modeling the shapeof the scene A holistic representation of the spatialenvelope International Journal of Computer Vision42 145ndash175

Olmos A amp Kingdom F A A (2004) McGillcalibrated colour image database Internet sitehttptabbyvisionmcgillca (Accessed November20 2012)

Olshausen B A amp Field D J (1996) Emergence ofsimple-cell receptive-field properties by learning asparse code for natural images Nature 381 607ndash609

Ott R L (1993) An introduction to statistical methodsand data analysis (4th ed) Belmont CA Wads-worth Inc

Perona P (1992) Steerable-scalable kernels for edgedetection and junction analysis Image and VisionComputing 10 663ndash672

Pollen D A amp Ronner S F (1983) Visual corticalneurons as localized spatial frequency filters IEEETransactions on Systems Man and Cybernetics 13907ndash916

Portilla J amp Simoncelli E P (2000) A parametrictexture model based on joint statistics of complexwavelet coefficients International Journal of Com-puter Vision 40 4971

Rajashekar U van der Linde I Bovik A C ampCormack L K (2007) Foveated analysis of imagefeatures at fixations Vision Research 47 3160ndash3172

Reinagel P amp Zador A M (1999) Natural scenestatistics at the centre of gaze Network 10 341ndash350

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 20

Rijsbergen C V (1979) Information retrieval LondonButterworths

Rivest J amp Cavanagh P (1996) Localizing contoursdefined by more than one attribute Vision Re-search 36 53ndash66

Schwartz O amp Simoncelli E P (2001) Natural signalstatistics and sensory gain control Nature Neuro-science 4 819ndash825

Sigman M Cecchi G A Gilbert C D ampMagnasco M O (2001) On a common circleNatural scenes and gestalt rules Proceedings of theNational Academy of Sciences 98 1935ndash1940

Todd J T (2004) The visual perception of 3D shapeTrends in Cognitive Science 8 115ndash121

Torralba A (2009) How many pixels make an imageVisual Neuroscience 26 123ndash131

Torralba A amp Oliva A (2003) Statistics of naturalimage categories Network Computation in NeuralSystems 14 391ndash412

Voorhees H amp Poggio T (1988) Computing textureboundaries from images Nature 333 364ndash367

Zetzsche C amp Rohrbein F (2001) Nonlinear andextra-classical receptive field properties and thestatistics of natural scenes Network 12 331ndash350

Zhou C amp Mel B W (2008) Cue combination andcolor edge detection in natural scenes Journal ofVision 8(4)4 1ndash25 httpwwwjournalofvisionorgcontent844 doi101167844 [PubMed][Article]

Zhu S C Wu Y N amp Mumford D (1997)Minimax entropy principle and its application totexture modeling Neural Computation 9 1627ndash1660

Ziegaus C amp Lang E W (2003) Independentcomponent analysis of natural and urban imageensembles Neural Information Processing Letters1 89ndash95

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 21

  • Introduction
  • Methods
  • f01
  • f02
  • f03
  • e01
  • e02
  • e03
  • e04
  • e05
  • e06
  • e07
  • e08
  • e09
  • e10
  • f04
  • e11
  • e12
  • e13
  • e14
  • e15
  • e16
  • e17
  • f05
  • e18
  • Results
  • f06
  • f07
  • f08
  • f09
  • f10
  • f11
  • f12
  • f13
  • f14
  • f15
  • f16
  • Discussion
  • f17
  • f18
  • Abdou1
  • Acharya1
  • Arsenault1
  • Baddeley1
  • Baker1
  • Balboa1
  • Bell1
  • Bergen1
  • Bishop1
  • Cadieu1
  • Caywood1
  • Collins1
  • Cristianini1
  • Daugman1
  • Doi1
  • Duda1
  • Elder1
  • Field1
  • Fine1
  • Fowlkes1
  • Geisler1
  • Goldberg1
  • Graham1
  • Graham2
  • Guzman1
  • Heeger1
  • Hess1
  • Hoiem1
  • Ing1
  • Jones1
  • Karklin1
  • Karklin2
  • Kersten1
  • Kingdom1
  • Knill1
  • Konishi1
  • Landy1
  • Landy2
  • Landy3
  • Lee1
  • Lee2
  • Malik1
  • Marr1
  • Martin1
  • Martin2
  • McDermott1
  • McGraw1
  • Nakayama1
  • Oliva1
  • Olmos1
  • Olshausen1
  • Ott1
  • Perona1
  • Pollen1
  • Portilla1
  • Rajashekar1
  • Reinagel1
  • Rijsbergen1
  • Rivest1
  • Schwartz1
  • Sigman1
  • Todd1
  • Torralba1
  • Torralba2
  • Voorhees1
  • Zetzsche1
  • Zhou1
  • Zhu1
  • Ziegaus1
Page 21: Detecting natural occlusion boundaries using local cuesDetecting natural occlusion boundaries using local cues Christopher DiMattina # $ Department of Psychology & Neuroscience Concentration,

Rijsbergen C V (1979) Information retrieval LondonButterworths

Rivest J amp Cavanagh P (1996) Localizing contoursdefined by more than one attribute Vision Re-search 36 53ndash66

Schwartz O amp Simoncelli E P (2001) Natural signalstatistics and sensory gain control Nature Neuro-science 4 819ndash825

Sigman M Cecchi G A Gilbert C D ampMagnasco M O (2001) On a common circleNatural scenes and gestalt rules Proceedings of theNational Academy of Sciences 98 1935ndash1940

Todd J T (2004) The visual perception of 3D shapeTrends in Cognitive Science 8 115ndash121

Torralba A (2009) How many pixels make an imageVisual Neuroscience 26 123ndash131

Torralba A amp Oliva A (2003) Statistics of naturalimage categories Network Computation in NeuralSystems 14 391ndash412

Voorhees H amp Poggio T (1988) Computing textureboundaries from images Nature 333 364ndash367

Zetzsche C amp Rohrbein F (2001) Nonlinear andextra-classical receptive field properties and thestatistics of natural scenes Network 12 331ndash350

Zhou C amp Mel B W (2008) Cue combination andcolor edge detection in natural scenes Journal ofVision 8(4)4 1ndash25 httpwwwjournalofvisionorgcontent844 doi101167844 [PubMed][Article]

Zhu S C Wu Y N amp Mumford D (1997)Minimax entropy principle and its application totexture modeling Neural Computation 9 1627ndash1660

Ziegaus C amp Lang E W (2003) Independentcomponent analysis of natural and urban imageensembles Neural Information Processing Letters1 89ndash95

Journal of Vision (2012) 12(13)15 1ndash21 DiMattina Fox amp Lewicki 21

  • Introduction
  • Methods
  • f01
  • f02
  • f03
  • e01
  • e02
  • e03
  • e04
  • e05
  • e06
  • e07
  • e08
  • e09
  • e10
  • f04
  • e11
  • e12
  • e13
  • e14
  • e15
  • e16
  • e17
  • f05
  • e18
  • Results
  • f06
  • f07
  • f08
  • f09
  • f10
  • f11
  • f12
  • f13
  • f14
  • f15
  • f16
  • Discussion
  • f17
  • f18
  • Abdou1
  • Acharya1
  • Arsenault1
  • Baddeley1
  • Baker1
  • Balboa1
  • Bell1
  • Bergen1
  • Bishop1
  • Cadieu1
  • Caywood1
  • Collins1
  • Cristianini1
  • Daugman1
  • Doi1
  • Duda1
  • Elder1
  • Field1
  • Fine1
  • Fowlkes1
  • Geisler1
  • Goldberg1
  • Graham1
  • Graham2
  • Guzman1
  • Heeger1
  • Hess1
  • Hoiem1
  • Ing1
  • Jones1
  • Karklin1
  • Karklin2
  • Kersten1
  • Kingdom1
  • Knill1
  • Konishi1
  • Landy1
  • Landy2
  • Landy3
  • Lee1
  • Lee2
  • Malik1
  • Marr1
  • Martin1
  • Martin2
  • McDermott1
  • McGraw1
  • Nakayama1
  • Oliva1
  • Olmos1
  • Olshausen1
  • Ott1
  • Perona1
  • Pollen1
  • Portilla1
  • Rajashekar1
  • Reinagel1
  • Rijsbergen1
  • Rivest1
  • Schwartz1
  • Sigman1
  • Todd1
  • Torralba1
  • Torralba2
  • Voorhees1
  • Zetzsche1
  • Zhou1
  • Zhu1
  • Ziegaus1