beyond pixels: a comprehensive survey from bottom-up to semantic image · pdf...

19
1 Beyond Pixels: A Comprehensive Survey from Bottom-up to Semantic Image Segmentation and Cosegmentation Hongyuan Zhu, Fanman Meng, Jianfei Cai and Shijian Lu Abstract—Image segmentation refers to the process to divide an image into non-overlapping meaningful regions according to human perception, which has become a classic topic since the early ages of computer vision. A lot of research has been conducted and has resulted in many applications. However, while many segmentation algorithms exist, yet there are only a few sparse and outdated summarizations available, an overview of the recent achievements and issues is lacking. We aim to provide a comprehensive review of the recent progress in this field. Covering around 180 publications, we give an overview of broad areas of segmentation topics including not only the classic bottom-up approaches, but also the recent development in superpixel, interactive methods, object proposals, semantic image parsing and image cosegmentation. In addition, we also review the existing influential datasets and evaluation metrics. Finally, we suggest some design flavors and research directions for future research in image segmentation. Index Terms—Image segmentation, superpixel, interactive im- age segmentation, object proposal, semantic image parsing, image cosegmentation. I. I NTRODUCTION Human can quickly localize many patterns and automat- ically group them into meaningful parts. Perceptual group- ing refers to human’s visual ability to abstract high level information from low level image primitives without any specific knowledge of the image content. Discovering the working mechanisms under this ability has long been studied by the cognitive scientists since 1920’s [1]. Early gestalt psychologists observed that human visual system tends to perceive the configurational whole with rules governing the psychological grouping. The hierarchical grouping from low- level features to high level structures has been proposed by gestalt psychologists which embodies the concept of grouping by proximity, similarity, continuation, closure and symmetry. The highly compact representation of images produced by per- ceptual grouping can greatly facilitate the subsequent indexing, retrieving and processing. With the development of modern computer, computer sci- entists ambitiously want to equip the computer with the per- ceptual grouping ability given many promising applications, which lays down the foundation for image segmentation, and has been a classical topic since early years of computer vision. H. Zhu and S. Lu are with Institute for Infocomm Research, A*STAR, Singapore, email: {zhuh,slu}@i2r.a-star.edu.sg. F. Meng is with School of Electronic Engineering, Univ. of Electronic Science and Technology of China, email: [email protected]. J. Cai is with School of Computer Engineering, Nanyang Technological University, Singapore, e-mail: [email protected]. Image segmentation, as a basic operation in computer vision, refers to the process to divide a natural image into K non- overlapped meaningful entities (e.g., objects or parts). The segmentation operation has been proved quite useful in many image processing and computer vision tasks. For example, image segmentation has been applied in image annotation [2] by decomposing an image into several blobs corresponding to objects. Superpixel segmentation, which transforms millions of pixels into hundreds or thousands of homogeneous regions [3], [4], has been applied to reduce the model complexity and improve speed and accuracy of some complex vision tasks, such as estimating dense correspondence field [5], scene parsing [6] and body model estimation [7]. [8], [9], [10] have used segmented regions to facilitate object recognition, which provides better localization than sliding windows. The techniques developed in image segmentation, such as Mean Shift [4] and Normalized Cut [11] have also been widely used in other areas such as data clustering and density estimation. One would expect a segmentation algorithm to decompose an image into the “objects” or semantic/meaningful parts. However, what makes an “object” or a “meaningful” part can be ambiguous. An “object” can be referred to a “thing” (a cup, a cow, etc), a kind of texture (wood, rock) or even a “stuff” (a building or a forest). Sometimes, an “object” can also be part of other “objects”. Lacking a clear definition of “object” makes bottom-up segmentation a challenging and ill-posed problem. Fig. 1 gives an example, where different human subjects have different ways in interpreting objects. In this sense, what makes a ‘good’ segmentation needs to be properly defined. Research in human perception has provided some useful guidelines for developing segmentation algorithms. For exam- ple, cognition study [13] shows that human vision views part boundaries at those with negative minima of curvature and the part salience depends on three factors: the relative size, the boundary strength and the degree of protrusion. Gestalt theory and other psychological studies have also developed various principles reflecting human perception, which include: (1) human tends to group elements which have similarities in color, shape or other properties; (2) human favors linking contours whenever the elements of the pattern establish an implied direction. Another challenge which makes “object” segmentation diffi- cult is how to effectively represent the “object”. When human perceives an image, elements in the brain will be perceived as a whole, but most images in computers are currently represented

Upload: trinhdat

Post on 30-Mar-2018

218 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Beyond Pixels: A Comprehensive Survey from Bottom-up to Semantic Image · PDF file · 2015-01-26Bottom-up to Semantic Image Segmentation and Cosegmentation Hongyuan Zhu, ... few sparse

1

Beyond Pixels: A Comprehensive Survey fromBottom-up to Semantic Image Segmentation and

CosegmentationHongyuan Zhu, Fanman Meng, Jianfei Cai and Shijian Lu

Abstract—Image segmentation refers to the process to dividean image into non-overlapping meaningful regions accordingto human perception, which has become a classic topic sincethe early ages of computer vision. A lot of research has beenconducted and has resulted in many applications. However,while many segmentation algorithms exist, yet there are only afew sparse and outdated summarizations available, an overviewof the recent achievements and issues is lacking. We aim toprovide a comprehensive review of the recent progress in thisfield. Covering around 180 publications, we give an overviewof broad areas of segmentation topics including not only theclassic bottom-up approaches, but also the recent development insuperpixel, interactive methods, object proposals, semantic imageparsing and image cosegmentation. In addition, we also reviewthe existing influential datasets and evaluation metrics. Finally,we suggest some design flavors and research directions for futureresearch in image segmentation.

Index Terms—Image segmentation, superpixel, interactive im-age segmentation, object proposal, semantic image parsing, imagecosegmentation.

I. INTRODUCTION

Human can quickly localize many patterns and automat-ically group them into meaningful parts. Perceptual group-ing refers to human’s visual ability to abstract high levelinformation from low level image primitives without anyspecific knowledge of the image content. Discovering theworking mechanisms under this ability has long been studiedby the cognitive scientists since 1920’s [1]. Early gestaltpsychologists observed that human visual system tends toperceive the configurational whole with rules governing thepsychological grouping. The hierarchical grouping from low-level features to high level structures has been proposed bygestalt psychologists which embodies the concept of groupingby proximity, similarity, continuation, closure and symmetry.The highly compact representation of images produced by per-ceptual grouping can greatly facilitate the subsequent indexing,retrieving and processing.

With the development of modern computer, computer sci-entists ambitiously want to equip the computer with the per-ceptual grouping ability given many promising applications,which lays down the foundation for image segmentation, andhas been a classical topic since early years of computer vision.

H. Zhu and S. Lu are with Institute for Infocomm Research, A*STAR,Singapore, email: {zhuh,slu}@i2r.a-star.edu.sg. F. Meng is with School ofElectronic Engineering, Univ. of Electronic Science and Technology of China,email: [email protected]. J. Cai is with School of Computer Engineering,Nanyang Technological University, Singapore, e-mail: [email protected].

Image segmentation, as a basic operation in computer vision,refers to the process to divide a natural image into K non-overlapped meaningful entities (e.g., objects or parts). Thesegmentation operation has been proved quite useful in manyimage processing and computer vision tasks. For example,image segmentation has been applied in image annotation [2]by decomposing an image into several blobs corresponding toobjects. Superpixel segmentation, which transforms millions ofpixels into hundreds or thousands of homogeneous regions [3],[4], has been applied to reduce the model complexity andimprove speed and accuracy of some complex vision tasks,such as estimating dense correspondence field [5], sceneparsing [6] and body model estimation [7]. [8], [9], [10]have used segmented regions to facilitate object recognition,which provides better localization than sliding windows. Thetechniques developed in image segmentation, such as MeanShift [4] and Normalized Cut [11] have also been widely usedin other areas such as data clustering and density estimation.

One would expect a segmentation algorithm to decomposean image into the “objects” or semantic/meaningful parts.However, what makes an “object” or a “meaningful” part canbe ambiguous. An “object” can be referred to a “thing” (acup, a cow, etc), a kind of texture (wood, rock) or even a“stuff” (a building or a forest). Sometimes, an “object” canalso be part of other “objects”. Lacking a clear definitionof “object” makes bottom-up segmentation a challenging andill-posed problem. Fig. 1 gives an example, where differenthuman subjects have different ways in interpreting objects.In this sense, what makes a ‘good’ segmentation needs to beproperly defined.

Research in human perception has provided some usefulguidelines for developing segmentation algorithms. For exam-ple, cognition study [13] shows that human vision views partboundaries at those with negative minima of curvature andthe part salience depends on three factors: the relative size,the boundary strength and the degree of protrusion. Gestalttheory and other psychological studies have also developedvarious principles reflecting human perception, which include:(1) human tends to group elements which have similaritiesin color, shape or other properties; (2) human favors linkingcontours whenever the elements of the pattern establish animplied direction.

Another challenge which makes “object” segmentation diffi-cult is how to effectively represent the “object”. When humanperceives an image, elements in the brain will be perceived as awhole, but most images in computers are currently represented

Page 2: Beyond Pixels: A Comprehensive Survey from Bottom-up to Semantic Image · PDF file · 2015-01-26Bottom-up to Semantic Image Segmentation and Cosegmentation Hongyuan Zhu, ... few sparse

2

Fig. 1. An image from Berkeley segmentation dataset [12] with hand labels from different human subjects, which demonstrates the variety of human perception.

based on low-level features such as color, texture, curvature,convexity, etc. Such low-level features reflect local properties,which are difficult to capture global object information. Fullysupervised methods can learn higher level and global cues,but they can only handle limited number object classes andrequire per-class labeling.

A lot of research has been conducted on image segmen-tation. Unsupervised segmentation, as one classic topic incomputer vision, has been studied since 70’s. Early techniquesfocus on local region merging and splitting [14], [15]. Re-cent techniques, on the other hand, seek to optimize someglobal criteria [4], [16], [11], [17], [18], [19], [20]. Interactivesegmentation methods [21], [22], [23] which utilize userinput, have been applied in some commercial products suchas Microsoft Office and Adobe Photoshop. The substantialdevelopment of image classification [24], object detection [25],superpixel segmentation [3] and 3D scene recovery [26] in thepast few years have boosted the research in supervised sceneparsing [27], [28], [29], [30], [31], [32]. With the emergenceof large-scale image databases, such as the ImageNet [33]and personal photo streams on Flickr, the cosegmentationmethods [34], [35], [36], [37], [38], [39], [40], [41] which canextract recurring objects from a set of images, has attractedincreasing attentions in these years.

Although image segmentation as a community has beenevolving for a long time, the challenges ranging from featurerepresentation to model design and optimization are still notfully resolved, which hinder further performance improvementtowards human perception. Thus, it is necessary to periodicallygive a thorough and systematic review of the segmentationalgorithms, especially the recent ones, to summarize whathave been achieved, where we are now, what knowledgeand lessons can be shared and transferred between differentcommunities and what are the directions and opportunities forfuture research. To our surprise, there are only some sparsereviews on segmentation literature, there is no comprehensivereview which covers broad areas of segmentation topics in-cluding not only the classic bottom-up approaches, but also therecent development in superpixel, interactive methods, objectproposals, semantic image parsing and image cosegmentation,which will be critically and exhaustively reviewed in this paper

in Sections II-VII. In addition, we will also review the existinginfluential datasets and evaluation metrics in Section VIII.Finally, we discuss some popular design flavors and somepotential future directions in Section IX, and conclude thepaper in Section X.

II. BOTTOM-UP METHODS

The bottom-up methods usually do not take into accountthe explicit notion of the object, and their goal is mainlyto group nearby pixels according to some local homogeneitycriteria in the feature space, e.g. color, texture or curvature,by clustering those features based on fitting mixture models,mode shifting [4] or graph partitioning [16] [11][17][18]. Inaddition, the variational [19] and level set [20] techniques havealso been used in segmenting images into regions. Below wegive a brief summary of some popular bottom-up methodsdue to their reasonable performance and publicly availableimplementation. We divide the bottom-up methods into twomajor categories: discrete methods and continues methods,where the former considers an image as a fixed discrete gridwhile the latter treats an image as a continuous surface.

A. Discrete bottom-up methods

K-Means: K-means is among the simplest and most effi-cient method. Given k initial centers which can be randomlyselected, K-means first assign each sample to one of thecenters based on their feature space distance and then thecenters are updated. These two steps iterate until the termina-tion condition is met. Assuming that cluster number is knownor the distributions of the clusters are spherically symmetric,K-means works efficiently. However, these assumptions oftendon’t meet in general cases, and thus K-means could haveproblems when dealing with complex clusters.

Mixture of Gaussians: This method and K-means aresimilar to each other. Within mixture of Gaussians, each clustercenter is now replaced by a covariance matrix. Assume a setof d-dimensional feature vector x1, x2, ..., xn which are drawnfrom a Gaussian mixture:

p(x|{πk, µk,Σk}) =∑k

πkN(x|µk,Σk) (1)

Page 3: Beyond Pixels: A Comprehensive Survey from Bottom-up to Semantic Image · PDF file · 2015-01-26Bottom-up to Semantic Image Segmentation and Cosegmentation Hongyuan Zhu, ... few sparse

3

where πk are the mixing weights, µk,Σk are the means andcovariances, and

N(x|µk,Σk) =exp{− 1

2 (x− µk)TΣ−1k (x− µk)}

(2π)d/2|Σk|1/2(2)

is the normal equation [42]. The parameters πk, µk,Σk can beestimated by using expectation maximization (EM) algorithmas follows [42].

1) The E-step estimates how likely a sample xi is generatedfrom the kth Gaussian clusters with current parameters:

zik =1

ZN(xi|µk,Σk) (3)

with∑k zik = 1 and Z =

∑kN(xi|µk,Σk).

2) The M-step updates the parameters:

µk =1

nk

∑i

zikxi,

Σk =1

nkzik(xi − µk)(xi − µk)T ,

πk =nkn

(4)

where nk =∑i zik estimates the number of sample

points in each cluster.After the parameters are estimated, the segmentation can

be formed by assigning the pixels to the most probablecluster. Recently, Rao et al [43] extended the gaussian mixturemodels by encoding the texture and boundary using minimumdescription length theory and achieved better performance.

Mean Shift: Different from the parametric methods suchas K-means and Mixture of Gaussian which have assumptionsover the cluster number and feature distributions, Mean-shift [4] is a non-parametric method which can automaticallydecide the cluster number and modes in the feature space.

Assume a data point x are drawn from some probabilityfunction, whose density can be estimated by convolving thedata with a fixed kernel of width h:

f(x) =∑i

K(x− xi) =∑i

k(||x− xi||2

h2) (5)

where xi is a near-by sample and k(.) is the kernel func-tion [42]. After the density function is estimated, mean shiftuses a multiple restart gradient descent method which starts atsome initial guess yk, then the gradient direction of f(x) isestimated at yk and uphill step is taken in the direction [42].Particularly, the gradient of f(x) is given by

∇f(x) =∑i

(xi − x)G(x− xi)

=∑i

(xi − x)g(||x− xi||2

h2)

(6)

whereg(r) = −k′(r) (7)

and k′(.) is the first-order derivative of k(.). ∇f(x) can bere-written as

∇f(x) =

[∑i

G(x− xi)

]m(x) (8)

where the vector

m(x) =

∑i xiG(x− xi)∑iG(x− xi)

− x (9)

is called the mean shift vector which is the difference betweenthe mean of the neighbors around x and the current value ofx. During the mean-shift procedure, the current mode yk isreplaced by its locally weighted mean:

yk+1 = yk +m(yk). (10)

Final segmentation is formed by grouping pixels whose con-verge points are closer than hs in the spatial domain andhr in the range domain, and these two parameters are tunedaccording to the requirement of different applications.

Watershed: This method [17] segments an image into thecatchment and basin by flooding the morphological surface atthe local minimum and then constructing ridges at the placeswhere different components meet. As watershed associateseach region with a local minimum, it can lead to serious over-segmentation. To mitigate this problem, some methods [18]allow user to provide some initial seed positions which helpsimprove the result.

Graph Based Region Merging: Unlike edge based regionmerging methods which use fixed merging rules, Felzenszwalband Huttenlocher [16] advocates a method which uses arelative dissimilar measure to produce segmentation whichoptimizes a global grouping metric. The method maps animage to a graph with a 4-neighbor or 8-neighbor structure.The pixels denote the nodes, while the edge weights reflectthe color dissimilarity between nodes. Initially each nodeforms their own component. The internal difference Int(C) isdefined as the largest weight in the minimum spanning tree ofa component C. Then the weight is sorted in ascending order.Two regions C1 and C2 are merged if the in-between edgeweight is less than min(Int(C1) + τ(C1), Int(C2) + τ(C2)),where τ(C) = k/|C| and k is a coefficient that is used tocontrol the component size. Merging stops when the differencebetween components exceeds the internal difference.

Normalized Cut: Many methods generate the segmentationbased on local image statistics only, and thus they could pro-duce trivial regions because low-level features are sensitive tothe lighting and perspective changes. In contrast, NormalizedCut [44] finds a segmentation via splitting the affinity graphwhich encodes the global image information, i.e. minimizingthe Ncut value between different clusters:

Ncut(S1, S2, ..., Sk) :=1

2

k∑i=1

W (Si, Si)

vol(Si)(11)

where S1, S2, ..., Sk form a k-partition of a graph, Si is thecomplement of Si, W (Si, Si) is the sum of boundary edgeweights of Si, and vol(Si) is the sum of weights of all edgesattached to vertices in Si. The basic idea here is that bigclusters have large vol(Si) and minimizing Ncut encouragesall vol(Si) to be about the same, thus achieving a “balanced”clustering.

Finding the normalized cut is an NP-hard problem. Usually,an approximate solution is sought by computing the eigenvec-tors v of the generalized eigenvalue system (D−W )v = λDv,

Page 4: Beyond Pixels: A Comprehensive Survey from Bottom-up to Semantic Image · PDF file · 2015-01-26Bottom-up to Semantic Image Segmentation and Cosegmentation Hongyuan Zhu, ... few sparse

4

where W = [wij ] is the affinity matrix of an image graphwith wij describing the pairwise affinity of two pixels andD = [dij ] is the diagonal matrix with dii =

∑j wij . In the

seminal work of Shi and Malik [44], the pairwise affinityis chosen as the Gaussian kernel of the spatial and featuredifference for pixels within a radius ||xi − xj || < r:

Wij = exp(−||Fi − Fj ||2

σ2F

− ||xi − xj ||2

σ2s

) (12)

where F is a feature vector consisting of intensity, color andGabor features, xi is the spatial coordinate of pixel i and σFand σs are the variance of the feature and spatial position,respectively. In the later work of Malik et al. [45], they definea new affinity matrix using an intervening contour method.They measure the difference between two pixel i and j byinspecting the probability of an obvious edge alone the lineconnecting the two pixels:

Wij = exp(−maxp∈ij

mPb(p)/σ) (13)

where ij is the line segment connecting i and j and σ isa constant, the mPb(p) is the boundary strength defined atpixel p by maximizing the oriented contour signal mPb(p, θ)at different orientation θ :

mPb(p) = maxθ

mPb(p, θ) (14)

The oriented contour signal mPb(p, θ) is defined as a linearcobination of multiple local cuess at orientation θ:

mPb(p, θ) =∑s

∑i

αi,sGi,σ(i,s)(p, θ) (15)

where Gi,σ(i,s)(p, θ) measures the χ2 distance at featurechannel i (brightness, color a, color b, texture) between thehistograms of the two halves of a disc of radius σ(i, s)divided at angle θ, and αi,s is the combination weight bygradient ascent on the F-measure using the training imagesand corresponding ground-truth. Moreover, the affinity matrixcan also be learned by using the recent multi-task low-rankrepresentation algorithm presented in [46].

The segmentation is achieved by recursively bi-partitioningthe graph using the first nonzero eigenvalue’s eigenvector [11]or spectral clustering of a set of eigenvectors [47]. For thecomputational efficiency purpose, spectral clustering requiresthe affinity matrix to be sparse which limits its applications.Recent work of Cour et al. [48] solves this limitation bydefining the affinity matrix at multiple scales and then settingup cross-scale constraints which achieved better result. Inaddition, Arbelaez et al. [49] convolve eigenvectors withGaussian directional derivatives at multiple orientations θ toobtain oriented spectral contours responses at each pixel p:

sPb(p, θ) =

n∑k=1

1√λk· Oθvk(p) (16)

Since the signal mPb and sPb carries different contourinformation, Arbelaez et al. [49] proposed to combine themto globalized the contour information gPb:

gPb(p, θ) = β ·mPb(p, θ) + γ · sPb(p, θ) (17)

where the combination weights β and γ are also learned bygradient ascent on the F-measure using ground truth, whichachieved the state-of-the-art contour detection result.

Edge Based Region Merging: Such methods [15][50] startfrom pixels or super-pixels, and then two adjacent regions aremerged based on the metrics which can reflect their similar-ities/dissimilarities such as boundary length, edge strength orcolor difference. Recently, Arbelaez et al. [51] proposed thegPb-OWT-UCM method to transform a set of contours intoa nested partition of the image. The method first generatesa probability edge map gPb (see Eq.17) which delineatesthe salient contours. Then, it performs watershed over thetopological space defined by the gPb to form the finest levelsegmentation. Finally, the edges between regions are sortedand merged in an ascending order which forms the ultrametriccontour map UCM . Thresholding UCM at a scale λ formsthe final segmentation.

B. Continuous methods

Variational techniques [20], [19], [52], [53], [54] have alsobeen used in segmenting images into regions, which treat animage as a continuous surface instead of a fixed discrete gridand can produce visually more pleasing results.

Mumford-Shah Model: The Mumford-Shah (MS) modelpartitions an image by minimizing the functional which en-courages homogeneity within each region as well as sharppiecewise regular boundaries. The MS functional is defined as

FMS(I, C) =

∫ω

|I−I0|2dx+µ

∫ω\C|∇I|2dx+νHN−1(C)

(18)for any observed image I0 and any positive parameters µ, ν,where I corresponds to a piecewise smooth approximation ofI0, C represents the boundary contours of I and its length isgiven by Hausdorff measure HN−1(C). The first term of (18)is a fidelity term with respect to the given data I0, the secondterm regularizes the function I to be smooth inside the regionω\C and the last term imposes a regularization constraint onthe discontinuity set C to be smooth.

Since minimizing the Mumford-Shah model is not easy,many variants have been proposed to approximate the func-tional [55], [56], [57], [58]. Among them, Vese-Chan[59]proposed to approximate the term HN−1(C) by the lengths ofregion contours, which provides the model of active contourwithout edges. By assuming the region is piecewise constant,the model is further simplified to the continuous Potts model,which has convexified solvers [53], [54].

Active Contour / Snake Model [60]: This type of modelsdetects objects by deforming a snake/contour curve C towardsthe sharp image edges. The evolution of parametric curveC(p) = (x(p), y(p)), p ∈ {0, 1} is driven by minimizing thefunctional:

F (C) = α

∫ 1

0

|∂C(p)

∂p|2dp+β

∫ 1

0

|∂2C

∂p2|2dp+λ

∫ 1

0

f2(I0(C))dp

(19)where the first two terms enforce smoothness constraintsby making the snake act as a membrane and a thin platecorrespondingly, and the sum of the first two terms makes

Page 5: Beyond Pixels: A Comprehensive Survey from Bottom-up to Semantic Image · PDF file · 2015-01-26Bottom-up to Semantic Image Segmentation and Cosegmentation Hongyuan Zhu, ... few sparse

5

the internal energy. The third term, called external energy,attracts the curve toward the object boundaries by using theedge detecting function

f(I0) =1

1 + γ|∇(I0 ∗Gσ)|2(20)

where γ is an arbitrary positive constant and I0 ∗ Gσ is theGaussian smoothed version of I0. The energy function isnon-convex and sensitive to initialization. To overcome thelimitation, Osher et al. [20] proposed the level set method,which implicitly represents curve C by a higher dimensionψ, called the level set function. Moreover, Bresson et al. [52]proposed the convex relaxed active contour model which canachieve desirable global optimal.

The bottom-up methods can also be classified into anothertwo categories: the ones [16][11][51] which attempt to produceregions likely belonging to objects and the ones which tendto produce over-segmentation [17][4] (to be introduced inSection III). For methods of the first category, obtaining objectregions is extremely challenging as the bottom-up methodsonly use low-level features. Recently, Zhu et al. [61] proposedto combine hand-crafted low-level features that can reflectglobal image statistics (such as Gaussian Mixture Model,Geodesic Distance and Eigenvectors) with the convexifiedcontinuous Potts model to capture high-level structures, whichachieves some promising results.

III. SUPERPIXEL

Superpixel methods aim to over-segment an image intohomogeneous regions which are smaller than object or parts. Inthe seminal work of Ren and Malik [62], they argue and justifythat superpixel is more natural and efficient representation thanpixel because local cues extracted at pixel are ambiguous andsensitive to noise. Superpixel has a few desirable properties:• It is perceptually more meaningful than pixel. The su-

perpixel produced by state-of-the-art algorithms is nearlyperceptually consistent in terms of color, texture, etc, andmost structures can be well preserved. Besides, superpixelalso conveys some shape cues which is difficult to captureat pixel level.

• It helps reduce model complexity and improve effi-ciency and accuracy. Existing pixel-based methods needto deal with millions of pixels and their parameters,where training and inference in such a big system posegreat challenges to current solvers. On the contrary, usingsuperpixels to represent the image can greatly reduce thenumber of parameters and alleviate computation cost.Meanwhile, by exploiting the large spatial support ofsuperpixel, more discriminative features such as color ortexture histogram can be extracted. Last but not least,superpixel makes longer range information propagationpossible, which allows existing solvers to exploit richerinformation than those using pixel.

There are different paradigms to produce superpixels:• Some existing bottom-up methods can be directly adapted

to over-segmentation scenario by tuning the parameters,e.g. Watersheds, Normalized Cut (by increasing cluster

(c) (d)(a) (b)

Fig. 2. Examples of interactive image segmentation using bounding box (a∼ b) [21] or scribbles (c ∼ d) [22].

number), Graph Based Merging (by controlling the re-gions size) and Mean-Shift/Quick-Shift (by tuning thekernel size or changing mode drifting style).

• Some recent methods produce much faster superpixelsegmentation by changing optimization scope from thewhole image to local non-overlap initial regions, and thenadjusting the region boundaries to snap to salient objectcontours. TurboPixel [63] deforms the initial spatial gridto compact and regular regions by using geometric flowwhich is directed by local gradients. Wang et al. [64] alsoadapted geodesic flows by computing geodesic distanceamong pixels to produce adaptive superpixels, which havehigher density in high intensity or color variation regionswhile having larger superpixels at structure-less regions.Veskler et al. [65] proposed to place overlapping patchesat the image, and then assigned each pixel by inferringthe MAP solution using graph-cuts. Zhang et al. [66]further studied in this direction by using a pseudo-booleanoptimization which achieves faster speed. Achanta etal. [3] introduced the SLIC algorithm which greatlyimproves the superpixel efficiency. SLIC starts from theinitial regular grid of superpixels, grows superpixels byestimating each pixel’s distance to its cluster center local-ized nearby, and then updates the cluster centers, whichis essentially a localized K-means. SLIC can producesuperpixels at 5Hz without GPU optimization.

• There are also some new formulations for over-segmentation. Liu et al. [67] recently proposed a newgraph based method, which can maximize the entropy rateof the cuts in the graph with a balance term for compactrepresentation. Although it outperforms many methods interms of boundary recall measure, it takes about 2.5s tosegment an image of size 480x320. Van den Berge etal. [68] proposed the fastest superpixel method-SEED,which can run at 30Hz. SEED uses multi-resolutionimage grids as initial regions. For each image grid, theydefine a color histogram based entropy term and anoptional boundary term. Instead of using EM as in SLIC,which needs to repeatedly compute distances, SEEDsuses Hill-Climbing to move coarser-resolution grids, andthen refines region boundary using finer-resolution grids.In this way, SEEDs can achieve real time superpixelsegmentation at 30Hz.

IV. INTERACTIVE METHODS

Image segmentation is expected to produce regions match-ing human perception. Without any prior assumption, it isdifficult for bottom-up methods to produce object regions. Forsome specific areas such as image editing and medical image

Page 6: Beyond Pixels: A Comprehensive Survey from Bottom-up to Semantic Image · PDF file · 2015-01-26Bottom-up to Semantic Image Segmentation and Cosegmentation Hongyuan Zhu, ... few sparse

6

analysis, which require precise object localization and segmen-tation for subsequent applications (e.g. changing backgroundor organ reconstruction), prior knowledge or constraints (e.g.color distribution, contour enclosure or texture distribution)directly obtained from a small amount of user inputs can beof great help to produce accurate object segmentation. Suchtype of segmentation methods with user inputs are termed asinteractive image segmentation methods. There are alreadysome surveys [69], [70], [71] on interactive segmentationtechniques, and thus this paper will act as a supplementaryto them, where we discuss recent influential literature notcovered by the previous surveys. In addition, to make themanuscript self-contained, we will also give a brief reviewof some classical techniques.

In general, an interactive segmentation method has thefollowing pipeline: 1) user provides initial input; 2) then seg-mentation algorithm produces segmentation results; 3) basedon the results, user provides further constraints, and thengo back to step 2. The process repeats until the results aresatisfactory. A good interactive segmentation method shouldmeet the following criteria: 1) offering an intuitive interface forthe user to express various constraints; 2) producing fast andaccurate result with as little user manipulation as possible; 3)allowing user to make additional adjustments to further refinethe result.

Existing interactive methods can be classified according tothe difference of the user interface, which is the bridge forthe user to convey his prior knowledge (see Fig. 2). Existingpopular interfaces include bounding box [21], polygon (theobject of interest is within the provided regions), contour [72],[73] (the object boundary should follow the provided di-rection), and scribbles [74], [22] (the object should followsimilar color distributions). According to the methodology, theexisting interactive methods can be roughly classified into twogroups: (i) contour based methods [20], [73], [75], [23], [76]and (ii) label propagation based methods [71], [74], [77].

A. Contour based interactive methods

Contour based methods are one type of the earliest in-teractive segmentation methods. In a typical contour basedinteractive method, the user first places a contour close to theobject boundary, and then the contour will evolve to snap to thenearby salient object boundary. One of the key components forcontour based methods is the edge detection function, whichcan be based on first-order image statistics (such as operatorsof Sobel, Canny, Prewitt, etc), or more robust second-orderstatistics (such as operators of Laplacian, LoG, etc).

For instance, the Live-Wire / Intelligent Scissorsmethod [73], [75] starts contour evolving by building aweighted graph on the image, where each node in the graphcorresponds to a pixel, and directed edges are formed aroundpixels with their closest four neighbors or eight neighbors.The local cost of each direct edge is the weighted sumof Laplacian zero-cross, gradient magnitude and gradientdirection. Then given the seed locations, the shortest pathfrom a seed point p to a certain seed point s is found byusing Dijikstras method. Essentially, Live-Wire minimizes a

local energy function. On the other hand, the active contourmethod introduced in Section II-B deforms the contour byusing a global energy function, which consists of a regionalterm and a boundary term, to overcome some ambiguouslocal optimum. Nguyen et al. [23] recently employed theconvex active contour model to refine the object boundaryproduced by other interactive segmentation methods andachieved satisfactory segmentation results. Liu et al. [76]proposed to use the level set function [20] to track the zerolevel set of the posterior probabilistic mask learned from theuser provided bounding box, which can capture objects withcomplex topology and fragmented appearance such as treeleaves.

B. Label propagation based methods

Label Propagation methods are more popular in literature.The basic idea of label propagation is to start from user-provided initial input marks, and then propagate the labelsusing either global optimization (such as GraphCut [74] orRandomWalk [77]) or local optimization, where the global op-timization methods are more widely used due to the existenceof fast solvers.

GraphCut and its decedents [74], [21], [78] model thepixel labeling problem in Markov Random Field. Their energyfunctions can be generally expressed as

E(X) =∑xi∈X

Eu(xi) +∑

xi,xj∈NEp(xi, xj) (21)

where X = x1, x2, ..., xn is the set of random variablesdefined at each pixel, which can take either foreground label1 or background label 0; N defines a neighborhood sys-tem, which is typically 4-neighbor or 8-neighbor. The firstterm

∑xi∈X Eu(xi) in (21) is the unary potential, and the

second term∑xi,xj∈N Ep(xi, xj) is the pairwise potential.

In [74], the unary potential is evaluated by using an intensityhistogram. Later in GrabCut [21], the unary potential isderived from the two Gaussian Mixture Models (GMMs) forbackground and foreground regions respectively, and then ahard constraint is imposed on the regions outside the boundingbox such that the labels in the background region remainconstant while the regions within the box are updated tocapture the object. Li et al. [78] further proposed the LazySnapmethod, which extends GrabCut by using superpixels andincluding an interface to allow user to adjust the result at thelow-contrast and weak object boundaries. One limitation ofGrabCut is that it favors short boundary due to the energyfunction design. Recently Kohli et al. [79] proposed to use aconditional random field with multiple-layered hidden units toencode boundary preserving higher order potential, which hasefficient solvers and therefore can capture thin details that areoften neglected by classical MRF model.

Since MRF/CRF model provides a unified framework tocombine multiple information, various methods have beenproposed to incorporate prior knowledge:• Geodesic prior: One typical assumption on objects is

that they are compactly clustered in spatial space, in-stead of distributing around. Such spatial constraint can

Page 7: Beyond Pixels: A Comprehensive Survey from Bottom-up to Semantic Image · PDF file · 2015-01-26Bottom-up to Semantic Image Segmentation and Cosegmentation Hongyuan Zhu, ... few sparse

7

be constructed by exploiting the geodesic distance toforeground and background seeds. Unlike the Euclideandistance which directly measures two-point distance inspatial space, the geodesic distance measures the lowestcost path between them. The weight is set to the gradientof the likelihood of pixels belonging to the foreground.Then the geodesic distance can be incorporated as akind of data term in the energy function as in [80].Later, Bai et al. [81] further extended the solution to softmatting problem. Zhu et al. [61] also incorporated theBais geodesic distance as one type of object potential toproduce bottom-up object segmentation, which achievesimproved results.

• Convex prior: Another common assumption is that mostobjects are convex. Such prior can be expressed as akind of labeling constraints. For example, a star shapeis defined with respect to a center point c. An objecthas a star shape if for any point p inside the object, allpoints on the straight line between the center c and palso lie inside the object. Veskler et al. [82] formulatedsuch constraint by penalizing different labeling on thesame line, which such formulation can only work withsingle convex center. Gulshan et al. [83] proposed to usegeodesic distance transform [84] to compute the geodesicconvexity from each pixel to the star center, which workson objects with multiple start centers. Other similarconnectivity constraint has also been studied in [85].

RandomWalk: The RandomWalk model [77] providesanother pixel labeling framework, which has been appliedto many computer vision problems, such as segmentation,denoising and image matching. With notations similar tothose for GraphCut in (21), RandomWalk starts from buildingan undirected graph G = (V,E), where V is the set ofvertices defined at each pixel and E is the set of edges whichincorporate the pairwise weight Wij to reflect the probabilityof a random walker to jump between two nodes i and j. Thedegree of a vertex is defined as di =

∑jWij .

Given the weighted graph, a set of user scribbled nodes Vmand a set of unmarked nodes Vu, such that Vm ∪ Vu = Vand Vm ∩ Vu = Φ, the RandomWalk approach is to assigneach node i ∈ Vu a probability xi that a random walkerstarting from that node first reaches a marked node. The finalsegmentation is formed by thresholding xi.

The entire set of node probability x can be obtained byminimizing

E(x) = xTLx (22)

where L represents the combinatorial Laplacian matrix definedas

Lij =

di if i = j

−wij if i 6= j and i and j are adjacent nodes0 otherwise.

(23)By partitioning the matrix L into marked and unmarked blocksas [

Lm BBT Lu

](24)

and defining a |Vm| × 1 indicator vector f as

fj =

{1 if j is labeled as foreground0 otherwise,

(25)

the minization of (22) with respect to xu results in

Luxu = −BT f (26)

where only a sparse linear system needs to be solved. Yanget al. [22] further proposed a constrained RandomWalk that isable to incorporate different types of user inputs as additionalconstraints such as hard and soft constraints to provide moreflexibility for users.

Local optimization based methods: Many label propa-gation methods use specific solvers such as graph cut andbelief propagation to get the most likely solution. However,such global optimization type of solvers do not scale wellwith image size. Hoshi et al. [86] proved that by filteringthe cost volume using fast edge preserving filters (such asJoint Bilateral Filter [87], Guidede Filter [88], Cross-mapfilter [89], etc) and then using Winner-Takes-All label selectionto take the most likely labels, they can achieve comparable orbetter results than global optimized models. More importantly,such cost-volume filtering approaches can achieve real timeperformance, e.g. 2.85 ms to filter an 1 Mpix image on aCore 2 Quad 2.4GHZ desktop. Recently, Crimisi et al. [84]proposed to use geodesic distance transform to filter the costvolume, which can produce results comparable to the globaloptimization methods and can better capture the edges at theweak boundaries.

There are also some simple region growing based meth-ods [90], [91], which start from user-drawn seeded regions,and then iteratively merge the remaining groups according tosimilarities. One limitation of such methods is that differentmerging orders can produce different results.

V. OBJECT PROPOSALS

Automatically and precisely segmenting out objects from animage is still an unsolved problem. Instead of searching fordeterministic object segmentation, recent research on objectproposals relaxes the object segmentation problem by lookingfor a pool of regions that have high probability to coverthe objects by some of the proposals. This type of methodsleverages high level concept of the “object” or “thing” toseparate object regions from the “stuff”, where an “object”tends to have clear size and shape (e.g. pedestrian, car),as opposed to “stuff”(e.g. sky or grass) which tends to behomogeneous or with recurring patterns of fine-scale structure.

Class-specific object proposals: One approach to incorpo-rate the object notion is through the use of class-specific objectdetectors. Such object detectors can be any bounding boxdetector, e.g. the famous Deformable Part Model (DPM) [25]or Poselets [92]. A few works have been proposed to combineobject detection with segmentation. For example, Larlus andJurie [93] obtained the object segmentation by refining thebounding box using CRF. Gu et al. [8] proposed to use hierar-chical regions for object detection, instead of bounding boxes.However, class-specific object segmentation methods can only

Page 8: Beyond Pixels: A Comprehensive Survey from Bottom-up to Semantic Image · PDF file · 2015-01-26Bottom-up to Semantic Image Segmentation and Cosegmentation Hongyuan Zhu, ... few sparse

8

(a) (b)

Fig. 3. An example of class-independent object segmentation [9].

be applied to a limited number of object classes, and cannothandle large number of object classes, e.g. ImageNet [33],which has thousands of classes.

Class-independent object proposals: Inspired by the ob-jectness window work of Alexi et al. [94], methods of class-independent region proposals directly attempt to producegeneral object regions. The underlying rationale for class-independent proposals to work is that the object-of-interestis typically distinct from background in certain appearance orgeometry cues. Hence, in some sense general object proposalproblem is correlated with the popular salient object detectionproblem. A more complete survey of state-of-the-art object-ness window methods and salient object detection methodscan be found in [95] and [96], respectively. Here we focus onthe region proposal methods.

One group of class-independent object proposal methodsis to first use bottom-up methods to generate region pro-posals, and then a pre-trained classifier is applied to rankproposals according to the region features. The representativeworks include CPMC [9] and Category Independent ObjectProposal [10], which extend GrabCut to this scenario by usingdifferent seed sampling techniques (see Fig. 3 for an example).In particular, CPMC applies sequential GrabCuts by using theregular grid as foreground seeds and image frame boundaryas background seeds to train the Gaussian mixture models.Category Independent Object Proposal samples foregroundseeds from occlusion boundaries inferred by [97]. Then thegenerated regions proposals are fed to random forests clas-sifier or regressor, which is trained with the features (suchas convexity, color contrast, etc) of the ground truth objectregions, for re-ranking. One bottleneck of such appearancebased object proposal methods is that re-computing GrabCut’sappearance model implicitly incorporates the re-computationof some distance, which makes such methods hard to speedup. A recent work in [98] pre-computes a graph which canbe used for parametric min-cuts over different seeds, and thendynamic graph cut [99] is applied to accelerate the speed.Instead of computing expensive GrabCut, another recent workcalled Geodesic Object Proposal (GOP) [100] exploits the fastgeodesic distance transform for object proposals. GOP usessuperpixels as its atomic units, chooses the first seed locationas the one whose geodesic distance is the smallest to all othersuperixels, and then places next seeds far to the existing seeds.The authors also proposed to use RankSVM to learn to placeseeds, but the performance improvement is not significant.

Another bottom-up way to generate region proposals is

to use the edge based region merging method, gPb-OWT-UCM, described in Section II. Then, the region hierarchiesare filtered using the classifiers of CPMC. Due to its relianceon the high accuracy contour map, such method can achievehigher accuracy than GrabCut based methods. However, suchmethod is limited by the performance of contour detectors,whose shortcomings on speed and accuracy have been greatlyimproved by some recently introduced learning based methodssuch as Structured Forest [101] or the multi-resolution eigen-solvers. The later solver has been applied in an improved ver-sion of gPb-OWT-UCM (MCG) [102] which simultaneouslyconsiders the region combinatorial space.

Different from the above class-independent object proposalmethods, which use single strategy to generate object regions,another group of methods apply multiple hand-crafted strate-gies to produce diversified solutions during the process ofatomic unit generation and region proposal generation, whichwe call diversified region proposal methods. This type ofmethods typically just produce diversified region proposals,but do not train classifier to rank the proposals. For example,SelectiveSearch [103] generates region trees from superpix-els to capture objects at multiple scales by using differentmerging similarity metrics (such as RGB, Intensity, Texture,etc). To increase the degree of diversification, SeleciveSearchalso applies different parameters to generate initial atomicsuperpixels. After different segmentation trees are generated,the detection starts from the regions in higher hierarchies.Manen et al. [104] proposed a similar method which exploitsmerging randomized trees with learned similarity metrics.

VI. SEMANTIC IMAGE PARSING

Semantic image parsing aims to break an image into non-overlapped regions which correspond to predefined semanticclasses (e.g. car, grass, sheep, etc), as shown in Fig. 4 . Thepopularity of semantic parsing since early 2000s is deeplyrooted in the success of some specific computer vision taskssuch as face/object detection and tracking, camera pose esti-mation, multiple view 3D reconstruction and fast conditionalrandom field solvers. The ultimate goal of semantic parsing isto equip the computer with the holistic ability to understandthe visual world around us. Although also depending on thegiven information, high-level learned representations make itdifferent from the interactive methods. This type of approachesis also different from the object region proposal in the sensethat it aims to parse an image as a whole into the “thing”and “stuff” classes, instead of just producing possible “thing”candidates.

Most state-of-the-art image parsing systems are formulatedas the problem of finding the most probable labeling ona Markov random field (MRF) or conditional random field(CRF). CRF provides a principled probabilistic frameworkto model complex interactions between output variables andobserved features. Thanks to the ability to factorize the prob-ability distribution over different labeling of the random vari-ables, CRF allows for compact representations and efficientinference. The CRF model defines a Gibbs distribution of theoutput labeling y conditioned on observed features x via an

Page 9: Beyond Pixels: A Comprehensive Survey from Bottom-up to Semantic Image · PDF file · 2015-01-26Bottom-up to Semantic Image Segmentation and Cosegmentation Hongyuan Zhu, ... few sparse

9

(c) (d)(a) (b)

Fig. 4. An example of semantic segmentation [31] which is trained usingpictures with human labeled ground truth such as (b) to segment the testimage in (c) and produce the final segmentation in (d).

energy function E(y; x):

P (y|x) =1

Z(x)exp{−E(y; x)}, (27)

where y = (y1, y2, ., yn) is a vector of random variables yi(n is the number of pixels or regions) defined on each nodei (pixel or superpixel), which takes a label from a predefinedlabel set L given the observed features x. Z(x) here is calledpartition function which ensures the distribution is properlynormalized and summed to one. Computing the partitionfunction is intractable due to the sum of exponential functions.On the other hand, such computation is not necessary giventhe task is to infer the most likely labeling. Maximizing aposterior of (27) is equivalent to minimize the energy functionE(y; x). A common model for pixel labeling involves a unarypotential ψu(yi;x) which is associated with each pixel, and apairwise potential ψp(yi, yj) which is associated with a pairof neighborhood pixels:

E(y; x) =

n∑i=1

ψu(yi; x) +∑

vi,vj∈Nψpij(yi, yj ; x). (28)

Given the energy function, semantic image parsing usuallyfollows the following pipelines: 1) Extract features from apatch centered on each pixel; 2) With the extracted featuresand the ground truth labels, an appearance model is trainedto produce a compatible score for each training sample; 3)The trained classifier is applied on the test image’s pixel-wise features, and the output is used as the unary term; 4)The pairwise term of the CRF is defined over a 4 or 8-connected neighborhood for each pixel; 5) Perform maximuma posterior (MAP) inference on the graph. Following thiscommon pipeline, there are different variants in differentaspects:• Features: The commonly used features are bottom-up

pixel-level features such as color or texton. He et al. [105]proposed to incorporate the region and image level fea-tures. Shotton et al. [106] proposed to use spatial layoutfilters to represent the local information corresponding todifferent classes, which was later adapted to random for-est framework for real-time parsing [24]. Recently, deepconvolutional neural network learned features [107] havealso been applied to replace the hand-crafted features,which achieves promising performance.

• Spatial support: The spatial support in step 1 can beadapted to superpixels which conform to image internalstructures and make feature extraction less susceptible tonoise. Also by exploiting superpixels, the complexity ofthe model is greatly reduces from millions of variablesto only hundreds or thousands. Hoiem et al. [108] usedmultiple segmentations to find out the most feasible con-figuration. Tighe et al. [109] used superpixels to retrievesimilar superpixels in the training set to generate unaryterm. To handle multiple superpixel hypotheses, Ladickyet al. [110] proposed the robust higher order potential,which enforces the labeling consistency between thesuperpixels and their underlying pixels.

• Context: The context (such as boat in the water, caron the road) has emerged as another important factorbeyond the basic smoothness assumption of the CRFmodel. Basic context model is implicitly captured by theunary potential, e.g. the pixels with green colors are morelikely to be grass class. Recently, more sophisticated classco-occurrence information has been incorporated in themodel. Rabinovich et al. [111] learned label co-occurencestatistic in the training set and then incorporated it intoCRF as additional potential. Later the systems usingmultiple forms of context based on co-occurence, spatialadjacency and appearance have been proposed in [112],[113], [109]. Ladicky et al. [32] proposed an efficientmethod to incorporate global context, which penalizesunlikely pairs of labels to be assigned anywhere inthe image by introducing one additional variable in theGraphCut model.

• Top-down and bottom-up combination: The combina-tion of top-down and bottom-up information has alsobeen considered in recent scene parsing works. Thebottom-up information is better at capturing stuff classwhich is homogeneous. On the other hand, object detec-tors are good at capturing thing class. Therefore, theircombination helps develop a more holistic scene under-standing system. There are some recent studies incorpo-rating the sliding window detectors such as DeformablePart Model [114] or Poselet [92]. Specifically, Ladicky etal. [115] proposed the higher order robust potential basedon detectors which use GrabCut to generate the shapemask to compete with bottom up cues. Floros et al. [116]instead infered the shape mask from the Implicit ShapeModel top-down segmentation system [117]. Arbelaez etal. [118] used the Poselet detector to segment articu-lated objects. Guo and Hoiem [119] chose to use Auto-Context [120] to incorporate the detector response in theirsystem. Xia et al [121] applied CPMC in Section. Von object detection’s bounding boxes to generate objectproposals and then the proposals are subsequently fusedand refined to form the final segmentation. More recently,Tighe et al. [30] proposed to transfer the mask of trainingimages to test images as the shape potential by usingtrained exemplar SVM model which greatly improve theperformance of the non-parametric scene parsing system.

• Inference: To optimize the energy function, varioustechniques can be applied, such as GraphCut, Belief-

Page 10: Beyond Pixels: A Comprehensive Survey from Bottom-up to Semantic Image · PDF file · 2015-01-26Bottom-up to Semantic Image Segmentation and Cosegmentation Hongyuan Zhu, ... few sparse

10

Propagation or Primal-Dual methods, etc. A completereview of recent inference methods can be found in [122].Original CRF or MRF models are usually limited to 4-neighbor or 8-neighbor. Recently, the fully connectedgraphical model which connects all pixels has also be-come popular due to the availability of efficient ap-proximation of the time-costly message-passing step viafast image filtering [31], with the requirement that thepairwise term should be a mixture of Gaussian kernels.Vineet et al. [123] introduced the higher order term tothe fully connected CRF framework, which generalizesits application.

• Data Driven: The current pipeline needs pre-trainedclassifier, which is quite restrictive when new classes areincluded in the database. Recently some researchers haveadvocated for non-parametric, data-driven approach foropen-universe datasets. Such approaches avoid trainingby retrieving similar training images from the databasefor segmenting the new image. Liu et al. [29] proposedto use SIFT-flow [124] to transfer masks from trainimages. On the other hand, Tighe et al. [109] proposedto retrieve nearest superpixel neighbor in training imagesand achieved comparable performance to [29].

VII. IMAGE COSEGMENTATION

Cosegmentation aims at extracting common objects froma set of images (see Fig. 5 for examples). It is essentially amultiple-image segmentation problem, where the very weakprior that the images contain the same objects is used forautomatic object segmentation. Since it does not need anypixel level or image level object information, it is suitablefor large scale image dataset segmentation and many practicalapplications, which attracts much attention recently. At themeanwhile, cosegmentation is challenging, which faces severalmajor issues:• The classical segmentation models are designed for a

single image while cosegmentation deals with multipleimages. How to design the cosegmentation model andminimize the model is critical for cosegmentation re-search.

• The common object extraction depends on the fore-ground similarity measurement. But the foreground usu-ally varies, which makes the foreground similarity mea-surement difficult. Moreover, the model minimization ishighly related to the selection of similarity measurement,which may result in extremely difficult optimization.Thus, how to measure the foreground similarities isanother important issue.

• There are many practical applications with different re-quirements such as large-scale image cosegmentation,video cosegmentation and web image cosegmentation. In-dividual special needs require a cosegmentation model tobe extendable to different scenarios, which is challenging.

Cosegmentation was first introduced by Rother et al. [125]in 2006. After that, many cosegmentation methods have beenproposed [34], [35], [36], [37], [38], [39], [40], [41]. The ex-isting methods can be roughly classified into three categories.

The first one is to extend the existing single image basedsegmentation models to solve the cosegmentation problem,such as MRF cosegmentation, heat diffusion cosegmentation,RandomWalk based cosegmentation and active contour basedcosegmentation. The second one is to design new coseg-mentation models, such as formulating the cosegmentationas clustering problem, graph theory based proposal selectionproblem, metric rank based representation. The last one is tosolve new emerging cosegmentation needs, such as multipleclass/foreground object cosegmentation, large scale imagecosegmentation and web image cosegmentation.

A. Cosegmentation by extending single-image segmentationmodels

A straight-forward way for cosegmentation is to extendclassical single image based segmentation models. In general,the extended models can be represented as

E = Es + Eg (29)

where Es is the single image segmentation term, which guar-antees the smoothness and the distinction between foregroundand background in each image, and Eg is the cosegmentationterm, which focuses on evaluating the consistency betweenthe foregrounds among the images. Only the segmentation ofcommon objects can result in small values of both Es andEg . Thus, the cosegmentation is formulated as minimizing theenergy in (29).

Many classical segmentation models have been used to formEs, e.g. using MRF segmentation models [125], [41] as

EMRFs = EMRF

u + EMRFp (30)

where EMRFu and EMRF

p are the conventional unary potentialand the pairwise potential. GraphCut algorithm is widely usedto minimize the energy in (30). The cosegmentation term Egis used to evaluate the multiple foreground consistency, whichis introduced to guarantee the common object segmentation.However, it makes the minimization of (29) difficult. Variouscosegmentation terms and their minimization methods havebeen carefully designed in MRF based cosegmentation models.

In particular, Rother et al. [125] evaluated the consistencyby `1 norm, i.e., Eg =

∑z(|h1(z) − h2(z)|), where h1

and h2 are features of the two foregrounds, and z is thedimension of the feature. Adding the `1 evaluation makesthe minimization quite challenging. An approximation methodcalled submodular-supermodular procedure has been proposedto minimize the model by max-flow algorithm. Mukherjee etal. [41] replaced `1 with squared `2 evaluation, i.e. Eg =∑z(|h1(z)−h2(z)|)2. The `2 has several advantages, such as

relaxing the minimization to LP problem and using Pseudo-Boolean optimization method for minimization. But it is stillan approximation solution. In order to simplify the modelminimization, Hochbaum et al. [126] used reward strategyrather than penalty strategy. The rationale is that given anyforeground pixel p in the first image, the similar pixel q inthe second image will be rewarded as foreground. The globalterm was formulated as Eg =

∑z h1(z) · h2(z), which is

similar to the histogram based segment similarity measurement

Page 11: Beyond Pixels: A Comprehensive Survey from Bottom-up to Semantic Image · PDF file · 2015-01-26Bottom-up to Semantic Image Segmentation and Cosegmentation Hongyuan Zhu, ... few sparse

11

(a) (b)

Fig. 5. (a) An example of simultaneously segmenting one common foreground object from a set of image [39]. (b) An example of multi-classcosegmentation [35].

in [39], to force each foreground to be similar with the otherforegrounds and also be different with their backgrounds. Theenergy generated with MRF model is proved to be submodularand can be efficiently solved by GraphCut algorithm. Rubioet al. [37] evaluated the foreground similarities by high ordergraph matching, which is introduced into MRF model toform the global term. Batra et al. [127] firstly proposed aninteractive cosegmentation, where an automatic recommen-dation system was developed to guide the user to scribblethe uncertain regions for cosegmentation refinement. Severalcues such as uncertainty based cues, scribble based cues andimage level cues are combined to form a recommendationmap, where the regions with larger values are suggested tobe labeled by the user.

Apart from MRF segmentation model, Collins et al. [128]extended RandomWalk model to solve cosegmentation, whichresults in a convex minimization problem with box constraintsand a GPU implementation. In [129], active contour basedsegmentation is extended for cosegmentation, which consistsof the foreground consistencies across images and the back-ground consistencies within each image. Due to the linearsimilarity measurement, the minimization can be resolved bylevel set technique.

B. New cosegmentation models

The second category try to solve cosegmentation problemusing new strategies rather than extending existing singlesegmentation models. For example, by treating the commonobject extraction task as a common region clustering prob-lem, the cosegmentation problem can be solved by clusteringstrategy. Joulin et al. [38] treated the cosegmentation labelingas training data, and trained a supervised classifier basedon the labeling to see if the given labels are able to resultin maximal separation of the foreground and backgroundclasses. The cosegmentation is then formulated as searchingthe labels that lead to the best classification. Spectral method(Normalized Cuts) is adopted for the bottom-up clustering,and discriminative clustering is used to share the bottom-up clustering among images. The two clusterings are finally

combined to form the cosegmentation model, which is solvedby convex relaxation and efficient low-rank optimization. Kimet al. [130] solved cosegmentation by clustering strategy,which divides the images into hierarchical superpixel layersand describes the relationship of the superpixels using graph.Affinity matrix considering intra-image edge affinity and inter-image edge affinity is constructed. The cosegmentation canthen be solved by spectral clustering strategy.

By representing the region similarity relationships as edgeweights of a graph, graph theory has also been used to solvecosegmentation. In [131], by representing each image as aset of object proposals, a random forest regression basedmodel is learned to select the object from backgrounds. Therelationships between the foregrounds are formulated by fullyconnected graph, and the cosegmentation is achieved by loopbelief propagation. Meng et al. [132] constructed directedgraph structure to describe the foreground relationship by onlyconsidering the neighbouring images. The object cosegmenta-tion is then formulated as s shortest path problem, which canbe solved by dynamic programming.

Some methods try to learn the prior of common objects,which is then used to segment the common objects. Sun etal. [133] solved cosegmentation by learning discriminative partdetectors of the object. By forming the positive and negativetraining parts from the given images and other images, thediscriminative parts are learned based on the fact that thepart detector of the common objects should more frequentlyappear in positive samples than negative samples. The prob-lem is finally formulated as a latent SVM model learningproblem with group sparsity regularization. Dai et al. [134]proposed coskech model by extending the active basis modelto solve cosegmentation problem. In cosketch, a deformableshape template represented by codebook is generated to alignand extract the common object. The template is introducedinto unsupervised learning process to iteratively sketch theimages based on the shape and segment cues and re-learn theshape and segment templates with the model parameters. Bycosketch, similar shape objects can be well segmented.

There are also some other strategies. Faktor et al. [135]

Page 12: Beyond Pixels: A Comprehensive Survey from Bottom-up to Semantic Image · PDF file · 2015-01-26Bottom-up to Semantic Image Segmentation and Cosegmentation Hongyuan Zhu, ... few sparse

12

solved cosegmentation based on the similarity of composition,where the likelihood of the co-occurring region is high ifit is non-trivial and has good match with other composi-tions. The co-occurring regions between images are firstlydetected. Then, consensus scoring by transferring informationamong the regions is performed to obtain cosegments fromco-occurring regions. Mukherjee et al. [40] evaluated theforeground similarity by forcing low entropy of the matrixcomprised by the foreground features, which can handle thescale variation very well.

C. New cosegmentation problemsMany applications require the cosegmentation on a large-

scale set of images, which is extremely time consuming.Kim et al. [35] solved the large-scale cosegmentation bytemperature maximization on anisotropic heat diffusion (calledCoSand), which starts with finite sources and performs heatdiffusion among images based on superpixels. The model issubmodular with fast solver. Wang et al. [136] proposed asemi-supervised learning based method for large scale imagecosegmentation. With very limited training data, the coseg-mentation is formed based on the terms of inter-image distanceto measure the foreground consistencies among images, theintra-image distance to evaluate the segmentation smoothnesswithin each image and the balance term to avoid the same labelof all superpixels. The model is converted and approximatedto a convex QP problem and can be solved in polynomialtime using active set. Zhu et al. [137] proposed the firstmethod which uses search engine to retrieve similar images tothe input image to analyze the object-of-interest information,and then uses the information to cut-out the object-of-interest.Rubinstein et al. [138] observed that there are always noiseimages (which do not contain the common objects) from webimage dataset, and proposed a cosegmentation model to avoidthe noise images. The main idea is to match the foregroundsusing SIFT flow to decide which images do not contain thecommon objects. Ku

Applying cosegmentation to improve image classification isan important application of cosegmentation. Chai et al. [139]proposed a bi-level cosegmentation method, and used it forimage classification. It consists of bottom level obtained byGrabCut algorithm to initialize the foregrounds and the toplevel with a discriminative classification to propagate theinformation. Later, a TriCos model [140] containing threelevels were further proposed, including image level (GrabCut),dataset-level, and category level, which outperforms the bi-level model. Kuttel et al. [141] proposed to progate seg-mentation masks in ImageNet by exploiting the hierarchicalstructures of the database. Recently, Meng et al. [142] proposeto conduct multiple image groups cosegmentation instead ofsingle image group cosegmentation, which is motivated basedon the observation that there are usually multiple groups ofimages in the applications, and the unbalance of cosegmen-tation on different groups (e.g., it is easy to extract commonobjects from some image group while others are still difficult),can be used to improve the holistic cosegmentation results.

The sea of personal albums over internet, which depictsevents in short periods, makes one popular source of large

image data that need further analysis. Typical scenario of apersonal album is that it contains multiple objects of differentcategories in the image set and each image can contain a subsetof them. This is different from the classical cosegmentationmodels introduced above, which usually assume each imagecontains the same common foreground while having strongvariations in backgrounds. The problem of extracting multipleobjects from personal albums is called “Multiple ForegroundCosegmentation” (MFC) (see Fig. 6). Kim and Xing [34]proposed the first method to handle MFC problem. Theirmethod starts from building apperance models (GMM & SPM)from user-provided bounding boxes which enclose the object-of-interest, and over-segments the images into superpixelsby [35]. Then they used beam search to find proposal candi-dates for each foreground. Finally, the candidates are seamedinto non-overlap regions by using dynamic programming. Maet al. [143] formulated the multiple foreground cosegmentationas semi-supervised learning (graph transduction learning), andintroduced connectivity constraints to enforce the extraction ofconnected regions. A cutting-plane algorithm was designed toefficiently minimize the model in polynomial time.

Kim and Ma’s methods hold an implicit constraint onobjects using low-level cues, and therefore their method mightassign labels of “stuff” (grass, sky) to “thing” (people orother objects). Zhu et al. [144] proposed a principled CRFframework which explicitly expresses the object constraintsfrom object detectors and solves an even more challengingproblem: multiple foreground recognition and cosegmentation(MFRC). They proposed an extended multiple color-line basedobject detector which can be on-line trained by using user-provided bounding boxes to detect objects in unlabeled im-ages. Finally, all the cues from bottom-up pixels, middle-levelcontours and high-level object detectors are integrated in arobust high-order CRF model, which can enforce the labelconsistency among pixels, superpixels and object detectionssimultaneously, produce higher accuracy in object regions andachieve state-of-the-art performance for the MFRC problem.Later, Zhu et al. [145] further studied another challenging mul-tiple human identification and cosegmentation problem, andproposed a novel shape cue which uses geodesic filters [84]and joint-bilateral filters to transform the blurry response mapsfrom multiple color-line object detectors and poselet modelsto edge-aligned shape prior. It leads to promising humanidentification and co-segmentation performance.

VIII. DATASET AND EVALUATION METRICS

A. Datasets

To inspire new methods and objectively evaluate their per-formance for certain applications, different datasets and eval-uation metrics have been proposed. Initially, the huge labelingcost limits the size of the datasets [146], [21] (typically inhundreds of images). Recently, with the popularity of crowd-sourcing platform such as Amazon Merchant Turk (AMT) andLabelMe [147], the label cost is shared over the internet users,which makes large datasets with millions of images and labelspossible. Below we summarize the most influential datasetswhich are widely used in the existing segmentation literature

Page 13: Beyond Pixels: A Comprehensive Survey from Bottom-up to Semantic Image · PDF file · 2015-01-26Bottom-up to Semantic Image Segmentation and Cosegmentation Hongyuan Zhu, ... few sparse

13

Fig. 6. Given a user’s photo stream about certain event, which consists of finite objects, e.g. red-girl, blue-girl, blue-baby and apple basket. Each imagecontains an unknown subset of them, which we called “Multiple Foreground Cosegmentation” problem. Kim and Xing [34] proposed the first method toextract irregularly occurred objects from the photo stream (b) by using a few bounding boxes provided by the user in (a).

ranging from bottom-up image segmentation to holistic sceneunderstanding:

1) Single image segmentation datasets: Berkeley segmen-tation benchmark dataset (BSDS) [146] is one of theearliest and largest datasets for contour detection and singleimage object-agnostic segmentation with human annotation.The latest BSDS dataset contains 200 images for training, 100images for validation and the rest 200 images for testing. Eachimage is annotated by at least 3 subjects. Though the size ofthe dataset is small, it still remains one of the most difficultsegmentation datasets as it contains various object classes withgreat pose variation, background clutter and other challenges.It has also been used to evaluate superpixel segmentationmethods. Recently, Li et al. [148] proposed a new benchmarkbased on BSDS, which can evaluate semantic segmentation atobject or part level.

MSRC-interactive segmentation dataset [21] includes50 images with a single binary ground-truth for evaluatinginteractive segmentation accuracy. This dataset also providesimitated inputs such as labeling-lasso and rectangle with labelsfor background, foreground and unknown areas.

2) Cosegmentation datasets: MSRC-cosegmentationdataset [125] has been used to evaluate image-pair binarycosegmentation. The dataset contains 25 image pairs withsimilar foreground objects but heterogeneous backgrounds,which matches the assumptions of early cosegmentationmethods [125], [126], [41]. Some pairs of the images arepicked such that they contain some camouflage to balancedatabase bias which forms the baseline cosegmentationdataset.

iCoseg dataset [127] is a large binary-class image coseg-mentation dataset for more realistic scenarios. It contains 38groups with a total of 643 images. The content of the imagesranges from wild animals, popular landmarks, sports teamsto other groups containing similar foregrounds. Each groupcontains images of similar object instances from differentposes with some variations in the background. iCoseg ischallenging because the objects are deformed considerably interms of viewpoint and illumination, and in some cases, onlya part of the object is visible. This contrasts significantly withthe restrictive scenario of MSRC-Cosegmentation dataset.

FlickrMFC dataset [34] is the only dataset for multipleforeground cosegmentation, which consists of 14 groups of im-

ages with manually labeled ground-truth. Each group includes10∼20 images which are sampled from a Flikcr photostream.The image content covers daily scenarios such as children-playing, fishing, sports, etc. This dataset is perhaps the mostchallenging cosegmentation dataset as it contains a number ofrepeating subjects that are not necessarily presented in everyimage and there are strong occlusions, lighting variations, orscale or pose changes. Meanwhile, serious background cluttersand variations often make even state-of-the-art object detectorsfailing on these realistic scenarios.

3) Video segmentation/cosegmentation datasets: SegTrackdataset [149] is a large binary-class video segmentationdataset with pixel-level annotation for primary objects. It con-tains six videos (bird, bird-fall, girl, monkey-dog, parachuteand penguin). The dataset contains challenging cases includingforeground/background occlusion, large shape deformationand camera motion.

CVC binary-class video cosegmentation dataset [150]contains 4 synthesis videos which paste the same foregroundto different backgrounds and 2 videos sampled from theSegTrack. It forms a restrictive dataset for early video coseg-mentation methods.

MPI multi-class video cosegmentation dataset [151] wasproposed to evaluate video cosegmentation approaches in morechallenging scenarios, which contain multi-class objects. Thischallenging dataset contains 4 different video sets sampledfrom Youtube including 11 videos with around 520 frameswith ground truth. Each video set has different numbersof object classes appearing in them. Moreover, the datasetincludes challenging lighting, motion blur and image conditionvariations.

Cambridge-driving video dataset (CamVid) [152] is acollection of videos with labels of 32 semantic classes (e.g.building, tree, sidewalk, traffic light, etc), which are capturedby a position-fixed CCTV-camera on a driving automobile over10 mins at 30Hz footage. This dataset contains 4 video se-quences, with more than 700 images at resolution of 960×720.Three of them are sampled at the day light condition and theremaining one is sampled at the dark. The number and theheterogeneity of the object classes in each video sequence arediverse.

4) Static scene parsing datasets: MSRC 23-classdataset [24] consists of 23 classes and 591 images. Due to

Page 14: Beyond Pixels: A Comprehensive Survey from Bottom-up to Semantic Image · PDF file · 2015-01-26Bottom-up to Semantic Image Segmentation and Cosegmentation Hongyuan Zhu, ... few sparse

14

the rarity of ‘horse’ and ‘mountain’ classes, these two classesare often ignored for training and evaluation. The remaining21 classes contain diverse objects. The annotated ground-truthis quite rough.

PASCAL VOC dataset [153] provides a large-scale datasetfor evaluating object detection and semantic segmentation.Starting from the initial 4-class objects in 2005, now PASCALdataset includes 20 classes of objects under four major cate-gories (animal, person, vehicle and indoor). The latest train/valdataset has 11,530 images and 6,929 segmentations.

LabelMe + SUN dataset: LabelMe [147] is initiated by theMIT CSCAIL which provides a dataset of annotated images.The dataset is still growing. It contains copyright-free imagesand is open to public contribution. As of October 31, 2010,LabelMe has 187,240 images, 62,197 annotated images, and658,992 labeled objects. SUN [154] is a subsampled datasetfrom LabelMe. It contains 45,676 image (21,182 indoor and24,494 outdoor), total 515 object categories. One noteworthypoint is that the number of objects in each class is uneven,which can cause unsatisfactory segmentation accuracy for rareclasses.

SIFT-flow dataset: The popularity of nonparametric sceneparsing requires a large labeled dataset. The SIFT Flowdataset [124] is composed of 2,688 images that have beenthoroughly labelled by LabelMe users. Liu et al. [124] havesplit this dataset into 2,488 training images and 200 test imagesand used synonym correction to obtain 33 semantic labels.

Stanford background dataset [28] consists of around 720images sampled from the existing datasets such as LabelMe,MSRC and PASCAL VOC, whose content consists of rural,urban and harbor scenes. Each image pixel is given twolabels: one for its semantic class (sky, tree, road, grass, water,building, mountain and foreground) and the one for geometricproperty (sky, vertical, and horizontal).

NYU dataset [155]: The NYU-depth V2 dataset is com-prised of video sequences from a variety of indoor scenesrecorded by both the RGB and depth cameras of MicrosoftKinect. It features 1449 densely labeled pairs of aligned RGBand depth images, 464 new scenes taken from 3 cities, and407,024 new unlabeled frames. Each object is labeled with aclass and an instance number (cup1, cup2, cup3, etc).

Microsoft COCO [156] is a recent dataset for holistic sceneunderstanding, which provides 328K images with 91 objectclasses. One substantial difference with other large datasets,such as PASCAL VOC and SUN datasets, is that MicrosoftCOCO contains more labelled instances in million units. Theauthors argue that it can facilitate training object detectors withbetter localization, and learning contextual information.

B. Evaluation metrics

As segmentation is an ill-defined problem, how to evaluatean algorithm’s goodness remains an open question. In thepast, the evaluation was mainly conducted through subjec-tive human inspections or by evaluating the performance ofsubsequent vision system which uses image segmentation. Toobjectively evaluate a method, it is desirable to associate thesegmentation with perceptual grouping. Current trend is to

develop a benchmark [146] which consists of human-labeledsegmentation and then compares the algorithm’s output withthe human-labeled results using some metrics to measure thesegmentation quality. Various evaluation metrics have beenproposed:• Boundary Matching: This method works by matching

the algorithm-generated boundaries with human-labeledboundaries, and then computing some metric to evaluatethe matching quality. Precision and recall frameworkproposed by Martin et al. [157] is among the widelyused evaluation metrics, where the Precision measures theproportion of how many machine-generated boundariescan be found in human-labeled boundaries and is sensitiveto over-segmentation, while the Recall measures the pro-portion of how many human-labelled boundaries can befound in machine-generated boundaries and is sensitive tounder-segmentation. In general, this method is sensitiveto the granularity of human labeling.

• Region Covering : This method [157] operates by check-ing the overlap between the machine-generated regionsand human-labelled regions. Let Sseg and Shum denotethe machine segmentation and the human segmentation,respectively. Denote the corresponding segment regionsfor pixel pi from the pixel set P = {p1, p2, ..., pn}as C(Sseg, pi) and C(Shum, pi). The relative regioncovering error at pi is

O(Sseg, Shum, pi) =|C(Sseg, pi) \ C(Shum, pi)|

C(Sseg, pi).

(31)where \ is the set differencing operator.The globe region covering error is defined as:

GCE(Sseg, Shum) =

1

nmin{

∑i

O(Sseg, Shum, pi), O(Slabel, Shum, pi)}.

(32)

However, when each pixel is a segment or the wholeimage is a segment, the GCE becomes zero which isundesirable. To alleviate these problems, the authors pro-posed to replace min operation by using max operation,but such change will not encourage segmentation at finerdetail.Another commonly used region based criterion is theIntersection-over-Union, by checking the overlap betweenthe Sseg and Shum:

IoU(Sseg, Shum) =Sseg ∩ ShumSseg ∪ Shum

(33)

• Variation of Information (VI): This metric [158]measures the distance between two segmentationsSseg and Shum using average conditionalentropy. Assume that Sseg and Shum haveclusters Cseg = {C1

seg, C2seg, ..., C

Kseg} and

Chum = {C1hum, C

2hum, ..., C

K′

hum}, respectively.The variation of information is defined as:

V I(Sseg, Shum) = H(Sseg)+H(Shum)−2I(Sseg, Shum)(34)

Page 15: Beyond Pixels: A Comprehensive Survey from Bottom-up to Semantic Image · PDF file · 2015-01-26Bottom-up to Semantic Image Segmentation and Cosegmentation Hongyuan Zhu, ... few sparse

15

where H(Sseg) is the entropy associated with clusteringSseg:

H(Sseg) = −K∑k=1

P (k)logP (k) (35)

with P (k) = nk

n , where nk is the number of elements incluster k and n is the total number of elements in Sseg .I(Sseg, Shum) is the mutual information between Ssegand Shum:

I(Sseg, Shum) =

K∑k=1

K′∑k′=1

P (k, k′)logP (k, k′)

P (k)P (k′)(36)

with P (k, k′) =|Ck

seg

⋂Ck

hum|n .

Although V I posses some interesting property, its per-ceptual meaning and potential in evaluating more thanone ground-truth are unknown [49].

• Probabilistic Random Index (PRI): PRI was intro-duced to measure the compatibility of assignmentsbetween pairs of elements in Sseg and Shum ={S1

hum, S2hum, ..., S

nhum}. It has been defined to deal with

multiple ground-truth segmentations [159]:

PRI(Sseg, Skhum) =

1N2

∑i<j

[cijpij + (1− cij)(1− pij)]

(37)where Skhum is the k-th human-labeled ground-truth, cijis the event that pixels i and j have the same label andpij is the corresponding probability. As reported in [51],the PRI has a small dynamic range and the values acrossimages and algorithms are often similar which makes thedifferentiation difficult.

IX. DISCUSSIONS AND FUTURE DIRECTIONS

A. Design flavors

When designing a new segmentation algorithm, it is oftendifficult to make choices among various design flavors, e.g. touse superpixel or not, to use more images or not. It all dependson the applications. Our literature review has cast some lightson the pros and cons of some commonly used design flavors,which is worth thinking twice before going ahead for a specificsetup.

Patch vs. region vs. object proposal: Careful readers mightnotice that there has been a significant trend in migrating frompatch based analysis to region based (or superpixel based)analysis. The continuous performance improvement in termsof boundary recall and execution time makes superpixel a fastpreprocessing technique. The advantages of using superpixellie in not only the time reduction in training and inferencebut also more complex and discriminative features that can beexploited. On the other hand, superpixel itself is not perfect,which could introduce new structure errors. For users whocare more about visual results on the segment boundary, pixel-based or hybrid approach of combining pixel and superpixelshould be considered as better options. The structure errors ofusing superpixel can also be alleviated by using different meth-ods and parameters to produce multiple over-segmentation or

using fast edge-aware filtering to refine the boundary. For usersmore caring about localization accuracy, the region based wayis more preferred due to the various advantages introducedwhile the boundary loss can be neglected. Another uprisingtrend that is worth mentioning is the application of the objectregion proposals [131], [132], [160]. Due to the larger supportprovided by object-like regions than oversegmentation orpixels, more complex classifiers and region-level features canbe extracted. However, the recall rate of the object proposalis still not satisfactory (around 60%); therefore more carefuldesigns need to be made when accuracy is a major concern.

Intrinsic cues vs. extrinsic cues: Although intrinsic cues(the features and prior knowledge for a single image) stillplay dominant roles in existing CV applications, extrinsiccues which come from multiple images (such as multiviewimages, video sequence, and a super large image dataset ofsimilar products) are attracting more and more attentions. Anintuitive answer why extrinsic cues convey more semanticscan be interpreted in terms of statistic. When there are manysignals available, the signals which repeatedly appear formpatterns of interest, while those errors are averaged. Therefore,if there are multiple images containing redundant but diverseinformation, incorporating extrinsic cues should bring someimprovements. When taking extrinsic cues, the source ofinformation needs to be considered in the algorithm design.More robust constraints such as the cues from multiple-viewgeometric or spatial-temporal relationships should be exploitedfirst. When working with external information such as a largedataset which contains heterogeneous data, a mechanism thatcan handle noisy information should be developed.

Hand-crafted features vs. learned features: Hand-craftedfeatures, such as intensity, color, SIFT and Bag-of-Word,etc, have played important roles in computer vision. Thesesimple and training-free features have been applied to manyapplications, and their effectiveness has been widely exam-ined. However, the generalization capability of these hand-crafted features from one task to another task depends onthe heuristic feature design and combination, which cancompromise the performance. The development of low-levelfeatures has become more and more challenging. On the otherhand, learned features from labelled database have recentlybeen demonstrated advantages in some applications, such asscene understanding and object detection. The effectivenessof the learned features comes from the context informationcaptured from longer spatial arrangement and higher order co-occurrence. With labelled data, some structured noise is elim-inated which helps highlight the salient structures. However,learned features can only detect patterns for certain tasks. Ifmigrating to other tasks, it needs newly labelled data, whichis time consuming and expensive to obtain. Therefore, it issuggested to choose features from the handcraft features first.If it happens to have labelled data, then using learned featuresusually boost up the performance.

B. Promising future directionsBased on the discussed literature and the recent development

in segmentation community, here we suggest some futuredirections which is worth for exploring:

Page 16: Beyond Pixels: A Comprehensive Survey from Bottom-up to Semantic Image · PDF file · 2015-01-26Bottom-up to Semantic Image Segmentation and Cosegmentation Hongyuan Zhu, ... few sparse

16

Beyond single image: Although single image segmentationhas played a dominant role in the past decades, the recent trendis to use the internal structures of multiple images to facilitatesegmentation. When human perceives a scene, the motion cuesand depth cues provide extra information, which should not beneglected. For example, when the image data is from a video,the 3D geometry estimated from the structure-from-motiontechniques can be used to help image understanding [161],[162]. The depth information captured by commodity depthcameras such as Kinect also benefits the indoor scene un-derstanding [163], [164]. Beyond 3D information, the usefulinformation from other 2D images in a large database, eitherorganized (like ImageNet) or un-organized (like product oranimal category), or multiple heterogeneous datasets of thesame category, can also be exploited [141], [142].

Toward multiple instances: Most segmentation methodsstill aim to produce a most likely solution, e.g. all pixels ofthe same category are assigned the same label. On the otherhand, people are also interested in knowing the information of“how many” of the same category objects are there, which isa limitation of the existing works. Recently, there are someworks making efforts towards this direction by combiningstate-of-the-art detectors with the CRF modelling [115], [165],[166] which have achieved some progress. In addition, therecently developed dataset, Microsoft COCO [156], whichcontains many images with labelled instances, can be expectedto boost the development in this direction.

Become more holistic: Most existing works consider theimage labelling task alone [105], [24]. On the other hand,studying a related task can improve the existing scene analysisproblem. For example, by combining the object detectionwith image classification, Yao et al. [167] proposed a morerobust and accurate system than the ones which performsingle task analysis. Such holistic understanding trend canalso be seen from other works, by combining context [111],[168], geometry [28], [169], [108], [170], attributes [171],[172], [173], [174], [175] or even language [176], [177]. It istherefore expected that a more holistic system should lead tobetter performance than those performing monotone analysis,though at an increasing inference cost.

Go from shallow to deep: Feature has been played animportant role in many vision applications, e.g. stereo match-ing, detection and segmentation. Good features, which arehighly discriminative to differentiate ‘object’ from ‘stuff’, canhelp object segmentation significantly. However, most featurestoday are hand-crafted designed for specific tasks. Whenapplied to other tasks, the hand-crafted features might notgeneralize well. Recently, learned features using multiple-layerneural network [178] have been applied in many vision tasks,including object classification [179], face detection [180],pose estimation [181], object detection [182] and semanticsegmentation [107], and achieved state-of-the-art performance.Therefore, it is interesting to explore whether such learnedfeatures can benefit the aforementioned more complex tasks.

X. CONCLUSION

In this paper, we have conducted a thorough review ofrecent development of image segmentation methods, including

classic bottom-up methods, interactive methods, object regionproposals, semantic parsing and image cosegmentation. Wehave discussed the pros and cons of different methods, andsuggested some design choices and future directions whichare worthwhile for exploration.

Segmentation as a community has achieved substantialprogress in the past decades. Although some technologiessuch as interactive segmentation have been commercializedin recent Adobe and Microsoft products, there is still a longway to make segmentation a general and reliable tool to bewidely applied to practical applications. This is more relatedto the existing methods’ limitations in terms of robustnessand efficiency. For example, one can always observe thatgiven images under various degradations (strong illuminationchange, noise corruption or rain situation), the performance ofthe existing methods could drop significantly [183]. With therapid improvement in computing hardware, more effective androbust methods will be developed, where the breakthrough islikely to come from the collaborations with other engineeringand science communities, such as physics, neurology andmathematics.

REFERENCES

[1] M. Wertheimer, “Laws of organization in perceptual forms,” in Psy-cologische Forschung, 1929, pp. 301–350.

[2] C. Carson, S. Belongie, H. Greenspan, and J. Malik, “Blobworld: Imagesegmentation using expectation-maximization and its application toimage querying,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24,no. 8, pp. 1026–1038, 2002.

[3] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Susstrunk,“Slic superpixels compared to state-of-the-art superpixel methods,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 11, 2012.

[4] D. Comaniciu and P. Meer, “Mean shift: A robust approach towardfeature space analysis,” IEEE Trans. Pattern Anal. Mach. Intell.,vol. 24, no. 5, pp. 603–619, 2002.

[5] J. Lu, H. Yang, D. Min, and M. N. Do, “Patch match filter: Efficientedge-aware filtering meets randomized search for fast correspondencefield estimation,” in CVPR, 2013.

[6] P. Kohli, M. P. Kumar, and P. H. S. Torr, “P & beyond: Move makingalgorithms for solving higher order functions,” IEEE Trans. PatternAnal. Mach. Intell., vol. 31, no. 9, 2009.

[7] G. Mori, “Guiding model search using segmentation,” in ICCV, 2005.[8] C. Gu, J. J. Lim, P. Arbelaez, and J. Malik, “Recognition using regions,”

in CVPR, 2009.[9] J. Carreira and C. Sminchisescu, “Cpmc: Automatic object segmenta-

tion using constrained parametric min-cuts,” IEEE Trans. Pattern Anal.Mach. Intell., vol. 34, no. 7, 2012.

[10] I. Endres and D. Hoiem, “Category-independent object proposals withdiverse ranking,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, no. 2,2014.

[11] J. Shi and J. Malik, “Normalized cuts and image segmentation,” IEEETrans. Pattern Anal. Mach. Intell., vol. 22, no. 8, pp. 888–905, 2000.

[12] D. R. Martin, C. Fowlkes, and J. Malik, “Learning to detect naturalimage boundaries using local brightness, color, and texture cues,” IEEETrans. Pattern Anal. Mach. Intell., 2004.

[13] D. D. Hoffman and M. Singh, “Salience of visual parts,” Cognition,vol. 63, no. 1, pp. 29 – 78, 1997.

[14] R. Ohlander, K. Price, and D. R. Reddy, “Picture segmentation usinga recursive region splitting method,” Computer Graphics and ImageProcessing, vol. 8, pp. 313–333, 1978.

[15] C. R. Brice and C. L. Fennema, “Scene analysis using regions.” Artif.Intell., vol. 1, no. 3, pp. 205–226, 1970.

[16] P. F. Felzenszwalb and D. P. Huttenlocher, “Efficient graph-based imagesegmentation,” International Journal of Computer Vision, vol. 59, no. 2,pp. 167–181, 2004.

[17] L. Vincent and P. Soille, “Watersheds in digital spaces: An efficientalgorithm based on immersion simulations,” IEEE Trans. Pattern Anal.Mach. Intell., vol. 13, no. 6, pp. 583–598, 1991.

Page 17: Beyond Pixels: A Comprehensive Survey from Bottom-up to Semantic Image · PDF file · 2015-01-26Bottom-up to Semantic Image Segmentation and Cosegmentation Hongyuan Zhu, ... few sparse

17

[18] R. Beare, “A locally constrained watershed transform,” IEEE Trans.Pattern Anal. Mach. Intell., vol. 28, no. 7, pp. 1063–1074, 2006.

[19] T. F. Chan and L. A. Vese, “Active contours without edges,” IEEETransactions on Image Processing, vol. 10, no. 2, pp. 266–277, 2001.

[20] S. Osher and J. A. Sethian, “Fronts propagating with curvature de-pendent speed: Algorithms based on hamilton-jacobi formulations,”Journal of Computational Physics, vol. 79, no. 1, pp. 12–49, 1988.

[21] C. Rother, V. Kolmogorov, and A. Blake, “Grabcut: interactive fore-ground extraction using iterated graph cuts,” ACM Trans. Graph.,vol. 23, no. 3, 2004.

[22] W. Yang, J. Cai, J. Zheng, and J. Luo, “User-friendly interactiveimage segmentation through unified combinatorial user inputs,” IEEETransactions on Image Processing, vol. 19, no. 9, pp. 2470–2479, 2010.

[23] T. N. A. Nguyen, J. Cai, J. Zhang, and J. Zheng, “Robust interactiveimage segmentation using convex active contours,” IEEE Transactionson Image Processing, vol. 21, no. 8, pp. 3734–3743, 2012.

[24] J. Shotton, J. M. Winn, C. Rother, and A. Criminisi, “Textonboost forimage understanding: Multi-class object recognition and segmentationby jointly modeling texture, layout, and context,” International Journalof Computer Vision, vol. 81, no. 1, 2009.

[25] P. F. Felzenszwalb, R. B. Girshick, and D. A. McAllester, “Cascadeobject detection with deformable part models,” in CVPR, 2010.

[26] D. Hoiem, A. A. Efros, and M. Hebert, “Putting objects in perspective.”International Journal of Computer Vision, vol. 80, no. 1, pp. 3–15,2008.

[27] P. Kohli, L. Ladicky, and P. H. S. Torr, “Robust higher order potentialsfor enforcing label consistency,” International Journal of ComputerVision, vol. 82, no. 3, 2009.

[28] S. Gould, R. Fulton, and D. Koller, “Decomposing a scene intogeometric and semantically consistent regions,” in ICCV, 2009.

[29] C. Liu, J. Yuen, and A. Torralba, “Nonparametric scene parsing vialabel transfer,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 12,2011.

[30] J. Tighe and S. Lazebnik, “Finding things: Image parsing with regionsand per-exemplar detectors,” in CVPR, 2013.

[31] P. Krahenbuhl and V. Koltun, “Efficient inference in fully connectedcrfs with gaussian edge potentials,” in NIPS, 2011, pp. 109–117.

[32] L. Ladicky, C. Russell, P. Kohli, and P. H. S. Torr, “Graph cut basedinference with co-occurrence statistics,” in ECCV, 2010.

[33] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and F.-F. Li, “Imagenet:A large-scale hierarchical image database,” in CVPR, 2009.

[34] G. Kim and E. P. Xing, “On multiple foreground cosegmentation,” inCVPR, 2012, pp. 837–844.

[35] G. Kim, E. P. Xing, F.-F. Li, and T. Kanade, “Distributed cosegmenta-tion via submodular optimization on anisotropic diffusion,” in ICCV,2011, pp. 169–176.

[36] A. Joulin, F. Bach, and J. Ponce, “Multi-class cosegmentation,” inCVPR, 2012.

[37] J. C. Rubio, J. Serrat, A. M. Lopez, and N. Paragios, “Unsupervisedco-segmentation through region matching,” in CVPR, 2012.

[38] A. Joulin, F. R. Bach, and J. Ponce, “Discriminative clustering forimage co-segmentation,” in CVPR, 2010, pp. 1943–1950.

[39] K.-Y. Chang, T.-L. Liu, and S.-H. Lai, “From co-saliency to co-segmentation: An efficient and fully unsupervised energy minimizationmodel,” in CVPR, 2011, pp. 2129–2136.

[40] L. Mukherjee, V. Singh, and J. Peng, “Scale invariant cosegmentationfor image groups,” in CVPR, 2011.

[41] L. Mukherjee, V. Singh, and C. R. Dyer, “Half-integrality basedalgorithms for cosegmentation of images,” in CVPR, 2009.

[42] R. Szeliski, Computer Vision - Algorithms and Applications, ser. Textsin Computer Science. Springer, 2011.

[43] S. Rao, H. Mobahi, A. Y. Yang, S. Sastry, and Y. Ma, “Natural imagesegmentation with adaptive texture and boundary encoding,” in ACCV,2009.

[44] J. Shi and J. Malik, “Normalized cuts and image segmentation,” inCVPR, 1997.

[45] J. Malik, S. Belongie, J. Shi, and T. K. Leung, “Textons, contours andregions: Cue integration in image segmentation,” in ICCV, 1999.

[46] B. Cheng, G. Liu, J. Wang, Z. Huang, and S. Yan, “Multi-task low-rankaffinity pursuit for image segmentation,” in ICCV, 2011.

[47] A. Y. Ng, M. I. Jordan, and Y. Weiss, “On spectral clustering: Analysisand an algorithm,” in NIPS, 2001, pp. 849–856.

[48] T. Cour, F. Benezit, and J. Shi, “Spectral segmentation with multiscalegraph decomposition,” in CVPR, 2005, pp. 1124–1131.

[49] P. Arbelaez, M. Maire, C. C. Fowlkes, and J. Malik, “From contoursto regions: An empirical evaluation,” in CVPR, 2009.

[50] A. K. Jain, A. P. Topchy, M. H. C. Law, and J. M. Buhmann,“Landscape of clustering algorithms,” in ICPR (1), 2004, pp. 260–263.

[51] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik, “Contour detectionand hierarchical image segmentation,” IEEE Trans. Pattern Anal. Mach.Intell., vol. 33, no. 5, pp. 898–916, 2011.

[52] X. Bresson, S. Esedoglu, P. Vandergheynst, J.-P. Thiran, and S. Osher,“Fast global minimization of the active contour/snake model,” Journalof Mathematical Imaging and Vision, vol. 28, no. 2, pp. 151–167, 2007.

[53] T. Pock, A. Chambolle, D. Cremers, and H. Bischof, “A convexrelaxation approach for computing minimal partitions,” in CVPR, 2009,pp. 810–817.

[54] J. Yuan, E. Bae, X.-C. Tai, and Y. Boykov, “A continuous max-flowapproach to potts model,” in ECCV, 2010, pp. 379–392.

[55] L. Ambrosio and V. Tortorelli, “Approximation of functionals depend-ing on jumps by elliptic functionals via Γ-convergence,” Communi-cations on Pure and Applied Mathematics, vol. XLIII, pp. 999–1036,1990.

[56] A. Braides and G. Dal Maso, “Non-local approximation of themumford-shah functional,” Calculus of Variations and Partial Differ-ential Equations, vol. 5, no. 4, pp. 293–322, 1997.

[57] A. Braides, Approximation of Free-Discontinuity Problems, ser. LectureNotes in Mathematics. Springer, 1998, vol. 1694.

[58] A. Chambolle, “Image segmentation by variational methods: Mumfordand shah functional and the discrete approximations,” SIAM J. Appl.Math., vol. 55, no. 3, pp. 827–863, Jun. 1995.

[59] L. A. Vese and T. F. Chan, “A multiphase level set framework forimage segmentation using the mumford and shah model,” InternationalJournal of Computer Vision, vol. 50, no. 3, pp. 271–293, Dec. 2002.

[60] M. Kass, A. Witkin, and D. Terzopoulos, “Snakes: Active contourmodels,” International Journal of Computer Vision, vol. 1, no. 4, pp.321–331, 1988.

[61] H. Zhu, J. Zheng, J. Cai, and N. Magnenat-Thalmann, “Object-levelimage segmentation using low level cues,” IEEE Transactions on ImageProcessing, vol. 22, no. 10, pp. 4019–4027, 2013.

[62] X. Ren and J. Malik, “Learning a classification model for segmenta-tion,” in ICCV, 2003, pp. 10–17.

[63] A. Levinshtein, A. Stere, K. N. Kutulakos, D. J. Fleet, S. J. Dickinson,and K. Siddiqi, “Turbopixels: Fast superpixels using geometric flows,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 12, pp. 2290–2297, 2009.

[64] P. Wang, G. Zeng, R. Gan, J. Wang, and H. Zha, “Structure-sensitivesuperpixels via geodesic distance,” International Journal of ComputerVision, vol. 103, no. 1, pp. 1–21, 2013.

[65] O. Veksler, Y. Boykov, and P. Mehrani, “Superpixels and supervoxels inan energy optimization framework,” in Computer Vision - ECCV 2010- 11th European Conference on Computer Vision, Heraklion, Crete,Greece, September 5-11, 2010, Proceedings, Part V, 2010, pp. 211–224.

[66] Y. Zhang, R. I. Hartley, J. Mashford, and S. Burn, “Superpixels viapseudo-boolean optimization,” in ICCV, 2011, pp. 1387–1394.

[67] M. Liu, O. Tuzel, S. Ramalingam, and R. Chellappa, “Entropy-rateclustering: Cluster analysis via maximizing a submodular functionsubject to a matroid constraint,” IEEE Trans. Pattern Anal. Mach.Intell., vol. 36, no. 1, pp. 99–112, 2014.

[68] M. V. den Bergh, X. Boix, G. Roig, B. de Capitani, and L. J. V. Gool,“SEEDS: superpixels extracted via energy-driven sampling,” in ECCV,2012, pp. 13–26.

[69] K. McGuinness and N. E. O’Connor, “A comparative evaluation ofinteractive segmentation algorithms,” Pattern Recognition, vol. 43,no. 2, pp. 434–444, 2010.

[70] F. Yi and I. Moon, “Image segmentation: A survey of graph-cutmethods,” in Systems and Informatics (ICSAI), 2012 InternationalConference on, May 2012.

[71] J. He, C.-S. Kim, and C.-C. J. Kuo, Interactive Segmentation Tech-niques: Algorithms and Performance Evaluation. Springer PublishingCompany, Incorporated, 2013.

[72] M. Kass, A. P. Witkin, and D. Terzopoulos, “Snakes: Active contourmodels,” International Journal of Computer Vision, vol. 1, no. 4, pp.321–331, 1988.

[73] E. Mortensen, B. Morse, W. Barrett, and J. Udupa, “Adaptive boundarydetection using ‘live-wire’ two-dimensional dynamic programming,” inComputers in Cardiology 1992, Proceedings of, Oct 1992, pp. 635–638.

[74] Y. Boykov and M.-P. Jolly, “Interactive graph cuts for optimal boundaryand region segmentation of objects in n-d images,” in ICCV, 2001, pp.105–112.

[75] E. N. Mortensen and W. A. Barrett, “Intelligent scissors for imagecomposition,” in SIGGRAPH, 1995, pp. 191–198.

Page 18: Beyond Pixels: A Comprehensive Survey from Bottom-up to Semantic Image · PDF file · 2015-01-26Bottom-up to Semantic Image Segmentation and Cosegmentation Hongyuan Zhu, ... few sparse

18

[76] Y. Liu and Y. Yu, “Interactive image segmentation based on level setsof probabilities,” IEEE Trans. Vis. Comput. Graph., vol. 18, no. 2, pp.202–213, 2012.

[77] L. Grady, “Random walks for image segmentation,” IEEE Trans.Pattern Anal. Mach. Intell., vol. 28, no. 11, pp. 1768–1783, 2006.

[78] Y. Li, J. Sun, C. Tang, and H. Shum, “Lazy snapping,” ACM Trans.Graph., vol. 23, no. 3, pp. 303–308, 2004.

[79] P. Kohli, A. Osokin, and S. Jegelka, “A principled deep random fieldmodel for image segmentation,” in CVPR, 2013, pp. 1971–1978.

[80] B. L. Price, B. S. Morse, and S. Cohen, “Geodesic graph cut forinteractive image segmentation,” in CVPR, 2010, pp. 3161–3168.

[81] X. Bai and G. Sapiro, “A geodesic framework for fast interactive imageand video segmentation and matting,” in ICCV, 2007, pp. 1–8.

[82] O. Veksler, “Star shape prior for graph-cut image segmentation,” inComputer Vision - ECCV 2008, 10th European Conference on Com-puter Vision, Marseille, France, October 12-18, 2008, Proceedings,Part III, 2008, pp. 454–467.

[83] V. Gulshan, C. Rother, A. Criminisi, A. Blake, and A. Zisserman,“Geodesic star convexity for interactive image segmentation,” in TheTwenty-Third IEEE Conference on Computer Vision and Pattern Recog-nition, CVPR 2010, San Francisco, CA, USA, 13-18 June 2010, 2010,pp. 3129–3136.

[84] A. Criminisi, T. Sharp, and A. Blake, “Geos: Geodesic image segmen-tation,” in Computer Vision - ECCV 2008, 10th European Conferenceon Computer Vision, Marseille, France, October 12-18, 2008, Proceed-ings, Part I, 2008, pp. 99–112.

[85] S. Vicente, V. Kolmogorov, and C. Rother, “Graph cut based imagesegmentation with connectivity priors,” in CVPR, 2008.

[86] A. Hosni, C. Rhemann, M. Bleyer, C. Rother, and M. Gelautz, “Fastcost-volume filtering for visual correspondence and beyond,” IEEETrans. Pattern Anal. Mach. Intell., vol. 35, no. 2, pp. 504–511, 2013.

[87] F. Durand and J. Dorsey, “Fast bilateral filtering for the display ofhigh-dynamic-range images,” ACM Trans. Graph., vol. 21, no. 3, pp.257–266, 2002.

[88] K. He, J. Sun, and X. Tang, “Guided image filtering,” IEEE Trans.Pattern Anal. Mach. Intell., vol. 35, no. 6, pp. 1397–1409, 2013.

[89] J. Lu, K. Shi, D. Min, L. Lin, and M. N. Do, “Cross-based localmultipoint filtering,” in CVPR, 2012, pp. 430–437.

[90] R. Adams and L. Bischof, “Seeded region growing,” IEEE Trans.Pattern Anal. Mach. Intell., vol. 16, no. 6, pp. 641–647, Jun 1994.

[91] J. Ning, L. Zhang, D. Zhang, and C. Wu, “Interactive image segmenta-tion by maximal similarity based region merging,” Pattern Recognition,vol. 43, no. 2, pp. 445–456, 2010.

[92] L. D. Bourdev, S. Maji, T. Brox, and J. Malik, “Detecting people usingmutually consistent poselet activations,” in ECCV, 2010.

[93] D. Larlus and F. Jurie, “Combining appearance models and markovrandom fields for category level object segmentation,” in CVPR, 2008.

[94] B. Alexe, T. Deselaers, and V. Ferrari, “Measuring the objectness ofimage windows,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34,no. 11, pp. 2189–2202, 2012.

[95] J. Hosang, R. Benenson, and B. Schiele, “How good are detection pro-posals, really?” in British Machine Vision Conference (BMVC), BritishMachine Vision Association. British Machine Vision Association,September 2014.

[96] A. Borji, M. Cheng, H. Jiang, and J. Li, “Salient object detection: Asurvey,” CoRR, vol. abs/1411.5878, 2014.

[97] D. Hoiem, A. A. Efros, and M. Hebert, “Recovering occlusionboundaries from an image,” International Journal of Computer Vision,vol. 91, no. 3, pp. 328–346, 2011.

[98] A. Humayun, F. Li, and J. M. Rehg, “RIGOR: reusing inference ingraph cuts for generating object regions,” in CVPR, 2014, pp. 336–343.

[99] P. Kohli, J. Rihan, M. Bray, and P. H. S. Torr, “Simultaneous seg-mentation and pose estimation of humans using dynamic graph cuts,”International Journal of Computer Vision, vol. 79, no. 3, pp. 285–298,2008.

[100] P. Krahenbuhl and V. Koltun, “Geodesic object proposals,” in ComputerVision - ECCV 2014 - 13th European Conference, Zurich, Switzerland,September 6-12, 2014, Proceedings, Part V, 2014, pp. 725–739.

[101] P. Dollar and C. L. Zitnick, “Structured forests for fast edge detection,”in ICCV, 2013, pp. 1841–1848.

[102] P. A. Arbelaez, J. Pont-Tuset, J. T. Barron, F. Marques, and J. Malik,“Multiscale combinatorial grouping,” in CVPR, 2014, pp. 328–335.

[103] K. E. A. van de Sande, J. R. R. Uijlings, T. Gevers, and A. W. M.Smeulders, “Segmentation as selective search for object recognition,”in ICCV, 2011, pp. 1879–1886.

[104] S. Manen, M. Guillaumin, and L. J. V. Gool, “Prime object proposalswith randomized prim’s algorithm.” in ICCV, 2013.

[105] X. He, R. S. Zemel, and M. . Carreira-Perpin, “Multiscale conditionalrandom fields for image labeling.” in CVPR, 2004, pp. 695–702.

[106] J. Shotton, M. Johnson, and R. Cipolla, “Semantic texton forests forimage categorization and segmentation.” in CVPR, 2008.

[107] C. Farabet, C. Couprie, L. Najman, and Y. LeCun, “Learning hierar-chical features for scene labeling,” IEEE Trans. Pattern Anal. Mach.Intell., vol. 35, no. 8, pp. 1915–1929, Aug 2013.

[108] D. Hoiem, A. A. Efros, and M. Hebert, “Recovering surface layoutfrom an image,” International Journal of Computer Vision, vol. 75,no. 1, pp. 151–172, 2007.

[109] J. Tighe and S. Lazebnik, “Superparsing - scalable nonparametricimage parsing with superpixels.” International Journal of ComputerVision, vol. 101, no. 2, pp. 329–349, 2013.

[110] L. Ladicky, C. Russell, P. Kohli, and P. H. S. Torr, “Associativehierarchical crfs for object class image segmentation,” in IEEE 12thInternational Conference on Computer Vision, ICCV 2009, Kyoto,Japan, September 27 - October 4, 2009, 2009, pp. 739–746.

[111] A. Rabinovich, A. Vedaldi, C. Galleguillos, E. Wiewiora, and S. Be-longie, “Objects in context.” in ICCV, 2007, pp. 1–8.

[112] C. Galleguillos, A. Rabinovich, and S. Belongie, “Object categorizationusing co-occurrence, location and appearance.” in CVPR, 2008.

[113] C. Galleguillos, B. McFee, S. J. Belongie, and G. R. G. Lanckriet,“Multi-class object localization by combining local contextual interac-tions.” in CVPR. IEEE, 2010, pp. 113–120.

[114] P. F. Felzenszwalb, D. A. McAllester, and D. Ramanan, “A discrimi-natively trained, multiscale, deformable part model.” in CVPR, 2008.

[115] L. Ladicky, P. Sturgess, K. Alahari, C. Russell, and P. H. S. Torr, “What,where and how many? combining object detectors and crfs,” in ECCV,2010.

[116] G. Floros, K. Rematas, and B. Leibe, “Multi-class image labeling withtop-down segmentation and generalized robust $pˆn$ potentials,” inBritish Machine Vision Conference, BMVC 2011, Dundee, UK, August29 - September 2, 2011. Proceedings, 2011.

[117] B. Leibe, A. Leonardis, and B. Schiele, “Robust object detection withinterleaved categorization and segmentation.” International Journal ofComputer Vision, vol. 77, no. 1-3, pp. 259–289, 2008.

[118] P. Arbelaez, B. Hariharan, C. Gu, S. Gupta, L. D. Bourdev, andJ. Malik, “Semantic segmentation using regions and parts,” in CVPR,2012, pp. 3378–3385.

[119] R. Guo and D. Hoiem, “Beyond the line of sight: Labeling the un-derlying surfaces.” in ECCV, ser. Lecture Notes in Computer Science,A. W. Fitzgibbon, S. Lazebnik, P. Perona, Y. Sato, and C. Schmid, Eds.Springer, 2012, pp. 761–774.

[120] Z. Tu, “Auto-context and its application to high-level vision tasks,” inCVPR, 2008.

[121] W. Xia, C. Domokos, J. Dong, L.-F. Cheong, and S. Yan, “Semanticsegmentation without annotating segments,” in ICCV, 2013.

[122] J. H. Kappes, B. Andres, F. A. Hamprecht, C. Schnorr, S. Nowozin,D. Batra, S. Kim, B. X. Kausler, J. Lellmann, N. Komodakis, andC. Rother, “A comparative study of modern inference techniques fordiscrete energy minimization problems,” in CVPR, 2013, pp. 1328–1335.

[123] V. Vineet, J. Warrell, and P. H. S. Torr, “Filter-based mean-fieldinference for random fields with higher-order terms and product label-spaces,” International Journal of Computer Vision, vol. 110, no. 3, pp.290–307, 2014.

[124] C. Liu, J. Yuen, and A. Torralba, “Sift flow: Dense correspondenceacross scenes and its applications.” IEEE Trans. Pattern Anal. Mach.Intell., vol. 33, no. 5, pp. 978–994, 2011.

[125] C. Rother, T. P. Minka, A. Blake, and V. Kolmogorov, “Cosegmen-tation of image pairs by histogram matching - incorporating a globalconstraint into mrfs,” in CVPR, 2006, pp. 993–1000.

[126] D. S. Hochbaum and V. Singh, “An efficient algorithm for co-segmentation,” in ICCV, 2009, pp. 269–276.

[127] D. Batra, A. Kowdle, D. Parikh, J. Luo, and T. Chen, “icoseg: Inter-active co-segmentation with intelligent scribble guidance,” in CVPR,2010.

[128] M. D. Collins, J. Xu, L. Grady, and V. Singh, “Random walksbased multi-image segmentation: Quasiconvexity results and gpu-basedsolutions,” in CVPR, 2012.

[129] F. Meng, H. Li, and G. Liu, “Image co-segmentation via activecontours,” in 2012 IEEE International Symposium on Circuits andSystems, ISCAS 2012, Seoul, Korea (South), May 20-23, 2012, 2012,pp. 2773–2776.

Page 19: Beyond Pixels: A Comprehensive Survey from Bottom-up to Semantic Image · PDF file · 2015-01-26Bottom-up to Semantic Image Segmentation and Cosegmentation Hongyuan Zhu, ... few sparse

19

[130] E. Kim, H. Li, and X. Huang, “A hierarchical image clusteringcosegmentation framework,” in CVPR, 2012, pp. 686–693.

[131] S. Vicente, C. Rother, and V. Kolmogorov, “Object cosegmentation,”in CVPR, 2011, pp. 2217–2224.

[132] F. Meng, H. Li, G. Liu, and K. N. Ngan, “Object co-segmentation basedon shortest path algorithm and saliency model,” IEEE Transactions onMultimedia, vol. 14, no. 5, 2012.

[133] J. Sun and J. Ponce, “Learning discriminative part detectors for imageclassification and cosegmentation,” in ICCV, 2013, pp. 3400–3407.

[134] J. Dai, Y. N. Wu, J. Zhou, and S. Zhu, “Cosegmentation and cosketchby unsupervised learning,” in ICCV, 2013, pp. 1305–1312.

[135] A. Faktor and M. Irani, “Co-segmentation by composition,” in ICCV,2013, pp. 1297–1304.

[136] Z. Wang and R. Liu, “Semi-supervised learning for large scale imagecosegmentation,” in Computer Vision (ICCV), 2013 IEEE InternationalConference on, 2013.

[137] H. Zhu, J. Cai, J. Zheng, J. Wu, and N. Magnenat-Thalmann, “Salientobject cutout using google images,” in ISCAS, 2013, pp. 905–908.

[138] M. Rubinstein, A. Joulin, J. Kopf, and C. Liu, “Unsupervised jointobject discovery and segmentation in internet images,” in CVPR, 2013,pp. 1939–1946.

[139] Y. Chai, V. S. Lempitsky, and A. Zisserman, “Bicos: A bi-level co-segmentation method for image classification,” in ICCV, 2011, pp.2579–2586.

[140] Y. Chai, E. Rahtu, V. S. Lempitsky, L. J. V. Gool, and A. Zisserman,“Tricos: A tri-level class-discriminative co-segmentation method forimage classification,” in ECCV, 2012, pp. 794–807.

[141] D. Kuttel, M. Guillaumin, and V. Ferrari, “Segmentation propagationin imagenet,” in ECCV, 2012.

[142] F. Meng, J. Cai, and H. Li, “On multiple image group cosegmentation,”in ACCV, 2014.

[143] T. Ma and L. J. Latecki, “Graph transduction learning with connectivityconstraints with application to multiple foreground cosegmentation,” inCVPR, 2013.

[144] H. Zhu, J. Lu, J. Cai, J. Zheng, and N. Thalmann, “Multiple foregroundrecognition and cosegmentation: An object-oriented crf model withrobust higher-order potentials,” in WACV, 2014.

[145] H. Zhu, J. Lu, J. Cai, J. Zheng, and N. Magnenat-Thalmann, “Poselet-based multiple human identification and cosegmentation,” in ICIP,2014.

[146] D. R. Martin, C. Fowlkes, D. Tal, and J. Malik, “A database ofhuman segmented natural images and its application to evaluatingsegmentation algorithms and measuring ecological statistics,” in ICCV,2001, pp. 416–425.

[147] B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman,“Labelme: A database and web-based tool for image annotation.”International Journal of Computer Vision, vol. 77, no. 1-3, pp. 157–173, 2008.

[148] H. Li, J. Cai, N. T. N. Anh, and J. Zheng, “A benchmark for semanticimage segmentation,” in ICME, 2013, pp. 1–6.

[149] D. Tsai, M. Flagg, and J. M.Rehg, “Motion coherent tracking withmulti-label mrf optimization,” BMVC, 2010.

[150] J. C. Rubio, J. Serrat, and A. M. Lopez, “Video co-segmentation,” inACCV, 2012, pp. 13–24.

[151] W. chen Chiu and M. Fritz, “Multi-class video co-segmentation with agenerative multi-video model.” in CVPR. IEEE, 2013, pp. 321–328.

[152] G. J. Brostow, J. Shotton, J. Fauqueur, and R. Cipolla, “Segmentationand recognition using structure from motion point clouds,” in ECCV(1), 2008, pp. 44–57.

[153] M. Everingham, L. J. V. Gool, C. K. I. Williams, J. M. Winn, andA. Zisserman, “The pascal visual object classes (VOC) challenge,”International Journal of Computer Vision, vol. 88, no. 2, pp. 303–338,2010.

[154] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba, “SUNdatabase: Large-scale scene recognition from abbey to zoo,” in CVPR,2010, pp. 3485–3492.

[155] P. K. Nathan Silberman, Derek Hoiem and R. Fergus, “Indoor segmen-tation and support inference from rgbd images,” in ECCV, 2012.

[156] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,P. Dollar, and C. L. Zitnick, “Microsoft COCO: common objects incontext,” in ECCV, 2014, pp. 740–755.

[157] D. R. Martin, “An empirical approach to grouping and segmentation,”Ph.D. dissertation, EECS Department, University of California, Berke-ley, Aug 2003.

[158] M. Meila, “Comparing clusterings by the variation of information,” inCOLT, 2003, pp. 173–187.

[159] R. Unnikrishnan, C. Pantofaru, and M. Hebert, “Toward objectiveevaluation of image segmentation algorithms,” IEEE Trans. PatternAnal. Mach. Intell., vol. 29, no. 6, pp. 929–944, 2007.

[160] J. Dong, Q. Chen, S. Yan, and A. Yuille, “Towards unified objectdetection and semantic segmentation,” in ECCV, 2014.

[161] P. Sturgess, K. Alahari, L. Ladicky, and P. H. S. Torr, “Combiningappearance and structure from motion features for road scene under-standing.” in BMVC. British Machine Vision Association, 2009.

[162] C. Zhang, L. Wang, and R. Yang, “Semantic segmentation of urbanscenes using dense depth maps.” in ECCV, ser. Lecture Notes inComputer Science, K. Daniilidis, P. Maragos, and N. Paragios, Eds.Springer, pp. 708–721.

[163] N. Silberman and R. Fergus, “Indoor scene segmentation using astructured light sensor.” in ICCV Workshops, 2011, pp. 601–608.

[164] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor segmentationand support inference from rgbd images.” in ECCV, 2012, pp. 746–760.

[165] X. He and S. Gould, “An exemplar-based CRF for multi-instance objectsegmentation,” in CVPR, 2014, pp. 296–303.

[166] J. Tighe, M. Niethammer, and S. Lazebnik, “Scene parsing with objectinstances and occlusion ordering,” in CVPR, 2014, pp. 3748–3755.

[167] J. Yao, S. Fidler, and R. Urtasun, “Describing the scene as a whole:Joint object detection, scene classification and semantic segmentation,”in CVPR, 2012, pp. 702–709.

[168] S. K. Divvala, D. Hoiem, J. Hays, A. A. Efros, and M. Hebert, “Anempirical study of context in object detection,” in CVPR, 2009, pp.1271–1278.

[169] A. Gupta, S. Satkin, A. A. Efros, and M. Hebert, “From 3d scenegeometry to human workspace.” in CVPR, 2011, pp. 1961–1968.

[170] V. Vineet, J. Warrell, and P. H. S. Torr, “Filter-based mean-fieldinference for random fields with higher-order terms and product label-spaces,” in ECCV, 2012, pp. 31–44.

[171] A. Farhadi, I. Endres, and D. Hoiem, “Attribute-centric recognition forcross-category generalization.” in CVPR, 2010.

[172] N. Kumar, A. C. Berg, P. N. Belhumeur, and S. K. Nayar, “Describablevisual attributes for face verification and image search.” IEEE Trans.Pattern Anal. Mach. Intell., vol. 33, no. 10, pp. 1962–1977, 2011.

[173] C. H. Lampert, H. Nickisch, and S. Harmeling, “Learning to detectunseen object classes by between-class attribute transfer.” in CVPR,2009, pp. 951–958.

[174] J. Tighe and S. Lazebnik, “Understanding scenes on many levels.” inICCV, D. N. Metaxas, L. Quan, A. Sanfeliu, and L. J. V. Gool, Eds.,2011, pp. 335–342.

[175] S. Zheng, M. Cheng, J. Warrell, P. Sturgess, V. Vineet, C. Rother, andP. H. S. Torr, “Dense semantic image segmentation with objects andattributes,” in CVPR, 2014, pp. 3214–3221.

[176] A. Gupta and L. S. Davis, “Beyond nouns: Exploiting prepositionsand comparative adjectives for learning visual classifiers.” in ECCV,ser. Lecture Notes in Computer Science, D. A. Forsyth, P. H. S. Torr,and A. Zisserman, Eds. Springer, 2008, pp. 16–29.

[177] G. Kulkarni, V. Premraj, V. Ordonez, S. Dhar, S. Li, Y. Choi, A. C.Berg, and T. L. Berg, “Babytalk: Understanding and generating simpleimage descriptions.” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35,no. 12, pp. 2891–2903, 2013.

[178] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classificationwith deep convolutional neural networks,” in NIPS, 2012.

[179] D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov, “Scalable objectdetection using deep neural networks,” CoRR, vol. abs/1312.2249,2013.

[180] A. Toshev and C. Szegedy, “Deeppose: Human pose estimation viadeep neural networks,” CoRR, vol. abs/1312.4659, 2013.

[181] C. Szegedy, A. Toshev, and D. Erhan, “Deep neural networks for objectdetection,” in NIPS, 2013.

[182] R. B. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich featurehierarchies for accurate object detection and semantic segmentation,”in CVPR, 2014, pp. 580–587.

[183] W. T. Freeman, “Where computer vision needs help from computerscience,” in Proceedings of the Twenty-second Annual ACM-SIAMSymposium on Discrete Algorithms, ser. SODA ’11, 2011.