graphcut-based interactive segmentation using colour and ... · white pixel indicates closer range,...

Graphcut-based Interactive Segmentation using Colour and Depth cues

Hu HeUniversity of Queensland,Australia

[email protected]

David McKinnonQueensland University of Technology,Australia

[email protected]

Michael WarrenUniversity of Queensland,Australia

[email protected]

Ben UpcroftUniversity of Queensland, Australia

[email protected]

Abstract

Segmentation of novel or dynamic objects ina scene, often referred to as background sub-traction or foreground segmentation, is criticalfor robust high level computer vision applica-tions such as object tracking, object classifica-tion and recognition. However, automatic real-time segmentation for robotics still poses chal-lenges including global illumination changes,shadows, inter-reflections, colour similarity offoreground to background, and cluttered back-grounds. This paper introduces depth cuesprovided by structure from motion (SFM) forinteractive segmentation to alleviate some ofthese challenges. In this paper, two prevailinginteractive segmentation algorithms are com-pared; Lazysnapping [Li et al., 2004] and Grab-cut [Rother et al., 2004], both based on graph-cut optimisation [Boykov and Jolly, 2001]. Thealgorithms are extended to include depth cuesrather than colour only as in the original pa-pers. Results show interactive segmentationbased on colour and depth cues enhances theperformance of segmentation with a lower er-ror with respect to ground truth.

1 Introduction

Object recognition is an integral component for buildingrobots capable of interacting in human environments.However, real-time object recognition in real scenes re-mains one of the most challenging problems in computervision. Many issues must be overcome such as track-ing, robustly identifying objects in complex clutteredbackgrounds, viewpoint changes, etc [Schulz et al., 2001;Batavia and Singh, 2001; Salembier et al., 1997]. Fur-thermore, computational complexity must be reducedto a minimum for mobile robotic applications as objectinformation can rapidly become obsolete in a dynamicworld. This paper moves towards robust object recogni-tion using reliable object segmentation based on graph

cut optimisation which combines both colour and depthcues. Here, we only consider interactive segmentation re-quiring input from a human operator. However, we pro-pose that depth combined with colour cues can providerich information enabling reliable automatic segmenta-tion and eventual object recognition for mobile robotics.

This paper is motivated by the Lazysnapping [Li etal., 2004] and Grabcut [Rother et al., 2004] algorithmswhich are based on graphcut optimization [Boykov andJolly, 2001]. Both algorithms require user input for ini-tialisation. For Lazysnapping [Li et al., 2004], users needto draw strokes on the original colour image indicatingforeground and background (Figure 3). Pixels in thestrokes are collected to model energy terms in an energyfunction framework. The graphcut algorithm is then em-ployed to minimise the energy function resulting in thesegmented foreground pixels stored as a label vector L.

Grabcut [Rother et al., 2004] initialisation requiresusers to draw a rectangle around the objects of inter-est. The pixels internal and external to the rectangu-lar boundary are used to model the energy terms. Thegraphcut algorithm is run iteratively until the resultinglabel vector L remains constant. The common attributefor both algorithms is the use of colour as the only cue tomodel the energy terms. Thus both algorithms are proneto failure when foreground and background colours aresimilar.

In this paper, we use both colour and depth cuesto model the energy terms, while using graphcuts tominimise the new energy function. Depth cues are ac-quired using 3D reconstruction from structure from mo-tion, an increasingly common vision technology in mobilerobotics.

If there are considerable depth differences betweenforeground and background even though they have simi-lar colour, segmentation should be improved. Our exper-imental results show segmentation based on both colourand depth cues outperform those relying purely on colourbased on an error evaluation with respect to groundtruth.

Figure 1: Statue dataset. The colour image and thecorresponding depth image. White pixel indicates closerrange, while black one indicates more distant range.

2 Colour-Based Segmentation

Currently, graphcut based interactive segmentation usesregion (colour or intensity similarity) [Agarwala andDontcheva, 2004; Barrett and Cheney, 2002; Reese andBarrett, 2002] and/or boundary (colour or intensity con-trast) [Gleicher, 1995; Mortensen and Barrett, 1995;1999] information to construct an objective function, of-ten referred to as an energy function. Hard constraintsare then introduced by a user, e.g., by selecting fore-ground and background pixels. Soft constraints refer tothe inherent properties, e.g., assumption on elsewheresmoothness or piecewise smoothness across pixels.

For completeness we briefly summarise the graphcutoptimisation algorithm as follows. The original imageis represented as the corresponding graph G =< V, E >which is defined as a set of nodes (V) and a set of un-ordered edges (E) that connect these nodes. The nodesrelate to the pixels in the original image, while the edgesrelate to the relationship between the adjacent pixels.

Let L = (L1, · · · , Li, · · · , L|V|) be the binary vectorwhose elements Li specify assignments to pixels i in V.Each Li can be either background (Li = 0) or foreground(Li = 1). Here the optimised vector L defines the finalsegmentation. The energy function used in this paper issimilar to the Gibbs energy function described in [Geman

Figure 2: Camel dataset. The colour image and thecorresponding depth image. White pixel indicates closerrange, while black one indicates most distant range.

and Geman, 1984]:

E(L) =∑i∈V

ER(Li) + λ∑

(i,j)∈E

EB(Li, Lj) (1)

where ER(Li) is the region-based energy encoding thecost when the label of the node i is Li, and EB(Li, Lj) isthe boundary-based energy, representing the cost whenthe label of adjacent nodes (pixels) i and j are Li and Lj

respectively, and λ ∈ [0, 1] indicates the relative impor-tance of the region-based energy versus the boundary-based energy.

Note that when we make a negative logarithm to bothsides of the basic Bayesian formula (2), we can get Eq.(3) as follows:

P (L|D) ∝ P (D|L)P (L) (2)

−ln(P (L|D)) ∝ −ln(P (D|L)) + (−ln(P (L))) (3)

where L denotes the label for all pixels, D encodes theobserved data, i.e., the pixels with pre-defined labels,and P (L|D) is a conditional probability.

Comparing (1) and (3), the term ER(Li) in the energyfunction can be seen as the likelihood probability, whilethe term EB(Li, Lj) can be seen as the prior probability.So we also refer ER(Li) and EB(Li, Lj) as the likelihoodenergy and prior energy respectively.

Figure 3: Lazysnapping GUI. Red strokes select theforeground seed, and blue strokes select the backgroundseed.

Using the colour or intensity similarity and contrast,we can obtain the weights between each node in thegraph corresponding to the original image. Then thecombinatorial optimisation algorithm max-flow/min-cut[Boykov and Kolmogorov, 2004] is employed to minimisethe energy function. This results in nodes grouped intodifferent classes (e.g., source and sink in binary graphtheory) which is equivalent to assigning the correspond-ing pixels into different classes (e.g., foreground andbackground in bi-layer segmentation).

3 Depth and Colour-BasedSegmentation

In contrast to the above formulation, we use both colourand depth cues to construct the energy function (1), andcan be formed as follows:

E(L) =

[θ∑i∈V

EcR(Li) + (1− θ)

∑i∈V

EdR(Li)

]

+λ

θ ∑(i,j)∈E

EcB(Li, Lj) + (1− θ)

∑(i,j)∈E

EdB(Li, Lj)

(4)

where θ denotes the relative importance among thecolour and depth terms, the superscript c, d of the en-ergy term encode the colour term and depth term re-spectively,and the other variables have the same meaningwith that in (1).

The method to construct the depth term is as in theframework (4), and the tuning parameters except θ areset to be the same value described in [Li et al., 2004;Rother et al., 2004].

3.1 Colour Data and Depth DataModelling

In the implementation of Lazysnapping, we select somepixels as foreground and others as background using pen

Figure 4: GrabCut GUI. The rectangle is drawn by theuser, while the internal thin red line indicates the graphcut between foreground and background generated byGrabcut.

strokes (Figure 3). The corresponding pixels in the depthimage are also selected. This results in two sets; fore-ground set F and background set B. All the known la-belled pixels in F and B are used to construct the colourand depth energy terms respectively.

With respect to the region-based energy term, we usethe distance between each unlabelled pixel and the cen-troid of the selected foreground and background pix-els precomputed using K-means [Kwatra et al., 2003].Specifically, the selected pixels of foreground and back-ground are used to construct the models of foregroundand background by K-means based on colour distribu-tion.

While for the boundary-based energy term, we use thegradient or contrast between two adjacent pixels.

In the implementation of Grabcut, we use a rectangleto select the internal and external pixels of the colourimage denoting the uncertain set U (used to model theforeground) and background set B respectively (Figure4). The corresponding pixels in the depth image arechosen as well. Here we construct a Gaussian MixtureModel (GMMs) based on the selected set to representthe region-based energy term, while the boundary-basedenergy term is still constructed using the gradient orcontrast between two adjacent pixels.

Further details on user input and modelling are shownin [Li et al., 2004; Rother et al., 2004].

3.2 Segmentation using Lazysnapping andGrabCut

Image resolution highly affects computational efficiency.To improve efficiency, we use the Watershed algorithm[Vincent and Soille, 1991] to divide the entire image intoseveral new patches which still locate boundaries welland preserve small differences between each patch. Thisprocess is referred to as pre-segmentation in [Li et al.,2004].

(a) Weight θ = 0 (b) Weight θ = 0.9 (c) Weight θ = 1

Figure 5: Lazysnapping results with varying weights between the colour and depth term. Note that as θ increaseswhich means the weight of colour cue increases, the segmentation performance gets better due to the depth ambiguity.

Lazysnapping runs the graphcut algorithm once to ac-quire the final segmentation result, while Grabcut runsiteratively until the final segmentation result remainsconstant.

4 Results

This section presents the results from an implementationof the Lazysnapping and Grabcut algorithms using depthand colour cues on two datasets of a statue and a toycamel. Modifications to freely available code provided byGupta were used to produce the following results [Gupta,2005]. Each dataset consists of colour images and cor-responding depth map images. The depth map for thecolour images is obtained from SFM. To a pair of colourimages, the 3D position of a dense set of visual featureswas estimated using SFM algorithms [Ullman, 1979;Zhang et al., 2006]. Here we use GPU to extractthe SURF [Bay et al., 2006] features and then matchthem to make the features trackable across multi-views.RANSAC [Fischler and Bolles, 1981] will be used toget rid of outliers, i.e., mismatched feature pairs. Fi-nally,depth map will be generated by triangulation andinterpolation over multi-views.

The tuning parameter θ in (4) is varied from 0 to 1. Ifθ = 0, segmentation depends only on depth information,while θ = 1, forces a dependence on colour informationonly - identical to the original Lazysnapping and Grabcutalgorithms. Both colour and depth cues contribute tothe segmentation when θ ∈ (0, 1).

The images used in this paper are shown in Figure 1and Figure 2. In Figure 1, there is a high contrast indepth information which can complement the drawbackof colour only. While in Figure 2, the dataset has similardepth information, i.e., the objects lie in an approximateplane, so the colour information should be more usefulin this scenario.

4.1 Lazysnapping based on Colour andDepth

Here the pixels intersecting with the red stroke are fore-ground set F , and the ones intersecting with the bluestroke are background set B. Figure 3 illustrates theGUI for Lazysnapping. The final segmentation result isshown in Figure 5.

4.2 GrabCut based on Colour and Depth

With Grabcut, we use a red rectangle to divide the pixelsinto background set B (external to the rectangle) and theuncertain set U (internal to the rectangle). The GrabcutGUI is shown in Figure 4. Figure 6 illustrates the finalsegmentation result with different weights for the colourterm. Note that Grabcut algorithm is an iterative en-ergy minimisation. The energy should decrease with thenumber of iterations, which is shown in Figure 7.

Figure 7: The energy E for the segmentation convergesover 13 iterations..

(a) Weight θ = 0 (b) Weight θ = 0.55 (c) Weight θ = 1

Figure 6: GrabCut results with varying weights between the colour and depth term. Note that as θ increases whichmeans the weight of colour cue increases, the segmentation performance gets worse due to the colour ambiguity.

4.3 Evaluation

To evaluate the segmentation performance, we manuallysegmented both images to form a ground truth (Figure8).

Results were then compared using two methods ofevaluation; 1) the L2 distance between the segmenta-tion result from the proposed method and ground truthsegmentation, referred to as ε1 in (5), 2) the number ofpixels mislabelled as compared to the ground truth as aratio with the total pixels in the original image, denotedby ε2 in (6).

ε1 = ‖P − PG‖2 (5)

ε2 =Nerror

Ntotal(6)

where P is the intensity of pixels in the segmentationderived by the proposed method, while PG is the in-tensity of pixels in the ground truth segmentation. Inequation (6), Nerror is the number of misclassified pixelscomparing to the ground truth and Ntotal is the numberof total pixels in original image.

4.4 Result Discussion

In Figure 7, the energy converges after approximate 10iterations. Particularly, it decreases dramatically at thebeginning couple of iterations. It indicates Grabcut algo-rithm still converges quickly even after introducing thedepth cue.

Figure 9 indicates the best results are obtained whenthe weight θ is set to 0.55 for the statue dataset usingGrabcut algorithm. For the camel dataset using Lazys-napping algorithm, the weight θ is set to 0.9 meaningcolour information is more important than depth infor-mation in this case.

The quantitative evaluations show that jointly usingcolour and depth cues in general, achieves better accu-racy than using colour alone. Furthermore, the perfor-mance of segmentation can be refined by determining

the weight θ based on the discriminative capabilities ofthe colour and depth cues. By further investigating Fig-ure 9 we can find that the colour information is quiteambiguous for the statue segmentation. The depth in-formation seems to be a reliable cue and demonstratesgood performance, especially when the weight for depthterm reaches 0.55. But the accuracy of segmentation onstatue slightly decreases as the weight for depth termcontinuously increases. For this case, the discriminativecapability of depth cue achieves a maximum at 0.55, andthe increased importance of the depth information causesthe background objects to be mislabelled as foreground.However, depth information is more ambiguous for thecamel dataset as all the small objects lie on the table, andthe depth distributions of the toy camel and the booksare quite similar. In this case, increasing the weight ofcolour cue refines the segmentation results.

5 Conclusion and Future Work

We have introduced the depth cue into the basic en-ergy function framework for interactive segmentation ina static image. As mentioned before, better segmenta-tion results will be obtained based on colour and depthcues rather than using only one. With an appropriateweight θ, the proposed method outperforms Lazysnap-ping and Grabcut methods based only on colour cue.

In the future, we would like to explore several direc-tions in which the accuracy of segmentation can be fur-ther improved.

• Develop a novel model to automatically constructthe foreground and background data selected by theuser

• Develop an adaptive way to adjust the weight θ

• Extend the current method to video sequences

• Use both depth and colour to design an automaticsegmentation method with reasonable accuracy forobject tracking for mobile robotics [Wang et al.,2006; Zhao et al., 2008]

(a) Groundtruth for camel segmentation (b) Groundtruth for statue segmentation

Figure 8: Groundtruth dataset.

0 10 20 30 40 50 60 70 80 90 1000.4

0.6

0.8

1

1.2

1.4

1.6x 10

6 The L2 distance between the segmentation result and groundtruth

weight for depth term in the energy function(%)

The

L2

dist

ance

for

all p

ixel

s

(a) ε1 over different depth weights θ for statue dataset

0 10 20 30 40 50 60 70 80 90 1000.012

0.014

0.016

0.018

0.02

0.022

0.024

0.026

0.028

0.03The ratio of mislabling pixels verse the total pixels


ratio

of m

isla

belin

g(w

rong

seg

men

tatio

n)

(b) ε2 over different depth weights θ for statue dataset

0 10 20 30 40 50 60 70 80 90 1002.6

2.8

3

3.2

3.4

3.6

3.8

4

4.2

4.4

4.6x 10

6 The L2 distance between the segmentation result and groundtruth


The

L2

dist

ance

for

all p

ixel

s

(c) ε1 over different depth weights θ for camel dataset

0 10 20 30 40 50 60 70 80 90 1000.008

0.01

0.012

0.014

0.016

0.018

0.02The ratio of mislabling pixels verse the total pixels


ratio

of m

isla

belin

g(w

rong

seg

men

tatio

n)

(d) ε2 over different depth weights θ for camel dataset

Figure 9: Segmentation error verse varying weights θ between colour and depth. (a,b) shows error for the statuesegmentation, while (c,d) indicates error for the camel segmentation. As can be seen error decreases with the weightof depth term increasing to some extent for the statue dataset, however, error increases for the camel dataset.

Acknowledgement

This research is supported by CRCMining and CSCscholarship. The authors acknowledge helpful email dis-cussions with Mohit Gupta.

References[Agarwala and Dontcheva, 2004] A. Agarwala and

M. et.al Dontcheva. Interactive digital photomon-tage. ACM Transactions on Graphics (TOG),23(3):294–302, 2004.

[Barrett and Cheney, 2002] W.A. Barrett and A.S. Ch-eney. Object-based image editing. ACM Transactionson Graphics, 21(3):777–784, 2002.

[Batavia and Singh, 2001] P.H. Batavia and S. Singh.Obstacle detection using adaptive color segmentationand color stereo homography. In IEEE InternationalConference on Robotics and Automation, volume 1,pages 705–710. Citeseer, 2001.

[Bay et al., 2006] H. Bay, T. Tuytelaars, andL. Van Gool. Surf: Speeded up robust features.Computer Vision–ECCV 2006, pages 404–417, 2006.

[Boykov and Jolly, 2001] Y. Boykov and M.P. Jolly. In-teractive graph cuts for optimal boundary and re-gion segmentation of objects in ND images. In Inter-national Conference on Computer Vision, volume 1,pages 105–112. Citeseer, 2001.

[Boykov and Kolmogorov, 2004] Y. Boykov and V. Kol-mogorov. An experimental comparison of min-cut/max-flow algorithms for energy minimization invision. IEEE Transactions on Pattern Analysis andMachine Intelligence, pages 1124–1137, 2004.

[Fischler and Bolles, 1981] M.A. Fischler and R.C.Bolles. Random sample consensus: A paradigm formodel fitting with applications to image analysisand automated cartography. Communications of theACM, 24(6):381–395, 1981.

[Geman and Geman, 1984] S. Geman and D. Geman.Stochastic relaxation,Gibbs distributions, and theBayesian restoration of images. IEEE trans on pat-tern analysis and machine inteligence, 1984.

[Gleicher, 1995] M. Gleicher. Image snapping. In Pro-ceedings of the 22nd annual conference on Computergraphics and interactive techniques, page 190. ACM,1995.

[Gupta, 2005] M. Gupta. Interactive segmentationtoolbox. In http: // www. cs. cmu. edu/ ~ mohitg/

Research/ segmentation. htm , 2005.

[Kwatra et al., 2003] V. Kwatra, A. Schodl, I. Essa,G. Turk, and A. Bobick. Graphcut textures: Imageand video synthesis using graph cuts. ACM Transac-tions on Graphics, 22(3):277–286, 2003.

[Li et al., 2004] Y. Li, J. Sun, C.K. Tang, and H.Y.Shum. Lazy snapping. ACM Transactions on Graph-ics (TOG), 23(3):303–308, 2004.

[Mortensen and Barrett, 1995] E N Mortensen and W ABarrett. Intelligent scissors for image composition. InProceedings of the 22nd annual conference on Com-puter graphics and interactive techniques, pages 191–198. ACM, 1995.

[Mortensen and Barrett, 1999] E N Mortensen and W ABarrett. Toboggan-based intelligent scissors with afour-parameter edgemodel. In Computer Vision andPattern Recognition, 1999. IEEE Computer SocietyConference on., volume 2, 1999.

[Reese and Barrett, 2002] LJ Reese and WA Barrett.Image editing with intelligent paint. In Proceedingsof Eurographics, volume 21, pages 714–724. Citeseer,2002.

[Rother et al., 2004] C. Rother, V. Kolmogorov, andA. Blake. Grabcut: Interactive foreground extractionusing iterated graph cuts. In ACM SIGGRAPH 2004Papers, page 314. ACM, 2004.

[Salembier et al., 1997] P. Salembier, F. Marques,M. Pardas, J.R. Morros, I. Corset, S. Jean-nin, L. Bouchard, F. Meyer, and B. Marcotegui.Segmentation-based video coding system allowing themanipulation ofobjects. IEEE Transactions on Cir-cuits and Systems for Video Technology, 7(1):60–74,1997.

[Schulz et al., 2001] D. Schulz, W. Burgard, D. Fox, andA.B. Cremers. Tracking multiple moving targets witha mobile robot using particle filters and statisticaldata association. In IEEE international conferenceon robotics and automation, 2001. Proceedings 2001ICRA, volume 2, 2001.

[Ullman, 1979] S. Ullman. The interpretation of struc-ture from motion. Proceedings of the Royal Society ofLondon. Series B. Biological Sciences, 203(1153):405,1979.

[Vincent and Soille, 1991] L. Vincent and P. Soille. Wa-tersheds in digital spaces: an efficient algorithm basedonimmersion simulations. IEEE transactions on pat-tern analysis and machine intelligence, 13(6):583–598,1991.

[Wang et al., 2006] L. Wang, M. Liao, M. Gong,R. Yang, and D. Nister. High-quality real-time stereousing adaptive cost aggregation and dynamic pro-gramming. In 3D Data Processing, Visualization,and Transmission, Third International Symposiumon, pages 798–805, 2006.

[Zhang et al., 2006] G. Zhang, J. Jia, W. Xiong, T.T.Wong, P.A. Heng, and H. Bao. Moving object extrac-

tion with a hand-held camera. In Proc. InternationalConference on Computer Vision, pages 1–8. Citeseer,2006.

[Zhao et al., 2008] T. Zhao, R. Nevatia, and B. Wu. Seg-mentation and tracking of multiple humans in crowdedenvironments. IEEE transactions on pattern analysisand machine intelligence, 30(7):1198–1211, 2008.

graphcut-based interactive segmentation using colour and ... · white pixel indicates closer range,...

Documents