visual reconstruction of ground plane obstacles in a sparse view

Graphical Models 68 (2006) 282–293

www.elsevier.com/locate/gmod

Visual reconstruction of ground plane obstacles in a sparseview robot environment

R. Laganiere a,*, H. Hajjdiab a, A. Mitiche b

a VIVA Research Laboratory, School of Information Technology and Engineering, University of Ottawa, Ottawa, Ont., Canada K1N 6N5b Institut National de la Recherche Scientifique, INRS-EMT, Montreal, Que., Canada H5A 1K6

Received 20 July 2004; received in revised form 17 October 2005; accepted 2 February 2006Available online 31 March 2006

Communicated by Dimitris Metaxes

Abstract

The purpose of this study is to investigate a geometric/level set method to locate ground plane objects in a robot envi-ronment and reconstruct their structure from a collection of sparse views. In a first step, a model of the ground plane sur-face, on which the robot is operating, is obtained through the matching of the available views. This wide-baseline matchingof the ground plane views allows also to compute camera pose information associated with each of these views. Based onthe information obtained, reconstruction of the obstacles proceeds by minimizing an energy functional containing threeterms: a term of shape-from-silhouettes consistency to characterize the ground plane objects structure and to accountfor possibly non-textured object surfaces; a term of visual information consistency to measure the conformity of the objectssurface visual information to the acquired images; and finally, a term of regularization to bias the solution toward smoothobject surfaces. The functional is minimized following the associated Euler–Lagrange surface evolution descent equations,implemented via level set PDEs to allow changes in topology while ensuring numerical stability. We provide examples ofverification of the scheme on real data.� 2006 Elsevier Inc. All rights reserved.

Keywords: Ground plane obstacle; 3D reconstruction; Widely separated view matching; Shape-from-silhouette; Photo-consistency;Level sets

1. Introduction

Building three-dimensional (3D) representationsfrom a collection of two-dimensional (2D) imagesconstitutes one of the most challenging task in com-puter vision [5,21]. The objective is to obtain a 3D

1524-0703/$ - see front matter � 2006 Elsevier Inc. All rights reserveddoi:10.1016/j.gmod.2006.02.001

* Corresponding author. Fax: +1 613 562 5664.E-mail addresses: [email protected] (R. Laganiere),

[email protected] (A. Mitiche).

interpretation of a scene that complies with theavailable observations while obeying to some con-straints, these later representing prior knowledgeabout the observed world. A large variety of appli-cations can benefit from the solving of this visualreconstruction problem, ranging from accurateobject model building to the development of auton-omous navigation systems. In this study, we pro-pose a novel approach to determine the positionand structure of ground plane objects in a robot

.

mailto:[email protected]

mailto:[email protected]

R. Laganiere et al. / Graphical Models 68 (2006) 282–293 283

environment from a collection of sparse views ofthis environment.

When an autonomous robot is operating in awork area, it must have the ability to detect, locate,and identify the obstacles and other such objects forthe purpose of avoidance, manipulation or recogni-tion. Most often, robots move on relatively flat ter-rains. This fact affords a substantial simplificationof the geometry of the problem and of its represen-tation through projective relations. Reconstructionwill proceed here by considering a collection ofsparse views of the scene. These views can have beenobtained, for example, by initially monitoring,along a peripheral path, the robot equipped with acamera to acquire images of the environment orby a team of robots collaborating together to obtaina representation of the environment. This is a verychallenging situation since it requires the registra-tion of images taken from very different points ofview. But this approach offers the advantage of con-siderably limiting the amount of data to be pro-cessed and/or transmitted. Since many real-worldrobotic applications have to cope with limited band-width issues, this asset can be a key. In addition, 3Dlocalization from widely separated views is generallymore accurate and less sensitive to imperfect featurelocalization and matching due to pixelization andimage noise. Following the proposed procedure,robot positions are determined, ground plane objectsare detected, and, finally, their structure is recon-structed by an off-line process in charge of buildingand maintaining the 3D scene representation.

In general, solutions to ground plane objectdetection are based on the homographic relationinduced by the observed planar surface. Using thisrelation, image points corresponding to the planecan be transferred from one view to another. Inthe case of a stereoscopic system of cameras, a per-spective warping of the left image to the right imagecan be computed. Differentiating the warped andthe original images leads to a rough obstacle/planesegmentation from which the obstacles can be local-ized [1,4,11,20]. The approach described in [7]detects and matches image points in a stereo head.Robot locations are then obtained through groundplane transformation in a way similar to the onepresented in this paper. The computation of theresidual disparity can also provide informationabout the amount of deviation of a point withrespect to the reference plane [37]. When only onecamera is used, information about camera motionand 3D structure of the imaged scene is generally

obtained through the estimation of the optical flowfield of the image sequence resulting from the robotmotion [6,12,26].

The proposed reconstruction procedure is two-fold. First a model of the ground plane surface,on which the robot is operating, is obtained throughthe matching of the available views. This wide-base-line matching of the ground plane views allows alsoto compute the robot (or camera) position associat-ed with each of these views. View matching androbot localization are achieved here without usingany special landmarks [30,32] or preestablishedenvironmental map [17]. Second and based on theinformation obtained, reconstruction of the obsta-cles proceeds by minimizing an energy functionalcontaining three terms: a term of shape-from-sil-houettes consistency to characterize the groundplane objects structure and to account for possiblynon-textured object surfaces; a term of visual infor-mation consistency to measure the conformity ofthe objects surface visual information to theacquired images; and, finally, a term of regulariza-tion to bias the solution toward smooth object sur-faces. This functional is minimized following theassociated Euler–Lagrange surface evolution des-cent equations, implemented via level set PDEs fornumerical stability and to allow changes in thetopology of the surface during evolution (thus han-dling the case of multiple obstacles).

The level set formalism for 3D reconstructionfrom 2D images has also been used in[8,9,36,27]. In [27], the study was concerned withtemporal image sequences and short-term motion,rather than with wide baseline image sets as in thispaper. In [36], 3D reconstruction of a single objectand estimation of camera poses are sought underthe following two assumptions: (a) the object sur-face is smooth and projects onto images as piece-wise smooth irradiance segments with brightnessdiscontinuities corresponding to occlusion bound-aries and (b) the background and the object sur-face support two significantly distinct radiancefunctions. In [9], the method, based on perspectiveinvariant intensity correlation, required: (a) prioraccurate knowledge of camera positions and (b)pre-segmented images in terms of object and back-ground. Their approach combines the correspon-dence problem and the reconstruction problemusing a function which measures conformity ofstructure to visual data [10]. This ability to combinedifferent constraints is one of the attractive aspectsof the level set formalism; we exploit it in this study.

284 R. Laganiere et al. / Graphical Models 68 (2006) 282–293

The methods in both [36,9] assume both a non-applicable model of brightness variations and anon-intervening background. These assumptionscannot be retained here because we are dealing withreal-world scenes with objects that may not have suf-ficient brightness variations to inform on shape, andwith backgrounds that cannot be abstracted out ofthe problem. The lack of sufficient object brightnessvariations is handled in our formulation by theshape-from-silhouette-consistency component inthe energy functional, and this is a major differencewith the functionals in [36,9]. The inclusion of thisshape-from-silhouette in the functional is importantbecause of the assumptions it relaxes and that mustbe relaxed with images of real scenes. Also, in ourmethod, the background is not abstracted out fromthe reconstruction process, by assuming, forinstance, a known uniform brightness background.Abstraction of the background avoids the problemwhere it is most delicate to solve: object boundaries.Another major difference resides in our use of robustvisual information, namely color invariants ratherthan the raw image data, and of a geometry toaccount for the case of objects on a ground floor.

Fig. 1. Different views of a gro

The remainder of this paper is organized as fol-lows: Section 2 describes the basic models ofshape-from-silhouettes consistency and of visualinformation consistency. Section 3 gives the formu-lation, the Euler–Lagrange equations and their levelset expression. Section 4 gives the details for build-ing an overhead view mosaic of the ground plane.Section 5 considers the recovery of the camera poseinformation. Section 6 gives experimental results ofreconstruction on real scenes, and Section 7 con-tains a conclusion.

2. Basic models

We consider an environment consisting of anarbitrary and unknown number of objects standingat arbitrary position on a ground plane. The work-place of a robot is such an environment, objectsbeing obstacles the robot must identify in its naviga-tion. Our goal is to write a variational formulationto recover the position and structure of theseground plane obstacles. To achieve this goal, wehave at our disposal k distinct color images of theenvironment, Ij, j = 1, . . . ,k (as in Fig. 1). However,

und plane with obstacles.


because we are dealing with views taken from widelyseparated points of view, rather than the raw valuesof the acquired images Ij, j = 1, . . . ,k, we will usenormalized, Gaussian-filtered color invariants asvisual information. We selected the Hilbert’s colorinvariants, these having the advantage of requiringonly first order derivatives. These derivatives arecomputed using Gaussian filters. Montesinos et al.[23] propose to use an invariant vector of eight suchcomponents. In our case, we just use the Gaussian-filtered color intensities, Rr, Gr, Br, and their corre-sponding gradient magnitudes, |$rR|, |$rG|, |$rB|,with r2 being the variance of the Gaussian filter.

To compute the similarity between two such vec-tors, the components must be resized since colorintensities and gradient magnitudes have differentranges of values. Moreover, since the stability ofthe gradient magnitude over viewpoint variation isnot very good, we normalize these values using asigmoidal function of the form:

sðtÞ ¼ 1

1þ e�lðt�toÞ. ð1Þ

This non-linear normalization leads to a more qual-itative characterization of the gradient where pixelsare classified (in a fuzzy way) as points with low orhigh color gradient magnitude. The value to is thethreshold that defines the point of transition be-tween low and high magnitudes and l controls thefuzzyness of the classification. The gradient magni-tude thus transformed is similar to the measure ofedgeness as defined in [35]. This leads us to the fol-lowing normalized invariant vector:

ImðX Þ¼ Rr

255;

Gr

255;

Br

255;sðjrrRjÞ;sðjrrGjÞ;sðjrrBjÞ

� �T

.

ð2ÞEuclidean distance can then be used to measure thesimilarity between two color image points. The useof these invariants is particularly useful at thematching step as explained in Section 4.1.

Let Imj ; j ¼ 1; . . . ; k be the k (vectorial) images of

visual information. Based on projective geometryrelations, we can combine these images to constructa mosaic image Im of the ground plane (the proce-dure is explained in Section 4.2; an example of suchmosaic is shown in Fig. 4). We can also compute theprojective transformations Pj, j = 1, . . . ,k of the k

original views. Finally, we can compute the homo-graphic transformations Hmj, j = 1, . . . ,k betweenthe mosaic view and each of the original views.The details of how the mosaic view and the transfor-

mations are obtained are given in Sections 4 and 5.But assuming that all this information has beenobtained, we first show here how obstacle recon-struction can proceed.

Now, given I mj ; j ¼ 1; . . . ; k, Im, Pj, j = 1, . . . ,k,

and Hmj, j = 1, . . . ,k, we can formulate the problem.The formulation incorporates two basic models:a silhouette-consistency model, and a photo-consisten-

cy model.

2.1. Silhouette-consistency model

According to the shape-from-silhouette strategy[3,38], object’s visual cones from several views areintersected to obtain a visual hull within which the3D object must lie. Each of these cones is the gener-alized cone defined by the union of the visual raysemanating from the view and through the object sil-houette. With a large number of informative views,the object surface structure can be well approximat-ed by the visual hull [18]. For the problem we areaddressing, we cannot assume the availability of alarge number of views and the visual hull one wouldobtain by direct application of the shape-from-sil-houette strategy would be a gross estimate of theobject surface structure. However, even with a smallnumber of views, visual cones contain valuableinformation about object surface structure. We inte-grate this information in the formulation via an esti-mate of the probability that a given 3D pointbelongs to an object visual hull. The estimation ofthis probability is done as follows.

Let X be a 3D point. If, when viewed from cam-era j, the ground plane is not occluded along thevisual ray through X, this visual ray does not inter-sect an object surface. Therefore, X is necessarilyoutside the visual cone of the view (when the scenecontains several objects, the union of the visualcones of a view is simply called the visual cone ofthe view). This can be verified by comparing thevisual information at the pixel of view j onto whichX projects (given by x = PjX) to the visual informa-tion at the pixel of the overhead mosaic which is theprojection of the intersection between the visual rayand the ground plane (given by xm ¼ H�1

mjP jX ), thatis, by evaluating the residual:

rjðX Þ ¼ kImi ðP jX Þ � Im

mðH�1mj P jX Þk. ð3Þ

Image Imm is the composed mosaic image of the

ground plane that is obtained from all availableviews. Indeed when a plane is observed from aknown viewpoint, the corresponding image can be


projectively transformed into an overhead view rep-resentation; the procedure is explained in Section4.2. The residual rj is expected to be low for pointsoutside the visual cone and high otherwise. There-fore, under the assumption that residuals arebounded, we express the probability that a pointX is within the visual cone of view j as an increasingfunction of the residual, for instance:

PðX 2 RVCj jrjðX ÞÞ / 1� e�

r2jðX Þ

b2 ; ð4Þ

where RVCj is the region inside visual cone of view j,and / is the proportional-to symbol. Assuming thatevents X 2 RVCi and X 2 RVCj are independent fori 5 j for all X, and the visual hull being the intersec-tion of the visual cones, the probability that a pointX is within the visual hull can be expressed as:

PðX 2 RVHjfrjgkj¼1Þ /

Yi

1� e�

r2jðX Þ

b2

!; ð5Þ

where RVH is the region inside the objects visualhull. This later equation is the basis of a shape-from-silhouette approach that does not requireexplicit silhouette extraction; rather, it constitutesa fuzzy representation of the silhouette set to beintegrated to the reconstruction process. This is incontrast with the usual schemes that rely on theavailability of accurate silhouette images (often ob-tained by placing the object in front of a black cur-tain). By not explicitly imposing silhouettesegmentation, introduction of errors that cannotbe recovered at later stages is avoided.

2.2. Photo-consistency model

To further improve the 3D reconstruction of theobserved objects, the photometric information con-tained in each view can be used. This can be doneusing a voxel-coloring strategy that consists in divid-ing the scene space into small volumetric elements(the voxels) [28]. The scene is then visited and eachvoxel is projected onto the input images and exam-ined to determine if it belongs to an object surface.In summary, while silhouette-consistency refers tothe region within the objects visual hull, photo-con-sistency concerns the objects surface.

The photo-consistency of a given 3D point can bemeasured by computing the standard deviation ofthe pixel colors to which that 3D point projects[25]. Let X be a point, and ImðX Þ the visual informa-tion at X averaged over the k views then dj (X)

represents the deviation from this average of thevisual information at the projection of X on viewj, i.e.:

djðX Þ ¼ kImjðP jX Þ � ImðX Þk. ð6Þ

This deviation will be low for physical points locat-ed on Lambertian surfaces because they project thesame color information on all cameras. On the otherhand, cameras looking at a point not located on anobject surface will see different elements of thescene, thus resulting in a large visual deviation. Inestimating this deviation, a more robust approachwould consist in considering instead the mediandeviation from the pixel average color value; thiswould avoid having a very dissimilar pixel to exces-sively contribute to the consistency measure. In-stead, we use a continuously derivable function,based on the idea of Geman and Reynolds [13] touse concave functions to implicitly address theproblem of outliers. This is important in our appli-cation since, because of occlusion, it is not expectedto have all cameras seeing all obstacle points. Ourgoal is simply to maximize the consensus concerningthe color of an obstacle point as seen by the differentviews. The conformity of a given point X to theobservation is then measured by:

hðX Þ ¼Xk

j¼1

�e�cdjðX Þ. ð7Þ

Such photo-consistency term will be evaluated forall voxels above the ground plane. Because consis-tency in photometry is observed only for those vox-els lying on an object surface, this function isexpected to be low for points located on an obstacle.A probabilistic formulation of photo-consistency isalso proposed in [14] to iteratively identify the vox-els which lie on objects’ surface. The probabilitythat a voxel exists (i.e., lie on a surface) is also esti-mated by Broadhurst et al. [15] in their space carv-ing approach. The two Eqs. (5) and (7) can now becombined to obtain an energy functional represent-ing our characterization of the ground plane objectsreconstruction problem.

3. Formulation

An energy functional to be minimized can nowbe written. This functional will contain terms whichfollow from our basic models of shape-from-silhou-ette-consistency and visual information consistency(Eqs. (5) and (7)) and a regularization term toobtain smooth object surfaces.

Fig. 2. The geometry of the camera system.


3.1. Functional and Euler–Lagrange equation

Let S be a closed surface and RS the regionenclosed by S. Since the functional to be defined willbe minimized, let us simply negate Eq. (5), i.e., let:

f ðX Þ ¼Y

i

�e�

r2iðX Þ

b2 � 1

�. ð8Þ

The energy functional to minimize over all closedsurfaces S is:

EðSÞ ¼Z

RS

f ðX Þ dV þ aZ

ShðX Þ dS þ b

ZS

dS; ð9Þ

where a and b are positive constants to weigh therelative contribution of the terms in the functional.Generic derivations of the Euler–Lagrange equa-tions corresponding to each integral in (9) (a volumeintegral of a scalar function and surface integrals ofscalar functions) can be found in [22]. Let

gðX Þ ¼ ahðX Þ þ b; ð10ÞThen, the Euler–Lagrange descent equation tominimize (9) is:

d/dt¼ �ðf þrg � n� 2gjÞn; ð11Þ

where / is a parametrization of S, n is the outwardunit normal on S, and j is the mean curvature func-tion. The right-hand side of (11) is independent ofsurface parametrization, as it should be, because weare interested in the image S of / and not in / itself.

3.2. Level set evolution equation

Execution of the descent equation by explicit rep-resentation of S as a set of points does not accom-modate changes in the topology of S. Analternative execution is via level sets where S is rep-resented implicitly as the zero level set of a one-pa-rameter family of functions u:

ð8sÞ u � /ðsÞðxðsÞ; yðsÞ; zðsÞ; sÞ ¼ 0; ð12Þwhere x, y, and z are the spatial variables. Differen-tiation of (12) with respect to s yields the level setequation that drives the evolution of u:

ru � d/dsþ ou

os¼ 0. ð13Þ

Referring to (11) and taking u to be positive insideSR and negative outside (u = 0 on S) so thatn � $u/k$uk, the evolution equation of u is, in ourcase:

@u@s¼ rg � ru

kruk þ 2gj� f� �

kruk. ð14Þ

Because it is a gradient descent, the algorithm al-ways converge to a local minimum. Mean curvatureis expressed in terms of u as in [29]:

j ¼ divrukruk

� �: ð15Þ

By construction, S can be recovered at any instantas the 0-level surface of function u regardless ofthe changes in topology during its evolution. Thebasic models and the problem formulation of Sec-tions 2 and 3 referred to a mosaic view of theground plane and to camera pose, in particular tothe projection matrices of the different views of theenvironment. The details of how these were ob-tained were not necessary to formulate the problem.We now turn our attention to these details, buildinga mosaic overhead view of the ground plane(Section 4) and recovery of camera pose (Section 5).

4. Building a mosaic overhead view of the ground

plane

The set of images collected by the moving robotmust be assembled together in order to produce amodel of the ground plane over which the robot isoperating. These images show the ground planeand the obstacles to be reconstructed, taken fromdifferent arbitrary and unknown viewpoints (see,for example, the images of Fig. 1). From these ones,the objective is first to build an overhead view mosa-ic of the ground plane. Since the images have beencaptured by a tilted camera of known height and tiltangle, a rough estimation of the homographic trans-formation between world plane (i.e., the groundplane) and each image plane can be obtained.


Fig. 2 shows the geometry of the camera system.The ground plane lies on the XZ plane and the opti-cal axis of the camera is aligned with the Z axis. Thecamera, at a height h, is rotated around the X axisby an angle /. When the geometry of the systemis as shown, the 3 · 3 homography matrix HB ofthe projective relation, [x,y, 1]T = HB[XW,ZW,1]T,between the world plane and the correspondingimage point can be described as follows:

HB¼f u0cosð/Þ u0hcosð/Þ0 f sinð/Þþv0cosð/Þ v0hcosð/Þ�fhsinð/Þ0 cosð/Þ hcosð/Þ

264

375;ð16Þ

where f represents the focal length of the camera, u0

and v0 are the principal point coordinates. Thistransformation is invertible such that the overheadview can be generated from a perspective or viceversa. Fig. 3 shows an image on which this kindof transformation has been applied.

4.1. Sparse view matching

To match two views of a scene, feature pointsmust be detected. To this end, we used the Harris

Fig. 3. (A) One additional view of the scene. (B) The generatedoverhead view.

corner detector [16]. The detected feature pointsare then mapped on the overhead views on whichmatching will be performed. Indeed, working inthe overhead view space has the virtue of eliminat-ing the perspective distortion that deforms thevisual patterns in each view. Consequently, tomatch points in these overhead views, only a rota-tionally invariant measure is required. Assumingthat the intensity variation of images due to thechanges in viewpoints is not significant, the invari-ant vector of Eq. (2) can give a robust character-ization of the points of interest. However,because of their limited discrimination power,matching with invariants leads to several falsematches. Therefore, we need to introduce an addi-tional matching measure. The choice we made isbased on the observation that while the transfor-mation between two images of a plane is a generalhomography, the transformation between the twogenerated overhead views is an isometric transfor-mation, i.e., it is composed of a rotation and atranslation. The transformation has three degreesof freedom; two for translation and one for rota-tion, it can therefore be computed from two pointcorrespondences. This isometric transformation HS

can be described as follows:

HS ¼cosðhÞ � sinðhÞ T x

sinðhÞ cosðhÞ T y

0 0 1

264

375 ð17Þ

with h being the angle between two cameras. Aninvariant in this transformation is the Euclidean dis-tance between two points. Indeed, the line length be-tween the two overhead views is preserved.Following a RANSAC-like scheme, we randomlyselect two match pairs and the length of the line seg-ments that join the two selected points in each imageis compared. If the difference in length is sufficientlylow, then the points are considered to be good can-didate matches. These candidate matches are thenfurther validated by considering the intensity pro-files of the candidate segments. Tell and Carlsson[31] also used a comparison between the affinelyinvariant Fourier features of the intensity profilesbetween randomly selected pairs of image interestpoints. We use a similar approach; however, inour algorithm, and because of the isometry that sep-arates the two views, each line segment is simplydivided into k + 1 points. Cross correlation betweenthe intensity profiles of the left and right lines is thencomputed as follows:

Fig. 4. The overhead view mosaic.


lc ¼Pk

i¼0½ðAi � �AÞðBi � �BÞ�ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi½Pk

i¼0ðAi � �AÞ2�½Pk

i¼0ðBi � �BÞ2�q ; ð18Þ

where arrays A and B contain the pixel intensity val-ues of the k + 1 points of the two segments (we usedk = 8). If the correlation coefficient lc exceeds a giv-en value then the two segments, and therefore theircorresponding end points, are assumed to match.The transformation, HSij, between the two overheadview images can then be calculated using the result-ing set of matches. A best-fit finds the best isometrictransformation. Using this result, the homographybetween the two views can then calculated by:

H ij ¼ H�1Bi HSijH Bj; ð19Þ

where HBi is the overhead homography of view i

and HSij is the overhead isometric transformationbetween views i and j.

To obtain a more accurate estimate of the inter-image homographies Hij, the corners detected in oneimage are mapped using Hij to the correspondinglocation in the other image. The neighborhood ofthe transferred point is searched for a corner suchthat the correlation difference between the two can-didate corners is below a certain threshold. Thisupdated set of matched corners is then used to refinethe homography Hij.

4.2. Overhead view composition

By combining the computed in-between imagehomographies with the overhead transformations,it is possible to build the global overhead viewmosaic of the ground plane. However, when assem-bling the different transformed views, only theimage point showing the ground plane must beused, that is, images of the obstacles must bediscarded.

To do so, we use the following procedure. Foreach point on the ground plane, the appropriatetransformation is applied to obtain the correspond-ing image point in each view. The mean RGB valueis then computed and the image point that deviatesthe most from this mean value is discarded. Thisprocedure is repeated until half of the image pointshave been discarded. The mean RGB value of theremaining points is then used in the mosaic compo-sition. In our experiments, where a sufficient num-ber of sparsely distributed views were used, thissimple algorithm was able to eliminate the imagesof the obstacles from the ground model. Fig. 4

shows the mosaic obtained from the images of Figs.1 and 3. Note, however, that our goal here was notto obtain a mosaic of high visual quality but ratherto build a ground plane model that will be suitablein the reconstruction of the obstacles (Section 2).This is the reason why we did not apply any imageenhancement algorithm (such as pixel blending).

5. Recovery of camera pose

The computed inter-image homographies havebeen used to generate an overhead view mosaic rep-resenting the ground plane model. These matricescan also be used to obtain the pose of each camera.For a two-camera system, Tsai et al. [34] showedthat the planar homography can be decomposedas follows:

H ¼ K R� TnT

d

� �K�1; ð20Þ

where K is the matrix containing the intrinsicparameters of the cameras, R is the rotation be-tween the two cameras, n is the normal to the planeunder consideration, and T is the translation. Final-ly d is the distance from the camera to the groundwhich can be arbitrarily set to 1. Based on this equa-tion it is possible, as shown in [33], to extract thecamera parameters through singular value decom-position of the inter-view homography. In general,the approach leads to two distinct solutions. Sincein our case, the normal vector to the ground plane


is approximatively known, the correct solution canbe easily identified.

6. Reconstruction experiments

Our numerical implementation is based on thelevel sets method proposed by Osher and Sethianin [24,29]. The first step in the implementation isto discretize the scene into 3D grid points (voxels).This is achieved by our knowledge of the groundplane equation with respect to a selected referencecamera. The 3D grid points are divided into layersparallel to the ground plane. Each layer has200 · 200 grid points for a total of 100 layers. Theobtained 3D structures are represented in Fig. 5where the reprojected set of voxels under a virtuallight source are shown. This one has been obtainedusing the 10 available views of the scene (cf. Fig. 1)and the ground plane model shown in Fig. 4. Thevalues used for the various parameters were set

A B

Fig. 5. The reconstructed obstacles

A

B

Fig. 6. The reconstructed obstacles

experimentally; however, it has been observed thata wide range of values for these parameters gaveresults similar to the ones shown. A total of 1500level set iterations were required to obtain theshown result. Considering that the camera parame-ters (the projection matrices) have been obtaineddirectly from the image data set (as explained inSection 5), the resulting solution succeeded well incapturing the 3D structure of the observed ‘groundplane obstacle’ scene.

A second result is shown in Fig. 6. It has beenobtained from 12 images, some of them beingshown in Fig. 7. This type of scene, i.e., non-uni-form-brightness background combined with bothtextured and non-textured objects, is challengingto most 3D reconstruction algorithms. The evolu-tion of the model during the level set iterative pro-cedure is illustrated in Fig. 8 where one horizontalslice of the reconstructed model is shown at differ-ent iterations. The figure shows how, when starting

of the scene shown in Fig. 1.

C

of the scene shown in Fig. 7.

Fig. 7. A few images of a second scene.

Fig. 8. Evolution of one slice of the reconstructed scene shown in Fig. 7 after 100, 500, 900, 1100, 1200, and 1500 iterations.


with a cubic interface, the evolution can lead thelevel set function to break into several distinct vol-umes some of them eventually vanishing if not sup-ported by a sufficient level of energy. It should alsobe noted that level set evolution is a time consum-ing process. However, this is not critical in thepresent application because reconstruction wouldbe performed off-line, most probably by a base sta-tion, upon reception of the data collected by therobot.

Finally, we also test our algorithm on a naturaloutdoor scene. A total of 14 images were used inthis case. Fig. 9(A) shows one of these images andthe resulting 3D structure is represented inFig. 9(B).

7. Conclusion

This study addressed the problem of reconstruct-ing the structure of objects on the ground plane of a

A B

Fig. 9. One image of an outdoor scene and the reconstructed obstacle.


robot environment. The proposed solution is basedon a general, variational statement of the problem,solved via level set PDEs. The objective energy func-tional to be minimized includes a novel term (silhou-ette-consistency) in addition to the terms ofconformity to data (expressed, here, as a voxel col-oring problem), and regularization. This extends theproblem statement to (an arbitrary number of) real-world objects which may lack the brightness varia-tions necessary to infer a shape-from-brightnessprocess.

Different factors can affect the accuracy of theresulting 3D model. The precision of the estimatedcamera positions is of prime importance. This posi-tional information is obtained here from the match-ing of natural features. When coupled with refiningtechniques, such as bundle adjustment, very accu-rate projection matrices can be obtained, as longas sufficient features are available in the scene. Theset of available views must also ensure a completecoverage of the scene. The formulation also requiresa good level of redundancy between the views; this isgenerally achieved by a 20–30� view separation, i.e.,a loop of 10–20 views. Accuracy of the produced 3Dmodel is ultimately determined by the resolution ofthe 3D grid used for level set evolution. The numberof voxels in the grid has also an important impacton the required computational effort. The use ofmulti-resolution strategies (e.g., octree representa-tions) to accelerate level set evolution can improvethe efficiency of the algorithm. Faster implementa-tions of level set methods are the object of activeresearch.

Finally, the solution proposed also accounts forreal-world robot environment image data. This con-text required the use of image invariants for sparseview matching, the estimation of camera pose fromground plane geometry, and of the construction of aground plane mosaic model. The result is a 3D

representation of the environment explored by amobile robot, that is inferred strictly from the visualdata and that has an accuracy largely sufficient forthe robot to perform tasks such as avoidance, local-ization, recognition, and manipulation.

References

[1] P. Batavia, S. Singh, Obstacle detection using color segmen-tation and color stereo homography, in: IEEE Conferenceon Robotics and Automation, Seoul Korea (2001).

[3] C. Chian, J. Aggarwal, Model reconstruction and shaperecognition form occluded contours, IEEE Trans. PatternAnal. Mach. Intell. 11 (1989) 372–389.

[4] Y. Chow, R. Chung, Obstacle avoidance of legged robotwithout 3d reconstruction of the surroundings, in: Proc. ofthe IEEE Conf. on Robotics and Automation, 2000, pp.2316–2321.

[5] U.R. Dhond, J.K. Aggarwal, Structure from stereo—areview, IEEE Trans. Syst. Man Cyb. 19 (1989) 1489–1510.

[6] W. Enkelmann, Obstacle detection by evaluation of opticalflow fields from image sequences, Image Mach. Comput. 9(3) (1991) 160–168.

[7] P.A. Beardsley, P.A. Reid, A. Zisserman, D.W. Murray,Active visual navigation using non-metric structure, Int.Conf. Comput. Vis. (1995) 58–64.

[8] V. Caselles, R. Kimmel, G. Sapiro, C. Sbert, 3d activecontours, Int. Conf. Anal. Opt. Syst. (1996) 43–49.

[9] O. Faugeras, R. Keriven, Variational principles, surfaceevolution, PDE’s, level set methods and the stereo problem,IEEE T. Image Process. 7 (3) (1998) 336–344.

[10] O. Faugeras, J. Gomes, R. Keriven, Computational stereo: avariational method, in: S. Osher, N. Paragios (Eds.),Geometric Level Set methods in Imaging Vision andGraphics, Springer Verlag, Berlin, 2003, p. 532.

[11] F. Ferrari, E. Grosso, G. Sandini, M. Magassi, A stereovision system for real time obstacle avoidance in unknownenvironment, IEEE Int. Workshop on Intelligent Robotsand Systems, 1990, pp. 703–708.

[12] P. Fornland, Direct obstacle detection and motion fromspatio-temporal derivatives, in: CAIP, Prague, CzechRepublic, Sept.1995.

[13] D. Geman, G. Reynolds, Constrained restoration and therecovery of discontinuities, IEEE T. Pattern Anal. Mach.Intell. 14 (3) (1992) 367–383.


[14] M. Agrawal, L.S. Davis, A probabilistic framework forsurface reconstruction from multiple images, IEEE Conf.Comput. Vision Pattern Recogn. 2 (2001) 470–476.

[15] A. Broadhurst, T.W. Drummund, R. Cipolla, A probabilis-tic framework for space carving, Int. Conf. Comput. Vision 1(2001) 7–14.

[16] C. Harris, M. Stephens, A combined corner and edgedetector, Alvey Vision Conf. (1988) 147–151.

[17] E. Krotkov, Mobile robot localization using single image, in:Proc. IEEE Int. Conf. on Robotics and Automation, 1989,pp. 978–983.

[18] A. Laurentini, The visual hull concept for silhouette-basedimage understanding, IEEE Trans. Pattern Anal. Mach.Intell. 17 (2) (1997) 150–162.

[20] M. Lourakis, S. Orphanoudakis, Visual detection of obsta-cles assuming a locally planar ground, Proc. 3rd Asian Conf.on Computer Vision 2 (1998) 527–534.

[21] A. Mitiche, Computational Analysis of Visual Motion,Plenum Press, 1994.

[22] A. Mitiche, R. Feghali, A. Mansouri, Motion tracking asspatio-temporal motion boundary detection, Robot. Auton.Syst. 43 (2003) 39–50.

[23] P. Montesinos, V. Gouet, R. Deriche, D. Pel, Matching coloruncalibrated images using differential invariants, ImageVision Comput. 18 (9) (2000) 659–672.

[24] S. Osher, J. Sethian, Fronts propagation with curvature-dependant speed: algorithms based on hamilton–jacobiequations, J. Comp. Phys. (79) (1988) 12–49.

[25] A.C. Prock, C.R. Dyer, Towards real-time voxel coloring,Proc. Image Understanding Workshop, 1998), pp. 315–321.

[26] J. Santos-Victor, G. Sandini, Uncalibrated obstacle detectionusing normal flow, Mach. Vision Appl. (1996) 130–137.

[27] H. Sekatti, A. Mitiche, Dense 3D interpretation of imagesequences: a variational approach using anisotropic diffu-

sion, in: IAPR International Conference on Image Analysisand Processing, Montova, Italy, 2003.

[28] S. Seitz, C. Dyer, Photorealistic scene reconstruction byvoxel coloring, in: IEEE Conf. on Computer Vision andPattern Recognition, 1997, pp. 1067–1073.

[29] J. Sethian, Level Set Methods and Fast Marching Methods,Cambridge University Press, Cambridge, 1996.

[30] R. Sim, G. Dudek, Mobile robot localization from learnedlandmarks, in: Proc. IEEE/RSJ Conf. on Intelligent Robotsand Systems (IROS), 1998.

[31] D. Tell, S. Carlsson, Wide baseline point matchingusing affine invariants computed from intensityprofiles, European Conf. on Computer Vision, 2000,pp. 814–828.

[32] S. Thrun, Y. Liu, Multi-robot SLAM with sparseextended information filers, in: Proceedings of the 11thInternational Symposium of Robotics Research (ISRR’03),2003.

[33] B. Triggs, Auto-calibration from planar scenes, EuropeanConf. on Computer Vision, 1998, pp. 89–105.

[34] R.Y. Tsai, T. Huang, W. Zhu, Estimating three-dimensionalmotion parameters of a rigid planar patch, IEEE AcousticSpeech Signal Process. 30 (4) (1982) 525–534.

[35] J. Weng, N. Ahuja, T. Huang, Two-view matching, Int.Conf. on Computer Vision, 1988, pp. 65–73.

[36] A. Yezzi, S. Soatto, Structure from motion for sceneswithout features, in: Proceeding of the IEEE Conf. onComp. Vis. and Patt. Recog, 2003.

[37] Z. Zhang, R. Weiss, A. Hanson, Qualitative obstacledetection, in: IEEE Conf. on Computer Vision and PatternRecognition, 1994, pp. 554–559.

[38] J. Zheng, Acquiring 3d models from sequences of contours,IEEE Trans. Pattern Anal. Mach. Intell. 16 (2) (1994)163–178.

visual reconstruction of ground plane obstacles in a sparse view

Documents