wall estimation from stereo vision in urban street canyons - kit - mrt · 2013-09-26 · wall...

Wall Estimation from Stereo Vision in Urban Street Canyons

Tobias Schwarze and Martin LauerInstitute of Measurement and Control Systems, Karlsruhe Institute of Technology, Karlsruhe, Germany

Keywords: Environment Perception, Geometry Estimation, Robust Plane Fitting.

Abstract: Geometric context has been recognised as important high-level knowledge towards the goal of scene under-standing. In this work we present two approaches to estimate the local geometric structure of urban streetcanyons captured from a head-mounted stereo camera. A dense disparity estimation is the only input for bothapproaches. First, we show how the left and right building facade can be obtained by planar segmentationbased on random sampling. In a second approach we transform the disparity into an elevation map fromwhich we extract the main building orientation. We evaluate both approaches on a set of challenging inner cityscenes and demonstrate how visual odometry can be incorporated to keep track of the estimated geometry.

1 INTRODUCTION

Robotic systems aiming at autonomously navigatingpublic spaces need to be able to understand their sur-rounding environment. Many approaches towards vi-sual scene understanding have been made, coveringdifferent aspects such as object detection, semantic la-beling or scene recognition. Also the extraction of ge-ometric knowledge has been recognised as importanthigh-level cue to support scene interpretation from amore holistic viewpoint. Recent work for instancedemonstrates the applicability in top-down reasoning(Geiger et al., 2011a; Cornelis et al., 2008).

Extracting geometric knowledge appears as hardtask especially in populated outdoor scenarios, be-cause it requires to tell big amounts of unstructuredclutter apart from the basic elements that make upthe geometry. This problem can be approached frommany sides, clearly depending on the input data. Inrecent years the extraction of geometric knowledgefrom single images has attracted a lot of attention andhas been approached in different ways, e.g. as recog-nition problem (Hoiem et al., 2007), as joint optimiza-tion problem (Barinova et al., 2010), or by geometricreasoning e.g. on line segments (Lee et al., 2009).Other than the extensive work in this field, here weinvestigate the problem based on range data acquiredfrom a stereo camera setup as only input, which isin principle replaceable by any range sensor like LI-DAR systems or TOF cameras. We aim at extract-ing a local geometric description from urban streetscenarios with building facades to the left and right

(”street canyon”). Rather than trying to explain theenvironment as accurately as possible, our focus is asimplified and thus very compact representation thathighlights the coarse scene geometry and provides astarting point for subsequent reasoning steps. To thisend our goal is a representation based on geometricplanes, in the given street canyon scenario one planefor each building facade, which are vertical aligned tothe groundplane.

Such representation can basically be found in twoways. The 3D input data can be segmented by grow-ing regions using similarity and local consistency cri-teria between adjacent data points that lead to planarsurface patches, or surfaces can be expressed as para-metric models and directly fitted into the data. Eitherway has attracted much attention. Studies on rangeimage segmentation have been conducted, but usu-ally evaluating range data that differs strongly fromoutdoor range data obtained by a stereo camera interms of size of planar patches and level of accuracy(Hoover et al., 1996). Variants of region growing canbe found in e.g. (Gutmann et al., 2008; Poppingaet al., 2008).

The combination of short-baseline stereo, largedistances in urban scenarios and difficult light con-ditions due to a free moving and unconstrained plat-form poses challenging conditions. Additionally wecan not assume free view on the walls since especiallytraffic participants and static infrastructure often oc-clude large parts of the images - a key requirement ishence robustness of the fitting methods.

Region growing alone does not guarantee to result

83Schwarze T. and Lauer M. (2013).Wall Estimation from Stereo Vision in Urban Street Canyons.In Proceedings of the 10th International Conference on Informatics in Control, Automation and Robotics, pages 83-90DOI: 10.5220/0004484600830090Copyright c SciTePress

(a) XY Z-points, fixed distance threshold (b) XY Z-points, distance threshold scaled bycovariance

(c) uvd-points, fixed distance threshold

Figure 1: Plane estimation based on point-to-plane distance thresholds using (a) a fixed distance in XY Z coordinates, (b) theMahalanobis distance in XY Z coordinates and (c) a fixed distance in uvd coordinates. The top row visualizes with gray valuesthe number of support points for a vertical plane swept through the XY Z- resp. uvd-points in distance and angle steps of 0.5mresp. 1�. The true plane parameters are marked in red. The bottom row shows support points for these plane parameters asoverlay.

in connected surfaces when occlusions visually splitthe data, a subsequent merging step would be nec-essary. This does not occur when parametric modelsare fitted directly. Most popular methods here are ran-dom sampling and 3D Hough transformations (Iocchiet al., 2000).

A large body of literature focuses specifically onthe task of groundplane estimation, in case of visionsystems planes have been extracted using v-disparityrepresentations (Labayrade et al., 2002) and robustfitting methods (Se and Brady, 2002), often assum-ing fixed sensor orientation (Chumerin and Van Hulle,2008).

We start with estimating the groundplane usingrandom sampling. Based on the groundplane parame-ters we constrain the search space to fit two planes tothe left and right building facade. In Section 2.2 and2.3 we present two robust methods to fulfil this task.In Section 3 we evaluate both methods using a datasetof inner city scenes and show how visual odometrydata can be integrated to keep track of the estimatedgeometry.

2 PLANE FITTING

Estimating planar structures from an egocentric view-point in urban environments has to deal with a hugeamount of occlusions. Especially the groundplane isoften only visible in a very small part of the image

since buildings, cars or even pedestrians normally oc-clude free view onto the ground. Hence, robustness ofthe methods is a key requirement. Therefore, we de-veloped an approach based on the RANSAC scheme(Fischler and Bolles, 1981), which is known to pro-duces convincing results on model fitting problemseven with way more than 50% outliers. In a scenariowith fairly free view and cameras pointing towards thehorizon with little tilt a good heuristic is to constrainthe search space to the lower half of the camera imagespace to find an initial estimate of the groundplane.

A plane described through the equation

aX +bY + cZ +d = 0

can be estimated using the RANSAC scheme by re-peatedly selecting 3 random points and evaluating thesupport of the plane fitting these points. A fit is eval-uated by counting the 3D points with point-to-planedistance less than a certain threshold. In our case,we had to extend the RANSAC scheme by an adap-tive threshold to cope with the varying inaccuracy of3D points determined from a stereo camera. To ac-count for the uncertainty, the covariance matrices ofthe XY Z points can be incorporated into the distancethreshold. In case of reconstructing from stereo visionone obtains the 3D coordinates (X ;Y;Z) 1 through:

1Our Z-axis equals the optical axis of the camera, X-axis pointing right and Y-axis towards the ground. CompareFigure 2

ICINCO�2013�-�10th�International�Conference�on�Informatics�in�Control,�Automation�and�Robotics

84

Figure 2: Stereo covariances.

F(u;v;d) =

24XYZ

35=

264B(u�cx)

dB(v�cy)

dB fd

375 (1)

Where B is the baseline, f the focal length, d

the disparity measurement at image point (u;v), and(cx;cy) the principal point. The covariance matrix Ccan be calculated by (also found in (Murray and Lit-tle, 2004)) C = J �M � JT with J the Jacobian of F

J =

24 dFXdu

dFXdv

dFXdd

dFYdu

dFYdv

dFYdd

dFZdu

dFZdv

dFZdd

35=

264Bd

0 �B(u�cx)d2

0 Bd

�B(v�cy)

d2

0 0 �B fd2

375Assuming a measurement standard deviation of

1px for the u and v coordinates and a disparity match-ing error of 0.05px we obtain as measurement matrixM = diag(1;1;0:05). A world point on the opticalaxis of the camera in 15 m distance is subject to astandard deviation of�1 m (focal length 400px, base-line 12cm). While the Z uncertainty of reconstructedpoints grows quadratically with increasing distance,the uncertainty of reconstructed X and Y componentsremains reasonable small (see Figure 2).

With the covariance matrices we can determinethe point to plane Mahalabobis distance and use it in-stead of a fixed distance threshold to count plane sup-port points. This way the plane margin grows with in-creasing camera distance according to the uncertaintyof the reconstruction. Calculating the point to planeMahalanobis distance essentially means transformingthe covariance uncertainty ellipses into spheres. Away to do so is shown in (Schindler and Bischof,2003).

However, calculating the covariance matrix for ev-ery XY Z-point is computationally expensive. We canavoid this by fitting planes in the uvd-space and trans-forming the plane parameters into XY Z-space after-wards.

Fitting planes in uvd-space can be done in thesame way as described above for the XY Z-space. Weobtain the plane model satisfying the equation

au+bv+ g�d = 0

Expressing u;v;d through equations (1) yields the uvd

to XY Z-plane transformation

a = a; b = b; c =acx +bcy + g

f; d =�B

and vice-versa

a =aBd

; b =bBd

; g =B(c f �acx�bcy)

d

In the following section we demonstrate the im-portance of considering the reconstruction uncertaintywhen setting the plane distance threshold.

2.1 Sweep Planes

For the purpose of estimating facades of houses weconstruct a plane vertical to the groundplane andsweep it through the XY Z- respectively uvd-points.For most urban scenarios it is a valid assumptionthat man-made structures and even many weak andcluttered structures like forest edges or fences arestrongly aligned vertical to the floor. Knowledge ofthe groundplane parameters f~ngp;dgpg with ground-plane normal ~ngp and distance dgp from the cameraorigin allows us to construct arbitrary planes perpen-dicular to the ground.

We construct a sweep plane vector perpendicularto ~ngp and the Z-axis (compare Figure 2) by

~nsweep(a) = R~ngp(a)

0@~ngp�

24001

351A ;

where rotation matrix R~ngp(a) rotates a degreesaround the axis given by~ngp.

We sweep the plane f~nsweep(a);dsweepg throughthe XY Z respectively uvd-points by uniformly sam-pling a and dsweep and store the number of sup-port points for every sample plane in the parameterspace (a;dsweep). The result for a sampling between�10� < a < 10� and �3m < dsweep < 15m with astepsize of 1� resp. 0.5m is shown in Figure 1. Peaksin the parameter space correspond to good plane sup-port.

A fixed threshold in XY Z-space (Figure 1(a))overrates planes close to the camera (left facade),while planes further away are underrated (right wall).The Mahalanobis threshold (Figure 1(b)) can com-pensate for this. Fitting planes in uvd-space leads tothe same result (Figure 1(c)), without the computa-tional expense.

The method can be used to extract planar sur-faces from the points, regardless whether input data ispresent in XY Z- or uvd-space. It is easy to incorporateprior scene knowlegde like geometric cues to select

Wall�Estimation�from�Stereo�Vision�in�Urban�Street�Canyons

85

planes and suppress non-maxima. Parallel planes aremapped to the same rows in parameter space, while aminimal distance between selected planes can be en-forced by the column gap.

In practice, when knowledge about the rough ori-entation of the scene is unavailable, the computationalcost of building the parameter space is very high. Inour strongly constrained example scene with �10�

heading angle and 18m distance already 777 planeshad to be evaluated. The required size and subsam-pling of parameter space is hard to anticipate andwould have to be chosen much bigger, which makesthe approach less attractive in this simple form.

Due to the fact that the plane sweeping is not adata-driven approach, many planes are evaluated thatare far off the real plane parameters. Simplifying thedata does not affect the number of evaluated planes.In its data-driven form this algorithm resembles theHough transform (Hough, 1962), which has also beenstudied and extended for model-fitting problems in3D data, e.g. (Borrmann et al., 2011).

In the following sections we present two data-driven approaches for wall estimation. First, we showhow RANSAC can be used to achieve a planar scenesegmentation, from which we extract the street geom-etry. In Section 2.3 we use a 2D Hough transform tofind the left and right facade.

Figure 3: RANSAC based planar segmentation. The twomost parallel planes (bottom) are selected from five hy-potheses (top).

2.2 Planar RANSAC Segmentation

RANSAC based plane fitting can obviously not onlybe used to fit the groundplane but any other planarstructure in the scene. Assuming the groundplane pa-rameters are known, we first remove the groundplanepoints from uvd-space. In the remaining points weiteratively estimate the best plane using RANSAC.Since we are interested in vertical structures, we canreject plane hypotheses that intersect the groundplaneat smaller angles than 70� by comparing their nor-mal vectors. After every iteration we remove theplane support points from uvd-space. This way wegenerate five unique plane hypotheses, out of whichwe select the two most parallel planes with distanced = j~n1d1+~n2d2j> 5m by pairwise comparison to ob-tain the left and right building facade. Figure 3 showsan example scene with five plane hypotheses (top) outof which the red and blue one are selected since theyare the most parallel and also exceed the minimal dis-tance threshold.

The top row in Figure 4 presents the output forsome challenging scenarios, some of which featureconsiderable occlusions. In every iteration we eval-uate 50 planes with a valid groundplane intersectionangle, 5 iterations hence summing to 250 evaluatedplane hypotheses.

2.3 Elevation Maps

The strong vertical alignment of man-made environ-ments can be exploited by transforming the 3D pointdata into an elevation map. We do this by discretizingthe groundplane into small cells (e.g. 10x10cm) andprojecting the 3D points along the groundplane nor-mal onto this grid. The number of 3D points project-ing onto a cell provides a hint about the elevation overground for the cell. Grid cells underneath high verti-cal structures will count more points than grid cellsunderneath free-space. Now, the grid can be analysedusing 2D image processing tools. In case of a streetscenario with expected walls left and right we applya Hough transform to discover two long connectedbuilding facades. Because of geometric plausibilitywe again enforce a minimal distance of 5m betweenwalls when selecting the Hough peaks. Two examplesalong with their elevation map are shown in Figure 5.

The approach also works with weak structureslike forest edges (bottom image), though overhangingtrees are obviously causing a deviation from the realforest bottom here and the assumed street model withtwo walls does not hold in this view. The approachwill benefit strongly from integrating elevation mapsover multiple frames, which is topic of future work.


86

(a) (b) (c) (d)

(e) (f) (g) (h)

Figure 4: Results of RANSAC based planar segmentation (top row) and estimation of facade orientations using elevationmaps (bottom row), which can be used in a subsequent step to generate an according surface.

2.4 Iterative Least-squares

The random nature of the RANSAC segmentation andthe assumption of perfectly vertical buildings in caseof the elevation maps prevent either approach to pro-duce perfectly robust results. Nevertheless, both ap-proaches normally yield an approximation of the mainorientation of the buildings, which is accurate enoughto optimize the estimated surface with a least squaresestimator. To deal with the remaining outliers we op-timize the plane hypotheses iteratively. While shrink-ing the plane to point distance threshold in every iter-ation, the optimization converges within a few itera-tions.

We verify the plane by comparing the normal an-gle deviation between the initial fit and the optimizedfit. A false initial fit will lead to big deviations andcan be rejected in this way.

3 EXPERIMENTS

In a set of experiments we compare the differentapproaches for plane estimation. Our experimentalsetup consists of a calibrated stereo rig with a shortbaseline of around 10 cm and a video resolution of640x480px. To enlarge the field-of-view we deploya wide-angle lens of 12 mm focal length. We obtainthe dense disparity estimation using an off-the-shelfsemi-global-matching approach (OpenCV).

3.1 Evaluation

We ran some tests on a dataset consisting of inner cityscenes captured from ego-view to evaluate the appli-cability of the proposed approaches in some challeng-ing scenarios. Processing video data from ego-viewperspective especially has to be robust against occlu-sion and the high degree of clutter caused by parkedcars or bikes, trees, or other dynamic traffic partici-pants. Another issue in real life scenarios are the chal-


87

Figure 5: Left column: Elevation maps. Connected el-ements are found by Hough transformation and projectedinto the camera image (right column).

lenging light conditions, that often lead to over- andunderexposed image parts in the same image. Figure4 shows 8 scenes with the result of RANSAC segmen-tation in the top row, and the resulting facade orienta-tion drawn from elevation maps in the bottom row.

Planar RANSAC Segmentation

The random plane selection in the RANSAC segmen-tation approach makes it difficult to draw quantita-tive conclusions about the performance. To underlaysome numbers we picked five of the more difficultscenarios and evaluated the repeatability of the out-put. We ran the algorithm 50 times on each scenarioand evaluated the number of correct wall estimationsby manual supervision. We consider a wall as missed

Figure 6: Evaluation of repeatability of planar RANSACsegmentation. Shown is the percentage of correctly deter-mined walls in 50 repetitions, scenes correspond to Fig-ure 4.

when the estimated orientation deviates so strongly,that a subsequent optimization step will not convergeclose to the optimum.

Figure 6 shows the results for some scenariostaken from Figure 4. Scenario (c) is challenging inthat the left building wall is hardly visible due to oc-clusion, and the visible part is overexposed. The rightwall is found very robustly. The large amount of er-rors in detecting the right wall in scenario (d) can beexplained by clutter, which often leads to planes fittedto the sides of the parked cars. Increasing the numberof sampled planes per iteration would probably pre-vent this. The substantial gap on the right hand sidein (h) explains the often missed right wall. In scenar-ios with mostly free view on the walls a rate of around90% for both walls is realistic.

Elevation Maps

To rate the stability of wall estimation based on el-evation maps we investigate two sequences of 200and 500 frames, taken while travelling down a streetcanyon. In each frame we estimate the orientation ofboth walls, independent of the previous frame.

The first sequence consists of 200 frames and ismostly free of wall occlusions. The algorithm findsthe correct wall orientation in all but 8 frames, that

(a)

(b)

(c)

Figure 7: Failures in orientation estimation using elevationmaps.


88

Figure 8: Plane parameters tracked over a sequence of 450 frames. The top diagram shows the groundplane angle, the bottomdiagram the normal plane distance. Predicted parts due to walls being out of view are shaded.

were taken while passing a street sign (see Figure7(a)). The second sequence consists of 500 framesand is more cluttered. The algorithm fails in around10% of all frames to estimate one of the walls cor-rectly. Reasons are always related to obstacles thatwere not filtered because they exceed the groundplanecut-off height, or obstacles that occlude the wall. SeeFigures 7(b) and 7(c) for two examples.

3.2 Geometry Tracking

The wall estimation is embedded as part of a real-time system, which also contains a module to esti-mate the camera motion between consecutive framesby running the visual odometry estimation taken fromthe LIBVISO2 library (Geiger et al., 2011b). Visualodometry provides the egomotion in all six degrees offreedom such that the camera poses of frame n�1 andn are related via a translation vector~t and a rotationmatrix R.

Knowledge of the camera transformation allows

to predict the current groundplane and wall param-eters from the previous frame. The XY Z-plane~pt�1 = f~n;dg, with surface normal ~n and distance dfrom the camera origin, transforms into the currentframe via

~pt =�[Rj~t]�1�T

~pt�1

In a street canyon scenario we proceed as follows:We initialize the groundplane and planes for left andright wall with the methods described above. Forthe following frames we use the prediction as start-ing point for the iterative least squares optimizationto compensate the inaccuracy of the egomotion esti-mation. To stabilize the process over time we storethe best fitting support points in a plane history ringbuffer and incorporate them with a small weightingfactor into the subsequent least squares optimization.We reject the optimization when the plane normal an-gles of prediction and optimization deviate by morethan 5� in either direction, or the groundplane anglebecomes smaller than 80�. Reasons for this to hap-


89

pen are normally related to a limited view onto thewall, which either occurs when the wall is occludedby some close object (e.g. truck), or the cameras tem-porarily point away from the wall. If the optimiza-tion was rejected we carry over the prediction as cur-rent estimate and continue like that until the wall is inproper view again.

The estimated plane parameters for both walls in asequence over 450 frames are plotted in Figure 8. Theupper diagram shows the angle between groundplaneand walls, the bottom diagram shows the plane dis-tance parameters. The distances add up to the streetwidth, for this sequence with a mean of 9.3m.

The sequence begins on the left sidewalk and endson the right sidewalk after crossing the street. It con-tains several parts in which the walls are out of viewdue to the camera heading, some are shown in thescreen shots. As explained earlier, these parts arebridged by predicting the parameters using the ego-motion and are shaded in the diagram.

4 CONCLUSIONS

We have demonstrated two approaches towards esti-mating the local, geometric structure in the scenarioof urban street canyons. We model the right and leftbuilding walls as planar surfaces and estimate the un-derlying plane parameters from 3D data points ob-tained from a passive stereo-camera system, which isreplaceable by any kind of range sensor as long theuncertainties of reconstructed 3D points are knownand can be considered.

The presented approaches are not intended as astandalone version. Their purpose is rather to separatea set of inlier points fitting the plane model to initial-ize optimization procedures as we applied in form ofthe iterative least-squares. By taking visual odometryin combination with a prediction and update step intothe loop we are able to present a stable approach tokeep track of groundplane and both walls.

Future work includes integrating the rich informa-tion offered by the depth-registered image intensityvalues and relaxing the assumptions implied by thestreet canyon scenario.

ACKNOWLEDGEMENTS

The work was supported by the German Federal Min-istry of Education and Research within the projectOIWOB. The authors would like to thank the”Karlsruhe School of Optics and Photonics” for sup-porting this work.

REFERENCESBarinova, O., Lempitsky, V., Tretiak, E., and Kohli, P.

(2010). Geometric image parsing in man-made en-vironments. ECCV’10, pages 57–70, Berlin, Heidel-berg. Springer-Verlag.

Borrmann, D., Elseberg, J., Lingemann, K., and Nuchter, A.(2011). The 3d hough transform for plane detection inpoint clouds: A review and a new accumulator design.3D Res., 2(2):32:1–32:13.

Chumerin, N. and Van Hulle, M. M. (2008). Ground PlaneEstimation Based on Dense Stereo Disparity. ICN-NAI’08, pages 209–213, Minsk, Belarus.

Cornelis, N., Leibe, B., Cornelis, K., and Gool, L. V.(2008). 3d urban scene modeling integrating recog-nition and reconstruction. International Journal ofComputer Vision, 78(2-3):121–141.

Fischler, M. A. and Bolles, R. C. (1981). Random sampleconsensus: a paradigm for model fitting with appli-cations to image analysis and automated cartography.Commun. ACM, 24(6):381–395.

Geiger, A., Lauer, M., and Urtasun, R. (2011a). A gen-erative model for 3d urban scene understanding frommovable platforms. In CVPR’11, Colorado Springs,USA.

Geiger, A., Ziegler, J., and Stiller, C. (2011b). Stereoscan:Dense 3d reconstruction in real-time. In IEEE Intelli-gent Vehicles Symposium, Baden-Baden, Germany.

Gutmann, J.-S., Fukuchi, M., and Fujita, M. (2008).3d perception and environment map generation forhumanoid robot navigation. I. J. Robotic Res.,27(10):1117–1134.

Hoiem, D., Efros, A. A., and Hebert, M. (2007). Recoveringsurface layout from an image. International Journalof Computer Vision, 75:151–172.

Hoover, A., Jean-baptiste, G., Jiang, X., Flynn, P. J., Bunke,H., Goldgof, D., Bowyer, K., Eggert, D., Fitzgibbon,A., and Fisher, R. (1996). An experimental compari-son of range image segmentation algorithms.

Hough, P. (1962). Method and Means for RecognizingComplex Patterns. U.S. Patent 3.069.654.

Iocchi, L., Konolige, K., and Bajracharya, M. (2000). Vi-sually realistic mapping of a planar environment withstereo. In ISER, volume 271, pages 521–532.

Labayrade, R., Aubert, D., and Tarel, J.-P. (2002). Realtime obstacle detection in stereovision on non flat roadgeometry through ”v-disparity” representation. In In-telligent Vehicle Symposium, 2002. IEEE, volume 2,pages 646 – 651 vol.2.

Lee, D. C., Hebert, M., and Kanade, T. (2009). Geomet-ric reasoning for single image structure recovery. InCVPR’09.

Murray, D. R. and Little, J. J. (2004). Environment model-ing with stereo vision. IROS’04.

Poppinga, J., Vaskevicius, N., Birk, A., and Pathak, K.(2008). Fast plane detection and polygonalization innoisy 3d range images. IROS’08.

Schindler, K. and Bischof, H. (2003). On robust regressionin photogrammetric point clouds. In Michaelis, B. andKrell, G., editors, DAGM-Symposium, volume 2781 ofLecture Notes in Computer Science, pages 172–178.Springer.

Se, S. and Brady, M. (2002). Ground plane estimation, erroranalysis and applications. Robotics and AutonomousSystems, 39(2):59–71.


90

wall estimation from stereo vision in urban street canyons - kit - mrt · 2013-09-26 · wall...

Documents