learning depth from single monocular images using stereo ... · obstacle avoidance and navigation,...

10
1 Learning Depth from Single Monocular Images Using Stereo Supervisory Input Jo˜ ao Paquim Guido de Croon Abstract—Stereo vision systems are often employed in robotics as a means for obstacle avoidance and navigation. These systems have inherent depth-sensing limitations, with significant problems in occluded and untextured regions, leading to sparse depth maps. We propose using a monocular depth estimation algorithm to tackle these problems, in a Self-Supervised Learning (SSL) framework. The algorithm learns online from the sparse depth map generated by a stereo vision system, producing a dense depth map. The algorithm is designed to be computationally efficient, for implementation onboard resource-constrained mobile robots and unmanned aerial vehicles. Within that context, it can be used to provide both reliability against a stereo camera failure, as well as more accurate depth perception, by filling in missing depth information, in occluded and low texture regions. This in turn allows the use of more efficient sparse stereo vision algorithms. We test the algorithm offline on a new, high resolution, stereo dataset, of scenes shot in indoor environments, and processed using both sparse and dense stereo matching algorithms. It is shown that the algorithm’s performance doesn’t deteriorate, and in fact sometimes improves, when learning only from sparse, high confidence regions rather than from the computationally expensive, dense, occlusion-filled and highly post-processed dense depth maps. This makes the approach very promising for self- supervised learning on autonomous robots. Index Terms—Monocular depth estimation, stereo vision, robotics, self-supervised learning. I. I NTRODUCTION Depth sensors have become ubiquitous in the field of robotics, due to their multitude of applications, ranging from obstacle avoidance and navigation, to localization and envi- ronment mapping. For this purpose, active depth sensors, like the Microsoft Kinect [1] and the Intel RealSense [2] are often employed, however, they are very susceptible to sunlight, and thus impractical for outdoor use, and are also typically power hungry and heavy. This makes them particularly unsuitable for use in small robotics applications, such as micro aerial vehicles, which are power-constrained and have very limited weight-carrying capacities. For these applications, passive stereo cameras are usually used [3], since they’re quite energy efficient, and can be made very small and light. The working principle of a stereo vision system is quite simple, and is based on finding corresponding points between the left and right camera images, and using the distance between them (binocular disparity) to infer the point’s depth. The stereo matching process itself is not trivial, and implies a trade-off between computational complexity and the density of the resulting depth map. Systems low on computational resources, or with a need for high frequency processing, usually have to settle with estimating only low- density, sparse depth maps. In any case, low texture regions are hard to match accurately, due to a lack of features to find correspondences with. There are also physical limits to a stereo system’s accuracy. Matching algorithms fail in occluded regions, visible from one of the cameras, but not the other. There is no possibility of finding correspondences, given that the region is hidden in one of the frames. Additionally, the maximum range measurable by the camera is inversely related to the distance between the two lenses, called the baseline. Consequently, reasonably-sized cameras have limited maximum ranges, of around 15 - 20 m. On the other end of the spectrum, stereo matching becomes impossible at small distances, due to excessive occlusions, and the fact that objects start to look too different in the left and right lenses’ perspectives. Fig. 1: Examples of dense and sparse depth maps. Some of the important limitations of stereo vision could be overcome by complementing it with monocular depth estimation from appearance features. Monocular depth esti- mation is based on exploiting both local properties of texture, gradients, and color, as well as global geometric relations, like relative object placement, and perspective cues, having no a priori constraints on minimum and maximum ranges, nor any problems with stereo occlusions. We argue that monocular depth estimation can be used to enhance a stereo vision algorithm: by using complementary information, it should provide more accurate depth estimates in regions of occlusion and low confidence stereo matching. We approach the problem of monocular depth estimation using a Self-Supervised Learning (SSL) framework. SSL is a learning methodology where the supervisory input is itself ob- tained in an automatic fashion [4], unlike traditional supervised learning, which typically starts with humans laboriously col- lecting, and making manual corrections to, training data. This allows for massive amounts of training data to be collected,

Upload: nguyenbao

Post on 01-Jul-2018

223 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Learning Depth from Single Monocular Images Using Stereo ... · obstacle avoidance and navigation, to localization and envi- ... data collected using a Kinect sensor. Dey et al. [14]

1

Learning Depth from Single Monocular ImagesUsing Stereo Supervisory Input

Joao PaquimGuido de Croon

AbstractmdashStereo vision systems are often employed in roboticsas a means for obstacle avoidance and navigation These systemshave inherent depth-sensing limitations with significant problemsin occluded and untextured regions leading to sparse depthmaps We propose using a monocular depth estimation algorithmto tackle these problems in a Self-Supervised Learning (SSL)framework The algorithm learns online from the sparse depthmap generated by a stereo vision system producing a dense depthmap The algorithm is designed to be computationally efficientfor implementation onboard resource-constrained mobile robotsand unmanned aerial vehicles Within that context it can be usedto provide both reliability against a stereo camera failure as wellas more accurate depth perception by filling in missing depthinformation in occluded and low texture regions This in turnallows the use of more efficient sparse stereo vision algorithmsWe test the algorithm offline on a new high resolution stereodataset of scenes shot in indoor environments and processedusing both sparse and dense stereo matching algorithms It isshown that the algorithmrsquos performance doesnrsquot deteriorate andin fact sometimes improves when learning only from sparsehigh confidence regions rather than from the computationallyexpensive dense occlusion-filled and highly post-processed densedepth maps This makes the approach very promising for self-supervised learning on autonomous robots

Index TermsmdashMonocular depth estimation stereo visionrobotics self-supervised learning

I INTRODUCTION

Depth sensors have become ubiquitous in the field ofrobotics due to their multitude of applications ranging fromobstacle avoidance and navigation to localization and envi-ronment mapping For this purpose active depth sensors likethe Microsoft Kinect [1] and the Intel RealSense [2] are oftenemployed however they are very susceptible to sunlight andthus impractical for outdoor use and are also typically powerhungry and heavy This makes them particularly unsuitablefor use in small robotics applications such as micro aerialvehicles which are power-constrained and have very limitedweight-carrying capacities

For these applications passive stereo cameras are usuallyused [3] since theyrsquore quite energy efficient and can be madevery small and light The working principle of a stereo visionsystem is quite simple and is based on finding correspondingpoints between the left and right camera images and usingthe distance between them (binocular disparity) to infer thepointrsquos depth The stereo matching process itself is not trivialand implies a trade-off between computational complexityand the density of the resulting depth map Systems low oncomputational resources or with a need for high frequencyprocessing usually have to settle with estimating only low-density sparse depth maps In any case low texture regions

are hard to match accurately due to a lack of features to findcorrespondences with

There are also physical limits to a stereo systemrsquos accuracyMatching algorithms fail in occluded regions visible from oneof the cameras but not the other There is no possibility offinding correspondences given that the region is hidden in oneof the frames Additionally the maximum range measurableby the camera is inversely related to the distance between thetwo lenses called the baseline Consequently reasonably-sizedcameras have limited maximum ranges of around 15minus 20 mOn the other end of the spectrum stereo matching becomesimpossible at small distances due to excessive occlusions andthe fact that objects start to look too different in the left andright lensesrsquo perspectives

Fig 1 Examples of dense and sparse depth maps

Some of the important limitations of stereo vision couldbe overcome by complementing it with monocular depthestimation from appearance features Monocular depth esti-mation is based on exploiting both local properties of texturegradients and color as well as global geometric relations likerelative object placement and perspective cues having no apriori constraints on minimum and maximum ranges nor anyproblems with stereo occlusions We argue that monoculardepth estimation can be used to enhance a stereo visionalgorithm by using complementary information it shouldprovide more accurate depth estimates in regions of occlusionand low confidence stereo matching

We approach the problem of monocular depth estimationusing a Self-Supervised Learning (SSL) framework SSL is alearning methodology where the supervisory input is itself ob-tained in an automatic fashion [4] unlike traditional supervisedlearning which typically starts with humans laboriously col-lecting and making manual corrections to training data Thisallows for massive amounts of training data to be collected

2

making it a very suitable methodology for the effective trainingof data-hungry algorithms such as deep neural networks

The main challenge with the SSL approach is making surethat the data used for training is correct Whereas traditionalmethods make use of human intervention to manually tweakthe data the data collected in an online context is in generalraw and imperfect Consequently training is performed onpartially incorrect data which can be compensated by the largeamount of data collected and the online learning process itself

Online learning in SSL allows the learned model to evolveover time and adapt to changes in the statistics of theunderlying data On the other hand traditional methods learnonly a fixed statistical model of the data which is then usedfor offline testing on unseen data or online use onboard someparticular application If the data used during training isnrsquotsampled from the same distribution as the test data there canbe a strong statistical misadjustment between the two leadingto poor test performance [5] In SSL the robot learns in theenvironment in which it operates which greatly reduces thedifference in distribution between the training and test set

In this work we present a strategy for enhancing a stereovision system through the use of a monocular depth estimationalgorithm The algorithm is itself trained using possiblysparse ground truth data from the stereo camera and usedto infer dense depth maps filling in the occluded and lowtexture regions It is shown that even when trained only onsparse depth maps the algorithm exhibits performance similarto when trained on dense occlusion-filled and highly post-processed dense depth maps

The article is structured as follows in section II we examinethe most relevant contributions in the literature to the areasof monocular depth estimation self-supervised learning anddepth datasets We then show the overall methodology of ourexperiments including a detailed description of the learningsetup features used and learning algorithm in section IIIIn section IV we zoom in on the offline experiments andtheir results in terms of datasets test cases and the employederror metrics briefly describing the progress in the onlineimplementation Finally in section V we analyse and discussthe obtained results and give recommendations for futurework on the subject

II RELATED WORK

In this section the most significant contributions in theliterature are reviewed in the fields of monocular depthestimation in general as well as applied to robot obstacleavoidance and SSL

A Monocular depth estimation

Monocular depth estimation is a research topic in computervision that has been tackled by multiple research groups overthe past decades with varying degrees of success Saxena etal [6 7] engineered features to capture absolute depth used bymany works ever since including ours namely those of textureenergy texture gradients and haze calculated from squareimage patches and their neighbors at multiple size scales Theythen model the depth estimation problem as a Markov Random

Field (MRF) and use multi-conditional learning (MCL) forapproximate learning and inference

Karsch et al [8] presented a non-parametric frameworkfor the extraction of depth maps from single images andalso temporally consistent depth from video sequences robustto camera movement changes in focal length and dynamicscenery Their approach is based on the transfer of depthfrom similar input images in an existing RGBD database bymatching and warping the most similar candidatersquos depth mapand then interpolating and smoothing the depth map via anoptimization procedure to guarantee spacial consistency

Recent years have seen the proliferation of deep neuralnetworks in computer vision research and literature includingseveral applications to monocular depth estimation Thesemodels are attractive because they can be very effectivelytrained on GPUs and donrsquot require the use of hand-engineeredfeatures However they typically require very large amountsof data to be effectively trained

Eigen et al [9] employed an architecture of two deepnetworks one of which makes a coarse global predictionand the other one which locally refines it They augment thetraining data by applying scaling rotation translation colorvariation and horizontal flips to existing data In further work[10] they develop a more powerful network with three scalesof refinement which is then applied to the tasks of depthestimation surface normal estimation and semantic labeling

Liu et al [11] train a deep neural network architecturebased on learning the unary and pairwise potentials of aContinuous Random Field (CRF) model Their model is com-putationally very efficient significantly outperforming Eigenrsquosnetworks in both inference and learning time while alsorequiring less training data

Chen et al [12] follow up on research by Zoran et al [13]on learning to estimate metric depth from relative rather thanmetric depth training data Both works learn from simpleordinal depth relations between pairs of points in the imageBy training a deep neural network on a large crowd-sourceddataset they achieve metric depth prediction performance onpar with algorithms trained on dense metric depth maps

In general the previously presented methods are computa-tionally expensive andor require specialized hardware andare thus unsuitable for real-time applications on constrainedhardware platforms

B Monocular depth estimation for robot navigationMore recently monocular depth learning has been applied

to micro aerial vehicles navigation and obstacle avoidancereplacing heavier stereo cameras and active depth sensorsBipin et al [5] approach the depth estimation part of theirautonomous navigation pipeline as a multiclass classificationproblem by quantizing the continuous depths into discretelabels from rdquonearrdquo to rdquofarrdquo They use a multiclass classifierbased on the linear support vector machine (SVM) in a one-vs-the-rest configuration using features very similar to [6] andtrained offline on the Make3D dataset and additional trainingdata collected using a Kinect sensor

Dey et al [14] use a calibrated least squares algorithm firstpresented by Agarwal et al [15] to achieve fast nonlinear

3

prediction The depth estimation is done over large patchesusing features similar to [16] and additional features basedon Histogram of Oriented Gradients (HOG) and tree detectorfeatures at multiple size scales The training data is collectedby a rover using a stereo camera system and the training doneoffline An additional cost-sensitive greedy feature selectionalgorithm by Hu et al [17] is used to evaluate the mostinformative features for a given time-budget

Although multiple studies have investigated the use ofmonocular depth estimation for robot navigation none havefocused on how it can be used to complement stereo visionin the context of SSL

C Self-supervised learning

SSL has been the focus of some recent research in roboticssince in contrast to traditional offline learning methodologiesit requires less human intervention and offers the possibilityof adaptation to new circumstances

Dahlkamp et al [4 18] used SSL to train a vision-basedterrain analyser for Stanleyrsquos DARPA Grand Challenge per-formance The scanner was used for obstacle detection andterrain classification at close ranges and as supervisory inputto train the vision-based classifier The vision-based classifierachieved a much greater obstacle detection range which inturn made it possible to increase Stanleyrsquos maximum speedand eventually win the challenge

Hadsell et al [19] developed a SSL methodology with thesimilar purpose of enabling long-range obstacle detection ina vehicle equipped with a stereo camera For this purposethey train a real-time classifier using labels from the stereocamera system as supervisory input and perform inferenceusing the learned classifier This process repeats every framebut keeping a small buffer of previous training examples forsuccessive time steps allowing for short term memory ofprevious obstacles The features to be extracted are themselveslearned offline using both supervised and unsupervised meth-ods not hand engineered

SSL is also applied in Ho et al [20] to the problem ofdetecting obstacles using a downward facing camera in thecontext of micro aerial vehicle landing In contrast to previousapproaches optical flow is used to estimate a measure ofsurface roughness given by the fitting error between theobserved optical flow and that of a perfect planar surfaceThe surface roughness is then used as supervisory input toa linear regression algorithm using texton distributions asfeatures Learning wasnrsquot performed for every frame butrather when the uncertainty of the estimates increased dueto previously unseen inputs The resulting appearance-basedobstacle detector demonstrated good performance even insituations where the optical flow is negligible due to lack oflateral motion

Recently van Hecke et al [21] successfully applied SSLto the similar problem of estimating a single average depthfrom a monocular image for obstacle avoidance purposesusing supervisory input from a stereo camera system Theyfocused on the behavioral aspects of SSL and its relation withlearning from demonstration by looking at how the learning

process should be organized in order to maximize performancewhen the supervisory input becomes unavailable The beststrategy is determined to be after an initial period of learningto use the supervisory input only as rdquotraining wheelsrdquo thatis using stereo vision only when the vehicle gets too closeto an obstacle The depth estimation algorithm uses textondistributions as features and kNN as the learning algorithm

III METHODOLOGY OVERVIEW

In this section we describe the learning methodology weused namely the SSL setup the features the learning algo-rithm and its hyperparameters

A Learning setup

The setup is similar to previous stereo-based SSL ap-proaches such as Hadsellrsquos [19] and van Heckersquos [21] Thebasic principle is to use the output from a stereo vision systemas the supervisory input to an appearance-based depth estima-tion learning algorithm Unlike their work however our maingoal is to obtain an accurate depth map over the whole imagerather than performing terrain classification or estimating asingle average depth value The camerarsquos output is processedusing both sparse and dense stereo matching algorithms andwe study the consequences of learning only on sparse depthmaps by observing and evaluating the algorithmrsquos behavioron the dense depth data A schematic diagram of the setup ispresented in Figure 2

Left camera Right camera

Sparse stereo

matching

Dense stereo

matching

Monocular

estimation supervisory

input

Testingground truth

depth map

estimated

depth map

Fig 2 Diagram of SSL setup

For our experiments we used a Stereolabs ZED stereocamera It features wide-angle lenses with a 110 field ofview spaced at a baseline of 120 mm allowing for accuratedepth estimation in the range of 07 to 20 m The camerarsquosf20 aperture and relatively large 13primeprime sensor enables goodexposure performance even under low light conditions Its out-put is highly configurable in terms of both resolution and framerate with 15 frames per second possible at 22K resolution

4

in terms of both photographic and depth data One problemwith the hardware however is its use of a rolling shuttercausing undesired effects such as stretch shear and wobblein the presence of either camera motion or very dynamicenvironments We experienced some of these problems whileshooting scenes with lateral camera movement so for actualrobotics applications we would instead use a camera systemwith a global shutter where these effects would be absent

The ZED SDK is designed around the OpenCV and CUDAlibraries with its calibration distortion correction and depthestimation routines taking advantage of the CUDA platformrsquosmassively parallel GPU computing capabilities The SDK ad-ditionally provides optional post processing of the depth mapsincluding occlusion filling edge sharpening and advancedpost-filtering and a map of stereo matching confidence isalso available Additional capabilities of positional trackingand real-time 3D reconstruction are also offered although notused in this work

We performed both offline and online experiments usingthe ZED SDK to provide dense depth maps by employingthe full post-processing and occlusion filling as well as sparsedepth maps by using only the basic stereo matching algorithmand filtering out low confidence regions The latter methodgives depth maps similar to what would be obtained usinga simple block-based stereo matching algorithm commonlyused in resource-constrained or high-frequency applications[3]

In our offline experiments data from the stereo visionsystem is recorded and a posteriori used to train and test thelearning algorithm under varying conditions in batch modeWe used a modular MATLAB program for rapid prototypingof different feature combinations and learning algorithms aswell as to determine good values for their hyperparameters

When operating online depth and image data is streameddirectly from the stereo camera into an online learning algo-rithm and afterwards monocular depth inference is performedThe resulting depth maps are recorded for posterior evaluationWe used an architecture based on C++ and OpenCV for fasterperformance and easy interaction with C-based embeddedrobotics platforms

In both situations the images and ground truth depth mapsare resized to standard sizes before learning takes place This isdone for performance reasons due to the very high resolutionof the input data and the short time available for featurecomputation and learning

B Features

The features used for learning are in general similar tothose recently used in the literature [14 16] However weexclude optical flow features and add features based on atexton similarity measure to be discussed below Features arecalculated over square patches directly corresponding to pixelsin the matching depth maps

1) Filter-based features These features are implementa-tions of the texture energy texture gradients and haze fea-tures engineered and popularized by Saxenarsquos research group[6 7] and used in multiple robotics applications ever since

[5 14 16 22] The features are constructed by first convertingthe image patch into YCbCr color space applying variousfilters to the specified channel and then taking the sum ofabsolute and squared values over the patch This procedure isrepeated at three increasing size scales to capture both localand global information The filters used arebull Lawsrsquo masks as per Davies [23] constructed by convolv-

ing the L3 E3 and S3 basic 1times 3 masks together

L3 =[1 2 1

]E3 =

[minus1 0 1

]S3 =

[minus1 2 minus1

]In total 9 Lawsrsquo masks are obtained from the pairwiseconvolutions of the basic masks namely L3TL3 L3TE3 S3TE3 and S3TS3 These are applied to the Ychannel to capture texture energy

bull A local averaging filter applied to the Cb and Crchannels to capture haze in the low frequency colorinformation The first Lawsrsquo mask (L3TL3) was used

bull Nevatia-Babu [24] oriented edge filters applied to the Ychannel to capture texture gradients The 6 filters arespaced at 30 intervals

2) Texton-based features Textons are small image patchesrepresentative of particular texture classes learned by clus-tering patch samples from training images They were firstrecognized as a valuable tool for computer vision in the workof Varma et al [25] which showed their performance in thetask of texture classification when compared to traditionalfilter bank methods The distribution of textons over the imagehas since been used to generate computationally efficientfeatures for various works in robotics [20 21 26] for obstacledetection

Previous visual bag of words approaches represented an im-age using a histogram constructed by sampling image patchesand determining for each patch its closest texton in terms ofEuclidean distance Then the corresponding histogram bin isincremented Since we desire to capture local information fora given patch we use its square Euclidean distance to eachtexton as features The texton dictionary is learned from thetraining dataset using Kohonen clustering [27] similarly toprevious works

3) Histogram of Oriented Gradients Histogram of Ori-ented Gradients (HOG) features have been successfully usedfor object and human detection [28] as well by Dey et alfor depth estimation [14] The image is divided into cellsover which the pixel-wise gradients are determined and theirdirections binned into a histogram Adjacent cells are groupedinto 2 times 2 blocks and the histograms are normalized withrespect to all the cells in the block to correct for contrastdifferences and improve accuracy

4) Radon transform Michels et al [22] introduced a fea-ture to capture texture gradient and the direction of strongedges based on the Radon transform [29] also subsequentlyused by other works [14 16] The Radon transform is an in-tegral continuous version of the Hough transform commonlyused in computer vision for edge detection and maps an imagefrom (x y) into (θ ρ) coordinates For each value of θ the

5

two highest values of the transform are recorded and used asfeatures

C Learning algorithm

To choose a learning algorithm we looked at previousapproaches in the literature Bipin et al [5] had success withapproaching the problem as a multiclass classification prob-lem and using a linear SVM for learning Dey et al [14] useda nonlinear regression algorithm based on the Calibrated LeastSquares (CLS) algorithm by Agarwal et al [15] In most of theliterature the algorithms are used to estimate the logarithms ofdepths rather than the depths themselves and after testing bothapproaches we also found better performance when estimatinglog depths

We have approached the depth estimation problem bothas a classification problem and as a regression problem Forclassification we have tried out methods such as a SVM usingboth linear [30] and radial basis function kernels [31] in bothcases using one-vs-the-rest for multiclass We experimentedwith using a decision tree the generalized least squaresalgorithm [15] with a multinomial regression link functionand the classification version of the CLS algorithm [15] Forregression we have employed linear least squares a regressiontree and a modified version of the CLS algorithm

Our evaluation ultimately lead us to the conclusion thatregression consistently outperforms classification in this taskbecause multiclass classification loss functions penalize everymisclassification in the same way while regression attributeslarger penalties to larger deviations Additionally we observedthat the modified CLS regression algorithm exhibits betterperformance than linear least squares or regression trees whilestill being computationally very efficient For this reason wedecided to use it for the rest of our testing

The CLS algorithm is based on the minimization of acalibrated loss function in the context of a generalized linearmodel with an unknown link function The link function is it-self approximated as a linear combination of basis functions ofthe target variable typically low degree polynomials The CLSalgorithm consists of simultaneously solving the problems oflink function approximation and loss function minimizationby iteratively solving two least squares problems

We make slight modifications to the algorithm shown byAgarwal et al [15] namely removing the simplex clippingstep since wersquore performing regression rather than multiclassclassification and using Tikhonov regularization in the leastsquares problems From a computational point of view we useCholesky decompositions to efficiently solve the inner leastsquares problems and define the convergence criterion usingthe norm of the difference in y in successive iterations Thestraightforward algorithm is described in Figure 3

The algorithm is iterative by design but it can be adaptedfor online inference by storing the weight matrices and usingthem a posteriori on new test data repeating steps 3 and 5 ofthe algorithm with the stored weight matrices Additionallyit can be adapted to online training by using batch-wisestochastic gradient descent to update the weights as new datasamples come in

input feature vectors xi vector of target values youtput predicted target values y sequence of weight

matrices W and W

1 while t lt tmax and∥∥∥y(t) minus y(tminus1)∥∥∥ gt threshold do

Iteratively minimize thecalibrated loss function

2 Wt = argminnsum

i=1

∥∥∥yi minus y(tminus1)i minusWxi

∥∥∥22+λ1W22

3 y(t)i = y

(tminus1)i +Wtxi

Find the optimal linearcombination of basis functions

4 Wt = argminnsum

i=1

∥∥∥yi minus WG(y(t)i )∥∥∥22+ λ2

∥∥∥W∥∥∥22

5 y(t)i = WtG(y

(t)i )

6 end

Fig 3 Description of the modified CLS regression algorithm

D Hyperparameters

The algorithm has numerous hyperparameters includingthe regularization constants the base size and number ofsize scales of the image patches over which the featuresare computed the size and number of textons the cell sizeand number of bins for HOG feature calculation etc Onecould try to determine their optimal values using a geneticalgorithm but due to the size of the parameter space and thetime required for end-to-end simulations we instead opted tochoose some of the parameters based on their typical valuesin the literature

We used a base patch size of 11times 11 and 2 additional sizescales 33 times 33 and 99 times 99 We learned a dictionary of 30black and white textons 5times 5 in size shown in Figure 4 Weused 9 bins for the HOG features and discretized the anglesfor the Radon transform into 15 values

Fig 4 Texton dictionary learned from training data

6

IV EXPERIMENTAL RESULTS

In this section the offline experiments are described indetail These were performed in order to evaluate the perfor-mance of the proposed learning algorithm on existing datasetsand determining optimal values for its hyperparameters Wealso tested our hypothesis that it should be possible to estimatedense depth maps despite learning only on sparse trainingdata by testing on a new indoors stereo dataset with bothsparse and dense depth maps

A Error metrics

To measure the algorithmrsquos accuracy error metrics com-monly found in the literature [9] were employed namelybull The mean logarithmic error 1

N

sum∣∣log dest minus log dgt∣∣

bull The mean relative error 1N

sum∣∣dest minus dgt∣∣ dgt

bull The mean relative squared error 1N

sum(dest minus dgt)

2dgt

bull The root mean squared (RMS) errorradic

1N

sum(dest minus dgt)

2

bull The root mean squared (RMS) logarithmic errorradic1N

sum(log dest minus log dgt)

2

bull The scale invariant error 1N

sum(log dest minus log dgt)

2 minus1n2

(sumlog dest minus log dgt

)2B Standard datasets

As a first step the algorithm was tested on existingdepth datasets collected using active laser scanners namelyMake3D KITTI 2012 and KITTI 2015 For Make3D we usedthe standard division of 400 training and 134 test samplesSince the KITTI datasetsrsquo standard test data consists solely ofcamera images lacking ground truth depth maps we insteadrandomly distributed the standard training data among twosets with 70 of the data being allocated for training and30 for testing

The results we obtained for Make3D are shown qualitativelyin Figure 5 and quantitatively in Table I along with resultsfound in the literature [7 8 11] It can be seen that weobtain slightly worse performance than current state of theart approaches However we do surpass Bipin et alrsquos [5]results using a linear SVM while using a much more efficientlearning algorithm (in our tests training a linear SVM with10 depth classes took around 10 times longer than the CLSalgorithm)

Upon visual inspection of the image samples we observethat our algorithm manages to successfully capture most ofthe global depth variations and even some local details

TABLE I Comparison of results on the Make3D dataset

Make3D Saxena Karsch Liu Bipin Our method

mean log 0430 0292 0251 0985 0493relative abs 0370 0355 0287 0815 0543

relative square - - - - 10717linear RMS - 920 736 - 20116log RMS - - - - 0683

scale-invariant - - - - 0462

For the KITTI datasets we show the results in Figure 6and compare them quantitatively in Tables II and III to results

found in the literature [9 11 32] although we note that to ourknowledge no previous work on monocular depth estimationhas yet shown results on the more recent KITTI 2015 dataset

TABLE II Comparison of results on the KITTI 2012 dataset

KITTI 2012 Saxena Eigen Liu Mancini Our method

mean log - - 0211 - 0372relative abs 0280 0190 0217 - 0525

relative square 3012 1515 - - 2311linear RMS 8734 7156 7046 7508 13093log RMS 0361 0270 - 0524 0590

scale-invariant 0327 0246 - 0196 0347

TABLE III Results on the KITTI 2015 dataset

KITTI 2015 Our method

mean log 0291relative abs 0322

relative square 1921linear RMS 12498log RMS 0452

scale-invariant 0204

We observe that our algorithm performs comparativelybetter on these datasets due to the smaller variations in thedepicted environments when compared to Make3D which isinherently more diverse This once again leads us to concludethat in order to achieve the optimal performance the trainingand test data should be made as similar as possible and in thecontext of robotics this is enabled through the use of SSL

C Stereo dataset

In order to further test our hypotheses we developed a newdataset of images shot in the same environment to simulatethe SSL approach and with both dense and sparse depth mapsto see how the algorithm performed on the dense data whilebeing trained only on the sparse maps

We shot several videos with the ZED around the TU DelftAerospace faculty The camera was shot handheld with somefast rotations and no particular care for stabilization similarto footage that would be captured by an UAV The resultingimages naturally have imperfections namely defocus motionblur and stretch and shear artifacts from the rolling shutterwhich are not reflected in the standard RGBD datasets butnevertheless encountered in real life situations

The stereo videos were then processed offline with theZED SDK using both the STANDARD (structure conserva-tive no occlusion filling) and FILL (occlusion filling edgesharpening and advanced post-filtering) settings The providedconfidence map was then used on the STANDARD data tofilter out low confidence regions leading to very sparse depthmaps as shown in Figure 1 similar to depth maps obtainedusing traditional block-based stereo matching algorithms Thefull dataset consists of 12 video sequences and will be madeavailable for public use in the near future

For our tests we split the video sequences into two partslearning on the first 70 and testing on the final 30 in

7

Fig 5 Qualitative results on the Make3D dataset From left to right the monocular image the ground truth depth map andthe depth estimated by our algorithm The depth scales are the same for every image

Fig 6 Qualitative results on the KITTI 2012 and 2015 datasets From left to right the monocular image the ground truthdepth map and the depth estimated by our algorithm The depth scales are the same for every image

8

Fig 7 Qualitative results on the stereo ZED dataset From left to right the monocular image the ground truth depth map thealgorithm trained on the dense depth and the algorithm trained on the sparse depth The depth scales are the same for everyimage

9

TABLE IV Comparison of results on the new stereo dataset collected with the ZED part 1

Video sequence 1 dense 1 sparse 2 dense 2 sparse 3 dense 3 sparse 4 dense 4 sparse 5 dense 5 sparse 6 dense 6 sparse

mean log 0615 0746 05812 0609 0169 0594 0572 0547 0425 0533 0525 0479relative abs 1608 1064 1391 0823 0288 0420 0576 0557 0446 0456 0956 0526

relative square 26799 8325 17901 6369 2151 3280 4225 3728 2951 3312 12411 3320linear RMS 8311 7574 7101 6897 2933 7134 6488 6026 5692 6331 7239 5599log RMS 0967 0923 0913 0770 0359 0770 0748 0716 0546 0662 0822 0613

scale-invariant 0931 0778 0826 0524 0127 0277 0440 0412 0275 0247 0669 0314

TABLE V Comparison of results on the new stereo dataset collected with the ZED part 2

Video sequence 7 dense 7 sparse 8 dense 8 sparse 9 dense 9 sparse 10 dense 10 sparse 11 dense 11 sparse 12 dense 12 sparse

mean log 0532 0514 0654 0539 0579 0470 0411 0460 0640 0704 0362 0494relative abs 0732 0484 0583 0770 0773 0534 0576 0472 0838 0609 0517 0572

relative square 7194 3016 6480 7472 15659 3082 5583 3666 56000 4792 51092 4367linear RMS 6132 5577 8383 6284 14324 5319 5966 6127 11435 7832 30676 6095log RMS 0789 0669 1006 0770 0823 0610 0681 0633 0898 0835 0580 0642

scale-invariant 0601 0332 0825 0493 0648 0338 0442 0287 0729 0378 0323 0347

order to simulate the conditions of a robot undergoing SSLWe tested our depth estimation algorithm under two differentconditions training on the fully processed and occlusioncorrected dense maps and training directly on the raw sparseoutputs from the stereo matching algorithm The dense depthmaps were used as ground truths during testing in both casesThe results obtained are shown in Figure 7 and in Tables IVand V

Observing the results we see that the performance is ingeneral better than on the existing laser datasets especiallyfrom a qualitative point of view The algorithm is capableof leveraging the fact that the data used for training is notparticularly diverse and additionally very similar to the testdata going around the generalization problems shown bytraditional supervised learning methodologies

It can also be observed that contrary to our expectationsthe algorithm trained on the sparse data doesnrsquot fall shortand actually surpasses the algorithm trained on the dense datain many of the video sequences and error metrics Lookingat the estimated depth maps this is mostly a consequenceof the algorithmrsquos failure to correctly extrapolate The largestcontributions to the error metrics come from the depth extrapo-lations that fall significantly outside the maximum range of thecamera and are qualitatively completely incorrect When thealgorithm is trained only on the sparse data it is exposed to asmaller range of target values and is consequently much moreconservative in its estimates leading to lower error figures

Another influencing factor is the fact that in the sparse highconfidence regions there is a stronger correlation betweenthe depth obtained by the stereo camera and the true depthwhen compared to the untextured and occluded areas Sincewe assume there is a strong correlation between the extractedmonocular image features and the true depth this impliesthat there is a stronger correlation between the image featuresand the stereo camera depths The simplicity of the machinelearning model used means that this strong correlation canbe effectively exploited and learned while the low correlationpresent in the dense depth data can easily lead the algorithm

to learn a wrong modelFrom a qualitative point of view however both sparse and

dense behave similarly Looking at the sample images we cansee examples where both correctly estimate the relative depthsbetween objects in a scene (rows 1 and 2) examples wheresparse is better than dense (rows 3 and 4) and where dense isbetter than sparse (row 5) Naturally there are also many caseswhere the estimated depth maps are mostly incorrect (row 6)

In particular we see that the algorithm trained only onsparse data also behaves well at estimating the depth inoccluded and untextured areas for instance the white wallin row 3 This leads us to the conclusion that the algorithmrsquosperformance is perfectly adequate to serve as complementaryto a sparse stereo matching algorithm correctly filling in themissing depth information

D Online Experiments

Wersquove started preliminary work on a framework for onboardSSL applied to monocular depth estimation using a stereocamera Wersquove rewritten some of our feature extraction func-tions in C++ and have tested the setup using a linear leastsquares regression algorithm achieving promising results Weintend to further develop the framework in order to fully sup-port an online version of our learning pipeline and integrateit with existing autopilot platforms such as Paparazzi [33]

V CONCLUSION

This work focused on the application of SSL to the problemof estimating depth from single monocular images with theintent of complementing sparse stereo vision algorithms Wehave shown that our algorithm exhibits competitive perfor-mance on existing RGBD datasets while being computation-ally more efficient to train than previous approaches [5] Wealso train the algorithm on a new stereo dataset and showthat it remains accurate even when trained only on sparserather than dense stereo maps It can consequently be used toefficiently produce dense depth maps from sparse input Ourpreliminary work on its online implementation has revealed

10

promising results obtaining good performance with a verysimple linear least squares algorithm

In future work we plan to extend our methodology andfurther explore the complementarity of the information presentin monocular and stereo cues The use of a learning algorithmsuch as a Mondrian forest [34] or other ensemble-basedmethods would enable the estimation of the uncertainty inits own predictions A sensor fusion algorithm can then beused to merge information from both the stereo vision systemand the monocular depth estimation algorithm based on theirlocal confidence in the estimated depth This would lead to anoverall more accurate depth map

We have avoided the use of optical flow features sincetheyrsquore expensive to compute and are not usable when es-timating depths from isolated images rather than video dataHowever future work could explore computationally efficientways of using optical flow to guarantee the temporal consis-tency and consequently increase the accuracy of the sequenceof estimated depth maps

Current state of the art depth estimation methods [10ndash12 32] are all based on deep convolutional neural networks ofvarying complexities The advent of massively parallel GPU-based embedded hardware such as the Jetson TX1 and itseventual successors means that online training of deep neuralnetworks is close to becoming reality These models wouldgreatly benefit from the large amounts of training data madepossible by the SSL framework and lead to state of the artdepth estimation results onboard micro aerial vehicles

REFERENCES

[1] R A El-laithy J Huang and M Yeh ldquoStudy on the use of microsoftkinect for robotics applicationsrdquo in Position Location and NavigationSymposium (PLANS) 2012 IEEEION IEEE 2012 pp 1280ndash1288

[2] M Draelos Q Qiu A Bronstein and G Sapiro ldquoIntel realsense= reallow cost gazerdquo in Image Processing (ICIP) 2015 IEEE InternationalConference on IEEE 2015 pp 2520ndash2524

[3] C De Wagter S Tijmons B D Remes and G C de CroonldquoAutonomous flight of a 20-gram flapping wing mav with a 4-gramonboard stereo vision systemrdquo in 2014 IEEE International Conferenceon Robotics and Automation (ICRA) IEEE 2014 pp 4982ndash4987

[4] H Dahlkamp A Kaehler D Stavens S Thrun and G R Brad-ski ldquoSelf-supervised monocular road detection in desert terrainrdquo inRobotics science and systems vol 38 Philadelphia 2006

[5] K Bipin V Duggal and K M Krishna ldquoAutonomous navigation ofgeneric monocular quadcopter in natural environmentrdquo in 2015 IEEEInternational Conference on Robotics and Automation (ICRA) IEEE2015 pp 1063ndash1070

[6] A Saxena S H Chung and A Y Ng ldquoLearning depth from singlemonocular imagesrdquo in Advances in Neural Information ProcessingSystems 2005 pp 1161ndash1168

[7] A Saxena M Sun and A Y Ng ldquoMake3d Learning 3d scene structurefrom a single still imagerdquo IEEE transactions on pattern analysis andmachine intelligence vol 31 no 5 pp 824ndash840 2009

[8] K Karsch C Liu S Bing and K Eccv ldquoDepth Extraction from VideoUsing Non-parametric Sampling Problem Motivation amp Backgroundrdquono Sec 5 pp 1ndash14 2013

[9] D Eigen C Puhrsch and R Fergus ldquoDepth map prediction from asingle image using a multi-scale deep networkrdquo Nips pp 1ndash9 2014[Online] Available httparxivorgabs14062283

[10] D Eigen and R Fergus ldquoPredicting depth surface normals and se-mantic labels with a common multi-scale convolutional architecturerdquo inProceedings of the IEEE International Conference on Computer Vision2015 pp 2650ndash2658

[11] F Liu C Shen G Lin and I D Reid ldquoLearning Depth from SingleMonocular Images Using Deep Convolutional Neural Fieldsrdquo Pamip 15 2015 [Online] Available httparxivorgabs15027411

[12] W Chen Z Fu D Yang and J Deng ldquoSingle-image depth perceptionin the wildrdquo arXiv preprint arXiv160403901 2016

[13] D Zoran P Isola D Krishnan and W T Freeman ldquoLearning ordinalrelationships for mid-level visionrdquo in Proceedings of the IEEE Interna-tional Conference on Computer Vision 2015 pp 388ndash396

[14] D Dey K S Shankar S Zeng R Mehta M T Agcayazi C EriksenS Daftry M Hebert and J A Bagnell ldquoVision and learning fordeliberative monocular cluttered flightrdquo in Field and Service RoboticsSpringer 2016 pp 391ndash409

[15] A Agarwal S M Kakade N Karampatziakis L Song and G ValiantldquoLeast squares revisited Scalable approaches for multi-class predictionrdquoarXiv preprint arXiv13101949 2013 software available at httpsgithubcomn17ssecondorderdemos

[16] S Ross N Melik-Barkhudarov K S Shankar A Wendel D DeyJ A Bagnell and M Hebert ldquoLearning monocular reactive uav controlin cluttered natural environmentsrdquo in Robotics and Automation (ICRA)2013 IEEE International Conference on IEEE 2013 pp 1765ndash1772

[17] H Hu A Grubb J A Bagnell and M Hebert ldquoEfficient fea-ture group sequencing for anytime linear predictionrdquo arXiv preprintarXiv14095495 2014

[18] S Thrun M Montemerlo H Dahlkamp D Stavens A Aron J DiebelP Fong J Gale M Halpenny G Hoffmann et al ldquoStanley The robotthat won the darpa grand challengerdquo Journal of field Robotics vol 23no 9 pp 661ndash692 2006

[19] R Hadsell P Sermanet J Ben A Erkan M Scoffier K KavukcuogluU Muller and Y LeCun ldquoLearning long-range vision for autonomousoff-road drivingrdquo Journal of Field Robotics vol 26 no 2 pp 120ndash1442009

[20] H W Ho C De Wagter B D W Remes and G C H E de CroonldquoOptical-Flow based Self-Supervised Learning of Obstacle Appearanceapplied to MAV Landingrdquo no Iros 15 pp 1ndash10 2015

[21] K van Hecke G de Croon L van der Maaten D Hennes and D IzzoldquoPersistent self-supervised learning principle from stereo to monocularvision for obstacle avoidancerdquo arXiv preprint arXiv160308047 2016

[22] J Michels A Saxena and A Y Ng ldquoHigh speed obstacle avoidanceusing monocular vision and reinforcement learningrdquo in Proceedings ofthe 22nd international conference on Machine learning ACM 2005pp 593ndash600

[23] E R Davies Computer and machine vision theory algorithms prac-ticalities Academic Press 2012

[24] R Nevatia and K R Babu ldquoLinear feature extraction and descriptionrdquoComputer Graphics and Image Processing vol 13 no 3 pp 257ndash2691980

[25] M Varma and A Zisserman ldquoTexture classification arefilter banks necessaryrdquo Cvpr vol 2 pp IIndash691ndash8 vol2 2003 [Online] Available httpieeexploreieeeorgxplsabs alljsparnumber=1211534$delimiterrdquo026E30F$npapers2publicationdoi101109CVPR20031211534

[26] G De Croon E De Weerdt C De Wagter B Remes and R RuijsinkldquoThe appearance variation cue for obstacle avoidancerdquo IEEE Transac-tions on Robotics vol 28 no 2 pp 529ndash534 2012

[27] T Kohonen ldquoThe self-organizing maprdquo Proceedings of the IEEEvol 78 no 9 pp 1464ndash1480 1990

[28] N Dalal and B Triggs ldquoHistograms of oriented gradients for humandetectionrdquo in 2005 IEEE Computer Society Conference on ComputerVision and Pattern Recognition (CVPRrsquo05) vol 1 IEEE 2005 pp886ndash893

[29] S R Deans The Radon transform and some of its applications CourierCorporation 2007

[30] R-E Fan K-W Chang C-J Hsieh X-R Wang and C-J LinldquoLiblinear A library for large linear classificationrdquo Journal of machinelearning research vol 9 no Aug pp 1871ndash1874 2008 softwareavailable at httpwwwcsientuedutwsimcjlinliblinear

[31] C-C Chang and C-J Lin ldquoLIBSVM A library for support vectormachinesrdquo ACM Transactions on Intelligent Systems and Technologyvol 2 pp 271ndash2727 2011 software available at httpwwwcsientuedutwsimcjlinlibsvm

[32] M Mancini G Costante P Valigi and T A Ciarfuglia ldquoFast robustmonocular depth estimation for obstacle detection with fully convolu-tional networksrdquo arXiv preprint arXiv160706349 2016

[33] G Hattenberger M Bronz and M Gorraz ldquoUsing the paparazzi uavsystem for scientific researchrdquo in IMAV 2014 International Micro AirVehicle Conference and Competition 2014 2014 pp ppndash247

[34] B Lakshminarayanan D M Roy and Y W Teh ldquoMondrian forestsfor large-scale regression when uncertainty mattersrdquo arXiv preprintarXiv150603805 2015

  • Introduction
  • Related Work
    • Monocular depth estimation
    • Monocular depth estimation for robot navigation
    • Self-supervised learning
      • Methodology Overview
        • Learning setup
        • Features
          • Filter-based features
          • Texton-based features
          • Histogram of Oriented Gradients
          • Radon transform
            • Learning algorithm
            • Hyperparameters
              • Experimental Results
                • Error metrics
                • Standard datasets
                • Stereo dataset
                • Online Experiments
                  • Conclusion
                  • References
Page 2: Learning Depth from Single Monocular Images Using Stereo ... · obstacle avoidance and navigation, to localization and envi- ... data collected using a Kinect sensor. Dey et al. [14]

2

making it a very suitable methodology for the effective trainingof data-hungry algorithms such as deep neural networks

The main challenge with the SSL approach is making surethat the data used for training is correct Whereas traditionalmethods make use of human intervention to manually tweakthe data the data collected in an online context is in generalraw and imperfect Consequently training is performed onpartially incorrect data which can be compensated by the largeamount of data collected and the online learning process itself

Online learning in SSL allows the learned model to evolveover time and adapt to changes in the statistics of theunderlying data On the other hand traditional methods learnonly a fixed statistical model of the data which is then usedfor offline testing on unseen data or online use onboard someparticular application If the data used during training isnrsquotsampled from the same distribution as the test data there canbe a strong statistical misadjustment between the two leadingto poor test performance [5] In SSL the robot learns in theenvironment in which it operates which greatly reduces thedifference in distribution between the training and test set

In this work we present a strategy for enhancing a stereovision system through the use of a monocular depth estimationalgorithm The algorithm is itself trained using possiblysparse ground truth data from the stereo camera and usedto infer dense depth maps filling in the occluded and lowtexture regions It is shown that even when trained only onsparse depth maps the algorithm exhibits performance similarto when trained on dense occlusion-filled and highly post-processed dense depth maps

The article is structured as follows in section II we examinethe most relevant contributions in the literature to the areasof monocular depth estimation self-supervised learning anddepth datasets We then show the overall methodology of ourexperiments including a detailed description of the learningsetup features used and learning algorithm in section IIIIn section IV we zoom in on the offline experiments andtheir results in terms of datasets test cases and the employederror metrics briefly describing the progress in the onlineimplementation Finally in section V we analyse and discussthe obtained results and give recommendations for futurework on the subject

II RELATED WORK

In this section the most significant contributions in theliterature are reviewed in the fields of monocular depthestimation in general as well as applied to robot obstacleavoidance and SSL

A Monocular depth estimation

Monocular depth estimation is a research topic in computervision that has been tackled by multiple research groups overthe past decades with varying degrees of success Saxena etal [6 7] engineered features to capture absolute depth used bymany works ever since including ours namely those of textureenergy texture gradients and haze calculated from squareimage patches and their neighbors at multiple size scales Theythen model the depth estimation problem as a Markov Random

Field (MRF) and use multi-conditional learning (MCL) forapproximate learning and inference

Karsch et al [8] presented a non-parametric frameworkfor the extraction of depth maps from single images andalso temporally consistent depth from video sequences robustto camera movement changes in focal length and dynamicscenery Their approach is based on the transfer of depthfrom similar input images in an existing RGBD database bymatching and warping the most similar candidatersquos depth mapand then interpolating and smoothing the depth map via anoptimization procedure to guarantee spacial consistency

Recent years have seen the proliferation of deep neuralnetworks in computer vision research and literature includingseveral applications to monocular depth estimation Thesemodels are attractive because they can be very effectivelytrained on GPUs and donrsquot require the use of hand-engineeredfeatures However they typically require very large amountsof data to be effectively trained

Eigen et al [9] employed an architecture of two deepnetworks one of which makes a coarse global predictionand the other one which locally refines it They augment thetraining data by applying scaling rotation translation colorvariation and horizontal flips to existing data In further work[10] they develop a more powerful network with three scalesof refinement which is then applied to the tasks of depthestimation surface normal estimation and semantic labeling

Liu et al [11] train a deep neural network architecturebased on learning the unary and pairwise potentials of aContinuous Random Field (CRF) model Their model is com-putationally very efficient significantly outperforming Eigenrsquosnetworks in both inference and learning time while alsorequiring less training data

Chen et al [12] follow up on research by Zoran et al [13]on learning to estimate metric depth from relative rather thanmetric depth training data Both works learn from simpleordinal depth relations between pairs of points in the imageBy training a deep neural network on a large crowd-sourceddataset they achieve metric depth prediction performance onpar with algorithms trained on dense metric depth maps

In general the previously presented methods are computa-tionally expensive andor require specialized hardware andare thus unsuitable for real-time applications on constrainedhardware platforms

B Monocular depth estimation for robot navigationMore recently monocular depth learning has been applied

to micro aerial vehicles navigation and obstacle avoidancereplacing heavier stereo cameras and active depth sensorsBipin et al [5] approach the depth estimation part of theirautonomous navigation pipeline as a multiclass classificationproblem by quantizing the continuous depths into discretelabels from rdquonearrdquo to rdquofarrdquo They use a multiclass classifierbased on the linear support vector machine (SVM) in a one-vs-the-rest configuration using features very similar to [6] andtrained offline on the Make3D dataset and additional trainingdata collected using a Kinect sensor

Dey et al [14] use a calibrated least squares algorithm firstpresented by Agarwal et al [15] to achieve fast nonlinear

3

prediction The depth estimation is done over large patchesusing features similar to [16] and additional features basedon Histogram of Oriented Gradients (HOG) and tree detectorfeatures at multiple size scales The training data is collectedby a rover using a stereo camera system and the training doneoffline An additional cost-sensitive greedy feature selectionalgorithm by Hu et al [17] is used to evaluate the mostinformative features for a given time-budget

Although multiple studies have investigated the use ofmonocular depth estimation for robot navigation none havefocused on how it can be used to complement stereo visionin the context of SSL

C Self-supervised learning

SSL has been the focus of some recent research in roboticssince in contrast to traditional offline learning methodologiesit requires less human intervention and offers the possibilityof adaptation to new circumstances

Dahlkamp et al [4 18] used SSL to train a vision-basedterrain analyser for Stanleyrsquos DARPA Grand Challenge per-formance The scanner was used for obstacle detection andterrain classification at close ranges and as supervisory inputto train the vision-based classifier The vision-based classifierachieved a much greater obstacle detection range which inturn made it possible to increase Stanleyrsquos maximum speedand eventually win the challenge

Hadsell et al [19] developed a SSL methodology with thesimilar purpose of enabling long-range obstacle detection ina vehicle equipped with a stereo camera For this purposethey train a real-time classifier using labels from the stereocamera system as supervisory input and perform inferenceusing the learned classifier This process repeats every framebut keeping a small buffer of previous training examples forsuccessive time steps allowing for short term memory ofprevious obstacles The features to be extracted are themselveslearned offline using both supervised and unsupervised meth-ods not hand engineered

SSL is also applied in Ho et al [20] to the problem ofdetecting obstacles using a downward facing camera in thecontext of micro aerial vehicle landing In contrast to previousapproaches optical flow is used to estimate a measure ofsurface roughness given by the fitting error between theobserved optical flow and that of a perfect planar surfaceThe surface roughness is then used as supervisory input toa linear regression algorithm using texton distributions asfeatures Learning wasnrsquot performed for every frame butrather when the uncertainty of the estimates increased dueto previously unseen inputs The resulting appearance-basedobstacle detector demonstrated good performance even insituations where the optical flow is negligible due to lack oflateral motion

Recently van Hecke et al [21] successfully applied SSLto the similar problem of estimating a single average depthfrom a monocular image for obstacle avoidance purposesusing supervisory input from a stereo camera system Theyfocused on the behavioral aspects of SSL and its relation withlearning from demonstration by looking at how the learning

process should be organized in order to maximize performancewhen the supervisory input becomes unavailable The beststrategy is determined to be after an initial period of learningto use the supervisory input only as rdquotraining wheelsrdquo thatis using stereo vision only when the vehicle gets too closeto an obstacle The depth estimation algorithm uses textondistributions as features and kNN as the learning algorithm

III METHODOLOGY OVERVIEW

In this section we describe the learning methodology weused namely the SSL setup the features the learning algo-rithm and its hyperparameters

A Learning setup

The setup is similar to previous stereo-based SSL ap-proaches such as Hadsellrsquos [19] and van Heckersquos [21] Thebasic principle is to use the output from a stereo vision systemas the supervisory input to an appearance-based depth estima-tion learning algorithm Unlike their work however our maingoal is to obtain an accurate depth map over the whole imagerather than performing terrain classification or estimating asingle average depth value The camerarsquos output is processedusing both sparse and dense stereo matching algorithms andwe study the consequences of learning only on sparse depthmaps by observing and evaluating the algorithmrsquos behavioron the dense depth data A schematic diagram of the setup ispresented in Figure 2

Left camera Right camera

Sparse stereo

matching

Dense stereo

matching

Monocular

estimation supervisory

input

Testingground truth

depth map

estimated

depth map

Fig 2 Diagram of SSL setup

For our experiments we used a Stereolabs ZED stereocamera It features wide-angle lenses with a 110 field ofview spaced at a baseline of 120 mm allowing for accuratedepth estimation in the range of 07 to 20 m The camerarsquosf20 aperture and relatively large 13primeprime sensor enables goodexposure performance even under low light conditions Its out-put is highly configurable in terms of both resolution and framerate with 15 frames per second possible at 22K resolution

4

in terms of both photographic and depth data One problemwith the hardware however is its use of a rolling shuttercausing undesired effects such as stretch shear and wobblein the presence of either camera motion or very dynamicenvironments We experienced some of these problems whileshooting scenes with lateral camera movement so for actualrobotics applications we would instead use a camera systemwith a global shutter where these effects would be absent

The ZED SDK is designed around the OpenCV and CUDAlibraries with its calibration distortion correction and depthestimation routines taking advantage of the CUDA platformrsquosmassively parallel GPU computing capabilities The SDK ad-ditionally provides optional post processing of the depth mapsincluding occlusion filling edge sharpening and advancedpost-filtering and a map of stereo matching confidence isalso available Additional capabilities of positional trackingand real-time 3D reconstruction are also offered although notused in this work

We performed both offline and online experiments usingthe ZED SDK to provide dense depth maps by employingthe full post-processing and occlusion filling as well as sparsedepth maps by using only the basic stereo matching algorithmand filtering out low confidence regions The latter methodgives depth maps similar to what would be obtained usinga simple block-based stereo matching algorithm commonlyused in resource-constrained or high-frequency applications[3]

In our offline experiments data from the stereo visionsystem is recorded and a posteriori used to train and test thelearning algorithm under varying conditions in batch modeWe used a modular MATLAB program for rapid prototypingof different feature combinations and learning algorithms aswell as to determine good values for their hyperparameters

When operating online depth and image data is streameddirectly from the stereo camera into an online learning algo-rithm and afterwards monocular depth inference is performedThe resulting depth maps are recorded for posterior evaluationWe used an architecture based on C++ and OpenCV for fasterperformance and easy interaction with C-based embeddedrobotics platforms

In both situations the images and ground truth depth mapsare resized to standard sizes before learning takes place This isdone for performance reasons due to the very high resolutionof the input data and the short time available for featurecomputation and learning

B Features

The features used for learning are in general similar tothose recently used in the literature [14 16] However weexclude optical flow features and add features based on atexton similarity measure to be discussed below Features arecalculated over square patches directly corresponding to pixelsin the matching depth maps

1) Filter-based features These features are implementa-tions of the texture energy texture gradients and haze fea-tures engineered and popularized by Saxenarsquos research group[6 7] and used in multiple robotics applications ever since

[5 14 16 22] The features are constructed by first convertingthe image patch into YCbCr color space applying variousfilters to the specified channel and then taking the sum ofabsolute and squared values over the patch This procedure isrepeated at three increasing size scales to capture both localand global information The filters used arebull Lawsrsquo masks as per Davies [23] constructed by convolv-

ing the L3 E3 and S3 basic 1times 3 masks together

L3 =[1 2 1

]E3 =

[minus1 0 1

]S3 =

[minus1 2 minus1

]In total 9 Lawsrsquo masks are obtained from the pairwiseconvolutions of the basic masks namely L3TL3 L3TE3 S3TE3 and S3TS3 These are applied to the Ychannel to capture texture energy

bull A local averaging filter applied to the Cb and Crchannels to capture haze in the low frequency colorinformation The first Lawsrsquo mask (L3TL3) was used

bull Nevatia-Babu [24] oriented edge filters applied to the Ychannel to capture texture gradients The 6 filters arespaced at 30 intervals

2) Texton-based features Textons are small image patchesrepresentative of particular texture classes learned by clus-tering patch samples from training images They were firstrecognized as a valuable tool for computer vision in the workof Varma et al [25] which showed their performance in thetask of texture classification when compared to traditionalfilter bank methods The distribution of textons over the imagehas since been used to generate computationally efficientfeatures for various works in robotics [20 21 26] for obstacledetection

Previous visual bag of words approaches represented an im-age using a histogram constructed by sampling image patchesand determining for each patch its closest texton in terms ofEuclidean distance Then the corresponding histogram bin isincremented Since we desire to capture local information fora given patch we use its square Euclidean distance to eachtexton as features The texton dictionary is learned from thetraining dataset using Kohonen clustering [27] similarly toprevious works

3) Histogram of Oriented Gradients Histogram of Ori-ented Gradients (HOG) features have been successfully usedfor object and human detection [28] as well by Dey et alfor depth estimation [14] The image is divided into cellsover which the pixel-wise gradients are determined and theirdirections binned into a histogram Adjacent cells are groupedinto 2 times 2 blocks and the histograms are normalized withrespect to all the cells in the block to correct for contrastdifferences and improve accuracy

4) Radon transform Michels et al [22] introduced a fea-ture to capture texture gradient and the direction of strongedges based on the Radon transform [29] also subsequentlyused by other works [14 16] The Radon transform is an in-tegral continuous version of the Hough transform commonlyused in computer vision for edge detection and maps an imagefrom (x y) into (θ ρ) coordinates For each value of θ the

5

two highest values of the transform are recorded and used asfeatures

C Learning algorithm

To choose a learning algorithm we looked at previousapproaches in the literature Bipin et al [5] had success withapproaching the problem as a multiclass classification prob-lem and using a linear SVM for learning Dey et al [14] useda nonlinear regression algorithm based on the Calibrated LeastSquares (CLS) algorithm by Agarwal et al [15] In most of theliterature the algorithms are used to estimate the logarithms ofdepths rather than the depths themselves and after testing bothapproaches we also found better performance when estimatinglog depths

We have approached the depth estimation problem bothas a classification problem and as a regression problem Forclassification we have tried out methods such as a SVM usingboth linear [30] and radial basis function kernels [31] in bothcases using one-vs-the-rest for multiclass We experimentedwith using a decision tree the generalized least squaresalgorithm [15] with a multinomial regression link functionand the classification version of the CLS algorithm [15] Forregression we have employed linear least squares a regressiontree and a modified version of the CLS algorithm

Our evaluation ultimately lead us to the conclusion thatregression consistently outperforms classification in this taskbecause multiclass classification loss functions penalize everymisclassification in the same way while regression attributeslarger penalties to larger deviations Additionally we observedthat the modified CLS regression algorithm exhibits betterperformance than linear least squares or regression trees whilestill being computationally very efficient For this reason wedecided to use it for the rest of our testing

The CLS algorithm is based on the minimization of acalibrated loss function in the context of a generalized linearmodel with an unknown link function The link function is it-self approximated as a linear combination of basis functions ofthe target variable typically low degree polynomials The CLSalgorithm consists of simultaneously solving the problems oflink function approximation and loss function minimizationby iteratively solving two least squares problems

We make slight modifications to the algorithm shown byAgarwal et al [15] namely removing the simplex clippingstep since wersquore performing regression rather than multiclassclassification and using Tikhonov regularization in the leastsquares problems From a computational point of view we useCholesky decompositions to efficiently solve the inner leastsquares problems and define the convergence criterion usingthe norm of the difference in y in successive iterations Thestraightforward algorithm is described in Figure 3

The algorithm is iterative by design but it can be adaptedfor online inference by storing the weight matrices and usingthem a posteriori on new test data repeating steps 3 and 5 ofthe algorithm with the stored weight matrices Additionallyit can be adapted to online training by using batch-wisestochastic gradient descent to update the weights as new datasamples come in

input feature vectors xi vector of target values youtput predicted target values y sequence of weight

matrices W and W

1 while t lt tmax and∥∥∥y(t) minus y(tminus1)∥∥∥ gt threshold do

Iteratively minimize thecalibrated loss function

2 Wt = argminnsum

i=1

∥∥∥yi minus y(tminus1)i minusWxi

∥∥∥22+λ1W22

3 y(t)i = y

(tminus1)i +Wtxi

Find the optimal linearcombination of basis functions

4 Wt = argminnsum

i=1

∥∥∥yi minus WG(y(t)i )∥∥∥22+ λ2

∥∥∥W∥∥∥22

5 y(t)i = WtG(y

(t)i )

6 end

Fig 3 Description of the modified CLS regression algorithm

D Hyperparameters

The algorithm has numerous hyperparameters includingthe regularization constants the base size and number ofsize scales of the image patches over which the featuresare computed the size and number of textons the cell sizeand number of bins for HOG feature calculation etc Onecould try to determine their optimal values using a geneticalgorithm but due to the size of the parameter space and thetime required for end-to-end simulations we instead opted tochoose some of the parameters based on their typical valuesin the literature

We used a base patch size of 11times 11 and 2 additional sizescales 33 times 33 and 99 times 99 We learned a dictionary of 30black and white textons 5times 5 in size shown in Figure 4 Weused 9 bins for the HOG features and discretized the anglesfor the Radon transform into 15 values

Fig 4 Texton dictionary learned from training data

6

IV EXPERIMENTAL RESULTS

In this section the offline experiments are described indetail These were performed in order to evaluate the perfor-mance of the proposed learning algorithm on existing datasetsand determining optimal values for its hyperparameters Wealso tested our hypothesis that it should be possible to estimatedense depth maps despite learning only on sparse trainingdata by testing on a new indoors stereo dataset with bothsparse and dense depth maps

A Error metrics

To measure the algorithmrsquos accuracy error metrics com-monly found in the literature [9] were employed namelybull The mean logarithmic error 1

N

sum∣∣log dest minus log dgt∣∣

bull The mean relative error 1N

sum∣∣dest minus dgt∣∣ dgt

bull The mean relative squared error 1N

sum(dest minus dgt)

2dgt

bull The root mean squared (RMS) errorradic

1N

sum(dest minus dgt)

2

bull The root mean squared (RMS) logarithmic errorradic1N

sum(log dest minus log dgt)

2

bull The scale invariant error 1N

sum(log dest minus log dgt)

2 minus1n2

(sumlog dest minus log dgt

)2B Standard datasets

As a first step the algorithm was tested on existingdepth datasets collected using active laser scanners namelyMake3D KITTI 2012 and KITTI 2015 For Make3D we usedthe standard division of 400 training and 134 test samplesSince the KITTI datasetsrsquo standard test data consists solely ofcamera images lacking ground truth depth maps we insteadrandomly distributed the standard training data among twosets with 70 of the data being allocated for training and30 for testing

The results we obtained for Make3D are shown qualitativelyin Figure 5 and quantitatively in Table I along with resultsfound in the literature [7 8 11] It can be seen that weobtain slightly worse performance than current state of theart approaches However we do surpass Bipin et alrsquos [5]results using a linear SVM while using a much more efficientlearning algorithm (in our tests training a linear SVM with10 depth classes took around 10 times longer than the CLSalgorithm)

Upon visual inspection of the image samples we observethat our algorithm manages to successfully capture most ofthe global depth variations and even some local details

TABLE I Comparison of results on the Make3D dataset

Make3D Saxena Karsch Liu Bipin Our method

mean log 0430 0292 0251 0985 0493relative abs 0370 0355 0287 0815 0543

relative square - - - - 10717linear RMS - 920 736 - 20116log RMS - - - - 0683

scale-invariant - - - - 0462

For the KITTI datasets we show the results in Figure 6and compare them quantitatively in Tables II and III to results

found in the literature [9 11 32] although we note that to ourknowledge no previous work on monocular depth estimationhas yet shown results on the more recent KITTI 2015 dataset

TABLE II Comparison of results on the KITTI 2012 dataset

KITTI 2012 Saxena Eigen Liu Mancini Our method

mean log - - 0211 - 0372relative abs 0280 0190 0217 - 0525

relative square 3012 1515 - - 2311linear RMS 8734 7156 7046 7508 13093log RMS 0361 0270 - 0524 0590

scale-invariant 0327 0246 - 0196 0347

TABLE III Results on the KITTI 2015 dataset

KITTI 2015 Our method

mean log 0291relative abs 0322

relative square 1921linear RMS 12498log RMS 0452

scale-invariant 0204

We observe that our algorithm performs comparativelybetter on these datasets due to the smaller variations in thedepicted environments when compared to Make3D which isinherently more diverse This once again leads us to concludethat in order to achieve the optimal performance the trainingand test data should be made as similar as possible and in thecontext of robotics this is enabled through the use of SSL

C Stereo dataset

In order to further test our hypotheses we developed a newdataset of images shot in the same environment to simulatethe SSL approach and with both dense and sparse depth mapsto see how the algorithm performed on the dense data whilebeing trained only on the sparse maps

We shot several videos with the ZED around the TU DelftAerospace faculty The camera was shot handheld with somefast rotations and no particular care for stabilization similarto footage that would be captured by an UAV The resultingimages naturally have imperfections namely defocus motionblur and stretch and shear artifacts from the rolling shutterwhich are not reflected in the standard RGBD datasets butnevertheless encountered in real life situations

The stereo videos were then processed offline with theZED SDK using both the STANDARD (structure conserva-tive no occlusion filling) and FILL (occlusion filling edgesharpening and advanced post-filtering) settings The providedconfidence map was then used on the STANDARD data tofilter out low confidence regions leading to very sparse depthmaps as shown in Figure 1 similar to depth maps obtainedusing traditional block-based stereo matching algorithms Thefull dataset consists of 12 video sequences and will be madeavailable for public use in the near future

For our tests we split the video sequences into two partslearning on the first 70 and testing on the final 30 in

7

Fig 5 Qualitative results on the Make3D dataset From left to right the monocular image the ground truth depth map andthe depth estimated by our algorithm The depth scales are the same for every image

Fig 6 Qualitative results on the KITTI 2012 and 2015 datasets From left to right the monocular image the ground truthdepth map and the depth estimated by our algorithm The depth scales are the same for every image

8

Fig 7 Qualitative results on the stereo ZED dataset From left to right the monocular image the ground truth depth map thealgorithm trained on the dense depth and the algorithm trained on the sparse depth The depth scales are the same for everyimage

9

TABLE IV Comparison of results on the new stereo dataset collected with the ZED part 1

Video sequence 1 dense 1 sparse 2 dense 2 sparse 3 dense 3 sparse 4 dense 4 sparse 5 dense 5 sparse 6 dense 6 sparse

mean log 0615 0746 05812 0609 0169 0594 0572 0547 0425 0533 0525 0479relative abs 1608 1064 1391 0823 0288 0420 0576 0557 0446 0456 0956 0526

relative square 26799 8325 17901 6369 2151 3280 4225 3728 2951 3312 12411 3320linear RMS 8311 7574 7101 6897 2933 7134 6488 6026 5692 6331 7239 5599log RMS 0967 0923 0913 0770 0359 0770 0748 0716 0546 0662 0822 0613

scale-invariant 0931 0778 0826 0524 0127 0277 0440 0412 0275 0247 0669 0314

TABLE V Comparison of results on the new stereo dataset collected with the ZED part 2

Video sequence 7 dense 7 sparse 8 dense 8 sparse 9 dense 9 sparse 10 dense 10 sparse 11 dense 11 sparse 12 dense 12 sparse

mean log 0532 0514 0654 0539 0579 0470 0411 0460 0640 0704 0362 0494relative abs 0732 0484 0583 0770 0773 0534 0576 0472 0838 0609 0517 0572

relative square 7194 3016 6480 7472 15659 3082 5583 3666 56000 4792 51092 4367linear RMS 6132 5577 8383 6284 14324 5319 5966 6127 11435 7832 30676 6095log RMS 0789 0669 1006 0770 0823 0610 0681 0633 0898 0835 0580 0642

scale-invariant 0601 0332 0825 0493 0648 0338 0442 0287 0729 0378 0323 0347

order to simulate the conditions of a robot undergoing SSLWe tested our depth estimation algorithm under two differentconditions training on the fully processed and occlusioncorrected dense maps and training directly on the raw sparseoutputs from the stereo matching algorithm The dense depthmaps were used as ground truths during testing in both casesThe results obtained are shown in Figure 7 and in Tables IVand V

Observing the results we see that the performance is ingeneral better than on the existing laser datasets especiallyfrom a qualitative point of view The algorithm is capableof leveraging the fact that the data used for training is notparticularly diverse and additionally very similar to the testdata going around the generalization problems shown bytraditional supervised learning methodologies

It can also be observed that contrary to our expectationsthe algorithm trained on the sparse data doesnrsquot fall shortand actually surpasses the algorithm trained on the dense datain many of the video sequences and error metrics Lookingat the estimated depth maps this is mostly a consequenceof the algorithmrsquos failure to correctly extrapolate The largestcontributions to the error metrics come from the depth extrapo-lations that fall significantly outside the maximum range of thecamera and are qualitatively completely incorrect When thealgorithm is trained only on the sparse data it is exposed to asmaller range of target values and is consequently much moreconservative in its estimates leading to lower error figures

Another influencing factor is the fact that in the sparse highconfidence regions there is a stronger correlation betweenthe depth obtained by the stereo camera and the true depthwhen compared to the untextured and occluded areas Sincewe assume there is a strong correlation between the extractedmonocular image features and the true depth this impliesthat there is a stronger correlation between the image featuresand the stereo camera depths The simplicity of the machinelearning model used means that this strong correlation canbe effectively exploited and learned while the low correlationpresent in the dense depth data can easily lead the algorithm

to learn a wrong modelFrom a qualitative point of view however both sparse and

dense behave similarly Looking at the sample images we cansee examples where both correctly estimate the relative depthsbetween objects in a scene (rows 1 and 2) examples wheresparse is better than dense (rows 3 and 4) and where dense isbetter than sparse (row 5) Naturally there are also many caseswhere the estimated depth maps are mostly incorrect (row 6)

In particular we see that the algorithm trained only onsparse data also behaves well at estimating the depth inoccluded and untextured areas for instance the white wallin row 3 This leads us to the conclusion that the algorithmrsquosperformance is perfectly adequate to serve as complementaryto a sparse stereo matching algorithm correctly filling in themissing depth information

D Online Experiments

Wersquove started preliminary work on a framework for onboardSSL applied to monocular depth estimation using a stereocamera Wersquove rewritten some of our feature extraction func-tions in C++ and have tested the setup using a linear leastsquares regression algorithm achieving promising results Weintend to further develop the framework in order to fully sup-port an online version of our learning pipeline and integrateit with existing autopilot platforms such as Paparazzi [33]

V CONCLUSION

This work focused on the application of SSL to the problemof estimating depth from single monocular images with theintent of complementing sparse stereo vision algorithms Wehave shown that our algorithm exhibits competitive perfor-mance on existing RGBD datasets while being computation-ally more efficient to train than previous approaches [5] Wealso train the algorithm on a new stereo dataset and showthat it remains accurate even when trained only on sparserather than dense stereo maps It can consequently be used toefficiently produce dense depth maps from sparse input Ourpreliminary work on its online implementation has revealed

10

promising results obtaining good performance with a verysimple linear least squares algorithm

In future work we plan to extend our methodology andfurther explore the complementarity of the information presentin monocular and stereo cues The use of a learning algorithmsuch as a Mondrian forest [34] or other ensemble-basedmethods would enable the estimation of the uncertainty inits own predictions A sensor fusion algorithm can then beused to merge information from both the stereo vision systemand the monocular depth estimation algorithm based on theirlocal confidence in the estimated depth This would lead to anoverall more accurate depth map

We have avoided the use of optical flow features sincetheyrsquore expensive to compute and are not usable when es-timating depths from isolated images rather than video dataHowever future work could explore computationally efficientways of using optical flow to guarantee the temporal consis-tency and consequently increase the accuracy of the sequenceof estimated depth maps

Current state of the art depth estimation methods [10ndash12 32] are all based on deep convolutional neural networks ofvarying complexities The advent of massively parallel GPU-based embedded hardware such as the Jetson TX1 and itseventual successors means that online training of deep neuralnetworks is close to becoming reality These models wouldgreatly benefit from the large amounts of training data madepossible by the SSL framework and lead to state of the artdepth estimation results onboard micro aerial vehicles

REFERENCES

[1] R A El-laithy J Huang and M Yeh ldquoStudy on the use of microsoftkinect for robotics applicationsrdquo in Position Location and NavigationSymposium (PLANS) 2012 IEEEION IEEE 2012 pp 1280ndash1288

[2] M Draelos Q Qiu A Bronstein and G Sapiro ldquoIntel realsense= reallow cost gazerdquo in Image Processing (ICIP) 2015 IEEE InternationalConference on IEEE 2015 pp 2520ndash2524

[3] C De Wagter S Tijmons B D Remes and G C de CroonldquoAutonomous flight of a 20-gram flapping wing mav with a 4-gramonboard stereo vision systemrdquo in 2014 IEEE International Conferenceon Robotics and Automation (ICRA) IEEE 2014 pp 4982ndash4987

[4] H Dahlkamp A Kaehler D Stavens S Thrun and G R Brad-ski ldquoSelf-supervised monocular road detection in desert terrainrdquo inRobotics science and systems vol 38 Philadelphia 2006

[5] K Bipin V Duggal and K M Krishna ldquoAutonomous navigation ofgeneric monocular quadcopter in natural environmentrdquo in 2015 IEEEInternational Conference on Robotics and Automation (ICRA) IEEE2015 pp 1063ndash1070

[6] A Saxena S H Chung and A Y Ng ldquoLearning depth from singlemonocular imagesrdquo in Advances in Neural Information ProcessingSystems 2005 pp 1161ndash1168

[7] A Saxena M Sun and A Y Ng ldquoMake3d Learning 3d scene structurefrom a single still imagerdquo IEEE transactions on pattern analysis andmachine intelligence vol 31 no 5 pp 824ndash840 2009

[8] K Karsch C Liu S Bing and K Eccv ldquoDepth Extraction from VideoUsing Non-parametric Sampling Problem Motivation amp Backgroundrdquono Sec 5 pp 1ndash14 2013

[9] D Eigen C Puhrsch and R Fergus ldquoDepth map prediction from asingle image using a multi-scale deep networkrdquo Nips pp 1ndash9 2014[Online] Available httparxivorgabs14062283

[10] D Eigen and R Fergus ldquoPredicting depth surface normals and se-mantic labels with a common multi-scale convolutional architecturerdquo inProceedings of the IEEE International Conference on Computer Vision2015 pp 2650ndash2658

[11] F Liu C Shen G Lin and I D Reid ldquoLearning Depth from SingleMonocular Images Using Deep Convolutional Neural Fieldsrdquo Pamip 15 2015 [Online] Available httparxivorgabs15027411

[12] W Chen Z Fu D Yang and J Deng ldquoSingle-image depth perceptionin the wildrdquo arXiv preprint arXiv160403901 2016

[13] D Zoran P Isola D Krishnan and W T Freeman ldquoLearning ordinalrelationships for mid-level visionrdquo in Proceedings of the IEEE Interna-tional Conference on Computer Vision 2015 pp 388ndash396

[14] D Dey K S Shankar S Zeng R Mehta M T Agcayazi C EriksenS Daftry M Hebert and J A Bagnell ldquoVision and learning fordeliberative monocular cluttered flightrdquo in Field and Service RoboticsSpringer 2016 pp 391ndash409

[15] A Agarwal S M Kakade N Karampatziakis L Song and G ValiantldquoLeast squares revisited Scalable approaches for multi-class predictionrdquoarXiv preprint arXiv13101949 2013 software available at httpsgithubcomn17ssecondorderdemos

[16] S Ross N Melik-Barkhudarov K S Shankar A Wendel D DeyJ A Bagnell and M Hebert ldquoLearning monocular reactive uav controlin cluttered natural environmentsrdquo in Robotics and Automation (ICRA)2013 IEEE International Conference on IEEE 2013 pp 1765ndash1772

[17] H Hu A Grubb J A Bagnell and M Hebert ldquoEfficient fea-ture group sequencing for anytime linear predictionrdquo arXiv preprintarXiv14095495 2014

[18] S Thrun M Montemerlo H Dahlkamp D Stavens A Aron J DiebelP Fong J Gale M Halpenny G Hoffmann et al ldquoStanley The robotthat won the darpa grand challengerdquo Journal of field Robotics vol 23no 9 pp 661ndash692 2006

[19] R Hadsell P Sermanet J Ben A Erkan M Scoffier K KavukcuogluU Muller and Y LeCun ldquoLearning long-range vision for autonomousoff-road drivingrdquo Journal of Field Robotics vol 26 no 2 pp 120ndash1442009

[20] H W Ho C De Wagter B D W Remes and G C H E de CroonldquoOptical-Flow based Self-Supervised Learning of Obstacle Appearanceapplied to MAV Landingrdquo no Iros 15 pp 1ndash10 2015

[21] K van Hecke G de Croon L van der Maaten D Hennes and D IzzoldquoPersistent self-supervised learning principle from stereo to monocularvision for obstacle avoidancerdquo arXiv preprint arXiv160308047 2016

[22] J Michels A Saxena and A Y Ng ldquoHigh speed obstacle avoidanceusing monocular vision and reinforcement learningrdquo in Proceedings ofthe 22nd international conference on Machine learning ACM 2005pp 593ndash600

[23] E R Davies Computer and machine vision theory algorithms prac-ticalities Academic Press 2012

[24] R Nevatia and K R Babu ldquoLinear feature extraction and descriptionrdquoComputer Graphics and Image Processing vol 13 no 3 pp 257ndash2691980

[25] M Varma and A Zisserman ldquoTexture classification arefilter banks necessaryrdquo Cvpr vol 2 pp IIndash691ndash8 vol2 2003 [Online] Available httpieeexploreieeeorgxplsabs alljsparnumber=1211534$delimiterrdquo026E30F$npapers2publicationdoi101109CVPR20031211534

[26] G De Croon E De Weerdt C De Wagter B Remes and R RuijsinkldquoThe appearance variation cue for obstacle avoidancerdquo IEEE Transac-tions on Robotics vol 28 no 2 pp 529ndash534 2012

[27] T Kohonen ldquoThe self-organizing maprdquo Proceedings of the IEEEvol 78 no 9 pp 1464ndash1480 1990

[28] N Dalal and B Triggs ldquoHistograms of oriented gradients for humandetectionrdquo in 2005 IEEE Computer Society Conference on ComputerVision and Pattern Recognition (CVPRrsquo05) vol 1 IEEE 2005 pp886ndash893

[29] S R Deans The Radon transform and some of its applications CourierCorporation 2007

[30] R-E Fan K-W Chang C-J Hsieh X-R Wang and C-J LinldquoLiblinear A library for large linear classificationrdquo Journal of machinelearning research vol 9 no Aug pp 1871ndash1874 2008 softwareavailable at httpwwwcsientuedutwsimcjlinliblinear

[31] C-C Chang and C-J Lin ldquoLIBSVM A library for support vectormachinesrdquo ACM Transactions on Intelligent Systems and Technologyvol 2 pp 271ndash2727 2011 software available at httpwwwcsientuedutwsimcjlinlibsvm

[32] M Mancini G Costante P Valigi and T A Ciarfuglia ldquoFast robustmonocular depth estimation for obstacle detection with fully convolu-tional networksrdquo arXiv preprint arXiv160706349 2016

[33] G Hattenberger M Bronz and M Gorraz ldquoUsing the paparazzi uavsystem for scientific researchrdquo in IMAV 2014 International Micro AirVehicle Conference and Competition 2014 2014 pp ppndash247

[34] B Lakshminarayanan D M Roy and Y W Teh ldquoMondrian forestsfor large-scale regression when uncertainty mattersrdquo arXiv preprintarXiv150603805 2015

  • Introduction
  • Related Work
    • Monocular depth estimation
    • Monocular depth estimation for robot navigation
    • Self-supervised learning
      • Methodology Overview
        • Learning setup
        • Features
          • Filter-based features
          • Texton-based features
          • Histogram of Oriented Gradients
          • Radon transform
            • Learning algorithm
            • Hyperparameters
              • Experimental Results
                • Error metrics
                • Standard datasets
                • Stereo dataset
                • Online Experiments
                  • Conclusion
                  • References
Page 3: Learning Depth from Single Monocular Images Using Stereo ... · obstacle avoidance and navigation, to localization and envi- ... data collected using a Kinect sensor. Dey et al. [14]

3

prediction The depth estimation is done over large patchesusing features similar to [16] and additional features basedon Histogram of Oriented Gradients (HOG) and tree detectorfeatures at multiple size scales The training data is collectedby a rover using a stereo camera system and the training doneoffline An additional cost-sensitive greedy feature selectionalgorithm by Hu et al [17] is used to evaluate the mostinformative features for a given time-budget

Although multiple studies have investigated the use ofmonocular depth estimation for robot navigation none havefocused on how it can be used to complement stereo visionin the context of SSL

C Self-supervised learning

SSL has been the focus of some recent research in roboticssince in contrast to traditional offline learning methodologiesit requires less human intervention and offers the possibilityof adaptation to new circumstances

Dahlkamp et al [4 18] used SSL to train a vision-basedterrain analyser for Stanleyrsquos DARPA Grand Challenge per-formance The scanner was used for obstacle detection andterrain classification at close ranges and as supervisory inputto train the vision-based classifier The vision-based classifierachieved a much greater obstacle detection range which inturn made it possible to increase Stanleyrsquos maximum speedand eventually win the challenge

Hadsell et al [19] developed a SSL methodology with thesimilar purpose of enabling long-range obstacle detection ina vehicle equipped with a stereo camera For this purposethey train a real-time classifier using labels from the stereocamera system as supervisory input and perform inferenceusing the learned classifier This process repeats every framebut keeping a small buffer of previous training examples forsuccessive time steps allowing for short term memory ofprevious obstacles The features to be extracted are themselveslearned offline using both supervised and unsupervised meth-ods not hand engineered

SSL is also applied in Ho et al [20] to the problem ofdetecting obstacles using a downward facing camera in thecontext of micro aerial vehicle landing In contrast to previousapproaches optical flow is used to estimate a measure ofsurface roughness given by the fitting error between theobserved optical flow and that of a perfect planar surfaceThe surface roughness is then used as supervisory input toa linear regression algorithm using texton distributions asfeatures Learning wasnrsquot performed for every frame butrather when the uncertainty of the estimates increased dueto previously unseen inputs The resulting appearance-basedobstacle detector demonstrated good performance even insituations where the optical flow is negligible due to lack oflateral motion

Recently van Hecke et al [21] successfully applied SSLto the similar problem of estimating a single average depthfrom a monocular image for obstacle avoidance purposesusing supervisory input from a stereo camera system Theyfocused on the behavioral aspects of SSL and its relation withlearning from demonstration by looking at how the learning

process should be organized in order to maximize performancewhen the supervisory input becomes unavailable The beststrategy is determined to be after an initial period of learningto use the supervisory input only as rdquotraining wheelsrdquo thatis using stereo vision only when the vehicle gets too closeto an obstacle The depth estimation algorithm uses textondistributions as features and kNN as the learning algorithm

III METHODOLOGY OVERVIEW

In this section we describe the learning methodology weused namely the SSL setup the features the learning algo-rithm and its hyperparameters

A Learning setup

The setup is similar to previous stereo-based SSL ap-proaches such as Hadsellrsquos [19] and van Heckersquos [21] Thebasic principle is to use the output from a stereo vision systemas the supervisory input to an appearance-based depth estima-tion learning algorithm Unlike their work however our maingoal is to obtain an accurate depth map over the whole imagerather than performing terrain classification or estimating asingle average depth value The camerarsquos output is processedusing both sparse and dense stereo matching algorithms andwe study the consequences of learning only on sparse depthmaps by observing and evaluating the algorithmrsquos behavioron the dense depth data A schematic diagram of the setup ispresented in Figure 2

Left camera Right camera

Sparse stereo

matching

Dense stereo

matching

Monocular

estimation supervisory

input

Testingground truth

depth map

estimated

depth map

Fig 2 Diagram of SSL setup

For our experiments we used a Stereolabs ZED stereocamera It features wide-angle lenses with a 110 field ofview spaced at a baseline of 120 mm allowing for accuratedepth estimation in the range of 07 to 20 m The camerarsquosf20 aperture and relatively large 13primeprime sensor enables goodexposure performance even under low light conditions Its out-put is highly configurable in terms of both resolution and framerate with 15 frames per second possible at 22K resolution

4

in terms of both photographic and depth data One problemwith the hardware however is its use of a rolling shuttercausing undesired effects such as stretch shear and wobblein the presence of either camera motion or very dynamicenvironments We experienced some of these problems whileshooting scenes with lateral camera movement so for actualrobotics applications we would instead use a camera systemwith a global shutter where these effects would be absent

The ZED SDK is designed around the OpenCV and CUDAlibraries with its calibration distortion correction and depthestimation routines taking advantage of the CUDA platformrsquosmassively parallel GPU computing capabilities The SDK ad-ditionally provides optional post processing of the depth mapsincluding occlusion filling edge sharpening and advancedpost-filtering and a map of stereo matching confidence isalso available Additional capabilities of positional trackingand real-time 3D reconstruction are also offered although notused in this work

We performed both offline and online experiments usingthe ZED SDK to provide dense depth maps by employingthe full post-processing and occlusion filling as well as sparsedepth maps by using only the basic stereo matching algorithmand filtering out low confidence regions The latter methodgives depth maps similar to what would be obtained usinga simple block-based stereo matching algorithm commonlyused in resource-constrained or high-frequency applications[3]

In our offline experiments data from the stereo visionsystem is recorded and a posteriori used to train and test thelearning algorithm under varying conditions in batch modeWe used a modular MATLAB program for rapid prototypingof different feature combinations and learning algorithms aswell as to determine good values for their hyperparameters

When operating online depth and image data is streameddirectly from the stereo camera into an online learning algo-rithm and afterwards monocular depth inference is performedThe resulting depth maps are recorded for posterior evaluationWe used an architecture based on C++ and OpenCV for fasterperformance and easy interaction with C-based embeddedrobotics platforms

In both situations the images and ground truth depth mapsare resized to standard sizes before learning takes place This isdone for performance reasons due to the very high resolutionof the input data and the short time available for featurecomputation and learning

B Features

The features used for learning are in general similar tothose recently used in the literature [14 16] However weexclude optical flow features and add features based on atexton similarity measure to be discussed below Features arecalculated over square patches directly corresponding to pixelsin the matching depth maps

1) Filter-based features These features are implementa-tions of the texture energy texture gradients and haze fea-tures engineered and popularized by Saxenarsquos research group[6 7] and used in multiple robotics applications ever since

[5 14 16 22] The features are constructed by first convertingthe image patch into YCbCr color space applying variousfilters to the specified channel and then taking the sum ofabsolute and squared values over the patch This procedure isrepeated at three increasing size scales to capture both localand global information The filters used arebull Lawsrsquo masks as per Davies [23] constructed by convolv-

ing the L3 E3 and S3 basic 1times 3 masks together

L3 =[1 2 1

]E3 =

[minus1 0 1

]S3 =

[minus1 2 minus1

]In total 9 Lawsrsquo masks are obtained from the pairwiseconvolutions of the basic masks namely L3TL3 L3TE3 S3TE3 and S3TS3 These are applied to the Ychannel to capture texture energy

bull A local averaging filter applied to the Cb and Crchannels to capture haze in the low frequency colorinformation The first Lawsrsquo mask (L3TL3) was used

bull Nevatia-Babu [24] oriented edge filters applied to the Ychannel to capture texture gradients The 6 filters arespaced at 30 intervals

2) Texton-based features Textons are small image patchesrepresentative of particular texture classes learned by clus-tering patch samples from training images They were firstrecognized as a valuable tool for computer vision in the workof Varma et al [25] which showed their performance in thetask of texture classification when compared to traditionalfilter bank methods The distribution of textons over the imagehas since been used to generate computationally efficientfeatures for various works in robotics [20 21 26] for obstacledetection

Previous visual bag of words approaches represented an im-age using a histogram constructed by sampling image patchesand determining for each patch its closest texton in terms ofEuclidean distance Then the corresponding histogram bin isincremented Since we desire to capture local information fora given patch we use its square Euclidean distance to eachtexton as features The texton dictionary is learned from thetraining dataset using Kohonen clustering [27] similarly toprevious works

3) Histogram of Oriented Gradients Histogram of Ori-ented Gradients (HOG) features have been successfully usedfor object and human detection [28] as well by Dey et alfor depth estimation [14] The image is divided into cellsover which the pixel-wise gradients are determined and theirdirections binned into a histogram Adjacent cells are groupedinto 2 times 2 blocks and the histograms are normalized withrespect to all the cells in the block to correct for contrastdifferences and improve accuracy

4) Radon transform Michels et al [22] introduced a fea-ture to capture texture gradient and the direction of strongedges based on the Radon transform [29] also subsequentlyused by other works [14 16] The Radon transform is an in-tegral continuous version of the Hough transform commonlyused in computer vision for edge detection and maps an imagefrom (x y) into (θ ρ) coordinates For each value of θ the

5

two highest values of the transform are recorded and used asfeatures

C Learning algorithm

To choose a learning algorithm we looked at previousapproaches in the literature Bipin et al [5] had success withapproaching the problem as a multiclass classification prob-lem and using a linear SVM for learning Dey et al [14] useda nonlinear regression algorithm based on the Calibrated LeastSquares (CLS) algorithm by Agarwal et al [15] In most of theliterature the algorithms are used to estimate the logarithms ofdepths rather than the depths themselves and after testing bothapproaches we also found better performance when estimatinglog depths

We have approached the depth estimation problem bothas a classification problem and as a regression problem Forclassification we have tried out methods such as a SVM usingboth linear [30] and radial basis function kernels [31] in bothcases using one-vs-the-rest for multiclass We experimentedwith using a decision tree the generalized least squaresalgorithm [15] with a multinomial regression link functionand the classification version of the CLS algorithm [15] Forregression we have employed linear least squares a regressiontree and a modified version of the CLS algorithm

Our evaluation ultimately lead us to the conclusion thatregression consistently outperforms classification in this taskbecause multiclass classification loss functions penalize everymisclassification in the same way while regression attributeslarger penalties to larger deviations Additionally we observedthat the modified CLS regression algorithm exhibits betterperformance than linear least squares or regression trees whilestill being computationally very efficient For this reason wedecided to use it for the rest of our testing

The CLS algorithm is based on the minimization of acalibrated loss function in the context of a generalized linearmodel with an unknown link function The link function is it-self approximated as a linear combination of basis functions ofthe target variable typically low degree polynomials The CLSalgorithm consists of simultaneously solving the problems oflink function approximation and loss function minimizationby iteratively solving two least squares problems

We make slight modifications to the algorithm shown byAgarwal et al [15] namely removing the simplex clippingstep since wersquore performing regression rather than multiclassclassification and using Tikhonov regularization in the leastsquares problems From a computational point of view we useCholesky decompositions to efficiently solve the inner leastsquares problems and define the convergence criterion usingthe norm of the difference in y in successive iterations Thestraightforward algorithm is described in Figure 3

The algorithm is iterative by design but it can be adaptedfor online inference by storing the weight matrices and usingthem a posteriori on new test data repeating steps 3 and 5 ofthe algorithm with the stored weight matrices Additionallyit can be adapted to online training by using batch-wisestochastic gradient descent to update the weights as new datasamples come in

input feature vectors xi vector of target values youtput predicted target values y sequence of weight

matrices W and W

1 while t lt tmax and∥∥∥y(t) minus y(tminus1)∥∥∥ gt threshold do

Iteratively minimize thecalibrated loss function

2 Wt = argminnsum

i=1

∥∥∥yi minus y(tminus1)i minusWxi

∥∥∥22+λ1W22

3 y(t)i = y

(tminus1)i +Wtxi

Find the optimal linearcombination of basis functions

4 Wt = argminnsum

i=1

∥∥∥yi minus WG(y(t)i )∥∥∥22+ λ2

∥∥∥W∥∥∥22

5 y(t)i = WtG(y

(t)i )

6 end

Fig 3 Description of the modified CLS regression algorithm

D Hyperparameters

The algorithm has numerous hyperparameters includingthe regularization constants the base size and number ofsize scales of the image patches over which the featuresare computed the size and number of textons the cell sizeand number of bins for HOG feature calculation etc Onecould try to determine their optimal values using a geneticalgorithm but due to the size of the parameter space and thetime required for end-to-end simulations we instead opted tochoose some of the parameters based on their typical valuesin the literature

We used a base patch size of 11times 11 and 2 additional sizescales 33 times 33 and 99 times 99 We learned a dictionary of 30black and white textons 5times 5 in size shown in Figure 4 Weused 9 bins for the HOG features and discretized the anglesfor the Radon transform into 15 values

Fig 4 Texton dictionary learned from training data

6

IV EXPERIMENTAL RESULTS

In this section the offline experiments are described indetail These were performed in order to evaluate the perfor-mance of the proposed learning algorithm on existing datasetsand determining optimal values for its hyperparameters Wealso tested our hypothesis that it should be possible to estimatedense depth maps despite learning only on sparse trainingdata by testing on a new indoors stereo dataset with bothsparse and dense depth maps

A Error metrics

To measure the algorithmrsquos accuracy error metrics com-monly found in the literature [9] were employed namelybull The mean logarithmic error 1

N

sum∣∣log dest minus log dgt∣∣

bull The mean relative error 1N

sum∣∣dest minus dgt∣∣ dgt

bull The mean relative squared error 1N

sum(dest minus dgt)

2dgt

bull The root mean squared (RMS) errorradic

1N

sum(dest minus dgt)

2

bull The root mean squared (RMS) logarithmic errorradic1N

sum(log dest minus log dgt)

2

bull The scale invariant error 1N

sum(log dest minus log dgt)

2 minus1n2

(sumlog dest minus log dgt

)2B Standard datasets

As a first step the algorithm was tested on existingdepth datasets collected using active laser scanners namelyMake3D KITTI 2012 and KITTI 2015 For Make3D we usedthe standard division of 400 training and 134 test samplesSince the KITTI datasetsrsquo standard test data consists solely ofcamera images lacking ground truth depth maps we insteadrandomly distributed the standard training data among twosets with 70 of the data being allocated for training and30 for testing

The results we obtained for Make3D are shown qualitativelyin Figure 5 and quantitatively in Table I along with resultsfound in the literature [7 8 11] It can be seen that weobtain slightly worse performance than current state of theart approaches However we do surpass Bipin et alrsquos [5]results using a linear SVM while using a much more efficientlearning algorithm (in our tests training a linear SVM with10 depth classes took around 10 times longer than the CLSalgorithm)

Upon visual inspection of the image samples we observethat our algorithm manages to successfully capture most ofthe global depth variations and even some local details

TABLE I Comparison of results on the Make3D dataset

Make3D Saxena Karsch Liu Bipin Our method

mean log 0430 0292 0251 0985 0493relative abs 0370 0355 0287 0815 0543

relative square - - - - 10717linear RMS - 920 736 - 20116log RMS - - - - 0683

scale-invariant - - - - 0462

For the KITTI datasets we show the results in Figure 6and compare them quantitatively in Tables II and III to results

found in the literature [9 11 32] although we note that to ourknowledge no previous work on monocular depth estimationhas yet shown results on the more recent KITTI 2015 dataset

TABLE II Comparison of results on the KITTI 2012 dataset

KITTI 2012 Saxena Eigen Liu Mancini Our method

mean log - - 0211 - 0372relative abs 0280 0190 0217 - 0525

relative square 3012 1515 - - 2311linear RMS 8734 7156 7046 7508 13093log RMS 0361 0270 - 0524 0590

scale-invariant 0327 0246 - 0196 0347

TABLE III Results on the KITTI 2015 dataset

KITTI 2015 Our method

mean log 0291relative abs 0322

relative square 1921linear RMS 12498log RMS 0452

scale-invariant 0204

We observe that our algorithm performs comparativelybetter on these datasets due to the smaller variations in thedepicted environments when compared to Make3D which isinherently more diverse This once again leads us to concludethat in order to achieve the optimal performance the trainingand test data should be made as similar as possible and in thecontext of robotics this is enabled through the use of SSL

C Stereo dataset

In order to further test our hypotheses we developed a newdataset of images shot in the same environment to simulatethe SSL approach and with both dense and sparse depth mapsto see how the algorithm performed on the dense data whilebeing trained only on the sparse maps

We shot several videos with the ZED around the TU DelftAerospace faculty The camera was shot handheld with somefast rotations and no particular care for stabilization similarto footage that would be captured by an UAV The resultingimages naturally have imperfections namely defocus motionblur and stretch and shear artifacts from the rolling shutterwhich are not reflected in the standard RGBD datasets butnevertheless encountered in real life situations

The stereo videos were then processed offline with theZED SDK using both the STANDARD (structure conserva-tive no occlusion filling) and FILL (occlusion filling edgesharpening and advanced post-filtering) settings The providedconfidence map was then used on the STANDARD data tofilter out low confidence regions leading to very sparse depthmaps as shown in Figure 1 similar to depth maps obtainedusing traditional block-based stereo matching algorithms Thefull dataset consists of 12 video sequences and will be madeavailable for public use in the near future

For our tests we split the video sequences into two partslearning on the first 70 and testing on the final 30 in

7

Fig 5 Qualitative results on the Make3D dataset From left to right the monocular image the ground truth depth map andthe depth estimated by our algorithm The depth scales are the same for every image

Fig 6 Qualitative results on the KITTI 2012 and 2015 datasets From left to right the monocular image the ground truthdepth map and the depth estimated by our algorithm The depth scales are the same for every image

8

Fig 7 Qualitative results on the stereo ZED dataset From left to right the monocular image the ground truth depth map thealgorithm trained on the dense depth and the algorithm trained on the sparse depth The depth scales are the same for everyimage

9

TABLE IV Comparison of results on the new stereo dataset collected with the ZED part 1

Video sequence 1 dense 1 sparse 2 dense 2 sparse 3 dense 3 sparse 4 dense 4 sparse 5 dense 5 sparse 6 dense 6 sparse

mean log 0615 0746 05812 0609 0169 0594 0572 0547 0425 0533 0525 0479relative abs 1608 1064 1391 0823 0288 0420 0576 0557 0446 0456 0956 0526

relative square 26799 8325 17901 6369 2151 3280 4225 3728 2951 3312 12411 3320linear RMS 8311 7574 7101 6897 2933 7134 6488 6026 5692 6331 7239 5599log RMS 0967 0923 0913 0770 0359 0770 0748 0716 0546 0662 0822 0613

scale-invariant 0931 0778 0826 0524 0127 0277 0440 0412 0275 0247 0669 0314

TABLE V Comparison of results on the new stereo dataset collected with the ZED part 2

Video sequence 7 dense 7 sparse 8 dense 8 sparse 9 dense 9 sparse 10 dense 10 sparse 11 dense 11 sparse 12 dense 12 sparse

mean log 0532 0514 0654 0539 0579 0470 0411 0460 0640 0704 0362 0494relative abs 0732 0484 0583 0770 0773 0534 0576 0472 0838 0609 0517 0572

relative square 7194 3016 6480 7472 15659 3082 5583 3666 56000 4792 51092 4367linear RMS 6132 5577 8383 6284 14324 5319 5966 6127 11435 7832 30676 6095log RMS 0789 0669 1006 0770 0823 0610 0681 0633 0898 0835 0580 0642

scale-invariant 0601 0332 0825 0493 0648 0338 0442 0287 0729 0378 0323 0347

order to simulate the conditions of a robot undergoing SSLWe tested our depth estimation algorithm under two differentconditions training on the fully processed and occlusioncorrected dense maps and training directly on the raw sparseoutputs from the stereo matching algorithm The dense depthmaps were used as ground truths during testing in both casesThe results obtained are shown in Figure 7 and in Tables IVand V

Observing the results we see that the performance is ingeneral better than on the existing laser datasets especiallyfrom a qualitative point of view The algorithm is capableof leveraging the fact that the data used for training is notparticularly diverse and additionally very similar to the testdata going around the generalization problems shown bytraditional supervised learning methodologies

It can also be observed that contrary to our expectationsthe algorithm trained on the sparse data doesnrsquot fall shortand actually surpasses the algorithm trained on the dense datain many of the video sequences and error metrics Lookingat the estimated depth maps this is mostly a consequenceof the algorithmrsquos failure to correctly extrapolate The largestcontributions to the error metrics come from the depth extrapo-lations that fall significantly outside the maximum range of thecamera and are qualitatively completely incorrect When thealgorithm is trained only on the sparse data it is exposed to asmaller range of target values and is consequently much moreconservative in its estimates leading to lower error figures

Another influencing factor is the fact that in the sparse highconfidence regions there is a stronger correlation betweenthe depth obtained by the stereo camera and the true depthwhen compared to the untextured and occluded areas Sincewe assume there is a strong correlation between the extractedmonocular image features and the true depth this impliesthat there is a stronger correlation between the image featuresand the stereo camera depths The simplicity of the machinelearning model used means that this strong correlation canbe effectively exploited and learned while the low correlationpresent in the dense depth data can easily lead the algorithm

to learn a wrong modelFrom a qualitative point of view however both sparse and

dense behave similarly Looking at the sample images we cansee examples where both correctly estimate the relative depthsbetween objects in a scene (rows 1 and 2) examples wheresparse is better than dense (rows 3 and 4) and where dense isbetter than sparse (row 5) Naturally there are also many caseswhere the estimated depth maps are mostly incorrect (row 6)

In particular we see that the algorithm trained only onsparse data also behaves well at estimating the depth inoccluded and untextured areas for instance the white wallin row 3 This leads us to the conclusion that the algorithmrsquosperformance is perfectly adequate to serve as complementaryto a sparse stereo matching algorithm correctly filling in themissing depth information

D Online Experiments

Wersquove started preliminary work on a framework for onboardSSL applied to monocular depth estimation using a stereocamera Wersquove rewritten some of our feature extraction func-tions in C++ and have tested the setup using a linear leastsquares regression algorithm achieving promising results Weintend to further develop the framework in order to fully sup-port an online version of our learning pipeline and integrateit with existing autopilot platforms such as Paparazzi [33]

V CONCLUSION

This work focused on the application of SSL to the problemof estimating depth from single monocular images with theintent of complementing sparse stereo vision algorithms Wehave shown that our algorithm exhibits competitive perfor-mance on existing RGBD datasets while being computation-ally more efficient to train than previous approaches [5] Wealso train the algorithm on a new stereo dataset and showthat it remains accurate even when trained only on sparserather than dense stereo maps It can consequently be used toefficiently produce dense depth maps from sparse input Ourpreliminary work on its online implementation has revealed

10

promising results obtaining good performance with a verysimple linear least squares algorithm

In future work we plan to extend our methodology andfurther explore the complementarity of the information presentin monocular and stereo cues The use of a learning algorithmsuch as a Mondrian forest [34] or other ensemble-basedmethods would enable the estimation of the uncertainty inits own predictions A sensor fusion algorithm can then beused to merge information from both the stereo vision systemand the monocular depth estimation algorithm based on theirlocal confidence in the estimated depth This would lead to anoverall more accurate depth map

We have avoided the use of optical flow features sincetheyrsquore expensive to compute and are not usable when es-timating depths from isolated images rather than video dataHowever future work could explore computationally efficientways of using optical flow to guarantee the temporal consis-tency and consequently increase the accuracy of the sequenceof estimated depth maps

Current state of the art depth estimation methods [10ndash12 32] are all based on deep convolutional neural networks ofvarying complexities The advent of massively parallel GPU-based embedded hardware such as the Jetson TX1 and itseventual successors means that online training of deep neuralnetworks is close to becoming reality These models wouldgreatly benefit from the large amounts of training data madepossible by the SSL framework and lead to state of the artdepth estimation results onboard micro aerial vehicles

REFERENCES

[1] R A El-laithy J Huang and M Yeh ldquoStudy on the use of microsoftkinect for robotics applicationsrdquo in Position Location and NavigationSymposium (PLANS) 2012 IEEEION IEEE 2012 pp 1280ndash1288

[2] M Draelos Q Qiu A Bronstein and G Sapiro ldquoIntel realsense= reallow cost gazerdquo in Image Processing (ICIP) 2015 IEEE InternationalConference on IEEE 2015 pp 2520ndash2524

[3] C De Wagter S Tijmons B D Remes and G C de CroonldquoAutonomous flight of a 20-gram flapping wing mav with a 4-gramonboard stereo vision systemrdquo in 2014 IEEE International Conferenceon Robotics and Automation (ICRA) IEEE 2014 pp 4982ndash4987

[4] H Dahlkamp A Kaehler D Stavens S Thrun and G R Brad-ski ldquoSelf-supervised monocular road detection in desert terrainrdquo inRobotics science and systems vol 38 Philadelphia 2006

[5] K Bipin V Duggal and K M Krishna ldquoAutonomous navigation ofgeneric monocular quadcopter in natural environmentrdquo in 2015 IEEEInternational Conference on Robotics and Automation (ICRA) IEEE2015 pp 1063ndash1070

[6] A Saxena S H Chung and A Y Ng ldquoLearning depth from singlemonocular imagesrdquo in Advances in Neural Information ProcessingSystems 2005 pp 1161ndash1168

[7] A Saxena M Sun and A Y Ng ldquoMake3d Learning 3d scene structurefrom a single still imagerdquo IEEE transactions on pattern analysis andmachine intelligence vol 31 no 5 pp 824ndash840 2009

[8] K Karsch C Liu S Bing and K Eccv ldquoDepth Extraction from VideoUsing Non-parametric Sampling Problem Motivation amp Backgroundrdquono Sec 5 pp 1ndash14 2013

[9] D Eigen C Puhrsch and R Fergus ldquoDepth map prediction from asingle image using a multi-scale deep networkrdquo Nips pp 1ndash9 2014[Online] Available httparxivorgabs14062283

[10] D Eigen and R Fergus ldquoPredicting depth surface normals and se-mantic labels with a common multi-scale convolutional architecturerdquo inProceedings of the IEEE International Conference on Computer Vision2015 pp 2650ndash2658

[11] F Liu C Shen G Lin and I D Reid ldquoLearning Depth from SingleMonocular Images Using Deep Convolutional Neural Fieldsrdquo Pamip 15 2015 [Online] Available httparxivorgabs15027411

[12] W Chen Z Fu D Yang and J Deng ldquoSingle-image depth perceptionin the wildrdquo arXiv preprint arXiv160403901 2016

[13] D Zoran P Isola D Krishnan and W T Freeman ldquoLearning ordinalrelationships for mid-level visionrdquo in Proceedings of the IEEE Interna-tional Conference on Computer Vision 2015 pp 388ndash396

[14] D Dey K S Shankar S Zeng R Mehta M T Agcayazi C EriksenS Daftry M Hebert and J A Bagnell ldquoVision and learning fordeliberative monocular cluttered flightrdquo in Field and Service RoboticsSpringer 2016 pp 391ndash409

[15] A Agarwal S M Kakade N Karampatziakis L Song and G ValiantldquoLeast squares revisited Scalable approaches for multi-class predictionrdquoarXiv preprint arXiv13101949 2013 software available at httpsgithubcomn17ssecondorderdemos

[16] S Ross N Melik-Barkhudarov K S Shankar A Wendel D DeyJ A Bagnell and M Hebert ldquoLearning monocular reactive uav controlin cluttered natural environmentsrdquo in Robotics and Automation (ICRA)2013 IEEE International Conference on IEEE 2013 pp 1765ndash1772

[17] H Hu A Grubb J A Bagnell and M Hebert ldquoEfficient fea-ture group sequencing for anytime linear predictionrdquo arXiv preprintarXiv14095495 2014

[18] S Thrun M Montemerlo H Dahlkamp D Stavens A Aron J DiebelP Fong J Gale M Halpenny G Hoffmann et al ldquoStanley The robotthat won the darpa grand challengerdquo Journal of field Robotics vol 23no 9 pp 661ndash692 2006

[19] R Hadsell P Sermanet J Ben A Erkan M Scoffier K KavukcuogluU Muller and Y LeCun ldquoLearning long-range vision for autonomousoff-road drivingrdquo Journal of Field Robotics vol 26 no 2 pp 120ndash1442009

[20] H W Ho C De Wagter B D W Remes and G C H E de CroonldquoOptical-Flow based Self-Supervised Learning of Obstacle Appearanceapplied to MAV Landingrdquo no Iros 15 pp 1ndash10 2015

[21] K van Hecke G de Croon L van der Maaten D Hennes and D IzzoldquoPersistent self-supervised learning principle from stereo to monocularvision for obstacle avoidancerdquo arXiv preprint arXiv160308047 2016

[22] J Michels A Saxena and A Y Ng ldquoHigh speed obstacle avoidanceusing monocular vision and reinforcement learningrdquo in Proceedings ofthe 22nd international conference on Machine learning ACM 2005pp 593ndash600

[23] E R Davies Computer and machine vision theory algorithms prac-ticalities Academic Press 2012

[24] R Nevatia and K R Babu ldquoLinear feature extraction and descriptionrdquoComputer Graphics and Image Processing vol 13 no 3 pp 257ndash2691980

[25] M Varma and A Zisserman ldquoTexture classification arefilter banks necessaryrdquo Cvpr vol 2 pp IIndash691ndash8 vol2 2003 [Online] Available httpieeexploreieeeorgxplsabs alljsparnumber=1211534$delimiterrdquo026E30F$npapers2publicationdoi101109CVPR20031211534

[26] G De Croon E De Weerdt C De Wagter B Remes and R RuijsinkldquoThe appearance variation cue for obstacle avoidancerdquo IEEE Transac-tions on Robotics vol 28 no 2 pp 529ndash534 2012

[27] T Kohonen ldquoThe self-organizing maprdquo Proceedings of the IEEEvol 78 no 9 pp 1464ndash1480 1990

[28] N Dalal and B Triggs ldquoHistograms of oriented gradients for humandetectionrdquo in 2005 IEEE Computer Society Conference on ComputerVision and Pattern Recognition (CVPRrsquo05) vol 1 IEEE 2005 pp886ndash893

[29] S R Deans The Radon transform and some of its applications CourierCorporation 2007

[30] R-E Fan K-W Chang C-J Hsieh X-R Wang and C-J LinldquoLiblinear A library for large linear classificationrdquo Journal of machinelearning research vol 9 no Aug pp 1871ndash1874 2008 softwareavailable at httpwwwcsientuedutwsimcjlinliblinear

[31] C-C Chang and C-J Lin ldquoLIBSVM A library for support vectormachinesrdquo ACM Transactions on Intelligent Systems and Technologyvol 2 pp 271ndash2727 2011 software available at httpwwwcsientuedutwsimcjlinlibsvm

[32] M Mancini G Costante P Valigi and T A Ciarfuglia ldquoFast robustmonocular depth estimation for obstacle detection with fully convolu-tional networksrdquo arXiv preprint arXiv160706349 2016

[33] G Hattenberger M Bronz and M Gorraz ldquoUsing the paparazzi uavsystem for scientific researchrdquo in IMAV 2014 International Micro AirVehicle Conference and Competition 2014 2014 pp ppndash247

[34] B Lakshminarayanan D M Roy and Y W Teh ldquoMondrian forestsfor large-scale regression when uncertainty mattersrdquo arXiv preprintarXiv150603805 2015

  • Introduction
  • Related Work
    • Monocular depth estimation
    • Monocular depth estimation for robot navigation
    • Self-supervised learning
      • Methodology Overview
        • Learning setup
        • Features
          • Filter-based features
          • Texton-based features
          • Histogram of Oriented Gradients
          • Radon transform
            • Learning algorithm
            • Hyperparameters
              • Experimental Results
                • Error metrics
                • Standard datasets
                • Stereo dataset
                • Online Experiments
                  • Conclusion
                  • References
Page 4: Learning Depth from Single Monocular Images Using Stereo ... · obstacle avoidance and navigation, to localization and envi- ... data collected using a Kinect sensor. Dey et al. [14]

4

in terms of both photographic and depth data One problemwith the hardware however is its use of a rolling shuttercausing undesired effects such as stretch shear and wobblein the presence of either camera motion or very dynamicenvironments We experienced some of these problems whileshooting scenes with lateral camera movement so for actualrobotics applications we would instead use a camera systemwith a global shutter where these effects would be absent

The ZED SDK is designed around the OpenCV and CUDAlibraries with its calibration distortion correction and depthestimation routines taking advantage of the CUDA platformrsquosmassively parallel GPU computing capabilities The SDK ad-ditionally provides optional post processing of the depth mapsincluding occlusion filling edge sharpening and advancedpost-filtering and a map of stereo matching confidence isalso available Additional capabilities of positional trackingand real-time 3D reconstruction are also offered although notused in this work

We performed both offline and online experiments usingthe ZED SDK to provide dense depth maps by employingthe full post-processing and occlusion filling as well as sparsedepth maps by using only the basic stereo matching algorithmand filtering out low confidence regions The latter methodgives depth maps similar to what would be obtained usinga simple block-based stereo matching algorithm commonlyused in resource-constrained or high-frequency applications[3]

In our offline experiments data from the stereo visionsystem is recorded and a posteriori used to train and test thelearning algorithm under varying conditions in batch modeWe used a modular MATLAB program for rapid prototypingof different feature combinations and learning algorithms aswell as to determine good values for their hyperparameters

When operating online depth and image data is streameddirectly from the stereo camera into an online learning algo-rithm and afterwards monocular depth inference is performedThe resulting depth maps are recorded for posterior evaluationWe used an architecture based on C++ and OpenCV for fasterperformance and easy interaction with C-based embeddedrobotics platforms

In both situations the images and ground truth depth mapsare resized to standard sizes before learning takes place This isdone for performance reasons due to the very high resolutionof the input data and the short time available for featurecomputation and learning

B Features

The features used for learning are in general similar tothose recently used in the literature [14 16] However weexclude optical flow features and add features based on atexton similarity measure to be discussed below Features arecalculated over square patches directly corresponding to pixelsin the matching depth maps

1) Filter-based features These features are implementa-tions of the texture energy texture gradients and haze fea-tures engineered and popularized by Saxenarsquos research group[6 7] and used in multiple robotics applications ever since

[5 14 16 22] The features are constructed by first convertingthe image patch into YCbCr color space applying variousfilters to the specified channel and then taking the sum ofabsolute and squared values over the patch This procedure isrepeated at three increasing size scales to capture both localand global information The filters used arebull Lawsrsquo masks as per Davies [23] constructed by convolv-

ing the L3 E3 and S3 basic 1times 3 masks together

L3 =[1 2 1

]E3 =

[minus1 0 1

]S3 =

[minus1 2 minus1

]In total 9 Lawsrsquo masks are obtained from the pairwiseconvolutions of the basic masks namely L3TL3 L3TE3 S3TE3 and S3TS3 These are applied to the Ychannel to capture texture energy

bull A local averaging filter applied to the Cb and Crchannels to capture haze in the low frequency colorinformation The first Lawsrsquo mask (L3TL3) was used

bull Nevatia-Babu [24] oriented edge filters applied to the Ychannel to capture texture gradients The 6 filters arespaced at 30 intervals

2) Texton-based features Textons are small image patchesrepresentative of particular texture classes learned by clus-tering patch samples from training images They were firstrecognized as a valuable tool for computer vision in the workof Varma et al [25] which showed their performance in thetask of texture classification when compared to traditionalfilter bank methods The distribution of textons over the imagehas since been used to generate computationally efficientfeatures for various works in robotics [20 21 26] for obstacledetection

Previous visual bag of words approaches represented an im-age using a histogram constructed by sampling image patchesand determining for each patch its closest texton in terms ofEuclidean distance Then the corresponding histogram bin isincremented Since we desire to capture local information fora given patch we use its square Euclidean distance to eachtexton as features The texton dictionary is learned from thetraining dataset using Kohonen clustering [27] similarly toprevious works

3) Histogram of Oriented Gradients Histogram of Ori-ented Gradients (HOG) features have been successfully usedfor object and human detection [28] as well by Dey et alfor depth estimation [14] The image is divided into cellsover which the pixel-wise gradients are determined and theirdirections binned into a histogram Adjacent cells are groupedinto 2 times 2 blocks and the histograms are normalized withrespect to all the cells in the block to correct for contrastdifferences and improve accuracy

4) Radon transform Michels et al [22] introduced a fea-ture to capture texture gradient and the direction of strongedges based on the Radon transform [29] also subsequentlyused by other works [14 16] The Radon transform is an in-tegral continuous version of the Hough transform commonlyused in computer vision for edge detection and maps an imagefrom (x y) into (θ ρ) coordinates For each value of θ the

5

two highest values of the transform are recorded and used asfeatures

C Learning algorithm

To choose a learning algorithm we looked at previousapproaches in the literature Bipin et al [5] had success withapproaching the problem as a multiclass classification prob-lem and using a linear SVM for learning Dey et al [14] useda nonlinear regression algorithm based on the Calibrated LeastSquares (CLS) algorithm by Agarwal et al [15] In most of theliterature the algorithms are used to estimate the logarithms ofdepths rather than the depths themselves and after testing bothapproaches we also found better performance when estimatinglog depths

We have approached the depth estimation problem bothas a classification problem and as a regression problem Forclassification we have tried out methods such as a SVM usingboth linear [30] and radial basis function kernels [31] in bothcases using one-vs-the-rest for multiclass We experimentedwith using a decision tree the generalized least squaresalgorithm [15] with a multinomial regression link functionand the classification version of the CLS algorithm [15] Forregression we have employed linear least squares a regressiontree and a modified version of the CLS algorithm

Our evaluation ultimately lead us to the conclusion thatregression consistently outperforms classification in this taskbecause multiclass classification loss functions penalize everymisclassification in the same way while regression attributeslarger penalties to larger deviations Additionally we observedthat the modified CLS regression algorithm exhibits betterperformance than linear least squares or regression trees whilestill being computationally very efficient For this reason wedecided to use it for the rest of our testing

The CLS algorithm is based on the minimization of acalibrated loss function in the context of a generalized linearmodel with an unknown link function The link function is it-self approximated as a linear combination of basis functions ofthe target variable typically low degree polynomials The CLSalgorithm consists of simultaneously solving the problems oflink function approximation and loss function minimizationby iteratively solving two least squares problems

We make slight modifications to the algorithm shown byAgarwal et al [15] namely removing the simplex clippingstep since wersquore performing regression rather than multiclassclassification and using Tikhonov regularization in the leastsquares problems From a computational point of view we useCholesky decompositions to efficiently solve the inner leastsquares problems and define the convergence criterion usingthe norm of the difference in y in successive iterations Thestraightforward algorithm is described in Figure 3

The algorithm is iterative by design but it can be adaptedfor online inference by storing the weight matrices and usingthem a posteriori on new test data repeating steps 3 and 5 ofthe algorithm with the stored weight matrices Additionallyit can be adapted to online training by using batch-wisestochastic gradient descent to update the weights as new datasamples come in

input feature vectors xi vector of target values youtput predicted target values y sequence of weight

matrices W and W

1 while t lt tmax and∥∥∥y(t) minus y(tminus1)∥∥∥ gt threshold do

Iteratively minimize thecalibrated loss function

2 Wt = argminnsum

i=1

∥∥∥yi minus y(tminus1)i minusWxi

∥∥∥22+λ1W22

3 y(t)i = y

(tminus1)i +Wtxi

Find the optimal linearcombination of basis functions

4 Wt = argminnsum

i=1

∥∥∥yi minus WG(y(t)i )∥∥∥22+ λ2

∥∥∥W∥∥∥22

5 y(t)i = WtG(y

(t)i )

6 end

Fig 3 Description of the modified CLS regression algorithm

D Hyperparameters

The algorithm has numerous hyperparameters includingthe regularization constants the base size and number ofsize scales of the image patches over which the featuresare computed the size and number of textons the cell sizeand number of bins for HOG feature calculation etc Onecould try to determine their optimal values using a geneticalgorithm but due to the size of the parameter space and thetime required for end-to-end simulations we instead opted tochoose some of the parameters based on their typical valuesin the literature

We used a base patch size of 11times 11 and 2 additional sizescales 33 times 33 and 99 times 99 We learned a dictionary of 30black and white textons 5times 5 in size shown in Figure 4 Weused 9 bins for the HOG features and discretized the anglesfor the Radon transform into 15 values

Fig 4 Texton dictionary learned from training data

6

IV EXPERIMENTAL RESULTS

In this section the offline experiments are described indetail These were performed in order to evaluate the perfor-mance of the proposed learning algorithm on existing datasetsand determining optimal values for its hyperparameters Wealso tested our hypothesis that it should be possible to estimatedense depth maps despite learning only on sparse trainingdata by testing on a new indoors stereo dataset with bothsparse and dense depth maps

A Error metrics

To measure the algorithmrsquos accuracy error metrics com-monly found in the literature [9] were employed namelybull The mean logarithmic error 1

N

sum∣∣log dest minus log dgt∣∣

bull The mean relative error 1N

sum∣∣dest minus dgt∣∣ dgt

bull The mean relative squared error 1N

sum(dest minus dgt)

2dgt

bull The root mean squared (RMS) errorradic

1N

sum(dest minus dgt)

2

bull The root mean squared (RMS) logarithmic errorradic1N

sum(log dest minus log dgt)

2

bull The scale invariant error 1N

sum(log dest minus log dgt)

2 minus1n2

(sumlog dest minus log dgt

)2B Standard datasets

As a first step the algorithm was tested on existingdepth datasets collected using active laser scanners namelyMake3D KITTI 2012 and KITTI 2015 For Make3D we usedthe standard division of 400 training and 134 test samplesSince the KITTI datasetsrsquo standard test data consists solely ofcamera images lacking ground truth depth maps we insteadrandomly distributed the standard training data among twosets with 70 of the data being allocated for training and30 for testing

The results we obtained for Make3D are shown qualitativelyin Figure 5 and quantitatively in Table I along with resultsfound in the literature [7 8 11] It can be seen that weobtain slightly worse performance than current state of theart approaches However we do surpass Bipin et alrsquos [5]results using a linear SVM while using a much more efficientlearning algorithm (in our tests training a linear SVM with10 depth classes took around 10 times longer than the CLSalgorithm)

Upon visual inspection of the image samples we observethat our algorithm manages to successfully capture most ofthe global depth variations and even some local details

TABLE I Comparison of results on the Make3D dataset

Make3D Saxena Karsch Liu Bipin Our method

mean log 0430 0292 0251 0985 0493relative abs 0370 0355 0287 0815 0543

relative square - - - - 10717linear RMS - 920 736 - 20116log RMS - - - - 0683

scale-invariant - - - - 0462

For the KITTI datasets we show the results in Figure 6and compare them quantitatively in Tables II and III to results

found in the literature [9 11 32] although we note that to ourknowledge no previous work on monocular depth estimationhas yet shown results on the more recent KITTI 2015 dataset

TABLE II Comparison of results on the KITTI 2012 dataset

KITTI 2012 Saxena Eigen Liu Mancini Our method

mean log - - 0211 - 0372relative abs 0280 0190 0217 - 0525

relative square 3012 1515 - - 2311linear RMS 8734 7156 7046 7508 13093log RMS 0361 0270 - 0524 0590

scale-invariant 0327 0246 - 0196 0347

TABLE III Results on the KITTI 2015 dataset

KITTI 2015 Our method

mean log 0291relative abs 0322

relative square 1921linear RMS 12498log RMS 0452

scale-invariant 0204

We observe that our algorithm performs comparativelybetter on these datasets due to the smaller variations in thedepicted environments when compared to Make3D which isinherently more diverse This once again leads us to concludethat in order to achieve the optimal performance the trainingand test data should be made as similar as possible and in thecontext of robotics this is enabled through the use of SSL

C Stereo dataset

In order to further test our hypotheses we developed a newdataset of images shot in the same environment to simulatethe SSL approach and with both dense and sparse depth mapsto see how the algorithm performed on the dense data whilebeing trained only on the sparse maps

We shot several videos with the ZED around the TU DelftAerospace faculty The camera was shot handheld with somefast rotations and no particular care for stabilization similarto footage that would be captured by an UAV The resultingimages naturally have imperfections namely defocus motionblur and stretch and shear artifacts from the rolling shutterwhich are not reflected in the standard RGBD datasets butnevertheless encountered in real life situations

The stereo videos were then processed offline with theZED SDK using both the STANDARD (structure conserva-tive no occlusion filling) and FILL (occlusion filling edgesharpening and advanced post-filtering) settings The providedconfidence map was then used on the STANDARD data tofilter out low confidence regions leading to very sparse depthmaps as shown in Figure 1 similar to depth maps obtainedusing traditional block-based stereo matching algorithms Thefull dataset consists of 12 video sequences and will be madeavailable for public use in the near future

For our tests we split the video sequences into two partslearning on the first 70 and testing on the final 30 in

7

Fig 5 Qualitative results on the Make3D dataset From left to right the monocular image the ground truth depth map andthe depth estimated by our algorithm The depth scales are the same for every image

Fig 6 Qualitative results on the KITTI 2012 and 2015 datasets From left to right the monocular image the ground truthdepth map and the depth estimated by our algorithm The depth scales are the same for every image

8

Fig 7 Qualitative results on the stereo ZED dataset From left to right the monocular image the ground truth depth map thealgorithm trained on the dense depth and the algorithm trained on the sparse depth The depth scales are the same for everyimage

9

TABLE IV Comparison of results on the new stereo dataset collected with the ZED part 1

Video sequence 1 dense 1 sparse 2 dense 2 sparse 3 dense 3 sparse 4 dense 4 sparse 5 dense 5 sparse 6 dense 6 sparse

mean log 0615 0746 05812 0609 0169 0594 0572 0547 0425 0533 0525 0479relative abs 1608 1064 1391 0823 0288 0420 0576 0557 0446 0456 0956 0526

relative square 26799 8325 17901 6369 2151 3280 4225 3728 2951 3312 12411 3320linear RMS 8311 7574 7101 6897 2933 7134 6488 6026 5692 6331 7239 5599log RMS 0967 0923 0913 0770 0359 0770 0748 0716 0546 0662 0822 0613

scale-invariant 0931 0778 0826 0524 0127 0277 0440 0412 0275 0247 0669 0314

TABLE V Comparison of results on the new stereo dataset collected with the ZED part 2

Video sequence 7 dense 7 sparse 8 dense 8 sparse 9 dense 9 sparse 10 dense 10 sparse 11 dense 11 sparse 12 dense 12 sparse

mean log 0532 0514 0654 0539 0579 0470 0411 0460 0640 0704 0362 0494relative abs 0732 0484 0583 0770 0773 0534 0576 0472 0838 0609 0517 0572

relative square 7194 3016 6480 7472 15659 3082 5583 3666 56000 4792 51092 4367linear RMS 6132 5577 8383 6284 14324 5319 5966 6127 11435 7832 30676 6095log RMS 0789 0669 1006 0770 0823 0610 0681 0633 0898 0835 0580 0642

scale-invariant 0601 0332 0825 0493 0648 0338 0442 0287 0729 0378 0323 0347

order to simulate the conditions of a robot undergoing SSLWe tested our depth estimation algorithm under two differentconditions training on the fully processed and occlusioncorrected dense maps and training directly on the raw sparseoutputs from the stereo matching algorithm The dense depthmaps were used as ground truths during testing in both casesThe results obtained are shown in Figure 7 and in Tables IVand V

Observing the results we see that the performance is ingeneral better than on the existing laser datasets especiallyfrom a qualitative point of view The algorithm is capableof leveraging the fact that the data used for training is notparticularly diverse and additionally very similar to the testdata going around the generalization problems shown bytraditional supervised learning methodologies

It can also be observed that contrary to our expectationsthe algorithm trained on the sparse data doesnrsquot fall shortand actually surpasses the algorithm trained on the dense datain many of the video sequences and error metrics Lookingat the estimated depth maps this is mostly a consequenceof the algorithmrsquos failure to correctly extrapolate The largestcontributions to the error metrics come from the depth extrapo-lations that fall significantly outside the maximum range of thecamera and are qualitatively completely incorrect When thealgorithm is trained only on the sparse data it is exposed to asmaller range of target values and is consequently much moreconservative in its estimates leading to lower error figures

Another influencing factor is the fact that in the sparse highconfidence regions there is a stronger correlation betweenthe depth obtained by the stereo camera and the true depthwhen compared to the untextured and occluded areas Sincewe assume there is a strong correlation between the extractedmonocular image features and the true depth this impliesthat there is a stronger correlation between the image featuresand the stereo camera depths The simplicity of the machinelearning model used means that this strong correlation canbe effectively exploited and learned while the low correlationpresent in the dense depth data can easily lead the algorithm

to learn a wrong modelFrom a qualitative point of view however both sparse and

dense behave similarly Looking at the sample images we cansee examples where both correctly estimate the relative depthsbetween objects in a scene (rows 1 and 2) examples wheresparse is better than dense (rows 3 and 4) and where dense isbetter than sparse (row 5) Naturally there are also many caseswhere the estimated depth maps are mostly incorrect (row 6)

In particular we see that the algorithm trained only onsparse data also behaves well at estimating the depth inoccluded and untextured areas for instance the white wallin row 3 This leads us to the conclusion that the algorithmrsquosperformance is perfectly adequate to serve as complementaryto a sparse stereo matching algorithm correctly filling in themissing depth information

D Online Experiments

Wersquove started preliminary work on a framework for onboardSSL applied to monocular depth estimation using a stereocamera Wersquove rewritten some of our feature extraction func-tions in C++ and have tested the setup using a linear leastsquares regression algorithm achieving promising results Weintend to further develop the framework in order to fully sup-port an online version of our learning pipeline and integrateit with existing autopilot platforms such as Paparazzi [33]

V CONCLUSION

This work focused on the application of SSL to the problemof estimating depth from single monocular images with theintent of complementing sparse stereo vision algorithms Wehave shown that our algorithm exhibits competitive perfor-mance on existing RGBD datasets while being computation-ally more efficient to train than previous approaches [5] Wealso train the algorithm on a new stereo dataset and showthat it remains accurate even when trained only on sparserather than dense stereo maps It can consequently be used toefficiently produce dense depth maps from sparse input Ourpreliminary work on its online implementation has revealed

10

promising results obtaining good performance with a verysimple linear least squares algorithm

In future work we plan to extend our methodology andfurther explore the complementarity of the information presentin monocular and stereo cues The use of a learning algorithmsuch as a Mondrian forest [34] or other ensemble-basedmethods would enable the estimation of the uncertainty inits own predictions A sensor fusion algorithm can then beused to merge information from both the stereo vision systemand the monocular depth estimation algorithm based on theirlocal confidence in the estimated depth This would lead to anoverall more accurate depth map

We have avoided the use of optical flow features sincetheyrsquore expensive to compute and are not usable when es-timating depths from isolated images rather than video dataHowever future work could explore computationally efficientways of using optical flow to guarantee the temporal consis-tency and consequently increase the accuracy of the sequenceof estimated depth maps

Current state of the art depth estimation methods [10ndash12 32] are all based on deep convolutional neural networks ofvarying complexities The advent of massively parallel GPU-based embedded hardware such as the Jetson TX1 and itseventual successors means that online training of deep neuralnetworks is close to becoming reality These models wouldgreatly benefit from the large amounts of training data madepossible by the SSL framework and lead to state of the artdepth estimation results onboard micro aerial vehicles

REFERENCES

[1] R A El-laithy J Huang and M Yeh ldquoStudy on the use of microsoftkinect for robotics applicationsrdquo in Position Location and NavigationSymposium (PLANS) 2012 IEEEION IEEE 2012 pp 1280ndash1288

[2] M Draelos Q Qiu A Bronstein and G Sapiro ldquoIntel realsense= reallow cost gazerdquo in Image Processing (ICIP) 2015 IEEE InternationalConference on IEEE 2015 pp 2520ndash2524

[3] C De Wagter S Tijmons B D Remes and G C de CroonldquoAutonomous flight of a 20-gram flapping wing mav with a 4-gramonboard stereo vision systemrdquo in 2014 IEEE International Conferenceon Robotics and Automation (ICRA) IEEE 2014 pp 4982ndash4987

[4] H Dahlkamp A Kaehler D Stavens S Thrun and G R Brad-ski ldquoSelf-supervised monocular road detection in desert terrainrdquo inRobotics science and systems vol 38 Philadelphia 2006

[5] K Bipin V Duggal and K M Krishna ldquoAutonomous navigation ofgeneric monocular quadcopter in natural environmentrdquo in 2015 IEEEInternational Conference on Robotics and Automation (ICRA) IEEE2015 pp 1063ndash1070

[6] A Saxena S H Chung and A Y Ng ldquoLearning depth from singlemonocular imagesrdquo in Advances in Neural Information ProcessingSystems 2005 pp 1161ndash1168

[7] A Saxena M Sun and A Y Ng ldquoMake3d Learning 3d scene structurefrom a single still imagerdquo IEEE transactions on pattern analysis andmachine intelligence vol 31 no 5 pp 824ndash840 2009

[8] K Karsch C Liu S Bing and K Eccv ldquoDepth Extraction from VideoUsing Non-parametric Sampling Problem Motivation amp Backgroundrdquono Sec 5 pp 1ndash14 2013

[9] D Eigen C Puhrsch and R Fergus ldquoDepth map prediction from asingle image using a multi-scale deep networkrdquo Nips pp 1ndash9 2014[Online] Available httparxivorgabs14062283

[10] D Eigen and R Fergus ldquoPredicting depth surface normals and se-mantic labels with a common multi-scale convolutional architecturerdquo inProceedings of the IEEE International Conference on Computer Vision2015 pp 2650ndash2658

[11] F Liu C Shen G Lin and I D Reid ldquoLearning Depth from SingleMonocular Images Using Deep Convolutional Neural Fieldsrdquo Pamip 15 2015 [Online] Available httparxivorgabs15027411

[12] W Chen Z Fu D Yang and J Deng ldquoSingle-image depth perceptionin the wildrdquo arXiv preprint arXiv160403901 2016

[13] D Zoran P Isola D Krishnan and W T Freeman ldquoLearning ordinalrelationships for mid-level visionrdquo in Proceedings of the IEEE Interna-tional Conference on Computer Vision 2015 pp 388ndash396

[14] D Dey K S Shankar S Zeng R Mehta M T Agcayazi C EriksenS Daftry M Hebert and J A Bagnell ldquoVision and learning fordeliberative monocular cluttered flightrdquo in Field and Service RoboticsSpringer 2016 pp 391ndash409

[15] A Agarwal S M Kakade N Karampatziakis L Song and G ValiantldquoLeast squares revisited Scalable approaches for multi-class predictionrdquoarXiv preprint arXiv13101949 2013 software available at httpsgithubcomn17ssecondorderdemos

[16] S Ross N Melik-Barkhudarov K S Shankar A Wendel D DeyJ A Bagnell and M Hebert ldquoLearning monocular reactive uav controlin cluttered natural environmentsrdquo in Robotics and Automation (ICRA)2013 IEEE International Conference on IEEE 2013 pp 1765ndash1772

[17] H Hu A Grubb J A Bagnell and M Hebert ldquoEfficient fea-ture group sequencing for anytime linear predictionrdquo arXiv preprintarXiv14095495 2014

[18] S Thrun M Montemerlo H Dahlkamp D Stavens A Aron J DiebelP Fong J Gale M Halpenny G Hoffmann et al ldquoStanley The robotthat won the darpa grand challengerdquo Journal of field Robotics vol 23no 9 pp 661ndash692 2006

[19] R Hadsell P Sermanet J Ben A Erkan M Scoffier K KavukcuogluU Muller and Y LeCun ldquoLearning long-range vision for autonomousoff-road drivingrdquo Journal of Field Robotics vol 26 no 2 pp 120ndash1442009

[20] H W Ho C De Wagter B D W Remes and G C H E de CroonldquoOptical-Flow based Self-Supervised Learning of Obstacle Appearanceapplied to MAV Landingrdquo no Iros 15 pp 1ndash10 2015

[21] K van Hecke G de Croon L van der Maaten D Hennes and D IzzoldquoPersistent self-supervised learning principle from stereo to monocularvision for obstacle avoidancerdquo arXiv preprint arXiv160308047 2016

[22] J Michels A Saxena and A Y Ng ldquoHigh speed obstacle avoidanceusing monocular vision and reinforcement learningrdquo in Proceedings ofthe 22nd international conference on Machine learning ACM 2005pp 593ndash600

[23] E R Davies Computer and machine vision theory algorithms prac-ticalities Academic Press 2012

[24] R Nevatia and K R Babu ldquoLinear feature extraction and descriptionrdquoComputer Graphics and Image Processing vol 13 no 3 pp 257ndash2691980

[25] M Varma and A Zisserman ldquoTexture classification arefilter banks necessaryrdquo Cvpr vol 2 pp IIndash691ndash8 vol2 2003 [Online] Available httpieeexploreieeeorgxplsabs alljsparnumber=1211534$delimiterrdquo026E30F$npapers2publicationdoi101109CVPR20031211534

[26] G De Croon E De Weerdt C De Wagter B Remes and R RuijsinkldquoThe appearance variation cue for obstacle avoidancerdquo IEEE Transac-tions on Robotics vol 28 no 2 pp 529ndash534 2012

[27] T Kohonen ldquoThe self-organizing maprdquo Proceedings of the IEEEvol 78 no 9 pp 1464ndash1480 1990

[28] N Dalal and B Triggs ldquoHistograms of oriented gradients for humandetectionrdquo in 2005 IEEE Computer Society Conference on ComputerVision and Pattern Recognition (CVPRrsquo05) vol 1 IEEE 2005 pp886ndash893

[29] S R Deans The Radon transform and some of its applications CourierCorporation 2007

[30] R-E Fan K-W Chang C-J Hsieh X-R Wang and C-J LinldquoLiblinear A library for large linear classificationrdquo Journal of machinelearning research vol 9 no Aug pp 1871ndash1874 2008 softwareavailable at httpwwwcsientuedutwsimcjlinliblinear

[31] C-C Chang and C-J Lin ldquoLIBSVM A library for support vectormachinesrdquo ACM Transactions on Intelligent Systems and Technologyvol 2 pp 271ndash2727 2011 software available at httpwwwcsientuedutwsimcjlinlibsvm

[32] M Mancini G Costante P Valigi and T A Ciarfuglia ldquoFast robustmonocular depth estimation for obstacle detection with fully convolu-tional networksrdquo arXiv preprint arXiv160706349 2016

[33] G Hattenberger M Bronz and M Gorraz ldquoUsing the paparazzi uavsystem for scientific researchrdquo in IMAV 2014 International Micro AirVehicle Conference and Competition 2014 2014 pp ppndash247

[34] B Lakshminarayanan D M Roy and Y W Teh ldquoMondrian forestsfor large-scale regression when uncertainty mattersrdquo arXiv preprintarXiv150603805 2015

  • Introduction
  • Related Work
    • Monocular depth estimation
    • Monocular depth estimation for robot navigation
    • Self-supervised learning
      • Methodology Overview
        • Learning setup
        • Features
          • Filter-based features
          • Texton-based features
          • Histogram of Oriented Gradients
          • Radon transform
            • Learning algorithm
            • Hyperparameters
              • Experimental Results
                • Error metrics
                • Standard datasets
                • Stereo dataset
                • Online Experiments
                  • Conclusion
                  • References
Page 5: Learning Depth from Single Monocular Images Using Stereo ... · obstacle avoidance and navigation, to localization and envi- ... data collected using a Kinect sensor. Dey et al. [14]

5

two highest values of the transform are recorded and used asfeatures

C Learning algorithm

To choose a learning algorithm we looked at previousapproaches in the literature Bipin et al [5] had success withapproaching the problem as a multiclass classification prob-lem and using a linear SVM for learning Dey et al [14] useda nonlinear regression algorithm based on the Calibrated LeastSquares (CLS) algorithm by Agarwal et al [15] In most of theliterature the algorithms are used to estimate the logarithms ofdepths rather than the depths themselves and after testing bothapproaches we also found better performance when estimatinglog depths

We have approached the depth estimation problem bothas a classification problem and as a regression problem Forclassification we have tried out methods such as a SVM usingboth linear [30] and radial basis function kernels [31] in bothcases using one-vs-the-rest for multiclass We experimentedwith using a decision tree the generalized least squaresalgorithm [15] with a multinomial regression link functionand the classification version of the CLS algorithm [15] Forregression we have employed linear least squares a regressiontree and a modified version of the CLS algorithm

Our evaluation ultimately lead us to the conclusion thatregression consistently outperforms classification in this taskbecause multiclass classification loss functions penalize everymisclassification in the same way while regression attributeslarger penalties to larger deviations Additionally we observedthat the modified CLS regression algorithm exhibits betterperformance than linear least squares or regression trees whilestill being computationally very efficient For this reason wedecided to use it for the rest of our testing

The CLS algorithm is based on the minimization of acalibrated loss function in the context of a generalized linearmodel with an unknown link function The link function is it-self approximated as a linear combination of basis functions ofthe target variable typically low degree polynomials The CLSalgorithm consists of simultaneously solving the problems oflink function approximation and loss function minimizationby iteratively solving two least squares problems

We make slight modifications to the algorithm shown byAgarwal et al [15] namely removing the simplex clippingstep since wersquore performing regression rather than multiclassclassification and using Tikhonov regularization in the leastsquares problems From a computational point of view we useCholesky decompositions to efficiently solve the inner leastsquares problems and define the convergence criterion usingthe norm of the difference in y in successive iterations Thestraightforward algorithm is described in Figure 3

The algorithm is iterative by design but it can be adaptedfor online inference by storing the weight matrices and usingthem a posteriori on new test data repeating steps 3 and 5 ofthe algorithm with the stored weight matrices Additionallyit can be adapted to online training by using batch-wisestochastic gradient descent to update the weights as new datasamples come in

input feature vectors xi vector of target values youtput predicted target values y sequence of weight

matrices W and W

1 while t lt tmax and∥∥∥y(t) minus y(tminus1)∥∥∥ gt threshold do

Iteratively minimize thecalibrated loss function

2 Wt = argminnsum

i=1

∥∥∥yi minus y(tminus1)i minusWxi

∥∥∥22+λ1W22

3 y(t)i = y

(tminus1)i +Wtxi

Find the optimal linearcombination of basis functions

4 Wt = argminnsum

i=1

∥∥∥yi minus WG(y(t)i )∥∥∥22+ λ2

∥∥∥W∥∥∥22

5 y(t)i = WtG(y

(t)i )

6 end

Fig 3 Description of the modified CLS regression algorithm

D Hyperparameters

The algorithm has numerous hyperparameters includingthe regularization constants the base size and number ofsize scales of the image patches over which the featuresare computed the size and number of textons the cell sizeand number of bins for HOG feature calculation etc Onecould try to determine their optimal values using a geneticalgorithm but due to the size of the parameter space and thetime required for end-to-end simulations we instead opted tochoose some of the parameters based on their typical valuesin the literature

We used a base patch size of 11times 11 and 2 additional sizescales 33 times 33 and 99 times 99 We learned a dictionary of 30black and white textons 5times 5 in size shown in Figure 4 Weused 9 bins for the HOG features and discretized the anglesfor the Radon transform into 15 values

Fig 4 Texton dictionary learned from training data

6

IV EXPERIMENTAL RESULTS

In this section the offline experiments are described indetail These were performed in order to evaluate the perfor-mance of the proposed learning algorithm on existing datasetsand determining optimal values for its hyperparameters Wealso tested our hypothesis that it should be possible to estimatedense depth maps despite learning only on sparse trainingdata by testing on a new indoors stereo dataset with bothsparse and dense depth maps

A Error metrics

To measure the algorithmrsquos accuracy error metrics com-monly found in the literature [9] were employed namelybull The mean logarithmic error 1

N

sum∣∣log dest minus log dgt∣∣

bull The mean relative error 1N

sum∣∣dest minus dgt∣∣ dgt

bull The mean relative squared error 1N

sum(dest minus dgt)

2dgt

bull The root mean squared (RMS) errorradic

1N

sum(dest minus dgt)

2

bull The root mean squared (RMS) logarithmic errorradic1N

sum(log dest minus log dgt)

2

bull The scale invariant error 1N

sum(log dest minus log dgt)

2 minus1n2

(sumlog dest minus log dgt

)2B Standard datasets

As a first step the algorithm was tested on existingdepth datasets collected using active laser scanners namelyMake3D KITTI 2012 and KITTI 2015 For Make3D we usedthe standard division of 400 training and 134 test samplesSince the KITTI datasetsrsquo standard test data consists solely ofcamera images lacking ground truth depth maps we insteadrandomly distributed the standard training data among twosets with 70 of the data being allocated for training and30 for testing

The results we obtained for Make3D are shown qualitativelyin Figure 5 and quantitatively in Table I along with resultsfound in the literature [7 8 11] It can be seen that weobtain slightly worse performance than current state of theart approaches However we do surpass Bipin et alrsquos [5]results using a linear SVM while using a much more efficientlearning algorithm (in our tests training a linear SVM with10 depth classes took around 10 times longer than the CLSalgorithm)

Upon visual inspection of the image samples we observethat our algorithm manages to successfully capture most ofthe global depth variations and even some local details

TABLE I Comparison of results on the Make3D dataset

Make3D Saxena Karsch Liu Bipin Our method

mean log 0430 0292 0251 0985 0493relative abs 0370 0355 0287 0815 0543

relative square - - - - 10717linear RMS - 920 736 - 20116log RMS - - - - 0683

scale-invariant - - - - 0462

For the KITTI datasets we show the results in Figure 6and compare them quantitatively in Tables II and III to results

found in the literature [9 11 32] although we note that to ourknowledge no previous work on monocular depth estimationhas yet shown results on the more recent KITTI 2015 dataset

TABLE II Comparison of results on the KITTI 2012 dataset

KITTI 2012 Saxena Eigen Liu Mancini Our method

mean log - - 0211 - 0372relative abs 0280 0190 0217 - 0525

relative square 3012 1515 - - 2311linear RMS 8734 7156 7046 7508 13093log RMS 0361 0270 - 0524 0590

scale-invariant 0327 0246 - 0196 0347

TABLE III Results on the KITTI 2015 dataset

KITTI 2015 Our method

mean log 0291relative abs 0322

relative square 1921linear RMS 12498log RMS 0452

scale-invariant 0204

We observe that our algorithm performs comparativelybetter on these datasets due to the smaller variations in thedepicted environments when compared to Make3D which isinherently more diverse This once again leads us to concludethat in order to achieve the optimal performance the trainingand test data should be made as similar as possible and in thecontext of robotics this is enabled through the use of SSL

C Stereo dataset

In order to further test our hypotheses we developed a newdataset of images shot in the same environment to simulatethe SSL approach and with both dense and sparse depth mapsto see how the algorithm performed on the dense data whilebeing trained only on the sparse maps

We shot several videos with the ZED around the TU DelftAerospace faculty The camera was shot handheld with somefast rotations and no particular care for stabilization similarto footage that would be captured by an UAV The resultingimages naturally have imperfections namely defocus motionblur and stretch and shear artifacts from the rolling shutterwhich are not reflected in the standard RGBD datasets butnevertheless encountered in real life situations

The stereo videos were then processed offline with theZED SDK using both the STANDARD (structure conserva-tive no occlusion filling) and FILL (occlusion filling edgesharpening and advanced post-filtering) settings The providedconfidence map was then used on the STANDARD data tofilter out low confidence regions leading to very sparse depthmaps as shown in Figure 1 similar to depth maps obtainedusing traditional block-based stereo matching algorithms Thefull dataset consists of 12 video sequences and will be madeavailable for public use in the near future

For our tests we split the video sequences into two partslearning on the first 70 and testing on the final 30 in

7

Fig 5 Qualitative results on the Make3D dataset From left to right the monocular image the ground truth depth map andthe depth estimated by our algorithm The depth scales are the same for every image

Fig 6 Qualitative results on the KITTI 2012 and 2015 datasets From left to right the monocular image the ground truthdepth map and the depth estimated by our algorithm The depth scales are the same for every image

8

Fig 7 Qualitative results on the stereo ZED dataset From left to right the monocular image the ground truth depth map thealgorithm trained on the dense depth and the algorithm trained on the sparse depth The depth scales are the same for everyimage

9

TABLE IV Comparison of results on the new stereo dataset collected with the ZED part 1

Video sequence 1 dense 1 sparse 2 dense 2 sparse 3 dense 3 sparse 4 dense 4 sparse 5 dense 5 sparse 6 dense 6 sparse

mean log 0615 0746 05812 0609 0169 0594 0572 0547 0425 0533 0525 0479relative abs 1608 1064 1391 0823 0288 0420 0576 0557 0446 0456 0956 0526

relative square 26799 8325 17901 6369 2151 3280 4225 3728 2951 3312 12411 3320linear RMS 8311 7574 7101 6897 2933 7134 6488 6026 5692 6331 7239 5599log RMS 0967 0923 0913 0770 0359 0770 0748 0716 0546 0662 0822 0613

scale-invariant 0931 0778 0826 0524 0127 0277 0440 0412 0275 0247 0669 0314

TABLE V Comparison of results on the new stereo dataset collected with the ZED part 2

Video sequence 7 dense 7 sparse 8 dense 8 sparse 9 dense 9 sparse 10 dense 10 sparse 11 dense 11 sparse 12 dense 12 sparse

mean log 0532 0514 0654 0539 0579 0470 0411 0460 0640 0704 0362 0494relative abs 0732 0484 0583 0770 0773 0534 0576 0472 0838 0609 0517 0572

relative square 7194 3016 6480 7472 15659 3082 5583 3666 56000 4792 51092 4367linear RMS 6132 5577 8383 6284 14324 5319 5966 6127 11435 7832 30676 6095log RMS 0789 0669 1006 0770 0823 0610 0681 0633 0898 0835 0580 0642

scale-invariant 0601 0332 0825 0493 0648 0338 0442 0287 0729 0378 0323 0347

order to simulate the conditions of a robot undergoing SSLWe tested our depth estimation algorithm under two differentconditions training on the fully processed and occlusioncorrected dense maps and training directly on the raw sparseoutputs from the stereo matching algorithm The dense depthmaps were used as ground truths during testing in both casesThe results obtained are shown in Figure 7 and in Tables IVand V

Observing the results we see that the performance is ingeneral better than on the existing laser datasets especiallyfrom a qualitative point of view The algorithm is capableof leveraging the fact that the data used for training is notparticularly diverse and additionally very similar to the testdata going around the generalization problems shown bytraditional supervised learning methodologies

It can also be observed that contrary to our expectationsthe algorithm trained on the sparse data doesnrsquot fall shortand actually surpasses the algorithm trained on the dense datain many of the video sequences and error metrics Lookingat the estimated depth maps this is mostly a consequenceof the algorithmrsquos failure to correctly extrapolate The largestcontributions to the error metrics come from the depth extrapo-lations that fall significantly outside the maximum range of thecamera and are qualitatively completely incorrect When thealgorithm is trained only on the sparse data it is exposed to asmaller range of target values and is consequently much moreconservative in its estimates leading to lower error figures

Another influencing factor is the fact that in the sparse highconfidence regions there is a stronger correlation betweenthe depth obtained by the stereo camera and the true depthwhen compared to the untextured and occluded areas Sincewe assume there is a strong correlation between the extractedmonocular image features and the true depth this impliesthat there is a stronger correlation between the image featuresand the stereo camera depths The simplicity of the machinelearning model used means that this strong correlation canbe effectively exploited and learned while the low correlationpresent in the dense depth data can easily lead the algorithm

to learn a wrong modelFrom a qualitative point of view however both sparse and

dense behave similarly Looking at the sample images we cansee examples where both correctly estimate the relative depthsbetween objects in a scene (rows 1 and 2) examples wheresparse is better than dense (rows 3 and 4) and where dense isbetter than sparse (row 5) Naturally there are also many caseswhere the estimated depth maps are mostly incorrect (row 6)

In particular we see that the algorithm trained only onsparse data also behaves well at estimating the depth inoccluded and untextured areas for instance the white wallin row 3 This leads us to the conclusion that the algorithmrsquosperformance is perfectly adequate to serve as complementaryto a sparse stereo matching algorithm correctly filling in themissing depth information

D Online Experiments

Wersquove started preliminary work on a framework for onboardSSL applied to monocular depth estimation using a stereocamera Wersquove rewritten some of our feature extraction func-tions in C++ and have tested the setup using a linear leastsquares regression algorithm achieving promising results Weintend to further develop the framework in order to fully sup-port an online version of our learning pipeline and integrateit with existing autopilot platforms such as Paparazzi [33]

V CONCLUSION

This work focused on the application of SSL to the problemof estimating depth from single monocular images with theintent of complementing sparse stereo vision algorithms Wehave shown that our algorithm exhibits competitive perfor-mance on existing RGBD datasets while being computation-ally more efficient to train than previous approaches [5] Wealso train the algorithm on a new stereo dataset and showthat it remains accurate even when trained only on sparserather than dense stereo maps It can consequently be used toefficiently produce dense depth maps from sparse input Ourpreliminary work on its online implementation has revealed

10

promising results obtaining good performance with a verysimple linear least squares algorithm

In future work we plan to extend our methodology andfurther explore the complementarity of the information presentin monocular and stereo cues The use of a learning algorithmsuch as a Mondrian forest [34] or other ensemble-basedmethods would enable the estimation of the uncertainty inits own predictions A sensor fusion algorithm can then beused to merge information from both the stereo vision systemand the monocular depth estimation algorithm based on theirlocal confidence in the estimated depth This would lead to anoverall more accurate depth map

We have avoided the use of optical flow features sincetheyrsquore expensive to compute and are not usable when es-timating depths from isolated images rather than video dataHowever future work could explore computationally efficientways of using optical flow to guarantee the temporal consis-tency and consequently increase the accuracy of the sequenceof estimated depth maps

Current state of the art depth estimation methods [10ndash12 32] are all based on deep convolutional neural networks ofvarying complexities The advent of massively parallel GPU-based embedded hardware such as the Jetson TX1 and itseventual successors means that online training of deep neuralnetworks is close to becoming reality These models wouldgreatly benefit from the large amounts of training data madepossible by the SSL framework and lead to state of the artdepth estimation results onboard micro aerial vehicles

REFERENCES

[1] R A El-laithy J Huang and M Yeh ldquoStudy on the use of microsoftkinect for robotics applicationsrdquo in Position Location and NavigationSymposium (PLANS) 2012 IEEEION IEEE 2012 pp 1280ndash1288

[2] M Draelos Q Qiu A Bronstein and G Sapiro ldquoIntel realsense= reallow cost gazerdquo in Image Processing (ICIP) 2015 IEEE InternationalConference on IEEE 2015 pp 2520ndash2524

[3] C De Wagter S Tijmons B D Remes and G C de CroonldquoAutonomous flight of a 20-gram flapping wing mav with a 4-gramonboard stereo vision systemrdquo in 2014 IEEE International Conferenceon Robotics and Automation (ICRA) IEEE 2014 pp 4982ndash4987

[4] H Dahlkamp A Kaehler D Stavens S Thrun and G R Brad-ski ldquoSelf-supervised monocular road detection in desert terrainrdquo inRobotics science and systems vol 38 Philadelphia 2006

[5] K Bipin V Duggal and K M Krishna ldquoAutonomous navigation ofgeneric monocular quadcopter in natural environmentrdquo in 2015 IEEEInternational Conference on Robotics and Automation (ICRA) IEEE2015 pp 1063ndash1070

[6] A Saxena S H Chung and A Y Ng ldquoLearning depth from singlemonocular imagesrdquo in Advances in Neural Information ProcessingSystems 2005 pp 1161ndash1168

[7] A Saxena M Sun and A Y Ng ldquoMake3d Learning 3d scene structurefrom a single still imagerdquo IEEE transactions on pattern analysis andmachine intelligence vol 31 no 5 pp 824ndash840 2009

[8] K Karsch C Liu S Bing and K Eccv ldquoDepth Extraction from VideoUsing Non-parametric Sampling Problem Motivation amp Backgroundrdquono Sec 5 pp 1ndash14 2013

[9] D Eigen C Puhrsch and R Fergus ldquoDepth map prediction from asingle image using a multi-scale deep networkrdquo Nips pp 1ndash9 2014[Online] Available httparxivorgabs14062283

[10] D Eigen and R Fergus ldquoPredicting depth surface normals and se-mantic labels with a common multi-scale convolutional architecturerdquo inProceedings of the IEEE International Conference on Computer Vision2015 pp 2650ndash2658

[11] F Liu C Shen G Lin and I D Reid ldquoLearning Depth from SingleMonocular Images Using Deep Convolutional Neural Fieldsrdquo Pamip 15 2015 [Online] Available httparxivorgabs15027411

[12] W Chen Z Fu D Yang and J Deng ldquoSingle-image depth perceptionin the wildrdquo arXiv preprint arXiv160403901 2016

[13] D Zoran P Isola D Krishnan and W T Freeman ldquoLearning ordinalrelationships for mid-level visionrdquo in Proceedings of the IEEE Interna-tional Conference on Computer Vision 2015 pp 388ndash396

[14] D Dey K S Shankar S Zeng R Mehta M T Agcayazi C EriksenS Daftry M Hebert and J A Bagnell ldquoVision and learning fordeliberative monocular cluttered flightrdquo in Field and Service RoboticsSpringer 2016 pp 391ndash409

[15] A Agarwal S M Kakade N Karampatziakis L Song and G ValiantldquoLeast squares revisited Scalable approaches for multi-class predictionrdquoarXiv preprint arXiv13101949 2013 software available at httpsgithubcomn17ssecondorderdemos

[16] S Ross N Melik-Barkhudarov K S Shankar A Wendel D DeyJ A Bagnell and M Hebert ldquoLearning monocular reactive uav controlin cluttered natural environmentsrdquo in Robotics and Automation (ICRA)2013 IEEE International Conference on IEEE 2013 pp 1765ndash1772

[17] H Hu A Grubb J A Bagnell and M Hebert ldquoEfficient fea-ture group sequencing for anytime linear predictionrdquo arXiv preprintarXiv14095495 2014

[18] S Thrun M Montemerlo H Dahlkamp D Stavens A Aron J DiebelP Fong J Gale M Halpenny G Hoffmann et al ldquoStanley The robotthat won the darpa grand challengerdquo Journal of field Robotics vol 23no 9 pp 661ndash692 2006

[19] R Hadsell P Sermanet J Ben A Erkan M Scoffier K KavukcuogluU Muller and Y LeCun ldquoLearning long-range vision for autonomousoff-road drivingrdquo Journal of Field Robotics vol 26 no 2 pp 120ndash1442009

[20] H W Ho C De Wagter B D W Remes and G C H E de CroonldquoOptical-Flow based Self-Supervised Learning of Obstacle Appearanceapplied to MAV Landingrdquo no Iros 15 pp 1ndash10 2015

[21] K van Hecke G de Croon L van der Maaten D Hennes and D IzzoldquoPersistent self-supervised learning principle from stereo to monocularvision for obstacle avoidancerdquo arXiv preprint arXiv160308047 2016

[22] J Michels A Saxena and A Y Ng ldquoHigh speed obstacle avoidanceusing monocular vision and reinforcement learningrdquo in Proceedings ofthe 22nd international conference on Machine learning ACM 2005pp 593ndash600

[23] E R Davies Computer and machine vision theory algorithms prac-ticalities Academic Press 2012

[24] R Nevatia and K R Babu ldquoLinear feature extraction and descriptionrdquoComputer Graphics and Image Processing vol 13 no 3 pp 257ndash2691980

[25] M Varma and A Zisserman ldquoTexture classification arefilter banks necessaryrdquo Cvpr vol 2 pp IIndash691ndash8 vol2 2003 [Online] Available httpieeexploreieeeorgxplsabs alljsparnumber=1211534$delimiterrdquo026E30F$npapers2publicationdoi101109CVPR20031211534

[26] G De Croon E De Weerdt C De Wagter B Remes and R RuijsinkldquoThe appearance variation cue for obstacle avoidancerdquo IEEE Transac-tions on Robotics vol 28 no 2 pp 529ndash534 2012

[27] T Kohonen ldquoThe self-organizing maprdquo Proceedings of the IEEEvol 78 no 9 pp 1464ndash1480 1990

[28] N Dalal and B Triggs ldquoHistograms of oriented gradients for humandetectionrdquo in 2005 IEEE Computer Society Conference on ComputerVision and Pattern Recognition (CVPRrsquo05) vol 1 IEEE 2005 pp886ndash893

[29] S R Deans The Radon transform and some of its applications CourierCorporation 2007

[30] R-E Fan K-W Chang C-J Hsieh X-R Wang and C-J LinldquoLiblinear A library for large linear classificationrdquo Journal of machinelearning research vol 9 no Aug pp 1871ndash1874 2008 softwareavailable at httpwwwcsientuedutwsimcjlinliblinear

[31] C-C Chang and C-J Lin ldquoLIBSVM A library for support vectormachinesrdquo ACM Transactions on Intelligent Systems and Technologyvol 2 pp 271ndash2727 2011 software available at httpwwwcsientuedutwsimcjlinlibsvm

[32] M Mancini G Costante P Valigi and T A Ciarfuglia ldquoFast robustmonocular depth estimation for obstacle detection with fully convolu-tional networksrdquo arXiv preprint arXiv160706349 2016

[33] G Hattenberger M Bronz and M Gorraz ldquoUsing the paparazzi uavsystem for scientific researchrdquo in IMAV 2014 International Micro AirVehicle Conference and Competition 2014 2014 pp ppndash247

[34] B Lakshminarayanan D M Roy and Y W Teh ldquoMondrian forestsfor large-scale regression when uncertainty mattersrdquo arXiv preprintarXiv150603805 2015

  • Introduction
  • Related Work
    • Monocular depth estimation
    • Monocular depth estimation for robot navigation
    • Self-supervised learning
      • Methodology Overview
        • Learning setup
        • Features
          • Filter-based features
          • Texton-based features
          • Histogram of Oriented Gradients
          • Radon transform
            • Learning algorithm
            • Hyperparameters
              • Experimental Results
                • Error metrics
                • Standard datasets
                • Stereo dataset
                • Online Experiments
                  • Conclusion
                  • References
Page 6: Learning Depth from Single Monocular Images Using Stereo ... · obstacle avoidance and navigation, to localization and envi- ... data collected using a Kinect sensor. Dey et al. [14]

6

IV EXPERIMENTAL RESULTS

In this section the offline experiments are described indetail These were performed in order to evaluate the perfor-mance of the proposed learning algorithm on existing datasetsand determining optimal values for its hyperparameters Wealso tested our hypothesis that it should be possible to estimatedense depth maps despite learning only on sparse trainingdata by testing on a new indoors stereo dataset with bothsparse and dense depth maps

A Error metrics

To measure the algorithmrsquos accuracy error metrics com-monly found in the literature [9] were employed namelybull The mean logarithmic error 1

N

sum∣∣log dest minus log dgt∣∣

bull The mean relative error 1N

sum∣∣dest minus dgt∣∣ dgt

bull The mean relative squared error 1N

sum(dest minus dgt)

2dgt

bull The root mean squared (RMS) errorradic

1N

sum(dest minus dgt)

2

bull The root mean squared (RMS) logarithmic errorradic1N

sum(log dest minus log dgt)

2

bull The scale invariant error 1N

sum(log dest minus log dgt)

2 minus1n2

(sumlog dest minus log dgt

)2B Standard datasets

As a first step the algorithm was tested on existingdepth datasets collected using active laser scanners namelyMake3D KITTI 2012 and KITTI 2015 For Make3D we usedthe standard division of 400 training and 134 test samplesSince the KITTI datasetsrsquo standard test data consists solely ofcamera images lacking ground truth depth maps we insteadrandomly distributed the standard training data among twosets with 70 of the data being allocated for training and30 for testing

The results we obtained for Make3D are shown qualitativelyin Figure 5 and quantitatively in Table I along with resultsfound in the literature [7 8 11] It can be seen that weobtain slightly worse performance than current state of theart approaches However we do surpass Bipin et alrsquos [5]results using a linear SVM while using a much more efficientlearning algorithm (in our tests training a linear SVM with10 depth classes took around 10 times longer than the CLSalgorithm)

Upon visual inspection of the image samples we observethat our algorithm manages to successfully capture most ofthe global depth variations and even some local details

TABLE I Comparison of results on the Make3D dataset

Make3D Saxena Karsch Liu Bipin Our method

mean log 0430 0292 0251 0985 0493relative abs 0370 0355 0287 0815 0543

relative square - - - - 10717linear RMS - 920 736 - 20116log RMS - - - - 0683

scale-invariant - - - - 0462

For the KITTI datasets we show the results in Figure 6and compare them quantitatively in Tables II and III to results

found in the literature [9 11 32] although we note that to ourknowledge no previous work on monocular depth estimationhas yet shown results on the more recent KITTI 2015 dataset

TABLE II Comparison of results on the KITTI 2012 dataset

KITTI 2012 Saxena Eigen Liu Mancini Our method

mean log - - 0211 - 0372relative abs 0280 0190 0217 - 0525

relative square 3012 1515 - - 2311linear RMS 8734 7156 7046 7508 13093log RMS 0361 0270 - 0524 0590

scale-invariant 0327 0246 - 0196 0347

TABLE III Results on the KITTI 2015 dataset

KITTI 2015 Our method

mean log 0291relative abs 0322

relative square 1921linear RMS 12498log RMS 0452

scale-invariant 0204

We observe that our algorithm performs comparativelybetter on these datasets due to the smaller variations in thedepicted environments when compared to Make3D which isinherently more diverse This once again leads us to concludethat in order to achieve the optimal performance the trainingand test data should be made as similar as possible and in thecontext of robotics this is enabled through the use of SSL

C Stereo dataset

In order to further test our hypotheses we developed a newdataset of images shot in the same environment to simulatethe SSL approach and with both dense and sparse depth mapsto see how the algorithm performed on the dense data whilebeing trained only on the sparse maps

We shot several videos with the ZED around the TU DelftAerospace faculty The camera was shot handheld with somefast rotations and no particular care for stabilization similarto footage that would be captured by an UAV The resultingimages naturally have imperfections namely defocus motionblur and stretch and shear artifacts from the rolling shutterwhich are not reflected in the standard RGBD datasets butnevertheless encountered in real life situations

The stereo videos were then processed offline with theZED SDK using both the STANDARD (structure conserva-tive no occlusion filling) and FILL (occlusion filling edgesharpening and advanced post-filtering) settings The providedconfidence map was then used on the STANDARD data tofilter out low confidence regions leading to very sparse depthmaps as shown in Figure 1 similar to depth maps obtainedusing traditional block-based stereo matching algorithms Thefull dataset consists of 12 video sequences and will be madeavailable for public use in the near future

For our tests we split the video sequences into two partslearning on the first 70 and testing on the final 30 in

7

Fig 5 Qualitative results on the Make3D dataset From left to right the monocular image the ground truth depth map andthe depth estimated by our algorithm The depth scales are the same for every image

Fig 6 Qualitative results on the KITTI 2012 and 2015 datasets From left to right the monocular image the ground truthdepth map and the depth estimated by our algorithm The depth scales are the same for every image

8

Fig 7 Qualitative results on the stereo ZED dataset From left to right the monocular image the ground truth depth map thealgorithm trained on the dense depth and the algorithm trained on the sparse depth The depth scales are the same for everyimage

9

TABLE IV Comparison of results on the new stereo dataset collected with the ZED part 1

Video sequence 1 dense 1 sparse 2 dense 2 sparse 3 dense 3 sparse 4 dense 4 sparse 5 dense 5 sparse 6 dense 6 sparse

mean log 0615 0746 05812 0609 0169 0594 0572 0547 0425 0533 0525 0479relative abs 1608 1064 1391 0823 0288 0420 0576 0557 0446 0456 0956 0526

relative square 26799 8325 17901 6369 2151 3280 4225 3728 2951 3312 12411 3320linear RMS 8311 7574 7101 6897 2933 7134 6488 6026 5692 6331 7239 5599log RMS 0967 0923 0913 0770 0359 0770 0748 0716 0546 0662 0822 0613

scale-invariant 0931 0778 0826 0524 0127 0277 0440 0412 0275 0247 0669 0314

TABLE V Comparison of results on the new stereo dataset collected with the ZED part 2

Video sequence 7 dense 7 sparse 8 dense 8 sparse 9 dense 9 sparse 10 dense 10 sparse 11 dense 11 sparse 12 dense 12 sparse

mean log 0532 0514 0654 0539 0579 0470 0411 0460 0640 0704 0362 0494relative abs 0732 0484 0583 0770 0773 0534 0576 0472 0838 0609 0517 0572

relative square 7194 3016 6480 7472 15659 3082 5583 3666 56000 4792 51092 4367linear RMS 6132 5577 8383 6284 14324 5319 5966 6127 11435 7832 30676 6095log RMS 0789 0669 1006 0770 0823 0610 0681 0633 0898 0835 0580 0642

scale-invariant 0601 0332 0825 0493 0648 0338 0442 0287 0729 0378 0323 0347

order to simulate the conditions of a robot undergoing SSLWe tested our depth estimation algorithm under two differentconditions training on the fully processed and occlusioncorrected dense maps and training directly on the raw sparseoutputs from the stereo matching algorithm The dense depthmaps were used as ground truths during testing in both casesThe results obtained are shown in Figure 7 and in Tables IVand V

Observing the results we see that the performance is ingeneral better than on the existing laser datasets especiallyfrom a qualitative point of view The algorithm is capableof leveraging the fact that the data used for training is notparticularly diverse and additionally very similar to the testdata going around the generalization problems shown bytraditional supervised learning methodologies

It can also be observed that contrary to our expectationsthe algorithm trained on the sparse data doesnrsquot fall shortand actually surpasses the algorithm trained on the dense datain many of the video sequences and error metrics Lookingat the estimated depth maps this is mostly a consequenceof the algorithmrsquos failure to correctly extrapolate The largestcontributions to the error metrics come from the depth extrapo-lations that fall significantly outside the maximum range of thecamera and are qualitatively completely incorrect When thealgorithm is trained only on the sparse data it is exposed to asmaller range of target values and is consequently much moreconservative in its estimates leading to lower error figures

Another influencing factor is the fact that in the sparse highconfidence regions there is a stronger correlation betweenthe depth obtained by the stereo camera and the true depthwhen compared to the untextured and occluded areas Sincewe assume there is a strong correlation between the extractedmonocular image features and the true depth this impliesthat there is a stronger correlation between the image featuresand the stereo camera depths The simplicity of the machinelearning model used means that this strong correlation canbe effectively exploited and learned while the low correlationpresent in the dense depth data can easily lead the algorithm

to learn a wrong modelFrom a qualitative point of view however both sparse and

dense behave similarly Looking at the sample images we cansee examples where both correctly estimate the relative depthsbetween objects in a scene (rows 1 and 2) examples wheresparse is better than dense (rows 3 and 4) and where dense isbetter than sparse (row 5) Naturally there are also many caseswhere the estimated depth maps are mostly incorrect (row 6)

In particular we see that the algorithm trained only onsparse data also behaves well at estimating the depth inoccluded and untextured areas for instance the white wallin row 3 This leads us to the conclusion that the algorithmrsquosperformance is perfectly adequate to serve as complementaryto a sparse stereo matching algorithm correctly filling in themissing depth information

D Online Experiments

Wersquove started preliminary work on a framework for onboardSSL applied to monocular depth estimation using a stereocamera Wersquove rewritten some of our feature extraction func-tions in C++ and have tested the setup using a linear leastsquares regression algorithm achieving promising results Weintend to further develop the framework in order to fully sup-port an online version of our learning pipeline and integrateit with existing autopilot platforms such as Paparazzi [33]

V CONCLUSION

This work focused on the application of SSL to the problemof estimating depth from single monocular images with theintent of complementing sparse stereo vision algorithms Wehave shown that our algorithm exhibits competitive perfor-mance on existing RGBD datasets while being computation-ally more efficient to train than previous approaches [5] Wealso train the algorithm on a new stereo dataset and showthat it remains accurate even when trained only on sparserather than dense stereo maps It can consequently be used toefficiently produce dense depth maps from sparse input Ourpreliminary work on its online implementation has revealed

10

promising results obtaining good performance with a verysimple linear least squares algorithm

In future work we plan to extend our methodology andfurther explore the complementarity of the information presentin monocular and stereo cues The use of a learning algorithmsuch as a Mondrian forest [34] or other ensemble-basedmethods would enable the estimation of the uncertainty inits own predictions A sensor fusion algorithm can then beused to merge information from both the stereo vision systemand the monocular depth estimation algorithm based on theirlocal confidence in the estimated depth This would lead to anoverall more accurate depth map

We have avoided the use of optical flow features sincetheyrsquore expensive to compute and are not usable when es-timating depths from isolated images rather than video dataHowever future work could explore computationally efficientways of using optical flow to guarantee the temporal consis-tency and consequently increase the accuracy of the sequenceof estimated depth maps

Current state of the art depth estimation methods [10ndash12 32] are all based on deep convolutional neural networks ofvarying complexities The advent of massively parallel GPU-based embedded hardware such as the Jetson TX1 and itseventual successors means that online training of deep neuralnetworks is close to becoming reality These models wouldgreatly benefit from the large amounts of training data madepossible by the SSL framework and lead to state of the artdepth estimation results onboard micro aerial vehicles

REFERENCES

[1] R A El-laithy J Huang and M Yeh ldquoStudy on the use of microsoftkinect for robotics applicationsrdquo in Position Location and NavigationSymposium (PLANS) 2012 IEEEION IEEE 2012 pp 1280ndash1288

[2] M Draelos Q Qiu A Bronstein and G Sapiro ldquoIntel realsense= reallow cost gazerdquo in Image Processing (ICIP) 2015 IEEE InternationalConference on IEEE 2015 pp 2520ndash2524

[3] C De Wagter S Tijmons B D Remes and G C de CroonldquoAutonomous flight of a 20-gram flapping wing mav with a 4-gramonboard stereo vision systemrdquo in 2014 IEEE International Conferenceon Robotics and Automation (ICRA) IEEE 2014 pp 4982ndash4987

[4] H Dahlkamp A Kaehler D Stavens S Thrun and G R Brad-ski ldquoSelf-supervised monocular road detection in desert terrainrdquo inRobotics science and systems vol 38 Philadelphia 2006

[5] K Bipin V Duggal and K M Krishna ldquoAutonomous navigation ofgeneric monocular quadcopter in natural environmentrdquo in 2015 IEEEInternational Conference on Robotics and Automation (ICRA) IEEE2015 pp 1063ndash1070

[6] A Saxena S H Chung and A Y Ng ldquoLearning depth from singlemonocular imagesrdquo in Advances in Neural Information ProcessingSystems 2005 pp 1161ndash1168

[7] A Saxena M Sun and A Y Ng ldquoMake3d Learning 3d scene structurefrom a single still imagerdquo IEEE transactions on pattern analysis andmachine intelligence vol 31 no 5 pp 824ndash840 2009

[8] K Karsch C Liu S Bing and K Eccv ldquoDepth Extraction from VideoUsing Non-parametric Sampling Problem Motivation amp Backgroundrdquono Sec 5 pp 1ndash14 2013

[9] D Eigen C Puhrsch and R Fergus ldquoDepth map prediction from asingle image using a multi-scale deep networkrdquo Nips pp 1ndash9 2014[Online] Available httparxivorgabs14062283

[10] D Eigen and R Fergus ldquoPredicting depth surface normals and se-mantic labels with a common multi-scale convolutional architecturerdquo inProceedings of the IEEE International Conference on Computer Vision2015 pp 2650ndash2658

[11] F Liu C Shen G Lin and I D Reid ldquoLearning Depth from SingleMonocular Images Using Deep Convolutional Neural Fieldsrdquo Pamip 15 2015 [Online] Available httparxivorgabs15027411

[12] W Chen Z Fu D Yang and J Deng ldquoSingle-image depth perceptionin the wildrdquo arXiv preprint arXiv160403901 2016

[13] D Zoran P Isola D Krishnan and W T Freeman ldquoLearning ordinalrelationships for mid-level visionrdquo in Proceedings of the IEEE Interna-tional Conference on Computer Vision 2015 pp 388ndash396

[14] D Dey K S Shankar S Zeng R Mehta M T Agcayazi C EriksenS Daftry M Hebert and J A Bagnell ldquoVision and learning fordeliberative monocular cluttered flightrdquo in Field and Service RoboticsSpringer 2016 pp 391ndash409

[15] A Agarwal S M Kakade N Karampatziakis L Song and G ValiantldquoLeast squares revisited Scalable approaches for multi-class predictionrdquoarXiv preprint arXiv13101949 2013 software available at httpsgithubcomn17ssecondorderdemos

[16] S Ross N Melik-Barkhudarov K S Shankar A Wendel D DeyJ A Bagnell and M Hebert ldquoLearning monocular reactive uav controlin cluttered natural environmentsrdquo in Robotics and Automation (ICRA)2013 IEEE International Conference on IEEE 2013 pp 1765ndash1772

[17] H Hu A Grubb J A Bagnell and M Hebert ldquoEfficient fea-ture group sequencing for anytime linear predictionrdquo arXiv preprintarXiv14095495 2014

[18] S Thrun M Montemerlo H Dahlkamp D Stavens A Aron J DiebelP Fong J Gale M Halpenny G Hoffmann et al ldquoStanley The robotthat won the darpa grand challengerdquo Journal of field Robotics vol 23no 9 pp 661ndash692 2006

[19] R Hadsell P Sermanet J Ben A Erkan M Scoffier K KavukcuogluU Muller and Y LeCun ldquoLearning long-range vision for autonomousoff-road drivingrdquo Journal of Field Robotics vol 26 no 2 pp 120ndash1442009

[20] H W Ho C De Wagter B D W Remes and G C H E de CroonldquoOptical-Flow based Self-Supervised Learning of Obstacle Appearanceapplied to MAV Landingrdquo no Iros 15 pp 1ndash10 2015

[21] K van Hecke G de Croon L van der Maaten D Hennes and D IzzoldquoPersistent self-supervised learning principle from stereo to monocularvision for obstacle avoidancerdquo arXiv preprint arXiv160308047 2016

[22] J Michels A Saxena and A Y Ng ldquoHigh speed obstacle avoidanceusing monocular vision and reinforcement learningrdquo in Proceedings ofthe 22nd international conference on Machine learning ACM 2005pp 593ndash600

[23] E R Davies Computer and machine vision theory algorithms prac-ticalities Academic Press 2012

[24] R Nevatia and K R Babu ldquoLinear feature extraction and descriptionrdquoComputer Graphics and Image Processing vol 13 no 3 pp 257ndash2691980

[25] M Varma and A Zisserman ldquoTexture classification arefilter banks necessaryrdquo Cvpr vol 2 pp IIndash691ndash8 vol2 2003 [Online] Available httpieeexploreieeeorgxplsabs alljsparnumber=1211534$delimiterrdquo026E30F$npapers2publicationdoi101109CVPR20031211534

[26] G De Croon E De Weerdt C De Wagter B Remes and R RuijsinkldquoThe appearance variation cue for obstacle avoidancerdquo IEEE Transac-tions on Robotics vol 28 no 2 pp 529ndash534 2012

[27] T Kohonen ldquoThe self-organizing maprdquo Proceedings of the IEEEvol 78 no 9 pp 1464ndash1480 1990

[28] N Dalal and B Triggs ldquoHistograms of oriented gradients for humandetectionrdquo in 2005 IEEE Computer Society Conference on ComputerVision and Pattern Recognition (CVPRrsquo05) vol 1 IEEE 2005 pp886ndash893

[29] S R Deans The Radon transform and some of its applications CourierCorporation 2007

[30] R-E Fan K-W Chang C-J Hsieh X-R Wang and C-J LinldquoLiblinear A library for large linear classificationrdquo Journal of machinelearning research vol 9 no Aug pp 1871ndash1874 2008 softwareavailable at httpwwwcsientuedutwsimcjlinliblinear

[31] C-C Chang and C-J Lin ldquoLIBSVM A library for support vectormachinesrdquo ACM Transactions on Intelligent Systems and Technologyvol 2 pp 271ndash2727 2011 software available at httpwwwcsientuedutwsimcjlinlibsvm

[32] M Mancini G Costante P Valigi and T A Ciarfuglia ldquoFast robustmonocular depth estimation for obstacle detection with fully convolu-tional networksrdquo arXiv preprint arXiv160706349 2016

[33] G Hattenberger M Bronz and M Gorraz ldquoUsing the paparazzi uavsystem for scientific researchrdquo in IMAV 2014 International Micro AirVehicle Conference and Competition 2014 2014 pp ppndash247

[34] B Lakshminarayanan D M Roy and Y W Teh ldquoMondrian forestsfor large-scale regression when uncertainty mattersrdquo arXiv preprintarXiv150603805 2015

  • Introduction
  • Related Work
    • Monocular depth estimation
    • Monocular depth estimation for robot navigation
    • Self-supervised learning
      • Methodology Overview
        • Learning setup
        • Features
          • Filter-based features
          • Texton-based features
          • Histogram of Oriented Gradients
          • Radon transform
            • Learning algorithm
            • Hyperparameters
              • Experimental Results
                • Error metrics
                • Standard datasets
                • Stereo dataset
                • Online Experiments
                  • Conclusion
                  • References
Page 7: Learning Depth from Single Monocular Images Using Stereo ... · obstacle avoidance and navigation, to localization and envi- ... data collected using a Kinect sensor. Dey et al. [14]

7

Fig 5 Qualitative results on the Make3D dataset From left to right the monocular image the ground truth depth map andthe depth estimated by our algorithm The depth scales are the same for every image

Fig 6 Qualitative results on the KITTI 2012 and 2015 datasets From left to right the monocular image the ground truthdepth map and the depth estimated by our algorithm The depth scales are the same for every image

8

Fig 7 Qualitative results on the stereo ZED dataset From left to right the monocular image the ground truth depth map thealgorithm trained on the dense depth and the algorithm trained on the sparse depth The depth scales are the same for everyimage

9

TABLE IV Comparison of results on the new stereo dataset collected with the ZED part 1

Video sequence 1 dense 1 sparse 2 dense 2 sparse 3 dense 3 sparse 4 dense 4 sparse 5 dense 5 sparse 6 dense 6 sparse

mean log 0615 0746 05812 0609 0169 0594 0572 0547 0425 0533 0525 0479relative abs 1608 1064 1391 0823 0288 0420 0576 0557 0446 0456 0956 0526

relative square 26799 8325 17901 6369 2151 3280 4225 3728 2951 3312 12411 3320linear RMS 8311 7574 7101 6897 2933 7134 6488 6026 5692 6331 7239 5599log RMS 0967 0923 0913 0770 0359 0770 0748 0716 0546 0662 0822 0613

scale-invariant 0931 0778 0826 0524 0127 0277 0440 0412 0275 0247 0669 0314

TABLE V Comparison of results on the new stereo dataset collected with the ZED part 2

Video sequence 7 dense 7 sparse 8 dense 8 sparse 9 dense 9 sparse 10 dense 10 sparse 11 dense 11 sparse 12 dense 12 sparse

mean log 0532 0514 0654 0539 0579 0470 0411 0460 0640 0704 0362 0494relative abs 0732 0484 0583 0770 0773 0534 0576 0472 0838 0609 0517 0572

relative square 7194 3016 6480 7472 15659 3082 5583 3666 56000 4792 51092 4367linear RMS 6132 5577 8383 6284 14324 5319 5966 6127 11435 7832 30676 6095log RMS 0789 0669 1006 0770 0823 0610 0681 0633 0898 0835 0580 0642

scale-invariant 0601 0332 0825 0493 0648 0338 0442 0287 0729 0378 0323 0347

order to simulate the conditions of a robot undergoing SSLWe tested our depth estimation algorithm under two differentconditions training on the fully processed and occlusioncorrected dense maps and training directly on the raw sparseoutputs from the stereo matching algorithm The dense depthmaps were used as ground truths during testing in both casesThe results obtained are shown in Figure 7 and in Tables IVand V

Observing the results we see that the performance is ingeneral better than on the existing laser datasets especiallyfrom a qualitative point of view The algorithm is capableof leveraging the fact that the data used for training is notparticularly diverse and additionally very similar to the testdata going around the generalization problems shown bytraditional supervised learning methodologies

It can also be observed that contrary to our expectationsthe algorithm trained on the sparse data doesnrsquot fall shortand actually surpasses the algorithm trained on the dense datain many of the video sequences and error metrics Lookingat the estimated depth maps this is mostly a consequenceof the algorithmrsquos failure to correctly extrapolate The largestcontributions to the error metrics come from the depth extrapo-lations that fall significantly outside the maximum range of thecamera and are qualitatively completely incorrect When thealgorithm is trained only on the sparse data it is exposed to asmaller range of target values and is consequently much moreconservative in its estimates leading to lower error figures

Another influencing factor is the fact that in the sparse highconfidence regions there is a stronger correlation betweenthe depth obtained by the stereo camera and the true depthwhen compared to the untextured and occluded areas Sincewe assume there is a strong correlation between the extractedmonocular image features and the true depth this impliesthat there is a stronger correlation between the image featuresand the stereo camera depths The simplicity of the machinelearning model used means that this strong correlation canbe effectively exploited and learned while the low correlationpresent in the dense depth data can easily lead the algorithm

to learn a wrong modelFrom a qualitative point of view however both sparse and

dense behave similarly Looking at the sample images we cansee examples where both correctly estimate the relative depthsbetween objects in a scene (rows 1 and 2) examples wheresparse is better than dense (rows 3 and 4) and where dense isbetter than sparse (row 5) Naturally there are also many caseswhere the estimated depth maps are mostly incorrect (row 6)

In particular we see that the algorithm trained only onsparse data also behaves well at estimating the depth inoccluded and untextured areas for instance the white wallin row 3 This leads us to the conclusion that the algorithmrsquosperformance is perfectly adequate to serve as complementaryto a sparse stereo matching algorithm correctly filling in themissing depth information

D Online Experiments

Wersquove started preliminary work on a framework for onboardSSL applied to monocular depth estimation using a stereocamera Wersquove rewritten some of our feature extraction func-tions in C++ and have tested the setup using a linear leastsquares regression algorithm achieving promising results Weintend to further develop the framework in order to fully sup-port an online version of our learning pipeline and integrateit with existing autopilot platforms such as Paparazzi [33]

V CONCLUSION

This work focused on the application of SSL to the problemof estimating depth from single monocular images with theintent of complementing sparse stereo vision algorithms Wehave shown that our algorithm exhibits competitive perfor-mance on existing RGBD datasets while being computation-ally more efficient to train than previous approaches [5] Wealso train the algorithm on a new stereo dataset and showthat it remains accurate even when trained only on sparserather than dense stereo maps It can consequently be used toefficiently produce dense depth maps from sparse input Ourpreliminary work on its online implementation has revealed

10

promising results obtaining good performance with a verysimple linear least squares algorithm

In future work we plan to extend our methodology andfurther explore the complementarity of the information presentin monocular and stereo cues The use of a learning algorithmsuch as a Mondrian forest [34] or other ensemble-basedmethods would enable the estimation of the uncertainty inits own predictions A sensor fusion algorithm can then beused to merge information from both the stereo vision systemand the monocular depth estimation algorithm based on theirlocal confidence in the estimated depth This would lead to anoverall more accurate depth map

We have avoided the use of optical flow features sincetheyrsquore expensive to compute and are not usable when es-timating depths from isolated images rather than video dataHowever future work could explore computationally efficientways of using optical flow to guarantee the temporal consis-tency and consequently increase the accuracy of the sequenceof estimated depth maps

Current state of the art depth estimation methods [10ndash12 32] are all based on deep convolutional neural networks ofvarying complexities The advent of massively parallel GPU-based embedded hardware such as the Jetson TX1 and itseventual successors means that online training of deep neuralnetworks is close to becoming reality These models wouldgreatly benefit from the large amounts of training data madepossible by the SSL framework and lead to state of the artdepth estimation results onboard micro aerial vehicles

REFERENCES

[1] R A El-laithy J Huang and M Yeh ldquoStudy on the use of microsoftkinect for robotics applicationsrdquo in Position Location and NavigationSymposium (PLANS) 2012 IEEEION IEEE 2012 pp 1280ndash1288

[2] M Draelos Q Qiu A Bronstein and G Sapiro ldquoIntel realsense= reallow cost gazerdquo in Image Processing (ICIP) 2015 IEEE InternationalConference on IEEE 2015 pp 2520ndash2524

[3] C De Wagter S Tijmons B D Remes and G C de CroonldquoAutonomous flight of a 20-gram flapping wing mav with a 4-gramonboard stereo vision systemrdquo in 2014 IEEE International Conferenceon Robotics and Automation (ICRA) IEEE 2014 pp 4982ndash4987

[4] H Dahlkamp A Kaehler D Stavens S Thrun and G R Brad-ski ldquoSelf-supervised monocular road detection in desert terrainrdquo inRobotics science and systems vol 38 Philadelphia 2006

[5] K Bipin V Duggal and K M Krishna ldquoAutonomous navigation ofgeneric monocular quadcopter in natural environmentrdquo in 2015 IEEEInternational Conference on Robotics and Automation (ICRA) IEEE2015 pp 1063ndash1070

[6] A Saxena S H Chung and A Y Ng ldquoLearning depth from singlemonocular imagesrdquo in Advances in Neural Information ProcessingSystems 2005 pp 1161ndash1168

[7] A Saxena M Sun and A Y Ng ldquoMake3d Learning 3d scene structurefrom a single still imagerdquo IEEE transactions on pattern analysis andmachine intelligence vol 31 no 5 pp 824ndash840 2009

[8] K Karsch C Liu S Bing and K Eccv ldquoDepth Extraction from VideoUsing Non-parametric Sampling Problem Motivation amp Backgroundrdquono Sec 5 pp 1ndash14 2013

[9] D Eigen C Puhrsch and R Fergus ldquoDepth map prediction from asingle image using a multi-scale deep networkrdquo Nips pp 1ndash9 2014[Online] Available httparxivorgabs14062283

[10] D Eigen and R Fergus ldquoPredicting depth surface normals and se-mantic labels with a common multi-scale convolutional architecturerdquo inProceedings of the IEEE International Conference on Computer Vision2015 pp 2650ndash2658

[11] F Liu C Shen G Lin and I D Reid ldquoLearning Depth from SingleMonocular Images Using Deep Convolutional Neural Fieldsrdquo Pamip 15 2015 [Online] Available httparxivorgabs15027411

[12] W Chen Z Fu D Yang and J Deng ldquoSingle-image depth perceptionin the wildrdquo arXiv preprint arXiv160403901 2016

[13] D Zoran P Isola D Krishnan and W T Freeman ldquoLearning ordinalrelationships for mid-level visionrdquo in Proceedings of the IEEE Interna-tional Conference on Computer Vision 2015 pp 388ndash396

[14] D Dey K S Shankar S Zeng R Mehta M T Agcayazi C EriksenS Daftry M Hebert and J A Bagnell ldquoVision and learning fordeliberative monocular cluttered flightrdquo in Field and Service RoboticsSpringer 2016 pp 391ndash409

[15] A Agarwal S M Kakade N Karampatziakis L Song and G ValiantldquoLeast squares revisited Scalable approaches for multi-class predictionrdquoarXiv preprint arXiv13101949 2013 software available at httpsgithubcomn17ssecondorderdemos

[16] S Ross N Melik-Barkhudarov K S Shankar A Wendel D DeyJ A Bagnell and M Hebert ldquoLearning monocular reactive uav controlin cluttered natural environmentsrdquo in Robotics and Automation (ICRA)2013 IEEE International Conference on IEEE 2013 pp 1765ndash1772

[17] H Hu A Grubb J A Bagnell and M Hebert ldquoEfficient fea-ture group sequencing for anytime linear predictionrdquo arXiv preprintarXiv14095495 2014

[18] S Thrun M Montemerlo H Dahlkamp D Stavens A Aron J DiebelP Fong J Gale M Halpenny G Hoffmann et al ldquoStanley The robotthat won the darpa grand challengerdquo Journal of field Robotics vol 23no 9 pp 661ndash692 2006

[19] R Hadsell P Sermanet J Ben A Erkan M Scoffier K KavukcuogluU Muller and Y LeCun ldquoLearning long-range vision for autonomousoff-road drivingrdquo Journal of Field Robotics vol 26 no 2 pp 120ndash1442009

[20] H W Ho C De Wagter B D W Remes and G C H E de CroonldquoOptical-Flow based Self-Supervised Learning of Obstacle Appearanceapplied to MAV Landingrdquo no Iros 15 pp 1ndash10 2015

[21] K van Hecke G de Croon L van der Maaten D Hennes and D IzzoldquoPersistent self-supervised learning principle from stereo to monocularvision for obstacle avoidancerdquo arXiv preprint arXiv160308047 2016

[22] J Michels A Saxena and A Y Ng ldquoHigh speed obstacle avoidanceusing monocular vision and reinforcement learningrdquo in Proceedings ofthe 22nd international conference on Machine learning ACM 2005pp 593ndash600

[23] E R Davies Computer and machine vision theory algorithms prac-ticalities Academic Press 2012

[24] R Nevatia and K R Babu ldquoLinear feature extraction and descriptionrdquoComputer Graphics and Image Processing vol 13 no 3 pp 257ndash2691980

[25] M Varma and A Zisserman ldquoTexture classification arefilter banks necessaryrdquo Cvpr vol 2 pp IIndash691ndash8 vol2 2003 [Online] Available httpieeexploreieeeorgxplsabs alljsparnumber=1211534$delimiterrdquo026E30F$npapers2publicationdoi101109CVPR20031211534

[26] G De Croon E De Weerdt C De Wagter B Remes and R RuijsinkldquoThe appearance variation cue for obstacle avoidancerdquo IEEE Transac-tions on Robotics vol 28 no 2 pp 529ndash534 2012

[27] T Kohonen ldquoThe self-organizing maprdquo Proceedings of the IEEEvol 78 no 9 pp 1464ndash1480 1990

[28] N Dalal and B Triggs ldquoHistograms of oriented gradients for humandetectionrdquo in 2005 IEEE Computer Society Conference on ComputerVision and Pattern Recognition (CVPRrsquo05) vol 1 IEEE 2005 pp886ndash893

[29] S R Deans The Radon transform and some of its applications CourierCorporation 2007

[30] R-E Fan K-W Chang C-J Hsieh X-R Wang and C-J LinldquoLiblinear A library for large linear classificationrdquo Journal of machinelearning research vol 9 no Aug pp 1871ndash1874 2008 softwareavailable at httpwwwcsientuedutwsimcjlinliblinear

[31] C-C Chang and C-J Lin ldquoLIBSVM A library for support vectormachinesrdquo ACM Transactions on Intelligent Systems and Technologyvol 2 pp 271ndash2727 2011 software available at httpwwwcsientuedutwsimcjlinlibsvm

[32] M Mancini G Costante P Valigi and T A Ciarfuglia ldquoFast robustmonocular depth estimation for obstacle detection with fully convolu-tional networksrdquo arXiv preprint arXiv160706349 2016

[33] G Hattenberger M Bronz and M Gorraz ldquoUsing the paparazzi uavsystem for scientific researchrdquo in IMAV 2014 International Micro AirVehicle Conference and Competition 2014 2014 pp ppndash247

[34] B Lakshminarayanan D M Roy and Y W Teh ldquoMondrian forestsfor large-scale regression when uncertainty mattersrdquo arXiv preprintarXiv150603805 2015

  • Introduction
  • Related Work
    • Monocular depth estimation
    • Monocular depth estimation for robot navigation
    • Self-supervised learning
      • Methodology Overview
        • Learning setup
        • Features
          • Filter-based features
          • Texton-based features
          • Histogram of Oriented Gradients
          • Radon transform
            • Learning algorithm
            • Hyperparameters
              • Experimental Results
                • Error metrics
                • Standard datasets
                • Stereo dataset
                • Online Experiments
                  • Conclusion
                  • References
Page 8: Learning Depth from Single Monocular Images Using Stereo ... · obstacle avoidance and navigation, to localization and envi- ... data collected using a Kinect sensor. Dey et al. [14]

8

Fig 7 Qualitative results on the stereo ZED dataset From left to right the monocular image the ground truth depth map thealgorithm trained on the dense depth and the algorithm trained on the sparse depth The depth scales are the same for everyimage

9

TABLE IV Comparison of results on the new stereo dataset collected with the ZED part 1

Video sequence 1 dense 1 sparse 2 dense 2 sparse 3 dense 3 sparse 4 dense 4 sparse 5 dense 5 sparse 6 dense 6 sparse

mean log 0615 0746 05812 0609 0169 0594 0572 0547 0425 0533 0525 0479relative abs 1608 1064 1391 0823 0288 0420 0576 0557 0446 0456 0956 0526

relative square 26799 8325 17901 6369 2151 3280 4225 3728 2951 3312 12411 3320linear RMS 8311 7574 7101 6897 2933 7134 6488 6026 5692 6331 7239 5599log RMS 0967 0923 0913 0770 0359 0770 0748 0716 0546 0662 0822 0613

scale-invariant 0931 0778 0826 0524 0127 0277 0440 0412 0275 0247 0669 0314

TABLE V Comparison of results on the new stereo dataset collected with the ZED part 2

Video sequence 7 dense 7 sparse 8 dense 8 sparse 9 dense 9 sparse 10 dense 10 sparse 11 dense 11 sparse 12 dense 12 sparse

mean log 0532 0514 0654 0539 0579 0470 0411 0460 0640 0704 0362 0494relative abs 0732 0484 0583 0770 0773 0534 0576 0472 0838 0609 0517 0572

relative square 7194 3016 6480 7472 15659 3082 5583 3666 56000 4792 51092 4367linear RMS 6132 5577 8383 6284 14324 5319 5966 6127 11435 7832 30676 6095log RMS 0789 0669 1006 0770 0823 0610 0681 0633 0898 0835 0580 0642

scale-invariant 0601 0332 0825 0493 0648 0338 0442 0287 0729 0378 0323 0347

order to simulate the conditions of a robot undergoing SSLWe tested our depth estimation algorithm under two differentconditions training on the fully processed and occlusioncorrected dense maps and training directly on the raw sparseoutputs from the stereo matching algorithm The dense depthmaps were used as ground truths during testing in both casesThe results obtained are shown in Figure 7 and in Tables IVand V

Observing the results we see that the performance is ingeneral better than on the existing laser datasets especiallyfrom a qualitative point of view The algorithm is capableof leveraging the fact that the data used for training is notparticularly diverse and additionally very similar to the testdata going around the generalization problems shown bytraditional supervised learning methodologies

It can also be observed that contrary to our expectationsthe algorithm trained on the sparse data doesnrsquot fall shortand actually surpasses the algorithm trained on the dense datain many of the video sequences and error metrics Lookingat the estimated depth maps this is mostly a consequenceof the algorithmrsquos failure to correctly extrapolate The largestcontributions to the error metrics come from the depth extrapo-lations that fall significantly outside the maximum range of thecamera and are qualitatively completely incorrect When thealgorithm is trained only on the sparse data it is exposed to asmaller range of target values and is consequently much moreconservative in its estimates leading to lower error figures

Another influencing factor is the fact that in the sparse highconfidence regions there is a stronger correlation betweenthe depth obtained by the stereo camera and the true depthwhen compared to the untextured and occluded areas Sincewe assume there is a strong correlation between the extractedmonocular image features and the true depth this impliesthat there is a stronger correlation between the image featuresand the stereo camera depths The simplicity of the machinelearning model used means that this strong correlation canbe effectively exploited and learned while the low correlationpresent in the dense depth data can easily lead the algorithm

to learn a wrong modelFrom a qualitative point of view however both sparse and

dense behave similarly Looking at the sample images we cansee examples where both correctly estimate the relative depthsbetween objects in a scene (rows 1 and 2) examples wheresparse is better than dense (rows 3 and 4) and where dense isbetter than sparse (row 5) Naturally there are also many caseswhere the estimated depth maps are mostly incorrect (row 6)

In particular we see that the algorithm trained only onsparse data also behaves well at estimating the depth inoccluded and untextured areas for instance the white wallin row 3 This leads us to the conclusion that the algorithmrsquosperformance is perfectly adequate to serve as complementaryto a sparse stereo matching algorithm correctly filling in themissing depth information

D Online Experiments

Wersquove started preliminary work on a framework for onboardSSL applied to monocular depth estimation using a stereocamera Wersquove rewritten some of our feature extraction func-tions in C++ and have tested the setup using a linear leastsquares regression algorithm achieving promising results Weintend to further develop the framework in order to fully sup-port an online version of our learning pipeline and integrateit with existing autopilot platforms such as Paparazzi [33]

V CONCLUSION

This work focused on the application of SSL to the problemof estimating depth from single monocular images with theintent of complementing sparse stereo vision algorithms Wehave shown that our algorithm exhibits competitive perfor-mance on existing RGBD datasets while being computation-ally more efficient to train than previous approaches [5] Wealso train the algorithm on a new stereo dataset and showthat it remains accurate even when trained only on sparserather than dense stereo maps It can consequently be used toefficiently produce dense depth maps from sparse input Ourpreliminary work on its online implementation has revealed

10

promising results obtaining good performance with a verysimple linear least squares algorithm

In future work we plan to extend our methodology andfurther explore the complementarity of the information presentin monocular and stereo cues The use of a learning algorithmsuch as a Mondrian forest [34] or other ensemble-basedmethods would enable the estimation of the uncertainty inits own predictions A sensor fusion algorithm can then beused to merge information from both the stereo vision systemand the monocular depth estimation algorithm based on theirlocal confidence in the estimated depth This would lead to anoverall more accurate depth map

We have avoided the use of optical flow features sincetheyrsquore expensive to compute and are not usable when es-timating depths from isolated images rather than video dataHowever future work could explore computationally efficientways of using optical flow to guarantee the temporal consis-tency and consequently increase the accuracy of the sequenceof estimated depth maps

Current state of the art depth estimation methods [10ndash12 32] are all based on deep convolutional neural networks ofvarying complexities The advent of massively parallel GPU-based embedded hardware such as the Jetson TX1 and itseventual successors means that online training of deep neuralnetworks is close to becoming reality These models wouldgreatly benefit from the large amounts of training data madepossible by the SSL framework and lead to state of the artdepth estimation results onboard micro aerial vehicles

REFERENCES

[1] R A El-laithy J Huang and M Yeh ldquoStudy on the use of microsoftkinect for robotics applicationsrdquo in Position Location and NavigationSymposium (PLANS) 2012 IEEEION IEEE 2012 pp 1280ndash1288

[2] M Draelos Q Qiu A Bronstein and G Sapiro ldquoIntel realsense= reallow cost gazerdquo in Image Processing (ICIP) 2015 IEEE InternationalConference on IEEE 2015 pp 2520ndash2524

[3] C De Wagter S Tijmons B D Remes and G C de CroonldquoAutonomous flight of a 20-gram flapping wing mav with a 4-gramonboard stereo vision systemrdquo in 2014 IEEE International Conferenceon Robotics and Automation (ICRA) IEEE 2014 pp 4982ndash4987

[4] H Dahlkamp A Kaehler D Stavens S Thrun and G R Brad-ski ldquoSelf-supervised monocular road detection in desert terrainrdquo inRobotics science and systems vol 38 Philadelphia 2006

[5] K Bipin V Duggal and K M Krishna ldquoAutonomous navigation ofgeneric monocular quadcopter in natural environmentrdquo in 2015 IEEEInternational Conference on Robotics and Automation (ICRA) IEEE2015 pp 1063ndash1070

[6] A Saxena S H Chung and A Y Ng ldquoLearning depth from singlemonocular imagesrdquo in Advances in Neural Information ProcessingSystems 2005 pp 1161ndash1168

[7] A Saxena M Sun and A Y Ng ldquoMake3d Learning 3d scene structurefrom a single still imagerdquo IEEE transactions on pattern analysis andmachine intelligence vol 31 no 5 pp 824ndash840 2009

[8] K Karsch C Liu S Bing and K Eccv ldquoDepth Extraction from VideoUsing Non-parametric Sampling Problem Motivation amp Backgroundrdquono Sec 5 pp 1ndash14 2013

[9] D Eigen C Puhrsch and R Fergus ldquoDepth map prediction from asingle image using a multi-scale deep networkrdquo Nips pp 1ndash9 2014[Online] Available httparxivorgabs14062283

[10] D Eigen and R Fergus ldquoPredicting depth surface normals and se-mantic labels with a common multi-scale convolutional architecturerdquo inProceedings of the IEEE International Conference on Computer Vision2015 pp 2650ndash2658

[11] F Liu C Shen G Lin and I D Reid ldquoLearning Depth from SingleMonocular Images Using Deep Convolutional Neural Fieldsrdquo Pamip 15 2015 [Online] Available httparxivorgabs15027411

[12] W Chen Z Fu D Yang and J Deng ldquoSingle-image depth perceptionin the wildrdquo arXiv preprint arXiv160403901 2016

[13] D Zoran P Isola D Krishnan and W T Freeman ldquoLearning ordinalrelationships for mid-level visionrdquo in Proceedings of the IEEE Interna-tional Conference on Computer Vision 2015 pp 388ndash396

[14] D Dey K S Shankar S Zeng R Mehta M T Agcayazi C EriksenS Daftry M Hebert and J A Bagnell ldquoVision and learning fordeliberative monocular cluttered flightrdquo in Field and Service RoboticsSpringer 2016 pp 391ndash409

[15] A Agarwal S M Kakade N Karampatziakis L Song and G ValiantldquoLeast squares revisited Scalable approaches for multi-class predictionrdquoarXiv preprint arXiv13101949 2013 software available at httpsgithubcomn17ssecondorderdemos

[16] S Ross N Melik-Barkhudarov K S Shankar A Wendel D DeyJ A Bagnell and M Hebert ldquoLearning monocular reactive uav controlin cluttered natural environmentsrdquo in Robotics and Automation (ICRA)2013 IEEE International Conference on IEEE 2013 pp 1765ndash1772

[17] H Hu A Grubb J A Bagnell and M Hebert ldquoEfficient fea-ture group sequencing for anytime linear predictionrdquo arXiv preprintarXiv14095495 2014

[18] S Thrun M Montemerlo H Dahlkamp D Stavens A Aron J DiebelP Fong J Gale M Halpenny G Hoffmann et al ldquoStanley The robotthat won the darpa grand challengerdquo Journal of field Robotics vol 23no 9 pp 661ndash692 2006

[19] R Hadsell P Sermanet J Ben A Erkan M Scoffier K KavukcuogluU Muller and Y LeCun ldquoLearning long-range vision for autonomousoff-road drivingrdquo Journal of Field Robotics vol 26 no 2 pp 120ndash1442009

[20] H W Ho C De Wagter B D W Remes and G C H E de CroonldquoOptical-Flow based Self-Supervised Learning of Obstacle Appearanceapplied to MAV Landingrdquo no Iros 15 pp 1ndash10 2015

[21] K van Hecke G de Croon L van der Maaten D Hennes and D IzzoldquoPersistent self-supervised learning principle from stereo to monocularvision for obstacle avoidancerdquo arXiv preprint arXiv160308047 2016

[22] J Michels A Saxena and A Y Ng ldquoHigh speed obstacle avoidanceusing monocular vision and reinforcement learningrdquo in Proceedings ofthe 22nd international conference on Machine learning ACM 2005pp 593ndash600

[23] E R Davies Computer and machine vision theory algorithms prac-ticalities Academic Press 2012

[24] R Nevatia and K R Babu ldquoLinear feature extraction and descriptionrdquoComputer Graphics and Image Processing vol 13 no 3 pp 257ndash2691980

[25] M Varma and A Zisserman ldquoTexture classification arefilter banks necessaryrdquo Cvpr vol 2 pp IIndash691ndash8 vol2 2003 [Online] Available httpieeexploreieeeorgxplsabs alljsparnumber=1211534$delimiterrdquo026E30F$npapers2publicationdoi101109CVPR20031211534

[26] G De Croon E De Weerdt C De Wagter B Remes and R RuijsinkldquoThe appearance variation cue for obstacle avoidancerdquo IEEE Transac-tions on Robotics vol 28 no 2 pp 529ndash534 2012

[27] T Kohonen ldquoThe self-organizing maprdquo Proceedings of the IEEEvol 78 no 9 pp 1464ndash1480 1990

[28] N Dalal and B Triggs ldquoHistograms of oriented gradients for humandetectionrdquo in 2005 IEEE Computer Society Conference on ComputerVision and Pattern Recognition (CVPRrsquo05) vol 1 IEEE 2005 pp886ndash893

[29] S R Deans The Radon transform and some of its applications CourierCorporation 2007

[30] R-E Fan K-W Chang C-J Hsieh X-R Wang and C-J LinldquoLiblinear A library for large linear classificationrdquo Journal of machinelearning research vol 9 no Aug pp 1871ndash1874 2008 softwareavailable at httpwwwcsientuedutwsimcjlinliblinear

[31] C-C Chang and C-J Lin ldquoLIBSVM A library for support vectormachinesrdquo ACM Transactions on Intelligent Systems and Technologyvol 2 pp 271ndash2727 2011 software available at httpwwwcsientuedutwsimcjlinlibsvm

[32] M Mancini G Costante P Valigi and T A Ciarfuglia ldquoFast robustmonocular depth estimation for obstacle detection with fully convolu-tional networksrdquo arXiv preprint arXiv160706349 2016

[33] G Hattenberger M Bronz and M Gorraz ldquoUsing the paparazzi uavsystem for scientific researchrdquo in IMAV 2014 International Micro AirVehicle Conference and Competition 2014 2014 pp ppndash247

[34] B Lakshminarayanan D M Roy and Y W Teh ldquoMondrian forestsfor large-scale regression when uncertainty mattersrdquo arXiv preprintarXiv150603805 2015

  • Introduction
  • Related Work
    • Monocular depth estimation
    • Monocular depth estimation for robot navigation
    • Self-supervised learning
      • Methodology Overview
        • Learning setup
        • Features
          • Filter-based features
          • Texton-based features
          • Histogram of Oriented Gradients
          • Radon transform
            • Learning algorithm
            • Hyperparameters
              • Experimental Results
                • Error metrics
                • Standard datasets
                • Stereo dataset
                • Online Experiments
                  • Conclusion
                  • References
Page 9: Learning Depth from Single Monocular Images Using Stereo ... · obstacle avoidance and navigation, to localization and envi- ... data collected using a Kinect sensor. Dey et al. [14]

9

TABLE IV Comparison of results on the new stereo dataset collected with the ZED part 1

Video sequence 1 dense 1 sparse 2 dense 2 sparse 3 dense 3 sparse 4 dense 4 sparse 5 dense 5 sparse 6 dense 6 sparse

mean log 0615 0746 05812 0609 0169 0594 0572 0547 0425 0533 0525 0479relative abs 1608 1064 1391 0823 0288 0420 0576 0557 0446 0456 0956 0526

relative square 26799 8325 17901 6369 2151 3280 4225 3728 2951 3312 12411 3320linear RMS 8311 7574 7101 6897 2933 7134 6488 6026 5692 6331 7239 5599log RMS 0967 0923 0913 0770 0359 0770 0748 0716 0546 0662 0822 0613

scale-invariant 0931 0778 0826 0524 0127 0277 0440 0412 0275 0247 0669 0314

TABLE V Comparison of results on the new stereo dataset collected with the ZED part 2

Video sequence 7 dense 7 sparse 8 dense 8 sparse 9 dense 9 sparse 10 dense 10 sparse 11 dense 11 sparse 12 dense 12 sparse

mean log 0532 0514 0654 0539 0579 0470 0411 0460 0640 0704 0362 0494relative abs 0732 0484 0583 0770 0773 0534 0576 0472 0838 0609 0517 0572

relative square 7194 3016 6480 7472 15659 3082 5583 3666 56000 4792 51092 4367linear RMS 6132 5577 8383 6284 14324 5319 5966 6127 11435 7832 30676 6095log RMS 0789 0669 1006 0770 0823 0610 0681 0633 0898 0835 0580 0642

scale-invariant 0601 0332 0825 0493 0648 0338 0442 0287 0729 0378 0323 0347

order to simulate the conditions of a robot undergoing SSLWe tested our depth estimation algorithm under two differentconditions training on the fully processed and occlusioncorrected dense maps and training directly on the raw sparseoutputs from the stereo matching algorithm The dense depthmaps were used as ground truths during testing in both casesThe results obtained are shown in Figure 7 and in Tables IVand V

Observing the results we see that the performance is ingeneral better than on the existing laser datasets especiallyfrom a qualitative point of view The algorithm is capableof leveraging the fact that the data used for training is notparticularly diverse and additionally very similar to the testdata going around the generalization problems shown bytraditional supervised learning methodologies

It can also be observed that contrary to our expectationsthe algorithm trained on the sparse data doesnrsquot fall shortand actually surpasses the algorithm trained on the dense datain many of the video sequences and error metrics Lookingat the estimated depth maps this is mostly a consequenceof the algorithmrsquos failure to correctly extrapolate The largestcontributions to the error metrics come from the depth extrapo-lations that fall significantly outside the maximum range of thecamera and are qualitatively completely incorrect When thealgorithm is trained only on the sparse data it is exposed to asmaller range of target values and is consequently much moreconservative in its estimates leading to lower error figures

Another influencing factor is the fact that in the sparse highconfidence regions there is a stronger correlation betweenthe depth obtained by the stereo camera and the true depthwhen compared to the untextured and occluded areas Sincewe assume there is a strong correlation between the extractedmonocular image features and the true depth this impliesthat there is a stronger correlation between the image featuresand the stereo camera depths The simplicity of the machinelearning model used means that this strong correlation canbe effectively exploited and learned while the low correlationpresent in the dense depth data can easily lead the algorithm

to learn a wrong modelFrom a qualitative point of view however both sparse and

dense behave similarly Looking at the sample images we cansee examples where both correctly estimate the relative depthsbetween objects in a scene (rows 1 and 2) examples wheresparse is better than dense (rows 3 and 4) and where dense isbetter than sparse (row 5) Naturally there are also many caseswhere the estimated depth maps are mostly incorrect (row 6)

In particular we see that the algorithm trained only onsparse data also behaves well at estimating the depth inoccluded and untextured areas for instance the white wallin row 3 This leads us to the conclusion that the algorithmrsquosperformance is perfectly adequate to serve as complementaryto a sparse stereo matching algorithm correctly filling in themissing depth information

D Online Experiments

Wersquove started preliminary work on a framework for onboardSSL applied to monocular depth estimation using a stereocamera Wersquove rewritten some of our feature extraction func-tions in C++ and have tested the setup using a linear leastsquares regression algorithm achieving promising results Weintend to further develop the framework in order to fully sup-port an online version of our learning pipeline and integrateit with existing autopilot platforms such as Paparazzi [33]

V CONCLUSION

This work focused on the application of SSL to the problemof estimating depth from single monocular images with theintent of complementing sparse stereo vision algorithms Wehave shown that our algorithm exhibits competitive perfor-mance on existing RGBD datasets while being computation-ally more efficient to train than previous approaches [5] Wealso train the algorithm on a new stereo dataset and showthat it remains accurate even when trained only on sparserather than dense stereo maps It can consequently be used toefficiently produce dense depth maps from sparse input Ourpreliminary work on its online implementation has revealed

10

promising results obtaining good performance with a verysimple linear least squares algorithm

In future work we plan to extend our methodology andfurther explore the complementarity of the information presentin monocular and stereo cues The use of a learning algorithmsuch as a Mondrian forest [34] or other ensemble-basedmethods would enable the estimation of the uncertainty inits own predictions A sensor fusion algorithm can then beused to merge information from both the stereo vision systemand the monocular depth estimation algorithm based on theirlocal confidence in the estimated depth This would lead to anoverall more accurate depth map

We have avoided the use of optical flow features sincetheyrsquore expensive to compute and are not usable when es-timating depths from isolated images rather than video dataHowever future work could explore computationally efficientways of using optical flow to guarantee the temporal consis-tency and consequently increase the accuracy of the sequenceof estimated depth maps

Current state of the art depth estimation methods [10ndash12 32] are all based on deep convolutional neural networks ofvarying complexities The advent of massively parallel GPU-based embedded hardware such as the Jetson TX1 and itseventual successors means that online training of deep neuralnetworks is close to becoming reality These models wouldgreatly benefit from the large amounts of training data madepossible by the SSL framework and lead to state of the artdepth estimation results onboard micro aerial vehicles

REFERENCES

[1] R A El-laithy J Huang and M Yeh ldquoStudy on the use of microsoftkinect for robotics applicationsrdquo in Position Location and NavigationSymposium (PLANS) 2012 IEEEION IEEE 2012 pp 1280ndash1288

[2] M Draelos Q Qiu A Bronstein and G Sapiro ldquoIntel realsense= reallow cost gazerdquo in Image Processing (ICIP) 2015 IEEE InternationalConference on IEEE 2015 pp 2520ndash2524

[3] C De Wagter S Tijmons B D Remes and G C de CroonldquoAutonomous flight of a 20-gram flapping wing mav with a 4-gramonboard stereo vision systemrdquo in 2014 IEEE International Conferenceon Robotics and Automation (ICRA) IEEE 2014 pp 4982ndash4987

[4] H Dahlkamp A Kaehler D Stavens S Thrun and G R Brad-ski ldquoSelf-supervised monocular road detection in desert terrainrdquo inRobotics science and systems vol 38 Philadelphia 2006

[5] K Bipin V Duggal and K M Krishna ldquoAutonomous navigation ofgeneric monocular quadcopter in natural environmentrdquo in 2015 IEEEInternational Conference on Robotics and Automation (ICRA) IEEE2015 pp 1063ndash1070

[6] A Saxena S H Chung and A Y Ng ldquoLearning depth from singlemonocular imagesrdquo in Advances in Neural Information ProcessingSystems 2005 pp 1161ndash1168

[7] A Saxena M Sun and A Y Ng ldquoMake3d Learning 3d scene structurefrom a single still imagerdquo IEEE transactions on pattern analysis andmachine intelligence vol 31 no 5 pp 824ndash840 2009

[8] K Karsch C Liu S Bing and K Eccv ldquoDepth Extraction from VideoUsing Non-parametric Sampling Problem Motivation amp Backgroundrdquono Sec 5 pp 1ndash14 2013

[9] D Eigen C Puhrsch and R Fergus ldquoDepth map prediction from asingle image using a multi-scale deep networkrdquo Nips pp 1ndash9 2014[Online] Available httparxivorgabs14062283

[10] D Eigen and R Fergus ldquoPredicting depth surface normals and se-mantic labels with a common multi-scale convolutional architecturerdquo inProceedings of the IEEE International Conference on Computer Vision2015 pp 2650ndash2658

[11] F Liu C Shen G Lin and I D Reid ldquoLearning Depth from SingleMonocular Images Using Deep Convolutional Neural Fieldsrdquo Pamip 15 2015 [Online] Available httparxivorgabs15027411

[12] W Chen Z Fu D Yang and J Deng ldquoSingle-image depth perceptionin the wildrdquo arXiv preprint arXiv160403901 2016

[13] D Zoran P Isola D Krishnan and W T Freeman ldquoLearning ordinalrelationships for mid-level visionrdquo in Proceedings of the IEEE Interna-tional Conference on Computer Vision 2015 pp 388ndash396

[14] D Dey K S Shankar S Zeng R Mehta M T Agcayazi C EriksenS Daftry M Hebert and J A Bagnell ldquoVision and learning fordeliberative monocular cluttered flightrdquo in Field and Service RoboticsSpringer 2016 pp 391ndash409

[15] A Agarwal S M Kakade N Karampatziakis L Song and G ValiantldquoLeast squares revisited Scalable approaches for multi-class predictionrdquoarXiv preprint arXiv13101949 2013 software available at httpsgithubcomn17ssecondorderdemos

[16] S Ross N Melik-Barkhudarov K S Shankar A Wendel D DeyJ A Bagnell and M Hebert ldquoLearning monocular reactive uav controlin cluttered natural environmentsrdquo in Robotics and Automation (ICRA)2013 IEEE International Conference on IEEE 2013 pp 1765ndash1772

[17] H Hu A Grubb J A Bagnell and M Hebert ldquoEfficient fea-ture group sequencing for anytime linear predictionrdquo arXiv preprintarXiv14095495 2014

[18] S Thrun M Montemerlo H Dahlkamp D Stavens A Aron J DiebelP Fong J Gale M Halpenny G Hoffmann et al ldquoStanley The robotthat won the darpa grand challengerdquo Journal of field Robotics vol 23no 9 pp 661ndash692 2006

[19] R Hadsell P Sermanet J Ben A Erkan M Scoffier K KavukcuogluU Muller and Y LeCun ldquoLearning long-range vision for autonomousoff-road drivingrdquo Journal of Field Robotics vol 26 no 2 pp 120ndash1442009

[20] H W Ho C De Wagter B D W Remes and G C H E de CroonldquoOptical-Flow based Self-Supervised Learning of Obstacle Appearanceapplied to MAV Landingrdquo no Iros 15 pp 1ndash10 2015

[21] K van Hecke G de Croon L van der Maaten D Hennes and D IzzoldquoPersistent self-supervised learning principle from stereo to monocularvision for obstacle avoidancerdquo arXiv preprint arXiv160308047 2016

[22] J Michels A Saxena and A Y Ng ldquoHigh speed obstacle avoidanceusing monocular vision and reinforcement learningrdquo in Proceedings ofthe 22nd international conference on Machine learning ACM 2005pp 593ndash600

[23] E R Davies Computer and machine vision theory algorithms prac-ticalities Academic Press 2012

[24] R Nevatia and K R Babu ldquoLinear feature extraction and descriptionrdquoComputer Graphics and Image Processing vol 13 no 3 pp 257ndash2691980

[25] M Varma and A Zisserman ldquoTexture classification arefilter banks necessaryrdquo Cvpr vol 2 pp IIndash691ndash8 vol2 2003 [Online] Available httpieeexploreieeeorgxplsabs alljsparnumber=1211534$delimiterrdquo026E30F$npapers2publicationdoi101109CVPR20031211534

[26] G De Croon E De Weerdt C De Wagter B Remes and R RuijsinkldquoThe appearance variation cue for obstacle avoidancerdquo IEEE Transac-tions on Robotics vol 28 no 2 pp 529ndash534 2012

[27] T Kohonen ldquoThe self-organizing maprdquo Proceedings of the IEEEvol 78 no 9 pp 1464ndash1480 1990

[28] N Dalal and B Triggs ldquoHistograms of oriented gradients for humandetectionrdquo in 2005 IEEE Computer Society Conference on ComputerVision and Pattern Recognition (CVPRrsquo05) vol 1 IEEE 2005 pp886ndash893

[29] S R Deans The Radon transform and some of its applications CourierCorporation 2007

[30] R-E Fan K-W Chang C-J Hsieh X-R Wang and C-J LinldquoLiblinear A library for large linear classificationrdquo Journal of machinelearning research vol 9 no Aug pp 1871ndash1874 2008 softwareavailable at httpwwwcsientuedutwsimcjlinliblinear

[31] C-C Chang and C-J Lin ldquoLIBSVM A library for support vectormachinesrdquo ACM Transactions on Intelligent Systems and Technologyvol 2 pp 271ndash2727 2011 software available at httpwwwcsientuedutwsimcjlinlibsvm

[32] M Mancini G Costante P Valigi and T A Ciarfuglia ldquoFast robustmonocular depth estimation for obstacle detection with fully convolu-tional networksrdquo arXiv preprint arXiv160706349 2016

[33] G Hattenberger M Bronz and M Gorraz ldquoUsing the paparazzi uavsystem for scientific researchrdquo in IMAV 2014 International Micro AirVehicle Conference and Competition 2014 2014 pp ppndash247

[34] B Lakshminarayanan D M Roy and Y W Teh ldquoMondrian forestsfor large-scale regression when uncertainty mattersrdquo arXiv preprintarXiv150603805 2015

  • Introduction
  • Related Work
    • Monocular depth estimation
    • Monocular depth estimation for robot navigation
    • Self-supervised learning
      • Methodology Overview
        • Learning setup
        • Features
          • Filter-based features
          • Texton-based features
          • Histogram of Oriented Gradients
          • Radon transform
            • Learning algorithm
            • Hyperparameters
              • Experimental Results
                • Error metrics
                • Standard datasets
                • Stereo dataset
                • Online Experiments
                  • Conclusion
                  • References
Page 10: Learning Depth from Single Monocular Images Using Stereo ... · obstacle avoidance and navigation, to localization and envi- ... data collected using a Kinect sensor. Dey et al. [14]

10

promising results obtaining good performance with a verysimple linear least squares algorithm

In future work we plan to extend our methodology andfurther explore the complementarity of the information presentin monocular and stereo cues The use of a learning algorithmsuch as a Mondrian forest [34] or other ensemble-basedmethods would enable the estimation of the uncertainty inits own predictions A sensor fusion algorithm can then beused to merge information from both the stereo vision systemand the monocular depth estimation algorithm based on theirlocal confidence in the estimated depth This would lead to anoverall more accurate depth map

We have avoided the use of optical flow features sincetheyrsquore expensive to compute and are not usable when es-timating depths from isolated images rather than video dataHowever future work could explore computationally efficientways of using optical flow to guarantee the temporal consis-tency and consequently increase the accuracy of the sequenceof estimated depth maps

Current state of the art depth estimation methods [10ndash12 32] are all based on deep convolutional neural networks ofvarying complexities The advent of massively parallel GPU-based embedded hardware such as the Jetson TX1 and itseventual successors means that online training of deep neuralnetworks is close to becoming reality These models wouldgreatly benefit from the large amounts of training data madepossible by the SSL framework and lead to state of the artdepth estimation results onboard micro aerial vehicles

REFERENCES

[1] R A El-laithy J Huang and M Yeh ldquoStudy on the use of microsoftkinect for robotics applicationsrdquo in Position Location and NavigationSymposium (PLANS) 2012 IEEEION IEEE 2012 pp 1280ndash1288

[2] M Draelos Q Qiu A Bronstein and G Sapiro ldquoIntel realsense= reallow cost gazerdquo in Image Processing (ICIP) 2015 IEEE InternationalConference on IEEE 2015 pp 2520ndash2524

[3] C De Wagter S Tijmons B D Remes and G C de CroonldquoAutonomous flight of a 20-gram flapping wing mav with a 4-gramonboard stereo vision systemrdquo in 2014 IEEE International Conferenceon Robotics and Automation (ICRA) IEEE 2014 pp 4982ndash4987

[4] H Dahlkamp A Kaehler D Stavens S Thrun and G R Brad-ski ldquoSelf-supervised monocular road detection in desert terrainrdquo inRobotics science and systems vol 38 Philadelphia 2006

[5] K Bipin V Duggal and K M Krishna ldquoAutonomous navigation ofgeneric monocular quadcopter in natural environmentrdquo in 2015 IEEEInternational Conference on Robotics and Automation (ICRA) IEEE2015 pp 1063ndash1070

[6] A Saxena S H Chung and A Y Ng ldquoLearning depth from singlemonocular imagesrdquo in Advances in Neural Information ProcessingSystems 2005 pp 1161ndash1168

[7] A Saxena M Sun and A Y Ng ldquoMake3d Learning 3d scene structurefrom a single still imagerdquo IEEE transactions on pattern analysis andmachine intelligence vol 31 no 5 pp 824ndash840 2009

[8] K Karsch C Liu S Bing and K Eccv ldquoDepth Extraction from VideoUsing Non-parametric Sampling Problem Motivation amp Backgroundrdquono Sec 5 pp 1ndash14 2013

[9] D Eigen C Puhrsch and R Fergus ldquoDepth map prediction from asingle image using a multi-scale deep networkrdquo Nips pp 1ndash9 2014[Online] Available httparxivorgabs14062283

[10] D Eigen and R Fergus ldquoPredicting depth surface normals and se-mantic labels with a common multi-scale convolutional architecturerdquo inProceedings of the IEEE International Conference on Computer Vision2015 pp 2650ndash2658

[11] F Liu C Shen G Lin and I D Reid ldquoLearning Depth from SingleMonocular Images Using Deep Convolutional Neural Fieldsrdquo Pamip 15 2015 [Online] Available httparxivorgabs15027411

[12] W Chen Z Fu D Yang and J Deng ldquoSingle-image depth perceptionin the wildrdquo arXiv preprint arXiv160403901 2016

[13] D Zoran P Isola D Krishnan and W T Freeman ldquoLearning ordinalrelationships for mid-level visionrdquo in Proceedings of the IEEE Interna-tional Conference on Computer Vision 2015 pp 388ndash396

[14] D Dey K S Shankar S Zeng R Mehta M T Agcayazi C EriksenS Daftry M Hebert and J A Bagnell ldquoVision and learning fordeliberative monocular cluttered flightrdquo in Field and Service RoboticsSpringer 2016 pp 391ndash409

[15] A Agarwal S M Kakade N Karampatziakis L Song and G ValiantldquoLeast squares revisited Scalable approaches for multi-class predictionrdquoarXiv preprint arXiv13101949 2013 software available at httpsgithubcomn17ssecondorderdemos

[16] S Ross N Melik-Barkhudarov K S Shankar A Wendel D DeyJ A Bagnell and M Hebert ldquoLearning monocular reactive uav controlin cluttered natural environmentsrdquo in Robotics and Automation (ICRA)2013 IEEE International Conference on IEEE 2013 pp 1765ndash1772

[17] H Hu A Grubb J A Bagnell and M Hebert ldquoEfficient fea-ture group sequencing for anytime linear predictionrdquo arXiv preprintarXiv14095495 2014

[18] S Thrun M Montemerlo H Dahlkamp D Stavens A Aron J DiebelP Fong J Gale M Halpenny G Hoffmann et al ldquoStanley The robotthat won the darpa grand challengerdquo Journal of field Robotics vol 23no 9 pp 661ndash692 2006

[19] R Hadsell P Sermanet J Ben A Erkan M Scoffier K KavukcuogluU Muller and Y LeCun ldquoLearning long-range vision for autonomousoff-road drivingrdquo Journal of Field Robotics vol 26 no 2 pp 120ndash1442009

[20] H W Ho C De Wagter B D W Remes and G C H E de CroonldquoOptical-Flow based Self-Supervised Learning of Obstacle Appearanceapplied to MAV Landingrdquo no Iros 15 pp 1ndash10 2015

[21] K van Hecke G de Croon L van der Maaten D Hennes and D IzzoldquoPersistent self-supervised learning principle from stereo to monocularvision for obstacle avoidancerdquo arXiv preprint arXiv160308047 2016

[22] J Michels A Saxena and A Y Ng ldquoHigh speed obstacle avoidanceusing monocular vision and reinforcement learningrdquo in Proceedings ofthe 22nd international conference on Machine learning ACM 2005pp 593ndash600

[23] E R Davies Computer and machine vision theory algorithms prac-ticalities Academic Press 2012

[24] R Nevatia and K R Babu ldquoLinear feature extraction and descriptionrdquoComputer Graphics and Image Processing vol 13 no 3 pp 257ndash2691980

[25] M Varma and A Zisserman ldquoTexture classification arefilter banks necessaryrdquo Cvpr vol 2 pp IIndash691ndash8 vol2 2003 [Online] Available httpieeexploreieeeorgxplsabs alljsparnumber=1211534$delimiterrdquo026E30F$npapers2publicationdoi101109CVPR20031211534

[26] G De Croon E De Weerdt C De Wagter B Remes and R RuijsinkldquoThe appearance variation cue for obstacle avoidancerdquo IEEE Transac-tions on Robotics vol 28 no 2 pp 529ndash534 2012

[27] T Kohonen ldquoThe self-organizing maprdquo Proceedings of the IEEEvol 78 no 9 pp 1464ndash1480 1990

[28] N Dalal and B Triggs ldquoHistograms of oriented gradients for humandetectionrdquo in 2005 IEEE Computer Society Conference on ComputerVision and Pattern Recognition (CVPRrsquo05) vol 1 IEEE 2005 pp886ndash893

[29] S R Deans The Radon transform and some of its applications CourierCorporation 2007

[30] R-E Fan K-W Chang C-J Hsieh X-R Wang and C-J LinldquoLiblinear A library for large linear classificationrdquo Journal of machinelearning research vol 9 no Aug pp 1871ndash1874 2008 softwareavailable at httpwwwcsientuedutwsimcjlinliblinear

[31] C-C Chang and C-J Lin ldquoLIBSVM A library for support vectormachinesrdquo ACM Transactions on Intelligent Systems and Technologyvol 2 pp 271ndash2727 2011 software available at httpwwwcsientuedutwsimcjlinlibsvm

[32] M Mancini G Costante P Valigi and T A Ciarfuglia ldquoFast robustmonocular depth estimation for obstacle detection with fully convolu-tional networksrdquo arXiv preprint arXiv160706349 2016

[33] G Hattenberger M Bronz and M Gorraz ldquoUsing the paparazzi uavsystem for scientific researchrdquo in IMAV 2014 International Micro AirVehicle Conference and Competition 2014 2014 pp ppndash247

[34] B Lakshminarayanan D M Roy and Y W Teh ldquoMondrian forestsfor large-scale regression when uncertainty mattersrdquo arXiv preprintarXiv150603805 2015

  • Introduction
  • Related Work
    • Monocular depth estimation
    • Monocular depth estimation for robot navigation
    • Self-supervised learning
      • Methodology Overview
        • Learning setup
        • Features
          • Filter-based features
          • Texton-based features
          • Histogram of Oriented Gradients
          • Radon transform
            • Learning algorithm
            • Hyperparameters
              • Experimental Results
                • Error metrics
                • Standard datasets
                • Stereo dataset
                • Online Experiments
                  • Conclusion
                  • References