bilateral cyclic constraint and adaptive regularization for learning...

Bilateral Cyclic Constraint and AdaptiveRegularization for Learning a Monocular Depth Prior

Technical Report #180004May 18, 2018

Alex WongDepartment of Computer Science

University of California, Los [email protected]

Stefano SoattoDepartment of Computer Science

University of California, Los [email protected]

Abstract

Current methods to infer (hypothesize) depth of a scene from a single image requirecostly per-pixel ground-truth. We propose an unsupervised method that exploitsabundant stereo imagery. Although we train a network with stereo pairs, we onlyrequire a single image at test time to hypothesize disparity or depth. We propose anovel objective function that exploits the bilateral cyclic relationship between theleft and right disparities and we introduce an adaptive regularization scheme thatallows the network to handle both the co-visible and occluded regions in a stereopair. This process ultimately produces a prior for the 3-dimensional structure ofthe scene as viewed in a single image. Our method outperforms state-of-the-artunsupervised monocular depth estimation methods on the KITTI benchmarks. Wealso demonstrate its generalizability on Make3d with models trained on KITTI.

1 IntroductionEstimating the 3-dimensional geometry of a scene is a fundamental problem in machine perceptionwith a wide range of applications, including autonomous driving [1], robotics [2, 3], pose-estimation[4] and scene object composition [5]. It is well-known that 3-d scene geometry can be recoveredfrom multiple images of a scene taken from different viewpoints, including stereo, under suitableconditions. Under no conditions, on the other hand, is a single image sufficient to recover 3-d scenestructure, unless strong priors are available on the shape of objects populating the scene. Even inthese cases, metric information is lost in the projection, so at best we can use a single image togenerate hypotheses, as opposed to estimates, of scene geometry.

A number of works in the recent literature has sought to exploit such strong scene priors to train adeep network to “infer” depth from a single image [6, 7, 8, 9]. To do so, they use pixel-level depthannotation captured with a range sensor (e.g. depth camera, lidar) to perform supervised regressionof depth from the RGB image. Cognizant of the intrinsic limitations of this endeavor, we proposean unsupervised method to generate depth hypotheses, to be used as a reference to construct priormodels for later use in 3-d reconstruction. While our method learns a prior distribution over the depthof possible scenes that may generate a given image, we predict the most-probable scene structurethat produces the image by selecting the mode of the distribution, in order to enable comparison withother methods in public benchmarks. We evaluate our method against ground-truth depths via twobenchmarks from the KITTI dataset [10] and generalize models trained on KITTI to Make3d [11].

Rather than attempting to learn a prior by associating the raw-pixel values with depth, we recastthe depth estimation as an image reconstruction problem [12, 13] and exploit the epipolar geometrybetween images in a rectified stereo pair to train a deep fully convolutional network. Our network

32nd Conference on Neural Information Processing Systems (NIPS 2018), Montréal, Canada.

Bilateral Cyclic Constraint and Adaptive Regularization for Learning a Monocular Depth PriorA. Wong, S. Soatto Technical Report #180004

learns to predict the dense pixel correspondences (disparity field) between the stereo pair, despiteonly having seen one of them. Hence, our network implicitly learns the relative pose of the camerasused in training and hallucinates the existence of a second image taken from the same relative posewhen given a single image during testing. From the disparity predictions, we can synthesize depthusing the known focal length and baseline of the cameras used in training.

While existing methods with a similar training scheme [12, 13, 14] are able to obtain realistic depthestimations, they do not scale to high resolution [14] or rely on non-differentiable objectives [12].More recently, [13] proposed using two uni-directional edge-aware disparity gradients and left-rightdisparity consistency as regularizers. However, edge-awareness should inform bi-directionally whileallowing disparity gradients to vary independently. Moreover, we argue that regularity should notonly be data-driven, but also model-driven and that left-right consistency should not be enforced in arelative spatial domain. Our contributions are two-fold: 1) A model-driven adaptive weighing schemethat is both space and time varying and can be applied generically to regularizers. 2) A bilateralconsistency constraint that enforces the cyclic application of left and right disparity to be the identity.We incorporate our contributions into a single objective function that, when realized by a genericencoder-decoder architecture, achieves state-of-the-art performance on two KITTI [10] benchmarksand exhibits generalizability to Make3d [11].

2 Related Works

2.1 Supervised Monocular Depth Estimation

In an early work, [15] learned depth from monocular cues in 2-d images by proposing a patch-basedmodel that estimated a set of planes to explain each patch. The local estimations were then combinedwith Markov random fields (MRF) to estimate the global depth. Similarly, [16, 11, 17, 18] exploitedlocal monocular features to make global predictions. However, local methods using hand-tuned unaryand pairwise terms generally lack the global context needed to generate accurate depth estimations.[8] instead used a convolutional neural network (CNN) to learn the parameters for the term. [19]further improved monocular methods by incorporating semantic cues into their model.

More recently, [6, 20] introduced a two scale network trained on images with ground-truth depths.These works differ from earlier attempts to solve the monocular problem as they leverage deepnetworks to learn a representation from the images and regress depth directly. Recently, [9] proposedusing a residual network combined with the fully convolutional architecture and up-sampling modulesto produce higher resolution depth maps. [21] explored a patch-based approach using neural forestsand [22] used conditional random fields (CRF) jointly with a CNN to estimate depth.

2.2 Unsupervised Monocular Depth Estimation

Recently, [23] introduced novel view synthesis using a deep network by predicting pixel values basedon interpolation of the pixels from nearby images. This launched an unsupervised approach to solvingthe monocular depth estimation problem. [14] proposed using novel view synthesis to hallucinate theexistence of a right view of a stereo pair given an input image. Using an image reconstruction loss,their model produced a distribution of disparities for each pixel and used the weighted combinationof the disparity probability and its corresponding pixels to reconstruct the right view.

Building off of similar intuitions, [12] trained a network for monocular depth estimation by recon-structing the right image of a stereo pair and synthesizing disparity as an intermediate. Yet, theirimage formation model is not fully differentiable, making their objective function to be difficultto optimize. To overcome this, [13, 24, 25, 26] utilized a bilinear sampler modeled after the Spa-tial Transformer Network [27] to allow for a fully differentiable loss and end-to-end training oftheir respective networks. Specifically, Godard et al. [13] introduced structural similarity index(SSIM) [28] as a loss in addition to the image reconstruction loss. Also, they predicted both left andright disparities and used them for regularization via a left-right consistency check along with anedge-aware smoothness term. More recently, [29] and [30] learned both ego-motion and depth frommonocular videos. [31] trained multiple networks using stereo video streams and proposes a featurereconstruction loss. [29] and [31] borrowed the objective function of [13] in their formulation.


2.3 Adaptive Regularization

A number of computer vision problems can be formulated as energy minimization in a variationalframework with a data fidelity term and a regularizer weighted by a fixed scalar. The solution foundby the minimal energy involves a trade-off between data fidelity and regularization. Finding theoptimal parameter for regularity is a long studied problem as [32] explored methods to determine theregularization parameter in image de-noising, while [33] used cross-validation as a selection criterionfor the weight. Others [13, 34, 35] used image gradients as cues for a data-driven weighing scheme.Recent works [36, 37] proposed that regularity should not only be data-driven, but also model driven.The amount of regularity imposed should adapt to the fitness of the model in relation to the datarather than being constant throughout the training process.

We draw upon intuitions from unsupervised monocular methods and adaptive regularization topropose a novel objective function using bilateral cyclic consistency constraint along with a spatialand temporal varying regularization modulator. We show that despite using the same network as [13],our objective function contributes to superior performance over the state-of-the-art unsupervisedmonocular methods as well as others trained on ground-truth depth and lidar data. We detail our lossfunction with adaptive regularization, in Sec. 3 and specify hyper-parameters and data augmentationprocedures used in Sec. 4. We measure the performance our of model on both the KITTI 2015 andKITTI Eigen Split benchmarks in Sec. 5. Lastly, we end with a discussion of our work in Sec. 6.

3 Method FormulationWe learn a prior on scene depth without ground-truth supervision by exploiting the availability ofstereo pairs (I0, I1) during the training process. Once a prior is learned, we can produce hypotheses,or “estimates” of the disparity field d compatible with the single image and synthesize the depthz = FB/d of the scene using the focal length F and baseline B during test time. Given a singleimage I0, our goal is to estimate a function d ∈ R+ that represents the disparity of I0, which weformulate as a loss function L (Eqn. 1), a linear combination of data terms and adaptive regularizers.

Multiple hypotheses can be obtained by adding stochasticity to the inference pipeline; since thebenchmarks only evaluate a single hypothesis, we restrict to one. Our network, parameterized byω, takes a single image I0 as input and estimates a function d = f(I0;ω), where d represents thedisparity (which is monotonically related to inverse-depth) corresponding to I0. We drive the trainingprocess with I1, which is only used in the loss function, by a surrogate loss that minimizes thereprojection error of I0 to I1 and vice versa. We will refer to the disparity estimated by L as d0and d1 for I0 and I1, respectively. Interested readers may refer to supplementary materials for moredetails on our formulation.

L = wphlph + wstlst︸︷︷︸data fidelity

+wsmlsm + wbclbc︸︷︷︸regularization

(1)

where each individual term l will be described in the next sections and their weights w in Sec. 4.

3.1 Data Fidelity

Our data fidelity terms seek to minimize the discrepancy between the observed stereo pair (I0, I1)and their reconstructions (I0, I1). We generate each I term by applying a 1-d horizontal disparityshift to I at each pixel position (x, y) where

I0xy = I1xy−d0xyand I1xy = I0xy+d1xy

(2)

We do so by using a 1-d horizontal bilinear sampler modeled after the image sampler from the SpatialTransformer Network [27]. Our sampler is locally fully differentiable and each output pixel is theweighted sum of two (left and right) pixels. We propose to minimize the reprojection residuals asa two-part loss, which measures the standard color constancy (photometric) and the difference inillumination, contrast and image quality (structural).

Photometric loss. We model the image formation process via a photometric loss lph, which measuresthe L1 penalty of the reprojection residual for each I and I on each channel at every (x, y) position:

lph =∑x,y

|I0xy − I0xy|+ |I1xy − I1xy| (3)


Structural loss. In order to make inference invariant to to local illumination changes, we muse aperceptual metric (SSIM) that discounts such variability. We apply SSIM to image patches of size3× 3 at corresponding (x, y) positions in I and I . Since two similar images give a SSIM score closeto 1, we subtract 1 by the score to represent a distance:

lst =∑x,y

2− (SSIM(I0xy, I0xy) + SSIM(I1xy, I

1xy))

2(4)

3.2 Residual-Based Adaptive Regularization Weighing Scheme

A point estimate d can be obtained by maximizing the Bayesian criterion with a corresponding datafidelity term (energy) D(d) and a Bayesian or Tikhonov regularizerR(d) in the form:

D(d) + αR(d) (5)

where the weight α is a pre-defined positive scalar parameter that controls the amount of regularizationto impose on the model, leading to a trade-off between data fidelity and regularization.

The weight α serves as a modulator between data-fidelity and regularization, constraining the solutionspace based on assumptions of how the desired output should behave. Yet, subjecting the entiresolution, a dense disparity field, to the same regularity fails to address cases where the assumptions donot hold. Suppose one enforces a smoothness constraint to the output disparity field by simply takingthe disparity gradient ∇d. This constraint would incorrectly penalize object boundaries (regionsof high image gradients) and hence [13, 38] apply an edge-aware term to reduce the effects ofregularization on edge regions. Although the edge-awareness term gives a data-driven approach onregularization, it is still static (the same image will always have the same weights) and independentof the performance of the model. Instead, we propose a space and time varying weighing schemebased on the performance of our model measured by reprojection residuals.

Model-driven adaptive weight. We propose an adaptive weight αxy that is both spatially and timevarying for every position (x, y) of the solution based on the local residual ρxy = |Ixy − Ixy| and the

global residual, represented by the average per-pixel residual, σ =1

1N

∑x,y

|Ixy − Ixy|for N pixels:

αxy = exp(−ρxy

σ

)(6)

α is controlled by the local residual between an image I and its reprojection I at each position whiletaking into account of the global residual σ, which correlates to the training time step and decreasesover time. α is naturally small when residuals are large and tends to 1 as training converges.

Local adaptation. Consider a pair of poorly matched pixels, (Ixy, Ixy), where the residual |Ixy−Ixy|is large. By reducing the regularity on the solution dxy, we effectively allow for exploration in thesolution space to find a better match and hence a better dxy . Alternatively, consider a pair of perfectlymatched pixels, (Ixy, Ixy), where |Ixy − Ixy| = 0. We should apply regularization to decrease thescope of the solution space such that we can allow for convergence. Hence, a spatially adaptive(R2) αxy must vary inversely to the local residual ρxy such that we impose regularization when theresidual is small and reduce it when the residual is large.

Global adaptation. Consider a solution dxy proposed at the first time step t = 1. Imposing regularityeffectively reduces the solution space based on an assumption about dxy and biases the final solution.We propose that a weighting scheme αxy → 1 as t→∞. However, if αxy is directly dependent onthe t, then αxy will change if we continue to train even after convergence – causing the model to beunstable. Instead, let αxy be inversely proportional to the global residual σ such that αxy is smallwhen the σ is large (generally corresponding to early time steps) and αxy → 1 as σ → 0 throughouttraining. When training converges (i.e. the global residual has stabilized), αxy likewise will be stable.This naturally lends to an annealing schedule where αxy → 1 as time progresses in training steps.

3.3 Adaptive Regularization

Our regularizers assume local smoothness and consistency between the left and right disparitiesestimated. We propose to minimize the disparity gradient (smoothness) and the disparity reprojectionerror (bilateral cyclic consistency) while adaptively weighting both with α (Sec. 3.2).


Figure 1: Qualitative results on KITTI Eigen split. From top to bottom: input images, ground-truthdisparities, results of Godard et al.[13], and our results. Our approach not only produces moreaccurate results (i.e. biker in left-most, building in second left-most and car in right-most), but alsorespects object boundaries (i.e. pedestrians in middle and parked cars in second right-most).

Smoothness loss. We encourage the predicted disparities to be locally smooth by applying an L1penalty to the disparity gradients in the x (∂X ) and y (∂Y ) directions. However, such an assumptiondoes not hold at object boundaries, which generally correspond to regions of high changes in pixelintensities; hence, we include an edge-aware term λ to allow for discontinuities in the disparitygradient. We also weigh this term adaptively with α:

lsm =∑x,y

α0xy(λ

0xy|∂Xd0xy|+ λ0xy|∂Y d0xy|) + α1

xy(λ1xy|∂Xd1xy|+ λ1xy|∂Y d1xy|) (7)

where λxy = e−|∇2Ixy| and the ∇2 operator denotes the image Laplacian. We choose to use the

image Laplacian over the first order image gradients because it allows the disparity gradients tobe aware of intensity changes in both directions. However, we regularize the disparity field usingthe disparity gradient so that we can allow for independent movement in each direction. Prior tocomputing the image Laplacian for λ, we smooth the image with a Gaussian kernel to reduce noise.

Bilateral cyclic consistency loss. A common regularization technique in stereo-vision is to maintainthe consistency between the left (d0) and right (d1) disparities by reconstructing each disparitythrough projecting its counter-part with its disparity shifts:

d0pxy = d1xy−d0xyand d1pxy = d0xy+d1xy

(8)

However, in doing so, the projected disparities suffers from the unresolved correspondences of boththe disparity ramps and occlusions (i.e. attempting to reconstruct a region that does not exist cannever yield a correct solution). We, instead, propose a bilateral cyclic consistency check that isdesigned to specifically reason about occlusions while removing the effects of stereo dis-occlusions.We follow the simple intuition that each set of disparities d should have an identity mapping whenprojected to its counter-part and back-projected to itself as a reconstruction d.

d0xy = d0xy+d1xy−d0xyand d1xy = d1xy−d0xy+d

1xy

(9)

By applying an L1 penalty on the disparity field and its reconstruction, we are constraining thatthe cyclic transformations should result in the identity, which keeps d0 and d1 consistent with eachother in co-visible regions. If there exists an occluded region, the region in the reconstruction wouldbe inconsistent with the original – yielding reprojection error. To avoid penalizing a model for anunresolvable correspondence due to the nature of the data, we propose to adaptively regularize thebilateral cyclic constraint using our residual-based weighting scheme (Eqn. 6). Unsurprisingly, localregions of high reprojection residual often correspond to occluded regions.

lbc =∑x,y

α0xy|d0xy − d0xy|+ α1

xy|d1xy − d1xy| (10)

4 Implementation DetailsOur approach was implemented using TensorFlow [39]. There are ≈ 31 million trainable parameters(more details on the network architecture can be found in Supp. Mat. Table 1) and the training


Error Metrics Accuracy Metrics

Method Supervised Cap Abs Rel Sq Rel RMS logRMS δ < 1.25 δ < 1.252 δ < 1.253

Train set mean No 80m 0.361 4.826 8.102 0.377 0.638 0.804 0.894

Eigen et al. [6] Coarse Yes 80m 0.214 1.605 6.563 0.292 0.673 0.884 0.957

Eigen et al. [6] Fine Yes 80m 0.203 1.548 6.307 0.282 0.702 0.890 0.958

Liu et al. [8] Yes 80m 0.201 1.584 6.471 0.273 0.680 0.898 0.967

Zhou et al. [30] No 80m 0.208 1.768 6.856 0.283 0.678 0.885 0.957

Godard et al. [13] No 80m 0.148 1.344 5.927 0.247 0.803 0.922 0.964

Zhan et al. [31] (w/ video) No 80m 0.144 1.391 5.869 0.241 0.803 0.928 0.969

Ours No 80m 0.142 1.296 5.832 0.240 0.808 0.926 0.967

Zhou et al. [30] No 50m 0.201 1.391 5.181 0.264 0.696 0.900 0.966

Garg et al. [12] No 50m 0.169 1.080 5.104 0.273 0.740 0.904 0.962

Godard et al. [13] No 50m 0.140 0.976 4.471 0.232 0.818 0.931 0.969

Zhan et al. [31] (w/ video) No 50m 0.135 0.905 4.366 0.225 0.818 0.937 0.973

Ours No 50m 0.135 0.913 4.357 0.225 0.823 0.935 0.971

Table 1: Quantitative results1on the KITTI [10] Eigen split [6] benchmark. The maximum depths arecapped at 50 and 80 meters (any point whose estimated depth exceeds the cap will be set to the cap).We apply the crop proposed by [12] for fair comparison with other methods. Our method outperformsthe state-of-the-art [13], a number of supervised methods [8, 6] as well as [31], which uses stereovideo streams instead of stereo pairs for training multiple networks.

process took ≈ 18 hours using an Nvidia GTX 1080Ti. Inference takes ≈ 32 ms per image. Weused Adam [40] to optimize our network end-to-end with a base learning rate of 1× 10−6, β1 = 0.9,β2 = 0.999. We decrease the learning rate by half after 30 epochs and by a quarter after 40 epochsfor a total of 50 epochs. We train with a batch size of 8 using a 512× 256 resolution with 4 levels inour loss pyramid. We are able to achieve our results using the following set of weights for each termin our loss function: wph = 0.15, wst = 0.85, wsm = 0.10 and wbc = 1.05. For our smoothnessterm, we decrease the term by a factor of 2r for each r-th resolution in the loss pyramid where r = 0refers to our highest resolution at 512× 256 and r = 3 the lowest.

Data augmentation is performed online during training. We perform a horizontal flip on the stereopairs with 50% probability (keeping in mind to swap the stereo pairs such that they are in the correctrelative positions). Color augmentations on brightness, gamma and color shifts of each channel alsooccur with 50% chance. We sampled from uniform distributions of ranges [0.5, 1.5] for brightnessand [0.8, 1.2] for gamma shifts as well as for each color channel separately.

5 Experiments and ResultsWe present our results on the KITTI dataset [10] under two different training and testing schemes, theKITTI 2015 split [13] and the KITTI Eigen split [6, 12]. The KITTI dataset contains 42,382 rectifiedstereo pairs from 61 scenes with approximate resolutions of 1242× 375. We evaluate our method onthe monocular depth estimation task on KITTI Eigen split and compare our approach with similarvariants on a disparity error metric as an ablation study using the KITTI 2015 split. We show thatour method outperforms state-of-the-art unsupervised monocular approaches and even supervisedapproaches on KITTI benchmarks, while generalizing to Make3d [11].

5.1 KITTI Eigen Split

We evaluate our method using the KITTI Eigen split [6], which has 697 test images from 29 scenes.The remaining 32 scenes contain 23,488 stereo pairs, of which 22,600 pairs are used for training andthe rest for validation, following [12]. We project the velodyne points into the left input color camera

1 AbsRel = 1N

∑x,y

|dxy−dgtxy|dgtxy

, SqRel = 1N

∑x,y

||dxy−dgtxy||2

dgtxy

,

RMS =√

1N

∑x,y ||dxy − d

gtxy||2 , logRMS =

√1N

∑x,y ||log(dxy)− log(dgtxy)||2 ,

% of δ < threshold where δ = argmax( dxy

dgtxy,dgtxy

dxy) , log10 = 1

N

∑x,y |log(dxy)− log(dgtxy)|


Error Metrics Accuracy Metrics

Method Abs Rel Sq Rel RMS logRMS D1-all δ < 1.25 δ < 1.252 δ < 1.253

Godard et al. [13] w/ Deep3D [14] 0.412 16.37 13.693 0.512 66.850 0.690 0.833 0.891

Godard et al. [13] w/ Deep3Ds [14] 0.151 1.312 6.344 0.239 59.640 0.781 0.931 0.976

Godard et al. [13] w/o LR 0.123 1.417 6.315 0.220 30.318 0.841 0.937 0.973

Godard et al. [13] 0.124 1.388 6.125 0.217 30.272 0.841 0.936 0.975

Godard et al. [13] w/ AR (Ours) 0.123 1.372 6.108 0.215 30.292 0.843 0.938 0.974

Ours w/o AR 0.122 1.337 6.078 0.214 29.991 0.839 0.938 0.975

Ours w/o BC 0.120 1.320 6.139 0.213 30.248 0.844 0.938 0.975

Ours w/o EL 0.120 1.314 6.029 0.213 30.326 0.844 0.938 0.976

Ours 0.119 1.227 5.953 0.211 29.915 0.844 0.938 0.976

Table 2: Quantitative comparison1 amongst variants of our model on KITTI 2015 split proposedby [13]. LR denotes the left-right consistency term [13], AR refers to adaptive regularization, ELrefers to edge-awareness with image Laplacian and BC to bilateral cyclic consistency. We show theeffectiveness of our adaptive regularization (Sec. 3.3) by it to [13] and improving their model. Ourfinal model outperforms all variants on every metric.

Figure 2: Qualitative results on KITTI 2015 split. From top to bottom: input images, results ofGodard et al.[13], and our results. Our approach generates more consistent depths (i.e. street sign inthe right-most) and recovers thin structures (i.e. traffic light in the left-most).

frame to generate ground-truth depths. The ground-truth depth maps are sparse (≈ 5% of the entireimage) and prone to errors from rotation of the velodyne and motion of the vehicle and surroundingobjects along with occlusions. As a result, we use the cropping scheme proposed by [12], whichcontains approximately 58% in height and 93% in width of the image dimensions.

We compare the performance of our approach with the recent monocular depth estimation methodsat two different caps, 80 and 50 meters, in Table 1. The results of other methods listed are takendirectly from their papers. Fig. 1 provides a qualitative comparison between our method and thebaseline. We note that [31] trained two networks using stereo video streams (as opposed to a singlenetwork with stereo pairs like ours and [13]), which allows their networks to learn a depth prior inboth spatial and temporal domains. Despite having less knowledge about the scene, our approachperforms better under 80 meter cap for most metrics with the exception of δ < 1.252 and δ < 1.253

and 50 meter cap where we are comparable. Also, note that our network was trained from scratchwith random initialization, whereas [6] was initialized with weights from Alexnet [41]. In comparisonwith the baseline [13], we better on every metric, but more importantly we score significantly lowerin squared error metrics (RMS, SqRel) – suggesting that our adaptive regularization and bilateralcyclic consistency terms (Sec. 3.3) contribute to a better solution with fewer outliers. Also, we scoresignificantly higher in the δ < 1.25 metric, which suggests that our model produces more correct andrealistically detailed depths than all competing methods.

5.2 KITTI 2015 Split

We evaluate our method on 200 high quality disparity maps provided as part of the official KITTItraining set [10]. These 200 stereo pairs cover 28 of the total 61 scenes. From 30,159 stereo pairscovering the remaining 33 scenes, we choose 29,000 for training and the rest for validation. Whiletypical training and evaluation schemes project velodyne laser values to depth, we choose to use theprovided disparity maps as they are less erroneous than velodyne data points. In addition, we also usethe official KITTI disparity metric of end-point-error (D1-all) to measure our performance as it is a


(a)

Method Supervised AbsRel Sq Rel RMS log10

Train set mean No 0.893 15.517 11.542 0.223

Karsch et al. [18] Yes 0.417 4.894 8.172 0.144

Liu et al. [8] Yes 0.462 6.625 9.972 0.161

Laina et al. [9] Yes 0.198 1.665 5.461 0.082

Godard et al. [13] No 0.468 9.236 12.525 0.165

Ours No 0.467 8.920 12.403 0.167

(b)

Figure 3: Qualitative (a) and quantitative (b) results1 on Make3d [11] with maximum depth of 70meters. In (a), top to bottom: input images, ground-truth disparities, our results. In (b), unsupervisedmethods listed are all trained on KITTI Eigen split. Despite being trained on KITTI, we performcomparably to a number of supervised methods trained on Make3d.

more appropriate metric on our class of approach that outputs disparity and synthesizes depth fromthe output using camera focal length and baseline.

Fig. 2 shows qualitative comparisons between our method and that of [13] and Table 2 shows thequantitative results of our approach along with variants using different image formation model andregularization terms on the KITTI 2015 split benchmark as an ablation study. We show that by simplyapplying our adaptive regularization to the baseline [13], we achieve improvement over their model.We also show results of replacing our bilateral cyclic consistency with the left-right consistencyregularizer from [13]. That experiment illustrates the benefit of using the image Laplacian over imagegradients as edge-aware weights. In addition, we find that adaptive regularization and bilateral cyclicconsistency contribute similarly to the improvements of the models. However, when combined theyachieve significantly improvements over the baseline method (and all variants) in every metric.

5.3 Generalizing to Different Datasets: Make3d

To show that our model generalizes, we present our qualitative and quantitative results in Fig. 3 onthe Make3d dataset [11] containing 134 test images with 2272× 1707 resolution. Make3d providesrange maps (resolution of 305× 55) for ground-truth depths, which must be rescaled and interpolated.We use the central cropping proposed by [13] where we generate a 852× 1707 crop centered on theimage. We use the standard C1 evaluation metrics1 proposed for Make3d and limit the maximumdepth to 70 meters. The results of the supervised methods are taken from [13]. Because Make3d doesnot provide stereo pairs, we are unable to train on it. However, we find that despite having trained ourmodel on KITTI Eigen split, our performance is comparable to that of supervised methods trained onMake3d and is better than the baseline on most metrics.

6 DiscussionIn this work, we presented a novel set of regularizers that produced state-of-the-art results on theunsupervised single image depth estimation problem. While the previous state-of-the-art [13] provedto be a strong baseline, we believe that their static regularizers drove their model into a local minima.Hence, we proposed an adaptive weighing scheme (Sec. 3.3) that is both spatially and time varying,allowing for not only a data-driven, but also model-driven approach to regularization. Moreover,we introduce a bilateral cyclic consistency constraint that not only enforces consistency betweenthe left and right disparities, but also the identity when the shift is applied in a cyclic manner. Wepresented results on two KITTI benchmarks using the network architecture of [13] and showed thatour approach is superior to [13] trained in a similar fashion, a recent work [31] trained with stereovideos and supervised methods. Based on the metrics (Table 1 and Table 2), our approach producesdepth maps with more details while maintaining global correctness.

Although our work produces state-of-the-art results, there are still limitations to our approach – whichwe plan to address in future work. We plan to improve robustness to specular and transparent surfacesas these regions tend to produce inconsistent depths. We are also exploring more sophisticatedregularizers in place of the simple disparity gradient. Finally, we believe that the task should drivethe network architecture. Rather than using a generic network, finding a better architectural fit couldprove to be ground-breaking and push the state-of-the-art in monocular depth estimation.


Appendix A Formulation of Adaptive Regularization

Given a single image I0, our goal is to estimate a function d ∈ R+ that represents the disparity of I0with a posterior distribution:

p(d|I0) ∝ p(I0|d) · p(d) (11)

We assume that both the likelihood p(I0|d) and the disparity prior p(d) follow a Laplacian distribution.The likelihood can be approximated by a data fidelity term g(I0, I1, I0, I1) and the prior by a disparityregularization term h(I0, I1, d). g(I0, I1, I0, I1) measures the data fidelity between each image Iand its reconstruction I . h(I0, I1, d) asserts the assumption of local smoothness and bilateral cyclicconstraints in the disparity fields. We denote a as the variance of the reprojection residual of eachimage pixel and b the variance over the disparity, which directly relates to the global reprojectionresidual. We show that these two quantities form an adaptive weight for our regularizer. Ourlikelihood and prior will have the following form:

p(I0|d) ≈ exp(−g(I0, I1, I0, I1)

a) (12)

p(d) ≈ exp(−h(I0, I1, d)

b) (13)

We then take the negative log to form our data and regularization terms:

− log p(d|I0, I1) ∝ − log exp(−g(I0, I1, I0, I1)

a) · exp(−h(I

0, I1, d)

b)

∝ 1

ag(I0, I1, I0, I1) +

1

bh(I0, I1, d)

∝ g(I0, I1, I0, I1)︸︷︷︸data fidelity

+αh(I0, I1, d)︸︷︷︸regularization

(14)

for α = a/b represents an adaptive weight on the regularizer dependent on the local and globalchanges to the reprojection residual. We apply this as a weighing scheme to our regularization terms,lsm and lbc (Eqn. 7 and 10 from main text).


Appendix B Bilateral Cyclic Consistency

Our model enforces bilateral cyclic consistency (Sec. 3.3 from main text) by projecting the setof disparities of a given image in a stereo pair to its counter-part and back-projected to itself asa reconstruction (Eqn. 9 from main text). We illustrate the cyclic application of the left andright disparities in Fig. 4. Although we are never given the right image of the stereo pair, ournetwork successfully hallucinates its existence and predicts the corresponding disparities to enableour bilateral cyclic consistency check. The areas of inconsistency between the initial disparities andtheir reconstructions (last row of Fig. 4) exist near the occlusion boundaries, which effectively guidesour model to regularize these regions to improve depth consistency.

Figure 4: An illustration of our bilateral cyclic constraint. Top to bottom: input images, initial left andright disparities, projected left and right disparities, reconstructed left and right disparities throughback-projection, and absolute difference between the initial disparities and their reconstructions. Eachpair of columns are associated with the left and right disparities predicted for the left image. Theright image (whitened) is not given to the network and is only shown to give a frame of reference forthe right disparities predicted.

Appendix C Additional Examples on KITTI

Figure 5: Additional results on KITTI Eigen split. From top to bottom: input images, ground-truthdisparities, results of Godard et al. [13], and our results. Our method produces overall sharper depthmaps that that of [13]. We are able to recover the boundaries of objects better (i.e. the street sighon the right-most and the pedestrians in second right-most). We are also able to recover thinnerstructures such as tree trunks.


Appendix D Network Architecture

Encoder Architecture

kernel channels downscale

layer size stride in out in out input

conv1 7 2 3 32 1 2 left

conv1b 7 1 32 32 2 2 conv1

conv2 5 2 32 64 2 4 conv1b

conv2b 5 1 64 64 4 4 conv2

conv3 3 2 64 128 4 8 conv2b

conv3b 3 1 128 128 8 8 conv3

conv4 3 2 128 256 8 16 conv3b

conv4b 3 1 256 256 16 16 conv4

conv5 3 2 256 512 16 32 conv4b

conv5b 3 1 512 512 32 32 conv5

conv6 3 2 512 512 32 64 conv5b

conv6b 3 1 512 512 64 64 conv6

conv7 3 2 512 512 64 128 conv6b

conv7b 3 1 512 512 128 128 conv7

Decoder Architecture

upconv7 3 2 512 512 128 64 conv7b

iconv7 3 1 1024 512 64 64 upconv7+conv6b

upconv6 3 2 512 512 64 32 iconv7


upconv5 3 2 512 256 32 16 iconv6


upconv4 3 2 256 128 16 8 iconv5


disp4 3 1 128 2 8 8 iconv4

upconv3 3 2 128 64 8 4 iconv4

iconv3 3 1 130 64 4 4 upconv3+conv2b+disp4*

disp3 3 1 64 2 4 4 iconv3

upconv2 3 2 64 32 4 2 iconv3

iconv2 3 1 66 32 2 2 upconv2+conv1b+disp3*

disp2 3 1 32 2 2 2 iconv2

upconv1 3 2 32 16 2 1 iconv2

iconv1 3 1 18 16 1 1 upconv1+disp2*

disp1 3 1 16 2 1 1 iconv1

Table 3: Our network architecture follows that of [13] and [42] and we are able to achieve betterperformance over the baseline [13]. “in” and “out” refers to the input and output channels anddownscale factor due to striding for each layer. + refers to the concatenation of multiple layers. ∗refers to up-sampling done to disparity predictions at a given resolution. We do not use batch normlayers.


References

[1] Joel Janai, Fatma Güney, Aseem Behl, and Andreas Geiger. Computer vision for autonomousvehicles: Problems, datasets and state-of-the-art. arXiv preprint arXiv:1704.05519, 2017.

[2] Ian Lenz, Honglak Lee, and Ashutosh Saxena. Deep learning for detecting robotic grasps. TheInternational Journal of Robotics Research, 34(4-5):705–724, 2015.

[3] Danail Stoyanov, Marco Visentini Scarzanella, Philip Pratt, and Guang-Zhong Yang. Real-time stereo reconstruction in robotically assisted minimally invasive surgery. In InternationalConference on Medical Image Computing and Computer-Assisted Intervention, pages 275–282.Springer, 2010.

[4] Jamie Shotton, Toby Sharp, Alex Kipman, Andrew Fitzgibbon, Mark Finocchio, Andrew Blake,Mat Cook, and Richard Moore. Real-time human pose recognition in parts from single depthimages. Communications of the ACM, 56(1):116–124, 2013.

[5] Kevin Karsch, Kalyan Sunkavalli, Sunil Hadap, Nathan Carr, Hailin Jin, Rafael Fonte, MichaelSittig, and David Forsyth. Automatic scene inference for 3d object compositing. ACMTransactions on Graphics (TOG), 33(3):32, 2014.

[6] David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single imageusing a multi-scale deep network. In Advances in neural information processing systems, pages2366–2374, 2014.

[7] Fayao Liu, Chunhua Shen, and Guosheng Lin. Deep convolutional neural fields for depthestimation from a single image. In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition, pages 5162–5170, 2015.

[8] Fayao Liu, Chunhua Shen, Guosheng Lin, and Ian Reid. Learning depth from single monocularimages using deep convolutional neural fields. IEEE transactions on pattern analysis andmachine intelligence, 38(10):2024–2039, 2016.

[9] Iro Laina, Christian Rupprecht, Vasileios Belagiannis, Federico Tombari, and Nassir Navab.Deeper depth prediction with fully convolutional residual networks. In 3D Vision (3DV), 2016Fourth International Conference on, pages 239–248. IEEE, 2016.

[10] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? thekitti vision benchmark suite. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEEConference on, pages 3354–3361. IEEE, 2012.

[11] Ashutosh Saxena, Min Sun, and Andrew Y Ng. Make3d: Learning 3d scene structure from asingle still image. IEEE transactions on pattern analysis and machine intelligence, 31(5):824–840, 2009.

[12] Ravi Garg, Vijay Kumar BG, Gustavo Carneiro, and Ian Reid. Unsupervised cnn for singleview depth estimation: Geometry to the rescue. In European Conference on Computer Vision,pages 740–756. Springer, 2016.

[13] Clément Godard, Oisin Mac Aodha, and Gabriel J Brostow. Unsupervised monocular depthestimation with left-right consistency. In CVPR, volume 2, page 7, 2017.

[14] Junyuan Xie, Ross Girshick, and Ali Farhadi. Deep3d: Fully automatic 2d-to-3d video conver-sion with deep convolutional neural networks. In European Conference on Computer Vision,pages 842–857. Springer, 2016.

[15] Ashutosh Saxena, Sung H Chung, and Andrew Y Ng. Learning depth from single monocularimages. In Advances in neural information processing systems, pages 1161–1168, 2006.

[16] Derek Hoiem, Alexei A Efros, and Martial Hebert. Recovering surface layout from an image.International Journal of Computer Vision, 75(1):151–172, 2007.

[17] Janusz Konrad, Meng Wang, Prakash Ishwar, Chen Wu, and Debargha Mukherjee. Learning-based, automatic 2d-to-3d image and video conversion. IEEE Transactions on Image Processing,22(9):3485–3496, 2013.

[18] Kevin Karsch, Ce Liu, and Sing Bing Kang. Depth extraction from video using non-parametricsampling. In European Conference on Computer Vision, pages 775–788. Springer, 2012.

[19] Lubor Ladicky, Jianbo Shi, and Marc Pollefeys. Pulling things out of perspective. In Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition, pages 89–96, 2014.


[20] David Eigen and Rob Fergus. Predicting depth, surface normals and semantic labels witha common multi-scale convolutional architecture. In Proceedings of the IEEE InternationalConference on Computer Vision, pages 2650–2658, 2015.

[21] Anirban Roy and Sinisa Todorovic. Monocular depth estimation using neural regression forest.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages5506–5514, 2016.

[22] Seungryong Kim, Kihong Park, Kwanghoon Sohn, and Stephen Lin. Unified depth predictionand intrinsic image decomposition from a single image via joint convolutional neural fields. InEuropean Conference on Computer Vision, pages 143–159. Springer, 2016.

[23] John Flynn, Ivan Neulander, James Philbin, and Noah Snavely. Deepstereo: Learning to predictnew views from the world’s imagery. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, pages 5515–5524, 2016.

[24] Viorica Patraucean, Ankur Handa, and Roberto Cipolla. Spatio-temporal video autoencoderwith differentiable memory. arXiv preprint arXiv:1511.06309, 2015.

[25] Tinghui Zhou, Philipp Krahenbuhl, Mathieu Aubry, Qixing Huang, and Alexei A Efros. Learningdense correspondence via 3d-guided cycle consistency. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, pages 117–126, 2016.

[26] Tinghui Zhou, Shubham Tulsiani, Weilun Sun, Jitendra Malik, and Alexei A Efros. Viewsynthesis by appearance flow. In European conference on computer vision, pages 286–301.Springer, 2016.

[27] Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al. Spatial transformer networks. InAdvances in neural information processing systems, pages 2017–2025, 2015.

[28] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment:from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.

[29] Reza Mahjourian, Martin Wicke, and Anelia Angelova. Unsupervised learning of depth and ego-motion from monocular video using 3d geometric constraints. arXiv preprint arXiv:1802.05522,2018.

[30] Tinghui Zhou, Matthew Brown, Noah Snavely, and David G Lowe. Unsupervised learning ofdepth and ego-motion from video. In CVPR, volume 2, page 7, 2017.

[31] Huangying Zhan, Ravi Garg, Chamara Saroj Weerasekera, Kejie Li, Harsh Agarwal, and IanReid. Unsupervised learning of monocular depth estimation and visual odometry with deepfeature reconstruction. arXiv preprint arXiv:1803.03893, 2018.

[32] Nikolas P Galatsanos and Aggelos K Katsaggelos. Methods for choosing the regularizationparameter and estimating the noise variance in image restoration and their relation. IEEETransactions on image processing, 1(3):322–336, 1992.

[33] Nhat Nguyen, Peyman Milanfar, and Gene Golub. Efficient generalized cross-validation withapplications to parametric image restoration and resolution enhancement. IEEE Transactionson image processing, 10(9):1299–1308, 2001.

[34] Andreas Wedel, Daniel Cremers, Thomas Pock, and Horst Bischof. Structure-and motion-adaptive regularization for high accuracy optic flow. In Computer Vision, 2009 IEEE 12thInternational Conference on, pages 1663–1668. IEEE, 2009.

[35] Manuel Werlberger, Werner Trobin, Thomas Pock, Andreas Wedel, Daniel Cremers, and HorstBischof. Anisotropic huber-l1 optical flow. In BMVC, volume 1, page 3, 2009.

[36] Byung-Woo Hong, Ja-Keoung Koo, Hendrik Dirks, and Martin Burger. Adaptive regularizationin convex composite optimization for variational imaging problems. In German Conference onPattern Recognition, pages 268–280. Springer, 2017.

[37] Byung-Woo Hong, Ja-Keoung Koo, Martin Burger, and Stefano Soatto. Adaptive regularizationof some inverse problems in image analysis. arXiv preprint arXiv:1705.03350, 2017.

[38] Philipp Heise, Sebastian Klose, Brian Jensen, and Alois Knoll. Pm-huber: Patchmatch withhuber regularization for stereo matching. In Computer Vision (ICCV), 2013 IEEE InternationalConference on, pages 2360–2367. IEEE, 2013.


[39] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, MatthieuDevin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: A system forlarge-scale machine learning. In OSDI, volume 16, pages 265–283, 2016.

[40] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980, 2014.

[41] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deepconvolutional neural networks. In Advances in neural information processing systems, pages1097–1105, 2012.

[42] Nikolaus Mayer, Eddy Ilg, Philip Hausser, Philipp Fischer, Daniel Cremers, Alexey Dosovitskiy,and Thomas Brox. A large dataset to train convolutional networks for disparity, optical flow,and scene flow estimation. In Proceedings of the IEEE Conference on Computer Vision andPattern Recognition, pages 4040–4048, 2016.

bilateral cyclic constraint and adaptive regularization for learning...

Documents