vsi lab, goethe university, frankfurt am main, germany ... · sdnet: semantically guided depth...

14
SDNet: Semantically Guided Depth Estimation Network Matthias Ochs 1 , Adrian Kretz 1 , and Rudolf Mester 1,2 1 VSI Lab, Goethe University, Frankfurt am Main, Germany 2 Norwegian Open AI Lab, CS Dept. (IDI), NTNU Trondheim, Norway Abstract. Autonomous vehicles and robots require a full scene under- standing of the environment to interact with it. Such a perception typi- cally incorporates pixel-wise knowledge of the depths and semantic labels for each image from a video sensor. Recent learning-based methods esti- mate both types of information independently using two separate CNNs. In this paper, we propose a model that is able to predict both outputs simultaneously, which leads to improved results and even reduced com- putational costs compared to independent estimation of depth and se- mantics. We also empirically prove that the CNN is capable of learning more meaningful and semantically richer features. Furthermore, our SD- Net estimates the depth based on ordinal classification. On the basis of these two enhancements, our proposed method achieves state-of-the- art results in semantic segmentation and depth estimation from single monocular input images on two challenging datasets. 1 Introduction Many computer vision applications rely on or at least can increase their perfor- mance, if pixel-wise depth information is available in addition to the input image. Based on this depth information, geometric relationships in the environment can be explained better and can support, for instance, the scene understanding in robotics, autonomous driving, image editing or 3D modeling/reconstruction. semantic segmentation depth map RGB image encoder-decoder with ASPP Fig. 1: Our proposed SDNet estimates pixel-wise depth and semantic labels from a single monocular input image based on a modified DeepLabv3+ architecture. arXiv:1907.10659v1 [cs.CV] 24 Jul 2019

Upload: others

Post on 20-Mar-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: VSI Lab, Goethe University, Frankfurt am Main, Germany ... · SDNet: Semantically Guided Depth Estimation Network Matthias Ochs 1, Adrian Kretz , and Rudolf Mester;2 1 VSI Lab, Goethe

SDNet: Semantically GuidedDepth Estimation Network

Matthias Ochs1, Adrian Kretz1, and Rudolf Mester1,2

1 VSI Lab, Goethe University, Frankfurt am Main, Germany2 Norwegian Open AI Lab, CS Dept. (IDI), NTNU Trondheim, Norway

Abstract. Autonomous vehicles and robots require a full scene under-standing of the environment to interact with it. Such a perception typi-cally incorporates pixel-wise knowledge of the depths and semantic labelsfor each image from a video sensor. Recent learning-based methods esti-mate both types of information independently using two separate CNNs.In this paper, we propose a model that is able to predict both outputssimultaneously, which leads to improved results and even reduced com-putational costs compared to independent estimation of depth and se-mantics. We also empirically prove that the CNN is capable of learningmore meaningful and semantically richer features. Furthermore, our SD-Net estimates the depth based on ordinal classification. On the basisof these two enhancements, our proposed method achieves state-of-the-art results in semantic segmentation and depth estimation from singlemonocular input images on two challenging datasets.

1 Introduction

Many computer vision applications rely on or at least can increase their perfor-mance, if pixel-wise depth information is available in addition to the input image.Based on this depth information, geometric relationships in the environment canbe explained better and can support, for instance, the scene understanding inrobotics, autonomous driving, image editing or 3D modeling/reconstruction.

semantic segmentation

depth map

RGB image

encoder-decoder with ASPP

Fig. 1: Our proposed SDNet estimates pixel-wise depth and semantic labels froma single monocular input image based on a modified DeepLabv3+ architecture.

arX

iv:1

907.

1065

9v1

[cs

.CV

] 2

4 Ju

l 201

9

Page 2: VSI Lab, Goethe University, Frankfurt am Main, Germany ... · SDNet: Semantically Guided Depth Estimation Network Matthias Ochs 1, Adrian Kretz , and Rudolf Mester;2 1 VSI Lab, Goethe

2 M. Ochs, A. Kretz, and R. Mester

Unfortunately, estimating the depth from a single monocular image is an ill-posed problem, which cannot be solved without further knowledge. In this paper,we propose a deep learning-based approach, where this additional informationis learned in a supervised manner from the geometric cues of many trainingsamples by a CNN. The network architecture of this CNN is inspired by theDeepLabv3+ model from [2] and extended by two subnets to simultaneouslypredict the depth and semantic labels. Due to this combined prediction, ournet learns semantically richer and more stable features in the Atrous SpatialPyramid Pooling (ASPP) module, which results in an improved estimation ofboth. Following the insights of [6] and having in mind that depths follow a naturalorder, we regard the depth estimation as an ordinal classification problem andnot as a regression-based one. This ensures that the neural network convergesfaster and that the resulting depth maps are more precise and can retrieve moredetails of fine structures and small objects. In the evaluation section, we showthat SDNet yields state-of-the-art results on two datasets while still being ableto generalize to other datasets as well.

2 Related Work

In this section, we review classical as well as deep learning based methods forsolving the depth estimation problem. In most classical approaches [24,25,20],hand-crafted features are extracted from the monocular images, which are thenused to predict the depth map by optimizing a probabilistic model like, a Markovrandom field (MRF) or conditional random field (CRF). Ladicky et al.[15] es-timate the depth based on different canonical views and show that semanticknowledge helps to improve the prediction. This is verified for stereo-matchingmethods by [26,31], too. In this paper, we empirically prove that this conceptalso holds for depth estimation from a single monocular image with a CNN.

Besides these classical techniques, many recent approaches make use of Con-volutional Neural Networks (CNNs). These techniques can be divided into twocategories: supervised and unsupervised learning methods. Supervised learningtechniques include models which are trained on stereo images, but which caninfer depth maps on monocular images. [12] propose an approach which followsthis paradigm. They use the correlations of CNN feature maps of stereo imagesand derive the unary matching costs. The depth maps are then computed byusing a CRF. But there are also CNNs [28,21,23,8], which try to find stereocorrespondences to estimate the disparity/depth.

The previously mentioned approaches are trained in a supervised manner.[14] introduce a semi-supervised approach, where the loss function includes anadditional unsupervised term. This term captures the stereo alignment error oftwo images given an approximate depth map. The approaches of [7,30] and [9]use modifications of this alignment error to train networks in a completely un-supervised way. Since ground truth depth maps are not necessary with theseapproaches, it is possible to make use of much larger training databases. These

Page 3: VSI Lab, Goethe University, Frankfurt am Main, Germany ... · SDNet: Semantically Guided Depth Estimation Network Matthias Ochs 1, Adrian Kretz , and Rudolf Mester;2 1 VSI Lab, Goethe

SDNet: Semantically Guided Depth Estimation Network 3

unsupervised techniques tend to produce much smoother and thus more inaccu-rate depth map than supervised techniques, though.

Fu et al.[6] introduced DORN, which is a CNN trained in a supervised wayand achieves current state-of-the-art results. Instead of using regression to esti-mate the depth of each pixel, the depths are discretized and ordinal classificationis used. We also use a similar idea for our depth estimation.

3 Approach

In this section, we introduce our proposed model, which classifies the depth basedon ordinal depth classes while simultaneously inferring the semantic labels foreach pixel, too. This has the advantage that more meaningful features are learnedin the final layers of the encoder, from which both the semantic segmentationand the depth estimation can benefit. This allows the CNN to describe, detectand classify objects more accurately.

3.1 Architecture

The network architecture of our model is depicted in figure 2. It is similar to theDeepLabv3+ model from [2], which represents the current state of the art forsemantic segmentation.

Input

channels: 3

resolution: 1

ResBlock2

channels: 256

resolution: 1/4

dilation: 1

ResBlock3

channels: 512resolution: 1/8

dilation: 1

ResBlock4

channels: 1024resolution: 1/8

dilation: 2

ResBlock5

channels: 2048resolution: 1/8

dilation: 4

Conv1

channels: 64

resolution: 1/4

dilation: 1

ASPP

channels: 1280resolution: 1/8

Upsampling

channels: 256resolution: 1/4

Concat

channels: 320

resolution: 1/4

Upsampling

channels: 256

resolution: 1

Semantics

channels: MSem

resolution: 1

layer: softmax

||●||*

7

2

||●||*

1

1

0.5

3

1

||●||*

0.5

*3

1

||●||

ReLU

GroupNorm

*n

sn x n Convolution with stride s

Dropout with rate rr

3 x 3 Max-pooling with stride 2

Legend

Upsampling

channels: 256

resolution: 1

Depth map

channels: MDepth

resolution: 1

layer: sigmoid

3

1

||●||*

*3

1

0.5

*1

10.5

Fig. 2: Encoder-decoder architecture of SDNet with ASPP module. The decoderconsists of two subnets, which predict the pixel-wise depths and semantic labels.

Theoretically, every encoder, e.g. ResNets or InceptionNets, can be used toextract feature maps which serve as input for the ASPP module. Due to lim-ited computational capacity and real-time requirements, we use a ResNet-50in this work. The ResNet is configured as proposed by [2] (dilation rate, multi

Page 4: VSI Lab, Goethe University, Frankfurt am Main, Germany ... · SDNet: Semantically Guided Depth Estimation Network Matthias Ochs 1, Adrian Kretz , and Rudolf Mester;2 1 VSI Lab, Goethe

4 M. Ochs, A. Kretz, and R. Mester

grid, output stride of 8, atrous convolution, etc.). With the combination of thisencoder and the ASPP module, SDNet is capable of extracting features on mul-tiple scales, which improves the results significantly. The ASPP module is alsoadopted from DeepLabv3+, except that we use group normalization [29] insteadof batch normalization, which applies to all normalization layers in the CNN.Furthermore, we add dropout layers, as shown in figure 2. Compared to DORN[6], our ASPP module is fully-convolutional and thus our CNN is not dependenton a predefined input resolution.

Unlike DeepLabv3+, our decoder consists of two subnets. First, the decoderinterpolates the semantically meaningful feature maps from the ASPP modulewhich are then concatenated with feature maps from the second ResBlock. Thesefeature maps provide additional structural information and are able to recognizefine structures in the image. The resulting feature maps serve as input for ourtwo subnets. In the first subnet, the semantic labels are estimated. The secondsubnet determines the depth. Both output layers of the two subnets computeclass probabilities, which is done by a softmax function for the semantics andby a sigmoid layer for the depths.

3.2 Discretization

For ordinal regression, the continuous depth must be discretized so that a partic-ular class c can be assigned to each depth d. Normally, the depths d ∈ [dmin, dmax]are divided linearly into classes between the minimum and maximum depth.However, this partitioning has the disadvantage that an erroneous estimation ofa depth class at shallow depths causes a larger relative estimation error comparedto a faulty estimated large depth.

0

10

20

30

40

50

60

70

80

0 20 40 60 80 100 120

de

pth

[m]

depth class

Discretization

exponential

linear

0

0,005

0,01

0,015

0,02

0,025

0,03

0,035

0 20 40 60 80 100 120

rela

tive

fre

qu

en

cy

depth class

Distribution of depth classes

exponential

linear

Fig. 3: The left plot shows the exponential and linear discretization of the depthswith Mdepth = 128 in the interval [2 m, 80 m]. In the right histogram, the relativefrequency of these exponential and linear discretized depth classes are shown overall existing depths of the KITTI depth prediction training dataset.

This relative error can be reduced by either defining the classes by theirinverse depths ρ = 1

d , or by choosing a more complex mapping function, e.g., [6]

Page 5: VSI Lab, Goethe University, Frankfurt am Main, Germany ... · SDNet: Semantically Guided Depth Estimation Network Matthias Ochs 1, Adrian Kretz , and Rudolf Mester;2 1 VSI Lab, Goethe

SDNet: Semantically Guided Depth Estimation Network 5

chose a logarithmic mapping function for discretization. In this work, we splitthe inverse depth into discrete classes ci using an exponential function

ci = ρmin · exp

(log

(ρmax

ρmin

)· i− 1

Md − 1

), i ∈ {1, . . .Md} (1)

where Md corresponds to the number of depth classes. In the left plot of figure 3,both the linear and the exponential mapping functions are shown. On the basis ofthis graph, one can see that due to the exponential character, more discretizationlevels or classes are available for small and medium depths than for larger ones.Thus, the relative prediction errors of an estimator should be correspondinglysmaller. In the adjacent histogram, the relative frequencies of these two depthclasses are shown using the depths found in the KITTI depth prediction trainingdata. A uniform distribution of these classes would be ideal for training a CNN.By a suitable choice of the mapping function, such as the proposed exponential,the depth distribution can be approximated to be a uniform one.

In contrast to the semantic classes, the depth classes follow an ordinal rela-tion. This means that a larger depth class ci is always included in the smallerones: {c1 4 c2 4 . . . ci . . . 4 cMd}. By this concept, an inverse depth ρ whichactually belongs to the class ci, also belongs to the classes c1, . . . , ci−1. Thisordered classification was introduced for neural networks by [3] and [22]. [6] andwe use this idea, and show that an ordinal classification is advantageous over astandard classification or regression based estimation of depth values.

3.3 Output & Loss

As described in Section 3.1, SDNet returns the result of the semantic segmenta-tion and depth estimation in two separate subnets. The output and the loss ofthe semantic subnet follow the standard multi-class classification for each pixelk ∈ 1, . . . ,K, where K is the number of pixels in the image. Hence, the semanticsubnet outputs logits ys

k ∈ IRMs , where Ms is equal to the number of seman-tic classes. The logits yk are then converted into probabilities psk by using thesoftmax function. As training loss, we use the multi-class cross entropy withtsk ∈ {0, 1}Ms as the one-hot-encoded vector representing the ground truth:

Lk,sem(θ;psk, tsk) = − 1

Ms·∑i

tsk,i log psk,i, (2)

where θ are the trainable parameters in the CNN.

This vanilla multi-class classification cannot be used for the ordered depthclasses, because a single pixel can belong to multiple labels (multi-label classi-fication), whereas each depth class is not dependent on the others. Thus, it isregarded as a binary classification problem for each class ci and pixel k. Thedepth logits yd

k ∈ IRMd are transformed by the sigmoid function to probabilitiespdk, where pdi,k ∈ [0, 1] indicates whether the pixel k belongs to class ci or not.

Page 6: VSI Lab, Goethe University, Frankfurt am Main, Germany ... · SDNet: Semantically Guided Depth Estimation Network Matthias Ochs 1, Adrian Kretz , and Rudolf Mester;2 1 VSI Lab, Goethe

6 M. Ochs, A. Kretz, and R. Mester

Then, SDNet calculates the following multi-hot-encoded vector for each pixel:

zdk =[σ(pdk,1), . . . , σ(pdk,Md

)]T, with

σ(x) =

{1, if x ≥ 0.5

0, otherwise.

(3)

To preserve the ordered relation of the classes, the components of the clas-sified vector yi are accumulated until the first entry of the vector zdk is 0. Thissum then corresponds to the class of the pixel:

ck =∑i

η(zdk, i), with

η(zdk, i) =

1, if i = 1 and zdk,i = 1

1, if η(zdk, i− 1) = 1 and zdk,i = 1

0, otherwise.

(4)

To train the depth part of SDNet, we choose the following binary cross en-tropy loss and the ground truth vector tdk is formulated as a multi-hot-encodedvector for class ck:

Lk,depth(θ;pdk, tdk) = − 1

Md·∑i

tdk,i log pdk,i + (1− tdk,i) log(1− pdk,i). (5)

Thus, the total loss is the mean of two individual losses over all pixels in thebatch, with λ acting as a coupling constant:

L(θ;pk, tk) =1

K

∑k

Lk,depth(θ;pdk, tdk) + λ · Lk,sem(θ;psk, t

sk). (6)

To weight both loss terms similarly, λ = 10 was empirically determined. Wehave also analyzed additional loss terms, such as regularization of the predicteddepth map by its gradient as proposed by [14] and [9], or the absolute sum of thedepth values from [32]. Another possible improvement would be the use of classbalancing factors or the focal loss from [18]. However, all of these improvementshave not produced any additional positive effects, so we disregarded them.

4 Evaluation & Experiments

This section describes the training of SDNet in more detail. Both, the predicteddepth and semantic segmentation are evaluated and achieve state-of-the-art re-sults in the KITTI depth prediction (KD) benchmark [27] and the KITTI pixel-level semantic segmentation (KS) benchmark [1]. In addition, we demonstratethe generalization of SDNet to other data on Cityscapes (CS) [4] and compareour semantic segmentation results with a modified DeepLabv3+ model. We usethe same evaluation metrics as in [5] and the official benchmarks.

Page 7: VSI Lab, Goethe University, Frankfurt am Main, Germany ... · SDNet: Semantically Guided Depth Estimation Network Matthias Ochs 1, Adrian Kretz , and Rudolf Mester;2 1 VSI Lab, Goethe

SDNet: Semantically Guided Depth Estimation Network 7

Both the evaluation and ablation studies show that a joint estimation ofdepths and semantic labels is preferable to an independent estimation, becauseit yields superior results. This observation indicates that a CNN is learning se-mantically more meaningful features. Using these features, the CNN can detectand recognize objects better, which in turn improves both predictions. Addi-tionally, the experiments also show that depths can be estimated more preciselyusing classes rather than a regression-based approach.

4.1 Training Protocol

To train the SDNet, we used the three datasets mentioned above in the followingorder. First, we train on the 3000 samples of fine-annotated training data fromCS, which contain both semantic information and depth maps from SGBM [11].These depth maps are relatively dense compared to LIDAR measurements fromKITTI, but they are also less accurate and exhibit the typical stereo artifacts.Thus, SDNet was also trained with the 85898 samples of training data fromKD, in which LIDAR measurements were accumulated over several frames toget denser depth maps. There is no semantic information available for this data,so only the depth part of SDNet could be trained with KD. However, KITTIpublished a training dataset (KS) with 200 images containing semantic anno-tations. LIDAR data also exists for this dataset, so that all necessary data isavailable for a complete training of SDNet.

Due to the different resolutions and the limited GPU memory, we randomlycropped the images to 768× 352 pixel at each epoch. Furthermore, we used ran-dom flipping and the standard color jittering as data augmentation techniques.We initialized the encoder with pre-trained weights from ImageNet and froze theparameters of the first and the second ResNet block during training. We usedAdam as optimizer and a batch size of 4. The weight-decay factor and dropoutrate was set to 0.0005 and 0.5 respectively. The best results were achieved whenusing 128 depth classes to split the depth interval of [2m, 80m].

In the first training phase, SDNet was trained on CS and KD with 50.000iterations each and an initial learning rate of 0.0001. For training on KD, λwas set to 0 because there is no semantic data and thus only the depth subnetshould be trained. In all other cases λ = 10 was chosen. The second trainingphase consists of fine-tuning on KS, where both subnets are optimized usinglearning rate of 0.00001 and 3000 iterations.

During deployment, there is no need to divide the input images, becauseSDNet is fully-convolutional. Thus, the output can be estimated on arbitrarysized input images. Similarly, as suggested by [9], SDNet also predicts the outputof the flipped input image. The final prediction is then computed by mergingthese two outputs, which makes the result more robust to occlusion and outliers.

4.2 KITTI Dataset

The depth estimation and semantic segmentation were evaluated by several eval-uations within KD and KS, and compared to state-of-the-art methods. Before

Page 8: VSI Lab, Goethe University, Frankfurt am Main, Germany ... · SDNet: Semantically Guided Depth Estimation Network Matthias Ochs 1, Adrian Kretz , and Rudolf Mester;2 1 VSI Lab, Goethe

8 M. Ochs, A. Kretz, and R. Mester

KITTI published its benchmark for the monocular depth estimation in 2018, itwas common to evaluate the results using an unofficial dataset suggested by [5].

Method ARD ↓ SRD ↓ RMSE ↓ RMSELog ↓ δ < 1.25 ↑ δ < 1.252 ↑ δ < 1.253 ↑Fu et al.[6] 0.072 0.307 2.727 0.120 0.932 0.985 0.995SDNet 0.079 0.504 3.700 0.167 0.921 0.968 0.983Yang et al.[32] 0.097 0.734 4.442 0.187 0.888 0.958 0.980Kuznietsov et al.[14] 0.113 0.741 4.621 0.189 0.862 0.960 0.986Godard et al.[9] 0.114 0.898 4.935 0.206 0.861 0.949 0.976Eigen et al.[5] 0.190 1.515 7.156 0.270 0.692 0.899 0.967Liu et al.[19] 0.217 1.841 6.986 0.289 0.647 0.882 0.961Saxena et al.[25] 0.280 3.012 8.734 0.361 0.601 0.820 0.926

Fu et al.[6] 0.071 0.268 2.271 0.116 0.936 0.985 0.995SDNet 0.075 0.430 3.199 0.163 0.926 0.970 0.984Yang et al.[32] 0.092 0.547 3.390 0.177 0.898 0.962 0.982Kuznietsov et al.[14] 0.108 0.595 3.518 0.179 0.875 0.964 0.988Godard et al.[9] 0.108 0.657 3.729 0.194 0.873 0.954 0.979Garg et al.[7] 0.169 1.080 5.104 0.273 0.740 0.904 0.962

Table 1: Evaluation and comparison with other approaches on the test split byEigen et al.[5]. In the upper part of this table, the depth is evaluated in range of[0m; 80m] whereas the maximum depth was reduced to 50 m in the lower part.

The Eigen test dataset consists of 697 images, which are selected from 29different sequences of the KITTI raw dataset. The evaluation in table 1 provesthat SDNet achieves very good results in all metrics and is only outperformedby DORN. Moreover, the qualitative comparison of the predicted depth maps infigure 4 shows that SDNet can reproduce the depths more accurately and withmore details compared to other methods, whose depth maps usually look moreblurred. Further examples of depth estimation on KS can be found in figure 6.

Besides to this unofficial analysis, the estimated depths of SDNet were alsoevaluated on the official KD benchmark, which consists of 500 test images. Theimages depict different driving situations, so that the network has to generalize inorder to perform well in this benchmark. The results are shown in table 2, whereonly previously published methods have been listed and, again, SDNet achievesstate-of-the-art results. Based on the D1 error images in figure 5, which showsthe deviation to the ground truth for some test images, it can be recognized thatthe predicted depth of SDNet is erroneous especially at large depths. One reasonfor this could be that the maximum depth class has been set to 80 m. This alsoexplains the relatively small gap to DORN at the Eigen evaluation, in which themaximum depth is limited to 50 m or 80 m.

Additionally, the predicted semantic segmentation of SDNet was also evalu-ated on the KS benchmark, which consists of 200 images. The results obtained

Page 9: VSI Lab, Goethe University, Frankfurt am Main, Germany ... · SDNet: Semantically Guided Depth Estimation Network Matthias Ochs 1, Adrian Kretz , and Rudolf Mester;2 1 VSI Lab, Goethe

SDNet: Semantically Guided Depth Estimation Network 9

RGB GT Garg et al. Liu et al.

Godard et al. Kuznietsov et al. Yang et al. SDNet

Fig. 4: Qualitative comparison with other state-of-the-art methods on KITTIEigen test dataset. The ground truth depth maps are interpolated from thesparse LIDAR measurements for better visualization. Our approach can retrievethe depths for small objekts and fine structures better than other methods.

in this benchmark can be found in table 3. Although, the focus of SDNet isthe estimation of depth and in consequence the semantic segmentation has notbeen purposefully optimized and has only been used to learn more meaningfulfeatures, SDNet also achieves state-of-the-art results in this benchmark.

In figure 6, sample results are shown on this test dataset. It can be seen thatthe results of semantic segmentation and depth estimation correlate directly witheach other. For example, if a pixel is assigned to a wrong semantic class, thenmostly the depth prediction is also incorrect. If, on the other hand, the depth orsemantic class has been correctly classified, then in most cases the other outputis correct as well.

4.3 Cityscapes Dataset

Both outputs of SDNet were also evaluated on CS to show the generalizationcapability of SDNet. Unfortunately, there is no benchmark for depth estimation,which is why the results could only be assessed qualitatively. In figure 7, theestimated depth maps of SDNet and [14] are exemplarily shown for CS. Onecan clearly see that the SDNet depths are more detailed compared to Kuzni-etsov et al.. For example, the depth for the persons in front of the BrandenburgGate and the gate itself are retrieved well by SDNet in contrast to [14], wherethe depths are noticeably more blurry. Aside from the depths, the semantic seg-mentation of SDNet is also shown in this figure. Qualitatively, the predicted

Page 10: VSI Lab, Goethe University, Frankfurt am Main, Germany ... · SDNet: Semantically Guided Depth Estimation Network Matthias Ochs 1, Adrian Kretz , and Rudolf Mester;2 1 VSI Lab, Goethe

10 M. Ochs, A. Kretz, and R. Mester

Method SiLog[100 · log(m)]

SRDin %

ARDin %

iRMSE[1/km]

DORN [6] 11.67 2.21 9.04 12.23VGG16-UNet [10] 13.41 2.86 10.06 15.06DABC [17] 14.49 4.08 12.72 15.53SDNet 14.68 3.90 12.31 15.96APMoE [13] 14.74 3.88 11.74 15.63CSWE [16] 14.85 3.48 11.84 16.38DHGRL [33] 15.47 4.04 12.52 15.72

Table 2: Official results of the KITTI depth prediction benchmark [27] (as ofMay 01, 2019). This table includes only already published methods. If severalresults of one method are ranked, then only the best is taken.

Fig. 5: D1 error of the SDNet depth estimate from the ground truth on threeexemplary images from the KITTI Depth Prediction test dataset. The errorswere computed by KITTI and blue represents a small deviation and red a largeone.

semantics yield correct and good results, with a few exceptions. This can alsobe shown by the quantitative results on CS (see supplementary materials).

4.4 Ablation Studies

We also perform ablation studies on the KD validation data to empirically provethat our contributions/enhancements result in verifiable improvements. Theseexperiments can be divided into three parts: i.e. loss function, discretization lev-els and architecture. The results are shown in table 4. As a baseline configurationfor SDNet, a ResNet-50 with group normalization layers is chosen as encoder andthe proposed ordinal BCE loss is used as loss function from equation 5. Thus, theCNN is trained only on the depths and the semantic loss is completely ignored.The depth interval [2m; 80m] is divided into Md = 128 classes.

The performance of the network can be increased, if the semantic loss isadded to the loss. Hence, the network can learn superior features that do notsolely depend on the depth, but also on the semantics. As a result, as shownhere empirically, a better performance can be achieved. In addition, the orderedclassification of the depths is superior to a regression approach with MAE.

The depth estimation achieves the best results, when dividing the depthinterval into 128 classes. Although the accuracy is larger for less classes, the

Page 11: VSI Lab, Goethe University, Frankfurt am Main, Germany ... · SDNet: Semantically Guided Depth Estimation Network Matthias Ochs 1, Adrian Kretz , and Rudolf Mester;2 1 VSI Lab, Goethe

SDNet: Semantically Guided Depth Estimation Network 11

Method IoU class iIoU class IoU cat. iIoU cat.

SegStereo [31] 59.10 28.00 81.31 60.26SDNet 51.14 17.74 79.62 50.45APMoE [13] 47.96 17.86 78.11 49.17

Table 3: Official results of the KITTI pixel-level semantic segmentation bench-mark [1] (as of May 01, 2019). This list includes only already published methods.

Fig. 6: Qualitative results of semantic segmentation and depth estimation usingexamples from KS test dataset.

prediction results get worse, because the discretization error increases with fewernumber of classes. On this basis, 128 classes have proven to be the best trade-off.

Replacing the ResNet-50 by a ResNet-101 as an encoder, was supposed toincrease performance, because even better features can be extracted. However,the evaluation shows that exactly the opposite was the case. The reasons forthis could be that larger networks generally are harder to train. Therefore, ad-justments or fine-tuning had to be made in the training protocol, which wasnot done (e.g. additional warm-up phase or adjustment of the learning rate).Additionally, the change of the normalization method was also evaluated. Bydefault, batch normalization layers are used in the ResNets, but these proveddisadvantageous for small batch sizes. Therefore, these layers have been replacedby group normalization layers, which had a positive effect on all metrics.

5 Summary & Conclusion

In this paper, we propose the novel SDNet for the simultaneous prediction ofpixel-wise depths and semantic labels from a single monocular image. The archi-tecture of SDNet is based on the DeepLabv3+ model, which we have extended.The depths are determined by ordered classes rather than the classic regression-based approach. In comparison to other methods, the input image is also seg-

Page 12: VSI Lab, Goethe University, Frankfurt am Main, Germany ... · SDNet: Semantically Guided Depth Estimation Network Matthias Ochs 1, Adrian Kretz , and Rudolf Mester;2 1 VSI Lab, Goethe

12 M. Ochs, A. Kretz, and R. Mester

Fig. 7: Qualitative results of the semantic segmentation and the depth estimationbased on exemplary examples from the CS test dataset. The first column depictthe input image and the second the predicted semantics. In the last two columns,the depth maps of [14] (third column) and SDNet (last column) are shown.

Method SiLog[100 · log(m)]

ARDin %

RMSE[m]

iRMSE[1/km]

Accuracyin %

sem. loss (λ = 10) 14.83 9.39 3.11 12.17 20.10MAE 18.80 12.11 4.03 15.30 -

Mdepth = 96 16.04 10.88 3.30 13.28 22.50Mdepth = 160 15.57 10.06 3.23 12.87 15.50

BatchNorm 15.84 10.31 3.42 12.77 18.95ResNet-101 16.25 10.89 3.44 13.12 18.24

baseline 15.38 9.92 3.30 12.55 20.50

Table 4: Ablation studies on the KD validation dataset for different configura-tions. As baseline, we use SDNet without semantic loss (λ = 0).

mented semantically. These two modifications turn out to be beneficial as theCNN learns semantically richer features, resulting in significant better results.

In a further stage, a new error or uncertainty measure could be developed onbasis of the output, because prediction errors occur mostly in both predictions.This knowledge could be used for such a measure. Another improvement ofSDNet would be that a better encoder than ResNet-50 is used. But this impliesalso that hyper-parameters must be tuned during training, which by itself couldbe an improvement. Nevertheless, SDNet yields state-of-the-art results for bothsemantic segmentation and depth estimation on various datasets.

Acknowledgement: This project (HA project no. 626/18-49) is financed withfunds of LOEWE Landes-Offensive zur Entwicklung Wissenschaftlich-ko- nomis-cher Exzellenz, Frderlinie 3: KMU-Verbundvorhaben (State Offensive for theDevelopment of Scientific and Economic Excellence).

We also gratefully acknowledge the support of NVIDIA Corporation with thedonation of the Titan Xp GPU used for this research.

Page 13: VSI Lab, Goethe University, Frankfurt am Main, Germany ... · SDNet: Semantically Guided Depth Estimation Network Matthias Ochs 1, Adrian Kretz , and Rudolf Mester;2 1 VSI Lab, Goethe

SDNet: Semantically Guided Depth Estimation Network 13

References

1. Alhaija, H., Mustikovela, S., Mescheder, L., Geiger, A., Rother, C.: Augmentedreality meets computer vision: Efficient data generation for urban driving scenes.International Journal of Computer Vision (IJCV) 126(9), 961–972 (2018)

2. Chen, L.C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoderwith atrous separable convolution for semantic image segmentation. In: EuropeanConference on Computer Vision (ECCV). pp. 833–851 (2018)

3. Cheng, J., Wang, Z., Pollastri, G.: A neural network approach to ordinal regression.In: International Joint Conference on Neural Networks. pp. 1279–1284 (2008)

4. Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R.,Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urbanscene understanding. In: Conference on Computer Vision and Pattern Recogni-tion (CVPR). pp. 3213–3223 (2016)

5. Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single imageusing a multi-scale deep network. In: Advances in Neural Information ProcessingSystems (NIPS). pp. 2366–2374 (2014)

6. Fu, H., Gong, M., Wang, C., Batmanghelich, K., Tao, D.: Deep ordinal regressionnetwork for monocular depth estimation. In: Conference on Computer Vision andPattern Recognition (CVPR). pp. 2002–2011 (2018)

7. Garg, R., Kumar BG, V., Carneiro, G., Reid, I.: Unsupervised CNN for single viewdepth estimation: Geometry to the rescue. In: European Conference on ComputerVision (ECCV). pp. 740–756 (2016)

8. Gidaris, S., Komodakis, N.: Detect, replace, refine: Deep structured prediction forpixel wise labeling. In: Conference on Computer Vision and Pattern Recognition(CVPR). pp. 7187–7196 (2017)

9. Godard, C., Mac Aodha, O., Brostow, G.J.: Unsupervised monocular depth esti-mation with left-right consistency. In: Conference on Computer Vision and PatternRecognition (CVPR). pp. 6602–6611 (2017)

10. Guo, X., Li, H., Yi, S., Ren, J., Wang, X.: Learning monocular depth by distill-ing cross-domain stereo networks. In: European Conference on Computer Vision(ECCV). pp. 484–500 (2018)

11. Hirschmller, H.: Stereo processing by semiglobal matching and mutual information.IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 30(2),328–341 (2008)

12. Knbelreiter, P., Reinbacher, C., Shekhovtsov, A., Pock, T.: End-to-end trainingof hybrid CNN-CRF models for stereo. In: Conference on Computer Vision andPattern Recognition (CVPR). pp. 1456–1465 (2017)

13. Kong, S., Fowlkes, C.: Pixel-wise attentional gating for parsimonious pixel labeling.In: Winter Conference on Applications of Computer Vision (WACV) (2019)

14. Kuznietsov, Y., Stckler, J., Leibe, B.: Semi-supervised deep learning for monoculardepth map prediction. In: Conference on Computer Vision and Pattern Recognition(CVPR). pp. 2215–2223 (2017)

15. Ladick, L., Shi, J., Pollefeys, M.: Pulling things out of perspective. In: Conferenceon Computer Vision and Pattern Recognition (CVPR). pp. 89–96 (2014)

16. Li, B., Dai, Y., He, M.: Monocular depth estimation with hierarchical fusion ofdilated cnns and soft-weighted-sum inference. Pattern Recognition 83, 328 – 339(2018)

17. Li, R., Xian, K., Shen, C., Cao, Z., Lu, H., Hang, L.: Deep attention-based clas-sification network for robust depth prediction. CoRR arXiv, 1807.03959 [cs.CV](2018)

Page 14: VSI Lab, Goethe University, Frankfurt am Main, Germany ... · SDNet: Semantically Guided Depth Estimation Network Matthias Ochs 1, Adrian Kretz , and Rudolf Mester;2 1 VSI Lab, Goethe

14 M. Ochs, A. Kretz, and R. Mester

18. Lin, T., Goyal, P., Girshick, R., He, K., Dollar, P.: Focal loss for dense object detec-tion. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)(2018)

19. Liu, F., Shen, C., Lin, G., Reid, I.: Learning depth from single monocular imagesusing deep convolutional neural fields. IEEE Transactions on Pattern Analysis andMachine Intelligence (TPAMI) 38(10), 2024–2039 (2016)

20. Liu, M., Salzmann, M., He, X.: Discrete-continuous depth estimation from a singleimage. In: Conference on Computer Vision and Pattern Recognition (CVPR). pp.716–723 (2014)

21. Mayer, N., Ilg, E., Husser, P., Fischer, P., Cremers, D., Dosovitskiy, A., Brox,T.: A large dataset to train convolutional networks for disparity, optical flow, andscene flow estimation. In: Conference on Computer Vision and Pattern Recognition(CVPR). pp. 4040–4048 (2016)

22. Niu, Z., Zhou, M., Wang, L., Gao, X., Hua, G.: Ordinal regression with multipleoutput cnn for age estimation. In: Conference on Computer Vision and PatternRecognition (CVPR). pp. 4920–4928 (2016)

23. Pang, J., Sun, W., SJ. Ren, J., Yang, C., Yan, Q.: Cascade residual learning:A two-stage convolutional neural network for stereo matching. In: InternationalConference on Computer Vision (ICCV) - Workshop. pp. 887–895 (2017)

24. Saxena, A., Chung, S.H., Ng, A.Y.: Learning depth from single monocular images.In: Advances in Neural Information Processing Systems (NIPS). pp. 1161 – 1168(2005)

25. Saxena, A., Sun, M., Ng, A.Y.: Make3D: Learning 3D scene structure from asingle still image. IEEE Transactions on Pattern Analysis and Machine Intelligence(TPAMI) 31(5), 824–840 (2009)

26. Schneider, L., Cordts, M., Rehfeld, T., Pfeiffer, D., Enzweiler, M., Franke, U., Polle-feys, M., Roth, S.: Semantic stixels: Depth is not enough. In: Intelligent VehiclesSymposium (IV). pp. 110–117 (2016)

27. Uhrig, J., Schneider, N., Schneider, L., Franke, U., Brox, T., Geiger, A.: Sparsityinvariant CNNs. In: International Conference on 3D Vision (3DV) (2017)

28. Zbontar, J., LeCun, Y.: Stereo matching by training a convolutional neural networkto compare image patches. Journal of Machine Learning Research (JMLR) 17,2287–2318 (2016)

29. Wu, Y., He, K.: Group normalization. In: European Conference on Computer Vi-sion (ECCV). pp. 3–19 (2018)

30. Xie, J., Girshick, R., Farhadi, A.: Deep3D: fully automatic 2D-to-3D video conver-sion with deep convolutional neural networks. In: European Conference on Com-puter Vision (ECCV). pp. 842–857 (2016)

31. Yang, G., Zhao, H., Shi, J., Deng, Z., Jia, J.: Segstereo: Exploiting semantic in-formation for disparity estimation. In: European Conference on Computer Vision(ECCV). pp. 660–676 (2018)

32. Yang, N., Wang, R., Stuckler, J., Cremers, D.: Deep virtual stereo odometry: Lever-aging deep depth prediction for monocular direct sparse odometry. In: EuropeanConference on Computer Vision (ECCV). pp. 835–852 (2018)

33. Zhang, Z., Xu, C., Yang, J., Tai, Y., Chen, L.: Deep hierarchical guidance andregularization learning for end-to-end depth estimation. Pattern Recognition 83,430 – 442 (2018)