learning-based filter selection scheme for depth image super resolution

1051-8215 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TCSVT.2014.2317873, IEEE Transactions on Circuits and Systems for Video Technology

IEEE 1

Learning-based Filter Selection Scheme for DepthImage Super Resolution

Seung-Won Jung, Member, IEEE, Ouk Choi*, Member, IEEE

Abstract—Depth images that have the same spatial resolutionas color images are required in many applications such as multi-view rendering and three-dimensional texture modeling. Since adepth sensor usually has poorer spatial resolution compared to acolor image sensor, many depth image super-resolution methodshave been investigated in the literature. With an assumption thatno one super-resolution method can universally outperform theother methods, in this paper, we introduce a learning-based selec-tion scheme for different super-resolution methods. In our casestudy, three distinctive mean-type, max-type, and median-typefiltering methods are selected as candidate methods. In addition,a new frequency domain feature vector is designed to enhancethe discriminability of the methods. Given the candidate methodsand feature vectors, a classifier is trained such that the bestmethod can be selected for each depth pixel. The effectivenessof the proposed scheme is first demonstrated using the syntheticdataset. The noise-free and noisy low-resolution depth imagesare constructed, and the quantitative performance evaluation isperformed by measuring the difference between the ground-truthhigh-resolution depth images and the resultant depth images. Theproposed algorithm is then applied to real color and time-of-flight depth cameras. The experimental results demonstrate thatthe proposed algorithm outperforms the conventional algorithmsboth quantitatively and qualitatively.

Index Terms—depth image, feature vector, machine learning,super resolution, time-of-flight.

I. INTRODUCTION

ACQUISITION or estimation of a high quality depthimage has been considered as one of the most important

issues for enabling three-dimensional (3-D) image and videoapplications. Recent stereo matching techniques [1], [2] canproduce high-quality depth images in near real-time usinggraphics hardware, and commodity active depth sensors suchas time-of-flight (ToF)-based [3] and structured-light-baseddepth cameras [4] can capture depth images in real-time withsufficient quality for hand gesture and body pose estimation.Since many vision applications, such as 3-D reconstruction andmulti-view rendering, require color images as well as depth

Manuscript received , 2014; revised.Copyright (c) 2012 IEEE. Personal use of this material is permitted.

However, permission to use this material for any other purposes must beobtained from the IEEE by sending a request to [email protected].

S.-W. Jung is with the Department of Multimedia Engineering, DonggukUniversity, 30, Pildong-ro 1-gil, Jung-gu, Seoul, Republic of Korea. Tel: +8210 2231 4853, E-mail: [email protected]

Corresponding author: O. Choi is with the Multimedia Processing Lab,Samsung Advanced Institute of Technology (SAIT), 130, Samsung-ro,Yeongtong-gu, Suwon-si, Gyeonggi-do, 443-803, Republic of Korea. Tel: +8210 3668 4589, E-mail: [email protected]

This research was supported by the MSIP(Ministry of Science, ICT andFuture Planning), Korea, under the ITRC(Information Technology ResearchCenter) support program (NIPA-2014-H0301-14-1021) supervised by theNIPA(National IT Industry Promotion Agency).

images, several hybrid camera structures including color-plus-depth [5]–[7], stereo-plus-depth [8], [9], and multi-color-plus-multi-depth [10] have been presented. In particular, the color-plus-depth structure is found to be desirable for practicalimplementation due to its compactness [4], [11].

One typical problem of the color-plus-depth structure is thata depth sensor has poorer spatial resolution compared to animage sensor. Thus, a technique of increasing spatial resolutionof a depth image, called depth image super resolution, isrequired. To this end, various depth image super-resolutionalgorithms have been proposed in the literature [12]–[17]. In[12], upsampling filter coefficients are obtained in a bilateral-filtering manner using a color image. Although sharp depthedges aligned with color edges can be obtained, an annoyingtexture copying problem can occur in color-textured regions.In [13], an adaptive median filter and a post-bilateral filter areused to up-sample the depth image. This method can producesharp depth edges without the texture copying problem, butobtained depth edges are not accurately aligned with coloredges since color information is not used in depth imagesuper resolution. In [14], a local histogram is constructedfor each pixel to-be-interpolated using neighboring color anddepth pixels, and the histogram is treated as a probabilitydensity function (pdf). The position maximizing the pdf is thenchosen as a target depth value. However, we found that such achoice does not always guarantee the best super-resolution per-formance, and thus artifacts are sometimes produced aroundinterpolated depth edges.

The aforementioned and many other depth image super-resolution techniques attempt to find one effective filteringscheme [12], [14]–[16] or combine several filtering schemes ina heuristic manner [13], [17]. We claim that no one filter canuniversally outperform the other filters and heuristic filter com-bination methods always leave room for improvement. Fromthis viewpoint, we approach depth image super resolutionusing a machine learning framework. Due to the availabilityof color-plus-depth image database [18], training samplesconsisting of low resolution (LR) depth images and theircorresponding high resolution (HR) color and depth imagepairs can be obtained. Given training samples and multiplecandidate filters, we can observe which filter performs bestand which factor causes a certain filter performs best for eachpixel. To this end, a histogram is generated in a similar mannerto [14], and a frequency-domain feature vector, which welldescribes a pattern of filter selection, is then extracted fromthe histogram. The support vector machine (SVM) classifieris finally built using the training samples and feature vectors.When an LR depth image and an HR color image are



IEEE 2

given as an input, an HR depth image is reconstructed byselecting a filter for each pixel using the trained classifier.The effectiveness of the proposed scheme is first demonstratedusing the synthetic dataset. Given noise-free and noisy low-resolution depth images, the proposed algorithm is performedand the super-resolution results are compared with the ground-truth high-resolution depth images. The experimental resultsshow that the proposed algorithm quantitatively outperformsthe conventional methods [14], [24]. The proposed algorithmis then applied to actual color and ToF depth cameras, andthe experimental results exhibit that the proposed algorithmproduces sharp depth edges without annoying artifacts.

Note that learning-based frameworks have been extensivelystudied for color image [19]–[21], color video [22], [23],and depth image super resolution [25]. The most widelyused learning-based super-resolution framework is based ongeneration of a database that consists of LR patches and theircorresponding HR patches [19]–[22], [25]. For each input LRpatch, one or more closest LR patches are found from thedatabase, and their corresponding HR patches are then usedto reconstruct an HR image. In particular, the learning-baseddepth image super-resolution technique [25] builds a Markovrandom field (MRF) framework such that HR patches aresmoothly connected with each other and their correspondingLR patches are similar to input LR patches. However, theacquired performance of [25] is not as good as the performanceof the methods that use both color and depth images [12], [14],[24]. The classification-based color video super-resolutiontechnique [23] is most closely related to our proposed frame-work. In [23], super-resolution results obtained using spatial-only and temporal-only information are selectively combinedfor each color pixel using a trained classifier. On the contrary,we train a classifier to select an effective filter for each pixelusing color and depth images. Moreover, the conventionalmethod [23] simply concatenates available information aroundthe pixel to-be-interpolated, whereas we design a frequencydomain feature vector that can well discriminate differentcharacteristics of filters.

The rest of the paper is organized as follows. In SectionII, the proposed depth image super-resolution algorithm isdescribed. Experimental results are presented in Section III,followed by the conclusion in Section IV.

II. PROPOSED ALGORITHM

In our work, it is assumed that color and depth imagesare aligned. When two separate color and depth sensors areused, a typical solution is to transform the pixel coordinate ofthe depth image to that of the color image [5], [7], [14]. Aproblem of depth image super resolution is then changed to aproblem of finding depth values at the color pixel coordinates.Recent sensor architectures [26], [27] enable acquisition ofcolor and depth images through a single lens, and thus thecoordinate transform is not required in such architectures. It isalso assumed that depth values are represented by 256 discretelevels.

This section first illustrates the necessity of the use ofdifferent filters for depth image super resolution. Our feature

extraction method for filter classification is then detailed. Last,a training and testing framework of the proposed algorithm ispresented.

A. Filter Candidates

Let C and DL denote an HR color image and its corre-sponding LR depth image, respectively. By projecting DL tothe color pixel coordinate [5], [7], [14], an HR depth imagewith partially filled depth pixels is obtained, which is denotedas D. Our objective for depth image super resolution is toapply a spatially varying filter to D such that all pixels in Dhave accurate depth values.

We note that many conventional depth upsampling algo-rithms [12], [14]–[16] can be considered as a problem offinding a depth value d at a pixel p of D from the followingweight distribution,

H(p, d) =∑

q∈N(p)

GS (p−q)GC (C(p)−C(q))GD (d−D(q)),

(1)where N(p) represents a set of neighboring pixels of p thathave depth values. GS , GC , and GD denote spatial, color, anddepth Gaussian functions, where means are 0 and standarddeviations are σS , σC , and σD, respectively. In particular,the Euclidean distance between pixel coordinates is used tomeasure the spatial distance, and the mean absolute difference(MAD) is used to measure the distance between two RGBcolor vectors. The weight distribution H(p, d) is obtained forall possible d values, and it is then normalized such that∑d

H(p, d) = 1.

In the conventional joint bilateral average filter (JBAF) [12],the depth value is obtained as

fJBAF (p) =∑d

H(p, d)d , (2)

where the depth Gaussian function is reduced to the deltafunction by setting σD = 0. Since the filter coefficients ofthe JBAF are guided by the color difference between pixels,the depth edges of D can be well aligned to the color edgesof C. However, unnecessary blur appears in D due to a non-negligible effect of small weights in summation of (1) [14].

To remedy the problem of the JBAF, the weighted modefilter (WMF) [14] is defined as

fWMF (p) = arg maxd

H(p, d). (3)

In other words, the WMF selects an output depth valuethat maximizes the weight distribution. Due to the choiceof non-linear operation in filtering, unnecessary blur can besignificantly reduced. However, the depth edges obtained bythe WMF tend not to be accurately aligned to the color edges.It is found in [14] that the JBAF and WMF are effective inL2 norm and L1 norm minimization, respectively.

Meanwhile, median filters have also been used in color [28]and depth image super resolution [13]. For non-integer valuedweights, a median filtered value is obtained as

fJBMFG (p) = argmax

dm

{dm−1∑d=0

H(p, d) <1

2

}. (4)



IEEE 3

Fig. 1. Example of local patches with different characteristics. (a) HRcolor image Art and its corresponding (b)-(d) depth patches (left) and theirweight distributions (right). For each weight distribution, the max, median,and mean positions are marked by ‘o’ with annotations and the ground-truthdepth position is marked by ‘x’.

Please refer to [29] for the derivation of the above weightedmedian filter. Since the color difference is used in depth imagefiltering, we call this filter the joint bilateral median filter(JBMF).

Fig. 1 illustrates a situation in which different filters performdifferently. Fig. 1(a) shows the HR color image Art [18] andFigs. 1(b)-(d) show the three examples of LR depth patches.In this example, the upsampling ratio is set as 2, and thusthree-fourths of pixels in D do not have a depth value. Whendotted center pixels of Figs. 1(b)-(d) are to be filled, the weightdistributions are computed as (1). Here, the standard deviationvalues of (1) are set according to [14]. Given HR color anddepth images as a training input, we can examine which filterfinds the closest depth value to the ground-truth depth value

and when a certain filter performs better than other filters.Figs. 1(b)-(d) correspond to the cases when the JBAF, WMF,and JBMF perform best among the three filters, respectively.When there exist noisy but concentrated peaks in the weightdistribution as shown in Fig. 1(b), a mean-type filter such asthe JBAF is found to be effective. The WMF outperforms theother filters when there are two separated peaks but one peakdominates the weight distribution as shown in Fig. 1(c). Whenthere are multiple non-negligible peaks as shown in Fig. 1(d),the JBMF tends to outperform the other filters.

We have observed that filter selection depends on thecharacteristic of the weight distribution. However, it is nottractable to determine a filter selection rule from few ob-servations. We thus adopt a learning-based framework suchthat a filter classifier is trained using a large database. Itshould be noted that we consider the JBAF, WMF, andJBMF as filter candidates because they have rather distinctivecharacteristics. The proposed learning-based framework caninclude any additional filters if necessary. In addition, the sameweight distribution of (1) is used for the three filters. Sincethe depth Gaussian function can take the reliability of depthsignals into account [14], the JBAF in our experiment alsouses the depth Gaussian function with σD 6= 0. A new designof weight distribution is out of concern for our work.

B. Feature Extraction

Feature extraction plays an important role in a learning-based framework. The objective here is to define a featurevector that can well discriminate different characteristics ofthe filters. In Section II-A, we observed that variation of theweight distribution is related to filter selection. To this end, thefeature vector is extracted from the weight distribution withconsidering two aspects.

First, absolute position on the weight distribution doesnot make a difference in filter selection. For instance, non-zero-valued positions range from 163 to 177 as shown inFig. 1(b). In this case, the selection of the JBAF as the bestfilter is not changed whether the range is shifted to left orright direction. Second, the existence of multiple separatedmodes in the weight distribution makes a difference in filterselection as exemplified in Figs. 1(b)-(d). Thus, intervals ofnon-zero-valued position carry significant information for filterselection.

By considering the above two aspects, two vectors vw andvs are defined as

vw = {H(d)|H(d) 6= 0} , (5)

vs =

{de − ds

∣∣∣∣ H(ds) 6= 0, H(de) 6= 0,H(d) = 0, ds < d < de

}, (6)

where the pixel coordinate p is omitted from H(p, d) for thesake of brevity. vw and vs consist of the weight values corre-sponding to non-zero-valued position of the weight distributionand the intervals between two consecutive non-zero-valuedpositions, respectively. Let N and N − 1 be the length of vw

and vs, respectively. Due to the definition of vw and vs, Nvaries for each pixel depending on the sparsity of the weight



IEEE 4

Fig. 2. Distribution of (a) spatial domain and (b) frequency domain featurevectors. The red circles, green crosses, and blue dots correspond to the featurevectors of the JBAF, WMF, and JBMF, respectively. 100 feature vectors arerandomly chosen from each training set. The x-axis and y-axis are the firsttwo principal components obtained by the LDA.

distribution. We thus define a fixed length L and modify thevector as follows:

vw=

[vw(1), ···,vw(N−1), (L−N+1)♦vw(N)] , if N<L,vw, if N=L,[vw(i)] , R (vw(i))≤L, if N > L,

(7)where ♦ is a repetition operator defined as

k♦X = [ XX · · ·X︸︷︷︸k times

], (8)

and R(vw(i)) computes a ranking of vw(i) out of all elementsin vw in descending order, e.g., R(vw(i)) = 1 when vw(i) isthe largest value. More specifically, if N < L, the very endvalue vw(N) is repeated to make the length of vw equal toL. Otherwise, if N > L, L highest values are extracted fromvw. vs of length L is defined in a similar manner. Due toa slow-varying nature of depth signals, only few bins in theweight distribution have non-negligible values, and thus weempirically set L as 16.

Frequency domain features are effective in representingvariation of signals [30]–[32]. In order to make feature vectorsmore discriminative to a filter classifier, we also adopt afrequency domain feature extraction method. Instead of usingvw and vs directly, the discrete cosine transform (DCT) isapplied to vw and vs as follows:

Vw = DCT(vw) , Vs = DCT(vs) , (9)

where Vw and Vs represent the DCT domain vectors corre-sponding to vw and vs, respectively. In order to reduce theeffect of extrapolated values in (7), DC components of Vw

and Vs are then removed. The resultant feature vector V isobtained as

V =[Vw(2), ···,Vw(L),Vs(2), ···,Vs(L)

]. (10)

The feature vector V is used in our learning-based depthimage super resolution. Fig. 2 shows the effectiveness ofour frequency domain feature extraction. From the lineardiscriminant analysis (LDA) [33], we can observe from thefirst two principal components that the feature vectors are sep-arated better in the frequency domain than the spatial domain.The classification accuracy is also measured by applying 10-fold cross validation with the SVM to the acquired training

Fig. 3. Block diagram of the training phase.

Fig. 4. Block diagram of the test phase.

samples. The correct classification rates are obtained as 72.4%and 80.0% when the spatial and frequency domain featurevectors are used, respectively.

C. Overall Framework

The proposed learning-based framework consists of trainingand test phases. Fig. 3 shows a training phase for obtaininga filter classifier. Given HR color and depth image pairsas a training database, LR depth images are obtained bysubsampling the pixels in HR depth images. The JBAF, WMF,and JBMF are then applied to all the pixels to-be-interpolatedusing a pair of HR color and LR depth images. Since wehave ground-truth HR depth images in the training phase, wecan examine for each pixel which filter acquires a depth valueclosest to the known depth value. To pool the feature vectors,three empty sets are first generated for the JBAF, WMF, andJBMF. If one filter outperforms the other filters for a certainpixel, its feature vector V is included to the correspondingset. However, if more than one filter results in the same depthvalue, the feature vector is neglected to exclude ambiguoussamples. After collecting feature vectors for the three filtersusing all training color and depth image pairs, the SVM [34]is applied to obtain the filter classifier.

Fig. 4 shows a test phase for obtaining an HR depth image.When LR depth and HR color images are given as an input, theLR depth image is first projected to the color pixel coordinate.



IEEE 5

For each to-be-interpolated pixel, the weight distribution isobtained as (1). The feature vector is then extracted from theweight distribution as described in Section II-B. The trainedfilter classifier finally determines which filter needs to be usedfor each pixel. By applying the above procedure for all pixelsthat do not have depth values, the HR depth image can bereconstructed.

For training and classification of the filters, the LIBSVM isused with default parameter settings [34]. The training colorand depth images are collected from the Middlebury 2005 and2006 datasets [18]. HR depth images are downsampled by afactor of 2, and the feature vectors are obtained from HR colorand downsampled depth images. When the downsamplingfactor is larger than 2, we adopt a multiscale color measure(MCM) [14] which interpolates the depth image in a coarse-to-fine manner. Thus, the depth images downsampled by a factorof 2 are enough for collecting training samples.

III. EXPERIMENTAL RESULTS

We first quantitatively evaluated the performance of theproposed algorithm using the Middlebury 2001 and 2003dataset [18]. We then applied the proposed algorithm to a real-istic environment and qualitatively evaluated the performance.The standard deviations σS , σc, and σD of (1) were chosen as7, 6, and 2.90, respectively. The detailed descriptions on theseparameter settings can be found in [14].

Five methods were compared with the proposed algorithm.The first three methods were the separate use of the WMF,JBAF, and JBMF, respectively. The fourth method was basedon selecting the best filter from the WMF, JBAF, and JBMF.More specifically, the best filter that resulted in the closestvalue to the ground-truth HR depth value was selected foreach pixel. The upper limit performance of the proposedalgorithm can be measured by this manner, and we hereaftercall the fourth method the ideal method. Last, the performanceof the conventional depth super-resolution algorithm, calledanisotropic total generalized variation (ATGV) algorithm [24],was also compared with the above methods, where the author-provided software was used to obtain the results.

A. Synthetic data

Four widely-used test images shown in Fig. 5, which werenot included in the training images, were used for evaluatingthe performance of the algorithms. Several missing pixelsin the ground-truth depth images as shown in Figs. 5(f)and (h) were excluded at the depth image super-resolutionstage. Note that such pixels were not used at the quantitativeevaluation [18]. HR depth images were then downsampledby a factor of 4 and 8, respectively. In addition, noisy LRdepth images were also used as input depth images in orderto take realistic environments into account. In particular, wesimulated the noise model of the time-of-flight (ToF) depthcamera, which suffers from the low-resolution problem moreseverely than the structured-light-based depth cameras. It iswell-known that the standard deviation of the ToF depth noiseis inversely proportional to the strength of the light returnedto the image sensor (typically, infra-red light is emitted and

Fig. 5. Original color and depth image pairs. (a)-(b) Tsukuba (384×288),(c)-(d) Venus (434×383), (e)-(f) Teddy (450×375), (g)-(h) Cones (450×375).

Fig. 6. Synthetic noisy images. From (a) to (d), the left and right imagescorrespond to the downsampling factor of 4 (left) and 8 (right), respectively.



IEEE 6

TABLE ITHE PROPORTION (%) OF BAD MATCHING PIXELS FOR VARIOUS DEPTH IMAGE SUPER-RESOLUTION METHODS, WHERE NOISE-FREE DEPTH IMAGES

WERE USED AS INPUTS.

Ratio Algorithm Tsukuba Venus Teddy Conesall disc all disc all disc all disc

4

ATGV [24] 2.75 12.9 0.49 5.48 3.97 10.2 3.28 6.81JBAF 2.34 10.5 0.58 5.84 2.84 7.36 3.02 6.97JBMF 1.14 5.40 0.21 2.05 2.43 5.95 1.83 3.93WMF 1.13 5.35 0.20 2.00 3.38 8.44 1.69 3.53

Proposed 1.11 5.32 0.20 1.99 2.39 5.79 1.62 3.35Ideal 1.06 4.98 0.18 1.82 1.74 4.29 1.49 3.16

8


Proposed 2.11 9.89 0.32 3.38 5.27 13.8 3.15 7.19Ideal 2.02 9.46 0.31 3.24 4.54 11.6 2.71 6.37

TABLE IITHE PROPORTION (%) OF BAD MATCHING PIXELS FOR VARIOUS DEPTH IMAGE SUPER-RESOLUTION METHODS, WHERE NOISY DEPTH IMAGES WERE

USED AS INPUTS.

Ratio Algorithm Tsukuba Venus Teddy Coneall disc all disc all disc all disc

4


Proposed 1.49 6.12 1.84 3.21 7.04 13.3 7.45 9.74Ideal 1.32 5.31 1.52 2.49 6.42 12.5 6.42 8.27

8


Proposed 3.46 11.2 1.81 5.11 9.51 19.8 9.54 15.1Ideal 2.83 10.8 1.49 4.38 8.55 18.2 8.08 12.3

TABLE IIITHE PROPORTION (%) OF SELECTED FILTERS FOR THE PROPOSED AND IDEAL METHODS.

Type Ratio Algorithm Tsukuba Venus Teddy Cone(JBAF, JBMF, WMF) (JBAF, JBMF, WMF) (JBAF, JBMF, WMF) (JBAF, JBMF, WMF)

Noise-free4 Proposed (23.1, 5.6, 71.3) (21.0, 9.7, 69.3) (14.0, 7.5, 78.5) (21.9, 5.3, 72.8)

Ideal (19.5, 5.5, 75.0) (9.0, 9.8, 81.2) (19.7, 10.2, 70.1) (14.3, 8.8, 76.9)

8 Proposed (12.0, 10.1, 77.9) (15.3, 4.7, 80.0) (10.8, 9.1, 80.1) (25.5, 12.3, 62.2)Ideal (20.8, 5.8, 73.4) (14.6, 6.9, 78.5) (17.6, 8.1, 74.3) (15.3, 8.7, 76.0)

Noisy4 Proposed (27.6, 12.3, 60.1) (21.1, 15.6, 63.3) (33.3, 14.1, 52.6) (33.4, 13.8, 52.8)

Ideal (21.1, 13.3, 65.6) (23.3, 10.6, 66.1) (23.6, 14.9, 61.5) (24.5, 13.1, 62.4)

8 Proposed (35.1, 9.9, 55.0) (20.2, 11.9, 67.9) (32.4, 8.8, 58.8) (31.1, 10.3, 58.6)Ideal (27.8, 11.2, 61.0) (22.1, 11.8, 66.1) (27.2, 10.9, 61.9) (23.6, 11.2, 65.2)

then returned) [35]. By considering the intensity componentof the color image in the HSI color space as the amount ofthe returned light, the intensity-dependent noise was added tothe noise-free depth images. More specifically, the Gaussiannoise with the standard deviation of σN (x, y) was added toeach pixel at (x,y), where σN (x, y) is defined as

σN (x, y) =k

CI(x, y). (11)

CI (x,y) represents the intensity value at the pixel coordinate(x,y) of the color image in the HSI color space. For simulatingthe ToF depth camera noise, given disparity values were firstconverted to depth values and the noise was added in thedepth-space [36]. The noisy depth values were then convertedback to the noisy disparity values. The constant k in (11)was fine-tuned such that the RMSE value between the noise-free and noisy disparity images became approximately 5

(within the approximation tolerance of 0.01). Fig. 6 showsour synthetic noisy images.

The performance evaluation results are given in Tables Iand II. The scores were obtained by counting the percentof bad matching pixels, where the absolute disparity errorwas greater than 1 pixel. The pixels in all and disc (neardepth discontinuities) regions were used in the evaluation,respectively. The detailed information about quantificationcan be found in [18]. The proposed algorithm quantitativelyoutperformed the ATGV algorithm and the separate use of theWMF, JBAF, and JBMF. In particular, the proposed algorithmstill outperformed the separate use of the filters when thenoisy depth images were used, which indicates that our filterclassifier is robust to the depth noise. However, there were stillnon-negligible performance gaps between the proposed andideal methods. We empirically found that the three filters usedin our experiment were not perfectly separable. In particular,



IEEE 7

Fig. 7. Magnified subregions for HR depth images, where noise-free depth images downsampled with a factor of 4 were used as inputs. (a) ATGV, (b)JBAF, (c) JBMF, (d) WMF, (e) proposed method, and (f) original depth.

Fig. 8. Magnified subregions for HR depth images, where noisy depth images downsampled with a factor of 4 were used as inputs. (a) input (upsampledusing the nearest neightborhood interpolation for visualization) (b) ATGV, (c) JBAF, (d) JBMF, (e) WMF, and (f) proposed method.



IEEE 8

Fig. 9. The prototype color/depth camera [26]. The depth and colorimages are captured at a single sensor with the resolution of 480×270 and1920×1080, respectively.

since the mean and median outputs of the JBAF and JBMFwere often similar as shown in Fig. 2, perfect separationbetween the two filters was hardly attainable at the trainingphase.

Table III shows the proportion of the selected filters for theproposed and ideal methods. For the proposed method, theproportion was calculated only if the depth value obtained bythe selected filter was different from the depth values obtainedby the other two filters. Similarly, for the ideal method, weonly considered the case when one filter outperformed theother two filters. Note that two or even three filters can resultin the same depth value, typically in the region with smoothlyvarying color/depth values. It can be seen from Table III thatthe WMF was selected dominantly but the other two filtershad non-negligible importance. This result is consistent withTables I and II, since the WMF outperformed the JBAF andJBMF, but our selective use of the filters outperformed theseparate use of the filters. The ratio of filter selection of theproposed algorithm was not largely deviated from that of theideal method.

Fig. 7 shows the resultant HR depth images for the noise-free LR depth images. For the sake of visualization, onlysubregions in depth discontinuities were magnified. We canobserve that the proposed algorithm produced HR depthimages with high accuracy. The JBAF produced unnecessaryblur, whereas the WMF often resulted in artifacts near depthedges since it always chose a depth value that maximized theweight distribution. The sharpness of depth edges obtainedby the JBMF appeared in-between that obtained by the JBAFand WMF. The ATGV algorithm also exhibited non-negligibleartifacts around depth edges.

Fig. 8 shows the results on the noisy LR depth images.Compared to the noisy input subregions as shown in Fig. 8(a),all the algorithms improved the depth quality significantly. Inparticular, the ATGV algorithm performed well in represent-ing the planar surfaces in V enus but resulted in inaccuratedepth edges in Tsukuba and Teddy. In overall, the proposedalgorithm produced less artifacts at the depth edges comparedto the other algorithms.

Fig. 10. Test color (1920×1080) and depth (480×270) images captured bythe camera shown in Fig. 9.

B. Real data

We applied the proposed algorithm to real-world color anddepth images acquired by our previously developed prototypecamera as shown in Fig. 9 [26]. The camera can captureboth color and depth images using a single lens and a singlesensor via a time-division multiplexing scheme. Specifically,every odd frame obtains a depth image via a ToF principle,whereas every even frame acquires a color image. Referto [26] for the details about the prototype camera. Fig. 10shows the input color and depth image pairs. It should benoted that no calibration between color and depth imageswas required since the two images were inherently aligned.Other camera configurations which consist of separate colorand depth sensors may require a careful geometric calibrationscheme for misaligned color and depth images.

Fig. 11 shows the resultant depth images. For the concisesubjective quality evaluation, we magnified subregions of thedepth images obtained by the ATGV, WMF, and proposedmethod. The dynamic range of the subregions was extended toenhance the visibility. In overall, the three methods producedHR depth images with comparable visual quality, but the WMFoften resulted in artifacts around depth edges. The ATGValgorithm performed well in reducing the depth noise butyielded uneven depth edges. By selectively using the upsam-pling methods via filter classification, the proposed methodcould produce less noisy depth edges. In the proposed method,the proportions of the JBAF, JBMF, and WMF were 36.9%,9.4%, and 53.7%, respectively, for Fig. 11(d) and 47.4%, 9.2%,and 43.4%, respectively, for Fig. 11(l).

IV. CONCLUSIONS

In this paper, we proposed a learning-based filter selectionframework for depth image super resolution. With an as-



IEEE 9

Fig. 11. The HR depth images obtained by (a), (e) the nearest neighborhood interpolation, (b), (j) ATGV, (c), (k) WMF, and (d), (l) proposed algorithm.(e)-(h) and (m)-(p) are the magnified subregions of (a)-(d) and (i)-(l), respectively. The results are best viewed in the electronic version.

sumption that no one super-resolution method can universallyoutperform the other methods, a classifier was trained suchthat an effective method could be selected for each pixel. Inaddition, a new frequency-domain feature vector was designedto enhance the discriminability of the different methods. Theeffectiveness of the proposed framework was demonstratedusing both synthetic and real-world color and depth images.

Several future works are being considered. First, we adoptedthe three representative filtering schemes, the JBAF, JBMF,and WMF, as candidate depth image super-resolution methods.Other super-resolution methods can be included to improvethe performance. Second, we considered depth image su-per resolution not depth video super resolution. In videoapplications, temporal consistency of depth super-resolutionresults would be an important issue. Therefore, a technique ofenforcing temporal consistency can be included for applyingthe proposed framework to depth video super resolution.

REFERENCES

[1] X. Mei, X. Sun, M. Zhou, S. Jiao, H. Wang, and X. Zhang, “On buildingan accurate stereo matching system on graphics hardware,” in Proc. ICCVWorkshops, 2011, pp. 467-474.

[2] J. Kowalczuk, E. T. Psota, and L. C. Perez, “Real-time stereo matching onCUDA using an iterative refinement method for adaptive support-weightcorrespondences,” IEEE Trans. Circuits Syst. Video Technol., vol. 23, no.1, pp. 94-104, Jan. 2013.

[3] R. Larsen , E. Barth and A. Kolb “Special issue on time-of-flight camerabased computer vision,” Comput. Vis. Image Understand., vol. 114, no.12, pp. 1317, 2010

[4] Z. Zhang, “Microsoft Kinect sensor and its effect,” IEEE Multimedia, vol.19, no. 2, pp. 4-10, 2012.

[5] F. Garcia, D. Aouada, B. Mirbach, B. Ottersten, “Real-time distance-dependent mapping for a hybrid ToF multi-camera rig,” IEEE J. Sel.Topics Signal Process., vol. 6, no. 5, pp. 425-436, Sep. 2012.

[6] S.-W. Jung, “Enhancement of image and depth map using adaptive jointtrilateral filter,” IEEE Trans. Circuis Syst. Video Technol., vol. 23, no. 2,pp. 258-269, Feb. 2013.

[7] C. Richardt, C. Stoll, N. A. Dodgson, H.-P. Seidel, and C. Theobalt,“Coherent spatiotemporal filtering, upsampling and rendering of RGBZvideos,” Computer Graphics Forum (Proceedings of Eurographics), vol.31, no. 2, May 2012.

[8] J. Zhu, L. Wang, R. Yang, J. E. Davis, and Z. Pan, “Reliability fusionof time-of-flight depth and stereo geometry for high quality depth maps,”IEEE Trans. Pattern Anal. Mach. Intell. vol. 33, no. 7, pp. 1400-1414,Jul. 2011

[9] V. Gandhi, J. Cech, and R. Horaud, “High-resolution depth maps based onTOF-stereo fusion,” in Proc. IEEE International Conference on Roboticsand Automation, 2012, pp. 4742-4749.

[10] Y. M. Kim, C. Theobalt, J. Diebel, J. Kosecka, B. Miscusik, andS. Thrun, “Multi-view image and ToF sensor fusion for dense 3Dreconstruction,” in Proc. ICCV Workshops, 2009, pp. 1542-1549.

[11] J. Russell and R. Cohn, SoftKinetic, Book on Demand, 2012.[12] J. Kopf, M. F. Chen, D. Lischinski, and M. Uyttendaele, “Joint bilateral

upsampling,” it ACM. Trans. Graph., vol. 26, no. 3, Jul. 2007.[13] K.-J. Oh, S. Yea, A. Vetro, and Y.-S. Ho, “Depth reconstruction filter and

down/up sampling for depth coding in 3-D video,” IEEE Signal Process.Lett., vol. 16, no. 9, pp. 747-750, Sep. 2009.

[14] D. Min, J. Lu, and M. N. Do, “Depth video enhancement based onweighted mode filtering,” IEEE Trans. Image Process., vol. 21, no. 3, pp.1176-1190, Mar. 2012.

[15] F. Garcia, B. Mirbach, B. Ottersten, F. Grandidier, and A. Cuesta, “Pixelweighted average strategy for depth sensor data fusion,” in Proc. IEEEInt. Conf. Image Process., 2010, pp. 2805-2808.

[16] Q. Yang, Spatial-depth super resolution for range images, in Proc. IEEEConf. Computer Vision and Pattern Recognition, 2007, pp. 1-8.

[17] D. Chan, H. Buisman, C. Theobalt, and S. Thrun, “A noise-aware filterfor real-time depth upsampling,” in Proc. ECCV Workshop on Multi-camera and Multi-modal Sensor Fusion Algorithms and Applications,pp. 1-12, 2008.

[18] D. Scharstein and R. Szeliski, “A taxonomy and evaluation of densetwo-frame stereo correspondence algorithms,” Int. J. Comput. Vis., vol.47, no. 1, pp. 7-42, Apr.-Jun. 2002.

[19] W. T. Freeman, T. R. Jones, and E. C. Pasztor, “Example-based super-



IEEE 10

resolution,” IEEE Comput. Graph. Appl., vol. 22, no. 2, pp. 56-65,Mar./Apr. 2002.

[20] X. Li, K. M. Lam, G. Qiu, L. Shen, and S. Wang, “Example-basedimage super-resolution with class-specific predictors,” J. Vis. Commun.Image R., vol. 20, pp. 312-322, Apr. 2009.

[21] P. P. Gajjar and M. V. Joshi, “New learning based super-resolution: useof DWT and IGMRF prior,” IEEE Trans. Image Process., vol. 19, no. 5,pp. 1201-1213, May 2010.

[22] D. Kong, M. Han, W. Xu, H. Tao, and Y. H. Gong, “Video super-resolution with scene-specific priors,” in Proc. British Machine VisionConference, 2006, pp. 549-558.

[23] K. Simonyan, S. Grishin, D. Vatolin, and D. Popov, “Fast video super-resolution via classification,” in Proc. IEEE Int. Conf. Image Process.,2008, pp. 349-352.

[24] D. Ferstl, C. Reinbacher, R. Ranftl, M. Ruther, and H. Bischof, “Imageguided depth upsampling using anisotropic total generalized variation,”in Proc. Int. Conf. Computer Vision, 2013, pp. 1-8.

[25] O. M. Aodha, N. D. F. Campbell, A. Nair, and G. J. Brostow, “Patchbased synthesis for single depth image super-resolution,” in Proc. Euro-pean Conference on Computer Vision, 2012, pp. 71-84. Software [Online].Available: http://visual.cs.ucl.ac.uk/pubs/depthSuperRes/

[26] S.-J. Kim, J. D. K. Kim, B. Kang, and K. Lee, “A CMOS imagesensor based on unified pixel architecture with time-division multiplexingscheme for color and depth image acquisition,” IEEE J. Solid-StateCircuits, vol. 47, no. 11, pp. 2834-2845, Nov. 2012.

[27] W. Kim, W. Yibing, I. Ovsiannikov, S. Lee, Y. Park, C. Chung, andE. Fossum, “A 1.5Mpixel RGBZ CMOS image sensor for simultaneouscolor and range image capture,” in Proc. IEEE International Solid-StateCircuits Conference Digest of Technical Papers, 2012, pp. 392-394.

[28] A. V. Nasonov and A. S. Krylov, “Fast super-resolution using weightedmedian filtering,” in Proc. Int. Conf. Pattern Recogn., 2010, pp. 169-172.

[29] L. Yin, R. Yang, M. Gabbouj, Y. Neuvo, “Weighted median filters: atutorial,” IEEE Trans. Circuits Syst. II, Analog Digit. Signal Process., vol.43, no. 3, pp. 157-192, Mar. 1996.

[30] C. Kreucher and S. Lakshmanan, “LANA: a lane extraction algorithmthat uses frequency domain features,” IEEE Trans. Robot. Autom., vol.15, no. 2, pp. 343-350, Apr. 1999.

[31] H. Imtiaz and S. A. Fattah, “A spectral domain feature extractionalgorithm for face recognition,” in Proc. IEEE Region 10 ConferenceTENCON, 2010, pp. 169-172.

[32] S. O. Shahdi and S. A. R. Abu-Bakar, “Frequency domain feature-based face recognition technique for different poses and low-resolutionconditions,” in Proc. IEEE International Conference on Imaging Systemsand Techniques, 2011, pp. 322-326.

[33] L. J. P. van der Maaten, E.O. Postma, and H. J. van den Herik,“Dimensionality reduction: a comparative review,” Tilburg UniversityTechnical Report, TiCC-TR 2009-005, 2009.

[34] C.-C. Chang and C.-J. Lin, “LIBSVM: a library for support vectormachines”, [Online]. Available: http://www.csie.ntu.edu.tw/ cjlin/libsvm

[35] Y. S. Kim, B. Kang, H. Lim, O. Choi, K. Lee, J. D. K. Kim, and C.Kim, “Parametric model-based noise reduction for ToF depth sensors,”in Proc. SPIE 8290, 2012, pp. 1-8.

[36] O. Choi and B. Kang, “Denoising of time-of-flight depth data viaiteratively reweighted least squares minimization,” in Proc. Int. Conf.Image Process., 2013, pp. 1-4.

Seung-Won Jung (M’11) received the B.S. andPh.D. degrees in electrical engineering from Ko-rea University, Seoul, Korea, in 2005 and 2011,respectively. He was a Research Professor with theResearch Institute of Information and Communica-tion Technology, Korea University, from 2011 to2012. He was a Research Scientist with the SamsungAdvanced Institute of Technology, Yongin-si, Korea,from 2012 to 2014. He is currently an Assistant Pro-fessor at the Department of Multimedia Engineering,Dongguk University, Seoul, Korea. He has published

over 30 peer-reviewed articles in international journals. His current researchinterests include image enhancement, image restoration, video compression,and computer vision.

Ouk Choi (M’09) is a research scientist at SAIT,Samsung Electronics, since 2009. He received hisPh.D. and M.Sc. degrees in Electrical Engineeringfrom KAIST, Republic of Korea, in 2009 and 2003,respectively. During his studies in KAIST, he madecontributions in computer vision and robotics prob-lems such as image matching, segmentation and reg-istration, which were applied to object recognitionand visual navigation. After joining SAIT, he hascarried out extensive research projects ranging fromToF sensing architecture to depth image processing,

and especially has devoted on phase unwrapping, depth map upsampling andnoise reduction. His research interests also include medical image processingand machine learning.

learning-based filter selection scheme for depth image super resolution

Documents