3d video processing algorithms part...

Sergey Smirnov n Atanas Gotchev n Sumeet Sen n Gerhard Tech n Heribert Brust

3D Video Processing Algorithms – Part I

MOBILE3DTV

Project No. 216503

3D Video Processing Algorithms – Part I

Sergey Smirnov, Atanas Gotchev, Sumeet Sen, Gerhard Tech, Heribert Brust

Abstract: This report describes algorithms developed to enhance the quality of 3D video. At the pre-processing side, we have addressed the following scenarios: stereo-video with higher resolution to be downscaled to meet the resolution of mobile 3D display; stereo-video captured at noisy conditions (e.g.

user-created content) to be denoised; depth map in the format „view+depth‟ to be further refined. At the post-processing side we address the problem of dealing with depth map impaired by blocky artifacts resulted from block transform based encoders, such as H.264. For all these cases, we investigate advanced algorithms and present experimental results illustrating their performance.

Keywords: mobile 3D video resolution, mixed resolution coding, down-sampling, up-sampling, denoising, 3D grouping and transform-domain collaborative filtering, local polynomial approximation, bilateral filtering, hypothesis filtering, time consistency; depth map filtering.

MOBILE3DTV D5.4 3D Video Processing Algorithms – Part I

2

Executive Summary

We present the first part of pre- and post-processing methods for 3D video represented in different formats. In this report we concentrate on sampling rate conversion for stereo video, stereo-video denoising and refinement of depth maps in the „view plus depth‟ representation.

Sampling rate conversion is required when higher definition video is to be downscaled to mobile resolution. It also appears in the mixed resolution stereo representations schemes, where one of the channels is deliberately downscaled for the sake of more effective compression and then up-scaled back for visualization. For this, standard up- and down-sampling methods as well as an alternative simple FIR filter for down-sampling with variable cutoff frequency have been presented and evaluated. Coding experiments demonstrate that the simple FIR filter with a cutoff frequency of approx. 0.6 outperforms the standard methods. PSNR gains up to 1 dB at a constant bit rate or bit rate savings up to 30% at a constant PSNR can be achieved.

Denoising of stereo video might be needed when the content to be delivered to the mobile device has been created under low light conditions. Noisy channels are more problematic not only for creating pleasant stereo perception but also for compression, depth estimation and view synthesis. One of the most competitive video denoising methods, abbreviated as VBM3D (video block matching in 3D) has been evaluated for its applicability and performance for stereo video. Experiments demonstrate that the denoised left and right video channels are with very high quality, where all 3D visual cues are well preserved and in fact even enhanced. From implementation point of view the results show an equal performance of the algorithm when applied independently to the two channels or jointly. Marginal improvement can be expected only for content with high amount of motion.

Deblocking of depth maps is perhaps one of the most important pre- and post-processing tasks for the representation format „view plus depth‟ since practitioners tend to employ standard, i.e. block transform based, compression methods. A set of five filtering approaches has been tested. Approaches vary from simple Gaussian smoothing through standard H.264 deblocking to more sophisticated methods utilizing structural and colour constraint from the presented color video channel. The methods have been optimized with respect to the quantization parameter of the H.264 compression used. The experiments have ranked the methods for their performance. For the best performing method, we have suggested practical modifications leading to a faster and memory-efficient implementation. We have extended the same method also to video and for more general types of depth impairments (e.g. resulting from fast depth estimation or noise). Our approach yield highly time-consistent depth sequences adequately restoring the depth properties of the 3D scenes.


3

Table of Contents

1 Introduction .......................................................................................................................... 4

2 Evaluation of down-sampling methods for Mixed Resolution Coding .................................... 5

2.1 Sampling Methods......................................................................................................... 5

2.1.1 Standard anti-aliasing filters ................................................................................... 5

2.1.2 Standard interpolation filters ................................................................................... 6

2.1.3 FIR anti-aliasing filter with variable cutoff frequency (VCF)..................................... 7

2.2 Coding Experiments ...................................................................................................... 8

2.2.1 Setup ..................................................................................................................... 8

2.2.2 Results ................................................................................................................... 9

3 Filtering of color stereo video sequences ........................................................................... 11

3.1 Introduction ................................................................................................................. 11

3.2 Denoising of stereo video by VBM3D .......................................................................... 11

3.3 Experiments ................................................................................................................ 12

4 Restoration of block transform compressed depth maps .................................................... 16

4.1 Introduction ................................................................................................................. 16

4.2 Problem Formulation ................................................................................................... 17

4.3 Depth maps filtering approaches ................................................................................. 18

4.3.1 Gaussian Filtering ................................................................................................ 18

4.3.2 Adaptive H.264 Loop-Filtering .............................................................................. 19

4.3.3 Local Polynomial Approximation approach ........................................................... 19

4.3.4 Bilateral Filter ....................................................................................................... 20

4.3.5 Hypothesis filtering approach ............................................................................... 22

4.4 Quality measures ........................................................................................................ 25

4.5 Experimental results .................................................................................................... 26

5 Temporally-consistent filtering of depth map sequences .................................................... 31

5.1 Introduction ................................................................................................................. 31

5.2 Problem formulation .................................................................................................... 31

5.2.1 Extending the filtering approach to video .............................................................. 31

5.3 Experiments ................................................................................................................ 32

5.4 Results ........................................................................................................................ 34

6 Conclusions........................................................................................................................ 37


4

1 Introduction This deliverable consists of four parts. The first part deals with down-sampling and up-sampling of stereo video in the mixed resolution stereo representation. The second part deals with color channel filtering, particularly with denoising in order to increase quality of followed depth estimation and view synthesis. The third part describes methods for deblocking of depth maps impaired by compression artifacts. In the fourth part we extend the most effective filtering approach from the previous part for depth map sequences and more general types of depth map distortions. We especially target bettr time-consistency to avoid flickering and some other 3D artifacts in the synthesized views [37].

The first part is authored by Gerhard Tech and Heribert Brust from Fraunhofer HHI, the second part is authored by Sumeet Sen and Atanas Gotchev and the third and fourth parts are authored by Sergey Smirnov and Atanas Gotchev from TTY.


5

2 Evaluation of down-sampling methods for Mixed Resolution Coding

The mixed resolution approach is based on the transmission of a full and a down-sampled view. In a pre-processing step one view of a stereoscopic sequence is decimated. The decimated and the full view are coded and transmitted. At the receiver side the decimated view is up-sampled again ([1], [2]).

Although decimation and interpolation is a theoretically solved problem, in practice a great variety of up- and down-sampling methods exist. Differences are given in the design of anti-aliasing and interpolation filters. In this scope additional design factors that affect the performance of up and down-sampling have to be regarded. These factors are the distortions introduced by coding and the low resolution of content suitable for displaying on mobile devices.

To achieve best overall quality using the mixed resolution approach two standard methods previously used by VCEG/MPEG Joint Video Team (JVT) are analyzed and evaluated in this section. Moreover an approach using a down-sampling filter with variable cutoff frequency is optimized and evaluated.

2.1 Sampling Methods

The standard sampling methods discussed in sections 2.1.1 and 2.1.2 are implemented in the resample tool downconvert provided with Reference Software for Scalable Video Coding JSVM [3]. An implementation of the filter with variable cutoff frequency presented in section 2.1.3 is part of the Mathworks Matlab Software [4]. All filters are applied separately in vertical and horizontal direction.

2.1.1 Standard anti-aliasing filters

2.1.1.1 Sine windowed sinc (SWS)

For down-sampling the filter given in [5] is used. Filter coefficients are given by the sine-windowed sinc-function shown in equation (1)

, for otherwise (1)

This leads for a decimation of factor 2 with and to a 14-tap filter. In [5] this filter is collapsed to a 12-tap filter, whereas the software implementation clips it to a 8-tap filter. Magnitude and impulse response of this filter are shown in

Figure 2.1


6

Figure 2.1 Impulse and Magnitude Response of sine windowed sinc down-sample filter

2.1.1.2 Dyadic down-sampling filter (DDS)

For down-sampling the dyadic filter presented in [6] is used. For the Mixed Resolution Approach decimation by factor two is sufficient. Therefore the filter has to be applied only once. Impulse and magnitude responses of the filter are shown in Figure 2.2.

Figure 2.2: Impulse and Magnitude Response of dyadic down-sample filter

2.1.2 Standard interpolation filters

2.1.2.1 SVC normative up-sampling (SNU)

Interpolation is based on a set of 4-taps filters. These integer-based 4-tap filters are originally derived from the Lanczos-3 filter. For a detailed description of the complex interpolation process please refer to [7].

2.1.2.2 Dyadic up-sampling filter (DUS)

After doubling of the sampling rate the AVC 6-tap half pel filter presented in [6] is applied for interpolation. The impulse and magnitude response are shown in Figure 2.3.


7

Figure 2.3: Impulse and Magnitude Response of dyadic interpolation filter

2.1.3 FIR anti-aliasing filter with variable cutoff frequency (VCF)

Additional to the down-sampling filter provided with JSVM Reference software, a hamming-windowed FIR filter with varying cutoff frequencies has been evaluated. For a detailed description of the filter design please refer to [8]. Figure 2.4 and Figure 2.5 show the impulse and magnitude response for cutoff frequencies of 0.4 and 0.6. The order of the filter has been set to 10.

Figure 2.4 Impulse and Magnitude Response of VCF filter with normalized cutoff frequency 0.4

Figure 2.5 Impulse and Magnitude Response of VCF filter with normalized cutoff frequency 0.6


8

2.2 Coding Experiments

2.2.1 Setup

For the evaluation of the down-sampling filters one view of each sequence has been down-sampled, encoded and up-sampled again. Codec parameters are given in Table 2.1.

Table 2.1: codec settings

Profile Baseline

GOP Size 1 (IPPP)

Symbol Mode CAVLC

8x8 Transform Disabled

Search Range 48

Intra Period 16

Quantization Parameter 24, 28, 32, 36, 40

The filter combinations shown in Table 2.2 have been examined. The first two combinations are the standard filters provided with JSVM Software. Note that the SWS and SNU filters introduce and remove a shift of a half pel, hence a combination with the other filters is not possible. The last nine combinations utilize the VCF filter with cutoff frequencies from 0.1 to 0.9 for anti-aliasing and the DUS filter for interpolation.

Table 2.2: combinations of evaluated up and down-sampling methods and cutoff frequencies

Down DDS SWS VCF VCF VCF VCF VCF VCF VCF VCF VCF

Cutoff ~0.4 ~0.4 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Up DUS SNU DUS DUS DUS DUS DUS DUS DUS DUS DUS

The six sequences from the coding test set of the stereo video database [9] have been used for evaluation. This leads to a total of number of 6 (Sequences) x 11 (Up/Down Combinations) x 5 (QPs) = 330 sequences that have been coded.


9

2.2.2 Results

Results of coding experiments are presented in Figure 2.6. The curves depict the PSNR vs. the bit rate of the down-sampled, coded and re-up-sampled right view. The uncoded full right view has been used for reference.

The solid curves show results for the VCF down-sampling filter in combination with the dyadic up-sampling filter. For each QP the cutoff frequency was varied from 0.1 to 0.9 with a step size of 0.1. In Figure 2.6 the corresponding nine points are marked with crosses for each QP. With an increased cutoff frequency more details retain in the smoothed picture, hence coding leads to an increased bit rate. Therefore the leftmost rate-distortion point for each QP corresponds to a cut off frequency of 0.1 and the rightmost point to a frequency of 0.9.

The envelope of the solid curves is depicted as yellow dashed line in Figure 2.6 and gives the rate-distortion points with optimal cutoff frequency. Regarding PSNR measure, for most sequences and rate points the optimal cutoff frequency is around 0.6. Lower frequencies lead to an over-smoothing and a strongly reduced image quality. Higher frequencies result not only in a further increased bit rate but also in a reduced PSNR by introduction of aliasing artifacts.

Results obtained using the standard methods provided with the JSVM Software are presented as black and magenta dashed lines in Figure 2.6. Since the cutoff frequency is fixed only the QP was varied here. It can be seen that the dyadic up- and down-sampling approach performs slightly better than the combination of SWS down-sampling filter and SVC normative up-sampling filter. A comparison with the optimized VCF filter shows that both methods are outperformed. The VCF filter leads to PSNR gains up to 1 dB at a constant bit rate or at a constant PSNR to bit rate savings up to 30% respectively.


10

Figure 2.6 the solid lines show PSNR vs. bit rate for the down-sampled, coded and re up-sampled view using

the VCF-filter, each curve represents a fixed QP, the cutoff frequency increases from left (0.1) to right (0.9),

the dashed yellow curve is the envelope of the solid lines and shows the optimal cutoff frequencies; the dashed

magenta and black curves show the results for varying QPs obtained with the JSVM tools


11

3 Filtering of color stereo video sequences

3.1 Introduction In the recent years, denoising of still images and video has received high interest due to the availability of mobile imaging platforms and the recent trends in user-created content. Capture of images and video have

became quite popular with the use of consumer and compact cameras. Content created by users using non-

professional equipment has been spreading through content-sharing platforms. In many cases such content is created in low illumination conditions and is quite noisy. This has determined the research interest in

developing high performance denoising methods.

The state-of-the-art denoising approaches seek for similarities between non-local patches within images or

video frames and utilize them for getting highly over-completed and sparse representations, usually in transform domain, where the noise can be effectively separated from the information signals and

subsequently suppressed [11],[12],[13],[14]. Methods based on non-local means [11] and collaborative

non-local transform-domain filtering [13] are considered as the most powerful denoising approaches. We refer to the review paper [12] for a nice overview of the topic.

In our development, we consider a scenario where the input stereo video is impaired by noise. We try to

evaluate the importance of having more information, as in stereo, for achieving better denoising results. Similar problems have been addressed in [15], [16], [17] where non-local means have been applied on

multiple frames or along with given depth map in noisy multi-view setting. In our setting, we adopt the

collaborative transform-domain filtering approach, known as 3D Block-Matching (BM3D) [13] and its

video version VBM3D [14] as they have shown superior performance for conventional 2D video. We aim at quantifying the performance of this algorithm for stereo video and at investigating the advantage stereo

video would bring to the approach while using sparse 3D transform-domain collaborative filtering.

3.2 Denoising of stereo video by VBM3D We have applied the VB3D algorithm as in [14]. The algorithm operates by identifying similar blocks in the spatial and temporal neighborhood of a reference block. The similarity is measured by Eucledian distance and the similar blocks are collected in a stack (3D block). This step is called grouping. The advantage of grouping is that highly similar signal fragments are put and processed together. The noise is then suppressed using collaborative filtering in DCT domain which takes the advantage of the increased correlation between the grouped blocks. For video, rhe denoising is performed in two steps; predictive-search block-matching is combined with collaborative hard-thresholding in the 1st step and with collaborative Wiener filtering in the second step. Figure 7 shows a pictorial representation of the algorithm. The predictive-search block-matching is performed for successive video frames, assuming that the intra-frame search has identified the similar blocks to the reference one. Then, these blocks are used to find new similar ones in positions close to their spatial positions (predictive search). Thus, the similarity search is extended along temporal dimension with no need of explicit motion estimation.

The algorithm essentially depends on the search range within the current video frame and with respect to the reference block (intra-frame search) and the search range for each similar block along the temporal axis. For stereo, it is straightforward to extend the algorithm to search for similar blocks in the other given view as well. In practice this would require some knowledge about the disparity range so to adjust the inter-view search range.

In this study we are interested in two cases: in the first case the two noisy video channels are denoised independently using VBM3D and in the second case they are denoised jointly using the modified approach.


12

Figure 7. Video 3D block-matching denoisng approach

3.3 Experiments

In the first experiment, we add white Gaussian noise to ground true stereo video sequences, then denoise either jointly or individually and then measure the denosing performance in terms of frame-wise PSNR between ground true and denoised channels. The test sequences „Horse‟ and „Car‟ of resolution 640x360 were used in the experiments. Figure 8 illustrates the experimental setting.

Figure 8. Experimental setting for denoising of stereo video

The results for „Horse‟ sequences are given in Figure 9 and the results for „Car‟ sequence are given in Figure 10.

Left channel

video

Right channel

video

Denoising

block (VBM3D)

Denoising

block (VBM3D)

Denoised Left

channel video

Denoised Right

channel video

Interleaved

video

Denoising

block (VBM3D)

Denoised

Interleaved video

Denoised Left

channel video

Denoised Right

channel video

Plot frame wise PSNR with original

sequence


sequence


sequence

Left channel

video

Right channel

video


sequence


13

Figure 9. Denoising results for 'Horse' for the left and right channels. ‘Red’: noisy vs noise-free; ‘ligh-blue’:

jointly denoised vs noise-free; ‘blue’: individually denoised vs noise-free

Figure 10. Denoising results for 'Car' for the left and right channels. ‘Red’: noisy vs noise-free; ‘ligh-blue’:

jointly denoised vs noise-free; ‘blue’: individually denoised vs noise-free

As can be seen in the figures, the stereo adds little to the denoising performance. The jointly denoised and individually denoised channels go close to each other with a small preference for the individual denoising.

In the second experiment, we put the noisy and denoised sequences to tasks such as depth estimation and view synthesis. The test involved the same noise-free and noisy sequences. Using the FhG‟s depth estimator [18], the depth was estimated for the following stereo sequences a) noise-free (i.e. ground true) sequences; b) noisy sequences; c) individually denoised sequences (denoised data 1); d) jointly denoised sequences (denoised data 2). The obtained depths were used to render the right channel using the corresponding left (noise-free, noisy, or denoised) channel. The resulted right channel video sequences were compared with the original ones in terms of PSNR. The results are presented in Figure 11 and Figure 12.


14

Figure 11. PSNR of ground true vs synthesized right channels out of different depth maps (see Legend)

0 20 40 60 80 100 120 14026

27

28

29

30

31

32

33

34

35horse

Frame index

PS

NR

[dB

]

Noise-free data

Noisy data

Denoised data1

Denoised data2

0 50 100 150 200 25025

26

27

28

29

30

31

32

33

34car

Frame index

PS

NR

[dB

]

Noise-free data

Noisy data

Denoised data1

Denoised data2


15

Figure 12. Zoomed version of Figure 11

The denoising plays a substantial role in improving the quality of the synthesized view. The synthesized views out of denoised data are even better than those rendered using depth estimated out of the „noise-free‟ data. This suggests that beside the simulated added noise, the original videos contained some small amount of inherent noise which impeded the depth estimation but was suppressed by the denoising technique. Improving the depth estimation quality and subsequently, the quality of the synthesized view is a nice property of the VBM3D algorithm.

In terms of depth estimation and view rendering, and for the „horse‟ test sequence, the approach of individual denoising of left and right channels showed better performance. For the „car‟ sequence however, the approach of joint denoising was superior. This is caused by differences in the content. „Horse‟ data contains small amount of motion, there is static background dominating in the scene. There are also luminance differences between the left and right channel. Correspondingly, VBM3D takes more advantage of finding more similar blocks along temporal domain than between views. Thus, individual processing of views turn to be more successful. In opposite, „Car‟ contains more motion, i.e. more changes along temporal axis. The similarity search finds more similar blocks between views and filters them collaboratively in a successful manner.

40 45 50 55 60 65 70 75 8031.8

32

32.2

32.4

32.6

32.8

33

33.2

33.4horse

Frame index

PS

NR

[dB

]

40 45 50 55 60 65 70 75 8031.4

31.6

31.8

32

32.2

32.4

32.6

32.8

33car

Frame index

PS

NR

[dB

]


16

4 Restoration of block transform compressed depth maps

4.1 Introduction

One of the 3D video formats, studied within the Mobile3DTV project, is informally called „video plus depth‟, where the 2D video frames are augmented with per-pixel depth information. The 2D color video is represented in its ordinary form (e.g. in luminance-chrominance space) while the associated depth is represented as a quantized (gray-scale) map ranging from the minimum to the maximum distance with respect to the assumed camera position. Figure 13 illustrates the concept of view plus depth 3D representation for the popular test sequence „Ballet dancer‟.

Figure 13. Illustration of the 'view+depth' format concept

Such a representation has a number of benefits: it ensures backward compatibility for legacy devices and offers easy rendering of virtual views for 3DTV and Free-viewpoint TV applications, while being also compression-friendly. The latter feature is based on the observation that the depth channel is more compression-susceptible than any other color video channel for delivering the same 3D scene geometry information. We refer to Deliverables D2.2, D2.3, and D2.5 for more details about the compressibility of depth maps.

Depth image has two noticeable peculiarities. First, it is an image never seen by the viewer. It is used for rendering new views only (so-called depth-image based rendering - DIBR). Second, being a range map, it exhibits smooth regions representing objects of the same distance, delineated by sharp transitions (object boundaries). Thus, it is quite different from color texture images compressible with block transform based compression methods. This peculiarity has been addressed in designing compression schemes especially tailored for depth images [19], [20]. Nevertheless, the block transform-based video coding schemes have been favored in rate-allocation studies because of the existing standardized encoders, such as H.264 and MPEG [21], [22]. In these studies two rate-allocation approaches have been adopted. In the first approach the bit allocation has been optimized jointly for the video and depth to minimize the rendering distortion of the desired virtual view [21]. In the second approach, the video quality has been maximized for the sake of backward compatibility while the depth has been encoded with a small fraction (10-15%) of the total bit rate [22]. The H.264 coding scheme has also been adopted within the project, where the total bit budget between color video and depth has been carefully jointly optimized [23] (see also D2.2 and D2.5).

In the above rate-allocation approaches, especially for low bit rates, depth has been compressed by enforcing strong quantization of DCT coefficients. This creates the well-known blocking artifacts which are generic for block-transform-based compression schemes. For the case of depth images, blocking leads to distorted depth discontinuities and therefore distorted geometrical properties and object boundaries in the rendered view. The problem is illustrated by


17

Figure 14. The problem can be partially addressed by simple (e.g. Gaussian) smoothing – an approach used also for mitigating occlusion effects. While simple, this approach is also weak as it destroys true sharp boundaries and impedes true virtual view rendering.

We study the problem of restoration of compressed depth maps affected by blocky artifacts from two points of view. Our first aim is to adapt and compare state-of-the-art methods, originally designed to handle similar problems. We are interested in two groups of methods: methods from the first group regard the depth image as „it is‟, i.e. they process it independently from the available color video. Methods from the second group utilize structural information from the video channel in order to improve the depth map restoration. Our second aim is to identify appropriate quality measures to quantify the distortions in the depth image and their effect on the rendered virtual view.

(a) (b)

(c) (d)

Figure 14. Teddy’ dataset. (a) ground truth depth; (b) rendered view using (a) (without occlusion filling); (c)

ground truth depth compressed as H.264 I-frame with QP=51; (d) rendered view using (c)

4.2 Problem Formulation

Consider an individual colour video frame in some colour space. For sake of clarity we consider YUV colour space however most of the developments can be done in RGB as well. We denote the colour frame as three-component vector , where


18

is a spatial variable, being the image domain. Along with the video frame, we consider the associated per-pixel depth .

A new, virtual view can be synthesized out of the given (reference) color frame and depth, applying projective geometry and knowledge about the

reference view camera [24]. The synthesized view is composed of two parts, , where

denotes the visible pixels from the position of the virtual view camera and denotes the

pixels of occluded areas. The corresponding domains are denoted by correspondingly, .

We consider the case where both are to be coded as H.264 intra frames with some QPs,

this leafing to their quantized versions . We model the effect of quantization as

quantization noise added to the uncompressed signal. Namely,

(2)

(3)

The quantization noise terms added to the color channels and the depth channel are considered

independent white Gaussian processes: , . While this modeling is simple, it has proven quite effective for mitigating the blocking artifacts arising from quantization of transform coefficients. In particular, it allows for establishing a direct link between the quantization parameter (QP) and the quantization noise variance to be used for tuning deblocking filtering algorithms [25].

Let us denote by the virtual view synthesized out of quantized depth and quantized reference view. Unnatural discontinuities at the boundaries of the transform blocks (the blocking artifacts) in the quantized depth image cause geometrical distortions and distorted object boundaries in the rendered view. The goal of the restoration of compressed depth maps is to mitigate the

blocking effects in the depth image domain, i.e. to obtain a deblocked depth image estimate , which would be closer to the original, uncompressed depth, and would improve the quality of the rendered view.

4.3 Depth maps filtering approaches

We have implemented and compared five methods which can be grouped into two groups. First two methods work directly on the depth image making no use of the given reference color video frame. These methods are simple and by choosing them we wanted to check the effect of simple or adaptive smoothing of the depth image on the rendered view. The second set groups methods which essentially utilize structural information from the video channel(s). The assumption here is that the video channel is coded with better quality and as such it can provide trustful information about objects at different depth to be used for restoring the true depth discontinuities. We aim at utilizing structural information such as pixel neighborhood or color (dis-)similarity from the given video frame to infer the true depth values.

4.3.1 Gaussian Filtering

Gaussian smoothing is a popular technique for getting rid of usually high-frequency contaminations. The method suggests convolving the noisy image with 2D discrete smoothing kernel in the form:

(4)

The standard deviation is a free parameter which can be used to control the imposed smoothness. For our experiments we have tuned it as a function of the H.264 Quantization Parameter . The main drawback of the Gaussian filtering is that is applies fixed-size


19

rectangular window across true object boundaries and thus smoothes out true image features together with the noise.

4.3.2 Adaptive H.264 Loop-Filtering

The H.264 video compression standard has a built-in deblocking algorithm addressing the problem of adaptive smoothing. It works adaptively on boundaries trying to avoid smoothing of real signal discontinuities. To achieve this, two adaptive threshold functions have been experimentally defined to determine whether or not to apply smoothing across block boundaries. The functions depend on the QP as well as on two encoder-selectable offsets, denoted by

and included and transmitted in the slice header. These two offsets are the only user-tunable parameters allowing some adjustment of the smoothing for a specific application. For more details on the H.264 deblocking we refer to [26].

4.3.3 Local Polynomial Approximation approach

The anisotropic local polynomial approximation (LPA) is a point-wise method for adaptive estimation in noisy conditions [27]. For every point of the image, local polynomial sectorial-neighborhood estimates are fitted for different directions. In the simpler case, instead of sectors, 1D directional estimates of four (by 90 degrees) or eight (by 45 degrees) different directions are used. The length of each estimate, denoted as scale, is adjusted so to meet the compromise

between the exact polynomial model (low bias) and enough smoothing (low variance).

A statistical criterion, denoted as Intersection of Confidence Intervals (ICI) rule is used to find this compromise [28], [29], i.e. the optimal scale for each direction. These optimal scales in each direction determine an anisotropic polygonal neighborhood for every point of the image well adapted to the structure of the image. This neighborhood has been successfully utilized for shape-adaptive transform-based color image denoising and deblurring [25].

In the spirit of [25], we use the quantized luminance channel as source of structural

information. The image is convolved with a set of 1D directional polynomial kernels ,

where is the set of different lengths (scales) and are the directions,

thus obtaining the estimates . In order to find the optimal scale

for each direction (hereafter the notation of direction is omitted), so-called confidence intervals

are formed first: (Goldenshluger & Nemirovski,

1997),(Katkovnik V. , 1999). The optimal scale is the largest scale (in number of pixels),

which ensures a non-empty intersection of confidence intervals . Figure 15a

illustrates the optimal scale for each pixel (encoded with different gray value) for a particular direction.

The optimal scales for all directions form an adaptive polygonal neighborhood with current pixel being in the centre, as illustrated in Figure 15b. After finding optimal neighborhood in the luminance image domain, the same is used for smoothing the depth image (cf. Figure 15c). The smoothing is done by fitting a plane within the neighborhood. Since LPA is point-wise procedure, neighborhoods for each pixel overlap. Correspondingly, depth pixels get estimated multiple times depending on how many times they get inside a neighborhood. The final estimate or each depth pixel is obtained by averaging the aggregated planar estimates for the pixel. Figure 15e illustrates the result of LPA-ICI filtering.

Note that the scheme depends on two parameters: the noise variance of the luminance channel

and the positive threshold parameter . The former depends on the quantization of the color video. We assume low quantization noise. The latter can be adjusted so to favor higher amount


20

of smoothing. We have optimized it with respect to the quantization parameter of the depth channel: .

a) b) c)

d) e)

Figure 15 LPA-ICI filtering of depth maps. a) Optimal scales for one of the directions; b) luminance channel

with some of found optimal neighbors; c) compressed depth with the same neighbours overlaid; d) input

(compressed) depth e) filtered by LPA-ICI.

4.3.4 Bilateral Filter

The goal of bilateral filtering is to smooth the image while preserves edges [30]. It utilizes information from all color channels to specify suitable weights for local (non-linear) neighborhood filtering. For grayscale images, local weights of neighbors are calculated based on both their spatial distance and their photometric similarity, favoring nearer values to distant ones in both spatial domain and intensity range. For color images, bilateral filtering uses color distance to distinguish photometric similarity between pixels, thus reducing phantom colors in the filtered image. Figure 16 a-e illustrates the approach in forming the filtering window.

Such collaboratively-weighted neighborhood defined by the color image is applicable also to the depth channel. The approach is similar also to the one used in depth estimation where contour color information has been used for finding correspondences [31]. In our setting, we have adopted a version of the bilateral filter as in [32].


21

(5)

where , and .

The two parameters and determine the spatial extent and the range extent of the weighting functions correspondingly. We have optimized them with respect to the QP:

. A result of bilateral filtering is given in Figure 16 f,g.

a) b)

c) d) e)

f) g)

Figure 16. Bilateral filtering of depth maps. a) color frame with reference pixel (in red); b) spatial proximity;

c) colour similarity; d) colour window; e) combined spatial-colour window; f) blocky depth; g) bilaterally

filtered depth


22

4.3.5 Hypothesis filtering approach

Originally, the considered method has been developed for increasing the resolution of low-resolution depth images, utilizing information from the high-resolution color image [32]. This method is perfectly applicable to our problem of suppression of compression artefacts and restoration of real discontinuities in the depth map. In the original approach, a 3D cost volume is constructed frame-wise out of several depth hypothesizes and the hypothesis with lowest cost is selected as a refined depth value at the current iteration. More specifically, the cost volume at the i-th iteration is formed as truncated quadratic difference

, (6)

where d is the potential depth candidate, is the current depth estimate at coordinates x and L is the search range controlled by a constant . The obtained slices of the cost volume for

different values of d somehow keep the degraded pattern of z, as illustrated in Figure 17 left.

Therefore, each slice of the cost volume undergoes joint bilateral filtering, i.e. each pixel of the cost slice is obtained as a weighted average of neighboring pixels where weights are also

modified by the color similarity measured as l1 distance between the corresponding pixel of the color video frame and the neighboring ones

(7)

where , and is the

neighborhood of coordinate x. The reason of applying bilateral filtering is two-fold: it assumes the depth reflects the piecewise smoothness of the surfaces of the given 3D scene and that the depth is correlated with the local scene color (same local color corresponds to constant depth). Our experimental tests demonstrated that filtering of the cost volume (1) is more effective than directly filtering the noisy depth.

After bilateral filtering, the slices get smoothed (Figure 17 right) and the depth for the next iteration is obtained as

. (8)

Figure 17 Result of filtering of cost volume. Left: unfiltered cost function; right: bilaterally-filtered cost function.

The hypothesis filtering approach is illustrated in Figure 18. The approach methodologically assumes three steps: (1) form a cost volume, (2) filter the cost volume, (3) peak the min hypothesis. In the original approach [32], a further refinement of the depth is suggested: instead of selecting the depth giving the minimum cost, as of Eq. (3), a quadratic function is fit around that minimum and the minimum value of that function is selected instead.


23

Figure 18 Block diagram of hypothesis filtering

We suggest several modifications to the original approach to make it more memory-efficient and to improve its speed. It is straightforward to figure out that there is no need to form cost volume in order to obtain the depth estimate for a given coordinate x at the i-th iteration. Instead, the

cost function is formed for the required neighbourhood only and then filtering applies, i.e.

(9)

Furthermore, the computation cost is reduced by assuming that not all depth hypotheses are applicable for the current pixel. A safe assumption is that only depths within the range

where have to be checked.

Figure 19 Histogram of non-compressed and compressed depth map

Additionally, depth range is scaled with the purpose to further reduce the number of hypothesizes. This step is especially efficient for the compression (blocky) artifacts. For compressed depth maps, the depth range appears to be sparse due to the quantization effect. Figure 19 illustrates histograms of depth values before and after compression so to confirm the use of rescaled search range of depth hypotheses. This modification speeds up the procedure and relies on the subsequent quadratic interpolation to find the true minimum. A pseudo-code of the suggested procedure in Eq.(4) is given in Table 1.

50 100 150 200

50 100 150 200


24

Table 1. Pseudo-code of modified hypothesis filtering

Rescale the range of Noisy Depth Image

For every (x,y) in Noisy Depth Image

D = read window of depth frame around (x,y)

C = read window of color frame around (x,y)

W = calculate bilateral weights from C;

Xmin = max color difference;

For d=min(D) to max(D)

X = W*MIN((D-d)^2, threshold)/W;

If sum(X) < Xmin

Depth_new(x,y) = d;

Xmin = sum(X);

End

End

End

Rescale the range of Filtered Depth

Figure 20 Execution time of different implementations of filtering approach

Figure 20 illustrates the achievements in terms of speed. The figure shows experiments with depth filtering of a scene with different implementations of the filtering procedure. All implementations have been written in C and then compiled into MEX files to be run from Matlab environment. The vertical axis shows the execution time in seconds and the horizontal line shows the number of slices employed (and thus the dynamic range assumed). In the figure, the dotted curve shows single pass bilateral filtering. It does not depend on the dynamic range but on the window size, thus it is a constant in the figure. The red line shows the computational time for the original approach implemented as a three step procedure for the full dynamic range. Naturally, it is linear function with respect to the slices to be filtered. Our implementation (blue

50 100 150 200 2500

50

100

150

200

250

300

350

400

Slices

seconds

Bilateral filter directly on depth

Original approach

No cost volume


25

curve) applying reduced dynamic range is also linearly depending on the number of slices but with dramatically reduced steepness.

4.4 Quality measures

We have considered two groups of quality measures, the first group operating directly on the depth images (true and processed) and the second group operating on the rendered view (true and restored). While the measures in the first group are simpler and faster to calculate, the measures from the second group are more realistic to subjective perception.

PSNR of Restored Depth compares the compressed or restored depth against ground true

depth

(10)

where is number of pixels of the depth image.

Percentage of bad pixels is a measure originally used to compare estimated depths from stereo

[34]. It counts the number of pixels differing more that a pre-specified threshold

(11)

Consider the gradient of the difference between true depth and approximated depth . By Depth Consistency we denote the percentage of pixels, having magnitude of that gradient

higher than a pre-specified threshold.

(12)

The measure favors non-smooth areas in the restored depth considered as main source of geometrical distortion, as illustrated in

Figure 21. Results of tresholding in

PSNR of Rendered View. Analogously to formula (9) but taken over the rendered view.

Gradient-normalized RMSE has been suggested in [36] as a performance metric for optical flow estimation algorithms to make it more robust to local intensity variations in textured areas. In our implementation we calculate it over the luminance channel of rendered image and excluding true occluded areas


26

(13)

Discontinuity Falses accounts for the percentage of wrong occlusions in the rendered channel.

Those are either new occlusions of initially non-occluded pixels or falsely disoccluded pixels

(14)

where is cardinality (number of elements) of a domain .

4.5 Experimental results

We present two experiments. In the first experiment, we compare the performance of all depth restoration algorithms assuming the true color channel is given (it has been also used in the optimization of the tunable parameters). In the second experiment we compare the effect of depth restoration in the case of mild quantization of the color channel.

Figure 22 illustrates the performance of some of the filtering techniques. Rendering of the right channel has been accomplished using the original left channel and either compressed or filtered depth. No occlusion filling has been applied.

Results of the first experiment are presented in Figure 23. Along x-axis of all plots, the H.364 QPs are given and the area of interest is between 30 and 50. All measures but the BAD one distinguish the methods in a consistent way. The group of structurally-constrained methods clearly outperforms the simple methods working on the depth image only. The two PSNR-based

seem to be less reliable in characterizing the performance of the methods. The three remained measures, namely – Depth Consistency, Discontinuity Falses and Gradient-normalized RMSE perform in a consistent manner. While NRMSE is perhaps the measure closest to the subjective

perception, we favor also the other two measures of this group as they are relatively simple and do not require calculation of the warped (rendered) image.

To characterize the consistency of our optimized parameters, in Figure 23g, we show the trend of CONSIST calculated for the algorithms with parameters optimized for NMRSE. One can see

that the trend is pretty consistent with that of Figure 2e (where the methods are both optimized and compared with respect to CONSIST). The same can be seen while comparing Figure 23h with Figure 23f. In the former, the NRMSE is calculated over the test set while the algorithms parameters are optimized over the training set with respect to CONSIST. The measure shows the same trend as in the case when the algorithms are optimized with respect to the same measure.

So far, we have been working with uncompressed color channel. It has been involved in the optimizations and comparisons. Our aim was to characterize the pure influence of the depth restoration only. In the second experiment we play with quantized color channel. We assume mild quantization of the color image, e.g. by QP=35 and two QPs, 35 and 45 for the depth. For our test imagery, the first depth QP corresponds to about 10% of the total bitrate. The NRMSE of the rendered channel is calculated with respect to the channel rendered from uncompressed color and depth. The results are given in Figure 24. One can see that the depth post- processing clearly makes a difference allowing to use stronger quantization of the depth channel and still to achieve good quality.


27

a) b)

c) d)

e) f) Figure 22. Filtering of compressed depth maps. a) decompressed depth map; b) right channel rendered using

original left and depth from a); c) depth filtered by bilateral filer; d) right channel rendered using c); e) depth

filtered by hypothesis filter; f) right channel rendered using e)


28

Figure 23. Experiment 1. Horisontal axes show H.264 QP. (a)-(f) Performance of selected algorithms

optimized for and compared by same measure. (g) Peformance measured by CONSIST of algorithms

optimized for NRMSE. (h) Peformance measured by NRMSE for algorithms optimized for CONSIST.

(a)

(b)

PS

NR

of

Res

tore

d D

epth

(d

B)

PS

NR

of

Ren

der

ed C

han

nel

(dB

)

(c)

(d)

Bad

Pix

els

Per

centa

ge

(%)

Dis

conti

nuit

y F

alse

s (%

)

(e)

(f)

Dep

th C

onsi

sten

cy (

%)

Norm

aliz

ed R

MS

E (

dB

)

(g)

(h)

Dep

th C

onsi

sten

cy (

%)

Norm

aliz

ed R

MS

E (

dB

)

20 25 30 35 40 45 50

50

52

54

56

58

60

62

64

66

68

70

H.264 Quantization Parameter

PS

NR

(dB

)

No Filtering

H.264 Loop Filter

Gaussian Smooth

LPA-ICI Filtering

Bilateral Filtering

Super Resolution

25 30 35 40 45 5030

32

34

36

38

40

42

44

46

48


PS

NR

(dB

)

No Filtering

H.264 Loop Filter

Gaussian Smooth

LPA-ICI Filtering

Bilateral Filtering

Super Resolution

30 32 34 36 38 40 42 44 46 48 50 520

5

10

15

20

25

30

35

40


Per

cent

(%

)

No Filtering

H.264 Loop Filter

Gaussian Smooth

LPA-ICI Filtering

Bilateral Filtering

Super Resolution

25 30 35 40 45 500.5

1

1.5

2

2.5

3


Per

cent

(%

)

No Filtering

H.264 Loop Filter

Gaussian Smooth

LPA-ICI Filtering

Bilateral Filtering

Super Resolution

20 25 30 35 40 45 500

1

2

3

4

5

6


Per

cent

(%

)

No Filtering

H.264 Loop Filter

Gaussian Smooth

LPA-ICI Filtering

Bilateral Filtering

Super Resolution

20 25 30 35 40 45 5015

20

25

30

35

40

45

50

55

60

65


NR

MS

E (

dB)

No Filtering

H.264 Loop Filter

Gaussian Smooth

LPA-ICI Filtering

Bilateral Filtering

Super Resolution

20 25 30 35 40 45 500

1

2

3

4

5

6


Per

cent

(%

)

No Filtering

H.264 Loop Filter

Gaussian Smooth

LPA-ICI Filtering

Bilateral Filtering

Super Resolution

20 25 30 35 40 45 5015

20

25

30

35

40

45

50

55

60

65


NR

MS

E (

dB)

No Filtering

H.264 Loop Filter

Gaussian Smooth

LPA-ICI Filtering

Bilateral Filtering

Super Resolution


29

True Color, True Depth Color QP=35, True Depth, NRMSE=10

True Color, Depth QP=35, NRMSE=23 Color QP=35, Depth QP=35, NRMSE=24


30

True Color, Depth QP=45, NRMSE=31 Color QP=35, Depth QP=45, NRMSE=32

True Color, Filtered Depth from QP=45,

NRMSE=21

Color QP=35, Filtered Depth from QP=45,

NRMSE=22

Figure 24. Experiment 2. Effect of compressed color and compressed and filtered depth to the quality of

rendered view


31

5 Temporally-consistent filtering of depth map sequences

5.1 Introduction

In the previous section we addressed the problem of refinement of depth maps impaired by compression artefacts. The quality of the depth maps also depends on the way they have been generated: that is either through „depth-from-stereo‟ or „depth-from-multiview‟ type of algorithms or using special depth sensors based on time-of-flight (ToF) principles or laser scanners or structural light. When accompanying video sequences, the consistency of successive depth maps in the sequence becomes an issue. Time-inconsistent depth sequences might cause flickering in the synthesized views as well as other 3D-specific artifacts [37].

The time-consistency issue has been addressed mainly at the stage of depth estimation either by adding a smoothing constraint along temporal dimension in the depth estimation global optimization procedure or by simple median filtering along successive depth frames [38], [39].

In this section, we address the problem of filtering of depth map sequences, which are impaired either by inaccurate depth estimation or noise or compression artifacts. We extend the approach from Section 4 toward video to tackle the time-consistency issue.

5.2 Problem formulation We extend the formulation in Sub-section 4.2, to add the temporal dimension. Consider color

video sequence in YUV color space , accompanied by the

associated per-pixel depth , where is a spatial variable being

the image domain, and is frame index. The virtual view to be synthesized out of the given (reference) color frame and depth at time t, is denotebe by

. It is composed of two parts, , where

denotes the visible pixels from the position of the virtual view camera and denotes the pixels

of occluded areas. The corresponding domains are denoted by correspondingly,

. We consider the case where the depth sequence has been degraded by some impairment added to the true depth: Finally, we denote by the virtual

view synthesized out of the degraded depth and by the virtual view synthesized out of

processed depth and the given reference view. The goal of the depth filtering is to get an estimate of the depth sequence closer to the ground true depth sequence and providing

synthesized virtual view with improved quality.

5.2.1 Extending the filtering approach to video

In Section 4, we found out that the hypothesis filter gives superior performance when applied to

individual depth frames impaired by compression artifacts. Here, we extend the same approach to

video and to more general types of depth artifacts.

Eq. (8) is extended to video sequences as follows

(15)

where

.


32

This essentially means, that the depth hypotheses are checked within a parallelogram around the current depth voxel with coordinates (x,t). While the neighbouring voxels are weighted by

their color similarities to the central one, the temporal distance is penalized separately from the spatial one to enable better flexibility in tuning the filter parameters. Note, that the video filtering uses no explicit motion information. No motion estimation/compensation is applied. We rely on the color (dis-)similarity weights to suppress sufficiently depth voxels changed considerably by motion. The hypothesis filtering procedure for video is illustrated in Figure 25.

Figure 25. Extension of hypothesis filtering to video

5.3 Experiments We present two experiments. In the first experiment, we consider the depth sequence as

estimated from noisy stereo sequences. Namely, a given stereo sequence and

is used to estimate the depth sequence . Then, white noise is added to the stereo video to

obtain noisy stereo video , which is used to estimate the impaired depth sequence The latter is filtered by the suggested video hypothesis filtering. For

comparison, median filtering is applied to the noisy depth sequence and to the per-frame hypothesis filtered data. In our practical setting, we have used a stereo pair of the „Cone‟ test data from the Middlebury Evaluation Test bench [40]. For that given stereo pair we have the ground true depth and we also estimated the depth by the method in [41]. To simulate a stereo video, we repeated the stereo pair 40 times to form 40 successive video frames, then added different amount of noise to each frame and estimated the depth from each so-obtained noisy stereo frame. The results of different filtering techniques applied to the noisy depth sequence are given in Figure 26. The results are consistent over all measures and show considerable improvement along the temporal dimension when the video extension of the hypothesis filtering is applied. The video hypothesis filtering not only manages to equalize the quality along the time axis but also improves the depth estimates compared to ones obtained from noise-free data by the method from [41].

In the second experiment we simulate blocky artifacts in the depth channel. To create ground

true video plus depth, we circularly shifted the same „cone‟ sequence with a radius of 10 pixels


33

also adding some noise to the shifting vectors and then crop the central parts of the so-obtained frames. Thus, we got a sequence simulating circular motion of the camera plus some small amount of shaking. The sequence was compressed by H.264 encoder in IPIPIP mode varying slightly the quantization parameter (QP) per frame to simulate different amount of blockiness in successive frames. The filtering results are presented in Figure 27. We kept the following filters: single-frame hypothesis filter, the same followed by median filtering along time, and video hypothesis filtering. As it can be seen in the figure, the video version of hypothesis filtering has the most consistent performance. It performs especially well around edges. The rendered frames are with similar quality thus providing smooth and flicker-free experience. The only exception is the BAD metric, where the compressed depth seems to be the „best‟. The metric, originally introduced to measure the performance of depth estimation algorithms, simply counts differences between ground true and processed pixels no matter how big or small (but above a threshold) the differences are. While all filtering algorithms introduce small changes over the whole image, those small changes seem to be more in percentage than the number of different pixels in the quantized depth image. However, what really matter are the bigger differences appearing around edges. These are well tackled by the filtering, as seen in the other metrics. Especially informative is the NRMSE, which measures the quality of the rendered channel being closer to the human perception. There, the new filtering approach truly excels.

Finally, we provide some visual illustrations on the performance of the algorithm. We use the „Book arrival‟ sequence provided by Fraunhofer HHI, where the depth is estimated by the MPEG depth estimation software [42]. While it incorporates rather powerful techniques and yields high-quality and time-consistent depth maps, our technique still adds some improvements. Figure 28shows the result of filtering for frame 20. From left to right, the figure shows the originally-estimated depth, then the depth obtained after median filtering along time, and then depth resulting from the proposed method. The depth estimation has failed around the face of the person entering the room and at the floor area. Median filtering manages to correct the depth of the floor but fails to correct the face of the person. The proposed method restores both the floor and the face. The same sequence has been compressed/decompressed with H.264 intra-frame and then filtered. The result of decompression and filtering is shown in Figure 29. Again, despite the substantial blocking artefacts, details as human faces have been successfully restored.


34

5.4 Results

Figure 26. Comparative results of filtering approaches as in Experiment 1

0 5 10 15 20 25 30

10

20

30

40

50

60

70

80

Frame

BA

D

Cones

0 5 10 15 20 25 30

20

30

40

50

60

70

80

Frame

BA

D n

ear

dis

continuitie

s

Cones

0 5 10 15 20 25 300

10

20

30

40

50

60

70

80

90

100

Frame

CO

NS

IST

Cones

0 5 10 15 20 25 30

0.075

0.08

0.085

0.09

0.095

0.1

0.105

0.11

0.115

0.12

Frame

Norm

aliz

ed R

MS

E

Cones

0 5 10 15 20 25 30

34

36

38

40

42

44

46

48

50

52

54

Frame

PS

NR

Cones

0 5 10 15 20 25 30

18

20

22

24

26

28

Frame

PS

NR

of

Virtu

al C

hannel

Cones

Noise-Free Estimate

Noisy Estimate

Noisy Estimate + Median(5 frm)

Noisy + Hypotesis + Median(5 frm)

Noisy + Hypothesis

Noisy + Video Hypothesis (3 frm)



35

Figure 27. Comparative results of filtering approaches as in Experiment 2

5 10 15 20 25 303

4

5

6

7

8

9

10

Frame

BA

DCones

5 10 15 20 25 30

20

22

24

26

28

30

32

34

36

38

Frame

BA

D n

ear

dis

continuitie

s

Cones

5 10 15 20 25 30

3

3.2

3.4

3.6

3.8

4

Frame

CO

NS

IST

Cones

5 10 15 20 25 30

0.068

0.07

0.072

0.074

0.076

0.078

0.08

Frame

Norm

aliz

ed R

MS

E

Cones

5 10 15 20 25 3054.5

55

55.5

56

56.5

57

57.5

Frame

PS

NR

Cones

5 10 15 20 25 3025.5

26

26.5

27

27.5

28

Frame

PS

NR

of

Virtu

al C

hannel

Cones

Noisy Estimate

Noisy + Hypothesis

Noisy + Hypotesis + Median (7 frm)



36

Figure 28. Results of filtering of 'Book arrival' depth sequence. From left to right: originally-estimated depth;

median-filtered; filtered by proposed approach

Figure 29. Filtering of compressed depth sequence. From left to right: decompressed depth map;

decompressed dep map filtered by proposed approach


37

6 Conclusions Existing standard up- and down-sampling methods as well as an alternative simple FIR filter with variable cutoff frequency for mixed resolution stereo have been presented and evaluated. Coding experiments demonstrate that the simple FIR filter with a cutoff frequency of approximately 0.6 outperforms the standard methods. PSNR gains up to 1 dB at a constant bit rate or bit rate savings up to 30% at a constant PSNR can be achieved.

The standard methods have the drawback of a fixed cutoff frequency of approximately 0.4. This rather low frequency might be suitable for high resolution video material with a low spectral density at high frequencies. In contrast to this, content for mobile applications has a low resolution; hence a high spectral power density can be expected at high frequencies. A high damping of these spectral bands achieved by applying a filter with low cutoff frequency leads to a significant image distortion. The second effect that is not taken into account by the standard anti-aliasing filters is the additional filter characteristics of the codec. Quantization of transform coefficients leads to further damping of high frequent aliasing artifacts.

Although high gains have already been achieved with the FIR filter with variable cutoff frequency, further improvements are imaginable. Filters of higher or variable order might perform better, but may be more computational expensive due to higher number of coefficients. Moreover filtering in frequency domain or a joint optimization of down- and up-sampling filters [10] could lead to higher gains. Nevertheless the maximal possible gain is limited and it is questionable if advanced methods will result in a significant improvement.

Our experiments with noisy stereo video demonstrated the power of the VBM3D technique to denoise data effectively. The denoised channels are much better for visualization and for accomplishment of tasks such as disparity and depth estimation and view synthesis. In addition, the denoising algorithm tends to improve the views in a way that they bring even better quality depth maps and synthesized views than the originals. This fact suggests that the approach has potential to correct and improve visual cues in such content.

Our initial hypothesis was that the correlation between the two views would be successfully utilized for finding more similar patches to be used for the collaborative transform-domain filtering. This would give us an extension of the VBM3D to stereo and multi-view. The experiments, however, showed no benefit of searching for and using similar patches between the two views, although there is high correlation between them. Apparently, the slightly different light conditions, the multi-view geometry and the fact that the videos are captured by two different sensors offer no high enough similarity to contribute to the denoising performance.

For the task of deblocking of compressed depth maps, the method based on probabilistic assumptions (Subsection 4.5) showed superior results however for the price of very high computational cost. Therefore, we have suggested practical modifications leading to faster and higher memory-efficient version suitable for implementation on a mobile platform. The competitive methods, i.e. LPA-ICI and bilateral filtering, should not be however discarded as fast implementations of those do exist as well. They demonstrated competitive performance and thus form a scalable set of algorithms. Practitioners can choose between the algorithms in the set depending on the requirements of their applications and available computational resources.

The deblocking tests demonstrated that it is possible to allocate really small fraction of the total bit budget for compressing the depth, thus allowing for high-quality backward compatibility and channel fidelity. The price for this would be some additional post-processing at the receiver side.

We have extended the depth filtering approaches to the case of video sequences and for more general types of depth distortions. Again, we have suggested a fast, memory-efficient and high-quality filtering approach which utilizes colour information from the associated video channel and


38

also adapts to the true depth range and its structure. As a result of the efficient data structure for processing, our technique delivers highly time-consistent depth sequences. In the case of depth sequences impeded by blocky artifacts as result of block-transform based compression, it is possible to tune the filtering parameters depending on the quantization parameter of the compression engine. The technique is also applicable in depth estimation scenarios where the depth quality is compromised by noisy data or requirements for quick processing. The approach does not require knowledge of the motion or optical flow as it relies on the colour weighting to discard non-suitable pixels from adjacent video frames in the filtering domain.

Acknowledgements

The authors would like to thank the providers of 3D image and video content: Middleburry [40], KUK Filmproduktion, and Fraunhofer HHI (FhG) [9]. We acknowledge also the availability of the MPEG depth estimation software [42] as well as the depth estimation [18] and view rendering software developed by FhG. We especially thank the authors of the original VBM3D denoising algorithm and software [14].

References

[1] Heribert Brust, Gerhard Tech and Karsten Müller, Mobile3DTV: Report on generation of mixed spatial resolution stereo data base, June 2009

[2] Gerhard Tech, Heribert Brust, Karsten Müller, Döne Buğdaycı, Mobile3DTV: Development and optimization of coding algorithms for mobile 3DTV, November 2009.

[3] JSVM Software Manual, Version 9.16, December 2008 [4] Mathworks Matlab helpdesk,

http://www.mathworks.com/access/helpdesk/help/toolbox/signal/fir1.html [5] Shijun Sun, Julien Reichel, “AHG Report on Spatial Scalability Resampling”, Joint Video

Team, Doc.JVT-R006, Bangkok, Thailand, January, 2006. [6] Gary Sullivan, Shijun Sun, “AHG Report on Spatial Scalability Filters”, Joint Video Team,

Doc.JVT-P007, Poznan, Poland, July, 2005. [7] ITU-T Recommendation H.264, “Advanced video coding for generic audiovisual

services”, Annex G.8.6.2.3, November 2007 [8] Programs for Digital Signal Processing, IEEE Press, New York, 1979. Algorithm 5.2. [9] Aljoscha Smolic and Gerhard Tech, Report on generation of stereo video data base, July

2009. [10] Tsaig, Y., Elad, M., Milanfar, P., Golub, G.H., “Variable Projection for Near-Optimal

Filtering in Low Bit-Rate Block Coders”, IEEE CirSysVideo, No. 1, January 2005, pp. 154-160.

[11] A. Buades, B. Coll, J.M Morel, "A review of image denoising algorithms, with a new one", Multiscale Modeling and Simulation (SIAM interdisciplinary journal), Vol 4 (2), pp: 490-530, 2005.

[12] Katkovnik, V., A. Foi, K. Egiazarian, and J. Astola, “From local kernel to nonlocal multiple-model image denoising”, Int. J. Computer Vision, vol. 86, no. 1, pp. 1-32, January 2010. doi:10.1007/s11263-009-0272-7.

[13] Dabov, K., A. Foi, V. Katkovnik, and K. Egiazarian, “Image denoising by sparse 3D transform-domain collaborative filtering, ” IEEE Trans. Image Process., vol. 16, no. 8, pp. 2080-2095, August 2007.

[14] Dabov, K., A. Foi, and K. Egiazarian, “Video denoising by sparse 3D transform-domain collaborative filtering, ” Proc. 15th European Signal Processing Conference, EUSIPCO 2007, Poznan, Poland, September 2007.

http://www.mathworks.com/access/helpdesk/help/toolbox/signal/fir1.html


39

[15] T. Buades, Y. Lou, J.M. Morel and Z. Tang, „A Note on Multi-Image Denoising,‟ In the proceeding of the International Workshop on Local and Non-Local Approximation (LNLA) in Image Processing, 2009.

[16] Li Zhang, S. Vaddadi, H. Jin, and S. Nayar, „Multiple View Image Denoising,‟ In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, June 2009.

[17] Yong Seok Heo, Kyoung Mu Lee, and Sang Uk Lee, "Simultaneous Depth Reconstruction and Restoration of Noisy Stereo Images Using Non-local Pixel Distribution," Proc. Computer Vision and Pattern Recognition (CVPR), 2007.

[18] N. Atzpadin, P. Kauff, and O. Schreer. “Stereo Analysis by Hybrid Recursive Matching for Real-Time Immersive Video Conferencing”, IEEE Transactions on Circuits and Systems for Video Technology, Special Issue on Immersive Telecommunications, 14(3):321-334, March 2004.

[19] Y. Morvan, Farin D., and P. H.N. de With, "Depth-Image Compression based on an R-D Optimized Quadtree Decomposition for the Transmission of Multiview Images," in IEEE Internation Conference on Image Processing, San Antonio, TX, USA, 2007.

[20] P. Merkle et al., "The Effect of Depth Compression on Multiview Rendering Quality," in 2008 3DTV-Conference: The True Vision – Capture, Transmission and Display of 3D Video, Istanbul, 2008, pp. 245-248.

[21] Y. Morvan, D. Farin, and P.H.N. De With, "Joint Depth/Texture Bit−Allocation For Multi−View Video Compression," in Picture Coding Symposium, Lisboa, 2007.

[22] Antti Tikanmäki, Aljoscha Smolic, Karsten Mueller, and Atanas Gotchev, "Quality Assessment of 3D Video in Rate Allocation Experiments," in IEEE International Symposium on Consumer Electronics ISCE 2008, Algarve, Portugal, 2008.

[23] P. Merkle, Y. Wang, K. Müller, A. Smolic, and T. Wiegand, "Video Plus Depth Compression For Mobile 3D Services", 3DTV-Conference, Potsdam, Germany, May 2009.

[24] C. Fehn, "Depth-image-based rendering (DIBR), compression and transmission for a new approach on 3D-TV," in Proc. SPIE Stereoscopic Displays and Virtual Reality Systems XI, 2004, p. 93.

[25] A. Foi, V. Katkovnik, and K. Egiazarian, "Pointwise Shape-Adaptive DCT for High-Quality Denoising and Deblocking of Grayscale and Color Images," IEEE Trans. Image Process., vol. 16, no. 5, pp. 1395-1411, 2007.

[26] P. List, A. Joch, J. Lainema, G. Bjøntegaard, and M. Karczewicz, "Adaptive Deblocking Filter," IEEE Transactions on Circuits and Systems for Video Technology, vol. 13, no. 7, pp. 614-619, July 2003.

[27] V. Katkovnik, K. Egiazarian, and J. Astola, Local Approximation Techniques in Signal and Image Processing.: SPIE Publications, 2006.

[28] A. Goldenshluger and A. Nemirovski, "On spatial adaptive estimation of nonparametric regression," math. Meth. Statistics, vol. 6, pp. 135-170, 1997.

[29] V. Katkovnik, A new method for varying adaptive bandwidth selection," IEEE Transaction on Signal Processing, vol. 47, no. 9, pp. 2567-2571, September 1999.

[30] C. Tomasi and R. Manduchi, "Bilateral Filtering for Gray and Color Images," in IEEE International Conference on Computer Vision, Bombay, 1998.

[31] Kuk-Jin Yoon and In-So Kweon, "Locally Adaptive Support-Weight Approach for Visual Correspondence Search," in Conference on Computer Vision and Pattern Recognition,

2005, pp. 924 – 931. [32] Y. Qingxiong, Y. Ruigang, J. Davis, and D. Nister, "Spatial-Depth Super Resolution for

Range Images," in CVPR, 2007.

[33] D. Scharstein and R. Szeliski. Middlebury Stereo Vision Page. [Online]. http://vision.middlebury.edu/stereo/

http://vision.middlebury.edu/stereo/


40

[34] D. Scharstein and R. Szeliski, "A taxonomy and evaluation of dense two-frame stereo correspondence algorithms.," International Journal of Computer Vision, vol. 47, pp. 7-42,

April-June 2002. [35] D. Szeliski and R. Scharstein, "High-accuracy stereo depth maps using structured light,"

in Computer Vision and Pattern Recognition, Madison, 2003. [36] S. Baker et al., "A database and evaluation methodology for optical flow," in Proc. IEEE

Int’l Conf. on Computer Vision, Crete, Greece, 2007, p. 243–246.

[37] A. Boev, D. Hollosi, A. Gotchev, K. Egiazarian, “Classification and simulation of stereoscopic artifacts in mobile 3DTV content”, SPIE Proceedings Vol. 7237, Stereoscopic Displays and Applications XX, Andrew J. Woods; Nicolas S. Holliman; John O. Merritt, (Editors), 2009, 12 pages, page 72371F.

[38] G. Zhang, J. Jia, T. Wong, H. Bao, Consistent Depth Maps Recovery from a Video Sequence. IEEE Trans. Pattern Anal. Mach. Intell. Vol. 31, No.6, pp. 974-988 (2009).

[39] C. Cigla, and A. A. Alatan, Temporally consistent dense depth map estimation via Belief Propagation, in Proceedings of 3DTV-CON 2009, 4-6 May 2009, Potsdam, Germany.

[40] D. Scharstein and R. Szeliski. Middlebury Stereo Vision Page. [Online]. http://vision.middlebury.edu/stereo/

[41] Kuk-Jin Yoon and In-So Kweon, "Locally Adaptive Support-Weight Approach for Visual Correspondence Search," in Conference on Computer Vision and Pattern Recognition,

2005, pp. 924 – 931. [42] O. Stankiewicz, K. Wegner, “Depth Map Estimation Software version 3”, ISO/IEC

JTC1/SC29/WG11 MPEG/M15540 July 2008, Hannover, Germany.

http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/j/Jia:Jiaya.html

http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/w/Wong:Tien=Tsin.html

http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/b/Bao:Hujun.html

http://www.informatik.uni-trier.de/~ley/db/journals/pami/pami31.html#ZhangJWB09

http://vision.middlebury.edu/stereo/

Mobile 3DTV Content Delivery Optimization over DVB-H System

MOBILE3DTV - Mobile 3DTV Content Delivery Optimization over DVB-H System - is a three-yearproject which started in January 2008. The project is partly funded by the European Union 7th

RTD Framework Programme in the context of the Information & Communication Technology (ICT)Cooperation Theme.

The main objective of MOBILE3DTV is to demonstrate the viability of the new technology ofmobile 3DTV. The project develops a technology demonstration system for the creation andcoding of 3D video content, its delivery over DVB-H and display on a mobile device, equippedwith an auto-stereoscopic display.

The MOBILE3DTV consortium is formed by three universities, a public research institute and twoSMEs from Finland, Germany, Turkey, and Bulgaria. Partners span diverse yet complementaryexpertise in the areas of 3D content creation and coding, error resilient transmission, userstudies, visual quality enhancement and project management.

For further information about the project, please visit www.mobile3dtv.eu.

Tuotekehitys Oy TamlinkProject coordinator

FINLAND

Tampereen Teknillinen Yliopisto

Visual quality enhancement,

Scientific coordinator

FINLAND

Fraunhofer Gesellschaft zur Förderung der angewandten Forschung e.V

Middle East Technical UniversityError resilient transmission

TURKEY

Stereo video content creation and coding

GERMANY

Technische Universität IlmenauDesign and execution of subjective tests

GERMANY

MM Solutions Ltd. Design of prototype terminal device

BULGARIA

MOBILE3DTV project has received funding from the European Community’s ICT programme in the context of theSeventh Framework Programme (FP7/2007-2011) under grant agreement n° 216503. This document reflects onlythe authors’ views and the Community or other project partners are not liable for any use that may be made of theinformation contained therein.

http://www.europa.eu/

http://cordis.europa.eu/fp7/

3d video processing algorithms part...

Documents