evaluation of depth compression and view...

4
EVALUATION OF DEPTH COMPRESSION AND VIEW SYNTHESIS DISTORTIONS IN MULTIVIEW-VIDEO-PLUS-DEPTH CODING SYSTEMS Noha A. El-Yamany (1,2) , Kemal Ugur (3) , Miska M. Hannuksela (3) and Moncef Gabbouj (2) (1) Department of Signal Processing, Tampere University of Technology, Tampere, Finland (2) Department of Electrical Engineering, Southern Methodist University, Dallas, Texas, USA (3) Nokia Research Center, Tampere, Finland [email protected], {kemal.ugur, miska.hannuksela}@nokia.com, [email protected] ABSTRACT Several quality evaluation studies have been performed for video- plus-depth coding systems. In these studies, however, the distortions in the synthesized views have been quantified in experimental setups where both the texture and depth videos are compressed. Neverthe- less, there are several factors that affect the quality of the synthe- sized view. Incorporating more than one source of distortion in the study could be misleading; one source of distortion could mask (or be masked by) the effect of other sources of distortion. In this paper, we conduct a quality evaluation study that aims to assess the distor- tions introduced by the view synthesis procedure and depth map compression in multiview-video-plus-depth coding systems. We report important findings that many of the existing studies have overlooked, yet are essential to the reliability of quality evaluation. In particular, we show that the view synthesis reference software yields high distortions that mask those due to depth map compres- sion, when the distortion is measured by average luma peak signal- to-noise ratio. In addition, we show what quality metric to use in order to reliably quantify the effect of depth map compression on view synthesis quality. Experimental results that support these find- ings are provided for both synthetic and real multiview-video-plus- depth sequences. Index Terms — Depth map compression, multi-view video coding, video plus depth, view synthesis 1. INTRODUCTION MULTIVIEW video processing and encoding have attracted a lot of attention recently in the academia and industry arenas. 3D video applications, such as 3D television (3DTV), have driven huge re- search and development activities worldwide, ranging from the de- sign of new technologies to the development of new processing pipelines and coding standards. Multiview video, however, involves a huge amount of data that needs to be encoded and transmitted. Consequently, it is essential to have efficient 3D content representa- tion and compression techniques in order to enable prospective 3D services and technologies. Two commonly used formats for 3D video representation are the multiview video and the multiview-video-plus-depth (MVD) [1] formats. The multiview video representation consists of two or more views, each represented by a sequence of pictures. A subset of the multiview video format is the stereoscopic video format including two views. The multiview video format does not include any infor- mation about scene geometry such as depth. Selected views of the multiview video can directly be ported to stereoscopic displays. The multiview video format, however, does not enable the adjustment of depth perception to accommodate different displays due to the lack of information on scene geometry. Despite of its limitation, the mul- tiview video format is robust due to the fact that it does not involve any error-prone processing such as depth estimation and view syn- thesis. Nevertheless, encoding of multiview video data requires that all views be compressed and transmitted, thus representing chal- lenges in terms of memory, computational power and bandwidth. The MVD format consists of multiview video and an associated per-pixel depth map for each view. MVD extends from the single- view video-plus-depth (VPD) representation, which is also referred to as 2D+Z [2]. The depth maps can then be used to render novel (virtual) views in which the objects in the video have been shifted to the positions where they would have been seen by a virtual camera that is parallel to the real one. As view synthesis from the MVD representation is able to use more than one texture and depth view, the quality of the synthesized view is generally better than that re- sulting from the VPD representation. The MVD format allows the adjustment of depth perception to accommodate different displays as well as stereo rendering. Furthermore, this format enables the devel- opment of new and flexible 3DV applications and services; MVD systems could send a limited number of views and their correspond- ing depth maps, and virtual views could be synthesized at the re- ceivers. In MVD systems, both color (texture) and depth need to be compressed prior to transmission. However, the depth map can be considered as monochromatic video and be converted to YUV 4:0:0 format and compressed at significantly low bit rates (compared to the color video). This, of course, comes at the expense of increased complexity due to the need for view synthesis at the receiver. In addition, there is a need for efficient depth estimation and view synthesis techniques in order to render virtual views having satisfac- tory quality. MVD data can be compressed by various means. For example, the Advanced Video Coding (H.264/AVC) standard [3] can be used to encode each texture view and each depth view independently, which is commonly referred to as H.264/AVC Simulcast coding of the MVD representation. Alternatively, the Multiview Video Coding (MVC) extension [4] of the H.264/AVC standard can be used to code the texture views as one bitstream and the depth views as an- other bitstream. The quality of the synthesized virtual views in MVD coding systems depends on 1) the compression method, 2) the bit rate budgets for the depth and color information, and 3) the accu- racy of the depth estimation and view synthesis procedures. A number of quality evaluation studies have been introduced in the literature for MVD and VPD compression [1-2,5-9]. These stud- ies either addressed the rate-distortion (RD) performance of different coding schemes, attempting to optimize or examine the bit rate budgets for both the color and depth information, or proposed new objective quality metrics and evaluated their correlation with subjec- tive evaluation tests. In these studies, however, the distortions in the synthesized views were quantified in experimental setups where both the texture and depth videos were compressed. As mentioned earlier, there are several factors that affect the quality of the synthe- sized views. Hence, incorporating more than one source of distortion in the study could be misleading; one source of distortion could mask (or be masked by) the effect of other sources of distortion. Therefore, it is essential to quantify the effect of each source sepa- rately, and then evaluate the combined effects. Only then quality evaluation studies would be effective and meaningful, and yield findings that help to develop efficient RD optimization criteria. In this paper, we conduct a quality evaluation study that aims to assess the distortions introduced by the view synthesis procedure and depth map compression in MVD coding systems. We report important findings that many of the existing studies have over- looked, yet are essential to the reliability of quality evaluation. In particular, we show that the view synthesis reference software yields high distortions in terms of average luma peak signal-to-noise ratio (PSNR) that mask those due to depth map compression. In addition,

Upload: buinhi

Post on 02-May-2018

216 views

Category:

Documents


1 download

TRANSCRIPT

EVALUATION OF DEPTH COMPRESSION AND VIEW SYNTHESIS DISTORTIONS IN MULTIVIEW-VIDEO-PLUS-DEPTH CODING SYSTEMS

Noha A. El-Yamany(1,2), Kemal Ugur(3), Miska M. Hannuksela(3) and Moncef Gabbouj(2)

(1)Department of Signal Processing, Tampere University of Technology, Tampere, Finland

(2)Department of Electrical Engineering, Southern Methodist University, Dallas, Texas, USA (3)Nokia Research Center, Tampere, Finland

[email protected], {kemal.ugur, miska.hannuksela}@nokia.com, [email protected]

ABSTRACT

Several quality evaluation studies have been performed for video-plus-depth coding systems. In these studies, however, the distortions in the synthesized views have been quantified in experimental setups where both the texture and depth videos are compressed. Neverthe-less, there are several factors that affect the quality of the synthe-sized view. Incorporating more than one source of distortion in the study could be misleading; one source of distortion could mask (or be masked by) the effect of other sources of distortion. In this paper, we conduct a quality evaluation study that aims to assess the distor-tions introduced by the view synthesis procedure and depth map compression in multiview-video-plus-depth coding systems. We report important findings that many of the existing studies have overlooked, yet are essential to the reliability of quality evaluation. In particular, we show that the view synthesis reference software yields high distortions that mask those due to depth map compres-sion, when the distortion is measured by average luma peak signal-to-noise ratio. In addition, we show what quality metric to use in order to reliably quantify the effect of depth map compression on view synthesis quality. Experimental results that support these find-ings are provided for both synthetic and real multiview-video-plus-depth sequences.

Index Terms — Depth map compression, multi-view video coding, video plus depth, view synthesis

1. INTRODUCTION MULTIVIEW video processing and encoding have attracted a lot of attention recently in the academia and industry arenas. 3D video applications, such as 3D television (3DTV), have driven huge re-search and development activities worldwide, ranging from the de-sign of new technologies to the development of new processing pipelines and coding standards. Multiview video, however, involves a huge amount of data that needs to be encoded and transmitted. Consequently, it is essential to have efficient 3D content representa-tion and compression techniques in order to enable prospective 3D services and technologies.

Two commonly used formats for 3D video representation are the multiview video and the multiview-video-plus-depth (MVD) [1] formats. The multiview video representation consists of two or more views, each represented by a sequence of pictures. A subset of the multiview video format is the stereoscopic video format including two views. The multiview video format does not include any infor-mation about scene geometry such as depth. Selected views of the multiview video can directly be ported to stereoscopic displays. The multiview video format, however, does not enable the adjustment of depth perception to accommodate different displays due to the lack of information on scene geometry. Despite of its limitation, the mul-tiview video format is robust due to the fact that it does not involve any error-prone processing such as depth estimation and view syn-thesis. Nevertheless, encoding of multiview video data requires that all views be compressed and transmitted, thus representing chal-lenges in terms of memory, computational power and bandwidth.

The MVD format consists of multiview video and an associated per-pixel depth map for each view. MVD extends from the single-view video-plus-depth (VPD) representation, which is also referred

to as 2D+Z [2]. The depth maps can then be used to render novel (virtual) views in which the objects in the video have been shifted to the positions where they would have been seen by a virtual camera that is parallel to the real one. As view synthesis from the MVD representation is able to use more than one texture and depth view, the quality of the synthesized view is generally better than that re-sulting from the VPD representation. The MVD format allows the adjustment of depth perception to accommodate different displays as well as stereo rendering. Furthermore, this format enables the devel-opment of new and flexible 3DV applications and services; MVD systems could send a limited number of views and their correspond-ing depth maps, and virtual views could be synthesized at the re-ceivers. In MVD systems, both color (texture) and depth need to be compressed prior to transmission. However, the depth map can be considered as monochromatic video and be converted to YUV 4:0:0 format and compressed at significantly low bit rates (compared to the color video). This, of course, comes at the expense of increased complexity due to the need for view synthesis at the receiver. In addition, there is a need for efficient depth estimation and view synthesis techniques in order to render virtual views having satisfac-tory quality.

MVD data can be compressed by various means. For example, the Advanced Video Coding (H.264/AVC) standard [3] can be used to encode each texture view and each depth view independently, which is commonly referred to as H.264/AVC Simulcast coding of the MVD representation. Alternatively, the Multiview Video Coding (MVC) extension [4] of the H.264/AVC standard can be used to code the texture views as one bitstream and the depth views as an-other bitstream. The quality of the synthesized virtual views in MVD coding systems depends on 1) the compression method, 2) the bit rate budgets for the depth and color information, and 3) the accu-racy of the depth estimation and view synthesis procedures.

A number of quality evaluation studies have been introduced in the literature for MVD and VPD compression [1-2,5-9]. These stud-ies either addressed the rate-distortion (RD) performance of different coding schemes, attempting to optimize or examine the bit rate budgets for both the color and depth information, or proposed new objective quality metrics and evaluated their correlation with subjec-tive evaluation tests. In these studies, however, the distortions in the synthesized views were quantified in experimental setups where both the texture and depth videos were compressed. As mentioned earlier, there are several factors that affect the quality of the synthe-sized views. Hence, incorporating more than one source of distortion in the study could be misleading; one source of distortion could mask (or be masked by) the effect of other sources of distortion. Therefore, it is essential to quantify the effect of each source sepa-rately, and then evaluate the combined effects. Only then quality evaluation studies would be effective and meaningful, and yield findings that help to develop efficient RD optimization criteria.

In this paper, we conduct a quality evaluation study that aims to assess the distortions introduced by the view synthesis procedure and depth map compression in MVD coding systems. We report important findings that many of the existing studies have over-looked, yet are essential to the reliability of quality evaluation. In particular, we show that the view synthesis reference software yields high distortions in terms of average luma peak signal-to-noise ratio (PSNR) that mask those due to depth map compression. In addition,

vlado
Typewriter
978-1-4244-6378-7/10/$26.00 ©2010 IEEE

we show what quality metric to use in order to reliably quantify the effect of depth map compression on view synthesis quality. Experi-mental results that support these findings are provided for both syn-thetic and real MVD sequences.

The rest of the paper is organized as follows. Section (2) intro-duces the experimental setup used in the proposed quality evaluation study. Experimental results are provided and analyzed in Section (3). Finally, the paper concludes in Section (4).

2. PROPOSED STUDY FRAMEWORK Figure 1 depicts a schematic diagram for the MVD compression scheme used in our study. As shown in the figure, there are two camera views, A and C, with their corresponding texture and depth videos available. A viewpoint B, for which the original texture video is also available, lies between A and C. The distance from B to ei-ther A or C is the same, and is fixed to a one-camera distance, con-sidering a linear camera arrangement – increasing the distance leads to synthesized views of less quality, which would bias the results of the study, and hence we use one-camera distance in our experiments. Since we aim to evaluate the distortions introduced by depth map compression, only the depth maps corresponding to A and C are encoded and decoded independently with H.264/AVC, and the original texture videos are used. A reference virtual view video, BRV, is synthesized for viewpoint B from the uncompressed texture and depth videos of A and C, and a test virtual view video, BTV, is rendered from the uncompressed texture and compressed depth of A and C. We use the View Synthesis Reference Software (VSRS) 3.0 [10] to generate BRV and BTV.

The quality of the synthesized views is assessed by means of three PSNR metrics, which are listed below. (A) PSNR1, defined as the average luma PSNR of BRV compared to

the original middle view video, BO.

21 10

1PSNR 10 Y MSE Y YM i i

i

⎡ ⎤⎡ ⎤= ⎢ ⎥⎢ ⎥⎣ ⎦⎣ ⎦∑ max B BO, RV,log , (dB) (1)

where M is the number of frames, i = 1,2,…,M, Y is the luma component, Ymax is the maximum luma value, and MSE is the mean square error. This metric is used to quantify the distortions introduced by the view synthesis algorithm.

(B) PSNR2, defined as the average luma PSNR of BTV compared to BRV.

22 10

1PSNR 10 Y MSE Y YM i i

i

⎡ ⎤⎡ ⎤= ⎢ ⎥⎢ ⎥⎣ ⎦⎣ ⎦∑ max B BRV, TV,log , (dB) (2)

This metric quantifies the distortions due to depth map compres-sion.

(C) PSNR3, defined as the average luma PSNR of BTV compared to BO.

23 10

1PSNR 10 Y MSE Y YM i i

i

⎡ ⎤⎡ ⎤= ⎢ ⎥⎢ ⎥⎣ ⎦⎣ ⎦∑ max B BO, TV,log , (dB) (3)

This metric is used to evaluate the combined distortions due to view synthesis and depth map compression.

3. EXPERIMENTAL RESULTS Table I lists the sequences used in the proposed evaluation study, which cover different types of scene content complexity and tempo-ral variation. The first sequence, Undo Dancer, is a synthetic one having views L4 … L1 C R1 … R4, from left to right. The depth maps for Undo Dancer are available as ground truth. Figure 2 depicts the first frame from the texture and depth videos for view C of the Undo Dancer sequence. The depth maps for the other four sequences1 were estimated by the Depth Estimation Reference Software (DERS) 4.0 [10]. Using the synthetic Undo Dancer sequence in our study was to factor out the inherent inaccuracies of the depth estima-

1 The Champagne Tower, Book Arrival, Lovebird-1 and Newspaper

sequences are courtesy of Nagoya University, FHG-HHI, ETRI and GIST, respectively.

tion procedure during the quality evaluation. Later using real se-quences was to confirm the findings obtained by the study.

Nokia H.264/AVC baseline encoder implementation was used for speed purposes, and was configured as follows: IPPP prediction structure, only first frame encoded as Intra, context-based adaptive variable length coding for symbol coding, rate-distortion optimized mode decision, number of reference frames set to 4, full search mo-tion estimation with a search range of 32, and four QP values were considered, which are 24, 30, 36 and 42, and the same QP was used for the I and P frames.

Plots of the metrics PSNR2 and PSNR1 & PSNR3 versus the depth total bit rate (the sum of the bit rates of the left and right depth vid-eos) for the test sequences are depicted in Figure 3 and Figure 4, respectively. From these results, there are important observations that are worth mentioning: - The values of PSNR1 are generally low, which indicates a high

distortion level introduced by the view synthesis procedure. That also implies that the view synthesis algorithm yields relatively poor estimates of the middle view, BO. Indeed, examining BRV for the test sequences does reveal that the estimates suffer from vari-ous artifacts such as blur, false edges, false coloring and ghosting. These artifacts impair the quality of the virtual view, increasing the mismatch between BRV and BO, hence lowering the PSNR value.

- The PSNR2 values decrease as QP increases and more distortions are introduced by depth map compression. However, it is ob-served that these PSNR values are relatively high, which implies that the depth map can be compressed at low bit rates, yet main-taining relatively high synthesis quality compared to the synthesis using uncompressed depth. Examining BTV and BRV for the test sequences also confirms this result.

- The values of PSNR3, on the other hand, are quite intriguing, as we note the following: a. PSNR1 and PSNR3 are very close in their values, i.e. MSE Y Y MSE Y YB B B BO RV O TV( , ) ~ ( , ) (4)

b. PSNR3 seems to be independent of the QP value, and changes only slightly as QP changes.

Table I. The test sequences and corresponding views

Sequence Name ( Resolution) A - B - C M

Undo Dancer (1920×1072) L4 – L3 – L2 L1– C – R1

R2 – R3 – R4

300

Champagne Tower (1280×960) 39 – 40 – 41 100

Book Arrival (1024×768) 10 – 9 – 8 99

Lovebird 1 (1024×768) 6 – 7 – 8 100

Newspaper (1024×768) 4 – 5 – 6 100

The above observations bear important findings pertinent to our quality evaluation study, and are listed below:

1. From (4) and the observations above, it is evident that the dis-tortions due to view synthesis (reference software [10]) domi-nate those due to depth map compression, when the distortions are measured by average luma PSNR. This finding is con-firmed by the relatively low PSNR1 and the relatively high PSNR2 values.

2. The use of the PSNR3 metric, as conducted by some of the studies introduced in the literature, to evaluate the effect of depth map compression on the view synthesis quality should be done with a grain of salt. Such evaluation could be mislead-ing since view synthesis distortions could mask those due to compression.

3. To evaluate the effect of depth map compression on view syn-thesis quality, the use of the PSNR2 metric is therefore recom-mended.

Figure 3. Plots of the peak signal-to-noise ratio metric PSNR2 for the test sequences used in the proposed study (a) Undo Dancer L4 – L3 – L2, R2 – R3 – R4 and R2 – R3 – R4, (b) Champagne Tower, Book Arrival, Newspaper and Lovebird-1

4. CONCLUSIONS

In this paper, we introduced a quality evaluation study that aimed to assess the distortions due to the view synthesis procedure and depth compression in MVD coding systems.

We reported important findings that many of the existing studies have overlooked, yet are essential to the reliability of quality evalua-tion. In particular, we found that the view synthesis reference soft-ware yielded high distortions that masked those due to depth map compression, when the distortion was measured by average luma PSNR. As a conclusion, based on this finding, we recommend to use average luma PSNR of the synthesized view resulting from uncom-pressed texture views and compressed depth views relative to the synthesized view from uncompressed texture and depth views as a measure for the quality impact of depth compression.

It should be noted that a study on subjective evaluation of the view synthesis quality is needed to better understand the correlation of this metric to perceived 3D video quality. Similar studies could also be performed for evaluating the effect of other sources of distor-tion in MVD coding, such as texture compression.

5. REFERENCES [1] P. Merkle, A. Smolic, K. Müller and T. Wiegand, “Multi-view

video plus depth representation and coding,” in Proceedings of

the IEEE International Conference on Image Processing, Vol. I, pp. 201-204, 2007.

[2] P. Merkle, Y. Wang, K. Müller, A. Smolic and T. Wiegand, "Video plus depth compression for mobile 3D services," in Pro-ceedings of 3DTV Conference: The True Vision - Capture, Transmission and Display of 3D Video, 2009.

[3] ITU-T Recommendation H.264, “Advanced video coding for generic audiovisual services,” Mar. 2009.

[4] Y. Chen, Y.-K. Wang, K. Ugur, M. M. Hannuksela, J. Lainema, and M. Gabbouj, “The emerging MVC standard for 3D video services,” EURASIP Journal on Advances in Signal Processing, vol. 2009, Article ID 786015, 2009. doi:10.1155/2009/786015.

[5] G. Tech, A. Smolic, H. Brust, P. Merkle, K. Dix, Y. Wang, K. Müller and T. Wiegand, “Optimization and comparison of cod-ing algorithms for mobile 3DTV,” in Proceedings of 3DTV Conference: The True Vision - Capture, Transmission and Dis-play of 3D Video, 2009.

[6] A. Tikanmaki, A. Gotchev, A. Smolic and K. Müller, ”Quality assessment of 3D video in rate allocation experiments,” in pro-ceedings of the IEEE International Symposium on Consumer Electronics, 2008.

[7] G. Leon, H. Kalva, and B. Furht, “3D video quality evaluation with depth quality variations,” in Proceedings of 3DTV Con-ference: The True Vision - Capture, Transmission and Display of 3D Video, pp. 301-304, 2008.

View Synthesis (VSRS 3.0)

H.264/AVC Encoder

H.264/AVC Decoder

Original Depth

Original Color

H.264/AVC Encoder

H.264/AVC Decoder

Original Depth

Reference Virtual View (BRV)

Test Virtual View (BTV)

Compressed Depth

Compressed Depth

Original Color

View C (VPD)

View A (VPD)

View Synthesis (VSRS 3.0)

View B (Video)

Figure 1. A schematic diagram for the VPD coding scheme used in the proposed study

Figure 2. The first frame from the texture and depth video for Undo Dancer sequence, view C

(a)

(b)

[8] P. Merkle, Y. Morvan, A. Smolic, D. Farin, K. Müller, P. H. N. de With and T. Wiegand, “The effects of multiview depth video compression on multiview rendering,” Signal Process-ing: Image Communication, Vol. 24, pp. 73–88, 2009.

[9] K. Klimaszewski, K. Wegner and M. Domański, “Distortions of synthesized views caused by compression of views and

depth maps,” in Proceedings of 3DTV Conference: The True Vision - Capture, Transmission and Display of 3D Video, 2009.

[10] M. Tanimoto, T. Fujii, K. Suzuki, N. Fukushima, Y. Mori, “Reference Softwares for Depth Estimation and View Synthe-sis,” ISO/IEC JTC1/SC29/WG11MPEG2008/M15377, Ar-champs, France, April 2008.

Figure 4. Plot of the peak signal-to-noise ratio metrics PSNR1 and PSNR3 for the test sequences used in the proposed study (a) Champagne Tower, (b) Book Arrival, (c) Newspaper, (d) Lovebird-1 and (e) Undo Dancer R2 – R3 – R4

(a)

(b)

(c)

(d)

(e)