pixel-based inter prediction in coded texture assisted depth coding

74 IEEE SIGNAL PROCESSING LETTERS, VOL. 21, NO. 1, JANUARY 2014

Pixel-Based Inter Prediction in CodedTexture Assisted Depth Coding

Shuai Li, Jianjun Lei, Member, IEEE, Ce Zhu, Senior Member, IEEE, Lu Yu, and Chunping Hou

Abstract—This letter presents a pixel-based motion estimationscheme assisted with the coded texture video for depth inter-pre-diction, in view of motion similarity between depth and texturevideo. The proposed scheme can achieve higher inter-predictiongain without transmitting any motion vector in the pixel-basedmotion estimation. Coupled with depth-texture structure sim-ilarity, the inter prediction method is further extended to anintegrated prediction approach by making use of both intra andinter information. Experimental results show that our proposedmethod achieves superior rate-distortion performance.

Index Terms—Depth coding, inter prediction, pixel-based mo-tion estimation, 3D video.

I. INTRODUCTION

I N THE past two decades, 3D video and related applica-tions, including 3DTV [1] and free viewpoint TV [2], have

been attracting more andmore research and development effortsworldwide.Multiview video plus depth (MVD) [3], [4] has beenconsidered as a promising candidate of 3D video format, sinceit can synthesize virtual views at any viewpoint using depthimage-based rendering (DIBR) [5] technique. To facilitate theview rendering in the receiver side, depth video needs to betransmitted together with 2D texture video.Depth map is a gray scale image representing the distance

between the camera and 3D points in the scene. It is composedof large smooth regions and distinct edges, which exhibitsdifferent characteristics from the conventional texture video.To take advantage of these different characteristics of the depthvideo, many coding methods have been proposed [6]–[8]. Mostof those works consider an edge/structure map to depict thestructure/shape of the depth blocks or employ more predic-tion directions to predict the block. While reducing the bitsfor coding the residuals, the overhead bits in those methodsused for representing the structure or the additional predictiondirections lower the coding efficiency as well.

Manuscript received September 16, 2013; revised November 07, 2013; ac-cepted November 14, 2013. Date of current version November 22, 2013. Thiswork was supported in part by the National Natural Science Foundation of Chinaunder Grants 61228102, 61271324, 60932007, and 61002029, and by the Nat-ural Science Foundation of Tianjin under Grant 12JCYBJC10400. The associateeditor coordinating the review of this manuscript and approving it for publica-tion was Prof. Xiaokang Yang.S. Li, J. Lei and C. Hou are with the School of Electronic Information Engi-

neering, Tianjin University, Tianjin 300072, China (e-mail: [email protected]).C. Zhu is with the School of Electronic Engineering, University of Elec-

tronic Science and Technology of China, Chengdu 611731, China (e-mail:[email protected]).L. Yu is with the Department of Information Science and Electronic Engi-

neering, Zhejiang University, Hangzhou 310027, China.Color versions of one or more of the figures in this paper are available online

at http://ieeexplore.ieee.org.Digital Object Identifier 10.1109/LSP.2013.2291941

Since the texture and depth image both capture the samescene, motion and structure similarities present in the depthvideo and the corresponding texture video, which refer to themovement of the objects in the texture video conforms to that inthe depth video and collocated edges in the texture video existfor most depth edges. By properly exploiting the two kindsof similarities with the aid of coded texture video, the depthcoding efficiency can be enhanced significantly. In [9][10],motion vectors of the corresponding texture video are re-usedfor the coding of the depth maps. Although same motion isexpected between the texture and depth video, the motionvectors obtained from the block-based motion estimation foreach video may not be the same. The current block-based mo-tion estimation is to find the block which minimizes a certainmatching criterion for the whole block instead of locating thereal motion of each pixel in the object. Due to the differencein characteristics between texture and depth video, motionvectors for the texture video may not be applicable for thedepth video as it may produce large residuals especially aroundobject edges by directly utilizing the texture motion vectors. Inaddition to the above methods using the motion similarity, someintra coding methods [11][12] based on the structure similarityhave been proposed. These approaches attempt to exploit thetexture-depth structure similarity, and the performance heavilydepends on the quality of the depth map, i.e. alignment betweendepth edges and texture edges.While the above depth coding techniques focus on the faithful

coding of the depth map, some other depth coding methods takethe inaccuracy of the depth map due to the state-of-the-art depthgeneration approaches into consideration. Given that the depthvalues of smooth regions are usually inaccurate due to a lackof features to perform stereo matching, Kim et al. [13] consid-ered enforcing the skip mode of the texture video onto the depthvideo to reduce the flickering artifact according to local char-acteristics of the texture video. In [15] and [16], a fast codingscheme to skip selected blocks of the depth map and a fast intermode selection method were proposed based on the temporaland inter-view correlations of the corresponding texture images.In our previous work [14], a depth-texture cooperative clus-tering based intra prediction method was proposed, followed bya depth-texture alignment procedure to exploit the structure sim-ilarity while coping with the misalignment between depth andtexture boundaries. Simulation results show that the efficiencyof intra coding is significantly improved. However, this type ofintra coding methods, including the methods in [11][12], onlyexploits the neighborhood information of the to-be-coded blockin the current frame, which implies that it may work well whenthe information of the current block can be inferred well fromthe adjacent coded blocks.

1070-9908 © 2013 IEEE

LI et al.: PIXEL-BASED INTER PREDICTION IN CODED TEXTURE ASSISTED DEPTH CODING 75

This letter presents a pixel-based motion estimation approachto exploit the inter information for the to-be-coded depth block.The proposed approach first develops pixelwise relationshipbetween the current frame and the reference frame for thecoded texture video, by making use of the block-based motionvectors. Based on the motion similarity between the depthand texture video, the developed pixelwise relationship is thenmapped to the depth video, which enables a pixel-based motionestimation for the depth video without sending any motionvector. Furthermore, a more robust prediction and an integratedprediction using both intra and inter information are proposedto further improve the prediction accuracy. On the other hand,large residuals due to the texture-depth misalignment are prop-erly addressed. Experimental results show that our proposedscheme demonstrates substantially superior performance interms of rate-distortion performance.The proposed depth coding approach is described in

Section II. Section III presents the experimental results andSection IV concludes this letter.

II. PROPOSED PIXEL-BASED INTER PREDICTION

A. Pixel-Based Motion Estimation

Inter prediction is one of the most important components invideo coding. It uses the coded information from reference pic-tures to predict the current picture. A widely used inter predic-tion method is block-based motion compensation, which com-pensates for movement of blocks of the current frame. The dis-placement between the current block and its reference block iscoded into the bit stream as a motion vector. However, real ob-jects in the scene rarely have neat edges that match rectangularboundaries as a block, which results in large residuals for theunmatched pixels. Especially for the depth video, those largeresiduals due to the sharp transition between different objectswill greatly degrade the coding performance. A more efficientinter prediction method is highly desired to accurately predicteach pixel of the current block/frame.For the texture video, it is possible to estimate the trajectory

of each pixel between successive video frames, producing a fieldof pixel trajectories known as optical flow [17]. With the opticalflow field known, it is possible to develop accurate predictionsfor most of the pixels in the current frame by using the pixels inthe reference frame located by their optical flow vectors. On theother hand, the large amount of transmitted data for the opticalflow vectors makes it impractical for the texture video coding.However, considering that the texture video is generally codedbefore the depth coding, the estimated pixel-based motion vec-tors (optical flow vectors) from the coded texture video can beutilized for depth coding.As both a depth map and its corresponding texture image rep-

resent the same scene, the motion vectors for the objects in thetexture video highly correlates with those in the depth video.Here we perform pixel-based motion estimation based on thecoded texture video. In this way, the motion vectors for eachpixel do not need to be transmitted since it can be obtained inthe decoder side with the decoded texture video. Fig. 1 illus-trates the process for the proposed pixel-based motion estima-tion scheme. The figures in the bottom row represent the depth

Fig. 1. An illustration of the pixel-based motion estimation approach with themapping of motion vector from texture to depth field.

video frames while the upper row shows the corresponding tex-ture video frames, where and - denote the current frameand the reference frame, respectively. The black dot in the cur-rent frame represents the current pixel. In the reference frame,the dots filled with lines represent the search points for the cur-rent pixel and the dot filled with vertical lines is the best matchedpixel for the current pixel. First, the motion vector for the corre-sponding texture pixel is estimated by locating the best matchedpixel in the reference texture video frame, and then it is mappedto the depth video to identify the reference pixel as the predictorfor the current depth pixel.To effectively locate the reference pixel and reduce the search

range, the blockmotion vectors from the coded texture video areutilized to determine the starting center point for the pixel-basedmotion estimation in a small search window. For the blockscoded with intra mode in the texture video, a motion vector isfirst estimated by the block-based motion estimation and thenused as the starting center point. Since the values in the non-edgedepth blocks are smooth, the block-based motion vector can beused for the pixel-based one, thus saving the pixel-based motionestimation. Only the depth edge blocks (the Canny edge detectoris used to derive the edges for the depth map, and the blockswith edge are regarded as edge blocks) are considered for thepixel-based motion estimation. In the decoder side, the sameprocess is employed to obtain the motion vector for each pixel.

B. Robust Prediction and Correction

Lossy coding is generally used in video compression, whichpresents distortions in the coded depth and texture video. Asthe quantization step size increases, the difference between theoriginal value and the decoded value becomes larger, especiallyfor the pixels around the sharp edge which normally sufferbigger losses in the coding process. In this sense, the proposedpixel-based motion estimation may not be much robust forsome heavily distorted video as the motion similarity is signifi-cantly compromised. To present a more robust prediction for adepth pixel, its adjacent pixels in a same object region are takeninto account in view of the motion consistence in the objectregion. To efficiently realize this, a threshold is first consideredin the texture video to determine which pixels in the reference


frame are in the same object region as the current pixel, andthen the median of the corresponding depth pixels is employedas the predictor for the current pixel,

median (1)

where is the prediction value for the current depth pixelwhile is the texture value of the current pixel. and rep-resent the depth and texture value of pixel , respectively, and(empirically set as 10 in our simulation) is a certain threshold

to determine whether a pixel is considered as being in the sameregion as the current pixel based on their difference of texturevalues.In view of the inaccurate depth values along the object

boundary presented in the depth map, known as misalignment[18], [19] between the depth edges and the texture edges, largeresiduals may occur after the texture-assisted depth prediction.To deal with these large residuals, the detection and rectificationtechnique proposed in our previous work [14] is incorporated inthe depth coding scheme.The approachdetects themisalignmentdirectly in the residual domain by taking advantage of the neigh-boring reliable pixels. First, a three-pixel-wide region around thedepth edge is considered as an error-prone region within whichmisalignment is most likely to present. A large residual in theerror-prone region after the prediction could be due to eithera poor prediction, or a misalignment with an originally wrongdepth value. To distinguish the misalignment from the poorprediction, residuals adjacent to the large residual but outside theerror-prone region are further considered. If there is any largeresidual also present outside the error-prone region, it impliesthat the depth values may not be predicted well with the textureassisted prediction approach. Accordingly the occurrence of thelarge residual is considered to be caused by a poor prediction.On the other hand, if there is no any large residual outside theerror-prone region in a predefined neighborhood (e.g., a block),the current large residual is most likely to be a misaligned pixel.Such a misaligned pixel is then corrected based on the neigh-boring reliable pixels outside the error-prone region. With thisdetection and rectification procedure, the coding efficiency ofthe texture-assisted depth coding can be further enhanced.

C. Extension to Integrated Prediction

As we explained in the previous paragraph, the pixels in thesame region as the current pixel can be used to approximate thecurrent pixel since depth map is composed of large smooth re-gions. The coded pixels in the reference frame and the currentframe can both be used to form the prediction of the currentpixel. Therefore, the proposed pixel-based motion estimationcan be extended to an integrated prediction which uses boththe intra and inter information. Fig. 2 illustrates the estimationprocess for the proposed integrated prediction approach. Thisintegrated prediction approach needs to identify all the pixelswhich are considered to be in the same region as the currentpixel (based on their texture value differences) from the avail-able neighboring pixels in the current frame and in the refer-ence frame, as shown in the top row of Fig. 2. Through a directmapping of the locations of the identified pixels from texture todepth field, the depth predictor of the current pixel is obtainedas the median of the depth values at those mapped locations.

Fig. 2. An illustration of the integrated prediction approach with the mappingfrom texture to depth field.

TABLE IINFORMATION OF TEST SEQUENCES AND VIEW INDEXES FOR ENCODING

III. EXPERIMENTAL RESULTS

Simulations were performed using the H.264/AVC referencesoftware JM version 18.2. The proposed depth coding approachis incorporated as an additional inter-prediction option in the en-coder software. Five standard sequences Ballet, Breakdancers,Lovebird1, Kendo, and UndoDancer [20], [21] are used to testour proposed depth coding scheme, where 50 frames are codedfor each sequence with one I frame followed by all P-frames(IPPP…). Table I tabulates the information of the test sequenceswith view indexes used for encoding. Intermediate views arerendered with the coded depth maps and texture videos using theVSRS 3.5 (View Synthesis Reference Software) [22]. Four dif-ferent QPs (22, 27, 32 and 37) are applied to the inter-frames ofthe depth map sequences, while the first frame of the depth mapand all the frames of the texture videos are coded with a fixedQP of 22. Full search is used with the search range of forblock-based motion estimation while for the pixel-basedmotion estimation.As depthmaps are used for view synthesis in the decoder side,

the depth quality is measured by the quality of rendered viewssynthesized based on the coded depth maps. Therefore, the rateused for coding the depth maps and the quality of the corre-sponding rendered view are used to indicate the rate-distortionperformance of the depth coding scheme. Specifically, statisti-cally averaged BDBR [23] is used to measure the bit rate of thecoded depth maps and BDPSNR [23] for measuring the qualityof the rendered views generated based on the coded depth maps.Table II shows the coding results of our proposed depth coding

LI et al.: PIXEL-BASED INTER PREDICTION IN CODED TEXTURE ASSISTED DEPTH CODING 77

Fig. 3. Comparison of rate-distortion curves with different coding methods(a) Ballet (b) UndoDancer.

scheme, compared with JM18.2 and the forced skip method in[13]. It can be seen that the proposed scheme can achieve up to1.70 dB gain in BDPSNR, which translates to a bitrate saving of33.55% compared with JM18.2, and 0.69 dB gain in BDPSNR,which translates to a bitrate saving of 17.11% compared with theforced skip method [13]. The rate-distortion curves of the pro-posed coding approach for Ballet and UndoDancer are shownin Fig. 3. “Intra method [14]” represents a proposed intra codingmethod in our previous work [14] and “Proposed+Intra” meanscombining the intra method in [14] with the proposed approachin this letter for the coding. It can be seen that the proposedinter-prediction approach contributes substantially to the codinggain and best performance is obtained when both the intra andinter approaches are combined. Snapshots of the reconstructeddepthmaps and rendered images using JM18.2 and the proposeddepth coding scheme at are shown in Fig. 4. It can beseen that the subjective quality of the synthesized view is sig-nificantly better with our proposed depth coding scheme.Table III shows that the average encoding time of the pro-

posed coding method increases about 6.26%. For UndoDancer,the encoding time is even reduced with the proposed methodincorporated in the JM codec, which may be due to more occur-rences of skip mode with the better reference. In the decoderside, the pixel-based motion vectors need to be re-estimatedbased on the decoded texture video thus also increasing the de-coder complexity. Since the proposed option only applies to thedepth edge blocks, the decoding complexity is restricted to anacceptable level.

Fig. 4. Snapshots of the decoded depth maps and synthersized images at depth: (a) & (b) Reconstructed depth images by JM18.2 and proposed

method, respectively; (c) & (d) Correspnding synthesized images to the depth(a) and (b), respectively.

TABLE IIIMPROVEMENT OF RATE-DISTORTION PERFORMANCE

OF THE PROPOSED SCHEME

TABLE IIICOMPLEXITY ANALYSIS BASED ON THE ENCODING TIME

IV. CONCLUSION

In this letter, an inter prediction scheme of high performanceis proposed for depth coding with the aid of coded texture video.The proposed scheme attempts to perform a pixel-based mo-tion estimation for a better predictor in the depth coding byexploiting the depth-texture motion and structure similarity aswell as the coded texture video available at both encoder and de-coder side, thus saving the transmission of the pixel-based mo-tion vectors. Some specific techniques regarding reducing thecomplexity and enhancing the robustness of the proposed interprediction are also considered. Moreover, an integrated predic-tion method by making use of both the intra and inter informa-tion is developed for a better pixel-based depth prediction. Ex-perimental results substantiate the effectiveness of our proposeddepth coding scheme.


REFERENCES[1] N. S. Holliman, N. A. Dodgson, G. E. Favalora, and L. Pockett,

“Three-dimensional displays: A review and applications analysis,”IEEE Trans. Broadcasting, vol. 57, no. 2, pp. 362–371, Jun. 2011.

[2] M.Tanimoto,M.P.Tehrani,T.Fujii, andT.Yendo, “FTVfor 3-Dspatialcommunication,”Proc. IEEE, vol. 100, no. 4, pp. 905–917, Apr. 2012.

[3] K. Muller, P. Merkle, and T. Wiegand, “3-d video representation usingdepth maps,” Proc. IEEE, vol. 99, no. 4, pp. 643–656, Apr. 2011.

[4] M. M. Hannuksela, D. Rusanovskyy, W. Su, L. Chen, R. Li, P. Aflaki,D. Lan, M. Joachimiak, H. Li, and M. Gabbouj, “Multiview-video-plus-depth coding based on the advanced video coding standard,” IEEETrans. Image Process., vol. 22, no. 9, Sep. 2013.

[5] C. Fehn, “Depth-image-based rendering (DIBR), compression andtransmission for a new approach on 3D-TV,” in SPIE ConferenceStereoscopic Displays and Virtual Reality Systems XI. San Francisco,CA, USA: , Jan. 2004, vol. 5291, pp. 93–104.

[6] P.Merkle, Y.Morvan, A. Smolic, D. Farin, K.Muller, P. H. N. deWith,and T. Wiegand, “The effects of multiview depth video compressionon multiview rendering,” Signal Process.: Image Commun., vol. 24,no. 1–2, pp. 73–88, Jan. 2009.

[7] M.-K. Kang, J. Lee, J.-Y. Lee, and Y.-S. Ho, “Geometry-based blockpartitioning for efficient intra prediction in depth video coding,” inSPIE Visual Information Processing and Communication. San Jose,CA, USA: , Jan. 2010, pp. 75430A-1–75430A-11.

[8] G. Shen, W.-S. Kim, A. Ortega, J. Lee, and H. Wey, “Edge-awareintra prediction for depth-map coding,” in IEEE Int. Conf. Image Pro-cessing, Hong Kong, Sep. 2010, pp. 3393–3396.

[9] S. Grewatsch and E. Miiller, “Sharing of motion vectors in 3D videocoding,” in Proc. IEEE Int. Conf. Image Processing, Singapore, Oct.2004, pp. 3271–3274.

[10] H. Oh and Y.-S. Ho, “H.264-based depth map sequence coding usingmotion information of corresponding texture video,” in In the FirstPacific Rim Conf. Advances in Image and Video Technology, Hsinchu,Taiwan, Dec. 2006, pp. 898–907.

[11] S. Milani, P. Zanuttigh, M. Zamarin, and S. Forchhammer, “Efficientdepth compression exploiting segmented color data,” in IEEE Int.Conf. Multimedia and Expo, Barcelona, Spain, Jul. 2011, pp. 1–6.

[12] S. Liu, P. Lai, D. Tian, and C. W. Chen, “New depth coding techniqueswith utilization of corresponding video,” IEEE Trans. Broadcasting,vol. 57, no. 2, pp. 551–561, Jun. 2011.

[13] W.-S. Kim, A. Ortega, P. Lin, D. Tian, and C. Gomila, “Depth mapdistortion analysis for view rendering and depth coding,” in IEEE Int.Conf. Image Processing, Cairo, Egypt, Nov. 2009, pp. 721–724.

[14] S. Li, C. Zhu, and J. Lei, “Depth-texture cooperative clustering andalignment for high efficiency depth intra coding,” in 2013 IEEE ChinaSummit & Int. Conf. Signal and Information Processing, Beijing,China, Jul. 2013, pp. 240–244.

[15] J. Y. Lee, H.-C. Wey, and D.-S. Park, “A fast and efficient multi-viewdepth image coding method based on temporal and interview corre-lations of texture images,” IEEE Trans. Circuits Syst. Video Technol.,vol. 21, no. 12, pp. 1859–1868, Dec. 2011.

[16] L. Shen, Z. Zhang, and Z. Liu, “Inter mode selection for depth mapcoding in 3D video,” IEEE Trans. Consumer Electron., vol. 58, no. 3,pp. 926–931, Aug. 2012.

[17] B. K. P. Horn and B. G. Schunck, “Determining optical flow,” 1980,Artificial Intelligence Lab., Mass. Inst. Technology, Tech. Rep. A.I.Memo, No. 572,.

[18] C. Zhu, Y. Zhao, L. Yu, and M. Tanimoto, 3D-TV System With Depth-Image-Based Rendering: Architectures, Techniques and Challenges.Berlin, Germany: Springer, 2013.

[19] Y. Zhao, C. Zhu, Z. Z. Chen, D. Tian, and L. Yu, “Boundary arti-fact reduction in view synthesis of 3D video: From perspective of tex-ture-depth alignment,” IEEE Trans. Broadcasting, vol. 57, no. 2, pp.510–522, Jun. 2011.

[20] C. Zitnick, S. B. Kang, M. Uyttendaele, S. Winder, and R. Szeliski,“High quality video view interpolation using a layered representa-tion,” ACM Trans. Graph., Proc. ACM SIGGRAPH, vol. 23, no. 3, pp.600–608, Aug. 2004.

[21] Moving multiview camera test sequences for MPEG-FTV, Doc.M16922, ISO/IEC JTC1/SC29/WG11, Oct. 2009.

[22] 3DV depth estimation and view synthesis software package, ISO/IECJTC1/SC29/WG11 Doc. N12188, Turin, Italy, Jul. 2011.

[23] An excel add-in for computing Bjontegaard metric and its evolution,ITU-T SG16 Q.6, Doc. VCEG-AE07, Jan. 2007.

pixel-based inter prediction in coded texture assisted depth coding

Documents