one-for-all: grouped variation network-based …39.96.165.147/pub files/2019/xsf_tip19.pdfmading li,...

12
2140 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 28, NO. 5, MAY 2019 One-for-All: Grouped Variation Network-Based Fractional Interpolation in Video Coding Jiaying Liu , Senior Member, IEEE , Sifeng Xia, Wenhan Yang , Member, IEEE, Mading Li, and Dong Liu , Member, IEEE Abstract—Fractional interpolation is used to provide sub-pixel level references for motion compensation in the interprediction of video coding, which attempts to remove temporal redundancy in video sequences. Traditional handcrafted fractional interpolation filters face the challenge of modeling discontinuous regions in videos, while existing deep learning-based methods are either designed for a single quantization parameter (QP), only gener- ating half-pixel samples, or need to train a model for each sub- pixel position. In this paper, we present a one-for-all fractional interpolation method based on a grouped variation convolutional neural network (GVCNN). Our method can deal with video frames coded using different QPs and is capable of generating all sub-pixel positions at one sub-pixel level. Also, by predicting variations between integer-position pixels and sub-pixels, our network offers more expressive power. Moreover, we perform specific measurements in training data generation to simulate practical situations in video coding, including blurring the down- sampled sub-pixel samples to avoid aliasing effects and coding integer pixels to simulate reconstruction errors. In addition, we analyze the impact of the size of blur kernels theoretically. Experimental results verify the efficiency of GVCNN. Compared with HEVC, our method achieves 2.2% in bit saving on average and up to 5.2% under low-delay P configuration. Index Terms— High efficient video coding (HEVC), fractional interpolation, convolutional neural network (CNN), grouped variation network. I. I NTRODUCTION O NE of the most important features in video coding is to remove temporal redundancy in video sequences, which can be achieved by motion compensation. Specifically, during the inter prediction, reference blocks are searched from pre- ceding coded frames. With these blocks, only corresponding motion vectors and residuals between reference blocks and the current block need to be coded. Generally, encoding motion vectors and residuals are more efficient than encoding the original block, which leads to distinct bit saving in video coding. Manuscript received April 10, 2018; revised September 2, 2018; accepted November 15, 2018. Date of publication November 22, 2018; date of current version January 16, 2019. This work was supported in part by the National Natural Science Foundation of China under Grant 61772043 and Grant 61390512, in part by the Peking University C Tencent Rhino Bird Innovation Fund, and in part by the NVIDIA Corporation with the GPU. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Xin Li. (Corresponding author: Wenhan Yang.) J. Liu, S. Xia, W. Yang, and M. Li are with the Institute of Computer Science and Technology, Peking University, Beijing 100871, China (e-mail: [email protected]; [email protected]; [email protected]; [email protected]). D. Liu is with the CAS Key Laboratory of Technology in Geo-Spatial Information Processing and Application Systems, University of Science and Technology of China, Hefei 230027, China (e-mail: [email protected]). Digital Object Identifier 10.1109/TIP.2018.2882923 However, due to the spatial sampling of digital video, the locations of adjacent pixels in a video frame are not continuous. The reference blocks at integer positions may have sub-pixel motion shifts to the current block. In order to produce better reference blocks, video coding standards, e.g. High Efficiency Video Coding (HEVC), generate sub-pixel reference blocks by fractional interpolation over the retrieved integer-position blocks. Video coding standards adopt fixed interpolation filters to perform fractional interpolations. For example, MPEG-4/ H.264 AVC [1] uses a 6-tap filter for half-pixel interpolation and a simple average filter for quarter-pixel interpolation of luma component. HEVC adopts a DCT-based interpolation filter (DCTIF) [2] for fractional interpolation. These simple fixed interpolation filters are effective for motion compensa- tion. However, the quality of interpolation results generated by fixed filters may be limited, since fixed filters can not fit for natural and artificial video signals with various kinds of structures and contents. In the last decade, researchers have been dedicated on improving the design of traditional sub-pixel interpolation filters [1]–[5]. However, these handcrafted filters have limited expressiveness, facing the challenge of various and abundant local contexts in large amount of videos being generated everyday. Moreover, these filters always assume videos to be locally smooth and perform linear interpolations. In fact, videos contain many discontinuous contents such as sharp edges and local details, which are difficult to model using handcrafted filters. Recently, many deep learning based methods have been pro- posed for image processing problems, e.g. denoising [6], [7], inpainting [8], and super-resolution [9], [10], demonstrating promising abilities and generating impressive results. Consid- ering the great performance brought by the deep learning based methods in image processing problems and the high imple- mentation efficiency of deep learning based methods brought by GPU acceleration, it is a new opportunity to utilize deep learning based interpolation methods in motion compensation for video coding. Yan et al. [11] first proposed a CNN-based interpolation filter to replace the half-pixel interpolation part of HEVC. Following that work, Zhang et al. [12] adopted a successful network VDSR [10] in image super-resolution problem to improve the half-pixel interpolation in HEVC. Although performance gain can be observed in both methods, they have some drawbacks. First of all, these methods are QP-dependent, which means they have to train one model 1057-7149 © 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Upload: others

Post on 03-Feb-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: One-for-All: Grouped Variation Network-Based …39.96.165.147/Pub Files/2019/xsf_tip19.pdfMading Li, and Dong Liu , Member, IEEE Abstract—Fractional interpolation is used to provide

2140 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 28, NO. 5, MAY 2019

One-for-All: Grouped Variation Network-BasedFractional Interpolation in Video Coding

Jiaying Liu , Senior Member, IEEE, Sifeng Xia, Wenhan Yang , Member, IEEE,

Mading Li, and Dong Liu , Member, IEEE

Abstract— Fractional interpolation is used to provide sub-pixellevel references for motion compensation in the interprediction ofvideo coding, which attempts to remove temporal redundancy invideo sequences. Traditional handcrafted fractional interpolationfilters face the challenge of modeling discontinuous regions invideos, while existing deep learning-based methods are eitherdesigned for a single quantization parameter (QP), only gener-ating half-pixel samples, or need to train a model for each sub-pixel position. In this paper, we present a one-for-all fractionalinterpolation method based on a grouped variation convolutionalneural network (GVCNN). Our method can deal with videoframes coded using different QPs and is capable of generatingall sub-pixel positions at one sub-pixel level. Also, by predictingvariations between integer-position pixels and sub-pixels, ournetwork offers more expressive power. Moreover, we performspecific measurements in training data generation to simulatepractical situations in video coding, including blurring the down-sampled sub-pixel samples to avoid aliasing effects and codinginteger pixels to simulate reconstruction errors. In addition,we analyze the impact of the size of blur kernels theoretically.Experimental results verify the efficiency of GVCNN. Comparedwith HEVC, our method achieves 2.2% in bit saving on averageand up to 5.2% under low-delay P configuration.

Index Terms— High efficient video coding (HEVC), fractionalinterpolation, convolutional neural network (CNN), groupedvariation network.

I. INTRODUCTION

ONE of the most important features in video coding is toremove temporal redundancy in video sequences, which

can be achieved by motion compensation. Specifically, duringthe inter prediction, reference blocks are searched from pre-ceding coded frames. With these blocks, only correspondingmotion vectors and residuals between reference blocks and thecurrent block need to be coded. Generally, encoding motionvectors and residuals are more efficient than encoding theoriginal block, which leads to distinct bit saving in videocoding.

Manuscript received April 10, 2018; revised September 2, 2018; acceptedNovember 15, 2018. Date of publication November 22, 2018; date ofcurrent version January 16, 2019. This work was supported in part by theNational Natural Science Foundation of China under Grant 61772043 andGrant 61390512, in part by the Peking University C Tencent Rhino BirdInnovation Fund, and in part by the NVIDIA Corporation with the GPU. Theassociate editor coordinating the review of this manuscript and approving itfor publication was Dr. Xin Li. (Corresponding author: Wenhan Yang.)

J. Liu, S. Xia, W. Yang, and M. Li are with the Institute of ComputerScience and Technology, Peking University, Beijing 100871, China (e-mail:[email protected]; [email protected]; [email protected];[email protected]).

D. Liu is with the CAS Key Laboratory of Technology in Geo-SpatialInformation Processing and Application Systems, University of Science andTechnology of China, Hefei 230027, China (e-mail: [email protected]).

Digital Object Identifier 10.1109/TIP.2018.2882923

However, due to the spatial sampling of digital video,the locations of adjacent pixels in a video frame are notcontinuous. The reference blocks at integer positions mayhave sub-pixel motion shifts to the current block. In orderto produce better reference blocks, video coding standards,e.g. High Efficiency Video Coding (HEVC), generate sub-pixelreference blocks by fractional interpolation over the retrievedinteger-position blocks.

Video coding standards adopt fixed interpolation filtersto perform fractional interpolations. For example, MPEG-4/H.264 AVC [1] uses a 6-tap filter for half-pixel interpolationand a simple average filter for quarter-pixel interpolation ofluma component. HEVC adopts a DCT-based interpolationfilter (DCTIF) [2] for fractional interpolation. These simplefixed interpolation filters are effective for motion compensa-tion. However, the quality of interpolation results generatedby fixed filters may be limited, since fixed filters can not fitfor natural and artificial video signals with various kinds ofstructures and contents.

In the last decade, researchers have been dedicated onimproving the design of traditional sub-pixel interpolationfilters [1]–[5]. However, these handcrafted filters have limitedexpressiveness, facing the challenge of various and abundantlocal contexts in large amount of videos being generatedeveryday. Moreover, these filters always assume videos tobe locally smooth and perform linear interpolations. In fact,videos contain many discontinuous contents such as sharpedges and local details, which are difficult to model usinghandcrafted filters.

Recently, many deep learning based methods have been pro-posed for image processing problems, e.g. denoising [6], [7],inpainting [8], and super-resolution [9], [10], demonstratingpromising abilities and generating impressive results. Consid-ering the great performance brought by the deep learning basedmethods in image processing problems and the high imple-mentation efficiency of deep learning based methods broughtby GPU acceleration, it is a new opportunity to utilize deeplearning based interpolation methods in motion compensationfor video coding. Yan et al. [11] first proposed a CNN-basedinterpolation filter to replace the half-pixel interpolation partof HEVC. Following that work, Zhang et al. [12] adopteda successful network VDSR [10] in image super-resolutionproblem to improve the half-pixel interpolation in HEVC.Although performance gain can be observed in both methods,they have some drawbacks. First of all, these methods areQP-dependent, which means they have to train one model

1057-7149 © 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Page 2: One-for-All: Grouped Variation Network-Based …39.96.165.147/Pub Files/2019/xsf_tip19.pdfMading Li, and Dong Liu , Member, IEEE Abstract—Fractional interpolation is used to provide

LIU et al.: ONE-FOR-ALL: GROUPED VARIATION NETWORK-BASED FRACTIONAL INTERPOLATION IN VIDEO CODING 2141

for each QP. Moreover, the method in [11] need to train onemodel for each half-pixel position (i.e. three models for half-pixel positions). Furthermore, they do not support fractionalinterpolation for quarter-pixel positions. These factors lead tohigher storage costs and limit the practicability of the methods.

In this paper, we propose a deep learning basedfractional interpolation method to infer sub-pixels for motioncompensation in HEVC. A grouped variation convolutionalneural network (GVCNN) is designed to predict sub-pixelsin different sub-pixel positions under different QPs at once.The network first extracts feature maps from integer-positionsamples, and then infers variations for different sub-pixelsusing the same feature maps. In order to make our networkmore adaptive to practical video coding, we elaborate severalmeasurements to generate training data simulating practicalsituations. Moreover, we provide theoretical analysis togenerate the most appropriate and effective training data.Experimental results show the superiority of our GVCNN infractional interpolation which further benefits video coding.Our contributions are summarized as follows:

• We propose a grouped variation based neural networkto predict variations between half-/quarter-pixels andinteger-position pixels. Comparing to learning the map-ping between pixel values, learning the variations makesthe network easier to train and more expressive.

• Instead of training one model for each QP/sub-pixelposition, we present a one-for-all model to support dif-ferent QPs and generate all sub-pixels at one sub-pixellevel (three half-pixel positions or twelve quarter-pixelpositions) using a single model.

• In order to simulate practical situations in video coding,we present two operations to generate training data forour network. Specifically, we apply blurring operationsto sub-pixels to prevent aliasing effect and encode integerpixels by HEVC to simulate reconstruction error. Further-more, a theoretical analysis is provided to determine thebest size of blur kernels.

The rest of the paper is organized as follows. Sec. IIintroduces the proposed GVCNN based fractionalinterpolation. In Sec. III, details of fractional interpolationin HEVC are first introduced and analyzed, and then thearchitecture of the network is illustrated and the groupedvariation is described. Process of the generation of trainingdata is also presented in Sec. IV. Experimental results areshown in Sec. V and concluding remarks are given in Sec. VI.

II. RELATED WORKS

A. Traditional Sub-Pixel Interpolation

Sub-pixel interpolations are performed by fixed filters inmost video coding standards. MPEG-4 AVC/H.264 [1] adoptsa 6-tap filter to interpolate pixels at half-pixel positions, andsimply averages interpolated pixels at integer and half-pixelpositions to calculate samples at quarter-pixel positions. Audioand Video Coding Standard (AVS) [3] employs a Two-StepFour-Tap (TSFT) interpolation to perform quarter-pixelinterpolations, which utilizes a 4-tap cubic convolution filterand a 4-tap cubic B-spline filter. In HEVC [2], half-pixels

and quarter-pixels are calculated by an 8-tap filter and a 7-tapfilter respectively, with coefficients derived from DCT basisfunction equations.

These fixed interpolation filters are easy to be implementedand based on the assumptions that video signal is spatiallyand temporally continuous. However, the assumption does notalways hold true in practice. Therefore, further researcheshave been conducted to perform better sub-pixel interpolations.Lakshman et al. [4] proposed a generalized interpolation bycombining Infinite Impulse Response (IIR) and Finite ImpulseResponse (FIR) filters, which provides a greater degree offreedom for selecting basis functions. Kokhazadeh et al. [5]attempted to interpolate considering the edges of video objects.To interpolate pixels around edges, they first extrapolated pix-els in the dissident side based on pixels in the accordant sideand then interpolated half-pixel samples using extrapolatedpixels and pixels in the accordant side. Still, these hand-craftedfilters cannot provide sufficient capability when facing variousvideo contents.

B. Deep Learning Based Video Coding

Deep learning based methods have shown significant perfor-mance in both computer vision applications and image/videoprocessing tasks. Deep learning is one of the efficient waysto solve the problems encountered in developing cutting-edgevideo coding standards. For instance, deep learning models areable to learn complex patterns in various videos automatically,without the need of manually elaborate designs; multiple lay-ers and activation functions are capable of modeling non-lineartransformations, which are very common in video coding.

In recent years, many deep learning based methods emergefrom video coding community. Intuitively, CNN can be usedin post-processing of decoded video sequences. Lin et al. [13]presented an end-to-end mapping from decompressed framesto enhanced versions, attempting to approximate the reversefunction of video compression. Dai et al. [14] proposeda Variable-filter-size Residue learning CNN (VRCNN) thatlearns the residue between the decoded frame and the originalone. Wang et al. [15] further presented a deep CNN-basedauto decoder (DCAD) to automatically remove artifacts andenhance the details of HEVC-compressed videos.

Besides utilizing CNN as a post-processing procedure, someresearchers investigate the practicability of applying CNNduring the video coding process. Park and Kim [16] replacedthe Sample Adaptive Offset (SAO) filter in HEVC by theirCNN architecture. In [17], Laude and Ostermann proposed toreplace the conventional Rate Distortion Optimization (RDO)for the intra prediction mode with CNN. Coding unit (CU)partition mode decision can also be predicted by CNN [18]and Long and Short-Term Memory (LSTM) network [19].As mentioned in the introduction, Yan et al. [11] proposed toreplace the half-pixel interpolation of HEVC by the networkarchitecture of SRCNN [9], achieving better results comparedwith the original HEVC. And similarly Zhang et al. [12]replace the half-pixel interpolation by the network architectureof VDSR [10]. In our work, we also focus on improving theaccuracy of sub-pixel interpolation using CNN. We propose a

Page 3: One-for-All: Grouped Variation Network-Based …39.96.165.147/Pub Files/2019/xsf_tip19.pdfMading Li, and Dong Liu , Member, IEEE Abstract—Fractional interpolation is used to provide

2142 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 28, NO. 5, MAY 2019

Fig. 1. Positions of different fractional pixels. Blue, green and pink blocksindicate respectively the integer- (Ai,j), half- (h1

i,j, h2i,j, . . . , h3

i,j) and quarter-

(q1i,j

, q2i,j

, . . . , q12i,j

) pixel positions for luma interpolation.

one-for-all fractional interpolation method based on groupedvariation convolutional neural network, which generates inter-polations for all half-pixel (or quarter-pixel) positions in asingle inference pass and only needs to be trained for oncefor each sub-level.

III. GROUPED VARIATION NEURAL NETWORK

BASED FRACTIONAL INTERPOLATION

A. Fractional Interpolation in HEVC

During the motion compensation in HEVC, blocks frompreceding coded frames will be used as reference blocks.Fractional interpolation aims to generate sub-pixel positionsamples based on these reference blocks. Fig. 1 illustratespositions of integer pixels and sub-pixels. To be more spe-cific, positions indicated by Ai,j represent integer samples;hk

i,j(k ∈ {1, 2, 3}) and qki,j(k ∈ {1, 2, · · · , 12}) denote half-

pixel positions and quarter-pixel positions, respectively. Givena reference block I A, whose pixels are regarded as integersamples (Ai,j), the half-pixel blocks I hk

and quarter-pixelblocks I qk

are interpolated from I A. With these sub-pixelsinterpolated, the most similar reference sample is finallyselected among integer and sub-pixel position samples tofacilitate coding the current block.

HEVC adopts a uniform 8-tap filter for half-pixel interpo-lation and a 7-tap filter for quarter-pixel interpolation. Thesefixed interpolation filters may not be flexible enough to accom-plish the interpolation of videos that contain diversified scenesand contents. Moreover, these interpolations only considerthe information from a very small neighborhood, and cannotmake an accurate prediction when facing signals with complexstructures.

B. Learning Grouped Variations

Learning-based interpolation methods [11], [20], [21]aim to estimate the mapping f (·) between an inputlow-resolution (LR) image y and its corresponding high-resolution (HR) image x:

x = f (y). (1)

These approaches that directly model the mapping betweeny and x usually suffer from several drawbacks: 1) the learnedmodel over-fits to the regularity between low-frequency partsof image signals and high-frequency details are neglected;2) the priors are imposed on the whole x, thus the method ishard to learn useful guidance for recovering the high frequencyimage signal; 3) useful structural correspondences in the highfrequency domain may be neglected when directly modeling x.

Our method also attempts to estimate a mapping function,except that the mapping estimated is between integer pixels I A

and sub-pixel samples I hk(or I qk

). On the other hand,we propose a novel image model, i.e. grouped variation imagerepresentation, to overcome the aforementioned drawbacks.

Motivated by wavelet transformation [22], [23] and variationimage statistics [24]–[26], we make use of the local redun-dancy to remove the auto-correlation of I A and I hk

(or I qk)

as well as their correlation, and to construct more useful struc-tural correspondences between adjacent pixels. Specifically,a pixel can be represented by the summation of one of itsnearest neighbors and a very small difference value, i.e. thevariation. This operation removes much of auto-correlationwithin adjacent pixels. For a reference block I A, the variationsbetween its pixels and sub-pixels are demonstrated below:

�I hk = I hk − I A, k ∈ {1, 2, 3}, (2)

where �I hkdenotes the variations of half-pixels.

Similarly, the variations of quarter-pixels are constructed as:

�I qk = I qk − I A, k ∈ {1, 2, · · · , 12}. (3)

Thus, in our work, we focus on estimating the variations�I hk

and �I qk, which naturally leads to a modified mapping:

I hk − I A = fh(I A), k ∈ {1, 2, 3},I qk − I A = fq(I A), k ∈ {1, 2, · · · , 12}, (4)

where fh(·) and fq (·) represent the learned mapping betweeninteger pixels and grouped variations of half- and quarter-pixelpositions, respectively.

This image representation (4) has three major advantagescompared with (1):

• Fitting to high frequency image signals. According to thelocal dependency, low frequency part is removed, and theregularity between the high frequency is focused on.

• Direct priors. The priors are imposed on the missing partof the reference block I A, which benefits learning usefulguidance for recovering the high frequency part of theimage signal.

• Plenty of structural correspondences. Compared with animage signal, the variation signal significantly increases

Page 4: One-for-All: Grouped Variation Network-Based …39.96.165.147/Pub Files/2019/xsf_tip19.pdfMading Li, and Dong Liu , Member, IEEE Abstract—Fractional interpolation is used to provide

LIU et al.: ONE-FOR-ALL: GROUPED VARIATION NETWORK-BASED FRACTIONAL INTERPOLATION IN VIDEO CODING 2143

Fig. 2. Framework of the proposed GVCNN. The network first extracts feature maps from the integer-position sample. Then the group variations that identifythe differences between different sub-pixel position samples and the integer-position sample are inferred using the same feature maps. Final results of sub-pixelposition samples are naturally obtained by adding the variations back to the integer-position sample.

the patch repetitiveness, which benefits the inference ofmissing details in the restoration process.

In summary, for fractional interpolation, compared with pre-dicting sub-pixel values, it is more simple to predict sub-pixelvariations based on integer pixels. Thus, the model trainingfor learning-based methods becomes easier than directly end-to-end learning from integer pixels to sub-pixels. Therefore,the modeling capacity of a model also becomes larger.

C. Grouped Variation Neural Network

As discussed in Sec. II, deep convolutional neural networkbased methods have obtained much success in different kindsof low-level image processing problems. With the help oflarge-scale training data, deep CNN based methods attempt tolearn a mapping from the input signal x to the target result y ′as follows:

y ′ = f (x,�), (5)

where � represents the set of parameters of the convolutionalneural network, learnt on the training data with the back-propagation.

In this subsection, we introduce the proposed one-for-allfractional interpolation method based on grouped variationconvolutional neural network (GVCNN). It first extracts afeature map from integer-position samples I A and then infersresidual maps of different sub-pixel positions sharing the samefeature map.

In the shared feature map extraction component, a featuremap with 48 channels is initially generated from the integer-position sample, followed by 8 convolution layers with 10channels. The 10-th layer later derives a 48 channel featuremap. This shrinkage in the middle design makes the networklightweight and cost less to train. The residual learning tech-nique is utilized in shared feature map extraction to accelerate

the convergency of the network. So that we add the 1-st layerto the 10-th layer and then activate their sum with parametricrectified linear unit (PReLU) [27] function to obtain the sharedfeature map. After 9 convolutional layers with 3×3 kernel size,the receptive field of each pixel in the shared feature map is19 × 19 in the input space, which means that a large nearbyarea in the integer-pixel position sample has been consideredfor the feature extraction of each pixel.

Compared with the method in [11], which investigatesseveral networks to predict each subpixel position samples sep-arately, our GVCNN is more compact and effective. We arguethat there is no need to separately extract different featuremaps to predict sub-pixel position samples, once the spatialcorrelation and continuity of the sub-pixels are fully consid-ered. In our network, a shared feature map is generated andthen used to infer sub-pixel samples at different locations. Theshared feature map is followed by a specific convolutionallayer, which generates a residual map for each sub-pixelposition. The final results of sub-pixels can be obtained byadding the integer-position sample to the residual maps.

Fig. 2 shows the architecture of GVCNN. The network takesthe integer-position sample I A as its input. PReLU is utilizedfor nonlinearity between the convolutional layers. Specifically,we define f out

k to be the output of the k-th convolutional layer.f outk is obtained by:

f outk = Pk

(Wk ∗ f out

k−1 + Bk), (6)

where f outk−1 is the output of the previous layer, Wk is the

convolutional filter kernel of the k-th layer and Bk is its bias.f out0 is the input integer-position sample. The function Pk (·)

is the PReLU function of the k-th layer:

Pk (x) ={

x, x > 0,

ak ∗ x, x ≤ 0.(7)

Page 5: One-for-All: Grouped Variation Network-Based …39.96.165.147/Pub Files/2019/xsf_tip19.pdfMading Li, and Dong Liu , Member, IEEE Abstract—Fractional interpolation is used to provide

2144 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 28, NO. 5, MAY 2019

Fig. 3. Flow chart of the training data generation for GVCNN. For the integer position sample generation, pixels at integer positions are first sampled andthen coded for simulation. Sub-pixel position samples are sampled from the raw image blurred by a Gaussian kernel to prevent aliasing effects.

x is the input signal and ak is the parameter to be learned forthe k-th layer. ak is initially set as 0.25 and all channels ofthe k-th layer share the same parameter ak .

During training process, mean square error is used as theloss function. Let F (·) represent the learnt network that inferssub-pixel position samples from the integer-position sampleand � denote the set of all the learnt parameters including theconvolutional filter kernels, bias and ak of the PReLU layer.The loss function can be formulated as follows:

L (�) = 1

n

n∑

i=1

‖F (xi ,�) − yi‖2, (8)

where pairs {xi , yi }ni=1 are the ground-truth pairs of integer-

position and sub-pixel position samples.

IV. TRAINING DATA GENERATION

FOR MOTION COMPENSATION

In this section, we focus on the generation of training datathat is going to be fed to the proposed GVCNN. We present aframework to simulate the practical process to generate groundtruth for fractional interpolation. Specifically, we encodethe integer-position pixels with HEVC and perform blurringoperations before down-sampling. In addition, we providetheoretical analysis of the performance bound for motionestimation with quantization to choose the appropriate sizeof blur kernels.

A. Training Data Generation

In most video coding standards, motion compensation isused to compress redundant information between adjacentframes. Intuitively, corresponding pixels projected by pre-dicted motion vectors between frames are likely to be locatedat sub-pixel positions, and this is where fractional interpolationis used. It generates reference samples in sub-pixel positionsfor motion vectors to project on. The effectiveness of fractionalinterpolation methods can be quantitatively evaluated by the

final coding performance, which is affected by a great manyfactors in the complex video coding system. It means there isno exact ground truth for fractional interpolations.

Thus, in our work, we attempt to generate simulated groundtruth for our GVCNN. Similar to previous works, we alsoobtain sub-pixel position samples by down-sampling fromhigh-resolution pixel grids. Nevertheless, our training datageneration framework has two key considerations:

1) Since fractional interpolation is performed on recon-structed coded frames, which contains reconstructionerror especially under high quantization parameters,the input of our network is coded and reconstructedby video coding system to simulate practical situations.As shown in the lower part of Fig. 3, input integer-pixels are coded and reconstructed by HEVC after down-sampled from the raw image.

2) High-frequency components in the original high-resolution signal become aliasing after down-sampling,which leads to different appearance than the originalsignal. To deal with the problem, we perform blurringoperations to alleviate the aliasing effect caused bydown-sampling. The upper part of Fig. 3 illustratesthe process. The high-resolution signal is first blurredand then down-sampled to generate simulated ground-truth sub-pixels. The size of the blur kernel is essential.Generally, a large blur kernel effectively reduces theinfluence of aliasing. However, it also introduces morenoise in reconstruction. In the next subsection, we dis-cuss the optimal size of blur kernels theoretically.

B. Performance Bound for Motion EstimationWith Quantization

We analyze the performance bound for motion estima-tion with quantization and give suggestions for choos-ing the best blur kernel size in training data generation.Specifically, we consider the Cramer-Rao bounds (CRB) for

Page 6: One-for-All: Grouped Variation Network-Based …39.96.165.147/Pub Files/2019/xsf_tip19.pdfMading Li, and Dong Liu , Member, IEEE Abstract—Fractional interpolation is used to provide

LIU et al.: ONE-FOR-ALL: GROUPED VARIATION NETWORK-BASED FRACTIONAL INTERPOLATION IN VIDEO CODING 2145

Fig. 4. Illustration of the degradations caused by quantization, up-sampling and Gaussian blurring during sub-pixel interpolation.

motion estimation, which provides the minimum mean squareerror that an unbiased estimator can achieve. The CRB can beobtained by constructing the maximum likelihood estimatorand Fisher information matrix.

The following derivation refers to [28] and [29]. Ouranalysis has two distinct properties: 1) Our analysis considersquantization errors; 2) In the analysis of [28], the originaland shift signals go through the same degradation. However,in our analysis, the shifted signals go through an additionalquantization degradation process. For simplicity, we onlyconsider 1-D signal.

1) Frequency Signals With Motion Shift: Consider twosignals: a low frequency signal ω1, and a high-frequencyaliasing signal ω2, where ω2 = ω1 + NL . The original signalwith these two frequency components in the time domain ispresented as follows,

It1 (n) = A1

NLe− i2πω1n

NH + A2

NLe− i2πω2n

NH , (9)

where NL and NH are the lengths of down-sampled and orig-inal signals, respectively, and NH serves as a normalizationconstant. n is an integer indexes a certain frequency band.A1 and A2 are magnitudes of low frequency signal and high-frequency signals, respectively. Note that, the magnitudes ofsignals are assumed to follow a power law [30], thus themagnitude of ω2 is much larger than those of other high-frequency components. That is why we can assume that, thereis only one high frequency component in the given signal.Assuming a motion u2 happens on the original grid in t2 time,the shifted signal can be presented as follows,

It2 (n) = A1

NLe− i2πω1(n−u2)

NH + A2

NLe− i2πω2(n−u2)

NH ,

= A1

NLW−w1(n−u2) + A2

NLW−w2(n−u2), (10)

where W = exp(

i2πNH

).

We then conduct the further analysis in DFT domain, wherethe shifts in time naturally become changes in the phase of thesignal. The DFTs of the original signals are

It1(ω) = Z0 [A1δ(ω − ω1) + A2δ(ω − ω2)],

It2(ω) = Z0[A1δ(ω − ω1)W u2ω1 + A2δ(ω − ω2)W u2ω2

],

(11)

where Z0 = NH /NL .2) Up-Sampled Quantized Signals: We then deduce the

degradation version of Eqs. (10) and (11). In our sub-pixelinterpolation for motion estimation, the current block is givenin the high-resolution space, and the reference block is

assumed to be at first quantized and then up-sampled by ouralgorithm. The whole process is shown in Fig. 4. To suppressthe negative effect of the potential aliasing and quantizationnoises on motion compensation, blurring operations can beconducted on the current block and reference blocks. Thereare two kinds of blurring operations. The first one is implicitlyconducted by our interpolation method. The second one isset in the training data synthesis. Thus, in the formulation,the two signals are degraded by rescaling operation (modeledas Gaussian blur), quantization (modeled as random noise)and contaminated by random noises. The DFTs of these twodegraded signals are presented as follows,

Jt1(ω) = [Gσp(ω1)A1 + Gσp(ω2)A2

]δ(ω − ω1) + n1(ω),

Jt2(ω) = {[Gσq (ω1)Gσk (ω1)A1W u2ω1 + Gσq (ω1)n

q11 (ω1)

]

+Gσq (ω2)Gσk (ω2)A2W u2ω2}δ(ω − ω1) + n2(ω),

(12)

where n1 and n2 are assumed to be additive Gaussian noisewith variances σ 2

n1and σ 2

n2, respectively. DFTs of the Gaussian

blur kernels that are used to suppress aliasing and quantization

are denoted by Gσp (ω) = exp

(−ω2σ 2

p2

)and Gσq (ω) =

exp

(−ω2σ 2

q2

), where σp and σq are the standard deviations.

DFT of the Gaussian blur kernel caused by up-sampling

is denoted by Gσk (ω) = exp

(−ω2σ 2

k2

), where σk is the

standard deviation. nq11 is the quantization error, which follows

a uniform distribution ranging between [− q2 , q

2 ], where q isassumed to be linearly correlated to A1. The expectation andvariance of nq1

1 are:

E(nq11 ) = 0,

var(nq11 ) = C1 A2

1, (13)

where C1 is a constant correlated to the peak and range ofthe signal, as well as the quantization step. Since A1 and A2follow a power law [30], the magnitude of quantization errorof ω1 is much larger than that of ω2. Thus, we here neglectthe quantization error of ω2.

Similar to [28], the aliasing signals can be regarded asadditive white Gaussian noise. Hence, the problem becomes

Jt1(ω) = [Gσp(ω1)A1 + Gσp(ω2)A2

]δ(ω − ω1) + n1(ω),

= Gσp (ω1)A1δ(ω − ω1) + n′1(ω),

Jt2(ω) = Gσq (ω1)Gσk (ω2)A1δ(ω − ω1)W u2w1NH

+ n′2(ω),

(14)

Page 7: One-for-All: Grouped Variation Network-Based …39.96.165.147/Pub Files/2019/xsf_tip19.pdfMading Li, and Dong Liu , Member, IEEE Abstract—Fractional interpolation is used to provide

2146 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 28, NO. 5, MAY 2019

where n′1 = n1 + Gσp(ω2)A2 has the variance

σ 2n′

1= G2

σp(ω2)A2

2 + σ 2n1

, (15)

and n′2 = n2 + Gσq (ω2)Gσk (ω2)A2 + Gσq (ω1)n

q1 has the

variance

σ 2n′

2= G2

σk(ω2)G2

σq(ω2)A2

2 + σ 2n2

+ G2σq

(ω2)C1 A21. (16)

3) Cramer-Rao Bounds for Motion Estimation: TheCramer-Rao bounds [31] provides a bound for unbiasedestimators. It is obtained by calculating the inverse of theFisher information matrix. The Fisher information matrix iscalculated by taking derivatives of a log-likelihood in termsof the unknown parameters.

The negative log likelihood function for the input down-sampled image is

− log p(Jt1, Jt2 |A1, u2

)

= 1

2σ 2n′

2

|| Jt2(ω1) − Gσq (ω1)Gσk (ω1)A1W u2w1NH

||22

+ 1

2σ 2n′

1

|| Jt1(ω1) − Gσp(ω1)A1||22, (17)

where || ∗ ||22 evaluates the �2 norm for complex signals.Then, we can get Fisher Information Matrix for the

unknown parameters θ = {Re {A1} , Im {A1} , u2} as follows,

Iθ =⎛

⎝t1 + t2 0 st2

0 t1 + t2 st2st2 st2 s2t2

⎠, (18)

where t1 = G2σk

(ω1)/σ2n′

1, t2 = G2

σq(ω1)G2

σk(ω1)/σ

2n′

2, and

s = 2π A1ω1/NH . Then, its inverse is,

I−1θ =

⎜⎜⎜⎜⎜⎜⎝

t1t21 − t2

2

t2t21 − t2

2

−1

st1 − st2t2

t21 − t2

2

t1t21 − t2

2

−1

st1 − st2−1

st1 − st2

−1

st1 − st2

t1 + t2s2(t1 − t2)t2

⎟⎟⎟⎟⎟⎟⎠

. (19)

Thus, we obtain the following Cramer-Rao bounds forestimating motion u2 as

var[u2] ≥ I−1θ (3, 3) = t1 + t2

s2(t1 − t2)t2= 1

s2t2H1, (20)

where

H1 = t1 + t2

t1 − t2=

G2σk

(ω1)σ2n′

2+ G2

σq(ω1)G2

σk(ω1)σ

2n′

1

G2σk

(ω1)σ2n′

2− G2

σq(ω1)G2

σk(ω1)σ

2n′

1

,

=σ 2

n′2+ G2

σq(ω1)σ

2n′

1

σ 2n′

2− G2

σq(ω1)σ

2n′

1

.

Since the power spectral of natural images follows a powerlow, i.e. |A1| � |A2|, the quantization error term related to|A1| in σ 2

n′2

makes σ 2n′

2dominate σ 2

n′2+ G2

σq(ω1)σ

2n′

1in magni-

tude. Then, 1 ≤ H1 ≤ 1 + α, and 1 + α is the upper-bound

of H1. Therefore, we can replace H1 with a constant h1. Afterthat, we have

var[u2] ≥ I−1θ (3, 3) = 1

s2t2

t1 + t2

t1 − t2≥ h1

s2t2

= N2H h1

4π2 A21ω

21

σ 2n′

2

G2σq

(ω1)G2σk

(ω1),

= N2H h1

4π2 A21ω

21

Z1(ω1, ω2),

= N2H h1

4π2 A21ω

21

Z2(ω1, ω2),

= N2H h1

4π2 A21ω

21

(H2 + H3), (21)

where

Z1(ω1, ω2) = G2σk

(ω2)G2σq

(ω2)A22 + σ 2

n2+ G2

σq(ω2)C1 A2

1

G2σq

(ω1)G2σk

(ω1),

Z2(ω1, ω2) = e(ω4

1−ω42)σ4

q2 e

(ω41−ω4

2)σ4k

2 A22

+ e(ω4

1−ω42)σ4

q2 e

ω41σ4

k2 C1 A2

1 + eω4

1σ2q

2 eω4

1σ2k

2 σ 2n2

,

and

H2 = e(ω4

1−ω42)σ4

q2 e

(ω41−ω4

2)σ4k

2 A22 + e

(ω41−ω4

2)σ4q

2 eω4

1σ4k

2 C1 A21,

H3 = eω4

1σ2q

2 eω4

1σ2k

2 σ 2n2

.

We can observe the effects of the blur kernel (σk) on motionestimation. The effects are two-fold. A small kernel preserveslow-frequency component, boosts less noises (in H3), andsuppresses less aliasing signals and quantization errors (in H2).A large kernel lose effective low-frequency component, boostsnoises (in H3), but suppresses more aliasing signals andquantization errors (in H2). An intermediate size blur kernelachieves best performance.

V. EXPERIMENTAL RESULTS

A. Implementation Details

200 training images and 200 testing images inBSDS500 [32] at size 481 × 321 and 321 × 481 areused for training data generation. Different from [11], whichcodes and reconstructs input integer samples with one singleQP value to obtain a specific model for each QP, we utilizemultiple QPs from 0 to 51 for training data generation toacquire a model that can be applicable for all QPs.

Besides, due to different sub-pixel levels of half-pixelsamples and quarter-pixel samples, we adopt different set-tings for training data generation for half-pixel and quarter-pixel position samples, and we train two models for thesetwo sub-pixel levels (GVCNN-H for half-pixel positions andGVCNN-Q for quarter-pixel positions). Specific settings aredemonstrated below.

For GVCNN-H, 3 × 3 Gaussian kernels with random stan-dard deviations in the range [0.4, 0.5] are used for blurring.By dividing the images into 2×2 patches without overlapping,

Page 8: One-for-All: Grouped Variation Network-Based …39.96.165.147/Pub Files/2019/xsf_tip19.pdfMading Li, and Dong Liu , Member, IEEE Abstract—Fractional interpolation is used to provide

LIU et al.: ONE-FOR-ALL: GROUPED VARIATION NETWORK-BASED FRACTIONAL INTERPOLATION IN VIDEO CODING 2147

Fig. 5. Three example R-D curves of the sequences (a) BQTerrace, (b) Kimono and (c) BasketballPass under LDP configuration.

pixels at the top-left of the patches in the raw images are sam-pled to obtain the integer-position sample. And pixels at otherthree positions of the patches are separately sampled from theblurred image to derive the sub-pixel position samples.

As for GVCNN-Q, the inferred samples are at a smallersub-pixel level. 3 × 3 Gaussian kernels with random standarddeviations in the range [0.5, 0.6] are utilized. The samplingis performed based on 4 × 4 patches, where 12 samples areextracted from pixels at 1/4 or 3/4 positions vertically or hor-izontally in the patch.

During the integration into HEVC, we interpolate anentire coded reference frame with our trained GVCNN-Hand GVCNN-Q in advance. Specifically, we take the ref-erence frame as the integer-position sample and input itinto the trained networks to acquire 3 half-pixel positionsamples (I h1

, I h2, I h3

) and 12 quarter-pixel position samples(I q1

, I q2, ..., I q12

). The interpolated sub-pixel position sam-ples are cached in the system and we can later obtain theneeded blocks during inter prediction. For the sake of betterperformance, a CU level RDO is used for the interpolationmethod selection. Two passes of encoding that respectivelyinterpolate the sub-pixel position samples with and without thedeep based method are performed at the encoder side. A flagis set based on the rate-distortion costs of the two passes toindicate whether to use the deep based method for sub-pixelposition samples interpolation. The flag is coded with one bitand integrated at CU level. All the prediction units (PU) in aCU share the same flag.

B. Experimental Settings

During the training process, the training images are decom-posed into 32×32 sub-images with a stride of 16. The GVCNNis trained on Caffe platform [33] via Adam [34] with standardback-propagation. The learning rate is initially set as a fixedvalue 0.0001. The batch size is set as 128. Models after100, 000 iterations are used for testing. The network is trainedon a single NVIDIA GeForce GTX 1080.

The proposed method is tested on HEVC reference softwareHM 16.4 under the low delay P (LDP) configuration. BD-rateis used to measure the rate-distortion. The QP values are setto be 22, 27, 32 and 37. As mentioned above, we train one

GVCNN-H and GVCNN-Q for all QPs based on the corre-sponding training data. A CU level rate-distortion optimizationis also integrated to decide whether to replace our deep basedinterpolation method with the interpolation method of HEVC.

C. Experimental Results and Analysis

1) Overall Performance: Performance of the proposedmethod for classes A, B, C, D, E and F is shown in Table I.For the luma component, our method achieves on average2.2%, 1.2% and 0.9% BD-rate saving respectively under LDP,LDB and RA conditions. And for the sequence BQTerrace,the BD-rate saving under LDP can be as high as 5.2%. Forfurther verification, some example rate-distortion (R-D) curvesare shown in Fig. 5. It can be seen that under most QPs ourmethod is superior to HEVC.

Besides, the JVET ultra high definition (UHD) sequences(3840 × 2160) are also tested under the RA configuration.Results are shown in Table II. The BD-rate reduction is lesssignificant on high-resolution test sequences compared withthat on other sequences. Our explanation is that, the samplingprecision of the high-resolution test sequences is high enoughand the signals of adjacent pixels are more continuous. Hence,it is not very beneficial to generate sub-pixel position samplesfor motion compensation.

2) Comparison With Existing Methods: We also compareour method with two existing deep based fractional interpola-tion methods. Two methods (respectively called CNNIF [11]and VDIF [12]) only replace the half-pixel interpolationwithout rate-distortion optimization in HEVC and is tested onHM 16.4. For fair comparison, we also integrate our methodinto HM 16.7 by only replacing the half-pixel interpolationalgorithm with the GVCNN-H without rate-distortionoptimization. Three methods are all tested under the HEVCcommon test conditions. The BD-rate reduction comparisonfor Y component between the three methods are shownin Table III. Our method obtains gain over CNNIF and VDIFin most cases. And our performance will be further improvedafter adding the GVCNN-Q and rate-distortion optimization.

3) Verification of the Grouped Variation Network: With thegrouped variation network, only two networks correspondingto half-pixel and quarter-pixel position samples need

Page 9: One-for-All: Grouped Variation Network-Based …39.96.165.147/Pub Files/2019/xsf_tip19.pdfMading Li, and Dong Liu , Member, IEEE Abstract—Fractional interpolation is used to provide

2148 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 28, NO. 5, MAY 2019

TABLE I

BD-RATE REDUCTION OF THE PROPOSED METHOD COMPARED TO HEVC

TABLE II

BD-RATE REDUCTION OF THE PROPOSED METHOD FOR UHD SEQUENCESUNDER RA CONFIGURATION COMPARED TO HEVC

to be trained, which is convenient and meaningful for realapplication. In order to further identify the effectiveness of ourgrouped variation method, we additionally train 15 networksfor each sub-pixel position to identify that the convenienceis achieved without the loss of coding performance.We define the method that trains networks separatelyas GVCNN-Separate and the proposed uniform method asGVCNN-Uniform. The average BD-rate reduction comparisonfor some testing classes are shown in Table IV. Note that in

this comparison only the quarter-pixel interpolation method isselected by RDO. As can be observed, results of training thenetworks uniformly are comparable to the results obtained bytraining networks separately.

Besides the coding performance, we additionally comparethe implementation time of the two methods. The overallencoding time complexity in CPU mode for GVCNN-Separateis 1495% and for GVCNN-Uniform is 629%. It can be seenthat the time complexity of GVCNN-Separate is much largerthan GVCNN-Uniform since GVCNN-Separate needs morepasses of network forwarding.

4) Analysis of Blurring in Training Data Generation: Theeffect of the blurring process in training data generation isalso further tested and analyzed. Table V has shown the resultswhich are derived by the models trained without blurring. Theproposed model obtains obvious gain over the model withoutblurring.

We define the half-pixel interpolation model trained withGaussian blurring kernels whose standard deviations arein range [x, x + 0.1] as GVCNN-H(x) and similarly the

Page 10: One-for-All: Grouped Variation Network-Based …39.96.165.147/Pub Files/2019/xsf_tip19.pdfMading Li, and Dong Liu , Member, IEEE Abstract—Fractional interpolation is used to provide

LIU et al.: ONE-FOR-ALL: GROUPED VARIATION NETWORK-BASED FRACTIONAL INTERPOLATION IN VIDEO CODING 2149

TABLE III

BD-RATE REDUCTION COMPARISON FOR CLASSES C, D AND EUNDER LDP CONFIGURATION

TABLE IV

AVERAGE BD-RATE REDUCTION ACHIEVED BY TRAINING

THE NETWORKS SEPARATELY AND UNIFORMLY

TABLE V

BD-RATE REDUCTION OF GVCNN WITH AND

WITHOUT BLURRING FOR CLASS D

quarter-pixel interpolation model as GVCNN-Q(x). Forfurther analysis, we have trained models with differentblurring levels and tested them on class D. By fixing theblurring level of one model (GVCNN-H or GVCNN-Q) andchanging the others’, two line charts are derived as shownin Fig. 6 and Fig. 7. It is obvious that the models of medianblurring levels obtain the largest gain, which accords with thetheory proposed in Sec. IV-B. Note that, in this analysis, onlythe quarter-pixel interpolation method is selected by RDO.

5) Implementation Time: Experiments shown in this paperare all run in CPU mode. We additionally test the class D inGPU mode for tesing. The Intel i7 7700 CPU and NVIDIAGeForce GTX 1070 GPU are used for testing. In GPU mode,

Fig. 6. BD-rate reduction obtained with fixed GVCNN-H model (x = 0.4)and various GVCNN-Q models (x = 0.3, 0.4, 0.5, 0.6).

Fig. 7. BD-rate reduction obtained with fixed GVCNN-Q model (x = 0.5)and various GVCNN-H models (x = 0.3, 0.4, 0.5, 0.6).

the GPU is alternatively used for the forward operation ofCNN and other operations are still performed by CPU. Theoverall encoding time complexity in CPU mode is 629% andin GPU mode is 282%. And the decoding time complexity inCPU mode is 154858% and in GPU mode is 11922%. It canbe seen that the coding efficiency is well improved by GPUacceleration. There are advanced techniques to accelerate ourmethod, such as a fixed point implementation of our model.Based on recent progresses of deep model quantization [35],all 32-bit float point multiplications can be replaced by 8-bitfixed point operations. After the optimization, the running timeof our neural network model in theory can reduce to 1/4 ofthat before optimization. The deduced encoding time of ourmethod in GPU mode can reduce from about 280% to 145%.Moreover, the complexity of our work can be further reducedby hardware acceleration when being integrated into the chipin real applications.

6) Rate Distortion Optimization Results: During the inte-gration, a CU level RDO is performed to select the methodfor interpolating sub-pixel position samples. PUs with sub-pixel level motion vectors in one CU will choose the sameinterpolation method. Some example RDO results of thesequences in class D are shown for verification.

Fig. 8 shows the visualization results of the interpolationmethod selection for a frame in RaceHorses. In the figure,the marked blocks are the CUs of inter mode. Color redindicates that PUs in the CU choose GVCNN for interpolationand color blue indicates that PUs choose the interpolationmethod DCTIF of HEVC.

Page 11: One-for-All: Grouped Variation Network-Based …39.96.165.147/Pub Files/2019/xsf_tip19.pdfMading Li, and Dong Liu , Member, IEEE Abstract—Fractional interpolation is used to provide

2150 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 28, NO. 5, MAY 2019

Fig. 8. Visualization of RDO results of a frame in RaceHorses. The testcondition is LDP, QP32, POC8. Red Blocks indicate that the CU choosesGVCNN for interpolation and the blue ones represent CUs that choose DCTIF.

TABLE VI

RDO RESULTS FOR SUB-PIXEL POSITION SAMPLES INTERPOLATION

We also count the total numbers of the two kinds of CUsfor the sequences in class D. We test the sequences for 2swith QPs 22, 27, 32 and 37. The number of CUs that chooseGVCNN and DCTIF for interpolation during the encodingprocess are calculated and shown in Table VI. The choosingratio is calculated for each sequence to indicate the ratio ofthe CUs that choose GVCNN for interpolation. It can beseen that the choosing ratio is highly related with the codingperformance of our method over the sequence.

VI. CONCLUSION

In this paper, we propose a one-for-all grouped variationconvolutional neural network for fractional interpolation inmotion compensation of video coding. The network firstuniformly extracts a feature map from the input integer-position sample, and then variations at different sub-pixelpositions are inferred sharing the same feature map. Trainingdata generation of the proposed network is also carefullyanalyzed and designed. Experimental results show that ourmethod has obtained on average a 2.2% BD-rate saving onthe test sequences compared with HEVC. Effectiveness of thegrouped variation adopted in our network is also well identifiedby experiments.

REFERENCES

[1] T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra, “Overviewof the H. 264/AVC video coding standard,” IEEE Trans. Circuits Syst.Video Technol., vol. 13, no. 7, pp. 560–576, Jul. 2003.

[2] G. J. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand, “Overviewof the high efficiency video coding (HEVC) standard,” IEEE Trans.Circuits Syst. Video Technol., vol. 22, no. 12, pp. 1649–1668,Dec. 2012.

[3] R. Wang, C. Huang, J. Li, and Y. Shen, “Sub-pixel motion compensationinterpolation filter in AVS [video coding],” in Proc. IEEE Int. Conf.Multimedia Expo (ICME), vol. 1, Jun. 2004, pp. 93–96.

[4] H. Lakshman, B. Bross, H. Schwarz, and T. Wiegand, “Fractional-sample motion compensation using generalized interpolation,” in Proc.Picture Coding Symp., Dec. 2010, pp. 530–533.

[5] A. Kokhazadeh, A. Aminlou, and M. R. Hashemi, “Edge-orientedinterpolation for fractional motion estimation in hybrid video coding,”in Proc. IEEE Int. Conf. Multimedia Expo, Jul. 2011, pp. 1–6.

[6] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, “Beyond aGaussian denoiser: Residual learning of deep CNN for image denoising,”IEEE Trans. Image Process., vol. 26, no. 7, pp. 3142–3155, Jul. 2017.

[7] K. Zhang, W. Zuo, S. Gu, and L. Zhang, “Learning deep CNN denoiserprior for image restoration,” in Proc. IEEE Int. Conf. Comput. Vis.Pattern Recognit., Jul. 2017, pp. 2808–2817.

[8] D. Pathak, P. Krähenbühl, J. Donahue, T. Darrell, and A. A. Efros,“Context encoders: Feature learning by inpainting,” in Proc. IEEE Int.Conf. Comput. Vis. Pattern Recognit., Jun. 2016, pp. 2536–2544.

[9] C. Dong, C. C. Loy, K. He, and X. Tang, “Learning a deep convolutionalnetwork for image super-resolution,” in Proc. Eur. Conf. Comput. Vis.,2014, pp. 184–199.

[10] J. Kim, J. K. Lee, and K. M. Lee, “Accurate image super-resolution usingvery deep convolutional networks,” in Proc. IEEE Int. Conf. Comput.Vis. Pattern Recognit., Jun. 2016, pp. 1646–1654.

[11] N. Yan, D. Liu, H. Li, and F. Wu, “A convolutional neural networkapproach for half-pel interpolation in video coding,” in Proc. IEEE Int.Symp. Circuits Syst., May 2017, pp. 1–4.

[12] H. Zhang, L. Song, Z. Luo, and X. Yang, “Learning a convolutionalneural network for fractional interpolation in HEVC inter coding,” inProc. IEEE Vis. Commun. Image Process., Dec. 2017, pp. 1–4.

[13] R. Lin, Y. Zhang, H. Wang, X. Wang, and Q. Dai, “Deep convolutionalneural network for decompressed video enhancement,” in Proc. DataCompress. Conf., Mar./Apr. 2016, p. 617.

[14] Y. Dai, D. Liu, and F. Wu, “A convolutional neural network approach forpost-processing in HEVC intra coding,” in Proc. Int. Conf. MultimediaModeling, 2017, pp. 28–39.

[15] T. Wang, M. Chen, and H. Chao, “A novel deep learning-based methodof improving coding efficiency from the decoder-end for HEVC,” inProc. Data Compress. Conf., Apr. 2017, pp. 410–419.

[16] W.-S. Park and M. Kim, “CNN-based in-loop filtering for codingefficiency improvement,” in Proc. IEEE Image, Video, MultidimensionalSignal Process. Workshop, Jul. 2016, pp. 1–5.

[17] T. Laude and J. Ostermann, “Deep learning-based intra predictionmode decision for HEVC,” in Proc. Picture Coding Symp., Dec. 2016,pp. 1–5.

[18] Z. Liu, X. Yu, Y. Gao, S. Chen, X. Ji, and D. Wang, “CU partition modedecision for HEVC hardwired intra encoder using convolution neuralnetwork,” IEEE Trans. Image Process., vol. 25, no. 11, pp. 5088–5103,Nov. 2016.

[19] M. Xu, T. Li, Z. Wang, X. Deng, R. Yang, and Z. Guan. (2017).“Reducing complexity of HEVC: A deep learning approach.” [Online].Available: https://arxiv.org/abs/1710.01218

[20] W. Dong, L. Zhang, R. Lukac, and G. Shi, “Sparse representation basedimage interpolation with nonlocal autoregressive modeling,” IEEE Trans.Image Process., vol. 22, no. 4, pp. 1382–1394, Apr. 2013.

[21] J.-J. Huang, W.-C. Siu, and T.-R. Liu, “Fast image interpolationvia random forests,” IEEE Trans. Image Process., vol. 24, no. 10,pp. 3232–3245, Oct. 2015.

[22] H. Huang, R. He, Z. Sun, and T. Tan, “Wavelet-SRNet: A wavelet-based CNN for multi-scale face super resolution,” in Proc. IEEE Int.Conf. Comput. Vis. Pattern Recognit., Oct. 2017, pp. 1689–1697.

[23] T. Guo, H. S. Mousavi, T. H. Vu, and V. Monga, “Deep waveletprediction for image super-resolution,” in Proc. IEEE Int. Conf. Comput.Vis. Pattern Recognit. Workshops, Jul. 2017, pp. 1100–1109.

[24] F. Guichard and F. Malgouyres, “Total variation based interpolation,” inProc. Eur. Signal Process. Conf., 1998, pp. 1–4.

[25] S. D. Babacan, R. Molina, and A. K. Katsaggelos, “Total variation superresolution using a variational approach,” in Proc. IEEE Int. Conf. ImageProcess., Oct. 2008, pp. 641–644.

[26] Q. Yuan, L. Zhang, and H. Shen, “Regional spatially adaptive total vari-ation super-resolution with spatial information filtering and clustering,”IEEE Trans. Image Process., vol. 22, no. 6, pp. 2327–2342, Jun. 2013.

[27] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers:Surpassing human-level performance on ImageNet classification,” inProc. IEEE Int. Conf. Comput. Vis., Dec. 2015, pp. 1026–1034.

Page 12: One-for-All: Grouped Variation Network-Based …39.96.165.147/Pub Files/2019/xsf_tip19.pdfMading Li, and Dong Liu , Member, IEEE Abstract—Fractional interpolation is used to provide

LIU et al.: ONE-FOR-ALL: GROUPED VARIATION NETWORK-BASED FRACTIONAL INTERPOLATION IN VIDEO CODING 2151

[28] C. Liu and D. Sun, “On Bayesian adaptive video super resolution,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, no. 2, pp. 346–360,Feb. 2014.

[29] A. V. Oppenheim, R. W. Schafer, and J. R. Buck, Discrete-Time SignalProcessing, 2nd ed. Upper Saddle River, NJ, USA: Prentice-Hall, 1999.

[30] S. Roth, “High-order Markov random fields for low-level vision,”Ph.D. dissertation, Dept. Comput. Sci., Brown Univ., Providence, RI,USA, 2007.

[31] S. M. Kay, Fundamentals of Statistical Signal Processing: EstimationTheory. Upper Saddle River, NJ, USA: Prentice-Hall, 1993.

[32] D. Martin, C. Fowlkes, D. Tal, and J. Malik, “A database of humansegmented natural images and its application to evaluating segmentationalgorithms and measuring ecological statistics,” in Proc. 8th IEEE Int.Conf. Comput. Vis. (ICCV), vol. 2. Jul. 2001, pp. 416–423.

[33] Y. Jia et al., “Caffe: Convolutional architecture for fast feature embed-ding,” in Proc. ACM Int. Conf. Multimedia, 2014, pp. 675–678.

[34] D. P. Kingma and L. J. Ba, “Adam: A method for stochastic optimiza-tion,” in Proc. Int. Conf. Learn. Represent., 2015, pp. 1–15.

[35] P. Gysel, J. Pimentel, M. Motamedi, and S. Ghiasi, “Ristretto: A frame-work for empirical study of resource-efficient inference in convolutionalneural networks,” IEEE Trans. Neural Netw. Learn. Syst., vol. 29, no. 11,pp. 5784–5789, Nov. 2018.

Jiaying Liu (S’08–M’10–SM’17) received the Ph.D.degree (Hons.) in computer science from PekingUniversity, Beijing, China, in 2010. She was aVisiting Scholar with the University of SouthernCalifornia, Los Angeles, from 2007 to 2008. She wasa Visiting Researcher with Microsoft Research Asiain 2015 supported by the Star Track Young FacultiesAward. She is currently an Associate Professor withthe Institute of Computer Science and Technology,Peking University. She has authored over 100 techni-cal articles in refereed journals and proceedings. She

holds 28 granted patents. Her current research interests include multimediasignal processing, compression, and computer vision.

Dr. Liu is a Senior Member of CCF. She served as a member for theMultimedia Systems and Applications Technical Committee, the Educationand Outreach Technical Committee in the IEEE Circuits and Systems Society,and the Image, Video, and Multimedia Technical Committee in APSIPA.She has also served as the Publicity Chair for IEEE ICIP-2019/VCIP-2018,the Grand Challenge Chair for IEEE ICME-2019, and the Area Chair forICCV-2019. She was the APSIPA Distinguished Lecturer from 2016 to 2017.

Sifeng Xia received the B.S. degree in computerscience from Peking University, Beijing, China,in 2017, where he is currently pursuing the master’sdegree with the Institute of Computer Science andTechnology. His current research interests includedeep learning-based image processing and videocoding.

Wenhan Yang (M’18) received the B.S. and Ph.D.degrees (Hons.) in computer science from PekingUniversity, Beijing, China, in 2012 and 2018, respec-tively. He was a Visiting Scholar with the NationalUniversity of Singapore from 2015 to 2016. His cur-rent research interests include deep-learning basedimage processing, bad weather restoration, andrelated applications and theories.

Mading Li received the B.S. and Ph.D. degrees incomputer science from Peking University, Beijing,China, in 2013 and 2018, respectively. He was aVisiting Scholar with McMaster University in 2016.His current research interests include image andvideo processing, image interpolation, imagerestoration, and low-light image enhancement.

Dong Liu (M’13) received the B.S. and Ph.D.degrees in electrical engineering from the Universityof Science and Technology of China (USTC), Hefei,China, in 2004 and 2009, respectively.

He was a member of the Research Staff with theNokia Research Center, Beijing, China, from 2009 to2012. He joined USTC as an Associate Professorin 2012. He has authored or co-authored more than80 papers in international journals and conferences.He has 14 granted patents. His research interestsinclude image and video coding, multimedia signal

processing, and data mining.Dr. Liu received the 2009 IEEE TRANSACTIONS ON CIRCUITS AND SYS-

TEMS FOR VIDEO TECHNOLOGY Best Paper Award and the Best 10% PaperAward at VCIP 2016. He and his team received four challenges that wereheld in ACM MM 2018, CVPR 2018, ECCV 2018, and ICME 2016,respectively.