highly parallel rate-distortion optimized intra-mode decision on multicore graphics processors
DESCRIPTION
Highly Parallel Rate-Distortion Optimized Intra-Mode Decision on Multicore Graphics Processors. Ngai -Man Cheung, Oscar C. Au, Senior Member, IEEE, Man-Cheung Kung, Peter H.W. Wong, Senior Member, IEEE, and Chun Hung Liu CSVT NOVEMBER 2009. Outline. Introduction Intra-Prediction - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Highly Parallel Rate-Distortion Optimized Intra-Mode Decision on Multicore Graphics Processors](https://reader036.vdocuments.net/reader036/viewer/2022062521/568166b2550346895ddab31a/html5/thumbnails/1.jpg)
1
Highly Parallel Rate-Distortion Optimized Intra-Mode Decision on Multicore Graphics ProcessorsNgai-Man Cheung,Oscar C. Au, Senior Member, IEEE, Man-Cheung Kung,Peter H.W. Wong, Senior Member, IEEE, and Chun Hung Liu
CSVT NOVEMBER 2009
![Page 2: Highly Parallel Rate-Distortion Optimized Intra-Mode Decision on Multicore Graphics Processors](https://reader036.vdocuments.net/reader036/viewer/2022062521/568166b2550346895ddab31a/html5/thumbnails/2.jpg)
2
Outline•Introduction•Intra-Prediction•Parallel RD Optimized Intra-Mode
Decision•Experiments•Conclusion
![Page 3: Highly Parallel Rate-Distortion Optimized Intra-Mode Decision on Multicore Graphics Processors](https://reader036.vdocuments.net/reader036/viewer/2022062521/568166b2550346895ddab31a/html5/thumbnails/3.jpg)
3
Introduction•Multicore Graphics Processors
▫Graphic Processing Unit (GPUs)▫Coprocessing units for CPUs to accelerate
numerical and signal processing applications , thanks to high-performance multicore and pipeline architectures
•Investigate the use of GPUs to perform RD optimized intra-mode selection in AVS
and H.264
![Page 4: Highly Parallel Rate-Distortion Optimized Intra-Mode Decision on Multicore Graphics Processors](https://reader036.vdocuments.net/reader036/viewer/2022062521/568166b2550346895ddab31a/html5/thumbnails/4.jpg)
4
Difficulties•Intra-Mode Decision
▫Dependency between current block and adjacent block▫Determine the encoding bit-rate for each of
the candidate modes, some conditional branching
may be needed
![Page 5: Highly Parallel Rate-Distortion Optimized Intra-Mode Decision on Multicore Graphics Processors](https://reader036.vdocuments.net/reader036/viewer/2022062521/568166b2550346895ddab31a/html5/thumbnails/5.jpg)
5
Contributions•Analyze the dependency constraints in
intra-mode decision•Propose a strategy to determine the mode
decisions of video blocks in parallel▫Encode the blocks in novel orders
•Extend a bit-rate approximation methodto estimate the rate in RD cost computation
![Page 6: Highly Parallel Rate-Distortion Optimized Intra-Mode Decision on Multicore Graphics Processors](https://reader036.vdocuments.net/reader036/viewer/2022062521/568166b2550346895ddab31a/html5/thumbnails/6.jpg)
6
Intra-prediction in H.264M A B C D
I a b c d
J e f g h
K i j k l
L m n o p
E F G H
(a) 4 × 4 current blocks and their neighboring reconstructed pixels.
(b) Prediction directions and their corresponding modes.
0 54
6
1
8
73
2: DC mode4x4
![Page 7: Highly Parallel Rate-Distortion Optimized Intra-Mode Decision on Multicore Graphics Processors](https://reader036.vdocuments.net/reader036/viewer/2022062521/568166b2550346895ddab31a/html5/thumbnails/7.jpg)
7
Intra-prediction in AVS 1.0Vertical mode Horizontal mode
DC mode Down-right mode
Down-left mode:bidirectional prediction
8x8
![Page 8: Highly Parallel Rate-Distortion Optimized Intra-Mode Decision on Multicore Graphics Processors](https://reader036.vdocuments.net/reader036/viewer/2022062521/568166b2550346895ddab31a/html5/thumbnails/8.jpg)
8
Dependency Analysis•Dependency constraints on block encoding
order▫Prediction Direction
Determine the RD costs of the current block is hard before all the candidate reference blocks have been encoded and reconstructed
▫Pixel Filtering(AVS) Filtering may be applied to the reconstructed pixels
of the adjacent blocks before they are used in prediction, and this filtering may involve pixels from several blocks, leading to additional block dependency
![Page 9: Highly Parallel Rate-Distortion Optimized Intra-Mode Decision on Multicore Graphics Processors](https://reader036.vdocuments.net/reader036/viewer/2022062521/568166b2550346895ddab31a/html5/thumbnails/9.jpg)
9
Dependency Analysis• Dependency between the
four 8 × 8 blocks (K1-K4) in the current macroblock and their spatially adjacent neighbor blocks (T 1-T4,L1,L2), in AVS intra-prediction
![Page 10: Highly Parallel Rate-Distortion Optimized Intra-Mode Decision on Multicore Graphics Processors](https://reader036.vdocuments.net/reader036/viewer/2022062521/568166b2550346895ddab31a/html5/thumbnails/10.jpg)
10
Dependency Analysis• Dependency between the
four 4 × 4 blocks (K1-K4)in the current 8 × 8 block and their spatially adjacent neighbor blocks(T 1-T4,L1,L2), in H.264 intra-prediction
![Page 11: Highly Parallel Rate-Distortion Optimized Intra-Mode Decision on Multicore Graphics Processors](https://reader036.vdocuments.net/reader036/viewer/2022062521/568166b2550346895ddab31a/html5/thumbnails/11.jpg)
11
Dependency Analysis•The dependency relationships form
directed acyclic graphs. ▫Parallelize the RD cost computation of the
four constituent blocks of the same 16x16MB
▫Compute in parallel RD costs of the blocks from different 16x16MB
![Page 12: Highly Parallel Rate-Distortion Optimized Intra-Mode Decision on Multicore Graphics Processors](https://reader036.vdocuments.net/reader036/viewer/2022062521/568166b2550346895ddab31a/html5/thumbnails/12.jpg)
12
Greedy-Based Block Encoding Order•Encode those blocks of which all the
reference reconstructed pixels are available.
![Page 13: Highly Parallel Rate-Distortion Optimized Intra-Mode Decision on Multicore Graphics Processors](https://reader036.vdocuments.net/reader036/viewer/2022062521/568166b2550346895ddab31a/html5/thumbnails/13.jpg)
13
Greedy-Based Block Encoding Order• AVS Example
![Page 14: Highly Parallel Rate-Distortion Optimized Intra-Mode Decision on Multicore Graphics Processors](https://reader036.vdocuments.net/reader036/viewer/2022062521/568166b2550346895ddab31a/html5/thumbnails/14.jpg)
14
Greedy-Based Block Encoding Order• AVS Example • AVS Example modify
version
• Postpone the encoding of several blocks along the left frame boundary, • All the four constituent blocks of any MB could be encoded consecutively• Does not incur any execution time penalty
![Page 15: Highly Parallel Rate-Distortion Optimized Intra-Mode Decision on Multicore Graphics Processors](https://reader036.vdocuments.net/reader036/viewer/2022062521/568166b2550346895ddab31a/html5/thumbnails/15.jpg)
15
Greedy-Based Block Encoding Order• AVS Example
![Page 16: Highly Parallel Rate-Distortion Optimized Intra-Mode Decision on Multicore Graphics Processors](https://reader036.vdocuments.net/reader036/viewer/2022062521/568166b2550346895ddab31a/html5/thumbnails/16.jpg)
16
Optimality • Lemma 1: The proposed greedy-based encoding order can process all bottleneck path(s) P∗ with exactly n∗ iterations
• Proof of Lemma 1:▫ Suppose the greedy-based order
requires more that n* iterations to process
▫ At least one processing gap of length w which P* is not being processed between
Ki and Ki+1▫ There would exist an immediate parent
block Bm of Ki+1▫ Continuing with backtracking
eventually one would reach some block Kj in P*
▫ P1 has no processing gap >P0▫ P* replace Po to P1 would be longer
than p*
P*
P1P0
![Page 17: Highly Parallel Rate-Distortion Optimized Intra-Mode Decision on Multicore Graphics Processors](https://reader036.vdocuments.net/reader036/viewer/2022062521/568166b2550346895ddab31a/html5/thumbnails/17.jpg)
17
Optimality•Theorem 1: The proposed greedy-based
order can process all the video blocks in a frame in n∗ iterations
•P∗ = {K1,K2, ...,Kn∗ }, Kn∗ , would be processed in the n∗th iteration by Lemma 1. Since all the paths would also end in Kn∗ , all the blocks could be processed with n∗ iterations
![Page 18: Highly Parallel Rate-Distortion Optimized Intra-Mode Decision on Multicore Graphics Processors](https://reader036.vdocuments.net/reader036/viewer/2022062521/568166b2550346895ddab31a/html5/thumbnails/18.jpg)
18
Performance estimation• One of the longest paths
in H.264 4 × 4 intra-prediction
• The length can be found to be
• n*=((V/4)/2)x2+H/4-2 =V/4+H/4-2
• n*=(V/4)x2+(H/4)/2-2 =(V/4)x2 +H/8-2
![Page 19: Highly Parallel Rate-Distortion Optimized Intra-Mode Decision on Multicore Graphics Processors](https://reader036.vdocuments.net/reader036/viewer/2022062521/568166b2550346895ddab31a/html5/thumbnails/19.jpg)
19
Bit-Rate Estimation•Lagrangian cost function
•Entropy coding may involve many branching instructions, hard to implement on pipeline architecture
![Page 20: Highly Parallel Rate-Distortion Optimized Intra-Mode Decision on Multicore Graphics Processors](https://reader036.vdocuments.net/reader036/viewer/2022062521/568166b2550346895ddab31a/html5/thumbnails/20.jpg)
20
Fast Bit Rate Estimation for Mode Decision• Tc : number of nonzero coefficients• Tz : number of zeros before the last nonzero coefficients• |Lk| : the absolute value of kth nonzero coefficient• Fk : the frequency of kth nonzero coefficient
[33] M. G. Sarwer and L.-M. Po, “Fast bit rate estimation for mode decision of H.264/AVC,” IEEE Trans. Circuits Syst. Video Technol., vol. 17, no. 10, pp. 1402–1407, Oct. 2007.
![Page 21: Highly Parallel Rate-Distortion Optimized Intra-Mode Decision on Multicore Graphics Processors](https://reader036.vdocuments.net/reader036/viewer/2022062521/568166b2550346895ddab31a/html5/thumbnails/21.jpg)
21
Experiments•PC equipped with one GeForce 8800 GTS
PCIe graphics card with 96 stream processors
•Intel Pentium 4 3.2 GHz processor with 1GB DDR2 memory
•H.264 JM 14.0•AVS RM 6.2 reference software
![Page 22: Highly Parallel Rate-Distortion Optimized Intra-Mode Decision on Multicore Graphics Processors](https://reader036.vdocuments.net/reader036/viewer/2022062521/568166b2550346895ddab31a/html5/thumbnails/22.jpg)
22
Encoding Bit-Rate Estimation
![Page 23: Highly Parallel Rate-Distortion Optimized Intra-Mode Decision on Multicore Graphics Processors](https://reader036.vdocuments.net/reader036/viewer/2022062521/568166b2550346895ddab31a/html5/thumbnails/23.jpg)
23
Parallel RD Optimized Intra-Mode Decision
More than 80 times reduction
QP has no significant effect
•H.264
![Page 24: Highly Parallel Rate-Distortion Optimized Intra-Mode Decision on Multicore Graphics Processors](https://reader036.vdocuments.net/reader036/viewer/2022062521/568166b2550346895ddab31a/html5/thumbnails/24.jpg)
24
Parallel RD Optimized Intra-Mode Decision
Parallelism within a MB
•H.264
![Page 25: Highly Parallel Rate-Distortion Optimized Intra-Mode Decision on Multicore Graphics Processors](https://reader036.vdocuments.net/reader036/viewer/2022062521/568166b2550346895ddab31a/html5/thumbnails/25.jpg)
25
Parallel RD Optimized Intra-Mode Decision
Similar speedups when RDO is disabled
•H.264
![Page 26: Highly Parallel Rate-Distortion Optimized Intra-Mode Decision on Multicore Graphics Processors](https://reader036.vdocuments.net/reader036/viewer/2022062521/568166b2550346895ddab31a/html5/thumbnails/26.jpg)
26
Parallel RD Optimized Intra-Mode Decision•H.264
![Page 27: Highly Parallel Rate-Distortion Optimized Intra-Mode Decision on Multicore Graphics Processors](https://reader036.vdocuments.net/reader036/viewer/2022062521/568166b2550346895ddab31a/html5/thumbnails/27.jpg)
27
Parallel RD Optimized Intra-Mode Decision•H.264
![Page 28: Highly Parallel Rate-Distortion Optimized Intra-Mode Decision on Multicore Graphics Processors](https://reader036.vdocuments.net/reader036/viewer/2022062521/568166b2550346895ddab31a/html5/thumbnails/28.jpg)
28
Parallel RD Optimized Intra-Mode Decision
•AVS
![Page 29: Highly Parallel Rate-Distortion Optimized Intra-Mode Decision on Multicore Graphics Processors](https://reader036.vdocuments.net/reader036/viewer/2022062521/568166b2550346895ddab31a/html5/thumbnails/29.jpg)
29
Parallel RD Optimized Intra-Mode Decision
•AVS
![Page 30: Highly Parallel Rate-Distortion Optimized Intra-Mode Decision on Multicore Graphics Processors](https://reader036.vdocuments.net/reader036/viewer/2022062521/568166b2550346895ddab31a/html5/thumbnails/30.jpg)
30
Parallel RD Optimized Intra-Mode Decision
39
96(processors)x2(threads)/5(modes) = 38.4
![Page 31: Highly Parallel Rate-Distortion Optimized Intra-Mode Decision on Multicore Graphics Processors](https://reader036.vdocuments.net/reader036/viewer/2022062521/568166b2550346895ddab31a/html5/thumbnails/31.jpg)
31
Conclusion• Based on the dependency analysis of intra-mode
decision , encode the video blocks following the greedy orders, leading to highly parallel RD cost computations.
• More than 80 times speedup for GPU based intra-prediction, GPU can be utilized to offload intra-prediction from CPU.
• To facilitate implementation on GPU, use a bitrate approximation method to estimate the rate in RD cost
computation. • The approximation errors only a small impact to
the coding performance: no more than 0.12 dB loss in PSNR and 0.98% bit-rate increase.