optimization of h.264 high profile decoder for pentium 4 processor tarun bhatia university of texas...
TRANSCRIPT
Optimization of H.264Optimization of H.264High Profile Decoder for High Profile Decoder for
Pentium 4 Processor Pentium 4 Processor
Tarun Bhatia Tarun Bhatia
University of Texas at ArlingtonUniversity of Texas at Arlington
[email protected]@fastvdo.com
H.264H.264 DecoderDecoder
EntropyDecoding
Inverse Transform and Dequantization Deblocking
Intra/InterMode Selection
IntraPrediction
Motion Compensation
PictureBuffering
BitstreamInput +
+
VideoOutput
Optimization:NeedOptimization:Need
H.264/AVC video coding introduces substantially more H.264/AVC video coding introduces substantially more coding tools and coding options than earlier standards. coding tools and coding options than earlier standards. Therefore, it takes much more computational complexity to Therefore, it takes much more computational complexity to
achieve highest possible coding gain.achieve highest possible coding gain.
Aggressive optimization is typically required in order to get Aggressive optimization is typically required in order to get H.264 implementations to meet cost and power targets and H.264 implementations to meet cost and power targets and
provide real-time performance for applicationsprovide real-time performance for applications..
Sequences UsedSequences Used
Plane.264 Shore.264
Golf.264
Girl.264 Karate.264
H.264 ProfilesH.264 Profiles
I slice P slice
CAVLC
Arbitrary Slice Order (ASO)Frame Macroblock Ordering (FMO)
Redundant Slices
B slicesWeighted Prediction
CABACData Partition
SP Slice
SI Slice
Adaptive Block Size Transform
Perceptual Quantization Matrices
High Profile
Extended ProfileMain Profile
Baseline Profile
H.264 High Profiles - featuresH.264 High Profiles - features
Main Profile + additional featuresMain Profile + additional features
8x8 Integer DCT 8x8 Integer DCT
HVS matricesHVS matrices
8x8 Intra Prediction modes 8x8 Intra Prediction modes
Optimization : LevelsOptimization : Levels Algorithm LevelAlgorithm Level e.g. DCT implementation e.g. DCT implementation
Compiler Level Compiler Level (Microsoft Visual Studio .NET 2003 (Microsoft Visual Studio .NET 2003 / Intel C++ compiler v 8.0) / Intel C++ compiler v 8.0)
Implementation Level Implementation Level e.g. Elimination of Loops, Conditions e.g. Elimination of Loops, Conditions Using SIMD for implementation Using SIMD for implementation MultithreadingMultithreading
Target Platform : Pentium 4 ProcessorTarget Platform : Pentium 4 Processor
Intel SIMD ArchitectureIntel SIMD Architecture
8 XMM Registers [128 bits]
MXCSR [32 bit]
8 MMX Registers [64 bit]
8 GPRs [32bit]
EFLAGS[32bit]X87 FP Register File
FP MMXSSE/SSE2/SSE3
FP MOVE
L1 Data Cache (8KB 4-way)
Intel HT (Hyper Threading) Intel HT (Hyper Threading) TechnologyTechnology
Purpose : Simultaneous Execution of ThreadsPurpose : Simultaneous Execution of Threads
SYSTEM BUS
ArchitecturalArchitectural
StateState
ArchitecturalArchitectural
StateState
Execution Engine Execution Engine
Local Local
APICAPIC
LocalLocal
APICAPIC
Bus InterfaceBus Interface
Optimization : StepsOptimization : Steps
Optimization during code developmentOptimization during code development Optimization after code developmentOptimization after code development 1) Searching for “hotspots” in the code1) Searching for “hotspots” in the code
2) Analysis of “hotspot” 2) Analysis of “hotspot” e.g. more number of calls, cache miss, e.g. more number of calls, cache miss,
slower implementationslower implementation 3) Optimization of hotspots3) Optimization of hotspots
Performance ProfilingPerformance Profiling Intel VTuneIntel VTuneTMTM Performance Analyzer Performance Analyzer
Intel VTune Performance Analysis - Intel VTune Performance Analysis - Results (FastVDO H.264 HD High Profile Decoder)Results (FastVDO H.264 HD High Profile Decoder)
%Time Consumed %Time Consumed Girl.264 Girl.264 Golf.264Golf.264 Karate.264Karate.264 Plane.264Plane.264 Shore.264Shore.264
IDCT 8x8IDCT 8x8 4.9578794.957879 0.9567370.956737 3.27358593.2735859 1.8941261.894126 1.638841.63884
CABACCABAC 11.02645211.026452 5.2935925.293592 10.33594510.335945 10.0180710.01807 6.4072746.407274
Memcpy + Memcpy + MemsetMemset
13.3336913.33369 16.8690516.86905 11.84930711.849307 11.0198711.01987 14.5961114.59611
IDCT 4x4IDCT 4x4 17.0252717.02527 20.3963620.39636 15.56831515.568315 12.8975712.89757 17.7944617.79446
MCMC 29.5326529.53265 38.0013738.00137 40.04507840.045078 50.1614950.16149 41.6631441.66314
OthersOthers 24.1240524.12405 18.4828918.48289 18.92776618.927766 14.0088714.00887 17.9001917.90019
Distribution of Distribution of Decoder Time ConsumptionDecoder Time Consumption
0
10
20
30
40
50
60
IDCT8x8 CABAC Memory IDCT4x4 MC Others
GolfPlane
SIMDSIMD
Single Instruction Multiple Data InstructionsSingle Instruction Multiple Data Instructions Intel Pentium 4Intel Pentium 4
MMX ( Multimedia Extension) from Pentium MMX onwardsMMX ( Multimedia Extension) from Pentium MMX onwards
SSE ( Streaming SIMD Extension ) from Pentium III onwardsSSE ( Streaming SIMD Extension ) from Pentium III onwards
SSE2 ( Streaming SIMD Extension 2) from Pentium IV onwardsSSE2 ( Streaming SIMD Extension 2) from Pentium IV onwards
AMD Athalon 64AMD Athalon 64
3D Now3D Now
SIMD Data TypesSIMD Data Types
88 88 88 88 88 88 88 88 88 88 88 88 88 88 88 88
1616 1616 1616 1616 1616 1616 1616 1616
3232 3232 3232 3232
6464 6464
128
Available in XMM registers in SSE Technology
Available in MMX and XMM registers
SIMD Instructions : TypesSIMD Instructions : Types
Packed Arithmetic (e.g. padd, pmul)Packed Arithmetic (e.g. padd, pmul) Packed Logical (e.g. pand, por)Packed Logical (e.g. pand, por) Data Movement and Memory Access (mov)Data Movement and Memory Access (mov) General Support (pack, unpack)General Support (pack, unpack) Packed Shift ( >> ,<< )Packed Shift ( >> ,<< ) Packed Comparison (<=, = =)Packed Comparison (<=, = =)
Case StudyCase Study
interpolation4x4 (pixel_data * forward_block, pixel_data* backward_block)interpolation4x4 (pixel_data * forward_block, pixel_data* backward_block)
{{
pixel_data* result;pixel_data* result;
for (int i=0 ; i<=15 ; i++)for (int i=0 ; i<=15 ; i++)
{{
result [i] = (forward_block[i] + backward_block[i]+1)/2;result [i] = (forward_block[i] + backward_block[i]+1)/2;
}}
}}
MMX CodeMMX Code interpolation (pixel_data* forward_block , pixel_data* backward_block)interpolation (pixel_data* forward_block , pixel_data* backward_block){{
___asm {___asm {
__asm {__asm {
pxor mm7,mm7 // set mm7 to 0pxor mm7,mm7 // set mm7 to 0 mov EDX, 0x01010101 // EDX = 01 01 01 01 mov EDX, 0x01010101 // EDX = 01 01 01 01 mov EAX, forward_block // Store forward block starting address mov EAX, forward_block // Store forward block starting address movd mm3, EDX // mm3: 00 00 00 00 01 01 01 01movd mm3, EDX // mm3: 00 00 00 00 01 01 01 01 mov EBX, backward_block // Store backward block starting addressmov EBX, backward_block // Store backward block starting address punpcklbw mm3,mm7 // mm3: 00 01 00 01 00 01 00 01punpcklbw mm3,mm7 // mm3: 00 01 00 01 00 01 00 01
mov ECX, result // Store the address of resultmov ECX, result // Store the address of result movd mm0, [EAX] // mm0: fb[1:4]movd mm0, [EAX] // mm0: fb[1:4] movd mm1, [EBX] // mm1: bb[1:4]movd mm1, [EBX] // mm1: bb[1:4] movd mm4, [EAX+4] // mm4: fb[5:8]movd mm4, [EAX+4] // mm4: fb[5:8] movd mm5, [EBX+4] // mm5: bb[5:8]movd mm5, [EBX+4] // mm5: bb[5:8] punpcklbw mm0,mm7 // punpcklbw mm0,mm7 // punpcklbw mm1,mm7 // punpcklbw mm1,mm7 //
punpcklbw mm4,mm7 //punpcklbw mm4,mm7 // punpcklbw mm5,mm7 //punpcklbw mm5,mm7 //
paddw mm0, mm1 // mm0: fb[1:4]+bb[1:4]paddw mm0, mm1 // mm0: fb[1:4]+bb[1:4] paddw mm4, mm5 // mm4: fb[5:8]+bb[5:8]paddw mm4, mm5 // mm4: fb[5:8]+bb[5:8] paddw mm0, mm3 // mm0: fb[1:4]+bb[1:4]+1paddw mm0, mm3 // mm0: fb[1:4]+bb[1:4]+1 paddw mm4, mm3 // mm4: fb[5:8]+bb[5:8]+1paddw mm4, mm3 // mm4: fb[5:8]+bb[5:8]+1 psrl mm0, 1 // mm1: (fb[1:4]+bb[1:4]+1)>> 1psrl mm0, 1 // mm1: (fb[1:4]+bb[1:4]+1)>> 1 psrl mm4, 1 // mm5: (fb[5:8]+bb[5:8]+1)>> 1 psrl mm4, 1 // mm5: (fb[5:8]+bb[5:8]+1)>> 1
packuswb mm0,mm0 // mm0: 00 00 00 00 r4 r3 r2 r1packuswb mm0,mm0 // mm0: 00 00 00 00 r4 r3 r2 r1 packuswb mm4,mm4 // mm4: 00 00 00 00 r8 r7 r6 r5 packuswb mm4,mm4 // mm4: 00 00 00 00 r8 r7 r6 r5 movd [ECX],mm0 // result[1:4] = mm0movd [ECX],mm0 // result[1:4] = mm0
movd [ECX+4],mm4 // result[5:8] = mm4 movd [ECX+4],mm4 // result[5:8] = mm4 //Repeat the same process for fb[9:16] and bb[9:16] //Repeat the same process for fb[9:16] and bb[9:16] emms // Empty MMX stateemms // Empty MMX state
}}
}}
SIMD Application ResultsSIMD Application Results Amdahl’s Law : The Overall Speedup (O.S.) obtained Amdahl’s Law : The Overall Speedup (O.S.) obtained
by optimizing a portion p of the program by a factor s by optimizing a portion p of the program by a factor s isis
O.S. = 1 x 100 %O.S. = 1 x 100 % ----------------- - 1----------------- - 1 1 – p + (p/s)1 – p + (p/s)
p p fraction of the code being optimized fraction of the code being optimizeds s speedup factor for that fraction of code speedup factor for that fraction of code
Application to IDCT 4x4Application to IDCT 4x4
%% Girl.264Girl.264 Golf.264 Golf.264 Karate.264Karate.264 Plane.264 Plane.264 Shore.264Shore.264
NO SIMD NO SIMD (%)(%)
17.0252717.02527 20.3963620.39636 15.5683115.56831 12.8975712.89757 17.7944617.79446
SIMD (%)SIMD (%) 8.8114058.811405 11.6139311.61393 9.16206 9.16206 7.0504347.050434 9.891519.89151
SpeedupSpeedup
FactorFactor1.9321861.932186 1.7561981.756198 1.699211.69921 1.829931.82993 1.798961.79896
OverallOverall
Speedup Speedup (%)(%)
8.94898.9489 9.6289.628 6.84476.8447 6.216.21 8.5818.581
IDCT 4x4 IDCT 4x4 Comparison of % Time ConsumedComparison of % Time Consumed Of the Total Decoding Time Of the Total Decoding Time
0
5
10
15
20
25
Girl Golf Karate Plane Shore
No SIMDSIMD
% Overall Speed up in % Overall Speed up in Decoding Time with SIMD IDCT4x4Decoding Time with SIMD IDCT4x4
0
1
2
3
4
5
6
7
8
9
10
Girl Golf Karate Plane Shore
% OverallSpeedup
Application to Application to Motion CompensationMotion Compensation
The implementation of Motion Compensation can be divided as :-The implementation of Motion Compensation can be divided as :-
Data Manipulation (SIMD not used)Data Manipulation (SIMD not used)
Interpolation (SIMD used)Interpolation (SIMD used)
Half Pel InterpolationHalf Pel Interpolation Quarter Pel InterpolationQuarter Pel Interpolation Linear Interpolation for B framesLinear Interpolation for B frames
Motion Compensation-Motion Compensation-% Time consumption (without MMX)% Time consumption (without MMX)
0
5
10
15
20
25
30
35
Girl Golf Karate Plane Shore
ManipulationInterpolation
SIMD Application to SIMD Application to Motion Compensation - ResultsMotion Compensation - Results
%% GirlGirl Golf Golf KarateKarate PlanePlane ShoreShore
NO SIMDNO SIMD
(%)(%)15.9682415.96824 13.0263413.02634 23.2531923.25319 32.5450332.54503 19.739919.7399
SIMDSIMD
(%)(%) 9.518329.51832 7.538747.53874 14.3231714.32317 19.741419.7414 11.60811.608
SpeedupSpeedup
FactorFactor 1.681.68 1.731.73 1.621.62 1.651.65 1.71.7
OverallOverall
SpeedupSpeedup
(%)(%)
6.896.89 5.85.8 9.89.8 14.6814.68 8.858.85
Motion Compensation – ResultsMotion Compensation – ResultsComparison of % Time ConsumedComparison of % Time Consumed
0
5
10
15
20
25
30
35
Girl Golf Karate Plane Shore
No SIMDSIMD
% Overall Speed up in % Overall Speed up in Decoding Time with SIMD MCDecoding Time with SIMD MC
0
2
4
6
8
10
12
14
16
Girl Golf Karate Plane Shore
Overall Speedup
MultithreadingMultithreading
DefinitionDefinition : : Multithreading is the ability of the program to Multithreading is the ability of the program to multitask within itself. The program can split itself into multitask within itself. The program can split itself into separate “separate “threadsthreads” of execution that seem to run concurrently. ” of execution that seem to run concurrently.
WaitsWaits are used to block the thread till a particular event are used to block the thread till a particular event hands over controlhands over control
ReleaseRelease is use to unblock the threadis use to unblock the thread
SemaphoresSemaphores : : Locking mechanism / Counters to control Locking mechanism / Counters to control access to shared resources being used by multiple processesaccess to shared resources being used by multiple processes
Producer-Consumer Problem (Diagram)Producer-Consumer Problem (Diagram)
Producer Thread Consumer Thread
SemaphoresWait
Release
Serial ExecutionOf a Thread
Producer-Consumer Problem (Algorithm)Producer-Consumer Problem (Algorithm)
Producer thread starts and initialize dataProducer thread starts and initialize data Wait for the Consumer thread Wait for the Consumer thread If Consumer thread ready, release control to the If Consumer thread ready, release control to the
consumer threadconsumer thread Producer thread completes one execution cycle in the Producer thread completes one execution cycle in the
meantime and waits for Consumer thread meantime and waits for Consumer thread When the control is passed back to Producer thread, When the control is passed back to Producer thread,
the process is repeated till the end condition is met. the process is repeated till the end condition is met.
Multithreading in Video CodingMultithreading in Video Coding
The Codec can be multithreaded in two waysThe Codec can be multithreaded in two ways:-:- Block Level Block Level
Independent blocks can be executed as separate threads Independent blocks can be executed as separate threads e.g. slices in H.264, motion estimation, deblocking of non-e.g. slices in H.264, motion estimation, deblocking of non-reference framesreference frames
GOP LevelGOP Level Closed GOPClosed GOP : Group of frames which will not use any : Group of frames which will not use any
reference frames except from their GOPreference frames except from their GOP Open GOPOpen GOP : Group of frames can use reference frames : Group of frames can use reference frames
from outside their GOP from outside their GOP
Proposed Multithreading Architecture Proposed Multithreading Architecture -features-features
GOP Level (Closed GOP)GOP Level (Closed GOP) 30 frames per GOP30 frames per GOP IPPPPPPP…PIPPPPPPP…P Each GOP begins with an I frame and contains Each GOP begins with an I frame and contains
P frames only (i.e. 1 I frame and 29 P frames P frames only (i.e. 1 I frame and 29 P frames in each )in each )
B frames are not used in the design to maintain B frames are not used in the design to maintain closed GOP structureclosed GOP structure
Proposed Multithreading ArchitectureProposed Multithreading Architecture
Main Thread Get IDR Position
Decoder 0
Decoder 1
Decoder N
Multithreaded Decoder - ThreadsMultithreaded Decoder - Threads Main ThreadMain Thread
Creates all threads and semaphoresCreates all threads and semaphores Get SPS and PPS NALUs from theGet SPS and PPS NALUs from the Initialize Multiple decoders with SPS and PPS NALUsInitialize Multiple decoders with SPS and PPS NALUs
Get IDR Frame Position ThreadGet IDR Frame Position Thread Search for IDR NALU Position in the bitstreamSearch for IDR NALU Position in the bitstream Manage Waits and Releases of SemaphoresManage Waits and Releases of Semaphores
Decoder ThreadsDecoder Threads Decode H.264 GOPsDecode H.264 GOPs
SPS SPS Sequence Parameter Set Sequence Parameter Set PPSPPS Picture Parameter Set Picture Parameter SetNALU NALU Network Abstraction Layer Unit Network Abstraction Layer Unit
Multithreading - ResultsMultithreading - Results% Speed up in Decoding Time% Speed up in Decoding Time
0
2
4
6
8
10
12
14
16
Girl Golf Karate Plane Shore
2
3
4
Number of
Threads
Multithreading-ResultsMultithreading-ResultsThreading Overhead (Time in seconds) Threading Overhead (Time in seconds)
0
0.05
0.1
0.15
0.2
0.25
Girl Golf Karate Plane Shore
2
3
4
No. of Threads
Further ResearchFurther Research Optimization of High Profile HD (720p) Encoder for Optimization of High Profile HD (720p) Encoder for
minimization of Hardware requirementminimization of Hardware requirement
Testing of the H.264 encoder and decoder on Testing of the H.264 encoder and decoder on multicore CPUsmulticore CPUs
Implementation of time consuming modules of H.264 Implementation of time consuming modules of H.264 encoder and decoder on GPU (Graphic Processing encoder and decoder on GPU (Graphic Processing Unit)Unit)
ReferencesReferences H.264: International Telecommunication Union, “Recommendation ITU-T H.264: H.264: International Telecommunication Union, “Recommendation ITU-T H.264:
Advanced Video Coding for Generic Audiovisual Services,” Advanced Video Coding for Generic Audiovisual Services,” ITU-TITU-T, 2005., 2005. MPEG-2: ISO/IEC JTC1/SC29/WG11 and ITU-T, “ISO/IEC 13818-2: Information MPEG-2: ISO/IEC JTC1/SC29/WG11 and ITU-T, “ISO/IEC 13818-2: Information
Technology-Generic Coding of Moving Pictures and Associated Audio Information: Technology-Generic Coding of Moving Pictures and Associated Audio Information: Video,” Video,” ISO/IEC and ITU-TISO/IEC and ITU-T, 1994. , 1994.
Soon-kak Kwon, A.Tamhankar and K.R.Rao ,”Overview of MPEG-4 Part 10”.Soon-kak Kwon, A.Tamhankar and K.R.Rao ,”Overview of MPEG-4 Part 10”. G. Sullivan, P. Topiwala and A. Luthra, “The H.264/AVC Advanced Video Coding G. Sullivan, P. Topiwala and A. Luthra, “The H.264/AVC Advanced Video Coding
Standard: Overview and Introduction to the Fidelity Range Extensions,” Standard: Overview and Introduction to the Fidelity Range Extensions,” SPIE Conference SPIE Conference on Applications of Digital Image Processing XXVIIon Applications of Digital Image Processing XXVII , vol 5558 , page 53-74, Aug 2004., vol 5558 , page 53-74, Aug 2004.
The Software Optimization Cookbook, The Software Optimization Cookbook, Intel Press,Intel Press, 2002. 2002. IA-32 Intel Architecture Optimization, Reference Manual, IA-32 Intel Architecture Optimization, Reference Manual, www.intel.comwww.intel.com Optimization Applications with the Intel C++ and FORTRAN compilers, White paper, Optimization Applications with the Intel C++ and FORTRAN compilers, White paper,
http://developer.intel.com/design/pentium4/manuals/http://developer.intel.com/design/pentium4/manuals/ J.Lee, S.Moon and W.Sun, “H.264 Decoder Optimization Exploiting SIMD Instructions”, J.Lee, S.Moon and W.Sun, “H.264 Decoder Optimization Exploiting SIMD Instructions”,
Seoul National University. Seoul National University. http://sips03.snu.ac.kr/pub/conf/c67.pdfhttp://sips03.snu.ac.kr/pub/conf/c67.pdf Accepted at Accepted at IEEE IEEE Asia-Pacific Conference on Circuits and Systems, (APCCAS),Asia-Pacific Conference on Circuits and Systems, (APCCAS), December 2004. December 2004.
Amdahl, G.M. Validity of the single-processor approach to achieving large scale Amdahl, G.M. Validity of the single-processor approach to achieving large scale computing capabilities. In computing capabilities. In AFIPS Conference ProceedingsAFIPS Conference Proceedings vol. 30 (Atlantic City, N.J., Apr. vol. 30 (Atlantic City, N.J., Apr. 18-20). AFIPS Press, Reston, Va., 1967, pp. 483-485.18-20). AFIPS Press, Reston, Va., 1967, pp. 483-485.
Horowitz, A. Joch, F. Kossentini, and A. Hallapuro,“H.264/AVC Baseline Profile Decoder Horowitz, A. Joch, F. Kossentini, and A. Hallapuro,“H.264/AVC Baseline Profile Decoder Complexity Analysis,” IEEE Transactions for Circuits and Systems for Video Technology, Complexity Analysis,” IEEE Transactions for Circuits and Systems for Video Technology, vol.13, no. 7, pp. 704-716, July 2003vol.13, no. 7, pp. 704-716, July 2003 ..
References:ContinuedReferences:Continued
http://www.blu-ray.com/http://www.blu-ray.com/ http://www.hddvd.org/hddvd/http://www.hddvd.org/hddvd/ http://www.fastvdo.comhttp://www.fastvdo.com http://www.intel.comhttp://www.intel.com http://www.intel.com/software/products/vtunehttp://www.intel.com/software/products/vtune// http://msdn.microsoft.comhttp://msdn.microsoft.com
Thanks!!Thanks!!