optimization of h.264 high profile decoder for pentium 4 processor tarun bhatia university of texas...

Optimization of H.264Optimization of H.264High Profile Decoder for High Profile Decoder for

Pentium 4 Processor Pentium 4 Processor

Tarun Bhatia Tarun Bhatia

University of Texas at ArlingtonUniversity of Texas at Arlington

[email protected]@fastvdo.com

H.264H.264 DecoderDecoder

EntropyDecoding

Inverse Transform and Dequantization Deblocking

Intra/InterMode Selection

IntraPrediction

Motion Compensation

PictureBuffering

BitstreamInput +

+

VideoOutput

Optimization:NeedOptimization:Need

H.264/AVC video coding introduces substantially more H.264/AVC video coding introduces substantially more coding tools and coding options than earlier standards. coding tools and coding options than earlier standards. Therefore, it takes much more computational complexity to Therefore, it takes much more computational complexity to

achieve highest possible coding gain.achieve highest possible coding gain.

Aggressive optimization is typically required in order to get Aggressive optimization is typically required in order to get H.264 implementations to meet cost and power targets and H.264 implementations to meet cost and power targets and

provide real-time performance for applicationsprovide real-time performance for applications..

Sequences UsedSequences Used

Plane.264 Shore.264

Golf.264

Girl.264 Karate.264

H.264 ProfilesH.264 Profiles

I slice P slice

CAVLC

Arbitrary Slice Order (ASO)Frame Macroblock Ordering (FMO)

Redundant Slices

B slicesWeighted Prediction

CABACData Partition

SP Slice

SI Slice

Adaptive Block Size Transform

Perceptual Quantization Matrices

High Profile

Extended ProfileMain Profile

Baseline Profile

H.264 High Profiles - featuresH.264 High Profiles - features

Main Profile + additional featuresMain Profile + additional features

8x8 Integer DCT 8x8 Integer DCT

HVS matricesHVS matrices

8x8 Intra Prediction modes 8x8 Intra Prediction modes

Optimization : LevelsOptimization : Levels Algorithm LevelAlgorithm Level e.g. DCT implementation e.g. DCT implementation

Compiler Level Compiler Level (Microsoft Visual Studio .NET 2003 (Microsoft Visual Studio .NET 2003 / Intel C++ compiler v 8.0) / Intel C++ compiler v 8.0)

Implementation Level Implementation Level e.g. Elimination of Loops, Conditions e.g. Elimination of Loops, Conditions Using SIMD for implementation Using SIMD for implementation MultithreadingMultithreading

Target Platform : Pentium 4 ProcessorTarget Platform : Pentium 4 Processor

Intel SIMD ArchitectureIntel SIMD Architecture

8 XMM Registers [128 bits]

MXCSR [32 bit]

8 MMX Registers [64 bit]

8 GPRs [32bit]

EFLAGS[32bit]X87 FP Register File

FP MMXSSE/SSE2/SSE3

FP MOVE

L1 Data Cache (8KB 4-way)

Intel HT (Hyper Threading) Intel HT (Hyper Threading) TechnologyTechnology

Purpose : Simultaneous Execution of ThreadsPurpose : Simultaneous Execution of Threads

SYSTEM BUS

ArchitecturalArchitectural

StateState

ArchitecturalArchitectural

StateState

Execution Engine Execution Engine

Local Local

APICAPIC

LocalLocal

APICAPIC

Bus InterfaceBus Interface

Optimization : StepsOptimization : Steps

Optimization during code developmentOptimization during code development Optimization after code developmentOptimization after code development 1) Searching for “hotspots” in the code1) Searching for “hotspots” in the code

2) Analysis of “hotspot” 2) Analysis of “hotspot” e.g. more number of calls, cache miss, e.g. more number of calls, cache miss,

slower implementationslower implementation 3) Optimization of hotspots3) Optimization of hotspots

Performance ProfilingPerformance Profiling Intel VTuneIntel VTuneTMTM Performance Analyzer Performance Analyzer

Intel VTune Performance Analysis - Intel VTune Performance Analysis - Results (FastVDO H.264 HD High Profile Decoder)Results (FastVDO H.264 HD High Profile Decoder)

%Time Consumed %Time Consumed Girl.264 Girl.264 Golf.264Golf.264 Karate.264Karate.264 Plane.264Plane.264 Shore.264Shore.264

IDCT 8x8IDCT 8x8 4.9578794.957879 0.9567370.956737 3.27358593.2735859 1.8941261.894126 1.638841.63884

CABACCABAC 11.02645211.026452 5.2935925.293592 10.33594510.335945 10.0180710.01807 6.4072746.407274

Memcpy + Memcpy + MemsetMemset

13.3336913.33369 16.8690516.86905 11.84930711.849307 11.0198711.01987 14.5961114.59611

IDCT 4x4IDCT 4x4 17.0252717.02527 20.3963620.39636 15.56831515.568315 12.8975712.89757 17.7944617.79446

MCMC 29.5326529.53265 38.0013738.00137 40.04507840.045078 50.1614950.16149 41.6631441.66314

OthersOthers 24.1240524.12405 18.4828918.48289 18.92776618.927766 14.0088714.00887 17.9001917.90019

Distribution of Distribution of Decoder Time ConsumptionDecoder Time Consumption

0

10

20

30

40

50

60

IDCT8x8 CABAC Memory IDCT4x4 MC Others

GolfPlane

SIMDSIMD

Single Instruction Multiple Data InstructionsSingle Instruction Multiple Data Instructions Intel Pentium 4Intel Pentium 4

MMX ( Multimedia Extension) from Pentium MMX onwardsMMX ( Multimedia Extension) from Pentium MMX onwards

SSE ( Streaming SIMD Extension ) from Pentium III onwardsSSE ( Streaming SIMD Extension ) from Pentium III onwards

SSE2 ( Streaming SIMD Extension 2) from Pentium IV onwardsSSE2 ( Streaming SIMD Extension 2) from Pentium IV onwards

AMD Athalon 64AMD Athalon 64

3D Now3D Now

SIMD Data TypesSIMD Data Types

88 88 88 88 88 88 88 88 88 88 88 88 88 88 88 88

1616 1616 1616 1616 1616 1616 1616 1616

3232 3232 3232 3232

6464 6464

128

Available in XMM registers in SSE Technology

Available in MMX and XMM registers

SIMD Instructions : TypesSIMD Instructions : Types

Packed Arithmetic (e.g. padd, pmul)Packed Arithmetic (e.g. padd, pmul) Packed Logical (e.g. pand, por)Packed Logical (e.g. pand, por) Data Movement and Memory Access (mov)Data Movement and Memory Access (mov) General Support (pack, unpack)General Support (pack, unpack) Packed Shift ( >> ,<< )Packed Shift ( >> ,<< ) Packed Comparison (<=, = =)Packed Comparison (<=, = =)

Case StudyCase Study

interpolation4x4 (pixel_data * forward_block, pixel_data* backward_block)interpolation4x4 (pixel_data * forward_block, pixel_data* backward_block)

{{

pixel_data* result;pixel_data* result;

for (int i=0 ; i<=15 ; i++)for (int i=0 ; i<=15 ; i++)

{{

result [i] = (forward_block[i] + backward_block[i]+1)/2;result [i] = (forward_block[i] + backward_block[i]+1)/2;

}}

}}

MMX CodeMMX Code interpolation (pixel_data* forward_block , pixel_data* backward_block)interpolation (pixel_data* forward_block , pixel_data* backward_block){{

___asm {___asm {

__asm {__asm {

pxor mm7,mm7 // set mm7 to 0pxor mm7,mm7 // set mm7 to 0 mov EDX, 0x01010101 // EDX = 01 01 01 01 mov EDX, 0x01010101 // EDX = 01 01 01 01 mov EAX, forward_block // Store forward block starting address mov EAX, forward_block // Store forward block starting address movd mm3, EDX // mm3: 00 00 00 00 01 01 01 01movd mm3, EDX // mm3: 00 00 00 00 01 01 01 01 mov EBX, backward_block // Store backward block starting addressmov EBX, backward_block // Store backward block starting address punpcklbw mm3,mm7 // mm3: 00 01 00 01 00 01 00 01punpcklbw mm3,mm7 // mm3: 00 01 00 01 00 01 00 01

mov ECX, result // Store the address of resultmov ECX, result // Store the address of result movd mm0, [EAX] // mm0: fb[1:4]movd mm0, [EAX] // mm0: fb[1:4] movd mm1, [EBX] // mm1: bb[1:4]movd mm1, [EBX] // mm1: bb[1:4] movd mm4, [EAX+4] // mm4: fb[5:8]movd mm4, [EAX+4] // mm4: fb[5:8] movd mm5, [EBX+4] // mm5: bb[5:8]movd mm5, [EBX+4] // mm5: bb[5:8] punpcklbw mm0,mm7 // punpcklbw mm0,mm7 // punpcklbw mm1,mm7 // punpcklbw mm1,mm7 //

punpcklbw mm4,mm7 //punpcklbw mm4,mm7 // punpcklbw mm5,mm7 //punpcklbw mm5,mm7 //

paddw mm0, mm1 // mm0: fb[1:4]+bb[1:4]paddw mm0, mm1 // mm0: fb[1:4]+bb[1:4] paddw mm4, mm5 // mm4: fb[5:8]+bb[5:8]paddw mm4, mm5 // mm4: fb[5:8]+bb[5:8] paddw mm0, mm3 // mm0: fb[1:4]+bb[1:4]+1paddw mm0, mm3 // mm0: fb[1:4]+bb[1:4]+1 paddw mm4, mm3 // mm4: fb[5:8]+bb[5:8]+1paddw mm4, mm3 // mm4: fb[5:8]+bb[5:8]+1 psrl mm0, 1 // mm1: (fb[1:4]+bb[1:4]+1)>> 1psrl mm0, 1 // mm1: (fb[1:4]+bb[1:4]+1)>> 1 psrl mm4, 1 // mm5: (fb[5:8]+bb[5:8]+1)>> 1 psrl mm4, 1 // mm5: (fb[5:8]+bb[5:8]+1)>> 1

packuswb mm0,mm0 // mm0: 00 00 00 00 r4 r3 r2 r1packuswb mm0,mm0 // mm0: 00 00 00 00 r4 r3 r2 r1 packuswb mm4,mm4 // mm4: 00 00 00 00 r8 r7 r6 r5 packuswb mm4,mm4 // mm4: 00 00 00 00 r8 r7 r6 r5 movd [ECX],mm0 // result[1:4] = mm0movd [ECX],mm0 // result[1:4] = mm0

movd [ECX+4],mm4 // result[5:8] = mm4 movd [ECX+4],mm4 // result[5:8] = mm4 //Repeat the same process for fb[9:16] and bb[9:16] //Repeat the same process for fb[9:16] and bb[9:16] emms // Empty MMX stateemms // Empty MMX state

}}

}}

SIMD Application ResultsSIMD Application Results Amdahl’s Law : The Overall Speedup (O.S.) obtained Amdahl’s Law : The Overall Speedup (O.S.) obtained

by optimizing a portion p of the program by a factor s by optimizing a portion p of the program by a factor s isis

O.S. = 1 x 100 %O.S. = 1 x 100 % ----------------- - 1----------------- - 1 1 – p + (p/s)1 – p + (p/s)

p p fraction of the code being optimized fraction of the code being optimizeds s speedup factor for that fraction of code speedup factor for that fraction of code

Application to IDCT 4x4Application to IDCT 4x4

%% Girl.264Girl.264 Golf.264 Golf.264 Karate.264Karate.264 Plane.264 Plane.264 Shore.264Shore.264

NO SIMD NO SIMD (%)(%)

17.0252717.02527 20.3963620.39636 15.5683115.56831 12.8975712.89757 17.7944617.79446

SIMD (%)SIMD (%) 8.8114058.811405 11.6139311.61393 9.16206 9.16206 7.0504347.050434 9.891519.89151

SpeedupSpeedup

FactorFactor1.9321861.932186 1.7561981.756198 1.699211.69921 1.829931.82993 1.798961.79896

OverallOverall

Speedup Speedup (%)(%)

8.94898.9489 9.6289.628 6.84476.8447 6.216.21 8.5818.581

IDCT 4x4 IDCT 4x4 Comparison of % Time ConsumedComparison of % Time Consumed Of the Total Decoding Time Of the Total Decoding Time

0

5

10

15

20

25

Girl Golf Karate Plane Shore

No SIMDSIMD

% Overall Speed up in % Overall Speed up in Decoding Time with SIMD IDCT4x4Decoding Time with SIMD IDCT4x4

0

1

2

3

4

5

6

7

8

9

10


% OverallSpeedup

Application to Application to Motion CompensationMotion Compensation

The implementation of Motion Compensation can be divided as :-The implementation of Motion Compensation can be divided as :-

Data Manipulation (SIMD not used)Data Manipulation (SIMD not used)

Interpolation (SIMD used)Interpolation (SIMD used)

Half Pel InterpolationHalf Pel Interpolation Quarter Pel InterpolationQuarter Pel Interpolation Linear Interpolation for B framesLinear Interpolation for B frames

Motion Compensation-Motion Compensation-% Time consumption (without MMX)% Time consumption (without MMX)

0

5

10

15

20

25

30

35


ManipulationInterpolation

SIMD Application to SIMD Application to Motion Compensation - ResultsMotion Compensation - Results

%% GirlGirl Golf Golf KarateKarate PlanePlane ShoreShore

NO SIMDNO SIMD

(%)(%)15.9682415.96824 13.0263413.02634 23.2531923.25319 32.5450332.54503 19.739919.7399

SIMDSIMD

(%)(%) 9.518329.51832 7.538747.53874 14.3231714.32317 19.741419.7414 11.60811.608

SpeedupSpeedup

FactorFactor 1.681.68 1.731.73 1.621.62 1.651.65 1.71.7

OverallOverall

SpeedupSpeedup

(%)(%)

6.896.89 5.85.8 9.89.8 14.6814.68 8.858.85

Motion Compensation – ResultsMotion Compensation – ResultsComparison of % Time ConsumedComparison of % Time Consumed

0

5

10

15

20

25

30

35


No SIMDSIMD

% Overall Speed up in % Overall Speed up in Decoding Time with SIMD MCDecoding Time with SIMD MC

0

2

4

6

8

10

12

14

16


Overall Speedup

MultithreadingMultithreading

DefinitionDefinition : : Multithreading is the ability of the program to Multithreading is the ability of the program to multitask within itself. The program can split itself into multitask within itself. The program can split itself into separate “separate “threadsthreads” of execution that seem to run concurrently. ” of execution that seem to run concurrently.

WaitsWaits are used to block the thread till a particular event are used to block the thread till a particular event hands over controlhands over control

ReleaseRelease is use to unblock the threadis use to unblock the thread

SemaphoresSemaphores : : Locking mechanism / Counters to control Locking mechanism / Counters to control access to shared resources being used by multiple processesaccess to shared resources being used by multiple processes

Producer-Consumer Problem (Diagram)Producer-Consumer Problem (Diagram)

Producer Thread Consumer Thread

SemaphoresWait

Release

Serial ExecutionOf a Thread

Producer-Consumer Problem (Algorithm)Producer-Consumer Problem (Algorithm)

Producer thread starts and initialize dataProducer thread starts and initialize data Wait for the Consumer thread Wait for the Consumer thread If Consumer thread ready, release control to the If Consumer thread ready, release control to the

consumer threadconsumer thread Producer thread completes one execution cycle in the Producer thread completes one execution cycle in the

meantime and waits for Consumer thread meantime and waits for Consumer thread When the control is passed back to Producer thread, When the control is passed back to Producer thread,

the process is repeated till the end condition is met. the process is repeated till the end condition is met.

Multithreading in Video CodingMultithreading in Video Coding

The Codec can be multithreaded in two waysThe Codec can be multithreaded in two ways:-:- Block Level Block Level

Independent blocks can be executed as separate threads Independent blocks can be executed as separate threads e.g. slices in H.264, motion estimation, deblocking of non-e.g. slices in H.264, motion estimation, deblocking of non-reference framesreference frames

GOP LevelGOP Level Closed GOPClosed GOP : Group of frames which will not use any : Group of frames which will not use any

reference frames except from their GOPreference frames except from their GOP Open GOPOpen GOP : Group of frames can use reference frames : Group of frames can use reference frames

from outside their GOP from outside their GOP

Proposed Multithreading Architecture Proposed Multithreading Architecture -features-features

GOP Level (Closed GOP)GOP Level (Closed GOP) 30 frames per GOP30 frames per GOP IPPPPPPP…PIPPPPPPP…P Each GOP begins with an I frame and contains Each GOP begins with an I frame and contains

P frames only (i.e. 1 I frame and 29 P frames P frames only (i.e. 1 I frame and 29 P frames in each )in each )

B frames are not used in the design to maintain B frames are not used in the design to maintain closed GOP structureclosed GOP structure

Proposed Multithreading ArchitectureProposed Multithreading Architecture

Main Thread Get IDR Position

Decoder 0

Decoder 1

Decoder N

Multithreaded Decoder - ThreadsMultithreaded Decoder - Threads Main ThreadMain Thread

Creates all threads and semaphoresCreates all threads and semaphores Get SPS and PPS NALUs from theGet SPS and PPS NALUs from the Initialize Multiple decoders with SPS and PPS NALUsInitialize Multiple decoders with SPS and PPS NALUs

Get IDR Frame Position ThreadGet IDR Frame Position Thread Search for IDR NALU Position in the bitstreamSearch for IDR NALU Position in the bitstream Manage Waits and Releases of SemaphoresManage Waits and Releases of Semaphores

Decoder ThreadsDecoder Threads Decode H.264 GOPsDecode H.264 GOPs

SPS SPS Sequence Parameter Set Sequence Parameter Set PPSPPS Picture Parameter Set Picture Parameter SetNALU NALU Network Abstraction Layer Unit Network Abstraction Layer Unit

Multithreading - ResultsMultithreading - Results% Speed up in Decoding Time% Speed up in Decoding Time

0

2

4

6

8

10

12

14

16


2

3

4

Number of

Threads

Multithreading-ResultsMultithreading-ResultsThreading Overhead (Time in seconds) Threading Overhead (Time in seconds)

0

0.05

0.1

0.15

0.2

0.25


2

3

4

No. of Threads

Further ResearchFurther Research Optimization of High Profile HD (720p) Encoder for Optimization of High Profile HD (720p) Encoder for

minimization of Hardware requirementminimization of Hardware requirement

Testing of the H.264 encoder and decoder on Testing of the H.264 encoder and decoder on multicore CPUsmulticore CPUs

Implementation of time consuming modules of H.264 Implementation of time consuming modules of H.264 encoder and decoder on GPU (Graphic Processing encoder and decoder on GPU (Graphic Processing Unit)Unit)

ReferencesReferences H.264: International Telecommunication Union, “Recommendation ITU-T H.264: H.264: International Telecommunication Union, “Recommendation ITU-T H.264:

Advanced Video Coding for Generic Audiovisual Services,” Advanced Video Coding for Generic Audiovisual Services,” ITU-TITU-T, 2005., 2005. MPEG-2: ISO/IEC JTC1/SC29/WG11 and ITU-T, “ISO/IEC 13818-2: Information MPEG-2: ISO/IEC JTC1/SC29/WG11 and ITU-T, “ISO/IEC 13818-2: Information

Technology-Generic Coding of Moving Pictures and Associated Audio Information: Technology-Generic Coding of Moving Pictures and Associated Audio Information: Video,” Video,” ISO/IEC and ITU-TISO/IEC and ITU-T, 1994. , 1994.

Soon-kak Kwon, A.Tamhankar and K.R.Rao ,”Overview of MPEG-4 Part 10”.Soon-kak Kwon, A.Tamhankar and K.R.Rao ,”Overview of MPEG-4 Part 10”. G. Sullivan, P. Topiwala and A. Luthra, “The H.264/AVC Advanced Video Coding G. Sullivan, P. Topiwala and A. Luthra, “The H.264/AVC Advanced Video Coding

Standard: Overview and Introduction to the Fidelity Range Extensions,” Standard: Overview and Introduction to the Fidelity Range Extensions,” SPIE Conference SPIE Conference on Applications of Digital Image Processing XXVIIon Applications of Digital Image Processing XXVII , vol 5558 , page 53-74, Aug 2004., vol 5558 , page 53-74, Aug 2004.

The Software Optimization Cookbook, The Software Optimization Cookbook, Intel Press,Intel Press, 2002. 2002. IA-32 Intel Architecture Optimization, Reference Manual, IA-32 Intel Architecture Optimization, Reference Manual, www.intel.comwww.intel.com Optimization Applications with the Intel C++ and FORTRAN compilers, White paper, Optimization Applications with the Intel C++ and FORTRAN compilers, White paper,

http://developer.intel.com/design/pentium4/manuals/http://developer.intel.com/design/pentium4/manuals/ J.Lee, S.Moon and W.Sun, “H.264 Decoder Optimization Exploiting SIMD Instructions”, J.Lee, S.Moon and W.Sun, “H.264 Decoder Optimization Exploiting SIMD Instructions”,

Seoul National University. Seoul National University. http://sips03.snu.ac.kr/pub/conf/c67.pdfhttp://sips03.snu.ac.kr/pub/conf/c67.pdf Accepted at Accepted at IEEE IEEE Asia-Pacific Conference on Circuits and Systems, (APCCAS),Asia-Pacific Conference on Circuits and Systems, (APCCAS), December 2004. December 2004.

Amdahl, G.M. Validity of the single-processor approach to achieving large scale Amdahl, G.M. Validity of the single-processor approach to achieving large scale computing capabilities. In computing capabilities. In AFIPS Conference ProceedingsAFIPS Conference Proceedings vol. 30 (Atlantic City, N.J., Apr. vol. 30 (Atlantic City, N.J., Apr. 18-20). AFIPS Press, Reston, Va., 1967, pp. 483-485.18-20). AFIPS Press, Reston, Va., 1967, pp. 483-485.

Horowitz, A. Joch, F. Kossentini, and A. Hallapuro,“H.264/AVC Baseline Profile Decoder Horowitz, A. Joch, F. Kossentini, and A. Hallapuro,“H.264/AVC Baseline Profile Decoder Complexity Analysis,” IEEE Transactions for Circuits and Systems for Video Technology, Complexity Analysis,” IEEE Transactions for Circuits and Systems for Video Technology, vol.13, no. 7, pp. 704-716, July 2003vol.13, no. 7, pp. 704-716, July 2003 ..

References:ContinuedReferences:Continued

http://www.blu-ray.com/http://www.blu-ray.com/ http://www.hddvd.org/hddvd/http://www.hddvd.org/hddvd/ http://www.fastvdo.comhttp://www.fastvdo.com http://www.intel.comhttp://www.intel.com http://www.intel.com/software/products/vtunehttp://www.intel.com/software/products/vtune// http://msdn.microsoft.comhttp://msdn.microsoft.com

Thanks!!Thanks!!

optimization of h.264 high profile decoder for pentium 4 processor tarun bhatia university of texas...

Documents

block store backward

block store forward

pentium mmx

mm7 mm3

mov edx

implementation level

edx mm3

mov eax