08 dkawamo2 jpeg presentation

7/31/2019 08 Dkawamo2 JPEG Presentation

1/22

CUDA JPEG Essentials

Darek Kawamoto


2/22

Introduction

Project Origin

Inverse Discrete Cosine Transform

Kernel Summary

Performance

Parallel Huffman Decode for JPEG

Design Approach

Design Problems, Solutions

Implementation Remarks

Conclusion


3/22

Project Origin

Computer Animation for Scientific Visualization

Stuart Levy, UIUC / NCSA

Goal: Decode big (1920x1080) JPEG images fast (~30 fps)

GPU cheaper than specialized hardware CUDA and Two JPEG Bottlenecks:

Inverse Discrete Cosine Transform (IDCT)

Straightforward, similar to class machine problems

Huffman Decode Stage

Tricky parallelization of a serial process


4/22

Inverse Discrete Cosine Transform

2-D IDCT:

1-D IDCT:

2-D is equivalent to 1-D applied in each direction

Kernel uses 1-D transforms

pxy=1

4i= 0

7

j= 0

7

Ci Cj Gij cos2x 1 i

16cos

2x 1 j

16

WhereCf=1

2when f= 0, 1otherwise

px

=

1

2i= 0

7

Ci

Gi

cos2x 1 i

16


5/22

IDCT Kernel

Thread Parallelism Each thread corresponds to an element of the matrix

Threads compute IDCT across columns, then rows

Memory Access Patterns Shared memory: broadcast, or no bank conflicts

Global memory: buffered, coalesced

Other Optimizations Careful use of 16KB Shared Memory: 6 blocks per SMP

Unrolled 5x: Each iteration computes five 2-D IDCTs


6/22

IDCT Performance -- How...?

How to benchmark? libJPEG: executes processes serially

GPU: executes IDCT process wholesale

How precise? short implementations do almost as well as float

double precision has no advantages

How much work? GPU shines with > 64,000 blocks

JPEG specific: CPU can short circuit vectors of zeros

Let CPU short circuit ~50% of columns in the first IDCT


7/22

IDCT Performance -- Cost

IDCT Implementations:

(float) Nave 1-D

64 Multiplies and 64 Adds per 1-D transform

(short) Chen-Wang 11 Multiplies, 29 Adds per 1-D transform

(float) Arai, Agui, and Nakajima (AA&N)

5 Multiplies, 29 Adds per 1-D transform

Other multiplies folded into de-quantization tables


8/22

IDCT Performance -- Small

Approx. Execution times for 67,200 blocks:

(float) Nave 1-D GPU 4.69 ms

(float) Nave 1-D CPU Serial 333 ms (71x)

(float) Nave 1-D CPU Wholesale 100 ms (21x)

(short) Chen-Wang Serial 30 ms (6.4x)

(float) AA&N Wholesale 25 ms (5.3x)

(float) AA&N Serial 268 ms (57x)

GPU: ~29 GFLOPS


9/22

IDCT Performance -- Big

Approx. Execution times for 245,760 blocks:

(float) Nave 1-D GPU 16.94 ms

(float) Nave 1-D CPU Serial 1250 ms (73x)

(float) Nave 1-D CPU Wholesale 375 ms (22x)

(short) Chen-Wang Serial 113 ms (6.7x)

(float) AA&N Wholesale 91 ms (5.4x)

(float) AA&N Serial 1000 ms (59x)

GPU: ~30 GFLOPS


10/22

IDCT Performance Conclusion

Amount Wholesale transforms work much better than retail

67,200 IDCT blocks performs almost as well as 245,760

Speed 30 fps means each frame needs to be ready in 33 ms

How much time to perform the other JPEG functions?

With 67,200 blocks, we have 28.6 ms left

With 245,760 blocks, we have 16.4 ms left

Conclusion

Could not previously hope to process in < 33 ms

Application now depends on the speedup ofother kernels


11/22


Huffman Compression Prefix-free, variable length code

Serial in nature: decode each symbol in sequential order

Parallel Decoding Challenge Impossible to determine where symbols start and end

without decoding all previous symbols

Design Approach Start decoding in the middle of the stream at several places,combine results when synchronization occurs, and throw out

all extra work


12/22

Design Approach

Spawn parallel work threads


13/22


Design Approach Start decoding in the middle of the stream at several places,

combine results when synchronization occurs, and throw out

all extra work

Problems Does the parallel speedup of successful synchronization

offset the penalty of extra work?

Yes, if we choose our work wisely! Do so by exploiting JPEG structure and probability

Each decoder thread doesn't know how much data it willdecode

Allocate memory on device using atomic functions


14/22

Choosing Work Wisely

Exploit Block Coding Each block of coefficients encodes a DC coefficient and

assorted AC coefficients

Due to quantization and coding scheme, it's likely a blockwill end with a End of Block (EOB) symbol

If the EOB block symbol is 4 or more symbols and can'tprefix itself, probability of random occurrence is 1/16

In regions where we want to start a parallel decode thread,only start after possible EOB blocks

Can use any symbol to attempt synchronization, EOB is

arbitrary, but practical because DC coefficient is codeddifferently


15/22

New Approach

Suppose EOB = 0101


16/22

EOB Overhead

Overhead associated with finding EOB symbols Implemented a kernel to do so, < 1 ms

Effectiveness depends on block length statistics

If we guarantee a true EOB hit in each section of stream welook at, then we guarantee synchronization with that section

If we do not guarantee synchronization, some threads may

have to decode multiple sections

Research on these statistics necessary to make appropriatedesign decisions that maximize the probability of EOB hitswhile minimizing the amount of false hits


17/22

Decode Synchronization

Each decoder thread maintains information Where it started (bit it first looked at)

Length and Data of Decoder output

Where it is (the bit it currently looks at) Synchronization occurs when the current thread location

matches another's start location

Problem What happens to false EOB hits... do they synchronize?

How does each decoder thread know how much data it willdecode? How do we allocate memory for each thread?


18/22

Synchronization

Problem What happens to false EOB hits... do they synchronize?

Answer

In general, yes they do. After several hundred bits, theysynchronize with the real stream and will end at the next

parallel section

Experiments in Klein and Wiseman, Parallel Huffman

Decoding with Applications to JPEG Files (2003) Can use advanced logic to prevent false hitting decoder

threads from doing too much work


19/22

Memory Allocation

Problem How does each decoder thread know how much data it will

decode? How do we allocate memory for each thread?

Solution Store decoder output in chunks of global memory

Use atomic functions to acquire locks on chunks

Requires compute capability 1.1 (G92s)


20/22

Putting it All Together

After all decoder threads have finished We figure out which threads did meaningful work

Chain the decoded data together to create the output

Makes use of the decoder thread information Clear out the scratch space (memory chunks)

Throw away all of the extra work


21/22


22/22

Conclusion

IDCT Kernel speedup (5-59x) based on context

Because of the serial nature of JPEG, applications

often do not make use of wholesale transforms

Parallel Huffman Decoding is ComplexIs now the main bottleneck of JPEG decompression

Lots of potential speedup to be had, but requires

careful and precise research and development 30 frame per second high-res JPEG Animation

Possible and probable, with additional work

08 dkawamo2 jpeg presentation

Documents