08 dkawamo2 jpeg presentation
TRANSCRIPT
-
7/31/2019 08 Dkawamo2 JPEG Presentation
1/22
CUDA JPEG Essentials
Darek Kawamoto
-
7/31/2019 08 Dkawamo2 JPEG Presentation
2/22
Introduction
Project Origin
Inverse Discrete Cosine Transform
Kernel Summary
Performance
Parallel Huffman Decode for JPEG
Design Approach
Design Problems, Solutions
Implementation Remarks
Conclusion
-
7/31/2019 08 Dkawamo2 JPEG Presentation
3/22
Project Origin
Computer Animation for Scientific Visualization
Stuart Levy, UIUC / NCSA
Goal: Decode big (1920x1080) JPEG images fast (~30 fps)
GPU cheaper than specialized hardware CUDA and Two JPEG Bottlenecks:
Inverse Discrete Cosine Transform (IDCT)
Straightforward, similar to class machine problems
Huffman Decode Stage
Tricky parallelization of a serial process
-
7/31/2019 08 Dkawamo2 JPEG Presentation
4/22
Inverse Discrete Cosine Transform
2-D IDCT:
1-D IDCT:
2-D is equivalent to 1-D applied in each direction
Kernel uses 1-D transforms
pxy=1
4i= 0
7
j= 0
7
Ci Cj Gij cos2x 1 i
16cos
2x 1 j
16
WhereCf=1
2when f= 0, 1otherwise
px
=
1
2i= 0
7
Ci
Gi
cos2x 1 i
16
-
7/31/2019 08 Dkawamo2 JPEG Presentation
5/22
IDCT Kernel
Thread Parallelism Each thread corresponds to an element of the matrix
Threads compute IDCT across columns, then rows
Memory Access Patterns Shared memory: broadcast, or no bank conflicts
Global memory: buffered, coalesced
Other Optimizations Careful use of 16KB Shared Memory: 6 blocks per SMP
Unrolled 5x: Each iteration computes five 2-D IDCTs
-
7/31/2019 08 Dkawamo2 JPEG Presentation
6/22
IDCT Performance -- How...?
How to benchmark? libJPEG: executes processes serially
GPU: executes IDCT process wholesale
How precise? short implementations do almost as well as float
double precision has no advantages
How much work? GPU shines with > 64,000 blocks
JPEG specific: CPU can short circuit vectors of zeros
Let CPU short circuit ~50% of columns in the first IDCT
-
7/31/2019 08 Dkawamo2 JPEG Presentation
7/22
IDCT Performance -- Cost
IDCT Implementations:
(float) Nave 1-D
64 Multiplies and 64 Adds per 1-D transform
(short) Chen-Wang 11 Multiplies, 29 Adds per 1-D transform
(float) Arai, Agui, and Nakajima (AA&N)
5 Multiplies, 29 Adds per 1-D transform
Other multiplies folded into de-quantization tables
-
7/31/2019 08 Dkawamo2 JPEG Presentation
8/22
IDCT Performance -- Small
Approx. Execution times for 67,200 blocks:
(float) Nave 1-D GPU 4.69 ms
(float) Nave 1-D CPU Serial 333 ms (71x)
(float) Nave 1-D CPU Wholesale 100 ms (21x)
(short) Chen-Wang Serial 30 ms (6.4x)
(float) AA&N Wholesale 25 ms (5.3x)
(float) AA&N Serial 268 ms (57x)
GPU: ~29 GFLOPS
-
7/31/2019 08 Dkawamo2 JPEG Presentation
9/22
IDCT Performance -- Big
Approx. Execution times for 245,760 blocks:
(float) Nave 1-D GPU 16.94 ms
(float) Nave 1-D CPU Serial 1250 ms (73x)
(float) Nave 1-D CPU Wholesale 375 ms (22x)
(short) Chen-Wang Serial 113 ms (6.7x)
(float) AA&N Wholesale 91 ms (5.4x)
(float) AA&N Serial 1000 ms (59x)
GPU: ~30 GFLOPS
-
7/31/2019 08 Dkawamo2 JPEG Presentation
10/22
IDCT Performance Conclusion
Amount Wholesale transforms work much better than retail
67,200 IDCT blocks performs almost as well as 245,760
Speed 30 fps means each frame needs to be ready in 33 ms
How much time to perform the other JPEG functions?
With 67,200 blocks, we have 28.6 ms left
With 245,760 blocks, we have 16.4 ms left
Conclusion
Could not previously hope to process in < 33 ms
Application now depends on the speedup ofother kernels
-
7/31/2019 08 Dkawamo2 JPEG Presentation
11/22
Parallel Huffman Decode for JPEG
Huffman Compression Prefix-free, variable length code
Serial in nature: decode each symbol in sequential order
Parallel Decoding Challenge Impossible to determine where symbols start and end
without decoding all previous symbols
Design Approach Start decoding in the middle of the stream at several places,combine results when synchronization occurs, and throw out
all extra work
-
7/31/2019 08 Dkawamo2 JPEG Presentation
12/22
Design Approach
Spawn parallel work threads
-
7/31/2019 08 Dkawamo2 JPEG Presentation
13/22
Parallel Huffman Decode for JPEG
Design Approach Start decoding in the middle of the stream at several places,
combine results when synchronization occurs, and throw out
all extra work
Problems Does the parallel speedup of successful synchronization
offset the penalty of extra work?
Yes, if we choose our work wisely! Do so by exploiting JPEG structure and probability
Each decoder thread doesn't know how much data it willdecode
Allocate memory on device using atomic functions
-
7/31/2019 08 Dkawamo2 JPEG Presentation
14/22
Choosing Work Wisely
Exploit Block Coding Each block of coefficients encodes a DC coefficient and
assorted AC coefficients
Due to quantization and coding scheme, it's likely a blockwill end with a End of Block (EOB) symbol
If the EOB block symbol is 4 or more symbols and can'tprefix itself, probability of random occurrence is 1/16
In regions where we want to start a parallel decode thread,only start after possible EOB blocks
Can use any symbol to attempt synchronization, EOB is
arbitrary, but practical because DC coefficient is codeddifferently
-
7/31/2019 08 Dkawamo2 JPEG Presentation
15/22
New Approach
Suppose EOB = 0101
-
7/31/2019 08 Dkawamo2 JPEG Presentation
16/22
EOB Overhead
Overhead associated with finding EOB symbols Implemented a kernel to do so, < 1 ms
Effectiveness depends on block length statistics
If we guarantee a true EOB hit in each section of stream welook at, then we guarantee synchronization with that section
If we do not guarantee synchronization, some threads may
have to decode multiple sections
Research on these statistics necessary to make appropriatedesign decisions that maximize the probability of EOB hitswhile minimizing the amount of false hits
-
7/31/2019 08 Dkawamo2 JPEG Presentation
17/22
Decode Synchronization
Each decoder thread maintains information Where it started (bit it first looked at)
Length and Data of Decoder output
Where it is (the bit it currently looks at) Synchronization occurs when the current thread location
matches another's start location
Problem What happens to false EOB hits... do they synchronize?
How does each decoder thread know how much data it willdecode? How do we allocate memory for each thread?
-
7/31/2019 08 Dkawamo2 JPEG Presentation
18/22
Synchronization
Problem What happens to false EOB hits... do they synchronize?
Answer
In general, yes they do. After several hundred bits, theysynchronize with the real stream and will end at the next
parallel section
Experiments in Klein and Wiseman, Parallel Huffman
Decoding with Applications to JPEG Files (2003) Can use advanced logic to prevent false hitting decoder
threads from doing too much work
-
7/31/2019 08 Dkawamo2 JPEG Presentation
19/22
Memory Allocation
Problem How does each decoder thread know how much data it will
decode? How do we allocate memory for each thread?
Solution Store decoder output in chunks of global memory
Use atomic functions to acquire locks on chunks
Requires compute capability 1.1 (G92s)
-
7/31/2019 08 Dkawamo2 JPEG Presentation
20/22
Putting it All Together
After all decoder threads have finished We figure out which threads did meaningful work
Chain the decoded data together to create the output
Makes use of the decoder thread information Clear out the scratch space (memory chunks)
Throw away all of the extra work
-
7/31/2019 08 Dkawamo2 JPEG Presentation
21/22
-
7/31/2019 08 Dkawamo2 JPEG Presentation
22/22
Conclusion
IDCT Kernel speedup (5-59x) based on context
Because of the serial nature of JPEG, applications
often do not make use of wholesale transforms
Parallel Huffman Decoding is ComplexIs now the main bottleneck of JPEG decompression
Lots of potential speedup to be had, but requires
careful and precise research and development 30 frame per second high-res JPEG Animation
Possible and probable, with additional work