jpeg encoder accelerator advanced embedded systems architecture ee-382n-4 fall 2009 anup p. joshi...
TRANSCRIPT
Jpeg Encoder AcceleratorAdvanced Embedded Systems Architecture
EE-382N-4Fall 2009
Anup P. JoshiChandra Bhushan Prakash
Karthick SanthanamPratap Ramanathan
OVERVIEW
JPEG Encoding Process JPEG Encoder Accelerator Existing Architecture Proposed Architectures Implementation Results Conclusion
JPEG Overview Raw Image – represents lots and lots of bytes of information
Standardized image compression mechanism
Exploits known limitations of the human eye - Small color changes are perceived less accurately than small changes in brightness
Lossy method but achieves much greater compression compared to GIF, BMP etc
Stores 24-bit-per-pixel color data instead of 8-bit-per-pixel data 24 bits per pixel gives 16 million colors as compared to 256 or fewer colors
Disadvantage: Repeated compression and decompression will deteriorate image quality
Step 2: Color Space Transform Color of each pixel 3-d vector (R,G,B) Significant correlation between these
components Color space transform to produce a new vector Luminance Y; blue and red chrominance, Cb and
Cr
Step 3: DCT Use Sequential DCT to transform block into set of 64 values (DCT coefficients)
One DC coefficient; Measure of average of energy of block
63 AC coefficients, corresponding to high frequencies; Tend to be zeroor near zero for most natural images
Step 4: Quantizer 64 coefficients quantized using one of 64 corresponding values from a
quantization table
Facilitates greater compression, but lossy (removes most coefficients)
Step 5: Encoder ‘Huffman’ encoder – most popular
Previously quantized DC coefficient used to predict current coefficient, difference encoded
Accelerator considerations Hardware v/s Software
Pure software always slower than hardware based implementation
Dedicated Hardware functional unit (accelerator) – more faster
Enhanced Architectural Options: Pipelining JPEG Encoder - already done Going for different architecture/microarchitecture Pipelining Individual blocks in encoder
We chose the 2nd option due to constraints in design (more in following slides)
Existing Pipelined Encoder - Open Source Design files acquired from Opencores.org Pipelined Encoder – Verilog source files Existing architecture for Encoding:
Existing Implementation Details Input to the Encoder (data_in) is 24-bit data bus with 8 bits each for the
Red, Green and Blue pixels
Follows sequential DCT-based mode : Inputs start with the top left 8x8 block of the image, starting with the top left
pixel, going to the right, then down to the second row, etc.
Input data for 1st 8x8 block of pixels sent over 64 consecutive clocks
After sending data for the first block, a delay of 33 clock cycles incurred due to the Encoding process (Huffman) before sending the next block Huffman encodes values based on previous block’s output dependency
and delay introduced A candidate for improvisation
Output: JPEG_bitstream, 32-bits produced out of the Huffman encoder
Experimented architectures # 1:
Insert a buffer between Quantizer and Huffman encoder so that it doesn’t change for 97 cycles.
But quantizer output changes every 64 cycles.
Hence loss of data!!
Architectures # 2:
Split image bitstream equally across 2 parallel paths – replicated functional units
Equivalent to using 2 encoders – albeit delay within each encoder still remains !
Gross over-usage of Silicon area - additional overhead on software too
Architecture #3:
Two Huffman blocks Eliminates bottleneck – helps in removing the delay between feeding two blocks of data
Individual Huffman blocks are driven alternately :
1st Huffman Block for every odd 8x8 pixel block
2nd Huffman Block for every even 8x8 pixel block Negligible loss in compression – two separate first set input in Huffman blocks
64 Cycle - accumulation
97 Cycles in each Huffman
Some cycles for synchronization
Implementation details
Transform source image into the required R <7:0>,G<15:8>,B<23:16> bit stream for each pixel
Process it in the Design (Hardware)
Generate encoded bit stream for every pixel
Reconstruct image from the output of the Hardware implementation
Conversion of image to R,G,B bitstream In Matlab:
Generated bit information using imread() function Generates a text file ‘bits.txt’ containing 24bit data for total
number of pixels Properly formatted and supplied to the Design via Test bench
Bits.txt
.TIFF format (File size: 28KB)
gen_bitstream.m
Supplied to the Testbench
Simulation results (Existing architecture):
The ‘enable’ signal should be brought high when the data from the first pixel of the image is ready enable signal needs to stay high while the data is being input to the core Each 8x8 block of data needs to be input to the core on 64 consecutive clock cycles Takes additional 33 clocks to produce the JPEG bitstream for 64 pixels of data from 1 block of input Overall clock consumption (for this example): 143,120,000 / 10,000 = 14312 clocks
Simulation results (New architecture):
Alternates between the 2 Huffman encoder blocks Introduced 2 data_ready signals each corresponding to the two JPEG
bitstreams coming out of the 2 Huffman encoder blocks Overhead in synchronizing the two Huffman Encoders: Only Eight! Overall clock consumption: 107,120,000 / 10,000 = 10712 clocks
Reconstructing the image
Ideal reconstruction – Implement a decoder Functionally complex (Excessive design time)
Alternative way to verify functionality- Software (Matlab)
Re-construct the image using the generated bitstream – giving us the much-anticipated “JPEG image”
Image-reconstruction performed in Matlab
Verify against the input image (quality & compression)
Image reconstruction (software):
JPEG_bitstream_odd.txt
JPEG Bitstream_odd
JPEG Bitstream_even
Reconstruct
Reconstruct
JPEG_bitstream_even.txt
Merge
JPEG format
imagecomb.m
read_jpegstream.m
Performance Comparison of architecturesExisting: Frequency: ~68MHz For test image, total clocks consumed
= 14312 Total area = 1 374 028.8 sq. μm
(Based on Design Vision synthesis)
New: Frequency: ~68MHz For test image, total clocks consumed
= 10712 Total area = 1 634 796.8 sq.μm
(Based on Design Vision synthesis)
Result summary:
Overall savings in clock cycles (acceleration) : 3600
Savings per 8x8 block = 3600 / 144 = 25 clock cycles
Overall increase in area (in terms of NAND1 gates)
= (1 634 796.8 / 1.8772) - (1 374 028.8 / 1.8772)
= 138 913.275
Change in power consumption ???
Design trade-offs Existing implementation had a lot of dependency between
functional blocks
Re-designing/pipelining the internal blocks is cumbersome
Adopted a revised “Architectural” solution that uses multiple functional units
Improves speed of encoding !!!!!
Costs more Area and higher instantaneous power
A second chance?
Possibly look at pipelining individual blocks
Re-design Huffman block to reduce the internal dependency
Reconstruct image using JPEG Decoder
Accelerate the Decoding process as well
Besides Starting early
Breakdown of work performed:
Anup Joshi and Chandra Prakash Architecture with 2 encoders Architecture with buffer Synchronizing 2 Huffman blocks in proposed architecture Synthesis of encoder
Karthick Santhanam and Pratap Ramanathan Analysis of open source code Architecture with 2 Huffman blocks Matlab code for generating input bit stream Matlab code for combining bit stream outputs
Architecture #2:
But after the design we realized that the Huffman was the bottleneck
No point in making the Quantizer’s output wait at the ‘already slow’ stage
Lessons learnt: Identify the initial bottlenecks, DO NOT WASTE TIME