jpeg encoder accelerator advanced embedded systems architecture ee-382n-4 fall 2009 anup p. joshi...

29
Jpeg Encoder Accelerator Advanced Embedded Systems Architecture EE-382N-4 Fall 2009 Anup P. Joshi Chandra Bhushan Prakash Karthick Santhanam Pratap Ramanathan

Upload: ezra-mckenzie

Post on 22-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Jpeg Encoder AcceleratorAdvanced Embedded Systems Architecture

EE-382N-4Fall 2009

Anup P. JoshiChandra Bhushan Prakash

Karthick SanthanamPratap Ramanathan

OVERVIEW

JPEG Encoding Process JPEG Encoder Accelerator Existing Architecture Proposed Architectures Implementation Results Conclusion

JPEG Overview Raw Image – represents lots and lots of bytes of information

Standardized image compression mechanism 

Exploits known limitations of the human eye - Small color changes are perceived less accurately than small changes in brightness

Lossy method but achieves much greater compression compared to GIF, BMP etc

Stores 24-bit-per-pixel color data instead of 8-bit-per-pixel data 24 bits per pixel gives 16 million colors as compared to 256 or fewer colors

Disadvantage: Repeated compression and decompression will deteriorate image quality

Encoding scheme:

Step 1: Image Pixel

Source image Division into 8x8 blocks One 8x8 block

Step 2: Color Space Transform Color of each pixel 3-d vector (R,G,B) Significant correlation between these

components Color space transform to produce a new vector Luminance Y; blue and red chrominance, Cb and

Cr

Step 3: DCT Use Sequential DCT to transform block into set of 64 values (DCT coefficients)

One DC coefficient; Measure of average of energy of block

63 AC coefficients, corresponding to high frequencies; Tend to be zeroor near zero for most natural images

Step 4: Quantizer 64 coefficients quantized using one of 64 corresponding values from a

quantization table

Facilitates greater compression, but lossy (removes most coefficients)

Step 5: Encoder ‘Huffman’ encoder – most popular

Previously quantized DC coefficient used to predict current coefficient, difference encoded

Accelerator considerations Hardware v/s Software

Pure software always slower than hardware based implementation

Dedicated Hardware functional unit (accelerator) – more faster

Enhanced Architectural Options: Pipelining JPEG Encoder - already done Going for different architecture/microarchitecture Pipelining Individual blocks in encoder

We chose the 2nd option due to constraints in design (more in following slides)

Existing Pipelined Encoder - Open Source Design files acquired from Opencores.org Pipelined Encoder – Verilog source files Existing architecture for Encoding:

Existing Implementation Details Input to the Encoder (data_in) is 24-bit data bus with 8 bits each for the

Red, Green and Blue pixels

Follows sequential DCT-based mode : Inputs start with the top left 8x8 block of the image, starting with the top left

pixel, going to the right, then down to the second row, etc.

Input data for 1st 8x8 block of pixels sent over 64 consecutive clocks

After sending data for the first block, a delay of 33 clock cycles incurred due to the Encoding process (Huffman) before sending the next block Huffman encodes values based on previous block’s output dependency

and delay introduced A candidate for improvisation

Output: JPEG_bitstream, 32-bits produced out of the Huffman encoder

Experimented architectures # 1:

Insert a buffer between Quantizer and Huffman encoder so that it doesn’t change for 97 cycles.

But quantizer output changes every 64 cycles.

Hence loss of data!!

Architectures # 2:

Split image bitstream equally across 2 parallel paths – replicated functional units

Equivalent to using 2 encoders – albeit delay within each encoder still remains !

Gross over-usage of Silicon area - additional overhead on software too

Architecture #3:

Two Huffman blocks Eliminates bottleneck – helps in removing the delay between feeding two blocks of data

Individual Huffman blocks are driven alternately :

1st Huffman Block for every odd 8x8 pixel block

2nd Huffman Block for every even 8x8 pixel block Negligible loss in compression – two separate first set input in Huffman blocks

64 Cycle - accumulation

97 Cycles in each Huffman

Some cycles for synchronization

Implementation details

Transform source image into the required R <7:0>,G<15:8>,B<23:16> bit stream for each pixel

Process it in the Design (Hardware)

Generate encoded bit stream for every pixel

Reconstruct image from the output of the Hardware implementation

Conversion of image to R,G,B bitstream In Matlab:

Generated bit information using imread() function Generates a text file ‘bits.txt’ containing 24bit data for total

number of pixels Properly formatted and supplied to the Design via Test bench

Bits.txt

.TIFF format (File size: 28KB)

gen_bitstream.m

Supplied to the Testbench

Simulation results (Existing architecture):

The ‘enable’ signal should be brought high when the data from the first pixel of the image is ready enable signal needs to stay high while the data is being input to the core Each 8x8 block of data needs to be input to the core on 64 consecutive clock cycles Takes additional 33 clocks to produce the JPEG bitstream for 64 pixels of data from 1 block of input Overall clock consumption (for this example): 143,120,000 / 10,000 = 14312 clocks

Simulation results (New architecture):

Alternates between the 2 Huffman encoder blocks Introduced 2 data_ready signals each corresponding to the two JPEG

bitstreams coming out of the 2 Huffman encoder blocks Overhead in synchronizing the two Huffman Encoders: Only Eight! Overall clock consumption: 107,120,000 / 10,000 = 10712 clocks

Synthesis results:

synthesis_report.txt jpeg_top_map.mrp

Reconstructing the image

Ideal reconstruction – Implement a decoder Functionally complex (Excessive design time)

Alternative way to verify functionality- Software (Matlab)

Re-construct the image using the generated bitstream – giving us the much-anticipated “JPEG image”

Image-reconstruction performed in Matlab

Verify against the input image (quality & compression)

Image reconstruction (software):

JPEG_bitstream_odd.txt

JPEG Bitstream_odd

JPEG Bitstream_even

Reconstruct

Reconstruct

JPEG_bitstream_even.txt

Merge

JPEG format

imagecomb.m

read_jpegstream.m

Original Image Vs Jpeg Encoded Image

Size: 28 KB , TIFF format Size: 3 KB , JPEG format

Performance Comparison of architecturesExisting: Frequency: ~68MHz For test image, total clocks consumed

= 14312 Total area = 1 374 028.8 sq. μm

(Based on Design Vision synthesis)

New: Frequency: ~68MHz For test image, total clocks consumed

= 10712 Total area = 1 634 796.8 sq.μm

(Based on Design Vision synthesis)

Result summary:

Overall savings in clock cycles (acceleration) : 3600

Savings per 8x8 block = 3600 / 144 = 25 clock cycles

Overall increase in area (in terms of NAND1 gates)

= (1 634 796.8 / 1.8772) - (1 374 028.8 / 1.8772)

= 138 913.275

Change in power consumption ???

Design trade-offs Existing implementation had a lot of dependency between

functional blocks

Re-designing/pipelining the internal blocks is cumbersome

Adopted a revised “Architectural” solution that uses multiple functional units

Improves speed of encoding !!!!!

Costs more Area and higher instantaneous power

A second chance?

Possibly look at pipelining individual blocks

Re-design Huffman block to reduce the internal dependency

Reconstruct image using JPEG Decoder

Accelerate the Decoding process as well

Besides Starting early

Questions??

Back up

Mapping onto an FPGA wasn’t successful due to too many cells – ran out of space!

Breakdown of work performed:

Anup Joshi and Chandra Prakash Architecture with 2 encoders Architecture with buffer Synchronizing 2 Huffman blocks in proposed architecture Synthesis of encoder

Karthick Santhanam and Pratap Ramanathan Analysis of open source code Architecture with 2 Huffman blocks Matlab code for generating input bit stream Matlab code for combining bit stream outputs

Architecture #2:

But after the design we realized that the Huffman was the bottleneck

No point in making the Quantizer’s output wait at the ‘already slow’ stage

Lessons learnt: Identify the initial bottlenecks, DO NOT WASTE TIME

Lossy – quantization factor - 10

JPEG format, Size 1 KB