arindam goswami eric huneke mert ustun advanced embedded systems architecture spring 2011 hw/sw...

26
ARINDAM GOSWAMI ERIC HUNEKE MERT USTUN ADVANCED EMBEDDED SYSTEMS ARCHITECTURE SPRING 2011 HW/SW Implementation of JPEG Decoder

Upload: frank-evans

Post on 21-Dec-2015

220 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: ARINDAM GOSWAMI ERIC HUNEKE MERT USTUN ADVANCED EMBEDDED SYSTEMS ARCHITECTURE SPRING 2011 HW/SW Implementation of JPEG Decoder

A R I N D A M G O S WA M IE R I C H U N E K EM E RT U S T U N

A DVA N C E D E M B E D D E D S Y S T E M S A R C H I T E C T U R E

S P R I N G 2 0 1 1

HW/SW Implementation of JPEG Decoder

Page 2: ARINDAM GOSWAMI ERIC HUNEKE MERT USTUN ADVANCED EMBEDDED SYSTEMS ARCHITECTURE SPRING 2011 HW/SW Implementation of JPEG Decoder

Division of Labor

Software Profiling – Arindam/Eric Timing analysis – Arindam/Eric Interface to hardware - Arindam Test data for hardware - Eric

Hardware – Mert C to Verilog Conversion Scheduling & Resource Allocation on FPGA Bus Communication Interface

Page 3: ARINDAM GOSWAMI ERIC HUNEKE MERT USTUN ADVANCED EMBEDDED SYSTEMS ARCHITECTURE SPRING 2011 HW/SW Implementation of JPEG Decoder

Outline

What is JPEG?Project DescriptionJPEG AlgorithmProfile DataSoftware DesignHardware DesignResultsConclusion

Page 4: ARINDAM GOSWAMI ERIC HUNEKE MERT USTUN ADVANCED EMBEDDED SYSTEMS ARCHITECTURE SPRING 2011 HW/SW Implementation of JPEG Decoder

What is JPEG?

Image codec released by the Joint Photographic Experts Group in 1992 Joint committee between the ISO/IEC JTC1 and ITU-T

standards committeesInformally used to describe the file format

JPEG-encoded images are packed in Although the file format specified in the original

standard, JPEG Interchange Format (JIF), is rarely used

Exif or JFIF, both based JIF, are commonly used

Page 5: ARINDAM GOSWAMI ERIC HUNEKE MERT USTUN ADVANCED EMBEDDED SYSTEMS ARCHITECTURE SPRING 2011 HW/SW Implementation of JPEG Decoder

What is JPEG? (cont.)

Optimized for realistic images and photographs Color transitions should be smooth for best results

Lossy compression, which can be tuned to produce compressions of varying quality and size Up to 20:1 without loss in quality for appropriate

images Better ratios than other algorithms such as GIF, but

slower to compress and decompress Has lossless mode, but not widely used

Page 6: ARINDAM GOSWAMI ERIC HUNEKE MERT USTUN ADVANCED EMBEDDED SYSTEMS ARCHITECTURE SPRING 2011 HW/SW Implementation of JPEG Decoder

Project Description

Selected an existing software JPEG implementation we could modify and increase performance

Criteria Small enough to be easily understood and modified Reasonably fast, but not optimized

Page 7: ARINDAM GOSWAMI ERIC HUNEKE MERT USTUN ADVANCED EMBEDDED SYSTEMS ARCHITECTURE SPRING 2011 HW/SW Implementation of JPEG Decoder

Project Description (cont.)

Most common JPEG implementation out there is libjpeg, from the Independent JPEG Group Fast, but hard modify due to complexity

Various other open source implementations Tiny Jpeg Decoder jpeg-compressor

Page 8: ARINDAM GOSWAMI ERIC HUNEKE MERT USTUN ADVANCED EMBEDDED SYSTEMS ARCHITECTURE SPRING 2011 HW/SW Implementation of JPEG Decoder

Project Description (cont.)

We ended up choosing NanoJPEG, written by Martin Fiedler Reasonably fast, but not optimized Very small code size (< 1000 lines) in a single file Easy to understand

I/O Decompresses grayscale or YCbCr images Outputs grayscale or RGB raw images

Other details Written in C No floating point

Page 9: ARINDAM GOSWAMI ERIC HUNEKE MERT USTUN ADVANCED EMBEDDED SYSTEMS ARCHITECTURE SPRING 2011 HW/SW Implementation of JPEG Decoder

JPEG Algorithm

Step 1Convert the image to the YCbCr color space

(typically from RGB) Y for brightness Cb and Cr for blue and red color components

The human eye is less sensitive to color changes than it is too brightness changes JPEG takes advantage of this

Page 10: ARINDAM GOSWAMI ERIC HUNEKE MERT USTUN ADVANCED EMBEDDED SYSTEMS ARCHITECTURE SPRING 2011 HW/SW Implementation of JPEG Decoder

JPEG Algorithm (cont.)

Step 2Downsample the color data (CbCr) by

averaging together rows and vertically Factor of two on rows Factor of one or two on column Data can thus be reduced by 1/2 or 1/3

Imperceptible loss in quality

Page 11: ARINDAM GOSWAMI ERIC HUNEKE MERT USTUN ADVANCED EMBEDDED SYSTEMS ARCHITECTURE SPRING 2011 HW/SW Implementation of JPEG Decoder

JPEG Algorithm (cont.)

Step 3For each component, split the pixel data into

8x8 blocksRun each block through a discrete cosine

transform (DCT)End up with a matrix containing one DC

value and 63 AC components

Page 12: ARINDAM GOSWAMI ERIC HUNEKE MERT USTUN ADVANCED EMBEDDED SYSTEMS ARCHITECTURE SPRING 2011 HW/SW Implementation of JPEG Decoder

JPEG Algorithm

Step 4Divide each cell of the matrix by values

defined in a quantization matrix, then round to the nearest integer

The quantization matrix has values of customizable size The larger the values, the more cells are reduced to

zero, and hence lost

Page 13: ARINDAM GOSWAMI ERIC HUNEKE MERT USTUN ADVANCED EMBEDDED SYSTEMS ARCHITECTURE SPRING 2011 HW/SW Implementation of JPEG Decoder

JPEG Algorithm (cont.)

Step 5Take the reduced blocks and perform

Huffman encoding (or Arithmetic encoding) to eliminate redundant values Lossless compression

Step 6Wrap data in a standard file format, along

with compression data including quantization and Huffman tables

Page 14: ARINDAM GOSWAMI ERIC HUNEKE MERT USTUN ADVANCED EMBEDDED SYSTEMS ARCHITECTURE SPRING 2011 HW/SW Implementation of JPEG Decoder

JPEG Algorithm (cont.)

Decoding is simply the reverse of the encoding process Get the reduced matrixes back Multiply it with the quantization matrix Run an inverse DCT (IDCT) Upsample Convert to RGB

Page 15: ARINDAM GOSWAMI ERIC HUNEKE MERT USTUN ADVANCED EMBEDDED SYSTEMS ARCHITECTURE SPRING 2011 HW/SW Implementation of JPEG Decoder

Profile Data

Profiled NanoJPEG on sample image with armsd simulator

55.10% of total time spent converting the image to RGB upsampling Logically separate from decode phase

38.34% of total time spent decoding the 8x8 blocks So really 85.39% of time not spend converting/upsampling

Row and column IDCTs were about half of the block decode time Our main focus for speedup, since took about 42% of decode

time, and were an obvious candidate for FPGA implementation

Page 16: ARINDAM GOSWAMI ERIC HUNEKE MERT USTUN ADVANCED EMBEDDED SYSTEMS ARCHITECTURE SPRING 2011 HW/SW Implementation of JPEG Decoder

Software Design

Block decoding code

Row and column IDCT calls

Page 17: ARINDAM GOSWAMI ERIC HUNEKE MERT USTUN ADVANCED EMBEDDED SYSTEMS ARCHITECTURE SPRING 2011 HW/SW Implementation of JPEG Decoder

Software Design

Row IDCT

Column IDCT

Page 18: ARINDAM GOSWAMI ERIC HUNEKE MERT USTUN ADVANCED EMBEDDED SYSTEMS ARCHITECTURE SPRING 2011 HW/SW Implementation of JPEG Decoder

Software Design

Interface – Write 8x8 integers to FPGA addresses- D3000100-1FF Read 8x8 integers from D3000200-2FF (o/p of

RowIDCT) Read 8x8 bytes from D3000300-33F (o/p of ColIDCT)

Code – Replace calls to IDCT functions with r/w to FPGA

addresses

Page 19: ARINDAM GOSWAMI ERIC HUNEKE MERT USTUN ADVANCED EMBEDDED SYSTEMS ARCHITECTURE SPRING 2011 HW/SW Implementation of JPEG Decoder

Hardware Design - Architecture

ROW IDCT

IDCT CORE

8x8x8b COL_OUTRegister File

BUS COMM. IF

8x8x32b BLOCKRegister File

AMBA BUS

COL IDCT

1. ARM writes row 02. Row IDCT: row 0

ARM writes row 1 3. …4. Row IDCT: row 7

ARM reads row 0 5. Col IDCT: col 0 - 7

ARM reads rest of the block6. ARM reads colIDCT results

Page 20: ARINDAM GOSWAMI ERIC HUNEKE MERT USTUN ADVANCED EMBEDDED SYSTEMS ARCHITECTURE SPRING 2011 HW/SW Implementation of JPEG Decoder

Hardware Design - Optimizations

Register Files are used instead of RAMs to allow random access to any word in the block matrix

Arithmetic operations were distributed in multiple stages to share resources and therefore reduce area

Column IDCT and Row IDCT have a lot of common operations –

Use only a single datapath for both = Core IDCT

Page 21: ARINDAM GOSWAMI ERIC HUNEKE MERT USTUN ADVANCED EMBEDDED SYSTEMS ARCHITECTURE SPRING 2011 HW/SW Implementation of JPEG Decoder

Hardware Design – Core IDCT

Row IDCT

Column IDCT

Page 22: ARINDAM GOSWAMI ERIC HUNEKE MERT USTUN ADVANCED EMBEDDED SYSTEMS ARCHITECTURE SPRING 2011 HW/SW Implementation of JPEG Decoder

Hardware Design – Optimizations (2)

The hardware speed is limited by the ARM – FPGA bus transactions (block transfers).

Optimize bus state machine: Started with 6 state bus machine of Lab 2 Reduced it to only 3 states !!!

Total # of FPGA cycles per 8x8 block process: 3 x (64 Writes + (64+16) Reads ) = 432 Cycles

432 Cycles for 8 Row and 8 Column IDCTs

Page 23: ARINDAM GOSWAMI ERIC HUNEKE MERT USTUN ADVANCED EMBEDDED SYSTEMS ARCHITECTURE SPRING 2011 HW/SW Implementation of JPEG Decoder

Results

Hardware produces correct outputs in simulation

Integrated system does not yet match simulation

Communication overhead between ARM and FPGA is the major bottleneck

Expected speed-up: ARM: 8 x 60 + 8 x 120 = 1440 ARM Cycles

(optimistic appr.) FPGA: 3 x (64 Writes + (64+16) Reads ) = 432 FPGA

Cycles

Page 24: ARINDAM GOSWAMI ERIC HUNEKE MERT USTUN ADVANCED EMBEDDED SYSTEMS ARCHITECTURE SPRING 2011 HW/SW Implementation of JPEG Decoder

Conclusion

Work Completed Parallelized IDCT routines for each block decode in

FPGAWork to be completed

Get interface workingWhat we would have done differently

Used DMA to reduce communication overhead even more

Parallelize ARM and FPGA block processing Additional speed-up possible by moving njConvert

(upsampling & color conversion) into FPGA

Page 25: ARINDAM GOSWAMI ERIC HUNEKE MERT USTUN ADVANCED EMBEDDED SYSTEMS ARCHITECTURE SPRING 2011 HW/SW Implementation of JPEG Decoder

References

Joint Photographic Experts Group http://www.jpeg.org/jpeg/index.html

Introduction to JPEG http://www.faqs.org/faqs/compression-faq/part2/

NanoJPEG http://keyj.s2000.ws/?p=137

Page 26: ARINDAM GOSWAMI ERIC HUNEKE MERT USTUN ADVANCED EMBEDDED SYSTEMS ARCHITECTURE SPRING 2011 HW/SW Implementation of JPEG Decoder

Questions

?