708825/fulltext01.pdf · presentation date 2008-12-16 publishing date (electronic version)...

104
Evaluation and Hardware Implementation of Real-Time Color Compression Algorithms Master’s Thesis Division of Electronics Systems Department of Electrical Engineering Linköping University By Ahmet Caglar Amin Ojani Report number: LiTH-ISY-EX--08/4265--SE Linköping, December 2008

Upload: others

Post on 24-Sep-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

Evaluation and Hardware Implementation of

Real-Time Color Compression Algorithms

Master’s Thesis

Division of Electronics Systems

Department of Electrical Engineering

Linköping University

By

Ahmet Caglar

Amin Ojani

Report number: LiTH-ISY-EX--08/4265--SE

Linköping, December 2008

Page 2: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems
Page 3: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

Evaluation and Hardware Implementation of

Real-Time Color Compression Algorithms

Master’s Thesis

Division of Electronics Systems

Department of Electrical Engineering

at Linköping Institute of Technology

By

Ahmet Caglar

Amin Ojani

LiTH-ISY-EX--08/4265--SE

Supervisor: Henrik Ohlsson,

Ericsson Mobile Platforms (EMP)

Examiner: Oscar Gustafsson,

Electronics Systems, Linköping University

Linköping, December 2008

Page 4: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems
Page 5: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

Presentation Date 2008-12-16

Publishing Date (Electronic version)

Department and Division

Department of Electrical Engineering Division of Electronic Systems

URL, Electronic Version http://www.ep.liu.se

Publication Title Evaluation and Hardware Implementation of Real-Time Color Compression Algorithms

Author(s) Amin Ojani, Ahmet Caglar

Abstract A major bottleneck, for performance as well as power consumption, for graphics hardware in mobile devices is the amount of data that needs to be transferred to and from memory. In, for example, hardware accelerated 3D graphics, a large part of the memory accesses are due to large and frequent color buffer data transfers. In a graphics hardware block color data is typically processed using RGB color format. For both 3D graphic rasterization and image composition several pixels needs to be read from and written to memory to generate a pixel in the frame buffer. This generates a lot of data traffic on the memory interfaces which impacts both performance and power consumption. Therefore it is important to minimize the amount of color buffer data. One way of reducing the memory bandwidth required is to compress the color data before writing it to memory and decompress it before using it in the graphics hardware block. This compression/decompression must be done “on-the-fly”, i.e. it has to be very fast so that the hardware accelerator does not have to wait for data. In this thesis, we investigated several exact (lossless) color compression algorithms from hardware implementation point of view to be used in high throughput hardware. Our study shows that compression/decompression datapath is well implementable even with stringent area and throughput constraints. However memory interfacing of these blocks is more critical and could be dominating.

Keywords Graphics Hardware, Color Compression, Image Compression, Mobile Graphics, Compression Ratio, Frame Buffer Compression, Lossless Compression, Golomb-Rice coding.

Language

X English Other (specify below)

Number of Pages 88

Type of Publication

Licentiate thesis X Degree thesis Thesis C-level Thesis D-level Report Other (specify below)

ISBN (Licentiate thesis)

ISRN: LiTH-ISY-EX--08/4265--SE

Title of series (Licentiate thesis)

Series number/ISSN (Licentiate thesis)

Page 6: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems
Page 7: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

i

Abstract

A major bottleneck, for performance as well as power consumption, for graphics hardware in

mobile devices is the amount of data that needs to be transferred to and from memory. In, for

example, hardware accelerated 3D graphics, a large part of the memory accesses are due to large

and frequent color buffer data transfers. In a graphic hardware block color data is typically

processed using RGB color format. For both 3D graphic rasterization and image composition

several pixels needs to be read from and written to memory to generate a pixel in the frame buffer.

This generates a lot of data traffic on the memory interfaces which impacts both performance and

power consumption. Therefore it is important to minimize the amount of color buffer data. One

way of reducing the memory bandwidth required is to compress the color data before writing it to

memory and decompress it before using it in the graphics hardware block. This

compression/decompression must be done “on-the-fly”, i.e. it has to be very fast so that the

hardware accelerator does not have to wait for data. In this thesis, we investigated several exact

(lossless) color compression algorithms from hardware implementation point of view to be used

in high throughput hardware. Our study shows that compression/decompression datapath is well

implementable even with stringent area and throughput constraints. However memory interfacing

of these blocks is more critical and could be dominating.

Keywords: Graphics Hardware, Color Compression, Image Compression, Mobile Graphics,

Compression Ratio, Frame Buffer Compression, Lossless Compression, Golomb-Rice coding.

Page 8: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

ii

Page 9: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

iii

Acknowledgements

First, we would like to express our gratitude and appreciation to our supervisor Dr. Henrik

Ohlsson from Ericsson Mobile Platform (EMP) for his valuable guidance and discussions.

We would also like to thank our supervisor from Electronics Systems at Linköping University,

Dr. Oscar Gustafsson for his great supports and recommendations.

Finally, deepest thanks go to our beloved parents for their everlasting supports and

encouragements throughout our educational years. This thesis is dedicated to them.

Page 10: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

iv

Page 11: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

v

Table of Contents

CHAPTER 1 ................................................................................................................................................................. 1

1 INTRODUCTION .............................................................................................................................................. 1

1.1 COLOR BUFFER AND GRAPHICS HARDWARE ...................................................................................................... 2 1.2 COLOR BUFFER COMPRESSION VS. IMAGE COMPRESSION .................................................................................. 3 1.3 STRUCTURE OF THE REPORT ............................................................................................................................. 3

CHAPTER 2 ................................................................................................................................................................. 5

2 LOSSLESS COMPRESSION ALGORITHMS ............................................................................................... 5

2.1 INTRODUCTION ................................................................................................................................................. 5 2.2 THEORETICAL BACKGROUND OF LOSSLESS IMAGE COMPRESSION ................................................................... 6

2.2.1 JPEG-LS Algorithm .............................................................................................................................. 6 2.3 REFERENCE LOSSLESS COMPRESSION ALGORITHM .......................................................................................... 7

2.3.1 Color Transform and Reverse Color Transform ................................................................................... 8 2.3.2 Predictor and Constructor .................................................................................................................. 10 2.3.3 Golomb-Rice Encoder ......................................................................................................................... 12 2.3.4 Golomb-Rice Decoder ......................................................................................................................... 16

2.4 GOLOMB-RICE ENCODING OPTIMIZATION ...................................................................................................... 17 2.4.1 Proposed method for exhaustive search solution ................................................................................ 17 2.4.2 Estimation method............................................................................................................................... 22

2.5 IMPROVED LOSSLESS COLOR BUFFER COMPRESSION ALGORITHM ................................................................ 24 2.5.1 Modular Reduction ............................................................................................................................. 24 2.5.2 Embedded Alphabet Extension (Run-length Mode) ............................................................................ 25 2.5.3 Previous Header Flag ......................................................................................................................... 26

2.6 COMPRESSION PERFORMANCES OF ALGORITHMS ........................................................................................... 26 2.7 POSSIBLE FUTURE ALGORITHMIC IMPROVEMENTS ......................................................................................... 28

2.7.1 Pixel Reordering ................................................................................................................................. 28 2.7.2 Spectral Predictor ............................................................................................................................... 28 2.7.3 CALIC Predictor ................................................................................................................................. 28 2.7.4 Context Information ............................................................................................................................ 29

CHAPTER 3 ............................................................................................................................................................... 30

3 COLOR BUFFER COMPRESSION/DECOMPRESSION HARDWARE ................................................. 30

3.1 DESIGN CONSTRAINTS .................................................................................................................................... 30 3.2 COMPRESSOR BLOCK ..................................................................................................................................... 31

3.2.1 Addr_Gen1 (Source memory address generator) ............................................................................... 32 3.2.2 Color_T (Color Transformer) ............................................................................................................. 36 3.2.3 Pred_RegFile_Ctrl (Prediction Register File Controller) .................................................................. 37 3.2.4 Predictor ............................................................................................................................................. 39 3.2.5 Enc_RegFile_Ctrl (Golomb-Rice Encoder Register File Controller) ................................................. 40 3.2.6 GR_Encoder (Golomb-Rice Encoder) ................................................................................................ 42 3.2.6.1 GR_k Block (Golomb-Rice Parameter Estimation) ............................................................................ 43 3.2.6.2 Enc Block (Encoding Block) ............................................................................................................... 45 3.2.6.3 GR_ctrl (Golomb-Rice Control Block) ............................................................................................... 47 3.2.7 Data_Packer (Variable Bit Length Packer to Memory Word) ............................................................ 47 3.2.8 Addr_Gen2 (Destination memory address generator) ........................................................................ 49 3.2.9 Compressor_Ctrl (Control Path) ........................................................................................................ 50 3.2.10 Overall Compressor Datapath and Address Generation .................................................................... 51

3.3 DECOMPRESSOR BLOCK ................................................................................................................................. 52 3.3.1 Addr_Gen2 (Source memory address generator) ............................................................................... 53

Page 12: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

vi

3.3.2 Rev_Color_T (Reverse Color Transformer) ....................................................................................... 54 3.3.3 Const_RegFile_Ctrl (Construction Register File Controller) ............................................................ 55 3.3.4 Constructor ......................................................................................................................................... 56 3.3.5 Dec_RegFile_Ctrl (Golomb-Rice Decoder Register File Controller) ................................................ 57 3.3.6 GR_Decoder (Golomb-Rice Decoder) ................................................................................................ 58 3.3.7 Data_Unpacker (Variable Bit Length Unpacker from Memory Word) .............................................. 59 3.3.8 Addr_Gen1 (Destination memory address generator) ........................................................................ 60 3.3.9 Decompressor_Ctrl (Control Path) .................................................................................................... 61 3.3.10 Overall Decompressor Datapath and Address Generation ................................................................ 62

3.4 FUNCTIONAL VERIFICATION FRAMEWORK ..................................................................................................... 63 3.5 SYNTHESIS RESULTS ...................................................................................................................................... 64 3.6 EVALUATION OF OTHER HARDWARE IMPLEMENTATIONS .............................................................................. 66

3.6.1 Parallel pipeline Implementation of LOCO-I for JPEG-LS [17] ........................................................ 66 3.6.2 Benchmarking and Hardware Implementation of JPEG-LS [18] ....................................................... 67 3.6.3 A Lossless Image Compression Technique Using Simple Arithmetic Operations [19] ...................... 67 3.6.4 A Low power, Fully Pipelined JPEG-LS Encoder for Lossless Image Compression [11].................. 67 3.6.5 Hardware Implementation of a Lossless Image Compression Algorithm Using a FPGA [20] .......... 68 3.6.6 Comparison ......................................................................................................................................... 68

CHAPTER 4 ............................................................................................................................................................... 69

4 CONCLUSION ................................................................................................................................................. 69

4.1 WORKFLOW .................................................................................................................................................... 69 4.2 RESULTS AND OUTCOMES .............................................................................................................................. 69 4.3 FUTURE WORK ............................................................................................................................................... 71

REFERENCES ........................................................................................................................................................... 73

APPENDIX A ............................................................................................................................................................. 75

PROPOSED COST REDUCTION METHOD ANALYSIS ................................................................................................... 75 A.1 Overlap-limited Search ................................................................................................................................ 75 A.2 Remainder-Based Correction ....................................................................................................................... 83

APPENDIX B ............................................................................................................................................................. 85

TEST IMAGE SETS .................................................................................................................................................... 85 B.1 Standard Photographic Test Images ............................................................................................................ 85 B.2 Computer Generated Test Scenes ................................................................................................................ 86 B.3 Computer Generated User Menu Scenes ..................................................................................................... 87

Page 13: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

vii

Table of Figures

FIGURE 1: COMPRESSOR/DECOMPRESSOR HARDWARE ON MEMORY INTERFACE............................................................. 2 FIGURE 2: ERROR ACCUMULATION DUE TO TANDEM COMPRESSION .............................................................................. 6 FIGURE 3: COMPRESSION / DECOMPRESSION FUNCTIONAL BLOCKS ............................................................................... 8 FIGURE 4: COLOR TRANSFORM / REVERSE COLOR TRANSFORM BLOCK INTERFACE ...................................................... 9 FIGURE 5: COLOR TRANSFORM / REVERSE COLOR TRANSFORM OPERATION FLOW GRAPH ........................................... 9 FIGURE 6: MEDIAN EDGE DETECTOR (MED) PREDICTOR PREDICTION WINDOW ........................................................... 10 FIGURE 7: PREDICTOR / CONSTRUCTOR BLOCK INTERFACE .......................................................................................... 11 FIGURE 8: PREDICTOR / CONSTRUCTOR OPERATION FLOW GRAPH .............................................................................. 11 FIGURE 9: ENCODED DATA IN THE STREAM .................................................................................................................. 12 FIGURE 10: ENCODED DATA FOR (2, 0, 13, 3) AND K = 2 ............................................................................................... 12 FIGURE 11: GOLOMB-RICE ENCODER FUNCTIONAL BLOCKS ......................................................................................... 13 FIGURE 12: GOLOMB-RICE PARAMETER EXHAUSTIVE SEARCH HARDWARE .................................................................. 14 FIGURE 13: A POSSIBLE GOLOMB-RICE ENCODER HARDWARE ..................................................................................... 15 FIGURE 14: A POSSIBLE GOLOMB-RICE DECODER HARDWARE ..................................................................................... 16 FIGURE 15: HW-COST VS. NUMBER OF INPUT SAMPLES (N) ........................................................................................... 19 FIGURE 16: HW-COST VS. NUMBER OF PARAMETERS (K) .............................................................................................. 20 FIGURE 17: HW IMPLEMENTATION OF THE NEW COMBINED METHOD ........................................................................... 21 FIGURE 18: ILLUSTRATION OF MODULAR REDUCTION .................................................................................................. 24 FIGURE 19: CALIC GAP PREDICTION WINDOW ............................................................................................................ 29 FIGURE 20: COMPRESSOR BLOCK ................................................................................................................................. 31 FIGURE 21: MEMORY MAPPING AND CORRESPONDING PIXELS OF THE IMAGE .............................................................. 33 FIGURE 22: TRAVERSAL IN PREDICTION WINDOW ......................................................................................................... 34 FIGURE 23: ADDRESS GENERATOR I INTERFACE ........................................................................................................... 35 FIGURE 24: ADDRESS GENERATOR I HARDWARE DIAGRAM ......................................................................................... 35 FIGURE 25: COLOR TRANSFORM HARDWARE DIAGRAM ............................................................................................... 36 FIGURE 26: PREDICTION REGISTER FILE CONTROLLER INTERFACE............................................................................... 37 FIGURE 27: CHANGE OF PREDICTION WINDOW FOR PIXELS OF ONE SUBTILE ................................................................. 37 FIGURE 28: STATES AND REGISTER INPUT CONNECTIVITY IN PREDICTION REGISTER FILE CONTROLLER...................... 38 FIGURE 29: MED PREDICTION HARDWARE FOR BOTH PREDICTOR AND CONSTRUCTOR ................................................ 39 FIGURE 30: PREDICTOR BLOCK HARDWARE DIAGRAM.................................................................................................. 40 FIGURE 31: ENCODER REGISTER FILE CONTROLLER BLOCK INTERFACE ....................................................................... 41 FIGURE 32: GOLOMB-RICE ENCODER BLOCK DIAGRAM ................................................................................................ 42 FIGURE 33: K- PARAMETER ESTIMATION HARDWARE .................................................................................................. 44 FIGURE 34: GOLOMB-RICE ENCODER REALIZATION ..................................................................................................... 46 FIGURE 35: P3 BLOCK, BASIC HARDWARE REALIZATION ............................................................................................... 48 FIGURE 36: PACKED DATA ORDER FORMAT IN THE MEMORY ........................................................................................ 48 FIGURE 37: DATA PACKER ............................................................................................................................................ 49 FIGURE 38: DESTINATION MEMORY ADDRESS GENERATOR BLOCK INTERFACE ............................................................. 50 FIGURE 39: CONTROL PATH BLOCK INTERFACE ............................................................................................................ 50 FIGURE 40: OVERALL COMPRESSOR ............................................................................................................................. 51 FIGURE 41: DECOMPRESSOR BLOCK ............................................................................................................................. 52 FIGURE 42: SOURCE MEMORY ADDRESS GENERATOR BLOCK INTERFACE ...................................................................... 53 FIGURE 43: REVERSE COLOR TRANSFORM HARDWARE DIAGRAM ................................................................................ 54 FIGURE 44: CONSTRUCTION REGISTER FILE CONTROLLER INTERFACE ......................................................................... 55 FIGURE 45: STATES AND REGISTER INPUT CONNECTIVITY IN CONSTRUCTION REGISTER FILE CONTROLLER ................ 56 FIGURE 46: CONSTRUCTOR BLOCK HARDWARE DIAGRAM ............................................................................................ 57 FIGURE 47: DECODER REGISTER FILE CONTROLLER BLOCK INTERFACE ....................................................................... 58 FIGURE 48: GOLOMB-RICE DECODER HARDWARE ........................................................................................................ 58 FIGURE 49: DATA UNPACKER INTERFACE AND BLOCK DIAGRAM ................................................................................. 59 FIGURE 50: READ / WRITE ADRESSES FROM/TO DESTINATION MEMORY TO CONSTRUCT ONE SUBTILE ......................... 60 FIGURE 51: ACTUAL ADDRESSING SCHEME FOR DESTINATION MEMORY ADDRESSES .................................................... 60 FIGURE 52: DESTINATION MEMORY ADDRESS GENERATOR BLOCK INTERFACE ............................................................. 61

Page 14: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

viii

FIGURE 53: OVERALL DECOMPRESSOR ......................................................................................................................... 62 FIGURE 54: VERIFICATION FRAMEWORK FSM ............................................................................................................. 63 FIGURE 55: FUNCTIONAL VERIFICATION FRAMEWORK ................................................................................................. 64 FIGURE 56: ONE BLOCK OF N VALUES ........................................................................................................................... 75 FIGURE 57: OVERLAP REGIONS OF CONSECUTIVE LENGTH FUNCTIONS WITH RESPECT TO ET ........................................ 77 FIGURE 58: OVERLAP REGIONS BETWEEN LENGTH FUNCTIONS L1, L2, L3, L4 ............................................................. 78 FIGURE 59: OVERLAP REGIONS FOR N=4 AND K= {0, 1, 2, 3, 4, 5, 6} WITH RESPECT TO ET ........................................... 79 FIGURE 60: REQUIRED COMPARISONS OF OVERLAP REGIONS FOR N=4, K= {0, 1, 2, 3, 4, 5, 6} BASED ON ET ................ 80 FIGURE 61: OVERLAP REGIONS OF NON-CONSECUTIVE LENGTH FUNCTIONS WITH RESPECT TO ET ............................... 81 FIGURE 62: MOTIVATION BEHIND REMAINDER-BASED CORRECTION............................................................................. 83

Page 15: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

ix

List of Tables

TABLE 1: ENCODED OUTPUT LENGTHS FOR EACH K-PARAMETER .................................................................................. 14 TABLE 2: LOGIC COST OF FUNCTIONAL BLOCKS .......................................................................................................... 17 TABLE 3: HW COST COMPARISON OF EXHAUSTIVE SEARCH AND NEW COMBINED METHOD .......................................... 22 TABLE 4: ESTIMATION INTERVALS ACCORDING TO SUM OF INPUTS............................................................................... 23 TABLE 5: HW COST AND COMPRESSION RATIO OF ESTIMATION METHOD ...................................................................... 23 TABLE 6: COMPARISON OF COMPRESSION PERFORMANCES ........................................................................................... 27 TABLE 7: COMPRESSOR BLOCK INTERFACE PORT DESCRIPTION................................................................................... 32 TABLE 8: SOURCE MEMORY ADDRESS GENERATOR ADDRESSING SCHEME .................................................................... 34 TABLE 9: ESTIMATION FUNCTION ................................................................................................................................. 45 TABLE 10: HEADER FORMAT GENERATED BY GR_CTRL BLOCK.................................................................................... 47 TABLE 11: DECOMPRESSOR BLOCK INTERFACE PORT DESCRIPTION ............................................................................ 53 TABLE 12: DESTINATION MEMORY ADDRESS GENERATOR ADDRESSING SCHEME ......................................................... 61 TABLE 13: COMPRESSOR SYNTHESIS RESULT ............................................................................................................... 65 TABLE 14: DECOMPRESSOR SYNTHESIS RESULT........................................................................................................... 66 TABLE 15: CHARACTERISTICS OF DIFFERENT HARDWARE IMPLEMENTATIONS .............................................................. 68 TABLE 16: COMPARISON OF COST ESTIMATIONS AND ACTUAL SIZES FOR BLOCKS ........................................................ 70

Page 16: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

x

Page 17: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

1

Chapter 1

1 Introduction

A major bottleneck, for performance as well as power consumption, for graphics hardware in

mobile devices is the amount of data that needs to be transferred to and from memory. In, for

example, hardware accelerated 3D graphics, a large part of the memory accesses are due to large

and frequent color buffer data transfers. Therefore it is important to minimize the amount of color

buffer data.

In a graphics hardware block (for example image composition, 3D graphics rasterization), color

data is typically processed using RGB color format. Depending on the color resolution of the

image 8, 12, 16, or 32 bits could be used to represent one pixel. For both 3D graphic rasterization

and image composition several pixels needs to be read from and written to memory to generate a

pixel in the frame buffer. This generates a lot of data traffic on the memory interfaces which

impacts both performance and power consumption.

One way of reducing the memory bandwidth required is to compress the color data before writing

it to memory and decompress it before using it in the graphics hardware block. Figure 1 shows

the location of compressor/decompressor hardware with respect to graphics hardware block and

memory. The compressor/decompressor hardware will help reduce the data traffic on memory

interface shown with arrows in the figure. The reduction on the memory bandwidth can be used

to minimize power consumption (reduced access to memory bus), to increase performance (more

data traffic with the same memory bandwidth) or a combination of them. Hence, a better trade-off

between power and performance can be found depending on the design constraints.

Page 18: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

2

Graphics Hardware Block RAM

Compress

Data

Decompress

Data

Figure 1: Compressor/Decompressor hardware on memory interface

Hardware implementation of such a compressor/decompressor is the subject of this work. Our

thesis - based on a reference color buffer compression algorithm [1] – aims at:

− Evaluation of color buffer compression algorithms with respect to hardware

implementation properties,

− VHDL implementation of a selected algorithm in order to validate the hardware cost

estimations.

Accordingly, the thesis has been carried out in two phases. In the first phase, the following tasks

have been carried out:

− Analysis of the problem and modeling of the reference algorithm,

− Evaluation of the proposed solution with respect to both compression performance and

implementation properties,

− Exploration of algorithmic and hardware optimizations to improve both compression

performance and implementation cost,

− Decision of the final algorithm to be implemented.

The second phase of the thesis work is dedicated to hardware implementation in VHDL and

verification of the algorithm which is decided in the first phase, and completion of the thesis

report.

1.1 Color buffer and graphics hardware

Color buffer refers to a portion of memory where the actual pixel data to be sent to display is

stored. Graphics hardware uses this buffer during rasterization. Depending on the rasterizer

architecture, the access to this buffer can be in different ways. In traditional immediate mode

rendering, each triangle is rendered as soon as they come in. Hence, for every triangle that is

drawn, the related pixel data are written to the buffer unless the triangle is completely hidden. On

the other hand for tiled, deferred rendering architectures, the color buffer is written when a

complete tile (a unit of w h pixels) is finished. Hence only visible color write is performed

which reduces the overall color buffer bandwidth. A more detailed explanation on the topic can

be found at [2].

Page 19: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

3

1.2 Color buffer compression vs. image compression

Color buffer data compression, as a specific application of general data compression, shares lots

of similarities with image compression. Consequently, the theory developed for image

compression is well-suited to be used for compressing color buffer data in 3D graphics hardware.

Specifically, correlation between neighboring pixel values is also valid for color buffer data and

can be used as a basis for compression.

On the other hand, there are important differences between color buffer data compression and

image compression. First of all, most of the image compression algorithms in literature have been

developed for continuous-tone still images. Their compression results have been customarily

based on some set of well-known test images. Those images are real (photographic) images and it

is harder to get information about the performance of image compression algorithms on computer

generated images. Secondly and more importantly, most image compression algorithms assume

the availability of a whole and completed image. For example most (if not all) of the state-of-the

art image compression algorithms are adaptive, which can be briefly explained as learning from

the image itself while traversing it in some order. Rasterization in graphics hardware, on the other

hand, is an incremental process. Depending on the rasterizer architecture, the data to be

compressed could be an unfinished scene and it could also be only a part of the whole scene. In a

tiled architecture for example, a tile is the data to be compressed, and the tile size could be too

small to learn from. Hence the success of adaptive image compression algorithms on color buffer

data is not obvious and dependant on the specific rasterizer architecture.

Another difference between our framework and image compression algorithms is the

requirements on the complexity and implementation cost. As mentioned in [1], most of the image

compression algorithms are not symmetric, i.e., compression and decompression take different

times. Moreover, for most of the compression algorithms, the complexity of the forward path

(compression) is discussed, since they aim at the applications where only compression and

storage of the image data is important. The backward path (decompression) is not considered as

critical. However in our case, the compression/decompression must be done “on-the-fly”, i.e. it

has to be very fast so that the hardware accelerator does not have to wait for data. Finally, a

compressor/decompressor for mobile devices has extra requirements on the implementation cost.

Specifically, the size of the hardware block is of prime concern. This prohibits using

sophisticated algorithms that require logic cost and storage (buffering) cost more than what is

affordable in our case.

1.3 Structure of the report

Chapter 1 of the report has given a description of the aim of this thesis work and some

background information about the application area. Chapter 2, starting with an explanation of the

need for lossless compression in our case, gives a thorough analysis of the lossless compression

algorithms considered for this thesis and evaluation of their implementation properties. This

chapter corresponds to first phase of our thesis work. Chapter 3 describes the implementation and

hardware of the compressor/decompressor and presents synthesis results. Chapter 4 includes

concluding remarks and discussion of some possible future work.

Page 20: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

4

Page 21: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

5

Chapter 2

2 Lossless Compression Algorithms

In this chapter we discuss several lossless color data compression algorithms, their

performances with their hardware implementation properties. Later, we propose a modified

algorithm which is especially effective for compressible images. The chapter ends with a

comparison of compression ratio and cost of those algorithms and some remarks about possible

future improvements.

2.1 Introduction

Lossless image compression is customarily used in specific application areas like medical and

astronomical imaging, preservation of art work and professional photography. It is not surprising

that lossless compression is not used for multimedia in general when one considers its limited

compression performance. The achievable compression ratio varies between 2:1 and 3:1 in

general, which is significantly lower than what lossy compression can offer. Furthermore, in

lossy compression the resulting image quality and desired compression performance can always

be traded-off depending on the requirements.

Considering the disadvantages just mentioned, the usage of lossless compression in 3D graphics

hardware for color buffer data may be objected. However, [1] explains and illustrates the

possibility of getting unbounded errors due to so called tandem compression when a lossy

algorithm is used. Tandem compression artifacts arise when lossy compression is performed for

every triangle written to a tile during rasterization, resulting in accumulation of error. This is a

direct consequence of rasterization being an incremental process. Figure 2 from [1] illustrates the

accumulation of error.

Page 22: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

6

Figure 2: Error accumulation due to Tandem Compression

Although, it is possible to control the accumulated error in those cases as suggested in [1], the

resulting image quality may not be acceptable. In our work we employ a conservative approach

(lossless compression) instead, since the resulting compression ratio is sufficient for our

application.

2.2 Theoretical Background of Lossless Image Compression

In image compression applications, there are several algorithms which offer different approaches

for compression of still images. The most famous algorithms are FELICS [3], LOCO-I [4] and

CALIC [5]. According to the better tradeoff between complexity and compression ratio, LOCO-I

was standardized into JPEG-LS [6].

2.2.1 JPEG-LS Algorithm

The idea behind JPEG-LS is to take the advantage of both simplicity and the compression

potential of context models. The error residuals are computed using an adaptive predictor and

Golomb-Rice technique is used for encoding the data. The purpose of having an adaptive

predictor instead of a fixed predictor is that it proposes minor variations of prediction residuals

which lead to a higher compression ratio. It should be noticed that having better prediction result

help efficiently only when the header information is extracted from the compressed stream which

is the case in JPEG-LS. Otherwise the major overhead which degrades the compression ratio is

sending the header information and in that case, improving the predictor cannot help much in

getting higher performance. The reason why non-adaptive algorithms give lower compression

ratio is that they are limited in their compression performance by first order entropy of the

prediction residuals, which in general cannot achieve total decorrelation of the data [6]. As a

consequence, the compression gap between these simple schemes and more complex algorithms

is significant.

LOCO-I algorithm is constructed by three main components. The first component is predictor

and consists of two components of adaptive and fixed. The fixed component does the task of

horizontal and vertical edge detection where dependence on the surrounding samples is through

fixed coefficients. The fixed predictor used in LOCO-I, is a simple median edge detector (MED)

Page 23: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

7

predictor and will be explained in subsection 2.3.2. Adaptive component, on the other hand, is

context dependant and does the bias cancellation task because the DC offset is typically present

in context-based prediction [6].

The second component is context model. A more complex context modeling technique results in

higher achievable dependency order. For LOCO-I, the context model is to compute the gradient

of neighboring pixels and then quantize gradients into a small number of equally probable

connected regions. Although in principle, the number of those regions should be adaptively

optimized, the low complexity requirement dictates a fixed number of equally probable regions.

The gradients represent information about the part of the image surrounding a sampling pixel. By

knowing the gradients we can learn the level of activity such as smoothness or edginess around

the sampling pixel. This information governs the statistical behavior of prediction error [6].

For JPEG-LS, the number of contexts is 365. This number represents a suitable trade-off between

storage requirements which is proportional to the number of contexts.

The last component coder is used to encode the corrected prediction residuals. LOCO-I uses

Golomb-Rice coding technique [6, 7] in two different modes as regular mode and run-length

mode. This coding technique is discussed in details in subsection 2.3.3.

There are several different implementation approaches for JPEG-LS algorithm, each of which

uses specific hardware architectures such as parallel, pipeline or a combination of both.

Implementation options include dedicated DSP, FPGA boards, and ASIC. Factors that affect the

choice of platform selection involve cost, speed, memory, size, and power consumption. One of

the very important characteristics of JPEG-LS algorithm is its sequential execution nature due to

the use of context statistics in coding the error residuals in the prediction phase. This

characteristic makes this possible to design parallel pipeline encoder architecture in order to

speed-up the compression. In section 3.6, different hardware architectures and their

implementation result have been discussed.

Compression in a mobile application is limited by the available storage and memory bandwidth.

Therefore, context-based algorithms such as JPEG-LS may not be applicable and their storage

requirement for the context information could be quite high for this application.

2.3 Reference Lossless Compression Algorithm

Our thesis work is based on [1], which gives a survey of color buffer data compression

algorithms and propose a new exact (lossless) algorithm. In this section, we describe a thorough

analysis of this algorithm, the role and hardware implementation cost of its functional blocks.

The result of this analysis serves as the basis for our later work both on algorithmic and hardware

optimizations.

This algorithm, as opposed to more complex adaptive context-modeling schemes like LOCO-I

[4], can be classified as a variant of simplicity-driven DPCM technique by employing a variable

Page 24: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

8

bit length coding of prediction residuals obtained from a fixed predictor [6]. To get a better

decorrelation of pixel data, a lossless (exactly reversible) color transform precedes those blocks.

The block diagram of the compressor and decompressor is given in figure 3.

Figure 3: Compression / Decompression Functional Blocks

In context-based algorithms, the encoding parameter for each pixel is estimated from previously

traversed data (context). Since the decoder traverses the data in the same order, it will give the

same decision as the encoder for the parameter of the current pixel. This eliminates the overhead

of sending the encoding parameter in the stream. However since no context information is stored

in our reference algorithm, the overhead of sending the encoding parameter of each pixel is

significant. An important feature of the algorithm is thus encoding a number of pixels (2 2

subtile) with the same parameter in the encoder stage. This allows a trade-off between the

overhead and using non-optimal encoding parameter for pixels.

Another feature of the reference algorithm is that it operates on tiles (88 blocks of pixels) to

make it compliant with a tiled architecture. However, the functional blocks of the algorithm itself

do not use any tile specific information.

In the following subsections blocks of the algorithm are discussed.

2.3.1 Color Transform and Reverse Color Transform

The color transform block converts RGB triplet to YCoCg triplet in order to decorrelate the

channels. Y channel is the luminance channel; Co and Cg are chrominance channels. It is stated in

[1] that decorrelation of channels improves the compression ratio by about 10%. This

transformation and its important features have been introduced in [9]. Exact reversibility is an

essential feature of this transformation since the overall algorithm is lossless. The forward and

backward transformation equations are:

(1)

Reverse Color

Transform Constructor Decoder +

Page 25: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

9

From implementation point of view, this transformation has a dynamic range expansion of 2 bits,

i.e., if input RGB channels are n bits each, the output Y channel will require n bits, and

chrominance channels will require n+1 bits each. The block interfaces of the forward and reverse

transforms with 8-bit RGB channels are given in figure 4.

Figure 4: Color Transform / Reverse Color Transform Block Interface

As the equations suggest, both the color transform and reverse color transform has 2 shift and 4

add/subtract operations per pixel which can be expressed as follows:

[2(>>) , 4(+)] per pixel.

The flow-graph of the operations are given in figure 5.

Figure 5: Color Transform / Reverse Color Transform Operation Flow Graph

The operation cost and data lengths indicate that both blocks can be realized by:

B

G

R 8

8

9

9 8

8

Y

Co

Cg

Reverse

Color

Transform

R

G

B

Y

Cg

Co

8 8

8

8 9

9

Color

Transform

-

+

-

+

>>

>>

<<

+

-

<<

+

-

Page 26: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

10

- Two 9-bit adders/subtractors

- Two 8-bit adders/subtractors

This cost is per pixel cost and the overall cost depends on the throughput requirement. It should

also be noted that color transform has a maximum logic depth of two 9-bit adders and two 8-bit

adders, whereas the reverse color transform has a maximum logic depth of one 9-bit adder and

two 8-bit adders.

2.3.2 Predictor and Constructor

The predictor used in our reference algorithm is named as MED predictor in [6] and originally

introduced by Martucci [10]. This predictor uses three surrounding pixels to predict the value of

the current pixel as shown in figure 6.

Figure 6: Median Edge Detector (MED) predictor prediction window

The prediction is performed with the following formula:

(2)

The first two cases correspond to a primitive test for horizontal and vertical edge detection. If no

edge is detected, the third case predicts the value of the current pixel by considering it on a plane

formed by the three neighboring pixels. Despite its simplicity, MED predictor is mentioned to be

a very effective fixed predictor.

After the prediction, the predicted value ( x̂ ) is subtracted from the actual pixel value (x) and the

resulting error residual ( ) is sent out to be encoded in the encoder block. Conversely, in the

decompression path the same prediction is performed from the previously constructed pixels and

the resulting prediction ( x̂ ) is added to the decoded error residual ( ) from the stream to

construct the actual pixel value (x) back. The block interface of the predictor and constructor are

given in figure 7. In this figure, the input pixel values are 9-bit signed chrominance components

(Co and Cg), and the error residual is 10-bit signed value. For Y and α predictors/constructors, the

input size is 8-bits.

Page 27: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

11

Figure 7: Predictor / Constructor Block Interface

The operations extracted from (2) can be expressed as follows:

[3 comp.(< ) , 3(+)] per pixelcomponent.

The flow-graph of the predictor and constructor operations are identical and given in figure 8.

Figure 8: Predictor / Constructor Operation Flow Graph

The flow graph and data wordlengths indicate that both the predictor and constructor blocks can

be realized by:

- Three 10-bit comparators

- Two 9-bit adders/subtractors

- One 10-bit adder/subtractor

- One 9-bit 4x1 MUX (with some additional logic at select inputs)

This cost is per pixel-component cost and the overall cost depends on the throughput requirement.

Both the predictor and constructor have a maximum logic depth of two 9-bit and one 10-bit

adders.

Since the next stage i.e. Golomb-Rice encoding requires unsigned (one-sided) error residuals, the

following signed-to-unsigned conversion, as suggested in [4], needs to be performed after the

prediction:

10

9

9 9

9

Constructor

9

9

9 10

9

Predictor

< <

+

-

-

< < <

+

-

+

<

Page 28: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

12

(3)

Conversely, after Golomb-Rice decoding in decompression, the corresponding unsigned-to-

signed conversion is needed.

2.3.3 Golomb-Rice Encoder

Golomb codes are variable bit rate codes optimal for one-sided geometric distribution (OSGD) of

non-negative integers. Since the statistics of prediction error residuals from a fixed predictor in

continuous-tone images are well-modeled by a two-sided geometric distribution (TSGD) centered

at zero [6], Golomb coding is widely used in lossless image coding algorithms with a

mathematical absolute operation at the beginning to obtain OSGD.

Since Golomb coding requires an integer division and modulo operation with Golomb parameter

m, in implementations Rice codes [8] are generally used. Rice coding is a special case of Golomb

coding which reduces division and modulo operations to simple shift and mask operations.

In Golomb-Rice encoding, we encode an input value, e, by dividing it with a constant 2k. The

results are a quotient q and a remainder r. The quotient q is stored using unary coding, and the

remainder r is stored using normal, binary coding using k bits. To illustrate with an example

(figure 10), let us assume that we want to encode the values 2, 0, 13, 3 and assume we have

selected the constant k = 2. After the division we get the following (q, r)-pairs: (0, 2), (0, 0), (3,

1), (0, 3). Unary coding represents a value by as many zeros as the magnitude of the value

followed by a terminating one. The encoded values therefore becomes (1b, 10b), (1b, 00b),

(0001b, 01b), (1b, 11b) which is 15 bits in total.

Figure 9: Encoded Data in the Stream

Figure 10: Encoded Data for (2, 0, 13, 3) and k = 2

010 1 10 00 1 1 1000 01 11 k 1r ...

k 1q 1r 2r 2q 4q 3q 3r 4r k

1r ...

first component of

first sub-tile

... second component

of first sub-tile

Direction of

storing/reading data

Page 29: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

13

In our reference algorithm, the optimal Golomb-Rice parameter k for a 22 pixel subtile of error

residuals is computed with an exhaustive search, and the Golomb-Rice coded residuals are sent

out to the stream preceded by k-parameter as the header. During decompression, the decoder

decodes the data from the stream with k-parameter received as the header.

Encoding requires three functional blocks as given in figure 11:

Figure 11: Golomb-Rice Encoder functional blocks

The reference algorithm uses 3-bit header (k= 0, 1, ... , 7) to encode a subtile. Among those

headers, k = 7 is reserved for the special case when all error residuals in a subtile are zero. In this

case only header is stored; otherwise the header is followed by coded component-wise residuals.

The exhaustive search of the best k-parameter requires comparison of the lengths of output code

created by each possible k value (0, 1, … , 6) excluding the special case. The length of an output

code corresponding to a k-parameter can be expressed with the following formula:

442222

4321

k

eeeeL

kkkkk (4)

The lengths of each output code from this formula are given in table 1.

k-parameter Length of output code (Lk)

0 1e + 2e + 3e + 4e + 0 + 4

1 21e + 22e + 23e + 24e + 4 + 4

2 41e + 42e + 43e + 44e + 8 + 4

3 81e + 82e + 83e + 84e + 12 + 4

4 161e + 162e + 163e + 164e + 16 + 4

e1

e2

e3

e4

k (3-bit)

e1

e2

e3

e4

k-parameter

determination

Golomb-Rice

encoding with k

Data packer to

external memory

compressed

stream

10-bit error residuals of

subtile

encoded codewords of

each pixel

Page 30: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

14

5 321e + 322e + 323e + 324e + 20 + 4

6 641e + 642e + 643e + 644e + 24 + 4

Table 1: Encoded output lengths for each k-parameter

In order to find the best k-parameter, four additions should be performed for each k to calculate

the length of its corresponding output length (three additions are needed for k=0). The fixed term

“4” is common to all the choices; therefore its addition is not needed for comparison. This

corresponds to 64 + 3 = 27 additions. To compare the lengths of seven values, six comparison

operations are needed. To summarize, the operations to find the best k-parameter with exhaustive

search can be expressed as follows:

[6 comp.(< ) , 27(+)] per subtilecomponent = [6 comp.(< ) , 27(+)] per pixel

The hardware diagram is given in figure 12.

1e 2e 3e 4e 21e 2

2e 23e 2

4e 4

1e 42e

43e 4

4e 8

1e 82e 8

3e 84e

161e 16

2e 163e 16

4e 32

1e 322e 32

3e 324e

641e 64

2e 64

3e 64

4e

Figure 12: Golomb-Rice parameter exhaustive search hardware

More specifically, the hardware cost is:

- Six 13-bit comparators

- Two 12-bit adders

+ +

+

+ +

+

+ +

+

+ + + + + + + +

+ + + +

+ + + + + +

L0 L1 L2 L3 L4 L5 L6

4 8 12 16 20 24

Page 31: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

15

- Four 11-bit adders

- Four 10-bit adders

- Four 9-bit adders

- Four 8-bit adders

- Four 7-bit adders

- Three 6-bit adders

- Two 5-bit adders

This cost is per subtile-component which can be equivalently thought as per pixel cost. The

overall cost depends on the throughput requirement. This block has a logic depth of three 13-bit,

one 12-bit, one 11-bit and one 10-bit adder.

The second encoder block encodes the input residuals of a subtile with the calculated k-parameter.

The output of this block is four encoded words corresponding to each pixel of a subtile and their

corresponding lengths.

A very simple possible architecture for this block is given in [11]. Adjusting this architecture to

our case, the hardware for each pixel of the second block is given in figure 13.

Figure 13: A possible Golomb-Rice encoder hardware

The hardware cost per pixel-component of this block is:

- One 5-bit adder

- Two 10-bit shifters

- One 22-bit shifter

- 10 XOR gates

- 22 OR gates

>>

<< <<

XOR OR

code

residual

10

22’h0001

k

+

length

q

”1”

k

3

residual

Page 32: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

16

The final block of the encoder is the data packer. This block receives the 3-bit header (k-

parameter) and code – length pairs of each pixel in a subtile. It combines the code words into a

fixed memory word size and sends as an output to external memory.

2.3.4 Golomb-Rice Decoder

The role of the decoder is to extract error residuals of a subtile by decoding the compressed data

using the header according to figure 9. Its functional blocks are similar to the encoder but since

header is provided by the incoming stream, k-parameter determination block is not needed. The

data un-packer provides the header and (q, r) pairs of each pixel of a subtile. The q data is

obtained with unary-to-binary conversion.

The next block combines binary (q, r) pairs with the header and reproduces error residual as the

output according to:

rqe k 2 (5)

A simple possible decoder hardware for each pixel-component of a subtile is given in figure 14.

Figure 14: A possible Golomb-Rice decoder hardware

The hardware cost per pixel-component of this block is:

- One 22-bit shifter

- 10 OR gates

<<

OR

residual

q

k

r

10

Page 33: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

17

To summarize, table 2 gives the logic cost of functional blocks in both compressor and

decompressor (only adder cost is considered). Note that this calculation only includes the

datapath functional blocks shown in figure 3. This means the actual hardware is expected to

include other blocks for implementation of memory interfacing, memory addressing, pipelining,

control path etc. It is also important to note that the actual hardware size to a great extend

depends on design requirements, while table 2 shows generic per pixel cost of the algorithm.

Functional Blocks Compressor logic cost (adder cell)

per pixel

Decompressor logic cost (adder

cell) per pixel

Color transform 34 - Reverse color transform - 34 Prediction 232 - Construction - 232 GR Encoder – k determination 310 - GR Encoder – residual encoding 20 - GR Decoder – residual decoding - - Total 596 266

Table 2: Logic Cost of Functional Blocks

2.4 Golomb-Rice Encoding Optimization

Considering the result given in table 2 it is obvious that the most costly part of the design is the

hardware necessary to find the best k parameter for Golomb-Rice coding. Therefore, in order to

reduce the hardware cost, it is convenient to try reducing the cost of this circuitry.

Two approaches have been considered to reduce the complexity. First one is to use an improved

exhaustive search method which is presented in subsection 2.4.1. The second one is to use an

estimation formula given in [8] and is presented in subsection 2.4.2.

2.4.1 Proposed method for exhaustive search solution

Exhaustive search method to find k-parameter is straightforward to implement, but the

computational cost of this method is large and it increases linearly with the number of k values.

For all k values, the length of the encoded data should be calculated and the k, corresponding to

minimum length is chosen among them by comparison. For example, consider that we have a

block size of n, which indicates the number of inputs to be encoded together and the set k = {0, 1,

2,…, m-1}, where m is variable and depends on the application requirements. The best member of

the set should be selected as the Golomb-Rice parameter

The computational requirements for exhaustive search method can be significantly reduced with

our new solution, while still finding the Golomb-Rice (best k) parameter for a group of input

data. The approach proposed uses a combination of two different ideas.

Page 34: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

18

The first idea, which will be referred as “overlap-limited search“, removes the need for

computation and comparison of all the length values for each possible k. It is mathematically

proven that for any given set of input samples {e1, e2, e3,…,en}, depending on their sum, there are

overlap regions only between a fixed limited number of length functions and that only those

length functions need to be computed and compared to get the best k. In other words, not all

possible k values but only a fixed, limited and consecutive subset of them can be candidates of

being the Golomb-Rice parameter of each block. This idea is not limited to hardware

implementations but reduces time-complexity of comparison in software implementations as

well.

The second idea, which will be referred as “remainder-based correction“, eliminates

computational redundancy of performing identical bit additions in calculation of code lengths (Lk)

corresponding to each k. We identify bit additions common to all Lk and save hardware by

performing those additions only once. With another point of view, instead of adding shifted

versions of input data (the quotients) for each k, we first add the inputs only once and then shift

the same sum for each k. This way of calculation however, ignores the effect of remainders on the

sum. To obtain the exact same result, after the addition, a correction is performed for each k by

using remainders of division. Since the correction hardware is much smaller than the adders used

for each k, a significant hardware saving is possible. This idea is only applicable to hardware

implementations of finding the Golomb-Rice parameter (best k-parameter).

To put the solution into perspective, plots in figures 15 and 16 show cost function of three

different implementation which are exhaustive search, the overlap-limited search method, and the

combined method (overlap-limited search and remainder-based correction) with respect to n

(number of input samples) and k (number of candidates for Golomb-Rice parameter) respectively.

In figure 15, the cost function is represented with respect to n (the number of input samples to be

encoded together). It is assumed that the set k = {0, 1, 2, 3, 4, 5, 6, 7} is fixed and the input data

word-length is equal to 8 bits. It can be observed from the plot that the slope of the cost function

of the combined method is ⅓ of the exhaustive search method.

Page 35: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

19

Figure 15: HW-cost vs. number of input samples (n)

In figure 16, the cost is shown as a function of the number of members in set k. This plot shows a

very important feature of “overlap-limited search”. The number of comparisons to find the

Golomb-Rice parameter (best k) is fixed and independent of the number of k values to be

compared. Hence, for applications where dynamic range of input data is larger, a larger set of k

values should be used and “overlap-limited search” leads to even more significant reductions in

the complexity of number of comparisons. Audio applications using 16-bit input data is an

example of this case [12].

The result of both figures 15 and 16 shows that the combined solution is cheaper solution.

Page 36: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

20

Figure 16: HW-cost vs. number of parameters (k)

Mathematical derivation and data analysis of this proposed method is given in Appendix A.

Our implementation combining both methods and the circuit diagram in figure 17, takes input

bits (A5-A0, B5-B0, C5-C0, D5-D0), eT, k, k+1, k+2 as inputs. eT is obtained by adding input values.

Then the region corresponding to eT is located to find the three k (k, k+1, k+2) values to compare.

The output of the circuit diagram is Lk, Lk+1, Lk+2. These three values are compared by using two

comparators as a final stage to find the best Golomb-Rice parameter.

Page 37: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

21

Figure 17: HW implementation of the new combined method

6 5 4 3 2 1 0

2

C0

B0 A0

1 1

1 +

+

2

”00”

D0 1

2

C1

B1 A1

1 1

1 +

+

2 MSB

D1 1

2

C5

B5 A5

1 1

1 +

+

2

D5 1

3 3 3

24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 x

0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3

2 M

SB

2 M

SB

2 M

SB

+

>>

eT

k

Lk

6 5 4 3 2 1 0 6 5 4 3 2 1 0

+

>> k+1

Lk+1

+

>> k+2

Lk+2

eT eT

k+2 k+1 k

’0

Page 38: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

22

This method is a general solution for implementations of Golomb-Rice encoders in all

applications with any set of Golomb-Rice parameters k and different block sizes n (subtile size

for our case). This is an exact method which replaces the exhaustive search method to find the

best k-parameter and leads to much lower computational requirements. The improvement in

hardware cost with our implementation explained is given in table 3.

Method Cost (full adders) Compression

Ratio (norm.)

Exhaustive search

(exact) 310

1

New combined

method (exact) 111

1

Table 3: HW cost comparison of exhaustive search and new combined method

The table shows that the new implementation method leads to a reduction of 65% hardware cost

over exhaustive search while still finding the best k-parameter for a block.

The comparison of the result is presented in figures 15 and 16, which shows the advantage of the

new method in reducing the hardware cost. For example in figure 17, considering the word-

length which is 32-bit, in order to achieve the minimum code-length k = {0, 1, 2,…, 31} should

be used. The hardware cost in this case reduces 83% by overlap-limited search and 89% by

combined method with exactly the same result.

2.4.2 Estimation method

In [8], an estimation formula based on the sum of all inputs is given where the k-parameter is

determined according to the range of the sum of input values. The estimation works based on

table 4, where sum is the summation of inputs to be encoded (in our case, four pixels in a subtile):

sum = e1 + e2 + e3 + e4 (6)

Page 39: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

23

sum K

sum = 0 7

0 < sum <8 0

8 ≤ sum <16 1

16 ≤ sum <32 2

32 ≤ sum < 64 3

64 ≤ sum < 128 4

128 ≤ sum < 256 5

sum ≥ 256 6

Table 4: Estimation intervals according to sum of inputs

The advantage of estimation method over the exhaustive method in reducing the hardware cost is

given in table 5. The cost of estimation method is the cost of the hardware to calculate the sum in

(6). Therefore, two 10-bit and one 11-bit adder is required. Estimation method may rarely find

non-optimal k-parameter. However, empirical results with wide range of test images show that

the reduction in compression performance is insignificant as shown in table 5. In [8] it is also

mathematically proven that the effect of estimation on the compression performance is bounded.

Method Cost (full adders) Compression

Ratio (norm.)

Exhaustive search

(exact) 310

1

Estimation 31 0.998

Table 5: HW cost and compression ratio of estimation method

For applications where the exact exhaustive search is preferred, the method proposed in

subsection 2.4.1 can reduce the hardware cost significantly. However, in this thesis work the

estimation method has been chosen since it is cheaper and the resulting compression ratio is good

enough.

Page 40: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

24

2.5 Improved Lossless Color Buffer Compression Algorithm

As it is mentioned in [1], our reference algorithm is influenced by LOCO-I algorithm. It can be

thought as a low-cost non-adaptive projection of LOCO-I. This has lead us to a deeper analysis of

the ideas behind LOCO-I and hence enabled us to improve the algorithm to get better

compression ratio especially for highly compressible images with negligible extra hardware cost.

The modifications on the reference algorithm are using estimation method (explained in

subsection 2.4.2), modular reduction, run-length mode and previous header flag which will be

explained in the following subsections.

2.5.1 Modular Reduction

The error residual at the output of predictor is one bit more than the data at predictor inputs. For

example in our case, the inputs x, x1, x2, x3 are all 9-bit data and the error residual is 10-bit. The

reason for this expansion is xx ˆ subtraction operation. However, actually since the

predicted value ( x̂ ) is known to both decoder and the encoder, the error residual ( ) can take on

values that can be represented by the number of bits same as input data size. However, since this

data in not centered around zero, a remapping of large prediction residuals is needed. This is

named as modular reduction [4]. Figure 18 illustrates the technique.

Figure 18: Illustration of Modular Reduction

Positive prediction Negative prediction

-256 0 255

-256 0 x^ 255

-256-x^ -256 0 255-x^ 255

x

e

-256 0 255

-256 x^ 0 255

-256 -256-x^ 0 255 255-x^

x

e

Page 41: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

25

The effect of modular reduction is two-fold. Firstly, it leads to slightly more compression during

encoding stage, since the absolute value of error residual is smaller. Secondly, the compression

and decompression hardware blocks have smaller area due to smaller data size in the datapath.

2.5.2 Embedded Alphabet Extension (Run-length Mode)

In section 2.3.3, it is mentioned that header k = 7 is used for the cases where all four error

residuals of a subtile are zero during GR-Encoding process. In this case the whole subtile is

encoded with 3 bits only. This addresses the redundancy of sending extra terminating bits for

each error residual in a subtile. Although the redundancy is removed within a single subtile

boundary, a significant redundancy may still exist among adjacent subtiles. In a graphics

application this corresponds to the cases where a whole tile (88 blocks of pixels) is covered

with one/two triangles during rasterization. A typical example is the user menus in mobile

devices. A menu typically consists of large icons and several flat regions at the background.

In image compression applications, a quite similar problem exists for large smooth regions of a

still image. In [4] it is stated that in general, symbol-by-symbol (in our case Golomb-Rice)

encoding of error residuals in low entropy distributions (large flat regions) results in significant

redundancy. They address this problem through introducing “alphabet extension”. Specifically,

LOCO-I /JPEG-LS algorithm enters “run-length mode” when a flat region is encountered.

We used the same idea for more efficient encoding of low entropy regions. In order to do this, we

keep track of the headers used for each component of the previous subtile. Whenever all four

headers are 7 (kα = 7, kY = 7, kCg = 7, kCo = 7) the algorithm enters run-length mode. In this mode

we no longer put any bits into the output stream as long as the incoming error residuals to

Golomb-Rice encoder are zero. Instead we increase a 4-bit run-length counter by one for each

component. The run-length counter indicates the total number of zero error residuals so far.

Whenever a non-zero error residual is encountered, the run-length mode is broken. In this case

current value of the run-length counter is put into the output stream and the normal mode of

operation continues again.

During decoding, the decoder also keeps track of headers for previously decoded subtile. Hence it

also enters run-length mode at the same position during traversal. As soon as it enters the run-

length mode, it first reads 4-bit run-length counter value from the stream. Then, it gives output

error residual as zero for that many cycles and continues normal mode of operation.

The 4-bit run-length counter is fixed in range (0-15). This causes the problem of representing run

lengths longer than four subtiles (16 components). This problem is solved by introducing a run-

length flag. During encoding, when run-length counter becomes 15, a “1” bit is put into the

stream representing completion of one 16-component block. Correspondingly, when run-length is

broken a “0” bit is put into the stream just before the run-length counter value. For the decoder,

each “1” read from the stream means one 16-component block in run-length mode. Similarly a

following “0” bit designates that run length is broken.

Page 42: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

26

The hardware cost of the run-length mode implementation is four 3-bit registers to store

component headers and a 4-bit run length counter. Its size relative to other functional blocks will

be given in section 3.5

2.5.3 Previous Header Flag

Once headers of previous subtile are stored in the encoder for run-length mode, a better

compression can be achieved by comparing the current header with the previous header. Due to

the existence of spatial correlation among adjacent subtiles, it is likely that these two headers

have the same value. Hence, instead of putting 3-bit header into the output stream for each subtile,

a “0” flag bit is put which means the current header is same as the previous header. Conversely,

when headers are different, a “1” bit is put before the actual header.

Now that all the modifications on the reference algorithm are introduced, the final algorithm to be

implemented is decided. The algorithm includes all the modifications explained in this section.

Moreover, the algorithm will be implemented not for tiled-traversal but scan-line traversal of the

input data. Therefore, both the reference algorithm and the modified algorithm are modeled for a

left-to-right scan-line data traversal. The results in table 6 are obtained from scan-line traversal of

images as well.

It is important to note that the maximum output size for a 32-bit input pixel is 64 bits. Therefore

theoretically it is possible to have a compressed size twice the original input size. However,

unless the input data is a completely noisy meaningless data, the output size always smaller than

input size. This is same for most other compression algorithms as well.

2.6 Compression Performances of Algorithms

In order to evaluate the compression performance gained, software models of both algorithms

have been prepared in MATLABTM environment. Three different groups of test data have been

used. The first group includes well-known standard photographic test images used for

benchmarking image compression algorithms and taken from [15]. The second group includes

several computer generated scenes. The first four of them in table 6 are used in [1] as well to

benchmark the reference algorithm. The third group includes several menu screen snapshots

typical to mobile devices. Finally, the compression of a completely black image is also evaluated

to observe the performance of algorithms on the extreme case. All test data used are 24-bit color

images in .PNG or .BMP format. The images evaluated are given in Appendix B.

It is important to note that the data used for evaluation are compressed screenshots. This means

that the result does not include the full, incremental rasterization process. An evaluation of the

improvement gained within a real or software-simulated rasterizer framework is definitely of

interest. Nevertheless, we anticipate that the results would be similar or even better during a

rasterization process since an unfinished scene is generally simpler and contains fewer details

Page 43: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

27

than a complete scene. It is already mentioned that the improved algorithm works better on

simpler, compressible scenes. This is also verified in table 6 for group 3 data.

Another important point to mention is that the all input data are 24-bit RGB images. The

algorithms are modeled for 32-bit RGBA data format. For evaluation, the alpha channel of all the

image data was padded with eight “0” bits hence the evaluation is performed with 32-bit RGB0

data for all the input images. This is the reason of getting higher compression ratios than

expected for both algorithms. For example, the compression ratio for well-known Lena image is

found as 1.945 / 2.021 in both algorithms respectively. On the other hand, the JPEG-LS

compression ratio is reported as 1.773 [16]. Definitely, JPEG-LS is expected to compress better

than both algorithms within the same framework.

IMAGE REFERENCE

ALGORITHM

IMPROVED

ALGORITHM

Group1

(standard

photographic

test images)

(24-bit color)

Peppers (512 512) 2.812 3.016

Peppers2 (512 512) 1.769 1.828

Mandrill (512 512) 1.542 1.591

Lena (512 512) 1.945 2.021

House (256 256) 2.131 2.226

Sailboat (512 512) 1.690 1.744

Airplane (512 512) 2.289 2.404

Average 2.025 2.118

Group2

(computer

generated

test scenes)

(24-bit color)

Ducks (640 480) 2.785 2.991

Square (640 480) 2.937 3.155

Car (640 480) 3.609 4.059

Quake4 (640 480) 3.173 3.469

Bench_scr1 (640 360) 2.992 3.253

Bench_scr2 (640 360) 2.976 3.249

Bench_scr4 (640 360) 3.168 3.567

Average 3.091 3.392

Group3

(computer

generated

user menu

scenes)

(24-bit color)

Menu1 (240 320) 4.684 6.377

Menu2 (240 320) 2.776 3.056

Menu3 (240 320) 1.992 2.068

Menu4 (240 320) 2.700 2.941

Menu5 (240 320) 4.166 5.734

Menu6 (320 480) 3.416 3.803

Menu7 (320 480) 4.606 6.395

Average 3.477 4.340

Group 4 Black (1280 1024) 10.667 511.926

Table 6: Comparison of compression performances

Page 44: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

28

2.7 Possible Future Algorithmic Improvements

In this thesis work several solutions have been examined to improve the compression

performance while still keeping the complexity and hardware cost reasonably low. However,

there are still several possibilities for algorithmic and architectural improvements. This chapter

describes some of those techniques proposed by several scientific papers which might be

applicable to image compression for mobile 3D graphics and to be considered as future works in

the area.

2.7.1 Pixel Reordering

This is one of the solutions that have been examined within our work. The objective of this

algorithm is to minimize the header overhead in the Golomb-Rice encoder. The idea is to group

the pixels/subtiles, inside a tile, based on their Golomb-Rice parameter (k value). This increases

compression ratio significantly since it helps to reduce the header overhead in the stream. As a

future work, it is interesting to do investigation on storage requirements necessary to keep track

of the original place of the pixels in order to reconstruct pixels in their original orders [13].

2.7.2 Spectral Predictor

As it is mentioned before, the main overhead which degrades the compression performance is in

storing the header in the encoded stream. Improving the predictor might not have a large

contribution into compression performance and this small improvement might not justify having

a more complex and costly predictor. However there is an opportunity to get rid of color

transform block, if we could efficiently take advantage of spectral correlation between the color

components R, G, and B. In order to do so, a spectral predictor is needed which can predict pixel

values of one color component, based on the predicted value of another component for the same

pixel. This method is described in [14] in detail. What is interesting for the area of mobile image

compression is to investigate the cost and complexity of this method, compare it with the total

cost of both color transform block and fixed MED predictor, and measure the compression

performance improvement that could be achieved by using spectral predictor.

2.7.3 CALIC Predictor

Context-Based Adaptive Lossless Image Compression (CALIC) was proposed by Menon and Wu

[5]. This algorithm is developed based on an adaptive predictor followed by a context-based

arithmetic coder. CALIC uses a gradient adjusted predictor (GAP), which is able to adapt itself

with respect to the intensity gradients of the surrounding and neighboring pixels near the pixel

under prediction. [14]

Page 45: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

29

Figure 19: CALIC GAP prediction window

GAP calculates two Intensity gradient variables as follows:

dh = |Iw − Iww| + |In − Inw| + |In − Ine|,

dv = |Iw − Inw| + |In − Inn| + |Ine − Inne| (7)

It detects and classifies three different kinds of edges as “sharp”, “normal”, or “weak” and the

prediction value is corrected using this evaluation. The basic idea is if one gradient is small and

the other is large, the predictor estimates the current pixel value along the direction of the low

value gradient. Otherwise, if one gradient is larger than the other, but their difference is not small,

the prediction is corrected taking into account this difference. In [5] the pseudo-code of the

prediction algorithm explains how GAP works.

The appealing part of CALIC algorithm in this context is only the predictor. By replacing the

MED predictor with GAP, one can observe that how the compression performance improves and

whether the performance improvement justifies an extra cost introduced by this more complex

predictor.

2.7.4 Context Information

Context-based algorithms are so powerful in terms of giving high compression performance and

used by most of today’s famous lossless image compression algorithms such as LOCO-I / JPEG-

LS and CALIC. Storage elements are needed in order to store the context information of the

image under compression. In image compression algorithms where throughput or storage area is

not a bottleneck, it is advantageous to use context-based approaches. However, this may not be

the case in mobile applications and storing large amount of context information may not be

affordable. But, as a future work, it would be interesting to investigate how to modify context-

based algorithms and make them suitable for mobile image compression, for instance by using

limited number of contexts. Our modifications on the original algorithm (especially previous

header flag) is a first attempt to use information from previously traversed pixels and can be

considered as using a very simple form of context – the previous pixel.

Page 46: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

30

Chapter 3

3 Color Buffer Compression/Decompression Hardware

In this chapter we explain the hardware of color buffer compression algorithm presented in

Chapter 2. The chapter starts with the description of design constraints and design environment.

Later, in a hierarchical way we describe each hardware block in the design. The description is

followed by the description of functional verification framework. At the end of the chapter we

present and discuss the synthesis results and give a survey of several lossless compression

hardware implementations.

3.1 Design Constraints

The main goal of the thesis was to investigate the hardware implementation properties of a

selected color buffer compression algorithm and to design corresponding synthesizable RTL level

compressor and decompressor block descriptions in VHDL. The hardware has been designed

considering the following constraints:

Pixels are represented with 32 bits (8-bit integer R, G, B and α channels) before compression.

RGB 8880 representation is assumed.

The blocks are interfaced with single port memories of 64-bit wordlength.

These two properties translate into the constraint of maximum 2 pixel read per clock cycle.

The target clock frequency of 208 MHz in 65nm technology. This constraint defines the

longest combinational path allowed in the design.

Target throughput of one uncompressed pixel per clock cycle i.e. the compressor should be

able to process one uncompressed pixel/clock and conversely the decompressor should

produce one uncompressed pixel/clock.

Page 47: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

31

The compressor and decompressor blocks have been designed separately and both have two sets

of memory interfaces i.e. one set to source memory and one set to destination memory. If single

memory block is to be used as both the source and the destination, a separate memory interface

block is needed to coordinate the accesses to the memory unit. The same requirement is also valid

if both the decompression and compression blocks operate concurrently on the same memory

blocks.

3.2 Compressor Block

The hierarchical block diagram of the compressor is given in figure 20.

Figure 20: Compressor Block

The top level of the compression block has interfaces to the source and destination memories as

well as the block controlling the compressor. The compression starts with issuing the “start”

signal together with the “start_addr1” and “start_addr2” signals to point the source and

CT1

CT2

Py

PCo

PCg

Addr

Gen 1

Addr

Gen 2

Pred.

Reg.

File

Ctrl

Enc.

Reg.

File

Ctrl

Golomb

Rice

Enc.

Data

Packer

Compressor Ctrl

output

addr2

rd_req2

wr_req2

’0’

rdy2

rd_req1

wr_req1

rdy1

’0’

addr1

input

start_addr1 start_addr2

valid valid

enb

enb

enb enb enb

enb

load_enb data_rdy

start exec_finish

Page 48: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

32

destination addresses. The completion of the operation is communicated through the

“exec_finish” signal from the compressor block.

The interface port description of the compressor is given in table 7.

Port name Width Direction Source / Dest Description

Clk 1 I Controller 208 MHz clock signal

Rst 1 I Controller Block reset signal

start 1 I Controller Compression start signal

exec_finish 1 O Controller Compression complete signal

start_addr1 24 I Controller Source memory start address

start_addr2 24 I Controller Destination memory start address

rd_req1 1 O Source mem. controller Read request from source memory

Wr_req1 1 O Source mem. controller Write request to source memory

Rdy1 1 I Source mem. controller Source mem. data available

addr1 24 O Source memory Source mem. address bus

input 64 I Source memory Source mem. data bus

rd_req2 1 O Dest. mem. controller Read request from dest. memory

Wr_req2 1 O Dest. mem. controller Write request to dest. memory

Rdy2 1 I Dest. mem. controller Destination mem. data available

addr2 24 O Destination memory Destination mem. address bus

output 64 O Destination memory Destination mem. data bus

Table 7: Compressor Block Interface Port Description

It should be noted that the compressor block reads data only from source memory and writes data

only into destination memory. Therefore, “wr_req1” and “rd_req2” are connected to logic ‘0’.

The sub-blocks inside the compressor block are discussed in the following subsections:

3.2.1 Addr_Gen1 (Source memory address generator)

This sub-block generates the read addresses for the source memory. Since the algorithm operates

on 22 pixel subtiles, the corresponding read addresses need to be generated. The data in the

source memory are assumed to be in left-to-right scan-line order. Considering a resolution of w l

pixels and two pixels per memory word assumption, the memory map and the corresponding

pixels of the image are shown in figure 21:

Page 49: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

33

Figure 21: Memory mapping and corresponding pixels of the image

Since subtile is the unit of processing, the data should be read in subtile order from the source

memory. For example, in figure 21 the first subtile (top left corner) consists of pixels p(0), p(1),

p(w) and p(w+1). The memory map shows that these pixels are at addresses a and a + (w/2) in

the source memory. Hence, the corresponding addressing scheme should be as follows:

[a] → [a+(w/2)] → [a+1] → [a+(w/2 + 1)] → [a+2] → … → [a+(w/2 – 1)] → [a+(w – 1)]

[a+w] → [a+(3w/2)] → [a+(w + 1)] → [a+(3w/2 + 1)] → …

However this is not the whole story. As this will be explained in subsection 3.2.3., the prediction

window of a subtile is five neighboring pixels towards up and left. As an example, in order to

start_addr a p(0) p(1)

p(2) p(3)

p(4) p(5)

a+1

a+2

a+3

p(0) p(1)

p(w)

p(2) p(3)

p(w+1)

p(w-1)

p(2w)

p(w-2) p(w-1)

p(w-2)

w

l p(w+1) p(w)

a + (w/2 - 1)

a + (w/2)

p(3w)

p(2w+1)

p(w+2) p(w+3)

p(3w+1)

p(w+3) p(w+2)

p(2w+1) p(2w)

p(2w+3) p(2w+2)

p(3w+1) p(3w)

p(3w+3) p(3w+2)

a + (w)

a + (w+1)

a + (w/2 + 1)

a + (3w/2 + 1)

a + (3w/2)

p(2w+2)

p(3w+2)

p(2w+3)

p(3w+3)

Page 50: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

34

encode the grey subtile in figure 21, p(w+1), p(2w+1), p(3w+1), p(w+2) and p(w+3) pixel

values should be available. Figure 22 further illustrates the change of prediction window from

one subtile to the next subtile.

Figure 22: Traversal in prediction window

In order to encode second subtile shown in the right side of the figure 22, the pixel data in all the

six addresses shown in the figure are needed. However, data in addresses “A”, “A + w/2” and “A

+ W” should have already been read in order to encode the first subtile shown in the left side of

the figure. The conclusion is that in order to encode the second subtile three read operations

should be performed to addresses “A+1”, “A + w/2 + 1” and “A + w + 1”. The arrows in the

figure indicate the basic addressing scheme followed by the address generator.

Encoding of one subtile takes four cycles, while three cycles are enough to read the required data

from memory. In the spare cycle the memory bus is released to allow better usage of memory

bandwidth. In this spare cycle “address_valid” signal is ‘0’.

All these considerations lead to the basic cyclic addressing scheme of:

cycle1 cycle2 cycle3 cycle4

Operation read - read read

Address +(w/2) keep +(w/2) -(w - 1)

Table 8: Source memory address generator addressing scheme

End of lines and the first line of the image need special treatment. First line of the image does not

have a pixel to the up hence a read is not needed. The prediction window change at the end of

line causes addressing scheme change at the end of lines and the scheme for the last subtile of a

line is [“+(w/2)”, keep, “+(w/2)”, “-(w/2-1)”].

The block interface and the hardware diagram of the source memory address generator are shown

in figures 23 and 24 respectively. The address register is a 24-bit register which is the assumed

address bus width corresponding to 128 MB RAM. The multiplexer select inputs are determined

with a simple state machine using the block inputs “enable”, “end_of_line”, “end_of_image”,

“first_line”. The synthesis results of the sub-block is given in section 3.5

A

A + W/2

A + W

A + 1

A + W/2 + 1

A + W + 1 A + W

A + W/2 + 1

A + W + 1

A + 1 A

A + W/2

Page 51: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

35

Figure 23: Address Generator I interface

Figure 24: Address Generator I Hardware Diagram

+

Cin=”1”

Cin=”0”

w/2

w/2-1

w-1

start_addr

addr_out

address register

start_addr

addr_out enable

end_of_line

end_of_image

first_line

address_valid

Address

Generator

I

Page 52: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

36

3.2.2 Color_T (Color Transformer)

Color transform block performs RGB → YCoCg conversion as explained in subsection 2.3.1. The

block interface is shown in figure 4 in subsection 2.3.1. The hardware diagram of the sub-block

is given in figure 25.

Figure 25: Color Transform Hardware Diagram

Co

9

8

8

8

[1,”t”]

9

Cin=”1”

MSB

[1,”B”] [0,”R”]

[0,”G”]

9

9 8

9

9

+

+

R B

+

+

G

>>1 1

>>1 1

MSB

Cin=”1”

Cin=”0”

Cin=”0”

Cg

8

Y

Page 53: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

37

The synthesis result of the sub-block is given in section 3.5

3.2.3 Pred_RegFile_Ctrl (Prediction Register File Controller)

This sub-block is responsible for providing the predictor block with current pixel (x) as well as

three neighboring pixels in its prediction window (x1, x2, x3) as shown in figure 6. The block

interface is shown in figure 26.

Figure 26: Prediction Register File Controller interface

The operation is controlled through signals “enable”, “end_of_line” and “end_of_image” by

compression control block. This sub-block in each cycle receives two pixels (p1, p2)

corresponding to one memory word that are read from memory and outputs (x, x1, x2, x3) pixels to

the combinational predictor block. This is performed for each pixel of one 22 subtile before

passing to the next subtile. Figure 27 shows the pixels for the prediction operation of one subtile.

Figure 27: Change of prediction window for pixels of one subtile

x x3

x1 x2

x x3

x1 x2

x x3

x1 x2

x x3

x1 x2

p1

x

enable

end_of_line

end_of_image

Pred_RegFile

Ctrl

(one component)

p2

x1

x2

x3

9

9

9

9

9

9

Page 54: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

38

The figure clearly shows that 9 pixels are involved in the prediction of one subtile. However at

any time instant at most 7 pixel values need to be stored. (In figure 27, before step 2, the lower

right two pixels of the subtile are not read in yet. After step 2, the upper left two pixels are not

used anymore so can be overwritten by the incoming pixels.) Therefore, this block includes

seven 9-bit registers and a state machine controlling data transfer among them as well as input

and output.

The basic data transfer scheme is shown in figure 28. The block outputs are directly from

registers X, X1, X2, X3 while registers A, B, C are used for temporary storage of data. As an

example the figure shows input connectivity of register X3 i.e. register X3 receives data only from

register X1 and register A in different states. Different data transfer schemes resulting in the same

functionality are possible, however connectivity affects MUX sizes at the inputs of registers and

hence the hardware cost. In this scheme, it is given importance to use 41 MUXes or smaller.

Figure 28: States and register input connectivity in Prediction Register File Controller

It is also important to note that state S4 of this scheme changes at the end of lines due to the

change in the prediction window.

This sub-block is instantiated 4 times in the design corresponding to four components namely Y,

Cg, Co and α. The synthesis results are given in section 3.5

Registers States

S1

S2

S3

S4

X X1 X2 X3 A B C

P2 P2 C X X1 X2 B

P1 P2 X2 B A X1 X

C P2 B X X1 A P1

P1 C B X1 A X3 X

Page 55: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

39

3.2.4 Predictor

The MED predictor explained in subsection 2.3.2 has been realized with the hardware shown in

figures 29 and 30.

Figure 29 shows the prediction hardware common for both the predictor and constructor. This

hardware block generates the predicted value ( x̂ ) from the neighboring pixels x1, x2, x3.

Figure 29: MED Prediction Hardware for both predictor and constructor

The predictor block is given in figure 30. The block performs xx ˆ subtraction, modular

reduction and signed to unsigned conversion as explained in subsection 2.5.1.

9

9

MSB

10

9

000 011 else

111 100

9

x2

9

x1

MSB

sign

extend

Cin=”1”

+

x3 x1

9 9

sign

extend

Cin=”1”

+

x3 x2

9 9

sign

extend

+

x2 x1

9 9

Cin=”1”

x1

+ Cin=”0”

9

10

MSB

s2

s1 s0

Page 56: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

40

Figure 30: Predictor block Hardware diagram

This sub-block is instantiated 4 times in the design corresponding to four components namely Y,

Cg, Co and α. The synthesis result of the sub-block is given in section 3.5

3.2.5 Enc_RegFile_Ctrl (Golomb-Rice Encoder Register File Controller)

This sub-block is responsible for preparing the data for the GR Encoder. More precisely, it

performs pixel to subtile conversion i.e. at each clock it receives 4 components of error residuals

+ Cin=”1”

x2 x3 x1 x

9 9 9 9

9

Prediction

Hardware

9 x̂

<<1 ”0”

9

MSB

0 1

Page 57: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

41

corresponding to one pixel from predictors and it outputs one component of 4 pixels (subtile) to

be encoded together in GR Encoder.

The block interface is shown in Figure 31.

Figure 31: Encoder Register File Controller block interface

As the figure may suggest, this sub-block consists of 16 9-bit registers organized as a small 44

transpose memory. This means, in an alternating fashion, the registers are filled (and read out at

the same time) column-wise first and then read out (and also filled at the same time) row-wise in

a FIFO manner. The conversion from 4 components of a pixel to one component of four pixels is

performed this way. The alternation is realized with a simple 2-state state machine. The state

machine is started and stopped by signals “enable and “end_of_image” coming from compression

control block.

Since all registers can be loaded both column-wise and row-wise, there are 21 MUXes at their

inputs (except topmost left, which has only one input). Also since the block outputs can be given

out from two registers (except lowermost right), there are three 21 MUXes at outputs.

The synthesis result of this sub-block is given in section 3.5

ey_in

e4_out

enable

end_of_image

Enc_RegFile

Ctrl

eCo_in

e3_out

e2_out

e1_out

9

9

9

eCg_in

eα_in

9

9

9 9 9

y4 y3 y2 y1

α4 α 3 α 2 α 1

Cg4 Cg3 Cg2 Cg1

Co4 Co3 Co2 Co1

Page 58: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

42

3.2.6 GR_Encoder (Golomb-Rice Encoder)

Golomb-Rice Encoder main task – as described before – is to generate code and length values. At

each clock cycle, it receives the residual values of four pixels of one specific subtile (e1, e2, e3,

and e4) of one component, and generates corresponding “code”, “length” and “header” values

for each pixel.

The GR_Encoder block consists of three sub-blocks which are GR_k , Enc, and GR_ctrl as

shown in figure 32..

Figure 32: Golomb-Rice Encoder block diagram

The synthesis result of GR_Encoder is given in section 3.5.

GR_k Enc1

Enc2

Enc3

Enc4 Code4

Leng4

Code3

Leng3

Code2

Leng2

Code1

Leng1

GR_Encoder

e1

e4

e3

e2

rst

enb

e_o_img

GR ctrl

header

length k

Page 59: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

43

3.2.6.1 GR_k Block (Golomb-Rice Parameter Estimation)

This block is responsible for Golomb-Rice parameter – best k - determination. It uses an

estimation formula described in section 2.4.2. The hardware to determine the k parameter is given

in figure 33.

x7 = s[10] OR s[9] OR s[8] OR s[7] OR s[6] OR s[5] OR s[4] OR s[3] OR s[2] OR s[1] OR s[0]

x6 = s[10] OR s[9] OR s[8] OR s[7] OR s[6] OR s[5] OR s[4] OR s[3]

x5 = s[10] OR s[9] OR s[8] OR s[7] OR s[6] OR s[5] OR s[4]

x4 = s[10] OR s[9] OR s[8] OR s[7] OR s[6] OR s[5]

x3 = s[10] OR s[9] OR s[8] OR s[7] OR s[6] (8)

x2 = s[10] OR s[9] OR s[8] OR s[7]

x1 = s[10] OR s[9] OR s[8]

Equation (8) shows the cases corresponding to table 9.

Page 60: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

44

Figure 33: K- Parameter Estimation Hardware

e1 e2 e3 e4

S[10:0]

9 9 9 9

10 10

11

...

9 bits 9 bits

10 bits

s[10] s[8]

x1

... s[10] s[7]

x2

... s[10] s[6]

x3

... s[10] s[5]

x4

... s[10] s[4]

x5

... s[10] s[3]

x6

... s [10] s[0]

x7

0

0

0

0

0

0

0

1

1

1

1

1

1

1

k

111

110

101

100

011

010

001

000

3

Page 61: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

45

The estimation works based on table 4 using equation (6).

sum sum[10 : 0] K

sum = 0 00000000000 7

0 < sum <8 00000000XXX 0

8 ≤ sum <16 00000001XXX 1

16 ≤ sum <32 0000001XXXX 2

32 ≤ sum < 64 000001XXXXX 3

64 ≤ sum < 128 00001XXXXXX 4

128 ≤ sum < 256 0001XXXXXXX 5

sum ≥ 256 XX1XXXXXXXX 6

Table 9: Estimation Function

The synthesis result of GR_k is given in section 3.5.

3.2.6.2 Enc Block (Encoding Block)

The other sub-block, Enc, is responsible to perform encoding according to its inputs, k and e.

Outputs are length and code. Considering the throughput constraint – one pixel / cycle – four

instances of encoder are necessary. The Encoder hardware is represented in figure 34. The first

two multiplexers on the left determine the quotient and the remainder of the divisionk

e

2,

respectively. One addition is performed in order to calculate the length value, k+q+1 and the last

multiplexer and the OR function, generate the output code by enclosing the unary q to the binary

r.

The synthesis result of Enc is given in section 3.5.

Page 62: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

46

Figure 34: Golomb-Rice Encoder Realization

0

1

2

3

4

5

6

7

k

q

e[2:0]

e[3:1]

e[4:2]

e[5:3]

e[6:4]

e[7:5]

e[8:6]

”000” 3 bits

3

3

0

1

2

3

4

5

6

7

”000000”

”00000” & e[0]

”0000” & e[1:0]

”000” & e[2:0]

”00” & e[3:0]

”0” & e[4:0]

e[5:0]

”000000”

’1’

”0000” 7

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

”00000000000000”

”00000000000001”

”00000000000010”

”00000000000100”

”00000000001000”

”00000000010000”

”00000000100000”

”00000001000000”

”00000010000000”

”00000100000000”

”00001000000000”

”00010000000000”

”00100000000000”

”01000000000000”

”10000000000000”

length

code

r

”00

00

00

00

0” &

r

4

6

4

14 14

Page 63: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

47

3.2.6.3 GR_ctrl (Golomb-Rice Control Block)

This sub-block is a FSM is added to the design in order to handle the new compression algorithm

which improves the compression performance. By just adding this block, it is possible to make

use of the new algorithm without changing two previous blocks. The encoded header format is

not always the same and it changes according to the state of operation. The state encoding is

defined by Golomb-Rice control, based on the previous k-parameters. The possible encoded

formats are given in table 10.

Mode Header Format Header

Length

Condition

start {header} 3 Only first subtile of image

normal {flag = ‘0’} 1 Current_k = Prev_k

{flag = ‘1’, header} 4 Current_k ≠ Prev_k

run-

length

{-} 0 run_length_counter < 15

{run_length_flag = ‘1’} 1 End of image

{run_length_flag = ‘1’} 1 run_length_counter = 15

{run_length_flag = ‘0’, run_length_counter, header} 8 Run-length mode broken

{run_length_flag = ‘1’,run_length_flag = ‘1’} 2 run_length_counter = 15

and end of image

Table 10: Header format generated by GR_ctrl block

The synthesis result of GR_ctrl is given in section 3.5.

3.2.7 Data_Packer (Variable Bit Length Packer to Memory Word)

The output of the GR_Encoder block is five pairs of code and length registers. In each pair, the

length register determines the number of valid data -in bits- in the code register. According to the

throughput requirements, at each clock cycle, the values of these four registers must be stored in

the memory of 64-bit word-length. We need some piece of hardware called data packer, in order

to combine these variable length codes and store them in the memory at each cycle, while

keeping track of the next empty buffer cell, for the next cycle. The data packer hardware consists

of four different stages as shown in figure 37. At each stage a certain block of hardware is used.

In the first stage, there are two instantiation of a block called P1. This block takes two variable-

length codes as well as information about their lengths, combines them by doing the

concatenation operation (Shift & OR), and gives out this result as well as a new length value

which is the summation of original ones. The outputs of two P1 blocks are the input to the P2

block. This block and also the next stage block, P4, exactly does what P1 does but with different

word-length of the registers. In the last stage there is one P3 block which is the only sequential

block in the data packer, and does the final combination as well as keeping track of the next

empty buffer cell. Whenever one 64-bit of packed data is ready, P3 gives it out and issues a ready

Page 64: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

48

<<

OR

code length

64 7

signal which is used to generate a write request to the memory in the control block. The hardware

realization of P3 block of the data packer is given is figure 35.

Figure 35: P3 block, basic hardware realization

Output of the data packer would be stored in the memory and it looks like as shown in figure 36.

Figure 36: Packed data order format in the memory

y component

of first subtile

Co component

of first subtile

Cg component

of first subtile component

of first subtile

0 63

....

. . . . . . . . .

component

of second subtile

prev_length

64-bit buffer

Page 65: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

49

Figure 37: Data Packer

As it is shown in figure 40, data packer is not considered as a part of datapath because the

hardware overhead of this block is independent of how the datapath is designed. It is not the

optimized design of data packer since it was not the focus of this work. According to the design

constraints, data packer must be capable of packing four codes per clock cycle and give the result

out whenever 64-bit packed data is ready.

3.2.8 Addr_Gen2 (Destination memory address generator)

This sub-block is responsible for generating destination memory addresses for the compressed

data. The compressed data –when packed into a 64-bit memory word- are written in consecutive

locations in the memory. Therefore this sub-block is simply a 24-bit counter with parallel-load in

order to load the destination memory start address at the beginning of the compression operation.

The block interface is shown in figure 38.

Packed

64

Packed

Ready

P1

_

A code2 14

length1 4

length2 4

length12 5

4

code1 14

code12 28

4

P1

code4 14

length3 4

length4 4

length34 5

4

code3 14

code34 28

P2

P4

length1234 6

length12345

4

code1234 56

code12345 length5 4

code5 8

P3

enb

rst

7

64

Page 66: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

50

Figure 38: Destination memory address generator block interface

The synthesis results of this sub-block is given in section 3.5

3.2.9 Compressor_Ctrl (Control Path)

This sub-block is responsible for controlling all other blocks in the compression hardware. More

specifically, it is responsible for starting the operation, stalling the datapath when memory is not

available and provide other sub-blocks with image traversal information such as end of lines, first

line etc. The block interface is shown in figure 39

Figure 39: Control path block interface

count_en

load_en

start_addr

addr_out

24 24

Addr_Gen2

rd_req1

rd_req2

wr_req1

wr_req2

rdy2

rdy1

start

finished

addr1_enb

first_line_addr1

end_of_line_addr1

end_of_img_addr1

addr1_valid

pred_enb

end_of_line_pred

end_of_img_pred

enc_enb

end_of_img_enc

GR_enb

packer_enb

end_of_img_packer

data_packed

addr2_enb

controller

Dest.

Memory

Source

Memory

Addr

Gen1

Addr

Gen2

Pred.

RegFile

Ctrl

Enc.

RegFile

Ctrl

Data

Packer

GR

Decoder

CONTROL

PATH

Page 67: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

51

The control path basically includes a 25-bit pixel counter to keep track of where in the image we

are. Mainly comparators are used comparing pixel counter to some specific values in order to

compute the location in the image such as first line, end of line, end of image. The enable signals

of datapath and address generator sub-blocks are used to stall the pipeline whenever memories

are not available (Pipeline is stalled according to “rdy” input signals coming from memories or

memory controllers).

3.2.10 Overall Compressor Datapath and Address Generation

Figure 40: Overall Compressor

CT1

P_reg Y

P_reg Cg

P_reg a

P_reg Co

Pred 1

Pred 2

Pred 3

Pred 4

GR_k

GR Enc1

GR Enc2

GR Enc3

GR Enc4

header

Code4

Leng4

Code3

Leng3

Code2

Leng2

Code1

Leng1

GR_Encoder

Enc. Reg_ File

Control Path

Addr_Gen2 Addr_Gen1

Datapath

Data

Pack

er

Gnd

Gnd

RGB1

CT2 RGB2

Y1

Y2

Cg1

Cg2

Co2

Co1

α 2

α1 e α

eY

eCg

eCo

e1

e4

e3

e2

Pixel 1

Pixel 2

Page 68: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

52

Figure 40 shows the complete hardware for the compressor datapath and address generator blocks.

The design is divided into three main blocks. Our effort has been mainly on the datapath and

address generator blocks. Synthesizable control path and data packer blocks are designed so as to

ensure correct overall functionality based on constraints. However these blocks are not optimized

and need more detailed design considerations. The synthesis result of the datapath is given in

section 3.5.

3.3 Decompressor Block

The block diagram of the decompressor is given in figure 41.

Figure 41: Decompressor Block

The interface port description of the decompressor is given in table 11.

Port name Width Direction Source / Dest Description

clk 1 I Controller 208 MHz clock signal

rst 1 I Controller Block reset signal

start 1 I Controller Compression start signal

finish 1 O Controller Compression complete signal

Decompressor Ctrl rd_req1

wr_req1 wr_req2

rdy2

start_addr2 start_addr1

start finish

Addr

Gen 2

Addr

Gen 1

addr1

rdy1

addr2

rd_req2

Data

Unpacker

Golomb

Rice

Dec.

Dec.

Reg.

File

Ctrl

input2 Cy

CCo

CCg

RCT

CT

output1

input1

Const.

Reg.

File

Ctrl

’0’

valid

enb enb

enb enb enb data_req

Page 69: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

53

start_addr1 24 I Controller Destination memory start address

start_addr2 24 I Controller Source memory start address

rd_req1 1 O Dest. mem. controller Read request from dest. memory

wr_req1 1 O Dest. mem. controller Write request to dest memory

rdy1 1 I Dest. mem. controller Destination mem. data available

addr1 24 O Dest. memory Destination mem. address bus

Input1 64 I Dest. memory Destination mem. data bus

Output1 64 O Dest. memory Destination mem. data bus

rd_req2 1 O Source mem. controller Read request from source memory

wr_req2 1 O Source mem. controller Write request to source memory

rdy2 1 I Source mem. controller Source mem. data available

addr2 24 O Source memory Source mem. address bus

Input2 64 I Source memory Source mem. data bus

Output2 64 O Source memory Source mem. data bus

Table 11: Decompressor Block Interface Port Description

It should be noted that similar to compressor, the decompressor block writes data only into

destination memory . However, - different from compressor - it reads data both from source

memory and destination memory. Therefore, only “wr_req2” is connected to logic ‘0’.

The sub-blocks inside the decompressor block are discussed in the following subsections:

3.3.1 Addr_Gen2 (Source memory address generator)

This sub-block is responsible for generating source memory addresses for reading in compressed

data. Since compressed data – which is packed into a 64-bit memory word- are located in

consecutive locations in the memory, this sub-block is simply a 24-bit counter with parallel-load

in order to load the source memory start address at the beginning of the decompression operation.

The block interface is shown in figure 42.

Figure 42: Source memory address generator block interface

The synthesis results of this sub-block is given in section 3.5

count_en

load_en

start_addr

addr_out

24 24

Addr_Gen2

Page 70: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

54

3.3.2 Rev_Color_T (Reverse Color Transformer)

Reverse color transform sub-block performs YCoCg → RGB conversion as explained in

subsection 2.3.1. The block interface is shown in figure 4 in subsection 2.3.1. The hardware

diagram of the sub-block is given in figure 43.

Figure 43: Reverse Color Transform hardware diagram

G

8

8

8 9

[0,”t”]

Cin=”1”

MSB

[0,”B”]

9

9

+

+

Y Co

+

+

Cg

>>1 1

MSB

Cin=”1”

Cin=”0”

B

8

R

MSB

8

Cin=”0”

t

8

9

>>1 1

MSB

8

Page 71: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

55

The synthesis result of this sub-block is given in section 3.5

3.3.3 Const_RegFile_Ctrl (Construction Register File Controller)

The functionality of this sub-block is similar to Pred_RegFile_Ctrl block in compressor datapath.

It is responsible for providing the constructor block with three neighboring pixels in its prediction

window (x1, x2, x3) of the current pixel to be constructed as shown in figure 6. The block interface

is shown in figure 44.

Figure 44: Construction Register File Controller interface

The operation is controlled by decompression control block through signals “enable”,

“end_of_line” and “end_of_image”. This sub-block in each cycle receives one pixel (p) that is

read from memory as well as currently constructed pixel (x) to be used for subsequent predictions.

The sub-block outputs (x1, x2, x3) pixels to the combinational constructor block. This is performed

for each pixel of one 22 subtile before passing to the next subtile.

The functionality is slightly different from Pred_RegFile_Ctrl sub-block in compressor datapath

in the sense that current pixel (x) is not input to the constructor, but it is output. Hence it is not

provided by Const_RegFile_Ctrl sub-block. This leads to different storage requirements in the

block. At any time instant at most 5 pixel values need to be stored. This block includes five 9-bit

registers and a state machine controlling data transfer among them as well as input and output.

The basic data transfer scheme is shown in figure 45. The block outputs are directly from

registers X1, X2, X3 while registers A, B are used for temporary storage of data. As an example the

figure shows input connectivity of register X3 i.e. register X3 receives data only from register X1

and register A in different states. Different data transfer schemes resulting in the same

x

enable

end_of_line

end_of_image

Const_RegFile

Ctrl

(one component)

p x1

x2

x3

9

9

9

9

9

Page 72: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

56

functionality are possible, however connectivity affects MUX sizes at the inputs of registers and

hence the hardware cost. Again, in the scheme, it is given importance to use 41 MUXes or

smaller and state S4 of this scheme changes at the end of lines due to the change in the prediction

window.

Figure 45: States and register input connectivity in Construction Register File Controller

This sub-block is instantiated 4 times in the design corresponding to four components namely Y,

Cg, Co and α. The synthesis results are given in section 3.5

3.3.4 Constructor

The constructor block uses the same prediction hardware (figure 29) as the predictor. The sub-

block performs unsigned to signed conversion, modular correction and xx ˆ addition as

explained in subsection 2.3.2. The constructor hardware is given in figure 46.

This sub-block is instantiated 4 times in the design corresponding to four components namely Y,

Cg, Co and α. The synthesis results are given in section 3.5

Registers States

S1

S2

S3

S4

X1 X2 X3 A B

P X A X2 B

X2 B A X B

A X X1 P B

P X1 A P X

Page 73: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

57

Figure 46: Constructor block Hardware diagram

3.3.5 Dec_RegFile_Ctrl (Golomb-Rice Decoder Register File Controller)

This sub-block is performs the inverse operation of what Enc_RegFile_Ctrl sub-block does in

compression datapath. More precisely, it performs subtile to pixel conversion this time. At each

clock, it receives error residuals of one component of 4 pixels of a subtile that have been decoded

by GR_Decoder and outputs 4 components of error residuals corresponding to one pixel into

corresponding predictors.

The block interface is shown in Figure 47.

+ Cin=”0”

x2 x3 x1

9 9 9

9

Prediction

Hardware

9

x

9

LSB

>>1 ”0”

9

0 1

Page 74: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

58

Figure 47: Decoder Register File Controller block interface

The design of this sub-block is identical to Enc_RegFile_Ctrl sub-block of compression datapath.

Refer to sub-section 3.2.5 for details. The synthesis result of this sub-block is given in section 3.5

3.3.6 GR_Decoder (Golomb-Rice Decoder)

A very simple circuit is used as GR_Decoder. In order to fulfill the throughput requirements, four

instances of this block are needed in the design, since we need all four pixel errors to be ready at

the same time. The inputs to this block are quotient, q, residual, r, and the Golomb-Rice

parameter, k, and the output is the pixel error, e, using (5) which can be realized as figure 48.

Figure 48: Golomb-Rice Decoder hardware

0

1

2

3

4

5

6

7

”000000” & q

”00000” & q & r[0]

”0000” & q &r[1:0]

”000” & q &r[2:0]

”00” & q & r[3:0]

”0” & q & r[4:0]

q & r[5:0]

”000000”

k

3

e

9

e2_in

eCo _out

enable

end_of_image

Dec_RegFile

Ctrl

e4_in

eCg _out

ey _out

eα _out

9

9

9

e3_in

e1 _in

9

9

9 9 9

Co2

Cg2

y2 α 2

Co1

Cg1 y1 α 1

Co3 Cg3 y3 α 3

Co4 Cg4 y4 α4

Page 75: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

59

The synthesis result of GR_Decoder is given in section 3.5.

3.3.7 Data_Unpacker (Variable Bit Length Unpacker from Memory Word)

This sub-block does the reverse task of data packer block in the compression path. It consists of

two sub-blocks, unpacker and GR_ctrl. Unpacker block receives data stream of 64-bits from the

memory and extracts four codes as well as Golomb-Rice parameter which has been used for data

encoding. Codes are given out in terms of quotients, q, and residuals, r. GR_ctrl block is

designed based on the GR_ctrl in the compressor block. As a control block, it provides

information about the state and k parameter in the previous cycle. GR_ctrl is necessary in order to

be able to use our new algorithm. It supplies unpack with information about the previous k-

parameter as well as the current state of operation. This information is necessary for decoding

procedure since as it is mentioned in subsection , there are several data format in the stream. In

order to fulfill throughput requirements, data unpacker must be capable of producing four output

codes per cycle.

Figure 49: Data Unpacker Interface and block diagram

q4

r4

q2

r2

q3

r3

q1

r1

state_in

prev_k

k

Unpack

GR_ctrl

RL-counter

data_stream

Page 76: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

60

As it is shown in figure 53, data unpacker is not considered as a part of datapath because the

hardware overhead of this block is independent of how the datapath is designed. It is not the

optimized design of data unpacker since it was not the focus of this thesis work.

3.3.8 Addr_Gen1 (Destination memory address generator)

This sub-block is responsible for generating addresses for destination memory. The generated

addresses can be both read addresses and write addresses. The write operation writes

uncompressed output data to the destination memory. The read operation is required for

prediction operation inside the constructor block.

Figure 50: Read / Write Adresses from/to destination memory to construct one subtile

Figure 50 shows that, blue subtile needs two write operations into addresses “A” and “A + w/2”.

However, a read into address “A - w/2” (shown with a circle) should be performed beforehand in

order to construct this subtile. The conclusion is that, for each subtile one read and two write

operations should be performed and corresponding addresses need to be generated.

During implementation, due to the pipeline latency from destination memory through color

transform to the constructor, the read data should be requested two cycles before it will be used.

Hence, the actual addressing scheme is shown in figure 51 and given in table 12

Figure 51: Actual addressing scheme for destination memory addresses

A + W/2 + 1

A – W/2 A – W/2 - 1 A – W/2 - 2

A - 2

A – W/2 + 1

A + 1

A – W/2 + 2

A + 2

A + W/2 + 2

A - 1

A + W/2 - 2

A

A + W/2 - 1 A + W/2

A + W/2 + 1

A – W/2 A – W/2 - 1

A - 1

A – W/2 + 2

A + 2

A + W/2 + 2

A

A + W/2 - 1

A – W/2 + 1

A + 1

A + W/2

Page 77: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

61

Decoding of one subtile takes four cycles, while three cycles are enough to write/read required

data to/from memory. In the spare cycle the memory bus is released to allow better usage of

memory bandwidth. In this spare cycle, both “rd_valid” and “wr_valid” signals are ‘0’.

All these considerations lead to the basic cyclic addressing scheme of:

cycle1 cycle2 cycle3 cycle4

Operation write read write -

Address -(w/2 - 1) -(w/2 - 2) +(w – 2) keep

Table 12: Destination memory address generator addressing scheme

End of lines and the first line of the image need special treatment again. First line of the image

does not have a pixel to the up hence a read is not needed. The prediction window change at the

end of line causes addressing scheme change for the last two subtiles of lines.

The block diagram is given in figure 52 and the synthesis results are in section 3.5.

Figure 52: Destination memory address generator block interface

3.3.9 Decompressor_Ctrl (Control Path)

This sub-block is responsible for controlling all other sub-blocks in the decompression hardware.

More specifically, it is responsible for starting the operation, stalling the datapath when memory

is not available and provide other sub-blocks with image traversal information such as end of

lines, first line etc.

The control path basically includes a 25-bit pixel counter to keep track of where in the image we

are. Mainly comparators are used comparing pixel counter to some specific values in order to

compute the location in the image such as first line, end of line, end of image. The enable signals

of datapath and address generator sub-blocks are used to stall the pipeline whenever memories

are not available (Pipeline is stalled according to “rdy” input signals coming from memories or

memory controllers).

start_addr

addr_out enable

last_two_of_line

end_of_image

first_line

rd_valid Address

Generator

I wr_valid

Page 78: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

62

3.3.10 Overall Decompressor Datapath and Address Generation

Figure 53 shows the complete hardware for the decompressor datapath and address generator

blocks. The design is divided into three main blocks. Our effort has been mainly on the datapath

and address generator blocks. Synthesizable control path and data unpacker blocks are designed

so as to ensure correct overall functionality based on constraints. However these blocks are not

optimized and need more detailed design considerations. The synthesis result of the datapath is

given in section 3.5.

Figure 53: Overall Decompressor

Golomb-Rice Decoder

Co- constructor

Reverse Color

transform

α- constructor

y- constructor

Cg- constructor

Constructor register file

control (20 registers)

DECOMPRESSOR CONTROL PATH

DA

TA

UN

-PA

CK

ER

pixel

pixel

pixel

pixel

q1

Decoder register file

control (16 registers)

r1

q2

r2

q3

r3

q4

r4

header

e1

e2

e3

e4

eCo

eCg

ey

α- reg. file

y- reg. file

Cg- reg. file

Co- reg. file

Color transform

x1

x2

x3

x

Address Generator 2

Address Generator 1

Encoded s

tream

R,G,B,α

pCo

DECOMPRESSOR DATAPATH

AND

ADDRESS GENERATION

Page 79: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

63

3.4 Functional Verification Framework

In order to verify the functionality of the design, a verification framework has been designed.

This framework consists of two RAM blocks, a counter to generate the address for the memory,

and a control block which is a finite state machine. There are three modes of operation during the

functional verification. The input image to be compressed has already been converted to binary

representation in MATLABTM and stored in a text file as input file to the framework. This file has

64 columns which is equal to the 64-bit memory word-length. The number of lines in the file

depends on the image size and therefore the memory size has to be adjusted accordingly.

In the first mode, content of the input file is stored in the source memory, RAM_1. It takes x

clock cycle to perform this data transfer where x is the number of lines in the input file. That is

when the FSM issues “exec_start” signal and the operation enters the next mode where the

compressor/decompressor starts performing its task and sending the result to the destination

memory, RAM_2. When it issues “exec_finish” signal, FSM changes to the final mode where the

content of RAM_2 is written to the output text file. The verification FSM is shown in figure 54.

The functional verification blocks are shown in figure 55.

For each test image, two output text files are generated. One file in generated by this RTL

verification framework of the VHDL code and the other one is generated by the equivalent

algorithm in MATLABTM

. The correct functionality of the design is verified for several test

images through the comparison of these two text file results to see that they are identical.

Figure 54: Verification Framework FSM

Read from

Input file

Compression /

Decompression

Execution

Write to

Output file

Page 80: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

64

Figure 55: Functional Verification Framework

3.5 Synthesis Results

The compressor and decompressor datapath blocks are synthesized with the target clock

frequency of 208 MHz.

The overall block size of the compressor is 10.55 kgates. This size includes datapath blocks,

input-output data registers and address generators. 61.3 % of this block is combinational, and the

rest is sequential logic size. The overall block includes 618 registers.

The sub-block sizes inside the datapath are also of interest. Table 13 shows the hierarchical area

distribution of the whole block.

COMP/

DECOMP

FSM

Address

Generator

RAM_1 RAM_2 wr_req

1

wr_req

2

addr2 addr1

wr_data

1

rd_data2

rd_data1

wr_data

2

rd_req

2

Output

_File

rd_req

1

Input_

File

exec_finish

exec_start

Page 81: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

65

Block Name Area

(Kgates)

Compressor 10.55

- Color Transform 1 0.23

- Color Transform 2 0.21

- Golomb-Rice Encoder 2.34

Encoder 1 0.37

Encoder 2 0.36

Encoder 3 0.37

Encoder 4 0.37

GR control 0.48

GR parameter estimation 0.36

- Input Preparation 0.67

- Address generator 1 0.59

- Address generator 2 0.45

- Encoder register file control 1.87

- Prediction register file control 1 0.65

- Prediction register file control 2 0.66

- Prediction register file control 3 0.72

- Prediction register file control 4 0.73

- Predictor 1 0.33

- Predictor 2 0.38

- Predictor 3 0.39

- Predictor 4 0.34

Table 13: Compressor Synthesis Result

The overall block size of the decompressor is 9.23 kgates. This size includes datapath blocks,

input-output data registers and address generators. 58.8 % of this block is combinational, and the

rest is sequential logic size. The overall block includes 584 registers.

The sub-block sizes inside the datapath are also of interest. Table 14 shows the hierarchical area

distribution of the whole block.

Block Name Area

(Kgates)

Decompressor 9.23

- Color Transform 0.29

- Golomb-Rice Decoder 0.27

- Output Preparation 1.46

- Reverse Color Transform 0.18

- Address generator 1 0.68

- Address generator 2 0.46

- Construction register file control 1 0.51

Page 82: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

66

- Construction register file control 2 0.51

- Construction register file control 3 0.58

- Construction register file control 4 0.58

- Constructor 1 0.31

- Constructor 2 0.38

- Constructor 3 0.38

- Constructor 4 0.31

- Decoder register file control 1.58

Table 14: Decompressor Synthesis Result

An important result that can be extracted from synthesis is that functional blocks constitute

relatively small portion of the overall cost. The more costly operations are the ones that are

related with the traversal of the image. Also due to high throughput requirement (one pixel/clock

in our case) the pipeline registers and temporary storage registers constitute a significant portion

of overall size. In that sense, it would not be wrong to claim that fast implementations of such

simpler algorithms are control dominated in terms of hardware cost.

Another result is that the GR_ctrl block, which is added to improve compression ratio, only takes

480 gates to implement which corresponds to 4.5% of the overall compressor datapath.

3.6 Evaluation of Other Hardware Implementations

During this thesis work, several hardware implementations of lossless compression algorithms

are investigated. Most implementations are targeting either medical applications such as wireless

endoscopy system or space applications. Majority of the implementations are based on LOCO-I /

JPEG-LS with minor modifications to adapt it better to hardware constraints such as speed and

area. Another feature of implementations is that generally only the compression is implemented.

In this section a survey of investigated implementations is given with their basic features. Finally,

most algorithms are implemented for compressing 8-bit pixels.

3.6.1 Parallel pipeline Implementation of LOCO-I for JPEG-LS [17]

This is parallel pipelined version of a modified LOCO_I lossless compression algorithm used

within the JPEG-LS coding scheme. It doubles the memory required for context statistics, but

achieves a speed-up of almost 2. The latency is 8 clock cycles. It yields a pixel/clock encoding

speed in the range 1.1-1.7

The context memories are dual port devices, each consisting of 368 38-bit words. The synthesis

libraries are the ST HCMOS9 0.13 um process libraries. The synthesized encoder uses a total of

539521 μm2 or 76660 equivalent gates.

Page 83: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

67

3.6.2 Benchmarking and Hardware Implementation of JPEG-LS [18]

This is a low complexity version of JPEG-LS algorithm. There is shared memory architecture

between encoder and decoder assuming that they are not processing at the same time. Target is

high speed and real-time compression. On-chip required memory is 4 KB. VHDL code is

synthesized with Synopsys and the overall chip area, without wire interconnections, is 373,862

gates, of which 324,405 gates belong to the on-chip memory, and the other 49,457 gates belong

to the functional units. The overall power consumption is 59.07 mW. The emphasis has been on

operation time not the hardware area. No information is given about throughput and maximum

frequency. The result of the processing speed is measured using a 15ns clock cycle.

3.6.3 A Lossless Image Compression Technique Using Simple Arithmetic Operations [19]

The algorithm implemented is based on the logarithmic number system (LNS) properties. It is

suitable for high quality still image compression where information content is very large

(redundant data is very less). The aim is to speed-up the encoding and decoding by only few

addition/ subtraction and shift operations. It has simple architecture with fast encoding and

decoding. Each pixel is represented by 8 bits. The algorithm is implemented and synthesized

using Xilinx Integrated Software Environment (ISE) with the following results:

For forward arithmetic compression algorithm (FAC), number of slices is 1897, number of flip-

flops is 236, and the number of 4-input LUTs is 2766. (Compression path)

For Inverse arithmetic compression algorithm (IAC), number of slices is 52, number of flip-flops

is 64, and the number of 4-input LUTs is 94. (Decompression path)

3.6.4 A Low power, Fully Pipelined JPEG-LS Encoder for Lossless Image Compression

[11]

A fully pipelined VLSI architecture with a clock management scheme has been proposed for real-

time data processing and low power application. The input image has a maximum resolution of

6404808 bits. The system clock frequency is 40MHz. The frequency of the sensor’s output

pixel is 10MHz. The design has been implemented in UMC 0.18 um technology. The total scale

of the JPEG-LS encoder is 17.6kgates, plus 18k bits on-chip SRAM. The overall power

consumption is reduced by 15.7% by clock management scheme.

Page 84: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

68

3.6.5 Hardware Implementation of a Lossless Image Compression Algorithm Using a

FPGA [20]

The algorithm is based on the LOCO-I with some modification in order to reduce the complexity.

8-bit pixel values are used. Total amount of SRAM memory needed by algorithm is 1K 8 for

the pixel memory (maximum image width is 1024 pixels) and 1K 32 for the context memory

(total memory is about 5 KB). The clock frequency is 12 MHz and the latency is nine clock

cycles. The throughput is 1.33 Mpixels/second.

3.6.6 Comparison

In this subsection a comparison of several lossless compression hardware implementations is

given in table 15. The data is mainly taken from [11] and the rest is extracted from the

investigation of respective scientific papers.

Implementation Technology Area (gates) Memory

Usage (bits)

Operating

Frequency

(MHz)

Throughput

[17] STM 0.13 m 53096 236838 - 1.33

[18] - 49457 2k 66 0.0364

[19] Xilinx (ISE) 1897 slice

236 flip-flop

2766 4-input

LUT

No context - -

[11] UMC 0.18 m 17.6k 36534 10(Main

clk)/

40(High

clk)

1

[20] Xilinx XCV50 - - 12 0.1108

Proposed 65 nm 10.55 k No context 208 1

Table 15: Characteristics of different hardware implementations

Note that the proposed implementation does not include data packer and control path blocks.

Also only compressor size is given to make it comparable to the other implementations.

Page 85: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

69

Chapter 4

4 Conclusion

The work carried out in this thesis investigated several color compression algorithms from

hardware implementation point of view to be used in high throughput hardware. One such

algorithm is aimed to be implemented in hardware in order to validate early cost estimations and

gain more insight into problematic parts of the algorithms in terms of hardware implementation.

4.1 Workflow

An investigation is done on several scientific papers for both available algorithms and their

hardware implementations. The reference algorithm [1] is simulated in MATLABTM environment.

A possible hardware realization for functional blocks of compression and decompression is

proposed and their cost has been estimated. Then, the work continued with introducing a

compression algorithm based on modification of algorithm used in [1] in order to get a better

compression performance, while the hardware cost is still kept reasonably low. The MATLABTM

simulation of this modified algorithm has verified the significant improvement in the

compression ratio. This algorithm is chosen for hardware implementation in VHDL. The

hardware is designed according to requirements and constraints and simulated with ModelSim in

order to verify functionality as well as throughput requirements. In order to extract the

information about the timing and area, the design has been synthesized.

4.2 Results and Outcomes

Synthesis results given in section 3.5. show that area estimations for functional blocks of the

datapath, which are given in tables 2 and 5, are reasonably close to actual sizes. Table 16

combines our estimations and actual sizes.

Page 86: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

70

estimated size

(in NAND2 gates)

actual size

(in NAND2 gates)

Color/Reverse Color Transform 306 290

Predictor/Constructor 522 390

GR k-determination 279 360

GR encoding 180 360

Table 16: Comparison of cost estimations and actual sizes for blocks

Our size estimations given in table 2 and table 5 are given in terms of number of full adders. To

be able to compare them with actual block sizes we assumed that each full adder is equivalent to

nine NAND2 gates. Color Transform and GR k-determination block estimations are quite close

to the real sizes. The predictor / constructor block actual size is smaller since the estimation is

based on 6 adders, whereas the design is made with 5 adders (figures 29 and 30). GR encoder

block estimation is smaller due to the fact that only adders are taken into account in estimations

and for this block, other components such as OR gates and MUXes constitute significant portion

of the block size.

According to us the most important outcome of our hardware implementation is about generic

variable length data packing and unpacking task. The implementation has revealed that high

throughput requirement complicates the design of data packing/unpacking significantly. To fulfill

our throughput requirement, both blocks should pack/unpack four variable length codewords

each clock cycle. It may be possible to parallelize this operation with several units however the

size would probably be too big to afford. For data unpacking, the design is even more difficult

since bit-by-bit read of packed data is required which is inherently a serial operation. Hence, our

implementation shows that packer/unpacker is the bottleneck of the overall design both in terms

of size and speed.

We can summarize the outcomes of this thesis work with the following four main points:

- Give an average size of complete datapath and address generation blocks. (~10 kgates)

- Locate data packing / unpacking as the most critical task for high throughput hardware

implementation.

- Improve the compression ratio (+15%), especially for compressible scenes such as user

menus (+25%), with little extra hardware cost (+4.5%).

- Replace exhaustive search method with estimation, which significantly reduces (-58%)

the hardware size of overall Golomb-Rice encoding with almost same compression

capability (-0.2%).

Page 87: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

71

4.3 Future Work

An immediate future work is to investigate fast and efficient implementation of variable length

data packing/unpacking and to integrate it with existing datapath. When this is done, it will be

possible to see the overall hardware size.

Other possible future work about algorithmic improvements has been discussed in section 2.7.

Page 88: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

72

Page 89: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

73

References

[1] J. Rasmusson, J. Hasselgren, T. A. Möller, ”Exact and error-bounded approximate color

buffer compression and decompression” in Graphics Hardware 2007, San Diego, California,

Aug. 2007.

[2] Course homepage “Mobile computer graphics”, faculty of computer science, Lund University,

http://www.cs.lth.se/EDA075/

[3] P. G. Howard and J. S. Vitter. Fast and efficient lossless image compression. Inc Proc. IEEE

Data Compression Conference (DCC 1993), pages 351-360, Snowbird, Utah, USA, March 1993,

[4] M. J. Weinberger, G. Seroussi, and G. Sapiro. LOCO-I: A low complexity, context-based

lossless image compression algorithm. Inc Proc. IEEE Data Compression Conference (DCC

1996), pages 140-149, Snowbird, Utah, USA, March 1996.

[5] X. Wu and N. D. Memon, “Context-based, adaptive, lossless image coding,” IEEE Trans.

Commun., vol. 45 (4), pp. 437-444, Apr. 1997.

[6] M. J. Weinberger, G. Seroussi, and G. Sapiro. LOCO-I: A low complexity, context-based

lossless image compression algorithm: principles and standardization into JPEG-LS. IEEE Trans.

Image Processing, 9(8):1309-1324, August 2000.

[7] S. W. Golomb, “Run-length encodings,” IEEE Trans. Inform. Theory, vol. IT-12, pp. 399-401,

1966.

[8] R. F. Rice, ”Some practical universal noiseless coding techniques,” Tech. Rep. JPL-91-3, Jet

Propulsion Laboratory, Pasadena, CA.

[9] H. Malvar, G. Sullivan: YCoCg-R: A Color Space with RGB Reversibility and Low Dynamic

Range. In JVT-I014r3 (2003)

[10] S. A. Martucci, “Reversible compression of HDTV images using median adaptive prediction

and arithmetic coding,” in Proc. IEEE Intern’l Symp. On Circuits and Syst., pp. 1310-1313, IEEE

Press, 1990

[11] X. Li, X. Chen, “A low power, fully pipelined JPEG-LS encoder for lossless image

compression. IEEE, Beijing, China (2007)

[12] J. Coalson. FLAC – Free Lossless Audio Codec (2005), http://flac.sourceforge.net/

[13] K. Veeraswamy, S. Srinivaskumar, “ Lossless image compression using topological pixel re-

ordering,” JNTU, College of Engineering, Kakinada, India.

Page 90: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

74

[14] S. Andriani, “Lossless compression and interpolation for high quality still images and video

sequences”, Ph.D. thesis, University of Padova, Faculty of engineering, 2006.

[15] The USC-SIPI Image Database, University of Southern California, Electrical engineering

department, signal and image processing institute, http://sipi.usc.edu/database/index.html.

[16] I. Matsuda, T. Kaneko, A. Minezawa, S. Itoh, ”Lossless coding of color images using block-

adaptive inter-color prediction, IEEE ICIP, 2007.

[17] M. Ferretti, M. Boffadossi, “A Parallel Pipeline Implementation of LOCO-I for JPEG-LS,”

17th

International Conference on Pattern Recognition (ICPR’04), vol. 1,pp. 769-772. 2004.

[18] A. Savakis and M. Piorium, “Benchmarking and Hardware Implementation of JPEG-LS,”

ICIP’02, Rochester, NY, Sep. 2002.

[19] S. Kummar Pattaniak, K. K. Mahapatra, ”A Lossless Image Compression Technique Using

Simple Arithmetic Operations and Its FPGA Implementation,” IEEE, 2006.

[20] M. Klimesh, V. Stanton, and D. watola, “Hardware Implementation of a Lossless Image

Compression Algorithm Using a Field Programmable Gate Array,” NASA JPL TMO Progress

Report 42-144, 2001.

Page 91: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

75

APPENDIX A

Proposed Cost Reduction Method Analysis

A.1 Overlap-limited Search

Consider block B of size n as B = {e1, e2…, en}, where n is the number of input values which

should be encoded together, n is an integer and n > 0.

e1 e2 …

e3 e4 …

… … en

Figure 56: One block of n values

The Golomb-Rice code length of each input (when encoded with k) in the block is computed as:

where q is the quotient of the integer division of each input value by 2k.

For the general case where n and k are variables, we find the total length of the encoded block as

a function of n and k as (in calculations, Lk = Ltotal-k – n will be used, which does not affect the

comparison result since n is a common term for all k):

(1)

, where

So,

We define eT and rT as:

Then,

Page 92: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

76

We also have for the remainders.

So, and

(I)

(II)

Combining (I) & (II)

The above inequality shows bounds of Lk (the length of output code when coded with parameter

k) as a function of n, k and eT

Now, we want to find the overlap regions between two adjacent Lk and Lk+1:

a ≤ Lk ≤ b

c ≤ Lk+1 ≤ d

Three regions are of interest:

A. Lk > Lk+1 a > d: In this case, k+1 is always better than k, no need to compare.

B. Lk < Lk+1 c > b: In this case, k is always better than k+1, no need to compare.

C. If neither (A.) nor (B.) are satisfied, then Lk and Lk+1 need to be compared to find

whether k or k+1 give shorter code length. Hence, the inequalities in (A.) and (B.) give the

bounds of overlap region with respect to eT.

a > d

Page 93: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

77

Upper bound of overlap region with respect to eT: a > d eT > 2n (2k+1

-1)

c > b

Lower bound of overlap region with respect to eT: c > b eT < n (2k)

So, the overlap region of two adjacent length functions, Lk and Lk+1 is:

Equation (2) is a general formulation to find overlap region for two adjacent code lengths.

Figure 57: Overlap regions of consecutive length functions with respect to eT

Considering figure 57, it is mathematically provable that for any given block of input data,

there is overlap region only for three consecutive length functions Lk, Lk+1, Lk+2:

Proof: Assume that there is an overlap region between A and C for at least one given eT,

By solving this inequality, we get:

n ≤ 0, which is impossible, since n represents the number of input values in a block. So, there is

never such an overlap region for any given set of k and any block size of n.

(2)

Lk

Lk+1

eT

Lk, Lk+1 Lk+2, Lk+3

eT A ? C

Lk+1, Lk+2

B

Page 94: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

78

Result: For any block size n, only three consecutive k values (k, k+1, k+2) among the

consecutive set of all possible k values can give the minimum encoded length, i.e., can be the best

k-parameter for that block. These three k values depend on the sum of inputs values (eT) in the

block. Therefore, once they are located it is sufficient to compare only three length functions (two

comparisons) corresponding to k, k+1 and k+2.

Figure 58, illustrates the case for n=4. Point a (eT = 24) on the plot is the boundary between L1

and L2 such that after point a, L1 is always greater than L2. So L1 need not be considered any

more after point a. Other point of interest is point b (eT= 32) which is the boundary for the start of

L4 such that before point b, L3 is always smaller than L4. So L4 need not be considered before

point b. Since point a (eT = 24) is before point b (eT = 32), L1 and L4 never need to be compared.

Figure 58: Overlap regions between length functions L1, L2, L3, L4

Now, we apply this method to our application. Our application, given in [4], operates on subtiles

of 2x2 pixels. Hence n = 4 in our application. We want to find the Golomb-Rice parameter for a

subtile in the set of k = {0, 1, 2, 3, 4, 5, 6}.

The overlap regions for our case are shown in figure 59.

Page 95: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

79

Figure 59: Overlap regions for n=4 and k= {0, 1, 2, 3, 4, 5, 6} with respect to eT

Figure 60 shows the overlap regions with respect to the sum of input numbers, eT, which can be

in the range 0 to 2044 (9-bit unsigned inputs). We only have overlap regions in the interval [4,

504]. The conclusion which can be drawn from this figure is that we never need to do an

exhaustive search among all seven Golomb-Rice parameters in order to find the best one. The

alternative introduced here, is to compute the sum of the inputs and find its corresponding

overlap region.

L0

L1

L2

L3

L4

L5

L6

eT [4, 8]

eT [8, 24]

eT [16, 56]

eT [32, 120]

eT [64, 248]

eT [128, 504]

0 4 8 16 24 32 56 64 120 128 248 256 504 512

L0, L1 L2, L3 L4, L5 L6

L0 L1, L2 L3, L4 L5, L6

eT

Page 96: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

80

By knowing the overlap region the search is limited to at most three cases which happens in

intervals of [16, 24], [32, 56], [64, 120], and [128, 248] and only two cases in other intervals.

Also notice that for eT > 504 and eT < 4, no comparison is needed. Now the only remaining issue

is how to find the regions. In order to make hardware simpler, we limit the boundaries between

the regions with powers of two. As the result, we always need three numbers to be compared

according to the regions below:

For eT < 16, compare L0, L1, L2

For 16 ≤ eT <32, compare L1, L2, L3

For 32 ≤ eT < 64, compare L2, L3, L4

For 64 ≤ eT < 128, compare L3, L4, L5

For eT ≥ 128, compare L4, L5, L6

Figure 60: Required comparisons of overlap regions for n=4, k= {0, 1, 2, 3, 4, 5, 6} based on eT

It is noticeable that this solution is an exact method which finds the best k-parameter among all

seven possible values, but only with two comparator hardware units.

This result is obtained when the set of all possible k values is a consecutive set. In specific

implementations it could be the case that a smaller set is used. Now, we extend our method and

show that only one comparison is sufficient to find the best k-parameter if the overall set does

not include any three consecutive k values., i.e., there exists no {k, k+1, k+2} subset in the set of

k.

0 4 8 16 24 32 56 64 120 128 248 256 504 512

L0, L1 L2, L3 L4, L5 L6

L0 L1, L2 L3, L4 L5, L6

eT

L0,L1,L2

L1,L2,L3

L2,L3,L4

L3,L4,L5 L4,L5,L6

Page 97: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

81

To show this more general case, we need to derive a formula for the overlap region between Lk

and Lk+2 as well.

Following the same steps for Lk and Lk+2:

m ≤ Lk ≤ n

o ≤ Lk+2 ≤ p

m > p

Upper bound of overlap region with respect to eT:

m > p eT > (4n/3) (3x2k-1)

o > n

Lower bound of overlap region with respect to eT:

o > n eT < 2n (2k)

So, the overlap region of Lk and Lk+2 is:

Figure 61: Overlap regions of non-consecutive length functions with respect to eT

(2)

Lk

Lk+

2

Lk, Lk+2 Lk+2, Lk+3

eT A ? B

Page 98: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

82

Figure 61 shows that, if the overall set does not include any three consecutive k (k+1 is not in the

set in the above figure), then there is an overlap between at most two length functions (Lk,Lk+2 or

Lk+2,Lk+3 in the figure)

Proof: Assume that there is an overlap region between A and B for at least one given eT,

Solving this inequality, we get:

n ≤ 0, which is impossible, since n represents the number of input values in a block. So, there is

never such an overlap region for any given set of k and any block size of n.

It is important to note that the total hardware cost of “overlap-limited search” is independent of

the number of k values as showed in figure 3.

Note: The derivations on pg. 7 – 11 are general in the sense that they do not limit Lk to be an

integer. In almost all, if not all, applications Lk is integer and its effect needs to be examined.

This effect can be considered as a quantization of Lk to integer values. This, in effect, narrows the

overlap regions. However, the requirement of two comparisons still exists. Here, without any

formal verification, we give the resulting overlap regions for our example:

L0

L1

L2

L3

L4

L5

L6

eT [6]

eT [10, 20]

eT [20, 48]

eT [40, 100]

eT [80, 204]

eT [160, 412]

Page 99: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

83

A.2 Remainder-Based Correction

Golomb-Rice coding separates the data into two parts (q: quotient and r: remainder), by making a

division with a divisor being a power of two. The added values in the first stage additions shown

in figure 12, are actually q of division with 2k.

For block size of n, the output length corresponding to k is repeated here as:

knqknqqqL Tnk 21

Figure 62, shows the quotients that are added together for the case of k = 0 and k = 1 respectively.

Figure 62: Motivation behind remainder-based correction

It is clear from figure 62 that in both additions added bits m to 1 are identical. This, in a hardware

implementation, means same data bits are connected to inputs of two separate adders. The point

is that, the sum in the second addition could be obtained by using the sum in the first addition,

specifically by right shifting the sum by 1-bit. In order to get exactly the same result however, the

effect of LSB bits (remainders for k = 1) on the sum in first addition should be corrected. This

means that by first subtracting carry-out of sum of LSB bits from the first sum (corresponding to

k=0) and then right shifting it by 1-bit, the second sum (corresponding to k=1) is obtained exactly

without addition.

The idea can be generalized for all k values such that once the sum of input values is obtained

(which is nothing but sum of quotients for k = 0), qT for all other k can be found by a common

correction circuit using the remainder bits of each stage.

Mathematically the equivalence can be shown as follows:

am am-1 ... a2 a1 am am-1 ... a2 a1

bm bm-1 ... b2 b1 bm bm-1 ... b2 b1

nm nm-1 ... n2 n1

sm sm-1 ... s2 s1

nm nm-1 ... n2 n1

sm sm-1 ... s2 s1

a0

b0

n0

s0

+ +

Page 100: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

84

, where

So,

We define eT and rT as:

Then,

(4)

Equation (4) shows that Lk can be obtained by adding a remainder-conditioned operand (the

second term) to the shifted sum of inputs (first term).

Page 101: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

85

Appendix B

Test Image Sets

B.1 Standard Photographic Test Images

Peppers (512 x 512) Peppers2 (512 x 512) Mandrill (512 x 512)

Lenna (512 x 512) House (256 x 256) Sailboat (512 x 512)

Airplane (512 x 512)

Page 102: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

86

B.2 Computer Generated Test Scenes

Ducks (640 x 480) Square (640 x 480)

Car (640 x 480) Quake4 (640 x 480)

Bench_scr1(640 x 360) Bench_scr1 (640 x 360)

Page 103: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

87

Bench_scr4(640 x 360)

B.3 Computer Generated User Menu Scenes

Menu1 (240 x 320) Menu2 (240 x 320) Menu3 (240 x 320)

Page 104: 708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems

88

Menu4 (240 x 320) Menu5 (240 x 320)

Menu6 (320 x 480) Menu7 (320 x 480)