a high performance video transform engine by.pdf

8/14/2019 A High Performance Video Transform Engine by.pdf

1/10

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 4, APRIL 2012 655

A High Performance Video Transform Engine byUsing Space-Time Scheduling Strategy

Yuan-Ho Chen, Student Member, IEEE, and Tsin-Yuan Chang, Member, IEEE

AbstractIn this paper, a spatial and time scheduling strategy,called the space-time scheduling (STS) strategy, that achieves highimage resolutions in real-time systems is proposed. The proposedspatial scheduling strategy includes the ability to choose the dis-tributed arithmetic (DA)-precision bit length, a hardware sharingarchitecture that reduces the hardware cost, and theproposed timescheduling strategy arranges different dimensional computationsin that it can calculate first-dimensional and second-dimensionaltransformations simultaneously in single 1-D discrete cosinetransform (DCT) core to reach a hardware utilization of 100%.The DA-precision bit length is chosen as 9 bits instead of thetraditional 12 bits based on test image simulations. In addition,

the proposed hardware sharing architecture employs a binarysigned-digit DA architecture that enables the arithmetic resourcesto be shared during the four time slots. For this reason, theproposed 2-D DCT core achieves high accuracy with a small areaand a high throughput rate and is verified using a TSMC 0.18- m1P6M CMOS process chip implementation. Measurement resultsshow that the core has a latency of 84 clock cycles with a 52 dBpeak-signal-to-noise-ratio and is operated at 167 MHz with 15.8K gate counts.

Index TermsBinary signed-digit (BSD), discrete cosine trans-form (DCT), distributed arithmetic (DA)-based, space-time sched-uling (STS).

I. INTRODUCTION

DISCRETE COSINE TRANSFORM (DCT) is a widelyused transform engine for image and video compression

applications [1]. In recent years, the development of visualmedia has been progressed towards high-resolution specifica-tions, such as high definition television (HDTV). Consequently,a high-accuracy and high-throughput rate component is neededto meet future specifications. In addition, in order to reducethe manufacturing costs of the integrated circuit (IC), a lowhardware cost design is also required. Therefore, a high per-formance video transform engine that utilizes high accuracy,a small area, and a high-throughput rate is desired for VLSI

designs.The 2-D DCT core design has often been implemented usingeither direct [2][4] or indirect [5][13] methods. The direct

Manuscript received February 22, 2010; revised August 09, 2010 and De-cember 30, 2010; accepted January 24, 2011. Date of publication February 28,2011; date of current version March 12, 2012. This work was supported in partby the Chip implementation Center and National Science Council under projectnumber CIC T18-98C-12a and NSC 99-2221-E-007-119, respectively.

The authors are with the Department of Electrical Engineering, NationalTsing Hua University, Hsinchu 30013, Taiwan (e-mail: [email protected]).

Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TVLSI.2011.2110620

methods include fast algorithms that reduce the computationcomplexity by mapping and rotating the DCT transform into acomplex number [3]. However, the structure of the DCT is notas regular as that of the fast Fourier transform (FFT). A reg-ular 2-D DCT core using the direct method that derives the 2-Dshifted FFT (SFFT) from the 2-D DCT algorithm format, andshares the hardware in FFT/IFFT/2-D DCT computation is im-plemented in [4].

On the other hand, the 2-D DCT cores using the indirectmethod are implemented based on transpose memory and havethe following two structures: 1) two 1-D DCT cores and one

transpose memory (TMEM) and 2) a single 1-D DCT core andone TMEM. In the first structure, the 2-D DCT core has a highthroughput rate, because the two 1-D DCT cores compute thetransformation simultaneously [5][9]. In order to reduce thearea overhead, a single 1-D DCT core is applied in the secondstructure [10][13], thereby saving hardware costs. Hsia et al.[10] present a 2-D DCT with a single 1-D DCT core and onecustom transposed memory that can transpose data sequencesusing serial-input and parallel-output and requires only a smallarea, but the speed is reduced because the single DCT coreperforms the first-dimensional (1st-D) and second-dimensional(2nd-D) transformations at different times. Therefore, Guo et al.use two parallel paths to deal with the reduction of throughputrate in the second structure [11]. However, the additional inputbuffers are needed in order to temporarily store the input dataduring the 2nd-D transformation. Madisetti et al. present a hard-ware sharing technique to implement the 2-D DCT core usinga single 1-D DCT core [13]. The 1-D DCT core can calculatethe 1st-D and 2nd-D DCT computations simultaneously, and thethroughput achieves 100 Mpels/s. However, multiplier-basedprocessing element requires a large amount of circuit area in[13]. As a result, it is obvious that a tradeoff is required be-tween the hardware cost and the speed. Tumeo et al.find a bal-ance point between the area required and the speed for 2-D DCTdesigns [14] and present a multiplier-based pipeline fast 2-D

DCT accelerator implemented using a field-programmable gatearrays (FPGA) platform [14]. Additionally, several 2-D DCTdesigns are implemented based on FPGA for fast verification[14][17].

In this paper, an 8 8 2-D DCT core that consists of a single1-D DCT core and one TMEM is proposed using a strategyknown as space-time scheduling (STS). Due to the accuracysimulations in DA-based binary signed-digit (BSD) expression,9-bit DA precision is chosen in order to meet the requirementsof the peak-signal-to-noise-ratio (PSNR) outlined in previousworks [5][7]. Furthermore, the proposed DCT core is designedbased on a hardware sharing architecture with BSD DA-basedcomputation so as to reduce the area cost. The arithmetics share

the hardware resources during the four time slots in the DCT

1063-8210/$26.00 2011 IEEE


2/10

656 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 4, APRIL 2012

core design. A 100% hardware utilization is also achieved byusing the proposed time scheduling strategy, and the 1st-D and2nd-D DCT computations can be calculated at the same time.Therefore, a high performance transform engine with high ac-curacy, small area, and a high-throughput rate has been achievedin this research.

This paper is organized as follows. In Section II, the math-ematical derivation of the BSD format distributed arithmetic isgiven. The proposed 8 8 2-D DCT architecture that includesan analysis of the coefficient bits, the hardware sharing architec-ture, and the proposed timing scheduling strategy is discussedin Section III. Comparisons and discussions are presented inSection IV, and conclusions are drawn in Section V.

II. MATHEMATICAL DERIVATION OFBSD FORMATDISTRIBUTEDARITHMETIC

The inner product for a general matrix multiplication-and-

accumulation can be written as follows:

(1)

where is a fixed coefficient and is the input data. Assume

that the coefficient is an -bit BSD number. Equation (1)

can be expressed as follows:

..

.

..

.

. .

.

..

.

..

.

...(2)

where , , and

. The matrix is called the DA coef-

ficient matrix.

In (2), can be calculated by adding or subtracting the

with , and then the transform output can be ob-

tained by shifting and adding every non-zero . Thus, the innerproduct computation in (1) can be implemented by using shifters

and adders instead of multipliers. Therefore, a smaller area can

be achieved by using BSD DA-based architecture.

III. PROPOSED8 8 2-D DCT COREDESIGN

This section introduces the proposed 8 8 2-D DCT core

implementation. The 2-D DCT is defined as

(3)

where for , and for

. Consider the 1-D 8-point DCT first

(4)

where for , for non-zero , andand denote the input data and the transform

output, respectively. By neglecting the scaling factor 1/2, the

1-D 8-point DCT in (4) can be divided into even and odd parts,

and , as listed in (5) and (6), respectively

(5)

(6)

where and

(7)

(8)

(9)

For the 2-D DCT core implementation, the three main strate-

gies for increasing the hardware and time utilization, called the

space-time scheduling (STS) strategy, are proposed and listed

as follows:

find the DA-precision bit length for the BSD representation

to achieve system accuracy requirements;

share the hardware resource in time to reduce area cost;

plan an 100% hardware utilization by using time sched-uling strategy.

A. Analysis of the Coefficient Bits

The seven internal coefficients from to for the 2-D DCT

transformation are expressed as BSD representations in order to

save computation time and hardware cost, as well as to achieve

the requirements of PSNR. The system PSNR is defined [1] as

follows:

(10)

where the is the mean-square-error between the originalimage and the reconstructed image for each pixel.


3/10

CHEN AND CHANG: HIGH PERFORMANCE VIDEO TRANSFORM ENGINE 657

Fig. 1. Simulation for system PSNR in different BSD bit length expressions.

TABLE INORMALIZEDHW COST INDIFFERENTBSD BITLENGTHEXPRESSIONS

First, the BSD coefficients for different bit length expressions

use the popular test images, Lena, Peppers, and Baboon,

to simulate the system PSNR. After inputting the original test

image pixels to the forward and inverse DCT transform, the

system PSNR vs. the different BSD bit length expressions is

illustrated in Fig. 1(a), and the reconstructed images, i.e., that

have passed through the forward and inverse DCT for different

BSD bit lengths, are also shown in Fig. 1(b). The normalized

hardware (HW) cost of processing element (PE) for the different

BSD bit length expressions is shown in Table I. The HW cost

is synthesized in area for each BSD bit length using a Synopsys

Design Compiler with the Artisan TSMC 0.18- m Standard cell

library.

The gap in accuracy between the 8-bit and the 9-bit BSD ex-pressions is shown in Fig. 1(a), and the HW cost increases when

the BSD bit length increases. Consequently, the 9-bit BSD ex-

pression coefficient is chosen due to the tradeoff between the

hardware cost and the system accuracy. The BSD expressions

and the accuracy of the DCT coefficients are listed in Table II,

where the coefficient in Table II indicates the BSD number

1. The coefficient accuracy is defined as the signal-to-quanti-

zation-noise ratio (SQNR) shown in (11)

(11)

where and are the power of the desired floating-point signaland the quantization noise in this case, respectively.

TABLE II9-BITBSD EXPRESSIONS FORDCT COEFFICIENTS

B. Hardware Sharing Strategy

All modules in the 1-D DCT core, including the modified

two-input butterfly (MBF2), the pre-reorder, the process ele-

ment even (PEE), the process element odd (PEO), and the post-

reorder share the hardware resources in order to reduce the area

cost. Moreover, the 8 8 2-D DCT core is implemented using

a single 1-D DCT core and one TMEM. The architecture of the

2-D DCT core is described in the following sections.

1) Modified Butterfly Module: Equation (9) is easily imple-

mented using a two-input butterfly module [18] called BF2. In

general, the BF2 has a hardware utilization rate in the adder andsubtracter of 50%. In order to enable the hardware resources

to be shared, additional multiplexers and Reorder Registers are

added to the proposed MBF2 module as illustrated in Fig. 2.

The Reorder Registers consist of four word registers that use

the control signals to select the input data and use enable

signals to output or hold the data. Similar to BF2, the operation

of the proposed MBF2 has an eight-clock-cycle period. In

the first four cycles, the 1st-D input data shift

into Reorder Registers 1, and the 2nd-D input data execute

the operation of addition and substraction. In the next four

cycles, the operations of 1st-D and 2nd-D will be changed.

The 1st-D data calculate and in (9) by

using adder and subtracter. In the second four cycles, theresults are fed into next stage, and the


4/10


TABLE IIICOEFFICIENTS ANDINPUTDATAFROMMBF2 FOR THEFOURSEPARATETIMESLOTOUTPUTS

Fig. 2. Proposed MBF2 architecture.

results shift to Reorder Registers 1 in

each cycle. At the same time, the 2nd-D input data shift

to Reorder Registers 2. Therefore, the adder and subtracter in

MBF2 employ the operations of 1st-D and 2nd-D in turns to

achieve 100% hardware utilization.

2) Pre-Reorder Module: In (5), the even part transform

output can be modified as follows:

(12)

Also, the odd part transform output in (6) can be rewritten as

(13)

From (12) and (13), the transform output can be calculatedusing four separate time slots in order to share the hardware

TABLE IVEVENPARTTRANSFORMATIONTABLE

resources. Hence, (12) and (13) can be expressed as (14) and

(15), respectively

(14)

(15)

where the time slot , the coefficient elements, , and the trans-

form inputs from the MBF2 stage and

. Equations (14) and (15) for the four time

slots are summarized in Table III. The proposed Pre-Reorder

module follows the order of the inputs , and ) de-

pending on each time slot indicated in Table III. The hardware

resources can be shared by reordering the inputs during the four

separate time slots.

3) Process Element Module: For DA-based computation,

the even part and odd part transformations can be implemented

using PEE and PEO, respectively. From Table III, the even part

transformation can be expanded for the DA-based computation

formats and can share the hardware resources at the bit level.The coefficient vector has two combinations of and


5/10


Fig. 3. Architecture for the proposed PEE and PEO.

for and transform outputs, re-

spectively. Table IV expresses (14) in the bit level formulation.

Using given input data , , , and , the transform

output needs only three adders (to compute ,

, and ), a Data-Distributed module,

and one even part adder-tree (EAT) to obtain the result for

during the four time slots. The Data-Distributed module consists

of seventwo-input multiplexers that are used to appoint different

non-zero values for each weighted input during the four separate

time slots, as shown in Table IV, and the EAT sums the different

weighted values in tree-like adders to complete the transform

output . Similarly, the odd part transformation in (15) can beimplemented in a DA-based format using four adders and one

odd part adder-tree (OAT). The EAT and OAT can be imple-

mented using the error-compensated adder tree [19] to improve

the computation accuracy. Moreover, there are three pipeline

stages (two in both the EAT and the OAT) in each PEE and PEO

module that enable high speed computation to be achieved. The

proposed architecture for both PEE and PEO is shown in Fig. 3.

4) Post-Reorder Module: For the last stage of the 1-D DCT

computation, the data sequence after the PEE and PEO must

be merged and repermuted in the Post-Reorder module. The

order of the original 8-point data sequence from the PEE and

the PEO is . After the Post-Re-order module permutes the data order, the data sequence is or-

dered as . Fig. 4 shows the pro-

posed Post-Reorder architecture. Two multiplexers select the

data that is fed into the different Reorder Registers in order to

permute the output order. is the transform output for the 1st-D

DCT that will input into the TMEM. In addition, the 2nd-D DCT

transform output is completed after the permutation by Re-

order Registers 4.

5) 2-D DCT Core Architecture: To save hardware costs, the

proposed 2-D DCT core, as shown in Fig. 5, is implemented

using a single 1-D DCT core and one TMEM. The 1-D DCT

core includes an MBF2, a Pre-Reorder module, a PEE, a PEO,

a Post-Reorder module, and one TMEM. The TMEM is imple-mented using 64-word 12-bit dual-port registers and has a la-

Fig. 4. Proposed post-reorder architecture.

tency of 52 cycles. Based on the time scheduling strategy de-

scribed in Section III-C, a hardware utilization of 100% can be

achieved.

C. Time Scheduling Strategy

As a result of the time scheduling strategy, the 1st-D and

2nd-D transforms are able to be computed simultaneously,

which increases hardware utilization be 100%. The timing flow

chart for the proposed strategy is illustrated in Fig. 6.

1) 1st-D Data Computation: In the first four cycles, the first

four-point input data shift into the Reorder Register 1. During

cycles 58, the MBF2 performs additions and subtractions using

the first eight-point input data. In the 9th cycle, the even-part of

Pre-Reorder module obtains and reorders

the input data output to the PEE based on Table III. During

cycles 1013, the is calculated in the PEE module. Based

on the latency of the three pipeline stages in the PEE, is

completed during cycles 1316. Also, the is performed in

the PEO module during cycles 1417 and is finished in cycles

1720. Therefore, after the 16th cycle, the Post-Reorder module

permutes the 1st-D transform results and inputs them into theTMEM.

2) 2nd-D Data Computation: From the 17th cycle, the 1st-D

transform data is input into the TMEM. Due to the latency of

the TMEM 52 cycles , the 2nd-D computation data trans-

posed by the TMEM is sent to the input of Reorder Registers 2 in

MBF2 at the 69th cycle. Then, the adder and subtracter are oper-

ated during cycles 7376 to compute the first 2nd-D 8-point data

sequences, and the first 2nd-D data is permuted by Pre-Reorder

module during cycles 7780. The PEE and PEO calculate the

first 2nd-D data during cycles 7881 and 8285, respectively.

The Post-Reorder module finishes the 2nd-D transform results

from the 84th cycle. In addition, the 2-D DCT transform outputdata is obtained at the end of 84th cycle.

3) Hardware Utilization: In Fig. 6, the adder and subtracter

in MBF2 is at 50% utilization during cycles 168. However,

after the 68th cycle, the MBF2 achieves a 100% hardware uti-

lization as a result of the 2nd-D data being input. At the same

time, the utilization in the PEE and PEO also reaches 100% from

74th and 78th cycles, respectively. The Post-Reorder module

does not work between cycles 112, but after the 16th cycle the

1st-D transform output are completed using Reorder Regis-

ters 3 in the Post-Reorder. Similarly, the 2nd-D transform re-

sult is finished using Reorder Registers 4 in the Post-Reorder

from the 84th cycle. At the end of 84th cycle, the 1st-D and

2nd-D transform outputs and are obtained simultaneously.Hence, the hardware utilization increases to 100% after the 84th


6/10


Fig. 5. Proposed 2-D DCT architecture.

Fig. 6. Timing scheduling for the proposed 2-D DCT core.

cycle, and the latency of the proposed 8 8 2-D DCT core is 84

clock cycles. Based on the proposed time scheduling strategy, a

100% hardware utilization is achieved, and computation of the

data sequence is straightforward.

In summary, following an analysis of the accuracy analysis

of the coefficients, the DA-precision bit length was chosen as

a 9-bit BSD expression. In this way, the hardware cost is re-

duced by using the BSD 9-bit DA-precision instead of the 12-bit

used in previous works. Using the proposed hardware sharing

strategy, the DCT computation shares the hardware resources

during the four time slots, thereby reducing the area cost. In ad-

dition, a 100% hardware utilization can be achieved based on the

proposed time scheduling strategy, effectively achieving a high-throughput-rate design. Therefore, a high-accuracy, small-area,

and high-throughput-rate 2-D DCT core has been obtained by

applying the proposed STS strategy.

IV. DISCUSSION ANDCOMPARISONS

In this section, the system accuracy test for the proposed

2-D DCT core is discussed. Then, a comparison of the pro-

posed 2-D DCT core with previous works is also presented. Fi-

nally, the characteristics of the implemented chip and FPGA are

described.

A. System Accuracy Test

Seven test images used to check system accuracy are all 512

512 pixels in size, with each pixel being represented by 8-bit256 gray level data. After the original test image pixels are fed


7/10


Fig. 7. Accuracy verification flow for proposed DCT core.

TABLE VPSNR PERFORMANCE FOREACHTESTPATTERNS

TABLE VIPSNR VERSUS QF FOR THE PROPOSED 2-D DCT

APPLYING TOJPEG COMPRESSIONFLOW

into the proposed 2-D DCT core, the transform output data is

captured and passed to MATLAB tool in order to compute the

inverse DCT using 64-bit double-precision operations. The ver-

ification flow follows the precedent of previous works ([4], [7],

and [20]) test flow illustrated as Fig. 7. The PSNR values of each

test images verified by the test flow are listed in Table V, and the

average PSNR closes to 52 dB using the proposed 8 8 2-D

DCT core. Furthermore, the proposed 2-D DCT core was intro-duced to enable utilization of JPEG compression flow [21], [22]

to evaluate the proposed 2-D DCT applied in real compression

systems. The quality factor (QF) dominates the compression

quality and data size, and therefore QF affects the system PSNR

of the JPEG compression. Table VI lists the PSNR versus QF for

the proposed 2-D DCT core applying to JPEG, and Fig. 8 illus-

trates both the original and the compressed versions of the test

image Lena using the JPEG compression flow. The quality of

the compressed image is excellent.

B. Comparison With Other 2-D DCT Architectures

Table VII compares the proposed 8 8 2-D DCT core withprevious works. In [4], Lin et al.derived an SFFT format from

a 2-D DCT and shared the hardware resources for the FFT/

IFFT/2-D DCT computation in a triple-mode processor design.

The 2-D DCT core is implemented in a direct 2-D method based

on a pipeline radix- single delay feedback path

architecture. In [7], a 2-D DCT core used NEDA architecture

with two 1-D cores and one TMEM to achieve a low-cost de-

sign. Speed limitations exist in the operations of serial shiftingand addition in the NEDA operations. Huang et al. improved the

throughput rate using adder-tree (AT) instead of shifting and ad-

dition operations for NEDA architecture [9]. However, the par-

allel input and output in the NEDA architecture required a large

amount of I/O resources. The 2-D DCT core using a single 1-D

core and one TMEM for a multiplier-based DCT implementa-

tion was addressed in [10]. An area-saving serial-input and par-

allel-output transpose memory was also implemented in the 2-D

DCT core to achieve a small-area design operated at frequency

of 55 MHz. However, the 1-D core could not compute 1st-D and

2nd-D transformation simultaneously. Hence, the throughput

rate was half of the operation frequency. Therefore, a hardware

sharing technique was addressed in the 2-D DCT core to dealwith throughput rate reduction, and the 2-D DCT core achieved

a throughput of 100 Mpels/s [13].

The proposed 2-D DCT core employs a single 1-D core and

one TMEM in order to reduce the core area. Also, the 1-D

core utilizes 9 adders and 2 adder-trees (AT) using the proposed

hardware sharing architecture in order to reduce hardware cost.

Meaning that the throughput rate of the proposed 2-D DCT core

is enhanced based on the time scheduling strategy. In this way,

the proposed 2-D core is able to compute 1st-Dand 2nd-D trans-

formation simultaneously, and a 167 MHz throughput rate is

achieved. Therefore, the proposed 2-D DCT core has the highest

hardware efficiency, defined as follows:

Hardware Efficiency pels/s-gate

(16)

A comparison with previous works is summarized in

Table VII. Noted that previous works presented only simulation

results, while proposed work has chip implementation with de-

sign for test (DFT) and measured results. For fair comparisons,

the simulation results of proposed work are also shown (Re-

move DFT considerations in chip implement that needs extra

area for memory built-in self-test circuit and scan flip-slops).

The gate count of STS DCT without DFT insertion is 12.2 K

and the hardware efficiency of STS DCT is equal to 13.69 that

is the largest in the comparison Table VII.

On the other hand, the proposed 2-D DCT achieves 52 dB

PSNR value, that is the medium accuracy compared with other

pervious works, verified by the test flow in Fig. 7. The power

consumption is another issue in the circuit designs. There are

many different technologies to implement DCT core in previous

works. In order to have a fair comparison, the power is normal-

ized to 0.18- m technology process. Similar to [23] as listed in

(17) and is also summarized in Table VII. The proposed 2-D

DCT core has medium power consumption between pervious

works

Normalized Power(17)


8/10


Fig. 8. Compressed image using JPEG compression processing with the proposed 2-D DCT core. (a) Original image. (b) Compressed image with .

TABLE VII

COMPARISON OFDIFFERENT2-D DCT ARCHITECTURESWITH THEPROPOSEDARCHITECTURE

Four transistors per NAND2 gate for different technology.

5 : Where 5 is the chip measurement result with DFT insertion and is simulation result without DFT insertion.

The power result is estimated for 1-D DCT operation.

C. Chip Implementation

Implemented in a 1.8-V TSMC 0.18- m 1P6M CMOS

process, the proposed 8 8 2-D DCT core uses the Synopsys

Design Compiler to synthesize the RTL code and the Cadence

SoC Encounter for placement and routing (P&R). The proposed

DCT core has a latency of 84 clock cycles and is operated at

167 MHz to meet HDTV specifications. For the design for

test (DFT) considerations, the proposed core uses the SynTest

SRAMBIST to generate the memory built-in self-test (BIST)

circuit and the Synopsys Design Compiler to synthesize scan

flip-flops, and the total gate counts with DFT insertion of the

proposed core is 15.8 K. The photomicrograph of the proposed

core with marking each modules boundary is shown in Fig. 9and the measured characteristics are listed in Table VIII.

D. FPGA Implementation

The proposed 2-D DCT core synthesized using Xilinx ISE

10.1 and Xilinx XC2VP30 FPGA can be operated at a clock

frequency of 110 MHz. Table IX shows a comparison of the pro-

posed 2-D DCTcore with previous FPGA implementations. The

multiplier-less coordinate rotation digital computer (CORDIC)

architecture was presented in [15]. Sunet al. utilized 120 adders

and 80 shifters performing a 2-D quantized discrete cosine in-

teger transform (QDCIT) to achieve a high clock rate of 149

MHz, but with large FPGA area resources. Also, a 2-D DCT

core that was optimized in both delay and area was addressed

by Tumeo et al. [14], who presented a balance point between

the delay and the area for multiplier-based pipeline DCT designwith 19 adders/subtracters and 4 multipliers in their DCT core


9/10


Fig. 9. Photomicrograph of the proposed DCT core.

TABLE VIIICHIPCHARACTERISTICS OF THEPROPOSED2-D DCT ARCHITECTURE

TABLE IXCOMPARISONS OF2-D DCT ARCHITECTURES INFPGAS

operated at clock frequency of 107 MHz. The proposed DCT

core has lower area resources and a medium operation frequency

in the XC2VP30 FPGA implementations.

V. CONCLUSION

This paper proposes a high performance video transform

engine using the STS strategy. The proposed 2-D DCT core

employs a single 1-D DCT core and one TMEM with a small

area. Based on the test image simulations, 9-bit DA-precision

is chosen in order to meet the PSNR requirements, and the

hardware sharing architecture enables the arithmetic resources

to be shared based on time so as to reduce area cost. Hence, thenumber of adders/substracters in 1-D DCT core allows a 74%

saving in area over the NEDA architecture for a DA-based DCT

design. Furthermore, the proposed time scheduling strategy

arranges the computation time for each process element. The

1-D core can calculate the 1st-D and 2nd-D transformations

simultaneously, thereby achieving a high throughput rate.

Therefore, a high performance 8 8 2-D DCT core utilizing

high-accuracy, a small-area, and a high-throughput rate hasbeen achieved using the proposed STS strategy. Finally, the

proposed high performance 2-D DCT core is fabricated using a

TSMC 0.18- 1P6M CMOS process and implemented in the

XC2VP30 FPGA.

REFERENCES

[1] Y. Wang, J. Ostermann, and Y. Zhang, Video Processing and Commu-nications, 1st ed. Englewood Cliffs, NJ: Prentice-Hall, 2002.

[2] E. Feig and S. Winograd, Fast algorithms for the discrete cosine trans-form,IEEE Trans. Signal Process., vol. 40,no. 9, pp.21742193,Sep.1992.

[3] Y. P. Lee, T. H. Chen, L. G. Chen, M. J. Chen, and C. W. Ku, Acost-effective architecture for 8 2 8 two-dimensional DCT/IDCT using

direct method, IEEE Trans. Circuits Syst. Video Technol., vol. 7, no.3, pp. 459467, Jun. 1997.

[4] C. T. Lin, Y. C. Yu, and L. D. Van, Cost-effective triple-mode recon-figurable pipeline FFT/IFFT/2-D DCT processor, IEEE Trans. Very

Large Scale Integr. (VLSI) Syst., vol. 16, no. 8, pp. 10581071, Aug.2008.

[5] M. Alam, W. Badawy, and G. Jullien, A new time distributed DCTarchitecture for MPEG-4 hardware reference model,IEEE Trans. Cir-cuits Syst. Video Technol., vol. 15, no. 5, pp. 726730, May 2005.

[6] T. Xanthopoulos and A. P. Chandrakasan, A low-power DCT coreusing adaptive bitwidth and arithmetic activity exploiting signal corre-lations and quantization, IEEE J. Solid-State Circuits, vol. 35, no. 5,pp. 740750, May 2000.

[7] A. M. Shams, A. Chidanandan, W. Pan, and M. A. Bayoumi, NEDA:A low-power high-performance DCT architecture,IEEE Trans. SignalProcess., vol. 54, no. 3, pp. 955964, Mar. 2006.

[8] C. Peng, X. Cao, D. Yu, and X. Zhang, A 250 MHz optimized dis-tributed architecture of 2D 8 2 8 DCT, inProc. Int. Conf. ASIC, 2007,pp. 189192.

[9] C. Y. Huang, L. F. Chen, and Y. K. Lai, A high-speed 2-D transformarchitecture with unique kernel for multi-standard video applications,inProc. IEEE Int. Symp. Circuits Syst., 2008, pp. 2124.

[10] S. C. Hsia and S. H. Wang, Shift-register-based data transposition forcost-effective discrete cosine transform,IEEE Trans. Very Large Scale

Integr. (VLSI) Syst., vol. 15, no. 6, pp. 725728, Jun. 2007.[11] J. I. Guo, R. C. Ju, and J. W. Chen, An efficient 2-D DCT/IDCT core

design using cyclic convolution and adder-based realization, IEEETrans. Circuits Syst. Video Technol., vol. 14, no. 4, pp. 416428, Apr.2004.

[12] H. C. Hsu, K. B. Lee, N. Y. Chang, and T. S. Chang, Architecturedesign of shape-adaptive discrete cosine transform and its inverse forMPEG-4 video coding,IEEE Trans. Circuits Syst. Video Technol., vol.18, no. 3, pp. 375386, Mar. 2008.

[13] A. Madisetti and A. N. Willson, A 100 MHz 2-D 8 2 8 DCT/IDCTprocessor for HDTV applications, IEEE Trans. Circuits Syst. VideoTechnol., vol. 5, no. 2, pp. 158165, Apr. 1995.

[14] A. Tumeo, M. Monchiero, G. Palermo, F. Ferrandi, and D. Sciuto, Apipelined fast 2D-DCT accelerator for FPGA-based SOCs, in Proc.

IEEE Comput. Soc. Ann u. Symp. VLSI., 2007, pp. 331336.[15] C. C. Sun, P. Donner, and J. Gotze, Low-complexity multi-purpose

IP core for quantized discrete cosine and integer transform, in Proc.IEEE Int. Symp. C ircuits Syst., 2009, pp. 30143017.

[16] S. Ghosh, S. Venigalla, and M. Bayoumi, Design and implementaionof a 2D-DCT architecture using coefficient distributed arithmetic, inProc. IEEE Comput. Soc. Annu. Symp. VLSI., 2005, pp. 162166.

[17] L. V. Agostini, I. S. Silva, and S. Bampi, Pipelined fast 2D DCT ar-chitecture for JPEG image compression, in Proc. IEEE Symp. Integr.Circuits Syst. Des., 2001, pp. 226231.

[18] C. H. Chang, C. L. Wang, and Y. T. Chang, Efficient VLSI archi-

tectures for fast computation of the discrete Fourier transform and itsinverse,IEEE Trans. Signal Process., vol. 48, no. 11, pp. 32063216,Nov. 2000.


10/10


[19] Y. H. Chen, T. Y. Chang, and C. Y. Li, High throughput DA-basedDCT with high accuracy error-compensated adder tree,IEEE Trans.Very Large Scale Integr. (VLSI) Syst., 2010, to be published.

[20] W. Pan, A fast 2-D DCT algorithm via distributed arithmetic opti-mization, in Proc. IEEE Int. Conf. Image Process. , 2000, vol. 3, pp.114117.

[21] Information TechnologyDigital Compression and Coding of Con-tinuous-Tone Still Images: Requirements and Guidelines, ISO/IEC

10918-1, 1991.[22] W. B. Pennebaker and J. L. Mitchell, JPEG Still Image Data Compres-sion Standard. New York: Van Nostrand Reinhold, 1992.

[23] S. N. Tang, J. W. Tsai, and T. Y. Chang, A 2.4-Gs/s FFT processorfor OFDM-based WPAN applications, IEEE Trans. Circuits Syst. II,

Exp. Briefs, vol. 57, no. 6, pp. 451455, Jun. 2010.

Yuan-Ho Chen (S10) received the B.S. degreefrom Chang Gung University, Taoyuan, Taiwan, in2002, and the M.S. degree from the National TsingHua University (NTHU), Hsinchu, Taiwan, in 2004,where he is currently pursuing the Ph.D. degree inelectrical engineering.

His research interests include video/image anddigital signal processing, VLSI architecture design,VLSI implementation, computer arithmetic, andsignal processing for wireless communication.

Tsin-Yuan Chang (S87M90) received the B.S.degree in electrical engineering from National TsingHua University, Hsinchu, Taiwan, in 1982, and theM.S. and Ph.D. degrees from Michigan State Univer-sity, East Lansing, in 1987 and 1989, respectively,both in electrical engineering.

He is an Associate Professor with the Departmentof Electrical Engineering, National Tsing Hua Uni-

versity. His research interests include IC design andtesting, Computer arithmetic, and VLSI DSP.

a high performance video transform engine by.pdf

Documents