a high performance video transform engine by.pdf

Upload: ecessec

Post on 04-Jun-2018

240 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/14/2019 A High Performance Video Transform Engine by.pdf

    1/10

    IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 4, APRIL 2012 655

    A High Performance Video Transform Engine byUsing Space-Time Scheduling Strategy

    Yuan-Ho Chen, Student Member, IEEE, and Tsin-Yuan Chang, Member, IEEE

    AbstractIn this paper, a spatial and time scheduling strategy,called the space-time scheduling (STS) strategy, that achieves highimage resolutions in real-time systems is proposed. The proposedspatial scheduling strategy includes the ability to choose the dis-tributed arithmetic (DA)-precision bit length, a hardware sharingarchitecture that reduces the hardware cost, and theproposed timescheduling strategy arranges different dimensional computationsin that it can calculate first-dimensional and second-dimensionaltransformations simultaneously in single 1-D discrete cosinetransform (DCT) core to reach a hardware utilization of 100%.The DA-precision bit length is chosen as 9 bits instead of thetraditional 12 bits based on test image simulations. In addition,

    the proposed hardware sharing architecture employs a binarysigned-digit DA architecture that enables the arithmetic resourcesto be shared during the four time slots. For this reason, theproposed 2-D DCT core achieves high accuracy with a small areaand a high throughput rate and is verified using a TSMC 0.18- m1P6M CMOS process chip implementation. Measurement resultsshow that the core has a latency of 84 clock cycles with a 52 dBpeak-signal-to-noise-ratio and is operated at 167 MHz with 15.8K gate counts.

    Index TermsBinary signed-digit (BSD), discrete cosine trans-form (DCT), distributed arithmetic (DA)-based, space-time sched-uling (STS).

    I. INTRODUCTION

    DISCRETE COSINE TRANSFORM (DCT) is a widelyused transform engine for image and video compression

    applications [1]. In recent years, the development of visualmedia has been progressed towards high-resolution specifica-tions, such as high definition television (HDTV). Consequently,a high-accuracy and high-throughput rate component is neededto meet future specifications. In addition, in order to reducethe manufacturing costs of the integrated circuit (IC), a lowhardware cost design is also required. Therefore, a high per-formance video transform engine that utilizes high accuracy,a small area, and a high-throughput rate is desired for VLSI

    designs.The 2-D DCT core design has often been implemented usingeither direct [2][4] or indirect [5][13] methods. The direct

    Manuscript received February 22, 2010; revised August 09, 2010 and De-cember 30, 2010; accepted January 24, 2011. Date of publication February 28,2011; date of current version March 12, 2012. This work was supported in partby the Chip implementation Center and National Science Council under projectnumber CIC T18-98C-12a and NSC 99-2221-E-007-119, respectively.

    The authors are with the Department of Electrical Engineering, NationalTsing Hua University, Hsinchu 30013, Taiwan (e-mail: [email protected]).

    Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.

    Digital Object Identifier 10.1109/TVLSI.2011.2110620

    methods include fast algorithms that reduce the computationcomplexity by mapping and rotating the DCT transform into acomplex number [3]. However, the structure of the DCT is notas regular as that of the fast Fourier transform (FFT). A reg-ular 2-D DCT core using the direct method that derives the 2-Dshifted FFT (SFFT) from the 2-D DCT algorithm format, andshares the hardware in FFT/IFFT/2-D DCT computation is im-plemented in [4].

    On the other hand, the 2-D DCT cores using the indirectmethod are implemented based on transpose memory and havethe following two structures: 1) two 1-D DCT cores and one

    transpose memory (TMEM) and 2) a single 1-D DCT core andone TMEM. In the first structure, the 2-D DCT core has a highthroughput rate, because the two 1-D DCT cores compute thetransformation simultaneously [5][9]. In order to reduce thearea overhead, a single 1-D DCT core is applied in the secondstructure [10][13], thereby saving hardware costs. Hsia et al.[10] present a 2-D DCT with a single 1-D DCT core and onecustom transposed memory that can transpose data sequencesusing serial-input and parallel-output and requires only a smallarea, but the speed is reduced because the single DCT coreperforms the first-dimensional (1st-D) and second-dimensional(2nd-D) transformations at different times. Therefore, Guo et al.use two parallel paths to deal with the reduction of throughputrate in the second structure [11]. However, the additional inputbuffers are needed in order to temporarily store the input dataduring the 2nd-D transformation. Madisetti et al. present a hard-ware sharing technique to implement the 2-D DCT core usinga single 1-D DCT core [13]. The 1-D DCT core can calculatethe 1st-D and 2nd-D DCT computations simultaneously, and thethroughput achieves 100 Mpels/s. However, multiplier-basedprocessing element requires a large amount of circuit area in[13]. As a result, it is obvious that a tradeoff is required be-tween the hardware cost and the speed. Tumeo et al.find a bal-ance point between the area required and the speed for 2-D DCTdesigns [14] and present a multiplier-based pipeline fast 2-D

    DCT accelerator implemented using a field-programmable gatearrays (FPGA) platform [14]. Additionally, several 2-D DCTdesigns are implemented based on FPGA for fast verification[14][17].

    In this paper, an 8 8 2-D DCT core that consists of a single1-D DCT core and one TMEM is proposed using a strategyknown as space-time scheduling (STS). Due to the accuracysimulations in DA-based binary signed-digit (BSD) expression,9-bit DA precision is chosen in order to meet the requirementsof the peak-signal-to-noise-ratio (PSNR) outlined in previousworks [5][7]. Furthermore, the proposed DCT core is designedbased on a hardware sharing architecture with BSD DA-basedcomputation so as to reduce the area cost. The arithmetics share

    the hardware resources during the four time slots in the DCT

    1063-8210/$26.00 2011 IEEE

  • 8/14/2019 A High Performance Video Transform Engine by.pdf

    2/10

    656 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 4, APRIL 2012

    core design. A 100% hardware utilization is also achieved byusing the proposed time scheduling strategy, and the 1st-D and2nd-D DCT computations can be calculated at the same time.Therefore, a high performance transform engine with high ac-curacy, small area, and a high-throughput rate has been achievedin this research.

    This paper is organized as follows. In Section II, the math-ematical derivation of the BSD format distributed arithmetic isgiven. The proposed 8 8 2-D DCT architecture that includesan analysis of the coefficient bits, the hardware sharing architec-ture, and the proposed timing scheduling strategy is discussedin Section III. Comparisons and discussions are presented inSection IV, and conclusions are drawn in Section V.

    II. MATHEMATICAL DERIVATION OFBSD FORMATDISTRIBUTEDARITHMETIC

    The inner product for a general matrix multiplication-and-

    accumulation can be written as follows:

    (1)

    where is a fixed coefficient and is the input data. Assume

    that the coefficient is an -bit BSD number. Equation (1)

    can be expressed as follows:

    ..

    .

    ..

    .

    . .

    .

    ..

    .

    ..

    .

    ...(2)

    where , , and

    . The matrix is called the DA coef-

    ficient matrix.

    In (2), can be calculated by adding or subtracting the

    with , and then the transform output can be ob-

    tained by shifting and adding every non-zero . Thus, the innerproduct computation in (1) can be implemented by using shifters

    and adders instead of multipliers. Therefore, a smaller area can

    be achieved by using BSD DA-based architecture.

    III. PROPOSED8 8 2-D DCT COREDESIGN

    This section introduces the proposed 8 8 2-D DCT core

    implementation. The 2-D DCT is defined as

    (3)

    where for , and for

    . Consider the 1-D 8-point DCT first

    (4)

    where for , for non-zero , andand denote the input data and the transform

    output, respectively. By neglecting the scaling factor 1/2, the

    1-D 8-point DCT in (4) can be divided into even and odd parts,

    and , as listed in (5) and (6), respectively

    (5)

    (6)

    where and

    (7)

    (8)

    (9)

    For the 2-D DCT core implementation, the three main strate-

    gies for increasing the hardware and time utilization, called the

    space-time scheduling (STS) strategy, are proposed and listed

    as follows:

    find the DA-precision bit length for the BSD representation

    to achieve system accuracy requirements;

    share the hardware resource in time to reduce area cost;

    plan an 100% hardware utilization by using time sched-uling strategy.

    A. Analysis of the Coefficient Bits

    The seven internal coefficients from to for the 2-D DCT

    transformation are expressed as BSD representations in order to

    save computation time and hardware cost, as well as to achieve

    the requirements of PSNR. The system PSNR is defined [1] as

    follows:

    (10)

    where the is the mean-square-error between the originalimage and the reconstructed image for each pixel.

  • 8/14/2019 A High Performance Video Transform Engine by.pdf

    3/10

    CHEN AND CHANG: HIGH PERFORMANCE VIDEO TRANSFORM ENGINE 657

    Fig. 1. Simulation for system PSNR in different BSD bit length expressions.

    TABLE INORMALIZEDHW COST INDIFFERENTBSD BITLENGTHEXPRESSIONS

    First, the BSD coefficients for different bit length expressions

    use the popular test images, Lena, Peppers, and Baboon,

    to simulate the system PSNR. After inputting the original test

    image pixels to the forward and inverse DCT transform, the

    system PSNR vs. the different BSD bit length expressions is

    illustrated in Fig. 1(a), and the reconstructed images, i.e., that

    have passed through the forward and inverse DCT for different

    BSD bit lengths, are also shown in Fig. 1(b). The normalized

    hardware (HW) cost of processing element (PE) for the different

    BSD bit length expressions is shown in Table I. The HW cost

    is synthesized in area for each BSD bit length using a Synopsys

    Design Compiler with the Artisan TSMC 0.18- m Standard cell

    library.

    The gap in accuracy between the 8-bit and the 9-bit BSD ex-pressions is shown in Fig. 1(a), and the HW cost increases when

    the BSD bit length increases. Consequently, the 9-bit BSD ex-

    pression coefficient is chosen due to the tradeoff between the

    hardware cost and the system accuracy. The BSD expressions

    and the accuracy of the DCT coefficients are listed in Table II,

    where the coefficient in Table II indicates the BSD number

    1. The coefficient accuracy is defined as the signal-to-quanti-

    zation-noise ratio (SQNR) shown in (11)

    (11)

    where and are the power of the desired floating-point signaland the quantization noise in this case, respectively.

    TABLE II9-BITBSD EXPRESSIONS FORDCT COEFFICIENTS

    B. Hardware Sharing Strategy

    All modules in the 1-D DCT core, including the modified

    two-input butterfly (MBF2), the pre-reorder, the process ele-

    ment even (PEE), the process element odd (PEO), and the post-

    reorder share the hardware resources in order to reduce the area

    cost. Moreover, the 8 8 2-D DCT core is implemented using

    a single 1-D DCT core and one TMEM. The architecture of the

    2-D DCT core is described in the following sections.

    1) Modified Butterfly Module: Equation (9) is easily imple-

    mented using a two-input butterfly module [18] called BF2. In

    general, the BF2 has a hardware utilization rate in the adder andsubtracter of 50%. In order to enable the hardware resources

    to be shared, additional multiplexers and Reorder Registers are

    added to the proposed MBF2 module as illustrated in Fig. 2.

    The Reorder Registers consist of four word registers that use

    the control signals to select the input data and use enable

    signals to output or hold the data. Similar to BF2, the operation

    of the proposed MBF2 has an eight-clock-cycle period. In

    the first four cycles, the 1st-D input data shift

    into Reorder Registers 1, and the 2nd-D input data execute

    the operation of addition and substraction. In the next four

    cycles, the operations of 1st-D and 2nd-D will be changed.

    The 1st-D data calculate and in (9) by

    using adder and subtracter. In the second four cycles, theresults are fed into next stage, and the

  • 8/14/2019 A High Performance Video Transform Engine by.pdf

    4/10

    658 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 4, APRIL 2012

    TABLE IIICOEFFICIENTS ANDINPUTDATAFROMMBF2 FOR THEFOURSEPARATETIMESLOTOUTPUTS

    Fig. 2. Proposed MBF2 architecture.

    results shift to Reorder Registers 1 in

    each cycle. At the same time, the 2nd-D input data shift

    to Reorder Registers 2. Therefore, the adder and subtracter in

    MBF2 employ the operations of 1st-D and 2nd-D in turns to

    achieve 100% hardware utilization.

    2) Pre-Reorder Module: In (5), the even part transform

    output can be modified as follows:

    (12)

    Also, the odd part transform output in (6) can be rewritten as

    (13)

    From (12) and (13), the transform output can be calculatedusing four separate time slots in order to share the hardware

    TABLE IVEVENPARTTRANSFORMATIONTABLE

    resources. Hence, (12) and (13) can be expressed as (14) and

    (15), respectively

    (14)

    (15)

    where the time slot , the coefficient elements, , and the trans-

    form inputs from the MBF2 stage and

    . Equations (14) and (15) for the four time

    slots are summarized in Table III. The proposed Pre-Reorder

    module follows the order of the inputs , and ) de-

    pending on each time slot indicated in Table III. The hardware

    resources can be shared by reordering the inputs during the four

    separate time slots.

    3) Process Element Module: For DA-based computation,

    the even part and odd part transformations can be implemented

    using PEE and PEO, respectively. From Table III, the even part

    transformation can be expanded for the DA-based computation

    formats and can share the hardware resources at the bit level.The coefficient vector has two combinations of and

  • 8/14/2019 A High Performance Video Transform Engine by.pdf

    5/10

    CHEN AND CHANG: HIGH PERFORMANCE VIDEO TRANSFORM ENGINE 659

    Fig. 3. Architecture for the proposed PEE and PEO.

    for and transform outputs, re-

    spectively. Table IV expresses (14) in the bit level formulation.

    Using given input data , , , and , the transform

    output needs only three adders (to compute ,

    , and ), a Data-Distributed module,

    and one even part adder-tree (EAT) to obtain the result for

    during the four time slots. The Data-Distributed module consists

    of seventwo-input multiplexers that are used to appoint different

    non-zero values for each weighted input during the four separate

    time slots, as shown in Table IV, and the EAT sums the different

    weighted values in tree-like adders to complete the transform

    output . Similarly, the odd part transformation in (15) can beimplemented in a DA-based format using four adders and one

    odd part adder-tree (OAT). The EAT and OAT can be imple-

    mented using the error-compensated adder tree [19] to improve

    the computation accuracy. Moreover, there are three pipeline

    stages (two in both the EAT and the OAT) in each PEE and PEO

    module that enable high speed computation to be achieved. The

    proposed architecture for both PEE and PEO is shown in Fig. 3.

    4) Post-Reorder Module: For the last stage of the 1-D DCT

    computation, the data sequence after the PEE and PEO must

    be merged and repermuted in the Post-Reorder module. The

    order of the original 8-point data sequence from the PEE and

    the PEO is . After the Post-Re-order module permutes the data order, the data sequence is or-

    dered as . Fig. 4 shows the pro-

    posed Post-Reorder architecture. Two multiplexers select the

    data that is fed into the different Reorder Registers in order to

    permute the output order. is the transform output for the 1st-D

    DCT that will input into the TMEM. In addition, the 2nd-D DCT

    transform output is completed after the permutation by Re-

    order Registers 4.

    5) 2-D DCT Core Architecture: To save hardware costs, the

    proposed 2-D DCT core, as shown in Fig. 5, is implemented

    using a single 1-D DCT core and one TMEM. The 1-D DCT

    core includes an MBF2, a Pre-Reorder module, a PEE, a PEO,

    a Post-Reorder module, and one TMEM. The TMEM is imple-mented using 64-word 12-bit dual-port registers and has a la-

    Fig. 4. Proposed post-reorder architecture.

    tency of 52 cycles. Based on the time scheduling strategy de-

    scribed in Section III-C, a hardware utilization of 100% can be

    achieved.

    C. Time Scheduling Strategy

    As a result of the time scheduling strategy, the 1st-D and

    2nd-D transforms are able to be computed simultaneously,

    which increases hardware utilization be 100%. The timing flow

    chart for the proposed strategy is illustrated in Fig. 6.

    1) 1st-D Data Computation: In the first four cycles, the first

    four-point input data shift into the Reorder Register 1. During

    cycles 58, the MBF2 performs additions and subtractions using

    the first eight-point input data. In the 9th cycle, the even-part of

    Pre-Reorder module obtains and reorders

    the input data output to the PEE based on Table III. During

    cycles 1013, the is calculated in the PEE module. Based

    on the latency of the three pipeline stages in the PEE, is

    completed during cycles 1316. Also, the is performed in

    the PEO module during cycles 1417 and is finished in cycles

    1720. Therefore, after the 16th cycle, the Post-Reorder module

    permutes the 1st-D transform results and inputs them into theTMEM.

    2) 2nd-D Data Computation: From the 17th cycle, the 1st-D

    transform data is input into the TMEM. Due to the latency of

    the TMEM 52 cycles , the 2nd-D computation data trans-

    posed by the TMEM is sent to the input of Reorder Registers 2 in

    MBF2 at the 69th cycle. Then, the adder and subtracter are oper-

    ated during cycles 7376 to compute the first 2nd-D 8-point data

    sequences, and the first 2nd-D data is permuted by Pre-Reorder

    module during cycles 7780. The PEE and PEO calculate the

    first 2nd-D data during cycles 7881 and 8285, respectively.

    The Post-Reorder module finishes the 2nd-D transform results

    from the 84th cycle. In addition, the 2-D DCT transform outputdata is obtained at the end of 84th cycle.

    3) Hardware Utilization: In Fig. 6, the adder and subtracter

    in MBF2 is at 50% utilization during cycles 168. However,

    after the 68th cycle, the MBF2 achieves a 100% hardware uti-

    lization as a result of the 2nd-D data being input. At the same

    time, the utilization in the PEE and PEO also reaches 100% from

    74th and 78th cycles, respectively. The Post-Reorder module

    does not work between cycles 112, but after the 16th cycle the

    1st-D transform output are completed using Reorder Regis-

    ters 3 in the Post-Reorder. Similarly, the 2nd-D transform re-

    sult is finished using Reorder Registers 4 in the Post-Reorder

    from the 84th cycle. At the end of 84th cycle, the 1st-D and

    2nd-D transform outputs and are obtained simultaneously.Hence, the hardware utilization increases to 100% after the 84th

  • 8/14/2019 A High Performance Video Transform Engine by.pdf

    6/10

    660 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 4, APRIL 2012

    Fig. 5. Proposed 2-D DCT architecture.

    Fig. 6. Timing scheduling for the proposed 2-D DCT core.

    cycle, and the latency of the proposed 8 8 2-D DCT core is 84

    clock cycles. Based on the proposed time scheduling strategy, a

    100% hardware utilization is achieved, and computation of the

    data sequence is straightforward.

    In summary, following an analysis of the accuracy analysis

    of the coefficients, the DA-precision bit length was chosen as

    a 9-bit BSD expression. In this way, the hardware cost is re-

    duced by using the BSD 9-bit DA-precision instead of the 12-bit

    used in previous works. Using the proposed hardware sharing

    strategy, the DCT computation shares the hardware resources

    during the four time slots, thereby reducing the area cost. In ad-

    dition, a 100% hardware utilization can be achieved based on the

    proposed time scheduling strategy, effectively achieving a high-throughput-rate design. Therefore, a high-accuracy, small-area,

    and high-throughput-rate 2-D DCT core has been obtained by

    applying the proposed STS strategy.

    IV. DISCUSSION ANDCOMPARISONS

    In this section, the system accuracy test for the proposed

    2-D DCT core is discussed. Then, a comparison of the pro-

    posed 2-D DCT core with previous works is also presented. Fi-

    nally, the characteristics of the implemented chip and FPGA are

    described.

    A. System Accuracy Test

    Seven test images used to check system accuracy are all 512

    512 pixels in size, with each pixel being represented by 8-bit256 gray level data. After the original test image pixels are fed

  • 8/14/2019 A High Performance Video Transform Engine by.pdf

    7/10

    CHEN AND CHANG: HIGH PERFORMANCE VIDEO TRANSFORM ENGINE 661

    Fig. 7. Accuracy verification flow for proposed DCT core.

    TABLE VPSNR PERFORMANCE FOREACHTESTPATTERNS

    TABLE VIPSNR VERSUS QF FOR THE PROPOSED 2-D DCT

    APPLYING TOJPEG COMPRESSIONFLOW

    into the proposed 2-D DCT core, the transform output data is

    captured and passed to MATLAB tool in order to compute the

    inverse DCT using 64-bit double-precision operations. The ver-

    ification flow follows the precedent of previous works ([4], [7],

    and [20]) test flow illustrated as Fig. 7. The PSNR values of each

    test images verified by the test flow are listed in Table V, and the

    average PSNR closes to 52 dB using the proposed 8 8 2-D

    DCT core. Furthermore, the proposed 2-D DCT core was intro-duced to enable utilization of JPEG compression flow [21], [22]

    to evaluate the proposed 2-D DCT applied in real compression

    systems. The quality factor (QF) dominates the compression

    quality and data size, and therefore QF affects the system PSNR

    of the JPEG compression. Table VI lists the PSNR versus QF for

    the proposed 2-D DCT core applying to JPEG, and Fig. 8 illus-

    trates both the original and the compressed versions of the test

    image Lena using the JPEG compression flow. The quality of

    the compressed image is excellent.

    B. Comparison With Other 2-D DCT Architectures

    Table VII compares the proposed 8 8 2-D DCT core withprevious works. In [4], Lin et al.derived an SFFT format from

    a 2-D DCT and shared the hardware resources for the FFT/

    IFFT/2-D DCT computation in a triple-mode processor design.

    The 2-D DCT core is implemented in a direct 2-D method based

    on a pipeline radix- single delay feedback path

    architecture. In [7], a 2-D DCT core used NEDA architecture

    with two 1-D cores and one TMEM to achieve a low-cost de-

    sign. Speed limitations exist in the operations of serial shiftingand addition in the NEDA operations. Huang et al. improved the

    throughput rate using adder-tree (AT) instead of shifting and ad-

    dition operations for NEDA architecture [9]. However, the par-

    allel input and output in the NEDA architecture required a large

    amount of I/O resources. The 2-D DCT core using a single 1-D

    core and one TMEM for a multiplier-based DCT implementa-

    tion was addressed in [10]. An area-saving serial-input and par-

    allel-output transpose memory was also implemented in the 2-D

    DCT core to achieve a small-area design operated at frequency

    of 55 MHz. However, the 1-D core could not compute 1st-D and

    2nd-D transformation simultaneously. Hence, the throughput

    rate was half of the operation frequency. Therefore, a hardware

    sharing technique was addressed in the 2-D DCT core to dealwith throughput rate reduction, and the 2-D DCT core achieved

    a throughput of 100 Mpels/s [13].

    The proposed 2-D DCT core employs a single 1-D core and

    one TMEM in order to reduce the core area. Also, the 1-D

    core utilizes 9 adders and 2 adder-trees (AT) using the proposed

    hardware sharing architecture in order to reduce hardware cost.

    Meaning that the throughput rate of the proposed 2-D DCT core

    is enhanced based on the time scheduling strategy. In this way,

    the proposed 2-D core is able to compute 1st-Dand 2nd-D trans-

    formation simultaneously, and a 167 MHz throughput rate is

    achieved. Therefore, the proposed 2-D DCT core has the highest

    hardware efficiency, defined as follows:

    Hardware Efficiency pels/s-gate

    (16)

    A comparison with previous works is summarized in

    Table VII. Noted that previous works presented only simulation

    results, while proposed work has chip implementation with de-

    sign for test (DFT) and measured results. For fair comparisons,

    the simulation results of proposed work are also shown (Re-

    move DFT considerations in chip implement that needs extra

    area for memory built-in self-test circuit and scan flip-slops).

    The gate count of STS DCT without DFT insertion is 12.2 K

    and the hardware efficiency of STS DCT is equal to 13.69 that

    is the largest in the comparison Table VII.

    On the other hand, the proposed 2-D DCT achieves 52 dB

    PSNR value, that is the medium accuracy compared with other

    pervious works, verified by the test flow in Fig. 7. The power

    consumption is another issue in the circuit designs. There are

    many different technologies to implement DCT core in previous

    works. In order to have a fair comparison, the power is normal-

    ized to 0.18- m technology process. Similar to [23] as listed in

    (17) and is also summarized in Table VII. The proposed 2-D

    DCT core has medium power consumption between pervious

    works

    Normalized Power(17)

  • 8/14/2019 A High Performance Video Transform Engine by.pdf

    8/10

    662 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 4, APRIL 2012

    Fig. 8. Compressed image using JPEG compression processing with the proposed 2-D DCT core. (a) Original image. (b) Compressed image with .

    TABLE VII

    COMPARISON OFDIFFERENT2-D DCT ARCHITECTURESWITH THEPROPOSEDARCHITECTURE

    Four transistors per NAND2 gate for different technology.

    5 : Where 5 is the chip measurement result with DFT insertion and is simulation result without DFT insertion.

    The power result is estimated for 1-D DCT operation.

    C. Chip Implementation

    Implemented in a 1.8-V TSMC 0.18- m 1P6M CMOS

    process, the proposed 8 8 2-D DCT core uses the Synopsys

    Design Compiler to synthesize the RTL code and the Cadence

    SoC Encounter for placement and routing (P&R). The proposed

    DCT core has a latency of 84 clock cycles and is operated at

    167 MHz to meet HDTV specifications. For the design for

    test (DFT) considerations, the proposed core uses the SynTest

    SRAMBIST to generate the memory built-in self-test (BIST)

    circuit and the Synopsys Design Compiler to synthesize scan

    flip-flops, and the total gate counts with DFT insertion of the

    proposed core is 15.8 K. The photomicrograph of the proposed

    core with marking each modules boundary is shown in Fig. 9and the measured characteristics are listed in Table VIII.

    D. FPGA Implementation

    The proposed 2-D DCT core synthesized using Xilinx ISE

    10.1 and Xilinx XC2VP30 FPGA can be operated at a clock

    frequency of 110 MHz. Table IX shows a comparison of the pro-

    posed 2-D DCTcore with previous FPGA implementations. The

    multiplier-less coordinate rotation digital computer (CORDIC)

    architecture was presented in [15]. Sunet al. utilized 120 adders

    and 80 shifters performing a 2-D quantized discrete cosine in-

    teger transform (QDCIT) to achieve a high clock rate of 149

    MHz, but with large FPGA area resources. Also, a 2-D DCT

    core that was optimized in both delay and area was addressed

    by Tumeo et al. [14], who presented a balance point between

    the delay and the area for multiplier-based pipeline DCT designwith 19 adders/subtracters and 4 multipliers in their DCT core

  • 8/14/2019 A High Performance Video Transform Engine by.pdf

    9/10

    CHEN AND CHANG: HIGH PERFORMANCE VIDEO TRANSFORM ENGINE 663

    Fig. 9. Photomicrograph of the proposed DCT core.

    TABLE VIIICHIPCHARACTERISTICS OF THEPROPOSED2-D DCT ARCHITECTURE

    TABLE IXCOMPARISONS OF2-D DCT ARCHITECTURES INFPGAS

    operated at clock frequency of 107 MHz. The proposed DCT

    core has lower area resources and a medium operation frequency

    in the XC2VP30 FPGA implementations.

    V. CONCLUSION

    This paper proposes a high performance video transform

    engine using the STS strategy. The proposed 2-D DCT core

    employs a single 1-D DCT core and one TMEM with a small

    area. Based on the test image simulations, 9-bit DA-precision

    is chosen in order to meet the PSNR requirements, and the

    hardware sharing architecture enables the arithmetic resources

    to be shared based on time so as to reduce area cost. Hence, thenumber of adders/substracters in 1-D DCT core allows a 74%

    saving in area over the NEDA architecture for a DA-based DCT

    design. Furthermore, the proposed time scheduling strategy

    arranges the computation time for each process element. The

    1-D core can calculate the 1st-D and 2nd-D transformations

    simultaneously, thereby achieving a high throughput rate.

    Therefore, a high performance 8 8 2-D DCT core utilizing

    high-accuracy, a small-area, and a high-throughput rate hasbeen achieved using the proposed STS strategy. Finally, the

    proposed high performance 2-D DCT core is fabricated using a

    TSMC 0.18- 1P6M CMOS process and implemented in the

    XC2VP30 FPGA.

    REFERENCES

    [1] Y. Wang, J. Ostermann, and Y. Zhang, Video Processing and Commu-nications, 1st ed. Englewood Cliffs, NJ: Prentice-Hall, 2002.

    [2] E. Feig and S. Winograd, Fast algorithms for the discrete cosine trans-form,IEEE Trans. Signal Process., vol. 40,no. 9, pp.21742193,Sep.1992.

    [3] Y. P. Lee, T. H. Chen, L. G. Chen, M. J. Chen, and C. W. Ku, Acost-effective architecture for 8 2 8 two-dimensional DCT/IDCT using

    direct method, IEEE Trans. Circuits Syst. Video Technol., vol. 7, no.3, pp. 459467, Jun. 1997.

    [4] C. T. Lin, Y. C. Yu, and L. D. Van, Cost-effective triple-mode recon-figurable pipeline FFT/IFFT/2-D DCT processor, IEEE Trans. Very

    Large Scale Integr. (VLSI) Syst., vol. 16, no. 8, pp. 10581071, Aug.2008.

    [5] M. Alam, W. Badawy, and G. Jullien, A new time distributed DCTarchitecture for MPEG-4 hardware reference model,IEEE Trans. Cir-cuits Syst. Video Technol., vol. 15, no. 5, pp. 726730, May 2005.

    [6] T. Xanthopoulos and A. P. Chandrakasan, A low-power DCT coreusing adaptive bitwidth and arithmetic activity exploiting signal corre-lations and quantization, IEEE J. Solid-State Circuits, vol. 35, no. 5,pp. 740750, May 2000.

    [7] A. M. Shams, A. Chidanandan, W. Pan, and M. A. Bayoumi, NEDA:A low-power high-performance DCT architecture,IEEE Trans. SignalProcess., vol. 54, no. 3, pp. 955964, Mar. 2006.

    [8] C. Peng, X. Cao, D. Yu, and X. Zhang, A 250 MHz optimized dis-tributed architecture of 2D 8 2 8 DCT, inProc. Int. Conf. ASIC, 2007,pp. 189192.

    [9] C. Y. Huang, L. F. Chen, and Y. K. Lai, A high-speed 2-D transformarchitecture with unique kernel for multi-standard video applications,inProc. IEEE Int. Symp. Circuits Syst., 2008, pp. 2124.

    [10] S. C. Hsia and S. H. Wang, Shift-register-based data transposition forcost-effective discrete cosine transform,IEEE Trans. Very Large Scale

    Integr. (VLSI) Syst., vol. 15, no. 6, pp. 725728, Jun. 2007.[11] J. I. Guo, R. C. Ju, and J. W. Chen, An efficient 2-D DCT/IDCT core

    design using cyclic convolution and adder-based realization, IEEETrans. Circuits Syst. Video Technol., vol. 14, no. 4, pp. 416428, Apr.2004.

    [12] H. C. Hsu, K. B. Lee, N. Y. Chang, and T. S. Chang, Architecturedesign of shape-adaptive discrete cosine transform and its inverse forMPEG-4 video coding,IEEE Trans. Circuits Syst. Video Technol., vol.18, no. 3, pp. 375386, Mar. 2008.

    [13] A. Madisetti and A. N. Willson, A 100 MHz 2-D 8 2 8 DCT/IDCTprocessor for HDTV applications, IEEE Trans. Circuits Syst. VideoTechnol., vol. 5, no. 2, pp. 158165, Apr. 1995.

    [14] A. Tumeo, M. Monchiero, G. Palermo, F. Ferrandi, and D. Sciuto, Apipelined fast 2D-DCT accelerator for FPGA-based SOCs, in Proc.

    IEEE Comput. Soc. Ann u. Symp. VLSI., 2007, pp. 331336.[15] C. C. Sun, P. Donner, and J. Gotze, Low-complexity multi-purpose

    IP core for quantized discrete cosine and integer transform, in Proc.IEEE Int. Symp. C ircuits Syst., 2009, pp. 30143017.

    [16] S. Ghosh, S. Venigalla, and M. Bayoumi, Design and implementaionof a 2D-DCT architecture using coefficient distributed arithmetic, inProc. IEEE Comput. Soc. Annu. Symp. VLSI., 2005, pp. 162166.

    [17] L. V. Agostini, I. S. Silva, and S. Bampi, Pipelined fast 2D DCT ar-chitecture for JPEG image compression, in Proc. IEEE Symp. Integr.Circuits Syst. Des., 2001, pp. 226231.

    [18] C. H. Chang, C. L. Wang, and Y. T. Chang, Efficient VLSI archi-

    tectures for fast computation of the discrete Fourier transform and itsinverse,IEEE Trans. Signal Process., vol. 48, no. 11, pp. 32063216,Nov. 2000.

  • 8/14/2019 A High Performance Video Transform Engine by.pdf

    10/10

    664 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 4, APRIL 2012

    [19] Y. H. Chen, T. Y. Chang, and C. Y. Li, High throughput DA-basedDCT with high accuracy error-compensated adder tree,IEEE Trans.Very Large Scale Integr. (VLSI) Syst., 2010, to be published.

    [20] W. Pan, A fast 2-D DCT algorithm via distributed arithmetic opti-mization, in Proc. IEEE Int. Conf. Image Process. , 2000, vol. 3, pp.114117.

    [21] Information TechnologyDigital Compression and Coding of Con-tinuous-Tone Still Images: Requirements and Guidelines, ISO/IEC

    10918-1, 1991.[22] W. B. Pennebaker and J. L. Mitchell, JPEG Still Image Data Compres-sion Standard. New York: Van Nostrand Reinhold, 1992.

    [23] S. N. Tang, J. W. Tsai, and T. Y. Chang, A 2.4-Gs/s FFT processorfor OFDM-based WPAN applications, IEEE Trans. Circuits Syst. II,

    Exp. Briefs, vol. 57, no. 6, pp. 451455, Jun. 2010.

    Yuan-Ho Chen (S10) received the B.S. degreefrom Chang Gung University, Taoyuan, Taiwan, in2002, and the M.S. degree from the National TsingHua University (NTHU), Hsinchu, Taiwan, in 2004,where he is currently pursuing the Ph.D. degree inelectrical engineering.

    His research interests include video/image anddigital signal processing, VLSI architecture design,VLSI implementation, computer arithmetic, andsignal processing for wireless communication.

    Tsin-Yuan Chang (S87M90) received the B.S.degree in electrical engineering from National TsingHua University, Hsinchu, Taiwan, in 1982, and theM.S. and Ph.D. degrees from Michigan State Univer-sity, East Lansing, in 1987 and 1989, respectively,both in electrical engineering.

    He is an Associate Professor with the Departmentof Electrical Engineering, National Tsing Hua Uni-

    versity. His research interests include IC design andtesting, Computer arithmetic, and VLSI DSP.