ieee transactions on circuits and systems—i: regular...

9
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 62, NO. 2, FEBRUARY 2015 449 A Generalized Algorithm and Recongurable Architecture for Ef cient and Scalable Orthogonal Approximation of DCT Maher Jridi, Member, IEEE, Ayman Alfalou, Senior Member, IEEE, and Pramod Kumar Meher, Senior Member, IEEE Abstract—Approximation of discrete cosine transform (DCT) is useful for reducing its computational complexity without sig- nicant impact on its coding performance. Most of the existing algorithms for approximation of the DCT target only the DCT of small transform lengths, and some of them are non-orthogonal. This paper presents a generalized recursive algorithm to obtain orthogonal approximation of DCT where an approximate DCT of length could be derived from a pair of DCTs of length at the cost of additions for input preprocessing. We perform recursive sparse matrix decomposition and make use of the symmetries of DCT basis vectors for deriving the proposed approximation algorithm. Proposed algorithm is highly scalable for hardware as well as software implementation of DCT of higher lengths, and it can make use of the existing approximation of 8-point DCT to obtain approximate DCT of any power of two length, . We demonstrate that the proposed approximation of DCT provides comparable or better image and video com- pression performance than the existing approximation methods. It is shown that proposed algorithm involves lower arithmetic complexity compared with the other existing approximation algo- rithms. We have presented a fully scalable recongurable parallel architecture for the computation of approximate DCT based on the proposed algorithm. One uniquely interesting feature of the proposed design is that it could be congured for the computation of a 32-point DCT or for parallel computation of two 16-point DCTs or four 8-point DCTs with a marginal control overhead. The proposed architecture is found to offer many advantages in terms of hardware complexity, regularity and modularity. Experimental results obtained from FPGA implementation show the advantage of the proposed method. Index Terms—Algorithm-architecture codesign, DCT approx- imation, discrete cosine transform (DCT), high efciency video coding (HEVC). I. INTRODUCTION T HE DISCRETE cosine transform (DCT) is popularly used in image and video compression. Since the DCT is com- putationally intensive, several algorithms have been proposed in the literature to compute it efciently [1]–[3]. Recently, sig- nicant work has been done to derive approximate of 8-point DCT for reducing the computational complexity [4]–[9]. The main objective of the approximation algorithms is to get rid of Manuscript received April 29, 2014; revised July 02, 2014 and August 27, 2014; accepted September 18, 2014. Date of publication October 14, 2014; date of current version January 26, 2015. This paper was recommended by Associate Editor H. Johansson. M. Jridi and A. Alfalou are with the Vision Team, ISEN Brest, 29228 Brest Cedex 2, France (e-mail: [email protected]; ayman.al-falou@isen-bretagne. fr). P. K. Meher is with the School of Computer Engineering, Nanyang Techno- logical University, Singapore 639798. (e-mail: [email protected]. Color versions of one or more of the gures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identier 10.1109/TCSI.2014.2360763 multiplications which consume most of the power and computa- tion-time, and to obtain meaningful estimation of DCT as well. Haweel [8] has proposed the signed DCT (SDCT) for 8 8 blocks where the basis vector elements are replaced by their sign, i.e, 1. Bouguezel-Ahmad-Swamy (BAS) have proposed a series of methods. They have provided a good estimation of the DCT by replacing the basis vector elements by 0, 1/2, 1 [7]. In the same vein, Bayer and Cintra [5], [6] have proposed two transforms derived from 0 and 1 as elements of transform kernel, and have shown that their methods perform better than the method in [7], particularly for low- and high-compression ratio scenarios. The need of approximation is more important for higher-size DCT since the computational complexity of the DCT grows nonlinearly. On the other hand, modern video coding standards such as high efciency video coding (HEVC) [10] uses DCT of larger block sizes (up to 32 32) in order to achieve higher compression ratio. But, the extension of the design strategy used in H264 AVC for larger transform sizes, such as 16-point and 32-point is not possible [11]. Besides, several image processing applications such as tracking [12] and simultaneous compres- sion and encryption [13] require higher DCT sizes. In this con- text, Cintra has introduced a new class of integer transforms ap- plicable to several block-lengths [14]. Cintra et al. have pro- posed a new 16 16 matrix also for approximation of 16-point DCT, and have validated it experimentally [15]. Recently, two new transforms have been proposed for 8-point DCT approx- imation: Cintra et al. have proposed a low-complexity 8-point approximate DCT based on integer functions [16] and Potluri et al. have proposed a novel 8-point DCT approximation that requires only 14 addition [17]. On the other hand, Bouguezel et al. have proposed two methods for multiplication-free approxi- mate form of DCT. The rst method is for length , 16 and 32; and is based on the appropriate extension of integer DCT [18]. Also, a systematic method for developing a binary ver- sion of high-size DCT (BDCT) by using the sequency-ordered Walsh-Hadamard transform (SO-WHT) is proposed in [4]. This transform is a permutated version of the WHT which approx- imates the DCT very well and maintains all the advantages of the WHT. A scheme of approximation of DCT should have the fol- lowing features: i) It should have low computational complexity. ii) It should have low error energy in order to provide com- pression performance close to the exact DCT, and prefer- ably should be orthogonal. 1549-8328 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Upload: others

Post on 28-Sep-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR …kresttechnology.com/krest-academic-projects/krest-major... · 2015. 7. 22. · JRIDI et al.: GENERALIZED ALGORITHM AND RECONFIGURABLE

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 62, NO. 2, FEBRUARY 2015 449

A Generalized Algorithm and ReconfigurableArchitecture for Efficient and Scalable Orthogonal

Approximation of DCTMaher Jridi, Member, IEEE, Ayman Alfalou, Senior Member, IEEE, and

Pramod Kumar Meher, Senior Member, IEEE

Abstract—Approximation of discrete cosine transform (DCT)is useful for reducing its computational complexity without sig-nificant impact on its coding performance. Most of the existingalgorithms for approximation of the DCT target only the DCT ofsmall transform lengths, and some of them are non-orthogonal.This paper presents a generalized recursive algorithm to obtainorthogonal approximation of DCT where an approximate DCTof length could be derived from a pair of DCTs of length

at the cost of additions for input preprocessing. Weperform recursive sparse matrix decomposition and make use ofthe symmetries of DCT basis vectors for deriving the proposedapproximation algorithm. Proposed algorithm is highly scalablefor hardware as well as software implementation of DCT of higherlengths, and it can make use of the existing approximation of8-point DCT to obtain approximate DCT of any power of twolength, . We demonstrate that the proposed approximationof DCT provides comparable or better image and video com-pression performance than the existing approximation methods.It is shown that proposed algorithm involves lower arithmeticcomplexity compared with the other existing approximation algo-rithms. We have presented a fully scalable reconfigurable parallelarchitecture for the computation of approximate DCT based onthe proposed algorithm. One uniquely interesting feature of theproposed design is that it could be configured for the computationof a 32-point DCT or for parallel computation of two 16-pointDCTs or four 8-point DCTs with a marginal control overhead. Theproposed architecture is found to offer many advantages in termsof hardware complexity, regularity and modularity. Experimentalresults obtained from FPGA implementation show the advantageof the proposed method.

Index Terms—Algorithm-architecture codesign, DCT approx-imation, discrete cosine transform (DCT), high efficiency videocoding (HEVC).

I. INTRODUCTION

T HEDISCRETE cosine transform (DCT) is popularly usedin image and video compression. Since the DCT is com-

putationally intensive, several algorithms have been proposedin the literature to compute it efficiently [1]–[3]. Recently, sig-nificant work has been done to derive approximate of 8-pointDCT for reducing the computational complexity [4]–[9]. Themain objective of the approximation algorithms is to get rid of

Manuscript received April 29, 2014; revised July 02, 2014 and August 27,2014; accepted September 18, 2014. Date of publication October 14, 2014; dateof current version January 26, 2015. This paper was recommended by AssociateEditor H. Johansson.M. Jridi and A. Alfalou are with the Vision Team, ISEN Brest, 29228 Brest

Cedex 2, France (e-mail: [email protected]; [email protected]).P. K. Meher is with the School of Computer Engineering, Nanyang Techno-

logical University, Singapore 639798. (e-mail: [email protected] versions of one or more of the figures in this paper are available online

at http://ieeexplore.ieee.org.Digital Object Identifier 10.1109/TCSI.2014.2360763

multiplications which consumemost of the power and computa-tion-time, and to obtain meaningful estimation of DCT as well.Haweel [8] has proposed the signed DCT (SDCT) for 8 8blocks where the basis vector elements are replaced by theirsign, i.e, 1. Bouguezel-Ahmad-Swamy (BAS) have proposeda series of methods. They have provided a good estimation ofthe DCT by replacing the basis vector elements by 0, 1/2, 1[7]. In the same vein, Bayer and Cintra [5], [6] have proposedtwo transforms derived from 0 and 1 as elements of transformkernel, and have shown that their methods perform better thanthe method in [7], particularly for low- and high-compressionratio scenarios.The need of approximation is more important for higher-size

DCT since the computational complexity of the DCT growsnonlinearly. On the other hand, modern video coding standardssuch as high efficiency video coding (HEVC) [10] uses DCTof larger block sizes (up to 32 32) in order to achieve highercompression ratio. But, the extension of the design strategy usedin H264 AVC for larger transform sizes, such as 16-point and32-point is not possible [11]. Besides, several image processingapplications such as tracking [12] and simultaneous compres-sion and encryption [13] require higher DCT sizes. In this con-text, Cintra has introduced a new class of integer transforms ap-plicable to several block-lengths [14]. Cintra et al. have pro-posed a new 16 16 matrix also for approximation of 16-pointDCT, and have validated it experimentally [15]. Recently, twonew transforms have been proposed for 8-point DCT approx-imation: Cintra et al. have proposed a low-complexity 8-pointapproximate DCT based on integer functions [16] and Potluriet al. have proposed a novel 8-point DCT approximation thatrequires only 14 addition [17]. On the other hand, Bouguezel etal. have proposed two methods for multiplication-free approxi-mate form of DCT. The first method is for length , 16 and32; and is based on the appropriate extension of integer DCT[18]. Also, a systematic method for developing a binary ver-sion of high-size DCT (BDCT) by using the sequency-orderedWalsh-Hadamard transform (SO-WHT) is proposed in [4]. Thistransform is a permutated version of the WHT which approx-imates the DCT very well and maintains all the advantages ofthe WHT.A scheme of approximation of DCT should have the fol-

lowing features:i) It should have low computational complexity.ii) It should have low error energy in order to provide com-pression performance close to the exact DCT, and prefer-ably should be orthogonal.

1549-8328 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Page 2: IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR …kresttechnology.com/krest-academic-projects/krest-major... · 2015. 7. 22. · JRIDI et al.: GENERALIZED ALGORITHM AND RECONFIGURABLE

450 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 62, NO. 2, FEBRUARY 2015

iii) It should work for higher lengths of DCT to supportmodern video coding standards, and other applicationslike tracking, surveillance, and simultaneous compres-sion and encryption.

But the existing DCT algorithms do not provide the best of allthe above three requirements. Some of the existing methods aredeficient in terms of scalability [18], generalization for highersizes [15], and orthogonality [14].We intend tomaintain orthog-onality in the approximate DCT for two reasons. Firstly, if thetransform is orthogonal, we can always find its inverse, and thekernel matrix of the inverse transform is obtained by just trans-posing the kernel matrix of the forward transform. This featureof inverse transform could be used to compute the forward andinverse DCT by similar computing structures. Moreover, in caseof orthogonal transforms, similar fast algorithms are applicableto both forward and inverse transforms [19], [20].In this paper, we propose an algorithm to derive approximate

form of DCTs which satisfy all the three features. We obtainthe proposed approximate form of DCT by recursive decompo-sition of sparse DCT matrix. It is observed that proposed algo-rithm involves less arithmetic complexity than the existing DCTapproximation algorithms. The proposed approximate form ofDCT of different lengths are orthogonal, and result in lowererror-energy compared to the existing algorithms for DCT ap-proximation. The decomposition process allows generalizationof the proposed transform for higher-size DCTs. Interestingly,proposed algorithm is easily scalable for hardware as well assoftware implementation of DCT of higher lengths, and it canmake use of the best of the existing approximations of 8-pointDCT. Based on the proposed algorithm, we have proposed afully scalable, reconfigurable, and parallel architecture for ap-proximate DCT computation. One uniquely interesting featureof proposed design is that the structure for the computation of32-point DCT could be configured for parallel computation oftwo 16-point DCTs or four 8-point DCTs. The proposed algo-rithm is found to be better than the existing methods in terms ofenergy compaction and hardware complexity, as well.The remainder of this paper is organized as follows. In Sec-

tion II, we derive the proposed algorithm for the generationof kernel matrices for the approximate DCT. In Section III weprovide the proposed configurable parallel architecture and dis-cuss the performance evaluation of the proposed architecturein terms hardware complexity. In Section IV, the applicationof the proposed method in image and video compression, andcompression performances are discussed. Finally, conclusionsare presented in Section V.

II. PROPOSED DCT APPROXIMATION

The elements of -point DCT matrix are given by:

(1)

where , , and for. The DCT given by (1) is referred to as exact DCT in orderto distinguish it from approximated forms of DCT. For

and , for any even value of we canfind that

(2)

since , (2) can be rewritten as:

(3)

Hence, the cosine transform kernel on the right-hand side of(3) corresponds to -point DCT and its elements can be as-sumed to be , for . Therefore, thefirst elements of even rows of DCT matrix of sizecorrespond to the -point DCT matrix. Accordingly, the re-cursive decomposition of can be performed as detailed in(4)–(8). Using the even/odd symmetries of its row vectors, DCTmatrix can be represented by the following matrix product

(4)

where is a block sparse matrix expressed by:

(5)

where is the zero matrix. Block sub-matrix consists of odd rows of the first columns of

. is a permutation matrix expressed by:

(6)

where is a row of zeros and is amatrix defined by its row vectors as:

(7)

where is the th row vector of theidentity matrix. Finally, the last matrix in (4), is

defined by:

(8)

where is an matrix having all ones onthe anti-diagonal and zeros elsewhere.To reduce the computational complexity of DCT, the com-

putational cost of matrices presented in (4) is required to beassessed. Since does not involve any arithmetic or logicoperation, and requires additions and subtrac-tions, they contribute very little to the total arithmetic com-plexity and cannot be reduced further. Therefore, for reducingthe computational complexity of -point DCT, we need to ap-proximate in (5). Let and denote the approxi-mation matrices of and , respectively. To find theseapproximated submatrices we take the smallest size of DCTma-trix to terminate the approximation procedure to 8, since 4-pointDCT and 2-point DCT can be implemented by adders only. Con-sequently, a good approximation of , where is an integralpower of two, for , leads to a proper approximations ofand . For approximation of we can choose the 8-point

DCT given in [6] since that presents the best trade-off betweenthe number of required arithmetic operators and quality of thereconstructed image. The trade-off analysis given in [6] showsthat approximating by where denotes therounding-off operation outperforms the current state-of-the-artof 8-point approximation methods.When we closely look at (4) and (5), we note that operates

on sums of pixel pairs while operates on differences of the

Page 3: IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR …kresttechnology.com/krest-academic-projects/krest-major... · 2015. 7. 22. · JRIDI et al.: GENERALIZED ALGORITHM AND RECONFIGURABLE

JRIDI et al.: GENERALIZED ALGORITHM AND RECONFIGURABLE ARCHITECTURE FOR EFFICIENT AND SCALABLE ORTHOGONAL APPROXIMATION 451

same pixel pairs. Therefore, if we replace by , we shallhave two main advantages. Firstly, we shall have good com-pression performance due to the efficiency of and secondlythe implementation will be much simpler, scalable and recon-figurable. For approximation of we have investigated twoother low-complexity alternatives, and in the following we dis-cuss here three possible options of approximation of :i) The first one is to approximate by null matrix, whichimplies all even-indexed DCT coefficients are assumedto be zero. The transform obtained by this approximationis far from the exact values of even-indexed DCT coeffi-cients, and the odd coefficients do not contain any infor-mation.

ii) The second solution is obtained by approximating byan 8 8 matrix where each row contains one 1 and allother elements are zeros. Here, elements equal to 1 corre-spond to the maximum of elements of the exact DCT ineach row. The approximate transform in this case is closerto the exact DCT than the solution obtained by null ma-trix.

iii) The third solution consists of approximation of by .Since as well as are submatrices of and op-erate on matrices generated by sum and differences ofpixel pairs at distance of 8, approximation of byhas attractive computational properties: regularity of thesignal-flow graph, orthogonality since is orthogonal-izable, and good compression efficiency, other than scal-ability and scope for reconfigurable implementation.

We have not done exhaustive search of all possible solutions. Sothere could be other possible low-complexity implementationof . But other solutions are not expected to have the potentialfor reconfigurablity what we achieve by replacement of by. Based on this third possible approximation of , we have

obtained the proposed approximation of as:

(9)

As stated before, matrix is orthogonalizable. Indeed, foreach we can calculate given by:

(10)

where denotes matrix transposition. For data compres-sion, we can use instead of since

. Since is a diagonal matrix, itcan be integrated into the scaling in the quantization process(without additional computational complexity). Therefore, asadopted in [4]–[8], the computational cost of is equalto that of . Moreover, the term of in (9) can beintegrated in the quantization step in order to have multiplerlessarchitecture. The procedure for the generation of the proposedorthogonal approximated DCT is stated in Algorithm 1.

Algorithm 1 Generation of proposed DCT matrix

function PROPOSED DCT(N) power of 2,

Fig. 1. Signal flow graph (SFG) of ( ). Dashed arrows represent multiplica-tions by 1.

is the number of 8-sample blocks

while do

, Eq (6), (8)

Eq (9)

end while

Eq (10)return

end function

III. SCALABLE AND RECONFIGURABLE ARCHITECTURE FORDCT COMPUTATION

In this section, we discuss the proposed scalable architecturefor the computation of approximate DCT of and 32.Wehave derived the theoretical estimate of its hardware complexityand discuss the reconfiguration scheme.

A. Proposed Scalable Design

The basic computational block of algorithm for the proposedDCT approximation, is given in [6]. The block diagram ofthe computation of DCT based on is shown in Fig. 1. For agiven input sequence , , the approximateDCT coefficients are obtained by . An exampleof the block diagram of is illustrated in Fig. 2, where twounits for the computation of are used along with an inputadder unit and output permutation unit. The functions of thesetwo blocks are shown respectively in (8) and (6). Note that struc-tures of 16-point DCT of Fig. 2 could be extended to obtain theDCT of higher sizes. For example, the structure for the compu-tation of 32-point DCT could be obtained by combining a pairof 16-point DCTs with an input adder block and output permu-tation block.

B. Complexity Comparison

To assess the computational complexity of proposed -pointapproximate DCT, we need to determine the computationalcost of matrices quoted in (9). As shown in Fig. 1 the approx-imate 8-point DCT involves 22 additions. Since has nocomputational cost and requires additions for -pointDCT, the overall arithmetic complexity of 16-point, 32-point,

Page 4: IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR …kresttechnology.com/krest-academic-projects/krest-major... · 2015. 7. 22. · JRIDI et al.: GENERALIZED ALGORITHM AND RECONFIGURABLE

452 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 62, NO. 2, FEBRUARY 2015

Fig. 2. Block diagram of the proposed DCT for ( ).

TABLE IASSESSMENT OF REQUIRED ARITHMETIC OPERATIONS FOR SEVERAL

APPROXIMATION ALGORITHMS

and 64-point DCT approximations are 60, 152, and 368 addi-tions, respectively. More generally, the arithmetic complexityof -point DCT is equal to additions.Moreover, since the structures for the computation of DCT ofdifferent lengths are regular and scalable, the computationaltime for DCT coefficients can be found to bewhere is the addition-time. The number of arithmetic op-erations involved in proposed DCT approximation of differentlengths and those of the existing competing approximations areshown in Table I. It can be found that the proposed method re-quires the lowest number of additions, and does not require anyshift operations. Note that shift operation does not involve anycombinational components, and requires only rewiring duringhardware implementation. But it has indirect contribution to thehardware complexity since shift-add operations lead to increasein bit-width which leads to higher hardware complexity ofarithmetic units which follow the shift-add operation. Also, wenote that all considered approximation methods involve sig-nificantly less computational complexity over that of the exactDCT algorithms. According to the Loeffler algorithm [2], theexact DCT computation requires 29, 81, 209, and 513 additionsalong with 11, 31, 79, and 191 multiplications, respectively for8, 16, 32, and 64-point DCTs.Pipelined and non-pipelined designs of different methods are

developed, synthesized and validated using an integrated logicanalyzer. The validation is carried out by using the Digilent EBof Spartan6-LX45. We have used 8-bit inputs, and we have al-lowed the increase of output size (without any truncations). For

TABLE IISYNTHESIS RESULTS OF PIPELINED AND NON-PIPELINED DESIGNS FOR

SEVERAL APPROXIMATION ALGORITHMS

Fig. 3. Proposed reconfigurable architecture for approximate DCT of lengthsand 16.

the 8-point transform of Fig. 1, we have 11-bit and 10-bit out-puts. The pipelined design are obtained by insertion of registersin the input and output stages along with registers after eachadder stage, while the no pipeline registers are used within thenon-pipelined designs. The synthesis results obtained fromXSTsynthesizer are presented in Table II. It shows that pipelineddesigns provide significantly higher maximum operating fre-quency (MOF). It also shows that the proposed design involvesnearly 7%, 6%, and 5% less area compared to the BDCT de-sign for equal to 16, 32, and 64, respectively. Note that bothpipelined and non-pipelined designs involve the same numberof LUTs since pipeline registers do not require additional LUTs.For 8-point DCT, we have used the approximation proposedin [6] which forms the basic computing block of the proposedmethod. Also, we underline that all designs have the same crit-ical path; and accordingly have the same MOFs. Most impor-tantly, the proposed designs are reusable for different transformlengths.

C. Proposed Reconfiguration Scheme

As specified in the recently adopted HEVC [10], DCT ofdifferent lengths such as , 16, 32 are required to beused in video coding applications. Therefore, a given DCTarchitecture should be potentially reused for the DCT of dif-ferent lengths instead of using separate structures for differentlengths. We propose here such reconfigurable DCT structureswhich could be reused for the computation of DCT of differentlengths. The reconfigurable architecture for the implementation

Page 5: IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR …kresttechnology.com/krest-academic-projects/krest-major... · 2015. 7. 22. · JRIDI et al.: GENERALIZED ALGORITHM AND RECONFIGURABLE

JRIDI et al.: GENERALIZED ALGORITHM AND RECONFIGURABLE ARCHITECTURE FOR EFFICIENT AND SCALABLE ORTHOGONAL APPROXIMATION 453

Fig. 4. Proposed reconfigurable architecture for approximate DCT of lengths , 16 and 32.

of approximated 16-point DCT is shown in Fig. 3. It consistsof three computing units, namely two 8-point approximatedDCT units and a 16-point input adder unit that generatesand , . The input to the first 8-point DCTapproximation unit is fed through 8 MUXes that select either

or , dependingon whether it is used for 16-point DCT calculation or 8-pointDCT calculation. Similarly, the input to the second 8-pointDCT unit (Fig. 3) is fed through 8 MUXes that select either

or , dependingon whether it is used for 16-point DCT calculation or 8-pointDCT calculation. On the other hand, the output permutationunit uses 14 MUXes to select and re-order the output dependingon the size of the selected DCT. is used as control inputof the MUXes to select inputs and to perform permutationaccording to the size of the DCT to be computed. Specifically,

enables the computation of 16-point DCT andenables the computation of a pair of 8-point DCTs

in parallel. Consequently, the architecture of Fig. 3 allows thecalculation of a 16-point DCT or two 8-point DCTs in parallel.A reconfigurable design for the computation of 32-, 16-, and

8-point DCTs is presented in Fig. 4. It performs the calcula-tion of a 32-point DCT or two 16-point DCTs in parallel orfour 8-point DCTs in parallel. The architecture is composed of32-point input adder unit, two 16-point input adder units, andfour 8-point DCT units. The reconfigurability is achieved bythree control blocks composed of 64 2:1 MUXes along with 303:1 MUXes. The first control block decides whether the DCTsize is of 32 or lower. If , the selection of input data

is done for the 32-point DCT, otherwise, for the DCTs of lowerlengths. The second control block decides whether the DCT sizeis higher than 8. If the length of the DCT to be com-puted is higher than 8 (DCT length of 16 or 32), otherwise, thelength is 8. The third control block is used for the output per-mutation unit which re-orders the output depending on the sizeof the selected DCT. and are used as control sig-nals to the 3:1 MUXes. Specifically, for equalto {00}, {01} or {11} the 32 outputs correspond to four 8-pointparallel DCTs, two parallel 16-point DCTs, or 32-point DCT,respectively. Note that the throughput is of 32 DCT coefficientsper cycle irrespective of the desired transform size.

IV. EXPERIMENTAL VALIDATION

We discuss here the performance of the proposed DCT ap-proximation algorithm in terms of energy compaction charac-teristics compared to the DCT approximations suggested in [4],[14], [15] and [18]. Note that the method in [15] was proposedfor a fixed size . The method in [18] was defined for

, 16 and 32, but it is not suitable to be extended for sizeshigher than 32. Also, we have not compared with the SDCTmethod [8]. Although the SDCT method is the first one relatedto the approximation of DCT algorithms, its reconstruction ca-pabilities are significantly lower than others in [4], [14], [15]and [18].

A. Application for Image Compression

Image compression is a typical application to validate the en-ergy compaction capability of DCT.We have used the compres-

Page 6: IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR …kresttechnology.com/krest-academic-projects/krest-major... · 2015. 7. 22. · JRIDI et al.: GENERALIZED ALGORITHM AND RECONFIGURABLE

454 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 62, NO. 2, FEBRUARY 2015

Fig. 5. Average PSNR: exact (solid line), proposed ( ), BAS [8] ( ), Cintra [14] ( ), BDCT [4] ( ) and BC [15] ( ) DCT. BAS transform[18] is defined for maximum size of .

sion scheme of [8] to compare the performance of the competingDCT approximation schemes. According to this method, in eachblock of the 2D-transform of size only coefficientsare retained to reconstruct the image according to the zigzag se-quence (all the other coefficients were set to zero). For ,we have used and , where

, which corresponds to compression ratio in the range50% to 98.43%. In order to keep the same compression ratio fordifferent DCT sizes, and are multiplied by 4 whenis doubled. Accordingly, the number of retained coefficients

is varied in the range [4,128], [16,512], and [64,2048], respec-tively, for block-sizes (16 16), (32 32), and (64 64) tovary the compression ratio from 50% to 98.43% equivalentlyfor all block-sizes. In this context, we have introduced the no-tation of normalized which varies in the range [1,32] for allblock sizes. For a given block-size, the normalized is the ratioof to the number of times the block-size is larger than 8 8.Accordingly, normalized remains the same for all block sizes.The idea behind this notation is to have the same compressionratio for different block sizes for the same normalized .In Fig. 5 we have plotted the average PSNR of reconstructed

image from a set of forty 512 512 8-bit greyscale images(Miscellaneous and Aerials) obtained from a standard database[21]. For comparison, we have taken the method in [14] withthe zeroth order approximation which leads to a lower hard-ware complexity. Note that for , the DCT matrix givenby [6] is the same one used in [14]. Moreover, our algorithmis based on the 8-point DCT approximation given in [6]. Forthat reason, we find the same PSNR performance of proposedapproximation as that of [6]. It is also shown that for high com-pression ratio , method in [6] presents a higher PSNR,

and for , BAS method [18] outperforms all 8-point ap-proximations. For , the proposed method often presentsa better results especially for normalized . Finally, for

and the proposed method has a higher PSNR atall compression ratios. All these results are obtained by Matlabsimulations and are in conformity with those presented in [4]where comparison is made between methods in [4] and [18].We have considered the structural similarity (SSIM) index [22],(which is an improved version of the universal quality index(UQI) [23], and one of the best objective metrics for imagedistortion). In Fig. 6 we have shown the plot of mean SSIMvalues obtained for different approximation methods relative tothe exact DCT. We have used different compression ratios for

and for the same images of PSNR as-sessment. It is found that the proposed algorithm outperformsexisting methods for all and for compression ratios higherthan 84.37% . For lower compression ratios, the SSIMvalues are similar. Finally, subjective evaluation is presented inFig. 7 for normalized and for . As expected, itis found that the reconstructed images using the proposed al-gorithm have less blocking artifacts than methods in [4], [8]and [14] while method of [18] presents a reconstructed imageclosely similar to the proposed method.

B. Application for Video Compression

To evaluate the performance of the proposed algorithm forvideo coding we have integrated the proposed approximatedDCT into HEVC reference software HM12.1, in the same wayas has been done in [17]. Moreover, we have integrated the ex-isting methods to have a comparison in real time video coding.One key feature of HEVC is that it involves DCTs of different

Page 7: IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR …kresttechnology.com/krest-academic-projects/krest-major... · 2015. 7. 22. · JRIDI et al.: GENERALIZED ALGORITHM AND RECONFIGURABLE

JRIDI et al.: GENERALIZED ALGORITHM AND RECONFIGURABLE ARCHITECTURE FOR EFFICIENT AND SCALABLE ORTHOGONAL APPROXIMATION 455

Fig. 6. Mean SSIM relative to exact DCT of proposed ( ), BAS [18] ( ), Cintra [14] ( ), BDCT [4] ( ) and BC [15] ( ) DCT. BAStransform [18] is defined for maximum size of .

Fig. 7. Reconstructed images using several algorithms for and normalized .

sizes such as 4, 8, 16, and 32. Therefore, we select the BDCT in[4] and BAS [18] methods since orthogonal approximate DCTsfor transform sizes up to 32 32 are defined in [4] and BAS[18]. All these competing transforms are implemented in theencoder to produce HEVC-compliant bit-stream. The inverseDCT defined in the HEVC final draft international standard [24]is used in the decoder.We performed the tests using JCT-VC HM 12.1 for the main

profile, and all-intra configuration over Qp values of 22, 27,32, and 37. We have taken the PeopleOnStreet 2560 1600,

CrowdRun 1920 1080, and OldTownCross 1280 720 testsequences for comparison. All simulations are performed underDesktop PCs with the Intel Core2 processor family with 2 GHzof processor frequency and 1 Gbit of RAM. We have measuredbit-rate difference as the percentage difference of bit-rate of each of the methods , and from the bit-rate of the anchor with the same value of PSNR.The values shown in Table III are calculated as:

Page 8: IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR …kresttechnology.com/krest-academic-projects/krest-major... · 2015. 7. 22. · JRIDI et al.: GENERALIZED ALGORITHM AND RECONFIGURABLE

456 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 62, NO. 2, FEBRUARY 2015

TABLE IIIBRD PERCENTAGE OF APPROXIMATED TRANSFORMS FOR BDCT [4], BAS

[18], AND PROPOSED METHOD

where positive values indicate coding loss compared to the an-chor. Note that the bitrate is calculated by taking into accountthe frame rate of the test sequence, the total number of testedframes and the size of the bitstream. Assessment of video codingperformance of approximated transforms compared to HM12.1are summarized in Table III. The approximated DCT algorithmsgive coding loss between 3.43% and 6.40% for different test se-quences. This coding loss can be explained by the entropy coderused after the DCT. The entropy coder is designed for codingthe exact DCT coefficients and not approximated coefficients.The coding loss could be reduced by an appropriate modifica-tion of context models used in CABAC coder, which we cantake up as a future work.

V. CONCLUSION

In this paper, we have proposed a recursive algorithm toobtain orthogonal approximation of DCT where approximateDCT of length could be derived from a pair of DCTs oflength at the cost of additions for input preprocessing.The proposed approximated DCT has several advantages,such as of regularity, structural simplicity, lower-computa-tional complexity, and scalability. Comparison with recentlyproposed competing methods shows the effectiveness of theproposed approximation in terms of error energy, hardwareresources consumption, and compressed image quality. Wehave also proposed a fully scalable reconfigurable architecturefor approximate DCT computation where the computation of32-point DCT could be configured for parallel computation oftwo 16-point DCTs or four 8-point DCTs.

REFERENCES

[1] A. M. Shams, A. Chidanandan, W. Pan, and M. A. Bayoumi, “NEDA:A low-power high-performance DCT architecture,” IEEE Trans.Signal Process., vol. 54, no. 3, pp. 955–964, 2006.

[2] C. Loeffler, A. Lightenberg, and G. S. Moschytz, “Practical fast 1-DDCT algorithm with 11 multiplications,” in Proc. Int. Conf. Acoust.,Speech, Signal Process. (ICASSP), May 1989, pp. 988–991.

[3] M. Jridi, P. K. Meher, and A. Alfalou, “Zero-quantised discrete cosinetransform coefficients prediction technique for intra-frame video en-coding,” IET Image Process., vol. 7, no. 2, pp. 165–173, Mar. 2013.

[4] S. Bouguezel, M. O. Ahmad, and M. N. S. Swamy, “Binary discretecosine and Hartley transforms,” IEEE Trans. Circuits Syst. I, Reg. Pa-pers, vol. 60, no. 4, pp. 989–1002, Apr. 2013.

[5] F. M. Bayer and R. J. Cintra, “DCT-like transform for image compres-sion requires 14 additions only,” Electron. Lett., vol. 48, no. 15, pp.919–921, Jul. 2012.

[6] R. J. Cintra and F. M. Bayer, “A DCT approximation for image com-pression,” IEEE Signal Process. Lett., vol. 18, no. 10, pp. 579–582,Oct. 2011.

[7] S. Bouguezel, M. Ahmad, and M. N. S. Swamy, “Low-complexity 88 transform for image compression,” Electron. Lett., vol. 44, no. 21,

pp. 1249–1250, Oct. 2008.[8] T. I. Haweel, “A new square wave transform based on the DCT,” Signal

Process., vol. 81, no. 11, pp. 2309–2319, Nov. 2001.[9] V. Britanak, P. Y. Yip, and K. R. Rao, Discrete Cosine and Sine Trans-

forms: General Properties, Fast Algorithms and Integer Approxima-tions. London, U.K.: Academic, 2007.

[10] G. J. Sullivan, J.-R. Ohm,W.-J. Han, and T.Wiegand, “Overview of thehigh efficiency video coding (HEVC) standard,” IEEE Trans. CircuitsSyst. Video Technol., vol. 22, no. 12, pp. 1649–1668, Dec. 2012.

[11] F. Bossen, B. Bross, K. Suhring, and D. Flynn, “HEVC complexity andimplementation analysis,” IEEE Trans. Circuits Syst. Video Technol.,vol. 22, no. 12, pp. 1685–1696, 2012.

[12] X. Li, A. Dick, C. Shen, A. van den Hengel, and H. Wang, “Incre-mental learning of 3D-DCT compact representations for robust visualtracking,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 4, pp.863–881, Apr. 2013.

[13] A. Alfalou, C. Brosseau, N. Abdallah, and M. Jridi, “Assessing the per-formance of a method of simultaneous compression and encryption ofmultiple images and its resistance against various attacks,” Opt. Ex-press, vol. 21, no. 7, pp. 8025–8043, 2013.

[14] R. J. Cintra, “An integer approximation method for discrete sinu-soidal transforms,” Circuits, Syst., Signal Process., vol. 30, no. 6, pp.1481–1501, 2011.

[15] F. M. Bayer, R. J. Cintra, A. Edirisuriya, and A. Madanayake, “Adigital hardware fast algorithm and FPGA-based prototype for a novel16-point approximate DCT for image compression applications,”Meas. Sci. Technol., vol. 23, no. 11, pp. 1–10, 2012.

[16] R. J. Cintra, F. M. Bayer, and C. J. Tablada, “Low-complexity 8-pointDCT approximations based on integer functions,” Signal Process., vol.99, pp. 201–214, 2014.

[17] U. S. Potluri, A. Madanayake, R. J. Cintra, F. M. Bayer, S. Kulasekera,and A. Edirisuriya, “Improved 8-point approximate DCT for image andvideo compression requiring only 14 additions,” IEEE Trans. CircuitsSyst. I, Reg. Papers, vol. 61, no. 6, pp. 1727–1740, Jun. 2014.

[18] S. Bouguezel, M. Ahmad, and M. N. S. Swamy, “A novel transformfor image compression,” in Proc. 2010 53rd IEEE Int. Midwest Symp.Circuits Syst. (MWSCAS), pp. 509–512.

[19] K. R. Rao and N. Ahmed, “Orthogonal transforms for digital signalprocessing,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process.(ICASSP), Apr. 1976, vol. 1, pp. 136–140.

[20] Z. Mohd-Yusof, I. Suleiman, and Z. Aspar, “Implementation of twodimensional forward DCT and inverse DCT using FPGA,” in Proc.TENCON 2000, vol. 3, pp. 242–245.

[21] “USC-SIPI image database,” Univ. Southern California, Signal andImage Processing Institute [Online]. Available: http://sipi.usc.edu/database/, 2012

[22] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Imagequality assessment: From error visibility to structural similarity,” IEEETrans. Image Process., vol. 13, no. 4, pp. 600–612, 2004.

[23] Z. Wang and A. C. Bovik, “A universal image quality index,” IEEESignal Process. Lett., vol. 9, no. 3, pp. 81–84, 2002.

[24] B. Bross, W.-J. Han, J.-R. Ohm, G. J. Sullivan, Y.-K. Wang, and T.Wiegand, “High efficiency video coding (HEVC) text specificationdraft 10 (for FDIS and consent) associated resources,” in Proc.JCT-VC, Geneva, Switzerland, Jan. 2013, Doc. JCTVC-L1003.

Maher Jridi received the Engineering degree fromthe Graduate Telecommunication School Sup'Com,Tunisia, in 2003. He received the Ph.D. degree inelectronics from the University of Science and Tech-nology, Bordeaux I, France, in 2007 at IMS labora-tory. He worked in digital calibration for interleaveddata converters. Currently, he is an Associate Pro-fessor at ISEN-Brest, France, and he is member ofthe vision group. He has published around 30 journalarticles or conference papers, His research interestsfocus on digital electronics design for VLSI signal/

image/video processing and more particularly in algorithm architecture co-de-sign for resource-constrained systems.

Page 9: IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR …kresttechnology.com/krest-academic-projects/krest-major... · 2015. 7. 22. · JRIDI et al.: GENERALIZED ALGORITHM AND RECONFIGURABLE

JRIDI et al.: GENERALIZED ALGORITHM AND RECONFIGURABLE ARCHITECTURE FOR EFFICIENT AND SCALABLE ORTHOGONAL APPROXIMATION 457

Ayman Alfalou (SM'03) received the Ph.D. degreein telecommunications and signal processing fromthe French National Telecommunication GraduateEngineering School of Brittany (ENSTB-France)and the University of Rennes (1), France, in 1999.He held a Postdoc position for one year at FrenchNational Telecommunications graduate EngineeringSchool of Brittany (ENSTB-France) (DGA, FrenchArmy), consisting in designing and realizing anoptical compact and high rate correlator. Partners:THOMSN-lcr-tco-detexis, ENST of Brittany, and

ENSP of Marseille. Since June 2000, he has been a Professor of Telecom-munications and Signal Processing at ISEN-Brest (Institut Sup'rieur del'Electronique et du Num'rique), France. At ISEN, he created the OpticalSignal and Image Processing Laboratory (VISION). His research interestsare Signal processing and image processing, Telecommunications, opticalsystems, optical Processing, opt-electronics, Laser, polarization. He has somePh.D., Master's, and Engineering students (recognition forms, compression,encryption, polarization). He has published over 130 refereed journal articlesor conferences on a wide variety of theoretical and experimental topics. Hehas organized or co-organized many conferences and special sessions; hewas the chairman or a member of scientific committees of many internationalconferences. In September 2006 he passed the “HDR” degree (Full professorat ISEN-Brest). He is a Senior Member of OSA, Senior Member of SPIE, andelected member of the Institute of Physics.

Pramod Kumar Meher (SM'03) received the B.Sc.(Honours) andM.Sc. degree in physics, and the Ph.D.degree in science from Sambalpur University, India,in 1976, 1978, and 1996, respectively.Currently, he is a Senior Research Scientist with

Nanyang Technological University, Singapore. Pre-viously, he was a Professor of Computer Applicationswith Utkal University, India, from 1997 to 2002, anda Reader in electronics with Berhampur University,India, from 1993 to 1997. His research interest in-cludes design of dedicated and reconfigurable archi-

tectures for computation-intensive algorithms pertaining to signal, image andvideo processing, communication, bio-informatics and intelligent computing.He has contributed more than 200 technical papers to various reputed journalsand conference proceedings.Dr. Meher has served as a speaker for the Distinguished Lecturer Program

(DLP) of IEEE Circuits Systems Society during 2011 and 2012 and AssociateEditor of the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—PART II:EXPRESS BRIEFS during 2008 to 2011, and Associate Editor for the IEEETRANSACTIONS ON CIRCUITS AND SYSTEMS—PART I: REGULAR PAPERSduring 2012–2013. Currently, he is serving as Associate Editor for the IEEETRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, andthe Journal of Circuits, Systems, and Signal Processing. Dr. Meher is a Fellowof the Institution of Electronics and Telecommunication Engineers, India.He was the recipient of the Samanta Chandrasekhar Award for excellence inresearch in engineering and technology for 1999.