a block based pass-parallel spiht algorithm.bak

12
1064 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 22, NO. 7, JULY 2012 A Block-Based Pass-Parallel SPIHT Algorithm Yongseok Jin, Member, IEEE, and Hyuk-Jae Lee Abstract —Set-partitioning in hierarchical trees (SPIHT) is a widely used compression algorithm for wavelet-transformed im- ages. One of its main drawbacks is a slow processing speed due to its dynamic processing order that depends on the image contents. To overcome this drawback, this paper presents a modified SPIHT algorithm called block-based pass-parallel SPIHT (BPS). BPS decomposes a wavelet-transformed image into 4 × 4 blocks and simultaneously encodes all the bits in a bit-plane of a 4 × 4 block. To exploit parallelism, BPS reorganizes the three passes of the original SPIHT algorithm and then BPS encodes/decodes the reorganized three passes in a parallel and pipelined manner. The precalculation of the stream length of each pass enables the parallel and pipelined execution of these three passes by not only an encoder but also a decoder. The modification of the processing order slightly degrades the compression efficiency. Experimental results show that the peak signal-to-noise ratio loss by BPS is between approximately 0.23 and 0.59 dB when compared to the original SPIHT algorithm. Both an encoder and a decoder are implemented in the hardware that can process 120 million samples per second at an operating clock frequency of 100 MHz. This processing speed allows a video of size of 1920 × 1080 in the 4:2:2 format to be processed at the rate of 30 frames/s. The gate count of the hardware is about 43.9K. Index Terms—Discrete wavelet transform (DWT), set- partitioning in hierarchical trees (SPIHT), wavelet image coding. I. Introduction W AVELET-BASED image coding, such as the JPEG2000 standard [1], is widely used because of its high compression efficiency. There are three important wavelet-based image coding algorithms that have embedded coding property enabling easy bit rate control with progressive transmission of information for a wavelet-transformed image. They are the embedded zerotree wavelet algorithm (EZW) [2], embedded block coding with optimized truncation algorithm (EBCOT) [3], and set partitioning in hierarchical trees algorithm (SPIHT) [4]. There are many video applications that need image compression with embedded coding property. These applications include frame memory compression for a video compression chip [5]–[7], overdrive detection and Manuscript received March 1, 2010; revised November 12, 2010 and April 20, 2011; accepted November 14, 2011. Date of publication March 5, 2012; date of current version June 28, 2012. This work was supported by the Korean Science and Engineering Foundation, under Grant 2011-0027502 funded by the Ministry of Education, Science, and Technology, Korean Government. This paper was recommended by Associate Editor O. C. Au. Y. Jin is with the Department of Computer Science and Engineering, Pennsylvania State University, University Park, PA 16802 USA (e-mail: [email protected]). H.-J. Lee is with the School of Electrical Engineering and Com- puter Science, Seoul National University, Seoul 151-742, Korea (e-mail: hyuk jae [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TCSVT.2012.2189793 compensation for a LCD driver chip [8]–[11]. Among the three wavelet-based coding algorithms, EZW and EBCOT need a binary arithmetic coding that requires a large amount of hardware circuitry and memory increasing the hardware cost and moreover suffering from limited throughput [12]. On the contrary, SPIHT does not need arithmetic coding providing a cheaper and faster hardware solution. In addition, SPIHT surpasses EZW and is close to EBCOT in compression efficiency [4], [13]. Therefore, extensive research has focused on SPIHT and its variations to improve the efficiency of wavelet-based image coding. The original SPIHT algorithm processes wavelet coefficients in a dynamic order that depends on the values of the coefficients. Thus, it is not easy to process multiple coefficients in parallel; and consequently, it is difficult to improve the throughput of the original SPIHT. In order to increase the throughput, the SPIHT algorithm is modified such that the processing order is fixed statically (i.e., the processing order is independent of the values of the coefficients) [14]– [19]. Although a fixed-order SPIHT improves throughput, the coding efficiency is degraded because its order is different from the order of the original SPIHT. No list SPIHT algorithm (NLS) [14] is initially proposed for a fixed-order SPIHT algo- rithm to reduce the required memory. Later, Corsonello et al. proposed a low cost implementation of NLS [15]. In [14] and [15], the modified algorithm uses an array data structure for storing coding states in the fixed order instead of the list data structure required for the dynamic order of the original SPIHT. Although NLS succeeds in the reduction of the memory size, it does not process coefficients in parallel, so only 1 or 2 bits are produced at each step. Consequently, the throughput of NLS is only 0.092 bit per cycle [15]. To improve the coding speed, Chen et al. [16] proposed a modified SPIHT that processes a4 × 4 bit-plane in one cycle. However, this algorithm does not exploit pixel parallelism but processes multiple sequential steps in one cycle in its hardware implementation leading to a significant increase of the critical path delay in combinational logic circuits. Consequently, the operating clock frequency is limited although a 4 × 4 bit-plane is processed in a single cycle. Thus, the overall throughput is also not very high. Fry et al. [17] proposed a bit-plane parallel SPIHT encoder architecture. This modified SPIHT decomposes wavelet coefficients bit-plane by bit-plane and then processes multiple bit-planes independently in a parallel manner. Then, the results of multiple bit-planes are merged into a single bitstream. This bit-plane-parallel approach achieves very large throughput by processing four pixels in a single cycle. However, there are two drawbacks. One is the low utilization of the parallel hardware and memory because the execution time of bit-plane coding 1051-8215/$31.00 c 2012 IEEE http://ieeexploreprojects.blogspot.com

Upload: ieeexploreprojects

Post on 11-May-2015

493 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A block based pass-parallel spiht algorithm.bak

1064 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 22, NO. 7, JULY 2012

A Block-Based Pass-Parallel SPIHT AlgorithmYongseok Jin, Member, IEEE, and Hyuk-Jae Lee

Abstract—Set-partitioning in hierarchical trees (SPIHT) is awidely used compression algorithm for wavelet-transformed im-ages. One of its main drawbacks is a slow processing speed due toits dynamic processing order that depends on the image contents.To overcome this drawback, this paper presents a modifiedSPIHT algorithm called block-based pass-parallel SPIHT (BPS).BPS decomposes a wavelet-transformed image into 4 × 4 blocksand simultaneously encodes all the bits in a bit-plane of a 4 × 4block. To exploit parallelism, BPS reorganizes the three passesof the original SPIHT algorithm and then BPS encodes/decodesthe reorganized three passes in a parallel and pipelined manner.The precalculation of the stream length of each pass enables theparallel and pipelined execution of these three passes by not onlyan encoder but also a decoder. The modification of the processingorder slightly degrades the compression efficiency. Experimentalresults show that the peak signal-to-noise ratio loss by BPS isbetween approximately 0.23 and 0.59 dB when compared tothe original SPIHT algorithm. Both an encoder and a decoderare implemented in the hardware that can process 120 millionsamples per second at an operating clock frequency of 100 MHz.This processing speed allows a video of size of 1920 × 1080 inthe 4:2:2 format to be processed at the rate of 30 frames/s. Thegate count of the hardware is about 43.9K.

Index Terms—Discrete wavelet transform (DWT), set-partitioning in hierarchical trees (SPIHT), wavelet image coding.

I. Introduction

WAVELET-BASED image coding, such as theJPEG2000 standard [1], is widely used because

of its high compression efficiency. There are three importantwavelet-based image coding algorithms that have embeddedcoding property enabling easy bit rate control with progressivetransmission of information for a wavelet-transformed image.They are the embedded zerotree wavelet algorithm (EZW) [2],embedded block coding with optimized truncation algorithm(EBCOT) [3], and set partitioning in hierarchical treesalgorithm (SPIHT) [4]. There are many video applicationsthat need image compression with embedded coding property.These applications include frame memory compression fora video compression chip [5]–[7], overdrive detection and

Manuscript received March 1, 2010; revised November 12, 2010 and April20, 2011; accepted November 14, 2011. Date of publication March 5, 2012;date of current version June 28, 2012. This work was supported by the KoreanScience and Engineering Foundation, under Grant 2011-0027502 funded bythe Ministry of Education, Science, and Technology, Korean Government.This paper was recommended by Associate Editor O. C. Au.

Y. Jin is with the Department of Computer Science and Engineering,Pennsylvania State University, University Park, PA 16802 USA (e-mail:[email protected]).

H.-J. Lee is with the School of Electrical Engineering and Com-puter Science, Seoul National University, Seoul 151-742, Korea (e-mail:hyuk−jae−[email protected]).

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TCSVT.2012.2189793

compensation for a LCD driver chip [8]–[11]. Among thethree wavelet-based coding algorithms, EZW and EBCOTneed a binary arithmetic coding that requires a large amountof hardware circuitry and memory increasing the hardwarecost and moreover suffering from limited throughput [12].On the contrary, SPIHT does not need arithmetic codingproviding a cheaper and faster hardware solution. In addition,SPIHT surpasses EZW and is close to EBCOT in compressionefficiency [4], [13]. Therefore, extensive research has focusedon SPIHT and its variations to improve the efficiency ofwavelet-based image coding.

The original SPIHT algorithm processes wavelet coefficientsin a dynamic order that depends on the values of thecoefficients. Thus, it is not easy to process multiplecoefficients in parallel; and consequently, it is difficult toimprove the throughput of the original SPIHT. In order toincrease the throughput, the SPIHT algorithm is modified suchthat the processing order is fixed statically (i.e., the processingorder is independent of the values of the coefficients) [14]–[19]. Although a fixed-order SPIHT improves throughput, thecoding efficiency is degraded because its order is differentfrom the order of the original SPIHT. No list SPIHT algorithm(NLS) [14] is initially proposed for a fixed-order SPIHT algo-rithm to reduce the required memory. Later, Corsonello et al.proposed a low cost implementation of NLS [15]. In [14] and[15], the modified algorithm uses an array data structure forstoring coding states in the fixed order instead of the list datastructure required for the dynamic order of the original SPIHT.Although NLS succeeds in the reduction of the memory size, itdoes not process coefficients in parallel, so only 1 or 2 bits areproduced at each step. Consequently, the throughput of NLSis only 0.092 bit per cycle [15]. To improve the coding speed,Chen et al. [16] proposed a modified SPIHT that processesa 4 × 4 bit-plane in one cycle. However, this algorithm doesnot exploit pixel parallelism but processes multiple sequentialsteps in one cycle in its hardware implementation leading to asignificant increase of the critical path delay in combinationallogic circuits. Consequently, the operating clock frequencyis limited although a 4 × 4 bit-plane is processed in a singlecycle. Thus, the overall throughput is also not very high.Fry et al. [17] proposed a bit-plane parallel SPIHT encoderarchitecture. This modified SPIHT decomposes waveletcoefficients bit-plane by bit-plane and then processes multiplebit-planes independently in a parallel manner. Then, the resultsof multiple bit-planes are merged into a single bitstream. Thisbit-plane-parallel approach achieves very large throughput byprocessing four pixels in a single cycle. However, there are twodrawbacks. One is the low utilization of the parallel hardwareand memory because the execution time of bit-plane coding

1051-8215/$31.00 c© 2012 IEEE

http://ieeexploreprojects.blogspot.com

Page 2: A block based pass-parallel spiht algorithm.bak

JIN AND LEE: BLOCK-BASED PASS-PARALLEL SPIHT ALGORITHM 1065

differs from bit-plane to bit-plane. In addition, dependingon the bitrate, the results from less significant bit-planes aretruncated and are not merged into the final bitstream. Thetruncated bitstream implies that the hardware execution cyclesused for the generation of the truncated bitstream are wasted.The more serious drawback lies in the fact that this bit-planeparallel approach is not applicable to a decoder. In a decoder,multiple bit-planes cannot be decoded in parallel becausethe decoder cannot predict the length of each bit-plane, andconsequently, cannot divide the bitstream into multiple bit-plane streams for parallel processing at the beginning of thedecoding process. As the speed of a decoder is often more im-portant than the encoder speed, the low decoding speed in thebit-plane parallel SPIHT severely limits the application of thisalgorithm.

This paper proposes a BPS algorithm and its hardwareimplementation. BPS decomposes a wavelet-transformed im-age into 4 × 4-bit blocks (4 × 4 blocks in a bit-plane) andprocesses one 4 × 4-bit block in a single cycle. The threepasses in the original SPIHT algorithm are reorganized intothree new passes, which can be executed in a pipelined andparallel manner. Parallel execution is possible because thereorganization of the three passes removes data dependenceand enables the precalculation of the bit length of each passbefore the pass is processed. This bit length precalculationenables parallel execution not only for an encoder but alsofor a decoder. As a result, the encoder and decoder imple-menting BPS achieve a fast execution and large throughputthrough parallel execution and efficient hardware utilization.Changing the processing order in BPS degrades the codingefficiency that is slightly lower than the original SPIHTalgorithm.

This paper is organized as follows. In Section II, theSPIHT algorithm is introduced. Section III describes theproposed BPS algorithm and Section IV presents the hardwareimplementation and experimental results. Finally, Section Vconcludes this paper.

II. SPIHT Algorithm

SPIHT is a compression algorithm applied to an image inthe wavelet transformed domain. A wavelet-transformed imagecan be organized as a spatial orientation tree (SOT) [Fig.1(a)] in which an arrow represents the relationship between aparent and its offspring. Each node of the tree corresponds toa coefficient (also called pixel) in the transformed image. Fig.1(b) shows the Morton scanning order of the SOT [20] wherethe number assigned to each pixel represents the scanningorder. For an image of size m×n, the upper-leftmost nodes ofsize (m/2L)×(n/2L) are called the root nodes of the SOT whenthe image is transformed by L-level DWT. Fig. 1(a) shows animage of size 16×16 transformed by 3-level DWT. The squaredenoted by R in Fig. 1(a) represents the root whereas the 2×2pixels numbered 0, 1, 2, and 3 in Fig. 1(b) correspond to theroot.

For a given set T , SPIHT defines a function of significancewhich indicates whether the set T has pixels larger than agiven threshold. Sn(T ), the significance of set T in the nth

Fig. 1. (a) Spatial orientation tree. (b) Morton scanning order of a 16 × 163-level wavelet-transformed image.

bit-plane is defined as

Sn(T ) =

{1, max

w(i)∈T(|w(i)|) ≥ 2n

0, otherwise.(1)

When Sn(T ) is “0,” T is called an insignificant set; otherwise,T is called a significant set. An insignificant set can be rep-resented as a single-bit “0,” but a significant set is partitionedinto subsets, whose significances in turn are to be tested again.Based on the zerotree hypothesis [2], SPIHT encodes a givenset T and its descendants [denoted by D(T )] together bychecking the significance of T ∪ D(T ) (the union of T andD(T )) and by representing T ∪D(T ) as a single symbol “zero”if T ∪ D(T ) is insignificant. On the other hand, if T ∪ D(T )is significant, T is partitioned into subsets, each of which istested independently.

To reduce the complexity of SPIHT, an entire picture isdecomposed into 4 × 4 sets (sets consisting of 4 × 4 pixels),and the significance of the union of each 4 × 4 set and itsdescendants is tested. The SPIHT algorithm encodes waveletcoefficients bit-plane by bit-plane from the most significantbit-plane to the least significant bit-plane. Fig. 2 presentsthe SPIHT algorithm encoding a single bit-plane. A SPIHTalgorithm consists of three passes: insignificant set pass (ISP),insignificant pixel pass (IPP), and significant pixel pass (SPP).According to the results of the (n + 1)th bit-plane, the nth bitof pixels are categorized and processed by one of the threepasses. Insignificant pixels classified by the (n + 1)th bit-planeare encoded by IPP for the nth bit-plane whereas significantpixels are processed by SPP. The main goal of each pass is thegeneration of the appropriate bitstream according to waveletcoefficient information. ISP, the second pass in the SPIHTalgorithm shown in Fig. 2, handles insignificant sets. If a setin this pass is classified as a significant set in the nth bit-plane, it is decomposed into smaller sets until the smaller setsare insignificant or they correspond to single pixels. If thesmaller sets are insignificant, they are handled by ISP. If thesmaller sets correspond to single pixels, they are handled byeither IPP or SPP depending on their significance.

If the most significant bit-plane is a zero bit-plane (thatis all coefficients have their most significant bit equal to 0),the bit-plane is not encoded, and consequently, the numberof encoded bit-planes is decreased. The following significant

http://ieeexploreprojects.blogspot.com

Page 3: A block based pass-parallel spiht algorithm.bak

1066 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 22, NO. 7, JULY 2012

Fig. 2. SPIHT algorithm encoding for a single bit-plane of wavelet coeffi-cients.

bit-planes are not encoded if they are also a zero bit-plane.Thus, the SPIHT algorithm starts from the first nonzero bit(FNZB) plane. For example, if the largest coefficient in SOTis smaller than 27, FNZB is 6 and the bit-plane coding startsfrom the sixth bit-plane. The FNZB is stored in the header ofthe coded stream. Further details about the SPIHT algorithmare described in [4].

In the original SPIHT algorithm, three linked lists are main-tained for processing, ISP, SPP, and IPP, respectively. In eachpass, the entries in the linked list are processed in the first-in-first-out (FIFO) order. This FIFO order causes a large overheadslowing down the computation speed of the SPIHT algorithm.To speed up the algorithm, sets and pixels are visited in theMorton order as shown in Fig. 1(b) and processed by theappropriate pass. This modified algorithm, called Morton orderSPIHT, hereafter, is relatively easy to implement in hardwarewith a slight degradation of the compression efficiency whencompared with the original SPIHT [14], [15], [17]–[19]. Thealgorithm shown in Fig. 2 describes both the original SPIHTand Morton order SPIHT algorithms. The processing order ofthe for-loops in each pass differentiates the original SPIHTfrom the Morton order SPIHT algorithm.

III. High Throughput Wavelet Image Coding

This section presents a modified SPIHT algorithm, calledthe BPS. The proposed algorithm aims to speed up bothencoding and decoding times with a slight sacrifice in thecompression efficiency.

A. Block-Based Pass-Parallel SPIHT

BPS processes each bit-plane from the most significant bit-plane just like the original SPIHT algorithm. However, theprocessing order of the pixels in each bit-plane is differentfrom the original SPIHT. BPS first decomposes an entire bit-plane into 4 × 4-bit blocks (4 × 4 blocks in a bit-plane) and

Fig. 3. 4 × 4 BPS algorithm

processes each 4 × 4-bit block at a time. After one 4 × 4-bitblock is processed, the next 4×4-bit block is processed in theMorton scanning order [20] as shown in Fig. 1(b).

The encoded stream in the original SPIHT is categorizedinto three types: sorting bit, magnitude bit, and sign bit. Thesorting bit is the result of the significance test for a 2 × 2 or4 × 4 set indicating whether the set is significant or not. Themagnitude and sign bits indicate the magnitude and sign ofeach pixel, respectively. The magnitude and sign bits outputin IPP and SPP are called “refining bit,” but the magnitudeand sign bits output in ISP are called the “first refining bit”because they are the refining bits generated first for each pixel.

The proposed BPS algorithm for a single 4 × 4-bit blockis described in Fig. 3. The 4 × 4-bit block is denoted byH that is decomposed into four 2 × 2 blocks. In Fig. 3, Q

represents a 2×2-bit block that is a subblock of a 4 × 4-bitblock H and n represents the bit-plane number. BPS consistsof three passes that output refining bits, sorting bits, and firstrefining bits, respectively. According to the type of generatedbits, these three passes are called refinement pass (RP), sortingpass (SP), and first refinement pass (FRP), respectively. TheRP is a combination of the IPP and SPP from the originalSPIHT and visits each 2 × 2 block which is significant inthe previous bit-plane (i.e., Sn+1(Q) = 1 as the condition inline 2 of the algorithm in Fig. 3). Then, RP outputs the nthmagnitude bit of the significant 2 × 2-bit block (line 4 inFig. 3). In addition, the sign bit of a pixel is output (line6) if the pixel becomes significant in the nth bit-plane (i.e.,Sn+1(w(i)) = 0 ∧ Sn(w(i)) = 1 in line 5). Since the two passesIPP and SPP from the original SPIHT are combined as asingle pass RP in BPS, the order of pixels processed in BPS

http://ieeexploreprojects.blogspot.com

Page 4: A block based pass-parallel spiht algorithm.bak

JIN AND LEE: BLOCK-BASED PASS-PARALLEL SPIHT ALGORITHM 1067

is different from that in SPIHT. It is noted from experimentalresults that the degradation of the compression efficiency bythis change of the processing order is not very significant.

The ISP pass in the original SPIHT is decomposed into SPand FRP passes in BPS. The SP classifies a block as either asignificant block or an insignificant block and transmits sortingbits. The first step of the SP is to transmit and generate thesignificance of the 4 × 4-bit block (line 11 in Fig. 3). Thisis done when two conditions are satisfied (line 10 in Fig. 3).The first condition is that the 4×4-bit block is insignificant inthe (n + 1)th bit-plane (i.e., Sn+1(H ∪ D(H)) = 0). The secondcondition ∼ (parent(H) ∧ Sn(parent(H)) = 0 implies that itis not necessary to generate the significance of the set if the4×4-bit block has a parent whose descendants are insignificantbecause the insignificance of the parent already indicates thatthe 4 × 4-bit block is insignificant. SP is the only pass thatprocesses a 4×4-bit block. The other two passes RP and FRPprocess a 2 × 2-bit block as the processing unit.

The remaining operation of the SP depends on the signifi-cance of the 4 × 4-bit block (tested in line 12). If the block issignificant, it is decomposed into four 2 × 2-bit blocks. Thesignificance of each 2 × 2 block is generated (line 14) if it isinsignificant in the (n + 1)th bit-plane (line 13). According toits significance, each 2 × 2-bit block is classified either as aninsignificant block to be processed by the SP for the (n−1)thbit-plane (line 16) or as a significant block to be processedby the FRP pass in the current bit-plane. To be processed bythe FRP, a 2 × 2 block Q needs its significance Sn(Q) tobe set to 1 (line 18). The significant block processed by theFRP is called the new-significant block. When (H ∪ D(H)) isinsignificant, all four 2 × 2-bit blocks in H are classified asinsignificant blocks for the (n − 1)th bit-plane (lines 19, 20,and 21). The FRP pass processes the new-significant 2 ×2-bitblocks classified by the SP (line 24). FRP outputs the nthmagnitude bit of the pixels in the new-significant blocks. Ifthe magnitude bit is significant in the FRP, this implies thatthe magnitude bit is significant in the first time for the pixel.Thus, the sign bit is also output.

Recall that the ISP in the original SPIHT is decomposed intoSP and FRP in the BPS algorithm in Fig. 3. The separationof SP and FRP allows each pass to be processed in a singlecycle. It should be noted that the operation of FRP dependson the results of SP, so they cannot be executed in parallel. Inthe implementation, FRP is delayed by one cycle, so it can beexecuted in parallel with the RP and SP of the next 4 × 4-bitblock. Parallel execution is possible because the FRP in thecurrent 4 × 4-bit block is not dependent on the RP and SPof the next 4 × 4-bit block. As a result, for each cycle, thebitstream of a single 4 × 4-bit block for a given bit-plane isgenerated.

Additional improvement of the compression efficiency isachieved by a slight modification in the selection of FNZB.When the size of the wavelet-transformed image is relativelysmall (e.g., 16 × 16), the root pixel(s) has a much largerabsolute value than the other pixels in the image. Therefore,only the root pixel(s) is significant for the several mostimportant bit-planes. Thus, the FNZB is obtained from thepixels excluding the root pixel(s). Then, bit-plane coding starts

from this FNZB. Consequently, the number of encoded bit-planes can be reduced. For the root pixel(s), the value fromMSB to FNZB-1 is stored in the header.

For the FNZBth bit-plane (the most significant bit-plane tobe processed), initialization is necessary before the algorithmgiven in Fig. 3 begins. Initially, the 2 × 2 set that includesthe root pixel(s) is classified as a significant block. All otherblocks are classified as insignificant. For any 2 × 2 set Q, theparameter dsig is derived. This parameter is used to evaluatethe significance Sn(Q∪D(Q)). Furthermore, for any 4 ×4 setH , significance Sn(H ∪ D(H)) is also evaluated in advance.The initial derivation of dsig makes the significance evaluationsimple enough to be processed in a single cycle. Details of thisderivation are explained in Section IV-A.

B. Bitstream Generation for a Fast Decoder

Increasing the speed of a decoder may be more importantand more difficult than that of an encoder. The encoder canprocess RP and SP in parallel because they are independent. Ina decoder, RP and SP being independent of each other is notenough for parallel execution. Another condition for parallelexecution is the precalculation of the start bit of each passin the bitstream. This condition is obvious because a decodercannot start to process a pass unless the start bit of the passis known prior to the start of the pass. It is not easy for adecoder to find the start bit of each pass because the length ofeach pass is variable, and the length is known by the decoderonly after the pass is completely decoded. Therefore, in orderto enable parallel execution of multiple passes in a decoder,the bitstream should be formatted carefully to look ahead forthe length of the bitstream for each pass.

Fig. 4 shows the bitstream format of a 8×8 block. FNZB isstored in the leftmost position. Next, the magnitude pixels ofthe root(s) from the nth bit to the (FNZB+1)th bit are stored.Once the compression ratio is determined, the bitstream lengthis also determined. In this example, the bitstream length mustbe smaller than or equal to 256 (= 8 × 8 × 8 × 50%) assumingthat the compression ratio is 50%. Hence, a decoder knowsnot only the first bit position but also the last bit position ofthe bitstream. By exploiting this fact to increase the speed,the proposed decoder parses the bitstream in both directions:from left to right and from right to left. The magnitude bitsfrom RP and FRP and the sorting bit from SP are stored fromleft to right. On the other hand, sign bits from RP and FRPare stored from right to left to the bitstream. Note that onlythe first 4 × 4-bit block among 4 × 4-bit blocks in the FNZBbit-plane has RP magnitude and sign bits because only the2 × 2 set that includes the root pixel(s) is initially classifiedas a significant block as explained in Section III-A.

With the bitstream organization given in Fig. 4, each 4 ×4-bit block in a single bit-plane can be processed in a singlecycle in a pipelined manner. Fig. 5 shows the pipelinedexecution of the BPS decoder. Fig. 5(a) shows a 8 × 8 blockthat is decomposed into four 4 × 4 blocks, denoted by H1,H2, H3, and H4, respectively. Fig. 5(b) shows the executioncycle of the three passes of the four 4 × 4-bit blocks. TheRP and SP of a 4 × 4-bit block can be processed in parallel.Recall that parallel processing of two passes requires not only

http://ieeexploreprojects.blogspot.com

Page 5: A block based pass-parallel spiht algorithm.bak

1068 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 22, NO. 7, JULY 2012

Fig. 4. Bitstream organization.

Fig. 5. Pipelined execution of the BPS decoder. (a) 8×8 block decomposedinto four 4×4 blocks. (b) Pipelined execution of RP, SP, and FRP.

two passes being independent of each other but also priorknowledge of the length of the bitstream of the pass thatis stored ahead in the bitstream. In this case, the number ofmagnitude bits transmitted by an RP is known before RP startsbecause it is determined by the result from the previous bit-plane. Therefore, the end bit of the RP magnitude (and thestart bit of the next SP sorting) is known prior to the beginningof the RP. Thus, both RP magnitude bits and SP sorting bitscan be decoded in parallel. On the other hand, the numberof sign bits in the RP is known only after the magnitude bitsof the corresponding RP are decoded. Thus, the sign bits ofRP are decoded in one cycle later than the decoding of thecorresponding magnitude bits. For FRP, the number of sortingbits transmitted by the SP is known only after SP is completed.Thus, the next bits, FRP magnitude bits, can be decoded onecycle later than SP. The number of magnitude bits from FRPis determined by the result of the SP in the same bit-plane.Thus, the length of the FRP can be precalculated before theFRP begins. This implies that the start bit of the RP of thenext 4 × 4-bit block is also known before the FRP begins.Therefore, the RP (and SP as well) of the next 4×4-bit blockis performed in the same cycle as FRP. In summary, the RP andSP can be processed in parallel with the FRP of the previous4 × 4-bit block.

As shown in Fig. 4, sign bits are stored from the right of thebitstream. The length of the sign bits transmitted by RP is notknown before the RP is completed because it is determined by

Fig. 6. Example of three-level wavelet coefficients used to demonstrate theproposed SPIHT algorithm.

the RP according to the magnitude bit. Therefore, the sign bitsare processed by the decoder one cycle after the correspondingmagnitude bits. The sign bits of FRP can also begin only afterthe magnitude bits of the FRP are decoded. Thus, the decodingof FRP sign bits is done one cycle after the decoding of FRPmagnitude bits. Note that the FRP sign bits can be processedin the same cycle as the RP sign bits of the next 4 × 4-bitblock.

C. An Example of Block-Based Pass-Parallel SPIHT

This section explains the proposed BPS coding examplewavelet coefficients of size 8 × 8 shown in Fig. 6. The8 × 8 coefficients are decomposed into four 4 × 4 blocksthat are denoted by H1, H2, H3, and H4, respectively. A3-level wavelet transform is done and the upper leftmostpixel constitutes the root. The fifth and fourth bit-planes areshown in Fig. 6. In these figures, insignificant set, significantpixel, and insignificant pixel are differentiated by darkness.Significant pixels are indicated by darker shaded areas whereasinsignificant pixels are represented by lighter shaded areas.Insignificant sets are represented by white boxes.

The value of the root pixel is 672 that is the maximumabsolute value among all coefficients. Thus, the FNZB is 9.BPS starts its coding from the sixth bit-plane because themaximum magnitude value other than the root is 72. Thevalue 01012 from the MSB to the seventh bit of the root 672(=010101000002) is stored in the header. The bitstream gener-ation of the fourth bit-plane is explained in this example. Foreach 8×8-bit block, the Morton processing order of the 4×4-bit blocks is H1, H2, H3, and H4. Among the four 2×2 blocksthat make up H1, all 2×2 blocks except the right and bottomblock {12, 10, -2, 4} are significant and therefore they are pro-cessed in the RP. For the first 2×2 block {672, -72, -72, 32},magnitude bits “0000” are generated because their fourth mag-nitude bits are all zero. For the second block {-40, 24, -4, 0},magnitude bits “0100” are generated. In addition, a sign bit“+” is generated because the first significant magnitude bitis generated for pixel 24. Note that the sign bit is storedseparately in the end of the bitstream. For the third block{-52, -8, 0, -18}, magnitude bits “1001” are generated. Note

http://ieeexploreprojects.blogspot.com

Page 6: A block based pass-parallel spiht algorithm.bak

JIN AND LEE: BLOCK-BASED PASS-PARALLEL SPIHT ALGORITHM 1069

Fig. 7. Bitstream generated from the example in the fourth bit-plane.

that the first significant magnitude bit is generated for pixel-18, so the sign bit “-” is also generated. The RP for H1 iscompleted and followed by the SP for H1. The union of 2×2block {12, 10, -2, 4} and its descendant (=H4) is insignificantin the fourth bit-plane (S4({12, 10, −2, 4} ∪ H4) = 0) so thatthe sorting bit “0” is generated to indicate the insignificanceof the block. H1 in the fourth bit-plane does not include anynew-significant block, and consequently, no block is processedby the FRP.

Next, the bitstream for H2 is generated. H2 is insignificantin the fifth bit-plane, so all the 2×2-bit blocks are insignificantblocks and no 2×2-bit blocks are processed in the RP. In SP,“0” is generated because H2 is also insignificant in the fourthbit-plane. Next, H3 is insignificant in the fifth bit-plane, so itis not processed by the RP. In the SP, H3 is significant in thefourth bit-plane, so the sorting bit “1” is generated and H3is partitioned into four 2×2-bit blocks and the significanceof each 2×2-bit block is tested. The first 2×2-bit block{-16, -8, 2, 2} is significant, and sorting bit “1” is generatedwhereas the other 2×2-bit blocks are insignificant, and sortingbit “0” is generated. Block {-16, -8, 2, 2} is a new-significantblock which is processed by the FRP generating “1-000.” Next,H4 consists of four insignificant 2×2 blocks, so H4 is notprocessed by the RP. In the SP, the second “if-condition” isnot satisfied, so it is not processed by the SP. Therefore, nobit is generated for H4.

The bitstream generated in the fourth bit-plane is shownin Fig. 7. The bit results in Table I are stored from the toprow to the bottom row. The magnitude bits of the RP passof H1 are stored in sequence as 0000 0100 1001. Then, thesorting bit of H1 is stored next as 0 followed by the sortingbit of H2 which is also 0. Then, the sorting bits of H3, 11000,are stored followed by the magnitude bits of the FRP whichare 1000. The sign bits “+” and “-” are stored as “0” and“1,” respectively, and they are stored from the right of thebitstream. Thus, the sign bits “+” followed by “-” generatedby the RP pass of H1 are stored as 01 from the right to theleft. Then, the sign bit “-” generated by the FRP pass of H3is stored as 1. Note that the output orders of the SPIHT andthe BPS bitstreams are different, but the values correspondingto each coefficient are the same for both algorithms. Thus,these two algorithms produce different bitstreams only whenthe length of the output bitstream is limited for compression.

In this example, the magnitude bit length of RP and FRPpasses can be precalculated prior to the start of the passes.For example, consider the magnitude bit length of the RPpass of H1 in the fourth bit-plane. As a result of the fifth bit-plane, it is figured out by the decoder that three 2×2 blocksare categorized as significant blocks. This implies that themagnitude bits of 12 pixels are generated by the RP pass sothat the magnitude bit length of this RP pass is 12. For H2,

TABLE I

Bitstream Results of the Fourth Bit-Plane in Fig. 6 by

Pass-Parallel SPIHT

4 × 4 Block Pixel or Set RP SP FRP1 {672, -72, -72, 32} 0000

{-40, 24, -4, 0} 01+00{-52, -8, 0, -18} 1001-{12, 10, -2, 4} 0

2 H2 03 H3 1

{-16, -8, 2, 2} 1 1-000{2, -4, 0, -2} 0{2, 4, 4, 6} 0{0, 0, -2, 2} 0

4 H4

H3, and H4, they are all insignificant in the fifth bit-plane.This indicates that the magnitude bit lengths of the RP passesfor H2, H3, and H4 are all zero in the fourth bit-plane. Thelength of FRP is determined by the result of SP in the samebit-plane. In this example, {-16, -8, 2, 2} in H3 is the only2×2 bit-block that is converted from an insignificant block to asignificant block. Therefore, four magnitude bits are generatedin the corresponding FRP pass. The length of the sign bits inthe RP (and also FRP) is known after the magnitude bits of thesame pass are decoded. For example, among the 12 magnitudebits generated by the RP, “24” and “-18” are the two pixelsthat generate the magnitude bit of value “1” in the first timefor each pixel. Thus, the two sign bits need to be decoded inthe corresponding RP.

D. Hardware Organization of Block-Based Pass-ParallelSPIHT

Fig. 8 shows the hardware organization of the BPS encoderand decoder. The computation is decomposed into two stepsthat are executed in a pipelined manner for both the encoderand the decoder. In the encoder, three passes, FRP, RP, andSP, are processed in parallel by three dedicated modules.The bit information (magnitude, sign, sorting bits, and theirlengths) generated by the three passes is forwarded to thebitstream aligner that merges the information and generatesthe final bitstream. For example, the sign bits are stored in thetemporary buffer and generated at the end of the bitstream.For bitstream merge and alignment, multiple barrel shiftersand registers are used by the bitstream aligner.

In the BPS decoder, the first stage is the bitstream parser thatidentifies the bit information to be passed to the three parallelpass decoders in the second stage. The sign bits are processedseparately as they are stored from the end of a given bitstreamand parsed in the reverse direction. Similar to the bitstreamaligner in the encoder, the parser consists of multiple barrelshifters and registers. Unlike the aligner, the length of the bitsin each pass is not available. Instead, the length informationis obtained from the previous decoding results. The derivationof the bit length makes the decoder more complex than theencoder.

Fig. 9 shows the organization of the bitstream parser. The32-bit stream is stored in register D1. When the accumulatedbit length of the parsed bitstream is greater than 32, carry-out

http://ieeexploreprojects.blogspot.com

Page 7: A block based pass-parallel spiht algorithm.bak

1070 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 22, NO. 7, JULY 2012

Fig. 8. Hardware organization of the BPS encoder and decoder.

Fig. 9. Organization of the bitstream parser in a BPS decoder.

is generated and the stream in D1 is moved to D0. D2 is usedfor storing the accumulated bit length of the parsed bitstream.The output bits of the barrel shifter BS0 are decomposed intothree bit streams by additional barrel shifters BS1 and BS2.The decomposed bits are fed forward to three independentpasses, FRP, RP, and SP. The length precalculator derives thelengths of bits to be decoded by FRP and RP (LFRP and LRP)and gives them to BS1 and BS2, which can extract the exactbits that are to be decoded by RP and SP passes. These lengthsare derived from the decoding results of the previous bit-plane.The length of the bits to be decoded in a single cycle (L) isobtained by the summation of LFRP, LRP, and LSP. LSP is thelength of the bits decoded by SP and its value ranges from 0to 5. This length is also calculated by the length precalculator.

For sign bits, a bitstream parser similar to that shown inFig. 9 is also used. As SP does not generate any sign bit, onlyFRP and RP decode sign bits. The length of the sign bits issmaller than that of the magnitude bits. As a result, the signbit parser is less complex than the magnitude bit parser.

Fig. 10. Data storage pattern for BPS. (a) 8 × 8 wavelet coefficients in thesign and magnitude format. (b) Coefficient stored in the pattern that eachwavelet coefficient is stored in the same address location. (c) Coefficient storedin the pattern that a 4×4 block in the same bit-plane stored in the same addresslocation.

Both the encoder and the decoder are designed to processa 4 × 4 block in a single cycle. If the size of a transform-block increases, the processing time also increases. On theother hand, the complexity of the hardware is independentof the size of the transform-block. This is different from thehardware for wavelet transform of which complexity increasesas the transform-block size increases.

IV. Experimental Results

A. Implementation of BPS

An image is decomposed into transform-blocks, each ofwhich is transformed by DWT and then coded by BPS encod-ing. As the size of the transform-block increases, the codingefficiency and the hardware cost of DWT also increase [21].Section IV-B shows the experimental results on the codingefficiency according to the size of the transform-block.

The DWT module generates wavelet coefficients in the signand magnitude forms. On the other hand, the BPS encodermodule accesses a 4 × 4-bit block one at a time. Thus, thestorage pattern of the wavelet coefficients generated by theDWT module is not suitable for the BPS algorithm. Forefficient data access by the BPS algorithm, the storage patternpresented in [22] is adopted, which is shown in Fig. 10. Thewavelet coefficients of size 8×8 are decomposed into four 4×4coefficients, H1, H2, H3, and H4. Fig. 10(b) shows the storagepattern generated by a DWT module whereas Fig. 10(c) showsthe pattern stored for BPS. The pattern shown in Fig. 10(c)stores 4 × 4-bit blocks in the bit-plane order from the mostsignificant bit-plane to the least significant bit-plane. Within abit-plane, 4 × 4-bit blocks are stored in the Morton scanningorder. Within a 4 × 4-bit block, each bit is also stored in theMorton scanning order. As BPS can access the magnitude andsign data at the same time, the magnitude and sign buffersare stored separately. If the BPS encoder and decoder use thepattern shown in Fig. 10(b), they should access the memory 16times to get a 4×4-bit block that causes a waste of the memorybandwidth. With the storage pattern shown in Fig. 10(c), onlya single memory access is required to get a 4 × 4-bit block.Further details about the storage pattern are described in [22].

Recall that the initialization step derives the parameter dsig

for every 2×2 set (see the last paragraph in Section III-A). Forthe derivation of dsig of a 2×2 set Q, another parameter dmax

http://ieeexploreprojects.blogspot.com

Page 8: A block based pass-parallel spiht algorithm.bak

JIN AND LEE: BLOCK-BASED PASS-PARALLEL SPIHT ALGORITHM 1071

TABLE II

PSNR (in dB) Comparison of Original SPIHT and Our High Throughput SPIHT

Block-Based Block-BasedTransform-Block Morton Order Pass-Parallel Pass-Parallel Pass-Parallel

Size b/p SPIHT SPIHT SPIHT SPIHT SPIHT+ Root in Header

4 44.66 44.64 43.73 43.70 43.9716 × 16 2 35.02 34.98 34.52 34.63 34.85

1 29.66 29.64 29.19 29.41 29.700.5 25.96 25.97 25.68 25.89 26.244 45.84 45.82 45.06 45.05 45.00

32 × 32 2 36.34 36.26 35.75 35.87 35.831 31.08 31.03 30.56 30.73 30.67

0.5 27.46 27.42 27.07 27.27 27.204 46.56 46.54 45.91 45.88 45.84

64 × 64 2 37.19 37.10 36.55 36.64 36.611 31.91 31.85 31.30 31.51 31.45

0.5 28.23 28.17 27.75 27.99 27.924 47.20 47.17 46.73 46.68 46.64

128×128 2 38.15 38.03 37.54 37.59 37.551 32.73 32.67 32.21 32.34 32.30

0.5 28.95 28.87 28.45 28.62 28.564 47.62 47.59 47.28 47.23 47.20

256 × 256 2 38.84 38.70 38.18 38.23 38.201 33.41 33.29 32.91 32.96 32.93

0.5 29.48 29.39 29.01 29.13 29.074 48.02 47.97 47.88 47.81 47.78

512 × 512 2 39.67 39.50 39.18 39.08 39.061 34.31 34.21 33.92 33.92 33.89

0.5 30.21 30.13 29.90 29.98 29.924 46.65 46.62 46.10 46.06 46.07

Average 2 37.54 37.43 36.95 37.01 37.011 32.18 32.11 31.68 31.81 31.82

0.5 28.38 28.32 27.97 28.15 28.15

is computed first where dmax is the maximum absolute valueof all entries in the set Q ∪ D(Q). Once dmax is obtained,dsig is derived by setting all the bits from the most significantbit with its value of “1” down to the least significant bit. Forexample, when dmax is 000100102, the fourth significant bitis the most significant bit with its value “1.” Then, dsig is000111112 that is obtained by setting all the bits from thefourth significant bit down to the least significant bit. Oncedsig is obtained, the significances for the nth bit-plane areobtained by using a bitwise logical operation as follows:

Sn(Q ∪ D(Q)) = AND(dsig(Q ∪ D(Q), 2n) (2)

where AND represents the bitwise logical AND operation anddsig(Q ∪ D(Q)) represents the value of dsig derived for set(Q∪D(Q)). Once the significance is derived for all 2×2 sets,then the significance of every 4 × 4 set is obtained by logicalOR operation. Let H denote a 4 × 4 set and Q1, Q2, Q3, andQ4 denote the subsets of H , then the significance of H∪D(H)is

Sn(H ∪ D(H)) = OR(Sn(Q1 ∪ D(Q1)), Sn(Q2 ∪ D(Q2)),

Sn(Q3 ∪ D(Q3)), Sn(Q4 ∪ D(Q4))) (3)

where OR represents the bitwise logical OR operation.

B. Comparison

This section evaluates the compression efficiency of the pro-posed BPS algorithm. Recall that BPS makes three modifica-tions to speed up SPIHT computation. These modifications are

block-based processing, pass reorganization, and the Morton-order-based operation in each pass. The increased speedachieved by these modifications may cause a slight degradationof the compression efficiency. Thus, the degradation of thecompression efficiency by each of these modifications isevaluated by experimentation in which each modification isemployed by SPIHT independently of the other modifications.Note that the number of generated bits is the same for allmodified versions of the SPIHT algorithm when no data islost. For lossy compression when the number of allowed bitsis limited, the compression efficiency depends on the modi-fication because a different modification leads to a differentgeneration order that affects the compression efficiency.

Table II shows the results of the experiments. Test im-ages are Barbara, Gold hill, and Lena [monochrome, 8 bitsper pixel (b/p), 512 × 512] and Bike, Cafe, and Woman(monochrome, 8 b/p, 2560 × 2048) from the JPEG 2000test bed. A test image is partitioned into blocks and theneach block is transformed by DWT. The integer Le Gall5/3 filter is used for DWT. The first column of Table IIshows the transform-block size. The second column showsthe number of bits per pixel that determines the compressionratio. For example, a bit rate of 4 b/p is equivalent to thecompression ratio of 2 as the original image bit rate withoutcompression is 8 b/p. The peak signal-to-noise ratio (PSNR)performance of the original SPIHT algorithm is shown inthe third column. As the compression ratio increases, thePSNR decreases rapidly. Meanwhile, as the DWT block size

http://ieeexploreprojects.blogspot.com

Page 9: A block based pass-parallel spiht algorithm.bak

1072 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 22, NO. 7, JULY 2012

increases, the PSNR increases accordingly. The fourth columnshows the PSNR performance of the Morton Order SPIHT,which is equivalent to no list SPIHT [14]. The PSNR is slightlydecreased by an average of 0.07 dB (ranging from 0.03 to 0.11dB) when compared with the original SPIHT. The fifth columnshows the PSNR performance of the modified version of theMorton order SPIHT algorithm that adopts the reorganizedthree passes, RP, SP, and FRP, instead of the IPP, ISP, and SPP.As shown in the table, this modification decreases the PSNRperformance by an average of 0.51 dB (ranging from 0.40 to0.58 dB). The sixth column shows the PSNR performance ofthe proposed 4 × 4 BPS. When compared with the data inthe fifth column, the PSNR slightly increases except in thecase when the bit rate is 4 b/p as the average PSNR decreaseis 0.43 dB (ranging from 0.23 to 0.59 dB). In pass-parallelSPIHT, the refining bit always precedes the first refining bitin a bit-plane. On the other hand, in BPS, the first refiningbit of a lower subband precedes the refining bit of a highersubband. In general, the first refining bits of a lower subbandare more important than the refining bits of a higher subband.This is the reason why block based processing may lead to aslight increase of the PSNR. When the bit rate is 4 pp, theefficiency of BPS may be reduced because of the followingreason. The integer Le Gall 5/3 DWT used for the experimentalways yields zero for the LSB of the low-pass coefficientsthrough the scaling step because the scaling factor is twofor the low-pass coefficients. When the bit rate is high, theLSB planes are likely to be coded. In LSB plane coding,BPS always outputs the coefficients of the lower subbands inadvance of those of the higher subbands and does not increasethe PSNR performance.

The last column shows PSNR performance of BPS withthe scheme that processes from the FNZBth bit-plane. As theDWT block size increases, the PSNR values drop sharply.This scheme is efficient only when the block size is 16 × 16.The reason is because the difference between the FNZB withroots and the FNZB without roots is significantly different onlywhen the transform-block size is 16 × 16. Therefore, as thetransform-block size increases, the benefits from this schemedecrease whereas the overhead increases.

Fig. 11 presents the Barbara image that is encoded bythe proposed BPS algorithm and then the encoded stream isdecoded again by the BPS algorithm. The transform-block sizeis 16 × 16 and the bit rate varies from 4, 2, 1, and 0.5 bitsper pixel. The PSNRs are 43.28 dB, 33.12 dB, 27.74 dB, and24.42 dB, respectively. As the bit rate decreases, the subjectivequality also degrades with blurring observed in object edges.A blocking effect is also apparent in the image with the bit rateat 0.5 b/p. The blocking effect can be reduced by increasingthe transform-block size although the hardware cost increasesdramatically as the transform-block size increases.

Fig. 12 compares PSNRs of six test images compressedby the original SPIHT and BPS. The transform-block size ischosen as 16 × 16. Depending on the complexity of a testimage, the PSNR degradation varies substantially. The PSNRdegradation is relatively large for complex images such asBarbara, whereas it is small for relatively simple images suchas Lena. For example, the difference in PSNR degradation

Fig. 11. Barbara image compressed by the BPS algorithm with various bitrates. (a) 4 b/s. (b) 2 b/s. (c) 1 b/s. (d) 0.5 b/s.

between Lena image and Barbary image is 1.32 dB when thebit rate is 4 b/p. The PSNR degradation decreases as the bitrate decreases. The bottom graph shows the averages over thesix test images. The PSNR degradation is 0.97B on averagewhen the bit rate is 4 b/p, whereas it is 0.07 dB when the bitrate is 0.5 b/p.

Table III compares the throughputs of the proposed designand previous designs. For those algorithms without a decoderalgorithm (no list SPIHT and bit-plane parallel SPIHT), thedecoder throughput is considered the same as that of no listSPIHT encoder. This is a reasonable number because no listSPIHT decoder can process both no list SPIHT and bit-planeparallel SPIHT. As shown in the table, the proposed designachieves dominantly large throughput when compared withany previous work. In most systems with an integrated FMCmodule [5]–[7], the encoder and decoder throughputs of theFMC need to be balanced. Otherwise, the overall performanceof the system may be significantly decreased by the one withthe smaller throughput. Thus, the large throughputs for boththe encoder and the decoder of the proposed design allowmuch improved system performance when they are integratedin the system.

The fourth column shows the operating clock frequencythat is obtained from the timing analysis after place androute for FPGA targeting. Previous works also use the sametiming analysis results from FPGA synthesis results. Note thatprevious FPGA technology may be slower than that used bythe proposed design. To eliminate the effect of technologydifference and emphasize the enhancement by the proposeddesign, the fifth and sixth columns in Table III present ad-ditional comparison factors that are the encoder and decoderthroughputs with normalized at 1 MHz clock frequency. As

http://ieeexploreprojects.blogspot.com

Page 10: A block based pass-parallel spiht algorithm.bak

JIN AND LEE: BLOCK-BASED PASS-PARALLEL SPIHT ALGORITHM 1073

TABLE III

Throughput Comparison of the SPIHT Architecture and Previous Works

Encoder Decode Normalized NormalizedThroughput Throughput Frequency Encoder Throughput Decoder Throughput(MPixels/s) (MPixels/s) (MHz) (MPixels/(s x MHz)) (MPixels/(s x MHz))

No list SPIHT [15] 18.4 18.4 100 0.184 0.184Two-level SPIHT [16] 4.35 4.35 10 0.435 0.435Bit-plane parallel SPIHT [17] 224 18.4 56 4 0.184BPS 180 132 150 (encoder)/110 (decoder) 1.2 1.2

TABLE IV

Implementation Comparison of the SPIHT Architecture and Previous Works

Technology Size Logic Cell (FPGA) Memory (bit)No list SPIHT [15] Xilinx Virtex II-XC2V1000 4500 slices 10 125/11 520 24 BRAM (24×18K bits)Bit-plane parallel SPIHT [17] Xilinx Virtex 2000E 62%+34%+98% 83 808/43 200 N/ABPS encoder Xilinx Virtex V-LX330 3592 slices 22 996/331 876 8320BPS decoder Xilinx Virtex V-LX330 3421 slices 21 901/331 876 6656

Fig. 12. Compression efficiency comparison of SPIHT and BPS with sixtest images. Transform block size is 16 × 16.

shown in the table, the normalized decoder throughput issignificantly improved by the proposed design, whereas thenormalized encoder throughput is somewhere between thoseof no list SPIHT and bit-plane parallel SPIHT.

Hardware implementations of the various SPIHT algorithmsare summarized in Table IV. The results published in [15]–[17] are presented for no list SPIHT, two-level SPIHT, andbit-plane parallel SPIHT, respectively. The second columnshows the implementation technology. For previous work, thenumbers published in [15]– [17] are given in the table. No listSPIHT sequentially generates 1 or 2 bits in a cycle resultingin a very small throughput of 18.4 M pixels/s. The hardwareoccupies 4500 slices of Xilinx Virtex-II that corresponds to

10 125 logic cells (presented in the fourth column in Table IV[23]). Bit-plane parallel SPIHT provides a large throughputthat is expected because this algorithm processes all bit-planes concurrently. However, bit-plane parallelism cannot beused for a decoder because the decoder cannot decomposea bitstream bit-plane by bit-plane. The hardware of a bit-plane parallel SPIHT is implemented with three Xilinx Vertex2000E FPGAs with their capacity consumed by 62%, 34%,and 98%, respectively [17]. This implies that a large amountof hardware logics are necessary to process all bit-planes inparallel. As the capacity of this FPGA is 43 200 logic cells,this hardware cost corresponds to 83 808 logic cells that aremuch larger than the no list SPIHT [24]. In the contrary,the hardware implementation for the BPS algorithm is notvery large because it is designed effectively to meet the targetthroughput exploiting pixel-level parallelism. As shown in thetable, the BPS encoder and decoder require 22 996 and 21 901logic cells, respectively. The proposed design requires twiceas much hardware resources as no list SPIHT, whereas it isabout one fourth of the hardware cost for bit-plane parallelSPIHT.

In order to avoid any unfair comparison by using differenttechnologies, only the FPGA synthesis results are used forcomparison. Although the FPGA used for the proposed designis different from that for the previous designs, all FPGAs aredeveloped by the same company and the logic cell counts canbe used for comparison of the hardware cost because they arenormalized for comparisons among different FPGAs. Fromthe logic cell counts, it is observed from Table IV that theproposed design requires much less hardware cost than thatrequired by bit-plane parallel SPIHT.

The two-level SPIHT presented in Table III is not presentedin Table IV because it runs very low clock frequency of 10MHz when compared with other designs.

Consider the memory requirement for the transform of size16 × 16 with the wavelet coefficient stored in 13 bits. Thewavelet coefficients require 16 × 16×13 bits when it is storedin the format shown in Fig. 10(b). After the coefficients areconverted into the format in Fig. 10(c), the same size of bufferis also necessary. For the calculation of dsig, 1/4×16×16×13

http://ieeexploreprojects.blogspot.com

Page 11: A block based pass-parallel spiht algorithm.bak

1074 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 22, NO. 7, JULY 2012

TABLE V

Gate Counts of Hardware Modules

Module Gate Count Logic Cell(ASIC 0.13 μm) (Xiilnx Virtex V-LX330)

DWT 5.5 K 5214DSIG 3.0 K 2750C2B 2.6 K 2378BPS encoder core 12.7 K 12 654- 3 parallel pass 9.5 K 8994- Bitstream aligner 3.2 K 3661IDWT 5.6 K 5232B2C 2.6 K 2396BPS decoder core 14.5 K 14 273- 3 parallel pass 8.4 K 8303- Bitstream parser 6.1 K 5970

bits are necessary, and the same size buffer is also necessaryfor the format conversion. Thus, the total memory requirementis 8320 bits (=2.5×16×16×13). Additional buffers temporar-ily storing the input image and output stream for an encoder (orinput stream and output image for a decoder) are necessary, butthese buffers are not counted because they may be consideredoutside the BPS hardware and also may vary depending onthe architecture that utilizes the BPS module.

The comparison of the hardware cost given in Table IVis a little bit unfair because the DWT sizes of the previousimplementations are different from that of BPS. No list SPIHT[15] performs 64×64 DWT whereas two-level SPIHT [16] andbit-plane parallel SPIHT [17] implements the DWT of sizes8 × 8 and 512 × 512, respectively. As the DWT size affectsthe compression efficiency, fair comparison requires the sameDWT size. The memory buffer size increases in proportion tothe DWT size whereas the logic gate count may not changemuch because the hardware logic speed is fixed as the targetthroughput (pixels/cycle), which does not change dependingon the DWT size. For fair comparison, the memory size givenin the rightmost column needs to be weighted by the DWTsize.

The detailed information about the hardware size is givenin Table V. DSIG (third row) represents the module thatcalculates dsig, which is explained in Section IV-A DSIGcalculation is required only for an encoder. C2B (fourth row)represents the module that converts the data format shownin Fig. 10(b) to that in Fig. 10(c), whereas B2C is themodule for the format conversion in the opposite direction.The conversion of data pattern from Fig. 10(b) to Fig. 10(c)requires additional hardware resources that are the 16×16×13buffer, the 1/4×16×16×13 buffer, and the address generationmodule (C2B in Table V for an encoder and B2C for adecoder). Given that the total buffer size is 2.5×16 × 16×3,the memory size increases by 50%. On the other hand, 9.2%of logic gates are increased due to the addition of C2B andB2C modules. The proposed BPS design requires additionalhardware units, bitstream aligner for the encoder, and bitstreamparser for the decoder due to its pass-parallel processing.The gate counts of these two units are 3.2K and 6.1K gates,respectively. These are about 24% and 44% of the hardwarecosts of the BPS encoder and decoder, respectively. Thisimplies that a substantial overhead is necessary to implementthe proposed algorithm. Furthermore, control signal generation

for the datapaths may also be slightly increased by the complexconditional control flow of BPS. However, this increasedcomplexity is very small. The reason is that the repeatedoperation in Fig. 3 is precomputed in DSIG module that isdescribed in Section IV-A and the rest is implemented bybit-level combinational logic rather than a multiplexer anddemultiplexer.

For memory bandwidth, the proposed design does notincrease the required bandwidth. The reason is that both BPSand SPIHT require just a single memory access for each bitexcept the sign bit in the buffer to complete either encodingor decoding. The number of sign bit accesses is also the samefor both BPS and SPIHT.

V. Conclusion

The proposed BPS algorithm modified the processing orderof the original SPIHT in order to increase the speed of both anencoder and a decoder. As a result, the processing speed of 120million pixels per second can be achieved by both an encoderand a decoder at a slight sacrifice of compression efficiency.This speed allowed 1920×1080 size video in the 4:2:2 formatto be processed at the speed of 30 frames/s. The hardwareimplementation of BPS with Verilog HDL showed that therequired gate count is 43.9 K. The BPS is the only algorithmthat allows the fast processing time for both an encoder and adecoder with a small hardware cost. The increased throughputwith a small hardware cost made it possible for a SPIHT-based compression to be used for many video compressionapplications that require fast encoding and decoding time.These applications include frame memory compression for aH.264 or MPEG codec chip, and overdrive detection for LCDdisplays.

A possible modification of SPIHT algorithm is that animage is partitioned into multiple blocks (or stripes) andthat coefficient trees are local to these blocks. Then, theseblocks are simply processed in parallel with a relatively simplealgorithm. The block boundaries in the bitstream could begiven in the header, or the same bitrate could be used for allblocks. As a result, the decoder also can decode these blocksin parallel. This new scheme gives another option that speedsup the algorithm at a sacrifice of compression efficiency.In fact, an implementation of [15] processes four 64 × 64blocks in parallel achieving four times speedup. We believethat one limitation of this scheme is that the compressionblock size cannot be large. This is because the DWT sizemust also decreases as the compression block size decreases.Note that a small DWT size may substantially reduce thecompression efficiency. For applications such as a H.264 videocodec chip, a macroblock of size 16 × 16 is used as thebasic unit for memory access, and consequently, the memorycompression used in a H.264 codec often requires 16 × 16as the compression block size. Thus, this scheme may not beeffectively used for a frame memory compression for an H.264codec. On the other hand, the proposed BPS does not have anylimitation on the DWT size. Another limitation with a largecompression block size is that the hardware complexity mayincrease as the block size increases. The increased complexity

http://ieeexploreprojects.blogspot.com

Page 12: A block based pass-parallel spiht algorithm.bak

JIN AND LEE: BLOCK-BASED PASS-PARALLEL SPIHT ALGORITHM 1075

makes it difficult to increase the operating clock frequency, andconsequently, reduces the throughput. Despite the limitations,this scheme provides a good tradeoff between speed and com-pression and it can be combined with BPS to obtain the besttradeoff.

References

[1] JPEG2000 Image Coding System, document ISO/IEC 15444-1, 2000.[2] J. M. Shapiro, D. S. R. Center, and N. J. Princeton, “Embedded image

coding using zerotrees of wavelet coefficients,” IEEE Signal Process.,vol. 41, no. 12, pp. 3445–3462, Dec. 1993.

[3] D. Taubman, “High performance scalable image compression withEBCOT,” IEEE Trans. Image Process., vol. 9, no. 7, pp. 1158–1170,Jul. 2000.

[4] A. Said and W. Pearlman, “A new, fast, and efficient image codec basedon set partitioning in hierarchical trees,” IEEE Trans. Circuits Syst. VideoTechnol., vol. 6, no. 3, pp. 243–250, Jun. 1996.

[5] T. Y. Lee, “A new frame-recompression algorithm and its hardwaredesign for MPEG-2 video decoders,” IEEE Trans. Circuits Syst. VideoTechnol., vol. 13, no. 6, pp. 529–534, Jun. 2003.

[6] Y. Jin, Y. Lee, and H.-J. Lee, “A new frame memory compressionalgorithm with DPCM and VLC in a 4 × 4 block,” EURASIP J. Adv.Signal Process., vol. 2009, no. 629285, p. 18, 2009.

[7] W.-Y. Chen, L.-F. Ding, P.-K. Tsung, and L.-G. Chen, “Architecturedesign of high performance embedded compression for high definitionvideo coding,” in Proc. IEEE Int. Conf. Multimedia Expo, Apr.–Jun.2008, pp. 825–828.

[8] J. Someya, A. Nagase, N. Okuda, K. Nakanishi, and H. Sugiura, “De-velopment of single chip overdrive LSI with embedded frame memory,”in Proc. SID Symp. Dig., vol. 39. 2008, pp. 464–467.

[9] T. B. Yng, B.-G. Lee, and H. Yoo, “A low complexity and losslessframe memory compression for display devices,” IEEE Trans. ConsumerElectron., vol. 54, no. 3, pp. 1453–1458, Aug. 2008.

[10] Y.-H. Lee, Y.-Y. Lee, H.-Z. Lin, and T.-H. Tsai, “A high-speed losslessembedded compression codec for high-end LCD applications,” in Proc.IEEE Asian Solid-State Circuits Conf., Nov. 2008, pp. 185–188.

[11] J.-W. Han, M.-C. Hwang, S.-G. Kim, T.-H. You, and S.-J. Ko, “Vectorquantizer based block truncation coding for color image compression inLCD overdrive,” IEEE Trans. Consumer Electron., vol. 54, no. 4, pp.1839–1845, Nov. 2008.

[12] G. Pastuszak, “A novel architecture of arithmetic coder in JPEG2000based on parallel symbol encoding,” in Proc. Int. Conf. Parallel Comput.Electr. Eng., 2004, pp. 303–308.

[13] W. A. Pearlman, A. Islam, N. Nagaraj, and A. Said, “Efficient, low-complexity image coding with a set-partitioning embedded block coder,”IEEE Trans. Circuits Syst. Video Technol., vol. 14, no. 11, pp. 1219–1235, Nov. 2004.

[14] F. Wheeler and W. Pearlman, “SPIHT image compression without lists,”in Proc. IEEE ICASSP, vol. 4. Jun. 2000, pp. 2047–2050.

[15] P. Corsonello, S. Perri, G. Staino, M. Lanuzza, and G. Cocorullo, “Lowbit rate image compression core for onboard space applications,” IEEETrans. Circuits Syst. Video Technol., vol. 16, no. 1, pp. 114–128, Jan.2006.

[16] C.-C. Cheng, P.-C. Tseng, and L.-G. Chen, “Multimode embeddedcompression codec engine for power-aware video coding system,” IEEETrans. Circuits Syst. Video Technol., vol. 19, no. 2, pp. 141–150, Feb.2009.

[17] T. Fry and S. Hauck, “SPIHT image compression on FPGAs,” IEEETrans. Circuits Syst. Video Technol., vol. 15, no. 9, pp. 1138–1147, Sep.2005.

[18] A. Nandi and R. Banakar, “Throughput efficient parallel implementationof SPIHT algorithm,” in Proc. Int. Conf. Very Large Scale Integr. Des.,2008, pp. 718–725.

[19] R. Kutil, “Approaches to zerotree image and video coding on MIMDarchitectures,” Parallel Comput., vol. 28, pp. 1095–1109, Aug. 2002.

[20] V. R. Algazi and J. Estes, “Analysis-based coding of image transformand subband coefficients,” in Proc. SPIE Vis. Commun. Image Process.Conf., 1995, pp. 11–21.

[21] N. Zervas, G. Anagnostopoulos, V. Spiliotopoulos, Y. Andreopoulos,and C. Goutis, “Evaluation of design alternatives for the 2-D-discretewavelet transform,” IEEE Trans. Circuits Syst. Video Technol., vol. 11,no. 12, pp. 1246–1262, Dec. 2001.

[22] A. Gupta, S. Nooshabadi, and D. Taubman, “Efficient interfacing ofDWT and EBCOT in JPEG2000,” IEEE Trans. Circuits Syst. VideoTechnol., vol. 18, no. 5, pp. 687–693, May 2008.

[23] Virtex-II Platform FPGAs: Complete Data Sheet, Tech. Doc. DS031(v3.5) Product Specification, Xilinx, San Jose, CA, 2007, pp. 1–318.

[24] Virtex-E 1.8V FPGAs: Complete Data Sheet, Tech. Doc. DS022 (v3.5)Product Specification, Xilinx, San Jose, CA, 2002, pp. 1–233.

Yongseok Jin (M’09) received the B.S. and Ph.D.degrees in electrical engineering and computer sci-ence from Seoul National University, Seoul, Korea,in 2003 and 2010, respectively.

He is currently a Research Associate with the Mi-crosystems Design Laboratory, Department of Com-puter Science and Engineering, Pennsylvania StateUniversity, University Park. His current researchinterests include computer architectures and system-on-chip design for video coding and computer visionapplications.

Hyuk-Jae Lee received the B.S. and M.S. degrees inelectronics engineering from Seoul National Univer-sity, Seoul, Korea, in 1987 and 1989, respectively,and the Ph.D. degree in electrical and computerengineering from Purdue University, West Lafayette,IN, in 1996.

From 1998 to 2001, he was with the Serverand Workstation Chipset Division, Intel Corporation,Hillsboro, OR, as a Senior Component Design En-gineer. From 1996 to 1998, he was on the facultyof the Department of Computer Science, Louisiana

Tech University, Ruston. In 2001, he joined the School of Electrical En-gineering and Computer Science, Seoul National University, where he iscurrently a Professor. He is the Founder of Mamurian Design, Inc., Seoul, afabless system-on-chip (SoC) design house for multimedia applications. Hiscurrent research interests include computer architectures and SoC design formultimedia applications.

http://ieeexploreprojects.blogspot.com