vlsi implementation of discrete wavelet transform

25
IEEE Transactions on VLSI Systems, Dec 1996 VLSI Implementation of Discrete Wavelet Transform A. Grzeszczak, M. K. Mandal, S. Panchanathan, Member, IEEE, and T. Yeap, Member, IEEE Visual Computing and Communications Laboratory Department of Electrical Engineering University of Ottawa, Ottawa, Canada K1N 6N5 Tel. (613) 562-5800 x.6219, Fax (613) 562-5175 email: [email protected] Abstract - This paper presents a VLSI implementation of discrete wavelet transform (DWT). The architecture is simple, modular, and cascadable for computation of one, or multi-dimensional DWT. It comprises of four basic units: input delay, filter, register bank, and control unit. The proposed architecture is systolic in nature and performs both high-pass and low-pass coefficient calculations with only one set of multipliers. In addition, it requires a small on-chip interface circuitry for interconnection to a standard communication bus. A detailed analysis of the effect of finite precision of data and wavelet filter coefficients on the accuracy of the DWT coefficients is presented. The architecture has been simulated in VLSI and has a hardware utilization efficiency of 87.5%. Being systolic in nature, the architecture can compute DWT at a data rate of N × 10 6 samples/sec corresponding to a clock speed of N MHz. 1. Introduction In the last decade, there has been an enormous increase in the applications of wavelets in various scientific disciplines [1]-[5]. One of the main contributions of wavelet theory is to relate the discrete time filterbank with the theory of continuous time function space. Typical applications of wavelets include signal processing [6], [7], image processing [8]-[10], numerical analysis [11], statistics [12], bio-medicine [13], etc. Wavelet transform offers a wide variety of useful features, in contrast to other transforms, such as Fourier transform or cosine transform. Some of these are: • Adaptive time-frequency windows • Lower aliasing distortion for signal processing applications • Computational complexity of ON ( ) , where N is the number of data samples • Inherent scalability • Efficient VLSI implementation Since DWT requires intensive computations, several architectural solutions using special purpose parallel processor have been proposed [15]-[20] in order to meet the real time requirement in many applications. The solutions include parallel filter architecture, SIMD linear array architecture, SIMD multigrid architecture [17], [19], 2-D block based architecture [20], and the AWARE's wavelet transform processor (WTP) [16]. The first three architectures, namely the parallel filter architecture, SIMD linear array architecture and the SIMD multigrid architecture, are special purpose parallel processors that implement the high level abstraction of the pyramid algorithm. The 2-D block based architecture is a VLSI implementation that uses four multiply and accumulate (MAC) units to execute the forward and inverse transforms. It requires a small on-chip memory and implements 2-D wavelet transform directly without data transposition. However, this feature can be a drawback in certain applications. In addition, the block based architecture may introduce block boundary effects degrading the visual quality. The AWARE's WTP is capable of computing forward and inverse wavelet transforms for 1-D input data using a maximum of six filter coefficients. It can be cascaded to execute transforms using higher order filters. The WTP has been clocked at speeds of 30 MHz and offers 16 bits precision on input and output data. The DWT computation is executed in a synchronous pipeline fashion and is under complete user control. However, the AWARE's WTP is a complex design requiring extensive user control. Programming such a device is therefore tedious, difficult, and time consuming. There is a clear need for designing and implementing a DWT chipset that explores the potential of DWT particularly in the area of decomposition algorithm and hardware implementation and which operates in a turnkey fashion. Here, the user is required to input only the data stream and the high-pass and low-pass filter coefficients.

Upload: ledang

Post on 16-Dec-2016

239 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: VLSI Implementation of Discrete Wavelet Transform

IEEE Transactions on VLSI Systems, Dec 1996

VLSI Implementation of Discrete Wavelet Transform

A. Grzeszczak, M. K. Mandal, S. Panchanathan, Member, IEEE, and T. Yeap, Member, IEEE

Visual Computing and Communications Laboratory Department of Electrical Engineering

University of Ottawa, Ottawa, Canada K1N 6N5 Tel. (613) 562-5800 x.6219, Fax (613) 562-5175

email: [email protected]

Abstract - This paper presents a VLSI implementation of discrete wavelet transform (DWT). The architecture is simple, modular, and cascadable for computation of one, or multi-dimensional DWT. It comprises of four basic units: input delay, filter, register bank, and control unit. The proposed architecture is systolic in nature and performs both high-pass and low-pass coefficient calculations with only one set of multipliers. In addition, it requires a small on-chip interface circuitry for interconnection to a standard communication bus. A detailed analysis of the effect of finite precision of data and wavelet filter coefficients on the accuracy of the DWT coefficients is presented. The architecture has been simulated in VLSI and has a hardware utilization efficiency of 87.5%. Being systolic in nature, the

architecture can compute DWT at a data rate of N × 10 6 samples/sec corresponding to a clock speed of N MHz.

1. Introduction

In the last decade, there has been an enormous increase in the applications of wavelets in various scientific disciplines [1]-[5]. One of the main contributions of wavelet theory is to relate the discrete time filterbank with the theory of continuous time function space. Typical applications of wavelets include signal processing [6], [7], image processing [8]-[10], numerical analysis [11], statistics [12], bio-medicine [13], etc. Wavelet transform offers a wide variety of useful features, in contrast to other transforms, such as Fourier transform or cosine transform. Some of these are:

• Adaptive time-frequency windows

• Lower aliasing distortion for signal processing applications

• Computational complexity of O N( ) , where N is the number of data samples

• Inherent scalability

• Efficient VLSI implementation

Since DWT requires intensive computations, several architectural solutions using special purpose parallel processor have been proposed [15]-[20] in order to meet the real time requirement in many applications. The solutions include parallel filter architecture, SIMD linear array architecture, SIMD multigrid architecture [17], [19], 2-D block based architecture [20], and the AWARE's wavelet transform processor (WTP) [16]. The first three architectures, namely the parallel filter architecture, SIMD linear array architecture and the SIMD multigrid architecture, are special purpose parallel processors that implement the high level abstraction of the pyramid algorithm. The 2-D block based architecture is a VLSI implementation that uses four multiply and accumulate (MAC) units to execute the forward and inverse transforms. It requires a small on-chip memory and implements 2-D wavelet transform directly without data transposition. However, this feature can be a drawback in certain applications. In addition, the block based architecture may introduce block boundary effects degrading the visual quality.

The AWARE's WTP is capable of computing forward and inverse wavelet transforms for 1-D input data using a maximum of six filter coefficients. It can be cascaded to execute transforms using higher order filters. The WTP has been clocked at speeds of 30 MHz and offers 16 bits precision on input and output data. The DWT computation is executed in a synchronous pipeline fashion and is under complete user control. However, the AWARE's WTP is a complex design requiring extensive user control. Programming such a device is therefore tedious, difficult, and time consuming.

There is a clear need for designing and implementing a DWT chipset that explores the potential of DWT particularly in the area of decomposition algorithm and hardware implementation and which operates in a turnkey fashion. Here, the user is required to input only the data stream and the high-pass and low-pass filter coefficients.

Page 2: VLSI Implementation of Discrete Wavelet Transform

IEEE Transactions on VLSI Systems, Dec 1996

This paper presents a design and VLSI implementation of an efficient systolic array architecture for computing DWT [21]. The proposed VLSI architecture computes both highpass and lowpass frequency coefficients in the same clock cycle and thus has an efficient hardware utilization. The design is simple, modular, and cascadable for computation of 1-D or 2-D data streams of fairly arbitrary size. The proposed architecture requires a small on-chip interface circuitry for purposes of interconnection to a standard communication bus.

The paper is organized as follows. A brief introduction to discrete wavelet transform is presented in section 2. The effect of finite precision in DWT computation is presented in section 3. The VLSI systolic array architecture for computing DWT is presented in section 4, followed by the conclusions in section 5.

2. Discrete Wavelet Transform

This section presents a brief introduction to discrete wavelet transform (DWT). DWT represents an arbitrary square integrable function as superposition of a family of basis functions called wavelets. A family of wavelet basis functions can be generated by translating and dilating the mother wavelet corresponding to the family. The DWT coefficients can be obtained by taking the inner product between the input signal and the wavelet functions. Since, the basis functions are translated and dilated versions of each other, a simpler algorithm, known as Mallat’s tree algorithm or pyramid algorithm, has been proposed in [2]. In this algorithm, the DWT coefficients of one stage can be calculated from the DWT coefficients of the previous stage, which is expressed as follows:

W n j W m j h m nL Lm

( , ) ( , ) ( )= − −∑ 1 2 (1a)

W n j W m j g m nH Lm

( , ) ( , ) ( )= − −∑ 1 2 (1b)

where W p qL ( , ) is p -th scaling coefficient at the q -th

stage, W p qH ( , ) is the p -th wavelet coefficient at q -th

stage, h n( ) and g n( ) are the dilation coefficients

corresponding to the scaling and wavelet functions, respectively.

For computing the DWT coefficients of the discrete-time data, it is assumed that the input data represents the DWT coefficients of a high resolution stage. Eq. 1 can then be used for obtaining DWT coefficients of subsequent stages. In practice, this decomposition is performed only for a few stages. We note that the dilation coefficients h n( ) represent a lowpass filter (LPF), whereas the corresponding g n( ) represent a highpass filter (HPF).

Hence, DWT extracts information from the signal at different scales. The first level of wavelet decomposition extracts the details of the signal (high frequency

components) while the second and all subsequent wavelet decompositions extract progressively coarser information (lower frequency components). A schematic of three stage DWT decomposition is shown in Fig. 1.

Approximate Location of Figure 1

In order to reconstruct the original data, the DWT coefficients are upsampled and passed through another set of lowpass and highpass filters, which is expressed as:

W n j W k j h n k W l j g n lL L Hlk

( , ) ( , ) ( ) ( , ) ( )= + ′ − + + ′ −∑∑ 1 2 1 2

(2)

where ′h n( ) and ′g n( ) are respectively the lowpass and

highpass synthesis filter corresponding to the mother wavelet. It is observed from Eq. 2 that j -th level DWT

coefficients can be obtained from ( )j + 1 -th level DWT

coefficients.

Compactly supported wavelets are generally used in various applications. Table 1 lists a few orthonormal wavelet filter coefficients (h n( ) ) popular in comp ression applications [6]. These wavelets have the property of having the maximum number of vanishing moments for a given order, and are known as Daubechies wavelets. The entries in column 2, 3, and 5 provide the filter coefficients with the minimum phase, whereas entries in column 4 and 6 provide the filter coefficients with the least asymmetric phase.

Approximate Location of Table 1

The 2-D DWT is usually calculated using a separable approach [2]. To start with, the 1-D DWT computation is performed on each row. This is followed by a matrix transposition operation [28]. Next, the DWT operation is executed on each row of the 2-D data. Hence, 2-D DWT can be implemented in a straightforward manner by inserting a matrix transposer between two 1-D DWT modules. Fig. 2 shows a 3-level wavelet decomposition of an image. In the first level of decomposition, one low pass subimage ( LL2 )

and three orientation selective highpass subimages ( LH2 , HL2 , HH2 ) are created. In the second level of

decomposition, the lowpass subimage is further decomposed into one lowpass and three highpass subimages ( LH3 , HL3 , HH3 ). This process is repeated on the low pass

subimage to derive the higher level decompositions. In other words, DWT decomposes an image into a pyramid

Page 3: VLSI Implementation of Discrete Wavelet Transform

IEEE Transactions on VLSI Systems, Dec 1996

structure of subimages with various resolutions corresponding to the different scales. The inverse wavelet transform is calculated in the reverse manner, i.e., starting from the lowest resolution subimages, the higher resolution images are calculated recursively. We note that nonseparable wavelets have also been proposed in the literature. However, they are not widely used because of their complexity.

Approximate Location of Figure 2

2.1 Computational Complexity of the DWT

It is observed from Eq. 1a-1b that the complexity of each stage of wavelet decomposition is linear in the number of input samples, where the constant factor depends on the length of the filter. We note that for a dyadic wavelet decomposition, the number of input samples decreases by 50% at subsequent stages of decomposition. For wavelet order L , number of decomposition stages J , the computational complexity for an 1-D N -point sequence is

C NN N N

Ldyadic J= + + + +

−2 4 2

21. .. .. .. * FLOP

( )= − −4 1 2 J NL FLOP (3) where FLOP corresponds floating point operations and usually refers to multiplications and additions. We note that a simple polyphase decomposition has been assumed in the above calculation. The complexity can be further reduced using more sophisticated algorithms, such as FFT, first running FIR filtering [14]. However, these algorithms need complex control circuitry for hardware implementation and has hence not been considered in the proposed architecture.

In many applications, a regular tree, instead of a dyadic tree, might be more appropriate. The computational complexity at each stage of a regular tree is 2NL FLOP. Hence, the total complexity for a J level decomposition is:

C JNLregular = 2 FLOP (4)

The complexity of an irregular tree, or a wavelet packet algorithm is upper bounded by C regular .

2.2 Data Dependencies within DWT

The wavelet decomposition of an 1-D input signal for three stages is shown in Fig. 1. The transfer functions of the sixth order highpass ( g n( ) ) and lowpass ( h n( ) ) FIR filter can be expressed as follows

H z g g z g z g z g z g z( ) ( ) ( ) ( ) ( ) ( ) (5)= + + + + +− − − − −0 1 2 3 41 2 3 4 5 (5a)

L z h h z h z h z h z h z( ) ( ) ( ) ( ) ( ) ( ) (5)= + + + + +− − − − −0 1 2 3 41 2 3 4 5 (5b)

For clarity, the intermediate and final DWT coefficients in Fig. 1, are denoted by a, b, c, d, e, f and g. The DWT computation is complex because of the data dependencies at different octaves. Eq. 6a-6n show the relationship among a, b, c, d, e, f and g. We note that the data dependencies at various octaves are represented by identical symbols used in more than one octave equation.

1st octave:

b g a g a g a g a g a g a( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )0 0 0 1 1 2 2 3 3 4 4 5 5= + − + − + − + − + − (6a)

b g a g a g a g a g a g a( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )2 0 2 1 1 2 0 3 1 4 2 5 3= + + + − + − + − (6b)

b g a g a g a g a g a g a( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )4 0 4 1 3 2 2 3 1 4 0 5 1= + + + + + − (6c)

b g a g a g a g a g a g a( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )6 0 6 1 5 2 4 3 3 4 2 5 1= + + + + + (6d)

c h a h a h a h a h a h a( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )0 0 0 1 1 2 2 3 3 4 4 5 5= + − + − + − + − + − (6e)

c h a h a h a h a h a h a( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )2 0 2 1 1 2 0 3 1 4 2 5 3= + + + − + − + − (6f)

Page 4: VLSI Implementation of Discrete Wavelet Transform

IEEE Transactions on VLSI Systems, Dec 1996

c h a h a h a h a h a h a( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )4 0 4 1 3 2 2 3 1 4 0 5 1= + + + + + − (6g)

c h a h a h a h a h a h a( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )6 0 6 1 5 2 4 3 3 4 2 5 1= + + + + + (6h)

2nd octave:

d g c g c g c g c g c g c( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )0 0 0 1 2 2 4 3 6 4 8 5 10= + − + − + − + − + − (6i)

d g c g c g c g c g c g c( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )4 0 4 1 2 2 0 3 2 4 4 5 6= + + + − + − + − (6j)

e h c h c h c h c h c h c( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )0 0 0 1 2 2 4 3 6 4 8 5 10= + − + − + − + − + − (6k)

e h c h c h c h c h c h c( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )4 0 4 1 2 2 0 3 2 4 4 5 6= + + + − + − + − (6l)

3rd octave:

f g e g e g e g e g e g e( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )0 0 0 1 4 2 8 3 12 4 16 5 20= + − + − + − + − + − (6m)

g h e h e h e h e h e h e( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )0 0 0 1 4 2 8 3 12 4 16 5 20= + − + − + − + − + − (6n)

As shown in Fig. 1 and Eq. 6, several intermediate results (c, e) are first computed, and then used to calculate multiple output samples. We note that the intermediate results must be stored and made available for further processing at specific time instants.

3. Finite Precision Effect

The accuracy of DWT coefficients depend on the precision of both the input data and the DWT coefficients. In this section, the performance of a finite precision wavelet transformer is evaluated. We note that a multistage DWT is calculated recursively. Therefore in addition to the wavelet decomposition stage, an extra space is required to store the intermediate coefficients. Hence, the overall performance depends significantly on the precision of the intermediate DWT coefficients. It is assumed that both the intermediate and final DWT coefficients are represented with the same precision, and their dynamic range is symmetric about zero. Let's assume that the precision of the various data are as follows :

i) Input data : i bits ii) Intermediate wavelet coefficients : j bits iii) Wavelet filter coefficients ( h n( ) ) : m bits

Generally, the dynamic range of the DWT coefficients is greater than the dynamic range of the input data, and hence j should be greater than i . In our implementation, we

multiply the input data by 2 j i− to make them j -bits precision. We then calculate the first stage DWT and scale down the DWT coefficients to j bits. Subsequent stages of DWT decomposition is executed in the same manner.

We note from Table 1 that the maximum absolute value of a filter coefficient is between 0.5 and 1. Therefore, to represent the filter coefficients in m bits precision, we

multiply all the coefficients by 2 1m− and round to nearest integer value. Instead of rounding, sometimes, we may have to select either h n( ) , or h n( ) so that h n( )∑ is closest

to 2 (for orthonormality) and g n( )∑ is closest to zero.

In this case, the filter coefficients range from − −2 1m to

( )2 11m− − .

DWT coefficients can be recursively calculated using Eq. (1a) and (1b), where W represents wavelet coefficients of a certain stage and h g, represent the corresponding filter

coefficients. The precision of W and h are j and m bits respectively. To execute the operation in Eq. 1, we need a multiplier of j m× bits. After the multiplication is

performed, each term on the right hand side of the equation

Page 5: VLSI Implementation of Discrete Wavelet Transform

IEEE Transactions on VLSI Systems, Dec 1996

will have a precision of m j+ − 1 bits. We note that the

dynamic range of the DWT coefficients will increase because of the sum of several terms. The rate of increase will be upper bounded by h n

n

( )∑ which is listed in Table

1 for a few wavelets. Table 2 shows the maximum dynamic range of DWT coefficients at various stages. We note that in most cases, the value of z in Table 2 is less than 2. Hence, for an 1-D transform, the dynamic range will atmost increase by 2. In other words, the accumulator should have at least m j+ bits precision. Since, the

precision of the intermediate storage is j bits, all the coefficients are scaled down to j bits by dividing them by

2 m. We note that the input data was scaled up initially by

2 j − i. After the first stage of decomposition, the

coefficients will be effectively 2 j −i −1 times greater than

the ideal coefficients. To derive the ideal DWT coefficients, the resulting DWT coefficients should be scaled down by the scaling factors provided in Table 3.

Approximate Location of Table 2

Approximate Location of Table 3

The inverse transform is calculated using Eq. 2. The wavelet coefficients are upsampled and passed through a set of LPF and HPF synthesis filters (which are different but related to the analysis filters) and the filter outputs are added for reconstruction. We note that there is a difference between the forward and inverse DWT with respect to the precision of the coefficients. In addition, the dynamic range of the forward wavelet coefficients increases with the tree depth because of the summation operation in Eq. 1a-1b. However, in the case of inverse transform, the dynamic range of the coefficients will decrease due to the perfect reconstruction property of wavelet transform. The inverse transform of k -th stage wavelet coefficients, results in the ( )k − 1 -th stage lowpass

coefficients which have a lower dynamic range. Hence, while performing the inverse DWT, the individual terms of the convolution sum have m j+ − 1 bits precision.

However, after the summation, the resulting coefficients have m j+ − 2 bits precision. To ensure that the

coefficients have j bits precision for intermediate storage,

all the coefficients are divided by 2 m −2 . When all the stages have been computed, the resulting output data are with j bits precision. To obtain the original data, the

output data should be divided by 2 j −m.

To determine the accuracy of finite precision wavelet transformer, 1-D step (shown in Fig. 3a) and sinusoidal (shown in Fig. 4a) input data with 8 bits precision were decomposed for 3 stages using Daubechies 8 tap wavelet. The differences of the DWT coefficients obtained with 12 bits precision, and the ideal coefficients are shown in Fig. 3 and 4. For comparison purposes, the ideal coefficients are also shown along with the error coefficients. It is observed that the error coefficients due to finite precision are small when the data is uniform in nature (Fig. 3). However, when the input data changes rapidly, the value of the error coefficients becomes high (near the step discontinuity and throughout the sinusoidal signal).

Approximate Location of Figure 3

Approximate Location of Figure 4

A useful measure of the accuracy of DWT coefficients is the signal to noise ratio (SNR). Here, the signal is the floating point DWT coefficient and noise is the difference between the floating point and finite precision coefficients. Fig. 5 provides the performance variation for 1-D signal with respect to the precision of filter coefficients with a fixed 12 bits DWT coefficients. The data were decomposed with Daubechies 8 tap wavelet for 3 stages. It is observed that a performance of 50-70 dB SNR can be obtained with 12 bits precision of both DWT coefficients and filter coefficients. Two test images Lena and Mandrill (2-D) were decomposed using Daubechies 8 tap wavelet for 3 stages. The performance with respect to various precisions of DWT and filter coefficients are detailed in Fig. 6. It is observed that with 12 bits precision, a SNR of 40-50 dB can be achieved. We note that in the 2-D case more stages (row and column) of DWT are involved, resulting in a decrease in the SNR (compared to 1-D case).

Approximate Location of Figure 5

Approximate Location of Figure 6

4. The Proposed Systolic Array Architecture

A digit serial wavelet architecture using the pyramid algorithm was outlined in [15]. The proposed systolic array (DWT-SA) architecture is an improvement over the above architecture. Here, only one set of multipliers and adders has been employed, in contrast to two parallel computational hardware employed in [15]. The multiplier and adder set performs all necessary computations to generate all highpass

Page 6: VLSI Implementation of Discrete Wavelet Transform

IEEE Transactions on VLSI Systems, Dec 1996

and lowpass coefficients. The DWT-SA architecture does not use any external or internal memory modules to store the intermediate results and therefore avoids the delays caused by memory access and memory refresh timing. In addition, since a set of registers controlled by a global clock is employed, the control circuitry does not need to take the intermediate products in and out of the memory. This results in a simple and efficient systolic implementation for 1-D DWT computation.

In order to compute separable 2-D DWT, two modules (one for row transform and another for column transform) of the proposed architecture can be used along with a transposer. The details of the schematic and the design of the transposer can be found in [21], [28]. We note that the proposed architecture is a complete setup for computing forward DWT. The inverse DWT can be calculated by replacing the decimator with an interpolator.

4.1 DWT-SA Architecture

The design of DWT-SA is based on a computation schedule derived from Eq. 6a - 6n which are the result of applying the pyramid algorithm for eight data points (N = 8) to the six tap filter. We note that Eq. 1a and 1b represent the highpass and lowpass components of the six tap FIR filter.

The proposed DWT-SA architecture is shown in Fig. 7. It comprises of four basic units: Input Delay, Filter, Register Bank, and Control. The following sections present the design of each unit. First, we present the design of the Filter Unit and its subcomponent-the Filter Cell. The design of Storage Units are then discussed, which is followed by the description of the Control Unit.

Approximate Location of Figure 7

4.2 Filter Unit (FU)

The Filter Unit (FU) proposed for this architecture is a six-tap non-recursive FIR digital filter whose transfer function for the highpass and lowpass components are shown in Eq. 1 where g g( ) ( )0 5− and h h( ) (5)0 − are the

coefficients of the HPF and LPF, respectively.

Computation of any DWT coefficient can be executed by employing a multiply and accumulate method where partial products are computed separately and subsequently added. This feature makes possible systolic implementation of DWT. The latency of each filter stage is 1 time unit (TU). Since partial components of more than one DWT coefficient are being computed at any given time, the latency of the filter once the pipeline has been filled is also 1 (TU). The systolic architecture of a six tap filter is shown in Fig. 8. Here, partial results (one per cell)

are computed and subsequently passed in a systolic manner from one cell to the adjacent cell.

Approximate Location of Figure 8

Filter Cell (FC)

Eq. 1a-1b show that computations of the highpass and lowpass DWT coefficients at specific time instants are identical except for different values of the LPF and HPF filter coefficients. By introducing additional control circuitry, computations of both highpass and lowpass DWT coefficients can be executed using the same hardware in one clock cycle. The highpass coefficient calculation is performed during the first half of the clock cycle whereas the lowpass coefficient calculation is performed during the second half. Subsequently, the partial results are passed synchronously in a systolic manner from one cell to the adjacent cell. The proposed filter cell therefore consists of only one multiplier, one adder, and two registers to store the high-pass and the low-pass coefficients, respectively, as shown in Fig. 9.

Approximate Location of Figure 9

In order to meet the real time requirements of applications such as video compression, a fast multiplier design is required. For this purpose, a high speed Booth multiplier [24] is used in the filter cell. The Booth multiplier uses a powerful algorithm for signed-number multiplication which treats both positive and negative numbers uniformly and is significantly faster than an array multiplier due to the reduced number of add stages required [24], [25]. The additional speedup is a result of using only half the number of adder stages The full adders and half adders employed inside the multiplier are variants of those described in [26]. Both designs have been modified to reduce the carry out delay which is critical in achieving the fastest possible multiplication.

4.3 Storage Units

Two storage units are used in the proposed architecture: Input Delay and Register Bank. The data registers used in these storage units have been constructed from standard Master-Slave, Edge Triggered, D-type flip-flops [26], [27]. The following presents the structure of each storage unit.

Input Delay Unit (ID)

Eq. 1a and 1b show that the value of computed filter coefficient depends on the present as well as the five previous data samples (the negative time indexes in Eq. 1a-1b correspond to the reference starting time unit 0). It is therefore required that the present and the past five input

Page 7: VLSI Implementation of Discrete Wavelet Transform

IEEE Transactions on VLSI Systems, Dec 1996

data values be held in registers and be retrievable by the FU and the CU. Therefore, five data registers are connected serially in a chain, as shown in Fig. 7, and at any clock cycle each register passes its contents to its right neighbor which results in only five past values being retained.

Register Bank Unit (RB)

Several registers are required for storage of the intermediate partial results. Analysis of Eq. 1a and 1b justifies the requirement for the register bank, however it does not explain its size. It will be shown in the next section that 26 data registers connected serially are required to implement RB.

4.4 Control Unit (CU)

One of the most important aspects of the DWT-SA architecture is its potential for real time operation. The proposed DWT-SA architecture computes N coefficients in N clock cycles and achieves real time operation by executing computations of higher octave coefficients in between the first octave coefficient computations. The first octave computations are scheduled every N/4 clock cycles, while the second and third octaves are scheduled every N/2 and every N clock cycles, respectively.

There are several approaches for scheduling the octave computations. In the DWT-SA architecture, a schedule based on filter latency of 1 TU is proposed to meet the real time requirements in some applications. The computations are scheduled at the earliest possible clock cycle, and computed output samples are available one clock cycle after they have been scheduled as shown in Table 4. The delay is minimized through the pipeline facilitating real time operation. For example, the computation of d(0) can only be executed after the calculation of c(0) has been completed. The calculation of c(0) scheduled for cycle 1 is completed in cycle 2 due to the filter latency of 1 TU. d(0) is therefore scheduled for computations in a later cycle, i.e. cycle 4.

Approximate Location of Table 4

The schedule presented in Table 4 is periodic with period N, and the hardware is not utilized in cycle kN+2 where k is a non-negative integer. The computation schedule in Table 4 corresponds to a high hardware utilization of 87.5% (i.e. 7/8).

4.4.1 Register Allocation

The next step in designing the DWT-SA architecture is the design of the Control Unit (CU) and the Register Bank (RB). The two components synchronize the availability of operands. There are two schemes which can be employed

for this purpose, namely the Forward Register Allocation (FRA), or the Forward-Backward Register Allocation (FBRA). The FRA method uses a set of registers which are allocated to intermediate data on the first come first served basis. It does not reassign any registers to other operands once its contents have been accessed. The FBRA scheme is similar, except that once the operand stored has been used, the register is reallocated to another operand. The FRA method is simpler, requires less control circuitry and permits easy adaptation of the architecture for coefficient calculation of more than 3 octaves. It results however, in less efficient register utilization.

In either scheme, the coefficient computations are periodic and hence, each register containing a specific variable will be reserved for the same variable in the next period. The construction of the register allocation tables for both FRA and FBRA are presented below.

FRA Register Allocation

To demonstrate the construction of the register allocation table using the FRA approach, we will consider the case of computing the coefficients c(0) and c(2).

As shown in Table 4, coefficient c(0) is computed in cycle one, whereas the coefficient c(2) is scheduled for computation in cycle 3. These two computed coefficients along with other four; c(4), c(6), c(8) and c(10) are needed for the computation of coefficient e(0) (see Eq. 6k). According to Table 4, e(0) computation is scheduled for cycle 4. However, the six coefficients c(0), c(2), c(4), c(6), c(8), c(10) will not be available until cycle 12, and therefore the calculation of e(0) has to be scheduled for cycle 12 (i.e., 4 + N). Similarly, coefficient e(4) is computed not in cycle 8, but in cycle 16 (i.e.,. 8 + N). The number of registers needed becomes apparent once one complete frame of computations has been scheduled.

Systematic application of the described method, yields a complete schedule of computation for all intermediate and final coefficients as shown in Table 5. Due to its size, the table has been divided into two parts.

Approximate Location of Table 5

In the FRA register allocation approach where data moves systolically in one direction only, it is possible to increase the number of DWT decomposition octaves by placing additional registers in series after register R26. The new registers hold the intermediate coefficients needed for the computation of the next octave decomposition. Hardware utilization of the higher octave decomposition registers is inversely proportional to the order of computed coefficients.

FBRA Register Allocation

Page 8: VLSI Implementation of Discrete Wavelet Transform

IEEE Transactions on VLSI Systems, Dec 1996

Table 5 shows that not all registers in the FRA register allocation scheme are used at every time instant. In fact, close examination of Table 5, suggests that for the first to second octave calculations, registers R1 to R11 are used 87.5 % of the time, whereas for the second to third octave calculation, registers R12 to R26 are used 25 % of the time.

Table 6 shows a complete register allocation table for the DWT-SA using the FBRA approach. It is observed that a higher register utilization could be achieved by applying the reallocation scheme where empty registers are reallocated to other variables once the original variables held in them are no longer valid. This approach, which in the ideal case would make use of every register at each time instance, has a negative impact on the complexity of the architecture. However, it requires complex control circuitry for data reallocation, and results in a less modular architecture compared to the FRA approach. Therefore, FRA architecture, instead of FBRA, has been employed in the proposed architecture.

Approximate Location of Table 6

4.4.2 Activity Periods

Tables 5 and 6 show that the last pair of coefficients i.e. f(0) and g(0) for the first group of eight samples is available at time instance 38. This implies that coefficient computations are overlapped for 5 periods. Table 5 also shows the time periods when the intermediate coefficients must remain valid in order to produce the subsequent higher octaves of coefficients. For example consider the case for a first octave coefficient c(0). It remains active from the time instant it has been computed to the time of calculation of the next higher octave of coefficients, i.e., from time instance 1 to time instance 12 in the present configuration. Similarly, e(0), a second octave coefficient remains active until the third octave coefficients (in which it is used) are computed, i.e. from time instance 12 to time instance 38. All the intermediate results, and the associated periods of activity are listed in Table 7.

Approximate Location of Table 7

The number of registers required in this architecture is directly proportional to the number of levels of DWT decomposition, and is calculated during the construction of the timetable of computations. For the DWT-SA architecture which computes three octaves of DWT decomposition and employs the FRA register allocation method, the top row of Table 5 indicates the number of registers is 26.

We note that since no variable in Table 5 has a negative time index, a periodic interpretation of that table is required. Consider for example, the variable c(-2) in the computation of d(0) and e(0) in cycle 4. The periodic interpretation of Table 5 implies that the register which holds the variable c(-2) in cycle 4, also holds the variable c(-2+8) in clock cycle 12 (4+8). Table 5 shows that c(6) is held in register R5.

4.4.3 Complete Design of CU

The complete design of the Control Unit for DWT-SA architecture is shown in Fig. 10. It schedules the computation of each DWT coefficient as shown in Table 4.

Approximate Location of Figure 10

CU is a switch, that directs data from the Input Delay (ID), or the Register Bank (RB) to the Filter Unit (FU). CU is a modular switch with a number of subcomponents equal to the number of taps in the FU. The CU multiplexes data from the ID every second cycle, and from the RB in cycles 4, 6, and 8. In cycle 2, CU remains idle, i.e. it does not allow any passage of data. Proper timing, synchronization as well as enabling and disabling of the CU is ensured by the global CLK signal.

4.5 Timing Considerations

We have examined the design of each component in the proposed DWT-SA architecture. The timing considerations of the architecture are discussed below with respect to Fig. 7.

The number of switching inputs in each control subcell is equal to the number of octave computations. The first octave computations are scheduled every second clock cycle and hence the corresponding switch input is labeled 2k , where k is any non-negative integer. Moreover, its inputs are supplied directly by the ID. Second octave computations are executed in clock cycles 4 and 8, which is reflected by the label 4k+4. The third octave computations are scheduled in clock cycle 6, or 8k+6. Both second and third octave computations use partial results from previous octave computations and therefore use inputs from RB. Table 5 determines which register is used as output.

Delay of the DWT-SA architecture consists of the latency period necessary to fill up the filter for the first time, in addition to the number of clock cycles through the registers as described in Table 4. First results are thus produced 43 (i.e., 5+38) clock cycles after the first input sample has entered the pipeline. Subsequent coefficients are available at the output the pipeline every 8 clock cycles. The DWT coefficients are output from the final filter stage.

4.6 Simulation Results

Page 9: VLSI Implementation of Discrete Wavelet Transform

IEEE Transactions on VLSI Systems, Dec 1996

The proposed DWT-SA architecture, has been fully simulated in order to validate its functionality [21]. First, analog simulations on each small cell, and later digital simulations on the larger block and the final chip have been performed. The analog simulations were executed using the Hspice simulation tool, running under the Opus 4.2 design platform. Once the gate level analog simulations of all subcircuits were completed, digital simulations were performed on groups of subcircuits forming a more complex functional circuit. Larger blocks of circuits were assembled progressively and verified until the entire DWT-SA architecture was simulated. The digital simulator used was Verilog logic simulator running under Opus 4.2. Process parameters used were those of 1.2 µm technology.

The dimension of a single DWT-SA module with 8 bit wide data and filter coefficients will be approximately 10 7mm mm× with approximately 300,000 transistors on the chip. The power dissipation of the architecture using 5 volt CMOS process at 20 Mhz will be approximately 500mW -1W. The 16-bit architecture would require 4 times more transistors, 4 times more silicon area, and will consume about 4 times more power. However, the simulation results in Section 3 shows that it is sufficient to have an architecture with 12-bit precision. In this case, the required number of transistors and power dissipation will be approximately 2.25 times those required by 8-bit architecture.

One of the most popular applications of DWT is video processing. With a frame rate of 30 frames/sec, the video processor should process a complete frame in less than 33 ms. It has been found that the proposed architecture can execute the DWT computations on a monochrome 512 512× frame in 13 ms, with 1.2 µm technology and

20 MHz clock rate. The computation on color frame will take about 39 ms, and hence cannot be executed in real time. However, an additional 35% speedup can be expected if 0.8 µm technology or below is employed. This speedup will enable the architecture to perform real time DWT computations for color video sequences.

5. Conclusion

A systolic VLSI architecture for computing one dimensional DWT in real time has been presented. The architecture is simple, modular, cascadable, and has been implemented in VLSI. The implementation employs only one multiplier per filter cell, and hence results in a considerably smaller chip area. As implemented in a 1.2 µm technology, and running at 20 MHz, this architecture achieves real-time DWT computation for a monochrome 512 x 512 video input.

Acknowledgment

This research was funded by the Microelectronics Network (Micronet) under the Network Centers of Excellence (NCE) program of the Government of Canada.

References

[1] I. Daubechies, “Orthonormal bases of compactly supported wavelets ,” Comm. Pure Appl. Math, Vol. 41, pp. 906-966, 1988

[2] S. G. Mallat, “A theory of multiresolution signal decomposition: the wavelet representation,” IEEE Trans. on Pattern Recognition and Machine Intelligence, Vol. 11, No. 7, July 1989.

[3] M. Vetterli and C. Harley, “Wavelets and filter banks: theory and design,” IEEE Transactions on Signal processing, Vol. 40, No. 9, pp. 2207-2232, 1992.

[4] R. A. Gopinath, Wavelets and Filter Banks - New Results and Applications, Ph.D Dissertation, Rice University, Houston, Texas, 1993.

[5] Y. Meyer, Wavelets: Algorithms and Applications, SIAM, Philadelphia, 1993.

[6] A. N. Akansu and R. A. Haddad, Multiresolution Signal Decomposition: Transform, Subbands and Wavelets, Academic Press Inc., 1992.

[7] O. Rioul and M. Vetterli, “Wavelets and signal processing,” IEEE Signal processing Magazine, pp. 14-38, Oct. 1991.

[8] R. A. Devore, B. Jawerth and B. J. Lucier, “Image compression through wavelet coding,” IEEE Trans. on Information Theory, Vol. 38, No. 2, pp. 719-746, March 1992.

[9] M. Unser, “Texture classification and segmentation using wavelet frames,” IEEE Trans. on Image processing, Vol. 4, No. 11, pp. 1549-1560, Nov. 1995.

[10] S. G. Mallat, "Multifrequency channel decompositions of images and wavelet models", IEEE Trans. on Acoustics, Speech and Signal Processing, Vol. 37, No. 12, pp. 2091-2110, Sept. 1989.

[11] G. Beylkin, R. R. Coifman and V. Rokhlin, Wavelets in Numerical Analysis; in Wavelets and Their Applications, pp. 181-210, Jones and bartlett, 1992.

[12] M. A. Stoksik, R. G. Lane and D. T. Nguyen, “Accurate synthesis of fractional Brownian motion using wavelets,” Electronics Letters, Vol. 30, No. 5, pp. 383-384, March 1994.

[13] L. Senhadji, G. Carrault, and J. J. Bellanger, “Interictal EEG spike detection: A new framework based on wavelet transform,” Proc. of the IEEE-SP International Symposium on Time-Frequency and Time-Scale Analysis, pp. 548-551, Philadelphia, Oct 1994.

[14] O. Rioul and P. Duhamel, “Fast algorithms for discrete and continuous wavelet transforms,” IEEE Trans. on Information Theory, Vol. 38, No. 2, March 1992.

Page 10: VLSI Implementation of Discrete Wavelet Transform

IEEE Transactions on VLSI Systems, Dec 1996

[15] K. K. Parhi and T. Nishitani, “VLSI architectures for discrete wavelet transforms”, IEEE Trans. on VLSI Systems, pp. 191-202, June 1993.

[16] Aware Wavelet Transform Processor (WTP) Preliminary, Aware Inc., Cambridge, MA.

[17] M. Vishwanath, “Discrete wavelet transform in VLSI”, Proc. IEEE Int. Conf. Appl. Specific Array Processors. pp. 218-229, 1992.

[18] K. K. Parhi, "Video data format converters using minimum number of registers", IEEE Trans. on Circuits and Systems for Video Tech., vol. 38, pp. 255-267, June 1992.

[19] C. Chakrabarti, M. Vishwanath and R. M. Owens, "Architectures for wavelet transforms", Proc. IEEE VLSI Signal Processing Workshop. pp. 507-515, 1993.

[20] Y. Kang, "Low-power design of wavelet processors", Proc. of SPIE, vol. 2308, pp. 1800-1806, 1993.

[21] A. Grzeszczak, VLSI Architecture for Discrete Wavelet Transform, M.A.Sc. thesis, Department of Electrical Engineering, University of Ottawa, Canada, 1995.

[22] G. Knowles, "VLSI architecture for the discrete wavelet transform," Electronics Letters, Vol. 26, No. 15, pp. 1184-1185, Jul. 1990.

[23] M. Vishwanath, “The recursive pyramid algorithm for the discrete wavelet transform”, IEEE Trans. on Signal Processing, March 1994.

[24] A. D. Booth, "A signed binary multiplication technique", Quarterly Journal of Mechanics and Applied Mathematics, Vol. 4, pp. 236-240, 1951.

[25] G. J. Hekstra, Multiplier Architectures for VLSI Implementation, Technical Report No. 90-104, Delft University of Technology, Netherlands, Nov. 1990.

[26] N. Weste and K. Eshragian, Principles of CMOS VLSI Design, Addison-Wesley, June 1988.

[27] J. Cavanagh, Digital Computer Arithmetic, McGraw-Hill, 1984.

[28] S. Panchanathan, “Universal architecture for matrix transposition,” IEE Proceedings-E, Vol. 139, No. 5, Sept 1992.

Page 11: VLSI Implementation of Discrete Wavelet Transform

IEEE Transactions on VLSI Systems, Dec 1996

BIOGRAPHIES

Aleksander Grzeszczak received a B. A. Sc. and M. A. Sc. in Electrical Engineering from the University of Ottawa in 1991 and 1995 respectively. In between he worked for Mitel Corporation in Ottawa; Philips-Faselec in Zurich and as a memory design engineer for Mosaid Technologies also in Ottawa. As a graduate student in the Multimedia Communications Research Laboratory, he published four papers dealing with VLSI implementation of the discrete wavelet transform. In order to bridge technical and business interests, he is presently pursuing a graduate degree in International Business and Finance at Columbia University. Mrinal Kr. Mandal received his B.E. (Electronics and Comm. Engineering) degree from the University of Burdwan, India in 1987; M.E. in Electronics and Tele-Comm. Engineering from the University of Calcutta, India in 1989; and M.A.Sc in Electrical Engineering from the University of Ottawa, Canada in 1994. Since 1989 he has been working as a Scientist/Engineer in Space Applications Centre, Ahemedabad, India. He is currently on study leave and pursuing his Ph.D degree in Electrical Engineering as a Canadian Commonwealth Fellow in the University of Ottawa, Canada. His research interests include image and signal processing, image and video compression, wavelets, filterbanks and remote sensing.

Sethuraman Panchanathan received his B.S. (Physics) degree from the University of Madras, India in 1981; B.E. (Electronics and Communication Engineering) degree from the Indian Institute of Science, India in 1984; M. Tech degree in Electrical Engineering from the Indian Institute of Technology, Madras, India in 1986; and the Ph.D. degree in Electrical Engineering from the University of Ottawa in 1989. Dr. Panchanathan is presently an Associate Professor in the Department of Electrical Engineering, University of Ottawa. He is the Director of the Visual Computing and Communications Laboratory at the University of Ottawa and leads a team of Post-doctoral Fellows, Research Engineers and Graduate students working in the areas of Compression, Indexing, Storage, Retrieval and Browsing of Images and Video, VLSI Architectures for Multimedia Processing, Parallel Processing and Multimedia Communications. He has published over 80 papers in refereed journals and conferences. He is an associate editor of the IEEE Transactions on Circuits and Systems for Video Technology, and Journal of Visual Communication and Image representation. He is a member of the IEEE, SPIE, EURASIP, CSECE, APEO.

Page 12: VLSI Implementation of Discrete Wavelet Transform

IEEE Transactions on VLSI Systems, Dec 1996

Tet Yeap received the B.A.Sc. degree in electrical engineering from the Queen's University in 1982, and the M.A.Sc. and Ph.D. degrees in electrical engineering from the University of Toronto, in 1984 and 1991, respectively. He is currently an assistant professor in the Department of Electrical Engineering, University of Ottawa. His research interest include broadband access architecture, neural networks, multi-media, parallel computer architectures, and dynamics and control.

Page 13: VLSI Implementation of Discrete Wavelet Transform

IEEE Transactions on VLSI Systems, Dec 1996

Nsample

High

H1

Low

L1

High

H2

Low

L2

High

H3

Low

L3

N/8 fsamples

N/8 gsamples

2

2

2

2

2

2

N/2 c samples

N/4 e samples

N/2 b samples

N/2 d samples

a

Figure 1. Three stage DWT decomposition using pyramid algorithm.

LL LH

HL HH

(a)

LL LH

HL HHLH

HL HH

LH

HL HH

4 4

443

33

2

2

2

(b)

xx/2x/4x/8

y

y/2

y/4

y/8

Figure 2. (a) Wavelet transform decomposition of an image into 4 sub-images.

(b) A three level image decomposition.

Page 14: VLSI Implementation of Discrete Wavelet Transform

IEEE Transactions on VLSI Systems, Dec 1996

-150

-75

0

75

150

0 64 128 192

Samples

Inp

ut D

ata

(a)

-600

-300

0

300

600

0 8 16 24Samples

DW

T C

oeff

icie

nts Original

200*Error

-150

-75

0

75

150

0 8 16 24Samples

DW

T C

oeff

ieie

nts

Original200*Error

(b) (c)

-150

-75

0

75

150

0 16 32 48Samples

DW

T C

oeff

icie

nts

Original200*Error

-150

-75

0

75

150

0 16 32 48 64 80 96 112S a m p l e s

DW

T C

oeffi

cien

ts

Original200*Error

(d) (e)

Figure 3. Ideal DWT coefficients and errors due to 12 bit precision of coefficients. The data were decomposed with Daubechies 8 tap wavelet for 3 stages. The dotted lines represent errors due to 12 bits coefficients. a) The step input, b) stage 3 lowpass coefficients, c) stage 3 highpass coefficients, d) stage 2 highpass coefficients, e) stage 1 highpass coefficients.

Page 15: VLSI Implementation of Discrete Wavelet Transform

IEEE Transactions on VLSI Systems, Dec 1996

-150

-75

0

75

150

0 64 128 192Samples

Inpu

t dat

a

(a)

-400

-200

0

200

400

0 8 16 24Samples

DW

T C

oeffi

cien

ts

Original

100*Error

-1

-0.5

0

0.5

1

0 8 16 24

Samples

DW

T C

oeffi

cien

ts

Original

Error

(b) (c)

-0.6

-0.3

0

0.3

0.6

0 8 16 24 32 40 48 56Samples

DW

T co

effic

ient

s

OriginalError

-0.8

-0.4

0

0.4

0.8

0 16 32 48 64 80 96 112Samples

DW

T C

oeff

icie

nts

Original5*error

(d) (e)

Figure 4. Ideal DWT coefficients and errors due to 12 bit precision of coefficients. The data were decomposed with Daubechies 8 tap wavelet for 3 stages. The dotted lines represent errors due to 12 bits coefficients. a) The sinusoidal input, b) stage 3 lowpass coefficients, c) stage 3 highpass coefficients, d) stage 2 highpass coefficients, e) stage 1 highpass coefficients.

Page 16: VLSI Implementation of Discrete Wavelet Transform

IEEE Transactions on VLSI Systems, Dec 1996

40

45

50

55

60

65

8 9 10 11 12 13 14

Filter Coefficient Precision (in bits)

SNR

(in

dB)

Sine Input

Step Input

Figure 5. Performance variation with respect to filter coefficient precision

38

39

40

41

42

43

44

45

9 10 11 12 13 14 15

Filter Coefficient Precision (in bits)

SNR

(in

dB)

Lena

Mandrill

(a)

20

30

40

50

60

9 10 11 12 13 14

Wavelet Coefficient Precision (in bits)

SNR

(in

dB)

Lena

Mandrill

(b)

Figure 6. Performance variation with respect to a) filter coefficients and b) DWT coefficients precision.

Page 17: VLSI Implementation of Discrete Wavelet Transform

IEEE Transactions on VLSI Systems, Dec 1996

R26

R22

R18

R14

R11

R10

R9

R8

R7

R6

R5

R4

R3

R2

R1

F 6

F 5

F 4

F 3

F 2

F 1

D 1 D 2 D 3 D 4 D 5

O U T PU T

F

I

L

T

E

R

IN P U T

2 K

4 K + 4 8 K + 6

IN P U T D E L A Y

R

E

G

I

S

T

E

R

B

A

N

K

C

O

N

T

R

O

L

U

N

I

T

=

M U X

Figure 7. The proposed DWT-SA architecture.

Page 18: VLSI Implementation of Discrete Wavelet Transform

IEEE Transactions on VLSI Systems, Dec 1996

0 1 2 3 4 5

6 TAP FILTER

OUT

I(nT) I(nT-1) I(nT-2) I(nT-3) I(nT-4) I(nT-5)

CLK

SAMPLE INPUT

Figure 8. Systolic operation of the six tap filter.

To

the

next

stage

H

L

MUX * +

Sample

From

the

previous

stage

Figure 9. Proposed filter cell.

Page 19: VLSI Implementation of Discrete Wavelet Transform

IEEE Transactions on VLSI Systems, Dec 1996

M UX4 : 1

M UX4 : 1

M UX4 : 1

M UX4 : 1

M UX4 : 1

M UX4 : 1

I

Iz -1

Iz -2

Iz -3

Iz-4

Iz -5

From RB

Clk Control

FromID 3

To Filter Unit

Figure 10. The Control Unit (CU) block diagram.

Page 20: VLSI Implementation of Discrete Wavelet Transform

IEEE Transactions on VLSI Systems, Dec 1996

Table 1. Filter coefficients of Daubechies wavelets.

Wavelets Coefficients Daub-6 Daub-8 Daub-8(LA) 1 Daub-10 Daub-10(LA)

h(0) 0.332671 0.230378 -0.075766 0.160102 0.027333

h(1) 0.806892 0.714847 -0.029636 0.603829 0.029519

h(2) 0.459878 0.630881 0.497619 0.724309 -0.039134

h(3) -0.135011 -0.027984 0.803739 0.138428 0.199398

h(4) -0.085441 -0.187035 0.297858 -0.242295 0.723408

h(5) 0.035226 0.030841 -0.099220 -0.032245 0.633979

h(6) 0.032883 -0.012604 0.077571 0.016602

h(7) -0.010597 0.032223 -0.006241 -0.175328

h(8) -0.012581 -0.021102

h(9) 0.003336 0.019539

h n( )∑

1.855118 1.865446 1.848663 2.000938 1.885342

{ }max ( ) h n

0.806892 0.714847 0.803739 0.724309 0.723408

1 LA refers to filters with least asymmetric phase Table 2: Dynamic range of DWT coefficients for various stage ( z h n= ∑ ( ) ).

Data 1-D 2-D input data ( )−d d, ( )−d d,

1st level coeff. ( )−d z d z* , * ( )−d z d z* , *2 2

2nd level coeff. ( )−d z d z* , *2 2 ( )−d z d z* , *4 4

3rd level coeff. ( )−d z d z* , *3 3 ( )−d z d z* , *6 6

Page 21: VLSI Implementation of Discrete Wavelet Transform

IEEE Transactions on VLSI Systems, Dec 1996

Table 3: Scaling factors: to obtain the ideal coefficients, the j bits DWT coefficients should be divided by the scaling factor.

1-D 2-D

1st level coeff. 2 1j i− − 2 2j i− − 2nd level coeff. 2 2j i− − 2 4j i− − 3rd level coeff. 2 3j i− − 2 6j i− −

Table 4. Schedule for one complete set of computations.

Init Cycle High - pass Low - pass 1 b(0) c(0) 2 - - 3 b(2) c(2) 4 d(0) e(0) 5 b(4) c(4) 6 f(0) g(0) 7 b(6) c(6) 8 d(4) e(4)

Page 22: VLSI Implementation of Discrete Wavelet Transform

IEEE Transactions on VLSI Systems, Dec 1996

Table 5. FRA Register Allocation

Cycle Com R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 R13 1 c(0) 2 c(0) 3 c(2) c(0) 4 e(0) c(2) c(0) 5 c(4) c(2) c(0) 6 g(0) c(4) c(2) c(0) 7 c(6) c(4) c(2) c(0) 8 e(4) c(6) c(4) c(2) c(0) 9 c(0) c(6) c(4) c(2) c(0) 10 c(0) c(6) c(4) c(2) c(0) 11 c(2) c(0) c(6) c(4) c(2) c(0) 12 e(0) c(2) c(0) c(6) c(4) c(2) c(0) 13 c(4) e(0) c(2) c(0) c(6) c(4) c(2) 14 g(0) c(4) e(0) c(2) c(0) c(6) c(4) c(2) 15 c(6) c(4) e(0) c(2) c(0) c(6) c(4) 16 e(4) c(6) c(4) e(0) c(2) c(0) c(6) c(4) 17 c(0) e(4) c(6) c(4) e(0) c(2) c(0) c(6) 18 c(0) e(4) c(6) c(4) e(0) c(2) c(0) c(6) 19 c(2) c(0) e(4) c(6) c(4) e(0) c(2) c(0) 20 e(0) c(2) c(0) e(4) c(6) c(4) e(0) c(2) c(0) 21 c(4) e(0) c(2) c(0) e(4) c(6) c(4) e(0) c(2) 22 g(0) c(4) e(0) c(2) c(0) e(4) c(6) c(4) e(0) c(2) 23 c(6) c(4) e(0) c(2) c(0) e(4) c(6) c(4) e(0) 24 e(4) c(6) c(4) e(0) c(2) c(0) e(4) c(6) c(4) e(0) 25 c(0) e(4) c(6) c(4) e(0) c(2) c(0) e(4) c(6) e(0) 26 c(0) e(4) c(6) c(4) e(0) c(2) c(0) e(4) c(6) 27 c(2) c(0) e(4) c(6) c(4) e(0) c(2) c(0) e(4) 28 e(0) c(2) c(0) e(4) c(6) c(4) e(0) c(2) c(0) e(4) 29 c(4) e(0) c(2) c(0) e(4) c(6) c(4) e(0) c(2) e(4) 30 g(0) c(4) e(0) c(2) c(0) e(4) c(6) c(4) e(0) c(2) 31 c(6) c(4) e(0) c(2) c(0) e(4) c(6) c(4) e(0) 32 e(4) c(6) c(4) e(0) c(2) c(0) e(4) c(6) c(4) e(0) 33 c(0) e(4) c(6) c(4) e(0) c(2) c(0) e(4) c(6) e(0) 34 c(0) e(4) c(6) c(4) e(0) c(2) c(0) e(4) c(6) 35 c(2) c(0) e(4) c(6) c(4) e(0) c(2) c(0) e(4) 36 e(0) c(2) c(0) e(4) c(6) c(4) e(0) c(2) c(0) e(4) 37 c(4) e(0) c(2) c(0) e(4) c(6) c(4) e(0) c(2) e(4) 38 g0) c(4) e(0) c(2) c(0) e(4) c(6) c(4) e(0) c(2)

Page 23: VLSI Implementation of Discrete Wavelet Transform

IEEE Transactions on VLSI Systems, Dec 1996

Table 5. FRA Register Allocation (continued).

Cycle Com R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 1 c(0) 2 3 c(2) 4 e(0) 5 c(4) 6 g(0) 7 c(6) 8 e(4) 9 c(0) 10 11 c(2) 12 e(0) 13 c(4) 14 g(0) 15 c(6) 16 e(4) 17 c(0) 18 19 c(2) 20 e(0) 21 c(4) 22 g(0) 23 c(6) 24 e(4) 25 c(0) 26 e(0) 27 c(2) e(0) 28 e(0) e(0) 29 c(4) e(0) 30 g(0) e(4) e(0) 31 c(6) e(4) e(0) 32 e(4) e(4) e(0) 33 c(0) e(4) e(0) 34 e(0) e(4) e(0) 35 c(2) e(0) e(4) e(0) 36 e(0) e(0) e(4) e(0) 37 c(4) e(0) e(4) e(0) 38 g0) e(4) e(0) e(4) e(0)

Page 24: VLSI Implementation of Discrete Wavelet Transform

IEEE Transactions on VLSI Systems, Dec 1996

Table 6. FBRA Register Allocation.

Tabl Com R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 R13 R14 1 c(0) 2 c(0) 3 c(2) c(0) 4 e(0) c(2) c(0) 5 c(4) c(2) c(0) 6 g(0) c(4) c(2) c(0) 7 c(6) c(4) c(2) c(0) 8 e(4) c(6) c(4) c(2) c(0) 9 c(0) c(6) c(4) c(2) c(0) 10 c(0) c(6) c(4) c(2) c(0) 11 c(2) c(0) c(6) c(4) c(2) c(0) 12 e(0) c(2) c(0) c(6) c(4) c(2) c(0) 13 c(4) e(0) c(2) c(0) c(6) c(4) c(2) 14 g(0) c(4) e(0) c(2) c(0) c(6) c(4) c(2) 15 c(6) c(4) e(0) c(2) c(0) c(6) c(4) 16 e(4) c(6) c(4) e(0) c(2) c(0) c(6) c(4) 17 c(0) e(4) c(6) c(4) e(0) c(2) c(0) c(6) 18 c(0) e(4) c(6) c(4) e(0) c(2) c(0) c(6) 19 c(2) c(0) e(4) c(6) c(4) e(0) c(2) c(0) 20 e(0) c(2) c(0) e(4) c(6) c(4) e(0) c(2) c(0) 21 c(4) e(0) c(2) c(0) e(4) c(6) c(4) e(0) c(2) 22 g(0) c(4) e(0) c(2) c(0) e(4) c(6) c(4) e(0) c(2) 23 c(6) c(4) e(0) c(2) c(0) e(4) c(6) c(4) e(0) 24 e(4) c(6) c(4) e(0) c(2) c(0) e(4) c(6) c(4) e(0) 25 c(0) e(4) c(6) c(4) e(0) c(2) c(0) e(4) c(6) e(0) 26 c(0) e(4) c(6) c(4) e(0) c(2) c(0) e(4) c(6) e(0) 27 c(2) c(0) e(4) c(6) c(4) e(0) c(2) e(0) c(0) e(4) 28 e(0) c(2) c(0) e(4) c(6) c(4) e(0) c(2) e(0) c(0) e(4) 29 c(4) e(0) c(2) c(0) e(4) c(6) c(4) e(0) c(2) e(0) e(4) 30 g(0) c(4) e(0) c(2) c(0) e(4) c(6) c(4) e(0) c(2) e(0) e(4) 31 c(6) c(4) e(0) c(2) c(0) e(4) c(6) e(4) c(4) e(0) e(0) 32 e(4) c(6) c(4) e(0) c(2) c(0) e(4) c(6) e(4) c(4) e(0) e(0) 33 c(0) e(4) c(6) c(4) e(0) c(2) e(0) c(0) e(4) c(6) e(4) e(0) 34 c(0) e(4) c(6) c(4) e(0) c(2) e(0) c(0) e(4) c(6) e(4) e(0) e(0) 35 c(2) c(0) e(4) c(6) c(4) e(0) c(2) e(0) c(0) e(4) e(0) e(4) e(0) 36 e(0) c(2) c(0) e(4) c(6) c(4) e(0) c(2) e(0) c(0) e(4) e(0) e(4) 37 c(4) e(0) c(2) c(0) e(4) c(6) e(4) c(4) e(0) c(2) e(0) e(0) e(4) 38 g(0) c(4) e(0) c(2) c(0) e(4) c(6) e(4) c(4) e(0) c(2) e(0) e(0) e(4)

Page 25: VLSI Implementation of Discrete Wavelet Transform

IEEE Transactions on VLSI Systems, Dec 1996

Table 7. Activity periods for intermediate results

Sample Available at cycle Life period c(0) 1 1 to 12 c(2) 3 3 to 14 c(4) 5 5 to 16 c(6) 7 7 to 18 e(0) 12 12 to 18 e(4) 16 16 to 38