introduction - gupta · web viewusing word length reduction of the inputs. partitioning the...

CSE 450: Design and Analysis of Algorithms P-173Power, Speed and Area Optimization of FFT processor

POWER, SPEED AND AREA

OPTIMIZATION OF FFT PROCESSORS

P- 173

Md. Hafijur Rahman

Sashvat Sai

Sudarshan Suresh

of 40


INTRODUCTION

Digital Signal Processing has progressed by leaps and bounds in

the last few decades due to a huge growth in the semiconductor industry. One of the most

fundamental algorithms in the field of Digital Signal Processing is the Fast Fourier

Transform [FFT] and it’s inverse [IFFT]. The FFT/IFFT are widely used in various areas

such as telecommunications, speech and image processing, medical electronics and

seismic processing, etc.

The FFT is basically a fast and efficient method of computing the

Discrete Fourier Transform [DFT]. A Fourier transform is a technique that is used to

transform signals in the time domain to signals in the frequency domain. This

transformation is necessary because many complex operations in time domain reduces to

less complex problems in frequency domain and there is an definite advantage in the the

implementation aspects.

Fourier transform is usually applied to continuous and aperiodic

signals. Unfortunately, computers can't handle a continuous signal, they have to have

of 40

CSE 450: Design and Analysis of Algorithms P-173Power, Speed and Area Optimization of FFT processordiscrete samples. Therefore we need a discrete form of the Fourier transform known as

the Discrete Fourier Transform [ DFT]. In a DFT both time and frequency components

are discretized.

The DFT of a time domain signal x (n) is given by the following

equation

k = 0, 1, ….. N-1

n= 0, 1, 2, ….N-1

It may be noted that the number of complex multiply and add operations required by the

simple forms both the DFT and IDFT is of order N2. This is because there are N data

points to calculate, each of which requires N complex arithmetic operations. The

complexity of this algorithm is O (n2). The number of adders and multipliers required by

this transform makes it unsuitable for most of the practical DSP applications. The running

time of this algorithm is also very high.

In computing the DFT, greater efficiency results from using a divide and conquer

approach and decomposing the computation into successively smaller DFT computations.

In this process, we exploit both the symmetry and the periodicity of the complex

exponential = e-j2kn/N.

Most of the FFT algorithms are used for N which a power of 2. These algorithms are

known as radix 2 algorithms. Since N is an even integer we can compute X (k) by

separating x (n) into two (N/2) point sequences consisting of even-numbered points and

of 40

CSE 450: Design and Analysis of Algorithms P-173Power, Speed and Area Optimization of FFT processorthe odd-numbered points in x (n). Therefore, we can proceed along similar lines by

decomposing the N point sequence into two N/2 point sequences and then decomposing

them into N/4 point subsequences and continue until we are left with only 2-point

transforms. This requires log2 N stages of computation. These algorithms have a

complexity of O (n log n). Each stage in this algorithm requires N complex additions and

multiplications. Therefore, the computation of a FFT from a DFT requires N log 2 N

complex additions and multiplications.

A comparison between computing the DFT using the FFT approach and directly by

formula reveals that, for a 1024 point DFT the direct method would involve an enormous

1,048,576 multipliers and the FFT method would involve an estimated 10,240

multipliers. There is a 10 times reduction in the number of multipliers. This reduction in

the number of multipliers is achieved at the expense of the number of adders. But the cost

involved in the implementation of an adder is much less than the cost involved in

building the multiplier. This reduction in the number of multipliers has a great advantage

in terms of power, area and speed.

The term 'FFT' is actually slightly ambiguous, because there are several commonly used

'FFT' algorithms. There are two different Radix 2 algorithms, called 'Decimation In Time'

(DIT) and 'Decimation In Frequency' (DIF) algorithms. Both of these rely on the

recursive decomposition of an N point transform into 2 (N/2) point transforms. This

decomposition process can be applied to any composite (non prime) N, it just so happens

that is particularly simple if N is divisible by 2 (and if N is a regular power of 2, the

decomposition can be applied repeatedly until the trivial '1 point' transform is reached).

The three major parameters used as a benchmark for the performance of present day

VLSI systems are: Power, Speed and Area. All three have to be optimized for a good of 40

CSE 450: Design and Analysis of Algorithms P-173Power, Speed and Area Optimization of FFT processorsystem. The various algorithms used to minimize the power dissipation in a FFT

processor are dealt in Chapter 2. The methods that can be used to maximize the speed of

operation of a FFT processor are considered in Chapter 3. Area optimization techniques

are discussed in Chapter 4.

Chapter 2

POWER OPTIMIZATION OF FFT

In recent years, there is a strong rise in the demand of portable devices and a major factor

in the weight and size of the portable devices is the amount of batteries. The battery

power/life depends greatly on the power dissipated in the circuit. Even in non-portable

applications, reduction of power consumption has become the need of the hour due to

increased cost in providing cooling mechanisms for these devices.

For all signal-processing applications, it is ideal to maintain a constant throughput

throughout the range of operation, as there would be no major incentive in doing the

operations faster than some given rate as the processor might have to wait until further

processing is required. This is very much unlike a general-purpose processor where speed

is one of the major considerations.

Since CMOS circuits dissipate power only when they are switching, a major focus of

power reduction is to reduce the number of switching events required to perform a

of 40

CSE 450: Design and Analysis of Algorithms P-173Power, Speed and Area Optimization of FFT processorcomputation. This can range simply from switching down the complete circuit or portions

of it, to more sophisticated approaches where clocks are gated or optimized circuit

architectures are used to minimize the number of transitions.

Power dissipation is a CMOS circuit is due to three components: Switching power, which

accounts for 80-85 % of the total power dissipated, Short Circuit Power, which is very

negligible and the Leakage Power, which accounts for 15-20% of the power dissipated.

Switching power of a circuit is directly proportional to the squared of the supply voltage

and is proportional to the frequency of the clock, load capacitance and the probability of

0-1 transitions.

This project deals with the power reduction in FFT processors. The key areas we have

concentrated to reduce the power dissipated is as follows:

1. Butterfly Module.

2. Memory and

3. Coefficient optimization module (this includes the input data along with the

twiddle factors)

It is important that power consumption be reduced because the more the power

consumption, the less the battery life. Also when the speed of data transfer and the

operation of the butterfly modules are not critical, we can operate the chip at a reduced

voltage for the same throughput.

Numerous research papers have been published in the areas of FFT/IFFT

processors. There are various projects currently being carried out to reduce the power

consumption in FFT processors. Some of the methods by which power is reduced are

of 40


1. Using Word length reduction of the inputs.

2. Partitioning the coefficients.

3. Using DRAM instead of SRAM thereby reducing the number of transistors

4. Intelligent RAM.

5. Mapping intelligent code on to the memory.

6. Using decimation in time and decimation in frequency for the butterfly

operations.

7. Using pipelining.

8. Trading speed for power reduction.

9. Shutting off part of the chip.

10. Code compaction of Software embedded in memory.

11. Using Multi Port memory.

We suggest a modification of an existing algorithm in partitioning the coefficients

used in the FFT operations. We first demonstrate how power consumption is reduced in

the multipliers by the use of a partitioning algorithm along with clustering. The algorithm

has been implemented using Mat lab and the results are shown. This algorithm can be

used for any N-point FFT in optimizing power.

We then propose a method by which we reduce the power consumption in

butterfly operations by considering the tradeoff between spatial parallelization and time

multiplexing.

Pipelined FFT Architecture:

of 40


The FFT operation has been proven to be both computation intensive, in terms of

arithmetic operations, and communication intensive, in terms of data

swapping/exchanging in the storage. High-speed operation is obtained either by high

frequency clock or parallel or pipelined processors operating near sampling frequency. It

has been shown that the latter is more preferable when the application environment limits

power consumption. Each stage in the FFT requires the reading and writing of all N data

words. The pipelined architecture proposed has M butterfly stages with M-1

buffer/interconnection stages as shown below [1].

Fig 1: Baseline pipelined FFT architecture

During the operation of the pipelined FFT, r data samples (for a radix-r butterfly

unit in each stage) appear at the input of each butterfly block. Given a throughput

requirement, the number of butterflies in each stage determines the operating speed and

degree of parallelism in each butterfly of that stage.

Since we are concerned only with power reduction, we can reduce the supply

voltage there by reducing the power consumption significantly since power is

proportional to square of Vdd (supply voltage). This can be used in applications where

we are very keen on reducing the power with the same throughput. If we can compromise

on speed a little further, we can still reduce the supply voltage.

of 40

Stage – 1Butterflies

Coefficient

BuFFER

Stage – 1Butterflies

Coefficient

BuFFER

CSE 450: Design and Analysis of Algorithms P-173Power, Speed and Area Optimization of FFT processorCoefficient Optimization:

In this method, the bit patterns of the twiddle factors are being manipulated to

reduce the number of actual multiplications that take place in a butterfly unit. The

approach scales the values of the given coefficients to derive a representation that enables

the partitioning of the original multiplication in to several, small multiplications that can

be performed in parallel. The coefficient values are chosen so that an error bound for the

overall computation is satisfied. Therefore the supply voltage can be reduced to decrease

power dissipation while maintaining any given throughput and quantization error

constraints. Additional reductions in power dissipation are achieved by disabling the rows

of the multiplier that correspond to multiplications by zero and thus do not affect the final

result of the multiplication. The main limitation of this method is that all the operations

are runtime and hardwiring based on previous logic is not possible.

X

Cluster Width

Fig 2: Effect of Cluster width C on multiplier Structure

Clustering:

of 40

M bits

M+C+Sbits

M * C MULT

SHIFT


The cluster width C of an N-bit coefficient Y is defined as the distance between

the first and the last nonzero bits in Y. The cluster width of a coefficient depends on the

number representation. For example, in two’s complement representation, the string

0000111000 denotes the decimal number 28. Here the cluster width is 3. But in signed

digit representation it is 000100b00(b represents a –1) the cluster width is 4.By using

clustering we reduce the value to Cmax from the original value of N. From a power

dissipation standpoint, this clustering results in two main benefits. First, the multiplier

can be designed to reduce switching activity by ignoring the bit positions outside the

cluster. Secondly, the supply voltage required to meet a given throughput requirement

can be reduced, since the worst-case critical path among all multipliers is decreased.

These benefits are more visible in array multipliers.

Partitioning:

Partitioning is used to reduce the input to output delay by parallelization.

Coefficient partitioning is used to parallelize coefficient multiplications. Coefficient

partitioning divides an N-bit coefficient Y=YN-1YN-2…Y0 into an Na-bit coefficient Ya and

an Nb- bit coefficient Yb, where Ya=YN-1YN-2….YN-Nb, Yb=YNb-1YNb-2…Y0, and

N = Na+Nb. For the partitioned coefficient, we have

Ca+Cb<=C,

0<K<N-1,

Where Ca is the cluster width of Ya, Cb is the cluster width of Yb, C is the cluster width of

Y, and K is the partition point. Partitioning can decrease coefficient cluster widths. The

original multiplication is turned into an M*Na multiplication and an M*Nb multiplication

of 40

CSE 450: Design and Analysis of Algorithms P-173Power, Speed and Area Optimization of FFT processorwhich can be performed in parallel. The depth of the individual multiplication is much

less than that of the multiplication X*Y.

In a conventional array multiplier, if the kth bit of a coefficient is 0, the k th row of

adders does not need to be activated. The partial product of the previous adder rows can

simply be shifted and bypassed to the next row of adders. The adders of a conventional

array multiplier corresponding to zero coefficients are still switching even though the

addition is not required. The increased switching activity results in unnecessary power

dissipation. This structure is modified by incorporating deactivation circuitry and bypass

logic so that one bit shifts can be performed without unnecessarily switching adder cells.

Algorithm for Coefficient Optimization:

In this section we present the currently used algorithm for coefficient optimization

[1]. We then propose a modification to the algorithm based on the cluster width

manipulation.

The P-point FFT, including quantization noise and some scaling factor , is given by the

equation

p-1

X[k] = (1/) (X[p] + ed[p])(WPkp

+ ew[p]),

p=0

where WPkp

is the pth coefficient

X[p] is the input data sample

Ew[p] is the error due to twiddle factor coefficient approximation and

Ed[p] is the input sample quantization error. of 40


Given a set W of infinite-precision twiddle-factor coefficient WPkp

where p = 0,1,2…P-1,

an error-ratio bound , a delay bound Tmax, and a cluster width bound Ctarget, the

optimization process returns an encoded set of P twiddle coefficients Yp = WPkp, p =

0,1,2,…P-1 such that maxpCb,p Ctarget, the multiplier critical delay is less than Tmax, the

coefficient quantization error ratio is less than , and the multiplier dissipation is

minimal.

The following is the existing algorithm for coefficient optimization.

COPT (W, , Tmax, Ctarget)

1. Initialize temp to an encoded set Y

2. With delay (temp) <= Tmax and error (temp) <=

3. For ( = 1.0; >=0.5; = -)

4. For (K=Ctarget; K<=N-Ctarget; K=K+1)

5. For each number representation f

6. Y = (f (Y0), f (Y1)…f (Yp-1))

7. Partition coefficients in Y with respect to K

8. If power (Y)<power (temp)

9. And delay (Y) < delay (temp)

10. And error (Y) < error (temp)

11. Then temp Y

12. Return Y

of 40


This algorithm comprises three nested loops. The outer loop steps through the

possible scaling factors . The middle loop steps through the possible partition points.

The inner loop steps through the possible number representations f. For each scaling

factor , partition point K, and number representation f, the algorithm partitions the

encoded coefficients with respect to the current partition point. The power constrained is

checked using the expression

P-1

Power (Y) = (Cp, nz/(P*Np)) * Cload * Vsupply2

p=0

Where Cp, nz is the number of nonzero bits in the coefficient Yp and

Np is the coefficient width and Cload is the output capacitance.

Proposed Modification to the above formula:

For our analysis of a given set of coefficients, we have taken cluster width instead

of number of nonzero bits. Thus here we have made the power consumption of a

multiplier dependent on the cluster width. We have used the combination of partitioning

and clustering in order to reduce the power consumption further. The simulations for our

algorithm are shown for various point FFTs. The implementation was done in Mat lab.

Also power comparison is made with the algorithm for original and optimized

coefficients. All simulations were carried out using 2’s complement encoding and 8-bit

representation was used for each of the coefficients. In our simulations, we have selected

the load capacitance value as 1F. The power calculation for a twiddle factor is computed

using the following:

of 40


P-1

Power = cluster-width * Cload * V2supply/(P*Np)

x=0

Where P represents P point FFT.

The following is the pseudo code of our algorithm.

1. Initialize the optimal power to any arbitrary value.

2. Encode the twiddle factor y using 2’s complement approach.

3. Select a scaling factor .

4. Partition coefficients y with respect to the partitioning point k.

5. if(power(y) < power(optimal))

Power(y) = power(optimal).

End

6. Continue until the cumulative power due to all the twiddle factors are computed.

It can be shown that optimizing the coefficients can save power. This is repeated for the

entire twiddle factors in the butterfly model. The twiddle factors can be calculated apriori

for various N values. Then we can apply our proposed algorithm to each of the

coefficients and find the optimal partitioning point. We can do this for all the possible

representations like canonical and 2’s complement representations so that if the 2’s

complement representation gives larger reduction in power, we can use that particular

form in a particular stage.

After partitioning-clustering, the multiplier can be implemented using adder-

shifter units. By this algorithm we are effectively reducing the number of multiplications

and computations are taken care of by the shifters. of 40


We can also use different forms of representation for different stages but the

overhead will be the conversion to be done for each stage and we have to keep track of

what is happening with respect to the coefficients. This will need better control and logic

circuits.

The following results were obtained from the simulations:

Power Efficiency Vs Partioning Degree:

Fig 3: Power Efficiency Vs Partioning Degree

We see that as the partition length is varied, we obtain the optimal cluster width at a point

due to which there is reasonable reduction in power consumption that is evident from the

above-simulated result. Based on the above result we will be able to choose the optimal

partitioning point for efficient power utilization.

Power with optimal partitioning (FFT Size = 8)

of 40


Fig 4: Power with optimal partitioning (FFT Size = 8)

Here we have made the assumption that all twiddle factors have been used for the

computation. Based on this we obtain a result as shown that proves out to be a significant

reduction in power compared to the original case.

Power Savings (8-Point FFT):

Fig 5: Power Savings (8-Point FFT)

The power consumption in a processor varies with the square of the supply voltage. The

above result shows the power saved using the algorithm for 8-point FFT. We find that the

power saving is more significant at higher voltages compared to the normal case.

Power Vs Partitioning Degree:

of 40


Fig 6: Power Vs Partitioning degree

The above result shows the variation in power consumption for different partitioning

degree. The power consumption basically depends on the encoding style used for the

twiddle factors. Thus we see that for optimal power utilization, we need to select the

proper partitioning degree. By partitioning degree, we mean the placement of partition

window in the bit pattern of a particular twiddle factor that effectively reduces the cluster

width.

Power with optimal partitioning:

of 40


Fig 7 : Power with optimal partitioning

From the above result, we find that the power consumption is different from the 8-point

FFT although it has 16 twiddle factors and it happens to be less under the assumption that

all of these twiddle factors has been used. This is because the encoding style of the

twiddle factors in 16-point FFT is different from the 8-point FFT and hence the

portioning algorithm has different effects on each of the N-point FFT. Similar to the

previous results, we find that there is reasonable power difference between the original

case and the optimized case.

Power Savings:

Fig 8 : Power Savings for a 16 Point FFT

of 40

CSE 450: Design and Analysis of Algorithms P-173Power, Speed and Area Optimization of FFT processorHere we find that there is reasonable power savings with the partitioning-clustering

algorithm. Similar results are obtained for 32 point FFT.

Power with Optimal Partitioning (FFT Size = 32):

Fig 9 : Power with Optimal Partitioning

Power Savings (32 Point FFT – 8 bit representation):

Fig 10 : Power Savings for 32 Point FFT of 40


It is found that as the partitioning degree is varied, there is a change in the power

saving. We suggest that the same partitioning technique be applied to the input

coefficients so that power can further be reduced. But this has to be done as inputs enter

into the processor and thus we cannot compute them in advance as we do for the twiddle

factors and store in ROM. Thus even if partitioning a twiddle factor in a particular stage

does not reduce power considerably, the same algorithm when applied to the input data

may reduce power. Thus we have a two-way control over power consumption. The

overhead in this method will be that we need an intelligent controller that chooses the

right kind of partitioning for a given coefficient. We can run our algorithm for various

coefficients values and have a statistical analysis of which partitioning will work for

which coefficient.

By analyzing the partition width for the twiddle factors we can reduce the word

length in the ROM for the multiplication. The coefficient parts that do not take part in

computation are redundant in memory. So we can have lesser memory word length for

certain twiddle factors.

Low Power Memory

Memory is an important part of the FFT processor. The FFT processor does lot of

computations at every stage and the computations become more complicated with more

word length and with larger number of input data. A suitable memory is needed to store

the computation results of the first stage. The memory should be available for immediate

access by the next stage. The following are some of the techniques proposed by

researchers in universities and companies in reducing power in memory.

Memory Partitioning:

Memory power can be divided into 3 parts, write energy, read energy, and idle

energy. Power can further be divided in standard SRAM arrays by separately measuring of 40

CSE 450: Design and Analysis of Algorithms P-173Power, Speed and Area Optimization of FFT processorthe power consumption in the bit array, the row decoders, the column decoders, the sense

amps, the precharge circuitry, and finally extra buffers. More than half of the total

memory power consumption is when the memory is in the precharge stage. By

partitioning the memory into several smaller segments, and precharging only the small

segment in use will drastically reduce the total power consumption of memories while

not adding any significant delay to the memory. The following diagram shows the

memory segmentation in order to save power.

Fig 11: Standard RAM array and Segmented Memory

DRAM/Logic Merged Technology:

The RAM bandwidth used in a system is limited by off-chip interconnects. On

memory/logic merged ICs, the RAM access time is improved dramatically by utilizing

the bandwidth available from memory internal arrays. In addition to this, if the memory

bus can be eliminated at the system board level, a major portion of the total power

dissipation of a system can be removed. There are two choices to implement the

memory/logic merged VLSI systems. One is to integrate SRAM with logic using

standard processing technology for logic. The other choice is to integrate DRAM with

logic using DRAM process. The latter is cost effective and has smaller form factor. There

are various issues that are to be addressed in memory selection and design like logic gate

performance penalty on SRAM process due to its inherently slower transistor switching

of 40

Decoder

ARRAY

M1

M5

M9

M2

M6

M10

M3

M7

M11

M4

M8

M12precharge

Sense amps

CSE 450: Design and Analysis of Algorithms P-173Power, Speed and Area Optimization of FFT processorand routing area penalty due to lack of metal layers on DRAM process compared to logic

process. The advantage of using DRAM is that we will have only one transistor instead of

six transistors (as in SRAM) to store one bit.

Intelligent RAM:

This is a project being done at UC, Berkeley. The IRAM project proposes to

architect, design, fabricate, and evaluate single chip and multiple chip systems for data-

intensive applications. It will combine a processor and high capacity DRAM to deliver

vector supercomputer-style sustained floating point and memory performance at vastly

reduced power. The conventional system has separate chips for processor, external cache,

main memory, and networking. An IRAM would be smaller, use less power, and be less

expensive. The IRAM design will be scalable within a chip, allowing the processing

power to vary with memory size or power budget without changes to the architectural

specification.

The concept of IRAM can be extended to FFT. There are numerous floating point

and complex multiplications done in FFT. A processor with intelligent RAM can serve

the purpose of faster reduction and memory optimization.

Reducing power and improving performance in butterfly Module

Pipelined FFT processors are a class of architectures used for application-specific

real-time DFT computation. Their operation is characterized by non-stop processing of

the input data samples with a fixed clock frequency. A lower clock frequency is an

advantage of pipeline architectures when either high speed or low power is sought.

Pipelined structures are highly regular and can be easily scaled and parameterized.

The basic processing element incorporated in a butterfly stage is the radix-r

butterfly. The following figure shows the radix-2, radix-4 butterflies in the decimation-in-

time FFT algorithm. of 40


Fig 12: Butterfly diagram for radix-2 and radix-4

There are two design choices for a butterfly module, spatial parallelism and temporal

multiplexing. Introducing parallelism in the architecture can decrease speed penalty but

this increases the overall effective switching capacitance. A highly time-multiplexed FFT

architecture lowers the effective switching capacitance since butterfly units are shared.

But the operating speed of such a unit has to be high in order to process the data within a

specified time slot. A fully parallel architecture reduces the supply voltage requirement

but it is too large in size. Therefore an effective FFT low energy architectural solution has

to take care of the trade-off between a fully time multiplexed architecture and a fully

space parallelized architecture. As the radix of the FFT is increased, the degree of the

parallelism increases thus decreasing the required supply voltage. But after a certain point

the supply voltage cannot be reduced any further and no further reduction in power

dissipation is possible. For lower degrees of parallelization, a higher supply voltage is

necessary to satisfy the throughput requirement that increases power dissipation. Due to

of 40

CSE 450: Design and Analysis of Algorithms P-173Power, Speed and Area Optimization of FFT processorthis, the appropriate solution for this is to perform partial parallel processing and involve

time multiplexing to a certain extent.

of 40

CSE 450: Design and Analysis of Algorithms P-173Power, Speed and Area Optimization of FFT processorCHAPTER 3:

SPEED OPTIMIZATION IN AN FFT PROCESSOR:

FFT processors need to process data at very high speeds as continuous times signals are

processed using these processors (in the discrete time domain). So it is imperative for

these processors to have high computational rates, as this will ensure a faithful

reproduction of the continuous time signal that is being processed. The execution speed

in an FFT processor is limited almost entirely by the number of additions and

multiplications required. A few algorithms to achieve high-speed performance on these

processors are discussed below.

CORDIC Algorithm:

CORDIC is an Acronym for Coordinate Rotation Digital Computer. This algorithm is a

set of shift-add operations that could be used to realize trigonometric functions in the FFT

processor. These algorithms were originally developed as a solution for real time

navigational systems. All trigonometric functions can be realized using a set of vector

rotations. This method achieves this by recursively performing shifts and adds to achieve

the required function. The underlying principle of the algorithm is shown below.

Let us consider we need to rotate two vectors V1 and V2 as shown in the figure.

of 40

V2 (x2,y2)

V1 (x1,y1)

x

y

a

a

CSE 450: Design and Analysis of Algorithms P-173Power, Speed and Area Optimization of FFT processorEach of these has the same magnitude as they are basically the same vector but the only

difference is that the have a different angle. Let be the angle between vector V1and the

x-axis and let be the that vector V1 needs to be rotated.

Now,

x2 = a cos(+)

Which implies x2 = x1 cos - y1 sin Similarly y2 = x1 sin + y1 cos Or, x2 = cos (x1 - y1 tan )

y2 = cos (y1 + x1 tan )

Using this transformation matrix any vector V1 can be rotated by an angle . So far there

has not been any reduction in complexity. However if the angles to be rotated are chosen

such that

tan = 2-i

then the multiplication by the tan of the angle is just reduced to the shift operation. Here

the variable i denotes the iteration number. So the algorithm that needs to implemented

can be stated as follows:

Compare the angle to be rotted with the set of angles already stored in memory.

If is greater than a particular value of then the transformation matrix pertaining to that value of is multiplied with the given coordinates.

Calculate the difference between the angle and the value of used above. Now repeat the steps with the new value of being the difference computed

above. The above steps are repeated till the value of becomes 0.

This algorithm has a running time, which is O (n) as compared to the O (n2) of the

multiplication by brute force. The algorithm basically does a binary search of the various

angles that can be added or subtracted to make up the angle by which the vector can be

rotated. This algorithm is very helpful when we do not need a lot of accuracy and a lot of

hardware is also not required for this. For example while performing small computations

in calculators etc this algorithm can be used.

of 40

CSE 450: Design and Analysis of Algorithms P-173Power, Speed and Area Optimization of FFT processorThe Cordic algorithm has been implemented using the schematic capture of the cadence

tool. It is found that there is a saving in area of around 10-12%. Here is a significant

reduction in power and the sped also improves drastically. The cordic algorithm can be

implemented with pipelining which really enhances the speed performance of the

processor as a whole.

Booth’s Algorithm:

Multiplying two binary numbers basically involves multiple shifts of the multiplier and

adding according to the bit pattern of the multiplicand. The booths algorithm works as

follows. Given two numbers A and B the algorithm analyses the multiplicand A (taking

two bits at a time) to determine whether to add 0,B,-B. The booths algorithm can be

stated as follows:

Test current multiplier bit and next lower-order bit. If these bits are:

o 00 or 11 then do nothing; o 01 then add multiplicand to partial product; o 10 then Subtract multiplicand from partial product. o Shift the partial product right (the sign bit is propagated).

Subtract 1 from the counter. If the result is 0 then read the product from the partial product register, otherwise repeat step (b).

This algorithm basically uses the transitions in the bit stream to calculate the amount that

has to be added to the partial product of the multiplication process.

It can be shown that whenever there is a large number of ones in an operand then it can

be shown that the corresponding additions can be replaced by a single addition and a

subtraction as shown below.

So longer the sequence of one’s greater the savings. Booths recoding is a digit set

translation. The effect of this translation is that the digit set [1 0] is converted to the digit

set [1 –1]. For example consider the following conversion:

1 0 0 1 1 1 0 1 1 0 1 0 1 1 1 0The above set of bits are recoded as follows:

-1 0 1 0 0 –1 1 0 -1 1 –1 1 0 0 –1 0

of 40

CSE 450: Design and Analysis of Algorithms P-173Power, Speed and Area Optimization of FFT processorIf there are any bit beyond the MSB of the argument then just ignore it. The following

table can summarize the radix-2 booths recoding algorithm.

Radix 2 Booths algorithm :

Xj Xj-1 Yi Explanation0 0 0 No string of ones in sight0 1 1 End of string of ones1 0 -1 Beginning of string of ones1 1 0 Continuation of string of ones

The radix 2 Booths algorithm just described above works well but it has an inherent

problem. What if the number to be recoded were 0101010101. This would mean we

would have to perform a lot of additions and shifts and this could increase the and power

dissipation in the processors. So instead of taking two bits at a time it was suggested to

take three bits at a time. So this would reduce the amount of computations. This

algorithm was called the “modified booths algorithm”.

A multiplier implemented with booths algorithm is smaller in area than a multiplier

implemented with say array multiplication. The running time of the booths algorithm is O

(n). This algorithm can be used to multiply numbers represented by varying sizes of bits.

Hence the Aspect ratio of the multiplier is very good which helps in efficiently laying out

very dense multipliers.

The booths algorithm has been implemented in the cadence tool and it has been found

that there is an improvement in area by about 20%. The multiplier was first coded in

verilog and the results were synthesized in cadence. The schematics are attached.

Twiddle Factor based FFT algorithms:

Memory reference is one of the major sources of power consumption in a

microprocessor. In this algorithm two transformations are done to reduce the number of

memory references. With the first transformation all the butterflies sharing the same

twiddle factor will be computed together to eliminate the unnecessary memory access to

load the twiddle factors. With the second transformation, all the remaining butterflies

involving the other twiddle factor are computed using a breadth first tree traversal

algorithm so that the load store operations of the intermediate data arrays are minimized.

of 40

CSE 450: Design and Analysis of Algorithms P-173Power, Speed and Area Optimization of FFT processorThis new algorithm improves the speed by 20% when compared to the standard DIF FFT

algorithm and there is a 30% reduction in memory accesses, which in turn reduces the

power consumption in the processor by that amount. The complete flow graph of an N-

point radix-2 FFT is can be constructed by applying the basic butterfly structure

recursively where N=2,4,8… For an N-point FFT it has log 2(N) stages.

The twiddle factor W is needed at various points. These are stored in memory and hence

need to be accessed each and every time until and unless the processor provides enough

registers to store them and hence access them at various points.

This algorithm does a computational transformation such that the complexity of the

original FFT is preserved and the memory accesses are reduced. In an N-point FFT flow

graph each stage can be represented by using an NxN stage matrix. One butterfly in a

given stage can have four entries.

Radix-2 and 4 Algorithms

By limiting the data-length to the form N = RV we can define a class of FFT known as

radix algorithms. These algorithms successively decompose a single N point DFT into R

segments of N/R-point DFT. The most widely used of these radix algorithms is the

Radix-2 and the Radix-4. Each uses the periodic properties of the DFT to attain higher

efficiency levels. Radix algorithms can be implemented by either using Decimation-In-

Time (Cooley and Tukey) or by Decimation-In-Frequency (Sande and Tukey). Each of

these algorithms reduces the number of operations from O (N2) to O (N log2 N). The

drawback is that the data must be of a specified length. This problem can be avoided by

zero padding with no loss of information.

The parallel structure of the radix algorithms is understood by considering the fact that

the DFT can be decomposed into smaller independent DFTs. By performing each of these

smaller DFTs concurrently, we can take advantage of this parallelism. The figure shows

two stages of a Decimation-In-Time FFT. In this figure, one can see that the 8-point DFT

is decomposed into two 4-point DFTs. These, in turn, are each decomposed as follows:

of 40


The above shows an 8-point decimation in frequency algorithm. The decimation of the

odd and even frequency terms makes them independent of each other. Thus these even

and odd terms can be evaluated in different processors. The complexity of this algorithm

is also O (N log N). These algorithms can thus be realized on separate processors and

hence make the computations faster.

Split-Radix Algorithm

By observing the figure above, it can be seen that even index points can be calculated

independently of the odd indexed points. This leads to the possibility of making use of

more than one algorithm for the data set. The increase in computational efficiency of the

higher order Radix-4 is attractive, but the limitation in data sequence lengths is a

hindrance. The Split-Radix utilizes the fact that the data points can be decomposed into

even and odd indices to employ both Radix-2 and 4 algorithms. A Radix-2 is performed

on the even index points. The odd points are then decomposed into two N/4 point

sequences where a Radix-4 approach is taken. In making use of these two techniques, the

Split-Radix acquires an increase in computational efficiency over the Radix-2 while

retaining its ability to perform on any power of two.

of 40

CSE 450: Design and Analysis of Algorithms P-173Power, Speed and Area Optimization of FFT processor As shown above, the Split-Radix algorithm is composed of a Radix-2 and two Radix-4

components. These components are independent of each other and, so can be performed

in parallel separate processes. The figure shows the flow of this process. Also the Radix-2

and Radix-4 components can each be performed as parallel computations. In this manner,

we can achieve maximum utilization of parallel hardware.

The split radix algorithm is runs in O (N log N) time.

of 40


CHAPTER 4:

AREA OPTIMIZATION TECHNIQUES

The area of a chip depends on the number of components to be placed on it. Effective

placement techniques are being used in the industry to minimize area and maximize

functionality. Effective place and rout can not only aid in reducing the area of the chip

but also reduce the delay due to wiring and interconnect capacitance. The following

micrograph shows the major components in a FFT processor:

of 40

CSE 450: Design and Analysis of Algorithms P-173Power, Speed and Area Optimization of FFT processorIt is evident from the above picture that if we reduce the number of multipliers, the area

occupied on the chip will automatically reduce. The techniques we described for power

dissipation can be used effectively to reduce area by an appreciable amount.

of 40


CONCLUSION

The algorithms used for power, speed and area optimizations of FFT

processors were researched, analyzed and studied. The optimization techniques are very

much dependent on the environment of the processor as there is always a tradeoff

between the three parameters. One technique that can be used to moderate effect for

optimizing all three parameters is the reduction of datapath components used in the

processor.

of 40

CSE 450: Design and Analysis of Algorithms P-173Power, Speed and Area Optimization of FFT processorReferences:

1. S. Hong, S. Kim, M.C. Papaefthymiou, and W.E. Stark. Low power parallel multiplier design for DSP applications through coefficient optimization. In 12th IEEE International ASIC/SOC Conference, September 1999.

2. S. Hong, S. Kim, M.C. Papaefthymiou, and W.E. Stark. Power-complexity analysis of pipelined VLSI FFT architectures for low energy wireless communication applications. In 42nd Midwest Symposium on Circuits and Systems, August 1999.

3. L.Richard Carley, Memory Partitioning for Low Power Design4. DRAM/Logic Merged Technology Project Page, Colorado State University,

http://www.engr.colostate.edu/ece/Research/vlsi/dram-logic.htm.5. Weidong Li, Lars Wanhammar, A Pipelined FFT Processor.6. Yan Solihin, Jaejin Lee, and Josep Torellas, Adaptively Mapping code in an

Intelligent Memory Architecture.

of 40


APPENDIX

Mat lab Codes for partitioning algorithmtmax = 0.5;delta = 0.1;c_target = 1; % cluster width targetdel_alpha = 0.0001; alpha = 1;y = zeros(1,10);carry = 0;

init_delay = 0.1; % Initial delayinit_error = 0.01; % Initial errorinit_power = 10; % Initial powerfft_point = 32;bit_rep = 8;bits = zeros(1,bit_rep);N = bit_rep;

final_cluster1 = zeros(1,fft_point); % Partitioned optimal cluster1final_cluster2 = zeros(1,fft_point); % Partitioned optimal cluster2pow = zeros(1,N-c_target);org_pow = zeros(1,N-c_target);cluster = zeros(1,fft_point);clust_part = zeros(1,N-c_target);clust1 = zeros(1,N-c_target);clust2 = zeros(1,N-c_target);

for z = 1:fft_point; x = exp(j*2*pi*(z-1)/fft_point); % Calculating twiddle factors t1 = real(x); t2 = imag(x); % Converting twiddle factors to bits for iter = 1:2 if(iter == 1)

if(t1 < 0) t1 = -t1; end if((t1~=0) & (t1~=1)) for d = 1:bit_rep t1 = t1*2; if(t1>1) bits(d) = 1; t1 = t1-1; else bits(d) = 0; end end

of 40

CSE 450: Design and Analysis of Algorithms P-173Power, Speed and Area Optimization of FFT processor else bits = [0 0 0 0 0 0 0 1]; end else if(t2 < 0) t2 = -t2; end if((t2~=0) & (t2~=1)) for d = 1:bit_rep t2 = t2*2; if(t2>1) bits(d) = 1; t2 = t2-1; else bits(d) = 0; end end else bits = [0 0 0 0 0 0 0 1]; end end

bits = bits(1:bit_rep);

% One's complementfor i = 1:length(bits) if(bits(i) == 0) bits(i) = 1; else bits(i) = 0; endend

u = length(bits);%Two's complementif(xor(bits(u),1) == 0) bits(u) = 0; carry = 1;for i = u-1:-1:1 if((xor(bits(i),carry) == 0) & (carry == 1)); bits(i) = 0; carry = 1;else bits(i) = xor(bits(i),carry); carry = 0; endendelse bits(u) = 1;end

if(carry == 1) bits = [1,bits];end

of 40

CSE 450: Design and Analysis of Algorithms P-173Power, Speed and Area Optimization of FFT processorcarry = 0;alpha = 0.5;

%while(alpha >= 0.5) k = c_target; % Computing optimal cluster while(k <= N - c_target) temp = zeros(1,N); temp = bits(N:-1:1); temp2 = zeros(1,N-k); part1 = zeros(1,k); temp1 = zeros(1,k); part2 = zeros(1,N-k); part1 = bits(1:k); part2 = bits(k+1:N); temp1 = bits(k:-1:1); temp2 = bits(N:-1:k+1); [q,i1] = max(bits);[q,i2] = max(temp); if((q == 1) | (q == 1)) cluster(z) = ((length(bits)-(i2-1)) - i1) + 1; else cluster(z) = 0; end [q,i1] = max(part1);[q,i2] = max(temp1); if((q == 1) | (q == 1)) cluster1 = ((length(part1)-(i2-1)) - i1) + 1; else cluster1 = 0; end [q,i1] = max(part2);[q,i2] = max(temp2); if((q == 1) | (q == 1)) cluster2 = ((length(part2)-(i2-1)) - i1) + 1; else cluster2 = 0; end clust1(k) = cluster1; clust2(k) = cluster2; % Calculating power after clustering and partitioning% power = sum(part1)*(10*(10e-6))*(5^2)/(8*length(part1)) + sum(part2)*(10*(10e-6))*(5^2)/(8*length(part2)); % power updation according to cluster size power = cluster1*(10*(10e-6))*(5^2)/(fft_point*length(bits)) + cluster2*(10*(10e-6))*(5^2)/(fft_point*length(bits)); % power updation according to cluster size clust_part(k) = cluster1 + cluster2; if(power <= init_power) init_power = power; final_alpha = alpha; final_power = power; % Optimal power final_cluster1(z) = cluster1; % Optimal cluster1 final_cluster2(z) = cluster2; % Optimal cluster2 p1 = part1;

of 40

CSE 450: Design and Analysis of Algorithms P-173Power, Speed and Area Optimization of FFT processor p2 = part2; end pow(k) = power; % With partitioning and clustering org_pow(k) = cluster(z)*(10*(10e-6))*(5^2)/(fft_point*length(bits)); % without partitioning and clustering k = k + 1; end% alpha = alpha - del_alpha;%endendend % Plotsv = [0:0.001:5];power_v = zeros(1,length(v));power_org = zeros(1,length(v));

for z = 1:fft_pointpower_v = power_v + final_cluster1(z)*(10*(10e-6))*(v.^2)/(fft_point*length(bits)) + final_cluster2(z)*(10*(10e-6))*(v.^2)/(fft_point*length(bits)); % power according to optimal partitioningpower_org = power_org + cluster(z)*(10*(10e-6))*(v.^2)/(fft_point*length(bits));end

m = [1:length(pow)];plot(m,pow,m,org_pow,'r') % partitioning powerpause;

plot(v,power_v,v,power_org,'r') %comparison of optimal powerpause;

% Partitioning degree Vs Power for various voltagespow_v = zeros(N-c_target,length(v));for l = c_target:N-c_target pow_v(l,1:length(v)) = clust1(l)*(10*(10e-6))*(v.^2)/(fft_point*length(bits)) + clust2(l)*(10*(10e-6))*(v.^2)/(fft_point*length(bits)); % power for various partitioningend

% Power savings Vs voltagepow_save = zeros(1,length(power_v));pow_save = power_org - power_v;

plot(v,pow_save);pause;

for l = c_target:N-c_target plot(v,pow_v(l,:)); hold on;

of 40

CSE 450: Design and Analysis of Algorithms P-173Power, Speed and Area Optimization of FFT processorendpause;

of 40

introduction - gupta · web viewusing word length reduction of the inputs. partitioning the...

Documents