introduction - gupta · web viewusing word length reduction of the inputs. partitioning the...
TRANSCRIPT
CSE 450: Design and Analysis of Algorithms P-173Power, Speed and Area Optimization of FFT processor
POWER, SPEED AND AREA
OPTIMIZATION OF FFT PROCESSORS
P- 173
Md. Hafijur Rahman
Sashvat Sai
Sudarshan Suresh
Page 1 of 40
CSE 450: Design and Analysis of Algorithms P-173Power, Speed and Area Optimization of FFT processor
INTRODUCTION
Digital Signal Processing has progressed by leaps and bounds in
the last few decades due to a huge growth in the semiconductor industry. One of the most
fundamental algorithms in the field of Digital Signal Processing is the Fast Fourier
Transform [FFT] and it’s inverse [IFFT]. The FFT/IFFT are widely used in various areas
such as telecommunications, speech and image processing, medical electronics and
seismic processing, etc.
The FFT is basically a fast and efficient method of computing the
Discrete Fourier Transform [DFT]. A Fourier transform is a technique that is used to
transform signals in the time domain to signals in the frequency domain. This
transformation is necessary because many complex operations in time domain reduces to
less complex problems in frequency domain and there is an definite advantage in the the
implementation aspects.
Fourier transform is usually applied to continuous and aperiodic
signals. Unfortunately, computers can't handle a continuous signal, they have to have
Page 2 of 40
CSE 450: Design and Analysis of Algorithms P-173Power, Speed and Area Optimization of FFT processordiscrete samples. Therefore we need a discrete form of the Fourier transform known as
the Discrete Fourier Transform [ DFT]. In a DFT both time and frequency components
are discretized.
The DFT of a time domain signal x (n) is given by the following
equation
k = 0, 1, ….. N-1
n= 0, 1, 2, ….N-1
It may be noted that the number of complex multiply and add operations required by the
simple forms both the DFT and IDFT is of order N2. This is because there are N data
points to calculate, each of which requires N complex arithmetic operations. The
complexity of this algorithm is O (n2). The number of adders and multipliers required by
this transform makes it unsuitable for most of the practical DSP applications. The running
time of this algorithm is also very high.
In computing the DFT, greater efficiency results from using a divide and conquer
approach and decomposing the computation into successively smaller DFT computations.
In this process, we exploit both the symmetry and the periodicity of the complex
exponential = e-j2kn/N.
Most of the FFT algorithms are used for N which a power of 2. These algorithms are
known as radix 2 algorithms. Since N is an even integer we can compute X (k) by
separating x (n) into two (N/2) point sequences consisting of even-numbered points and
Page 3 of 40
CSE 450: Design and Analysis of Algorithms P-173Power, Speed and Area Optimization of FFT processorthe odd-numbered points in x (n). Therefore, we can proceed along similar lines by
decomposing the N point sequence into two N/2 point sequences and then decomposing
them into N/4 point subsequences and continue until we are left with only 2-point
transforms. This requires log2 N stages of computation. These algorithms have a
complexity of O (n log n). Each stage in this algorithm requires N complex additions and
multiplications. Therefore, the computation of a FFT from a DFT requires N log 2 N
complex additions and multiplications.
A comparison between computing the DFT using the FFT approach and directly by
formula reveals that, for a 1024 point DFT the direct method would involve an enormous
1,048,576 multipliers and the FFT method would involve an estimated 10,240
multipliers. There is a 10 times reduction in the number of multipliers. This reduction in
the number of multipliers is achieved at the expense of the number of adders. But the cost
involved in the implementation of an adder is much less than the cost involved in
building the multiplier. This reduction in the number of multipliers has a great advantage
in terms of power, area and speed.
The term 'FFT' is actually slightly ambiguous, because there are several commonly used
'FFT' algorithms. There are two different Radix 2 algorithms, called 'Decimation In Time'
(DIT) and 'Decimation In Frequency' (DIF) algorithms. Both of these rely on the
recursive decomposition of an N point transform into 2 (N/2) point transforms. This
decomposition process can be applied to any composite (non prime) N, it just so happens
that is particularly simple if N is divisible by 2 (and if N is a regular power of 2, the
decomposition can be applied repeatedly until the trivial '1 point' transform is reached).
The three major parameters used as a benchmark for the performance of present day
VLSI systems are: Power, Speed and Area. All three have to be optimized for a good Page 4 of 40
CSE 450: Design and Analysis of Algorithms P-173Power, Speed and Area Optimization of FFT processorsystem. The various algorithms used to minimize the power dissipation in a FFT
processor are dealt in Chapter 2. The methods that can be used to maximize the speed of
operation of a FFT processor are considered in Chapter 3. Area optimization techniques
are discussed in Chapter 4.
Chapter 2
POWER OPTIMIZATION OF FFT
In recent years, there is a strong rise in the demand of portable devices and a major factor
in the weight and size of the portable devices is the amount of batteries. The battery
power/life depends greatly on the power dissipated in the circuit. Even in non-portable
applications, reduction of power consumption has become the need of the hour due to
increased cost in providing cooling mechanisms for these devices.
For all signal-processing applications, it is ideal to maintain a constant throughput
throughout the range of operation, as there would be no major incentive in doing the
operations faster than some given rate as the processor might have to wait until further
processing is required. This is very much unlike a general-purpose processor where speed
is one of the major considerations.
Since CMOS circuits dissipate power only when they are switching, a major focus of
power reduction is to reduce the number of switching events required to perform a
Page 5 of 40
CSE 450: Design and Analysis of Algorithms P-173Power, Speed and Area Optimization of FFT processorcomputation. This can range simply from switching down the complete circuit or portions
of it, to more sophisticated approaches where clocks are gated or optimized circuit
architectures are used to minimize the number of transitions.
Power dissipation is a CMOS circuit is due to three components: Switching power, which
accounts for 80-85 % of the total power dissipated, Short Circuit Power, which is very
negligible and the Leakage Power, which accounts for 15-20% of the power dissipated.
Switching power of a circuit is directly proportional to the squared of the supply voltage
and is proportional to the frequency of the clock, load capacitance and the probability of
0-1 transitions.
This project deals with the power reduction in FFT processors. The key areas we have
concentrated to reduce the power dissipated is as follows:
1. Butterfly Module.
2. Memory and
3. Coefficient optimization module (this includes the input data along with the
twiddle factors)
It is important that power consumption be reduced because the more the power
consumption, the less the battery life. Also when the speed of data transfer and the
operation of the butterfly modules are not critical, we can operate the chip at a reduced
voltage for the same throughput.
Numerous research papers have been published in the areas of FFT/IFFT
processors. There are various projects currently being carried out to reduce the power
consumption in FFT processors. Some of the methods by which power is reduced are
Page 6 of 40
CSE 450: Design and Analysis of Algorithms P-173Power, Speed and Area Optimization of FFT processor
1. Using Word length reduction of the inputs.
2. Partitioning the coefficients.
3. Using DRAM instead of SRAM thereby reducing the number of transistors
4. Intelligent RAM.
5. Mapping intelligent code on to the memory.
6. Using decimation in time and decimation in frequency for the butterfly
operations.
7. Using pipelining.
8. Trading speed for power reduction.
9. Shutting off part of the chip.
10. Code compaction of Software embedded in memory.
11. Using Multi Port memory.
We suggest a modification of an existing algorithm in partitioning the coefficients
used in the FFT operations. We first demonstrate how power consumption is reduced in
the multipliers by the use of a partitioning algorithm along with clustering. The algorithm
has been implemented using Mat lab and the results are shown. This algorithm can be
used for any N-point FFT in optimizing power.
We then propose a method by which we reduce the power consumption in
butterfly operations by considering the tradeoff between spatial parallelization and time
multiplexing.
Pipelined FFT Architecture:
Page 7 of 40
CSE 450: Design and Analysis of Algorithms P-173Power, Speed and Area Optimization of FFT processor
The FFT operation has been proven to be both computation intensive, in terms of
arithmetic operations, and communication intensive, in terms of data
swapping/exchanging in the storage. High-speed operation is obtained either by high
frequency clock or parallel or pipelined processors operating near sampling frequency. It
has been shown that the latter is more preferable when the application environment limits
power consumption. Each stage in the FFT requires the reading and writing of all N data
words. The pipelined architecture proposed has M butterfly stages with M-1
buffer/interconnection stages as shown below [1].
Fig 1: Baseline pipelined FFT architecture
During the operation of the pipelined FFT, r data samples (for a radix-r butterfly
unit in each stage) appear at the input of each butterfly block. Given a throughput
requirement, the number of butterflies in each stage determines the operating speed and
degree of parallelism in each butterfly of that stage.
Since we are concerned only with power reduction, we can reduce the supply
voltage there by reducing the power consumption significantly since power is
proportional to square of Vdd (supply voltage). This can be used in applications where
we are very keen on reducing the power with the same throughput. If we can compromise
on speed a little further, we can still reduce the supply voltage.
Page 8 of 40
Stage – 1Butterflies
Coefficient
BuFFER
Stage – 1Butterflies
Coefficient
BuFFER
CSE 450: Design and Analysis of Algorithms P-173Power, Speed and Area Optimization of FFT processorCoefficient Optimization:
In this method, the bit patterns of the twiddle factors are being manipulated to
reduce the number of actual multiplications that take place in a butterfly unit. The
approach scales the values of the given coefficients to derive a representation that enables
the partitioning of the original multiplication in to several, small multiplications that can
be performed in parallel. The coefficient values are chosen so that an error bound for the
overall computation is satisfied. Therefore the supply voltage can be reduced to decrease
power dissipation while maintaining any given throughput and quantization error
constraints. Additional reductions in power dissipation are achieved by disabling the rows
of the multiplier that correspond to multiplications by zero and thus do not affect the final
result of the multiplication. The main limitation of this method is that all the operations
are runtime and hardwiring based on previous logic is not possible.
X
Cluster Width
Fig 2: Effect of Cluster width C on multiplier Structure
Clustering:
Page 9 of 40
M bits
M+C+Sbits
M * C MULT
SHIFT
CSE 450: Design and Analysis of Algorithms P-173Power, Speed and Area Optimization of FFT processor
The cluster width C of an N-bit coefficient Y is defined as the distance between
the first and the last nonzero bits in Y. The cluster width of a coefficient depends on the
number representation. For example, in two’s complement representation, the string
0000111000 denotes the decimal number 28. Here the cluster width is 3. But in signed
digit representation it is 000100b00(b represents a –1) the cluster width is 4.By using
clustering we reduce the value to Cmax from the original value of N. From a power
dissipation standpoint, this clustering results in two main benefits. First, the multiplier
can be designed to reduce switching activity by ignoring the bit positions outside the
cluster. Secondly, the supply voltage required to meet a given throughput requirement
can be reduced, since the worst-case critical path among all multipliers is decreased.
These benefits are more visible in array multipliers.
Partitioning:
Partitioning is used to reduce the input to output delay by parallelization.
Coefficient partitioning is used to parallelize coefficient multiplications. Coefficient
partitioning divides an N-bit coefficient Y=YN-1YN-2…Y0 into an Na-bit coefficient Ya and
an Nb- bit coefficient Yb, where Ya=YN-1YN-2….YN-Nb, Yb=YNb-1YNb-2…Y0, and
N = Na+Nb. For the partitioned coefficient, we have
Ca+Cb<=C,
0<K<N-1,
Where Ca is the cluster width of Ya, Cb is the cluster width of Yb, C is the cluster width of
Y, and K is the partition point. Partitioning can decrease coefficient cluster widths. The
original multiplication is turned into an M*Na multiplication and an M*Nb multiplication
Page 10 of 40
CSE 450: Design and Analysis of Algorithms P-173Power, Speed and Area Optimization of FFT processorwhich can be performed in parallel. The depth of the individual multiplication is much
less than that of the multiplication X*Y.
In a conventional array multiplier, if the kth bit of a coefficient is 0, the k th row of
adders does not need to be activated. The partial product of the previous adder rows can
simply be shifted and bypassed to the next row of adders. The adders of a conventional
array multiplier corresponding to zero coefficients are still switching even though the
addition is not required. The increased switching activity results in unnecessary power
dissipation. This structure is modified by incorporating deactivation circuitry and bypass
logic so that one bit shifts can be performed without unnecessarily switching adder cells.
Algorithm for Coefficient Optimization:
In this section we present the currently used algorithm for coefficient optimization
[1]. We then propose a modification to the algorithm based on the cluster width
manipulation.
The P-point FFT, including quantization noise and some scaling factor , is given by the
equation
p-1
X[k] = (1/) (X[p] + ed[p])(WPkp
+ ew[p]),
p=0
where WPkp
is the pth coefficient
X[p] is the input data sample
Ew[p] is the error due to twiddle factor coefficient approximation and
Ed[p] is the input sample quantization error.Page 11 of 40
CSE 450: Design and Analysis of Algorithms P-173Power, Speed and Area Optimization of FFT processor
Given a set W of infinite-precision twiddle-factor coefficient WPkp
where p = 0,1,2…P-1,
an error-ratio bound , a delay bound Tmax, and a cluster width bound Ctarget, the
optimization process returns an encoded set of P twiddle coefficients Yp = WPkp, p =
0,1,2,…P-1 such that maxpCb,p Ctarget, the multiplier critical delay is less than Tmax, the
coefficient quantization error ratio is less than , and the multiplier dissipation is
minimal.
The following is the existing algorithm for coefficient optimization.
COPT (W, , Tmax, Ctarget)
1. Initialize temp to an encoded set Y
2. With delay (temp) <= Tmax and error (temp) <=
3. For ( = 1.0; >=0.5; = -)
4. For (K=Ctarget; K<=N-Ctarget; K=K+1)
5. For each number representation f
6. Y = (f (Y0), f (Y1)…f (Yp-1))
7. Partition coefficients in Y with respect to K
8. If power (Y)<power (temp)
9. And delay (Y) < delay (temp)
10. And error (Y) < error (temp)
11. Then temp Y
12. Return Y
Page 12 of 40
CSE 450: Design and Analysis of Algorithms P-173Power, Speed and Area Optimization of FFT processor
This algorithm comprises three nested loops. The outer loop steps through the
possible scaling factors . The middle loop steps through the possible partition points.
The inner loop steps through the possible number representations f. For each scaling
factor , partition point K, and number representation f, the algorithm partitions the
encoded coefficients with respect to the current partition point. The power constrained is
checked using the expression
P-1
Power (Y) = (Cp, nz/(P*Np)) * Cload * Vsupply2
p=0
Where Cp, nz is the number of nonzero bits in the coefficient Yp and
Np is the coefficient width and Cload is the output capacitance.
Proposed Modification to the above formula:
For our analysis of a given set of coefficients, we have taken cluster width instead
of number of nonzero bits. Thus here we have made the power consumption of a
multiplier dependent on the cluster width. We have used the combination of partitioning
and clustering in order to reduce the power consumption further. The simulations for our
algorithm are shown for various point FFTs. The implementation was done in Mat lab.
Also power comparison is made with the algorithm for original and optimized
coefficients. All simulations were carried out using 2’s complement encoding and 8-bit
representation was used for each of the coefficients. In our simulations, we have selected
the load capacitance value as 1F. The power calculation for a twiddle factor is computed
using the following:
Page 13 of 40
CSE 450: Design and Analysis of Algorithms P-173Power, Speed and Area Optimization of FFT processor
P-1
Power = cluster-width * Cload * V2supply/(P*Np)
x=0
Where P represents P point FFT.
The following is the pseudo code of our algorithm.
1. Initialize the optimal power to any arbitrary value.
2. Encode the twiddle factor y using 2’s complement approach.
3. Select a scaling factor .
4. Partition coefficients y with respect to the partitioning point k.
5. if(power(y) < power(optimal))
Power(y) = power(optimal).
End
6. Continue until the cumulative power due to all the twiddle factors are computed.
It can be shown that optimizing the coefficients can save power. This is repeated for the
entire twiddle factors in the butterfly model. The twiddle factors can be calculated apriori
for various N values. Then we can apply our proposed algorithm to each of the
coefficients and find the optimal partitioning point. We can do this for all the possible
representations like canonical and 2’s complement representations so that if the 2’s
complement representation gives larger reduction in power, we can use that particular
form in a particular stage.
After partitioning-clustering, the multiplier can be implemented using adder-
shifter units. By this algorithm we are effectively reducing the number of multiplications
and computations are taken care of by the shifters.Page 14 of 40
CSE 450: Design and Analysis of Algorithms P-173Power, Speed and Area Optimization of FFT processor
We can also use different forms of representation for different stages but the
overhead will be the conversion to be done for each stage and we have to keep track of
what is happening with respect to the coefficients. This will need better control and logic
circuits.
The following results were obtained from the simulations:
Power Efficiency Vs Partioning Degree:
Fig 3: Power Efficiency Vs Partioning Degree
We see that as the partition length is varied, we obtain the optimal cluster width at a point
due to which there is reasonable reduction in power consumption that is evident from the
above-simulated result. Based on the above result we will be able to choose the optimal
partitioning point for efficient power utilization.
Power with optimal partitioning (FFT Size = 8)
Page 15 of 40
CSE 450: Design and Analysis of Algorithms P-173Power, Speed and Area Optimization of FFT processor
Fig 4: Power with optimal partitioning (FFT Size = 8)
Here we have made the assumption that all twiddle factors have been used for the
computation. Based on this we obtain a result as shown that proves out to be a significant
reduction in power compared to the original case.
Power Savings (8-Point FFT):
Fig 5: Power Savings (8-Point FFT)
The power consumption in a processor varies with the square of the supply voltage. The
above result shows the power saved using the algorithm for 8-point FFT. We find that the
power saving is more significant at higher voltages compared to the normal case.
Power Vs Partitioning Degree:
Page 16 of 40
CSE 450: Design and Analysis of Algorithms P-173Power, Speed and Area Optimization of FFT processor
Fig 6: Power Vs Partitioning degree
The above result shows the variation in power consumption for different partitioning
degree. The power consumption basically depends on the encoding style used for the
twiddle factors. Thus we see that for optimal power utilization, we need to select the
proper partitioning degree. By partitioning degree, we mean the placement of partition
window in the bit pattern of a particular twiddle factor that effectively reduces the cluster
width.
Power with optimal partitioning:
Page 17 of 40
CSE 450: Design and Analysis of Algorithms P-173Power, Speed and Area Optimization of FFT processor
Fig 7 : Power with optimal partitioning
From the above result, we find that the power consumption is different from the 8-point
FFT although it has 16 twiddle factors and it happens to be less under the assumption that
all of these twiddle factors has been used. This is because the encoding style of the
twiddle factors in 16-point FFT is different from the 8-point FFT and hence the
portioning algorithm has different effects on each of the N-point FFT. Similar to the
previous results, we find that there is reasonable power difference between the original
case and the optimized case.
Power Savings:
Fig 8 : Power Savings for a 16 Point FFT
Page 18 of 40
CSE 450: Design and Analysis of Algorithms P-173Power, Speed and Area Optimization of FFT processorHere we find that there is reasonable power savings with the partitioning-clustering
algorithm. Similar results are obtained for 32 point FFT.
Power with Optimal Partitioning (FFT Size = 32):
Fig 9 : Power with Optimal Partitioning
Power Savings (32 Point FFT – 8 bit representation):
Fig 10 : Power Savings for 32 Point FFTPage 19 of 40
CSE 450: Design and Analysis of Algorithms P-173Power, Speed and Area Optimization of FFT processor
It is found that as the partitioning degree is varied, there is a change in the power
saving. We suggest that the same partitioning technique be applied to the input
coefficients so that power can further be reduced. But this has to be done as inputs enter
into the processor and thus we cannot compute them in advance as we do for the twiddle
factors and store in ROM. Thus even if partitioning a twiddle factor in a particular stage
does not reduce power considerably, the same algorithm when applied to the input data
may reduce power. Thus we have a two-way control over power consumption. The
overhead in this method will be that we need an intelligent controller that chooses the
right kind of partitioning for a given coefficient. We can run our algorithm for various
coefficients values and have a statistical analysis of which partitioning will work for
which coefficient.
By analyzing the partition width for the twiddle factors we can reduce the word
length in the ROM for the multiplication. The coefficient parts that do not take part in
computation are redundant in memory. So we can have lesser memory word length for
certain twiddle factors.
Low Power Memory
Memory is an important part of the FFT processor. The FFT processor does lot of
computations at every stage and the computations become more complicated with more
word length and with larger number of input data. A suitable memory is needed to store
the computation results of the first stage. The memory should be available for immediate
access by the next stage. The following are some of the techniques proposed by
researchers in universities and companies in reducing power in memory.
Memory Partitioning:
Memory power can be divided into 3 parts, write energy, read energy, and idle
energy. Power can further be divided in standard SRAM arrays by separately measuring Page 20 of 40
CSE 450: Design and Analysis of Algorithms P-173Power, Speed and Area Optimization of FFT processorthe power consumption in the bit array, the row decoders, the column decoders, the sense
amps, the precharge circuitry, and finally extra buffers. More than half of the total
memory power consumption is when the memory is in the precharge stage. By
partitioning the memory into several smaller segments, and precharging only the small
segment in use will drastically reduce the total power consumption of memories while
not adding any significant delay to the memory. The following diagram shows the
memory segmentation in order to save power.
Fig 11: Standard RAM array and Segmented Memory
DRAM/Logic Merged Technology:
The RAM bandwidth used in a system is limited by off-chip interconnects. On
memory/logic merged ICs, the RAM access time is improved dramatically by utilizing
the bandwidth available from memory internal arrays. In addition to this, if the memory
bus can be eliminated at the system board level, a major portion of the total power
dissipation of a system can be removed. There are two choices to implement the
memory/logic merged VLSI systems. One is to integrate SRAM with logic using
standard processing technology for logic. The other choice is to integrate DRAM with
logic using DRAM process. The latter is cost effective and has smaller form factor. There
are various issues that are to be addressed in memory selection and design like logic gate
performance penalty on SRAM process due to its inherently slower transistor switching
Page 21 of 40
Decoder
ARRAY
M1
M5
M9
M2
M6
M10
M3
M7
M11
M4
M8
M12precharge
Sense amps
CSE 450: Design and Analysis of Algorithms P-173Power, Speed and Area Optimization of FFT processorand routing area penalty due to lack of metal layers on DRAM process compared to logic
process. The advantage of using DRAM is that we will have only one transistor instead of
six transistors (as in SRAM) to store one bit.
Intelligent RAM:
This is a project being done at UC, Berkeley. The IRAM project proposes to
architect, design, fabricate, and evaluate single chip and multiple chip systems for data-
intensive applications. It will combine a processor and high capacity DRAM to deliver
vector supercomputer-style sustained floating point and memory performance at vastly
reduced power. The conventional system has separate chips for processor, external cache,
main memory, and networking. An IRAM would be smaller, use less power, and be less
expensive. The IRAM design will be scalable within a chip, allowing the processing
power to vary with memory size or power budget without changes to the architectural
specification.
The concept of IRAM can be extended to FFT. There are numerous floating point
and complex multiplications done in FFT. A processor with intelligent RAM can serve
the purpose of faster reduction and memory optimization.
Reducing power and improving performance in butterfly Module
Pipelined FFT processors are a class of architectures used for application-specific
real-time DFT computation. Their operation is characterized by non-stop processing of
the input data samples with a fixed clock frequency. A lower clock frequency is an
advantage of pipeline architectures when either high speed or low power is sought.
Pipelined structures are highly regular and can be easily scaled and parameterized.
The basic processing element incorporated in a butterfly stage is the radix-r
butterfly. The following figure shows the radix-2, radix-4 butterflies in the decimation-in-
time FFT algorithm.Page 22 of 40
CSE 450: Design and Analysis of Algorithms P-173Power, Speed and Area Optimization of FFT processor
Fig 12: Butterfly diagram for radix-2 and radix-4
There are two design choices for a butterfly module, spatial parallelism and temporal
multiplexing. Introducing parallelism in the architecture can decrease speed penalty but
this increases the overall effective switching capacitance. A highly time-multiplexed FFT
architecture lowers the effective switching capacitance since butterfly units are shared.
But the operating speed of such a unit has to be high in order to process the data within a
specified time slot. A fully parallel architecture reduces the supply voltage requirement
but it is too large in size. Therefore an effective FFT low energy architectural solution has
to take care of the trade-off between a fully time multiplexed architecture and a fully
space parallelized architecture. As the radix of the FFT is increased, the degree of the
parallelism increases thus decreasing the required supply voltage. But after a certain point
the supply voltage cannot be reduced any further and no further reduction in power
dissipation is possible. For lower degrees of parallelization, a higher supply voltage is
necessary to satisfy the throughput requirement that increases power dissipation. Due to
Page 23 of 40
CSE 450: Design and Analysis of Algorithms P-173Power, Speed and Area Optimization of FFT processorthis, the appropriate solution for this is to perform partial parallel processing and involve
time multiplexing to a certain extent.
Page 24 of 40
CSE 450: Design and Analysis of Algorithms P-173Power, Speed and Area Optimization of FFT processorCHAPTER 3:
SPEED OPTIMIZATION IN AN FFT PROCESSOR:
FFT processors need to process data at very high speeds as continuous times signals are
processed using these processors (in the discrete time domain). So it is imperative for
these processors to have high computational rates, as this will ensure a faithful
reproduction of the continuous time signal that is being processed. The execution speed
in an FFT processor is limited almost entirely by the number of additions and
multiplications required. A few algorithms to achieve high-speed performance on these
processors are discussed below.
CORDIC Algorithm:
CORDIC is an Acronym for Coordinate Rotation Digital Computer. This algorithm is a
set of shift-add operations that could be used to realize trigonometric functions in the FFT
processor. These algorithms were originally developed as a solution for real time
navigational systems. All trigonometric functions can be realized using a set of vector
rotations. This method achieves this by recursively performing shifts and adds to achieve
the required function. The underlying principle of the algorithm is shown below.
Let us consider we need to rotate two vectors V1 and V2 as shown in the figure.
Page 25 of 40
V2 (x2,y2)
V1 (x1,y1)
x
y
a
a
CSE 450: Design and Analysis of Algorithms P-173Power, Speed and Area Optimization of FFT processorEach of these has the same magnitude as they are basically the same vector but the only
difference is that the have a different angle. Let be the angle between vector V1and the
x-axis and let be the that vector V1 needs to be rotated.
Now,
x2 = a cos(+)
Which implies x2 = x1 cos - y1 sin Similarly y2 = x1 sin + y1 cos Or, x2 = cos (x1 - y1 tan )
y2 = cos (y1 + x1 tan )
Using this transformation matrix any vector V1 can be rotated by an angle . So far there
has not been any reduction in complexity. However if the angles to be rotated are chosen
such that
tan = 2-i
then the multiplication by the tan of the angle is just reduced to the shift operation. Here
the variable i denotes the iteration number. So the algorithm that needs to implemented
can be stated as follows:
Compare the angle to be rotted with the set of angles already stored in memory.
If is greater than a particular value of then the transformation matrix pertaining to that value of is multiplied with the given coordinates.
Calculate the difference between the angle and the value of used above. Now repeat the steps with the new value of being the difference computed
above. The above steps are repeated till the value of becomes 0.
This algorithm has a running time, which is O (n) as compared to the O (n2) of the
multiplication by brute force. The algorithm basically does a binary search of the various
angles that can be added or subtracted to make up the angle by which the vector can be
rotated. This algorithm is very helpful when we do not need a lot of accuracy and a lot of
hardware is also not required for this. For example while performing small computations
in calculators etc this algorithm can be used.
Page 26 of 40
CSE 450: Design and Analysis of Algorithms P-173Power, Speed and Area Optimization of FFT processorThe Cordic algorithm has been implemented using the schematic capture of the cadence
tool. It is found that there is a saving in area of around 10-12%. Here is a significant
reduction in power and the sped also improves drastically. The cordic algorithm can be
implemented with pipelining which really enhances the speed performance of the
processor as a whole.
Booth’s Algorithm:
Multiplying two binary numbers basically involves multiple shifts of the multiplier and
adding according to the bit pattern of the multiplicand. The booths algorithm works as
follows. Given two numbers A and B the algorithm analyses the multiplicand A (taking
two bits at a time) to determine whether to add 0,B,-B. The booths algorithm can be
stated as follows:
Test current multiplier bit and next lower-order bit. If these bits are:
o 00 or 11 then do nothing; o 01 then add multiplicand to partial product; o 10 then Subtract multiplicand from partial product. o Shift the partial product right (the sign bit is propagated).
Subtract 1 from the counter. If the result is 0 then read the product from the partial product register, otherwise repeat step (b).
This algorithm basically uses the transitions in the bit stream to calculate the amount that
has to be added to the partial product of the multiplication process.
It can be shown that whenever there is a large number of ones in an operand then it can
be shown that the corresponding additions can be replaced by a single addition and a
subtraction as shown below.
So longer the sequence of one’s greater the savings. Booths recoding is a digit set
translation. The effect of this translation is that the digit set [1 0] is converted to the digit
set [1 –1]. For example consider the following conversion:
1 0 0 1 1 1 0 1 1 0 1 0 1 1 1 0The above set of bits are recoded as follows:
-1 0 1 0 0 –1 1 0 -1 1 –1 1 0 0 –1 0
Page 27 of 40
CSE 450: Design and Analysis of Algorithms P-173Power, Speed and Area Optimization of FFT processorIf there are any bit beyond the MSB of the argument then just ignore it. The following
table can summarize the radix-2 booths recoding algorithm.
Radix 2 Booths algorithm :
Xj Xj-1 Yi Explanation0 0 0 No string of ones in sight0 1 1 End of string of ones1 0 -1 Beginning of string of ones1 1 0 Continuation of string of ones
The radix 2 Booths algorithm just described above works well but it has an inherent
problem. What if the number to be recoded were 0101010101. This would mean we
would have to perform a lot of additions and shifts and this could increase the and power
dissipation in the processors. So instead of taking two bits at a time it was suggested to
take three bits at a time. So this would reduce the amount of computations. This
algorithm was called the “modified booths algorithm”.
A multiplier implemented with booths algorithm is smaller in area than a multiplier
implemented with say array multiplication. The running time of the booths algorithm is O
(n). This algorithm can be used to multiply numbers represented by varying sizes of bits.
Hence the Aspect ratio of the multiplier is very good which helps in efficiently laying out
very dense multipliers.
The booths algorithm has been implemented in the cadence tool and it has been found
that there is an improvement in area by about 20%. The multiplier was first coded in
verilog and the results were synthesized in cadence. The schematics are attached.
Twiddle Factor based FFT algorithms:
Memory reference is one of the major sources of power consumption in a
microprocessor. In this algorithm two transformations are done to reduce the number of
memory references. With the first transformation all the butterflies sharing the same
twiddle factor will be computed together to eliminate the unnecessary memory access to
load the twiddle factors. With the second transformation, all the remaining butterflies
involving the other twiddle factor are computed using a breadth first tree traversal
algorithm so that the load store operations of the intermediate data arrays are minimized.
Page 28 of 40
CSE 450: Design and Analysis of Algorithms P-173Power, Speed and Area Optimization of FFT processorThis new algorithm improves the speed by 20% when compared to the standard DIF FFT
algorithm and there is a 30% reduction in memory accesses, which in turn reduces the
power consumption in the processor by that amount. The complete flow graph of an N-
point radix-2 FFT is can be constructed by applying the basic butterfly structure
recursively where N=2,4,8… For an N-point FFT it has log 2(N) stages.
The twiddle factor W is needed at various points. These are stored in memory and hence
need to be accessed each and every time until and unless the processor provides enough
registers to store them and hence access them at various points.
This algorithm does a computational transformation such that the complexity of the
original FFT is preserved and the memory accesses are reduced. In an N-point FFT flow
graph each stage can be represented by using an NxN stage matrix. One butterfly in a
given stage can have four entries.
Radix-2 and 4 Algorithms
By limiting the data-length to the form N = RV we can define a class of FFT known as
radix algorithms. These algorithms successively decompose a single N point DFT into R
segments of N/R-point DFT. The most widely used of these radix algorithms is the
Radix-2 and the Radix-4. Each uses the periodic properties of the DFT to attain higher
efficiency levels. Radix algorithms can be implemented by either using Decimation-In-
Time (Cooley and Tukey) or by Decimation-In-Frequency (Sande and Tukey). Each of
these algorithms reduces the number of operations from O (N2) to O (N log2 N). The
drawback is that the data must be of a specified length. This problem can be avoided by
zero padding with no loss of information.
The parallel structure of the radix algorithms is understood by considering the fact that
the DFT can be decomposed into smaller independent DFTs. By performing each of these
smaller DFTs concurrently, we can take advantage of this parallelism. The figure shows
two stages of a Decimation-In-Time FFT. In this figure, one can see that the 8-point DFT
is decomposed into two 4-point DFTs. These, in turn, are each decomposed as follows:
Page 29 of 40
CSE 450: Design and Analysis of Algorithms P-173Power, Speed and Area Optimization of FFT processor
The above shows an 8-point decimation in frequency algorithm. The decimation of the
odd and even frequency terms makes them independent of each other. Thus these even
and odd terms can be evaluated in different processors. The complexity of this algorithm
is also O (N log N). These algorithms can thus be realized on separate processors and
hence make the computations faster.
Split-Radix Algorithm
By observing the figure above, it can be seen that even index points can be calculated
independently of the odd indexed points. This leads to the possibility of making use of
more than one algorithm for the data set. The increase in computational efficiency of the
higher order Radix-4 is attractive, but the limitation in data sequence lengths is a
hindrance. The Split-Radix utilizes the fact that the data points can be decomposed into
even and odd indices to employ both Radix-2 and 4 algorithms. A Radix-2 is performed
on the even index points. The odd points are then decomposed into two N/4 point
sequences where a Radix-4 approach is taken. In making use of these two techniques, the
Split-Radix acquires an increase in computational efficiency over the Radix-2 while
retaining its ability to perform on any power of two.
Page 30 of 40
CSE 450: Design and Analysis of Algorithms P-173Power, Speed and Area Optimization of FFT processor As shown above, the Split-Radix algorithm is composed of a Radix-2 and two Radix-4
components. These components are independent of each other and, so can be performed
in parallel separate processes. The figure shows the flow of this process. Also the Radix-2
and Radix-4 components can each be performed as parallel computations. In this manner,
we can achieve maximum utilization of parallel hardware.
The split radix algorithm is runs in O (N log N) time.
Page 31 of 40
CSE 450: Design and Analysis of Algorithms P-173Power, Speed and Area Optimization of FFT processor
CHAPTER 4:
AREA OPTIMIZATION TECHNIQUES
The area of a chip depends on the number of components to be placed on it. Effective
placement techniques are being used in the industry to minimize area and maximize
functionality. Effective place and rout can not only aid in reducing the area of the chip
but also reduce the delay due to wiring and interconnect capacitance. The following
micrograph shows the major components in a FFT processor:
Page 32 of 40
CSE 450: Design and Analysis of Algorithms P-173Power, Speed and Area Optimization of FFT processorIt is evident from the above picture that if we reduce the number of multipliers, the area
occupied on the chip will automatically reduce. The techniques we described for power
dissipation can be used effectively to reduce area by an appreciable amount.
Page 33 of 40
CSE 450: Design and Analysis of Algorithms P-173Power, Speed and Area Optimization of FFT processor
CONCLUSION
The algorithms used for power, speed and area optimizations of FFT
processors were researched, analyzed and studied. The optimization techniques are very
much dependent on the environment of the processor as there is always a tradeoff
between the three parameters. One technique that can be used to moderate effect for
optimizing all three parameters is the reduction of datapath components used in the
processor.
Page 34 of 40
CSE 450: Design and Analysis of Algorithms P-173Power, Speed and Area Optimization of FFT processorReferences:
1. S. Hong, S. Kim, M.C. Papaefthymiou, and W.E. Stark. Low power parallel multiplier design for DSP applications through coefficient optimization. In 12th IEEE International ASIC/SOC Conference, September 1999.
2. S. Hong, S. Kim, M.C. Papaefthymiou, and W.E. Stark. Power-complexity analysis of pipelined VLSI FFT architectures for low energy wireless communication applications. In 42nd Midwest Symposium on Circuits and Systems, August 1999.
3. L.Richard Carley, Memory Partitioning for Low Power Design4. DRAM/Logic Merged Technology Project Page, Colorado State University,
http://www.engr.colostate.edu/ece/Research/vlsi/dram-logic.htm.5. Weidong Li, Lars Wanhammar, A Pipelined FFT Processor.6. Yan Solihin, Jaejin Lee, and Josep Torellas, Adaptively Mapping code in an
Intelligent Memory Architecture.
Page 35 of 40
CSE 450: Design and Analysis of Algorithms P-173Power, Speed and Area Optimization of FFT processor
APPENDIX
Mat lab Codes for partitioning algorithmtmax = 0.5;delta = 0.1;c_target = 1; % cluster width targetdel_alpha = 0.0001; alpha = 1;y = zeros(1,10);carry = 0;
init_delay = 0.1; % Initial delayinit_error = 0.01; % Initial errorinit_power = 10; % Initial powerfft_point = 32;bit_rep = 8;bits = zeros(1,bit_rep);N = bit_rep;
final_cluster1 = zeros(1,fft_point); % Partitioned optimal cluster1final_cluster2 = zeros(1,fft_point); % Partitioned optimal cluster2pow = zeros(1,N-c_target);org_pow = zeros(1,N-c_target);cluster = zeros(1,fft_point);clust_part = zeros(1,N-c_target);clust1 = zeros(1,N-c_target);clust2 = zeros(1,N-c_target);
for z = 1:fft_point; x = exp(j*2*pi*(z-1)/fft_point); % Calculating twiddle factors t1 = real(x); t2 = imag(x); % Converting twiddle factors to bits for iter = 1:2 if(iter == 1)
if(t1 < 0) t1 = -t1; end if((t1~=0) & (t1~=1)) for d = 1:bit_rep t1 = t1*2; if(t1>1) bits(d) = 1; t1 = t1-1; else bits(d) = 0; end end
Page 36 of 40
CSE 450: Design and Analysis of Algorithms P-173Power, Speed and Area Optimization of FFT processor else bits = [0 0 0 0 0 0 0 1]; end else if(t2 < 0) t2 = -t2; end if((t2~=0) & (t2~=1)) for d = 1:bit_rep t2 = t2*2; if(t2>1) bits(d) = 1; t2 = t2-1; else bits(d) = 0; end end else bits = [0 0 0 0 0 0 0 1]; end end
bits = bits(1:bit_rep);
% One's complementfor i = 1:length(bits) if(bits(i) == 0) bits(i) = 1; else bits(i) = 0; endend
u = length(bits);%Two's complementif(xor(bits(u),1) == 0) bits(u) = 0; carry = 1;for i = u-1:-1:1 if((xor(bits(i),carry) == 0) & (carry == 1)); bits(i) = 0; carry = 1;else bits(i) = xor(bits(i),carry); carry = 0; endendelse bits(u) = 1;end
if(carry == 1) bits = [1,bits];end
Page 37 of 40
CSE 450: Design and Analysis of Algorithms P-173Power, Speed and Area Optimization of FFT processorcarry = 0;alpha = 0.5;
%while(alpha >= 0.5) k = c_target; % Computing optimal cluster while(k <= N - c_target) temp = zeros(1,N); temp = bits(N:-1:1); temp2 = zeros(1,N-k); part1 = zeros(1,k); temp1 = zeros(1,k); part2 = zeros(1,N-k); part1 = bits(1:k); part2 = bits(k+1:N); temp1 = bits(k:-1:1); temp2 = bits(N:-1:k+1); [q,i1] = max(bits);[q,i2] = max(temp); if((q == 1) | (q == 1)) cluster(z) = ((length(bits)-(i2-1)) - i1) + 1; else cluster(z) = 0; end [q,i1] = max(part1);[q,i2] = max(temp1); if((q == 1) | (q == 1)) cluster1 = ((length(part1)-(i2-1)) - i1) + 1; else cluster1 = 0; end [q,i1] = max(part2);[q,i2] = max(temp2); if((q == 1) | (q == 1)) cluster2 = ((length(part2)-(i2-1)) - i1) + 1; else cluster2 = 0; end clust1(k) = cluster1; clust2(k) = cluster2; % Calculating power after clustering and partitioning% power = sum(part1)*(10*(10e-6))*(5^2)/(8*length(part1)) + sum(part2)*(10*(10e-6))*(5^2)/(8*length(part2)); % power updation according to cluster size power = cluster1*(10*(10e-6))*(5^2)/(fft_point*length(bits)) + cluster2*(10*(10e-6))*(5^2)/(fft_point*length(bits)); % power updation according to cluster size clust_part(k) = cluster1 + cluster2; if(power <= init_power) init_power = power; final_alpha = alpha; final_power = power; % Optimal power final_cluster1(z) = cluster1; % Optimal cluster1 final_cluster2(z) = cluster2; % Optimal cluster2 p1 = part1;
Page 38 of 40
CSE 450: Design and Analysis of Algorithms P-173Power, Speed and Area Optimization of FFT processor p2 = part2; end pow(k) = power; % With partitioning and clustering org_pow(k) = cluster(z)*(10*(10e-6))*(5^2)/(fft_point*length(bits)); % without partitioning and clustering k = k + 1; end% alpha = alpha - del_alpha;%endendend % Plotsv = [0:0.001:5];power_v = zeros(1,length(v));power_org = zeros(1,length(v));
for z = 1:fft_pointpower_v = power_v + final_cluster1(z)*(10*(10e-6))*(v.^2)/(fft_point*length(bits)) + final_cluster2(z)*(10*(10e-6))*(v.^2)/(fft_point*length(bits)); % power according to optimal partitioningpower_org = power_org + cluster(z)*(10*(10e-6))*(v.^2)/(fft_point*length(bits));end
m = [1:length(pow)];plot(m,pow,m,org_pow,'r') % partitioning powerpause;
plot(v,power_v,v,power_org,'r') %comparison of optimal powerpause;
% Partitioning degree Vs Power for various voltagespow_v = zeros(N-c_target,length(v));for l = c_target:N-c_target pow_v(l,1:length(v)) = clust1(l)*(10*(10e-6))*(v.^2)/(fft_point*length(bits)) + clust2(l)*(10*(10e-6))*(v.^2)/(fft_point*length(bits)); % power for various partitioningend
% Power savings Vs voltagepow_save = zeros(1,length(power_v));pow_save = power_org - power_v;
plot(v,pow_save);pause;
for l = c_target:N-c_target plot(v,pow_v(l,:)); hold on;
Page 39 of 40
CSE 450: Design and Analysis of Algorithms P-173Power, Speed and Area Optimization of FFT processorendpause;
Page 40 of 40